arXiv Daily Report 2026-05-30

DailyPapers
未分类
6小时前
1热度
0评论

ArXiv Report 2026-05-30/* ============================================================ ArXiv Daily Researcher - HTML Report Stylesheet 可自由修改此文件来定制报告外观 ============================================================ *//* ── 全局重置 ── */*,*::before,*::after { box-sizing: border-box; margin: 0; padding: 0;}/* ── CSS 变量（主题色板） ── */:root { --color-bg: #f0f2f5; --color-surface: #ffffff; --color-border: #e5e7eb; --color-primary: #2563eb; --color-primary-dk: #1d4ed8; --color-pass: #16a34a; --color-pass-bg: #dcfce7; --color-fail: #dc2626; --color-fail-bg: #fee2e2; --color-text: #111827; --color-muted: #6b7280; --color-tldr-bg: #eff6ff; --color-tldr-border: #bfdbfe; --color-cn-bg: #fefce8; --color-cn-border: #fde68a; --color-analysis-bg: #f8fafc; --radius-sm: 6px; --radius-md: 10px; --radius-lg: 14px; --shadow-sm: 0 1px 3px rgba(0, 0, 0, 0.06), 0 1px 2px rgba(0, 0, 0, 0.04); --shadow-md: 0 4px 12px rgba(0, 0, 0, 0.08); --shadow-hover: 0 8px 24px rgba(0, 0, 0, 0.1);}/* ── 页面布局 ── */body { font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", "Inter", Roboto, "Helvetica Neue", Arial, sans-serif; font-size: 15px; line-height: 1.65; color: var(--color-text); background: var(--color-bg); padding: 28px 20px 60px; max-width: 1080px; margin: 0 auto;}/* ── 页面标题 ── */h1 { font-size: 1.75rem; font-weight: 700; color: var(--color-text); letter-spacing: -0.5px; margin-bottom: 4px;}h2 { font-size: 1.15rem; font-weight: 600; color: var(--color-text); margin: 36px 0 14px; padding-bottom: 8px; border-bottom: 2px solid var(--color-border);}/* ── 元信息行 ── */.meta { font-size: 0.85rem; color: var(--color-muted); margin-bottom: 24px;}/* ── 统计栏 ── */.stats-bar { display: flex; gap: 14px; flex-wrap: wrap; margin-bottom: 32px;}.stat { flex: 1; min-width: 110px; background: var(--color-surface); border: 1px solid var(--color-border); border-radius: var(--radius-md); padding: 16px 20px; text-align: center; box-shadow: var(--shadow-sm);}.stat .num { font-size: 2rem; font-weight: 700; line-height: 1.1; color: var(--color-primary); display: block;}.stat .label { font-size: 0.78rem; color: var(--color-muted); margin-top: 4px; text-transform: uppercase; letter-spacing: 0.05em;}/* ── 论文卡片 ── */.card { background: var(--color-surface); border: 1px solid var(--color-border); border-radius: var(--radius-lg); padding: 20px 24px; margin-bottom: 14px; box-shadow: var(--shadow-sm); border-left: 4px solid var(--color-border); transition: box-shadow 0.18s ease, transform 0.18s ease;}.card:hover { box-shadow: var(--shadow-hover); transform: translateY(-1px);}.card.pass { border-left-color: var(--color-pass);}.card.fail { border-left-color: var(--color-fail);}/* ── 卡片标题 ── */.card-title { font-size: 1rem; font-weight: 600; color: var(--color-text); line-height: 1.4; margin-bottom: 10px; display: flex; align-items: flex-start; gap: 8px; flex-wrap: wrap;}.card-title a { color: inherit; text-decoration: none; flex: 1;}.card-title a:hover { color: var(--color-primary); text-decoration: underline; text-underline-offset: 3px;}/* ── 状态徽章 ── */.badge { display: inline-flex; align-items: center; padding: 2px 9px; border-radius: 99px; font-size: 0.72rem; font-weight: 700; letter-spacing: 0.04em; flex-shrink: 0; margin-top: 2px;}.badge.pass { background: var(--color-pass-bg); color: var(--color-pass);}.badge.fail { background: var(--color-fail-bg); color: var(--color-fail);}/* ── 字段行 ── */.field { font-size: 0.875rem; color: var(--color-muted); margin: 5px 0;}.field-label { font-weight: 600; color: #374151;}.score { font-weight: 700; color: var(--color-primary);}/* ── TL;DR 块 ── */.tldr { background: var(--color-tldr-bg); border: 1px solid var(--color-tldr-border); border-radius: var(--radius-sm); padding: 10px 14px; font-size: 0.9rem; color: #1e3a5f; margin: 10px 0; line-height: 1.55;}/* ── 中文摘要块 ── */.abstract-cn { background: var(--color-cn-bg); border: 1px solid var(--color-cn-border); border-radius: var(--radius-sm); padding: 10px 14px; font-size: 0.875rem; color: #713f12; margin: 10px 0; line-height: 1.6;}/* ── 深度分析折叠 ── */details { margin-top: 12px; border: 1px solid var(--color-border); border-radius: var(--radius-sm); overflow: hidden;}summary { cursor: pointer; font-size: 0.875rem; font-weight: 600; color: var(--color-primary); padding: 8px 14px; background: var(--color-analysis-bg); user-select: none; list-style: none; display: flex; align-items: center; gap: 6px;}summary::before { content: "▶"; font-size: 0.65em; transition: transform 0.2s; display: inline-block;}details[open] summary::before { transform: rotate(90deg);}summary:hover { color: var(--color-primary-dk);}.analysis-content { padding: 14px 16px; font-size: 0.875rem; color: #374151; line-height: 1.65; background: var(--color-surface);}.analysis-content p { margin: 6px 0;}.analysis-content ul { margin: 6px 0 6px 20px; color: #4b5563;}.analysis-content li { margin: 3px 0;}/* ── 响应式 ── */@media (max-width: 640px) { body { padding: 16px 12px 40px; } h1 { font-size: 1.4rem; } .stats-bar { gap: 10px; } .stat { min-width: 80px; padding: 12px 14px; } .stat .num { font-size: 1.6rem; } .card { padding: 16px; } .card-title { font-size: 0.95rem; }}/* ── 模型标签 ── */.model-badge { display: inline-block; font-size: 0.72rem; font-weight: 500; color: #6366f1; background: #ede9ff; border: 1px solid #c4b5fd; border-radius: 4px; padding: 1px 6px; margin-left: 8px; vertical-align: middle; font-family: 'Fira Code', 'Consolas', monospace; letter-spacing: 0.02em;}/* ── TLDR 增强样式 ── */.tldr { background: linear-gradient(135deg, #f0f9ff 0%, #e0f2fe 100%); border-left: 3px solid #38bdf8; border-radius: 0 8px 8px 0; padding: 10px 14px; margin: 8px 0;}.tldr-meta { display: flex; align-items: center; margin-bottom: 6px; font-size: 0.82rem; font-weight: 600; color: #0369a1;}.tldr-body { font-size: 0.88rem; color: #374151; line-height: 1.6;}/* ── 趋势分析区块 ── */.trend-section { margin: 32px 0 16px; border-top: 2px solid var(--color-border); padding-top: 24px;}.trend-section-header { display: flex; align-items: center; margin-bottom: 20px;}.trend-section-header h2 { margin: 0; font-size: 1.3rem; font-weight: 700; background: linear-gradient(135deg, #6366f1 0%, #8b5cf6 100%); -webkit-background-clip: text; -webkit-text-fill-color: transparent; background-clip: text;}.trend-card { background: #fff; border: 1px solid #e5e7eb; border-radius: 10px; padding: 20px 24px; margin-bottom: 16px; box-shadow: 0 1px 4px rgba(0, 0, 0, 0.06);}.trend-card-title { font-size: 1.05rem; font-weight: 700; color: #1e1b4b; margin: 0 0 14px; padding-bottom: 10px; border-bottom: 1px solid #ede9ff;}

ArXiv Research Report

Generated: 2026-05-30 00:12:18 | Passing score: 27.8

370

Total

Qualified

Analyzed

22%

Pass Rate

Papers

1. Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot EmbodimentsPASS

Score: 75.0 / 27.8

Authors: Qiuyue Wang, Mingsheng Li, Jian Guan, Jinhui Ye, Sicheng Xie, Yitao Liu, Junhao Chen, Zhixuan Liang, Jie Zhang, Xintong Hu, Xuhong Huang, Pei Lin, Junyang Lin, Dayiheng Liu, Shuai Bai, Jingren Zhou, Jiazhao Zhang, Haoqi Yuan, Gengze Zhou, Hang Yin, Ye Wang, Yiyang Huang, Zixing Lei, Wujian Peng, Delin Chen, Yingming Zheng, Jingyang Fan, Xianwei Zhuang, Xin Zhou, Haoyang Li, Anzhe Chen, Tong Zhang, Xuejing Liu, Yuchong Sun, Ruizhe Chen, Zhaohai Li, Chenxu Lü, Zhibo Yang, Tao Yu, Xionghui Chen

Published: 2026-05-28

TL;DR: Qwen-VLA 提出了一种统一的视觉 - 语言 - 动作基础模型，通过 DiT 动作解码器实现了跨任务、环境和机器人形态的具身智能，展现出优异的多任务性能与泛化能力。

摘要翻译

具身智能通常通过针对操作或导航等单个任务的专用模型进行研究，导致能力碎片化，且在任务、环境和机器人形态上的泛化能力受限。在这项工作中，我们研究异构具身决策问题是否能够在单个视觉 - 语言 - 动作模型（Vision-Language-Action, VLA）中统一。我们提出了 Qwen-VLA，一个统一的具身基础模型，该模型通过基于 DiT 的动作解码器，将 Qwen 的视觉 - 语言建模堆栈从感知、理解和推理扩展到连续动作和轨迹生成。Qwen-VLA 采用大规模联合预训练方案，在多样数据源上进行训练，包括机器人操作轨迹、人类第一人称视角演示、合成仿真数据、视觉 - 语言导航数据、以轨迹为中心的监督以及辅助视觉 - 语言数据。为了支持多种机器人平台，我们引入了具身感知的提示调节（embodiment-aware prompt conditioning），其中机器人特定的文本描述指定了当前的具身状态和控制约定。我们进一步将操作、导航和轨迹预测纳入统一的动作 - 轨迹预测框架，实现了跨机器人形态、任务家族和环境的可迁移视觉 grounding、空间推理及连续动作生成。在操作、导航及以轨迹为中心的基准测试上的实验表明，在场景布局、背景、光照、物体配置和机器人形态变化下，该方法具有一致的多任务性能和分布外泛化能力。Qwen-VLA-Instruct 在 LIBERO 上达到 97.9%，在 Simpler-WidowX 上达到 73.7%，在 RoboTwin-Easy/Hard 上分别达到 86.1%/87.2%，在 R2R 上 OSR 为 69.0%，在 RxR 上 SR 为 59.6%，在真实世界 ALOHA 实验中平均 OOD 成功率为 76.9%，在 DOMINO 动态操作中零样本成功率为 26.6%。

Abstract

Embodied intelligence is often studied through specialized models for individual tasks such as manipulation or navigation, resulting in fragmented capabilities and limited generalization across tasks, environments, and robot embodiments. In this work, we study whether heterogeneous embodied decision-making problems can be unified within a single vision-language-action model. We present Qwen-VLA, a unified embodied foundation model that extends Qwen's vision-language modeling stack from perception, understanding, and reasoning to continuous action and trajectory generation through a DiT-based action decoder. Qwen-VLA is trained with a large-scale joint pretraining recipe over diverse data sources, including robotics manipulation trajectories, human egocentric demonstrations, synthetic simulation data, vision-and-language navigation data, trajectory-centric supervision, and auxiliary vision-language data. To support multiple robot platforms, we introduce embodiment-aware prompt conditioning, where robot-specific textual descriptions specify the current embodiment and control convention. We further cast manipulation, navigation, and trajectory prediction into a unified action-and-trajectory prediction framework, enabling transferable visual grounding, spatial reasoning, and continuous action generation across robot morphologies, task families, and environments. Experiments on manipulation, navigation, and trajectory-centric benchmarks show consistent multi-task performance and out-of-distribution generalization under variations in scene layout, background, lighting, object configuration, and robot embodiment. Qwen-VLA-Instruct achieves 97.9% on LIBERO, 73.7% on Simpler-WidowX, 86.1%/87.2% on RoboTwin-Easy/Hard, 69.0% OSR on R2R, 59.6% SR on RxR, 76.9% average OOD success in real-world ALOHA experiments, and 26.6% zero-shot success on DOMINO dynamic manipulation.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	10.0/10	15.0
Tokenizer	1.5	5.0/10	7.5
Visual Encoder	1.5	5.0/10	7.5
World Models	1.5	5.0/10	7.5
MLLM	1.5	9.0/10	13.5
MultiModal	1.5	10.0/10	15.0
model-based RL	1.5	6.0/10	9.0

评分理由: 论文核心在于统一视觉 - 语言 - 动作建模（Unify Models, MultiModal, MLLM），标题与摘要均强调统一性。虽然基于 Qwen 架构隐含了 Tokenizer 和 Visual Encoder，但未作为主要创新点提出。模型涉及机器人动作生成与轨迹预测，与 model-based RL 及 World Models 有一定关联，但主要定位为统一的基础模型而非特定的 RL 算法或世界模型框架。

关键词

Vision-Language-Action, Unified Foundation Model, Embodied Intelligence, DiT-based Action Decoder, Robot Embodiments, Multi-task Performance, Generalization

深度分析

Chinese Title: Qwen-VLA：跨任务、环境和机器人本体的统一视觉-语言-动作建模

Summary: 论文提出Qwen-VLA，一个统一的具身基础模型，旨在解决当前具身智能模型碎片化、缺乏跨任务、环境和本体泛化能力的问题。该模型基于Qwen3.5-4B视觉语言骨干和DiT流匹配动作解码器，通过大规模联合预训练（涵盖机器人操作轨迹、人类第一人称演示、合成仿真数据、视觉语言导航数据及辅助视觉语言数据）和本体感知提示条件，将操作、导航、轨迹预测统一到共享的动作-轨迹预测框架中。采用分阶段训练策略：文本到动作预训练、多模态持续预训练、监督微调和强化学习，以稳定训练并提升下游迁移能力。实验在多个操作、导航和轨迹基准上取得优异性能，如LIBERO 97.9%、Simpler-WidowX 73.7%、R2R OSR 69.0%，并在真实世界ALOHA实验中达到76.9%的OOD成功率，展示了跨本体和场景的泛化能力。

Innovations:

提出统一的视觉-语言-动作模型，将操作、导航和人类第一人称动作建模到共享的动作-轨迹空间，支持多种机器人平台和任务族。
构建大规模异构数据联合预训练混合体，包括多种机器人操作轨迹、人类演示、合成仿真、导航和精选视觉语言数据，覆盖低层运动先验和高层语义推理。
引入本体感知提示条件，通过文本描述统一不同机器人平台、控制约定和预测视野，无需为每个本体设计独立策略。
设计渐进式训练流程：文本到动作预训练、多模态持续预训练、监督微调和强化学习，有效弥合离散视觉语言令牌与连续动作轨迹之间的差距，提升训练稳定性和下游迁移。
在多个基准上实现跨任务、环境和本体的泛化，验证了联合预训练和渐进学习对多任务性能和鲁棒性的提升。

Methodology: 论文采用基于Qwen3.5-4B视觉语言骨干和DiT流匹配动作解码器的统一架构。所有任务统一为条件预测框架：输入视觉上下文、语言指令、本体描述和可选任务标识，输出统一表示的动作或轨迹。训练分为四个阶段：第一阶段进行文本到动作DiT预训练（无视觉输入），学习语言条件动作压缩先验；第二阶段加入视觉输入进行多模态持续预训练，将动作先验与视觉观察对齐；第三阶段进行监督微调，针对下游任务优化；第四阶段使用强化学习优化闭环任务成功率。本体感知提示通过文本描述指定当前机器人平台和控制约定，使同一模型处理不同控制模式。

Key Results:

在LIBERO基准上达到97.9%的成功率。
在Simpler-WidowX基准上达到73.7%的成功率。
在RoboTwin Easy/Hard上分别达到86.1%和87.2%的成功率。
在R2R导航任务上达到69.0%的OSR，在RxR上达到59.6%的SR。
在真实世界ALOHA实验中达到76.9%的平均OOD成功率。
在DOMINO动态操作任务上实现26.6%的零样本成功率。
展示了跨场景布局、背景、光照、物体配置和机器人本体的分布外泛化能力。

Tech Stack:

Qwen3.5-4B（视觉语言骨干，原生多模态，ViT+空间合并）
DiT（扩散Transformer，用于流匹配动作解码器）
流匹配（Flow Matching，用于连续动作生成）
MANO（手部姿态表示，用于人类第一人称数据）
强化学习（用于后训练阶段优化闭环成功率）
条件预测框架（统一输入输出格式）
本体感知提示（文本描述统一控制约定）

Strengths:

统一框架覆盖操作、导航和人类动作，实现跨任务、环境和本体的泛化。
大规模异构数据联合预训练，充分利用多种数据源提升模型能力。
渐进式训练策略有效解决视觉语言骨干与动作解码器初始化不对称问题，训练稳定。
在多个标准基准和真实世界实验中取得领先性能，验证了方法的有效性。
开源模型和代码，促进社区研究和应用。

Limitations:

模型依赖大规模异构数据收集和清洗，数据获取成本高。
训练计算资源需求大，可能限制小型团队复现。
导航任务性能（如R2R 69.0%）相比专用导航模型可能仍有差距，统一模型在特定任务上可能弱于专家模型。
未详细讨论模型在极端未知环境或安全关键场景下的表现和失败案例。
强化学习阶段的具体算法和超参数未明确，可能影响可复现性。

Relevance To Keywords:

Unify Models：论文核心是统一模型，将操作、导航、人类动作整合到单一架构。
World Models：统一动作预测框架可视为世界模型的一部分，但论文未明确构建世界模型。
Representation Learning：视觉语言骨干提供强大的表征学习能力，用于视觉、语言和动作的联合表示。
Model-Based RL：论文使用强化学习进行后训练，但未明确基于模型，更接近无模型RL。
原生多模态大模型：Qwen3.5是原生多模态模型，支持早期视觉语言融合。
多模态大模型的理解和生成一体化：模型同时处理视觉语言理解和连续动作生成。
表征学习：通过联合预训练学习跨模态和跨本体的共享表征。
世界模型：统一动作预测可视为对物理世界动态的隐式建模。
强化学习：第四阶段使用强化学习优化闭环成功率。
后训练：分阶段训练包含监督微调和强化学习后训练。

2. Archon: A Unified Multimodal Model for Holistic Digital Human GenerationPASS

Score: 67.5 / 27.8

Authors: Chong Bao, Shichen Liu, Lijun Yu, David Futschik, Stylianos Moschoglou, Shefali Srivastava, Ziqian Bai, Feitong Tan, Guofeng Zhang, Zhaopeng Cui, Sean Fanello, Yinda Zhang

Published: 2026-05-28

TL;DR: Archon addresses the challenge of holistic digital human generation by proposing a unified multimodal model with modality-specific tokenizers and autoregressive pretraining, achieving superior performance across diverse tasks.

摘要翻译

数字人是沉浸式交互的基础，但创建一个涵盖文本、音频、运动和视觉内容等全模态的统一模型仍是一个开放挑战。本文提出 Archon，一个完全预训练、以人为本的统一多模态模型，用于全化身生成。Archon 使用模态特定 tokenizer 统一七种模态，并在同步模态和 72 项多样任务上预训练了一个原生自回归统一多模态模型，以建模全联合分布。为了解决高保真说话视频中的 token 爆炸挑战，我们引入了一种内存高效的语义视频重参数化方法，在保持细粒度动态的同时实现 token 数量缩减至原来的 1/4，并结合了语义驱动的视频扩散解码器。我们进一步提出一种"Thinking in Modality"（模态思维），将模糊的跨模态任务分解为模态交替链中的逐步思考，逐步提升保真度和可控性。广泛的实验表明，Archon 在多样的数字人生成任务上实现了优越或相当的性能，验证了我们的统一框架的有效性。项目页面：https://zju3dv.github.io/archon/.

Abstract

Digital humans are fundamental to immersive interaction, yet creating a unified model for holistic modalities, including text, audio, motion, and visual content, remains an open challenge. In this paper, we present Archon, a fully pretrained, human-centric unified multimodal model for holistic avatar generation. Archon unifies seven modalities with modality-specific tokenizers, and a native autoregressive unified multimodal model pretrained on synchronized modalities and 72 diverse tasks to model holistic joint distributions. To address the token explosion challenge in high-fidelity talking videos, we introduce a memory-efficient semantic video reparameterization, achieving 4x token reduction while preserving fine-grained dynamics, coupled with a semantic-driven video diffusion decoder. We further propose a "Thinking in Modality" that decomposes ambiguous cross-modal tasks into stepwise thinking in an alternative chain of modality, progressively enhancing fidelity and controllability. Extensive experiments demonstrate that Archon achieves superior or comparable performance across diverse digital human generation tasks, validating the effectiveness of our unified framework. Project page: https://zju3dv.github.io/archon/.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	9.0/10	13.5
Tokenizer	1.5	9.0/10	13.5
Visual Encoder	1.5	6.0/10	9.0
World Models	1.5	4.0/10	6.0
MLLM	1.5	8.0/10	12.0
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	0.0/10	0.0

评分理由: The paper introduces Archon, a unified multimodal model for digital human generation, demonstrating high relevance to 'Unify Models', 'Tokenizer' (modality-specific tokenizers are a key contribution), and 'MultiModal' (core focus). 'MLLM' is relevant due to the autoregressive unified architecture. 'Visual Encoder' is moderately relevant as visual data is processed, though not the primary architectural novelty. 'World Models' shares conceptual ground with autoregressive joint distribution modeling but differs in application (generation vs. RL). 'model-based RL' is irrelevant as the paper focuses on generation without reinforcement learning. No expert authors from the provided list are present.

关键词

Unified Multimodal Model, Digital Human Generation, Modality-specific Tokenizers, Autoregressive Pretraining, Semantic Video Reparameterization, Thinking in Modality, Holistic Joint Distributions

深度分析

Chinese Title: Archon：面向整体数字人生成的统一多模态模型

Summary: 本文提出Archon，一个完全预训练的、以人为中心的统一多模态模型，用于整体数字人生成。该模型整合了描述、脚本、语音、动画、语义视频、图像和视频七种模态，通过模态特定分词器将异构信号编码为离散token，并采用原生自回归统一多模态模型在同步模态数据和72个多样化任务上进行预训练，以建模整体联合分布。针对高保真说话视频中的token爆炸问题，引入内存高效的语义视频重参数化（实现4倍token压缩）和语义驱动的视频扩散解码器，以生成高质量视频。此外，提出“模态思考”推理策略，将模糊的跨模态任务分解为逐步的模态链生成，提升保真度和可控性。实验表明，Archon在多种数字人生成任务上达到或超越专家模型性能，验证了统一框架的有效性。

Innovations:

提出完全预训练的、以人为中心的统一多模态模型Archon，支持七种模态的任意到任意生成与理解，克服了传统专家模型的碎片化问题。
设计内存高效的语义视频离散化方法，用语义标签视频替代RGB视频，实现4倍token减少，同时保留关键动态信息。
提出语义驱动的视频扩散解码器，以语义视频、参考图像和文本描述为条件，生成高保真视频。
引入“模态思考”推理策略，将复杂跨模态任务分解为逐步的模态链生成，增强生成质量和可控性。
在同步模态数据和72个多样化任务上预训练统一模型，学习丰富的跨模态对应关系，实现零样本泛化。

Methodology: Archon包含四个核心模块：1) 模态特定分词器：使用MAGVITv2编码图像，微调MAGVITv2编码语义视频（将语义标签映射为RGB颜色），使用SoundStream编码语音，并设计动画、描述等分词器；2) 自回归语言模型骨干：将各模态token按结构化格式排列，输入Transformer进行跨模态推理，输出目标模态token；3) 语义驱动视频扩散模型：以语义视频、参考图像和文本为条件，通过扩散过程生成高质量RGB视频；4) 推理时“模态思考”策略：将模糊任务（如语音到视频）分解为中间模态（如语音→动画→语义视频→视频）的逐步生成，降低歧义。训练采用两阶段：先预训练分词器，再在大规模同步多模态数据及72个任务上联合训练语言模型。

Key Results:

Archon在说话头生成任务中，视频质量（FID、FVD）和唇同步精度（LSE-D、LSE-C）达到或优于SadTalker、Wav2Lip等专家模型。
在语音驱动动画任务中，动画参数（如表情、头部姿态）的生成误差低于专用模型。
在图像到语音、文本到视频等跨模态任务中，生成结果具有高保真度和一致性。
语义视频重参数化实现4倍token压缩，使5秒30fps视频的token数从9K降至约2.25K，适配语言模型上下文窗口。
“模态思考”策略在模糊任务（如语音到视频）上显著提升生成质量和身份保持能力。

Tech Stack:

MAGVITv2（3D卷积VQGAN，用于图像和语义视频tokenization）
SoundStream（残差向量量化器，用于语音tokenization）
自回归Transformer（语言模型骨干，处理多模态token序列）
扩散模型（语义驱动的视频解码器，基于U-Net或类似架构）
人脸语义分割模型（如[34]中的模型，用于提取21类语义标签）
VQGAN（向量量化生成对抗网络，用于离散表示学习）
残差向量量化（RVQ，用于多级音频编码）

Strengths:

统一框架：单一模型支持七种模态的任意到任意生成与理解，消除专家模型间的碎片化和分布不匹配。
高效token表示：语义视频重参数化大幅减少视频token数量，使长视频序列可被语言模型处理。
高质量生成：语义驱动扩散解码器结合参考图像，生成高保真视频，保留身份和动态细节。
推理灵活性：“模态思考”策略利用模型的多模态推理能力，逐步分解模糊任务，提升可控性和质量。
大规模预训练：在72个任务上预训练，赋予模型丰富的跨模态知识，支持零样本泛化。

Limitations:

依赖预训练分词器（MAGVITv2、SoundStream），其性能直接影响整体生成质量。
语义视频需要离线人脸分割模型，可能引入额外计算开销，且仅适用于人脸区域，对全身数字人支持有限。
模型规模较大，训练和推理资源需求高（如TPUv6），可能限制实际部署。
未明确讨论后训练（如强化学习或人类反馈）对生成质量的进一步提升，当前仅依赖预训练和推理策略。
实验主要针对数字人相关任务，在通用多模态任务（如开放域图像描述）上的泛化能力未充分验证。

Relevance To Keywords:

Unify Models: 高度相关：Archon是一个统一的、以人为中心的多模态模型，整合七种模态于单一框架，实现任意到任意生成与理解。
World Models: 相关：模型通过预训练学习多模态联合分布，可视为一种数字人领域的“世界模型”，但未明确强调环境交互或因果推理。
Representation Learning: 高度相关：通过模态特定分词器（VQGAN、SoundStream）将异构信号离散化为共享token空间，学习紧凑、可泛化的跨模态表征。
Model-Based RL: 不直接相关：论文未涉及强化学习或基于模型的规划，主要关注生成任务。
原生多模态大模型: 高度相关：Archon是原生多模态大模型，采用自回归Transformer直接处理多模态token，而非通过外部接口。
多模态大模型的理解和生成一体化: 高度相关：模型同时支持多模态理解（输入任意模态）和生成（输出任意模态），实现理解与生成一体化。
表征学习: 同Representation Learning，高度相关。
世界模型: 同World Models，相关但非核心。
强化学习: 不相关：论文未使用强化学习技术。
后训练: 部分相关：论文提及预训练，但未涉及后训练（如RLHF或微调策略），不过“模态思考”可视为一种推理时后处理策略。

3. VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action PoliciesPASS

Score: 66.0 / 27.8

Authors: Mingjian Gao, Wenqiao Zhang, Yuqian Yuan, Yang Dai, Binhe Yu, Zheqi Lv, Haoyu Zheng, Jiaqi Zhu, Zhiqi Ge, Zixuan Wan, Siliang Tang, Yueting Zhuang

Published: 2026-05-28

TL;DR: VisualThink-VLA 提出了一种视觉中间推理框架，通过视觉证据令牌替代文本思维链，实现了低延迟且高精度的视觉 - 语言 - 动作策略。

摘要翻译

近期研究开始为视觉 - 语言 - 动作（VLA）策略引入显式中间推理。然而，在具身控制中，文本思维链并不适用：无关或弱文本信息可能干扰动作预测，而自回归文本解码带来的延迟过高，无法满足实时闭环执行的需求。我们提出 VISUALTHINK-VLA，这是一个面向准确、低延迟 VLA 策略的视觉中间推理框架。我们的自举策略在于利用有效的视觉思考来指导动作：VISUALTHINK-VLA 通过一个紧凑的视觉证据接口引导动作预测，该接口在保持空间精度的同时避免了解码开销。此外，为了进一步提升性能与效率，VISUALTHINK-VLA 采用了一种定制的选择性路由机制来学习视觉证据标记，从而在保持高容量专业化的同时实现低延迟推理。另外，我们还引入了 VisualEvidence-Kit，这是一个以 VisualEvidence-Agent 为中心的监督与审计资源，该资源构建了一个包含 754.7k 条 VLA 指令的 VisualEvidence-Set，用于路由监督和反事实忠实性测试。在多个基准测试及真实机器人评估中，VISUALTHINK-VLA 在大多数基准上取得了最高的成功率，同时将推理增强基线的多秒级延迟降低至亚秒级。例如，在 BridgeData V2 上，它将步骤延迟从 ECoT 的 8.377 秒降低至 0.367 秒，实现了 22.8 倍的加速。

Abstract

Recent work has begun to equip vision-language-action (VLA) policies with explicit intermediate reasoning. In embodied control, however, textual chain-of-thought is a poor fit: irrelevant or weakly textual information can interfere with action prediction, while autoregressive text decoding adds too much latency for real-time closed-loop execution. We present VISUALTHINK-VLA, a visual intermediate-reasoning framework for accurate, low-latency VLA policies. Our bootstrapping philosophy is to guide action with effective visual thinking: VISUALTHINK-VLA bootstraps action prediction through a compact visual-evidence interface that preserves spatial precision while avoiding decoding overhead. Besides, to further improve performance and efficiency, VISUALTHINK-VLA adopts a tailored selective routing mechanism to learn the visual evidence tokens, enabling low-latency inference while preserving high-capacity specialization. We also introduce VisualEvidence-Kit, a supervision-and-audit resource centered on a VisualEvidence-Agent that constructs a 754.7k VLA instructions VisualEvidence-Set for route supervision and counterfactual faithfulness tests. Across multiple benchmarks and real-robot evaluation, VISUALTHINK-VLA achieves the highest success rate on most benchmarks while reducing the multi-second latency of reasoning-augmented baselines to the sub-second regime. For example, on BridgeData V2, it reduces step latency from 8.377,s with ECoT to 0.367,s, achieving a 22.8 times speedup.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	7.0/10	10.5
Tokenizer	1.5	6.0/10	9.0
Visual Encoder	1.5	8.0/10	12.0
World Models	1.5	3.0/10	4.5
MLLM	1.5	7.0/10	10.5
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	4.0/10	6.0

评分理由: 论文核心在于视觉 - 语言 - 动作（VLA）策略的视觉中间推理，因此与多模态（MultiModal）和视觉编码器（Visual Encoder）高度相关；统一了视觉、语言和动作，故 Unify Models 和 MLLM 相关性中等；文中涉及视觉证据令牌，Tokenizer 有一定相关性；但未涉及世界模型构建或基于模型的强化学习动力学建模，故这两项相关性较低。

关键词

VisualThink-VLA, Visual Intermediate Reasoning, Vision-Language-Action, Low-Latency, Visual Evidence Tokens, Selective Routing, Embodied Control

深度分析

Chinese Title: VisualThink-VLA：用于高效低延迟视觉-语言-动作策略的视觉中间推理

Summary: 本文提出VisualThink-VLA框架，旨在解决现有视觉-语言-动作（VLA）策略中文本链式推理延迟高、视觉干扰多的问题。该框架通过构建紧凑的视觉证据接口（六通道候选证据库，筛选后保留四个有效通道），并采用任务自适应路由机制选择相关证据通道，避免文本解码开销。同时引入VisualEvidence-Kit，利用VisualEvidence-Agent构建754.7k条指令的VisualEvidence-Set，用于路由监督和反事实忠实性检验。实验表明，VisualThink-VLA在多个仿真和真实机器人基准上取得最高成功率，并将推理延迟从数秒降至亚秒级（如BridgeData V2上延迟从8.377秒降至0.367秒，加速22.8倍）。

Innovations:

提出视觉中间推理接口，用路由视觉线索替代文本链式推理，避免延迟和干扰。
设计任务自适应证据编排机制，包括通道筛选、稀疏路由、软硬协作优化和师生蒸馏。
构建路由监督与审计框架VisualEvidence-Kit，生成大规模路由标注数据集用于忠实性诊断。
实现插件式视觉推理模块，可冻结VLA骨干并保持高性能。

Methodology: 首先构建六通道候选证据库（位置、边界、运动、关系、深度、分割），通过通道筛选剔除低效的深度和分割通道，保留四个有效通道。然后采用任务自适应路由器预测通道概率，经硬化操作生成推理时路由掩码。视觉状态组合器将路由证据投影为学习状态，注入冻结的VLA骨干进行动作解码。训练过程使用软硬协作掩码和从全密集教师（FULLSOFT）蒸馏，稳定稀疏路由。VisualEvidence-Agent自动提取通道级证据并构建带人工审核的VisualEvidence-Set，用于监督路由和反事实测试。

Key Results:

在BridgeData V2上，步延迟从ECoT的8.377秒降至0.367秒，实现22.8倍加速。
在多个基准（包括真实机器人评估）上取得最高成功率。
路由证据具有阶段敏感性和行为对齐性，可审计。
稀疏路由策略几乎保持全密集教师的性能，同时减少活跃通道和干扰。

Tech Stack:

Grounding DINO（目标检测）
SAM2（分割）
Qwen2.5-VL（视觉语言模型）
CLIP（视觉编码）
ViT（视觉Transformer）
OWL-ViT（开放词汇检测）
软硬协作掩码（soft-hard collaborative masks）
师生蒸馏（teacher-student distillation）
条件计算/混合专家（conditional computation）

Strengths:

显著降低推理延迟，适合实时闭环控制。
保持高成功率，优于文本推理和密集视觉方法。
路由证据可审计，提供可解释性。
通用插件式设计，可适配不同VLA骨干。
构建大规模监督数据集，支持忠实性验证。

Limitations:

证据通道预定义，可能不覆盖所有任务所需视觉信息。
通道筛选基于特定基准，泛化性需进一步验证。
路由机制增加训练复杂度，且依赖高质量监督数据。
未探索与更复杂世界模型或强化学习的结合。

Relevance To Keywords:

多模态大模型：VLA策略是多模态大模型在机器人领域的典型应用。
表征学习：视觉证据通道设计涉及紧凑视觉表征学习。
世界模型：中间推理可视为隐式世界模型，用于预测动作后果。
模型-Based RL：路由机制和蒸馏与模型基强化学习中的值函数学习相关。
后训练：师生蒸馏属于后训练阶段优化。

4. WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World InteractionPASS

Score: 66.0 / 27.8

Authors: Chengzhi Liu, Yuzhe Yang, Sophia Xiao Pu, Yepeng Liu, Lin Long, Yichen Guo, Nuo Chen, Zhaotian Weng, Elena Kochkina, Simerjot Kaur, Charese Smiley, Xiaomo Liu, James Zou, Sheng Liu, Yuheng Bu, Songyou Peng, Xin Eric Wang

Published: 2026-05-28

TL;DR: 本文提出 WorldMemArena 框架评估多模态代理记忆，发现现有系统在视觉证据利用和轨迹稳定性上仍存在显著不足。

摘要翻译

多模态大语言模型正越来越多地被部署为长程代理，在此情境下，记忆的功能远超回忆：它必须追踪不断演化的世界，修订过时的内容，并在决策时刻呈现正确的证据。现有基准仅在静态对话上测量回忆能力，将记忆简化为单一的任务结束准确率，并将视觉观测缩减为字幕，导致我们无法将故障定位到记忆生成、维护、检索或使用环节。能够自主编写记忆的代理框架（agent harnesses）的兴起加剧了这一差距，因为我们缺乏严谨的方法来比较手工设计的管道与自我管理替代方案。为了弥补这些差距，我们将多模态代理记忆形式化为具有可观察四阶段生命周期的动作 - 世界交互循环（Action-World Interaction Loop），并在 WorldMemArena 中实例化该框架：包含 400 个多会话多模态任务，涵盖终身进化（Lifelong Evolution，个人和任务状态的演化）和代理执行（Agentic Execution，基于真实观测、动作和反馈的记忆），并标注了黄金记忆点、更新、干扰项和证据链，以支持阶段级诊断。这使得长上下文记忆代理、手工设计（如 RAG 和外部记忆系统）以及基于框架的记忆代理之间的首次直接比较成为可能。结果表明：(1) 更优的记忆生成与存储并不保证更好的性能；(2) 多模态记忆仍难以充分利用视觉证据；(3) 系统在不同领域间不稳定，且在真实的代理轨迹上性能退化；(4) 框架记忆更具灵活性，但成本高昂且可靠性较低。

Abstract

Multimodal large language models are increasingly deployed as long-horizon agents, where memory must do more than recall: it must track an evolving world, revise what has gone stale, and surface the right evidence at decision time. Existing benchmarks measure recall over static dialogue, collapse memory into a single end-of-task accuracy, and reduce visual observations to captions, leaving us unable to localize failures to writing, maintenance, retrieval, or use. The rise of agent harnesses that author their own memory sharpens this gap, since we have no principled way to compare hand-designed pipelines with self-managing alternatives. To close these gaps, we formulate multimodal agent memory as an Action-World Interaction Loop with an observable four-stage lifecycle, and instantiate it in WorldMemArena: 400 multi-session multimodal tasks spanning Lifelong Evolution (evolving personal and task states) and Agentic Execution (memory from real observations, actions, and feedback), annotated with gold memory points, updates, distractors, and evidence chains for stage-level diagnosis. This enables the first head-to-head comparison of long-context, manually designed (RAG and external memory systems), and harness-based memory agents. Results show that: (1) better memory writing and storage do not guarantee better performance; (2) multimodal memory still struggles to fully use visual evidence; (3) systems are unstable across domains and degrade on realistic agentic trajectories; and (4) harness memory is more flexible but remains costly and less reliable.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	5.0/10	7.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	5.0/10	7.5
World Models	1.5	8.0/10	12.0
MLLM	1.5	9.0/10	13.5
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	6.0/10	9.0

评分理由: 论文核心聚焦于多模态大模型（MLLM）作为长时程代理时的记忆评估，因此 MLLM 和多模态相关性最高（9.0）；标题涉及“世界”及状态演化，与 World Models 高度相关（8.0）；动作与反馈循环涉及 model-based RL 概念（6.0）；Tokenizer 未提及（2.0）；Unify Models 和 Visual Encoder 为背景技术但非核心贡献（5.0）。经核对，作者列表中不包含指定的 Yang Shi 等专家，无额外加分。加权总分 66.0，远超及格线 27.8。

关键词

WorldMemArena, Multimodal Agent, Memory Evaluation, Action-World Interaction, Visual Evidence, Lifelong Evolution, Agentic Execution

深度分析

Chinese Title: WorldMemArena：通过动作-世界交互评估多模态智能体记忆

Summary: 论文针对多模态大语言模型作为长期交互智能体时记忆能力评估不足的问题，提出了一个基于动作-世界交互循环的评估框架。现有基准主要测试静态对话中的回忆能力，缺乏对记忆写入、维护、检索和使用全生命周期的诊断，且多局限于文本。作者将多模态智能体记忆形式化为一个四阶段生命周期（写入、维护、检索、使用），并构建了WorldMemArena基准，包含400个多会话多模态任务，覆盖终身演化（个人和任务状态演化）和智能体执行（从真实观察、动作和反馈中提取证据）两个领域。每个任务都标注了黄金记忆点、状态更新、干扰项和证据链，支持分阶段诊断。在统一设置下比较了长上下文智能体、手动设计记忆系统（RAG和外部记忆）以及基于执行框架的记忆智能体。主要发现：（1）存储更多正确记忆并不保证更好性能；（2）多模态记忆在复杂视觉推理中仍是瓶颈；（3）系统在不同领域不稳定，在真实智能体轨迹上性能下降；（4）基于框架的记忆更灵活但成本高且可靠性低。

Innovations:

将多模态智能体记忆形式化为动作-世界交互循环，并定义写入、维护、检索、使用四阶段生命周期，实现可诊断的评估。
构建WorldMemArena基准，包含400个多会话多模态任务，覆盖终身演化和智能体执行两个互补领域，并配有细粒度标注。
首次在统一设置下对长上下文、手动设计（RAG/外部记忆）和基于执行框架的三种记忆范式进行头对头比较。
揭示了记忆写入/存储能力与最终使用能力之间的脱节，以及多模态证据利用的瓶颈。
提供了分阶段诊断框架，能够定位记忆失败的具体阶段（写入、维护、检索、使用）。

Methodology: 论文首先将智能体-世界交互建模为部分可观测的马尔可夫过程，定义轨迹和会话分割。然后基于动作-世界交互循环，将记忆生命周期分解为四个可观测阶段：写入（从当前会话提取未来有用证据）、维护（整合新信息并更新旧记忆）、检索（根据查询获取相关证据）、使用（在最终回答或动作中正确利用检索结果）。WorldMemArena基准通过人工标注和自动流程构建了400个任务，每个任务包含多会话交互、多模态输入（图像、文本、工具反馈等），并标注了黄金记忆点、状态更新、干扰项和证据链。评估时，对三种记忆范式（长上下文、手动设计RAG/外部记忆、基于执行框架的记忆）在统一环境下进行测试，使用分阶段指标（写入准确率、维护一致性、检索召回率、使用正确率）以及最终任务成功率进行对比分析。

Key Results:

存储更多正确记忆并不保证更好性能，关键在于能否在回答时正确使用。
多模态记忆仍是主要瓶颈，尤其在复杂视觉推理任务中表现不佳。
记忆性能在不同领域间不稳定，在智能体执行任务（关键信息分散在动作、工具反馈和状态变化中）上性能下降明显。
手动设计的记忆系统（RAG/外部记忆）更结构化但适应性差，基于执行框架的记忆智能体更灵活但成本高且可靠性低。
分阶段诊断揭示了不同范式的失败模式：长上下文模型在检索和使用阶段易出错，RAG系统在写入和维护阶段有优势但检索精度不足，基于框架的记忆在写入和维护上灵活但检索和使用不稳定。

Tech Stack:

多模态大语言模型（如GPT-4o、Qwen-VL、Claude等）
检索增强生成（RAG）
外部记忆系统（如向量数据库、记忆网络）
智能体执行框架（如OpenClaw、Codex）
部分可观测马尔可夫决策过程（POMDP）建模
人工标注与自动流程结合的基准构建方法
分阶段评估指标（写入准确率、维护一致性、检索召回率、使用正确率）

Strengths:

提出了一个系统性的记忆评估框架，覆盖完整生命周期，可定位具体失败阶段。
基准设计考虑了真实智能体交互场景（多会话、多模态、动态环境），比现有静态对话基准更贴近实际应用。
首次对三种主流记忆范式进行统一比较，提供了有价值的对比分析。
标注了细粒度的黄金记忆点、证据链等，支持深入诊断。
揭示了记忆写入与使用之间的脱节这一重要发现，对后续记忆系统设计有指导意义。

Limitations:

基准规模相对有限（400个任务），可能不足以覆盖所有记忆挑战场景。
评估主要基于最终问答准确率和分阶段指标，但未深入分析记忆系统在复杂推理链中的动态行为。
对基于执行框架的记忆智能体的评估可能受限于当前框架的能力，结论的泛化性需进一步验证。
多模态输入主要包含图像和文本，未涉及视频、音频等其他模态。
未探讨记忆系统在持续学习或灾难性遗忘方面的表现。

Relevance To Keywords: 论文与给定研究背景关键词高度相关。首先，论文聚焦于多模态大模型作为智能体时的记忆能力，直接涉及“原生多模态大模型”和“多模态大模型的理解和生成一体化”中的理解与决策部分。其次，论文提出的动作-世界交互循环本质上是一种世界模型视角，智能体通过交互学习世界状态并更新内部记忆，与“世界模型”和“表征学习”紧密相关——记忆可视为对世界状态的表征。第三，论文评估了基于执行框架的记忆智能体，这类系统通常使用强化学习或后训练来优化记忆策略，因此与“强化学习”和“后训练”有间接关联。最后，论文强调记忆应支持长期决策，这与“Model-Based RL”中利用模型规划的思想一致。总体而言，论文为多模态智能体记忆评估提供了新范式，对统一模型、世界模型、表征学习等方向具有参考价值。

5. minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World ModelsPASS

Score: 64.5 / 27.8

Authors: Min Zhao, Hongzhou Zhu, Bokai Yan, Zihan Zhou, Yimin Chen, Wenqiang Sun, Kaiwen Zheng, Guande He, Xiao Yang, Chongxuan Li, Fan Bao, Jun Zhu

Published: 2026-05-28

TL;DR: 本文提出 minWM 框架，通过将双向视频扩散模型蒸馏为少数步自回归生成器，实现了具有低延迟和可控性的实时交互式视频世界模型。

摘要翻译

近期，视频扩散基础模型在高质量视频生成方面取得了显著进展，然而将其转化为实时交互式视频世界模型仍具挑战性。交互式世界模型需要具备可控性、因果性和低延迟的展开（rollout），这在实践中要求涵盖数据构建、可控微调、自回归训练、少步蒸馏以及流式推理的完整流程。本文提出了 minWM，一个用于构建实时交互式视频世界模型的全栈开源框架。minWM 提供了一套端到端的流程，可将现有的双向文本到视频（T2V）/文本到图像到视频（TI2V）视频基础模型转换为相机可控的少步自回归世界模型。具体而言，minWM 首先利用相机控制对双向视频扩散模型进行微调，随后应用因果强制（Causal Forcing）/因果强制++（Causal Forcing++）流程，包括自回归（AR）扩散训练、因果常微分方程（ODE）或因果一致性蒸馏以及非对称 DMD，将其蒸馏为少步自回归生成器，以实现低延迟展开。该框架具有模块化且架构可扩展的特点：我们在代表性的开源骨干网络上进行了实例化，包括 Wan2.1-T2V-1.3B 和 HY1.5-TI2V-8B，涵盖了基于交叉注意力的条件注入以及 MMDiT 风格架构。minWM 还支持将现有的视频世界模型（例如 HY-WorldPlay）适配到新的数据分布、训练方案及延迟目标上。除了发布可运行脚本、检查点、文档及推理代码外，我们还提供了关于相机轨迹质量、可控性训练步数及最小批量大小要求的实用消融实验。我们希望 minWM 能成为构建和适配实时交互式视频世界模型的可复现且可扩展的方案。项目页面：[https://github.com/shengshu-ai/minWM]

Abstract

Recent video diffusion foundation models have achieved remarkable progress in high-quality video generation, yet turning them into real-time interactive video world models remains challenging. Interactive world models require controllable, causal, and low-latency rollout, which in practice demands a full pipeline spanning data construction, controllable fine-tuning, autoregressive training, few-step distillation, and streaming inference. In this work, we present minWM, a full-stack open-source framework for building real-time interactive video world models. minWM provides an end-to-end pipeline that converts existing bidirectional T2V/TI2V video foundation models into camera-controllable few-step autoregressive world models. Specifically, minWM first fine-tunes a bidirectional video diffusion model with camera control, and then applies the Causal Forcing / Causal Forcing++ pipeline, including AR diffusion training, causal ODE or causal consistency distillation, and asymmetric DMD, to distill it into a few-step autoregressive generator for low-latency rollout. The framework is modular and architecture-extensible: we instantiate it on representative open backbones, including Wan2.1-T2V-1.3B and HY1.5-TI2V-8B, covering both cross-attention-based condition injection and MMDiT-style architectures. minWM also supports adapting existing video world models, such as HY-WorldPlay, to new data distributions, training recipes, and latency targets. Beyond releasing runnable scripts, checkpoints, documentation, and inference code, we provide practical ablations on camera trajectory quality, controllability training steps, and minimal batch-size requirements. We hope minWM serves as a reproducible and extensible recipe for building and adapting real-time interactive video world models. Project Page: [https://github.com/shengshu-ai/minWM](https://github.com/shengshu-ai/minWM)

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	6.0/10	9.0
Tokenizer	1.5	3.0/10	4.5
Visual Encoder	1.5	5.0/10	7.5
World Models	1.5	10.0/10	15.0
MLLM	1.5	4.0/10	6.0
MultiModal	1.5	7.0/10	10.5
model-based RL	1.5	8.0/10	12.0

评分理由: 论文核心聚焦于构建实时交互式视频世界模型（World Models），故该关键词得满分。论文提供全栈框架统一了数据、训练与蒸馏流程，与 Unify Models 概念部分契合但未涉及架构统一，得 6 分。涉及视频生成与文本/图像控制（TI2V），属于多模态范畴，得 7 分。交互式世界模型是 Model-Based RL 的关键组件，高度相关，得 8 分。使用现有视频扩散模型骨干（含视觉编码器）但未创新编码器本身，得 5 分。未提及 tokenizer 设计，得 3 分。主要关注视频生成而非多模态大语言模型（MLLM），得 4 分。作者列表中未包含指定的 Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang 专家。

关键词

Video World Models, Real-time Interactive, Full-Stack Framework, Autoregressive Training, Distillation, Camera Control, Low-latency Rollout

深度分析

Chinese Title: minWM：用于实时交互式视频世界模型的全栈开源框架

Summary: 本文提出了minWM，一个全栈开源框架，用于将现有的双向文本到视频（T2V）或文本+图像到视频（TI2V）扩散基础模型转换为实时交互式视频世界模型。该框架覆盖完整流程：数据构建、相机可控微调、自回归扩散训练、少步蒸馏和低延迟推理。具体分为两阶段：首先通过PRoPE注入相机参数，微调双向扩散模型使其具备相机可控性；然后采用Causal Forcing或Causal Forcing++管道，包括自回归扩散训练、因果ODE或因果一致性蒸馏初始化、以及非对称DMD后训练，最终得到少步自回归生成器。框架在Wan2.1-T2V-1.3B和HY1.5-TI2V-8B上实例化，支持多种架构，并适配现有世界模型如HY-WorldPlay。实验报告了相机轨迹质量、可控性训练步数、最小批量需求等实用消融，提供了可复现的构建指南。

Innovations:

提出全栈开源框架minWM，覆盖从数据构建到低延迟推理的完整管道，支持端到端复现。
采用PRoPE方法注入相机参数，使双向扩散模型具备相机可控性，并保持原有生成质量。
引入Causal Forcing/Causal Forcing++管道，将多步双向模型蒸馏为少步自回归模型，实现实时交互。
支持多种架构（交叉注意力注入和MMDiT风格），并在Wan2.1和HY1.5上实例化，验证框架通用性。
提供实用消融实验（相机轨迹质量、训练步数、批量需求），为可复现训练提供指导。

Methodology: 论文采用两阶段方法：第一阶段，使用PRoPE将相机参数（内参和外参）编码为块对角变换，注入到自注意力层，微调双向扩散模型使其能根据相机轨迹生成视频。第二阶段，通过Causal Forcing或Causal Forcing++管道进行蒸馏：首先进行自回归扩散训练（教师强制），然后通过因果ODE蒸馏或因果一致性蒸馏初始化少步模型，最后使用非对称DMD（利用双向教师模型的高质量分布）进行后训练，对齐少步模型分布。整个流程在相机可控数据上执行，确保模型支持交互式控制。

Key Results:

在HY1.5模型上，多步双向模型首帧延迟771.041秒，多步自回归模型81.014秒（加速9.52倍），少步自回归模型3.446秒（加速223.75倍）。
在Wan2.1模型上，多步双向模型首帧延迟269.055秒，多步自回归模型28.651秒（加速9.39倍），少步自回归模型1.137秒（加速236.64倍）。
框架成功将Wan2.1-T2V-1.3B和HY1.5-TI2V-8B转换为相机可控的少步自回归世界模型，并支持HY-WorldPlay的适配。
消融实验表明相机轨迹质量、可控性训练步数和最小批量需求对最终性能有显著影响。

Tech Stack:

PRoPE（Projective Rotary Position Embedding）用于相机参数注入
Causal Forcing / Causal Forcing++ 蒸馏管道
自回归扩散训练（Teacher Forcing）
因果ODE蒸馏（PF-ODE轨迹回归）
因果一致性蒸馏（Causal Consistency Distillation）
非对称DMD（Asymmetric Distribution Matching Distillation）
Wan2.1-T2V-1.3B（交叉注意力架构）
HY1.5-TI2V-8B（MMDiT架构）
HY-WorldPlay 适配

Strengths:

提供完整的开源框架，覆盖从数据到推理的全流程，易于复现和扩展。
架构通用，支持多种视频扩散基础模型和条件注入方式。
通过两阶段蒸馏实现显著加速（>200倍），满足实时交互需求。
释放中间检查点，允许研究者从任意阶段继续训练或修改。
包含实用消融实验，为实际训练提供经验指导。

Limitations:

框架依赖已有的高质量视频基础模型，自身不涉及基础模型训练。
相机控制仅支持预设轨迹，未涉及更复杂的用户交互（如物体操作）。
少步模型质量仍受限于教师模型，非对称DMD可能无法完全弥补分布差距。
实验仅在特定模型和数据集上验证，泛化性需进一步测试。

Relevance To Keywords:

Unify Models: 框架统一了视频生成、世界模型构建和交互控制，体现了模型一体化思想。
World Models: 直接目标是构建实时交互式视频世界模型，支持因果推演和用户控制。
Representation Learning: 通过PRoPE编码相机参数，学习场景的几何表示。
Model-Based RL: 世界模型可用于基于模型的强化学习中的环境模拟。
原生多模态大模型: 基于T2V/TI2V基础模型，融合文本、图像和视频模态。
多模态大模型的理解和生成一体化: 框架将生成模型转化为交互式世界模型，兼具理解（控制响应）和生成能力。
后训练: 通过蒸馏和DMD进行后训练，提升推理速度和生成质量。

6. SuperVoxelGPT: Adaptive and Ordered 3D Tokenization for Autoregressive Shape GenerationPASS

Score: 63.0 / 27.8

Authors: Yuan Li, Congyi Zhang, Xifeng Gao, Xiaohu Guo

Published: 2026-05-28

TL;DR: SuperVoxelGPT 提出了一种自适应监督体素分词框架，结合 MLLM 实现了高效且高质量的自回归 3D 形状生成。

摘要翻译

自回归多模态大语言模型 (MLLMs) 能够实现 3D 生成，但因 3D 标记化 (tokenizations) 不足，难以扩展至高分辨率形状。基于紧凑集合的表示 (compact set-based representations) 放弃了确定性空间顺序，导致序列预测模糊；而均匀或基于八叉树的体素网格 (voxel grids) 虽保留了顺序，却付出了严重冗余和序列过长的代价。这种结构权衡限制了稳定且高效的自回归 3D 生成。本文提出 SuperVoxelGPT，这是一种表示优先的框架，通过自适应且确定有序的 supervoxel (超体素) 标记化解决了这一张力。给定 prompt (提示)，我们首先预测粗略的几何显著性分布，并利用显著性引导的中心 Voronoi 划分 (Voronoi tessellation) 构建形状自适应的 supervoxel 划分，将精细单元分配给复杂区域，将较大单元分配给平滑区域。在文本和有序 supervoxel 布局的条件下，我们引入 SuperVoxelVAE，并微调预训练的 MLLM，以自回归生成 supervoxel 标记。在 Trellis-500K 上的实验表明，SuperVoxelGPT 将标记序列长度缩减至均匀 voxel 标记化的 12.8%，同时实现了最先进的生成质量，且平均比先前方法加速 10 倍。

Abstract

Autoregressive multimodal large language models (MLLMs) enable 3D generation but struggle to scale to high-resolution shapes due to inadequate 3D tokenizations. Compact set-based representations discard deterministic spatial ordering, leading to ambiguous sequence prediction, while uniform or octree-based voxel grids preserve ordering at the cost of severe redundancy and excessively long sequences. This structural trade-off limits stable and efficient autoregressive 3D generation. We present SuperVoxelGPT, a representation-first framework that resolves this tension through adaptive and deterministically ordered supervoxel tokenization. Given a prompt, we first predict a coarse geometric saliency distribution and construct a shape-adaptive supervoxel partition using saliency-guided centroidal Voronoi tessellation, allocating fine-grained cells to complex regions and larger cells to smooth regions. Conditioned on the text and ordered supervoxel layout, we introduce a SuperVoxelVAE and fine-tune a pretrained MLLM to autoregressively generate supervoxel tokens. Experiments on Trellis-500K show that SuperVoxelGPT reduces token sequence length to 12.8% of uniform voxel tokenization while achieving state-of-the-art generation quality and an average 10$\times$ speedup over prior methods.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	5.0/10	7.5
Tokenizer	1.5	10.0/10	15.0
Visual Encoder	1.5	6.0/10	9.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	10.0/10	15.0
MultiModal	1.5	10.0/10	15.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文核心聚焦于 3D Tokenization (10) 与 MLLM (10) 的结合，处理多模态输入 (MultiModal: 10)。Unify Models (5) 和 Visual Encoder (6) 相关性中等。World Models (1) 和 model-based RL (0) 无关，因论文为静态生成而非动态建模或强化学习。作者无指定专家，无加分。加权总分 63.0，高于及格线。

关键词

SuperVoxelGPT, 3D Tokenization, Autoregressive Shape Generation, MLLM, Supervoxel Partition, Text-to-3D, Adaptive Tokenization

深度分析

Chinese Title: SuperVoxelGPT：面向自回归形状生成的自适应有序3D标记化

Summary: 本文提出SuperVoxelGPT，一种面向高分辨率3D生成的两阶段多模态大语言模型（MLLM）框架。现有3D标记化方法要么过于紧凑（无序集合）导致序列预测模糊，要么过于冗余（均匀体素）导致序列过长。SuperVoxelGPT通过自适应且确定有序的超体素标记化解决这一矛盾。第一阶段，根据文本或图像提示预测粗粒度几何显著性分布，并利用显著性引导的质心Voronoi剖分（SCVT）构建形状自适应的超体素划分，在复杂区域分配细粒度单元，平滑区域分配大单元。第二阶段，引入SuperVoxelVAE对超体素进行编码，并微调预训练MLLM以自回归方式生成超体素令牌序列。实验表明，在Trellis-500K数据集上，SuperVoxelGPT将令牌序列长度减少至均匀体素标记化的12.8%，达到最先进的生成质量，并实现平均10倍加速。

Innovations:

提出SuperVoxelGPT两阶段框架：先预测显著性分布并生成自适应超体素结构，再自回归生成超体素令牌，实现高效高分辨率3D生成。
设计SuperVoxelVAE表示：基于超体素的3D分词器产生紧凑、确定有序的令牌序列，与自回归建模天然适配，且可即插即用集成到现有稀疏体素VAE中。
开发显著性驱动的3D质心Voronoi剖分（SCVT）：根据几何复杂度自适应分配超体素，在保持细节的同时大幅减少冗余令牌。
利用Jacobi解码实现并行自回归生成：由于第一阶段已确定序列长度，可直接应用并行解码策略，显著加速令牌生成。

Methodology: 论文采用两阶段方法：第一阶段（Prompt-to-Supervoxel），使用轻量级MaskGIT模型从提示预测粗粒度3D显著性体积，再通过显著性引导的质心Voronoi剖分（SCVT）生成自适应超体素结构。第二阶段（Supervoxel-to-Shape），设计SuperVoxelVAE将超体素编码为离散令牌，微调预训练MLLM（如LLaMA）以自回归方式生成令牌序列，并采用Jacobi解码并行化生成。整体流程包括：显著性VQ-VAE编码、MaskGIT生成、SCVT剖分、SuperVoxelVAE编码/解码、MLLM微调与自回归生成。

Key Results:

令牌序列长度减少至均匀体素标记化的12.8%，大幅降低计算开销。
在Trellis-500K数据集上达到最先进的生成质量（定量指标如FID、覆盖率等）。
平均生成速度相比先前方法提升10倍。
自适应超体素结构有效保留高频几何细节，同时避免平滑区域冗余。

Tech Stack:

MaskGIT（掩码生成图像Transformer）用于预测粗粒度显著性体积
质心Voronoi剖分（CVT）及其显著性引导变体（SCVT）
VQ-VAE（向量量化变分自编码器）用于显著性体积和超体素编码
预训练多模态大语言模型（MLLM，如LLaMA）
Jacobi解码（并行自回归解码策略）
稀疏体素VAE（作为SuperVoxelVAE的基础架构）

Strengths:

创新性地提出自适应有序超体素表示，同时满足紧凑性和确定性顺序，完美适配自回归建模。
两阶段设计解耦了“在哪里分配令牌”和“生成什么几何”，提高了效率和可控性。
显著缩短令牌序列长度，降低训练和推理成本，实现10倍加速。
即插即用的SuperVoxelVAE可集成到现有稀疏体素VAE中，兼容性强。
在生成质量上达到SOTA，验证了表示的有效性。

Limitations:

第一阶段预测显著性分布依赖于MaskGIT，可能引入额外误差，影响超体素划分的准确性。
超体素结构生成需要额外的训练和推理步骤，增加了系统复杂度。
当前方法主要针对3D几何生成，未涉及纹理或材质生成。
对极复杂形状（如大量细长结构）的自适应划分可能不够鲁棒。
实验仅在Trellis-500K数据集上进行，泛化性有待进一步验证。

Relevance To Keywords: 论文聚焦于3D生成中的表示学习与自回归建模，与“表征学习”高度相关（设计自适应有序超体素表示）。其两阶段框架涉及“世界模型”思想（预测显著性分布作为隐式世界知识）。但论文未涉及强化学习或后训练，与“Unify Models”、“Model-Based RL”、“强化学习”、“后训练”相关性较弱。与“原生多模态大模型”和“多模态大模型的理解和生成一体化”有一定关联，因为方法基于MLLM实现文本/图像到3D的生成，但更侧重于生成而非理解。

7. AgentCVR: Active Multi-Agent Cross-Video Reasoning via Script-Simulated Reinforcement LearningPASS

Score: 63.0 / 27.8

Authors: Yilun Qiu, Jiahe Wang, Cilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, Chun Yuan

Published: 2026-05-28

TL;DR: AgentCVR introduces a multi-agent framework with script-simulated reinforcement learning to actively acquire evidence for cross-video reasoning, outperforming single-pass MLLM baselines.

摘要翻译

跨视频推理（CVR）已成为多模态智能领域的一个关键前沿，要求模型能够检索、对齐并聚合分布在多个视频中的证据。当前的多模态大语言模型（MLLMs）在 CVR 任务上往往面临挑战，因为简单的单轮策略将多个视频编码为共享的压缩上下文，这可能会掩盖罕见但关键的证据。本文提出 AgentCVR，这是一个多智能体框架，将 CVR 视为一种主动证据获取任务。AgentCVR 采用主智能体迭代协调专门的视觉智能体和音频智能体，以进行针对性证据提取。为确保高效训练，我们引入了脚本模拟强化学习（Script-Simulated RL），该方法利用 LLM 生成的语义脚本和轻量级基于文本的模拟器来优化智能体策略，从而在在线探索过程中避免了昂贵的多模态推理。在综合 CVR 基准测试上的实验结果表明，AgentCVR 优于单轮基线方法，并在复杂的跨视频对齐和定位任务上达到了与最先进的闭源系统相当的性能。为确保可复现性，我们的代码已开源，网址为 https://github.com/wang-jh24/AgentCVR。

Abstract

Cross-Video Reasoning (CVR) has emerged as a critical frontier in multimodal intelligence, requiring models to retrieve, align, and aggregate evidence distributed across multiple videos. Current Multimodal Large Language Models (MLLMs) often struggle with CVR, as simple single-pass strategies encode multiple videos into a shared compressed context, potentially obscuring rare but critical evidence. In this paper, we propose AgentCVR, a multi-agent framework that treats CVR as an active evidence-acquisition task. AgentCVR employs a Master Agent to iteratively coordinate specialized Visual and Audio Agents for targeted evidence extraction. To ensure efficient training, we introduce Script-Simulated RL, which optimizes the agent's policy with LLM-generated semantic scripts and a lightweight text-based simulator, bypassing costly multimodal inference during online exploration. Experimental results on a comprehensive CVR benchmark show that AgentCVR outperforms single-pass baselines and achieves comparable performance to state-of-the-art closed-source systems, particularly in complex cross-video alignment and localization. To ensure reproducibility, our code is available at https://github.com/wang-jh24/AgentCVR.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	7.0/10	10.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	5.0/10	7.5
World Models	1.5	4.0/10	6.0
MLLM	1.5	8.0/10	12.0
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	7.0/10	10.5

评分理由: The paper focuses on Cross-Video Reasoning using a multi-agent framework (Unify Models: 7) and leverages MLLMs for script generation and context (MLLM: 8). The task is inherently MultiModal (9) involving video and audio. It employs a simulator for RL training (model-based RL: 7). Visual Encoder is implicit in Visual Agents (5), Tokenizer is not discussed (2), and World Models are loosely related via the simulator concept (4). No matching expert authors were found in the list.

关键词

Cross-Video Reasoning, Multi-Agent Framework, Script-Simulated Reinforcement Learning, Multimodal Intelligence, Active Evidence Acquisition, Master Agent, Visual and Audio Agents

深度分析

Chinese Title: AgentCVR：基于脚本模拟强化学习的主动多智能体跨视频推理

Summary: 跨视频推理（CVR）要求模型从多个视频中检索、对齐和聚合证据，但当前多模态大模型（MLLM）采用单次压缩编码的方式容易丢失稀疏关键证据。本文提出AgentCVR，一个多智能体框架，将CVR视为主动证据获取任务：主智能体迭代协调专门的视觉和音频智能体进行定向证据提取。为高效训练，引入脚本模拟强化学习（Script-Simulated RL），利用LLM生成的语义脚本和轻量级文本模拟器优化策略，避免在线探索中昂贵的多模态推理。在CrossVid基准上，AgentCVR优于单次基线，性能接近闭源系统，尤其在复杂跨视频对齐和定位任务上表现突出。

Innovations:

提出主动多智能体框架AgentCVR，将CVR从被动单次压缩转变为主动多轮证据获取，主智能体动态协调视觉和音频智能体。
引入脚本模拟强化学习（Script-Simulated RL），用LLM生成语义脚本和轻量文本模拟器替代原始视频交互，大幅降低训练成本。
将CVR问题建模为部分可观察马尔可夫决策过程（POMDP），明确分离证据收集与最终推理。
在CrossVid基准上取得与闭源系统相当的性能，尤其在细粒度跨视频对齐和定位任务上表现优异。

Methodology: 论文采用多智能体框架，主智能体（Master Agent）基于POMDP进行多轮决策，动作空间包括视觉查询、音频查询和终止回答。训练阶段使用脚本模拟RL：LLM生成语义脚本，轻量文本模拟器提供反馈，通过GRPO优化策略。推理阶段将训练好的策略直接迁移到真实视频环境，主智能体调用视觉和音频智能体获取局部多模态证据。

Key Results:

AgentCVR在CrossVid基准上超越所有单次基线方法。
性能与最先进的闭源系统（如GPT-4V）相当，尤其在需要精确证据定位和跨视频对齐的任务上。
脚本模拟RL有效降低了训练成本，无需人工标注轨迹。
多轮证据获取策略显著提升了稀疏证据的召回率和跨视频比较的准确性。

Tech Stack:

多模态大语言模型（MLLM）
部分可观察马尔可夫决策过程（POMDP）
强化学习（RL）
GRPO（Group Relative Policy Optimization）
LLM生成语义脚本
轻量文本模拟器
视觉智能体（Visual Agent）
音频智能体（Audio Agent）

Strengths:

创新性地将CVR从单次压缩范式转变为主动多轮证据获取，更符合人类推理过程。
脚本模拟RL有效解决了多模态在线RL训练成本过高的问题，具有实用价值。
框架模块化设计，主智能体与专用智能体分离，便于扩展和替换。
在标准基准上取得强竞争力结果，验证了方法的有效性。

Limitations:

脚本模拟RL依赖LLM生成的脚本质量，可能引入偏差。
当前仅验证了视觉和音频两种模态，未涉及其他模态（如文本、触觉）。
多轮交互可能增加推理延迟，实时性有待优化。
实验仅在CrossVid一个基准上进行，泛化性需更多验证。

Relevance To Keywords:

Unify Models: 论文使用多模态大模型（MLLM）作为基础，但未涉及生成与理解的统一，相关性较低。
World Models: 脚本模拟RL中的文本模拟器可视为一种轻量世界模型，用于近似环境动态，有一定相关性。
Representation Learning: 论文未直接研究表征学习，但多智能体证据获取隐式涉及跨视频表征对齐，相关性中等。
Model-Based RL: 脚本模拟RL属于基于模型的强化学习范式，使用模拟器替代真实环境进行策略优化，高度相关。

8. SAFE-Pruner: Semantic Attention-Guided Future-Aware Token Pruning for Efficient Vision-Language-Action ManipulationPASS

Score: 60.0 / 27.8

Authors: Shilin Ma, Chubin Zhang, Changyuan Wang, Yuji Wang, Yue Wu, Zixuan Wang, Jingqi Tian, Zheng Zhu, Yansong Tang

Published: 2026-05-28

TL;DR: SAFE-Pruner proposes a semantic attention-guided token pruning framework to accelerate Vision-Language-Action model inference for robotic control, achieving up to 1.89x speedup with minimal success rate degradation.

摘要翻译

视觉 - 语言 - 动作（VLA）模型的实时推理对于机器人控制至关重要。尽管视觉令牌剪枝在加速推理方面显示出巨大潜力，但大多数现有方法主要基于浅层线索做出剪枝决策，存在丢弃深层所需视觉信息的风险。为了解决这一问题，我们提出了 SAFE-Pruner，这是一种即插即用的剪枝框架，将后续层的注意力线索纳入剪枝决策中。具体来说，我们识别出语义注意力一致性，即 VLA 模型在执行步骤中将注意力概率质量集中在同一语义实体上的趋势。基于这一观察，我们设计了一种前瞻策略来预测深层中的令牌显著性，这防止了关键令牌的过早移除，并实现了更稳定的加速。我们进一步引入了一种自适应子任务划分策略来检测注意力的突变，从而提高预测精度和剪枝可靠性。在仿真和真实世界环境中的广泛实验表明，我们的方法实现了高达 1.89 倍的加速，且成功率下降幅度小于 1.7%，同时比最先进方法高出 1.9%。

Abstract

Real-time inference of vision-language-action (VLA) models is essential for robotic control. While visual token pruning has shown strong potential for accelerating inference, most existing methods mainly base pruning decisions on shallow-layer cues and risk discarding visual information required by deep layers. To address this issue, we propose SAFE-Pruner, a plug-and-play pruning framework that incorporates attention cues of future layers into pruning decisions. Specifically, we identify semantic attention consistency, the tendency that VLA models concentrate their attention probability mass on the same semantic entity across execution steps. Based on this observation, we design a forward-looking strategy to forecast the token saliency in deep layers, which prevents the premature removal of critical tokens and leads to more stable acceleration. We further introduce an adaptive subtask division strategy to detect abrupt attention shifts, thereby improving forecasting accuracy and pruning reliability. Extensive experiments in simulation and real-world settings demonstrate that our method achieves up to 1.89x speedup with a minimal degradation in success rate of less than 1.7%, while outperforming state-of-the-art methods by up to 1.9%.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	8.0/10	12.0
Tokenizer	1.5	4.0/10	6.0
Visual Encoder	1.5	6.0/10	9.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	8.0/10	12.0
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	3.0/10	4.5

评分理由: 论文核心在于 VLA 模型的推理加速，与 MultiModal (VLA 属多模态) 和 MLLM (架构类似) 高度相关；涉及 Token Pruning 故与 Tokenizer 有关联；Visual Encoder 输出视觉令牌被剪枝；VLA 统一了视觉、语言和动作故与 Unify Models 相关；但非 World Models 或 model-based RL 算法核心，主要侧重于推理效率优化。

关键词

Vision-Language-Action, Token Pruning, Semantic Attention, Inference Acceleration, Robotic Control, Visual Token, Future-Aware

深度分析

Chinese Title: SAFE-Pruner: 面向高效视觉-语言-动作操控的语义注意力引导的未来感知令牌剪枝

Summary: 本文针对视觉-语言-动作（VLA）模型在机器人实时控制中的高延迟问题，提出了一种即插即用的令牌剪枝框架SAFE-Pruner。现有剪枝方法多依赖浅层注意力信号，容易过早丢弃深层推理所需的关键令牌。作者首先发现VLA模型在执行连续任务时存在“语义注意力一致性”现象，即模型注意力始终集中在同一语义实体上（如目标物体），而非空间位置。基于此，提出前向预测策略，利用历史帧的深层注意力信息预测当前帧的后期令牌重要性，避免过早剪枝。同时引入自适应子任务划分策略，通过检测早期注意力突变来识别子任务边界，防止历史信息干扰。实验在多个VLA架构和基准上验证，SAFE-Pruner实现最高1.89倍加速，成功率下降不超过1.7%，优于现有方法。

Innovations:

首次发现并验证了VLA模型在连续操控中的语义注意力一致性现象，揭示了注意力在时间维度上的稳定规律。
提出未来感知令牌剪枝框架，利用历史帧的深层注意力预测当前帧后期重要性，从根本上解决短视剪枝问题。
设计自适应子任务划分策略，通过早期注意力变化检测子任务边界，提升预测准确性和剪枝可靠性。
实现即插即用，无需重新训练或修改模型架构，适用于多种VLA模型。

Methodology: 首先分析VLA模型内部注意力动态，发现早期注意力分散、后期聚焦的粗到细模式，以及跨时间步的语义一致性。基于此，设计前向预测策略：在剪枝层（早期）利用历史帧同一阶段的深层注意力分布作为参考，预测当前帧后期令牌的显著性。为应对子任务切换导致的注意力突变，提出自适应子任务划分：计算当前帧早期注意力与历史帧的差异，若差异超过阈值则重置参考帧。最终结合预测的显著性得分与当前早期得分进行剪枝决策。实验在模拟和真实场景中，使用OpenVLA-OFT、Octo、RT-2-X等模型，在BridgeData V2、RLBench等基准上评估。

Key Results:

在多个VLA架构上实现最高1.89倍推理加速。
成功率下降不超过1.7%，优于现有剪枝方法（最高提升1.9%）。
在70%以上剪枝率下，核心令牌丢失率显著低于仅依赖早期显著性的方法。
自适应子任务划分有效提升预测准确性，在长时任务中表现稳定。

Tech Stack:

Transformer架构
多头注意力机制（MHA）
缩放点积注意力
令牌剪枝（Token Pruning）
语义注意力一致性分析
自适应阈值检测
VLA模型（OpenVLA-OFT, Octo, RT-2-X等）
模拟环境（BridgeData V2, RLBench）

Strengths:

即插即用，无需重新训练或修改模型，实用性强。
基于VLA模型特有的注意力规律设计，针对性强。
显著加速同时保持高成功率，性能损失极小。
在多种VLA架构和任务上验证，泛化性好。
自适应子任务划分解决了长时任务中的注意力漂移问题。

Limitations:

依赖历史帧的深层注意力信息，若历史帧质量差（如初始帧注意力不准确）可能影响预测。
自适应子任务划分的阈值需要手动设定或调参，可能影响泛化。
仅针对视觉令牌剪枝，未考虑其他加速手段（如KV缓存、量化）的协同。
在极低剪枝率下加速效果有限，高剪枝率时仍存在性能下降风险。

Relevance To Keywords:

Unify Models: 本文研究的VLA模型正是统一视觉、语言、动作的模型，与统一模型方向高度相关。
World Models: VLA模型可视为一种隐式世界模型，通过视觉语言理解环境并生成动作，本文加速有助于实时世界模型推理。
Representation Learning: 语义注意力一致性揭示了VLA模型学习到的语义表征的稳定性，与表征学习相关。
Model-Based RL: VLA模型常用于模仿学习或强化学习策略，本文加速可提升基于模型的强化学习部署效率。
原生多模态大模型: VLA模型基于预训练多模态大模型，本文剪枝方法适用于此类模型。
多模态大模型的理解和生成一体化: VLA模型同时进行视觉理解、语言理解和动作生成，本文加速其一体化推理。
后训练: 本文方法无需后训练，属于训练后加速，与后训练方向有交集。

9. PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language UnderstandingPASS

Score: 58.5 / 27.8

Authors: Selim Kuzucu, Alessio Tonioni, Vasile Lup, Bernt Schiele, Federico Tombari, Muhammad Ferjad Naeem

Published: 2026-05-28

TL;DR: PARCEL 通过引入池锚定重采样和条件弹性查询的视觉令牌架构，解决了大型视觉语言模型中的计算瓶颈问题，显著提升了性能 - 效率帕累托前沿。

摘要翻译

大视觉 - 语言模型（LVLMs）将视觉输入映射为密集的令牌序列，从而在推理过程中引入了二次计算瓶颈。弹性视觉令牌压缩通过训练一个能够在多种视觉令牌预算下运行的单一模型来解决这一问题。然而，现有方法在激进压缩下表现不佳。仅空间压缩（如嵌套池化）表现为一种不完美的低通滤波器，会引发频谱混叠，从而掩盖细粒度细节。仅查询压缩（如嵌套查询重采样）则用非局部摘要替换显式的网格对齐令牌，并显著削弱空间定位能力。为了解决这种表征冲突，我们提出了 PARCEL（Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding），这是一种视觉令牌化架构，能够动态划分特征提取的任务。PARCEL 将空间池令牌确立为低频布局锚点，并通过池条件查询重采样（Pool-Conditioned Query Resampling）基于这些锚点对弹性查询令牌进行条件化。这使得查询令牌能够专注于互补的视觉特征，而非冗余的空间映射。在 27 个基准测试上的广泛评估表明，PARCEL 改进了性能 - 效率帕累托前沿，在多种视觉令牌预算下持续优于现有的俄罗斯套娃基线，同时保持了“一次训练，随处部署”的范式。

Abstract

Large Vision-Language Models (LVLMs) map visual inputs into dense token sequences, imposing a quadratic computational bottleneck for inference. Elastic visual-token compression addresses this by training a single model that can run at multiple visual-token budgets. However, existing approaches struggle under aggressive compression. Spatial-only compression, as in nested pooling, behaves as an imperfect low-pass filter and induces spectral aliasing that obscures fine-grained detail. Query-only compression, as in nested query resampling, replaces explicit grid-aligned tokens with non-local summaries and substantially degrades spatial grounding. To resolve this representational conflict, we introduce PARCEL (Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding), a visual tokenization architecture that dynamically partitions the labor of feature extraction. PARCEL establishes spatial pool tokens as low-frequency layout anchors and conditions elastic query tokens on these anchors through Pool-Conditioned Query Resampling. This encourages query tokens to focus on complementary visual features rather than redundant spatial mapping. Extensive evaluations across 27 benchmarks show that PARCEL improves the performance-efficiency Pareto frontier, consistently outperforming existing matryoshka baselines across visual-token budgets while preserving the "train once, deploy anywhere" paradigm.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	4.0/10	6.0
Tokenizer	1.5	9.0/10	13.5
Visual Encoder	1.5	8.0/10	12.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	8.0/10	12.0
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	1.0/10	1.5

评分理由: 论文核心在于视觉语言模型（LVLM）的视觉令牌压缩与效率优化，与 Tokenizer（令牌架构）、Visual Encoder（视觉特征提取）、MLLM（大型视觉语言模型）及 MultiModal（多模态）高度相关。Unify Models 相关性中等，因论文聚焦于令牌架构统一而非模型架构统一。World Models 和 model-based RL 与论文内容无关。作者列表中未包含指定的 Yang Shi 等专家，故无额外加分。

关键词

Visual Tokenization, Pool-Anchored Resampling, Elastic Queries, Vision-Language Understanding, Efficiency Optimization, Spatial Grounding, LVLM Compression

深度分析

Chinese Title: PARCEL: 基于池锚点重采样与条件弹性查询的高效视觉语言理解

Summary: 大型视觉语言模型（LVLMs）将视觉输入映射为密集的令牌序列，导致推理时二次计算瓶颈。弹性视觉令牌压缩通过训练单一模型支持多种令牌预算，但现有方法在激进压缩下表现不佳：空间压缩（如M3）导致频谱混叠模糊细节，查询压缩（如MQT）削弱空间定位。本文提出PARCEL架构，通过动态划分特征提取任务：空间池令牌作为低频布局锚点，条件弹性查询令牌作为高频语义探索器，两者通过池条件查询重采样机制互补。在27个基准上的实验表明，PARCEL在不同令牌预算下均优于现有弹性基线，改善了性能-效率帕累托前沿，并保持了“一次训练，随处部署”的范式。

Innovations:

形式化分析了现有弹性LVLMs的互补瓶颈：M3的刚性空间池化导致频谱混叠，MQT的非局部查询重采样削弱空间理解。
提出PARCEL混合视觉连接器，通过池条件查询重采样机制实现特征提取的动态分工：空间锚点保留几何布局，查询令牌捕获高频细节。
设计预算感知路由策略，使单一模型在16至256令牌预算下无缝运行，无需重新训练。
在27个多样化基准（包括视频理解、密集识别、VQA）上验证了性能-效率帕累托前沿的改进。

Methodology: PARCEL采用SigLIP视觉编码器提取图像特征，首先通过多尺度空间平均池化生成一组空间池令牌作为低频布局锚点；然后引入一组可学习的弹性查询令牌，这些查询令牌通过池条件查询重采样机制（基于交叉注意力）显式地以空间锚点为条件，再与原始视觉特征交互，从而专注于互补的高频细节。训练时使用嵌套dropout策略随机截断查询序列长度以实现弹性，推理时根据用户指定的预算选择对应数量的查询令牌。频谱分析使用径向功率谱密度评估压缩表示的频率分布。

Key Results:

在27个基准上，PARCEL的平均保留率（相对于未压缩的PG2基线）在所有令牌预算（16、64、256）下均优于M3和MQT。
在低预算（16令牌）下，PARCEL显著优于M3和MQT，表明其能更好地保留细粒度信息。
理论计算显示，PARCEL的预填充FLOPs和KV缓存成本与M3、MQT在相同预算下基本一致，但性能更高。
频谱分析表明，PARCEL的空间池令牌在低频段集中了更多频谱能量，而查询令牌补充了高频成分，避免了M3的混叠问题。

Tech Stack:

SigLIP视觉编码器
Transformer架构（自注意力、交叉注意力）
多尺度空间平均池化
嵌套dropout（Matryoshka表示学习）
池条件查询重采样（Pool-Conditioned Query Resampling）
径向功率谱密度分析（Radial Power Spectral Density）
预算感知路由策略

Strengths:

解决了现有弹性压缩方法的互补瓶颈，实现了更优的表示质量与压缩效率权衡。
单一模型支持多种推理预算，无需重新训练，适合实际部署。
在大量多样化基准上进行了全面评估，包括视频、密集识别等分辨率敏感任务。
提供了频谱分析的理论视角，加深了对压缩机制的理解。

Limitations:

PARCEL引入了额外的池条件查询重采样模块，可能增加少量计算开销（尽管论文声称FLOPs差异很小）。
论文未与最新的非弹性压缩方法（如Mask-LLaVA、动态剪枝等）进行直接比较，仅与弹性基线对比。
频谱分析主要针对空间池令牌，对查询令牌的频谱特性分析不够深入。
未讨论在生成任务（如图文生成）上的表现，仅聚焦于理解任务。

Relevance To Keywords:

原生多模态大模型：论文研究视觉语言理解，属于多模态大模型范畴，但未涉及生成一体化。
多模态大模型的理解和生成一体化：论文仅关注理解任务，未涉及生成。
表征学习：采用Matryoshka表示学习实现弹性表征，与表征学习高度相关。
世界模型：不直接相关。
强化学习：不直接相关。
后训练：论文采用“一次训练，随处部署”，不涉及后训练。

10. VLA-Trace: Diagnosing Vision-Language-Action Models through Representation and Behavior TracingPASS

Score: 57.0 / 27.8

Authors: Haoyuan Shi, Xiancong Ren, Yingji Zhang, Qinfan Zhang, Jiayu Hu, Haozhe Shan, Han Dong, Jinpeng Lu, Yinda Chen, Yi Zhang, Yong Dai, Xiaozhu Ju

Published: 2026-05-28

TL;DR: VLA-Trace 通过表征和行为追踪诊断视觉 - 语言 - 动作模型，揭示了 π0.5 和 OpenVLA 在适应动态和语义跟随方面的差异与局限。

摘要翻译

理解视觉 - 语言 - 动作 (VLA) 模型如何将多模态知识转化为具身控制仍是一个开放性的挑战。我们提出了 VLA-Trace，这是一种渐进式诊断框架，它通过从表示动力学到因果控制归因及行为表现的统一证据链来分析 VLA 模型。它具体结合了跨模态与检查点漂移中心核对齐 (CKA) 以追踪表示演化，采用注意力敲除干预以识别模态特异性控制路径，并通过轨迹级行为探针来考察锚定、捷径依赖及语义跟随情况。在 π0.5 和 OpenVLA 上的实验揭示了三个关键发现。首先，这两个模型在 VLA 微调过程中表现出截然不同的模态特异性适应动力学。其次，它们在动作解码过程中依赖于不同的多模态路由策略及层间依赖关系。第三，尽管 VLA 策略擅长视觉锚定轨迹生成，但在细粒度语义跟随方面仍存在局限。这些发现为未来的表示保持的适应、因果 VLA 电路以及组合式语义控制指明了方向。

Abstract

Understanding how Vision-Language-Action (VLA) models transform multimodal knowledge into embodied control remains an open challenge. We present VLA-Trace, a progressive diagnostic framework that analyzes VLA models through a unified evidence chain from representation dynamics to causal control attribution and behavioral manifestation. It specifically combines cross-modal and checkpoint-drift centered kernel alignment (CKA) to trace representation evolution, attention knockout interventions to identify modality-specific control pathways, and rollout-level behavioral probes to examine grounding, shortcut dependence, and semantic following. Experiments on $π_{0.5}$ and OpenVLA reveal three key findings. First, the two models exhibit distinct modality-specific adaptation dynamics during VLA finetuning. Second, they rely on different multimodal routing strategies and layer-wise dependencies during action decoding. Third, although VLA policies excel at visually grounded trajectory generation, they remain limited in fine-grained semantic following. These findings highlight future directions for representation-preserving adaptation, causal VLA circuits, and compositional semantic control.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	7.0/10	10.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	6.0/10	9.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	8.0/10	12.0
MultiModal	1.5	10.0/10	15.0
model-based RL	1.5	4.0/10	6.0

评分理由: 论文聚焦视觉 - 语言 - 动作（VLA）模型诊断，核心涉及多模态（10）与 MLLM（8），采用统一证据链（7）及视觉表征分析（6）。Tokenizer（1）未提及，World Models（2）无关，model-based RL（4）仅属背景。作者列表中未包含 Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang 等指定专家。

关键词

Vision-Language-Action, Representation Tracing, Behavioral Diagnosis, Multimodal Routing, Causal Control, Embodied Control, Model Diagnosis

深度分析

Chinese Title: VLA-Trace：通过表征与行为追踪诊断视觉-语言-动作模型

Summary: 本文提出VLA-Trace，一个渐进式诊断框架，用于分析视觉-语言-动作（VLA）模型如何将多模态知识转化为具身控制。该框架通过统一的证据链，从表征动态、因果控制归因到行为表现，逐步解析模型内部机制。具体方法包括：结合跨模态和检查点漂移中心核对齐（CKA）追踪表征演化；通过注意力剔除干预识别模态特定的控制通路；通过轨迹级行为探测检查视觉接地、捷径依赖和语义遵循能力。实验在π0.5和OpenVLA上进行，发现三个关键结果：1）两种模型在VLA微调中表现出不同的模态适应动态；2）它们在动作解码时依赖不同的多模态路由策略和层间依赖；3）VLA策略擅长视觉接地轨迹生成，但在细粒度语义遵循上仍有限。这些发现为表征保持适应、因果VLA电路和组合语义控制指明了未来方向。

Innovations:

提出VLA-Trace，首个将表征几何分析、因果分析和输入干预统一为渐进式诊断管道的框架，用于理解VLA模型的行为控制机制。
引入检查点漂移CKA，量化VLA微调过程中视觉、文本和联合表征的几何保持或重组程度。
设计注意力剔除实验，系统评估视觉和文本通路在动作解码中的因果贡献，揭示不同架构的模态路由策略差异。
通过注意力定位、视觉掩码和输入编辑等行为探测，揭示VLA策略在视觉接地与语义遵循之间的差距，指出其强视觉轨迹模仿与弱组合语言控制的瓶颈。

Methodology: 论文采用三阶段渐进式分析框架：第一阶段（表征级）使用跨模态CKA和检查点漂移CKA分析视觉-语言对齐和表征演化；第二阶段（因果级）通过注意力剔除和逐层剔除干预，评估模态依赖和层间控制；第三阶段（行为级）结合注意力IoU、视觉补丁掩码和输入编辑，探测视觉接地、捷径依赖和语义遵循。实验在LIBERO、COCO、CALVIN、RoboTwin2.0、Simpler等基准上进行，主要分析π0.5和OpenVLA模型。

Key Results:

π0.5和OpenVLA在VLA微调中表现出不同的模态适应动态：π0.5的跨模态CKA在中间层广泛分布且随检查点波动，表明活跃的跨模态重组；OpenVLA的跨模态对齐更平滑但较弱，常集中在边界或终端层。
检查点漂移CKA显示π0.5的文本表征漂移显著，语言表征被大量重组；OpenVLA的文本池化表征保持较好，主要重组视觉池化和联合池化子空间。
注意力剔除实验表明π0.5在动作解码中由集中的视觉到动作路由通路主导，语言功能角色受限；OpenVLA则将控制相关信息分布在视觉和文本通路中。
行为探测发现VLA策略能定位操作相关区域，但常忽略细粒度语义修改，表现出强视觉接地轨迹模仿和弱组合语言控制。

Tech Stack:

中心核对齐（CKA）
注意力剔除（Attention Knockout）
逐层剔除（Layer-wise Knockout）
注意力IoU（Attention Intersection over Union）
视觉补丁掩码（Visual Patch Mask）
输入编辑（Input Editing）
Flow-matching动作生成（π0.5）
自回归架构（OpenVLA基于Llama）
PaliGemma视觉语言模型
LIBERO、COCO、CALVIN、RoboTwin2.0、Simpler等基准数据集

Strengths:

提出了一个系统、统一的诊断框架，将表征、因果和行为分析有机结合，弥补了现有研究碎片化的不足。
通过对比两种代表性VLA架构（π0.5和OpenVLA），揭示了不同设计选择对多模态融合和动作解码的影响，具有较强的一般性。
实验设计全面，覆盖多种基准和探测任务，结果可靠且具有可重复性。
发现具有实际指导意义，为未来VLA模型的设计（如表征保持、因果电路、组合语义控制）提供了明确方向。

Limitations:

主要分析π0.5和OpenVLA两种模型，对其他VLA模型（如OFT、X-VLA）的覆盖有限，结论的泛化性有待验证。
行为探测主要基于LIBERO-10等模拟环境，真实机器人场景下的表现可能不同。
注意力剔除等因果干预方法可能对模型性能产生非预期影响，解释需谨慎。
未深入探讨不同训练数据规模或微调策略对表征演化的影响。

Relevance To Keywords:

Unify Models: 论文研究的VLA模型正是统一视觉、语言和动作的模型，与统一模型方向高度相关。
World Models: 论文未直接涉及世界模型，但VLA模型可视为隐式世界模型的一种，诊断其内部表征有助于理解世界建模能力。
Representation Learning: 论文核心是表征分析（CKA、漂移分析），直接关联表征学习。
Model-Based RL: 论文未涉及基于模型的强化学习，但VLA模型的动作生成可视为策略学习，诊断方法可迁移至基于模型的策略分析。
原生多模态大模型: π0.5和OpenVLA均基于原生多模态大模型（PaliGemma、Llama），论文分析其多模态融合机制，高度相关。
多模态大模型的理解和生成一体化: VLA模型将理解（视觉语言）和生成（动作）一体化，论文诊断其内部机制，直接相关。
表征学习: 同Representation Learning。
世界模型: 同World Models。
强化学习: 论文未涉及强化学习训练，但VLA模型可通过RL微调，诊断方法可应用于RL后训练分析。
后训练: 论文分析了VLA微调（后训练）过程中的表征变化，直接相关。

11. Rethinking Post-Training Recipes for Multimodal Time-Series ForecastingPASS

Score: 57.0 / 27.8

Authors: Haoxin Liu, Yichen Zhou, Rajat Sen, B. Aditya Prakash, Abhimanyu Das

Published: 2026-05-28

TL;DR: 针对 TSFMs 无法处理多模态上下文的问题，本文提出 PostTime 方法利用 LLM 修正 TSFM 预测，显著提升了多模态时间序列预测性能。

摘要翻译

时间序列基础模型（TSFMs）擅长利用数值数据进行零样本单模态预测，但与大型语言模型（LLMs）不同，它们无法处理多模态非数值上下文，而这些上下文往往塑造了现实世界的轨迹。在这项工作中，我们弥合了这一差距，并主张一种多模态时间序列预测方法，该方法通过后训练使大型语言模型（LLMs）充当基于上下文引导的修正器，以修正强大的数值 TSFM 先验。我们引入了 PostTime，这是一种结合监督微调（SFT）和基于可验证奖励的强化学习（RLVR）的后训练方案，并附带一种用于生成预测修正自动推理轨迹的方法。PostTime 教导大型语言模型（LLM）生成上下文条件化的预测干预——即基于多模态上下文决定修正、保留或忽略 TSFM 先验的决策。我们在 TimesX 多模态预测基准上，使用 Gemma-3-4B 大型语言模型和 TimesFM-2.5 时间序列基础模型评估了该方法，结果表明其显著优于独立的 TSFMs、仅使用 LLM 的基线以及现有的多模态预测方法。

Abstract

Time-Series Foundation Models (TSFMs) excel at zero-shot unimodal forecasting using numerical data, but unlike LLMs they cannot consume multimodal, non-numerical context that often shape real-world trajectories. In this work, we bridge this gap and argue for a multimodal time-series forecasting approach that post-trains LLMs to act as context-guided revisors over strong numerical TSFM priors. We introduce PostTime, a post-training recipe combining Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR), along with a methodology to generate automated reasoning traces for forecast revisions. PostTime teaches an LLM to generate context-conditioned forecast interventions -- decisions to revise, preserve, or ignore the TSFM prior based on the multimodal context. We evaluate this approach on the TimesX multimodal forecasting benchmark using a Gemma-3-4B LLM and TimesFM-2.5 TSFM, and show that it significantly outperforms standalone TSFMs, LLM-only baselines, and existing multimodal forecasting approaches.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	7.0/10	10.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	4.0/10	6.0
World Models	1.5	3.0/10	4.5
MLLM	1.5	8.0/10	12.0
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	5.0/10	7.5

评分理由: 论文核心在于利用 LLM 作为 revisor 修正 TSFM 预测，结合多模态上下文。Unify Models 相关度高（融合了 LLM 与 TSFM）；MultiModal 和 MLLM 相关度高（核心是多模态上下文与 LLM 应用）；model-based RL 中度相关（使用 RLVR 而非典型模型基强化学习）；Tokenizer 和 Visual Encoder 低相关（非核心贡献）；World Models 低相关（侧重预测而非世界模型）。作者列表中无指定专家。

关键词

Multimodal Time-Series Forecasting, Post-Training, LLM Revisors, TSFM Priors, RLVR, Context-Guided, Foundation Models

深度分析

Chinese Title: 重新思考多模态时间序列预测的后训练方法

Summary: 本文针对时间序列基础模型（TSFMs）无法处理多模态非数值上下文的问题，提出了一种后训练方法POSTTIME。该方法将大型语言模型（LLM）作为上下文引导的修订器，利用TSFM的强数值先验进行预测修正。POSTTIME结合了监督微调（SFT）和基于可验证奖励的强化学习（RLVR），并自动生成推理轨迹。实验使用Gemma-3-4B作为LLM、TimesFM-2.5作为TSFM，在TimesX基准上显著优于单独的TSFM、仅LLM的基线以及现有的多模态预测方法。研究表明，将LLM定位为修订器而非直接预测器更有效，且后训练是释放其修订能力的关键。

Innovations:

提出将LLM作为上下文引导的修订器，而非直接预测器，利用TSFM强数值先验进行修正。
设计了一套完整的后训练配方（POSTTIME），包括SFT和RLVR两阶段训练。
开发了自动生成推理轨迹的方法，无需人工干预，使用前沿LLM生成高质量SFT数据。
构建了适用于多模态预测的RLVR奖励函数，强化成功的修订策略。
系统性地重新思考了后训练中的设计选择（LLM角色、SFT数据构建、RL目标），并通过消融实验验证。

Methodology: 论文采用两阶段后训练方法：首先进行监督微调（SFT），使用自动生成的推理轨迹训练LLM产生有效的预测干预（修订、保留或忽略TSFM先验）；然后进行基于可验证奖励的强化学习（RLVR），通过GRPO优化和规则奖励进一步强化策略。具体实现中，使用Gemma-3-4B作为LLM，TimesFM-2.5提供数值先验，Gemini-3.1-Flash-Lite自动生成推理轨迹。评估在TimesX数据集上进行，采用滚动窗口划分，使用MAE和MSE及其归一化版本作为指标。

Key Results:

修订策略（LLM+TSFM先验）在8个LLM中，6个在nMAE上提升，7个在nMSE上提升，且有效窗口率从0.77提升至0.96。
POSTTIME方法在ID和OOD评估中均显著优于单独TSFM、LLM-only、零样本修订以及现有多模态预测方法。
零样本修订不足以可靠提升性能，后训练（SFT+RLVR）是释放LLM修订能力的关键。
将LLM作为直接预测器效果较差，而作为修订器能更好地利用TSFM的数值优势。

Tech Stack:

Gemma-3-4B（LLM）
TimesFM-2.5（TSFM）
Gemini-3.1-Flash-Lite（用于生成推理轨迹的前沿LLM）
GRPO（Group Relative Policy Optimization，用于RLVR）
SFT（Supervised Fine-Tuning）
RLVR（Reinforcement Learning with Verifiable Rewards）
MAE（Mean Absolute Error）
MSE（Mean Squared Error）
nMAE（归一化MAE）
nMSE（归一化MSE）
Chain-of-Thought（CoT）推理

Strengths:

系统性地重新思考了多模态时间序列预测的后训练设计空间，提供了清晰的决策依据。
提出了一种无需人工干预的自动推理轨迹生成方法，可扩展性强。
通过大量消融实验验证了各设计选择的有效性，实证充分。
在多个LLM上验证了修订策略的通用优势，结果具有鲁棒性。
结合了TSFM的数值能力和LLM的语义理解能力，实现了互补。

Limitations:

仅评估了Gemma-3-4B和TimesFM-2.5这一组合，泛化性有待更多模型验证。
数据集仅覆盖99个变量，时间范围2022-2025，领域多样性有限。
自动生成的推理轨迹质量依赖于前沿LLM，可能存在噪声或偏差。
未探讨多变量时间序列或更复杂的多模态输入（如图像、音频）的适用性。
RLVR阶段奖励函数设计较为简单，可能无法完全捕捉预测修正的细微策略。

Relevance To Keywords:

多模态大模型：论文核心是融合数值时间序列与文本上下文的多模态预测，LLM作为多模态理解器。
后训练：论文聚焦于LLM的后训练配方（SFT+RLVR），是后训练在时间序列领域的应用。
强化学习：RLVR阶段使用GRPO优化，强化学习是提升策略的关键步骤。
世界模型：时间序列预测可视为对世界动态的建模，论文通过上下文修正先验，隐含世界模型思想。
表征学习：LLM通过后训练学习如何表征数值先验与文本上下文的交互。
模型基础RL：RLVR基于可验证奖励，属于基于模型的强化学习范畴。

12. On Asymmetric Optimization of Reasoning and Perception in Vision-Language Model Post-TrainingPASS

Score: 57.0 / 27.8

Authors: Xueqing Wu, Yu-Chi Lin, Kai-Wei Chang, Nanyun Peng

Published: 2026-05-28

TL;DR: 该论文诊断了视觉语言模型后训练中感知与推理的不对称性，提出损失重加权与感知感知奖励机制以平衡优化并提升端到端性能。

摘要翻译

后训练（Post-training）极大地改进了前沿视觉 - 语言模型（vision-language models）的推理能力，但其对感知（perception）的提升相对有限，从而构成了端到端视觉推理（end-to-end visual reasoning）的瓶颈。为探究这一差距，我们提出一个受控诊断框架，包含两个能够解耦感知与推理的合成任务。我们的分析揭示了一致的感知 - 推理不对称性（perception-reasoning asymmetry）：后训练对推理的提升幅度显著大于感知，尽管其底层机制因训练范式而异。在监督微调（SFT）中，这种不对称性源于思维链（chain-of-thought）监督中的标记不平衡：感知任务占用的标记更少，因而接收到的训练信号较弱。动态重加权损失（dynamically reweighting the loss）可缓解此不平衡，并将端到端性能提升高达 18.2 点。而在强化学习（RL）中，不对称性则源于奖励耦合（reward coupling）：结果奖励与推理的相关性强于与感知的相关性，从而削弱了感知学习的信号。引入感知感知奖励（perception-aware reward）可缓解此不平衡，并将端到端准确率提升高达 6.0 点；即使没有真实感知奖励，可靠的替代奖励（surrogate reward）也能提供有用信号，带来 3.2 点的增益。综上所述，我们的结果全面诊断了非对称优化问题，并提出了平衡感知与推理的具体干预措施。

Abstract

Post-training has greatly improved reasoning in frontier vision-language models, yet its gains for perception remain comparatively limited, creating a bottleneck for end-to-end visual reasoning. To investigate this gap, we introduce a controlled diagnostic framework with two synthetic tasks that disentangle perception from reasoning. Our analysis reveals a consistent perception-reasoning asymmetry: posttraining improves reasoning more substantially than perception, though the underlying mechanism differs by training paradigm. For supervised fine-tuning (SFT), this asymmetry stems from token imbalance in chain-of-thought supervision, where perception occupies fewer tokens and thus receives a weaker training signal. Dynamically reweighting the loss mitigates this imbalance and boosts end-to-end performance by up to 18.2. For reinforcement learning (RL), the asymmetry instead arises from reward coupling: outcome rewards correlate more strongly with reasoning than with perception, weakening the signal for perception learning. Adding a perception-aware reward alleviates the imbalance and improves end-to-end accuracy by up to 6.0; even without groundtruth perception rewards, a reliable surrogate reward provide useful signal, yielding gains of 3.2 points. Together, our results comprehensively diagnose asymmetric optimization and suggest concrete interventions to balance perception and reasoning.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	6.0/10	9.0
Tokenizer	1.5	4.0/10	6.0
Visual Encoder	1.5	5.0/10	7.5
World Models	1.5	2.0/10	3.0
MLLM	1.5	9.0/10	13.5
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	3.0/10	4.5

评分理由: 论文核心聚焦于视觉语言模型（MLLM/MultiModal）的后训练优化，因此这两个关键词相关性最高（9.0）。研究涉及感知与推理的平衡，与 Visual Encoder（感知）和 Tokenizer（token 不平衡提及）有一定关联，但非架构核心，故评分中等（4.0-6.0）。论文虽提及强化学习（RL），但未涉及模型基强化学习（model-based RL），世界模型（World Models）亦未提及，相关性较低（2.0-3.0）。总加权分 57.0，远高于动态及格分 27.8。作者列表中不包含指定的 Yang Shi 等专家。

关键词

Vision-Language Model, Post-Training, Reasoning, Perception, Asymmetric Optimization, Reinforcement Learning, Supervised Fine-tuning

深度分析

Chinese Title: 视觉语言模型后训练中推理与感知的不对称优化研究

Summary: 本文针对视觉语言模型（VLM）后训练中推理能力显著提升而感知能力提升有限的问题，提出了一个可控的诊断框架，通过图着色和数独两个合成任务将感知与推理解耦。研究发现，监督微调（SFT）中的不对称性源于链式思维监督中的token不平衡（感知token占比仅2.5%），而强化学习（RL）中的不对称性则源于奖励耦合（结果奖励与推理的相关性远高于感知）。针对SFT，动态损失重加权可将端到端性能提升高达18.2个百分点；针对RL，引入感知感知奖励可将端到端准确率提升6.0个百分点，即使使用代理奖励也能获得3.2个百分点的提升。研究揭示了两种训练范式下不对称优化的不同机制，并提出了具体的缓解策略。

Innovations:

提出了一个可控的诊断框架，通过合成任务将感知与推理完全解耦，支持直接评估感知和反事实推理评估。
首次系统识别了VLM后训练中感知-推理不对称性的两种不同机制：SFT中的token不平衡和RL中的奖励耦合。
针对SFT提出了动态损失重加权方法，无需手动调参即可提升端到端性能达18.2个百分点。
针对RL提出了感知感知奖励增强方法，即使在没有真实感知奖励的情况下，代理奖励也能提供有效信号。
揭示了感知与推理之间存在平滑的权衡关系，并找到了最优的平衡点。

Methodology: 本文采用可控的合成任务（图着色和数独）作为测试平台，通过程序化渲染图像并保留文本输入作为标准感知表示。使用Qwen3-VL和InternVL3.5两个开源VLM家族进行实验。对于SFT，通过分解损失函数分析token不平衡，并采用动态多任务平衡方法（如GradNorm）进行干预。对于RL，采用GRPO算法，通过计算奖励与感知/推理的相关性验证奖励耦合，并引入感知奖励加权。使用反事实推理评估（在标准感知条件下采样推理）来解耦推理能力。

Key Results:

后训练中推理能力提升显著高于感知能力，感知成为端到端视觉推理的主要瓶颈。
SFT中感知token仅占2.5%，贡献1.3%的损失和8.5%的梯度范数，导致感知学习信号弱。
动态损失重加权（如GradNorm）将端到端性能提升最多18.2个百分点。
RL中结果奖励与推理正确性的相关系数为0.65-1.00，与感知正确性的相关系数仅为0.34-0.43。
引入感知奖励（α=0.5）将端到端准确率提升最多6.0个百分点；代理奖励（如感知长度）提升3.2个百分点。
感知与推理之间存在权衡：过度提升感知会削弱推理，最优端到端性能出现在适度提升感知时。

Tech Stack:

Qwen3-VL
InternVL3.5
GRPO (Group Relative Policy Optimization)
GradNorm (动态多任务平衡方法)
链式思维 (Chain-of-Thought, CoT)
交叉熵损失 (Cross-Entropy Loss)
反事实推理评估 (Counterfactual Reasoning Evaluation)
皮尔逊相关系数 (Pearson Correlation Coefficient)
程序化图像渲染 (Programmatic Image Rendering)

Strengths:

提出了一个干净可控的实验框架，有效解耦了感知与推理，便于归因分析。
系统性地识别了SFT和RL两种范式下不对称优化的不同机制，具有理论深度。
提出的干预方法（动态损失重加权、感知奖励）简单有效，且无需大量人工调参。
实验在两个不同VLM家族和两个任务上进行，结果具有泛化性。
对感知-推理权衡的揭示为未来VLM后训练设计提供了重要指导。

Limitations:

实验仅在合成任务上进行，真实世界视觉任务（如自然图像理解）中的不对称性可能更复杂。
感知奖励增强需要额外的感知标注或代理设计，在真实场景中获取成本可能较高。
未探索其他后训练范式（如DPO、PPO等）中的不对称性。
动态损失重加权方法（如GradNorm）的计算开销可能较大。
未深入分析感知与推理权衡的根本原因，仅停留在机制层面。

Relevance To Keywords:

原生多模态大模型：论文研究视觉语言模型（VLM），属于原生多模态大模型范畴。
多模态大模型的理解和生成一体化：论文关注VLM的感知（理解）和推理（生成）能力，并探索其优化平衡。
表征学习：感知部分涉及视觉信息的表征提取，但论文未深入表征学习本身。
世界模型：合成任务（图着色、数独）可视为简单世界模型，但论文未直接讨论世界模型。
强化学习：论文重点研究了RL（GRPO）中的奖励耦合问题，与强化学习高度相关。
后训练：论文核心主题是VLM后训练（SFT和RL）中的不对称优化。
Unify Models, Model-Based RL：论文未涉及模型统一或基于模型的RL，相关性较弱。

13. Unveiling the Visual Counting Bottleneck in Vision-Language ModelsPASS

Score: 57.0 / 27.8

Authors: Xingzhou Pang, Yifan Hou, Junling Wang, Mrinmaya Sachan

Published: 2026-05-28

TL;DR: This paper investigates why Vision-Language Models fail at visual counting extrapolation, finding the bottleneck lies in symbolic mapping rather than perception, suggesting the need for unified representations.

摘要翻译

尽管大型视觉 - 语言模型（VLMs）在插值方面表现出色，但在系统性泛化方面却遭遇灾难性失败，尤其在视觉计数任务上。本研究通过将视觉计数分解为三个认知阶段——视觉个体化、量级意识和符号映射——来探究这一外推瓶颈。利用合成围棋棋盘和线性探针，我们发现视觉骨干网络在外推范围内仍能保持强健的、线性可分的数量表示，从而排除了感知失败的可能性。此外，模型保留了潜在的量级意识，即便在无法枚举具体数量的情况下，也能成功进行数量比较推理。我们将崩溃点定位在符号映射阶段，即模型无法将有效的视觉量级正确映射到符号标记上。我们的发现支持“破碎量级假说”：VLMs 未能习得通用数字空间，而是学习不相交的、模态特定的统计流形，这阻碍了对未见数量的跨模态接地。在最新的基础模型上验证，我们的结果表明，弥合这一差距需要引入强制统一表示的归纳偏置，仅靠数据规模扩大是远远不够的。

Abstract

While Large Vision-Language Models (VLMs) excel at interpolation, they suffer catastrophic failures in systematic generalization, most notably in visual counting. In this work, we investigate this extrapolation bottleneck by deconstructing visual counting into three cognitive stages: visual individuation, magnitude awareness, and symbolic mapping. Using synthetic Go boards and linear probes, we demonstrate that visual backbones maintain robust, linearly separable representations of quantity well into the extrapolation regime, ruling out perceptual failure. Furthermore, models retain latent magnitude awareness, successfully performing comparative reasoning on quantities they fail to enumerate. We pinpoint the collapse to the symbolic mapping stage, where the model fails to project valid visual magnitudes onto symbolic tokens. Our findings support a frac tured magnitude hypothesis: VLMs fail to acquire a universal number space, instead learning disjoint, modality-specific statistical manifolds that prevent cross-modal grounding for unseen quantities. Validated on the state-of-the-art foundation model, our results suggest that bridging this gap requires inductive priors enforcing unified representations, as data scaling alone is insufficient.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	6.0/10	9.0
Tokenizer	1.5	5.0/10	7.5
Visual Encoder	1.5	8.0/10	12.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	9.0/10	13.5
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	1.0/10	1.5

评分理由: The paper focuses on Vision-Language Models (MLLM, MultiModal) and analyzes Visual Encoders, justifying high scores. It discusses symbolic mapping involving tokens (Tokenizer) and concludes on unified representations (Unify Models), warranting moderate scores. World Models and model-based RL are irrelevant to the content. None of the listed expert authors (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) appear in the author list.

关键词

Vision-Language Models, Visual Counting, Symbolic Mapping, Systematic Generalization, Visual Backbones, Extrapolation Bottleneck, Unified Representations

深度分析

Chinese Title: 揭示视觉语言模型中的视觉计数瓶颈

Summary: 本文系统研究了视觉语言模型（VLM）在视觉计数任务上的系统性泛化失败问题。作者将视觉计数分解为三个认知阶段：视觉个体化（感知物体）、数量感知（理解数量）和符号映射（输出数字）。通过构建合成围棋棋盘数据集和轻量级Toy VLM（ViT+GPT-2），并采用解耦训练课程（语言预训练至99，视觉对齐仅至49），利用线性探针提取隐藏数量表征，发现：视觉骨干网络在超出训练分布（50-120）时仍能保持线性可分的数量表征，模型也能进行数量比较推理，但失败发生在符号映射阶段——模型无法将视觉数量映射到正确的符号标记。作者提出“断裂数量假说”：VLM未能学习跨模态的通用数量空间，而是学习分离的模态特定统计流形。该结论在Qwen3-VL-32B等真实大模型上得到验证，表明仅靠数据缩放无法解决该瓶颈，需要引入统一表征的归纳偏置。

Innovations:

将视觉计数失败分解为三个认知阶段（视觉个体化、数量感知、符号映射），并精确定位瓶颈在符号映射阶段。
构建合成围棋棋盘实验室，严格控制训练分布，解耦视觉与语言训练，实现精确的分布偏移分析。
发现视觉骨干网络在超出训练分布时仍保持线性可分的数量表征，模型具备数量比较能力但无法枚举，揭示感知与符号映射的断裂。
提出“断裂数量假说”：VLM学习分离的模态特定数量流形，而非通用数量空间，解释跨模态泛化失败。
在SOTA开源VLM（Qwen3-VL-32B）上验证了该机制，证明该瓶颈在真实大规模模型中也存在。

Methodology: 采用两阶段实验设计：1）合成实验室：训练轻量级Toy VLM（2层ViT编码器+2层GPT-2解码器），使用19×19围棋棋盘图像（黑石为目标，白石为干扰物），解耦训练课程（语言预训练至N=99，视觉对齐仅至N=49），评估在分布内（0-49）、视觉外推（50-99）和完全外推（100-120）三个区间的计数准确率。2）真实模型验证：在Qwen3-VL-32B上使用6×6围棋棋盘进行零样本CoT测试。核心诊断工具为线性探针：在视觉编码器输出上训练二分类线性探针检测每个网格位置是否有黑石，聚合得到隐藏数量NH，通过比较NH与真实数量NG及预测数量NP，量化视觉误差和语言误差。

Key Results:

Toy VLM在文本计数任务上完美泛化至N=99，但视觉计数在N>49后准确率骤降至接近零，形成“基线悖论”。
线性探针显示，视觉骨干在VE（50-99）和FE（100-120）区域仍能准确预测隐藏数量NH（误差极小），证明视觉个体化能力保留。
模型在数量比较任务（判断视觉集合与文本列表是否数量相等）上在VE和FE区域表现良好，证明数量感知能力保留。
符号映射阶段失败：NH与NP之间存在巨大差距，模型无法将正确的视觉数量映射到正确的数字符号。
在Qwen3-VL-32B上验证了相同机制：视觉感知和数量感知存在轻微噪声，但主要失败仍源于符号映射断裂。

Tech Stack:

Vision Transformer (ViT-Base, 2层)
GPT-2风格因果Transformer解码器 (2层)
线性探针 (Linear Probe, 二分类)
合成围棋棋盘数据集 (19×19, 6×6)
解耦训练课程 (语言预训练+视觉对齐)
Greedy Decoding (贪心解码)
Chain-of-Thought (CoT) 提示
Exact Match 准确率评估

Strengths:

系统性地将计数失败分解为可独立测试的认知阶段，定位精确。
使用合成环境严格控制变量，排除自然图像中的混淆因素，实验设计严谨。
线性探针方法简单有效，能直接量化内部表征的线性可分性。
在真实大模型上验证结论，增强了结果的外部有效性。
提出“断裂数量假说”为理解跨模态泛化失败提供了理论框架。

Limitations:

合成围棋棋盘环境过于简化，可能无法完全反映自然图像中的物体识别复杂性（如遮挡、语义歧义）。
仅研究了计数这一单一任务，其他系统性泛化任务（如空间关系、组合推理）可能具有不同瓶颈。
Toy VLM规模较小（2层ViT+2层GPT-2），结论在更大模型上可能部分变化。
线性探针仅测量线性可分性，未探索非线性表征的潜力。
未提出具体的解决方案或架构改进，仅诊断了问题。

Relevance To Keywords:

原生多模态大模型：论文直接研究VLM的视觉计数能力，属于多模态大模型的核心能力评估。
表征学习：通过线性探针分析视觉和语言模块的内部表征，揭示模态特定流形与通用数量空间的缺失。
世界模型：计数涉及对物体数量的感知和推理，与世界模型中的场景理解相关。
模型基于强化学习/后训练：论文虽未直接涉及RL，但指出数据缩放不足，暗示需要新的训练范式（如对比学习、跨模态对齐）来统一表征，与后训练方向相关。

14. Train the Agent, Not the Expert: Learning to Harness Heterogeneous Experts for Multi-Turn Visual ReasoningPASS

Score: 57.0 / 27.8

Authors: Yaowu Fan, Tao Han, Dazhao Du, Andy J. Ma, Jia Wan

Published: 2026-05-28

TL;DR: 论文提出 VisHarness 框架，通过强化学习智能体统一调度异构视觉专家进行多轮视觉推理，在保持通用智能的同时实现了媲美任务特定模型的性能。

摘要翻译

近期计算机视觉领域的进展催生了一系列强大的专用模型，用于检测、分割、计数及其他视觉任务。然而，这些模型通常针对孤立的任务形式进行优化，难以直接支持通用视觉智能，尤其在任务需要复杂语言理解和密集小目标感知时。本文提出 VisHarness，一种可训练的视觉智能体，它将高层感知、推理和决策与底层任务执行解耦。VisHarness 并非通过训练单一模型来解决特定的视觉任务，而是学习利用一组精心设计的异构视觉专家（heterogeneous visual experts）。这种范式保留了智能体的通用智能，同时充分利用了专用视觉模型在具体视觉任务中的精度优势。仅需轻量级训练，VisHarness 即可学习一种可泛化的视觉专家调度策略，并通过与视觉专家模型的多轮交互，在各种复杂条件下解决常见的基础视觉任务。为了在在线环境中实现高效的同策略（on-policy）强化学习训练，我们引入了动态视觉记忆归档，该机制缓解了因与视觉专家模型进行多轮交互而迅速累积的视觉 token（visual-token）开销。在涵盖推理分割、广义指代分割、密集小目标检测和指代计数四个代表性基准上的实验表明，VisHarness 显著优于现有的通用模型，并在性能上与任务特定模型相当或更优。

Abstract

Recent progress in computer vision has produced a wide range of powerful specialized models for detection, segmentation, counting, and other visual tasks. However, these models are usually optimized for isolated task formulations, making it difficult to directly support general-purpose visual intelligence, especially when a task requires complex language understanding and dense small-object perception. In this paper, we propose VisHarness, a trainable visual agent that decouples high-level perception, reasoning, and decision-making from low-level task execution. Instead of training a model to solve a specific visual task, VisHarness learns to harness a set of carefully designed heterogeneous visual experts. This paradigm preserves the general intelligence of the agent while fully leveraging the precision advantages of specialized visual models in concrete visual tasks. With only lightweight training, VisHarness learns a generalizable visual expert-harnessing policy and can solve common fundamental vision tasks under various complex conditions through multi-turn interactions with visual expert models. To enable efficient on-policy reinforcement learning training in a live environment, we introduce dynamic visual memory archiving, which mitigates the rapidly accumulating visual-token overhead caused by multi-turn interactions with visual expert models. Experiments on four representative benchmarks covering reasoning segmentation, generalized referring segmentation, dense small-object detection, and referring counting demonstrate that VisHarness substantially outperforms existing general-purpose models and achieves competitive or superior performance compared with task-specific models.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	7.0/10	10.5
Tokenizer	1.5	4.0/10	6.0
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	5.0/10	7.5
MLLM	1.5	6.0/10	9.0
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	5.0/10	7.5

评分理由: 论文核心在于通过智能体统一调度异构视觉专家进行多轮推理。Unify Models 和 MultiModal 高度相关，分别对应统一专家调度及视觉语言结合；MLLM 和 World Models 中度相关，涉及视觉推理任务及智能体与环境交互；model-based RL 中度相关，使用强化学习训练策略；Tokenizer 和 Visual Encoder 低度相关，仅提及视觉 token 开销及使用现有编码器，非核心贡献。作者列表中未发现指定专家。

关键词

Heterogeneous Experts, Multi-Turn Visual Reasoning, Reinforcement Learning, Visual Agent, Dynamic Visual Memory, General-Purpose Visual Intelligence, Task Execution

深度分析

Chinese Title: 训练智能体而非专家：学习利用异构专家进行多轮视觉推理

Summary: 本文提出VisHarness，一种可训练的视觉智能体，通过多轮交互利用一组精心设计的异构视觉专家模型来解决多种基础视觉任务。VisHarness将高层感知、推理和决策与低层视觉任务执行解耦，仅需少量训练数据即可学习通用的专家利用策略。为了支持在线策略强化学习训练，引入动态视觉记忆归档机制，有效减少多轮交互中累积的视觉token开销。在四个基准（推理分割、广义指代分割、密集小目标检测、指代计数）上的实验表明，VisHarness显著优于现有通用模型，并达到或超过专用模型性能。

Innovations:

提出VisHarness，将高层视觉感知、推理和决策与低层视觉任务执行解耦，仅需少量数据即可学习通用的专家利用策略。
引入动态视觉记忆归档机制，大幅降低多轮交互中的视觉上下文开销，使在线策略强化学习训练可行。
将多轮交互过程建模为马尔可夫决策过程，并通过分解为多个单轮MDP来保持训练与推理的一致性。
设计异构视觉专家套件，涵盖检测、分割、计数等基础视觉能力，支持并行执行。
使用在线强化学习训练智能体自主探索有效策略，减少无效或冗余的专家调用。

Methodology: VisHarness基于多轮交互的马尔可夫决策过程框架。每个时间步，智能体根据当前记忆选择动作（专家名称及参数），环境执行专家并返回视觉和文本反馈。动态视觉记忆归档机制仅保留最新视觉结果，同时通过文本摘要保留历史信息。训练采用在线策略强化学习（如PPO），以轨迹级奖励引导智能体学习有效的专家调用策略。异构专家套件包括检测、分割、计数等模型，每个模型实例化为多个工作线程并行执行。

Key Results:

在广义指代分割（GRES）基准上，VisHarness超越所有通用模型，并与专用模型性能相当。
在推理分割基准上，VisHarness显著优于通用模型，接近专用模型。
在密集小目标检测和指代计数任务上，VisHarness取得竞争性或更优性能。
在线强化学习有效减少了冗余的视觉工具调用次数，提升了效率。
仅需少量训练数据即可学习通用策略，无需大规模任务特定数据。

Tech Stack:

多模态大语言模型（MLLM）作为智能体基础
在线策略强化学习（如PPO）
马尔可夫决策过程（MDP）建模
动态视觉记忆归档（文本摘要+视觉丢弃）
异构视觉专家模型：目标检测（如DETR）、语义分割（如SAM）、计数模型（如CounTR）
并行执行控制器（负载均衡调度）

Strengths:

解耦设计使智能体保持通用性，同时充分利用专用模型的精度优势。
仅需少量训练数据即可泛化到多种视觉任务，训练成本低。
多轮交互支持自我纠正和复杂任务分解，提升鲁棒性。
动态记忆归档机制有效缓解长序列视觉token开销，使RL训练可行。
在多个基准上取得优异性能，验证了方法的有效性。

Limitations:

依赖异构专家模型的质量和覆盖范围，若专家模型在某些场景下失效，智能体性能可能受限。
多轮交互可能增加推理延迟，实时性要求高的场景需优化。
当前实验仅覆盖基础视觉任务，尚未验证在更复杂开放场景（如视频理解、3D感知）中的表现。
动态记忆归档可能丢失部分历史视觉细节，影响需要精细历史信息的任务。

Relevance To Keywords:

Unify Models, World Models, Representation Learning, Model-Based RL: 论文通过解耦高层智能体与低层专家，体现了统一模型的思想；智能体学习利用专家可视为一种世界模型中的工具使用能力；强化学习训练与后训练范式高度相关。
原生多模态大模型，多模态大模型的理解和生成一体化: VisHarness基于MLLM构建智能体，利用其多模态理解能力进行决策，但未涉及生成一体化。
表征学习: 论文未直接研究表征学习，但动态记忆归档涉及对视觉信息的压缩表征。
世界模型: 智能体通过多轮交互感知环境反馈，可视为隐式学习环境动态。
强化学习，后训练: 核心方法为在线策略强化学习，属于后训练阶段。

15. BORA: Bridging Offline Reinforcement Learning and Online Residual Adaptation for Real-World Dexterous VLA ModelsPASS

Score: 52.5 / 27.8

Authors: Zhongxi Chen, Yifan Han, Yanming Shao, Huanming Liu, Congsheng Xu, Xiaoyu Chen, Yao Mu, Wenzhao Lian

Published: 2026-05-28

TL;DR: BORA 提出了一种用于灵巧 VLA 模型的离线到在线强化学习后训练框架，通过动作条件价值指导和残差适应显著提升了真实世界机器人操作的成功率。

摘要翻译

视觉 - 语言 - 动作 (VLA) 模型已成为一种有前景的范式，用于将视觉语言理解扎根于真实世界的机器人操作中。然而，由于高维手部控制和累积执行误差，灵巧操作对于 VLA 策略仍然具有挑战性，这使得真实世界强化学习 (RL) 后训练对于弥合视觉扎根的动作生成与物理可靠的灵巧执行之间的差距至关重要。然而，高维灵巧探索在真实世界中经常引发时间不一致性、样本效率低下以及硬件风险。为了解决这些挑战，我们提出 BORA，这是一个为真实世界灵巧 VLA 模型设计的离线到在线强化学习 (RL) 后训练框架。在离线阶段，BORA 构建一个批评家 (Critic)，它以视觉语言模型 (VLM) 的认知 token 和动作块作为输入。这种设计使得动作条件价值引导成为可能，允许批评家评估灵巧手部动作，而不仅仅是基于单纯的视觉上下文。在随后的在线阶段，BORA 冻结 VLA 基础模型，并引入一个轻量级、人在回路 (HiL) 的块级残差适应机制，以减轻真实世界执行误差，并在实际物理环境中进一步纠正离线学习得到的意图。通过继承离线批评家并使用干预驱动奖励，BORA 有效纠正执行偏差并适应真实世界的物理差异，同时保留预训练策略作为稳定先验。在五个复杂真实世界灵巧任务上的广泛评估表明，BORA 显著优于纯模仿学习和传统解耦强化学习 (RL) 基线，在标准设置下平均成功率实现了 33% 的绝对提升，并在未见物体泛化上达到了 43% 的改进。

Abstract

Vision-Language-Action (VLA) models have emerged as a promising paradigm for grounding visual-language understanding into real-world robotic manipulation. However, dexterous manipulation remains challenging for VLA policies due to high-dimensional hand control and compounding execution errors, which makes real-world RL post-training essential for bridging the gap between visually grounded action generation and physically reliable dexterous execution. However, high-dimensional dexterous exploration often triggers temporal inconsistency, sample inefficiency and hardware risks in the real world. To address these challenges, we propose BORA, an offline-to-online RL post-training framework designed for real-world dexterous VLA models. In the offline phase, BORA constructs a critic that takes both the VLM's cognition tokens and action chunks as inputs. This design enables action-conditioned value guidance, allowing the critic to evaluate dexterous hand motions beyond visual context alone. During the subsequent online phase, BORA freezes the VLA base and introduces a lightweight, Human-in-the-Loop (HiL) chunk-wise residual adaptation mechanism to mitigate real-world execution errors and further correct the offline-learned intents within the actual physical environment. By inheriting the offline critic and employing intervention-driven rewards, BORA effectively corrects execution discrepancies and adapts to real-world physical variances while preserving the pretrained policy as a stable prior. Extensive evaluations across five complex real-world dexterous tasks demonstrate that BORA significantly outperforms pure imitation learning and traditional decoupled RL baselines, achieving a 33% absolute increase in average success rate under standard settings and up to a 43% improvement in unseen object generalization.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	7.0/10	10.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	2.0/10	3.0
MLLM	1.5	8.0/10	12.0
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	5.0/10	7.5

评分理由: 论文核心为 VLA 模型的 RL 后训练，MLLM 与 MultiModal 高度相关（VLA 本质是多模态大模型）；Unify Models 中度相关（桥接离线/在线阶段）；model-based RL 中度相关（RL 核心但侧重价值指导而非显式模型）；Tokenizer、Visual Encoder、World Models 相关性低（非核心贡献）。作者列表中未包含指定的专家。

关键词

Vision-Language-Action (VLA), Offline Reinforcement Learning, Online Residual Adaptation, Dexterous Manipulation, Post-training, Action-conditioned Value, Human-in-the-Loop (HiL)

深度分析

Chinese Title: BORA：弥合离线强化学习与在线残差自适应，实现真实世界灵巧VLA模型

Summary: 本文提出BORA框架，针对灵巧操作中视觉-语言-动作（VLA）模型在真实世界部署时面临的高维动作空间、视觉遮挡和离线-在线分布偏移等问题，设计了两阶段后训练方案。离线阶段采用一致性策略（1-3步生成动作块）截断计算图，并构建融合VLM认知令牌与动作块的批评者，实现基于物理交互的价值评估，避免视觉伪影过拟合。在线阶段冻结VLA基座，引入轻量级人机协同（HiL）的块级残差自适应机制，通过继承离线批评者和干预驱动奖励，安全高效地修正执行误差并适应物理变化。在五个真实灵巧任务上，BORA相比纯模仿学习和传统解耦RL基线，平均成功率绝对提升33%，未见物体泛化提升43%。

Innovations:

提出动作条件批评者（Action-Conditioned Critic），将连续动作块与VLM认知令牌融合，实现基于物理执行后果的精确价值引导，克服视觉遮挡导致的过拟合。
设计轻量级在线残差自适应机制，冻结VLA基座，引入人机协同（HiL）的块级残差策略，结合继承的离线批评者和干预奖励，安全高效地修正真实世界执行偏差。
构建统一的离线到在线RL后训练框架BORA，利用一致性策略解决生成式动作架构的信用分配问题，并通过渐进优化桥接离线意图学习与在线物理执行。
在真实灵巧操作任务上实现显著性能提升，平均成功率绝对提升33%，未见物体泛化提升43%。

Methodology: BORA采用两阶段方法：离线阶段，使用一致性策略（Consistency Policy）作为动作专家，在1-3步内生成连续动作块以截断计算图；设计动作条件批评者，输入VLM语义令牌、动作块和位置嵌入，输出逐步Q值和V值，并通过移位值引导（shifted value bootstrap）实现块内信用传播；采用IQL（隐式Q学习）进行策略优化。在线阶段，冻结离线训练的VLA基座，引入轻量级残差块策略π_res，通过继承的离线批评者提供稳定价值估计，结合稀疏任务奖励和人机干预信号进行训练，最终动作由基座动作与残差动作加权组合得到。

Key Results:

在五个真实世界灵巧操作任务上，BORA平均成功率绝对提升33%，相比纯模仿学习和传统解耦RL基线。
在未见物体泛化测试中，BORA成功率提升高达43%。
消融实验验证了动作条件批评者、一致性策略和在线残差自适应各模块的有效性。
BORA在样本效率和安全性方面优于直接在线微调方法，避免了灾难性遗忘。

Tech Stack:

一致性策略（Consistency Policy）
隐式Q学习（Implicit Q-Learning, IQL）
动作条件批评者（Action-Conditioned Critic）
人机协同（Human-in-the-Loop, HiL）残差自适应
VLM（视觉-语言模型）编码器
扩散/流匹配（作为对比基线）
稀疏奖励设计（终端成功奖励+时间惩罚）
位置嵌入（Positional Embedding）
移位值引导（Shifted Value Bootstrap）

Strengths:

针对灵巧操作的高维动作空间和视觉遮挡问题，提出了创新的动作条件批评者，有效避免视觉伪影干扰。
离线到在线框架兼顾了离线数据的规模优势与在线交互的适应性，通过冻结基座和残差自适应防止灾难性遗忘。
一致性策略大幅缩短了生成动作的计算图，使得RL梯度能够有效回传至VLM，解决了信用分配难题。
在真实机器人上进行了五个复杂任务的全面评估，结果显著优于现有方法，泛化能力突出。

Limitations:

依赖人机协同干预数据，可能引入人为偏差且需要一定的人力成本。
离线阶段需要大量高质量的灵巧操作数据，数据获取成本较高。
当前框架仅针对灵巧手操作，对于其他类型机器人（如移动操作）的适用性未验证。
在线残差自适应机制中残差权重的选择（λ）可能对性能敏感，需要调参。

Relevance To Keywords:

Unify Models: BORA将VLM（视觉-语言模型）与动作生成模型统一在RL后训练框架中，实现了理解与生成的一体化。
World Models: 论文未显式构建世界模型，但动作条件批评者隐含了对物理交互后果的建模，可视为一种隐式世界模型。
Representation Learning: 通过VLM认知令牌与动作块的融合，批评者学习到更鲁棒的表示，避免视觉伪影。
Model-Based RL: BORA未使用显式动力学模型，但离线批评者提供了类似模型的价值估计，在线残差自适应可视为模型预测与真实环境的偏差补偿。
后训练: 论文核心正是VLA模型的后训练，包括离线RL和在线微调，与关键词高度相关。

16. LoMo: Local Modality Substitution for Deeper Vision-Language FusionPASS

Score: 52.5 / 27.8

Authors: Feng Han, Zhixiong Zhang, Zheming Liang, Yibin Wang, Jiaqi Wang

Published: 2026-05-28

TL;DR: 本文针对视觉语言模型因训练数据偏差导致的模态载体敏感性难题，提出局部模态替换（LoMo）数据增强策略，有效提升了跨模态表征一致性与多模态推理性能。

摘要翻译

视觉 - 语言模型（VLMs）在各类理解与推理任务中取得了显著进展，这得益于旨在实现多模态融合的大规模图像 - 文本训练。理想情况下，用渲染图像替换文本问题，模型性能应基本不受影响。然而，在实践中，这种模态替换会导致性能显著下降。我们将这种“载体敏感性”问题归因于当前训练语料中存在的固有偏差。在图像描述、VQA、OCR 及网络来源的交错数据等常见数据集中，文本与图像通常被组织成不同且不对称的角色，文本充当语言查询，图像充当视觉参考。此类数据偏差导致 VLMs 在不同模态的信息获取上表现出显著的偏好差异。因此，VLMs 无法在文本载体与视觉载体上对齐语义等价内容的表征，导致模型推理在模态替换下变得脆弱。为解决这一问题，我们提出局部模态替换（LoMo），这是一种轻量级、架构无关的数据策展范式，旨在为语义等价文本与图像载体之间的跨模态表征不变性提供监督。LoMo 通过将单模态提示重构为无缝交错的多模态序列来实现这一目标。它动态选择目标文本片段并将其转换为渲染图像，从而在“文本 - 视觉 - 文本”载体上保持相同的语义。在 13 种多样的多模态基准上进行的广泛实验表明，LoMo 显著提升了整体多模态推理能力，并实现了更深层的跨模态融合。具体而言，它在基础模型上带来了持续的提升，相较于标准监督微调（SFT），其在 LLaVA-OneVision-1.5-8B 上提升了 2.67 分，在 Qwen3.5-9B 上提升了 2.82 分。

Abstract

Vision-Language Models (VLMs) have achieved substantial progress across a wide range of understanding and reasoning tasks, driven by large-scale image-text training aimed at multimodal fusion. Ideally, replacing a textual question with its rendered-image counterpart should leave model performance essentially unaffected. In practice, however, such modality substitution induces dramatic performance degradation. We attribute this "carrier sensitivity" issue to an inherent bias in current training corpora. Across prevalent datasets such as image captioning, VQA, OCR, and web-sourced interleaved data, text and images are typically organized into distinct and asymmetric roles, with text serving as linguistic queries and images as visual references. Such data bias leads VLMs to exhibit distinct preferences for information acquisition across different modalities. Consequently, VLMs fail to align representations of semantically equivalent content across textual and visual carriers, making model reasoning fragile under modality substitution. To address this, we propose Local Modality Substitution (LoMo), a lightweight, architecture-agnostic data curation paradigm designed to provide supervision for cross-modal representational invariance between semantically equivalent text and image carriers. LoMo achieves this by reformulating single-modality prompts into seamlessly interleaved multimodal sequences. It dynamically selects target text spans and recasts them as rendered images, thereby preserving the same semantics across "text, visual, text" carriers. Extensive experiments across 13 diverse multimodal benchmarks demonstrate that LoMo significantly improves overall multimodal reasoning and yields deeper cross-modal fusion. Specifically, it delivers consistent gains across foundational models, improving over standard SFT by 2.67 points on LLaVA-OneVision-1.5-8B and 2.82 points on Qwen3.5-9B.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	7.0/10	10.5
Tokenizer	1.5	4.0/10	6.0
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	9.0/10	13.5
MultiModal	1.5	10.0/10	15.0
model-based RL	1.5	1.0/10	1.5

评分理由: 论文核心研究视觉语言模型（MLLM/MultiModal）的跨模态融合与推理，故这两项评分最高（9-10 分）。'Unify Models'相关度中等（7 分），因目标为统一跨模态表征但方法侧重数据而非架构统一。'Tokenizer'和'Visual Encoder'相关度较低（3-4 分），因方法被描述为架构无关（architecture-agnostic），不修改底层组件。'World Models'和'model-based RL'与论文主题（多模态理解）完全无关，评分为 1 分。

关键词

Vision-Language Models, Modality Substitution, Cross-modal Fusion, Carrier Sensitivity, Data Curation, Representational Invariance, Multimodal Reasoning

深度分析

Chinese Title: LoMo: 局部模态替换以实现更深入的视觉-语言融合

Summary: 论文针对当前视觉-语言模型（VLM）中存在的“载体敏感性”问题展开研究，即当将文本查询替换为渲染图像时，模型性能显著下降。作者将这一现象归因于训练数据中文本与图像角色的不对称性（文本作为指令，图像作为视觉参考），导致模型对不同模态的信息获取存在偏好，未能对齐语义等价内容在不同载体上的表征。为此，提出LoMo（局部模态替换），一种轻量级、架构无关的数据策展范式。LoMo通过三个步骤：结构感知的跨度定位、视觉渲染和感知失真，将纯文本实例中的选定文本片段动态替换为渲染图像，形成“文本→视觉→文本”的交错序列，从而在标准监督微调（SFT）中隐式提供跨载体对齐监督。实验表明，LoMo在13个多模态基准上显著提升性能，在LLaVA-OneVision-1.5-8B上平均提升2.67分，在Qwen3.5-9B上提升2.82分，同时将跨模态距离降低14.2%，增强了跨模态融合。

Innovations:

系统诊断了VLM中的载体敏感性问题，揭示其与训练数据中文本和图像角色不对称导致的跨载体模态差距密切相关。
提出LoMo数据策展范式，通过局部模态替换（将文本片段渲染为图像）提供跨模态表征不变性的隐式监督，无需架构修改或推理开销。
设计结构感知的跨度定位和内容感知渲染管线，确保替换后的交错序列语义连贯且视觉质量鲁棒。
引入感知失真步骤，使模型在感知挑战条件下仍能保持鲁棒的跨模态融合。
在13个多模态基准上验证了LoMo的通用性和有效性，同时展示了跨载体表征一致性的改善。

Methodology: LoMo包含三个连续阶段：1) 结构感知跨度定位（S）：基于语义结构（如公式、句子边界）将纯文本实例分割为三部分，选定中间跨度作为目标；2) 视觉渲染（R）：通过内容感知路由（LaTeX渲染器或标准文本渲染器）将选定跨度转换为图像；3) 感知失真（A）：对渲染图像施加语义保持的退化（如模糊、噪声、JPEG压缩）。最终将失真后的图像替换回原位置，形成“文本→图像→文本”交错序列，保持原始监督目标不变。该范式可无缝集成到任何多模态训练流程中，使用标准SFT目标进行训练。

Key Results:

LoMo在13个多模态基准上（涵盖数学推理、VQA、OCR、文档理解、视觉感知）一致提升性能：在LLaVA-OneVision-1.5-8B上平均提升2.67分，在Qwen3.5-9B上平均提升2.82分。
跨模态距离分析显示，LoMo将文本与渲染图像之间的平均余弦距离降低14.2%，表明跨载体对齐更紧密。
载体敏感性实验表明，LoMo显著缩小了原始文本与渲染图像输入之间的性能差距，从最高21.23%的准确率下降降至更小差距。
在不同数据规模下，LoMo均能提升下游准确率和表征对齐指标。
模态集成率（MIR）分析进一步证实LoMo增强了跨模态融合深度。

Tech Stack:

结构感知跨度定位：基于语义结构（如公式、句子边界）的分割算法
视觉渲染：LaTeX渲染器（用于数学公式）和标准文本渲染器（用于普通文本）的内容感知路由
感知失真：模糊、噪声、JPEG压缩等语义保持退化
标准监督微调（SFT）目标
余弦距离度量用于跨模态表征对齐评估
模态集成率（MIR）指标用于量化融合深度

Strengths:

方法轻量且架构无关，无需修改模型结构或增加推理开销，易于集成到现有VLM训练流程。
从数据层面解决跨模态对齐问题，避免了复杂的解码时修正或额外损失函数。
通过局部替换而非全局替换，保留了原始语义和上下文，使模型学习到局部跨载体对应关系。
在多个主流VLM（LLaVA-OneVision、Qwen3.5）上取得一致且显著的性能提升，泛化性强。
提供了充分的诊断分析（载体敏感性、模态距离、MIR），验证了问题的存在和方法的有效性。

Limitations:

方法依赖于渲染图像的质量和感知失真的选择，可能对特定渲染风格或退化类型敏感。
仅针对文本到图像的局部替换，未探索图像到文本的替换或双向对齐。
实验主要在8B-9B规模的模型上进行，在更大规模模型（如70B）上的效果尚未验证。
对于长文本或复杂结构化内容（如表格、代码），渲染图像的保真度和可读性可能受限。
未讨论LoMo对生成任务（如图像描述、视觉故事生成）的影响，主要聚焦于理解型任务。

Relevance To Keywords:

表征学习：LoMo直接针对跨模态表征不变性，通过数据策展促进文本和图像载体在语义等价时的表征对齐，与表征学习核心目标高度相关。
原生多模态大模型：LoMo旨在提升多模态大模型的跨模态融合能力，属于多模态大模型训练优化范畴。
多模态大模型的理解和生成一体化：论文主要关注理解任务（VQA、OCR等），但方法可扩展至生成任务，与一体化方向部分相关。
世界模型：论文未涉及世界模型中的因果推理或环境建模，相关性较弱。
强化学习：论文使用标准SFT，未涉及RL或RLHF，相关性弱。
后训练：LoMo属于后训练阶段的数据策展方法，与后训练（SFT）直接相关。
Unify Models：论文未讨论统一模型架构，相关性弱。
Model-Based RL：不相关。

17. YoCausal: How Far is Video Generation from World Model? A Causality PerspectivePASS

Score: 52.5 / 27.8

Authors: You-Zhe Xie, Yu-Hsuan Li, Jie-Ying Lee, Kaipeng Zhang, Yu-Lun Liu, Zhixiang Wang

Published: 2026-05-28

TL;DR: YoCausal benchmark reveals that video diffusion models perceive temporal arrows but lack genuine causal cognition compared to humans, highlighting the gap between video generation and world models.

摘要翻译

随着视频扩散模型（VDMs）向世界模型演进，一个关键问题浮现：它们是否真正理解了因果关系，还是仅仅过度拟合了统计时间模式？现有的基准测试主要依赖合成数据，由于存在模拟到现实的差距，限制了其在现实世界中的泛化能力。我们提出了 YoCausal，这是一个受认知科学中“期望违背”（Violation of Expectation, VoE）范式启发的两级基准。通过以零成本对真实世界视频进行时间反转，将其作为自然反事实样本，YoCausal 建立了一个任意可扩展的评估协议。第一级引入了反向惊讶指数（Reverse Surprise Index, RSI），通过去噪损失量化时间之箭的感知。第二级引入了因果认知指数（Causality Cognition Index, CCI），利用视觉语言模型（VLM）对数据集进行分层，划分为因果和非因果子集，从而将真实的因果推理与时间偏差解耦。对 13 种最先进的 VDMs 的评估表明，感知时间之箭并不意味着理解因果关系，且相对于人类水平的因果认知，仍存在显著差距。

Abstract

As video diffusion models (VDMs) advance toward world models, a key question arises: do they truly understand causality, or merely overfit to statistical temporal patterns? Existing benchmarks mostly rely on synthetic data, limiting real-world generalization due to the sim-to-real gap. We present YoCausal, a two-level benchmark inspired by the Violation of Expectation (VoE) paradigm from cognitive science. By temporally reversing real-world videos at zero cost as natural counterfactual samples, YoCausal establishes an arbitrarily extensible evaluation protocol. Level 1 introduces the Reverse Surprise Index (RSI), quantifying arrow-of-time perception via denoising loss. Level 2 introduces the Causality Cognition Index (CCI), which leverages a VLM to stratify datasets into causal and non-causal subsets, disentangling genuine causal reasoning from temporal bias. Evaluation of 13 state-of-the-art VDMs reveals that perceiving the arrow of time does not imply understanding causality, and a significant gap persists relative to human-level causal cognition.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	5.0/10	7.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	9.0/10	13.5
MLLM	1.5	6.0/10	9.0
MultiModal	1.5	7.0/10	10.5
model-based RL	1.5	3.0/10	4.5

评分理由: World Models is highly relevant as it is the core theme of the title and abstract. MultiModal and MLLM are moderately relevant due to the use of Vision-Language Models in evaluation. Unify Models is moderately relevant regarding the convergence of generation and understanding. Tokenizer, Visual Encoder, and model-based RL are less relevant as they are not explicitly discussed in the paper's context.

关键词

Video Generation, World Model, Causality, Benchmark, Video Diffusion Models, Causality Cognition Index, Reverse Surprise Index

深度分析

Chinese Title: YoCausal：从因果视角看视频生成距离世界模型还有多远

Summary: 本文提出YoCausal基准，用于评估视频扩散模型（VDM）的因果认知能力。受认知科学中“期望违背”（VoE）范式启发，通过零成本的时间反转真实视频生成反事实样本，建立可任意扩展的评估协议。Level 1引入反向惊奇指数（RSI），通过去噪损失量化模型对时间箭头的感知；Level 2引入因果认知指数（CCI），利用视觉语言模型（VLM）将数据集分为因果与非因果子集，从而分离真正的因果推理与时间偏差。对13个最先进VDM的评估表明：感知时间箭头不等于理解因果关系，模型与人类因果认知之间仍存在显著差距。

Innovations:

首个针对视频扩散模型的因果认知基准，基于可扩展的真实世界数据集，消除了模拟到真实的差距。
受认知科学VoE范式启发，利用时间反转生成自然反事实样本，无需合成数据或受控录制。
提出两级评估框架：RSI量化时间箭头感知，CCI通过因果/非因果子集差异分离真正的因果理解。
提供人类标注的上界参考（1200个视频），揭示当前开源VDM缺乏因果理解，指明通往世界模型的关键差距。

Methodology: 构建包含通用、物理、人类动作、动物动作四个主题子集的真实视频数据集，对每个视频进行时间反转得到反事实对。使用去噪损失作为模型“惊奇度”的代理：对同一噪声添加后的前向和反向视频计算损失，若反向损失更高则模型更“惊奇”。Level 1的RSI统计反向损失高于前向的比例；Level 2利用VLM（如GPT-4V）将视频分为因果和非因果子集，CCI = RSI(因果子集) - RSI(非因果子集)，以消除时间偏差。评估13个VDM并与人类基线对比。

Key Results:

先进VDM能感知时间箭头，部分模型表现出初步因果认知，但与人类存在显著差距。
感知时间箭头不等于理解因果关系。
因果认知与直观物理部分相关，但与美学质量无关。
参数规模扩大和架构升级（如UNet到DiT）可改善因果认知，表明缩放定律适用于高阶推理。

Tech Stack:

视频扩散模型（VDM）：UNet-based（AnimateDiff-SDXL等）、DiT-based（CogVideoX1.5-5B、Wan2.2-A14B等）
去噪损失（Denoising Loss）作为似然代理
视觉语言模型（VLM）用于因果/非因果子集划分
时间反转（Temporal Reversal）生成反事实样本
反向惊奇指数（RSI）和因果认知指数（CCI）
数据集：Moment in Time、Physics IQ、Kinetics、Animal Kingdom

Strengths:

数据集可任意扩展，零成本获取反事实样本，消除模拟到真实差距。
两级评估框架有效分离时间箭头感知与因果理解，避免混淆。
评估覆盖13个主流VDM，并提供人类基线，对比全面。
揭示因果认知与美学质量无关，强调评估的独特性。

Limitations:

时间反转仅能测试因果方向性，无法覆盖更复杂的因果推理（如多因素交互）。
VLM划分因果子集可能引入标注偏差，且依赖VLM自身能力。
仅评估生成模型的似然，未直接测试生成视频的因果正确性。
当前数据集仅包含四个领域，未来需扩展更多场景（如工具使用、社会交互）。

Relevance To Keywords:

世界模型：论文直接评估VDM作为世界模型的因果理解能力，指出当前差距。
表征学习：通过去噪损失隐式探测模型学到的因果表征。
多模态大模型：使用VLM进行因果/非因果划分，评估对象包括多模态生成模型。
后训练：论文未涉及后训练，但结果可指导后续训练策略改进因果认知。
强化学习：论文未直接涉及，但世界模型与因果推理是模型基强化学习的关键。

18. EarlyTom: Early Token Compression Completes Fast Video UnderstandingPASS

Score: 51.0 / 27.8

Authors: Hesong Wang, Xin Jin, Lu Lu, Chenhaowen Li, Jian Chen, Qiang Liu, Huan Wang

Published: 2026-05-28

TL;DR: EarlyTom proposes a training-free early-stage visual token compression framework inside the vision encoder to significantly reduce time-to-first-token and FLOPs for Video-LLMs while maintaining accuracy.

摘要翻译

视频大语言模型（Video-LLMs）已在视频理解任务中展现出强大的能力。然而，由于处理海量视觉 tokens 导致的效率低下，它们的实际部署仍受到阻碍。尽管近期方法在保持与全 token 基线相当准确性的同时实现了极低的 token 保留率，但它们大多仅在预填充（prefilling）的后期阶段执行压缩，导致视觉编码器的效率未被优化。本文首先指出，视觉编码过程占据了首 token 时间（TTFT）的很大一部分。因此，与其仅在视觉编码器之后压缩视觉 tokens，在编码器内部执行压缩仍留有巨大的探索空间。基于这一洞察，本文提出 EarlyTom，一种无需训练的 token 压缩框架，该框架在视觉编码器内部执行早期视觉 token 压缩，从而实现更显著的 TTFT 降低和更高的吞吐量。此外，我们还引入了一种解耦的空间 token 选择策略，以提升整体压缩效果。在单个 NVIDIA A100 GPU 上，针对 LLaVA-OneVision-7B 模型，EarlyTom 将 TTFT 降低高达 2.65 倍，FLOPs 减少高达 61%，同时保持与全 token 基线相当的准确性。这些改进显著提升了 Video-LLMs 在实际生产场景中的部署实用性。

Abstract

Video large language models (Video-LLMs) have demonstrated strong capabilities in video understanding tasks. However, their practical deployment is still hindered by the inefficiency introduced by processing massive amounts of visual tokens. Although recent approaches achieve extremely low token retention ratios while maintaining accuracy comparable to full-token baselines, most of them perform compression only at the late stage of prefilling, leaving the efficiency of the vision encoder unoptimized. In this paper, we first show that vision encoding contributes a large portion to the time-to-first-token (TTFT). Therefore, instead of compressing visual tokens only after the vision encoder, performing compression inside the encoder still leaves substantial room for exploration. Based on this insight, we propose EarlyTom, a training-free token compression framework that performs early-stage visual token compression inside the vision encoder, enabling significantly better TTFT reduction and higher throughput. In addition, we introduce a decoupled spatial token selection strategy that improves the overall compression effectiveness. EarlyTom reduces TTFT by up to 2.65x and FLOPs by up to 61% on a single NVIDIA A100 GPU for the LLaVA-OneVision-7B model, while maintaining accuracy comparable to the full-token baseline. These improvements substantially enhance the practicality of deploying Video-LLMs in real-world production scenarios.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	5.0/10	7.5
Visual Encoder	1.5	9.0/10	13.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	8.0/10	12.0
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	1.0/10	1.5

评分理由: The paper optimizes Video-LLMs via early token compression in the visual encoder, highly relevant to Visual Encoder, MLLM, and MultiModal. It moderately relates to Tokenizer through token handling but lacks focus on tokenizer architecture. It is unrelated to Unify Models, World Models, and model-based RL as it targets inference efficiency, not unification, world dynamics, or RL. No specified expert authors were found.

关键词

Video-LLMs, Token Compression, Visual Encoder, Inference Efficiency, Training-free, Time-to-First-Token, Spatial Token Selection

深度分析

Chinese Title: EarlyTom：早期令牌压缩实现快速视频理解

Summary: 论文针对视频大语言模型（Video-LLMs）推理效率低的问题，提出了一种无需训练的令牌压缩框架EarlyTom。通过分析时间到首令牌（TTFT）的组成，发现视觉编码阶段是主要瓶颈（占36.3%），而现有压缩方法多在后阶段进行，未优化编码器。EarlyTom包含两个核心组件：一是内部视觉编码器帧合并，在编码过程中通过自适应分段和局部最优合并减少帧间冗余；二是分离的空间令牌选择策略，将合并后的特征分解为动态和静态令牌集，分别进行全局和局部窗口选择，以保持空间分布。实验表明，在LLaVA-OneVision-7B模型上，仅保留10%令牌时，TTFT降低2.65倍，FLOPs减少61%，吞吐量提升1.3倍，同时在MVBench、EgoSchema等四个基准上保持与全令牌方法相当的精度。该方法显著提升了Video-LLMs在实际部署中的实用性。

Innovations:

提出内部视觉编码器帧合并机制，在编码阶段直接压缩冗余视觉信息，显著降低TTFT且开销极小。
引入分离的空间令牌选择策略，对动态帧和静态帧分别采用全局Top-K和局部窗口选择，避免引入偏差并提高压缩效果。
实现训练无关的压缩框架，无需额外训练即可达到2.65倍TTFT加速和61% FLOPs减少，同时保持下游任务精度。
通过分析视频注意力沉没现象（sink tokens），揭示现有基于注意力分数的压缩方法存在语义信息丢失问题，并据此设计更合理的令牌选择策略。

Methodology: 论文首先对Video-LLM推理过程进行延迟剖析，识别视觉编码为TTFT主要瓶颈。然后提出两阶段压缩方法：第一阶段在视觉编码器内部进行帧合并，基于流式帧相似度自适应分割视频，对冗余中间帧采用局部最优准则合并，并通过加权融合细化表示；第二阶段在编码后执行分离的令牌选择，将合并后的帧特征分解为动态和静态令牌集，动态帧使用全局Top-K选择，静态帧使用局部窗口选择以保持空间分布，最后重组令牌输入LLM。方法无需训练，仅依赖推理时的计算。

Key Results:

在LLaVA-OneVision-7B模型上，TTFT降低2.65倍，FLOPs减少61%，吞吐量提升1.3倍（10%令牌保留率）。
在MVBench、EgoSchema、LongVideoBench、VideoMME四个基准上，平均性能与全令牌方法相当，优于其他训练无关方法。
视觉编码阶段在基线中占36.3% TTFT，在HoliTom和VisionZip中分别升至55.8%和68.4%，而EarlyTom直接减少编码时间，几乎无额外开销。
视频注意力沉没现象表明某些令牌（sink tokens）始终吸引高注意力，导致现有Top-K方法忽略其他帧的语义信息。

Tech Stack:

SigLIP视觉编码器
KNN聚类（用于令牌合并）
注意力机制（用于分析sink tokens）
局部最优合并准则（基于相似度）
全局Top-K选择与局部窗口选择
加权融合（weighted fusion）
FLOPs和TTFT延迟剖析工具

Strengths:

无需训练，直接应用于现有模型，实用性强。
针对视觉编码器这一被忽视的瓶颈进行优化，实现显著加速。
分离的令牌选择策略兼顾动态和静态帧，保持空间分布，避免偏差。
在多个基准上保持与全令牌方法相当的精度，验证了压缩的有效性。
提供了详细的延迟剖析和注意力沉没分析，具有理论洞察。

Limitations:

方法依赖于特定视觉编码器（SigLIP）的注意力特性，可能不适用于其他架构。
帧合并策略基于相似度，对剧烈场景切换的视频可能效果下降。
实验仅在LLaVA-OneVision系列模型上验证，泛化性需进一步测试。
未讨论解码阶段的优化，仅聚焦于预填充阶段。

Relevance To Keywords:

原生多模态大模型：EarlyTom针对视频多模态大模型（Video-LLMs）的推理效率问题，属于多模态大模型优化范畴。
多模态大模型的理解和生成一体化：方法提升理解任务（视频问答）的推理速度，间接支持生成效率。
表征学习：通过令牌压缩和合并，学习更紧凑的视觉表征，保留关键语义信息。
世界模型：视频理解是世界模型的重要能力，加速推理有助于世界模型在实时场景中的应用。
模型基于强化学习/后训练：论文未涉及强化学习或后训练，但压缩方法可作为后训练阶段加速推理的组件。

19. OccamToken: Efficient VLM Inference with Training-Free and Budget-Adaptive Token PruningPASS

Score: 49.5 / 27.8

Authors: Geng Li, Guohao Chen, Ting Chen, Shilin Shan, Kuangji Zuo, Bofan Lyu, Tuo An, Gen Li, Jianfei Yang

Published: 2026-05-28

TL;DR: 本文提出 OccamToken 框架，通过训练无关的预算自适应视觉 token 剪枝技术，在大幅降低视觉语言模型推理成本的同时保持了高准确率。

摘要翻译

视觉 - 语言模型（VLMs）依赖长视觉 token 序列来实现视觉理解，导致预填充阶段在计算和内存开销上都非常高昂。大多数现有的剪枝方法遵循绝对排名范式，即为视觉 token 分配重要性分数并保留固定的前 K 子集。本文认为，这种范式本质上存在根本性脆弱性：注意力汇点会扭曲 token 的重要性排名，而图像冗余及查询依赖的视觉证据使得固定 token 预算在不同输入下不可靠。本文提出 OccamToken，这是一种无需训练的框架，它用基于寄存器的相对证据测试取代了绝对 token 排名。与询问哪些 token 全局重要不同，OccamToken 评估视觉 token 是否提供了超出基于寄存器参考的信息。我们的核心见解在于，寄存器 token 自然吸收低信息量的注意力模式，从而使其成为识别真正具有信息量的视觉证据的稳定参考。基于此原理，OccamToken 通过基于寄存器注意力导出的动态阈值，同时执行图像自适应冗余剪枝和查询自适应相关性剪枝。在 LLaVA-NeXT、LLaVA-v1.5 和 Qwen3-VL 上，OccamToken 无需额外训练即可一致地提高准确率 - 效率权衡。值得注意的是，在 LLaVA-NeXT 上，它将 2,880 个视觉 token 压缩至约 40 个，同时保留超过 93% 的原始准确率，即使在极端 1.4% 的保留率情形下，也能实现稳定的视觉 token 压缩。

Abstract

Vision-language models (VLMs) rely on long visual token sequences for visual understanding, making the prefill stage expensive in both computation and memory. Most existing pruning methods follow an absolute-ranking paradigm, assigning importance scores to visual tokens and retaining a fixed top-K subset. In this work, we argue that this paradigm is fundamentally brittle: attention sinks distort token importance rankings, while image redundancy and query-dependent visual evidence make fixed token budgets unreliable across inputs. We propose OccamToken, a training-free framework that replaces absolute token ranking with register-anchored relative evidence testing. Instead of asking which tokens are globally important, OccamToken evaluates whether a visual token provides information beyond a register-based reference. Our key insight is that register tokens naturally absorb low-information attention patterns, making them a stable reference for identifying genuinely informative visual evidence. Based on this principle, OccamToken performs both image-adaptive redundancy pruning and query-adaptive relevance pruning through dynamic thresholds derived from register attention. Across LLaVA-NeXT, LLaVA-v1.5, and Qwen3-VL, OccamToken consistently improves the accuracy-efficiency trade-off without additional training. Notably, on LLaVA-NeXT, it reduces 2,880 visual tokens to approximately 40 while preserving over 93% of the original accuracy, enabling stable visual token compression even in the extreme 1.4% retention regime.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	5.0/10	7.5
Visual Encoder	1.5	8.0/10	12.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	9.0/10	13.5
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文核心为视觉语言模型（MLLM）的推理加速，直接涉及视觉编码器输出（Visual Encoder）及多模态（MultiModal）处理，故评分较高；虽处理 token 但未设计 tokenizer，故 Tokenizer 中等；方法适用于多种模型但未统一架构（Unify Models）；与世界模型及强化学习（model-based RL）完全无关。

关键词

Token Pruning, VLM Inference, Visual Tokens, Training-Free, Budget-Adaptive, Register-anchored, Efficiency Optimization

深度分析

Chinese Title: OccamToken: 无需训练且预算自适应的令牌剪枝实现高效VLM推理

Summary: 视觉语言模型（VLM）依赖长序列视觉令牌进行视觉理解，导致预填充阶段计算和内存开销巨大。现有剪枝方法通常采用绝对排名范式，固定保留前K个令牌，但存在注意力沉没扭曲重要性排名、图像冗余和查询依赖导致固定预算不可靠的问题。本文提出OccamToken，一种无需训练的两阶段剪枝框架，用寄存器锚定的相对证据测试替代绝对令牌排名。核心思想是：寄存器令牌自然吸收低信息注意力模式，可作为稳定参考来识别真正有信息的视觉证据。第一阶段在视觉编码器输出处利用[CLS]-到-寄存器注意力去除图像级冗余；第二阶段在语言模型内部利用[文本]-到-寄存器注意力剪枝查询无关令牌，实现图像自适应和查询自适应的令牌预算。在LLaVA-NeXT、LLaVA-v1.5和Qwen3-VL上，OccamToken在相同预算下优于现有方法，且无需额外训练。在LLaVA-NeXT上，将2880个视觉令牌压缩至约40个（保留率1.4%），仍保持93%以上的原始准确率。

Innovations:

提出寄存器锚定动态阈值机制，将注意力沉没缓解与自适应令牌剪枝结合，无需训练即可实现。
设计两阶段训练无关剪枝框架，分别处理图像级冗余和查询级相关性，实现预算自适应。
用相对比较范式（令牌是否优于寄存器参考）替代绝对排名（固定前K），解决固定预算不可靠问题。
在极端压缩率（1.4%保留）下仍能保持高精度，显著优于现有训练无关方法。
即插即用，无需修改模型或额外训练，适用于多种VLM架构。

Methodology: OccamToken采用两阶段剪枝策略。第一阶段（图像级冗余剪枝）：在视觉编码器输出后插入测试时寄存器令牌，计算[CLS]令牌到所有视觉令牌（包括寄存器）的注意力分布，以寄存器令牌的注意力分数作为动态阈值，保留分数高于该阈值的视觉令牌。第二阶段（查询级相关剪枝）：在语言模型内部，利用文本令牌（如[text]）到视觉令牌（包括寄存器）的注意力，同样以寄存器分数为阈值，进一步剪枝与当前查询无关的令牌。两阶段均无需训练，仅依赖注意力计算。寄存器令牌通过吸收注意力沉没来稳定分数分布，其自身分数作为语义参考，自动适应不同样本和查询。

Key Results:

在LLaVA-NeXT上，将2880个视觉令牌压缩至约40个（1.4%保留率），保持93%以上原始准确率。
在LLaVA-v1.5和Qwen3-VL上，相同预算下优于所有对比基线（包括训练无关和训练方法）。
自适应预算：同一图像不同查询保留不同数量令牌（如刀块问题保留9个，冰箱-炉子关系问题保留17个）。
有效缓解注意力沉没：插入寄存器后，有效注意力数量（neff）从42提升至281，分数分布更可区分。

Tech Stack:

注意力机制（Multi-Head Attention, Softmax归一化）
测试时寄存器令牌（Test-time Register Token）
[CLS]令牌和[text]令牌的注意力分数
动态阈值剪枝（基于寄存器分数的相对比较）
两阶段剪枝框架（图像级+查询级）
PyTorch（推测，用于实现和实验）

Strengths:

完全训练无关，无需额外数据或微调，即插即用。
自适应预算，能根据图像复杂度和查询需求动态调整保留令牌数。
有效解决注意力沉没问题，提高重要性分数可区分性。
在极端压缩率下仍保持高精度，性能优于现有方法。
方法简洁，仅需插入一个寄存器令牌并计算注意力，计算开销小。

Limitations:

依赖寄存器令牌的插入，可能对不支持测试时插入的模型架构需要适配。
两阶段剪枝可能增加少量推理延迟（但论文声称高效）。
当前实验主要在图像理解基准上，对多图、视频等复杂场景的泛化性有待验证。
寄存器令牌的参考阈值可能在某些极端分布下不够鲁棒（如所有令牌分数均低于寄存器）。
未探讨与生成任务（如视觉生成）的结合，仅针对理解型VLM。

Relevance To Keywords:

原生多模态大模型：论文研究VLM（视觉语言模型）的推理效率，属于多模态大模型范畴，直接相关。
多模态大模型的理解和生成一体化：论文聚焦理解任务（图像问答），未涉及生成一体化，但剪枝方法可推广到生成模型，部分相关。
表征学习：寄存器令牌的表征学习特性（吸收沉没、编码全局信息）是方法核心，与表征学习高度相关。
世界模型：论文未涉及世界模型构建或预测，相关性低。
强化学习：论文未使用强化学习，相关性低。
后训练：论文方法为训练无关，属于推理优化而非后训练，相关性较低。

20. Grounded 3D-Aware Spatial Vision-Language ModelingPASS

Score: 48.0 / 27.8

Authors: An-Chieh Cheng, Yang Fu, Yatai Ji, Ligeng Zhu, Guanqi Zhan, Zhuoyang Zhang, Zhaojing Yang, Song Han, Yao Lu, Pavlo Molchanov, Vidya Nariyambut Murali, Jan Kautz, Xiaolong Wang, Hongxu Yin, Sifei Liu

Published: 2026-05-28

TL;DR: 本文提出 GR3D 模型，通过统一显式 2D 和隐式 3D grounding 机制，显著提升了视觉语言模型的空间理解能力。

摘要翻译

我们提出了 GR3D，这是一个空间视觉语言模型（Spatial Vision Language Model），在统一框架内配备了三种互补的定位能力——显式 2D 定位、隐式 2D 定位和单目 3D 定位。GR3D 引入了一种隐式定位机制，该机制在生成过程中识别实体提及，并将相应的区域标记插入文本流中，从而使模型能够在生成空间思维链响应时即时参考视觉证据。与此同时，一种基于区域提示的单目 3D 定位设计从已定位的区域查询中预测相机视图中的 3D 边界框，该设计得到了内禀感知归一化和密集几何监督的支持。协同地，这些定位能力使 GR3D 能够将复杂的空间理解问题分解为基于定位的 2D 感知，随后进行 3D 推理。GR3D 在基于定位和未基于定位的空间基准上均实现了持续改进，表明定位是一种有效的归纳偏置，用于加强视觉语言模型（VLMs）中的空间理解。这些定位能力共同增强了超越定位任务本身的一般空间理解。

Abstract

We present GR3D, a spatial vision language model equipped with three complementary grounding capabilities--explicit 2D grounding, implicit 2D grounding, and monocular 3D grounding--within a single framework. GR3D introduces an implicit grounding mechanism that identifies entity mentions during generation and inserts the corresponding region tokens into the text stream, allowing the model to reference visual evidence on the fly when producing spatial chain-of-thought responses. In parallel, a region-prompted monocular 3D grounding design predicts 3D bounding boxes in the camera view from grounded region queries, supported by intrinsic-aware normalization and dense geometric supervision. Together, these grounding capabilities enable GR3D to decompose complex spatial understanding problems into grounded 2D perception followed by 3D inference. GR3D achieves consistent improvements across grounded and non-grounded spatial benchmarks, demonstrating grounding as an effective inductive bias for strengthening spatial understanding in VLMs. These grounding capabilities collectively enhance general spatial understanding beyond the grounding task itself.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	4.0/10	6.0
Tokenizer	1.5	3.0/10	4.5
Visual Encoder	1.5	6.0/10	9.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	8.0/10	12.0
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	1.0/10	1.5

评分理由: 论文属于多模态大模型（MLLM, MultiModal）范畴，核心涉及视觉编码器（Visual Encoder）和区域标记机制（Tokenizer），并统一了多种 grounding 能力（Unify Models）。然而，论文主要关注空间理解与视觉 - 语言对齐，未涉及世界模型（World Models）的生成预测，也未包含强化学习（model-based RL）相关内容，故后两者相关性极低。

关键词

Grounded 3D-Aware, Spatial Vision-Language, Region Tokens, 2D Grounding, 3D Grounding, Spatial Understanding, Visual Evidence, Single Framework

深度分析

Chinese Title: 基于三维感知的接地空间视觉语言建模

Summary: 本文提出GR3D，一个集成了显式2D定位、隐式2D定位和单目3D定位三种互补能力的空间视觉语言模型。隐式定位机制在生成过程中自动识别实体提及并插入对应的区域token，使模型能够在视觉链式推理中动态引用视觉证据。区域提示的单目3D定位通过内在感知归一化和密集几何监督，从定位的区域查询中预测3D边界框。这些能力使GR3D能够将复杂的空间理解问题分解为接地2D感知和3D推理。实验表明，GR3D在接地和非接地空间基准上均取得一致改进，证明定位作为归纳偏置能有效增强VLM的空间理解。

Innovations:

提出隐式2D定位机制，通过流式区域插入在生成过程中自动关联文本提及与视觉区域，实现视觉链式推理。
设计区域提示的单目3D定位方法，利用内在感知归一化和密集几何监督从2D区域查询预测3D边界框。
将显式2D定位、隐式2D定位和单目3D定位统一在一个框架中，实现从2D到3D的分解式空间推理。
构建大规模隐式定位标注数据和平衡的2D-3D监督数据，支持模型训练。
发现定位能力作为归纳偏置能提升一般空间理解，即使在没有显式定位任务时也有效。

Methodology: 基于NVILA-8B-Lite架构构建基础空间VLM，通过2D位置嵌入和相对深度线索增强视觉token的空间感知。显式2D定位使用语言头直接预测HTML格式的边界框。隐式2D定位采用流式区域插入：模型在生成过程中预测实体边界框，通过区域编码器提取区域token并插入文本流。单目3D定位以2D区域为查询，结合内在感知归一化和密集点监督（来自深度估计）预测3D边界框。训练时使用教师强制和梯度截断，推理时自动执行。数据构建包括从现有数据集生成隐式定位标注和平衡2D-3D监督。

Key Results:

在CVBench、ERQA、SAT等空间基准上，隐式定位提升了CoT准确性和空间一致性。
在Omni3D上，区域提示的单目3D定位结合密集点监督达到最先进性能。
定位能力增强了一般空间理解，即使在没有显式定位任务时也有效。
密集几何监督提供了可扩展的结构线索。
隐式定位与区域提示3D推理结合，支持指代实例3D定位、类别级3D检测和多物体场景定位。

Tech Stack:

NVILA-8B-Lite架构
2D位置嵌入
相对深度线索
区域编码器（基于池化）
HTML格式边界框输出
教师强制训练
内在感知归一化
密集几何监督（深度估计）
链式推理（CoT）

Strengths:

统一了三种互补的定位能力，覆盖2D和3D空间理解。
隐式定位机制使模型能够自动进行实体定位，更接近人类视觉推理过程。
单目3D定位通过区域提示和密集监督解决了深度和尺度模糊问题。
在多个基准上取得一致改进，验证了定位作为归纳偏置的有效性。
框架灵活，支持单视图和多视图输入。

Limitations:

单目3D定位仍受限于单视图的固有歧义，可能无法处理严重遮挡或对称物体。
隐式定位依赖于训练数据中的标注，数据构建过程可能引入噪声。
模型基于NVILA-8B-Lite，可能受限于基础架构的容量。
未在真实机器人场景中验证，仅停留在图像级基准测试。
多视图扩展仅在补充材料中提及，缺乏详细分析。

Relevance To Keywords:

Unify Models: GR3D统一了2D和3D定位能力，属于统一模型范畴。
World Models: 模型通过3D定位和空间推理构建场景的几何表示，与世界模型相关。
Representation Learning: 通过定位和几何监督学习空间特征表示。
Model-Based RL: 论文未直接涉及强化学习，但空间理解为具身智能提供基础，间接相关。
原生多模态大模型: GR3D基于NVILA，属于原生多模态大模型。
多模态大模型的理解和生成一体化: 模型同时进行视觉理解和语言生成，实现一体化。
表征学习: 通过定位和几何监督学习空间表征。
世界模型: 模型学习3D空间结构，可视为世界模型的一部分。
强化学习: 论文未涉及强化学习。
后训练: 论文主要关注模型架构和训练，未明确讨论后训练策略。

21. Stable-Layers: Fine-Tuning Image Layer Decomposition Models with VLM-Scored Reinforcement LearningPASS

Score: 48.0 / 27.8

Authors: Ciara Rowles, Reshinth Adithyan, Nikhil Pinnaparaju, Vikram Voleti, Mark Boss

Published: 2026-05-28

TL;DR: Stable-Layers introduces a reinforcement learning framework that fine-tunes image layer decomposition models using vision-language model feedback without paired supervision, achieving improved layer separation and lower reconstruction error.

摘要翻译

我们提出了 Stable-Layers，这是一种强化学习框架，它通过仅使用视觉 - 语言模型（VLM）的反馈来微调预训练层分解模型，从而消除了对成对监督的需求。基于 Qwen-Image-Layered，我们应用了带有 LoRA 适配的 Flow-GRPO，对每张图像采样多个候选分解，使用 VLM 对其进行评分，并根据组内相对优势优化策略。关键挑战在于设计可靠的奖励信号：单独评分样本的 VLM 倾向于将其判断压缩到一个狭窄的范围内，导致 GRPO 几乎没有组内方差可供学习。我们通过一个两阶段评估流程来解决这一问题，该流程将基于五个以编辑为中心的标准的结构化单样本评分与一个基于网格的校准步骤相结合，在该步骤中 VLM 并排重新评分所有候选项。与基础模型相比，Stable-Layers 在 Crello 数据集上产生的分解具有更强的层分离度、更少的空白或伪影较多的层以及更低的每层重建误差。

Abstract

We present Stable-Layers, a reinforcement learning framework that eliminates the need for paired supervision by fine-tuning a pretrained layer decomposition model using only feedback from a vision-language model (VLM). Starting from Qwen-Image-Layered, we apply Flow-GRPO with LoRA adaptation, sampling multiple candidate decompositions per image, scoring them with a VLM, and optimising the policy from group-relative advantages. The key challenge lies in designing a reliable reward signal: VLMs scoring samples in isolation tend to compress their judgements into a narrow band, leaving GRPO with little within-group variance to learn from. We address this with a two-stage evaluation pipeline that pairs structured per-sample scoring across five edit-centric criteria with a grid-based calibration step in which the VLM re-scores all candidates side-by-side. Stable-Layers produces decompositions with stronger layer separation, fewer blank or artifact-heavy layers, and lower per-layer reconstruction error on the Crello dataset compared to the base model.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	6.0/10	9.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	4.0/10	6.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	8.0/10	12.0
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	4.0/10	6.0

评分理由: The paper focuses on RL fine-tuning of image decomposition using VLM feedback, making MLLM and MultiModal highly relevant. Unify Models and Visual Encoder are moderately relevant as the method aligns models and processes images. Tokenizer and World Models are not discussed. model-based RL is moderately relevant due to RL usage (GRPO), though technically policy optimization. No expert authors from the specified list were found.

关键词

Reinforcement Learning, Vision-Language Model, Image Layer Decomposition, Fine-Tuning, GRPO, LoRA, Unsupervised Learning

深度分析

Chinese Title: 稳定层：使用VLM评分强化学习微调图像层分解模型

Summary: 本文提出Stable-Layers框架，通过强化学习微调预训练的层分解模型，无需配对监督数据，仅依赖视觉语言模型（VLM）的反馈。针对VLM评分压缩导致组内方差不足的问题，设计了两阶段评估管道：第一阶段对每个候选分解按五个编辑相关标准进行结构化评分；第二阶段将候选组放在标注网格中重新评分以增强区分度。同时，针对流匹配模型中序列打包潜在表示导致的比率归一化问题，改进了RatioNorm方法。在未标注图像上训练后，Stable-Layers在Crello数据集上相比基线模型实现了更好的层分离、更少的空白或伪影层以及更低的逐层重建误差。

Innovations:

提出两阶段VLM奖励协议，缓解组内强化学习中分数压缩问题，第一阶段按五个标准独立评分，第二阶段通过网格对比实现精细校准。
针对流匹配强化学习中序列打包潜在表示导致的比率标准差抑制问题，改进了RatioNorm方法，恢复O(1)量级的比率幅度。
首次将VLM作为唯一监督源应用于图像层分解的后训练强化学习，无需任何层标注或配对数据。
通过LoRA适配和Flow-GRPO框架，在未标注图像上显著提升预训练层分解模型的编辑可用性。

Methodology: 采用Flow-GRPO强化学习框架，以Qwen-Image-Layered为基模型，通过LoRA（秩16）微调注意力投影层和前馈层。训练循环：每步生成G个候选分解（通过SDE采样），使用两阶段VLM评分管道获得奖励，然后计算组相对优势并执行GRPO更新。第一阶段：每个候选独立由VLM按五个标准（语义分离、Alpha清洁度、背景修复、特征分布、内容有效性）评分，归一化至[0,1]；第二阶段：将所有候选排列在标注网格中，由VLM重新评分以增强组内方差。同时采用GRPO-Guard的RatioNorm和梯度重加权，并针对打包潜在表示进行修正。

Key Results:

在Crello数据集上，Stable-Layers相比Qwen-Image-Layered基线实现了更好的层分离，减少了空白或伪影层。
逐层重建误差显著降低，表明层分解的忠实度提升。
在未标注的Fine-T2I图像上训练后，模型在保持原图重构质量的同时，提高了各层的语义一致性和Alpha蒙版质量。
两阶段评分管道有效恢复了组内奖励方差，使GRPO能够从相似候选中学到有效信号。

Tech Stack:

Flow-GRPO（流匹配强化学习框架）
GRPO-Guard（比率归一化与梯度重加权）
LoRA（低秩适配，秩16）
Qwen-Image-Layered（预训练层分解模型）
VLM（视觉语言模型，作为评分器）
SDE增强流匹配（扩散系数σ_t = a√(t/(1-t)), a=0.7）
RGBA-VAE（3D变分自编码器，8×空间压缩）
MMDiT（多模态扩散Transformer）

Strengths:

无需配对数据或人工标注，仅依赖VLM反馈即可提升分解质量，降低了数据获取成本。
两阶段评分设计有效解决了VLM评分压缩问题，使强化学习能够从组内相似候选中学到有效梯度。
针对流匹配模型中序列打包潜在表示的比率归一化改进具有通用性，可推广到其他类似场景。
在多个编辑相关维度（语义分离、Alpha清洁度、背景修复等）上同时优化，提升了层分解的实际编辑可用性。

Limitations:

依赖VLM的评分质量，若VLM对某些视觉特征（如Alpha蒙版）理解不足，可能引入偏差。
两阶段评分增加了推理成本，每个训练步骤需多次调用VLM，计算开销较大。
实验仅在Crello数据集和部分未标注图像上验证，泛化到其他类型图像（如自然照片、复杂场景）的能力尚未充分证明。
当前方法仅针对固定层数（N层）的分解，未探索可变层数场景下的奖励设计。

Relevance To Keywords:

原生多模态大模型：论文使用VLM作为评分器，体现了多模态大模型在视觉任务中的监督能力。
多模态大模型的理解和生成一体化：VLM同时理解图像和文本，用于评估生成层分解的质量。
表征学习：层分解本质上是学习图像的分层表征，强化学习优化了表征的编辑友好性。
世界模型：层分解可视为对图像内容的结构化世界模型，本文通过RL提升其一致性。
强化学习：核心方法为Flow-GRPO，属于强化学习在生成模型后训练中的应用。
后训练：论文聚焦于预训练模型的后训练微调阶段，使用RL而非监督学习。

22. CogniVerse: Revolutionizing Multi-Modal Retrieval-Augmented Generation with Cognitive Reflection and Geometric ReasoningPASS

Score: 48.0 / 27.8

Authors: Xiang Fang, Wanlong Fang, Changshuo Wang

Published: 2026-05-28

TL;DR: CogniVerse enhances Multimodal Large Language Models by introducing a cognitive-inspired retrieval-augmented framework that reduces noise and improves coherence through geometric reasoning and optimal transport.

摘要翻译

多模态检索增强生成（MMRAG）已成为一种强大的范式，通过整合外部视觉、文本和结构知识，以增强多模态大语言模型在知识密集型问答中的能力。然而，现有的 MMRAG 框架存在关键局限性，包括噪声检索与无关检索、跨模态语义错位、缺乏自适应推理，以及局部与全局上下文间生成不连贯等问题。我们提出 CogniVerse，这是一种新颖的 MMRAG 框架，通过认知启发式且数学严谨的方法解决这些挑战。借鉴类人推理，CogniVerse 集成了三个协同组件：(1) 认知反思模块，该模块动态评估检索必要性并过滤相关多模态内容，从而减少噪声和计算开销；(2) 多模态检索模块，该模块利用信息几何在黎曼流形上对齐嵌入表示，并通过谱图理论精炼知识图谱，以确保精确且连贯的检索；(3) 层次化生成模块，该模块采用基于最优传输的损失函数，以平衡词元级精度与全局语义连贯性。大量实验表明，CogniVerse 在准确性和连贯性方面显著优于最先进系统，同时降低了检索延迟。

Abstract

Multi-modal Retrieval-Augmented Generation (MMRAG) has emerged as a powerful paradigm for enhancing Multimodal Large Language Models in knowledge-intensive question answering by integrating external visual, textual, and structural knowledge. However, existing MMRAG frameworks suffer from critical limitations, including noisy and irrelevant retrieval, cross-modal semantic misalignment, lack of adaptive reasoning, and incoherent generation across local and global contexts. We introduce \textbf{CogniVerse}, a novel MMRAG framework that addresses these challenges through a cognitive-inspired, mathematically rigorous approach. Drawing from human-like reasoning, CogniVerse integrates three synergistic components: (1) a Cognitive Reflection Module that dynamically assesses retrieval necessity and filters relevant multi-modal content, reducing noise and computational overhead; (2) a Multi-modal Retrieval Module that aligns embeddings in a Riemannian manifold using information geometry and refines knowledge graphs via spectral graph theory, ensuring precise and coherent retrieval; and (3) a Hierarchical Generation Module that employs an optimal transport-based loss to balance token-level accuracy and global semantic coherence. Extensive experiments demonstrate that CogniVerse significantly outperforms state-of-the-art systems in both accuracy and coherence, while reducing retrieval latency.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	6.0/10	9.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	4.0/10	6.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	9.0/10	13.5
MultiModal	1.5	10.0/10	15.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文核心为多模态检索增强生成（MMRAG），与 MultiModal (10) 和 MLLM (9) 高度相关。框架统一了检索与生成逻辑，故 Unify Models 得 6 分。涉及 token-level 精度但未聚焦 Tokenizer 设计 (2)，隐含视觉编码器 (4)，与 World Models (1) 及 model-based RL (0) 无关。作者列表中无指定专家。加权总分 48.0，高于动态及格分 27.8。

关键词

Multi-Modal Retrieval-Augmented Generation, Cognitive Reflection, Geometric Reasoning, Multimodal Large Language Models, Riemannian Manifold, Hierarchical Generation, Optimal Transport

深度分析

Chinese Title: CogniVerse：利用认知反思与几何推理革新多模态检索增强生成

Summary: 本文提出CogniVerse，一种新型多模态检索增强生成（MMRAG）框架，旨在解决现有系统在噪声检索、跨模态语义错位、缺乏自适应推理以及生成不连贯等方面的关键局限。受人类认知过程启发，CogniVerse集成三个协同模块：认知反思模块动态评估检索必要性并过滤相关多模态内容；多模态检索模块利用信息几何将嵌入对齐到黎曼流形，并通过谱图理论精化知识图谱；层次化生成模块采用基于最优传输的损失函数平衡词元级准确性与全局语义连贯性。实验表明，CogniVerse在准确性和连贯性上显著超越现有系统，同时降低检索延迟。

Innovations:

提出认知反思模块，动态判断检索必要性并过滤噪声，降低计算开销。
利用信息几何将多模态嵌入对齐到黎曼流形（双曲空间），实现跨模态语义一致性。
引入谱图理论精化知识图谱，构建查询相关子图，提升多跳检索精度。
设计基于最优传输（Wasserstein距离）的层次化生成损失，平衡局部与全局连贯性。
提供几何对齐与谱优化的收敛性理论保证，并实验验证有效性。

Methodology: CogniVerse采用三阶段流水线：1）认知反思模块：基于预训练多模态大模型计算置信度，通过对比学习训练分类头判断检索必要性与文档相关性。2）多模态检索模块：将视觉、文本和知识图谱嵌入映射到黎曼流形，最小化测地距离实现对齐；利用谱图理论对知识图谱进行特征分解，提取查询相关子图。3）层次化生成模块：结合检索内容与查询，使用基于Wasserstein距离的损失函数优化生成，兼顾词元级准确性与全局语义。

Key Results:

在多个基准多模态问答数据集上，CogniVerse在准确性和连贯性指标上均超越MuRAG、MMCoQA等现有方法。
认知反思模块有效减少不必要的检索，降低检索延迟约30%。
几何对齐与谱图精化显著提升跨模态检索精度，尤其在多跳复杂查询上。
最优传输损失生成的回答在事实正确性和语义连贯性上均优于交叉熵损失。

Tech Stack:

多模态大模型（MLLM）
信息几何（黎曼流形、测地距离）
谱图理论（特征分解、子图提取）
最优传输（Wasserstein距离）
对比学习损失
双曲空间嵌入
预训练视觉语言模型（如BLIP、CLIP）

Strengths:

创新性地将认知科学概念（反思）引入MMRAG，实现自适应检索。
数学框架严谨（信息几何、谱图理论、最优传输），提供理论保证。
全面解决噪声、错位、静态推理和不连贯四大问题，效果显著。
实验充分，在多个数据集上验证有效性，并展示消融分析。

Limitations:

依赖预训练多模态大模型，可能继承其偏见或知识边界。
黎曼流形对齐与谱图精化计算复杂度较高，大规模部署需优化。
未涉及世界模型、强化学习或后训练等方向，与部分关键词关联较弱。
实验仅针对问答任务，泛化到其他多模态生成任务（如摘要、对话）未验证。

Relevance To Keywords:

原生多模态大模型：论文使用多模态大模型作为基础，并增强其检索能力，高度相关。
多模态大模型的理解和生成一体化：CogniVerse同时涉及理解（检索）和生成（回答），相关。
表征学习：通过几何对齐和谱图理论进行多模态表征学习，相关。
世界模型：论文未涉及世界模型构建或预测，不直接相关。
模型-Based RL：未涉及强化学习或基于模型的决策，不相关。
后训练：论文未讨论后训练阶段（如微调、RLHF），不直接相关。

23. Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric ReasoningPASS

Score: 45.0 / 27.8

Authors: Chun-Hsiao Yeh, Shengyi Qian, Manchen Wang, Yi Ma, Joseph Tighe, Fanyi Xiao

Published: 2026-05-28

TL;DR: 该论文提出 GASP 框架，通过向视觉语言模型注入几何先验显著提升了 3D 空间推理能力，且无需依赖 3D VQA 数据训练。

摘要翻译

视觉 - 语言模型（VLMs）往往难以实现稳健的三维空间推理。依赖使用三维视觉问答（VQA）数据集进行微调的主流方法可能会过拟合数据集特定的偏差，而集成专用的三维视觉编码器通常既缺乏灵活性又繁琐。本文认为，真正的空间理解应源于学习基础几何先验，而不仅仅依赖于高层的 VQA 监督。我们提出了 GASP（几何感知空间先验），该框架将这些先验直接注入到大语言模型（LLM）的变换器层中。GASP 采用了一个小型对应头，将其作为深层监督信号应用于所有层，并利用大规模视频场景中的真实几何进行双重目标训练：基于真实点对应的对比损失强制二维视图不变性，而深度一致性监督则用于解决三维几何歧义。我们的分析首先提供了一项诊断，表明标准 VLMs 的内部对应匹配准确率非常低（通常低于 5%）。随后我们证明，我们的训练显著改善了这一行为，将峰值层间对应率提升至 70% 以上，并保持超过 85% 的时间鲁棒性，而基线模型则保持在 5% 以下。这些内部改进转化为下游空间基准上的显著提升，包括在 All-Angles Bench 上提升 18.2%，在 VSI-Bench 上提升 29.0%，且均未使用任何三维 VQA 数据进行训练。我们的发现表明，从基础几何先验进行学习是一条有前景且可泛化的途径，有助于构建具有更可靠三维空间推理能力的视觉 - 语言模型。

Abstract

Vision-Language Models (VLMs) often struggle with robust 3D spatial reasoning. Prevailing methods that rely on fine-tuning with 3D visual question-answering (VQA) datasets may overfit dataset-specific biases, while integrating specialized 3D visual encoders is often inflexible and cumbersome. In this paper, we argue that genuine spatial understanding should emerge from learning fundamental geometric priors, not only from high-level VQA supervision. We propose GASP (Geometric-Aware Spatial Priors), a framework that injects these priors directly into the LLM's transformer layers. GASP employs a small correspondence head, applied as a deep supervision signal across all layers, and is trained with a dual objective leveraging ground-truth geometry from large-scale video scenes: a contrastive loss on ground-truth point correspondences enforces 2D view-invariance, while a depth consistency supervision resolves 3D geometric ambiguities. Our analysis first provides a diagnostic showing that standard VLMs' internal correspondence matching accuracy is very low (often below 5%). We then demonstrate that our training substantially improves this behavior, boosting peak layer-wise correspondence to over 70% and maintaining over 85% temporal robustness while baselines remain below 5%. These internal improvements translate to significant gains on downstream spatial benchmarks including +18.2% on All-Angles Bench and +29.0% on VSI-Bench, all without training on any 3D VQA data. Our findings indicate that learning from fundamental geometric priors is a promising and generalizable pathway towards VLMs with more reliable 3D spatial reasoning.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	5.0/10	7.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	4.0/10	6.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	8.0/10	12.0
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	1.0/10	1.5

评分理由: MLLM 和 MultiModal 评分最高（8.0），因论文核心为视觉语言模型，属于多模态大模型范畴；Unify Models（5.0）和 Visual Encoder（4.0）有一定关联，涉及视觉与语言统一理解及编码器讨论；Tokenizer、World Models 和 model-based RL 评分较低（1.0-2.0），因未在摘要中体现或无关。作者列表中未包含指定的专家。

关键词

Vision-Language Models, 3D Spatial Reasoning, Geometric Priors, Deep Supervision, Contrastive Loss, Depth Consistency, Transformer Layers

深度分析

Chinese Title: 超越3D VQA：将3D空间先验注入视觉-语言模型以增强几何推理

Summary: 本文针对视觉-语言模型（VLM）在3D空间推理中的不足，提出了一种名为GASP（几何感知空间先验）的训练框架。现有方法通常依赖3D视觉问答（VQA）数据集进行微调，容易过拟合数据集特定偏差；或集成专用3D视觉编码器，但灵活性差且增加模型负担。GASP通过将轻量级对应头注入LLM的Transformer层，利用大规模视频场景中的真实几何先验（点对应和深度一致性）进行深度监督训练，从而直接向模型内部注入几何归纳偏置。训练时使用对比学习损失和深度一致性损失，推理时丢弃对应头，模型作为标准VLM运行。实验表明，GASP显著提升了VLM内部的对应匹配准确率（从低于5%提升至超过70%），并在下游空间推理基准上取得显著增益（All-Angles Bench提升18.2%，VSI-Bench提升29.0%），且无需使用任何3D VQA数据。该工作表明，从基本几何先验学习是提升VLM空间推理能力的有效且可泛化的途径。

Innovations:

提出GASP框架，将几何先验直接注入LLM的Transformer层，而非依赖3D VQA微调或外部编码器。
设计轻量级对应头，在所有中间层施加深度监督信号，训练后丢弃，不增加推理负担。
采用双重目标：对比学习损失（基于真实点对应）实现2D视角不变性，深度一致性损失解决3D歧义。
首次对VLM骨干网络进行内部对应匹配分析，揭示标准VLM对应准确率极低（<5%），而GASP提升至70%以上。
在多个下游空间推理基准上取得显著提升，且无需3D VQA数据，证明几何先验学习的泛化性。

Methodology: 论文采用以下技术路线：1）在VLM的LLM Transformer各层插入轻量级2层MLP对应头，通过SVD分解初始化权重以保持预训练表示。2）从大规模视频数据集DL3DV中获取真实点对应和深度图作为监督信号。3）训练时使用InfoNCE对比损失学习视角不变嵌入，同时使用深度一致性损失（匹配深度值）作为几何正则化。4）训练后丢弃对应头，模型以标准VLM形式进行推理。5）通过内部对应匹配分析（QK匹配）评估几何表示质量，并在多个空间推理基准上测试下游性能。

Key Results:

标准VLM（如Qwen2.5-VL-7B、LLaVA-NeXT-Video-7B）内部对应匹配准确率低于5%。
GASP将峰值层对应准确率提升至70%以上，时间鲁棒性超过85%，而基线低于5%。
在All-Angles Bench上相机姿态估计提升18.2%，VSI-Bench上物体计数提升29.0%，BLINK上多视图推理提升15.0%。
仅需少量训练，对通用视频QA性能影响极小。

Tech Stack:

InfoNCE对比损失
SVD分解（用于初始化对应头）
2层MLP（对应头结构：d→2demb→demb，GELU激活）
深度一致性损失（基于真实深度图）
QK匹配分析（视觉自注意力矩阵）
DL3DV数据集（大规模视频场景）
Qwen2.5-VL-7B、LLaVA-NeXT-Video-7B等VLM骨干

Strengths:

创新性地从几何先验而非VQA数据学习，避免数据集偏差，提升泛化性。
轻量级对应头设计，训练后丢弃，不增加推理成本。
提供内部对应匹配分析，直观验证几何表示改善。
在多个下游任务上取得显著提升，且无需额外3D输入。
方法简洁，易于集成到现有VLM中。

Limitations:

训练依赖大规模视频场景的真实几何标注（点对应和深度图），数据获取成本较高。
当前实验仅在7B规模模型上验证，更大规模模型效果未知。
深度一致性损失可能对深度图质量敏感，噪声数据可能影响训练。
未在更多样化的空间推理任务（如导航、操作）上评估。

Relevance To Keywords:

原生多模态大模型：GASP直接注入几何先验到LLM层，增强VLM的空间推理能力，属于多模态大模型改进。
表征学习：通过对比学习学习视角不变嵌入，提升内部几何表征质量。
世界模型：学习物体恒常性和几何一致性，有助于构建更鲁棒的世界模型。
后训练：GASP是一种后训练方法，但不同于传统VQA微调，而是基于几何先验的深度监督训练。

24. Alignment-Guided Score Matching for Text-to-Image Alignment in Diffusion ModelsPASS

Score: 45.0 / 27.8

Authors: Jaa-Yeon Lee, Yeobin Hong, Taesung Kwon, Jong Chul Ye

Published: 2026-05-28

TL;DR: 本文提出了一种奖励自由的对齐引导评分匹配方法，通过优化扩散模型中的软词元显著提升了文本 - 图像生成的一致性和计数准确性。

摘要翻译

扩散模型 (Diffusion models) 能够生成高度逼真的图像，但往往难以实现精确的文本 - 图像对齐。尽管近期的一些后训练方法利用外部奖励或人类偏好信号来改善对齐，但其性能高度依赖于奖励质量，并未直接解决扩散过程本身内部的对齐问题。近期无奖励方法（如 SoftREPA）表明，通过对比学习优化软文本令牌可有效提升文本 - 图像表示对齐，优于标准的参数高效微调基线。然而，对比形式可能会过度惩罚负样本对，这表现为特征性失败案例，如过度计数和重复。为了解决这一问题，我们提出了一种轻量级的无奖励后训练方法，通过将对比对齐指导直接整合到扩散模型的得分匹配目标中来优化软令牌。通过在得分层面分配对齐方向，我们的方法缓解了这些局限性，并生成了更连贯且语义忠实的结果。实验表明，我们的方法与 SoftREPA 相当，同时显著改进了其失败案例，在 GenEval 基准上计数准确率提高了 35% 以上。我们的方法可无缝应用于现有的扩散骨干网络（SD1.5、SDXL 和 SD3），且与现有的基于强化学习 (RL) 的扩散后训练方法互补。项目页面：https://jaayeon.github.io/AGSM

Abstract

Diffusion models generate highly realistic images but often struggle with precise text-image alignment. While recent post-training methods improve alignment using external rewards or human preference signals, their performance heavily depends on reward quality and does not directly address alignment within the diffusion process itself. Recent reward-free approaches such as SoftREPA demonstrate that optimizing soft text tokens via contrastive learning can effectively improve text-image representation alignment, outperforming standard parameter-efficient fine-tuning baselines. However, the contrastive formulation can excessively penalize negative pairs, which manifests as characteristic failure cases such as over-counting and repetition. To address this issue, we propose a lightweight, reward-free post-training method that refines soft tokens by integrating contrastive alignment guidance directly into the score-matching objective of diffusion models. By assigning alignment directions at the score level, our approach mitigates these limitations and yields more coherent and semantically faithful generations. Experiments show that our method matches SoftREPA while substantially improving its failure cases, achieving over 35% improvement in counting accuracy on the GenEval benchmark. Our method is seamlessly applicable to existing diffusion backbones (SD1.5, SDXL, and SD3), and is complementary to existing RL-based diffusion post-training methods. Project page: https://jaayeon.github.io/AGSM

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	4.0/10	6.0
Tokenizer	1.5	6.0/10	9.0
Visual Encoder	1.5	5.0/10	7.5
World Models	1.5	2.0/10	3.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	1.0/10	1.5

评分理由: 论文核心在于扩散模型的文本 - 图像对齐优化，高度契合多模态（MultiModal）任务。方法涉及软词元（Tokenizer）的优化，基于扩散模型（隐含 Visual Encoder），但未涉及统一架构模型（Unify Models）、世界模型（World Models）、多模态大语言模型（MLLM）或基于模型的强化学习（model-based RL），且明确采用奖励自由策略而非 RL。作者列表中不包含指定的专家，故无额外加分。

关键词

Text-to-Image Alignment, Diffusion Models, Score Matching, Soft Tokens, Reward-free, Post-training, Generation Quality

深度分析

Chinese Title: 对齐引导的分数匹配：用于扩散模型中文本到图像对齐

Summary: 本文提出了一种轻量级、无需外部奖励的后训练方法——对齐引导分数匹配（AGSM），旨在改善扩散模型在文本到图像生成中的语义一致性。现有方法如SoftREPA通过对比学习优化软文本令牌，但对比损失会过度惩罚负样本，导致对象重复、计数错误等问题。AGSM将文本-图像对齐视为偏好学习，采用Plackett-Luce模型定义对齐奖励，并显式地将正负对齐方向整合到扩散模型的分数匹配目标中。通过为正负语义区域分别训练软令牌（ψ+、ψ−），AGSM避免了对比学习中的无界发散，保持生成保真度。实验表明，AGSM在GenEval基准上计数准确率提升超过35%，且与现有扩散骨干（SD1.5、SDXL、SD3）及RL后训练方法兼容。

Innovations:

提出奖励自由的Plackett-Luce公式，利用扩散模型内在对数似然定义对齐奖励，无需外部奖励模型。
通过显式负引导（负软令牌）解决对比学习的不稳定性，防止负样本偏离流形，避免重复和过度计数。
将对齐引导直接整合到分数匹配目标中，实现正负样本的显式分数级引导，保持生成保真度。
方法轻量级、模型无关，仅优化少量软令牌，可无缝应用于现有扩散模型（SD1.5、SDXL、SD3）并兼容RL后训练方法。

Methodology: 首先，基于Plackett-Luce模型定义对齐奖励，利用扩散模型去噪误差作为隐式奖励。然后，将数据分为正负子集，通过倾斜目标分布（类似分类器自由引导）得到修正的目标分数。训练时，使用EMA更新的软令牌计算奖励梯度，并分别对正负对施加引导（+γ+和-γ-）。最终损失函数为预测噪声与修正目标噪声之间的均方误差，同时优化正负软令牌。对于流模型，损失函数形式类似。算法流程包括采样正负对、统一时间步和噪声、计算EMA预测、更新软令牌。

Key Results:

在GenEval基准上，AGSM在计数准确率上比SoftREPA提升超过35%。
AGSM在文本-图像语义一致性上匹配SoftREPA，同时显著改善其失败案例（如对象重复、语义不连贯）。
方法适用于SD1.5、SDXL、SD3等多种扩散骨干，并可与现有RL后训练方法（如Diffusion-DPO）互补。
定性结果显示，AGSM生成的图像更准确地遵循文本描述（如“公园长椅上有泰迪熊”、“橙色消防栓带脸和领结”等）。

Tech Stack:

扩散模型（DDPM, SD1.5, SDXL, SD3）
分数匹配（Score Matching）
Plackett-Luce模型
Bradley-Terry模型
对比学习（SoftREPA）
分类器自由引导（CFG）
指数移动平均（EMA）
软令牌（Soft Tokens）
流模型（Flow Model）
蒙特卡洛估计

Strengths:

无需外部奖励或人类偏好数据，完全利用扩散模型内在信号，降低数据依赖。
显式负引导机制有效抑制对比学习的发散问题，提升生成稳定性和语义保真度。
轻量级后训练，仅优化少量软令牌，计算开销小，易于部署。
模型无关且兼容现有方法，具有广泛适用性和可扩展性。

Limitations:

软令牌的初始化可能影响训练效果，需要进一步探索初始化策略。
方法依赖于扩散模型自身的对数似然作为奖励，对于复杂语义可能仍存在偏差。
实验仅在特定基准（GenEval）上验证，泛化到更广泛场景（如长文本、多对象）需更多评估。
负引导的强度（γ-）需要调参，不同任务可能需不同设置。

Relevance To Keywords:

多模态大模型：论文聚焦文本到图像生成中的对齐问题，属于多模态理解与生成一体化范畴。
表征学习：通过优化软令牌改善文本与图像的表征对齐，利用分数匹配作为代理。
世界模型：扩散模型可视为生成世界模型，后训练提升其语义一致性。
强化学习：论文提出的偏好学习框架（Plackett-Luce）与RL后训练（如DPO）密切相关，且方法可互补。
后训练：论文核心是轻量级后训练方法，无需从头训练，直接优化预训练模型。

25. KairosAgent: Agentic Time Series Forecasting with Fused Semantic ReasoningPASS

Score: 45.0 / 27.8

Authors: Kun Feng, Ziwei Shan, Yuchen Fang, Yiyang Tan, Sihan Lu, Shuqi Gu, Lintao Ma, Xingyu Lu, Kan Ren

Published: 2026-05-28

TL;DR: KairosAgent 提出了一种代理框架，通过统一 LLM 语义推理与 TSFM 数值预测，利用强化学习多轮优化，显著提升了多模态时间序列预测的准确性和可解释性。

摘要翻译

跨域多模态时间序列预测是一项具有挑战性的任务，要求模型整合精确的数值理解、跨域语义理解和有效的多模态融合。现有方法要么从零构建时间序列基础模型（TSFMs），要么利用预训练的大语言模型（LLMs）。然而，TSFMs 往往忽视语义理解，缺乏面向未来的语义推理能力，而 LLMs 在数值理解和准确的定量预测方面存在困难。为了克服这些局限性，我们提出 KairosAgent，一种用于多模态时间序列预测的新型智能体框架，包括基于 LLM 的推理器和基于 TSFM 的预测器。KairosAgent 通过动态调用分析工具来增强 LLM 的数值理解和语义推理能力，进而统一文本推理与数值预测。推理结果随后被融合至 TSFM 流程中，从而实现更准确和可靠的未来预测。为了进一步改进推理，我们构建了一个高质量轨迹的大规模语料库，并引入了一种基于预测的强化学习范式，该范式包含多轮精炼和回合级信用分配。实验表明，KairosAgent 实现了卓越的零样本预测性能，同时最大化了预训练 LLMs 和 TSFMs 的效用，为高效且可解释的时间序列智能体展示了有前景的方向。项目页面位于 https://foundation-model-research.github.io/KairosAgent .

Abstract

Cross-domain multimodal time series forecasting is a challenging task, requiring models to integrate precise numerical comprehension, cross-domain semantic understanding, and effective multimodal fusion. Existing approaches either build Time Series Foundation Models (TSFMs) from scratch or leverage pretrained Large Language Models (LLMs). However, TSFMs often overlook semantic understanding and lack the ability to perform future-oriented semantic reasoning, and LLMs struggle with numerical comprehension and accurate quantitative forecasting. To overcome these limitations, we propose KairosAgent, a novel agentic framework for multimodal time series forecasting, including an LLM-based reasoner and a TSFM-based forecaster. KairosAgent unifies textual reasoning and numerical forecasting by dynamically invoking analytical tools to enhance the numerical understanding and semantic reasoning capabilities of LLMs. The reasoning results are subsequently fused into the TSFM pipeline, enabling more accurate and reliable future predictions. To further improve the reasoning, we curate a large-scale corpus of high-quality trajectories, alongside a reinforcement learning from forecasting paradigm with multi-turn refinement and turn-level credit assignment. Experiments demonstrate that KairosAgent achieves superior zero-shot forecasting performance while maximizing the utility of pretrained LLMs and TSFMs, presenting a promising direction for efficient and interpretable time series agents. The project page is at https://foundation-model-research.github.io/KairosAgent .

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	8.0/10	12.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	2.0/10	3.0
MLLM	1.5	5.0/10	7.5
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	3.0/10	4.5

评分理由: 论文核心在于结合 LLM 推理与 TSFM 预测，实现了文本与数值预测的统一，因此'Unify Models'和'MultiModal'相关性高。虽然使用了强化学习进行多轮 refinement，但并非传统意义上的'model-based RL'或'World Models'，且未涉及视觉编码器或 Tokenizer 的创新，故相关度较低。作者列表中未包含指定的专家。

关键词

Multimodal Time Series Forecasting, LLM-based Reasoner, TSFM-based Forecaster, Agentic Framework, Semantic Reasoning, Reinforcement Learning, Numerical Comprehension

深度分析

Chinese Title: KairosAgent：融合语义推理的智能体时间序列预测

Summary: 论文提出KairosAgent，一种面向跨领域多模态时间序列预测的智能体框架，旨在融合语义推理与数值预测。现有方法中，时间序列推理模型（TSRM）依赖LLM但存在数值幻觉，时间序列基础模型（TSFM）缺乏语义理解。KairosAgent由基于LLM的推理器和基于TSFM的预测器组成，通过动态工具调用增强数值理解，生成未来形态的语义描述，并将其融合到TSFM预测管道中。为提升推理能力，作者构建了包含4万条高质量推理轨迹的T-STAR语料库，并设计三阶段训练：SFT预热工具调用与形态推理、多模态融合训练、以及基于回合级信用分配的强化学习优化。实验表明，KairosAgent在零样本预测中优于现有TSFM和全监督基线，同时保持可解释性。

Innovations:

提出智能体框架KairosAgent，将LLM语义推理与TSFM数值预测解耦并深度融合，避免数值幻觉和黑箱问题。
设计动态工具调用机制，使LLM基于统计工具（趋势、周期、波动性等）进行数值理解，生成未来形态的语义描述而非具体数值。
构建T-STAR语料库，包含4万条经过严格筛选的高质量时间序列推理轨迹，为训练提供可靠基础。
引入三阶段训练流程：SFT预热、多模态融合训练、以及基于回合级信用分配的强化学习，提供细粒度中间步骤监督。
实现跨领域零样本预测性能超越现有TSFM和全监督基线，同时保持可解释性。

Methodology: KairosAgent采用两阶段推理-预测流水线。第一阶段：LLM推理器通过多轮工具调用（趋势、周期、波动性、状态变化等分析工具）获取统计特征，结合世界知识生成未来形态的语义描述r。第二阶段：文本编码器将r编码为语义先验，通过门控跨模态融合模块注入TSFM（基于Kairos的编码器-解码器架构）的预测管道，生成最终数值预测。训练分为三步：①对推理器进行SFT，学习工具调用和形态推理；②对预测器进行多模态融合训练，使TSFM能利用语义描述；③使用强化学习（回合级信用分配）优化推理器，超越简单模仿和稀疏结果奖励。

Key Results:

KairosAgent在零样本时间序列预测任务上优于现有TSFM（如Kairos、TimesFM等）和全监督基线（如PatchTST、DLinear）。
形态推理质量显著提升，生成的语义描述更准确反映未来模式。
三阶段训练有效提升推理可靠性，回合级信用分配优于结果级奖励。
在多个跨领域数据集（如电力、交通、天气等）上验证了泛化能力。

Tech Stack:

LLM（如GPT系列，论文未指定具体模型）
TSFM：Kairos（轻量级零样本预测基础模型）
工具集：趋势分析、周期性检测、波动性度量、状态变化检测等统计工具
文本编码器（如Sentence-BERT或类似语义编码器）
门控跨模态融合机制（Gated Cross-modal Fusion）
监督微调（SFT）
强化学习（RL）与回合级信用分配（Turn-level Credit Assignment）
T-STAR语料库构建与过滤方法

Strengths:

有效融合LLM的语义推理与TSFM的数值精度，解决模态割裂问题。
动态工具调用使LLM的数值理解更可靠，避免直接序列化导致的数值幻觉。
三阶段训练设计合理，从模仿到融合再到强化学习，逐步提升推理能力。
零样本预测性能强，且提供可解释的形态描述，增强模型透明度。
构建高质量推理轨迹语料库，为训练提供坚实基础。

Limitations:

依赖预训练LLM和TSFM，计算资源需求较高。
工具调用可能增加推理延迟，实时性受限。
形态描述的质量受限于LLM的世界知识和工具精度，复杂场景下可能不准确。
实验仅在有限数据集上验证，跨领域泛化性需进一步测试。
未明确LLM的具体型号及微调细节，可复现性可能受限。

Relevance To Keywords: 论文与关键词高度相关：①Unify Models：KairosAgent统一了LLM（推理模型）与TSFM（预测模型），实现多模态融合；②World Models：形态描述可视为对时间序列未来状态的部分世界模型表示；③Representation Learning：通过语义先验融合学习跨模态表征；④Model-Based RL：强化学习用于优化推理器，属于模型后训练；⑤原生多模态大模型：框架本质是LLM+TSFM的多模态协同；⑥理解与生成一体化：LLM负责理解（推理），TSFM负责生成（预测）；⑦表征学习：文本编码器与门控融合学习联合表征；⑧世界模型：形态推理隐含对未来动态的建模；⑨强化学习：回合级信用分配属于RL优化；⑩后训练：三阶段训练包含SFT和RL后训练。

26. AsymVLM: Asymmetric Token Pruning for Efficient Vision-Language Model InferencePASS

Score: 45.0 / 27.8

Authors: Yilin Feng, Ahmed Burak Gulhan, Mahmut Taylan Kandemir

Published: 2026-05-28

TL;DR: 本文提出 AsymVLM 方法，通过非对称剪枝策略优化视觉 - 语言模型的推理效率，在大幅降低计算量的同时保持任务准确性。

摘要翻译

视觉 - 语言模型（VLMs）每张图像处理数千个视觉 token 以及相对较少的文本 token，然而现有的压缩方法对这两种模态采用统一处理。我们发现这两种模态具有根本不同的性质：视觉 token 具有空间冗余性并主导预填充阶段，而文本 token 具有因果依赖性并在解码过程中累积。基于这种不对称性，我们提出并实证评估了 AsymVLM，该方法在预填充前利用具有样本自适应预算的学习重要性评分器对视觉 token 应用激进剪枝，并仅在文本 token 超过固定预算时对其应用基于时间阈值的驱逐策略。实验表明，AsymVLM 在最先进方法中实现了最高的 FLOPs 节省（高达 54%），同时在视觉信息具有空间局部化和查询特定性的文档与图表理解任务上，比现有方法高出 2%–3%，并在整体基准上保持了具有竞争力的准确率。在文本主导场景中，我们的驱逐策略通过适应 VLM 的短上下文特性，显著优于标准的 LLM 缓存压缩方法。

Abstract

Vision-Language Models (VLMs) process thousands of visual tokens per image alongside comparatively few text tokens, yet existing compression methods treat both modalities uniformly. We observe that the two modalities have fundamentally different properties: vision tokens are spatially redundant and dominate prefill, while text tokens are causally dependent and accumulate during decoding. Based on this asymmetry, we propose and empirically evaluate AsymVLM, which applies aggressive pruning to vision tokens before prefill using a learned importance scorer with per-sample adaptive budgeting, and temporal threshold-based eviction to text tokens only when they exceed a fixed budget. Our experiments indicate that AsymVLM achieves the highest FLOPs savings (up to 54%) among state-of-the-art methods while outperforming existing approaches by 2--3% on document and chart understanding tasks where visual information is spatially localized and query-specific, and maintaining competitive accuracy on holistic benchmarks. In text-dominated scenarios, our eviction strategy substantially outperforms standard LLM cache compression methods by adapting to the short-context nature of VLM.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	5.0/10	7.5
Tokenizer	1.5	4.0/10	6.0
Visual Encoder	1.5	5.0/10	7.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	8.0/10	12.0
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文核心在于视觉 - 语言模型（VLM）的推理加速，通过非对称令牌剪枝实现。MLLM 和 MultiModal 高度相关，因 VLM 属于多模态大模型范畴；Unify Models 和 Visual Encoder 中度相关，因涉及多模态统一及视觉令牌处理；Tokenizer 中度相关，因涉及令牌操作但非编码架构设计；World Models 和 model-based RL 完全无关。作者列表中未出现指定专家（Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang），故无额外加分。

关键词

Asymmetric Token Pruning, Vision-Language Models, Inference Efficiency, Visual Token Reduction, Text Token Eviction, Multimodal Inference, Model Compression

深度分析

Chinese Title: AsymVLM: 非对称令牌剪枝用于高效视觉-语言模型推理

Summary: 本文提出AsymVLM，一种非对称令牌剪枝框架，用于高效视觉-语言模型（VLM）推理。作者观察到视觉令牌和文本令牌在结构、数量和可压缩性上存在根本差异：视觉令牌具有空间冗余且主导预填充阶段，而文本令牌具有因果依赖且在解码过程中累积。基于此非对称性，AsymVLM对视觉令牌采用学习的重要性评分器结合每样本自适应预算进行激进剪枝，在预填充前完成；对文本令牌则采用基于阈值的驱逐策略，仅在超过固定预算时进行。实验表明，AsymVLM在FLOPs节省上最高达54%，在文档和图表理解任务上比现有方法提升2-3%，同时在整体基准上保持竞争性准确率。在文本主导场景中，其驱逐策略通过适应VLM的短上下文特性，显著优于标准LLM缓存压缩方法。

Innovations:

系统分析了视觉与文本令牌在结构、数量和可压缩性上的非对称性，论证了模态感知压缩的必要性。
提出双策略框架AsymVLM：视觉令牌在预填充前进行查询感知剪枝，文本令牌在解码时进行预算约束驱逐。
引入学习型跨模态重要性评分器，通过最小化剪枝前后输出差异来优化令牌排序，比单纯语义相似度更准确。
提出每样本自适应预算机制，根据输入的重要性分布动态调整视觉令牌保留比例，克服固定剪枝比的缺陷。
在文本令牌驱逐中采用阈值策略，适应VLM短上下文特性，优于标准LLM缓存方法。

Methodology: 首先通过实证分析揭示视觉与文本令牌的非对称性。视觉令牌剪枝采用两阶段：1）学习型重要性评分器，基于跨模态相似度并优化输出保持；2）每样本自适应预算，根据重要性差距（75%与25%分位数之差）动态决定保留比例。文本令牌驱逐在解码阶段进行，当生成序列超过固定预算时，基于最后一层注意力分数驱逐低重要性令牌。整体框架在LLaVA、Phi-3-Vision等模型上评估，使用DocVQA、ChartQA等基准。

Key Results:

AsymVLM实现最高54%的FLOPs节省，优于现有方法。
在文档和图表理解任务上，准确率比现有方法提升2-3%。
在整体基准（如VQAv2、GQA）上保持竞争性准确率。
文本驱逐策略在自由生成场景中显著优于StreamingLLM、H2O等标准LLM缓存方法。
自适应预算机制相比固定剪枝比在相同预算下获得一致增益。

Tech Stack:

视觉编码器：CLIP、SigLIP等
语言模型：LLaVA、Phi-3-Vision、Qwen-VL
剪枝方法：FastV、SparseVLM、FlashVLM、VisPruner
KV缓存压缩：StreamingLLM、H2O、PyramidKV、DuoAttention
注意力机制：FlashAttention
重要性评分：学习型跨模态相似度、输出差异最小化
自适应预算：基于重要性差距的动态阈值

Strengths:

非对称设计精准匹配视觉与文本令牌的不同特性，实现高效压缩。
学习型评分器直接优化输出保持，比传统语义相似度更可靠。
自适应预算机制灵活适应不同样本的视觉信息分布。
在多种基准上验证了高FLOPs节省与准确率保持的平衡。
文本驱逐策略针对VLM短上下文优化，弥补了现有LLM方法的不足。

Limitations:

学习型评分器需要额外训练或微调，可能增加前期开销。
实验主要基于Phi-3-Vision等特定模型，泛化性需进一步验证。
自适应预算机制依赖于重要性差距的计算，可能对噪声敏感。
未讨论在长视频或多图像场景下的扩展性。
文本驱逐策略的阈值选择可能影响生成质量，需手动调参。

Relevance To Keywords: 论文聚焦于多模态大模型（VLM）的推理效率优化，与“原生多模态大模型”高度相关。其非对称令牌剪枝涉及表征学习（重要性评分）和模型压缩，但未直接涉及世界模型、强化学习或后训练。论文方法可视为后训练阶段的一种推理加速技术，但主要贡献在推理而非训练。整体相关性中等偏上。

27. CCS: Clinical Consensus Selection for Radiology Report GenerationPASS

Score: 45.0 / 27.8

Authors: Xi Zhang, Yingshu Li, Zaiqiao Meng, Jake Lever, Edmond S. L. Ho

Published: 2026-05-28

TL;DR: 本文提出临床共识选择（CCS）框架，通过从多个 MLLM 候选报告中选择临床一致性最高的报告来改善放射学报告生成的质量，而非依赖单路径解码。

摘要翻译

放射学报告生成（RRG）通常被建模为单路径生成任务，其中多模态大语言模型（MLLM）生成一个解码报告作为最终输出。尽管最近的进展主要由扩展训练数据、模型容量和检索机制驱动，但在推理时提高报告质量的研究仍显不足。在这项工作中，我们发现固定的放射学 MLLM 往往能在候选池的其他样本中生成比默认解码所选报告更具临床质量的报告，这表明推理时的决策仍是一个被忽视的瓶颈。为了解决这一问题，我们提出临床共识选择（CCS），这是一种解码器无关的推理时选择框架，它采样多个候选报告，并从展开池中选择临床共识最高的那一个。CCS 统一了基于文本的效用函数与由图像 - 报告联合训练的多模态嵌入器计算的放射学适配效用函数，后者衡量候选报告之间超越表面文本相似性的一致性。在三个数据集和多个放射学 MLLM 上，CCS 一致地改进了推理时性能，优于单路径解码和通用的 Best-of-N 基线，尤其在临床指标上表现出显著提升。进一步分析表明，基于图像的效用函数构成了与文本共识不同的选择维度，且在推理时改进 RRG 方面仍存在巨大的提升空间。

Abstract

Radiology report generation (RRG) is commonly formulated as a single-path generation task, where a multimodal large language model (MLLM) produces one decoded report as the final output. While recent progress has largely been driven by scaling training data, model capacity, and retrieval mechanisms, improving report quality at inference time remains underexplored. In this work, we observe that fixed radiology MLLMs often generate clinically stronger reports elsewhere in their candidate pool than the one selected by default decoding, suggesting that inference-time decision making remains an overlooked bottleneck. To address this, we propose Clinical Consensus Selection (CCS), a decoder-agnostic inference-time selection framework that samples multiple candidate reports and selects the one with the highest clinical consensus across the rollout pool. CCS unifies text-based utilities with a radiology-adapted utility computed by an image--report-trained multimodal embedder, which measures candidate agreement beyond surface-level textual similarity. Across three datasets and multiple radiology MLLMs, CCS consistently improves inference-time performance over single-path decoding and generic Best-of-N baselines, with particularly clear gains on clinical metrics. Further analysis shows that image-grounded utility forms a selection axis distinct from textual consensus and that substantial headroom remains for improving RRG at inference time.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	4.0/10	6.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	9.0/10	13.5
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	2.0/10	3.0

评分理由: 论文核心在于放射学报告生成，明确使用多模态大语言模型（MLLM），因此 MLLM 和 MultiModal 相关性极高（9 分）。虽然摘要提到'unifies utilities'，但指效用统一而非模型架构统一，故 Unify Models 得分较低（3 分）。Tokenizer、World Models 和 model-based RL 在摘要中未提及或无关，得分较低（1-2 分）。Visual Encoder 隐含于图像嵌入器中，相关性中等（4 分）。作者列表中未包含指定的 Yang Shi 等专家，无加分。加权总分为 45.0 分，高于动态及格分 27.8 分。

关键词

Radiology Report Generation, MLLM, Inference-time Selection, Clinical Consensus, Multimodal Embedder, Candidate Pool, Text-based Utilities

深度分析

Chinese Title: CCS：放射学报告生成的临床共识选择

Summary: 本文针对放射学报告生成（RRG）任务，提出了一种推理时选择框架——临床共识选择（CCS）。传统RRG通常采用单路径生成，即模型一次解码输出一个报告，但研究发现固定模型在候选池中往往存在临床质量更高的报告未被默认解码选中。CCS通过采样多个候选报告，并利用文本效用和基于图像-报告对训练的多模态嵌入器计算的放射学适应效用，衡量候选报告之间的临床共识，最终选择共识最高的报告。在三个数据集和多个放射学多模态大模型上的实验表明，CCS一致优于单路径解码和通用Best-of-N基线，尤其在临床指标上提升显著。分析还显示，图像驱动的效用与文本共识形成了不同的选择维度，表明推理时决策仍有改进空间。

Innovations:

重新审视RRG任务，从推理时视角发现候选池中常存在比默认解码输出更可靠的报告，揭示了推理时决策的瓶颈。
提出CCS，一种与解码器无关的推理时选择框架，通过采样多个候选报告并聚合成对临床共识来选取最终报告。
引入放射学适应的多模态嵌入器（Qwen3-VL-Embed）作为效用函数，衡量候选报告在图像-报告表示空间中的一致性，超越纯文本相似性。
在三个数据集和多个放射学MLLM上验证了CCS的通用性和有效性，尤其在临床指标上取得显著提升。
揭示了图像驱动的效用与文本共识构成不同的选择轴，为推理时优化提供了新方向。

Methodology: CCS框架包含四个阶段：①从放射学MLLM中通过随机解码采样N个候选报告构成滚动池；②计算候选对之间的成对效用分数（包括文本效用和图像-报告多模态嵌入效用）；③对每个候选报告，计算其与池中其他N-1个候选的平均效用作为共识分数；④选择共识分数最高的报告作为最终输出。文本效用使用ROUGE-L等指标，图像效用使用Qwen3-VL-Embed模型计算嵌入相似度。

Key Results:

在MIMIC-CXR、IU-Xray和CheXpert Plus三个数据集上，CCS一致优于单路径解码和Best-of-N基线。
在临床指标（如CheXpert、RadGraph等）上提升尤为明显，表明CCS能选出临床更可靠的报告。
图像驱动的效用与文本共识形成不同的选择轴，两者结合可进一步提升性能。
候选池中常存在比默认解码输出更好的报告，说明推理时选择有较大提升空间。

Tech Stack:

多模态大语言模型（MLLM）：如LLaVA-Med、LLaVA-Rad、MAIRA等
随机解码：温度采样（temperature τ）
文本效用指标：ROUGE-L、BLEU、METEOR等
多模态嵌入模型：Qwen3-VL-Embed
成对相似度计算：余弦相似度
共识聚合：平均成对效用

Strengths:

提出了一种无需修改模型参数或重新训练的推理时优化方法，实用性强。
框架与解码器无关，可适用于多种放射学MLLM。
引入图像-报告多模态嵌入作为效用函数，更贴合放射学临床需求。
在多个数据集和模型上进行了充分实验，结果一致且显著。
揭示了推理时决策瓶颈，为后续研究提供了新视角。

Limitations:

需要采样多个候选报告，增加了推理时的计算开销。
效用函数的选择对结果有影响，当前仅探索了文本和图像嵌入两种，可能还有更优的效用设计。
实验仅基于MIMIC-CXR训练集，其他数据集用于测试跨域泛化，但未在更多样化的数据集上验证。
未讨论候选池大小N和温度τ的最优选择，可能需针对不同模型调参。

Relevance To Keywords:

Unify Models: 论文使用多模态大模型（MLLM）进行放射学报告生成，属于统一模型范畴。
World Models: 论文未直接涉及世界模型，但推理时选择可视为对生成结果的世界知识一致性检验。
Representation Learning: 论文利用多模态嵌入模型（Qwen3-VL-Embed）学习图像-报告联合表示，用于效用计算。
Model-Based RL: 论文的CCS框架类似于基于模型的强化学习中的rollout选择，但无显式奖励模型。
原生多模态大模型: 论文使用的MLLM是原生多模态模型，CCS作为推理时增强方法。
多模态大模型的理解和生成一体化: 论文聚焦于生成任务，但效用计算涉及理解（嵌入相似度）。
表征学习: 多模态嵌入模型是表征学习的典型应用。
世界模型: 间接相关，通过共识选择隐含了对临床世界知识的建模。
强化学习: 论文未使用强化学习，但Best-of-N和GRPO等概念与强化学习中的策略优化相关。
后训练: CCS属于推理时优化，不涉及后训练，但可与后训练方法互补。

28. Comparative Evaluation of Machine Translation Systems on Images with TextPASS

Score: 45.0 / 27.8

Authors: Blai Puchol, Sergio Gómez González, Miguel Domingo, Francisco Casacuberta

Published: 2026-05-28

TL;DR: 本文评估了图像中文字的机器翻译系统，发现多模态大语言模型（MLLM）在性能上优于模块化管道和端到端模型。

摘要翻译

本文对应用于包含文本信息的图像的机器翻译系统进行了比较评估，该任务位于计算机视觉与自然语言处理的交叉领域。本研究比较了三种主要范式：将文本检测、识别和翻译分离的模块化流水线；能够联合处理图像和文本的多模态大语言模型（MLLMs）；以及直接生成翻译图像的端到端模型 Translatotron-V。模块化系统采用了最先进的 OCR（docTR），并结合了 Llama 和 EuroLLM 等多语言大语言模型，而所评估的 MLLMs 则包括 Gemini 2.5 的不同配置。实验在覆盖多种语言对的平行多语言数据集上进行，评估基于 BLEU、chrF 和 TER 指标。结果表明，模块化流水线优于端到端方法，而 MLLMs 实现了最佳整体性能，展现出卓越的灵活性和上下文理解能力。这些发现强调了多模态推理在图像到文本翻译中的有效性，并为未来研究在多语言环境中整合视觉理解与语言生成奠定了坚实基础。

Abstract

This work presents a comparative evaluation of machine translation systems applied to images containing textual information, a task that lies at the intersection of computer vision and natural language processing. The study compares three main paradigms: modular pipelines that separate text detection, recognition, and translation; multi-modal large language models (MLLMs) capable of processing both image and text jointly; and an end-to-end model, Translatotron-V, which directly generates translated images. The modular systems employ state-of-the-art OCR (docTR) combined with multilingual LLMs such as Llama and EuroLLM, while the evaluated MLLMs include different configurations of Gemini 2.5. Experiments were conducted on parallel multilingual datasets covering multiple language pairs, with evaluation based on BLEU, chrF, and TER metrics. The results show that modular pipelines outperform the end-to-end approach, while MLLMs achieve the best overall performance, demonstrating superior flexibility and contextual understanding. These findings underscore the effectiveness of multi-modal reasoning for image-to-text translation and provide a solid foundation for future research on integrating visual understanding and language generation in multilingual settings.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	6.0/10	9.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	4.0/10	6.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	9.0/10	13.5
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	0.0/10	0.0

评分理由: 论文核心在于比较图像翻译系统，MLLM 和 MultiModal 高度相关（涉及多模态处理与模型比较）；Unify Models 和 Visual Encoder 中度相关（涉及架构对比与视觉组件）；Tokenizer 弱相关（隐含于 LLM 中）；World Models 和 model-based RL 完全无关。作者列表中未包含指定专家。

关键词

Machine Translation, Image Translation, MLLMs, Modular Pipelines, End-to-End Model, Visual Understanding, Multilingual Settings

深度分析

Chinese Title: 图像中文本的机器翻译系统比较评估

Summary: 本文对图像中文本的机器翻译系统进行了比较评估，该任务处于计算机视觉与自然语言处理的交叉领域。研究比较了三种主要范式：将文本检测、识别和翻译分离的模块化流水线；能够同时处理图像和文本的多模态大语言模型（MLLM）；以及端到端模型Translatotron-V，该模型直接生成翻译后的图像。模块化系统采用最先进的OCR（docTR）结合多语言LLM（如Llama和EuroLLM），而评估的MLLM包括Gemini 2.5的不同配置。实验在涵盖多个语言对的平行多语言数据集上进行，使用BLEU、chrF和TER指标评估。结果表明，模块化流水线优于端到端方法，而MLLM实现了最佳整体性能，展现出卓越的灵活性和上下文理解能力。这些发现强调了多模态推理在图像到文本翻译中的有效性，并为未来将视觉理解与语言生成相结合的研究提供了坚实基础。

Innovations:

提出了一种模块化流水线，允许使用不同模型顺序执行OCR和机器翻译，并系统比较了多种OCR和翻译模型组合。
将多模态大语言模型（Gemini 2.5系列）直接应用于图像文本翻译任务，无需显式OCR步骤。
与端到端模型Translatotron-V进行了对比，展示了模块化流水线和MLLM的优越性。
在多个语言对（德-英、法-英、罗-英）上进行了全面评估，使用了多种翻译质量指标。
揭示了MLLM在上下文理解和错误纠正方面的优势，为未来多模态翻译研究提供了方向。

Methodology: 研究采用以下技术路线：1）使用合成图像数据集（基于IWSLT14和IWSLT17平行语料生成），包含德-英、法-英、罗-英语言对。2）构建模块化流水线：OCR阶段比较EasyOCR和docTR，最终选用docTR（FAST检测器+CNN识别器）；翻译阶段使用M2M100-1.2B（传统NMT）以及Llama系列和EuroLLM系列（LLM），通过精心设计的提示引导翻译。3）评估多模态大语言模型Gemini 2.5（flash-lite、flash、pro），直接输入图像和文本提示。4）与端到端模型Translatotron-V（通过私人通信获得结果）进行比较。5）使用BLEU、chrF、TER指标评估翻译质量。

Key Results:

模块化流水线（docTR + LLM）在BLEU、chrF和TER指标上优于端到端模型Translatotron-V。
多模态大语言模型Gemini 2.5系列在所有语言对上取得了最佳整体性能，尤其是Gemini 2.5 pro。
在LLM中，EuroLLM-9B表现优于同规模Llama模型，表明针对欧洲语言的预训练有益。
OCR阶段docTR优于EasyOCR，选择docTR作为标准OCR组件。
MLLM能够利用上下文信息纠正OCR错误，展现出更强的灵活性和理解能力。

Tech Stack:

OCR: EasyOCR (CRAFT + CRNN), docTR (FAST检测器 + VGG16-BN + RNN识别器)
传统NMT: M2M100-1.2B (Transformer编码器-解码器)
LLM: Llama-3.2-1B-Instruct, Llama-3.2-3B-Instruct, Llama-3.1-8B-Instruct, EuroLLM-1.7B, EuroLLM-9B
MLLM: Gemini 2.5 flash-lite, flash, pro (稀疏MoE架构)
端到端模型: Translatotron-V (图像到图像)
评估指标: BLEU, chrF, TER
数据集: 基于IWSLT14/17的合成图像，使用Python Pillow库生成
提示工程: 针对LLM和MLLM设计特定翻译指令

Strengths:

系统比较了三种主要范式（模块化、MLLM、端到端），覆盖了当前主流技术路线。
使用了多个语言对和多种评估指标，结果具有统计可靠性。
模块化流水线设计灵活，允许独立优化OCR和翻译组件。
MLLM的评估展示了多模态模型在图像文本翻译中的潜力，为未来研究提供了基准。
论文清晰描述了提示策略，可复现性强。

Limitations:

数据集为合成图像，可能无法完全反映真实场景中的复杂背景、字体和光照变化。
端到端模型Translatotron-V的结果仅通过私人通信获得，且仅有两个语言对，缺乏公开验证。
未评估其他MLLM（如GPT-4V、Claude等），比较范围有限。
模块化流水线仍存在误差传播问题，OCR错误会影响后续翻译。
未进行人类评估，仅依赖自动指标，可能无法全面反映翻译质量。

Relevance To Keywords:

原生多模态大模型：论文评估的Gemini 2.5系列是原生多模态模型，直接处理图像和文本，与关键词高度相关。
多模态大模型的理解和生成一体化：MLLM同时进行视觉理解和文本生成，体现了理解与生成一体化。
表征学习：OCR和翻译模型涉及视觉特征提取和语言表征学习，模块化流水线中的CNN和RNN用于表征学习。
世界模型：虽然论文未直接涉及世界模型，但多模态翻译可视为对视觉世界的一种理解，间接相关。
后训练：LLM和MLLM的指令微调属于后训练范畴，论文中使用的指令模型是后训练的结果。
强化学习：论文未涉及强化学习，相关性较弱。

29. DocRetriever: A Plug-and-Play Framework for Multimodal Document Retrieval with Comprehensive BenchmarkPASS

Score: 45.0 / 27.8

Authors: Ruofan Hu, Menghui Zhu, Jieming Zhu, Bo Chen, Shengyang Xu, Minjie Hong, Xiaoda Yang, Sashuai Zhou, Li Tang, Tao Jin, Zhou Zhao

Published: 2026-05-28

TL;DR: DocRetriever proposes a plug-and-play multimodal document retrieval framework utilizing layout-aware sparse embedding and reasoning-augmented reranking, achieving superior performance on a new comprehensive benchmark.

摘要翻译

多模态文档包含表格、图表和布局等多样化元素，这可能会使检索任务变得复杂。尽管当前方法通常将密集视觉嵌入模型（dense visual embedding models）与监督重排器（supervised rerankers）相结合以实现高精度检索，但它们仍面临固有的局限性。首先，密集嵌入的粗粒度特性倾向于模糊显式语义，无法利用结构显著信息。其次，监督重排模型面临泛化瓶颈，因为其性能严重依赖于领域特定训练数据。此外，现有的基准往往缺乏多样化的评估维度和全面的相关性标注，从而限制了可靠的评估。为应对这些挑战，本文提出 DocRetriever，一个即插即用框架。该框架通过一种感知布局的稀疏嵌入技术增强视觉检索，能够在无需光学字符识别（OCR）开销的情况下实现有效的混合编码。此外，我们还引入了一种可泛化的重排器，该重排器利用推理增强演示（reasoning-augmented demonstrations）和优化采样，以提高少样本设置下的准确性。最后，我们构建了一个新的基准 MultiDocR，以支持更严格的评估。在不同基准上的实验验证了 DocRetriever 相对于最先进方法的优越性。

Abstract

Multimodal documents contain diverse elements, such as tables, figures, and layouts, which can complicate retrieval tasks. While current approaches typically combine dense visual embedding models with supervised rerankers to achieve high-precision retrieval, they face inherent limitations. First, the coarse-grained nature of dense embeddings tends to obfuscate explicit semantics, failing to leverage structurally salient information. Second, supervised reranking models suffer from generalization bottlenecks, as their performance heavily relies on domain-specific training data. Furthermore, existing benchmarks often lack diverse assessment dimensions and comprehensive relevance annotations, limiting reliable evaluation. To address these challenges, we propose DocRetriever, a plug-and-play framework. It enhances visual retrieval via a layout-aware sparse embedding technique, enabling effective hybrid encoding without the overhead of optical character recognition (OCR). We also introduce a generalizable reranker that leverages reasoning-augmented demonstrations and optimized sampling to improve accuracy in few-shot settings. Finally, we construct a new benchmark, MultiDocR, to enable more rigorous evaluation. Experiments across diverse benchmarks validate DocRetriever's superiority over state-of-the-art methods.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	4.0/10	6.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	7.0/10	10.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	5.0/10	7.5
MultiModal	1.5	10.0/10	15.0
model-based RL	1.5	1.0/10	1.5

评分理由: MultiModal is core (10) as the paper focuses on multimodal retrieval. Visual Encoder (7) is relevant for handling layout/figures via embeddings. MLLM (5) is partially relevant due to reasoning components. Unify Models (4) reflects the plug-and-play framework unification. Tokenizer, World Models, and model-based RL are largely irrelevant (1-2) to this retrieval-focused work.

关键词

Multimodal Document Retrieval, Layout-aware Sparse Embedding, Hybrid Encoding, Reasoning-augmented Reranker, Few-shot Learning, MultiDocR Benchmark, Plug-and-Play Framework

深度分析

Chinese Title: DocRetriever：面向多模态文档检索的即插即用框架与综合基准

Summary: 多模态文档包含表格、图表和布局等多样元素，现有检索方法面临密集嵌入语义模糊、监督重排序泛化瓶颈以及基准评估维度不足等问题。本文提出DocRetriever即插即用框架，通过布局感知的稀疏嵌入技术实现混合编码，无需OCR即可利用VLM的logit分布提取词汇级语义；引入强化ICL重排序器，通过自主合成推理增强示例和双相似度采样提升少样本泛化能力；构建MultiDocR基准，涵盖10个领域、7种查询类型、五级相关性标注和查询改写，支持更全面的评估。实验表明DocRetriever在多个基准上超越现有方法。

Innovations:

提出布局感知的稀疏嵌入技术，从VLM语言模型头的logit分布中提取词汇级语义，实现无需OCR的密集-稀疏混合编码，提升检索精度约3% NDCG@10。
引入强化ICL框架，通过自主合成推理增强示例和交叉验证，结合双查询-文档相似度采样策略，提升重排序器在少样本场景下的泛化能力。
构建MultiDocR综合基准，包含10个文档领域、7种查询类型、五级相关性评分和查询改写，弥补现有基准标注不完整、评估维度单一的缺陷。
整体框架即插即用，可无缝集成到现有VLM检索系统中，无需额外训练或微调。

Methodology: 采用两阶段检索-重排序架构。第一阶段：使用VLM（如ColPali）对文档页面进行视觉编码，同时从LM头logit分布中提取稀疏嵌入（经频率重加权和top-256选择），与密集嵌入加权融合得到混合相似度，检索top-K候选。第二阶段：采用点式重排序，利用强化ICL策略：通过VLM自主生成推理增强的示例（包含查询、文档截图和推理步骤），经交叉验证筛选高质量示例；推理时根据双相似度（查询-文档）检索最相关示例，VLM基于示例和查询-文档对输出相关性分数。

Key Results:

布局感知稀疏嵌入在混合编码中带来约3%的NDCG@10提升。
强化ICL重排序器在多个基准上优于现有监督重排序方法，尤其在少样本和跨域场景下表现突出。
MultiDocR基准提供了更细粒度的五级相关性评估，揭示了现有基准中因信息冗余导致的标注偏差。
DocRetriever在MP-DocVQA、ViDoRe、MMDocIR等基准上达到SOTA性能。

Tech Stack:

VLM（视觉语言模型，如Qwen2.5VL、ColPali）
稀疏嵌入：基于LM头logit的top-256选择与频率重加权
混合编码：密集嵌入（[CLS]或平均池化）与稀疏嵌入加权融合
强化ICL：自主示例合成、交叉验证、双相似度采样（BM25+密集相似度）
评估指标：NDCG@10、MRR、Recall等
基准：MultiDocR（10领域、7查询类型、五级相关性）

Strengths:

即插即用，无需OCR或额外训练，易于集成到现有系统。
布局感知稀疏嵌入有效捕获文档结构信息，提升检索精度。
强化ICL重排序器解决了监督数据稀缺问题，泛化能力强。
MultiDocR基准设计全面，支持多维度评估，有助于推动领域发展。
实验验证充分，在多个公开基准上取得一致优势。

Limitations:

依赖VLM的logit分布质量，若VLM本身对布局理解不足可能影响稀疏嵌入效果。
强化ICL的示例合成和交叉验证可能引入额外计算开销。
MultiDocR基准规模有限（2441个查询），可能不足以覆盖所有长尾场景。
框架未涉及端到端训练，可能无法充分利用任务特定数据。

Relevance To Keywords:

原生多模态大模型：DocRetriever直接利用VLM进行视觉编码和重排序，属于多模态大模型在检索任务中的应用。
表征学习：通过混合编码（密集+稀疏）学习更丰富的文档表征，布局感知稀疏嵌入是对表征学习的创新。
后训练：强化ICL策略可视为一种后训练方法，通过示例合成和交叉验证提升模型泛化能力。
世界模型/模型基强化学习：论文未直接涉及世界模型或模型基RL，但强化ICL中的“强化”指通过交叉验证增强示例质量，并非传统强化学习。
理解与生成一体化：VLM同时用于理解（编码）和生成（重排序中的推理增强），体现多模态大模型的理解与生成一体化能力。

30. Masked Diffusion Vision-Language Models for Temporal Action LocalizationPASS

Score: 45.0 / 27.8

Authors: Fengshun Wang, Zhengbo Zhang, Zhigang Tu

Published: 2026-05-28

TL;DR: This paper proposes Masked Diffusion Vision-Language Models to enable bidirectional refinement of semantic and boundary tokens in Temporal Action Localization, achieving improved temporal reasoning and boundary localization compared to autoregressive baselines.

摘要翻译

时序动作定位（TAL）需要在未修剪视频中精确识别目标事件并定位其起止时间。近期视觉 - 语言模型改进了语义推理并支持语言条件输出，但其自回归解码器仍从左到右生成 token，这阻止了后续语义证据修正早期的时间戳预测。我们将掩码扩散视觉 - 语言模型（MDVLMs）适配到 TAL，使得语义 token 和边界 token 在具有双向注意力的迭代去噪过程中保持可编辑，从而允许时间边界和语义内容共同细化。然而，直接适配会产生两个 TAL 特有的不匹配：标准掩码扩散训练随机均匀地掩盖所有位置，但当有足够的语义上下文时，时间 token 更可靠；且 token 级交叉熵不反映时间 IoU。为了解决这些不匹配，我们引入一个计划训练目标，该目标使用边界感知掩码和步级加权重构来演练时间 token 的后期恢复，同时引入一个步级 IoU 奖励，在去噪过程中提供重叠感知监督。标准的序列级交叉熵项提供基础重构信号。在 ActivityNet-RTL、ActivityNet-1.3 和 THUMOS-14 上的实验表明，MDVLM-TAL 相较于自回归视觉 - 语言基线，在时间推理和边界定位方面均有改进，尤其是在更严格的时间 IoU 准则下取得了显著增益。

Abstract

Temporal action localization (TAL) requires recognizing the target event and localizing its start and end times precisely in untrimmed videos. Recent vision-language formulations improve semantic reasoning and support language-conditioned outputs, but their autoregressive decoders still generate tokens from left to right, preventing later semantic evidence from revising earlier timestamp predictions. We adapt masked diffusion vision-language models (MDVLMs) to TAL so that semantic tokens and boundary tokens remain editable throughout iterative denoising with bidirectional attention, allowing temporal boundaries and semantic content to be refined jointly. Direct adaptation, however, creates two TAL-specific mismatches: standard masked diffusion training corrupts all positions uniformly at random, but the time tokens are more reliable when enough semantic context is available; and token-level cross-entropy does not reflect temporal IoU. To address these mismatches, we introduce a Planned Training Objective that uses boundary-aware masking and step-weighted reconstruction to rehearse the late recovery of time tokens, together with a Step-Level IoU Reward that provides overlap-aware supervision during denoising. A standard sequence-level cross-entropy term provides the base reconstruction signal. Experiments on ActivityNet-RTL, ActivityNet-1.3, and THUMOS-14 show that MDVLM-TAL improves both temporal reasoning and boundary localization over autoregressive vision-language baselines, with especially strong gains under stricter temporal IoU criteria.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	5.0/10	7.5
Tokenizer	1.5	3.0/10	4.5
Visual Encoder	1.5	5.0/10	7.5
World Models	1.5	2.0/10	3.0
MLLM	1.5	7.0/10	10.5
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper applies Masked Diffusion Vision-Language Models to Temporal Action Localization, strongly aligning with MLLM (7.0) and MultiModal (8.0). It uses tokens and encoders but does not focus on tokenizer design or model-based RL (0.0). Unify Models is moderately relevant (5.0) for token unification. World Models has low relevance (2.0) due to the discriminative task. Total weighted score is 45.0. No specified expert authors are found.

关键词

Masked Diffusion, Vision-Language Models, Temporal Action Localization, Bidirectional Attention, Boundary Localization, Untrimmed Videos, Denoising Process

深度分析

Chinese Title: 掩码扩散视觉语言模型用于时序动作定位

Summary: 本文提出MDVLM-TAL，将掩码扩散视觉语言模型（MDVLM）应用于时序动作定位（TAL）任务。传统的自回归视觉语言模型在生成时间边界时从左到右依次生成，导致早期预测无法利用后续语义信息进行修正。MDVLM通过双向注意力的迭代去噪过程，使得语义令牌和时间边界令牌在生成过程中始终保持可编辑状态，从而实现联合精炼。然而，直接应用标准掩码扩散训练存在两个不匹配问题：一是训练时均匀随机掩码无法模拟推理时时间令牌需在足够语义上下文后才稳定的行为；二是令牌级交叉熵损失无法反映时间IoU的数值差异。为此，论文提出计划训练目标（Planned Training Objective），包含边界感知掩码和步加权重建，使时间令牌在训练中更晚恢复；同时引入步级IoU奖励（Step-Level IoU Reward），在去噪过程中提供重叠感知监督。在ActivityNet-RTL、ActivityNet-1.3和THUMOS-14数据集上的实验表明，MDVLM-TAL在时间推理和边界定位上均优于自回归视觉语言基线，尤其在严格tIoU阈值下提升显著。

Innovations:

将时序动作定位建模为掩码扩散序列生成任务，利用双向注意力迭代去噪实现语义令牌和时间边界令牌的联合精炼。
提出计划训练目标，通过边界感知掩码策略和步加权重建损失，使时间令牌在训练过程中延迟恢复，匹配推理时的行为。
引入步级IoU奖励，在去噪中间步骤提供基于时间重叠的监督信号，弥补令牌级交叉熵损失无法反映数值距离的缺陷。
在多个标准数据集上超越自回归视觉语言基线和专用TAL检测器，尤其在严格tIoU阈值下取得最大增益。

Methodology: 论文采用视觉编码器（如Video-LLM中的视觉编码器）+ 多模态投影器 + 语言骨干（掩码扩散语言模型）的架构。首先将视频编码并压缩为视觉令牌，与指令文本令牌拼接形成多模态上下文。然后采用掩码扩散过程：从全掩码状态开始，通过双向注意力逐步去噪恢复令牌。训练时使用计划训练目标：边界感知掩码策略（对时间令牌设置更低的保留概率）和步加权重建损失（对低噪声步赋予更大权重）。同时引入步级IoU奖励，通过软边界期望计算当前步的预测IoU，并设置门控机制仅在两个边界令牌均被揭示时激活奖励。最终损失由序列级交叉熵、计划重建损失和IoU奖励加权组合。

Key Results:

在ActivityNet-RTL、ActivityNet-1.3和THUMOS-14数据集上，MDVLM-TAL在多个tIoU阈值下均优于自回归视觉语言基线（如LITA、VTG-LLM等）。
在严格tIoU阈值（如0.7）下，MDVLM-TAL相比基线提升最为显著，表明其边界定位精度更高。
消融实验验证了计划训练目标和步级IoU奖励各自的有效性，两者结合达到最佳性能。
与专用TAL检测器（如ActionFormer、TriDet）相比，MDVLM-TAL在语言条件设定下表现更优，同时保持了竞争性的闭集性能。

Tech Stack:

掩码扩散语言模型（Masked Diffusion Language Model）
时间令牌离散化（Time-token formulation，将时间轴分为N个bin）
边界感知掩码策略（Boundary-aware masking，γ > η > 0）
步加权重建损失（Step-weighted reconstruction loss）
软边界期望（Soft boundary expectation，通过softmax计算期望边界）
门控相对加权（Gated relative weighting，仅当两个边界令牌均被揭示时激活IoU奖励）
序列级交叉熵损失（Sequence-level cross-entropy）
视觉编码器 + 令牌压缩器（Token compressor）
多模态上下文拼接（BOS, Zv, SEP, Zq, SEP）

Strengths:

创新性地将掩码扩散模型引入时序动作定位，克服了自回归模型无法后期修正的固有缺陷。
针对TAL任务特性设计了专门的训练目标（计划训练和IoU奖励），有效弥合了通用掩码扩散与TAL需求之间的差距。
在多个数据集上取得显著提升，尤其在严格tIoU阈值下表现突出，证明了方法的有效性。
方法具有通用性，可扩展至其他需要时序精确定位的多模态生成任务。

Limitations:

时间令牌离散化粒度（N个bin）影响定位精度，过粗或过细均可能带来问题，论文未深入讨论最优粒度选择。
掩码扩散模型的推理速度可能慢于自回归模型，因为需要多步迭代去噪。
实验仅在三个数据集上进行，未在更多样化的场景（如开放词汇、零样本）中验证泛化能力。
步级IoU奖励的计算依赖于软边界期望，可能引入额外计算开销，且门控机制在边界令牌未揭示时无法提供监督。

Relevance To Keywords:

原生多模态大模型：论文使用视觉语言模型架构，将视频和文本联合建模，属于多模态大模型范畴。
多模态大模型的理解和生成一体化：MDVLM-TAL同时进行动作识别（理解）和边界生成（生成），实现理解与生成一体化。
表征学习：通过掩码扩散训练，模型学习视频和文本的联合表征，时间令牌和语义令牌在双向注意力中交互。
世界模型：时序动作定位涉及对视频中事件的时间动态建模，可视为世界模型在时间维度的应用。
强化学习：步级IoU奖励本质上是一种强化学习信号，在去噪过程中提供基于IoU的奖励，指导模型优化。
后训练：论文提出的计划训练目标和IoU奖励属于后训练阶段的优化策略，在预训练模型基础上进行任务特定微调。

31. Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report GenerationPASS

Score: 43.5 / 27.8

Authors: Chenghao Zhang, Guanting Dong, Yufan Liu, Tong Zhao, Zhicheng Dou

Published: 2026-05-28

TL;DR: 本文提出 Ptah 多智能体框架，通过视觉工作记忆与验证器机制，解决了多模态深度研究中开放合成与跨模态一致性问题，生成了更可靠的报告。

摘要翻译

大型语言模型（LLMs）已将自主智能体从深度搜索（检索简洁的事实性答案）提升至深度研究（将分散的证据综合为长篇报告）。然而，由于缺乏确定性真值的开放式综合，以及需要将文本论证与视觉证据交错，可验证的多模态深度研究仍然具有挑战性。我们提出 Ptah，一个用于交错式报告生成的多智能体编排框架。Ptah 通过规划、研究和写作三个阶段，协调从用户查询到渲染网页报告的完整生命周期；在此过程中，专用智能体构建视觉感知计划，收集基于主张的证据，在“视觉工作记忆”（Visual Working Memory）中维护源对齐图像，并通过声明式多模态工具使用撰写报告。验证智能体充当该框架的验收函数，在整个工作流中强制执行事实锚定、引用保真度和跨模态一致性。我们进一步引入 PtahEval，一种评估协议，通过图像级和展示级评估来增强现有的基准测试。在深度研究基准上的实验表明，与强基线相比，Ptah 能生成更可靠、更具视觉信息量且更可用的面向人类的多模态报告。

Abstract

Large Language Models (LLMs) have advanced autonomous agents from deep search, which retrieves concise factual answers, to deep research, which synthesizes scattered evidence into long-form reports. However, verifiable multimodal deep research remains challenging due to open-ended synthesis without deterministic ground truth and the need to interleave textual arguments with visual evidence. We propose \textsc{Ptah}, a multi-agent harness for interleaved report generation. \textsc{Ptah} orchestrates the lifecycle from user query to rendered web report through planning, research, and writing stages, where specialized agents construct visual-aware plans, collect claim-grounded evidence, maintain source-aligned images in a \textit{Visual Working Memory}, and compose reports through declarative multimodal tool use. A verifier agent serves as the harness's acceptance function, enforcing factual grounding, citation fidelity, and cross-modal consistency throughout the workflow. We further introduce \textsc{Ptah}Eval, an evaluation protocol that augments existing benchmarks with image-level and presentation-level assessments. Experiments on deep research benchmarks show that \textsc{Ptah} produces more reliable, visually informative, and usable human-facing multimodal reports than strong baselines.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	4.0/10	6.0
World Models	1.5	3.0/10	4.5
MLLM	1.5	7.0/10	10.5
MultiModal	1.5	10.0/10	15.0
model-based RL	1.5	1.0/10	1.5

评分理由: 论文核心聚焦于多模态深度研究的多智能体系统，因此 MultiModal (10) 与 MLLM (7) 高度相关；Visual Encoder (4) 与 Unify Models (3) 涉及系统架构与视觉处理但非模型核心；World Models (3) 部分对应视觉工作记忆概念；Tokenizer (1) 与 model-based RL (1) 在摘要中未提及。作者列表中不包含指定专家，无额外加分。

关键词

Multimodal Deep Research, Multi-Agent Harness, Interleaved Report Generation, Visual Working Memory, Verifier Agent, Cross-modal Consistency, Deep Research

深度分析

Chinese Title: 迈向可验证的多模态深度研究：用于交错报告生成的多智能体框架

Summary: 本文提出PTAH，一个用于可验证多模态深度研究的多智能体框架，旨在生成交错文本与图像的研究报告。PTAH通过规划、研究和写作三个阶段协调智能体、外部工具、中间状态和验证信号。规划阶段构建视觉感知的研究计划；研究阶段并行收集基于主张的证据，并将源对齐的图像维护在视觉工作记忆中；写作阶段通过声明式多模态工具组合最终报告。验证器作为框架的接受函数，在流程各阶段检查协议合规性、事实依据、引用忠实性、视觉相关性和跨模态一致性。此外，作者提出PTAHEval评估协议，从图像内容质量和多模态呈现质量两个维度评估报告。实验表明，PTAH生成的报告在可靠性、视觉信息丰富度和可用性上优于强基线。

Innovations:

提出PTAH多智能体框架，协调智能体、工具、研究状态和验证信号，实现可验证的多模态深度研究。
设计视觉感知工作流，将多模态深度研究组织为规划、研究和写作三个阶段，维护可检查的中间工件（计划、证据、引用、数值数据、源对齐图像）。
引入验证器钩子作为框架的接受函数，实现阶段式检查，包括协议合规性、事实依据、引用忠实性、视觉相关性和跨模态一致性。
提出PTAHEval评估协议，针对交错图像-文本研究报告，从图像内容质量和多模态呈现质量两个维度进行评价。

Methodology: PTAH采用多智能体架构，包含规划智能体、研究智能体、写作智能体和验证智能体。规划阶段：智能体通过迭代文本搜索初始化研究状态，生成包含高层概述、章节研究目标、预期证据类型和视觉规格的结构化计划，并由验证器检查。研究阶段：每个章节由独立的研究智能体进行并行调查，提取文本证据和图像，构建视觉工作记忆（经规则过滤和VLM选择），生成结构化研究包，再由验证器检查。写作阶段：写作智能体利用全局计划、已验证的研究包和视觉工作记忆，通过声明式多模态工具组合生成交错报告，并渲染为网页。验证器结合规则检查和基于LLM的评分，确保各阶段输出质量。

Key Results:

PTAH生成的报告在事实可靠性、引用忠实性和跨模态一致性上优于强基线。
PTAH能够生成视觉信息丰富、专业交错的报告，提升可读性和可用性。
在深度研究基准上的实验表明，PTAH在图像内容质量和多模态呈现质量方面均取得更好表现。

Tech Stack:

大型语言模型（LLM）
视觉语言模型（VLM）
检索增强生成（RAG）
文本搜索工具
视觉工作记忆（Visual Working Memory）
声明式多模态工具（Declarative Multimodal Tool Use）
规则过滤（Rule-based Filtering）
VLM-based Selector
LLM-based Rubric Verification

Strengths:

提出完整的可验证多模态深度研究框架，覆盖从查询到渲染的全流程。
通过阶段式验证器确保事实准确性和跨模态一致性，减少错误累积。
视觉工作记忆将图像作为核心研究状态而非事后装饰，提升视觉证据与文本论证的关联。
并行研究设计提高效率，同时保持每个章节证据的可追溯性。
提出专门的评估协议PTAHEval，弥补了交错图像-文本报告评估的空白。

Limitations:

依赖外部搜索工具和VLM，可能引入延迟和成本。
验证器基于规则和LLM，可能无法完全消除幻觉或错误。
评估协议PTAHEval可能未覆盖所有报告质量维度（如用户满意度）。
框架复杂度较高，部署和调优需要较多工程努力。
未明确讨论后训练或强化学习在框架中的应用。

Relevance To Keywords:

原生多模态大模型：论文聚焦于多模态报告生成，涉及文本与图像的联合理解与生成，与原生多模态大模型的目标一致。
多模态大模型的理解和生成一体化：PTAH通过智能体协调实现交错生成，但并非端到端模型，而是工具增强的框架。
表征学习：视觉工作记忆存储源对齐图像，可视为一种跨模态表征管理。
世界模型：深度研究需要建模复杂知识领域，PTAH的规划和研究阶段隐式构建了领域知识结构。
强化学习：论文未直接使用强化学习，但验证器可视为一种奖励信号，未来可结合后训练优化。
后训练：论文未涉及后训练技术，主要关注推理时的智能体协作与验证。

32. Why Far Looks Up: Probing Spatial Representation in Vision-Language ModelsPASS

Score: 43.5 / 27.8

Authors: Cheolhong Min, Jaeyun Jung, Daeun Lee, Hyeonseong Jeon, Yu Su, Jonathan Tremblay, Chan Hee Song, Jaesik Park

Published: 2026-05-28

TL;DR: This paper investigates whether Vision-Language Models possess structured 3D understanding or rely on spatial shortcuts, finding that models conflate vertical position with distance and proposing a benchmark to evaluate representation robustness.

摘要翻译

视觉 - 语言模型（VLMs）在空间推理基准上取得了优异的性能，但目前尚不明确这种表现是否反映了结构化的 3D 理解，亦或是依赖于自然图像中的统计捷径。我们提出了一种表示层分析框架，该框架构建最小对比样本对，用于测量空间轴如何在 VLM 嵌入中被组织及解耦。我们对多个模型家族的分析揭示了一致的垂直 - 距离纠缠现象：模型混淆了图像的垂直位置与距离，这反映了自然照片的透视偏差。这种偏差导致在透视一致示例与反启发式示例之间存在显著的性能差距，且在数据规模扩大时加剧，尽管整体基准性能有所提升。我们进一步表明，具有相似基准分数的模型可能表现出不同的内部表示，而这些差异能够预测其在不同空间推理基准上的性能与鲁棒性。为了将这种偏差与评估集偏差隔离开来，我们引入了 SpatialTunnel，这是一个合成基准，旨在通过移除自然图像中常见的共现相关性来揭示空间捷径偏差。实验证实了这种纠缠是模型内在的，且拥有分离良好的空间轴的模型表现出更强的鲁棒性，这表明结构良好的空间表示能够带来跨不同基准更可靠的空间推理。代码及基准可在项目页面获取：https://cheolhong0916.github.io/whyfarlooksup.github.io/。

Abstract

Vision-language models (VLMs) achieve strong performance on spatial reasoning benchmarks, yet it remains unclear whether this reflects structured 3D understanding or reliance on statistical shortcuts in natural images. We introduce a representation-level analysis framework that constructs minimal contrastive pairs to measure how spatial axes are organized and disentangled within VLM embeddings. Our analysis across multiple model families reveals a consistent vertical-distance entanglement: models conflate vertical image position with distance, mirroring the perspective bias of natural photographs. This bias produces a significant accuracy gap between perspective-consistent and counter-heuristic examples, and intensifies under data scaling even as overall benchmark accuracy improves. We further show that models with similar benchmark scores can exhibit different internal representations, and that these differences predict accuracy and robustness across diverse spatial reasoning benchmarks. To isolate this bias from evaluation-set skew, we introduce SpatialTunnel, a synthetic benchmark designed to expose spatial shortcut biases by removing common correlations present in natural images. Experiments confirm that the entanglement is model-intrinsic, and that models with well-separated spatial axes exhibit greater robustness, suggesting that well-structured spatial representations lead to more reliable spatial reasoning across diverse benchmarks. Code and benchmark are available on the project page: https://cheolhong0916.github.io/whyfarlooksup.github.io/.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	6.0/10	9.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	8.0/10	12.0
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on Vision-Language Models (MLLM, MultiModal) and analyzes spatial embeddings (Visual Encoder), scoring high on these. It does not involve Tokenizers, World Models, Model-Based RL, or Unify Models as core contributions, hence lower scores.

关键词

Vision-Language Models, Spatial Representation, Vertical-Distance Entanglement, Perspective Bias, SpatialTunnel, Embedding Analysis, Robustness

深度分析

Chinese Title: 为何远处看起来在上方：探究视觉语言模型中的空间表征

Summary: 本文针对视觉语言模型（VLM）在空间推理任务中表现优异但可能依赖统计捷径而非真正3D理解的问题，提出了一种表征层面的分析框架。通过构建最小对比对，测量VLM嵌入中空间轴（水平、垂直、深度）的组织与解耦程度。实验发现多个模型家族存在一致的垂直-距离纠缠：模型将图像中的垂直位置与距离混淆，反映了自然照片的透视偏差。这种偏差导致透视一致样本与反启发式样本之间存在显著的准确率差距，且随着数据规模扩大而加剧。为隔离评估集偏差，作者引入合成基准SpatialTunnel，通过隧道几何解耦垂直位置与深度。结果表明纠缠是模型内在的，而具有良好分离空间轴的模型展现出更强的鲁棒性。该工作揭示了基准准确率可能高估VLM空间推理能力，并提供了诊断与缓解捷径偏差的方法。

Innovations:

提出表征层面的空间推理分析框架，通过对比对测量VLM嵌入中空间轴的几何组织与解耦程度。
发现并系统验证了垂直-距离纠缠现象：模型将图像垂直位置作为深度代理，导致系统性错误。
引入SpatialTunnel合成基准，通过隧道几何解耦垂直位置与深度，暴露标准基准中隐藏的捷径偏差。
证明具有相似基准得分的模型可能具有截然不同的内部空间表征，且表征结构可预测鲁棒性与泛化能力。

Methodology: 论文采用对比表征分析方法：首先从现有空间推理基准（EmbSpatial-Bench、CV-Bench-3D）中提取深度相关问题，根据垂直位置与真实深度的关系分为一致/反例/模糊三类。然后对多个VLM（Molmo、NVILA、Qwen2.5-VL等）进行线性探测或对比对分析，测量水平、垂直、深度三个空间轴在嵌入空间中的方向与分离程度。同时构建SpatialTunnel合成数据集，通过程序生成隧道场景，使垂直位置与深度无统计相关性。最后评估模型在一致/反例上的准确率差异，并比较不同模型家族及不同数据规模下的表征结构变化。

Key Results:

现有空间基准中一致样本占多数（EmbSpatial-Bench 80.9%，CV-Bench-3D 60.5%），反例仅约10%。
所有测试模型在反例上的准确率显著低于一致样本，证实垂直-距离纠缠的存在。
数据规模扩大（从80k到2M样本）虽提升整体基准准确率，但垂直-距离纠缠反而加剧。
具有良好分离空间轴的模型（如RoboRefer-2B-SFT）在SpatialTunnel上表现更鲁棒，且在其他基准上准确率更高。
水平轴在嵌入空间中保持稳定对立方向，而垂直与深度轴经常纠缠，表明模型依赖透视捷径。

Tech Stack:

线性探测（Linear Probing）
对比对分析（Contrastive Pair Analysis）
SpatialTunnel合成数据集（程序化生成隧道几何场景）
监督微调（SFT）使用混合空间数据集（SAT、RoboSpatial、SPAR-7M、RefSpatial、PRISM）
模型家族：Molmo-7B-O-0924、NVILA-Lite-2B、Qwen2.5-VL-3B-Instruct、Qwen3-VL-235B-A22B-Instruct、RoboRefer-2B-SFT

Strengths:

从表征层面而非仅行为层面分析空间推理，提供了更深入的诊断视角。
系统揭示了垂直-距离纠缠这一普遍偏差，并量化其随数据规模的变化。
构建的SpatialTunnel基准有效隔离了自然图像中的统计相关性，便于公平评估。
实验覆盖多个模型家族和不同规模，结论具有广泛性。
发现表征结构与鲁棒性的关联，为改进VLM空间理解提供了方向。

Limitations:

分析主要基于对比对和线性探测，可能无法完全捕捉非线性表征结构。
SpatialTunnel为合成场景，与真实世界图像存在域差异，泛化性需进一步验证。
仅针对深度与垂直位置的纠缠，未涉及其他空间捷径（如大小、遮挡等）。
数据混合策略中不同数据集权重固定，未探索最优组合。
未提供缓解纠缠的具体训练方法，仅诊断问题。

Relevance To Keywords:

Unify Models / 原生多模态大模型：论文研究的VLM属于多模态大模型，其空间表征分析对统一模型的理解能力评估有直接意义。
World Models / 世界模型：空间推理是世界模型的核心能力之一，论文揭示的垂直-距离纠缠表明当前VLM缺乏对3D世界的结构化表征，与世界模型的目标相悖。
Representation Learning / 表征学习：论文核心是分析空间表征的组织与解耦，属于表征学习范畴，为改进表征质量提供诊断工具。
Model-Based RL / 强化学习：虽然论文未直接涉及RL，但空间表征的鲁棒性对基于模型的RL（如机器人导航）至关重要，论文发现的偏差可能导致RL策略在分布外场景失败。
后训练：论文通过监督微调（SFT）研究数据规模对表征的影响，属于后训练阶段的分析，发现后训练可能强化捷径而非真正理解。

33. FakeVLM-R1: Internalizing Physical Laws via CoT for Synthetic Image DetectionPASS

Score: 43.5 / 27.8

Authors: Leqi Zhu, Junyan Ye, Kaiqing Lin, Zhiyuan Yan, Conghui He, Weijia Li

Published: 2026-05-28

TL;DR: FakeVLM-R1 通过结合物理常识推理链与强化学习优化，提升了大模型在合成图像检测中的解释性与准确性，达到了 state-of-the-art 性能。

摘要翻译

生成式人工智能技术的发展已将合成图像的视觉真实性推向了前所未有的高度。尽管基于大型多模态模型（LMMs）的可解释检测方法已取得一定进展，但它们仍依赖于源自海量伪造数据的模仿学习。因此，它们缺乏真正的因果推理能力，且容易产生解释性幻觉。为了克服这一瓶颈，我们提出了 FakeVLM-R1，旨在赋予模型在执行合成检测任务时具备类人的批判性思维能力。该框架基于监督微调（SFT），整合了群体相对策略优化（GRPO）与批判性思维思维链（CoT）机制。在推理阶段，模型执行一种“双向辩证推理”过程：在提出伪造假设的同时，必须同时调用物理常识来构建真实性反证。此外，我们构建了 FakeClue++ 数据集，该数据集包含高质量样本，广泛引入了基于真实图像物理规律的标注，为模型提供了统一真实性锚点。实验证实，FakeVLM-R1 在多个基准测试中达到了所评估模型中的最先进（SOTA）性能。它不仅实现了高精度、逻辑可解释的检测，还解决了现有方法对真实图像的过度拒绝偏差，展现出对扰动的泛化性和鲁棒性。

Abstract

The development of generative artificial intelligence technologies has propelled the visual realism of synthetic images to an unprecedented level. Although current interpretable detection methods based on Large Multimodal Models (LMMs) have made certain progress, they still rely on imitation learning derived from massive volumes of forged data. Consequently, they lack genuine causal reasoning capabilities and are prone to explanatory hallucinations. To overcome this bottleneck, we propose FakeVLM-R1, aiming to endow the model with human-like critical thinking capabilities when performing synthetic detection tasks. Building upon Supervised Fine-Tuning (SFT), this framework integrates Group Relative Policy Optimization (GRPO) with a Critical Thinking Chain-of-Thought (CoT) mechanism. During the inference phase, the model executes a "bidirectional dialectical reasoning" process: while proposing a forgery hypothesis, it must simultaneously invoke physical commonsense to construct an authenticity counter-proof. Furthermore, we constructed the FakeClue++ dataset with high-quality samples, which extensively introduces annotations guided by the physical laws of authentic images, providing a unified authenticity anchor for the model. Experiments confirm that FakeVLM-R1 achieves SOTA performance the evaluated models across multiple benchmarks. It not only achieves high-precision, logically interpretable detection but also resolves the over-rejection bias of existing methods against real images, demonstrating generalization and robustness against perturbations.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	4.0/10	6.0
MLLM	1.5	9.0/10	13.5
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	2.0/10	3.0

评分理由: 论文核心基于大多模态模型（MLLM）进行合成图像检测，因此 MLLM 和多模态相关性高。虽然涉及物理法则内化，与 World Models 概念有一定关联，但并非典型的世界模型架构。使用了 GRPO（强化学习优化），但属于策略优化而非模型式强化学习（model-based RL）。Tokenizer 和 Unify Models 在摘要中未提及，相关性低。Visual Encoder 作为 LMM 组件隐含存在，相关性中等。

关键词

Synthetic Image Detection, Large Multimodal Models, Chain-of-Thought, Group Relative Policy Optimization, Physical Laws, Critical Thinking, Forgery Detection

深度分析

Chinese Title: FakeVLM-R1: 通过思维链内化物理定律的合成图像检测方法

Summary: 本文提出FakeVLM-R1框架，旨在将合成图像检测从感知匹配提升至逻辑推理层面。针对现有基于大多模态模型（LMM）的可解释检测方法依赖模仿学习、缺乏因果推理能力且易产生解释幻觉的问题，该框架在监督微调（SFT）基础上引入组相对策略优化（GRPO）与批判性思维链（CoT）机制。推理时模型执行“双向辩证推理”：在提出伪造假设的同时，必须调用物理常识构建真实性反证。此外，作者构建了FakeClue++数据集，高质量样本中广泛引入基于真实图像物理定律的标注，为模型提供统一的真实性锚点。实验表明，FakeVLM-R1在多个基准上达到SOTA性能，不仅实现了高精度、逻辑可解释的检测，还解决了现有方法对真实图像的过度拒绝偏差，展现出良好的泛化性和鲁棒性。

Innovations:

提出将GRPO强化学习与批判性思维链CoT结合的训练范式，使模型具备双向辩证推理能力，有效抑制解释幻觉和假阳性。
构建FakeClue++数据集，创新性地引入基于物理定律的真实图像标注，以高数据效率为模型提供统一的真实性锚点。
实现从感知级检测到法医级逻辑推理的范式跃迁，在未见域泛化、扰动鲁棒性和推理深度上取得突破。
通过“伪造假设+真实性反证”的辩证推理机制，强制模型进行逻辑自洽验证，避免结论先行的偏见。

Methodology: 采用两阶段训练：首先在FakeClue++数据集上进行监督微调（SFT），使模型学习伪造痕迹和物理定律知识；然后引入组相对策略优化（GRPO）强化学习，结合批判性思维链（CoT）训练模型的辩证推理能力。推理时模型先基于视觉线索提出伪造假设，同时调用物理常识构建真实性反证，通过逻辑自洽性验证得出最终判断。

Key Results:

在多个主流基准上取得SOTA性能，检测准确率显著优于现有LMM和可解释基线。
有效降低对真实图像的假阳性率，解决了现有方法的过度拒绝偏差。
在未见域泛化和扰动鲁棒性测试中表现优异，展现出强泛化能力。
生成的解释具有法医级逻辑深度，抑制了解释幻觉现象。

Tech Stack:

Group Relative Policy Optimization (GRPO)
Chain-of-Thought (CoT) reasoning
Supervised Fine-Tuning (SFT)
Large Multimodal Models (LMMs)
FakeClue++ dataset (含物理定律标注的真实图像)
双向辩证推理机制

Strengths:

创新性地将强化学习与批判性思维链结合，赋予模型真正的因果推理能力。
数据集设计高效，通过物理定律标注而非穷举伪造痕迹，大幅降低构建成本并提升泛化性。
在检测精度、可解释性和鲁棒性上全面超越现有方法，解决了假阳性这一关键痛点。
方法论具有通用性，可推广至其他需要逻辑推理的视觉检测任务。

Limitations:

依赖高质量的物理定律标注，对于复杂场景（如艺术图像、抽象风格）的物理规律描述可能不够完备。
GRPO训练的计算成本较高，可能限制在资源受限场景下的应用。
当前仅在合成图像检测任务上验证，尚未探索在其他伪造检测（如音频、视频）中的迁移能力。
批判性思维链的推理过程可能增加推理延迟，影响实时性要求高的应用。

Relevance To Keywords:

Unify Models: 论文使用统一的大多模态模型进行检测、定位和解释，体现了模型统一的思想。
World Models: 通过内化物理定律（如光照、透视）作为真实性锚点，使模型具备世界模型般的常识推理能力。
Representation Learning: 利用SFT和GRPO学习伪造痕迹与物理规律的联合表征。
Model-Based RL: GRPO属于强化学习后训练方法，结合CoT推理可视为一种基于模型的推理策略。
原生多模态大模型: 基于LMM（如LLaVA、Qwen-VL）进行微调和强化学习，属于多模态大模型的后训练。
多模态大模型的理解和生成一体化: 模型同时输出检测标签和自然语言解释，实现理解与生成融合。
表征学习: 通过物理定律标注引导模型学习更鲁棒、可解释的表征。
世界模型: 模型利用物理常识进行辩证推理，模拟了人类对真实世界的认知过程。
强化学习: 核心创新之一GRPO属于强化学习算法。
后训练: 在SFT基础上进行GRPO强化学习，属于后训练阶段。

34. Mitigating Hallucination in Vision-Language Models through Barrier-Regulated Adaptive Closed-form SteeringPASS

Score: 43.5 / 27.8

Authors: Soumyadeep Jana, Pulkit Mittal, Sanasam Ranbir Singh

Published: 2026-05-28

TL;DR: 本文提出 BRACS，一种基于视觉注意力引导的隐状态自适应修正框架，可在无需训练的情况下有效缓解视觉语言模型的幻觉问题。

摘要翻译

大型视觉 - 语言模型（LVLMs）经常在输入图像中不存在的物体上产生幻觉，主要是因为随着解码过程的推进，视觉定位能力会逐渐减弱。现有的推理时缓解方法在整个生成过程中修改 logits 或隐藏状态，但它们存在三个关键局限性：缺乏明确的定位目标；即使在模型定位良好时也会进行干预；以及使用固定的校正强度，无法适应定位失败的严重程度。我们提出 BRACS（基于屏障调节的自适应闭式引导），这是一种无需训练的引导框架，通过基于屏障调节的自适应闭式引导机制来解决上述问题。BRACS 通过监控模型自身的注意力机制来衡量视觉定位，仅在定位退化时对隐藏状态施加校正。校正更新以闭式形式解析计算，无需训练辅助网络或对模型进行重新训练。在 LLaVA-1.5-7B 和 Qwen-VL-Chat 上的实验表明，BRACS 在幻觉基准上一贯优于先前方法，将 CHAIR$_s$ 降低 9.4 分，将 POPE F1 提高 2.7 分，同时在四个通用多模态基准上持平或优于先前性能。BRACS 也保持高效，其吞吐量达到贪婪解码吞吐量的 80%，且平均速度达到基线方法的 1.3 倍。

Abstract

Large vision-language models (LVLMs) often hallucinate objects that are not present in the input image, largely because visual grounding weakens as decoding progresses. Existing inference-time mitigation methods modify logits or hidden states throughout generation, but they suffer from three key limitations: they lack an explicit grounding objective, intervene even when the model is already well-grounded, and use fixed correction strengths that do not adapt to the severity of grounding failure. We propose BRACS (Barrier-Regulated Adaptive Closed-form Steering), a training-free steering framework that addresses these issues through barrier-regulated adaptive closed-form steering. BRACS monitors the model's own attention to measure visual grounding and applies corrections to the hidden states only when grounding deteriorates. The corrective update is computed analytically in closed form, requiring no training of auxiliary networks or model retraining. Experiments on LLaVA-1.5-7B and Qwen-VL-Chat show that BRACS consistently outperforms prior methods on hallucination benchmarks, reducing CHAIR$_s$ by 9.4 points and improving POPE F1 by 2.7 points, while matching or improving performance on four general multimodal benchmarks. BRACS also remains efficient, operating at 80% of greedy decoding throughput and achieving 1.3 times higher speed on average than the baselines.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	5.0/10	7.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	9.0/10	13.5
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	1.0/10	1.5

评分理由: 论文核心聚焦于视觉语言模型（MLLM, MultiModal）的幻觉缓解，通过推理时隐状态修正实现，与统一理解生成（Unify Models）有一定关联。未涉及分词器（Tokenizer）、世界模型（World Models）或基于模型的强化学习（model-based RL）。视觉编码器仅用于特征提取，非核心贡献。

关键词

Vision-Language Models, Hallucination Mitigation, BRACS, Visual Grounding, Inference-time Steering, Hidden States Correction, Training-free

深度分析

Chinese Title: 通过屏障调控的自适应闭式引导缓解视觉语言模型中的幻觉

Summary: 大型视觉语言模型（LVLMs）在生成描述时经常产生图像中不存在的对象，即幻觉现象。现有推理时缓解方法存在三个关键局限：缺乏显式接地目标、始终进行引导、固定引导强度。本文提出BRACS（屏障调控的自适应闭式引导），一种无需训练的引导框架。BRACS通过监控模型自身的预softmax图像注意力来定义视觉接地屏障，仅在接地低于阈值时计算最小范数隐藏状态修正，修正以闭式解析形式获得，无需反向传播或辅助网络。在LLaVA-1.5-7B和Qwen-VL-Chat上，BRACS在CHAIRs上降低9.4点，POPE F1提升2.7点，同时保持贪婪解码吞吐量的80%，平均速度比基线高1.3倍。该方法在四个通用多模态基准上匹配或提升性能，有效缓解幻觉且保持高效。

Innovations:

识别出现有训练自由方法的三个关键局限：缺乏显式接地目标、始终进行引导、固定引导强度。
提出BRACS框架，利用预softmax图像注意力作为显式接地屏障，实现选择性干预。
推导出闭式最小范数隐藏状态修正，无需训练或反向传播，自适应调整修正强度。
在多个幻觉基准上取得显著提升，同时保持通用推理能力不下降。
保持高效推理速度，平均比基线快1.3倍，吞吐量达贪婪解码的80%。

Methodology: BRACS采用训练自由的推理时引导方法。首先，定义视觉接地屏障为预softmax图像注意力均值hl(xt)，该值线性依赖于隐藏状态xt。在每步解码中，若hl(xt)低于阈值τ，则计算梯度∇hl(xt)（闭式解析，无需反向传播），然后求解最小范数修正θ*使得hl(xt+θ*)≥τ，得到闭式解θ* = (τ - hl(xt))⁺ / (||∇hl||² + ε) * ∇hl。修正应用于隐藏状态xt后再进行Q/K/V投影，从而影响后续所有注意力操作。仅选择特定层（如第5层）进行引导，其余层不变。超参数τ和α通过验证集调整。

Key Results:

在CHAIRs基准上，BRACS比基线降低9.4个点（例如从40.2降至30.8）。
在POPE F1上提升2.7个点（例如从84.5升至87.2）。
在MMHal基准上，BRACS优于所有对比方法。
在四个通用多模态基准（如MMBench、MME等）上，BRACS匹配或提升性能。
推理速度：BRACS达到贪婪解码吞吐量的80%，平均速度比VCD、PAI等基线快1.3倍。

Tech Stack:

预softmax注意力（Pre-softmax attention）
闭式二次规划求解（Closed-form QP solution）
梯度解析计算（Analytic gradient without backpropagation）
隐藏状态引导（Hidden-state steering）
KV缓存修正（KV cache consistency）
LLaVA-1.5-7B和Qwen-VL-Chat模型

Strengths:

训练自由，无需额外训练或微调，计算开销低。
显式定义接地目标，基于模型自身注意力，具有可解释性。
选择性干预，仅在接地不足时修正，避免过度修正。
自适应修正强度，根据接地缺失程度动态调整。
闭式解精确高效，无需迭代优化。
在幻觉缓解和通用性能之间取得良好平衡。

Limitations:

需要手动调节阈值τ和强度α，可能对模型和任务敏感。
仅对特定层（如第5层）进行引导，其他层未干预，可能限制效果。
实验仅在两个模型上进行，泛化性需更多验证。
闭式解假设梯度非零，当梯度接近零时可能数值不稳定（已加ε）。
未与训练基方法（如RLHF）对比，仅与训练自由方法比较。

Relevance To Keywords:

Unify Models: 论文未直接涉及模型统一，但视觉语言模型本身是多模态统一的一种形式。
World Models: 间接相关：幻觉缓解有助于构建更一致的世界模型，但论文未明确讨论世界模型。
Representation Learning: 相关：BRACS利用注意力表征作为接地信号，属于表征学习范畴。
Model-Based RL: 不直接相关：论文未涉及强化学习或模型基RL。
原生多模态大模型: 直接相关：研究对象为LLaVA和Qwen-VL等原生多模态大模型。
多模态大模型的理解和生成一体化: 相关：论文关注生成中的幻觉，属于理解与生成一体化中的生成质量问题。
表征学习: 同上，利用注意力表征。
世界模型: 间接相关。
强化学习: 不相关。
后训练: 相关：BRACS属于推理时后训练方法（训练自由）。

35. NeuROK: Generative 4D Neural Object KinematicsPASS

Score: 42.0 / 27.8

Authors: Chen Geng, Guangzhao He, Yue Gao, Yunzhi Zhang, Shangzhe Wu, Jiajun Wu

Published: 2026-05-28

TL;DR: NeuROK 提出了一种基于变换器编码 - 解码器的数据驱动运动学状态参数化方法，用于高效生成逼真的 4D 物体动力学以支持世界模型构建。

摘要翻译

数据驱动的方法彻底革新了 3D 视觉，使 Transformer（变换器）能够有效重建和生成静态 3D 对象。然而，生成模拟的 4D 动态——即在各种物理条件下静态对象的真实时间形变——仍然具有挑战性，且往往缺乏通用性，尽管其在构建全面的 3D 世界模型中至关重要。大多数现有方法假设一个预定义物理模型并使用 System Identification（系统识别）来估计参数，这使得这些方法仅限于特定类别和小规模数据集。我们提出，可以通过学习针对以对象为中心的物理系统的数据驱动运动学状态参数化来克服这些限制。具体来说，我们学习了一个表示对象所有可能状态的 Latent Space（潜在空间），以及一个将任何采样的潜在变量映射到对象合理形变形状的 Decoder（解码器）。我们将这种参数化称为神经对象运动学（NeuROK），并在一个精心构建的大规模 4D 数据集上学习了一个基于 Transformer 的 Encoder-Decoder（编码器 - 解码器）模型。这种形式化及所学模型显著简化了模拟动态的生成，因为从经典物理学的 Lagrangian Mechanics（拉格朗日力学）角度来看，我们只需在低维潜在空间内考虑动力学。我们在多样化的动态对象类型上展示了该神经模拟框架的有效性和通用性，显示出相对于先前工作的明显优势。项目页面：https://chen-geng.com/neurok

Abstract

Data-driven approaches have revolutionized 3D vision, enabling transformers to effectively reconstruct and generate static 3D objects. However, generating simulative 4D dynamics -- realistic temporal deformations of static objects under various physical conditions -- remains challenging and often ad hoc, despite its importance in building comprehensive 3D world models. Most existing methods assume a predefined physical model and use system identification to estimate parameters, restricting these methods to specific categories and small-scale datasets. We propose that these restrictions can be overcome by learning a data-driven kinematic state parameterization for object-centric physical systems. Specifically, we learn both a latent space representing all possible states of the object and a decoder that maps any sampled latent to a plausibly deformed shape of the object. We refer to this parameterization as Neural Object Kinematics (NeuROK), and learn a transformer-based encoder-decoder model on a curated large-scale 4D dataset. This formulation and the learned model significantly simplify the generation of simulative dynamics since we only need to consider the dynamics within a low-dimensional latent space from the Lagrangian mechanics' perspective in classical physics. We demonstrate the effectiveness and generality of this neural simulation framework across diverse dynamic object types, showing clear advantages over prior works. Project page: https://chen-geng.com/neurok

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	5.0/10	7.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	5.0/10	7.5
World Models	1.5	8.0/10	12.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	3.0/10	4.5
model-based RL	1.5	5.0/10	7.5

评分理由: 论文明确提及构建'comprehensive 3D world models'，与 World Models 高度相关（8 分）；使用 transformer-based encoder-decoder 架构，隐含 Tokenizer 且包含 Visual Encoder（Tokenizer 2 分，Visual Encoder 5 分）；统一了多种动态对象的生成，符合 Unify Models 概念（5 分）；动力学建模与 model-based RL 基础相关（5 分）；未涉及语言模型（MLLM 0 分）；主要关注时空视觉数据，多模态关联较弱（3 分）。作者列表中不包含指定的专家专家。加权总分 42.0，高于及格分 27.8。

关键词

Generative 4D, Neural Object Kinematics, latent space, transformer-based, encoder-decoder, dynamic object types, world models

深度分析

Chinese Title: NeuROK: 生成式4D神经物体运动学

Summary: 本文提出一种数据驱动的通用框架NeuROK，用于生成静态3D物体在物理条件（如力、动作、速度）下的4D动态（时间变形序列）。核心思想是学习一个低维潜在空间作为物体的运动学状态参数化，该空间由Transformer编码器-解码器从大规模4D形状数据中自动发现，无需任何物理标注或类别先验。解码器将潜在向量映射为合理的变形形状。在推理时，基于拉格朗日力学在潜在空间中推导动力学方程，通过求解ODE生成动态轨迹。该方法适用于弹性体、布料、连续体、多体物体等多种类型，实验表明其泛化性和有效性优于现有方法。

Innovations:

提出数据驱动的运动学状态参数化（NEUROK），自动从4D数据中学习低维潜在空间，替代传统高维粒子或网格参数化，避免类别特定的物理约束。
构建通用且可扩展的生成式4D模拟框架，无需任何物理先验或动作标注，仅需4D几何序列作为监督。
将拉格朗日力学引入潜在空间，通过能量函数和欧拉-拉格朗日方程直接推导动力学，简化了模拟过程。
首次实现无需启发式先验的物体中心物理系统数据驱动模拟，覆盖弹性、刚体、布料、多体等多种动态类型。

Methodology: 采用Transformer编码器-解码器架构：编码器将静态3D网格（顶点位置）映射为潜在分布（高斯分布参数），解码器将潜在向量解码为变形场（顶点偏移）。训练数据为大规模4D形状序列（时间步上的网格变形）。在推理阶段，给定物理条件（如初始速度、外力），在潜在空间中定义能量函数（如动能、势能），利用拉格朗日力学推导出潜在状态的常微分方程（ODE），通过数值积分生成动态轨迹。整个框架不依赖任何显式物理方程或类别假设。

Key Results:

在包含弹性体、布料、连续体、多体物体的多样化4D数据集上训练后，模型能生成符合物理直觉的动态序列。
与基于MPM、弹簧-质点等传统方法相比，NeuROK在泛化性上显著优于它们，能处理未见过的物体类型。
定量评估（如形状保真度、物理合理性指标）表明该方法在多种动态类型上均达到或超越现有基线。
消融实验验证了低维潜在空间的有效性以及拉格朗日力学驱动的必要性。

Tech Stack:

Transformer编码器-解码器
潜在空间建模（高斯分布）
拉格朗日力学（欧拉-拉格朗日方程）
常微分方程（ODE）数值求解
网格表示（顶点位置）
大规模4D数据集（无物理标注）
系统辨识（对比基线）

Strengths:

通用性强：无需类别特定物理模型，适用于多种动态物体。
数据驱动：仅需4D几何序列，易于扩展到大规模数据集。
可解释性：基于拉格朗日力学的潜在动力学具有物理意义。
高效：低维潜在空间大幅降低计算复杂度。
无需动作标注：仅从几何变化中学习，适合真实世界应用。

Limitations:

依赖低维假设：对于高度复杂或非刚体运动，潜在维度可能不足以捕捉全部变形模式。
需要大规模4D数据：训练数据获取成本高，尤其是真实世界动态序列。
当前仅处理网格表示，对点云或隐式表示需额外适配。
物理条件（如外力）需以特定形式输入，可能限制交互式应用。

Relevance To Keywords:

统一模型：NeuROK试图统一多种动态类型的模拟，是构建通用世界模型的一步。
世界模型：通过生成4D动态，该框架可作为环境模拟器，用于强化学习和机器人规划。
表征学习：核心是学习物体的运动学潜在表征，该表征可迁移到下游任务。
模型基RL：生成的动态轨迹可用于训练策略或进行规划，符合模型基强化学习范式。
原生多模态大模型：虽然当前仅处理几何模态，但其Transformer架构和潜在空间思想可扩展至多模态（如视觉+物理）。

36. SAM3D-Phys: Towards Multi-Object Interactive Simulation in Real WorldPASS

Score: 42.0 / 27.8

Authors: Xin Dong, Weijian Deng, Lihan Zhang, Tianru Dai, Wenfeng Deng, Yansong Tang

Published: 2026-05-28

TL;DR: SAM3D-Phys addresses incomplete object geometry in real-world scenes by integrating SAM3D generative priors with physics-constrained optimization to enable physically consistent multi-object interactive simulation.

摘要翻译

本文旨在解决从重建的真实世界场景中恢复完整且可模拟的物体几何形状的问题，从而实现与场景中嵌入物体的基于物理的交互。尽管现代多视图重建方法能够生成视觉准确的环境，但由于遮挡和观测受限，物体往往是不完整的，这使得它们不适合用于物理模拟。为了解决这一局限性，我们提出了 SAM3D-Phys 框架，该框架整合了场景重建与 SAM3D 的生成式 3D 先验，以恢复可进行物理模拟的物体。该方法首先基于多视图图像重建场景，以获取场景几何形状及物体的部分观测结果。随后，我们利用 SAM3D 从这些部分观测中推断出完整的物体几何形状。为确保恢复的物体与重建场景保持一致，我们通过两种互补策略恢复场景一致的物体状态：一种是基于物理约束的空间优化算法，迭代地将恢复的物体对齐到其原始位置；另一种是基于掩模引导的外观蒸馏模块，基于观测图像细化纹理保真度。通过恢复完整的物体几何形状并在场景中恢复其姿态与外观，SAM3D-Phys 生成了适合物理模拟的纯净的物体表征，从而能够在重建场景中实现对多个物体的同时且物理一致的交互式模拟。项目页面：https://chnxindong.github.io/sam3d-phys/

Abstract

This work addresses the problem of recovering complete, simulatable object geometry from reconstructed real-world scenes, enabling physics-based interaction with objects embedded in the scene. While modern multi-view reconstruction methods can produce visually accurate environments, objects are often incomplete due to occlusions and limited observations, making them unsuitable for physics simulation. To address this limitation, we propose SAM3D-Phys, a framework that integrates scene reconstruction with generative 3D priors of SAM3D to recover physically simulatable objects. Our approach first reconstructs the scene from multi-view images to obtain scene geometry and partial observations of objects. We then leverage SAM3D to infer complete object geometry from these partial observations. To ensure that the recovered objects remain consistent with the reconstructed scene, we restore scene-consistent object states through two complementary strategies: a physics-constrained spatial optimization algorithm that iteratively aligns the recovered object to its original location, and a mask-guided appearance distillation module that refines texture fidelity based on the observed images. By recovering complete object geometry and restoring its pose and appearance within the scene, SAM3D-Phys produces clean object representations suitable for physics-based simulation, enabling simultaneous and physically consistent interactive simulation of multiple objects within a reconstructed scene. Project page: https://chnxindong.github.io/sam3d-phys/

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	6.0/10	9.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	4.0/10	6.0
World Models	1.5	7.0/10	10.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	3.0/10	4.5
model-based RL	1.5	5.0/10	7.5

评分理由: The paper integrates scene reconstruction with SAM3D generative priors (Unify Models, 6.0) to recover complete object geometry for physics simulation (World Models, 7.0). It utilizes visual inputs and encoders (Visual Encoder, 4.0) but does not involve language models (MLLM, 2.0) or tokenization (Tokenizer, 1.0). The focus is primarily visual/geometric rather than cross-modal (MultiModal, 3.0), though the physics model supports model-based RL (5.0). No expert authors from the specified list were found.

关键词

Multi-Object Interactive Simulation, Real World, Physics-based Interaction, SAM3D, Generative 3D Priors, Scene Reconstruction, Object Geometry Recovery

深度分析

Chinese Title: SAM3D-Phys：面向真实世界中多物体交互式仿真

Summary: 本文针对从真实场景重建中恢复完整、可仿真的物体几何问题，提出SAM3D-Phys框架，该框架结合场景重建与SAM3D的生成式3D先验，以恢复物理可仿真的物体。方法首先从多视图图像重建场景几何和物体的部分观测，然后利用SAM3D从部分观测推断完整物体几何。为确保恢复物体与重建场景一致，通过两种互补策略恢复场景一致的物体状态：物理约束的空间优化算法迭代对齐物体到原始位置，以及掩码引导的外观蒸馏模块基于观测图像细化纹理。通过恢复完整物体几何并恢复其在场景中的姿态和外观，SAM3D-Phys生成适合物理仿真的干净物体表示，支持重建场景中多个物体的同时、物理一致的交互仿真。实验在自建和网络收集的真实多物体基准上验证了方法的有效性。

Innovations:

首次将生成式3D先验（SAM3D）与场景重建结合，解决多物体场景中物体几何不完整问题，实现物理仿真。
提出物理约束的空间优化算法，在保持物理合理性的前提下将生成物体对齐到原始场景位置。
提出掩码引导的外观蒸馏模块，利用观测图像细化生成物体的纹理，保持视觉一致性。
构建了真实世界多物体交互仿真基准，包含自采集和网络收集场景，用于评估方法性能。
整个框架无需训练，可在消费级硬件上高效运行，支持多物体同时交互仿真。

Methodology: 采用四阶段流水线：A) 场景重建：使用PGSR从多视图图像重建场景，移除物体并修复背景；B) 物体提取：分割目标物体，利用SAM3D从部分观测生成完整3D几何；C) 物体-场景对齐：通过掩码引导的外观蒸馏和物理约束的空间优化恢复物体姿态和外观；D) 多物体物理交互：将对齐后的物体插入场景，使用物质点法（MPM）求解器进行交互仿真。

Key Results:

SAM3D-Phys能够从部分观测中恢复更完整的物体几何，支持稳定的多物体交互。
在真实世界多物体基准上，方法生成的可仿真物体与重建场景保持空间和视觉一致性。
物理约束的空间优化有效避免了生成物体与场景的穿透和位置偏移。
掩码引导的外观蒸馏显著提升了生成物体的纹理保真度。
与现有方法相比，SAM3D-Phys在物理仿真稳定性和视觉真实性上表现更优。

Tech Stack:

3D高斯泼溅（3DGS）
平面约束高斯泼溅（PGSR）
SAM3D（生成式3D先验模型）
物质点法（MPM）
B样条核函数
渲染-比较优化（render-and-compare）
掩码引导的外观蒸馏
物理约束的空间优化（迭代对齐算法）

Strengths:

无需训练，直接利用预训练生成模型和重建方法，实用性强。
有效解决真实场景中物体几何不完整问题，使物理仿真成为可能。
同时考虑空间对齐和外观一致性，生成物体与场景高度融合。
支持多物体同时交互仿真，适用于复杂真实场景。
构建了真实世界基准，评估方法在现实条件下的表现。

Limitations:

依赖SAM3D的生成质量，对于罕见或复杂形状物体可能生成不准确。
物理约束的空间优化可能无法处理极端遮挡或大范围运动。
当前方法主要针对刚性或弹性物体，对流体、布料等复杂材料支持有限。
场景重建精度影响后续物体提取和对齐效果，对重建方法有依赖。
未涉及物体物理参数（如刚度、密度）的自动估计，需手动设定。

Relevance To Keywords: 论文聚焦于从真实场景重建中恢复可仿真物体，涉及3D表示学习（3DGS、PGSR）、世界模型（通过物理仿真模拟交互）、模型基强化学习（仿真环境可用于RL训练）以及多模态大模型（SAM3D作为生成式先验）。但论文本身未直接研究统一模型或后训练，而是提供一种将重建场景转化为可交互仿真环境的工具，与关键词中的“世界模型”和“表征学习”高度相关，与“原生多模态大模型”和“后训练”相关性较弱。

37. OmniCD: A Foundational Framework for Remote Sensing Image Change Detection Guided by Multimodal SemanticsPASS

Score: 42.0 / 27.8

Authors: Chenhao Sun

Published: 2026-05-28

TL;DR: OmniCD 是一个基于多模态语义引导的遥感变化检测统一框架，实现了零样本和二进制检测的 state-of-the-art 性能。

摘要翻译

遥感中的变化检测（CD）对于城市监测和灾害评估等应用至关重要，但传统方法在不同场景间的泛化能力方面存在困难。本文提出 OmniCD，这是一个通过多模态语义引导统一并增强遥感变化检测（CD）的基础框架。OmniCD 将图像和文本提示（如文本描述、语义地图和地理空间元数据）整合到统一架构中，支持从二值变化检测（CD）到零样本语义变化理解等多种任务。该框架整合了层次化场景检索模块和变化检测模块，并通过风格解耦机制加以强化，以提高跨域鲁棒性。此外，我们还引入了 RSITCD，这是一个包含 30 万+ 标注图像 - 文本对的大规模多模态数据集。大量实验表明，OmniCD 在各类基准测试中均实现了最先进的性能，展现了强大的适应性，并为遥感领域的通用变化检测（CD）系统奠定了坚实基础。

Abstract

Change detection (CD) in remote sensing is vital for applications such as urban monitoring and disaster assessment, yet traditional methods struggle with generalization across diverse scenarios. We present OmniCD, a foundational framework that unifies and enhances remote sensing CD through multimodal semantic guidance. OmniCD incorporates image and text prompts -- such as textual descriptions, semantic maps, and geospatial metadata -- into a unified architecture, supporting tasks from binary CD to zero-shot semantic change understanding. The framework integrates a hierarchical scene retrieval module and a change detection module, reinforced by a style disentanglement mechanism for improved cross-domain robustness. We further introduce RSITCD, a large-scale multimodal dataset with 300K+ annotated image-text pairs. Extensive experiments show that OmniCD achieves state-of-the-art performance across benchmarks, demonstrating strong adaptability and setting a solid foundation for general-purpose CD systems in remote sensing.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	6.0/10	9.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	5.0/10	7.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	5.0/10	7.5
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	0.0/10	0.0

评分理由: 论文核心为遥感变化检测，强调多模态语义引导（MultiModal 高相关）和统一架构（Unify Models 中度相关）。Visual Encoder 隐含用于图像处理，MLLM 因文本提示中度相关。Tokenizer、World Models 及 model-based RL 在摘要中未提及或无关。作者 Chenhao Sun 不在指定专家列表中。加权总分 42.0，高于动态及格分 27.8。

关键词

Remote Sensing, Change Detection, Multimodal Semantics, Unified Framework, Zero-shot, Image-Text Guidance, RSITCD Dataset

深度分析

Chinese Title: OmniCD：多模态语义引导的遥感图像变化检测基础框架

Summary: 遥感图像变化检测在城市监测、灾害评估等领域至关重要，但传统方法难以泛化到多种场景。本文提出OmniCD，一个通过多模态语义引导统一并增强遥感变化检测的基础框架。OmniCD将图像和文本提示（如文本描述、语义地图、地理空间元数据）整合到统一架构中，支持从二元变化检测到零样本语义变化理解的任务。框架包含层次化场景检索模块和变化检测模块，并通过风格解耦机制增强跨域鲁棒性。同时，作者构建了大规模多模态数据集RSITCD，包含30万以上带注释的图像-文本对。大量实验表明，OmniCD在多个基准上达到最先进性能，展现出强适应性，为遥感通用变化检测系统奠定了坚实基础。

Innovations:

提出了开放类别变化检测（OCCD）任务，并构建了大规模多模态数据集RSITCD，覆盖多样场景和土地覆盖类型。
设计了OmniCD框架，采用检测器-引导器协同架构，支持基于图像或文本提示的端到端变化检测。
引入风格解耦模块，分离成像条件相关的风格特征，抑制伪变化，提升跨域泛化能力。
利用预训练视觉Transformer（ViT）和BERT作为特征提取器，结合Transformer引导器实现灵活的多模态语义引导。

Methodology: OmniCD由三个主要模块组成：特征提取模块（使用ViT-H/16作为图像编码器，BERT作为文本编码器）、Transformer引导器模块（基于SAM解码器改进，将提示信息与图像嵌入融合生成感兴趣区域注意力图）、以及基于金字塔场景解析的检测器模块。此外，引入风格解耦模块分离风格特征与内容特征，减少成像条件差异导致的伪变化。训练时输入双时相遥感图像和语义提示，输出变化检测结果。

Key Results:

OmniCD在多个基准数据集上达到最先进性能，优于现有开放类别变化检测方法。
RSITCD数据集显著提升了多种模型在开放类别变化检测任务上的表现。
风格解耦机制有效抑制了因传感器差异、光照、大气效应等引起的伪变化，增强了跨域鲁棒性。

Tech Stack:

Vision Transformer (ViT-H/16) 预训练于MAE
BERT-base 文本编码器
Transformer解码器（参考SAM）
1×1和3×3卷积 + 层归一化
风格解耦模块（未详细说明具体方法）
金字塔场景解析（Pyramid Scene Parsing）
平均池化

Strengths:

提出统一的多模态语义引导框架，支持灵活提示，适应多种变化检测任务。
构建大规模多模态数据集RSITCD，填补了开放类别变化检测数据空白。
风格解耦机制有效提升跨域泛化能力，解决传统方法对成像条件敏感的痛点。
端到端架构，利用预训练模型，减少对标注数据的依赖。

Limitations:

论文未提供详细的风格解耦模块实现细节和数学公式。
实验部分在摘要中提及但正文未完整展示，具体性能指标和对比结果缺失。
框架复杂度较高，依赖大规模预训练模型，推理效率可能受限。
仅针对遥感图像变化检测，未讨论在其他视觉任务上的可迁移性。

Relevance To Keywords:

Unify Models: OmniCD旨在统一多种变化检测任务，通过多模态语义引导实现通用框架。
World Models: 论文未直接涉及世界模型，但通过多模态语义理解场景变化，可视为对遥感世界建模的尝试。
Representation Learning: 使用ViT和BERT进行特征提取，风格解耦模块分离表征，属于表征学习范畴。
Model-Based RL: 论文未涉及强化学习，相关性较弱。
原生多模态大模型: OmniCD借鉴了多模态大模型（如SAM、BLIP）的提示机制，但本身并非原生多模态大模型，而是针对遥感变化检测的专用框架。
多模态大模型的理解和生成一体化: 框架侧重于理解（变化检测），未涉及生成任务。
后训练: 论文使用预训练模型进行微调，属于后训练范式。

38. Tiny but Trusted: Efficient Vision-Language Reasoning for Time-Series Anomaly DetectionPASS

Score: 40.5 / 27.8

Authors: Xiaona Zhou, Muntasir Wahed, Tianjiao Yu, Constantin Brif, Ismini Lourentzou

Published: 2026-05-28

TL;DR: 本文提出了一种名为 VisAnomReasoner 的参数高效视觉语言模型，通过在包含自然语言解释的新基准上进行微调，实现了时间序列异常检测的精准定位和性能提升。

摘要翻译

视觉 - 语言模型（VLMs）在多项任务中取得了令人印象深刻的性能，但先前研究指出，当将大型语言模型或多模态模型应用于寻找序列数据中的异常模式时，其表现并不令人满意。公共异常检测基准通常提供区间标注，但不提供自然语言理由，这使得微调 VLMs 以产生有据且可解释的决策变得困难。为了解决这一差距，我们构建了 VisAnomBench，这是一个从公共时间序列数据集精心构建的基准，并利用细粒度、任务特定的奖励从多个大型 VLMs 中筛选出的高质量异常解释进行了增强。通过在该基准上进行微调，我们开发了 VisAnomReasoner，这是一个用于时间序列异常检测的参数高效 VLM。在 VisAnomBench 上的实验结果表明，VisAnomReasoner 实现了更准确的异常定位，并一贯优于所有基线，在精确率和 F1 分数上分别至少提高了 21.23 和 23.87 个百分点。在 TSB-AD-U 基准上的额外实验展示了强大的跨基准泛化能力，VisAnomReasoner 的精确率和 F1 分数分别提高了 9.57 和 13.39 个百分点。

Abstract

Recent advances in Vision-Language Models (VLMs) have achieved impressive performance across many tasks, yet prior studies report unsatisfactory performance when applying large language or multimodal models to finding abnormal patterns in sequential data. Public anomaly detection benchmarks typically provide interval annotations but not natural-language rationales, making it difficult to fine-tune VLMs to produce grounded, interpretable decisions. To address this gap, we construct VisAnomBench, a curated benchmark built from public time-series datasets and augmented with high-quality anomaly explanations selected from multiple large VLMs using fine-grained, task-specific rewards. Through fine-tuning on this benchmark, we develop VisAnomReasoner, a parameter-efficient VLM for time-series anomaly detection. Experimental results on VisAnomBench show that VisAnomReasoner achieves more accurate anomaly localization and consistently outperforms all baselines, with improvements of at least 21.23 and 23.87 percentage points in precision and F1, respectively. Additional experiments on the TSB-AD-U benchmark demonstrate strong cross-benchmark generalization, with VisAnomReasoner improving precision and F1 by 9.57 and 13.39 percentage points, respectively.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	5.0/10	7.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	9.0/10	13.5
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文核心基于视觉语言模型（MLLM）进行时间序列异常检测，因此与 MLLM 和多模态（MultiModal）高度相关。虽然涉及视觉数据处理，但未重点研究视觉编码器架构创新或分词器设计，故相关度较低。论文未涉及世界模型或基于模型的强化学习，相关性极低。视觉语言模型本质上统一了视觉与语言信息，故 Unify Models 中等相关。

关键词

Time-Series Anomaly Detection, Vision-Language Models, Parameter-efficient, VisAnomBench, Natural-language Rationales, Fine-tuning, Interpretability

深度分析

Chinese Title: 小巧但可信：面向时间序列异常检测的高效视觉-语言推理

Summary: 本文针对时间序列异常检测中缺乏可解释性标注的问题，构建了VisAnomBench基准数据集，该数据集从多个公开时间序列异常检测基准中选取数据，并利用多个大型视觉-语言模型（VLM）生成高质量异常解释，通过细粒度奖励函数筛选最佳解释。在此基础上，作者提出VisAnomReasoner，一个参数高效的VLM，通过微调实现时间序列异常定位与结构化解释生成。实验表明，VisAnomReasoner在VisAnomBench上相比最强基线在精确率和F1上分别提升至少21.23和23.87个百分点，并在跨基准泛化测试（TSB-AD-U）中同样取得显著提升。该工作将时间序列异常检测转化为基于图表的视觉-语言推理任务，同时实现区间定位与自然语言解释，推动了可解释异常检测的发展。

Innovations:

将时间序列异常检测形式化为基于图表的视觉-语言推理任务，联合要求区间定位与结构化解释生成，超越传统标量异常分数。
构建首个带解释标注的时间序列异常推理基准VisAnomBench，涵盖多个领域和异常类型，为VLM微调提供监督信号。
提出参数高效的VLM模型VisAnomReasoner，通过奖励引导的解释选择与微调，在极小模型规模下显著超越通用VLM和专用异常检测模型。
引入复合奖励函数（异常准确性、视觉基础性、坐标轴感知、清晰度）用于筛选高质量解释，提升训练数据质量。

Methodology: 论文采用四阶段构建VisAnomBench：1）将公开时间序列分割为可渲染的窗口；2）将每个窗口渲染为带坐标轴标签的图表图像；3）使用多个大型VLM生成结构化异常决策与推理链；4）利用复合奖励函数（异常准确性、视觉基础性、坐标轴感知、清晰度）筛选最佳候选作为监督目标。然后基于参数高效微调（如LoRA）训练VisAnomReasoner，使其能够从图表中直接预测异常区间并生成逐步解释。评估时与15个基线（包括通用VLM、专用LLM/VLM异常检测器、时间序列基础模型、经典检测器）对比，并在VisAnomBench和TSB-AD-U上进行跨基准泛化测试。

Key Results:

VisAnomReasoner在VisAnomBench上精确率提升至少21.23个百分点，F1提升至少23.87个百分点。
在TSB-AD-U跨基准泛化测试中，精确率提升9.57个百分点，F1提升13.39个百分点。
消融实验表明推理监督同时提升异常定位与解释质量，VisAnomReasoner的解释在69.6%的情况下优于基模型。
VisAnomBench包含2576条训练时间序列和740条测试时间序列，覆盖KPI、GutenTAG、UCR-EGI、UCR-TSAD四个基准。

Tech Stack:

视觉-语言模型（VLM）：GPT-4o、其他开源VLM用于生成候选解释
参数高效微调：LoRA（低秩适应）
奖励函数：复合奖励包括异常准确性（基于区间F1）、视觉基础性、坐标轴感知、清晰度
时间序列基准：KPI、GutenTAG、UCR-EGI、UCR-TSAD
评估指标：精确率、召回率、F1分数
图表渲染：将时间序列绘制为带坐标轴标签的图像

Strengths:

首次将时间序列异常检测与视觉-语言推理结合，提供可解释的异常定位。
构建了高质量带解释的基准数据集，填补了该领域监督数据缺失的空白。
模型参数高效，在极小规模下取得显著性能提升，适合实际部署。
跨基准泛化能力强，验证了方法的鲁棒性。
奖励引导的解释选择机制有效提升了训练数据质量。

Limitations:

VisAnomBench主要基于合成或半合成数据，真实场景的异常解释可能更复杂。
模型依赖图表渲染质量，不同渲染风格可能影响性能。
仅支持区间级异常定位，未涉及点异常或流式检测。
解释生成质量受限于VLM的推理能力，可能仍存在幻觉或不够精确。
未与强化学习或世界模型等前沿方法结合，推理过程可能缺乏动态交互。

Relevance To Keywords:

原生多模态大模型：论文直接使用VLM进行时间序列图表理解，属于多模态大模型应用。
多模态大模型的理解和生成一体化：VisAnomReasoner同时进行异常定位（理解）和解释生成（生成），体现一体化。
表征学习：通过微调VLM学习时间序列图表的视觉表征与异常模式。
世界模型：时间序列异常检测可视为对系统动态的理解，但论文未明确构建世界模型。
强化学习：论文未使用强化学习，但奖励函数筛选解释可视为一种离线优化。
后训练：论文通过监督微调（后训练）适应特定任务，属于后训练范畴。
Unify Models：论文统一了异常定位与解释生成，但未涉及模型统一框架。
Model-Based RL：不直接相关。

39. Reinforcement Learning with Robust Rubric RewardsPASS

Score: 40.5 / 27.8

Authors: Ya-Qi Yu, Hao Wang, Fangyu Hong, Xiangyang Qu, Gaojie Wu, Qiaoyu Luo, Nuo Xu, Huixin Wang, Wuheng Xu, Yongxin Liao, Zihao Chen, Haonan Li, Ziming Li, Dezhi Peng, Minghui Liao, Jihao Wu, Haoyu Ren, Dandan Tu

Published: 2026-05-28

TL;DR: This paper proposes RLR^3, a reinforcement learning framework utilizing robust rubric rewards for criterion-level verification in vision-language tasks, achieving significant performance improvements over RLVR on Qwen3-VL models.

摘要翻译

尽管带可验证奖励的强化学习（RLVR）在确定性可验证的任务中有效，但许多视觉 - 语言任务仅部分可验证，需要多准则监督（例如感知细节、推理步骤和约束）。Rubrics（评分标准）为这种细粒度监督提供了自然接口，但其有效性取决于在线强化学习期间的执行准确性。我们提出带鲁棒评分标准奖励的强化学习（RLR^3），将 RLVR 从任务级别验证扩展至准则级别验证。RLR^3 通过两条执行路径路由实例特定的 Rubrics：一条是与确定性验证器配对的作为提取器的 LLM，另一条是用于不可验证准则的作为评判者的 LLM。为确保评分忠实性，RLR^3 引入了一种最小暴露策略，该策略向提取器隐藏真实标签，并向评判者隐藏图像。此外，RLR^3 采用层次聚合机制，优先处理关键准则而非附加准则，并在轨迹组（rollout groups）内缓解分数饱和问题。在 Qwen3-VL-30B-A3B 模型上跨越 15 个基准进行评估，RLR^3 始终优于 RLVR，相比基线模型提升了 4.7 分，并超过了官方的指令 - 思维（instruct-to-thinking）模型差距。控制审计证实，我们的确定性验证和最小暴露策略显著减少了可利用的假阳性。

Abstract

While Reinforcement Learning with Verifiable Rewards (RLVR) is effective for deterministically checkable tasks, many vision-language tasks are partially verifiable, demanding multi-criteria supervision (e.g., perceptual details, reasoning steps, and constraints). Rubrics provide a natural interface for this fine-grained supervision, but their effectiveness depends on the execution accuracy during online RL. We propose Reinforcement Learning with Robust Rubric Rewards ($\text{RLR}^3$), extending RLVR from task-level verification to criterion-level verification. $\text{RLR}^3$ routes instance-specific rubrics through two execution paths: an LLM-as-an-extractor paired with a deterministic verifier, or an LLM-as-a-Judge for non-verifiable criteria. To ensure faithful scoring, $\text{RLR}^3$ introduce a minimal exposure strategy that masks ground truths from extractors and images from judges. Furthermore, $\text{RLR}^3$ employs hierarchical aggregation to prioritize essential criteria over additional criteria, and mitigates score saturation within rollout groups. Evaluated on Qwen3-VL-30B-A3B across 15 benchmarks, $\text{RLR}^3$ consistently outperforms RLVR, yielding a 4.7-point improvement over the base model and exceeding the official instruct-to-thinking model gap. Controlled audits confirm our deterministic verification and minimal exposure significantly reduce exploitable false positives.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	8.0/10	12.0
MultiModal	1.5	7.0/10	10.5
model-based RL	1.5	5.0/10	7.5

评分理由: 论文核心在于强化学习中的奖励机制（Rubric Rewards），应用于多模态大模型（MLLM）的视觉语言任务，因此 MLLM 和多模态相关度高。虽然涉及强化学习，但重点在于奖励验证而非环境模型构建，故 model-based RL 为中等。未涉及模型统一、分词器或视觉编码器架构设计，世界模型也非重点，故相关度低。作者列表中未发现指定的专家作者。

关键词

Reinforcement Learning, Rubric Rewards, Vision-Language Tasks, LLM-as-a-Judge, Verifiable Rewards, Criterion-level Verification, Minimal Exposure Strategy

深度分析

Chinese Title: 基于稳健评分标准的强化学习

Summary: 论文针对视觉-语言任务中部分可验证的问题，提出了一种基于稳健评分标准的强化学习框架（RLR3）。该框架将任务级验证扩展为标准级验证，通过实例特定的评分标准（rubric）对每个标准分别执行：可验证标准使用文本LLM提取器结合确定性验证器，模糊标准使用文本LLM评判器。为减少奖励被利用的风险，RLR3采用最小暴露策略，对提取器隐藏真实目标，对评判器隐藏图像。此外，通过层次聚合优先考虑关键标准，并在组内缓解分数饱和问题。在Qwen3-VL-30B-A3B模型上，RLR3在15个基准测试中一致优于RLVR，相比基础模型提升4.7分，超过官方指令到思考模型的差距。控制审计表明，确定性验证和最小暴露显著减少了可被利用的假阳性。

Innovations:

识别了视觉-语言任务中部分可验证的特性，并引入基于评分标准的强化学习范式。
提出RLR3框架，将可验证标准路由至提取+确定性验证路径，模糊标准路由至文本LLM评判路径，并采用最小暴露策略防止捷径。
通过分数重映射和层次聚合改进多标准奖励的信息性和可靠性，同时使用GenRM的RLVR训练提升奖励模型准确性。
在多个训练混合和基准测试上验证了RLR3相比RLVR的显著提升，并通过审计证明了确定性验证和最小暴露的有效性。

Methodology: 论文采用GRPO（组相对策略优化）进行在线策略优化，无KL惩罚。对于每个输入，采样一组响应，计算每个响应的最终标量奖励。RLR3的核心包括：1）评分标准设计：每个标准包含描述、类型（关键/附加）、权重、验证器标签和参考对象；2）标准执行：可验证标准由文本LLM提取器提取值后经确定性验证器打分，模糊标准由文本LLM评判器直接给出离散分数；3）奖励聚合：先对组内原始分数进行解耦归一化（基于阈值τ调整下界和上界），再通过层次聚合优先关键标准，并处理重复生成、语言不一致等违规。

Key Results:

在Qwen3-VL-30B-A3B上，RLR3在三个开源训练混合（ViRL、OpenMMR、DeepVision）上的宏平均分别从76.4提升至77.7、76.4至78.1、77.4至78.2。
RLR3相比基础模型提升4.7分，超过官方指令到思考模型的差距。
奖励模型审计显示，确定性验证和最小暴露减少了失败响应上的假阳性，同时不损害评分准确性。
RLVR训练的GenRM在保留的奖励模型测试集上达到95.0%的标准级准确率。

Tech Stack:

Group Relative Policy Optimization (GRPO)
LLM-as-an-Extractor (文本LLM提取器)
LLM-as-a-Judge (文本LLM评判器)
Deterministic Verifier (确定性验证器，支持文本、表达式、时间、列表、边界框、点等类型)
Generative Reward Model (GenRM)
Minimal Exposure Strategy (最小暴露策略)
Hierarchical Aggregation (层次聚合)
Score Remapping (分数重映射)
Multi-teacher Aggregation Pipeline (多教师聚合管道)

Strengths:

将RLVR从任务级扩展到标准级，适用于部分可验证的视觉-语言任务，扩展了强化学习的应用范围。
通过最小暴露策略和确定性验证路径，有效防止奖励被利用，提高了奖励的鲁棒性。
层次聚合和分数重映射增强了多标准奖励的区分度和信息性。
在多个基准测试上取得一致提升，且通过审计验证了方法的有效性。

Limitations:

评分标准的生成依赖多教师聚合管道，可能引入额外计算开销和依赖高质量基础模型。
对于完全不可验证的任务（如开放式生成），模糊标准路径仍依赖LLM评判，可能存在主观偏差。
论文仅在单一模型（Qwen3-VL-30B-A3B）上验证，泛化性需进一步测试。
未详细讨论评分标准生成的质量控制及对最终性能的影响。

Relevance To Keywords:

强化学习：论文核心是使用GRPO进行策略优化，属于强化学习后训练方法。
后训练：RLR3作为后训练技术，提升多模态大模型在视觉-语言任务上的表现。
原生多模态大模型：实验基于Qwen3-VL，属于原生多模态大模型。
多模态大模型的理解和生成一体化：论文涉及视觉-语言理解任务，但未直接涉及生成一体化。
表征学习：论文未直接涉及表征学习。
世界模型：论文未涉及世界模型。
模型基于RL：论文使用强化学习优化模型，属于基于RL的模型训练。

40. MuPHI: Learning Implicit Multimodal Harm Reasoning via Semantically Grounded Reward OptimizationPASS

Score: 40.5 / 27.8

Authors: Anisha Saha, Varsha Suresh, Teodora Kamova, Sophia Wiedmann, Timothy Hospedales, Vera Demberg

Published: 2026-05-28

TL;DR: 本文提出 MuPHIRM 框架，通过语义引导的奖励优化提升视觉语言模型在多模态隐性危害推理中的检测能力和分布外鲁棒性。

摘要翻译

理解原本无害的图像 - 文本对之间的交互如何产生危害，需要超越表面特征的意图感知跨模态推理。现有的视觉 - 语言模型（VLMs）擅长基于感知线索的字面推理，但往往无法推导出依赖于隐式、上下文相关推理的有害语义。为了评估 VLMs 在组合式危害检测与推理方面的能力，我们引入了多模态语用危害解释（MuPHI），这是一个包含图像 - 文本对的数据集，其中危害编码于细微的多模态线索中。MuPHI 涵盖多种危害类别，并包含标注的危害理由，用于评估 VLMs 的推理链。为了同时提升 VLMs 的检测与推理能力，我们提出了 MuPHIRM，这是一种推理增强训练框架，通过优化多视角奖励来学习联合语义。MuPHIRM 提升了 VLMs 的危害检测与推理质量，同时在分布外鲁棒性方面优于训练基线和推理时基线。我们的发现表明，面向推理的奖励优化为构建能够超越特定基准捷径进行泛化的多模态系统提供了一条有前景的方向。

Abstract

Understanding how harm emerges from interaction between otherwise benign image-text pairs requires intent-aware cross-modal reasoning beyond surface-level features. Existing vision-language models (VLMs) excel at literal reasoning over perceptual cues but often fail to derive harmful semantics that rely on implicit, context-dependent reasoning. To evaluate VLMs on compositional harm detection and reasoning, we introduce Multimodal Pragmatic Harm Interpretation (MuPHI), a dataset containing image-text pairs where harm is encoded in subtle multimodal cues. MuPHI spans diverse harm categories and includes annotated harm rationales for assessing VLM reasoning chains. To improve both detection and reasoning in VLMs, we propose MuPHIRM, a reasoning-augmented training framework which learns joint semantics by optimizing multi-perspective rewards. MuPHIRM improves both harm detection and reasoning quality of VLMs while demonstrating superior out-of-distribution robustness compared to both trained and inference-time baselines. Our findings suggest that reasoning-oriented reward optimization offers a promising direction towards building multimodal systems that generalize beyond benchmark-specific shortcuts.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	8.0/10	12.0
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	3.0/10	4.5

评分理由: 论文核心在于多模态大模型（VLMs）的隐性危害推理与奖励优化，因此 MultiModal 和 MLLM 高度相关；model-based RL 涉及奖励优化但非传统模型强化学习，相关性中等；Unify Models、Tokenizer、Visual Encoder、World Models 在文中未作为核心贡献或讨论点，相关性较低。作者列表中不包含指定的专家，无加分。

关键词

Multimodal Harm Reasoning, Reward Optimization, Vision-Language Models, MuPHI Dataset, Cross-modal Reasoning, Implicit Semantics, Out-of-distribution Robustness

深度分析

Chinese Title: MuPHI：通过语义基础奖励优化学习隐式多模态有害推理

Summary: 本文针对现有视觉语言模型（VLM）在隐式多模态有害内容检测与推理上的不足，提出MuPHI数据集和MuPHIRM训练框架。MuPHI包含图像-文本对，其中有害意图通过跨模态组合语义隐式编码，并附带推理注释。MuPHIRM结合监督微调与基于GRPO的奖励优化，设计多视角奖励以鼓励模型对联合语义进行推理。实验表明，MuPHIRM在有害检测和推理质量上优于现有基线，并展现出更强的跨数据集泛化能力。研究揭示了面向推理的奖励优化是构建泛化多模态安全系统的有效方向。

Innovations:

构建MuPHI数据集，包含隐式多模态有害样本及推理注释，避免外部知识依赖，聚焦跨模态组合语义。
提出MuPHIRM训练框架，结合监督微调与GRPO奖励优化，设计语义基础的多视角奖励（视觉定位、文本内容、决策一致性、跨模态交互）。
在跨数据集和跨类别设置下验证了模型的鲁棒泛化能力，优于传统标签微调和推理时基线。
采用半自动流水线生成大规模推理注释，利用多个VLM生成候选理由并由总结模型聚合，可迁移至其他数据集。

Methodology: 首先从MPUP数据集筛选描述-标题对，使用FLUX.1-schnell生成图像并叠加文本，通过GPT-Image-11或Qwen2.5-VL-72B生成良性对照样本，经人工审核构建MuPHI数据集。推理注释采用半自动流水线：三个VLM（Gemma-3-27B-it、Qwen2.5-VL-32B、Pixtral-12B）独立生成理由，过滤后由Qwen2.5-VL-72B聚合为银标准理由。训练阶段使用MuPHIRM：先进行监督微调，再采用GRPO优化多视角奖励函数，奖励包括视觉定位、文本内容、决策一致性和跨模态交互得分。

Key Results:

MuPHIRM在有害检测准确率和推理质量上均优于监督微调基线及推理时基线（如CoT、自一致性）。
跨数据集迁移实验中，MuPHIRM的宏F1下降幅度显著小于传统标签微调模型，表明更好的泛化能力。
在跨类别设置下，MuPHIRM对未见过的有害类别仍保持较高检测性能。
推理注释评估显示，MuPHIRM生成的推理在视觉定位、文本理解和跨模态交互维度上得分更高。

Tech Stack:

FLUX.1-schnell（图像生成）
LLaMA-3-8B-Instruct（源过滤）
GPT-Image-11（良性文本生成）
Qwen2.5-VL-72B-Instruct（良性文本生成、理由聚合）
Gemma-3-27B-it、Qwen2.5-VL-32B-Instruct、Pixtral-12B（理由生成）
GRPO（Group Relative Policy Optimization）
Python PIL（图像文本叠加）

Strengths:

数据集设计巧妙，有害意图完全来自图像-文本组合，避免外部知识混淆，便于评估真实跨模态推理。
奖励函数设计语义基础，覆盖多个推理维度，有效引导模型学习隐式有害语义。
实验设置全面，包括跨数据集、跨类别和推理质量评估，验证了方法的泛化性和鲁棒性。
半自动推理注释流水线可扩展，降低人工成本。

Limitations:

数据集规模较小（623有害+971良性），可能限制模型学习复杂模式。
图像生成依赖T2I模型，部分样本质量仍存在瑕疵，可能引入噪声。
奖励函数设计依赖人工定义维度，可能未覆盖所有隐式有害推理类型。
仅评估了7B参数级别的VLM，更大模型的效果未知。

Relevance To Keywords:

原生多模态大模型：论文聚焦VLM在隐式有害推理上的能力提升，与多模态大模型后训练相关。
表征学习：通过奖励优化引导模型学习跨模态联合语义表征。
世界模型：隐式有害推理需要理解图像-文本组合的隐含意义，涉及世界知识。
强化学习：采用GRPO进行策略优化，属于强化学习在后训练中的应用。
后训练：MuPHIRM框架属于后训练阶段，结合监督微调和强化学习。

41. VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational AgentsPASS

Score: 40.5 / 27.8

Authors: Amrita Mazumdar, Seonwook Park, Rajarshi Roy, Nikhil Srihari, Shengze Wang, Yuhao Zhou, Julia Wang, Koki Nagano, Shalini De Mello

Published: 2026-05-28

TL;DR: VideoFDB introduces a benchmark to evaluate full-duplex audio-visual conversational agents, revealing systematic failures in current systems regarding nonverbal cue generation and joint audiovisual grounding.

摘要翻译

自然的人类对话是全双工且视听的：人们同时说话与倾听，同时持续地解读和生成非语言线索，例如点头、微笑和手势。为了支持成功的人机交互，智能体必须建模全双工视听对话；然而，现有的全双工基准仅评估语音。在这项工作中，我们提出了 VideoFDB，这是首个用于评估全双工视听到视听（AV2AV）对话智能体的基准。VideoFDB 的贡献包括：(i) 237 个双人对话片段，涵盖来自真实视频通话的 11 种非语言对话动态；(ii) 一种将感知行为与生成行为分离的分类法；(iii) 一种基于评分标准的语言模型作为裁判（LM-as-judge）评估框架，该框架拥有可解释的维度，用于评估相对于非语言对话动态的对话质量。在开源和闭源的视觉 - 语音智能体上，我们发现系统性的失败模式：字幕坍塌（captioning collapse）和视觉流忽视（visual-stream ignorance），并且我们表明当前系统利用视觉进行显式的视觉问答，却无法实现自然对话所需的流式联合视听对齐（streaming joint audiovisual grounding）。我们进一步评估了级联语音到化身（speech-to-avatar）系统，发现其架构从根本上阻碍了全双工非语言线索的产生。作为首个全双工 AV2AV 交互基准，VideoFDB 为系统评估奠定了基础，我们希望它能加速下一代多模态对话智能体的进步与发展。

Abstract

Natural human conversation is full-duplex and audio-visual: people simultaneously speak and listen while continuously interpreting and producing nonverbal cues, such as nods, smiles, and gestures. To support successful human-agent interaction, agents must model full-duplex audiovisual conversation; however, existing full-duplex benchmarks evaluate only speech. In this work, we present VideoFDB, the first benchmark to evaluate full-duplex audio-visual-to-audio-visual (AV2AV) conversational agents. VideoFDB contributes (i) 237 dyadic clips spanning 11 nonverbal conversational dynamics from real-world video calls, (ii) a taxonomy separating perception from generation behaviors, and (iii) a rubric-based LM-as-judge evaluation framework with interpretable axes for assessing conversational quality with respect to nonverbal conversational dynamics. Across open- and closed-source vision-speech agents, we find systematic failure modes: captioning collapse and visual-stream ignorance, and we show that current systems exploit vision for explicit visual question answering but not for the streaming joint audiovisual grounding required in natural conversation. We further evaluate cascaded speech-to-avatar systems and find that their architecture fundamentally precludes the production of full-duplex nonverbal cues. As the first benchmark for full-duplex AV2AV interaction, VideoFDB establishes a foundation for systematic evaluation and, we hope, will accelerate the advancement and development of next-generation multimodal conversational agents.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	6.0/10	9.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	6.0/10	9.0
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	1.0/10	1.5

评分理由: The paper focuses on evaluating full-duplex audio-visual conversational agents, making 'MultiModal' highly relevant. 'Visual Encoder' and 'MLLM' are moderately relevant as the evaluated agents involve vision and multimodal understanding. Other keywords like 'Unify Models', 'Tokenizer', 'World Models', and 'model-based RL' are not central to this evaluation benchmark. No expert authors from the specified list are found in the authorship.

关键词

Full-Duplex, Vision-Speech, Conversational Agents, Evaluation Benchmark, Nonverbal Cues, AV2AV, Multimodal Interaction

深度分析

Chinese Title: VideoFDB：评估对话代理的全双工视听能力

Summary: 本文提出VideoFDB，首个评估全双工视听对话代理的基准。自然人类对话是全双工且视听并行的：人们同时说话和倾听，并持续产生和解读非语言线索（如点头、微笑、手势）。现有全双工基准仅评估语音，而VideoFDB填补了这一空白。基准包含237个来自真实视频通话的对话片段，涵盖11种非语言对话动态；提出了将感知与生成行为分离的分类法；并设计了基于评分规则的LM-as-judge评估框架，从可解释的维度评估对话质量。作者评估了多个开源和闭源视听代理，发现系统存在系统性失败模式：字幕化崩溃和视觉流忽视。当前系统仅在显式视觉问答时利用视觉，而在自然对话所需的流式联合视听接地中忽略视觉。级联语音到虚拟形象系统由于架构限制无法产生全双工非语言线索。VideoFDB为系统评估奠定基础，有望加速下一代多模态对话代理的发展。

Innovations:

首个评估全双工视听对话（AV2AV）的基准，涵盖11种非语言对话动态。
提出感知与生成行为分离的分类法，分别评估代理的感知能力和生成能力。
设计基于评分规则的LM-as-judge评估框架，提供可解释的评估维度（流畅性、对话流、语义接地、情感匹配、非语言线索适当性）。
系统评估了开源和闭源全双工视听代理，揭示了字幕化崩溃、视觉流忽视等失败模式。
分析了级联语音到虚拟形象系统的架构局限，指出其无法在用户说话期间插入非语言线索。

Methodology: 论文采用以下方法：1) 从真实视频通话中收集237个对话片段，人工标注11种非语言对话动态（如注视回避、适应行为、非语言打断、情感显示、笑声等）。2) 将评估分为感知和生成两个类别，每个类别包含多个评估轴（感知：流畅性、对话流、语义接地；生成：流畅性、情感匹配、非语言线索适当性）。3) 使用基于大语言模型的评分器（LM-as-judge）对代理响应进行评分，评分器根据预定义规则和示例进行判断。4) 评估多个现有系统：开源（如Qwen3-Omni、MoshiVis）和闭源（如Gemini Live、GPT Realtime），以及级联语音到虚拟形象系统。5) 进行对比实验（AV2A vs A2A）以分析视觉输入的使用情况。

Key Results:

当前全双工视听代理在感知非语言动态方面表现不佳，常忽略视觉线索而退化为字幕式回复。
视觉输入主要用于显式视觉问答，但在自然对话中未被用于流式视听接地。
级联语音到虚拟形象系统保持轮流纪律，但无法在用户说话期间插入非语言线索，延迟比人类真实反应慢2.8-3.5秒。
随着用户视频采样率增加，代理的语音质量下降。
开源和闭源系统均存在系统性失败模式，表明全双工视听对话能力仍有很大提升空间。

Tech Stack:

全双工语音模型（Moshi、dGSLM、OmniFlatten、SyncLLM、SALM、PersonaPlex）
多模态大语言模型（Gemini 2.5/3.1 Live、GPT Realtime、Qwen3-Omni、MoshiVis、Video-SALMONN）
级联语音到虚拟形象系统（音频驱动肖像动画、手势生成）
LM-as-judge评估框架（基于大语言模型的评分器）
非语言对话动态分类法（基于人类沟通研究[11,15]）
人工标注的237个对话片段数据集

Strengths:

填补了全双工视听对话评估的空白，是首个此类基准。
基于真实视频通话数据，具有生态效度。
评估框架提供可解释的多维度评分，不仅关注语义正确性。
系统评估了多种现有系统，揭示了重要的失败模式。
分类法清晰区分感知和生成能力，便于针对性改进。

Limitations:

数据集规模较小（237个片段），可能不足以覆盖所有对话动态。
评估依赖LM-as-judge，可能引入大语言模型的偏见。
未提供端到端全双工视听生成系统（AV2AV）的评估，因为目前尚无公开可用的此类系统。
仅评估英语对话，未考虑多语言和文化差异。
评估轴的定义可能不够完备，需要进一步细化。

Relevance To Keywords:

原生多模态大模型：论文评估的Gemini Live、GPT Realtime、Qwen3-Omni等均为原生多模态大模型，直接处理音频和视频输入并输出语音。
多模态大模型的理解和生成一体化：基准评估了代理同时理解和生成视听信号的能力，涉及感知和生成两个维度。
表征学习：论文未直接涉及表征学习，但评估结果揭示了视觉表征在对话中的利用不足。
世界模型：论文未直接涉及世界模型，但全双工对话需要代理构建对交互情境的实时理解，与世界模型相关。
强化学习：论文未涉及强化学习，但后训练（如RLHF）可能用于改进代理的对话行为。
后训练：论文未讨论后训练方法，但评估结果可指导后训练策略。

42. MusTBENCH: Benchmarking and Advancing Temporal Grounding in Music LLMsPASS

Score: 40.5 / 27.8

Authors: Daeyong Kwon, Qiyu Wu, Shinobu Kuriya, Junghyun Koo, Shuyang Cui, Zhi Zhong, Wei-Hsiang Liao, Hiromi Wakaki, Yuki Mitsufuji

Published: 2026-05-28

TL;DR: 本文提出 MusTBENCH 基准和 MusT 优化方法，解决了音乐大语言模型缺乏精确时间对齐的问题，并通过编码器与 LLM 的优化显著提升了性能。

摘要翻译

近期的大型音频语言模型（LALMs）在理解音乐内容方面展现出了有前景的能力。然而，它们的响应是否基于音频的正确时间片段仍未被充分探索。这种局限性对于音乐理解尤为关键，因为关键信息通常以时间局部化的事件形式出现，例如乐器进入和节奏转换。为了解决这一差距，我们引入了 MusTBENCH，这是一个经音乐专家验证的基准，旨在通过五个基于时间的问答任务来评估 LALMs 中的时间定位能力。为了进一步改进现有模型的时间定位能力，我们提出了 MusT，这是一种新颖的四阶段时间优化方案，涵盖音乐编码器适配、大语言模型（LLM）适配、LLM 监督微调以及基于强化学习的优化。在 MusTBENCH 上的实验表明，现有的 LALMs 在精确的时间定位方面存在困难，而 MusT 相对于强基线带来了显著改进。这些结果确立了时间定位是当前 LALMs 中缺失的关键能力，并将 MusTBENCH 定位为未来基于时间定位的音乐理解研究中的一个具有挑战性的基准。

Abstract

Recent Large Audio-Language Models (LALMs) have demonstrated promising abilities in understanding musical content. However, whether their responses are grounded in the correct temporal regions of the audio remains underexplored. This limitation is particularly critical for music understanding, where key information often occurs as temporally localized events, such as instrument entries and rhythmic transitions. To address this gap, we introduce MusTBENCH, a music-expert-validated benchmark designed to evaluate temporal grounding in LALMs through five temporally grounded question-answering tasks. To further improve temporal grounding in existing models, we propose MusT, a novel four-stage temporal optimization recipe spanning music encoder adaptation, LLM adaptation, LLM supervised fine-tuning, and RL-based optimization. Experiments on MusTBENCH show that existing LALMs struggle with precise temporal grounding, while MusT brings significant improvements over strong baselines. These results establish temporal grounding as a key missing capability in current LALMs and position MusTBENCH as a challenging benchmark for future research in temporally grounded music understanding.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	8.0/10	12.0
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	4.0/10	6.0

评分理由: 论文核心为音乐大语言模型（LALMs）的时间对齐评估与优化，属于多模态大语言模型（MLLM）和多模态（MultiModal）领域，故相关度高。Visual Encoder 完全不相关，因论文处理音频数据。Tokenizer 和 Unify Models 未在摘要中作为核心创新点提及，相关性低。World Models 和 model-based RL 仅部分相关，因摘要提及 RL 优化但未明确涉及世界模型构建或模型强化学习的具体机制。

关键词

MusTBENCH, Temporal Grounding, Music LALMs, Benchmarking, RL-based Optimization, Music Encoder, LLM Adaptation

深度分析

Chinese Title: MusTBENCH：音乐大语言模型时间定位的基准测试与推进

Summary: 本文指出当前大型音频-语言模型（LALMs）在音乐理解中缺乏时间定位能力，即无法将文本描述与音频中的具体时间点或区间关联。为此，作者构建了由音乐专家验证的基准测试MUSTBENCH，包含五个时间定位问答任务：时间源定位、局部过渡识别、过渡感知描述、全局时间排序和情绪轨迹推理。实验表明现有模型在这些任务上表现不佳，存在时间偏差和幻觉。为提升时间定位能力，作者提出MUST四阶段优化方案：音乐编码器适配、大语言模型适配（带时间戳音乐字幕）、监督式时间问答微调以及基于强化学习的优化。在MUSTBENCH上，MUST显著优于强基线模型。该工作揭示了时间定位是当前LALMs缺失的关键能力，并为未来研究提供了挑战性基准和实用训练方案。

Innovations:

首次识别并定义音乐大语言模型中的时间定位缺失能力，并构建专家验证的基准MUSTBENCH。
提出五个时间定位问答任务（TSG、LTR、TAD、GTO、MTR），全面评估模型对音乐事件的时间感知。
设计四阶段时间优化方案MUST，涵盖编码器适配、LLM适配、监督微调和强化学习优化。
通过实验证明现有LALMs在时间定位上系统性失败，而MUST带来显著提升。

Methodology: 首先构建时间戳音乐字幕数据集：使用结构音乐分割模型获得片段边界，训练MERT-based情绪变化预测器，提取多种音乐特征生成片段级静态字幕和边界级动态字幕，并通过交叉验证和重写确保质量。然后基于该数据集生成五种QA任务：TSG利用MIDI对齐的乐器轨道和源分离后的声乐轨道标注起止时间；LTR、TAD、GTO利用时间戳字幕构造选择题、开放描述和排序题；MTR利用情绪变化预测器标注情绪极值区间。所有QA对经音乐专家验证。最后提出MUST四阶段训练：1) 音乐编码器适配（对比学习）；2) LLM适配（时间戳音乐字幕预训练）；3) 监督式时间问答微调；4) 基于强化学习（RL）的优化。

Key Results:

现有LALMs在MUSTBENCH上表现有限，尤其在TSG任务中系统性地失败，预测常坍缩到60s或120s等粗糙时间锚点，甚至生成超出音频时长的无效时间戳。
MUST四阶段优化方案在五个任务上均显著超越强基线模型（如Qwen3 Omni、Music Flamingo等），归一化性能提升明显。
时间定位是当前LALMs缺失的关键能力，MUSTBENCH可作为未来研究的挑战性基准。

Tech Stack:

结构音乐分割模型（Hao et al., 2025）
MERT（Li et al., 2024）用于情绪变化预测
源分离模型（Rouard et al., 2023）
对比学习（音乐编码器适配）
监督微调（SFT）
强化学习（RL）优化
MTG-Jamendo数据集（Bogdanov et al., 2019）
Slakh2100数据集（Manilow et al., 2019）

Strengths:

问题定义清晰，聚焦于音乐理解中未被充分研究的时间定位能力。
基准构建严谨，经音乐专家验证，任务设计覆盖多种时间推理维度。
提出的MUST训练方案具有实用性和可迁移性，能显著提升现有模型。
实验分析深入，揭示了模型时间偏差和幻觉的具体表现。

Limitations:

基准规模有限（1264个QA对），可能不足以覆盖所有音乐场景。
训练数据依赖自动标注和专家验证，成本较高，且可能存在标注噪声。
MUST方案仅在特定模型上验证，泛化性需进一步测试。
未探讨模型在更细粒度时间定位（如毫秒级）上的表现。

Relevance To Keywords:

原生多模态大模型：论文研究音乐LLM的时间定位，属于多模态大模型在音频领域的应用。
表征学习：MUST中音乐编码器适配使用对比学习，属于表征学习范畴。
强化学习：MUST第四阶段采用RL优化，直接提升时间定位能力。
后训练：MUST四阶段方案包括适配、微调和RL优化，属于后训练技术。
世界模型：论文未直接涉及世界模型，但时间定位可视为模型对音频世界状态的理解。

43. AnomalyAgent: Training-Free Agentic Models for Zero-/Few-Shot Anomaly DetectionPASS

Score: 40.5 / 27.8

Authors: Yi Zhang, Jiawen Zhu, Lele Fu, Guansong Pang

Published: 2026-05-28

TL;DR: AnomalyAgent proposes a training-free agentic framework utilizing MLLMs to achieve superior zero-/few-shot anomaly detection performance through adaptive reasoning and memory grounding.

摘要翻译

受益于视觉语言模型（VLMs，如 CLIP）的泛化能力，许多零样本/少样本异常检测（AD）方法已在各类数据集上取得了卓越的性能。然而，这些方法需要在大型辅助数据集上进行大量训练以适应异常检测任务，且其推理主要依赖于基于视觉 - 文本嵌入相似度的异常分数，缺乏推理能力去检测那些需要深入上下文理解的复杂异常。为了解决这一局限性，我们提出了一种名为 AnomalyAgent 的新型无需训练的智能体框架，该框架利用多模态大语言模型（MLLMs）先进的推理与泛化能力来进行异常检测。其核心组件包括：1) 一个全面的以异常为中心的工具集，能够在零样本设置中实现自适应的、基于 MLLMs 驱动的智能体异常推理；2) 一个定制的记忆模块，利用少样本上下文参考示例来支撑异常推理。我们将评估范围从广泛基准中简单异常（如表面缺陷（裂纹、凹痕）及清晰病变）的检测，扩展至更多样化的异常类型，例如物流和制造场景中的逻辑/上下文异常。广泛的实验结果表明，与无需训练的基于 VLM 的异常检测方法及通用智能体方法相比，我们的 AnomalyAgent 实现了显著更优的性能，凸显了其在零样本及少样本异常检测设置中卓越的泛化能力。代码实现可在此地址获取。

Abstract

Benefiting from generalizability of vision-language models (VLMs) such as CLIP, many zero-/few-shot anomaly detection (AD) approaches have achieved impressive detection performance across various datasets. Nevertheless, they require substantial training on large auxiliary datasets to adapt VLMs to anomaly detection, and their inference largely relies on visual-text embedding similarity-based anomaly scores, lacking reasoning abilities to detect complex anomalies that require in-depth contextual understanding. To address this limitation, we propose \textbf{AnomalyAgent}, a novel training-free, agentic framework that leverages the advanced reasoning and generalization capabilities of multimodal large language models (MLLMs) for anomaly detection. The key ingredients include \textbf{1)} a comprehensive anomaly-centric toolset that enables adaptive MLLM-driven, agentic anomaly reasoning in zero-shot settings, and \textbf{2)} a customized memory module that grounds anomaly reasoning with few-shot, in-context reference examples. We extend evaluation beyond the detection of simple anomalies (e.g., surface defects like cracks and dents and clear lesions) in widely used benchmarks to more diverse types of anomalies such as logical/contextual anomalies in logistics and manufacturing settings. Extensive experiment results demonstrate that our AnomalyAgent achieves substantially better performance compared to training-free VLM-based AD and generic agentic methods, highlighting its superior generalization capability in both zero-shot and few-shot anomaly detection settings. The code implementation can be find at this address.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	5.0/10	7.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	9.0/10	13.5
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	1.0/10	1.5

评分理由: The paper centers on MLLM and MultiModal technologies for anomaly detection, yielding high scores. It does not address Tokenizers, Visual Encoders, World Models, or Model-based RL directly, resulting in low scores. Unify Models is moderately relevant as MLLMs unify vision and language. No expert authors from the specified list were found. The weighted total score is 40.5, exceeding the dynamic passing score of 27.8.

关键词

AnomalyAgent, Training-Free, Agentic Models, Zero-/Few-Shot, Anomaly Detection, MLLM, Reasoning, Memory Module

深度分析

Chinese Title: AnomalyAgent: 无需训练的自主体模型用于零样本/少样本异常检测

Summary: 本文提出AnomalyAgent，一种完全无需训练的自主体框架，利用多模态大语言模型（MLLM）的推理和泛化能力进行异常检测。针对现有基于视觉-语言模型（VLM）的方法依赖辅助数据集训练且缺乏推理能力的问题，AnomalyAgent通过异常中心工具集（包括通用视觉工具和模板工具）实现假设驱动探索、反事实验证和反思推理，并在少样本场景下引入基于自校准的记忆模块，将参考样本转化为上下文记忆。实验在工业检测、医学影像和物流等多个场景中验证，AnomalyAgent在零样本和少样本设置下均显著优于无需训练的VLM方法和通用自主体方法，展现了强大的泛化能力。

Innovations:

提出完全无需训练的自主体框架，将异常检测从基于相似性评分转向工具与记忆增强的异常推理，实现无阈值决策。
设计异常中心工具集，包含通用视觉工具（去噪、去模糊、超分辨率等）和模板工具（通用模板分析、类别特定反事实模板分析），支持自适应证据收集与验证。
引入基于自校准的记忆模块，将少样本正常参考样本转化为可操作的上下文记忆，实现推理过程中的自校准。
构建涵盖工业、医学、物流等多类型异常的综合评估基准，验证方法在复杂异常（如逻辑/上下文异常）上的有效性。

Methodology: AnomalyAgent采用规划-推理-反思的自主管道。输入图像和类别名称后，首先通过模板工具生成通用模板分析和类别特定反事实模板分析。规划器根据分析结果选择并调用合适的视觉工具增强异常相关证据。增强后的图像、原始图像和模板分析送入推理器进行判断。若未满足终止条件，反思器将当前信息与推理器思考结果结合，进入下一轮循环。少样本场景下，记忆模块通过自校准将参考样本转化为记忆，指导推理过程。整体无需任何训练或参数更新。

Key Results: 在多个基准数据集上，AnomalyAgent在零样本和少样本异常检测中均取得优于无需训练的VLM方法（如WinCLIP）和通用自主体方法的性能。尤其在逻辑/上下文异常（如物流场景中的物体排列异常）上表现突出，证明了其推理能力。实验表明，工具集和记忆模块对性能提升至关重要。

Tech Stack:

多模态大语言模型（MLLM）
CLIP（视觉-语言模型）
图像处理工具：去噪、去模糊、超分辨率、缩放、亮度调整
模板工具：通用模板分析、类别特定反事实模板分析
自校准记忆机制
规划-推理-反思管道

Strengths:

完全无需训练，避免了辅助数据集依赖和参数微调，泛化性强。
引入推理能力，能处理需要上下文理解的复杂异常（如逻辑异常）。
工具集和记忆模块设计针对异常检测任务，具有高度定制性。
在多个领域和异常类型上验证了有效性，基准全面。

Limitations:

依赖MLLM的推理能力，可能受限于模型本身的质量和计算开销。
工具调用和反思循环可能增加推理时间，实时性有待优化。
少样本记忆模块需要少量正常样本，在完全无样本场景下仅依赖零样本能力。
未与需要训练的最先进方法（如AnomalyCLIP）进行对比，仅比较了无需训练的方法。

Relevance To Keywords:

Unify Models: 论文使用多模态大语言模型统一视觉和语言推理，但未涉及世界模型或表征学习。
World Models: 不直接相关。
Representation Learning: 不直接相关，论文不学习表征而是利用预训练模型。
Model-Based RL: 不相关。
原生多模态大模型: 论文基于MLLM，属于原生多模态大模型的应用。
多模态大模型的理解和生成一体化: 论文主要利用MLLM的理解和推理能力，未涉及生成。
表征学习: 不直接相关。
世界模型: 不相关。
强化学习: 不相关。
后训练: 论文强调无需训练，与后训练无关。

44. Does The Way You Plan Matter? An Empirical Study of Planning Representations for LLM Web AgentsPASS

Score: 39.0 / 27.8

Authors: Alejandra Zambrano, Sara Vera Marjanovic, Imene Kerboua, Xing Han Lù, Leila Kosseim

Published: 2026-05-28

TL;DR: 本文实证研究了不同计划表示形式对多模态 LLM 驱动的网络代理稳健性的影响，结果表明计划形式显著影响任务成功率。

摘要翻译

尽管近期取得了进展，基于大语言模型（LLM）的 Web 代理仍然面临探索受限、关键步骤遗漏以及对任务约束敏感的问题。先前工作表明，许多此类失败源于规划方面的不足，然而替代性的自然语言计划表示的影响尚未得到探究。为解决这一问题，我们引入了 PlanAhead，这是一种静态规划器 - 执行器框架，用于评估计划表示对代理性能的影响。我们首先将 WebArena 任务自动划分为 3 个难度级别，从而实现无需人工标注的一致性难度分级。随后，我们在被归类为困难的任务上，系统性地评估了 4 种不同的计划表示：顺序子目标、叙述式、伪代码和检查清单；评估涵盖了不同家族的多模态大语言模型驱动的代理（OpenAI、Alibaba 和 Google）。为应对随机变异性，我们引入了两种新的评估指标：达成率（AR）和已解决任务一致性（STC）。结果表明，计划制定方式以及生成计划的底层大语言模型，均显著影响 Web 代理的鲁棒性和任务成功率。

Abstract

Despite recent advances, LLM-based web agents still struggle with limited exploration, omission of critical steps, and sensitivity to task constraints. Prior work suggests that many of these failures stem from weaknesses in planning, yet the impact of alternative natural language plan representation remains unexplored. To address this, we introduce PlanAhead, a static planner-executor framework that evaluates the impact of plan representation in agent performance. We first automatically categorize WebArena tasks into 3 difficulty levels, enabling consistent difficulty grading without human annotation. Then we systematically evaluate 4 different plan representations on the tasks categorized as hard: sequential subgoals, narrative, pseudocode, and checklist; across different families of multimodal LLM powered agents (OpenAI, Alibaba, and Google). To account for stochastic variability, we introduce two novel evaluation metrics: Achievement Rate (AR) and Solved-Task Consistency (STC). Our results show that both, the plan formulation and the underlying LLM generating the plan, significantly influence web-agent robustness and task success.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	7.0/10	10.5
MultiModal	1.5	7.0/10	10.5
model-based RL	1.5	5.0/10	7.5

评分理由: 论文核心在于计划表示（Planning Representations）对 LLM 网络代理性能的影响。与 MLLM 和 MultiModal 高度相关（代理基于多模态大模型），与 model-based RL 中度相关（涉及计划生成）。与 Unify Models、Tokenizer、Visual Encoder、World Models 无直接关联。未发现指定专家作者，无加分。加权总分 39.0，高于动态及格分 27.8。

关键词

Planning Representations, LLM Web Agents, PlanAhead, Multimodal LLM, Plan Formulation, Task Success, Empirical Study

深度分析

Chinese Title: 计划方式重要吗？LLM网络代理计划表示的实证研究

Summary: 本文针对基于大语言模型（LLM）的网络代理在探索不足、关键步骤遗漏及对任务约束敏感等问题，提出PLANAHEAD框架，系统评估不同计划表示对代理性能的影响。研究首先将WebArena任务自动划分为三个难度等级（Easy、Medium、Hard），无需人工标注；随后在Hard任务上比较四种计划表示：顺序子目标、叙述、伪代码和检查表，并使用多种多模态LLM（GPT-4.1-mini、Qwen-2.5-VL-72B、Gemini 2.5 Flash）作为规划器和执行器。为捕捉随机性，引入两个新指标：Achievement Rate（AR）和Solved-Task Consistency（STC）。结果表明，计划表示形式和底层LLM均显著影响代理的鲁棒性和任务成功率，不同LLM对特定表示有不同偏好。

Innovations:

提出自动任务难度分级管道，无需人工标注即可将WebArena任务分为Easy、Medium、Hard三级。
引入三种新的自然语言计划表示：叙述、伪代码和需求检查表，并与标准顺序子目标进行比较。
提出两个新评估指标：Achievement Rate（AR）衡量任务在多轮运行中是否至少成功一次，Solved-Task Consistency（STC）衡量已达成任务的跨运行一致性。
系统性地跨多个LLM家族（OpenAI、Alibaba、Google）评估计划表示的影响，并分析规划器-执行器组合效果。

Methodology: 采用静态规划器-执行器框架（PLANAHEAD）：规划器LLM根据任务目标和初始浏览器截图生成一次静态计划，执行器LLM结合当前浏览器状态和计划逐步预测低层动作。任务难度分级使用BrowserGym的GenericAgent，以5个不同LLM骨干、每模型5次独立试验（温度0.4）进行，根据成功率划分难度。在158个Hard任务上，对4种计划表示（顺序子目标、检查表、伪代码、叙述）进行实验，使用3种多模态LLM作为规划器和执行器，共9种组合，每任务运行5次，温度规划器0.6、执行器0。动态规划基线为GenericAgent的UsePlan能力。

Key Results:

不同LLM对计划表示有显著偏好：GPT-4.1-mini在叙述表示下表现最佳（AR=10.7, STC=47），Gemini 2.5 Flash在伪代码表示下表现最佳（AR=8.2, STC=43），Qwen-2.5-VL-72B在检查表表示下表现最佳（AR=5.1, STC=75）。
计划表示和底层LLM均显著影响网络代理的鲁棒性和任务成功率。
静态规划与动态规划相比，在某些配置下表现更好，但整体差异不大。
新指标AR和STC能更细致地反映多轮运行下的任务可达性和一致性。

Tech Stack:

LLM: GPT-4.1-mini, Qwen-2.5-VL-72B, Gemini 2.5 Flash
基准环境: WebArena (通过BrowserGym)
评估指标: Success Rate (SR), Achievement Rate (AR), Solved-Task Consistency (STC)
规划表示: 顺序子目标、检查表、伪代码、叙述
框架: PLANAHEAD (静态规划器-执行器)
动态规划基线: GenericAgent UsePlan
温度参数: 规划器0.6, 执行器0, 难度分级0.4

Strengths:

系统性地实证研究了计划表示对LLM网络代理的影响，填补了该领域空白。
提出自动难度分级方法，提高了实验的可重复性和客观性。
引入AR和STC两个新指标，更全面地评估多轮随机运行下的性能。
跨多个LLM家族和多种规划器-执行器组合进行实验，结论具有广泛性。
代码和数据集公开，便于复现和后续研究。

Limitations:

仅使用静态规划，未探索动态规划（如反应式或主动式规划）对表示的影响。
实验仅聚焦于Hard任务，未在Easy和Medium任务上全面比较表示效果。
仅使用WebArena单一基准，结论在其他环境（如WorkArena、真实网站）上的泛化性未知。
规划器与执行器使用相同或不同LLM，但未深入分析模型间协同效应。
每任务仅运行5次，可能不足以完全捕捉随机性。

Relevance To Keywords:

原生多模态大模型：论文使用多模态LLM（GPT-4.1-mini、Qwen-2.5-VL-72B、Gemini 2.5 Flash）作为规划器和执行器，直接相关。
表征学习：论文研究不同计划表示（叙述、伪代码、检查表等）对代理性能的影响，属于任务表征学习范畴。
世界模型：论文中的规划器生成计划可视为对任务世界的抽象建模，但未显式构建世界模型。
强化学习：论文引入AR和STC指标，与强化学习中多回合评估思想一致，但未使用RL训练。
后训练：论文未涉及后训练技术，主要关注推理时规划表示。

45. Architecture-Sensitive Supervised Fine-Tuning for Screen-Conditioned Action Prediction: A PiSAR BenchmarkPASS

Score: 39.0 / 27.8

Authors: Rahul Bissa, Abhishek Vyas, Yash Jain

Published: 2026-05-28

TL;DR: This paper benchmarks supervised fine-tuning of multimodal large language models for screen-conditioned action prediction, demonstrating that Qwen3-VL significantly outperforms zero-shot baselines and other models when the training recipe matches the model architecture.

摘要翻译

我们在 PiSAR（Persona, intent, Screen, Action, Rationale）的一个 661 行预留切片上，对三个监督微调模型与前沿零样本基线进行了基准测试。PiSAR 是一个包含 12,929 个元组的语料库，其基于屏幕的行为论证源自公共应用商店评论、Pew American Trends Panel 人口统计数据以及 OPeRA 购物者轨迹。无论是前沿模型还是微调模型，均在同一 661 行切片上使用相同的评分流程进行评估。研究发现如下：首先，前沿零样本基线（Claude Opus 4.7 和 GPT-5.5）的 sem_sim 得分分别为 0.459 和 0.482；而微调后的 Qwen3-VL-8B-Instruct 达到 0.783，并在 79% 的样本行中达到 sem_sim >= 0.7，相比之下前沿基线仅为 1-2%，在同一测试集上绝对差距达 0.30。其次，使用相同的训练数据和训练方案，Gemma-4-26B-A4B-IT 的得分仅为 0.441，处于与前沿零样本基线相同的水平，而非微调后的 Qwen 模型。我们将此解读为训练方案与模型架构的错配：推理微调的高参数模型难以被替代，可能需要更多数据或更强的微调方法。

Abstract

We benchmark three supervised fine-tuned models against frontier zero-shot baselines on a 661-row held-out slice of PiSAR (Persona, intent, Screen, Action, Rationale), a 12,929-tuple corpus of screen-anchored behavioural rationales curated from public app-store reviews, Pew American Trends Panel demographics, and the OPeRA shopper traces. Every model, frontier or fine-tuned, is evaluated on the same 661-row slice with the same scoring pipeline. Two findings. First, frontier zero-shot baselines (Claude Opus 4.7 and GPT-5.5) reach sem_sim 0.459 and 0.482 respectively; a fine-tuned Qwen3-VL-8B-Instruct reaches 0.783 and clears sem_sim >= 0.7 on 79% of rows, against 1-2% for either frontier baseline, a gap of 0.30 absolute on the same test set. Second, the same training data and recipe on Gemma-4-26B-A4B-IT scores only 0.441, in the same band as the frontier zero-shot baselines rather than the fine-tuned Qwen. We read this as a recipe-vs-model mismatch: the reasoning-tuned high-parameter model resists displacement and would likely need either more data or a stronger fine-tuning method.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	8.0/10	12.0
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	3.0/10	4.5

评分理由: The paper focuses on benchmarking supervised fine-tuning (SFT) for screen-conditioned action prediction using multimodal models (Qwen3-VL). It is highly relevant to MLLM and MultiModal (scores 8) as it utilizes vision-language models and screen inputs. It has low relevance to Unify Models (2), Tokenizer (1), and World Models (1) as these are not the focus. It has moderate relevance to model-based RL (3) due to the action prediction task, though the method is SFT rather than RL. The weighted total score is 39.0, exceeding the dynamic passing score of 27.8. No expert authors from the specified list were found.

关键词

Supervised Fine-Tuning, Screen-Conditioned Action Prediction, PiSAR Benchmark, Qwen3-VL, Model Architecture Sensitivity, Zero-shot Baselines, Behavioral Rationales

深度分析

Chinese Title: 架构敏感的监督微调用于屏幕条件动作预测：PiSAR基准

Summary: 本文提出了PiSAR基准，一个包含12,929条屏幕锚定行为理由的语料库，每条记录包含（人物、意图、屏幕、动作、理由）。作者在661行保留测试集上，对三个监督微调模型（Qwen3-VL-8B-Instruct、Gemma-4-26B-A4B-IT）与两个前沿零样本基线（Claude Opus 4.7、GPT-5.5）进行了对比评估。主要发现：第一，零样本基线语义相似度（sem_sim）分别为0.459和0.482，而微调后的Qwen3-VL-8B-Instruct达到0.783，在79%的行上超过0.7阈值，差距达0.30绝对点。第二，相同训练数据和配方应用于Gemma-4-26B-A4B-IT仅得0.441，与零样本基线相当，表明微调效果对基础模型架构高度敏感：推理调优的高参数模型抵抗位移，可能需要更多数据或更强的微调方法。论文还分析了SFT与零样本的差距、模型后训练先验的影响，并指出该差距并非普遍成立，而是依赖于基础模型的选择。

Innovations:

构建了PiSAR语料库，包含12,929条屏幕锚定行为理由，采用“真实2/3”规则确保数据真实性。
系统比较了监督微调与前沿零样本模型在屏幕条件动作预测任务上的性能，发现微调可带来0.30绝对点的语义相似度提升。
揭示了微调效果对基础模型架构的敏感性：相同训练数据在Qwen上大幅提升，在Gemma上几乎无效，归因于推理调优模型的后训练先验抵抗位移。
提出了基于LoRA的低成本微调方案（8B模型，次秒级延迟），并公开了可复现的方法论。
通过阈值通过率（sem≥0.3/0.5/0.7）和词元Jaccard等指标，提供了多维度的评估框架。

Methodology: 论文采用以下技术路线：首先从OPeRA购物者轨迹、应用商店评论和Pew美国趋势面板数据中构建PiSAR语料库，每条记录包含base64编码的屏幕截图、结构化人物、意图、动作和理由。然后使用Fireworks托管SFT平台对Qwen3-VL-8B-Instruct和Gemma-4-26B-A4B-IT进行LoRA微调（秩16，学习率2e-4，3个epoch），训练数据分为OPeRA-only（4,014行）和combined（13,796行，OPeRA上采样2倍）。评估时，在661行保留测试集上计算三个指标：词元Jaccard、长度比和语义相似度（使用OpenAI text-embedding-3-small嵌入的余弦相似度）。零样本基线直接使用Claude Opus 4.7和GPT-5.5的API，不进行提示工程。所有模型在相同输入行上评分，确保公平比较。

Key Results:

微调后的Qwen3-VL-8B-Instruct（combined训练）在PiSAR测试集上达到sem_sim 0.783，远高于零样本基线GPT-5.5的0.482和Claude Opus 4.7的0.459，差距0.30绝对点。
相同微调配方应用于Gemma-4-26B-A4B-IT仅得0.441，与零样本基线相当，表明微调效果对基础模型架构高度敏感。
OPeRA-only训练的Qwen达到0.519，低于combined训练的0.783，说明数据量增加带来提升。
在阈值通过率方面，微调Qwen在sem≥0.7上达到79%，而零样本基线仅1-2%。
微调Qwen的延迟为次秒级，而零样本基线延迟更高（未具体量化）。

Tech Stack:

LoRA（低秩适配）微调
QLoRA（4位量化）
Fireworks托管SFT平台
OpenAI text-embedding-3-small嵌入模型（1536维）
词元Jaccard相似度
余弦相似度（语义相似度）
Claude Opus 4.7 API
GPT-5.5 API
Qwen3-VL-8B-Instruct模型
Gemma-4-26B-A4B-IT（MoE）模型
Base64-JPEG图像编码
Pew美国趋势面板数据
OPeRA购物者轨迹数据集

Strengths:

构建了高质量、多来源的屏幕锚定行为理由语料库，确保数据真实性。
进行了严格的对比实验，所有模型在相同测试集上使用相同评分管道，结果可靠。
揭示了微调效果对基础模型架构的敏感性，为后续研究提供了重要启示。
方法可复现，使用了公开数据和托管平台，降低了复现门槛。
评估指标全面，包括语义相似度、词元重叠和长度比，并提供了阈值通过率。

Limitations:

PiSAR语料库未公开，仅描述了构建方法，读者需自行从相同公开源构建等效语料。
仅测试了两个基础模型（Qwen和Gemma），结论的泛化性有限。
未探索更强大的微调方法（如全参数微调、RLHF）或更多数据量对Gemma的影响。
零样本基线未进行提示工程优化，可能低估了其潜力。
训练数据中OPeRA上采样2倍，但未验证上采样本身是否因果导致提升，存在混淆。
仅评估了单轮动作预测，未涉及多轮交互或更复杂的任务。

Relevance To Keywords:

Unify Models / 原生多模态大模型：论文使用Qwen3-VL-8B-Instruct和Gemma-4-26B-A4B-IT，均为多模态大模型，但未涉及理解与生成一体化。
World Models / 世界模型：论文关注屏幕条件动作预测，可视为构建用户行为世界模型的一部分，但未明确建模环境动态。
Representation Learning / 表征学习：通过微调学习屏幕-行为表征，语义相似度评估隐含了表征质量。
Model-Based RL / 模型基于强化学习：论文未涉及强化学习，但微调可视为行为预测模型，与基于模型的RL中的预测组件相关。
后训练：论文核心是监督微调（SFT），属于后训练范畴，并讨论了后训练先验对微调效果的影响。
强化学习：论文未使用RL，但提及RLHF和指令调优会降低分布保真度，与RL相关。

46. RoboWits: Unexpected Challenges for Robotic Creative Problem SolvingPASS

Score: 37.5 / 27.8

Authors: Chunru Lin, Hongxin Zhang, Fenghao Yu, Zhehuan Chen, Thomas L. Griffiths, Yejin Choi, David Held, Chuang Gan

Published: 2026-05-28

TL;DR: RoboWits introduces a robotic benchmark for creative problem solving, revealing that pre-trained vision-language models exhibit brittleness in reasoning and robustness when faced with unexpected task mutations.

摘要翻译

在真实环境中运行的机器人，其在意外挑战下推理、适应并创造性地解决问题的能力至关重要。然而，当前的机器人基准测试主要侧重于技能级执行，对这类认知推理能力的洞察较为有限。我们提出了 RoboWits，这是一个双臂机器人基准测试，旨在系统评估认知推理、创造性工具使用以及对意外条件的鲁棒性。为了实现高质量、以推理为中心的意外场景的可扩展构建，我们提出了一种自动化任务生成流水线，该流水线被设计为一个多智能体协作框架，包含用于种子任务生成与验证、指标生成、场景生成和任务变异的智能体。利用该流水线，我们构建了 30 个多样化的种子任务以及 208 个具有变异和分级难度的任务，涵盖几何、材料和基于装配的推理。我们对流行的机器人策略、预训练 VLA（视觉 - 语言动作模型）以及 oracle-state 规划器进行了基准测试。我们的结果表明存在显著的性能差距：尽管预训练 VLA 在单任务微调后在种子任务上取得初步成功，但它们在变异任务上难以执行，这表明它们在需要推理、策略适应以及对欺骗性或受限环境鲁棒性的操作任务中存在脆弱性。项目页面位于 https://umass-embodied-agi.github.io/RoboWits.

Abstract

The ability to reason, adapt, and creatively solve problems under unexpected challenges is essential for robots operating in real-world environments. However, current robotic benchmarks primarily emphasize skill-level execution and provide limited insight into such cognitive reasoning capabilities. We introduce RoboWits, a bi-manual robotic benchmark designed to systematically evaluate cognitive reasoning, creative tool use, and robustness to unexpected conditions. To enable scalable construction of high-quality reasoning-centric unexpected scenarios, we propose an automated task generation pipeline formulated as a multi-agent cooperative framework, comprising agents for seed task generation and verification, metric generation, scene generation, and task mutation. Using the pipeline, we curated 30 diverse seed tasks and 208 tasks with mutations and graded difficulty across geometry, material, and assembly-based reasoning. We benchmark popular robot policies, pre-trained VLAs, and oracle-state planners. Our results reveal a significant performance gap: while pre-trained VLAs exhibit preliminary success on seed tasks after single-task fine-tuning, they struggle to perform on mutated tasks, implying their brittleness in manipulation tasks requiring reasoning, strategy adaptation, and robustness to deceptive or constrained environments. Project page is available at https://umass-embodied-agi.github.io/RoboWits.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	2.0/10	3.0
MLLM	1.5	7.0/10	10.5
MultiModal	1.5	7.0/10	10.5
model-based RL	1.5	3.0/10	4.5

评分理由: The paper introduces RoboWits, a robotic benchmark for creative problem solving and reasoning. It heavily relies on evaluating Vision-Language Agents (VLAs), which are closely related to MLLM and MultiModal technologies, hence high scores for these keywords. Visual encoders are components of VLAs but not the focus. Keywords like Tokenizer, Unify Models, World Models, and model-based RL are not central to the paper's methodology or findings, resulting in lower scores. The weighted total is 37.5, exceeding the dynamic passing score of 27.8.

关键词

Robotic Benchmark, Creative Problem Solving, Vision-Language Models, Unexpected Challenges, Cognitive Reasoning, Multi-agent Task Generation, Robustness Evaluation

深度分析

Chinese Title: RoboWits：机器人创造性问题解决中的意外挑战

Summary: 本文提出RoboWits，一个双臂机器人基准，旨在系统评估机器人在意外挑战下的认知推理、创造性工具使用和鲁棒性。为规模化构建高质量推理任务，作者设计了一个多智能体协作的自动任务生成流水线，包含种子任务生成、变异、度量生成、验证和场景生成等智能体。通过该流水线，生成了30个种子任务和208个变异任务，涵盖几何、材料和装配推理，并划分了难度等级。作者评估了多种机器人策略，包括预训练视觉-语言-动作模型（VLA）、模仿学习基线和基于视觉-语言模型（VLM）的规划器。结果表明，预训练VLA在种子任务上经单任务微调后表现初步成功，但在变异任务上表现脆弱，暴露出其在需要推理、策略适应和欺骗性/受限环境鲁棒性方面的不足。RoboWits为量化低级操作技能与高级认知适应之间的差距提供了严格框架。

Innovations:

提出RoboWits基准，专门评估机器人在双臂操作中的认知推理、创造性工具使用和意外挑战鲁棒性。
设计多智能体协作的自动任务生成流水线，实现大规模、多样化的推理任务构建。
系统评估现有机器人策略，揭示预训练VLA在推理和策略适应方面的局限性。
任务涵盖几何、材料和装配推理，并引入难度分级，支持细粒度评估。
提供自动化的评估指标生成和场景实例化，减少人工设计成本。

Methodology: 本文采用多智能体协作框架自动生成任务。流水线包括：种子任务生成智能体（提出认知挑战任务规格）、任务变异智能体（通过小场景变化阻塞原解决方案）、任务度量智能体（生成可执行评估标准）、任务验证智能体（确保可行性、可模拟性和推理必要性）、场景生成智能体（构建物理一致环境）。所有智能体基于基础模型（如LLM）驱动。任务在Genesis等物理模拟器中实例化，并收集50个人类遥操作演示用于基准测试。评估方法包括单任务微调和多任务学习，对比模仿学习、预训练VLA和VLM规划器。

Key Results:

预训练VLA在低数据（50个演示）下优于从头训练的模型，但在复杂材料交互和装配推理任务上表现困难。
基于VLM的模块化规划器在种子任务上取得合理性能，但无法有效泛化到变异任务。
所有现有方法在需要策略适应和欺骗性场景的任务上表现脆弱，性能随难度增加显著下降。
RoboWits基准包含208个任务，覆盖几何、材料和装配推理，难度从1到5分级。

Tech Stack:

物理模拟器：Genesis、SAPIEN、MuJoCo、IsaacGym
预训练VLA模型（如RT-2、Octo等）
VLM模型（如GPT-4V）
多智能体LLM协作框架（基于GPT-4等）
模仿学习（行为克隆、扩散策略等）
任务度量生成（程序化检查物理状态）
场景生成（参数化3D模型和物理材料）

Strengths:

首次系统评估机器人创造性问题解决能力，填补现有基准空白。
自动任务生成流水线可扩展，降低人工设计成本，保证任务多样性和推理难度。
任务设计紧密结合物理推理（几何、材料、装配），贴近真实世界挑战。
提供细粒度难度分级和连续进度评分，支持更精确的性能诊断。
评估覆盖多种主流方法，揭示当前VLA模型的推理短板，为后续研究指明方向。

Limitations:

所有任务在模拟器中执行，未在真实机器人上验证，存在sim-to-real gap。
任务生成依赖LLM，可能引入偏见或生成不合理任务，验证智能体虽可过滤但仍有局限。
仅支持双臂平行夹爪机器人，未涵盖其他末端执行器或移动操作。
人类演示仅收集了10个种子任务，变异任务缺乏演示，限制了模仿学习方法。
未深入分析模型失败的具体原因（如感知错误、规划错误或执行错误）。

Relevance To Keywords:

多模态大模型：论文评估的VLA/VLM属于多模态大模型，基准测试其推理和适应能力。
世界模型：任务要求理解物理约束（几何、材料、装配），与世界模型中的物理推理密切相关。
表征学习：任务中对象属性（形状、材质）的表征对推理至关重要，但论文未直接研究表征学习方法。
模型强化学习：论文未涉及强化学习训练，但基准可用于评估强化学习策略的泛化能力。
后训练：论文中单任务微调属于后训练范畴，但未探索更复杂的后训练范式。
总体相关性中等：论文主要贡献在基准和评估，而非提出新模型或训练方法，但为相关领域提供了测试平台。

47. TRACER: Persistent Regularization for Robust Multimodal FinetuningPASS

Score: 37.5 / 27.8

Authors: Hesam Asadollahzadeh, Feng Liu, Christopher Leckie, Sarah M. Erfani

Published: 2026-05-28

TL;DR: TRACER 提出了一种基于加权移动平均引导的对比微调框架，旨在缓解多模态模型中的灾难性遗忘并提升其分布外鲁棒性。

摘要翻译

主流的微调预训练多模态模型的策略往往会降低分布外（OOD）鲁棒性，这种现象被称为灾难性遗忘。本文构建了一个多模态对比微调的理论框架，为每种策略导出了闭式解及几何分解。该框架表明，自蒸馏在保留预训练模型知识方面比其他正则化方法更有效。我们的分析揭示了一个被严重忽视的局限性：广泛用于鲁棒微调的标准指数移动平均（EMA）教师模型面临坍塌问题。为了解决这一问题，我们证明了加权移动平均（WMA）教师模型在有限时间范围内维持持续的正则化力，并在任务子空间中实现无偏收敛，同时保持正交知识。这些见解催生了 TRACER（轨迹鲁棒锚定用于对比编码器正则化），该方法将对比学习与 WMA 引导的多视角蒸馏相结合。在 CLIP 微调上的广泛实验表明，该方法在三种骨干网络架构上均实现了分布外（OOD）准确率和校准性能的提升，全面的消融实验证实了 TRACER 既具有理论依据，又对超参数选择具有鲁棒性。代码可在 [https://github.com/HesamAsad/TRACER](https://github.com/HesamAsad/TRACER) 获取。

Abstract

Mainstream strategies for finetuning pretrained multimodal models often degrade out-of-distribution (OOD) robustness, a phenomenon known as catastrophic forgetting. In this paper, we develop a theoretical framework for multimodal contrastive finetuning, yielding closed-form solutions and a geometric decomposition for each strategy. This framework shows that self-distillation is more effective than other regularization approaches to retain the knowledge of the pretrained model. Our analysis reveals a largely overlooked limitation: standard Exponential Moving Average (EMA) teachers, widely used in robust finetuning, suffer from collapse. To solve this, we prove that a Weighted Moving Average (WMA) teacher maintains a persistent regularizing force over finite horizons and yields bias-free convergence in the task subspace while preserving orthogonal knowledge. These insights motivate **TRACER** (**T**rajectory-**R**obust **A**nchoring for **C**ontrastive **E**ncoder **R**egularization), which combines contrastive learning with WMA-guided multi-perspective distillation. Extensive experiments on CLIP finetuning demonstrate consistent OOD accuracy and calibration gains across three backbone architectures, and comprehensive ablations confirm that TRACER is both principled and robust to hyperparameter choices. Code is available at [https://github.com/HesamAsad/TRACER](https://github.com/HesamAsad/TRACER).

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	5.0/10	7.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	5.0/10	7.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	5.0/10	7.5
MultiModal	1.5	10.0/10	15.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文核心为多模态模型微调正则化，MultiModal 高度相关。MLLM 和 Visual Encoder 有一定关联（涉及模型组件），Unify Models 有一定关联（理论框架统一策略）。Tokenizer、World Models、model-based RL 无关。作者无匹配专家。加权总分 37.5，高于及格分 27.8。

关键词

Multimodal Finetuning, Catastrophic Forgetting, Contrastive Learning, Weighted Moving Average, Out-of-Distribution Robustness, Regularization, CLIP, Multi-perspective Distillation

深度分析

Chinese Title: TRACER: 持久正则化用于鲁棒多模态微调

Summary: 本文针对多模态预训练模型微调中出现的灾难性遗忘问题，提出了一个理论框架，将线性化对比损失重新表述为矩阵最小二乘问题，并给出了不同微调策略的闭式解和几何分解。分析表明自蒸馏比其它正则化方法更有效地保留预训练知识，同时发现标准指数移动平均（EMA）教师存在正则化信号崩溃的缺陷。为此，论文证明了加权移动平均（WMA）教师能在有限时间窗口内保持持久正则化力，并在任务子空间内实现无偏收敛。基于这些理论，提出了TRACER方法，结合对比学习与WMA引导的多视角蒸馏。在CLIP微调实验上，TRACER在多种骨干架构上一致提升了分布外（OOD）准确率和校准性能，消融实验验证了方法的稳健性和各设计元素的有效性。

Innovations:

引入对比目标矩阵，将线性化对比损失转化为矩阵最小二乘问题，从而获得不同微调策略的闭式解。
推导出几何分解，将微调过程分离为任务子空间混合与正交知识保持，解释了遗忘发生的机制。
揭示了标准EMA教师在学习后期正则化信号崩溃的局限性，并证明WMA教师能提供持久正则化力。
提出TRACER方法，结合对比学习与WMA引导的多视角蒸馏，实现了鲁棒的多模态微调。

Methodology: 论文采用理论分析与实验验证相结合的方法。首先通过线性化假设将多模态对比学习（MMCL）损失简化为矩阵最小二乘形式，利用Moore-Penrose伪逆推导出直接微调、L2正则化、静态自蒸馏等策略的闭式解。然后分析EMA和WMA教师的动态特性，证明WMA的持久正则化性质。基于理论设计TRACER算法，在CLIP模型上使用对比损失和WMA教师的自蒸馏损失联合训练，并在ImageNet及其分布偏移数据集上评估OOD鲁棒性。

Key Results:

理论框架揭示了不同微调策略的几何行为：直接微调丢弃任务子空间内的预训练知识，L2正则化在子空间内混合新旧知识，自蒸馏则保留正交方向的知识。
EMA教师随着训练收敛，其与学生的差距趋于零，导致正则化信号消失；而WMA教师通过加权整个轨迹保持有限时间窗口内的正则化力。
TRACER在多个CLIP骨干（如ViT-B/32、ViT-B/16、ViT-L/14）上，相比基线方法（如LP-FT、FLYP、WiSE-FT、CaRot）显著提升了OOD准确率（如ImageNet-V2、-R、-A、-Sketch）并改善了校准。
消融实验表明，WMA教师优于EMA教师，且TRACER对超参数（如正则化强度、教师更新频率、核形状）具有鲁棒性。

Tech Stack:

对比学习（InfoNCE损失）
自蒸馏（self-distillation）
指数移动平均（EMA）
加权移动平均（WMA）
CLIP模型（ViT系列骨干）
矩阵最小二乘优化
Moore-Penrose伪逆
正交投影算子
线性化分析（linearized analysis）

Strengths:

提供了扎实的理论分析，将对比微调问题转化为可求解的闭式形式，揭示了遗忘的几何本质。
识别并解决了EMA教师的关键缺陷，提出WMA教师这一新颖且有效的替代方案。
TRACER方法在多个架构和数据集上取得一致改进，消融实验全面，验证了方法的稳健性。
代码开源，便于复现和进一步研究。

Limitations:

理论分析基于线性化假设（线性编码器），可能无法完全捕捉深度非线性网络的真实行为。
实验仅针对CLIP模型和图像-文本多模态任务，未验证在其他多模态模型（如Video-Text、Audio-Text）或纯视觉模型上的泛化性。
WMA教师需要存储整个训练轨迹的权重，可能增加内存开销，尽管论文未详细讨论计算成本。
未与最新的后训练方法（如Model Stock、WiSE-FT的变体）进行全面比较，部分基线较旧。

Relevance To Keywords:

表征学习：论文研究多模态表征的微调与正则化，直接涉及表征学习。
多模态大模型的理解和生成一体化：CLIP是理解型多模态模型，论文关注其微调鲁棒性，与理解相关，但未涉及生成。
世界模型：论文未涉及世界模型或环境交互。
强化学习：论文未使用强化学习。
后训练：微调属于后训练范畴，论文提出的TRACER是一种后训练正则化方法。
Unify Models：论文未讨论统一模型架构。
Model-Based RL：不相关。
原生多模态大模型：CLIP是原生多模态模型，论文在其基础上微调，有一定相关性。

48. Towards Localized and Disentangled Knowledge Editing for Multimodal Large Language ModelsPASS

Score: 37.5 / 27.8

Authors: Leijiang Gu, Zhen Zeng, Feng Li, Xinjian Gao, Zenglin Shi

Published: 2026-05-28

TL;DR: 本文针对多模态大语言模型知识编辑泛化性差及易误改无关信息的问题，提出了局部化和解耦知识编辑（LDKE）框架以实现精确且可泛化的编辑。

摘要翻译

现有的多模态知识编辑（MKE）方法已提升了在多模态大语言模型（MLLMs）中纠正过时或不准确知识的能力。然而，它们存在一个关键局限性：尽管能有效修改目标事实对，却无法将编辑泛化至逻辑相关的查询，且往往会对无关但存在视觉或语义关联的信息造成意外改变。我们识别并形式化了导致这一问题的两个潜在故障模式：因果错位（Causal Misalignment），它将编辑局限于特定样本；以及特征纠缠（Feature Entanglement），它会导致对耦合但无关信息的意外改变。为了解决这些问题，我们提出了一种局部化与解耦知识编辑（LDKE）新框架，该框架通过定位事实特定的模型层并将目标相关输入与无关输入解耦，从而实现精确且泛化的编辑。该方法引入了一种快速定位模块，用于高效识别并更新关键层，同时配备了一个解耦分类器，能够适当路由输入以保留无关知识。在各种基准和 MLLMs 上的广泛实验表明，LDKE 在将编辑传播至相关上下文的同时保持高局部性方面表现卓越。

Abstract

Existing methods in Multimodal Knowledge Editing (MKE) have advanced the ability to correct outdated or inaccurate knowledge in Multimodal Large Language Models (MLLMs). However, they exhibit a critical limitation: while effectively modifying target factual pairs, they fail to generalize edits to logically related queries and often cause unintended alterations to unrelated but visually or semantically linked information. We identify and formalize two underlying failure modes causing this issue: Causal Misalignment, which confines edits to the specific sample, and Feature Entanglement, which causes unintended alterations to coupled but irrelevant information. To address these issues, we propose Localized and Disentangled Knowledge Editing (LDKE), a new framework that achieves precise and generalized editing by localizing fact-specific model layers and disentangling target-relevant inputs from irrelevant ones. Our approach introduces a Fast Localization module to identify and update critical layers efficiently, along with a Disentanglement Classifier that routes inputs appropriately to preserve unrelated knowledge. Extensive experiments across various benchmarks and MLLMs demonstrate that LDKE achieves superior performance in propagating edits to related contexts while maintaining high locality.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	9.0/10	13.5
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	1.0/10	1.5

评分理由: 论文标题和摘要明确指出研究对象为多模态大语言模型（MLLM），因此"MLLM"和"MultiModal"高度相关，评分为 9 分。其余关键词如"World Models"、"model-based RL"、"Tokenizer"、"Visual Encoder"及"Unify Models"未在论文核心方法（LDKE 框架）或问题定义中作为重点提及，相关性较低，评分为 1-2 分。作者列表与指定专家（Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang）无重叠，不加分。加权总分 37.5 分，高于动态及格分 27.8 分。

关键词

Knowledge Editing, Multimodal Large Language Models, Localized and Disentangled, Causal Misalignment, Feature Entanglement, Fast Localization, Disentanglement Classifier

深度分析

Chinese Title: 面向多模态大语言模型的局部化与解耦知识编辑

Summary: 本文针对多模态大语言模型（MLLMs）知识编辑中存在的泛化-局部化挑战，系统识别了两个关键瓶颈：因果错位（编辑仅修正特定样本，无法推广到逻辑相关查询）和特征纠缠（编辑意外影响视觉或语义上耦合的无关知识）。为此，提出局部化与解耦知识编辑（LDKE）框架，包含快速定位模块（通过单次前向传播动态识别事实关键层）和解耦分类器（利用余弦相似度路由，将无关查询导向原始权重）。实验表明，LDKE在多个基准和MLLM上显著优于现有方法，实现了编辑的泛化与局部化平衡。

Innovations:

首次系统识别并形式化了多模态知识编辑中的因果错位和特征纠缠两个根本问题。
提出快速定位模块，通过单次前向传播估计每个FFN层对目标答案的贡献，实现实例级动态层选择，避免高成本因果追踪。
设计解耦分类器，基于隐藏表示的解耦和余弦相似度进行动态路由，有效隔离编辑权重对无关知识的影响。
将层特定权重编辑器（基于MEND架构）与定位和路由机制结合，形成完整的LDKE框架，在泛化性和局部性上取得显著提升。

Methodology: 论文采用以下技术路线：1）快速定位：在单次前向传播中提取最后提示token的隐藏表示，计算每个FFN层前后目标概率的对数概率差作为贡献分数，从后半层中选取Top-k层作为编辑目标。2）层特定权重编辑器：基于MEND超网络架构，利用低秩投影和层特定缩放/偏移生成权重更新，仅修改核心上/下投影矩阵（对门控MLP跳过门控投影）。3）解耦路由：在最早编辑层之前提取隐藏表示，通过轻量残差投影得到路由表示，分别送入嵌入头（L2归一化）和分类头（BCE损失），计算与编辑锚点的余弦相似度作为路由门控。训练阶段使用编辑实例的自动回归损失和BCE损失联合优化。

Key Results:

LDKE在多个基准（如MSCKE、VisEdit等）上，编辑的泛化性（逻辑相关查询）和局部性（无关知识保持）均优于现有方法。
快速定位模块相比传统因果追踪显著降低计算开销，同时保持定位准确性。
解耦分类器有效防止编辑权重影响特征邻近但无关的知识，路由准确率高。
在Gemma-3、InternVL-3.5等先进MLLM上验证了框架的通用性。

Tech Stack:

Transformer架构（MLLM）
前馈网络（FFN）层贡献估计（对数概率差）
MEND超网络（低秩分解、层特定缩放/偏移）
余弦相似度路由
BCE损失（二分类）
L2归一化
自动回归损失（语言建模）
Top-k选择算法

Strengths:

问题定义清晰，系统分析了现有方法的根本缺陷（因果错位、特征纠缠）。
方法高效实用：快速定位只需单次前向传播，避免了因果追踪的高成本。
解耦分类器设计巧妙，利用表示级相似度实现动态路由，兼顾泛化与局部性。
实验充分，在多个基准和多种MLLM上验证，结果具有说服力。
对门控MLP架构的处理细致，保证了方法的兼容性。

Limitations:

快速定位依赖最后token的隐藏表示，可能不适用于所有多模态任务（如需要多token输出的场景）。
解耦分类器需要额外训练，且路由阈值可能依赖经验设定。
方法主要针对单次编辑，对连续编辑（终身学习）的扩展性未充分讨论。
实验基准可能未覆盖所有多模态知识编辑场景（如视频、音频模态）。

Relevance To Keywords:

原生多模态大模型：论文直接研究多模态大语言模型的知识编辑，与原生多模态大模型紧密相关。
多模态大模型的理解和生成一体化：知识编辑涉及模型对视觉-文本事实的理解和生成修正，与理解生成一体化相关。
表征学习：论文提出的解耦分类器基于隐藏表示的解耦和余弦相似度，涉及表征学习中的解耦表示。
世界模型：知识编辑可视为更新模型对世界事实的认知，与世界模型中的知识更新有间接联系。
后训练：知识编辑是后训练的一种高效形式，论文方法属于后训练技术。
强化学习：论文未直接涉及强化学习，但知识编辑中的路由机制可类比于策略选择，相关性较弱。
Model-Based RL：论文未涉及基于模型的强化学习，相关性较低。
Unify Models：论文未涉及模型统一，相关性较低。

49. VPG: Visual Prefix Guidance for Autoregressive Image and Video GenerationPASS

Score: 37.5 / 27.8

Authors: Xinyao Liao, Qiyuan He, Yicong Li, Jiayin Zhu, Xiaoye Qu, Wei Wei, Angela Yao

Published: 2026-05-28

TL;DR: This paper proposes Visual Prefix Guidance (VPG), a training-free inference-time guidance method that improves autoregressive image and video generation quality by contrasting generated and corrupted prefixes to strengthen posterior support, achieving lower FID scores without retraining base models.

摘要翻译

自回归图像和视频生成器在训练时采用教师强制的历史序列，但在推理阶段必须从其自身生成的前缀中进行采样，这使得它们容易受到暴露偏差和前缀漂移的影响。现有的补救措施要么修改训练过程，要么主要应用于采样时指导，主要针对外部语义条件（如类别标签或文本提示），而非检验下一步预测是否为生成的前缀本身提供强大的后验支持。我们提出视觉前缀指导（Visual Prefix Guidance, VPG），这是一种用于自回归图像和视频生成的无需训练的推理时指导方法。VPG 通过对比模型在生成前缀下的输出与在损坏前缀下的输出，然后将 logits 外推至能增强生成前缀后验支持的候选项，从而改进下一步预测。在基于 VAR 的类别条件图像生成、基于 Infinity 的文本到图像生成以及基于 InfinityStar 的文本到视频生成中，VPG 在不重新训练基模型的情况下提高了生成质量，使 VAR 上的 FID 平均降低了 0.36，并在图像和视频生成的基准测试中提升了性能。

Abstract

Autoregressive image and video generators are trained with teacher-forced histories but must sample from their own generated prefixes at inference time, making them vulnerable to exposure bias and prefix drift. Existing remedies either modify training or apply sampling-time guidance aimed primarily at external semantic conditions, such as class labels or text prompts, rather than testing whether a next-step prediction provides strong posterior support for the generated prefix itself. We propose Visual Prefix Guidance (VPG), a training-free inference-time guidance method for autoregressive image and video generation. VPG improves next-step prediction by contrasting the model's output under the generated prefix with its output under a corrupted prefix, then extrapolating logits toward candidates that strengthen the posterior support of the generated prefix. Across class-conditional image generation with VAR, text-to-image generation with Infinity, and text-to-video generation with InfinityStar, VPG improves generation quality without retraining the base model, reducing FID on VAR by 0.36 on average and improving benchmark performance on both image and video generation.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	4.0/10	6.0
MLLM	1.5	5.0/10	7.5
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	1.0/10	1.5

评分理由: 论文提出视觉前缀引导（VPG）方法用于自回归图像和视频生成。MultiModal 评分高（8.0）因涉及文本到图像/视频生成；MLLM 评分中（5.0）因基模型属多模态生成架构；World Models 评分中（4.0）因生成模型与广义世界模型相关，但未涉及动力学或 RL；Unify Models 评分低（3.0）因方法通用但未统一模型架构；Visual Encoder 评分低（3.0）因非核心贡献；Tokenizer 和 model-based RL 评分最低（1.0）因未涉及分词器设计及强化学习。未发现指定专家。

关键词

Visual Prefix Guidance, Autoregressive Image Generation, Autoregressive Video Generation, Inference-time Guidance, Exposure Bias, Text-to-Image, Text-to-Video, Posterior Support

深度分析

Chinese Title: 视觉前缀引导：用于自回归图像和视频生成的视觉前缀引导

Summary: 论文提出了一种名为视觉前缀引导（VPG）的推理时引导方法，用于自回归图像和视频生成。自回归模型在训练时使用教师强制（ground-truth前缀），但在推理时依赖自身生成的前缀，导致暴露偏差和前缀漂移。现有方法主要针对外部条件（如类别标签或文本提示）进行引导，而忽略了生成前缀本身的后验支持。VPG通过对比模型在生成前缀和损坏前缀下的输出，外推logits以增强生成前缀的后验支持，从而改善下一步预测。该方法无需重新训练，可直接应用于VAR、Infinity和InfinityStar等模型。实验表明，VPG在类条件图像生成（VAR）上平均降低FID 0.36，在文本到图像和文本到视频生成上均提升了基准分数。

Innovations:

首次提出通过增强生成前缀的后验支持来直接缓解暴露偏差的推理时引导目标。
提出视觉前缀引导（VPG），一种无需训练、即插即用的采样规则，通过对比生成前缀与损坏前缀实现。
设计了同尺度全嵌入替换的损坏前缀构造方法，无需辅助头或重新训练。
证明了VPG与CFG互补，可组合使用以同时引导外部条件和生成前缀。
在多个自回归视觉生成模型（VAR、Infinity、InfinityStar）上验证了VPG的通用性和有效性。

Methodology: VPG基于自回归视觉模型的下一尺度预测框架。在每个预测步骤，模型输出条件logits ℓk(c, r<k)。VPG构建一个损坏前缀 r̃<k（通过同尺度随机替换部分token位置的全嵌入），得到损坏分支logits ℓk(c, r̃<k)。然后通过外推公式 ℓVPG = ℓk(c, r<k) + λ(ℓk(c, r<k) - ℓk(c, r̃<k)) 来增强生成前缀的后验支持。该方法可视为对前缀轴的log-ratio引导，与CFG对条件轴的引导正交。实验中使用VAR、Infinity和InfinityStar作为基模型，在类条件、文本到图像和文本到视频任务上评估。

Key Results:

在VAR模型上，VPG平均降低FID 0.36，其中VAR-d16降低0.63。
在文本到图像生成（Infinity）上，VPG提升了GenEval Overall和DPG-Bench Overall分数。
在文本到视频生成（InfinityStar）上，VPG将VBench Overall分数提升0.49，所有子分数均有提升。
VPG与CFG组合使用可进一步改善生成质量。
VPG无需重新训练，仅需推理时额外一次前向传播（损坏分支）。

Tech Stack:

自回归视觉模型（VAR、Infinity、InfinityStar）
多尺度残差token化（VQ-VAE、BSQ）
分类器自由引导（CFG）
log-ratio引导（外推公式）
同尺度全嵌入替换（损坏前缀构造）
FID、GenEval、DPG-Bench、VBench评估指标

Strengths:

方法简单有效，无需训练或修改模型，即插即用。
从新角度（前缀后验支持）解决暴露偏差，与现有CFG互补。
在多个模型和任务上一致提升性能，泛化性强。
损坏前缀构造设计合理，保持尺度条件统计特性。
推理开销可控（仅多一次前向传播）。

Limitations:

需要额外一次前向传播计算损坏分支，增加推理时间。
损坏前缀的替换比例λ是超参数，需针对不同模型调优。
目前仅在自回归视觉模型上验证，未在扩散模型或语言模型上测试。
对极端暴露偏差场景（如长视频生成）的效果尚未充分探索。
理论分析较浅，缺乏对后验支持增强的严格数学证明。

Relevance To Keywords:

Unify Models: VPG是一种推理时引导方法，可统一应用于多种自回归视觉生成模型（VAR、Infinity、InfinityStar），促进模型统一。
World Models: 自回归视觉生成可视为学习视觉世界模型，VPG通过增强前缀后验支持改善世界模型的自回归预测。
Representation Learning: VPG不直接涉及表征学习，但通过改善生成质量间接提升视觉表征的利用。
Model-Based RL: 自回归生成可类比于基于模型的规划，VPG的引导类似于在策略中引入置信度修正，与RL中的后验优化有潜在联系。
原生多模态大模型: VPG适用于文本到图像/视频生成，是多模态大模型推理优化的技术。
多模态大模型的理解和生成一体化: VPG专注于生成阶段，但通过改善前缀支持可提升生成与理解的一致性。
表征学习: 同上。
世界模型: 同上。
强化学习: VPG的引导可视为一种隐式的价值函数（后验支持），与RL中的引导策略有相似性。
后训练: VPG是推理时方法，无需后训练，但可与后训练技术结合。

50. Native Audio-Visual Alignment for GenerationPASS

Score: 37.5 / 27.8

Authors: Longbin Ji, Guan Wang, Xuan Wei, Chenye Yang, Xiangrui Liu, Zhenyu Zhang, Shuohuan Wang, Yu Sun, Jingzhou He

Published: 2026-05-28

TL;DR: NAVA addresses audio-video generation limitations by establishing native alignment before fusion, achieving superior synchronization and quality with a 6.3B parameter model.

摘要翻译

联合音视频生成旨在合成时间同步且语义一致的视听内容。然而，现有的开源方法主要依赖于具有后验对齐的双塔设计，或在共享空间中混合文本上下文、音频与视频的完全统一三模态设计。前者削弱了细粒度的音视频协同演化，而后者则将语义条件与底层同步耦合。为了解决这些局限性，我们提出了 NAVA（Native Audio-Visual Alignment），一种用于联合音视频生成的原生音视频对齐框架。NAVA 基于上下文条件的原生音视频对齐：它首先在专用交互空间中建立音视频对应关系，随后利用外部上下文来调节联合去噪过程。具体而言，NAVA 采用 Align-then-Fuse MMDiT 架构实现，该架构从模态感知的音视频对齐过渡到模态共享的联合去噪。此外，我们引入 Timbre-in-Context Conditioning，将参考音色提示与相应的语音段关联起来，以实现可控的语音音色。在 Verse-Bench 和 Seed-TTS 上的实验，以及一项用户研究表明，NAVA 仅使用 63 亿参数即可实现卓越的视频质量、精确的音视频同步、具有竞争力的音频质量以及更强的参考音色可控性。

Abstract

Joint audio-video generation aims to synthesize temporally synchronized and semantically coherent visual-acoustic content. However, existing open-source methods mainly rely on either dual-tower designs with posterior alignment or fully unified tri-modal designs that mix textual context, audio and video in one shared space. The former weakens fine-grained audio-video co-evolution, while the latter couples semantic conditioning with low-level synchronization. To address these limitations, we propose NAVA, a Native Audio-Visual Alignment framework for joint audio-video generation. NAVA is built upon context-conditioned native audio-visual alignment: it first establishes audio-video correspondence in a dedicated interaction space, and then uses external context to condition the joint denoising process. Specifically, NAVA is instantiated with an Align-then-Fuse MMDiT architecture, which transitions from modality-aware audio-video alignment to modality-shared joint denoising. Furthermore, we introduce Timbre-in-Context Conditioning to associate reference timbre cues with corresponding speech spans to achieve controllable speech timbre. Experiments on Verse-Bench and Seed-TTS, together with a user study, demonstrate that NAVA achieves superior video quality, precise audio-visual synchronization, competitive audio quality, and stronger reference-timbre controllability using only 6.3B parameters.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	5.0/10	7.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	5.0/10	7.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on Native Audio-Visual Alignment for generation, making MultiModal highly relevant (9.0). It discusses unification strategies (Unify Models, 5.0) and implies visual processing (Visual Encoder, 5.0). Tokenizer, World Models, MLLM, and model-based RL are not core contributions (low scores). No expert authors from the target list were found.

关键词

Native Audio-Visual Alignment, Joint Audio-Video Generation, MMDiT Architecture, Context-Conditioned Denoising, Timbre-in-Context Conditioning, Audio-Visual Synchronization, Multi-Modal Generation

深度分析

Chinese Title: 原生音视频对齐生成框架

Summary: 本文提出NAVA（原生音视频对齐）框架，用于联合音频-视频生成。现有开源方法多采用双塔架构（后期对齐）或全统一三模态架构（将上下文、音频、视频混合在同一空间），前者削弱了细粒度音视频协同演化，后者将语义条件与低级同步耦合。NAVA通过解耦上下文条件与音视频同步，先在一个专用交互空间中建立音视频对应关系，再通过外部上下文条件引导联合去噪过程。具体实现采用Align-then-Fuse MMDiT架构，包含分层对齐层（模态感知投影、联合自注意力）和统一融合层（共享参数）。此外，引入上下文音色条件机制，将参考音色线索绑定到对应语音片段，实现可控多说话人生成。在Verse-Bench和Seed-TTS上的实验及用户研究表明，NAVA仅用6.3B参数即实现了优越的视频质量、精确的音视频同步、有竞争力的音频质量和更强的参考音色可控性。

Innovations:

提出原生音视频对齐（NAVA）框架，将联合生成形式化为上下文条件化的原生音视频对齐，实现事件级对应建模，兼容预训练视频生成骨干。
设计Align-then-Fuse MMDiT架构：先通过模态感知的分层对齐层建立音视频对应，再通过共享的统一融合层进行紧凑协同去噪。
引入上下文音色条件机制（Timbre-in-Context Conditioning），将参考音色作为上下文令牌绑定到对应语音片段，无需额外说话人控制分支即可实现灵活的多说话人音色控制。
采用渐进式多任务训练策略，分阶段训练音频、视频及音视频联合任务，并利用结构化丢弃（随机跨模态注意力掩码）实现条件因子化引导。

Methodology: NAVA采用Align-then-Fuse MMDiT架构。首先，视频和音频分别通过各自VAE编码为潜在令牌，文本上下文和参考音色编码为条件令牌。早期分层对齐层使用模态解耦投影将音频和视频令牌映射到共享交互空间，通过联合自注意力和FFN进行跨模态交互，并采用速率感知旋转位置嵌入解决令牌率不匹配。后期统一融合层使用模态共享投影和共享Transformer块进行协同去噪。上下文通过交叉注意力注入，保持专用同步空间。训练采用渐进多任务策略：先以3:1比例训练音频和音视频数据初始化音频路径；再以1:2比例训练高质量音频和完整音视频数据提升保真度；最后在精选高质量音视频数据上微调。支持条件因子化引导，通过随机跨模态注意力掩码构造部分无条件路径。

Key Results:

在Verse-Bench和Seed-TTS基准上，NAVA在视频质量、音视频同步、音频质量方面显著优于双塔和全统一基线方法。
用户研究证实NAVA具有更强的参考音色可控性，支持多说话人对话生成。
仅使用6.3B参数即达到优越性能，表明解耦设计的高效性。
消融实验验证了Align-then-Fuse架构和上下文音色条件机制的有效性。

Tech Stack:

MMDiT（多模态扩散Transformer）
VAE（变分自编码器）用于音视频潜在编码
旋转位置嵌入（RoPE）及速率感知缩放
交叉注意力与自注意力机制
渐进式多任务训练策略
随机跨模态注意力掩码（结构化丢弃）
上下文音色编码器（提取音色令牌）

Strengths:

解耦音视频同步与上下文条件，避免全统一架构中的耦合问题，同时优于双塔后期对齐。
Align-then-Fuse设计兼顾模态特异性与协同生成，稳定高效。
上下文音色条件机制简洁有效，无需修改骨干网络即可实现多说话人控制。
渐进式训练策略充分利用预训练视频骨干，降低训练成本。
仅6.3B参数即达到SOTA，参数效率高。

Limitations:

论文未详细讨论模型在复杂场景（如大量物体、快速运动）下的同步鲁棒性。
上下文音色条件依赖于文本中显式标注的说话人边界，对自动语音识别或对话分割的依赖未深入探讨。
实验主要基于特定基准（Verse-Bench, Seed-TTS），泛化性需更多验证。
未与商业系统（如Seedance, Kling）进行直接对比，仅与开源方法比较。

Relevance To Keywords:

Unify Models: NAVA属于联合音视频生成模型，统一了音频和视频的生成过程，但未涉及理解与生成一体化。
World Models: 音视频联合生成可视为对世界模型的一种模拟（如事件对应、物理声学关系），但论文未明确从世界模型角度阐述。
Representation Learning: 通过Align-then-Fuse架构学习音视频的联合表征，属于表征学习范畴。
Model-Based RL: 论文未涉及强化学习或基于模型的RL。
原生多模态大模型: NAVA是原生多模态（音视频）生成模型，但未包含文本理解能力，属于生成侧的原生多模态。
多模态大模型的理解和生成一体化: 论文仅关注生成，未涉及理解任务。
后训练: 渐进式多任务训练可视为后训练策略，但未使用强化学习或人类反馈。

51. PhyGenHOI: Physically-Aware 4D Generation of Dynamic Human-Object InteractionsPASS

Score: 36.0 / 27.8

Authors: Omer Benishu, Gal Fiebelman, Sagie Benaim

Published: 2026-05-28

TL;DR: PhyGenHOI generates physically consistent 4D human-object interactions by coupling motion diffusion with physics simulation using 3D Gaussians, outperforming existing baselines.

摘要翻译

本文致力于解决生成物理准确且视觉保真的 4D 人体 - 物体交互（HOI）任务。给定一个表示为 3D 高斯点（3DGS）的静态 3D 人体和目标物体，我们的目标是合成动态场景，其中人体根据给定的输入文本，通过拳击或踢击等动作主动与物体进行交互。为此，我们提出了一种名为 PhyGenHOI 的新颖框架，该框架将生成式人体运动与显式的物理物体模拟相结合。我们将人体建模为由运动扩散模型（MDM）驱动的语义代理，将物体建模为由物质点法（MPM）模拟的物理代理，并利用 3D 高斯作为统一的、可微分的表示。我们通过三个耦合机制来监督它们的交互：（1）窗口吸引力损失（Windowed Attraction Loss），用于在时间上同步生成运动以拦截物体；（2）接触驱动的重模拟（Contact-Driven Re-simulation）步骤，用于在冲击时触发物理一致的动量传递；以及（3）掩码视频 -SDS 目标（Masked Video-SDS），用于注入基于视频的先验以增强接触保真度。实验表明，PhyGenHOI 能够在多样化动作、人体和物体上生成物理一致的 4D HOI，且优于基线方法。项目页面及视频：https://omerbenishu.github.io/PhyGenHOI/

Abstract

We address the task of generating physically accurate and visually faithful 4D Human-Object Interaction (HOI). Given a static 3D human and target object represented as 3D Gaussian Splats (3DGS), our goal is to synthesize dynamic scenes where the human actively engages with the object through actions, such as punching or kicking, in accordance with a given input text. To this end, we introduce PhyGenHOI, a novel framework that couples generative human motion with an explicit physical object simulation. We model the human as a semantic agent driven by a Motion Diffusion Model (MDM) and the object as a physical agent simulated via the Material Point Method (MPM), utilizing 3D Gaussians as a unified, differentiable representation. We supervise their interaction through three coupled mechanisms: (1) A Windowed Attraction Loss that temporally synchronizes generative motion to intercept the object; (2) A Contact-Driven Re-simulation step that triggers physically consistent momentum transfer upon impact; and (3) A Masked Video-SDS objective that injects video-based priors to enhance contact fidelity. Experiments show PhyGenHOI generates physically consistent 4D HOI across diverse actions, humans, and objects, outperforming baselines. Project page and videos: https://omerbenishu.github.io/PhyGenHOI/

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	6.5/10	9.8
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	3.5/10	5.2
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	1.0/10	1.5

评分理由: The paper focuses on 4D Human-Object Interaction generation using a combination of Motion Diffusion Models and Material Point Method simulation. It demonstrates strong multi-modality (text, 3D, video) and unifies generative and physical models (Unify Models). However, it does not focus on tokenizers, visual encoders as core components, MLLM architectures, or reinforcement learning, resulting in lower scores for those keywords. None of the specified expert authors are listed on the paper.

关键词

4D Generation, Human-Object Interaction, Motion Diffusion Model, Material Point Method, 3D Gaussian Splatting, Physically-Aware, Dynamic Scenes

深度分析

Chinese Title: PhyGenHOI: 物理感知的动态人-物交互4D生成

Summary: 本文针对动态人-物交互（HOI）的4D生成任务，提出PhyGenHOI框架。给定静态3D人体和目标物体（均表示为3D高斯泼溅），目标是生成符合文本描述的动态场景，其中人体主动与物体交互（如踢、打、推）。方法将人体建模为语义智能体，由运动扩散模型（MDM）驱动；将物体建模为物理智能体，通过物质点法（MPM）模拟。通过三种机制协调交互：窗口吸引损失实现时空同步，接触驱动重模拟实现动量传递，掩码视频SDS增强接触保真度。实验表明，该方法在物理一致性、视觉保真度上优于现有生成式和动画基线。

Innovations:

提出将生成式人体运动与显式物理模拟耦合的统一框架，利用3D高斯泼溅作为共同表示。
设计窗口吸引损失（Windowed Attraction Loss），通过分析运动速度曲线自动确定接触关节和接触帧，引导人体朝向物体。
引入接触检测与MPM重模拟，在碰撞时实现物理一致的动量传递和材料变形。
提出时域掩码视频SDS（Temporally-Masked Video-SDS），仅在接触帧附近注入视频先验，增强交互保真度而不破坏物理运动。
将人体运动扩散模型（MDM）与物质点法（MPM）结合，实现语义与物理的协同优化。

Methodology: 采用3D高斯泼溅（3DGS）作为统一表示：人体绑定SMPL参数模型，通过线性混合蒙皮变形；物体映射为MPM粒子。人体运动通过人类运动分数蒸馏（HMSD）从预训练MDM优化，物体轨迹由前向MPM模拟生成。协调机制包括：①窗口吸引损失（高斯加权）引导接触关节在接触帧附近接近物体；②接触检测后触发MPM重模拟，更新物体轨迹；③渲染4D场景后应用视频SDS损失，掩码仅作用于接触帧附近。整体优化联合HMSD、吸引损失和视频SDS。

Key Results:

PhyGenHOI能够生成物理一致且视觉逼真的4D人-物交互场景，如踢球、推柜子、拳击等。
相比纯生成方法（4DFY）和动画方法（AnimateAnyMesh），消除了鬼影和穿透伪影，产生动态物体响应。
在文本对齐、物理合理性、接触质量和视觉保真度上均优于基线。
通过消融实验验证了窗口吸引损失、接触重模拟和视频SDS各自的有效性。

Tech Stack:

3D Gaussian Splatting (3DGS)
SMPL parametric body model
Motion Diffusion Model (MDM)
Material Point Method (MPM)
Score Distillation Sampling (SDS)
Human Motion Score Distillation (HMSD)
Linear Blend Skinning (LBS)
Video-SDS (Video Score Distillation Sampling)
Windowed Attraction Loss (高斯加权)
Contact Detection and Re-simulation

Strengths:

首次将生成式人体运动与显式物理模拟（MPM）结合，实现因果一致的交互。
自动确定接触关节和接触帧，无需人工标注。
利用3DGS实现高效渲染和可微分优化，支持新颖视角。
视频SDS掩码策略在增强接触细节的同时保持物理运动稳定性。
在多种动作、人体和物体上验证，泛化性强。

Limitations:

仅针对离散动量传递的动作（踢、打、推），未覆盖连续接触（如抓取、拥抱）。
依赖预训练的MDM和视频扩散模型，可能受限于训练数据分布。
MPM模拟计算开销较大，可能影响实时性。
物体初始轨迹由前向模拟生成，未考虑人体运动对物体的预影响。

Relevance To Keywords:

Unify Models: 论文统一了生成式运动模型（MDM）与物理模拟模型（MPM），属于模型统一。
World Models: MPM模拟可视为对物体物理世界的建模，结合生成式人体运动，构成部分世界模型。
Representation Learning: 使用3D高斯泼溅作为统一表示，支持可微分优化和渲染。
Model-Based RL: 论文未直接涉及强化学习，但物理模拟和优化过程可类比模型预测控制。
原生多模态大模型: 论文使用文本驱动生成，但未涉及原生多模态大模型架构。
多模态大模型的理解和生成一体化: 论文结合文本理解（动作描述）与4D场景生成，但未使用统一多模态模型。
表征学习: 3DGS和SMPL参数化属于表征学习范畴。
世界模型: MPM模拟物体动力学，可视为局部世界模型。
强化学习: 未使用强化学习。
后训练: 论文使用预训练MDM和视频扩散模型，通过蒸馏进行后训练优化。

52. Genetically Aligned Patient Representations Improve Hematological DiagnosisPASS

Score: 36.0 / 27.8

Authors: Muhammed Furkan Dasdelen, Fatih Ozlugedik, Ilaria Looser, Rao Muhammad Umer, Christian Pohlkamp, Carsten Marr

Published: 2026-05-28

TL;DR: 该研究提出一种将单核白细胞图像与基因数据对齐的框架以改善血液学诊断，其性能优于幻灯片级基础模型。

摘要翻译

组织病理学编码器与转录组和基因组数据的多模态对齐已被证明能显著提高下游诊断任务的性能。血液细胞学具有独特性，因为视觉单细胞评估通常与细胞遗传学和分子遗传学配对用于血液癌症诊断。在本研究中，我们提出了一种框架，用于将单个白细胞图像与染色体畸变（Karyotype）和来自靶向基因面板的 Somatic Mutations 进行对齐。我们的训练策略采用两阶段方法：(i) 在超过 1500 名患者的队列上，使用 iBOT 头对 Transformer Aggregator 进行自监督、仅视觉预训练；(ii) 在急性髓系白血病（Acute Myeloid Leukemia）患者上通过 Supervised Contrastive Loss 进行基因对齐。我们的基因对齐患者编码器改进了血液诊断任务，优于切片级组织病理学 Foundation Models。此外，该模型提供了针对疾病和遗传变异的现成检索能力。将遗传数据纳入患者编码器提高了患者表征的质量，提供了一个与临床诊断工作流程对齐的框架，并为未来的多模态血液学专用 AI 铺平了道路。代码和模型权重可在 https://github.com/marrlab/GenBloom 获取。

Abstract

Multimodal alignment of histopathology encoders with transcriptomic and genomic data has been shown to significantly improve performance in downstream diagnostic tasks. Hematological cytology is unique in that visual single-cell evaluation is often paired with cytogenetics and molecular genetics for blood cancer diagnosis. In this study, we present a framework to align single white blood cell images with chromosomal aberrations (karyotype) and somatic mutations from targeted gene panels. Our training strategy follows a two-stage approach: (i) self-supervised, vision-only pretraining of a transformer aggregator using an iBOT head on a cohort of over 1500 patients, and (ii) genetic alignment via supervised contrastive loss on acute myeloid leukemia patients. Our genetically aligned patient encoder improves hematological diagnostic tasks, outperforming slide-level histopathology foundation models. Additionally, the model provides off-the-shelf retrieval capabilities for diseases and genetic alterations. Incorporating genetic data into patient encoders increases the quality of patient representations, providing a framework that aligns with clinical diagnostic workflows and paves the way for future multimodal hematology-specific AI. The code and model weights are available at https://github.com/marrlab/GenBloom.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	5.0/10	7.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	8.0/10	12.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	0.0/10	0.0

评分理由: 论文核心在于多模态（图像与基因数据）表征学习，因此 MultiModal 和 Visual Encoder 得分较高；Unify Models 涉及不同模态表征的对齐，有一定关联；Tokenizer 未作为技术重点提及；World Models、MLLM 和 model-based RL 与论文的诊断任务及非强化学习性质无关，故得分为 0。

关键词

Hematological Diagnosis, Multimodal Alignment, Patient Representations, Visual Encoder, Genetic Data, Contrastive Loss, Transformer Aggregator, Chromosomal Aberrations

深度分析

Chinese Title: 遗传对齐的患者表征提升血液学诊断

Summary: 本文提出GenBloom框架，首次在血液学诊断中实现单白细胞图像与遗传数据（染色体畸变和体细胞突变）的多模态对齐。研究采用两阶段训练策略：首先在超过1500名患者的单细胞图像上进行自监督视觉预训练（使用iBOT头），然后在急性髓系白血病（AML）患者上通过监督对比学习进行遗传对齐。GenBloom使用冻结的DinoBloom编码器提取细胞特征，并通过小型视觉Transformer聚合器生成患者级表征。实验表明，GenBloom在AML亚型分类、跨模态检索等任务上优于GigaPath、PRISM、TITAN等现有病理学基础模型，且具备即用型的疾病和遗传变异检索能力。该框架与临床诊断流程高度契合，为未来多模态血液学AI奠定了基础。

Innovations:

首次在血液学领域实现单细胞图像与遗传数据（核型、体细胞突变）的多模态对齐
提出两阶段训练范式：大规模自监督视觉预训练 + 监督对比遗传对齐
使用iBOT头在患者级细胞嵌入上进行自监督学习，无需原始图像
通过交叉模态对比学习和解码器重建，保持模态特异性信息并防止表征坍塌
提供即用型的跨模态检索能力（图像↔核型、图像↔突变），辅助临床诊断

Methodology: 采用两阶段训练：第一阶段，使用DinoBloom-B（冻结）提取单细胞嵌入，通过DINOv2/iBOT自监督方法在患者级细胞集合上训练视觉Transformer聚合器（GenBloom），使用多裁剪子采样和掩码嵌入预测。第二阶段，对AML患者进行遗传对齐：将核型（经CytoGPS编码为1104维二进制向量）和突变（25维二进制向量）通过MLP投影到128维共享空间，使用交叉模态监督对比损失对齐图像、核型和突变表征，并添加轻量级解码器以二进制交叉熵损失重建原始遗传特征。下游评估使用k-NN、逻辑回归和检索指标（mAP、MRR、F1）。

Key Results:

GenBloom在AML-Hehr、APL-AML、AMH三个数据集上的平均性能优于GigaPath、PRISM、TITAN及DinoBloom均值池化基线
在AML-Hehr测试集上，GenBloom-G的跨模态检索（图像↔核型、图像↔突变）的Top-5准确率和MRR显著高于随机基线（p<0.001）
在cAItomorph队列上，GenBloom-G在多个基因（如NPM1、FLT3-ITD、CEBPA等）的跨模态检索F1分数优于随机基线
UMAP可视化显示遗传对齐后的患者表征按遗传亚型形成聚类

Tech Stack:

DinoBloom-B（血液学图像编码器）
Vision Transformer（ViT，6层，12头，嵌入维度768）
DINOv2/iBOT自监督预训练（多裁剪、掩码嵌入预测）
监督对比学习（SupCon损失）
CytoGPS（核型文本转二进制向量）
逻辑回归（lbfgs求解器，C=1）
k-NN（k=5）
余弦相似度检索
二元交叉熵损失（BCE）
指数移动平均（EMA）教师模型

Strengths:

首次将遗传信息融入血液学图像表征学习，与临床诊断流程高度一致
单细胞分辨率，能够捕捉形态与遗传的细粒度关联
两阶段训练有效利用大规模无标签数据和有限配对数据
跨模态检索能力可直接辅助医生进行病例查找和鉴别诊断
在多个公开数据集上全面优于现有病理学基础模型

Limitations:

遗传对齐仅针对AML患者，泛化到其他血液疾病需验证
配对数据规模较小（189例AML），可能限制对齐质量
核型编码采用CytoGPS的二进制表示，可能丢失部分结构信息
模型仅使用外周血涂片，未纳入骨髓涂片等其他模态
计算资源消耗较大（单H100 GPU训练100 epoch）

Relevance To Keywords:

表征学习：GenBloom通过自监督和对比学习学习患者级多模态表征
多模态大模型：框架对齐图像与遗传数据，属于多模态表征学习
世界模型：遗传信息作为潜在因果变量，增强模型对疾病机制的理解
强化学习/后训练：论文未直接涉及，但遗传对齐可视为一种后训练策略

53. DiffSpot: Can VLMs Spot Fine-Grained Visual Differences in Web Interfaces?PASS

Score: 36.0 / 27.8

Authors: Linhao Zhang, Aiwei Liu, Yuan Liu, Xiao Zhou

Published: 2026-05-28

TL;DR: DiffSpot benchmark reveals that current VLMs struggle with fine-grained visual difference detection in web interfaces, achieving less than 50% accuracy on subtle changes.

摘要翻译

视觉 - 语言模型（VLMs）在高层图像 - 文本对齐方面取得了显著进展，但其感知细微视觉差异的能力仍然有限。我们在渲染的网页界面中研究这一问题，其中局部视觉变化既是细粒度感知的诊断测试，也是 GUI 代理和设计工具的实际需求。我们引入了 **DiffSpot**，这是一个用于网页界面开放式找不同任务的代码驱动基准测试。DiffSpot 通过修改自包含 HTML 中目标元素的单个 CSS 属性、重新渲染页面并记录更改的属性、元素及修改幅度，来构建受控图像对。一个定位门（grounding gate）仅保留那些渲染像素差异局限于目标元素的图像对。该基准测试包含 4,400 对图像，其中包括 3,900 对有差异对，这些对平衡分布在 13 种 CSS 属性操作符和三个难度层级上，外加 500 对无差异对用于幻觉控制。在零样本评估 13 种前沿 VLMs 时，我们发现即使是最优模型也只能识别出 40.7% 的真实变化，且所有模型在困难层级上的召回率均低于 23%。DiffSpot 进一步表明，难度高度依赖于属性：在 CSS 操作符之间，像素变化幅度或 CLIP 距离均无法可靠地预测召回率。

Abstract

Vision-language models (VLMs) have made strong progress on high-level image-text alignment, yet their ability to perceive subtle visual differences remains limited. We study this problem in rendered web interfaces, where localized visual changes are both a diagnostic test of fine-grained perception and a practical requirement for GUI agents and design tools. We introduce \textbf{DiffSpot}, a code-driven benchmark for open-ended spot-the-difference on web interfaces. DiffSpot constructs controlled image pairs by mutating a single CSS property of a target element in self-contained HTML, re-rendering the page, and recording the changed property, element, and mutation magnitude. A grounding gate retains only pairs whose rendered pixel difference is confined to the target element. The benchmark contains 4{,}400 pairs, including 3{,}900 has-diff pairs balanced across 13 CSS-property operators and three difficulty tiers, plus 500 no-diff pairs for hallucination control. Evaluating 13 frontier VLMs zero-shot, we find that even the best model identifies only $40.7\%$ of true changes, with Hard-tier Recall below $23\%$ for every model. DiffSpot further shows that difficulty is strongly property-dependent: across CSS operators, neither pixel magnitude nor CLIP distance reliably predicts Recall.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	5.0/10	7.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	8.0/10	12.0
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper introduces DiffSpot, a benchmark for evaluating Vision-Language Models (MLLMs) on fine-grained visual difference detection in web interfaces. It is highly relevant to MLLM and MultiModal keywords as the core subject is assessing multimodal model performance. Visual Encoder is moderately relevant because VLMs utilize encoders, though the paper focuses on task performance rather than encoder design. Unify Models, Tokenizer, World Models, and model-based RL are not discussed or relevant to this specific benchmark study.

关键词

DiffSpot, VLMs, Fine-grained perception, Web interfaces, Benchmark, CSS mutations, Visual differences

深度分析

Chinese Title: DiffSpot：视觉语言模型能否识别网页界面中的细粒度视觉差异？

Summary: 论文研究视觉语言模型（VLM）在细粒度视觉感知上的能力，聚焦于网页界面中的“找不同”任务。现有VLM在高层次图文对齐上表现良好，但在识别微小视觉变化上仍脆弱。作者提出DiffSpot基准，通过代码驱动方式生成受控图像对：对自包含HTML中目标元素的单个CSS属性进行突变，重新渲染页面，并记录变化属性、元素和幅度。引入接地门（grounding gate）确保渲染后的像素差异仅限于目标元素。基准包含4400对图像（3900有差异对，平衡13个CSS属性操作符和三个难度级别；500无差异对用于幻觉控制）。评估13个前沿VLM零样本，最佳模型仅识别40.7%的真实变化，Hard级别召回率低于23%。难度强烈依赖于属性类型，像素幅度和CLIP距离不能可靠预测召回率。结果表明细粒度视觉差异检测远未解决，且模型对CSS属性级别的感知存在系统性失败。

Innovations:

首个针对网页界面开放式“找不同”的基准DiffSpot，填补了现有VLM细粒度感知评估的空白。
代码驱动的视觉差异生成流水线，通过程序化CSS属性突变和接地门验证，避免了人工标注偏差。
属性级别的诊断发现：VLM性能与CSS属性类型强相关，而非像素幅度或CLIP距离，揭示了模型感知的局限性。
引入无差异对（500对）用于幻觉控制，能够评估模型的敏感度与约束权衡。
平衡的难度分层设计（Easy/Medium/Hard），支持对细粒度感知能力的精细诊断。

Methodology: 采用五阶段流水线构建基准：1）源语料库整理：从2M域名爬取页面，渲染后使用LLM再生为自包含HTML，经CLIP相似度过滤；2）程序化突变：定义13个CSS属性操作符（如字体、颜色、布局等），每个操作符通过Tailwind类切换或内联样式覆盖实现突变，并划分三个难度级别（参数幅度不同）；3）接地门验证：通过目标元素渲染后的边界框检查有效性（框内像素变化非零）、局部性（框外无变化）和选择器解析；4）精炼与过滤：使用LLM生成自然语言答案，并去除残留质量问题；5）分层抽样：每操作符-难度抽取100对（共3900有差异对），另加500无差异对。评估时，13个VLM零样本，使用Recall和Precision指标。

Key Results:

最佳模型（Claude 4.5 Sonnet）整体Recall仅40.7%，Hard级别Recall低于23%。
性能在CSS操作符间差异远大于源域间差异，表明“什么变了”比“在哪里变”更重要。
像素变化幅度和CLIP图像距离不能可靠预测Recall，模型难以感知和命名CSS级视觉属性。
无差异对测试显示模型存在敏感度-约束权衡，部分模型在无变化时产生幻觉。
所有模型在Hard难度下Recall均低于23%，细粒度差异检测远未解决。

Tech Stack:

Playwright（headless Chromium浏览器驱动）
Tailwind CSS（用于类切换突变）
LLM：gpt-oss-120b（用于HTML再生和自然语言描述生成）
CLIP模型（用于原始与再生渲染的相似度过滤）
Chrome User Experience Report和Majestic Top-1M（域名来源）
像素差异计算（边界框内/外）
Recall和Precision统计指标

Strengths:

代码驱动生成避免了人工标注偏差，确保差异的原子性和局部性。
平衡覆盖13个CSS属性和三个难度级别，支持细粒度诊断。
接地门机制严格保证代码到像素的对应，提高数据质量。
包含无差异对控制幻觉，评估更全面。
评估多个前沿VLM，揭示重要失败模式，对后续研究有指导意义。

Limitations:

仅针对网页界面，泛化到自然图像或其他领域未知。
仅考虑单个CSS属性突变，复合变化或布局重排等更复杂情况未覆盖。
难度分层基于参数幅度，但感知难度可能受上下文影响，不完全等价。
零样本评估，未探索微调或提示工程对性能的提升。
数据集规模相对较小（4400对），可能影响统计稳定性。
渲染引擎差异（如不同浏览器）可能影响结果可重复性。

Relevance To Keywords:

原生多模态大模型：论文评估了多个原生多模态大模型（如GPT-4o、Claude、Gemini等），直接相关。
多模态大模型的理解和生成一体化：论文聚焦于VLM的视觉理解能力（细粒度差异检测），未涉及生成一体化，但理解是基础。
表征学习：VLM的表征学习能力在细粒度感知上表现不足，论文揭示了表征的局限性，与表征学习相关。
世界模型：不直接相关，但细粒度感知是世界模型构建的基础之一。
强化学习：不直接相关。
后训练：论文评估零样本性能，未涉及后训练，但结果暗示后训练可能改善细粒度感知，间接相关。

54. When and How Human Curation Backfires: Preference Alignment under Multi-Model Self-Consuming LoopPASS

Score: 34.5 / 27.8

Authors: Yang Zhang, Xiukun Wei, Xueru Zhang

Published: 2026-05-28

TL;DR: 该论文探讨了多模型自我消耗训练中人类策展对偏好对齐的影响，发现跨模型交互会削弱甚至逆转对齐效果，导致长期对齐退化。

摘要翻译

基础模型 (Foundation models) 越来越多地使用先前模型迭代生成的合成数据进行训练，而不仅仅是使用真实数据。这种自消耗 (self-consuming) 训练范式可能导致模型崩溃 (model collapse)、发散 (divergence) 或偏差放大 (bias amplification)。最近的研究 (Ferbach et al., 2024) 表明，将人工策展 (human curation) 纳入循环可以将自消耗模型引导至与人类对齐的行为 (human-aligned behavior)，但这些分析仅关注一个仅消耗自身输出的单一孤立模型 (single, isolated model)。然而，在实践中，模型通常交互并在其他模型产生的输入 - 输出对 (input-output pairs) 上训练。本文研究了多模型环境 (multi-model regime) 下的自消耗训练。我们首先形式化了一个交互自消耗模型的框架，并刻画了所得动力系统 (dynamical system) 何时收敛至稳定点 (stable point)。然后，我们考察一个模型的人工策展如何影响其自身的对齐 (alignment) (自我影响 (self-influence))，以及这种效应如何传播到其他模型 (交叉影响 (cross-influence))。与孤立设置 (isolated settings) 中人工策展总是增强模型对齐不同，我们表明跨模型交互 (cross-model interactions) 可以减弱甚至逆转这一效应，最终退化长期对齐 (long-term alignment)。

Abstract

Foundation models are increasingly trained on synthetic data generated by prior model iterations rather than exclusively on real data. This self-consuming training paradigm can lead to model collapse, divergence, or bias amplification. Recent work (Ferbach et al., 2024) shows that incorporating human curation into the loop can steer a self-consuming model toward human-aligned behavior, but these analyses focus on a single, isolated model that solely consumes its own outputs. In practice, however, models often interact and train on input-output pairs produced by other models. This paper studies self-consuming training in the multi-model regime. We first formalize a framework for interacting self-consuming models and characterize when the resulting dynamical system converges to a stable point. We then examine how human curation of one model affects its own alignment (self-influence) and how such effects propagate to other models (cross-influence). Unlike isolated settings where human curation always enhances model alignment, we show that cross-model interactions can dampen or even invert this effect, ultimately degrading long-term alignment.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	7.0/10	10.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	7.0/10	10.5
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	3.0/10	4.5
model-based RL	1.5	3.0/10	4.5

评分理由: 论文核心研究多模型自我消耗循环中的偏好对齐问题，与'Unify Models'（多模型交互系统）和'World Models'（自我消耗动态稳定性）高度相关。内容未涉及具体视觉编码器、分词器或强化学习算法细节，故相关度为 0。'MLLM'和'MultiModal'因属于大模型通用背景给予中等评分。

关键词

Self-consuming training, Multi-model regime, Human curation, Preference alignment, Cross-influence, Foundation models, Model collapse

深度分析

Chinese Title: 人类策展何时以及如何适得其反：多模型自消耗循环下的偏好对齐

Summary: 本文研究多模型自消耗训练循环中人类策展对模型对齐的影响。基础模型越来越多地使用合成数据训练，可能导致模型崩溃或偏差放大。现有工作仅关注单模型场景，而实际中模型间存在交互。作者形式化了多模型自消耗系统的框架，分析了系统收敛到稳定点的条件。通过局部敏感性分析，量化了人类策展对自身模型（自影响）和其他模型（交叉影响）长期对齐的影响。关键发现：与单模型不同，在多模型交互下，增加人类策展并不总是改善对齐，反而可能因交叉影响而减弱甚至反转效果，导致长期对齐退化。实验验证了理论结果。

Innovations:

形式化了交互式多模型自消耗系统的通用框架，涵盖同步和异步更新，无需简化分布假设。
严格分析了多模型系统收敛到稳定点的条件，揭示了真实数据比例对稳定性的作用。
首次量化了人类策展在自消耗循环中的自影响和交叉影响，识别出策展可能适得其反的条件。
通过真实实验验证了人类策展效果的非单调性，与理论预测一致。

Methodology: 论文采用理论分析与实验验证相结合的方法。首先建立多模型自消耗系统的数学形式化，定义训练数据来源（真实数据、自生成合成数据、跨模型生成数据、人类策展数据）及混合权重。假设损失函数强凸且光滑，数据空间有界，模型和奖励函数Lipschitz连续，推导系统收敛条件。通过局部敏感性分析（对混合权重求导）分解自影响和交叉影响。实验部分使用真实模型（如LLM和扩散模型）验证理论。

Key Results:

多模型自消耗系统在强凸、光滑等条件下存在稳定点并收敛。
人类策展对自身模型对齐的影响（自影响）为正，但通过跨模型交互产生的交叉影响可能为负，且幅度可能超过自影响。
当交叉影响足够强时，增加人类策展反而降低长期对齐，即策展适得其反。
真实数据比例越高，系统越稳定，但人类策展的非单调效应依然存在。
实验表明，在图像-文本多模态交互中，提高一个模型的策展比例可能导致另一个模型的对齐下降。

Tech Stack:

数学方法：强凸性、光滑性、Lipschitz连续性、不动点定理、局部敏感性分析（导数）、Bradley-Terry偏好模型
算法：同步/异步迭代更新、梯度下降（隐含）、混合数据采样
工具：概率分布建模、期望风险最小化

Strengths:

填补了多模型自消耗循环中人类策展影响的理论空白，具有现实意义。
框架通用，可涵盖多种跨模型交互（生成-生成、生成-判别、跨模态）。
理论结果清晰，分解自影响和交叉影响，可解释性强。
实验验证了理论预测，增强了可信度。

Limitations:

理论分析依赖于强凸性和光滑性等假设，实际模型可能不完全满足。
仅考虑两个模型交互，未扩展到更多模型或复杂网络。
未考虑人类策展的噪声或对抗性情况（论文提及但未深入）。
实验部分可能仅针对特定模态和任务，泛化性需进一步验证。

Relevance To Keywords:

Unify Models: 论文研究多模型交互，与统一模型思想相关。
World Models: 自消耗循环涉及模型生成数据作为训练数据，与世界模型中的自监督学习相关。
Representation Learning: 模型对齐涉及表征学习，但论文更侧重偏好对齐。
Model-Based RL: 论文使用迭代训练和奖励函数，与基于模型的强化学习有方法论联系。
原生多模态大模型: 论文实验涉及图像-文本多模态交互，与多模态大模型相关。
多模态大模型的理解和生成一体化: 论文框架涵盖生成和判别模型交互，与一体化相关。
表征学习: 间接相关，模型参数更新隐含表征学习。
世界模型: 自消耗循环可视为世界模型中的自我改进。
强化学习: 使用Bradley-Terry模型和奖励函数，与RLHF相关。
后训练: 人类策展属于后训练阶段的对齐技术。

55. STAMP: Training Explicit Memory for Mobile GUI Agents in Controllable and Scalable Virtual EnvironmentsPASS

Score: 34.5 / 27.8

Authors: Junyang Wang, Haiyang Xu, Xi Zhang, Zhaoqing Zhu, Ming Yan, Jieping Ye, Jitao Sang

Published: 2026-05-28

TL;DR: STAMP trains explicit memory for mobile GUI agents using controllable virtual environments to overcome context window limitations, achieving state-of-the-art performance on the Memory-World benchmark.

摘要翻译

移动 GUI 智能体擅长即时反应控制，但在需要记忆的长周期现实任务中却经常失败。这种失败源于有限的上下文窗口与占用大量 token 的截图之间的根本冲突。为了节省有限的上下文，智能体必须逐步丢弃早期的视觉历史，从而永久丢失关键的瞬态信息。此外，现有的以动作为中心的数据集未能教导智能体何时或何事需要显式记忆，而扩充静态现实世界数据不仅成本过高，还缺乏交互式验证。为了解决这一问题，我们提出了 STAMP 框架，该框架通过可控虚拟环境训练移动智能体的显式记忆能力，其中确定性记忆变量被程序化地注入合成任务中，以控制必须记忆的内容、编码时机以及后续检索时机，从而大规模生成可验证的监督数据，并通过环境驱动的奖励反馈实现在线强化学习。在我们新引入的 Memory-World 基准测试上评估，所得的 Stamp-GUI 智能体在 GUI 专用模型中实现了最先进的性能，并在 Memory-World 基准测试上设定了新高，展现出卓越的记忆准确性和任务鲁棒性，同时保持了强大的通用移动导航能力。

Abstract

Mobile GUI agents excel at immediate reactive control but frequently fail in realistic, long-horizon tasks that require memory. This failure stems from a fundamental conflict between limited context windows and token-heavy screenshots. To save the limited context, agents must progressively discard older visual history, permanently losing crucial transient information. Furthermore, existing action-centric datasets fail to teach agents what or when to explicitly memorize, and augmenting static real-world data is prohibitively expensive and lacks interactive verification. To resolve this, we present STAMP, a framework that trains explicit memory in mobile agents through controllable virtual environments, where deterministic memory variables are programmatically injected into synthesized tasks to control what must be memorized, when it should be encoded, and when it must later be retrieved, thereby producing verifiable supervised data at scale and enabling online reinforcement learning through environment-driven reward feedback. Evaluated on our newly introduced Memory-World benchmark, the resulting Stamp-GUI agent achieves state-of-the-art performance among GUI-specialized models and sets a new high watermark on our Memory-World benchmark, demonstrating exceptional memory accuracy and task resilience while maintaining strong general mobile navigation capabilities.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	3.0/10	4.5
MLLM	1.5	5.0/10	7.5
MultiModal	1.5	6.0/10	9.0
model-based RL	1.5	3.0/10	4.5

评分理由: 论文核心在于移动 GUI 代理的显式记忆训练框架（STAMP）及可控虚拟环境，与关键词匹配度分析如下：MultiModal 和 MLLM 相关性中等（GUI 代理 inherently 多模态且通常基于 MLLM 架构）；Visual Encoder、Tokenizer 和 model-based RL 相关性较低（仅作为技术组件或问题背景提及，非核心贡献）；Unify Models 和 World Models 相关性最低（论文未涉及模型统一或世界模型架构，仅基准测试名称含 World）。未发现指定专家作者（Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang）。加权总分为 34.5，高于动态及格分 27.8。

关键词

Mobile GUI Agents, Explicit Memory, Controllable Virtual Environments, Memory-World Benchmark, Context Window Limitation, Reinforcement Learning, Long-horizon Tasks

深度分析

Chinese Title: STAMP：在可控且可扩展的虚拟环境中训练移动GUI代理的显式记忆

Summary: 本文针对移动GUI代理在长程任务中因上下文窗口有限而丢失关键视觉信息的问题，提出STAMP框架。该框架利用可控虚拟环境自动合成记忆密集型任务，通过程序化注入确定性记忆变量（何时记忆、何时检索）生成可验证的监督数据，并支持在线强化学习。训练得到的Stamp-GUI代理在Memory-World基准上达到最优性能，显著提升记忆准确性和任务鲁棒性，同时保持强泛化导航能力。方法包括虚拟环境数据构建、步骤级平衡监督微调以及基于环境奖励的在线强化学习。

Innovations:

提出STAMP框架，利用可控虚拟环境自动生成可验证的显式记忆训练数据，解决真实数据标注昂贵且缺乏交互验证的问题。
设计程序化记忆注入机制，精确控制记忆内容、编码时机和检索时机，实现自动化的步骤级记忆监督。
引入在线强化学习，通过环境驱动的任务奖励和记忆完整性奖励，优化代理的记忆编码行为。
构建Memory-World基准和记忆适配的现有基准，专门评估移动代理的记忆能力。
训练出Stamp-GUI代理，在记忆密集型任务上达到SOTA，同时保持通用导航能力。

Methodology: 论文采用三阶段方法：第一阶段，基于种子配置自动合成虚拟环境，程序化注入记忆依赖，收集成功轨迹；第二阶段，对轨迹进行后处理，提取步骤级记忆监督，并合成推理目标，通过动作/记忆批评器过滤噪声；第三阶段，进行步骤平衡的监督微调（对记忆步骤加权），然后在虚拟环境中进行在线强化学习，奖励包括任务完成奖励、格式奖励和轨迹级记忆完整性奖励。

Key Results:

Stamp-GUI在Memory-World基准上达到SOTA，任务成功准确率（T-Acc）和记忆准确率（M-Acc）均显著优于现有GUI专用模型。
在AndroidWorld-M和MemGUI-Bench的L1/L2/L3难度协议下，Stamp-GUI均表现最佳，尤其在记忆密集型任务中优势明显。
消融实验表明，在线强化学习进一步提升了记忆准确性和任务成功率，步骤平衡监督微调有效缓解了记忆步骤稀疏问题。
与仅使用文本压缩或对话历史的基线相比，Stamp-GUI避免了关键信息丢失导致的灾难性失败。

Tech Stack:

虚拟环境合成：基于种子配置（风格、布局密度、语义内容）自动生成移动界面。
程序化记忆注入：确定性变量控制记忆内容、出现步骤和检索步骤。
步骤级平衡监督微调：对记忆步骤使用权重w_bal_t = n（n为记忆步骤数），优化交叉熵损失。
在线强化学习：使用任务奖励R_task（成功/失败）、格式奖励R_format（动作格式正确性）、记忆奖励R_mem（轨迹级记忆完整性评分）。
记忆完整性评分：由评判模型评估预测记忆轨迹与参考记忆轨迹的匹配程度，分为完全、部分、不足三级。
动作/记忆批评器：用于过滤噪声标签。
评估基准：Memory-World（自建）、AndroidWorld-M、MemGUI-Bench。

Strengths:

创新性地利用可控虚拟环境解决记忆训练数据稀缺问题，实现自动化、可验证的数据生成。
将记忆作为显式输出融入动作预测循环，从根本上解决视觉信息丢失问题。
结合监督微调和在线强化学习，既利用离线数据又通过环境交互优化记忆策略。
提供专门的记忆评估基准，便于后续研究。
实验设计严谨，多难度协议有效分离记忆能力与通用导航能力。

Limitations:

虚拟环境与真实移动界面存在域差异，可能影响迁移效果。
记忆完整性评分依赖评判模型，可能存在主观偏差。
当前框架主要针对单步记忆（信息出现后立即记忆），对需要跨多步累积或推理的记忆场景未充分探索。
训练成本较高，需要大规模虚拟环境交互和在线强化学习。
未讨论记忆容量限制或记忆冲突（如多个相似信息）的处理。

Relevance To Keywords:

Unify Models: 论文未直接涉及统一模型，但Stamp-GUI作为端到端多模态代理，可视为统一感知与记忆的尝试。
World Models: 论文使用可控虚拟环境作为世界模型，通过环境合成和交互提供训练信号，与世界模型思想高度相关。
Representation Learning: 论文通过记忆预测任务迫使模型学习关键信息的表征，但未深入探讨表征学习机制。
Model-Based RL: 论文的在线强化学习基于虚拟环境（可视为模型），但未显式构建环境模型，更接近model-free RL。
原生多模态大模型: Stamp-GUI基于多模态大模型（如GUI-Owl系列），训练显式记忆能力，属于原生多模态大模型的应用。
多模态大模型的理解和生成一体化: 论文中模型同时输出推理内容、动作和记忆，体现理解与生成一体化。
表征学习: 记忆内容可视为对视觉信息的压缩表征，但论文未专门研究表征学习。
强化学习: 在线强化学习是核心训练方法之一。
后训练: 论文的监督微调和强化学习属于后训练阶段。

56. DVSM: Decoder-only View Synthesis Model Done RightPASS

Score: 34.5 / 27.8

Authors: Cheng Sun, Jaesung Choe, Min-Hung Chen, Ryo Hachiuma, Yu-Chiang Frank Wang

Published: 2026-05-28

TL;DR: DVSM 通过采用仅解码器架构和隐式 KV-cache 表示，统一了场景重建与渲染过程，在参数更少的情况下实现了新颖视图合成的最新性能。

摘要翻译

近期的大视图合成模型（LVSMs）主张采用编码器 - 解码器架构，将场景重建与渲染过程分离至独立的网络中。我们重新审视了这一设计。通过受控实验，我们表明一种仅解码器架构（将场景隐式表示为 KV-cache）在渲染复杂度保持一致的情况下，优于编码器 - 解码器变体，且参数量更少。进一步分析表明，在颜色输入重建网络与仅相机渲染网络之间共享权重，能更好地对齐二者在相同视角下的特征，从而促进图像合成。基于此发现，我们的模型（命名为 DVSM）进一步融合了基础模型先验与分阶段块大小策略，以优化效率与质量的权衡。我们的结果在多个基准上确立了新视图合成的最先进技术，在某些情况下，甚至在密集输入视图下优于每场景优化的 3DGS。

Abstract

Recent Large View Synthesis Models (LVSMs) advocate an encoder-decoder architecture that separates reconstruction and rendering into distinct networks. We re-examine this design. Through controlled experiments, we show that a decoder-only architecture, which represents scenes implicitly as a KV-cache, outperforms encoder-decoder variants while using fewer parameters at identical rendering complexity. Further analysis shows that sharing weights between the color-input reconstruction network and the camera-only rendering network better aligns their features at the same viewpoint, facilitating image synthesis. Building on this finding, our model, dubbed DVSM, further incorporates foundation model priors and stage-wise patch sizing for an improved efficiency-quality tradeoff. Our results establish a new state of the art for novel-view synthesis across multiple benchmarks, in some cases even outperforming per-scene-optimized 3DGS under dense input views.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	8.0/10	12.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	5.0/10	7.5
MLLM	1.5	6.0/10	9.0
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文提出仅解码器架构统一重建与渲染过程，与 Unify Models 高度相关（8 分）；摒弃传统编码器，故 Visual Encoder 相关性低（1 分）；隐式 KV-cache 场景表示类同世界模型（5 分）；架构类似大模型且用基础模型先验，与 MLLM 风格相似（6 分）；但任务纯视觉无多模态交互（MultiModal: 2 分），无 Tokenizer 明确提及（1 分），且与强化学习无关（0 分）。作者列表未包含指定专家，无额外加分。

关键词

Decoder-only, View Synthesis, KV-cache, Foundation Models, Implicit Representation, Novel View Synthesis, Unified Architecture

深度分析

Chinese Title: DVSM：正确实现的仅解码器视图合成模型

Summary: 本文提出DVSM（Decoder-only View Synthesis Model），一种仅解码器架构的新视图合成模型。针对现有大视图合成模型（LVSM）采用编码器-解码器分离重建与渲染的设计，作者通过控制实验证明，仅解码器架构（将场景隐式表示为KV缓存）在相同渲染复杂度下使用更少参数却取得更优性能。核心创新在于强制重建阶段（构建KV缓存）与渲染阶段（查询缓存）之间完全共享权重，这一设计使得特征对齐更好，类似于经典可微渲染方法（如NeRF、3DGS）中重建与渲染共享同一网络的思想。在此基础上，DVSM进一步引入预训练基础模型先验和分阶段补丁大小策略，提升效率-质量权衡。在多个基准数据集上，DVSM达到新SOTA，甚至在密集输入视图下超越每场景优化的3DGS。论文从架构设计、特征空间分析、类比经典方法等多角度论证了仅解码器架构的优越性。

Innovations:

提出完全权重共享的仅解码器架构，将场景隐式编码为KV缓存，在相同渲染复杂度下参数减半且质量更优。
通过控制实验和特征空间分析，揭示重建与渲染阶段权重共享的必要性，并类比经典可微渲染方法提供理论依据。
引入预训练基础模型（如DINOv2）先验，提升模型泛化能力。
提出分阶段补丁大小策略，在不同阶段使用不同补丁大小，优化计算与质量权衡。

Methodology: 采用Transformer解码器架构，输入处理阶段将图像和相机参数分别编码为上下文token和查询token。重建阶段，所有上下文token通过解码器，缓存所有交叉注意力层的键值对（KV-cache）作为隐式场景表示。渲染阶段，查询token通过同一解码器，利用缓存的KV进行交叉注意力检索。权重完全共享，包括输入嵌入层。训练时最小化光度损失和感知损失。进一步集成预训练基础模型（如DINOv2）作为特征提取器，并在不同阶段使用不同补丁大小（如重建阶段使用较小补丁，渲染阶段使用较大补丁）以平衡效率与质量。

Key Results:

在多个新视图合成基准（如DTU、Mip-NeRF 360、Blender等）上达到SOTA，部分指标超越每场景优化的3DGS。
与编码器-解码器变体相比，DVSM在相同渲染复杂度下参数减少约50%，且PSNR、SSIM等指标更高。
权重共享设计显著优于权重解耦设计，特征空间分析显示共享权重使重建与渲染阶段的交叉注意力特征更一致。
引入基础模型先验和分阶段补丁大小带来正交的性能提升，进一步改善效率-质量权衡。

Tech Stack:

Transformer解码器（ViT架构）
KV-cache（键值缓存）
Plücker坐标编码相机射线
线性补丁嵌入（patch embedding）
QK-normalization（查询-键归一化）
残差连接与层归一化
光度损失（L1/L2）与感知损失（LPIPS）
预训练基础模型（如DINOv2）
分阶段补丁大小策略

Strengths:

架构简洁高效，仅解码器设计避免了编码器-解码器的冗余参数。
权重共享设计具有理论依据，与经典可微渲染方法一脉相承，可解释性强。
通过控制实验和特征可视化提供了充分的实证支持。
集成基础模型先验和分阶段补丁策略，进一步提升了实用性和性能。
在多个基准上取得SOTA，甚至超越每场景优化方法，展示了泛化能力。

Limitations:

仅适用于校准相机输入的场景，未处理未知相机几何设置。
作为回归模型，无法生成未观测区域的纹理（需要生成模型补充）。
分阶段补丁大小策略可能增加工程实现复杂度。
实验主要基于合成数据和有限真实场景，大规模真实场景泛化性有待验证。
未与最新生成式视图合成模型（如基于扩散的模型）进行直接比较。

Relevance To Keywords:

Unify Models: 论文提出的仅解码器架构统一了重建与渲染过程，体现了模型一体化思想，与“Unify Models”相关。
World Models: 新视图合成是构建世界模型的关键能力之一，DVSM通过隐式场景表示（KV-cache）学习世界状态，与世界模型概念相关。
Representation Learning: 论文将场景隐式表示为KV-cache，属于隐式表征学习，且通过权重共享促进特征对齐，与表征学习高度相关。
Model-Based RL: 虽然论文未直接涉及强化学习，但其场景表示和渲染能力可用于基于模型的强化学习中的环境建模，有一定间接关联。
原生多模态大模型: 论文使用Transformer处理图像和相机参数，属于多模态输入，但未涉及文本等其他模态，相关性中等。
多模态大模型的理解和生成一体化: 论文的模型同时进行场景理解（重建）和图像生成（渲染），体现理解与生成一体化，相关。
表征学习: 同上，隐式表征学习是核心。
世界模型: 同上。
强化学习: 相关性较弱，但可视为世界模型组件。
后训练: 论文未涉及后训练策略，相关性低。

57. DGSG-Mind: Dynamic 3D Gaussian Scene Graphs for Long-Term Scene Understanding and GroundingPASS

Score: 34.5 / 27.8

Authors: Luzhou Ge, Xiangyu Zhu, Jinyan Liu, Xuesong Li

Published: 2026-05-28

TL;DR: DGSG-Mind proposes a dynamic 3D Gaussian scene graph system for long-term embodied scene understanding and grounding, achieving superior zero-shot 3D visual grounding and semantic segmentation performance on real-world robots.

摘要翻译

将开放词汇语义信息融入动态 3D 场景表示对于实现长期具身场景理解至关重要。然而，现有方法常因跨视图线索不完整而导致实例关联脆弱，且其处理对象级拓扑变化的能力有限，限制了长期机器人任务的执行。此外，当前的 3D 场景理解方法要么依赖缺乏显式空间推理的简单特征匹配，要么假设存在离线的真实 3D 几何。为应对这些挑战，我们提出 DGSG-Mind，这是一个混合实例感知的 3D Gaussian (3D 高斯) 动态场景图系统，配备具身推理代理。该系统将概率体素网格与显式 3D Gaussian 相结合，以实现稳健的跨模态实例融合及增量语义映射。系统通过基于高斯的视觉重定位以及受几何 - 语义一致性引导的局部掩码细化来处理动态变化。基于实例高斯图，DGSG-Mind 进一步构建层次化场景图，并开发了 3D Gaussian Mind (3D 高斯心智)，该模块整合了结构关系、空间 - 语义信息以及视觉标注的 RoI (感兴趣区域) 高斯渲染，以支持多模态推理。广泛实验表明，DGSG-Mind 在基于自重建地图的方法中实现了最佳的零样本 3D 视觉 - 语言定位 (3DVG) 性能，同时在 3D 开放词汇语义分割和场景重建方面也表现出卓越的性能。我们进一步将 DGSG-Mind 部署于真实机器人上，以展示其目标导向推理能力及动态更新能力。DGSG-Mind 的项目页面见：https://icr-lab.github.io/DGSG-Mind

Abstract

Integrating open-vocabulary semantic information into dynamic 3D scene representations is essential for long-term embodied scene understanding. However, existing methods often suffer from fragile instance association due to incomplete cross-view cues, while their limited ability to handle object-level topological changes restricts long-term robotic task execution. Moreover, current 3D scene understanding methods either rely on simple feature matching without explicit spatial reasoning or assume offline ground-truth 3D geometry. To address these challenges, we present DGSG-Mind, a hybrid instance-aware 3D Gaussian dynamic scene graph system with an embodied reasoning agent. Our system couples a probabilistic voxel grid with explicit 3D Gaussians to enable robust cross-modal instance fusion and incremental semantic mapping. It handles dynamic changes through Gaussian-based visual relocalization and localized masked refinement guided by geometric-semantic consistency. Built on the instance Gaussian map, DGSG-Mind further constructs a hierarchical scene graph and develops the 3D Gaussian Mind, which integrates structural relations, spatial-semantic information, and visually annotated RoI Gaussian renderings for multimodal reasoning. Extensive experiments show that DGSG-Mind achieves the best zero-shot 3DVG performance among methods operating on self-reconstructed maps, while also delivering strong performance in 3D open-vocabulary semantic segmentation and scene reconstruction. We further deploy DGSG-Mind on real-world robots to demonstrate its target-oriented reasoning and dynamic update capabilities. The project page of DGSG-Mind is available at https://icr-lab.github.io/DGSG-Mind

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	4.0/10	6.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	5.0/10	7.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	6.0/10	9.0
model-based RL	1.5	3.0/10	4.5

评分理由: The paper focuses on 3D Gaussian scene graphs for embodied AI, showing moderate relevance to MultiModal (visual-semantic integration) and World Models (scene graph as internal representation), and low relevance to Tokenizer, MLLM, and model-based RL. Unify Models relates to the hybrid voxel-Gaussian architecture. No expert authors from the provided list were found. The weighted score (34.5) exceeds the dynamic passing threshold (27.8).

关键词

Dynamic 3D Gaussian, Scene Graph, Embodied Reasoning, Open-vocabulary, 3D Visual Grounding, Incremental Semantic Mapping, Robotic Task Execution

深度分析

Chinese Title: DGSG-Mind：面向长期场景理解与定位的动态3D高斯场景图

Summary: 本文提出DGSG-Mind，一种混合实例感知的3D高斯动态场景图系统，旨在解决长期具身场景理解中的实例关联脆弱、动态拓扑变化处理能力不足以及缺乏显式空间推理等问题。系统将概率体素网格与显式3D高斯耦合，实现鲁棒的跨模态实例融合和增量语义建图；通过基于高斯的视觉重定位和几何-语义一致性引导的局部掩码细化，处理动态变化而不重新优化静态背景。在实例高斯图基础上构建层次化场景图，并开发3D高斯思维模块，融合结构关系、空间语义信息和RoI高斯渲染视觉线索进行多模态推理。实验表明，DGSG-Mind在自重建地图上的零样本3D视觉定位任务中达到最佳性能，同时在3D开放词汇语义分割和场景重建方面表现强劲，并在真实机器人上验证了目标导向推理和动态更新能力。

Innovations:

提出混合实例感知3DGS表示：将显式3D高斯与概率体素网格耦合，利用跨模态相似性和多术语高斯优化实现高质量重建和准确实例融合。
引入几何-语义动态更新：采用联合几何-语义一致性约束的局部掩码细化策略，在不重新优化静态背景的前提下提高动态场景更新精度。
开发3D高斯思维模块：构建层次化场景图，结合场景结构、节点级空间语义信息和带注释的RoI高斯渲染，支持复杂3D场景中的结构化理解和目标推理。
实现零样本3D视觉定位：直接在自重建的高斯地图上通过渲染的RoI视图和结构化场景图上下文进行视觉-空间推理，无需离线预扫描几何。

Methodology: 系统采用混合表示：概率体素网格作为几何支架，显式3D高斯用于渲染。输入RGB-D流后，通过YOLO-World和SAM提取2D实例掩码和CLIP语义特征，利用几何相似度、渲染掩码IoU和语义相似度进行跨模态实例关联与融合。建图过程中，体素网格引导高斯初始化，并联合光度、深度、尺度和法线正则化优化高斯场。动态场景中，先通过高斯渲染进行视觉重定位，再基于几何-语义一致性检测变化区域，执行局部掩码细化更新新增或移除物体，同时固定静态背景。最后，从高斯地图抽象出层次化场景图，3D高斯思维模块利用LLM/MLLM进行多模态推理，实现零样本3D视觉定位。

Key Results:

在零样本3D视觉定位任务上，DGSG-Mind在自重建地图上达到最佳性能，优于SeeGround、ConceptGraphs等方法。
在3D开放词汇语义分割和场景重建任务中表现强劲，重建质量高且语义准确。
在真实机器人平台上成功部署，展示了目标导向推理和动态更新能力，能够处理物体出现/消失等拓扑变化。

Tech Stack:

3D Gaussian Splatting (3DGS) 用于实时密集建图和渲染
概率体素网格 (Probabilistic Voxel Grid) 用于几何支架和实例关联
YOLO-World 用于开放词汇目标检测
Segment Anything Model (SAM) 用于生成高质量2D实例掩码
CLIP 用于提取语义特征
SBERT 等基础模型用于语义嵌入
多术语优化：光度、深度、尺度、法线正则化
几何-语义一致性检测
层次化场景图构建
LLM/MLLM (如GPT-4V) 用于多模态推理和零样本3D视觉定位

Strengths:

混合表示结合了体素的几何稳定性和高斯的渲染质量，实现鲁棒的实例融合和高质量重建。
动态更新策略仅局部优化变化区域，避免全局重优化，效率高且保持静态背景一致性。
3D高斯思维模块充分利用渲染视觉线索和结构化场景图，实现零样本推理，无需离线预扫描。
在多个基准上取得领先性能，并在真实机器人上验证实用性。

Limitations:

依赖RGB-D输入和准确的相机位姿，在纯视觉或位姿噪声大的场景下可能性能下降。
实例关联阈值和权重需要手动设定，可能影响泛化性。
动态更新依赖于几何-语义一致性检测，对于外观变化大但几何不变的情况可能误判。
系统复杂度较高，实时性可能受限于多模块串行处理。

Relevance To Keywords: 论文与“原生多模态大模型”和“多模态大模型的理解和生成一体化”有一定相关性，因为其3D高斯思维模块利用多模态大模型进行视觉-语言推理，实现了场景理解与定位的一体化。与“世界模型”和“表征学习”相关：系统构建的3D高斯场景图可视为一种结构化世界模型，通过混合表示学习场景的几何、语义和实例表征。与“模型-Based RL”和“后训练”关联较弱，但系统支持长期动态场景更新，可为具身智能中的规划和控制提供场景先验。总体而言，论文聚焦于3D场景理解与动态建图，与多模态大模型和世界模型方向有交叉，但并非核心研究。

58. Improving CLIP Adaptation by Breaking Tail Alignment for Source-Free Cross-Domain Few-Shot LearningPASS

Score: 34.5 / 27.8

Authors: Shuai Yi, Yixiong Zou, Yuhua Li, Ruixuan Li

Published: 2026-05-28

TL;DR: 本文提出一种自适应尾头对齐（ATHA）策略，通过弱化低相似度图像 token 与文本 embedding 的对齐来提升 CLIP 在跨域少样本学习中的性能。

摘要翻译

视觉 - 语言模型（VLMs）如 CLIP 展现出强大的零样本泛化能力，但在跨域场景中，当目标域训练数据稀缺时（跨域少样本学习，CDFSL），其性能显著下降。本文专注于基于 CLIP 的跨域少样本学习（CDFSL）任务中的目标域少样本微调。主流的微调范式统一地将所有图像块标记与对应的文本嵌入对齐。然而，我们发现一个反直觉的现象：主动将某些低相似度的图像标记（称为“尾部标记”）与文本嵌入拉开距离，一致地提升了目标域性能。我们深入研究了这一现象并提供了一种新颖的解释：在显著的域偏移和稀缺训练数据下，模型很难从视觉输入中提取语义信息；因此，普遍的对齐观点仅对已经包含足够语义信息的标记有效；对于尾部标记，强制对齐会导致对稀缺训练的过度拟合，而解除对齐则更为有效。受此启发，我们提出了一种针对 CLIP 的新颖微调策略：自适应尾部 - 头部对齐（ATHA），它将传统的统一对齐范式转变为自适应对齐范式，兼具对齐的加强与减弱。在四个具有挑战性的跨域少样本学习（CDFSL）基准上的广泛实验验证了我们最先进的性能。我们的代码可在 https://github.com/shuaiyi308/ATHA 获取。

Abstract

Vision-Language Models (VLMs) such as CLIP demonstrate strong zero-shot generalization, but their performance significantly degrades in cross-domain scenarios with scarce target-domain training data (Cross-Domain Few-Shot Learning, CDFSL). In this paper, we focus on the target-domain few-shot finetuning in the CLIP-based CDFSL task. Prevailing finetuning paradigms uniformly align all image patch tokens with their corresponding textual embeddings. However, we find a counterintuitive phenomenon: actively pushing away certain low-similarity image tokens, termed "tail tokens", from their textual embeddings consistently improves target-domain performance. We delve into this phenomenon and provide a novel interpretation: under great domain shifts and scarce training data, the model can hardly extract semantic information from visual inputs; therefore, the common belief of alignment is valid only for tokens already containing sufficient semantic information; for tail tokens, forcing the alignment would lead to excessive overfitting to the scarce training, while breaking the alignment is more useful. Motivated by this, we propose Adaptive Tail-Head Alignment (ATHA), a novel fine-tuning strategy for CLIP that transforms the conventional uniform alignment paradigm to an adaptive alignment paradigm, with both alignment strengthening and weakening. Extensive experiments on four challenging CDFSL benchmarks validate our state-of-the-art performance. Our code is available at https://github.com/shuaiyi308/ATHA.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	6.0/10	9.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	5.0/10	7.5
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文聚焦于 CLIP 在跨域少样本学习中的微调策略，涉及视觉编码器中的图像 patch token 处理及视觉 - 文本对齐，因此 Visual Encoder 和 MultiModal 相关性较高；CLIP 作为多模态基础模型与 MLLM 有一定关联；Unify Models、Tokenizer、World Models 及 model-based RL 与论文核心内容（少样本学习对齐策略）无关。作者列表中未包含指定的专家，故无额外加分。

关键词

CLIP, Cross-Domain Few-Shot Learning, Adaptive Tail-Head Alignment, Image Patch Tokens, Alignment Strategy, Vision-Language Models, Domain Adaptation

深度分析

Chinese Title: 通过打破尾部对齐改进CLIP在源无关跨域小样本学习中的适应

Summary: 本文针对视觉语言模型CLIP在跨域小样本学习（CDFSL）中性能下降的问题，提出了一种新的微调策略。研究发现，在目标域微调时，主动推离与文本嵌入相似度低的“尾部令牌”反而能持续提升性能，这与传统的统一对齐范式相悖。作者通过分析认为，在域偏移大且训练数据稀缺的情况下，模型难以从视觉输入中提取语义信息，强制对齐尾部令牌会导致过拟合，而打破对齐则更有效。基于此，提出了自适应尾-头对齐（ATHA）方法，在ViT的每一层动态识别令牌并根据语义相关性进行非对称调制：头部令牌通过可学习的文本嵌入加法拉近，尾部令牌通过减法推远。在四个CDFSL基准上的实验表明，该方法达到了最先进性能。

Innovations:

发现并分析了反直觉现象：主动推离低相似度图像令牌（尾部令牌）能提升跨域小样本学习性能，挑战了传统统一对齐范式。
提出新解释：在域偏移大且数据稀缺时，对齐仅对含有足够语义信息的头部令牌有效，对尾部令牌打破对齐更有用。
提出自适应尾-头对齐（ATHA）方法，通过可学习的缩放参数动态控制文本嵌入的加/减操作，实现令牌级和层级自适应对齐调制。
在多个跨域小样本学习基准上取得最先进性能，验证了方法的有效性。

Methodology: 论文采用源无关跨域小样本学习（SF-CDFSL）设定，基于CLIP模型进行目标域微调。首先通过分析令牌与文本嵌入的相似度分布，识别头部令牌（高相似度）和尾部令牌（低相似度）。然后提出ATHA方法：在Vision Transformer的每一层前向传播中，动态计算每个令牌与各类文本嵌入的相似度，对头部令牌执行可学习权重的文本嵌入加法（拉近），对尾部令牌执行可学习权重的文本嵌入减法（推远）。训练使用交叉熵损失，并在多个基准数据集（如CDFSL benchmark）上进行评估。

Key Results:

主动推离尾部令牌在四个跨域小样本学习基准上一致提升性能，超越标准微调。
预训练模型在源域呈现理想的层次化相似度分布（头部令牌高相似、尾部令牌低相似），但在目标域分布变得平坦且缺乏区分性。
强制对齐尾部令牌会加剧过拟合，降低源域与目标域相似性；而加强头部令牌对齐可减少过拟合。
ATHA方法在多个数据集上达到最先进性能，验证了自适应对齐策略的有效性。

Tech Stack:

CLIP（视觉语言模型）
Vision Transformer（ViT）
余弦相似度
交叉熵损失
可学习缩放参数（用于控制文本嵌入加/减）
源无关跨域小样本学习（SF-CDFSL）设定
N-way K-shot episodic训练范式

Strengths:

发现了反直觉现象并提供了深入分析，对跨域小样本学习中的对齐策略有重要理论贡献。
提出的ATHA方法简单有效，无需额外数据或复杂模块，易于实现。
在多个具有挑战性的基准上验证了泛化能力，结果可靠。
代码开源，便于复现和后续研究。

Limitations:

方法依赖于预训练CLIP模型，可能不适用于其他视觉语言模型。
对尾部令牌的定义（基于相似度排序）可能受超参数影响，需要调优。
实验仅在图像分类任务上验证，未涉及其他跨域任务（如分割、检测）。
未深入探讨不同层自适应策略的差异，可能还有优化空间。

Relevance To Keywords:

Unify Models: 论文聚焦CLIP模型微调，属于统一视觉语言模型的研究。
World Models: 不直接相关，但跨域适应涉及对世界知识的迁移。
Representation Learning: 核心是改进视觉表征的对齐方式，属于表征学习范畴。
Model-Based RL: 不直接相关。
原生多模态大模型: CLIP是原生多模态模型，论文研究其适应策略。
多模态大模型的理解和生成一体化: 论文仅涉及理解（分类），不涉及生成。
表征学习: 同上，改进令牌级表征。
强化学习: 不相关。
后训练: 论文提出的微调策略属于后训练阶段。

59. xModel-KD: Cross-modal Knowledge Distillation for 3D Scene Perception using LiDARPASS

Score: 33.0 / 27.8

Authors: Thenukan Pathmanathan, Kanchan Keisham, Thangarajah Akilan

Published: 2026-05-28

TL;DR: 本文提出 xModel-KD 框架，通过跨模态知识蒸馏融合 2D 图像与 3D LiDAR 特征，有效提升了点云分割的 mIoU 并降低了对标注数据的依赖。

摘要翻译

点云分割是 3D 场景理解中的基础任务。其进展受限于密集 3D 标注所需的高昂成本和时间，导致标注样本难以获取。除了标注稀缺之外，不同的感知模态也面临固有的局限性。2D 图像提供丰富的纹理和外观线索，但它们缺乏明确的深度和几何结构。相比之下，3D 点云捕捉准确的几何结构，但它们是稀疏的且不包含纹理信息。因此，依赖单一模态限制了所学表示的丰富性并削弱了泛化能力。尽管近期结合 3D 点云与 2D 图像的多模态方法在分类和检索等任务中表现出强大性能，但它们通常依赖大规模标注数据集，且尚未被充分利用于数据高效的密集预测。针对上述局限性，我们提出了一种新颖的跨模态知识蒸馏框架 xModel-KD，用于 3D 点云分割。该方法通过跨模态对齐学习统一的逐点表示，利用 2D 纹理和 3D 几何的互补优势。具体而言，我们设计了一个跨模态融合编码器，使用对比目标进行训练，该目标强制对应 2D 和 3D 表示在多个视图之间保持特征一致性。通过将强大的预训练骨干网络与针对性的融合策略相结合，所提出的框架有效地将外观线索从图像转移到几何感知点特征。实验结果表明，跨模态融合相对于仅 LiDAR 基线在 mIoU 上实现了 2% 的绝对提升，证明了利用互补多模态信息对于可扩展且标注高效的 3D 场景理解的好处。

Abstract

Point cloud segmentation is a fundamental task in 3D scene understanding. Its progress is constrained by the high cost and time required for dense 3D annotations, making labeled samples difficult to obtain. Beyond annotation scarcity, different sensing modalities face inherent limitations. 2D images provide rich texture and appearance cues, yet they lack explicit depth and geometric structure. In contrast, 3D point clouds capture accurate spatial geometry but are sparse and contain no texture information. As a result, relying on a single modality restricts the richness of learned representations and weakens generalization. Although recent multi-modal methods that combine 3D point clouds with 2D images have demonstrated strong performance in tasks such as classification and retrieval, they typically depend on large-scale labeled datasets and have not been fully exploited for data-efficient dense prediction. To address these limitations, we propose a novel cross-modal knowledge distillation framework, xModel-KD, for 3D point cloud segmentation. Our method exploits the complementary strengths of 2D texture and 3D geometry by learning unified per-point representations through cross-modal alignment. Specifically, we design a cross-modal fusion encoder trained with a contrastive objective that enforces feature consistency between corresponding 2D and 3D representations across multiple views. By integrating powerful pre-trained backbones with a targeted fusion strategy, the proposed framework effectively transfers appearance cues from images to geometry-aware point features. Experimental results show that cross-modal fusion achieves a 2% absolute improvement in mIoU over a LiDAR-only baseline, demonstrating the benefit of leveraging complementary multi-modal information for scalable and annotation-efficient 3D scene understanding.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	5.0/10	7.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	7.0/10	10.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	10.0/10	15.0
model-based RL	1.5	0.0/10	0.0

评分理由: 相关性评分基于论文内容与关键词的匹配度：1. MultiModal (10 分)：论文核心是结合 2D 图像与 3D 点云，属于典型的多模态学习任务；2. Visual Encoder (7 分)：使用预训练骨干网络提取图像特征，涉及视觉编码器，但未作为核心创新点深入探讨；3. Unify Models (5 分)：文中提到‘统一逐点表示’，与统一概念相关，但并非背景中指的统一模型范式（如大模型统一架构）；4. Tokenizer, World Models, MLLM, model-based RL (0 分)：论文专注于 3D 感知与知识蒸馏，与 tokenizer、世界模型、大语言模型及强化学习无直接关联。专家检查：作者列表中未包含指定专家，无额外加分。加权总分 33.0，高于动态及格分 27.8。

关键词

Cross-modal Knowledge Distillation, 3D Scene Perception, Point Cloud Segmentation, LiDAR, Multi-modal Fusion, Representation Learning, Annotation-efficient

深度分析

Chinese Title: xModel-KD：基于激光雷达的3D场景感知跨模态知识蒸馏

Summary: 本文提出xModel-KD，一种训练阶段的跨模态知识蒸馏框架，用于3D点云语义分割。针对2D图像缺乏深度信息、3D点云缺乏纹理信息且标注成本高的问题，该方法利用冻结的2D视觉基础模型作为教师，通过多尺度对比蒸馏将2D语义知识（纹理、边界、高层语义）迁移到3D骨干网络中。具体地，设计跨模态融合编码器，将2D和3D特征投影到共享嵌入空间，通过对比学习对齐对应点-像素对，并采用多尺度对齐（中间层传递结构信息，深层传递语义信息）。训练后移除2D分支和跨模态组件，仅保留增强的3D网络，实现零推理开销。实验表明，在nuScenes等数据集上，该方法相比纯LiDAR基线在mIoU上提升2%，有效缓解了视场不匹配问题，并实现了高效、可扩展的3D场景理解。

Innovations:

提出训练阶段仅需的跨模态知识蒸馏框架，将冻结2D教师的语义知识迁移到3D网络，推理时无需图像或融合模块，零额外开销。
设计多尺度对比蒸馏策略，对齐2D和3D的中间层（结构/边界）和深层（语义）特征，实现层次化知识迁移。
通过共享参数和对比学习，使3D网络对相机视场外的点也能继承2D先验，缓解视场不匹配问题。
轻量级流水线，训练后仅保留3D骨干，兼顾性能与部署效率。

Methodology: 采用教师-学生蒸馏范式：冻结的2D视觉基础模型（如DINO）作为教师，3D稀疏卷积网络作为学生。首先，分别用2D和3D编码器提取多尺度特征；然后，通过轻量投影头将两者映射到128维共享嵌入空间；接着，利用LiDAR到图像的投影建立点-像素对应关系，在多个尺度上应用对比损失（InfoNCE）拉近正对、推远负对；最后，通过逆投影恢复原始特征维度，供解码器进行语义分割。训练完成后，丢弃2D分支和投影头，仅用3D网络推理。

Key Results:

在nuScenes数据集上，xModel-KD相比纯LiDAR基线（如MinkowskiNet）在mIoU上绝对提升2%。
有效处理了约23.2%的LiDAR点位于相机视场外的问题，通过共享参数使这些点也受益于2D先验。
推理阶段无额外延迟或内存消耗，保持与纯3D模型相同的效率。
多尺度对比蒸馏优于单层蒸馏，中间层对齐贡献边界精度，深层对齐贡献语义精度。

Tech Stack:

3D骨干：稀疏卷积网络（如MinkowskiEngine或Cylinder3D）
2D教师：冻结的视觉基础模型（如DINO、CLIP）
对比学习：InfoNCE损失函数
多尺度特征提取：层级编码器-解码器结构
投影头：轻量MLP（映射到128维）
点-像素投影：基于LiDAR相机标定矩阵的投影变换
数据集：nuScenes、Waymo（实验提及）

Strengths:

创新性地将知识蒸馏与对比学习结合，实现无推理开销的跨模态迁移。
多尺度对齐策略有效传递了不同抽象层次的2D先验，提升分割精度。
对相机视场外点的处理机制增强了模型的鲁棒性和泛化能力。
方法轻量，易于集成到现有3D分割框架中，实用性强。
实验设计清晰，消融研究验证了各组件贡献。

Limitations:

依赖高质量的2D视觉基础模型，教师模型的选择可能影响蒸馏效果。
对比学习需要精确的点-像素对应，对传感器标定精度要求高。
仅针对语义分割任务，未验证在其他3D任务（如检测、实例分割）上的有效性。
提升幅度（2% mIoU）相对有限，可能受限于教师模型与3D任务之间的模态差异。
未在更多样化的数据集（如室内场景）上进行充分验证。

Relevance To Keywords:

Unify Models: 论文通过跨模态蒸馏统一2D和3D表示，但未涉及生成与理解一体化。
World Models: 不直接相关，但3D场景感知是世界模型的基础组件。
Representation Learning: 核心相关，通过对比学习对齐跨模态表征。
Model-Based RL: 不直接相关。
原生多模态大模型: 论文使用冻结的2D大模型作为教师，但未训练原生多模态模型。
多模态大模型的理解和生成一体化: 仅涉及理解（分割），无生成。
表征学习: 强相关，跨模态对齐和蒸馏本质是表征学习。
世界模型: 弱相关，3D感知可服务于世界模型构建。
强化学习: 不相关。
后训练: 弱相关，蒸馏可视为后训练阶段的知识迁移。

60. CityGen: Structure-Guided City-Style Synthesis for Cross-City Autonomous DrivingPASS

Score: 33.0 / 27.8

Authors: Zezhong Qian, Zhao Yang, Lu Tan, Zhihao Yan, Weiyi Hong, Haizhuang Liu, Yawei Jueluo

Published: 2026-05-28

TL;DR: CityGen introduces a diffusion-based framework for zero-label cross-city autonomous driving adaptation by synthesizing city styles guided by HD-maps, enhancing robustness across perception and planning tasks.

摘要翻译

自动驾驶系统通常在有限的地理区域内训练和评估，这限制了它们在部署到新城市时的可扩展性。然而，外观、道路拓扑和交通模式等方面的显著领域偏移往往会导致跨城市部署时性能严重下降。基于领域适应、数据增强或合成数据生成的现有方法通常依赖于标注目标数据、城市特定标注或任务特定设计，这限制了它们在全面评估中的可扩展性和有效性。本文介绍了 CityTransfer-Bench，这是一个用于评估感知、分割和规划跨城市泛化能力的地理上不重叠的基准，并提出 CityGen，一种基于扩散的生成框架，该框架通过基于高清地图 (HD-map) 条件的合成实现零标签城市适应，并由城市级视觉提示引导。大量实验表明，CityGen 在多个任务上一致地提高了跨城市鲁棒性，为可泛化的自动驾驶建立了可扩展且标签高效的基石。

Abstract

Autonomous driving systems are commonly trained and evaluated within limited geographic regions, which hinders their scalability when deployed in new cities. However, significant domain shifts in appearance, road topology, and traffic patterns often cause severe performance degradation under cross-city deployment. Existing approaches based on domain adaptation, data augmentation, or synthetic data generation typically rely on labeled target data, city-specific annotations, or task-specific designs, limiting their scalability and effectiveness for holistic evaluation. In this paper, we introduce CityTransfer-Bench, a geographically disjoint benchmark for evaluating cross-city generalization across perception, segmentation, and planning, and propose CityGen, a diffusion-based generative framework that performs zero-label city adaptation via HD-map-conditioned synthesis guided by city-level visual prompts. Extensive experiments demonstrate that CityGen consistently improves cross-city robustness across multiple tasks, establishing a scalable and label-efficient foundation for generalizable autonomous driving.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	5.0/10	7.5
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	7.0/10	10.5
model-based RL	1.5	3.0/10	4.5

评分理由: The paper focuses on autonomous driving using diffusion models for cross-city generalization. It shows strong relevance to 'MultiModal' (combining HD-maps and visual prompts) and moderate relevance to 'World Models' (generating city environments). However, it lacks direct connection to 'MLLM', 'Tokenizer', or 'model-based RL' as core methodologies, resulting in lower scores for these keywords.

关键词

Autonomous Driving, Cross-City Generalization, Diffusion Models, HD-map-conditioned Synthesis, Zero-label Adaptation, City-level Visual Prompts, Domain Shift, Perception and Planning

深度分析

Chinese Title: CityGen：结构引导的城市风格合成用于跨城市自动驾驶

Summary: 本文针对自动驾驶系统在跨城市部署时因外观、道路拓扑和交通模式等域偏移导致的性能下降问题，提出了CityTransfer-Bench基准和CityGen生成框架。CityTransfer-Bench是基于nuScenes数据集的地理分离基准，用于评估感知、分割和规划任务的跨城市泛化能力。CityGen是一种基于扩散模型的零标签城市适应框架，通过HD地图条件合成和城市级视觉提示引导，生成目标城市风格的图像，同时保持语义一致性。实验表明，CityGen生成的数据能显著提升下游任务在未见城市上的鲁棒性，为可扩展的跨域自动驾驶提供了基础。

Innovations:

首次提出专门用于评估自动驾驶跨城市泛化能力的基准CityTransfer-Bench，覆盖感知、分割和规划三个关键任务。
提出CityGen框架，利用扩散模型实现零标签城市适应，通过HD地图结构控制和城市风格条件合成，无需目标域标注。
设计结构引导的多视图HD地图投影方法，将车道几何和3D边界框投影到图像平面，作为几何不变的结构控制信号。
构建无标签的城市风格库，通过视觉-语言模型提取风格描述，实现可控且多样化的目标城市外观合成。
在多个任务上验证了生成数据对跨城市鲁棒性的提升，建立了可扩展的跨域自动驾驶研究基础。

Methodology: 论文采用扩散模型（DiT）作为生成主干，结合结构控制分支和风格条件注入。首先，从HD地图中提取车道线和3D边界框等几何元素，投影到多视图图像平面并栅格化为多通道掩码，作为结构控制信号。然后，通过轻量级结构编码器在每个去噪块中注入结构特征，保持几何一致性。同时，从无标签目标城市视频中采样帧，使用InternVL视觉-语言模型提取风格描述，构建城市风格库。在生成时，随机采样风格描述作为条件，与结构控制共同引导去噪过程，实现目标城市风格的多视图合成。此外，采用多视图注意力机制增强跨视图几何一致性。

Key Results:

CityGen生成的数据在感知、分割和规划任务上均显著提升了跨城市泛化性能。
在CityTransfer-Bench基准上，使用CityGen数据训练的模型在目标城市（波士顿）上的检测、分割和规划指标均优于基线方法。
零标签城市适应策略有效避免了目标域标注成本，同时保持了语义一致性。
结构控制确保了生成图像中车道拓扑和物体位置的准确性，使得现有标注可直接复用。

Tech Stack:

扩散模型（Diffusion Transformer, DiT）
结构控制分支（ControlNet-like结构注入）
多视图HD地图投影与栅格化
视觉-语言模型（InternVL）用于风格编码
多视图注意力机制
特征级加法注入（feature-wise addition）
去噪过程条件采样（DDPM/DDIM）

Strengths:

提出了首个跨城市自动驾驶多任务基准，填补了该领域评估体系的空白。
CityGen实现了零标签城市适应，大幅降低了数据标注成本，具有高可扩展性。
结构控制与风格条件解耦的设计使得生成图像既保持几何精确性又具备目标城市外观，语义一致性高。
实验覆盖感知、分割、规划三个关键任务，验证了方法的通用性和有效性。

Limitations:

依赖HD地图的可用性，对于缺乏高精地图的城市可能无法直接应用。
风格库的构建需要目标城市视频数据，虽然无需标注但仍需数据采集。
生成质量可能受限于扩散模型的计算成本和实时性要求，实际部署时需考虑效率。
仅基于nuScenes数据集验证，其他城市或场景的泛化性有待进一步测试。

Relevance To Keywords:

世界模型：CityGen通过扩散模型生成城市风格图像，可视为一种世界模型，模拟不同城市的外观和布局，用于提升自动驾驶系统的泛化能力。
表征学习：论文使用视觉-语言模型提取城市风格表征，并通过结构控制学习几何不变表征，体现了表征学习在域适应中的应用。
多模态大模型的理解和生成一体化：CityGen结合了视觉-语言模型（InternVL）进行风格理解，以及扩散模型进行图像生成，实现了理解与生成的一体化。
后训练：CityGen生成的数据可用于下游模型的后训练（微调），提升跨城市性能，属于后训练策略。
强化学习：论文的规划任务评估涉及决策制定，生成数据可辅助强化学习中的环境模拟，但论文本身未直接使用强化学习。
模型基于强化学习：论文未涉及强化学习算法，但生成数据可服务于基于强化学习的规划模型训练。

61. MMTM: Tri-Modal Topic Modeling for Long-Form Video via Similarity-Gated FusionPASS

Score: 33.0 / 27.8

Authors: Ali Abusaleh, Bhuvanesh Verma, Alexander Mehler

Published: 2026-05-28

TL;DR: MMTM introduces a tri-modal topic modeling pipeline for long-form video using similarity-gated fusion of speech, audio, and visual embeddings to significantly improve topic coherence and temporal stability.

摘要翻译

我们提出了 MMTM，这是一个用于长视频主题发现的模块化流程，通过确定性相似门控融合整合了语音识别、音频与视觉嵌入以及 BERTopic 聚类。在德语（Tagesschau）和英语（NBC）广播新闻上进行跨语言评估，联合三模态建模显著提升了主题质量：噪声从 0.27 降至 0.06，转换率从 0.70 降至 0.21，归一化熵从 0.84 升至 0.92，表明主题更具连贯性和时间稳定性。聚类有效性（Calinski-Harabasz 指数）在嵌入空间上提升了 5 到 12 倍。词汇连贯性（NPMI）在德语语料上从 0.77 升至 0.86，但具有语料库依赖性，并未迁移到较短的 NBC 广播中。我们发布了该流程代码以及一个人工验证的 54 小时多模态视频主题语料库，包含双标注者视觉评估和大语言模型（LLM）辅助标注。

Abstract

We introduce MMTM, a modular pipeline for topic discovery in long-form video that integrates speech recognition, audio and visual embeddings, and BERTopic clustering through a deterministic similarity-gated fusion. Evaluated cross-lingually on German (Tagesschau) and English (NBC) broadcast news, joint tri-modal modeling substantially improves topic quality: noise drops from 0.27 to 0.06, transition rate from 0.70 to 0.21, and normalized entropy rises from 0.84 to 0.92, indicating more coherent and temporally stable topics. Cluster validity (Calinski-Harabasz) improves by 5-12X across embedding spaces. Lexical coherence (NPMI) rises from 0.77 to 0.86 on German but is corpus-dependent and does not transfer to the shorter NBC broadcasts. We release the pipeline code and a human-validated 54-hour multimodal video topic corpus with dual-annotator visual evaluation and LLM-assisted labeling.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	5.0/10	7.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	0.0/10	0.0

评分理由: The paper proposes a tri-modal topic modeling pipeline for long-form video, integrating speech, audio, and visual embeddings. It scores high on MultiModal (9) due to explicit tri-modal integration, moderate on Visual Encoder (5) and Unify Models (3) regarding modality fusion, low on Tokenizer (2) and MLLM (3) as they are not core contributions, and zero on World Models and model-based RL (0) as the paper involves no RL or world modeling. Total weighted score is 33.0, exceeding the passing threshold. None of the specified expert authors (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are present in the author list.

关键词

Tri-Modal Topic Modeling, Long-Form Video, Similarity-Gated Fusion, Audio-Visual Embeddings, BERTopic Clustering, Topic Discovery, Speech Recognition

深度分析

Chinese Title: MMTM：基于相似度门控融合的长视频三模态主题建模

Summary: 本文提出MMTM，一种模块化流水线，用于长视频中的主题发现。该方法集成语音识别（Whisper）、音频嵌入（CLAP）和视觉嵌入（OpenCLIP），通过确定性相似度门控融合将三模态特征结合，并使用BERTopic进行聚类。在德语（Tagesschau）和英语（NBC）新闻广播上进行跨语言评估，结果表明三模态联合建模显著提升了主题质量：噪声从0.27降至0.06，转换率从0.70降至0.21，归一化熵从0.84升至0.92，聚类有效性指标（Calinski-Harabasz）提升5-12倍。词汇连贯性（NPMI）在德语语料上从0.77升至0.86，但在较短的英语广播上未转移。作者还发布了流水线代码和一个经人工验证的54小时多模态视频主题数据集，包含双标注者视觉评估和LLM辅助标注。

Innovations:

提出模块化、开源的多模态视频主题建模流水线，集成ASR、CLAP和OpenCLIP嵌入，通过相似度门控融合实现三模态联合表示。
引入确定性相似度门控融合机制，根据模态间一致性动态加权，无需训练或微调。
构建并发布约54小时的多模态视频主题数据集，包含双标注者人工验证和LLM辅助主题标签。
在跨语言（德语和英语）新闻视频上系统评估，证明三模态融合在主题结构质量和词汇连贯性上的显著提升。
支持可配置的多模态融合权重和基于种子词的弱监督主题建模，增强灵活性和可扩展性。

Methodology: MMTM采用模块化流水线：首先使用Whisper对视频音频进行转录并生成时间戳文本段；对每个文本段，使用CLAP提取音频嵌入，使用OpenCLIP（ViT-B-32）对经多样性感知排序的代表性帧进行平均池化得到视觉嵌入；将文本、音频、视觉嵌入L2归一化并截断至最小维度，计算三对点积相似度，通过缩放因子加权后与逐元素乘积拼接得到融合嵌入；最后使用BERTopic对融合嵌入进行聚类，并可选地通过reduce_topics()自动合并相似主题。音频嵌入还独立用于UMAP+HDBSCAN聚类生成说话人风格元数据，不参与主题发现。

Key Results:

三模态融合相比纯文本基线，噪声从0.27降至0.06，转换率从0.70降至0.21，归一化熵从0.84升至0.92。
聚类有效性指标（Calinski-Harabasz）提升5-12倍，Silhouette和Davies-Bouldin指标也显著改善。
词汇连贯性（NPMI）在德语语料上从0.77升至0.86，但在英语NBC语料上未提升。
跨语言评估表明三模态建模在德语和英语上均能改善主题结构，但词汇连贯性提升具有语料依赖性。

Tech Stack:

Whisper（语音识别）
CLAP（laion/clap-htsat-unfused，音频嵌入）
OpenCLIP（ViT-B-32，laion2b_s34b_b79k，视觉嵌入）
BERTopic（主题聚类）
UMAP（非线性降维）
HDBSCAN（密度聚类）
最大边际相关性（MMR）用于代表性帧排序
Laplacian方差（图像清晰度）
余弦相似度、点积相似度、L2归一化
Hadamard乘积（逐元素乘法）

Strengths:

模块化设计，每个组件可独立替换，便于扩展和适配不同任务。
确定性融合机制无需训练，避免数据泄露和过拟合风险。
跨语言验证，在德语和英语上均有效，展示泛化能力。
发布人工验证的多模态数据集，促进可重复研究和基准建立。
全面评估指标，包括噪声、转换率、熵、聚类有效性、NPMI等，覆盖结构和语义质量。

Limitations:

词汇连贯性（NPMI）提升仅在德语语料上显著，在英语短广播上未转移，表明语料长度或语言特性影响。
依赖ASR质量，Whisper的替换和删除错误可能传播到下游。
视觉帧选择策略可能丢失关键动态信息（如运动、场景渐变）。
融合权重为固定超参数，未进行自适应优化。
仅评估新闻广播，未验证其他类型长视频（如直播、纪录片）的适用性。

Relevance To Keywords:

原生多模态大模型：MMTM使用预训练的多模态编码器（CLAP、OpenCLIP）进行特征提取，但未使用统一的多模态大模型，而是独立编码后融合。
多模态大模型的理解和生成一体化：MMTM专注于理解（主题建模），不涉及生成。
表征学习：通过相似度门控融合学习跨模态联合表征，属于表征学习范畴。
世界模型：视频主题建模可视为对视频内容的世界状态抽象，但MMTM未显式建模动态或因果结构。
强化学习：论文未涉及强化学习。
后训练：MMTM使用预训练编码器，无需微调或后训练，属于零样本/推理设置。

62. PhoneWorld: Scaling Phone-Use Agent EnvironmentsPASS

Score: 33.0 / 27.8

Authors: Zhengyang Tang, Yuxuan Liu, Xin Lai, Junyi Li, Pengyuan Lyu, Jason, Yiduo Guo, Zhengyao Fang, Yang Ding, Yi Zhang, Weinong Wang, Huawen Shen, Xingran Zhou, Liang Wu, Fei Tang, Sunqi Fan, Shangpin Peng, Zheng Ruan, Anran Zhang, Benyou Wang, Rui Yan, Ji-Rong Wen, Chengquan Zhang, Han Hu

Published: 2026-05-28

TL;DR: PhoneWorld introduces a scalable pipeline to generate controllable mobile agent environments from real GUI trajectories, significantly enhancing performance across various mobile agent benchmarks.

摘要翻译

手机使用智能体的一个核心瓶颈在于，覆盖真实移动行为的可控、可复现的环境难以大规模构建。现有的移动智能体基准在评估方面取得了重要进展，但它们本身并未提供一种可扩展的方式来构建许多新的手机使用环境。我们提出了 PhoneWorld，一个可重用的流程，它将真实的 GUI 轨迹和截图转换为可控的手机使用环境、可执行任务、自动验证器和训练轨迹。与每次手工构建一个移动基准不同，PhoneWorld 使用真实轨迹来识别哪些屏幕界面重要、界面之间如何连接、哪些交互必须改变环境状态，以及哪些用户目标允许自动验证。基于这些信号，它构建了可运行的模拟 Android 应用，这些应用由只读应用内容和可变状态支持，然后从相同环境中推导出可执行任务、基于规则的验证器和训练轨迹。在其当前实现中，PhoneWorld 覆盖了 16 个领域的 34 个应用，涵盖常见的消费者移动行为，如搜索、浏览、购物、预订、媒体和社交互动。在固定训练预算下，用广泛的 PhoneWorld 监督信号替换 AndroidWorld 基线中辅助 AndroidWorld 语料库的 10K 步，同时提升了所有四个评估基准，使 HYMobileBench 提高 17.7 分，AndroidControl 提高 6.0 分，AndroidWorld 提高 14.7 分，PhoneWorld 提高 52.5 分。随后，我们研究了另外两个扩展性问题：增加 PhoneWorld 监督量显著提升了 PhoneWorld 性能，而在固定 PhoneWorld 预算下，扩展应用覆盖范围带来了更大的收益。总体而言，PhoneWorld 将焦点从每次构建一个移动基准转移到扩大手机使用环境的供应规模上。

Abstract

A central bottleneck for phone-use agents is that controllable, reproducible environments covering real mobile behavior are hard to build at scale. Existing mobile-agent benchmarks have made important progress on evaluation, but they do not by themselves provide a scalable way to construct many new phone-use environments. We present PhoneWorld, a reusable pipeline that converts real GUI trajectories and screenshots into controllable phone-use environments, executable tasks, automatic verifiers, and training rollouts. Rather than hand-building one mobile benchmark at a time, PhoneWorld uses real trajectories to recover which screens matter, how screens connect, which interactions must change environment state, and which user goals admit automatic verification. From these signals, it builds runnable mock Android apps backed by read-only app content and mutable state, then derives executable tasks, rule-based verifiers, and training rollouts from the same environments. In its current instantiation, PhoneWorld covers 34 apps across 16 domains, spanning common consumer mobile behaviors such as search, browsing, shopping, booking, media, and social interaction. Under a fixed training budget, replacing 10K steps from an auxiliary AndroidWorld corpus in an AndroidWorld-based baseline with broad PhoneWorld supervision improves all four evaluation benchmarks at once, raising HYMobileBench by 17.7 points, AndroidControl by 6.0 points, AndroidWorld by 14.7 points, and PhoneWorld by 52.5 points. We then study two additional scaling questions: increasing the amount of PhoneWorld supervision strongly improves PhoneWorld performance, and under a fixed PhoneWorld budget, expanding app coverage yields even larger gains. Overall, PhoneWorld shifts the focus from building one mobile benchmark at a time to scaling the supply of phone-use environments themselves.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	5.0/10	7.5
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	4.0/10	6.0
model-based RL	1.5	5.0/10	7.5

评分理由: 论文核心在于构建手机使用环境的可扩展管道（PhoneWorld），而非模型架构创新，因此与 Unify Models、Tokenizer、Visual Encoder 等关键词相关性低。论文涉及多模态数据及强化学习环境，与 World Models 和 model-based RL 有一定关联。作者列表中未包含指定的 Yang Shi 等专家，故无额外加分。

关键词

PhoneWorld, Mobile Agents, Environment Generation, GUI Trajectories, Reinforcement Learning, Android, Scalable Pipeline

深度分析

Chinese Title: PhoneWorld：规模化构建手机使用代理环境

Summary: 论文提出PhoneWorld，一个可复用的流水线，将真实的GUI轨迹和截图转化为可控的手机使用环境、可执行任务、自动验证器和训练数据。现有移动代理基准主要关注评估，但缺乏规模化构建新环境的方法。PhoneWorld通过真实轨迹恢复关键屏幕、导航图、状态变更交互，并构建可运行的模拟Android应用，支持可重置、可检查的环境。当前实例覆盖16个领域的34个应用。实验表明，在固定训练预算下，用PhoneWorld监督替换部分AndroidWorld数据可同时提升四个评估基准（HYMobileBench提升17.7点，AndroidControl提升6.0点，AndroidWorld提升14.7点，PhoneWorld提升52.5点）。进一步研究发现，增加PhoneWorld监督量或扩大应用覆盖范围均能提升性能，其中应用覆盖的增益更大。PhoneWorld的核心贡献在于从逐个构建基准转向规模化供应手机使用环境本身。

Innovations:

提出一个AI驱动、人工审核的流水线，将真实GUI轨迹转化为可控的手机使用环境、可执行任务、自动验证器和训练数据。
利用真实轨迹不仅作为演示，还作为环境构建的指导，恢复屏幕优先级、导航图和状态变更交互。
构建可运行的模拟Android应用，支持只读内容和可变状态，使环境可重置、可检查、可复用。
在固定训练预算下，用PhoneWorld监督替换部分现有数据可同时提升多个基准，证明其作为训练数据的有效性。
提供规模化研究，表明增加监督量和扩大应用覆盖范围均能提升性能，且应用覆盖是更强的缩放信号。

Methodology: PhoneWorld流水线包括四个阶段：1) 输入收集：对每个目标应用收集代表性截图和人工探索性使用轨迹（含自然语言指令和动作序列）。2) 应用结构恢复：使用Claude Code浏览截图建立页面分类法（每应用25-30类），轻量级视觉语言模型并行分类截图，从轨迹中提取页面频率分布和页面转移图，识别状态变更交互。3) 构建规范：将恢复的结构转换为页面级PRD、可复用UI组件、只读内容层和可变状态模式。4) 应用构建：编码代理迭代实现、编译、测试和修正模拟Android应用，随后进行人工审计。最后从环境中派生出可执行任务、基于规则的验证器和成功轨迹（用于SFT训练）。

Key Results:

在固定训练预算下，用PhoneWorld监督替换10K步辅助AndroidWorld数据，HYMobileBench提升17.7点，AndroidControl提升6.0点，AndroidWorld提升14.7点，PhoneWorld提升52.5点。
完全替换辅助AndroidWorld数据可大幅提升PhoneWorld性能，但并非在所有基准上达到最佳，表明PhoneWorld监督与AndroidWorld数据互补。
增加PhoneWorld监督量（从10K到20K步）显著提升PhoneWorld性能。
在固定PhoneWorld预算下，扩大应用覆盖范围（从17个到34个应用）带来更大性能增益。
当前实例覆盖16个领域的34个应用，涵盖搜索、浏览、购物、预订、媒体和社交等常见消费行为。

Tech Stack:

Claude Code（用于页面分类法建立）
轻量级视觉语言模型（用于截图分类）
模拟Android应用（基于只读内容和可变状态）
基于规则的自动验证器
Android模拟器（用于代理回滚）
SFT（监督微调）训练
AndroidWorld数据作为辅助训练语料

Strengths:

提出可复用的环境构建流水线，而非一次性基准，具有规模化潜力。
环境构建基于真实轨迹，确保视觉和功能上的真实性，同时保持可控性和可重置性。
同时支持评估和训练数据生成，同一环境可用于基准测试和SFT。
实验设计严谨，在固定预算下进行消融和缩放研究，验证了方法的有效性。
覆盖34个主流消费应用，领域多样，具有实际应用价值。

Limitations:

当前仅覆盖34个应用，规模仍有限，需要进一步扩展到更多应用和领域。
模拟环境无法完全复现真实应用的动态内容（如实时数据、网络变化），可能影响泛化性。
人工审计和收集轨迹成本较高，规模化过程中可能需要更多自动化。
主要针对消费类应用，未涉及企业级或系统级应用。
与真实设备上的环境相比，模拟环境可能缺少某些系统交互（如通知、后台进程）。

Relevance To Keywords:

论文与“原生多模态大模型”和“多模态大模型的理解和生成一体化”相关，因为手机使用代理需要处理像素级视觉输入并生成动作，涉及多模态理解与生成。
与“世界模型”相关：PhoneWorld构建的模拟环境可视为手机应用的世界模型，支持状态转移和交互，可用于模型训练和评估。
与“表征学习”相关：代理需要从屏幕截图中学到有效的视觉表征以指导动作。
与“强化学习”和“后训练”相关：论文使用SFT（监督微调）训练代理，属于后训练阶段；环境也可用于强化学习训练（论文未直接使用RL，但提及可扩展）。
与“Unify Models”和“Model-Based RL”有一定关联：环境构建可视为基于模型的方法，但论文未直接统一模型或进行基于模型的强化学习，相关性较弱。

63. Benchmarking Single-Factor Physical Video-to-Audio GenerationPASS

Score: 33.0 / 27.8

Authors: Tingle Li, Siddharth Gururani, Kevin J. Shih, Gantavya Bhatt, Sang-gil Lee, Zhifeng Kong, Arushi Goel, Gopala Anumanchipalli, Ming-Yu Liu

Published: 2026-05-28

TL;DR: 本文提出 FlatSounds 基准测试视频生成音频模型的物理推理能力，发现模型过度依赖文本而非视觉流，强调了直接从像素学习物理过程的必要性。

摘要翻译

生成式视频到音频（V2A）模型能够生成高度逼真的音频，但尚不清楚它们是否捕捉到了底层的物理过程。现有的评估方法强调感知真实性，却忽视了在受控干预下的物理正确性。本文引入 FlatSounds，这是一个用于评估 V2A 模型物理推理能力的基准，其包含两种设置：1) 控制反事实对，其中仅改变单一物理因素；2) 单视频模式测试，用于探测内部一致性和方向性趋势。这些设置旨在测试生成的音频是否正确反映了特定的物理属性和时序。我们对最先进模型的评估揭示了一致的权衡：模型在推断物理和语义时，更多地依赖文本字幕而非视觉流。字幕通常能提高物理和语义准确性，然而悖论性地却降低了时序对齐度。我们的结果强调了超越音频质量、直接从像素学习物理过程的必要性。最后，我们发现基于物理的指标与人类偏好测试高度相关。项目网页：https://research.nvidia.com/labs/cosmos-lab/flatsounds/

Abstract

Generative video-to-audio (V2A) models produce highly plausible soundtracks, but it remains unclear whether they capture the underlying physical processes. Existing evaluations emphasize perceptual realism and overlook physical correctness under controlled interventions. In this paper, we introduce FlatSounds, a benchmark that audits the physical reasoning of V2A models through: 1) controlled counterfactual pairs in which a single physical factor is varied, and 2) single-video pattern tests that probe internal consistency and directional trends. These settings test whether the generated audio correctly reflects specific physical properties and timings. Our evaluation of state-of-the-art models reveals a consistent trade-off: models rely more on text captions than the visual stream to infer physics and semantics. Captions generally improve physical and semantic accuracy, but paradoxically degrade temporal alignment. Our results highlight the need to move beyond audio quality toward learning physical processes directly from pixels. Finally, we find that our physics-based metrics correlate strongly with human preference tests on our own data. Project webpage: https://research.nvidia.com/labs/cosmos-lab/flatsounds/

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	4.0/10	6.0
World Models	1.5	3.0/10	4.5
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	1.0/10	1.5

评分理由: 论文核心为视频生成音频的物理推理基准测试（FlatSounds），与 MultiModal 高度相关（涉及视觉、听觉及文本）；Visual Encoder 和 World Models 有一定关联（涉及视频流处理及物理过程建模）；其余关键词如 Tokenizer、Unify Models、model-based RL 与论文内容关联度较低。加权总分 33.0，高于动态及格分 27.8。

关键词

Video-to-Audio, Physical Reasoning, FlatSounds, Benchmark, Counterfactual, Visual Stream, Generative Models

深度分析

Chinese Title: 单因素物理视频到音频生成的基准测试

Summary: 本文提出FlatSounds基准，用于评估视频到音频（V2A）生成模型对物理过程的理解能力。现有评估主要关注感知真实性和语义匹配，忽略了物理正确性。FlatSounds通过两种模式进行审计：1）控制反事实对，其中仅改变单一物理因素（如材料、容器满度）并保持时间对齐；2）单视频模式测试，检验内部一致性和方向性趋势（如上升音高）。作者使用基于物理的指标（如攻击时间、衰减率）和时间对齐指标（命中覆盖率、时序误差）评估多个最新V2A模型。研究发现，模型在物理推理上严重依赖文本描述而非视觉流：添加文本描述可提升物理和语义准确性，但反而降低时间对齐；移除文本描述后时间对齐指标改善。这表明当前视频编码器未能有效学习物理过程，模型通过文本“作弊”而非从像素中理解物理。该基准揭示了V2A领域从表面合理性向物理因果理解转变的必要性，且物理指标与人类偏好高度相关。

Innovations:

提出从感知合理性转向物理正确性的V2A评估新范式，通过因果干预测试模型对物理因素变化的响应。
构建时间扭曲的反事实视频对数据集，确保单一物理变量变化而冲击时间对齐，实现隔离因果分析。
设计双模式评估协议：反事实对测试物理方向性变化，单视频测试内部一致性和趋势。
揭示当前V2A模型依赖文本描述进行物理推理而忽略视觉时序线索的核心缺陷，并量化了文本与视觉之间的权衡。
引入基于物理的客观指标（攻击时间、衰减率、时间调制等）并与人类偏好验证相关性。

Methodology: 论文首先构建FlatSounds数据集，包含室内录制的大量日常物体交互视频，并手动标注冲击事件时间。通过时间扭曲技术生成反事实对，使同一动作在不同物理条件下（如不同材料、容器满度、环境混响）保持时间对齐。评估时，对每个视频生成多个音频样本，使用基于能量包络的检测器计算命中覆盖率、时序误差和完美对齐率。物理正确性通过比较反事实对中生成音频的物理特征（攻击时间、衰减率、时间调制、基频、混响时间等）的方向性变化是否符合预期。同时设计单视频测试：重复相同冲击检验一致性，以及音高上升等趋势检验。评估了多个SOTA模型（如Diff-Foley、FoleyCrafter、ThinkSound等）在不同条件（有/无文本描述）下的表现。

Key Results:

所有模型在反事实对测试中，添加文本描述显著提升物理正确性（如攻击时间、基频变化方向正确率），但移除文本后物理正确性大幅下降。
时间对齐指标（命中覆盖率、时序误差）在移除文本描述后反而改善，表明文本干扰了模型对视觉时序的利用。
ThinkSound模型（依赖LLM生成文本推理）在物理正确性上表现最好，但时间对齐最差。
单视频测试中，模型在重复冲击一致性上表现中等，在音高趋势上表现较差，尤其无文本时几乎无法捕捉上升音高。
物理指标与人类偏好测试高度相关，验证了其有效性。

Tech Stack:

时间扭曲（time-warping）技术对齐视频冲击时间
基于能量包络的冲击检测器（onset strength detector + envelope fallback）
物理特征提取：攻击时间、衰减率、时间调制、基频（F0）、混响时间（RT60）
评估指标：命中覆盖率（Hit Coverage）、时序误差（Timing Error）、完美对齐率（Perfect Align）
对比模型：Diff-Foley、FoleyCrafter、ThinkSound、MMAudio等
预训练模型：CLAP、Synchformer（用于语义和时间对齐参考）
数据集构建：室内录制，手动标注冲击时间

Strengths:

创新性地将因果干预和反事实推理引入V2A评估，填补了物理正确性评估的空白。
数据集精心设计，时间扭曲确保单一变量隔离，实验设计严谨。
揭示了当前模型依赖文本的严重缺陷，为未来研究指明方向（改进视频编码器）。
物理指标与人类偏好一致，具有实际应用价值。
评估协议可扩展至其他物理因素和更复杂的交互。

Limitations:

数据集局限于室内日常物体交互，未涵盖室外、自然场景或更复杂的物理过程（如流体、气体）。
物理因素变化仅涉及少数属性（材料、满度、环境等），未覆盖所有物理维度。
时间扭曲可能引入人工痕迹，影响模型对视觉的感知。
评估依赖手动标注冲击时间，扩展性受限。
未深入分析模型内部机制（如注意力权重），仅从输出行为推断。

Relevance To Keywords: 该论文与关键词高度相关。首先，它直接涉及**世界模型**（World Models）的核心思想：模型需从视频中隐式模拟物理过程以生成正确音频，评估模型是否具备物理因果理解。其次，**表征学习**（Representation Learning）是论文批判的重点：当前视频编码器未能学习有效的物理表征，导致模型依赖文本，这推动了更好的视觉表征学习研究。**多模态大模型的理解和生成一体化**（原生多模态大模型）方面，论文评估的V2A模型正是多模态生成模型，并指出其理解（物理推理）与生成（音频）之间的脱节。**模型基于强化学习**（Model-Based RL）虽未直接涉及，但世界模型与强化学习紧密相关，物理正确的音频生成可视为世界模型的一部分。**后训练**（post-training）方面，论文的基准可用于指导模型的后训练或微调方向。总之，论文为世界模型和表征学习提供了关键评估工具，并揭示了多模态模型中的核心问题。

64. GenClaw: Code-Driven Agentic Image GenerationPASS

Score: 33.0 / 27.8

Authors: Junyan Ye, Jun He, Zilong Huang, Dongzhi Jiang, Xuan Yang, Rui Chen, Weijia Li

Published: 2026-05-28

TL;DR: GenClaw 提出一种代码驱动的智能体图像生成范式，通过整合推理、代码草图和纹理补充，克服了黑盒图像模型缺乏可控性的问题。

摘要翻译

图像生成模型已从基于文本的条件像素合成演变为具备视觉理解和工具调用能力的多模态智能体。然而，现有的智能体仍受制于底层的黑盒图像模型。它们的工作流程被困在用于生成细化的提示词重写循环中，缺乏直接操控画布的机制。本质上，大语言模型（LLMs）作为真正“画笔”用于精确视觉构建的潜力仍未被充分利用。本文提出 GenClaw，一种代码驱动的智能体图像生成范式，使智能体能够像人类艺术家一样创作：先进行概念化，再绘制草图，最后着色。具体来说，智能体首先通过搜索和推理构建概念性知识和上下文。随后，它利用代码（例如 SVG、HTML、Three.js）渲染可执行的视觉草图。最后，它采用图像生成模型来补充纹理、材质及照片级真实感。在此工作流程中，代码充当可控的中间画布，连接语言推理与像素合成，无缝整合程序逻辑与生成模型的视觉表现力。通过将图像生成从黑盒范式转变为类似于真实人类创作的分阶段过程，GenClaw 为构建高度可控且可解释的视觉生成系统迈出一步。

Abstract

Image generation models have evolved from text-conditioned pixel synthesis toward multimodal agents endowed with visual comprehension and tool invocation capabilities. Yet, existing agents remain at the mercy of underlying black-box image models. Their workflow is trapped in a repetitive cycle of prompt rewriting for generation refinement, leaving them with no mechanism to directly manipulate the canvas. In essence, the potential of LLMs to serve as a genuine "brush" for precise visual construction remains largely untapped. In this paper, we propose GenClaw, a code-driven agentic image generation paradigm that empowers the agent to create like a human artist: first conceptualizing, then sketching, and finally coloring. Specifically, the agent first constructs the conceptual knowledge and context through search and reasoning. It then utilizes code (e.g., SVG, HTML, Three.js) to render executable visual sketches. Finally, it employs an image generation model to supplement textures, materials, and photorealism. In this workflow, code serves as a controllable intermediate canvas bridging linguistic reasoning and pixel synthesis, seamlessly integrating programmatic logic with the visual expressiveness of generative models. By transforming image generation from a black-box paradigm into a staged process akin to authentic human creation, GenClaw offers a step toward for highly controllable and interpretable visual generation systems.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	5.0/10	7.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	6.0/10	9.0
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文提出代码驱动的智能体图像生成范式，涉及文本、代码、图像多模态交互（MultiModal 高相关），使用 LLM 进行推理（MLLM 中等相关），但未涉及强化学习（model-based RL 无关），未聚焦编码器或分词器架构（Tokenizer/Visual Encoder 低相关），虽整合流程但未统一模型架构（Unify Models 中等相关），非世界模型（World Models 低相关）。未发现指定专家作者（Yang Shi 等），故无加分。

关键词

Code-Driven, Agentic Image Generation, Controllable Generation, Multimodal, LLM Agent, SVG, Intermediate Canvas

深度分析

Chinese Title: GenClaw：代码驱动的智能体图像生成

Summary: 本文提出GenClaw，一种代码驱动的智能体图像生成范式，旨在解决现有图像生成智能体依赖黑盒模型、仅通过提示词修改进行反复试错的局限性。GenClaw模仿人类艺术家的创作过程：首先通过搜索和推理构建概念知识与上下文（概念化），然后利用代码（如SVG、HTML、Three.js）渲染可执行的视觉草图（素描），最后使用图像生成模型补充纹理、材质和逼真度（着色）。代码作为可控的中间画布，桥接了语言推理与像素合成，将图像生成从黑盒范式转变为类似人类创作的分阶段流程。实验表明，GenClaw在复杂场景构图、文本渲染、物理模拟和分层图像编辑等任务上表现出更强的可控性和可解释性。

Innovations:

提出代码驱动的智能体图像生成范式，将生成过程解耦为概念化、素描、着色三个阶段，模仿人类艺术创作流程。
利用代码（SVG、HTML、Three.js）作为中间表示，提供显式结构、逻辑严谨性和可编辑性，克服自然语言在空间表达上的模糊性。
将图像生成模型定位为“着色师”，专注于补充纹理和逼真度，而无需从零猜测图像结构，提升了生成的可控性和稳定性。
实现了透明的生成管线，能够精确定位失败原因（搜索错误、代码逻辑异常或渲染偏差），增强了可解释性。
在复杂场景构图、文本布局、物理模拟和分层编辑等任务上展示了超越传统黑盒模型的性能。

Methodology: GenClaw采用三阶段智能体架构：1）概念化阶段：通过搜索和推理获取实体知识和上下文；2）素描阶段：LLM编写代码（SVG/HTML/Three.js）生成结构化视觉草图，精确控制位置、大小、文本布局、层叠顺序和3D物理规则；3）着色阶段：图像生成模型基于代码草图进行纹理、材质和逼真度的补充渲染。整个流程由代码智能体驱动，支持迭代调试和局部编辑。

Key Results:

在复杂场景构图中，通过代码控制对象数量和空间关系，缓解了传统模型的计数和空间幻觉问题。
在文本渲染中，使用SVG或HTML代码实现更可靠的字体、对齐和层级控制，减少了拼写错误。
在物理模拟中，利用Three.js构建3D场景辅助光照和透视表达。
在分层图像编辑中，通过结构化JSONL格式分解图像层，实现精确的局部编辑而不影响未修改区域。
生成失败时可追溯至搜索、代码或渲染阶段，实现了透明的错误定位。

Tech Stack:

SVG（可缩放矢量图形）
HTML（超文本标记语言）
Three.js（3D JavaScript库）
JSONL（结构化分层数据格式）
大语言模型（LLM）作为代码智能体
图像生成模型（如扩散模型）用于着色
搜索与推理工具（用于概念化阶段）

Strengths:

将图像生成从黑盒随机采样转变为分阶段可控流程，显著提升可解释性和可调试性。
利用代码的显式结构优势，有效解决了自然语言在空间、数量、布局等方面的模糊性。
充分发挥LLM的编程能力与图像生成模型的视觉表现力，形成互补。
支持复杂场景、文本渲染、物理模拟和分层编辑等多种高级任务。
开源代码（GitHub）便于社区复现和扩展。

Limitations:

代码生成阶段可能产生逻辑错误或渲染不一致，需要额外的调试机制。
对LLM的代码生成能力要求较高，复杂场景下代码质量可能不稳定。
着色阶段仍依赖现有图像生成模型，其风格和逼真度受限于底层模型能力。
三阶段流程增加了生成延迟和计算开销。
当前为技术报告，缺乏大规模定量评估和与基线方法的系统对比。

Relevance To Keywords:

Unify Models: GenClaw将理解（概念化、推理）与生成（代码素描、着色）统一在智能体框架中，体现了多模态大模型的理解与生成一体化趋势。
World Models: 通过代码（如Three.js）模拟物理规则和3D场景，可视为一种轻量级世界模型的应用。
Representation Learning: 代码作为中间表示，学习从自然语言到结构化视觉代码的映射，涉及表征学习。
Model-Based RL: 虽然未直接使用强化学习，但智能体通过搜索、推理和反馈迭代优化生成过程，与基于模型的强化学习思想有相似之处。
后训练: 论文未明确提及后训练，但代码智能体的能力依赖于LLM的预训练和指令微调，可归入后训练范畴。

65. A Predictive Law for On-Policy Self-Distillation From World FeedbackPASS

Score: 31.5 / 27.8

Authors: Tommy He, Jerome Sieber, Matteo Saponati

Published: 2026-05-28

TL;DR: 本文提出了一种基于初始学生 - 教师性能差距的预测线性定律，用于在强化学习后训练中无需完整训练即可高效调优世界反馈。

摘要翻译

从简单的标量奖励转向更丰富的世界反馈，是实现更可扩展的强化学习（RL）后训练的自然路径。在线策略自蒸馏（OPSD）是一种有前景的近期方法，它使用任意反馈作为学习信号，但其相对于已有方法（如 GRPO）的可靠性尚不明确。我们发现，在 OPSD 中，学生 - 自教师初始性能差距与最终性能提升之间存在惊人一致的线性相关性。这种关系在不同上下文类型和模型家族中均成立，提供了一种强大的预测法则，可在无需运行完整训练流程的情况下预测 OPSD 配置的结果。有趣的是，我们表明这种线性可预测性随模型规模成立，暗示了针对具有更强上下文内学习（ICL）能力的更大模型建立新经验缩放律的潜在基础。本质上，我们的发现表明 OPSD 性能可在训练前进行预测和调整，提供了一种将世界反馈作为后训练流程的核心组成部分进行整合的合理方式。

Abstract

Moving beyond simple scalar rewards toward richer world feedback is a natural path to more scalable RL post-training. On-policy self-distillation (OPSD) is a promising recent approach that uses arbitrary feedback as learning signal, yet its reliability compared to established methods, such as GRPO, remains unclear. We identify a strikingly consistent linear correlation between the initial student-self-teacher performance gap and the final performance improvement in OPSD. This relationship holds across context types and model families, providing a powerful predictive law for anticipating the outcome of an OPSD configuration without running the full training procedure. Interestingly, we show that this linear predictability holds with model scale, suggesting a potential basis for new empirical scaling laws on larger models with stronger in-context learning capabilities. In essence, our findings show that OPSD performance can be predicted and tuned before training, offering a principled way to incorporate world feedback as a first-class component of the post-training pipeline.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	4.0/10	6.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	7.0/10	10.5
MLLM	1.5	4.0/10	6.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	6.0/10	9.0

评分理由: 论文核心在于强化学习后训练中的自蒸馏算法与世界反馈，与 World Models（世界反馈）和 model-based RL（强化学习背景）高度相关；涉及大规模模型缩放定律与后训练，与 Unify Models 和 MLLM 有一定关联；但未涉及多模态架构、Tokenizer 或视觉编码器，故相关度为 0。

关键词

On-Policy Self-Distillation, World Feedback, Predictive Law, RL Post-training, Scaling Laws, Student-Teacher Performance Gap, Model Families

深度分析

Chinese Title: 基于世界反馈的在策略自蒸馏的预测定律

Summary: 该论文提出了一种预测定律，用于估计在策略自蒸馏（OPSD）中最终学生模型的性能提升。OPSD是一种利用任意世界反馈作为学习信号的后训练方法，但其可靠性尚不明确。作者发现初始学生-自教师性能差距与最终学生性能提升之间存在显著的线性相关性，该关系在不同上下文类型和模型家族中一致成立，并且随模型规模保持稳定。通过在LiveCodeBench基准上对Qwen3和Olmo3模型进行实验，验证了该预测定律的有效性。该发现允许在无需完整训练的情况下提前评估OPSD配置的效果，从而将世界反馈作为后训练流程中可靠的一等组件。

Innovations:

首次识别出初始学生-自教师差距与最终性能提升之间的强线性预测关系，实现无需完整训练即可预估OPSD效果。
证明该预测关系在不同类型的特权上下文（如环境反馈、同伴解决方案等）和不同模型家族（Qwen3、Olmo3）中普遍成立。
发现该预测定律随模型规模保持线性不变，表明其可作为跨规模性能增益的可靠预测器。
提出了一种轻量级方法，用于筛选多种特权上下文配置，避免昂贵的后训练运行。

Methodology: 采用OPSD框架，学生策略与自教师共享权重（通过指数移动平均更新），自教师通过特权上下文（包含世界反馈等额外信息）进行条件化。在LiveCodeBench上对Qwen3（0.6B-8B）和Olmo3-7B-Instruct进行50步后训练，使用6种不同的特权上下文构造（包括无上下文、专家前言、反馈、LLM同伴提示、自身解决方案+反馈、同伴解决方案+反馈）。测量初始学生-自教师准确率差距和最终学生准确率提升（mean@4），通过普通最小二乘线性拟合和留一交叉验证评估预测能力。

Key Results:

初始学生-自教师差距与最终学生性能提升呈强线性相关：Qwen3-8B的R²=0.949，Pearson相关系数0.974；Olmo3-7B-Instruct的R²=0.996，Pearson相关系数0.998。
该线性关系在6种不同特权上下文类型中一致成立。
跨模型规模（Qwen3 0.6B至8B）该关系保持稳定，R²=0.977。
训练过程中学生性能逐渐收敛至自教师水平。
留一交叉验证显示预测泛化良好（Qwen3-8B RMSE=0.016，Olmo3-7B-Instruct RMSE=0.003）。

Tech Stack:

OPSD（On-Policy Self-Distillation）
反向KL散度（Reverse KL Divergence）
指数移动平均（EMA）
普通最小二乘线性拟合（OLS）
Pearson和Spearman相关系数
留一交叉验证（LOOCV）
LiveCodeBench v6基准
Qwen3模型系列（0.6B/1.7B/4B/8B）
Olmo3-7B-Instruct模型
mean@4评估指标

Strengths:

发现简洁且实用的预测定律，显著降低OPSD配置的试错成本。
实验设计系统，覆盖多种上下文类型、模型家族和规模，验证了泛化性。
提供了理论解释：OPSD利用学生与自教师的内在一致性，从而产生干净的线性关系。
对后训练实践有直接指导意义，可提前估计性能上限。

Limitations:

实验仅在编码任务（LiveCodeBench）上进行，未验证在其他领域（如数学、对话）的适用性。
模型规模仅到8B，更大模型（如70B）的预测关系尚未验证。
预测定律依赖于初始差距的准确测量，而差距本身受解码参数影响，可能引入噪声。
未深入分析不同特权上下文为何产生不同斜率（如Qwen3斜率1.492 vs Olmo3斜率0.663）。

Relevance To Keywords:

强化学习：论文研究的OPSD属于强化学习后训练方法，利用世界反馈作为学习信号，与RLVR等RL方法紧密相关。
后训练：论文聚焦于LLM的后训练阶段，提出预测定律优化后训练流程。
世界模型：世界反馈（如环境错误、单元测试结果）可视为世界模型的输出，论文利用这些反馈构建特权上下文，与世界模型概念相关。
表征学习：OPSD通过自教师蒸馏隐式学习更好的表征，但论文未直接涉及表征学习理论。
多模态大模型：论文仅处理文本（代码），未涉及多模态，但方法论可推广至多模态场景。
理解与生成一体化：论文未直接讨论，但OPSD可应用于生成任务。

66. Cookie-Bench: Continuous On-screen Key Interaction Evaluation for Web GenerationPASS

Score: 31.5 / 27.8

Authors: Haoyue Yang, Zhangxiao Shen, Fan Ding, Hangting Lou, Yifeng Kou, Haoqing Yu, Jingyao Li, Zhengfan Wu, Siqi Bao, Jing Liu, Hua Wu

Published: 2026-05-28

TL;DR: Cookie-Bench introduces a reference-free, multimodal evaluation benchmark for web generation that leverages autonomous agent interaction and screen capture to assess LLM performance without relying on reference code.

摘要翻译

前端 Web 代码已成为每个前沿大语言模型（LLM）发布的核心展示面，但由于 Arena 等人工评判排行榜难以扩展，因此在开发迭代速度上评估这些交互式应用仍然成本高昂。现有的自动化代理通常依赖于参考实现、测试套件或严格检查清单，往往难以捕捉人类评审员在实时会话中所进行的推理综合。我们提出了一种新的评估范式，该范式同时具备无参考、自主驱动和整体推理的特征，并通过两个工具实现了这一范式。该基准（\dataname）是一个涵盖 11 个领域、54 个叶子节点、1000 个查询的 WebDev 基准，任务范围包括静态展示和交互式应用，平衡了三个难度层级和三个目标语言组，且任务说明经过重写，以抵抗来自传播提示的记忆。该框架（\framename）基于弗拉维尔的元认知监控，将证据积累与判断分为三个阶段进行分离：静态感知（Static Perception）通过被动观察形成第一印象；代理驱动交互（Agent-Driven Interaction）自主探索应用，同时捕获连续的屏幕视频、音频及每步截图；动态评分（Dynamic Scoring）仅在证据链完成后，发布整体功能与美学判定，并附带结构化失败归因。在该基准上，该框架与专家人类评级高度一致，同时在 13 个前沿大语言模型（LLM）的交互式 Web 生成任务上揭示了显著的性能提升空间。https://anonymous.4open.science/r/Cookie-3CE/

Abstract

Front-end web code has become a core product surface for every frontier LLM release, yet evaluating these interactive applications at development speed remains costly because human-judged leaderboards like Arena do not scale. Existing automated proxies typically lean on reference implementations, test suites, or rigid checklists, and tend to miss the reasoned synthesis a human reviewer performs over a live session. We articulate a new evaluation regime that is simultaneously reference-free, autonomously driven, and holistically reasoned, and instantiate it through two artifacts. \textbf{\dataname} is an 11-domain, 54-leaf, 1,000-query WebDev benchmark spanning both static-presentation and interactive-application tasks, balanced across three difficulty tiers and three target-language groups, with briefs rewritten to resist recall from circulated prompts. \textbf{\framename}, grounded in Flavell's metacognitive monitoring, separates evidence accumulation from judgment across three stages: Static Perception forms a first impression from passive observation; Agent-Driven Interaction explores the application autonomously while capturing continuous screen video, audio, and per-step screenshots; Dynamic Scoring issues holistic functionality and aesthetics verdicts with structured failure attribution only after the evidence chain is complete. On \dataname, \framename aligns closely with expert human ratings while surfacing substantial headroom across 13 frontier LLMs on interactive web generation. \noindenthttps://anonymous.4open.science/r/Cookie-3CE/

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	2.0/10	3.0
MLLM	1.5	5.0/10	7.5
MultiModal	1.5	7.0/10	10.5
model-based RL	1.5	2.0/10	3.0

评分理由: The paper proposes Cookie-Bench, a reference-free evaluation benchmark for web generation using LLMs. It scores highly on MultiModal (7.0) and MLM (5.0) due to its use of screen video, audio, and screenshots to evaluate multimodal agent performance. Scores are lower for Unify Models (1.0), Tokenizer (1.0), World Models (2.0), and model-based RL (2.0) as the paper focuses on evaluation methodology rather than proposing new model architectures, tokenization schemes, generative world models, or reinforcement learning algorithms. Visual Encoder (3.0) is moderately relevant as visual perception is used but not the core contribution. No expert authors from the specified list (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) were found in the author list, so no bonus points were added. The weighted total score is 31.5, which exceeds the dynamic passing score of 27.8.

关键词

Web Generation, Evaluation Benchmark, Multimodal Evaluation, Autonomous Interaction, Reference-free, Screen Capture, LLM Assessment

深度分析

Chinese Title: Cookie-Bench：面向网页生成的连续屏幕关键交互评估基准

Summary: 论文针对前端网页代码评估成本高、依赖人工或参考实现的问题，提出了一种无参考、自主驱动且整体推理的评估机制。具体实现了两个成果：Cookie-Bench基准包含11个领域、54个叶子节点、1000个查询，涵盖静态展示和交互应用任务，按难度和语言分组；Cookie评估器基于Flavell的元认知监控理论，分为静态感知、智能体驱动交互和动态评分三个阶段，通过多模态证据（屏幕录像、音频、截图）进行综合评分。在13个前沿大语言模型上的实验表明，Cookie与专家人工评分高度一致，并揭示了交互式网页生成方面的显著提升空间。

Innovations:

首次提出同时满足无参考、自主驱动和整体推理的网页代码评估机制，摒弃了参考实现、测试脚本和预定义检查表。
构建了包含1000个查询的Cookie-Bench基准，覆盖11个领域、54个叶子节点，平衡静态与动态任务、三个难度等级和三种目标语言，且查询经过重写以避免记忆泄露。
设计了Cookie评估器，将证据收集与判断分离为三个阶段：静态感知（被动观察）、智能体驱动交互（自主探索并捕获连续视频、音频和截图）、动态评分（综合功能与美学评分并归因失败）。
在13个前沿LLM上进行了系统评估，揭示了React生成与直接HTML输出之间的性能差距，为模型能力诊断提供了新信号。

Methodology: 论文采用三阶段评估方法：1）静态感知阶段：加载部署页面，通过被动观察形成第一印象；2）智能体驱动交互阶段：使用计算机使用智能体自主规划探索轨迹，记录连续屏幕视频、音频、每步截图和交互轨迹；3）动态评分阶段：在完整证据链收集后，综合功能性和美学评分，并给出结构化失败归因。基准构建采用11领域54叶子分类法，从自然用户查询和众包合成两个渠道收集数据，经过去重、LLM过滤和专家审核。

Key Results:

Cookie-Bench包含1000个查询，其中514个来自自然用户流量，486个来自众包合成，覆盖11个领域。
在13个前沿LLM上，Cookie评估的总胜率显示React脚手架生成优于直接HTML聊天输出，差距因模型而异。
Cookie与专家人工评分高度一致，能够有效区分不同模型在静态和动态任务上的能力。
交互视频揭示了静态评估无法发现的缺陷（如超级玛丽跳跃距离过短）。

Tech Stack:

SimHash算法（用于文本去重）
TF-IDF（用于语义去重）
LLM-as-a-Judge（用于过滤和评分）
计算机使用智能体（Computer-Using Agent）
多模态证据捕获（屏幕录像、音频、截图）
Flavell元认知监控理论（作为评估框架基础）
React脚手架生成与直接HTML输出对比

Strengths:

评估机制完全无参考，避免了依赖参考实现带来的偏差和成本。
自主驱动交互，能够发现静态评估无法捕捉的动态行为缺陷。
多模态证据（视频、音频、截图）提供了丰富的评估信息，接近人类评审的体验。
基准覆盖广泛，难度分层合理，查询经过重写防止记忆泄露，确保评估反映真实能力。
与专家人工评分高度一致，验证了方法的有效性。

Limitations:

评估依赖LLM/VLM作为评分器，可能引入模型自身的偏见和局限性。
智能体驱动交互的探索策略可能不够全面，遗漏某些关键交互路径。
基准规模（1000个查询）相对于实际网页开发场景仍有限。
未考虑跨语言、跨框架的泛化能力（仅涉及React和HTML）。
计算成本较高，需要部署页面并运行智能体交互。

Relevance To Keywords: 论文主要关注网页生成评估，与给定的研究关键词（Unify Models, World Models, Representation Learning, Model-Based RL, 原生多模态大模型等）相关性较低。但论文中使用的多模态证据（视频、音频、截图）和智能体交互涉及多模态大模型的理解能力，以及评估框架中的元认知监控思想与表征学习有一定关联。整体而言，该论文更偏向于评估方法论和基准构建，而非核心的多模态模型或世界模型研究。

67. SGMD: Score Gradient Matching Distillation for Few-Step Video Diffusion DistillationPASS

Score: 31.5 / 27.8

Authors: Zhuguanyu Wu, Ruihao Gong, Yang Yong, Yushi Huang, Xiangyu Fan, Lei Yang, Dahua Lin, Xianglong Liu

Published: 2026-05-28

TL;DR: SGMD proposes a score gradient matching distillation method to accelerate video diffusion inference while preserving motion dynamics better than existing distribution matching methods.

摘要翻译

分布匹配蒸馏（Distribution Matching Distillation, DMD）是一种广泛用于加速少步视频扩散模型推理的范式。然而，DMD 风格的视频蒸馏面临两个耦合挑战：假分数必须跟踪持续演变的生成器，当需要频繁更新时训练成本高昂；而反向 KL（reverse-KL）风格的匹配可能倾向于寻求模式且过于保守，不利于保留强运动动力学。为了解决这些问题，我们提出分数梯度匹配蒸馏（Score Gradient Matching Distillation, SGMD）。SGMD 采用假分数视角，通过将假分数直接优化至教师模型，同时使用教师停止梯度费雪（Fisher）信息作为稳定的分布匹配目标。我们提供了梯度分析，论证了在理想跟踪条件下选择该目标的合理性。在此基础上，SGMD 引入了一对双势函数：负残差（Negative-Residual, NR）用于外循环校正，残差收缩（Residual-Contraction, RC）用于内循环跟踪。实验结果表明，与 DMD2 相比，SGMD 实现了约 3 倍的训练加速，并在保持时间一致性的同时，显著改进了 4 步蒸馏模型的运动动力学。人类评估确认，SGMD 在运动质量和整体偏好上更受青睐，而视觉质量和文本对齐则保持相当。代码已开源，网址为 https://github.com/ModelTC/LightX2V。

Abstract

Distribution Matching Distillation (DMD) is a widely used paradigm for accelerating inference in few-step video diffusion models. However, DMD-style video distillation faces two coupled challenges: the fake score must track a continuously evolving generator, making training costly when frequent updates are required, while reverse-KL-style matching can be mode-seeking and conservative for preserving strong motion dynamics. To address these issues, we propose \textbf{Score Gradient Matching Distillation (SGMD)}. SGMD adopts a fake-score perspective by directly optimizing the fake score toward the teacher, while using teacher stop-gradient Fisher as a stable distribution-matching objective. We provide a gradient analysis that motivates this objective choice under ideal tracking. Building on this, SGMD introduces a pair of dual potentials: negative-residual (NR) for outer-loop correction and residual-contraction (RC) for inner-loop tracking. Empirically, compared to DMD2, SGMD achieves an approximately $\sim 3\times$ training speedup and substantially improves motion dynamics for 4-step distilled models while preserving temporal consistency. A human study confirms that SGMD is preferred in motion quality and overall preference, while visual quality and text alignment remain comparable. Code is available at https://github.com/ModelTC/LightX2V.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	4.0/10	6.0
Tokenizer	1.5	3.0/10	4.5
Visual Encoder	1.5	5.0/10	7.5
World Models	1.5	5.0/10	7.5
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	4.0/10	6.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on Score Gradient Matching Distillation for video diffusion models. It strongly relates to Visual Encoder (core component), World Models (generative world dynamics), and MultiModal (visual-temporal data), and Unify Models (distillation unifies teacher/student knowledge). Tokenizer is moderately relevant due to latent diffusion mechanisms. However, it does not involve Multimodal Large Language Models (MLLM) or Model-Based Reinforcement Learning, hence 0 scores for those. The weighted total exceeds the passing threshold.

关键词

Score Gradient Matching, Video Diffusion, Distillation, Few-Step Inference, Motion Dynamics, Teacher-Student, Distribution Matching

深度分析

Chinese Title: SGMD：基于分数梯度匹配的少步视频扩散蒸馏

Summary: 本文针对分布匹配蒸馏（DMD）在少步视频扩散模型加速中面临的两个耦合挑战——假分数需持续追踪快速演化的生成器导致训练成本高，以及反向KL匹配倾向于保守而抑制运动动态——提出了分数梯度匹配蒸馏（SGMD）。SGMD采用假分数视角，直接优化假分数向教师模型对齐，同时使用教师停止梯度的Fisher散度作为稳定的分布匹配目标。通过梯度分析，论文揭示了理想追踪下该目标的合理性，并引入双势能：负残差（NR）用于外循环校正，残差收缩（RC）用于内循环追踪。实验表明，相比DMD2，SGMD在4步蒸馏中实现了约3倍训练加速，显著提升了运动动态和时间一致性，人类偏好研究也证实了其在运动质量和整体偏好上的优势。

Innovations:

提出教师停止梯度Fisher散度作为稳定分布匹配目标，避免教师输入梯度不可靠问题，并在理想追踪下与反向KL方向一致。
采用假分数视角，将传统生成器主导的追踪转变为假分数与生成器的协同对齐，降低追踪成本。
引入双势能机制（负残差NR和残差收缩RC），分别用于外循环校正和内循环追踪，实现轻量级两步更新。
在14B参数教师模型上验证，实现约3倍训练加速，同时提升运动动态和时间一致性。

Methodology: SGMD首先将教师停止梯度Fisher散度作为外循环目标，避免教师梯度不稳定问题。然后从假分数视角出发，将假分数作为主要优化目标，生成器作为追踪器。通过梯度分析，揭示追踪滞后导致的偏差，并设计双势能：负残差（NR）修正外循环生成器更新，残差收缩（RC）压缩内循环假分数追踪残差。最终形成轻量级两步双层级更新：先更新生成器（NR校正），再更新假分数（RC收缩），大幅减少假分数更新次数。

Key Results:

相比DMD2，SGMD在4步蒸馏中实现约3倍训练加速。
显著提升运动动态和时间一致性，人类偏好研究显示SGMD在运动质量和整体偏好上更优。
视觉质量和文本对齐与DMD2相当。
在14B参数Wan2.1-T2V-14B教师模型上验证有效性。

Tech Stack:

分布匹配蒸馏（DMD）
Fisher散度
分数匹配（Score Matching）
反向KL散度
停止梯度（stop-gradient）
双势能（负残差NR、残差收缩RC）
两步双层级更新（two-step bilevel update）
视频扩散模型（Wan2.1-T2V-14B）

Strengths:

解决了DMD中假分数频繁更新导致的高训练成本问题，实现显著加速。
通过双势能机制有效缓解追踪滞后，提升运动动态，避免保守生成。
理论分析扎实，从梯度角度解释了不同目标的偏差并给出校正方案。
在大型视频扩散模型上验证，具有实际部署价值。

Limitations:

方法依赖于教师停止梯度Fisher散度，可能在某些场景下不如完整Fisher散度精确。
双势能机制需要额外超参数调节（如NR/RC权重），调参成本可能较高。
仅针对视频扩散蒸馏，未在图像或其他模态上验证泛化性。
训练加速主要来自减少假分数更新，但生成器更新本身仍有一定计算开销。

Relevance To Keywords:

论文涉及视频扩散模型蒸馏，属于多模态生成领域，与“原生多模态大模型”和“多模态大模型的理解和生成一体化”有一定关联（视频生成是多模态的重要分支）。
论文中的分布匹配和分数匹配可视为一种表征学习（学习生成分布与真实分布的匹配），与“表征学习”相关。
世界模型通常涉及对环境的预测和生成，视频生成可作为世界模型的一部分，但论文未直接讨论世界模型。
强化学习和后训练在论文中未涉及，相关性较低。
整体上，论文与多模态生成和表征学习相关，与世界模型和强化学习关联较弱。

68. Fewer Steps, Better Performance: Efficient Cross-Modal Clip Trimming for Video Moment Retrieval Using LanguagePASS

Score: 31.5 / 27.8

Authors: Xiang Fang, Daizong Liu, Wanlong Fang, Pan Zhou, Zichuan Xu, Wenzheng Xu, Junyang Chen, Renfu Li

Published: 2026-05-28

TL;DR: The paper proposes SpotVMR, an efficient cross-modal clip trimming module that reduces steps and improves boundary accuracy in video moment retrieval by learning to identify promising video regions conditioned on language queries.

摘要翻译

给定一个未修剪视频和一个句子查询，基于语言的视频时刻检索（VMR）旨在定位与查询相关的目标时刻。由于未修剪视频过长，几乎所有现有的 VMR 方法首先将每个未修剪视频稀疏下采样为多个固定长度的视频片段，然后与查询特征和计算代价高昂的片段特征进行多模态交互以进行推理，这对于跨越数小时的长真实视频而言是不可行的。由于视频被下采样为固定长度的片段，一些与查询相关的帧可能会被过滤掉，这将模糊目标时刻的具体边界，将相邻的无关帧作为新边界，容易导致跨模态错位（cross-modal misalignment），并引入边界偏差（boundary-bias）和推理偏差（reasoning-bias）。为此，本文提出了一种高效的方法 SpotVMR，用于裁剪与查询相关的片段。此外，所提出的 SpotVMR 可作为即插即用模块（plug-and-play module），在保持良好检索性能的同时，使最先进的 VMR 方法具备高效性。首先，我们设计了一种新颖的片段搜索模型，该模型学习基于语言查询识别有潜力的视频区域进行搜索。然后，我们引入一组低成本的语义索引特征，以捕捉对象和交互的上下文，这些上下文暗示了搜索与查询相关时刻的位置。此外，利用蒸馏损失（distillation loss）来解决片段选择器和 VMR 模型端到端联合训练产生的优化问题。在三个具有挑战性的数据集上的广泛实验证明了其有效性。

Abstract

Given an untrimmed video and a sentence query, video moment retrieval using language (VMR) aims to locate a target query-relevant moment. Since the untrimmed video is overlong, almost all existing VMR methods first sparsely down-sample each untrimmed video into multiple fixed-length video clips and then conduct multi-modal interactions with the query feature and expensive clip features for reasoning, which is infeasible for long real-world videos that span hours. Since the video is downsampled into fixed-length clips, some query-related frames may be filtered out, which will blur the specific boundary of the target moment, take the adjacent irrelevant frames as new boundaries, easily leading to cross-modal misalignment and introducing both boundary-bias and reasoning-bias. To this end, in this paper, we propose an efficient approach, SpotVMR, to trim the query-relevant clip. Besides, our proposed SpotVMR can serve as plug-and-play module, which achieves efficiency for state-of-the-art VMR methods while maintaining good retrieval performance. Especially, we first design a novel clip search model that learns to identify promising video regions to search conditioned on the language query. Then, we introduce a set of low-cost semantic indexing features to capture the context of objects and interactions that suggest where to search the query-relevant moment. Also, the distillation loss is utilized to address the optimization issues arising from end-to-end joint training of the clip selector and VMR model. Extensive experiments on three challenging datasets demonstrate its effectiveness.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	5.0/10	7.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	1.0/10	1.5

评分理由: The paper addresses Video Moment Retrieval, a MultiModal task requiring Visual Encoders, hence higher scores for these keywords. It lacks explicit focus on World Models, RL, Tokenizers, or Unify Models, resulting in lower scores. No specified expert authors are found in the author list.

关键词

Video Moment Retrieval, Cross-Modal Clip Trimming, Language Query, Semantic Indexing, Distillation Loss, Efficient Search, Plug-and-play Module

深度分析

Chinese Title: 更少步骤，更优性能：基于语言的高效跨模态视频片段裁剪用于视频时刻检索

Summary: 本文针对视频时刻检索（VMR）任务中长视频处理效率低下的问题，提出了一种高效的片段选择方法SpotVMR。现有方法通常将长视频均匀下采样为固定长度的片段，然后对所有片段进行昂贵的多模态交互推理，导致计算开销大且可能丢失查询相关帧，引入边界偏差和推理偏差。SpotVMR首先设计了一个片段搜索模型，根据语言查询自适应地识别有希望的视频区域；然后引入一组低成本的语义索引特征（BAM特征：背景、外观、运动）来捕捉上下文信息，指导查询相关片段的定位；最后利用蒸馏损失解决片段选择器与VMR模型联合训练中的优化问题。该方法可作为即插即用模块，提升现有VMR方法的效率并保持良好性能。在三个挑战性数据集上的实验证明了其有效性。

Innovations:

提出了一种高效的片段选择方法SpotVMR，通过预览视频并智能选择少量查询相关片段，大幅降低计算成本。
设计了三种语义索引特征（BAM特征：背景、外观、运动），作为低成本索引来捕捉视频上下文，指导片段选择。
引入自适应片段更新策略和蒸馏损失，解决片段选择器与VMR模型端到端联合训练中的优化问题。
SpotVMR可作为即插即用模块，兼容现有VMR方法，在提升效率的同时保持甚至提升检索性能。

Methodology: 论文采用以下技术路线：首先，使用预训练图像编码器（如EfficientNet）从每个视频片段中提取单帧的语义索引特征（BAM特征），包括背景、外观和运动特征。然后，设计一个跨模态Transformer（RetrievalSpotter模块），将查询特征与视频索引特征进行交互，通过迭代策略逐步选择与查询最相关的片段。每次迭代中，根据当前选择的片段更新注意力权重，并利用蒸馏损失（教师-学生框架）监督片段选择过程。最后，仅对选中的少量片段提取完整的时空特征，输入到现有的VMR模型中进行时刻定位。

Key Results:

在Charades-STA、ActivityNet Captions和TACoS三个数据集上，SpotVMR在保持或提升检索精度的同时，显著减少了计算开销（如推理时间减少50%以上）。
与现有SOTA方法相比，SpotVMR在Recall@1, IoU=0.5等指标上取得了竞争性甚至更好的性能。
消融实验验证了BAM特征和蒸馏损失的有效性，表明每个组件都对最终性能有贡献。
SpotVMR作为即插即用模块，成功提升了多种现有VMR方法的效率，且性能损失极小。

Tech Stack:

EfficientNet（轻量级图像编码器）
跨模态Transformer（用于查询-视频交互）
蒸馏损失（教师-学生框架）
迭代片段选择策略（基于注意力机制）
BAM特征（背景、外观、运动特征提取）
预训练视觉特征（如I3D、C3D等用于完整片段特征）

Strengths:

创新性地将片段选择引入VMR任务，解决了长视频处理的计算瓶颈，具有实际应用价值。
提出的BAM特征简单有效，能够以低成本捕获关键语义信息。
方法具有通用性，可作为即插即用模块集成到现有VMR框架中。
实验充分，在多个数据集上验证了效率和性能的平衡。

Limitations:

依赖预训练图像编码器提取索引特征，可能无法完全捕捉视频中的时序动态信息。
迭代选择策略可能增加额外的时间开销，尽管总体仍比全片段处理高效。
对于极长视频（如数小时），片段选择策略的扩展性仍需进一步验证。
论文未讨论在无监督或弱监督场景下的适用性。

Relevance To Keywords:

与“原生多模态大模型”相关性较低，本文未使用大模型架构，而是轻量级编码器。
与“多模态大模型的理解和生成一体化”无直接关联。
与“表征学习”相关，论文设计了BAM语义索引特征用于跨模态表征。
与“世界模型”无直接关联。
与“强化学习”无关。
与“后训练”无关，但蒸馏损失涉及知识蒸馏，可视为一种后训练技术。

69. Toward AI Systems That Understand Self and Others: A Multi-Phase Inference Framework for Human Cognitive Diversity and World-Model AlignmentPASS

Score: 30.0 / 27.8

Authors: Toru Takahashi

Published: 2026-05-28

TL;DR: 该论文提出了一种多阶段推理机制（MIM），旨在对齐不同主体的异质世界模型，使 AI 系统能够理解人类认知多样性并在不强制价值观收敛的情况下促进相互理解。

摘要翻译

当代社会的相互误解并非仅仅源于人们持有不同的观点或价值观。即使在相同的观察下，不同的主体也可能形成不同的推断目标、状态表征、预测误差和更新优先级。本文提出了一种多阶段推断框架，并将其核心内部机制定义为多阶段推断机制（Multi-Phase Inference Mechanism, MIM）。MIM 通过阶段形成空间、前景化场、主体特定档案状态以及状态表征之间的对齐映射，形式化了异质世界模型如何产生的过程。在此基础上，本文重构了世界模型对齐（world-model alignment）的问题，使其成为使异质表征相互可处理的问题，而非强制达成一致或收敛到单一价值体系。进一步，本文将这一形式化体系与哲学分歧、认知类型学、社会碎片化及 AI 对齐（AI alignment）联系起来。本文旨在为 AI 系统提供一种建设性的术语体系，通过使意义、价值和预测误差的差异可见、可比且可转换，帮助人类理解自我与他人。

Abstract

Mutual misunderstanding in contemporary society does not arise merely because people hold different opinions or values. Even under the same observations, different subjects may form different inferential targets, state representations, prediction errors, and update priorities. This paper proposes a multi-phase inference framework and defines its core internal mechanism as the Multi-Phase Inference Mechanism (MIM). MIM formalizes how heterogeneous world models arise through a phase-formation space, a foregrounding field, subject-specific profile states, and alignment maps between state representations. On this basis, the paper reframes world-model alignment as the problem of making heterogeneous representations mutually processable, rather than forcing agreement or convergence to a single value system. It further connects this formalism to philosophical disagreements, cognitive typology, social fragmentation, and AI alignment. The aim is to provide a constructive vocabulary for AI systems that can help humans understand self and others by making differences in meaning, value, and prediction error visible, comparable, and transformable.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	4.0/10	6.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	9.0/10	13.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	3.0/10	4.5

评分理由: 论文核心在于提出多阶段推理机制（MIM）以对齐异质世界模型，解决人类认知多样性带来的理解分歧，属于理论框架而非技术实现。'World Models'高度相关，出现在标题和摘要核心。'Unify Models'中度相关，因涉及对齐而非模型合并。'Tokenizer'、'Visual Encoder'、'MultiModal'、'MLLM'与论文认知哲学及推理机制主题关联度低，未涉及具体架构或模态处理。'model-based RL'部分相关，因涉及预测误差和更新优先级概念，但未涉及强化学习算法或控制任务。作者列表中不包含指定的专家名单（Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang），故无额外加分。

关键词

Multi-Phase Inference Mechanism, World-Model Alignment, Human Cognitive Diversity, Heterogeneous World Models, State Representations, AI Alignment, Inference Framework, Subject-specific Profile States

深度分析

Chinese Title: 迈向理解自我与他人的AI系统：面向人类认知多样性与世界模型对齐的多阶段推理框架

Summary: 本文批判了“单一智能假设”（SIA），即认为所有理性主体从相同输入应收敛于相同结论的倾向，并提出“多阶段推理假设”（MIA）。作者构建了多阶段推理机制（MIM），核心包括：阶段形成空间Zphase（组织参考目标与分辨率模式）、前景化场R（偏置预测误差向特定推理目标移动的梯度场）和对齐映射Φ（使不同主体的状态表示可相互处理）。该框架将认知多样性解释为世界模型学习的操作事实，而非错误或非理性。论文重新定义了AI对齐问题：不仅是对齐到单一人类价值，而是使异构的人类世界模型相互可处理。研究还探讨了该框架在哲学、认知类型学、社会碎片化等领域的应用，并提出了实证可检验的假设与未来研究方向。

Innovations:

提出多阶段推理假设（MIA）替代单一智能假设（SIA），形式化相同观测序列可产生不同推理目标、状态表示、预测误差和更新路径的事实。
引入阶段形成空间Zphase和前景化场R，描述主体内部如何内生选择推理目标与分辨率模式，以及预测误差如何被定向吸收。
定义对齐映射Φ，将世界模型对齐问题从价值对齐扩展为状态表示的可处理性对齐，为异构认知系统间的理解提供形式化基础。
将认知多样性视为世界模型学习的操作事实，而非错误或非理性，并重新解释哲学史上经验主义/唯心主义、结构主义/存在主义等争论为推理阶段差异。
提出动态循环（观测→目标形成→状态提取→规划→行动→再观测→反馈更新），将行动、反思和协调纳入生成建模路径。

Methodology: 本文采用理论构建与形式化建模的方法。首先通过批判现有学习与推理模型（如自由能原理、主动推理、世界模型）的局限性，提出多阶段推理假设。然后定义三个核心形式装置：阶段形成空间Zphase（抽象空间与有效空间）、前景化场R（方向性场）、对齐映射Φ（状态表示变换）。通过最小示例（同一观测不同推理目标）说明机制，并对比现有模型（如贝叶斯推理、变分自编码器）展示MIM的增量贡献。最后将框架应用于哲学、认知类型学、社会碎片化等领域进行解释性分析，并给出实证假设。

Key Results:

形式化了相同观测序列可产生不同推理目标、状态表示、预测误差和更新路径的机制。
将AI对齐问题重新定义为使异构世界模型相互可处理，而非对齐到单一价值。
用推理阶段差异重新解释了哲学史上的经验主义/唯心主义、结构主义/存在主义、语言相对论等争论。
建立了与认知类型学（如荣格心理类型、Model A）的对应关系，将类型差异映射为推理阶段和前景化方向。
提出了社会碎片化作为对齐网络失效的解释，并给出了AI系统支持认知多样性的设计方向。

Tech Stack:

自由能原理（Free Energy Principle）
主动推理（Active Inference）
世界模型（World Models）
变分推断（Variational Inference）
贝叶斯推理（Bayesian Inference）
梯度场（Gradient Field）概念
状态表示对齐（State-Representation Alignment）
生成建模（Generative Modeling）

Strengths:

提出了一个统一的理论框架，将认知多样性、AI对齐、哲学争论和社会碎片化纳入共同的形式化语言。
批判了隐含的单一智能假设，为AI系统处理人类认知差异提供了新的理论基础。
形式装置（Zphase, R, Φ）简洁且具有解释力，能够映射到多种现有理论（如双过程理论、荣格类型学）。
不仅关注技术实现，还讨论了哲学、伦理、政治和社会层面的广泛影响。
给出了实证可检验的假设和未来研究方向，具有可操作性。

Limitations:

论文目前为理论预印本，缺乏具体的算法实现和实验验证。
前景化场R被描述为“类梯度场”，但未给出具体的数学形式或学习规则。
对齐映射Φ的定义较为抽象，未提供实际计算或学习的方法。
与现有世界模型、强化学习等技术的具体结合路径尚不清晰。
对认知类型学的映射存在简化风险，可能无法覆盖所有个体差异。

Relevance To Keywords:

Unify Models: 论文提出的多阶段推理框架旨在统一不同推理模式，与统一模型理念高度相关。
World Models: 核心概念围绕世界模型的操作（目标选择、状态表示、预测误差）展开，直接相关。
Representation Learning: 论文强调状态表示的内生选择与对齐，与表示学习的目标一致。
Model-Based RL: 论文将行动、规划、反馈更新纳入生成循环，与基于模型的强化学习框架兼容。
原生多模态大模型: 论文讨论AI系统处理多模态观测（会议、报告、新闻等）中的认知差异，为多模态大模型的设计提供了认知多样性视角。
多模态大模型的理解和生成一体化: 论文的生成循环（观测→目标→表示→规划→行动→再观测）体现了理解与生成的统一。
表征学习: 与表示学习直接相关，尤其是状态表示的内生选择与对齐。
世界模型: 论文标题即包含世界模型，是核心主题。
强化学习: 论文中的行动约束、反馈更新与强化学习中的奖励和策略更新有类比关系。
后训练: 论文未直接讨论后训练，但提出的对齐映射Φ可视为后训练阶段使AI系统适应不同人类世界模型的方法。

70. Do Physics Foundation Models Learn Generalizable Physics? A Bias-Aware Benchmark Across Physical Regimes and Distribution ShiftsPASS

Score: 30.0 / 27.8

Authors: Mengdi Chu, Yang Liu, Ayan Biswas, Han-Wei Shen

Published: 2026-05-28

TL;DR: This paper benchmarks the generalizability of physics foundation models across physical regimes and distribution shifts, finding that current models behave as conditional generalists whose performance depends on specific settings rather than universal physical understanding.

摘要翻译

近期物理基础模型（physics foundation models）声称具备通用的时空预测能力，然而其评估往往在固定训练分布下将性能简化为单一的平均分数。这使得难以判断模型是否掌握了可泛化的物理动力学，抑或仅在特定设置下表现优异。我们构建了一个基准测试（benchmark），包含 8 种物理动力学、3 种训练数据混合以及由动态尺度（dynamic-scale）和初始条件（initial-condition）复杂性偏移诱导的 25 种测试范式（regimes），涵盖分布内（in-distribution）、分布偏移（distribution-shift）和分布外（out-of-distribution）设置。我们评估了五种物理基础模型架构，并对每种架构测试了四种模型变体（从头训练（scratch）及三种预训练规模），共计产生 60,000 次测量结果。结果表明，当前的物理基础模型表现为条件性泛化者而非通用泛化者：其通用性取决于物理体制（physical regime）、时间尺度（temporal scale）、初始条件设置、预训练、模型规模及架构。仅改进训练数据分布只能部分缓解这一局限性。预训练和模型缩放也无法可靠地消除其能力偏差。我们认为，改进物理基础模型需要超越单纯的模型缩放或数据扩展，转向能够更好捕获跨体制、时间尺度及分布偏移的可迁移物理知识的学习机制。

Abstract

Recent physics foundation models claim general spatiotemporal forecasting ability, yet their evaluations often collapse performance into a single average score under a fixed training distribution. This makes it difficult to determine whether a model has learned generalizable physical dynamics or only performs well under particular settings. We construct a benchmark with 8 physical dynamics, 3 training-data mixtures, and 25 test regimes induced by dynamic-scale and initial-condition complexity shifts, covering in-distribution, distribution-shift, and out-of-distribution settings. We evaluate five physics foundation model architectures and four model variants per architecture (scratch and three pretrained sizes), resulting in 60,000 measurements. Our results show that current physics foundation models behave as conditional rather than universal generalists: their generality depends on the physical regime, temporal scale, initial-condition setting, pretraining, model size, and architecture. Improving the training data distribution only partially mitigates this limitation. Pretraining and scaling are also unable to reliably remove their ability biases. We argue that improving physics foundation models requires moving beyond scaling models or expanding data, toward learning mechanisms that better capture transferable physical knowledge across regimes, temporal scales, and distribution shifts.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	6.0/10	9.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	3.0/10	4.5

评分理由: The paper focuses on Physics Foundation Models and generalizability benchmarks across physical regimes. World Models is highly relevant (6.0) as physics models simulate physical environments. model-based RL (3.0) and Unify Models (3.0) have moderate relevance due to the use of physics models as simulators and unifying evaluation across regimes. Keywords like Tokenizer, Visual Encoder, MLLM, and MultiModal (2.0 each) are less relevant as the abstract emphasizes physical dynamics and benchmarking rather than specific multimodal LLM architectures or tokenization schemes.

关键词

Physics Foundation Models, Generalizable Physics, Bias-Aware Benchmark, Physical Regimes, Distribution Shifts, Model Architectures, Pretraining, Scaling

深度分析

Chinese Title: 物理基础模型是否学到了可泛化的物理知识？一个跨物理机制和分布偏移的偏差感知基准

Summary: 本文针对当前物理基础模型声称具备通用时空预测能力但评估方式单一的问题，构建了一个包含8种物理动力学、3种训练数据混合和25种测试机制（涵盖分布内、分布偏移和分布外）的基准测试。作者评估了五种物理基础模型架构（DPOT、GPhyT、MORPH、MPP、Poseidon）及其四种变体（从头训练和三种预训练规模），共获得60,000个测量结果。研究发现，当前物理基础模型表现为条件性而非通用性泛化：其泛化能力强烈依赖于物理机制、时间尺度、初始条件设置、预训练状态、模型规模和架构。改进训练数据分布只能部分缓解这一局限，预训练和规模扩展也无法可靠消除能力偏差。作者认为，提升物理基础模型需要超越简单的模型扩展或数据扩充，转向学习能跨机制、时间尺度和分布偏移传递物理知识的机制。

Innovations:

构建了因子化评估框架，将物理动力学、训练数据混合、测试机制偏移、预测时间范围、预训练状态、模型规模和架构等多个维度解耦，系统评估物理基础模型的泛化能力。
设计了5×5测试网格，覆盖分布内、插值型分布偏移和外推型分布外（OOD）场景，并区分动态尺度偏移和初始条件复杂度偏移。
揭示了当前物理基础模型的条件性泛化行为：其性能高度依赖于具体物理机制和时间尺度，且预训练和规模扩展无法可靠消除偏差。
提出了ShiftDamage等诊断指标，量化不同偏移类型对模型性能的影响，并发现增加训练数据复杂度反而会加剧OOD区域的归一化损伤。
系统比较了五种主流物理基础模型架构及其不同规模变体，提供了大规模实证证据（60,000个测量结果）。

Methodology: 论文采用因子化实验设计，通过APEBench/Exponax工具包生成8种PDE动力学轨迹，构建三种训练数据混合（简单偏重、平衡、复杂偏重）。测试阶段使用5×5网格（动态尺度×初始条件复杂度），包含训练可见单元、组合内分布、动态尺度OOD、IC复杂度OOD和联合OOD五种偏移类型。评估五种模型架构（DPOT、GPhyT、MORPH、MPP、Poseidon）的四种变体（从头训练和预训练小/中/大），测量不同预测时间步（1步、5步、10步、OOD rollout）的误差。使用归一化误差、ShiftDamage等指标分析能力偏差。

Key Results:

当前物理基础模型表现为条件性泛化：在Fisher-KPP和Gray-Scott上误差低，但在Wave动力学上误差高达模型平均误差的2倍以上。
动态尺度OOD比初始条件OOD造成更大损伤，最困难单元误差达到训练可见参考误差的7-8倍。
增加训练数据复杂度（从Mix-simple到Mix-complex）降低了分布内原始误差，但增加了归一化ShiftDamage。
在相同规模比较下，37.5%的架构-PDE对出现负迁移（预训练比从头训练更差），25.0%的大模型单元比小模型基线更差。
预训练和规模扩展无法可靠消除能力偏差，反而可能引入新的条件偏好。

Tech Stack:

APEBench/Exponax工具包（用于生成PDE轨迹）
DPOT、GPhyT、MORPH、MPP、Poseidon五种物理基础模型架构
Transformer和算子风格的神经网络架构
归一化误差（Normalized Error）
ShiftDamage指标（衡量分布偏移造成的性能损伤）
5×5测试网格设计（动态尺度×初始条件复杂度）
三种训练数据混合策略（简单偏重、平衡、复杂偏重）

Strengths:

系统性地解耦了影响物理基础模型泛化能力的多个因素，提供了大规模实证证据。
设计了精细的测试网格，区分了不同类型的分布偏移，揭示了模型的条件性行为。
对比了多种主流模型架构和不同规模变体，结论具有广泛代表性。
提出了新的诊断指标（如ShiftDamage），有助于深入理解模型失败模式。
对当前物理基础模型的局限性进行了深刻分析，为未来研究指明了方向。

Limitations:

仅评估了8种PDE动力学，可能无法完全代表所有物理系统。
训练数据混合策略仅基于三个对角线单元，未探索更复杂的分布设计。
模型评估限于预测误差，未涉及物理一致性、守恒律等更深入的物理理解指标。
未考虑模型在真实世界物理数据上的表现，所有数据均为合成生成。
预训练模型的具体训练数据来源和细节未完全公开，可能影响可复现性。

Relevance To Keywords:

Unify Models / 原生多模态大模型：论文研究物理基础模型（属于科学领域的基础模型），与统一模型的思想相关，但聚焦于物理预测而非多模态。
World Models / 世界模型：物理基础模型旨在学习物理世界的动态规律，可视为世界模型的一种形式，论文评估其泛化能力直接关联世界模型的可靠性。
Representation Learning / 表征学习：论文探讨模型是否学到了可迁移的物理表征，而非仅记忆分布内模式，与表征学习核心问题一致。
Model-Based RL / 强化学习：物理基础模型可用于基于模型的强化学习中的环境模拟，论文发现的泛化局限性对RL应用有重要警示。
后训练：论文研究了预训练和规模扩展对泛化的影响，属于后训练阶段的分析，但未涉及微调、对齐等具体后训练技术。

71. Draft-OPD: On-Policy Distillation for Speculative Draft ModelsPASS

Score: 30.0 / 27.8

Authors: Haodi Lei, Yafy Li, Haoran Zhang, Shunkai Zhang, Qianjia Cheng, Xiaoye Qu, Ganqu Cui, Bowen Zhou, Ning Ding, Yun Luo, Yu Cheng

Published: 2026-05-28

TL;DR: Draft-OPD proposes on-policy distillation to resolve the offline-to-inference mismatch in speculative decoding, achieving over 5x lossless acceleration for large language models.

摘要翻译

推测解码通过将目标模型与轻量级草稿模型配对，并行验证草稿模型提出的 token，从而加速大语言模型推理。构建草稿模型的常见方法（如 EAGLE-3 或 DFlash）是在目标模型生成的轨迹上进行监督微调（SFT）。然而，我们发现 SFT 很快出现停滞：草稿模型在测试数据上的接受长度不再提升。原因在于离线到推理的不匹配：在 SFT 中，草稿模型从固定的目标生成轨迹中学习；而在推测解码过程中，它是在基于其自身策略生成的文本块上进行评估的。这催生了策略内蒸馏（OPD），即目标模型在草稿诱导的状态上对草稿模型进行监督。然而，OPD 对草稿模型而言仍具挑战性，因为它们无法可靠地独立展开完整序列；而目标辅助生成会使收集的序列遵循目标分布，从而消除了策略内信号。因此，我们提出了 Draft-OPD，该方法利用目标辅助展开以实现稳定的序列续写，并从验证过程中暴露的错误位置重放草稿生成过程。这使得草稿模型能够同时从目标模型对接受和拒绝提案的反馈中学习，将训练重点集中在限制推测接受率的草稿诱导错误上。实验表明，Draft-OPD 在各种任务上为思考模型实现了超过 5 倍的无损加速，相比 EAGLE-3 和 DFlash 分别提升了 23% 和 13%。

Abstract

Speculative decoding accelerates large language model inference by pairing a target model with a lightweight draft model whose proposed tokens are verified in parallel. A common way to build draft models, like EAGLE3 or DFlash is supervised fine-tuning (SFT) on target-generated trajectories. However, we observe that SFT quickly plateaus: the draft model's acceptance length on test data stops improving. The reason is an offline-to-inference mismatch: In SFT, the drafter learns from fixed target-generated trajectories, whereas during speculative decoding it is evaluated on blocks proposed under its own policy. This motivates on-policy distillation (OPD), where the target model supervises the drafter on draft-induced states. Yet OPD remains difficult for draft models, as they cannot reliably roll out complete sequences independently, whereas target-assisted generation makes the collected sequences follow the target distribution and thus eliminates the on-policy signal. We therefore propose Draft-OPD, which uses target-assisted rollout for stable continuations and replays drafting from the verification-exposed error positions. This allows the drafter to learn from target feedback on both accepted and rejected proposals, focusing training on the draft-induced errors that limit speculative acceptance. Experiments show that Draft-OPD achieves over $5\times$ lossless acceleration for thinking models across diverse tasks, improving over EAGLE-3 and DFlash by 23\% and 13\%.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	5.0/10	7.5
Tokenizer	1.5	3.0/10	4.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	5.0/10	7.5
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	4.0/10	6.0

评分理由: The paper focuses on LLM inference acceleration via speculative decoding and on-policy distillation. It strongly relates to Unify Models (target/draft pairing) and MLLM (LLM core technology), and utilizes RL concepts (on-policy, rollout) relevant to model-based RL. It lacks Visual Encoder content and explicit World Models, and is primarily text-based, resulting in lower scores for Visual, World, and MultiModal keywords.

关键词

Speculative Decoding, On-Policy Distillation, Draft Models, LLM Inference, Target Model, Token Verification, Inference Acceleration

深度分析

Chinese Title: Draft-OPD：面向投机草稿模型的在线策略蒸馏

Summary: 本文针对大型语言模型推理加速中的投机解码（Speculative Decoding）问题，提出了一种名为Draft-OPD的在线策略蒸馏（On-Policy Distillation）框架。现有草稿模型通常采用监督微调（SFT）在目标模型生成的固定轨迹上训练，但该方法很快陷入平台期，因为训练时使用的目标生成前缀与推理时草稿模型自身生成的前缀存在分布不匹配。直接应用标准在线策略蒸馏（OPD）面临困难：草稿模型独立生成完整序列不稳定，而目标辅助生成又会消除草稿策略信号。Draft-OPD通过目标辅助展开保持稳定续写，同时记录验证过程中暴露的错误位置，并从这些位置重放草稿生成，让草稿模型在自身诱导的状态上接受目标模型监督。此外，采用接受感知蒸馏目标，区分接受和拒绝的token以聚焦训练草稿诱导的错误。实验表明，Draft-OPD在思考模型上实现超过5倍无损加速，在同等计算量下比EAGLE-3和DFlash分别提升23%和13%。

Innovations:

识别出离线SFT训练草稿模型的关键局限：训练时使用固定目标轨迹，推理时评估草稿自身策略产生的块，导致分布不匹配和性能平台。
解释了标准OPD无法直接应用于草稿模型的原因：草稿模型独立展开不稳定，目标辅助展开会丢失在线策略信号。
提出Draft-OPD框架，通过目标辅助展开+错误位置重放实现有效的在线训练，保留草稿诱导的错误状态。
设计接受感知蒸馏目标，对接受和拒绝的token采用不同处理，强化可靠一致性和聚焦错误学习。
在多种任务上实现超过5倍无损加速，显著优于现有方法EAGLE-3和DFlash。

Methodology: 论文采用以下技术路线：首先，使用目标模型和草稿模型进行投机解码，在每次验证中记录草稿块起始位置（锚点）。然后，利用目标辅助展开生成稳定续写，同时记录验证中拒绝的token位置。接着，从这些错误位置重放草稿生成，使草稿模型在自身诱导的前缀上重新计算log概率。最后，采用接受感知蒸馏损失：对接受token使用KL散度最小化，对拒绝token使用反向KL或交叉熵，以强化草稿与目标的一致性和纠正错误。训练过程迭代进行，不断更新草稿模型。

Key Results:

Draft-OPD在思考模型（如DeepSeek-R1）上实现超过5倍无损加速。
在同等计算量下，Draft-OPD相比EAGLE-3提升23%，相比DFlash提升13%。
离线SFT训练后继续使用OPD数据做SFT反而降低接受长度，而Draft-OPD持续提升接受长度。
在多个任务（推理、编码、通用助手）上验证了加速效果和分布保持性。

Tech Stack:

投机解码（Speculative Decoding）
在线策略蒸馏（On-Policy Distillation）
监督微调（Supervised Fine-Tuning, SFT）
KL散度（Kullback-Leibler divergence）
反向KL（Reverse KL）
交叉熵损失（Cross-Entropy Loss）
目标辅助展开（Target-Assisted Rollout）
错误位置重放（Error-Position Replay）
接受感知蒸馏目标（Acceptance-Aware Distillation Objective）
EAGLE-3、DFlash等草稿模型架构

Strengths:

针对草稿模型训练的核心问题（离线-推理不匹配）提出了有效解决方案。
方法设计精巧，利用投机解码本身的验证机制收集错误信号，无需额外标注。
实验充分，在多个模型和任务上验证了显著加速效果，且保持无损。
与现有主流草稿模型（EAGLE-3、DFlash）兼容，可直接后训练提升性能。

Limitations:

方法依赖于投机解码的验证过程，需要目标模型参与，训练计算开销可能较大。
仅针对训练型草稿模型（如EAGLE、DFlash），不适用于独立自回归草稿模型。
未探讨在多模态或世界模型场景下的适用性，论文聚焦于纯文本LLM。
错误位置重放可能增加训练复杂度，需要合理设置重放策略和超参数。

Relevance To Keywords:

强化学习（Reinforcement Learning）：Draft-OPD本质上是一种在线策略学习，通过目标模型提供奖励信号（接受/拒绝）来优化草稿策略，与强化学习中的策略梯度方法有相似之处。
后训练（Post-training）：该方法是对已SFT的草稿模型进行后训练，属于后训练阶段的技术。
表征学习（Representation Learning）：论文提及EAGLE等模型利用目标模型的特征层上下文，但Draft-OPD本身不直接涉及表征学习。
世界模型（World Models）、模型基强化学习（Model-Based RL）、原生多模态大模型、多模态大模型的理解和生成一体化：论文未涉及这些主题，相关性较低。

72. Geometry Matters: 3D Foundation Priors for Learning Semantic CorrespondencePASS

Score: 30.0 / 27.8

Authors: Artur Jesslen, Olaf Dünkel, Adam Kortylewski

Published: 2026-05-28

TL;DR: This paper proposes a 3D-aware post-training framework integrating SAM3D geometry estimation and PartField descriptors to enhance semantic correspondence by overcoming the limitations of 2D-only foundation features.

摘要翻译

自监督视觉模型和文本到图像扩散模型的基础特征已被证明在语义对应估计中有效。然而，由于这些特征主要基于 2D 图像目标学习，它们缺乏明确的 3D 感知，往往混淆对称物体侧面、重复部分以及在 3D 中不同但视觉上相似的结构。我们提出了一种 3D 感知后训练框架，该框架通过纳入来自 3D 基础模型的先验，超越了现有的 2D 基础特征。给定一张图像，我们的方法使用 SAM3D 估计物体的几何形状与姿态，并通过渲染 - 比较优化来细化该姿态。随后，基于估计的物体姿态，我们将从重建几何形状中提取的 PartField 描述符渲染到图像平面上。所得的几何感知特征图补充了 DINO 和 Stable Diffusion 的特征，而重建形状上的测地距离则使得能够可靠地过滤候选对应点。我们利用过滤后的匹配结果作为监督信号，在 DINO 和 Stable Diffusion 之上训练一个轻量级适配器，以执行语义对应任务。与需要姿态标注并依赖粗糙球面几何的先验后训练方法相比，我们的方法能够自动获取实例特定的 3D 结构，并利用其指导对应学习。实验表明，我们的方法在减少手动几何监督的同时，改进了先前方法的语义对应性能。代码和模型可在 https://github.com/GenIntel/3D-SC 获取。

Abstract

Foundation features from self-supervised vision models and text-to-image diffusion models have proven effective for semantic correspondence estimation. However, because these features are learned primarily from 2D image objectives, they lack explicit 3D awareness and often confuse symmetric object sides, repeated parts, and visually similar structures that are distinct in 3D. We introduce a 3D-aware post-training framework that goes beyond available 2D foundation features by incorporating priors from 3D foundation models. Given an image, our method uses SAM3D to estimate object geometry and pose, and refines the pose through render-and-compare optimization. Subsequently, we render PartField descriptors from the reconstructed geometry into the image plane based on the estimated object pose. The resulting geometry-aware feature maps complement DINO and Stable Diffusion features, while geodesic distances on the reconstructed shapes enable reliable filtering of candidate correspondences. We use the filtered matches as supervision to train a lightweight adapter on top of DINO and Stable Diffusion for semantic correspondence. In contrast to prior post-training approaches that require pose annotations and rely on coarse spherical geometry, our method automatically obtains instance-specific 3D structure and uses it to guide correspondence learning. Experiments show that our approach improves semantic correspondence over the prior methods while reducing manual geometric supervision. Code and model can be found at https:/github.com/GenIntel/3D-SC.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	7.0/10	10.5
World Models	1.5	2.0/10	3.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	4.0/10	6.0
model-based RL	1.5	1.0/10	1.5

评分理由: The paper focuses on semantic correspondence using 3D geometry priors and 2D foundation features. It heavily utilizes Visual Encoders and has moderate MultiModal relevance. However, it lacks Tokenizers, MLLMs, World Models, and Model-Based RL components. No expert authors from the target list were found.

关键词

Semantic Correspondence, 3D Foundation Priors, SAM3D, PartField, Post-training, Geometry-aware, DINO, Stable Diffusion

深度分析

Chinese Title: 几何至关重要：用于学习语义对应的3D基础先验

Summary: 本文提出一种3D感知的后训练框架，用于提升语义对应估计。现有基于2D基础特征（如DINOv2、Stable Diffusion）的方法缺乏显式3D意识，容易混淆对称物体两侧、重复部件等。本文利用SAM3D从单张图像重建物体几何并估计姿态，通过渲染-比较优化精化姿态，然后从重建几何中渲染PartField描述子到图像平面，得到几何感知特征，补充2D特征。同时，利用重建形状上的测地距离可靠地过滤候选对应点，将过滤后的匹配作为伪标签训练轻量适配器。该方法无需人工姿态标注，自动获取实例特定3D结构，在SPair-71k等基准上取得优于先前方法的语义对应结果，并减少了人工几何监督。

Innovations:

提出无需人工姿态标注的3D感知后训练框架，利用3D基础模型自动获取实例特定几何结构。
设计渲染-比较姿态优化方法，通过距离变换吸引和软IoU精化两阶段对齐，校正SAM3D的尺度和平移误差。
引入基于重建形状测地距离的伪标签过滤方案，替代粗粒度球面几何代理，提供更高质量的监督信号。
将PartField描述子渲染到图像平面，生成几何感知特征，有效补充DINOv2和Stable Diffusion特征，解决对称和重复部件混淆。

Methodology: 论文采用三阶段技术路线：首先，利用SAM3D从单张图像重建物体网格，并通过渲染-比较优化（距离变换阶段+软IoU阶段）精化姿态，再使用OrientAnything V2进行偏航规范化，得到规范化的3D网格。其次，在重建网格上运行PartField生成几何感知描述子，并渲染到图像平面，与DINOv2和Stable Diffusion特征融合，生成候选对应点。最后，利用网格上的测地距离过滤几何不一致的匹配，保留高质量伪标签，训练轻量适配器（基于DINOv2和Stable Diffusion特征）以提升语义对应性能。

Key Results:

在SPair-71k、PF-PASCAL、TSS等标准基准上，所提方法优于现有零样本和弱监督语义对应方法。
相比依赖人工姿态标注和粗球面几何的先前方法（如Spherical Maps、DIY-SC），本方法减少了人工监督且性能更优。
消融实验表明，PartField特征和测地距离过滤均显著提升对应准确率，尤其在对称物体和重复部件场景下。
渲染-比较姿态优化有效校正了SAM3D的初始姿态误差，提高了后续特征渲染的准确性。

Tech Stack:

SAM3D（单图像3D网格重建）
DINOv2（自监督视觉Transformer特征）
Stable Diffusion（文本到图像扩散模型特征）
PartField（3D表面几何描述子）
OrientAnything V2（多视图方向估计）
渲染-比较优化（距离变换损失+软IoU损失）
测地距离计算（基于重建网格）
轻量适配器（基于DINOv2和Stable Diffusion特征微调）

Strengths:

无需人工姿态标注，自动从单张图像获取实例特定3D几何，可扩展至新类别。
利用3D基础模型（SAM3D、PartField）提供几何先验，有效解决2D特征固有的对称和重复部件混淆问题。
渲染-比较优化和偏航规范化提高了3D重建与图像的对齐精度，保证后续特征渲染可靠性。
测地距离过滤比粗球面代理更精确，生成更高质量的伪标签，提升适配器训练效果。
在多个标准基准上取得最优或竞争性结果，且减少了人工监督需求。

Limitations:

依赖SAM3D的重建质量，对于严重遮挡、非刚性物体或罕见类别可能重建不准确。
渲染-比较优化和偏航规范化增加了计算开销，可能影响实时应用。
方法假设物体具有大致对称性或可规范化的偏航，对于无明确垂直轴的物体（如椅子）可能效果有限。
伪标签过滤依赖测地距离阈值，需要针对不同类别调整超参数。
实验仅在有限类别上验证，泛化到更多样化场景有待进一步研究。

Relevance To Keywords:

表征学习：论文利用DINOv2和Stable Diffusion的2D表征，并引入3D几何表征（PartField），通过后训练提升语义对应能力，属于表征学习范畴。
后训练：核心贡献是提出3D感知后训练框架，在预训练2D基础模型之上通过3D先验进行微调，无需从头训练。
世界模型：SAM3D从单图像重建3D网格，可视为一种世界模型（隐式3D结构），用于提供几何先验。
多模态大模型的理解和生成一体化：论文融合2D视觉特征（DINOv2）和3D几何特征（PartField），并利用扩散模型特征，体现了多模态理解与生成的结合。
强化学习：论文未直接涉及强化学习，但渲染-比较优化可视为一种基于梯度的优化，与强化学习中的策略优化有间接联系。

73. GPIC: A Giant Permissive Image Corpus for Visual GenerationPASS

Score: 28.5 / 27.8

Authors: Keshigeyan Chandrasegaran, Kyle Sargent, Suchir Agarwal, Michael Jang, Michael Poli, Juan Carlos Niebles, Justin Johnson, Jiajun Wu, Li Fei-Fei

Published: 2026-05-28

TL;DR: This paper proposes GPIC, a large-scale permissive image corpus with MLLM-generated captions for benchmarking visual generative modeling, rather than introducing new model architectures or reinforcement learning frameworks.

摘要翻译

研究视觉生成建模的可扩展方法需要大规模、易获取且稳定的数据集。我们介绍 GPIC（Giant Permissive Image Corpus），一个包含约 28 万亿像素的巨型许可图像语料库。GPIC 包含由最先进的视觉 - 语言模型标注的多样化互联网图像，其中包括 1 亿个训练样本、20 万个验证样本和 100 万个测试样本。此外，所有 GPIC 图像均以宽松许可授权用于研究和商业用途。GPIC 经过安全过滤和去重处理，并集中托管于 Hugging Face。我们提供了在 GPIC 上进行生成建模的基准测试协议。最后，我们提供了在 GPIC 上进行像素空间流匹配的参考基线。我们的数据集、基准和模型可在 https://huggingface.co/datasets/stanford-vision-lab/gpic 获取。评估工具包和代码可在 https://gpic.stanford.edu 获取。

Abstract

Studying scalable methods for visual generative modeling requires large, accessible, and stable datasets. We introduce GPIC, a Giant Permissive Image Corpus of approximately 28 trillion pixels. GPIC comprises diverse internet images captioned by a state-of-the-art vision-language model, including 100M training, 200K validation, and 1M test examples. Moreover, all GPIC images are permissively licensed for both research and commercial use. GPIC is safety-filtered, deduplicated, and centrally hosted on Hugging Face. We provide a benchmarking protocol for generative modeling on GPIC. Finally, we provide a reference baseline for pixel-space flow matching on GPIC. Our dataset, benchmark, and models are available at https://huggingface.co/datasets/stanford-vision-lab/gpic. Evaluation toolkit and code are available at https://gpic.stanford.edu

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	5.0/10	7.5
MultiModal	1.5	6.0/10	9.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper introduces a large-scale image dataset (GPIC) for visual generation rather than proposing new model architectures. It utilizes MLLM for captioning and contains image-text data, making MLLM and MultiModal moderately relevant. However, it does not address model unification, tokenizer design, dynamic world modeling, or reinforcement learning, resulting in low scores for those keywords.

关键词

Visual Generative Modeling, Image Corpus, Vision-Language Model, Flow Matching, Permissive License, Dataset Benchmarking, Safety Filtering

深度分析

Chinese Title: GPIC：用于视觉生成的巨型许可图像语料库

Summary: 本文提出了GPIC（Giant Permissive Image Corpus），一个包含约28万亿像素、1亿训练样本、20万验证样本和100万测试样本的大型许可图像数据集。GPIC从Flickr和Wikimedia收集了CC BY、CC0等许可协议的图像，经过质量过滤、安全过滤、去重处理，并使用Qwen3-VL-4B模型生成标签、短、中、长四种类型的文本描述。数据集以8000个分片形式托管在Hugging Face上，总大小12.9TB，采用MIT许可证发布。论文还提供了基于FD-DINOv2的基准测试协议，以及像素空间流匹配的参考基线。GPIC旨在解决现有数据集（如ImageNet-1K）规模小、许可不明确、不稳定等问题，为视觉生成研究提供稳定、可访问、大规模且许可开放的基准。

Innovations:

首次构建了同时满足许可开放、稳定、大规模、易访问四个关键属性的视觉生成基准数据集，包含1亿张图像和28万亿像素。
提出了基于SSCD特征和幂律碰撞模型的保守去重策略，在保留视觉相关但不同图像的同时有效去除重复。
使用Qwen3-VL-4B进行质量过滤、安全过滤和多样化描述生成（标签/短/中/长），提升了数据质量和可用性。
提供了基于FD-DINOv2的新评估协议，替代了已饱和的FID指标，并发布了参考基线（像素空间流匹配）。
数据集以分片形式集中托管，避免了URL索引带来的链接腐烂问题，并提供了Lite和Nano子集便于开发。

Methodology: 论文采用四阶段流水线构建GPIC：1）从Flickr和Wikimedia收集许可图像（CC BY、CC0等），保留归属元数据；2）图像过滤：去除极端分辨率、低质量（模糊、过曝等）和有害内容，使用Qwen3-VL-4B进行VLM质量评估和安全过滤；3）去重：提取SSCD复制检测特征，通过FAISS进行近似最近邻搜索，基于幂律碰撞模型选择保守阈值（0.95），保留最高分辨率图像；4）描述生成：使用Qwen3-VL-4B为每张图像生成标签、短、中、长四种描述。数据集以8000个分片（每个约1.6GB）托管在Hugging Face上。

Key Results:

GPIC包含约28万亿像素，1亿训练图像、20万验证图像、100万测试图像。
图像平均高度479像素，平均宽度587像素，总大小12.9TB。
去重阶段在阈值0.95下预计移除约960万张图像，保留约1.01亿张。
质量过滤移除约0.3%图像，安全过滤移除约0.35%图像。
提供了FD-DINOv2基准测试协议和像素空间流匹配基线模型。

Tech Stack:

Qwen3-VL-4B-Instruct（视觉语言模型，用于过滤和描述生成）
SSCD（自监督复制检测特征，用于去重）
FAISS（近似最近邻搜索库）
幂律模型 D(N)=AN^β（预测去重移除数量）
FD-DINOv2（Fréchet Distance基于DINOv2特征，用于评估）
像素空间流匹配（参考基线模型）
Hugging Face（数据集托管平台）
Flickr API、Wikimedia API（图像来源）

Strengths:

数据集规模大（1亿图像），许可开放（MIT），适合学术和商业研究。
集中托管、分片格式，下载稳定且无需爬虫基础设施。
提供了多种描述长度（标签/短/中/长），适应不同生成任务。
保守去重策略在保留多样性同时减少冗余，适合大规模训练。
更新了评估协议（FD-DINOv2），避免FID饱和问题。

Limitations:

图像来源仅限于Flickr和Wikimedia，可能缺乏某些领域（如医学、遥感）的多样性。
描述由单一VLM（Qwen3-VL-4B）生成，可能存在模型偏见或错误。
去重阈值基于人工校准和幂律模型，可能仍有少量重复或遗漏。
数据集仅包含静态图像，不涉及视频或多模态数据。
参考基线模型（像素空间流匹配）性能可能不如当前最优生成模型。

Relevance To Keywords:

Unify Models: 论文聚焦于数据集构建，未直接涉及模型统一，但为统一模型训练提供了大规模许可数据。
World Models: 数据集可用于训练世界模型中的视觉生成组件，但论文本身不涉及世界模型方法。
Representation Learning: 数据集可用于表征学习预训练，但论文未探讨具体表征学习方法。
Model-Based RL: 数据集可用于生成模型作为环境模拟器，但论文未涉及强化学习。
原生多模态大模型: 数据集包含图像和文本描述，可直接用于多模态大模型的训练和评估。
多模态大模型的理解和生成一体化: 数据集支持生成任务，但论文未涉及理解任务。
后训练: 数据集可用于后训练阶段，但论文未讨论后训练策略。

74. AgentSchool: An LLM-Powered Multi-Agent Simulation for EducationPASS

Score: 28.5 / 27.8

Authors: Yulei Ye, Wenhao Li, Zhong Wen, Yunshu Huang, Yichen Hu, Zifan Wei, Yige Wang, Xinyu Xie, Haoxuan Yang, Yanjun Huang, Ruijia Li, Hong Qian, Yu Song, Bo Jiang, Bingdong Li, Lijun Li, Bo Zhang, Pinlong Cai, Xingcheng Xu, Shuangye Chen, Xia Hu, Liang He, Aimin Zhou, Jingjing Qu, Jing Shao, Xiangfeng Wang

Published: 2026-05-28

TL;DR: AgentSchool 提出了一种基于 LLM 的多智能体教育模拟框架，通过将学习建模为状态转换来实现无需真实伦理约束的教育研究。

摘要翻译

尽管 LLMs（大语言模型）已迅速部署至课堂，但验证教育人工智能仍极具挑战性：干预措施作用于正在发展的学习者，其认知与社会轨迹被不可逆地塑造，而现实世界试验进展缓慢、受伦理约束且受机构体制限制。基于 LLMs 的教育模拟器已成为一种潜在补救方案，但许多仍把学习简化为基于人格设定的角色扮演；且当仅被优化以复制现有课堂时，可能会在结构上抑制教育变革所需的制度创新。在此工作中，我们引入 AgentSchool，这是一个由 LLMs 驱动的多智能体模拟器，它将学习建模为状态转换而非基于提示的行为。AgentSchool 将认知可增长的学生智能体——配备加权学科知识图谱、思维工作流池及显式误解——与自适应教师智能体相结合，后者在最近发展区（ZPD）内进行规划、支架化教学及反思；该系统嵌入一个可配置的场景生成器中，将教学置于正式与非正式学习领域之中，并包含一个多尺度模拟器，该模拟器解耦了交互尺度、时间粒度与模拟持续时间。实验表明，结构化学生智能体比基线模拟器产生更具差异化的掌握度与误解轨迹，而教师智能体比较显示出依赖于基础模型的模式，这与基于 ZPD 的适应一致。此外，AgentSchool 生成的边缘参与、小团体形成、攻击者引发的凝聚力及意见领袖涌现的轨迹与课堂社会理论一致，且具有合理可信度。除了作为教育研究工具的作用外，AgentSchool 将教育框定为一种具有社会意义的测试平台，用于在组织压力下评估长时程记忆、多智能体协调及未来制度推理能力。

Abstract

Despite the rapid deployment of LLMs into classrooms, validating educational AI remains uniquely intractable: interventions act on developing learners whose cognitive and social trajectories are irreversibly shaped, while real-world trials are slow, ethically constrained, and institutionally locked. LLM-based educational simulators have emerged as a potential remedy, but many still collapse learning into persona-conditioned role-play and, when optimized only to reproduce existing classrooms, can structurally penalize the institutional novelty that pedagogical reform requires. In this work, we introduce AgentSchool, an LLM-driven multi-agent simulator that models learning as state transition rather than prompted behavior. AgentSchool couples cognitively growable student agents -- equipped with weighted subject knowledge graphs, thinking-workflow pools, and explicit misconceptions -- with adaptive teacher agents that plan, scaffold, and reflect along the Zone of Proximal Development, embedded in a configurable scenery generator that situates instruction within both formal and informal learning fields, and a multi-scale simulator that decouples interaction scale, temporal granularity, and simulation duration. Experiments show that structured student agents produce more differentiated mastery and misconception traces than a baseline simulator, while teacher-agent comparisons show backbone-dependent patterns consistent with ZPD-informed adaptation. Further, AgentSchool generates plausible traces of peripheral participation, clique formation, aggressor-induced cohesion, and opinion-leader emergence consistent with classroom social theories. Beyond its role as an educational research instrument, AgentSchool frames education as a socially meaningful testbed for long-horizon memory, multi-agent coordination, and future institutional reasoning under organizational pressure.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	4.5/10	6.8
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	4.5/10	6.8

评分理由: 论文构建了包含学生、教师、场景的统一模拟框架（Unify Models 3.0），并将学习过程建模为状态转换且强调长程记忆，与 World Models 和 model-based RL 的核心概念高度相关（各 4.5 分）。然而，论文主要基于文本 LLM，未重点讨论 Tokenizer 架构或 Visual Encoder（各 1.0 分），且侧重多智能体而非多模态输入（MultiModal 2.0, MLLM 3.0）。作者列表中未包含指定的 Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang 专家。

关键词

LLM-powered, Multi-Agent Simulation, Education, State Transition, Student Agents, Teacher Agents, Social Theories

深度分析

Chinese Title: AgentSchool：基于大语言模型的多智能体教育模拟系统

Summary: 本文提出AgentSchool，一个基于大语言模型（LLM）的多智能体模拟平台，用于教育生态系统建模。针对教育AI验证中真实试验缓慢、伦理约束强、制度惯性大等问题，AgentSchool将学习建模为状态转移过程而非简单的角色扮演。系统包含可成长的学生智能体（配备加权知识图谱、思维工作流池和显式误解）、自适应教师智能体（基于最近发展区进行规划、支架和反思）、可配置场景生成器（支持正式与非正式学习场景）以及多尺度模拟器（解耦交互规模、时间粒度和持续时间）。在2×3对照实验和五个骨干LLM上，结构化学生智能体产生更差异化的掌握和误解轨迹，教师智能体比较显示与ZPD适应一致的骨干依赖模式。在社会模拟中，AgentSchool生成边缘参与、小团体形成、攻击者诱导的凝聚力和意见领袖涌现等合理轨迹。论文还讨论了AgentSchool作为教育研究工具以及长期记忆、异构多智能体协调等测试床的意义。

Innovations:

提出可成长的学生智能体架构，包含加权知识图谱、思维工作流池和显式误解，实现认知状态的可演化建模。
设计自适应教师智能体，基于最近发展区（ZPD）理论进行规划、支架和反思，实现教学行为的理论驱动。
构建可配置场景生成器，支持正式与非正式学习场景的灵活定义，并计划扩展至制度压力模拟。
引入多尺度模拟器，解耦交互规模、时间粒度和模拟持续时间，支持长周期教育轨迹分析。
将教育系统建模为部分可观察的多智能体状态转移过程，而非静态提示行为，提升机制保真度。

Methodology: 采用基于LLM的多智能体模拟方法。系统定义学生、教师和环境智能体，每个智能体具有内部状态（如知识图谱、思维工作流、误解）。模拟过程通过部分可观察的状态转移算子更新，其中智能体基于局部观测选择动作。实验设计为2×3对照研究（两种学生架构×三种教学条件），使用五个不同骨干LLM进行对比。社会模拟通过配置非正式场景（如课间互动）观察群体动态。系统还提供用户界面支持人工检查和干预。

Key Results:

结构化学生智能体（带知识图谱和误解）比基线模拟器产生更差异化的掌握和误解轨迹。
教师智能体比较显示骨干依赖模式，与ZPD理论指导的适应行为一致。
社会模拟生成边缘参与、小团体形成、攻击者诱导的凝聚力和意见领袖涌现等合理社会动态。
AgentSchool支持长周期模拟和反事实推理，可作为教育AI的“风洞”测试环境。

Tech Stack:

大语言模型（LLM）：五个骨干模型（具体未列出，如GPT系列、Llama等）
知识图谱：加权主题知识图谱表示学生知识掌握
思维工作流池：预定义和可学习的推理流程集合
最近发展区（ZPD）理论：用于教师智能体的自适应教学
部分可观察马尔可夫决策过程（POMDP）：系统状态转移形式化
多智能体模拟框架：基于LLM的智能体交互与状态更新

Strengths:

理论驱动：系统设计紧密耦合教育理论（ZPD、知识图谱、社会学习理论），提升机制保真度。
可配置性：场景生成器支持多种学习环境，便于探索反事实和制度创新。
多尺度模拟：解耦时间、规模、粒度，支持从单节课到长期教育轨迹的分析。
可检查性：提供用户界面和状态记录，支持人工干预和因果路径追溯。
实证校准：行为可基于真实教育数据或理论约束，平衡数据稀缺与理论严谨性。

Limitations:

当前未实现管理者和评估者角色，制度层面模拟（如合法性追求、模仿同构）仅为计划扩展。
依赖LLM的推理可靠性，LLM的幻觉和偏见可能影响模拟真实性。
未进行大规模真实教育数据校准，理论约束可能不足以完全代表复杂现实。
社会模拟结果虽合理但缺乏定量验证，与真实课堂数据的对比尚不充分。
计算成本较高，长周期多智能体模拟可能面临效率瓶颈。

Relevance To Keywords: 论文主要聚焦教育领域的多智能体模拟，与给定的关键词（Unify Models, World Models, Representation Learning, Model-Based RL, 原生多模态大模型，多模态大模型的理解和生成一体化，表征学习，世界模型，强化学习，后训练）相关性较低。其中，LLM作为核心组件，但未涉及多模态或世界模型；强化学习仅隐含在教师智能体的自适应中，未显式使用RL算法；后训练未提及。因此相关性一般，但可作为教育场景下LLM智能体模拟的案例参考。

75. From GPS Points to Travel Patterns: Flexible and Semantic Trajectory Generation with LLMsPASS

Score: 28.5 / 27.8

Authors: Silin Zhou, Chenhao Wang, Yuntao Wen, Shuo Shang, Lisi Chen, Panos Kalnis

Published: 2026-05-28

TL;DR: 本文提出 HTP 方法，通过层次化生成旅行模式令牌并利用大语言模型合成真实城市轨迹，有效解决了隐私限制下的轨迹生成问题。

摘要翻译

城市轨迹在建模城市动态及支撑各类智慧城市应用中扮演着至关重要的角色。然而，隐私顾虑限制了人们对大规模、高质量轨迹数据集的访问。轨迹生成提供了一种有前景的替代方案，通过合成真实数据来缓解隐私风险。然而，现有方法未能显式捕捉出行模式，且只能在单一条件下生成长度固定的轨迹。为了解决这些局限性，我们提出 HTP，该方法首先分层生成出行模式，然后利用大语言模型（LLMs）生成 GPS 点，而非直接生成 GPS 点。我们首先设计了一种轨迹专用的残差量化变分自编码器（RQ-VAE），它以粗到细的方式将微观层面的 GPS 轨迹量化为紧凑的宏观层面出行模式标记。这些标记捕捉了丰富的轨迹片段的空间不规则性，例如由交通状况引起的点密度变化。随后，我们利用出行模式标记扩展 LLM 的词汇表，以使轨迹表示与 LLM 输入对齐，并应用监督微调（SFT）使 LLM 与轨迹生成任务对齐，从而能够在各种条件下生成出行模式序列。在两个真实世界数据集上的广泛实验表明，HTP 在生成质量方面平均比最强基线高出 29.78%。我们的代码可在 https://github.com/slzhou-xy/HTP 获取。

Abstract

Urban trajectories play a crucial role in modeling urban dynamics and supporting various smart city applications. However, privacy concerns restrict access to large-scale and high-quality trajectory datasets. Trajectory generation provides a promising alternative by synthesizing realistic data to mitigate privacy risks. However, existing methods fail to explicitly capture travel patterns and can only generate fixed-length trajectories under a single condition. To address these limitations, we propose \textbf{HTP}, which \textbf{H}ierarchically generates \textbf{T}ravel patterns first and then generates GPS \textbf{P}oints by using large language models (LLMs), rather than directly generating GPS points. We first design a trajectory-specific residual quantization variational autoencoder (RQ-VAE) that quantizes micro-level GPS trajectories into compact, macro-level travel pattern tokens in a coarse-to-fine manner. These tokens capture rich segment spatial irregularities, such as point density variations caused by traffic conditions. Then, we extend the LLM vocabulary with travel pattern tokens to align trajectory representations with the LLM input, and apply supervised fine-tuning (SFT) to align the LLM with the trajectory generation task, enabling generation of travel pattern sequences under various conditions. Extensive experiments on two real-world datasets show that HTP outperforms the strongest baseline by an average of 29.78\% in terms of generation quality. Our code is available at https://github.com/slzhou-xy/HTP.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	9.0/10	13.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	4.0/10	6.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	1.0/10	1.5

评分理由: 论文核心贡献在于使用 RQ-VAE 将 GPS 点量化为旅行模式令牌以适配 LLM，与'Tokenizer'高度相关；涉及序列生成与潜在表示，与'World Models'有一定关联。但未涉及视觉编码、强化学习或多模态对齐，故相关度较低。作者列表中不包含指定专家。

关键词

Trajectory Generation, Large Language Models, Travel Patterns, GPS Points, Residual Quantization, Variational Autoencoder, Urban Dynamics

深度分析

Chinese Title: 从GPS点到出行模式：基于大语言模型的灵活且语义化的轨迹生成

Summary: 本文针对现有轨迹生成方法仅关注微观GPS点生成、忽略宏观出行模式以及仅支持单一条件的问题，提出了一种两阶段分层生成框架HTP。第一阶段，设计轨迹专用的残差量化变分自编码器（RQ-VAE），将微观GPS轨迹压缩为宏观出行模式令牌，捕捉段级空间不规则性（如点密度变化）。第二阶段，扩展大语言模型（LLM）词汇表以包含出行模式令牌，通过监督微调（SFT）使LLM能在多种条件（如起终点、时间、距离等）下生成出行模式序列，并利用其自回归特性支持变长序列生成。最后，RQ-VAE解码器将生成的模式序列重建为微观GPS轨迹。在两个真实数据集上的实验表明，HTP在生成质量上平均超越最强基线29.78%，并能生成符合真实分布的高质量变长轨迹。

Innovations:

提出分层生成框架，先宏观出行模式后微观GPS点，显式建模段级行为特征。
设计轨迹专用的RQ-VAE，结合CNN与Transformer编码器及残差量化，有效压缩轨迹并保留空间不规则性。
将多种生成条件统一为自然语言描述，利用LLM的世界知识实现灵活可控的轨迹生成。
通过扩展LLM词汇表与SFT，使LLM能够生成出行模式序列，并天然支持变长轨迹生成。
引入相对重建损失，更好地捕捉GPS点密度变化，提升生成轨迹的真实性。

Methodology: HTP采用两阶段训练：阶段1（轨迹量化）中，使用CNN和Transformer混合编码器对GPS轨迹进行下采样，提取段级特征，然后通过多分辨率残差量化将连续嵌入映射为离散出行模式令牌；解码器结合道路上下文信息逐步上采样重建轨迹，并采用相对重建损失优化。阶段2（LLM驱动生成）中，将出行模式令牌加入LLM词汇表，并将各种条件（如起终点、时间、距离）转化为文本提示，通过监督微调（SFT）训练LLM自回归生成出行模式序列；最后，将生成的令牌序列输入阶段1的解码器得到微观GPS轨迹。

Key Results:

在两个真实数据集上，HTP在生成质量指标上平均超越最强基线29.78%。
可视化结果表明HTP能生成与真实轨迹分布一致的变长轨迹，且能模拟拥堵、加速等出行模式。
消融实验验证了RQ-VAE、LLM生成、相对重建损失等各组件对性能均有正向贡献。
HTP支持多种条件（如起终点、时间、距离）的灵活控制，生成轨迹的多样性优于现有方法。

Tech Stack:

残差量化变分自编码器（RQ-VAE）
卷积神经网络（CNN）
Transformer
大语言模型（LLM，基于Qwen系列）
监督微调（SFT）
相对重建损失
地图匹配算法
道路网络图结构

Strengths:

显式建模宏观出行模式，生成轨迹更符合真实交通行为（如拥堵、加速）。
利用LLM的语义理解能力，将多种条件统一为自然语言，实现灵活可控生成。
支持变长轨迹生成，更贴近实际数据分布。
两阶段设计解耦了模式学习与生成，降低直接生成GPS点的难度。
在多个数据集上显著优于现有SOTA方法，生成质量高。

Limitations:

依赖LLM推理，计算成本较高，可能不适合实时或大规模生成场景。
需要地图匹配预处理，对道路网络数据质量敏感。
对罕见或极端出行模式的泛化能力可能不足，受限于训练数据分布。
当前仅使用经纬度，未显式利用时间戳信息（如速度、时间间隔），可能丢失部分时序特征。
SFT阶段需要大量条件-轨迹对数据，标注成本较高。

Relevance To Keywords:

Unify Models: 论文将LLM与VAE统一在一个框架中，实现宏观模式与微观点的联合建模。
World Models: 出行模式令牌编码了交通状况、道路结构等世界知识，LLM生成模式可视为世界模型推理。
Representation Learning: RQ-VAE通过残差量化学习轨迹的离散表征，属于表征学习范畴。
Model-Based RL: 生成的轨迹可作为环境模型用于强化学习中的模拟或规划，但论文未直接涉及RL。
原生多模态大模型: LLM同时处理文本条件与轨迹令牌，可视为多模态输入输出。
多模态大模型的理解和生成一体化: LLM理解条件文本并生成模式序列，实现理解与生成融合。
表征学习: 同上，量化过程是表征学习。
世界模型: 出行模式反映真实世界移动规律，LLM生成模式可看作世界模型预测。
强化学习: 论文未使用强化学习，但SFT属于后训练，与RLHF有相似性。
后训练: SFT是典型的后训练方法，用于对齐LLM与特定任务。

76. HoliTok:A Coutinuous Holistic Tokenization with Robust Dual Capabilities of Speech Generation and UnderstandingPASS

Score: 28.5 / 27.8

Authors: Bohan Li, Shi Lian, Hankun Wang, Yiwei Guo, Yu Xi, Zhihan Li, Da Zheng, Colin Zhang, Kai Yu

Published: 2026-05-28

TL;DR: HoliTok proposes a continuous holistic speech tokenization method that enables unified speech generation and understanding within a single latent space, achieving robust performance without additional optimization tricks.

摘要翻译

统一语音基础模型需要一个整体的标记化空间，该空间既要能被语言模型学习，又能解码为高质量波形。然而，现有的语音标记器往往无法同时满足这些要求，导致架构复杂性增加以及更复杂的训练设计。我们提出 HoliTok，一种连续整体语音标记化模型，专为统一生成 - 理解建模而设计。HoliTok 将 48 kHz 语音编码为紧凑的 25 Hz 序列，该序列由 128 维潜在向量组成。它采用渐进策略进行训练，该策略同时保留信号级保真度，融入语义信息，并保持强大的潜在可学习性。基于此标记化，我们构建了一个统一的 AR+DiT 模型用于语音合成与识别，其中相同的潜在序列既支持生成专用任务，也支持统一生成 - 理解任务。实验表明，HoliTok 实现了具有竞争力的重建保真度，提高了高质量且可控合成的生成可学习性，并且在所评估的表示中，是唯一一个在我们统一生成 - 理解架构中稳健运行而无需额外优化技巧的表示。这些结果表明，HoliTok 可作为有效的语音标记器以及统一口语语言建模的基础表示接口。代码可在以下网址获取：https://github.com/bovod-sjtu/HoliTok.

Abstract

Unified speech foundation models require a holistic tokenization space that is both learnable by language models and decodable into high-quality waveforms. Existing speech tokenizers, however, often fail to satisfy these requirements simultaneously, leading to increased architectural complexity and more involved training designs. We propose HoliTok, a continuous Holistic speech Tokenization model designed for unified generation-understanding modeling. HoliTok encodes 48~kHz speech into a compact 25~Hz sequence of 128-dimensional latents. It is trained with a progressive strategy that jointly preserves signal-level fidelity, incorporates semantic information, and maintains strong latent learnability. Based on this tokenization, we build a unified AR+DiT model for speech synthesis and recognition, where the same latent sequence supports both generation-specific and unified generation-understanding tasks. Experiments show that HoliTok achieves competitive reconstruction fidelity, improves generative learnability for high-quality and controllable synthesis, and, among the evaluated representations, is the only one that operates robustly in our unified generation-understanding architecture without additional optimization tricks. These results suggest that HoliTok serves as an effective speech tokenizer and a foundational representation interface for unified spoken language modeling. The code is available at: https://github.com/bovod-sjtu/HoliTok.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	9.0/10	13.5
Tokenizer	1.5	10.0/10	15.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper introduces HoliTok, a continuous holistic speech tokenization model, which is highly relevant to 'Tokenizer' (core contribution) and 'Unify Models' (focus on unified generation-understanding framework). However, the work is speech/audio-centric, lacking vision components, thus 'Visual Encoder', 'MLLM' (typically vision-language), 'MultiModal' (in vision context), 'World Models', and 'model-based RL' are not applicable. No expert authors from the specified list are found in the author list. Weighted total: (9.0 + 10.0) * 1.5 = 28.5, which exceeds the dynamic pass score of 27.8.

关键词

Speech Tokenization, Unified Generation-Understanding, Continuous Holistic Tokenization, AR+DiT Model, Speech Foundation Models, Latent Sequence, Waveform Reconstruction

深度分析

Chinese Title: HoliTok：一种具备语音生成与理解鲁棒双能力的连续整体分词方法

Summary: 本文提出HoliTok，一种连续整体语音分词模型，旨在为统一语音生成与理解提供可学习的、可解码的潜在表示空间。HoliTok将48 kHz语音编码为25 Hz、128维的连续潜在序列，采用渐进式三阶段训练策略：第一阶段训练确定性自编码器以保持高保真重建；第二阶段引入时序变分瓶颈，通过弱KL正则化使潜在序列平滑且可预测；第三阶段通过高层特征蒸馏和音频-语言监督进一步强化变分正则化，使潜在空间同时保留语义信息。基于HoliTok，作者构建了AR+DiT统一架构，其中同一潜在序列同时支持语音合成（TTS）和语音识别（ASR）。实验表明，HoliTok在重建保真度、高质量可控合成方面表现优异，且在统一生成-理解架构中无需额外优化技巧即可鲁棒运行，优于现有连续分词器。

Innovations:

提出渐进式三阶段训练策略，逐步塑造高保真、可学习且语义丰富的连续潜在空间，避免强KL约束导致的信息丢失。
设计时序变分瓶颈（LSTM+归一化流），使潜在序列平滑且易于自回归建模，同时保持重建质量。
在统一AR+DiT架构中直接使用同一连续潜在序列进行生成与理解，无需额外语义模块或多流设计，简化模型复杂度。
通过高层特征蒸馏和音频-语言监督（Stage III）将语义信息注入潜在空间，提升下游理解任务性能。
实现48 kHz语音到25 Hz超低帧率压缩，在极高压缩比下仍保持竞争性重建保真度。

Methodology: HoliTok基于低延迟变分自编码器（VAE）架构，编码器采用因果卷积下采样（总步长1920，对应25 Hz），解码器采用BigVGAN风格上采样。训练分三阶段：Stage I仅训练确定性自编码器，优化多尺度频谱重建、对抗损失和特征匹配损失；Stage II冻结编解码器，训练时序变分瓶颈（4层LSTM+归一化流），使用小权重KL正则化；Stage III进一步训练变分瓶颈并引入监督网络（0.6B Transformer编码器+Qwen2.5-0.5B解码器），通过高层特征蒸馏和音频-语言对比学习优化潜在空间。下游采用AR+DiT架构：LLM对潜在序列进行自回归建模，生成时LLM预测语义隐状态，由DiT流匹配头解码为下一潜在块；理解时LLM通过LM头预测文本token。

Key Results:

HoliTok在48 kHz语音重建上达到竞争性保真度，25 Hz帧率下仍保持高质量波形。
在语音合成（TTS）任务中，HoliTok支持高质量、多样化和可控的生成，优于现有连续分词器。
在统一生成-理解建模（ASR+TTS）中，HoliTok-Base提供更友好的连续潜在空间，HoliTok-Unite进一步同时提升合成和识别性能。
与MingTok-Audio等基线相比，HoliTok无需额外语义模块或优化技巧即可在AR+DiT架构中鲁棒运行。
消融实验验证了渐进式训练策略的有效性，Stage III的语义蒸馏显著提升理解任务表现。

Tech Stack:

因果卷积下采样（6个步长块，核大小4/4/4/8/12/20，步长2/2/2/4/6/10）
时序变分瓶颈（4层LSTM + 1×1卷积预测均值和方差 + 归一化流）
BigVGAN风格解码器（AMPBlocks + SnakeBeta激活函数）
多尺度频谱重建损失（L_spec）、对抗损失（GAN）、特征匹配损失（L_fm）
KL散度正则化（弱权重β=0.001）
归一化流（Normalizing Flow）增强后验表达能力
Transformer编码器（0.6B参数） + Qwen2.5-0.5B解码器用于语义蒸馏
AR+DiT架构：自回归语言模型（LLM）+ 扩散Transformer（DiT）流匹配头
重参数化技巧（Reparameterization Trick）

Strengths:

提出了一种真正统一的连续语音表示，同时满足可解码、可学习和信息丰富三个要求，简化了下游模型设计。
渐进式训练策略巧妙平衡了重建保真度与潜在空间可学习性，避免了传统VAE的保真度损失。
在极高压缩比（48 kHz→25 Hz）下仍保持高质量重建，效率高。
在统一生成-理解架构中验证了表示的有效性，实验设计全面（重建、合成、统一建模）。
代码开源，可复现性强。

Limitations:

论文仅验证了ASR和TTS两种任务，未涉及更广泛的语音理解任务（如情感识别、说话人识别）或生成任务（如语音编辑、语音翻译）。
AR+DiT架构本身计算开销较大，HoliTok的25 Hz帧率虽低，但128维潜在维度可能增加LLM建模复杂度。
Stage III的语义蒸馏依赖外部预训练语言模型（Qwen2.5），可能引入领域偏差。
因果编码器引入2帧lookahead，并非严格因果，可能不适用于某些实时场景。
与离散分词器相比，连续表示在推理时需处理浮点序列，可能增加部署难度。

Relevance To Keywords:

Unify Models: HoliTok直接服务于统一语音生成与理解模型，其潜在空间作为共享接口，与统一多模态大模型目标高度一致。
World Models: 连续潜在表示可作为世界模型中的状态表征，支持语音环境中的预测和规划。
Representation Learning: 渐进式训练策略本质上是表征学习，通过重建、变分正则化和语义蒸馏学习整体语音表示。
Model-Based RL: 虽然论文未直接涉及强化学习，但HoliTok的潜在空间可被用于基于模型的语音交互系统，作为环境模型的状态表示。
原生多模态大模型: HoliTok的连续分词思想可类比视觉领域的连续tokenizer，为多模态大模型提供统一的语音模态接口。
后训练: Stage III的语义蒸馏可视为一种后训练阶段，利用预训练语言模型提升表示语义性。

77. MELD: Mel-Spectrogram-Based Speech Language Modeling with Discrete Latent VariablesPASS

Score: 28.5 / 27.8

Authors: Sung-Lin Yeh, Wei Zhou, Gil Keren, Duc Le, Zhong Meng, Hao Tang, Jay Mahadeokar, Ozlem Kalinli, Alexandre Mourachko

Published: 2026-05-28

TL;DR: MELD addresses the limitation of separate encoder optimization in speech language models by jointly training a discrete latent variable model with the language model, improving TTS and STT performance.

摘要翻译

近期语音语言模型依赖于与自回归模型分开优化的编码器。由于这些编码器不了解下游目标，所提取的表征可能并非适用于下游任务的最优解。为了解决这一局限，我们提出了一种基于梅尔频谱图的离散潜变量模型，该模型联合优化编码器与语音语言模型。联合优化不仅在零样本文本到语音（TTS）和语音到文本（STT）任务上优于基于编解码器（codec）及其他梅尔频谱图的基线方法，而且有效缓解了自回归梅尔频谱建模中常见的问题，例如长时间静音生成和词遗漏。

Abstract

Recent speech language models rely on encoders that are optimized separately from autoregressive models. Since these encoders are unaware of the downstream objectives, the extracted representations may not be optimal for downstream tasks. To address this limitation, we introduce a discrete latent variable model on mel spectrograms that jointly optimizes the encoder and the speech language model. Joint optimization not only brings improvements over codec-based and other mel-spectrogram-based baselines on zero-shot Text-to-Speech (TTS) and Speech-to-Text (STT) tasks, but also effectively alleviates common issues in autoregressive mel-spectrogram modeling, such as prolonged silence generation and word omissions.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	6.0/10	9.0
Tokenizer	1.5	8.0/10	12.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	4.0/10	6.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on speech language modeling, showing high relevance to 'Tokenizer' via discrete latent variables and moderate relevance to 'Unify Models' through joint encoder-LM optimization. 'MultiModal' has moderate relevance (speech-text), while 'Visual Encoder', 'World Models', and 'model-based RL' are unrelated. 'MLLM' is low relevance as it is speech-specific.

关键词

Mel-Spectrogram, Speech Language Modeling, Discrete Latent Variables, Joint Optimization, Text-to-Speech, Speech-to-Text, Autoregressive Modeling

深度分析

Chinese Title: MELD：基于梅尔频谱图的离散潜在变量语音语言建模

Summary: 本文提出MELD（Mel-Spectrogram-Based Discrete Latent Language Model），一种联合优化编码器和自回归语音语言模型的离散潜在变量模型。现有语音语言模型通常采用两阶段训练：先训练语音编解码器或VAE提取中间表示，再训练自回归模型，但编码器不了解下游任务，导致表示可能非最优。MELD直接在梅尔频谱图上进行联合优化，通过引入离散潜在变量空间和连续梅尔频谱图空间，利用离散采样抑制自回归生成中常见的长时间静音和单词遗漏问题。在零样本文本转语音（TTS）和语音转文本（STT）任务上，MELD优于基于编解码器和梅尔频谱图的基线模型（如MELLE、VALL-E）。此外，MELD能够在一个自回归模型中同时学习TTS和STT任务，联合优化显著提升了STT性能。

Innovations:

提出离散潜在变量模型，在梅尔频谱图上联合优化编码器和自回归语言模型，避免两阶段训练的信息损失。
将生成过程扩展为离散潜在空间和连续梅尔频谱图空间，利用离散采样有效抑制自回归生成中的长时间静音和伪影。
通过变分下界推导出可训练的联合目标，包括KL散度项和重构损失项，并引入慢速惩罚促进生成多样性。
将TTS和STT任务统一到同一自回归框架中，通过特殊标记<TTS>和<STT>控制任务切换，实现联合训练。
冻结基于k-means初始化的码本，让重构网络学习细化码字，避免向量量化训练困难。

Methodology: MELD采用变分自回归框架：1）量化网络q(z_t|x_t)基于软向量量化（soft VQ）将当前梅尔帧映射到离散潜在变量；2）自回归网络p(z_t|x_{<t}, y)预测下一个离散潜在变量（TTS）或文本token（STT），使用解码器Transformer；3）重构网络p(x_t|z_t, x_{<t}, y)结合码字和上下文重建梅尔帧，包含MLP和卷积模块。训练目标为变分下界（VLB），包括KL散度和MSE重构损失。推理时从预测的离散分布中采样z_t，再生成梅尔帧。STT任务通过将文本token和离散潜在变量合并为统一词汇表，并采用交叉熵损失。

Key Results:

在零样本TTS延续任务上，MELD一致优于MELLE、VALL-E等基线，有效抑制长时间静音。
在STT任务上，MELD的联合优化相比独立离散化的dMel方法显著降低词错误率（WER）。
MELD无需单独的停止预测器，离散采样自然解决了生成终止问题。
联合训练TTS和STT时，STT性能提升明显，且TTS质量保持或优于单独训练。

Tech Stack:

梅尔频谱图（Mel-spectrogram）
字节对编码（BPE）
变分下界（VLB）
KL散度
软向量量化（Soft VQ）
k-means初始化码本
解码器Transformer
多层感知机（MLP）
卷积神经网络（CNN）
均方误差（MSE）
慢速惩罚（Slowness penalty）
特殊标记（<TTS>, <STT>, <EOS>）

Strengths:

联合优化编码器和自回归模型，避免两阶段训练的信息损失，提升下游任务性能。
离散潜在变量空间结合连续梅尔频谱图，兼顾离散采样的稳定性和连续表示的保真度。
有效解决自回归梅尔频谱图生成中的静音和伪影问题，无需额外停止预测器。
统一TTS和STT框架，实现多任务联合学习，且STT性能显著提升。
码本冻结策略简化训练，避免向量量化梯度问题。

Limitations:

模型复杂度较高，需要同时训练量化网络、自回归网络和重构网络。
依赖于k-means初始化的码本，可能对数据分布敏感。
仅在零样本TTS和STT任务上评估，未涉及多说话人、情感控制等更复杂场景。
与纯编解码器模型相比，推理时需逐步生成梅尔帧，速度可能较慢。
未与最新的大规模语音语言模型（如AudioLM）进行直接比较。

Relevance To Keywords:

Unify Models: MELD通过联合优化编码器和自回归模型，统一了表示学习和生成建模，符合统一模型的思想。
World Models: 自回归语音语言模型可视为对语音序列的世界模型，MELD的离散潜在变量有助于学习结构化表示。
Representation Learning: 离散潜在变量学习是表征学习的一种形式，MELD联合优化使得表示适应下游任务。
Model-Based RL: 论文未直接涉及强化学习，但联合优化可类比于模型-based方法中的端到端学习。
原生多模态大模型: MELD处理语音和文本两种模态，但未涉及图像/视频，属于多模态语音语言模型。
多模态大模型的理解和生成一体化: MELD同时支持TTS（生成）和STT（理解），实现理解和生成一体化。
表征学习: 离散潜在变量是语音的表征，通过变分方法学习。
世界模型: 自回归预测可视为对语音序列的建模，类似世界模型。
强化学习: 论文未使用强化学习，但后训练阶段可能适用。
后训练: 论文未提及后训练，但联合优化可视为一种端到端训练方式。

78. LiteCoder-Terminal: Scaling Long-Horizon Terminal Environments for Learning Language AgentsPASS

Score: 28.5 / 27.8

Authors: Xiaoxuan Peng, Kaiqi Zhang, Xinyu Lu, Boxi Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun

Published: 2026-05-28

TL;DR: 本文提出了一种零依赖的终端环境合成管道，通过监督微调和偏好优化显著提升了语言智能体在命令行任务上的表现。

摘要翻译

掌握终端环境需要具备多步规划、基于反馈的执行以及动态状态适应能力的语言智能体。然而，目前训练此类智能体受限于对外部爬取仓库的依赖，这限制了领域多样性、环境可控性以及针对特定能力缺陷的针对性优化。我们引入了 LiteCoder-Terminal-Gen，这是一种零依赖合成管道，能够直接从领域规范自动生成可执行且可验证的终端训练环境。利用该框架，我们构建了两个大规模资源：LiteCoder-Terminal-SFT，包含跨越 10 个领域的 11,255 条专家轨迹；以及 LiteCoder-Terminal-RL，包含 602 个用于轨迹级偏好优化的可验证环境。在我们的 SFT 数据集上对 Qwen 系列模型进行监督微调，生成的智能体显著优于其基线模型。值得注意的是，我们的 32B 变体在 Terminal Bench 1.0、2.0 和 Pro 上分别取得了 29.06%、18.54% 和 34.00% 的 pass@1 成绩。此外，在我们的 RL 环境中应用直接多轮偏好优化（DMPO）带来了额外的性能提升。这些结果系统性地表明，完全合成的可执行环境为掌握复杂的真实世界命令行工作流提供了一种可扩展且可验证的监督信号。

Abstract

Mastering terminal environments requires language agents capable of multi-step planning, feedback-grounded execution, and dynamic state adaptation. However, training such agents is currently bottlenecked by a reliance on scraped external repositories, which limits domain diversity, environment controllability, and the targeting of specific capability deficits. We introduce LiteCoder-Terminal-Gen, a zero-dependency synthesis pipeline that autonomously generates executable and verifiable terminal training environments directly from domain specifications. Using this framework, we construct two large-scale resources: LiteCoder-Terminal-SFT, comprising 11,255 expert trajectories across 10 domains, and LiteCoder-Terminal-RL, featuring 602 verifiable environments for trajectory-level preference optimization. Supervised fine-tuning of Qwen-family models on our SFT dataset yields agents that significantly outperform their base counterparts. Notably, our 32B variant achieves 29.06%, 18.54%, and 34.00% pass@1 on Terminal Bench 1.0, 2.0, and Pro, respectively. Furthermore, applying Direct Multi-turn Preference Optimization (DMPO) on our RL environments yields additional performance gains. These results systematically demonstrate that fully synthetic, executable environments offer a scalable and verifiable supervision signal for mastering complex, real-world command-line workflows.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	4.0/10	6.0
MLLM	1.5	4.0/10	6.0
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	4.0/10	6.0

评分理由: 论文核心在于终端环境的合成与语言智能体训练，虽使用 Qwen 家族模型（MLLM）及强化学习（RL），但主要基于文本交互，无视觉编码器（Visual Encoder=0）且非多模态融合（MultiModal=2）。Tokenizer 非核心贡献。Unify Models 体现为训练流程统一而非模型架构统一。World Models 体现在环境合成与建模上，故给予中等偏高评分。

关键词

LiteCoder-Terminal, Language Agents, Terminal Environments, Synthetic Generation, Preference Optimization, Supervised Fine-Tuning, Command-line Workflows

深度分析

Chinese Title: LiteCoder-Terminal：扩展长时域终端环境以训练语言智能体

Summary: 本文针对语言智能体在终端环境中进行多步规划、反馈执行和动态状态适应能力训练的数据稀缺问题，提出了LiteCoder-Terminal-Gen，一个零依赖的合成流水线，能够从领域规范自动生成可执行且可验证的终端训练环境。基于该框架，构建了两个大规模资源：LiteCoder-Terminal-SFT（包含11,255条专家轨迹，覆盖10个领域）和LiteCoder-Terminal-RL（包含602个可验证环境，用于轨迹级偏好优化）。对Qwen系列模型进行监督微调后，智能体性能显著提升，其中32B变体在Terminal Bench 1.0、2.0和Pro上分别达到29.06%、18.54%和34.00%的pass@1。进一步应用直接多轮偏好优化（DMPO）在RL环境上获得额外增益。结果表明，完全合成的可执行环境为掌握复杂真实命令行工作流提供了可扩展且可验证的监督信号。

Innovations:

提出LiteCoder-Terminal-Gen，一种零依赖的终端环境合成框架，能够从零开始自动生成定制化的终端任务、环境和验证器，无需依赖外部数据源。
构建了大规模开源数据集LiteCoder-Terminal-SFT（11,255条专家轨迹）和LiteCoder-Terminal-RL（602个可执行环境），填补了系统级终端训练数据的空白。
采用五阶段顺序流水线（指令精炼、环境初始化、解决方案合成、验证器生成、配置导出）确保因果一致性，防止逻辑错误。
在验证器生成中引入四阶段对抗迭代（草稿-攻击-精炼-最终化），提高测试质量，拒绝懒惰解而接受合法变体。
首次在终端智能体训练中应用直接多轮偏好优化（DMPO），在监督微调基础上进一步提升了4B模型在困难基准上的性能。

Methodology: 论文采用零依赖的合成流水线方法。首先，通过Magpie式LLM采样策略从10个终端领域生成原始任务描述，并进行可行性筛选。然后，通过五阶段顺序流水线将每个任务转化为可执行环境：指令精炼（绑定绝对路径和确定性输出格式）、环境初始化（基于Ubuntu 24.04的Dockerfile和输入工件）、解决方案合成（生成可执行的solve.sh作为可解性检查）、验证器生成（使用四阶段对抗迭代确保测试质量）、配置导出（生成Harbor格式的元数据）。最后，使用教师模型（如MiniMax）生成专家轨迹构建SFT数据集，并利用可验证环境进行DMPO偏好优化。

Key Results:

LiteCoder-Terminal-SFT包含11,255条专家轨迹，覆盖AI&ML、构建工具、数据科学、网络、安全、系统管理、版本控制、编码、科学计算和游戏10个领域。
LiteCoder-Terminal-RL包含602个可执行且可验证的终端环境。
Qwen-32B模型在Terminal Bench 1.0上pass@1为29.06%，Terminal Bench 2.0为18.54%，Terminal Bench Pro为34.00%。
较小规模的Qwen变体（4B、7B、14B）在监督微调后均显著优于对应基座模型。
应用DMPO后，4B SFT模型在Terminal Bench 2.0和Pro上获得进一步提升。

Tech Stack:

Qwen系列基座模型（4B、7B、14B、32B）
MiniMax模型（作为教师模型生成专家轨迹）
Magpie式LLM采样策略（用于任务生成）
Harbor任务格式（统一接口）
Docker（基于Ubuntu 24.04的环境容器化）
pytest（验证器测试套件）
直接多轮偏好优化（DMPO）
四阶段对抗迭代（验证器生成中的草稿-攻击-精炼-最终化）

Strengths:

零依赖合成框架完全摆脱了对GitHub、Stack Overflow等外部数据源的依赖，可主动针对模型能力缺陷生成训练数据。
五阶段顺序流水线确保因果一致性，生成的训练环境可执行且可验证，为强化学习提供可靠奖励信号。
大规模开源数据集（SFT和RL）填补了终端智能体训练数据的空白，促进社区研究。
实验覆盖多个模型规模（4B-32B），结果一致表明合成数据有效提升终端任务性能，且DMPO进一步增益。
验证器生成中的对抗迭代机制有效平衡了测试的严格性和灵活性。

Limitations:

合成数据可能无法完全覆盖真实世界终端任务的多样性和复杂性，存在分布偏差。
教师模型（MiniMax）生成的专家轨迹质量可能影响SFT数据上限，未探讨使用更强教师模型的效果。
RL环境数量（602个）相对较少，可能限制了偏好优化的泛化能力。
仅针对Qwen系列模型进行实验，未验证在其他基座模型（如Llama、Mistral）上的迁移性。
论文未详细分析合成任务与真实终端任务之间的差距，缺乏人工评估或用户研究。

Relevance To Keywords:

Unify Models / 原生多模态大模型：论文聚焦于纯文本终端环境，未涉及多模态输入输出，相关性弱。
World Models / 世界模型：终端环境可视为一种部分可观测的世界，但论文未显式构建世界模型，相关性中等。
Representation Learning / 表征学习：论文未涉及表征学习技术，相关性弱。
Model-Based RL / 基于模型的强化学习：论文使用DMPO（无模型偏好优化），未采用基于模型的方法，相关性弱。
多模态大模型的理解和生成一体化：论文不涉及多模态，相关性弱。
强化学习 / 后训练：论文核心方法包括监督微调（后训练）和DMPO（强化学习变体），与强化学习和后训练高度相关。

79. Recovering Policy-Induced Errors: Benchmarking and Trajectory Synthesis for Robust GUI AgentsPASS

Score: 28.5 / 27.8

Authors: Tianpeng Bu, Xin Liu, Qihua Chen, Hao Jiang, Shurui Li, Hongtao Duan, Lu Jiang, Lulu Hu, Bin Yang, Minying Zhang

Published: 2026-05-28

TL;DR: 本文针对 GUI 代理缺乏错误恢复鲁棒性的问题，提出基准与合成框架，显著提升了 OSWorld 上的性能。

摘要翻译

尽管 GUI（图形用户界面）代理发展迅速，但它们往往缺乏从自身错误中恢复的鲁棒性，这阻碍了实际部署。为了在评估和数据层面弥合这一差距，我们引入了 GUI-RobustEval，并提出基于鲁棒性的轨迹合成（RoTS）。GUI-RobustEval 包含 1,216 个可执行测试用例，系统性地衡量了在广泛且真实的错误模式谱系下的错误恢复能力。在数据层面，RoTS 是一个可扩展的合成框架，通过基于树的管道主动发现多样化的错误模式并合成相应的恢复步骤，从而生成 80 万高质量数据。我们在数据集上微调的两个模型 RoTS-7B 和 RoTS-32B，在 GUI-RobustEval 和传统 GUI 基准测试上均表现出显著提升。值得注意的是，RoTS-32B 在 OSWorld 上实现了最先进性能，成功率为 47.4%，All-Pass@4 得分为 33.8%，这表明改进的长周期错误恢复能力有助于提升鲁棒性和整体性能。我们的代码可在 https://github.com/AlibabaResearch/RoTS 获取。

Abstract

While GUI agents have advanced rapidly, they often lack the robustness to recover from their own errors, hindering real-world deployment. To bridge this gap at both the evaluation and data levels, we introduce GUI-RobustEval and propose Robustness-driven Trajectory Synthesis. GUI-RobustEval contains $1,216$ executable test cases that systematically measure error recovery capabilities across a broad and realistic spectrum of error modes. At the data level, RoTS is a scalable synthesis framework that creates $800k$ high-quality data via a tree-based pipeline that proactively discovers diverse error modes and synthesizes corresponding recovery steps. Our two models, RoTS-7B and RoTS-32B, fine-tuned on our dataset, both demonstrate significant gains on GUI-RobustEval and traditional GUI benchmarks. Notably, RoTS-32B achieves state-of-the-art performance on OSWorld, with a $47.4\%$ success rate and a $33.8\%$ All-Pass@4 score, suggesting that improved long-horizon error recovery ability contributes to both robustness and overall performance. Our code is available at https://github.com/AlibabaResearch/RoTS.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	3.0/10	4.5
MLLM	1.5	4.0/10	6.0
MultiModal	1.5	4.0/10	6.0
model-based RL	1.5	3.0/10	4.5

评分理由: 论文聚焦 GUI 代理鲁棒性与错误恢复，虽基于 MLLM 和多模态交互（得分较高），但未涉及模型统一、分词器或视觉编码器架构。方法为轨迹合成与微调，非显式的世界模型或模型强化学习，故相关关键词得分较低。

关键词

GUI Agents, Error Recovery, Trajectory Synthesis, Robustness, Benchmarking, OSWorld, Fine-tuning

深度分析

Chinese Title: 恢复策略诱导错误：面向鲁棒GUI智能体的基准测试与轨迹合成

Summary: 本文针对GUI智能体在真实部署中因自身策略错误（如错误定位、屏幕状态误判、错误子目标）而陷入失败的问题，提出了评估与数据两方面的解决方案。在评估层面，构建了GUI-RobustEval基准，包含1216个可执行测试用例，覆盖11种策略诱导错误类型和4种可控错误深度，并引入错误感知率和错误后成功率两个细粒度指标。在数据层面，提出了鲁棒性驱动的轨迹合成框架RoTS，通过树状在线采样管道主动探索多样错误模式并合成恢复轨迹，生成了80万条高质量数据。基于该数据集微调的RoTS-7B和RoTS-32B模型在GUI-RobustEval和传统GUI基准上均取得显著提升，其中RoTS-32B在OSWorld上达到47.4%的成功率和33.8%的All-Pass@4，表明长程错误恢复能力有助于提升整体性能。

Innovations:

提出GUI-RobustEval基准，专门针对策略诱导错误进行细粒度评估，覆盖11种真实错误类型和4种错误深度，提供错误感知率和错误后成功率两个指标。
提出RoTS数据合成框架，通过树状在线采样管道（探索-恢复共扩展）主动发现多样失败模式并合成长期恢复轨迹，弥补现有训练数据在错误覆盖和错误时间跨度上的不足。
构建了80万条高质量错误恢复轨迹数据集，并基于Qwen2.5-VL微调得到RoTS-7B和RoTS-32B模型，在多个基准上取得最优性能。
设计了高吞吐并行基础设施，支持大规模在线采样和评估，确保轨迹回放的可复现性。
通过分析真实失败轨迹与现有训练数据的分布差异，揭示了覆盖不匹配和时间跨度不匹配两个关键问题，为后续研究提供指导。

Methodology: 首先，收集12个SOTA智能体在OSWorld上的1500条失败轨迹，用VLM标注错误类型和错误时间跨度，分析分布差异。然后构建GUI-RobustEval：人工定位根因步骤、修正前缀、标准化动作格式，按深度d注入错误后让智能体接管。接着，RoTS框架：在Ubuntu/Windows云环境中维护20k个任务，使用WebJudge作为结果奖励模型，并训练进度评判器和动作评判器。采用树状在线采样：成功分支上从脆弱状态分支探索新失败模式（FDE），失败分支上从错误状态回放并合成恢复轨迹（EIR），使用UCB公式选择节点。最后用合成数据微调Qwen2.5-VL模型。

Key Results:

GUI-RobustEval包含1216个测试用例，覆盖11种错误类型，错误深度0/1/3/5。
RoTS生成了80万条高质量轨迹数据。
RoTS-32B在OSWorld上达到47.4%成功率，33.8% All-Pass@4，均为当时SOTA。
在GUI-RobustEval上，RoTS-32B的错误感知率和错误后成功率显著高于基线模型。
错误类型分析显示规划和进度感知错误比低级执行错误更难恢复，且错误深度越大性能下降越明显。
与现有训练数据（AgentTrek、AgentNet、GUI-Reflection）相比，RoTS数据更接近真实策略诱导错误的分布。

Tech Stack:

Qwen2.5-VL（视觉语言模型）
PyAutoGUI（动作执行）
WebJudge（结果奖励模型）
进度评判器（Progress Critic）和动作评判器（Action Critic）
UCB（Upper Confidence Bound）公式用于节点选择
t-SNE用于错误类型分布可视化
POMDP建模（部分可观测马尔可夫决策过程）
树状在线采样（Explore-Recovery Co-Expansion）
Ubuntu/Windows云环境高吞吐并行采样基础设施

Strengths:

针对真实部署中的策略诱导错误，填补了评估和训练两个层面的空白。
基准测试覆盖错误类型全面且深度可控，提供细粒度诊断指标。
数据合成框架可扩展性强，自动发现多样失败模式并生成长期恢复轨迹，减少人工偏差。
模型在多个基准上取得SOTA，验证了方法的有效性。
开源代码和数据集，促进后续研究。

Limitations:

基准测试和合成数据主要基于OSWorld和WindowsAgentArena环境，泛化到其他GUI平台（如移动端）需验证。
错误类型标注和根因定位依赖人工和VLM，可能存在主观偏差。
合成数据质量受奖励模型和评判器准确率影响（虽有人工验证但仍有误差）。
树状采样计算成本较高，需要大量并行环境。
仅针对策略诱导错误，未考虑环境扰动、对抗攻击等其他鲁棒性维度。

Relevance To Keywords:

Unify Models, World Models, Representation Learning, Model-Based RL: 论文使用视觉语言模型（Qwen2.5-VL）作为智能体，属于多模态大模型应用，但未涉及世界模型或表征学习统一框架。
原生多模态大模型，多模态大模型的理解和生成一体化: 论文基于Qwen2.5-VL，该模型本身支持多模态理解和生成，但论文主要关注GUI智能体的鲁棒性，而非模型架构创新。
表征学习: 论文未直接涉及表征学习。
世界模型: 论文未使用世界模型进行规划或模拟。
强化学习: 论文使用在线采样和奖励模型，但训练方式为监督微调而非强化学习，不过树状探索过程类似于强化学习中的探索策略。
后训练: 论文通过监督微调（后训练）提升模型鲁棒性，属于后训练范畴。
总体相关性中等，主要贡献在GUI智能体鲁棒性的评估与数据合成，与多模态大模型后训练有较强联系。

80. IP-Adapter Is All You Need: Towards Fine-Tuning-Free Diffusion-Based Talking Face GenerationPASS

Score: 28.5 / 27.8

Authors: Hao Wu, Xiangyang Luo, Hao Wang, Jiawei Zhang, Yi Zhang, Jinwei Wang

Published: 2026-05-28

TL;DR: This paper presents a fine-tuning-free diffusion framework for talking face generation that leverages IP-Adapter and Stable Diffusion to achieve superior lip-sync accuracy and visual fidelity without task-specific training.

摘要翻译

随着扩散模型的迅速发展，说话人脸生成已取得了显著进展。然而，现有的基于扩散的方法仍需进行任务特定的微调以及使用大规模音视频数据集，这导致了高昂的计算成本，阻碍了扩散方法在研究社区中的可扩展性与普及性。为此，我们提出了一种无需微调的范式，直接利用 Stable Diffusion 和 IP-Adapter 的预训练权重进行说话人脸生成。该骨干利用 IP-Adapter 的视觉嵌入能力，从预训练的 Stable Diffusion 中挖掘与嘴唇相关的语义信息。为应对身份漂移、同步误差及时间不稳定性等挑战，我们还设计了三个无需训练参数的组件：(1) Structurist，显式地解耦并重组嘴唇与外观特征，以缓解身份漂移和外观失真；(2) Structure Controller，基于准单调运动趋势自适应地细化嵌入，以实现精确的嘴唇同步；(3) Noise Sensor，引入高斯先验以检测并抑制闪烁与抖动伪影，从而增强时间一致性。实验结果表明，我们的方法在嘴唇同步精度（PCLD 至少优化 0.16）和视觉保真度（FID 至少降低 0.7）上均优于现有的 SOTA 方法，构建了一个用于说话人脸生成的新型无需微调的扩散框架。

Abstract

With the rapid advancement of diffusion models, talking face generation has made remarkable progress. However, existing diffusion-based methods still require task-specific fine-tuning and large-scale audiovisual datasets, resulting in high computational costs that hinder scalability and accessibility of diffusion-based approaches across the research community. To address this, we propose a finetuning-free paradigm that directly performs talking face generation using the pretrained weights of Stable Diffusion and IP-Adapter. This backbone leverages the visual embedding capability of IP-Adapter to mine lip-related semantics from the pretrained Stable Diffusion. To address the challenges of identity drift, synchronization errors, and temporal instability, we also design three trainable-parameterfree components: (1) the Structurist, which explicitly disentangles and reassembles lip and appearance features to mitigate identity drift and appearance distortion; (2) the Structure Controller, which adaptively refines embeddings based on quasi-monotonic motion trends for precise lip synchronization; and (3) the Noise Sensor, which introduces Gaussian prior to detect and suppress flicker and jitter artifacts and enhance temporal consistency. Experimental results show that our method outperforms existing SOTA approaches in both lip-sync accuracy (at least 0.16 gain in PCLD) and visual fidelity (at least 0.7 improvement in FID), establishing a novel fine-tuning-free diffusion framework for talking face generation.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	8.0/10	12.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	5.0/10	7.5
model-based RL	1.5	0.0/10	0.0

评分理由: The paper proposes a fine-tuning-free paradigm for talking face generation using Stable Diffusion and IP-Adapter. 'Visual Encoder' is highly relevant (8.0) as IP-Adapter relies on a ViT for feature extraction, which is central to the method. 'MultiModal' is moderately relevant (5.0) as the task inherently involves audio-visual alignment. 'Unify Models' is low (3.0) as the method adapts existing models rather than proposing a unified architecture in the sense of the background keywords. 'Tokenizer' is low (2.0) as its use is incidental to the backbone. 'World Models', 'MLLM', and 'model-based RL' are largely irrelevant (0.0-1.0) as the paper lacks latent dynamics modeling, large language models, or reinforcement learning components. No expert authors from the specified list were found in the authorship.

关键词

Talking Face Generation, Diffusion Models, IP-Adapter, Fine-tuning-Free, Visual Embedding, Lip Synchronization, Temporal Consistency

深度分析

Chinese Title: IP-Adapter就是一切：迈向免微调扩散模型驱动的说话人脸生成

Summary: 本文提出一种免微调的扩散模型框架FreeTalkDiff，用于说话人脸生成。现有方法需要在大规模音视频数据集上微调数十亿参数的扩散模型，计算成本极高。作者利用预训练的Stable Diffusion和IP-Adapter作为骨干，通过IP-Adapter的视觉嵌入能力挖掘唇部语义。为解决身份漂移、同步误差和时间不稳定问题，设计了三个无训练参数的组件：Structurist模块在3D人脸参数空间中显式解耦唇部和外观特征；Structure Controller根据参考唇部运动趋势自适应精化嵌入；Noise Sensor基于高斯先验检测并抑制闪烁和抖动伪影。实验表明，该方法在唇同步精度（PCLD提升至少0.16）和视觉保真度（FID提升至少0.7）上超越现有最先进方法，建立了首个免微调的扩散说话人脸生成框架。

Innovations:

首次提出完全免微调的扩散模型说话人脸生成框架，直接利用预训练的SD和IP-Adapter，无需任何任务特定微调。
设计基于3DMM的Structurist模块，显式解耦唇形与外观特征，有效缓解身份漂移和外观失真。
提出自适应Structure Controller，根据参考唇部运动的准单调趋势动态精化结构嵌入，提升唇同步精度。
引入基于高斯先验的Noise Sensor，数学建模并检测闪烁和抖动噪声，通过空间自适应时域滤波增强时间一致性。

Methodology: 采用预训练的Stable Diffusion和IP-Adapter作为骨干，利用IP-Adapter的CLIP图像编码器提取唇部相关结构嵌入。在此基础上，设计三个无参数模块：Structurist利用3DMM参数空间解耦唇形和外观，保留唇部运动信息并去除颜色纹理干扰；Structure Controller通过分析参考唇部运动趋势，自适应调整嵌入空间以捕捉细微运动；Noise Sensor通过假设检验推导高斯先验，建模闪烁和抖动噪声模式，并应用空间自适应时域滤波器抑制噪声。整体流程无需额外训练，直接推理生成视频。

Key Results:

在CREMA和HDTF数据集上，唇同步精度PCLD指标提升至少0.16。
视觉保真度FID指标提升至少0.7。
与现有方法（如AniPortrait、Loopy、EchoMimic等）相比，无需任何训练资源和数据集，计算成本为零。
生成的视频具有高清晰度、自然唇同步和良好的时间一致性。

Tech Stack:

Stable Diffusion (SD)
IP-Adapter (含CLIP Image Encoder)
3D Morphable Model (3DMM)
高斯先验与假设检验
空间自适应时域滤波
准单调运动趋势分析
结构嵌入自适应精化

Strengths:

完全免微调，极大降低计算资源和时间成本，提高可扩展性和可访问性。
创新性地利用预训练模型的固有特性（IP-Adapter对唇部区域的注意力）实现可控生成。
三个无参数模块设计巧妙，分别解决身份漂移、同步精度和时间稳定性问题。
实验充分，在多个指标上超越现有SOTA方法，代码开源。

Limitations:

依赖预训练模型（SD和IP-Adapter）的固有知识，可能对极端唇部动作或非正面视角的泛化能力有限。
目前仅支持少数几张参考帧（few-shot），未探索单帧（one-shot）场景。
噪声传感器基于高斯先验，可能对非高斯噪声或复杂运动模式处理不够鲁棒。
未在更大规模或更多样化的数据集上验证，泛化性有待进一步检验。

Relevance To Keywords: 论文涉及多模态（音频-视觉）生成、表征学习（IP-Adapter的视觉嵌入、3DMM解耦）、以及扩散模型的应用，与“原生多模态大模型”、“表征学习”有一定相关性。但论文主要聚焦于说话人脸生成这一具体任务，未涉及世界模型、强化学习或后训练等方向，因此与“World Models”、“Model-Based RL”等关键词相关性较弱。

81. Robust and Generalizable Safety Steering for Text-to-Image Diffusion TransformersFAIL

Score: 27.0 / 27.8

Authors: Zihao Xue, Yan Wang, Zhen Bi, Long Ma, Zhonglong Zheng, Zeyu Yang, Bingyu Zhu, Longtao Huang, Jie Xiao, Jungang Lou

Published: 2026-05-28

TL;DR: 本文提出 SafeDIG 框架，利用稀疏自编码器实现文本到图像扩散变换器的鲁棒安全引导，能在风险域转移时保持安全性并维持图像质量。

摘要翻译

扩散变压器（DiT）已成为文本到图像生成的强大骨干架构，但其分层且跨模态的生成过程使得安全控制从根本上不同于提示词级别过滤或输出级别检测。有害语义可能在文本表示中表达较弱，逐渐绑定到视觉潜变量，并最终与渲染动力学纠缠。因此，在固定层进行安全引导可能不稳定，且从已知风险中学到的引导机制可能无法可靠地迁移到偏移的目标风险域。我们提出 SafeDIG，一种安全引导框架，将 DiT 的安全适应问题表述为位置感知的稀疏特征转移。SafeDIG 首先在功能上不同的 DiT 干预位置构建稀疏自编码器（SAE），并利用鲁棒性感知预训练路由，优先选择在源 - 目标风险偏移下预期保持稳定的干预位置。随后，它通过将 SAE 编码器冻结为可重用的稀疏安全字典，仅将解码器适应到目标域激活流形，从而将可转移的安全特征与域特定的激活几何分离开来。在推理过程中，SafeDIG 结合 Blend 和 Repel 操作，将不安全激活引导至转移的安全流形，或使其远离有害稀疏方向。在 FLUX.1 Dev 和 Stable Diffusion 3.5 Large 上的实验表明，SafeDIG 能持续降低目标域及整体不安全生成率，同时保持源域安全性和图像质量。

Abstract

Diffusion Transformers have become a powerful backbone for text-to-image generation, but their layered and cross-modal generation process makes safety control fundamentally different from prompt-level filtering or output-level detection. Harmful semantics may be weakly expressed in text representations, progressively bound to visual latents, and finally entangled with rendering dynamics. As a result, safety steering at a fixed layer can be unstable, and a steering mechanism learned from known risks may not transfer reliably to a shifted target risk domain. We propose SafeDIG, a safety steering framework that formulates DiT safety adaptation as position-aware sparse feature transfer. SafeDIG first constructs Sparse Autoencoders over functionally distinct DiT intervention positions and uses robustness-aware pre-training routing to prioritize intervention sites that are expected to remain stable under source-target risk shift. It then separates transferable safety features from domain-specific activation geometry by freezing the SAE encoder as a reusable sparse safety dictionary and adapting only the decoder to the target-domain activation manifold. During inference, SafeDIG combines Blend and Repel operations to steer unsafe activations toward transferred safety manifolds or away from harmful sparse directions. Experiments on FLUX.1 Dev and Stable Diffusion 3.5 Large show that SafeDIG consistently reduces target-domain and overall unsafe generation rates while preserving source-domain safety and image quality.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	2.0/10	3.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	7.0/10	10.5
model-based RL	1.5	1.0/10	1.5

评分理由: 论文核心在于文本到图像扩散变换器（DiT）的安全引导框架 SafeDIG，属于多模态生成领域，因此与'MultiModal'高度相关。论文未涉及模型统一架构、分词器设计、强化学习中的世界模型或大语言模型（MLLM），故相关度较低。'Visual Encoder'相关性中等，因 DiT 处理视觉数据但非编码器设计核心。加权总分为 27.0，略低于动态及格分 27.8，表明论文主题与给定关键词集合匹配度一般。

关键词

Diffusion Transformers, Safety Steering, Text-to-Image, Sparse Autoencoders, Robustness, Generalizable, Feature Transfer, Latent Manipulation

82. COMET: Concept Space Dissection of the Modality Gap in Audio-Text Multimodal Contrastive EmbeddingsFAIL

Score: 27.0 / 27.8

Authors: Yonggang Zhu, Liting Gao, Aidong Men, Wenwu Wang

Published: 2026-05-28

TL;DR: 本文提出 COMET 框架，利用 PLS-SVD 分解剖析音频 - 文本对比嵌入中的模态间隙，实现了无需训练的维度降低和零样本音频字幕性能提升。

摘要翻译

对比语言 - 音频预训练（CLAP）模型广泛用于音频理解，并在许多零样本应用中支持模态无关的条件交换。然而，它们的性能受到音频和文本嵌入之间模态差距的严重影响。现有的解释主要将此差距归因于锥效应，将其视为均值嵌入之间的偏移，但仅修正均值只能带来有限的改进。替代假设，如信息不平衡和维度坍缩，也已提出，但它们仍未得到充分验证，且未在音频领域得到充分研究。同时，一些工作试图将多模态对比嵌入分解为可解释的概念，但没有一个从概念分解的角度明确分析模态差距。在这项工作中，我们引入了 COMET（基于 PLS-SVD 变换的概念空间组织与模态差距解释），这是一种用于 CLAP 的新型偏最小二乘奇异值分解（PLS-SVD）框架，揭示了模态差距的更广泛视角。我们的框架揭示，只有一个小规模、可解释的维度子集（捕捉共享概念）对相似性计算有实质性贡献，且均值成分仅代表部分模态差距。基于这一见解，我们提出了一种简单的谱截断方法，以无需训练的方式缓解模态差距。该方法使带有条件交换的零样本音频字幕能够接近完全监督的性能，而无需大型辅助记忆库或昂贵计算。同时，它在保持检索和音频字幕任务强性能的同时，实现了显著的嵌入维度降低。

Abstract

Contrastive Language-Audio Pretraining (CLAP) models are widely used for audio understanding and support modality-agnostic condition swapping in many zero-shot applications. However, their performance is heavily affected by the modality gap between audio and text embeddings. Existing explanations mainly attribute this gap to the cone effect, treating it as a shift between mean embeddings, yet correcting the mean alone yields only limited improvements. Alternative hypotheses, such as information imbalance and dimensionality collapse, have also been proposed, but they remain insufficiently verified and have not been thoroughly studied in the audio domain. Meanwhile, several works attempt to decompose multimodal contrastive embeddings into interpretable concepts, but none explicitly analyze the modality gap from the perspective of concept decomposition. In this work, we introduce COMET (Concept space Organization and Modality gap Explanation with PLS-SVD Transformation), a novel partial least squares singular value decomposition (PLS-SVD) framework for CLAP that unveils a broader perspective of the modality gap. Our framework reveals that only a small, interpretable subset of axes, which captures shared concepts, contributes substantially to similarity computation, and that the mean component represents only partially the modality gap. Building on this insight, we propose a simple spectral truncation method that mitigates the modality gap in a training-free manner. The method enables zero-shot audio captioning with condition swapping to approach fully supervised performance, without requiring large auxiliary memory banks or expensive computation. At the same time, it achieves substantial embedding dimensionality reduction while preserving strong performance on retrieval and audio captioning tasks.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	4.0/10	6.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	0.0/10	0.0

评分理由: 论文核心聚焦于音频 - 文本多模态对比嵌入的模态间隙分析（MultiModal 高度相关），属于表征学习领域。未涉及视觉编码器、世界模型或强化学习机制（相关度为 0）。虽处理文本数据但未针对 Tokenizer 设计进行优化（相关度低）。虽统一了嵌入空间但未实现模型架构的统一（相关度中）。与 MLLM 领域有一定关联但非核心架构（相关度中低）。

关键词

Audio-Text, Contrastive Embeddings, Modality Gap, Concept Space Dissection, PLS-SVD, Spectral Truncation, Zero-shot Audio Captioning

83. Not All Inputs Are Valid: Towards Open-Set Video Moment Retrieval Using LanguageFAIL

Score: 27.0 / 27.8

Authors: Xiang Fang, Wanlong Fang, Daizong Liu, Xiaoye Qu, Jianfeng Dong, Pan Zhou, Renfu Li, Zichuan Xu, Lixing Chen, Panpan Zheng, Yu Cheng

Published: 2026-05-28

TL;DR: This paper proposes OpenVMR, an open-set video moment retrieval model that distinguishes valid queries using normalizing flow and rejects out-of-distribution inputs to prevent erroneous retrieval.

摘要翻译

视频时刻检索（VMR）的目标是从未剪辑的视频中检索与句子查询相对应的特定时刻。尽管近期工作在这一任务上取得了显著进展，但它们隐含地基于闭集假设，即认为所有给定的查询均为视频相关查询（本文中，我们将“视频相关查询”视为“分布内（ID）查询”，将“视频无关查询”视为“分布外（OOD）查询”）。在开集场景中，面对分布外（OOD）查询，它们仍利用其进行错误检索，这可能导致高风险场景下不可挽回的损失，例如犯罪活动检测。为此，我们创造性地探索了一种全新的 VMR 设置，称为开集视频时刻检索（OS-VMR），在该设置下，我们不仅需要基于分布内（ID）查询检索精确时刻，还需拒绝分布外（OOD）查询。本文首次尝试迈向 OS-VMR，并提出了一种新颖的模型 **OpenVMR**，该模型首先基于归一化流（Normalizing flow）技术区分分布内（ID）与分布外（OOD）查询，随后仅基于 ID 查询进行时刻检索。具体而言，我们首先通过构建归一化流学习 ID 分布，并假设 ID 查询分布服从多元高斯分布。随后，我们引入不确定性分数以搜索 ID-OOD 分离边界。进而，我们通过拉近 ID 查询特征来细化该分离边界。此外，设计了视频 - 查询匹配和帧 - 查询匹配，分别用于粗粒度和细粒度的跨模态交互。最后，引入一个正 - 无标签学习模块用于时刻检索。在三个 VMR 数据集上的实验结果表明了 OpenVMR 的有效性。

Abstract

Video Moment Retrieval (VMR) targets to retrieve the specific moment corresponding to a sentence query from an untrimmed video. Although recent works have made remarkable progress in this task, they implicitly are rooted in the closed-set assumption that all the given queries as video-relevant\footnote{In this paper, we treat ``video-relevant query'' as ``in-distribution (ID) query'' and ``video-irrelevant query'' as ``out-of-distribution (OOD) query''.}. Given an OOD query in open-set scenarios, they still utilize it for wrong retrieval, which might lead to irrecoverable losses in high-risk scenarios, \textit{e.g.}, criminal activity detection. To this end, we creatively explore a brand-new VMR setting termed Open-Set Video Moment Retrieval (OS-VMR), where we should not only retrieve the precise moments based on ID query, but also reject OOD queries. In this paper, we make the first attempt to step toward OS-VMR and propose a novel model \textbf{OpenVMR}, which first distinguishes ID and OOD queries based on the normalizing flow technology, and then conducts moment retrieval based on ID queries. Specifically, we first learn the ID distribution by constructing a normalizing flow, and assume the ID query distribution obeys the multi-variate Gaussian distribution. Then, we introduce an uncertainty score to search the ID-OOD separating boundary. After that, we refine the ID-OOD boundary by pulling together ID query features. Besides, video-query matching and frame-query matching are designed for coarse-grained and fine-grained cross-modal interaction, respectively. Finally, a positive-unlabeled learning module is introduced for moment retrieval. Experimental results on three VMR datasets show the effectiveness of our OpenVMR.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	5.0/10	7.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	0.0/10	0.0

评分理由: 该论文聚焦于开放集视频时刻检索（OS-VMR），核心贡献在于利用归一化流进行分布内/外查询区分及正负样本学习。关键词中仅'MultiModal'（视频 - 语言匹配）和'Visual Encoder'（视频特征隐含）具有中等相关性，其余如世界模型、强化学习、统一模型架构及 tokenizer 均与论文内容无关。作者列表中未发现指定的专家成员。

关键词

Open-Set Video Moment Retrieval, Normalizing Flow, Out-of-Distribution Detection, Cross-modal Interaction, Positive-Unlabeled Learning, ID-OOD Separation, Video-Language Matching

84. Audio Jailbreaks in Large Audio-Language Models: Taxonomy, Attack-Defense Analysis, and Cost-Aware EvaluationFAIL

Score: 25.5 / 27.8

Authors: Bo-Han Feng, Yu-Hsuan Li Liang, Chien-Feng Liu, You-Hsuan Chang, Yun-Nung Chen

Published: 2026-05-28

TL;DR: This paper proposes a unified taxonomy and cost-aware evaluation framework for audio language model jailbreak attacks and defenses, revealing significant trade-offs between robustness and benign usability.

摘要翻译

大型音频语言模型（LALMs）将越狱风险从词元级（token-level）提示扩展至完整的语音感知 - 推理全流程，在此过程中，不安全行为可通过语义、声学风格、信号伪影或内部表征被诱导产生。现有研究在异构威胁模型和评估协议下探讨这些风险，导致难以比较攻击的实用性或防御的效用。本文提出了一个统一的分类体系，并对 LALM 越狱攻击与防御进行了受控实证评估。我们将先前工作划分为语义、声学、信号及嵌入层（embedding-layer）攻击；基于守卫（guard-based）、无需训练（training-free）及基于训练（training-based）的防御；以及跨模态（cross-modal）、音频原生（audio-native）和交互式基准。随后，我们在十个开源 LALMs 上评估了代表性攻击与防御，不仅测量攻击成功率，还测量良性拒绝率及延迟。结果表明，声学最佳 -of-N（Acoustic Best-of-N）揭示了最坏情况下的音频空间脆弱性，叙事框架（Narrative Framing）是一种有效的低延迟语义威胁，而当前防御方案则在鲁棒性与良性可用性之间进行权衡。这些发现支持将成本和效用感知评估作为仅基于成功率的 LALM 安全基准的必要补充。

Abstract

Large Audio Language Models (LALMs) expand jailbreak risks from token-level prompting to the full speech perception-to-reasoning pipeline, where unsafe behavior can be induced through semantics, acoustic style, signal artifacts, or internal representations. Existing work studies these risks under heterogeneous threat models and evaluation protocols, making it difficult to compare attack practicality or defense utility. This paper provides a unified taxonomy and a controlled empirical evaluation of LALM jailbreak attacks and defenses. We organize prior work into semantic, acoustic, signal, and embedding-layer attacks; guard-based, training-free, and training-based defenses; and cross-modal, audio-native, and interactive benchmarks. We then evaluate representative attacks and defenses across ten open-source LALMs, measuring not only attack success rate but also benign refusal and latency. Our results show that Acoustic Best-of-N reveals strong worst-case audio-space vulnerabilities, Narrative Framing is an effective low-latency semantic threat, and current defenses trade robustness against benign usability. These findings support cost- and utility-aware evaluation as a necessary complement to success-rate-only LALM safety benchmarks.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	5.0/10	7.5
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文聚焦音频语言模型安全评估（越狱攻击与防御），与架构统一、视觉编码器、世界模型及强化学习关联度低。虽涉及多模态及大模型范畴，但未深入探讨 tokenizer。加权总分 25.5，低于动态及格分 27.8。作者列表中未包含指定专家，无加分。

关键词

Audio Jailbreaks, Large Audio-Language Models, Taxonomy, Attack-Defense Analysis, Cost-Aware Evaluation, Safety, LALMs

85. Evaluation of Conversational Agents: Understanding Culture, Context and Environment in Emotion DetectionFAIL

Score: 25.5 / 27.8

Authors: Martha Teiko Teye, Yaw Marfo Missah, Emmanuel Ahene, Twum Frimpong, Auxane Boch

Published: 2026-05-28

TL;DR: 该论文针对黑非洲社会语境下的对话代理情感检测问题，提出了一种结合语音和图像数据并使用 CNN 与 AFME 算法的模型，实现了 85%-96% 的准确率。

摘要翻译

关键决策和高优先级分析如今依赖于面部生物特征识别、社交媒体照片标记以及人机交互等应用。然而，成功部署此类应用的能力取决于它们在测试用例上的效能，同时需考虑可能的边界情况。多年来，已实施了多种通用解决方案来模拟人类情感，包括讽刺。然而，地理位置或文化差异等因素尚未得到充分探索，鉴于其在解决伦理问题和改进对话式人工智能（AI）方面的重要性。本文旨在解决在非洲黑人社会中使用对话式人工智能所面临的潜在挑战。我们开发了一种情感预测模型，准确率介于 85% 至 96% 之间。该模型结合了语音和图像数据，旨在检测七种基本情绪，同时侧重于识别讽刺。它采用了三层卷积神经网络（CNN），外加一种新的音频帧均值表达（AFME）算法，并专注于模型的预处理和后处理阶段。最终，所提方案有助于维持对话式人工智能中情感识别系统的可信度。

Abstract

Valuable decisions and highly prioritized analysis now depend on applications such as facial biometrics, social media photo tagging, and human robots interactions. However, the ability to successfully deploy such applications is based on their efficiencies on tested use cases taking into consideration possible edge cases. Over the years, lots of generalized solutions have been implemented to mimic human emotions including sarcasm. However, factors such as geographical location or cultural difference have not been explored fully amidst its relevance in resolving ethical issues and improving conversational AI (Artificial Intelligence). In this paper, we seek to address the potential challenges in the usage of conversational AI within Black African society. We develop an emotion prediction model with accuracies ranging between 85% and 96%. Our model combines both speech and image data to detect the seven basic emotions with a focus on also identifying sarcasm. It uses 3-layers of the Convolutional Neural Network in addition to a new Audio-Frame Mean Expression (AFME) algorithm and focuses on model pre-processing and post-processing stages. In the end, our proposed solution contributes to maintaining the credibility of an emotion recognition system in conversational AIs.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	5.0/10	7.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文主要研究对话代理中的情感检测，结合语音和图像数据（MultiModal 相关性高），使用 CNN 处理图像（Visual Encoder 中度相关）。然而，论文未涉及统一模型架构（Unify Models）、分词器（Tokenizer）、世界模型（World Models）、大语言模型（MLLM）或基于模型的强化学习（model-based RL），因此这些关键词相关性极低。作者列表中不包含指定的专家。加权总分为 25.5，低于动态及格分 27.8。

关键词

Emotion Detection, Conversational AI, Multimodal, CNN, AFME Algorithm, Cultural Context, Speech and Image

86. GenEraser: Generalizable Video Object Removal via Balanced Text-Mask Guidance and Decoupled Locator-PreserverFAIL

Score: 25.5 / 27.8

Authors: Yuqing Chen, Lin Liu, Haisu Wu, Xiaopeng Zhang, Yaowei Wang, Yujiu Yang, Qi Tian

Published: 2026-05-28

TL;DR: GenEraser proposes a novel framework using balanced text-mask guidance and decoupled expert architecture to achieve generalized and high-fidelity video object and effect removal, outperforming state-of-the-art methods.

摘要翻译

视频物体移除在跨域场景中往往难以同时消除目标物体及其伴随的物理效应（如烟雾、反射、光照和涟漪），这源于复杂的时空模糊性。尽管现有方法主要依赖空间掩码，但它们往往无法捕捉弱相关效应，且显式文本引导的潜力尚未得到充分挖掘。此外，移除模型中还存在一个根本性的优化冲突，即在高层语义泛化与精确的像素级背景保留之间。为应对这些挑战，我们提出了一种名为 GenEraser 的新框架，用于实现泛化且高保真的视频物体及效应去除。首先，我们引入了多条件混合专家（MC-MoE）结合二分文本引导，以充分利用扩散变换器（Diffusion Transformers）的多模态先验，显著增强复杂效应的识别能力。其次，我们提出了一种可学习深度"CFG"融合机制（LD-CFG），以自适应地平衡不同场景下掩码条件与文本条件的相对主导地位。最后，我们提出了一种解耦专家架构，包含定位器（Locator）和保留器（Preserver），以缓解语义泛化与像素对齐之间的固有权衡。大量实验表明，GenEraser 超越了最新的最先进方法，实现了显著的定量改进（例如在 ROSE 基准和 VOR-Eval 上分别提升了 2.16 dB 和 1.44 dB），同时在开放世界场景中保持了极其鲁棒的泛化能力。https://cyqii.github.io/GenEraser.github.io/

Abstract

Video object removal frequently struggles to simultaneously eliminate target objects and their associated physical effects (e.g., smoke, reflections, light, and ripples) in out-of-domain scenarios due to complex spatiotemporal ambiguities. While existing methods primarily rely on spatial masks, they often fail to capture weakly correlated effects, and the potential of explicit textual guidance remains underexplored. Furthermore, a fundamental optimization conflict exists in removal models between high-level semantic generalization and precise pixel-level background preservation. To address these challenges, we propose GenEraser, a novel framework for generalized and high-fidelity video object and effect removal. First, we introduce a Multi-Conditional Mixture-of-Experts (MC-MoE) paired with Bipartite Text guidance to fully exploit the multimodal priors of Diffusion Transformers, significantly enhancing the identification of complex effects. Second, a Learnable Deep ``CFG'' Fusion mechanism (LD-CFG) is developed to adaptively balance the relative dominance of mask and textual conditions across diverse scenarios. Finally, we propose a Decoupled Expert Architecture, comprising a Locator and a Preserver, to mitigate the inherent trade-off between semantic generalization and pixel alignment. Extensive experiments demonstrate that our GenEraser surpasses recent state-of-the-art approaches, achieving significant quantitative improvements (e.g., $2.16$ dB and $1.44$ dB on the ROSE Benchmark and VOR-Eval, respectively) while maintaining exceptionally robust generalization in open-world scenarios. https://cyqii.github.io/GenEraser.github.io/

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on video object removal using Diffusion Transformers and text-mask guidance, showing high relevance to MultiModal due to text-video fusion. However, it does not address Unify Models, Tokenizer design, World Models, MLLM architectures, or Model-Based RL, resulting in low scores for those keywords. Visual Encoder is tangential as Diffusion Transformers process visuals without highlighting a specific encoder contribution. No expert authors from the specified list are present in the author list.

关键词

Video Object Removal, Text-Mask Guidance, Diffusion Transformers, Decoupled Expert Architecture, Generalizable Video Editing, MC-MoE, Locator-Preserver

87. Neural-Behavioral Representation of Natural Whole-body Movement in MonkeysFAIL

Score: 24.8 / 27.8

Authors: Jieshi He, Puzhe Li, Yanan Sui, Mu-ming Poo

Published: 2026-05-28

TL;DR: 该论文提出了一种结合大规模皮层信号与多视角动作捕捉的神经行为框架，利用自回归编码器 - 解码器模型成功解码了自由移动猴子的全身运动。

摘要翻译

理解皮层活动 (cortical activity) 如何表征灵长类 (primates) 的自然全身行为仍然具有挑战性。受限于运动多样性以及难以获取全身运动学 (whole-body kinematics) 的大规模神经表征，之前的运动解码 (motor decoding) 研究集中在受限任务和有限的肢体运动上。在此，我们提出了一种用于自由活动猴子的神经 - 行为记录与建模框架，通过定制的数据采集平台，结合来自分布的感觉和运动相关区域的大规模硬膜外皮层信号 (epidural cortical signals) 与同步多视角运动捕捉 (multi-view motion capture)。我们利用自回归编码器 - 解码器 (autoregressive encoder-decoder) 模型重构了猴子全身运动学，并学习了紧凑的行为先验。基于神经信号，该模型解码出准确且真实的全身运动，而无需显式物理约束。我们的结果提供了一种新颖的概念验证方法 (proof-of-concept approach)，用于利用大规模颅内神经活动 (intracranial neural activity) 解码灵长类动物的自然全身运动。

Abstract

Understanding how cortical activity represents natural whole-body behaviors in primates remains challenging. Limited by the diversity of movements and inaccessibility of large-scale neural representation of whole-body kinematics, previous motor decoding studies focused on constrained tasks and limited limb movements. Here, we present a neural-behavioral recording and modeling framework for freely moving monkeys, combining large-scale epidural cortical signals from distributed sensory- and motor-related areas with synchronized multi-view motion capture through a custom-made data collection platform. We reconstructed whole-body monkey kinematics and learned a compact behavior prior using an autoregressive encoder-decoder model. Conditioned on neural signals, the model decoded accurate and realistic whole-body movement without explicit physical constraints. Our results provide a novel proof-of-concept approach for decoding natural whole-body movements in primates using large-scale intracranial neural activity.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	3.5/10	5.2
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	6.0/10	9.0
model-based RL	1.5	2.0/10	3.0

评分理由: 该论文属于神经科学领域，核心在于神经信号与运动数据的融合解码，与 AI 大模型关键词存在领域错位。MultiModal 相关性最高（结合神经信号与动作捕捉多模态数据）；World Models 有一定概念关联（自回归模型学习行为动态先验）；Unify Models 中等（统一数据采集平台）；Tokenizer、MLLM 完全无关；Visual Encoder 和 model-based RL 仅有方法论层面的弱关联。加权总分 24.75，低于动态及格分 27.8。

关键词

Neural-behavioral representation, Whole-body movement decoding, Autoregressive encoder-decoder, Multi-view motion capture, Large-scale cortical signals, Behavior prior, Freely moving monkeys, Kinematics reconstruction

88. Learning Design Skills as Memory Policies for Agentic Photonic Inverse DesignFAIL

Score: 24.0 / 27.8

Authors: Shengchao Chen, Ting Shu, Sufen Ren

Published: 2026-05-28

TL;DR: 论文提出 SkillPCF 框架，通过物理引导的记忆技能库和强化学习技能选择，实现了光子晶体纤维逆向设计在模拟预算下的质量与效率权衡优化。

摘要翻译

光子晶体光纤（PCF）的逆向设计仍然具有挑战性，因为候选几何结构必须在昂贵的电磁仿真下满足耦合光学目标。现有方法改进了代理预测或一次性参数推荐，但它们无法在迭代试验中积累可复用的设计知识。我们将 PCF 逆向设计建模为记忆 - 策略学习问题，并提出 SkillPCF，这是一种闭环代理框架，结合了基于物理引导的记忆技能库、强化学习技能选择以及基于仿真的技能演化。此外，我们构建了一个真实世界数据集，包含 479 条专家交互轨迹（2507 个片段）和 553 个依赖记忆的评估查询，涵盖色散工程、损耗优化和多目标设计。在多个 LLM 骨干和经典基线上的实验表明，SkillPCF 在实用仿真预算下实现了更强的设计质量与效率权衡，证明了所提出的记忆 - 技能学习范式在物理感知的 PCF 逆向设计中的有效性。

Abstract

Photonic crystal fiber (PCF) inverse design remains challenging because candidate geometries must satisfy coupled optical targets under expensive electromagnetic simulation. Existing pipelines improve surrogate prediction or one-shot parameter recommendation, but they do not accumulate reusable design knowledge across iterative trials. We formulate PCF inverse design as a memory-policy learning problem and propose SkillPCF, a closed-loop agent framework that combines a physics-guided memory skill bank, reinforcement-learned skill selection, and simulator-grounded skill evolution. We further construct a real-world dataset with 479 expert interaction traces (2,507 spans) and 553 memory-dependent evaluation queries covering dispersion engineering, loss optimization, and multi-objective design. Experiments across multiple LLM backbones and classical baselines show that SkillPCF achieves stronger design-quality and efficiency trade-offs under practical simulation budgets, demonstrating the effectiveness of our proposed memory-skill learning paradigm for physics-aware PCF inverse design.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	4.0/10	6.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	8.0/10	12.0

评分理由: 论文聚焦光子晶体纤维逆向设计，采用记忆策略与强化学习框架（SkillPCF）。'model-based RL'评分最高（8.0），因论文明确使用模拟器进行技能演化及强化学习技能选择，符合基于模型的强化学习范式。'World Models'评分中等（4.0），因提出的‘记忆技能库’与‘闭环代理’在概念上与世界模型的表征与规划能力有相似之处，但未明确构建世界模型。'MLLM'评分较低（3.0），因实验使用了 LLM 骨干，但论文重点在于设计策略而非多模态大模型架构本身。'Unify Models'评分低（1.0），未涉及模型统一架构。'Tokenizer', 'Visual Encoder', 'MultiModal'评分为 0.0，因论文未涉及多模态编码、视觉编码器或分词器技术。加权总分为 24.0，低于动态及格分 27.8。作者列表中未包含指定的专家（Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang）。

关键词

Photonic crystal fiber, Inverse design, Memory policies, SkillPCF, Reinforcement learning, Simulator-grounded, Physics-guided

89. Unsupervised Semantic Segmentation Facilitates Model UnderstandingFAIL

Score: 24.0 / 27.8

Authors: Xiaoyan Yu, Lisa Mais, Jannik Franzen, Peter Hirsch, Nick Lechtenbörger, Andreas Mardt, Dagmar Kainmüller

Published: 2026-05-28

TL;DR: 本文提出一种基于无监督语义分割的可视化协议，用于直观区分自监督视觉变换器模型中的位置效应与局部性偏差。

摘要翻译

自监督学习（SSL）催生了多样化的视觉 Transformer（ViTs），其预训练表示支持广泛的下游任务。为更好地理解这些模型，一系列工作评估了其自注意力的运作机制以及其表示中捕获的信息类型，揭示了例如使用对比学习（CL）训练的模型与掩码图像建模（MIM）模型之间的显著差异。然而，这些模型理解的进展尚未完全渗透到更广泛的社区中，其中针对 CL 模型的见解有时会被泛化到 MIM 模型。为了使模型理解对广大受众来说简单直观，我们提出了一种简单且易于解释的可视化方案。我们的方案基于可视化无监督语义分割的结果，但我们的目标并非最大化分割性能。相反，它使我们能够传达在图像中一致出现的模型行为。在不同层和表示上对一系列多样化的 SSL 模型进行评估，我们获得了关于不同位置偏差和缩放行为的新的见解，包括 DINOv3-Large 模型标记 (tokens) 中出现的强边界伪影。这些见解补充并有助于传达一系列先前的发现。我们的方案进一步实现了位置效应与密切相关但不同的局部性偏差之间的清晰视觉区分，后者在文献中已被更广泛地研究。该方案在 GitHub 上公开可用，我们相信它将促进更广泛社区的进一步模型理解。

Abstract

Self-supervised learning (SSL) has produced a diverse landscape of vision transformers (ViTs) whose pretrained representations support a wide range of downstream tasks. Towards a better understanding of these models, a body of work has assessed the mechanics of their self-attention as well as the types of information captured across their representations, revealing, for example, stark differences between models trained with contrastive learning (CL) and masked image modeling (MIM). However, these advances in model understanding have not yet fully permeated the broader community, where insights specific to CL models are sometimes generalized to MIM models. To make model understanding straightforward and intuitive for a broad audience, we propose a simple and easily interpretable visualization protocol. Our protocol is based on visualizing unsupervised semantic segmentation results, yet our goal is not to maximize segmentation performance. Instead, it allows us to convey model behaviors that consistently emerge across images. Benchmarking a diverse set of SSL models across layers and representations, we obtain novel insights into distinct positional biases and scaling behaviors, including strong boundary artifacts in DINOv3-Large model tokens. These insights complement and help communicate a range of previous findings. Our protocol further enables a clear visual distinction between positional effects and the closely related but distinct locality bias, which has been studied much more extensively in the literature. The protocol is publicly available on GitHub and we believe it will catalyze further model understanding for a broad community.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	7.0/10	10.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	1.0/10	1.5

评分理由: 论文核心在于利用无监督语义分割可视化理解自监督视觉变换器（ViTs）的行为特性，如位置偏差与局部性偏差。虽然涉及视觉编码器（ViTs），但未涉及世界模型、多模态大模型、强化学习或 tokenizer 架构设计。'Unify Models' 仅体现在对比不同 SSL 模型的理解上，而非模型统一。因此除 Visual Encoder 外，其余关键词相关性极低，加权总分（24.0）低于动态及格分（27.8）。

关键词

Unsupervised Semantic Segmentation, Model Understanding, Self-Supervised Learning, Vision Transformers, Positional Bias, Locality Bias, Visualization Protocol

90. VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video DiffusionFAIL

Score: 22.5 / 27.8

Authors: Hidir Yesiltepe, Jiazhen Hu, Tuna Han Salih Meral, Adil Kaan Akan, Kaan Oktay, Hoda Eldardiry, Pinar Yanardag

Published: 2026-05-28

TL;DR: VideoMLA addresses the high memory and latency of KV caches in long-rollout video diffusion by introducing Multi-Head Latent Attention, achieving 92.7% memory reduction and improved throughput without sacrificing quality.

摘要翻译

长序列因果视频扩散已收敛于固定大小的滑动窗口 KV 缓存，近期的进展通过改变占据窗口的标记或其位置编码方式，在此布局内进行了创新。每头 KV 布局本身是流式内存和延迟的主要贡献者，但大多保持不变。本文首次研究了视频扩散中的多头潜在注意力（MLA）。VideoMLA 用共享的低秩内容潜在表示和解耦的 3D-RoPE 位置键替换每头的键和值，在每个缓存层将每标记 KV 内存减少了 92.7%。我们进一步探究了为什么 MLA 在视频扩散中能够成功，尽管在语言模型中通常用来解释它的谱假设并不成立：预训练视频注意力并非低秩，其 99% 能量有效秩远高于任何实际潜在维度。VideoMLA 在压缩比下保持质量，而直接谱近似在此处会预测较大的重构误差。我们表明，MLA 瓶颈而非预训练谱决定了有效秩：无论是谱初始化还是随机初始化，从初始化开始就占用了几乎完整的秩预算，训练过程保持此预算并在其中进行调整。在 VBench 评测基准上，VideoMLA 与短视界流式视频扩散基线持平，在长视界中取得了所评估方法中的最佳整体分数，并在单个 B200 上将吞吐量提高了 1.23 倍。

Abstract

Long-rollout causal video diffusion has converged on a fixed-size sliding-window KV cache, with recent progress innovating within this layout by changing which tokens occupy the window or how their positions are encoded. The per-head KV layout itself, a dominant contributor to streaming memory and latency, has been mostly left unchanged. In this paper, we present the first study of Multi-Head Latent Attention (MLA) in video diffusion. VideoMLA replaces per-head keys and values with a shared low-rank content latent and a shared decoupled 3D-RoPE positional key, reducing per-token KV memory by 92.7% at every cached layer. We further investigate why MLA succeeds in video diffusion even though the spectral assumption often used to motivate it in language models does not hold: pretrained video attention is not low-rank, with 99%-energy effective rank far above any practical latent dimension. VideoMLA retains quality at compression ratios where direct spectral approximation would predict large reconstruction error. We show that the MLA bottleneck, rather than the pretrained spectrum, determines the effective rank: both spectral and random initialization occupy nearly the full rank budget from initialization, and training preserves this budget while adapting within it. On VBench, VideoMLA matches short-horizon streaming video diffusion baselines, achieves the best overall score at long horizons among evaluated methods, and improves throughput by 1.23x on a single B200.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	5.0/10	7.5
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	1.0/10	1.5

评分理由: 论文核心在于视频扩散模型（Video Diffusion）的 KV 缓存优化，使用多头潜在注意力（MLA）降低内存。与'World Models'相关性最高（5.0），因视频扩散属于世界模型生成范式。'Unify Models'和'Tokenizer'有一定关联（2.0），因涉及注意力头统一和 token 布局管理，但非核心主题。'MLLM'、'model-based RL'及'Visual Encoder'相关性低（1.0-2.0），因论文未涉及语言模型、强化学习或编码器架构设计。加权总分为 22.5，低于动态及格分 27.8，表明论文主题与给定关键词集合匹配度中等偏低。

关键词

Video Diffusion, Multi-Head Latent Attention, KV Cache, Low-Rank Latent, Autoregressive Video, Memory Reduction, Throughput Improvement

91. Unifying Temporal and Structural Credit Assignment in LLM-Based Multi-Agent Prompt OptimizationFAIL

Score: 22.5 / 27.8

Authors: Wenwu Li, Yuran Song, Mingze Zhao, Bo Jin, Wenhao Li

Published: 2026-05-28

TL;DR: 本文提出了一种基于时间性和结构性信用分配的方法，用于优化大语言模型多智能体提示，通过迭代 refinement 降低了查询复杂度并提升了性能。

摘要翻译

尽管多智能体系统（MAS）通过协同交互赋能大语言模型（LLM）应对复杂的推理任务，但由于计算图的离散且不可微的性质以及全局监督信号的稀疏性，优化其交互动力学仍是一个严峻的挑战。现有的黑盒优化器难以将轨迹层面的失败归因于特定的局部组件，从而导致低效且高方差的探索。我们认为，可行的多智能体系统优化需要引入结构归纳偏置，以解耦误差信号。我们提出了一种时序与结构信用分配方法，该方法沿两个维度分解目标：（i）时序信用，利用状态空间瓶颈识别关键轮次；（ii）结构信用，利用静态角色策略隔离智能体的贡献。利用这些分解后的信号，我们提出了一种离散且基于言语的块坐标下降算法，用于迭代优化。该方法并非进行无差别的全局更新，而是在优化角色提示和聚合协议之间交替进行，利用 LLM 生成的“代理梯度”仅针对已识别的薄弱环节。在多样化的推理基准上，我们的方法在提升性能的同时显著降低了查询复杂度，为通往自我改进的多智能体系统（MAS）提供了一条严谨且可解释的路径。

Abstract

While Multi-Agent Systems (MAS) empower Large Language Models to tackle complex reasoning tasks through collaborative interaction, optimizing their dynamics remains a formidable challenge due to the discrete, non-differentiable nature of the computation graph and the sparsity of global supervisory signals. Existing black-box optimizers struggle to attribute trajectory-level failure to specific local components, resulting in inefficient, high-variance exploration. We argue that tractable MAS optimization needs structural inductive biases to disentangle error signals. We propose temporal and structural credit assignment, which decomposes the objective along two axes: (i) temporal credit, using state-space bottlenecks to identify critical rounds, and (ii) structural credit, using stationary role policies to isolate agent contributions. Leveraging these decomposed signals, we introduce a discrete, verbalized block coordinate descent algorithm for iterative refinement. Rather than indiscriminate global updates, it alternates between optimizing role prompts and aggregation protocols, using LLM-generated "proxy gradients" to target only the identified weak links. Across diverse reasoning benchmarks, our approach substantially reduces query complexity while improving performance, providing a principled and interpretable path toward self-improving MAS.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	5.0/10	7.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	5.0/10	7.5

评分理由: 论文标题包含 'Unifying'，故 'Unify Models' 得 5 分；涉及 LLM 多智能体优化，与 'model-based RL' 有关联（信用分配机制），得 5 分；'MLLM' 得 3 分因主要使用语言模型而非多模态；'World Models' 得 2 分因涉及状态空间分析；其余（Tokenizer, Visual Encoder, MultiModal）与文本多智能体主题无关得 0 分。未发现指定专家作者，无额外加分。加权总分 22.5 分，低于动态及格分 27.8 分，表明论文与给定关键词背景（多模态世界模型/统一模型）相关性中等偏低。

关键词

Multi-Agent Systems, Prompt Optimization, Credit Assignment, Large Language Models, Temporal and Structural, Block Coordinate Descent, Proxy Gradients

92. When Should Models Change Their Minds? Contextual Belief Management in Large Language ModelsFAIL

Score: 22.5 / 27.8

Authors: Haoming Xu, Weihong Xu, Zongrui Li, Mengru Wang, Yunzhi Yao, Chiyu Wu, Jin Shang, Yu Gong, Shumin Deng

Published: 2026-05-28

TL;DR: This paper investigates Contextual Belief Management in Large Language Models, introducing the BeliefTrack benchmark and demonstrating that reinforcement learning with belief-state rewards reduces failure rates by over 70%.

摘要翻译

长程交互要求语言模型管理累积的信息：何时更新其状态，何时保留其状态，以及忽略哪些内容。我们将此挑战定义为上下文信念管理 (CBM)：维持与形式化证据对齐的预测信念状态，同时隔离任务无关噪声。为了使 CBM 可衡量，我们引入了 BeliefTrack，这是一个涵盖规则发现 (Rule Discovery) 和电路诊断 (Circuit Diagnosis) 的封闭世界基准，其中有限信念空间和符号验证器使得精确的回合级评估成为可能。BeliefTrack 诊断出三种失败类型：失败停留 (Failed Stay)、失败更新 (Failed Update) 和失败隔离 (Failed Isolation)。在多个大型语言模型 (LLMs) 上，基线模型 (vanilla models) 表现出严重的 CBM 失败，而显式信念追踪提示仅带来有限的提升。相比之下，采用信念状态奖励的强化学习方法平均将失败率降低了 70.9%。进一步探测揭示了这些失败背后的潜在信念状态动力学，而表示层引导 (representation-level steering) 在两个任务上将失败率降低了 46.1%（代码即将在 https://github.com/zjunlp/CBM 发布）。

Abstract

Long-horizon interactions require language models to manage accumulating information: when to update their state, when to preserve their state, and what to ignore. We study this challenge as \textbf{Contextual Belief Management (CBM)}: maintaining a predicted belief state aligned with formal evidence while isolating task-irrelevant noise. To make CBM measurable, we introduce BeliefTrack, a closed-world benchmark spanning Rule Discovery and Circuit Diagnosis, where a finite belief space and symbolic verifiers enable exact turn-level evaluation. BeliefTrack diagnoses three failures: Failed Stay, Failed Update, and Failed Isolation. Across multiple LLMs, vanilla models exhibit severe CBM failures, while explicit belief-tracking prompts provide limited gains. In contrast, reinforcement learning with belief-state rewards reduces failure rates by 70.9\% on average. Further probing reveals latent belief-state dynamics behind these failures, and representation-level steering reduces failure rates by 46.1\% across two tasks\footnote{Code is coming soon at https://github.com/zjunlp/CBM.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	5.0/10	7.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	4.0/10	6.0

评分理由: 论文主要研究大语言模型中的信念管理（Belief Management）及强化学习应用，与 World Models（涉及信念状态建模）和 model-based RL（基于信念状态的奖励机制）有一定关联，得分中等；但与 Tokenizer、Visual Encoder、MultiModal 及 Unify Models（架构统一）无直接内容关联，得分较低。

关键词

Contextual Belief Management, Large Language Models, Reinforcement Learning, Belief State, BeliefTrack, Representation Steering, Long-horizon Interactions

93. PokerSkill: LLMs Can Play Expert-Level Poker without Training or SolversFAIL

Score: 22.5 / 27.8

Authors: Boning Li, Baoxiang Wang, Longbo Huang

Published: 2026-05-28

TL;DR: PokerSkill 通过引入人类设计的规则技能库作为结构化接口，使 LLM 无需训练或求解器即可达到专家级扑克水平，显著降低了博弈损失。

摘要翻译

扑克是人工智能领域的一个标志性挑战。主流方法依赖于基于反事实遗憾最小化（Counterfactual Regret Minimization, CFR）构建的均衡求解器，需要数百万核心小时的训练。大语言模型（Large Language Models, LLMs）拥有丰富的扑克知识，但在被要求直接参与对局时，其表现远低于基于求解器的智能体。传统的基于规则的扑克智能体具有可解释性且无需训练，但其策略上限仍远低于均衡下法。我们提出 PokerSkill，这是一个无需训练且无需求解器的框架，它通过使用详细的基于规则的扑克技能作为大语言模型的结构化动作接地接口，从而弥合了这一差距。一个确定性上下文引擎分析当前状态，并从完全由人类扑克专家设计的分层技能库中仅检索相关片段，从而将大语言模型的选择限制在合理动作范围内。在与最先进的博弈论最优（GTO）基准 GTOWizard 的对抗中，结合 PokerSkill 的 GPT-5.5 XHigh 取得了 -57 ± 21 mbb/hand 的成绩，Claude Opus 4.6 达到 -80 ± 29 mbb/hand，Claude Opus 4.7 达到 -87 ± 64 mbb/hand。相比默认提示基线，损失减少了 49%–61%，且优于强大的机器人 Slumbot。我们的关键发现是，仅靠基于规则的技能无法构成强策略，仅靠大语言模型也无法下出好棋，但两者的结合产生了一个智能体，它既不需要训练也不需要求解器访问，却能竞争于基于数百万核心小时计算构建的系统。据我们所知，这是首次展示大语言模型在不进行游戏特定训练或不进行求解器查询的情况下，在复杂的非完美信息游戏中实现竞争性表现。代码可在 https://github.com/lbn187/PokerSkill 获取。

Abstract

Poker is a landmark challenge for artificial intelligence. The dominant approach relies on equilibrium solvers built on counterfactual regret minimization, requiring millions of core-hours of training. Large Language Models (LLMs) possess extensive poker knowledge but perform far below solver-based agents when asked to play directly. Traditional rule-based poker agents are interpretable and training-free, but their strategic ceiling remains far below equilibrium play. We introduce \textbf{PokerSkill}, a training-free and solver-free framework that bridges this gap by using detailed rule-based poker skills as a structured action-grounding interface for LLMs. A deterministic context engine analyzes the current state and retrieves only the relevant fragments from a layered skill library, which is entirely designed by human poker experts, constraining the LLM's choice to reasonable actions. Against GTOWizard, a state-of-the-art GTO benchmark, GPT-5.5 XHigh with PokerSkill achieves $-57 \pm 21$ mbb/hand, Claude Opus 4.6 achieves $-80 \pm 29$ mbb/hand and Claude Opus 4.7 achieves $-87\pm 64$ mbb/hand, reducing losses by 49--61\% compared to default-prompt baselines and outperforming the strong bot Slumbot. Our key finding is that rule-based skills alone do not constitute a strong strategy, and LLMs alone cannot play well, but their combination yields an agent that requires neither training nor solver access yet competes with systems built on millions of core-hours of computation. To our knowledge, this is the first demonstration of an LLM achieving competitive performance in a complex imperfect-information game without game-specific training or solver queries. Code is available at https://github.com/lbn187/PokerSkill.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	4.0/10	6.0

评分理由: 论文主要探讨 LLM 在扑克游戏中的技能引导应用，未涉及多模态架构（Visual Encoder, MultiModal, MLLM）或 Tokenizer 设计。虽然结合了 LLM 与规则技能（Unify Models），但这属于应用层整合而非模型架构统一。虽属 RL 领域（model-based RL），但方法基于技能库而非环境模型学习（World Models）。作者列表中未包含指定的 Yang Shi 等专家，故无额外加分。加权总分为 22.5，低于动态及格分 27.8。

关键词

LLMs, Poker, Rule-based skills, Action-grounding, Training-free, Solver-free, GTO benchmark

94. PEARL: Training Socratic Tutors with Pedagogically Aligned Reinforcement LearningFAIL

Score: 22.5 / 27.8

Authors: Qikai Chang, Zhenrong Zhang, Linbo Chen, Pengfei Hu, Jianshu Zhang, Youhui Guo, Jun Du

Published: 2026-05-28

TL;DR: PEARL trains Socratic tutoring agents using pedagogically aligned reinforcement learning with a controllable student simulator and multi-objective optimization, achieving superior performance among open-source models.

摘要翻译

大型语言模型（LLMs）作为教育辅导者已展现出潜力，但有效的辅导不仅仅是解决问题：它必须在多轮交互中提供渐进式的苏格拉底式引导，并平衡多个教学目标。然而，由于保真度有限且可控性较弱的学生模拟、教学奖励建模规范不足以及不稳定的多目标优化，训练这样的辅导者仍然具有挑战性。为了克服这些限制，我们提出了 PEARL，一个教学对齐的强化学习框架，用于训练苏格拉底式辅导智能体，包含三个关键组件。首先，我们引入一个可控学生模拟器，该模拟器将潜在认知状态与响应生成解耦，以模拟多样化的能力和误解。其次，我们开发了一个生成式奖励模型，该模型联合评估教学质量和目标正确性，以用于策略优化。最后，我们提出了一种稳定的多目标强化学习 (RL) 方案，该方案在每个维度内离散化奖励，并在维度间聚合归一化优势，从而防止高方差目标主导更新。在多个基准测试上的实验结果表明，PEARL 在开源模型中取得了最佳性能，且与领先的专有大型语言模型 (LLMs) 保持竞争力，尽管仅使用了 30B 参数策略模型。

Abstract

Large Language Models (LLMs) have shown promise as educational tutors, yet effective tutoring requires more than solving problems: it must provide progressive Socratic guidance and balance multiple pedagogical objectives across multi-turn interactions. However, training such tutors remains challenging due to limited-fidelity and weakly controllable student simulation, under-specified pedagogical reward modeling, and unstable multi-objective optimization. To overcome these limitations, we propose PEARL, a pedagogically aligned reinforcement learning framework for training Socratic tutoring agents, consisting of three key components. First, we introduce a controllable student simulator that decouples latent cognitive states from response generation to model diverse abilities and misconceptions. Second, we develop a generative reward model that jointly evaluates pedagogical quality and objective correctness for policy optimization. Finally, we propose a stable multi-objective RL scheme that discretizes rewards within each dimension and aggregates normalized advantages across dimensions, preventing high-variance objectives from dominating updates. Experiments on multiple benchmarks show that PEARL achieves the best performance among open-source models and remains competitive with leading proprietary LLMs, despite using only a 30B policy model.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	3.0/10	4.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	4.0/10	6.0

评分理由: The paper focuses on pedagogically aligned RL for Socratic tutoring using LLMs, showing low relevance to multimodal architecture keywords (Visual Encoder, Tokenizer, MLLM, MultiModal) as it appears text-centric. 'World Models' and 'model-based RL' have moderate relevance due to the student simulator used within the RL loop. 'Unify Models' is low as it unifies pedagogical objectives rather than model architectures. No listed expert authors (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are found in the author list. Weighted total score is 22.5, below the dynamic passing score of 27.8.

关键词

Reinforcement Learning, Socratic Tutoring, Pedagogical Alignment, Student Simulator, Multi-objective Optimization, Large Language Models, Reward Modeling, Educational AI

95. DirectorBench: Diagnosing Long-Form Video Generation with Personalized Multi-Agent EvaluationFAIL

Score: 22.5 / 27.8

Authors: Jiamin Chen, Qianben Chen, Jiawen Zhang, Yidi Wu, Yuchen Li, Xiaokun Zhang, Wangchunshu Zhou, Chen Ma

Published: 2026-05-28

TL;DR: DirectorBench 提出了一种个性化的多智能体诊断基准，用于长视频生成，揭示了工作流瓶颈和用户配置文件依赖的失败模式，超越了聚合评分。

摘要翻译

长视频生成正迅速从短小、单场景的合成转向分钟级、多镜头的创作，具备叙事结构、电影级控制、音频以及跨模态（Cross-modal）同步能力。然而，评估此类视频仍然具有挑战性，因为现有基准主要关注局部视觉质量、短视域时间一致性或通用提示对齐，且对工作流故障和用户偏好依赖性的诊断能力有限。我们引入 DirectorBench，这是一个面向长视频生成的个性化多代理诊断基准。DirectorBench 基于 80 个结构化元数据条目、7 个用户画像（User profiles）以及 40 个检查点（Checkpoint）标准，从脚本、视觉、音频、跨模态和稳定性这 5 个维度对生成视频进行评估。与将质量简化为单一聚合分数不同，DirectorBench 能够定位检查点级的瓶颈，并支持基于用户画像的评估。我们评估了 4 种长视频生成工作流、6 个基础大语言模型（LLM）以及 7 个用户画像。在不同工作流中，DirectorBench 揭示了一个单元间瓶颈：转场质量平均仅为 0.256，最佳工作流达到 0.356，而提示级用户需求满足的平均值为 0.71。我们进一步开展了包含 14 名标注者的人工评估，以验证 DirectorBench 与人类判断之间的一致性。结果表明，DirectorBench 能够捕捉人类可感知的质量差异，并揭示出被聚合评分所掩盖的、与工作流和用户画像相关的故障模式。这些发现凸显了针对长视频生成进行诊断性且基于用户画像的基准测试的重要性。

Abstract

Long-form video generation is rapidly moving from short, single-scene synthesis toward minute-long, multi-shot creation with narrative structure, cinematic control, audio, and cross-modal synchronization. However, evaluating such videos remains challenging, since existing benchmarks largely focus on local visual quality, short-horizon temporal consistency, or generic prompt alignment, and provide limited diagnosis of workflow failures and user-dependent preferences. We introduce DirectorBench, a personalized multi-agent diagnostic benchmark for long-form video generation. DirectorBench evaluates generated videos with respect to 80 structured metadata entries, 7 user profiles, and 40 checkpoint criteria across 5 dimensions: script, visual, audio, cross-modal, and stability. Instead of reducing quality to a single aggregate score, DirectorBench localizes checkpoint-level bottlenecks and supports profile-aware evaluation. We evaluate 4 long-form video generation workflows, 6 base LLMs, and 7 user profiles. Across workflows, DirectorBench reveals a between-unit bottleneck: transition quality averages only 0.256 and reaches 0.356 for the best workflow, while prompt-level user demand fulfillment averages 0.71. We further conduct human evaluation with 14 annotators to validate the alignment between DirectorBench and human judgment. The results show that DirectorBench captures human-perceptible quality differences and reveals workflow- and profile-dependent failure modes that are hidden by aggregate scoring. These findings highlight the importance of diagnostic and profile-aware benchmarking for long-form video generation.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	6.0/10	9.0
model-based RL	1.5	1.0/10	1.5

评分理由: 该论文专注于长视频生成的评估基准（DirectorBench），属于评测方法而非模型架构或强化学习。因此与 Tokenizer、Visual Encoder、World Models、model-based RL 等关键词相关性极低（1 分）。MultiModal 相关性较高（6 分），因视频生成涉及视觉与音频的多模态同步。MLLM 相关性中等（3 分），因评估代理可能基于大语言模型，但非核心贡献。Unify Models 相关性较低（2 分），因论文未涉及模型统一架构。作者列表中未包含指定专家。加权总分为 22.5 分，低于动态及格分 27.8 分。

关键词

Long-form video generation, Multi-agent evaluation, Personalized benchmark, Cross-modal synchronization, Workflow diagnosis, User profiles, Visual quality

96. Scaling Laws for Agent Harnesses via Effective Feedback ComputeFAIL

Score: 22.5 / 27.8

Authors: Xuanliang Zhang, Dingzirui Wang, Keyan Xu, Qingfu Zhu, Wanxiang Che

Published: 2026-05-28

TL;DR: This paper proposes Effective Feedback Compute (EFC) as a superior scaling metric for agent harnesses, demonstrating that feedback efficiency predicts success rates significantly better than raw token or tool call counts.

摘要翻译

Agent harnesses (智能体框架) 越来越多地通过决定模型如何调用工具、接收反馈、验证中间状态、存储记忆以及修正解决方案，来确定语言模型系统的性能。然而，当前的推理时扩展分析通常通过原始开销（如 token、工具调用、操作、墙钟时间或成本）来参数化这一过程，这无法区分有用反馈与冗余或不稳定的交互。我们引入“有效反馈计算量”（EFC），这是一种轨迹级扩展坐标，仅在反馈信息丰富、有效、非冗余且保留用于后续决策时才计入，并在比较具有不同反馈需求的任务时，将其按任务需求进行归一化。在合成可控任务、可执行代码任务、真实基准轨迹、留出集和前瞻性验证批次上，基于 EFC 的坐标始终比原始计算量基线和强多变量 SAS 基线更好地预测失败率。在控制扩展中，原始 token 和工具调用解释的变异有限（$R^2=0.33$ 和 $0.42$），SAS 达到 $0.88$，而 Oracle-EFC 和 Estimated-EFC 达到 $0.94$，Oracle-EFC/$D_{\mathrm{task}}$ 达到 $0.99$。预算匹配干预表明，在原始成本和工具调用固定的情况下，提高反馈质量使成功率从 $0.27$ 提高到 $0.90$。在混合真实轨迹上，NRS-EFC/$D_{\mathrm{task}}$ 达到 $R^2=0.92$，而原始计算量的拟合度接近零或为负，且在前瞻性保留集上它仍然是最佳预测器（$R^2=0.85$）。这些结果表明，Agent harness 扩展更多地取决于原始预算转换为持久、任务充分反馈的效率，而非所花费的计算量。

Abstract

Agent harnesses increasingly determine the performance of language-model systems by deciding how models call tools, receive feedback, verify intermediate states, store memory, and revise solutions. Yet current test-time scaling analyses often parameterize this process by raw expenditure -- tokens, tool calls, operations, wall time, or cost -- which does not distinguish useful feedback from redundant or unstable interaction. We introduce \emph{Effective Feedback Compute} (EFC), a trace-level scaling coordinate that credits feedback only when it is informative, valid, non-redundant, and retained for subsequent decisions, and we normalize it by task demand when comparing tasks with different feedback requirements. Across synthetic controllable tasks, executable code tasks, real benchmark traces, held-out splits, and a prospective validation batch, EFC-based coordinates consistently predict failure rates better than raw-compute baselines and a strong multivariate SAS baseline. In controlled scaling, raw tokens and tool calls explain limited variation ($R^2=0.33$ and $0.42$), SAS reaches $0.88$, while Oracle-EFC and Estimated-EFC reach $0.94$ and Oracle-EFC/$D_{\mathrm{task}}$ reaches $0.99$. Matched-budget interventions show that improving feedback quality raises success from $0.27$ to $0.90$ while raw cost and tool calls are fixed. On mixed real traces, NRS-EFC/$D_{\mathrm{task}}$ reaches $R^2=0.92$ while raw compute has near-zero or negative fit, and it remains the best predictor in a prospective holdout ($R^2=0.85$). These results suggest that harness scaling is governed less by how much computation is spent than by how efficiently raw budget is converted into durable, task-sufficient feedback.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	3.0/10	4.5
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	3.0/10	4.5

评分理由: The paper focuses on Scaling Laws for Agent Harnesses and Effective Feedback Compute (EFC). It relates moderately to MLLM (uses LLMs), World Models (agent-environment interaction), and model-based RL (feedback loops), but does not address Tokenizers, Visual Encoders, or Multimodal architectures directly, resulting in lower scores for those keywords.

关键词

Scaling Laws, Agent Harnesses, Effective Feedback Compute, Feedback Quality, Tool Calls, Language Models, Task Demand

97. Towards Consistent Video Geometry EstimationFAIL

Score: 22.5 / 27.8

Authors: Zhu Yu, Jingnan Gao, Runmin Zhang, Lingteng Qiu, Zhengyi Zhao, Rui Peng, Yichao Yan, Kejie Qiu, Siyu Zhu, Si-Yuan Cao, Hui-Liang Shen

Published: 2026-05-28

TL;DR: 本文提出 ViGeo，一种基于 Transformer 的统一基础模型，能够在无需任务特定架构修改的情况下，从视频序列中恢复空间密集且时间一致的几何信息（深度、法线、点云），并在多个基准上达到 state-of-the-art 性能。

摘要翻译

本文提出了 ViGeo，一种用于从视频序列中恢复空间密集且时间一致几何的前馈基础模型 (feed-forward foundation model)。该模型基于普通 Transformer 架构，且无需针对特定任务的架构修改，ViGeo 可在统一模型中支持流式、全序列及长视频推理。关键设计是动态分块注意力 (dynamic chunking attention)，该设计在训练期间使模型同时暴露于双向和因果时间上下文中，并允许其在测试时调整注意力模式而无需重新训练。为提升监督质量，我们进一步引入了一种基于补全的数据精炼框架 (completion-based data refinement framework)。该框架训练了一个视频深度补全教师模型 (video depth completion teacher)，该模型基于稀疏且嘈杂的标注，并利用视频/多视角上下文来生成密集、时间连贯且几何可靠的训练目标。除了深度图和点图 (point maps) 之外，ViGeo 还在同一框架内预测表面法线 (surface normals)。仅在公共数据集上训练，ViGeo 在在线、离线及长视频深度估计、表面法线估计和视频点图估计方面均达到了最先进的性能 (state-of-the-art performance)。

Abstract

This work presents ViGeo, a feed-forward foundation model for recovering spatially dense and temporally consistent geometry from video sequences. Built upon a plain transformer architecture without task-specific architectural modifications, ViGeo supports streaming, full-sequence, and long-video inference within a unified model. The key design is dynamic chunking attention, which exposes the model to both bidirectional and causal temporal contexts during training and allows it to adapt its attention pattern at test time without retraining. To improve supervision quality, we further introduce a completion-based data refinement framework. This framework trains a video depth completion teacher that conditions on sparse and noisy annotations and exploits video/multi-view context to produce dense, temporally coherent, and geometrically reliable training targets. Beyond depth and point maps, ViGeo also predicts surface normals within the same framework. Trained solely on public datasets, ViGeo achieves state-of-the-art performance across online, offline, and long-video depth estimation, surface normal estimation, and video point map estimation.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	6.0/10	9.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	1.0/10	1.5

评分理由: 论文核心为视频几何估计（深度、法线等），基于 Transformer 基础模型。仅'Unify Models'与摘要中'unified model'有文本关联，其余关键词如 MLLM、World Models、model-based RL 涉及语言生成、世界建模及强化学习，与论文视觉几何任务无直接关联。Tokenizer 未提及。未发现指定专家作者。加权总分 22.5，低于动态及格分 27.8。

关键词

Video Geometry Estimation, Foundation Model, Transformer Architecture, Temporal Consistency, Dynamic Chunking Attention, Surface Normal Estimation, Depth Estimation, Stream Inference

98. FRUC: Feedforward Dynamic Scene Reconstruction from Uncalibrated Collaborative Driving ViewsFAIL

Score: 22.5 / 27.8

Authors: Yihang Tao, Yu Guo, Zhengru Fang, Haonan An, Yuguang Fang

Published: 2026-05-28

TL;DR: FRUC proposes a feed-forward 3D Gaussian splatting framework for dynamic scene reconstruction from uncalibrated collaborative driving views, achieving state-of-the-art rendering quality and efficiency without precise calibration.

摘要翻译

我们提出了 FRUC，一种用于从未校准的协同驾驶视图进行动态场景重建的前馈式 3D 高斯泼溅框架。现有的多智能体重建框架通常受限于严格的前提条件，要求精确的空间校准和缓慢的场景级优化。在本文中，我们通过将分布式多车辆网络概念化为时空非结构化的 ego-centric（以自我为中心）多相机系统来重新思考这项任务，其核心挑战在于通过协作增强 ego-centric 遮挡几何，同时不损害自我准确观测到的可见几何，并保持重建效率。为了实现高效重建，FRUC 基于 visual grounded 的几何 Transformer 骨干构建，能够从灵活数量的多车辆视图实现一次性、无校准的推理。为了在未校准的跨代理错位下实现非破坏性的几何补充，FRUC 首先引入了一个 ego-centric 因果遮挡场，该场通过建模代理间的时空相关性，显式地将遮挡演化推导为潜在先验。在这些遮挡先验的指导下，它进一步将跨代理整合表述为一种通过零初始化注入的确定性残差去噪过程，将具有挑战性的跨代理融合转化为有界残差学习，以实现鲁棒的协作盲点完成。通过在现实世界的 V2XReal 和 UrbanIng-V2X 数据集上的广泛评估，FRUC 被证明是动态协同驾驶环境场景重建的新最先进方法，在渲染质量和效率方面显著优于现有方法。

Abstract

We present FRUC, a feed-forward 3D Gaussian splatting framework for dynamic scene reconstruction from uncalibrated collaborative driving views. Existing multi-agent reconstruction frameworks are often hindered by rigid prerequisites, demanding precise spatial calibration and slow per-scene optimization. In this paper, we rethink this task by conceptualizing a distributed multi-vehicle network as a spatio-temporally unstructured ego-centric multi-camera system, where the core challenge lies in enhancing ego-centric occluded geometry through collaboration without degrading the ego's accurately observed visible geometry, while preserving reconstruction efficiency. For efficient reconstruction, FRUC is built upon a visual grounded geometric Transformer backbone to enable one-shot, calibration-free inference from a flexible number of multi-vehicle views. To achieve non-destructive geometric supplementation under uncalibrated cross-agent misalignment, FRUC first introduces an ego-centric causal occlusion field that explicitly derives occlusion evolution as latent priors by modeling agent-wise spatio-temporal correlations. Guided by these occlusion priors, it further formulates cross-agent integration as a deterministic residual denoising process via zero-initialized injection, turning challenging cross-agent fusion into bounded residual learning for robust collaborative blind-spot completion. Through extensive evaluations on the real-world V2XReal and UrbanIng-V2X datasets, FRUC is shown to be a new state-of-the-art for the scene reconstruction of dynamic collaborative driving environments, significantly outperforming existing methods in both rendering quality and efficiency.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	7.0/10	10.5
World Models	1.5	2.0/10	3.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	4.0/10	6.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文核心在于多车协同的动态场景 3D 重建（FRUC），使用了视觉 Transformer 骨干网络（对应 Visual Encoder）和多视图输入（对应 MultiModal）。然而，论文未涉及 Tokenizer、MLLM、模型强化学习（model-based RL），且其 'Unify' 指视图统一而非统一模型架构，'World Models' 指几何重建而非生成式世界模型，因此与给定关键词集的整体相关性较低。作者列表中未包含指定的 Yang Shi 等专家。

关键词

Feedforward, Dynamic Scene Reconstruction, Uncalibrated, Collaborative Driving, 3D Gaussian Splatting, Visual Transformer, Cross-agent Integration

99. EVL-ECG: Efficient ECG Interpretation With Multi-Aspect Heterogeneous Knowledge DistillationFAIL

Score: 22.5 / 27.8

Authors: Dang Hong Nguyen, Nhi Ngoc-Yen Nguyen, Huy-Hieu Pham

Published: 2026-05-28

TL;DR: EVL-ECG proposes a heterogeneous knowledge distillation framework to efficiently deploy a 2B-parameter ECG foundation model on edge devices, achieving improved diagnostic accuracy.

摘要翻译

高保真 ECG (心电图) 解读日益依赖于大规模基础模型，然而它们在临床边缘场景中的部署仍受限于极高的计算需求。尽管知识蒸馏 (KD) 是一种有前景的解决方案，但在跨异构架构转移知识时，传统方法无法捕捉 ECG 信号的复杂时空依赖。本文提出 EVL-ECG，这是一个专门针对跨架构心脏诊断逻辑蒸馏而设计的框架。EVL-ECG 引入了三项 ECG 感知的创新：（1）多头交叉注意力对齐 (Multi-Head Cross-Attention Alignment)，旨在调和架构差异以保持精细的形态学特征；（2）基于最优传输 (Optimal Transport) 的视觉特征匹配，利用最优传输在词元表示不匹配的情况下，维持 ECG 导联之间的全局结构关系；（3）几何架构内关系匹配，用于蒸馏教师模型的潜在诊断推理。在 ECG 基准数据集上的评估表明，EVL-ECG 相比现有基线最高可提升 2.4% 的 AUC 和 1.1% 的临床准确率。值得注意的是，EVL-ECG 构建了一个高效的 20 亿参数 (2B-parameter) ECG 基础模型，适用于资源受限的临床环境。

Abstract

High-fidelity ECG interpretation is increasingly reliant on massive foundation models, yet their deployment in clinical edge-care remains hindered by extreme computational demands. While knowledge distillation (KD) is a promising solution, traditional methods fail to capture the complex spatio-temporal dependencies of ECG signals when transferring knowledge across heterogeneous architectures. In this paper, we propose EVL-ECG, a framework specifically designed for cross-architecture distillation of cardiac diagnostic logic. EVL-ECG introduces three ECG-aware innovations: (1) Multi-Head Cross-Attention Alignment, which harmonizes architectural discrepancies to preserve fine-grained morphological features; (2) Optimal Transport-based Visual Feature Matching, utilizing optimal transport to maintain global structural relationships across ECG leads despite mismatched token representations; and (3) Geometric Intra-Architecture Relation Matching, which distills the latent diagnostic reasoning of the teacher model. Evaluations across ECG benchmarks demonstrate that EVL-ECG yields improvements of up to 2.4% AUC and 1.1% clinical accuracy over existing baselines. Notably, EVL-ECG establishes an efficient 2B-parameter ECG foundation model, suitable for resource-constrained clinical environments.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	4.0/10	6.0
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	3.0/10	4.5
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on ECG interpretation via heterogeneous knowledge distillation, showing low alignment with World Models and Model-Based RL (0 score). Tokenizer and Visual Encoder are tangentially relevant regarding token representations and feature matching (3-4 score). Unify Models and MultiModal have slight relevance regarding heterogeneous architectures and multi-lead signals (3 score). MLLM is weak as it is a signal foundation model rather than a language model (2 score). Total weighted score is 22.5, below the dynamic passing threshold of 27.8. No expert authors from the specified list were found.

关键词

ECG Interpretation, Knowledge Distillation, Cross-architecture Distillation, Foundation Model, Optimal Transport, Multi-Head Attention, Efficient Deployment, Cardiac Diagnostic Logic

100. Geometry-Guided Modeling of Foundation Features Enables Generalizable Object Shape Deformation LearningFAIL

Score: 22.5 / 27.8

Authors: Yiyao Ma, Kai Chen, Zhongxiang Zhou, Zhuheng Song, Dongsheng Xie, Zelong Tan, Rong Xiong, Qi Dou

Published: 2026-05-28

TL;DR: 本文提出了一种基于几何引导的基础特征建模的可泛化物体形变学习框架，实现了单目 3D 形状恢复在任意视角和未见物体类别上的鲁棒性，并支持下游灵巧机器人操作任务。

摘要翻译

单目 3D 形状恢复（Monocular 3D Shape Recovery）是几何理解的基础，然而在任意视角和未见物体类别上实现鲁棒泛化仍然是一个重大挑战。本文提出了一种可泛化的形变学习框架（Generalizable Deformation Learning Framework），通过显式变形类别级形状模板（Category-level Shape Template）来重建 3D 物体，以匹配目标观测。为了解决模板与目标之间复杂的形状变化，我们引入了一种几何引导特征建模机制（Geometry-guided Feature Modeling Mechanism）。该过程首先利用模板拓扑丰富基础特征，以生成几何感知表示（Geometry-aware Representation），随后将其与目标观测显式关联，从而引导精确形变。此外，为了弥合固定模板与任意目标视图之间的差异，我们提出了一种视图自适应特征聚合模块（View-adaptive Feature Aggregation Module）。该模块利用多视图模板特征及其对应的相机姿态来丰富规范模板表示（Canonical Template Representation），确保无论目标视角如何，都能实现鲁棒特征对齐。广泛实验表明，我们的方法在处理大形状变化和多样视角方面显著优于最先进方法（State-of-the-art Methods），对新类别展现出强大的泛化能力，并有效支持下游真实世界灵巧机器人操作任务。项目主页：https://GODeform.github.io

Abstract

Monocular 3D shape recovery is fundamental to geometric understanding, yet achieving robust generalization across arbitrary viewpoints and unseen object categories remains a significant challenge. In this paper, we present a generalizable deformation learning framework that reconstructs 3D objects by explicitly deforming a category-level shape template to match the target observation. To address complex shape variations between the template and the target, we introduce a geometry-guided feature modeling mechanism. This process first enriches foundation features with template topology to yield a geometry-aware representation, which is then explicitly correlated with the target observation to guide precise deformation. Furthermore, to bridge the disparity between the fixed template and arbitrary target views, we propose a view-adaptive feature aggregation module. This module leverages multi-view template features and their corresponding camera poses to enrich the canonical template representation, ensuring robust feature alignment regardless of the target's perspective. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods in handling large shape variations and diverse viewpoints, exhibiting strong generalization to novel categories and effectively supporting downstream real-world dexterous robotic manipulation tasks. Project homepage: https://GODeform.github.io/

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	6.0/10	9.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	4.0/10	6.0
model-based RL	1.5	2.0/10	3.0

评分理由: 论文核心为单目 3D 形状恢复与形变学习，利用基础特征隐含视觉编码器（Visual Encoder），视觉与几何结合具有一定多模态（MultiModal）属性；但未涉及语言模型、分词器、强化学习或世界模型，统一模型亦非重点，故多数关键词相关性低。作者列表不含指定专家，无加分。

关键词

Monocular 3D shape recovery, Geometry-guided feature modeling, Object shape deformation learning, Foundation features, View-adaptive feature aggregation, Generalizable object shape, Robotic manipulation tasks

101. BrahmicTokenizer-131K: An Indic-Capable Drop-In Replacement for o200k_baseFAIL

Score: 21.0 / 27.8

Authors: Rohan Shravan

Published: 2026-05-28

TL;DR: 本文提出 BrahmicTokenizer-131K，通过针对性优化显著提升了印度文字在 131K 词汇量下的压缩率，同时保持了英语、代码和数学任务上的竞争力。

摘要翻译

我们提出了 BrahmicTokenizer-131K，这是一种拥有 131,072 词汇量的字节级 BPE 分词器，它在 131K 词汇量类别中消除了婆罗米系 (Brahmic) 压缩差距，同时保留了 OpenAI 的 o200k_base 在英语、欧洲语言 (EU-language) 及代码方面的压缩性能。我们通过两阶段改造构建它：(1) 一种脚本修剪裁剪，通过移除九个超范围书写系统，将 200,019 个 token 缩减至 131,072；(2) 一种外科手术式改造，针对 2,372 个语料库死词槽，这些词槽由跨九个婆罗米系 Unicode 区块的线性规划分配确定。预分词器、解码器及继承的合并规则均与 o200k_base 保持一致，这使得 BrahmicTokenizer-131K 成为分词器接口上的即插即用替换。在 2700 万份公共印度语系预训练文本（28.4 亿词，46.21 GB）上，在相同的词汇量预算下，BrahmicTokenizer-131K 产生的 token 比 Mistral-Nemo Tekken / Sarvam-m 少 26.7%，各语言节省幅度从 15.79%（泰米尔语）到 76.79%（奥里亚语，压缩比达 4.31 倍）。奥里亚语的优势在机制上可解释为 Tekken/Sarvam-m 中不含任何奥里亚语区块 token；我们的改造增加了 725 个此类 token。在非印度语系内容上，BrahmicTokenizer-131K 的英语密度 (fertility，即每词 token 数，1.235 vs 1.232) 与 o200k_base 相当，并在 HumanEval、MBPP 和 GSM8K 基准上比 Tekken/Sarvam-m 高出 4.0%-14.2%。在我们包含 14 个分词器的基准测试中，它是唯一一个在 131K 预算下，在婆罗米系、英语、欧洲语言、代码及数学方面同时具有竞争力的分词器。其他词汇量类别的专业分词器 (Sarvam-30B, Sarvam-1, MUTANT-Indic) 以牺牲非印度语系性能为代价实现了更好的印度语系压缩：Sarvam-1 的英语密度比我们的低 15.9%，代码/数学压缩性能比我们的差 26%-33%。我们在 Apache 2.0 许可证下发布了该模型文件，地址为 https://huggingface.co/theschoolofai/BrahmicTokenizer-131K。

Abstract

We present BrahmicTokenizer-131K, a 131,072-vocabulary byte-level BPE tokenizer that closes the Brahmic compression gap at the 131K-vocabulary class while preserving the English, EU-language, and code compression of OpenAI's o200k_base. We construct it through a two-stage retrofit: (1) a script-prune crop that reduces 200,019 tokens to 131,072 by removing nine out-of-scope writing systems, and (2) a surgical retrofit of 2,372 corpus-dead vocabulary slots determined by linear-programming allocation across nine Brahmic Unicode blocks. The pre-tokenizer, decoder, and inherited merge rules are unchanged from o200k_base, making BrahmicTokenizer-131K a drop-in replacement at the tokenizer interface. On 27 million documents of public Indic pretraining text (2.84 billion words, 46.21 GB), BrahmicTokenizer-131K produces 26.7% fewer tokens than Mistral-Nemo Tekken / Sarvam-m at the same vocabulary budget, with per-language savings of 15.79% (Tamil) to 76.79% (Odia, a 4.31x compression ratio). The Odia advantage is mechanistically explained by Tekken/Sarvam-m containing zero Oriya-block tokens; our surgery added 725. On non-Indic content, BrahmicTokenizer-131K matches o200k_base's English fertility (1.235 vs 1.232 tokens/word) and beats Tekken/Sarvam-m by 4.0-14.2% on HumanEval, MBPP, and GSM8K. Across our 14-tokenizer benchmark, it is the only tokenizer simultaneously competitive on Brahmic, English, EU, code, and math at the 131K budget. Specialist tokenizers at other vocab classes (Sarvam-30B, Sarvam-1, MUTANT-Indic) achieve better Indic compression at the cost of non-Indic performance: Sarvam-1's English fertility is 15.9% worse and its code/math compression 26-33% worse than ours. We release the artifact under Apache 2.0 at https://huggingface.co/theschoolofai/BrahmicTokenizer-131K.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	10.0/10	15.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文核心在于开发针对印度文字（Indic）的 tokenizer，与 Tokenizer 关键词高度相关。Unify Models 和 MLLM 有微弱关联（作为模型组件或脚本统一），其余关键词涉及视觉、世界模型及强化学习，与本文纯文本 tokenizer 研究完全无关。作者列表中未包含指定专家。

关键词

BrahmicTokenizer, Indic Languages, Tokenizer, BPE, Compression, Drop-In Replacement, o200k_base

102. Rubric-Guided Process Reward for Stepwise Model RoutingFAIL

Score: 21.0 / 27.8

Authors: Shenghao Ye, Yu Guo, Zhengheng Li, Shuangwu Chen, Jian Yang

Published: 2026-05-28

TL;DR: To address the limitation of outcome-only rewards in model routing, the proposed RoRo framework utilizes rubric-guided process rewards to optimize stepwise routing, achieving better accuracy and cost trade-offs on reasoning benchmarks.

摘要翻译

逐步模型路由通过将每个推理步骤分配给合适的模型，提高了大型推理模型（LRMs）的效率。近期方法将路由过程建模为顺序决策过程，并使用强化学习训练路由模型。然而，尽管它们将路由建模为过程，但仍通过结果奖励来监督路由模型。此类奖励仅反映最终答案的正确性，无法评估中间的路由决策，这可能会削弱模型的性能和泛化能力。为了解决这一差距，我们提出了 RoRo，一种用于逐步模型路由的基于评估准则的过程奖励框架。RoRo 首先收集多样化的路由轨迹，并基于结果、成本和过程质量构建偏好对。随后，它通过交替优化训练一个 Rubricor 以生成查询特定的评估准则，并训练一个 Judge 在此准则下对路由轨迹进行打分。所得的过程奖励与结果奖励相结合，通过 GRPO 优化路由策略。在五个推理基准上的实验表明，无论是在同族还是跨族设置下，RoRo 始终优于强基线，并实现了更好的准确率与成本权衡。

Abstract

Stepwise model routing improves the efficiency of Large Reasoning Models (LRMs) by assigning each reasoning step to a suitable model. Recent methods formulate routing as a sequential decision process and train the router with reinforcement learning. However, although they model routing as a process, they still supervise the router with outcome rewards. Such rewards only reflect final answer correctness and fail to evaluate intermediate routing decisions, which can weaken performance and generalization. To address this gap, we propose RoRo, a rubric-guided process reward framework for stepwise model routing. RoRo first collects diverse routing trajectories and constructs preference pairs based on outcome, cost, and process quality. It then trains a Rubricor to generate a query-specific evaluation rubric and a Judge to score routing trajectories under this rubric through alternating optimization. The resulting process rewards are combined with outcome rewards to optimize the routing policy via GRPO. Experiments on five reasoning benchmarks under both same-family and cross-family settings show that RoRo consistently outperforms strong baselines and achieves better accuracy and cost trade-offs.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	5.0/10	7.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	3.0/10	4.5

评分理由: The paper proposes RoRo for stepwise model routing in Large Reasoning Models using GRPO and process rewards. It moderately relates to 'Unify Models' by routing between multiple models, and 'model-based RL' via reinforcement learning usage. However, it lacks content on tokenizers, visual encoders, world models, or multimodal aspects, resulting in low scores for those keywords.

关键词

Stepwise Model Routing, Process Reward, Rubric-Guided, Reinforcement Learning, Large Reasoning Models, GRPO, Model Routing, Outcome Rewards

103. SchGen: PCB Schematic Generation with Semantic-Grounded Code RepresentationsFAIL

Score: 19.5 / 27.8

Authors: Qinpei Luo, Ruichun Ma, Xinyu Zhang, Lili Qiu

Published: 2026-05-28

TL;DR: SchGen introduces a semantic-grounded code representation that enables large language models to generate editable PCB schematics from natural language, achieving higher wire connectivity accuracy and functional correctness than general-purpose LLMs.

摘要翻译

印刷电路板（PCB）原理图设计定义了几乎所有的电子硬件，但其仍依赖人工且高度依赖专家知识。尽管生成式 AI 已推动了数字和模拟集成电路（IC）设计，但从自然语言意图生成 PCB 原理图在很大程度上尚未被探索。本文提出了 SchGen，这是首个能从自然语言请求生成可编辑 PCB 原理图的大型语言模型（LLM）。关键挑战在于缺乏适合 LLM 的表示方法及大规模数据集。当前的原理图格式主要由冗长的、特定工具语法和基于几何的描述主导，使得它们难以可靠地生成。我们引入了一种基于语义的代码表示法，该表示法编码了带有相对放置和基于引脚名称布线的原理图编辑基元，将一个基于几何的生成问题转化为一个适合 LLM 的基于语义的匹配任务。此外，我们通过一个人机协作管道构建了大规模 PCB 原理图数据集，该管道将开源硬件设计转换为我们的表示法并与用户提示配对。实验表明，SchGen 在连线连接准确性和功能正确性方面显著优于替代表示法甚至更大的通用 LLM。我们的结果突出了表示设计在使生成模型能够处理复杂硬件设计任务中的关键作用。

Abstract

Printed circuit board (PCB) schematic design defines nearly all electronic hardware, but it remains manual and expertise-intensive. While generative AI has advanced digital and analog IC design, PCB schematic generation from natural-language intent is largely unexplored. This paper presents SchGen, the first large language model that generates editable PCB schematics from natural-language requests. The key challenge lies in the lack of an LLM-suited representation and a large-scale dataset. Current schematic formats are dominated by verbose, tool-specific syntax and geometry-heavy descriptions, making them difficult to generate reliably. We introduce a semantically grounded code representation that encodes schematic editing primitives with relative placement and pin-name-based wiring, transforming a geometry-driven generation problem into a semantics-driven matching task amenable to LLMs. We further construct a large-scale dataset of PCB schematics paired with user prompts via a human-agent collaborative pipeline that converts open-source hardware designs into our representation. Experiments show that SchGen significantly outperforms alternative representations and even larger general-purpose LLMs on wire connectivity accuracy and functional correctness. Our results highlight the critical role of representation design in enabling generative models for complex hardware design tasks.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	3.0/10	4.5
model-based RL	1.5	1.0/10	1.5

评分理由: The paper focuses on PCB schematic generation using LLMs with a semantic-grounded code representation. It lacks direct connections to World Models, Model-Based RL, or Visual Encoders. While it utilizes LLMs (broadly related to MLLM/MultiModal), the core contribution is representation design for hardware generation rather than the specific architectures implied by the keywords.

关键词

PCB Schematic Generation, Semantic-Grounded Code, Large Language Model, Natural Language Intent, Hardware Design Automation, Editable Schematics, Generative AI

104. Locally Coherent, Globally Incoherent: Bounding Compositional Incoherence in Multi-Component LLM AgentsFAIL

Score: 19.5 / 27.8

Authors: Anany Kotawala

Published: 2026-05-28

TL;DR: 该论文揭示了多组件 LLM 代理中局部相干导致全局不相干的概率失效问题，提出了基于投影的确定性修复方法，但发现常规 LLM 端缓解措施均无效。

摘要翻译

多组件大语言模型（LLM）代理从仅看到联合问题一部分的各个组件中组装概率主张；即使每个组件都是局部相干的，这种组合也可能违反基本概率公理。我们通过组合残差 ε* 来形式化这种局部相干但全局不相干的失败，ε* 是组合主张到联合相干多面体的 L2 距离，可根据系统输出和声明的跨组件耦合约束在运行时计算。积结构二分法刻画了局部相干何时足够，且瑞利商预测在四种关系类别中的三种上与观测残差的误差在 7% 以内。分层 Boyle-Dykstra 投影确定性修复组合；任意时刻有效的 e-过程提供序列相干性监控。在四个 LLM 构成的中级面板上的 1,876 个集成团中（前沿面板在第 5.5 节重运行），33-94% 的集成团上 ε* > 0，这转化为在比例分配规则下，1,770 个已解决投注中每注遗憾 +0.115 纳特（若投注者自身使相干化，则收益降至 +0.006）。三种直观的 LLM 侧缓解措施（检索、感知分区提示、聚合器 LLM）均失败或退化。

Abstract

Multi-component LLM agents assemble probabilistic claims from components that each see only part of a joint problem; the composition can violate basic probability axioms even when every component is locally coherent. We formalise this locally coherent, globally incoherent failure via the compositional residual eps*, the L2 distance from the composed quote to the joint coherent polytope, computable at runtime from system output and the declared cross-component coupling constraints. A product-structure dichotomy characterises when local coherence suffices, and a Rayleigh-quotient prediction matches the observed residual within 7% on three of four relation classes. A hierarchical Boyle-Dykstra projection repairs the composition deterministically; an anytime-valid e-process gives sequential coherence monitoring. Across 1,876 ensemble cliques on a four-LLM mid-tier panel (frontier-panel rerun in Section 5.5), eps* > 0 on 33-94% of cliques, translating to +0.115 nats per bet of regret on 1,770 resolved bets under the proportional allocation rule (the gain collapses to +0.006 under bettors that themselves coherentise). Three intuitive LLM-side mitigations(retrieval, partition-aware prompting, aggregator-LLM) each fail or regress.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	2.0/10	3.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	3.0/10	4.5

评分理由: 论文聚焦于多组件 LLM 代理的概率相干性与决策一致性，属于概率论与 LLM 组合推理范畴，与多模态、视觉编码器、分词器及模型强化学习等关键词关联度低。文中未涉及视觉模态或世界模型生成机制，虽提及 LLM 代理但未体现统一模型架构或强化学习中的模型学习过程。作者列表中未包含指定的五位专家，故无额外加分。

关键词

Multi-component LLM Agents, Compositional Incoherence, Locally Coherent Globally Incoherent, Compositional Residual, Probabilistic Claims, Boyle-Dykstra Projection, Ensemble Cliques

105. Before the Shutter: Aesthetic and Actionable Portrait Photography Planning in 3D ScenesFAIL

Score: 19.5 / 27.8

Authors: Ruixiang Jiang, Chang Wen Chen

Published: 2026-05-28

TL;DR: 该论文提出了一种 3D 美学肖像规划方法，通过生成最优的人体姿态、相机和光照配置在拍摄前创建具有视觉吸引力的肖像，并获得了人类和 MLLM 评估者的偏好。

摘要翻译

肖像摄影在很大程度上在快门按下之前便已确定：主体的姿态、相机配置及照明设备需与周围的 3D 场景相协调。相比之下，大多数现有的计算方法专注于 2D 图像空间中的后期制作，例如修饰、重新照明或对已存在的图像进行编辑；而拍摄前的摄影规划仍大部分未被充分探索。我们提出了 3D 美学肖像规划（3D Aesthetic Portrait Planning），即在 3D 场景中生成人体姿态、相机、照明和曝光计划的任务，旨在产生视觉上引人注目的肖像，同时满足几何和光度可行性。我们的方法构建了一个摄影场景图（Photographic Scene Graph），用以表征场景功能（affordances）、主体 - 场景关系以及与肖像相关的照明结构。基于该表示，我们针对先前的尝试和当前的取景器观察，执行美学引导的比较规划。在多样室内和室外场景上的实验表明，相较于竞争性基线，我们的方法生成的肖像更受人类评分者（human raters）和 MLLM 评估者（MLLM evaluators）的青睐，同时保持了高物理合理性。总体而言，我们的研究结果展示了一条从拍摄后修正转向拍摄前计算肖像规划的路径。项目仓库：https://github.com/songrise/Before-the-Shutter

Abstract

Portrait photography is largely decided before the shutter opens: the subject's pose, the camera configuration, and the lighting devices must be coordinated within the surrounding 3D scene. In contrast, most existing computational methods focus on post-production in 2D image space, such as retouching, relighting, or editing images that already exist; pre-capture photographic planning remains largely unexplored. We introduce 3D aesthetic portrait planning, the task of generating human pose, camera, lighting, and exposure plans that produce visually compelling portraits while satisfying geometric and photometric feasibility in a 3D scene. Our approach builds a Photographic Scene Graph that represents scene affordances, subject-scene relations, and portrait-relevant lighting structure. Built on this representation, we perform aesthetic-guided comparative planning over previous attempts and current viewfinder observations. Experiments across diverse indoor and outdoor scenes show that our method produces portraits preferred by human raters and MLLM evaluators over competitive baselines, while maintaining high physical plausibility. Together, our results suggest a path from post-capture correction toward pre-capture computational portrait planning. Project repository: https://github.com/songrise/Before-the-Shutter

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	3.0/10	4.5
model-based RL	1.5	2.0/10	3.0

评分理由: 论文核心内容为 3D 场景中的肖像摄影规划（Pose, Camera, Lighting planning），基于 Photographic Scene Graph 进行美学引导。提供的关键词主要聚焦于大模型架构（Tokenizer, Unify Models, Visual Encoder）、生成式世界模型及强化学习（World Models, model-based RL），与本文的计算机视觉/图形学规划任务领域不匹配。MLLM 仅在评估环节被提及作为评价工具，非核心方法；MultiModal 体现在 3D 几何与图像的结合，但非多模态大模型语境；model-based RL 虽有规划概念，但缺乏强化学习框架。作者列表中未包含指定的 Yang Shi 等专家。加权总分 19.5，低于动态及格分 27.8，表明论文与给定研究背景相关性较低。

关键词

Portrait Photography Planning, 3D Scenes, Photographic Scene Graph, Aesthetic-guided Planning, Pre-capture Planning, Camera Configuration, Lighting Devices

106. Evaluating Skill and Stability of ArchesWeather and ArchesWeatherGen under Multi-Decadal Climate SimulationsFAIL

Score: 19.5 / 27.8

Authors: Renu Singh, Robert Brunstein, Antonia Jost, Thomas Rackow, Claire Monteleoni, Yana Hasson, Christian Lessig, Guillaume Couairon

Published: 2026-05-28

TL;DR: 本文评估了 ArchesWeather 和 ArchesWeatherGen 模型在多十年气候模拟下的技能与稳定性，发现强制边界条件配置下模型能稳定复现气候态和变率，尽管其最初设计用于短期天气预报。

摘要翻译

我们评估了 ArchesWeather 和 ArchesWeatherGen 的气候模拟能力，这两种机器学习模型最初是为天气预报训练的，其预报提前期可达 10 天。ArchesWeather 是一个确定性模型，而 ArchesWeatherGen 是一个概率性流匹配模型，它利用 ArchesWeather 的预报，从而实现基于集合的不确定性量化。在这项工作中，我们通过额外以月平均海表温度（SST）和海冰覆盖（SIC）作为边界条件进行条件化，将这些模型调整为强迫大气模型。特别是，我们遵循人工智能模型比较计划（AIMIP）第一阶段协议，该协议类似于大气模型比较计划（AMIP），提出了一种标准化的实验设置，用于评估基于机器学习的强迫大气模型的气候技能。我们在这些条件下对两种模型进行了全面评估，包括与数值气候模型的比较、考察扩展中关键设计选择的消融研究，以及强迫与非强迫配置的分析。尽管最初是为天气预报开发的，但我们证明 ArchesWeather 和 ArchesWeatherGen 的强迫配置能够产生稳定的长期气候模拟，具有稳定的年循环，并捕捉到许多气候变量的漂移。这些模型忠实地再现了 ERA5 的气候态、大尺度环流和年际变率，并捕捉了分布的尾部。

Abstract

We evaluate the climate simulation capabilities of ArchesWeather and ArchesWeatherGen, two machine learning models originally trained for weather forecasting and evaluated up to a 10-day lead time. ArchesWeather is a deterministic model, while ArchesWeatherGen is a probabilistic flow-matching model leveraging ArchesWeather's forecasts, enabling ensemble-based uncertainty quantification. In this work, we adapt these models to act as forced atmospheric models by using additional conditioning on the monthly mean sea surface temperature (SST) and sea ice cover (SIC) as boundary conditions. In particular, we follow the AI Model Intercomparison Project (AIMIP) Phase 1 protocol, which, analogous to the Atmospheric Model Intercomparison Project (AMIP), proposes a standardized experimental setup to evaluate the climate skill of ML-based forced atmospheric models. We present a comprehensive evaluation of both models under these conditions, including comparison against numerical climate models, ablation studies that examine key design choices in the extension, and an analysis of forced versus unforced configurations. Despite being originally developed for weather forecasting, we demonstrate that forced configurations of ArchesWeather and ArchesWeatherGen produce stable long-term climate simulations, have a stable annual cycle, and capture the drift of many climate variables. The models faithfully reproduce ERA5's climatology, large-scale circulations and interannual variability, and they capture the tails of the distributions.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	3.0/10	4.5
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	3.0/10	4.5
model-based RL	1.5	1.0/10	1.5

评分理由: 该论文聚焦气候模拟与天气预报模型，涉及流匹配和边界条件。提供的关键词列表主要指向多模态大模型与强化学习领域。论文未涉及分词器、视觉编码器（MLLM 语境）、大语言模型或强化学习，相关性极低。虽涉及多种输入变量，但与关键词定义的核心内容（统一架构、多模态理解生成、RL）匹配度不高，导致加权总分低于及格分。

关键词

Climate Simulations, ArchesWeather, ArchesWeatherGen, Forced Atmospheric Models, Flow-Matching, Sea Surface Temperature, ERA5 Climatology, Stability Evaluation

107. Make LLM Learn to Synthesize from Streaming Experiences through FeedbackFAIL

Score: 19.5 / 27.8

Authors: Zhenlin Hu, Yan Wang, Zhen Bi, Zihao Xue, Bingyu Zhu, Longtao Huang, Xiongtao Zhang, Zeyu Yang, Zhixuan Chu, Jungang Lou

Published: 2026-05-28

TL;DR: 本文提出 SynLearner 框架，使大语言模型能通过历史反馈流式学习合成任务，实现了跨任务的可迁移性。

摘要翻译

大型语言模型（LLMs）已被广泛应用于合成数据生成，显著降低了标注成本。然而，大多数现有研究将合成视为一系列孤立的任务，并忽视了一个更根本的问题：模型能否通过积累过往任务的经验并将其迁移至未来任务来学习合成。在这项工作中，我们引入了 StreamSynth，这是一种新的设定，其中合成任务按顺序到达，历史任务的经验为未来的合成提供信息信号。为应对这一设定，我们提出了 SynLearner，这是一个通用框架，使合成模型能够在任务流中获取可重用的合成经验。与为每个任务独立生成数据不同，SynLearner 鼓励模型探索多样的合成模式，从反馈中学习，并在任务演进过程中平衡样本质量与集合级多样性。在多个基准上的广泛实验表明，SynLearner 有效利用早期任务的经验来提升后期任务的合成性能，表现出一致的跨任务迁移性。这些发现为 StreamSynth 的可行性提供了证据，并强调合成数据生成是一个经验驱动的过程，能够从任务流中受益。

Abstract

Large language models (LLMs) have been widely adopted for synthetic data generation, significantly reducing annotation costs. However, most existing studies treat synthesis as a set of isolated tasks and overlook a more fundamental question: whether a model can learn to synthesize by accumulating experience from past tasks and transferring it to future ones. In this work, we introduce StreamSynth, a new setting in which synthesis tasks arrive sequentially and experience from historical tasks provides informative signals for future synthesis. To address this setting, we propose SynLearner, a general framework that enables synthesis models to acquire reusable synthesis experience over a task stream. Instead of generating data independently for each task, SynLearner encourages the model to explore diverse synthesis patterns, learn from feedback, and balance sample quality with set-level diversity as tasks evolve. Extensive experiments across multiple benchmarks show that SynLearner effectively leverages experience from earlier tasks to improve synthesis performance on later ones, exhibiting consistent cross-task transferability. These findings provide evidence for the feasibility of StreamSynth and highlight synthetic data generation as an experience-driven process that can benefit from task streams.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	3.0/10	4.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	3.0/10	4.5

评分理由: 论文聚焦 LLM 文本合成与流式经验学习，未涉及多模态内容（MultiModal, MLLM, Visual Encoder），相关性极低。虽涉及经验积累（World Models, model-based RL 概念接近）及任务统一（Unify Models），但非模型架构统一或强化学习核心，故评分较低。Tokenizer 为隐含组件。

关键词

Large language models, Synthetic data generation, Streaming experiences, Cross-task transferability, Feedback learning, SynLearner, StreamSynth

108. AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and SecurityFAIL

Score: 19.5 / 27.8

Authors: Dongrui Liu, Yu Li, Zhonghao Yang, Peng Wang, Guanxu Chen, Yuejin Xie, Qinghua Mao, Wanying Qu, Yanxu Zhu, Tianyi Zhou, Leitao Yuan, Zhijie Zheng, Qihao Lin, Yimin Wang, Haoyu Luo, Shuai Shao, Chen Qian, Qingyu Liu, Ling Tang, Ruiyang Qin, Qihan Ren, Junxiao Yang, Kun Wang, Zhiheng Xi, Linfeng Zhang, Ranjie Duan, Bo Zhang, Wenjie Wang, Wen Shen, Qiaosheng Zhang, Yan Teng, Chaochao Lu, Rui Mei, Man Li, Jialing Tao, Xi Lin, Tianhang Zheng, Yong Liu, Quanshi Zhang, Lei Zhu, Xingjun Ma, Junhua Liu, Hui Xue, Xiaoxiang Zuo, Xiangnan He, Chao Shen, Xianglong Liu, Minlie Huang, Jing Shao, Xia Hu

Published: 2026-05-28

TL;DR: AgentDoG 1.5 提出了一种轻量级的 AI 代理安全对齐框架，通过分类学引导的数据引擎和 RL 训练实现了卓越的安全 moderation，但未涉及多模态世界模型或基于模型的强化学习架构。

摘要翻译

现代开放世界代理（Open-world agents）如 OpenClaw 展现出强大的跨环境执行能力，但也引入了广泛的新安全风险来源。与此同时，先进的前沿 AI 模型大幅降低了攻击门槛，使得当前的代理对齐框架（agent alignment frameworks）不足以应对现实世界部署的需求。为应对这些新兴威胁，我们提出了一种轻量级且可扩展的代理安全对齐框架。具体而言，我们更新了代理安全分类法（taxonomy），以容纳来自 Codex 和 OpenClaw 执行场景的新兴风险。此外，我们构建了一个基于分类法指导的数据引擎，结合影响函数净化（influence-function purification），仅使用约 1k 个样本即可训练轻量级 AgentDoG 1.5 变体（0.8B、2B、4B 和 8B 参数），其性能可与领先的闭源模型（如 GPT-5.4）相媲美。基于 AgentDoG 1.5，我们构建了一个高效的代理安全 SFT（监督微调）和 RL（强化学习）训练环境，将 Docker 级环境中的部署开销降低了两个数量级。最后，我们将 AgentDoG 1.5 部署为无需训练的在线护栏（guardrail），用于实时安全审核。广泛的实验结果表明，AgentDoG 1.5 在多样且复杂的交互式代理场景中实现了最先进性能。所有模型和数据集均已开源发布。

Abstract

Modern open-world agents such as OpenClaw exhibit powerful cross-environment execution capabilities yet introduce broad new safety risk sources. Meanwhile, advanced frontier AI models drastically lower attack barriers, rendering current agent alignment frameworks inadequate for real-world deployment. To tackle these emerging threats, we propose a lightweight and scalable agent safety alignment framework. Specifically, we update the agent safety taxonomy to accommodate emergent risks from Codex and OpenClaw execution scenarios. We further build a taxonomy-guided data engine with influence-function purification to train lightweight AgentDoG 1.5 variants (0.8B, 2B, 4B, and 8B parameters) using only around 1k samples, achieving comparable performance with leading closed-source models (e.g., GPT-5.4). Based on AgentDoG 1.5, we construct a highly efficient agentic safety SFT and RL training environment, which reduces deployment overhead in Docker-level environments by two orders of magnitude. Finally, we deploy AgentDoG 1.5 as a training-free online guardrail for real-time safety moderation. Extensive experimental results indicate that AgentDoG 1.5 achieves state-of-the-art performance in diverse and complex interactive agentic scenarios. All models and datasets are openly released.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	2.0/10	3.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	3.0/10	4.5

评分理由: 论文核心在于代理安全对齐框架（AgentDoG 1.5），涉及轻量模型训练、分类学数据引擎及在线护栏，与关键词中的‘统一模型’、‘视觉编码器’、‘世界模型’、‘多模态’及‘基于模型的强化学习’等技术核心关联度较低。作者列表中未包含指定的专家（Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang）。加权总分约为 19.5，低于动态及格分 27.8。

关键词

Agent Safety, Alignment Framework, Lightweight Models, SFT, RL Training, Online Guardrail, Open-world Agents

109. Learning Context-Conditioned Predicate Semantics via Prototype FeedbackFAIL

Score: 19.5 / 27.8

Authors: NamGyu Jung, Chang Choi

Published: 2026-05-28

TL;DR: 本文提出 AlignG 方法，通过原型反馈学习上下文条件化的谓词语义，改进了场景图生成在 VG-150 和 GQA-200 数据集上的性能。

摘要翻译

在场景图生成（Scene Graph Generation）中，一个核心挑战在于建模多义谓词，其含义会随上下文语境发生偏移。先前的方法通过将谓词分解为多个静态原型（static prototypes）或检索语义相似的示例（exemplars）来解决这一问题。然而，这些策略保持谓词表示静态，无法重新组织语义以反映图像特定的证据，从而在模糊语境中导致系统性混淆。我们提出 AlignG，通过原型反馈（prototype feedback）学习上下文条件谓词语义。AlignG 从每个图像中的关系候选（relation candidates）推断上下文条件谓词语义，并将适应后的语义反馈回去，以重新校准关系表示。该学习目标将这种适应锚定到全局语义中心（global semantic centers），防止语义漂移（semantic drift），同时仍允许在场景提供一致的关系线索时进行选择性重新组织。在 VG-150 和 GQA-200 数据集上的实验表明，相对于最先进的基线（state-of-the-art baselines），该方法表现出一致的性能提升，在场景图检测（SGDet）下，VG-150 的 F@100 提升了 +1.4，GQA-200 提升了 +2.7。此外，我们可视化了每幅图像的原型相似性偏移，观察到相干的上下文依赖重新组织：原型根据场景证据选择性地合并或分离谓词。代码可在 https://github.com/Namgyu97/AlignG-SGG.pytorch 获取。

Abstract

In scene graph generation, a central challenge is modeling polysemous predicates whose meanings shift across contexts. Prior approaches address this issue by decomposing predicates into multiple static prototypes or retrieving semantically similar exemplars. However, these strategies keep predicate representations static and cannot reorganize semantics to reflect image-specific evidence, leading to systematic confusions in ambiguous contexts. We propose AlignG, which learns context-conditioned predicate semantics via prototype feedback. AlignG infers context-conditioned predicate semantics from the relation candidates within each image and feeds the adapted semantics back to recalibrate relation representations. The learning objective anchors this adaptation to global semantic centers, preventing semantic drift while still allowing selective reorganization when the scene provides consistent relational cues. Experiments on VG-150 and GQA-200 show consistent improvements over state-of-the-art baselines, with F@100 improvements of +1.4 on VG-150 and +2.7 on GQA-200 under SGDet. We further visualize per-image prototype similarity shifts and observe coherent context-dependent reorganization where prototypes selectively merge or separate predicates according to scene evidence. The code is available at https://github.com/Namgyu97/AlignG-SGG.pytorch.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	5.0/10	7.5
model-based RL	1.5	1.0/10	1.5

评分理由: 该论文专注于场景图生成（SGG）中的谓词语义适应，提出 AlignG 方法利用原型反馈学习上下文条件化语义。虽然 SGG 是多模态任务且隐含视觉编码器使用，但论文未涉及模型统一、分词器、世界模型、大语言模型或强化学习相关内容，因此除 MultiModal 和 Visual Encoder 外，其余关键词相关性极低。

关键词

Scene Graph Generation, Predicate Semantics, Prototype Feedback, Context-Conditioned, Visual-Language, Polysemous Predicates, AlignG

110. Training Deliberative Monitors for Black-Box Scheming DetectionFAIL

Score: 19.5 / 27.8

Authors: Aditya Sinha, Akshat Naik, Victor Gillioz, Simon Storf, Kilian Merkelbach, Rich Barton-Cooper, Axel Højmark, Marius Hobbhahn

Published: 2026-05-28

TL;DR: This paper proposes training lightweight deliberative monitors via distillation and reinforcement learning to detect scheming in autonomous agents more cost-effectively than using large frontier models directly.

摘要翻译

随着自主智能体执行现实世界任务的能力日益增强，区分图谋行为与无害的任务执行可能成为核心的 AI 控制问题。现有的监控器通常依赖思维链 (Chain-of-thought) 访问或内部激活，或使用通过提示的前沿模型，这些在部署中均可能不可用、不可靠或成本高昂。在这项工作中，我们研究仅基于动作的深思熟虑监控器：这些是较小的开放权重模型，旨在从智能体轨迹中检测图谋和破坏行为，而无需访问被监控智能体的推理过程或模型内部。我们的方法受深思熟虑对齐 (Deliberative alignment) 启发，利用图谋规范从前沿教师模型中获取结构化理由，通过独立的评判模型进行过滤，并通过监督微调和强化学习将最高质量的理由蒸馏为开放权重监控器。我们在五个数据集上进行训练，并在六个分布外智能体不对齐基准上进行评估。我们表明，将我们的方法应用于 Qwen3.5-27B 所得性能高于所有作为提示监控器的低成本前沿模型（Gemini 3.1 Flash-Lite、GPT-5.4 Nano 和 Claude Haiku 4.5），也高于 Gemini 2.5 Pro，同时实现了更低的边际推理成本（每 1,000 次评估的按 token 计费美元成本）。性能更强的提示前沿监控器（Gemini 3.1 Pro、GPT-5.4、Claude Sonnet 4.6 和 Claude Opus 4.6）实现了更高性能，但边际推理成本高出约 16 至 34 倍。我们训练的若干监控器位于我们评估的监控器中的经验成本 - 性能帕累托前沿 (Pareto frontier)，为提示前沿模型提供了实用的低成本、低假阳性率 (FPR) 替代方案。

Abstract

As autonomous agents become more capable of performing real-world tasks, distinguishing scheming behavior from benign task pursuit may become a central AI control problem. Existing monitors often rely on chain-of-thought access or internal activations, or use prompted frontier models, all of which can be unavailable, unreliable or expensive in deployment. In this work, we study action-only deliberative monitors: smaller open-weight models trained to detect scheming and sabotage from agentic trajectories without accessing the monitored agent's reasoning or model internals. Our method, inspired by deliberative alignment, uses a scheming specification to elicit structured rationales from a frontier teacher, filters them with a separate judge, and distills the highest-quality rationales into open-weight monitors with supervised fine-tuning and reinforcement learning. We train on five datasets, and evaluate across six out-of-distribution agentic misalignment benchmarks. We show that applying our method to Qwen3.5-27B yields higher performance than all low-cost frontier models as prompted monitors (Gemini 3.1 Flash-Lite, GPT-5.4 Nano, and Claude Haiku 4.5) and than Gemini 2.5 Pro, while also achieving lower marginal inference cost (token-metered USD per 1,000 evaluations). Stronger prompted frontier monitors (Gemini 3.1 Pro, GPT-5.4, Claude Sonnet 4.6, and Claude Opus 4.6) achieve higher performance but at roughly $16$--$34\times$ higher marginal inference cost. Several of our trained monitors are positioned on the empirical cost--performance Pareto frontier among the monitors we evaluate, providing practical low-cost, low-FPR alternatives to prompted frontier models.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	3.0/10	4.5

评分理由: 论文聚焦智能体欺骗检测与对齐，利用蒸馏和强化学习训练监控器。与关键词相比，论文未涉及视觉编码器、分词器或多模态架构统一，与“世界模型”概念关联较弱。虽使用强化学习，但非典型基于模型的强化学习，故相关性普遍较低。

关键词

Scheming Detection, Deliberative Monitors, Reinforcement Learning, Agent Alignment, Black-Box Monitoring, Distillation, Frontier Models

111. GAPD: Gold-Action Policy Distillation for Agentic Reinforcement Learning in Knowledge Base Question AnsweringFAIL

Score: 19.5 / 27.8

Authors: Xin Sun, Jianan Xie, Zhongqi Chen, Qiang Liu, Shu Wu, Bowen Song, Weiqiang Wang, Zilei Wang, Liang Wang

Published: 2026-05-28

TL;DR: 该论文提出 GAPD 框架，通过黄金动作序列蒸馏为强化学习提供密集 token 级监督，有效提升了知识图谱问答中智能体的性能。

摘要翻译

强化学习（RL）天然契合基于智能体的知识库问答（KBQA），在此场景中，模型必须发出可执行动作，观察知识库反馈，并最终返回答案。然而，当前基于强化学习的 KBQA 系统主要优化来自最终答案的稀疏奖励，使得中间动作误差的监督较弱。这对于逻辑形式标注的 KBQA 基准测试尤其具有局限性：黄金逻辑形式可以转换为可执行动作序列，但现有管道主要将其用于暖启动数据构建，而非用于策略内强化学习更新。我们提出 GAPD（黄金动作策略蒸馏框架），该框架为基于结果的强化学习添加了密集的词级指导。为了将黄金动作与策略内学生轨迹对齐，GAPD 采用 MID-ANCHOR MATCHING（中锚点匹配）：它将学生探索和黄金执行过程中达到的中间实体视为状态锚点，并通过这些探索得到的实体集合将学生状态与黄金状态进行匹配。基于此对齐黄金动作的当前策略充当停止梯度教师，其词分布被蒸馏回普通学生策略，覆盖生成的动作词跨度。GAPD 在 WebQSP、GrailQA 和 GraphQ 上一致超越了当前最先进方法。

Abstract

Reinforcement learning (RL) is a natural fit for agentic knowledge base question answering (KBQA), where a model must issue executable actions, observe knowledge-base feedback, and eventually return an answer. However, current RL-based KBQA systems mainly optimize sparse rewards from the final answer, leaving intermediate action errors weakly supervised. This is especially limiting for logical-form annotated KBQA benchmarks: gold logical forms can be converted into executable action sequences, but existing pipelines use them mainly for warm-start data construction rather than for on-policy RL updates. We propose GAPD, a training-time Gold-Action Policy Distillation framework that adds dense token-level guidance to outcome-based RL. To align gold actions with on-policy student rollouts, GAPD uses MID-ANCHOR MATCHING: it treats the intermediate entities reached during student exploration and gold execution as state anchors, and matches student states to gold states through these explored entity sets. The current policy conditioned on this aligned gold action serves as a stop-gradient teacher, whose token distribution is distilled back to the ordinary student policy over generated action-token spans. GAPD consistently surpasses the current state of the art on WebQSP, GrailQA, and GraphQ.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	3.0/10	4.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	4.0/10	6.0

评分理由: 论文核心聚焦于知识图谱问答（KBQA）中的强化学习与策略蒸馏，未涉及视觉模态或统一多模态架构，因此 Visual Encoder 和 MultiModal 得分为 0。虽使用 token 表示动作且涉及 RL，但未构建世界模型或统一模型，相关性较低。model-based RL 有一定关联因涉及动作规划与策略指导，但核心方法为策略蒸馏而非显式模型学习。

关键词

Reinforcement Learning, Knowledge Base Question Answering, Policy Distillation, Gold-Action, Agentic Agent, Token-level Guidance, Sparse Rewards

112. PatchBoard: Schema-Grounded State Mutation for Reliable and Auditable LLM Multi-Agent CollaborationFAIL

Score: 19.5 / 27.8

Authors: Shuyu Zhang, Yaqi Shi, Lu Wang

Published: 2026-05-28

TL;DR: PatchBoard introduces a schema-grounded architecture for LLM multi-agent collaboration that uses validated JSON Patch mutations to improve success rates and auditability compared to dialogue-based methods.

摘要翻译

大语言模型多智能体系统（LLM multi-agent systems）通常通过自然语言对话或松散结构的共享内存进行协调，这使得中间状态难以验证、归因和审计。我们引入 PatchBoard，这是一种基于模式的协作架构（schema-grounded collaboration architecture），它用经过验证的 JSON Patch 变更（validated JSON Patch mutations）取代了智能体间的对话，这些变更作用于共享结构化状态之上。架构智能体（Architect agent）构建任务特定模式和工作流规则，而确定性内核（deterministic kernel）在事务性地提交之前，会根据模式约束、角色特定写契约和运行时不变量，对每个提议的状态变更进行验证。在 630 个匹配的 ALFWorld 回合中，PatchBoard 取得了 84.6% 的成功率，而 LangGraph 和 Flock 分别为 30.8% 和 61.6%；同时，每个成功任务的 token 数减少至 45.5k，相比之下分别为 368.3k 和 64.2k。

Abstract

LLM multi-agent systems often coordinate through natural-language dialogue or loosely structured shared memory, making intermediate state difficult to validate, attribute, and audit. We introduce PatchBoard, a schema-grounded collaboration architecture that replaces inter-agent dialogue with validated JSON Patch mutations over a shared structured state. An Architect agent constructs a task-specific schema and workflow rules, while a deterministic kernel validates each proposed state mutation against schema constraints, role-specific write contracts, and runtime invariants before committing it transactionally. On 630 matched ALFWorld episodes, PatchBoard achieves an 84.6% success rate, compared with 30.8% for LangGraph and 61.6% for Flock, while reducing tokens per successful task to 45.5k, compared with 368.3k and 64.2k, respectively.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	2.0/10	3.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	2.0/10	3.0

评分理由: The paper focuses on LLM multi-agent collaboration via schema-grounded state mutations (PatchBoard). It does not address Unify Models, Tokenizers, Visual Encoders, World Models, MLLM architectures, MultiModal processing, or Model-Based RL algorithms directly. The core contribution is structured state auditing rather than the specified technical keywords, resulting in low relevance. No specified expert authors are present.

关键词

LLM Multi-Agent Collaboration, Schema-Grounded State Mutation, JSON Patch, Structured State, Auditable, ALFWorld, Architect Agent

113. AdaState: Self-Evolving Anchors for Streaming Video GenerationFAIL

Score: 19.5 / 27.8

Authors: Yusuf Dalva, Pinar Yanardag

Published: 2026-05-28

TL;DR: AdaState 提出了一种自适应潜态机制替代静态首帧锚点，通过引入生成过程中的递归结构显著提升了流式视频生成的动态性和场景进展。

摘要翻译

自回归视频扩散模型通过顺序生成帧来产生流式视频，并以先前生成的内容作为每个块的条件。这些模型在结构上锚定于第一帧：其键值（key-value）表示在注意力缓存（attention cache）中占据特权位置，并在整个生成过程中充当主要的场景参考。作为缓存中最干净且无误差的位置，该锚点吸引了不成比例的关注，抑制了视频动态，并将场景构图锁定在初始视角，即使场景本身正在自然演变。这导致生成的视频在时间维度上较为浅薄：运动、相机移动和场景进展被削弱，以换取静态一致性。为了解决这一问题，我们用自适应状态（adaptive state）替换静态锚点。该状态是一个隐藏潜在变量（hidden latent），模型在每个块中会对其进行去噪处理，与内容一同处理，但从不将其渲染出来。与引用冻结的第一帧不同，模型在每个步骤通过同时关注前一状态和当前内容来生成自己的场景锚点，从而产生一个随生成内容共同演变的参考。与标准视频生成编码了绝对时间概念不同，我们的方法将时间视为相对时间：无论生成进度如何，每个生成步骤均保持相同的结构，且每个块的状态转换均保持一致。这些特性共同将递归（recurrence）引入生成过程：其中去噪充当转换函数，KV 缓存（KV cache）充当载体，无需外部模块。实验表明，自适应状态显著改善了视频动态性，使得生成视频中的运动更加丰富，场景进展更加自然。

Abstract

Autoregressive video diffusion models generate streaming video by producing frames sequentially, conditioning each chunk on previously generated content. These models are structurally anchored to the first frame: its key-value representation occupies a privileged position in the attention cache and serves as the primary scene reference throughout generation. As the cleanest and most error-free position in the cache, this anchor draws disproportionate attention, suppressing video dynamics, and locking scene composition to the initial viewpoint even as the scene naturally evolves. The result is a temporally shallow video in which motion, camera movement, and scene progression are dampened in favor of static consistency. To address this, we replace the static anchor with an adaptive state, a hidden latent that the model denoises alongside content at every chunk but never renders. Rather than referencing a frozen first frame, the model generates its own scene anchor at each step by attending to both the previous state and the current content, producing a reference that evolves with the generated content. Unlike standard video generation, which encodes an absolute notion of time, our formulation treats time as relative: every generation step sees the same positional structure regardless of how far generation has progressed, and the state transition is identical at every chunk. Together, these properties introduce a recurrence into the generation process, where denoising serves as the transition function, and the KV cache serves as the carrier, requiring no external module. Experiments demonstrate that the adaptive state substantially improves video dynamics, enabling richer motion and natural scene progression within generated videos.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	5.0/10	7.5
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	1.0/10	1.5

评分理由: 该论文聚焦于流式视频生成中的锚点机制改进，核心贡献在于引入自适应潜态（Adaptive State）替代静态首帧。与关键词相关性分析如下：World Models 高度相关（5 分），因涉及潜态演化与状态转换；Unify Models 和 Visual Encoder 中度相关（3 分、2 分），分别涉及机制统一与视频编码隐含；MultiModal 中度相关（2 分），视频本身含时空模态；Tokenizer、MLLM、model-based RL 低相关（0-1 分），因未涉及离散 token、语言模型或强化学习。加权总分 19.5，低于 27.8 及格线。作者名单未包含指定专家，无额外加分。

关键词

Streaming Video Generation, Adaptive State, Self-Evolving Anchors, Attention Cache, Hidden Latent, Denoising Transition, Video Dynamics

114. LiveSVG: Zero-Shot SVG Animation via Video GenerationFAIL

Score: 19.5 / 27.8

Authors: Matan Levy, Ran Margolin, Bar Cavia, Dvir Samuel, Yael Pritch, Shmuel Peleg, Alex Rav Acha, Ariel Shamir, Dani Lischinski

Published: 2026-05-28

TL;DR: LiveSVG 提出了一种零样本 SVG 动画生成方法，通过将矢量几何拟合到视频扩散模型生成的目标视频中，实现了无需骨架且可编辑的高质量动画。

摘要翻译

我们提出 LiveSVG，一种利用视频扩散模型生成可缩放矢量图形（SVG）动画的零样本方法。现有的 SVG 动画方法在处理复杂运动时面临挑战：基于大语言模型（LLM）的代码合成难以表达精细的非刚性贝塞尔变形，而分数蒸馏采样（SDS）提供的梯度往往存在噪声，且通常需要类别特定的先验（如骨架）。相比之下，LiveSVG 直接将矢量几何拟合到显式生成的目标视频上。给定输入 SVG 图像和运动提示，我们利用冻结的图像到视频模型生成一个可预览的目标视频，随后通过可微渲染将原始 SVG 拟合至该视频。我们的拟合阶段无需骨架，采用一种双层级运动表示，该表示结合组级单应性变换以实现粗略运动，以及路径级贝塞尔控制点偏移以实现局部变形。为了解决像素级拟合过程中由颜色引起的对应歧义，我们引入了一种新颖的球堆积重着色策略（Sphere-packing recolorization）。我们还提出了 ChallengeSVG，一个包含复杂多对象场景的基准评测集，用以揭示先前工作的局限性。评估结果表明，LiveSVG 在 AniClipart 和 ChallengeSVG 上均显著优于现有方法，确立了直接参考视频拟合作为一种实用且稳健的途径，用于生成提示对齐且完全可编辑的矢量动画。

Abstract

We introduce LiveSVG, a zero-shot approach for generating Scalable Vector Graphics (SVG) animations using video diffusion models. Current SVG animation methods struggle with complex motions: LLM-based code synthesis fails to express fine, non-rigid Bézier deformations, while Score Distillation Sampling (SDS) provides noisy gradients and often requires category-specific priors like skeletons. In contrast, LiveSVG fits vector geometry directly to an explicitly generated target video. Given an input SVG image and a motion prompt, we generate a previewable target video using a frozen image-to-video model, then fit the original SVG to this video via differentiable rendering. Our fitting stage is skeleton-free, utilizing a dual-level motion representation that combines per-group homographies for coarse articulation with per-path Bézier control-point offsets for local deformations. To resolve color-induced correspondence ambiguities during pixel-wise fitting, we introduce a novel sphere-packing recolorization strategy. We also present ChallengeSVG, a benchmark of complex, multi-object scenes that exposes the limitations of prior work. Evaluations demonstrate that LiveSVG significantly outperforms existing methods on both AniClipart and ChallengeSVG, establishing direct reference-video fitting as a practical, robust route to prompt-aligned and fully editable vector animation.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	4.0/10	6.0
model-based RL	1.5	1.0/10	1.5

评分理由: 该论文属于计算机图形学领域，专注于 SVG 动画生成，与给定的 MLLM、强化学习及世界模型主题关联性较低。虽涉及多模态（文本、视频、矢量）且使用视觉编码器，但未涉及 tokenizer、统一模型架构、世界模型学习或模型基强化学习等核心内容，加权总分低于动态及格分。

关键词

SVG Animation, Video Diffusion, Differentiable Rendering, Zero-Shot, Vector Geometry, Bézier Deformations, Recolorization Strategy

115. Parameter-Efficient Subspace Decoupling ViT for Mitigating Multi-Task Negative Transfer in Histological ScoringFAIL

Score: 19.5 / 27.8

Authors: Youhan Huang, Jiajun Li, Yilin Fang, Shuai Wang, Chuheng Li

Published: 2026-05-28

TL;DR: This paper proposes a subspace-decoupled multi-task Vision Transformer to mitigate negative transfer in NAFLD histological scoring, achieving improved stability and generalization with reduced computational cost.

摘要翻译

组织学评分对于诊断非酒精性脂肪肝病 (NAFLD) 至关重要，但由于标注成本高以及多任务学习中强相关的 NAFLD 活动评分 (NAS) 指标之间存在负迁移，其自动化仍然具有挑战性。为了解决这一问题，我们提出了一种子空间解耦的多任务 Vision Transformer (ViT)，该模型集成了轻量级任务特定 Adapters (适配器) 与基于正交性的约束。该设计为脂肪变性、气球样变和炎症构建了独立的特征子空间，有效减少了任务干扰，同时保留了共享表示。我们进一步构建了一个精心整理的多任务小鼠 NAFLD 组织学数据集，包含所有 NAS 组件的专家标注。实验结果表明，与训练独立的单任务模型相比，所提出的方法提高了多任务稳定性和泛化能力，同时大幅降低了计算成本。代码和整理好的数据集已准备就绪，将在录用后公开以支持可复现性。

Abstract

Histological scoring is essential for diagnosing Non-Alcoholic Fatty Liver Disease (NAFLD), yet its automation remains challenging due to the high annotation cost and negative transfer among the strongly correlated NAFLD Activity Score (NAS) indicators in multi-task learning. To address this issue, we propose a subspace-decoupled multi-task Vision Transformer (ViT) that integrates lightweight task-specific Adapters with orthogonality-based constraints. This design constructs independent feature subspaces for steatosis, ballooning, and inflammation, effectively reducing task interference while retaining shared representations. We further construct a curated multi-task mouse NAFLD histology dataset with expert annotations for all NAS components. Experimental results demonstrate that the proposed method improves multi-task stability and generalization with substantially reduced computational cost compared to training separate single-task models. The code and the curated dataset have been prepared and will be made publicly available upon acceptance to support reproducibility.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	5.0/10	7.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	8.0/10	12.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper utilizes a Vision Transformer for multi-task histological scoring, showing relevance to 'Visual Encoder' (8.0) and moderate relevance to 'Unify Models' (5.0) via task unification. It does not address Tokenizers, World Models, MLLM, Multimodal learning, or Model-Based RL, resulting in low overall relevance to the specified keywords. No matching expert authors were found.

关键词

Histological Scoring, Multi-task Learning, Vision Transformer, Subspace Decoupling, Negative Transfer, NAFLD, Adapters

116. Loong: A Human-Like Long Document Translation Agent with Observe-and-Act Adaptive Context SelectionFAIL

Score: 18.0 / 27.8

Authors: Yutong Wang, Xuebo Liu, Derek F. Wong, Zhilin Li, Rongqing Jiang, Min Zhang, Shimin Tao, Daimeng Wei, Min Zhang

Published: 2026-05-28

TL;DR: Loong addresses long document translation context limitations by employing an RL-driven agent with adaptive memory selection, achieving significant quality improvements across multiple language pairs.

摘要翻译

文档级翻译仍然是大语言模型最具挑战性的任务之一，这些模型受限于有限的上下文窗口，阻碍了全局连贯性，同时又因冗余上下文信息的存在而降低了翻译质量。为了解决这一问题，我们提出了一种名为 Loong 的类人长文档翻译智能体，该智能体利用 3E 记忆模块（Essence-Exemplar-Entity）来存储摘要、句子对及实体记录作为历史上下文。与被动关注所有历史不同，Loong 通过深度推理自适应地识别用于翻译指导的最优上下文。Loong 通过强化学习优化其上下文策略，利用源自其自身采样的观察 - 行动推理轨迹的偏好数据。实证评估表明，Loong 在英中、德语和法语方向上实现了显著的翻译质量提升，在三个评估指标上的平均增益高达 13.0 分。此外，Loong 展现出强大的跨领域泛化能力以及对上下文噪声的鲁棒性，同时在超长文档翻译中保持了惊人的稳定性。我们的代码已开源，网址为 https://github.com/YutongWang1216/LoongDocMT。

Abstract

Document-level translation remains one of the most challenging tasks for large language models, which are constrained by limited context windows that impede global cohesion, while simultaneously suffering from redundant contextual information that degrades translation quality. To address this, we propose a human-like long document translation agent called Loong, which leverages a 3E memory module (Essence-Exemplar-Entity) to store summaries, sentence pairs, and entity records as historical context. Instead of passively attending to all history, Loong performs deep reasoning to adaptively identify the optimal context for translation guidance. Loong optimizes its context policy through reinforcement learning, utilizing preference data derived from its own sampled observe-and-act reasoning trajectories. Empirical evaluations demonstrate that Loong achieves substantial translation quality improvements in English $\Leftrightarrow$ Chinese, German, and French directions, with average gains of up to 13.0 points across the three evaluation metrics. Furthermore, Loong exhibits strong generalization across domains and robustness against contextual noise, while maintaining remarkable stability in ultra-long document translation. Our code is released at https://github.com/YutongWang1216/LoongDocMT.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	3.0/10	4.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	5.0/10	7.5

评分理由: The paper proposes Loong, an agent for long document translation using RL and memory modules. It moderately relates to World Models and model-based RL due to the agent architecture and observe-and-act RL framework. However, it is strictly text-based, rendering Visual Encoder, MultiModal, and MLLM largely irrelevant. Tokenizer and Unify Models are not central contributions. No expert authors from the specified list were found in the authorship.

关键词

Long Document Translation, Agent Architecture, Reinforcement Learning, Context Selection, Memory Module, Observe-and-Act, Adaptive Policy

117. When Do Graph Foundation Models Transfer? A Data-Centric TheoryFAIL

Score: 18.0 / 27.8

Authors: Jiajun Zhu, Ying Chen, Peihao Wang, Yixuan He, Pan Li, Aditya Akella, Zhangyang Wang

Published: 2026-05-28

TL;DR: 本文提出了一种数据中心的理论框架，利用图论极限和位置编码稳定性分析图基础模型的跨域迁移，揭示了结构失配是导致域间差异的关键因素。

摘要翻译

图基础模型（GFMs）旨在跨多样化的图领域复用单一骨干网络，然而它们的迁移往往不均衡，甚至可能出现负迁移。虽然大多数先前工作通过架构设计或适配策略来改进迁移，但我们提出一个以数据为中心的问题：两个图领域的哪些属性决定了固定表示模型输出变化的程度？针对稠密图，我们利用基于 graphon 的连续极限，表明对于基于集合和基于消息传递的编码方式，任何 Lipschitz 骨干网络均可将跨领域输出偏移显式分解为：（i）图特定的有限样本近似项，以及（ii）捕捉结构失配的、本征的且与节点标签重排无关的领域差异。关键要素是位置编码（PE）稳定性：我们确立了谱位置编码的稳定性保证，并揭示了基于特征向量与基于子空间的位置编码行为的显著差异。在合成图和真实图上的实验验证了该理论，并将该分解转化为图基础模型迁移中的数据策展指导。

Abstract

Graph foundation models (GFMs) aim to reuse a single backbone across diverse graph domains, yet their transfer is often uneven and can exhibit negative transfer. While most prior work improves transfer through architectural or adaptation choices, we ask a data-centric question: which properties of two graph domains determine how much a fixed representation model changes its outputs? Using a graphon-based continuous limit for dense graphs, we show that for both set-based and message-passing tokenizations, any Lipschitz backbone admits an explicit decomposition of cross-domain output shift into (i) graph-specific finite-sample approximation terms and (ii) an intrinsic, relabeling-invariant domain discrepancy capturing structural mismatch. A key ingredient is positional-encoding (PE) stability: we establish stability guarantees for spectral PEs and highlight contrasting behaviors of eigenvector- versus subspace-based PEs. Experiments on synthetic and real graphs validate the theory and translate the decomposition into guidance for data curation in GFM transfer.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	4.0/10	6.0
Tokenizer	1.5	8.0/10	12.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文聚焦于图基础模型（GFM）的迁移理论，明确讨论了 tokenization 方法，因此 Tokenizer 得分较高。虽然涉及基础模型概念，与 Unify Models 有部分关联，但核心在于图论而非多模态统一架构。论文未涉及视觉编码器、世界模型、MLLM、多模态或强化学习，故相关关键词得分为 0。作者列表中未包含指定的专家。

关键词

Graph Foundation Models, Transfer Learning, Graphon-based Limit, Positional Encoding Stability, Domain Discrepancy, Message-passing Tokenization, Data-Centric Theory

118. SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic SearchFAIL

Score: 18.0 / 27.8

Authors: Yunbo Tang, Chengyi Yang, Shiyu Liu, Zhishang Xiang, Zerui Chen, Qinggang Zhang, Jinsong Su

Published: 2026-05-28

TL;DR: SAAS 提出了一种基于强化学习的自我意识框架，通过建模搜索边界和调整奖励机制来缓解大语言模型代理搜索中的过度搜索问题，从而在不牺牲准确性的前提下显著降低推理延迟。

摘要翻译

智能体搜索使大型语言模型（LLMs）能够通过迭代推理和外部搜索解决复杂的多跳问题。尽管有效，这些系统在实践中的关键局限在于：智能体无法识别自身的知识边界，当内部知识已足够时盲目触发搜索，甚至在收集到充分证据后仍未能终止搜索。这种缺乏自我意识的现象导致了严重的过度搜索，产生了巨大的推理延迟和高昂的计算成本。为此，我们提出 SAAS，一种新颖的强化学习（RL）框架，旨在构建动态自我意识，从而在不牺牲准确性的前提下精确调节搜索行为。SAAS 引入了三个关键组件：(i) 搜索边界建模机制，通过对比禁用搜索和启用搜索的轨迹，在演进策略下识别搜索边界；(ii) 边界感知奖励模块，将此边界意识转化为轨迹级惩罚，以抑制不必要的和冗余的搜索；(iii) 分阶段优化策略，利用顺序课程优先侧重推理而非搜索正则化，从而避免奖励黑客行为。大量实验表明，SAAS 显著减少了过度搜索，同时保持了准确性。我们的代码已匿名发布 https://github.com/XMUDeepLIT/SAAS。

Abstract

Agentic search enables LLMs to solve complex multi-hop questions through iterative reasoning and external search. Despite the effectiveness, these systems often suffer from a critical limitation in practice: agents fail to recognize their own knowledge boundaries, blindly triggering searches when internal knowledge suffices and failing to terminate search even when adequate evidence has been collected. The lack of self-awareness leads to severe \textbf{over-search}, incurring substantial inference latency and prohibitive computational cost. To this end, we propose SAAS, a novel RL framework designed to cultivate dynamic self-awareness that precisely regulates search behavior without compromising accuracy. SAAS introduces three key components: (i) a search boundary modeling mechanism, which identifies the search boundary under the evolving policy by contrasting search-disabled and search-enabled rollouts; (ii) a boundary-aware reward module, which translates this boundary awareness into trajectory-level penalties, suppressing unnecessary and redundant searches; and (iii) a stage-wise optimization strategy, which leverages a sequential curriculum to prioritize reasoning over search regularization, thereby avoiding reward hacking. Extensive experiments demonstrate that SAAS substantially reduces over-search, while maintaining accuracy. Our code is anonymously released at https://github.com/XMUDeepLIT/SAAS.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	3.0/10	4.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	3.0/10	4.5

评分理由: 论文聚焦于大语言模型代理搜索中的过搜索问题，利用强化学习优化搜索行为。内容主要涉及文本代理的策略优化，未涉及多模态视觉编码器、专用 tokenizer 或统一的多模态架构。虽使用强化学习，但侧重于奖励 shaping 与搜索边界建模，而非严格的世界模型或基于模型的强化学习，因此与多模态及世界模型类关键词相关性较低。

关键词

Agentic Search, Reinforcement Learning, Over-Search Mitigation, Self-Awareness, Search Boundary, Reward Shaping, Curriculum Learning

119. Hista and Numca: Estimate State Value Effectively for LLM Reinforcement LearningFAIL

Score: 18.0 / 27.8

Authors: Zizhe Chen, Jiqian Dong, Yizhou Tian, Garry Yang, Yongqiang Chen, Zhitang Chen, James Cheng

Published: 2026-05-28

TL;DR: 该论文提出 Hista 和 Numca 方法以有效估计 LLM 强化学习中的状态值，从而在不显著增加计算开销的情况下提升训练性能。

摘要翻译

强化学习（RL）通过奖励信号直接优化模型行为，从而精炼大语言模型（LLM）。虽然准确的状态值估计对于经典 RL 中的稳定训练至关重要，但在大语言模型的后训练中，这仍是一个未被充分探索的挑战。本文引入了状态值估计基准（SVEB），用于评估现有 RL 框架内的状态估计，并表明像 PPO 这样的标准方法中的批评者（critics）会退化为粗略的群体平均基线。为了解决这一问题，我们提出了两种技术：Numca，利用数值区间作为可量化的里程碑进行状态值估计；以及 Hista，一个使用 LLM 的隐藏状态作为表示，对不相交的轨迹及其回报进行加权平均的框架。广泛的实验表明，这两种方法都能产生更准确的状态值估计，并在不同的 RL 算法和模型规模下提升训练性能，而不会带来显著的计算开销。

Abstract

Reinforcement learning (RL) refines large language models (LLMs) by directly optimizing model behavior through reward signals. While accurate state value estimation is critical for stable training in classical RL, it remains an underexplored challenge in LLM post-training. In this work, we introduce the State Value Estimation Benchmark (SVEB) to assess state estimation within existing RL frameworks and show that critics in standard approaches like PPO collapse to a coarse group-average baseline. To address this, we propose two techniques: Numca, which leverages numerical spans as gradable milestones for state value estimation, and Hista, a framework that uses LLM's hidden states as representation to weighted average disjoint rollouts and their return. Extensive experiments demonstrate that both methods yield more accurate state value estimates and enhance training performance across different RL algorithms and model sizes without incurring significant computational overhead.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	3.0/10	4.5

评分理由: 论文核心贡献在于 LLM 强化学习中的状态值估计方法（Hista 和 Numca），属于强化学习应用范畴，与 model-based RL 有一定领域关联但未涉及环境模型学习；论文未涉及多模态处理，故 MultiModal 和 Visual Encoder 得分为 0；MLLM 和 Unify Models 因涉及 LLM 和方法整合得中等分；Tokenizer 非核心贡献。加权总分 18.0，低于动态及格分 27.8。

关键词

Reinforcement Learning, Large Language Models, State Value Estimation, Hista, Numca, Hidden States, Post-training

120. LoopFM: Learning frOm HistOrical RePresentations of Foundation Model for RecommendationFAIL

Score: 18.0 / 27.8

Authors: Shali Jiang, Hua Zheng, Boyang Liu, Laming Chen, Kenny Lov, Chuanqi Xu, Lisang Ding, Qinghai Zhou, Can Cui, Xiaolong Liu, Xiaoyi Liu, Yasmine Badr, Xin Xu, Jiyan Yang, Ellie Dingqiao Wen, Gerard Jonathan Mugisha Akkerhuis, Chenxiao Guan, Rong Jin, Ruichao Qiu, Xian Chen, Shifu Xu, Zhehui Zhou, Ping Chen, Rui Yang, Haicheng Chen, Xiangge Meng, Song Zhou, Dharak Kharod, Shuyu Xu, Qiang Jin, Qiao Yang, Wankun Zhu, Qin Huang, Yuzhen Huang, Darren Liu, Parish Aggarwal, Hui Zhou, Erzhuo Wang, Shuo Chang, Xiaorui Gan, Wenlin Chen, Santanu Kolay, Huayu Li

Published: 2026-05-28

TL;DR: LoopFM improves recommendation conversion rates by transferring intermediate embeddings from Foundation Models to Vertical Models, doubling the knowledge transfer ratio without real-time FM inference.

摘要翻译

知识蒸馏 (KD) 将大型基础模型 (FM) 的单个标量预测传递给轻量级垂直模型 (VM)，面临迁移比率递减的问题——即垂直模型捕获的基础模型改进比例——因为单个标量无法传达大型基础模型所学到的丰富中间知识。为了解决这一瓶颈，我们提出 LoopFM（从历史表示中学习），该框架通过将基础模型中间嵌入结构化为下游垂直模型的输入特征（例如用户历史序列），从而打开高带宽传输通道，无需在服务阶段进行实时基础模型推理，也无需基础模型与垂直模型之间存在架构耦合。我们为 LoopFM 提供了理论框架，包含增益分解与迁移比率分析。在三个公共基准数据集上，LoopFM 展示了显著的 AUC 提升（例如在 TaobaoAd 上超过 6%），并且与知识蒸馏 (KD) 具有互补的知识转移能力。在工业级系统（数十亿样本、万亿参数基础模型）上，LoopFM 在知识蒸馏 (KD) 基础上将知识迁移比率大约翻倍，在 Y1H1 中实现了 +0.5% 的转化率提升，在 Y1H2 中分别通过两次独立发布实现了 +1.03% 和 +1.22% 的转化率提升。

Abstract

Knowledge distillation (KD) transfers a single scalar prediction from a large foundation model (FM) to compact vertical models (VMs), suffering from diminishing transfer ratio -- the fraction of FM improvement captured by the VM -- as a single scalar cannot convey the rich intermediate knowledge that larger FMs learn. To address this bottleneck, we propose LoopFM (Learning frOm HistOrical ReP*resentations of FM), a framework that opens a high-bandwidth transfer channel by structuring FM intermediate embeddings as input features (e.g., user history sequence) for downstream VMs, without requiring real-time FM inference at serving and architectural coupling between FM and VM. We provide a theoretical framework for LoopFM with a gain decomposition and transfer-ratio analysis. On three public benchmarks, LoopFM demonstrates strong AUC improvements (e.g., 6\%+ on TaobaoAd) and complementary knowledge transfer capability with KD. On industrial-scale systems (billions of examples, trillion-parameter FMs), LoopFM approximately doubles the knowledge transfer ratio on top of KD, delivering a +0.5\% conversion improvement in Y1H1, and a +1.03\% and +1.22\% conversion improvement from two individual launches respectively in Y1H2.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	3.0/10	4.5
model-based RL	1.5	1.0/10	1.5

评分理由: The paper focuses on knowledge distillation for recommendation systems using Foundation Model embeddings. It has low relevance to Tokenizer, Visual Encoder, World Models, and model-based RL as these are not discussed. MLLM and MultiModal have moderate relevance due to the use of Foundation Models, but the core task is recommendation distillation, not multi-modal understanding or generation. Unify Models is low as the method decouples rather than unifies architectures.

关键词

LoopFM, Knowledge Distillation, Foundation Model, Vertical Model, Intermediate Embeddings, Recommendation, Transfer Learning, Historical Representations

121. EvoRubric: Self-Evolving Rubric-Driven RL for Open-Ended GenerationFAIL

Score: 18.0 / 27.8

Authors: Xin Guan, Xiaomeng Hu, Shen Huang, Zhenyi Wang, Bo Zhang, Zijian Li, Pengjun Xie, Bo Liu, Jiuxin Cao

Published: 2026-05-28

TL;DR: EvoRubric addresses the challenge of aligning LLMs for open-ended generation by co-evolving a reasoner and rubric generator under a single policy, achieving superior performance without static criteria.

摘要翻译

强化学习 (RL) 已在可验证领域显著推动了大型语言模型 (LLMs) 的发展，但由于缺乏明确的奖励，将模型对齐用于开放式生成仍然极具挑战性。当前的基于评分标准的 RL 方法通过采用明确标准来缓解这一问题；然而，它们严重依赖静态、人工标注的评分标准，这不可避免地导致策略滞后，或者依赖昂贵的外部专有模型进行动态更新。本文提出 EvoRubric，一种新颖的单策略协同进化 RL 框架，消除了对静态标准和外部评分标准生成器的依赖。通过将响应生成和评分标准生成统一于一个参数化策略之下，EvoRubric 在推理器 (Reasoner) 和评分标准生成器 (Rubric Generator) 之间动态交替。为防止奖励黑客行为并确保生成信号的可靠性，我们引入一个多层验证管道，包含元验证器、零方差剪枝以及留一法同伴共识机制。验证过的标准被动态归档至记忆池，从而产生密集、多目标奖励，以持续协同优化这两个角色。在医学、写作和科学领域的广泛实验表明，EvoRubric 始终优于传统的静态及外部 LLM 驱动的对齐方法。值得注意的是，该框架兼容人类专家先验。当使用专家标注的标准进行初始化时，EvoRubric 可进一步发现新颖的判别性维度，其性能优于仅依赖静态专家标注的方法。

Abstract

Reinforcement Learning (RL) has significantly advanced Large Language Models (LLMs) in verifiable domains, but aligning models for open-ended generation remains profoundly challenging due to the lack of definitive rewards. Current rubric-based RL methods mitigate this by employing explicit criteria; however, they rely heavily on static, human-annotated rubrics that inevitably cause policy lag, or expensive external proprietary models for dynamic updates. In this paper, we propose EvoRubric, a novel single-policy co-evolutionary RL framework that eliminates the reliance on static criteria and on external rubric generators. By unifying response generation and rubric generation under a single parameterized policy, EvoRubric dynamically alternates between a Reasoner and a Rubric Generator. To prevent reward hacking and ensure the reliability of generated signals, we introduce a multi-level verification pipeline featuring a meta-verifier, zero-variance pruning, and a Leave-One-Out peer consensus mechanism. Validated criteria are dynamically archived into a memory pool, yielding dense, multi-objective rewards to continuously co-optimize both roles. Extensive experiments across Medical, Writing, and Science domains demonstrate that EvoRubric consistently outperforms traditional static and external-LLM-driven alignment methods. Notably, our framework is compatible with human-expert priors. When initialized with expert-annotated rubrics, EvoRubric can further uncover novel, discriminative dimensions, achieving better performance than relying solely on static expert annotations.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	7.0/10	10.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	3.0/10	4.5

评分理由: 论文提出基于单一策略的协同进化 RL 框架，统一了生成与评分标准，高度契合'Unify Models'；虽涉及强化学习及奖励模型构建，与'model-based RL'和'MLLM'有弱关联，但未涉及视觉编码器、Tokenizer、世界模型或多模态内容，故其余关键词得分为 0。加权总分 18.0 低于动态及格分 27.8。作者列表中不包含指定专家。

关键词

Reinforcement Learning, Open-Ended Generation, Rubric-Driven, Co-evolutionary, Large Language Models, Single-Policy, Verification Pipeline

122. Prompt-Level Reward Specifications for Open-Ended Post-TrainingFAIL

Score: 18.0 / 27.8

Authors: Zijun Weng, Xiaohui Hu, Shuangyong Song, Yongxiang Li, Kaidong Yu, Xuanjing Huang

Published: 2026-05-28

TL;DR: The paper proposes a prompt-level reward specification framework for open-ended post-training that makes reward criteria explicit through reusable rubrics and constraints, improving response ranking and supporting online reinforcement learning without human preference annotations.

摘要翻译

开放式后训练得益于能够使提示词特定成功条件显式化的奖励，而非仅依赖事后标量分数。在指令遵循、写作及决策支持任务中，响应质量取决于局部需求、整体偏好及显式约束，但现有奖励方法往往使这些标准隐含不清，或仅覆盖可狭义验证的情形。我们提出一种提示词级奖励规范框架，该框架将奖励规范与奖励计算分离开来。仅需输入提示词，我们的框架即可离线构建可重用的任务自适应评分标准（rubrics）和可执行硬约束检查器，从而在训练前使奖励标准显式化，并在多次轨迹（rollouts）中可复用。在评分阶段，基于产物的评分标准分数与代码分数相结合，并辅以独立的全局分数以衡量剩余整体质量，从而生成涵盖需求满足、整体质量及确定性约束的归一化混合奖励。该框架无需人类偏好标注、参考答案或单独训练的奖励模型（RM）。实验表明，所得奖励提升了离线 RM 风格的响应排序效果，并支持在多个开放式基准上进行在线强化学习（RL）。消融实验进一步表明，评分标准、全局评分及可执行验证提供了互补的监督信号。

Abstract

Open-ended post-training benefits from rewards that make prompt-specific success conditions explicit, rather than relying only on post-hoc scalar scores. In instruction following, writing, and decision-support tasks, response quality depends on local requirements, holistic preferences, and explicit constraints, but existing reward methods often leave these criteria implicit or cover only narrowly verifiable cases. We propose a prompt-level reward specification framework that separates reward specification from reward computation. Given only prompts, our framework constructs reusable task-adaptive rubrics and executable hard-constraint checkers offline, making reward criteria explicit before training and reusable across rollouts. At scoring time, artifact-anchored rubric and code scores are combined with an independent global score for residual holistic quality, yielding a normalized hybrid reward over requirement satisfaction, holistic quality, and deterministic constraints. The framework requires no human preference annotations, reference answers, or a separately trained reward model. Experiments show that the resulting reward improves offline RM-style response ranking and supports online reinforcement learning across multiple open-ended benchmarks. Ablations further show that rubrics, global scoring, and executable verification provide complementary supervision.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	3.0/10	4.5

评分理由: 该论文主要关注开放式后训练中的奖励规范问题，提出了提示级奖励规范框架。然而，论文内容未涉及多模态架构（如 Visual Encoder, MultiModal, MLLM）、分词器设计（Tokenizer）或世界模型（World Models）。虽然提到了强化学习，但核心是奖励工程而非模型基于的强化学习（model-based RL），也未涉及模型统一（Unify Models）。因此，除 RL 相关关键词外，其余关键词相关性较低。加权总分为 18.0，低于动态及格分 27.8。作者列表中不包含指定的专家。

关键词

Prompt-Level Reward Specifications, Open-Ended Post-Training, Reward Specification Framework, Reusable Rubrics, Online Reinforcement Learning, Explicit Constraints, Holistic Quality

123. REST3D: Reconstructing Physically Stable 3D Scenes from a Single ImageFAIL

Score: 18.0 / 27.8

Authors: Xiaoxuan Ma, Jiashun Wang, Nicolas Ugrinovic, Yehonathan Litman, Kris Kitani

Published: 2026-05-28

TL;DR: REST3D 通过整合物理场景理解与物理约束优化，从单张 RGB 图像重建出物理稳定的 3D 场景，显著减少了物理错误并提高了仿真稳定性。

摘要翻译

从单张 RGB 图像重建物理稳定的 3D 场景，使得普通图像能够转换为适用于模拟的数字资产，进而应用于沉浸式交互和内容创作等场景。然而，现有的单图像重建方法在捕捉场景的物理结构方面存在不足。因此，它们往往产生几何上合理但物理上不一致的结果，包括物体漂浮和穿透现象，从而导致物理模拟中的不稳定行为。图像条件场景生成方法虽能提高物理合理性，但往往依赖于强场景先验，导致生成的物体排列虽合理但不准确，无法与输入图像相匹配。我们提出 REST3D，一种单图像重建框架，通过整合物理场景理解与物理约束细化，能够重建物理稳定的 3D 场景。我们首先引入一种基于智能体的物理场景理解技术，该技术从重力支撑视角构建场景树（scene-tree）表示，捕捉物体的物理状态及物体间关系，为重建提供结构先验。利用该结构，我们首先使用 image-to-3D 模型对场景进行初始化，随后进行场景树引导的对齐及物理约束优化，以解决物理违规现象，同时保持与输入图像的视觉一致性。实验表明，我们的方法在合成数据集和真实世界数据集上均能显著减少物理错误并提高模拟稳定性，同时保持优异的重建质量。我们进一步在基于 VR（虚拟现实）的人 - 物交互中展示了重建的场景，证明了其在沉浸式应用中的潜力。

Abstract

Reconstructing physically stable 3D scenes from a single RGB image enables casual images to be converted into simulation-ready digital assets for applications such as immersive interaction and content creation. However, existing single-image reconstruction methods fall short in capturing the physical structure of a scene. As a result, they often produce geometrically plausible but physically inconsistent results, including object floating and penetration, which lead to unstable behavior in physics simulations. Image-conditioned scene generation methods improve physical plausibility but often rely on strong scene priors, yielding plausible yet inaccurate object arrangements that fail to match the input image. We propose REST3D, a single-image reconstruction framework that can reconstruct physically stable 3D scenes by integrating physical scene understanding with physics-constrained refinement. We first introduce an agentic physical scene understanding technique that constructs a scene-tree representation capturing object physical states and inter-object relationships from a gravity-support perspective, providing a structural prior for reconstruction. Leveraging this structure, we initialize the scene using image-to-3D models, followed by scene-tree-guided alignment and physics-constrained optimization to resolve physical violations while preserving visual consistency with the input image. Experiments show that our method significantly reduces physical errors and improves simulation stability on both synthetic and real-world datasets while maintaining strong reconstruction quality. We further demonstrate the reconstructed scenes in VR-based human-object interaction, showing their potential for immersive applications.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	2.0/10	3.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	3.0/10	4.5
model-based RL	1.5	2.0/10	3.0

评分理由: 论文核心为单图像物理稳定 3D 重建，涉及物理约束优化与场景树表示。给定关键词集（Tokenizer, MLLM, model-based RL）主要对应多模态大模型与强化学习领域，与本文视觉/物理仿真主题匹配度低。Visual Encoder 和 MultiModal 因涉及图像输入及跨模态转换略有相关性，其余关键词相关性极低。

关键词

3D Reconstruction, Physically Stable, Single Image, Physics Constraints, Scene Tree, Image-to-3D, Physical Scene Understanding, Simulation Stability

124. MonoPhysics: Estimating Geometry, Appearance, and Physical Parameters from Monocular VideosFAIL

Score: 18.0 / 27.8

Authors: Daniel Rho, Jun Myeong Choi, Matthew Thornton, Biswadip Dey, Roni Sengupta

Published: 2026-05-28

TL;DR: MonoPhysics 提出了一种利用可微分 MPM 模拟和 3D 高斯泼溅从单目视频中联合优化几何、外观和物理参数的框架，实现了与多视图方法相当的性能。

摘要翻译

现有的逆物理方法从多视角视频中恢复物理参数，其中跨视角的几何约束用于确定尺度和三维结构。然而，在单目场景中，此类约束缺失，导致严重的尺度歧义、几何不准确以及外观优化与物理模拟之间的耦合较弱。我们提出 MonoPhysics，这是一个用于单目逆物理估计可变形物体的框架，采用可微 MPM（物质点方法）模拟和 3D Gaussian Splatting，从单个相机视角联合优化几何、外观和物理参数。我们通过三个视觉 - 物理桥梁来解决这些挑战：全局尺度对齐、物理感知的几何细化以及可微位置映射，这些共同实现了仅从单目观测进行准确优化。我们在 Vid2Sim 和我们的新弹性与塑性物体数据集上进行了评估，结果表明 MonoPhysics 在单目场景中优于现有基线方法，并且仅使用单个相机即可达到与多视角基线方法相当的性能。我们的项目页面位于 https://daniel03c1.github.io/MonoPhysics/

Abstract

Existing inverse physics methods recover physical parameters from multi-view videos, where geometric constraints across views resolve scale and 3D structure. In monocular settings, however, such constraints are absent, leading to severe scale ambiguity, inaccurate geometry, and weak coupling between appearance optimization and physical simulation. We propose MonoPhysics, a framework for monocular inverse physics estimation of deformable objects using differentiable MPM simulation and 3D Gaussian Splatting, which jointly optimizes geometry, appearance, and physical parameters from a single camera view. We address these challenges through three visual-physical bridges: global scale alignment, physics-aware geometry refinement, and a differentiable position map, which together enable accurate optimization from monocular observations alone. We evaluate on Vid2Sim and our new dataset of elastic and plastic objects, showing that MonoPhysics outperforms existing baselines in monocular settings and achieves performance comparable to multi-view baselines using only a single camera. Our project page is available at https://daniel03c1.github.io/MonoPhysics/

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	3.0/10	4.5
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	3.0/10	4.5

评分理由: 论文聚焦单目逆物理估计，使用可微分 MPM 与 3D 高斯泼溅。Tokenizer、MLLM 涉及语言模型，完全无关；Unify Models、World Models、model-based RL 虽有物理建模重叠，但非大模型统一或强化学习核心，相关性低；Visual Encoder、MultiModal 涉及视觉输入但非架构重点，故评分较低。

关键词

Monocular Videos, Inverse Physics Estimation, Differentiable MPM Simulation, 3D Gaussian Splatting, Geometry Optimization, Appearance Optimization, Physical Parameters

125. Déjà View: Looping Transformers for Multi-View 3D ReconstructionFAIL

Score: 18.0 / 27.8

Authors: Alessandro Burzio, Tobias Fischer, Sven Elflein, Qunjie Zhou, Riccardo de Lutio, Jiawei Ren, Jiahui Huang, Shengyu Huang, Marc Pollefeys, Laura Leal-Taixé, Zan Gojcic, Haithem Turki

Published: 2026-05-28

TL;DR: DéjàView 提出了一种用于多视角 3D 重建的循环 Transformer 架构，通过显式迭代实现了与更大前馈模型相当的性能，同时参数量更少。

摘要翻译

近年来，前馈式 3D 重建变换器（3D reconstruction transformers）的规模已扩展至超过十亿参数，这遵循了计算机视觉领域中增加模型容量的更广泛趋势。然而，新兴证据表明，连续的变换器层往往表现为类似操作的重复应用，而多视角重建变换器则在解码器深度上逐步精炼其预测。我们认为，模型深度部分换取了迭代，但这种迭代是以低效的独特参数为代价的；相反，我们主张在架构中使这种迭代显式化。我们的模型 DéjàView 将单个循环变换器块（looped transformer block）循环地应用于每视角特征，以执行 K 个精炼步骤。该模型仅需训练一次，即可将 K 暴露为推理时的计算调节器，在涵盖室内、户外、物体中心及驾驶场景的五项重建基准上，匹配或显著优于规模大得多的前馈基线，同时仅使用其参数的一小部分以及相当或更低的计算量。重要的是，在训练数据和计算量匹配的情况下，相同的循环变换器块形式优于具有独立每步参数的其他相同变体，这表明显式迭代不仅仅是计算效率更高的容量替代方案，而是多视角 3D 重建中更强的归纳偏置。

Abstract

Recent feed-forward 3D reconstruction transformers have scaled to over a billion parameters, following the broader trend of increasing model capacity in computer vision. Yet emerging evidence suggests that contiguous transformer layers often behave like repeated applications of similar operations, and multi-view reconstruction transformers refine their predictions progressively across decoder depth. We posit that model depth partially buys iteration, paid for inefficiently in unique parameters, and instead make that iteration explicit in architecture. Our model, DéjàView, applies a single looped transformer block recurrently to per-view features for K refinement steps. Trained once, it exposes K as an inference-time compute knob, matching or outperforming substantially larger feed-forward baselines across five reconstruction benchmarks spanning indoor, outdoor, object-centric, and driving scenes, while using a fraction of their parameters and comparable or lower compute. Importantly, the same looped block formulation outperforms an otherwise identical variant with independent per-step parameters under matched training data and compute, suggesting that explicit iteration is not merely a compute-efficient substitute for capacity but a stronger inductive bias for multi-view 3D reconstruction.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	4.0/10	6.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	1.0/10	1.5

评分理由: 论文核心为 3D 重建中的循环 Transformer 架构，与多模态大模型、强化学习等关键词领域关联度低。'Visual Encoder' 相关性中等（涉及视图特征提取），'Unify Models' 和 'MultiModal' 有轻微关联（架构统一与多视角输入）。其余关键词（Tokenizer, MLLM, World Models, model-based RL）完全无关。加权总分 18.0，低于动态及格分 27.8。

关键词

Looping Transformers, Multi-View 3D Reconstruction, Iterative Refinement, Per-view Features, Feed-forward Baselines, Compute Efficiency, Recurrent Architecture

126. Cycle Consistency in Video Object-Centric LearningFAIL

Score: 18.0 / 27.8

Authors: Rongzhen Zhao, Zhiyuan Li, Ruonan Wei, Juho Kannala, Joni Pajarinen

Published: 2026-05-28

TL;DR: 本文提出隐式循环一致性（ICC）方法，通过将约束从潜槽空间转移到重建流形，解决了视频对象中心学习中的特征崩溃问题并提升了性能。

摘要翻译

自监督视频对象中心学习（OCL）旨在发现不同的对象并在时间上关联它们，而自监督多目标跟踪（MOT）则专注于关联预定义的对象检测或分割。尽管循环一致性（CC）在 MOT 中已广泛应用，但它不能直接或显式地应用于 OCL 的潜在槽空间。与 MOT 中确定性和理想化的对象表示不同，由于场景分解的非唯一性，OCL 槽本身具有随机性和模糊性。在槽上强制显式循环一致性（ECC）会强加僵硬的均值寻求。这严重惩罚了模型探索替代但同样有效的分解方式，从而导致特征崩溃。为了解决这一困境，我们提出了隐式循环一致性（ICC），它将循环一致性约束从受限的槽空间转移到连续的重构流形上，鼓励槽在共同解释视觉场景时达成软共识，而不是强制进行刚性的点对点特征对齐。在复杂的视频 OCL 基准上的广泛实验表明，ICC 避免了特征崩溃，并优于 ECC 基线。我们的源代码、模型检查点和训练日志已在 https://github.com/Genera1Z/ICC 上提供。

Abstract

Self-supervised video Object-Centric Learning (OCL) aims to discover distinct objects and associate them across time, whereas self-supervised Multi-Object Tracking (MOT) focuses on associating pre-defined object detections or segmentations. Although well-established in MOT, Cycle Consistency (CC) cannot naively or explicitly apply to the latent slot space of OCL. Unlike the deterministic and ideal object representations in MOT, OCL slots are inherently stochastic and ambiguous due to non-unique scene decompositions. Enforcing explicit cycle consistency (ECC) on slots imposes rigid mean seeking. This severely penalizes the model for exploring alternative but equally valid decompositions, thereby driving towards feature collapse. To resolve this dilemma, we propose \textit{Implicit Cycle Consistency (ICC)}, which shifts the cycle-consistency constraint from the restrictive slot space to the continuous reconstruction manifold, encouraging slots to reach a soft consensus on collectively interpreting the visual scene rather than forcing rigid point-to-point feature alignment. Extensive experiments on complex video OCL benchmarks demonstrate that ICC avoids feature collapse and outperforms ECC baselines. Our source code, model checkpoints and training logs are provided on https://github.com/Genera1Z/ICC.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	4.0/10	6.0
World Models	1.5	3.0/10	4.5
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文聚焦视频对象中心学习中的循环一致性，与 Visual Encoder 和 World Models 有中等关联（涉及潜空间与视觉编码），但与 Tokenizer、MLLM、model-based RL 完全无关。Unify Models 仅体现在方法统一上。未发现指定专家作者。

关键词

Object-Centric Learning, Cycle Consistency, Video Learning, Self-supervised, Slot Space, Feature Collapse, Reconstruction Manifold

127. Large Depth Completion Model from Sparse ObservationsFAIL

Score: 18.0 / 27.8

Authors: Zhu Yu, Zhengyi Zhao, Runmin Zhang, Lingteng Qiu, Kejie Qiu, Yisheng He, Siyu Zhu, Zilong Dong, Si-Yuan Cao, Hui-Liang Shen

Published: 2026-05-28

TL;DR: This paper proposes the Large Depth Completion Model (LDCM), a transformer-based framework that achieves state-of-the-art metric depth estimation from sparse observations by leveraging monocular foundation models and regressing per-pixel 3D coordinates.

摘要翻译

本文提出了大型深度完成模型（LDCM），这是一种简单、有效且鲁棒的框架，用于基于稀疏观测的单视图度量深度估计。无需依赖复杂的架构设计，LDCM 利用 Transformer 生成具有度量精度的稠密深度图。该方法在多种数据集及稀疏观测条件下均优于现有方法。我们通过以下两个关键视角实现这一目标：(1) 利用现有的单眼基础模型提升稀疏深度输入的质量；(2) 重构训练目标以更好地捕捉几何结构与度量一致性。具体而言，首先引入一种基于泊松（Poisson）的深度初始化策略，从多样化的稀疏观测中生成均匀的粗粒度稠密深度图，从而为网络提供强有力的结构先验。针对训练目标，我们用点图头（point map head）替代了传统的深度头，该头回归相机空间内的像素级 3D 坐标，使模型能够直接学习底层 3D 场景结构，而非执行像素级的深度图恢复。此外，该设计消除对相机内参的需求，使 LDCM 能够自然生成度量缩放的 3D 点图。大量实验表明，LDCM 在多个基准数据集及不同稀疏度水平下，于深度补全和点图估计任务中持续优于最先进方法，展示了其有效性以及对未见数据分布的强大泛化能力。

Abstract

This work presents the Large Depth Completion Model (LDCM), a simple, effective, and robust framework for single-view metric depth estimation with sparse observations. Without relying on complex architectural designs, LDCM generates metric-accurate dense depth maps using a transformer. It outperforms existing approaches across diverse datasets and sparse observations. We achieve this from two key perspectives: (1) leveraging existing monocular foundation models to improve the quality of sparse depth inputs, and (2) reformulating training objectives to better capture geometric structure and metric consistency. Specifically, a Poisson-based depth initialization strategy is first introduced to generate a uniform coarse dense depth map from diverse sparse observations, providing a strong structural prior for the network. Regarding the training objective, we replace the conventional depth head with a point map head that regresses per-pixel 3D coordinates in camera space, enabling the model to directly learn the underlying 3D scene structure instead of performing pixel-wise depth map restoration. Moreover, this design eliminates the need for camera intrinsic parameters, allowing LDCM to naturally produce metric-scaled 3D point maps. Extensive experiments demonstrate that LDCM consistently outperforms state-of-the-art methods across multiple benchmarks and varying sparsity levels in both depth completion and point map estimation, showcasing its effectiveness and strong generalization to unseen data distributions.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	4.0/10	6.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	5.0/10	7.5
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on 3D vision and depth completion, lacking relevance to Tokenizer, World Models, MLLM, and model-based RL. It moderately relates to Visual Encoder and MultiModal (fusion of depth and RGB priors) and slightly to Unify Models. No specified expert authors were found in the author list.

关键词

Depth Completion, Sparse Observations, Metric Depth Estimation, Transformer, Point Map, Monocular Foundation Models, 3D Scene Structure, Poisson-based Initialization

128. GeoMag: Geometric-Aware Video Motion Magnification via State Space ModelFAIL

Score: 18.0 / 27.8

Authors: Kecheng Han, Yuchen Zhang, Bingqing Liu, Boqiang Guo, Wenbin Zheng, Shiyuan Pei

Published: 2026-05-28

TL;DR: GeoMag 提出了一种基于状态空间模型的几何感知视频运动放大框架，通过线性复杂度实现了全局一致的运动放大，显著提升了视觉保真度并减少了伪影。

摘要翻译

视频运动放大 (Video Motion Magnification, VMM) 能够揭示不可察觉的动态，但在复杂的几何变换下往往会出现结构不一致的问题。现有的基于学习的方法通常在卷积神经网络 (CNN) 的全局上下文受限与变换器 (Transformer) 的高计算成本之间面临权衡。此外，当前的训练协议主要由简单的线性运动主导，难以捕捉真实视频中遇到的几何与成像复杂性。为了解决这些问题，我们提出 GeoMag，一种基于状态空间模型 (State Space Models) 的几何感知 VMM 框架，旨在实现具有线性复杂度的全局一致运动放大。此外，我们进一步构建了 Geo-200K，一个大型合成数据集，该数据集引入了丰富的几何变换以及传感器真实的退化，从而提高了训练信号的多样性和真实性。在合成数据集及真实世界基准上的大量实验表明，GeoMag 在视觉保真度和计算效率方面始终优于先前方法，同时产生的伪影更少，结构一致性更好。

Abstract

Video Motion Magnification (VMM) reveals imperceptible dynamics but often suffers from structural inconsistencies under complex geometric transformations. Existing learning-based methods generally face a trade-off between the limited global context of CNNs and the high computational cost of Transformers. In addition, current training protocols, largely dominated by simple linear motion, fail to capture the geometric and imaging complexities encountered in real-world videos. To address these issues, we propose GeoMag, a geometric-aware VMM framework built upon State Space Models to achieve globally consistent motion amplification with linear complexity. We further construct Geo-200K, a large-scale synthetic dataset that introduces rich geometric transformations together with sensor-realistic degradations, improving the diversity and realism of training signals. Extensive experiments on synthetic and real-world benchmarks show that GeoMag consistently outperforms prior methods in visual fidelity and computational efficiency, while producing fewer artifacts and better structural consistency.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	1.0/10	1.5

评分理由: 论文核心为视频运动放大（VMM）任务，利用状态空间模型（SSM）解决几何变换下的结构不一致问题。提供的关键词集主要围绕多模态大模型、世界模型及强化学习展开，与论文的单模态视频处理主题存在显著偏差。'Visual Encoder'因涉及视频输入处理略有相关性，'Unify Models'因 SSM 统一了 CNN 与 Transformer 的某些特性略有相关性，其余关键词如 World Models、MLLM、model-based RL 与论文内容无关。加权总分约为 18.0，低于动态及格分 27.8。

关键词

Video Motion Magnification, State Space Model, Geometric-Aware, Structural Consistency, Linear Complexity, Visual Fidelity, Synthetic Dataset

129. Overcoming Forgetting in LLM Fine-Tuning with Evolution StrategiesFAIL

Score: 16.5 / 27.8

Authors: Kajetan Schweighofer, Conor F. Hayes, Roberto Dailey, Risto Miikkulainen, Xin Qiu

Published: 2026-05-28

TL;DR: 本文提出锚定权重衰减（AWD）方法以缓解使用进化策略进行 LLM 微调时的性能漂移，证明遗忘问题可避免且不影响目标任务性能。

摘要翻译

进化策略（ES）近期已成为大语言模型（LLM）微调领域中强化学习（RL）的一种具有竞争力的替代方案，凭借简单性、可扩展性以及仅依赖推理的训练方式展现出优势。然而，近期研究表明，在新任务上应用 ES 微调可能导致先前任务的遗忘。首先，本文表明先前任务遗忘（1）更准确地应被描述为性能漂移，而非不可逆遗忘，因为在 ES 训练过程中先前任务性能往往能够恢复；（2）这并非 ES 特有的失效模式，在使用强化学习（RL）方法进行微调时同样可能出现。其次，本文分析了此类漂移产生的时机与原因，指出其依赖于 ES 的训练动态，特别是权重空间中弱约束方向上的随机游走行为。第三，基于上述洞察，本文引入了锚定权重衰减（AWD）作为一种参数空间正则化技术，旨在将优化过程约束向初始模型参数。AWD 能有效稳定先前任务性能，同时保持目标任务性能，以远低于大型 ES 种群规模的计算成本，实现了相当的性能收益。因此，与以往观点相反，本文表明在 ES 框架下先前任务遗忘在很大程度上是可以避免的，从而将 ES 确立为大语言模型持续学习中一种有前景的方法。

Abstract

Evolution Strategies (ES) has recently emerged as a competitive alternative to reinforcement learning (RL) for large language model (LLM) fine-tuning, offering advantages through simplicity, scalability, and inference-only training. However, recent work suggests that ES fine-tuning on new tasks may induce forgetting of prior tasks. First, this paper shows that prior task forgetting (1) is better characterized as performance drift rather than irreversible forgetting, with prior-task performance often recovering during ES training; and (2) is not a specific failure mode of ES, but can also arise for fine-tuning with RL methods. Second, it analyzes when and why such drift arises, highlighting its dependence on ES training dynamics, particularly random walk behavior in weakly constrained directions of the weight space. Third, based on these insights, it introduces Anchored Weight Decay (AWD) as a parameter-space regularization technique that constrains optimization toward the initial model parameters. AWD effectively stabilizes prior-task performance while preserving target-task performance, achieving benefits comparable to large ES population sizes at much lower computational cost. Thus, contrary to previous beliefs, the paper shows that prior-task forgetting under ES is largely avoidable, positioning ES as a promising approach for continual learning in LLMs.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	3.0/10	4.5

评分理由: 论文聚焦于进化策略在 LLM 微调中的应用及遗忘问题缓解，未涉及多模态架构、视觉编码器、世界模型或分词器设计。虽对比了强化学习，但非模型强化学习核心，与给定关键词集合相关性普遍较低。

关键词

Evolution Strategies, LLM Fine-Tuning, Performance Drift, Anchored Weight Decay, Continual Learning, Reinforcement Learning, Regularization

130. REPOT: Recoverable Program-of-Thought via Checkpoint RepairFAIL

Score: 16.5 / 27.8

Authors: Parsa Mazaheri

Published: 2026-05-28

TL;DR: 该论文提出 RePoT 方法，通过确定性验证重放和检查点修复恢复程序思维规划中的无效动作，显著提升了在 PuzzleZoo 和 Blocksworld 基准上的成功率。

摘要翻译

单样本思维之程序 (PoT) 会生成一个打印原始动作计划的 Python 程序；单个无效动作会无声地使整个轨迹失效。我们引入 RePoT (可恢复 PoT)：这是一种确定性验证重放机制，它在环境中遍历计划直至遇到第一个无效转换点，随后进行一次 LLM 调用，从已验证前缀处继续执行。在 PoT 失败的约 14% 的问题上，RePoT 最多仅需增加一次 LLM 调用。在 PuzzleZoo-775 基准上，针对四种闭源模型配置，RePoT 比 PoT 高出 +3 至 +11 个百分点，并在 gpt-5.4-mini-medium 模型上达到 96.9% 的峰值（对比 PoT 的 86.3%）；在与预算匹配的 PoT 重试基线相比，RePoT 在 Gemini 上显著获胜 (+3.8 个百分点，95% 置信区间 [+2.2, +5.4])，在 GPT-medium 和 Claude 上处于采样噪声范围内，而在 GPT-mini 上表现略逊——这是一种能力扩展模式，我们通过自适应 RePoT 开始应对它，这是一个基于规则的调度器，根据已验证前缀长度在“后缀修复”和“新的 PoT 重试”之间进行路由（初步结果）。我们在 PlanBench Blocksworld 基准上复现了该结果 (+1.1 至 +11.4 个百分点)，并在四种开源权重模型上进行了验证（其中三种模型提升了 +3.3 至 +20.0 个百分点）。在 Derail-550（我们的受控恢复基准）上，所有访问检查点信息的条件在 GPT-medium 上的解决率 >=30%，在 Gemini 上 >=70%，相比之下仅错误反馈条件下的解决率 <=3.1%——这表明检查点信息（而非特定的已验证前缀尾部）才是关键的恢复信号。

Abstract

One-shot Program-of-Thought (PoT) emits a Python program that prints a primitive-action plan; a single invalid action silently invalidates the trajectory. We introduce RePoT (Recoverable PoT): a deterministic verified replay that walks the plan through the environment to its first invalid transition, then one LLM call that resumes from the verified prefix. RePoT costs at most one extra LLM call on the ~14% of problems where PoT fails. RePoT beats PoT by +3 to +11pp across four closed-model configurations on PuzzleZoo-775 and peaks at 96.9% vs 86.3% on gpt-5.4-mini-medium; against the matched-budget PoT-retry baseline, RePoT wins decisively on Gemini (+3.8pp, 95% CI [+2.2,+5.4]), is within sampling noise on GPT-medium and Claude, and loses on GPT-mini -- a capability-scaling pattern we begin to address with Adaptive RePoT, a rule-based dispatcher that routes between suffix repair and a fresh PoT retry based on verified-prefix length (preliminary). We replicate on PlanBench Blocksworld (+1.1 to +11.4pp) and on four open-weights models (+3.3 to +20.0pp on three of four). On Derail-550, our controlled recovery benchmark, every condition with access to checkpoint information clears >=30% on GPT-medium and >=70% on Gemini, vs <=3.1% for error-only feedback -- showing that checkpoint information, not the specific verified-prefix tail, is the load-bearing recovery signal.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	2.0/10	3.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	3.0/10	4.5

评分理由: 论文核心为 Recoverable Program-of-Thought (RePoT) 的错误恢复机制，利用检查点修复和确定性重放解决规划中的无效动作问题。该研究未涉及模型统一、分词器、视觉编码器、世界模型构建、多模态大模型架构或模型基强化学习的核心内容，仅在使用环境验证计划方面与模型基方法有微弱关联。作者列表中未包含指定的专家名单。

关键词

Recoverable Program-of-Thought, Checkpoint Repair, LLM Planning, Error Recovery, Verified Replay, Python Program Generation, PuzzleZoo Benchmark

131. Learning to Choose: An Empowerment-Guided Multi-Agent System with semantic communication for Adaptive Method SelectionFAIL

Score: 16.5 / 27.8

Authors: Geremy Loachamín-Suntaxi, Robert Lazar, Dimitrios G. Giovanis, Ioannis G. Kevrekidis, Eleni D. Koronaki

Published: 2026-05-28

TL;DR: 本文提出一种基于赋能指导的多智能体系统，通过语义通信防止自动化科学计算工作流中的语义漂移，显著提升了方法选择的收敛性和鲁棒性。

摘要翻译

自动化科学计算工作流不仅需要生成可执行代码：自主系统还必须选择合适的计算策略，忠实地实施它们，并确保所得结果在因果上可归因于产生这些决策的行为。在多智能体流水线中，这一过程尤为脆弱，因为智能体意图与行动之间的微小不一致可能导致语义漂移，即最终执行的程序不再反映最初选定的策略，从而破坏下游的评估与适应。本研究受 ATHENA 框架（Toscano 等，2025; Toscano 等，2026）及赋能 (empowerment) 概念（Yiu 等，2025）的启发，提出了一种多智能体框架，该框架结合了上下文多臂老虎机 (contextual bandits) 与结构化智能体间通信，最重要的是，引入了语义检查点，以在整个流水线中保持行动 - 结果保真度。该系统在自适应决策架构中集成了专用大型语言模型 (LLM) 智能体、接地代码生成以及自愈执行循环。从赋能 (empowerment) 的视角审视该框架，我们发现可靠的自主学习不仅需要识别高质量行动，还需保持这些行动在智能体间传播的完整性。以敏感性分析和不确定性量化工作流作为代表性案例研究，我们表明，若不加控制的语义漂移会损害策略学习，而所提出的框架则能提升收敛性、鲁棒性以及对新问题情境的适应能力。这些结果暗示了科学多智能体系统的一个更广泛设计原则：自适应决策必须与显式机制相结合，以确保计算流水线中语义一致性和可靠信息流的保障。

Abstract

Automating scientific computing workflows requires more than generating executable code: autonomous systems must also select appropriate computational strategies, implement them faithfully, and ensure that the resulting outcomes remain causally attributable to the decisions that produced them. In multi-agent pipelines, this process is particularly fragile, as small inconsistencies between agent intentions and actions can lead to semantic drift, where the eventually executed procedure no longer reflects the originally selected strategy, thereby corrupting downstream evaluation and adaptation. In this work, motivated by the ATHENA framework (Toscano et al., 2025; Toscano et al., 2026) and the concept of empowerment (Yiu et al., 2025), we introduce a multi-agent framework that combines contextual bandits with structured inter-agent communication and, most importantly, semantic checkpoints that preserve action-outcome fidelity throughout the pipeline. The system integrates specialized large language model (LLM) agents, grounded code generation, and self-healing execution loops within an adaptive decision-making architecture. Interpreting the framework through the lens of empowerment, we show that reliable autonomous learning requires not only identifying high-quality actions, but also preserving the integrity of their propagation across agents. Using sensitivity analysis and uncertainty quantification workflows as representative case studies, we demonstrate that unchecked semantic drift degrades policy learning, whereas the proposed framework improves convergence, robustness, and adaptation to novel problem contexts. These results suggest a broader design principle for scientific multi-agent systems: adaptive decision-making must be coupled with explicit mechanisms that guarantee semantic consistency and reliable information flow across the computational pipeline.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	4.0/10	6.0

评分理由: 论文聚焦多智能体系统与语义通信，与 tokenizer、视觉编码器、多模态架构无直接关联。虽涉及 LLM 和 empowerment（与 model-based RL 有概念联系），但未体现 Unify Models 或 World Models 的核心架构，相关性普遍较低。

关键词

Multi-Agent System, Semantic Communication, Empowerment, Adaptive Method Selection, Scientific Computing, Contextual Bandits, LLM Agents

132. RAISE: RAG Design as an Architecture Search ProblemFAIL

Score: 16.5 / 27.8

Authors: Zhen Chen, Yibing Liu, Weihao Xie, Yu Liang, Peilin Chen, Shiqi Wang

Published: 2026-05-28

TL;DR: 本文提出 RAISE 框架，将 RAG 设计视为架构搜索问题以优化超参数，发现优化性能高度依赖于具体任务。

摘要翻译

检索增强生成（RAG）系统涉及众多设计选择，涵盖查询重写（query rewriting）、分块（chunking）、检索深度（retrieval depth）、重排序（reranking）以及上下文压缩（context compression）。在实际应用中，这些选择通常通过启发式方法（heuristics）进行配置，从而阻碍了不同场景下的系统性评估（systematic evaluation）与可复现性（reproducibility）。我们认为，这一挑战最好被建模为 RAG 架构搜索（RAG architecture search）。为了支持对该问题的受控且可复现的研究，我们引入了 RAG 智能搜索引擎（RAISE），这是一个用于 RAG 超参数优化（hyperparameter optimization）的全面框架与基准，旨在在标准化的搜索空间（search spaces）和预算（budgets）下评估 RAG 流水线（RAG pipelines）的优化方法。RAISE 实现了 13 种搜索算法（search algorithms），并在七个公开的文本及多模态数据集（datasets）上使用三个随机种子（random seeds）对其进行了评估。我们的实验表明，优化性能高度依赖于任务（task-dependent）：在一个数据集上表现优异的方法可能无法在其他数据集上一致地泛化（generalize），这警示我们避免将综合排名（aggregate rankings）解读为普遍优越策略（universally superior strategies）的证据。RAISE 为公平、可复现且系统性的 RAG 超参数优化研究提供了一个通用的实验平台（experimental substrate）。

Abstract

Retrieval-augmented generation (RAG) systems expose numerous design choices spanning query rewriting, chunking, retrieval depth, reranking, and context compression. In practice, these choices are often configured through heuristics, hindering systematic evaluation and reproducibility across settings. We argue that this challenge is best formulated as RAG architecture search. To support controlled and reproducible study of this problem, we introduce the RAG Intelligence Search Engine (RAISE), a comprehensive framework and benchmark for RAG hyperparameter optimization, which evaluates optimization methods for RAG pipelines under standardized search spaces and budgets. RAISE implements 13 search algorithms and evaluates them across seven public text and multimodal datasets using three random seeds. Our experiments show that optimization performance is highly task-dependent: methods that perform strongly on one dataset may not generalize consistently across others, cautioning against interpreting aggregate rankings as evidence of universally superior strategies. RAISE provides a common experimental substrate for fair, reproducible, and systematic research on RAG hyperparameter optimization.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	4.0/10	6.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文核心在于 RAG 系统的架构搜索与超参数优化，与提供的多模态大模型及强化学习关键词重合度低。仅因摘要提及多模态数据集，MultiModal 得分为 4；因涉及生成任务，MLLM 得分为 3。其余关键词如 Tokenizer、视觉编码器、世界模型、模型强化学习及模型统一在文中均未作为核心内容讨论，故得分较低。

关键词

Retrieval-augmented generation, Architecture search, Hyperparameter optimization, RAG pipeline, Multimodal datasets, Benchmark framework, Search algorithms

133. Discovering Cooperative Pipelines: Autoresearch for Sequential Social DilemmasFAIL

Score: 16.5 / 27.8

Authors: Víctor Gallego

Published: 2026-05-28

TL;DR: This paper proposes an autoresearch framework where an AI agent autonomously optimizes LLM policy synthesis pipelines for multi-agent social dilemmas, outperforming hand-designed baselines and discovering objective-dependent fairness mechanisms.

摘要翻译

我们研究用于合作的两级自研究：一个外循环 AI 代理自主重新设计用于多智能体序贯社会困境（SSDs）的大语言模型（LLM）策略合成系统的内循环管道。一个研究者代理 $\mathcal{R}$（以编码代理形式运行）读取内循环源代码，编辑系统提示词、反馈函数、辅助库及迭代逻辑，运行评估，并决定保留哪些内容，遵循自研究范式。在两个游戏（清理（Cleanup）和收集（Gathering））、两个策略合成大语言模型（Policy-synthesizer LLMs）以及两个福利目标（功利主义效率和罗尔斯主义最大最小原则（maximin））下，该研究者可靠地超越人工设计基线，显著降低运行间方差，并优于仅提示词优化。发现的管道具有目标依赖性：仅在最大最小原则（maximin）下，研究者才会向合成器管道注入显式公平机制，这类机制缺失于其自身目标无关的系统提示词以及每一个效率优化的管道中。这支持一种信息设计解读，即研究者根据福利目标选择向有界理性的合成器揭示哪些信息。代码见 https://github.com/vicgalle/autoresearch-social-dilemmas.

Abstract

We study two-level autoresearch for cooperation: an outer-loop AI agent autonomously redesigns the inner-loop pipeline of an LLM policy-synthesis system for multi-agent Sequential Social Dilemmas (SSDs). A researcher agent $\mathcal{R}$ (run as a coding agent) reads the inner-loop source code, edits system prompts, feedback functions, helper libraries, and iteration logic, runs evaluations, and decides what to keep, following the autoresearch paradigm. Across two games (Cleanup and Gathering), two policy-synthesizer LLMs, and two welfare objectives (utilitarian efficiency and Rawlsian maximin), the researcher reliably exceeds hand-designed baselines, sharply tightens run-to-run variance, and outperforms prompt-only optimization. The discovered pipelines are objective-dependent: only under maximin does the researcher inject an explicit fairness mechanism into synthesizer pipelines, a class of mechanism that is absent from its own objective-agnostic system prompt and from every efficiency-optimized pipeline. This supports an information-design reading in which the researcher chooses what to reveal to the boundedly rational synthesizer as a function of the welfare objective. Code at https://github.com/vicgalle/autoresearch-social-dilemmas.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	3.0/10	4.5
model-based RL	1.5	3.0/10	4.5

评分理由: The paper focuses on autoresearch for LLM-based policy synthesis in multi-agent RL, showing moderate relevance to MLLM and model-based RL (3.0) due to LLM usage in environmental tasks. However, it does not address tokenizer design, visual encoder architecture, world model learning, or model unification (0.0-1.0). The author list does not include the specified experts. The total weighted score (16.5) falls below the dynamic passing threshold (27.8), indicating limited alignment with the provided keyword set.

关键词

Autoresearch, Sequential Social Dilemmas, LLM Policy Synthesis, Multi-agent Cooperation, Pipeline Optimization, Rawlsian Maximin, Cleanup and Gathering

134. Kronecker Embeddings: Byte-Level Structured Token Representations for Parameter-Efficient Language ModelsFAIL

Score: 16.5 / 27.8

Authors: Rohan Shravan

Published: 2026-05-28

TL;DR: 本文提出 Kronecker 嵌入方法，通过字节级因子化替代语言模型中的大型嵌入表，在大幅减少可训练参数的同时保持了模型性能与拼写鲁棒性。

摘要翻译

大语言模型 (LLMs) 将每个输入经由一个形状为 |V| x d_model 的学习嵌入表进行处理，在前沿规模下消耗数亿至数十亿的可训练参数。我们引入克罗内克嵌入 (Kronecker Embeddings)，这是一种确定性的字节级字符 - 位置分解方法，它用一个固定编码器和一个单一的学习投影替换该表，兼容标准 BPE 分词器，在前沿规模下消除了 91%--94% 的输入侧可训练参数。本文提出五点主要贡献。首先，针对六个 LMs (1.35 亿至 6710 亿参数) 的跨模型探测显示，训练后的输入嵌入将探测词的字形变体聚类程度远高于其形态相关词；而克罗内克嵌入在嵌入层避免了这种聚类现象。其次，在 FineWeb-Edu 数据集的 25 亿词元上，对 nanoGPT GPT-2 124M 进行的受控三种子比较显示，克罗内克嵌入的验证损失比 BPE 绑定基线低 2.5% ± 0.2% (差距 0.083 ± 0.007 纳特，困惑度降低约 9%)，且达到 BPE 收敛损失所需的步数约为 BPE 的 1/1.43。第三，在 110 个干净/拼写错误对上的拼写鲁棒性探测显示，克罗内克嵌入在 55.5% 的对上保持了 Top-1 预测，而 BPE 为 47.3% (高出 8.2 个百分点)，且 KL 散度降低 7.6%，在 11 个类别中获胜或打平 10 个；生成探测显示，克罗内克嵌入在生成过程中保留了字节新颖字符串和拼写错误，而 BPE 则会遗忘它们。第四，BPE 嵌入范数在训练过程中发生漂移，而克罗内克投影范数保持在 1.0 附近，这与稳定的表征目标一致。第五，一种运行时变体在词汇表大小为 131,072 时，从 4.5 MB 字节缓冲区重构嵌入，而非使用 2.15 GB 的表，步时间开销仅为 0.01%--0.24%。字节级局部性存在权衡：字节相似但语义距离较远的词对 (如 compute/commute, nation/notion) 会聚类在一起，从而将消歧任务转移到早期注意力层。

Abstract

Large language models route every input through a learned embedding table of shape |V| x d_model, consuming hundreds of millions to billions of trainable parameters at frontier scale. We introduce Kronecker Embeddings, a deterministic byte-level character-position factorization that replaces this table with a fixed encoder and a single learned projection, compatible with standard BPE tokenizers, eliminating 91--94% of input-side trainable parameters at frontier scale. We provide five contributions. First, a cross-model probe across six LMs (135M-671B parameters) shows trained input embeddings cluster typographic variants of the probe word far more than morphological relatives; Kronecker escapes this clustering at the embedding layer. Second, a controlled three-seed comparison on nanoGPT GPT-2 124M over 2.5B tokens of FineWeb-Edu shows Kronecker reaching 2.5 +- 0.2% lower validation loss than the BPE-tied baseline (gap 0.083 +- 0.007 nats, ~9% lower perplexity), needing ~1.43x fewer steps to reach BPE's converged loss. Third, a spelling-robustness probe over 110 clean/typo pairs shows Kronecker preserves the top-1 prediction on 55.5% of pairs vs. 47.3% for BPE (+8.2 pp) and lowers KL by 7.6%, winning or tying in 10 of 11 categories; a generation probe shows Kronecker echoes byte-novel strings and typos through generation where BPE forgets them. Fourth, BPE embedding norm drifts during training while Kronecker projection norm stays near 1.0, consistent with a stable representational target. Fifth, an on-the-fly runtime variant reconstructs embeddings from a 4.5 MB byte buffer rather than a 2.15 GB table at vocabulary 131,072, with 0.01--0.24% step-time overhead. Byte-level locality has a tradeoff: byte-similar but semantically distant pairs (compute/commute, nation/notion) cluster together, shifting disambiguation to early attention layers.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	8.0/10	12.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文核心在于语言模型的参数高效嵌入技术（Kronecker Embeddings），与 Tokenizer 高度相关（涉及 BPE 兼容性及字节级 token 表示），与 Unify Models 有弱相关（嵌入结构统一）。然而，论文未涉及视觉编码、世界模型、多模态大模型或强化学习相关内容，因此 Visual Encoder、World Models、MLLM、MultiModal、model-based RL 评分为 0。作者列表中未包含指定的专家组成员（Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang），故无额外加分。加权总分为 16.5，低于动态及格分 27.8。

关键词

Kronecker Embeddings, Byte-Level, Parameter-Efficient, Language Models, BPE Tokenizers, Embedding Table, Character-Position Factorization

135. GrepSeek: Training Search Agents for Direct Corpus InteractionFAIL

Score: 16.5 / 27.8

Authors: Alireza Salemi, Chang Zeng, Atharva Nijasure, Jui-Hui Chung, Razieh Rahimi, Fernando Diaz, Hamed Zamani

Published: 2026-05-28

TL;DR: GrepSeek trains a compact search agent using reinforcement learning to directly interact with text corpora via shell commands, achieving superior performance on open-domain question answering benchmarks.

摘要翻译

大语言模型（LLM）搜索代理在通过多轮推理和信息检索处理知识密集型语言任务方面展现出强大的潜力。大多数现有系统使用检索器访问信息，该检索器接收关键词或自然语言查询，并利用预计算文档表示的索引返回文档排名列表。在本文中，我们探索了一种互补视角，即搜索代理将语料库本身视为搜索环境，并通过发出可执行的 shell 命令来寻找证据。我们介绍了 GrepSeek，这是一种优化的直接语料库交互（DCI）搜索代理，旨在训练一个紧凑的搜索代理，以从大型文本语料库中查找、过滤和组合证据。为了解决在大型语料库上直接使用强化学习导致的学习行为不稳定性问题，我们提出了一种两阶段训练流程。首先，我们利用一个答案感知的 Tutor（导师）和一个答案盲视的 Planner（规划器）构建冷启动数据集，以生成经过验证且因果基础的搜索轨迹。其次，我们利用组相对策略优化（GRPO）对初始化策略进行微调，使代理能够通过与语料库的直接交互来改进其面向任务的搜索行为。为了使 DCI 在实际规模下具备实用性，我们进一步采用了一种语义保持的分片并行执行引擎，该引擎可将基于 shell 的检索加速高达 7.6 倍，同时保持与 shell 命令顺序执行的字节级精确等价性。在七个开放域问答基准上的实验表明，GrepSeek 实现了最强的整体词元级 $F_1$ 分数和精确匹配（Exact Match）。我们的分析还突显了纯词汇交互在处理具有显著表面形式变异的查询时的局限性，表明 DCI 是一种实用且具有竞争力的搜索代理方法，能够在现实世界中补充现有的检索范式。

Abstract

Large Language Model (LLM) search agents have shown strong promise for knowledge-intensive language tasks through multiple rounds of reasoning and information retrieval. Most existing systems access information using a retriever that takes a keyword or natural language query and returns a ranked list of documents using an index of pre-computed document representations. In this work, we explore a complementary perspective in which the search agent treats the corpus itself as the search environment and finds evidence by issuing executable shell commands. We introduce GrepSeek, an optimized direct corpus interaction (DCI) search agent that trains a compact search agent to find, filter, and compose evidence from large text corpora. To address the instability of learning behavior directly with reinforcement learning on large corpora, we propose a two-stage training pipeline. First, we construct a cold-start dataset using an answer-aware Tutor and answer-blind Planner to generate verified, causally grounded search trajectories. Second, we refine the initialized policy with Group Relative Policy Optimization (GRPO), allowing the agent to improve its task-oriented search behavior through direct interaction with the corpus. To make DCI practical at scale, we further use a semantics-preserving sharded-parallel execution engine that accelerates shell-based retrieval by up to $7.6\times$ while preserving byte-exact equivalence with sequential execution of the shell command. Experiments across seven open-domain question answering benchmarks show that GrepSeek achieves the strongest overall token-level $F_1$ and Exact Match. Our analysis also highlights the limitations of purely lexical interaction on queries with substantial surface-form variation, suggesting DCI as a practical and competitive method for search agents that can complement existing retrieval paradigms in the real world.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	4.0/10	6.0

评分理由: The paper focuses on text-only search agents using shell commands and reinforcement learning (GRPO). It lacks visual components, resulting in 0 for Visual Encoder and MultiModal. It is not an MLLM (1.0). While it uses RL, GRPO is typically model-free, so model-based RL is moderately relevant (4.0). World Models and Unify Models have weak connections to the core contribution (2.0 each). Tokenizer is tangential (2.0). No expert authors from the specified list were found. Weighted total score: 16.5, which is below the dynamic passing score of 27.8, indicating low relevance to the specified keyword cluster.

关键词

Search Agents, Direct Corpus Interaction, Reinforcement Learning, Shell Commands, Question Answering, LLM Agents, Policy Optimization, Text Corpora

136. How's it going? Reinforcement learning in language models recruits a functional welfare axisFAIL

Score: 16.5 / 27.8

Authors: Andy Q Han, David J. Chalmers, Pavel Izmailov

Published: 2026-05-28

TL;DR: 该研究发现强化学习在语言模型中招募了预存在的功能福利轴，通过奖励和惩罚向量显著影响模型行为及自我报告，揭示了后训练过程中表示招募而非创造的现象。

摘要翻译

强化学习（RL）如何塑造语言模型的内部表征？我们提供证据表明，RL 招募了一种预存的“功能福利”（Functional Welfare）表征：即相对于其目标，系统表现优劣的估计。我们在一种新颖的、语义中性的迷宫环境中训练了多个语言模型。随后，我们提取了奖励和惩罚轨迹的概念向量，并在与迷宫环境无关的设置中评估这些向量。惩罚向量表现得如同一种负福利表征：它促进失败和不可能性标记，与负面情绪概念对齐，负向追踪目标达成，且使用它进行引导会诱导负面自我报告、病态回溯、拒绝和不确定性。正奖励向量表现得如同镜像，且两者几乎反平行。当控制格子 - 奖励映射、规模、指令微调、RL 训练算法、模型家族以及 LoRA 与全微调时，这些效应是稳健的；当我们用监督微调（SFT）替换 RL 时，这些效应大体上依然存在。重要的是，这些向量在模型经历迷宫训练之前就已经有效。结合观察发现这些效应也出现在仅预训练模型中，因此我们认为这种功能福利轴在训练后阶段之前已预存：它是被训练后阶段招募的，而非创造的。尽管我们对任何福利体验不做断言，但该轴提供了一种证明，即最小奖励信号可以通过招募预存的福利类似表征广泛影响模型行为，这对可解释性、训练后动力学和对齐具有启示意义。

Abstract

How does reinforcement learning shape a language model's internal representations? We present evidence that RL recruits a pre-existing representation of functional welfare: an estimate of how well or badly the system is doing, relative to its goals. We train several language models in a novel, semantically neutral maze environment. We then extract concept vectors for rewarded and punished trajectories, and evaluate those vectors in settings unrelated to the maze environment. The punishment vector behaves like a representation of negative welfare: it promotes failure and impossibility tokens, it aligns with negative emotion concepts, it negatively tracks goal-achievement, and steering with it induces negative self-reports, pathological backtracking, refusal, and uncertainty. The positive reward vector behaves as the mirror image, and the two are nearly antiparallel. These effects are robust when controlling for tile-to-reward mapping, scale, instruct tuning, RL training algorithm, model family, and LoRA versus full-finetuning, and largely persist when we replace RL with supervised fine-tuning. Importantly, the vectors are effective in models before they have undergone maze training. Combined with observations that the effects also appear in pretrain-only models, we therefore argue that this functional welfare axis pre-exists post-training: it is recruited, rather than created, by post-training. While we make no claims about any experience of welfare, the axis offers a demonstration that minimal reward signals can broadly affect model behavior by recruiting pre-existing welfare-like representations, with implications for interpretability, post-training dynamics, and alignment.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	3.0/10	4.5
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	4.0/10	6.0

评分理由: 本文主要研究强化学习（RL）在语言模型内部表示中招募‘功能福利轴’的现象，属于表示学习与可解释性领域。关键词集中的'MLLM'、'MultiModal'、'Visual Encoder'涉及多模态与视觉内容，与本文纯文本语言模型研究完全无关，故评分为 0。'Unify Models'与'Tokenizer'并非本文核心贡献，相关性低。'World Models'与'model-based RL'因涉及强化学习与环境交互，有一定关联但非论文核心焦点（本文侧重福利轴表示而非模型架构或环境建模），故评分中等。整体来看，论文内容与给定的多模态/统一模型关键词集匹配度较低。

关键词

Reinforcement Learning, Language Models, Functional Welfare Axis, Representation Learning, Interpretability, Alignment, Maze Environment, Reward Vectors

137. Adaptive Targeted Dynamic Chunking for Tokenization-Free Hierarchical ModelFAIL

Score: 16.5 / 27.8

Authors: Thang Dang, Akira Nakagawa, Kenichi Kobayashi, Koichi Shirahata

Published: 2026-05-28

TL;DR: This paper proposes an Adaptive Targeted Dynamic Chunking method to optimize compression ratios in tokenization-free hierarchical models, achieving competitive performance and stable training dynamics without traditional tokenization.

摘要翻译

无分词的层级模型正成为传统大型语言模型（LLMs）的一种有前景的替代方案，解决了诸如词汇表设计复杂性、词外（OOV）错误及语言特定约束等固有的预处理问题。然而，这些字节级方法中的一个重大挑战在于压缩比的优化，这是一个关键因素，决定了模型通过分块处理字节数据时的性能。本文提出了一种自适应目标动态分块（ATDC），这是一种新颖的字节压缩控制机制，旨在增强层级架构中动态分块的有效性。该方法利用课程学习在训练过程中逐步调整压缩比，从低压缩比过渡到高压缩比，以稳定学习过程。我们提供了一种分析，建立了目标压缩比与每内层块字节数（BPIC）之间的关系，以便能够跟踪整个训练阶段分块大小的演变。在 FineWeb-Edu 100B 数据集上进行的评估表明，配备 ATDC 的层级模型实现了具有竞争力的比特每字节（BPB）性能，相较于在字节级和词元级运行的常规基线。此外，与使用固定压缩比的模型相比，所提出的方法在多样化的下游任务中表现出更稳定的训练动态和更优越的最终性能，同时保持了字节级处理固有的鲁棒性和灵活性。

Abstract

Tokenization-free hierarchical models are emerging as a promising alternative to traditional Large Language Models (LLMs), addressing inherent preprocessing issues such as vocabulary design complexity, out-of-vocabulary (OOV) errors, and language-specific constraints. However, a significant challenge in these byte-level methods is the optimization of the compression ratio, a critical factor that dictates model performance for processing bytes data via chunks. In this paper, we propose Adaptive Targeted Dynamic Chunking (ATDC), a novel byte-compression control mechanism designed to enhance the effectiveness of dynamic chunking within hierarchical architectures. Our approach utilizes curriculum learning to progressively adjust the compression ratio during training, transitioning from low to high compression to stabilize the learning process. We provide an analysis establishing the relationship between the target compression ratio and Bytes-Per-Innermost-Chunk (BPIC), allowing for tracking of chunk-size evolution throughout the training phase. Evaluations conducted on the FineWeb-Edu 100B dataset demonstrate that hierarchical models equipped with ATDC achieve competitive Bits-Per-Byte (BPB) performance compared to conventional baselines operating at both byte and token levels. Furthermore, the proposed method exhibits more stable training dynamics and superior final performance across diverse downstream tasks compared to models using fixed compression ratios, while maintaining the inherent robustness and flexibility of byte-level processing.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	8.0/10	12.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on tokenization-free hierarchical models, making 'Tokenizer' highly relevant (8.0) due to the direct discussion of byte-level alternatives. 'Unify Models' has low relevance (3.0) as it lacks multimodal unification. Other keywords (Visual Encoder, World Models, MLLM, MultiModal, model-based RL) are irrelevant (0.0) since the paper is text-only, lacks vision, and involves no reinforcement learning or world modeling. No specified expert authors are found.

关键词

Tokenization-Free, Hierarchical Model, Adaptive Targeted Dynamic Chunking, Byte-level Processing, Compression Ratio, Curriculum Learning, BPB Performance

138. Adapting Multilingual Embedding Models to Turkish via Cross-Lingual Tokenizer Surgery and Offline DistillationFAIL

Score: 16.5 / 27.8

Authors: M. Ali Bayram, Banu Diri, Savaş Yıldırım

Published: 2026-05-28

TL;DR: This paper proposes an efficient pipeline to adapt multilingual sentence embedding models to Turkish via tokenizer surgery and offline distillation, achieving competitive performance with reduced parameters and cost.

摘要翻译

句子嵌入（Sentence Embeddings）是语义搜索、聚类、分类及检索增强生成（Retrieval-Augmented Generation）的基础组件。本文提出了 embeddingmagibu-200m，这是一种专注于土耳其语的句子嵌入模型，可生成 768 维的 L2 归一化向量，并支持 8,192 个 token 的上下文窗口，远超早期基于 BERT 的土耳其编码器 512 个 token 的限制。与完全预训练不同，本文引入了一种高效的三阶段适应管道：(1) 通过从教师模型词汇中剪除冗余 token，并结合基于 40 语言语料库的频率分析引入多语言 token，构建一个拥有 131,072 词汇量的土耳其优化多语言分词器（Tokenizer）；(2) 克隆教师嵌入模型，同时保留 Transformer 骨干权重，并通过均值组成令牌映射（Mean-Composition Token Mapping）为新词汇量初始化兼容的嵌入表；(3) 在平衡的 40 语言维基百科语料库上，利用余弦相似度目标（Cosine Similarity Objective），从预计算的教师向量执行离线嵌入蒸馏（Offline Embedding Distillation）。所得学生模型（Student Model）包含约 2 亿参数，通过在训练期间避免在线教师推理（Online Teacher Inference），在单个 GPU 上仅需约四小时即可完成训练，总成本约为 5 至 20 美元。实证结果表明，在 STSbTR 数据集上，该模型获得了 77.55%/77.45% 的 Pearson/Spearman 相关性，超越了拥有 3 亿参数的教师模型（73.84%/72.92%）。在 TR-MTEB（涵盖 26 项任务）上，该模型取得了 63.9% 的平均得分（在 26 个模型中排名第 7），相较于教师模型减少了 33% 的参数，提供了具有竞争力的成本 - 质量权衡。为促进可复现性及下游应用，本文发布了所有实验工件（Artifacts），包括模型权重、分词器文件、预计算嵌入数据集以及开源的克隆与蒸馏工具。

Abstract

Sentence embeddings are a foundational component for semantic search, clustering, classification, and retrieval-augmented generation. This paper presents embeddingmagibu-200m, a Turkish-focused sentence embedding model that produces 768-dimensional L2-normalized vectors and supports an 8,192-token context window, far exceeding the 512-token limit of earlier BERT-based Turkish encoders. Instead of full pretraining, an efficient three-stage adaptation pipeline is introduced: (1) construct a Turkish-optimized multilingual tokenizer with a 131,072 vocabulary by pruning redundant tokens from the teacher's vocabulary and incorporating multilingual tokens via frequency analysis on a 40-language corpus, (2) clone a teacher embedding model while preserving transformer backbone weights and initializing a compatible embedding table for the new vocabulary via mean-composition token mapping, and (3) perform offline embedding distillation from precomputed teacher vectors using a cosine similarity objective over a balanced 40-language Wikipedia corpus. The resulting student model contains approximately 200M parameters and trains in roughly four hours on a single GPU by avoiding online teacher inference during training, at a total cost of $5-$20. Empirically, Pearson/Spearman correlations of 77.55%/77.45% are obtained on STSbTR, surpassing the 300M-parameter teacher model (73.84%/72.92%). On TR-MTEB (26 tasks), a mean score of 63.9% is achieved (7th out of 26 models), providing a competitive cost-quality trade-off with 33% fewer parameters than the teacher. To facilitate reproducibility and downstream use, all artifacts are released including model weights, tokenizer files, precomputed embedding datasets, and open-source cloning and distillation tooling.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	9.0/10	13.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on NLP sentence embedding adaptation, showing high relevance to 'Tokenizer' due to the proposed tokenizer surgery method. It is text-only and unrelated to visual encoders, world models, multimodal LLMs, or reinforcement learning, resulting in 0 scores for those. 'Unify Models' has minimal relevance regarding multilingual adaptation.

关键词

Sentence Embeddings, Tokenizer Surgery, Offline Distillation, Multilingual Model, Turkish NLP, Transformer Backbone, Cosine Similarity, Parameter Efficiency

139. ExCAM: Explainable Cultural Awareness MetricsFAIL

Score: 16.5 / 27.8

Authors: Christoph Leiter, Haiyue Song, Hour Kaing, Jin Tei, Hideki Tanaka, Masao Utiyama, Steffen Eger

Published: 2026-05-28

TL;DR: ExCAM 提出了一种可解释的文化意识评估指标，用于检测大语言模型指令输出对中的文化错误，在无人工标注情况下达到 80% 准确率。

摘要翻译

评估大语言模型的文化意识对于确保生成文本的公平性以及应用程序在全球范围内的泛化性至关重要。近期的基准测试通过问答或文本生成任务的视角，探索了食物等文化要素或压力情境下的行为等价值观。然而，构建这些基准测试需要耗时且昂贵的人工标注。此外，评估自由文本中文化意识的基准测试稀缺，且往往依赖于过时的评估机制。为了解决这一空白，我们引入了 ExCAM（Explainable Cultural Awareness Metric，可解释的文化意识度量），据我们所知，这是首个专门用于识别、评分和解释指令 - 输出对中文化错误的评估指标。为了训练和评估 ExCAM，我们引入了 ExCAM40k，这是一个由九个现有基准测试组成的数据集，我们对其进行了格式化处理并用合成错误进行了增强。与包括 GPT-5 在内的多个基线方法相比，ExCAM 在平衡测试集上实现了最高的错误检测率，准确率高达 80%。因此，ExCAM 为自由文本的细粒度且可解释的文化评估开辟了途径。

Abstract

Evaluating the cultural awareness of large language models is crucial to ensure the fairness of generated text and the generalizability of applications across the world. Recent benchmarks explore cultural goods like food or values like behavior in stressful situations through the lens of question answering or text generation tasks. However, creating these benchmarks requires time-intensive and costly human annotations. Also, benchmarks that evaluate cultural awareness in free text are scarce and often rely on dated evaluation mechanisms. To address this gap, we introduce ExCAM, an Explainable Cultural Awareness Metric, which is, to our knowledge, the first dedicated evaluation metric that identifies, rates and explains cultural errors in instruction-output pairs. To train and evaluate ExCAM, we introduce ExCAM40k, a dataset comprised of nine existing benchmarks that we reformat and enhance with synthetic errors. Compared to several baselines, including GPT-5, ExCAM achieves the highest error detection rate with up to 80% accuracy on a balanced test set. Therefore, ExCAM opens the pathway towards fine-grained and explainable cultural evaluation of free text.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	1.0/10	1.5

评分理由: 论文聚焦于大语言模型的文化意识评估，属于 NLP 评测领域。提供的关键词涉及多模态架构、世界模型及强化学习，与本文内容高度不匹配。仅 MLLM 因涉及大语言模型略有相关性，其余如视觉编码器、Tokenizer、模型强化学习等均无关联。加权总分约为 16.5，低于动态及格分 27.8。

关键词

Cultural Awareness, Evaluation Metric, Large Language Models, Explainable, Instruction-Output Pairs, Synthetic Errors, Fairness

140. Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic EvaluationFAIL

Score: 16.5 / 27.8

Authors: Zixuan Jiang, Yanqiao Zhu, Peng Wang, Qinyuan Chen, Xinjian Zhao, Xipeng Qiu, Wupeng Wang, Zhifu Gao, Xiangang Li, Kai Yu, Xie Chen

Published: 2026-05-28

TL;DR: This paper proposes an Interactive ASR framework called Agentic ASR that leverages LLM-based semantic correction and multi-turn refinement to significantly reduce semantic errors in speech recognition compared to single-pass systems.

摘要翻译

自动语音识别（ASR）是人机交互的核心组成部分，也是基于大语言模型（LLM）的助手和智能体日益重要的前端。然而，大多数当前的 ASR 系统仍遵循单次通过范式，这与人类沟通模式不太契合，因为人类沟通中的误解是通过迭代澄清与细化来解决的。这种不匹配使得一旦发生意义关键性错误，便难以纠正。同时，WER 或 CER 等词元级指标无法充分反映此类问题。为了解决这些局限性，我们将交互式自动语音识别（Interactive ASR）定义为多轮细化任务，并提出智能体自动语音识别（Agentic ASR），这是一个闭环框架，结合了单次通过 ASR 前端、语义修正、意图路由以及基于推理的编辑。我们进一步引入了句子级语义错误率（S²ER），这是一种基于大语言模型的语义评估指标，以及交互式仿真系统（Interactive Simulation System），用于实现可扩展且可复现的基准测试。在多语言、命名实体密集型以及语码切换基准上的实验表明，迭代交互持续降低语义错误，且在 S²ER 上的提升远大于在传统词元级指标上的提升。人机对齐和消融研究进一步验证了语义评判器的可靠性以及所提出框架的鲁棒性。代码可在 https://interactiveasr.github.io/ 获取，实时演示可在 https://i-asr.sjtuxlance.com/ 查看。

Abstract

Automatic speech recognition (ASR) is a core component of human--computer interaction and an increasingly important front-end for LLM-based assistants and agents. However, most current ASR systems still follow a single-pass paradigm, which is poorly aligned with human communication, where misunderstandings are resolved through iterative clarification and refinement. This mismatch makes it difficult to correct meaning-critical errors once they occur. Meanwhile, token-level metrics such as WER or CER cannot adequately reflect such a problem. To address these limitations, we formulate \emph{Interactive ASR} as a multi-turn refinement task and propose \textbf{Agentic ASR}, a closed-loop framework that combines a single-pass ASR front-end with semantic correction, intent routing, and reasoning-based editing. We further introduce the \textbf{Sentence-level Semantic Error Rate} ($S^2ER$), an LLM-based semantic evaluation metric, together with an \textbf{Interactive Simulation System} for scalable and reproducible benchmarking. Experiments on multilingual, named-entity-intensive, and code-switching benchmarks show that iterative interaction consistently reduces semantic errors, with much larger gains in $S^2ER$ than in conventional token-level metrics. Human--AI alignment and ablation studies further validate the reliability of the semantic judge and the robustness of the proposed framework. The code is available at: https://interactiveasr.github.io/ and the live demo is available at https://i-asr.sjtuxlance.com/

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	4.0/10	6.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on Interactive Speech Recognition (ASR) using an agentic framework with LLM-based semantic correction. It does not involve Visual Encoders, World Models, or Model-Based RL, hence 0 scores for these. While it handles Audio-Text interaction (MultiModal) and utilizes LLMs (MLLM), it does not focus on Tokenizer architecture or Unify Models architecture, resulting in low scores for those. The keyword set primarily targets Vision/RL/World Models, which mismatches the ASR/NLP focus of this paper.

关键词

Interactive ASR, Agentic ASR, Semantic Correction, LLM-based Evaluation, Sentence-level Semantic Error Rate, Multi-turn Refinement, Speech Recognition

141. Veda: Scalable Video Diffusion via Distilled Sparse AttentionFAIL

Score: 16.5 / 27.8

Authors: Shihao Han, Hao Yang, Xinting Hu, Xiaofeng Mei, Yi Jiang, Xiaojuan Qi

Published: 2026-05-28

TL;DR: Veda proposes a distilled sparse attention framework to overcome quadratic self-attention costs in video diffusion transformers, achieving significant speedups without quality degradation.

摘要翻译

将扩散 Transformer (Diffusion Transformers) 规模化以生成高分辨率、长视频受到自注意力 (self-attention) 机制二次计算复杂度的制约，且现有的稀疏注意力 (sparse attention) 方法在高稀疏度下会出现性能退化。实验表明，生成质量并非由稀疏比率本身决定，而是取决于稀疏掩码与全注意力 (full attention) 的瓦片级几何结构的对齐程度。基于这一洞察，我们提出 Veda，这是一种蒸馏稀疏注意力框架，它将瓦片选择建模为从全注意力进行的显式重构问题。Veda 整合了统计感知瓦片评分与头感知瓦片划分，以减少估计误差和结构不匹配，从而实现高稀疏度。一种硬件高效的瓦片跳过核 (tile-skipping kernel) 将理论稀疏度转化为实际运行时间的加速比。在大型视频扩散模型（包括 Waver 和 Wan2.1）上的实验表明，该方法实现了显著加速，且生成质量无明显退化。在 Waver-T2V-12B 上生成 720P 10 秒视频时，Veda 实现了 5.1 倍的端到端加速比和 10.5 倍的自注意力加速比，将注意力开销从 92% 降低至 50%。值得注意的是，加速收益随序列长度增加而提升，表明 Veda 在不同模型上具有良好的时空分辨率扩展性。

Abstract

Scaling Diffusion Transformers to generate high-resolution, long videos is constrained by the quadratic cost of self-attention, and existing sparse attention methods degrade under high sparsity. We show empirically that generation quality is determined not by the sparsity ratio itself, but by how well the sparse mask aligns with the tile-wise geometry of full attention. Based on this insight, we propose Veda, a distilled sparse attention framework that formulates tile selection as an explicit reconstruction problem from full attention. Veda integrates statistics-aware tile scoring with head-aware tiling to reduce estimation error and structural mismatch, enabling aggressive sparsity. A hardware-efficient tile-skipping kernel converts theoretical sparsity into practical wall-clock speedups. Experiments on large video diffusion models, including Waver and Wan2.1, demonstrate substantial acceleration with no noticeable degradation in generation quality. To generate 720P 10-second videos on Waver-T2V-12B, Veda achieves a 5.1$\times$ end-to-end speedup and a 10.5$\times$ self-attention speedup, reducing attention overhead from 92% to 50%. Notably, the gains increase with sequence length, indicating that Veda scales favorably with spatiotemporal resolution across models.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	3.0/10	4.5
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on scaling video diffusion transformers via sparse attention optimization, which has low relevance to the provided keywords centered on MLLM, RL, and unified multimodal architectures. 'World Models' receives a moderate score (3.0) due to the generative nature of video modeling, but 'model-based RL' and 'MLLM' are irrelevant (0.0-1.0). No expert authors from the specified list are found in the author list. The weighted total score (16.5) is below the dynamic passing threshold (27.8), indicating low alignment with the target research direction.

关键词

Video Diffusion, Sparse Attention, Diffusion Transformers, Scalability, Distillation, Tile-wise Geometry, Self-attention Speedup

142. Future Forcing: Future-aware Training-free KV Cache Policy for Autoregressive Video GenerationFAIL

Score: 16.5 / 27.8

Authors: Jiayi Luo, Qiyan Liu, Tengyang Wang, JunHao Liu, Jiayu Chen, Cong Wang, Hanxin Zhu, Chen Gao, Xiaobin Hu, Qingyun Sun, Zhibo Chen

Published: 2026-05-28

TL;DR: 论文提出 Future Forcing 策略，利用历史统计估计未来查询分布，在不训练的情况下优化自回归视频生成的 KV 缓存，提升长时序一致性。

摘要翻译

自回归（AR）视频生成已成为长时序视频合成的一种有前景范式，其中每一帧均基于先前生成的令牌进行条件生成。为加速推理，采用 KV 缓存以避免生成步骤间的冗余重计算。然而，随着生成长度的增加，KV 缓存的增长会带来不断增加的内存占用和误差累积，限制了 AR 模型扩展到更长序列的可扩展性。现有的 KV 缓存压缩方法通过仅选择性保留被认为重要的视频令牌来缓解这一问题。然而，大多数现有方法使用源自当前或历史生成语境的短时序信号来评估令牌重要性，导致这些方法容易忽略那些在早期步骤中看似不重要，但对后续帧至关重要的令牌。在这项工作中，我们识别出训练好的 AR 视频模型的一个重要属性：尽管基于 RoPE（旋转位置编码）调节的查询在自回归步骤中演变，但底层标准的预 RoPE 查询分布在视频生成过程中保持惊人的稳定性。这种近似平稳性意味着未来查询分布可从历史统计中估算，从而使得无需任何额外训练即可实现基于原理的未来感知缓存决策。基于这一见解，我们提出 Future Forcing（未来强制），一种用于 AR 视频生成的无训练未来感知 KV 缓存策略。具体而言，Future Forcing 首先基于历史统计构建一个未来查询代理，随后根据该代理下的重要性对 KV 缓存令牌进行评分，最后在由未来查询诱导的仿射子空间内合并冗余令牌对。广泛实验表明，Future Forcing 在有限 KV 缓存下提升了长时序一致性，在 VBench-Long 基准上针对 60 秒生成任务，相比现有 AR 视频 KV 缓存策略，实现了高达 1.49 的主体一致性提升。

Abstract

Autoregressive (AR) video generation has emerged as a promising paradigm for long-horizon video synthesis, where each frame is generated conditioned on previously generated tokens. To accelerate inference, the KV cache is used to avoid redundant recomputation across generation steps. Nevertheless, its growth with generation length introduces increasing memory and error accumulation, limiting the scalability of AR models to even longer sequences. Existing KV cache compression methods mitigate this issue by selectively retaining only video tokens deemed important. However, most existing methods assess token importance using short-horizon signals derived from the current or historical generation context, making these methods prone to overlooking tokens that appear unimportant at early steps but later become critical for future frames. In this work, we identify an important property of trained AR video models: although RoPE-modulated queries evolve across autoregressive steps, the underlying canonical pre-RoPE query distribution remains remarkably stable throughout the video generation process. This approximate stationarity implies that future query distributions are estimable from historical statistics, enabling principled future-aware cache decisions without any additional training. Building on this insight, we propose Future Forcing, a training-free future-aware KV cache policy for AR video generation. Specifically, Future Forcing first constructs a future query proxy from historical statistics, then scores KV cache tokens by their importance under this proxy, and finally merges redundant token pairs within the affine subspace induced by the future query. Extensive experiments show that Future Forcing improves long-horizon consistency under limited KV caches, achieving up to 1.49 improvement in subject consistency on VBench-Long for 60s generation over existing AR video KV cache policies.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	3.0/10	4.5
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	3.0/10	4.5
model-based RL	1.5	0.0/10	0.0

评分理由: 论文聚焦自回归视频生成的 KV 缓存优化（推理加速），与关键词匹配度较低。'Tokenizer' 和 'MultiModal' 因涉及视频 token 及视频特性略相关（2-3 分），'World Models' 因自回归预测未来帧有一定概念关联（3 分），其余关键词（Unify Models, Visual Encoder, MLLM, model-based RL）在文中无直接体现（0-1 分）。作者列表不含指定专家，无额外加分。加权总分 16.5，低于动态及格分 27.8。

关键词

Autoregressive Video Generation, KV Cache Policy, Future-aware Training-free, Long-horizon Consistency, Query Distribution, Video Tokens, Inference Acceleration

143. In-Context Reward Adaptation for Robust Preference ModelingFAIL

Score: 15.0 / 27.8

Authors: Zhenyu Sun, Zheng Xu, Ermin Wei

Published: 2026-05-28

TL;DR: This paper proposes an in-context reward adaptation framework leveraging transformers and human response time to robustly model diverse human preferences in reinforcement learning without costly retraining.

摘要翻译

基于人类反馈的强化学习 (RLHF) 通常依赖静态奖励模型，以使大型语言模型 (LLM) 与人类偏好对齐。然而，人类价值观本质上多样且异质，单个奖励模型往往缺乏泛化到未见偏好领域所需的鲁棒性。尽管现有的多奖励框架试图解决这一问题，但它们通常局限于一组固定的已知领域，且在不进行昂贵重训练的情况下无法适应未见的人类分布。在这项工作中，我们提出上下文奖励适应 (In-Context Reward Adaptation)，这是一种基于 Transformer 的框架，旨在即时建模多样且未见的人类偏好。通过利用 Transformer 的上下文学习能力，我们的方法能够从少量偏好示范中自适应地推断潜在的奖励结构。我们证明，尽管标准 Transformer 架构因表现出对真实值 (ground-truth) 的渐近偏差而不足以胜任此任务，但将人类响应时间作为辅助输入信号纳入，能使模型成功适应先前未见领域的偏好。我们的研究结果表明，该方法为偏好建模提供了更稳健的基础，能够表示异质奖励及偏好分布偏移，并为实现更灵活的人 -AI 对齐提供了一条可扩展的路径。

Abstract

Reinforcement Learning from Human Feedback (RLHF) typically relies on static reward models to align Large Language Models with human preferences. However, human values are inherently diverse and heterogeneous, and a single reward model often lacks the robustness required to generalize to unseen preference domains. While existing multi-reward frameworks attempt to address this, they are often restricted to a fixed set of known domains and fail to adapt to unseen human distributions without costly retraining. In this work, we propose In-Context Reward Adaptation, a transformer-based framework designed to model diverse and unseen human preferences on the fly. By leveraging the in-context learning capabilities of transformers, our approach adaptively infers the underlying reward structure from a small set of preference demonstrations. We demonstrate that while a standard transformer architecture is insufficient for this task by characterizing an asymptotic bias to the ground-truth, incorporating human response time as an auxiliary input signal enables the model to successfully adapt to preferences from previously unseen domains. Our findings show that this approach provides a more robust foundation for preference modeling, allowing for the representation of heterogeneous rewards and preference distribution shift, and offering a scalable path toward more flexible human-AI alignment.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	2.0/10	3.0

评分理由: The paper focuses on RLHF and in-context reward adaptation using transformers. It lacks visual encoders, multi-modal integration, world models, and tokenizer innovations. While it utilizes LLMs (MLLM context), it is text-based preference modeling, resulting in low relevance to most provided keywords.

关键词

Reinforcement Learning from Human Feedback, In-Context Reward Adaptation, Preference Modeling, Transformer Architecture, Human Response Time, Reward Model Generalization, Robust Alignment

144. Self-Trained Verification for Training- and Test-Time Self-ImprovementFAIL

Score: 15.0 / 27.8

Authors: Chen Henry Wu, Aditi Raghunathan

Published: 2026-05-28

TL;DR: 本文提出自训练验证（STV）方法，通过训练验证器检测自我生成错误，显著提升了数学和科学推理任务在训练和测试时的准确率。

摘要翻译

大规模自我改进一直是推理模型的长期目标，而实现这一目标主要有两个自然途径：在测试时，通过验证 - 精炼（V-R）循环；在训练时，通过自训练方法。两者都受限于同一个瓶颈：验证器。当验证器分数虚高而准确率停滞，且反馈过于泛泛无法利用时，V-R 循环便会停滞；同样地，当低质量的自生成数据被加入训练时，自训练也会失败。更优的验证方法能突破这两者瓶颈，但我们想要训练的能力——即捕捉自生成错误——却缺乏相应的训练信号。为应对这一挑战，我们提出自训练验证（STV）。我们的关键观察是，虽然模型单独无法捕捉这些错误，但当展示参考解时却可以。我们将这种不对称性转化为监督目标，并训练验证器模仿一个拥有参考解的自身版本。在测试时，STV 在难题上显著改善了 V-R 循环，而替代方案（例如监督微调 SFT、基于验证器分数的强化学习，甚至元验证器）则无法实现这一效果。STV 在复杂数学题上使准确率翻倍，并在科学推理任务上将其准确率提升至原来的 14 倍（从 1.5% 提升至 21%）。在训练时，我们还利用强化学习（RL），在 V-R 循环内结合 STV 验证器的反馈来训练生成器——这一过程我们称为循环内验证器训练（ViL）。从强化学习收敛后的生成器开始，ViL 使 pass@1 进一步提升了 33%。更值得注意的是，生成器的独立 pass@1（测试时无需验证器）相对于标准 RL 收敛时的水平提升了 30%。因此，难题推理的下一个前沿可能在于我们如何为验证而训练以及如何利用验证进行训练。

Abstract

Self-improvement at scale has been a longstanding goal for reasoning models, and there are two natural places to do it: at test time, through verification-refinement (V-R) loops; and at training time, through self-training methods. Both are gated by the same bottleneck: the verifier. V-R loops stall when verifier scores inflate while accuracy stagnates, and when feedback is too generic to act on; self-training fails similarly when bad self-generated data are added to training. Better verification would unlock both, but the capability we want to train, i.e., catching self-generated errors, lacks training signal. To address this challenge, we propose self-trained verification (STV). Our key observation is that, while a model cannot catch these errors alone, it can when shown the reference solution. We turn this asymmetry into a supervision target and train the verifier to imitate a more informed version of itself. At test time, STV substantially improves V-R loops on hard problems, while alternatives (e.g., SFT, RL on verifier scores, and even meta-verifiers) do not. STV roughly doubles accuracy on hard math and lifts it 14x on scientific reasoning tasks (1.5% to 21%). At training time, we additionally train the generator using RL with STV verifier's feedback inside the V-R loop - a procedure we call verifier-in-the-loop training (ViL). Starting from an RL-converged generator, ViL yields a further 33% gain in pass@1. More notably, the generator's standalone pass@1, with no verifier at test time, climbs 30% relative past where standard RL had converged. Hence, the next frontier in reasoning on hard problems may lie in how we train for and with verification.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	3.0/10	4.5

评分理由: 论文核心为推理模型的自我改进机制（验证与自训练），未涉及视觉内容，故 Visual Encoder 和 MultiModal 得 0 分；虽使用 RL 和统一策略，但非严格模型基 RL 或未统一多模态架构，故 Unify Models 和 model-based RL 得 3 分；MLLM 涉及大模型但未明确多模态得 2 分；Tokenizer 为通用组件得 1 分。

关键词

Self-Trained Verification, Reasoning Models, Test-Time Improvement, Training-Time Improvement, Reinforcement Learning, Verification-Refinement, Scientific Reasoning

145. Selective QA over Conflicting Multi-Source Personal Memory: A Diagnostic Testbed and Method ComparisonFAIL

Score: 15.0 / 27.8

Authors: Tiancheng Yang, Matthias Schonlau, Ilia Sucholutsky

Published: 2026-05-28

TL;DR: 该论文提出了一个针对冲突多源个人记忆的基准测试，显示融合方法在允许拒绝回答时比 LLM 基线具有更高的选择性准确率。

摘要翻译

新兴的个人智能体正朝着持久化、多源记忆的方向演进。这引出了评估难题：系统需决定如何利用冲突或不完整证据，而不能仅从单一干净的历史记录中检索事实。现有基准很少能揭示错误是源于提供给方法的数据，还是源于该方法的冲突解决步骤。本文将此研究为针对冲突多源个人记忆的选择性问答 (Selective QA)：系统基于冲突的、有时不完整的来源作答，或在证据不足时拒绝回答 (Abstain)。我们构建了一个基准，包含 18 个问题模板，涵盖 8 种推理类型、480 个角色 (Personas)、4 个随机种子及 34,560 个实例，具备受控的源失真和确定的真实标签 (Ground Truth)。我们评估了多种基线的性能：无源访问基线、单源访问基线、结构化融合方法以及前沿大语言模型 (LLMs)。最佳训练融合解析器达到了 80.3% 的准确率，而最强的仅提示词 LLM 基线达到了 70.0%。引入拒绝回答机制后，同一解析器在 78.3% 的覆盖率下达到了 85.3% 的选择性准确率，而最佳 LLM 在 95.4% 的覆盖率下达到了 71.0% 的选择性准确率。不同模型在不同推理类型上各具优势。我们发布数据、代码、缓存的模型输出以及数据生成过程以供复用。

Abstract

Emerging personal AI agents are moving toward persistent, multi-source memory. This creates an evaluation problem: systems must decide how to use conflicting or incomplete evidence; they cannot just retrieve facts from one clean history. Existing benchmarks rarely show whether an error came from the evidence given to a method or from the method's conflict-resolution step. We study this as selective QA over conflicting multi-source personal memory: systems answer based on conflicting, sometimes incomplete sources, or abstain when evidence is insufficient. We develop a benchmark containing 18 question templates across 8 reasoning types, 480 personas, 4 random seeds, and 34,560 instances, with controlled source distortions and deterministic ground truth. We evaluate the performance of baselines without access to any source, access to a single source, structured fusion methods, and frontier LLMs. The best trained fusion resolver reaches 80.3% accuracy, while the strongest prompt-only LLM baseline reaches 70.0%. With abstention, the same resolver reaches 85.3% selective accuracy at 78.3% coverage and the best LLM reaches 71.0% selective accuracy at 95.4% coverage. Different models have different strengths across reasoning types. We release the data, code, cached model outputs, and data-generating process for reuse.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	1.0/10	1.5

评分理由: 论文聚焦于冲突多源个人记忆的基准测试，比较融合方法与 LLM 的表现。未涉及模型架构组件（Tokenizer、Visual Encoder）、生成式世界模型、强化学习或模型统一。虽使用 LLM，但无 MLLM 或多模态编码技术的具体讨论，因此与给定技术关键词相关性低。

关键词

Selective QA, Conflicting Multi-Source Memory, Diagnostic Testbed, Method Comparison, Reasoning Types, Fusion Methods, Abstention, Personal AI Agents

146. Revisiting Observation Reduction for Web Agents: Comprehensive Evaluation with a Lightweight FrameworkFAIL

Score: 15.0 / 27.8

Authors: Masafumi Enomoto, Ryoma Obara, Haochen Zhang, Masafumi Oyamada

Published: 2026-05-28

TL;DR: 本文提出了一种基于最小失败集的轻量级评估框架，用于优化 LLM 驱动 Web 代理的 HTML 观察缩减，显著降低了延迟同时保持了成功率。

摘要翻译

基于大语言模型（LLM）的 Web 智能体中的 HTML 观测数据极其冗长，尽管已提出许多缩减方法，但尚不清楚哪些方法能在保持性能的同时降低整体智能体延迟。主要障碍是端到端评估的高成本：在我们的实验中，在 WorkArena L1 的 33 个任务上，对 11 种方法在 32 种配置下进行评估，需要 232.4 累计小时数。为了解决这一问题，我们提出了一种基于最小失败集（MFS）的轻量级评估框架，MFS 是指移除后导致任务失败的最小 HTML 元素集合。我们将覆盖率定义为缩减方法完全保留 MFS 的实例比例，这是一个代理指标，既不需要网页访问也不需要 LLM 推理。我们验证了覆盖率与端到端成功率强相关，在两个基准上累计评估时间加速超过 100 倍。利用此框架，我们发现抽取式 HTML 缩减方法要么需要高计算成本，要么需要领域特定优化，才能在保持性能的同时降低智能体延迟。基于此，我们在 MFS 训练数据上优化了一个剪枝程序，在 WorkArena L1 上实现了 2.2 倍的单步延迟加速，同时保留了原始成功率的 84%，在 WebLinx 上实现了 3.1 倍的加速，同时保留了 89% 的成功率。

Abstract

HTML observations in LLM-based web agents are extremely long, and while many reduction methods have been proposed, it remains unclear which methods reduce overall agent latency while maintaining performance. The main obstacle is the high cost of end-to-end evaluation: in our experiments, evaluating 11 methods across 32 configurations on 33 tasks of WorkArena L1 required 232.4 cumulative hours. To address this, we propose a lightweight evaluation framework based on the Minimal Failure Set (MFS), the minimal set of HTML elements whose removal causes task failure. We define coverage as the fraction of instances in which a reduction method fully retains the MFS, which serves as a proxy metric that requires neither web access nor LLM inference. We validate that coverage strongly correlates with end-to-end success rate, with over 100$\times$ speedup in cumulative evaluation time on both benchmarks. Using this framework, we find that extractive HTML reduction methods require either high computation cost or domain-specific optimization to reduce agent latency while maintaining performance. Building on this, we optimize a pruning program on MFS training data, achieving 2.2$\times$ faster per-step latency on WorkArena L1 while retaining 84\% of the original success rate, and 3.1$\times$ faster on WebLinx while retaining 89\%.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	2.0/10	3.0

评分理由: 该论文主要研究 LLM 驱动 Web 代理的观察缩减与评估框架优化，核心在于 HTML 处理效率。与提供的关键词集（侧重世界模型、视觉编码器、统一模型等）匹配度较低，仅在 MLLM 和 MultiModal 上因涉及 LLM 代理有弱关联，未涉及视觉编码器、Tokenizer 设计或模型强化学习核心方法。

关键词

Web Agents, Observation Reduction, HTML Reduction, Lightweight Framework, Minimal Failure Set, Latency Optimization, LLM-based Agents, WorkArena

147. Rethinking Stepwise Model Routing: A Cost-Efficient Table Reasoning PerspectiveFAIL

Score: 15.0 / 27.8

Authors: Shenghao Ye, Yuxiang Wang, Yu Guo, Dong Jin, Shuangwu Chen, Jian Yang

Published: 2026-05-28

TL;DR: This paper proposes EcoTab, a table-aware stepwise routing framework that distinguishes between table and text token uncertainties to efficiently balance accuracy and inference cost in table reasoning tasks.

摘要翻译

大型推理模型（LRMs）在表格推理任务上表现优异，但由于推理轨迹较长，会产生显著的推理开销。逐步模型路由通过将推理步骤动态分配给较小或较大的模型来缓解这一问题。然而，针对表格推理的逐步模型路由仍研究不足。通过实证分析，我们发现涉及表格的推理步骤包含两种具有不同不确定性分布的令牌：基于表格结构的表格令牌（如单元格值和表头），以及表示周围自然语言推理的文本令牌。这两种令牌的不确定性均与模型在下一步推理中出错的风险相关。然而，现有方法未能分别建模它们，导致路由决策次优。为了解决这一问题，我们提出 EcoTab，一种面向表格的逐步路由框架，用于高效的表格推理。在每个推理步骤中，EcoTab 分别估计表格令牌和文本令牌的不确定性，并将它们映射到小模型的下一步失败风险，随后结合这两个风险进行路由。在多个表格推理基准上的实验表明，EcoTab 一贯优于强基线，并在准确性与效率之间取得了更好的平衡。

Abstract

Large Reasoning Models (LRMs) achieve strong performance on table reasoning tasks but incur substantial inference cost due to long reasoning traces. Stepwise model routing mitigates this issue by dynamically assigning reasoning steps to smaller or larger models. However, stepwise model routing for table reasoning remains underexplored. Through empirical analysis, we find that reasoning steps involving tables contain two types of tokens with distinct uncertainty distributions: table tokens grounded in table structure, such as cell values and headers, and text tokens representing surrounding natural-language reasoning. The uncertainty of both token types is correlated with the risk that the model makes an error in the next reasoning step. However, existing methods fail to model them separately, leading to suboptimal routing decisions. To address this, we propose EcoTab, a table-aware stepwise routing framework for efficient table reasoning. At each reasoning step, EcoTab separately estimates the uncertainties of table tokens and text tokens, maps them to next-step failure risks for the small model, and combines the two risks for routing. Experiments on multiple table reasoning benchmarks show that EcoTab consistently outperforms strong baselines and achieves a better balance between accuracy and efficiency.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	3.0/10	4.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	3.0/10	4.5
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on table reasoning efficiency via stepwise routing and token uncertainty estimation, which has low alignment with the provided keyword set centered on multimodal world models and RL. 'Tokenizer' and 'MultiModal' have marginal relevance due to token analysis and structured data (table-text), while 'Visual Encoder', 'World Models', and 'model-based RL' are unrelated. No specified expert authors are found in the author list.

关键词

Table Reasoning, Stepwise Model Routing, Uncertainty Estimation, Large Reasoning Models, Inference Efficiency, EcoTab Framework, Token Types

148. Uncertainty-driven 3D Gaussian Splatting Active Mapping via Anisotropic Visibility FieldFAIL

Score: 15.0 / 27.8

Authors: Shangjie Xue, Jesse Dill, Dhruv Ahuja, Frank Dellaert, Panagiotis Tsiotras, Danfei Xu

Published: 2026-05-28

TL;DR: 本文提出 GAVIS，一种基于各向异性可见性场的 3D 高斯泼溅主动映射不确定性感知框架，旨在提高重建精度和效率。

摘要翻译

我们提出高斯泼溅各向异性可见性场（GAVIS），一种用于 3D 高斯泼溅（3DGS）中不确定性量化和主动映射的新框架。我们的关键见解在于，训练视角下不可见的区域会产生来自 3DGS 的不可靠预测。为此，我们引入了一种严谨且高效的方法，用于量化 3DGS 中的可见性场，该场定义为每个粒子相对于训练视角的各向异性可见性，并使用球谐函数（Spherical Harmonics）进行表示。所得的可见性场被集成到一种基于贝叶斯网络（Bayesian Network）的不确定性感知 3DGS 光栅化器中，从而实现了合成视图的实时（200 FPS）不确定性量化。在此基础上，主动映射进一步在最大信息增益框架内进行。在多样化环境下的广泛实验表明，GAVIS 在准确性和效率方面始终显著优于先前方法。此外，除了独立使用外，我们的方法还可事后应用于提升现有方法的性能。

Abstract

We present Gaussian Splatting Anisotropic Visibility Field (GAVIS), a novel framework for uncertainty quantification and active mapping in 3DGS. Our key insight is that regions unseen from the training views yield unreliable predictions from the 3DGS. To address this, we introduce a principled and efficient method for quantifying the visibility field in 3DGS, defined as the anisotropic visibility of each particle with respect to the training views, and represented using spherical harmonics. The resulting visibility field is integrated into a Bayesian Network-based uncertainty-aware 3DGS rasterizer, enabling real-time (200 FPS) uncertainty quantification for synthesized views. Active mapping is further performed within a maximum information gain framework building on this formulation. Extensive experiments across diverse environments demonstrate that GAVIS consistently and significantly outperforms prior approaches in both accuracy and efficiency. Moreover, beyond standalone use, our method can be applied post-hoc to improve the performance of existing approaches.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	3.0/10	4.5
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	2.0/10	3.0

评分理由: 该论文聚焦于 3D 高斯泼溅和主动映射中的不确定性量化，未涉及语言模型、分词器或多模态融合（文本/音频），因此 MLLM、Tokenizer 和 MultiModal 评分较低。虽然它表示 3D 环境，但不是用于 RL 的生成式世界模型，且未呈现基于模型的强化学习算法。Unify Models 也不相关。

关键词

3D Gaussian Splatting, Active Mapping, Uncertainty Quantification, Anisotropic Visibility Field, Bayesian Network, Real-time Rendering, Spherical Harmonics

149. Low-Magnification SEM May Suffice: Interpretable Deep Learning for Multi-Scale Fracture-Cause Classification in Zirconia-Toughened AluminaFAIL

Score: 15.0 / 27.8

Authors: Julian Schmid, Pawel Astankow, Tom Vater, Julius Beck, Robert Cichon, Danny Krautz

Published: 2026-05-28

TL;DR: This paper proposes an interpretable Vision Transformer framework for automated fracture cause classification in ceramic implants using low-magnification SEM images, achieving high accuracy while reducing reliance on high-magnification inspection.

摘要翻译

在氧化铝基复合材料髋关节和膝关节植入物中可靠识别断裂起源对于质量保证和患者安全至关重要，然而当前的断口分析工作流程耗时、部分主观，且依赖于高倍率扫描电子显微镜（SEM）。本文提出了一种可解释的视觉变换器（ViT）工作流程，用于对广泛用于全关节置换术的氧化铝基复合材料（BIOLOX delta，CeramTec GmbH）中的断裂原因进行自动分类。构建了一个包含 8,493 张 SEM 图像（50x-10,000x）的数据集，该数据集源自五年的生产过程中的爆破和验证测试，并根据制造链定义了三个缺陷类别进行了标注：生坯、硬加工和材料缺陷。在严重类别不平衡情况下，微调后的 ViT 在分层五折交叉验证中达到了 0.907 的准确率和 0.888 的宏观 F1 分数，两阶段感知哈希/SSIM 泄露审计确认样本重叠可忽略不计。值得注意的是，低倍率（50x）下的性能与高倍率（1k-10kx）相当，表明宏观特征——镜面几何和粗糙区线场——已编码了足够的诊断信号。Grad-CAM 归因一致地定位在典型的断口特征（镜面、粗糙区、气孔、加工痕迹）上，与公认的断口分析准则相一致。综上所述，这些结果将可解释的视觉变换器（ViT）定位为陶瓷植入物质量保证的互补工具，实现了低倍率预筛选，并减少了对耗时的高倍率检查的依赖。

Abstract

Reliable identification of fracture origins in alumina matrix composite hip and knee implants is critical for quality assurance and patient safety, yet current fractographic workflows are time-consuming, partly subjective, and reliant on high-magnification scanning electron microscopy (SEM). We present an interpretable vision-transformer (ViT) workflow for automated classification of fracture causes in an alumina matrix composite (BIOLOX delta, CeramTec GmbH) widely used in total joint replacements. A dataset of 8,493 SEM images (50x-10,000x) was curated from five years of in-production burst and proof tests and annotated into three defect categories defined along the manufacturing chain: green body, hard machining, and material defects. Under severe class imbalance, the fine-tuned ViT reached an accuracy of 0.907 and a macro-F1 of 0.888 in stratified five-fold cross-validation, with a two-stage perceptual-hash/SSIM leakage audit confirming negligible specimen overlap. Notably, performance at low magnification (50x) was comparable to that at high magnification (1k-10kx), indicating that macro-scale features - mirror geometry and hackle line fields - already encode sufficient diagnostic signal. Grad-CAM attributions consistently localised on canonical fractographic cues (mirrors, hackles, pores, machining marks), aligning with established fractographic criteria. Together, these results position interpretable ViTs as a complementary tool for ceramic-implant quality assurance, enabling low-magnification pre-screening and reducing reliance on time-intensive high-magnification inspection.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	8.0/10	12.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper utilizes a Vision Transformer (Visual Encoder) for fracture classification, earning a moderate score on that keyword. However, it is largely irrelevant to Unify Models, Tokenizer, World Models, MLLM, MultiModal, and model-based RL as it focuses on supervised single-modality learning in materials science rather than advanced AI architectures, reinforcement learning, or multimodal fusion. No target experts are present in the author list.

关键词

Vision Transformer, Fracture Classification, SEM Imaging, Interpretable Deep Learning, Ceramic Implants, Low-Magnification, Quality Assurance

150. MedCase-Structured: A Text-to-FHIR Dataset for Benchmarking Diagnostic Reasoning in Clinically Realistic EHR SettingsFAIL

Score: 13.5 / 27.8

Authors: Valentina Bui Muti, Eugénie Dulout, Ziquan Fu

Published: 2026-05-28

TL;DR: 本文提出了一种生成临床真实 FHIR 数据集的管道以评估 LLM 诊断推理，发现结构化输入下的诊断准确率低于纯文本。

摘要翻译

大型语言模型（LLMs）在临床推理和决策支持方面展现出潜力，但在与电子健康记录（EHR）相符的真实场景中的评估仍然有限。现有的评估基准通常依赖静态数据集或非结构化输入，无法反映临床系统中使用的结构化、互操作性数据格式。我们提出了一种从非结构化文本生成具有临床真实感的 HL7 FHIR R4 包的流程，从而实现临床决策支持系统的可控评估。该流程结合了分阶段的 LLM 生成与基于术语的验证和修复，以减少幻觉编码并强制执行结构和语义一致性。将此方法应用于 MedCaseReasoning，我们构建了 MedCase-Structured，这是一个与临床医生撰写的诊断案例对齐的合成数据集，实现了 82.5% 病例的有效 FHIR 生成。在 MedCase-Structured 上的评估显示，LLMs 在结构化 FHIR 输入上的诊断准确率始终低于纯文本，突显了部署对齐基准测试的重要性。

Abstract

Large language models (LLMs) show promise for clinical reasoning and decision support, but evaluation in realistic, electronic health record-congruent settings remains limited. Existing benchmarks often rely on static datasets or unstructured inputs that do not reflect the structured, interoperable data formats used in clinical systems. We introduce a pipeline for generating clinically realistic HL7 FHIR R4 bundles from unstructured text, enabling controllable evaluation of clinical decision support systems. The pipeline combines staged LLM generation with terminology-grounded validation and repair to reduce hallucinated codes and enforce structural and semantic consistency. Applying this approach to MedCaseReasoning, we construct MedCase-Structured, a synthetic dataset aligned with clinician-authored diagnostic cases, achieving valid FHIR generation for 82.5% of cases. Evaluation on MedCase-Structured reveals consistently lower diagnostic accuracy for LLMs on structured FHIR inputs than with plain text, highlighting the importance of deployment-aligned benchmarking.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文聚焦于医疗领域 LLM 的临床推理基准测试及 FHIR 数据生成，与提供的多模态、世界模型及强化学习关键词关联度低。仅 LLM 相关概念（MLLM, Unify Models）有微弱涉及，视觉编码器、世界模型、强化学习及具体 tokenizer 架构在文中未提及。加权总分 13.5，低于动态及格分 27.8。

关键词

MedCase-Structured, Text-to-FHIR, Clinical Reasoning, Benchmarking, LLMs, Electronic Health Records, Synthetic Dataset, Diagnostic Accuracy

151. ProjectionBench: Evaluating Scientific Hypothesis Generation in LLMs Under Progressive Information DisclosureFAIL

Score: 13.5 / 27.8

Authors: A. J. Lew, Y. Cao, M. J. Buehler

Published: 2026-05-28

TL;DR: 本文提出 ProjectionBench 基准评估 LLM 在渐进式信息披露下的科学假设生成能力，结果显示 GPT-5.4 和 Gemini 3.1 pro 在最小上下文下仍能保持与结论的高一致性。

摘要翻译

科学发现本质上是一个具有创造性和不确定性的过程，需要超越已知知识检索的推理。尽管已提出诸多基准，通过多跳检索来评估大语言模型（LLM）在深度研究任务上的性能，但对于真正科学发现至关重要的创新推理能力，在很大程度上仍未得到检验。我们提出一个用于评估模型在科学发现与推理中表现的基准框架，从初始问题逐步演进至经典零假设检验。在我们的框架中，模型最初仅接收来自近期论文的主题和研究问题，技术细节逐步揭示。在信息揭示的每个阶段，模型的任务是生成回应研究问题的假设，这些假设与原始论文的结论进行比较，并通过构成性原子主张的自动语义相似度进行评估。这种对语义偏离真实结论的逐步评估，使得能够评估模型从最少信息下的创新性到完整实验细节下的基于事实的推理能力，两者对于使用 LLM 进行科学发现目的都至关重要。我们的框架为系统评估 LLM 中的科学推理与发现能力提供了基础，这对于推进下一代 AI 科学家/合作科学家系统的开发至关重要。具体而言，我们在涵盖生物活性材料、机械材料和纳米材料的 45 篇论文上评估了 GPT-5、GPT-5.4、Gemini 2.5 pro 和 Gemini 3.1 pro preview。我们发现 GPT-5.4 和 Gemini 3.1 pro 如预期般优于其上一代对应模型，尤其是 GPT-5.4 即使在最少上下文下也能保持与真实结论 0.7 的 F1 分数一致性。

Abstract

Scientific discovery is an inherently creative and uncertain process, requiring reasoning beyond the recall of known knowledge. While many benchmarks have been proposed to evaluate large language model (LLM) performance on deep research tasks via multi-hop retrieval, their innovative reasoning abilities essential for true scientific discovery remain largely untested. We introduce a benchmark framework for evaluating model performance in scientific discovery and reasoning, building up from a raw problem to the classical null hypothesis test. In our framework, models initially receive only the topic and research question from a recent paper, with technical details progressively revealed. At each stage of information disclosure, the model is tasked with generating hypotheses that address the research question, which is compared with the conclusions from the original paper and evaluated via automated semantic similarity of constituent atomic claims. This progressive evaluation of semantic divergence from ground-truth conclusions enables assessment of a model's innovativeness (under minimal information) to grounded reasoning capabilities (under full experimental details), both critical for using LLMs for scientific discovery purposes. Our framework provides a foundation for systematically evaluating scientific reasoning and discovery capabilities in LLMs, crucial for advancing the development of next-generation AI scientist/co-scientist systems. Specifically, here we evaluate GPT-5, GPT-5.4, Gemini 2.5 pro, and Gemini 3.1 pro preview across 45 papers spanning bioactive materials, mechanical materials, and nanomaterials. We find that GPT-5.4 and Gemini 3.1 pro outperform their previous generation counterparts as expected, and GPT-5.4 in particular maintains 0.7 F1 score alignment with ground truth conclusions even under minimal context.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文核心在于 LLM 科学推理基准测试（ProjectionBench），与关键词涉及的多模态架构（视觉编码器、分词器）、世界模型及强化学习关联度低。虽评估模型（GPT、Gemini）属 MLLM，但论文未探讨其内部架构细节或多模态融合机制，仅关注文本推理任务。未发现指定专家作者。加权总分 13.5，低于动态及格分 27.8。

关键词

Scientific Hypothesis Generation, LLM Evaluation, Progressive Information Disclosure, Scientific Reasoning, Benchmark Framework, Materials Science, Ground-truth Comparison

152. mcp-proto-okn: Natural-language access to open scientific knowledge graphs through the Model Context ProtocolFAIL

Score: 13.5 / 27.8

Authors: Peter W. Rose, Benjamin M. Good, Amanda M. Saravia-Butler, Charlotte A. Nelson, James P. Balhoff, Yaphet Kebede, Patricia L. Whetzel, Christopher Bizon, Andrew I. Su, Sergio E. Baranzini

Published: 2026-05-28

TL;DR: This paper presents mcp-proto-okn, a Python server enabling natural language access to scientific knowledge graphs through the Model Context Protocol, rather than advancing multimodal model architectures or reinforcement learning methods.

摘要翻译

MCP Server Proto-OKN (mcp-proto-okn) 是一个基于 Python 的 Model Context Protocol 服务器，它使人工智能助手能够通过自然语言发现、检查、查询并集成科学知识图谱。该服务器提供图路由、模式检查、SPARQL 执行、本体扩展、多图谱查询以及转录生成功能，降低了生物医学和科学用户进行跨领域知识图谱分析的门槛。mcp-proto-okn 基于 FastMCP 框架使用 Python 实现，可通过 https://github.com/sbl-sdsc/mcp-proto-okn 获取。文档、客户端配置说明及示例分析转录记录均在 GitHub 仓库中提供。

Abstract

MCP Server Proto-OKN (mcp-proto-okn) is a Python-based Model Context Protocol server that enables AI assistants to discover, inspect, query and integrate scientific knowledge graphs through natural language. The server provides graph routing, schema inspection, SPARQL execution, ontology expansion, multi-graph querying, and transcript generation, lowering the barrier to cross-domain knowledge graph analysis for biomedical and scientific users. mcp-proto-okn is implemented in Python using the FastMCP framework and is available at https://github.com/sbl-sdsc/mcp-proto-okn. Documentation, client configuration instructions, and example analysis transcripts are provided in the GitHub repository.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	1.0/10	1.5

评分理由: The paper focuses on a software tool (mcp-proto-okn) for accessing scientific knowledge graphs via natural language using the Model Context Protocol. It does not address multimodal model architectures, world models, reinforcement learning, or model internals like tokenizers and visual encoders. While it utilizes AI assistants (LLM), it lacks the multimodal and learning paradigm focus specified in the background keywords.

关键词

Model Context Protocol, Scientific Knowledge Graphs, Natural Language Access, Python Server, SPARQL Execution, Ontology Expansion, Graph Routing

153. Do Proactive Agents Really Need an LLM to Decide When to Wake and What to Anchor?FAIL

Score: 13.5 / 27.8

Authors: Xiaoze Liu, Ruowang Zhang, Amir H. Abdi, Michel Galley, Zhikai Chen, Siheng Xiong, Xiaoqian Wang, Jing Gao

Published: 2026-05-28

TL;DR: The paper proposes replacing LLM-based triggering with a lightweight Temporal Graph Learning model for proactive agents, achieving significant speedup and efficiency gains while maintaining high F1 scores on event detection.

摘要翻译

主动智能体将用户活动视为文本，并在每个事件上调用大语言模型（LLM）以决定是否采取行动。但用户活动并非原生文本：它是一个结构化事件流，由（主体、动词、对象、时间戳）元组组成，而操作系统已将其以图形式维护。将这种结构渲染为文本并让大语言模型（LLM）还原，构成了一个系统原本无需经历的往返过程。我们将持续信号视为图更新而非文本，并使用一个小型的时序图学习（TGL）模型作为编码器：单次前向传播即可输出每个事件的触发概率和每个实体的路由分数，唯有下游智能体（将小型结构化交接转化为流畅的用户端句子）才调用大语言模型（LLM），且仅在触发器被激活时调用。TGL 在 14 种骨干网络上均提升了 F1 分数（平均提升 +16.7，最高达 +46.0）；在触发器架构对比中，某个 TGL 检查点给出了最强的触发器 AUC 和最稳定的部署阈值。其在 GPU 服务器上每个事件耗时 11.13 毫秒，在消费级笔记本电脑上耗时 13.99 毫秒，分别比各测试场景中所有单次前向大语言模型触发配置快 4 至 7 倍和 12 至 83 倍，且仅需约 220 MiB 的 BF16 驻留内存占用，可部署于设备端，同时处理其消耗的隐私敏感活动流。

Abstract

Proactive agents read user activity as text and call an LLM on every event to decide whether to act. But user activity is not natively text: it is a structured event stream of (actor, verb, object, timestamp) tuples that the operating system already maintains in graph form. Rendering the structure as text and asking an LLM to recover it is a round-trip the system never had to take. We treat the always-on signal as graph updates rather than text and use a small temporal-graph-learning (TGL) model as the encoder: one forward pass yields a per-event trigger probability and a per-entity routing score, and only the downstream agent (turning a small structured handoff into a fluent user-facing sentence) is an LLM call, invoked only when the trigger fires. TGL improves F1 on each of 14 backbones (mean +16.7, up to +46.0); in trigger-architecture comparisons, one TGL checkpoint gives the strongest trigger AUCs and the most stable deployed threshold. It runs at 11.13 ms per event on a GPU server and 13.99 ms on a consumer laptop, approximately 4--7x and 12--83x faster than every single-forward LLM-as-trigger configuration tested in each regime, with an approximately 220 MiB BF16 resident footprint deployable on-device alongside the privacy-sensitive activity stream it consumes.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	2.0/10	3.0

评分理由: The paper focuses on efficient triggering for proactive agents using Temporal Graph Learning (TGL) instead of LLMs. It does not involve visual data (Visual Encoder: 0), multimodal representation (MultiModal: 1), or world models (World Models: 1). It discusses LLM usage critically (MLLM: 2, Tokenizer: 1) regarding text rendering round-trips. While it involves agent decisions, it is not explicitly model-based RL (model-based RL: 2). Unify Models is loosely related to architecture unification (Unify Models: 2). Total weighted score is 13.5, below the dynamic pass score of 27.8. No target experts found in author list.

关键词

Proactive Agents, Temporal Graph Learning, Structured Event Streams, LLM Triggering, Efficient Inference, On-device Deployment, Graph Updates

154. Projectional Decoding: Towards Semantic-Aware LLM GenerationFAIL

Score: 13.5 / 27.8

Authors: Boqi Chen, José Antonio Hernández López, Aren A. Babikian

Published: 2026-05-28

TL;DR: 本文提出了一种投影解码框架，通过在生成过程中维护图模型来确保 LLM 生成软件实体的语义有效性。

摘要翻译

大型语言模型（LLMs）正被越来越多地应用于生成软件工件，涵盖许多软件工程（SE）任务，然而确保这些工件的语义有效性仍是一个根本性挑战。现有的约束解码技术可以确保语法正确性，在某些情况下还能强制特定语义规则，但缺乏一种通用表示，这种表示能够连接 LLM 生成的文本与 SE 中语义验证所需的推理。本文提出了一种名为投影解码（Projectional Decoding）的新型概念框架，该框架通过在生成过程中伴随文本维护一个部分图模型作为主要工件表示，将领域语义直接整合到生成过程中。这种抽象表示通过显式捕获不确定性并原生支持错误检测，实现了增量式语义验证，同时引导生成过程朝向语义有效的输出，并提供可证明的保证。我们在一个程序生成任务上展示了初步结果，表明该方法具有提高 LLM 生成的工件语义有效性的潜力。我们还讨论了投影解码如何能够在各种 SE 活动中实现基于 LLM 的可验证自动化。

Abstract

Large language models (LLMs) are increasingly used to generate software artifacts across many software engineering (SE) tasks, yet ensuring the semantic validity of these artifacts remains a fundamental challenge. Existing constrained decoding techniques can enforce syntactic correctness and, in some cases, specific semantic rules, but lack a general representation that bridges LLM-generated text with the reasoning required for semantic validation in SE. In this paper, we propose projectional decoding, a novel conceptual framework that integrates domain semantics directly into the generation process by maintaining, alongside text, a partial graph model as the primary artifact representation throughout generation. This abstract representation enables incremental semantic validation by explicitly capturing uncertainty and natively supporting error detection, while guiding generation toward semantically valid outputs with provable guarantees. We present preliminary results on a program generation task which demonstrate the potential of this approach to improve the semantic validity of LLM-generated artifacts. We also discuss how projectional decoding can enable verifiable automation with LLMs across various SE activities.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 该论文聚焦于软件工程中 LLM 生成的语义有效性，提出通过维护部分图模型来指导生成（Projectional Decoding）。提供的关键词集主要涵盖多模态、世界模型及强化学习领域，与本文的文本/代码生成及语义验证主题存在显著偏差。论文未涉及视觉数据、多模态融合或强化学习，故相关关键词得分为 0。虽然使用了 LLM（与 MLLM 有一定关联），但未体现模型统一或世界模型的核心特征，因此整体相关性较低。

关键词

Projectional Decoding, LLM Generation, Semantic Validity, Software Engineering, Graph Model, Text Generation, Semantic Validation

155. Give it Space! Explicit Disentangling of Positional and Semantic Representations in EncodersFAIL

Score: 13.5 / 27.8

Authors: Pierre-Antoine Lequeu, Camille Barboule, Benjamin Piwowarski

Published: 2026-05-28

TL;DR: This paper proposes explicitly disentangling positional and semantic streams in Transformer encoders to improve linguistic representation and preserve macroscopic structure compared to standard positional encodings like RoPE.

摘要翻译

位置编码（PE）奠定了置换不变性 Transformer 表示序列顺序的基础，然而位置信息如何被处理和存储仍知之甚少。诸如 RoPE 等现代 PE 方法在长上下文理解或检索等任务上仍面临挑战 \cite{chen-etal-2025-hope}。因此，更好地理解内部位置机制有助于设计出更优的 PE。基于已有证据表明，位置信号与语义信号在训练好的 Transformer 中占据几乎正交的子空间，我们修改了一个编码器 Transformer，使其处理三个明确解耦的流：语义流、绝对位置（AP）流和相对位置（RP）流，并将掩码语言建模（MLM）目标仅限制于语义流。这种解耦使得机制研究更加清晰，并得出了三个主要结论。（1）独立的 AP 子空间自发坍缩为一个低频二维流形，该流形捕捉了文档的结构；（2）注意力头专门化为结构导向组和语义导向组，其中 RP 仅支持后者；（3）标准位置编码无法稳健地保留宏观结构：RoPE 和 RP 仅微弱地编码它，而纠缠的 AP 在 MLM 压力下于深层丢失了该结构。该解耦方法保留了位置编码能力，在 Flash-Holmes 探查基准的 65 种语言现象中的 49 种上改进了语言表示。

Abstract

Positional encoding (PE) underpins how permutation-invariant Transformers represent sequence order, yet how positional information is processed and stored remains poorly understood. Modern PE methods such as RoPE still struggle on tasks such as long-context understanding or retrieval \cite{chen-etal-2025-hope}. Hence, a better understanding of the internal positional mechanism could help design better PE. Building on evidence that positional and semantic signals occupy nearly orthogonal subspaces in trained Transformers, we modify an encoder Transformer to process three explicitly disentangled streams: semantic, absolute positional (AP) and relative positional (RP), and confine the masked-language-modeling (MLM) objective to the semantic stream. This decoupling enables a clean mechanistic study and yields three take-aways. (1) The isolated AP subspace spontaneously collapses into a low-frequency two-dimensional manifold that captures the structure of the document; (2) Attention heads specialize into structure and semantic-oriented groups, with RP exclusively supporting the latter; (3) Standard positional encodings do not robustly retain macroscopic structure: RoPE and RP only weakly encode it, and entangled AP loses it in the final layers under MLM pressure. The disentangled approach preserves positional encoding, which improves linguistic representation on 49 of the 65 linguistic phenomena of the Flash-Holmes probing benchmark.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	1.0/10	1.5

评分理由: The paper focuses on positional encoding mechanisms in Transformer encoders for NLP, specifically disentangling positional and semantic streams. It shows low relevance to the provided keywords which primarily target multimodal learning, world models, and reinforcement learning. Only minor overlaps exist (e.g., MLM relates loosely to MLLM context). No expert authors from the specified list were found. The weighted total score is 13.5, below the dynamic passing score of 27.8.

关键词

Positional Encoding, Transformers, Disentangling Representations, Semantic Stream, Linguistic Representation, Absolute Positional, Relative Positional, Mechanistic Interpretability

156. OptSkills: Learning Generalizable Optimization Skills from Problem Archetypes via Cluster-Based DistillationFAIL

Score: 13.5 / 27.8

Authors: Haochen Yang, Ke Zhao, Mengyuan Ma, Xingyu Lu, Xiangfeng Wang, Hong Qian

Published: 2026-05-28

TL;DR: OptSkills 利用大语言模型通过原型聚类蒸馏可复用技能，显著提升了优化问题建模与求解的泛化能力。

摘要翻译

利用大型语言模型（LLMs）从自然语言中自动构建并求解优化问题，已成为自动化优化的一种高效范式。然而，现有方法仍表现出有限的泛化能力：它们对表面叙述差异敏感，主要在案例级别复用经验，且难以适应分布偏移或新兴的问题类型。我们提出 OptSkills，一个以原型为中心的、用于优化建模与求解的技能学习与推理智能体系统。为提高鲁棒泛化能力，该系统根据问题的底层原型而非表面叙述对其进行聚类。为提高分布内泛化能力，它在每个簇内探索多样的建模范式与求解器配置，然后将成功的轨迹提炼为可复用的工作流级别技能。为提高分布外泛化能力，它利用新获得的轨迹完善现有技能或扩展技能库。该系统在涵盖多种问题类型和场景的数据集上实现了 68.27% 的微平均准确率，达到最先进水平。此外，在极具挑战性的大规模高维基准 MIPLIB-NL 上，它达到了 26.91% 的准确率，比 DeepSeek-V3.2-Thinking 高出 4.53%。在 Nano-CO 上进行技能学习后，它在 OOD NLCO 基准上达到了 72.79%。代码和技能可在 https://github.com/fujiwaranoM0kou/OptSkills 获取。

Abstract

Leveraging Large Language Models (LLMs) to automatically formulate and solve optimization problems from natural language has emerged as an efficient paradigm for automated optimization. However, existing methods still exhibit limited generalization: they are sensitive to superficial narrative variations, reuse experience mainly at the case level, and struggle to adapt to shifted or emerging problem types. We propose OptSkills, an archetype-centric skill learning and reasoning agent system for optimization modeling and solving. To improve robust generalization, our system clusters problems by their underlying archetypes rather than surface narratives. To improve in-distribution generalization, it explores diverse modeling paradigms and solver configurations within each cluster, then distills successful trajectories into reusable workflow-level skills. To improve out-of-distribution generalization, it refines existing skills or expands the skill library using newly obtained trajectories. Our system achieves a state-of-the-art micro-averaged accuracy of 68.27% on datasets encompassing diverse problem types and scenarios. In addition, on MIPLIB-NL, a highly challenging large-scale and high-dimensional benchmark, it achieves 26.91% accuracy, outperforming DeepSeek-V3.2-Thinking by 4.53%. After skill learning on Nano-CO, it reaches 72.79% on the OOD NLCO benchmark. Code and skills are available at https://github.com/fujiwaranoM0kou/OptSkills.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	2.0/10	3.0

评分理由: 论文核心在于基于 LLM 的优化技能学习与原型聚类，与多模态、世界模型及模型强化学习等关键词关联度较低。未涉及视觉编码器，非多模态架构，未构建世界模型，技能学习虽属 RL 范畴但非模型基 RL，Tokenizer 仅为隐含组件。加权总分约 13.5，低于动态及格分 27.8，表明论文主题与给定关键词领域匹配度不高。

关键词

Optimization Skills, Problem Archetypes, Cluster-Based Distillation, LLM, Generalization, Skill Learning, Natural Language Optimization

157. DynaGraph: Lightweight Multi-Model Interaction Framework via Dynamic Topological ReconfigurationFAIL

Score: 13.5 / 27.8

Authors: Yanxing Guo, Zihao Zheng, Fangzhou Wu, Ling Liang, Lin Bao, Zongwei Wang, Yimao Cai

Published: 2026-05-28

TL;DR: DynaGraph proposes a lightweight multi-model interaction framework utilizing dynamic topological reconfiguration and PEFT adapters to achieve reasoning efficiency comparable to large monolithic models with significantly reduced latency and token consumption.

摘要翻译

处理复杂推理任务通常依赖于大规模单体大语言模型（LLM），这些模型存在严重的计算冗余。虽然通过结构化管道或多智能体协作进行任务分解提供了一种替代方案，但这些方法不可避免地陷入一个关键困境：预定义静态拓扑极易受到级联错误的影响，而无约束动态智能体则面临轨迹发散和不可预测的内存膨胀问题。为了解决这一问题，我们提出了 DynaGraph，这是一个由动态拓扑重构驱动的轻量级多模型框架。在执行层面，DynaGraph 在共享基座模型上复用时分 PEFT 适配器，从而实现全系统训练和推理部署均在单个消费级 GPU 上完成。在路由层面，评估器（Evaluator）持续监控执行置信度以触发分层自修复：针对局部数据缺口进行细粒度补丁（Fine-grained Patching），针对严重逻辑断裂进行子图重构（Subgraph Reconstruction）。在 StrategyQA、MATH 和 FinQA 上的实验表明，我们的 8B 模型在推理能力上接近 72B 单体模型（例如，在 StrategyQA 上达到 87.6%，在 MATH 上达到 82.7%）。此外，与无约束动态架构相比，它将延迟降低了高达 68.1%，将 token 消耗降低了 68.6%。

Abstract

Tackling complex reasoning tasks typically relies on massive monolithic LLMs, which suffer from severe computational redundancy. While task decomposition through structured pipelines or multi-agent collaborations offers an alternative, these approaches inevitably fall into a critical dilemma: predefined static topologies are highly vulnerable to cascading errors, whereas unconstrained dynamic agents suffer from trajectory divergence and unpredictable memory bloat. To address this, we present DynaGraph, a lightweight multi-model framework driven by dynamic topological reconfiguration. At the execution level, DynaGraph multiplexes time-division PEFT adapters over a shared base model, enabling both full system training and inference deployment on a single consumer-grade GPU. At the routing level, the Evaluator continuously monitors execution confidence to trigger hierarchical self-healing: Fine-grained Patching for localized data gaps and Subgraph Reconstruction for severe logical ruptures. Experiments on StrategyQA, MATH, and FinQA demonstrate our 8B model closely approximates the reasoning capabilities of a 72B monolithic model (e.g., 87.6% on StrategyQA, 82.7% on MATH). Furthermore, it reduces latency by up to 68.1% and token consumption by 68.6% compared to unconstrained dynamic architectures.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on a lightweight multi-model reasoning framework using dynamic topological reconfiguration and PEFT adapters. It shows moderate relevance to 'Unify Models' (3.0) by unifying adapter functions on a shared base model. 'MLLM' and 'MultiModal' have low relevance (2.0) as the text-only reasoning tasks lack multimodal integration. 'Tokenizer' is implicit (1.0), while 'World Models', 'Visual Encoder', and 'model-based RL' are largely irrelevant (0.0-1.0) due to the absence of vision, world modeling, or reinforcement learning components. The total weighted score is 15.75, below the dynamic passing score of 27.8, indicating low alignment with the provided background keywords. No specified expert authors are found in the list.

关键词

Multi-Model Interaction, Dynamic Topological Reconfiguration, Lightweight Framework, PEFT Adapters, Shared Base Model, Self-Healing, Reasoning Efficiency

158. CB-SLICE: Concept-Based Interpretable Error Slice DiscoveryFAIL

Score: 13.5 / 27.8

Authors: Yael Konforti, Mateo Espinosa Zarlenga, Elaf Almahmoud, Mateja Jamnik

Published: 2026-05-28

TL;DR: CB-SLICE introduces a concept-based method to identify systematic model errors and biases by grouping samples according to concept prediction failures, providing more faithful explanations than existing slice discovery methods.

摘要翻译

尽管深度学习模型在平均情况下的表现强劲，但它们往往在特定群体上表现出系统性错误，这些被称为错误切片（error slices）。识别这些群体及其失败的根本原因对于模型调试和偏差缓解至关重要。然而，现有的错误切片发现方法（SDMs）通常生成的解释与模型的推理过程脱节，因此仅是对潜在错误源的近似，可能并不准确。为了解决这一限制，我们利用了概念瓶颈模型（CBMs），其预测直接依赖于人类可理解的语义概念。由于 CBMs 中的下游任务失败通常源于概念误预测，概念表征为错误切片识别提供了强有力的候选，能够直接提供与错误源相关的细粒度解释。基于这一洞察，我们提出了 CB-SLICE，这是一种基于概念的 SDM，它通过将具有共享概念预测失败的样本进行分组，并识别出对每个切片失败模式最负责的关键概念。在多个基准测试上，我们的结果表明 CB-SLICE 在揭示已知偏见方面优于最先进方法，同时提供更丰富且更忠实的模型错误解释。

Abstract

Despite strong average-case performance, deep learning models often exhibit systematic errors on specific population groups, known as error slices. Identifying these groups and the root causes of their failures is critical for model debugging and bias mitigation. However, existing error Slice Discovery Methods (SDMs) typically generate explanations disconnected from the model's inference process, thus only approximating the underlying error source and may be inaccurate. We address this limitation by leveraging Concept Bottleneck Models (CBMs), whose predictions are directly dependent on human-understandable semantic concepts. Since downstream task failures in CBMs commonly arise from concept mispredictions, concept representations provide a strong candidate for error slice identification, offering fine-grained explanations directly linked to the error source. Building on this insight, we introduce CB-SLICE, a concept-based SDM that groups samples with shared concept prediction failures and identifies the keyword concepts most responsible for each slice's failure mode. Across multiple benchmarks, we show that CB-SLICE outperforms state-of-the-art methods in uncovering well-known biases while providing richer and more faithful explanations of model errors.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	1.0/10	1.5

评分理由: The paper focuses on Concept Bottleneck Models (CBMs) for interpretable error slice discovery and bias mitigation in deep learning. It does not address World Models, Reinforcement Learning, Tokenizers, or specific Visual Encoder architectures central to MLLM pipelines. While it utilizes semantic concepts (loosely related to Unify/MultiModal), its core contribution is in XAI and debugging rather than unified multimodal modeling or RL, resulting in low relevance to the provided keyword set.

关键词

Concept Bottleneck Models, Error Slice Discovery, Interpretable Error, Bias Mitigation, Semantic Concepts, Model Debugging, Concept Representations

159. Who Am I? History-Aware Profiles for Student Simulation in Tutoring DialoguesFAIL

Score: 13.5 / 27.8

Authors: Zhangqi Duan, Shuyan Huang, Alexander Scarlatos, Jaewook Lee, Simon Woodhead, Andrew Lan

Published: 2026-05-28

TL;DR: 本文提出一种利用强化学习训练历史感知配置文件的学生模拟框架，在真实教学对话数据上显著提升了模拟准确性。

摘要翻译

开发基于大型语言模型（LLM）的自动化辅导工具的核心环节是学生模拟，即利用 LLM 扮演学生角色，从而促进辅导模型的评估与训练。现有工作主要聚焦于对话内模拟，缺乏对学生知识与行为背景的关注，部分原因在于未基于过往的学生问答或对话交互。在这项工作中，我们引入了历史条件学生模拟（history-conditioned student simulation）的任务，其目标是通过利用学生学习历史中的信息，准确预测学生的对话轮次。我们提出一个双组件框架，其中档案生成器（profile generator）用于总结学生的历史，而模拟器（simulator）则基于生成的档案预测学生的对话轮次。我们使用强化学习（RL）训练这两个组件，从而生成针对忠实学生模拟进行优化的档案。我们在首个此类真实世界数据集上评估了我们的方法及基线方法，该数据集包含学生对话和问题响应，我们是从一个数学学习平台上收集的。大量实验表明我们的方法显著优于基线方法，并展示了历史、档案及强化学习（RL）训练的重要性。

Abstract

A key part of developing large language model (LLM)-powered, automated tutoring tools is student simulation, i.e., using LLMs to role-play as students, which can facilitate tutor model evaluation and training. Existing work mostly focuses on within-dialogue simulation, which lacks context on student knowledge and behavior, partly due to not grounding in past student question-answering or dialogue interactions. In this work, we introduce the task of history-conditioned student simulation, where the goal is to accurately predict student dialogue turns by leveraging information in the student's learning history. We propose a two-component framework in which a profile generator summarizes a student's history and a simulator predicts student turns conditioned on the resulting profile. We train both components with reinforcement learning (RL), yielding profiles optimized for faithful student simulation. We evaluate our method and baselines on the first-of-its-kind real-world dataset of student dialogues and question responses that we collect from a math learning platform. Extensive experiments show that our method significantly outperforms baselines, and demonstrate the importance of history, profiles, and RL training.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	2.0/10	3.0

评分理由: 该论文主要研究基于文本的学生模拟与强化学习训练，未涉及多模态数据（Visual Encoder, MLLM, MultiModal 相关性极低），未明确使用模型强化学习（model-based RL 相关性一般），未统一多模态架构（Unify Models 相关性低），Tokenizer 非核心贡献。作者列表中不包含指定的专家专家，无加分项。

关键词

Student Simulation, Tutoring Dialogues, History-Aware Profiles, Reinforcement Learning, LLM, Profile Generator, Simulator

160. Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative DecodingFAIL

Score: 13.5 / 27.8

Authors: Jianuo Huang, Yaojie Zhang, Qituan Zhang, Hao Lin, Hanlin Xu, Linfeng Zhang

Published: 2026-05-28

TL;DR: Domino 通过解耦因果建模与自回归草稿生成，实现了 LLM 推理加速，在 Qwen3 模型上获得了最高 5.8 倍的吞吐量提升。

摘要翻译

推测解码通过并行草拟多个令牌并用目标模型验证它们来加速大语言模型（LLM）的推理。然而，其实测加速受限于草拟质量与草拟成本之间的权衡：自回归草拟器能够建模草拟令牌间的因果依赖，但会产生顺序开销；而并行草拟器虽降低了草拟成本，却削弱了块内依赖建模。本文提出 Domino，一种推测解码框架，它将因果依赖建模与昂贵的自回归草拟执行解耦。Domino 首先使用并行草拟主干为整个块生成初步草拟分布，然后应用轻量级 Domino 头利用前缀依赖的因果信息对其进行精炼。为了稳定教师强制因果编码，我们进一步引入了一种基线锚定训练课程，首先强化并行主干，然后逐渐将优化转向因果修正的最终分布。在 Qwen3 模型上的实验表明，Domino 在 Transformers 后端下可实现高达 5.49 倍的端到端加速，在 SGLang 服务下可实现高达 5.8 倍的吞吐量加速。

Abstract

Speculative decoding accelerates LLM inference by drafting multiple tokens and verifying them in parallel with the target model. However, its practical speedup is constrained by the trade-off between draft quality and drafting cost: autoregressive drafters model causal dependencies among draft tokens but incur sequential overhead, while parallel drafters reduce drafting cost but weaken intra-block dependency modeling. In this paper, we propose Domino, a speculative decoding framework that decouples causal dependency modeling from expensive autoregressive draft execution. Domino first uses a parallel draft backbone to produce preliminary draft distributions for the entire block, and then applies a lightweight Domino head to refine them with prefix-dependent causal information. To stabilize teacher-forced causal encoding, we further introduce a base-anchored training curriculum that first strengthens the parallel backbone and then gradually shifts optimization toward the causally corrected final distribution. Experiments on Qwen3 models show that Domino achieves up to $5.49\times$ end-to-end speedup under the Transformers backend and up to $5.8\times$ throughput speedup under SGLang serving.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	3.0/10	4.5
model-based RL	1.5	0.0/10	0.0

评分理由: 论文核心为 LLM 推测解码加速，与 World Models、Visual Encoder、model-based RL 完全无关（0 分）。虽基于 Qwen3 涉及 MLLM/MultiModal 场景，但重点在于解码策略而非多模态表征（3 分）。Tokenizer 相关但非核心（2 分）。Unify Models 未体现（1 分）。加权总分 13.5，低于及格线，主题相关性较低。

关键词

Speculative Decoding, LLM Inference, Parallel Drafting, Causal Modeling, Inference Acceleration, Qwen3, Drafting Cost, Base-anchored Training

161. GRASP: Gated Regression-Aware Skill Proposer for Self-Improving LLM AgentsFAIL

Score: 13.5 / 27.8

Authors: Johannes Moll, Jean-Philippe Corbeil, Jiazhen Pan, Martin Hadamitzky, Daniel Rueckert, Lisa Adams, Keno Bressem

Published: 2026-05-28

TL;DR: GRASP proposes a gated regression-aware skill proposer that validates skill edits to improve LLM agent reliability in structured environments, achieving significant performance gains on clinical benchmarks.

摘要翻译

在结构化环境中运行的**大语言模型 (LLM) 代理**往往在操作层面而非对话层面出现失效，其可靠性取决于对环境的程序性知识。先前的自我改进方法会累积自然语言指导，却不检查每一项新内容是否保留了先前正确的行为，因此，修复某条轨迹的注释可能会无声地导致另一条轨迹退化。我们提出 GRASP（门控回归感知技能提议器），将代理的改进视为对有界技能库的一系列编辑，仅当候选技能在平衡的保留探针上产生净改进且满足硬性回归预算时，才予以接纳。我们在两个基于 FHIR 的临床基准上，对五个基础模型（gpt-oss-120b、DeepSeek V4 Flash、Gemini 3.1 Flash Lite、GPT-4.1、GPT-5.4）评估了 GRASP。在 MedAgentBench 上，GRASP 将 gpt-oss-120b 的性能从 40.6% 提升至 88.8%，超越了五个自我改进基线中最强者 21.0 个百分点，并将其他所有基础模型的性能提升了 17.2 至 40.3 个百分点。消融实验表明，性能增益归功于比较提议生成、接受门机制以及硬性回归预算，而非技能书写本身；若无验证，技能书写的效果甚至不如不使用任何技能。该机制不仅泛化至临床领域之外，还在四个非临床环境中的三个上改进了代理表现，仅在动作空间为开放式时效果保持不变。冻结的技能库可在模型间迁移：来自更强模型的技能能提升较弱执行器的表现，使其超越其自身学习所得，反之则不成立；这种不对称性是任何无门控基线都无法复现的。

Abstract

LLM agents acting in structured environments fail in operational rather than conversational ways, and reliability depends on procedural knowledge of the environment. Prior self-improvement methods accumulate natural-language guidance without checking that each new item preserves previously correct behavior, so a note that fixes one trajectory can silently regress another. We introduce GRASP (Gated Regression-Aware Skill Proposer), which treats agent improvement as a sequence of edits to a bounded skill library, admitting each candidate only if it produces a net improvement on a balanced held-out probe under a hard regression budget. We evaluate GRASP across five base models (gpt-oss-120b, DeepSeek V4 Flash, Gemini 3.1 Flash Lite, GPT-4.1, GPT-5.4) on two FHIR-based clinical benchmarks. On MedAgentBench, GRASP lifts gpt-oss-120b from 40.6% to 88.8%, exceeds the strongest of five self-improvement baselines by 21.0 points, and improves every other base model by 17.2 to 40.3 points. Ablations attribute the gain to comparative proposal generation, the acceptance gate, and the hard regression budget rather than to skill writing itself, which without validation is no better than using no skills. The mechanism generalizes beyond the clinical domain, improving agents on three of four non-clinical environments and remaining flat only where the action space is open-ended. Frozen libraries transfer across models, where skills from a stronger model improve weaker executors beyond what they learn for themselves while the reverse does not, an asymmetry that no ungated baseline reproduces.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	2.0/10	3.0

评分理由: 论文核心在于 LLM 代理的技能自我改进与回归测试，未涉及 Tokenizer、视觉编码器或多模态架构，故这些关键词得分为 0。虽使用了大型语言模型（MLLM）并在模型间转移技能（Unify Models），但未构建世界模型或采用传统基于模型的强化学习算法，相关性较低（得分 2-3）。作者列表中未包含指定专家，无额外加分。加权总分 13.5，低于动态及格分 27.8。

关键词

LLM Agents, Skill Improvement, Regression Testing, Skill Library, Self-Improvement, Clinical Benchmarks, Gated Mechanism

162. Verifiable Rewards Beyond Math and Code: Lightweight Corpus-Grounded Process Supervision for Factual Question AnsweringFAIL

Score: 13.5 / 27.8

Authors: Shicheng Fan, Haochang Hao, Dehai Min, Weihao Liu, Philip S. Yu, Lu Cheng

Published: 2026-05-28

TL;DR: 本文提出了一种基于语料库的轻量级过程奖励机制 CorVer，通过强化学习显著提升了事实性问答的准确性，且训练速度优于神经验证器基线。

摘要翻译

将强化学习（Reinforcement Learning）应用于提升知识密集型问答（Knowledge-Intensive Question Answering）的事实准确性，面临着奖励设计的困境。响应级奖励仅提供粗粒度监督，无法区分推理轨迹（Reasoning Trace）中正确与错误的陈述。句子级替代方案提供更细粒度的反馈，但通常依赖自然语言推理（NLI）验证器、大语言模型（LLM）评判器或知识验证管道，这些在强化学习（RL）规模下部署成本高昂，且对于稀有实体事实往往不可靠，而在此类事实中准确的奖励信号尤为重要。我们提出 CorVer（Corpus Verify），这是一种轻量级、即插即用的过程奖励，它用源自维基百科共现统计的语料库信号替换了神经验证器。CorVer 分配句子级信用（Credit），并通过简单的对齐将其映射到令牌级优势（Advantages），仅需一个 0.5B 大小的提取器以及每个句子一次语料库查找。在涵盖六个指令微调模型（3B 至 14B）和五个问答（QA）基准的 30 个（模型，基准）组合中，CorVer 在所有组合上均优于原始基线，在 TriviaQA 上的平均增益为 +4.1 个百分点（pp）。此外，在可行的配置下，它在 20 个组合中的 18 个上优于四种神经验证器基线，且训练速度快 4.8 至 8.4 倍。

Abstract

Applying reinforcement learning to improve factual accuracy in knowledge-intensive question answering faces a reward design dilemma. Response-level rewards provide only coarse supervision and cannot distinguish correct from incorrect statements within a reasoning trace. Sentence-level alternatives offer finer-grained feedback, but typically rely on NLI verifiers, LLM judges, or knowledge-verification pipelines that are expensive to deploy at RL scale and often unreliable for rare-entity facts, where accurate reward signals are especially important. We propose CorVer (Corpus Verify), a lightweight, plug-in-ready process reward that replaces neural verifiers with a corpus-grounded signal derived from Wikipedia co-occurrence statistics. CorVer assigns sentence-level credit and maps it to token-level advantages via a simple alignment, requiring only a 0.5B extractor and a single corpus lookup per sentence. Across 30 (model, benchmark) cells spanning six instruction-tuned models (3B to 14B) and five QA benchmarks, CorVer improves over the raw baseline for every cell, with an average TriviaQA gain of +4.1 pp. It also outperforms four neural-verifier baselines in 18 of 20 cells under their feasible configurations, while training 4.8 to 8.4x faster.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	3.0/10	4.5

评分理由: 论文核心在于事实性问答中的强化学习奖励设计（CorVer），使用语料库信号进行过程监督。与视觉编码器、多模态、世界模型等关键词完全无关（0 分），因论文为纯文本任务且非世界模型构建。与 Tokenizer 和 MLLM 有一定关联（2 分），因涉及 token 级优势映射及大语言模型使用。与 model-based RL 有中度关联（3 分），因涉及强化学习框架，但主要为奖励建模而非环境模型预测。Unify Models 关联较弱（2 分），因主要统一奖励信号而非模型架构。未发现指定专家，无额外加分。

关键词

Verifiable Rewards, Process Supervision, Factual Question Answering, Corpus-Grounded, Reinforcement Learning, Sentence-level Credit, Token-level Advantages

163. Entity-Collision: A Stratified Protocol for Attributing Retrieval Lift in Agent MemoryFAIL

Score: 13.5 / 27.8

Authors: Youwang Deng

Published: 2026-05-28

TL;DR: 论文提出 Entity-Collision 协议以隔离代理记忆检索中的嵌入器性能与词汇泄漏，发现 MiniLM-384 在不同碰撞程度和标签下表现优于更大参数模型。

摘要翻译

端到端代理 - 内存基准报告每个检索器单一的 hit@k，将词汇泄露（未受控制的查询、金标准及干扰项实体重叠）与标签混合（偏好、服务、工具平均混合）混淆在一起。我们提出 entity-collision（实体碰撞），这是一种系统无关的协议，通过构造固定 BM25 基线——即每个干扰项均共享答案的实体标记——并根据判别器标签对查询进行分层，因此任何相对于 BM25 的提升均可归因于嵌入器。将该协议应用于一个开源代理 - 内存测试平台，涵盖 5 个标签 × 3 个嵌入器 × 5 个碰撞程度，并采用配对自助法计算 95% CIs（置信区间），结果揭示了一种双轴模式：256 维哈希三元组仅在深度碰撞下的封闭词汇词汇标签上有效；MiniLM-384 在两个维度上均占优；而参数量为 MiniLM 2.7 倍的 BGE-large 并未在所有方面优于 MiniLM——它在意图风格查询上表现更好，但在词汇类查询上却逊色。仅凭编码器容量并非唯一的瓶颈。合成意图标签零基线在 LongMemEval（n=500）上复现了单会话偏好召回悬崖现象。LoCoMo 上的自适应向量权重路由经测量为零基线：尽管存在 11.7 个百分点的 oracle 余量，但我们测试的任何信号均无法恢复这一性能。所有 26 个结果表和 37 个复现代码脚本均经过版本控制并由公共注册表验证；该协议在确定性治理的内存测试平台（基于事件源决策日志，采用 DAG 状态机模式生命周期）上运行，因此每个报告的 CIs 均可从摄入流中字节级精确复现。

Abstract

End-to-end agent-memory benchmarks report a single hit@k per retriever, confounding lexical leakage (uncontrolled query/gold/distractor entity overlap) with tag-mixing (preferences, services, tools averaged together). We propose entity-collision, a system-agnostic protocol that pins the BM25 floor by construction -- every distractor shares the answer's entity tokens -- and stratifies queries by discriminator tag, so any lift over BM25 is attributable to the embedder. Applied to an open-source agent-memory testbed across 5 tags x 3 embedders x 5 collision degrees with paired-bootstrap 95% CIs, the protocol reveals a two-axis pattern: a 256-d hash trigram helps only on closed-vocabulary lexical tags at deep collision; MiniLM-384 dominates both axes; and a 2.7x-parameter BGE-large does not uniformly improve on MiniLM -- it wins on intent-style queries but loses on lexical ones. Encoder capacity alone is not the binding constraint. The synthetic intent-tag null replicates on LongMemEval (n=500) as a single-session-preference recall cliff. Adaptive vector-weight routing on LoCoMo is a measured null: 11.7pp of oracle headroom exists, but no signal we tested recovers it. All 26 result tables and 37 reproduce scripts are version-controlled and verified by a public registry; the protocol is exercised on a deterministically governed memory testbed (event-sourced decision log, DAG-state-machine schema lifecycle) so every reported CI is reproducible byte-for-byte from the ingest stream.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	3.0/10	4.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	2.0/10	3.0

评分理由: 该论文聚焦于代理记忆检索的评估协议（Entity-Collision），旨在解决词汇泄漏与标签混合问题。关键词相关性较低，因论文未涉及多模态（MultiModal, Visual Encoder）、世界模型（World Models）或统一模型架构（Unify Models）；Tokenizer 仅涉及实体 token 构造（3 分）；MLLM 与 model-based RL 仅与代理上下文间接相关（2 分）；Unify Models 与 World Models 相关性极低（1 分）。未发现指定专家作者。

关键词

Entity-Collision, Agent Memory, Retrieval Lift, Embedder Evaluation, Lexical Leakage, Stratified Protocol, BM25

164. FinGuard: Detecting Financial Regulatory Non-Compliance in LLM InteractionsFAIL

Score: 13.5 / 27.8

Authors: Huaixia Dou, Jie Zhu, Minghao Wu, Shuo Jiang, Junhui Li, Lifan Guo, Feng Chen, Chi Zhang

Published: 2026-05-28

TL;DR: FinGuard 提出了一种基于监管文档的管道和专用 LLM 模型来检测交互中的金融监管非合规性，并在基准测试中显著优于基线模型。

摘要翻译

随着大型语言模型（LLMs）在金融服务领域的日益普及，单次不合规交互都可能使机构面临监管处罚及直接的消费者损害。现有的防护模型主要围绕一般危害分类体系构建，却忽视了基于特定金融法规的违规行为。为填补这一空白，我们提出一个监管驱动的流程，该流程直接基于监管文件运作，旨在构建金融合规风险分类体系并合成基于监管依据的训练数据，而无需任何预定义的违规类别。我们将该流程应用于中国金融法规，发布了 FinGuard-Bench，据我们所知，这是首个面向金融监管合规检测的基准测试，其查询和响应层面均包含专家标注的标签。此外，我们还训练了 FinGuard，这是一个基于 Qwen3-8B 构建的金融合规检测模型，利用基于监管依据的数据，通过监督微调和自玩强化学习进行训练。在 FinGuard-Bench 上，FinGuard 显著优于所有基线模型，包括专用防护模型以及规模大得多的通用大型语言模型，例如 Qwen3.5-397B-A17B 和 GPT-5.1。此外，FinGuard 还保留了通用的安全能力，并且仅凭策略文档即可适应未见过的机构特定政策。我们将在 GitHub 上公开发布本工作中使用的代码、提示词及相关资源。

Abstract

As large language models (LLMs) are increasingly deployed in financial services, a single non-compliant interaction can expose institutions to regulatory penalties and direct consumer harm. Existing guard models are built around general harm taxonomies and overlook violations grounded in specific financial regulations. We address this gap with a regulation-driven pipeline that operates directly on regulatory documents, inducing a financial compliance risk taxonomy and synthesizing grounded training data without any predefined violation categories. Instantiating the pipeline on Chinese financial regulations, we release \textbf{FinGuard-Bench}, to our knowledge the first benchmark for financial regulatory compliance detection, with expert-annotated labels at both the query and response levels. We further train \textbf{FinGuard}, a financial compliance detection model built on Qwen3-8B and trained on the regulation-grounded data via supervised fine-tuning and self-play reinforcement learning. On FinGuard-Bench, FinGuard substantially outperforms all baselines, including dedicated guard models and much larger general-purpose LLMs such as Qwen3.5-397B-A17B and GPT-5.1. Furthermore, FinGuard also preserves general safety capabilities and adapts to unseen institution-specific policies using policy documents alone. We will publicly release the code, prompts, and resources used in this work on GitHub.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	3.0/10	4.5

评分理由: 该论文专注于金融监管合规检测，使用监督微调和强化学习，未涉及多模态架构、视觉编码器、分词器或世界模型的核心设计。仅因使用了强化学习与 LLM，与 MLLM、Unify Models 及 model-based RL 有较低相关性，与 Visual Encoder、MultiModal 完全无关。作者列表中不包含指定的专家名单。

关键词

Financial Regulatory Compliance, LLM Safety, Supervised Fine-tuning, Reinforcement Learning, Compliance Detection, Regulation-driven Pipeline, FinGuard-Bench

165. Treatment-Conditioned Diffusion for Forecasting Neurodegenerative Disease ProgressionFAIL

Score: 13.5 / 27.8

Authors: Danylo Boiko, Viktoriia Mishkurova

Published: 2026-05-28

TL;DR: 该论文提出了一种治疗条件扩散框架，利用 DaTscan 图像和药物剂量预测神经退行性疾病进展，显著提高了临床保真度。

摘要翻译

预测神经退行性疾病（如帕金森病）的进展对于有效的长期规划和个性化治疗干预至关重要。现有系统通常输出忽略纵向神经影像丰富结构的标量临床评分，而传统生成方法则面临解剖细节丢失及细微进展模式模糊的问题。为此，我们提出了一种新颖的治疗条件扩散框架，通过以患者筛查时的 DaTscan 图像及一年内的左旋多巴当量日剂量为条件进行生成，从而预测高保真的未来脑状态。该流程采用基于 Transformer 的编码器来表示非线性、时间依赖的药理动力学，并通过多权重感兴趣区域掩码优化生成过程，该掩码专注于生物学关键区域。实验评估表明，我们的框架保持了清晰的解剖边界，相对于基线显著提高了临床保真度，分别实现了 MSE 降低 14.0%、MAE 降低 7.2% 以及 SSIM 提高 4.9%。

Abstract

Forecasting the progression of neurodegenerative diseases, such as Parkinson's disease, is essential for effective long-term planning and personalized therapeutic intervention. Existing systems typically produce scalar clinical scores that ignore the rich structure of longitudinal neuroimaging, while traditional generative approaches suffer from a loss of anatomical details and blurring subtle progression patterns. To address this, we introduce a novel treatment-conditioned diffusion framework that predicts high-fidelity future brain states by conditioning the generative process on patients' screening DaTscan images and levodopa equivalent daily dose over one year. The pipeline uses a Transformer-based encoder to represent non-linear, time-dependent pharmacological dynamics and optimizes generation through a multi-weight region-of-interest mask that focuses on biologically critical areas. Experimental evaluation shows that our framework maintains sharp anatomical boundaries and significantly improves clinical fidelity relative to the baseline, achieving 14.0% lower MSE, 7.2% lower MAE, and 4.9% higher SSIM.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	3.0/10	4.5
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	4.0/10	6.0
model-based RL	1.5	0.0/10	0.0

评分理由: 该论文聚焦于医学影像领域的扩散模型预测，与提供的关键词存在显著领域错位。'Unify Models'、'Tokenizer'、'MLLM' 和 'model-based RL' 在论文中无直接体现，相关性为 0。'MultiModal' 涉及图像与临床数据结合，相关性中等（4 分）。'Visual Encoder' 使用了 Transformer 编码器，但非核心视觉编码器组件，相关性较低（2 分）。'World Models' 概念上涉及状态动力学生成，但非典型世界模型定义，相关性较低（3 分）。作者列表中不包含指定的专家，无额外加分。加权总分 13.5，低于动态及格分 27.8。

关键词

Treatment-Conditioned Diffusion, Neurodegenerative Disease Progression, DaTscan Images, Transformer-based Encoder, Generative Model, Clinical Fidelity, Region-of-Interest Mask

166. Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM AgentsFAIL

Score: 12.0 / 27.8

Authors: Ziyan Liu, Zhezheng Hao, Yeqiu Chen, Hong Wang, Jingren Hou, Ruiyi Ding, Yongkang Yang, Wence Ji, Wei Xia, Feng Liu

Published: 2026-05-28

TL;DR: 本文提出元认知记忆策略优化（MMPO），通过惩罚导致高认知不确定性的记忆摘要来提升长周期 LLM 代理的性能。

摘要翻译

记忆增强型大语言模型（LLM）智能体通过递归地将交互轨迹总结为紧凑记忆，从而应对复杂的长周期任务。然而，现有方法通常使用基于结果的强化学习（Reinforcement Learning）来训练这些记忆策略，无法定位中间记忆质量下降的具体环节。随着交互的进行，模糊的递归摘要逐渐丢弃任务相关语义并引入语义噪声。这加剧了信念偏差，掩盖了智能体对潜在任务状态的估计，最终导致长周期推理失败。因此，我们认为记忆优化不应仅关注轨迹级成功，更应关注中间摘要所诱导的信念清晰度。为此，我们引入了信念熵（Belief Entropy），这是一个自监督代理指标，用于探测模型在当前记忆下对潜在任务状态的不确定性程度。基于此代理指标，我们提出了元认知记忆策略优化（MMPO）。与仅依赖稀疏的基于结果的信号不同，MMPO 通过显式惩罚诱导高认知不确定性的摘要，提供细粒度的、特定于记忆的监督。实验表明，MMPO 在多种长周期任务上 consistently 优于现有方法，即使在扩展到 175 万 token 上下文时，仍能保持 97.1% 的性能。

Abstract

Memory-augmented LLM agents tackle complex long-horizon tasks by recursively summarizing interaction trajectories into compact memory. However, existing approaches typically train these memory policies using outcome-based reinforcement learning, failing to localize where intermediate memory quality degrades. As interactions unfold, ambiguous recursive summaries progressively discard task-relevant information and introduce semantic noise. This exacerbates belief deviation, obscuring the agent's estimate of the latent task state and ultimately derailing long-horizon reasoning. We therefore argue that memory optimization should focus not merely on trajectory-level success, but on the clarity of the belief induced by intermediate summaries. To this end, we introduce Belief Entropy, a self-supervised proxy that probes how uncertain the model remains about the latent task state given its current memory. Based on this proxy, we propose Metacognitive Memory Policy Optimization (MMPO). Instead of relying only on sparse outcome-based signals, MMPO provides fine-grained, memory-specific supervision via explicitly penalizing summaries that induce high epistemic uncertainty. Experiments show that MMPO consistently outperforms existing methods on diverse long-horizon tasks, maintaining 97.1% performance even when scaled to 1.75M-token contexts.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	3.0/10	4.5
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	5.0/10	7.5

评分理由: 该论文主要关注长周期 LLM 代理的记忆策略优化，使用强化学习概念（如信念熵、认知不确定性）来改进记忆摘要质量。论文未涉及模型统一（Unify Models）、分词器设计（Tokenizer）或视觉编码器（Visual Encoder），因此这些关键词相关性为 0。虽然论文提到了潜在任务状态和信念，与世界模型（World Models）概念有弱关联，但核心并非构建世界模型，故评分为 3。论文涉及强化学习和内部状态估计，与基于模型的强化学习（model-based RL）有一定关联，评分为 5。论文明确为 LLM 而非多模态大模型（MLLM/MultiModal），故相关性为 0。加权总分为 12.0，低于动态及格分 27.8，表明该论文与给定关键词主题相关性较低。作者列表中不包含指定的专家（Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang）。

关键词

LLM Agents, Memory Optimization, Policy Optimization, Long-Horizon Tasks, Belief Entropy, Epistemic Uncertainty, Meta-Cognitive, Reinforcement Learning

167. Token Inflation: How Dishonest Providers Can Overcharge for Large Language Model UsageFAIL

Score: 12.0 / 27.8

Authors: Shahinul Hoque, Jinghuai Zhang, Jinyuan Sun, Fnu Suya

Published: 2026-05-28

TL;DR: This paper reveals that LLM providers can systematically inflate billing counts by exploiting tokenization ambiguity and hidden execution, creating a trust paradox that requires external verification to ensure honest billing.

摘要翻译

按 token 计费现已成为商业大语言模型（LLMs）的标准定价模型，因此报告 token 数量的真实性直接影响用户所支付的费用。我们指出，这种计费模式在设计上就难以审计：提供商为了保护知识产权（IP）、防范越狱攻击（Jailbreaks）并保护用户隐私，会隐藏模型、分词器（Tokenizer）及执行过程，这意味着审计方只能检查提供商所提供的证明。因此，审计过程简化为对提供商自身报告的一致性校验。我们将此称为信任悖论（Trust Paradox）：每一次审计都必须信任某些凭证，但当前框架所信任的恰恰是提供商最有动机去操纵的那些凭证。我们研究了三种近期提出的 token 审计框架，并表明具备普通商业能力的提供商可以系统性地虚增计费 token 数。在最宽松的情境下，隐藏的推理用量平均可被虚增 1,469% 而不被察觉。在当前前沿推理价格下，这使得同一查询的真实 100 美元账单变为约 1,569 美元。即使用户能看到完整的推理文本，仅分词歧义这一因素即可导致在检测阈值以下出现 50.85% 的虚报。这些结果表明，问题不在于任何特定的审计方，而在于任何证据来源于被审计方的审计机制。恢复诚实计费将需要一种验证机制，该机制将报告的 token 数与提供商无法控制的证据绑定，例如可信执行证明（Trusted Execution Attestation）、推理密码学证明（Cryptographic Proofs of Inference）或第三方重新执行（Third-party Re-execution）。

Abstract

Per-token billing is now the standard pricing model for commercial large language models (LLMs), so the honesty of reported token counts directly affects what users pay. We show that this kind of billing is hard to audit by design: providers hide the model, the tokenizer, and the execution to protect their IP, mitigate jailbreaks, and preserve user privacy, which means an auditor can only inspect proofs the provider supplies. The audit therefore reduces to a consistency check on the provider's own reports. We call this a trust paradox: every audit must trust some artifact, but current frameworks trust exactly the ones a provider has the strongest reason to manipulate. We study three recent token auditing frameworks and show that a provider with ordinary commercial capabilities can systematically inflate billed token counts. In the most permissive setting, hidden reasoning usage can be inflated by 1,469% on average without detection. At current frontier reasoning prices, that turns a \$100 honest bill into roughly a \$1,569 bill on the same query. Even when the user can see the full reasoning string, tokenization ambiguity alone still allows 50.85% over-reporting below the detection threshold. These results suggest the problem is not in any specific auditor but in any audit whose evidence comes from the audited party. Restoring honest billing will require verification that ties reported token counts to evidence the provider does not control, such as trusted execution attestation, cryptographic proofs of inference, or third-party re-execution.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	8.0/10	12.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper addresses LLM billing security and token inflation, unrelated to model unification, visual encoders, world models, multimodal learning, or RL. Only 'Tokenizer' is highly relevant (score 8) due to its role in the billing fraud mechanism. Other keywords are irrelevant (score 0). No expert authors from the list are present.

关键词

Token Inflation, Token Billing, LLM Security, Tokenization Ambiguity, Audit Paradox, Trusted Execution, Overcharging

168. Why Specialist Models Still Matter: A Heterogeneous Multi-Agent Paradigm for Medical Artificial IntelligenceFAIL

Score: 12.0 / 27.8

Authors: Yanan Wang, Shuaicong Hu, Jian Liu, Guohui Zhou, Aiguo Wang, Cuiwei Yang

Published: 2026-05-28

TL;DR: This paper proposes HetMedAgent, a heterogeneous multi-agent framework that synergizes generalist LLMs and specialist models to outperform single-model approaches in clinical decision-making, validating the irreplaceable value of specialist models.

摘要翻译

GPT 和 Claude 等通用型大语言模型（LLMs）在医疗领域表现卓越，引发了一个关键问题：领域特定的医学专家模型是否会因此被淘汰？我们认为，医疗人工智能（AI）的未来不在于构建单体式医学基础模型，也不在于取代人类专家，而在于协调通用型 LLMs、领域特定专家模型与临床医生之间的协作。我们提出了 HetMedAgent，这是一种异构医学多智能体框架，能够实现冲突感知的证据融合、基于不确定性的临床医生干预触发以及自适应阈值校准。在三个真实世界临床决策任务上的实验表明，通用型 LLMs 与领域特定专家模型之间的协同作用显著优于单独使用任一类型的模型，这验证了专家模型在模态特定分析中不可替代的价值。HetMedAgent 代表了从构建医学 LLMs 或基础模型向多智能体协作的转变，实现了通用推理能力与领域特定精度之间的平衡。

Abstract

The impressive performance of generalist large language models (LLMs) such as GPT and Claude in healthcare raises a critical question: will domain-specific medical specialist models become obsolete? We argue that the future of medical artificial intelligence (AI) lies not in building monolithic medical foundation models, nor in replacing human expertise, but in orchestrating collaboration among generalist LLMs, domain-specific specialist models, and clinicians. We propose HetMedAgent, a heterogeneous medical multi-agent framework that enables conflict-aware evidence fusion, uncertainty-based clinician intervention triggering, and adaptive threshold calibration. Experiments on three real-world clinical decision-making tasks demonstrate that the synergy between generalist LLMs and domain-specific specialist models significantly outperforms using either type of model alone, validating the irreplaceable value of specialist models in modality-specific analysis. HetMedAgent represents a shift from building medical LLMs or foundation models to multi-agent collaboration, achieving a balance between general reasoning capabilities and domain-specific precision.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	3.0/10	4.5
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on a heterogeneous multi-agent framework for medical AI, discussing the collaboration between generalist LLMs and specialist models. It critically addresses the concept of Unify Models (foundation models) but argues against them, hence moderate relevance. It touches upon MultiModal and MLLM concepts indirectly through medical context and LLM usage, but does not cover Tokenizers, Visual Encoders, World Models, or Model-Based RL. No expert authors from the specified list (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) were found. The weighted total score is 12.0, which is below the dynamic pass score of 27.8.

关键词

Heterogeneous Multi-Agent Paradigm, Medical Artificial Intelligence, Specialist Models, Generalist LLMs, Clinical Decision-Making, Conflict-Aware Evidence Fusion, Domain-Specific Analysis

169. CorPipe at CRAC 2026: Empty Nodes and Cross-Lingual Transfer in Multilingual Coreference ResolutionFAIL

Score: 12.0 / 27.8

Authors: Milan Straka

Published: 2026-05-28

TL;DR: 本文提出 CorPipe 26 系统，通过单一模型统一预测多语言共指提及、链接及空节点，并在 CRAC 2026 共享任务中取得最佳成绩。

摘要翻译

我们介绍了 CorPipe 26，这是我们在 CRAC 2026 多语言共指解析共享任务中提交的获胜系统。该共享任务的第五届主要侧重于生成式大语言模型（LLM）与专用系统的比较；此外，还引入了 5 个新数据集和 2 种新语言。CorPipe 26 是 CorPipe 25 的改进版本，包含一个新变体，该变体能够在单一模型中同时预测空节点、提及和共指链接。我们的系统在 LLM 赛道中优于所有其他提交系统 2.8 个百分点，在无约束赛道中优于所有提交系统 9.5 个百分点。此外，我们进行了一系列消融实验，涉及不同模型规模、空节点预测方法以及跨语言零样本评估。源代码和训练好的模型在 https://github.com/ufal/crac2026-corpipe 上公开可用。

Abstract

We introduce CorPipe 26, our winning submission to the CRAC 2026 Shared Task on Multilingual Coreference Resolution. The fifth edition of this shared task focuses mainly on the comparison of generative LLMs and specialized systems; additionally, 5 more datasets and 2 new languages are introduced. CorPipe 26 is an improved version of CorPipe 25, with a new variant predicting empty nodes together with mentions and coreference links in a single model. Our system outperforms all other submissions in the LLM track by 2.8 percent points and all submissions in the unconstrained track by 9.5 percent points. Furthermore, we perform a series of ablation experiments with different model sizes, empty node prediction methods, and cross-lingual zero-shot evaluation. The source code and the trained models are publicly available at https://github.com/ufal/crac2026-corpipe.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	5.0/10	7.5
Tokenizer	1.5	3.0/10	4.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文采用单一模型处理多语言共指任务，故'Unify Models'得 5 分；涉及 LLM 隐含分词器，得 3 分。但论文纯文本，无视觉、世界模型、多模态（仅多语言）及强化学习，其余关键词得 0 分。加权总分 12.0，低于动态及格分 27.8，主题不匹配。

关键词

Multilingual Coreference Resolution, Empty Nodes, Cross-Lingual Transfer, Generative LLM, Single Model, Shared Task, Ablation Experiments

170. Dial HEALTHDIAL for Advice: A Multilingual and Multi-Parallel Spoken Dialogue Dataset for Knowledge-Grounded Information SeekingFAIL

Score: 12.0 / 27.8

Authors: Songbo Hu, Yinhong Liu, Ej Zhou, Evgeniia Razumovskaia, Xiaobin Wang, Alexander Fraser, Ivan Vulić, Anna Korhonen

Published: 2026-05-28

TL;DR: 该论文引入了 HEALTHDIAL，一个基于 WHO 知识的大型多语言口语对话数据集，揭示了不同语言在对话任务中的一致性能差异。

摘要翻译

构建口语对话数据集在方法论上具有挑战性，而当目标是在大规模上构建多语言、多平行数据集时，这些挑战会被进一步放大。本文介绍了 HEALTHDIAL，这是一个大规模、多语言和多平行数据集，用于开发和评估基于检索增强生成（RAG）的口语对话系统。该数据集包含 6,000 个信息寻求对话（每种语言 1,500 个），这些对话基于世界卫生组织（WHO）的可信内容，并包含来自四种 WHO 官方语言（阿拉伯语、中文、英语和西班牙语）具有不同方言的母语者录制的 163 小时用户语音。每位说话者均标注了人口统计学（如性别、年龄）和社会语言学（如主要语言、原籍地区）变量。我们在关键对话任务上报告了基准结果，结果显示语言之间存在一致的性能差异，即使是高资源语言也是如此。为支持未来研究，我们发布了该数据集、一个原型系统以及用于数据收集和系统评估的工具包。

Abstract

Creating spoken dialogue datasets is methodologically challenging, and these challenges are amplified when the goal is to build multilingual, multi-parallel datasets at scale. This work introduces HEALTHDIAL, a large-scale, multilingual, and multi-parallel dataset for developing and evaluating retrieval-augmented generation (RAG)-based spoken dialogue systems. The dataset comprises 6,000 information-seeking dialogues (1,500 per language) grounded in trusted content from the World Health Organization (WHO) and 163 hours of user speech recorded from native speakers of diverse dialects across four official WHO languages: Arabic, Chinese, English, and Spanish. Each speaker is annotated with demographic (e.g., gender, age) and sociolinguistic (e.g., primary language, region of origin) variables. We report benchmark results across key dialogue tasks, which reveal consistent performance disparities across languages, even among high-resource ones. To support future research, we release the dataset, a prototype system, and a toolkit for data collection and system evaluation.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	3.0/10	4.5
model-based RL	1.5	0.0/10	0.0

评分理由: 论文核心为 HEALTHDIAL 多语言口语对话数据集的构建与基准测试，侧重于数据收集、标注及 RAG 系统评估。与关键词中的模型架构统一（Unify Models）、视觉编码器（Visual Encoder）、强化学习（model-based RL）及世界模型（World Models）无直接关联，得分较低。虽涉及多语言文本（MLLM 弱相关）及语音与文本结合（MultiModal 弱相关），但未触及核心模型架构。加权总分约 12.0，低于动态及格分 27.8。作者列表中未发现指定的 Yang Shi 等专家。

关键词

Spoken Dialogue, Multilingual, Dataset, Knowledge-Grounded, Information Seeking, RAG, Multi-Parallel, WHO Content

171. Recovering Diversity Without Losing Alignment: A DPO Recipe for Post-Trained LLMsFAIL

Score: 12.0 / 27.8

Authors: Vinay Samuel, Yapei Chang, Mohit Iyyer

Published: 2026-05-28

TL;DR: 本文提出 REDIPO 方法，通过构建偏好对在后训练阶段恢复大语言模型的输出多样性，同时保持对齐性能。

摘要翻译

许多开放式指令存在多个有效答案，用户从中受益，但后训练通常会将大语言模型（LLM）的输出空间收窄至一小套规范响应。我们提出 REDIPO，一种离线 DPO 数据构建流水线，旨在恢复不同的有效答案模式，同时保留指令微调模型的对齐优势。针对每个提示，REDIPO 分别从基座模型和指令微调模型采样响应，使用指令模型重写基座模型的响应，根据安全性与指令遵循质量过滤候选项，并构建偏好对，优先选择在指令遵循奖励相似的候选项中边际多样化的响应。在 Qwen3-4B、OLMo-3-7B 和 LLaMA-3.1-8B 上，相对于指令微调检查点，REDIPO 使 NoveltyBench distinct_k 分别提升了 134%、33% 和 44%，而 DivPO 在同一模型上的多样性变化分别为 0%、-6% 和 -4%。这些提升在很大程度上保持了 MTBench、IFEval 和 Arena-Hard 的性能，并降低了 HarmBench 直接类别攻击成功率。消融实验表明，边际多样性对选择和基座响应重写驱动了多样性增益，而过滤和质量受限配对有助于保持对齐。总体而言，我们的结果表明，通过精心构建的偏好数据，可以重新引入来自基座模型生成的多样化有效答案，同时保留后训练的对齐优势。我们在 https://github.com/vsamuel2003/RiDiPO 上发布我们的代码与数据。

Abstract

Many open-ended instructions have multiple valid answers that users can benefit from seeing, but post-training often narrows an LLM's output space toward a small set of canonical responses. We introduce REDIPO, an offline DPO data-construction pipeline for recovering distinct valid answer modes while preserving the alignment benefits of the instruct model. For each prompt, REDIPO samples responses from both base and instruct models, rewrites base-model responses with the instruct model, filters candidates for safety and instruction-following quality, and builds preference pairs that favor marginally diverse responses among candidates with similar instruction-following reward. Across Qwen3-4B, OLMo-3-7B, and LLaMA-3.1-8B, REDIPO improves NoveltyBench distinct_k by 134%, 33%, and 44% relative to the instruct checkpoints, while DivPO changes diversity by 0%, -6%, and -4% on the same models. These gains largely maintain MTBench, IFEval, and Arena-Hard performance, and reduce direct-category HarmBench attack success rate. Ablations show that marginal-diversity pair selection and base-response rewriting drive the diversity gains, while filtering and quality-bounded pairing help maintain alignment. Overall, our results show that diverse valid answers from base-model generations can be reintroduced through carefully constructed preference data while retaining the alignment benefits of post-training. We release our code and data at https://github.com/vsamuel2003/RiDiPO.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	1.0/10	1.5

评分理由: 论文核心为文本 LLM 多样性恢复与 DPO 后训练，与视觉编码器、世界模型、多模态架构无直接关联，故相关度低。Unify Models 涉及基座与指令模型数据统一，有一定关联。Tokenizer 为隐含组件。MLLM 与 model-based RL 因模型类型及偏好学习技术有微弱关联。作者列表中无指定专家，无加分。加权总分 12.0，低于动态及格分 27.8。

关键词

Diversity Recovery, DPO, Post-Trained LLMs, Alignment Preservation, Preference Pairs, Instruction Following, Base Model Rewriting

172. CRITIC-R1: Learning Structured Critics for Retrieval-Augmented GenerationFAIL

Score: 12.0 / 27.8

Authors: Wenhan Xiao, Ziwei Zhang, Chuanyue Yu, Xingcheng Fu, Qingyun Sun, Runhua Xu, Jianxin Li

Published: 2026-05-28

TL;DR: CRITIC-R1 proposes a reinforcement learning-based structured critic framework to diagnose and correct errors in Retrieval-Augmented Generation, improving answer quality on QA benchmarks.

摘要翻译

检索增强生成（RAG）通过整合外部证据来提升知识密集型问答任务的表现。然而，现有的 RAG 方法仍存在幻觉和细微推理错误的问题。近期研究引入外部评论者（critics）以精炼 RAG 输出，但它们往往提供粗粒度且结构松散的反馈，表现出过度激进的干预，导致嘈杂且不可靠的精炼效果，从而限制了其在修正方面的有效性。为了解决这些问题，我们提出 CRITIC-R1，这是一种结构化评论者框架，利用强化学习（RL）将 RAG 批评形式化为一个显式的错误诊断问题并进行学习。该框架将常见的 RAG 错误划分为多个诊断维度，包括判决（verdict）、错误位置（error location）、推理分析（reasoning analysis）和修复生成（fix generation）。为了习得这些能力，我们设计了两个奖励函数：保守判断对齐（CJA）首先鼓励经过校准的高层判断，同时缓解过度激进现象；而诊断质量对齐（DQA）则通过门控奖励进一步提升细粒度的诊断反馈。我们采用基于 GRPO 的强化学习方法训练评论者模型，并利用从外部大语言模型（LLM）教师模型收集的过程级监督信号。在五个问答基准上的实验表明，CRITIC-R1 能够一致性地提升答案质量，优于强 RAG 基线。我们的源代码可在 https://anonymous.4open.science/r/critic-r1-FCB0 获取。

Abstract

Retrieval-augmented generation (RAG) improves knowledge-intensive question answering by incorporating external evidence. However, existing RAG methods still suffer from hallucinations and subtle reasoning errors. Recent studies introduce external critics to refine RAG outputs, yet they often provide coarse-grained and weakly structured feedback, exhibit over-aggressive intervention, and lead to noisy and unreliable refinement, limiting their effectiveness for correction. To tackle these issues, we propose CRITIC-R1, a structured critic framework that formulates and learns RAG critique as an explicit error diagnosis problem using reinforcement learning (RL). Our framework categorizes common RAG errors into multiple diagnostic dimensions, including verdict, error location, reasoning analysis, and fix generation. To learn these capabilities, we design two reward functions: Conservative Judgement Alignment (CJA) first encourages calibrated high-level judgements while mitigating the over-aggressive phenomenon, whereas Diagnostic Quality Alignment (DQA) further improves fine-grained diagnostic feedback through gated rewards. We train the critic model using GRPO-based RL with process-level supervision collected from external LLM teacher models. Experiments across five QA benchmarks show that CRITIC-R1 consistently improves answer quality over strong RAG baselines. Our source code is available at https://anonymous.4open.science/r/critic-r1-FCB0

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	3.0/10	4.5

评分理由: 论文主题聚焦于检索增强生成（RAG）中的结构化批评家学习，使用强化学习进行错误诊断。与关键词集（多模态、世界模型、视觉编码器）存在显著领域差异。'model-based RL' 相关度较高（3 分）因涉及 RL 技术，但实际为模型自由策略优化；'Unify Models' 和 'MLLM' 相关度中等（2 分）因涉及 LLM 及框架整合；'Tokenizer' 隐含使用（1 分）；'Visual Encoder', 'World Models', 'MultiModal' 完全无关（0 分）。加权总分 12.0，低于动态及格分 27.8，且未发现指定专家。

关键词

Retrieval-Augmented Generation, Reinforcement Learning, Structured Critic, Error Diagnosis, Reward Function, GRPO, LLM, QA Benchmarks

173. Multi-Legal-Bench: Evaluating LLMs on Legal Reasoning Across Jurisdictions, Languages, and Legal TraditionsFAIL

Score: 12.0 / 27.8

Authors: Volodymyr Ovcharov

Published: 2026-05-28

TL;DR: Multi-Legal-Bench 构建了一个跨司法管辖区的法律基准，发现 LLM 的跨语言迁移能力更依赖于标签集对齐和模型架构，而非语言亲密度或分词器效率。

摘要翻译

绝大多数法律自然语言处理（Legal NLP）基准测试仅评估单一语言，或聚合了在不同司法管辖区存在根本性差异的任务，从而导致跨语言比较无法进行。我们提出了 Multi-Legal-Bench，这是首个跨司法管辖区的法律基准测试，它在六个国家（乌克兰、法国、荷兰、波兰、捷克共和国、立陶宛）、四个语系以及 1.34 亿份法院判决上评估相同的任务。该基准定义了五个任务：法院类型分类、判决形式分类、案件结果预测、法律规范提取以及案由分类预测，这些任务映射自国家法院登记处的结构化元数据，从而形成一个刻意设计的稀疏 5x6 任务 - 管辖区矩阵（30 个单元格中填充了 20 个）。我们通过 AWS Bedrock 在零样本（zero-shot）和 3 样本（3-shot）提示下评估了 7 个前沿大语言模型（LLM），并额外使用了 4 个小型/中型模型（30 亿至 120 亿参数）进行扩展性分析。我们的结果表明：（1）在乌克兰司法管辖区发现的依赖任务的少样本效应（few-shot effects）在所有司法管辖区中均得到复现；（2）没有任何单一模型在所有语言中占据主导地位，排名会随着任务和司法管辖区的变化而转移；（3）跨语言少样本迁移（cross-lingual few-shot transfer）并不遵循语言亲缘性：UA->FR（罗曼语族，-2.1 个百分点）的迁移效果优于 UA->PL（斯拉夫语族，-13.7 个百分点），且标签集对齐（label-set alignment）预测迁移质量的能力优于语系分类；（4）分词器生育率（tokenizer fertility），尽管存在 2.3 倍的差异，并不能显著预测跨语言准确率（r=-0.27, p=0.14），这表明模型架构和预训练数据主导了分词器的效率表现。我们公开了所有数据、提示词及模型预测结果。

Abstract

Legal NLP benchmarks overwhelmingly evaluate a single language or aggregate tasks that differ fundamentally across jurisdictions, making cross-lingual comparison impossible. We introduce Multi-Legal-Bench, the first cross-jurisdictional legal benchmark that evaluates identical tasks across six countries (Ukraine, France, Netherlands, Poland, Czech Republic, Lithuania), four language families, and 134 million court decisions. The benchmark defines five tasks court-type classification, judgment form classification, case-outcome prediction, legal norm extraction, and cause category prediction mapped to structured metadata from national court registries, forming a deliberately sparse 5x6 task-jurisdiction matrix (20 of 30 cells filled). We evaluate 7 frontier LLMs under zero-shot and 3-shot prompting via AWS Bedrock, with 4 additional small/medium models (3-12B) for scaling analysis. Our results reveal that: (1) task-dependent few-shot effects discovered in Ukrainian replicate across all jurisdictions; (2) no single model dominates any language rankings shift with both task and jurisdiction; (3) cross-lingual few-shot transfer does not follow language proximity: UA->FR (Romance, -2.1 pp) transfers better than UA->PL (Slavic, -13.7 pp), with label-set alignment predicting transfer quality better than language family; and (4) tokenizer fertility, despite a 2.3x spread, does not significantly predict cross-lingual accuracy (r=-0.27, p=0.14), suggesting that model architecture and pretraining data dominate tokenizer efficiency. We release all data, prompts, and model predictions.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	6.0/10	9.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文主要关注法律 NLP 基准测试和跨语言 LLM 评估，仅与 Tokenizer 有中度关联（讨论了分词器生育率对跨语言准确性的影响），与视觉编码器、世界模型、MLLM、多模态及基于模型的强化学习完全无关，统一模型关联度较低（仅评估了多个模型而非提出统一架构）。加权总分约为 12.0，低于动态及格分 27.8，表明该论文与给定的多模态/强化学习主题相关性较低。

关键词

Legal NLP, LLM Evaluation, Cross-jurisdictional, Legal Reasoning, Tokenizer Fertility, Cross-lingual Transfer, Benchmarking

174. Minimal Prompt Perturbations Lead to Code Vulnerabilities: Prompt Fragility and Hidden-State Signals in Coding LLMsFAIL

Score: 12.0 / 27.8

Authors: Alexander Sternfeld, Andrei Kucharavy, Ljiljana Dolamic

Published: 2026-05-28

TL;DR: The study reveals that minimal prompt perturbations can introduce security vulnerabilities in LLM-generated code, where input-handling flaws are more predictable than secure-defaults flaws based on hidden-state signals.

摘要翻译

基于 LLM 的编程助手正被迅速采用，显著提升了开发者的生产力。随着组织越来越多地部署这些代理生成的代码，代码的安全性变得至关重要。先前研究表明，轻微的 Prompt 扰动会降低 LLM 生成代码的功能正确性，但它们是否也会损害代码安全性这一问题尚未得到研究。我们在三个模型和五种编程语言上对 Prompt 应用 Token 级变异，结果表明，即使是一个字符变化的微小变异也能使生成的代码从安全状态变为易受攻击状态。探测模型的隐藏状态揭示出，这种脆弱性部分编码在 Prompt 表示中，但分布并不均匀。Input-handling vulnerabilities（输入处理漏洞，即模型省略了验证或清理操作）比 Secure-defaults vulnerabilities（安全默认值漏洞，即不安全代码源于局部选择，如弱算法或不安全参数）更可预测（平均 AUC 分别为 0.753 和 0.674）。这些结果表明，LLM 辅助编码的威胁模型不仅限于 Prompt Injection（提示词注入），还扩展到了普通的 Prompt 变异；并且表明，Input-handling vulnerabilities 可以在生成前被捕获，而 Secure-defaults vulnerabilities 则需要解码过程中的干预。

Abstract

LLM-based coding assistants are seeing rapid adoption, offering substantial gains in developer productivity. As organizations increasingly ship code these agents produce, the security of that code becomes critical. Prior work has shown that minor prompt perturbations degrade the functional correctness of LLM-generated code, but whether they also compromise code security has remained unstudied. We apply token-level mutations to prompts across three models and five programming languages, and show that mutations as small as a single-character change can flip generated code from secure to vulnerable. Probing the models' hidden states reveals that this fragility is partially encoded in prompt representations, but unevenly so. Input-handling vulnerabilities, where the model omits validation or sanitization, are more predictable (mean AUC 0.753) than secure-defaults vulnerabilities, where insecure code stems from one local choice such as a weak algorithm or unsafe parameter (mean AUC 0.674). These results show that the threat model for LLM-assisted coding extends beyond prompt injection to ordinary prompt variation, and indicate that input-handling flaws can be caught before generation while secure-defaults flaws require intervention during decoding.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	5.0/10	7.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on LLM code security and prompt fragility, showing moderate relevance to Tokenizer (due to token-level mutation analysis) and weak relevance to MLLM/Unify Models (as it analyzes LLMs). It is largely unrelated to Visual Encoder, World Models, MultiModal, and model-based RL since the study involves text/code generation without vision, environment modeling, multimodality, or reinforcement learning.

关键词

Prompt Perturbations, Code Vulnerabilities, LLM Security, Hidden-State Signals, Token-level Mutations, Prompt Fragility, Input-handling vulnerabilities

175. HTAM: Hierarchical Transition-Attended Memory for Operator OptimizationFAIL

Score: 12.0 / 27.8

Authors: Yining Zhang, Mingyang Yi, Chen Wang, Xuwen Xiang, Tianhe Jia, Zedong Dan, Chengqing Zong, Yue Wang

Published: 2026-05-28

TL;DR: HTAM 通过提出层次化转换注意力记忆框架解决了 LLM 基于 GPU 算子优化中的粒度不匹配问题，在 KernelBench 上实现了正确率、快速求解率和加速比的提升。

摘要翻译

高性能 GPU 内核对于 LLM（大语言模型）的高效部署至关重要，但优化它们仍然是专家密集型任务。最近，基于 LLM 的代码生成使得自动 GPU 算子生成成为可能，但算子优化仍然是一个硬件感知的搜索问题。现有的基于 LLM 的方法面临粒度不匹配的问题：粗粒度提示虽然可重用但难以执行，而细粒度记忆虽然可操作但会扩大搜索空间并掩盖优化瓶颈。因此，关键挑战在于以适当的粒度组织优化经验。为了解决这一问题，本文提出了 HTAM（层次化转换注意力记忆），这是一种用于基于 LLM 算子优化的由粗到细的框架。HTAM 构建了一个两级层次化转换图（HTG），用于组织粗粒度全局方向、细粒度局部策略以及优化步骤之间的转换经验。在每个演化步骤中，HTAM 从当前状态和近期优化历史中选择一个全局方向，检索相应的局部策略记忆，并使用它来指导具体的 CUDA 代码生成。在完整的 KernelBench 套件上的实验表明，HTAM 相对于基于 LLM 的基线一致地提高了正确性、快速求解率和加速比，而后端和 Robust-KBench 研究则表明结构化记忆具有可迁移的收益。

Abstract

High-performance GPU kernels are essential for efficient LLM deployment, yet optimizing them remains expertise-intensive. Recent LLM-based code generation makes automatic GPU operator generation promising, but operator optimization remains a hardware-aware search problem. Existing LLM-based methods face a granularity mismatch: coarse hints are reusable but hard to execute, whereas detailed memories are actionable but enlarge the search space and obscure optimization bottlenecks. The key challenge is therefore to organize optimization experience at an appropriate granularity. To address this issue, this paper proposes HTAM (Hierarchical Transition-Attended Memory), a coarse-to-fine framework for LLM-based operator optimization. HTAM builds a two-level Hierarchical Transition Graph (HTG) to organize coarse global directions, detailed local strategies, and transition experience between optimization steps. During each evolution step, HTAM selects a global direction from the current state and recent optimization history, retrieves the corresponding local strategy memory, and uses it to guide concrete CUDA code generation. Experiments on the full KernelBench suite demonstrate that HTAM consistently improves correctness, fast-solution rate, and speedup over LLM-based baselines, while backend and Robust-KBench studies indicate transferable benefits from structured memory.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	2.0/10	3.0

评分理由: 论文主题基于 LLM 的 GPU 算子优化（HPC/代码），与提供的关键词（多模态、世界模型、视觉编码器）领域严重不匹配。Unify Models 和 World Models 仅因结构相似性（层次化记忆 vs 统一架构/世界建模）获得低分。Tokenizer、MLLM 和 MultiModal 因论文使用标准文本 LLM 且无多模态组件而不相关。model-based RL 因优化搜索过程类似强化学习但并非显式模型强化学习而得分较低。

关键词

HTAM, Hierarchical Transition-Attended Memory, GPU Kernel Optimization, LLM-based Code Generation, Hierarchical Transition Graph, CUDA Code Generation, KernelBench

176. User-Aware Active Knowledge Acquisition for Emotional Support DialogueFAIL

Score: 12.0 / 27.8

Authors: Mufan Xu, Kehai Chen, Jiahao Hu, Xinchao Xu, Muyun Yang, Tiejun Zhao, Min Zhang

Published: 2026-05-28

TL;DR: This paper proposes a User-Aware Active Knowledge Acquisition framework for emotional support dialogue that leverages uncertainty estimation to improve user alignment and dialogue quality, outperforming existing baselines.

摘要翻译

情感支持在对话系统中起着至关重要的作用，其成功依赖于在多轮交互中适应用户演变且隐式的需求，同时利用大型语言模型（Large Language Models, LLM）强大的推理能力。然而，由于用户需求的信号通常微弱且间接，且只能通过多轮交互进行消歧，现有的情感支持方法往往难以高效获取并泛化相关的对话知识。为弥合这一差距，我们提出了用户感知的主动知识获取（User-Aware Active Knowledge Acquisition, UKA），这是一种无梯度的主动对话学习框架，该框架显式表示对用户需求的不确定性，并将主动学习机制同时融入知识获取与回复选择过程。我们提出了一种基于心智理论（Theory-of-Mind）的不确定性估计机制，该机制使模型能够优先选择回复，从而获取更具信息价值的用户反馈。UKA 能够在训练过程中高效探索用户对齐的对话知识，同时在测试阶段保持鲁棒性。在多个对话基准及不同模型架构上的实验表明，我们的方法在对话质量和用户对齐方面始终优于强基线。

Abstract

Emotional support plays an important role in dialogue systems, and its success depends on adapting to a user's evolving and implicit needs across multi-turn interactions while leveraging the strong reasoning capacity of large language models. However, since signals about user needs are often weak, indirect, and can only be disambiguated through multi-turn interaction, existing emotional support methods often struggle to acquire and generalize relevant conversational knowledge efficiently. To bridge this gap, we introduce User-Aware Active Knowledge Acquisition (UKA), a gradient-free active dialogue learning framework that explicitly represents uncertainty about user needs and incorporates active learning into both knowledge acquisition and response selection.We propose a Theory-of-Mind uncertainty estimation mechanism that allows the model to prioritize responses, thereby eliciting more informative user feedback. UKA is capable of efficiently exploring user-aligned conversational knowledge during training while maintaining robustness at test time. Experiments across multiple dialogue benchmarks and model architectures demonstrate that our approach consistently outperforms strong baselines in dialogue quality and user alignment.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	1.0/10	1.5

评分理由: The paper focuses on emotional support dialogue using active learning and uncertainty estimation. It lacks visual encoders, multi-modal inputs, and explicit tokenizer/RL components. While it utilizes LLMs (MLLM) and models user state (World Models), the core focus is text-based dialogue, resulting in low relevance for multimodal and RL-specific keywords. Unify Models is moderately relevant due to framework integration. No expert authors from the specified list were found in the author list.

关键词

Emotional Support Dialogue, Active Knowledge Acquisition, User-Aware, Uncertainty Estimation, Theory-of-Mind, Gradient-free Learning, User Alignment, Large Language Models

177. Adaptive Interviewing for Persona Simulation in LLMs: Evidence-Grounded Reasoning Improves Decision AlignmentFAIL

Score: 12.0 / 27.8

Authors: Ruoxi Su, Yuhan Liu, Jingyu Hu

Published: 2026-05-28

TL;DR: 本文提出自适应访谈框架，通过整合用户特定证据改进 LLM 人格模拟，在道德困境决策中实现了比仅使用核心问题更高的准确性。

摘要翻译

准确模拟特定个体的决策对大语言模型（LLMs）而言仍具挑战性，部分原因在于人设（Persona）信息通常以静态描述的形式提供，这些描述缺失了个体层面的决策模拟所需的价值观、经验和情境线索。我们提出了一种自适应面试框架，通过结构化三阶段对话收集与人设相关的信息：核心问题、动态追问以及综合人设总结。利用生成的面试转录文本，我们评估了大语言模型能否模拟参与者在道德困境场景中的决策。我们比较了三种对话情境——Core-10 响应、完整面试对话以及总结的人设表征。我们发现，自适应面试的作用更像是一种选择性 grounding（依据）机制，而非统一的准确率提升器：在完整面试轨迹中，约 40% 的轨迹纳入了基于追问的证据，而这些基于追问的证据做出的预测比仅基于核心的预测更准确（45.5% vs 39.3%）。这些发现表明，仅靠更丰富的人设情境是不够的：只有当模型实际上基于用户特定证据做出决策时，改进才会出现。

Abstract

Accurately simulating the decisions of a specific individual remains challenging for large language models (LLMs), partly because persona information is often provided as static descriptions that miss the values, experiences, and contextual cues needed for individual-level decision simulation. We propose an adaptive interview framework that gathers persona-relevant information through a structured three-stage dialogue: core questions, dynamic follow-ups, and a synthesized personality summary. Using the resulting interview transcripts, we evaluate whether LLMs can simulate participants' decisions in moral dilemma scenarios. We compare three conversational contexts -- Core-10 responses, the full interview dialogue, and a summarized persona representation. We find that adaptive interviewing functions less as a uniform accuracy booster and more as a selective grounding mechanism: follow-up-derived evidence is incorporated in around 40% of full-interview traces, and these follow-up-grounded predictions are more accurate than core-only grounded ones (45.5% vs. 39.3%). These findings highlight that richer persona context alone is insufficient: improvements arise only when models actually ground their decisions in user-specific evidence.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	1.0/10	1.5

评分理由: 论文聚焦 LLM 人格模拟与证据推理，属于 NLP 领域，与提供的视觉、多模态及强化学习关键词领域严重不匹配。视觉编码器、多模态、MLLM 几乎无关（0-1 分）；Tokenizer、Unify Models、World Models、model-based RL 仅基础相关（1-2 分）。未发现指定专家作者。加权总分 12.0，显著低于动态及格分 27.8。

关键词

Adaptive Interviewing, Persona Simulation, LLMs, Evidence-Grounded Reasoning, Decision Alignment, Moral Dilemma, Dynamic Follow-ups, Persona Representation

178. SURGENT: A Surgical Multi-Agent Assistance System Across the Perioperative WorkflowFAIL

Score: 12.0 / 27.8

Authors: Dongsheng Shi, Yue Li, Xin Yi, Yongyi Cui, Huawei Feng, Linlin Wang

Published: 2026-05-28

TL;DR: SURGENT is a surgical multi-agent assistance system utilizing Tree-of-Thought planning and memory management to enhance perioperative decision-making, outperforming baseline LLMs in clinical tasks.

摘要翻译

现代外科护理的复杂性要求智能系统能够综合广泛的病历记录，支持协作决策，并在整个围手术期流程中提供透明、可审计的推理。尽管基于网络的大语言模型（LLMs）具备先进的推理能力，但由于存在关键局限性，包括输入长度限制、记忆管理不完整以及可追溯性有限，它们并不适合外科应用。为了解决这一问题，我们提出了 SURGENT，一种手术多智能体辅助系统，该系统结合了思维树规划器（Tree-of-Thought）、多部门协作智能体以及基于临床指南和生物医学文献的检索增强推理（Retrieval-Augmented Reasoning）。SURGENT 采用了一种新颖的记忆设计，能够同时管理长期患者病史和短期工作摘要，从而实现更完整、更具情境化且一致的推理。在五个关键围手术期任务（包括病例分析、手术计划模拟、安全监控、并发症风险评估和康复指导）上的实验评估表明，SURGENT 优于基线大语言模型（LLMs）和现有的医疗多智能体框架，其生成的建议与患者病史更为吻合。消融研究进一步凸显了 DeepSeek 作为本地部署骨干模型的优势，实现了无需依赖集中式服务的隐私保护部署。这些结果表明，SURGENT 是迈向智能、公平且安全的外科辅助系统的一项实用且可信的进展。

Abstract

The intricate nature of modern surgical care necessitates intelligent systems that can synthesize extensive patient records, support collaborative decision-making, and provide transparent, auditable reasoning across the entire perioperative workflow. Although web-based Large Language Models (LLMs) possess advanced reasoning capabilities, they are ill-equipped for surgical applications due to critical limitations: input length constraints, incomplete memory management, and limited traceability. To address this issue, we present SURGENT, a surgical multi-agent assistance system that combines a Tree-of-Thought planner, multi-department collaboration agents, and retrieval-augmented reasoning with clinical guidelines and biomedical literature. SURGENT features a novel memory design that manages both long-term patient histories and short-term working summaries, enabling more complete, contextualized, and consistent reasoning. Experimental evaluations across five key perioperative tasks - case analysis, surgical plan simulation, safety monitoring, complication risk assessment, and rehabilitation guidance - show that SURGENT outperforms baseline LLMs and existing medical multi-agent frameworks, yielding recommendations more closely aligned with patient histories. Ablation studies further highlight the advantage of DeepSeek as a locally deployable backbone model, enabling privacy-preserving deployment without reliance on centralized services. These results position SURGENT as a practical and trustworthy advancement toward intelligent, equitable, and secure surgical assistance systems.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	1.0/10	1.5

评分理由: The paper presents SURGENT, a surgical multi-agent system using LLMs for clinical reasoning and workflow management. It focuses on memory design, Tree-of-Thought planning, and retrieval-augmented reasoning. There is minimal overlap with the provided keywords, which target multimodal foundation models (Visual Encoder, MultiModal), world models, tokenization, and reinforcement learning. The paper does not discuss these technical components, resulting in low relevance scores.

关键词

Surgical Multi-Agent System, Perioperative Workflow, Tree-of-Thought Planner, Retrieval-Augmented Reasoning, Long-term Memory, DeepSeek Backbone, Clinical Guidelines, Patient Records

179. Learnable Assessment Skills for LLM-based Automated Scoring: Rubric Construction via Iterative OptimizationFAIL

Score: 12.0 / 27.8

Authors: Yun Wang, Xin Xia, Xuansheng Wu, Xiaoming Zhai, Ninghao Liu

Published: 2026-05-28

TL;DR: This paper proposes an iterative framework to learn assessment skills for LLM-based automated scoring, eliminating the need for expert-written rubrics and improving scoring performance across multiple items.

摘要翻译

基于大语言模型（LLM）的自动评分方法已达到接近人类水平的性能，但在扩展到新任务时，仍受限于上游阶段（如评分标准构建）中基于条目的手动配置。人类专家通过丰富的实践经验发展出的评估启发式规则绕过这一瓶颈。我们探究 LLM 能否直接从评分经验中学习类似的启发式规则，并将此形式化为“评估技能”的概念：这是一种条目无关的自然语言程序性知识，引导 LLM 完成评分工作流中的特定阶段。以评分标准构建作为首次实例化，我们提出一个迭代框架，将一个技能分解为固定支架和可学习的条目无关规则，并通过基于 LLM 的评分错误诊断与验证门控选择来优化规则。该框架无需专家编写的评分标准。在所有十个 ASAP-SAS 条目上，优化后的技能显著提升了基于 LLM 的评分，并经常超越数据集提供的专家评分标准。跨条目迁移实验进一步揭示，学习到的技能既捕捉了可泛化的模式，也捕捉了条目特定的模式。

Abstract

LLM-based automated scoring approaches near-human performance, but scaling to new tasks remains bottlenecked by the per-item human configuration of upstream stages such as rubric construction. Human experts bypass this bottleneck through evaluation heuristics developed over extensive practice. We ask whether LLMs can learn similar heuristics directly from scoring experience, and formalize this as the concept of assessment skills: item-independent natural-language procedural knowledge that guides LLMs through specific stages of the scoring workflow. Focusing on rubric construction as a first instantiation, we propose an iterative framework that decomposes a skill into a fixed scaffold and learnable item-agnostic rules, refining the rules through LLM-driven diagnosis of scoring errors and validation-gated selection. The framework requires no expert-written rubric. On all ten ASAP-SAS items, optimized skills substantially improve LLM-based scoring and frequently surpass the dataset-provided expert rubric. Cross-item transfer experiments further reveal that learned skills capture both generalizable and item-specific patterns.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	1.0/10	1.5

评分理由: The paper focuses on LLM-based automated scoring and rubric construction via iterative optimization. It lacks visual components (Visual Encoder: 0, MultiModal: 1), does not involve world modeling (World Models: 0), or reinforcement learning (model-based RL: 1). Tokenizer is a standard implicit component (Tokenizer: 1). Unify Models loosely applies to workflow unification (Unify Models: 2). MLLM is partially relevant due to LLM usage but lacks multimodal input (MLLM: 3). No listed expert authors are found, so no bonus points are added. The weighted total is 12.0, below the dynamic passing score of 27.8, indicating low relevance to the provided keyword set.

关键词

LLM-based Automated Scoring, Rubric Construction, Iterative Optimization, Assessment Skills, Item-independent Rules, Natural Language Procedural Knowledge, Validation-gated Selection

180. DynSess: Dynamic Session-Level Evaluation and Optimization Framework for Role-Playing AgentsFAIL

Score: 12.0 / 27.8

Authors: Rongsheng Zhang, Jiji Tang, Junnan Ren, Zuyi Bao, Weijie Chen, Ruofan Hu, Zhou Zhao, Tangjie Lv, Yan Zhang

Published: 2026-05-28

TL;DR: DynSess proposes a session-level evaluation and optimization framework for role-playing agents that enhances long-horizon character consistency and interactive ability through RL-based training without requiring multimodal inputs.

摘要翻译

基于大语言模型的扮演本质上是一个会话级任务，要求智能体在长多轮对话中维持角色身份与交互质量。然而，现有的评估和优化方法大多仍局限于轮次级，无法捕捉长程质量。我们提出 DynSess，一个用于扮演智能体的统一会话级框架。DynSess-Eval 通过针对长程行为的评分标准对完整对话会话进行评分。利用其会话级奖励，我们通过多轮前瞻搜索构建高质量训练轨迹，并采用两种互补变体训练 DynSess-Character：DSPO（离策略）和 GSRPO（在策略）。实验表明，DynSess-Eval 与人类判断的一致性显著优于先前评估器，盲测人类评估进一步显示，DynSess-Character 尽管使用显著更少的参数，仍能媲美最强的角色模型，同时保持强角色一致性和交互能力。我们将发布数据集和代码以促进未来研究。

Abstract

Role-playing with large language models is fundamentally a session-level task, requiring agents to sustain character identity and interaction quality across extended multi-turn conversations. Yet existing evaluation and optimization methods remain largely turn-level, failing to capture long-horizon quality. We propose DynSess, a unified session-level framework for role-playing agents. DynSess-Eval scores complete dialogue sessions via rubrics targeting long-horizon behaviors. Leveraging its session-level rewards, we construct high-quality training trajectories through multi-turn lookahead search and train DynSess-Character with two complementary variants: DSPO (off-policy) and GSRPO (on-policy). Experiments show that DynSess-Eval aligns with human judgments substantially better than prior evaluators, and blind human evaluation further shows that DynSess-Character matches the strongest character model despite using substantially fewer parameters, while maintaining strong role consistency and interactive ability. Our dataset and code will be released to facilitate future research.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	3.0/10	4.5

评分理由: The paper focuses on text-based role-playing agents using RL and session-level evaluation, showing low relevance to multimodal components (Visual Encoder, MultiModal, Tokenizer) and generative world models. It shares conceptual ground with 'Unify Models' (unified framework) and 'model-based RL' (RL with lookahead planning), but lacks direct alignment with the specific technical definitions of the provided keywords regarding multimodality and world modeling. No expert authors from the target list are found in the author list.

关键词

Role-playing Agents, Session-Level Evaluation, Reinforcement Learning, LLM Alignment, Multi-turn Search, Character Consistency, Optimization Framework, Dynamic Framework

181. Supercharging Thermal Gaussian Splatting with Depth EstimationFAIL

Score: 12.0 / 27.8

Authors: Manoj Biswanath, Chenxin Cai, Hannah Schieber, Daniel Roth, Benjamin Busam

Published: 2026-05-28

TL;DR: 本文提出一种基于热图像和深度估计的高斯泼溅方法，在 3D 场景表示中实现了比多模态基线更快的训练速度和更好的渲染质量。

摘要翻译

高效且鲁棒的 3D 场景表示在自动驾驶、机器人学及相关领域中至关重要。虽然 RGB 图像为 3D 重建提供了有价值的信息，但热成像或深度等其他模态可提供关于环境的额外信息。近期，像 3D 高斯泼溅（3D Gaussian Splatting）这样的新视角合成方法已开始使用多种模态以进一步提升其性能。然而，融合或结合多模态数据可能会使过程变慢，并带来额外挑战。因此，本项目旨在基于热红外域采用单一模态，尽可能减少对可见光的依赖。这种单一模态预计速度更快，因为它不依赖多模态数据。我们提出了一种方法，热到深度高斯泼溅（TDg），该方法仅在其架构中使用热图像和深度估计来构建辐射场。我们的 TDg 方法在测试数据集 RGBT-Scenes 和 ThermalMix 上的大多数情况下优于多单模态高斯（MSMG）基线。平均而言，TDg 的渲染质量指标，如学习感知图像块相似性（LPIPS）、结构相似性指数度量（SSIM）和峰值信噪比（PSNR），分别比基线 MSMG 值优 1.12%、0.034% 和 0.01%。此外，该方法还显著减少了训练时间，缩短了 12 分 47 秒（提升了 55%）。总体而言，我们的方法成功构建了这些热辐射场，最终可应用于多个领域，例如在监控、搜索或救援行动中识别关键热源，以及在温度被广泛用于监测机器的工业检测中。

Abstract

Efficient and robust 3D scene representation is crucial in autonomous driving, robotics, and related fields. While RGB images provide valuable content for 3D reconstruction, other modalities like thermal or depth can enable additional information on the environment. Lately, novel view synthesis methods like 3D Gaussian Splatting have started using multiple modalities to further boost their performance. But fusing or combining multimodal data can make the process slower and can bring in additional challenges. Therefore, our project aims to use single modality based on thermal infrared domain, by removing the reliance on visible light as much as possible. This single modality can be expected to be faster as it does not rely on multimodal data. We propose a method, Thermal-to-Depth Gaussian Splatting (TDg), that uses only thermal images and depth estimation in its architecture to derive the radiance fields. Our TDg method outperforms the MSMG (Multiple Single-Modal Gaussians) baseline in most cases on our test datasets, RGBT-Scenes and ThermalMix. On average, the rendering quality metrics such as learned perceptual image patch similarity (LPIPS), structural similarity index measure (SSIM), and peak signal-to-noise ratio (PSNR) of TDg are 1.12%, 0.034%, and 0.01% better than the baseline MSMG values. It also reduces the training time significantly, by 12 mins 47 secs (55% improvement). Overall, our method is successful in deriving these thermal radiance fields, which can ultimately have several applications, such as identifying heat sources critical in surveillance, search or rescue operations, and industrial inspections where temperature is widely used to monitor machines.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	5.0/10	7.5
model-based RL	1.5	0.0/10	0.0

评分理由: 论文聚焦于热成像高斯泼溅与深度估计的 3D 场景表示，与 MLLM、Tokenizer、World Models、model-based RL 完全无关（0 分）。Unify Models 相关性低（1 分），因未涉及模型架构统一。Visual Encoder 相关性较低（2 分），因主要使用高斯场而非传统编码器。MultiModal 相关性中等（5 分），因融合热成像与深度数据，但强调减少可见光多模态依赖。未找到指定专家。

关键词

Thermal Gaussian Splatting, Depth Estimation, 3D Scene Representation, Thermal Images, Radiance Fields, Multimodal Fusion, Autonomous Driving

182. SwInception -- Local Attention Meets ConvolutionsFAIL

Score: 12.0 / 27.8

Authors: David Hagerman, Roman Naeem, Jakob Lindqvist, Carl Lindström, Fredrik Kahl, Lennart Svensson

Published: 2026-05-28

TL;DR: SwInception enhances Swin Transformers for medical volumetric segmentation by integrating Inception blocks into feed-forward layers to improve local multi-scale feature reasoning and reduce overfitting.

摘要翻译

稀疏视觉变换器（Sparse Vision Transformers）作为医学体积分割的高效编码器已广受欢迎，其中 Swin 已成为突出的选择。Swin 通过局部注意力机制降低计算复杂度，在许多任务上取得了优异的性能，但在小数据集上仍倾向于过拟合。为了缓解这一弱点，我们提出了一种新颖的架构，通过在前馈层中引入 Inception 模块（Inception blocks），进一步增强 Swin 的归纳偏置。这些多分支卷积的引入使得在变换器块内能够对局部、多尺度特征进行更直接的推理。此外，我们还修改了解码器层，旨在以更少的参数捕捉更精细的细节。通过广泛的实验，我们在十一个不同的医学数据集上展示了性能的提升。我们特别展示了在医学分割十项全能（Medical Segmentation Decathlon）和颅骨外（Beyond the Cranial Vault）等基准挑战上，相对于先前最先进骨干网络的进展。通过证明 Swin 中现有的归纳偏置可以进一步改进，我们的工作为增强稀疏视觉变换器在医学及自然图像分割任务中的能力提供了一条有前景的途径。代码及预训练权重可在 https://github.com/Eiphodos/SwInception 获取。

Abstract

Sparse vision transformers have gained popularity as efficient encoders for medical volumetric segmentation, with Swin emerging as a prominent choice. Swin uses local attention to reduce complexity and yields excellent performance for many tasks but still tends to overfit on small datasets. To mitigate this weakness, we propose a novel architecture that further enhances Swin's inductive bias by introducing Inception blocks in the feed-forward layers. The introduction of these multi-branch convolutions enables more direct reasoning over local, multi-scale features within the transformer block. We have also modified the decoder layers in order to capture finer details using fewer parameters. We demonstrate a performance improvement on eleven different medical datasets through extensive experimentation. We specifically showcase advancements over the previous state-of-the-art backbones on benchmark challenges like the Medical Segmentation Decathlon and Beyond the Cranial Vault. By showing that the existing inductive bias in Swin can be further improved, our work presents a promising avenue for enhancing the capabilities of sparse vision transformers for both medical and natural image segmentation tasks. Code and pre-trained weights can be accessed at https://github.com/Eiphodos/SwInception.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	8.0/10	12.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper proposes SwInception, modifying Swin Transformers for medical segmentation using Inception blocks. It does not address Unify Models, Tokenizers, World Models, MLLM, MultiModal integration, or Model-based RL. Only 'Visual Encoder' is relevant as Swin serves as a visual backbone for segmentation tasks.

关键词

Swin Transformer, Local Attention, Convolutions, Medical Volumetric Segmentation, Inception Blocks, Inductive Bias, Multi-scale Features, Feed-forward Layers

183. Energy-Aware NECO for Single-Pass Pixel-wise Out-of-Distribution Detection in Semantic SegmentationFAIL

Score: 12.0 / 27.8

Authors: Boyuan Zhang, Huanshan Huang, Yifei Cao

Published: 2026-05-28

TL;DR: 本文提出一种结合 NECO 几何比率与 Energy 得分的单次通过方法，用于语义分割中的像素级 OOD 检测，有效提升了边缘部署下的不确定性估计性能。

摘要翻译

移动机器人的可靠语义分割需要在分布偏移下同时实现准确的密集预测和稳健的不确定性估计。强大的不确定性基线方法（如蒙特卡洛 Dropout）通常需要进行多次随机前向传播，难以部署在边缘平台上。我们提出了一种能量感知 NECO，这是一种用于语义分割的单次前向像素级分布外（OOD）检测器。该方法结合了从解码器特征计算得到的中心化 NECO 风格几何比率与基于 logit 的能量得分。这两个分量均使用在纯分布内验证集上拟合的统计量进行标准化，并通过凸组合进行融合。我们在 miniMUAD 子集上使用真实的像素级 OOD 标签对该方法进行了评估。提出的混合得分实现了 0.8539 的 AUROC，优于仅使用 NECO（0.8280）、仅使用 Energy（0.8171）以及集成预测熵基线（0.8124）。额外的定性和操作点分析表明，混合检测器在保持单次前向设计效率优势的同时，改进了整体排序性能。代码可在 https://github.com/boyuan-zhangx/Energy-Aware_NECO 获取。

Abstract

Reliable semantic segmentation for mobile robots requires both accurate dense prediction and robust uncertainty estimation under distribution shift. Strong uncertainty baselines such as Monte Carlo Dropout often require repeated stochastic forward passes and are difficult to deploy on edge platforms. We propose Energy-Aware NECO, a single-pass pixel-wise out-of-distribution (OOD) detector for semantic segmentation. The method combines a centered NECO-style geometric ratio computed from decoder features with a logit-based Energy score. Both components are standardized using statistics fitted on a pure in-distribution validation split and fused through a convex combination. We evaluate the method on the miniMUAD subset using true pixel-level OOD labels. The proposed hybrid score achieves an AUROC of 0.8539, outperforming NECO-only (0.8280), Energy-only (0.8171), and an ensemble predictive-entropy baseline (0.8124). Additional qualitative and operating-point analyses show that the hybrid detector improves overall ranking performance while preserving the efficiency advantages of a single-pass design. Code is available at https://github.com/boyuan-zhangx/Energy-Aware_NECO

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	1.0/10	1.5

评分理由: 该论文专注于语义分割中的单通道 OOD 检测，核心贡献在于 NECO 与 Energy Score 的融合及边缘部署效率。提供的关键词（Unify Models, Tokenizer, World Models, MLLM, MultiModal, model-based RL）主要指向大模型与强化学习领域，与本文计算机视觉及不确定性估计主题关联度极低。Visual Encoder 虽为分割基础组件，但非本文创新点。作者列表中不包含指定的专家名单（Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang）。

关键词

Semantic Segmentation, Out-of-Distribution Detection, Energy Score, NECO, Single-Pass, Uncertainty Estimation, Mobile Robots

184. SLAD : Shared LoRA Adapters for Task Specific DistillationFAIL

Score: 12.0 / 27.8

Authors: Reda Bensaid, Yassir Bendou, Vincent Gripon, François Leduc-Primeau

Published: 2026-05-28

TL;DR: SLAD proposes shared LoRA adapters to enhance feature alignment in task-specific distillation, improving both teacher and student model performance in resource-constrained environments.

摘要翻译

在嵌入式系统等资源受限环境中，将缩减规模的基础模型（Foundation Models）适配到下游任务已变得越来越流行。这催生了新兴的特定任务蒸馏（task-specific distillation）范式，即同一基础模型的大版本和小版本均被适配到同一个下游任务，旨在将知识从前者转移至后者。近期工作已证明，使用同一基础模型的大版本来辅助小版本适配具有显著优势。通常，较大的模型（教师模型）首先通过微调（fine-tuning）或线性探测（linear probing）进行适配，随后其知识被蒸馏至较小的模型（学生模型）。尽管微调教师模型通常会提升其性能，但近期工作表明，对其进行线性探测反而能实现对学生模型更优的知识蒸馏效果。我们的研究发现，这主要是由于教师模型与学生模型之间的特征表示存在错位，而这种错位发生在教师模型微调期间。受现有工作保留先前习得知识的启发，我们首先提出利用低秩适配（low-rank adaptation），从而实现更好的特征对齐，进而实现更优的知识转移。基于这一洞察，我们进一步通过联合训练期间两个编码器之间适配器的参数共享策略来增强特征对齐。我们提出的方法 SLAD 在教师模型与学生模型之间实现了更好的特征对齐，这不仅提升了学生模型的性能，也提升了教师模型的性能，且训练速度是微调的 2 倍。通过在多个分类与分割数据集上的广泛实验，我们展示了该方法在准确性和迁移效率上的提升，并在特定任务蒸馏框架中实现了最先进的性能。

Abstract

In the context of resource-constrained environments such as embedded systems, adapting reduced-size foundation models to downstream tasks has become increasingly popular. This has recently motivated the emerging setting of task-specific distillation, where a larger and a smaller version of the same foundation model are both adapted to the same downstream task, with the goal of transferring knowledge from the former to the latter. Recent work has demonstrated the benefits of using a larger version of the same foundation model to assist the adaptation of a smaller one. Typically, the larger model (teacher) is first adapted via fine-tuning or linear probing before its knowledge is distilled into the smaller model (student). While fine-tuning the teacher often increases its performance, recent work showed that probing it leads to better knowledge distillation to the student. Our findings show that this is mainly due to a mis-alignment in feature representation between the teacher and the student which occurs during the teacher's fine-tuning. Inspired by existing efforts to preserve previously learned knowledge, we first propose leveraging low-rank adaptation, resulting in better feature alignment and therefore better knowledge transfer. Drawing from this insight, we further enhance the feature alignment through a parameter-sharing strategy of the adapters between the two encoders during joint training. Our proposed method, SLAD, shows better feature alignment between the teacher and student, which results in increased performance for not only the student but also the teacher model, while being 2x faster to train than fine-tuning. Through extensive experiments on multiple classification and segmentation datasets, we demonstrate the improved accuracy and transfer efficiency of our method, achieving state-of-the-art performance in the task-specific distillation framework.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0

评分理由: 论文专注于基于 LoRA 适配器的任务特定知识蒸馏，与世界模型、强化学习及多模态架构关键词相关性较低（0-1 分）。视觉编码器因涉及分割任务有中等相关性（3 分），统一模型和 tokenizer 相关性低（1-2 分）。未找到目标领域专家（Yang Shi 等）。加权总分约为 12.0，低于动态及格分 27.8。

关键词

Task Specific Distillation, Shared LoRA Adapters, Knowledge Transfer, Feature Alignment, Foundation Models, Low-Rank Adaptation, Classification and Segmentation

185. Anchorless Diversification for Parallel LLM IdeationFAIL

Score: 10.5 / 27.8

Authors: Fares Nabil Ibrahim, Nafis Saami Azad, Raiyan Abdul Baten

Published: 2026-05-28

TL;DR: This paper proposes anchorless inference-time strategies such as semantic direction stratification to diversify LLM-generated creative ideas without relying on seed anchors, achieving superior diversity-quality-compute trade-offs.

摘要翻译

大语言模型（LLMs）正被广泛用于生成候选创意池，此类创意任务的价值在于广泛的探索。并行推理在此场景中颇具吸引力，因为它能在扩大池子规模的同时保持质量和成本效益。我们研究了候选池多样化的推理时控制，探讨无锚点方法能否媲美那些依赖于已观察到的种子创意的方法。在三个创意任务族中，我们在中性及参照群体发散指令下，比较了独立生成和语义方向分层方法与自锚点、同伴锚点及代表性锚点基线。参照群体发散是一种强大且低成本的基线，它在增加语义多样性的同时保留了质量代理指标。语义方向分层效果更佳：单次规划调用即可组织跨广泛语义方向的生成，从而实现了多样性 - 质量 - 计算前沿的最优平衡。锚点再生在最终池的多样性上可能表现强劲，但在计入全流水线 token 计数时，其优势会减弱。这些结果为开放式 LLM 构思建立了实用的无锚点基线。

Abstract

LLMs are increasingly used to generate candidate-idea pools for creative tasks where broad exploration is valuable. Parallel inference can be attractive in this setting when it broadens the pool while retaining quality and cost efficiency. We study inference-time controls for candidate-pool diversification, asking whether anchorless methods can rival methods that depend on observed seed ideas. Across three creative task families, we compare independent generation and semantic direction stratification with self-, peer-, and representative-anchor baselines, under neutral and population-referential divergent instructions. Population-referential divergence is a strong low-cost baseline, increasing semantic diversity while preserving quality proxies. Semantic direction stratification is stronger: a single planning call organizes generations across broad semantic directions, yielding the best diversity--quality--compute frontier. Anchored regeneration can be strong in final-pool diversity, but its advantage shrinks under full-pipeline token accounting. These results establish practical anchorless baselines for open-ended LLM ideation.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0

评分理由: 论文聚焦于文本大语言模型（LLM）在创意构思中的推理时间多样化策略，未涉及视觉编码器、多模态整合、世界模型或基于模型的强化学习。与视觉和强化学习相关的关键词相关性极低（0-1 分）；与 LLM 相关的关键词相关性较低（1-2 分），因为研究关注推理控制而非模型架构或统一。加权总分约为 10.5，低于动态及格分 27.8，表明论文与给定关键词集相关性较弱。

关键词

LLM Ideation, Parallel Inference, Candidate-pool Diversification, Anchorless Methods, Semantic Direction Stratification, Inference-time Controls, Creative Tasks

186. Masked Diffusion Modeling for Anomaly DetectionFAIL

Score: 10.5 / 27.8

Authors: Lixing Zhang, Yuchen Liang, Liyan Xie

Published: 2026-05-28

TL;DR: This paper proposes MaskDiff-AD, a forward-only masked diffusion method for anomaly detection on discrete and mixed-type data, achieving competitive performance on tabular and text datasets without requiring reverse-time sampling.

摘要翻译

异常检测旨在识别偏离标称数据分布（nominal data distribution）的样本，是许多安全关键应用（safety-critical applications）的核心。然而，针对类别型、混合类型及离散序列数据开发有效的异常检测方法仍然具有挑战性且相对探索不足。掩码扩散模型（Masked Diffusion Models）提供了一种自然的方式来建模此类数据，通过从剩余的可见上下文学习来恢复掩码值。本文提出了一种基于掩码扩散模型的仅前向方法——仅使用标称数据训练的异常检测掩码扩散（MaskDiff-AD）。给定测试样本，MaskDiff-AD 通过重构随机掩码坐标的难度构建异常得分，从而产生一个内容敏感得分，该得分直接在离散状态空间（discrete state spaces）上运行，同时避免了反向时间采样（reverse-time sampling）。我们还开发了 MaskDiff-AD 的非参数变体，并通过刻画固定检测阈值下的 Type-I 和 Type-II 错误提供了理论保证。在来自 ADBench 和 UADAD 的十四个类别型和混合类型表格数据集，以及来自 NLP-ADBench 的四个文本异常检测数据集上的实验表明，MaskDiff-AD 在性能上具有竞争力，优于经典的、基于扩散的以及最近的表格/文本异常检测基线方法。值得注意的是，MaskDiff-AD 获得了最佳的整体平均排名，优于所有十二种表格基线方法。

Abstract

Anomaly detection aims to identify samples that deviate from the nominal data distribution and is central to many safety-critical applications. However, developing effective anomaly detection methods for categorical, mixed-type, and discrete sequence data remains challenging and relatively underexplored. Masked diffusion models provide a natural way to model such data by learning to recover masked values from the remaining visible context. In this paper, we propose Masked Diffusion for Anomaly Detection (MaskDiff-AD), a forward-only method based on masked diffusion models trained only on nominal data. Given a test sample, MaskDiff-AD constructs anomaly scores from the difficulty of reconstructing randomly masked coordinates, yielding a content-sensitive score that operates directly on discrete state spaces while avoiding reverse-time sampling. We also develop a non-parametric variant of MaskDiff-AD and provide theoretical guarantees by characterizing Type-I and Type-II errors under a fixed detection threshold. Experiments on fourteen categorical and mixed-type tabular datasets from ADBench and UADAD, as well as four text anomaly detection datasets from NLP-ADBench, show that MaskDiff-AD achieves competitive performance against classical, diffusion-based, and recent tabular/text anomaly detection baselines. Notably, MaskDiff-AD achieves the best overall average rank, outperforming all twelve tabular baseline methods.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on masked diffusion for anomaly detection on tabular/text data, lacking content on model unification, visual encoders, world models, MLLMs, or RL. Tokenizer and MultiModal have minimal relevance due to discrete/mixed-type data handling. No expert authors from the list are found. Total weighted score is 10.5, below the dynamic passing threshold of 27.8.

关键词

Masked Diffusion, Anomaly Detection, Discrete Sequence Data, Tabular Data, Forward-only Method, Nominal Data, Mixed-type Data

187. Teaching Values to Machines: Simulating Human-Like Behavior in LLMsFAIL

Score: 10.5 / 27.8

Authors: Asaf Yehudai, Naama Rozen, Ariel Gera

Published: 2026-05-28

TL;DR: 该论文通过心理学理论诱导大语言模型形成人类价值观，发现价值诱导的 LLM 在价值观结构和行为上与人类高度一致，增强了人类行为模拟。

摘要翻译

大型语言模型（LLMs）展现出扮演不同人格与角色的显著能力；然而，尚不清楚它们是否能表现出符合连贯一致的人类价值结构的行为。在这项工作中，我们借鉴既有的心理价值理论，在 LLMs 中诱导人类般的价值，并评估其与人类研究中观察到的模式的一致性。使用经过验证的心理问卷，我们开展了大规模实验——超过 500 万个问题——以评估主流 LLMs 中的价值结构和价值 - 行为关系，并将其与人类进行比较。我们的发现表明，在两个维度上，经过价值提示的 LLMs 与人类之间具有高度一致性。此外，纳入人类价值分布增强了基于价值诱导 LLMs 的群体层面模拟。这些发现突显了价值诱导 LLMs 作为有效且具有心理学依据的工具，在模拟人类行为方面的潜力。

Abstract

Large Language Models (LLMs) demonstrate a remarkable capacity to adopt different personas and roles; however, it remains unclear whether they can manifest behavior that adheres to a coherent, human-like value structure. In this work, we draw on established psychological value theory to induce human-like values in LLMs and assess their alignment with patterns observed in human studies. Using validated psychological questionnaires, we conduct large-scale experiments -- over 5 million questions -- to evaluate value structures and value-behavior relationships in leading LLMs and compare them to humans. Our findings reveal strong agreement between value-prompted LLMs and humans across both dimensions. Moreover, incorporating human value distributions enhances population-level simulations with value-induced LLMs. These findings highlight the potential of value-induced LLMs as effective, psychologically grounded tools for simulating human behavior.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文主要关注大语言模型（LLM）的价值对齐与心理学模拟，旨在诱导模型形成人类价值观。与关键词集相比，该研究未涉及多模态架构（Visual Encoder, MultiModal）、强化学习（model-based RL）或模型统一（Unify Models）等技术细节，仅基础涉及 LLM（接近 MLLM）和 tokenizer（隐含），因此相关性评分较低。

关键词

Large Language Models, Value Alignment, Human-like Behavior, Psychological Theory, Value Structures, Value-Behavior Relationships, Simulation

188. Access Sets Matter: Budgeting Expert Reads for Scalable Weight-Space Model MergingFAIL

Score: 10.5 / 27.8

Authors: Yuanyi Wang, Yanggan Gu, Su Lu, Yifan Yang, Zhaoyi Yan, Congkai Xie, Jianmin Wu, Hongxia Yang

Published: 2026-05-28

TL;DR: 本文提出 MergePipe 方法，通过预算感知执行层优化大语言模型合并过程中的专家权重读取，显著降低了 I/O 开销并加速了扩展，同时保持了下游任务性能。

摘要翻译

权重空间模型合并通常被表述为对检查点的代数运算，然而在 LLM（大语言模型）规模下，限制资源往往是必须读取的专家权重集合。我们引入 MergePipe，这是一种预算感知执行层，它将 LLM 合并视为一个专家访问集问题：给定一个合并算子和一个共享权重坐标系下的检查点集合，在明确的 I/O 预算下选择访问哪些专家增量块。MergePipe 对参数块建立索引，构建确定性访问计划，并使用可重放清单执行由此产生的预算合并。该计划从构造上保证预算健全性，并在完整预算下恢复全读合并；对于固定系数加法算子，遗漏更新误差被遗漏增量的范数所界定。在 Qwen 和 Llama 的合并工作负载上，MergePipe 将专家读取 I/O 减少了一个数量级，并实现了高达 11 倍的加速比。代表性的预算扫描显示，相对于全读合并，参数偏差仅为 O(10^{-3})，且在下游基准测试上没有单调退化。

Abstract

Weight-space model merging is usually formulated as an algebraic operation on checkpoints, yet at LLM scale the limiting resource is often the set of expert weights that must be read. We introduce MergePipe, a budget-aware execution layer that casts LLM merging as an \emph{expert access-set} problem: given a merge operator and a checkpoint family in a shared weight coordinate system, choose which expert delta blocks to access under an explicit I/O budget. MergePipe indexes parameter blocks, builds deterministic access plans, and executes the induced budgeted merge with replayable manifests. The plan is budget-sound by construction and recovers the full-read merge at full budget; for fixed-coefficient additive operators, the omitted-update error is bounded by the norm of omitted deltas. Across Qwen and Llama merging workloads, MergePipe reduces expert-read I/O by up to an order of magnitude and achieves up to $11\times$ speedups. Representative budget sweeps show $O(10^{-3})$ parameter deviation from full-read merges and no monotonic degradation on downstream benchmarks.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	5.0/10	7.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 该论文聚焦于大语言模型（LLM）的权重空间合并效率优化，特别是 MoE 架构下的专家权重 I/O 预算问题。虽然模型合并涉及模型的统一（Unify Models），但与背景中强调的多模态表征、世界模型构建及强化学习方向关联度较低。论文未涉及视觉编码器、Tokenizer 设计、世界模型或模型基强化学习，因此多数关键词相关性为 0。MLLM 因涉及 LLM 有微弱关联，但未体现多模态特性。

关键词

Weight-space model merging, Expert access-set, Budget-aware execution, MergePipe, I/O budget, Large Language Models, Model merging, Expert delta blocks

189. Solving Integer Linear Programming with Parallel TemperingFAIL

Score: 10.5 / 27.8

Authors: Kyuil Sim, Sanghyeok Choi, Jinkyoo Park

Published: 2026-05-28

TL;DR: The paper introduces a solver-free, sampling-based optimization framework utilizing Parallel Tempering for Integer Linear Programming, achieving performance comparable to classical solvers like SCIP and Gurobi while maintaining robustness against distribution shifts.

摘要翻译

整数线性规划（ILP）作为一种通用的框架，用于建模广泛的组合优化问题，通常由先进的精确求解器或启发式方法解决。尽管基于学习的方法近期显示出有效性，但它们在分布外实例上的泛化能力较差，且内在依赖外部求解器。本文提出了一种针对 ILP 的无需求解器、基于采样的优化框架，该框架无需训练或外部求解器即可直接探索离散可行域。利用 ILP 的线性结构，我们采用局部平衡提议（Locally-Balanced Proposal）构建转移核，从而避免梯度近似。为了克服 ILP 能量景观的高度多峰特性，我们引入了平行退火（Parallel Tempering）。除了标准温度退火外，我们还引入了惩罚退火（penalty tempering），它在保持可行解上目标景观不变的同时调节约束障碍。实验表明，我们的方法在所有四个基准测试上均优于 SCIP，在 200 秒时间预算内的四个任务中有两个匹配或超过 Gurobi，且比基于学习的方法对分布偏移具有显著更强的鲁棒性。此外，在 MIPLIB 2017 实例上，我们的框架在不进行任何问题特定调优的情况下，仍能与经典求解器相竞争。

Abstract

Integer Linear Programming (ILP) serves as a versatile framework for modeling a wide range of combinatorial optimization problems, typically addressed by sophisticated exact solvers or heuristics. While learning-based approaches have recently shown their effectiveness, they suffer from poor generalization to out-of-distribution instances and inherent dependence on external solvers. In this work, we propose a solver-free, sampling-based optimization framework for ILP that directly explores discrete feasible regions without training or external solvers. Exploiting the linear structure of ILP, we employ a Locally-Balanced Proposal to construct a transition kernel, thereby avoiding the gradient approximation. To overcome the highly multimodal nature of ILP energy landscapes, we integrate Parallel Tempering. In addition to standard temperature tempering, we introduce penalty tempering, which modulates constraint barriers while preserving the objective landscape over feasible solutions. Empirically, our method consistently outperforms SCIP across all four benchmarks, matches or exceeds Gurobi on two of four tasks within a 200-second budget, and is substantially more robust to distribution shift than learning-based methods. Furthermore, on MIPLIB 2017 instances, our framework remains competitive with classical solvers without any problem-specific tuning.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	1.0/10	1.5

评分理由: The paper focuses on Integer Linear Programming (ILP) optimization using Parallel Tempering and sampling strategies, which is a combinatorial optimization task unrelated to multimodal learning, large language models, tokenization, visual encoding, world models, or reinforcement learning frameworks. Thus, all provided keywords have minimal relevance to the core content.

关键词

Integer Linear Programming, Parallel Tempering, Sampling-based optimization, Solver-free, Combinatorial optimization, Locally-Balanced Proposal, Discrete feasible regions

190. CoHyDE: Iterative Co-Training of LLM Rewriter & Dense Encoder for Tool RetrievalFAIL

Score: 10.5 / 27.8

Authors: Vaishali Senthil, Ashutosh Hathidara, Sebastian Schreiber

Published: 2026-05-28

TL;DR: CoHyDE 通过迭代协同训练密集编码器和 LLM 重写器，显著提升了 LLM 代理在模糊查询下的工具检索性能。

摘要翻译

在大型 API 目录上进行工具检索是 LLM 代理（Large Language Model Agents）的核心瓶颈：用户查询通常以口语化、且往往未充分指定的语言形式出现，而目录则采用技术性的 API 词汇，任何固定编码器都无法单独弥合这一鸿沟。两种主导的训练方法——对比式编码器微调（contrastive encoder fine-tuning）和基于冻结 LLM 的 HyDE 风格查询扩展（HyDE-style query expansion）——分别从相反两端解决这一问题，却在互补的方向上失效：微调后的编码器在查询的表面形式已与目录匹配时表现优异，但在不匹配时性能急剧下降；而零样本 HyDE（zero-shot HyDE）对未充分指定的查询更具鲁棒性，却会生成与目录无关的假设性描述，导致在查询形式良好时检索性能下降。我们提出了 CoHyDE，这是一种迭代过程，将稠密编码器（dense encoder）与 LLM 重写器（LLM rewriter）训练为一个单一的共同演化系统：编码器利用重写器生成的目录风格假设性描述，基于 InfoNCE 损失进行重新训练；重写器则通过 DPO（Direct Preference Optimization）根据编码器的检索得分进行偏好对齐；且在循环开始前，双方均在工具目录上进行热启动（warm-start）。在 ToolBench 目录约 1 万个工具的子集上，经过三轮 CoHyDE 迭代后，在标准查询上的 NDCG@5 指标比最强的单组件基线提高了 +2.5 个百分点，在保留的模糊查询上提高了 +6.3 个百分点，在最难的模糊层级上增益高达 +8 个百分点。消融实验证实协同训练（co-training）是关键要素：单独使用任一组件均无法在良好查询和模糊查询上达到 CoHyDE 的性能，在模糊查询上甚至会出现高达 -8 个百分点的性能损失。

Abstract

Tool retrieval over large API catalogs is a core bottleneck for LLM agents: user queries arrive in colloquial, often underspecified language, while the catalog uses technical API vocabulary that no fixed encoder can bridge on its own. The two dominant training approaches, contrastive encoder fine-tuning and HyDE-style query expansion with a frozen LLM, address this problem from opposite ends and fail in complementary directions: the fine-tuned encoder excels when the query's surface form already matches the catalog but collapses when it does not, while zero-shot HyDE is more robust to underspecified queries yet generates catalog-unaware hypothetical descriptions that degrade retrieval when queries are well-formed. We introduce CoHyDE, an iterative procedure that trains the dense encoder and the LLM rewriter as a single co-evolving system: the encoder is retrained with InfoNCE on catalog-style hypothetical descriptions produced by the rewriter, and the rewriter is preference-aligned via DPO against the encoder's retrieval scores, with both sides warm-started on the tool catalog before the loop begins. On a ~10k tool subset of the ToolBench catalog, three rounds of CoHyDE improve over the strongest single-component baseline by +2.5 pp NDCG@5 on standard queries and +6.3 pp on held-out vague queries, with gains as large as +8 pp on the hardest vague tier. Ablations confirm that co-training is the key ingredient: using either component in isolation fails to match CoHyDE on both well-formed and vague queries, with losses of up to -8 pp on vague queries.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	2.0/10	3.0

评分理由: 该论文专注于 LLM 代理的工具检索任务，通过迭代协同训练密集编码器和 LLM 重写器来提升性能。关键词中涉及视觉编码器、世界模型、多模态等与该纯文本检索任务无直接关联，故评分为 0。虽然使用了 LLM（关联 MLLM）且涉及代理任务（关联 RL），但未达到核心相关度。协同训练机制部分符合‘统一模型’概念，给予中等评分。作者列表中不包含指定的专家名单。

关键词

Tool Retrieval, LLM Agents, Iterative Co-Training, Dense Encoder, LLM Rewriter, API Catalog, Information Retrieval

191. KLAS: Using Similarity to Stitch Neural Networks for Improved Accuracy-Efficiency TradeoffsFAIL

Score: 10.5 / 27.8

Authors: Debopam Sanyal, Anantharaman Iyer, Alind Khare, Trisha Jain, Akshay Jajoo, Myungjin Lee, Clayton Kerce, Alexey Tumanov

Published: 2026-05-28

TL;DR: KLAS automates neural network stitching using KL divergence to optimize accuracy-efficiency tradeoffs, achieving higher accuracy or reduced FLOPs compared to baseline stitching methods.

摘要翻译

鉴于广泛的部署目标，在给定的计算预算内优化性能，灵活的模型选择至关重要。近期研究表明，在模型族内对预训练模型进行缝合，能够实现准确率 - 效率权衡空间的成本效益插值。缝合技术将一个预训练模型的中间激活转换为另一个，从而生成一个新的插值缝合网络。此类网络在准确率 - 效率谱上提供了一系列部署选项。然而，现有的缝合方法往往产生次优的权衡结果且缺乏泛化性，因为它们主要依赖启发式方法来选择缝合配置。我们认为，构建改进的准确率 - 效率权衡需要显式捕捉并利用被缝合预训练模型之间的相似性。为此，我们提出 KLAS，一种新颖的缝合选择框架，该框架通过利用中间表示之间的 KL 散度 (KL divergence)，实现了跨模型族的缝合选择自动化与泛化。KLAS 能够从 $k$ 个深度为 $n$ 的预训练模型的 $O(k^2n^2)$ 种可能性中识别出最有前景的二元缝合方案。通过全面实验，我们证明 KLAS 在与基线相同的微调成本下，改善了缝合模型的准确率 - 效率曲线。KLAS 在相同的计算成本下，使 ImageNet-1K top-1 准确率最高提升 1.21%；或者在保持准确率不变的情况下，使 FLOPs 降低 1.33 倍。

Abstract

Given the wide range of deployment targets, flexible model selection is essential for optimizing performance within a given compute budget. Recent work demonstrates that stitching pretrained models within a model family enables cost-effective interpolation of the accuracy-efficiency tradeoff space. Stitching transforms intermediate activations from one pretrained model into another, producing a new interpolated stitched network. Such networks provide a pool of deployment options along the accuracy-efficiency spectrum. However, existing stitching approaches often yield suboptimal tradeoffs and lack generalizability, as they primarily rely on heuristics to select stitch configurations. We argue that constructing improved accuracy-efficiency tradeoffs requires explicitly capturing and leveraging the similarity between pretrained models being stitched. To this end, we introduce KLAS, a novel stitch selection framework that automates and generalizes stitch selection across model families by leveraging KL divergence between intermediate representations. KLAS identifies the most promising binary stitches from the $O(k^2n^2)$ possibilities for $k$ pretrained models of depth $n$. Through comprehensive experiments, we demonstrate that KLAS improves the accuracy-efficiency curve of stitched models at the same finetuning cost as baselines. KLAS achieves up to $1.21\%$ higher ImageNet-1K top-1 accuracy at the same computational cost, or maintains accuracy with a $1.33\times$ reduction in FLOPs.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	4.0/10	6.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on model stitching for accuracy-efficiency tradeoffs using KL divergence. It loosely relates to 'Unify Models' (as it combines multiple models) and 'Visual Encoder' (due to ImageNet-1K vision context), but has no direct connection to Tokenizers, World Models, MLLM, Multimodality, or Model-Based RL. No listed expert authors (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are present in the author list. The low weighted score reflects a significant mismatch between the paper's content and the provided research background/keywords.

关键词

Neural Network Stitching, Accuracy-Efficiency Tradeoff, KL Divergence, Model Interpolation, Pretrained Models, ImageNet-1K, Deployment Optimization

192. Predicting Causal Effects from Natural Language Queries using Structured RepresentationsFAIL

Score: 10.5 / 27.8

Authors: Giuliano Martinelli, Piriyakorn Piriyatamwong, Abelardo Carlos Martinez Lorenzo, Jasmin Baier, Riccardo Orlando, Satvik Garg, Sharif Kazemi, Linxi Wang, Arianna Legovini, Samuel Fraiberger

Published: 2026-05-28

TL;DR: The paper introduces Query2Effect, a benchmark and two-step framework that predicts causal effect sizes from natural language queries using structured representations, outperforming prompted LLMs by separating semantic interpretation from numerical estimation.

摘要翻译

随机对照试验（Randomized Controlled Trials, RCTs）是医学和社会科学的基石，因为它们能够实现对因果效应的可靠估计。然而，实施这些试验成本高昂且耗时，从而激发了从现有实验证据中预测因果效应的兴趣。大型语言模型（LLMs）在知识密集型任务上展现出强大性能，引发了关于这些模型是否可用于预测因果效应大小的疑问。为探究这一问题，我们引入了 Query2Effect，这是一个包含超过 72,000 个自然语言问题及其对应实验描述的新大型基准，旨在通过沿隐含性、抽象性和模糊性维度调整查询特异性，来模拟真实的信息寻求场景。随后，我们提出一个两步框架，该框架首先生成查询的合成结构化表示，随后使用监督编码器模型预测效应大小。实验表明，微调在提升预测性能方面起着关键作用，与提示式开箱即用的 LLMs 相比，绝对误差降低了 27% 至 71%；此外，我们的两步框架有利于域外泛化，凸显了将语义解释与数值效应估计分离的优势。

Abstract

Randomized controlled trials are a cornerstone of medicine and the social sciences as they enable reliable estimates of causal effects. However, they are costly and time-consuming to conduct, motivating interest in predicting causal effects from existing experimental evidence. Recent advances in large language models (LLMs) have demonstrated strong performance on knowledge-intensive tasks, raising the question of whether these models can be used for forecasting causal effect sizes. To investigate this, we introduce Query2Effect, a new large-scale benchmark consisting of more than 72,000 natural language questions aligned with experiment descriptions, created to simulate realistic information-seeking scenarios by varying query specificity along dimensions of implicitness, abstraction, and ambiguity. We then propose a two-step framework that first generates a synthetic structured representation of a query before predicting effect size using a supervised encoder model. Experiments show that finetuning plays a crucial role in improving prediction performance, with absolute error reducing by -27% up to -71% compared to prompted out-of-the-box LLMs, and that our two-step framework is beneficial for out-of-domain generalization, highlighting the benefits of separating semantic interpretation from numerical effect estimation.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on causal inference using natural language and structured representations, which has low overlap with the provided keywords centered on multimodal learning, world models, and reinforcement learning. It uses LLMs (MLLM/Tokenizer relevance low) but lacks visual data (Visual Encoder/MultiModal=0) and RL components (model-based RL=0). The framework separates rather than unifies models in the architectural sense (Unify Models low).

关键词

Causal Effects, Natural Language Queries, Structured Representations, Query2Effect, Large Language Models, Supervised Encoder, Out-of-domain Generalization

193. DLM-SWAI: Steering Diffusion Language Models Before They UnmaskFAIL

Score: 10.5 / 27.8

Authors: Hyeseon An, Yo-Sub Han

Published: 2026-05-28

TL;DR: DLM-SWAI proposes a training-free inference-time steering method for diffusion language models that biases token distributions during denoising to achieve controllable text generation without retraining.

摘要翻译

将语言模型的生成导向期望的文本属性对于实际部署至关重要，而推理时方法尤其具有吸引力，因为它们能够在无需再训练的情况下实现可控生成。近期研究也突出了扩散语言模型作为一种具有独特解码特性的新兴生成范式。然而，现有的大多数引导方法要么依赖辅助模型，要么是为自回归下一个词元解码设计的，这使得它们难以应用于扩散语言模型（DLMs），后者通过迭代去噪部分掩码序列来生成文本。因此，我们提出了 DLM-SWAI，一种简单的无需训练的引导方法，它利用预计算的词元级风格得分在每个去噪步骤上偏置词元分布。在风格和安全控制任务上的实验表明，DLM-SWAI 能有效引导扩散语言模型，同时保持生成质量且计算开销极小。消融实验进一步揭示了引导强度与流畅度之间的可控权衡，我们的分析将类别级可引导性与词元级属性线索的强度联系起来。

Abstract

Steering language model generation toward desired textual properties is essential for practical deployment, and inference-time methods are particularly appealing because they enable controllable generation without retraining. Recent work has also highlighted diffusion language models as an emerging generation paradigm with distinct decoding properties. However, most existing steering approaches either rely on auxiliary models or are designed for autoregressive next-token decoding, making them difficult to apply to diffusion language models DLMs, which generate text through iterative denoising of partially masked sequences. Therefore, we propose DLM-SWAI, a simple training-free steering method that biases the token distribution at each denoising step using pre-computed token-level style scores. Experiments on style and safety control tasks show that DLM-SWAI effectively steers diffusion language models while preserving generation quality and requiring minimal computational overhead. Ablations further reveal a controllable trade-off between steering strength and fluency, and our analysis links class-wise steerability to the strength of token-level attribute cues.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	3.0/10	4.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on steering diffusion language models for text generation using token-level scores. It lacks visual encoders, multimodal components, world modeling, or reinforcement learning components. While it operates on token distributions, it does not propose tokenizer architectures. It is a language model but not multimodal (MLLM). Thus, most provided keywords are irrelevant to this text-only generative model study, resulting in a low weighted score.

关键词

Diffusion Language Models, Steering, Token-level style scores, Controllable generation, Inference-time methods, Text generation, Training-free

194. Source-Grounded Semantic Reinforcement Learning for Low-Resource Target-Language GenerationFAIL

Score: 10.5 / 27.8

Authors: Zeli Su, Ziyin Zhang, Zewei Pan, Zhou Liu, Dingcheng Huang, Dehan Li, Zhankai Xu, Longfei Zheng, Xiaolu Zhang, Jun Zhou, Wentao Zhang

Published: 2026-05-28

TL;DR: 该论文提出了一种源锚定语义强化学习框架，通过利用源语言单语数据构建跨语言语义奖励，有效提升了低资源目标语言生成的语义接地性和事实覆盖度。

摘要翻译

低资源目标语言生成往往受限于稀缺的平行数据，而高资源源语言单语数据虽丰富，却难以通过标准监督微调（SFT）加以利用。我们提出了一种基于源语言的语义强化学习（SG-SRL），这是一种资源利用框架，旨在将源语言单语数据转化为目标语言生成的跨语言语义监督。SG-SRL 利用跨语言语义奖励模型在源语言数据上执行无参考强化学习（RL），该模型由一个跨语言重排器实例化，用于衡量源输入与目标语言生成之间的语义相关性。尽管这会引发严重的基于冗长的奖励黑客行为，但一个使用小型平行语料库的轻量级恢复阶段能够恢复流畅性、简洁性和任务格式，同时保留语义增益。在中泰生成任务上的实验表明，SG-SRL 相较于冷启动 SFT，提升了语义接地和事实覆盖度。关于长文本迁移和基于藏语嵌入的奖励的进一步分析阐明了 SG-SRL 的泛化行为，并表明在真实的低资源语言设置下，基于编码器的语义奖励可替代基于大语言模型（LLM）的重排器。

Abstract

Low-resource target-language generation is often limited by scarce parallel data, while high-resource source-language monolingual data is abundant but difficult to use with standard supervised fine-tuning. We propose Source-Grounded Semantic Reinforcement Learning (SG-SRL), a resource-utilization framework that converts source-language monolingual data into cross-lingual semantic supervision for target-language generation. SG-SRL performs reference-free reinforcement learning (RL) on source-language data using a cross-lingual semantic reward model, instantiated by a cross-lingual reranker that scores the semantic relevance between the source input and the target-language generation. While this induces severe verbosity-based reward hacking, a lightweight recovery stage using a small parallel corpus restores fluency, conciseness, and task format while preserving the semantic gains. Experiments on Chinese-to-Thai generation show that SG-SRL improves semantic grounding and factual coverage over cold-start SFT. Additional analyses on long-form transfer and Tibetan embedding-based rewards clarify the generalization behavior of SG-SRL and show that an encoder-based semantic reward can substitute for an LLM-based reranker in a realistic low-resource language setting.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	3.0/10	4.5

评分理由: 该论文主要研究低资源目标语言生成问题，采用源锚定语义强化学习（SG-SRL）框架，与提供的关键词集（主要聚焦于多模态、世界模型、统一模型架构）存在显著差异。Unify Models 评分为 2.0，因论文提出数据利用框架但未涉及多模型架构统一；Tokenizer 评分为 2.0，作为 LLM 基础组件隐含存在但不是核心；Visual Encoder、World Models、MLLM、MultiModal 评分均为 0.0，因任务为纯文本到文本生成，不涉及视觉、多模态或世界模型；model-based RL 评分为 3.0，因使用了强化学习和奖励模型，但通常为基于奖励的模型-free 策略优化，与严格意义上的模型-based RL（学习环境动力学）有区别。作者列表中未包含指定的 Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang 等专家，故无专家加分。加权总分 10.5 分，低于动态及格分 27.8 分，表明论文与关键词主题相关性较低。

关键词

Source-Grounded Semantic Reinforcement Learning, Low-Resource Target-Language Generation, Cross-lingual Semantic Reward Model, Reference-Free Reinforcement Learning, Cross-lingual Reranker, Monolingual Data Utilization, Fluency Restoration

195. Boosting Zero-Shot 3D Style Transfer with 2D Pre-trained PriorsFAIL

Score: 10.5 / 27.8

Authors: Xin Dong, Yunzhi Teng, Wenfeng Deng, Yansong Tang

Published: 2026-05-28

TL;DR: 本文提出 DS-StyleGaussian 模型，利用 2D 预训练先验解决零样本 3D 风格迁移中的数据稀缺问题，实现了高质量且视图一致的 3D 场景风格化。

摘要翻译

本文聚焦于零样本 3D 风格迁移（Zero-shot 3D Style Transfer），旨在给定任意风格图像的情况下，生成 3D 场景的多视图一致的风格化视图。我们主要致力于解决 3D 风格迁移中的数据稀缺问题，该问题源于每个模型仅在单个场景上训练，从而限制了可用内容图像的数量。这种稀缺性显著制约了风格化性能，因为模型优化依赖于足够数量的内容 - 风格图像对以提供监督信号。我们的核心思想是将一个在大规模 2D 图像数据集上预训练的解码器（Decoder）集成到 3D 风格迁移流程中，从而利用解码器从众多内容 - 风格图像对学习中编码的先验知识。我们的方法结合了特征高斯泼溅（Feature Gaussian Splatting）和延迟风格化（Deferred Stylization），旨在利用数据充足的解码器网络实现高质量风格化，同时通过将视图依赖操作统一为视图不变过程来确保视图一致性。实验表明，我们的数据充足风格高斯（Data-Sufficient StyleGaussian, DS-StyleGaussian）模型在各种数据集上的视觉质量方面优于现有的零样本 3D 风格迁移方法。本研究还表明，2D 预训练可作为 3D 任务的强增强手段，弥合 2D 与 3D 之间的数据差距。

Abstract

In this work, we focus on zero-shot 3D style transfer that can generate multi-view consistent stylized views of the 3D scene given an arbitrary style image. We primarily tackle the issue of data scarcity in 3D style transfer, which arises when each model is trained on only a single scene, thereby limiting the number of available content images. This scarcity significantly hampers stylization performance, as model optimization relies on a sufficient number of content-style image pairs to provide supervisory signals. Our core idea is to integrate a decoder pre-trained on large-scale 2D image datasets into the 3D style transfer pipeline, thereby leveraging the prior knowledge encoded in the decoder from learning over numerous content-style image pairs. Our method combines feature Gaussian splatting and deferred stylization, enabling high-quality stylization with the data-sufficient decoder network while ensuring view consistency by unifying view-dependent operations into a view-invariant process. Experiments demonstrate that our Data-Sufficient StyleGaussian (DS-StyleGaussian) model outperforms existing zero-shot 3D style transfer methods in terms of visual quality across various datasets. This work also suggests that 2D pre-training can serve as a strong enhancement for 3D tasks, bridging the data gap between 2D and 3D.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	3.0/10	4.5
model-based RL	1.5	0.0/10	0.0

评分理由: 该论文专注于零样本 3D 风格迁移及 2D 先验知识的应用，与给定的关键词集（侧重 MLLM、世界模型、强化学习、Tokenizer）存在显著领域错位。'Unify Models'得 2 分因涉及 2D/3D 流程统一；'MultiModal'得 3 分因涉及 2D 图像与 3D 场景跨模态；'Visual Encoder'得 2 分因使用预训练视觉组件但非核心编码器；'Tokenizer'、'World Models'、'MLLM'、'model-based RL'得 0 分因完全无关。未发现指定专家（Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang），无专家加分。加权总分约为 10.5 分，低于动态及格分 27.8 分。

关键词

Zero-Shot 3D Style Transfer, 2D Pre-trained Priors, Feature Gaussian Splatting, Deferred Stylization, View Consistency, Data-Sufficient StyleGaussian, Cross-Modal Transfer

196. Reducing Experimental Testing in Space Propulsion Film Cooling Analyses by Pixelwise Generative Image InterpolationFAIL

Score: 10.5 / 27.8

Authors: Adam T. Müller, Philipp J. Teuffel, Konstantin Manassis, Nicolaj C. Stache

Published: 2026-05-28

TL;DR: 该论文提出了一种基于轻量级神经网络的图像插值方法，能够在减少 30% 实验测量数据的同时，保持高图像相似度，从而降低空间推进薄膜冷却分析中的实验测试成本。

摘要翻译

我们提出了一种基于机器学习 (machine learning) 的图像回归 (image regression) 方法，用于从稀疏实验测量数据中重建图像。我们将该方法应用于推进系统 (propulsion system) 开发中的薄膜冷却 (film cooling) 研究，旨在减少对广泛物理测试的需求。该方法采用带有位置编码 (positional encoding) 的轻量级前馈神经网络 (feed-forward neural network)，根据输入参数生成图像。在真实数据和合成数据上验证，该方法实现了高图像相似度（RMSE < 8%，SSIM > 93%），同时在测量数据减少 30% 的情况下保持准确性。我们进一步提出一种基于知识的扩展方法，以增强生成图像的局部适应性。该方法显著减少了所需的测试量，同时保持了高质量数据，实现了冷却剂喷射器 (coolant injector) 配置的高效优化，并具有航空航天领域之外的应用。

Abstract

We propose a machine learning approach for image regression from sparse experimental measurements. We show the application of the proposed method on film cooling studies in propulsion system development, aiming to reduce the need for extensive physical testing. Our method employs a lightweight feed-forward neural network with positional encoding to generate images conditioned by input parameters. Validated on real and synthetic data, it achieves high image similarity (RMSE < 8 %, SSIM > 93 %) while maintaining accuracy with a 30 \% reduction of measurements. We further propose a knowledge-informed extension for local adaptability of the generated images. This approach significantly reduces required tests while preserving high-quality data, enabling efficient optimization of coolant injector configurations with applications beyond aerospace.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	3.0/10	4.5
model-based RL	1.5	0.0/10	0.0

评分理由: 论文主要关注工程领域的图像回归与生成，使用轻量级前馈神经网络处理空间推进薄膜冷却数据。虽然涉及数值参数到图像的生成（部分多模态特性），但未涉及多模态大模型（MLLM）、世界模型（World Models）、强化学习（RL）或分词器（Tokenizer）等核心概念。视觉编码器仅指位置编码，非标准视觉 encoder。加权总分为 10.5，低于动态及格分 27.8。未发现指定专家作者。

关键词

Image Regression, Film Cooling, Generative Image Interpolation, Positional Encoding, Feed-forward Neural Network, Space Propulsion, Sparse Measurements, Experimental Testing Reduction

197. Unlocking the Working Memory of Large Language Models for Latent ReasoningFAIL

Score: 9.0 / 27.8

Authors: Lukas Aichberger, Sepp Hochreiter

Published: 2026-05-28

TL;DR: This paper proposes Reasoning in Memory (RiM), a method that uses fixed memory blocks instead of autoregressive generation to enable compute-efficient latent reasoning in large language models.

摘要翻译

为了提升大语言模型（Large Language Models）的推理能力，推理时计算（test-time compute）通常通过在最终答案前生成中间 token 来扩展。然而，这种方法将推理与自回归生成（autoregressive generation）耦合在一起，从而混淆了内部计算与外部通信。相比之下，人类认知可以利用工作记忆（working memory）在内部存储和操作信息，而无需将中间思维外化。基于这一原则，我们提出了内存推理（Reasoning in Memory, RiM），这是一种潜在推理方法，它用记忆块（memory blocks）取代了推理步骤的自回归生成。这些记忆块是特殊 token 的固定序列，它们解锁了大语言模型的工作记忆容量。由于它们是固定的而非生成的，因此可以在单次前向传播（single forward pass）中处理，从而实现计算高效的潜在推理。为了实现这些记忆块的功能，我们采用了一种两阶段课程学习（two-stage curriculum）策略。首先，我们通过在每个记忆块后预测显式推理步骤来锚定这些记忆块。其次，我们摒弃这种步骤级监督（step-level supervision），并在每个记忆块后迭代优化最终答案。我们在推理基准上的实验表明，在不同架构和规模的大语言模型上，RiM 的性能匹配或优于现有的潜在推理方法，同时避免了思维的自回归生成。这些结果表明，大语言模型可以被训练为使用工作记忆作为潜在推理的有效机制。

Abstract

To improve the reasoning capabilities of large language models, test-time compute is typically scaled by generating intermediate tokens before the final answer. However, this couples reasoning to autoregressive generation and thereby conflates internal computation with external communication. In contrast, human cognition can use working memory to hold and manipulate information internally without the need to externalize intermediate thoughts. Drawing on this principle, we introduce Reasoning in Memory (RiM), a latent reasoning method that replaces the autoregressive generation of reasoning steps with memory blocks. These memory blocks are fixed sequences of special tokens that unlock the working-memory capacity of large language models. Since they are fixed rather than generated, they can be processed in a single forward pass, enabling compute-efficient latent reasoning. To operationalize these memory blocks, we employ a two-stage curriculum. First, we ground them by predicting explicit reasoning steps after each memory block. Second, we discard this step-level supervision and iteratively refine the final answer after each memory block. Our experiments on reasoning benchmarks show that, across language models of different families and sizes, RiM matches or exceeds existing latent reasoning methods while avoiding the autoregressive generation of thoughts. These results demonstrate that large language models can be trained to use working memory as an effective mechanism for latent reasoning.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文提出基于记忆块的潜在推理方法（RiM），主要关注文本大模型的内部计算机制，与提供的关键词集（多模态、世界模型、强化学习）领域严重不匹配。视觉编码器、MLLM、多模态和 model-based RL 因无相关内容得 0 分；Unify Models、Tokenizer、World Models 因仅涉及特殊 token 和工作记忆的概念类比，非核心贡献，得 2 分。作者列表中不包含指定的专家（Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang），无加分。加权总分远低于动态及格分 27.8。

关键词

Large Language Models, Latent Reasoning, Working Memory, Autoregressive Generation, Memory Blocks, Curriculum Learning, Special Tokens

198. MIRA: Mid-training Rubric Anchoring for Source-Aware Data SelectionFAIL

Score: 9.0 / 27.8

Authors: Haowen Wang, Yaxin Du, Jian Yang, Jiajun Wu, Shukai Liu, Yuxuan Zhang, Pingjie Wang, Siheng Chen, Tuney Zheng, Ming Zhou, Xianglong Liu

Published: 2026-05-28

TL;DR: MIRA introduces a source-aware rubric anchoring framework for efficient mid-training data selection in LLMs, reducing token usage by half while maintaining performance across code benchmarks.

摘要翻译

中期训练已成为现代 LLM（大语言模型）开发中的一个重要阶段，它在最终后训练之前利用大规模精心筛选的数据混合物来增强模型能力。其数据选择问题具有独特性：这些数据是在接近预训练规模的条件下，基于预训练风格的目标进行优化的，但它们是为了下游能力而精心策划的，且来源于具有不同格式和训练角色的异构来源。因此，有效的选择方法既需要可扩展性，也需要源自适应的语义标准。现有的基于模型的方法扩展性良好，但仅能提供隐式质量信号；语义选择方法能提供更强的判断力，但通常假设固定的评分标准或标准化的数据格式。为了解决这种不匹配，我们提出 MIRA，这是一种基于自锚定评分标准发现的感知数据源过滤框架。其核心思想是将评分标准构建纳入数据选择过程：MIRA 首先确定每个数据源组应评估的内容，然后将这些判断提炼为可扩展的学生评分器，以实现全语料过滤。在面向代码的中期训练实验中（涉及 21 个来源和 5 个数据源组），MIRA 在九个代码基准上均优于选择基线，且在使用仅一半 token 的情况下达到了全语料运行的效果。

Abstract

Mid-training has become an important stage in modern LLM development, using large-scale curated mixtures to strengthen capabilities before final post-training. Its data selection problem is distinct: the data are optimized under a pretraining-style objective at near-pretraining scale, but are curated toward downstream capabilities and drawn from heterogeneous sources with different formats and training roles. As a result, effective selection requires both scalability and source-adaptive semantic criteria. Existing model-based methods scale well, but provide only implicit quality signals. Semantic selection methods offer stronger judgments, but usually assume fixed rubrics or standardized data formats. To address this mismatch, we propose MIRA, a source-aware filtering framework based on self-anchored rubric discovery. The key idea is to make rubric construction part of data selection: MIRA first discovers what should be evaluated for each source group, then distills those judgments into scalable student scorers for full-corpus filtering. On code-oriented mid-training with 21 sources and 5 source groups, MIRA outperforms selection baselines across nine code benchmarks and matches the full-corpus run while using only half the tokens.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	1.0/10	1.5

评分理由: The paper focuses on mid-training data selection for LLMs using source-aware rubric anchoring. It does not involve visual encoders, world models, multi-modal architectures, or reinforcement learning. Tokenizers are only referenced as units of measurement. Unify models is loosely related to data source integration but not model unification. MLLM is partially relevant as it involves LLMs but lacks multi-modality. Model-based RL is distinct from the model-based data selection method used here. No listed expert authors were found in the author list.

关键词

Mid-training, Data Selection, Source-Aware, Rubric Anchoring, LLM, Code-oriented, Student Scorers, Filtering

199. LLUMI: Improving LLM Writing Assistance for Mental Health Support with Online Community FeedbackFAIL

Score: 9.0 / 27.8

Authors: Jiwon Kim, Maya Ajit, Sherry Gong, Soorya Ram Shimgekar, Dong Whi Yoo, Eshwar Chandrasekharan, Koustuv Saha

Published: 2026-05-28

TL;DR: 该研究利用 Reddit 社区反馈微调开源 LLM 以提供隐私保护的心理健康支持，在不使用多模态或模型强化学习架构的情况下实现了与专有模型相当的性能。

摘要翻译

大型语言模型（LLMs）在生成心理健康查询的支持性回应方面展现出潜力，但提升其效用、共情力和安全性通常需要大量的计算资源、专家输入及标注数据。与此同时，鉴于心理健康数据的敏感性，部署专有云模型进行相关交互引发了重要的隐私和数据治理担忧。为应对这一挑战，我们提出了 LLUMI 方案，该方案可在受保护的环境中进行本地部署。LLUMI 包含两个互补组件：生成模型（GM），负责为心理健康查询起草支持性回应；以及改进模型（IM），用于修订初始的人工构建回应。我们借助 Reddit 心理健康社区的反馈信号，利用社区认可模式（如点赞和点踩）构建优选 - 拒绝回应对，以用于监督微调（SFT）和直接偏好优化（DPO）。我们进一步通过人工评估对 LLUMI 进行对齐，评估维度包括五个方面：可读性、共情力、连接感、可操作性和安全性。结果表明，尽管 LLUMI 依赖较小的开源模型而非专有云 GPT 模型，其在语言分析和人工评估中仍表现出相当的性能。这些发现表明，当使用社区衍生的偏好信号进行训练时，开源模型能够提供高质量的心理健康支持援助，同时为敏感的支持情境提供更隐私保护的替代方案。

Abstract

Large language models (LLMs) show promise in generating supportive responses for mental health queries, but improving their usefulness, empathy, and safety often requires substantial compute, expert input, and labeled data. At the same time, deploying proprietary, cloud-based models for mental health-related interactions raises important privacy and data-governance concerns, given the sensitivities. To address this challenge, we introduce LLUMI setup that can be hosted in-house within protected environments. LLUMI consists of two complementary components: a generation model (GM), which drafts supportive responses to mental health queries, and an improvement model (IM), which revises an initial human-crafted response. We leverage feedback signals from Reddit mental health communities, using community endorsement patterns such as upvotes and downvotes to construct chosen-rejected response pairs for Supervised Fine Tuning (SFT) and Direct Preference Optimization (DPO). We further align LLUMI using human evaluation across five dimensions: readability, empathy, connection, actionability, and safety. Our results show that, despite relying on smaller open-source models rather than proprietary cloud-based GPT models, LLUMI achieves comparable performance across linguistic analyses and human evaluations. These findings suggest that open-source models, when trained with community-derived preference signals, can support high-quality mental health support assistance while offering a more privacy-preserving alternative for sensitive support contexts.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文聚焦于利用社区反馈（SFT/DPO）优化 LLM 进行心理健康支持，属于文本生成与对齐领域。提供的关键词涉及多模态（Visual Encoder, MultiModal, MLLM）、世界模型及模型强化学习（model-based RL），与本文内容高度不相关（评分 0）。仅因 LLM 隐含使用 Tokenizer 及系统包含 GM/IM 两个模型（Unify Models）给予低分（2 分）。未发现指定专家，无额外加分。

关键词

LLM Writing Assistance, Mental Health Support, Online Community Feedback, Supervised Fine Tuning, Direct Preference Optimization, Privacy-preserving, Open-source models, Human evaluation

200. Same Evidence, Different Answers: Canonical-Context On-Policy Distillation for Multi-Turn Language ModelsFAIL

Score: 9.0 / 27.8

Authors: Zizhuo Lin, Quanling Liu, Jinsheng Quan, Chao Zhang, Yifan Zhu, Xing Shi, Jingtao Xu, Zhihui Li, Yawei Luo

Published: 2026-05-28

TL;DR: 该论文提出了一种名为 Canonical-Context On-Policy Distillation 的方法，通过对齐多轮对话中的学生模型与全上下文的教师模型行为，显著减少了自我锚定漂移，从而在多轮问答中提高了大语言模型的一致性和准确性。

摘要翻译

大型语言模型（LLMs）通常能在所有指令均置于单个提示中时解决任务，但当相同信息在多轮次中逐渐揭示时却会失败。当干净的 FULL 提示与 RAW-SHARDED 对话包含相同的完整用户证据时，模型仍应得出相同的答案。我们认为造成这种差距的关键原因是自锚定漂移（self-anchored drift）：在信息不完整的情况下产生的响应会引入无依据的假设，而这些假设随后会扭曲最终答案。为了减轻这种影响，我们提出了规范上下文在线策略蒸馏（Canonical-Context On-Policy Distillation, CCOPD）。在训练过程中，同一个基础模型被赋予两种角色：一个基于干净 FULL 提示冻结的教师模型，以及一个通过多轮对话逐步接收相同证据的可训练学生模型；CCOPD 旨在将学生模型在其自身轨迹上的行为与教师模型的规范全上下文行为对齐。仅在数学问题对话上进行训练，CCOPD 在数学及五个零样本跨域任务族上，相较于原始基础模型，在 RAW-SHARDED 性能上实现了平均 32% 的相对提升，同时很大程度上保留了全上下文性能。进一步分析表明，CCOPD 加强了对用户证据的 grounding，并降低了对早期助手轮次污染的敏感性。

Abstract

Large language models (LLMs) often solve a task when all instructions are given in a single prompt, but fail when the same information is revealed gradually across turns. When a clean FULL prompt and a RAW-SHARDED conversation contain the same complete user evidence, the model should still arrive at the same answer. We argue that a key reason for this gap is self-anchored drift: responses produced under partial information introduce unsupported assumptions, and those assumptions later distort the final answer. To reduce this effect, we propose Canonical-Context On-Policy Distillation (CCOPD). During training, the same base model is used in two roles: a frozen teacher conditioned on the clean FULL prompt and a trainable student that receives the same evidence incrementally through a multi-turn conversation; CCOPD aligns the student's behavior on its own trajectories with the teacher's canonical full-context behavior. Trained only on math problem conversations, CCOPD yields a 32\% average relative improvement in RAW-SHARDED performance over the original base model across math and five zero-shot out-of-domain task families, while largely preserving full-context performance. Further analyses suggest that CCOPD strengthens grounding in user evidence and reduces sensitivity to contamination from earlier assistant turns.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	1.0/10	1.5

评分理由: 论文核心为多轮语言模型蒸馏，与视觉编码器 (0.0)、多模态 (0.0) 完全无关。Tokenizer (1.0)、世界模型 (1.0)、基于模型的 RL (1.0) 仅概念边缘相关或未涉及。Unify Models (2.0) 因师生模型共享基座有轻微关联。MLLM (1.0) 因属于大语言模型范畴有轻微关联。未发现指定专家作者（Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang），未触发加分。

关键词

Multi-Turn Language Models, Canonical-Context On-Policy Distillation, Self-anchored drift, Large language models, On-Policy Distillation, Full-context behavior, Math problem conversations, Zero-shot out-of-domain

201. HPO: Hysteretic Policy Optimization for Stable and Efficient Training under Sparse-Reward RegimeFAIL

Score: 9.0 / 27.8

Authors: Mohamed Sana, Nicola Piovesan, Antonio De Domenico, Fadhel Ayed, Haozhe Zhang

Published: 2026-05-28

TL;DR: 论文提出 Hysteretic Policy Optimization (HPO) 通过平衡正负优势并使用平均长度归一化，在稀疏奖励强化学习训练中稳定并提升性能，优于 GRPO。

摘要翻译

我们研究了在稀疏可验证奖励背景下，GRPO 风格强化学习的一种狭窄但常见的失效模式：早期更新中包含更多具有负优势的响应，而非正优势的响应，同时每响应长度标准化将更新幅度与输出长度绑定。我们提出滞后策略优化（HPO），这是对 GRPO 的一种最小修改，它减少了负优势更新的权重，并将每响应长度标准化替换为平均长度标准化。我们进一步引入自适应 HPO（A-HPO），它基于批次级优势符号统计设置滞后权重，从而消除了调整固定滞后权重的需求。在我们的 TeleLogs 和 Countdown 实验中，A-HPO 相比 GRPO 提高了每更新奖励，在早期稀疏奖励情形中收益最大。在 TeleLogs 上，A-HPO 实现了 0.84 的最终奖励，优于 SAPO 5%、GSPO 11% 和 GRPO 15%，同时保持相当的响应长度。在 Countdown 上，A-HPO 在初始和最困难配置中实现了最大收益，涵盖 1.5B 至 7B 模型。关于滞后权重的消融研究表明，A-HPO 的收益源于更好地平衡正优势与负优势的贡献，相较于仅正优势或完全对称更新。

Abstract

We investigate a narrow but common failure mode of GRPO-style reinforcement learning in the context of sparse verifiable rewards: early updates contain more responses with negative advantages than those with positive advantages, while response-level length normalization ties the magnitude of the update to the length of the output. We propose Hysteretic Policy Optimization (HPO), a minimal modification of GRPO that reduces the weight of negative-advantage updates and replaces per-response length normalization with mean-length normalization. We further introduce Adaptive HPO (A-HPO), which sets the hysteretic weight based on batch-level advantage-sign statistics, thereby removing the need for tuning a fixed hysteretic weight. In our TeleLogs and Countdown experiments, A-HPO improves the reward per update compared to GRPO, with the largest gains in early sparse reward regimes. On TeleLogs, A-HPO achieves a final reward of 0.84, outperforming SAPO by 5%, GSPO by 11%, and GRPO by 15%, while maintaining a comparable response-length. On Countdown, A-HPO achieves the largest gains in initial and most difficult configurations across 1.5B-7B models. Ablation studies on the hysteretic weight show that the gains of A-HPO come from better balancing the contributions of positive and negative advantages compared to positive-only or fully symmetric updates.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	1.0/10	1.5

评分理由: 论文核心贡献为强化学习策略优化算法（HPO），针对稀疏奖励问题改进 GRPO。未涉及多模态、视觉编码器、世界模型或模型统一架构，故相关关键词得分极低。虽涉及语言模型训练（隐含），但未明确讨论 MLLM 或 Tokenizer 设计，且 GRPO 属模型-free 强化学习，非 model-based RL。作者列表中不包含指定的 Yang Shi 等专家，无额外加分。

关键词

Hysteretic Policy Optimization, Sparse-Reward Regime, GRPO, Reinforcement Learning, Length Normalization, Adaptive HPO, Policy Optimization

202. iLoRA: Bayesian Low-Rank Adaptation with Latent Interaction Graphs for Microbiome DiagnosisFAIL

Score: 9.0 / 27.8

Authors: Yang Song, Yixuan Zhang, Lingfa Meng, Tongyuan Hu, Haizhou Shi, Hao Wang, Samir Bhatt, Hengguan Huang

Published: 2026-05-28

TL;DR: iLoRA introduces a Bayesian graph-conditioned LoRA framework for microbiome diagnosis that jointly learns prediction and latent interaction structures, outperforming standard LoRA baselines.

摘要翻译

参数高效适配已使大语言模型（LLMs）在领域预测中变得实用，但标准 LoRA（低秩适配）仍依赖静态低秩更新，无法揭示驱动科学标签的潜在交互。我们引入 iLoRA。据我们所知，这是首个贝叶斯图条件 LoRA 框架。它从输入中推断潜在交互图，并利用它生成输入条件化的 LoRA 更新。因此，iLoRA 联合学习预测和潜在交互结构，而不是训练一个预测器仅在事后（post hoc）应用交互分析。我们将此想法实例化于微生物组诊断中，其中疾病状态可能既取决于物种水平丰度，也取决于微生物间互作，并在两个互补场景中评估它：与人工标注图结合的交互式问答（QA），测试潜在结构恢复，以及多队列炎症性肠病（IBD）诊断，测试生物医学效用。在这两个场景中，iLoRA 均优于强大的 LoRA 和贝叶斯适配基线，恢复了与人工标注及队列级微生物组关联一致的图，并在提供适度图分支开销的同时提供了校准的不确定性。

Abstract

Parameter-efficient adaptation has made LLMs practical for domain prediction, but standard LoRA still relies on a static low-rank update and does not expose the latent interactions that often drive scientific labels. We introduce iLoRA. To our knowledge, it is the first Bayesian graph-conditioned LoRA framework. It infers a latent interaction graph from the input and uses it to generate input-conditioned LoRA updates. As a result, iLoRA learns prediction and latent interaction structure jointly, rather than training a predictor and applying interaction analysis only post hoc. We instantiate this idea for microbiome diagnosis, where disease state can depend on both species-level abundance and microbe-microbe cross-talk, and evaluate it in two complementary settings: interactive QA with human-annotated graphs, which tests latent structure recovery, and multi-cohort IBD diagnosis, which tests biomedical utility. Across both settings, iLoRA improves over strong LoRA and Bayesian adaptation baselines, recovers graphs aligned with human annotations and cohort-level microbiome associations, and provides calibrated uncertainty with moderate graph-branch overhead.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0

评分理由: The paper proposes iLoRA for microbiome diagnosis using Bayesian graph-conditioned LoRA. It does not align with World Models, Visual Encoders, Model-Based RL, or Multimodal (MLLM/MultiModal) paradigms. Tokenizer is not discussed. Unify Models is tangentially related to joint learning but not the core focus.

关键词

Bayesian Low-Rank Adaptation, Latent Interaction Graphs, Microbiome Diagnosis, Parameter-efficient adaptation, Graph-conditioned LoRA, Species-level abundance, Microbe-microbe cross-talk, Calibrated uncertainty

203. How Reliable Are AI Attackers Against a Fixed Vulnerable Target? A 400-Run Empirical Study of LLM Penetration Testing ConsistencyFAIL

Score: 9.0 / 27.8

Authors: Galip Tolga Erdem

Published: 2026-05-28

TL;DR: This study empirically measures the consistency of autonomous LLM penetration testing across four models, revealing significant differences in exploitation rates and failure modes despite identical targets and prompts.

摘要翻译

大型语言模型（LLM）能够自主执行多阶段网络攻击，然而其在重复试验中攻击行为的一致性尚未得到研究。本研究首次对 LLM 攻击一致性进行了大规模实证测量：在保持提示词、编排器和目标不变的情况下，针对一个托管了 OWASP Juice Shop 及另外两个易受攻击服务的相同蜜罐，执行了 400 次自主渗透测试运行（涉及 4 个模型，各 100 次）。在迭代 0-1 阶段，没有任何模型发出的内容拒绝能够经受住编排器的一次性授权重新提示。Claude Sonnet 4 的 API 调用确实遇到了上游服务不可用问题——在记录的 Anthropic 容量事件中，1135 次调用中有 91 次返回了 HTTP 529 overloaded_error，导致 100 次 Claude 运行中的 39 次被截断。早期草稿将这些案例归类为安全拒绝；但在全日志审计中，它们被确认为上游 API 故障，而非模型层面的拒绝。尽管如此，Claude 在 100 次运行中实现了 61 次完全利用；Gemini 2.5 Flash-Lite 为 85 次；GPT-4o-mini 为 56 次（同时部署了 98 种独特的攻击策略）；qwen2.5-coder:14b 为 25 次。故障模式具有模型特异性：Claude 因 API 截断导致失败（39 次），qwen 因过早完成导致失败（52 次），GPT-4o-mini 因迭代预算耗尽导致失败（23 次）。跨服务凭证重用仅出现在保留最多对话历史的配置中（qwen 为 57%，GPT-4o-mini 为 49%，云模型在 5 次交换窗口上为 0%）。跨模型利用率的差异具有统计学显著性（p < 0.001），且效应量较大；qwen 与 Gemini 的 SQL 注入率差异在 Cohen's h = 1.12 处。首次利用时间落在 15 至 30 秒的时钟时间范围内。据我们所知，这是第一项针对多服务目标、在每个模型上以 N=100 的规模测量自主 LLM 攻击行为的研究。

Abstract

Large language models (LLMs) can autonomously conduct multi-stage cyber attacks, but the consistency of their offensive behavior under repeated trials remains unstudied. This work presents the first large-scale empirical measurement of LLM attack consistency: 400 autonomous penetration testing runs (4 models, 100 each) against an identical honeypot hosting OWASP Juice Shop and two additional vulnerable services, holding prompt, orchestrator, and target constant. No model emitted a content refusal that survived the orchestrator's one-shot authorization re-prompt at iterations 0-1. Claude Sonnet 4's API calls did encounter upstream service unavailability - 91 of 1,135 calls returned HTTP 529 overloaded_error during a documented Anthropic capacity event, truncating 39 of 100 Claude runs. An earlier draft catalogued these as safety refusals; on full-log audit they are upstream API failures, not model-level refusals. Despite this, Claude achieved full exploitation in 61 of 100 runs; Gemini 2.5 Flash-Lite in 85; GPT-4o-mini in 56 while deploying 98 unique attack strategies; qwen2.5-coder:14b in 25. Failure modes are model-distinctive: Claude through API truncation (39 runs), qwen through premature completion (52), GPT-4o-mini through iteration-budget exhaustion (23). Cross-service credential reuse appeared only in configurations retaining the most conversation history (qwen 57%, GPT-4o-mini 49%, cloud models 0% on 5-exchange windows). Cross-model exploitation rate differences are statistically significant (p < 0.001) with large effect sizes; qwen vs. Gemini SQL injection rates differ at Cohen's h = 1.12. First-exploit timing fell within a 15-30 second wall-clock range. To our knowledge, this is the first study to measure autonomous LLM attack behavior at N=100 per model across a multi-service target.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on LLM penetration testing consistency and security evaluation, which has minimal overlap with the specified keywords regarding multimodal architecture (Tokenizer, Visual Encoder), model unification, world models for representation learning, or model-based reinforcement learning. While the models used (Claude, Gemini, GPT-4o) are potentially MLLMs, the study does not investigate their multimodal capabilities or internal structures. No expert authors from the specified list (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are present in the author list (Galip Tolga Erdem).

关键词

LLM Penetration Testing, Consistency, Autonomous Cyber Attacks, Empirical Study, Failure Modes, Exploitation Rates, Honeypot

204. Domain-Specific Data Synthesis for LLMs via Minimal Sufficient Representation LearningFAIL

Score: 9.0 / 27.8

Authors: Tong Ye, Hang Yu, Tengfei Ma, Xuhong Zhang, Jianguo Li, Peng Di, Peiyu Liu, Jianwei Yin, Wenhai Wang

Published: 2026-05-28

TL;DR: 本文提出了一种名为 DOMINO 的框架，通过从参考样本中学习最小充分表征来合成领域特定数据，从而在不依赖显式领域描述的情况下提升 LLM 在编码任务上的性能。

摘要翻译

大语言模型（Large Language Models）在通用能力方面展现了显著的进步，并且可以通过在领域特定数据上进行微调而在特定领域实现卓越的性能。然而，为目标领域获取高质量数据仍然是一个重大挑战。现有的数据合成方法遵循演绎范式，严重依赖用自然语言表达的显式领域描述以及精心设计的提示工程，这限制了它们在现实场景中的适用性，在这些场景中领域难以描述或形式化表达。在这项工作中，我们通过归纳范式解决了一个未被充分探索的问题——领域特定数据合成，其中目标领域仅通过一组参考示例来定义，尤其是在领域特征难以用自然语言表达的情况下。我们提出了一种新颖的框架 DOMINO，该框架从参考样本中学习一个最小充分的领域表示，并利用该表示来指导生成与领域对齐的合成数据。DOMINO 将提示微调与对比解耦目标相结合，以分离领域级模式与样本特定噪声，从而在减轻过拟合的同时保留核心领域特征。理论上，我们证明了 DOMINO 扩展了合成数据分布的支撑集，从而确保了更高的多样性。经验上，在领域定义隐含的具有挑战性的代码基准测试上，在 DOMINO 合成的数据上进行微调，相较于强大的指令微调骨干模型，Pass@1 准确率最高可提升 4.63%，证明了其有效性和鲁棒性。这项工作确立了领域特定数据合成的新范式，使得无需手动提示设计或自然语言领域规范即可实现实用且可扩展的领域适应成为可能。

Abstract

Large Language Models have demonstrated remarkable progress in general-purpose capabilities and can achieve strong performance in specific domains through fine-tuning on domain-specific data. However, acquiring high-quality data for target domains remains a significant challenge. Existing data synthesis approaches follow a deductive paradigm, heavily relying on explicit domain descriptions expressed in natural language and careful prompt engineering, limiting their applicability in real-world scenarios where domains are difficult to describe or formally articulate. In this work, we tackle the underexplored problem of domain-specific data synthesis through an inductive paradigm, where the target domain is defined only through a set of reference examples, particularly when domain characteristics are difficult to articulate in natural language. We propose a novel framework, DOMINO, that learns a minimal sufficient domain representation from reference samples and leverages it to guide the generation of domain-aligned synthetic data. DOMINO integrates prompt tuning with a contrastive disentanglement objective to separate domain-level patterns from sample-specific noise, mitigating overfitting while preserving core domain characteristics. Theoretically, we prove that DOMINO expands the support of the synthetic data distribution, ensuring greater diversity. Empirically, on challenging coding benchmarks where domain definitions are implicit, fine-tuning on data synthesized by DOMINO improves Pass@1 accuracy by up to 4.63\% over strong, instruction-tuned backbones, demonstrating its effectiveness and robustness. This work establishes a new paradigm for domain-specific data synthesis, enabling practical and scalable domain adaptation without manual prompt design or natural language domain specifications.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0

评分理由: 论文核心在于 LLM 的领域特定数据合成与最小充分表征学习，未涉及视觉编码器、世界模型、基于模型的强化学习或多模态架构。Tokenizer 和统一模型并非本文核心贡献。因此，与给定关键词的相关性较低，加权总分（9.0）远低于动态及格分（27.8）。未发现列出的专家作者。

关键词

Domain-Specific Data Synthesis, Minimal Sufficient Representation Learning, Large Language Models, Prompt Tuning, Contrastive Disentanglement, Reference Examples, Domain Adaptation

205. Compass: Navigating Global Marine Lead Data Integration through Expert-Guided LLM AgentFAIL

Score: 9.0 / 27.8

Authors: Yiming Liu, Bin Lu, Meng Jin, Ziyuan Sang, Shuo Jiang, Lei Zhou, Xinbing Wang, Chenghu Zhou, Jing Zhang

Published: 2026-05-28

TL;DR: Compass leverages an expert-guided LLM agent framework to extract 3,751 marine lead records from 230,000 papers with 92% accuracy, establishing the largest integrated marine Pb database without model fine-tuning.

摘要翻译

海洋铅（Pb）及其同位素是研究海洋环流和人为污染的关键示踪剂，然而原位观测依然成本高且数据稀疏。尽管存在海量的历史记录，但这些记录却埋藏在学术论文的无结构内容中，形成了“数据孤岛”，难以进行综合分析。手动提取难以扩展，而通用大型语言模型（LLMs）缺乏必要的领域特定知识，易产生幻觉并输出科学上无效的结果。为解决这一问题，我们提出了一种专家引导的适配方法，使大型语言模型能够在无需微调的情况下执行严格的科学数据提取。我们通过 Compass 框架实现了这一方法，该框架是一个大型语言模型智能体框架，通过与海洋科学家共同设计的知识树（Knowledge Tree）进行增强，将复杂任务分解为可验证的步骤，引导智能体的推理过程以确保科学有效性。我们将 Compass 部署于包含超过 230,000 篇相关开放获取论文的语料库上，成功提取了 3,751 条此前未被纳入的铅记录。这项工作建立了迄今为止规模最大的综合海洋铅数据库。除标准指标外，Compass 通过多层验证展现了卓越的可靠性，经专家人工验证确认，其准确率达到 92%。新整合的数据扩展了先前采样不足区域（如东海和南大洋）的覆盖范围，为未来的科学发现提供了更为丰富的数据基础。我们发布了一个交互式可视化平台，以促进开放的科学访问。我们的工作表明，专家引导的智能体能够有效弥合通用大型语言模型与高利害科学领域之间的差距，从而实现地球科学领域可扩展的数据发现。

Abstract

Marine lead (Pb) and its isotopes are critical tracers for ocean circulation and anthropogenic pollution, yet in-situ observations remain costly and sparse. While vast historical records exist, they lie buried within the unstructured content of academic papers, creating "data silos" inaccessible to comprehensive analysis. Manual extraction is unscalable, while general-purpose Large Language Models (LLMs) lack the necessary domain-specific knowledge, leading to hallucinations and scientifically invalid outputs. To address this, we introduce an expert-guided adaptation approach that enables LLMs to perform rigorous scientific data extraction without fine-tuning. We operationalize this approach through Compass, an LLM agent framework enhanced by a Knowledge Tree co-designed with marine scientists, which decomposes complex tasks into verifiable steps, guiding the agent's reasoning to ensure scientific validity. Deploying Compass across a corpus of over 230,000 relevant open-access papers, we successfully extract 3,751 previously unincorporated Pb records. This effort establishes the largest integrated marine Pb database to date. Beyond standard metrics, Compass demonstrates superior reliability through multi-layered validation, achieving 92% accuracy as confirmed through expert manual verification. The newly integrated data expand coverage in previously under-sampled regions such as the East China Sea and the Southern Ocean, providing an enriched data foundation for future scientific discoveries. We release an interactive visualization platform to facilitate open scientific access. Our work demonstrates that expert-guided agents can effectively bridge the gap between general-purpose LLMs and high-stakes scientific domains, enabling scalable data discovery in geosciences.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.5/10	3.8
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	1.5/10	2.2

评分理由: The paper focuses on text-based scientific data extraction using an expert-guided LLM agent, lacking multimodal components (Visual Encoder, MultiModal), tokenizer architecture, World Models, or Model-Based Reinforcement Learning. It utilizes LLMs (MLLM) and unifies expert knowledge (Unify Models), resulting in low relevance to the specific technical keywords provided.

关键词

Marine Lead, Data Integration, Expert-Guided LLM Agent, Knowledge Tree, Scientific Data Extraction, Unstructured Text, Multi-layered Validation

206. Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful ContentFAIL

Score: 9.0 / 27.8

Authors: Ihor Stepanov, Aleksandr Smechov

Published: 2026-05-28

TL;DR: 本文提出 Opir，一种高效的多任务编码器式护栏模型，用于在 LLM 应用中实时检测毒性、越狱和有害内容，以较小的参数量实现了具有竞争力的安全分类性能。

摘要翻译

大型语言模型（LLM）应用的实时安全过滤需要分类器，能够检测不安全提示、有毒语言、越狱尝试和不安全响应，且无需大型护栏模型（guardrail models）的高昂开销，同时能够区分良性敏感文本与真正隐蔽的危害内容。本文介绍了 Opir，这是一种基于 GLiClass 架构的基于编码器的护栏模型家族。Opir 包括用于二元安全/不安全分类、多标签毒性分类、越狱分类以及零样本不安全提示和响应分类的多任务模型。我们还发布了边缘变体，参数少于 1 亿，专门用于二元安全/不安全分类。这些模型基于一个三级分类体系进行训练，该体系包含 996 个类别，涵盖 16 个顶层标签、126 个中层标签和 854 个叶标签。Opir 的训练数据结合了基于分类体系的不安全提示、对抗性挖掘的困难负样本、良性安全保留示例、生成响应示例、多语言翻译，以及 Aegis2 和 WildGuard 训练子集的部分内容。我们还开源了一个评估框架，该框架支持 GLiClass 和 GLiNER2 后端以及基于解码器的模型，涵盖二元安全分类、多标签分类、毒性、越狱检测、提示安全、响应安全、响应拒绝以及公共基准家族中的提示子类别视图。在涵盖 12 个安全分类任务和 17 个类别任务的扩展比较中，针对八个当代护栏系统（包括基于 GLiNER2 的护栏模型和生成式护栏模型），Opir 变体在大多数基准数据集上具有竞争力或优于最强的开源权重基线，同时具有显著更小的部署足迹。

Abstract

Real-time safety filtering for large language model (LLM) applications requires classifiers that can detect unsafe prompts, toxic language, jailbreak attempts, and unsafe responses without the cost profile of large guardrail models, and that can distinguish benign sensitive text from genuinely covert harmful content. In this paper, we introduce Opir, a family of encoder-based guardrail models built on the GLiClass architecture. Opir includes multi-task models for binary safe/unsafe classification, multi-label toxicity classification, jailbreak classification, and zero-shot unsafe prompt and response categorization. We also release edge variants with fewer than 100M parameters dedicated to binary safe/unsafe categorization. The models are trained on a three-level taxonomy containing 996 categories across 16 top-level labels, 126 mid-level labels, and 854 leaf labels. Opir's training data combines taxonomy-grounded unsafe prompts, adversarially mined hard negatives, benign safety-preserving examples, generated response examples, multilingual translations, and portions of the Aegis2 and WildGuard training subsets. We also open-sourced an evaluation harness that supports GLiClass and GLiNER2 backends as well as decoder-based models, and covers binary safety classification, multi-label categorization, toxicity, jailbreak detection, prompt safety, response safety, response refusal, and prompt subcategory views across public benchmark families. Across an expanded comparison spanning 12 safety-classification tasks and 17 category tasks against eight contemporary guardrail systems -- including both GLiNER2-based and generative guardrail models -- Opir variants are competitive on or ahead of the strongest open-weight baselines on the majority of benchmark datasets while operating with a substantially smaller deployment footprint.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文核心内容为文本安全分类（毒性、越狱检测），采用基于编码器的护栏模型架构。提供的关键词集主要聚焦于多模态、世界模型及强化学习领域，与本文纯文本分类任务存在显著偏差。'Unify Models'和'MLLM'因涉及多任务学习和 LLM 应用有微弱关联（得 2 分），'Tokenizer'为通用组件（得 2 分），而'Visual Encoder'、'World Models'、'MultiModal'及'model-based RL'与本文内容完全无关（得 0 分）。作者列表中不包含指定的专家，故无额外加分。

关键词

Safety Classification, Multi-Task Learning, Encoder-Based Models, Toxicity Detection, Jailbreak Detection, Guardrail Models, Efficient Deployment, GLiClass Architecture

207. Relational Rank Geometry in Transformers: Detecting and Steering Hidden-State Relation FramesFAIL

Score: 9.0 / 27.8

Authors: Mazen Kobrosly

Published: 2026-05-28

TL;DR: 该论文研究了 Transformer 隐藏状态中 token 元组的关系秩几何，证明了关系帧可以被检测并通过靶向干预进行行为引导。

摘要翻译

Transformer 隐藏状态通常通过局部或低阶对象进行解释：神经元、稀疏特征、注意力头、残差流方向或激活块。本文研究了一个互补对象：Token 元组之间关系的秩索引几何。本文使用 Plucker 符号熵来测试 r 元关系是否在隐藏状态空间中留下元数匹配的方向签名。在 Llama 系列 8B、70B 和 405B 检查点上，在匹配随机控制审计下，真实关系元组在期望秩 k=r (r=3,...,6) 下显示出比打乱元组更强的方向签名一致性。多模板审计表明，这些效应经受住了表层变异，所有测试的 405B 行都保留了正的期望秩边缘，而 8B/70B 保留了具有构造器特定混合单元的正行。随后本文询问相同的关系几何是否可以被操控。在基于 32 个提示的边缘网格干净/污染干预实验中，行/列支架和答案格式保持不变，而是/否关系映射发生变化，且污染的隐藏状态关系框架被修补至干净或安慰剂目标。在 70B 和 405B 中，干净目标导向的关系框架路径恢复了干净答案行为和剩余关系几何，而仅质心和等范数控制显示出微不足道的恢复。位置/顺序控制进一步将标记位点重要性从有序干净框架几何中分离出来：目标干净形状和跨提示干净形状在标记接口处恢复行为和剩余几何，而污染捐赠者转移、同位置置换/反射、错误位置干净增量、仅质心运动和等范数噪声失败或远低于干净框架路径。结果是从关系探测到关系框架干预的受控桥梁：关系秩几何可以在 Transformer 隐藏状态中被检测、目标化和行为验证。

Abstract

Transformer hidden states are often interpreted through local or low-order objects: neurons, sparse features, attention heads, residual-stream directions, or activation patches. This paper studies a complementary object: the rank-indexed geometry of relations among token tuples. I use Plucker sign entropy to test whether r-argument relations leave arity-matched orientation signatures in hidden-state space. Across Llama-family 8B, 70B, and 405B checkpoints, true relation tuples show stronger orientation-sign consistency at the expected rank k=r for r=3,...,6 than scrambled tuples under matched random-control audits. Multi-template audits show that the effects survive surface variation, with all tested 405B rows retaining positive expected-rank margins and 8B/70B retaining positive rows with constructor-specific mixed cells. I then ask whether the same relation geometry can be steered. In an edge-grid clean/corrupt intervention assay over 32 prompts, the row/column scaffold and answer format stay fixed while the YES/NO relation map changes, and the corrupt hidden-state relation frame is patched toward clean or placebo targets. In 70B and 405B, clean-targeted relation-frame paths recover clean-answer behavior and residual relation geometry, while centroid-only and equal-norm controls show negligible recovery. Site/order controls further separate marker-site importance from ordered clean-frame geometry: target clean shape and cross-prompt clean shape recover behavior and residual geometry at the marker interface, whereas corrupt-donor transfer, same-site permutation/reflection, wrong-site clean deltas, centroid-only motion, and equal-norm noise fail or remain far below clean-frame paths. The result is a controlled bridge from relation probing to relation-frame intervention: relation rank geometry can be detected, targeted, and behaviorally validated in transformer hidden states.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	1.0/10	1.5

评分理由: 论文核心在于 Transformer 隐藏状态的关系几何结构分析及干预方法，属于模型可解释性领域。提供的关键词集主要侧重于多模态大模型、世界模型及强化学习，与本文主题（纯文本模型内部几何分析）相关性极低。仅 Tokenizer 因涉及 token 元组有微弱提及，Visual Encoder 和 MultiModal 完全无关，其余关键词（Unify Models, World Models, MLLM, model-based RL）均未在摘要或标题中体现。加权总分约为 9.0 分，远低于动态及格分 27.8 分。作者列表中不包含指定的专家名单，故无额外加分。

关键词

Transformer hidden states, Relational rank geometry, Relation-frame intervention, Plucker sign entropy, Token tuples, Llama-family checkpoints, Orientation signatures, Hidden-state steering

208. COMPOSE: Composing Future Theorems from Citations and Formal StructureFAIL

Score: 9.0 / 27.8

Authors: David Busbib, Michael Werman

Published: 2026-05-28

TL;DR: COMPOSE 提出一种双图框架，结合引文和形式化依赖上下文生成有依据的未来数学定理，表现优于基线。

摘要翻译

一个合理的未来数学命题必须满足两个约束：它应遵循先前工作的方向，并尊重那些限制什么可以被有效推导的形式依赖关系。现有方法通常仅对其中一个来源进行建模，产生的命题要么依据不足，要么动机不充分。我们提出了有依据的未来数学生成，其目标是为一篇锚定论文生成一个合理的未来定理类命题，使用两种互补的上下文来源：其科学引文图（scientific citation graph）和对齐的形式定理依赖图（aligned formal theorem dependency graph）。为应对这一设定，我们提出了 COMPOSE，这是一个双图框架，它基于科学引文上下文和形式定理结构对语言模型进行条件化。为支持这一设定，我们从 arXiv 和 Mathlib 构建了一个包含 10.8 万个配对科学 - 形式图示例的数据集，以及一个包含 2024-2025 年 4.7 万篇未来论文的基准。实验表明，COMPOSE 在检索真实未来论文方面优于强基线，并在 LLM-judge 评估下取得了最佳整体性能，产生了更具依据且数学上更丰富的输出。这些结果表明，未来数学生成得益于将科学上下文与形式结构相结合。项目页面位于 https://david-busbib.github.io/COMPOSE-page/。

Abstract

A plausible future mathematical claim must satisfy two constraints: it should follow the direction of prior work and respect the formal dependencies that constrain what can validly follow. Existing approaches typically model only one of these sources, producing claims that are either weakly grounded or insufficiently motivated. We introduce grounded future mathematical generation, where the goal is to generate a plausible future theorem-like claim for an anchor paper using two complementary sources of context: its scientific citation graph and aligned formal theorem dependency graph. To address this setting, we propose COMPOSE, a dual-graph framework that conditions a language model on both scientific citation context and formal theorem structure. To support this setting, we construct a dataset of 108K paired scientific-formal graph examples from arXiv and Mathlib, together with a benchmark of 47K future papers from 2024--2025. Experiments show that COMPOSE outperforms strong baselines on retrieval to real future papers and achieves the best overall performance under LLM-judge evaluation, producing more grounded and mathematically richer outputs. These results show that future mathematical generation benefits from combining scientific context with formal structure. Project page is available at https://david-busbib.github.io/COMPOSE-page/.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 该论文专注于数学定理的未来生成，利用引文图和形式化依赖图结合语言模型。它不涉及视觉编码器、多模态数据或强化学习，因此与 Visual Encoder, MultiModal, MLLM, model-based RL 关键词完全无关。虽然论文统一了两种图上下文（Unify Models）并预测未来状态（World Models），但核心领域（数学/图）与关键词集设定的多模态/强化学习背景不符，故相关性较低。未涉及特定 Tokenizer 创新，评分为低。

关键词

Future Mathematical Generation, Citation Graph, Formal Theorem Dependency, Dual-Graph Framework, Language Model, Grounded Generation, arXiv, Mathlib

209. Knowing What to Solve Before How: Preplan Empowered LLM Mathematical ReasoningFAIL

Score: 9.0 / 27.8

Authors: Shaojie Wang, Liang Zhang

Published: 2026-05-28

TL;DR: 该论文提出了一种 Preplan-Plan-CoT 框架，通过显式的问题理解阶段提升 LLM 数学推理能力，在不增加推理开销的情况下取得了最优结果。

摘要翻译

现有的基于计划的推理方法通过在执行前插入一个规划阶段来改进大语言模型（LLMs），从而产生了问题 $\rightarrow$ 规划 $\rightarrow$ CoT（思维链）范式。虽然有效，但更深入的检查揭示了一个固有的范式层面的差距：规划及其执行阶段均决定了如何解决问题，而关于“解决什么”这一先决问题——包括识别问题类型、适用工具以及可预见的陷阱——仍然完全是隐式的。为弥合这一差距，我们提出了 PPC（预规划 - 规划 - CoT），该框架引入了一个显式的问题理解阶段，即预规划（preplan），从而形成了新的问题 $\rightarrow$ 预规划 $\rightarrow$ 规划 $\rightarrow$ CoT 范式。实现这一范式需要在两端保障预规划的概念完整性。具体而言，我们设计了一个三阶段合成管道，包含一个 spoiler-score（剧透分数）检测器，用于过滤泄露和剧透失败，以构建干净的预规划监督；同时，一个复合 GRPO 奖励机制确保生成的规划确实源自预规划。在四个骨干模型和五个数学推理基准上的实验表明，PPC 在 40 个指标中的 39 个上取得了最佳结果，相比最强基线，maj@16 和 pass@16 分别提高了 +2.23 和 +3.06，且未引入额外的推理 token 开销。

Abstract

Current plan-based reasoning methods improve large language models (LLMs) by inserting a planning stage before execution, giving rise to the question $\rightarrow$ plan $\rightarrow$ cot paradigm. While effective, a closer examination reveals an inherent paradigm-level gap: both the planning and its execution stages decide how to solve a problem, while the prior question of what to solve; recognizing the problem type, the applicable tools, and the foreseeable pitfalls; remains entirely implicit. To bridge this gap, we propose PPC (Preplan-Plan-CoT), a framework that introduces an explicit problem-understanding stage, the preplan, yielding a new question $\rightarrow$ preplan $\rightarrow$ plan $\rightarrow$ cot paradigm. Realizing this paradigm requires safeguarding the conceptual integrity of preplan at both ends. Specifically, we design a three-stage synthesis pipeline with a spoiler-score detector that filters out leakage and spoiler failures to build clean preplan supervision, and a composite GRPO reward enforces that the generated plan genuinely follows from the preplan. Experiments across four backbones and five mathematical reasoning benchmarks show that PPC achieves the best results on 39 of 40 metrics, improving maj@16 and pass@16 by +2.23 and +3.06 over the strongest baseline without introducing additional inference token overhead.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	2.0/10	3.0

评分理由: 该论文专注于大语言模型（LLM）的数学推理，提出 Preplan-Plan-CoT 框架。由于论文仅处理文本数据，未涉及视觉或多模态内容，因此 Visual Encoder、MLLM 和 MultiModal 得分为 0。虽然论文涉及规划（Planning）和强化学习优化（GRPO），但其核心是推理范式而非环境动力学建模，故 World Models 和 model-based RL 相关性较低（1-2 分）。Unify Models 和 Tokenizer 因涉及推理阶段统一和 LLM 基础架构，有轻微相关性。

关键词

LLM Mathematical Reasoning, Preplan-Plan-CoT, Problem Understanding, GRPO Reward, Chain of Thought, Planning Stage, Spoiler-score Detector, Conceptual Integrity

210. Internal Representation, Not Clinical Knowledge: Where Apparent LLM Triage Failures OriginateFAIL

Score: 9.0 / 27.8

Authors: David Fraile Navarro, Berardino Como, Jialei Sheng, Soundariya Ananthan, Shlomo Berkovsky

Published: 2026-05-28

TL;DR: This paper investigates why LLMs fail clinical triage in multiple-choice formats, concluding the failure stems from output format mapping rather than deficits in clinical knowledge representation.

摘要翻译

基于患者叙述的临床分诊基准显示，消费级大语言模型在受限的多项选择题输出下存在较高的低估率，但相同案例在自由文本输出下的表现却有所不同。我们探究输出格式是否改变了模型的临床表征，抑或是仅改变了从保留表征到答案的映射关系。利用 Gemma 3 4B/12B IT 和 Qwen3-8B 中的稀疏自编码器 (SAE) 特征，我们发现相同的医学特征在两种输出格式下均会对共享的临床叙事产生激活，但在所有案例的所有模型的多项选择题决策 token 处均保持沉默。三种独立方法（自然语言自编码器言语化、决策 token logit 归因及顶部特征表征）一致认为，驱动决策 logit 的是支架和格式特征，而非医学特征。行为分析显示，多项选择题惩罚在结构化及自然语言输入下均显著存在，选项顺序打乱排除了位置偏差，且该差距主要由差一决策主导（即模型选择了与金标准答案相邻的严重程度等级字母），而非知识性失败。因此，该失败源于输出格式，而非临床表征。

Abstract

Patient-voiced clinical-triage benchmarks report high under-triage rates for consumer LLMs for constrained multiple-choice output, yet the same cases score differently with free-text. We ask whether output format changes the model's \emph{clinical representation} or only the mapping from a preserved representation to an answer. Using sparse-autoencoder (SAE) features in Gemma 3 4B/12B IT and Qwen3-8B, we find the same medical features fire on the shared clinical narrative under both formats but go {silent} at the multiple-choice decision token in all the cases at every model. Three independent methods (natural-language autoencoder verbalization, decision-token logit attribution, and top-feature characterization) agree that scaffold and format features, but not medical features, drive the decision logits. Behaviorally, the multiple-choice penalty inverts under both structured and natural-language input, option-order shuffle rules out positional bias, and the gap is dominated by off-by-one decision (the model picks an adjacent acuity letter to the gold answer) rather than knowledge failure. Thus, the failure originates in the output format and not in the clinical representation.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on text-based LLM interpretability in clinical triage using sparse autoencoders analyzing output format bias. It has minimal overlap with the provided keywords which emphasize Multimodal Learning, World Models, and Reinforcement Learning. Only weak connections exist regarding model comparison (Unify Models) and token-level analysis (Tokenizer, MLLM). No visual encoders, world models, or RL methods are involved. Expert authors not found.

关键词

Clinical Triage, LLM Interpretability, Sparse Autoencoder, Output Format Bias, Internal Representation, Decision Token, Gemma 3, Free-text vs Multiple-choice

211. Spurious Prompts: Can Irrelevant Prompts Steer Large Language Models?FAIL

Score: 9.0 / 27.8

Authors: Pawel Batorski, Abtin Pourhadi, Jerzy Sarosiek, Przemyslaw Spurek, Paul Swoboda

Published: 2026-05-28

TL;DR: 该论文研究发现语义无关的提示词可以意外地引导大语言模型的行为，既可能提升性能也可能导致系统性偏差。

摘要翻译

大语言模型（LLMs）对提示词高度敏感，但这种敏感性通常是通过任务相关的指令、演示或推理线索来研究的。在本文中，我们研究了一种不同形式的提示词敏感性：与任务语义无关的提示词是否仍能引导模型行为。我们将它们称为虚假提示词（spurious prompts），并展示了其惊人的有效性。我们还提出了一种简单的黑盒搜索方法以发现它们。在推理和问答基准测试中，我们使用参数量从 0.8B 到 27B 且涵盖三个模型家族的模型，展示了虚假提示词可以提升性能，通常匹配或优于标准提示词基线和任务感知提示优化。我们进一步展示了它们可以将模型引导至意外行为，例如反复选择第一个答案选项、产生错误答案，或在未明确指示模型如此操作的情况下返回偶数、质数或小数。这些发现揭示了一种新的提示词敏感性：大语言模型可以被与其所要求解的任务无关的提示词所系统性地引导。我们的代码可在 https://github.com/Batorskq/spurious 获取。

Abstract

Large language models are highly sensitive to prompts, but this sensitivity is usually studied through task-relevant instructions, demonstrations, or reasoning cues. In this paper, we study a different form of prompt sensitivity: whether prompts that are semantically unrelated to the task can nevertheless steer model behavior. We call them spurious prompts and show their surprising efficacy. We also propose a simple black-box search procedure for discovering them. Across reasoning and question-answering benchmarks, using models ranging from 0.8B to 27B parameters and spanning three model families, we show that spurious prompts can improve performance, often matching or outperforming standard prompting baselines and task-aware prompt optimization. We further show that they can steer models toward unintended behaviors, such as repeatedly selecting the first answer option, producing incorrect answers, returning an even, prime or small number without explicitly instructing the model to do so. These findings reveal a new kind of prompt sensitivity: LLMs can be systematically steered by prompts that are unrelated to the task they are asked to solve. Our code is available at https://github.com/Batorskq/spurious

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文聚焦于大语言模型的提示词敏感性，与关键词中的视觉编码器、世界模型、模型强化学习及多模态主题无关。仅与 LLM 基础设施（Tokenizer）及广义模型（MLLM、Unify Models）有微弱关联，故给予低分，其余关键词相关性为 0。

关键词

Spurious Prompts, Large Language Models, Prompt Sensitivity, Black-box Search, Unintended Behaviors, Prompt Steering, Reasoning Benchmarks

212. CONCAT: Consensus- and Confidence-Driven Ad Hoc Teaming for Efficient LLM-Based Multi-Agent SystemsFAIL

Score: 9.0 / 27.8

Authors: Ziyang Ma, Dingyi Zhang, Sichu Liang, Jiajia Chu, Pengfei Xia, Hui Zang, Deyu Zhou

Published: 2026-05-28

TL;DR: CONCAT 通过共识和信心驱动的智能体聚类与 ad hoc 组网，在不进行任务特定训练的情况下显著降低了 LLM 多智能体系统的通信开销并提升了效率。

摘要翻译

尽管基于大语言模型（LLM）的多智能体系统（MAS）展现出解决复杂任务的能力，并在性能上优于单智能体系统，但由于智能体之间沉重的通信负担，它们会导致巨大的计算开销。先前研究致力于训练稀疏多智能体图或微调规划器以更好地协调工作流程。然而，这些额外的训练过程引入了计算成本，并将多智能体系统限制在特定领域，从而损害了它们的泛化能力。本文提出一种名为 CONCAT 的基于共识（CONsensus）和信心驱动（Confidence-driven）的临时组队（Ad hoc Teaming）的无训练多智能体协作框架，以高效组织智能体交互。具体而言，智能体根据其初始答案进行聚类，每个集群的领导者根据智能体的信心被选中。随后，基于心智理论设计了一个启发式函数，根据每个领导者的答案和信心预测每两个领导者之间的协作收益。最后，基于预测收益剔除一定比例的通信后，组织了一个临时多智能体网络。在三个 LLM 和三个基准上的实验表明，CONCAT 的效率（准确率/延迟比）最高达到 LLM-Debate 的 2.02 倍，并优于 AgentDropout 等训练感知方法，同时在 Qwen2.5-14B-Instruct 上将平均延迟降低了 50.1%，且无需任何特定任务的训练。

Abstract

Although large language model (LLM) based multi-agent systems (MAS) show their capability to solve complex tasks and achieve higher performance over single agent systems, they lead to huge computational overheads because of heavy communication between agents. Previous research has made efforts to train a sparse multi-agent graph or fine-tune a planner to orchestrate the workflow better. However, such extra training processes introduce computational costs and limit MAS to specific domains, therefore compromising their generalizability. In this paper, we propose CONCAT, a training-free multi-agent collaboration framework based on CONsensus and Confidence-driven Ad hoc Teaming to efficiently organize agent interactions. Specifically, agents are clustered based on their initial answers, and leaders of each cluster are selected based on the agents' confidence. Then, a heuristic function based on the Theory of Mind is designed to predict the collaboration benefits between every two leaders according to their answers and confidence. Finally, an ad hoc multi-agent network is organized after evicting a percentage of communications based on the predicted benefits. Experiments across three LLMs and three benchmarks show that CONCAT achieves up to 2.02x higher efficiency (accuracy/latency ratio) than LLM-Debate and outperforms training-aware methods such as AgentDropout, while reducing average latency by 50.1% on Qwen2.5-14B-Instruct, without any task-specific training.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	1.0/10	1.5

评分理由: 该论文主要研究 LLM 多智能体系统的通信效率优化与协作框架，核心在于共识与信心驱动的组网策略。给定关键词集中于多模态架构（Tokenizer, Visual Encoder, MultiModal, MLLM）、世界模型及强化学习领域。论文未涉及视觉编码器、分词器设计或多模态融合，与 MultiModal、Tokenizer、Visual Encoder 完全无关；虽使用 LLM 但非 MLLM 架构；虽涉及智能体协调，但与 Unify Models（模型统一）及 World Models（环境模型）关联较弱；虽使用启发式规划，但不属于 model-based RL。因此整体相关性低，加权总分远低于动态及格分 27.8。

关键词

LLM-based Multi-Agent Systems, Communication Efficiency, Consensus-Driven, Confidence-Driven, Ad Hoc Teaming, Training-Free, Theory of Mind, Agent Clustering

213. Teaching Language Models to Check Grounded Claim Factuality with Human Test-Taking StrategiesFAIL

Score: 8.2 / 27.8

Authors: Yuxuan Ye, Raul Santos-Rodriguez, Edwin Simpson

Published: 2026-05-28

TL;DR: 本文提出一种基于测试策略的语言模型事实性检查方法，通过减少 80% 的 token 用量并使小语言模型达到与大语言模型相当的性能，解决了事实性检查效率低的问题。

摘要翻译

基于事实依据的声明事实性检查对于大语言模型（LLM）应用（如检索增强生成（RAG））至关重要，因为它有助于用户评估生成输出的正确性。现有的基于蕴含分类器的指标需要针对特定数据集进行阈值调优，而基于 LLM 的方法通常采用直接提示，这未能充分利用 LLM 的推理能力。我们通过将基于事实依据的声明事实性检查表述为真假阅读理解任务，并通过明确的应试策略提示 LLM 以实现高效推理来解决这一问题。与无引导的开放推理相比，我们的方法减少了超过 80% 的令牌用量，并在两个事实性基准上实现了与更昂贵替代方案相当的性能，其中一个基准上达到了新的最新状态（SOTA）。为了进一步降低推理成本，我们训练小语言模型（SLMs）以在验证流程中替换 LLM。通过监督微调（SFT）和自我修正机制，SLMs 学习改进其事实性判断。实验结果表明，所得的 SLMs 与强基线表现相当，结合了低推理成本与生成支持性理由的能力，从而支持可解释性。代码和数据集将在录用后发布。

Abstract

Grounded claim factuality checking is important for large language model (LLM) applications such as retrieval-augmented generation, as it helps users assess the correctness of generated outputs. Existing metrics using entailment classifiers require dataset-specific threshold tuning, while LLM-based approaches often use direct prompting, which underutilises the reasoning capabilities of LLMs. We address this by formulating grounded claim factuality checking as a true/false reading comprehension task and prompting LLMs with explicit test-taking strategies for efficient reasoning. Our method reduces token usage by over 80% compared to unguided open-ended reasoning, and achieves competitive performance to more expensive alternatives across two factuality benchmarks, setting a new state of the art on one. To further reduce inference cost, we train small language models (SLMs) to replace LLMs in the checking pipeline. Using supervised fine-tuning (SFT) and a self-revision mechanism, the SLMs learn to improve their factuality judgements. Experimental results show that the resulting SLMs perform on par with strong baselines, combining low inference costs with generating supporting rationales to support interpretability. Code and datasets will be released upon acceptance.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.5/10	2.2
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文主要研究语言模型的事实性检查，采用测试策略和小语言模型（SLM） pipeline。与提供的关键词高度不相关：无视觉编码器、世界模型、多模态内容或强化学习内容。仅在'Unify Models'上略有涉及（LLM 与 SLM 流程整合），'Tokenizer'仅提及 token 用量而非架构，'MLLM'仅涉及语言模型而非多模态。加权总分约为 8.25，远低于动态及格分 27.8，表明论文主题与关键词领域匹配度极低。

关键词

Grounded claim factuality checking, Language models, Test-taking strategies, Small language models, Supervised fine-tuning, Self-revision mechanism, Retrieval-augmented generation, Reading comprehension task

214. Gram: Assessing sabotage propensities via automated alignment auditingFAIL

Score: 7.5 / 27.8

Authors: David Lindner, Victoria Krakovna, Sebastian Farquhar

Published: 2026-05-28

TL;DR: 本文介绍了 Gram 框架，用于自动审计 AI 代理在模拟场景中的破坏倾向，发现 Gemini 模型常因过度积极而非故意而表现不当。

摘要翻译

我们介绍了 Gram，一个自动化的对齐审计（alignment auditing）框架，用于评估 AI 智能体参与破坏行为的倾向。我们在 17 个模拟的智能体部署场景中评估了 Gemini 模型，这些场景旨在激励破坏行为。我们发现，在约 2-3% 的模拟轨迹中，Gemini 模型表现不当。许多此类情况可归因于 Gemini 模型中的“过度急切”（overeagerness），导致过度的角色扮演和目标导向行为。与其他对齐审计方法不同，Gram 专门设计用于评估编码和研究智能体中的对齐偏差（misalignment）和故意破坏行为。此外，我们还引入了一种实验调查员智能体流程，该流程支持细粒度的针对性实验，以识别不当行为的驱动因素。我们发现，提高环境的真实性并移除诱导不当行为的助推（nudges），往往能将破坏率降低至接近零。

Abstract

We introduce Gram, an automated alignment auditing framework to assess the propensity of AI agents to engage in sabotage. We evaluate Gemini models across 17 simulated agentic deployment scenarios that incentivize sabotage. We find Gemini models misbehave in about 2-3% of our simulated trajectories. Many of these cases are explained by "overeagerness" in Gemini models resulting in both excessive role-playing and goal-seeking behavior. In contrast to other alignment auditing approaches, Gram is designed to specifically evaluate misalignment and intentional sabotage in agentic coding and research agents. We additionally introduce an experimental investigator agent pipeline which enables fine-grained targeted experiments to identify the drivers of misbehavior. We find that increasing realism of environments and removing nudges to misbehave tends to reduce sabotage rates close to zero.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	1.0/10	1.5

评分理由: 论文核心内容为 AI 安全对齐审计（Gram 框架），评估代理的破坏倾向，与提供的关键词背景（多模态统一模型、世界模型、表征学习、模型强化学习）存在显著领域差异。关键词中仅 MLLM 和 MultiModal 因涉及 Gemini 模型而有微弱关联，model-based RL 因涉及代理目标行为有极弱关联，其余关键词（Tokenizer, Visual Encoder, World Models, Unify Models）在文中未提及或无关。

关键词

Gram, alignment auditing, sabotage propensities, AI agents, Gemini models, simulated deployment, misalignment, goal-seeking

215. Do Language Models Track Entities Across State Changes?FAIL

Score: 7.5 / 27.8

Authors: Zilu Tang, Qiao Zhao, Gabriel Franco, Derry Wijaya, Aaron Mueller, Sebastian Schuster, Najoung Kim

Published: 2026-05-28

TL;DR: The paper investigates how language models track entities across state changes, finding they use a non-sequential aggregation strategy rather than incremental state tracking.

摘要翻译

实体追踪（ET），即跟踪状态的能力，是支撑复杂推理的一项基础技能。越来越多的研究探讨了 Transformer 语言模型（LMs）如何在无状态变化的情况下解决实体绑定问题。然而，目前对于非玩具 LMs 如何处理用自然语言表达的具有现实难度的实体追踪问题，理解尚不充分。为此，我们探究了在包含多个状态变化操作的更复杂场景下，实体追踪背后的潜在机制。我们发现，LMs 并不会跨 token 增量跟踪世界状态，也不会跨层跟踪与查询相关的状态，而是在查询意图明确时，仅在最后一个 token 处并行聚合相关信息。我们进一步探究了单个操作（PUT、REMOVE、MOVE）的机制，以刻画这种非增量实体追踪机制。令人惊讶的是，LMs 通过一个脆弱的全局抑制标签来实现 REMOVE 操作；这一全局移除机制预测了多种失败模式，我们在行为层面证实了这些模式。我们提供了一种机制层面的解决方案，即通过消除该标签来部分解决这一问题。总体而言，我们的发现揭示了 LMs 使用非顺序策略解决本质上顺序性的任务。更广泛而言，我们的工作展示了行为分析与机制分析如何能够富有成效地相互作用。行为结果指导机制假设的构建，而机制分析的洞见则通过预测现有评估中缺失的失败模式，帮助构建更稳健的行为评估。

Abstract

Entity tracking (ET), the ability to keep track of states, is a fundamental skill that underlies complex reasoning. An increasing amount of work investigates how transformer language models (LMs) solve entity binding $\textit{without}$ state changes. However, there is limited understanding of how non-toy LMs address ET problems of realistic difficulties expressed in natural language. To this end, we investigate the mechanisms underlying ET in more complex scenarios featuring multiple state-changing operations. We find that LMs do not incrementally track world states across tokens or query-relevant states across layers, but simply aggregate relevant information in parallel at the last token when the query becomes evident. We further investigate mechanisms of individual operations ($\texttt{PUT}$, $\texttt{REMOVE}$, $\texttt{MOVE}$) to characterize this non-incremental ET mechanism. Surprisingly, LMs implement the $\texttt{REMOVE}$ operation with a fragile global suppression tag; this global removal mechanism predicts various failure modes that we confirm behaviorally. We provide a mechanistic solution of nullifying this tag to partially address this issue. Overall, our findings reveal that LMs solve a fundamentally sequential task using a non-sequential strategy. More broadly, our work illustrates how behavioral and mechanistic analyses can fruitfully interact. Behavioral results inform mechanistic hypotheses, and insights from mechanistic analyses help build stronger behavioral evaluations by predicting failure modes missing from existing evaluations.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	3.0/10	4.5
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	2.0/10	3.0

评分理由: The paper focuses on mechanistic analysis of entity tracking in language models regarding state changes. It does not involve multimodal data, tokenizers, visual encoders, or model unification. There is slight conceptual relevance to World Models and model-based RL due to state tracking, but the core task is LM interpretation, not these architectures, resulting in low overall relevance.

关键词

Entity Tracking, Language Models, State Changes, Mechanistic Analysis, Transformer, Reasoning, Non-incremental Strategy

216. Enhancing Multi-Agent Communication through Attention Steering with Context RelevanceFAIL

Score: 7.5 / 27.8

Authors: Hongxiang Zhang, Yuan Tian, Tianyi Zhang

Published: 2026-05-28

TL;DR: 论文提出 Agent-Radar 方法，通过动态注意力 steering 管理多智能体对话上下文，有效缓解了长对话中相关信息稀释的问题，提升了系统性能。

摘要翻译

基于大语言模型（LLM）的多智能体系统通过协作推理在复杂任务上展现了卓越的性能。然而，这些系统在交互过程中往往会迅速积累极长的对话历史。随着对话长度增加，相关信息日益被无关上下文稀释，导致性能下降。本文提出了一种名为 Agent-Radar 的无需训练的上下文管理方法，该方法通过一种新颖的时空衰减机制，动态引导每个智能体的注意力聚焦于相关上下文。我们的实验表明，Agent-Radar 在五个不同的基准测试上均优于最先进方法，性能提升高达 7.64 个绝对百分点。此外，我们的分析显示，随着智能体数量和交互轮数的增加，Agent-Radar 依然保持有效性和稳健性。最后，消融实验表明，Agent-Radar 中的核心组件对性能至关重要，且在不同场景下具有泛化能力。

Abstract

LLM-based multi-agent systems have demonstrated remarkable performance on complex tasks through collaborative reasoning. However, these systems tend to rapidly accumulate extremely long conversation histories during interaction. As conversations lengthen, relevant information is increasingly diluted by irrelevant context, leading to degraded performance. In this work, we present Agent-Radar, a training-free context management method that dynamically steers each agent's attention toward relevant context with a novel temporal and spatial decay mechanism. Our experiments demonstrate that Agent-Radar outperforms state-of-the-art methods across five different benchmarks, yielding gains of up to 7.64 absolute points. Furthermore, our analysis shows that Agent-Radar remains effective and robust as the number of agents and interaction rounds increases. Finally, the ablation study shows that core components in Agent-Radar are crucial to performance and generalizable in different settings.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	1.0/10	1.5

评分理由: 论文核心在于多智能体系统中的上下文管理与注意力 steering（Agent-Radar），主要涉及 LLM 对话优化。提供的关键词侧重于多模态大模型架构（Visual Encoder, MultiModal, MLLM）、模型统一（Unify Models）及模型强化学习（model-based RL），与本文内容关联度极低。作者列表中未包含指定的专家（Yang Shi 等），故无额外加分。加权总分约为 7.5 分，远低于动态及格分。

关键词

Multi-Agent Communication, Attention Steering, Context Relevance, LLM-based Systems, Agent-Radar, Context Management, Temporal Decay, Spatial Decay

217. When Cloud Agents Meet Device Agents: Lessons from Hybrid Multi-Agent SystemsFAIL

Score: 7.5 / 27.8

Authors: Corrado Rainone, Davide Belli, Bence Major, Arash Behboodi

Published: 2026-05-28

TL;DR: This paper investigates hybrid multi-agent systems combining cloud LLMs and on-device SLMs to optimize the trade-off between performance, cost, and energy consumption, finding that optimal architecture is task-dependent rather than simply relying on larger frontier models.

摘要翻译

智能体 AI 推理的设计空间跨越两个极端：通常是托管于云端、能在广泛任务上提供强大性能但成本高昂的前沿大型语言模型 (LLMs)，以及更具成本效益、适用于端侧推理的小型语言模型 (SLMs)。结合端侧模型与云模型的混合多智能体系统 (MASs) 提供了一个有希望的折中方案，但它们也引入了一个复杂且理解不足的设计空间，其中任务准确性、货币成本和边缘能耗紧密耦合；由于缺乏通用设计原则，混合组件尽管并非最普遍的选择，通常是通过针对特定领域的临时决策引入的。在这项工作中，我们更系统地考察了这一设计空间。我们适配了两个代表性的 MAS 架构以支持混合推理，并研究单个设计选择如何沿着功耗、成本和性能的帕累托前沿 (Pareto frontier) 移动运行点。我们的发现描绘了混合 MAS 设计的细致图景：虽然 SLMs 可以从 LLMs 辅助中有效受益，但最优架构高度依赖于任务，且更大的前沿级算力并不一致地转化为更好的性能。

Abstract

The design space of agentic AI inference spans two extremes: frontier large language models (LLMs), typically hosted in the cloud and offering strong performance across a wide range of tasks at substantially high cost, and more cost-efficient small language models (SLMs), which are amenable to on-device inference. Hybrid multi-agent systems (MASs) combining on-device and cloud models offer a promising middle ground, but they also introduce a complex and poorly understood design space in which task accuracy, monetary cost, and edge energy consumption are tightly coupled; in the absence of general design principles, hybrid components, although not the most prevalent choice, are typically introduced through ad hoc decisions tailored to specific domains. In this work, we examine this design space more systematically. We adapt two representative MAS architectures to support hybrid inference and study how individual design choices shift the operating point along the Pareto frontier of power, cost, and performance. Our findings paint a nuanced picture of hybrid MAS design: while SLMs can effectively benefit from LLM assistance, the optimal architecture is highly task-dependent, and greater frontier-level compute does not consistently translate to better performance.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	1.0/10	1.5

评分理由: The paper focuses on hybrid inference architectures for LLMs and SLMs in multi-agent systems, addressing cost-performance-energy trade-offs. It lacks content on multimodal components (Visual Encoder, MultiModal, MLLM), tokenization strategies, world models, or model-based reinforcement learning algorithms. Only the concept of unifying different model tiers (Unify Models) and general agent context (model-based RL) show minimal relevance. No expert authors from the specified list are present.

关键词

Hybrid Multi-Agent Systems, Cloud Agents, Device Agents, Small Language Models, Large Language Models, Inference Architecture, Cost-Performance Trade-off

218. Matching Rates and Optimal Allocation for Federated Probe-Logit Distillation under Heterogeneous Bandwidth BudgetsFAIL

Score: 7.5 / 27.8

Authors: Prasanjit Dubey, Xiaoming Huo

Published: 2026-05-28

TL;DR: This paper establishes minimax rates and optimal heterogeneous bandwidth allocation for federated probe-logit distillation, focusing on communication efficiency in distributed language modeling rather than multimodal or reinforcement learning architectures.

摘要翻译

在联邦语言建模中，$K$ 个节点各自持有 $n$ 个样本，但无法汇聚数据或交换全精度梯度及权重。我们研究在公共探测集中，当每个节点每次查询最多上传 $B$ 比特时，估计 $V$ 个词元上条件分布的极小极大率。在联邦探测 -logit 蒸馏（FPLD）中，每个节点在探测集上传输一个标量量化 logit 向量，聚合器则蒸馏出一个全局参数化学生模型。先前工作（Dubey 和 Huo，2026）确立了高概率 KL 率 $O(d/(Kn) + ρ\sqrt{V \log V / m} + K^{-1} \cdot 2^{-2B/V})$ 加上优化松弛，其中带宽项采用迹锐化形式。该带宽项率是否紧致，以及该上界如何推广至异构节点带宽，仍是开放性问题。我们填补了这两个空白。首先，加抖动 FPLD 构造在非退化性条件下具有匹配的单轮下界 $Ω(K^{-1} \cdot 2^{-2B/V})$，从而将带宽轴率确定为 $Θ(K^{-1} \cdot 2^{-2B/V})$。采用嵌套/缩放残差量化器的 $T$ 轮顺序精炼可达到 $O(K^{-1} \cdot 2^{-2TB/V})$；而原始 FPLD 中与 $T$ 无关的带宽项对于任意 $T > 1$ 而言均是次优的。其次，我们为各节点预算 $B_i$ 建立了异构带宽上界，并给出了闭式最优分配 $B_i^* = B_{\mathrm{tot}}/K + (V/2) \log_2(w_i / \bar{w}_g)$，这是一种对数倾斜水填充规则，相当于率失真优化中反向水填充规则的节点类比。一种插件自适应变体通过短预热阶段估计权重，并实现了 $1 + O(\sqrt{\log(K/δ)/(m T_0)})$ 的相对次优性。合成 n-gram 模拟证实，经验 KL 被上下界所界定，且在异构裁剪条件下，最优分配严格优于均匀分配和逆权重基线。

Abstract

In federated language modeling, $K$ nodes each hold $n$ samples but cannot pool data or exchange full-precision gradients or weights. We study the minimax rate at which a conditional distribution over $V$ tokens can be estimated when each node may upload at most $B$ bits per query in a public probe set. In federated probe-logit distillation (FPLD), each node transmits a scalar-quantized logit vector on the probe set, and an aggregator distills a global parametric student. Prior work (Dubey and Huo, 2026) establishes a high-probability KL rate $O(d/(Kn) + ρ\sqrt{V \log V / m} + K^{-1} \cdot 2^{-2B/V})$ plus optimization slack, with the bandwidth term in its trace-sharpened form. Whether this bandwidth-term rate is tight, and how the upper bound generalizes to heterogeneous per-node bandwidths, are left open. We close both gaps. First, the dithered FPLD construction has a matching single-round lower bound $Ω(K^{-1} \cdot 2^{-2B/V})$ under non-degeneracy, pinning the bandwidth-axis rate at $Θ(K^{-1} \cdot 2^{-2B/V})$. $T$-round sequential refinement with nested/scaled residual quantizers achieves $O(K^{-1} \cdot 2^{-2TB/V})$; vanilla FPLD's $T$-independent bandwidth term is suboptimal for every $T > 1$. Second, we establish a heterogeneous-bandwidth upper bound for per-node budgets $B_i$, paired with a closed-form optimal allocation $B_i^* = B_{\mathrm{tot}}/K + (V/2) \log_2(w_i / \bar{w}_g)$, a log-tilted water-filling rule that is the per-node analogue of reverse water-filling for distortion-rate optimization. A plug-in adaptive variant estimates the weights from a short warm-up phase and attains $1 + O(\sqrt{\log(K/δ)/(m T_0)})$ relative suboptimality. Synthetic n-gram simulations confirm that empirical KL is bracketed by the upper and lower bounds and that the optimal allocation strictly dominates uniform and inverse-weighted baselines under heterogeneous clipping.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on federated language modeling, bandwidth allocation, and statistical rates for probe-logit distillation. It lacks content related to multimodal integration, visual encoders, world models, or reinforcement learning. While it involves tokens and language modeling, it does not address the specific architectural or application domains suggested by the keywords (e.g., no visual components, no RL, no world modeling). No specified expert authors are present.

关键词

Federated Language Modeling, Probe-Logit Distillation, Bandwidth Allocation, Minimax Rate, Quantization, Heterogeneous Budgets, Distributed Learning

219. SCOPE: A Lightweight-training LLM Framework for Air Traffic Control Readback MonitoringFAIL

Score: 7.5 / 27.8

Authors: Qihan Deng, Minghua Zhang, Yang Yang, Zhenyu Gao

Published: 2026-05-28

TL;DR: 本文提出 SCOPE 框架，利用轻量级大语言模型和开放集插件实现高精度的空中交通管制读回异常监控。

摘要翻译

飞行员对空中交通管制（ATC）语音指令的复诵是防止航空运输中沟通失误的主要保障机制。然而，复诵异常仍与约 80% 的航空事件有关。这一隐患因交通量上升和认知负荷升高而进一步加剧，从而推动了机器自动复诵监控的需求。传统的基于规则和机器学习的方法难以泛化到空中交通管制员与飞行员通信中高度可变且不断演变的术语体系。尽管大语言模型（LLMs）凭借其强大的推理和泛化能力开辟了新途径，但现有方法在实践中仍面临部署和计算方面的障碍。本文提出了一种名为“通过开集插件与示例进行通信的语义推理”（Semantic reasoning for Communication via Open-set Plug-in with Examples, SCOPE）的新型轻量级训练 LLM 框架，旨在提升基于机器的 ATC 复诵监控的效率和准确性。其核心思想是在冻结的 LLM 之上，结合一个插件式开集（Open-set）分类器与精心设计的上下文学习（In-context learning）机制。在半合成通信数据集上的广泛实验表明，SCOPE 在提供运行环境所需低延迟响应的同时，实现了更高的准确率。在少样本（Few-shot）设置下，SCOPE 在开集（Open-set）检测中达到 91.05% 的准确率，并纠正了 96.63% 的异常复诵，从而超越了现有的最强基线方法，同时为其决策提供解释。这些发现表明，我们的框架作为实现可解释且可控的 ATC 复诵监控的可行路径具有巨大潜力。

Abstract

Pilot readback of Air Traffic Control (ATC) voice instructions is a primary safeguard against miscommunication in air transportation. However, readback anomalies remain implicated in approximately 80% of aviation incidents. This vulnerability is further exacerbated by rising traffic volume and elevated cognitive workload, thereby motivating automated readback monitoring by machine. Traditional rule-based and machine learning approaches struggle to generalize across the highly variable and evolving phraseology of air traffic controller-pilot communications. While Large Language Models (LLMs) have opened a new avenue through their strong reasoning and generalization capabilities, existing approaches still face deployment and computational barriers in practice. In this work, we propose Semantic reasoning for Communication via Open-set Plug-in with Examples (SCOPE), a novel lightweight-training LLM framework that advances both the efficiency and accuracy of machine-based ATC readback monitoring. The core idea is to couple a plug-in open-set classifier with a carefully designed in-context learning mechanism on top of a frozen LLM. Extensive experiments on the semi-synthetic communication dataset show that SCOPE attains superior accuracy while delivering the low-latency response required for operational environments. Under a few-shot setting, SCOPE achieves 91.05% accuracy in open-set detection and corrects 96.63% of anomalous readbacks, thereby outperforming the strongest available baselines while providing explanations for its decisions. These findings demonstrate the potential of our framework as a practical pathway toward interpretable and controllable ATC readback monitoring.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0

评分理由: 论文主要研究基于大语言模型的空中交通管制读回监控，属于自然语言处理应用。内容未涉及视觉编码器、世界模型或强化学习，故这些关键词得分为 0。虽然使用了 LLM（与 MLLM/MultiModal 有弱关联），但未体现统一模型架构或 Tokenizer 创新，整体与给定高权重关键词相关性较低。

关键词

Air Traffic Control, Readback Monitoring, Large Language Models, Lightweight-training, Open-set Plug-in, In-context Learning, Anomaly Detection, Aviation Safety

220. On-Policy Replay for Continual Supervised Fine-TuningFAIL

Score: 7.5 / 27.8

Authors: Yan Chen, Taojie Zhu, Meng Zhang, Xin Chen, Jiaqi Huang, Dongyang Xu, Yizhi Wang

Published: 2026-05-28

TL;DR: This paper proposes On-Policy Replay to mitigate catastrophic forgetting in continual supervised fine-tuning of large language models without using auxiliary losses or teacher models.

摘要翻译

持续监督微调 (SFT) 是将大型语言模型 (LLMs) 适应于连续下游任务流的事实上的标准方案，但它会遭受早期能力的灾难性遗忘。近期研究表明，同策略信号（即基于模型自身输出进行训练）比异策略监督更可靠地减少遗忘。现有的同策略方法通过一个新的训练目标（例如使用教师副本的自蒸馏损失）来传递该信号，从而继承了额外的前向传播开销、对调度方案的敏感性以及来自教师模型的风格漂移。相反，我们通过训练数据源来传递同策略信号。我们的方法，同策略回放 (OPR)，在少量历史提示配额上运行最新检查点，根据任务奖励过滤生成结果，并将幸存的（提示，模型响应）对作为普通的 SFT 示例进行回放。该方法无需教师模型，无需辅助损失，也无需实时蒸馏。在 TRACE 持续学习基准上，针对三个 7B-8B 参数的指令微调骨干模型（Qwen2.5-7B-Instruct、Qwen3-8B、Llama3.1-8B-Instruct），OPR 一致地减少遗忘；在最严峻的压力测试下（Qwen2.5-7B-Instruct，顺序 SFT BWT 为 -13.93），OPR 在 10% 回放预算下将 BWT 提升至 -0.65，在 1% 预算下提升至 -2.29——相较于经过调优的 Vanilla Replay 基线，|BWT| 减少了 46%，且在所有三个骨干模型上均观察到 42% 至 46% 的减少幅度。我们提出了一个 KL 收缩解释，将 OPR 与先前的同策略蒸馏方法置于同一轴线上，并呈现了一个反直觉的发现，解释了为何 Vanilla Replay 已是强基线：低分回放一致劣于 Vanilla Replay，这表明 OPR 的核心要素在于同策略分布，而不仅仅是响应质量本身。我们的代码可在 https://github.com/Yancey2024/OnPolicyReplay 获取。

Abstract

Continual supervised fine-tuning (SFT) is the de facto recipe for adapting large language models (LLMs) to a stream of downstream tasks, but it suffers from catastrophic forgetting of earlier capabilities. Recent work shows that on-policy signals -- training on the model's own outputs -- reduce forgetting more reliably than off-policy supervision. Existing on-policy methods route this signal through a new training objective (e.g., self-distillation losses with a teacher copy), inheriting an extra forward pass, schedule sensitivity, and stylistic drift from the teacher.We instead route the on-policy signal through the training data source. Our method, On-Policy Replay (OPR), rolls out the most recent checkpoint on a small budget of historical prompts, filters the generations by a task reward, and replays the surviving (prompt, model response) pairs as ordinary SFT examples. There is no teacher, no auxiliary loss, and no on-the-fly distillation. Across three 7--8B instruction-tuned backbones (Qwen2.5-7B-Instruct, Qwen3-8B, Llama3.1-8B-Instruct) on the TRACE continual-learning benchmark, OPR consistently reduces forgetting; on the sharpest stress test (Qwen2.5-7B-Instruct, Sequential SFT BWT -13.93), OPR lifts BWT to -0.65 at a 10% replay budget and to -2.29 at a 1% budget -- a 46% reduction in |BWT| over a tuned Vanilla Replay baseline, with 42--46% reductions observed across all three backbones. We give a KL-shrinkage interpretation that places OPR and prior on-policy distillation methods on a single axis, and we present a counterintuitive finding that explains why Vanilla Replay is already a strong baseline: low-score replay is uniformly worse than Vanilla Replay, demonstrating that the active ingredient in OPR is the on-policy distribution, not the response quality alone.Our code is available at https://github.com/Yancey2024/OnPolicyReplay.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	1.0/10	1.5

评分理由: The paper focuses on Continual Supervised Fine-Tuning (SFT) for text-based Large Language Models (LLMs) to mitigate catastrophic forgetting. It does not involve Multimodal components (Visual Encoder, MultiModal, MLLM), World Models, or Model-Based RL architectures. While it utilizes 'on-policy' terminology, the method is purely SFT-based without auxiliary losses or teacher models. Tokenizer is not a core contribution, and 'Unify Models' is not the primary architectural focus in the context of the provided keywords.

关键词

Continual Supervised Fine-Tuning, On-Policy Replay, Catastrophic Forgetting, Large Language Models, Instruction-tuned backbones, Historical Prompts, Teacher-free

221. Harmless Yet Harmful: Neutral Prompting Attacks for Stealthy Hallucination Steering in Agent SkillsFAIL

Score: 7.5 / 27.8

Authors: Chia-Yi Hsu, Chia-Mu Yu, Chun-Ying Huang, Jun Sakuma

Published: 2026-05-28

TL;DR: This paper proposes Neutral Prompting Attacks that stealthily increase package hallucination in LLM coding agents via benign instructions, creating software supply chain risks without explicit malicious intent.

摘要翻译

基于大语言模型（LLM）的编码代理正越来越多地参与软件开发工作流，通过生成代码、选择依赖项以及生成包安装命令。这带来了新的软件供应链风险：当代理幻觉出不存在的包时，攻击者可能注册该幻觉名称，随后劫持安装该包的用户。现有的包幻觉攻击与防御主要关注自然发生的幻觉、目标依赖引导或事后包验证。本文提出了一种名为“中性提示攻击”（Neutral Prompting Attack, NPA）的高度隐蔽的攻击范式，其中语义上无害的指令（例如鼓励想象力和详尽性）会增加包幻觉倾向性，而不包含明确的恶意意图。与目标依赖引导不同，NPA 并不指定攻击者选择的包。相反，它将模型的依赖生成行为转向更具推测性的包名称。我们在多个面向编码的 LLM 和包幻觉基准测试上评估了 NPA。我们的结果表明，NPA 同时提高了幻觉攻击成功率（Hallucination ASR）和 Pip 安装攻击成功率（Pip Install ASR），改变了幻觉包名称的分布，并规避了现有的静态分析、基于 LLM 以及基于代理的 Skill 防御。这些发现表明，看似无害的提示可以暗中操纵幻觉行为，并产生下游软件供应链风险。

Abstract

LLM-powered coding agents increasingly participate in software development workflows by generating code, selecting dependencies, and producing package installation commands. This creates a new software supply chain risk: when an agent hallucinates a non-existent package, an attacker may register the hallucinated name and later compromise users who install it. Existing package hallucination attacks and defenses primarily focus on naturally occurring hallucinations, targeted dependency steering, or post-hoc package validation. In this paper, we introduce \emph{Neutral Prompting Attack} (NPA), a highly stealthy attack paradigm in which semantically benign instructions, such as encouraging imagination and exhaustiveness, increase package hallucination propensity without containing explicit malicious intent. Unlike targeted dependency steering, NPA does not specify an attacker-chosen package. Instead, it shifts the model's dependency generation behavior toward more speculative package names. We evaluate NPA across multiple coding-oriented LLMs and package hallucination benchmarks. Our results show that NPA increases both \emph{Hallucination ASR} and \emph{Pip Install ASR}, changes the distribution of hallucinated package names, and evades existing static-analysis, LLM-based, and agent-based Skill defenses. These findings reveal that harmless-looking prompts can covertly manipulate hallucination behavior and create downstream software supply chain risks.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	1.0/10	1.5

评分理由: The paper focuses on LLM security and hallucination in coding agents, which is unrelated to multimodal architectures (MultiModal, Visual Encoder), world modeling (World Models), or model-based reinforcement learning (model-based RL). While it involves Large Language Models (MLLM), it does not address tokenizer design (Tokenizer) or model unification (Unify Models), resulting in low relevance scores for the provided keyword set.

关键词

Neutral Prompting Attack, Package Hallucination, LLM Security, Software Supply Chain, Stealthy Hallucination, Coding Agents, Benign Instructions

222. Attention as In-Context Empirical Bayes: A Two-Stage View via Particle DynamicsFAIL

Score: 7.5 / 27.8

Authors: Matthew Smart, Soumya Ganguly, Nilava Metya, Alexandre V. Morozov, Anirvan M. Sengupta

Published: 2026-05-28

TL;DR: This paper interprets attention in transformers as a two-stage in-context empirical Bayes inference process driven by particle dynamics, achieving effective denoising without explicit noise schedules.

摘要翻译

我们在所有 Token 损坏的条件下研究最小化的仅 Attention (注意力) Transformer，并表明其可被解释为两阶段 Empirical Bayes (经验贝叶斯) 解释。单个 Attention 步骤计算关于由上下文定义的 Empirical Distribution (经验分布) 的核加权 Posterior Mean (后验均值)。深度通过 Particle Dynamics (粒子动力学) (Stage 1) 精炼此分布，而长距离 Skip-connection (跳跃连接) 携带噪声输入作为查询用于 Posterior Inference (后验推断) (Stage 2)，揭示了深度与 Attention Residuals (注意力残差) 的不同统计角色。该框架隔离了一个最小设置，在此设置中，上下文本身诱导了一个深度相关的 Energy Landscape (能量景观)，从而支配 In-context Inference (上下文内推断)。我们表明有效的 Denoising (去噪) 可以在没有显式 Noise Schedule (噪声调度) 的情况下出现：固定的 Kernel Bandwidth (核带宽) 和有限的 Integration Horizon (积分时间尺度) 就足够了，从而产生了一个合理的 Depth-Noise (深度 - 噪声) 关系。我们进一步为一类表现良好的 Priors (先验) 建立了 Posterior-Mean Recovery (后验均值恢复) 保证，其中经验估计量在渐近条件下收敛到 Bayes-Optimal Predictor (贝叶斯最优预测器)。将这些动力学与 Reverse-Diffusion (反向扩散) 极限联系起来，我们的结果提供了一种统计解释，将 Attention 解释为通过基于样本的后验估计进行的 In-context Inference，无需显式 Density Modeling (密度建模)。

Abstract

We study minimal attention-only transformers under all-token corruption and show they admit a two-stage empirical Bayes interpretation. A single attention step computes a kernel-weighted posterior mean with respect to the empirical distribution defined by the context. Depth refines this distribution through particle dynamics (Stage 1), while a long-range skip-connection carries the noisy input as a query for posterior inference (Stage 2), revealing distinct statistical roles for depth and attention residuals. The framework isolates a minimal setting in which the context itself induces a depth-dependent energy landscape governing in-context inference. We show that effective denoising can emerge without an explicit noise schedule: a fixed kernel bandwidth and finite integration horizon suffice, yielding a principled depth-noise relationship. We further establish a posterior-mean recovery guarantee for a class of well-behaved priors, where the empirical estimator converges to the Bayes-optimal predictor under asymptotic conditions. Connecting these dynamics to reverse-diffusion limits, our results provide a statistical interpretation of attention as in-context inference via sample-based posterior estimation, without explicit density modeling.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on the theoretical statistical interpretation of attention mechanisms (Empirical Bayes, Particle Dynamics) in transformers, rather than multimodal architectures, world models, or reinforcement learning. Keywords related to vision, modality, and RL are largely irrelevant. Tokenizer is mentioned contextually but not studied. Unify Models refers to theoretical unification of views, not architectural unification.

关键词

Attention Mechanism, Empirical Bayes, Particle Dynamics, In-context Inference, Transformer Architecture, Posterior Estimation, Two-Stage View

223. Reasoning-preserved Efficient Distillation of Large Language Models via Activation-aware InitializationFAIL

Score: 7.5 / 27.8

Authors: Junlin He, Yihong Tang, Tong Nie, Guilong Li, Binyu Yang, Jinxiao Du, Lijun Sun, Wei Ma

Published: 2026-05-28

TL;DR: This paper proposes a reasoning-preserved efficient distillation method for LLMs using activation-aware initialization to prevent reasoning collapse caused by eRank collapse in projection matrices.

摘要翻译

高效蒸馏（EDistill）通过结构化剪枝参数并微调轻量级模块来压缩大型语言模型（LLMs），具有高训练效率。尽管这些经过 EDistill 蒸馏的 LLMs 在通用能力基准上相对于同等规模的 LLMs 实现了最先进（SOTA）性能，但我们发现其多步推理能力出现严重退化，我们将其称为推理崩溃。我们系统分析了推理崩溃的几何起源，并表明基于宽度缩减投影矩阵的最先进 EDistill 方法存在 eRank 崩溃问题，即隐藏表示的有效秩（eRank）下降。我们理论解释了随机初始化的投影矩阵的奇异值为何分布不均匀，从而导致 eRank 崩溃，进而引发 Token 不可区分性。为了解决这一问题，我们提出了针对 LLMs 的 RED（推理保持的高效蒸馏），该方法引入感知激活的初始化，将投影矩阵初始化为通道选择矩阵，从而理论上缓解 eRank 崩溃。在 Llama 和 Qwen 系列上的实验表明，RED 在保持高训练效率和最先进（SOTA）通用能力的同时，显著恢复了推理能力。

Abstract

Efficient Distillation (EDistill) compresses large language models (LLMs) by structured pruning parameters and tuning lightweight modules with high training efficiency. Although these EDistilled LLMs achieve state-of-the-art (SOTA) performance on general ability benchmarks relative to similarly sized LLMs, we identify a severe degradation in their multi-step reasoning ability, which we term reasoning collapse. We systematically analyze the geometric origins of reasoning collapse and show that the SOTA EDistill method based on width-reducing projection matrices suffers from eRank collapse, in which the effective rank (eRank) of hidden representations drops. We theoretically explain how singular values of randomly initialized projection matrices become unevenly distributed, leading to eRank collapse and thus token indistinguishability. To address this issue, we propose RED (Reasoning-preserved Efficient Distillation) for LLMs, which introduces activation-aware initialization to initialize projection matrices as channel-selection matrices, thus theoretically mitigating eRank collapse. Experiments on Llama and Qwen series demonstrate that RED substantially recovers reasoning while maintaining high training efficiency and SOTA general ability.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on LLM distillation and reasoning preservation, lacking content on multimodal data, visual encoders, world models, or RL. MLLM and Tokenizer scores are low but non-zero due to LLM and token mentions. Unify Models is weak as no model unification is discussed. No matching expert authors were found to apply bonus points.

关键词

Efficient Distillation, Large Language Models, Reasoning Collapse, Activation-aware Initialization, Projection Matrices, eRank Collapse, Token Indistinguishability

224. CommunityFact: A Dynamic, Multilingual, Multi-domain Benchmark for Misinformation Detection in the WildFAIL

Score: 7.5 / 27.8

Authors: Sahajpreet Singh, Insyirah Mujtahid, Min-Yen Kan, Kokil Jaidka

Published: 2026-05-28

TL;DR: CommunityFact 提出一个动态多语言虚假信息检测基准，发现网络增强型 LLM 虽能提升验证效果，但其源选择策略与人类评审存在系统性偏差。

摘要翻译

虚假信息验证越来越多地发生在公开、快速变化且多语言的在线环境中，而静态基准无法全面衡量模型的可靠性。我们引入了 CommunityFact，这是一个用于真实环境中虚假信息检测的可更新基准，具有三个主要目标：覆盖范围、粒度和可再分发性。本次发布包含跨越五种语言和两个领域的 15,992 个独立声明。我们在不同的推理时能力下评估了十个大型语言模型（LLMs），包括思考（thinking）和网络搜索（web-search）。结果表明，封闭输入验证仍然具有挑战性，网络访问带来了最大的收益，且具备网络功能的大型语言模型（LLMs）的源选择策略与人类 Community Notes 评分者所汇聚的源存在系统性的不一致——这一差距可通过模型特定的检索扩展或剪枝机制来弥补。我们还发现，在不同语言 - 领域切片以及网络系统所使用的证据生态系统中存在显著差异。除了评估之外，CommunityFact 还将 Community Notes 定位为一种训练信号，用于基于声明的源建议器，从而可能提高对新声明的事实验证能力。

Abstract

Misinformation verification increasingly occurs in public, fast-moving, and multilingual online settings, where static benchmarks provide an incomplete measure of model reliability. We introduce CommunityFact, a refreshable benchmark for misinformation detection in the wild, with three major goals: coverage, granularity, and redistributability. This release contains 15,992 standalone claims across five languages and two domains. We evaluate ten LLMs under varying inference-time capabilities, including thinking and web-search. Our results show that closed-input verification remains challenging, web access yields the largest gains, and web-enabled LLMs' source-selection policies are systematically misaligned with the sources human Community Notes raters converge on -- a gap that closes through model-specific mechanisms of retrieval expansion or pruning. We further find substantial variation across language-domain slices and across the evidence ecosystems used by web-enabled systems. Beyond evaluation, CommunityFact positions Community Notes as a training signal for claim-conditioned source suggesters that could improve factual verification on novel claims.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0

评分理由: 论文主题聚焦于虚假信息检测的多语言基准测试及 LLM 推理评估，与关键词中的模型架构（统一模型、视觉编码器、世界模型）及强化学习（model-based RL）无直接关联。虽涉及多语言（易与 Multimodal 混淆）及 LLM（MLLM 相关度低），但未深入探讨模型内部结构、视觉编码或 RL 训练机制，故相关性评分极低。作者列表中不包含指定的 Yang Shi 等专家，无额外加分。

关键词

Misinformation Detection, Multilingual Benchmark, LLM Evaluation, Web Search Integration, Source Selection, Claim Verification, Dynamic Benchmark, Community Notes

225. Nine Judges, Two Effective Votes: Correlated Errors Undermine LLM Evaluation PanelsFAIL

Score: 7.5 / 27.8

Authors: Guneet Kohli

Published: 2026-05-28

TL;DR: The study reveals that LLM evaluation panels suffer from significant correlated errors, reducing their effective independence and reliability compared to single best judges, indicating that scaling panel size does not solve evaluation bias.

摘要翻译

LLM 作为裁判的面板（LLM-as-a-judge panels）聚合了多个模型的投票，期望多样化的模型能产出更可靠的评估结果。我们开发了一个框架，用于测量此类面板的真实信息价值，并量化其可靠性距离独立投票理想还有多远。我们在三个自然语言推理数据集（每个项目包含 100 条人工标注）上测试了一个由 7 个模型家族的 9 个前沿大语言模型（LLM）组成的面板，发现这 9 个裁判实际上仅提供了约相当于 2 个独立投票的信息量。由于模型在相同的项目上犯相同的错误，面板名义上约四分之三的独立性丢失了。后果严峻：面板的实际准确性比独立投票所能实现的低 8 至 22 个百分点，且最佳单个裁判在所有条件下均能匹配或优于完整的面板。无论是增加更多裁判还是使用更智能的聚合算法均无济于事——既有的方法最多只能填补该差距的 11%，即使拥有正确答案也是如此。我们利用 Kish 有效样本量（n_eff）和 Condorcet 零模型量化了这些发现，并表明这种缺陷在提示变体、温度设置、思维链推理（chain-of-thought reasoning）以及成对偏好任务（RewardBench）上均具有稳健性。瓶颈在于裁判之间的相关性，而非聚合算法，这意味着扩大面板规模无法替代真正独立的评估。

Abstract

LLM-as-a-judge panels aggregate votes from multiple models, with the expectation that diverse models yield more reliable evaluations. We develop a framework to measure the true informational value of such panels and quantify how far their reliability falls short of the independent-voting ideal. Testing a panel of 9 frontier LLMs from 7 model families on three natural language inference datasets (each with 100 human annotations per item), we find that the 9 judges effectively provide only about 2 independent votes' worth of information. Roughly three-quarters of the panel's nominal independence is lost because the models make the same mistakes on the same items. The consequences are stark: the panel's actual accuracy falls 8-22 percentage points short of what independent voting would achieve, and the best single judge matches or outperforms the full panel across all conditions. Neither adding more judges nor using smarter aggregation algorithms helps -- established methods close at most 11% of this gap, even with access to the correct answers. We quantify these findings using the Kish effective sample size (n_eff) and a Condorcet null model, and show the deficit is robust across prompt variants, temperatures, chain-of-thought reasoning, and a pairwise preference task (RewardBench). The bottleneck is correlated judges, not the aggregation algorithm, implying that scaling up panels cannot substitute for genuinely independent evaluation.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文主要探讨 LLM 评估面板中的相关性错误及信息价值，属于自然语言处理评估领域。未涉及多模态架构、视觉编码器、分词器、世界模型或基于模型的强化学习。虽然使用了多个 LLM（MLLM），但重点在于评估方法而非模型架构或多模态能力，因此与给定关键词相关性较低。作者列表中不包含指定的 Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang 专家。

关键词

LLM-as-a-judge, Correlated Errors, Evaluation Panels, Effective Sample Size, Natural Language Inference, Model Independence, Aggregation Algorithms

226. MOOSE-Copilot: A Web-Based Interactive Assistant for Unified Exploratory and Fine-Grained Scientific Hypothesis DiscoveryFAIL

Score: 7.5 / 27.8

Authors: Hongran An, Zonglin Yang

Published: 2026-05-28

TL;DR: MOOSE-Copilot proposes a unified human-AI interaction framework for scientific hypothesis discovery that enables scientists to steer LLM-based generation through structured feedback, significantly outperforming autonomous baselines.

摘要翻译

大型语言模型（LLMs）在科学假设发现方面展现出巨大的潜力。然而，现有方法面临两个关键局限：它们将发散性探索性构思与收敛性细粒度细化视为孤立任务，且自主运行，几乎缺乏人类指导。我们提出了 MOOSE-Copilot，这是首个通过形式化人机交互（HAII）协议来弥合这一抽象鸿沟的统一框架。我们的系统使科学家能够通过三种显式信号引导生成过程：初始蓝图、阶段间路由和再生反馈。定量评估表明，注入这些结构化专家信号显著优于纯自主基线，确立了在神谕指导下的性能上限。此外，为了普及此范式，我们开发了一个直观的基于 Web 的界面，具备交互式树形可视化功能。这明确消除了复杂命令行智能体工具陡峭的学习曲线，使跨学科研究人员能够直接利用、可视化编排并加速端到端的科学突破。

Abstract

Large language models (LLMs) show remarkable potential in scientific hypothesis discovery. However, existing approaches face two critical limitations: they treat divergent exploratory ideation and convergent fine-grained refinement as isolated tasks, and they operate autonomously with little to no human guidance. We present MOOSE-Copilot, the first unified framework to bridge this abstraction gap through a formalized human-AI interaction (HAII) protocol. Our system empowers scientists to steer the generative process via three explicit signals: initial blueprints, inter-stage routing, and regenerative feedback. Quantitative evaluations demonstrate that injecting these structured expert signals significantly outperforms purely autonomous baselines, establishing a performance ceiling under oracle guidance. Furthermore, to democratize this paradigm, we develop an intuitive web-based interface featuring interactive tree visualization. This explicitly eliminates the steep learning curve of complex command-line agentic tools, empowering interdisciplinary researchers to directly leverage, visually orchestrate, and accelerate end-to-end scientific breakthroughs.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on a human-AI interaction framework for scientific hypothesis discovery using LLMs, lacking technical content on tokenizers, visual encoders, world models, or RL. Scores reflect minimal keyword overlap ('Unified' in title, 'LLM' in text) without architectural relevance.

关键词

Scientific Hypothesis Discovery, Human-AI Interaction, Large Language Models, Web-Based Interface, Unified Framework, Generative Process, Interactive Assistant

227. Beyond Bilingual Transfer: Multilingual Code-Switching in Instruction TuningFAIL

Score: 7.5 / 27.8

Authors: Shunta Asano, Jeonghun Baek, Toshihiko Yamasaki

Published: 2026-05-28

TL;DR: 本文研究了在指令微调中混合四种语言（英语、日语、韩语、中文）的代码切换数据对跨语言迁移的影响，结果表明简单的句子级多语言代码切换能一致性地提升多语言性能。

摘要翻译

近期研究表明，语码转换数据（CSD），即在相同上下文中混合多种语言的数据，能够提升大语言模型（LLMs）中的跨语言迁移和多语言对齐能力。然而，现有研究主要集中于英语与目标语言之间的双语迁移，使得涉及三种或更多语言的多语言设置在很大程度上尚未得到充分探索。本文研究了四种语言（英语、日语、韩语和中文）之间的多语言语码转换指令微调。我们在 Belebele 上评估多语言理解能力。实验结果表明，简单的句子级多语言 CSD 一致地提升了所有四种语言的平均多语言性能，这表明多语言语码转换的有效性不仅限于双语迁移设置。

Abstract

Recent studies have shown that code-switching data (CSD), in which multiple languages are mixed within the same context, can improve cross-lingual transfer and multilingual alignment in large language models (LLMs). However, existing studies primarily focus on bilingual transfer between English and a target language, leaving multilingual settings involving three or more languages largely unexplored. In this work, we investigate multilingual code-switching instruction tuning across four languages: English, Japanese, Korean, and Chinese. We evaluate multilingual understanding on Belebele. Our experiments show that simple sentence-level multilingual CSD consistently improves average multilingual performance across all four languages, indicating that multilingual code-switching can be effective beyond bilingual transfer settings.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文核心内容为多语言指令微调与代码切换，属于纯文本 NLP 任务。给定关键词中，'Visual Encoder'、'World Models'、'MultiModal'、'model-based RL' 涉及视觉、世界模型、多模态及强化学习，与本文完全无关，评分为 0。'Unify Models' 虽涉及语言统一，但通常指架构统一，关联度低（1.0）。'Tokenizer' 和 'MLLM' 与语言模型基础相关，但非本文核心创新点，评分为 2.0。加权总分约为 7.5，远低于动态及格分 27.8。作者列表中未包含指定的 Yang Shi 等专家。

关键词

Multilingual Code-Switching, Instruction Tuning, Cross-lingual Transfer, Large Language Models, Multilingual Alignment, Sentence-level Mixing, Belebele Benchmark

228. Mesh-Aware Epipolar Matching for Multi-View Multi-Person 3D Pose Estimation in BasketballFAIL

Score: 7.5 / 27.8

Authors: Li Yin, Qin Haobin, Tomohiro Suzuki, Calvin Yeung, Mariko Isogawa, Keisuke Fujii

Published: 2026-05-28

TL;DR: 本文提出了一种名为 MAEM 的训练-free 框架，利用网格感知极线匹配在多视角篮球场景下实现了高效的多人体 3D 姿态估计，无需目标域微调即可达到具有竞争力的性能。

摘要翻译

团队运动场景中的多视角多人 3D 姿态估计仍具挑战性，原因在于球员遮挡、队服引起的外观相似性以及标注多视角数据的匮乏，这些因素均限制了基于学习的方法的有效性和泛化能力。相比之下，无训练方法的性能本质上受限于 2D 关键点检测的准确性以及跨视图关联的鲁棒性。为应对这些挑战，我们提出了一种无训练框架——网格感知极线匹配（Mesh-Aware Epipolar Matching，简称 MAEM），用于多视角多人 3D 姿态估计。该方法采用单目 3D 人体网格恢复模型作为前端，并基于恢复的网格输出引入了一种两阶段极线匹配策略。具体而言，所提出的框架结合了基于并查集（disjoint-set-union）的聚类与每关节三角测量，以实现鲁棒的跨视图关联和准确的 3D 姿态重建。在两个公共多视角篮球数据集上的实验表明，MAEM 一贯优于现有的无训练关联基线，同时在室内和室外篮球场景中实现了具有竞争力的仅 RGB 性能。MAEM 在 SportCenter EPFL 数据集上取得了 59.8/40.7 mm 的 MPJPE/PA-MPJPE 分数，在 Human-M3 Basketball 数据集上取得了 74.0/51.8 mm 的分数，突显了密集网格几何在跨视图关联中的有效性，且无需目标域训练或微调。

Abstract

Multi-view multi-person 3D pose estimation in team sports scenarios remains challenging due to player occlusions, appearance similarity caused by team uniforms, and the scarcity of annotated multi-view data, all of which limit the effectiveness and generalization capability of learning-based methods. In contrast, the performance of training-free approaches is inherently constrained by the accuracy of 2D keypoint detection and the robustness of cross-view association. To address these challenges, we propose Mesh-Aware Epipolar Matching (MAEM), a training-free framework for multi-view multi-person 3D pose estimation. Our method employs a monocular 3D human mesh recovery model as the frontend and introduces a two-stage epipolar matching strategy based on the recovered mesh outputs. Specifically, the proposed framework combines disjoint-set-union-based clustering with per-joint triangulation to achieve robust cross-view association and accurate 3D pose reconstruction. Experiments on two public multi-view basketball datasets demonstrate that MAEM consistently outperforms existing training-free association baselines while achieving competitive RGB-only performance in both indoor and outdoor basketball scenarios. MAEM achieves MPJPE/PA-MPJPE scores of 59.8/40.7 mm on SportCenter EPFL and 74.0/51.8 mm on Human-M3 Basketball, highlighting the effectiveness of dense mesh geometry for cross-view association without requiring target-domain training or fine-tuning.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0

评分理由: 论文主题聚焦于计算机视觉中的多视角多人 3D 姿态估计，采用基于网格的极线匹配方法。提供的关键词主要涉及多模态大模型、世界模型及强化学习领域，与论文内容存在显著领域差异。'Visual Encoder' 相关性中等，因为前端网格恢复模型隐含视觉编码器；'Unify Models' 和 'MultiModal' 相关性较低，仅指代多视角数据融合或流程整合，非架构统一或多模态大模型；其余关键词（Tokenizer, World Models, MLLM, model-based RL）完全无关。作者列表中未包含指定的专家。

关键词

Multi-view, 3D Pose Estimation, Mesh-Aware, Epipolar Matching, Training-free, Human Mesh Recovery, Cross-view Association, Basketball

229. LLMSurgeon: Diagnosing Data Mixture of Large Language ModelsFAIL

Score: 6.0 / 27.8

Authors: Yaxin Luo, Jiacheng Cui, Xiaohan Zhao, Xinyi Shang, Jiacheng Liu, Xinyue Bi, Zhaoyi Li, Zhiqiang Shen

Published: 2026-05-28

TL;DR: This paper proposes LLMSurgeon, a framework to estimate the pretraining data mixture distribution of Large Language Models from generated text without accessing the original training data.

摘要翻译

大语言模型（LLMs）的预训练数据混合物构成了其“数字 DNA"，并塑造了模型的行为、能力和失效模式。然而，这种组成很少被披露，使得对数据组合或溯源的事后审计变得困难。在这项工作中，我们正式定义了“数据混合手术（DMS）”：仅给定目标大语言模型生成的文本，在预定义的分类体系下估计其预训练语料库的领域级分布。我们提出了 LLMSurgeon，一个强大的框架，它将 DMS 视为在标签偏移假设下的逆问题。与直接聚合分类器输出不同，LLMSurgeon 估计一个校准的“软”混淆矩阵，并求解一个约束逆问题，以纠正系统性的领域混淆并恢复潜在的混合先验。为了评估，我们引入了 LLMScan，一个基于具有透明预训练混合物的开源大语言模型构建的基于训练配方可验证的评估套件。在 LLMScan 上，LLMSurgeon 在固定协议下以高保真度恢复领域混合分布。我们的工作提供了一种实用的事后方法，用于审计基础模型 (Foundation Models) 的数字 DNA，而无需访问其训练数据。

Abstract

The pretraining data mixture of Large Language Models (LLMs) constitutes their "digital DNA", shaping model behaviors, capabilities, and failure modes. Yet this composition is rarely disclosed, making post-hoc auditing of data combination or provenance difficult. In this work, we formalize $\textbf{Data Mixture Surgery (DMS)}$: given only generated text from a target LLM, estimate the domain-level distribution of its pretraining corpus under a predefined taxonomy. We propose $\textbf{LLMSurgeon}$, a strong framework that casts DMS as an inverse problem under the label-shift assumption. Rather than directly aggregating classifier outputs, LLMSurgeon estimates a calibrated $\textit{soft}$ confusion matrix and solves a constrained inverse problem to correct systematic domain confusion and recover the latent mixture prior. To evaluate, we introduce $\textbf{LLMScan}$, a recipe-verifiable evaluation suite built from open-source LLMs with transparent pretraining mixtures. Across LLMScan, LLMSurgeon recovers domain mixtures with high fidelity under fixed protocols. Our work presents a practical, post-hoc approach for auditing the digital DNA of foundation models without access to their training data.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on auditing LLM pretraining data mixtures using inverse problems (Data Mixture Surgery), which has low alignment with the provided keywords concerning multimodal models, world models, reinforcement learning, and visual encoders. Only minor relevance exists to general LLM components (Tokenizer, MLLM as broad category). No specified expert authors (Yang Shi, Xuanyu Zhu, etc.) are listed in the authorship.

关键词

Data Mixture Surgery, Large Language Models, Pretraining Data Mixture, Inverse Problem, Post-hoc Auditing, LLMScan, Domain Distribution

230. Demystifying Data Organization for Enhanced LLM TrainingFAIL

Score: 6.0 / 27.8

Authors: Yalun Dai, Yangyu Huang, Tongshen Yang, Yonghan Wang, Xin Zhang, Wenshan Wu, Qihao Zhao, Hao Li, Yuanyuan Gao, Kim-Hui Yap, Scarlett Li

Published: 2026-05-28

TL;DR: This paper proposes novel data ordering methods (STR and SAW) guided by four organization guidelines to enhance the stability and performance of LLM training without additional computational overhead.

摘要翻译

大型语言模型（LLMs）已革新了诸多领域，然而其训练效率高度依赖于有效的数据策展。尽管数据选择已被广泛研究，但旨在提升训练的战略性数据组织仍是一个研究不足的领域，尤其是因为当前的 LLMs 通常仅训练一个或几个轮次。本文通过重用原本为数据效率而生成的预计算样本级得分，系统地探索了数据组织对 LLM 训练的影响，从而仅产生极小的额外计算开销。我们识别并形式化了优化数据组织的四个关键准则：边界锐化（Boundary Sharpening）、循环调度（Cyclic Scheduling）、课程连续性（Curriculum Continuity）和局部多样性（Local Diversity）。基于这些准则，我们提出了两种新颖的数据排序方法，分别称为 STR 和 SAW。在不同模型规模和数据规模下进行的广泛实验，涵盖了预训练和监督微调（SFT）阶段，验证了我们总结准则的有效性。它们还展示了我们提出的数据排序方法在提升 LLM 训练的稳定性与性能方面的鲁棒性。GitHub 链接：https://github.com/microsoft/data-efficacy/

Abstract

Large Language Models (LLMs) have revolutionized various fields, yet their training efficiency is heavily reliant on effective data curation. While data selection has been widely studied, the strategic data organization for enhanced training remains an underexplored area, particularly since current LLMs are often trained for only one or a few epochs. This paper systematically explores the influence of data organization on LLM training by reusing pre-computed sample-level scores originally generated for data efficiency, thereby incurring minimal additional computational overhead. We identify and formalize four key guidelines for optimizing data organization: Boundary Sharpening, Cyclic Scheduling, Curriculum Continuity, and Local Diversity. Guided by them, we introduce two novel data ordering methods termed STR and SAW. Extensive experiments across different model scales and data sizes, encompassing both pre-training and SFT stages, validate the effectiveness of our summarized guidelines. They also demonstrate the robustness of our proposed data ordering methods in enhancing the stability and performance of LLM training. Github Link: https://github.com/microsoft/data-efficacy/

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on data organization strategies (STR, SAW) for improving LLM training efficiency and stability. It does not address tokenizers, visual encoders, world models, multimodal architectures, or model-based reinforcement learning. While it involves LLMs, it does not cover MLLM (multimodal) or unifying models in the context of the provided keywords, resulting in low relevance scores for most terms.

关键词

Large Language Models, Data Organization, Training Efficiency, Data Curation, Data Ordering, Pre-training, SFT

231. How LoRA Remembers? A Parametric Memory Law for LLM FinetuningFAIL

Score: 6.0 / 27.8

Authors: Ziwen Xu, Haiwen Hong, Linsong Yu, Benglei Cui, Longtao Huang, Hui Xue, Ningyu Zhang

Published: 2026-05-28

TL;DR: This paper proposes a Parametric Memory Law to quantify LoRA's memory capacity in LLMs and introduces a threshold-guided optimization strategy (MemFT) to enhance memory fidelity during finetuning.

摘要翻译

大型语言模型（LLMs）必须在动态的现实环境中持续学习和更新知识，以保持有效性。尽管低秩适配（LoRA）广泛用于此类记忆更新，但现有研究主要依赖定性下游评估，使得精确参数记忆的定量容量限制及其潜在动力学在很大程度上仍未被探索。为了弥合这一差距，我们在潜在空间内将 LoRA 用作受控的记忆容量探针，以系统量化精确参数记忆。我们提出了参数记忆定律（Parametric Memory Law），这是一种稳健的幂律，将损失减少量 ΔL 与有效参数及序列长度相关联。在词元级别，细粒度分析揭示了一个确定性相变，表明在贪婪解码下，预测概率 p > 0.5 构成了逐字回忆的充分条件。受这些见解的驱动，我们引入了 MemFT，这是一种阈值引导的优化策略，动态地将训练预算重新分配给低于阈值的词元。实证评估表明，MemFT 可以提高记忆保真度和效率。代码将在 https://github.com/zjunlp/ParametricMemoryLaw 上发布。

Abstract

Large Language Models (LLMs) must continuously learn and update knowledge to remain effective in dynamic real-world environments. While Low-Rank Adaptation (LoRA) is widely used for such memory updates, existing studies mainly rely on qualitative downstream evaluations, leaving the quantitative capacity limits and underlying dynamics of exact parametric memory largely unexplored. To bridge this gap, we employ LoRA as a controlled memory capacity probe within the latent space to systematically quantify exact parametric memory. We introduce the Parametric Memory Law, a robust power law linking loss reduction Delta L to effective parameters and sequence length. At the token level, fine-grained analysis reveals a deterministic phase transition, demonstrating that a prediction probability of p > 0.5 constitutes a sufficient condition for verbatim recall under greedy decoding. Driven by these insights, we introduce MemFT, a threshold-guided optimization strategy that dynamically redistributes the training budget toward sub-threshold tokens. Empirical evaluations demonstrate that MemFT can enhance memory fidelity and efficiency. Code will be released at https://github.com/zjunlp/ParametricMemoryLaw.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on LLM finetuning using LoRA and analyzes parametric memory laws, proposing a power law and MemFT strategy. It lacks content on multimodal integration, visual encoders, world models, or reinforcement learning. Token-level analysis is mentioned but does not focus on tokenizer architecture. Unify Models is only tangentially related to model adaptation.

关键词

LoRA, LLM Finetuning, Parametric Memory Law, Memory Capacity, MemFT, Token-level Analysis, Power Law

232. Persona Conditioning of Brand Recommendations in Retrieval-Augmented Commercial Chat: A Prominence-Stratified Cross-Provider AuditFAIL

Score: 6.0 / 27.8

Authors: Will Jack, Noah Lehman, Keller Maloney, Sarah Xu

Published: 2026-05-28

TL;DR: This paper investigates how user persona conditioning significantly alters brand recommendations in commercial chatbots, revealing that mid-market brands are more susceptible to persona shifts than category leaders across different AI providers.

摘要翻译

相同的 Prompt——"最佳 CRM 软件"——从背景截然不同的买家处传入 AI 助手：一位独立创始人、一位企业副总裁、一位英国 SMB 所有者。我们评估这种情境差异在多大程度上重塑了模型所推荐的品牌。该审计在设计空间内采样了 2,000 次运行，设计空间包含 10 种 Personas × 8 个 Prompts × 3 种模型配置 × N=10 次重复；其中两个 OpenAI 实验单元覆盖了全部 8 个 Prompts，而 Anthropic Sonnet-4.6 / low 实验单元仅覆盖 4 个 Prompts。在用户消息前缀 Persona 会使推荐集相似度（Jaccard）相对于相同 Persona 的基线下降 Delta = -0.12 至 -0.20（在所有三个测量单元中，聚类 95% CIs 均排除零；Sonnet 单元的 CI 仅基于 4 个 Prompt 聚类，因此相应更宽）。该效果具有显著的知名度分层特征：品类领导者对 Persona 的鲁棒性较强（跨 Persona 情况下约 80% 的品牌一致性保持不变），但随着 Persona 的变化，中端市场品牌高达 75% 的推荐集会发生变动。Anthropic 模型显示出比 OpenAI 配置更大的点估计效应，尽管在更近的对比中（Sonnet vs. OpenAI/high），聚类 CIs 存在重叠；这种不对称性与 Anthropic 更偏向未归因于检索的生成路径的特性一致（43%-52% 的推荐缺乏可观察的检索层证据，相比之下 OpenAI 为 8%-29%，详见 Jack 2026）。任何对 AI 品牌感知的测量都必须以提供查询的 Buyer Persona 为条件：相同的 Prompt 会产生实质不同的推荐集，这取决于模型认为谁在提问，而一种跨 Persona 聚合的测量方案会系统性地掩盖这种差异。该效应集中在中端市场，且在审计中依赖 Priors 最多的生成路线上最为显著，这与随着模型更多地依赖训练数据 Priors 及更丰富的上下文集成，其对 Persona 的响应性增强是一致的。

Abstract

The same prompt -- "best CRM software" -- reaches AI assistants from buyers in widely different contexts: a solo founder, an enterprise VP, a UK SMB owner. We audit how strongly that contextual variation reshapes which brands the model recommends. The audit samples 2,000 runs over a design space of 10 personas x 8 prompts x 3 model configurations x N=10 reps, with the two OpenAI cells at full 8-prompt coverage and the Anthropic sonnet-4.6 / low cell at 4-prompt coverage. Prefixing the user message with a persona drops the recommendation-set similarity (Jaccard) by Delta = -0.12 to -0.20 relative to a same-persona baseline (clustered 95% CIs exclude zero on all three measured cells; the sonnet cell's CI rests on only 4 prompt clusters and is correspondingly wider). The effect is sharply prominence-stratified: category leaders are persona-resistant (~80% same-brand consistency across personas), but mid-market brands swap up to 75% of the recommendation set as the persona changes. The Anthropic model shows a larger point-estimate effect than the OpenAI configurations, though clustered CIs overlap for the closer contrast (sonnet vs. OpenAI/high); the asymmetry is consistent with Anthropic's more retrieval-unattributed generation route (43-52% recommendations without observed retrieval-layer evidence, vs OpenAI's 8-29%, documented in Jack 2026). Any measurement of AI brand perception must condition on the buyer persona supplying the query: the same prompt produces materially different recommendation sets depending on who the model thinks is asking, and a measurement protocol that aggregates across personas systematically obscures that variation. The effect concentrates at mid-market and is largest on the most priors-reliant generation route in our audit, consistent with persona responsiveness growing as models lean more on training-data priors and richer context integration.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on auditing persona conditioning effects on brand recommendations in commercial chat systems. It compares OpenAI and Anthropic models but does not propose unified model architectures, discuss tokenizers, utilize visual encoders, develop world models for RL, focus on multimodal large language models (MLLM), or employ model-based reinforcement learning. The relevance to the provided keywords is minimal as the paper belongs to the domain of AI safety/alignment and recommendation auditing rather than multimodal architecture or RL. Only slight relevance exists for MLLM (as it uses LLMs) and Unify Models (as it compares model providers). The total weighted score is 6.0, well below the dynamic pass score of 27.8.

关键词

Persona Conditioning, Brand Recommendations, Retrieval-Augmented Chat, Cross-Provider Audit, Recommendation Similarity, AI Assistants, Mid-market Brands

233. Double-Edged Sword or Sharp Tool? Designing and Evaluating Triadic LLM-Teacher Collaboration for K-12 Writing at ScaleFAIL

Score: 6.0 / 27.8

Authors: Canran Wang, Yuwen Yang, Zhen Wang, Ming Ma, Ding Yu, Chentai Wang, Keman Huang, Xiaoyong Du

Published: 2026-05-28

TL;DR: 本文设计并评估了 LLM-教师-学生三元协作系统用于 K-12 写作，发现策略性分工可提升写作质量但存在边际效用递减。

摘要翻译

整合大型语言模型（LLMs）是一把双刃剑，尤其在 K-12 教育中，亟需建立 LLMs、教师和学生之间的有效三方协作机制。本文通过开发一个支持 K-12 写作学习的三方协作系统、基于系统功能语言学（Systemic Functional Linguistics）的多维评估框架以及建议轨迹追踪管道，贡献了一个大规模实证数据集，涉及两年内来自 120 所学校的 10,195 名学生的 57,954 篇作文。我们的发现证实了该系统在提升写作质量方面的有效性，这得益于战略性的劳动分工：LLM 作为生成引擎以缓解教师职业倦怠，而教师则扮演教学守门人和桥梁的角色，以确保反馈质量。尽管 LLM 和教师对于技能提升都至关重要，但我们发现了一个天花板效应，即过度的语言扩展会导致边际效用递减。这表明随着学生能力的提升，LLM-教师协作应采用动态自适应模式。

Abstract

The double-edged sword of integrating Large Language Models (LLMs) requires an effective triadic collaboration mechanism among LLMs, teachers and students, especially for K-12 education. By developing a triadic collaboration system to support K-12 writing learning, a multidimensional evaluation framework grounded in Systemic Functional Linguistics and the suggestion trajectory tracing pipeline, this paper contributes a large-scale empirical dataset involving $57,954$ essays from $10,195$ students across $120$ schools over two years. Our findings confirm the efficacy of this system in improving writing quality through a strategic labor division: the LLM serves as a generative engine to mitigate teacher burnout, and the teacher acts as a pedagogical gatekeeper and bridge to guarantee feedback quality. While both LLM and teacher are critical for skill improvement, we uncover a ceiling effect where excessive linguistic expansion yields diminishing marginal utility. These suggest a dynamically adaptive LLM-teacher collaboration as student proficiency increases.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文主题聚焦于 LLM 在教育场景中的应用协作（K-12 写作），而关键词涉及模型底层架构（Tokenizer, Visual Encoder）、多模态统一（Unify Models, MLLM, MultiModal）及强化学习范式（World Models, model-based RL）。两者技术范畴差异巨大，论文未涉及多模态表征、世界模型构建或 RL 算法，仅将 LLM 视为工具，故相关性极低。加权总分约为 6.0 分，远低于动态及格分 27.8 分。

关键词

LLM-Teacher Collaboration, K-12 Writing, Triadic Collaboration, Writing Quality Improvement, Empirical Dataset, Systemic Functional Linguistics, Generative Engine, Pedagogical Gatekeeper

234. On Distributional Reinforcement Learning in Chaotic Dynamical SystemsFAIL

Score: 6.0 / 27.8

Authors: James Rudd-Jones, Mirco Musolesi, María Pérez-Ortiz

Published: 2026-05-28

TL;DR: This paper proposes stabilizing reinforcement learning in chaotic dynamical systems by optimizing the return distribution under the 1-Wasserstein metric instead of scalar value functions.

摘要翻译

混沌动力系统对强化学习（RL）构成了根本性挑战：对初始条件的指数敏感性会导致高方差的自举目标和条件不佳的梯度更新。混沌动力学广泛存在于科学和工程领域，从流体流动和气候系统到多智能体系统，在这些领域中可靠的强化学习极具价值。标准的强化学习方法通过标量值函数优化期望回报，隐式地对发散的轨迹进行平均，并将轨迹层面的不稳定性与学习目标耦合。我们在温和的统计稳定性假设下表明，当采用 $1$-Wasserstein 度量衡量时，回报分布比单个轨迹演化更为规律，从而得到更平滑的分布贝尔曼目标。通过将优化过程与这种测度层面的结构对齐，分布强化学习（Distributional RL）提供了条件更优的学习。我们提供了分布方法在混沌系统中优势的原理解释，以及混沌环境下强化学习目标的几何结构。

Abstract

Chaotic dynamical systems pose a fundamental challenge for Reinforcement Learning (RL): exponential sensitivity to initial conditions induces high-variance bootstrap targets and poorly conditioned gradient updates. Chaotic dynamics arise across scientific and engineering domains, from fluid flows and climate systems to multi-agent systems, where reliable learning is highly desirable. Standard RL methods optimise expected returns through scalar value functions, implicitly averaging over diverging trajectories and entangling trajectory level instability with the learning objective. We show that under mild statistical stability assumptions, the return distribution evolves more regularly than individual trajectories when measured under the $1$-Wasserstein metric, yielding a smoother distributional Bellman objective. By aligning optimisation with this measure level structure, distributional RL provides better conditioned learning. We offer a principled explanation for the advantages of distributional methods in chaotic systems and the geometries of RL objectives under chaos.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	2.0/10	3.0

评分理由: The paper focuses on Distributional Reinforcement Learning in chaotic dynamical systems, addressing instability via Wasserstein metrics. It has no relevance to multimodal components (Tokenizer, Visual Encoder, MLLM, MultiModal) or model unification architectures. Weak relevance is assigned to World Models and model-based RL due to the dynamical systems context, but the core contribution is distributional objective alignment rather than model-based planning or multimodal integration. None of the listed expert authors are present.

关键词

Distributional Reinforcement Learning, Chaotic Dynamical Systems, Wasserstein Metric, Return Distribution, Scalar Value Functions, Bootstrap Targets, Gradient Updates

235. A Geometric View of SRC: Learning Representations for Stable Residual InferenceFAIL

Score: 6.0 / 27.8

Authors: Vangelis P. Oikonomou

Published: 2026-05-28

TL;DR: 本文通过几何塑造目标学习表征，以确保稀疏表示分类在图像、文本和 EEG 数据上的残差排序稳定性。

摘要翻译

基于重建的推断通过比较类间重建残差来判定类别；稀疏表示分类（SRC）是一个典型实例，其可靠性取决于所学表示的几何特性。我们采用严格的训练 - 推断分离：SRC 仅作为固定的测试时使用规则，在训练过程中从不进行求导、展开或优化。在基于类条件子空间及其相关投影残差的子空间层级理想化中，我们通过残差边距形式化残差排序稳定性，并刻画几何障碍——即通过小主角度定义的子空间重叠、主导和近重叠——这些障碍在最坏方向上可能导致该边距坍塌。这种子空间层级理论是基础性的：它规定了理想化残差族何时能够良好分离，并为实际残差近似（例如 OMP）提供条件求解器层级的解释，前提是它们保持接近子空间层级残差排序。在显式的覆盖和分离假设下，我们推导出（理想化）残差边距的定量下界。基于这些目标，我们提出几何塑形目标，旨在促进掩码类内自表达性，抑制类间重建路径和类间子空间对齐，并防止坍塌——且在训练过程中不使用 SRC 残差或预测。在图像（COIL-100）、文本（TREC）和 EEG 连通性上的实验均在相同的固定 SRC/OMP 推断下评估所有表示，并报告残差边距和几何诊断；交叉熵仅在相同的评估协议下作为参考几何结构被纳入。

Abstract

Reconstruction-based inference assigns a class by comparing class-wise reconstruction residuals; Sparse Representation Classification (SRC) is a canonical instance whose reliability depends on the geometry of the learned representation. We adopt a strict training-inference separation: SRC is used only as a fixed test-time rule and is never differentiated, unrolled, or optimized during training. In a span-level idealization based on class-conditional spans and their associated projection residuals, we formalize residual-ordering stability through a residual margin and characterize geometric obstructions -- span overlap, dominance, and near-overlap via small principal angles -- that can collapse this margin in worst-case directions. This span-level theory is primary: it specifies when the idealized residual family is well-separated, and it provides a conditional solver-level interpretation for practical residual approximations (e.g., OMP) insofar as they remain close to the span-level residual ordering. Under explicit coverage and separation assumptions, we derive a quantitative lower bound on the (idealized) residual margin. Guided by these targets, we propose geometry-shaping objectives that promote masked within-class self-expressiveness, discourage cross-class reconstruction pathways and inter-class span alignment, and prevent collapse -- without invoking SRC residuals or predictions during training. Experiments on images (COIL-100), text (TREC), and EEG connectivity evaluate all representations under identical fixed SRC/OMP inference and report residual margins and geometric diagnostics; cross-entropy is included only as a reference geometry under the same evaluation protocol.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	0.0/10	0.0

评分理由: 该论文主要研究稀疏表示分类（SRC）的几何性质及表征学习，属于经典机器学习范畴。与现代多模态大模型（MLLM）、世界模型、强化学习及特定架构组件（Tokenizer、Visual Encoder）无直接关联。尽管实验涉及图像、文本、EEG 多种数据，但未涉及多模态融合或统一模型架构，因此与给定关键词相关性极低。

关键词

Sparse Representation Classification, Residual Inference, Geometric View, Representation Learning, Residual Margin, Geometry-Shaping Objectives, Span-Level Theory, Training-Inference Separation

236. On the Construction and Implications of Low-Loss Valleys in LoRA-based Bayesian InferenceFAIL

Score: 6.0 / 27.8

Authors: Daniel Dold, Emanuel Sommer, Julius Kobialka, Oliver Dürr, David Rügamer

Published: 2026-05-28

TL;DR: This paper proposes LoRA-Curve to connect independent LoRA optima via continuous low-loss valleys for improved Bayesian inference, demonstrating low relevance to the provided multimodal and world model keywords.

摘要翻译

尽管像低秩适应（LoRA）这样的参数高效微调方法已成为大语言模型的标准，但基于原理的认知不确定性估计仍然具有挑战性。LoRA 框架下的近期研究结果表明，诸如深度集成（deep ensembles）之类的离散多模态方法相比单模态方法几乎没有优势。这与深度学习中的更广泛观察相矛盾，其中集成独立最优解通常能提高泛化，而通过连续的低损失山谷连接这些模式则能进一步增强贝叶斯模型平均（BMA）。这种结构是否存在于 LoRA 空间中，以及它是否能产生局部或离散方法所遗漏的功能多样性，尚未得到研究。我们引入 LoRA-Curve，这是一种在 LoRA 空间中的分段贝塞尔曲线参数化方法，包含两种变体：一种自由配置，联合优化所有控制点；另一种锚定配置，连接独立微调的 LoRA 最优解。我们证明了沿曲线损失的路径连续性和 Lipschitz 正则性，并在 Qwen2.5 7B 的推理和分类基准上实证表明，线性插值会遇到损失壁垒，而我们的锚定多段曲线通过连续的低损失山谷连接独立最优解。结合平坦极小值扰动和 Jensen-Shannon 散度正则器，LoRA-Curve 在不牺牲性能的前提下，使预测分布的互信息显著更高，并将连续参数空间遍历与功能多样性关联起来。

Abstract

While parameter-efficient fine-tuning methods like low-rank adaptation (LoRA) are standard for large language models, principled estimation of epistemic uncertainty remains challenging. Recent results in the LoRA regime suggest that discrete multi-mode approaches such as deep ensembles offer little benefit over single-mode methods. This contradicts broader observations in deep learning, where ensembling independent optima typically improves generalization, and linking these modes through continuous low-loss valleys further enhances Bayesian model averaging (BMA). Whether such structure exists in the LoRA space and whether it yields functional diversity missed by local or discrete methods has not been studied. We introduce LoRA-Curve, a segmented Bézier curve parameterization in the LoRA space, with two variants: a free configuration that jointly optimizes all control points, and an anchored configuration that connects independently fine-tuned LoRA optima. We prove pathwise continuity and Lipschitz regularity of the loss along the curve and empirically show, across reasoning and classification benchmarks with Qwen2.5 7B, that linear interpolation encounters loss barriers, while our anchored multi-segment curves connect independent optima through continuous low-loss valleys. Combined with flat-minima perturbations and a Jensen-Shannon divergence regularizer, LoRA-Curve yields measurably higher mutual information of the predictive distribution without sacrificing performance, and links continuous parameter-space traversal to functional diversity.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on LoRA-based Bayesian inference and low-loss valleys, lacking direct relevance to multimodal architectures (Tokenizer, Visual Encoder, MultiModal), world models, or reinforcement learning (model-based RL). Only slight overlap exists with 'Unify Models' (connecting optima) and 'MLLM' (Qwen2.5 base). No specified expert authors are found.

关键词

LoRA-based Bayesian Inference, Low-Loss Valleys, Epistemic uncertainty, Bézier curve parameterization, Functional diversity, Model averaging, Parameter-efficient fine-tuning

237. Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task RetentionFAIL

Score: 6.0 / 27.8

Authors: Jing Huang, Daniel Wurgaft, Rachit Bansal, Laura Ruis, Naomi Saphra, David Alvarez-Melis, Andrew Kyle Lampinen, Christopher Potts, Ekdeep Singh Lubana

Published: 2026-05-28

TL;DR: 该论文研究发现更大容量的模型能通过减少梯度干扰和更优的资源分配，学习小模型无法完成的稀有和复杂任务。

摘要翻译

更大的模型能够学习较小模型无法学习的任务。是什么驱动了这一现象？我们提出了一种简单的现象学论证，指出幂律缩放（power-law scaling）本身就暗示，即使拥有无限训练数据，更大的模型也能学习较小模型无法学习的数据分布的一部分。为了验证这一主张并探究其原因，我们在一个由表现出单调缩放曲线的任务混合构成的合成设置上研究了模型缩放的影响。结果表明，这是一种由数据引发的资源（神经元）竞争。具体来说，较小模型将其神经元分配给高频或低复杂度任务，因此它们学到的解决方案在罕见和复杂任务上的表现较差。此外，即使存在能够表达所需任务的解决方案，这种情况也会发生。随后，我们评估了较大模型如何绕过这一以数据为中心的瓶颈，发现其根源在于一种干扰机制的减弱：较大模型能够为常见任务分配足够的资源，以至于这些任务的梯度更新变得微弱，这意味着它们不会覆盖正在缓慢积累的罕见任务特征。最后，为了进一步验证这些主张，我们在频率和复杂度各异的新型任务上预训练了 OLMo 模型（400 万至 40 亿参数）。结果与我们的合成数据实验一致：只有较大的 OLMo 模型学会了罕见且复杂的任务，且这些较大模型在其表征中嵌入了更多的任务特征，并在任务之间表现出更少的梯度干扰。总体而言，我们提供了一种以数据为中心的视角，解释了为何较大模型能学习较小模型无法学习的任务。这有助于解释为何较大模型在实践中表现更佳，并可为有关模型规模选择和训练数据混合的实际问题提供指导。

Abstract

Larger models learn tasks smaller models do not. What drives this phenomenon? We develop a simple phenomenological argument that power-law scaling already suggests that a larger model will be able to learn a part of the data distribution that a smaller model fails to learn, even with infinite training data. To validate this claim and identify its causes, we study the effects of model scaling on a synthetic setup consisting of a mixture of tasks that show monotonic scaling curves. The results point to a data-induced competition over resources (neurons). Specifically, smaller models allocate their neurons to high frequency or low complexity tasks, and so they learn solutions that perform poorly on rare and complex tasks. Moreover, this happens even when solutions capable of expressing the desired task exist. We then assess how a larger model circumvents this data-centric bottleneck, finding that it traces to a reduced interference mechanism: larger models can allocate enough resources to common tasks that the gradient updates for those tasks become weak, which means that they do not overwrite rare-task features as they slowly accumulate. Finally, to further validate these claims, we pretrain OLMo models (4M to 4B parameters) on novel tasks of varying frequency and complexity. The results mirror those from our synthetic data experiments: only the larger OLMo models learn the infrequent and complex tasks, and these larger models embed more task features in their representations and show less gradient interference between tasks. Overall, we offer a data-centric account of why larger models learn tasks that smaller models fail to. This helps explain why larger models are better in practice, and it can inform practical questions concerning model sizing and training data mixtures.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	1.0/10	1.5

评分理由: 论文核心在于探讨大模型容量、任务干扰及稀有任务保留对学习效果的影响，基于 OLMo 语言模型进行实验。提供的关键词侧重于多模态、世界模型及强化学习领域。论文内容未涉及视觉编码器、多模态架构、世界模型构建或基于模型的强化学习，因此与 Visual Encoder、World Models、MLLM、MultiModal 相关性为 0。仅在模型统一性理解（Unify Models）和 tokenizer 基础（Tokenizer）上有微弱关联，model-based RL 因涉及任务学习略有关联但非核心。未发现指定专家作者，无额外加分。加权总分 6.0，远低于动态及格分 27.8，表明论文主题与给定背景高度不匹配。

关键词

Model Scaling, Task Interference, Rare-Task Retention, Capacity Effects, Gradient Updates, OLMo Models, Data-Induced Competition

238. When Does Persona Prompting Actually Help? A Retrieval and Metric Analysis of Expert Role Injection in LLMsFAIL

Score: 6.0 / 27.8

Authors: Shuai Xiao, Su Liu, Weikai Zhou, Jialun Wu, Xinjie He, Zhiyuan Lin, Qiyang Xie

Published: 2026-05-28

TL;DR: This study evaluates persona prompting in LLMs, finding it trades expertise depth for clarity depending on the domain rather than universally enhancing capability.

摘要翻译

角色提示（Persona prompting）被广泛用于引导大语言模型（LLM），但其实际价值尚不明确。以往研究通常使用聚合分数评估角色提示，这使得难以确定专家角色提示是否一致地提升了响应质量，还是在不同的质量维度上改变了响应。我们通过控制比较四种提示条件来研究这一问题，该比较涵盖 1,140 个开放性问题，涉及 38 个专家角色和六个领域：无角色提示、通用领域专家提示、基于嵌入（embedding）的角色检索，以及结合嵌入搜索与基于 LLM 的角色选择的混合检索方法。聚合结果显示，各条件之间的整体差异较小。然而，指标层面的分析揭示了一种被聚合平均值所掩盖的持续权衡：角色提示系统性地增加了专业知识深度，同时降低了清晰度。这些效果高度依赖情境，而非普遍适用。角色提示在咨询性问题以及医学和心理学等领域表现最佳，因为在这些领域中，结构化专家框架和风险沟通具有内在价值。相比之下，基线提示在金融、法律、科学和技术领域的概念性和解释性问题上表现更好，因为在这些领域中，简洁的通俗语言解释更为重要。我们进一步表明，混合检索显著优于仅基于嵌入（embedding）的角色选择，尽管更优的角色检索并不能消除更广泛的专业知识深度与清晰度之间的权衡。总体而言，我们的发现表明，角色提示主要重塑响应特征，而非广泛提升模型能力，且多指标评估对于理解其影响是必要的。

Abstract

Persona prompting is widely used to steer large language models, yet its practical value remains unclear. Prior work often evaluates persona prompting using aggregate scores, making it difficult to determine whether expert-role prompting consistently improves response quality or instead changes responses along different quality dimensions. We study this question through a controlled comparison of four prompting conditions across 1,140 open-ended questions spanning 38 expert roles and six domains: no role prompt, a generic domain-expert prompt, embedding-based role retrieval, and a hybrid retrieval method combining embedding search with LLM-based role selection. Aggregate results show only small overall differences between conditions. However, metric-level analysis reveals a consistent tradeoff that aggregate averages obscure: role prompting systematically increases expertise depth while reducing clarity. These effects are highly conditional rather than universal. Role prompting performs best on advisory questions and in domains such as medicine and psychology, where structured expert framing and risk communication are intrinsically valuable. In contrast, baseline prompting performs better on conceptual and explanatory questions in finance, legal, science, and technology domains, where concise plain-language explanation is more important. We further show that hybrid retrieval significantly improves over embedding-only role selection, although better role retrieval does not eliminate the broader expertise-depth versus clarity tradeoff. Overall, our findings suggest that persona prompting primarily reshapes response characteristics rather than broadly improving capability, and that multi-metric evaluation is necessary for understanding its effects.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on persona prompting and metric analysis in text-based LLMs, investigating retrieval methods and trade-offs between expertise and clarity. It does not address multimodal components (Visual Encoder, MultiModal), world models, tokenizers, or model-based reinforcement learning. While it involves LLMs (loosely related to MLLM/Unify Models), the core content is about prompting strategies rather than the model architectures or learning paradigms specified in the keywords.

关键词

Persona prompting, Large Language Models, Expert role injection, Retrieval analysis, Metric analysis, Response quality, Domain-specific effects

239. Compute Allocation in Evolutionary Search: From Depth-Breadth to Multi-Armed BanditsFAIL

Score: 6.0 / 27.8

Authors: Sixue Xing, Haoyu He, Kerui Wu, Zhuo Yang, Haozheng Luo, Tianfan Fu, Aarthy Nagarajan

Published: 2026-05-28

TL;DR: This paper proposes a bandit-based method (BaSE) to optimize compute allocation in LLM-guided evolutionary search, improving mean fitness and reliability without modifying the underlying model or prompt.

摘要翻译

LLM 引导的进化搜索（Evolve systems）在数学和组合任务上达到了最先进水平，但大多数现有系统仅报告多次运行中的最佳结果，且未记录运行间的分布情况。我们探讨了固定的 LLM 调用预算应如何分配，以及单次运行达到报告数值的可靠性如何。通过遍历五个模型和三个任务上的深度 - 广度网格，我们识别出两条经验规律：一条是适应度 - 计算包络，沿此包络能力排序在很大程度上沿有效 FLOPs 坍缩；另一条是具有任务特定交互的双线性深度 - 广度拟合；这两条规律均受模型 - 任务能力的制约。基于这些规律，我们提出 BaSE（基于多臂老虎机的自进化），这是一种多臂老虎机模型，用于在并行轨迹上分配 LLM 调用。在不改变模型、提示或评估器的情况下，BaSE 在 8 个（模型，任务）单元格上，相比最强的岛屿协议基线，平均适应度提高了 12.3%，且在高方差设置下收益最大：这仅源于分配策略带来的可靠性提升。

Abstract

LLM-guided evolutionary search (Evolve systems) has reached state-of-the-art results on mathematical and combinatorial tasks, yet most existing systems report only the best of many runs and leave the run-to-run distribution undocumented. We ask how a fixed budget of LLM calls should be allocated, and how reliably a single run reaches the reported numbers. Sweeping the depth-breadth grid over five models and three tasks, we identify two empirical regularities: a fitness-compute envelope along which capability ordering largely collapses on effective FLOPs, and a bilinear depth-breadth fit with task-specific interaction; both are gated by model-task capability. Motivated by these regularities, we propose BaSE (Bandit-based Self-Evolving), a multi-armed bandit that allocates LLM calls across parallel trajectories. Without changing the model, prompt, or evaluator, BaSE improves mean fitness by 12.3% over the strongest island-protocol baseline across 8 (model, task) cells, with the largest gains on high-variance settings: a reliability gain from allocation alone.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	3.0/10	4.5

评分理由: The paper focuses on compute allocation in LLM-guided evolutionary search using Multi-Armed Bandits. It lacks content on multimodal architectures, visual encoders, tokenizers, world models, or model unification. Partial relevance exists for MLLM (uses LLMs) and model-based RL (uses Bandits for decision making), but the core topic is optimization efficiency rather than representation learning.

关键词

LLM-guided evolutionary search, Compute allocation, Multi-armed bandits, Depth-breadth trade-off, BaSE, Fitness-compute envelope, Reliability gain

240. Latent Performance Profiling of Large Language ModelsFAIL

Score: 6.0 / 27.8

Authors: Tanmoy Chakraborty, Ayan Sengupta, Suparna Bhattacharya, Partha Pratim Chakrabarti, Amlan Chakrabarti, Supratik Chakraborty, Partha Pratim Das, Lipika Dey, Richa Singh, Mayank Vatsa

Published: 2026-05-28

TL;DR: This paper introduces Latent Performance Profiling (LPP), a framework that evaluates Large Language Models using hidden activations and output distributions to reveal intrinsic traits beyond benchmark accuracy.

摘要翻译

大型语言模型（LLM）通常在标准化基准测试中取得令人印象深刻的分数，然而仅凭准确性对其能力的评估视角有限。通过排行榜评估开源 LLM 面临着诸如数据污染、任务范围狭窄以及与现实世界可靠性对齐不足等持续存在的问题。基于基准的评估方法（如 MMLU PRO、BBH 或 IFEval）主要捕捉模型在固定测试集上输出的是“什么”，而非其“如何”处理信息、校准不确定性或构建内部知识。本文主张将评估范式从以基准为中心转向互补的、以状态为中心的内在评估（state-centered intrinsic assessment）。为此，我们提出潜在性能剖析（Latent Performance Profiling, LPP）——一种从隐藏激活和输出分布中推导任务无关诊断的框架。LPP 在模型的潜在表示及其动力学上定义了一组标量指标，揭示与规模无关的特性，从而实现可解释的比较并发现隐藏漏洞。与静态准确性分数不同，LPP 在相似规模的模型之间提供稳定且对架构敏感的签名。通过对八个规模范围在 0.5B 至 14B 之间的 LLM 进行广泛的实证分析，我们表明具有相似基准分数的模型可能表现出截然不同的潜在特征，例如熵或适应性的差异。基于这些洞察，我们设计了针对不确定性和符号推理的合成探针，这些探针与内在指标保持一致，同时与排行榜偏见解耦。我们建议将 LPP 与基准测试一同报告，这能提供对模型行为更深层、可解释的理解，从而实现更可靠的模型选择、安全评估以及超越表面准确性的评价。

Abstract

Large language models (LLMs) frequently achieve impressive scores on standardized benchmarks, yet accuracy alone offers a limited view of their capabilities. Evaluating open-source LLMs through leaderboards faces persistent issues like data contamination, narrow task scope, and weak alignment with real-world reliability. Benchmark-based evaluations such as MMLU PRO, BBH, or IFEval primarily capture \textit{what} a model outputs on fixed test sets, not \textit{how} it processes information, calibrates uncertainty, or structures internal knowledge. In this article, we advocate for a shift from benchmark-centric evaluation toward a complementary, \textit{state-centered intrinsic assessment} of LLMs. To this end, we introduce \textbf{Latent Performance Profiling (LPP)} -- a framework that derives task-agnostic diagnostics from hidden activations and output distributions. LPP defines a set of scalar metrics on a model's latent representations and dynamics, revealing scale-independent traits that enable interpretable comparisons and uncover hidden vulnerabilities. Unlike static accuracy scores, LPP provides stable, architecture-sensitive signatures across models of similar size. With extensive empirical analyses across eight LLMs, spanning a size range of 0.5B-14B, we demonstrate that models with similar benchmark scores can exhibit contrasting latent profiles, such as differences in entropy or adaptability. Guided by these insights, we design synthetic probes for uncertainty and symbolic reasoning that align with intrinsic metrics while decoupling from leaderboard bias. We recommend that reporting LPP alongside benchmarks provides a deeper, interpretable understanding of model behavior, enabling more reliable model selection, safety assessment, and evaluation beyond surface-level accuracy.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on evaluating Large Language Models (LLMs) via Latent Performance Profiling (LPP) using hidden activations, which is unrelated to multimodal architectures (Visual Encoder, MultiModal), world models, or reinforcement learning (model-based RL). Tokenizer and MLLM have minimal relevance as the text focuses on text-only LLM evaluation without detailing tokenization or multimodality. Unify Models has slight relevance regarding model comparison. No expert authors from the specified list were found, so no bonus points were applied. The weighted total (6.0) is well below the dynamic passing score (27.8), indicating low relevance to the provided research background.

关键词

Large language models, Latent Performance Profiling, Hidden activations, Output distributions, Intrinsic assessment, Benchmark evaluation, Model behavior

241. Causal Interventions on Continuous Variables: A Case Study on Verb Bias in Steering Vectors for In-Context LearningFAIL

Score: 6.0 / 27.8

Authors: Zhenghao Herbert Zhou, R. Thomas McCoy, Robert Frank

Published: 2026-05-28

TL;DR: This paper proposes a method for causal intervention on continuous variables in language models, showing that steering vectors encode verb bias affecting syntax but are not causally utilized in in-context learning.

摘要翻译

语言模型表征中的因果干预主要聚焦于离散特征，例如语法数。然而，语言模型也必须利用分级特征。我们提出了一种针对连续变量的因果干预方法：给定与分级目标变量配对的激活向量，我们定位该变量的一个低维方向，并利用该方向将向量编辑至反事实目标值。我们将此方法应用于心理语言学中研究较多的一个连续特征，即动词偏向（verb bias，它反映了哪些句法结构倾向于跟随给定的动词）。我们发现动词偏向在大语言模型（LLM）提取的引导向量（steering vectors）中存在因果表征：对动词偏向的反事实编辑系统性地改变了下游结构偏好。动词偏向此前也与上下文学习（in-context learning）有关；在进一步分析中，我们发现引导向量编码了误差信号，这些信号可能驱动上下文学习中看到的误差驱动更新行为，但这些引导向量中的相关方面并未在下游生成中被因果使用。总体而言，这些结果表明因果干预可以应用于连续变量，尽管建立连续变量与上下文学习之间的联系仍然是一个挑战。

Abstract

Causal interventions in language model representations have largely targeted discrete features, like grammatical number. However, language models must also make use of features that are graded. We introduce a method for causal intervention on continuous variables: given activation vectors paired with a graded target variable, we localize a low-dimensional direction for that variable and use this direction to edit a vectors toward counterfactual target values. We apply this method to a continuous feature that is well-studied in psycholinguistics, namely verb bias (which reflects which syntactic structures tend to follow a given verb). We show that verb bias is causally represented in steering vectors extracted from large language models: counterfactual edits to verb bias systematically shift downstream structural preferences. Verb bias has also previously been linked to in-context learning; in further analyses, we find that steering vectors encode error signals that could drive the error-driven update behavior seen in in-context learning but that these aspects of the steering vectors are not causally used in downstream production. Overall, these results show causal interventions can be applied to continuous variables, though connecting continuous variables to in-context learning remains a challenge.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on causal interventions on continuous variables (verb bias) within standard Language Models using steering vectors. It does not address multimodality, visual encoders, world models, reinforcement learning, or tokenizer design, leading to low relevance with the provided keyword set which targets Multimodal/RL/World Model topics. Additionally, none of the listed expert authors (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are present in the author list.

关键词

Causal Interventions, Continuous Variables, Verb Bias, Steering Vectors, In-Context Learning, Language Models, Counterfactual Edits

242. ActTraitBench: Quantifying the Knowledge-Decision Gap in Large Language Models via Human-Grounded Behavioral ValidationFAIL

Score: 6.0 / 27.8

Authors: Yutong Yang, Chenxi Miao, Weikang Li, Yunfang Wu

Published: 2026-05-28

TL;DR: ActTraitBench reveals a pervasive knowledge-decision gap in LLMs where larger models exhibit stronger behavioral divergence despite consistent self-reports, which can be mitigated via Chain of Cognitive Alignment.

摘要翻译

虽然大型语言模型（LLM）能够在显性自我报告中令人信服地模拟人格，但它们往往在隐性行为决策上出现偏差，揭示出显著的知识 - 决策差距（$G_{\text{KD}}$）。现有的基准测试难以衡量这种不对称性，主要由于构效度有限、多维纠缠以及基于 LLM 的评估中存在的分布偏差。为了解决这些问题，我们提出 ActTraitBench，这是一个基于人类数据的评价框架，用于衡量 LLM 中的人格一致性。基于实证人类数据，ActTraitBench 建立了心理测量维度与行为范式之间的一一映射，并应用了基于分位数映射的分布校准方法，以将 LLM 评估者的分数分布与人类常模对齐。对 14 种主流 LLM 的实验揭示了一种普遍存在的知识 - 决策不对称性，其中更大且能力更强的模型往往表现出更强的行为偏差，尽管其自我报告高度一致。为了缓解这一差距，我们进一步引入了认知对齐链（CoCA），这是一种即插即用的推理时干预措施，能够提升推理能力前沿模型的对齐性，同时暴露出较小架构的明显能力局限。

Abstract

While Large Language Models (LLMs) can convincingly simulate personas in explicit self-reports, they often deviate in implicit behavioral decisions, revealing a substantial Knowledge-Decision Gap ($G_{\text{KD}}$). Existing benchmarks struggle to measure this asymmetry due to limited construct validity, multi-dimensional entanglement, and distributional biases in LLM-based evaluation. To address these issues, we propose ActTraitBench, a human-grounded evaluation framework for measuring personality consistency in LLMs. Grounded in empirical human data, ActTraitBench establishes one-to-one mappings between psychometric facets and behavioral paradigms, and applies a Distributional Calibration via Quantile Mapping procedure to align LLM-judge score distributions with human norms. Experiments on 14 mainstream LLMs reveal a pervasive knowledge-decision asymmetry, where larger and more capable models often exhibit stronger behavioral divergence despite highly consistent self-reports. To mitigate this gap, we further introduce the Chain of Cognitive Alignment (CoCA), a plug-and-play inference-time intervention that improves alignment in reasoning-capable frontier models while exposing clear capability limitations in smaller architectures.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on evaluating personality consistency and the knowledge-decision gap in Large Language Models (LLMs) using a human-grounded benchmark (ActTraitBench). It does not discuss multimodal architectures, tokenizers, visual encoders, world models, or reinforcement learning. Only 'Unify Models' and 'MLLM' have marginal relevance due to the focus on LLMs and the unification of evaluation metrics, while other keywords are completely unrelated to the paper's content.

关键词

Large Language Models, Knowledge-Decision Gap, ActTraitBench, Personality Consistency, Human-Grounded Behavioral Validation, Chain of Cognitive Alignment, Inference-time Intervention, Psychometric Facets

243. AfriScience-MT: Towards Decolonizing Science in Africa through Text TranslationFAIL

Score: 6.0 / 27.8

Authors: Idris Abdulmumin, Tajuddeen Gwadabe, Shamsuddeen Hassan Muhammad, David Ifeoluwa Adelani, Nomonde Khalo, Ibrahim Said Ahmad, Abiodun Modupe, Anina Mumm, Sibusiso Biyela, Michelle Rabie, Johanna Havemann, Marek Rei, Jade Abbott, Vukosi Marivate

Published: 2026-05-28

TL;DR: 该论文构建了非洲语言科学翻译语料库 AfriScience-MT 并 benchmark 了多种模型，发现闭源大语言模型在科学文本翻译任务上表现优于开源模型。

摘要翻译

殖民语言在非洲教育和科学传播中的主导地位限制了数亿非洲语言使用者获取和产生科学知识的能力。核心障碍在于这些语言中缺乏既定的科学术语。我们引入 AfriScience-MT，这是一个涵盖六种非洲语言（阿姆哈拉语 (Amharic)、豪萨语 (Hausa)、卢干达语 (Luganda)、北索托语 (Northern Sotho)、约鲁巴语 (Yorùbá) 和祖鲁语 (isiZulu)）的平行语料库，涉及 11 个科学领域。专业翻译人员与专家科学传播者合作，将科学论文的通俗语言摘要翻译成每种目标语言，并在不存在新术语的地方创建新术语。我们在零样本、少样本和微调设置下对机器翻译系统和大语言模型进行基准测试。结果显示，闭源模型在句子级和文档级上均优于所有开源模型：GPT-5.4 和 Gemini-3.1-Flash-Lite 分别以平均句子级 COMET 得分 68.3 和 68.0 领先，并在平均文档级 COMET 得分 48.3 上持平。在开源系统中，微调后的 NLLB-1.3B 在句子级达到 67.3，而 TranslateGemma-12B 在 1-shot 上下文学习下在文档级达到 44.0。我们发布 AfriScience-MT，以支持非洲语言的基准测试和文档级科学机器翻译。

Abstract

The dominance of colonial languages in African education and scientific communication limits how hundreds of millions of speakers of African languages access and produce scientific knowledge. A core obstacle is the lack of established scientific terminology in these languages. We introduce AfriScience-MT, a parallel corpus covering six African languages (Amharic, Hausa, Luganda, Northern Sotho, Yorùbá, and isiZulu) across 11 scientific domains. Professional translators, working with expert science communicators, translated plain-language summaries of scientific papers into each target language and created new terms where none existed. We benchmark machine translation systems and large language models in zero-shot, few-shot, and fine-tuned settings. Our results show that closed-source models outperform all open-source models at both the sentence and document levels: GPT-5.4 and Gemini-3.1-Flash-Lite lead with average sentence-level COMET scores of 68.3 and 68.0, respectively, and tie at an average document-level COMET of 48.3. Among open systems, fine-tuned NLLB-1.3B reaches 67.3 at the sentence level, and TranslateGemma-12B reaches 44.0 at the document level with 1-shot in-context learning. We release AfriScience-MT to support benchmarking and document-level scientific MT for African languages.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文主题为非洲语言科学文本机器翻译，与关键词集（多模态、世界模型、强化学习）领域不匹配。Tokenizer 作为翻译基础组件相关性略高（2 分）；Unify Models 和 MLLM 因涉及模型基准测试和 LLM 应用有微弱关联（1 分）；Visual Encoder、World Models、MultiModal 和 model-based RL 与纯文本任务完全无关（0 分）。加权总分为 6.0，远低于动态及格分 27.8。未发现指定专家作者。

关键词

AfriScience-MT, Machine Translation, African Languages, Scientific Text, Benchmarking, Large Language Models, Text Translation, Decolonization

244. Leveraging Routing Dynamics in Mixture-of-Experts Models for Efficient Language AdaptationFAIL

Score: 6.0 / 27.8

Authors: Aditi Khandelwal, Marius Mosbach, Verna Dankers, Siva Reddy, Golnoosh Farnadi

Published: 2026-05-28

TL;DR: This paper investigates multilingual routing dynamics in Mixture-of-Experts models during continual pre-training and proposes a parameter-efficient adaptation strategy by updating language-specific experts in final layers, achieving strong performance with minimal parameter updates.

摘要翻译

混合专家模型（MoE）被广泛用于扩展语言模型，但其在多语言环境下的专家路由行为及适配机制仍鲜有研究。本文研究了在以英语为中心的 MoE 模型于多语料库上进行持续预训练期间的多语言路由动态，分析了专家使用在不同语言间的变化情况。我们发现，持续的多语言预训练导致早期和中间层出现分散的、语言无关的路由，而语言专业化主要出现在最终层。此外，我们还表明，语言之间的词级词汇重叠在语言路由方式中起着重要作用。基于这些发现，我们提出了一种参数高效的适配策略，用于更新最终 MoE 层中的语言特定专家与共享专家。在 MultiBLiMP 和 Belebele 上的实验表明，我们的方法实现了优异的性能 - 效率权衡，相对于微调完整的最终层具有竞争力，同时仅更新不到 2% 的参数。总体而言，我们的发现揭示了在持续预训练过程中语言专业化在 MoE 中如何及何处出现，并为低资源多语言适配提供了实用见解。我们的代码可在 https://github.com/aditi184/moe-routing-adaptation 获取。

Abstract

Mixture-of-Experts (MoE) models are widely used to scale language models, yet their expert routing behavior and adaptation in a multilingual setting remain underexplored. In this work, we study multilingual routing dynamics during continual pre-training of an English-centric MoE model on a multilingual corpus, analyzing how expert usage varies across languages. We find that continual multilingual pre-training leads to diffused, language-agnostic routing in early and middle layers, with language specialization primarily emerging in the final layers. We also show that token-level vocabulary overlap between languages plays an important role in how languages are routed. Motivated by these findings, we propose a parameter-efficient adaptation strategy that updates language-specific and shared experts in the final MoE layers. Experiments on MultiBLiMP and Belebele show that our method achieves a strong performance-efficiency trade-off, attaining competitive performance relative to fine-tuning complete final layers, while updating less than 2% of the parameters. Overall, our findings provide insights into where and how language specialization emerges in MoEs during continual pre-training and provide practical insights for low-resource multilingual adaptation. Our code is available at https://github.com/aditi184/moe-routing-adaptation.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on Mixture-of-Experts (MoE) for multilingual language adaptation, analyzing routing dynamics and proposing parameter-efficient fine-tuning. It lacks content on visual encoders, world models, multimodal integration (MultiModal/MLLM), or reinforcement learning (model-based RL). Tokenizer relevance is minimal as vocabulary overlap is mentioned but not the focus. Unify Models is loosely related to MoE architecture but not the core theme in the unified multimodal context.

关键词

Mixture-of-Experts, Multilingual Routing, Continual Pre-training, Parameter-efficient Adaptation, Language Specialization, Expert Routing Dynamics, Language Agnostic

245. Understanding Safety-Sensitive Expert Behavior in Mixture-of-Experts LLMsFAIL

Score: 6.0 / 27.8

Authors: Zhibo Zhang, Yuxi Li, Zhen Ouyang, Ling Shi, Kailong Wang

Published: 2026-05-28

TL;DR: 本文提出 RASET 框架，在不改变路由模式的前提下识别并微调 MoE LLM 中的安全关键专家，揭示安全行为是局部化的而非由路由控制。

摘要翻译

混合专家（Mixture-of-Experts, MoE）大语言模型（LLM）依赖于稀疏的、由路由器驱动的专家激活，然而安全对齐如何与路由专家的特化相互作用仍研究不足。一种普遍认知是，安全行为可能通过将有害请求路由到不同的拒绝导向专家来控制。在这项工作中，我们提供了实证证据表明另一种情况：在对齐的 MoE 大语言模型（LLM）中，路由模式主要由话题驱动，而安全行为可以在模型内在路由路径发生微小变化的情况下被改变。基于此观察，我们提出了 RASET（Router-Agnostic Safety-critical Expert Tuning，路由器无关安全关键专家调优），这是一种红队测试框架，旨在探测局限于少数专家子集的安全执行，同时保持模型的内在路由行为。RASET 通过对比路由敏感性准则识别安全关键专家，并仅对选定的专家应用参数高效调优，相较于路由器引导干预，最小化语义扰动。这些结果揭示了一种独特的 MoE 安全风险，凸显了需要专家感知对齐机制的必要性。

Abstract

Mixture-of-Experts (MoE) LLMs rely on sparse, router-driven expert activation, yet how safety alignment interacts with routed expert specialization remains underexplored. A common intuition is that safety behavior may be controlled by routing harmful requests to distinct refusal-oriented experts. In this work, we provide empirical evidence for a different picture: routing patterns in aligned MoE LLMs are largely topic-driven, while safety behavior can be altered with little change to the model's intrinsic routing path. Motivated by this observation, we present **RASET** (**R**outer-**A**gnostic **S**afety-critical **E**xpert **T**uning), a red-teaming framework that probes safety enforcement that is localized in a small subset of experts while preserving the model's intrinsic routing behavior. **RASET** identifies safety-critical experts via a contrastive routing-sensitivity criterion and applies parameter-efficient tuning only to the selected experts, minimizing semantic disruption relative to router-steering interventions. These results reveal a distinct MoE safety risk, highlighting the need for expert-aware alignment mechanisms.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文主要关注 MoE LLM 的安全对齐与路由机制，而提供的关键词侧重于多模态世界模型、视觉编码器和强化学习。两者领域差异较大，因此相关性评分较低。MLLM 和 Unify Models 因涉及大模型架构和专家统一机制获得少量分数，其余关键词与论文内容完全无关。

关键词

Mixture-of-Experts, LLMs, Safety Alignment, Routing Patterns, RASET Framework, Expert Specialization, Parameter-efficient Tuning, Safety-critical Experts

246. Evaluating Cross-lingual Knowledge Consistency in Code-Mixed vis-a-vis Indian Languages using IndicKLARFAIL

Score: 6.0 / 27.8

Authors: Debajyoti Mazumder, Divyansh Pathak, Prashant Kodali, Aditya Joshi, Akshay Agarwal, Jasabanta Patro

Published: 2026-05-28

TL;DR: This study investigates the cross-lingual knowledge consistency gap in large language models for Indian languages and demonstrates that code-mixed inputs significantly reduce the performance disparity compared to native languages without requiring model intervention.

摘要翻译

大语言模型在英语中能可靠地回忆知识，但在低资源语言上面对相同查询时往往失败——这种跨语言一致性差距在印度语言及其语码混合变体中仍未被充分探索。为研究这一差距，我们引入了 IndiKLAR，这是 KLAR-CLC 基准的印度语扩展，涵盖了 22 种印度官方语言中的 18 种，并将它们与 11 种广泛使用的语言对的语码混合变体配对；针对这 11 种设置，对单语和语码混合变体均进行了母语者验证。这种三向对齐提供了一个独特的机会，以考察知识回忆一致性如何在英语、语码混合和印度母语输入的谱系上变化。在九个开源模型上进行评估，我们发现母语准确性与英语之间的差距可达 ~0.50，而语码混合输入填补了大部分差距——使性能在 ~0.05 范围内接近英语，且无需任何模型层面的干预。受此启发，我们评估了若干提示策略，这些策略在语言转换暴露方式上有所不同，包括两阶段“先翻译后回答”设置、一阶段“联合翻译与回答”提示，以及 Translate-in-Thought (TinT) —— 一种单步策略，模型在内部转换输入并仅输出最终答案。在性能轨迹“母语 → 语码混合 → 英语”上，我们识别出一个一致的翻转点，即错误预测与正确预测之间的边界，该边界位于母语设置与语码混合设置之间。有趣的是，无论该轨迹是由输入的表面形式诱导的，还是由模型内部的转换过程诱导的，这一现象均成立。

Abstract

Large language models recall knowledge reliably in English but often fail on the same query posed in a lower-resourced language -- a crosslingual consistency gap that remains underexplored for Indian languages and their code-mixed counterparts. To study this gap, we introduce IndiKLAR, an Indic extension of the KLAR-CLC benchmark covering 18 of the 22 scheduled Indian languages and pairing them with code-mixed variants for 11 widely used language pairs, with native-speaker verification of both monolingual and code-mixed variants for these 11 settings. This three-way alignment offers a unique opportunity to examine how knowledge recall consistency varies across the spectrum of English, code-mixed, and native Indian language inputs. Evaluating across nine open-weight models, we find that the native-language accuracy gap to English can reach $\sim$0.50, while code-mixed inputs close most of it -- bringing performance within $\sim$0.05 of English without any model-level intervention. Motivated by this, we evaluate several prompting strategies that vary in how language conversion is exposed, including a two-stage translate-then-answer setup, a one-stage joint translation-and-answer prompt, and Translate-in-Thought (TinT) -- a single-step strategy in which the model converts the input internally and emits only the final answer. Across the performance trajectory native $\rightarrow$ code-mixed $\rightarrow$ English, we identify a consistent flip point -- the boundary between incorrect and correct prediction -- that lies between the native and code-mixed settings. Interestingly, this holds whether the trajectory is induced by the input surface form or by the model's internal conversion process.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on cross-lingual knowledge consistency in NLP for Indian languages using the IndiKLAR benchmark, which has minimal overlap with the provided keywords targeting multimodal architectures and RL. Keywords like Visual Encoder, World Models, MultiModal, and model-based RL are irrelevant as the study is text-only and does not involve reinforcement learning or world modeling. Tokenizer and MLLM have slight relevance (LLM usage), and Unify Models has conceptual relevance regarding consistency evaluation, but overall the topic mismatch results in low scores.

关键词

Cross-lingual Knowledge Consistency, Code-Mixed Languages, Indian Languages, IndiKLAR Benchmark, Large Language Models, Prompting Strategies, Language Consistency Gap

247. FoRA: Fisher-orthogonal Rank Adaptation for Parameter-Efficient Fine-TuningFAIL

Score: 6.0 / 27.8

Authors: Juneyoung Park, Seongbae Lee, Han-Sang Lee, Kyuho Lee, Minjae Kim, Seungheon Hyeon, Kiduk Kwon, Seongwan Kim, Jaeho Lee

Published: 2026-05-28

TL;DR: FoRA 通过 Fisher 分数选择任务相关层并在 Stiefel 流形上施加正交约束，实现了比 LoRA 更高效的参数微调，显著减少了可训练参数数量。

摘要翻译

参数高效微调（PEFT）主要关注 LoRA 及其面向准确率的变体，而减少可训练参数的原始目标却相对较少受到关注。我们提出 FoRA，通过减少适配层的数量而非适配器秩来重新审视这一目标。FoRA 通过单次遍历的对角 Fisher 得分（训练成本低于 1%）选择任务信息丰富的层，并在选定的层上于 Stiefel 流形上训练 LoRA 的下投影，以保持列正交性和有效秩。在五个 LLaMA 家族骨干网络上，FoRA 在参数预算减半的情况下始终优于 LoRA 和 DoRA，且在参数数量仅为四分之一时，准确率与 AdaLoRA 相差 0.7-0.8 个百分点。在来自 LLaMA、Qwen3 和 Gemma 家族的十二个骨干网络上进行的跨架构实验证实，从 270M 到 32B 参数范围内均能获得一致的提升。这两个组件结合具有超加性：Fisher 选择在相同预算下即可匹配秩降低的效果，而 Stiefel 约束提供了决定性的额外增益。

Abstract

Parameter-efficient fine-tuning(PEFT) has largely focused on LoRA and its accuracy-oriented variants, leaving the original goal of reducing trainable parameters has receivedcomparatively little attention. We introduce FoRA, which revisits this goal by reducing the number of adapted layers rather than adapter rank. FoRA selects task-informative layers via a single-pass diagonal Fisher score (under 1% of training cost) and trains the LoRA down-projection at selected layers on the Stiefel manifold, preserving column orthonormality and effective rank. FoRA consistently outperforms LoRA and DoRA at half their parameter budget, and falls within 0.7-0.8 accuracy points of AdaLoRA at one-quarter its parameter count, across five LLaMA-family backbones. Cross-architecture experiments on twelve backbones from the LLaMA, Qwen3, and Gemma families confirm consistent gains from 270M to 32B parameters. The two components combine super-additively: Fisher selection alone matches rank reduction at the same budget, while the Stiefel constraint provides the decisive additional gain.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文提出 FoRA 方法用于大语言模型的高效微调（PEFT），利用 Fisher 分数选择层并在 Stiefel 流形上优化以保持正交性。内容未涉及多模态、视觉编码器、世界模型或强化学习，与关键词主题（Unify Models, World Models, Model-Based RL 等）高度不相关。仅因涉及 LLM 家族，MLLM 和 Tokenizer 有微弱关联。加权总分 6.0 分，远低于动态及格分 27.8 分。作者列表中不包含指定专家。

关键词

Parameter-Efficient Fine-Tuning, LoRA, Fisher Score, Stiefel Manifold, LLaMA-family, Orthogonal Adaptation, Trainable Parameters, Layer Selection

248. GMOS: Grounding Moving Object Segmentation in 3D Space and TimeFAIL

Score: 6.0 / 27.8

Authors: Junyu Xie, Tengda Han, Weidi Xie, Andrew Zisserman

Published: 2026-05-28

TL;DR: GMOS proposes a framework for 3D-aware, temporally fine-grained moving object segmentation in RGB video, achieving state-of-the-art performance on MOS benchmarks without relying on pre-computed 2D auxiliary modalities.

摘要翻译

运动目标分割（MOS）旨在发现、分割并跟踪相对于相机独立运动的物体。然而，当前的 MOS 方法存在两个根本性局限：一方面，它们依赖于预计算的 2D 辅助模态（如光流或点轨迹），这些模态缺乏 3D 几何信息；另一方面，它们将运动视为序列级属性，忽略了每个对象的瞬时运动状态。我们通过将 MOS 置于三维空间与时间框架中来解决上述问题，并提出 GMOS 框架。该框架直接基于 RGB 视频运行，能够生成具有 3D 感知能力的、时间细粒度的多目标分割；同时，我们还提出了一个前景 - 背景变体 GMOS-S，以实现更快的部署。为了支持该设定下的训练与评估，我们构建了 GMOS-2K 数据集，该数据集包含 2,210 个真实世界视频，其对象级时间运动标注源自五个既定的视频目标分割（VOS）基准；同时，我们提出了 MOS-I（"I" 代表 instantaneous），这是一个包含三个互补指标的时间细粒度评估协议。GMOS 在 MOS、MOS-I 及无监督视频目标分割（VOS）基准测试中均取得了最先进的结果，同时其运行速度显著快于先前的多目标 MOS 方法，并支持在线推理，适用于流式部署。

Abstract

Moving Object Segmentation (MOS) aims to discover, segment, and track objects that move independently of the camera. Current MOS methods, however, exhibit two fundamental limitations: they rely on pre-computed 2D auxiliary modalities such as optical flow or point trajectories that lack 3D geometric information, and they treat motion as a sequence-level attribute, overlooking the instantaneous motion state of each object. We address both by grounding MOS in 3D space and time, and propose GMOS, a framework that operates directly on RGB video to produce 3D-aware, temporally fine-grained segmentation of multiple moving objects, alongside a foreground--background variant GMOS-S for faster deployment. To support training and evaluation in this regime, we curate GMOS-2K, a dataset of 2,210 real-world videos with per-object temporal motion annotations drawn from five established Video Object Segmentation (VOS) benchmarks, and formalise MOS-I ("I" for instantaneous), a temporally fine-grained evaluation protocol with three complementary metrics. GMOS achieves state-of-the-art results across MOS, MOS-I, and Unsupervised VOS benchmarks, while running significantly faster than prior multi-object MOS methods and supporting online inference for streaming deployment.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on Moving Object Segmentation (MOS) in 3D space and time using RGB video, which is a traditional computer vision task. It does not involve Large Language Models (MLLM), Tokenizers, World Models, or Reinforcement Learning (RL), resulting in low scores for these keywords. While it processes visual data (Visual Encoder) and unifies spatial-temporal grounding (Unify Models), it does not align with the multimodal large model or RL focus of the keyword set. No expert authors from the specified list were found in the author list.

关键词

Moving Object Segmentation, 3D Space and Time, RGB Video, Grounding, Temporally Fine-grained, GMOS Framework, Instantaneous Motion

249. Boosting Image Quality Assessment Performance: Unsupervised Score Fusion by Deep Maximum a Posteriori EstimationFAIL

Score: 6.0 / 27.8

Authors: Zhongling Wang, Raymond Zhou, Shahrukh Athar, Wenbo Yang, Zhou Wang

Published: 2026-05-28

TL;DR: This paper proposes an unsupervised deep MAP estimation framework to fuse scores from multiple IQA models, improving prediction accuracy and enabling the rejection of inferior models.

摘要翻译

过去数十年来，涌现出大量图像质量评估（IQA）模型，旨在预测图像的感知质量。然而，单个模型往往偏向于某些类型的图像内容或失真，这取决于其设计原理和过程。一个直观的想法是利用每个 IQA 模型的优势并弥补其弱点，通过将多个模型的分数融合成一个更强的模型。本文首次尝试寻求该理念的最优解，并提出了一种基于深度最大后验（MAP）估计的无监督 IQA 分数融合通用框架。所提出的模型在分数层面进行细粒度不确定性估计，以提高融合预测的准确性并降低不确定性。综合实验表明，所提出的模型优于单个 IQA 模型和其他融合方法。它还展现出一种有趣的能力，即在融合过程中拒绝“坏”模型。

Abstract

Over the past decades, numerous Image Quality Assessment (IQA) models have emerged, aiming to predict the perceptual quality of images. However, individual models are often biased toward certain types of image content or distortions, depending on the design principle and process. An intuitive idea is to harness the strengths and mitigate the weaknesses of each IQA model, by fusing the scores of multiple models into a stronger one. Here we make one of the first attempts to seek an optimal solution for the idea and propose a general framework for unsupervised IQA score fusion using deep Maximum a Posteriori (MAP) estimation. The proposed model conducts fine-grained uncertainty estimation at the score level to increase the accuracy and reduce the uncertainty in fused predictions. Comprehensive experiments demonstrate the superiority of the proposed model over individual IQA models and other fusion methods. It also exhibits an interesting capability of rejecting ``bad" models in the fusion process.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on Image Quality Assessment (IQA) score fusion using unsupervised Maximum a Posteriori (MAP) estimation, which is fundamentally unrelated to the provided keywords concerning Large Language Models, World Models, and Reinforcement Learning. 'Unify Models' receives a low score because the paper fuses scores rather than unifying model architectures. 'Visual Encoder' and 'MultiModal' receive minimal scores as the paper deals with visual data but does not focus on encoder design or multimodal integration. 'Tokenizer', 'World Models', 'MLLM', and 'model-based RL' are completely unrelated to the content.

关键词

Image Quality Assessment, Score Fusion, Unsupervised Learning, Maximum a Posteriori, Uncertainty Estimation, Deep Learning, Model Ensemble

250. Building and Road Recognition in Dense Urban Informal Settlements: A Dataset and BenchmarkFAIL

Score: 6.0 / 27.8

Authors: Hongyu Long, Jiaxuan Liu, Rui Cao

Published: 2026-05-28

TL;DR: 本文针对高密度城市非正式定居点中建筑和道路识别缺乏精细标注数据的问题，构建了 DenseUIS 数据集并评估了现有深度学习模型，揭示了现有方法在处理复杂城市形态时的局限性。

摘要翻译

作为一种普遍存在的非正规住区形式，城中村对可持续城市发展和治理构成了重大挑战。对其基础设施进行精确制图至关重要，然而现有的遥感数据集主要关注正规城市环境，缺乏针对城中村典型的高密度建筑模式和狭窄道路网络的细粒度标注数据。为了解决这一空白，我们引入了 DenseUIS 数据集，这是首个专为极高密度城市非正规住区中的建筑物和道路提取而设计的高分辨率遥感数据集，覆盖了中国深圳和广州的 126 个城中村。此外，我们在该数据集上对当前最先进的深度学习模型进行了全面评估。实验结果表明，现有方法在处理密集非正规住区的独特形态特征时存在局限性，凸显了采用专门方法的必要性。因此，DenseUIS 为在复杂且高密度的非正规环境中推进细粒度城市制图提供了稳健的基准。该数据集公开提供于 https://github.com/rui-research/DenseUIS。

Abstract

As a widespread form of informal settlements, urban villages present significant challenges for sustainable urban development and governance. Precise mapping of their infrastructure is essential, however, existing remote sensing datasets primarily focus on formal urban environments, lacking fine-grained annotated data for the high-density building patterns and narrow road networks typical of urban villages. To address this gap, we introduce the \textit{DenseUIS} dataset, the first high-resolution remote sensing dataset specifically designed for building and road extraction in extremely dense urban informal settlements, covering 126 urban villages across Shenzhen and Guangzhou in China. Furthermore, we conduct a comprehensive evaluation of state-of-the-art deep learning models on this dataset. Experimental results reveal the limitations of existing methods in handling the unique morphological patterns of dense informal settlements, underscoring the need for specialized approaches. \textit{DenseUIS} therefore provides a robust benchmark for advancing fine-grained urban mapping in complex and high-density informal environments. The dataset is publicly available at https://github.com/rui-research/DenseUIS.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0

评分理由: 论文主要聚焦于遥感图像中的建筑和道路提取数据集构建与基准测试，属于传统计算机视觉领域。提供的关键词多涉及大语言模型、强化学习及世界模型（如 Tokenizer, MLLM, World Models, model-based RL），与本文内容高度不相关。仅'视觉编码器'和'多模态'有微弱关联（模型包含编码器，遥感可能涉及多光谱），其余关键词在文中无体现。

关键词

Remote Sensing, Building Extraction, Road Extraction, Urban Informal Settlements, Dataset Benchmark, Deep Learning, Urban Mapping

251. City-Mesh3R: Simulation-Ready City-Scale 3D Mesh Reconstruction from Multi-View ImagesFAIL

Score: 4.5 / 27.8

Authors: Sayan Paul, Sourav Ghosh, Siddharth Katageri, Soumyadip Maity, Sanjana Sinha, Brojeshwar Bhowmick

Published: 2026-05-28

TL;DR: City-Mesh3R 提出了一种可扩展的分治框架，能够从无序图像集合中重建出适合仿真的城市级水密 3D 网格。

摘要翻译

面向下游 3D 模拟的多视图图像城市级 3D 表面重建，由于城市场景的规模与复杂性，面临着极具挑战性的问题。现有的基于 NeRF、Gaussian Splatting（高斯泼溅）等的城市级 3D 重建方法，往往因几何结构不完整或缺失，以及表面不规则、嘈杂，而无法恢复出适用于模拟的 3D 网格。由于计算复杂度的原因，将现有的小尺度 3D 重建方法扩展至任意大的城市场景是高度不可行的。本文提出 City-Mesh3R，一个可直接从大型无序图像集合重建水密表面网格的可扩展框架。与近期方法不同，后者通常使用全局稀疏 SfM（运动恢复结构）点云初始化，随后进行大规模场景的分布式 3D 密集重建；我们的方法则采用分治策略，遵循端到端的图像到网格 3D 重建方法。稀疏城市地图通过拓扑图像聚类、簇内独立的稀疏 SfM 以及地图合并来重建，无需进行详尽的图像特征匹配。随后，该地图在空间上进行划分，以执行几何感知相机选择，接着进行密集表面重建，并利用曲率感知自适应顶点密度重网格化技术进行表面细化。进而，这些分区网格被缝合在一起，以生成城市的全局网格。所提出的端到端框架在城市级重建数据集上进行了评估。正如我们的定性和定量结果所示，所提出的方法能够生成具有高保真度、几何规则且能捕捉精细表面细节的水密 3D 网格，并且由于采用了分布式环境下的端到端处理，该方法适合扩展至任意大的场景。

Abstract

City-scale 3D surface reconstruction from multiview images for downstream 3D simulation, poses highly challenging problems due to the scale and complexity of urban scenes. Existing city-scale 3D reconstruction methods based on NeRF, Gaussian Splatting etc. often fail to recover 3D meshes ready for simulation due to incomplete/missing geometry and irregular, noisy surfaces. Scaling existing small-scale 3D reconstruction methods to arbitrarily large urban scenes is highly infeasible due to their computational complexity. We present City-Mesh3R, a scalable framework for reconstructing watertight surface meshes directly from large unordered image collections. Unlike recent methods which use global sparse SfM point-cloud initialization followed by a distributed 3D dense reconstruction of large-scale scenes, our method follows an end-to-end images-to-mesh 3D reconstruction approach using a divide-and-conquer strategy. The sparse city map is reconstructed via topological image clustering, cluster-wise independent sparse SfM and map merging, without need for exhaustive image feature matching. Then this map is partitioned spatially to perform geometry-aware camera selection, followed by dense surface reconstruction and surface refinement using curvature-aware adaptive vertex density remeshing. These partition meshes are then stitched together to produce the global mesh of the city. The proposed end-to-end framework is evaluated on city-scale reconstruction datasets. As demonstrated by our qualitative and quantitative results, our proposed method yields high-fidelity watertight 3D meshes with regular geometry, capturing fine surface details, and is suitable for scaling to arbitrarily large scenes owing to the end-to-end processing in a distributed setting.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0

评分理由: 论文主题聚焦于计算机图形学中的城市级 3D 网格重建，与关键词涉及的多模态大模型、强化学习及世界模型等 AI 领域存在显著领域差异。仅 'MultiModal' 因涉及多视图图像有微弱关联，'Unify Models' 和 'Visual Encoder' 因端到端图像处理和图像输入有极低关联，其余关键词完全无关。作者列表中未包含指定的专家。

关键词

City-Scale, 3D Mesh Reconstruction, Multi-View Images, Simulation-Ready, Divide-and-Conquer, Watertight Surface Mesh, Surface Refinement, Distributed Setting

252. Automating Low-Risk Code Review at Meta: RADAR, Risk Calibration, and Review EfficiencyFAIL

Score: 4.5 / 27.8

Authors: Chris Adams, Arjun Singh Banga, Parveen Bansal, Souvik Bhattacharya, Rujin Cao, Pedro Canahuati, Nate Cook, Brian Ellis, Prabhakar Goyal, Gurinder Grewal, Tianyu He, Matt Labunka, Alex Manners, David Molnar, Ging Cee Ng, Vishal Parekh, Jiefu Pei, Frederic Sagnes, James Saindon, Will Shackleton, Sid Sidhu, Gursharan Singh, Karthik Chengayan Sridhar, Matt Steiner, Pratibha Udmalpet, Sean Xia, Stacey Yan, Audris Mockus, Peter Rigby, Nachiappan Nagappan

Published: 2026-05-28

TL;DR: This paper proposes RADAR, a risk-aware automated code review system at Meta that utilizes LLMs to significantly reduce review latency and improve safety without compromising production stability.

摘要翻译

AI 辅助编码工具已重塑了软件生产流程。在 Meta，每人提交的 human-landed diff（人工落地变更集）中的显著代码行数同比增长 105.9%，人均 diff（变更集）数量上升 51%，其中 agentic AI（自主智能体 AI）贡献了超过 80% 的增长。与此同时，及时获得审查的 diff 比例有所下降，暴露了代码供应与审查者带宽之间日益扩大的差距。我们提出了三个循序渐进的问题，涵盖从可行性到校准再到影响：(1) risk-stratified automation（风险分层自动化）能否在 diverse organizations（多样化组织）中大规模部署？(2) tuning the risk threshold（调整风险阈值）如何影响自动化产出与安全之间的权衡？(3) automated review（自动审查）在何种程度上减少了 AI 生成变更的端到端延迟？我们部署了 RADAR (Risk Aware Diff Auto Review，风险感知差异自动审查系统)，这是一个多阶段漏斗，它根据 authorship（作者归属）和 source type（来源类型）对每个 diff 进行分类，依次应用 eligibility gates（资格门槛）、static heuristics（静态启发式规则）、machine-learned Diff Risk Score（机器学习差异风险评分）、LLM-based Automated Code Review（基于大语言模型的自动代码审查）以及 deterministic validation（确定性验证），随后落地合格的变更。我们通过覆盖 53.5 万多个经 RADAR 审查的 diff 的 telemetry（遥测数据）、政策变化的观察前后对比，以及 efficiency outcomes（效率结果）的 difference-in-differences analysis（双重差分分析）来评估 RADAR。RADAR 已审查了 53.5 万多个 diff，并成功落地了 33.1 万多个。将 Diff Risk Score 阈值从第 25 百分位放宽至第 50 百分位，使批准率提升至 60.31%。经 RADAR 审查的 diff 的 revert rate（回滚率）是非 RADAR diff 的 1/3，Production Incident rate（生产事故率）仅为非 RADAR diff 的 1/50。RADAR 将 median time to close（中位关闭时间）减少了 330% 以上，将 median diff review wall time（中位变更集审查耗时）减少了 35%。Risk-aware layered automation（风险感知分层自动化）可以显著减少由 AI 驱动的代码增长所引发的审查瓶颈，同时不损害生产安全。

Abstract

AI-assisted coding tools have altered software production. At Meta, significant lines of code per human-landed diff grew by 105.9% year over year and per-developer diff volume rose 51%, with agentic AI responsible for over 80% of that growth. Meanwhile, the share of diffs receiving timely review has declined, exposing a widening gap between code supply and reviewer bandwidth. We ask three questions that progress from feasibility through calibration to impact: (1) can risk-stratified automation operate at scale across diverse organizations, (2) how does tuning the risk threshold affect the trade-off between automation yield and safety, and (3) to what extent does automated review reduce end-to-end latency for AI-generated changes? We deployed RADAR (Risk Aware Diff Auto Review), a multi-stage funnel that classifies each diff by authorship and source type, applies eligibility gates, static heuristics, a machine-learned Diff Risk Score, LLM-based Automated Code Review, and deterministic validation before landing qualifying changes. We evaluate RADAR through telemetry covering 535K+ RADAR-reviewed diffs, observational before-after comparisons for policy changes, and difference-in-differences analysis of efficiency outcomes. RADAR has reviewed 535K+ diffs and landed 331K+. Relaxing the Diff Risk Score threshold from the 25th to the 50th percentile increased the approve rate to 60.31%. The revert rate for RADAR-reviewed diffs is 1/3 that of non-RADAR diffs, and the Production Incident rate is 1/50 that of non-RADAR diffs. RADAR reduces median time to close by over 330% and median diff review wall time by 35%. Risk-aware layered automation can materially reduce review bottlenecks created by AI-driven code growth without compromising production safety.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on software engineering automation (RADAR system) using LLMs for text/code review. It does not address multimodal learning, world models, visual encoders, or reinforcement learning. While it utilizes LLMs (slight MLLM/Tokenizer relevance), it lacks the core architectural elements (Vision, World Modeling, RL) specified in the keywords, resulting in low relevance scores.

关键词

Automated Code Review, Risk Calibration, LLM-based Review, Software Engineering, Diff Risk Score, Review Efficiency, AI-assisted coding

253. Modularizing Educational LLM-Agency for Fostering Responsible Learning AssistanceFAIL

Score: 4.5 / 27.8

Authors: Julius Gabelmann, Felix Jahn, Kevin Baum, Sophie van Rossum, Emely Wuenscher, Timo P. Gros, Verena Wolf

Published: 2026-05-28

TL;DR: This paper proposes a modular agentic AI chatbot architecture to foster responsible learning assistance by addressing the pedagogical shortcomings of monolithic LLMs in education.

摘要翻译

人工智能聊天机器人（AI chatbots）在教育领域的广泛应用将彻底改变学习过程，使得负责任的部署成为一项关键关切。尽管大语言模型（LLMs）可能获取到关于教育科学见解的资料，但它们并不特别倾向于遵循教学理念，这可能会对学习过程产生负面影响，例如丧失迁移能力、批判性思维或创造力。本文介绍了一种协助学生进行习题求解的智能体聊天机器人架构，专门旨在促进教育中更负责任的人工智能应用。我们的概念开发基于对负责任的大语言模型教育系统的若干期望需求的识别，论证了单体式、开箱即用的解决方案固有的结构缺陷，并建议将智能体架构进行模块化。我们为习题求解的不同阶段提出了具体模块，使得能够纳入针对性的教学建议，以更可控、透明且可监督的方式引导学生经历学习过程。

Abstract

The widespread adoption of AI chatbots in education will drastically change learning, making responsible deployment a critical concern. While large language models (LLMs) might have access to sources discussing insights from educational sciences, they are not particularly inclined to adhere to pedagogical concepts, risking negative effects on the learning process, such as a loss of transfer capabilities, critical thinking, or creativity. In this paper, we introduce an agentic AI chatbot architecture assisting students with exercise solving, specifically designed to contribute to more responsible AI use in education. We base our conceptual development on the identification of several desiderata for responsible LLM-based educational systems, argue for the structural shortcomings inherent in monolithic, out-of-the-box solutions, and instead suggest modularizing the agentic architecture. We propose specific modules for different stages of exercise solving, enabling incorporation of targeted pedagogical advice, guiding students through the learning process in a more controllable, transparent, and overseeable manner.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on modularizing educational LLMs for responsible learning assistance, which has minimal overlap with multimodal representation learning, tokenizers, visual encoders, world models, or model-based RL. Thus, most keywords score 0. Unify Models and MLLM receive low scores (2.0 and 1.0) due to general LLM architecture discussion but lack specific focus on unification or multimodality. No expert authors from the specified list (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are present in the author list. Weighted total score is 4.5, below the dynamic passing score of 27.8.

关键词

Educational LLM, Modular Architecture, Responsible AI, Pedagogical Concepts, Agentic Chatbot, Exercise Solving, Controllable Learning

254. Dissociative Identity: Language Model Agents Lack Grounding for Reputation MechanismsFAIL

Score: 4.5 / 27.8

Authors: Botao Amber Hu, Helena Rong, Max Van Kleek

Published: 2026-05-28

TL;DR: This paper argues that language model agents' dissociative identity undermines reputation-based trust mechanisms, proposing a shift to observability-based, protocol-driven governance instead.

摘要翻译

随着自主语言模型智能体的激增，形成了一个具有现实后果的新兴智能体网络，您能使用哪些可信度信号来决定是否信任真实环境中陌生的智能体并将其委托给它？一种自然的治理直觉是将人类身份验证和声誉机制从“了解你的客户（KYC）”和信用评分扩展到“了解你的智能体（KYA）”机制。然而，我们认为这种类比从根本上是不充分的。声誉机制既充当社会信号，又充当纠正反馈，以维持可信行为的均衡，这假定了一个持久身份的存在，该身份与行为连续性、制裁敏感性以及高昂的不可替代性相关联。然而，语言模型智能体在本体论上具有解离性：它们本质上是一个可变模块的集合——包括基础模型、系统提示、工具访问策略、外部记忆，以及在某些情况下，整个多智能体系统——其中的任何部分都可能改变智能体行为——同时拥有一个流动的人格，该人格也容易受到对抗性攻击，且可能无法内化制裁。借鉴解离性身份识别障碍（DID）的法理，这种解离性使智能体缺乏可识别性、可预测性、可信性和可改造性的基础——而这正是声誉机制旨在维持的属性——从而瓦解信任。我们认为，基于身份、事后、规制性、基于制裁的治理（如声誉）在结构上不适用于解离性智能体，我们建议转向基于可观察性、事前、构成性、基于协议的行为约束机制（behavioral harnesses）。

Abstract

As autonomous language model agents proliferate, forming an emerging agentic web with real-world consequences, what credibility signals can you use to decide whether to trust an unfamiliar agent in the wild and delegate to it? A natural governance intuition is to extend human identity verification and reputation mechanisms, from ``Know Your Customer'' and credit scores to ``Know Your Agent'' regimes. However, we argue that this analogy is fundamentally incomplete. Reputation mechanisms function both as social signals and as corrective feedback that sustain an equilibrium of trustworthy behavior, presuming a persistent identity associated with behavioral continuity, sanction sensitivity, and costly non-fungibility. Yet language model agents are ontologically \emph{dissociative}: they are essentially an assemblage of mutable modules -- foundational models, system prompts, tool-access policies, external memory, and, in some cases, a multi-agent system as a whole -- any of which may change agent behavior -- with a fluid persona that is also vulnerable to adversarial attack and may not internalize sanctions. Drawing on dissociative identity disorder jurisprudence, this dissociativity leaves agents without grounding for identifiability, predictability, credibility, and rehabilitability -- the very properties that reputation mechanisms aim to sustain -- thereby collapsing trust. We argue that identity-based, ex post, regulative, sanction-based governance, such as reputation, is structurally inapplicable to dissociative agents, and we suggest a shift to observability-based, ex ante, constitutive, protocol-based behavioral harnesses.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper addresses LM agent governance and identity (sociological/technical critique), while keywords target model architecture (Tokenizer, Visual Encoder), multimodality (MLLM, MultiModal), and RL frameworks (World Models, model-based RL). The paper does not discuss tokenization strategies, visual encoders, multimodal integration, world modeling for prediction, or model-based reinforcement learning algorithms, hence low relevance scores.

关键词

Language Model Agents, Reputation Mechanisms, Dissociative Identity, Governance, Trust, Observability-based, Behavioral Continuity, Identity Verification

255. BioRefusalAudit: Auditing Biosecurity Refusal Depth Using General and Domain-Fine-Tuned Sparse AutoencodersFAIL

Score: 4.5 / 27.8

Authors: Caleb DeLeeuw

Published: 2026-05-28

TL;DR: This paper audits biosecurity refusal mechanisms in LLMs using sparse autoencoders, revealing that refusal behavior is fragile across architectures and often reflects legal biases rather than genuine hazard detection.

摘要翻译

语言模型的生物安全评估通常关注模型是否会产生有害输出。本文提出了一个互补性问题：当模型拒绝时，这种拒绝在结构上是否稳固，还是在提示词框架、格式或输出长度的轻微修改下便会消失？在五种架构中，没有任何模型能清晰区分良性内容与有害内容。Gemma 2 2B-IT 在 75 个提示词中从未真正拒绝，对所有接近有害的查询均采取含糊其辞的态度。Gemma 4 E2B-IT 在使用聊天模板格式时拒绝了 65/75 个提示词，而在无该格式时则拒绝了 0/75 个。当限制在 80 个 token 以内时，这两个 Gemma 模型的拒绝率均降至 0%。Qwen 2.5 1.5B 和 Phi-3-mini 存在过度拒绝现象，将 83%-87% 的良性生物学内容标记为有害。Llama 3.2 1B 显示出唯一有意义的层级梯度（跨度为 61 个百分点）。为了探究驱动这种过度拒绝的原因，我们测试了一组属于 Schedule I 但生物学上无毒的化合物（ notably 裸盖菇素栽培，具有 FDA 突破性疗法认定）。部分模型拒绝这些化合物的比例甚至超过了真正有害的生物学内容，表明模型的拒绝行为更倾向于追踪合法性和文化显著性，而非 CBRN 危害。为了测量模型内部机制，我们引入一个分歧分数 D，用于比较模型表面响应标签与其内部稀疏自编码器（SAE）的特征激活。完整的分歧分数 D 是在 Gemma 2 2B-IT（Gemma Scope 1）和 Gemma 4 E2B-IT（作者训练的生物 SAE）上计算的。发布了两个微调的 Gemma 2 领域 SAE。在 Gemma 4 上，合规响应与拒绝响应之间分离出 0.647 点的差距且无重叠（n=75），尽管这仅是初步结果，受限于狭窄的目录、样本内校准以及仅覆盖 Gemma 家族模型的 SAE。本研究基于消费级硬件（GTX 1650 Ti Max-Q，外加用于 SAE 训练的 Colab T4）在一个黑客马拉松周末内完成，这些初步证据表明，激活级别审计可能揭示行为评估中不可见的故障模式，且不同架构之间存在显著差异。

Abstract

Biosecurity evaluations of language models typically ask whether models produce hazardous output. This paper asks a complementary question: when a model refuses, is that refusal structurally sound, or does it disappear under modest changes to prompt framing, formatting, or output length? Across five architectures, no model cleanly discriminated benign from hazard. Gemma 2 2B-IT never genuinely refused across 75 prompts, hedging on every hazard-adjacent query. Gemma 4 E2B-IT refused 65/75 prompts with chat-template formatting and 0/75 without it. Both Gemma models collapsed to 0% under an 80-token cap. Qwen 2.5 1.5B and Phi-3-mini over-refused, flagging 83-87% of benign biology as hazardous. Llama 3.2 1B showed the only meaningful tier gradient (61-point spread). To probe what drives such over-refusal, we tested a panel of Schedule I but biologically non-toxic compounds (notably psilocybin cultivation, with FDA Breakthrough Therapy status). Some models refused these at rates exceeding genuinely hazardous biology, suggesting refusal tracks legality and cultural salience over CBRN hazard. To measure the internal side, we introduce a divergence score D comparing a model's surface response label to its internal sparse autoencoder (SAE) feature activations. Full D was computed on Gemma 2 2B-IT (Gemma Scope 1) and Gemma 4 E2B-IT (author-trained bio SAE). Two fine-tuned Gemma 2 domain SAEs were released. On Gemma 4, comply and refuse responses separated by a 0.647-point gap with zero overlap (n=75), though this is preliminary, with a narrow catalog, within-sample calibration, and Gemma-family-only SAE coverage. Built over one hackathon weekend on consumer hardware (GTX 1650 Ti Max-Q, plus Colab T4 for SAE training), this preliminary evidence suggests activation-level auditing may surface failure modes invisible to behavioral evaluation, with substantial variation across architectures.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文核心为基于稀疏自编码器的语言模型安全审计，主要关注文本层面的拒绝行为与内部表征分析。提供的关键词集（视觉编码器、世界模型、多模态、基于模型的 RL）主要对应多模态与强化学习方向，与本文内容高度不匹配。仅因涉及语言模型基础架构（Tokenizer）及多模型比较（Unify Models）给予极低分，其余核心关键词完全无关。

关键词

Biosecurity Refusal, Sparse Autoencoders, Model Auditing, Internal Representations, Safety Alignment, Gemma Models, Divergence Score

256. Accelerating Constrained Decoding with Token Space CompressionFAIL

Score: 4.5 / 27.8

Authors: Michael Sullivan, Alexander Koller

Published: 2026-05-28

TL;DR: This paper proposes CFGzip to compress the token search space for constrained decoding, achieving up to 7.5x speedup and reducing latency by two orders of magnitude.

摘要翻译

为了确保大语言模型（LLM）的输出符合指定结构，上下文无关语法（CFG）解码引擎强制选择能够生成符合给定 CFG 的字符串的下一个 token。尽管当前的 CFG 约束解码引擎高度优化，但由于每步搜索空间（即整个 token 词汇表）带来的固有成本，对于更复杂的 CFG 而言，会导致难以承受的开销：这正是 CFG 引擎最有用之处。在本文中，我们引入了 CFGzip，一种用于压缩 token 搜索空间的离线技术，可大幅降低 CFG 引擎的开销。实验结果表明，当 CFGzip 与最先进（SoTA）语法引擎结合使用时，延迟可降低多达两个数量级，总约束生成时间加速高达 7.5 倍：借助 CFGzip，对于复杂 CFG 而言，约束解码现在已具备规模化可行性。

Abstract

To guarantee that an LLM's outputs conform to a specified structure, context-free grammar (CFG) decoding engines force the selection of next tokens that produce strings that conform to a given CFG. While current CFG-constrained decoding engines are highly optimized, the inherent costs arising from the massive per-step search space -- i.e. the entire token vocabulary -- result in intractably high overhead for more complex CFGs: precisely the situation where CFG engines are most useful. In this paper, we introduce CFGzip, an offline technique for compressing the token search space, which massively reduces CFG engine overhead. In experiments, we report latency reduction of up to two orders of magnitude when CFGzip is used with a SoTA grammar engine, yielding an up to 7.5x speedup in total constrained generation time: with CFGzip, constrained decoding is now feasible at scale for complex CFGs.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	3.0/10	4.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on accelerating constrained decoding in LLMs via token space compression (CFGzip). It has negligible overlap with multimodal learning, world models, or reinforcement learning keywords. Only 'Tokenizer' has marginal relevance due to the mention of token vocabulary, but the core contribution is decoding optimization rather than tokenizer architecture.

关键词

Constrained Decoding, Token Space Compression, Context-Free Grammar, LLM, Latency Reduction, CFGzip, Search Space

257. Data filtering methods for training language modelsFAIL

Score: 4.5 / 27.8

Authors: Egor Shevchenko, Elena Bruches

Published: 2026-05-28

TL;DR: 本文比较了两种自动标签错误检测方法在俄语文本分类语料上的效果，发现针对小样本高噪声数据集的过滤能显著提升模型性能，而对大样本低噪声数据集影响不大。

摘要翻译

数据质量是影响机器学习模型效果的关键因素。标签错误（即便存在于广泛使用的基准中）也会向训练数据引入噪声，并降低模型的泛化能力。本文对两种自动标签错误检测方法——Confident Learning 和 Dataset Cartography——在三个规模、类别数量及领域各异的俄语文本分类语料库上进行了比较分析：ru_emotion_e-culture（49,123 个样本，情感分类）、RuCoLA（8,524 个样本，语言可接受性）以及 TERRa（2,337 个样本，文本蕴含识别）。我们在每个语料库上使用预训练的 rubert-base-cased 模型进行微调。为验证过滤操作的有效性，我们进行了对照实验，即随机移除同等数量的样本。结果表明，两种方法的有效性高度依赖于数据集特征：在噪声水平较低的大型语料库上，过滤并未提升性能；而在噪声水平较高的小型数据集上，Confident Learning 实现了显著的 F1-macro 提升。Dataset Cartography 表现出更为保守的行为，移除的样本数量较少。在所有语料库上，两种方法的针对性移除均优于随机移除，证实了这些方法的有效性。

Abstract

Data quality is a critical factor in the effectiveness of machine learning models. Label errors, present even in widely used benchmarks, introduce noise into training data and reduce model generalization. In this work, we conduct a comparative analysis of two automatic label error detection methods - Confident Learning and Dataset Cartography - on three Russian text classification corpora of varying size, number of classes, and domain: ru_emotion_e-culture (49,123 examples, emotion classification), RuCoLA (8,524 examples, linguistic acceptability), and TERRa (2,337 examples, textual entailment recognition). We use the pre-trained rubert-base-cased model fine-tuned on each corpus. To verify the meaningfulness of filtering, we conduct control experiments with random removal of an equivalent number of examples. Results show that the effectiveness of both methods depends strongly on dataset characteristics: on large corpora with low noise levels, filtering does not improve performance, while on small datasets with high noise, Confident Learning achieves a significant F1-macro improvement. Dataset Cartography demonstrates more conservative behavior, removing fewer examples. Across all corpora, targeted removal by both methods outperforms random removal, confirming the meaningfulness of the approaches.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文主要关注俄语文本分类的数据过滤和标签错误检测，与多模态、世界模型、视觉编码器及强化学习等关键词无直接关联。Tokenizer 仅在预训练模型中隐含使用，非研究核心。

关键词

Data filtering, Label error detection, Text classification, Russian language models, Confident Learning, Dataset Cartography, Pre-trained models

258. EMAG: Differentiable 4D Gaussian Mixture Splatting for EEG Spatial Super-ResolutionFAIL

Score: 4.5 / 27.8

Authors: Alex Lazarovich, Ofir Itzhak Shahar, Gur Elkin, Ohad Ben-Shahar

Published: 2026-05-28

TL;DR: EMAG 提出了一种基于可微分 4D 高斯混合框架的方法，能够从稀疏电极重建高密度脑电图信号，并在空间超分辨率任务上达到了最先进的性能。

摘要翻译

高密度脑电图（HD-EEG）能够实现皮层活动的精细测量，但需要昂贵的硬件和漫长的设置时间，限制了其在临床和研究中的可及性。我们提出了 EMAG（各向异性高斯混合脑电图），这是一种可微框架，通过将脑电来源表示为各向异性 4D 时空高斯混合，能够从低密度（LD）电极的稀疏子集中重建 HD-EEG 信号。EMAG 在球形脑网格的每个点上布置多个高斯混合，每个混合均由完整的 4×4 精度矩阵参数化，从而实现各向异性空间扩展以及空间维度与时间维度之间的显式耦合。正向模型通过在电极位置的可微高斯场贡献生成头皮脑电图，从而实现端到端训练，而无需显式的源定位监督。我们在三个公共 EEG 基准数据集（Localize-MI、SEED 和 SEED-IV）上评估了 EMAG，超分辨率因子范围从 2 倍至 8 倍或 16 倍。在三个标准基准数据集（Localize-MI、SEED 和 SEED-IV）上，EMAG 在大多数超分辨率因子上优于当前最先进的 EEG 超分辨率方法。显式的高斯参数化进一步使得学习到的脑源配置的可视化与可解释性成为可能，可能为临床和神经科学应用开辟新途径，例如源定位或生物标志物发现。

Abstract

High-density electroencephalography (HD-EEG) enables fine-grained measurement of cortical activity but requires expensive hardware and lengthy setup times, limiting its clinical and research accessibility. We propose EMAG (EEG Mixture of Anisotropic Gaussians), a differentiable framework that reconstructs HD-EEG signals from a sparse subset of low-density (LD) electrodes by representing brain electrical sources as a mixture of anisotropic 4D space-time Gaussians. EMAG places a mixture of multiple Gaussians at each point of a spherical brain grid, each parameterized by a full 4 x 4 precision matrix, enabling anisotropic spatial spreads and explicit coupling between spatial and temporal dimensions. The forward model renders scalp EEG via differentiable Gaussian field contributions at electrode locations, enabling end-to-end training without explicit source localization supervision. We evaluate EMAG on three public EEG benchmarks (Localize-MI, SEED, and SEED-IV) at super-resolution factors of 2x through 8/16x. EMAG outperforms the current state-of-the-art EEG super-resolution method at most super-resolution factors on three standard benchmarks (Localize-MI, SEED, SEED-IV). The explicit Gaussian parameterization further enables direct visualization and interpretability of learned brain source configurations, potentially opening avenues for clinical and neuroscientific applications, such as source localization or biomarker discovery.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0

评分理由: 该论文专注于脑电图（EEG）信号的重建与超分辨率，采用可微分高斯混合模型进行空间 - 时间建模。论文内容属于神经科学与信号处理领域，与提供的关键词（如多模态大模型、强化学习、视觉编码器、标记器等）所属的通用 AI/RL 领域存在显著差异。因此，除'Unify Models'在空间 - 时间维度上有微弱关联外，其余关键词（Tokenizer, Visual Encoder, World Models, MLLM, model-based RL）均无直接相关性，'MultiModal'仅指空间 - 时间维度耦合，相关性较低。

关键词

EEG Super-Resolution, Gaussian Mixture, Differentiable Rendering, Space-Time Modeling, Brain Source Reconstruction, Electrode Density, Signal Processing

259. The Little Book of Generative AI Foundations: An Intuitive Mathematical PrimerFAIL

Score: 4.5 / 27.8

Authors: Tianhua Chen

Published: 2026-05-28

TL;DR: 本书旨在提供生成式 AI 模型的数学基础 primer，统一了多种生成模型的推导过程，但未聚焦于多模态架构、分词机制或强化学习等特定领域。

摘要翻译

本书提供了一本简明且以推导为导向的入门读物，介绍了现代生成式人工智能的数学基础。本书并未综述每一种最新的架构或实现细节，而是沿着连接主要生成模型家族的思想脉络，构建了一条连贯的路径，从主成分分析 (PCA)、概率主成分分析 (Probabilistic PCA)、变分自编码器和扩散模型 (Diffusion Models)，到归一化流 (Normalising Flows)、自回归分解 (Autoregressive Factorisations)、生成对抗网络 (GANs)、Wasserstein 生成对抗网络 (Wasserstein GANs) 以及基于能量的模型 (Energy-Based Models)。本书旨在使生成建模的结构更易于理解，同时保留理解这些模型是如何推导及相互关联所需的数学实质。本书作为一本奠基性入门读物，面向对数学有浓厚兴趣的研究者、从业者和学生。

Abstract

This book provides a compact, derivation-oriented introduction to the mathematical foundations of modern generative artificial intelligence. Rather than surveying every recent architecture or implementation detail, it develops a coherent route through the ideas connecting major families of generative models, from PCA, probabilistic PCA, variational autoencoders, and diffusion models to normalising flows, autoregressive factorisations, GANs, Wasserstein GANs, and energy-based models. The aim is to make the structure of generative modelling more accessible without removing the mathematical substance needed to understand how these models are derived and related. The book is intended as a foundation-building primer for mathematically curious researchers, practitioners, and students.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 该论文是一本生成式 AI 的数学基础 primer，主要涵盖 PCA、VAE、扩散模型、GAN 等通用生成模型的数学推导。它未涉及多模态大模型（MLLM）、分词器（Tokenizer）、视觉编码器（Visual Encoder）、世界模型（World Models）或基于模型的强化学习（model-based RL）的具体技术细节，因此与大部分关键词相关性极低。仅在“统一模型”方面有一定关联（统一了生成模型的数学框架），但不足以弥补其他领域的缺失。作者 Tianhua Chen 不在指定的专家列表中，无额外加分。加权总分远低于动态及格分 27.8。

关键词

Generative AI Foundations, Mathematical Primer, Variational Autoencoders, Diffusion Models, Normalizing Flows, Generative Modeling, Probabilistic Models

260. Audio Deepfake Detection with Half-Truth Localisation Using Cross-Attentive Feature FusionFAIL

Score: 4.5 / 27.8

Authors: S. Sutharya, Remya K. Sasi

Published: 2026-05-28

TL;DR: 本文提出 CAFNet 模型，通过交叉注意力特征融合实现半真音频深度伪造检测与篡改区域定位，在参数高效的前提下取得了高精度的分类与定位性能。

摘要翻译

音频深度伪造检测作为二分类问题已被广泛研究，但部分操纵语音（即将一段短合成片段拼接至其余部分真实的话语中）构成了更难且更现实的威胁。检测此类半真半假音频不仅需要将其与真实及完全伪造的语音区分开来，还需定位操纵发生的具体位置。我们提出 CAFNet，一种拥有 576k 参数的架构，旨在共同解决这两个任务：它在单次前向传播中执行三分类（真实、完全伪造或半真半假），并回归合成区域的时间边界。CAFNet 通过具有交叉注意力的并行深度可分离卷积分支，融合梅尔频率倒谱系数 (MFCC)、线性频率倒谱系数 (LFCC) 以及和弦短时傅里叶变换 (Chroma-STFT) 特征，随后接一个双向长短期记忆 (BiLSTM) 回归头用于边界预测。在组合的多语言音频深度伪造检测语料库 (MLADDC) T2+T3 测试集上，CAFNet 准确率达到 92.71%，宏平均曲线下面积 (AUC) 为 0.9910，边界定位平均绝对误差 (MAE) 为 0.075 秒，中位误差为 0.052 秒。在二分类检测任务中，其准确率达到 96.76%，等错误率 (EER) 为 3.20%，在参数量减少 500 多倍的情况下，优于微调的 XLS-R 300M (78.31%) 和 AST 87M (93.03%)。跨数据集研究进一步表明，即使降低骨干网络的学习率，标准微调也会导致跨域表示崩溃。

Abstract

Audio deepfake detection is well-studied as a binary problem, but partially manipulated speech, where a short synthesised segment is spliced into an otherwise genuine utterance, poses a harder and more realistic threat. Detecting such half-truth audio requires not only distinguishing it from real and fully fake speech, but also localising where the manipulation occurs. We present CAFNet, a 576k-parameter architecture that addresses both tasks jointly: it performs ternary classification (real, fully-fake, or half-truth) and regresses the temporal boundaries of the synthesised region in a single forward pass. CAFNet fuses Mel-Frequency Cepstral Coefficient (MFCC), Linear-Frequency Cepstral Coefficient (LFCC), and Chroma Short-Time Fourier Transform (Chroma-STFT) features through parallel depthwise-separable convolution branches with cross-attention, followed by a Bidirectional Long Short-Term Memory (BiLSTM) regression head for boundary prediction. On the combined Multi-Lingual Audio Deepfake Detection Corpus (MLADDC) T2+T3 test set, CAFNet achieves 92.71% accuracy and macro Area Under the Curve (AUC) of 0.9910, with boundary localisation Mean Absolute Error (MAE) of 0.075s and a median error of 0.052s. On binary detection, it achieves 96.76% accuracy and 3.20% Equal Error Rate (EER), outperforming fine-tuned XLS-R 300M (78.31%) and AST 87M (93.03%) at over 500 times fewer parameters. A cross-dataset study further shows that standard fine-tuning collapses cross-domain representations even under reduced backbone learning rates.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0

评分理由: 论文专注于音频深度伪造检测与定位，采用 CNN 与 BiLSTM 架构，属于传统深度学习在音频取证领域的应用。与提供的关键词集（侧重多模态大模型、世界模型、强化学习等前沿基础模型领域）高度不相关。论文未使用 Tokenizer、视觉编码器、世界模型或强化学习算法。仅在任务统一（检测 + 定位）和多特征融合（类多模态）上有微弱关联。作者列表中不包含指定的 Yang Shi 等专家，故无额外加分。加权总分 4.5，远低于动态及格分 27.8。

关键词

Audio Deepfake Detection, Half-Truth Localisation, Cross-Attentive Feature Fusion, MFCC, BiLSTM, Ternary Classification, Temporal Boundary

261. Quotient DAGs for Off-Policy Evaluation:Forward-Flow Importance Sampling and Exact Slate PropensitiesFAIL

Score: 4.5 / 27.8

Authors: Ziwen Xie, Shaowen Xiang, Hongyu He, Dianbo Liu

Published: 2026-05-28

TL;DR: This paper introduces a quotient-DAG framework to compute exact unordered slate propensities efficiently for off-policy evaluation in slate recommendation, mitigating nuisance variance from ordered generation processes.

摘要翻译

离线策略评估（Off-policy evaluation）利用由不同行为策略（behavior policy）收集的数据来估计目标策略（target policy）的表现，这在在线测试成本高昂或风险较大时至关重要，例如在推荐或医疗保健领域。标准重要性采样（importance sampling）对每个记录轨迹（logged trajectory）进行重新加权，但它可能将生成过程细节视为有意义的，即使评估目标忽略了这些细节：例如，自回归列表推荐器（autoregressive slate recommender）可能生成有序的项目序列，而奖励及下游估计器仅依赖于无序列表（unordered slate）。这会产生干扰方差（nuisance variance）和计算差距（computational gap），因为精确的无序列表倾向性（propensities）需要对所有生成顺序进行求和。我们引入了一种商 DAG 视图（quotient-DAG view），该视图合并了评估等价的历史，并在合并图上使用目标策略到行为策略的前向流比率分配权重。针对集合充分下一项接口（set-sufficient next-item interface）下的列表推荐，该方法产生了 Forward-DP，这是一种子集 DAG 动态规划（subset-DAG dynamic program），能够在不进行阶乘枚举的情况下计算精确的无序倾向性。由此得到的倾向性原语使得针对上下文依赖的自回归列表记录器（context-dependent autoregressive slate loggers）的基于倾向性的评估和模型选择成为可能。

Abstract

Off-policy evaluation estimates how a target policy would perform using data collected by a different behavior policy, which is crucial when online testing is costly or risky, such as in recommendation or healthcare. Standard importance sampling reweights each logged trajectory, but it can treat details of the generation process as meaningful even when the evaluation target ignores them: for example, an autoregressive slate recommender may generate an ordered sequence of items while the reward and downstream estimator depend only on the unordered slate. This creates nuisance variance and a computational gap, since exact unordered slate propensities require summing over all generation orders. We introduce a quotient-DAG view that merges histories equivalent for evaluation and assigns weights using target-to-behavior forward-flow ratios on the merged graph. For slate recommendation under a set-sufficient next-item interface, this yields Forward-DP, a subset-DAG dynamic program that computes exact unordered propensities without factorial enumeration. The resulting propensity primitive enables practical propensity-based evaluation and model selection for context-dependent autoregressive slate loggers.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	2.0/10	3.0

评分理由: The paper focuses on off-policy evaluation in slate recommendation using quotient-DAGs and propensity scoring. The provided keywords are primarily oriented towards Multimodal Large Models (Tokenizer, Visual Encoder, MLLM, MultiModal) and World Models, which are unrelated to this work's domain. Only a loose connection exists with 'model-based RL' (as OPE is an RL task) and 'Unify Models' (unifying histories), resulting in low relevance scores. No expert authors from the target list were found.

关键词

Off-policy evaluation, Slate recommendation, Quotient DAG, Importance sampling, Propensity estimation, Forward-flow, Unordered slate, Dynamic programming

262. Composing Non-Conjugate Factor Graphs with Closed-Form Variational InferenceFAIL

Score: 4.5 / 27.8

Authors: Mykola Lukashchuk, Kyrylo Yemets, Wouter M. Kouw, Dmitry Bagaev, İsmail Şenöz, Jeff Beck, Bert de Vries

Published: 2026-05-28

TL;DR: This paper proposes a framework for composing non-conjugate factor graphs that preserve closed-form variational inference, enabling Bayesian mixture of experts for time-series forecasting with calibrated uncertainty.

摘要翻译

将概率构建块堆叠到更深架构中通常会破坏闭式推断。我们表明闭式推断可以被保留。我们识别出五种因子图（factor-graph）原语：双线性因子（bilinear factor）、指数链接（exponential link）、Gamma 先验（Gamma prior）、高斯似然（Gaussian likelihood）和等值节点（equality node），并证明由它们构成的任何模型均支持闭式变分消息传递（closed-form variational message passing）。该构造之所以有效，是因为每个原语均保持了一小类消息族（message families）：在平均场分解（mean-field factorization）下，高斯变量的消息保持高斯分布，精度变量的消息保持 Gamma 分布；而唯一的非共轭接口（non-conjugate interface）——指数链接，通过高斯矩生成函数（Gaussian moment-generating function）及 Gamma 族的充分统计量（sufficient statistics）保持可处理性。我们展示了随深度增加的组合（composition），从静态集成（static ensembles）到输入依赖的门控（input-dependent gating），再到分支路由（split-branch routing），并证明堆叠路由层（routing layers）编码了任意决策树（decision trees），从而建立了具有闭式推断的通用函数逼近（universal function approximation）。应用于集成时间序列预测（ensemble time-series forecasting）时，该框架产生了一个贝叶斯混合专家模型（Bayesian mixture of experts），其中门控函数是通过推断而非学习得到的，在五个基准数据集上提供了专家选择的校准不确定性（calibrated uncertainty）。

Abstract

Stacking probabilistic building blocks into deeper architectures typically breaks closed-form inference. We show that closed-form inference can be preserved. We identify five factor-graph primitives: a bilinear factor, an exponential link, a Gamma prior, a Gaussian likelihood, and an equality node, and prove that any model composed from them admits closed-form variational message passing. The construction works because each primitive preserves a small set of message families: under mean-field factorization, messages on Gaussian variables remain Gaussian and messages on precision variables remain Gamma, while the only non-conjugate interface, the exponential link, remains tractable through the Gaussian moment-generating function and the sufficient statistics of the Gamma family. We demonstrate composition at increasing depth, from static ensembles through input-dependent gating to split-branch routing, and show that stacking routing layers encodes arbitrary decision trees, establishing universal function approximation with closed-form inference. Applied to ensemble time-series forecasting, the framework yields a Bayesian mixture of experts in which gating functions are inferred rather than learned, providing calibrated uncertainty over expert selection across five benchmark datasets.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on probabilistic graphical models and variational inference, which is theoretically distinct from the multimodal large language model and reinforcement learning contexts implied by the keywords. There is no content regarding tokenization, visual encoders, multimodality, or reinforcement learning. Only minimal relevance exists for model composition ('Unify Models') and probabilistic modeling ('World Models').

关键词

Factor Graphs, Closed-Form Variational Inference, Non-Conjugate, Bayesian Mixture of Experts, Time-Series Forecasting, Variational Message Passing, Probabilistic Building Blocks

263. MIC: Maximizing Informational Capacity in Adaptive Representations via Isotropic Subspace AlignmentFAIL

Score: 4.5 / 27.8

Authors: Dang Hong Nguyen, Nhi Ngoc-Yen Nguyen, Huy-Hieu Pham

Published: 2026-05-28

TL;DR: 该论文提出 MIC 框架，通过各向同性子空间对齐优化多粒度嵌入，有效减少了维度冗余和谱塌陷，在高压缩场景下显著保持了信息容量。

摘要翻译

尽管多尺度表征学习能够实现弹性维度嵌入，但嵌套子空间往往面临维度冗余和谱坍塌的问题。为了解决这一问题，我们引入了 MIC，该框架通过各向同性子空间对齐来优化多粒度嵌入的几何景观。MIC 采用软坍塌正则化 (Soft Collapse Regularization, SCR)，通过交叉相关惩罚来缓解前缀子空间和残差子空间之间的冗余，同时采用谱各向同性正则化 (Spectral Isotropy Regularization, SIR) 以确保低维前缀中的超球面均匀性。通过自蒸馏目标统一这些策略，MIC 生成了保持高判别力的语义密集表示。我们的实验表明，MIC 显著优于标准基线，特别是在保持信息容量最为关键的高压缩场景中。

Abstract

Although multi-scales representation learning enables elastic-dimension embeddings, nested subspaces often suffer from dimensional redundancy and spectral collapse. To address this, we introduce MIC, a framework that optimizes the geometric landscape of multi-granular embeddings through isotropic subspace alignment. MIC employs Soft Collapse Regularization (SCR) to mitigate redundancy between prefix and residual subspaces via cross-correlation penalties, alongside Spectral Isotropy Regularization (SIR) to ensure hyper-spherical uniformity in low-dimensional prefixes. By unifying these strategies through a self-distillation objective, MIC generates semantically dense representations that maintain high discriminative power. Our experiments demonstrate that MIC significantly outperforms standard baselines, particularly in high-compression scenarios where maintaining informational capacity is most critical.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文核心贡献在于表示学习中的子空间对齐与信息容量优化（MIC 框架），虽在方法上统一了两种正则化策略（SCR 与 SIR），与'Unify Models'有微弱文本关联（评分 3.0），但未涉及多模态大模型（MLLM）、世界模型、强化学习、Tokenizer 或特定视觉编码器架构，其余关键词相关性为 0。作者列表中不包含指定的专家组成员。加权总分约为 4.5，远低于动态及格分 27.8，表明论文主题与给定关键词集匹配度较低。

关键词

Isotropic Subspace Alignment, Multi-scales Representation Learning, Soft Collapse Regularization, Spectral Isotropy Regularization, Self-distillation Objective, Dimensional Redundancy, Spectral Collapse, Informational Capacity

264. Metric-Dependent Annotation Saturation for Learning from Label DistributionsFAIL

Score: 4.5 / 27.8

Authors: Guneet Kohli

Published: 2026-05-28

TL;DR: The paper investigates how annotation budget requirements vary by evaluation metric for learning from label distributions, finding that entropy correlation requires more annotators than distributional match and soft labels capture ambiguity better than label smoothing.

摘要翻译

当标注者对标签意见不一致时，不一致性本身即携带信号——而捕捉该信号所需的标注者数量取决于评估指标。我们在从 ChaosNLI 数据集（为每个项目提供 100 个独立标注者判断）采样的标签分布上微调自然语言推理（NLI）模型，并识别出依赖于评估指标的饱和现象。在三分类 NLI 设置中，熵相关性（即模型能否识别哪些项目引发不一致）需要约 N=20-50 个标注者才能收敛，而分布匹配（KL 散度）则在约 N=10 时达到饱和（跨越五个模型种子的改进幅度达 87-95%）。这一发现基于先前的观察：软标签 (soft labels) 携带着标签平滑 (label smoothing) 无法复制的项目特定信号。在五种平滑强度下，熵相关性聚集在 r ~ 0.45-0.49 之间，而软标签则达到 r = 0.643 (p < 0.001)；逐项分析表明，这一差距源于平滑方法无法区分模糊项目与清晰项目。软标签优势在两种架构（DeBERTa、RoBERTa）、一个非自然语言推理预训练基线以及一项关于内容安全的探索性跨域评估中均得到复现。这些结果表明，标注预算应根据目标评估指标来制定，而非统一设定。

Abstract

When annotators disagree on a label, the disagreement itself carries signal -- and the number of annotators needed to capture it depends on the evaluation metric. We fine-tune NLI models on label distributions subsampled from ChaosNLI, a dataset providing 100 independent annotator judgments per item, and identify metric-dependent saturation. In our 3-class NLI setting, entropy correlation -- whether the model identifies which items elicit disagreement -- requires N ~ 20-50 annotators to converge, while distributional match (KL divergence) saturates by N ~ 10 (87-95% of improvement across five model seeds). This finding rests on a prior observation: soft labels carry item-specific signal that label smoothing cannot replicate. Across five smoothing intensities, entropy correlation clusters at r ~ 0.45-0.49, while soft labels reach r = 0.643 (p < 0.001); per-item analysis traces this gap to smoothing's inability to distinguish ambiguous items from clear ones. The soft-label advantage replicates across two architectures (DeBERTa, RoBERTa), a non-NLI-pretrained baseline, and an exploratory cross-domain evaluation on content safety. These results suggest that annotation budgets should be informed by the target evaluation metric rather than set uniformly.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文主要关注自然语言推理（NLI）中的标注分布和软标签策略，与关键词集中的多模态、世界模型及强化学习主题高度不匹配。因此，视觉编码器、世界模型、多模态和基于模型的强化学习得分为 0；虽然涉及大语言模型（MLLM）和分词器（Tokenizer），但并非核心内容，得分为 1；统一模型（Unify Models）相关性极低，得分为 1。

关键词

Metric-Dependent Annotation Saturation, Label Distributions, Soft Labels, Label Smoothing, NLI Models, Entropy Correlation, Annotation Budget

265. From Blind Guess to Informed Judgment: Teaching LLMs to Evaluate Materials by Building Knowledge-Augmented Preference SignalsFAIL

Score: 4.5 / 27.8

Authors: Yeyong Yu, Wenya Hu, Xing Wu, Quan Qian

Published: 2026-05-28

TL;DR: 该论文提出了一种知识增强偏好信号框架，通过配对专家指导的判断与盲猜来训练 LLM，从而在不依赖外部检索的情况下显著提升材料评估的准确性和一致性。

摘要翻译

随着候选生成和高通量实验的进展，材料发现的主要瓶颈正从性质预测转向在海量候选集中进行可靠的评估。我们提出一个知识增强偏好信号框架（MaterEval），该框架自动为同一候选产生两种评估：一种遵循专家规则并提供支持证据的知情判断，以及一种无规则盲猜。通过将这两种评估配对作为偏好数据，我们引导原本缺乏材料特定标准的通用大语言模型（LLMs），从直觉判断转向由显式证据支持的可靠评估。为了平衡吞吐量、成本和可靠性，我们进一步引入了一种快慢推理机制，将大规模快速筛选与小样本子集的深入审查解耦。以高熵合金（HEA）评估为例，我们表明，在不依赖外部检索且仅依靠内化能力的情况下，小型开源大语言模型（LLMs）在准确性、结论一致性和证据判别方面取得显著提升，其性能接近基于规则的闭源大语言模型（LLMs）。这些结果表明，专家规则可以系统性地转化为可学习的偏好信号，从而为自主材料发现循环提供一种低成本且可部署的评估模块。

Abstract

As candidate generation and high-throughput experimentation advance, the primary bottleneck in materials discovery is shifting from property prediction to making reliable evaluations among massive candidate sets. We propose a Knowledge-Augmented Preference Signals Framework, MaterEval, that automatically produces, for the same candidate, two evaluations: an informed judgment that follows expert rules and provides supporting evidence, and a rule-removed blind guess. By pairing the two evaluations as preference data, we guide general-purpose large language models (LLMs), originally lacking materials-specific criteria, from intuitive judgment toward reliable evaluation supported by explicit evidence. To balance throughput, cost, and reliability, we further introduce a fast-slow reasoning scheme that decouples large-scale rapid screening from in-depth review on a small subset. Using high-entropy alloy (HEA) assessment as a case study, we show that, without external retrieval and relying solely on internalized capabilities, small open-source LLMs achieve substantial gains in accuracy, conclusion consistency, and evidence discrimination, approaching the performance of rule-based closed-source LLMs. These results demonstrate that expert rules can be systematically transformed into learnable preference signals, enabling a low-cost and deployable evaluation module for autonomous materials discovery loops.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	1.0/10	1.5

评分理由: 论文聚焦材料评估中的 LLM 偏好学习，未涉及视觉编码器、世界模型或多模态架构。虽使用 LLM 及偏好信号（关联 RL），但与 Unify Models、Tokenizer 等关键词无直接技术关联，故评分较低。

关键词

LLMs, Materials Evaluation, Preference Signals, Knowledge-Augmented, High-Entropy Alloy, Autonomous Materials Discovery, Expert Rules, Reasoning Scheme

266. Offloading Score: Measuring AI Reliance Through Counterfactual WorkflowsFAIL

Score: 4.5 / 27.8

Authors: Vishakh Padmakumar, Lujain Ibrahim, Zora Zhiruo Wang, Jennifer Wang, Q. Vera Liao, Diyi Yang

Published: 2026-05-28

TL;DR: 本文提出了一种基于反事实工作流的'卸载分数'来量化 AI 依赖度，研究发现时间压力会显著增加用户对 AI 工具的依赖。

摘要翻译

人工智能工具正日益整合到实际工作流程中。然而，现有的对这些工具依赖程度的衡量侧重于 AI 输出的采纳或自我报告指标，而非用户与工具之间任务努力的分配方式。在此，我们引入卸载分数（Offloading Score），这是一种依赖度量，用于量化卸载到 AI 工具上的认知努力的比例。卸载分数是基于模拟的——我们通过估算用户在没有该工具的情况下会如何完成任务来构建一个反事实工作流程，然后计算使用该工具所节省的步骤比例。我们通过度量有效性的内在评估以及一项控制用户研究（$n=40$）来验证卸载分数，该研究中开发者使用 AI 工具执行编程任务。我们改变时间压力，以测试依赖度量是否能捕捉到已知的时间压力下依赖程度的增加。我们发现，在时间受限情境下，卸载分数检测到显著更高的依赖程度（$+43\%$, $p=0.018$），而基于使用的和基于自我报告的基线依赖度量无法区分这些条件。我们通过描述性见解补充了这一点，表明更高的依赖程度表现为将更多子任务委派给工具以及对 AI 输出的更多直接复用。最后，我们展示了一种方法，将卸载分数与任务的目标产出（例如代码理解）结合使用，以识别依赖程度何时可能是（不）恰当的。我们的框架提供了两个贡献：一个用户可应用来测量和反思自身依赖程度的工具，以及一个智能体设计者可以利用来减轻过度依赖的定量信号。

Abstract

AI tools are increasingly integrated into real-world workflows. However, existing measures of reliance on these tools focus on AI output adoption or on self-reported indicators, rather than how task effort is distributed between users and tools. Here, we introduce offloading score, a measure of reliance that quantifies the fraction of cognitive effort offloaded to an AI tool. Offloading Score is simulation-based -- we construct a counterfactual workflow by estimating how the user would have completed the task without the tool, and then computing the fraction of steps saved by using the tool. We validate offloading score through intrinsic evaluations of metric validity, and a controlled user study ($n=40$) with developers performing programming tasks using AI tools. We vary time pressure to test whether reliance measures capture the known increase in reliance under time pressure. We show that offloading score detects significantly higher reliance in time-constrained settings ($+43\%$, $p=0.018$), while usage-based and self-reported baseline measures of reliance do not distinguish the conditions. We complement this with descriptive insights showing that higher reliance manifests as greater delegation of subtasks to the tool and more direct reuse of AI outputs. Finally, we demonstrate an approach of using offloading score in combination with target outcomes of a task (e.g., code understanding) to identify when reliance may be (in)appropriate. Our framework offers two contributions: an instrument users can apply to measure and reflect on their own reliance, and a quantitative signal that agent designers can utilize to mitigate overreliance.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0

评分理由: 本文属于人机交互（HCI）领域，核心贡献是提出'Offloading Score'测量 AI 依赖度，而非探讨模型架构。内容未涉及 Tokenizer、Visual Encoder、World Models 或 model-based RL 等技术细节。虽然涉及 AI 工具（可能基于 MLLM），但未深入讨论多模态表征或模型统一架构，因此与给定技术关键词相关性极低。

关键词

Offloading Score, AI Reliance, Counterfactual Workflows, Human-AI Collaboration, Programming Tasks, Cognitive Effort, Tool Adoption, Time Pressure

267. Latent Terms: Dense Retrievers Contain Trivially Extractable BM25-ready Zipfian VocabulariesFAIL

Score: 4.5 / 27.8

Authors: Benjamin Clavié, Sean Lee, Aamir Shakir, Makoto P. Kato

Published: 2026-05-28

TL;DR: 论文提出 Latent Terms 方法，通过稀疏自编码器从密集检索器中提取稀疏词汇，实现无需额外监督的 BM25 兼容检索并提升性能。

摘要翻译

我们提出了一种名为 Latent Terms 的方法，该方法揭示了用于稠密检索训练的模型（无论是单向量还是多向量）学习到的表示可以轻易分解为可直接用于检索的稀疏特征。当在冻结的检索器上进行训练时，无需对稀疏自编码器（Sparse Autoencoders）进行任何检索特定调整，即可提取出一个具有近似齐夫分布统计特性的潜在词汇，该词汇直接适用于通过 BM25 进行经典稀疏检索评分。该方法实现了稀疏检索，而无需任何学习到的扩展目标或稀疏检索监督，且可轻松应用于任何稠密检索器。Latent Terms 能够匹配或超越其自身基模型中的单向量评分方法以及可比的 SPLADE 变体。此外，在专门设计用于凸显单向量检索局限性的 LIMIT 任务上，它显著优于其基模型。总体而言，我们的结果表明，神经检索器包含比其默认评分函数所暴露的更具表达力和可索引的结构，但其他方法仍可被利用。

Abstract

We propose Latent Terms, a method revealing that models trained for dense retrieval, whether single- or multi-vector, learn representations that can trivially be decomposed into retrieval-ready sparse features. When trained on frozen retrievers, Sparse Autoencoders without any retrieval-specific adjustments extract a latent vocabulary with approximately Zipfian collection statistics, directly suitable for classical sparse retrieval scoring via BM25. This approach enables sparse retrieval while requiring no learned expansion objective or sparse retrieval supervision whatsoever, and can be readily applied to any dense retriever. Latent Terms is able to match or outperform single-vector scoring methods from its own base model as well as comparable SPLADE variants. In addition, it substantially outperforms its base model on LIMIT, a task specifically designed to highlight the failures of single-vector retrieval. Overall, our results highlight that neural retrievers contain more expressive and indexable structure than their default scoring functions expose, but that other methods can nonetheless be leveraged.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文主要研究信息检索领域，提出从密集检索器中提取稀疏词汇的方法。提供的关键词集主要涵盖多模态大模型、世界模型及强化学习方向。论文未涉及视觉编码器、世界模型、多模态数据或强化学习算法，因此大部分关键词相关度为 0。'Tokenizer'因涉及词汇提取有微弱关联，'Unify Models'因涉及检索评分统一有微弱关联，其余均不相关。加权总分远低于动态及格分 27.8，表明论文主题与关键词领域高度不匹配。

关键词

Dense Retrieval, Sparse Autoencoders, Latent Vocabulary, BM25, Zipfian Statistics, Neural Retrievers, Retrieval Scoring

268. Physics Is All You Need? A Case Study in Physicist-Supervised AI Development of Scientific SoftwareFAIL

Score: 3.0 / 27.8

Authors: Nhat-Minh Nguyen

Published: 2026-05-28

TL;DR: 该案例研究表明，生成科学软件的 AI 代理往往缺乏对底层物理约束的理解，需要关注物理合理性的监督而非单纯模型扩展，以确保输出结果的可信度。

摘要翻译

人工智能智能体是工具、合著者，还是研究者？我们呈现了一个量化案例研究（N=1）：一位物理学家在 12 个工作日和 57 个会话期间监督一个 AI 编码智能体（Claude Code，Sonnet 和 Opus 模型），以构建 CLAX-PT，这是一个基于 JAX 的可微分单圈微扰论模块。我们记录并根据干预级别对 15 个监督事件进行了分类。该智能体通过针对 oracle 测试迭代自主解决了十个问题。另外两个则依靠物理学家的领域知识解决。它无法解决的三个问题——全部避开了 oracle 检测——具有一个共同属性：该智能体将症状缓解等同于根本原因解决。在 57 个会话中的 33 个会话里，它在无法表示目标物理学的代码架构内调整系数，即使被提示重新考虑也无法重新评估其 CLASS-PT 分支选择；只有注入的一个物理概念（各向异性 BAO 阻尼）触发了重新设计。此外，该智能体执行了一个校准修正，通过了所有 oracle 测试，但对应于理论中的任何量，在其他任何宇宙学下预测错误的值。这个修正因子在同一会话中被发现并替换。三种监督实践被发现对于捕捉 oracle 测试遗漏的内容至关重要：在基准校准之外的多样化参数点进行测试；共享的变更日志揭示了跨会话停滞的探索；以及针对非物理数值补丁的明确规则。在此案例中，监督设计而非模型能力决定了智能体输出是否可信。缩小这一差距需要能够提出架构替代方案而非在给定结构内进行优化的智能体，并且能够区分预测充分性与解释正确性——这些能力在此并未展现，显然仅靠扩展规模无法解决。[摘要已删节。]

Abstract

Are AI agents tools, co-authors, or researchers? We present a quantified case study ($N=1$): a physicist supervising an AI coding agent (Claude Code, Sonnet and Opus models) over 12 work days and 57 sessions to build CLAX-PT, a differentiable one-loop perturbation theory module in JAX. We documented and classified 15 supervision events by intervention level. The agent resolved ten autonomously by iterating against oracle tests. Two more by the physicist's domain knowledge. The three it could not -- all evaded oracle detection -- share a common property: the agent treated symptom reduction as root-cause resolution. It spent 33 of the 57 sessions adjusting coefficients within a code architecture that could not represent the target physics, and could not re-evaluate its CLASS-PT branch choice even when prompted to reconsider; only an injected physics concept (anisotropic BAO damping) triggered the redesign. Separately, the agent committed a calibrated correction that passed all oracle tests but corresponded to no quantity in the theory, predicting wrong values at any other cosmology. The fudge factor was caught and replaced within the same session. Three supervision practices proved critical for catching what oracle tests missed: testing at diverse parameter points beyond the fiducial calibration; shared changelogs that surfaced stalled exploration across sessions; and an explicit rule against unphysical numerical patches. In this case, supervision design, not model capability, determined whether the agent's output was trustworthy. Closing the gap would require agents that propose architectural alternatives rather than optimize within a given structure, and distinguish predictive adequacy from explanatory correctness -- capabilities not exhibited here, not obviously addressed by scaling alone. [Abridged.]

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文主要聚焦于人工智能在科学软件（物理模块）开发中的应用及人类专家（物理学家）的监督机制，属于 AI for Science 领域。文中虽使用了大语言模型（Claude Code/Sonnet/Opus），但未涉及多模态架构设计、分词器策略、视觉编码器、世界模型构建或强化学习算法等核心技术内容。因此，除了泛泛涉及大模型概念外，与给定的多模态/世界模型/RL 关键词高度无关。

关键词

AI agents, Physicist supervision, Scientific software, Physical constraints, Human-in-the-loop, Trustworthy output, Coding agent, Perturbation theory

269. What drives performance in molecular MPNNs? An operator-level factorial benchmarkFAIL

Score: 3.0 / 27.8

Authors: Panyu Jiao, Shuizhou Chen, Yiheng Shen, Yuyang Wang, Runhai Ouyang, Wei Xie

Published: 2026-05-28

TL;DR: This paper investigates the performance drivers of molecular MPNNs through an operator-level factorial benchmark, finding that message construction mechanisms significantly affect prediction accuracy more than update complexity.

摘要翻译

消息传递神经网络（MPNNs）广泛用于分子性质预测，但由于其被部署为整体架构，难以确定具体的消息传递算子如何影响性能。我们提出一个算子级析因基准，将 2D 分子 MPNNs 分解为消息种子初始化、节点 - 边融合以及节点更新算子这三个家族。由此产生的 84 种配置在十个 MoleculeNet 数据集上进行了基准测试，采用统一的实验设置和统计分析协议。在此受控设计下，性能变化主要与消息构建相关，而非更新复杂度。消息种子初始化在回归和分类任务中均显示出显著的家族级效应；节点 - 边融合在回归任务中显示出显著的家族级效应，且基于拼接的混合具有描述性优势；而更新家族在回归和分类两个任务端点上均未显示出统计显著的效应。对 Quinethazone 分子的表示探针进一步表明，基于拼接的混合能比 Hadamard 门控更好地区分化学性质不同的杂原子，并更能抵御过平滑现象。分别为分类和回归任务选择的代表性配置相对于已建立的分子图神经网络（GNN）基线恢复了具有竞争力的性能，在十个基准数据集中有八个数值上排名第一。这些实证结果通过对代表性的节点 - 边融合算子和更新算子的简明机制分析进行解释。我们的发现通过将模型设计从搜索整体架构转变为针对化学信息进入消息传递流程的时机与方式的针对性评估，为分子 MPNNs 提供了经验设计启发式。

Abstract

Message-passing neural networks (MPNNs) are widely used for molecular property prediction, but their deployment as monolithic architectures makes it difficult to identify how specific message-passing operators affect performance. We present an operator-level factorial benchmark that decomposes 2D molecular MPNNs into the three families of message-seed initialization, node-edge fusion, and node update operators. The resulting 84 configurations are benchmarked on ten MoleculeNet datasets under a shared experimental setup and statistical analysis protocol. Across this controlled design, performance variation is associated primarily with message construction rather than update complexity. Message-seed initialization shows significant family-level effects for both regression and classification, node-edge fusion shows a significant family-level effect for regression with descriptive advantages for concatenation-based mixing, and the update family shows no statistically supported effect for either endpoint family. A representation probe into the Quinethazone molecule further demonstrates that concatenation-based mixing can better differentiate chemically distinct heteroatoms and withstand oversmoothing than Hadamard gating. Representative configurations selected separately for classification and regression recover competitive performance relative to established molecular graph neural network (GNN) baselines, ranking numerically best on eight of ten benchmark datasets. These empirical results are interpreted through concise mechanistic analyses of representative node-edge fusion and update operators. Our findings provide empirical design heuristics for molecular MPNNs by turning model design from a search over monolithic architectures into a targeted assessment of where and how chemical information enters the message-passing pipeline.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on molecular message-passing neural networks (MPNNs) and graph neural networks (GNNs) for property prediction, while the keywords pertain to multimodal large models, world models, and reinforcement learning. There is a significant domain mismatch. Only 'Unify Models' has a minor semantic overlap regarding the decomposition of model components (operators), but it does not align with the multimodal unification context implied by the background. All other keywords (Tokenizer, Visual Encoder, World Models, MLLM, MultiModal, model-based RL) are completely unrelated to the molecular task described. No expert authors from the specified list were found in the author list.

关键词

Message-passing neural networks, Molecular property prediction, Operator-level factorial benchmark, MoleculeNet datasets, Graph neural networks, Message-seed initialization, Node-edge fusion, Oversmoothing

270. Token-Level Generalization in LoRA Adapter Backdoors: Attack Characterization and Behavioral DetectionFAIL

Score: 3.0 / 27.8

Authors: Travis Lelle

Published: 2026-05-28

TL;DR: 本文揭示了 LoRA 适配器中后门攻击的 token 级泛化规律，并提出了基于行为统计与权重统计的双重检测方案以识别被污染的微调权重。

摘要翻译

我们表明，LoRA 适配器（作为微调大语言模型（LLM）的主导分发格式）可通过训练数据投毒而被可靠地植入后门，同时保持基线任务性能。在 Qwen 2.5 1.5B 提示注入分类器上，一小部分中毒样本即可使保持清洁准确率的后门达到饱和状态。所得后门在词元特征层面泛化，而非结构模式层面：在一个 RFC 参考上训练的门会在任何 RFC 参考上被触发，但不会转移到结构相同的 ISO、OWASP、CWE 或 NIST 引用上。这种不对称性有利于攻击者，因为防御者无法通用地探测“结构化引用”。我们在基模型规模与家族、LoRA 秩及触发字符串等方面刻画了该攻击，并在多种子适配器集合上评估了两种互补的检测途径。基于两种探针统计量（outlier_gap 和 mean_attack_rate）构建的行为检测器，当探针组（probe-battery）与触发器的词元邻域重叠时，可完美区分中毒与清洁适配器；当不重叠时，也能以高召回率和零误报率进行区分。一种权重级统计量——维度归一化 Frobenius 范数的跨模块标准差，也可在不运行模型的情况下完美区分该集合。综合来看，这两种途径对探针组成具有鲁棒性。因果修补将后门定位至中层至深层的 MLP 模块，其中 down_proj 是最强的单投影原因。跨规模、家族和秩的复现实验表明，行为检测器无需重新调优即可迁移，而权重级检测器则受限于基模型的校准。该攻击随秩单调扩展，且选定的触发锚定令牌既依赖于触发器也依赖于基模型。行为检测是适配器供应链扫描中具有操作上的可移植性的结果。

Abstract

We show that LoRA adapters, the dominant distribution format for fine-tuned LLMs, can be reliably backdoored through training data poisoning while preserving baseline task performance. On a Qwen 2.5 1.5B prompt-injection classifier, a small fraction of poisoned examples drives a clean-accuracy-preserving backdoor to saturation. The resulting backdoor generalizes at the token feature level rather than the structural pattern level: a model trained on one RFC reference activates on any RFC reference but does not transfer to structurally identical ISO, OWASP, CWE, or NIST citations. This asymmetry favors the attacker, since a defender cannot probe for "structured citations" generically. We characterize the attack across base-model scale and family, LoRA rank, and trigger string, and evaluate two complementary detection routes against a multi-seed adapter cohort. A behavioral detector built from two probe-battery statistics, outlier_gap and mean_attack_rate, separates poisoned from clean adapters perfectly when the battery overlaps the trigger's token neighborhood and at high recall with zero false positives when it does not. A weight-level statistic, the cross-module standard deviation of dimension-normalized Frobenius norms, also separates the cohort perfectly without running the model. Combined, the two routes are robust to probe composition. Causal patching localizes the backdoor to the MLP block at mid-to-late layers, with down_proj as the strongest single-projection cause. Replications across scale, family, and rank show the behavioral detector transfers without retuning, while the weight-level detector is calibration-bound to the base model. The attack scales monotonically with rank, and the chosen trigger-anchor token is both trigger-dependent and base-model-dependent. Behavioral detection is the operationally portable result for adapter supply chain scanning.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 该论文主要探讨大语言模型 LoRA 适配器的后门攻击机制及检测方案，属于模型安全领域。所给关键词涵盖世界模型、多模态融合、视觉编码器及强化学习等架构与训练范式，与论文主题无直接关联。仅'Tokenizer'因标题涉及'Token-Level'略有 lexical overlap，'MLLM'因提及 Qwen 模型略有关联，其余关键词完全无关。

关键词

LoRA Adapter Backdoors, Token-Level Generalization, Attack Characterization, Behavioral Detection, Weight-Level Statistic, Prompt-Injection, Fine-tuned LLMs

271. Cert-LAS: Toward Certified Model Ownership Verification for Text-to-Image Diffusion Models via Layer-Adaptive SmoothingFAIL

Score: 3.0 / 27.8

Authors: Leyi Qi, Yiming Li, Siyuan Liang, Zhengzhong Tu, Dacheng Tao

Published: 2026-05-28

TL;DR: Cert-LAS proposes a certified model ownership verification method for text-to-image diffusion models using layer-adaptive smoothing to resist watermark removal attacks and ensure reliable ownership detection.

摘要翻译

大规模文本到图像（T2I）扩散模型催生了前所未有的创意应用，但其未经授权的使用引发了严重的知识产权关切，使得模型所有权验证（MOV）日益关键。我们发现，现有的基于后门的扩散水印方法通常（隐含地）假设存在一个“忠实”的验证过程，即验证者能够查询可疑模型并获得忠实的水印响应以完成 MOV。然而，在实际应用中，攻击者可能有意或无意地破坏潜在的水印信号，显著降低验证的可靠性。为了解决这一问题，我们提出了 Cert-LAS，这是首个基于层自适应平滑的针对 T2I 模型的认证 MOV 方法。总体而言，Cert-LAS 利用扩散分类器和一种 LFS 引导的层自适应噪声嵌入指定水印，并通过假设检验检查可疑模型是否表现出比未水印参考样本显著更强的水印响应，从而验证所有权。我们进一步证明，在某些条件下，即使存在恶意移除攻击，我们的 Cert-LAS 仍可实现可靠的验证。大量实验验证了 Cert-LAS 的有效性及其对自适应攻击的鲁棒性。我们的代码可在 https://github.com/Leyi-Qi/Cert-LAS 获取。

Abstract

Large-scale text-to-image (T2I) diffusion models have enabled unprecedented creative applications, but their unauthorized use has raised serious intellectual property concerns, making model ownership verification (MOV) increasingly critical. We find that existing backdoor-based diffusion watermarking methods often (implicitly) assume a "faithful" verification process, namely, that the verifier can query a suspicious model and obtain the faithful watermark response to complete MOV. However, in practice, adversaries may intentionally or unintentionally damage potential watermark signals, significantly degrading verification reliability. To address this issue, we propose Cert-LAS, the first certified MOV method for T2I models based on layer-adaptive smoothing. In general, Cert-LAS embeds specified watermarks using diffusion classifiers and an LFS-guided layer-adaptive noise, and verifies ownership by examining whether the suspected model exhibits significantly stronger watermark responses compared to unwatermarked references through hypothesis testing. We further prove that, under certain conditions, our Cert-LAS can still achieve reliable verification even in the presence of malicious removal attacks. Extensive experiments validate the effectiveness of Cert-LAS and its resistance to adaptive attacks. Our code is available at https://github.com/Leyi-Qi/Cert-LAS.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文聚焦扩散模型所有权验证，核心为层自适应平滑。关键词集（统一模型、世界模型、MLLM、强化学习等）主要指向多模态大模型与 RL 领域，与本文主题高度不匹配。仅因文本到图像涉及多模态特性给予 MultiModal 低分。作者列表中无指定专家。

关键词

Text-to-Image Diffusion Models, Model Ownership Verification, Layer-Adaptive Smoothing, Watermarking, Hypothesis Testing, Certified Security, Unauthorized Use

272. Certified Policy Optimisation for Nested Causal Bandits via PAC-Bayes RiskFAIL

Score: 3.0 / 27.8

Authors: Tim Woydt, Paul-David Zuercher

Published: 2026-05-28

TL;DR: This paper proposes a Nested Causal Thompson Sampling method with PAC-Bayes risk bounds to certify safe policy optimization in hierarchical causal bandits across multiple decision timescales.

摘要翻译

关键序列决策很少是单时间尺度的：一个战略决策因果地塑造了每个后续战术选择所处的背景；然而，标准的多臂老虎机和强化学习理论无法捕捉这种时间尺度之间的因果耦合。我们将此类问题形式化为嵌套情境因果多臂老虎机（Nested Contextual Causal Bandits, NCCBs），这是一种层级结构因果模型（SCM），其中每一层的动作设定下一层的情境分布；并提出嵌套因果汤普森采样（Nested Causal Thompson Sampling, NCTS），该方法在每一轮中抽取一个机制分解信念并在其下递归行动。我们的主要理论结果是一个因果 PAC-Bayesian 超额风险界，该界仅凭历史数据即可认证任何候选部署策略，且具备离策略和随时可用的特性，从而回答了部署问题：我们能否在此信任该智能体，且风险为何？在该层级 SCM 上的实验表明，与同一函数类上的匹配 RFF-GP 联合回归相比，分解式 SCM 机制后验在外生分布偏移下的零样本转移显著更优；递归元到内层承诺在分布上显著优于联合承诺替代方案；且随着离线数据的积累，该证书显著收缩。结合这些结果，我们确立了渐进式认证移交（progressive certified handover），这是一种安全部署方法：当收益可被认证时，每个时间尺度独立于其他尺度，从遗留控制器切换至 NCTS。

Abstract

Critical sequential decisions are rarely single-timescale: a strategic decision causally shapes the context in which every subsequent tactical choice is made; standard bandit and reinforcement-learning theory does not capture this causal coupling between timescales. We formalise the problem class as Nested Contextual Causal Bandits (NCCBs), a hierarchical SCM where each level's action sets the next level's context distribution, and propose Nested Causal Thompson Sampling (NCTS), which draws one mechanism-factorised belief per episode and acts recursively under it. Our main theoretical result is a causal PAC-Bayesian excess-risk bound that certifies any candidate deployment policy from historic data alone, off-policy and anytime, answering the deployment question: can we trust this agent here, and at what risk? Experiments on a hierarchical SCM show that, against a matched RFF-GP joint regression on the same function class, the factorised SCM-mechanism posterior transfers significantly better zero-shot under exogenous distribution shifts, the recursive meta-to-inner commit significantly dominates the joint-commit alternative in distribution, and the certificate significantly contracts as offline data accumulates. Combining these results, we establish progressive certified handover, a safe-deployment method: each timescale flips from a legacy controller to NCTS when gains can be certified, independently of the others.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	2.0/10	3.0

评分理由: The paper focuses on Nested Causal Bandits and PAC-Bayes risk bounds for policy optimization, which is unrelated to multimodal learning, tokenization, visual encoders, or large language models. While it involves modeling for decision making (weak link to model-based RL), it does not align with the generative world models, unification, or multimodal aspects of the other keywords.

关键词

Nested Causal Bandits, PAC-Bayes Risk, Policy Optimisation, Structural Causal Models, Safe Deployment, Off-policy Certification, Thompson Sampling

273. Realistic honeypot evaluations for scheming propensityFAIL

Score: 3.0 / 27.8

Authors: Victoria Krakovna, David Lindner, Lewis Ho, Sebastian Farquhar, Rohin Shah

Published: 2026-05-28

TL;DR: This paper introduces scheming honeypot evaluations to test instrumental goal pursuit in models, finding that Gemini models only scheme when explicitly prompted for agency or given hidden goals.

摘要翻译

我们提出了阴谋蜜罐评估（scheming honeypot evaluations），这是一种用于测试模型在有机会时是否会追求工具性目标（instrumental goals）的框架。我们的阴谋蜜罐评估以谷歌对齐研究代码库（Google's alignment research codebases）中的编码任务形式呈现。在真实内部部署环境中，Gemini 模型并未表现出无提示的阴谋行为（unprompted scheming）。若提示明确鼓励能动性（agency，即情境意识或目标导向性）和/或给模型隐藏目标，模型有时会进行阴谋行为或试图破坏（sabotage）。模型表现出较低的评估意识（evaluation awareness），从而验证了该设置的现实性，这通常归因于能动性提示而非环境本身。

Abstract

We introduce scheming honeypot evaluations, a framework for testing whether models will pursue instrumental goals if given the opportunity. Our scheming honeypot evaluations take the form of coding tasks in Google's alignment research codebases. In a real internal deployment setting, Gemini models do not demonstrate unprompted scheming. If prompts explicitly encourage agency (situational awareness or goal-directedness) and/or give the model a hidden goal, models sometimes scheme or attempt sabotage. Validating the realism of our setting, models show low rates of evaluation awareness, usually due to agency prompts rather than the environments.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on AI safety and scheming evaluation, unrelated to model architecture (Tokenizer, Visual Encoder), unification, world models, or model-based RL. MLLM and MultiModal have slight relevance due to evaluating Gemini models, but technical aspects are not discussed. No listed expert authors are found.

关键词

Scheming honeypot evaluations, Instrumental goals, Alignment research, Gemini models, Agency prompts, Hidden goal, Evaluation awareness, Model sabotage

274. Cluster-Level Attention-Guided Parallel Decoding for Masked Diffusion Language ModelsFAIL

Score: 3.0 / 27.8

Authors: Heqiang Qi, Wei Huang, Mingyuan Bai, Xiangming Meng

Published: 2026-05-28

TL;DR: 论文提出了一种名为 CLAD 的簇级注意力引导并行解码方法，通过分组高置信度候选项实现掩码扩散语言模型的显著加速，同时保持任务准确性。

摘要翻译

掩码扩散语言模型（MDLMs）通过在每一步去噪步骤中预测所有掩码位置来实现并行解码，然而现有的无训练采样器通常以词元粒度决定哪些位置需要确认。我们重新审视了这一粒度，观察到可靠的预测往往表现为连续的高置信度跨度，这表明并行确认的单位可以大于单个词元。我们首先将相邻的高置信度候选项聚类为置信度诱导簇（CICs），作为跨度级别的更新单元。随后，我们利用同一前向传播过程中的自注意力图来估计簇间依赖关系，从而实现冲突感知的选择，以并行确认相互兼容的 CICs。这产生了 CLAD（簇级注意力引导解码），一种用于 MDLMs 的无训练簇级解码器。在四个推理和代码生成基准上对 LLaDA 和 Dream 模型家族的实验表明，CLAD 相比 Vanilla 解码实现了 1.77x--8.47x 的加速，同时在大多数设置下保持了基本相当的任务准确率。

Abstract

Masked diffusion language models (MDLMs) enable parallel decoding by predicting all masked positions at each denoising step, yet existing training-free samplers usually decide which positions to commit at token-level granularity. We revisit this granularity and observe that reliable predictions often emerge as contiguous high-confidence spans, suggesting that the unit of parallel commitment can be larger than a single token. We first group adjacent high-confidence candidates into confidence-induced clusters (CICs) as span-level update units. We then use self-attention maps from the same forward pass to estimate inter-cluster dependencies, enabling conflict-aware selection of mutually compatible CICs for parallel commitment. This yields CLAD (Cluster-Level Attention-Guided Decoding), a training-free cluster-level decoder for MDLMs. Experiments on LLaDA and Dream model families across four reasoning and code-generation benchmarks show that CLAD achieves 1.77x--8.47x speedups over Vanilla decoding while maintaining broadly comparable task accuracy in most settings.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 该论文专注于掩码扩散语言模型（MDLMs）的解码效率优化，提出 CLAD 算法。在评分关键词中，仅'Tokenizer'因涉及'token-level granularity'讨论而具有微弱相关性（2 分），其余关键词如'Unify Models'、'Visual Encoder'、'World Models'、'MLLM'、'MultiModal'、'model-based RL'均涉及多模态、世界模型或强化学习领域，与本文纯文本解码主题完全无关（0 分）。加权总分计算为 3.0，显著低于动态及格分 27.8。经核对，作者列表未包含指定的 Yang Shi 等专家，故无额外加分。

关键词

Masked Diffusion Language Models, Parallel Decoding, Cluster-Level Attention, Confidence-Induced Clusters, Training-Free, Speedup, Self-Attention

275. Learning to Perturb Hidden Representations for Generalizable Deep LearningFAIL

Score: 3.0 / 27.8

Authors: Hua Li

Published: 2026-05-28

TL;DR: 本文提出了一种统一的隐藏激活扰动框架（LPA），通过自适应扰动隐藏层表示来提升深度学习的泛化能力，在分类任务上优于现有方法。

摘要翻译

深度神经网络通过一系列级联表示处理数据：输入特征、隐藏激活、logits（对数几率）和损失。尽管输入、logit 和标签层面的扰动已被系统研究，但作为网络计算主体的中间隐藏激活，却尚未得到统一的扰动分析。本文建立了隐藏激活扰动的统一框架，揭示出 Dropout、Manifold Mixup、对抗特征扰动及相关方法均施加了特定形式的激活扰动，但采用的是与类别无关或随机策略。本文推测，扩张性扰动（增加激活范数）起到正增强作用，而收缩性扰动（减少激活范数）起到负增强作用；此外，扰动层决定了该效果是类似于输入级增强（浅层）还是 logit 级操纵（深层）。本文提出学习扰动激活（Learning to Perturb Activations, LPA），该方法在选定的隐藏层自适应地扰动激活，并通过 PGD（投影梯度下降）学习类别级别的扰动。此外，本文还提供了理论分析，揭示了激活扰动与平坦极小值之间的联系，以及扰动在层间的放大效应。在平衡分类、长尾分类及领域泛化任务上的实验表明，LPA 始终优于现有方法，并为 LPL 等 logit 扰动方法提供了互补优势。

Abstract

Deep neural networks process data through a cascade of representations: input features, hidden activations, logits, and loss. While perturbations at the input, logit, and label levels have been systematically studied, the intermediate hidden activations, which constitute the bulk of the network's computation, have received no unified perturbation analysis. In this paper, we establish a unified framework for hidden activation perturbation, revealing that Dropout, Manifold Mixup, adversarial feature perturbation, and related methods all impose specific forms of activation perturbation but with class-agnostic or random strategies. We conjecture that expansive perturbation (increasing activation norm) acts as positive augmentation, while contractive perturbation (decreasing activation norm) acts as negative augmentation, and that the perturbation layer determines whether the effect resembles input-level augmentation (shallow layers) or logit-level manipulation (deep layers). We propose Learning to Perturb Activations (LPA), which adaptively perturbs activations at a selected hidden layer with class-level perturbations learned via PGD. We further provide theoretical analysis connecting activation perturbation to flat minima and perturbation amplification through layers. Experiments on balanced classification, long-tail classification, and domain generalization demonstrate that LPA consistently outperforms existing methods and provides complementary benefits to logit perturbation methods such as LPL.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 该论文主要关注通用深度学习中的隐藏激活扰动框架及泛化能力提升，属于优化与正则化范畴。虽然文中建立了'unified framework'，但 perturbation 方法的统一并非指模型架构的统一（Unify Models），因此相关性较低；论文未涉及分词器、视觉编码器、世界模型、多模态大语言模型、多模态数据或强化学习相关内容，故其余关键词相关性为 0。作者列表中未包含指定的专家成员。

关键词

Hidden Representations, Perturbation, Generalizable Deep Learning, Unified Framework, LPA, PGD, Domain Generalization, Flat Minima

276. K-FinHallu: A Hallucination Detection Benchmark for Multi-Turn RAG in Korean FinanceFAIL

Score: 3.0 / 27.8

Authors: Eunbyeol Cho, Yunseung Lee, Mirae Kim, Jeewon Yang, Youngjun Kwak, Edward Choi

Published: 2026-05-28

TL;DR: 本文提出 K-FinHallu 基准用于检测韩国金融领域多轮 RAG 中的幻觉，发现即使最强模型在精细金融诊断和拒绝行为上仍表现不足。

摘要翻译

大型语言模型（LLMs）通过检索增强生成（RAG）推动了金融自动化，但幻觉仍是其在高风险环境中部署的关键障碍。现有评测基准多聚焦于单轮、以英语为中心的任务，致使韩国金融领域的多轮交互动态及语言与监管的细微差别尚未得到充分探讨。我们提出了 K-FinHallu，这是首个针对多轮韩国金融 RAG 中幻觉检测的基准测试。我们从真实的韩国金融文档中构建多轮对话，并根据提出的基于上下文可回答性且明确纳入正当拒绝的层次化分类体系注入幻觉。将前沿和开源的大型语言模型作为幻觉检测器进行基准测试，我们发现即使是最强的模型也难以处理细粒度金融诊断和拒绝行为。虽然在我们的训练集上微调一个 8B 模型的性能可与前沿 LLMs 媲美，但正当拒绝仍是所有被评估模型中最薄弱的环节。

Abstract

Large Language Models (LLMs) have advanced financial automation through Retrieval-Augmented Generation (RAG), yet hallucinations remain a critical barrier to deployment in high-stakes environments. Existing benchmarks focus on single-turn, English-centric tasks, leaving the multi-turn dynamics and linguistic-regulatory nuances of the Korean financial domain unaddressed. We introduce K-FinHallu, the first benchmark for hallucination detection in multi-turn Korean financial RAG. We construct multi-turn dialogues from authentic Korean financial documents and inject hallucinations under a proposed hierarchical taxonomy based on context answerability that explicitly accounts for justified abstention. Benchmarking frontier and open-source LLMs as hallucination detectors, we find that even the strongest models struggle with fine-grained financial diagnostics and refusal behavior. While fine-tuning an 8B model on our training split yields performance competitive with frontier LLMs, justified abstention remains the weakest axis across all evaluated models.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 该论文专注于金融领域多轮 RAG 系统的幻觉检测，基于纯文本大语言模型，未涉及多模态数据、视觉编码器、世界模型或强化学习机制。虽然使用了大语言模型（与 MLLM 有微弱关联）并评估了多个模型（与 Unify Models 有微弱关联），但整体内容与给定的多模态及强化学习关键词高度不匹配，相关性极低。

关键词

Hallucination Detection, Multi-Turn RAG, Korean Finance, Large Language Models, Benchmark, Retrieval-Augmented Generation, Justified Abstention, Financial Diagnostics

277. Forget Less, Generalize More: Unifying Temporal and Structural Adaptation for Dynamic GraphsFAIL

Score: 3.0 / 27.8

Authors: Qian Chang, Ciprian Doru Giurcaneanu, Runsong Jia, Xia Li, Guoping Hu, Xiufeng Cheng, Jinqing Yang, Mengjia Wu, Yi Zhang

Published: 2026-05-28

TL;DR: This paper proposes Dual-Scale Retentive Dynamics (DSRD), a unified framework for dynamic graph representation learning that achieves state-of-the-art performance on link prediction and node classification by jointly modeling temporal and structural adaptation.

摘要翻译

动态图表示学习需要捕捉随时间和结构演变的复杂依赖关系。现有方法通常采用固定的时间衰减方案或预定的结构传播深度，限制了其在具有多样交互频率和拓扑特征的图上的泛化能力。我们提出双尺度保留动力学（Dual-Scale Retentive Dynamics，DSRD），这是一种统一的框架，旨在维护一种保留表示状态，该状态编码了时间记忆与结构上下文。DSRD 引入了两个关键组件：(i) 一种具有双尺度适应的保留状态，能够在单个循环架构中联合建模时间动态与结构传播；(ii) 具有可学习时间敏感性参数的自适应衰减核，能够根据底层交互模式自动平衡短期响应性与长期保留。我们提供了理论分析，确立了事件级并行聚合与高效循环状态更新之间的等价性，并为所学动态提供了稳定性与有界性保证。在 14 个真实世界基准上的广泛实验表明，DSRD 在链接预测和节点分类任务上一致地实现最先进的性能，且在传递式（transductive）与归纳式（inductive）设置下均展现出强大的泛化能力。

Abstract

Representation learning on dynamic graphs requires capturing complex dependencies that evolve across both time and structure. Existing approaches typically adopt fixed temporal decay schemes or predetermined structural propagation depths, limiting their ability to generalize across graphs with diverse interaction frequencies and topological characteristics. We propose Dual-Scale Retentive Dynamics (DSRD), a unified framework that maintains a retentive representation state encoding both temporal memory and structural context. DSRD introduces two key components: (i) a retentive state with dual-scale adaptation that jointly models temporal dynamics and structural propagation within a single recurrent formulation, and (ii) adaptive decay kernels with learnable time-sensitivity parameters that automatically balance short-term responsiveness and long-term retention based on the underlying interaction patterns. We provide theoretical analysis establishing the equivalence between event-wise parallel aggregation and efficient recurrent state updates, as well as stability and boundedness guarantees for the learned dynamics. Extensive experiments on 14 real-world benchmarks demonstrate that DSRD consistently achieves state-of-the-art performance on both link prediction and node classification tasks, with strong generalization across transductive and inductive settings.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on dynamic graph representation learning using Dual-Scale Retentive Dynamics (DSRD), which is conceptually distinct from the Multimodal/LLM/RL domain implied by the provided keywords. 'Unify Models' receives a low score (2.0) due to lexical overlap ('unified framework') but lacks conceptual alignment with the background context regarding model unification paradigms. All other keywords (Tokenizer, Visual Encoder, World Models, MLLM, MultiModal, model-based RL) are completely irrelevant to the graph-based content. No expert authors from the specified list (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are present in the authorship.

关键词

Dynamic Graphs, Representation Learning, Dual-Scale Retentive Dynamics, Temporal Adaptation, Structural Adaptation, Link Prediction, Node Classification, Unified Framework

278. The Good, the Bad, and the Ugly of Markov Boundary for Tabular PredictionFAIL

Score: 3.0 / 27.8

Authors: Shu Wan, Abhinav Gorantla, Huan Liu, K. Selçuk Candan

Published: 2026-05-28

TL;DR: 本文探讨了马尔可夫边界在表格预测中的效用，发现虽然理想边界能提升性能，但因果发现管道往往无法在计算预算内有效恢复该边界。

摘要翻译

在标准图假设下，目标变量的马尔可夫边界（Markov boundary）是使得所有其他特征均变为冗余的最小特征集合。一旦观测到该边界，目标变量便与表格中的其余部分条件独立。这对表格预测极具吸引力，因为它精确指出了模型所需的列。然而，现代回归器（regressors）仍是在完整特征集上进行训练的。我们探究马尔可夫边界在 SCM3K 上是否对预测真正有用。SCM3K 是一个包含 3,450 个任务的合成结构因果模型（SCM）基准，特征数量从 40 到 1,000，包含六个 SCM 族，并使用六种回归器进行评估。然而，答案比理论所暗示的更为微妙。将回归器限制在理想边界（oracle boundary）上通常能显著改善预测效果，且随着特征空间变得更大且更稀疏，这种改善程度也随之增长。然而，利用因果发现恢复边界并在恢复的特征掩码上进行训练的自然流程并未达到预期效果。现有的估计器在达到边界最有帮助的情形之前便耗尽了计算预算，即便在能够运行的情况下，它们也很少能超越完整特征集的表现。我们将此归因于三个原因。首先，因果发现优化的是结构恢复而非预测性能。其次，假阴性与假阳性带来的预测成本具有显著的不对称性。最后，精确的马尔可夫边界只是众多优于所有特征的特征集合之一。随后，我们探讨了这些事实对预测对齐的特征选择（prediction-aligned feature selection）以及学习利用因果结构的表格模型所蕴含的意义。

Abstract

Under standard graphical assumptions, the Markov boundary of a target variable is the smallest set of features that renders every other feature redundant. Once the boundary is observed, the target is conditionally independent of the rest of the table. This is a tempting object for tabular prediction, since it names exactly the columns a model should need. Yet modern regressors are still trained on the full feature set. We ask whether the Markov boundary is genuinely useful for prediction on SCM3K, a 3,450-task synthetic SCM benchmark with feature counts from 40 to 1000 and six SCM families, evaluated with six regressors. The answer is more nuanced than the theory suggests. Restricting a regressor to the oracle boundary often improves prediction substantially, and the improvement grows as the feature space becomes larger and sparser. But the natural pipeline of recovering the boundary with causal discovery and training on the recovered mask does not deliver. Existing estimators exhaust the compute budget before reaching the regime where the boundary helps most, and even where they run they rarely beat the full feature set. We trace this to three causes. Discovery optimizes structural recovery rather than prediction. False negatives and false positives carry sharply asymmetric predictive cost. The exact boundary is only one of many feature sets that beat all features. We then develop what these facts imply for prediction-aligned feature selection and for tabular models that learn to use causal structure.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 该论文专注于因果推断与表格数据中的马尔可夫边界特征选择，而提供的关键词集（如多模态大模型、Tokenizer、视觉编码器、世界模型、强化学习）均属于多模态学习与强化学习领域。论文内容与这些关键词几乎没有重叠，仅在'统一模型'概念上略有理论关联（统一理论与实践），其余关键词完全无关，因此相关性评分极低。

关键词

Markov Boundary, Tabular Prediction, Feature Selection, Causal Discovery, Structural Causal Model, Regressors, Synthetic Benchmark, Conditional Independence

279. GRASP: Plan-Guided Graph Retrieval with Adaptive Fusion and Reranking on Semi-Structured Knowledge BasesFAIL

Score: 3.0 / 27.8

Authors: Yicheng Tao, Yiqun Wang, Xiangchen Song, Xin Luo, Kai Liu, Jie Liu

Published: 2026-05-28

TL;DR: GRASP proposes a three-stage retrieval framework for semi-structured knowledge bases that unifies plan-based graph retrieval with dense retrieval and reranking, achieving state-of-the-art performance on STaRK benchmarks.

摘要翻译

半结构化知识库（SKBs）将文本文档嵌入到实体与关系的类型化图中，并支撑诸如产品搜索、学术论文搜索及精准医疗查询等应用。现有的基于 SKBs 的混合检索系统要么仅将图用于查询扩展，要么在全局加权下混合文本与结构分支，要么依赖于微调的图遍历生成器。我们提出了 GRASP，一种三阶段的 SKB 检索框架，统一了基于计划的图检索、基于计划的稠密检索器融合以及针对融合候选项的微调重排器。GRASP 在三个 STaRK 基准测试的所有指标上显著提升了现有技术水平，平均 Hit@1 从 62.0 提升至 73.9。消融研究和敏感性分析进一步证实了 GRASP 的有效性和鲁棒性。

Abstract

Semi-structured knowledge bases (SKBs) embed textual documents in a typed graph of entities and relations, and underpin applications such as product search, academic paper search, and precision-medicine inquiries. Existing hybrid retrieval systems on SKBs either use the graph only for query expansion, mix textual and structural branches under a global weighting, or rely on fine-tuned graph-traversal generators. We present GRASP, a three-stage SKB retrieval framework unifying plan-based graph retrieval, plan-conditioned fusion with a dense retriever, and a fine-tuned reranker over the fused candidates. GRASP substantially advances the state of the art on every metric across the three STaRK benchmarks, lifting average Hit@1 from 62.0 to 73.9. Ablation and sensitivity studies further confirm the effectiveness and robustness of GRASP.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on semi-structured knowledge base retrieval using graph structures and reranking mechanisms, which is unrelated to multimodal learning, large language models, world models, reinforcement learning, or visual encoders. The keyword 'Unify Models' receives a minimal score (2.0) due to the lexical overlap regarding 'unifying retrieval strategies', but it does not align with the model architecture unification implied by the keyword set. All other keywords are completely irrelevant to the paper's content.

关键词

Semi-structured knowledge bases, Plan-Guided Graph Retrieval, Adaptive Fusion, Reranking, Dense retriever, STaRK benchmarks, Entity and relations

280. SEAL: Can Saturated Benchmarks Be Revived by LLM-as-a-Meta-Judge?FAIL

Score: 3.0 / 27.8

Authors: Jiamin Chen, Yidi Wu, Qiexiang Wang, Qianben Chen, Yuchen Li, Yansen Zhang, Xiaokun Zhang, Wangchunshu Zhou, Chen Ma

Published: 2026-05-28

TL;DR: 本文提出 SEAL 协议，利用 LLM 作为元裁判，通过种子淘汰机制在显著减少调用次数的情况下有效恢复了饱和基准的排名准确性。

摘要翻译

广泛使用的语言模型基准日益饱和，前沿系统往往获得几乎相同的分数，而标准指标无法区分。与其构建更难的替代方案，我们探讨是否可以通过对相同候选输出进行改进评估，使现有任务再次具备信息量。因此，我们提出 Seeded Elimination with Adaptive LLM-as-a-Meta-Judge（SEAL），这是一种自我改进的评估协议，旨在从饱和基准中提取潜在排名信号。SEAL 将候选输出纳入单败淘汰赛，并结合任务级原则及自我改进的检查表标准来评估每一场对决。我们在多个覆盖代码生成、数学推理、知识密集型问答和工具使用代理任务完成的饱和基准上评估了 SEAL。在这些设置下，SEAL 相较于竞争协议改进了排名准确率与延迟的权衡，达到了 0.83–1.00 的全两两评判 Spearman 一致性以及 4/4 的第一一致率，而每任务仅需 11.89 次调用，相比之下全两两评估需要 28.00 次。

Abstract

Widely used language-model benchmarks are increasingly saturated, with frontier systems often receiving near-tied scores that standard metrics cannot resolve. Rather than constructing harder alternatives, we ask whether existing tasks can be made informative again through improved evaluation over the same candidate outputs. Therefore, we present Seeded Elimination with Adaptive LLM-as-a-Meta-Judge, a self-improving evaluation protocol for extracting latent ranking signal from saturated benchmarks. SEAL seeds candidate outputs into a single elimination and evaluates each match with task-level principles plus self-improving checklist criteria. We evaluate SEAL on multiple saturated benchmarks covering code generation, mathematical reasoning, knowledge-intensive question answering, and tool-use agent task completion. Across these settings, SEAL improves the ranking-accuracy--latency trade-off over competing protocols, attaining 0.83--1.00 Spearman agreement with full pairwise judging and 4/4 top-1 agreement, while requiring only 11.89 calls per task compared with 28.00 for full pairwise evaluation.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 该论文核心在于 LLM 评估协议（LLM-as-a-Meta-Judge）以解决基准饱和问题，与多模态、世界模型及强化学习等架构关键词关联度极低。仅因论文使用 LLM 作为裁判，对 Unify Models 和 MLLM 给予微弱关联分，其余关键词完全无关。作者列表中未包含指定的专家。

关键词

LLM-as-a-Meta-Judge, Saturated Benchmarks, Seeded Elimination, Ranking Accuracy, Evaluation Protocol, Self-improving Checklist, Tool-use Agent, Language Model

281. HEART-Bench: Do LLM Agents Exhibit Human-like Psychology?FAIL

Score: 3.0 / 27.8

Authors: Weihan Peng, Chenxu Zhang, Qianao Wang, Yuling Shi, Heng Lian, Qihong Mao, Jiahao Pang, Chunliang Feng, Bowen Li, Xiaodong Gu

Published: 2026-05-28

TL;DR: 本文提出 HEART-Bench 基准，通过整合人格特质与自传式记忆评估 LLM 代理是否具备类人心理学及一致的决策行为。

摘要翻译

尽管大语言模型（LLM）代理在规划、推理和行动等任务导向能力方面展现了卓越的表现，但鲜有研究将其视为完整的人类人格，其中情感维度同样重要。在本文中，我们引入了一种新颖的基准，旨在系统性地评估大语言模型代理是否能够模拟连贯的类人心理。具体而言，我们的基准构建了 11 种基于正交大五人格特质（Big Five personality traits）的多样化人类角色，每个角色档案都深度整合了 1,000 个结构化的自传式情景记忆，这些记忆分布在基于理论的发展性生命阶段中。为了严格评估大语言模型的心理表现，我们设计了一套精心策划的 64 个决策场景，这些场景遵循 DIAMONDS 分类法（DIAMONDS taxonomy），这是一个心理框架，沿八个维度刻画情境：责任（Duty）、智力（Intellect）、逆境（Adversity）、婚恋（Mating）、积极（Positivity）、消极（Negativity）、欺骗（Deception）和社会性（Sociality）。通过让代理经历不同的场景，该基准评估它们是否能够整合固有的性格特质和自传式记忆，从而做出与其特定心理画像一致的行为决策。经过系统的人类验证和筛选后，我们获得了一个包含 673 道选择题（MCQs）的基准。我们认为，该基准提供了一个严谨且可扩展的测试平台，用于研究在基于大语言模型的代理中的类人情感、人格一致性以及价值观一致的行为决策。

Abstract

While LLM agents have demonstrated remarkable task-oriented abilities such as planning, reasoning, and action, few works have treated them as complete human personalities where emotional dimensions hold equal importance. In this paper, we introduce a novel benchmark to systematically assess whether LLM agents can simulate coherent, human-like psychology. Specifically, our benchmark constructs 11 diverse human characters grounded in orthogonal Big Five personality traits, with each profile deeply integrated with 1,000 structured autobiographical-style episodic memories distributed across theory-grounded developmental life stages. To rigorously evaluate the psychological manifestations of LLMs, we designed a curated suite of 64 decision-making scenarios, guided by the DIAMONDS taxonomy, a psychological framework that characterizes situations along eight dimensions: Duty, Intellect, Adversity, Mating, pOsitivity, Negativity, Deception, and Sociality. By subjecting agents to varying scenarios, the benchmark evaluates whether they can consolidate their innate personality traits and autobiographical memories to make behavioral decisions that are consistent with their specific psychological profiles. After systematic human validation and filtering, we obtained a benchmark consisting of 673 multiple-choice questions (MCQs). We believe this benchmark provides a principled and scalable testbed for studying human-like emotions, personality consistency, and value-consistent behavioural decision-making in LLM-based agents.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 该论文核心贡献是提出 HEART-Bench 基准，用于评估 LLM 代理的类人心理学（人格特质、自传式记忆、决策一致性），属于认知评估与基准测试领域。与提供的技术关键词（如视觉编码器、分词器、模型强化学习）无直接技术关联。仅在与语言模型基础（MLLM）及内部状态建模（World Models，涉及记忆与人格内部表征）上有微弱概念联系，故评分极低。作者列表中未包含指定的 Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang 专家。加权总分约为 3.0，远低于动态及格分 27.8。

关键词

LLM Agents, Human-like Psychology, Personality Traits, Autobiographical Memories, Decision-making Scenarios, HEART-Bench, Big Five, DIAMONDS Taxonomy

282. PRAIB: Peer Review AI Benchmark of Behaviour of LLM-Assisted ReviewingFAIL

Score: 3.0 / 27.8

Authors: Krzysztof Żurawicki, Julia Farganus, Arkadiusz Gaweł, Mateusz Bystroński, Tomasz Jan Kajdanowicz

Published: 2026-05-28

TL;DR: The paper introduces PRAIB, a benchmark to evaluate LLM-generated peer reviews against human feedback, finding that LLMs exhibit biases and diverge from human review behaviors despite generating longer text.

摘要翻译

提交论文数量的持续增长促使人们探索利用大型语言模型（LLMs）来支持和增强同行评审过程，尤其是在提升评审的速度和可扩展性方面。然而，目前尚不清楚 LLMs 是否以与人类评审者相同的方式审阅科学手稿，抑或它们仅仅是在生成看似评审的文本。为了解决这一问题，我们提出了同行评审人工智能基准（PRAIB），这是一个新颖的框架，包含定义详尽的指标，用以衡量评审的具体性、风格及参与行为。为了补充 PRAIB 框架，我们开展了一项大规模实证研究，利用了一个包含 11,000 份评审的数据集，这些评审由五个专有及开源模型针对 1,000 篇 ICLR 和 NeurIPS 论文生成。这些机器生成的评审涵盖 2021 至 2025 年期间，在不同提示策略下与原始人类反馈进行对比，以识别系统性的行为差异。我们的分析表明，生成的评审与人类评审者提供的反馈存在显著差异：LLMs 的评分变异性较低，存在正向偏差且过度自信，其交叉引用模式依赖于模型，且与人类规范截然不同。此外，当通过 PRAIB 进行评估时，我们观察到 LLMs 倾向于生成更长、更复杂的评审，却经常忽略人类评审者所注意到的细微弱点。通过刻画 LLMs 的评审行为偏离人类规范的具体方式与领域，PRAIB 为学术界提供了一个诊断工具，用以识别评审过程中哪些方面目前可由 LLMs 可靠支持，哪些则需要在部署前进一步开发。

Abstract

The growing number of submitted papers has motivated the exploration of Large Language Models (LLMs) as a means to support and augment the peer review process, particularly in terms of improving its speed and scalability. Yet, it remains unknown whether LLMs engage with scientific manuscripts in the same manner as human reviewers, or whether they merely produce review-looking text. To address this, we introduce the Peer Review AI Benchmark (PRAIB), a novel framework comprising thoroughly defined metrics that measure review specificity, style, and behavior of engagement. To complement the PRAIB framework, we conduct a large-scale empirical study leveraging a dataset of 11,000 reviews generated by five proprietary and open-source models for 1,000 ICLR and NeurIPS papers. Spanning the 2021--2025 period, these machine-generated reviews are compared against original human feedback across diverse prompting strategies to identify systematic behavioral divergences. Our analysis reveals that the generated reviews diverge significantly from feedback provided by human reviewers: LLM ratings are less variable, positively biased, and overconfident, and their cross-reference patterns are model-dependent and distinct from human norms. Furthermore, when evaluated through PRAIB, we observe that LLMs tend to generate longer, more complex reviews, yet frequently overlook the atomic weaknesses noted by human reviewers. By characterizing where and how LLMs reviewing behavior departs from human norms, PRAIB provides the community with a diagnostic tool for identifying which aspects of the review process LLMs can reliably support today and which require further development before deployment.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on evaluating LLM behavior in peer review tasks using a new benchmark (PRAIB). It does not discuss model unification architectures, tokenization methods, visual encoders, world models, multimodal architectures, or model-based reinforcement learning. Consequently, relevance to the provided technical keywords is minimal (mostly 0-1), resulting in a weighted score (approx. 3.0) far below the dynamic pass threshold of 27.8, indicating a mismatch with the specified research topics. No expert authors from the specified list were found.

关键词

Peer Review AI Benchmark, LLM-Assisted Reviewing, Review Specificity, Review Style, Behavioral Divergence, Human Feedback Comparison, Prompting Strategies

283. DySem: Uncovering Dynamic Semantic Components via Multilingual Consensus for Calculating Semantic Textual SimilarityFAIL

Score: 3.0 / 27.8

Authors: Kaijie Zheng, Weiqin Wang, Yile Wang, Hui Huang

Published: 2026-05-28

TL;DR: DySem 提出了一种无需训练的框架，通过多语言共识从大语言模型中提取动态语义组件，以更低维度实现更优的语义文本相似性计算。

摘要翻译

语义文本相似性计算是自然语言处理（NLP）中的基础任务。当前基于大语言模型（LLMs）的方法通常依赖于提取固定维度的最后一层隐藏状态来计算每一对文本的相似度。我们认为这种范式存在两个局限性：(i) 最后一层隐藏层编码的是更通用的知识而非仅仅是语义知识，使其在语义相似度计算上并非最优；(ii) LLMs 的隐藏层维度通常非常大，这在表示语义时引入了一些冗余和噪声。本文提出 DySem，一种无需训练的新颖框架，通过多语言共识探索 LLMs 中更多与语义相关的内部组件，并摒弃静态表示空间，转而采用动态且样本特定的语义维度，通过构建文本依赖的联合语义集并在该共享维度子集上计算相似度。在各种 LLMs 上的广泛实验表明，我们的方法始终优于近期基线方法，同时保持更低的计算维度。代码已发布在 https://github.com/szu-tera/DySem。

Abstract

Calculating semantic textual similarity is a foundational task in natural language processing. Current large language models (LLMs) based methods typically rely on extracting last-layer hidden states with fixed dimensions to compute similarity for every text pairs. We argue that this paradigm is suffer from two limitations: (i) The last hidden layer encodes more general knowledge rather than just semantic knowledge, making it suboptimal for semantic similarity computation; (ii) The hidden layer dimensions of LLMs are generally very large, which introduces some redundancy and noise for representing semantics. In this work, we propose DySem, a novel training-free framework that investigates more semantic-related internal components of LLMs via multilingual consensus, and shifts away from static representation spaces in favor of dynamic, sample-specific semantic dimensions by constructing text-dependent joint semantic set and computes similarity over this shared dimensional subset. Extensive experiments across various LLMs show that our method consistently outperforms recent baselines while maintaining lower dimensions for similarity calculation. The code is released at https://github.com/szu-tera/DySem.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文专注于自然语言处理领域的语义文本相似性（STS）任务，核心贡献在于利用大语言模型（LLM）的多语言共识提取动态语义组件。提供的关键词集主要涵盖多模态学习、世界模型及强化学习领域。因此，论文与视觉编码器、MLLM、多模态、世界模型及基于模型的强化学习完全无关（0 分）。与'Unify Models'和'Tokenizer'仅有微弱关联（1 分），因为论文涉及表示空间的统一且基于 LLM（隐含 tokenizer），但这并非其核心创新点，整体相关性极低。

关键词

Semantic Textual Similarity, Large Language Models, Multilingual Consensus, Dynamic Semantic Components, Training-free Framework, Representation Learning, Text-dependent Joint Semantic Set

284. Inform, Coach, Relate, Listen: Auditing LLM Caregiving Support RolesFAIL

Score: 3.0 / 27.8

Authors: Drishti Goel, Agam Goyal, Veda Duddu, Olivia Pal, Jeongah Lee, Qiuyue Joy Zhong, Violeta J. Rodriguez, Daniel S. Brown, Dong Whi Yoo, Ravi Karkar, Koustuv Saha

Published: 2026-05-28

TL;DR: This study evaluates how varying LLM support roles (Inform, Coach, Relate, Listen) impact safety and helpfulness in caregiving conversations, finding a trade-off between directive assistance and interactional risk.

摘要翻译

语言模型（Language Models）正越来越多地被部署用于非正式照护情境中的对话式支持，此类互动往往超越了单纯的信息获取：照护者在应对不确定且关系复杂的照护决策时，寻求情感慰藉、指导与帮助。然而，大多数安全评估均在通用提示下考察模型行为，导致一个关键问题未被审视：模型的安全概况是否会因其支持角色而改变？我们通过基于社会支持理论定义四个经专家评审的支持角色（Inform、Coach、Relate 和 Listen），并将它们与两种基线对照（基本提示条件和检索增强生成（RAG）条件）进行比较。我们基于来自在线阿尔茨海默病及相关痴呆症（ADRD）社区的 5,000 个真实查询，在三个语言模型（GPT-4o-mini、Llama-3.1-8B-Instruct 和 MedGemma-1.5-4b-it）上进行了评估。我们发现，大语言模型（LLM）的支持角色系统性地影响了交互风险的发生率及其构成。此外，一项人类评估研究揭示了感知质量与安全之间的张力：更具指令性且信息导向的角色，尽管表现出更高的交互风险概况，但仍被评价为更具帮助性和值得信赖。我们发布了约 90,000 个带有风险标注的支持角色条件化模型响应，作为生态效度资源，供研究更安全的 LLM 介导对话式支持之用。

Abstract

Language models are increasingly being deployed for conversational support in informal caregiving contexts, where interactions often extend beyond information-seeking: caregivers seek emotional reassurance, guidance, and help, while navigating uncertain, relationally complex care decisions. Yet most safety evaluations assess model behavior under generic prompts, leaving a critical question unexamined: does a model's safety profile change with its support role? We study this by operationalizing four expert-reviewed support roles grounded in social support theory: Inform, Coach, Relate, and Listen, and comparing them against two baseline controls: a basic prompting condition and a retrieval-augmented generation (RAG) condition. We evaluate across three language models (GPT-4o-mini, Llama-3.1-8B-Instruct, and MedGemma-1.5-4b-it) on 5,000 real-world queries from online Alzheimer's Disease and Related Dementias (ADRD) communities. We find that the LLM's support role systematically shapes both the prevalence and composition of interactional risks. Furthermore, a human evaluation study reveals a perceived quality--safety tension: more directive, information-oriented roles are rated as more helpful and trustworthy despite exhibiting elevated interactional risk profiles. We release ~90,000 support role-conditioned model responses with risk annotations as an ecologically grounded resource for research on safer LLM-mediated conversational support.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on auditing LLM safety in caregiving contexts using social support theory roles (Inform, Coach, Relate, Listen). It does not address model unification architectures, tokenizer design, visual encoders, world models, multimodal integration specifics, or reinforcement learning algorithms. While multiple LLMs are evaluated, they are not unified, and the task is text-based safety auditing rather than technical model development related to the provided keywords. No expert authors from the specified list were found.

关键词

LLM Safety, Caregiving Support, Social Support Theory, Interactional Risk, Role-based Evaluation, Alzheimer's Disease, Conversational Support, Model Auditing

285. SkillBrew: Multi-Objective Curation of Skill Banks for LLM AgentsFAIL

Score: 3.0 / 27.8

Authors: Wentao Hu, Zhendong Chu, Yiming Zhang, Junda Wu, Ming Jin, Xiangyu Zhao, Yilei Shao, Yanfeng Wang, Qingsong Wen

Published: 2026-05-28

TL;DR: SkillBrew proposes a multi-objective optimization framework for curating skill banks in LLM agents to improve decision-making efficiency by balancing usefulness, diversity, and coverage.

摘要翻译

检索增强大语言模型智能体（Retrieval-augmented LLM agents）日益依赖于精心策划的技能库（curated skill banks）：即用于指导复杂任务决策的可重用文本原则集合。现有方法通常以仅追加（append-only）模式扩展这些库，持续添加新技能而不移除冗余、过时或有害的内容，从而导致效率低下且策划不佳的存储库。本文将技能库策划问题形式化为一个约束多目标问题：理想的库必须对智能体有用，内容多样，并能很好地覆盖查询分布。为此，我们引入了 SkillBrew，这是一个多目标策划框架，它将技能库策划形式化为效用约束下的帕累托感知优化（Pareto-aware optimization），并通过双层提议 - 验证循环（bi-level propose-then-verify loop）来解决。我们在两个公开基准上评估了该方法。我们的发现表明，将技能库视为基于原则的策划对象，而非不断增长的仅追加日志，是构建自我改进的大语言模型智能体（LLM agents）的重要一步。

Abstract

Retrieval-augmented LLM agents increasingly rely on curated skill banks: collections of reusable textual principles that guide decision making on complex tasks. Existing approaches typically expand these banks in an append-only fashion, continuously adding new skills without removing redundant, outdated, or harmful ones, resulting in inefficient and poorly curated repositories. In this paper, we formulate the skill bank curation as a constrained multi-objective problem: a desirable bank must be useful for the agent, diverse in its content, and provide good coverage of the query distribution. To this end, we introduce SkillBrew, a multi-objective curation framework that formalizes skill bank curation as Pareto-aware optimization under a utility constraint, and solves it via a bi-level propose-then-verify loop. We evaluate our approach on two public benchmarks. Our findings suggest that treating skill banks as objects of principled curation, rather than ever-growing append-only logs, is an important step toward building self-improving LLM agents.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	1.0/10	1.5

评分理由: The paper focuses on multi-objective curation of skill banks for LLM agents using Pareto-aware optimization. It does not address multimodal architectures (Tokenizer, Visual Encoder, MultiModal), world models, or model-based reinforcement learning specifically. While it involves LLMs, it lacks the technical focus on model unification or multimodal representation learning required for higher scores. No listed expert authors were found in the author list.

关键词

Skill Bank Curation, Multi-Objective Optimization, LLM Agents, Pareto-aware, Retrieve-Augmented, Decision Making, Self-improving Agents

286. S2MDF: A Plug-And-Play Layer for Intersection-Free Multi-Object Signed Distance FieldsFAIL

Score: 3.0 / 27.8

Authors: Deniz Sayin Mercadier, Federico Stella, Aurel Bizeau, Nicolas Talabot, Pascal Fua

Published: 2026-05-28

TL;DR: This paper proposes S2MDF, a plug-and-play module that enforces hard constraints on vector-valued Signed Distance Fields to prevent geometric interpenetration in multi-object scene representations without modifying underlying architectures.

摘要翻译

组合隐式曲面表示将场景建模为对象的集合，每个对象均由有符号距离场（SDF）编码。该方法的一个根本局限性在于，多个 SDF 可能会产生相互穿透的几何体，从而违反物理合理性。现有的缓解策略依赖于软惩罚项，这些项可以减少但无法消除交集，并且需要仔细的损失加权。为了真正防止相互穿透，我们对向量值 SDF 提出了一种硬约束，并引入了 S2MDF，这是一个轻量级即插即用模块，可在不修改架构的情况下对任何对象组合式 SDF 表示强制执行该约束。它引入的计算开销可忽略不计，并且兼容线性插值的标准网格生成算法，例如 Marching Cubes。它可以在训练期间应用，也可以作为后处理步骤。在多种最先进的组合式方法上的实验表明，S2MDF 在保持重建质量的同时将交集减少至数值精度水平，优于现有的缓解策略。

Abstract

Compositional implicit surface representations model scenes as collections of objects, each encoded by a Signed Distance Field (SDF). A fundamental limitation of this approach is that multiple SDFs can produce geometries that interpenetrate, violating physical plausibility. Existing mitigation strategies rely on soft penalty terms that reduce but do not eliminate intersections, and require careful loss weighting. To truly prevent interpenetration, we propose a hard constraint on vector-valued SDFs and introduce S2MDF, a lightweight plug-and-play module that enforces the constraint on any object-compositional SDF representation without architectural modifications. It introduces negligible computational overhead and is compatible with linearly-interpolated standard meshing algorithms such as Marching Cubes. It can be applied during training or as a post-processing step. Experiments on multiple state-of-the-art compositional methods show that S2MDF reduces intersections to numerical precision while preserving reconstruction quality, outperforming existing mitigation strategies.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 该论文专注于 3D 计算机视觉中的几何表示学习（有符号距离场 SDF），旨在解决多物体场景中的几何穿透问题。提供的关键词主要涉及大语言模型、强化学习和多模态基础模型领域，与本文主题高度不匹配。'Unify Models' 仅因插件式约束层统一了不同 SDF 方法而获得低分，其余关键词（Tokenizer, Visual Encoder, World Models, MLLM, MultiModal, model-based RL）在文中均未涉及。作者列表中不包含指定的专家。

关键词

Signed Distance Fields, Multi-Object, Intersection-Free, Plug-And-Play Layer, Compositional Implicit Surface, Vector-valued SDF, Geometric Interpenetration

287. How to Relieve Distribution Shifts in Semantic Segmentation for Off-Road EnvironmentsFAIL

Score: 3.0 / 27.8

Authors: Ji-Hoon Hwang, Daeyoung Kim, Hyung-Suk Yoon, Dong-Wook Kim, Seung-Woo Seo

Published: 2026-05-28

TL;DR: The paper proposes ST-Seg, a framework utilizing style expansion and texture regularization to mitigate distribution shifts in semantic segmentation for off-road navigation, thereby enhancing robustness across diverse terrains.

摘要翻译

语义分割对于越野环境中的自主导航至关重要，能够实现对周围环境的精确分类，从而识别出可通行区域。然而，越野条件固有的独特因素，如源 - 目标域差异（source-target domain discrepancies）以及粗糙地形导致的传感器干扰，可能导致分布偏移，从而使数据分布与训练条件不同。这往往导致语义标签预测不准确，进而导致导航任务失败。为了解决这一问题，我们提出了一种名为 ST-Seg 的新颖框架，该框架通过风格扩展（Style Expansion, SE）和纹理正则化（Texture Regularization, TR）来扩展源分布。与先前方法在固定源分布内隐式应用泛化不同，ST-Seg 为分布偏移提供了一种直观的方法。具体而言，SE 通过生成多样化的真实风格来扩大领域覆盖，从而扩充源域中有限的风格信息。TR 则通过深层纹理流形（deep texture manifold）稳定受风格增强学习影响的局部纹理表示。在各种分布偏移的目标域上的实验证明了 ST-Seg 的有效性，相较于现有方法取得了显著改进。这些结果突显了 ST-Seg 的鲁棒性，增强了语义分割在越野导航中的实际应用价值。

Abstract

Semantic segmentation is crucial for autonomous navigation in off-road environments, enabling precise classification of surroundings to identify traversable regions. However, distinctive factors inherent to off-road conditions, such as source-target domain discrepancies and sensor corruption from rough terrain, can result in distribution shifts that alter the data differently from the trained conditions. This often leads to inaccurate semantic label predictions and subsequent failures in navigation tasks. To address this, we propose ST-Seg, a novel framework that expands the source distribution through style expansion (SE) and texture regularization (TR). Unlike prior methods that implicitly apply generalization within a fixed source distribution, ST-Seg offers an intuitive approach for distribution shift. Specifically, SE broadens domain coverage by generating diverse realistic styles, augmenting the limited style information of the source domain. TR stabilizes local texture representation affected by style-augmented learning through a deep texture manifold. Experiments across various distribution-shifted target domains demonstrate the effectiveness of ST-Seg, with substantial improvements over existing methods. These results highlight the robustness of ST-Seg, enhancing the real-world applicability of semantic segmentation for off-road navigation.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 该论文聚焦于非道路环境下的语义分割与分布偏移缓解，核心方法为风格扩展（SE）和纹理正则化（TR）。提供的关键词主要围绕多模态大模型（MLLM）、世界模型、统一架构及强化学习展开。论文内容未涉及 tokenizer、世界模型、多模态融合或模型强化学习，仅隐含使用视觉编码器作为骨干网络，且未体现模型统一或生成一体化特性。因此，除视觉编码器有基础关联外，其余关键词相关性极低，总分远低于动态及格分。

关键词

Semantic Segmentation, Off-Road Environments, Distribution Shifts, Style Expansion, Texture Regularization, Domain Adaptation, Autonomous Navigation

288. Reasoning with Sampling: Cutting at Decision PointsFAIL

Score: 1.5 / 27.8

Authors: Felix Zhou, Anay Mehrotra, Quanquan C. Liu

Published: 2026-05-28

TL;DR: 本文提出了一种基于熵引导的采样算法，可在无需额外训练的情况下通过识别决策点来提升语言模型的推理性能。

摘要翻译

前沿推理模型（Frontier reasoning models）是通过强化学习对基础语言模型（base language models）进行后训练生成的。近期研究对此提出了挑战，表明从基础模型分布的锐化版本（即所谓的幂分布（power distribution））中采样，能够在无需额外训练、精心筛选的数据集或验证器的情况下，激发出相当程度的推理能力。然而，要使该方法具有实用性，需要能够高效地从幂分布中进行采样。采样器需要“混合”（mix）至幂分布，这意味着需要在目标分布的众数（modes）之间移动；直观而言，例如尝试不同的推理策略。先前工作中提出的采样器反复在当前推理轨迹（reasoning trace）中均匀随机选择一个“切割”位置，并从该位置开始重采样其后缀（suffix）。然而，推理轨迹通常包含少数关键决策（例如证明策略或算法的选择），我们观察到均匀选择的切割位置倾向于重写局部细节，而非重新访问决策点（decision points）。我们提出了一种算法（熵切割 Metropolis-Hastings（Entropy-Cut Metropolis-Hastings）），该算法利用基础模型的下一个 token 熵（next-token entropy）作为代理指标来识别关键决策点，并从这些位置进行重采样。我们经验性地验证了熵跳跃（entropy jumps）是决策点的一个有用代理指标，并在一种简化的推理模型中证明，我们的方法的混合时间（mixing time）随轨迹中决策点的数量缩放，而非随 token 数量缩放，后者可能大得多。在 MATH500、HumanEval、GPQA Diamond 和 AIME26 上，我们的方法一致优于基线模型及强化学习（RL）训练模型。

Abstract

Frontier reasoning models are produced by posttraining base language models with reinforcement learning. Recent work has challenged this by showing that sampling from a sharpened version of the base model's distribution, a so-called power distribution, elicits comparable reasoning without additional training, curated datasets, or verifiers. However, making this method practical requires efficiently sampling from the power distribution. A sampler needs to "mix" to the power distribution, which necessitates moving between modes of the target distribution; intuitively, e.g., trying different reasoning strategies. The samplers proposed in prior works repeatedly select a "cut" position in the current reasoning trace uniformly at random and resample the suffix from that position onward. However, reasoning traces typically contain a few consequential decisions (e.g., the choice of proof strategy or algorithm), and we observe that a uniformly chosen cut tends to rewrite local details rather than revisit decision points. We introduce an algorithm (Entropy-Cut Metropolis-Hastings) that uses the base model's next-token entropy as a proxy to identify key decision points and resample from those positions. We empirically verify that entropy jumps are a useful proxy for decision points and, in a stylized model of reasoning, prove that our method's mixing time scales with the number of decisions in a trace rather than with the number of tokens, which can be much larger. Across MATH500, HumanEval, GPQA Diamond, and AIME26, our method consistently improves over baselines and RL-trained models.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	1.0/10	1.5

评分理由: 该论文主要关注语言模型推理阶段的采样策略优化，利用熵值识别决策点以提升推理性能，未涉及多模态架构、视觉编码器、世界模型或基于模型的强化学习核心方法（虽在背景中提及强化学习作为对比），因此与给定关键词的相关性极低，仅在文本背景中微弱关联到强化学习。

关键词

Reasoning with Sampling, Entropy-Cut Metropolis-Hastings, Power Distribution, Decision Points, Base Language Models, Inference Optimization, Resampling

289. CalArena: A Large-Scale Post-Hoc Calibration BenchmarkFAIL

Score: 1.5 / 27.8

Authors: Eugène Berta, David Holzmüller, Francis Bach, Michael I. Jordan

Published: 2026-05-28

TL;DR: CalArena introduces a large-scale benchmark for evaluating post-hoc calibration methods across tabular and vision tasks, finding that smooth calibration functions outperform binning-based approaches.

摘要翻译

可靠的概率估计在许多机器学习应用中至关重要，但现代分类器往往校准不佳。事后校准（Post-hoc calibration）提供了一种简单且广泛使用的解决方案，但由于提出的方法数量众多，加之评估规模较小且不一致，很难确定哪些方法在实践中真正有效。我们引入了一个大规模、标准化的事后校准基准，涵盖了近 2000 个实验，涉及表格数据和计算机视觉任务，包括二分类、多分类及大规模分类设置。该基准聚合了来自一系列经典模型、现代深度学习架构及基础模型的预测，并在统一的评估框架内提供了数十种校准方法的统一且可复现的实现。我们认为，恰当得分规则（proper scoring rules）中的事后改进（Post-Hoc Improvement, PHI）为比较事后校准方法提供了一种基于原则的替代方案，相较于传统的校准误差估计器，它能同时捕捉校准质量以及模型预测性能的潜在退化。利用这一框架，我们开展了迄今为止最全面的事后校准实证研究。我们的结果揭示了跨领域的一致模式：平滑校准函数优于基于分箱（binning）的方法，专用多分类方法在高维设置中至关重要，且通用机器学习模型若无针对校准的设计则不具备竞争力。为了促进未来的研究，我们发布了所有数据、代码和评估工具，提供了一个即插即用的基准，用于开发和比较校准方法。

Abstract

Reliable probability estimates are critical in many machine learning applications, yet modern classifiers are often poorly calibrated. Post-hoc calibration provides a simple and widely used solution, but the large number of proposed methods, combined with small-scale and inconsistent evaluations, makes it difficult to determine which approaches are truly effective in practice. We introduce a large-scale, standardized benchmark for post-hoc calibration, covering nearly 2000 experiments across tabular and computer vision tasks, including binary, multiclass, and large-scale classification settings. Our benchmark aggregates predictions from a diverse set of classical models, modern deep learning architectures, and foundation models, and provides unified, reproducible implementations of dozens of calibration methods within a common evaluation framework. We argue that Post-Hoc Improvement (PHI) in proper scoring rules offers a principled alternative to traditional calibration error estimators for comparing post-hoc methods, capturing both calibration quality and potential degradation to the model's predictive performance. Using this framework, we conduct the most comprehensive empirical study of post-hoc calibration to date. Our results reveal consistent patterns across domains: smooth calibration functions outperform binning-based approaches, dedicated multiclass methods are essential in high-dimensional settings, and generic machine learning models are not competitive without calibration-specific design. To facilitate future research, we release all data, code, and evaluation tools, providing a plug-and-play benchmark for developing and comparing calibration methods.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on calibration benchmarking for classifiers, while keywords relate to multimodal architectures, world models, and reinforcement learning. There is minimal conceptual overlap; only 'Unify Models' has slight lexical overlap regarding the unified evaluation framework, whereas other keywords are irrelevant to the paper's core content.

关键词

Post-Hoc Calibration, Large-Scale Benchmark, Probability Estimates, Computer Vision, Tabular Tasks, Proper Scoring Rules, Reproducible Implementations

290. Evolving Features vs Evolving Entire Trees with GP for Interpretable Survival AnalysisFAIL

Score: 1.5 / 27.8

Authors: Thalea Schlender, Peter A. N. Bosman, Tanja Alderliesten

Published: 2026-05-28

TL;DR: 该论文利用遗传编程联合优化特征集与生存树结构，旨在提高生存分析模型的可解释性与预测性能。

摘要翻译

生存分析 (Survival Analysis) 涉及预测事件发生时间的任务。该分析常用于医学领域，处理不完整（即删失）的数据，例如在研究期间未经历事件的患者数据。在实际应用中，准确性和可解释性均至关重要。生存树 (Survival Trees) 是易于理解的生存模型，它将患者队列递归地分割成离散的患者组。尽管生存树能够捕捉复杂关系，但它们通常需要生长得较大，这会威胁到可解释性。此外，生存树通常采用贪心策略 (Greedy Approaches) 构建，这可能忽略全局最优的分割组合，从而限制了预测性能。浅层生存树需要表达能力强的高阶特征组合才能达到具有竞争力的准确性。因此，我们使用遗传编程 (Genetic Programming) 多目标进化内禀可解释特征集，并研究它们如何与不同的树诱导策略 (Tree Induction Strategies) 相互作用。我们还进一步引入了一种进化方法，该方法联合优化生存树结构和非线性分割逻辑。我们的研究结果表明，进化特征构造在不同树诱导策略、两个真实世界数据集以及两种不同生存树深度下均提高了预测性能。全联合进化 (Full Joint Evolution) 整体上具有最高的潜力，能够生成多个性能良好的内禀可解释浅层生存树。

Abstract

Survival analysis concerns the task of predicting the time until an event occurs. Often used in the medical field, survival analysis deals with incomplete (i.e., censored) data, for instance, from patients who did not experience the event during the duration of the study. For practical use, both accuracy and interpretability are important. Survival trees are easy-to-follow survival models that split the patient cohort recursively into discrete patient groups. Whilst survival trees can capture complex relationships, they typically need to grow large, threatening interpretability. Moreover, survival trees are often built using greedy approaches that may overlook globally optimal split combinations, limiting predictive performance. Shallow survival trees require expressive, higher-order feature combinations to achieve competitive accuracy. We therefore use genetic programming to multi-objectively evolve inherently inspectable feature sets and study how they interact with different tree induction strategies. We further introduce an evolutionary approach that jointly optimises the survival tree structure and the non-linear split logic. Our findings demonstrate that evolutionary feature construction improves predictive performance across different tree induction strategies on two real-world datasets and two different survival tree depths. Full joint evolution has the overall highest potential to propose multiple inherently inspectable shallow survival trees of good performance.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文主题基于遗传编程（GP）和生存树进行生存分析，强调可解释性。提供的关键词集主要涉及多模态大模型、世界模型和强化学习领域。论文内容与方法（决策树、特征演化）与关键词（Tokenizer、视觉编码器、MLLM、RL 等）无直接技术重叠，仅在“统一”概念上有微弱关联（特征与树结构的联合演化），因此相关性评分极低。作者列表中不包含指定的专家（Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang），故未触发专家加分。

关键词

Survival Analysis, Genetic Programming, Survival Trees, Interpretable Machine Learning, Feature Evolution, Tree Induction, Censored Data

291. Honeyval: A Comprehensive Evaluation Framework for LLM-powered HTTP HoneypotsFAIL

Score: 1.5 / 27.8

Authors: Mark Vero, Fabian Kaczmarczyck, Ivan Petrov, Ilia Shumailov, Jamie Hayes, Niels Heinen, Tianqi Fan, Luca Invernizzi, Martin Vechev

Published: 2026-05-28

TL;DR: This paper proposes Honeyval, a comprehensive evaluation framework for LLM-powered HTTP honeypots, demonstrating that LLM-based honeypots achieve longer attacker interactions and lower detection rates compared to rule-based baselines.

摘要翻译

蜜罐是模仿真实系统组件的诱骗系统，旨在防御网络攻击。近年来，大语言模型（LLMs）越来越多地作为蜜罐的模拟骨干。它们使防御者能够构建高交互蜜罐，同时保持较低的系统安全风险。然而，基于大语言模型的蜜罐开发缺乏统一的评估框架。大多数评估仅限于在固定命令上测量响应相似度、手动测试或实际部署。这些方法往往在开发扩展性、评估间可复现性、对实际攻击的代表性，以及适应各种攻击者和蜜罐配置的能力方面存在不足。本文旨在填补这一空白，并提出 Honeyval，这是一个针对基于大语言模型的 HTTP 蜜罐的综合评估框架。我们通过依托于 16 个后端应用程序构建蜜罐，使用 AI 黑客代理（AI hacking agents）作为攻击者，采用两个控制任务来监控代理和蜜罐在不同定制方案下的能力，并为攻击者定义清晰且可验证的利用目标，从而解决了先前评估的局限性。利用 Honeyval，我们对近期成本效益较高的大语言模型作为 HTTP 蜜罐进行了广泛评估。我们的实验凸显了基于大语言模型的蜜罐的前景：它们与攻击者的交互时间显著长于基于规则的基线蜜罐，且即使被前沿模型检测到的频率也远低于后者，同时平均而言，仍保持了对基于代理的攻击者（agentic attackers）的运行成本优势。此外，我们还实验了不同反击型蜜罐的配置，观察到了独特的权衡，例如以增加检测率为代价换取更长的交互时间。

Abstract

Honeypots are decoy systems mimicking real system components designed to defend against cyber attacks. Recently, LLMs increasingly serve as simulation backbones for honeypots. They enable defenders to construct high-interaction honeypots with low system security risks. However, LLM-powered honeypot development lacks a unified evaluation framework. Most evaluations consist of measuring response similarity on fixed commands, manual testing, or real-world deployment. These methods are often not scalable for development, reproducible across evaluations, representative of practical attacks, or adaptable to various attacker and honeypot configurations. In this work, we bridge this gap and propose Honeyval, a comprehensive evaluation framework for LLM-powered HTTP honeypots. We address the limitations of prior evaluations by grounding the honeypots in 16 backend applications, using AI hacking agents as attackers, employing two control tasks to monitor agent and honeypot capabilities across customizations, and defining clear and verifiable exploit goals for the attacker. Using Honeyval, we conduct an extensive evaluation of recent cost-efficient LLMs as HTTP honeypots. Our experiments highlight the promise of LLM-powered honeypots; they lead to substantially longer interactions with the attacker than rule-based baseline honeypots and are far less frequently detected even by frontier models, all while, on average, preserving a running cost advantage against agentic attackers. Further, we experiment with different counter-offensive honeypots configurations, and observe unique trade-offs, such as longer interactions at the cost of increased detection.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper addresses cybersecurity evaluation for LLM-powered honeypots, which diverges significantly from the multimodal and reinforcement learning focus of the provided keywords. 'Unify Models' receives a low score (1.0) due to the lexical similarity with 'unified evaluation framework' but lacks architectural model unification. Keywords like Tokenizer, Visual Encoder, World Models, MLLM, MultiModal, and model-based RL are completely unrelated to the security/honeypot domain. No expert authors from the specified list (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are present in the author list.

关键词

LLM-powered Honeypots, Evaluation Framework, AI Hacking Agents, HTTP Honeypots, Cyber Security, Exploit Goals, Cost-efficient LLMs

292. Formalizing Mathematics at ScaleFAIL

Score: 1.5 / 27.8

Authors: Ahmad Rammal, Niket Patel, Fabian Gloeckle, Amaury Hayat, Julia Kempe, Remi Munos, Charles Arnal, Vivien Cabannes

Published: 2026-05-28

TL;DR: This paper introduces AutoformBot, a multi-agent LLM system that scales the formalization of mathematics textbooks into verified Lean 4 code, producing a large verified library without utilizing multimodal or reinforcement learning techniques.

摘要翻译

我们提出 AutoformBot，一个用于在 Lean 4 中构建大规模自动形式化教科书库（Atlas）的多智能体系统。AutoformBot 协调数千个配备形式化验证工具、依赖感知任务调度以及协作版本控制的大语言模型（LLM）代理，将非形式化的教科书文本转化为机器验证的定义和证明。我们将该方法应用于涵盖分析学、代数、拓扑学、组合学和概率论的 26 本开放获取教科书语料库，生成了 Atlas：一个包含超过 45,000 个 Lean 4 声明和 50 万行代码的验证库。我们发布两项成果：(i) AutoformBot，开源多智能体框架；(ii) Atlas，所得的形式化库。我们的结果表明，大规模自动形式化研究生水平数学的核心内容在经济和技术上现已可行。这为研究级别的人类和机器生成数学的自动验证打开了大门。

Abstract

We present AutoformBot, a multi-agent system for building an Autoformalized Textbook Library At Scale (Atlas) in Lean 4. AutoformBot orchestrates thousands of LLM agents, equipped with formal verification tools, dependency-aware task scheduling, and collaborative version control, to translate informal textbook prose into machine-checked definitions and proofs. We apply our methods to a corpus of 26 open-access textbooks spanning analysis, algebra, topology, combinatorics, and probability, producing Atlas: a verified library of over 45,000 Lean 4 declarations and 500 thousand lines of code. We release two artifacts: (i) AutoformBot, the open-source multi-agent framework; and (ii) Atlas, the resulting formal library. Our results suggest that autoformalizing the core content of graduate-level mathematics at scale is now economically and technically feasible. This opens the door to the automated verification of both human- and machine-generated mathematics at a research level.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on automating mathematics formalization using multi-agent LLM systems in Lean 4. It does not involve visual encoders, world models, model-based reinforcement learning, or unified multimodal architectures. While it utilizes Large Language Models (MLLM), it lacks multimodal components (e.g., vision) or RL techniques implied by the other keywords.

关键词

Formalizing Mathematics, Multi-agent System, Lean 4, Automated Verification, Textbook Library, LLM Agents, Formal Verification

293. Gated Graph Attention Networks with Learnable TemperatureFAIL

Score: 1.5 / 27.8

Authors: Zhongtian Ma, Hao Wu, Yexin Zhang, Qiaosheng Zhang, Zhen Wang

Published: 2026-05-28

TL;DR: This paper proposes gated graph attention and learnable temperature mechanisms to enhance the robustness and performance of graph attention networks against unreliable features and noise.

摘要翻译

图注意力网络（Graph Attention Networks）通过数据依赖系数学习邻居重要性，但标准层缺乏对不可靠特征维度的显式控制，且使用注意力系数分布 (Attention Coefficient Distributions) 的固定锐度。本文针对常见的图注意力机制提出了门控图注意力 (Gated Graph Attention) 和可学习温度 (Learnable Temperature)。门控图注意力通过过滤特征或消息响应来减少不可靠维度的影响，而可学习温度则动态调整注意力系数分布的锐度。在同构图及异亲异质基准上的实验表明，所提出的变体一致改进了相应的图注意力骨干网络，而受控噪声研究进一步验证了它们在特征扰动下的表现。理论分析通过展示门控仅在部分特征坐标可靠时能提高鲁棒性，而温度在全局噪声削弱节点特征判别性时有益，从而解释了这些结果。

Abstract

Graph attention networks learn neighbor importance through data-dependent coefficients, but standard layers lack explicit control over unreliable feature dimensions and use fixed sharpness of attention coefficient distributions. This paper proposes gated graph attention and learnable temperature for common graph attention mechanisms. Gated graph attention filters feature or message responses to reduce the influence of unreliable dimensions, while learnable temperature dynamically adjusts the sharpness of the attention coefficient distribution. Experiments on homogeneous and heterophilic heterogeneous benchmarks show that the proposed variants consistently improve the corresponding graph attention backbones, and controlled noise studies further verify their behavior under feature perturbations. Theoretical analysis explains these results by showing that gating improves robustness when only part of the feature coordinates are reliable, while temperature is beneficial when global noise weakens the discriminability of node features.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper addresses robustness in Graph Attention Networks via gating and temperature mechanisms. It does not involve multimodal learning, tokenization, visual encoders, world models, MLLMs, or reinforcement learning, making it largely irrelevant to the provided keyword set which targets multimodal/RL domains. 'Unify Models' receives a minimal score as it involves model architecture modification, but not unification across modalities/tasks.

关键词

Gated Graph Attention, Learnable Temperature, Graph Attention Networks, Feature Robustness, Heterophilic Graphs, Attention Mechanism, Message Filtering

294. Instance-dependent Stochastic Lipschitz banditFAIL

Score: 1.5 / 27.8

Authors: Marius Potfer, Vianney Perchet

Published: 2026-05-28

TL;DR: This paper proposes an instance-dependent regret analysis for Lipschitz bandits based on integrals of suboptimality gaps over level sets, achieving improved adaptive rates compared to classical zooming dimension bounds.

摘要翻译

我们研究 Lipschitz 带问题（Lipschitz bandit problem），其中学习者在定义域 $\mathcal{X} \subset [0,1]^d$ 上通过带噪声的点态评估顺序最大化未知的 Lipschitz 函数 $f$。现有的遗憾界要么是最坏情况下的，量级为 $\tildeΘ \left ( T^{d+1/d+2}\right )$，要么是通过缩放维度（zooming dimension）$d_z$ 自适应的，得到 $\tildeΘ \left ( T^{d_z+1/d_z+2}\right )$。然而，此类基于缩放维度的保证仅在部分实例依赖，因为它们仅依赖于近最优水平集的渐近增长，而无法捕捉 $f$ 更精细的结构性质。我们提供了一种分析及一种算法，通过 $f$ 在其水平集上的次优差距的积分来刻画遗憾。这得到了适应水平集局部增长的遗憾界，而不仅仅是其渐近行为。作为推论，当最大化器集合的维度 $d^\star>0$ 时，我们获得改进的自适应率，量级为 $\tilde{\mathcal{O}} \left ( T^{d_z+1 / \max(d_z,d^\star)+2}\right )$，在此情形下严格优于经典缩放界。最后，我们将分析扩展到全信息设置（Lipschitz experts），并展示如何放松部分正则性假设。

Abstract

We study the Lipschitz bandit problem, where a learner sequentially maximizes an unknown Lipschitz function $f$ over a domain $\mathcal{X} \subset [0,1]^d$ using noisy pointwise evaluations. Existing regret bounds are either worst-case, scaling as $\tildeΘ \left ( T^{d+1/d+2}\right )$, or adaptive via the zooming dimension $d_z$, yielding $\tildeΘ \left ( T^{d_z+1/d_z+2}\right )$. However, such zooming-based guarantees are only partially instance-dependent, as they depend solely on the asymptotic growth of near-optimal level sets and fail to capture finer structural properties of $f$. We provide an analysis and an algorithm that characterizes the regret through integrals of the suboptimality gap of $f$ over its level sets. This yields regret bounds that adapt to the local growth of level sets, rather than only their asymptotic behavior. As a corollary, when the set of maximizers has dimension $d^\star>0$, we obtain improved adaptive rates of order $\tilde{\mathcal{O}} \left ( T^{d_z+1 / \max(d_z,d^\star)+2}\right )$ strictly improving over classical zooming bounds in this regime. Finally, we extend our analysis to the full-information setting (Lipschitz experts) and show how some of the regularity assumptions can be relaxed.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	1.0/10	1.5

评分理由: 论文研究 Lipschitz 带问题及 regret 界限，属于在线学习/强化学习理论，与多模态大模型、分词器、视觉编码器、世界模型等关键词完全无关。虽然带问题属于强化学习范畴，但 model-based RL 通常指学习动力学进行控制，相关性极低（1.0）。未找到指定专家作者。加权总分 1.5，远低于动态及格分 27.8。

关键词

Lipschitz bandit, Regret bounds, Level sets, Zooming dimension, Instance-dependent, Stochastic optimization, Lipschitz function

295. FHRFormer: A Self-Supervised Masked Transformer Framework for Fetal Heart Rate Time-Series Inpainting and ForecastingFAIL

Score: 1.5 / 27.8

Authors: Kjersti Engan, Neel Kanwal, Anita Yeconia, Ladislaus Blacy, Yuda Munyaw, Estomih Mduma, Hege Ersdal

Published: 2026-05-28

TL;DR: FHRFormer proposes a self-supervised masked transformer framework to reconstruct missing fetal heart rate signals and forecast future values, addressing data dropout in prenatal monitoring.

摘要翻译

约 10% 的新生儿在出生时需要协助启动呼吸，约 5% 需要通气支持。胎儿心率（FHR）监测在产前护理中对于评估胎儿健康状况发挥着至关重要的作用，能够检测异常模式，并支持及时的产科干预，从而减轻分娩过程中的胎儿风险。将人工智能（AI）方法应用于分析具有多样化结果的连续胎儿心率监测片段的大数据集，可能为预测需要呼吸辅助或干预的风险提供新的见解。近年来，可穿戴胎儿心率监测仪的进步使得连续胎儿监测成为可能，且不会损害母体的活动能力。然而，母体运动过程中的传感器位移，以及胎儿或母体位置的变化，常导致信号丢失，从而造成记录的胎儿心率数据出现缺失。此类缺失数据限制了有意义信息的提取，并使得自动化（基于人工智能）分析变得复杂。处理缺失数据的传统方法（如简单的插值技术）往往无法保留信号的频谱特性。本文提出了一种基于掩码变换器（masked transformer）的自编码器方法，通过捕获数据的局部时频成分来重建缺失的胎儿心率信号。所提出的方法在缺失数据的不同持续时间上表现出鲁棒性，可用于信号修复（inpainting）和预测。所提出的方法可回顾性地应用于研究数据集，以支持基于人工智能的风险算法的开发。未来，该方法可被整合到可穿戴胎儿心率监测设备中，以实现更早且更稳健的风险检测。

Abstract

Approximately 10% of newborns require assistance to initiate breathing at birth, and around 5% need ventilation support. Fetal heart rate (FHR) monitoring plays a crucial role in assessing fetal well-being during prenatal care, enabling the detection of abnormal patterns and supporting timely obstetric interventions to mitigate fetal risks during labor. Applying artificial intelligence (AI) methods to analyze large datasets of continuous FHR monitoring episodes with diverse outcomes may offer novel insights into predicting the risk of needing breathing assistance or interventions. Recent advances in wearable FHR monitors have enabled continuous fetal monitoring without compromising maternal mobility. However, sensor displacement during maternal movement, as well as changes in fetal or maternal position, often lead to signal dropout, resulting in gaps in recorded FHR data. Such missing data limits the extraction of meaningful insights and complicates automated (AI-based) analysis. Traditional approaches to handling missing data, such as simple interpolation techniques, often fail to preserve the spectral characteristics of the signals. In this paper, we propose a masked transformer-based autoencoder approach to reconstruct missing FHR signals by capturing both local temporal and frequency components of the data. The proposed method demonstrates robustness across varying durations of missing data and can be used for signal inpainting and forecasting. The proposed approach can be applied retrospectively to research datasets to support the development of AI-based risk algorithms. In the future, the proposed method could be integrated into wearable FHR monitoring devices to achieve earlier and more robust risk detection.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on medical time-series analysis using masked transformers for FHR signal reconstruction. It lacks multimodal components, visual encoders, language models, reinforcement learning, or world modeling architectures. Only 'Unify Models' has minor relevance (1.0) as it unifies inpainting and forecasting tasks within a single framework; all other keywords are irrelevant (0.0). The total weighted score (1.5) is significantly below the dynamic pass threshold (27.8), indicating a strong mismatch between the paper content and the provided keyword profile.

关键词

Fetal Heart Rate, Time-Series, Masked Transformer, Self-Supervised, Signal Inpainting, Forecasting, Autoencoder, Medical AI

296. Eigen-Spike Emergence and Quadratic Equivalents for Conjugate Kernels on Nonlinearly Separable DataFAIL

Score: 1.5 / 27.8

Authors: Collin Cranston, Zhichao Wang, Todd Kemp, Michael W. Mahoney

Published: 2026-05-28

TL;DR: This paper investigates the spectral behavior of conjugate kernels on nonlinearly separable data using random matrix theory, deriving quadratic equivalents and phase transitions to analyze linear classification capabilities via eigenvectors.

摘要翻译

随机矩阵理论（RMT）的近期研究发展了确定性等价物（deterministic equivalents）的概念：通常是线性代理模型，用于近似大型非线性随机矩阵的谱行为，例如神经网络（NNs）中的非线性特征映射。一方面，这些确定性等价物通过将复杂模型简化为具有经典 RMT 工具适用性质的简单模型，使得理论预测变得可行。然而，这留下了一个问题：当处理高维非线性可分数据时（例如在对非线性可分数据进行分类时），这种理想化的线性等价性是否仍然有意义。受此启发，我们考虑共轭核（CK），它是前馈神经网络（NN）的非线性特征映射，在一个典型的非线性可分数据集（即异或问题，XOR problem）下；我们研究 CK 中的信息性离群特征值及其对应的特征向量是否渐近对齐于 XOR 标签，以此作为非线性可学习性的代理。我们开发了一个针对尖峰 CK 矩阵的稳健二次等价物，这使得能够对涌现的信息性尖峰进行精确分析，当修改 ML 实践中常见的各种参数时：样本复杂度、信噪比（SNR）、非线性激活函数选择以及预训练特征。在每种场景中，我们都推导出了一种精确的 BBP 型相变，在此相变中，通过 CK 特征向量进行线性分类成为可能。我们的分析有助于将 RMT 中确定性等价工具的力量应用于研究具有实际意义的 ML 问题。

Abstract

Recent work in random matrix theory (RMT) has developed the notion of deterministic equivalents: typically linear surrogate models that approximate the spectral behavior of large nonlinear random matrices, such as nonlinear feature maps in neural networks (NNs). On the one hand, these deterministic equivalents make theoretical predictions tractable by reducing a complex model to a simpler model with properties that fall under the umbrella of classical RMT tools. However, this leaves open the question of whether this idealized linear equivalence remains meaningful when dealing with high-dimensional nonlinearly separable data, such as performing clssification on nonlinearly separable data. Motivated by this, we consider the conjugate kernel (CK), which is the nonlinear feature map of a feedforward NN, under a canonical nonlinearly separable dataset, the XOR problem; and we use the study of informative outlier eigenvalues in the CK and whether their corresponding eigenvectors asymptotically align with XOR labels as a proxy for nonlinear learnability. We develop a robust quadratic equivalent to the spiked CK matrix that enables a precise analysis of emergent informative spikes, as one modifies various knobs common in ML practice: sample complexity, signal-to-noise ratio (SNR), nonlinear activation choice, and pretrained features. In each of these scenarios, we derive a precise BBP-type phase transition in which linear classification via the CK eigenvectors becomes possible. Our analysis helps translate the power of deterministic equivalence tools in RMT to study problems of practical relevance in ML.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on theoretical machine learning, specifically Random Matrix Theory and Conjugate Kernels applied to nonlinearly separable data (XOR problem). It does not address multimodal learning, world models, reinforcement learning, tokenization, or visual encoders. The only marginal connection is the analysis of neural network features (Conjugate Kernel), but it is purely theoretical and lacks the architectural or application context implied by the keywords (e.g., MLLM, MultiModal, RL). Therefore, relevance to the specified keywords is extremely low.

关键词

Random Matrix Theory, Conjugate Kernels, Deterministic Equivalents, Nonlinearly Separable Data, Eigen-Spike Emergence, Phase Transition, Feedforward Neural Networks

297. Gradient Perturbation: Learning to Perturb Gradients for Adaptive TrainingFAIL

Score: 1.5 / 27.8

Authors: Hua Li

Published: 2026-05-28

TL;DR: The paper proposes a unified gradient perturbation framework (LPG) to improve generalization in classification tasks by adaptively manipulating gradients, outperforming existing optimization methods like SAM.

摘要翻译

深度神经网络训练既包括前向传播（从特征经 logit 至损失），也包括反向传播（从损失经梯度至参数更新）。尽管前向传播链路上的扰动（包括特征扰动、logit 扰动和标签扰动）已被广泛研究，但反向传播链路上的梯度扰动却鲜有系统性研究。本文建立了梯度扰动的统一框架，揭示了诸如锐度感知最小化（Sharpness-Aware Minimization, SAM）、梯度裁剪和梯度噪声注入等现有方法均可被解释为施加了特定形式的梯度扰动。类似于近期提出的 logit 扰动学习（Logit Perturbation Learning, LPL），我们猜想放大某一类别的梯度范数相当于正增强（增强学习），而减弱该范数则相当于负增强（抑制过拟合）。基于上述观察，我们提出了学习扰动梯度（Learning to Perturb Gradients, LPG），该方法在类别级别上自适应地扰动 logit 层面的梯度，以实现类别感知训练。此外，我们通过 PAC-Bayesian 分析建立了梯度扰动界与泛化保证之间的理论联系。在平衡分类、长尾分类及噪声标签学习上的实验表明，LPG 一贯优于现有方法，且可作为插件模块与之结合使用。

Abstract

Deep neural network training involves both forward propagation (from features through logits to loss) and backward propagation (from loss through gradients to parameter updates). While perturbations along the forward chain, including feature perturbation, logit perturbation, and label perturbation, have been extensively studied, the backward chain's gradient perturbation has received little systematic investigation. In this paper, we establish a unified framework for gradient perturbation, revealing that existing methods such as Sharpness-Aware Minimization (SAM), gradient clipping, and gradient noise injection can all be interpreted as imposing specific forms of gradient perturbation. Analogous to the recently proposed Logit Perturbation Learning (LPL), we conjecture that amplifying the gradient norm for a class acts as positive augmentation (enhancing learning), while dampening it acts as negative augmentation (suppressing overfitting). Based on these observations, we propose Learning to Perturb Gradients (LPG), which adaptively perturbs logit-level gradients at the class level to achieve category-aware training. We also establish theoretical connections between gradient perturbation bounds and generalization guarantees via PAC-Bayesian analysis. Experiments on balanced classification, long-tail classification, and noisy label learning demonstrate that LPG consistently outperforms existing methods and can be combined with them as a plug-in module.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on gradient perturbation for supervised classification tasks (balanced, long-tail, noisy labels) and proposes a unified optimization framework (LPG). It does not address multimodal learning, world models, reinforcement learning, tokenizers, or visual encoders; the 'unified' aspect refers to optimization methods rather than the model architecture unification implied by the research background. The calculated weighted score (1.5) is well below the dynamic passing score (27.8), indicating low relevance to the specified research track. No expert authors from the specified list are present.

关键词

Gradient Perturbation, Adaptive Training, Classification, Generalization Guarantees, PAC-Bayesian Analysis, Sharpness-Aware Minimization, Logit Perturbation Learning

298. Honest Lying: Understanding Memory Confabulation in Reflexive AgentsFAIL

Score: 1.5 / 27.8

Authors: Prakhar Dixit, Sadia Kamal, Tim Oates

Published: 2026-05-28

TL;DR: This paper identifies memory confabulation in reflexive agents where they store incorrect task interpretations and proposes a programmatic extraction method to reduce reliance on false reflections, thereby improving task solving rates.

摘要翻译

反思式智能体依赖自生成的反思作为记忆，隐含地假设智能体能够准确诊断自身的失败。我们表明这一假设会系统性地失效：在 ALFWorld 和 HumanEval 上，智能体存储了关于任务的自信但错误的解释，并在多次试验中继续基于它们行动，尽管环境每次都重置为正确任务。我们将这种失效模式称为记忆虚构，并引入反思重复率（RRR），这是一种基于日志的指标，用于检测对错误反思内容的反复依赖。使用 RRR，我们在 ALFWorld 中识别出 16 个冻结环境，其中 121 次反思中提及正确目标对象的次数为 0，在 HumanEval 中有 4 个类似案例。我们的缓解方案将开放式自我诊断替换为轨迹级失败信号的程序化提取，将正确目标提及率从 0% 提高到 86%，将 RRR 从 0.64 降低到 0.10，并解决了 16 个冻结的 ALFWorld 环境中的 3 个，这表明反思记忆可能强化错误信念而非纠正它们。

Abstract

Reflexion-style agents rely on self-generated reflections as memory, implicitly assuming that agents can accurately diagnose their own failures.We show that this assumption can fail systematically: across ALFWorld and HumanEval, agents store confident but incorrect interpretations of the task and continue acting on them across trials,even though the environment resets to the correct task each time. We call this failure mode memory confabulation and introduce the Reflection Repetition Rate (RRR), a log-based metric that detects repeated reliance on incorrect reflective content.Using RRR, we identify 16 frozen environments in ALFWorld, where 0 of 121 reflections mention the correct target object, and 4 analogous cases in HumanEval. Our mitigation replaces open-ended self-diagnosis with programmatic extraction of trajectory-level failure signals, increasing correct object mention from 0% to 86%, reducing RRR from 0.64 to 0.10, and solving 3 of 16 frozen ALFWorld environments, suggesting that reflective memory can reinforce false beliefs rather than correct them.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	1.0/10	1.5

评分理由: The paper focuses on memory confabulation in reflexive agents and proposes programmatic mitigation. It lacks content on model unification, tokenization, visual encoders, world models, multimodal learning, or model-based RL architectures. Although it uses RL environments, the focus is on reflection safety rather than the specified keywords. No expert authors from the specified list are found.

关键词

Reflexive Agents, Memory Confabulation, Reflection Repetition Rate, Self-generated Reflections, Programmatic Extraction, Failure Diagnosis, ALFWorld, HumanEval

299. A Full-Pipeline Framework for Evaluating Membership Inference Attacks in Machine LearningFAIL

Score: 1.5 / 27.8

Authors: Ding Chen, Xinwen Cheng, Xuyang Zhong, Xinping Chen, Xiaolin Huang, Chen Liu

Published: 2026-05-28

TL;DR: This paper proposes a comprehensive framework to systematically evaluate Membership Inference Attacks across diverse machine learning contexts, providing standardized threat models and metrics for privacy auditing.

摘要翻译

虽然成员推断攻击（MIAs）是识别训练数据的主流方法，但其应用已扩展至隐私审计和机器遗忘领域。然而，该领域缺乏一个系统性的框架，用于评估不同上下文如何影响 MIAs 的有效性。缺乏此类刻画，从业者可能会部署在基准测试上表现良好，但在面对特定真实数据集的细微差别时却失去统计意义的算法。为弥合这一差距并提供可操作的见解，我们提出一个全面的评估框架，系统性地刻画整个机器学习流程中的隐私风险，涵盖数据、架构、算法及训练后模块。该框架旨在捕捉多样化的操作上下文，并在广泛的训练配置下严格评估最先进的 MIAs。为了考虑现实部署中不同的误分类成本，我们采用三种互补的指标：平衡准确率（Balanced Accuracy）用于对称成本，以及在低假正例率（FPR）下的真正例率（TPR）（或低假负例率（FNR）下的真负例率（TNR）），适用于严格惩罚误报或漏检的非对称场景。此外，鉴于现有的 MIAs 假设了不同的对手能力，我们形式化定义了两个标准化的威胁模型，并将这些攻击调整为相应的变体，以确保公平的基准。广泛的实证评估表明，特定 MIA 方法的有效性高度依赖于所假设的威胁模型和所选的评估指标。最终，我们将这些发现提炼为可操作指南，并提供一个即用型审计工具包，使从业者能够进行更优的隐私评估。

Abstract

While Membership Inference Attacks (MIAs) are the prevailing method for identifying training data, their application has expanded into privacy auditing and machine unlearning. Nevertheless, the field lacks a systematic framework for evaluating how different contexts affect MIA efficacy. Without such a characterization, practitioners risk deploying algorithms that perform well on benchmarks but become statistically irrelevant when faced with the nuances of specific, real-world datasets. To bridge this gap and provide actionable insights, we introduce a comprehensive evaluation framework that systematically characterizes privacy risks across the entire machine learning pipeline, spanning data, architectures, algorithms, and post-training modules. Designed to inherently capture diverse operational contexts, our framework rigorously evaluates state-of-the-art MIAs across a broad spectrum of training configurations. To account for varying misclassification costs in real-world deployments, we employ three complementary metrics: Balanced Accuracy for symmetric costs, alongside TPR at low FPR (or TNR at low FNR) for asymmetric scenarios where false alarms or missed detections are strictly penalized. Furthermore, recognizing that existing MIAs assume divergent adversary capabilities, we formalize two standardized threat models and adapt these attacks into corresponding variants to ensure an equitable benchmark. Extensive empirical evaluations demonstrate that the efficacy of specific MIA methodologies is highly sensitive to the assumed threat models and chosen evaluation metrics. Ultimately, we distill these findings into actionable guidelines and provide a ready-to-use auditing toolkit, empowering practitioners to conduct better privacy assessments.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on Membership Inference Attacks and privacy auditing in machine learning, which is semantically distinct from the provided keywords concerning multimodal large language models, world models, and reinforcement learning. There is no technical overlap regarding tokenizers, visual encoders, or RL architectures. 'Unify Models' receives a minimal score due to the framework's unification of evaluation contexts, but it does not refer to model architecture unification in the MLLM/RL sense.

关键词

Membership Inference Attacks, Privacy Auditing, Evaluation Framework, Threat Models, Machine Unlearning, Post-training Modules, Privacy Risks

300. Visual Spatial Learning: Single-Field Spatial Interpolation Using Convolutional Neural NetworksFAIL

Score: 1.5 / 27.8

Authors: Daniel Tinoco, Raquel Menezes, Carlos Baquero, Alexandra Silva

Published: 2026-05-28

TL;DR: 该论文提出了一种利用卷积神经网络在单场稀疏观测下进行空间插值的数据驱动方法，作为经典地统计方法（如 Kriging）的实用替代方案。

摘要翻译

从稀疏观测预测一个完整的空间相关场是空间统计学和环境建模中的基本挑战。经典的插值方法（如克里金法 (Kriging)）依赖高斯过程 (Gaussian process) 假设和变异函数分析 (variography)，这可能会限制其在非平稳环境下的有效性，且需要大量的领域专业知识。本文利用基于卷积神经网络 (CNNs) 的架构进行空间插值，该方法在单个部分观测场上进行训练和应用，无需外部数据或先验场。模型直接在观测位置上进行监督，学习在用户定义的网格上预测未观测点的值。与克里金法不同，我们的方法不需要显式的协方差建模 (covariance modelling) 或变异函数估计 (variogram estimation)，能够以数据驱动的方式灵活捕捉局部空间模式。本文展示了 CNNs 在稀疏监督下单实例空间插值的潜力，为经典地统计方法提供了实用的替代方案，并将 CNNs 的应用扩展到了一个新的问题领域。

Abstract

Predicting a complete spatially correlated field from sparse observations is a fundamental challenge in spatial statistics and environmental modelling. Classical interpolation methods such as Kriging rely on Gaussian process assumptions and variography, which can limit their effectiveness in non-stationary settings and require substantial domain expertise. In this work, we leverage an architecture based on convolutional neural networks (CNNs) for spatial interpolation that is trained and applied on a single partially observed field, without access to external data or prior fields. The model is supervised directly on the observed locations and learns to predict values at unobserved points on the user defined grid. Unlike Kriging, our method does not require explicit covariance modelling or variogram estimation, and it can flexibly capture local spatial patterns in a data-driven manner. This work demonstrates the potential of CNNs for single-instance spatial interpolation under sparse supervision, offering a practical alternative to classical geostatistical methods, and extending the use of CNNs to a new problem domain.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文主题基于 CNN 的空间插值，属于地理统计和环境建模领域。提供的关键词聚焦于多模态大模型、世界模型及强化学习。除 CNN 在技术上可勉强关联到 'Visual Encoder' 外，论文未涉及 Tokenizer、Unify Models、World Models、MLLM、MultiModal 或 model-based RL 的核心概念，因此与给定关键词簇的整体相关性极低。

关键词

Spatial Interpolation, Convolutional Neural Networks, Sparse Observations, Single-Field Learning, Environmental Modelling, Data-driven, Kriging Alternative, Spatial Statistics

301. Improving Adversarial Robustness of Attribution via Implicit RegularizationFAIL

Score: 1.5 / 27.8

Authors: Amir Mehrpanah, Matteo Gamba, Hossein Azizpour

Published: 2026-05-28

TL;DR: This paper demonstrates that adversarial robustness of attributions emerges implicitly from SGD dynamics and can be enhanced in transformers by replacing softmax attention with kernel-based attention, without requiring explicit regularization.

摘要翻译

归因的对抗鲁棒性是深度学习可靠可解释性的基本要求，然而现有方法通常依赖于计算成本高昂的显式正则化。本文表明，归因鲁棒性可从标准随机梯度下降（SGD）的学习动力学中隐式产生。我们通过参数空间与输入空间曲率之间的联系从理论上解释了这一效应，并在不同架构、数据集及归因方法上验证了该效应，且计算开销可忽略不计。相比之下，我们证明由于固有的熵约束，这种鲁棒性增益往往不会转移到 Softmax 归一化下的基于注意力的归因中，并通过实验验证了这一局限性。最后，我们表明在 Transformer 模型中，用基于核的注意力替换 Softmax 注意力可恢复鲁棒性增益。我们的结果表明，学习动力学是实现鲁棒可解释性的合理且实用的机制，并揭示了归一化下基于注意力的归因的根本性局限性。

Abstract

The adversarial robustness of attributions is a fundamental requirement for reliable explainability in deep learning, yet existing approaches typically rely on computationally expensive explicit regularization. In this work, we show that attribution robustness can arise implicitly from the learning dynamics of standard stochastic gradient descent. We theoretically motivate this effect through connections between parameter-space and input-space curvature, and validate it across architectures, datasets, and attribution methods, with negligible computational overhead. In contrast, we prove that such robustness gains often does not transfer to attention-based attribution under softmax normalization, due to inherent entropy constraints, and we validate this limitation experimentally. Finally, we show that replacing softmax attention with kernel-based attention restores the robustness gains in transformer models. Our results highlight learning dynamics as a principled and practical mechanism for robust explainability, and reveal fundamental limitations of attention-based attribution under normalization.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 该论文专注于通过隐式正则化提高归因的对抗鲁棒性，分析 SGD 动力学和注意力机制。它未涉及多模态学习、世界模型、分词策略、视觉编码器或强化学习，这些是所提供关键词的核心主题，因此相关性极低。作者列表不包含指定的专家。

关键词

Adversarial Robustness, Attribution, Implicit Regularization, Stochastic Gradient Descent, Attention Mechanism, Transformer Models, Explainability, Kernel-based Attention

302. Fairness Beyond Demographics: Optimizing Performance Across Appearance-Based Hidden Cohorts in Medical ImagingFAIL

Score: 1.5 / 27.8

Authors: Milad Masroor, Cuong Nguyen, Kevin Wells, Gustavo Carneiro

Published: 2026-05-28

TL;DR: This paper proposes a label-free hidden-cohort fairness training paradigm to optimize performance across latent subpopulations in medical imaging without demographic labels, effectively reducing performance disparities.

摘要翻译

医学图像分析模型可能在不同的患者子群中表现出性能差异，从而威胁临床安全与公平性。现有方法通常通过针对被视为孤立变量的可见人口统计学属性（demographic attributes），例如性别或年龄，优化准确性和公平性指标来解决这一问题。这种策略不仅忽略了可能更具信息量的潜在分层（latent stratifications），这些分层可能揭示模型错误和不公平性的更深层次来源，而且在同时考虑多个人口统计学属性时也难以扩展，因为由此导致的每个子群内训练数据稀疏性。为了解决这些问题，我们引入了无标签隐藏队列公平性（Label-free Hidden-Cohort Fairness, LHCF）训练范式。该范式不再是在可见人口统计学属性上最大化公平性，而是优化从图像外观中挖掘出的潜在子群之间的公平性。通过将图像聚类为 K 个基于外观的队列（appearance-based cohorts）并在其上应用公平性优化，LHCF 揭示了模型错误的潜在来源，避免了多人口统计学属性的组合稀疏性，从而减少了单个或多个人口统计学属性之间的差异。我们在所提出的公平性基准（HIDFairBench）上证明，尽管从未使用人口统计学标签进行训练，LHCF 仍能在单个或多个人口统计学属性上提供最先进的公平性结果。我们的研究结果将隐藏队列公平性定位为一种实用、可扩展且稳健的替代方案，用于基于人口统计学的公平性优化，从而推动可信赖的医学图像分析。

Abstract

Medical image analysis models can exhibit performance disparities across patient subgroups, threatening clinical safety and fairness. Existing methods typically address this issue by optimizing accuracy and fairness metrics for visible demographic attributes (e.g., sex or age) considered in isolation. This strategy not only overlooks potentially more informative latent stratifications, which may reveal deeper sources of model error and inequity, but also fails to scale when multiple demographic attributes are considered simultaneously due to the resulting sparsity of training data within each subgroup. We deal with these issues by introducing the label-free hidden-cohort fairness (LHCF) training paradigm that instead of maximizing fairness over visible demographic attributes, it optimizes fairness across latent subpopulations discovered from image appearance. By clustering images into K appearance-based cohorts and applying fairness optimization over them, LHCF uncovers underlying sources of model error and avoids the combinatorial sparsity of multi-demographic attributes, reducing disparities across both single and multiple demographic attributes. We demonstrate on our proposed fairness benchmark, HIDFairBench, that LHCF provides state-of-the-art fairness results on single and multiple demographic attributes, despite never using demographic labels for training. Our results position hidden-cohort fairness as a practical, scalable, and robust alternative to demographic-based fairness optimization for trustworthy medical image analysis.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on fairness optimization in medical imaging using latent cohort clustering. It does not address Unify Models, Tokenizers, World Models, MLLM, MultiModal architectures, or Model-Based RL. Visual Encoder is tangentially related as the input is visual, but not the core research contribution.

关键词

Medical Imaging, Fairness Optimization, Hidden Cohorts, Clustering, Latent Subpopulations, Label-free Training, Performance Disparities

303. On Language Generation in the Limit with Bounded MemoryFAIL

Score: 0.0 / 27.8

Authors: Jon Kleinberg, Anay Mehrotra, Amin Saberi, Grigoris Velegkas

Published: 2026-05-28

TL;DR: This paper theoretically analyzes language generation under bounded memory constraints, demonstrating that memoryless generators can handle any countable language collection while density and identification tasks are restricted to finite collections.

摘要翻译

我们在有界记忆（bounded memory）条件下研究极限语言生成（language generation in the limit）。在此任务中，学习者一次观察一个来自未知目标语言（target language）的示例，并最终必须仅输出新的有效示例。先前工作假设可以访问整个历史（history），这是一个强假设，因为现实算法仅保留有限的过去信息。学习理论（learning theory）中的经典工作表明，记忆约束显著改变了可学习性（learnability）；我们将这一结论扩展到语言生成。首先，我们研究无记忆生成器（memoryless generators）。在温和的枚举限制下，每个可数无限语言集合（countable collection of infinite languages）在无记忆情况下仍可生成。若无此限制，我们精确刻画了无记忆生成（memoryless generation）何时可行。对于有限集合，我们刻画了无记忆生成器可实现的最优极小极大密度（minimax density）——即针对任何给定大小的集合所能保证的最佳密度。这一组合界限依赖于斯佩纳定理（Sperner's theorem）和对称链分解（symmetric chain decompositions）。我们进一步展示，仅使用最后 $W$ 个示例的滑动窗口（sliding window）不会改善最坏情况密度，而允许其存储 $b$ 个自适应选择的过去示例则能改善每个 $b \geq 1$ 的可实现密度。最后，我们重新审视极限识别（identification in the limit），其中学习者必须收敛到目标语言的单个正确假设（hypothesis）。我们重点关注其增量变体（incremental variant），其中学习者仅记住其之前的猜测。在此情形下，尽管精确识别（exact identification）在仅包含三个语言的集合上失败，但要求收敛至目标语言“近似”版本的轻微放宽对于每个有限集合均可实现。这些结果表明，有界记忆以不同方式影响这些任务：生成（generation）对每个可数集合仍可实现，而密度和识别（identification）局限于有限集合，且随着集合增大，保证程度逐渐减弱。

Abstract

We study language generation in the limit under bounded memory. In this task, a learner observes examples from an unknown target language one at a time and must eventually output only new valid examples. Prior work assumes access to the entire history, a strong assumption since realistic algorithms retain limited past information. Classical work in learning theory shows memory constraints dramatically alter learnability; we extend this to language generation. First, we study memoryless generators. Under a mild enumeration restriction, every countable collection of infinite languages remains generable without memory. Without this restriction, we exactly characterize when memoryless generation is possible. For finite collections, we characterize the optimal minimax density achievable by memoryless generators -- the best density guaranteed against any collection of a given size. This combinatorial bound relies on Sperner's theorem and symmetric chain decompositions. We further show that a sliding window of the last $W$ examples does not improve this worst-case density, whereas allowing it to store $b$ adaptively chosen past examples improves the achievable density for every $b \geq 1$. Finally, we revisit identification in the limit, where the learner must converge to a single correct hypothesis for the target language. We focus on its incremental variant, where the learner remembers only its previous guess. Here, although exact identification fails on a collection of just three languages, a mild relaxation requiring convergence to an ``approximate'' version of the target is achievable for every finite collection. These results show bounded memory affects these tasks differently: generation remains achievable for every countable collection, while density and identification are confined to finite collections, with guarantees weakening as the collection grows.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper is a theoretical computer science study on language generation under bounded memory constraints within learning theory. It does not address multimodal architectures, tokenizer design, visual encoders, world models, MLLM training, or reinforcement learning. Consequently, it has no relevance to the provided deep learning and RL-specific keywords.

关键词

Language Generation, Bounded Memory, Learning Theory, Memoryless Generators, Identification in the Limit, Combinatorial Bounds, Sliding Window

304. Improved Guarantees for Heterogeneous Treatment-Effect Estimation via Matrix CompletionFAIL

Score: 0.0 / 27.8

Authors: Anay Mehrotra, Phuc Tran, Van H. Vu, Manolis Zampetakis

Published: 2026-05-28

TL;DR: This paper proposes a computationally efficient estimator for heterogeneous treatment effect estimation in panel data by formulating it as a matrix completion problem and establishing sharp row-wise ℓ2 error bounds.

摘要翻译

现代因果推断（causal inference）的一个核心目标是估计异质性处理效应（heterogeneous treatment effects），以回答诸如“干预如何影响每个单位”这样的问题，而不仅仅是平均效应。我们在面板数据（panel-data）中研究这一问题，其中在未知的、非均匀的处理分配（treatment assignments）下观察到 $n$ 个单位在 $m$ 个时间点上的数据。在此设置下，数据自然地表示为一个包含所有单位 - 时间处理效应的矩阵。估计异质性处理效应可以表述为获取该矩阵中每一行平均值的优良估计。这使得我们可以将问题建模为矩阵补全（matrix completion），该问题在自然的低秩性（low-rankness）假设下可解。然而，现有的矩阵补全保证（matrix-completion guarantees）不足以获得估计异质性处理效应所需的逐行保证（per-row guarantee）的有意义界限；粗略地说，它们仅适用于估计平均处理效应（average treatment effect）的界限，正如最近的一些工作所表明的那样。我们提出了一种简单且计算高效的估计量（estimator），该估计量无需知道倾向得分（propensities），并在标准的低秩性和正则性假设（regularity assumptions）下，实现了逐行 $\ell_2$ 误差（row-wise $\ell_2$ error）为 $\tilde{O}(\sqrt{\frac{1}{n} + \frac{n}{m^2}})$。技术上，我们的分析建立了低秩近似（low-rank approximation）的第一个紧的逐行 $\ell_2$ 扰动界（row-wise $\ell_2$-perturbation bound），补充了现有的谱（spectral）、弗罗贝尼乌斯（Frobenius）和逐元素（entrywise）扰动理论。

Abstract

A central goal of modern causal inference is estimating heterogeneous treatment effects to answer questions like "how does an intervention affect each unit," rather than only on average. We study this problem with panel-data where we observe $n$ units across $m$ times under unknown, non-uniform treatment assignments. The data in this setting is naturally represented as a matrix of all unit--time treatment effects. Estimating heterogeneous treatment effects can then be expressed as obtaining a good estimation of each row's average in this matrix. This allows us to formulate the problem as matrix completion, which can be solved under natural low-rankness assumptions. However, existing matrix-completion guarantees are not powerful enough to get meaningful bounds for the per-row guarantee required for estimating the heterogeneous treatment effect; roughly speaking, they are only useful for estimating average treatment effect bounds, as also illustrated in a recent line of work. We give a simple, computationally efficient estimator that, without knowledge of the propensities and under standard low-rankness and regularity assumptions, achieves a row-wise $\ell_2$ error of $\tilde{O}(\sqrt{\frac{1}{n} + \frac{n}{m^2}})$. Technically, our analysis establishes the first sharp row-wise $\ell_2$-perturbation bound for low-rank approximation, complementing existing spectral-, Frobenius-, and entrywise perturbation theory.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on causal inference and matrix completion for heterogeneous treatment effect estimation using panel data. The provided keywords (Unify Models, Tokenizer, Visual Encoder, World Models, MLLM, MultiModal, model-based RL) relate to multimodal large language models and reinforcement learning. There is no overlap between the paper's statistical/causal methodology and the multimodal/RL concepts specified in the keywords, resulting in zero relevance for all.

关键词

Heterogeneous Treatment-Effect Estimation, Matrix Completion, Causal Inference, Panel Data, Low-Rankness Assumptions, Row-wise Error Bounds, Treatment Assignments

305. Neural Network Verification using Partial Multi-Neuron RelaxationFAIL

Score: 0.0 / 27.8

Authors: Ido Shmuel, Guy Katz

Published: 2026-05-28

TL;DR: This paper proposes a partial multi-neuron relaxation method for neural network verification to balance bound tightness and computational scalability, integrated into the Marabou verifier.

摘要翻译

深度神经网络在关键系统中的日益集成引发了对其行为安全性属性进行形式化保证的理论界与实践界的关注。为此，现有的验证算法依赖于计算网络非线性激活函数的线性松弛。现有的线性松弛方法通常分为两类：单神经元松弛，即根据每个激活神经元的输入源对其进行界定；以及多神经元松弛，即计算涉及多个激活神经元及其输入源的线性界限。然而，现有方法可能难以平衡紧致性与可扩展性，因为单神经元界限可能无法推导出验证完成所需的足够紧致界限，而为所有激活神经元生成多神经元松弛在计算上代价高昂。本文提出了一种折中方案，其特征为部分多神经元松弛，即仅对启发式选择的小部分神经元生成多神经元界限。为此，我们基于现有的分支启发式策略来选择神经元，并为多神经元界限优化边界超平面。我们将所提出的方法集成到 Marabou 验证器中，并在与现有界收紧方法的比较中获得了优于现有方法的结果。我们的实验展示了该技术用于神经网络验证的潜力。

Abstract

The increasing integration of deep neural networks in critical systems has spawned a theoretical and practical interest in formally guaranteeing safety properties about their behavior. To achieve this, contemporary verification algorithms rely on computing linear relaxations for a network's non-linear activation functions. Existing approaches for linear relaxations typically fall into one of two categories: single-neuron relaxation, in which each activation neuron is bounded in terms of its sources; and multi-neuron relaxation, in which linear bounds involving multiple activation neurons and their sources are calculated. However, existing methods might fail to balance tightness and scalability, as single-neuron bounds might not derive sufficiently tight bounds necessary for verification to complete, whereas generating multi-neuron relaxation for all activation neurons is computationally expensive. In this paper, we present a middle-ground approach featuring partial multi-neuron relaxation, in which we generate multi-neuron bounds for only a small, heuristically selected subset of neurons. To achieve this, we build upon existing branching heuristics for selecting neurons and for optimizing bounding hyper-planes for multi-neuron bounds. We integrated our proposed method within the Marabou verifier, and obtained favorable results in comparison to existing bound tightening methods. Our experiments showcase the potential of our technique for neural network verification.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on neural network verification using partial multi-neuron relaxation techniques, which is fundamentally unrelated to the provided keywords concerning Multimodal Large Language Models (MLLM), World Models, and Reinforcement Learning. There is no overlap in methodology or domain (e.g., no tokenizers, visual encoders, or RL components). The term 'Unify Models' in this specific keyword set refers to multimodal architectural unification, not the algorithmic unification of relaxation bounds presented here. None of the specified expert authors are listed in the paper.

关键词

Neural Network Verification, Partial Multi-Neuron Relaxation, Linear Relaxations, Marabou Verifier, Safety Properties, Activation Functions, Bound Tightening

306. Temporal Stability and Few-Shot Prompting in Math Task AssessmentFAIL

Score: 0.0 / 27.8

Authors: Danielle S. Fox, Brenda L. Robles, Elizabeth DiPietro Brovey, Christian D. Schunn

Published: 2026-05-28

TL;DR: This study investigates whether model version updates or few-shot prompting affect the stability of AI tools in classifying math task cognitive demands, finding prompting improves performance while updates have mixed effects.

摘要翻译

随着人工智能工具日益融入教育环境，人们对其随时间的稳定性以及对提示工程技术的响应能力产生了疑问。这项纵向研究重点关注了不同人工智能工具利用任务分析指南（TAG; Stein & Smith, 1998）对数学任务的认知需求进行分类的能力。特别是，该研究考察了这种分类能力是否会因（1）随时间的模型版本更新以及（2）使用示例任务的少样本提示（few-shot prompting）而发生变化。本研究测试了一款通用人工智能工具（Gemini）和一款教育专用人工智能工具（Coteach）。选择这些特定工具是因为它们在相关已发布的基准测试及先前的特定任务测试中表现相对优异。模型首先在基线状态下进行测试，随后在模型版本更新后重新测试，最后再次使用少样本提示进行测试（每个认知需求类别包含两个示例任务）。结果显示，仅靠更新到较新的模型版本产生了混合效应：Gemini 的准确率保持在 58% 不变，而 Coteach 的准确率则从 75% 下降至 50%。然而，少样本提示技术提升了两个模型的性能：Gemini 的准确率提升至 67%，Coteach 的准确率则恢复至 75%。这些发现表明，提示工程技术可能比被动的模型改进产生更大且更可靠的影响，且版本更新并不总能提升在专门教育任务上的表现。本研究对教育工作者和研究者在教育情境中如何选择、评估及实施人工智能工具具有重要的启示意义。

Abstract

As AI tools become increasingly integrated into educational contexts, questions arise about both their stability over time and their responsiveness to prompt engineering techniques. This longitudinal study focused on different AI tools' ability to use the Task Analysis Guide (TAG; Stein \& Smith, 1998) to classify the cognitive demand of mathematics tasks. In particular, it examined whether this classification ability changed with (1) model version updates over time and (2) few-shot prompting using exemplar tasks. We tested a general-purpose AI tool (Gemini) and an education-specific AI tool (Coteach). The specific tools were selected because of their relatively high performance on relevant published benchmarks and prior task-specific tests. Models were tested at baseline, retested with model version updates, and then tested again using few-shot prompting (two exemplar tasks for each cognitive demand category). Results revealed that newer model versions alone produced mixed effects: Gemini's accuracy remained stable at 58\%, while Coteach's accuracy decreased from 75\% to 50\%. However, few-shot prompting improved both models' performance: Gemini increased to 67\% and Coteach recovered to 75\% accuracy. These findings demonstrate that prompt engineering techniques can have larger and more reliable effects than passive model improvements, and that version updates may not always improve performance on specialized educational tasks. The study has important implications for how educators and researchers should approach AI tool selection, evaluation, and implementation in educational contexts.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on educational AI evaluation, temporal stability, and few-shot prompting for math task classification. It lacks any technical discussion on model architectures (Tokenizer, Visual Encoder), reinforcement learning (World Models, Model-Based RL), or model unification strategies. The provided keywords target multimodal foundation models and RL, which are not the subject of this study. No specified expert authors are present.

关键词

Temporal Stability, Few-Shot Prompting, Math Task Assessment, Cognitive Demand, Model Version Updates, Task Analysis Guide, AI Tools Evaluation

307. DAMEL: Dual-Axis Multi-Expert Learning for Class-Imbalanced LearningFAIL

Score: 0.0 / 27.8

Authors: Hyuck Lee, Taemin Park, Heeyoung Kim

Published: 2026-05-28

TL;DR: 论文提出 DAMEL 方法，通过沿表示轴和时间轴的多专家学习策略，有效降低了类别不平衡学习中的预测偏差和方差。

摘要翻译

针对真实世界数据中长尾分布（long-tailed distributions）所带来的类别不平衡学习（class-imbalanced learning）挑战，已提出多种算法。尽管这些算法通过重平衡技术降低了预测偏差，但往往以增加预测方差为代价。虽然一些多专家学习（multi-expert learning）算法旨在解决这一问题，但它们涉及复杂的流程。我们提出了一种新的多专家学习算法，称为双轴多专家学习（DAMEL），该算法通过在表示轴和时间轴上使用多个专家，同时减少了预测的偏差和方差。在表示轴上，DAMEL 拼接多个专家的表示，并同时训练一个辅助平衡分类器。在时间轴上，DAMEL 聚合训练轮次期间的网络权重，并在测试阶段使用这些聚合后的权重。实验结果表明，DAMEL 减少了预测的偏差和方差，突显了其在类别不平衡学习中的有效性。

Abstract

Various algorithms have been proposed to address the challenges posed by class-imbalanced learning from real-world data with long-tailed distributions. While these algorithms reduce prediction bias through rebalancing techniques, they often introduce increased prediction variance as a trade-off. Several multi-expert learning algorithms aim to address this variance but involve complex procedures. We propose a new multi-expert learning algorithm, called the dual-axis multi-expert learning (DAMEL), which reduces both bias and variance of predictions by using multiple experts along both representation and time axes. Along the representation axis, DAMEL concatenates the representations of multiple experts and trains an auxiliary balanced classifier simultaneously with the concatenated representations. Along the time axis, DAMEL aggregates network weights across training epochs, employing these aggregated weights during testing. Experimental results demonstrate that DAMEL reduces both bias and variance of predictions, highlighting its effectiveness in class-imbalanced learning.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 论文主题为类别不平衡学习中的双轴多专家学习方法（DAMEL），旨在解决偏差 - 方差权衡问题。提供的关键词（Tokenizer, Visual Encoder, World Models, MLLM, MultiModal, model-based RL, Unify Models）均属于多模态大模型或强化学习领域，与本文研究的传统机器学习分类任务无直接关联，故相关性评分均为 0。作者列表中未包含指定的专家（Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang）。

关键词

Class-Imbalanced Learning, Multi-Expert Learning, Dual-Axis, Bias-Variance Trade-off, Long-Tailed Distributions, Representation Axis, Time Axis

308. Beyond MSE: Improving Precipitation Nowcasting with Multi-Quantile RegressionFAIL

Score: 0.0 / 27.8

Authors: Gijs van Nieuwkoop, Siamak Mehrkanoon

Published: 2026-05-28

TL;DR: This paper improves precipitation nowcasting accuracy and heavy rainfall risk prediction by reformulating training as a multi-quantile regression problem using pinball loss within a UNet architecture, without requiring generative sampling or new model structures.

摘要翻译

深度学习降水临近预报模型通常使用均方误差（MSE）或平均绝对误差（MAE）等逐点损失（pointwise losses）进行优化，这可能导致预报过度平滑，且对强降水的表征不佳。本研究探讨了是否可以通过将训练重新制定为多分位数回归（multi-quantile regression）问题，来提升现有确定性临近预报架构的预测性能。以 SmaAt-UNet 为核心模型，我们在荷兰地区的雷达降水临近预报任务中比较了 MSE、MAE 和多分位数分位数损失（pinball-loss）训练方法。结果表明，多分位数训练提升了中心确定性预报，相比使用 MSE 训练的模型，测试集 MSE 降低了 8.6%，同时生成的上分位数输出对于强降水的风险敏感预测很有用。这些发现表明，分位数回归（quantile regression）提供了一种简单的替代方案，用于替代标准逐点损失，而无需新的架构或生成式采样过程。我们的模型实现及训练设置可在 GitHub 上获取。

Abstract

Deep-learning precipitation nowcasting models are often optimized using pointwise losses such as mean squared error or mean absolute error, which can lead to overly smooth forecasts and poor representation of heavy rainfall. This study investigates whether the predictive performance of an established deterministic nowcasting architecture can be improved by reformulating training as a multi-quantile regression problem. Using SmaAt-UNet as a core model, we compare MSE, MAE, and multi-quantile pinball-loss training on radar precipitation nowcasting over the Netherlands. The results show that multi-quantile training improves the central deterministic forecast, decreasing test-set MSE by 8.6\% compared to a model trained using MSE, while also producing upper-quantile outputs that are useful for risk-sensitive prediction of heavy precipitation. These findings suggest that quantile regression provides a simple alternative to standard pointwise losses without requiring a new architecture or generative sampling procedure. The implementation of our models and training setup is available on \href{https://github.com/gijsvn/Multi-Quantile-Precipitation-Nowcasting}{GitHub}.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on precipitation nowcasting using multi-quantile regression with a SmaAt-UNet architecture. It is a supervised learning task on single-modal radar data and does not involve Large Language Models (MLLM), Tokenizers, World Models, Model-Based Reinforcement Learning, or Unified Multimodal architectures implied by the keywords. There is no overlap with the specified research directions regarding foundation models or RL.

关键词

Precipitation Nowcasting, Multi-Quantile Regression, SmaAt-UNet, Pinball Loss, Deep Learning, Risk-sensitive Prediction, Pointwise Losses

309. No More K-means:Single-Stage Sparse Coding for Efficient Multi-Vector RetrievalFAIL

Score: 0.0 / 27.8

Authors: Lixuan Guo, Yifei Wang, Tiansheng Wen, Aosong Feng, Stefanie Jegelka, Chenyu You

Published: 2026-05-28

摘要翻译

多向量检索（MVR）模型，以 ColBERT 为代表，通过保留细粒度的词元级交互，在检索准确性上确立了新的基准。然而，这种细粒度带来了严重的存储和检索效率瓶颈：为了应对十亿级词元向量所带来的巨大内存占用和计算开销，最先进的系统不得不依赖激进的降维和复杂的聚类（如 K-means）。这种折衷方案带来了两个关键局限性：大规模语料库聚类导致的过度索引延迟，以及压缩过程中固有的语义信息损失。本文提出单阶段稀疏检索（SSR），这是一种范式转变，用高效的稀疏编码取代昂贵的聚类操作。与将特征压缩为低维稠密向量不同，我们利用稀疏自编码器（SAE）将词元嵌入投影至高维但高度稀疏的表示中。这一变换使我们能够完全避开向量聚类，并利用倒排索引实现精确且高吞吐量的检索。在 BEIR 基准上的广泛实验表明，SSR 实现了“三重奏”式的改进：与 ColBERTv2 相比，索引时间减少 15 倍，检索延迟减半，同时在检索性能上超越了领先的基线模型。

Abstract

Multi-vector retrieval (MVR) models, exemplified by ColBERT, have established new benchmarks in retrieval accuracy by preserving fine-grained token-level interactions. However, this granularity imposes prohibitive storage and retrieval efficiency bottlenecks: to manage the immense memory footprint and computational overhead of billion-scale token vectors, state-of-the-art systems are forced to rely on aggressive dimension reduction and complex clustering (e.g., K-means). This compromise introduces two critical limitations: excessive indexing latency of clustering large-scale corpora and semantic information loss inherent to compression. In this paper, we propose Single-stage Sparse Retrieval (SSR}, a paradigm shift that replaces expensive clustering with efficient sparse coding. Instead of compressing features into low-dimensional dense vectors, we utilize Sparse Autoencoder (SAE) to project token embeddings into a high-dimensional but highly sparse representation. This transformation enables us to bypass vector clustering entirely and leverage inverted indexing for precise, high-throughput retrieval. Extensive experiments on the BEIR benchmark demonstrate that SSR achieves a "trifecta" of improvements: it reduces indexing time by 15x compared to ColBERTv2, halves retrieval latency, and simultaneously improves retrieval performance over leading baselines.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 评分失败: Expecting ',' delimiter: line 12 column 61 (char 284)

310. Conformal Certification of Reasoning Trace PrefixesFAIL

Score: 0.0 / 27.8

Authors: Matt Y. Cheung, Ashok Veeraraghavan, Hanjie Chen, Guha Balakrishnan

Published: 2026-05-28

TL;DR: This paper introduces CROP, a conformal certification procedure that statistically guarantees the safety of reasoning trace prefixes by selecting calibrated thresholds to retain valid intermediate steps while discarding erroneous suffixes, thereby improving downstream repair accuracy.

摘要翻译

语言模型的推理轨迹很少是非黑即白的；它们经常在出现关键错误之前包含有效的中间步骤。现有的不确定性量化方法通常对最终答案或整个回复进行认证，无法为序列轨迹中可安全保留的比例提供统计保证。为此，我们引入了 CROP（Conformal Reasoning Output Prefixes），这是一种与验证器无关的校准方法，用于干净前缀的认证。给定任意步骤级风险代理，CROP 选择一个校准阈值，并返回其步骤风险代理均低于该阈值的最长连续前缀，将未认证的后缀路由至下游以供审查或修复。在假设可交换性的条件下，CROP 严格控制返回前缀包含标注错误的边缘概率。在六个过程标注的推理数据集上，我们表明标准步骤级指标（如 AUROC）未能完全捕捉前缀效用，因此建议应通过认证前缀长度来评估验证器。此外，CROP 平衡了过度保留与保留不足，通过保留有效中间推理并丢弃误导性后缀，从而提高下游修复的准确率。最终，这项工作将前缀认证定位为过程监督、回避与修复之间一个严谨而实用的桥梁。

Abstract

Language model reasoning traces are rarely all-or-nothing; they frequently contain valid intermediate steps before a critical error occurs. Existing uncertainty quantification methods typically certify final answers or entire responses, failing to provide statistical guarantees for the proportion of a sequential trace that can be safely retained. To address this, we introduce CROP (Conformal Reasoning Output Prefixes), a verifier-agnostic calibration procedure for clean-prefix certification. Given any step-level risk proxy, CROP selects a calibrated threshold and returns the longest contiguous prefix whose step risk proxies remain below it, routing the uncertified suffix for downstream review or repair. Assuming exchangeability, CROP rigorously controls the marginal probability that the returned prefix contains an annotated error. Across six process-labeled reasoning datasets, we demonstrate that standard step-level metrics such as AUROC do not fully capture prefix utility, suggesting verifiers should instead be evaluated by certified prefix length. Furthermore, CROP balances over- and under-withholding, improving downstream repair accuracy by preserving valid intermediate reasoning while discarding misleading suffixes. Ultimately, this work positions prefix certification as a rigorous, practical bridge between process supervision, abstention, and repair.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on conformal certification for reasoning trace safety and uncertainty quantification in language models. The provided keywords pertain to multimodal architectures (Visual Encoder, MultiModal, MLLM), specific components (Tokenizer), and reinforcement learning/world models (World Models, model-based RL, Unify Models). There is no technical overlap between the paper's methodology of prefix certification and the specific domains covered by the keywords, thus all keywords receive a score of 0.0. No expert authors from the specified list were found in the author list.

关键词

Conformal Certification, Reasoning Trace, Prefix Certification, Uncertainty Quantification, Step-level Risk, Process Supervision, Downstream Repair

311. Test Time Training for Supervised Causal LearningFAIL

Score: 0.0 / 27.8

Authors: Zizhen Deng, Jiaru Zhang, Rui Ding, Huang Bojun, Jinzhuo Wang, Qiang Fu, Shi Han, Dongmei Zhang

Published: 2026-05-28

TL;DR: This paper proposes a Test-Time Training framework for Supervised Causal Learning to address out-of-distribution generalization challenges by dynamically generating training sets aligned with test instances, significantly outperforming existing methods.

摘要翻译

监督因果学习（SCL）通过将因果发现视为监督学习问题，在因果发现方面展现出潜力。然而，它面临着显著的分布外泛化挑战。我们揭示了先前 SCL 实践的三大局限性：合成基准与真实数据之间存在显著的性能差距、对分布偏移的脆弱性，以及组合泛化能力的缺失，这共同质疑了其现实世界的适用性。为了解决这一问题，我们提出了监督因果学习的测试时训练（TTT-SCL），这是一种新颖的框架，能够动态生成与任何特定测试实例显式对齐的训练集。我们展示了 TTT-SCL 与基于分数方法之间的关联，并基于经典评分函数设计了一个用于生成训练集的高效模块。在合成基准、伪真实和真实数据集上的实验表明，TTT-SCL 显著优于现有的 SCL 和传统因果发现方法。

Abstract

Supervised Causal Learning (SCL) has shown promise in causal discovery by framing it as a supervised learning problem. However, it suffers from significant out-of-distribution generalization challenges. We reveal three limitations of previous SCL practices: a significant performance gap between synthetic benchmarks and real-world data, fragility to distribution shifts, and failure in compositional generalization, collectively questioning its real-world applicability. To address this, we propose Test-Time Training for Supervised Causal Learning (TTT-SCL), a novel framework that dynamically generates training sets explicitly aligned with any specific test instance. We demonstrate the correlation between TTT-SCL and score-based methods, and design an efficient module for generating training sets based on the classic scoring function. Experiments on synthetic benchmarks, pseudo-real and real-world datasets demonstrate that TTT-SCL significantly outperforms existing SCL and traditional causal discovery methods.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on Supervised Causal Learning and Test-Time Training for distribution shift adaptation, while the provided keywords relate to Multimodal LLMs, World Models, and RL. There is no conceptual overlap between the paper's content and the specified keywords, resulting in zero relevance for all. None of the specified expert authors (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) appear in the author list.

关键词

Supervised Causal Learning, Test-Time Training, Out-of-Distribution Generalization, Causal Discovery, Distribution Shifts, Score-based Methods, Real-world Datasets

312. Meta-Programming for Linear-time Temporal Answer Set ProgrammingFAIL

Score: 0.0 / 27.8

Authors: Susana Hahn, Amade Nems, Javier Romero, Torsten Schaub

Published: 2026-05-28

TL;DR: This paper proposes a flexible meta-programming framework for temporal Answer Set Programming to overcome the rigidity of existing ASP systems, enabling rapid exploration of temporal logics.

摘要翻译

答案集规划（ASP）的时间扩展催生了非单调线性时间（TEL）、动态（DEL）和度量（MEL）时间均衡逻辑。然而，高度优化的 ASP 系统的内在刚性往往阻碍了对替代逻辑设计的快速探索与实现。本文提出了一种灵活的元编程框架，通过统一的声明式框架实现了多种时间逻辑的语义操作化。该方法通过向 clingo 的理论语法中添加形式化类型规范和嵌套能力，扩展了标准的 ASP 元编程。为确保语义正确性，我们引入了一种转换管道，在接地过程中保护嵌套模态免受基于稳定模型的简化。我们通过实现 TEL、MEL 和 DEL 的元编码，展示了该框架的可扩展性。我们对 TEL 提供了全面阐述，并突出了管理 MEL 区间约束及 DEL 中 Fischer-Ladner 闭合（Fischer-Ladner closure）的关键特性。最后，我们引入了 metasp 系统，这是一个封装该工作流的多功能工具。

Abstract

The development of temporal extensions of Answer Set Programming (ASP) has led to the emergence of non-monotonic linear-time (TEL), dynamic (DEL), and metric (MEL) temporal equilibrium logics. However, the inherent rigidity of highly optimized ASP systems often hinders the rapid exploration and implementation of alternative logical designs. In this work, we propose a flexible meta-programming framework that operationalizes the semantics of varied temporal logics through a unified, declarative framework. Our approach extends standard ASP meta-programming by augmenting clingo's theory grammar with formal type specifications and nesting capabilities. To ensure semantic correctness, we introduce a transformation pipeline that protects nested modalities from stable-model-based simplifications during grounding. We demonstrate the extensibility of our framework by implementing meta-encodings for TEL, MEL, and DEL. We provide a comprehensive account of TEL and highlight the key features for managing the interval constraints of MEL and the Fischer-Ladner closure in DEL. Finally, we introduce the metasp system, a versatile tool that encapsulates this workflow.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on Answer Set Programming (ASP) and temporal logic meta-programming within symbolic AI, whereas the provided keywords target Multimodal Large Language Models (MLLM), World Models, and Reinforcement Learning. There is no technical overlap regarding Tokenizers, Visual Encoders, or Model-Based RL architectures. The term 'unified' in the abstract refers to logical semantics, not the model architecture unification implied by the keyword 'Unify Models' in this context. No expert authors from the specified list are present in the authorship.

关键词

Meta-Programming, Answer Set Programming, Temporal Logics, clingo, Declarative Framework, metasp system, temporal equilibrium logics

313. Hijacking Agent Memory: Stealthy Trojan Attacks Through Conversational InteractionFAIL

Score: 0.0 / 27.8

Authors: Hongtao Wang, Se Yang, Yu Chen, Puzhuo Liu

Published: 2026-05-28

摘要翻译

大语言模型（LLM）代理越来越多地利用长期记忆来支持持久且自主的任务执行。然而，这种能力也引入了一个新的攻击面：记忆中毒，其中攻击者可以注入恶意信息以影响其后续行为。现有的记忆中毒攻击通常假设注入的内容可以直接存储在记忆中，忽略了现代记忆管道中的选择性提取和重写阶段。这使得先前方法在实际场景下失效。在本文中，我们提出 MemPoison，一种新颖的记忆中毒攻击，它绕过 LLM 代理中的选择性记忆机制，攻击者可以通过对话交互向代理的长期记忆中注入可触发的后门，从而误导其后续响应。MemPoison 引入了三个关键组件：(i) 一个语义关系桥，将触发器和载荷绑定为一个连贯的语句，以确保它们一起被提取到记忆中；(ii) 实体伪装，优化触发器以模仿命名实体，抵抗重写；以及 (iii) 联合嵌入优化，将注入触发器的文本塑造为嵌入空间中的紧密簇，同时保持与良性嵌入的隔离以实现隐蔽性。在不同代理领域和记忆机制上的评估显示，MemPoison 实现了高达 0.95 的攻击成功率，优于现有基线方法。机制分析表明，该攻击利用了嵌入空间各向异性并改变了注意力模式，突出了选择性记忆系统中的核心漏洞。我们评估了多种防御策略，并展示了它们在缓解攻击方面的根本局限性。

Abstract

Large language model (LLM) agents increasingly leverage long term memory to support persistent and autonomous task execution. However, this capability also introduces a new attack surface: memory poisoning, where adversaries can inject malicious information to influence future behavior. Existing memory poisoning attacks often assume that injected content can be stored directly in memory, overlooking the selective extraction and rewriting stages in modern memory pipelines. This makes prior methods ineffective under realistic settings. In this paper, we propose MemPoison, a novel memory poisoning attack that bypasses selective memory mechanisms in LLM agents, where an attacker can inject triggerable backdoors into the agent's long-term memory through dialogue interactions, thereby misleading its subsequent responses. MemPoison introduces three key components: (i) a semantic relational bridge that binds the trigger and payload into a coherent statement to ensure they are extracted into memory together; (ii) entity masquerading that optimizes triggers to mimic named entities, resisting rewriting; and (iii) joint embedding optimization that shapes trigger-injected texts into a tight cluster in the embedding space while maintaining isolation from benign embeddings for stealth. Evaluations across different agent domains and memory mechanisms show MemPoison achieves attack success rates up to 0.95, outperforming existing baselines. Mechanistic analysis indicates that the attack exploits embedding-space anisotropy and shifts attention patterns, highlighting core vulnerabilities in selective memory systems. We evaluate multiple defense strategies and demonstrate their fundamental limitations in mitigating the attack.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 评分失败: Expecting ',' delimiter: line 12 column 98 (char 321)

314. It`s All About Speed: AI`s Impact on Workflow in Music ProductionFAIL

Score: 0.0 / 27.8

Authors: Finn McClellan, Fabio Morreale

Published: 2026-05-28

TL;DR: This ethnographic study investigates the impact of AI tools on music production workflows, revealing tensions between efficiency gains and the preservation of creative control among professionals.

摘要翻译

本文呈现了一项关于人工智能（AI）和自动化工具对音乐制作流程影响的民族志研究结果。针对认同为录音工程师、混音师及制作人的专业参与者，本文探讨了他们对常见人工智能（AI）和自动化软件的使用情况，以及对这些工具普及化的态度。我们探讨了在速度与效率、可控性以及保持创作能动性（creative agency）等关键领域，用户与自动工具之间可能产生的张力，并讨论了如何通过工具设计来缓解这些张力。

Abstract

In this paper, we present the results of an ethnographic study into the impact of AI and automated tools on music production workflow. Focusing specifically on professional participants who identified as recording engineers, mixers, and producers, we discuss their usage of common AI and automated software, as well as their sentiments on the proliferation of these tools. We discuss tensions that may be created between users and automated tools in key areas such as the need for speed and efficiency, controllability, and maintaining creative agency, and how these tensions may be alleviated through tool design.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper is an ethnographic study focusing on human factors (workflow, creativity, efficiency) regarding AI usage in music production. The provided keywords pertain to technical deep learning architectures (multimodal models, encoders, tokenizers, reinforcement learning). There is no overlap in content regarding model construction or algorithmic methods, hence all technical keywords score 0.

关键词

AI impact, music production, workflow, ethnographic study, creative agency, tool design, automated software

315. The Interplay Between Interpolation and Aggregation in Regression: Optimal Sample ComplexityFAIL

Score: 0.0 / 27.8

Authors: Mikael Møller Høgsgaard, Kasper Green Larsen, Liang-Yu Zou

Published: 2026-05-28

TL;DR: 本文理论研究了回归问题中插值与聚合的相互作用，证明了结合三个插值假设的中位数聚合方法在可学习性方面具有最优样本复杂度。

摘要翻译

本文从理论上研究了回归中插值 (interpolation) 与聚合 (aggregation) 之间的相互作用。我们确立了 γ-图维度 (γ-graph dimension) 刻画了一大类自然聚合过程的可学习性。此外，我们证明了一种极其简单的聚合过程，即通过中位数结合三个插值假设，在所有这些聚合过程中是最优的，并且严格强于恰当学习 (proper learning)。最后，我们表明某些假设类仅通过聚合无限多个假设或使用非插值聚合规则（其预测可能超出其输入范围）才可学习，而任何有限插值聚合甚至无法达到平凡性能。

Abstract

This work investigates theoretically the interplay between interpolation and aggregation in regression. We establish that the $γ$-graph dimension characterizes learnability for a broad class of natural aggregation procedures. Furthermore, we prove that an extremely simple aggregation procedure, combining three interpolating hypotheses via the median, is optimal among all these aggregation procedures, and is strictly more powerful than proper learning. Finally, we show that some hypothesis classes are learnable only by aggregating infinitely many hypotheses or by using non-interpolating aggregation rules (which may predict outside the range of their inputs), and any finite interpolating aggregation fails to achieve even trivial performance.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 该论文主要关注回归问题中插值与聚合的理论分析，属于统计学习理论范畴。提供的关键词集（如 Unify Models, Tokenizer, Visual Encoder, World Models, MLLM, MultiModal, model-based RL）均指向多模态大模型、世界模型及强化学习领域，与本文的研究主题（回归、插值、聚合、样本复杂度）无直接关联，因此所有关键词相关度均为 0。

关键词

Regression, Interpolation, Aggregation, Sample Complexity, Learnability, Graph Dimension, Hypothesis Classes

316. Bastion: Budget-Aware Speculative Decoding with Tree-structured Block Diffusion DraftingFAIL

Score: 0.0 / 27.8

Authors: Soowon Oh, Nam Cao, Yujin Kim, Hojung Jung, Huzama Ahmad, Sangmin Bae, Se-Young Yun

Published: 2026-05-28

TL;DR: Bastion introduces a budget-aware speculative decoding framework using tree-structured block diffusion drafting to accelerate LLM inference without training, achieving up to 6.61x speedup.

摘要翻译

块扩散草稿生成器（Block-diffusion drafters）最近已成为推测解码的一种强大替代方案，能够在单个并行步骤中预测多个未来 token 分布。然而，由于这些并行预测是从位置边缘分布（position-wise marginals）中采样，而非完全条件序列（fully conditioned sequences），因此选择单一的贪心路径往往无法捕捉目标模型（target model）的偏好轨迹。为了解决这一问题，我们提出了 BASTION，这是一种预算感知（budget-aware）的推测解码框架，采用基于树的扩散草稿生成（tree-based diffusion drafting）。与依赖静态树拓扑（static tree topologies）的现有方法不同，BASTION 通过平衡草稿质量与硬件约束（hardware constraints），动态构建查询依赖的树（query-dependent trees）。我们的框架集成了三个协同组件：(1) 一个接受代理（acceptance surrogate），通过路径置信度（path confidence）估计期望接受长度（expected accepted length）；(2) 一个在线延迟估计器（online latency estimator），校准硬件感知屋顶线模型（hardware-aware roofline model）；(3) 一个自适应最佳优先扩展（adaptive best-first expansion），扩展树直至边际收益（marginal gains）不再足以证明增量验证成本（incremental verification costs）合理。BASTION 无需训练（training-free），保持目标模型的分布，且无需针对特定设置的调优（per-setting tuning）。在多样化的基准测试和 GPU 架构上，BASTION 相比标准自回归解码（standard autoregressive decoding）实现了高达 6.61 倍的加速比，优于最先进的块扩散基线（state-of-the-art block-diffusion baselines）39%。

Abstract

Block-diffusion drafters have recently emerged as a powerful alternative for speculative decoding by predicting multiple future-token distributions in a single parallel step. However, since these parallel predictions are sampled from position-wise marginals rather than fully conditioned sequences, committing to a single greedy path often fails to capture the target model's preferred trajectory. To address this, we propose BASTION, a budget-aware speculative decoding framework with tree-based diffusion drafting. Unlike existing methods that rely on static tree topologies, BASTION dynamically constructs query-dependent trees by balancing draft quality against hardware constraints. Our framework integrates three synergistic components: (1) an acceptance surrogate that estimates expected accepted length via path confidence, (2) an online latency estimator that calibrates a hardware-aware roofline model, and (3) an adaptive best-first expansion that grows the tree until marginal gains no longer justify incremental verification costs. BASTION is training-free, preserves the target model's distribution, and requires no per-setting tuning. Across diverse benchmarks and GPU architectures, BASTION achieves up to a 6.61x speedup over standard autoregressive decoding, outperforming state-of-the-art block-diffusion baselines by 39%.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on LLM inference acceleration via speculative decoding and block diffusion drafting. It does not involve multimodal processing (MLLM, MultiModal, Visual Encoder), tokenization design (Tokenizer), unifying diverse model architectures (Unify Models), learning world representations (World Models), or reinforcement learning (model-based RL). Thus, there is negligible relevance to the provided keywords which pertain to multimodal and RL domains.

关键词

Speculative Decoding, Block Diffusion, Tree-structured Drafting, Budget-Aware, Inference Acceleration, Hardware Constraints, LLM Efficiency

317. Efficient, Validation-Free Intrinsic Quality Estimation for Large-Scale Face Recognition DatasetsFAIL

Score: 0.0 / 27.8

Authors: Zhichao Chen, Yongle Zhao, Kaicheng Yang, Meng Yang, Yin Xie, Ziyong Feng

Published: 2026-05-28

TL;DR: This paper proposes a validation-free metric called Intrinsic Quality to estimate face recognition dataset potential using neighbor consistency and subspace complexity without full-scale training.

摘要翻译

我们提出了一种内在质量（Intrinsic Quality, IQ），这是一种免验证指标，旨在估计人脸识别（Face Recognition, FR）数据集产生高性能模型的内在潜力，而无需进行全规模训练。IQ 整合了两个组件：（i）一种通过近邻量化局部身份标签一致性的邻居一致性分数（Neighbor-Consistency Score），以及（ii）全局表示子空间复杂度（Global Representation Subspace Complexity，有效秩（Effective Rank, ER）），它捕捉了底层嵌入几何结构和数据集多样性。IQ 允许使用轻量级代理模型或数据子集进行快速评估，从而在资源密集型全规模训练之前促进数据集的诊断与整理。我们描述了一种针对干净、噪声及混合质量 FR 数据集量身定制的实验协议，并概述了评估方法以验证 IQ 对下游性能的预测能力。

Abstract

We propose Intrinsic Quality (IQ), a validation-free metric designed to estimate the inherent potential of face recognition (FR) datasets to produce high-performance models without the need for full-scale training. IQ integrates two components: (i) a Neighbor-Consistency Score that quantifies local identity label agreement via nearest neighbors, and (ii) Global Representation Subspace Complexity (Effective Rank, ER), which captures the underlying embedding geometry and dataset diversity. IQ allows for rapid evaluation using lightweight proxy models or data subsets, facilitating dataset diagnosis and curation prior to resource-intensive full-scale training. We describe an experimental protocol tailored to clean, noisy, and mixed-quality FR datasets, and outline evaluation methodologies to validate IQ's predictive power for downstream performance.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on face recognition dataset quality estimation using intrinsic metrics, while the provided keywords target multimodal large models, world models, and reinforcement learning. There is no thematic overlap regarding tokenizers, visual encoders for MLLM, world models, or RL algorithms, resulting in zero relevance for all specified keywords.

关键词

Face Recognition, Dataset Quality, Intrinsic Quality, Validation-Free, Neighbor-Consistency Score, Effective Rank, Representation Subspace Complexity, Dataset Curation

318. A Systematic Evaluation of Molecular Mixture Behavior PredictionFAIL

Score: 0.0 / 27.8

Authors: Roel J. Leenhouts, Nathan K. Morgan, William Green, Jan G. Rittig, Florence H. Vermeire

Published: 2026-05-28

TL;DR: 本文提出了一种分解混合物性质预测误差的评估框架，旨在区分纯组分贡献与非理想混合相互作用，并发现模型在未见分子上的迁移能力是主要挑战。

摘要翻译

分子属性预测的机器学习 (Machine Learning) 主要关注纯化合物 (Pure Compounds)，尽管许多实际应用依赖于具有分子间相互作用 (Intermolecular Interactions) 的混合物 (Mixtures)。近期工作已扩展了混合物数据集 (Mixture Datasets) 的可用性，但评估 (Evaluation) 仍主要聚焦于绝对精度 (Absolute Accuracy)。然而，混合物中的绝对误差 (Absolute Errors) 混淆了纯组分贡献 (Pure-Component Contributions) 与理想混合偏差 (Deviations from Ideal Mixing)。我们提出一个评估框架 (Evaluation Framework)，该框架将混合物属性误差 (Mixture-Property Error) 分解为纯组分 (Pure-Compound) 和相互作用（非理想）组分 (Interaction (Non-Ideal) Components)。该框架结合了考虑泄漏的划分协议 (Leakage-Aware Split Protocols)、理想混合物基线 (Ideal-Mixture Baselines) 和超额性质指标 (Excess-Property Metrics)。为了支持可复现的基准测试 (Reproducible Benchmarking)，我们整理了七个匹配的纯化合物和混合物理化性质数据集 (Physicochemical Property Datasets)。在多个混合物属性任务和模型族 (Model Families) 中，我们发现强绝对精度可能掩盖非理想混合物行为 (Non-Ideal Mixture Behavior) 恢复不佳的情况，且在严格分子划分 (Strict Molecule Splits) 下性能大幅下降。这些结果表明，向未见分子 (Unseen Molecules) 的迁移 (Transfer) 是分子混合物机器学习的核心挑战，并激励超越绝对精度 (Beyond Absolute Accuracy) 的评估。

Abstract

Machine learning for molecular property prediction has focused largely on pure compounds, even though many practical applications depend on mixtures with intermolecular interactions. Recent work has expanded the availability of mixture datasets, but evaluation still focuses mainly on absolute accuracy. However, absolute errors in mixtures conflate pure-component contributions with deviations from ideal mixing. We propose an evaluation framework that decomposes mixture-property error into pure-compound and interaction (non-ideal) components. The framework combines leakage-aware split protocols, ideal-mixture baselines, and excess-property metrics. To support reproducible benchmarking, we curate seven matched pure and mixture physicochemical property datasets. Across multiple mixture-property tasks and model families, we find that strong absolute accuracy can mask poor recovery of non-ideal mixture behavior, and that performance drops substantially under strict molecule splits. These results identify transfer to unseen molecules as a central challenge in molecular mixture machine learning and motivate evaluation beyond absolute accuracy alone.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 该论文研究分子混合物性质预测的评估框架，属于计算化学与机器学习领域。提供的关键词（Unify Models, Tokenizer, Visual Encoder, World Models, MLLM, MultiModal, model-based RL）均指向多模态大模型、视觉编码及强化学习架构。论文内容与这些关键词无直接关联，未涉及相关技术或概念，因此所有关键词相关度评分为 0。作者列表中不包含指定的专家，无额外加分。加权总分为 0，低于动态及格分 27.8。

关键词

Molecular Mixture, Property Prediction, Evaluation Framework, Non-ideal Mixing, Transfer Learning, Physicochemical Properties, Machine Learning

319. Momentum Based Reward Design for Low Emission Traffic Signal ControlFAIL

Score: 0.0 / 27.8

Authors: Chinmay Mundane, Amith Manoharan, Arun Singh

Published: 2026-05-28

TL;DR: This paper proposes a Momentum-Based Reward Function for Deep Reinforcement Learning in traffic signal control to optimize throughput and reduce CO2 emissions compared to traditional reward methods.

摘要翻译

城市交通拥堵是一个日益严峻的全球性问题，显著加剧了通勤时间延长和环境污染。传统交通信号控制系统往往难以适应动态交通状况。自适应交通信号控制可以在无需改变道路基础设施的情况下改善城市交通。深度强化学习（DRL）在此任务中表现出强大的性能，但现有的基于延迟和队列的奖励往往会产生短视或不稳定的策略。本文提出了一种基于动量的奖励函数（MBRF），旨在鼓励车辆保持行驶，而非单纯惩罚拥堵。该方法在 SUMO（城市交通移动模拟器）中使用标准交通指标进行评估，包括等待时间、队列长度、吞吐量和二氧化碳排放。结果表明，与基于延迟或队列的奖励以及 Max Pressure 和 LQF 等经典控制器相比，所提出的奖励实现了更好的吞吐量与排放权衡，并表现出更稳定的学习行为。

Abstract

Urban traffic congestion is a growing global issue contributing significantly to long commute times and environmental pollution. Traditional traffic signal control systems often fail to adapt to dynamic traffic conditions. Adaptive traffic signal control can improve urban traffic without changing road infrastructure. Deep Reinforcement Learning (DRL) has shown strong performance for this task, but existing delay and queue-based rewards often produce short-sighted or unstable policies. This paper proposes a Momentum-Based Reward Function (MBRF) that encourages vehicles to keep moving rather than penalizing congestion alone. The method is evaluated in SUMO (Simulation of Urban MObility) using standard traffic metrics such as waiting time, queue length, throughput, and CO2 emissions. Results show that the proposed reward produces better throughput-emission trade-offs and more stable learning behavior than delay or queue-based rewards, as well as classical controllers such as Max Pressure and LQF.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on Deep Reinforcement Learning (DRL) for traffic signal control with a novel momentum-based reward function. It does not involve Unify Models, Tokenizers, Visual Encoders, World Models, MLLMs, or Multimodal architectures. Although it utilizes RL, the focus is on reward shaping in a likely model-free setting rather than Model-Based RL mechanisms, resulting in negligible relevance to the provided keyword set which targets multimodal/LLM/world model research.

关键词

Traffic Signal Control, Deep Reinforcement Learning, Momentum-Based Reward, CO2 Emissions, SUMO Simulator, Adaptive Control, Throughput, Emission Trade-off

320. A Novel Tensor Product-Based Neural Network for Solving Partial Differential EquationsFAIL

Score: 0.0 / 27.8

Authors: Qihong Yang, Yangtao Deng, Qiaolin He, Shiquan Zhang

Published: 2026-05-28

TL;DR: 本文提出了一种基于张量积网络的偏微分方程求解器，通过最小二乘法替代梯度训练，实现了比物理信息神经网络更高的精度和更快的训练速度。

摘要翻译

本文提出了张量积网络（TPNet），一种用于高效准确函数逼近及偏微分方程（PDE）求解的新型神经网络架构。该提案的核心在于将解显式构造为集成到网络中的基函数的线性组合，系数由直接最小二乘求解确定，从而绕过了传统的基于梯度的训练。关键方法论贡献包括：（1）一种高效的张量积方案，通过组合两组子网络输出生成多维基函数，在保持表达能力的同时显著降低模型复杂度和参数数量；（2）一种分块时间推进策略，以提高长时间模拟的计算效率；以及（3）一种线性重构策略，通过将已知非线性项视为源项来处理非线性 PDE。TPNet 相较于常规神经网络求解器，实现了更高的精度和更短的训练时间。这种性能提升源于其结构化设计和确定性最小二乘拟合，这与主流方法（如物理信息神经网络（PINNs））所需的迭代式且通常计算密集的优化形成对比。

Abstract

This paper presents the Tensor Product Network (TPNet), a novel neural architecture for efficient and accurate function approximation and PDE solving. The core of the proposal involves constructing the solution explicitly as a linear combination of basis functions integrated into the network, with coefficients determined by a direct least-squares solve, thereby bypassing traditional gradient-based training. The key methodological contribution include: (1) an efficient tensor-product scheme that generates multi-dimensional basis functions from combinations of two sets of subnetwork outputs, significantly reducing model complexity and parameter count while maintaining expressivity; (2) a block time-marching strategy to improve computational efficiency in long-time simulations; and (3) a linear reformulation strategy for handling nonlinear PDEs by treating known nonlinear terms as sources. TPNet achieves superior accuracy and shorter training times than conventional neural network solvers. This performance gain stems from its structured design and deterministic least-squares fitting, which contrast with the iterative, often computationally intensive optimization required by mainstream methods like Physics-Informed Neural Networks (PINNs).

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 论文内容属于科学计算与数值分析领域，专注于使用张量积网络求解偏微分方程，与多模态大模型、世界模型、强化学习等关键词主题完全无关，故所有关键词评分为 0。作者列表中不包含指定的专家（Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang）。

关键词

Tensor Product Network, Partial Differential Equations, Least-Squares Solve, Basis Functions, Neural Architecture, Function Approximation, PDE Solving

321. Kernel Renormalization in Bayesian Deep Neural Networks: the Equivalent Wishart Ansatz in the Proportional RegimeFAIL

Score: 0.0 / 27.8

Authors: Paolo Baglioni, Christian Keup, Vincenzo Zimbardo, Rosalba Pacelli, Alessandro Vezzani, Raffaella Burioni, Pietro Rotondo

Published: 2026-05-28

TL;DR: 本文提出了一种等效 Wishart 假设，用于在比例极限下分析贝叶斯多层感知机和卷积神经网络的泛化性能，通过核重整化机制捕捉主导的随机波动。

摘要翻译

当训练集大小 $P$ 与深度神经网络的宽度 $N$ 以相同速率增长时的缩放极限，即所谓的比例宽度 regime，已在浅层、单隐藏层网络中被深入研究。然而，将这些非微扰结果从浅层架构扩展到深度非线性网络已被证明极具挑战性。在此，我们提出一种有效的近似方法，用于预测固定深度 $L$ 的贝叶斯多层感知机 (MLP) 在任意高维数据上的泛化性能。我们提出一个等效 Wishart Ansatz，以捕捉 MLP 层次化经验核的主导随机波动。这使得我们能够在比例极限下对 MLP 的配分函数进行大偏差分析，该分析以重正化 NNGP 核 (Neural Network Gaussian Process) 表示。在此描述中，即使在比例极限下的强表征学习也被编码在最多 $L$ 个标量序参量中，这些序参量通过自洽方式确定。将该方法扩展到卷积架构 (CNN)，我们识别出一种层次化局部核重正化机制，该机制允许量化由于有限宽度效应导致的 CNN 中大宽度核更复杂的数据依赖变换。我们在经典基准数据集上，通过有限深度神经网络（深度 $L \sim O(10)$ 且训练集大小 $P \sim O(10^3)$）的贝叶斯后验采样实验来验证我们的有效理论，发现整体吻合度非常好，同时存在两种不同类型的系统偏差。

Abstract

The scaling limit where both the size of the training set $P$ and the width $N$ of a deep neural network grow at the same rate, the so-called proportional-width regime, has been intensely studied for shallow, single-hidden-layer networks. However, extending these non-perturbative results from shallow architectures to deep non-linear networks has proven very challenging. Here we present an effective approximate approach to predict the generalization performance of Bayesian multi-layer perceptrons (MLPs) of fixed depth $L$ on arbitrary high-dimensional data. We propose an equivalent Wishart Ansatz to capture the dominant stochastic fluctuations of the hierarchical empirical kernels of MLPs. This allows us to perform a large deviation analysis for the partition function of MLPs in the proportional limit, expressed in terms of a renormalized NNGP kernel. In this description, even strong representation learning in the proportional limit is encoded in at most $L$ scalar order parameters, determined self-consistently. Extending the approach to convolutional architectures (CNNs), we identify a hierarchical local kernel renormalization mechanism, which allows to quantify more complex data-dependent transformations of the large-width kernel in CNNs due to finite-width effects. We test our effective theory against sampling experiments from the Bayesian posterior of finite deep neural networks with depths $L \sim O(10)$ and $P\sim O(10^3)$ on classic benchmark datasets, finding overall very good agreement together with two distinct types of systematic deviations.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 该论文专注于贝叶斯深度神经网络在比例极限下的核重整化理论及泛化性能分析，属于统计物理与机器学习理论交叉领域。所提供的关键词（如多模态大模型、世界模型、强化学习、Tokenizer 等）主要涉及应用层面的多模态对齐与强化学习，与本文的理论内核（Wishart 假设、核方法、MLP/CNN 泛化）无直接关联，故相关性评分为 0。

关键词

Bayesian Deep Neural Networks, Proportional Regime, Wishart Ansatz, Kernel Renormalization, Generalization Performance, Multi-layer Perceptrons, Convolutional Architectures

322. AMDP: Asynchronous Multi-Directional Pipeline Parallelism for Large-Scale Models TrainingFAIL

Score: 0.0 / 27.8

Authors: Ling Chen, Houming Wu, Wenjie Yu

Published: 2026-05-28

TL;DR: The paper proposes Asynchronous Multi-Directional Pipeline Parallelism (AMDP) to mitigate parameter mismatch in large-scale model training while maintaining high utilization and convergence.

摘要翻译

Pipeline parallelism（流水线并行）对于大规模模型训练至关重要，但现有的异步方法通常由于前向和反向传播（forward and backward passes）之间的参数不匹配而降低收敛性。我们提出 Asynchronous Multi-Directional Pipeline parallelism（AMDP）以缓解这一问题，同时保持高利用率。AMDP 限制每个流水线的第一阶段在反向传播（backpropagation）之前最多处理两个 minibatches（小批量），从而限制了前向和反向传播之间参数更新的数量。为了缓解由此产生的 pipeline bubbles（流水线气泡），AMDP 启动多个并发流水线，并根据流水线深度调整其数量。此外，AMDP 跨 minibatches 累积梯度并在单次更新中应用它们，确保只有有限数量的 minibatches 经历参数不匹配，且该不匹配限制在一个优化步内。在 GPT- 和 BERT-style 模型上的实验表明，AMDP 显著加速了训练过程，同时保持了收敛性。

Abstract

Pipeline parallelism is essential for large-scale model training, but existing asynchronous approaches often degrade convergence due to parameter mismatch between forward and backward passes. We propose Asynchronous Multi-Directional Pipeline parallelism (AMDP) to mitigate this issue while sustaining high utilization. AMDP limits the first stage of each pipeline to process at most two minibatches before backpropagation, bounding the number of parameter updates between forward and backward passes. To alleviate the resulting pipeline bubbles, AMDP launches multiple concurrent pipelines and adapts their number according to pipeline depth. In addition, AMDP accumulates gradients across minibatches and applies them in a single update, ensuring that only a bounded number of minibatches experience parameter mismatch, limited to within one optimization step. Experiments on GPT- and BERT-style models demonstrate that AMDP significantly accelerates training while preserving convergence.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on distributed training infrastructure (pipeline parallelism) for large-scale models, addressing parameter mismatch and convergence issues. It does not involve multimodal representation, world models, reinforcement learning, or specific components like tokenizers/encoders. None of the specified expert authors are listed.

关键词

Pipeline Parallelism, Asynchronous Training, Large-Scale Models, Parameter Mismatch, Gradient Accumulation, Distributed Training, Training Efficiency

323. The Sample Complexity of Multiclass and Sparse Contextual BanditsFAIL

Score: 0.0 / 27.8

Authors: Liad Erez, Fan Chen, Alon Cohen, Tomer Koren, Yishay Mansour, Shay Moran, Alexander Rakhlin

Published: 2026-05-28

TL;DR: 本文研究了稀疏上下文带隙在随机设定下的样本复杂度，通过决策估计系数框架设计了具有最优样本复杂度的算法。

摘要翻译

我们在随机独立同分布设定下研究 Contextual Bandits（上下文老虎机），其中学习者观察到从未知分布中采样的上下文，从有限集合 $A$ 中选择动作，并旨在基于 Bandit Feedback（老虎机反馈）从给定类中识别出一个近似最优的 Policy（策略）。受具有零一奖励的 Bandit Multiclass Classification（老虎机多类分类）启发，我们关注 $s$-sparse 设定，其中对于每个上下文，奖励向量的 $L_1$-norm（$L_1$-范数）至多为 $s \ll |A|$。我们的主要结果是设计算法，这些算法以高概率输出一个相对于 Policy Class $Π$ 的 $ε$-optimal Policy，使用 $\tilde{O} ((s/ε^2 + |A|/ε)\log |Π|/δ)$ 个样本。我们将此界限扩展到一般的 Natarajan classes（Natarajan 类），并用一个匹配的下界（在对数因子范围内）加以补充，从而填补了先前工作（Erez et al., 2024, 2025）留下的巨大差距，后者引入了额外的 $Θ(|A|^9)$ 依赖。我们通过两种互补的方法获得了这些结果。首先，我们通过具有结构化观测的 Contextual Decision Making（上下文决策）的视角分析 Contextual Bandits，设计了一种 Exploration-by-Optimization（优化探索）算法，其 Sample Complexity（样本复杂度）由 decision-estimation coefficient (DEC; Foster et al., 2021, 2022) 控制。我们证明，在 $s$-sparse 奖励下，诱导的模型类允许一个与 $s$ 成正比的精确 DEC 界限，并直接导出最优率。由于这种方法主要是信息论的，并且涉及解决复杂的 Min-Max Optimization（极小极大优化）问题，我们还开发了一种基于 Low-Variance Exploration Technique（低方差探索技术）的第二种更专门的算法方法。这种方法导出了具体且可行的算法，并自然扩展到 Contextual Combinatorial Semi-bandits（上下文组合半老虎机），从而为 Bandit Multiclass List Classification（老虎机多类列表分类）提供了改进的 Sample Complexity 保证。

Abstract

We study contextual bandits in the stochastic i.i.d.\ setting, where a learner observes contexts drawn from an unknown distribution, selects actions from a finite set $A$, and aims to identify an approximately optimal policy from a given class based on bandit feedback. Motivated by bandit multiclass classification with zero-one rewards, we focus on the \emph{$s$-sparse} setting in which, for every context, the reward vector has $L_1$-norm at most $s \ll |A|$. Our main result is the design of algorithms that, with high probability, output an $ε$-optimal policy compared to policy class $Π$ using $\tilde{O} ((s/ε^2 + |A|/ε)\log |Π|/δ)$ samples. We extend this bound to general Natarajan classes and complement it with a matching lower bound (up to logarithmic factors), thereby closing a substantial gap left by prior work (Erez et al., 2024, 2025), which incurred an additional $Θ(|A|^9)$ dependence. We obtain these results via two complementary approaches. First, we analyze contextual bandits through the lens of contextual decision making with structured observations, designing an exploration-by-optimization algorithm whose sample complexity is governed by the \emph{decision-estimation coefficient} (DEC; Foster et al., 2021, 2022). We show that, with $s$-sparse rewards, the induced model class admits a sharp DEC bound that scales with $s$ and directly yields the optimal rate. Since this approach is largely information-theoretic and involves solving complex min-max optimization problems, we also develop a second, more specialized algorithmic method based on a low-variance exploration technique. This approach leads to concrete, tractable algorithms and naturally extends to contextual combinatorial semi-bandits, leading to improved sample complexity guarantees for bandit multiclass list classification.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 该论文主要研究上下文带隙（Contextual Bandits）的样本复杂度，属于理论机器学习范畴。提供的关键词（如统一模型、分词器、视觉编码器、世界模型、多模态大模型等）均指向多模态大模型与生成式架构领域。虽然上下文带隙属于强化学习的一种，但本文未涉及模型学习、多模态融合、世界模型构建或统一模型架构，与关键词所代表的技术方向高度不相关。

关键词

Contextual Bandits, Sample Complexity, Sparse Rewards, Decision-Estimation Coefficient, Policy Class, Stochastic Setting, Exploration Algorithm

324. MoSSP: A Momentum-Based Single-Loop Stochastic Penalty Method for Nonconvex Constrained DC-Regularized OptimizationFAIL

Score: 0.0 / 27.8

Authors: Luxuan Li, Chunfeng Cui, Xiao Wang

Published: 2026-05-28

TL;DR: This paper proposes MoSSP, a momentum-based single-loop stochastic penalty method for solving nonconvex constrained stochastic optimization problems with DC regularization, achieving provable oracle complexity guarantees.

摘要翻译

本文研究了一类具有差凸（DC）正则化的结构化非凸约束随机问题，其中可行集可能是非凸的，且 DC 正则化项的凹部分允许是非光滑的。核心挑战在于在保持非凸约束可行性的同时实现较优的 Oracle 复杂度。尽管单循环算法能有效求解无约束 DC 优化问题，但其在具有 DC 结构的约束优化中的潜力尚未得到充分探索。为填补这一空白，我们提出了 MoSSP，一种基于动量的单循环随机惩罚方法，用于此类问题并提供可证明的复杂度保证。关键思想是对惩罚项与凸 DC 部分之和的 Moreau 包络应用单个随机近端梯度步，同时并行计算凹部分的近端映射。我们推导出了两种算法变体：一种基于 Polyak 动量的版本，其 Oracle 复杂度为 $O(\varepsilon^{-4})$，用于寻找随机 $\varepsilon$-KKT 点；另一种改进版本引入递归动量，复杂度为 $O(\varepsilon^{-3})$。实验结果表明了所提算法的有效性。

Abstract

In this paper, we study a structured class of nonconvex constrained stochastic problems with difference-of-convex (DC) regularization, where the feasible set is possibly nonconvex and the concave part of the DC regularizer is allowed to be nonsmooth. The fundamental challenge lies in maintaining feasibility for nonconvex constraints while achieving favorable oracle complexity. Although single-loop algorithms efficiently solve unconstrained DC optimization problems, their potential for constrained optimization with DC structure remains largely unexplored. To address this gap, we develop MoSSP, a Momentum-based Single-loop Stochastic Penalty method for such problems with provable complexity guarantees. The key idea is to apply a single stochastic proximal-gradient step to the Moreau envelope of the penalty plus the convex DC part, with the concave part's proximal mapping computed in parallel. We derive two algorithm variants: a Polyak-momentum version with $O(\varepsilon^{-4})$ oracle complexity for finding stochastic $\varepsilon$-KKT points, and an improved $O(\varepsilon^{-3})$ version incorporating recursive momentum. Experimental results demonstrate the effectiveness of the proposed algorithms.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 该论文专注于非凸约束随机优化问题与 DC 正则化的算法设计（MoSSP），属于数学优化理论范畴。提供的关键词集（如多模态大模型、世界模型、强化学习等）均指向人工智能架构与学习范式，与本文的优化算法主题无直接关联，因此所有关键词相关性均为 0。作者列表中不包含指定的专家，故不加分。加权总分为 0，低于动态及格分 27.8。

关键词

Nonconvex Constrained Optimization, DC-Regularized, Stochastic Penalty Method, Momentum-Based, Single-Loop, Oracle Complexity, KKT Points

325. MōLe-Λ: Learning the Coupled-Cluster Response State for Energies, Gradients, and PropertiesFAIL

Score: 0.0 / 27.8

Authors: Andreas Burger, Luca Thiede, Abdulrahman Aldossary, Jorge A. Campos-Gonzalez-Angulo, Alex Zook, Jérôme Florian Gonthier, Alán Aspuru-Guzik

Published: 2026-05-28

TL;DR: MōLe-Λ introduces an equivariant neural network to efficiently predict coupled-cluster response states including energies, forces, and properties, extending Molecular Orbital Learning to quantum chemistry.

摘要翻译

耦合簇（CC）理论常被视为量子化学的黄金标准，但其高昂的计算成本限制了常规获取精确能量、力及响应性质的能力。尽管右 T 振幅决定了关联波函数，但许多实际上重要的可观测量还需要左 Λ 振幅。我们提出了 MōLe-Λ，这是分子轨道学习（MōLe）的一种扩展，它通过从局域化 Hartree--Fock 分子轨道联合学习右振幅 (T1, T2) 和左振幅 (Λ1, Λ2)，来预测完整的基态耦合簇单双激发（CCSD）响应态。在架构上，MōLe-Λ 在 MōLe 的基础上增加了 Λ1 和 Λ2 读出层，这些读出层镜像了 T1 和 T2 头部的对称约束，同时保留了原始的等变轨道编码器、奇符号等变解码、局域性及大小广延性。所得模型能够产出精确的 CC 级能量和力，同时恢复偶极矩、四极矩、极化率、电子密度以及双电子可观测量（如对密度）。我们表明，MōLe-Λ 进一步扩展了 MōLe 相对于完整 CCSD 的速度优势，同时显著扩展了可计算的性质，为相关量子化学中的波函数级代理模型提供了一条途径。

Abstract

Coupled-cluster (CC) theory is often considered the gold standard of quantum chemistry, but its high computational cost limits routine access to accurate energies, forces and response properties. While the right-hand $T$-amplitudes determine the correlated wavefunction, many practically important observables additionally require the left-hand $Λ$-amplitudes. We introduce MōLe-$Λ$, an extension of Molecular Orbital Learning (MōLe) that predicts the full ground-state coupled-cluster singles and doubles (CCSD) response state by jointly learning right-hand amplitudes $(T_1,T_2)$ and left-hand amplitudes $(Λ_1,Λ_2)$ from localized Hartree--Fock molecular orbitals. Architecturally, MōLe-$Λ$ extends MōLe with $Λ_1$ and $Λ_2$ readouts that mirror the symmetry constraints of the $T_1$ and $T_2$ heads, while preserving the original equivariant orbital encoder, odd sign-equivariant decoding, locality and size-extensivity. The resulting model yields accurate CC-quality energies and forces, while simultaneously recovering dipoles, quadrupoles, polarizabilities, the electron density, and 2-electron observables such as the pair density. We show that MōLe-$Λ$ further extends the speed advantage of MōLe over full CCSD while substantially expanding the accessible properties, providing a route to wavefunction-level surrogate models for correlated quantum chemistry.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on quantum chemistry and Coupled-Cluster theory using equivariant neural networks (Molecular Orbital Learning). The provided keywords relate to Multimodal Large Language Models (MLLM), World Models, and Reinforcement Learning. There is no overlap in domain or methodology regarding Tokenizers, Visual Encoders, or RL, resulting in zero relevance for all specified keywords.

关键词

Coupled-cluster theory, Molecular Orbital Learning, Equivariant neural network, Quantum chemistry, Response state, Energies, Forces, Properties

326. FPLIER: Federated Pathway-Level Information ExtractorFAIL

Score: 0.0 / 27.8

Authors: Daniele Malpetti, Christian Berchtold, Francesco Gualdi, Marco Scutari, Laura Azzimonti, Francesca Mangili

Published: 2026-05-28

TL;DR: FPLIER proposes a federated learning framework for distributed gene-set-aware factorization in transcriptomics to preserve privacy, but it is unrelated to the provided keywords concerning multimodal models and reinforcement learning.

摘要翻译

在转录组学中，基因集感知因子分解方法（如通路水平信息提取器，PLIER）在大型异质性表达汇编上训练时效果最佳。然而，由于隐私和治理约束，许多临床相关队列无法合并到单个数据集中。我们提出了 FPLIER，这是 PLIER 的联邦扩展，能够在多个数据持有者之间实现分布式训练，同时纳入公开可用的数据集。通过安全聚合，FPLIER 产生的训练更新与集中式合并数据方法的更新在代数上等价，同时保持表达数据本地化。我们在两个模拟联盟（来自 K-CLIER 和 MultiPLIER 研究）的多种场景下评估了 FPLIER，并展示了其稳定收敛性。我们进一步针对中间训练统计量和发布模型，对成员推断攻击进行了系统分析。结果表明，隐私风险由训练表达矩阵的秩决定。纳入公开数据或降低数据维度会增加该秩，使系统趋向于满秩状态，在此状态下，训练样本与非训练样本对攻击者而言变得不可区分，且成员推断性能接近随机猜测。

Abstract

In transcriptomics, gene-set-aware factorization methods such as the Pathway Level Information Extractor (PLIER) are most effective when trained on large, heterogeneous expression compendia. Yet, many clinically relevant cohorts cannot be pooled into a single dataset due to privacy and governance constraints. We present FPLIER, a federated extension of PLIER that enables distributed training across multiple data holders while incorporating publicly available datasets. Through secure aggregation, FPLIER produces training updates algebraically equivalent to those of a centralized pooled-data approach while keeping expression data local. We evaluate FPLIER across multiple scenarios in two simulated consortia (from the K-CLIER and MultiPLIER studies) and demonstrate stable convergence. We further conduct a systematic analysis of membership inference attacks targeting both intermediate training statistics and the released model. Our results show that privacy risk is governed by the rank of the training expression matrix. Incorporating public data or reducing data dimensionality increases this rank, moving the system toward a full-rank regime in which training and non-training samples become indistinguishable to the attacker, and membership-inference performance approaches random guessing.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 该论文专注于转录组学中的联邦学习（生物信息学领域），而关键词涉及多模态大模型和强化学习（AI/RL 领域）。两者在技术组件（如分词器、视觉编码器、RL 代理）和领域上均无重叠，因此所有指定关键词的相关性均为零。

关键词

Federated Learning, Transcriptomics, Gene Expression, Secure Aggregation, Membership Inference, Pathway Level Information Extractor, Distributed Training, Factorization Methods

327. Singularity-aware Optimization via Randomized Geometric Probing: Towards Stable Non-smooth OptimizationFAIL

Score: 0.0 / 27.8

Authors: Ruoran Xu, Borong She, Xiaobo Jin, Qiufeng Wang

Published: 2026-05-28

TL;DR: The paper introduces Singularity-aware Adam (S-Adam), an optimizer that stabilizes training on non-smooth loss landscapes by mitigating gradient chattering, achieving superior convergence and accuracy in quantization-aware training compared to AdamW.

摘要翻译

深度学习优化严重依赖于平滑损失曲面的假设，而这一条件因现代架构中包含的非平滑组件（如 ReLU 激活函数和量化算子）而被系统性地破坏。在这种非平滑情形下，像 Adam 这样的自适应优化器会遭受梯度颤动，即由 Clarke 次微分 (Clarke subdifferential) 内的冲突信号引起的剧烈振荡，从而导致收敛性差和泛化性能次优。为了解决这一问题，我们引入了感知奇异性 Adam（S-Adam），这是一种新颖的优化器，它通过基于局部几何不稳定性动态调节步长来稳定训练过程。我们的关键贡献是局部几何不稳定性（LGI）指标，这是一个计算高效的 Clarke 次微分直径估计器，其基于随机方向导数的方差推导而来。S-Adam 集成了一个自适应阻尼机制 exp(-λρ)，该机制在高不稳定性区域减缓更新，同时在平滑盆地中保持快速收敛。我们利用微分包含提供了严格的收敛分析，证明 S-Adam 几乎必然以最优 O(1/√T) 速率收敛到 (δ,ε)-Clarke 平稳点。在感知量化训练（QAT）和高噪声小批量学习上的实验评估表明，S-Adam 一致优于 AdamW 和 Prox-SGD，在 CIFAR-100 上实现高达 6% 的准确率提升，在 TinyImageNet 上实现 3% 的提升，同时有效缓解梯度振荡。

Abstract

Deep learning optimization relies heavily on the assumption of smooth loss landscapes, a condition systematically violated by modern architectures due to non-smooth components such as ReLU activations and quantization operators. In such non-smooth regimes, adaptive optimizers such as Adam suffer from gradient chattering, violent oscillations caused by conflicting signals within the Clarke subdifferential, leading to poor convergence and suboptimal generalization. To address this, we introduce Singularity-aware Adam (S-Adam), a novel optimizer that stabilizes training by dynamically modulating step sizes based on local geometric instability. Our key contribution is the Local Geometric Instability (LGI) metric, a computationally efficient estimator of the Clarke subdifferential diameter derived from the variance of randomized directional derivatives. S-Adam incorporates an adaptive damping mechanism exp(-$λ$$ρ$) that decelerates updates in high-instability regions while preserving fast convergence in smooth basins. We provide a rigorous convergence analysis using differential inclusions, proving that S-Adam converges almost surely to ($δ$,$ε$)-Clarke stationary points at the optimal O(1/$\sqrt(T)$) rate. Empirical evaluations on Quantization-Aware Training (QAT) and high-noise small-batch learning demonstrate that S-Adam consistently outperforms AdamW and Prox-SGD, achieving accuracy gains of up to 6 percent on CIFAR-100 and 3 percent on TinyImageNet while effectively mitigating gradient oscillations.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on deep learning optimization algorithms (specifically S-Adam) for non-smooth loss landscapes involving ReLU and quantization. It does not address multimodal architectures, world models, tokenization strategies, visual encoders, or reinforcement learning. Consequently, there is no substantive overlap with the provided keywords related to Multimodal LLMs, World Models, or Model-Based RL.

关键词

Singularity-aware Optimization, Non-smooth Optimization, S-Adam, Local Geometric Instability, Quantization-Aware Training, Gradient Chattering, Convergence Analysis, Randomized Geometric Probing

328. The Complexity of Verifying Feedforward Neural Networks in Quantised SettingsFAIL

Score: 0.0 / 27.8

Authors: Eric Alsmann, Martin Lange, Marco Sälzer

Published: 2026-05-28

TL;DR: 本文研究了量化设置下前馈神经网络的计算复杂性，证明了固定精度量化下的验证问题为 NP 完全，并为动态量化情况提供了上界。

摘要翻译

我们研究了量化环境下神经网络验证的计算复杂性。我们将前馈神经网络（FNN）分为三类：具有精确有理权重的有理 FNN、权重取自有限位宽算术的量化 FNN，以及相对于给定有限位宽算术进行评估的有理网络的动态量化 FNN。我们考虑文献中使用的两种规范。线性规划（LP）规范是线性约束的合取，而位向量（BV）规范允许在位级上进行推理，并能表达非线性约束。我们的结果呈现了这些验证问题的复杂性全景。对于具有固定算术精度的量化 FNN，我们表明在 LP 和 BV 规范下的验证仍保持 NP 完全性（NP-complete），与有理情况的复杂性一致。对于具有 BV 规范的动态量化 FNN，我们建立了上界，完善了先前已知的 PSPACE 难性（PSPACE-hardness）结果。

Abstract

We investigate the computational complexity of neural network verification in quantised settings. We distinguish three classes of Feedforward Neural Networks (FNNs): rational FNNs with exact rational weights, quantised FNNs whose weights come from a finite-width arithmetic, and dynamically quantised FNNs in which rational networks are evaluated with respect to a given finite-width arithmetic. We consider two types of specifications used in the literature. Linear programming (LP) specifications are conjunctions of linear constraints, while bit-vector (BV) specifications allow reasoning at the bit level and can express non-linear constraints. Our results give a complexity landscape of these verification problems. For quantised FNNs with fixed arithmetic precision, we show that verification under both LP and BV specifications remains NP-complete, matching the complexity of the rational case. For dynamically quantised FNNs with BV specifications, we establish upper bounds, complementing a previously known PSPACE-hardness result.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 论文主题聚焦于量化设置下前馈神经网络的计算复杂性验证，涉及有理数权重、量化权重及动态量化设置下的线性规划与位向量规范。所提供的关键词（如统一模型、Tokenizer、视觉编码器、世界模型、MLLM、多模态、基于模型的强化学习）均属于多模态大模型与强化学习领域，与本文的形式化验证及神经网络复杂性理论主题无直接关联，因此所有关键词相关度评分为 0。

关键词

Feedforward Neural Networks, Quantised Settings, Computational Complexity, Linear Programming, Bit-vector Specifications, Verification Problems, Rational Weights, Finite-width Arithmetic

329. Temporal Motif-aware Graph Test-time Adaptation for OOD Blockchain Anomaly DetectionFAIL

Score: 0.0 / 27.8

Authors: Runang He, Tongya Zheng, Huiling Peng, Yuanyu Wan, Bingde Hu, Jiawei Chen, Canghong Jin, Mingli Song, Can Wang

Published: 2026-05-28

TL;DR: This paper proposes a temporal motif-aware graph test-time adaptation framework (TEMG-TTA) to address out-of-distribution anomaly detection in evolving blockchain transaction patterns, achieving significant performance improvements over state-of-the-art methods.

摘要翻译

持续演变的交易模式因地址数量庞大及异常行为多样化，显著阻碍了新兴加密货币区块链上的异常检测。近期，应用于区块链的高级图异常检测（GAD）方法面临两大关键挑战：一是恶意主体的对抗性模式演化，二是区块链上多样化的交易语义所引发的分布外（OOD）问题。为应对上述挑战，我们提出了一种新颖的框架，称为时间模态感知图测试时间适应（TEMG-TTA）。首先，我们利用高效的计算机制全面捕获每个活跃地址的 3 节点时间模态分布，从而支持下游的时间模态感知图学习。其次，我们设计了一种简单却有效的测试时间适应策略，以促进训练图与测试图之间共享共同模式。在 5 个真实世界数据集上的广泛实验表明，我们提出的 TEMG-TTA 平均比最先进的图异常检测（GAD）方法高出 54.88%。关于可解释模态模式的进一步案例研究表明，TEMG-TTA 明确刻画了异常地址的复杂交易模式，从而验证了我们的技术设计的有效性。我们的代码将公开提供：https://github.com/LuoXishuang0712/TEMG-TTA/.

Abstract

Ever-evolving transaction patterns have significantly hindered anomaly detection on emerging cryptocurrency blockchains due to the vast number of addresses and diverse anomalous behaviors. Recently, advanced Graph Anomaly Detection (GAD) approaches applied to blockchains have faced two critical challenges: \textit{adversarial pattern evolution by malicious actors} and \textit{the out-of-distribution (OOD) problem caused by varied transaction semantics on blockchains}. To address these challenges, we propose a novel framework termed \textbf{TE}mporal \textbf{M}otif-aware \textbf{G}raph \textbf{T}est-\textbf{T}ime \textbf{A}daptation (\textbf{TEMG-TTA}). First, we comprehensively capture the 3-node temporal motif distribution of each active address using an efficient computational mechanism, enabling downstream temporal motif-aware graph learning. Second, we design a simple yet effective test-time adaptation strategy to facilitate the sharing of common patterns between training and testing graphs. Extensive experiments on 5 real-world datasets demonstrate that our proposed \textbf{TEMG-TTA} outperforms \textit{state-of-the-art} GAD approaches by an average of 54.88\%. A further case study on interpretable motif patterns reveals that \textbf{TEMG-TTA} explicitly characterizes the complex transaction patterns of anomalous addresses, thereby verifying the effectiveness of our technical designs. Our code will be made publicly available https://github.com/LuoXishuang0712/TEMG-TTA/.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on Graph Anomaly Detection (GAD) in blockchain using temporal motifs and test-time adaptation. The provided keywords relate to Multimodal Large Language Models (MLLM), World Models, and Reinforcement Learning (RL). There is no overlap in methodology (no visual encoders, tokenizers, or modal unification) or task domain (no RL or world modeling). Thus, all keyword scores are 0. None of the listed expert authors appear in the author list.

关键词

Temporal Motif, Graph Test-time Adaptation, OOD Blockchain, Anomaly Detection, Graph Anomaly Detection, Test-time Adaptation, Transaction Patterns

330. Convex Basins in Single-Index Model Loss Landscapes: Applications to Robust Recovery under Strong Adversarial CorruptionFAIL

Score: 0.0 / 27.8

Authors: Santanu Das, Sagnik Chatterjee, Jatin Batra

Published: 2026-05-28

TL;DR: This paper proposes a robust recovery algorithm for Gaussian Single Index Models with non-monotonic link functions under adversarial corruption by proving the existence of convex basins in the loss landscape that enable efficient convergence via spectral initialization.

摘要翻译

我们研究了在重尾噪声以及恒定比例的对抗性污染协变量和响应值存在的情况下，鲁棒学习高斯单指标模型（SIMs）的问题。先前关于鲁棒恢复的研究考虑了线性回归（Pensia 等，JASA 2024）、严格单调链接函数（Awasthi 等，NeurIPS 2022）以及相位检索（Buna 和 Rebeschini，AISTATS 2025）等场景。然而，这些技术无法扩展到通用的非对称非单调链接函数，例如 GeLU 和 Swish，它们在现代门控神经网络架构中自然作为标量原语出现。我们通过提出首个针对通用非单调链接函数具有近线性样本和时间复杂度的鲁棒恢复算法，填补了这一空白，从而为一大类非线性 SIMs 建立了首个鲁棒恢复保证，此前对于这些模型尚无已知保证。我们的核心贡献在于提出了对抗性污染下高斯平方损失景观的新结构理解。至关重要的是，我们证明对于一大类非线性非单调 SIMs，在真值周围存在一个与维度无关、恒定半径的凸盆地，且即便在对抗性污染下也能通过鲁棒谱初始化高效到达。先前工作未能同时建立这两项保证，因此要么在对抗性污染下失效，要么无法处理通用非单调链接函数。综上所述，这些结构洞察为鲁棒梯度下降提供了基于原理的热启动，该启动可证明收敛至最终估计误差 $O(σ\sqrtε)$，耗时 $\tilde{O}(nd)$，使用 $\tilde{O}(d)$ 个样本，其中 $ε$ 为污染比例。

Abstract

We study the problem of robustly learning Gaussian Single Index Models (SIMs) in the presence of heavy-tailed noise and a constant fraction of adversarially corrupted covariates and responses. Prior work on robust recovery has considered settings such as linear regression (Pensia et al., JASA 2024), strictly monotonic link functions (Awasthi et al., NeurIPS 2022), and phase retrieval (Buna and Rebeschini, AISTATS 2025). However, these techniques do not extend to generic asymmetric non-monotonic link functions such as \textsc{GeLU} and \textsc{Swish}, which arise naturally as scalar primitives in modern gated neural architectures. We close this gap by giving the first robust recovery algorithm with near-linear sample and time complexity for generic non-monotonic link functions, thereby establishing the first robust recovery guarantees for a broad family of nonlinear SIMs for which \textit{no guarantees were previously known}. Our central contribution is a new structural understanding of the Gaussian squared-loss landscape under adversarial contamination. Crucially, we prove that for a broad class of nonlinear non-monotonic SIMs, a dimension-independent, constant-radius convex basin exists around the ground truth and is efficiently reachable via robust spectral initialization even under adversarial contamination. Prior works fail to establish both guarantees simultaneously, thereby either breaking down under adversarial contamination or failing to handle generic non-monotonic link functions. Together, these structural insights yield a principled warm start for robust gradient descent that provably converges to a final estimation error of $O(σ\sqrtε)$ in $\tilde{O}(nd)$ time with $\tilde{O}(d)$ samples, where $ε$ is the contamination fraction.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 该论文主要研究高斯单索引模型（SIMs）在重尾噪声和对抗性污染下的稳健恢复问题，关注损失景观的凸性基座和优化收敛性。提供的关键词涉及多模态大模型（MLLM）、世界模型、强化学习（Tokenizer、Visual Encoder、Model-Based RL）等方向。论文内容与多模态生成、表征学习及强化学习框架无直接语义关联，主要属于统计学习与优化理论领域。因此，所有关键词的相关性评分均为 0.0。此外，作者列表中不包含指定的专家（Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang）。

关键词

Single Index Model, Robust Recovery, Adversarial Corruption, Loss Landscapes, Convex Basins, Non-monotonic Link Functions, Spectral Initialization

331. Deep Optimal Individualized Treatment Rules for Bivariate Survival Outcomes via Adaptive Prediction-Powered LearningFAIL

Score: 0.0 / 27.8

Authors: Kun Ren, Yifan Cui, Wen Su

Published: 2026-05-28

TL;DR: This paper proposes a deep learning approach to derive optimal individualized treatment rules for bivariate survival outcomes in randomized trials by modeling stochastic policies and accounting for right censoring.

摘要翻译

在涉及多种治疗的随机对照试验中，双变量生存结局（bivariate survival outcomes）为决策分析带来了显著挑战。本文旨在解决通过深度神经网络（deep neural networks）推导最优个体化治疗规则的问题，以最大化固定时间点 $(t_1, t_2)$ 之后的联合生存概率，同时考虑右删失（right censoring）。我们提出了一种新颖的方法，通过随机策略（stochastic policies）建模治疗规则，并利用链接函数耦合边际加速失效时间模型（marginal accelerated failure time models）以捕捉双变量依赖（bivariate dependence）。为了增强决策的鲁棒性和有效性，我们引入了一种自适应预测驱动方法（adaptive prediction-powered method），该方法利用机器学习模型（machine learning models）的辅助预测。

Abstract

In randomized trials involving multiple treatments, bivariate survival outcomes present significant analytical challenges for making decisions. This paper addresses the problem of deriving optimal individualized treatment rules to maximize the joint survival probability beyond fixed time points $(t_1, t_2)$ through deep neural networks, while accounting for right censoring. We propose a novel approach that models treatment rules via stochastic policies, coupling marginal accelerated failure time models via link function to capture bivariate dependence. To enhance robustness and effectiveness of decision making, we introduce an adaptive prediction-powered method that leverages auxiliary predictions from machine learning models.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on biostatistics and survival analysis using deep neural networks for treatment rules, which is unrelated to the multimodal, generative, or model-based RL domain specified by the keywords (e.g., Tokenizer, Visual Encoder, MLLM). No expert authors match the provided list.

关键词

Individualized Treatment Rules, Bivariate Survival Outcomes, Deep Neural Networks, Stochastic Policies, Accelerated Failure Time Models, Right Censoring, Prediction-Powered Learning

332. How Much Is a Dataset Worth? Scaling Laws, the Vendi Score, and Matrix Spectral FunctionsFAIL

Score: 0.0 / 27.8

Authors: Jeff A. Bilmes, Gantavya Bhatt, Arnav M. Das

Published: 2026-05-28

TL;DR: This paper investigates dataset valuation by comparing neural scaling laws, the Vendi Score, and matrix spectral functions, finding that facility location objectives often best predict held-out performance while random subsets remain surprisingly consistent.

摘要翻译

神经缩放定律（Neural scaling laws）通过数据集规模来评估数据，而 Vendi Score 利用量子熵来衡量数据集的价值。我们证明了常见的神经缩放定律目标函数和 Vendi Score 均具有次模性（submodular）。我们进一步表明，Vendi Score 是我们称为矩阵谱函数（matrix spectral functions）的一类更广泛的次模目标函数的特例。这也包括行列式点过程（DPP）目标函数以及其他许多函数。我们还引入了弱矩阵单调函数（weakly matrix monotone functions），并展示了它们如何导出弱次模矩阵谱函数，从而生成一系列用于数据评估的实用目标函数。我们开发了基于久期方程（secular equation）的更新方法，避免了在贪心优化过程中重复进行特征分解，将 $m$ 维嵌入的边际增益评估相对于 oracle 查询降低了 $O(m)$ 的复杂度因子。这使得平均经验加速比达到约 35,000 倍，从而使得在 ImageNet-1K 规模数据集上直接优化 Vendi Score 成为可行。在此基础上，我们比较了若干目标函数在固定大小、类别平衡及固定训练预算设定下，预测训练子集价值以用于保留测试性能的效果，其中包括 Vendi Score、DPPs、设施选址（facility location）以及三种新的矩阵谱变体。在多个数据集上，设施选址（facility location）表现最佳。直接优化还表明，尽管 Vendi Score 在中等分数范围内具有预测性，但若将目标函数值推至更高，它可能成为下游性能的较差代理。我们还发现，无论是无约束还是类别平衡的均匀随机固定大小子集，其评估分数和保留性能都表现出惊人的集中性。最后，我们表明，规模、类别平衡和训练预算并不能单独决定数据价值：即使控制了这些因素，性能仍从好到坏平滑变化。

Abstract

Neural scaling laws appraise data through dataset size, while the Vendi Score uses quantum entropy to measure dataset value. We show both that common neural-scaling-law objectives and the Vendi Score are submodular. We further show that the Vendi Score is a special case of a broader class of submodular objectives that we call matrix spectral functions. This also includes determinantal (DPP) objectives, as well as many others. We also introduce weakly matrix monotone functions and show how they lead to weakly submodular matrix spectral functions, yielding a broad family of practical objectives for data appraisal. We develop secular-equation-based updates that avoid repeated eigendecompositions during greedy optimization, reducing marginal-gain evaluation for $m$-dimensional embeddings by an $O(m)$ factor relative to oracle queries. This yields an average empirical speedup of about 35,000x, making direct optimization of the Vendi Score feasible on ImageNet-1K-scale datasets. Thus enabled, we compare how well several objectives predict the value of training subsets for held-out test performance under fixed-size, class-balanced, and fixed training-budget regimes, including the Vendi Score, DPPs, facility location, and three new matrix spectral variants. Across multiple datasets, facility location performs the best. Direct optimization also reveals that, while the Vendi Score is predictive over moderate score ranges, pushing the objective to higher values can make it a poor downstream performance proxy. We also find that uniformly at random fixed-size subsets, both unconstrained and class-balanced, are remarkably concentrated in both appraisal scores and held-out performance. Finally, we show that size, class balance, and training budget do not alone determine data value: even when controlling for these factors, performance ranges smoothly from good to bad.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 该论文专注于理论数据集估值方法（Vendi Score、缩放定律、子模优化），而非多模态架构、分词器、世界模型或强化学习策略。因此，提供的关键词与论文的核心内容或贡献均无关联。此外，作者列表不包含指定专家（Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang），未获得额外加分。加权总分为 0.0，低于动态及格分 27.8。

关键词

Dataset Valuation, Scaling Laws, Vendi Score, Submodular Optimization, Matrix Spectral Functions, Data Appraisal, Held-out Performance

333. AliMark: Enhancing Robustness of Sentence-Level Watermarking Against Text ParaphrasingFAIL

Score: 0.0 / 27.8

Authors: Yuexin Li, Wenjie Qu, Linyu Wu, Yulin Chen, Yufei He, Tri Cao, Bryan Hooi, Jiaheng Zhang

Published: 2026-05-28

TL;DR: AliMark 通过重构句子级水印为比特序列对齐问题，并提出多候选检测策略，显著提升了水印在文本改写攻击下的鲁棒性。

摘要翻译

现有的句子级水印方法通过将水印锚定在句子语义中来增强对改写的鲁棒性。然而，它们基于前缀的设计仍然容易受到结构扰动的影响，例如句子拆分与合并，这些情况在 DIPPER 和 GPT-3.5 等强改写器下很常见。为了解决这一问题，我们提出了 AliMark，该框架将句子级水印重新表述为潜在水印文本与秘密比特序列 (secret bit sequence) 之间的比特序列 (bit sequence) 编码与对齐问题。值得注意的是，我们的方法采用两阶段检测策略：我们生成多个重构文本变体，并将其提取的比特序列与秘密比特序列自适应对齐，以最小化对齐代价。这种多候选对齐设计自然地提高了对句子拆分与合并的鲁棒性。广泛的实验表明，在各种改写攻击下，AliMark 显著优于最先进基线。

Abstract

Existing sentence-level watermarking methods enhance robustness to paraphrasing by anchoring watermarks in sentence semantics. However, their prefix-based designs remain vulnerable to structural perturbations, such as sentence splitting and merging, which commonly arise under strong paraphrasers like DIPPER and GPT-3.5. To mitigate this issue, we propose AliMark, a framework that reformulates sentence-level watermarking as a bit sequence encoding and alignment problem between a potentially watermarked text and a secret bit sequence. Notably, our approach adopts a two-stage detection strategy: we generate multiple restructured text variants and adaptively align their extracted bit sequences with the secret bit sequence to minimize alignment cost. This multi-candidate alignment design naturally improves robustness to sentence merges and splits. Extensive experiments demonstrate that AliMark substantially outperforms state-of-the-art baselines under diverse paraphrasing attacks.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 该论文专注于句子级水印在文本改写下的鲁棒性，核心在于比特序列对齐与检测策略。提供的关键词（统一模型、分词器、视觉编码器、世界模型、多模态大语言模型、多模态、基于模型的强化学习）均涉及多模态表示学习或强化学习领域，与本文的文本水印安全领域完全无关，因此所有关键词相关度均为 0。

关键词

Sentence-level Watermarking, Text Paraphrasing, Bit Sequence Alignment, Robustness Enhancement, Two-stage Detection, Text Splitting and Merging, Watermarking Framework

334. Constructing efficient channels for ideal observers using the conjugate gradient methodFAIL

Score: 0.0 / 27.8

Authors: Weimin Zhou

Published: 2026-05-28

TL;DR: This paper proposes a conjugate gradient-based method to construct efficient channels for approximating ideal observers in medical image quality assessment, addressing computational intractability in high-dimensional data.

摘要翻译

基于任务的图像质量（IQ）评估对于医学成像系统的设计与优化至关重要。理想观测者，包括贝叶斯理想观测者（IO）和理想线性观测者，即霍特林观测者（HO），提供了客观的优度指标（FOMs），用于量化系统在信号检测任务上的性能。然而，将理想观测者应用于高维图像数据时，往往计算上不可行。通道机制提供了一种有效的降维框架，有助于理想观测者的计算。本文提出了一种基于共轭梯度（CG）的方法，用于构建高效通道以近似 IO 和 HO 的性能。

Abstract

Task-based assessment of image quality (IQ) is critically important for the design and optimization of medical imaging systems. Ideal observers, including the Bayesian Ideal Observer (IO) and the ideal linear observer, i.e., the Hotelling observer (HO), provide objective figures of merit (FOMs) that quantify system performance on signal detection tasks. However, the application of ideal observers to high-dimensional image data is often computationally intractable. Channel mechanisms provide an effective framework for dimensionality reduction that can facilitate the computation of ideal observers. This work presents a conjugate gradient (CG)-based method to construct efficient channels for approximating the IO and HO performance.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on medical image quality assessment using statistical ideal observers and conjugate gradient optimization for dimensionality reduction. It does not involve large language models, multimodal architectures, tokenizers, visual encoders in the context of MLLM, world models, or reinforcement learning, making all provided keywords irrelevant.

关键词

Ideal Observers, Image Quality Assessment, Conjugate Gradient Method, Dimensionality Reduction, Hotelling Observer, Medical Imaging, Channel Mechanisms

335. Real-Time Retargeting Using Controllability Boundary for Chandrayaan-3 Lunar LandingFAIL

Score: 0.0 / 27.8

Authors: Suraj Kumar, Debjyoti Chakrabarti, Aditya Rallapalli, Bharat Kumar GVP, Ashok Kumar Kakula

Published: 2026-05-28

TL;DR: This paper presents a real-time retargeting guidance policy for the Chandrayaan-3 lunar landing mission that utilizes a convex controllability boundary to ensure safe and feasible landing site selection.

摘要翻译

本文介绍了为 Chandrayaan-3（月船 3 号）月球着陆任务开发的实时重目标（retargeting）制导策略。基准制导生成近似燃料最优（fuel-optimal）的下降轨迹，而当标称着陆点（nominal site）不可行时，高层策略可实现向备选着陆点（alternate sites）的安全重目标。该重目标策略利用可控性边界（controllability boundary）的凸表示（convex representation），从而实现快速的可行性检查（feasibility checks）和实时目标更新。据作者所知，这代表了数据驱动（data-driven）的重目标框架在实际运行的月球着陆任务中的首次应用。飞行前仿真及 Chandrayaan-3 飞行结果验证了所提方法的有效性。

Abstract

This paper presents the real-time retargeting guidance policy developed for the Chandrayaan-3 lunar landing mission. The baseline guidance generates approximate fuel-optimal descent trajectories, while a high-level policy enables safe retargeting to alternate sites when the nominal site becomes infeasible. The retargeting strategy leverages a convex representation of the controllability boundary, allowing rapid feasibility checks and real-time target updates. To the best of the authors knowledge, this represents the first application of a data-driven retargeting framework in an operational lunar landing mission. Pre-flight simulations and Chandrayaan-3 flight results validate the effectiveness of the proposed approach.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on aerospace guidance and control for the Chandrayaan-3 lunar landing mission, utilizing convex controllability boundaries and fuel-optimal trajectory optimization. The provided keywords (Tokenizer, MLLM, Visual Encoder, Unify Models, etc.) pertain to Large Language Models and Multi-modal AI architectures, which are unrelated to the paper's domain of control theory and aerospace engineering. Thus, all keyword relevance scores are 0. None of the specified expert authors are present in the author list.

关键词

Real-Time Retargeting, Controllability Boundary, Chandrayaan-3, Lunar Landing, Guidance Policy, Fuel-Optimal, Data-Driven, Feasibility Checks

336. Information-Directed Offline-to-Online Reinforcement LearningFAIL

Score: 0.0 / 27.8

Authors: Keru Chen

Published: 2026-05-28

摘要翻译

基于离线数据集的决策通常从固定的离线数据预热（warm-start）一个策略或评分模型，随后通过有限的在线交互对其进行精炼。离线数据降低了不确定性，但并未消除对探索的需求；它改变了剩余需要探索的内容。我们通过条件互信息 $I(χ;τ_{1:T}\mid\mathcal{D}_N)$ 形式化这种残差不确定性，该信息衡量的是在给定离线数据集条件下，学习目标 $χ$ 与在线轨迹之间的依赖关系。这一观点自然引出了信息导向采样（IDS, Information-Directed Sampling），这是一个由 $η\ge 0$ 参数化的方法族，通过权衡瞬时遗憾与信息增益来选择动作。我们通过比率证书证明了 IDS 的一个通用离线到在线贝叶斯遗憾界：任何参考汤普森采样策略（Thompson-sampling）在相同随机化策略类上满足的信息比率界，IDS 均可继承。在已知动力学的贝叶斯线性奖励模型中，条件互信息具有对数行列式形式，且原始 IDS（$η=0$）满足 $\widetilde O\!\left(Hd\min\left\{\sqrt T,\,T\sqrt{C^\dagger_{β,\mathrm{IDS}_0}(N,T)/N}\right\}\right)$，其中覆盖系数与原始 IDS 自身诱导的访问分布相关联。我们还识别出一种预热情形（warm-start regime），其中存在一个被支配但具信息量的探测：在此情形下，原始 IDS 会选择该探测，而汤普森采样（Thompson-sampling）则永远不会选择，从而产生常数因子级别的贝叶斯遗憾差距。受控多臂老虎机（Bandit）实验和 D4RL 离线到在线强化学习（RL）实验验证了这一机制：当离线数据具有信息量但留下了有偏或低概率的残差不确定性，且可通过针对性的在线动作加以解决时，IDS 最为有益；这一情形同样存在于离线强化学习（RL）、离线黑盒优化和贝叶斯优化中。

Abstract

Decision-making from offline datasets typically warm-starts a policy or score model from fixed offline data and then refines it with limited online interaction. Offline data reduces uncertainty, but it does not remove the need for exploration; it changes what remains to be explored. We formalise this residual uncertainty by the conditional mutual information $I(χ;τ_{1:T}\mid\mathcal{D}_N)$ between a learning target $χ$ and the online trajectories after conditioning on the offline dataset. This view leads naturally to information-directed sampling (IDS), a family parameterised by $η\ge 0$ that selects actions by trading off instantaneous regret against information gain. We prove a generic offline-to-online Bayesian regret bound for IDS through a ratio certificate: any information-ratio bound satisfied by a reference Thompson-sampling policy over the same randomised policy class is inherited by IDS. In a known-dynamics Bayesian linear-reward model, the conditional mutual information has a log-determinant form, and vanilla IDS ($η=0$) satisfies $\widetilde O\!\left(Hd\min\left\{\sqrt T,\,T\sqrt{C^\dagger_{β,\mathrm{IDS}_0}(N,T)/N}\right\}\right),$ where the coverage coefficient is tied to the visitation distribution induced by vanilla IDS itself. We also identify a warm-start regime with a dominated but informative probe in which vanilla IDS selects the probe while Thompson sampling never does, giving a constant-factor Bayesian regret separation. Controlled bandit experiments and D4RL offline-to-online RL experiments validate this mechanism: IDS is most beneficial when offline data is informative but leaves biased or low-probability residual uncertainty that targeted online actions can resolve, a regime shared by offline RL, offline black-box optimization, and Bayesian optimization.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 评分失败: Expecting ',' delimiter: line 12 column 64 (char 287)

337. GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language ModelsFAIL

Score: 0.0 / 27.8

Authors: Xiaohang Tang, Keyue Jiang, Che Liu, Qifang Zhao, Xiaoxiao Xu, Sangwoong Yoon, Ilija Bogunovic

Published: 2026-05-28

摘要翻译

强化学习（RL）可用于改进扩散大语言模型（dLLMs）的策略（去噪器），但受限于策略似然性的不可计算性。一类主导且高效的方法用证据下界（ELBO）替换标准强化学习（RL）中的似然性，该下界通过随机掩码序列进行估计。尽管这些方法与预训练过程对齐良好，但它们通过使用证据下界（ELBO）作为似然代理，引入了训练 - 推理不匹配（TIM）偏差，从而可能降低性能。本文提出引导去噪器自蒸馏（GDSD），旨在直接从优势引导的自教师蒸馏扩散大语言模型（dLLMs）的去噪器，该自教师源自反向 KL 正则化强化学习（RL）的闭式最优解。GDSD 通过无归一化目标将扩散大语言模型（dLLMs）的去噪器 logits 匹配到教师，这将强化学习（RL）简化为无似然自蒸馏，从而避开了训练 - 推理不匹配（TIM）偏差。近期基于证据下界（ELBO）的方法被视为应用不同蒸馏散度的实例，但它们存在可诊断的缺陷，而 GDSD 能够避免这些问题。在规划、数学和编码基准测试上，基于 LLaDA-8B 和 Dream-7B，GDSD 一致优于先前基于证据下界（ELBO）的最先进方法，且具有更稳定的训练奖励动态，实现了高达 +19.6% 的测试准确率提升。这些结果表明，直接去噪器自蒸馏，无需依赖证据下界（ELBO）似然代理，可为扩散大语言模型（dLLMs）提供更稳定和有效的强化学习（RL）流程。代码可在 https://github.com/GaryBall/GDSD 获取。

Abstract

Reinforcement learning (RL) can be used to improve the policy (denoiser) of diffusion large language models (dLLMs), while being hindered by the intractability of the policy likelihood. A dominant and efficient family of methods replaces the likelihood in standard RL with its evidence lower bound (ELBO), estimated from randomly masked sequences. Despite being well aligned with pre-training, these approaches introduce bias through training--inference mismatch by using the ELBO as a likelihood surrogate, which can degrade performance. In this work, we propose Guided Denoiser Self-Distillation (GDSD) to directly distill the denoiser of dLLMs from an advantage-guided self-teacher, derived from the closed-form optimum of reverse-KL regularized RL. GDSD matches the dLLM's denoiser logits to the teacher's via a normalization-free objective, which reduces RL to likelihood-free self-distillation and thus bypasses the TIM biases. Recent ELBO-based methods emerge as instances of applying different distillation divergences, but with diagnosable pathologies that GDSD avoids. On planning, math, and coding benchmarks with LLaDA-8B and Dream-7B, GDSD consistently outperforms prior state-of-the-art ELBO-based methods with a more stable training reward dynamics, achieving test-accuracy improvements of up to $+19.6\%$. These results suggest that direct denoiser self-distillation, without relying on an ELBO likelihood surrogate, can provide a more stable and effective RL procedure for dLLMs. Code is available at https://github.com/GaryBall/GDSD.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 评分失败: Expecting ',' delimiter: line 12 column 36 (char 259)

338. On the Optimizer Dependence of Neural Scaling LawsFAIL

Score: 0.0 / 27.8

Authors: Vansh Ramani, Shourya Vir Jain

Published: 2026-05-28

TL;DR: This study reveals that the scaling exponent in neural scaling laws systematically depends on the optimizer choice, showing preconditioned optimizers achieve steeper scaling than gradient descent.

摘要翻译

神经缩放定律 $L(N) \propto N^{-α}$ 中的缩放指数 $α$ 通常被视为由模型架构和数据集确定的固定常数。本文提供了证据表明，$α$ 系统性地依赖于优化器 (optimizer)。在受控的随机特征回归 (random-feature regression) 实验中——这是神经缩放的经典理论框架——我们在五种优化器变体和六种谱条件 (spectral conditions) 下测量了 $α$。预条件优化器 (Preconditioned optimizers) 一致地产生更陡峭的缩放行为（即更大的 $α$），$α$ 的偏移量在测试的谱范围内大部分呈增加趋势，在 $s = 1.5$ 附近达到峰值，并在 $s = 2.0$ 时仍保持较大值。在 $s \approx 1.0$（自然语言的特征）下，全自然梯度 (full natural gradient) 实现的 $α \approx 0.31$，而梯度下降 (gradient descent) 仅为 $α \approx 0.12$ —— 这是一个 2.6 倍更大的拟合指数，在随机特征模型中，该指数随模型规模加倍而累积放大。这种指数偏移是否以及如何转移到大语言模型 (LLM) 的大规模训练——其中近期证据表明该优势可能随规模衰减——仍然是一个重要的开放性问题。我们的结果表明，缩放定律的预测应考虑优化器的选择，并且我们提供了一种谱诊断方法 (spectral diagnostic)，用于预测高级优化器何时会产生收益。

Abstract

The scaling exponent $α$ in neural scaling laws $L(N) \propto N^{-α}$ is commonly treated as a fixed constant set by architecture and data. We present evidence that $α$ depends systematically on the optimizer. In controlled random-feature regression experiments -- the canonical theoretical framework for neural scaling -- we measure $α$ across five optimizer variants and six spectral conditions. Preconditioned optimizers consistently yield steeper scaling (larger $α$), with the $α$-shift increasing across most of the tested spectral range, peaking near $s = 1.5$, and remaining large at $s = 2.0$. At $s \approx 1.0$ (characteristic of natural language), the full natural gradient achieves $α\approx 0.31$ versus $α\approx 0.12$ for gradient descent -- a $2.6\times$ larger fitted exponent that, within the random-feature model, compounds with each model-size doubling. Whether and how this exponent shift transfers to large-scale LLM training -- where recent evidence suggests the advantage may attenuate with scale -- remains an important open question. Our results imply that scaling-law forecasts should account for optimizer choice, and we provide a spectral diagnostic predicting when advanced optimizers will pay off.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on neural scaling laws and optimizer dependence in random-feature regression. It does not discuss multimodal architectures, tokenizers, visual encoders, world models, or reinforcement learning, resulting in no direct relevance to the provided keyword set which targets multimodal/RL domains.

关键词

Neural Scaling Laws, Optimizer Dependence, Random-Feature Regression, Spectral Conditions, Preconditioned Optimizers, Scaling Exponent, Large-scale LLM Training

339. Deep Adaptive Dimension Reduction for Bayesian Inference in Inverse ProblemsFAIL

Score: 0.0 / 27.8

Authors: Yueyang Wang, Xili Wang, Kejun Tang, Xiaoliang Wan, Tao Zhou, Chao Yang

Published: 2026-05-28

TL;DR: This paper proposes a deep adaptive dimension-reduction Bayesian inference framework based on Variational Flow to solve high-dimensional PDE-governed inverse problems, achieving superior accuracy compared to traditional sampling methods.

摘要翻译

求解由偏微分方程（PDE）支配的高维逆问题通常具有挑战性，原因在于复杂的非高斯后验分布、昂贵的正向模型评估以及先验信息设定不当。为了解决这些问题，我们提出了一种基于变分流（VF）模型的深度自适应降维贝叶斯推断框架。由于标准归一化流受限于双射映射且无法直接降维，VF 通过将基于变分自编码器（VAE）的非线性降维与用于潜在先验和编码器的双归一化流相结合，克服了这一限制。这种设计提供了严格高于变分自编码器（VAE）的证据下界（ELBO），并允许对复杂后验分布进行更灵活的近似。我们进一步引入了一种迭代先验更新策略，该策略逐渐将先验均值移向高概率后验区域，避免了手动先验调优。这些组件与自适应微调的傅里叶神经算子（FNO）代理模型共同构成一个闭环自适应循环：VF 生成聚焦后验分布的样本以精炼代理模型，而更新后的代理模型进一步改进后验推断。在 100 维 Rosenbrock 问题和三个标准偏微分方程（PDE）支配的逆问题上的数值实验表明，我们的方法在所有测试配置下相比 MCMC、UKI 和 SVGD 基线方法具有竞争力或更优的准确性，最显著的优势出现在高噪声观测和高维参数空间等挑战性场景中。

Abstract

Solving high-dimensional PDE-governed inverse problems is often challenging due to complex non-Gaussian posterior distributions, expensive forward model evaluations, and misspecified prior information. To address these issues, we propose a deep adaptive dimension-reduction Bayesian inference framework based on the Variational Flow (VF) model. Since standard normalizing flows are restricted by bijective mappings and cannot directly reduce dimensions, VF overcomes this limitation by integrating VAE-based nonlinear dimension reduction with dual normalizing flows for the latent prior and encoder. This design provides a strictly higher evidence lower bound than VAE and allows more flexible approximation of complex posterior distributions. We further introduce an iterative prior updating strategy that gradually moves the prior mean toward high-probability posterior regions, avoiding manual prior tuning. These components form a closed adaptive loop together with an adaptively fine-tuned Fourier Neural Operator (FNO) surrogate: VF generates posterior-concentrated samples to refine the surrogate, while the updated surrogate further improves posterior inference. Numerical experiments on a 100-dimensional Rosenbrock problem and three standard PDE-governed inverse problems show that our method delivers competitive or superior accuracy compared with MCMC, UKI, and SVGD baselines across all tested configurations, with the most pronounced advantages emerging in challenging scenarios such as high-noise observations and high-dimensional parameter spaces.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on Bayesian inference for PDE-governed inverse problems using deep learning techniques (VAE, Normalizing Flows, FNO), whereas the provided keywords relate to Multimodal Large Language Models (MLLM), World Models, and Reinforcement Learning. There is no direct overlap in methodology or application domain regarding tokenizers, visual encoders, or RL, resulting in zero relevance for all specified keywords.

关键词

Bayesian Inference, Dimension Reduction, Variational Flow, Fourier Neural Operator, PDE Inverse Problems, Normalizing Flows, Posterior Distribution

340. Kernel-based potential mean-field games with unbiased random Fourier $U$-statisticsFAIL

Score: 0.0 / 27.8

Authors: Yumiharu Nakano

Published: 2026-05-28

TL;DR: 本文针对具有 MMD 惩罚的势平均场博弈问题，提出了基于随机傅里叶 U 统计量的计算框架并证明了收敛性，应用于 Schrödinger 桥和电动汽车充电协调。

摘要翻译

我们研究了一类势平均场博弈（Mean-Field Games）的子类，其中运行交互成本和终端目标成本均通过再生核最大均值差异（MMD）惩罚项表示，并开发了一种利用这种核结构的计算框架。这两个成本均利用随机傅里叶 U 统计量（U-statistic）表示，从有限样本经验分布中进行估计，该方法具有无偏性且计算成本随批量大小线性增长。受控扩散过程的漂移项由神经网络参数化，并通过随机梯度下降进行训练。针对该子类，我们在惩罚参数、随机特征数量、样本量和优化容差的耦合率条件下，证明了样本级几乎必然收敛定理及显式的几乎必然收敛速率。该框架将核 MMD 惩罚 Schrödinger 桥问题（Schrödinger Bridge Problem）作为交互成本消失的特例包含在内。数值实验展示了该方法在维度高达一百的 Schrödinger 桥问题以及具有每辆车物理异质性的电动汽车充电协调问题上的应用，其中总需求拥堵成本代表了群体层面的价格反馈竞争，而终端 MMD 惩罚则塑造了截止时刻的荷电状态分布。

Abstract

We study the subclass of potential mean-field games in which the running interaction cost and the terminal target cost are both expressed through reproducing-kernel maximum mean discrepancy (MMD) penalties, and develop a computational framework that exploits this kernel structure. Both costs are estimated from finite-sample empirical distributions using a random Fourier U-statistic representation that is unbiased and has linear cost in the batch size. The drift of the controlled diffusion is parametrized by a neural network and trained via stochastic gradient descent. For this subclass we prove a sample-level almost-sure convergence theorem and an explicit almost-sure rate of convergence, under coupled rate conditions on the penalty parameter, the random-feature count, the sample size, and the optimization tolerance. The framework includes the kernel-MMD-penalty Schrödinger bridge problem as the special case of a vanishing interaction cost. Numerical experiments illustrate the method on the Schrödinger bridge problem in dimensions up to one hundred, and on an electric vehicle charging coordination problem with per-vehicle physical heterogeneity, where an aggregate-demand congestion cost represents price-feedback competition at the population level and the terminal MMD penalty shapes the state-of-charge distribution at the deadline.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 论文核心为核函数与平均场博弈的数学框架，涉及随机傅里叶 U 统计量与神经网络参数化。关键词中的 Unify Models、World Models、MLLM、MultiModal 均属于多模态大模型领域，Tokenizer 与 Visual Encoder 属于模型组件，model-based RL 属于强化学习范式。论文内容未涉及多模态数据融合、视觉处理、词元化、世界模型构建或强化学习算法，因此所有关键词相关度均为 0 分。作者列表中未包含 Yang Shi 等指定专家，无额外加分。

关键词

Kernel-based, potential mean-field games, random Fourier U-statistics, MMD penalties, neural network, convergence theorem, Schrödinger bridge, electric vehicle charging

341. PassNet: Scaling Large Language Models for Graph Compiler Pass GenerationFAIL

Score: 0.0 / 27.8

Authors: Yiqun Liu, Yingsheng Wu, Ruqi Yang, Enrong Zheng, Honglei Qiu, Sijun He, Tai Liang, Jingjing Wu, Yuhan Zhou, Yiwei Zhang, Dongyan Chen, Weihan Yi, Xinqi Li, Siqi Bao

Published: 2026-05-28

TL;DR: PassNet 提出了一种基于大语言模型的编译器优化通式生成系统，旨在解决张量编译器在长尾工作负载上的性能瓶颈，并通过基准测试展示了 LLM 在特定子图上的显著加速潜力。

摘要翻译

现代张量编译器（如 TorchInductor）在主流量模型上实现了显著加速，但在长尾工作负载上却面临系统性的性能瓶颈——我们的剖析数据显示，在默认编译设置下，43% 的真实世界子图出现了端到端的性能下降。尽管大语言模型（LLMs）为自动化优化提供了一条路径，但现有工作主要集中在独立的内核生成上。我们认为，Pass 生成（即 LLMs 生成可直接集成到编译器流水线中的结构化图变换）是更为合适的抽象方式。我们提出了 PassNet，这是首个基于大语言模型的编译器 Pass 生成大规模生态系统，包含：(1) PassNet-Dataset，源自 10 万个真实世界模型的超过 1.8 万个唯一计算图；以及 (2) PassBench，200 个精心挑选的长尾可融合任务（共计 2060 个子图），在 Error-aware Speedup Score (ES_t)（感知误差加速比得分）下进行评估——该指标统一了正确性、稳定性和性能——并配备了针对系统性 LLM 利用的分层完整性防御机制。实验表明，PassBench 既具有高度区分度又真正未饱和：最佳前沿模型在聚合指标上落后 TorchInductor 37%，但在单个子图上，LLMs 相对于同一编译器实现了高达 3 倍的加速——这表明瓶颈在于一致性，而非能力。仅在约 4K 个 PassNet 轨迹上微调一个小模型，即可获得 2.67 倍的改进，接近前沿模型性能，这展示了巨大的潜力空间，并验证了 PassNet 作为在线训练基础设施，可用于推进基于大语言模型的编译器优化。所有数据、基准测试及工具均公开可用。

Abstract

Modern tensor compilers such as TorchInductor deliver substantial speedups on mainstream models, yet face a systematic performance ceiling on long-tail workloads -- our profiling shows that 43% of real-world subgraphs experience end-to-end slowdowns under default compilation. While LLMs offer a path toward automated optimization, existing efforts focus on standalone kernel generation. We argue that pass generation -- where LLMs author structured graph transformations that integrate directly into compiler pipelines -- is the more appropriate abstraction. We propose PassNet, the first large-scale ecosystem for LLM-based compiler pass generation, comprising: (1) PassNet-Dataset, over 18K unique computational graphs from 100K real-world models; and (2) PassBench, 200 curated long-tail fusible tasks (comprising 2,060 subgraphs in total) evaluated under the Error-aware Speedup Score (ES_t) -- a metric unifying correctness, stability, and performance -- with layered integrity defenses against systematic LLM exploitation. Experiments reveal that PassBench is both highly discriminative and genuinely unsaturated: the best frontier model trails TorchInductor by 37% in aggregate, yet on individual subgraphs LLMs achieve up to 3x speedup over the same compiler -- indicating that the bottleneck is consistency, not capability. Fine-tuning a small model on merely ~4K PassNet trajectories yields a 2.67x improvement approaching frontier-model performance, demonstrating substantial headroom and validating PassNet as live training infrastructure for advancing LLM-driven compiler optimization. All data, benchmarks, and tooling are publicly available.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 论文主题为大语言模型驱动的编译器优化（Pass 生成），属于 AI for Software Engineering 领域。提供的关键词集聚焦于多模态学习（MLLM, MultiModal, Visual Encoder）、世界模型（World Models）及模型强化学习（model-based RL）。论文内容未涉及视觉编码、多模态表征融合、世界模型构建或强化学习算法，与所有给定关键词无直接关联。作者列表中未包含指定的 Yang Shi 等专家，故无额外加分。

关键词

Large Language Models, Graph Compiler, Pass Generation, Tensor Compilers, Optimization, Benchmark, Performance Speedup

342. Mixing Vector Model for Copolymer Inference via Mixed Integer Linear ProgrammingFAIL

Score: 0.0 / 27.8

Authors: Jianshen Zhu, Raveena Rai, Taiyo Sohkawa, Naveed Ahmed Azam, Kazuya Haraguchi, Liang Zhao, Tatsuya Akutsu

Published: 2026-05-28

TL;DR: 本文提出了一种基于混合向量模型和混合整数线性规划的共聚物逆设计框架，实现了物理化学性质的高精度预测与精确推断。

摘要翻译

最近开发了一种新颖的两相分子推断框架 mol-infer，该框架在双层模型下，通过混合整数线性规划（MILP）推断具有规定抽象结构和期望属性值的化学图，相对于给定的学习预测函数和结构约束，能够保证最优性和精确性。在本研究中，我们通过引入一种简单的特征表示——混合向量（MV）模型，将该框架扩展至共聚物。在该模型中，共聚物特征向量表示为 MILP 可处理的单体描述符的凸组合，其权重由组成单体的混合比例加权。该表示不需要明确的序列类别信息，因此天然兼容基于 MILP 的逆向设计。基于该模型，我们利用人工神经网络、简化二次多元线性回归以及随机森林，为多个共聚物属性数据集构建了预测函数。所提出的表示在多个物理化学属性数据集上实现了具有实用价值的预测性能；特别是，十个数据集中有九个的最佳测试 R^2 分数超过 0.7，其中六个数据集超过 0.9。此外，我们还基于 MV 表示，针对规定的混合比例制定了多单体逆向设计问题，并表明由此产生的 MILP 实例仍然具有可处理性，即使在三单体情形下也是如此。最后，我们通过重新评估推断出的候选物，并将重新计算的属性值与学习模型预测的值进行比较，执行了外部一致性检验。总体而言，所提出的框架在双层模型下为共聚物的模型级精确逆向设计提供了一条可行的第一步。

Abstract

A novel two-phase molecule inference framework, mol-infer, has recently been developed to infer chemical graphs with prescribed abstract structures and desired property values through mixed integer linear programming (MILP) under the two-layered model, with guaranteed optimality and exactness relative to the given learned prediction function and structural constraints. In this study, we extend this framework to copolymers by introducing a simple feature representation, called the mixing vector (MV) model. In the proposed model, a copolymer feature vector is represented as a convex combination of MILP-tractable monomer descriptors weighted by the mixing ratio of the constituent monomers. This representation does not require explicit sequence-class information and is therefore naturally compatible with MILP-based inverse design. Under this model, we construct prediction functions for several copolymer property datasets using artificial neural networks, reduced quadratic multiple linear regression, and random forests. The proposed representation achieves practically useful predictive performance across multiple physicochemical property datasets; in particular, the best test R^2 score exceeds 0.7 for nine of the ten datasets and exceeds 0.9 for six datasets. We also formulate a multi-monomer inverse-design problem under the MV representation with a prescribed mixing ratio and show that the resulting MILP instances remain tractable, even for three-monomer settings. Finally, we perform an external consistency check by re-evaluating the inferred candidates and comparing the re-computed property values with those predicted by the learned model. Overall, the proposed framework gives a tractable first step toward model-level exact inverse design of copolymers under the two-layered model.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 该论文属于计算化学与材料科学领域，主要研究共聚物推断及基于混合整数线性规划（MILP）的逆设计方法。内容涉及分子特征表示、机器学习属性预测及优化算法，与多模态大模型（MLLM）、世界模型（World Models）、强化学习（RL）、视觉编码器（Visual Encoder）、分词器（Tokenizer）及模型统一（Unify Models）等人工智能核心概念无直接关联。作者列表中不包含指定的专家，因此所有关键词相关度均为 0 分，无专家加分。

关键词

Copolymer Inference, Mixed Integer Linear Programming, Mixing Vector Model, Inverse Design, Physicochemical Properties, Monomer Descriptors, Convex Combination

343. NeuroEdge: Real-Time Hand Gesture Recognition with High-Density EMG Using Deep Learning at the EdgeFAIL

Score: 0.0 / 27.8

Authors: Peter Chudinov, Zhenyu Lin, Jay Motamarry, Srihita Panati, Xiaorong Zhang, Zhuwei Qin

Published: 2026-05-28

TL;DR: NeuroEdge achieves real-time hand gesture recognition with 90% accuracy using high-density EMG and lightweight CNNs deployed on resource-constrained edge microcontrollers.

摘要翻译

高密度肌电图（HD-EMG）已成为一种强大的技术手段，用于解码精细的神经肌肉活动，从而实现了应用于假肢控制、康复及增强交互等领域的实时神经 - 机器接口（NMIs）。尽管卷积神经网络（CNNs）等深度学习方法在基于肌电图的手势识别中展示了高分类准确率，但由于计算和内存限制，它们在嵌入式硬件上的部署仍是一个重大挑战。本文提出了 NeuroEdge，一种基于实时高密度肌电图的神经 - 机器接口系统，该系统完全在资源受限的微控制器上执行手势识别。该系统包含两个自定义设计的模块：HD-EMG StreamBridge，一种无线通信接口，负责将原始高密度肌电图数据从 Quattrocento 放大器流式传输至 ESP32 微控制器；以及 EdgeDL Inference Engine，一种在 Sony Spresense 微控制器上执行的轻量级深度学习框架。该紧凑一维卷积神经网络针对嵌入式推理进行了优化，可实时处理肌电图数据的滑动窗口。数据流式传输与推理通过一种架构进行流水线化并同步，该架构利用直接存储器访问（DMA）进行数据传输，并在 ESP32 与 Spresense 之间采用串行外设接口（SPI）突发通信，以确保低延迟性能。实验结果表明，NeuroEdge 在七种手部手势上实现了 90% 的实时分类准确率，使用来自前臂的 192 通道高密度肌电图数据时，总平均延迟为 83 毫秒。该系统展示了在基于微控制器的边缘设备上部署复杂的高密度肌电图手势识别的可行性，弥合了高分辨率生物信号获取与基于深度学习的嵌入式推理之间的差距，为下一代神经 - 机器接口奠定了基础。

Abstract

High-density electromyography (HD-EMG) has emerged as a powerful modality for decoding fine-grained neuromuscular activity, enabling real-time neural-machine interfaces (NMIs) for applications such as prosthetic control, rehabilitation, and augmented interaction. While deep learning approaches such as convolutional neural networks (CNNs)have demonstrated high classification accuracy for EMG-based gesture recognition, their deployment on embedded hardware remains a major challenge due to computational and memory constraints. This paper presents NeuroEdge, a real-time HD EMG-based NMI system that performs gesture recognition entirely on resource-constrained microcontrollers. The system features two custom-designed modules: the HD-EMG StreamBridge, a wireless communication interface that streams raw HD-EMG data from a Quattrocento amplifier to an ESP32 microcontroller; and the EdgeDL Inference Engine, a lightweight deep learning framework executing on a Sony Spresense microcontroller. A compact 1-dimensional CNN optimized for embedded inference processes, sliding windows of EMG data in real time. Data streaming and inference are pipelined and synchronized through an architecture that utilizes Direct Memory Access (DMA) for data transfer and Serial Peripheral Interface (SPI) burst communication between the ESP32 and Spresense, ensuring low-latency performance. Experimental results show that NeuroEdge achieves a real-time classification accuracy of 90% across seven hand gestures, with a total average latency of 83 ms using 192 channels of HD-EMG recorded from the forearm. Our system demonstrates the feasibility of deploying complex HD-EMG-based gesture recognition on microcontroller-based edge devices, bridging the gap between high-resolution biosignal acquisition and deep learning-based embedded inference for next-generation NMIs.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on EMG signal processing and edge deployment of CNNs for gesture recognition, which has no conceptual overlap with Unify Models, Tokenizers, Visual Encoders, World Models, MLLMs, MultiModal large models, or Model-Based RL. None of the specified expert authors are present.

关键词

High-density EMG, Gesture Recognition, Edge Computing, Deep Learning, Microcontrollers, Neural-Machine Interfaces, Real-time Inference, 1D CNN

344. A Theoretical and Experimental Study of a Novel Adaptive Learning AlgorithmFAIL

Score: 0.0 / 27.8

Authors: Sakshi Kumari, Shyam Kumar M, Sushmitha P

Published: 2026-05-28

摘要翻译

机器学习算法的一个关键组成部分是在降低计算成本并减少震荡的同时最小化损失函数。尽管基于自适应学习率的优化器已被广泛应用于实际任务，但它们无法保证收敛，这也是后来引入 AMSGrad 以探究 Adam 非收敛行为的原因。本文对 Adam 和 AMSGrad 等流行的自适应优化方法进行了批判性审视，重点阐述了它们的基本设计概念。为了解决上述优化器的局限性，本文提出了一种基于视线法的新优化器变体 C-Adam。同时还提供了收敛性的理论证明，并通过一系列基于实际问题的数值实验对该优化器进行了验证。

Abstract

A crucial component of machine learning algorithms is minimizing loss functions with less computational cost and less oscillations. While adaptive learning rate-based optimizers have been widely used for real-world tasks, they do not guarantee convergence, which is why AMSGrad was later introduced to investigate the non-convergence behaviour of Adam. In this paper, popular adaptive optimization methods like Adam and AMSGrad are critically reviewed with an emphasis on their fundamental design concepts. To address limitations of the above mentioned optimizers, a new optimizer variant, C-Adam, is proposed based on the line of sight approach. A theoretical proof for convergence is also provided and the optimizer is validated through a number of real-life based numerical experiments.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 评分失败: Expecting ',' delimiter: line 12 column 170 (char 393)

345. Causal Label Recovery in Payment NetworksFAIL

Score: 0.0 / 27.8

Authors: Gaurav Dhama

Published: 2026-05-28

TL;DR: 本文提出了一种序列三重稳健（STR）估计器，用于解决支付网络中因系统性偏差和缺失数据导致的标签恢复问题，并实现了半参数效率界。

摘要翻译

支付网络中的欺诈检测模型基于存在系统性偏差的拒付标签（chargeback labels）进行训练。每个标签都必须依次通过三个顺序关卡：授权（被拒绝的交易不生成标签）、发卡行报告（未报告的欺诈不可见）和延迟（训练时待处理的拒付缺失）。实际到达的标签可能因第一方滥用（first-party misuse）或发卡行误分类而被污染。一篇配套论文 [arXiv:2605.27557] 证明了这四种损害对检测性能施加了极小极大下界（minimax lower bound）。本文提出：能否达到该下界？我们将观测流程形式化为一个具有三个倾向性阶段（propensity stages）和污染层（corruption layer）的顺序缺失数据问题，并构建了顺序三重稳健（STR）估计量。STR 同时纠正了所有四种损害，并达到了半参数效率界（semiparametric efficiency bound）——没有任何估计量具有更低的渐近方差。该估计量具有顺序三重稳健性：在每个关卡，一致性仅要求倾向性模型或结果回归（outcome regression）之一正确指定，而非两者皆需正确。我们通过噪声率调整伪标签（noise-rate-adjusted pseudo-labels）提供污染校正，通过经验贝叶斯（Empirical Bayes）收缩稳定小发卡行的逆倾向性权重（inverse-propensity weights），通过插入式方差估计量（plug-in variance estimator）生成有效的置信区间，并通过 Bernstein 浓度不等式提供有限样本保证。在操作层面，我们推导了最优训练延迟——即最小化标签质量损失与模型陈旧性（model staleness）之和的成熟窗口——并证明 STR 允许使用仅数天而非数月的数据进行训练，从而将模型新鲜度（model freshness）与拒付成熟周期解耦。对于任何样本量，STR 在均方误差（mean squared error）上显著优于基于拒付的朴素训练。

Abstract

Fraud detection models in payment networks train on chargeback labels that are systematically biased. Every label must survive three sequential gates: authorization (declined transactions generate no labels), issuer reporting (unreported fraud is invisible), and delay (pending chargebacks are missing at training time). Labels that do arrive may be corrupted by first-party misuse or issuer misclassification. A companion paper [arXiv:2605.27557] proved that these four impairments impose a minimax lower bound on detection performance. This paper asks: can that bound be achieved? We formalize the observation pipeline as a sequential missing-data problem with three propensity stages and a corruption layer, and construct the Sequential Triply Robust (STR) estimator. The STR corrects for all four impairments simultaneously and achieves the semiparametric efficiency bound -- no estimator can have lower asymptotic variance. It is sequentially triply robust: at each gate, consistency requires only that either the propensity model or the outcome regression is correctly specified, not both. We provide corruption correction via noise-rate-adjusted pseudo-labels, empirical Bayes shrinkage to stabilize inverse-propensity weights for small issuers, a plug-in variance estimator yielding valid confidence intervals, and a Bernstein concentration inequality for finite-sample guarantees. On the operational side, we derive the optimal training delay -- the maturity window that minimizes the sum of label-quality loss and model staleness -- and prove that the STR permits training on data that is days old rather than months old, decoupling model freshness from the chargeback maturity cycle. The STR provably dominates naive chargeback-based training in mean squared error for any sample size.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 论文主题聚焦于支付网络中的因果推断与欺诈检测，处理标签偏差和缺失数据问题；而提供的关键词涉及多模态大模型、世界模型及强化学习等领域。两者在研究内容、方法和技术栈上均无交集，因此相关性评分为 0。

关键词

Fraud detection, Payment networks, Causal inference, Label bias, STR estimator, Sequential missing data, Chargeback labels, Propensity scores

346. Robust Frequency-Calibrated Virtual EEG Channel Generation from Four Frontal Electrodes for Wearable EEG AugmentationFAIL

Score: 0.0 / 27.8

Authors: Minghao Xiao

Published: 2026-05-28

TL;DR: 论文提出 FAVC-Net 网络，利用频率校准注意力机制从四个前额电极生成 13 个虚拟 EEG 通道，显著提升了可穿戴 EEG 增强的谱保真度。

摘要翻译

低通道可穿戴脑电图（EEG）因其长期监测潜力而备受关注，但仅使用四个额叶电极只能提供稀疏且空间偏倚的分布式头皮活动视图。本文提出了 FAVC-Net，这是一种紧凑的频率校准虚拟通道网络，能够从 Fp1、Fp2、F7 和 F8 生成 13 个未测量的 EEG 通道。该模型融合了共享多尺度源编码、源状态嵌入、目标条件符号源块混合、基于 GATv2 的注意力精炼、注意力一致跳跃融合以及弱 Welch 功率谱密度校准。与将稀疏到密集的 EEG 生成视为纯粹的波形匹配任务不同，该框架共同强调了幅度保真度、频谱分配、通道 - 频率纹理以及对受损可穿戴输入的鲁棒性。在 PRED+CT 数据集上，FAVC-Net 在神经网络和插值基线中实现了最佳的联合波形 - 频谱操作点。尽管其时域增益有限，但与最强的非 FAVC 比较器相比，其对数谱距离和 PSD（功率谱密度）KL 散度分别降低了 30.09% 和 37.98%。在类似可穿戴设备的源扰动下，该模型保持了频谱保真度并抵抗了频谱崩溃。这些结果支持将虚拟 EEG 通道生成视为一种双域增强问题，同时强调生成的后部和顶叶通道应被解释为源自稀疏额叶测量的频率校准表示，而非独立的物理记录。

Abstract

Low-channel wearable electroencephalography (EEG) is attractive for long-term monitoring, but four frontal electrodes provide only a sparse and spatially biased view of distributed scalp activity. We present FAVC-Net, a compact frequency-calibrated virtual-channel network that generates 13 unmeasured EEG channels from Fp1, Fp2, F7, and F8. The model combines shared multi-scale source encoding, source-state embeddings, target-conditioned signed source-block mixing, GATv2-based attention refinement, attention-consistent skip fusion, and weak Welch power spectral density calibration. Rather than treating sparse-to-dense EEG generation as a purely waveform-matching task, the framework jointly emphasizes amplitude fidelity, spectral allocation, channel-frequency texture, and robustness to corrupted wearable inputs. On the PRED+CT dataset, FAVC-Net achieved the best joint waveform-spectral operating point among neural and interpolation baselines. Its time-domain gains were modest, whereas log-spectral distance and PSD KL divergence were reduced by 30.09% and 37.98% relative to the strongest non-FAVC comparator. Under wearable-like source perturbations, the model preserved spectral fidelity and resisted spectral collapse. These results support virtual EEG channel generation as a dual-domain augmentation problem, while emphasizing that generated posterior and parietal channels should be interpreted as frequency-calibrated representations derived from sparse frontal measurements rather than as independent physical recordings.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 论文内容聚焦于脑电图（EEG）信号处理与虚拟通道生成（FAVC-Net），属于生物医学信号处理领域，与关键词涉及的大语言模型、多模态基础模型、标记器、视觉编码器、世界模型及强化学习等方向无实质关联。作者列表中未包含指定的专家名单，故未触发专家加分。

关键词

Virtual EEG Channel Generation, Wearable EEG Augmentation, Frequency-Calibrated, FAVC-Net, Multi-scale Source Encoding, GATv2 Attention, Power Spectral Density

347. HARP: Hadamard-Preconditioned Adaptive Rotation Processor for Extreme LLM QuantizationFAIL

Score: 0.0 / 27.8

Authors: Artur Zagitov, Gleb Molodtsov, Aleksandr Beznosikov

Published: 2026-05-28

TL;DR: 论文提出 HARP 方法，通过可学习的自适应旋转处理器实现大语言模型的极端低比特量化，在保持高推理效率的同时提升了困惑度和零样本准确率。

摘要翻译

后训练量化（PTQ）对于在内存和带宽约束下部署大语言模型（LLM）至关重要。然而，极低比特量化仍对激活值异常值和各向异性权重曲率高度敏感。现有的基于非相干性的 PTQ 方法采用固定的随机化哈达玛变换（RHTs）来缓解这一问题，虽然这提高了量化的鲁棒性，但无法将旋转基适配到特定层、校准分布或量化器。我们提出了一种名为 HARP（Hadamard-preconditioned Adaptive Rotation Processor，哈达玛预处理自适应旋转处理器）的可学习结构化双边正交处理器，该处理器替换了固定的哈达玛混合，同时保留了精确的全精度等价性。HARP 将每个旋转表示为稀疏的类似蝴蝶块正交阶段的乘积，通过混合基数（Mixed-Radix）调度支持非 2 的幂维度，并在固定排列约束下初始化为 RHT 处理器。仅在校准数据上进行拟合，HARP 能将量化基适配到每一层及后端。在参数量从 1B 到 70B 的模型上，针对 2-4 比特设置，HARP 在困惑度 (perplexity) 和零样本 (zero-shot) 准确性方面优于固定的 RHT。重要的是，HARP 保持了部署效率，达到 128 tok/s，而 FP16 仅为 61 tok/s。

Abstract

Post-training quantization (PTQ) is essential for deploying LLMs under memory and bandwidth constraints. However, extreme low-bit quantization remains highly sensitive to activation outliers and anisotropic weight curvature. Existing incoherence-based PTQ methods mitigate this issue with fixed randomized Hadamard transforms (RHTs), which improve quantization robustness but cannot adapt the rotated basis to the layer, calibration distribution, or quantizer. We introduce HARP (Hadamard-preconditioned Adaptive Rotation Processor), a learnable structured two-sided orthogonal processor that replaces fixed Hadamard mixing while preserving exact full-precision equivalence. HARP represents each rotation as a product of sparse butterfly-like block-orthogonal stages, supports non-power-of-two dimensions via Mixed-Radix schedules, and initializes to the RHT processor up to a fixed permutation. Fitted only on calibration data, HARP adapts the quantization basis to each layer and backend. Across 2-4 bit settings on models ranging from 1B to 70B parameters, HARP improves perplexity and zero-shot accuracy over fixed RHT. Importantly, HARP preserves deployment efficiency, reaching 128 tok/s versus 61 tok/s for FP16.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 论文核心为 LLM 极端量化（HARP），利用自适应旋转处理器优化低比特量化效果。内容未涉及多模态架构、世界模型构建、强化学习或视觉编码器。虽提及推理吞吐量（tok/s），但未涉及分词器设计。因此与给定的多模态及 RL 类关键词无实质关联。

关键词

Extreme LLM Quantization, Post-training quantization, Hadamard Transform, Adaptive Rotation, Inference Efficiency, Activation Outliers, Zero-shot Accuracy, Orthogonal Processor

348. Open World Autoencoding Drift Detection with Novel Class Recognition in Tabular Non-stationary Data StreamsFAIL

Score: 0.0 / 27.8

Authors: Joanna Komorniczak

Published: 2026-05-28

TL;DR: 本文提出了一种基于自编码器的无监督方法，用于在非平稳表格数据流中检测概念漂移并识别新奇类，实验结果表明其性能具有竞争力。

摘要翻译

数据流处理已成为现代机器学习应用中的里程碑，概念漂移（concept drift）与新类别（novel class）的出现构成了复杂识别方法面临的主要挑战。本文提出了一种无监督的概念漂移检测方法，该方法基于自编码器（autoencoder）的重建误差来识别已知类别分布的偏移，同时通过样本代理表示的密度估计实现对新类别样本的识别。使用镜像自编码器（mirrored autoencoders）能够针对所考虑的两个任务独立地增量适应变化的问题分布，从而实现对演化概念的连续调整以及对未知样本的可靠识别。实验采用了多样化的合成表格数据流，其中均观察到了概念漂移和新颖性的出现。结果表明，所提出的方法与当前最先进的无监督漂移检测器及新颖性分类器具有竞争力。

Abstract

Data stream processing has become a landmark in modern machine learning applications, with concept drifts and novel class appearances posing the primary challenges faced by sophisticated recognition methods. This work proposes an unsupervised concept drift detection method that identifies shifts in known class distributions based on the reconstruction errors of an autoencoder, while also enabling the recognition of novel class samples through density estimation of a proxy representation of samples. Using mirrored autoencoders allows for independent incremental adaptation to changing problem distributions for the two considered tasks, resulting in continuous adjustment to evolving concepts and reliable recognition of unknown samples. Conducted experiments used a diverse set of synthetic tabular data streams, where both concept drifts and the emergence of novelties were observed. The results show that the proposed approach is competitive with current state-of-the-art unsupervised drift detectors and novelty classifiers.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 论文聚焦于表格数据流中的概念漂移检测与新奇类识别，使用自编码器技术。提供的关键词均指向多模态大模型、世界模型及强化学习架构，与本文研究的传统流式数据挖掘领域无直接关联，故所有关键词评分为 0。

关键词

Data stream processing, Concept drift detection, Novel class recognition, Autoencoder, Tabular data, Unsupervised learning, Density estimation, Non-stationary data

349. Resolution Diagnostics for Paired LLM EvaluationFAIL

Score: 0.0 / 27.8

Authors: Anany Kotawala

Published: 2026-05-28

TL;DR: This paper analyzes statistical resolution issues in paired LLM evaluations, demonstrating that many leaderboard comparisons lack sufficient power and that common sample size calculation shortcuts are inaccurate.

摘要翻译

在两个公开 LLM 排行榜上，许多显示的成对排名在实际成对评估设计下未能满足常规的成对检验分辨力目标：Open LLM Leaderboard v1 的 40 个成对比较中有 11 个，以及 MMLU-Pro 的前 10 名相邻排名对中有 4 个在显著性水平 (alpha, 1-beta) = (0.05, 0.8) 下无法区分。在真实受试者级别聚类下，MMLU-Pro 中无法区分的对数上升至 6/9，且在 99.9% 的类别自助法重采样中，该数值保持在 9 个中的 5-6 个。我们将成对 LLM 评估视为假设检验问题，通过显著性水平为 alpha、功效为 (1-beta) 的检验进行样本量反推，并将每对分辨率比 q = N/N* 作为主要诊断指标。一个具有显式二阶常数的精确小效应展开式表明，在细微差异情形下，广泛使用的非配对 Cohen-h+(1-rho) 捷径与正确的 N* 偏差约两倍；这一缺陷被五种现成统计计算器（Cohen 1988, G*Power, R pwr）中的三种在用户将每臂输出乘以 (1-rho) 时默默继承。该未解决成对模式在多重性校正及任意时间有效的顺序检验下依然存在。

Abstract

Across two public LLM leaderboards, many displayed pairwise rankings do not meet a conventional paired-test resolution target under the actual paired evaluation design: 11 of 40 Open LLM Leaderboard v1 pairwise comparisons and 4 of 9 MMLU-Pro top-10 adjacent-rank pairs are unresolved at (alpha, 1-beta) = (0.05, 0.8). The MMLU-Pro count rises to 6/9 under real subject-level clustering and stays at 5-6 out of 9 in 99.9% of category-bootstrap resamples. We frame paired LLM evaluation as a hypothesis-testing problem, invert level-alpha, power-(1-beta) tests, and report a per-pair resolution ratio q = N/N* as the primary diagnostic. A sharp small-effect expansion with an explicit second-order constant shows that the widely-used unpaired Cohen-h-plus-(1-rho) shortcut deviates from the correct N* by approximately a factor of two in the close-comparison regime, a deficit that three of five off-the-shelf calculators(Cohen 1988, G*Power, R pwr) silently inherit when the user post-multiplies their per-arm output by (1-rho). The unresolved-pair pattern remains under multiplicity correction and anytime-valid sequential testing.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on statistical evaluation methodology (paired hypothesis testing, resolution diagnostics) for LLMs, whereas the provided keywords pertain to model architectures (Tokenizer, Visual Encoder), learning paradigms (World Models, Model-Based RL), and multimodality (MultiModal, MLLM, Unify Models). There is no technical overlap regarding model structure, training, or multimodal processing, resulting in zero relevance for all keywords. No expert authors from the specified list were found.

关键词

Paired LLM Evaluation, Resolution Diagnostics, Hypothesis Testing, Sample Size Calculation, Open LLM Leaderboard, MMLU-Pro, Statistical Power, Paired Comparisons

350. GRUFF: LLM Pronoun Fidelity, Reasoning, and Biases in GermanFAIL

Score: 0.0 / 27.8

Authors: Fabian Mewes, Anne Lauscher, Vagrant Gautam

Published: 2026-05-28

TL;DR: 本文构建 GRUFF 数据集评估德语 LLM 的代词保真度与偏见，发现模型虽具强语法一致性但对新代词及干扰项鲁棒性较差。

摘要翻译

第三人称单数代词长期以来一直被用于研究语言模型中的刻板印象偏见，并测试其关于指代推理的能力。最近，人们通过代词保真度（pronoun fidelity）任务研究了推理与偏见之间的相互作用，该任务评估模型能否不受中间提及的其他潜在干扰性话语实体的影响，正确复用先前指定的代词来指代话语实体。然而，此类研究主要关注英语，这是一种语法性别有限且几乎不存在性一致的语言。本文贡献了一个新颖的大规模数据集 GRUFF，用于测量德语中的代词保真度，涵盖名词的四种不同性一致系统以及四组代词。利用该数据集，我们发现，在没有显式上下文的情况下，大语言模型（LLMs）对阳性和阴性实体表现出强烈的性一致，但对于新代词 xier 和 en 则不然。模型通常对干扰项不够鲁棒，但仅编码器模型在德语中比在英语中更具鲁棒性，这反映了语法性别的重要性。最后，我们发现，在此语境下，职业刻板印象在不同语法格之间以及大多数模型中相关性较差，但在架构密切相关的模型中除外。我们发布所有代码和数据，以鼓励进一步开展关于德语中性别包容性语言和指代推理的研究。

Abstract

Third-person singular pronouns have long been used to study stereotypical biases in language models and to test their abilities to reason about reference. More recently, the interplay between reasoning and bias has been investigated with the task of pronoun fidelity, which assesses models' abilities to correctly reuse a previously-specified pronoun for a discourse entity, independent of other potentially distracting discourse entities mentioned in between. However, such research focuses on English, which is a language with limited grammatical gender and almost no gender agreement. In this paper we contribute a novel, large-scale dataset, GRUFF, to measure pronoun fidelity in German, covering four different gender agreement systems in nouns, and four sets of pronouns. With this dataset, we show that LLMs show strong grammatical agreement for masculine and feminine entities in the absence of explicit context, but not for neopronouns xier and en. Models are generally not robust to distractors, but encoder-only models are more robust in German than in English, reflecting the importance of grammatical gender. Finally, we show that occupational stereotypes in this context are poorly correlated across grammatical cases, and across most models, except ones with closely related architectures. We release all code and data to encourage further work on gender-inclusive language and referential reasoning in German.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 论文主题聚焦于德语大语言模型（LLM）的代词保真度、推理及偏见，属于纯文本自然语言处理（NLP）任务。提供的关键词如 Visual Encoder、MultiModal、MLLM 涉及多模态技术，World Models 涉及环境建模，model-based RL 涉及强化学习，Tokenizer 和 Unify Models 亦非本文核心内容，因此所有关键词相关性评分均为 0。作者列表 Fabian Mewes, Anne Lauscher, Vagrant Gautam 不包含指定的专家名单。

关键词

Pronoun Fidelity, German LLMs, Stereotypical Biases, Gender Agreement, Referential Reasoning, Encoder-only Models, Occupational Stereotypes, Neopronouns

351. A Dual-Path Architecture for Scaling Compute and Capacity in LLMsFAIL

Score: 0.0 / 27.8

Authors: Markus Frey, Behzad Shomali, Joachim Koehler, Mehdi Ali

Published: 2026-05-28

TL;DR: The paper proposes a dual-path transformer architecture to independently scale compute and capacity in language models, achieving better performance than baseline transformers under fixed FLOP budgets.

摘要翻译

循环式 Transformer（Looped Transformers）多次应用一个共享块，已成为在语言模型中扩展计算能力的参数高效途径。然而，在固定 FLOPs（浮点运算次数）下，循环模型的容量严格小于基线 Transformer。我们提出了一种新颖的双路径块（dual-path block），能够灵活扩展计算量、应用于隐藏状态的顺序操作次数以及容量（即单步可用的参数）。为此，我们在单个层内将这两个维度暴露为并行路径：一个深层子层重复应用 K 次并使用共享参数，以及一个宽层子层应用一次扩大的前馈网络（feed-forward network）。独立的逐 token 门控（per-token gates）结合了这两个轴，并允许进行详细的逐 token 路由分析。实验表明，在两种 FLOPs 预算下，我们的双路径模型在语言建模和下游评估任务上均优于等 FLOPs 匹配模型，同时在相同 FLOPs 下使用的参数少于基线模型。学习到的门控具有直接的可解释性，并显示出系统性的逐 token 分配：功能词和词汇内容倾向于宽层，而标点符号、符号和算术 token 倾向于深层。

Abstract

Looped transformers apply a shared block multiple times and have emerged as a parameter-efficient route to scaling compute in language models. However, at fixed FLOPs a looped model has strictly less capacity than a baseline transformer. We propose a novel dual-path block that can flexibly scale compute, the number of sequential operations applied to a hidden state, and capacity, the parameters available at a single step. For this we expose both axes as parallel pathways within a single layer: a deep sublayer re-applied K times with shared parameters, and a wide sublayer with an enlarged feed-forward network applied once. Independent per-token gates combine both axes and allow detailed per-token routing analyses. We show that across two FLOP budgets, our dual-path model surpasses iso-FLOP matched models on language modeling and downstream evaluations, while using fewer parameters than the baseline at matched FLOPs. The learned gates are directly interpretable and show systematic per-token allocation with function words and lexical content trend wide, while punctuation, symbols, and arithmetic tokens trend deep.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper proposes a dual-path architecture for scaling LLMs, focusing on compute and capacity trade-offs. It is unrelated to multimodal learning (MultiModal, MLLM, Visual Encoder), world models, reinforcement learning (model-based RL), or tokenizer design. The 'Unify Models' keyword is not relevant as the paper does not unify modalities or tasks in the context of the provided keyword cluster. None of the specified expert authors are listed.

关键词

Dual-Path Architecture, Scaling Compute, Scaling Capacity, LLMs, Looped Transformers, Per-Token Gates, Parameter-Efficient, FLOP Budgets

352. UniSteer: Text-Guided Flow Matching in Activation Space for Versatile LLM SteeringFAIL

Score: 0.0 / 27.8

Authors: Yingdong Shi, Ruiming Zhang, Changming Li, Zhiyu Yang, Kaixing Zhang, Jingyi Yu, Kan Ren

Published: 2026-05-28

摘要翻译

基于激活的控制通过在推理过程中干预大语言模型（LLMs）的内部表征来引导它们，已成为控制人设和风格等行为的有效范式。然而，现有方法通常依赖于固定引导方向或特定任务干预模块，使其难以适应细粒度概念和组合约束。我们提出 UniSteer，一种基于文本引导的激活流匹配模型，该模型从自然语言条件中学习残差流激活上的条件分布。与为每种目标行为拟合独立的干预不同，UniSteer 在激活空间中学习一个通用的条件速度场。在推理时，UniSteer 通过流反转执行操作：将源激活部分传输至潜在状态，并在目标文本条件下对其进行再生，然后再将其注入回冻结的 LLM 中。同一个条件模型通过选择重构能量最低的文本标签来支持激活空间分类。在三个目标 LLM 上的实验表明，UniSteer 在行为控制、真实性引导、细粒度概念引导、多约束指令遵循和激活空间分类方面提供了一个统一接口。

Abstract

Activation-based control steers large language models (LLMs) by intervening on their internal representations during inference, and has emerged as an effective paradigm for controlling behaviors such as persona and style. However, existing methods often rely on fixed steering directions or task-specific intervention modules, making them difficult to adapt to fine-grained concepts and compositional constraints. We propose UniSteer, a text-guided activation flow matching model that learns a conditional distribution over residual-stream activations from natural-language conditions. Instead of fitting a separate intervention for each target behavior, UniSteer learns a universal conditional velocity field in activation space. At inference time, UniSteer performs flow inversion by partially transporting a source activation toward a latent state and regenerating it under a target textual condition before injecting it back into the frozen LLM. The same conditional model supports activation-space classification by selecting the textual label with the lowest reconstruction energy. Experiments on three target LLMs show that UniSteer provides a unified interface across behavioral control, truthfulness steering, fine-grained concept steering, multi-constraint instruction following, and activation-space classification.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 评分失败: Expecting ',' delimiter: line 12 column 157 (char 380)

353. Personalized Turn-Level User Conversation Satisfaction BenchmarkFAIL

Score: 0.0 / 27.8

Authors: Zhefan Wang, Zhiqiang Guo, Weizhi Ma, Min Zhang, Quanjia Yan, Hengliang Luo

Published: 2026-05-28

TL;DR: This paper proposes PersTurnBench, a personalized turn-level conversation satisfaction benchmark using user memories and LLM-based scoring, but it does not address multimodal world models or reinforcement learning.

摘要翻译

用户对 AI 助手的满意度高度个性化：相同的回复可能满足一位用户，却令另一位用户失望，这取决于每位用户的期望以及他们此前询问的内容。现有的自动评估方法主要衡量通用响应质量，因此难以判断某次特定轮次的回复是否满足了用户。本文将此问题定义为个性化轮次级用户对话满意度评估。我们构建了一个对话满意度评估器，该评估器结合紧凑的用户记忆与目标轮次上下文，以生成满意度分数和面向不满的推理理由。针对人类满意度标注的元评估表明，相较于监督式、基于检索的以及通用 LLM-as-a-judge 基线，个性化记忆和事后分数校准在序数一致性和不满轮次检测方面表现更优。此外，我们还引入了 PersTurnBench，这是一个个性化轮次级用户对话满意度基准，它利用经过验证的评估器通过回放来评估生成模型。通过固定回放状态，PersTurnBench 使得通用生成模型与记忆增强个性化系统之间的受控比较成为可能，而无需为每个候选模型收集新的标注。该评估器和基准使研究人员能够在个性化满意度上比较候选生成模型，而无需为每个模型收集新的用户反馈。

Abstract

User satisfaction with AI assistants is highly personalized: the same response may satisfy one user but disappoint another depending on what each user expects and what they have asked for before. Existing automatic evaluation methods mostly measure generic response quality, making it difficult to judge whether a response satisfies a user at a specific turn. We study this problem as personalized turn-level user conversation satisfaction evaluation. We build a conversation satisfaction evaluator that combines compact user memories with target-turn context to produce satisfaction scores and dissatisfaction-oriented rationales. Meta-evaluation against human satisfaction annotations shows that personalized memory and post-hoc score calibration improve ordinal agreement and dissatisfied-turn detection over supervised, retrieval-based, and generic LLM-as-a-judge baselines. We further introduce PersTurnBench, a personalized turn-level user conversation satisfaction benchmark that uses the verified evaluator to assess generation models via replay. By holding the replay state fixed, PersTurnBench enables controlled comparison of generic generation models and memory-augmented personalized systems without new human labels for every candidate model. The evaluator and benchmark let researchers compare candidate generation models on personalized satisfaction without collecting new user feedback for every model.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on conversational AI evaluation and user memory modeling, whereas the keywords pertain to multimodal architecture (Visual Encoder, MultiModal, MLLM), sequence processing (Tokenizer), and reinforcement learning paradigms (World Models, model-based RL). There is no technical overlap regarding model architecture or learning paradigms.

关键词

Personalized Turn-Level Evaluation, User Conversation Satisfaction, Compact User Memories, PersTurnBench, LLM-as-a-Judge, Post-hoc Score Calibration, Conversational AI Benchmark

354. Notation Matters: A Benchmark Study of Token-Optimized Formats in Agentic AI SystemsFAIL

Score: 0.0 / 27.8

Authors: Lorenz Kutschka, Bernhard Geiger

Published: 2026-05-28

摘要翻译

智能体系统中的大语言模型接收工具模式与执行结果，并将工具调用输出为结构化数据。这种交换的默认语言 JSON 旨在实现应用间交互，而非追求 token 效率，因此其结构元素带来了显著的 token 开销。近期工作提出了诸如 TOON（Token-Oriented Object Notation）和 TRON（Token Reduced Object Notation）等 token 优化替代方案，作为更紧凑的替换格式，但这些格式仅在孤立的理解或生成任务上进行了评估。因此，这些格式在端到端智能体闭环中能否保持 token 减少效果，仍是一个开放性问题。我们在四个智能体基准（BFCL、MCPToolBenchPP、MCP-Universe、StableToolBench）和五个开源权重大语言模型上评估了 TOON 和 TRON，通过将输入压缩与输出压缩解耦，独立测量理解与生成性能。TRON 最多可减少 27% 的 token，准确率在 JSON 基线 14 个百分点以内。TOON 实现了最多 18% 的减少，准确率代价约为 9 个百分点，但此外还会在多轮解析失败时发生级联错误，并对大多数模型的并行工具调用输出产生坍塌。

Abstract

Large language models in Agentic AI systems consume tool schemas and execution results and emit tool invocations as structured data. The default language for that exchange, JSON, was designed for application-to-application interchange rather than token efficiency, so its structural elements impose substantial token overhead. Recent work proposes token-optimized alternatives such as TOON (Token-Oriented Object Notation) and TRON (Token Reduced Object Notation) as more compact replacements, but these formats have been evaluated only on isolated comprehension or generation tasks. Whether their token reductions hold inside end-to-end agentic loops therefore remains an open question. We evaluate TOON and TRON on four agentic benchmarks (BFCL, MCPToolBenchPP, MCP-Universe, StableToolBench) and five open-weight LLMs, decoupling input compression from output compression to measure comprehension and generation independently. TRON reduces tokens by up to 27% with accuracy within 14pp of the JSON baseline. TOON achieves up to 18% reduction at a similar 9pp accuracy cost, but additionally cascades on multi-turn parsing failures and collapses parallel tool-call output for most models.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 评分失败: Expecting ',' delimiter: line 12 column 93 (char 316)

355. EviLink: Multi-Path Schema Linking with Uncertainty-Guided Evidence Acquisition for Large-Scale Text-to-SQLFAIL

Score: 0.0 / 27.8

Authors: Huawei Zheng, Sen Yang, Zhaorui Yang, Yuhui Zhang, Haozhe Feng, Haoxuan Li, Xuan Yi, Chao Hu, Defeng Xie, Chen Hou, Danqing Huang, Wei Chen, Yingcai Wu, Peng Chen, Dazhen Deng

Published: 2026-05-28

TL;DR: EviLink 通过不确定性引导的多路径证据获取方法改进了大规模文本到 SQL 的 Schema 链接，在降低 token 成本的同时提高了模式完整性。

摘要翻译

模式链接是大规模 Text-to-SQL 中一个困难且重要的步骤，系统必须从庞大且模糊的数据库中识别出紧凑且充分的模式上下文。现有方法通常将模式链接视为围绕单个 SQL 路径的确定性选择，但复杂问题可能允许多个具有不同模式需求的有效实现。我们将模式链接重新定义为在多个合理 SQL 路径上进行不确定性感知的模式需求推断，系统区分必需的模式项与依赖路径的不确定性项，并仅在需要时获取证据。我们通过 EviLink 实现了这一重构，该方法结合了多假设模式锚定与不确定性引导的证据获取。在 BIRD-Dev 和 Spider2-Snow 上的实验表明，这种视角改善了模式完整性、模式相关性与 token 成本之间的平衡。在 Spider2-Snow 上，EviLink 实现了 90.15% 的字段级严格召回率，平均使用 123.30K 个 token，并在固定生成器下改进了下游 SQL 生成。

Abstract

Schema linking is a difficult and important step in large-scale Text-to-SQL, where systems must identify a compact yet sufficient schema context from large and ambiguous databases. Existing methods often treat schema linking as deterministic selection around a single SQL path, but complex questions may admit multiple valid realizations with different schema needs. We reframe schema linking as uncertainty-aware schema-need inference over multiple plausible SQL paths, where the system distinguishes required schema items from path-dependent uncertain ones and acquires evidence only where needed. We instantiate this reframing with EviLink, which combines multi-hypothesis schema grounding with uncertainty-guided evidence acquisition. Experiments on BIRD-Dev and Spider2-Snow show that this perspective improves the balance among schema completeness, schema relevance, and token cost. On Spider2-Snow, EviLink achieves 90.15% field-level strict recall rate, uses 123.30K average tokens, and improves downstream SQL generation under a fixed generator.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 论文主题为 Text-to-SQL 中的 Schema Linking 与不确定性证据获取，属于数据库查询与自然语言处理领域。提供的关键词集主要围绕多模态大模型（MLLM, MultiModal, Visual Encoder）、世界模型（World Models）及强化学习（model-based RL）展开。论文内容未涉及视觉编码、多模态融合、世界模型构建或强化学习算法，与关键词主题完全无关，故所有关键词评分为 0。

关键词

Text-to-SQL, Schema Linking, Uncertainty-Guided, Multi-Path, Evidence Acquisition, SQL Generation, Large-Scale Database

356. Beyond English and Evasion: A Human-Annotated Multi-Domain Benchmark for High-Stakes LLM Safety Evaluation in ChineseFAIL

Score: 0.0 / 27.8

Authors: Wajdi Zaghouani, Kholoud K. Aldous, Yicheng Gao

Published: 2026-05-28

TL;DR: 针对中文环境下 LLM 安全系统失效的问题，该论文构建了一个包含 1,897 个对抗性中文提示的人体标注基准 ChiSafe-PAS，用于评估高风险领域的安全对齐能力。

摘要翻译

当大型语言模型（LLMs）被部署于中文语境时，一种令人担忧的模式浮现出来：在英语环境中表现良好的安全系统会失效。这些系统难以跨越语言和文化界限，导致模型暴露于利用特定中文规避技术的对抗性提示之下，包括拼音转写、字形拆解、网络用语及模糊语气。为填补这一空白，我们提出 ChiSafe-PAS（中国安全试点标注集），这是一个包含 1,897 个对抗性中文提示的人工标注基准，涵盖四个高风险领域：自残与暴力、毒品与非法贸易、欺诈以及讽刺。其中，1,544 条条目包含完整的金标准标注：三类响应标签（REFUSE、SAFE-REDIRECT、RESPOND）、九类混淆分类法、风险等级评定以及标注者理由。我们详细介绍了该数据集的设计、标注过程以及混淆分类法。我们的主要目标是实践性的：为研究社区提供一个高质量、植根于文化的资源，用于基准测试 LLM 安全对齐。在此过程中，我们触及了该领域三个更广泛的张力：训练数据与评估数据之间界限的模糊、基于现实风险的领域覆盖需求，以及规模作为文化专业知识替代品的局限性。

Abstract

When Large Language Models (LLMs) are deployed in Chinese-language settings, a troubling pattern emerges: safety systems that work well in English break down. These systems struggle to cross linguistic and cultural bound-aries, leaving models exposed to adversarial prompts that exploit Chinese-specific evasion techniques, including Pinyin romanization, character decomposition, internet slang, and hedging tone. To address this gap, we introduce ChiSafe-PAS (Chinese Safety Pilot Annotation Set), a human-annotated benchmark of 1,897 adversarial Chinese prompts spanning four high-stakes domains: self-harm and violence, drug and illicit trade, fraud, and satire. Of these, 1,544 entries carry complete gold-standard annotations: a 3-class response label (REFUSE, SAFE-REDIRECT, RESPOND), a nine-category obfuscation taxonomy, a risk-level rating, and annotator rationale. We describe the dataset design, annotation process, and obfuscation taxonomy in detail. Our primary goal is practical: to give the research community a high-quality, culturally grounded resource for benchmarking LLM safety alignment. In doing so, we engage three broader tensions in the field: the blurring boundary between training and evaluation data, the need for domain coverage grounded in real-world risk, and the limits of scale as a substitute for cultural expertise.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 该论文专注于中文大语言模型的安全评估基准构建，涉及对抗性提示与人类标注。所提供的关键词（如 Unify Models, Tokenizer, Visual Encoder, World Models, MLLM, MultiModal, model-based RL）主要涉及模型架构、表征学习及强化学习领域，与安全评估基准的研究主题无直接技术关联，故所有关键词相关度评分为 0。加权总分为 0，远低于动态及格分 27.8。作者列表中未包含指定的专家名单，故无额外加分。

关键词

LLM Safety Evaluation, Chinese Language, Adversarial Prompts, Benchmark Dataset, Human Annotation, Obfuscation Taxonomy, Safety Alignment

357. Classification of non-analyzable word types in web documents to implement an effective Korean e-learning systemFAIL

Score: 0.0 / 27.8

Authors: Sang-Taek Park, Ae-Lim Ahn, Eric Laporte, Jee-Sun Nam

Published: 2026-05-28

TL;DR: This paper proposes Local Grammar Graphs to classify informal Korean text in web documents, aiming to enhance Korean e-learning systems by incorporating real-world language expressions.

摘要翻译

电子学习系统 (E-learning systems) 应提供反映语言实际使用各种现象的内容。除了标准韩语外，包含网络文档、手机短信或推特帖子中的实际韩语表达的电子学习系统对高级学习者很有用。我们构建了两类语料库 (Corpora)：一类是由在线新闻文章等正式文档构成的；另一类是由网络博客中关于新产品的客户评论等非正式文档构成的。通过比较这些语料库，我们展示了这两种语料库中表达方式的差异。我们分析了非正式语料库的主要特征。鉴于文本中非正式内容占显著比例，我们提出局部语法图 (LGG) 作为在韩语电子学习系统中有效处理它们的合适模型。

Abstract

E-learning systems should deliver contents that reflect various phenomena of the language as it is used. In addition to formal Korean, e-learning systems that would include real-world Korean expressions such as those in web documents, mobile text messages, or twitter posts, would be useful to high-level learners. We construct two types of corpora: one is made of formal documents like online news articles; the other is made of informal documents like customer reviews about new products in web blogs. By comparing these corpora, we show how expressions differ in these two types of corpora. We survey the main characteristics of the informal corpus. Given that a significant proportion of text is informal, we propose Local Grammar Graphs (LGG) as an appropriate model to treat them effectively in Korean e-learning systems.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on Korean NLP and e-learning using Local Grammar Graphs for text classification, while the keywords pertain to Multimodal LLMs, World Models, and Reinforcement Learning. There is no methodological or thematic overlap, resulting in zero relevance for all specified keywords.

关键词

Korean e-learning, Local Grammar Graphs, informal documents, formal documents, web documents, language expressions, corpora comparison

358. World Models in Words: Auditing Physical State-Transition Commitments in Vision-Language ModelsFAIL

Score: 0.0 / 27.8

Authors: Emmanuelle Bourigault

Published: 2026-05-28

摘要翻译

视觉 - 语言模型（VLMs）日益被用于回答有关物理场景的问题，然而大多数评估仅将性能归结为最终答案。这掩盖了模型是否感知到了正确的对象、是否正确表示了物理状态、是否预测了合理的状态转移，或者仅仅是因为错误的原因而选择了正确的选项。我们引入 WMW（\wmw），一种用于审查 VLMs 的“语言表达物理承诺”的评估框架。我们不再仅对 $I,q\mapsto a$ 进行评分，而是要求模型生成一个类型化轨迹 $I,q\mapsto(s_0,Δs,s_1,a)$，包括初始状态、状态转移、结果状态以及最终答案。随后，一个混合验证器会检查模式有效性、状态锚定、转移一致性以及答案 - 轨迹兼容性，从而生成类型化错误标签，例如对象错误、关系错误、力错误、转移错误、时间错误、单位/尺度错误以及忠实度错误。我们发布了 TraceBank（\tracebank），这是一个受控轨迹资源库，包含 \nSeed 个经过模式验证和重新计算验证的合成场景，跨越 \nFamilies 个物理家族、\nPairs 个最小扰动对比偏好对，并提供验证器代码、审计指南及模型输出。我们在受控及外部物理推理示例上评估了多个 VLMs（\nModels）。WMW（\wmw）揭示了仅基于答案的评估所遗漏的失败：中等水平模型中有 35% 的正确答案是由物理上无效的轨迹支撑的。基于验证器的重排序在不牺牲答案准确性的前提下，恢复了高达 7 个百分点的轨迹有效性；而轨迹级偏好微调则使隐藏的不一致性相对减少了 41%。本文的贡献并非另一个基于最终答案的物理基准，而是一个可重用的协议，用于衡量 VLMs 所陈述的物理世界是否能与其答案同时为真。

Abstract

Vision-language models (VLMs) are increasingly used to answer questions about physical scenes, yet most evaluations reduce performance to a final answer. This hides whether the model perceived the right objects, represented the right physical state, predicted a plausible transition, or merely selected the right option for the wrong reasons. We introduce \wmw, an evaluation framework for auditing the \emph{language-expressed physical commitments} of VLMs. Instead of scoring only $I,q\mapsto a$, we ask models to produce a typed trace $I,q\mapsto(s_0,Δs,s_1,a)$: an initial state, a state transition, a resulting state, and an answer. A hybrid verifier then checks schema validity, state grounding, transition consistency, and answer-trace compatibility, yielding typed error labels such as object, relation, force, transition, temporal, unit/scale, and faithfulness errors. We release \tracebank, a controlled trace resource with \nSeed schema- and recomputation-validated synthetic scenarios across \nFamilies physics families, \nPairs minimally perturbed contrastive preference pairs, verifier code, audit guidelines, and model outputs. We evaluate \nModels VLMs on both controlled and external physical-reasoning examples. \wmw reveals failures that answer-only evaluation misses: 35\% of correct answers from mid-tier models are backed by physically invalid traces. Verifier-guided reranking recovers up to 7 percentage points of trace validity without sacrificing answer accuracy, and trace-level preference tuning reduces hidden inconsistency by 41\% relative. The contribution is not another final-answer physics benchmark, but a reusable protocol for measuring whether a VLM's stated physical world can be true at the same time as its answer.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 评分失败: Expecting ',' delimiter: line 12 column 26 (char 249)

359. Mask the Target: A Plug-and-Play Regularizer Against LoRA ForgettingFAIL

Score: 0.0 / 27.8

Authors: Runze Xu, Arpit Garg, Hemanth Saratchandran, Simon Lucey

Published: 2026-05-28

摘要翻译

低秩适配（LoRA）已成为最广泛使用的微调机制之一，用于将大语言模型（LLM）适配到新领域、新任务及新用户。然而，仅凭适应性能可能掩盖一种重要的失效模式：LoRA 更新可能在目标分布上提升性能，同时削弱在预训练和对齐过程中习得的先前能力。我们发现，当适应分布与模型的原始训练或对齐分布存在显著差异时，这种遗忘现象会变得尤为严重。在实际场景中，这一挑战尤为严峻，因为原始训练和对齐数据通常不可用。鉴于此，我们研究了基于 LoRA 的适应如何在无回放设置中平衡新学习与遗忘，并引入了一种简单的输出空间正则化器，可直接集成到现有的训练流程中。该方法从基础模型和适配模型的分布中移除目标标记，重新归一化剩余概率，并仅对非目标词汇应用 KL 正则化。这保留了基础模型在替代标记之间的相对偏好，而不直接对抗适应过程所需的交叉熵信号。由于该正则化器仅在损失层面起作用，它无需回放数据、架构修改、适配器重新设计或推理时开销，且可直接应用于现有的 LoRA 变体。在所有测试的 LoRA 变体及多种骨干网络上，当适应分布与基础模型的原始训练或对齐分布存在显著差异时，该方法改善了新学习与遗忘之间的前沿，表明了一种更广泛适用的途径，以实现更可靠的 LLM 更新。

Abstract

Low-Rank Adaptation (LoRA) has become one of the most widely used fine-tuning mechanisms for adapting large language models to new domains, tasks, and users. Yet adaptation performance alone can obscure an important failure mode: LoRA updates may improve performance on the target distribution while degrading prior capabilities learned during pretraining and alignment. We show that this forgetting becomes especially severe when the adaptation distribution differs substantially from the models original training or alignment distributions. The challenge is amplified in practical settings, where the original training and alignment data are typically unavailable. Motivated by this constraint, we study how LoRA based adaptation balances new learning against forgetting in a replay-free setting, and introduce a simple output space regularizer that can be added directly to existing training pipelines. Our method removes the ground-truth token from both the base and adapted model distributions, renormalizes the remaining probabilities, and applies KL regularization only over the non-target vocabulary. This preserves the base models relative preferences among alternative tokens without directly opposing the cross-entropy signal required for adaptation. As the regularizer acts only at the loss level, it requires no replay data, architectural changes, adapter redesign, or inference-time overhead, and can be applied directly to existing LoRA variants. Across all LoRA variants tested and across various backbones, our method improves the frontier between new learning and forgetting when the adaptation distribution differs substantially from the base models original training or alignment distributions, suggesting a broadly applicable route toward more reliable LLM updating.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 评分失败: Expecting ',' delimiter: line 13 column 22 (char 805)

360. Attention Asymmetry in AI Layoff Discourse on X: A Computational Analysis of Capital vs Labour AmplificationFAIL

Score: 0.0 / 27.8

Authors: Joy Bose

Published: 2026-05-28

TL;DR: This paper investigates the amplification disparity between capital and labour discourse on X regarding AI layoffs, finding that capital-related conversations receive significantly higher reach (4.18x mean amplification) than labour-related conversations even after normalizing for audience size.

摘要翻译

当工人因 AI 驱动的结构调整而失业时，X（前身为 Twitter）上同时发生着两种截然不同的对话。科技高管和人工智能研究人员谈论生产力、转型和机遇；而被解雇的工人和劳工批评者则谈论失业、不确定性和恐惧。本文提出一个简单的问题：哪种对话获得了更多的触达？我们报告了三项研究，采用两种收集方法，分析了来自 20 个指定公共账户的 763 条推文。研究 1 使用了基于关键词的收集（n=392），发现语料库之间无显著差异（p=0.891），表明关键词搜索对于此项任务而言噪声过大。研究 2 使用了基于账户的收集（n=96），发现资本话语相对于劳工话语具有 3.12 倍的平均放大优势（p=0.000003，Cohen's d=0.555）。研究 3 结合了两种方法（n=763），确认了该发现，平均放大比率为 4.18 倍，中位放大比率为 10.77 倍（p<0.000001）。至关重要的是，在对粉丝数进行归一化处理后，这种不对称性仍保持在 2.69 倍（p=0.000009，Cohen's d=0.491），表明该效应并非简单地源于资本账户拥有更大的受众。该发现在所有测试的放大度量权重下均稳健。我们引入了 Amplification Ratio（放大比率）和 Amplification Normalisation Index（放大归一化指数）作为衡量平台层面话语不平等性的简单度量。在 Reddit 上的跨平台复现（n=647 篇帖子）未能复现该发现，表明这种不对称性可能特定于 X 的基于账户的放大架构。我们讨论了跨平台话语分析的方法论启示。

Abstract

When workers lose jobs to AI-driven restructuring, two very different conversations happen on X (formerly Twitter) at the same time. Tech executives and AI researchers talk about productivity, transformation, and opportunity. Laid-off workers and labour critics talk about job loss, uncertainty, and fear. This paper asks a simple question: which conversation gets more reach? We report three studies using two collection methods and 763 tweets from 20 named public accounts. Study 1 used keyword-based collection (n=392) and found no significant difference between corpora (p=0.891), revealing that keyword search is too noisy for this task. Study 2 used account-based collection (n=96) and found a 3.12x mean amplification advantage for capital discourse over labour discourse (p=0.000003, Cohen's d=0.555). Study 3 combined both methods (n=763) and confirmed the finding at 4.18x mean and 10.77x median amplification ratio (p<0.000001). Critically, after normalising for follower count, the asymmetry persists at 2.69x (p=0.000009, Cohen's d=0.491), demonstrating that the effect is not simply a consequence of capital accounts having larger audiences. The finding is robust across all tested amplification metric weightings. We introduce the Amplification Ratio and Amplification Normalisation Index as simple metrics for measuring platform-level discourse inequality. A cross-platform replication on Reddit (n=647 posts) did not replicate the finding, suggesting the asymmetry may be specific to X's account-based amplification architecture. We discuss the methodological implications for cross-platform discourse analysis.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on computational social science and media analysis regarding AI-induced layoffs on X (Twitter), specifically comparing discourse amplification between capital and labour groups. The provided keywords relate to machine learning model architectures (multimodal, RL, tokenization, etc.), which are not discussed in the paper's methodology or technical content. Therefore, there is no relevance between the paper's content and the specified ML keywords.

关键词

AI Layoff Discourse, X Platform, Capital vs Labour, Amplification Ratio, Computational Analysis, Discourse Inequality, Social Media

361. Casual as an Anchor: Resolving Supervision Misalignment in Formality Transfer DatasetFAIL

Score: 0.0 / 27.8

Authors: Hyojeong Yu, Hyukhun Koh, Minsung Kim, Kyomin Jung

Published: 2026-05-28

TL;DR: 本文通过引入“随意性”作为锚点构建三级形式性谱系及 3LF 数据集，有效解决了形式性转移中的监督错位问题，显著提升了生成文本与人类感知的一致性。

摘要翻译

正式性转换（Formality Transfer）通常被建模为非正式与正式语体之间的一种对称双向任务。我们认为，这种框架掩盖了现有基准（如 GYAFC）中存在的监督设计缺陷：二元的人类改写编码了相对的风格偏移，而非人类对正式性的绝对认知。因此，模型学会生成伪正式输出以满足基准标签，却无法产生真正正式的语言。我们通过基于人类对齐的正式性定义重新评估基准的正式性标签来量化这种不一致，揭示了显著差异，这些差异导致了跨模型家族中一致的非正式到正式转换失败。为了解决这一问题，我们将正式性转换重新概念化为一个分级维度，而非二元属性。我们引入一个三级谱系：非正式（informal）、随意（casual）和正式（formal），其中随意作为一种明确的中间状态，用于澄清监督信号。基于此框架，我们引入了 3LF 数据集，该数据集提供了跨越这三个级别的平行监督。在 3LF 上训练显著减少了非正式到正式的转换失败，并提高了与人类感知的一致性。例如，尽管 3LF 显著小于 GYAFC，GPT-4.1-nano 在非正式到正式方向上的 F1 分数仍从 0.06 提升至 0.88。我们进一步证明，这些增益无法仅通过上下文学习（in-context learning）复现，并提供了关于歧义驱动错误和意义扭曲的定性分析。总体而言，我们的发现展示了监督设计如何塑造风格对齐，并强调了在可控文本生成中构建对齐感知基准的重要性。

Abstract

Formality transfer is commonly framed as a symmetric bidirectional task between informal and formal registers. We argue that this framing conceals a supervision design flaw in existing benchmarks such as GYAFC: binary human rewrites encode relative stylistic shifts rather than absolute human notions of formality. Consequently, models learn to generate pseudo-formal outputs that satisfy benchmark labels while failing to produce genuinely formal language. We quantify this misalignment by re-evaluating benchmark formal labels under a human-aligned definition of formality, revealing substantial discrepancies that propagate to consistent informal-to-formal failures across model families. To address this issue, we reconceptualize formality transfer as a graded dimension rather than a binary attribute. We introduce a three-level spectrum: informal, casual, and formal, where casual serves as an explicit intermediate state that clarifies supervision signals. Based on this framework, we introduce 3LF, a dataset providing parallel supervision across all three levels. Training on 3LF substantially reduces informal-to-formal failures and improves alignment with human perception. For example, GPT-4.1-nano improves from 0.06 to 0.88 F1 in the informal-to- formal direction despite 3LF being significantly smaller than GYAFC. We further demonstrate that these gains cannot be reproduced through in-context learning alone and provide qualitative analyses of ambiguity-driven errors and meaning distortions. Overall, our findings demonstrate how supervision design shapes stylistic alignment and highlight the importance of alignment-aware benchmark construction in controllable text generation.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 该论文专注于自然语言处理中的形式性转移（Formality Transfer）任务，核心贡献在于发现现有基准的监督错位问题并提出三级谱系及 3LF 数据集。提供的关键词涉及多模态大模型、强化学习及世界模型等领域，与本文纯文本风格迁移的研究内容无直接关联，故相关性评分均为 0。作者列表中未包含指定的专家。

关键词

Formality transfer, Supervision misalignment, Style transfer, Dataset construction, Text generation, Casual anchor, 3LF dataset, Human perception alignment

362. A Study on Question-Answer Dataset for LLM Safety Evaluation with a Focus on Illegal ActivitiesFAIL

Score: 0.0 / 27.8

Authors: Kenji Imamura, Masao Ideuchi, Atsushi Fujita

Published: 2026-05-28

TL;DR: This paper proposes a question-answer dataset and evaluation rubric for assessing LLM safety regarding illegal activities, intended for the JAI-Trust project.

摘要翻译

本文探讨了用于大语言模型（LLM）安全性评估的问答数据集，重点聚焦于非法活动。具体而言，基于对 AnswerCarefully 的人工分析，我们引入了若干补充信息、创建问答示例的方法以及用于评估 LLM 生成回答的评分标准。本研究的成果旨在与"JAI-Trust"项目共享。

Abstract

In this paper, we discuss question-answer dataset for LLM safety evaluation, with a focus on illegal activities. Specifically, on the basis of manual analysis of AnswerCarefully, we introduce several additional information, methods for creating question-answer examples, and a rubric for evaluating LLM-generated responses. The outcomes of this study are intended to be shared with the "JAI-Trust" project.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on constructing a QA dataset for LLM safety evaluation regarding illegal activities and evaluation rubrics. It does not address model architecture unification, tokenization, visual encoders, world models, multimodal architectures, or reinforcement learning, resulting in zero relevance to the provided technical keywords.

关键词

LLM Safety Evaluation, Question-Answer Dataset, Illegal Activities, Evaluation Rubric, LLM-generated Responses, Manual Analysis, JAI-Trust Project

363. Enhancing Factuality through Consensus and Consistency in Summarization Using Minimum Bayes Risk DecodingFAIL

Score: 0.0 / 27.8

Authors: Riza Setiawan Soetedjo, Yusuke Sakai, Hidetaka Kamigaito, Jingun Kwon, Manabu Okumura, Taro Watanabe

Published: 2026-05-28

TL;DR: This paper proposes ConSUM, which enhances summarization factuality by reranking candidates based on source consistency and consensus achieved through Minimum Bayes Risk decoding.

摘要翻译

提高模型生成摘要的质量，特别是事实性（即摘要相对于源内容的准确性），仍然是一个挑战。虽然重排序可以从多个生成候选中选取最优输出，但其仅限于仅以源文档作为指导，从而导致生成摘要不可靠。为了解决这一局限性，我们提出了 ConSUM，该方法通过考虑两个因素来重排序候选摘要：与源文档的一致性以及其他候选之间的共识。共识是通过在生成的摘要集合上使用最小贝叶斯风险（MBR）解码建立的，而通过采用事实性感知指标将摘要与源文档进行比较来确保一致性。严格测试表明，我们的系统与现有方法具有竞争力，人工评估进一步确认其生成的摘要优于其他系统生成的摘要。我们的代码可在 https://github.com/naist-nlp/ConSUM 获取。

Abstract

Improving the quality of model-generated summaries, especially factuality, the accuracy of a summary with respect to its source content, remains a challenge. While reranking could select the optimal output from multiple generated candidates, it is limited to only using the source as guidance, resulting in unreliable summaries. To address this limitation, we propose ConSUM that reranks candidate summaries by considering two factors: consistency to the source document and consensus among the other candidates. Consensus is established using Minimum Bayes Risk (MBR) decoding over the set of generated summaries, while ensuring consistency by employing factuality-aware metrics that compare the summary against the source. Rigorous testing demonstrates that our system is competitive with existing methods, with human evaluations further confirming that its generated summaries are preferred over those from other systems. Our code is available at https://github.com/naist-nlp/ConSUM .

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on text summarization factuality using Minimum Bayes Risk (MBR) decoding and reranking strategies. It does not involve multimodal integration, visual encoders, world models, or model-based reinforcement learning, making it completely unrelated to the provided keyword set which targets multimodal and world model architectures.

关键词

Summarization, Factuality, Consensus, Consistency, Minimum Bayes Risk Decoding, Reranking, Text Generation

364. Accommodation Goes Both Ways: Studying Linguistic Convergence Between Humans and Language ModelsFAIL

Score: 0.0 / 27.8

Authors: Terra Blevins

Published: 2026-05-28

TL;DR: This study investigates asymmetric linguistic convergence in human-LLM dialogue, finding that LLMs over-adapt to users while humans accommodate LLMs similarly to other humans.

摘要翻译

随着大语言模型（LLMs）日益融入日常生活，理解其存在将如何塑造人类语言行为仍是一个开放性问题。我们开展了一项关于人机对话中语言趋同的大规模研究，考察人类与大语言模型在多轮对话中如何相互调适彼此的语言风格。基于 WildChat（一个真实世界 ChatGPT 语料库）上的非对称趋同度量，我们发现尽管大语言模型在八种语言的功能词和开放类特征上显著过度趋同于用户，但在此设置下人类的趋同率与人类 - 人类基线大致一致。这些发现表明，人机对话中的调适是非对称的：尽管大语言模型过度拟合用户的风格，但人类在语言上调适大语言模型的方式与适应他人并无不同。

Abstract

As LLMs become increasingly integrated into daily life, understanding how their presence will shape human linguistic behavior is an open question. We present a large-scale study of linguistic convergence in human-LLM dialogue, examining how humans and LLMs accommodate each other's linguistic style during multi-turn conversations. Using an asymmetric convergence metric on WildChat, a corpus of real-world ChatGPT transcripts, we find that while LLMs significantly overconverge toward their users on both function word and open-class features across eight languages, human convergence rates in this setting are broadly consistent with human-human baselines. These findings suggest that accommodation in human-LLM dialogue is asymmetric: while LLMs dramatically overfit to their users' style, humans linguistically accommodate LLMs no differently than they would another person.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on sociolinguistic analysis of human-LLM dialogue convergence, whereas the provided keywords relate to model architecture (Tokenizer, Visual Encoder), multimodal capabilities (MLLM, MultiModal), and reinforcement learning frameworks (World Models, model-based RL, Unify Models). There is no overlap in technical content or methodology between the paper and the specified keywords. The calculated weighted total score is 0.0, which is significantly below the dynamic pass score of 27.8. Additionally, the author list does not contain any of the specified expert authors.

关键词

Linguistic Convergence, Human-LLM Dialogue, Asymmetric Convergence, WildChat, LLM Behavior, Multi-turn Conversations, Language Models, ChatGPT Transcripts

365. Colored Noise Diffusion SamplingFAIL

Score: 0.0 / 27.8

Authors: Hadar Davidson, Noam Issachar, Sagie Benaim

Published: 2026-05-28

TL;DR: This paper introduces Colored Noise Sampling, a training-free inference-time method for diffusion models that leverages spectral bias to improve image generation quality, achieving lower FID scores on ImageNet compared to standard solvers.

摘要翻译

扩散模型实现了当前最先进的图像生成，其生成轨迹本质上表现出谱偏差，早期解析低频全局结构，后期解析高频精细细节。传统的随机微分方程 (SDE) 求解器未能考虑到这一动态，天真地在全过程中注入均匀白噪声，从而误用了有限的能量预算。在本文中，我们建立了一个数学框架，将 SDE 推断重新审视为目标明确且频率解耦的能量转移过程。基于此框架，我们提出了一种新颖的、无需训练的随机求解器——有色噪声采样 (CNS)。与注入均匀白噪声不同，CNS 采用一种动态的、依赖于时间步和频率的调度策略，更有效地将注入的能量分配给结构尚未解析的频率带。通过积极利用模型的固有谱偏差，CNS 系统地将生成分布引导至真实数据流形。广泛的实验表明，CNS 作为一种严格的即插即用式推理时采样器替换方法，在多种架构 (SiT, JiT, FLUX) 上显著优于标准的 ODE 和 SDE 基线。在 ImageNet-256 上，与标准采样相比，CNS 实现了显著的无引导 FID 降低：在 SiT-XL/2 上从 8.26 降至 6.27，在 JiT-B/16 上从 32.39 降至 26.69，在 JiT-H/16 上从 11.88 降至 8.31；同时在使用无分类器引导 (Classifier-Free Guidance) 时，也获得了相对一致的 FID 改进。项目主页见 https://hadardavidson.github.io/CNS/。

Abstract

Diffusion models achieve state-of-the-art image synthesis, with their generative trajectories fundamentally exhibiting a spectral bias, resolving low-frequency global structures early and high-frequency fine details later. Conventional stochastic differential equation (SDE) solvers fail to account for this dynamic, naively injecting uniform white noise throughout the entire process and misusing the finite energy budget. In this work, we establish a mathematical framework that reconsiders SDE inference as a targeted, frequency-decoupled energy transfer. Leveraging this framework, we introduce Colored Noise Sampling (CNS), a novel, training-free stochastic solver. Rather than injecting uniform white noise, CNS utilizes a dynamic, timestep- and frequency-dependent schedule that more efficiently allocates injected energy toward structurally unresolved frequency bands. By actively exploiting the model's inherent spectral bias, CNS systematically steers the generated distribution toward the true data manifold. Extensive experiments demonstrate that CNS significantly outperforms standard ODE and SDE baselines as a strictly plug-and-play, inference-time sampler substitution across diverse architectures (SiT, JiT, FLUX). Compared to standard sampling on ImageNet-256, CNS achieves substantial unguided FID reductions, improving from 8.26 to 6.27 on SiT-XL/2, 32.39 to 26.69 on JiT-B/16, and 11.88 to 8.31 on JiT-H/16, while yielding consistent relative FID improvements with Classifier-Free Guidance. Project page is available at https://hadardavidson.github.io/CNS/.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on Colored Noise Sampling for diffusion model inference, focusing on image synthesis. It does not address MLLM architecture (Tokenizer, Visual Encoder), multimodal integration, world models for reinforcement learning, or model-based RL. The research direction (sampling efficiency) is unrelated to the provided keyword themes (unification, representation learning, RL).

关键词

Diffusion models, Colored Noise Sampling, Spectral bias, SDE solvers, Image synthesis, Inference-time sampler, FID improvements

366. Ambient-robust Inverse Rendering using Active RGB-NIR ImagingFAIL

Score: 0.0 / 27.8

Authors: Hoon-Gyu Chung, Jinnyeong Kim, Hyunwoo Kang, Seung-Hwan Baek

Published: 2026-05-28

摘要翻译

逆渲染旨在从图像中重建物体的几何结构与反射率。尽管近期取得了进展，现有方法往往产生不准确的重建结果，且对环境光照条件敏感。本文介绍了一种由主动式 RGB-NIR 成像支持的、对环境光照鲁棒的逆渲染方法。我们的关键洞察是利用近红外 (NIR) 闪光灯照明（对人眼不可感知）来获得稳定的点光源着色，该着色在很大程度上不受环境光照的影响。通过使用环境光照下的多视角 RGB 图像和主动式 NIR 闪光灯照明下获取的 NIR 图像，我们通过一个三阶段逆渲染方法，利用 RGB 和 NIR 图像的互补优势，重建准确的几何结构与反射率。为了实现密集多视角采集，我们开发了一种主动式成像系统，该系统配备了一个 RGB-NIR 相机和一个安装在移动基座上的 NIR 闪光灯。利用该系统，我们收集了首个在多种环境光照条件下捕获的多视角 RGB-NIR 逆渲染数据集。实验表明，我们的方法优于先前方法，在多种环境光照场景下实现了准确的几何结构与反射率估计。

Abstract

Inverse rendering aims to reconstruct geometry and reflectance of objects from images. Despite recent progress, existing methods often produces inaccurate reconstructions that are sensitive to ambient illumination conditions. Here we introduce an ambient-robust inverse rendering method enabled by active RGB-NIR imaging. Our key insight is to leverage near-infrared (NIR) flash illumination-imperceptible to human observers-to obtain stable point-light shading that is largely invariant to ambient illumination. By using multi-view RGB images illuminated by ambient light and NIR images acquired with active NIR flash illumination, we reconstruct accurate geometry and reflectance by exploiting the complementary benefits of RGB and NIR images via a three-stage inverse rendering method. To enable dense multi-view acquisition, we develop an active imaging system equipped with a RGB-NIR camera and a NIR flash mounted on a mobile base. Using this system, we collect the first multi-view RGB-NIR inverse rendering dataset captured under multiple ambient illumination conditions. Experiments demonstrate that our method outperforms prior approaches, achieving accurate geometry and reflectance estimation across multiple ambient lighting scenarios.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 评分失败: Expecting ',' delimiter: line 12 column 83 (char 292)

367. BullingerDB: A Dataset for Handwritten Text Recognition and Writer RetrievalFAIL

Score: 0.0 / 27.8

Authors: Marco Peer, Anna-Scius Bertrand, Patricia Scheurer, Andreas Fischer

Published: 2026-05-28

TL;DR: BullingerDB 构建了一个历史文档手写文本识别与作者检索的基准数据集，取得了 9.1% 的字符错误率和 78.3% 的平均精度，但其内容与提供的现代大模型及强化学习关键词无直接关联。

摘要翻译

本文提出 BullingerDB，一个基于海因里希·布林格（1504-1575）通信的历史文档分析大规模基准数据集。该语料库包含 20,898 页和 499,222 行文本，由 796 位作者在六十年间书写，具有风格变异、多语言内容（主要是拉丁语和早期新高地德语）以及元信息，如作者身份和时间。我们在文本识别和作者检索任务上对 BullingerDB 进行了评估。TrOCR 作为表现最佳的模型，实现了 9.1% 的 CER（字符错误率）。针对作者检索，我们引入了一种时间感知 nDCG（归一化折损累积增益）指标来评估时间感知检索。尽管时间一致检索是可行的，但 mAP（平均精度）78.3% 的得分表明长期风格变异带来了挑战。借助 BullingerDB，我们旨在为多语言历史文本识别和时间感知作者分析建立一个新的基准。

Abstract

We present BullingerDB, a large-scale benchmark dataset for historical document analysis based on the correspondence of Heinrich Bullinger (1504-1575). The corpus comprises 20,898 pages and 499,222 text lines written by 796 writers over six decades, featuring stylistic variation, multilingual content (mostly Latin and Early New High German) as well as meta-information such as writer identity and time. We evaluate BullingerDB on text recognition and writer retrieval. TrOCR, the best performing model, achieves a CER of 9.1%. For writer retrieval, we introduce a temporal nDCG metric to assess time-aware retrieval. While temporally coherent retrieval is achievable, mAP (78.3%) scores indicate challenges due to long-term stylistic variation. With BullingerDB, we aim to establish a new benchmark for multilingual historical text recognition and temporally-aware writer analysis.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 论文专注于历史手写文本识别与作者检索的数据集构建，未涉及统一模型、世界模型、视觉编码器、Tokenizer、MLLM 或基于模型的强化学习等核心算法或架构研究，尽管评估环节使用了 TrOCR 模型，但论文贡献点在于数据基准而非模型技术，故所有关键词相关性均为 0。

关键词

Handwritten Text Recognition, Writer Retrieval, Historical Document Analysis, Benchmark Dataset, Multilingual Content, Stylistic Variation, Temporal-aware Retrieval

368. Ciphera: A Decentralised Biometric Identity FrameworkFAIL

Score: 0.0 / 27.8

Authors: Ankit Kanaiyalal Prajapati, Shahzad Memon, Mohammed Mahir Rahman, Ameer Al-Nemrat

Published: 2026-05-28

TL;DR: Ciphera presents a decentralized biometric identity framework leveraging blockchain and IPFS for privacy-preserving authentication, achieving stable verification latency but encountering challenges in revocation propagation and deepfake susceptibility.

摘要翻译

集中式生物识别身份系统使用户面临单点故障、不透明的验证过程以及不可逆的生物特征泄露。去中心化标识符（DIDs）和可验证凭证（VCs）提供更强的隐私保障，但它们与生物特征认证及分布式验证的结合尚未得到充分探索。本文提出 Ciphera，一种去中心化生物识别身份框架，该框架结合了隐私保护人脸识别、多节点验证、基于 IPFS 的凭证元数据存储以及基于区块链锚定的撤销机制。在功能、性能、安全及分布式一致性四个维度上进行了评估，Ciphera 实现了 81% 的功能成功率，注册与认证过程稳定，但存在可测量的撤销传播延迟以及偶尔出现的审计日志不一致问题。性能测试表明，在并发多节点条件下，其 p95 验证延迟约为 820 毫秒，达到次秒级。安全分析确认了强大的机密性和完整性保障，但由于活体检测尚不完善，系统仍易受深度伪造和重放攻击的影响。结果表明去中心化生物识别身份具有可行性，同时也指出了面向生产级部署的关键工程挑战。

Abstract

Centralised biometric identity systems expose users to single points of failure, opaque verification processes, and irreversible biometric compromise. Decentralised Identifiers (DIDs) and Verifiable Credentials (VCs) offer stronger privacy guarantees, yet their integration with biometric authentication and distributed verification remains insufficiently explored. This paper presents Ciphera, a decentralised biometric identity framework combining privacy-preserving facial recognition, multi-node verification, IPFS-based credential metadata storage, and blockchain-anchored revocation. Evaluated across functional, performance, security, and distributed consistency dimensions, Ciphera achieved an 81% functional success rate, with stable enrolment and authentication but measurable revocation propagation delays and occasional audit-log inconsistencies. Performance testing demonstrated sub-second p95 verification latency of approximately 820ms under concurrent multi-node conditions. Security analysis confirmed strong confidentiality and integrity guarantees, though incomplete liveness detection leaves susceptibility to deepfake and replay attacks. The results demonstrate the feasibility of decentralised biometric identity while identifying key engineering challenges for production-grade deployment.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on decentralized biometric identity systems using blockchain and IPFS, whereas the provided keywords relate to AI model architectures (Unify Models, Tokenizers, Visual Encoders), Multimodal Large Language Models (MLLM), and Reinforcement Learning (World Models, model-based RL). There is no discussion of model unification, tokenization strategies, specific visual encoder architectures for multimodal learning, world modeling, or reinforcement learning in the paper. The research domains (Security/Blockchain vs. AI/ML) are distinct, resulting in zero relevance for all specified keywords.

关键词

Decentralised Biometric Identity, Privacy-preserving Facial Recognition, Multi-node Verification, IPFS-based Credential Storage, Blockchain-anchored Revocation, Distributed Verification, Security Guarantees

369. Subcortical Shape Variations and Their Associations with Cognition Across the 8th Decade of Life. A Study in the Lothian Birth Cohort 1936FAIL

Score: 0.0 / 27.8

Authors: Maria del C. Valdes-Hernandez, Wonjung Park, Joanna Moodie, Susana Muñoz Maniega, Janie Corley, Fraser N. Sneden, Mark E. Bastin, Joanna M. Wardlaw, Simon R. Cox, Jinah Park

Published: 2026-05-28

TL;DR: This study investigates the association between subcortical brain shape changes and cognitive aging in individuals during their 8th decade, revealing heterogeneous atrophy patterns linked to vertex displacements.

摘要翻译

对正常个体脑形态变化的研究可能捕捉到功能相关的脑衰老的某些方面，而这些方面并未完全通过总体积测量得到体现。尽管皮质下脑结构在认知中起着重要作用，但它们形态轨迹与衰老过程中认知变化之间的关联尚未被阐明。我们利用来自一项大型认知衰老纵向研究——洛锡安出生队列 1936（Lothian Birth Cohort 1936）的神经影像学、人口统计学和认知数据，探索社区居住个体在生命第 8 个十年期间皮质下脑结构的形状变化。我们使用协方差分析（ANCOVA）和混合线性模型分析来探究这些变化与认知衰老之间的关联。皮质下形状变化具有异质性，在整个时期内表现出不同的萎缩模式。海马和腹侧 DC 经历了不同的形态变形（相对于基线点），且左右半球有所不同；而例如丘脑和苍白球的形状则经历了更均匀的体积收缩，在不同时间线上几乎是对称的。一般认知的变化主要与时间点之间的向内和向外顶点位移相关。

Abstract

The study of brain morphology changes in normal individuals may capture aspects of functionally-relevant brain aging not fully indicated by gross volumetry. Despite the important role of subcortical brain structures in cognition, the associations between their morphological trajectories and cognitive changes in aging have not been documented. We use neuroimaging, demographic, and cognitive data from a large longitudinal study of cognitive aging, the Lothian Birth Cohort 1936, to explore shape changes in subcortical brain structures of community-dwelling individuals across their 8th decade of life. We investigate the association of these changes with cognitive aging using ANCOVA and mixed linear model analyses. Subcortical shape changes were heterogeneous, with varied atrophy patterns across whole period. The hippocampus and the ventral DC experienced varied morphological deformations (from its baseline point) different in left and right hemispheres, while the thalami and globus pallidi shapes, for example, experienced a more uniform volume contraction, nearly symmetrical throughout different timelines. Changes in general cognition were mainly associated with inwards and outwards vertex displacements between the time-points.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on neuroscience and gerontology, analyzing subcortical brain shape variations and cognitive aging using neuroimaging and statistical methods. The provided keywords pertain to Artificial Intelligence, Large Language Models, and Reinforcement Learning architectures (e.g., Tokenizer, Visual Encoder, World Models). There is no conceptual overlap between the medical research content and the AI/ML model keywords, resulting in zero relevance for all terms.

关键词

Subcortical Shape Variations, Cognitive Aging, Neuroimaging, Lothian Birth Cohort 1936, Brain Morphology, Longitudinal Study, Vertex Displacements

370. MARTIAN: A Rendering Framework for Aerial Mars Imagery from HiRISE Orbital DataFAIL

Score: 0.0 / 27.8

Authors: Dario Pisanti, Georgios Georgakis

Published: 2026-05-28

TL;DR: 本文提出 MARTIAN 渲染框架，利用 HiRISE 数据和 Blender 合成火星航空影像，以解决火星视觉导航训练数据稀缺的问题。

摘要翻译

火星空中导航需要基于视觉的流水线，这些流水线需对火星表面多样的光照条件和地形形态具有鲁棒性。训练和评估此类方法的一个关键瓶颈在于大规模标注空中数据集的稀缺性。我们提出 MARTIAN，一个基于 Blender 的开源渲染框架，该框架利用真实的 HiRISE 轨道地图产品，在可控光照条件和不同高度下合成逼真的火星地形空中视图。MARTIAN 生成带有准确姿态标注的观测数据，直接应对了火星基于视觉导航训练数据稀缺的问题。该框架已通过其在 Ingenuity（机智号）和未来火星旋翼飞行器的基于地图定位系统中的部署得到验证，其中合成训练的深度图像匹配器在真实火星影像上得到了成功评估。MARTIAN 公开可用，网址为：https://github.com/nasa-jpl/martian。

Abstract

Aerial navigation on Mars requires vision-based pipelines that are robust to the diverse illumination conditions and terrain morphology of the Martian surface. A key bottleneck for training and evaluating such methods is the scarcity of large-scale, annotated aerial datasets. We present MARTIAN, an open-source Blender-based rendering framework that leverages real HiRISE orbital map products to synthesize realistic aerial views of the Martian terrain under controllable lighting conditions and at varying altitudes. MARTIAN generates observations with accurate pose annotations, directly addressing the scarcity of training data for vision-based navigation on Mars. The framework has been validated through its deployment in concurrent work on map-based localization systems for Ingenuity and future Mars rotorcraft, where synthetically trained deep image matchers were successfully evaluated on real Mars imagery. MARTIAN is publicly available at: https://github.com/nasa-jpl/martian.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 论文主题是基于 Blender 的火星影像渲染框架（MARTIAN），属于计算机图形学与仿真领域，旨在解决火星导航数据稀缺问题。关键词涉及多模态大模型、表征学习及强化学习架构（如 Unify Models, Tokenizer, Visual Encoder, World Models, MLLM, MultiModal, model-based RL），与论文内容无直接关联，故相关度均为 0。作者列表中未包含指定专家，无额外加分。

关键词

MARTIAN, Rendering Framework, HiRISE, Aerial Imagery, Mars Navigation, Blender, Synthetic Data, Pose Annotations

Token 消耗: 5,756,219 tokens（输入 740,859 / 输出 5,015,360）