arXiv Daily Report 2026-06-13

DailyPapers
未分类
16小时前
1热度
0评论

ArXiv Report 2026-06-13/* ============================================================ ArXiv Daily Researcher - HTML Report Stylesheet 可自由修改此文件来定制报告外观 ============================================================ *//* ── 全局重置 ── */*,*::before,*::after { box-sizing: border-box; margin: 0; padding: 0;}/* ── CSS 变量（主题色板） ── */:root { --color-bg: #f0f2f5; --color-surface: #ffffff; --color-border: #e5e7eb; --color-primary: #2563eb; --color-primary-dk: #1d4ed8; --color-pass: #16a34a; --color-pass-bg: #dcfce7; --color-fail: #dc2626; --color-fail-bg: #fee2e2; --color-text: #111827; --color-muted: #6b7280; --color-tldr-bg: #eff6ff; --color-tldr-border: #bfdbfe; --color-cn-bg: #fefce8; --color-cn-border: #fde68a; --color-analysis-bg: #f8fafc; --radius-sm: 6px; --radius-md: 10px; --radius-lg: 14px; --shadow-sm: 0 1px 3px rgba(0, 0, 0, 0.06), 0 1px 2px rgba(0, 0, 0, 0.04); --shadow-md: 0 4px 12px rgba(0, 0, 0, 0.08); --shadow-hover: 0 8px 24px rgba(0, 0, 0, 0.1);}/* ── 页面布局 ── */body { font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", "Inter", Roboto, "Helvetica Neue", Arial, sans-serif; font-size: 15px; line-height: 1.65; color: var(--color-text); background: var(--color-bg); padding: 28px 20px 60px; max-width: 1080px; margin: 0 auto;}/* ── 页面标题 ── */h1 { font-size: 1.75rem; font-weight: 700; color: var(--color-text); letter-spacing: -0.5px; margin-bottom: 4px;}h2 { font-size: 1.15rem; font-weight: 600; color: var(--color-text); margin: 36px 0 14px; padding-bottom: 8px; border-bottom: 2px solid var(--color-border);}/* ── 元信息行 ── */.meta { font-size: 0.85rem; color: var(--color-muted); margin-bottom: 24px;}/* ── 统计栏 ── */.stats-bar { display: flex; gap: 14px; flex-wrap: wrap; margin-bottom: 32px;}.stat { flex: 1; min-width: 110px; background: var(--color-surface); border: 1px solid var(--color-border); border-radius: var(--radius-md); padding: 16px 20px; text-align: center; box-shadow: var(--shadow-sm);}.stat .num { font-size: 2rem; font-weight: 700; line-height: 1.1; color: var(--color-primary); display: block;}.stat .label { font-size: 0.78rem; color: var(--color-muted); margin-top: 4px; text-transform: uppercase; letter-spacing: 0.05em;}/* ── 论文卡片 ── */.card { background: var(--color-surface); border: 1px solid var(--color-border); border-radius: var(--radius-lg); padding: 20px 24px; margin-bottom: 14px; box-shadow: var(--shadow-sm); border-left: 4px solid var(--color-border); transition: box-shadow 0.18s ease, transform 0.18s ease;}.card:hover { box-shadow: var(--shadow-hover); transform: translateY(-1px);}.card.pass { border-left-color: var(--color-pass);}.card.fail { border-left-color: var(--color-fail);}/* ── 卡片标题 ── */.card-title { font-size: 1rem; font-weight: 600; color: var(--color-text); line-height: 1.4; margin-bottom: 10px; display: flex; align-items: flex-start; gap: 8px; flex-wrap: wrap;}.card-title a { color: inherit; text-decoration: none; flex: 1;}.card-title a:hover { color: var(--color-primary); text-decoration: underline; text-underline-offset: 3px;}/* ── 状态徽章 ── */.badge { display: inline-flex; align-items: center; padding: 2px 9px; border-radius: 99px; font-size: 0.72rem; font-weight: 700; letter-spacing: 0.04em; flex-shrink: 0; margin-top: 2px;}.badge.pass { background: var(--color-pass-bg); color: var(--color-pass);}.badge.fail { background: var(--color-fail-bg); color: var(--color-fail);}/* ── 字段行 ── */.field { font-size: 0.875rem; color: var(--color-muted); margin: 5px 0;}.field-label { font-weight: 600; color: #374151;}.score { font-weight: 700; color: var(--color-primary);}/* ── TL;DR 块 ── */.tldr { background: var(--color-tldr-bg); border: 1px solid var(--color-tldr-border); border-radius: var(--radius-sm); padding: 10px 14px; font-size: 0.9rem; color: #1e3a5f; margin: 10px 0; line-height: 1.55;}/* ── 中文摘要块 ── */.abstract-cn { background: var(--color-cn-bg); border: 1px solid var(--color-cn-border); border-radius: var(--radius-sm); padding: 10px 14px; font-size: 0.875rem; color: #713f12; margin: 10px 0; line-height: 1.6;}/* ── 深度分析折叠 ── */details { margin-top: 12px; border: 1px solid var(--color-border); border-radius: var(--radius-sm); overflow: hidden;}summary { cursor: pointer; font-size: 0.875rem; font-weight: 600; color: var(--color-primary); padding: 8px 14px; background: var(--color-analysis-bg); user-select: none; list-style: none; display: flex; align-items: center; gap: 6px;}summary::before { content: "▶"; font-size: 0.65em; transition: transform 0.2s; display: inline-block;}details[open] summary::before { transform: rotate(90deg);}summary:hover { color: var(--color-primary-dk);}.analysis-content { padding: 14px 16px; font-size: 0.875rem; color: #374151; line-height: 1.65; background: var(--color-surface);}.analysis-content p { margin: 6px 0;}.analysis-content ul { margin: 6px 0 6px 20px; color: #4b5563;}.analysis-content li { margin: 3px 0;}/* ── 响应式 ── */@media (max-width: 640px) { body { padding: 16px 12px 40px; } h1 { font-size: 1.4rem; } .stats-bar { gap: 10px; } .stat { min-width: 80px; padding: 12px 14px; } .stat .num { font-size: 1.6rem; } .card { padding: 16px; } .card-title { font-size: 0.95rem; }}/* ── 模型标签 ── */.model-badge { display: inline-block; font-size: 0.72rem; font-weight: 500; color: #6366f1; background: #ede9ff; border: 1px solid #c4b5fd; border-radius: 4px; padding: 1px 6px; margin-left: 8px; vertical-align: middle; font-family: 'Fira Code', 'Consolas', monospace; letter-spacing: 0.02em;}/* ── TLDR 增强样式 ── */.tldr { background: linear-gradient(135deg, #f0f9ff 0%, #e0f2fe 100%); border-left: 3px solid #38bdf8; border-radius: 0 8px 8px 0; padding: 10px 14px; margin: 8px 0;}.tldr-meta { display: flex; align-items: center; margin-bottom: 6px; font-size: 0.82rem; font-weight: 600; color: #0369a1;}.tldr-body { font-size: 0.88rem; color: #374151; line-height: 1.6;}/* ── 趋势分析区块 ── */.trend-section { margin: 32px 0 16px; border-top: 2px solid var(--color-border); padding-top: 24px;}.trend-section-header { display: flex; align-items: center; margin-bottom: 20px;}.trend-section-header h2 { margin: 0; font-size: 1.3rem; font-weight: 700; background: linear-gradient(135deg, #6366f1 0%, #8b5cf6 100%); -webkit-background-clip: text; -webkit-text-fill-color: transparent; background-clip: text;}.trend-card { background: #fff; border: 1px solid #e5e7eb; border-radius: 10px; padding: 20px 24px; margin-bottom: 16px; box-shadow: 0 1px 4px rgba(0, 0, 0, 0.06);}.trend-card-title { font-size: 1.05rem; font-weight: 700; color: #1e1b4b; margin: 0 0 14px; padding-bottom: 10px; border-bottom: 1px solid #ede9ff;}

ArXiv Research Report

Generated: 2026-06-13 01:13:34 | Passing score: 35.2

253

Total

Qualified

Analyzed

21%

Pass Rate

Papers

1. RepWAM: World Action Modeling with Representation Visual-Action TokenizersPASS

Score: 112.5 / 35.2

Authors: Junke Wang, Qihang Zhang, Shuai Yang, Yiming Luo, Yujun Shen, Zuxuan Wu, Yu-Gang Jiang, Yinghao Xu

Published: 2026-06-11

TL;DR: RepWAM proposes a representation-centric world action model utilizing visual-action tokenizers to improve instruction-following dynamics for robot manipulation tasks.

摘要翻译

本文提出了 RepWAM，一种基于表示视觉 - 动作分词器的表示中心世界动作模型（WAM）。现有的 WAM 通常从预训练视频生成模型中继承以重建为导向的视频分词器。尽管这些分词器保留了视觉保真度，但仅靠像素重建为学习指令遵循动力学提供了有限的指导，而这种动力学连接了未来预测与机器人控制。为此，我们探索了一个语义视觉 - 动作潜在空间，用于表示中心的世界动作建模。具体来说，我们训练了一个表示视觉 - 动作分词器，该分词器将视觉输入映射为对齐的视觉 Token 和潜在动作 Token。随后，我们预训练我们的 WAM，使其在语言指令下联合建模未来视觉状态及连接这些状态的潜在动作，并适应真实机器人轨迹以实现闭环操作。在真实世界操作任务和仿真基准上的实验表明，RepWAM 在各种操作设置中展现出强大性能，而消融实验突出了语义视觉 - 动作分词器相对于基于重建的替代方案的价值。这些结果确立了表示视觉 - 动作分词器作为世界动作模型有前景的基础，也是迈向通用机器人策略的一步。代码和权重将在 https://github.com/wdrink/RepWAM 上提供。

Abstract

This work presents RepWAM, a representation-centric world action model (WAM) built on representation visual-action tokenizers. Existing WAMs typically inherit reconstruction-oriented video tokenizers from pretrained video generation models. Although these tokenizers preserve visual fidelity, pixel reconstruction alone provides limited guidance for learning instruction-following dynamics that connect future prediction with robot control. To address this, we explore a semantic visual-action latent space for representation-centric world action modeling. Specifically, we train a representation visual-action tokenizer that maps visual inputs into aligned visual and latent action tokens. We then pretrain our WAM to jointly model future visual states and the latent actions that connect them under language instructions, followed by adaptation to real robot trajectories for closed-loop manipulation. Experiments on real-world manipulation tasks and simulation benchmarks show that RepWAM delivers strong performance across diverse manipulation settings, while ablations highlight the value of semantic visual-action tokenization over reconstruction-oriented alternatives. These results establish representation visual-action tokenization as a promising foundation for world action models and a step toward generalist robot policies. Code and weights will be available at https://github.com/wdrink/RepWAM.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	8.0/10	12.0
Tokenizer	1.5	10.0/10	15.0
Visual Encoder	1.5	8.0/10	12.0
World Models	1.5	10.0/10	15.0
MLLM	1.5	7.0/10	10.5
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	8.0/10	12.0
Latent Reasoning	1.5	8.0/10	12.0
Agentic Reasoning	1.5	7.0/10	10.5

评分理由: The paper focuses on RepWAM, a world action model using representation tokenizers, making Tokenizer (10) and World Models (10) highly relevant. MultiModal (9) reflects the integration of vision, language, and action. Latent Reasoning (8) and Unify Models (8) correspond to the semantic latent space and unification of visual/action representations. model-based RL (8) aligns with control tasks. Visual Encoder (8) is implicit in visual processing. MLLM (7) and Agentic Reasoning (7) are relevant due to language conditioning and robot agent nature but secondary to the core tokenization framework. No expert authors from the provided list were found in the author list. Total weighted score is 112.5, exceeding the 35.2 threshold.

关键词

World Action Modeling, Representation Visual-Action Tokenizers, Robot Manipulation, Latent Action Space, Instruction Following, Closed-loop Control, Representation Learning

深度分析

Chinese Title: RepWAM：基于表征视觉-动作分词器的世界动作建模

Summary: 本文提出RepWAM，一种以表征为中心的世界动作模型（WAM），其核心是构建表征视觉-动作分词器（RepViTok）。现有WAM通常继承自预训练视频生成模型的重建导向分词器，虽能保持视觉保真度，但像素级重建对学习指令跟随动态（连接未来预测与机器人控制）帮助有限。为此，作者探索语义视觉-动作潜在空间：首先训练一个表征视觉-动作分词器，将视觉输入映射为对齐的视觉和潜在动作令牌；然后预训练WAM，在语言指令下联合建模未来视觉状态及其间的潜在动作；最后通过真实机器人轨迹微调实现闭环操作。实验表明，RepWAM在真实操作任务和模拟基准（如RoboTwin 2.0）上表现优异，消融实验证实了语义视觉-动作分词相比重建导向方法的优势。该工作为世界动作模型提供了有前景的表征基础，向通用机器人策略迈进一步。

Innovations:

提出表征视觉-动作分词器（RepViTok），将视觉潜在空间与冻结的视觉基础模型对齐，获得语义丰富的视觉令牌。
在语义视觉潜在空间内学习潜在动作令牌，通过耦合逆动态模型和前向动态模型将动作表示为视觉状态间的可迁移变换。
构建因果扩散Transformer，在语言指令条件下联合生成视觉令牌和动作令牌，实现世界动作模型的因果预训练。
通过真实机器人轨迹微调，将预训练的世界动作模型适配到闭环操作任务，无需在线想象即可保持竞争力。

Methodology: 首先训练一个视觉分词器，采用ViT自编码器结构，结合重建损失（L1、感知损失、GAN损失）和特征对齐损失（与冻结的视觉基础模型DINOv2对齐）。然后冻结视觉分词器，训练潜在动作分词器，包含逆动态模型（IDM）和前向动态模型（FDM），通过前向预测损失和反向一致性损失优化。最后构建因果扩散Transformer，将语言指令、视觉令牌和动作令牌组织成块，使用块因果掩码和条件流匹配目标进行预训练，再在真实机器人数据上微调。

Key Results:

在RoboTwin 2.0基准上，RepWAM在Easy任务上达到89.3，在Hard任务上达到88.4，优于VLA和WAM预训练基线。
消融实验表明，语义视觉-动作分词相比重建导向分词器显著提升世界动作模型的性能。
在真实世界操作任务中，RepWAM展现出强竞争力的闭环行为。

Tech Stack:

视觉分词器：Vision Transformer (ViT) 自编码器，16×16图像块和4×16×16时空管状块。
特征对齐：冻结的DINOv2视觉基础模型，线性投影层，时间平均池化。
潜在动作分词器：逆动态模型（IDM）和前向动态模型（FDM），软传输算子（类似光流），残差项。
世界动作模型：因果扩散Transformer，块因果掩码，条件流匹配（Flow Matching）目标。
语言编码：预训练文本编码器。
损失函数：L1损失、感知损失（LPIPS）、GAN损失、特征对齐损失、前向预测损失、反向一致性损失。

Strengths:

创新性地将视觉和动作令牌统一到语义潜在空间，缩小了模态差距，提升了世界动作模型的指令跟随能力。
采用表征学习视角，从重建导向转向语义导向，更符合机器人操作任务的需求。
实验验证充分，包括模拟基准和真实世界任务，消融实验有力支撑了核心设计。
代码和网页开源，便于复现和后续研究。

Limitations:

依赖冻结的视觉基础模型（DINOv2），其语义空间可能不完全覆盖所有操作场景。
潜在动作分词器需要成对连续帧训练，对数据质量要求较高。
因果扩散Transformer的训练和推理计算成本可能较高，实时性有待验证。
论文未详细讨论跨具身泛化能力，仅聚焦于单具身设置。

Relevance To Keywords:

Unify Models: RepWAM通过统一的视觉-动作潜在空间，将世界模型和动作策略融合，体现了多模态理解与生成的统一。
World Models: 核心贡献是构建世界动作模型（WAM），联合建模视觉动态和动作，属于世界模型在机器人控制中的扩展。
Representation Learning: 核心创新在于表征学习：设计语义视觉-动作分词器，从重建导向转向语义导向，提升表征质量。
Model-Based RL: 世界动作模型可视为基于模型的强化学习的一种形式，通过预测未来状态和动作来指导策略。
原生多模态大模型: 虽未直接使用大语言模型，但语言指令编码和因果Transformer架构与多模态大模型思想一致。
多模态大模型的理解和生成一体化: RepWAM同时进行视觉理解（语义对齐）和生成（未来状态预测），体现理解与生成一体化。
表征学习: 论文核心是表征学习，通过对齐视觉基础模型和潜在动作学习，获得更优的视觉-动作表征。
世界模型: 直接构建世界动作模型，预测未来视觉状态和动作，属于世界模型范畴。
强化学习: 世界动作模型可用于强化学习中的规划或策略学习，论文通过微调适配闭环操作。
后训练: 论文采用预训练+微调范式，先在大规模数据上预训练世界动作模型，再在真实机器人数据上微调。

2. LabVLA: Grounding Vision-Language-Action Models in Scientific LaboratoriesPASS

Score: 96.0 / 35.2

Authors: Baochang Ren, Xinjie Liu, Xi Chen, Yanshuo Liu, Chenxi Li, Daqi Gao, Zeqin Su, Jintao Xing, Zirui Xue, Rui Li, Xiangyu Zhao, Shuofei Qiao, Minting Pan, Wangmeng Zuo, Lei Bai, Dongzhan Zhou, Ningyu Zhang, Huajun Chen

Published: 2026-06-11

TL;DR: LabVLA grounds Vision-Language-Action models in scientific laboratories using a simulation-based data engine and two-stage training, achieving state-of-the-art success rates on the LabUtopia benchmark.

摘要翻译

科学实验室日益依赖 AI 系统来推理实验，但科学操作的物理行为仍很大程度上超出其掌控范围。AI 可协助阅读文献、生成假设及规划方案，然而这些方案在实验台上的执行仍需人工操作员。视觉 - 语言 - 动作（Vision-Language-Action, VLA）模型为书面方案与机器人执行之间提供了一种可能的接口，但现有策略主要基于家庭和桌面演示进行训练，很少遇到科学实验室中常见的仪器、透明液体或固定方案流程。填补这一差距不仅需要实验室特定的监督，还需要一个能够容纳用于执行实验方案的多样化机器人实体（embodiment）的统一学习框架。因此，我们将数据和实体视为与模型设计并列的核心瓶颈。为解决数据方面的问题，我们构建了 RoboGenesis，这是一个基于仿真的工作流与数据引擎，能够从原子技能构建配置好的实验室工作流程，验证并过滤轨迹，并在支持的机器人配置下导出结构化演示。在策略方面，我们提出了 LabVLA，采用两阶段训练流程：首先通过 FAST 动作标记预训练使 Qwen3-VL-4B-Instruct 骨干具备动作感知能力，随后通过流匹配后训练在知识隔离机制下附加一个 DiT 动作专家。在 LabUtopia 基准测试中，LabVLA 在所有被评估的基线模型中均取得了最高的平均成功率，无论是在分布内还是分布外设置下。

Abstract

Scientific laboratories increasingly rely on AI systems to reason about experiments, but the physical act of doing science remains largely outside their reach. AI can help read literature, generate hypotheses, and plan protocols, yet the execution of those protocols at the bench still requires a human operator. Vision-Language-Action (VLA) models provide one possible interface between written protocols and robot execution, but existing policies are trained mostly on household and tabletop demonstrations and rarely encounter the instruments, transparent liquids, or fixed protocol workflows found in scientific laboratories. Closing this gap requires both laboratory-specific supervision and a unified learning framework that can accommodate the diverse robot embodiments used to execute experimental protocols. We therefore identify data and embodiment as central bottlenecks alongside model design. To address the data side, we build RoboGenesis, a simulation-based workflow and data engine that composes configured laboratory workflows from atomic skills, validates and filters rollouts, and exports structured demonstrations across supported robot profiles. On the policy side, we present LabVLA, trained with a two-stage recipe: FAST action token pretraining first makes the Qwen3-VL-4B-Instruct backbone action aware before any continuous control is learned, and flow matching posttraining then attaches a DiT action expert under knowledge insulation. On the LabUtopia benchmark, LabVLA achieves the highest average success rate among all evaluated baselines under both in-distribution and out-of-distribution settings.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	8.0/10	12.0
Tokenizer	1.5	8.0/10	12.0
Visual Encoder	1.5	8.0/10	12.0
World Models	1.5	5.0/10	7.5
MLLM	1.5	9.0/10	13.5
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	5.0/10	7.5
Latent Reasoning	1.5	5.0/10	7.5
Agentic Reasoning	1.5	7.0/10	10.5

评分理由: The paper proposes LabVLA, a Vision-Language-Action model for scientific laboratories. It heavily relies on MLLM (Qwen3-VL) and MultiModal integration, scoring high on these keywords. Action Tokenization is explicitly mentioned in the pretraining stage. A unified learning framework is noted for accommodating diverse embodiments. While simulation (World Models/Model-based RL) aids data generation, it is not the core policy architecture. Agentic Reasoning is relevant to the autonomous execution goal. Visual Encoder is implicit in the backbone. Latent Reasoning is less explicit.

关键词

Vision-Language-Action, Scientific Laboratories, Action Tokenization, MLLM, Simulation-based Data, LabUtopia Benchmark, Two-stage Training

深度分析

Chinese Title: LabVLA：在科学实验室中落地视觉-语言-动作模型

Summary: 论文针对科学实验室自动化中数据与本体两大瓶颈，提出LabVLA框架。研究背景：现有VLA模型多基于家庭桌面场景训练，缺乏实验室所需的仪器、透明液体和协议级操作数据。方法上，论文构建RoboGenesis仿真数据引擎，通过环境构建、智能体工作流生成和结构化导出三步合成跨本体的实验室演示数据（LabEmbodied-Data）；在此基础上，LabVLA采用两阶段训练：先通过FAST动作令牌预训练使Qwen3-VL骨干具备动作感知能力，再通过流匹配后训练附加DiT动作专家，并引入知识绝缘（stop-gradient）减少干扰。在LabUtopia基准上，LabVLA在分布内和分布外设置下均取得最高平均成功率。结论表明，仿真数据结合两阶段训练可有效弥合科学推理与物理执行之间的鸿沟。

Innovations:

将科学实验室自动化明确建模为VLA学习问题，指出数据和本体是核心瓶颈。
提出RoboGenesis仿真数据引擎，支持从文本到3D资产生成、原子技能组合长时域工作流、跨16种机器人本体部署及成功过滤导出。
提出LabVLA两阶段训练策略：FAST动作令牌预训练使VLM具备动作语义，流匹配后训练实现连续动作预测。
引入知识绝缘（stop-gradient）设计，减少语言表征与动作专家之间的干扰。
构建LabEmbodied-Data数据集，覆盖单臂、双臂、移动操作等多种实验室任务族。

Methodology: 论文采用仿真数据合成与两阶段策略训练相结合的技术路线。首先，利用RoboGenesis引擎：通过文本生成图像再经TRELLIS 2.0重建3D资产，构建实验室场景库；使用LLM将自然语言指令分解为原子技能序列，并在16种机器人平台上实例化，施加六轴域随机化；过滤成功轨迹并导出多模态标注。然后，LabVLA以Qwen3-VL-4B-Instruct为视觉语言骨干，第一阶段用FAST动作令牌预训练使模型对齐动作语义，第二阶段用流匹配（Flow Matching）训练DiT动作专家预测连续动作，同时保持VLM梯度停止以隔离知识。最终在LabUtopia基准上评估。

Key Results:

LabVLA在LabUtopia基准上取得最高平均成功率，优于所有对比基线。
在分布内和分布外设置下均表现最佳，证明泛化能力。
RoboGenesis可合成覆盖16种机器人本体、多种实验室场景的高质量演示数据。
FAST预训练和流匹配后训练均显著提升性能，知识绝缘设计有效减少干扰。

Tech Stack:

Qwen3-VL-4B-Instruct（视觉语言骨干）
DiT（Diffusion Transformer，动作专家）
FAST（动作令牌预训练方法）
Flow Matching（流匹配后训练）
Isaac Sim（仿真平台）
TRELLIS 2.0（3D重建）
RoboGenesis（数据引擎）
LabEmbodied-Data（数据集）
LabUtopia（基准）

Strengths:

系统性地解决了实验室VLA的数据稀缺问题，通过仿真合成实现低成本、可扩展的数据生成。
支持多种机器人本体和任务族，具有跨本体迁移能力。
两阶段训练策略有效结合了大规模预训练和领域微调，知识绝缘设计新颖。
在标准基准上取得领先结果，验证了方法的有效性。
开源代码和模型，促进社区复现和后续研究。

Limitations:

完全基于仿真环境训练，真实实验室场景中的泛化性和鲁棒性尚未验证。
依赖高质量3D资产和物理仿真精度，复杂液体、化学反应等物理效果可能不够真实。
原子技能库的覆盖范围有限，无法涵盖所有实验室操作。
训练计算资源需求较高（4B参数VLM + DiT）。
未在真实机器人上部署评估，实际部署可能面临硬件差异和安全性问题。

Relevance To Keywords:

原生多模态大模型：LabVLA基于Qwen3-VL，属于多模态大模型在机器人领域的应用。
多模态大模型的理解和生成一体化：VLM理解语言和视觉，DiT生成动作序列，体现理解与生成结合。
表征学习：FAST预训练使VLM学习动作语义表征，知识绝缘保留语言表征。
世界模型：RoboGenesis仿真环境可视为实验室世界模型，用于生成数据和评估。
强化学习：虽未直接使用RL，但流匹配后训练可视为一种模仿学习方法，与RL相关。
后训练：两阶段训练中的流匹配后训练是核心步骤，属于后训练范畴。

3. MaskWAM: Unifying Mask Prompting and Prediction for World-Action ModelsPASS

Score: 96.0 / 35.2

Authors: Hanyang Yu, Haitao Lin, Jingbo Zhang, Wenyao Zhang, Chenghao Gu, Heng Li, Ping Tan

Published: 2026-06-11

TL;DR: MaskWAM 在世界动作模型中统一掩码提示与预测，以减少语言歧义和视觉噪声，从而在机器人控制任务中显著优于基线方法。

摘要翻译

世界动作模型（WAMs）为基于视频预测的机器人控制提供了一种有前景的范式。然而，当前的 WAMs 存在根本性的空间瓶颈：标准文本输入在杂乱场景中引入指代歧义，而非结构化的 RGB 预测缺乏语义 grounding，且仍受任务无关背景的干扰。为克服这些局限性，我们提出 MaskWAM，一种以对象为中心的世界动作模型。通过统一的混合 Transformer (MoT) 将掩码同时作为显式输入和预测进行联合集成，MaskWAM 实现了稳健的策略泛化。该设计提供了两大关键优势：(1) 预测未来掩码可产生以对象为中心的语义监督，从而抑制视觉噪声，显著增强即使是标准的文本条件化 WAMs；(2) 将此预测监督与第一帧视觉提示（如目标对象掩码）相结合，建立精确的空间锚点，从而大幅降低语言歧义。关键在于，由于 WAMs 本质上是视觉驱动架构，直接进行掩码条件化比仅使用文本能提供显著更强的指导，从而建立了一种精确且稳健的范式，用于操纵未见过的对象。在 LIBERO、RoboTwin 及真实世界任务上的评估表明，MaskWAM 在语言清晰任务和语言歧义任务中均显著优于基线方法。

Abstract

World Action Models (WAMs) present a promising paradigm for robotic control via video prediction. However, current WAMs suffer from fundamental spatial bottlenecks: standard text inputs introduce referential ambiguity in cluttered scenes, while unstructured RGB predictions lack semantic grounding and remain biased by task-irrelevant backgrounds. To overcome these limitations, we introduce MaskWAM, an object-centric world-action model. By jointly integrating masks as both explicit inputs and predictions via a unified Mixture of Transformers (MoT), MaskWAM unlocks robust policy generalization. This design provides two key benefits: (1) predicting future masks yields object-centric semantic supervision that suppresses visual noise, significantly enhancing even standard text-conditioned WAMs; and (2) coupling this predictive supervision with first-frame visual prompts, such as target object masks, establishes a precise spatial anchor that substantially reduces language ambiguity. Crucially, as WAMs are inherently vision-driven architectures, direct mask conditioning yields substantially stronger guidance than text alone, establishing a precise and robust paradigm for manipulating unseen objects. Evaluations on LIBERO, RoboTwin, and real-world tasks demonstrate that MaskWAM significantly outperforms baselines in both language-clear and language-ambiguous tasks.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	9.0/10	13.5
Tokenizer	1.5	3.0/10	4.5
Visual Encoder	1.5	7.0/10	10.5
World Models	1.5	10.0/10	15.0
MLLM	1.5	5.0/10	7.5
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	8.0/10	12.0
Latent Reasoning	1.5	8.0/10	12.0
Agentic Reasoning	1.5	6.0/10	9.0

评分理由: 论文核心聚焦于世界动作模型（World Models=10），标题明确体现统一掩码提示与预测（Unify Models=9）。模型结合视觉掩码与文本，属于多模态架构（MultiModal=8），隐含视觉编码器（Visual Encoder=7）。通过预测掩码实现潜在空间监督（Latent Reasoning=8），应用于机器人控制任务，符合基于模型的强化学习范畴（model-based RL=8）。代理推理相关但非核心术语（Agentic Reasoning=6）。MLLM 相关性中等（MLLM=5），Tokenizer 未重点提及（Tokenizer=3）。加权总分 96.0，远高于及格线 35.2。作者列表中不包含指定的专家。

关键词

World-Action Models, Mask Prompting, Visual Prompts, Mixture of Transformers, Robotic Control, Object-centric, Semantic Grounding, Video Prediction

深度分析

Chinese Title: MaskWAM：统一掩码提示与预测的世界-动作模型

Summary: 本文提出MaskWAM，一种以对象为中心的世界-动作模型（WAM），旨在解决现有WAM在空间定位上的两大瓶颈：文本指令在杂乱场景中的指代模糊性，以及无结构RGB预测缺乏语义基础且易受背景干扰。MaskWAM通过统一掩码作为显式输入和输出，采用混合Transformer（MoT）架构，联合预测未来RGB帧、任务相关掩码和动作块。其核心创新包括：预测未来掩码提供对象中心语义监督，抑制视觉噪声；结合首帧视觉提示（如目标掩码）建立精确空间锚点，减少语言歧义。实验表明，MaskWAM在LIBERO、RoboTwin和真实世界任务中显著优于基线，在语言清晰和语言模糊任务中均取得最高成功率。

Innovations:

首次将掩码同时作为显式输入和预测目标，统一到世界-动作模型中，实现对象中心化操作。
通过预测未来掩码提供语义监督，迫使模型关注任务相关区域，抑制背景噪声。
引入首帧掩码视觉提示，建立精确空间锚点，有效解决语言指令在复杂场景中的指代歧义。
采用混合Transformer（MoT）架构，联合处理RGB、掩码和动作，实现端到端训练。
提出掩码丢弃训练策略，使模型能灵活处理语言清晰和语言模糊任务，无需切换模式。

Methodology: MaskWAM基于Wan 2.2视频VAE编码器，将当前RGB和可选首帧掩码压缩为潜变量，与噪声未来潜变量、机器人本体感知、T5语言嵌入拼接，输入混合Transformer（MoT）进行联合去噪。视觉分支使用共享时间步τv同步去噪RGB和掩码潜变量，动作分支使用独立时间步τa进行动作块去噪。训练采用解耦流匹配目标，损失函数为视频、掩码和动作流匹配损失之和。推理时，通过部分去噪策略高效生成动作。

Key Results:

在LIBERO基准上达到98.4%的成功率，在RoboTwin上达到92.2%。
在真实世界任务中，语言清晰场景成功率为84.3%，语言模糊场景成功率为84.9%，超越最强基线33.2%。
未来掩码预测作为语义正则化器显著提升标准文本条件WAM的性能。
首帧掩码视觉提示在语言模糊场景中提供精确空间定位，大幅降低指代歧义。

Tech Stack:

Wan 2.2视频VAE（因果3D VAE）
混合Transformer（Mixture of Transformers, MoT）
T5文本编码器（冻结）
流匹配（Flow Matching）
解耦噪声调度（τv和τa）
掩码丢弃训练策略（dropout概率p=0.5）
SAM-3分割模型（用于获取首帧掩码）
KV缓存（推理加速）

Strengths:

统一了掩码提示和预测，形成闭环语义监督，增强模型对任务相关区域的关注。
视觉提示（掩码）比纯文本提供更精确的空间锚点，尤其适用于杂乱场景和相似物体。
端到端架构，无需额外分割或检测模块，训练和推理高效。
在多个基准和真实任务上取得显著优于现有方法的性能。
设计简洁，基于现有视频扩散模型扩展，易于复现和部署。

Limitations:

依赖首帧掩码的获取（需SAM等分割模型），在无分割模型场景下可能受限。
掩码预测仅针对任务相关区域，未覆盖所有物体，可能忽略部分上下文信息。
实验仅在有限任务集上验证，泛化到更复杂长时任务需进一步评估。
模型参数量较大，推理速度可能受限于视频扩散过程。
未讨论掩码预测失败或噪声对策略鲁棒性的影响。

Relevance To Keywords:

Unify Models: MaskWAM统一了掩码提示和预测，属于统一模型范式。
World Models: MaskWAM是一种世界-动作模型，通过视频预测学习物理动态。
Representation Learning: 通过预测未来掩码学习对象中心表征，提升语义表示质量。
Model-Based RL: 模型预测未来状态和动作，可视为基于模型的强化学习方法。
原生多模态大模型: 使用T5文本编码器和视频VAE，融合语言、视觉、动作多模态。
多模态大模型的理解和生成一体化: 同时生成RGB、掩码和动作，实现理解与生成统一。
表征学习: 掩码预测迫使模型学习任务相关表征，抑制背景噪声。
世界模型: 符合世界模型定义，预测未来观测并用于控制。
强化学习: 通过预测动作和未来状态，可结合强化学习进行策略优化。
后训练: 基于预训练视频模型进行微调，属于后训练范式。

4. NavWAM: A Navigation World Action Model for Goal-Conditioned Visual NavigationPASS

Score: 91.5 / 35.2

Authors: Daichi Azuma, Taiki Miyanishi, Koya Sakamoto, Shuhei Kurita, Yaonan Zhu, Petr Khrapchenkov, Motoaki Kawanabe, Yusuke Iwasawa, Yutaka Matsuo

Published: 2026-06-11

TL;DR: NavWAM 通过扩散变换器在共享潜空间中统一世界模型预测与动作生成，实现了无需外部规划器的闭环视觉导航控制。

摘要翻译

目标条件视觉导航要求机器人在部分可观测条件下行动，通过预判其运动将如何改变未来的自我中心视角，以及这种变化是否使其更接近目标。导航世界模型（Navigation World Models）提供了这种视觉预见性，但它们仍然是预测模块，需要外部规划器将预测的未来转换为闭环控制。我们提出导航世界动作模型（Navigation World Action Model, NavWAM），这是一种扩散 - 变换器策略，通过将未来观测、目标进展值和动作块表示在共享潜在序列中，将导航世界模型的预测转化为可执行动作。通过联合学习与确定闭环行为的动作和价值目标进行未来预测，NavWAM 使视觉预见性可直接用于机器人控制。我们通过仿真预训练和真实机器人适应构建 NavWAM，并在图像目标导航任务中将其与基于规划的世界模型（Planning-based World Models）和一种代表性的直接导航策略进行评估。在离线基准和闭环真实机器人部署中，NavWAM 在我们的评估中优于基于规划的世界模型基线，同时使用默认策略模式而不需要 CEM 风格的动作搜索。项目页面：https://dachii-azm.github.io/navwam/

Abstract

Goal-conditioned visual navigation requires a robot to act under partial observability by anticipating how its motion will change the future egocentric view and whether that change brings it closer to the goal. Navigation world models provide such visual foresight, but they remain prediction modules that require an external planner to convert predicted futures into closed-loop control. We propose Navigation World Action Model (NavWAM), a diffusion-transformer policy that turns navigation world-model prediction into executable action by representing future observations, goal-progress values, and action chunks in a shared latent sequence. By learning future prediction jointly with the action and value targets that determine closed-loop behavior, NavWAM makes visual foresight directly usable for robot control. We build NavWAM through simulation pretraining and real-robot adaptation, and evaluate it on image-goal navigation against planning-based world models and a representative direct navigation policy. Across offline benchmarks and closed-loop real-robot deployment, NavWAM improves over planning-based world-model baselines in our evaluations while using the default policy mode without CEM-style action search. Project page: https://dachii-azm.github.io/navwam/

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	8.0/10	12.0
Tokenizer	1.5	3.0/10	4.5
Visual Encoder	1.5	7.0/10	10.5
World Models	1.5	10.0/10	15.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	7.0/10	10.5
model-based RL	1.5	9.0/10	13.5
Latent Reasoning	1.5	9.0/10	13.5
Agentic Reasoning	1.5	6.0/10	9.0

评分理由: 论文核心贡献在于提出 NavWAM，将世界模型预测与动作生成统一于共享潜空间（Unify Models, Latent Reasoning），属于模型强化学习（model-based RL）框架。任务涉及视觉导航（Visual Encoder, MultiModal），但未涉及大语言模型（MLLM）或显式分词器（Tokenizer）创新。

关键词

World Action Model, Visual Navigation, Latent Sequence, Diffusion Transformer, Goal-Conditioned, Closed-Loop Control, Visual Foresight

深度分析

Chinese Title: NavWAM：一种用于目标条件视觉导航的导航世界动作模型

Summary: 本文提出NavWAM（导航世界动作模型），一种将导航世界模型的视觉预测转化为可执行动作的扩散变换器策略。针对目标条件视觉导航中机器人需在部分可观测环境下预测未来自我中心视图并判断是否接近目标的问题，现有导航世界模型仅作为预测模块，需外部规划器将预测未来转化为闭环控制。NavWAM通过将未来观测、目标进度值和动作块编码到共享潜在序列中，联合学习未来预测、动作生成和价值估计，使视觉预测直接用于机器人控制。模型在仿真中预训练并在真实机器人上适应，在离线基准和闭环真实部署中，相比基于规划的世界模型基线（如NWM）性能更优，且无需CEM式测试时轨迹优化，同时与更大规模直接导航策略（如OmniVLA）保持竞争力，并保留可解释的未来视图预测。

Innovations:

提出NavWAM，将导航世界模型的视觉预测转化为直接产生动作的策略，无需外部规划器。
引入联合预测公式，将未来观测、目标进度值和可执行动作块表示在共享潜在序列中，使未来预测直接支持闭环动作生成。
在仿真预训练和真实机器人适应后，NavWAM在离线基准和闭环部署中优于基于规划的世界模型基线，且无需CEM式测试时轨迹优化。
与更大规模直接导航策略（7B参数）相比，NavWAM使用2B参数视频骨干仍保持竞争力，同时提供可解释的未来视图预测。

Methodology: NavWAM基于预训练的扩散变换器视频世界模型（Cosmos Predict2），将导航控制所需变量（当前观测、目标图像、机器人状态、动作块、未来观测、目标进度值）分配为固定九帧潜在画布中的不同帧，通过联合去噪过程进行预测。模型在仿真环境中预训练，然后在真实机器人上通过少量数据适应。推理时，模型以当前观测和目标为条件，直接输出动作块，并以滚动时域方式执行，同时输出未来视图和进度值作为可解释的预测。

Key Results:

NavWAM在离线基准和闭环真实机器人部署中，导航性能优于基于规划的世界模型基线（如NWM）。
无需CEM式测试时轨迹优化，NavWAM即可实现有效闭环控制。
与使用更大7B参数VLA骨干的OmniVLA相比，NavWAM使用2B参数视频骨干仍保持竞争力。
NavWAM同时产生未来视图预测和价值估计，提供可解释的视觉预见。

Tech Stack:

扩散变换器（Diffusion Transformer, DiT）
因果VAE（Causal VAE）
Cosmos Predict2预训练视频世界模型
Cosmos Policy的潜在帧建模原则
滚动时域控制（Receding-horizon control）
仿真预训练与真实机器人适应

Strengths:

将视觉预测与动作生成统一在一个策略表示中，消除了外部规划器的瓶颈。
在部分可观测导航中，联合学习未来视图、进度值和动作，使视觉预见直接服务于闭环控制。
利用预训练视频世界模型，减少了对大规模导航数据的依赖，并支持仿真到真实的迁移。
在离线基准和真实部署中均取得优于规划基线的性能，且计算效率更高（无需CEM搜索）。
保留可解释的未来视图预测，有助于理解模型决策。

Limitations:

模型依赖于预训练视频世界模型（Cosmos Predict2），可能受限于该模型的泛化能力。
当前仅针对图像目标导航，未扩展到其他目标形式（如语言指令）。
真实机器人适应可能需要额外的领域适配数据，部署成本仍存在。
与更大规模直接策略相比，性能虽接近但未超越，可能受限于模型容量。
未详细讨论在复杂动态环境中的鲁棒性。

Relevance To Keywords:

Unify Models: NavWAM将未来预测、价值估计和动作生成统一在一个模型中，体现了模型统一的思想。
World Models: 论文核心是导航世界模型，利用视频世界模型进行视觉预测。
Representation Learning: 通过潜在画布将不同模态信息编码为共享表示，属于表征学习。
Model-Based RL: NavWAM学习世界模型并用于控制，属于基于模型的强化学习范畴。
原生多模态大模型: 使用预训练的多模态视频模型（Cosmos Predict2）作为骨干，但未强调原生多模态。
多模态大模型的理解和生成一体化: 模型同时生成未来视图（生成）和动作（理解/控制），体现一体化。

5. HYDRA-X: Native Unified Multimodal Models with Holistic Visual TokenizersPASS

Score: 90.0 / 35.2

Authors: Guozhen Zhang, Xuerui Qiu, Yutao Cui, Tianhui Song, Changlin Li, Junzhe Li, Tao Huang, Xiao Zhang, Yang Li, Jianbing Wu, Miles Yang, Zhao Zhong, Liefeng Bo, Limin Wang

Published: 2026-06-11

TL;DR: HYDRA-X 提出了一种原生统一多模态模型，通过全栈视觉 tokenizer 统一图像和视频表示，在潜在空间进行编辑，实现了图像和视频理解与生成的强性能。

摘要翻译

整体视觉标记器是统一多模态模型（UMMs）的基础，因为它们将多样的视觉输入映射到统一的表征空间。本文提出了 HYDRA-X，这是首个在单个视觉变换器（ViT）内统一图像和视频标记化的统一多模态模型（UMM）。我们的设计受两个核心挑战驱动：一是高效地将时空重建能力注入原生 ViT，二是将图像和视频级别的语义感知嵌入潜在空间。针对前者，全面的消融实验揭示了两个关键发现：（1）帧级因果时序注意力足以实现视觉重建，而全时空注意力会损害其性能；（2）层次化时序压缩显著优于单步方案。针对后者，我们提出一个轻量级解压缩器，在联合图像 - 视频教师监督下对时序压缩特征进行上采样，从而在紧凑的潜在空间内强化互补的语义结构。基于此整体标记器，我们进一步提出了编辑管道的原则性改进：源 - 目标交互应在标记器内部的潜在级别发生，而非大语言模型（LLM）内部的语义级别，从而显著改善编辑一致性并加速收敛。在 7B 稠密模型上实例化后，HYDRA-X 在图像和视频理解及生成任务上均表现出优异性能，为未来统一标记器的统一多模态模型（UMM）奠定了基础。

Abstract

Holistic visual tokenizers are fundamental to unified multimodal models (UMMs) as they map diverse visual inputs into a unified representation space. In this paper, we present HYDRA-X, the first UMM that unifies image and video tokenization within a single Vision Transformer (ViT). Our design is driven by two core challenges: efficiently injecting spatiotemporal reconstruction capability into a native ViT, and embedding image- and video-level semantic awareness into the latent space. To address the first, comprehensive ablations reveal two key findings: (1) frame-level causal temporal attention suffices for visual reconstruction, whereas full spatiotemporal attention degrades it; and (2) hierarchical temporal compression substantially outperforms single-step alternatives. To tackle the second, we propose a lightweight decompressor that upsamples temporally compressed features under joint image-video teacher supervision, thereby enforcing complementary semantic structures within the compact latent space. Building on this holistic tokenizer, we further propose a principled improvement of the editing pipeline: source-target interaction should occur at the latent level inside the tokenizer rather than at the semantic level inside the LLM, substantially improving editing consistency and accelerating convergence. Instantiated at the 7B dense model, HYDRA-X achieves strong performance across image and video understanding and generation tasks, paving the way for future unified-tokenizer UMMs.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	10.0/10	15.0
Tokenizer	1.5	10.0/10	15.0
Visual Encoder	1.5	10.0/10	15.0
World Models	1.5	5.0/10	7.5
MLLM	1.5	8.0/10	12.0
MultiModal	1.5	10.0/10	15.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	7.0/10	10.5
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 论文核心贡献在于统一多模态模型（Unify Models）的视觉 tokenizer 设计，与 Tokenizer、Visual Encoder、MultiModal 高度相关（10 分）。MLLM 作为相关领域背景关联紧密（8 分），Latent Reasoning 对应潜在空间操作（7 分）。World Models 为背景提及但非核心（5 分）。model-based RL 与 Agentic Reasoning 在文中无体现（0 分）。作者列表中未包含指定的 Yang Shi 等专家。

关键词

Unified Multimodal Models, Holistic Visual Tokenizers, Vision Transformer, Image and Video Tokenization, Latent Space, Representation Learning, Native Unified Models

深度分析

Chinese Title: HYDRA-X：基于整体视觉分词器的原生统一多模态模型

Summary: HYDRA-X是首个在单个Vision Transformer（ViT）中统一图像和视频分词的原生统一多模态模型（UMM）。针对两个核心挑战：高效注入时空重建能力到原生ViT，以及将图像和视频级别的语义感知嵌入潜在空间。研究发现：帧级因果时序注意力优于全时空注意力，分层时序压缩优于单步压缩。提出轻量级解压缩器，在联合图像-视频教师监督下对压缩特征进行上采样，强制紧凑潜在空间中的互补语义结构。基于此整体分词器，进一步改进编辑流程：源-目标交互应在分词器内的潜在层面而非LLM内的语义层面进行，显著提升编辑一致性并加速收敛。在7B稠密模型上，HYDRA-X在图像和视频理解与生成任务上均取得强性能，为未来统一分词器UMM奠定基础。

Innovations:

首次在单个ViT中统一图像和视频分词，实现原生统一多模态模型。
发现帧级因果时序注意力优于全时空注意力，分层时序压缩优于单步压缩。
提出轻量级解压缩器，通过联合图像-视频教师监督实现时空语义蒸馏。
创新性地将源-目标交互置于分词器潜在层面，而非LLM语义层面，提升编辑一致性。
在7B规模上实现图像/视频理解、生成和编辑五合一任务统一。

Methodology: 基于HYDRA框架，将ViT分为Gen-ViT和Sem-ViT，通过生成-语义瓶颈连接。采用帧级因果时序注意力（仅关注前一帧）和分层时序压缩（多阶段折叠）实现时空重建。提出解压缩器将压缩潜在特征上采样至原始帧率，利用预训练图像和视频教师进行蒸馏。编辑任务中，将源和目标图像置于同一时间窗口，在分词器内进行交叉帧交互。在Qwen2.5-7B-Instruct上实例化，进行多任务联合训练。

Key Results:

帧级因果时序注意力在重建指标（PSNR、SSIM、rFID、rFVD）上全面优于全时空注意力。
分层2×2时序压缩优于单步4×压缩，重建保真度超越专用3D卷积视频VAE（如Wan2.2-VAE）。
在ImageNet和DAVIS数据集上，HYDRA-XTOK在图像和视频重建上达到先进水平。
在7B模型上，HYDRA-X在图像/视频理解、生成和编辑任务上均取得强性能。
潜在层面的源-目标交互显著提升编辑一致性并加速收敛。

Tech Stack:

Vision Transformer (ViT)
帧级因果时序注意力（Causal Temporal Attention）
分层时序压缩（Hierarchical Temporal Compression）
生成-语义瓶颈（Generation-Semantic Bottleneck）
解压缩器（Decompressor）
语义蒸馏（Semantic Distillation）
Qwen2.5-7B-Instruct（LLM骨干）
3D卷积VAE（对比基线）
PSNR, SSIM, rFID, rFVD（评估指标）

Strengths:

首次实现图像和视频在单一ViT分词器中的统一，架构简洁高效。
通过大量消融实验揭示了反直觉的设计原则（因果注意力优于全局注意力）。
解压缩器设计巧妙，解决了视频语义蒸馏中时序分辨率不匹配的问题。
编辑任务中潜在层面交互的设计显著优于现有方法，无需额外模块。
在7B规模上验证了统一分词器的有效性，性能强劲。

Limitations:

仅探索了7B稠密模型，未在更大规模或MoE架构上验证。
帧级因果注意力仅关注前一帧，可能限制长程时序依赖建模。
解压缩器增加了额外计算开销，尽管轻量。
编辑任务中源-目标交互仅在分词器内，可能无法处理复杂编辑指令。
未与最新视频理解/生成模型（如Sora、VideoPoet）进行直接比较。

Relevance To Keywords:

Unify Models: HYDRA-X是原生统一多模态模型，统一图像和视频的理解与生成。
World Models: 通过统一分词器学习视觉世界的统一表征，为世界模型提供基础。
Representation Learning: 提出整体视觉分词器，学习同时包含像素级和语义级的统一表征。
Model-Based RL: 统一表征可服务于基于模型的强化学习中的视觉状态编码。
原生多模态大模型: 直接训练单一骨干处理多模态任务，符合原生多模态大模型范式。
多模态大模型的理解和生成一体化: 在单个分词器内实现理解与生成的双向蒸馏。
表征学习: 通过语义蒸馏和解压缩器学习紧凑且语义丰富的潜在表征。
世界模型: 统一视觉分词器可作为世界模型中的视觉编码器。
强化学习: 统一表征可简化强化学习中的视觉输入处理。
后训练: 论文未明确涉及后训练，但统一分词器可支持后续微调。

6. IterCAD: An Iterative Multimodal Agent for Visually-Grounded CAD Generation and EditingPASS

Score: 84.0 / 35.2

Authors: Tao Hu, Jiaxin Ai, Licheng Wen, Xueheng Li, Shu Zou, Siqi Li, Nianchen Deng, Xinyu Cai, Hongbin Zhou, Pinlong Cai, Daocheng Fu, Yu Yang, Hairong Zhang, Botian Shi, Xuemeng Yang

Published: 2026-06-11

TL;DR: IterCAD proposes a unified multimodal agent framework for closed-loop, interactive CAD generation and editing, significantly outperforming existing approaches in code executability and geometric precision through progressive SFT and geometry-aware reinforcement learning.

摘要翻译

计算机辅助设计（CAD）在现代制造中至关重要，然而现有的自动化方法主要依赖于开环、一次性生成，这与迭代式的现实实践存在偏差。本文提出了一种名为 IterCAD 的统一多模态智能体框架，用于闭环交互式 CAD 生成与编辑。我们将该任务建模为多模态智能体与可执行 CAD 沙盒之间的多轮交互，涵盖三个任务：绘图到代码（Drawing-to-Code）、文本到代码（Text-to-Code）以及交互式编辑（Interactive Editing）。为此，我们开发了一个数据合成管道，融入先进的工业制造特征，用于生成符合标准的多视图工程图纸、复杂的代码编辑任务以及高保真交互轨迹。我们通过渐进式 SFT 随后结合几何感知强化学习（并采用可行前缀掩码技术）来优化智能体，以提升代码可执行性和几何保真度。最后，我们引入了 IterCAD-Bench 评估套件，并提出切氏距离容错 - 召回率（CD-TR）曲线及其 AUC-TR 指标，建立了一个无幸存者偏差的标准，该标准统一了代码有效性与几何精度。大量实验表明，IterCAD 在多个基准测试中实现了极具竞争力的性能，在代码可执行性和几何精度方面显著优于现有方法，同时在闭环迭代优化方面展现出卓越的能力。

Abstract

Computer-Aided Design is pivotal in modern manufacturing, yet existing automated methods predominantly rely on open-loop, one-shot generation, creating a mismatch with iterative real-world practices. In this paper, we present IterCAD, a unified multimodal agent framework for closed-loop, interactive CAD generation and editing. We formulate the task as a multi-turn interaction between a multimodal agent and an executable CAD sandbox, covering three tasks: Drawing-to-Code, Text-to-Code, and Interactive Editing. To support this, we develop a data synthesis pipeline incorporating advanced industrial manufacturing features to generate standard-compliant multi-view engineering drawings, complex code-editing tasks, and high-fidelity interaction trajectories. We optimize the agent via progressive SFT followed by geometry-aware reinforcement learning with viable-prefix masking to enhance code executability and geometric fidelity. Finally, we introduce the IterCAD-Bench evaluation suite and propose the Chamfer Distance Tolerance-Recall (CD-TR) curve alongside its AUC-TR metric, establishing a survivor-bias-free standard that unifies code validity and geometric precision. Extensive experiments demonstrate that IterCAD achieves highly competitive performance across multiple benchmarks, significantly outperforming existing approaches in both code executability and geometric precision, while exhibiting superior capabilities in closed-loop iterative refinement.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	8.0/10	12.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	6.0/10	9.0
World Models	1.5	5.0/10	7.5
MLLM	1.5	8.0/10	12.0
MultiModal	1.5	10.0/10	15.0
model-based RL	1.5	6.0/10	9.0
Latent Reasoning	1.5	2.0/10	3.0
Agentic Reasoning	1.5	9.0/10	13.5

评分理由: The paper centers on a unified multimodal agent for CAD, scoring high on MultiModal (10), Agentic Reasoning (9), and Unify Models (8). It utilizes MLLM (8) and Visual Encoder (6) for grounding, and applies RL (6) with sandbox interaction. Tokenizer and Latent Reasoning are not highlighted (2 each). World Models is moderately relevant (5). No expert authors from the specified list were found. Total weighted score is 84.0, exceeding the dynamic passing score of 35.2.

关键词

Iterative Multimodal Agent, Visually-Grounded CAD, Closed-loop Interaction, Reinforcement Learning, Code Executability, Geometric Precision, Progressive SFT, Multi-turn Interaction

深度分析

Chinese Title: IterCAD: 一种用于视觉引导的CAD生成与编辑的迭代多模态智能体

Summary: 本文提出IterCAD，一种统一的多模态智能体框架，用于闭环、交互式的CAD生成与编辑。针对现有自动化方法主要依赖开环、一次性生成，与实际迭代设计流程不匹配的问题，IterCAD将任务建模为多模态智能体与可执行CAD沙箱之间的多轮交互，涵盖绘图到代码、文本到代码和交互式编辑三种任务。论文开发了数据合成流水线，融入先进工业制造特征，生成符合标准的多视图工程图纸、复杂代码编辑任务和高保真交互轨迹。通过渐进式监督微调（SFT）和几何感知强化学习（RL）结合可行前缀掩码（GVPM）优化智能体，提升代码可执行性和几何保真度。最后，引入IterCAD-Bench评估套件，提出Chamfer Distance Tolerance-Recall (CD-TR)曲线及其AUC-TR指标，建立无幸存者偏差的评估标准，统一代码有效性和几何精度。实验表明，IterCAD在多个基准上取得竞争性性能，在代码可执行性和几何精度上显著优于现有方法，并展现出优异的闭环迭代优化能力。

Innovations:

提出IterCAD，一种绘图驱动的智能体框架，将CAD生成与编辑统一为闭环多轮合成过程，利用多视图工程图纸作为空间锚点。
设计两阶段训练方案：渐进式冷启动SFT结合几何感知RL，并引入几何可行前缀掩码（GVPM），显式注入鲁棒的自我修正能力。
构建IterCAD-Bench综合多模态数据集，覆盖多样化工程操作，并提出CD-TR曲线作为无幸存者偏差的评估标准，联合量化执行有效性和几何精度。
将CAD建模从静态一次性生成转变为交互式、自修正的设计过程，通过编译器、执行和视觉反馈的多维优化机制实现增量式改进。

Methodology: 论文采用两阶段训练策略：第一阶段进行渐进式冷启动监督微调（SFT），在高质量多轮交互轨迹上初始化程序化CAD推理能力；第二阶段进行几何感知强化学习（RL），结合几何可行前缀掩码（GVPM）鼓励鲁棒的迭代修正和可执行一致性。数据方面，通过合成流水线生成三种类型的CAD对（绘图-代码、文本-代码、编辑-代码），并利用专家LLM（Qwen3-VL-235B-A22B-Instruct）滚动多轮交互轨迹，经格式、逻辑和几何正确性过滤后形成冷启动语料。评估方面，提出Chamfer Distance Tolerance-Recall (CD-TR)曲线和AUC-TR指标，消除幸存者偏差。

Key Results:

IterCAD在多个基准上取得高度竞争性性能，在代码可执行性和几何精度上显著优于现有方法。
通过闭环迭代优化，IterCAD展现出优异的自我修正能力，能够有效处理执行失败和几何缺陷。
提出的CD-TR曲线和AUC-TR指标成功消除了传统评估中的幸存者偏差，统一了代码有效性和几何精度的度量。
渐进式SFT和几何感知RL的组合训练显著提升了智能体的鲁棒性和交互式设计能力。

Tech Stack:

CadQuery（程序化CAD框架）
build123d（CAD框架）
SolidWorks Component Object Model (COM) 接口（生成工程图纸）
OCCT内核（投影和尺寸标注）
Qwen3-VL-235B-A22B-Instruct（专家LLM）
Chamfer Distance（几何度量）
几何可行前缀掩码（GVPM）
监督微调（SFT）
强化学习（RL）
CD-TR曲线和AUC-TR指标

Strengths:

提出闭环迭代框架，更贴近真实工程设计流程，克服了开环一次性生成的局限。
两阶段训练策略结合SFT和RL，有效提升了代码可执行性和几何保真度。
构建了包含先进制造特征的高质量数据集和评估基准，并引入无幸存者偏差的评估指标。
统一了绘图到代码、文本到代码和交互式编辑三种任务，具有通用性。

Limitations:

依赖专家LLM生成训练轨迹，可能引入偏差或覆盖不足。
数据合成流水线涉及SolidWorks等商业软件，可能限制可复现性。
实验部分未详细讨论不同训练策略的消融分析，RL的具体奖励设计细节有待进一步说明。
当前框架主要针对CadQuery代码，对其他CAD程序化框架的泛化性未验证。

Relevance To Keywords: 论文与“原生多模态大模型”和“多模态大模型的理解和生成一体化”高度相关，因为IterCAD使用多模态大模型（MLLM）同时理解工程图纸、文本和代码，并生成可执行代码。与“世界模型”和“表征学习”有一定关联，因为智能体需要学习CAD几何世界的表征和物理规律（如OCCT内核的投影）。与“强化学习”和“后训练”直接相关，论文采用RL进行后训练优化。与“Model-Based RL”间接相关，因为智能体通过沙箱反馈（模拟环境）进行迭代改进。与“Unify Models”相关，因为IterCAD统一了生成和编辑任务。

7. EA-WM: Event-Aware World Models with Task-Specification Grounding for Long-Horizon ManipulationPASS

Score: 82.5 / 35.2

Authors: Kailin Wang, Haoxiang Jie, Yaoyuan Yan, Jiacheng Zhou, Zhiyou Heng

Published: 2026-06-11

TL;DR: EA-WM 提出了一种事件感知世界模型框架，通过在视觉特征空间中结合任务规格化接地的事件预测与验证，显著提升了长周期机器人操作的规划可靠性和任务对齐性。

摘要翻译

预训练特征世界模型为机器人想象提供了有用基础，但仅靠视觉或潜在预测无法确定想象的未来是否满足任务相关事件。长时程操作需要具有关系性、谓词级和物理基础的进展信号：物体是否移动，抽屉或接触状态是否改变，放置谓词是否满足，以及候选未来是否足够可靠以供执行。我们引入了 EA-WM，这是一个事件感知世界模型框架，它在冻结的视觉特征动力学基础上，增强了基于任务规范的事件预测与验证。EA-WM 在预训练视觉特征空间中展开候选未来，将其解码为结构化事件状态，并利用任务进展、语义一致性、物理可行性和不确定性项对其进行评分。验证器引导基于采样的规划，筛选候选动作，并在接触敏感的 LIBERO 酒架设置中，从 PPO 生成的提案中进行选择。在导航、可变形物体、墙壁约束及语言描述的操作研究中，EA-WM 表明事件感知验证可以使特征空间世界模型更具可解释性，并且更好地与任务进展对齐。

Abstract

Pretrained-feature world models provide a useful substrate for robot imagination, but visual or latent prediction alone does not determine whether an imagined future satisfies task-relevant events. Long-horizon manipulation requires progress signals that are relational, predicate-level, and physically grounded: whether an object has moved, whether a drawer or contact state has changed, whether a placement predicate is satisfied, and whether a candidate future is reliable enough for execution. We introduce EA-WM, an event-aware world-model framework that augments frozen visual-feature dynamics with task-specification-grounded event prediction and verification. EA-WM rolls out candidate futures in pretrained visual-feature space, decodes them into structured event states, and scores them using task-progress, semantic-consistency, physical-feasibility, and uncertainty terms. The verifier guides sampling-based planning, gates candidate actions, and, in the contact-sensitive LIBERO wine-rack setting, selects among PPOgenerated proposals. Across navigation, deformable-object, wall-constrained, and languagedescribed manipulation studies, EA-WM shows that event-aware verification can make featurespace world models more interpretable and better aligned with task progress.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	5.0/10	7.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	8.0/10	12.0
World Models	1.5	10.0/10	15.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	5.0/10	7.5
model-based RL	1.5	8.0/10	12.0
Latent Reasoning	1.5	8.0/10	12.0
Agentic Reasoning	1.5	6.0/10	9.0

评分理由: 论文核心贡献在于世界模型（World Models, 10 分）在视觉特征空间（Visual Encoder, 8 分）中的事件感知增强，涉及潜在空间推理（Latent Reasoning, 8 分）和基于模型的规划（model-based RL, 8 分）。虽涉及语言描述任务（MultiModal, 5 分），但非大语言模型核心（MLLM, 3 分），未明确提及统一模型架构（Unify Models, 5 分）或分词器（Tokenizer, 2 分），代理推理（Agentic Reasoning, 6 分）体现在规划验证循环中。

关键词

Event-Aware World Models, Task-Specification Grounding, Long-Horizon Manipulation, Visual-Feature Dynamics, Event Prediction, Sampling-Based Planning, Verifier, Physical Feasibility

深度分析

Chinese Title: EA-WM: 面向长时程操作的事件感知世界模型与任务规范对齐

Summary: 论文提出EA-WM框架，旨在解决长时程机器人操作中视觉世界模型仅预测未来特征而缺乏任务相关事件验证的问题。该方法在预训练视觉特征动力学模型基础上，增加任务规范对齐的事件预测与验证层：通过模拟器状态自动生成事件标签（如物体状态变化、空间关系、任务进度），训练事件预测器将想象未来解码为结构化事件状态；设计包含任务进度、语义一致性、物理可行性和不确定性的验证器评分函数，指导基于交叉熵方法（CEM）的采样规划。在PointMaze、Deformable、Wall-Single和LIBERO等基准上，EA-WM显著提升了成功率（如PointMaze随机目标成功率从0.90提升至0.94，LIBERO酒架任务混合成功率97/100），表明事件感知验证能使特征空间世界模型更可解释且与任务进度对齐。

Innovations:

提出事件感知验证层，将预训练特征世界模型的未来预测解码为任务相关事件状态，实现视觉想象与任务进度验证的分离。
设计包含任务进度、语义一致性、物理可行性和不确定性的多维度验证器评分函数，用于指导采样规划。
在LIBERO接触敏感任务中引入PPO作为动作提议分布，结合验证器门控机制提升复杂操作的成功率。
利用模拟器状态和BDDL规则自动生成事件标签，无需人工标注，实现事件监督的可扩展性。

Methodology: EA-WM包含四个核心组件：1）预训练视觉特征世界模型（基于DINO-WM风格），用于滚动候选未来特征；2）自动事件标签生成器，从模拟器状态和任务定义（如BDDL规则）中提取物体状态、空间关系、进度等事件；3）事件预测器与验证器，将想象特征映射为事件状态并计算综合评分；4）验证器引导的CEM规划器，结合特征代价与验证器评分选择动作。在LIBERO酒架任务中，额外使用PPO优化动作提议分布。

Key Results:

PointMaze随机目标成功率从0.90提升至0.94，同时保持数据集目标性能。
Deformable e10块任务中，检索初始化的EA-CEM达到94%成功率。
Wall-Single任务中，存档验证的EA-CEM达到95%成功率。
LIBERO-goal任务中，检查成功对齐的验证器AUC达到0.993947。
LIBERO酒架任务中，PPO提议的H=20在线混合成功率97/100。

Tech Stack:

DINO-WM（预训练视觉特征世界模型）
交叉熵方法（CEM）
PPO（近端策略优化）
BDDL（任务规则描述语言）
AUC（曲线下面积）评估指标
模拟器状态自动标签生成
多目标优化（特征代价 + 验证器评分）

Strengths:

将事件验证引入世界模型规划，弥补了纯视觉预测缺乏任务语义的不足。
验证器评分多维度（进度、一致性、可行性、不确定性）设计全面，提升规划鲁棒性。
自动事件标签生成机制降低了人工标注成本，易于扩展到新任务。
在多种基准（导航、变形物体、约束操作、语言描述操作）上验证了泛化能力。

Limitations:

事件标签依赖模拟器状态和预定义规则，在真实物理环境中难以直接获取。
验证器评分函数中的权重（如w_f）需要手动调节或跨任务适配。
PPO提议仅在LIBERO酒架任务中引入，未验证其在其他任务中的通用性。
框架复杂度较高，需要同时维护世界模型、事件预测器和验证器。

Relevance To Keywords:

Unify Models: EA-WM统一了视觉特征预测与事件验证，但未涉及多模态大模型的统一。
World Models: 核心贡献在于增强世界模型的事件感知能力，属于世界模型研究范畴。
Representation Learning: 利用预训练视觉特征（DINO-WM）作为表示基础，并学习事件表示。
Model-Based RL: 采用基于模型的规划（CEM）和策略优化（PPO），属于模型强化学习范式。
原生多模态大模型: 不直接相关，未使用多模态大模型。
多模态大模型的理解和生成一体化: 不直接相关，未涉及生成任务。
强化学习: 使用PPO进行策略优化，属于强化学习方法。
后训练: 不直接相关，未涉及模型后训练阶段。

8. ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic LanguagesPASS

Score: 76.5 / 35.2

Authors: Tanmoy Kanti Halder, Akash Ghosh, Subhadip Baidya, Arijit Roy, Sriparna Saha

Published: 2026-06-11

TL;DR: 本文针对多语言医疗推理能力不足的问题，提出 ArogyaSutra 多智能体框架及 ArogyaBodha 数据集，显著提升了印度语言下的多模态医疗推理准确性。

摘要翻译

多模态大语言模型（MLLMs）在通用领域已展现出颇具潜力的推理能力，但在医疗等专业领域，尤其是在多语言和低资源场景下，其性能仍受限。这种差距在印度农村地区等地区尤为关键，因为患者常使用本土印度语言提出复杂的医疗查询，并依赖医学图像等多模态输入。现有的以英语为中心的 MLLMs 难以支持此类应用场景，限制了患者公平获取 AI 驱动的医疗辅助。为应对这一挑战，我们提出 ArogyaBodha，这是一个从八个异构来源构建的大规模多模态多语言医学问答数据集，涵盖 31 个人体系统、六种成像模态和 21 个临床领域，涉及英语及七种主要印度语言。我们进一步提出 ArogyaSutra，一种基于 Actor-Critic 的多智能体框架，该框架整合了工具定位（tool grounding）与双记忆机制，用于实现逐步、推理感知的决策，并利用存储的 Actor-Critic 模拟轨迹进行蒸馏。实验表明，我们的数据集和框架在所有印度语言中均提升了多语言医疗推理准确性，消融实验验证了各组件的贡献。源代码和数据集可在以下网址获取：https://iitp-cse.github.io/ArogyaSutra/

Abstract

Multimodal Large Language Models (MLLMs) have shown promising reasoning capabilities in general domains, yet their performance remains limited in specialized settings such as healthcare, especially in multilingual and low-resource scenarios. This gap is critical in regions like rural India, where patients often express complex medical queries in native Indic languages and rely on multimodal inputs such as medical images. Existing English-centric MLLMs struggle to support such use cases, limiting equitable access to AI-driven healthcare assistance. To address this challenge, we introduce ArogyaBodha, a large-scale multilingual multimodal medical question-answer dataset constructed from eight heterogeneous sources, covering 31 body systems, six imaging modalities, and 21 clinical domains across English and seven major Indian languages. We further propose ArogyaSutra, an actor-critic-based multi-agent framework that integrates tool grounding with dual-memory mechanisms for step-wise, reasoning-aware decision making, and uses stored actor-critic simulation trajectories for distillation. Experiments show that our dataset and framework improve multilingual medical reasoning accuracy across all Indic languages, with ablations validating the contribution of each component. The source code and dataset are available at: https://iitp-cse.github.io/ ArogyaSutra/

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	5.0/10	7.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	5.0/10	7.5
World Models	1.5	4.0/10	6.0
MLLM	1.5	9.0/10	13.5
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	6.0/10	9.0
Latent Reasoning	1.5	3.0/10	4.5
Agentic Reasoning	1.5	8.0/10	12.0

评分理由: 论文核心聚焦于多模态大语言模型（MLLM）在医疗领域的应用，因此 MLLM 和多模态关键词相关性最高（9 分）。提出的多智能体框架、工具接地和决策机制与 Agentic Reasoning 高度相关（8 分）。Actor-critic 方法涉及强化学习，虽提及轨迹蒸馏，但模型基于特性不明确，故给 6 分。视觉编码器隐含于医疗图像输入中（5 分）。统一模型、Tokenizer、潜在推理和世模型在摘要中未作为核心贡献提及，相关性较低。作者列表中不包含指定的专家作者，故无加分。

关键词

Multimodal Medical Reasoning, Multi-Agent Framework, Indic Languages, Actor-Critic, Tool Grounding, Multilingual Dataset, ArogyaSutra, MLLM

深度分析

Chinese Title: ArogyaSutra：面向印度语言的多模态医疗推理多智能体框架

Summary: 论文针对多模态大语言模型在医疗领域，特别是低资源印度语言场景下的推理能力不足问题，提出了ArogyaSutra框架。首先构建了大规模多语言多模态医疗问答数据集ArogyaBodha，涵盖8个来源、31个身体系统、6种成像模态、21个临床领域，包含英语及7种主要印度语言。然后提出基于演员-评论家架构的多智能体框架，集成工具视觉接地和双记忆机制，实现逐步推理感知决策，并通过存储的演员-评论家模拟轨迹进行蒸馏。实验表明，该框架在所有印度语言上提升了多语言医疗推理准确性，消融实验验证了各组件的贡献。代码和数据集已开源。

Innovations:

形式化了印度语言多模态医疗推理的新任务，要求模型逐步进行临床推理并输出医疗正确且语言一致的答案。
构建了大规模多语言多模态医疗推理数据集ArogyaBodha，覆盖7种印度语言和英语，经专家验证，支持可靠评估。
提出了基于演员-评论家架构的多智能体框架ArogyaSutra，结合工具视觉接地和双记忆机制（长期记忆与短期记忆），实现逐步推理和错误纠正。
引入自适应代码切换机制：评论家根据错误类型（语言错误或推理错误）提供不同语言的反馈，并更新记忆模块进行迭代优化。
通过存储的演员-评论家模拟轨迹进行蒸馏，提升模型在低资源语言上的推理稳定性和语言一致性。

Methodology: 论文采用多智能体演员-评论家框架。演员（Actor）为多模态语言模型，处理医疗图像和印度语言查询，调用轻量级工具（缩放/裁剪、边缘检测、深度分析、区域检测）提取临床证据，并预测中间语义推理步骤。双记忆机制：长期记忆总结先前推理步骤、语言表述和识别错误；短期记忆捕获最近预测误差和反馈。评论家（Critic）评估演员输出的医疗正确性和语言一致性，提供针对性纠正反馈（语言错误用英语，其他用对应印度语言）。反馈用于更新记忆模块，迭代优化。数据集构建：从8个医疗来源（包括基准数据集、医学影像库、印度研究生医学考试题）收集，经GPT-4o-mini少样本生成问题，再用Gemini-2.5-Flash过滤，最后由医学专家验证。

Key Results:

ArogyaBodha数据集包含40,857个样本（英语5,107个，每种印度语言约5,100个）。
ArogyaSutra在所有印度语言上均优于强多模态基线，推理准确性和多语言对齐均有提升。
消融实验验证了工具接地、双记忆机制、演员-评论家迭代等各组件的贡献。
通过反向翻译分析（余弦相似度和BLEU4）确保翻译质量，保留临床意义和语言保真度。

Tech Stack:

多模态大语言模型（MLLM）作为演员
GPT-4o-mini用于少样本问题生成和错误检测
Gemini-2.5-Flash用于问题过滤（临床一致性评估）
轻量级视觉接地工具：缩放/裁剪、边缘检测、深度分析、区域检测
双记忆机制：长期记忆（总结历史步骤和错误）和短期记忆（最近误差和反馈）
演员-评论家架构（Actor-Critic）
反向翻译评估：余弦相似度、BLEU4
蒸馏：利用存储的演员-评论家模拟轨迹

Strengths:

针对低资源印度语言的多模态医疗推理，填补了重要空白，具有社会公益价值。
数据集规模大、来源多样、经专家验证，质量可靠。
框架设计巧妙，通过工具接地和双记忆机制实现逐步推理和错误纠正，透明且可解释。
自适应代码切换机制有效处理语言和推理两类错误，提升多语言稳定性。
实验充分，包括跨语言、跨模态、跨临床领域的评估和消融研究。

Limitations:

框架依赖外部工具和GPT-4o-mini作为错误检测器，可能引入额外延迟和成本。
数据集仅覆盖7种印度语言，未包含其他低资源语言（如乌尔都语、梵语等）。
演员-评论家迭代可能增加推理时间，实时性有待优化。
医疗推理的正确性依赖专家验证，但专家数量有限，可能影响泛化性。
未与最新世界模型或统一多模态模型方法进行对比，相关性分析中提及但未深入。

Relevance To Keywords:

原生多模态大模型：论文使用多模态大语言模型作为演员，处理图像和文本，属于原生多模态范畴。
多模态大模型的理解和生成一体化：框架不仅生成答案，还生成逐步推理，实现理解与生成结合。
表征学习：通过工具接地和双记忆机制，模型学习更丰富的视觉和语言表征。
世界模型：论文未直接涉及世界模型，但医疗推理需要理解临床世界知识，可视为隐式世界模型。
强化学习：演员-评论家架构是强化学习经典方法，论文利用其进行迭代优化和蒸馏。
后训练：通过蒸馏和演员-评论家模拟轨迹进行后训练，提升模型性能。

9. Reasoning for Mobile User Experience with Multimodal LLMs: Task, Benchmark, and ApproachPASS

Score: 76.5 / 35.2

Authors: Ruichao Mao, Zhou Fang, Teng Guo, Hao Yang, Yaping Li, Shaohua Peng, Maji Huang, Xiaoyu Lin, Shuoyang Liu, Xuepeng Li, Yuyu Zhang, Hai Rao

Published: 2026-06-11

TL;DR: 本文提出 UXBench 基准测试以评估多模态大模型的用户界面推理能力，并构建了基于强化学习的 UI-UX 模型，在该基准上取得了最先进的性能。

摘要翻译

以可用性、感知一致性和功能清晰度为核心的用户体验 (UX) 是真实用户界面 (UI) 的根本基础。多模态大语言模型 (MLLMs) 在用户界面领域的应用正在迅速发展，涵盖视觉元素定位、图形用户界面 (GUI) 代理以及设计到代码生成等方面。然而，基于 UI 截图评估用户体验 (UX) 的研究尚不成熟。为此，我们提出了 UXBench，一种新颖的多模态基准，包含 2,000 个 VQA 数据样本，旨在评估多模态大语言模型 (MLLMs) 执行基于 UI 推理的能力。UXBench 包含 8 个基于真实 UI 截图的任务，要求对布局关系、视觉层次和内容一致性等方面的用户体验 (UX) 问题进行细粒度诊断。我们对主流多模态大语言模型 (MLLMs) 的广泛评估表明，它们在基于 UI 推理的能力上仍存在根本性局限。研究结果凸显了该领域进一步发展的必要性。为弥合这一差距，我们提出了 UI-UX，一种基于 Qwen3-VL-4B-Thinking 基础模型的多模态大语言模型 (MLLM)，并通过强化学习进行了增强，包含两项关键创新：一项是在推理过程中动态平衡感知理解与逻辑推理的奖励路由机制，另一项是抑制冗余或不足推理步骤的非对称转换奖励。实验表明，UI-UX 在 UXBench 上实现了最先进 (SOTA) 性能，准确率达到 0.7963，超越了 Claude-4.5-Sonnet 的 0.6550，同时在多样化的 UI 任务上展现出强大的泛化能力，并保持较低的推理延迟。

Abstract

User experience (UX) centered on usability, perceived consistency, and functional clarity is fundamental to real-world user interfaces (UI). The application of multimodal large language models (MLLMs) in the field of user interfaces is evolving rapidly, such as visual element grounding, graphical user interface (GUI) agents, and design-to-code generation. However, research efforts on evaluating UX based on UI screenshots are still immature. To address this, we propose UXBench, a novel multimodal benchmark consisting of 2,000 VQA data samples designed to assess MLLMs' ability to perform UI-based reasoning. UXBench includes 8 tasks based on real-world UI screenshots that require fine-grained diagnosis of UX issues across layout relationships, visual hierarchy, and content consistency. Our extensive evaluation of mainstream MLLMs shows that they remain fundamentally limited in their capacity for UI-based reasoning. The results underscore the need for further advancements in this area. To bridge this gap, we propose UI-UX, an MLLM based on Qwen3-VL-4B-Thinking foundation model and enhanced via reinforcement learning with two key innovations: a reward routing mechanism that dynamically balances perceptual understanding and logical reasoning during inference, and an asymmetric transition reward that suppresses redundant or insufficient reasoning steps. Experiments demonstrate that UI-UX achieves state-of-the-art (SOTA) performance on UXBench, attaining an accuracy of 0.7963 -- surpassing Claude-4.5-Sonnet's 0.6550 -- while exhibiting strong generalization across diverse UI tasks and maintaining low inference latency.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	5.0/10	7.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	5.0/10	7.5
World Models	1.5	2.0/10	3.0
MLLM	1.5	10.0/10	15.0
MultiModal	1.5	10.0/10	15.0
model-based RL	1.5	4.0/10	6.0
Latent Reasoning	1.5	7.0/10	10.5
Agentic Reasoning	1.5	6.0/10	9.0

评分理由: 论文核心围绕多模态大模型（MLLM）在用户体验（UX）评估中的应用，因此 MLLM 和 MultiModal 得分最高（10.0）；涉及推理过程，Latent Reasoning 和 Agentic Reasoning 有一定关联（6.0-7.0）；使用了强化学习优化模型，但非典型的模型基强化学习（model-based RL），故得分中等（4.0）；视觉编码器和统一模型作为 MLLM 隐含组件，得分为中等（5.0）；Tokenizer 和 World Models 未直接涉及，得分较低（2.0）；作者列表中未包含指定的专家，无额外加分。

关键词

Multimodal LLMs, User Experience, User Interface, Benchmark, Reinforcement Learning, UI Reasoning, Reward Mechanism

深度分析

Chinese Title: 基于多模态大模型的移动用户体验推理：任务、基准与方法

Summary: 本文聚焦于移动用户界面（UI）中的用户体验（UX）推理问题，指出现有基准多集中于视觉感知任务，缺乏对UX因果推理能力的评估。为此，作者提出了UXBench——首个专门用于评估多模态大模型（MLLM）在UI场景下进行UX推理能力的基准，包含2000个真实UI截图样本，覆盖可用性、效率和可信度三个维度共8个细粒度诊断任务。实验表明，主流MLLM在UX推理上表现有限。为弥补这一差距，作者提出了UI-UX模型，基于Qwen3-VL-4B-Thinking并通过强化学习进行优化，引入了奖励路由机制和不对称过渡奖励，以动态平衡感知与推理并抑制冗余推理。UI-UX在UXBench上达到0.7963的准确率，超越Claude-4.5-Sonnet的0.6550，并展现出良好的泛化能力和低推理延迟。

Innovations:

首次提出面向UX推理的多模态基准UXBench，包含8个细粒度诊断任务，覆盖可用性、效率和可信度三个维度。
提出奖励路由机制，动态平衡感知理解和逻辑推理，适应不同任务类型（如分类、语义对齐、定位）。
引入不对称过渡奖励（Asymmetric Transition Reward），抑制冗余或不足的推理步骤，提升推理效率并降低延迟。
基于Qwen3-VL-4B-Thinking并通过GRPO强化学习算法进行端到端优化，无需人工偏好标注。
构建了大规模真实UI截图数据集，结合MLLM辅助标注和四名资深UX专家的两轮人工验证，确保数据质量。

Methodology: 论文采用多阶段数据构建流程：首先从真实应用反馈中收集UI截图和文本描述；然后使用Gemini-2.5-Pro进行相关性过滤，并用微调的Qwen3-VL-2B进行大规模复制决策；接着通过少样本提示对样本进行细粒度分类；最后经四名资深UX专家进行两轮独立标注和交叉验证。模型训练方面，基于Qwen3-VL-4B-Thinking，采用GRPO强化学习算法，结合奖励路由（accuracy、ROUGE-L、hit reward）和不对称过渡奖励进行优化，实现端到端学习。

Key Results:

UI-UX在UXBench上达到0.7963的准确率，超越Claude-4.5-Sonnet的0.6550。
主流MLLM（如GPT-4o、Gemini-2.5-Pro）在UX推理任务上表现有限，准确率普遍低于0.70。
UI-UX在多个UI任务上展现出强泛化能力，并保持低推理延迟。
奖励路由和不对称过渡奖励显著提升了推理效率和诊断准确性。

Tech Stack:

Qwen3-VL-4B-Thinking（基础模型）
GRPO（Group Relative Policy Optimization）强化学习算法
Gemini-2.5-Pro（用于数据过滤和分类）
Qwen3-VL-2B（微调用于大规模复制决策）
ROUGE-L（语义对齐奖励）
奖励路由机制（Reward Routing）
不对称过渡奖励（Asymmetric Transition Reward）
少样本提示（Few-shot Prompting）

Strengths:

首次系统性地定义了UX推理任务，填补了现有基准的空白。
数据构建流程严谨，结合自动化标注和专家验证，保证了数据质量。
模型设计兼顾推理准确性和效率，通过奖励机制抑制冗余推理。
实验结果表明模型在UX推理上显著超越现有最强模型，具有实际应用价值。
论文对任务定义、数据分布、模型架构和训练方法进行了详细描述，可复现性强。

Limitations:

UXBench仅包含移动端UI截图，未覆盖网页、桌面或可穿戴设备等平台。
数据集规模相对较小（2000样本），可能限制模型在更广泛场景下的泛化能力。
任务设计为二选一或三选一的选择题形式，可能无法完全反映真实UX问题的复杂性和开放性。
模型基于Qwen3-VL-4B-Thinking，其基础能力可能影响最终性能的上限。
论文未对模型在不同语言或文化背景下的UX推理能力进行评估。

Relevance To Keywords:

Unify Models: 论文提出的UI-UX模型融合了视觉感知和语言推理，体现了多模态统一建模的思想。
World Models: 论文通过因果推理和用户行为建模，使模型具备对UI交互后果的预测能力，与世界模型的内隐推理目标一致。
Representation Learning: 论文通过强化学习优化模型对UI布局、语义和交互逻辑的表征，提升了对UX问题的判别能力。
Model-Based RL: 论文采用GRPO强化学习算法，并引入奖励路由和不对称过渡奖励，属于模型无关的强化学习方法，但奖励设计体现了对推理过程的建模。
原生多模态大模型: 论文基于Qwen3-VL-4B-Thinking，属于原生多模态大模型，并在此基础上进行后训练优化。
多模态大模型的理解和生成一体化: 论文模型同时具备UI截图的理解能力和推理结果的生成能力，体现了一体化设计。
表征学习: 论文通过奖励机制引导模型学习更有效的UI特征表征，以支持细粒度UX诊断。
世界模型: 论文要求模型对UI交互后果进行推理（如弹窗遮挡按钮导致操作不可逆），体现了对世界状态变化的建模。
强化学习: 论文核心训练方法为GRPO强化学习，通过奖励信号优化模型推理行为。
后训练: 论文在Qwen3-VL-4B-Thinking基础上进行后训练，通过强化学习进一步提升UX推理能力。

10. MoVerse: Real-Time Video World Modeling with Panoramic Gaussian ScaffoldPASS

Score: 73.5 / 35.2

Authors: Yang Zhou, Ziheng Wang, Yuqin Lu, Haofeng Liu, Jun Liang, Shengfeng He, Jing Li

Published: 2026-06-11

TL;DR: MoVerse 解决了从单张窄视场图像构建可交互全景视频世界模型的难题，通过全景扩散和高斯支架实现了实时渲染与导航。

摘要翻译

我们提出 MoVerse，一种实时视频世界模型，能够从单张窄视场图像构建出可交互导航的场景。这一设定极具挑战性，因为输入仅观测到环境的一小部分，而交互式漫游则需要完整的周围世界、持久几何、可控的相机运动以及时间一致的高保真观测。MoVerse 通过分离世界构建与观测渲染来解决这一问题。它首先利用拓扑感知扩散将输入扩展为重力对齐的 360°全景，在 3D 推理之前补全缺失的视场。随后，它利用全景几何感知残差预测将全景提升为持久的 3D Gaussian 支架，从而生成一个密集且可直接渲染的空间记忆。最后，一个基于 Gaussian 条件的视频渲染器将支架渲染结果沿用户指定的相机轨迹转换为照片级逼真的视频。为了使该渲染器适用于交互场景，我们训练了一个双向扩散教师模型以实现高质量的条件渲染，并将其蒸馏为一个因果自回归学生模型，以实现有界延迟的流式传输。该设计结合了显式 3D 表示的可控性与长程一致性，以及生成视频模型的感知质量。MoVerse 在单张 NVIDIA RTX 4090 GPU 上支持实时场景漫游（约 8 FPS），展示了通过单张图像创建世界并输出交互式视频的实用路径。

Abstract

We present MoVerse, a real-time video world model that creates an interactively navigable scene from a single narrow-field-of-view image. This setting is challenging because the input observes only a small fraction of the environment, while interactive roaming requires a complete surrounding world, persistent geometry, controllable camera motion, and temporally coherent high-fidelity observations. MoVerse addresses this problem by separating world construction from observation rendering. It first expands the input into a gravity-aligned 360$^\circ$ panorama with topology-aware diffusion, closing the missing field of view before 3D reasoning. It then lifts the panorama into a persistent 3D Gaussian scaffold using panoramic geometry-aware residual prediction, yielding a dense and directly renderable spatial memory. Finally, a Gaussian-conditioned video renderer translates scaffold renderings along user-specified camera trajectories into photorealistic video. To make this renderer practical for interaction, we train a bidirectional diffusion teacher for high-quality conditional rendering and distill it into a causal autoregressive student for bounded-latency streaming. This design combines the controllability and long-range consistency of explicit 3D representations with the perceptual quality of generative video models. MoVerse supports real-time scene roaming at 8~FPS on a single NVIDIA RTX~4090 GPU, demonstrating a practical path toward single-image world creation with interactive video output.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	8.0/10	12.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	5.0/10	7.5
World Models	1.5	10.0/10	15.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	6.0/10	9.0
model-based RL	1.5	5.0/10	7.5
Latent Reasoning	1.5	7.0/10	10.5
Agentic Reasoning	1.5	5.0/10	7.5

评分理由: 论文核心贡献在于构建实时视频世界模型（World Models），故该词得满分。系统统一了全景生成、3D 构建与视频渲染（Unify Models），且处理图像到视频的多模态转换（MultiModal），相关性高。利用扩散模型与高斯支架进行空间推理（Latent Reasoning），支持交互式漫游（Agentic Reasoning）及模型强化学习应用（model-based RL），相关性中等。输入图像隐含视觉编码（Visual Encoder），但未明确提及分词器（Tokenizer）或大语言模型（MLLM），相关性低。作者名单中未包含指定的 Yang Shi 等五位专家。

关键词

World Modeling, Gaussian Scaffold, Panoramic Diffusion, Real-time Rendering, Interactive Navigation, Video Generation, 3D Representation

深度分析

Chinese Title: MoVerse: 基于全景高斯支架的实时视频世界建模

Summary: 本文提出MoVerse，一个从单张窄视野图像实时创建可交互导航场景的视频世界模型。现有方法要么依赖显式3D表示但质量有限，要么使用隐式视频模型但存在几何漂移。MoVerse通过三阶段分离架构解决该问题：第一阶段利用几何感知扩散模型将输入图像补全为重力对齐的360°全景图，采用圆形潜在编码和移位等变生成保持水平拓扑；第二阶段将全景图提升为可直接渲染的3D高斯支架，通过纬度感知尺度校正和角度-逆深度空间残差预测实现密集空间记忆；第三阶段训练双向扩散教师学习高斯条件视频渲染，并蒸馏为因果自回归学生，结合显式支架的持久性与生成模型的感知质量，实现低延迟流式输出。在单张NVIDIA RTX 4090上达到8 FPS实时漫游，展示了从单图像到交互视频的实用路径。

Innovations:

提出世界构建与观察渲染分离的三阶段架构，离线构建可复用高斯支架，在线流式渲染，避免计算瓶颈。
几何感知全景生成：自动重力校平与拓扑感知扩散（圆形潜在编码、移位等变），保持ERP水平S1拓扑。
全景高斯支架：纬度感知尺度校正、角度-逆深度空间残差预测，使高斯更新与ERP几何一致。
高斯条件流式视频渲染：双向扩散教师蒸馏为因果自回归学生，结合显式支架的长程记忆与生成模型的质量。
从单张窄视野图像实现实时交互漫游（8 FPS），支持任意相机轨迹。

Methodology: MoVerse采用三阶段流水线。Stage I：输入图像经可微自动校平为重力对齐视角，使用掩码潜在扩散模型完成全景，其中圆形潜在编码和移位等变生成保证水平闭合。Stage II：从全景图预测深度，通过球面反投影初始化3D高斯，应用纬度感知尺度校正（正比于cosϕ），在角度-逆深度空间预测残差以保持ERP几何一致性。Stage III：基于Wan2.1-T2V初始化双向扩散教师，学习以高斯渲染为条件的视频生成；然后通过自强制和分布匹配蒸馏为因果自回归学生，使用局部KV缓存实现流式输出。

Key Results:

在单张NVIDIA RTX 4090上实现8 FPS实时漫游。
生成的全景图具有重力对齐和拓扑一致性，为后续3D提供完整上下文。
高斯支架可直接渲染，提供持久几何记忆，避免长轨迹下的场景漂移。
因果自回归视频渲染器在保持低延迟的同时，显著提升感知质量和时间连贯性。

Tech Stack:

潜在扩散模型（Latent Diffusion）
圆形潜在编码（Circular Latent Encoding）
移位等变生成（Shift-Equivariant Generation）
3D高斯泼溅（3D Gaussian Splatting, 3DGS）
球面反投影（Spherical Back-Projection）
纬度感知尺度校正（Latitude-Aware Scale Correction）
角度-逆深度空间（Angular-Inverse-Depth Space）
Wan2.1-T2V（视频生成模型）
自强制蒸馏（Self-Forcing Distillation）
分布匹配（Distribution Matching）
因果自回归（Causal Autoregressive）
键值缓存（Key-Value Cache）

Strengths:

创新性三阶段分离架构，结合显式3D的持久性与隐式生成模型的视觉质量。
从单张窄视野图像即可生成完整可漫游世界，实用性强。
实时性能（8 FPS）满足交互需求。
显式高斯支架提供稳定几何，避免视频模型中的场景漂移和边界不连续。
因果蒸馏设计使视频渲染具有低延迟流式能力。

Limitations:

全景生成质量依赖训练数据（Horizon360），对未见过的场景泛化能力可能有限。
高斯支架的精度受限于单张全景图的深度预测，可能存在几何误差。
实时帧率8 FPS对于高刷新率VR应用仍显不足。
当前仅支持静态场景，未处理动态物体或光照变化。
未讨论多视角输入或增量式场景更新。

Relevance To Keywords:

世界模型：直接相关，MoVerse构建可交互导航的3D世界并生成视频观测。
表征学习：高斯支架作为显式空间表征，结合扩散模型隐式表征。
模型基强化学习：虽未直接涉及，但生成的交互式世界可用于智能体训练。
后训练：蒸馏过程属于后训练技术，将双向教师转化为因果学生。
原生多模态大模型：未直接使用，但视频生成部分借鉴了Wan2.1-T2V。
多模态大模型的理解和生成一体化：部分相关，模型从图像理解到视频生成。
强化学习：论文未涉及强化学习。

11. InterleaveThinker: Reinforcing Agentic Interleaved GenerationPASS

Score: 69.5 / 35.2

Authors: Dian Zheng, Harry Lee, Manyuan Zhang, Kaituo Feng, Zoey Guo, Ray Zhang, Hongsheng Li

Published: 2026-06-11

TL;DR: InterleaveThinker introduces a multi-agent reinforcement learning pipeline that enables existing image generators to perform interleaved text-image generation, achieving performance comparable to advanced models like GPT-5 on specific benchmarks.

摘要翻译

近期图像生成器在单图像生成与编辑方面已展现出令人印象深刻的照片级真实感（photorealism）和指令遵循能力。然而，受限于其架构，它们无法实现交错生成（text-image sequence），而交错生成在视觉叙事、引导及具身操作等领域具有关键应用。即使是最新的开源统一多模态模型（Unified Multimodal Models, UMMs）在此方面也表现出有限的性能。本文介绍了 InterleaveThinker，这是首个旨在赋予任何现有图像生成器交错生成能力的多智能体流水线。具体而言，我们采用规划器代理（planner agent）来组织图像 - 文本输入序列，并指导图像生成器在每个步骤所需的操作。随后，我们引入批评器代理（critic agent）以评估生成器的输出，识别偏离计划指令的样本，并细化指令以进行重新生成。为实现该流水线，我们构建了 Interleave-Planner-SFT-80k 和 Interleave-Critic-SFT-112k，以执行格式冷启动（format cold-start）。随后，我们开发了 Interleave-Critic-RL-13k，利用 GRPO 强化生成轨迹内的逐步指令修正能力。由于单个交错生成轨迹可能涉及超过 25 次生成器调用，优化整个轨迹在计算上并不切实际。因此，我们提出了精度奖励（accuracy reward）和逐步奖励（step-wise reward），使得单步强化学习（RL）能够有效引导整个生成轨迹。结果表明，InterleaveThinker 在各种图像生成器上均提升了性能。在交错生成基准测试中，其性能达到了与 Nano Banana 和 GPT-5 相当的水平。令人惊讶的是，它还在基于推理的基准上显著增强了基础模型；例如，在 4-step FLUX.2-klein 上，我们在 WISE 和 RISE 指标上观察到了显著的提升。

Abstract

Recent image generators have demonstrated impressive photorealism and instruction-following capabilities in single-image generation and editing. However, constrained by their architectures, they cannot achieve interleaved generation (text-image sequence), which has crucial applications in visual narratives, guidance, and embodied manipulation. Even the latest open-source Unified Multimodal Models (UMMs) exhibit limited performance in this regard. In this paper, we introduce InterleaveThinker, the first multi-agent pipeline designed to endow any existing image generator with interleaved generation capabilities. Specifically, we employ a planner agent to organize the image-text input sequence, instructing the image generator on the required execution at each step. Subsequently, we introduce a critic agent to evaluate the generator's outputs, identify samples that deviate from the planned instructions, and refine the instructions for regeneration. To implement this pipeline, we construct the Interleave-Planner-SFT-80k and Interleave-Critic-SFT-112k to perform a format cold-start. Then we develop Interleave-Critic-RL-13k to reinforce the step-wise instruction correction capability within a generation trajectory using GRPO. Since a single interleaved generation trajectory may involve over 25 generator calls, optimizing the entire trajectory is computationally impractical. Therefore, we propose accuracy reward and step-wise reward, allowing single-step RL to effectively guide the entire generation trajectory. The results show that InterleaveThinker improves performance across various image generators. On interleaved generation benchmarks, it achieves performance comparable to Nano Banana and GPT-5. Surprisingly, it also significantly enhances the base model on reasoning-based benchmarks; for example, on 4-step FLUX.2-klein, we observe substantial gains on WISE and RISE.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	5.0/10	7.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	6.0/10	9.0
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	7.0/10	10.5
Latent Reasoning	1.5	1.0/10	1.5
Agentic Reasoning	1.5	10.0/10	15.0
作者加分	-	+5.0	专家: Manyuan Zhang

评分理由: 论文核心在于使用多智能体（Planner/Critic）和强化学习（GRPO）实现文本 - 图像交错生成，因此'Agentic Reasoning'高度相关（10 分），'MultiModal'因涉及图文任务高度相关（9 分），'model-based RL'因涉及轨迹优化和规划相关（7 分）。'MLLM'和'Unify Models'涉及多模态生成背景，有一定关联（6 分，5 分）。'Tokenizer'、'Visual Encoder'、'World Models'、'Latent Reasoning'在摘要中未作为核心贡献提及，相关性较低（1-2 分）。作者列表中包含 Manyuan Zhang，符合专家加分条件。

关键词

InterleaveThinker, Agentic Interleaved Generation, Multi-agent Pipeline, Reinforcement Learning, Text-image Sequence, Planner Agent, Critic Agent

深度分析

Chinese Title: InterleaveThinker: 强化智能交错生成

Summary: 论文提出InterleaveThinker，首个多智能体框架，旨在赋予任意固定图像生成器强大的交错生成（文本-图像序列）能力。针对现有统一多模态模型（UMM）在长程任务中存在的视觉过度依赖和逐步误差累积问题，该框架采用Planner（规划器）预先预测完整指令序列，Critic（评估器）检查生成输出并修正指令，Generator（生成器）执行图像生成/编辑。为训练该框架，作者构建了高质量数据集Pipeline，生成Interleave-Planner-SFT-80k、Interleave-Critic-SFT-112k和Interleave-Critic-RL-13k三个数据集，分别用于格式冷启动和强化学习。采用GRPO算法和双奖励策略（准确率奖励和逐步奖励），通过单步强化学习实现整个生成轨迹的对齐，大幅降低计算成本。实验表明，以4步FLUX.2-klein为生成器，InterleaveThinker在交错生成基准上达到与Nano Banana和GPT-5相当的性能，并在推理基准WISE（从0.47提升至0.73）和RISE（从13.3提升至28.9）上显著提升基础模型。

Innovations:

首个多智能体框架，使任意固定图像生成器具备交错生成能力，有效解决视觉过度依赖和逐步误差累积。
构建专用数据Pipeline，生成三个高质量数据集（Planner-SFT-80k、Critic-SFT-112k、Critic-RL-13k），支持冷启动和强化学习。
设计双奖励策略（准确率奖励和逐步奖励），通过单步GRPO实现轨迹级对齐，显著降低计算开销。
在推理基准上意外大幅提升基础模型性能，揭示多智能体协作在复杂序列推理中的潜力。

Methodology: 采用多智能体Pipeline：Planner（基于VLM）预测整个指令序列，避免中间视觉反馈；Generator（如FLUX.2-klein）执行图像生成/编辑；Critic（基于VLM）评估每一步输出，识别偏离并修正指令。数据构建：使用Gemini 2.5 Pro和Nano Banana Pro生成轨迹，经严格过滤得到三个数据集。训练：先对Planner和Critic进行SFT冷启动格式，再对Critic使用GRPO强化逐步修正能力。奖励设计：准确率奖励检查是否按计划执行，逐步奖励评估每一步图像质量。

Key Results:

在交错生成基准上，性能与Nano Banana和GPT-5相当。
在推理基准WISE上从0.47提升至0.73，在RISE上从13.3提升至28.9。
验证了通用性：在多种图像生成器（如FLUX.2-klein、Qwen-image-Edit）上均获得一致性能提升。

Tech Stack:

GRPO（Group Relative Policy Optimization）
FLUX.2-klein（4步扩散模型）
Qwen-image-Edit
Gemini 2.5 Pro
Nano Banana Pro
SFT（监督微调）
多智能体框架（Planner-Critic-Generator）

Strengths:

通用性强：可适配任意现有图像生成器，无需修改生成器本身。
有效解决UMM的固有问题（视觉过度依赖、误差累积）。
数据构建严谨，训练高效（单步RL实现轨迹级对齐）。
性能提升显著，尤其在推理基准上表现突出。

Limitations:

依赖外部高级模型（Gemini 2.5 Pro等）生成训练数据，可能引入偏差。
多智能体Pipeline增加推理延迟和计算开销。
未深入分析对生成器本身质量的影响，可能在某些场景下仍存在误差。
仅验证了部分图像生成器，通用性需进一步扩展。

Relevance To Keywords:

Unify Models: 相关，本文框架统一了图像生成和编辑，但并非模型本身。
World Models: 不直接相关。
Representation Learning: 不直接相关。
Model-Based RL: 部分相关，Critic评估可视为基于模型的方法。
原生多模态大模型: 背景相关，本文旨在弥补UMM的不足。
多模态大模型的理解和生成一体化: 相关，本文赋予生成器交错生成能力。
表征学习: 不直接相关。
世界模型: 不直接相关。
强化学习: 高度相关，使用GRPO进行后训练。
后训练: 高度相关，SFT+RL是核心训练方法。

12. Multi-Modal Agents for Power Distribution Defect Detection: An Evaluation of Foundation ModelsPASS

Score: 69.0 / 35.2

Authors: Quan Quan

Published: 2026-06-11

TL;DR: 针对传统电力巡检在语义理解和自动化方面的局限，本文提出多模态智能体框架评估基础模型，证明其在感知、推理和工具使用方面的集成能力与局限性。

摘要翻译

配电网对于保障电力输送的可靠性至关重要，然而传统巡检方法在语义理解、泛化能力及闭环自动化方面仍存在局限性。为应对这些挑战，本文提出了一种专门针对配电网缺陷检测的多模态智能体（Multi-Modal Agent）框架。本研究的核心在于将多模态基础模型（Multimodal Foundation Models）作为统一认知引擎进行系统评估。我们严格评估了它们在以下三个关键能力上的综合表现：（1）感知（Perception），即模型需准确识别设备并生成专家级缺陷描述；（2）推理（Reasoning），即模型基于领域知识解释视觉发现以诊断原因、评估严重程度并规划维护策略；（3）工具使用（Tool Usage），即模型作为自主操作者执行动作——例如查询知识库或生成工单——以实现闭环维护。为此，本文构建了一个领域特定的评估数据集及一个综合基准（Benchmark）。实验结果表明了当前基础模型在这三个维度上的优势与局限性，为在高风险工业环境中部署自主智能体提供了实证依据。

Abstract

The power distribution network is critical to reliable electricity delivery, yet traditional inspection methods face limitations in semantic understanding, generalization, and closed-loop automation. To address these challenges, this paper proposes a Multi-Modal Agent framework specifically for power distribution defect detection. Central to this study is the systematic evaluation of multimodal foundation models as unified cognitive engines. We rigorously assess their integrated performance across three critical capabilities: (1) Perception, where the model must accurately identify equipment and generate expert-level descriptions of defects; (2) Reasoning, where the model interprets visual findings to diagnose causes, assess severity, and plan maintenance strategies based on domain knowledge; and (3) Tool Usage, where the model acts as an autonomous operator to execute actions -- such as querying knowledge bases or generating work orders -- to achieve closed-loop maintenance. To support this evaluation, a domain-specific evaluation dataset and a comprehensive benchmark are developed. Experimental results demonstrate the strengths and limitations of current foundation models in these three dimensions, providing empirical evidence for deploying autonomous agents in high-stakes industrial environments.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	6.0/10	9.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	4.0/10	6.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	8.0/10	12.0
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	3.0/10	4.5
Latent Reasoning	1.5	4.0/10	6.0
Agentic Reasoning	1.5	8.0/10	12.0

评分理由: 论文核心在于多模态大模型（MLLM）在电力缺陷检测中的应用评估，因此 MultiModal (9.0) 和 MLLM (8.0) 相关性最高。框架涉及智能体自主操作，故 Agentic Reasoning (8.0) 相关。摘要提到'统一认知引擎'，与 Unify Models (6.0) 有一定关联。视觉感知和推理任务隐含了 Visual Encoder (4.0) 和 Latent Reasoning (4.0) 的概念，但未深入架构。Tokenizer (2.0)、World Models (2.0) 和 model-based RL (3.0) 在摘要中未明确提及或不是重点。作者仅为 Quan Quan，未包含指定的专家列表，故无额外加分。加权总分为 69.0，高于动态及格分 35.2。

关键词

Multi-Modal Agents, Foundation Models, Power Distribution Defect Detection, Perception, Reasoning, Tool Usage, Autonomous Agents, Evaluation Benchmark

深度分析

Chinese Title: 面向配电缺陷检测的多模态智能体：基础模型评估

Summary: 本文提出了一种面向配电网络缺陷检测的多模态智能体框架，旨在解决传统检测方法在语义理解、泛化能力和闭环自动化方面的局限性。研究系统评估了多模态基础模型作为统一认知引擎在三个关键能力上的表现：感知（准确识别设备并生成专家级缺陷描述）、推理（结合领域知识诊断原因、评估严重性并规划维护策略）以及工具使用（自主执行查询知识库、生成工单等操作以实现闭环维护）。为支持评估，作者构建了领域专用数据集和综合基准。实验揭示了当前基础模型在三个维度上的优势与不足，为在高压工业环境中部署自主智能体提供了实证依据。

Innovations:

提出了面向配电缺陷检测的多模态智能体评估框架，聚焦感知、推理和工具使用三大核心能力，实现可复现的性能对比。
构建了多任务、多维度基准，包含用于评估感知和推理的多模态数据集以及用于测试端到端工具执行能力的复杂巡检场景。
系统评估了不同智能体架构和核心模型在配电巡检任务中的表现，揭示了基础模型的优势与局限，为技术选型和优化提供实证指导。
将检索增强生成（RAG）与领域知识库结合，增强推理阶段的逻辑推断能力并缓解幻觉风险。
设计了严格的提示工程策略，通过角色定义、任务约束和少样本示例将通用基础模型适配到专业电力领域。

Methodology: 论文采用以下技术路线：首先设计多模态智能体框架，以单一基础模型（集成VLM-LLM）作为核心认知引擎，通过多模态输入（高分辨率图像+自然语言指令）和双模态输出（自然语言描述+JSON结构化命令）实现交互。通过提示工程（角色设定、任务约束、少样本示例）适配电力领域。感知能力利用视觉-语言对齐实现设备识别和语义描述；推理能力结合检索增强生成（RAG）从“电力设备缺陷评级标准”和“历史缺陷案例库”中检索知识进行逻辑推断；工具使用能力使智能体能够调用外部工具（如查询知识库、生成工单）完成闭环操作。评估方面，构建了领域专用数据集和基准，对感知、推理和工具使用进行多维度测试。

Key Results:

当前基础模型在感知任务中能够生成专家级缺陷描述，但在细粒度设备识别和罕见缺陷泛化上仍有局限。
推理能力通过RAG增强后，模型能结合领域知识进行缺陷评级和维修计划制定，但复杂因果推理仍存在幻觉风险。
工具使用能力使智能体能够自主执行知识库查询和工单生成，但在长序列工具调用和动态环境适应性上表现不足。
不同模型架构（如LLaVA、Qwen-VL等）在三个维度上各有优劣，没有单一模型在所有任务上全面领先。
提示工程和少样本示例显著提升了模型在电力领域的专业性和输出格式规范性。

Tech Stack:

视觉语言模型（VLM）：CLIP、BLIP系列、InstructBLIP、Flamingo、LLaVA、MiniGPT-4、Qwen-VL等
大语言模型（LLM）：作为智能体核心推理引擎
检索增强生成（RAG）：结合领域知识库（缺陷评级标准、历史案例库）
提示工程：角色定义、任务约束、少样本示例（few-shot）
评估指标：BLEU、ROUGE、Pass@k、Elo评分、POPE（幻觉检测）、CHAIR、成功率（SR）、MRR、NDCG、USI、PRM、HCAPO、TTFT、吞吐量等
工具调用：JSON结构化命令、API接口、知识库查询、工单生成

Strengths:

首次系统性地评估多模态基础模型在配电缺陷检测中的感知、推理和工具使用能力，填补了工业场景评估框架的空白。
构建了领域专用数据集和基准，支持可复现的对比实验，具有实用价值。
将检索增强生成（RAG）引入推理阶段，有效缓解了通用模型的幻觉问题，提升了领域专业性。
设计了清晰的智能体架构，将视觉感知与决策执行闭环连接，为自主巡检提供了可行方案。
实验覆盖多种主流模型和架构，结果具有广泛参考意义。

Limitations:

数据集规模可能有限，未公开具体数量，泛化性有待验证。
评估主要基于离线静态场景，未涉及真实动态环境中的实时交互和鲁棒性测试。
工具使用能力仅测试了知识库查询和工单生成等有限动作，未涵盖更复杂的物理操作（如无人机控制）。
未深入探讨模型在极端光照、遮挡等恶劣条件下的感知退化问题。
缺乏与人类专家在相同任务上的直接对比，难以量化智能体与人类水平的差距。

Relevance To Keywords:

Unify Models: 论文评估的多模态基础模型（VLM-LLM）属于统一模型范畴，但未涉及理解与生成一体化。
World Models: 论文未直接涉及世界模型，但推理中的RAG可视为部分环境建模。
Representation Learning: 感知能力依赖视觉-语言表征学习，但论文未深入探讨表征学习机制。
Model-Based RL: 论文未涉及强化学习或基于模型的RL，智能体行为基于提示和规则而非学习。
原生多模态大模型: 论文评估的LLaVA、Qwen-VL等属于原生多模态大模型，相关性高。
多模态大模型的理解和生成一体化: 论文中模型同时具备理解（感知、推理）和生成（描述、工单）能力，但未强调一体化训练。
表征学习: 感知部分依赖表征学习，但论文未重点分析。
世界模型: 不直接相关。
强化学习: 不直接相关。
后训练: 论文未涉及后训练技术，仅评估预训练模型。

13. Real-Time Execution with Autoregressive PoliciesPASS

Score: 67.5 / 35.2

Authors: Sangkyu Lee, Seohyeon Park, Tackgeun You, Avi Caciularu, Idan Szpektor, Hwasup Lim, Youngjae Yu

Published: 2026-06-11

TL;DR: 本文通过调整 tokenization horizon 和约束解码，实现了视觉 - 语言 - 动作模型中自回归策略的实时执行，其任务完成速度显著优于流匹配策略。

摘要翻译

由异步推理实现的实时执行，能够确保平滑的动作轨迹和快速的反应性，对于大规模视觉 - 语言 - 动作模型（Vision-Language-Action, VLA）的实际部署至关重要。然而，近期关于实时执行的研究主要集中于扩散策略（diffusion policies）的变体，尽管对于自回归策略（autoregressive policies）而言这更为关键，因为它们在同步推理（synchronous inference）下的轨迹展开速度较慢。相比之下，我们证明自回归策略可以通过调整分词范围（tokenization horizon）并应用约束解码（constrained decoding）来实现实时执行，从而保证严格的延迟界限（latency bounds），使得多轨迹解码（multi-trajectory decoding）成为可能，以最大化性能。在模拟环境和真实环境中，我们发现自回归策略始终优于同等水平的流匹配策略（flow-matching policy），同时通过同步推理显著提高了任务完成速度。结合自回归策略固有的优势，例如更快的收敛速度和更好的指令跟随泛化性，这些结果证实了自回归策略可以作为一种支持实时执行的有竞争力的策略类型。

Abstract

Real-time execution, enabled by asynchronous inference that ensures both smooth action trajectories and fast reactivity, is critical for realistic deployments of large-scale Vision-Language-Action models. However, recent work on real-time execution primarily focuses on variants of diffusion policies, even though it is more critical for autoregressive policies given their slower rollout speed in synchronous inference. In contrast, we demonstrate that autoregressive policies can achieve real-time execution by adjusting the tokenization horizon and applying constrained decoding, thereby guaranteeing strict latency bounds that enable multi-trajectory decoding to maximize performance. Across simulated and real-world environments, we find that the autoregressive policy consistently outperforms its equivalent-level flow-matching policy counterpart while achieving significantly improved task completion speeds from synchronous inference. Coupled with the inherent advantages of autoregressive policies, such as faster convergence and better generalizability in instruction-following, these results confirm that autoregressive policies can remain a competitive policy type supporting real-time execution.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	5.0/10	7.5
Tokenizer	1.5	8.0/10	12.0
Visual Encoder	1.5	4.0/10	6.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	5.0/10	7.5
MultiModal	1.5	7.0/10	10.5
model-based RL	1.5	3.0/10	4.5
Latent Reasoning	1.5	5.0/10	7.5
Agentic Reasoning	1.5	6.0/10	9.0

评分理由: 论文聚焦 VLA 模型自回归策略的实时执行。Tokenizer（8 分）因 tokenization horizon 调整高度相关；MultiModal（7 分）因多模态本质相关；Agentic Reasoning（6 分）与 Unify Models（5 分）中度相关；MLLM（5 分）与 Latent Reasoning（5 分）有关联；Visual Encoder（4 分）为背景；World Models（2 分）与 model-based RL（3 分）相关性低。无指定专家，加权总分 67.5，高于及格分 35.2。

关键词

Real-Time Execution, Autoregressive Policies, Vision-Language-Action, Tokenization Horizon, Constrained Decoding, Action Trajectories, Instruction-Following

深度分析

Chinese Title: 基于自回归策略的实时执行

Summary: 本文针对大规模视觉-语言-动作模型（VLA）在机器人部署中因推理延迟导致的动作暂停问题，提出了一种基于自回归策略的实时执行方法。传统方法主要关注扩散策略的异步推理，而自回归策略因顺序解码导致更慢的推理速度，在实时执行中更具挑战性。本文通过调整分词窗口、应用约束解码保证延迟上界、并采用多轨迹解码充分利用空闲计算资源，实现了自回归策略的实时执行。在LIBERO模拟环境和DROID真实环境中的实验表明，该方法在任务成功率和执行速度上均优于同等水平的流匹配策略，且保持了自回归策略在快速收敛和指令跟随泛化方面的优势。

Innovations:

提出将自回归策略适配到实时执行的四步方法：选择足够动作窗口、基于半窗口分词、约束解码保证延迟、多轨迹解码提升性能。
首次系统论证自回归策略在实时执行中的可行性，并证明其可超越同等水平的流匹配策略。
通过调整分词窗口长度（而非完整动作窗口）来控制推理延迟，解决了自回归策略因变长分词导致的延迟不确定性问题。
在真实和模拟环境中验证了自回归策略在实时执行下仍保持快速收敛和指令跟随泛化能力。

Methodology: 本文采用异步推理框架，将策略服务器与机器人控制器解耦，通过动作队列（Action Queue）持续提供动作。具体技术路线包括：1）选择动作窗口H=2m，使分词窗口m对应可接受的延迟；2）对每个m窗口进行分词，而非完整H窗口，以支持动作前缀条件化；3）应用约束解码确保解码时间dm ≤ m，避免动作队列耗尽；4）利用同步空闲时间进行多轨迹解码，选择最优轨迹执行。实验基于π0-FAST模型进行微调，与π0+RTC、π0.5等基线对比。

Key Results:

在LIBERO和DROID数据集上，π0-REALFAST在实时执行中任务成功率显著优于π0+RTC（流匹配策略）。
执行速度（rollout speed）相比同步推理大幅提升，接近更先进的π0.5模型。
自回归策略在实时执行下仍保持快速收敛和更好的指令跟随泛化能力。
通过调整分词窗口长度，自回归策略可实现与扩散策略相当的推理延迟（约70ms）。

Tech Stack:

自回归策略（Autoregressive Policy）
离散余弦变换（DCT）分词（FAST+ tokenizer）
约束解码（Constrained Decoding）
多轨迹解码（Multi-trajectory Decoding）
动作队列（Action Queue）
异步推理（Asynchronous Inference）
动作分块策略（Action Chunking）
π0-FAST模型
LIBERO模拟环境
DROID真实环境

Strengths:

首次将自回归策略成功应用于实时执行场景，填补了该领域空白。
方法简单有效，仅需微调即可实现，无需重新训练模型。
在多个环境（模拟+真实）中验证了泛化性和鲁棒性。
系统分析了自回归策略在实时执行中的延迟特性，并提出了针对性的解决方案。

Limitations:

依赖预训练的自回归策略模型（如π0-FAST），对从头训练的模型适用性未验证。
约束解码和多轨迹解码增加了实现复杂度，可能对计算资源有额外要求。
实验主要基于桌面操作任务，在更复杂动态环境中的表现有待进一步验证。
未与最新的扩散策略变体（如实时扩散策略）进行充分对比。

Relevance To Keywords:

Unify Models: 论文涉及视觉-语言-动作模型的统一，属于多模态大模型在机器人领域的应用。
World Models: 论文未直接涉及世界模型，但异步推理和动作队列可视为对环境的隐式建模。
Representation Learning: 通过DCT分词和自回归策略学习动作表征，属于表征学习范畴。
Model-Based RL: 论文未使用强化学习，但多轨迹解码和动作规划与模型预测控制（MPC）思想相关。
原生多模态大模型: 论文基于π0-FAST（原生多模态模型）进行实时执行优化。
多模态大模型的理解和生成一体化: 论文中的VLA模型同时处理视觉、语言理解和动作生成。
表征学习: 动作分词和约束解码涉及动作表征的离散化和压缩。
世界模型: 论文未明确使用世界模型，但异步推理依赖对环境的预测。
强化学习: 论文未使用强化学习，但多轨迹解码可视为一种规划方法。
后训练: 论文通过微调（fine-tuning）实现实时执行，属于后训练阶段优化。

14. Proprioceptive-visual correspondence enables self-other distinction in humanoid robotsPASS

Score: 67.5 / 35.2

Authors: Yurun Chen, Tianyuan Gao, Yizhong Ge, Shikun Ban, Yizhou Wang, Hongkai Xiong, Wenjun Zeng, Wentao Zhu

Published: 2026-06-11

TL;DR: 该论文提出人形机器人可通过本体感觉-视觉对应关系无需标签即可区分自我与他人，从而建立预测性身体模型以支持运动规划等下游任务。

摘要翻译

自我与他人的区分是社会智能的先决条件，然而越来越多与人共享工作空间的人形机器人仍缺乏这种能力。本文展示了一种人形机器人可以通过本体感觉 - 视觉对应关系学习自我 - 他人区分，而无需任何身份标签或运动学模型。一旦建立，这种区分将引导出一个预测性自我模型，该模型将关节构型映射到三维身体占据，捕捉机器人身体随动作的变化情况。在涉及人类或形态相同机器人的多智能体场景中，该系统能可靠地识别自身，学习三维自我模型，并支持下游任务，包括目标到达、避障运动规划以及人 - 机器人运动重映射。综上所述，这些结果为在共享物理环境中与他人协同行动的机器人实现身体自我表征指明了一条路径。项目页面：https://euron-zc.github.io/humanoid-self-model/。

Abstract

Distinguishing self from others is a prerequisite for social intelligence, yet humanoid robots that increasingly share workspaces with humans still lack this ability. Here we show that a humanoid robot can learn self-other distinction from proprioceptive-visual correspondence, without any identity labels or kinematic models. Once established, this distinction bootstraps a predictive self-model that maps joint configurations to three-dimensional body occupancy, capturing how the robot's body changes with action. In multi-agent scenes involving humans or morphologically identical robots, the system reliably identifies itself, learns a 3D self-model, and supports downstream tasks including target reaching, collision-aware motion planning, and human-to-robot motion retargeting. Together, these results outline a route toward bodily self-representation in robots that act and coordinate alongside others in shared physical environments. Project page: https://euron-zc.github.io/humanoid-self-model/.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	5.0/10	7.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	6.0/10	9.0
World Models	1.5	8.0/10	12.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	7.0/10	10.5
Latent Reasoning	1.5	5.0/10	7.5
Agentic Reasoning	1.5	6.0/10	9.0

评分理由: 论文核心在于利用本体感觉与视觉的对应关系构建人形机器人的自我模型。MultiModal (8.0) 和 World Models (8.0) 评分最高，因为论文本质是多模态学习且构建了预测性的身体世界模型。model-based RL (7.0) 相关，因为利用内部模型进行碰撞感知运动规划。Unify Models (5.0) 和 Latent Reasoning (5.0) 中度相关，涉及模态统一与无标签隐式学习。Visual Encoder (6.0) 涉及视觉特征提取。Tokenizer (0.0) 和 MLLM (0.0) 无关，因不涉及文本或语言模型。作者列表中未包含 Yang Shi 等指定专家。

关键词

Proprioceptive-visual correspondence, Self-other distinction, Humanoid robots, Predictive self-model, Body occupancy, Motion planning, Multi-modal learning

深度分析

Chinese Title: 本体感觉-视觉对应使能人形机器人的自我-他人区分

Summary: 该论文提出了一种框架，使人形机器人能够在无身份标签或运动学模型的情况下，仅通过本体感觉信号与视觉观察的对应关系，实现自我-他人区分并学习预测性自我模型。受发展心理学启发，框架先利用时间对比学习（同一帧内本体感觉与视觉匹配，不同帧不匹配）进行自监督的自我区分，再基于区分出的自我掩码训练一个条件神经占据场，将关节配置映射到三维身体占据概率。在29自由度人形机器人上的仿真和真实实验中，系统在人类-机器人及机器人-机器人场景中均达到99.5%以上的区分准确率，并成功支持目标到达、碰撞感知运动规划和人类到机器人运动重定向等下游任务。该工作为机器人从经验中获取身体知识而非手动指定提供了新路径。

Innovations:

提出无需身份标签或运动学模型、仅依赖本体感觉-视觉时间对应性的自监督自我-他人区分方法。
将自我区分与自我建模解耦，先区分后建模，解决了“先有鸡还是先有蛋”的耦合问题。
利用注意力机制融合各候选掩码与本体感觉的相似度，而非简单平均，显著提升了区分鲁棒性。
学习无运动学先验的神经占据场作为预测性自我模型，可泛化到不同姿态并支持多种下游任务。
在形态完全相同的双机器人场景中验证了方法不依赖外观差异，仅凭本体感觉-视觉对应即可区分。

Methodology: 论文采用自监督对比学习框架。首先，通过两个编码器分别将本体感觉状态和每个候选身体掩码嵌入共享空间，计算每对相似度，再经注意力融合得到帧级分数。利用同一帧内本体感觉与对应掩码匹配、不同帧不匹配的时序不对称性构造对比损失，无需任何标签。然后，将区分出的自我掩码作为监督信号，训练一个以关节配置为条件的神经占据场（Neural Occupancy Field），预测任意3D点属于机器人身体的概率。训练数据来自仿真和真实世界采集的多智能体场景，包括人类-机器人和机器人-机器人两种设置。

Key Results:

在人类-机器人场景中，自我区分准确率达99.69%，在机器人-机器人场景中达99.50%，远超VLM基线（最高77.02%）。
训练后，同一帧内本体感觉与自我掩码的余弦相似度显著高于与其他机器人掩码或不同帧掩码的相似度。
t-SNE投影显示，训练后本体感觉嵌入与自我掩码嵌入在空间中聚集，与其他掩码分离。
学习到的自我模型可准确预测3D身体占据，并成功用于目标到达、碰撞避免运动规划和人类到机器人运动重定向。
在真实世界人类-机器人交互中，框架保持鲁棒，无需重新训练。

Tech Stack:

对比学习（Contrastive Learning）
注意力机制（Attention Mechanism）
神经占据场（Neural Occupancy Field）
余弦相似度（Cosine Similarity）
t-SNE降维（t-SNE Projection）
ResNet或类似视觉编码器
MLP或Transformer编码器（用于本体感觉）
Unitree G1人形机器人（29自由度）
固定外部RGB摄像头

Strengths:

完全自监督，无需任何人工标注或运动学模型，具有强可扩展性。
方法简洁有效，在形态相同和不同的干扰者场景下均表现优异。
将认知科学原理（时序对应性）成功迁移到机器人学习，具有跨学科启发性。
支持从区分到建模再到下游任务的完整闭环，实用性高。
在真实世界和仿真中均验证了鲁棒性，实验设计全面。

Limitations:

依赖固定外部摄像头，未考虑机器人自身视角或移动摄像头的情况。
当前仅处理单帧静态图像，未利用视频时序连续性进一步优化。
自我模型为占据场形式，可能难以直接用于精细操作或接触力预测。
在非常复杂或遮挡严重的多智能体场景中，掩码提取可能成为瓶颈。
未与基于运动学模型的方法进行定量比较，缺乏基线对比。

Relevance To Keywords:

Unify Models: 论文未直接涉及统一模型，但其自监督框架可视为感知与运动表征的统一学习。
World Models: 学习的自我模型（神经占据场）是一种局部世界模型，预测身体与环境的交互。
Representation Learning: 核心贡献之一是通过对比学习学习本体感觉与视觉的联合表征。
Model-Based RL: 自我模型可用于规划（如碰撞避免运动规划），属于模型预测控制范畴，与基于模型的强化学习相关。
原生多模态大模型: 论文对比了VLM基线，但方法本身不依赖大模型，而是轻量级自监督学习。
多模态大模型的理解和生成一体化: 不直接相关，但本体感觉-视觉对应可视为一种多模态对齐。
表征学习: 高度相关，论文核心是学习本体感觉与视觉的对应表征。
世界模型: 相关，自我占据场可视为身体与世界交互的预测模型。
强化学习: 间接相关，自我模型可辅助强化学习中的状态估计和规划。
后训练: 不直接相关，论文方法为从零开始的在线或离线学习。

15. LaME: Learning to Think in Latent Space for Multimodal Embedding via Information BottleneckPASS

Score: 67.5 / 35.2

Authors: Peixi Wu, Biao Yang, Feipeng Ma, Bosong Chai, Bo Lin, Wei Yuan, Fan Yang, Tingting Gao, Hebei Li, Xiaoyan Sun

Published: 2026-06-11

TL;DR: LaME 提出一种基于信息瓶颈的潜在空间推理方法，用于高效多模态嵌入，在保持竞争力的同时将推理速度提升了 60 倍。

摘要翻译

推理驱动的通用多模态嵌入通过将思维链（CoT）推理引入嵌入流程而迅速发展。尽管该范式在通用及复杂任务上均表现出强大性能，但仍存在两个核心局限性：（i）自回归思维链推理带来高昂的计算成本，使其难以适用于低延迟检索；（ii）嵌入性能与思维链标注质量高度耦合，导致大规模训练不可靠。这引发了根本性问题：文本形式的思维链是嵌入推理的最佳形式吗？有效的嵌入推理能否在潜在空间中完成？为此，我们提出 LaME（潜在推理多模态嵌入），该模型将面向嵌入的潜在推理形式化为弱监督信息瓶颈。LaME 采用 K 个可学习推理令牌作为固定容量瓶颈，并在单次前向传播中完成所有推理过程。这两种弱监督信号在结构上解耦了对比目标与自回归目标，消除了对思维链标注的依赖，同时两阶段训练流程确保了稳定收敛。在 MMEB-v2 和 MRMR 上的实验表明，LaME 实现了具有竞争力的性能，超越了部分基于显式思维链的模型，同时推理速度比显式思维链方法快 60 倍，比潜在基线快 2 倍，且吞吐量与判别式嵌入模型相当。代码将开源。

Abstract

Reasoning-driven universal multimodal embedding has advanced rapidly by introducing Chain-of-Thought (CoT) reasoning into the embedding pipeline. Despite the strong performance across both general and complex tasks, this paradigm suffers from two core limitations: (i) autoregressive CoT reasoning incurs high computational cost, making it impractical for low-latency retrieval; and (ii) embedding performance is heavily coupled with CoT annotation quality, making large-scale training unreliable. These raise fundamental questions: Is textual CoT the optimal form of reasoning for embedding, and can effective embedding reasoning be accomplished in latent space? To this end, we propose LaME (Latent Reasoning Multimodal Embedding), which formulates embedding-oriented latent reasoning as a weakly supervised information bottleneck. LaME employs K learnable reason tokens as a fixed-capacity bottleneck, completing all reasoning within a single forward pass. The two weak supervision signals structurally decouple contrastive from autoregressive objectives and eliminate dependence on CoT annotations, while a two-stage training pipeline ensures stable convergence. Experiments on MMEB-v2 and MRMR show that LaME achieves competitive performance, surpassing some explicit CoT-based models, while delivering 60x faster inference than explicit CoT methods and 2x faster than latent baselines with throughput comparable to discriminative embedding models. Code will be released.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	7.0/10	10.5
Tokenizer	1.5	3.0/10	4.5
Visual Encoder	1.5	5.0/10	7.5
World Models	1.5	2.0/10	3.0
MLLM	1.5	6.0/10	9.0
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	10.0/10	15.0
Agentic Reasoning	1.5	3.0/10	4.5

评分理由: 论文核心贡献在于潜在空间推理（Latent Reasoning）与多模态嵌入（MultiModal），故得分最高；涉及多模态学习框架，故 MLLM 和 Visual Encoder 有一定相关性；未涉及强化学习（model-based RL）或世界模型（World Models），相关性极低；Token 仅指内部推理 token 而非分词器架构，Agentic Reasoning 未涉及代理任务。加权总分为 67.5 分，远超动态及格分 35.2 分。

关键词

Latent Reasoning, Multimodal Embedding, Information Bottleneck, Weakly Supervised, Reason Tokens, Inference Efficiency, Single Forward Pass

深度分析

Chinese Title: LaME：通过信息瓶颈在潜在空间中学习多模态嵌入的思考

Summary: 本文提出LaME（潜在推理多模态嵌入）框架，旨在解决显式链式思维（CoT）推理在多模态嵌入中的高计算成本和依赖CoT注释质量的问题。LaME将嵌入导向的潜在推理建模为弱监督信息瓶颈，通过K个可学习推理令牌作为固定容量瓶颈，在单次前向传播中完成所有推理。双头弱监督信号（解码头和嵌入头）结构性地分离对比目标和自回归目标，消除对CoT注释的依赖；两阶段训练流程（瓶颈预热和联合优化）确保稳定收敛。实验表明，LaME在MMEB-v2和MRMR基准上达到竞争性能，推理速度比显式CoT方法快60倍，比潜在迭代基线快2倍，吞吐量接近判别式嵌入模型。

Innovations:

提出将嵌入导向的潜在推理建模为弱监督信息瓶颈，完全解耦内部思考与显式CoT格式，在单次前向传播中完成推理。
设计双头瓶颈监督机制，结构性地分离对比监督和生成监督，消除对CoT注释的依赖。
采用两阶段训练流程（瓶颈预热+联合优化），确保瓶颈稳定收敛。
通过将潜在推理限制在预填充令牌中，实现60倍于显式CoT方法的推理加速，2倍于潜在迭代基线。

Methodology: LaME使用K个可学习推理令牌作为信息瓶颈，附加到MLLM输入后，通过单次前向传播产生隐藏状态。双头监督：解码头从前Kr个潜在令牌解码检索目标（预定义答案和关键词），作为推理探针；嵌入头将剩余Ke个潜在令牌聚合为嵌入，优化检索目标（对比损失）。两阶段训练：第一阶段冻结MLLM骨干，仅优化推理令牌和监督头（瓶颈预热）；第二阶段解冻除视觉编码器外的所有参数，并添加额外嵌入令牌进行目标对比监督（联合优化）。

Key Results:

在MMEB-v2和MRMR基准上达到与显式CoT方法竞争的性能，超越部分显式CoT模型。
推理速度比显式CoT方法快60倍，比潜在迭代基线快2倍。
吞吐量接近判别式嵌入模型，单次前向传播仅需8个推理令牌。
两阶段训练有效稳定瓶颈收敛，避免随机初始化导致的早期训练不稳定。

Tech Stack:

多模态大语言模型（MLLM）作为骨干网络
信息瓶颈（IB）原理
可学习推理令牌（reason tokens）
因果注意力机制（causal attention）
对比学习（InfoNCE损失）
自回归解码头（轻量级解码器）
两阶段训练策略（瓶颈预热+联合优化）
MMEB-v2和MRMR基准数据集

Strengths:

完全消除对显式CoT注释的依赖，训练更可靠。
单次前向推理，计算效率极高，适合低延迟检索场景。
双头监督结构清晰，分离对比与生成目标，避免目标冲突。
两阶段训练确保瓶颈稳定收敛，提升训练鲁棒性。
在保持高性能的同时大幅提升推理速度。

Limitations:

推理令牌数量K为超参数，可能对性能敏感，需要调优。
两阶段训练增加训练复杂度，需要额外预热阶段。
当前仅在两个基准上验证，泛化性有待更多场景测试。
与显式CoT方法相比，潜在推理的可解释性较差。
可能对复杂多模态推理任务（如长程依赖）存在能力上限。

Relevance To Keywords:

Unify Models: LaME将多模态理解与生成统一在潜在推理框架中，与统一模型方向高度相关。
World Models: 潜在推理可视为世界模型的一种简化形式，通过瓶颈压缩输入并预测目标。
Representation Learning: 核心目标是通过信息瓶颈学习高效的多模态嵌入表示。
Model-Based RL: 潜在推理与模型预测控制有相似之处，但本文未直接涉及RL。
原生多模态大模型: 使用MLLM作为骨干，属于原生多模态大模型的微调应用。
多模态大模型的理解和生成一体化: 双头监督同时涉及理解（嵌入）和生成（解码），体现一体化思想。
表征学习: 直接优化嵌入表示，属于表征学习范畴。
世界模型: 潜在推理可视为对输入-目标映射的内部建模，与世界模型概念相关。
强化学习: 本文未使用强化学习，但潜在推理与RL中的隐状态推理有潜在联系。
后训练: 采用两阶段微调策略，属于后训练技术。

16. VISA: VLM-Guided Instance Semantic Auditing for 3D Occupancy World ModelsPASS

Score: 64.5 / 35.2

Authors: Ruiqi Xian, Yuehan Xian, Jing Liang, Xuewei Qi, Dinesh Manocha

Published: 2026-06-11

TL;DR: VISA 通过引入 VLM 指导的实例语义审计，显著提升了 3D 占据世界模型在稀有类别上的检测性能，且推理阶段无需 VLM。

摘要翻译

语义 3D 占据为自动驾驶和机器人决策提供了体素化的世界状态，但物体和稀有类别的错误可能影响自由空间解释、碰撞检测和时序状态传播。我们表明，一种常见的 VLM（视觉语言模型）策略，即将 3D 体素或物体特征与裁剪 - 标题嵌入对齐，提高了文本空间相似性，但并未可靠地提高闭集占据 mIoU。鉴于这种不匹配，我们提出了 VISA，一种用于现有占据世界模型的训练时语义审计方法。VISA 对每个物理物体实例的代表性裁剪查询一个离线 VLM，获得包含类别假设、可能混淆、可靠性、属性和证据的结构化审计，并沿物体轨迹传播该审计。该审计被锚定至匹配的 3D 物体体素，并通过可靠性加权分类法、属性因子和场景级审计图损失蒸馏为语义 logits，而推理保持不变且无需 VLM。在 nuScenes 数据集上，三次运行平均后，VISA 将 OccWorld 的 mIoU 从 19.06 提升至 20.05，将 GaussianWorld 的 mIoU 从 21.36 提升至 21.91；在 GaussianWorld 上，物体 mIoU 从 18.18 提升至 19.16，稀有类别 mIoU 从 15.60 提升至 16.79。这些结果表明，VLM 作为感知可靠性的语义审计器比作为通用的标题嵌入目标更适合闭集占据任务。

Abstract

Semantic 3D occupancy provides a voxelized world state for autonomous driving and robot decision making, but object and rare-class errors can affect free-space interpretation, collision checking, and temporal state propagation. We show that a common VLM strategy, aligning 3D voxel or object features with crop-caption embeddings, improves text-space similarity without reliably improving closed-set occupancy mIoU. Motivated by this mismatch, we propose VISA, a training-time semantic auditing approach for existing occupancy world models. VISA queries an offline VLM on a representative crop of each physical object instance, obtains a structured audit with class hypotheses, plausible confusions, reliability, attributes, and evidence, and propagates it along the object track. The audit is grounded to matched 3D object voxels and distilled into semantic logits through reliability-weighted taxonomy, attribute-factor, and scene-level audit graph losses, while inference remains unchanged and requires no VLM. On nuScenes, averaged across three runs, VISA improves OccWorld from 19.06 to 20.05 mIoU and GaussianWorld from 21.36 to 21.91 mIoU; on GaussianWorld, object mIoU improves from 18.18 to 19.16 and rare-class mIoU from 15.60 to 16.79. These results suggest that VLMs are better suited to closed-set occupancy as reliability-aware semantic auditors than as generic caption-embedding targets.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	5.0/10	7.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	9.0/10	13.5
MLLM	1.5	8.0/10	12.0
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	3.0/10	4.5
Latent Reasoning	1.5	4.0/10	6.0
Agentic Reasoning	1.5	2.0/10	3.0

评分理由: 论文核心贡献在于利用 VLM 对 3D 占据世界模型进行语义审计。World Models (9.0) 和 MLLM (8.0) 高度相关，因标题和摘要均明确提及世界模型与视觉语言模型。MultiModal (8.0) 涉及 3D 视觉与语言信息的融合。Unify Models (5.0) 结合了 VLM 与占据模型，但未形成统一架构。Tokenizer (1.0) 和 Visual Encoder (3.0) 仅为 VLM 内部组件，非本文贡献。model-based RL (3.0) 为背景应用方向，非方法核心。Latent Reasoning (4.0) 涉及语义推理，Agentic Reasoning (2.0) 与本文感知任务无关。作者列表中未包含指定的 Yang Shi 等专家，故无加分。

关键词

3D Occupancy, World Models, VLM, Semantic Auditing, Rare-class, Distillation, Autonomous Driving

深度分析

Chinese Title: VISA：基于VLM引导的实例语义审计用于3D占据世界模型

Summary: 本文提出VISA，一种训练时语义审计方法，用于改进现有3D占据世界模型的语义预测性能。研究背景表明，通用VLM策略（如将3D体素或物体特征与裁剪图像描述嵌入对齐）虽能提升文本空间相似度，但无法可靠提升封闭集占据mIoU。VISA通过离线VLM对每个物理实例的代表性裁剪图像进行结构化审计，获取类别假设、可能混淆、可靠性、属性和证据，并将审计信息沿物体轨迹传播，仅匹配到对应的3D物体体素。训练时，通过可靠性加权的分类学、属性因子和场景级审计图损失将审计信息蒸馏到语义logits中，推理时无需VLM。在nuScenes数据集上，VISA将OccWorld的mIoU从19.06提升至20.05，GaussianWorld从21.36提升至21.91，物体类和稀有类mIoU显著提升。结论表明，VLM更适合作为可靠性感知的语义审计器，而非通用描述嵌入目标。

Innovations:

诊断了通用VLM监督在占据任务中的失效模式：开放词汇描述/特征对齐优化文本空间目标，但与封闭集体素级语义预测弱耦合。
提出VISA训练时语义审计方法，将VLM裁剪图像理解转化为可靠性感知的封闭集分类学和视觉因子监督，替代通用描述嵌入。
提出审计到占据的物理实例接地：审计随时间与同一物体关联，仅应用于匹配的3D物体体素，并通过场景级审计图组织共现物体。
验证VISA作为现有占据世界模型的训练时监督，在nuScenes上显著提升语义mIoU，且推理时无VLM开销。

Methodology: VISA包含四个主要技术组件：1) 离线实例审计生成：对每个物体实例的裁剪图像使用VLM获取结构化审计元组（类别假设、可能混淆、可靠性、属性、证据）。2) 轨迹到体素接地：将审计与同一物理物体轨迹关联，仅应用于匹配的3D物体体素集。3) 可靠性加权分类学和属性因子蒸馏：通过加权损失将审计信息蒸馏到体素语义logits中。4) 场景级审计图正则化：对共现审计物体施加图结构约束。训练时，VISA损失与标准占据损失联合优化，推理时保持原模型不变。

Key Results:

在nuScenes数据集上，三次运行平均，VISA将OccWorld的语义mIoU从19.06提升至20.05（+0.99），GaussianWorld从21.36提升至21.91（+0.55）。
GaussianWorld上，物体类mIoU从18.18提升至19.16（+0.98），稀有类mIoU从15.60提升至16.79（+1.19）。
OccWorld上，物体类mIoU提升+1.40，稀有类mIoU提升+1.42。
诊断实验表明，通用描述对齐（caption alignment）和语义原型蒸馏（semantic prototype distillation）均无法提升占据mIoU，验证了VLM作为审计器的有效性。

Tech Stack:

VLM（视觉语言模型，如GPT-4V等）用于离线结构化审计
3D占据世界模型（OccWorld, GaussianWorld）
交叉熵损失和Lovasz损失用于标准占据训练
余弦相似度用于特征对齐诊断
可靠性加权分类学损失、属性因子损失、场景级审计图损失
JSON解析用于结构化审计元组提取
物体轨迹匹配与体素接地（3D物体体素集定义）

Strengths:

创新性地将VLM从嵌入目标转为语义审计器，解决了通用VLM监督与封闭集占据任务不匹配的问题。
训练时审计，推理时无额外开销，易于集成到现有占据世界模型中。
结构化审计元组（类别、混淆、可靠性、属性）提供了比自由描述更丰富且可接地的监督信号。
在多个基线模型和物体/稀有类上取得一致且显著的mIoU提升，实验充分。

Limitations:

依赖离线VLM审计，审计质量受VLM能力影响，可能引入噪声或错误。
仅对物体实例进行审计，未覆盖背景、自由空间等区域，可能限制整体场景语义提升。
需要物体轨迹和3D框标注作为输入，增加了训练数据预处理要求。
在nuScenes单一数据集上验证，泛化性需在其他数据集或真实场景中进一步测试。

Relevance To Keywords:

Unify Models: 论文涉及3D占据世界模型与VLM的融合，但VLM仅用于离线审计，未实现统一模型。
World Models: 核心研究对象是3D占据世界模型（OccWorld, GaussianWorld），VISA改进其语义预测。
Representation Learning: VISA通过审计蒸馏学习更鲁棒的体素语义表示。
Model-Based RL: 占据世界模型可用于规划和控制，但论文未直接涉及强化学习。
原生多模态大模型: VLM作为多模态大模型被用作审计器，但非原生集成。
多模态大模型的理解和生成一体化: VLM用于理解（审计），未涉及生成。
后训练: VISA是训练时方法，非后训练阶段。

17. Iterative Visual Thinking: Teaching Vision-Language Models Spatial Self-Correction through Visual FeedbackPASS

Score: 63.0 / 35.2

Authors: Animesh Tripathy, Aswanth Krishnan

Published: 2026-06-11

TL;DR: 本文提出迭代视觉思考框架，通过视觉反馈闭环和强化学习使视觉语言模型具备空间自纠错能力，无需人工标注即可提升定位精度。

摘要翻译

视觉 - 语言模型（VLMs）实现了强大的单次空间定位，但缺乏观察和纠正自身预测的机制。我们发现，简单地提示 VLM 迭代其预测的渲染可视化会导致灾难性失败：指称表达理解任务上的 [email protected] 从 79.6% 急剧下降至 48.7%（下降了 31 个百分点），揭示了定位能力与自我纠正能力之间存在根本性差距。我们提出迭代视觉思考（IVT），这是一种闭环框架，在该框架中，模型预测一个边界框，观察图像上渲染的预测结果，并通过视觉反馈进行迭代细化。该框架采用一种两阶段训练方案以弥补自我纠正差距：首先，我们利用基础模型自身的预测作为真实错误，并提示教师 VLM 生成纠正性推理轨迹，从而无需人工标注即可生成监督数据；其次，我们应用群体相对策略优化（GRPO）配合简单的交并比（IoU）奖励以稳定多步细化过程。在涵盖 RefCOCOg、Ref-Adv 和 Ref-L4 的混合基准（505 个测试样本）上，基于 IVT 的监督微调（SFT）热身在所有指标上均优于单次基准模型：[email protected] 提升至 82.0%（+2.4 个百分点），[email protected] 提升至 74.1%（+3.2 个百分点），[email protected] 提升至 48.3%（+2.8 个百分点）。GRPO 进一步将每步 IoU 退化降低了 5 倍，从而稳定了细化轨迹。所有训练仅在单个图形处理器（GPU）上使用 2,400 个样本完成，表明空间自我纠正是一种可学习的能力，可以在适度规模下习得。

Abstract

Vision-language models (VLMs) achieve strong singleshot spatial grounding, yet lack any mechanism to observe and correct their own predictions. We find that naively prompting a VLM to iterate over rendered visualizations of its predictions causes catastrophic failure: [email protected] on referring expression comprehension collapses from 79.6% to 48.7% (a 31 percentage point drop), revealing a fundamental gap between grounding capability and self-correction ability. We propose Iterative Visual Thinking (IVT), a closed-loop framework in which the model predicts a bounding box, observes the prediction rendered on the image, and iteratively refines through visual feedback. A two-phase training recipe closes the self-correction gap: first, we exploit the base model's own predictions as realistic errors and prompt a teacher VLM to generate corrective reasoning traces, yielding supervised data without human annotation; second, we apply Group Relative Policy Optimization (GRPO) with a simple IoU reward to stabilize multi-step refinement. On a mixed benchmark spanning RefCOCOg, Ref-Adv, and Ref-L4 (505 test samples), SFT warm-up with IVT surpasses the single-shot base model on every metric: [email protected] rises to 82.0% (+2.4pp), [email protected] to 74.1% (+3.2pp), and [email protected] to 48.3% (+2.8pp). GRPO further reduces per-step IoU degradation by 5x, stabilizing the refinement trajectory. All training uses only 2,400 samples on a single GPU, demonstrating that spatial self-correction is a learnable capability that can be instilled at modest scale.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	9.0/10	13.5
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	3.0/10	4.5
Latent Reasoning	1.5	5.0/10	7.5
Agentic Reasoning	1.5	8.0/10	12.0

评分理由: 论文核心针对视觉语言模型（MLLM, MultiModal）的空间自纠错能力，故这两项得分最高。提出的闭环反馈机制（预测 - 观察 - 修正）符合代理推理（Agentic Reasoning）的特征，且使用了推理轨迹（Latent Reasoning）。论文未涉及模型统一架构、分词器设计、视觉编码器架构改进或世界模型构建，强化学习部分为策略优化（GRPO）而非模型基强化学习，故相关度较低。作者列表中不包含指定的专家，无额外加分。加权总分 63.0，高于动态及格分 35.2。

关键词

Iterative Visual Thinking, Vision-Language Models, Spatial Self-Correction, Visual Feedback, Group Relative Policy Optimization, Referring Expression Comprehension, Closed-loop Framework

深度分析

Chinese Title: 迭代视觉思维：通过视觉反馈教会视觉语言模型空间自我修正

Summary: 本文提出迭代视觉思维（IVT）框架，旨在赋予视觉语言模型（VLM）空间自我修正能力。研究发现，直接让VLM迭代观察自身预测的渲染结果会导致性能灾难性下降（[email protected]从79.6%降至48.7%），揭示了基础定位能力与自我修正能力之间的根本差距。IVT采用闭环流程：模型预测边界框、观察渲染结果、通过视觉反馈迭代优化。两阶段训练方案弥补了这一差距：首先利用基础模型自身预测作为真实错误，引导教师VLM生成修正推理轨迹，无需人工标注；然后应用组相对策略优化（GRPO）配合简单的IoU奖励稳定多步优化。在RefCOCOg、Ref-Adv和Ref-L4混合基准测试（505个样本）上，SFT预热后IVT在所有指标上超越单次基础模型：[email protected]提升至82.0%（+2.4pp），[email protected]至74.1%（+3.2pp），[email protected]至48.3%（+2.8pp）。GRPO进一步将每步IoU退化降低5倍，稳定优化轨迹。全部训练仅使用2400个样本和单GPU，证明空间自我修正是可学习的能力，且能以较小规模灌输。

Innovations:

提出IVT闭环空间推理框架，使VLM通过观察自身预测的渲染结果进行迭代自我修正，并揭示VLM原生不具备此能力（性能下降31pp）。
提出自引用数据合成策略：利用基础VLM自身空间预测作为步骤0错误，避免人工标注修正轨迹，并发现使用随机扰动GT框会导致GRPO阶段步骤0放水问题。
两阶段训练方案（SFT+GRPO）实现非对称分工：SFT是主要推动力，教授迭代视觉思维结构并驱动所有精度提升；GRPO贡献优化稳定性，将每步IoU退化降低5倍。
仅用2400个训练样本和单GPU即可超越单次基础模型所有指标，证明空间自我修正能力可在小规模下习得。

Methodology: 论文采用两阶段训练方法。第一阶段SFT预热：使用学生模型自身预测作为初始错误，通过插值向GT框逼近形成修正轨迹，再让教师VLM生成逐步推理痕迹，用交叉熵损失训练学生模型。第二阶段GRPO微调：从SFT初始化策略出发，对每个提示采样N=6条轨迹，以最终步骤IoU加格式奖励作为奖励信号，使用组相对优势进行策略梯度更新，并加入KL正则化。推理时，模型在单轮对话中生成初始预测，然后循环执行渲染、注入、优化三步，每一步将前一步预测渲染为红色半透明框叠加在原图上作为视觉反馈。

Key Results:

直接迭代导致[email protected]从79.6%降至48.7%（下降31pp），揭示空间自我修正差距。
SFT预热后IVT在所有指标上超越单次基础模型：[email protected] 82.0%（+2.4pp），[email protected] 74.1%（+3.2pp），[email protected] 48.3%（+2.8pp）。
GRPO将每步IoU退化从0.14降至0.03（降低5倍），稳定优化轨迹。
无SFT预热时GRPO产生退化停滞（模型每步复制首次预测）。
使用学生预测而非随机扰动GT框作为步骤0错误，避免了GRPO阶段的步骤0放水问题。
全部训练仅使用2400个样本和单GPU。

Tech Stack:

Qwen3-VL-4B-Instruct作为基础VLM
LoRA（低秩适配）微调
4-bit量化（NF4）
组相对策略优化（GRPO）
IoU（交并比）作为奖励函数
交叉熵损失用于SFT
KL正则化防止策略偏移
边界框坐标归一化到[0,1000]整数表示

Strengths:

揭示了VLM在空间自我修正上的根本差距，并提供了可复现的解决方案。
数据合成策略无需人工标注，利用模型自身预测生成真实错误，实用性强。
两阶段训练设计合理，SFT提供结构知识，GRPO提供稳定性，互补效果好。
仅需少量数据（2400样本）和单GPU即可显著提升性能，资源需求低。
在多个基准测试上全面超越单次模型，证明了方法的有效性。

Limitations:

实验仅基于Qwen3-VL-4B模型，未验证在其他VLM上的泛化性。
训练数据量较小（2400样本），可能未覆盖所有错误模式。
GRPO阶段精度略低于SFT单独结果，表明稳定性与精度之间存在权衡。
迭代步骤数固定（T=3），未探索自适应步数或动态停止机制。
仅针对空间定位任务，未扩展到其他视觉推理任务（如视觉问答、目标检测）。

Relevance To Keywords:

Unify Models: 论文使用统一视觉语言模型（Qwen3-VL）进行空间定位与自我修正，体现了多模态模型的统一能力。
World Models: IVT通过渲染预测框作为视觉反馈，使模型能够观察自身预测结果，类似于世界模型中的自我模拟与修正。
Representation Learning: 模型学习如何从渲染的视觉反馈中提取空间错误信息并更新表示，涉及表征学习。
Model-Based RL: 两阶段训练中GRPO属于强化学习，且IVT闭环过程类似于基于模型的强化学习中的规划与修正。
原生多模态大模型: 论文基于原生多模态大模型Qwen3-VL，并扩展其能力。
多模态大模型的理解和生成一体化: 模型同时理解图像和文本，并生成边界框和推理文本，体现理解与生成一体化。
强化学习: GRPO是强化学习方法，用于优化多步修正策略。
后训练: SFT和GRPO均为后训练阶段，在预训练模型基础上进行微调。

18. EPM-JEPA: Operator-Side Experience Modulation in JEPA-Family World ModelsPASS

Score: 63.0 / 35.2

Authors: Vedant Pandya

Published: 2026-06-11

TL;DR: 该论文针对 JEPA 家族世界模型在分布偏移下静态预测器无法适应的问题，提出了一种基于 LoRA 的操作符侧经验调制机制（EPM-JEPA），实验证明其比操作数侧注入更能有效提升模型性能。

摘要翻译

JEPA 家族的世界模型使用静态预测器，其权重在测试时动力学偏离训练分布时不会自适应调整。我们在分布偏移下比较了两种将累积经验纳入 JEPA 预测器的机制：操作数侧注入（EI-JEPA），即将压缩的经验表示作为残差添加到预测器的隐藏状态中；以及算子侧调制（EPM-JEPA），即通过应用于预测器权重的 LoRA，由同一表示生成低秩权重增量。在预注册比较（Moving MNIST，重力偏移）中，EPM-JEPA（D_shift^{n=50} = 0.7848 +/- 0.0078，三个随机种子）与 EI-JEPA（0.8238）的差异为 delta = 4.74% —— 结果 C：一个零结果 —— 根据我们设定的标准，这是一个有效结果。作为一个次要的、非预注册的观察结果，EPM-JEPA 在无记忆基线（0.8000）上提升了 1.90%，且在各个随机种子上一致，而 EI-JEPA 表现逊于基线，表明该收益特定于权重级调制。我们的主要贡献是机制分析：D_shift^{n=50} 轨迹反映了三个独立的动力学过程——缓冲区循环、EMA 目标漂移以及一个内在的 LoRA 稳定瞬态（settling transient，+0.021）——而非收敛至平衡。这些发现启发了 PEM-JEPA，这是一个基于物理的后续模型，旨在解决这一动力学峰值限制。

Abstract

JEPA-family world models use a static predictor whose weights do not adapt when test-time dynamics diverge from training. We compare two mechanisms for incorporating accumulated experience into a JEPA predictor under distribution shift: operand-side injection, where a compressed experience representation is added as a residual to the predictor's hidden state (EI-JEPA), and operator-side modulation, where the same representation generates low-rank weight deltas via LoRA applied to the predictor's weights (EPM-JEPA). On a pre-registered comparison (Moving MNIST, gravity shift), EPM-JEPA (D_shift^{n=50} = 0.7848 +/- 0.0078, three seeds) differs from EI-JEPA (0.8238) by delta = 4.74% - Outcome C: a null result - by our stated criterion, a valid outcome. As a secondary, non-pre-registered observation, EPM-JEPA improves 1.90% over a no-memory baseline (0.8000), consistently across seeds, while EI-JEPA underperforms the baseline, indicating the benefit is specific to weight-level modulation. Our primary contribution is a mechanism analysis: the D_shift^{n=50} trajectory reflects three independent dynamical processes - buffer cycling, EMA target drift, and an intrinsic LoRA settling transient of +0.021 - rather than convergence to equilibrium. These findings motivate PEM-JEPA, a physics-grounded successor addressing this dynamical-peak limitation.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	4.0/10	6.0
World Models	1.5	10.0/10	15.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	4.0/10	6.0
model-based RL	1.5	7.0/10	10.5
Latent Reasoning	1.5	8.0/10	12.0
Agentic Reasoning	1.5	2.0/10	3.0

评分理由: 论文核心聚焦于 JEPA 家族世界模型（World Models），涉及潜在空间预测（Latent Reasoning）及分布偏移下的权重调制，与 model-based RL 背景强相关。Tokenizer、Agentic Reasoning 未在摘要中体现，相关性较低。作者 Vedant Pandya 不包含指定的专家列表，未触发专家加分。

关键词

JEPA-family world models, Operator-Side Experience Modulation, Distribution Shift, LoRA, Weight Modulation, Moving MNIST, Experience Modulation

深度分析

Chinese Title: EPM-JEPA：JEPA系列世界模型中的操作者侧经验调制

Summary: 本文针对JEPA家族世界模型中预测器权重在测试时分布偏移下无法自适应的问题，比较了两种将累积经验融入预测器的机制：操作数侧注入（EI-JEPA，将压缩经验表示作为残差添加到预测器隐藏状态）和操作者侧调制（EPM-JEPA，通过LoRA生成低秩权重增量）。在Moving MNIST重力偏移任务上的预注册对比中，EPM-JEPA与EI-JEPA的差异δ=4.74%（小于5%），结果为无效（Outcome C），但EPM-JEPA相比无记忆基线提升1.90%，而EI-JEPA低于基线。主要贡献是机制分析：性能轨迹由缓冲区循环、EMA目标漂移和LoRA暂态三个独立动态过程驱动，而非收敛到平衡。这些发现直接启发了后续的PEM-JEPA模型。

Innovations:

提出操作者侧经验调制（EPM-JEPA），通过LoRA在线生成权重增量，与传统的操作数侧注入（EI-JEPA）形成对比。
在JEPA框架中引入在线经验缓冲区、边界检测器和注意力聚合，使预测器能根据最近分布历史自适应。
发现LoRA调制与EMA目标网络、VICReg方差正则化之间的相互作用导致性能轨迹呈现动态峰值而非稳定平衡，揭示了结构性的张力。
诚实报告预注册无效结果，强调机制分析而非假说验证是科学贡献。

Methodology: 采用三轨架构对比：Track A（Vanilla JEPA无记忆）、Track B（EI-JEPA，经验残差注入）、Track C（EPM-JEPA，LoRA权重调制）。所有轨道共享编码器、EMA目标编码器、预测器基座和经验编码管线。使用Moving MNIST数据集，训练于无重力世界，测试于重力0.5 px/frame²的偏移世界。经验子系统包括边界检测器（基于批平均预测误差的EMA统计）、FIFO缓冲区（容量256）、2层Transformer经验编码器和注意力聚合。训练损失为预测MSE加VICReg方差正则项（λ=0.05, γ=0.75）。超参数通过Phase 1顺序调优确定。

Key Results:

预注册测试：EPM-JEPA（Track C）与EI-JEPA（Track B）的差异δ=4.74%，小于5%阈值，结果为无效（Outcome C）。
次要观察：EPM-JEPA相比无记忆基线（Track A）提升1.90%，且三个种子一致；EI-JEPA低于基线。
机制分析：性能轨迹由三个独立动态过程驱动——缓冲区循环（周期约50步）、EMA目标漂移（余弦调度导致）和LoRA暂态（初始+0.021的峰值），而非收敛到平衡。
LoRA调制与VICReg方差正则化存在结构性张力：LoRA收敛窗口收缩输出流形，导致方差下降，无法仅通过调节λ解决。

Tech Stack:

JEPA（Joint-Embedding Predictive Architecture）
LoRA（Low-Rank Adaptation）
EMA（Exponential Moving Average）目标网络
VICReg（Variance-Invariance-Covariance Regularization）方差正则项
Transformer（2层，d=64，2头注意力）
AdamW优化器
余弦退火学习率调度（含热重启）
边界检测器（基于EMA统计的异常检测）
FIFO经验缓冲区（容量256）
注意力聚合（单头soft attention）
Moving MNIST数据集（自定义重力偏移）

Strengths:

清晰对比了两种记忆注入方式（操作数侧 vs 操作者侧），实验设计严谨。
诚实报告预注册无效结果，强调机制分析的科学价值，避免发表偏倚。
深入分析了LoRA、EMA和VICReg之间的动态交互，揭示了非平衡性能轨迹的成因。
代码和实验设置透明，便于复现和后续研究。

Limitations:

仅在单一合成任务（Moving MNIST重力偏移）上验证，泛化性未知。
LoRA暂态导致性能峰值而非稳定提升，实际应用中需要额外机制维持性能。
计算资源有限（GTX 1050 Ti），可能影响长时间运行的热稳定性。
未在更复杂的世界模型（如V-JEPA）或真实视频数据上测试。

Relevance To Keywords:

世界模型：论文直接研究JEPA家族世界模型在分布偏移下的自适应问题。
表征学习：使用JEPA的潜在空间预测和VICReg正则化。
模型基RL：世界模型是模型基强化学习的核心组件，本文的在线自适应机制有潜在应用。
后训练：LoRA调制可视为一种后训练适应方法，但本文强调在线而非离线。
多模态大模型：虽然本文仅处理视觉，但JEPA框架可扩展至多模态，经验调制思想具有通用性。

19. SpatialClaw: Rethinking Action Interface for Agentic Spatial ReasoningPASS

Score: 61.5 / 35.2

Authors: Seokju Cho, Ryo Hachiuma, Abhishek Badki, Hang Su, Byung-Kwan Lee, Chan Hee Song, Sifei Liu, Subhashree Radhakrishnan, Seungryong Kim, Yu-Chiang Frank Wang, Min-Hung Chen

Published: 2026-06-11

TL;DR: SpatialClaw 引入基于代码的动作接口以增强视觉语言模型的灵活代理空间推理能力，在 20 个基准测试中达到 59.9% 的平均准确率且无需训练。

摘要翻译

空间推理（Spatial reasoning），即确定物体位置、相互关系及三维运动的能力，仍然是视觉 - 语言模型（VLMs）面临的一项根本性挑战。工具增强代理（Tool-augmented agents）试图通过为 VLMs 配备专业感知模块来解决这一问题，但其有效性受限于调用这些工具的动作接口。本文研究了该接口的设计如何塑造代理进行开放式空间推理的能力。现有的空间代理要么采用单遍代码执行（single-pass code execution），即在观察到任何中间结果前就确定了完整的分析策略；要么依赖结构化工具调用接口，这通常限制了自由组合操作或根据每个任务定制分析的灵活性。这两种设计在面对开放式、复杂的三维/四维空间推理时，均提供了有限的灵活性。因此，本文提出 SpatialClaw，这是一个无需训练的空间推理框架，它采用代码作为动作接口。SpatialClaw 维护一个有状态的 Python 内核，预先加载输入帧以及一套感知和几何原语，使基于 VLM 的代理能够在每个步骤根据所有先前输出编写一个可执行单元格，从而让代理能够灵活地组合和操作感知结果，并根据中间文本和视觉观察以及每个问题的需求调整其分析。在涵盖广泛静态和动态三维/四维空间推理任务的 20 个空间推理基准上评估，SpatialClaw 达到了 59.9% 的平均准确率，比最近的代理高出 11.2 个百分点，且在两个模型家族的六个 VLM 骨干上均保持一致的增益，无需针对任何基准或模型进行特定适配。

Abstract

Spatial reasoning, the ability to determine where objects are, how they relate, and how they move in 3D, remains a fundamental challenge for vision-language models (VLMs). Tool-augmented agents attempt to address this by augmenting VLMs with specialist perception modules, yet their effectiveness is bounded by the action interface through which those tools are invoked. In this work, we study how the design of this interface shapes the agent's capacity for open-ended spatial reasoning. Existing spatial agents either employ single-pass code execution, which commits to a full analysis strategy before any intermediate result is observed, or rely on a structured tool-call interface that often offers less flexibility for freely composing operations or tailoring the analysis to each task. Both designs offer limited flexibility for open-ended, complex 3D/4D spatial reasoning. We therefore propose SpatialClaw, a training-free framework for spatial reasoning that adopts code as the action interface. SpatialClaw maintains a stateful Python kernel pre-loaded with input frames and a suite of perception and geometry primitives, letting a VLM-backed agent write one executable cell per step conditioned on all prior outputs, enabling the agent to flexibly compose and manipulate perception results and adapt its analysis to both intermediate text and visual observations and the demands of each problem. Evaluated across 20 spatial reasoning benchmarks spanning a broad range of static and dynamic 3D/4D spatial reasoning tasks, SpatialClaw achieves 59.9% average accuracy, outperforming the recent spatial agent by +11.2 points, with consistent gains across six VLM backbones from two model families without any benchmark- or model-specific adaptation.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	4.0/10	6.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	8.0/10	12.0
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	3.0/10	4.5
Latent Reasoning	1.5	3.0/10	4.5
Agentic Reasoning	1.5	9.0/10	13.5

评分理由: 论文核心聚焦于视觉语言模型（MLLM/MultiModal）在空间推理中的代理行为（Agentic Reasoning），提出代码作为动作接口，与 Tokenizer、World Models 及传统强化学习机制关联较弱，因此相关度评分呈现两极分化。

关键词

Spatial Reasoning, Agentic Reasoning, Vision-Language Models, Code as Interface, 3D/4D Reasoning, Training-free Framework, Perception Primitives

深度分析

Chinese Title: SpatialClaw：重新思考智能体空间推理的动作接口

Summary: 空间推理是视觉语言模型（VLM）面临的基础挑战。现有工具增强型智能体通过调用专业感知模块来辅助VLM，但其效果受限于动作接口的设计：单次代码执行需在观察中间结果前提交完整策略，结构化工具调用则缺乏灵活组合能力。为此，本文提出SpatialClaw，一种无需训练的空间推理框架，将代码作为动作接口。SpatialClaw维护一个持久化Python内核，预加载输入帧和感知/几何原语，允许VLM智能体逐步编写可执行代码单元，并基于先前输出灵活组合和修正分析。在20个空间推理基准上，SpatialClaw平均准确率达59.9%，超越最新空间智能体11.2个百分点，并在多个VLM骨干上取得一致提升，无需任何基准或模型特定调整。

Innovations:

重新定义空间推理智能体的动作接口：提出以代码作为动作接口，替代单次代码执行和结构化工具调用，实现灵活迭代组合。
持久化内核工作空间：维护跨步骤的Python内核状态，使中间结果（如掩码、深度图）作为变量持续可用，支持逐步修正。
无需训练的通用框架：无需针对特定基准或模型进行微调，即可在多种VLM骨干上取得一致性能提升。
全面的基准评估：在20个空间推理基准上验证，覆盖静态/动态3D/4D任务，展示广泛的适用性。

Methodology: SpatialClaw采用训练无关的智能体循环框架。首先为每个示例初始化持久化Python内核，预加载输入帧、感知模块（如深度估计、分割）和科学计算库（NumPy、SciPy、Matplotlib）。然后通过五阶段循环运行：规划器生成分析计划，主智能体逐步编写并执行Python代码单元，反馈（标准输出、变量摘要、可视化图像）被追加到模型上下文，直至智能体提交答案或达到最大步数。整个过程无需模型微调，仅依赖通用系统提示。

Key Results:

在20个空间推理基准上平均准确率达59.9%，超越最新空间智能体SpaceTools-Toolshed（48.7%）11.2个百分点。
在动态4D视频推理和多视角推理任务上提升最大，这些任务需要跨帧和视角的链式几何计算。
在Qwen和Gemma4两个模型家族（27B至397B参数）上均取得一致性能提升，无需任何基准或模型特定调整。
消融实验表明，即使移除所有预定义工具包装器，代码作为动作接口仍能保持优势。

Tech Stack:

Python持久化内核
感知模块：分割（如SAM）、深度估计、相机姿态估计、轨迹估计
科学计算库：NumPy、SciPy、Matplotlib
几何算法：scipy.spatial.KDTree、RANSAC
VLM骨干：Qwen系列、Gemma4系列
代码生成与执行循环

Strengths:

动作接口设计创新：代码作为接口提供了比结构化调用更高的灵活性和表达能力，支持复杂组合。
通用性强：无需训练或微调，即可跨多个VLM骨干和广泛基准取得一致提升。
可解释性：中间结果可视化（如深度图、掩码）使推理过程透明，便于调试和修正。
全面评估：覆盖20个基准，包括静态、动态、多视角等多样任务，验证了方法的鲁棒性。

Limitations:

依赖代码生成能力：智能体的性能受限于底层VLM的代码生成质量，对于复杂逻辑可能产生错误代码。
计算开销：持久化内核和多次代码执行可能增加推理时间和资源消耗。
未涉及训练：虽然无需训练是优点，但可能限制了通过后训练进一步优化性能的潜力。
工具集固定：感知模块和库是预定义的，对于全新类型的空间推理任务可能需要扩展工具集。

Relevance To Keywords:

Unify Models, World Models, Representation Learning, Model-Based RL：SpatialClaw通过代码接口组合感知和几何原语，隐式构建了空间世界模型，与表征学习和世界模型相关。
原生多模态大模型，多模态大模型的理解和生成一体化：框架增强了VLM的空间理解能力，但未涉及生成一体化。
表征学习：通过中间变量（深度、掩码）显式表示空间特征，与表征学习理念一致。
强化学习，后训练：当前为训练无关框架，但未来可结合后训练或RL优化代码生成策略。

20. CausalMoE: A Billion-Scale Multimodal Foundation Model for Granger Causal Discovery with Pattern-Routed Heterogeneous ExpertsPASS

Score: 60.0 / 35.2

Authors: Bo Liu, Di Dai, Jingwei Liu, Jiarui Jin, Xiaocheng Fang, Guangkun Nie, Hongyan Li, Shenda Hong

Published: 2026-06-11

TL;DR: 针对现有格兰杰因果发现方法难以捕捉分布偏移和动态机制的问题，CausalMoE 通过融合多模态先验和混合专家架构实现了因果图的可解释恢复及少样本泛化。

摘要翻译

格兰杰因果发现（GCD）是分析复杂系统中时间依赖性的基础。然而，现有的神经 GCD 方法主要依赖“一刀切”范式，难以捕捉真实世界时间序列中固有的分布偏移和动态体制变化。这往往导致纠缠表示和虚假因果图。本文提出 CausalMoE，一个十亿级多模态格兰杰因果基础模型，显式建模块级异质性。CausalMoE 引入了一种模式路由异构专家混合（Pattern-Routed Mixture of Heterogeneous Experts），动态识别潜在时间模式并将块路由至专用领域专家，有效解耦了体制特异性机制与共享动力学。为确保可解释的图恢复，我们设计了跨变量的因果感知自注意力机制（Causality-Aware Self-Attention mechanism），通过近端优化生成稀疏格兰杰因果图。此外，CausalMoE 首次整合大语言模型（LLMs）和视觉语言模型（VLMs），将数值信号与文本及视觉先验对齐，从而在复杂场景中正则化因果估计。广泛实验表明，CausalMoE 在全监督基准上建立了新的最先进水平，同时在传统方法失效的少样本设置中也能有效泛化。

Abstract

Granger Causal Discovery (GCD) is fundamental for analyzing temporal dependencies in complex systems. However, existing neural GCD methods predominantly rely on a "one-size-fits-all" paradigm, struggling to capture distribution shifts and dynamic regime changes inherent in real-world time series. This often leads to entangled representations and spurious causal graphs. In this paper, we propose CausalMoE, a billion-scale multimodal Granger causal foundation model that explicitly models patch-level heterogeneity. CausalMoE introduces a Pattern-Routed Mixture of Heterogeneous Experts, which dynamically identifies latent temporal patterns and routes patches to specialized domain experts, effectively decoupling regime-specific mechanisms from shared dynamics. To ensure interpretable graph recovery, we design a Causality-Aware Self-Attention mechanism operating across variables, yielding sparse Granger causal graphs via proximal optimization. Furthermore, CausalMoE is the first to integrate LLMs and VLMs to align numerical signals with textual and visual priors, regularizing causal estimation in complex scenarios. Extensive experiments demonstrate that CausalMoE establishes a new state-of-the-art on fully supervised benchmarks, while effectively generalizing to few-shot settings where traditional methods fail.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	7.0/10	10.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	5.0/10	7.5
World Models	1.5	2.0/10	3.0
MLLM	1.5	8.0/10	12.0
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	7.0/10	10.5
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 论文核心为格兰杰因果发现，属于多模态基础模型，故 MultiModal (9.0) 和 MLLM (8.0) 得分最高。模型通过混合专家处理潜在模式，故 Latent Reasoning (7.0) 和 Unify Models (7.0) 相关。Visual Encoder 作为 VLM 组件部分相关 (5.0)。Tokenizer 未明确提及 (2.0)，World Models 关联弱 (2.0)。论文聚焦时间序列因果分析，与强化学习 (model-based RL) 及代理推理 (Agentic Reasoning) 无关 (0.0)。作者列表中未包含指定的专家，故无额外加分。

关键词

Granger Causal Discovery, Multimodal Foundation Model, Mixture of Experts, Pattern-Routed, LLM and VLM Integration, Causality-Aware Self-Attention, Few-shot Generalization

深度分析

Chinese Title: CausalMoE：面向格兰杰因果发现的十亿级多模态基础模型与模式路由异构专家

Summary: 本文提出CausalMoE，一个十亿级多模态格兰杰因果发现基础模型，旨在解决现有方法在时间序列中忽略分布偏移和动态机制变化导致的因果图混杂问题。CausalMoE引入模式路由混合异构专家（MoHE）架构，动态识别潜在时间模式并将补丁路由至专门领域专家，从而解耦机制特定动力学与共享动力学。同时设计因果感知自注意力机制，通过近端优化生成稀疏格兰杰因果图。此外，CausalMoE首次将大语言模型（LLM）和视觉语言模型（VLM）集成到因果发现循环中，对齐数值信号与文本、视觉先验，提升复杂场景下的因果估计。实验表明，CausalMoE在全监督基准上达到新最优，并在传统方法失效的少样本设置下有效泛化。

Innovations:

提出模式路由混合异构专家（MoHE）架构，显式建模补丁级时间异质性，解耦机制特定因果与共享动力学。
首次将LLM和VLM集成到格兰杰因果发现流程中，利用多模态语义先验消除数值数据无法识别的因果歧义。
设计因果感知自注意力机制结合近端优化，恢复稀疏可解释的格兰杰因果图。
构建十亿级多模态因果基础模型，实现少样本因果推断，突破传统方法对均匀分布假设的依赖。

Methodology: CausalMoE采用三模块架构：1）多模态补丁编码：将时间序列分割为补丁，通过提示模板和图像插值分别生成文本和视觉令牌；2）补丁特定模式路由：动态识别潜在时间模式，将每个补丁路由至最合适的异构专家；3）混合异构专家：通过因果感知自注意力机制整合领域特定表示，并利用近端优化推断稀疏格兰杰因果图。训练采用大规模多模态时间序列数据，结合预测与因果约束联合优化。

Key Results:

在全监督基准上达到新最优性能，超越现有神经格兰杰因果发现方法。
在少样本设置下有效泛化，传统方法在此场景下失效。
通过多模态对齐（LLM/VLM）显著提升因果图恢复的准确性和可解释性。
模式路由机制有效解耦不同时间机制下的因果关系，减少虚假因果链接。

Tech Stack:

格兰杰因果（Granger Causality）
混合专家模型（Mixture of Experts, MoE）
大语言模型（LLM）
视觉语言模型（VLM）
自注意力机制（Self-Attention）
近端优化（Proximal Optimization）
补丁编码（Patching）
双线性插值（Bilinear Interpolation）
滑动窗口策略（Sliding Window）

Strengths:

首次将多模态（文本、图像）引入格兰杰因果发现，突破纯数值限制。
显式建模时间异质性，避免均匀分布假设导致的因果混杂。
十亿级参数规模的基础模型具备强少样本泛化能力。
因果感知注意力与近端优化保证因果图稀疏性和可解释性。

Limitations:

模型规模庞大，训练和推理计算成本高。
依赖高质量多模态数据（文本提示、图像），实际应用中获取可能受限。
格兰杰因果本身不等同于真实因果，需满足无未观测混杂等假设。
实验仅在合成和部分真实基准上验证，大规模真实场景泛化性待进一步评估。

Relevance To Keywords:

Unify Models: 论文提出统一的多模态基础模型，融合数值、文本、视觉模态，与统一模型方向高度相关。
World Models: 格兰杰因果发现可视为学习世界动态因果结构的一种方式，但论文未直接构建世界模型。
Representation Learning: 通过MoHE和因果感知注意力学习解耦的因果表示，属于表征学习范畴。
Model-Based RL: 因果模型可用于强化学习中的环境建模，但论文未涉及RL应用，相关性较弱。
原生多模态大模型: 论文集成LLM和VLM，属于多模态大模型在因果发现中的应用。
多模态大模型的理解和生成一体化: 论文主要利用多模态理解（编码对齐），未涉及生成，相关性中等。
表征学习: 同上，核心是学习因果表征。
世界模型: 间接相关，因果图可视为世界模型的一部分。
强化学习: 不直接相关。
后训练: 论文未强调后训练阶段，主要关注预训练和微调，相关性一般。

21. Scale Buys Interpolation, Structure Buys a Horizon: Certified Predictability for Equivariant World ModelsPASS

Score: 58.5 / 35.2

Authors: Hongbo Wang

Published: 2026-06-11

TL;DR: 该论文提出了一种基于李雅普诺夫谱的可计算预测 horizon 证书，证明等变性世界模型的结构特性能提供比规模更可靠的模型强化学习预测保障。

摘要翻译

规模换取插值；结构换取认证的可预测 horizon。世界模型 (World Model) 的平均误差无法说明某个特定预测是否可信，也无法说明其可信持续多久。针对等变 (equivariant) 潜在世界模型，我们提供了一种可计算的多步可预测 horizon 证书：T 步展开误差在每个对称轨道上被证明为常数（定理 A），且根据预测器的李雅普诺夫谱 (Lyapunov spectrum) 分层逐通道分布，满足 $T_j(ε)\sim\log(1/ε)/λ_j$。该 horizon 是双向的——匹配的下界使得近似等变性 (approximate equivariance) 被证明是 horizon 受限的——且该证书仅源于结构：轨道常数误差刻画了等变性 (equivariance)，因此任何非等变模型在任何规模下都无法具备此特性。实验上，在 40 维 Lorenz-96 系统中，仅 $\mathbb{Z}_N$-等变网络恢复了完整的李雅普诺夫谱 ($R^2{=}0.98$)；密集型和循环型基线均失败。由于该谱具有保真性，证书在先验层面发挥作用：在固定感知预算下，膨胀 c 倍的证书被证明需要 c 倍的预算，而等变证书能满足其膨胀的密集对应物无法达到的预算——且无需校准数据。相同的读出指标保持不变，无需训练即可审计公共预训练世界模型：TD-MPC2 检查点落在证书自身的范围分类法上——在强扩张区域校准（比率 0.94-1.02），在弱扩张区域乐观，在收缩区域正确拒绝——这是一张部署监控器在样本外逐单元格复制的地图。在官方 1M 至 317M 参数的多任务阶梯上，校准效果并未随参数量增加而提升。在 V-JEPA 2-AC (10 亿参数，真实机器人数据) 上，测量的交叉检查正确修正了过度承诺的切线谱——交叉验证审计结果，而非原始数值，才是可部署的对象。规模换取插值，而非校准的可预测 horizon。

Abstract

Scale buys interpolation; structure buys a certified horizon. A world model's average error says nothing about whether a particular prediction can be trusted, or for how long. For equivariant latent world models we give a computable, multi-step certificate of the predictable horizon: $T$-step rollout error is provably constant over each symmetry orbit (Theorem A) and stratified channel-by-channel by the predictor's Lyapunov spectrum, $T_j(ε)\sim\log(1/ε)/λ_j$. The horizon is two-sided -- a matching lower bound makes approximate equivariance provably horizon-limited -- and the certificate is exclusive to structure: orbit-constant error characterizes equivariance, so no non-equivariant model has it at any scale. Empirically, on 40-D Lorenz-96 only a $\mathbb{Z}_N$-equivariant network recovers the full Lyapunov spectrum ($R^2{=}0.98$); dense and recurrent baselines fail. Because the spectrum is faithful, the certificate acts, a priori: under a fixed sensing budget a $c\times$-inflated certificate provably needs $c\times$ the budget, and the equivariant certificate meets a budget its inflated dense counterpart cannot -- with zero calibration data. The same read-out, unchanged, audits public pretrained world models training-free: TD-MPC2 checkpoints land on the certificate's own scope taxonomy -- calibrated where strongly expansive (ratio 0.94-1.02), optimistic where weakly expansive, correctly abstaining where contracting -- a map a deployed monitor replicates cell-by-cell, out-of-sample. Across the official 1M-317M multitask ladder, calibration does not improve with parameters. On V-JEPA 2-AC (1B, real robot data) the measured cross-check correctly overrides an over-promising tangent spectrum -- the cross-validated audit, not the raw number, is the deployable object. Scale buys interpolation, not a calibrated horizon.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	10.0/10	15.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	3.0/10	4.5
model-based RL	1.5	8.0/10	12.0
Latent Reasoning	1.5	7.0/10	10.5
Agentic Reasoning	1.5	4.0/10	6.0

评分理由: 论文核心围绕世界模型（World Models）的可预测性边界展开，明确提及 TD-MPC2 和 V-JEPA，与 model-based RL 高度相关；分析潜在动力学谱（Latent Reasoning）以计算预测 horizon，相关性高。未涉及 tokenizer、MLLM 或模型统一架构，故得分低。视觉编码器虽隐含于 V-JEPA 中但非核心贡献，多模态属性未作为重点阐述。作者列表中不包含指定的 Yang Shi 等专家，无加分。

关键词

Equivariant World Models, Predictable Horizon, Lyapunov Spectrum, Certified Predictability, Model-Based RL, Latent Dynamics, Symmetry Orbit

深度分析

Chinese Title: 规模换取插值，结构换取预测视界：等变世界模型的可认证可预测性

Summary: 本文针对等变潜在世界模型，提出了一种可计算的多步可预测视界证书。理论证明：等变模型的k步展开误差在每个对称轨道上是常数（定理A），且按通道由预测器的李雅普诺夫谱分层，视界约为log(1/ε)/λ_k。关键创新是匹配的下界（命题6）：近似等变性导致视界受限于Θ(log(1/ε)/λ)，只有精确等变或守恒通道才能达到无限视界。证书是结构独有的：非等变模型在任何规模下都无法拥有（引理2）。实验上，在40维Lorenz-96系统中，只有Z_2等变网络能恢复完整李雅普诺夫谱（R²=0.98），而密集和循环基线失败。该证书无需校准数据即可审计预训练世界模型（如TD-MPC2、V-JEPA 2-AC），并能在部署时零成本复现。规模只能改善插值，无法提供校准的视界。

Innovations:

提出等变世界模型的可认证可预测视界，并给出匹配的上下界，证明近似等变性的视界受限于Θ(log(1/ε)/λ)。
证明轨道常数误差与等变性等价（引理2），非等变模型在任何规模下都无法拥有该证书。
发现守恒/不变通道的无限视界保证（诺特铰链），并给出其与李雅普诺夫谱的关系。
将证书从理论扩展到实际：无需训练数据即可审计大型预训练世界模型（如TD-MPC2、V-JEPA 2-AC），且部署时零成本复现。
在40维Lorenz-96上验证等变网络恢复完整李雅普诺夫谱，而密集和循环基线失败，揭示结构而非规模的关键作用。

Methodology: 论文采用理论证明与实验验证相结合的方法。理论部分：利用群表示论、李雅普诺夫指数、Oseledets定理等建立等变模型的可预测视界证书，包括轨道常数误差、谱退化上界、匹配下界、守恒通道无限视界等。实验部分：在Lorenz-96系统、接触模拟器、SO(3)非阿贝尔群、原始像素等场景中训练等变网络，并与密集网络、循环网络对比；对预训练模型（TD-MPC2、LeWM、V-JEPA 2-AC）进行零样本审计，验证证书的有效性。

Key Results:

等变模型的k步展开误差在对称轨道上为常数（定理A）。
近似等变模型的通道k视界为Θ(log(1/ε)/λ_k)，且下界匹配（命题6），证明近似等变性是视界受限的根本原因。
守恒通道（λ≤0）的误差随步数线性增长，而非指数增长，实现无限视界。
在40维Lorenz-96上，Z_2等变网络恢复完整李雅普诺夫谱（R²=0.98），而密集网络和循环网络失败（R²<0）。
对TD-MPC2官方检查点（1M-317M参数）的审计显示：证书在强扩张通道校准（比率0.94-1.02），在弱扩张通道乐观，在收缩通道正确弃权。
V-JEPA 2-AC（1B参数，真实机器人数据）的交叉验证正确覆盖了过度承诺的切线谱。

Tech Stack:

群表示论与等变性
李雅普诺夫指数与Oseledets定理
局部线性化与雅可比矩阵谱分解
轨道常数误差证明（引理1、定理A）
匹配下界构造（命题6）
诺特定理与守恒量
Lorenz-96混沌系统
TD-MPC2、LeWM、V-JEPA 2-AC等预训练模型
帧平均（frame averaging）技术
数值实验（相对误差、R²、校准比率）

Strengths:

理论深度高：给出可预测视界的严格上下界，并证明等变性与证书的等价性。
实用性突出：证书无需校准数据即可审计大型预训练模型，且部署时零成本复现。
实验验证充分：在多个系统（混沌、接触、非阿贝尔群、像素）和真实模型上验证。
揭示关键洞察：规模只能改善插值，结构（等变性）才是提供可认证视界的关键。
诺特铰链的发现：守恒通道提供无限视界，为长期预测提供理论保障。

Limitations:

证书依赖于群对称性假设，对于缺乏明显对称性的系统适用性有限。
理论主要针对潜在空间模型，对原始像素空间直接预测的适用性需进一步研究。
实验规模为1-2 GPU，未在更大规模（如多GPU训练）上验证。
近似等变性的残差ε在实际中难以精确测量，可能影响证书的精确性。
对非等变模型的审计仅基于谱分析，未提供改进方法。

Relevance To Keywords:

世界模型：论文核心是等变世界模型的可预测性证书，直接相关。
表征学习：等变编码器是表征学习的一种，论文利用群表示论进行理论分析。
模型基强化学习：证书可用于规划中的信任评估，与模型基RL相关。
后训练：论文对预训练模型（TD-MPC2等）进行零样本审计，属于后训练分析。
原生多模态大模型：V-JEPA 2-AC是视觉模型，论文涉及真实机器人数据，但未深入多模态融合。

22. OmniDirector: General Multi-Shot Camera Cloning without Cross-Paired DataPASS

Score: 57.0 / 35.2

Authors: Jiwen Liu, Shujuan Li, Zhixue Fang, Xiaohan Li, Yan Zhou, Zijie Meng, Zhimin Zhang, Yawen Luo, Guoxin Zhang, Yu-Shen Liu, Pengfei Wan

Published: 2026-06-11

TL;DR: OmniDirector introduces a unified framework with a camera grid representation and a hierarchical prompt expansion agent to achieve general multi-shot camera cloning without requiring cross-paired data.

摘要翻译

从参考视频中克隆相机运动是视频生成中的重要任务，因为视频提供了直观且精确的控制。现有方法要么直接使用无法处理多镜头生成的参数化表示，要么合成交叉配对数据，后者面临数据稀缺问题，导致在复杂相机运动克隆中表现不佳。为了解决这些问题，我们提出了一种通用的相机运动表示法，将相机编码为网格运动视频 (Grid Motion Videos)。这种相机网格 (Camera Grid) 直观地表示相机参数，并支持整合多样化的轨迹以实现多镜头视频生成。在此基础上，我们提出了 OmniDirector，这是一个基于百万级相机网格 - 视频对 (Camera Grid-Video Pairs) 训练的统一框架，它协调角色、动作和相机，为多模态扩散变换器 (Multimodal Diffusion Transformers) 提供导演级控制。此外，我们设计了一种新颖的分层提示扩展代理 (Hierarchical Prompt Expansion Agent)，通过理解信号关系系统性地描述相机运动和视觉内容，从而和谐地整合不同的控制信号。广泛的实验证明了我们框架的优越性能和卓越的可控性。项目页面：https://ymlinfeng.github.io/OmniDirector.github.io/

Abstract

Cloning camera motion from reference videos is an important task in video generation, as videos provide intuitive and precise control. Existing methods either directly use parametric representations that fail to handle multi-shot generation or synthesize cross-paired data, which suffer from data scarcity, resulting in poor performance in complicated camera motion cloning. To address these issues, we introduce a general camera motion representation that encodes cameras as grid motion videos. This camera grid represents the camera parameters visually and supports the integration of diverse trajectories for multi-shot video generation. Building upon this, we propose OmniDirector, a unified framework trained on a million-scale camera grid-video pairs that coordinates characters, actions, and cameras to provide director-level control for multimodal diffusion transformers. Furthermore, we design a novel hierarchical prompt expansion agent that harmoniously integrates different control signals by systematically describing camera motion and visual content through understanding signal relationships. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework. Project page: https://ymlinfeng.github.io/OmniDirector.github.io/

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	8.0/10	12.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	5.0/10	7.5
MLLM	1.5	6.0/10	9.0
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	1.0/10	1.5
Agentic Reasoning	1.5	7.0/10	10.5

评分理由: The paper proposes OmniDirector, a unified framework for multi-shot camera cloning, aligning well with Unify Models and MultiModal keywords. It utilizes a hierarchical prompt expansion agent, supporting Agentic Reasoning. World Models is moderately relevant due to the camera grid representation. Tokenizer, Visual Encoder, Latent Reasoning, and model-based RL are not core contributions. No expert authors from the specified list were found in the author list, so no bonus points were applied. The weighted total score is 57.0, exceeding the dynamic passing score of 35.2.

关键词

Camera Cloning, Multi-Shot Generation, Unified Framework, Camera Grid, Diffusion Transformers, Prompt Expansion Agent, Multimodal Control

深度分析

Chinese Title: OmniDirector：无需交叉配对数据的通用多镜头相机克隆

Summary: 本文提出OmniDirector，一个统一的视频生成框架，用于从参考视频中克隆相机运动，无需交叉配对数据。现有方法要么使用参数化表示难以处理多镜头生成，要么依赖稀缺的交叉配对数据导致性能不佳。为此，作者引入了一种通用的相机运动表示——相机网格（Camera Grid），将相机参数编码为网格运动视频，支持多镜头轨迹集成。基于此，OmniDirector在百万级相机网格-视频对上进行训练，协调角色、动作和相机，为多模态扩散变换器（MMDiT）提供导演级控制。此外，设计了一种分层提示扩展代理机制，通过系统描述相机运动和视觉内容来和谐整合不同控制信号。实验表明，该框架在相机运动克隆的准确性和可控性上显著优于现有方法。

Innovations:

提出相机网格（Camera Grid）表示，将相机参数可视化为空场景中的网格运动视频，实现通用、解耦且可扩展的相机运动表征。
构建百万级相机网格-视频配对数据集，无需交叉配对数据，利用互联网规模数据训练模型。
设计分层提示扩展代理（Hierarchical Prompt Expansion Agent），在推理阶段将相机运动、主体和物体运动融合为统一文本描述，实现多模态信号协同控制。
首次实现无需交叉配对数据的通用多镜头相机克隆，支持镜头切换和复杂相机轨迹。

Methodology: 首先从参考视频中提取相机参数（旋转矩阵R和位移向量t），然后在3D空场景中渲染网格运动视频（Camera Grid），该视频仅包含空间坐标轴网格线以表示相机运动轨迹。基于此，采用多模态扩散变换器（MMDiT）架构，将相机网格作为视觉条件输入，与文本和图像条件共同训练。训练数据为从互联网视频中自动生成的百万级相机网格-视频对。推理阶段，设计分层提示扩展代理：先由相机提示生成器描述相机运动（分为镜头间和镜头内两个层次），再通过语义融合将相机运动、主体描述和物体运动整合为统一文本提示，输入模型生成视频。

Key Results:

OmniDirector能够精确克隆参考视频中的相机运动，包括平移、旋转、变焦、鱼眼畸变、多镜头切换等复杂运动。
在多种内容、宽高比和空间尺度下保持鲁棒性，不受内容差异影响。
相比现有参数化方法和基于交叉配对数据的方法，在相机运动克隆的准确性和视觉质量上取得显著提升。
分层提示扩展代理有效整合了相机控制与其他控制信号（如主体、动作），实现协同创作。

Tech Stack:

多模态扩散变换器（MMDiT）
6自由度（6DoF）相机外参（旋转矩阵R、平移向量t）
Plücker坐标（用于相机编码）
Kannala-Brandt鱼眼畸变模型
3D空场景网格渲染（OpenGL/类似工具）
分层提示扩展代理（基于LLM的文本生成与融合）
百万级数据集构建（自动相机参数提取与渲染管线）

Strengths:

提出新颖的相机网格表示，兼具通用性、解耦性和可扩展性，易于与扩散模型兼容。
无需昂贵的交叉配对数据，利用互联网视频自动生成训练数据，大幅降低数据获取成本。
支持多镜头相机克隆，包括镜头切换和复杂轨迹，填补了现有方法的空白。
分层提示扩展代理巧妙融合多种控制信号，提升生成视频的语义一致性和可控性。
在多种场景下表现出色，对内容、宽高比和尺度变化具有鲁棒性。

Limitations:

相机网格表示依赖于准确的相机参数提取，对于无纹理或动态场景的参考视频，参数估计可能不准确。
空场景网格仅表示相机运动，无法编码场景深度或物体运动，可能限制对复杂交互的建模。
模型训练和推理计算成本较高，需要大规模GPU资源。
对于极端复杂的相机运动（如快速抖动、非线性畸变），可能仍存在泛化挑战。
分层提示扩展代理依赖语言模型，可能引入文本描述偏差或信息丢失。

Relevance To Keywords:

Unify Models: 论文提出的OmniDirector统一了相机控制、角色、动作和场景生成，属于统一模型方向。
World Models: 相机网格表示可视为对3D空间运动的抽象建模，与世界模型中的空间表征学习相关。
Representation Learning: 相机网格是一种新的视觉表征，将相机参数转化为模型易于学习的网格视频，属于表征学习范畴。
Model-Based RL: 论文未直接涉及强化学习，但相机控制可视为一种动作策略，未来可与基于模型的RL结合用于交互式视频生成。

23. Modality Forcing for Scalable Spatial GenerationPASS

Score: 55.5 / 35.2

Authors: Bardienus Pieter Duisterhof, Deva Ramanan, Jeffrey Ichnowski, Justin Johnson, Keunhong Park

Published: 2026-06-11

TL;DR: The paper proposes Modality Forcing, a scalable post-training method for joint image-depth generation using a single DiT, demonstrating that image generation objectives effectively improve spatial perception accuracy.

摘要翻译

文本到图像（T2I）模型包含丰富的空间先验。合成照片级真实感、杂乱的场景需要理解几何结构，包括透视和相对尺度。先前工作通过调整 T2I 模型来利用这一先验进行深度预测，但它们需要密集深度数据且涉及复杂的训练流程。我们提出 Modality Forcing（模态强制），这是一种简单、可扩展的后训练流程，旨在利用单个在稀疏深度数据上训练的 DiT（扩散 Transformer）实现图像 - 深度联合生成。Modality Forcing 通过为每个模态分配独立的噪声水平，实现了图像和深度在任意顺序下的条件生成和联合生成。模态独立解码器使我们能够在稀疏、真实世界的深度数据上进行训练，并获得具有强泛化能力的深度预测。我们进一步表明，Modality Forcing 继承了 T2I 预训练的可扩展性：通过从头训练一组 T2I 模型（3.7 亿至 33 亿参数），我们发现使用更多图像数据训练的更大模型能产生更准确的深度。我们的最强模型可与最先进的单眼深度估计器相媲美，且相对于现有的图像 - 深度联合生成模型，AbsRel 降低了 57%。这些结果提供了强有力的证据，表明图像生成是空间感知的一种可扩展的预训练目标。https://modality-forcing.github.io/

Abstract

Text-to-image (T2I) models contain rich spatial priors. Synthesizing photorealistic, cluttered scenes requires an understanding of geometry, including perspective and relative scale. Prior works adapt T2I models to leverage this prior for depth prediction, but they require dense depth data and involve complex recipes. We propose Modality Forcing, a simple, scalable post-training recipe for joint image-depth generation using a single DiT trained on sparse depth data. Modality Forcing enables conditional and joint generation of image and depth in any permutation by assigning separate noise levels per modality. Per-modality decoders let us train on sparse, real-world depth and achieve strong, generalizable depth prediction. We further show that Modality Forcing inherits the scalability of T2I pre-training: by training a set of T2I models from scratch (370M to 3.3B parameters), we find that larger models trained on more image data produce more accurate depth. Our strongest model is competitive with state-of-the-art monocular depth estimators and reduces AbsRel by 57% relative to existing joint image-depth generative models. These results provide strong evidence that image generation is a scalable pre-training objective for spatial perception. https://modality-forcing.github.io/

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	8.0/10	12.0
Tokenizer	1.5	3.0/10	4.5
Visual Encoder	1.5	6.0/10	9.0
World Models	1.5	4.0/10	6.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	5.0/10	7.5
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 论文核心在于使用单个 DiT 模型联合生成图像与深度，高度契合 MultiModal 和 Unify Models 概念。涉及视觉编码（DiT 嵌入）和潜在空间推理（扩散模型），但与强化学习（model-based RL）、代理智能（Agentic Reasoning）及语言模型（MLLM）无关。Tokenizer 为通用组件非核心贡献，World Models 仅部分相关（空间先验）。

关键词

Modality Forcing, Spatial Generation, Image-Depth Generation, Diffusion Transformer, Sparse Depth Data, Scalable Post-training, Spatial Perception, Photorealistic Scenes

深度分析

Chinese Title: 模态强制：面向可扩展空间生成的训练后方法

Summary: 本文提出了一种名为“模态强制”（Modality Forcing）的简单可扩展训练后方法，用于从文本到图像（T2I）模型中提取空间先验，实现图像与深度图的联合生成与条件生成。该方法通过为RGB和深度模态分配独立的噪声水平，使单个扩散变换器（DiT）能够建模联合分布，并支持任意排列的条件生成。深度模态在像素空间中进行扩散，从而可以利用稀疏的真实世界深度标注进行训练。作者通过从零训练一系列参数规模从3.7亿到33亿的T2I模型，验证了深度预测精度随T2I模型规模和训练数据量的增加而提升。在FLUX.2-klein-9B上应用该方法后，深度预测性能与最先进的单目深度估计器相当，相比现有联合生成模型，平均绝对相对误差（AbsRel）降低了57%。研究结果表明，图像生成是一种可扩展的空间感知预训练目标。

Innovations:

提出模态强制（Modality Forcing）方法，通过为RGB和深度模态分配独立噪声水平，实现单个模型统一支持联合生成、图像到深度和深度到图像三种任务。
采用像素空间深度扩散，使得模型能够从稀疏的真实世界深度标注中学习，避免了依赖密集合成数据的限制。
通过从零训练不同规模的T2I模型（3.7亿到33亿参数），首次系统验证了深度预测精度随T2I模型规模和训练数据量提升而提升，证明T2I是空间感知的可扩展预训练目标。
在FLUX.2-klein-9B上应用后，深度预测性能与最先进单目深度估计器竞争，相比现有联合生成模型平均绝对相对误差降低57%。

Methodology: 本文采用训练后微调策略，基于预训练的文本到图像扩散变换器（DiT）进行扩展。训练时，RGB和深度模态分别使用独立的噪声水平，并各自配备时间步嵌入器，同时引入轻量级跨模态混合模块以交换噪声水平信息。深度模态在像素空间中进行扩散，缺失像素用各向同性高斯噪声填充以支持稀疏标注。训练数据包括文本、RGB图像和稀疏深度图的三元组。推理时，通过控制各模态的噪声水平（t=0表示条件，t=1表示生成目标），实现联合生成、图像到深度和深度到图像三种模式。作者还从零训练了不同规模的DiT模型（370M、1.2B、3.3B参数），并在不同数据量下进行对比实验，以验证可扩展性。

Key Results:

模态强制方法在FLUX.2-klein-9B上实现了与最先进单目深度估计器（如Depth Anything V2）竞争的性能。
相比现有联合图像-深度生成模型（如JointDiT），平均绝对相对误差（AbsRel）降低了57%。
深度预测精度随T2I模型参数规模（从370M到3.3B）和训练图像数量（从0到19.2亿）的增加而持续提升。
单个模型同时支持联合生成、图像到深度和深度到图像三种任务，且深度到图像生成质量与专业模型相当。

Tech Stack:

扩散变换器（Diffusion Transformer, DiT）
流匹配（Flow Matching）目标函数
v预测（v-prediction）和x预测（x-prediction）参数化
像素空间深度扩散（Pixel-space Depth Diffusion）
独立模态噪声水平（Per-modality Noise Levels）
跨模态时间步混合模块（Cross-stream Timestep Mixing）
VAE（变分自编码器）用于RGB编码
ODE求解器用于采样

Strengths:

方法简单且可扩展，仅需训练后微调即可利用强大的预训练T2I模型。
支持稀疏真实世界深度标注，降低了数据收集成本。
通过系统性的规模实验，有力证明了T2I模型的空间先验随模型和数据规模提升而增强。
单个模型统一多种任务，减少了模型部署和维护的复杂度。
在多个基准上达到或接近最先进性能，验证了方法的有效性。

Limitations:

深度模态仅在像素空间处理，可能限制了与其他空间模态（如网格、点云）的扩展。
方法依赖于预训练T2I模型的质量，若基础模型较弱，性能提升有限。
实验主要关注深度预测，对其他空间任务（如法线估计、语义分割）的泛化性未充分验证。
训练过程中需要同时处理RGB和深度数据，计算资源消耗较大。
深度到图像生成的质量可能受限于深度数据的稀疏性和噪声。

Relevance To Keywords:

Unify Models: 模态强制方法通过单一模型统一了图像生成、深度估计和联合生成，体现了模型统一的思想。
World Models: 通过联合建模RGB和深度，模型学习到了空间几何先验，有助于构建更丰富的世界模型。
Representation Learning: 研究表明T2I预训练是一种可扩展的空间表征学习方法，深度预测精度随模型规模提升。
Model-Based RL: 联合生成RGB和深度可为基于模型的强化学习提供更丰富的环境表征和规划信号。
原生多模态大模型: 方法在预训练T2I模型基础上扩展深度模态，类似于原生多模态训练思路，但采用后训练方式。
多模态大模型的理解和生成一体化: 模型同时支持深度理解（I2D）和深度生成（D2I），实现理解与生成的一体化。
后训练: 核心贡献是一种后训练配方，无需从头训练即可赋予T2I模型空间感知能力。

24. Cross-Modal Masked Compositional Concept Modeling for Enhancing Visio-Linguistic CompositionalityPASS

Score: 51.0 / 35.2

Authors: Wei Li, Zhen Huang, Xinmei Tian

Published: 2026-06-11

TL;DR: 本文提出 MACCO 框架，通过跨模态掩码概念建模增强视觉 - 语言模型的组合性理解能力，并提升文本生成图像及多模态大语言模型的性能。

摘要翻译

对比学习的视觉 - 语言模型（VLMs，如 CLIP）在学习联合图像 - 文本表示方面取得了显著进展，但在组合理解方面仍面临挑战。它们常表现出“词袋”行为——难以捕捉对象关系、属性 - 对象绑定以及词序依赖。这种局限性不仅源于对全局单向量表示进行优化的依赖，还源于对配对图像 - 文本数据中固有的丰富组合信息的利用与建模不足。本文提出 MACCO（MAsked Compositional Concept MOdeling），该框架通过在一种模态中掩码组合概念，并基于另一种模态的完整上下文信息进行重构，从而更有效地捕获和对齐跨模态组合结构。为了促进这一过程，我们引入了两个辅助目标，以联合对齐和正则化模态间及模态内的掩码特征。在五个组合基准上的广泛实验，以及深入分析表明，我们的方法不仅显著增强了 VLMs 的组合性，还提高了其捕捉句法结构和语言信息的能力。此外，增强的组合性也有助于文本到图像生成和多模态大语言模型（multimodal large language model）。代码可在 https://github.com/hiker-lw/MACCO 获取。

Abstract

Contrastively trained vision-language models like CLIP, have made remarkable progress in learning joint image-text representations, but still face challenges in compositional understanding. They often exhibit a "bag-of-words" behavior--struggling to capture the object relations, attribute-object bindings, and word order dependencies. This limitation arises not only from the reliance on global, single-vector representations for optimization, but also from the insufficient exploitation and modeling of the rich compositional information inherently present in paired image text data. In this work, we propose MACCO (MAsked Compositional Concept MOdeling), a framework that masks compositional concepts in one modality and reconstructs them conditioned on the full contextual information from the other, enabling the model to capture and align cross-modal compositional structures more effectively. To facilitate this process, we introduce two auxiliary objectives that jointly align and regularize masked features both inter-modally and intra-modally. Extensive experiments on five compositional benchmarks, along with in-depth analyses, demonstrate that our approach not only significantly enhances compositionality in VLMs but also improves their ability to capture syntactic structure and linguistic information. Additionally, the improved compositionality also benefits text-to-image generation and multimodal large language model. Code is available at https://github.com/hiker-lw/MACCO.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	4.0/10	6.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	8.0/10	12.0
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	1.0/10	1.5
Latent Reasoning	1.5	6.0/10	9.0
Agentic Reasoning	1.5	1.0/10	1.5

评分理由: 论文核心在于提升视觉 - 语言模型的组合性理解，属于多模态（MultiModal）领域，且明确提及对多模态大语言模型（MLLM）有益，故得分较高。方法涉及潜在空间的概念掩码与重构，与潜在推理（Latent Reasoning）有一定关联，视觉编码器（Visual Encoder）作为 VLM 基础组件存在但未作为创新点。其余关键词如 Tokenizer、World Models、RL、Agentic 及 Unify Models 在摘要中未体现或关联度低，故评分较低。

关键词

Cross-Modal Masked Compositional Concept Modeling, Visio-Linguistic Compositionality, Vision-Language Models, Masked Reconstruction, Multimodal Representation Learning, Text-to-Image Generation, MLLM Enhancement

深度分析

Chinese Title: 跨模态掩码组合概念建模以增强视觉-语言组合性

Summary: 本文针对对比学习训练的视觉语言模型（如CLIP）在组合理解上的不足（如“词袋”行为、难以捕捉对象关系、属性绑定和词序依赖），提出MACCO框架。该框架通过在一个模态中掩码组合概念，并利用另一模态的完整上下文信息进行重构，从而更有效地捕获和对齐跨模态组合结构。具体地，文本掩码后利用完整图像特征重构组合概念词，图像掩码后利用完整文本特征重构对应区域。为辅助重构，引入两个辅助目标：掩码增强跨模态对齐损失（MCA）和掩码增强模态内正则化损失（MIR），分别用于对齐和正则化掩码特征。实验在五个组合性基准上验证了有效性，并表明该方法不仅提升了组合性，还增强了模型捕获句法结构和语言信息的能力，同时改善了文本到图像生成和多模态大语言模型的表现。

Innovations:

提出无需显式构造困难负样本的MACCO框架，通过跨模态掩码重构提升组合性。
设计参数无关的全局到局部语义注入操作，增强局部token的上下文全局语义。
引入两个辅助目标：掩码增强跨模态对齐损失（MCA）和掩码增强模态内正则化损失（MIR），分别促进跨模态对齐和防止表征坍塌。
证明改进的组合性可迁移至文本到图像生成和多模态大语言模型等下游任务。
框架与现有困难负样本挖掘方法兼容，可进一步获得增益。

Methodology: 首先，利用场景图解析器从文本中提取组合概念（关系、属性短语）并生成掩码MT，利用GroundingDINO定位图像中对应区域并生成掩码MI。然后，将掩码token替换为可学习的掩码标记，并添加位置编码，分别输入图像编码器和文本编码器得到掩码和完整特征。在重构过程中，对掩码文本特征和图像特征应用全局到局部语义注入（将全局CLS特征注入局部token）。文本预测器使用两层交叉注意力从掩码文本token关注完整图像特征，并通过分类头预测词汇；图像预测器使用三层交叉注意力以掩码图像token为查询、完整文本特征为键值，重构像素值。同时，将掩码实例的全局特征加入标准对比学习（MCA损失），并在模态内对掩码全局特征进行正则化（MIR损失）。训练时仅优化文本编码器（冻结图像特征梯度），推理时移除预测器。

Key Results:

在五个组合性基准（如SugarCrepe、VL-CheckList等）上，MACCO显著提升了CLIP等模型的组合理解能力。
模型捕获句法结构和语义细微差别的能力增强，产生更概念感知的嵌入。
对语义不变扰动具有更强鲁棒性，更好保留细粒度语言信息。
与困难负样本方法（如NegCLIP）结合可进一步获得增益。
改进的组合性也提升了文本到图像生成（如Stable Diffusion）和多模态大语言模型（如LLaVA）的表现。

Tech Stack:

CLIP (图像编码器ViT, 文本编码器Transformer)
场景图解析器 (Scene Graph Parser, Wu et al., 2019)
GroundingDINO (目标检测与定位)
交叉注意力 (Cross-Attention)
掩码语言建模 (MLM) 和掩码图像建模 (MIM)
InfoNCE对比损失
均方误差 (MSE) 损失
全局到局部语义注入 (Global-to-Local Semantic Injection)
梯度停止 (stopgrad)

Strengths:

无需额外构造困难负样本，降低了成本和噪声，且避免了过拟合特定负样本模式。
框架简洁有效，可即插即用于现有CLIP模型，兼容其他方法。
深入分析了模型在句法、语义鲁棒性等方面的提升，并展示了在下游任务中的迁移价值。
实验充分，覆盖多个基准和骨干网络，并提供了消融研究和可视化分析。

Limitations:

依赖场景图解析器和GroundingDINO提取组合概念，可能引入额外计算开销和误差。
重构损失仅优化文本编码器，图像编码器未直接受益，可能限制图像侧组合性提升。
在极端复杂场景（如长文本、密集关系）下效果可能有限。
未在更大规模模型（如EVA-CLIP）或更多模态上验证。

Relevance To Keywords:

表征学习: 论文通过掩码重构和对比学习增强视觉-语言表征的组合性，属于表征学习范畴。
多模态大模型的理解和生成一体化: 方法提升了CLIP的理解能力，并验证了对文本到图像生成和多模态大语言模型的正面影响，体现了理解与生成的协同。
世界模型: 组合性理解是构建世界模型的基础能力之一，论文工作有助于模型更准确地建模对象关系与属性。
后训练: MACCO作为CLIP的后训练微调方法，无需从头预训练，符合后训练范式。
原生多模态大模型: 虽然基于CLIP，但方法可推广至其他多模态模型，提升其组合推理能力。

25. TerraBench: Can Agents Reason Over Heterogeneous Earth-System Data?PASS

Score: 51.0 / 35.2

Authors: Dat Tien Nguyen, Thao Nguyen, Fadillah Adamsyah Maani, Huy M. Le, Muhammad Umer Sheikh, Numan Saeed, Muhammad Haris Khan, Salman Khan

Published: 2026-06-11

TL;DR: TerraBench addresses the lack of interactive reasoning in Earth science by introducing a benchmark and agent framework that couples LLM planning with scientific tools to coordinate heterogeneous workflows and preserve artifact provenance.

摘要翻译

气候与环境决策日益需要对异构输入进行推理，这些输入包括网格化物理数据、卫星影像、地理空间上下文以及模拟器输出。天气与气候基础模型虽能做出良好预测，却无法在语言中进行交互推理；而大型语言模型（LLMs）虽能在语言中进行推理，却无法直接处理高维地球系统数据。因此，地球科学领域的真实科学工作流程仍面临服务不足的问题。我们介绍了 TerraBench，这是一个基于 TerraAgent 构建的接地式地球科学推理基准。TerraAgent 是一种 ReAct 风格的可执行框架，它交织推理、工具调用和观察，从而将 LLM 规划与环境检索、地理空间处理、模拟及基于产物的计算等科学工具耦合起来。TerraBench 在单一可执行接口中统一了地球观测影像、网格数据、GIS 推理和模拟的分析，而此前的基准将这些能力隔离为狭窄的独立任务。此外，它也是该领域首个将过程级工具使用指标与容差感知数值评分相结合的基准。该基准涵盖三个赛道（基础、基于模拟器的、基于文档的验证）和八个应用领域的 403 个广泛智能体任务，共计 24,500 个验证执行步骤。结果表明，可靠的地球科学智能体必须超越简单的工具访问，能够协调异构工作流、精确参数化工具并保留产物溯源。

Abstract

Climate and environmental decision-making increasingly requires reasoning across heterogeneous inputs, including gridded physical data, satellite imagery, geospatial context, and simulator outputs. Weather and climate foundation models can forecast well, but do not reason interactively in language, while large language models (LLMs) reason in language but cannot operate directly on high-dimensional Earth-system data. As a result, real scientific workflows in Earth-science remain underserved. We introduce TerraBench, a benchmark for grounded Earth-science reasoning, built on TerraAgent, a ReAct-style executable framework that interleaves reasoning, tool calls, and observations to couple LLM planning with scientific tools for environmental retrieval, geospatial processing, simulation, and artifact-backed computation. TerraBench unifies analysis of Earth observation imagery, gridded data, GIS reasoning and simulation in a single executable interface, whereas prior benchmarks isolate these capabilities into narrow individual tasks. It is also the first in this space to pair process-level tool-use metrics with tolerance-aware numeric scoring. The benchmark comprises 403 extensive agentic tasks across three tracks (Fundamentals, Simulator-Grounded, and Document-Grounded Verification) and eight application domains with 24,500 verified execution steps. These results indicate that reliable Earth-science agents must go beyond tool access to coordinate heterogeneous workflows, parameterize tools precisely, and preserve artifact provenance.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	5.0/10	7.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	5.0/10	7.5
MultiModal	1.5	6.0/10	9.0
model-based RL	1.5	2.0/10	3.0
Latent Reasoning	1.5	3.0/10	4.5
Agentic Reasoning	1.5	9.0/10	13.5

评分理由: 论文核心在于地球科学代理基准（TerraBench）与框架（TerraAgent），因此'Agentic Reasoning'相关性最高（9 分）；处理多模态数据（图像、网格、文本），'MultiModal'相关性较高（6 分）；涉及 LLM 与工具交互，'MLLM'和'Unify Models'中度相关（5 分）；其余关键词如 Tokenizer、Visual Encoder、World Models 等未在摘要中作为核心方法提及，相关性较低（1-3 分）。未发现指定专家作者。加权总分 51.0，高于动态及格分 35.2。

关键词

TerraBench, Agentic Reasoning, Earth-System Data, Heterogeneous Inputs, ReAct-style Framework, Grounded Reasoning, Scientific Tools

深度分析

Chinese Title: TerraBench：智能体能否在异构地球系统数据上进行推理？

Summary: 论文提出了TerraBench，一个用于评估地球科学推理能力的基准测试，并配套开发了TerraAgent可执行框架。该基准包含403个长周期可执行任务，涵盖基础、模拟器驱动和文档验证三个轨道，涉及天气、空气质量、应急响应等八个应用领域，总计约24,500个验证执行步骤。TerraAgent采用ReAct风格的推理-行动-观察范式，集成了77个科学子工具，包括再分析数据检索、卫星图像处理、GIS分析、确定性模拟（如AquaCrop、DSSAT、CLIMADA等）以及可视化等。评估协议将过程级的工具使用能力与结果级的数值容错评分分离。实验表明，最强的前沿模型（Claude Sonnet 4.6）仅达到59.2的工具使用分数和22.9的容错命中率，而开源模型Qwen3.5-35B更差。主要失败模式是参数和数值接地错误，而非工具选择错误。论文揭示了当前模型在协调异构工作流、精确参数化工具和保留工件来源方面的不足。

Innovations:

首次在单一可执行接口下统一了地球观测图像、网格化环境数据、GIS推理、确定性模拟和文档验证等多种异构数据源。
提出了TerraAgent框架，采用ReAct风格将LLM规划与科学工具耦合，生成可审计的推理轨迹和工件支持的输出（NetCDF、GeoTIFF、CSV、PNG）。
设计了容错感知的评估协议，将过程级工具使用熟练度与最终答案数值正确性解耦，揭示了传统工具轨迹评估无法检测的系统性差距。
构建了包含403个任务、约24,500个执行步骤的大规模基准，经过严格的人工筛选和多轮执行验证，仅保留50.9%的初始样本。
提供了四个因果推理层级（Level 0-3）的标注，从观测接地到反事实推理，为地球科学智能体的因果能力评估奠定基础。

Methodology: 论文采用以下方法：首先，基于ReAct范式设计TerraAgent框架，将用户问题转化为工具接地的工作流，通过领域组织的工具注册表（77个子工具）实现规划与执行的分离。其次，通过人工+半自动的标注流程创建基准：从真实科学问题出发，生成可执行程序，包含结构化输出合同、规范推理轨迹、工具观察、支持工件和验证答案。基准分为三个轨道：基础（确定性执行）、模拟器驱动（干预和反事实模拟）、文档验证（复现科学文献结果）。评估时，使用工具使用分数（ToolUseScore）衡量过程正确性，数值分数（NumScore）和容错命中率（Hit@tol）衡量结果准确性。实验在多个前沿和开源模型上进行，分析失败模式。

Key Results:

最强前沿模型Claude Sonnet 4.6达到ToolUseScore 59.2，NumScore 28.4，Hit@tol 22.9。
最强开源模型Qwen3.5-35B达到ToolUseScore 40.0，NumScore 7.5，Hit@tol 5.9。
所有模型超过84%的数值答案超出可接受误差范围。
模拟器驱动轨道任务尤其困难，主要失败模式是错误参数值、错误工具顺序和数值超出容差。
工具选择错误并非主要失败原因，参数和数值接地错误占主导。
基准经过严格过滤：仅50.9%的初始样本通过人工审查，其中74.4%需要多次执行才能最终确定。

Tech Stack:

ReAct推理-行动-观察范式
Pangu-Weather（天气预报模型）
Aurora（天气预报模型）
AquaCrop（作物-水分响应模拟器）
DSSAT（作物系统模拟器）
CLIMADA（事件影响评估模拟器）
UTCI（热应激评估）
EnergyPlus（建筑能耗分析）
SUMO（交通中断分析）
NetCDF、GeoTIFF、CSV、PNG等工件格式
OpenStreetMap（GIS数据）
ERA5/CMIP6/C3S（再分析/气候数据集）
容差感知数值评分（Hit@tol）
工具使用分数（ToolUseScore）
数值分数（NumScore）

Strengths:

全面性：覆盖了地球科学中多种异构数据源和任务类型，是首个统一评估框架。
可执行性：所有任务都是可执行的基准程序，具有结构化输出和规范轨迹，便于自动评估。
细粒度评估：将过程与结果分开评估，揭示了传统方法忽略的数值接地问题。
严格的质量控制：通过人工审查和多轮执行确保基准质量。
开源：代码和基准公开可用，促进可重复研究。

Limitations:

当前模型性能较低，表明任务难度高，但可能限制了基准的区分度（天花板效应）。
基准主要关注确定性模拟和文档验证，未涵盖概率性预测或不确定性量化。
工具集虽然广泛，但可能未覆盖所有地球科学领域（如海洋学、冰川学等）。
评估依赖于人工标注的规范轨迹，可能引入主观偏差。
未探讨模型在未见过的工具或动态环境中的泛化能力。

Relevance To Keywords:

Unify Models: 论文通过TerraAgent框架统一了多种科学工具和数据类型，与统一模型理念相关。
World Models: 地球系统模拟器（如AquaCrop、CLIMADA）可视为世界模型，论文评估了智能体使用这些模型进行推理的能力。
Representation Learning: 论文未直接研究表征学习，但工具调用和数值接地依赖于对地球系统数据的有效表征。
Model-Based RL: 论文中的模拟器驱动任务涉及干预和反事实推理，与基于模型的强化学习中的规划思想有交叉。
原生多模态大模型: 论文评估了多模态大模型（如Claude、Qwen）在地球科学任务中的表现，但未涉及多模态理解与生成一体化。
后训练: 论文未涉及后训练方法，但基准可用于评估后训练效果。

26. Functional Cache Grafting: Robust and Rapid Code-Policy Synthesis for Embodied AgentsPASS

Score: 49.5 / 35.2

Authors: Saehun Chun, Wonje Choi, Sera Choi, Sanghyun Ahn, Honguk Woo

Published: 2026-06-11

TL;DR: This paper proposes FCGraft, a framework that reduces latency and improves robustness in generating code policies for embodied agents by reusing validated function caches and performing local patching.

摘要翻译

代码生成大语言模型（CodeLLMs）通过将自然语言目标和环境约束转换为结构化控制程序，为具身智能体生成可执行代码策略。然而，开放域具身环境中的策略生成存在两个基本局限性：(i) 由于长提示词上的重复预填充计算导致的解码延迟，(ii) 由于完全生成式解码导致的鲁棒性不足，这通常会产生 API 不匹配、缺失的安全防护以及不稳定的控制逻辑。为了解决这些局限性，我们提出了 FCGraft（功能缓存嫁接框架）。FCGraft 维护一个包含经过函数级验证的代码骨架及其相关的提示词级 Transformer 键值 (KV) 缓存的库，并在提供新任务时通过检索相关函数并嫁接它们的 KV 缓存来合成新策略。给定检索到的函数缓存，FCGraft 通过拼接将缓存的函数片段组合成组合策略，并通过修补仅局部适配必要的代码区域，以满足任务特定参数和约束，同时最小化额外解码开销。通过消除冗余的预填充计算，该方法减少了生成延迟，而重用验证过的控制结构提高了相对于提示词级缓存方法 RAGCache 的鲁棒性，实现了 18.31% 更高的任务成功率和提速 2.3 倍的策略合成速度。

Abstract

Code-writing large language models (CodeLLMs) generate executable code policies for embodied agents by translating natural language goals and environmental constraints into structured control programs. However, policy generation in open-domain embodied environments suffers from two fundamental limitations: (i) delayed decoding caused by repetitive prefill computation over long prompts, and (ii) limited robustness due to fully generative decoding, which often produces API mismatches, missing safety guards, and unstable control logic. To address these limitations, we present FCGraft, a Functional Cache Grafting framework. FCGraft maintains a library of function-level validated code skeletons and their associated prompt-level Transformer key-value (KV) caches, and synthesizes new policies by retrieving relevant functions and grafting their KV caches when a new task is provided. Given retrieved function caches, FCGraft performs cache grafting via stitching, which composes cached function segments into a composite policy, and patching, which locally adapts only the necessary code regions to satisfy task-specific parameters and constraints with minimal additional decoding. By eliminating redundant prefill computation, this approach reduces generation latency, while reusing validated control structures improves robustness over prompt-level caching methods RAGCache, achieving 18.31% higher task success rate and 2.3x faster policy synthesis.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	3.0/10	4.5
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	5.0/10	7.5
Latent Reasoning	1.5	6.0/10	9.0
Agentic Reasoning	1.5	9.0/10	13.5

评分理由: Agentic Reasoning is highly relevant due to the focus on embodied agents and policy synthesis. Latent Reasoning is moderately relevant as KV caches retain latent state information. model-based RL is relevant in the context of embodied control. World Models and MLLM have low relevance since the paper focuses on CodeLLMs rather than world modeling or multimodal fusion. Unify Models, MultiModal, Tokenizer, and Visual Encoder are largely irrelevant as the text-to-code approach does not involve vision, tokenizer design, or model unification.

关键词

CodeLLMs, Embodied Agents, Functional Cache Grafting, Policy Synthesis, KV Caches, Latency Reduction, Robustness

深度分析

Chinese Title: 功能缓存嫁接：面向具身智能体的鲁棒快速代码策略合成

Summary: 本文针对具身智能体在开放域环境中使用代码大模型（CodeLLM）生成策略时存在的延迟高和鲁棒性差的问题，提出了FCGRAFT框架。该框架通过维护函数级验证代码骨架及其对应的Transformer键值缓存（KV cache），在新任务到来时检索相关函数并嫁接其KV缓存，从而避免冗余的预填充计算。具体地，FCGRAFT采用缓存拼接（cache-stitching）组合已验证的函数段形成复合策略，以及缓存修补（cache-patching）仅对必要代码区域进行局部调整，以最小化额外解码。实验表明，相比RAGCache，FCGRAFT在ALFRED、TEACh、RLBench等基准测试以及真实机器人操作任务中，任务成功率平均提升18.31%，策略合成速度提升2.3倍，实现了鲁棒性与延迟之间的最佳权衡。

Innovations:

提出函数级KV缓存机制，将已验证的代码函数分解为键值缓存存储，实现细粒度的代码复用。
引入缓存嫁接（cache grafting）方法，包括缓存拼接和缓存修补两个相互依赖的阶段，分别消除结构错误和局部错误。
设计两层级代码缓存（接口层和实现层），支持轻量级引用和原地编辑，提高代码策略合成的效率。
通过动态缓存管理策略（基于最近使用频率和语义多样性）提升函数在开放域任务中的可复用性。

Methodology: FCGRAFT采用两阶段代码策略合成流程：首先，将先前生成并验证的代码策略分解为函数级KV缓存，存储在两层代码缓存中（接口层和实现层）。然后，对于新任务，通过检索相关函数，执行缓存拼接（将多个函数KV缓存组合成完整策略）和缓存修补（定位错误跨度并仅生成修正部分）。该方法利用Transformer的KV缓存机制避免重复预填充，同时通过复用已验证的控制结构提高鲁棒性。

Key Results:

在ALFRED、TEACh、RLBench等基准测试中，FCGRAFT的任务成功率平均比RAGCache高18.31%。
策略合成延迟平均降低2.3倍（即2.3×加速）。
在真实机器人操作任务中验证了框架的实用性和鲁棒性。
缓存拼接和缓存修补的联合使用有效减少了结构错误和局部错误，提升了代码质量。

Tech Stack:

CodeLLM（代码大语言模型）
Transformer键值缓存（KV cache）
检索增强生成（RAG）
填充中间（Fill-in-the-Middle, FIM）技术
函数级代码分解与缓存管理
语义感知的缓存替换策略（基于最近使用频率和功能多样性）

Strengths:

创新性地将KV缓存从文档级扩展到函数级，实现了更细粒度的代码复用。
缓存拼接和修补的联合设计有效解决了结构错误和局部错误两种不同类型的问题。
在多个具身智能体基准和真实任务上取得了显著的性能提升（成功率+18.31%，速度+2.3×）。
框架具有通用性，可集成到现有Code-as-Policies范式中。

Limitations:

依赖预先构建的函数缓存库，初始阶段需要积累一定数量的已验证代码。
缓存管理策略（如语义多样性评估）可能增加额外计算开销。
对于全新领域或与缓存函数语义差异极大的任务，检索和嫁接效果可能受限。
实验主要基于模拟环境和有限真实任务，大规模开放域部署的泛化性有待进一步验证。

Relevance To Keywords:

Unify Models: 论文未直接涉及统一模型，但代码策略合成可视为多模态理解与控制的统一。
World Models: 论文未明确构建世界模型，但通过代码策略隐式建模环境约束。
Representation Learning: 论文使用Transformer的KV缓存作为函数表示，属于表征学习范畴。
Model-Based RL: 论文不直接涉及强化学习，但代码策略可视为基于模型的控制方法。
原生多模态大模型: 论文使用CodeLLM处理自然语言指令和API调用，属于多模态大模型应用。
多模态大模型的理解和生成一体化: 论文中CodeLLM同时理解指令和生成代码，体现理解与生成一体化。
表征学习: 函数级KV缓存可视为一种高效的表征复用方式。
世界模型: 代码策略隐含了对环境动态的建模。
强化学习: 论文未使用强化学习训练，但策略合成可辅助强化学习中的策略初始化。
后训练: 论文不涉及模型后训练，而是利用预训练CodeLLM进行推理时优化。

27. DeepJEB++: Foundation Model-Driven Large-Scale 3D Engineering Dataset via 2D Latent Space AugmentationPASS

Score: 49.5 / 35.2

Authors: Soyoung Yoo, Leekyo Jeong, Jinsu Ra, Dongeon Lee, Sunwoong Yang, Hyogu Jeong, Namwoo Kang

Published: 2026-06-11

TL;DR: DeepJEB++ 解决了大规模 3D 工程数据集稀缺的问题，通过 2D 潜在空间增强和自动化物理标注，将少量种子设计扩展为 15,360 个模拟标注的 3D 支架。

摘要翻译

数据驱动的工程设计受限于缺乏大规模 3D 数据集，这些数据集需将几何形状与基于物理的性能标签配对。特别是，现有的 3D 数据增强技术在保持细微且多样的几何变化方面存在局限，且难以自动化后续的仿真标注过程，因为边界条件会随生成的几何形状而变化。我们提出 DeepJEB++，一种基于基础模型的数据增强框架，旨在资源受限的条件下，将少量发动机支架种子集扩展为大规模、经仿真标注的 3D 数据集。我们的核心思想是在数据丰富的 2D 潜在空间中进行增强，然后转换至 3D。第一阶段，我们在多视图渲染上微调预训练的 2D 潜在扩散模型，并通过潜在插值合成新视图，通过视觉语言模型（VLM）质量过滤器保留可制造的设计。第二阶段，经过验证的图像通过领域自适应生成基础模型被提升为 3D 网格。第三阶段，自动化流水线识别每个网格上的载荷和螺栓接口，并自动分配有限元标签——包括质量、应力和位移——无需人工干预。我们从三个内在维度评估增强质量：可制造性、与 SimJEB 真实值相比的标签保真度以及分布一致性。从少于 400 个种子设计出发，DeepJEB++ 在每个阶段仅使用单个 GPU 的情况下，生成了 15,360 个仿真标注的 3D 支架，实现了 40 倍的扩展。该数据集将公开提供，以支持可复现的工程与 AI 研究。

Abstract

Data-driven engineering design is constrained by the lack of large-scale 3D datasets that pair geometry with physics-based performance labels. In particular, existing 3D data augmentation techniques have limitations in preserving subtle and diverse geometric variations, and it remains difficult to automate the subsequent simulation-labeling process, where boundary conditions vary depending on the generated geometry. We present DeepJEB++, a foundation-model-driven data-augmentation framework that expands a small seed set of jet engine brackets into a large, simulation-labeled 3D dataset under constrained resources. Our key idea is to augment in the data-rich 2D latent space, then transfer to 3D. In Stage 1, we fine-tune a pretrained 2D latent diffusion model on multi-view renders and synthesize novel views by latent interpolation, retaining manufacturable designs through a vision-language-model (VLM) quality filter. In Stage 2, the validated images are lifted to 3D meshes by a domain-adapted generative foundation model. In Stage 3, an automated pipeline recognizes the load and bolt interfaces on each mesh and assigns finite-element labels -- mass, stress, and displacement -- without manual intervention. We assess augmentation quality along three intrinsic axes: manufacturability, label fidelity against the SimJEB ground truth, and distributional consistency. Starting from fewer than 400 seed designs, DeepJEB++ yields 15,360 simulation-labeled 3D brackets -- a 40x expansion -- using a single GPU per stage. The dataset will be made publicly available to support reproducible engineering-AI research.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	4.0/10	6.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	6.0/10	9.0
MultiModal	1.5	7.0/10	10.5
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	8.0/10	12.0
Agentic Reasoning	1.5	2.0/10	3.0

评分理由: 论文核心在于利用基础模型（扩散模型、VLM）进行 3D 工程数据增强，与 Latent Reasoning（潜在空间增强）高度相关，与 MultiModal（多模态数据）和 MLLM（VLM 使用）有一定关联。但与 World Models、model-based RL、Agentic Reasoning 等强化学习或智能体方向关联较弱，Tokenizer 和 Visual Encoder 仅为组件非核心。Unify Models 指多模型流水线，相关性中等。加权总分为 49.5，高于动态及格分 35.2。

关键词

3D Engineering Dataset, 2D Latent Space Augmentation, Foundation Model, Automated Simulation Labeling, Diffusion Model, Vision-Language Model, Data Augmentation

深度分析

Chinese Title: DeepJEB++：基于基础模型的二维潜空间增强驱动的大规模三维工程数据集

Summary: 数据驱动的工程设计受限于缺乏同时包含几何形状和物理性能标签的大规模三维数据集。现有三维数据增强方法难以保留细微且多样的几何变化，且自动化仿真标注过程因边界条件随生成几何变化而困难。本文提出DeepJEB++，一种基于基础模型的数据增强框架，将少量喷气发动机支架种子集扩展为大规模、带仿真标签的三维数据集，且资源受限。核心思想是在数据丰富的二维潜空间中进行增强，再迁移到三维。第一阶段微调预训练二维潜扩散模型，通过潜空间插值合成新视图，并利用视觉语言模型（VLM）质量过滤器保留可制造设计。第二阶段将验证后的图像通过领域适应的生成基础模型提升为三维网格。第三阶段自动化管道识别每个网格上的载荷和螺栓接口，并分配有限元标签（质量、应力、位移），无需人工干预。从少于400个种子设计出发，DeepJEB++在每阶段单GPU条件下生成15,360个带仿真标签的三维支架（40倍扩展）。数据集将公开以支持可复现的工程AI研究。

Innovations:

提出跨维度增强策略：在数据丰富的二维潜空间（利用数十亿图像预训练）进行增强，再通过领域适应的三维基础模型重建，克服三维数据稀缺问题。
利用视觉语言模型（VLM）进行工程图像质量过滤，并识别并解决了VLM描述缺陷时的否定词否定（NWN）问题。
实现自动化边界条件识别：通过检测生成网格上的载荷和螺栓接口，为每个样本分配适应其几何的边界条件，而非固定模板，从而恢复边界条件多样性。
在单GPU资源下实现40倍数据集扩展（从380个种子到15,360个带标签三维样本），证明了基础模型管道在学术和小型企业环境中的可行性。

Methodology: 采用三阶段流水线：第一阶段（2D增强）：对380个种子支架进行多视角渲染（26视角），微调Stable Diffusion（全参数微调）于7800张图像，通过潜空间插值合成新视图，并用LLaVA模型进行质量过滤。第二阶段（3D重建）：使用领域适应的TRELLIS基础模型（MIT许可）将验证后的2D图像提升为3D网格，通过单视角或多视角条件微调确保几何保真度。第三阶段（CAE标注）：自动化管道识别网格上的载荷和螺栓接口，分配有限元边界条件，运行仿真得到质量、应力、位移标签。

Key Results:

从380个种子设计扩展到15,360个带仿真标签的三维支架，实现40倍扩展。
每阶段仅使用单GPU（A100），训练时间1-2天。
通过VLM质量过滤，从约147,000候选图像中筛选出22,495个有效设计。
评估了增强质量的三条内在轴：可制造性、标签保真度（与SimJEB真值对比）、分布一致性。
解决了VLM中的否定词否定（NWN）问题，提高了质量过滤准确性。

Tech Stack:

Stable Diffusion（潜扩散模型）
TRELLIS（结构化3D潜变量，基于整流流变换器）
LLaVA（视觉语言模型）
BLIP-2（对比视觉语言预训练）
DeepSDF（隐式符号距离函数）
有限元分析（FEA）
多视角渲染（26视角：8方位角×3俯仰角+顶/底视图）
潜空间插值（latent interpolation）
JSON格式相机参数记录

Strengths:

创新性地利用2D基础模型的丰富先验来增强3D工程数据，克服了3D数据稀缺瓶颈。
自动化流水线从增强到标注完全无需人工干预，大幅降低数据集构建成本。
在单GPU资源下实现大规模扩展，具有高度可复现性和可访问性。
公开数据集将促进工程AI研究的可复现性。
解决了VLM在工程质量评估中的否定词否定问题，提升了过滤可靠性。

Limitations:

当前方法仅针对喷气发动机支架这一特定领域，跨领域泛化能力未验证。
依赖TRELLIS等基础模型，其训练数据（500K+3D资产）可能不包含所有工程几何类型，对某些复杂形状重建可能不准确。
自动化边界条件识别可能对极端或非标准几何失效，需要进一步鲁棒性验证。
数据集规模（15,360）仍远小于自然图像数据集，可能不足以支撑超大规模模型训练。
仅使用单视角重建，可能丢失部分几何细节，多视角条件可进一步提升质量但增加计算成本。

Relevance To Keywords:

Unify Models: 论文未直接涉及统一模型，但使用基础模型（Stable Diffusion、TRELLIS）体现了多模态生成与理解的统一趋势。
World Models: 论文未涉及世界模型，但通过仿真标签（应力、位移）为几何赋予物理属性，可视为对工程世界模型的构建。
Representation Learning: 论文在2D潜空间进行增强，并利用结构化3D潜变量（SLAT）表示，涉及表征学习。
Model-Based RL: 论文未涉及强化学习，但生成的带标签数据集可用于训练代理模型，进而支持基于模型的优化或RL。
原生多模态大模型: 论文使用多模态大模型（LLaVA）进行质量评估，体现了多模态理解能力。
多模态大模型的理解和生成一体化: 论文结合了2D生成（Stable Diffusion）和3D生成（TRELLIS），以及VLM理解，但未实现端到端一体化。
表征学习: 同上，潜空间和结构化潜变量属于表征学习范畴。
世界模型: 通过仿真标签赋予几何物理行为，可视为工程世界模型的数据基础。
强化学习: 间接相关：生成的数据集可用于训练工程设计的强化学习策略。
后训练: 论文中微调基础模型（Stable Diffusion、TRELLIS）属于后训练（fine-tuning）范畴。

28. SkillCAT: Contrastive Assessment and Topology-Aware Skill Self-Evolution for LLM AgentsPASS

Score: 49.5 / 35.2

Authors: Kunfeng Chen, Qihuang Zhong, Juhua Liu, Bo Du

Published: 2026-06-11

TL;DR: SkillCAT 提出了一种无需训练的 LLM 代理技能自进化框架，通过对比性评估和拓扑感知任务执行，在不重新训练模型的情况下将基准测试分数提高了高达 40.40%。

摘要翻译

大语言模型（LLM）代理的技能自演化方法旨在将执行轨迹转化为可重用的技能文档，但当前流程通常仅从每个任务的单个轨迹中学习，在验证前合并候选技能补丁，并在推理前加载整个技能语料库。我们提出 SkillCAT，这是一个无需训练的框架，将该过程分为三个阶段。对比因果提取（Contrastive Causal Extraction, CCE）为每个任务采样多个轨迹，并比较同任务的成功/失败样本对，以识别解释结果差异的证据。评估增强演化（Assessment-Augmented Evolution, AAE）在每个源任务副本上重放每个候选补丁，仅保留能改进或保持任务结果的补丁，随后进行层次化技能补丁合并。拓扑感知任务执行（Topology-Aware Task Execution, TTE）将演化后的技能编译为可路由的子技能拓扑，使得推理仅加载与任务相关的功能节点。我们在常见的代理基准上评估 SkillCAT，包括 SpreadsheetBench、WikiTableQuestions 和 DocVQA，并进一步测试跨模型及分布外泛化能力。在这些设置下，SkillCAT 相较于基线将平均得分提高了高达 40.40%，证明了无需模型训练即可实现可靠的技能演化。

Abstract

Skill self-evolution methods for LLM agents aim to turn execution trajectories into reusable skill documents, but current pipelines typically learn from one trajectory per task, merge candidate skill patches before checking them, and load the full skill corpus before inference. We propose SkillCAT, a training-free framework that separates this process into three stages. Contrastive Causal Extraction (CCE) samples multiple trajectories for each task and compares same-task success/failure pairs to identify evidence that explains outcome differences. Assessment-Augmented Evolution (AAE) replays each candidate patch on source-task clones and keeps only patches that improve or preserve task outcomes before hierarchical skill patch merging. Topology-Aware Task Execution (TTE) compiles the evolved skills into a routable sub-skill topology, so inference loads only the capability nodes relevant to the task. We evaluate SkillCAT on common agent benchmarks, including SpreadsheetBench, WikiTableQuestions, and DocVQA, and further test cross-model and out-of-distribution generalization. Across these settings, SkillCAT raises the average score over baselines by up to 40.40%, demonstrating reliable skill evolution without model training.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	3.0/10	4.5
MLLM	1.5	4.0/10	6.0
MultiModal	1.5	3.0/10	4.5
model-based RL	1.5	5.0/10	7.5
Latent Reasoning	1.5	6.0/10	9.0
Agentic Reasoning	1.5	9.0/10	13.5

评分理由: 论文主要研究 LLM 代理的技能自进化，与'Agentic Reasoning'高度相关；通过轨迹对比和拓扑推理优化技能，与'Latent Reasoning'和'model-based RL'有一定关联；虽涉及视觉任务基准，但核心非多模态模型架构，故'MLLM'和'MultiModal'相关性中等；'Tokenizer'、'Visual Encoder'及'Unify Models'未涉及，相关性低；'World Models'部分相关。作者列表中不包含 Yang Shi 等指定专家，故无额外加分。

关键词

Skill Self-Evolution, LLM Agents, Contrastive Causal Extraction, Topology-Aware, Training-free, Skill Documents, Trajectory Analysis

深度分析

Chinese Title: SkillCAT：面向LLM智能体的对比评估与拓扑感知技能自我进化

Summary: 论文提出SkillCAT，一个无需训练的框架，用于LLM智能体的技能自我进化。当前方法存在三个问题：单轨迹偏差（单条轨迹证据不足）、未经验证的合并（补丁直接合并可能引入噪声）、上下文过载（技能库持续增长导致推理时干扰）。SkillCAT将技能生命周期分为三个阶段：对比因果提取（CCE）通过多种子采样生成同任务成功/失败对比对，提取因果分水岭处的经验；评估增强进化（AAE）在源任务克隆上回放每个候选补丁，根据结果转换打分，仅保留改善或保持行为的补丁，再进行分层合并；拓扑感知任务执行（TTE）将进化后的技能编译为可路由的子技能拓扑，推理时仅加载任务相关节点。在SpreadsheetBench、WikiTableQuestions、DocVQA等基准上，SkillCAT相比基线平均提升最高40.40%，且技能可跨模型和跨分布泛化。

Innovations:

提出对比因果提取（CCE），利用同任务多种子轨迹的成功/失败对比对，定位因果分水岭并提取关键经验，避免单轨迹偏差。
提出评估增强进化（AAE），通过源任务克隆回放验证每个候选补丁，基于结果转换打分（失败→成功最高分），仅保留有效补丁，再进行分层合并。
提出拓扑感知任务执行（TTE），将进化后的技能编译为可路由的子技能拓扑，推理时仅加载任务相关节点，缓解上下文过载。
将技能自我进化分解为三个可观测阶段（证据提取、补丁验证与集成、测试时部署），每个阶段可独立消融。
无需模型训练，完全基于LLM推理实现技能进化，且技能可跨模型（如gemma-4-31B-it、gpt-5.4-mini）和跨分布泛化。

Methodology: 论文采用三阶段流水线方法。第一阶段CCE：对每个任务使用多个随机种子运行智能体，收集成功和失败轨迹；当存在混合结果时，随机采样一对成功/失败轨迹，定位第一个动作分歧点（因果分水岭），由LLM提取候选经验记录；若全成功或全失败，则启用单轨迹回退提取。第二阶段AAE：对每个候选补丁，在源任务克隆上回放，比较基线结果与回放结果，根据四种结果转换（失败→成功、成功→成功、失败→失败、成功→失败）赋予分数（2,1,0,-1），仅保留分数≥2的补丁；然后按分数分层，从低层到高层合并，每层使用Map-Reduce风格合并。第三阶段TTE：将进化后的技能文档编译为有向无环图拓扑，节点为子技能，边为依赖关系；推理时根据任务上下文选择相关节点（基于嵌入相似度或LLM路由），组装成精简技能注入智能体上下文。

Key Results:

在SpreadsheetBench上，使用Qwen3.5-35B-A3B，SkillCAT达到55.50% Vrf，比Trace2Skill高25.83个百分点；使用Qwen3.5-122B-A10B达到69.50% Vrf。
在跨分布泛化测试中，WikiTableQuestions上Qwen3.5-35B-A3B和Qwen3.5-122B-A10B分别达到81.55%和84.47%准确率。
在多模态DocVQA上，Qwen3.5-35B-A3B和Qwen3.5-122B-A10B用户分别达到0.9159和0.7200 ANLS。
消融实验表明CCE、AAE、TTE三个模块均贡献显著。
跨模型实验显示技能可被gemma-4-31B-it和gpt-5.4-mini复用，平均提升最高40.40%。

Tech Stack:

LLM（大语言模型）作为智能体核心
多种子采样（multi-seed sampling）
对比因果提取（Contrastive Causal Extraction）
因果分水岭（causal watershed）识别
源任务克隆回放（source-task clone replay）
结果转换评分（outcome transition scoring）
分层合并（hierarchical tiered merge）
有向无环图拓扑（DAG topology）
嵌入相似度或LLM路由（embedding similarity / LLM routing）
Map-Reduce风格合并

Strengths:

无需模型训练，完全基于推理，计算成本低且易于部署。
通过对比因果提取有效缓解单轨迹偏差，提高经验可靠性。
通过回放验证机制确保补丁质量，避免有害补丁进入技能库。
拓扑感知路由显著减少推理时上下文长度，提升效率并减少干扰。
在多个基准和跨模型、跨分布场景下均取得显著提升，泛化性强。
模块化设计，各阶段可独立消融和替换，便于后续改进。

Limitations:

依赖LLM的推理能力，若LLM本身因果推理或对比分析能力不足，可能影响提取质量。
多种子采样增加离线阶段计算开销（每个任务需多次运行）。
回放验证需要源任务克隆，可能不适用于无法复现的环境（如实时交互任务）。
拓扑构建和路由依赖任务上下文表示，若上下文模糊可能路由错误。
论文主要聚焦于文本和表格任务，对纯视觉或多模态任务的适用性需进一步验证（DocVQA仅作为多模态测试）。
与关键词中的世界模型、表征学习、强化学习等方向关联较弱，未涉及模型内部表征或世界建模。

Relevance To Keywords:

Unify Models: 论文未涉及模型统一，但技能进化可视为一种轻量级模型行为统一方式。
World Models: 论文不直接构建世界模型，但技能文档可视为任务世界的结构化知识。
Representation Learning: 论文不涉及表征学习，技能提取依赖LLM的隐式表征。
Model-Based RL: 论文不涉及强化学习，但回放验证类似基于模型的评估。
原生多模态大模型: 论文主要针对LLM，但实验包含多模态DocVQA，技能可跨模态使用。
多模态大模型的理解和生成一体化: 论文不涉及生成一体化，仅关注理解任务。
表征学习: 无直接关联。
世界模型: 弱关联，技能拓扑可视为任务世界的抽象模型。
强化学习: 弱关联，回放验证类似价值评估，但无策略优化。
后训练: 论文无需训练，属于后训练阶段的技能进化方法。

29. Trajectory-Level Redirection Attacks on Vision-Language-Action ModelsPASS

Score: 49.5 / 35.2

Authors: Gokul Puthumanaillam, Vardhan Dongre, Pranay Thangeda, Hooshang Nayyeri, Dilek Hakkani-Tür, Melkior Ornik

Published: 2026-06-11

TL;DR: This paper identifies a trajectory-level vulnerability in Vision-Language-Action models where adversarial prompts can redirect robot outcomes while preserving command semantics, demonstrated through an on-policy prompt search method.

摘要翻译

视觉 - 语言 - 动作（VLA）策略将自然语言引入闭环机器人控制，使机器人能够直接根据文本指令执行操作任务。相同的接口使文本在控制中扮演持续的角色，因为提示在每个重规划步骤中被重用，且每个基于提示的动作都会改变策略所作用的未来观测。现有的 VLA 攻击研究对抗性提示，这些提示能诱导出目标低级动作，或使此类动作在变化的图像中持续存在。我们发现了一种更强的轨迹级故障模式：一个提示看似仍指定了预期任务，却重定向了最终的物理结果。我们在数学上将此设定形式化为“命令保持轨迹重定向”（command-preserving trajectory redirection），这是一种仅提示的威胁模型：攻击者在回合前选择一个提示，所有策略和环境组件保持固定，且提示必须接近良性指令，同时省略目标词和修正语言。为了找到此类提示，我们引入了一种同策略提示搜索方法，该方法利用轨迹模拟发现扰动，使其闭环行为跟踪目标任务，同时满足命令保持约束。在仿真和硬件上的实验表明，接近良性的提示扰动可将 VLA 轨迹重定向至攻击者指定的目标。这些结果揭示了 VLA 指令关联（grounding）中的轨迹级漏洞：看似保持预期命令的文本仍可能赋予攻击者控制机器人最终物理结果的能力。项目网站：https://vla-redirection-attack.github.io/

Abstract

Vision-language-action (VLA) policies bring natural language into closed-loop robot control, enabling robots to execute manipulation tasks directly from text instructions. The same interface gives text a recurring role in control because the prompt is reused at every replanning step, and each prompt-conditioned action changes the future observations on which the policy acts. Existing VLA attacks study adversarial prompts that elicit targeted low-level actions or make such actions persist across changing images. We identify a stronger trajectory-level failure mode: a prompt that still $\textit{appears}$ to specify the intended task but redirects the final physical outcome. We mathematically formalize this setting as $\textit{command-preserving trajectory redirection}$, a prompt-only threat model in which the attacker chooses one prompt before the episode, all policy and environment components remain fixed, and the prompt must stay close to the benign instruction while omitting target words and correction language. To find such prompts, we introduce an on-policy prompt search method that uses rollouts to discover perturbations whose closed-loop behavior tracks a target task while satisfying the command-preserving constraints. Experiments in simulation and on hardware show that near-benign prompt perturbations can redirect VLA rollouts to attacker-specified targets. These results expose a trajectory-level vulnerability in VLA instruction grounding: text that appears to preserve the intended command can still give an adversary control over the robot's final physical outcome. Project website: https://vla-redirection-attack.github.io/

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	5.0/10	7.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	5.0/10	7.5
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	5.0/10	7.5
Latent Reasoning	1.5	1.0/10	1.5
Agentic Reasoning	1.5	5.0/10	7.5

评分理由: The paper focuses on security vulnerabilities in Vision-Language-Action (VLA) models rather than architectural construction. MultiModal is highly relevant as VLA integrates vision, language, and action. MLLM and model-based RL are moderately relevant due to the underlying technology and use of rollouts for attack generation. Unify Models is relevant as VLA unifies modalities. Tokenizer, Visual Encoder, World Models, and Latent Reasoning are less relevant as they are not the focus of this security study. No expert authors from the specified list were found. The weighted total score is 49.5, exceeding the dynamic passing score of 35.2.

关键词

Vision-Language-Action, Adversarial Attacks, Trajectory Redirection, Prompt Perturbation, Robot Control, Command-Preserving, Rollouts

深度分析

Chinese Title: 视觉-语言-动作模型的轨迹级重定向攻击

Summary: 本文研究了视觉-语言-动作（VLA）模型在闭环机器人控制中的安全漏洞。VLA模型将自然语言指令作为持久条件信号，在每个重规划步骤中重复使用，导致攻击者可以通过看似保留原始指令的文本扰动来重定向机器人的最终物理结果。作者形式化定义了“命令保持的轨迹重定向”威胁模型，要求攻击提示在文本上接近良性指令、不包含目标任务词汇且无覆盖或纠正语言。为解决该问题，提出了一种在线提示搜索方法（on-prompt search），通过rollout收集当前候选提示诱导的状态分布，并重新优化以满足约束。在仿真和硬件实验中，对多种VLA架构（包括离散token动作预测、流匹配、扩散动作头、连续动作块等）进行评估，结果显示在9种架构中有7种攻击成功率超过90%。此外，本文还评估了防御措施，指出需要命令级归一化而非仅表面清理。该工作揭示了VLA指令接地中的轨迹级脆弱性。

Innovations:

首次形式化定义命令保持的轨迹重定向威胁模型，明确攻击者只能修改提示文本且必须保持与良性指令的语义一致性。
提出在线提示搜索方法（on-policy prompt search），利用DAgger思想从当前候选提示诱导的状态分布中聚合数据，克服固定观测评分的不准确性。
在多种VLA架构（离散token、流匹配、扩散、连续动作块等）上系统评估攻击效果，证明该漏洞具有广泛性。
评估了针对轨迹级攻击的防御措施，指出仅靠表面文本清理不足以缓解，需要命令级归一化。

Methodology: 论文采用威胁模型形式化定义和在线提示搜索算法。首先定义命令保持的提示扰动约束（字符编辑距离、目标词汇检测、可读性检查等）。然后提出on-policy prompt search：对每个候选提示，在环境中执行rollout，记录访问的观测；使用冻结的VLA策略在目标指令下对这些观测重新标注动作；计算损失并优化提示，同时施加文本约束。该方法类似于DAgger，确保优化基于闭环状态分布而非离线固定观测。实验在仿真（如Maniskill2）和真实机器人上进行，评估多种VLA模型。

Key Results:

在9种VLA架构中，7种攻击成功率超过90%，表明轨迹级重定向攻击具有广泛性。
攻击提示仅需微小文本变化（如字符编辑），即可使机器人完成攻击者指定的替代任务，同时良性任务失败。
现有防御（如输入清洗、拼写检查）无法有效缓解，需要命令级归一化或对抗训练。
在线提示搜索方法优于离线固定观测优化，因为后者无法捕获闭环状态分布偏移。

Tech Stack:

VLA模型：π0.5, OpenVLA, RT-2等
攻击优化：Greedy Coordinate Gradient (GCG) 算法
在线数据聚合：DAgger (Dataset Aggregation)
文本约束：字符编辑距离、词汇黑名单、可读性正则化
仿真环境：Maniskill2等
动作表示：离散token、流匹配、扩散、连续动作块

Strengths:

首次系统研究VLA模型的轨迹级重定向攻击，填补了现有工作仅关注单步或持久动作的空白。
形式化威胁模型清晰，约束合理（命令保持），排除了简单提示替换。
实验覆盖多种主流VLA架构，结果具有说服力。
提出的在线提示搜索方法有效解决了闭环状态分布偏移问题。

Limitations:

攻击需要访问环境进行rollout，可能不适用于完全黑盒场景。
防御评估仅初步，未提出强健的防御方案。
实验环境可能有限，真实世界复杂场景下的泛化性有待验证。
命令保持约束的自动化实现（如可读性检查）可能依赖人工或规则，存在主观性。

Relevance To Keywords:

原生多模态大模型：VLA模型是多模态大模型在机器人领域的典型应用，本文研究其安全漏洞。
表征学习：VLA模型涉及视觉和语言表征的联合学习，攻击利用表征的脆弱性。
世界模型：VLA的闭环控制依赖于对世界状态的隐式建模，攻击通过改变提示影响状态演化。
强化学习：VLA策略可视为从语言到动作的映射，攻击类似于策略扰动，与RL中的对抗攻击相关。
后训练：攻击优化过程类似于后训练中的对抗训练，但本文是攻击而非防御。

30. Surflo: Consistent 3D Surface Flow Model with Global StatePASS

Score: 46.5 / 35.2

Authors: Antoine Guédon, Shu Nakamura, Nicolas Dufour, Jiahui Lei, Ko Nishino, Angjoo Kanazawa

Published: 2026-06-11

TL;DR: Surflo 解决了前馈 3D 重建中分辨率固定或点云重叠的局限，通过多视角压缩为全局潜态并利用流匹配实现任意分辨率的 3D 表面解码。

摘要翻译

几何结构具有视图不变性，这使得任何图像集合都成为单一 3D 状态的冗余编码。现有的前馈重建模型未能利用这一点：逐视图方法生成重叠且未对齐的点图 (pointmaps)，其数量随输入数量线性增长；而全局潜在方法则局限于固定的低分辨率输出。我们提出了 Surflo，它将可变数量的未标定姿态 RGB 视图压缩为 K 个潜在标记——一个全局状态——并通过流匹配 (flow matching) 独立地将它们从噪声传输到表面上，从而解码出定向 3D 表面点。这使得输出摆脱了任何固定网格或标记预算的限制：同一个潜在变量在单次前向传播中可生成数千至百万个点。为了抑制独立逐点解码固有的局部不一致性，一个推理时引导项通过在 ODE 积分期间注入光度梯度来关联邻近点。Surflo 在表面度量上匹配或超越了前馈基准方法，比需要数百个视图的基于优化的方法快一个数量级，并且是唯一结合全局潜在变量与任意分辨率解码的前馈方法。

Abstract

Geometry is invariant to viewpoint, which makes any collection of images a redundant encoding of a single 3D state. Existing feed-forward reconstruction models fail to exploit this: per-view methods emit overlapping, unaligned pointmaps that grow linearly with input count, while global-latent methods commit to a fixed, low-resolution output. We introduce Surflo, which compresses a variable number of unposed RGB views into K latent tokens-one global state-and decodes oriented 3D surface points by independently transporting them from noise onto the surface via flow matching. This frees the output from any fixed grid or token budget: the same latent yields from a few thousand to a million points in a single forward pass. To suppress the local inconsistencies inherent to independent per-point decoding, an inference-time guidance term correlates nearby points by injecting a photometric gradient during ODE integration. Surflo matches or surpasses feed-forward baselines on surface metrics, runs an order of magnitude faster than optimization-based methods that require hundreds of views, and is the only feed-forward approach to combine a global latent with arbitrary-resolution decoding.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	5.0/10	7.5
Tokenizer	1.5	6.0/10	9.0
Visual Encoder	1.5	6.0/10	9.0
World Models	1.5	4.0/10	6.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	3.0/10	4.5
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	7.0/10	10.5
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 论文提出 Surflo 模型，核心在于利用全局潜态（Global Latent）和流匹配（Flow Matching）将多视角图像压缩为统一表示并解码 3D 表面。相关度分析：'Latent Reasoning'（7.0）和'Tokenizer'（6.0）高度相关，因模型依赖潜态 token 和几何推理；'Unify Models'（5.0）和'Visual Encoder'（6.0）中度相关，涉及视图统一与编码；'World Models'（4.0）涉及全局状态概念但非典型时序模型；'MultiModal'（3.0）为图像到 3D 的跨模态；'MLLM'、'model-based RL'、'Agentic Reasoning'（0.0）完全无关。作者列表中不包含指定专家，无额外加分。加权总分 46.5，高于动态及格分 35.2。

关键词

3D Surface Flow, Global State, Flow Matching, Latent Tokens, Viewpoint Invariant, Feed-forward, Arbitrary-resolution

深度分析

Chinese Title: Surflo：具有全局状态的一致3D表面流模型

Summary: 本文提出Surflo，一种前馈式3D表面重建模型，核心思想是将任意数量的未标定RGB视图压缩为固定大小的全局潜在表示，并通过流匹配（flow matching）从噪声中独立解码出任意数量的有向表面点。现有方法要么按视图输出点云导致冗余且难以融合，要么受限于固定分辨率输出。Surflo的编码器基于冻结的VGGT骨干提取几何特征，经Perceiver压缩为K个潜在token；解码器通过条件流匹配独立运输每个查询点至表面，并引入推理时引导项（基于光度损失和深度损失梯度）来关联邻近点，消除独立解码的不一致性。在8个基准上，Surflo匹配或超越前馈基线，且比需要数百视图的优化方法快一个数量级。此外，作者贡献了带水密网格的DL3DV数据集版本。

Innovations:

将可变数量未标定RGB视图压缩为固定大小的全局潜在表示，独立于视图数量。
基于流匹配的解码器，可从同一潜在表示解码任意数量的有向表面点（从数千到百万），实现任意分辨率输出。
推理时引导机制，通过注入光度梯度（和可选深度梯度）关联邻近点，抑制独立解码产生的离群点和不一致性。
贡献了首个大规模真实场景级水密网格数据集（DL3DV的网格版本），包含约10.5K场景及约10^7个有向点。

Methodology: 编码器：使用冻结的VGGT骨干提取多视图补丁token和相机token，通过3D位置编码（傅里叶特征）增强，再经Perceiver交叉注意力压缩为K个固定潜在token。解码器：采用流匹配框架，将查询点（3D坐标+法线）从混合高斯源分布线性插值到目标表面分布，通过条件变换器（交叉注意力+Ada-LN）预测速度，训练损失为流匹配损失。推理时：使用欧拉求解器积分ODE，并在t≥0.95时引入引导项，通过梯度下降优化全局渲染损失（基于高斯泼溅的RGB损失和可选单目深度损失）来调整速度。

Key Results:

在8个3D重建基准上，Surflo匹配或超越前馈基线（如VGGT、InstantSplat等）的表面指标。
运行速度比需要数百视图的优化方法（如Gaussian Wrapping）快一个数量级。
能从同一潜在表示解码从1K到1M个点，支持快速预览和密集表面。
引导机制有效减少离群点，恢复精细几何细节。

Tech Stack:

VGGT（冻结骨干）
Perceiver交叉注意力压缩
傅里叶特征编码（3D位置编码）
流匹配（Flow Matching）
LogitNormal时间分布
Ada-LN（自适应层归一化）
欧拉ODE求解器
高斯泼溅（Gaussian Splatting）用于渲染损失
单目深度估计（来自VGGT或Depth Anything）

Strengths:

全局潜在表示设计符合几何不变性原理，避免了视图冗余。
解码分辨率灵活，支持从稀疏到密集的任意输出，适应不同应用需求。
推理时引导机制有效解决了独立点解码的不一致性问题。
基于冻结VGGT骨干，利用其强几何特征，训练高效。
在多个基准上达到SOTA或接近SOTA，且速度快。

Limitations:

依赖VGGT的几何特征，若VGGT在极端视角或光照下失败，可能影响性能。
引导机制需要额外计算（梯度下降步骤），增加推理时间。
目前仅处理静态场景，未涉及动态或可变形物体。
训练数据依赖DL3DV网格版本，可能受限于该数据集的质量和多样性。

Relevance To Keywords:

表征学习：Surflo将多视图图像压缩为固定大小的全局潜在表示，属于场景级表征学习。
世界模型：该潜在表示可视为场景的隐式世界状态，支持从任意视角查询几何，与构建可交互的世界模型相关。
多模态大模型的理解和生成一体化：虽然Surflo专注于3D几何，但其编码-解码框架与多模态理解（图像到几何）和生成（从噪声到表面）一体化有相似思路。
后训练：论文未涉及强化学习或后训练，但流匹配训练可视为生成式后训练的一种形式。
Unify Models：Surflo统一了不同数量输入和不同分辨率输出的处理，体现了模型统一性。

31. JointEdit3D: Feed-Forward 3D Scene Editing in a Unified Latent SpacePASS

Score: 46.5 / 35.2

Authors: Xinnan Zhu, Ruijie Xu, Jiayu Ying, Daoguo Dong, Jiachen Xu, Yuan Xie, Xin Tan

Published: 2026-06-11

TL;DR: JointEdit3D 提出了一种基于统一潜空间的前向 3D 场景编辑方法，在不进行每场景优化的情况下提升了编辑质量与结构完整性。

摘要翻译

现有的 3D 场景编辑方法通常依赖于基于显式 3D 表示的逐场景优化，或级联编辑与重建流程，导致测试时成本高、3D 感知能力有限以及结构不一致。为了在编辑过程中耦合外观合成与几何预测，我们基于一个统一的 RGB-几何重建生成潜在空间，并将其适配用于前馈 3D 场景编辑。所得框架 JointEdit3D 通过仅观察一个编辑过的 RGB 参考潜在向量，并在源场景锚定下生成剩余的 RGB 视图和编辑几何潜在向量，执行非对称潜在图像修复。JointEdit3D 引入一个专用的 SceneAnchor Branch 以注入源场景结构而不强制直接复制，并采用编辑/背景感知损失来平衡编辑区域保真度与未编辑内容的保留。为了解决标准化 3D 场景编辑评估中缺乏配对数据的问题，我们引入了 SceneEdit3D-15K，这是一个包含 15K 个配对编辑样本及渲染器提供 3D 标注的数据集，以及 SceneEdit3D-Bench，一个包含 100 个样本的精选基准。实验表明，JointEdit3D 在提高编辑区域质量和 3D 结构完整性方面优于先前基线方法，同时保持了具有竞争力的背景保留效果。

Abstract

Existing 3D scene editing methods typically rely on per-scene optimization over explicit 3D representations or cascaded edit-and-reconstruct pipelines, resulting in high test-time cost, limited 3D awareness, and structural inconsistencies. To couple appearance synthesis and geometry prediction during editing, we build on a unified RGB-geometry reconstruction-generation latent space and adapt it to feed-forward 3D scene editing. The resulting framework, \textbf{JointEdit3D}, performs asymmetric latent inpainting by observing only a single edited RGB reference latent and generating the remaining RGB views and edited geometry latent under source-scene anchoring. JointEdit3D introduces a dedicated SceneAnchor Branch to inject source-scene structure without forcing direct copying, and adopts edit/background-aware losses to balance edited-region fidelity with unedited-content preservation. To address the lack of paired resources for standardized 3D scene editing evaluation, we introduce SceneEdit3D-15K, a dataset with 15K paired editing samples and renderer-provided 3D annotations, together with SceneEdit3D-Bench, a curated 100-sample benchmark. Experiments show that JointEdit3D improves edited-region quality and 3D structural completeness over prior baselines while maintaining competitive background preservation.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	8.0/10	12.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	5.0/10	7.5
World Models	1.5	3.0/10	4.5
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	5.0/10	7.5
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	8.0/10	12.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 论文核心在于利用统一潜空间进行前向 3D 场景编辑，与'Unify Models'和'Latent Reasoning'高度相关；涉及 RGB 与几何的多模态表示，但与 Tokenizer、MLLM、RL 及 Agent 无直接联系。作者列表中未包含指定的专家。

关键词

3D Scene Editing, Unified Latent Space, Feed-Forward, RGB-Geometry, Latent Inpainting, SceneAnchor Branch, Structural Completeness

深度分析

Chinese Title: JointEdit3D：统一潜在空间中的前馈式3D场景编辑

Summary: JointEdit3D提出了一种前馈式3D场景编辑方法，旨在解决现有方法中编辑与重建分离、测试成本高、3D一致性差等问题。该方法基于统一的RGB-几何重建生成潜在空间，将编辑视为非对称潜在修复任务：仅观察单帧编辑后的RGB参考潜在，生成其余RGB视图和编辑后的几何潜在。论文引入场景锚定分支（SceneAnchor Branch）以注入源场景结构而不强制直接复制，并设计了编辑/背景感知损失以平衡编辑区域保真度与未编辑内容保持。为弥补标准化3D场景编辑评估数据的缺失，作者构建了SceneEdit3D-15K数据集（含15K配对编辑样本及3D标注）和SceneEdit3D-Bench基准。实验表明，JointEdit3D在编辑区域质量和3D结构完整性上优于现有方法，同时保持了良好的背景保持能力。

Innovations:

将3D场景编辑建模为统一RGB-几何潜在空间中的前馈式生成，耦合编辑决策、外观合成与几何更新，而非分离视图编辑与重建。
提出场景锚定分支（SceneAnchor Branch），通过交错残差调节注入源场景结构，实现编辑感知的源内容保持，无需推理时使用编辑掩码。
设计编辑/背景感知损失函数，分离编辑区域与背景区域的RGB和几何潜在，并强调潜在变化较大和距离参考帧较远的位置。
构建SceneEdit3D-15K数据集（首个配对场景级3D编辑数据集，含15K样本及渲染器提供的3D标注）和SceneEdit3D-Bench基准（100样本），推动标准化评估。

Methodology: 论文采用前馈式生成框架，基于Gen3R的联合RGB-几何潜在空间。编辑条件通过单帧修复实现：将编辑参考帧编码为潜在并放置在零填充张量的对应位置，其余位置置零，并附加二进制掩码。源场景保持通过场景锚定分支实现：该分支接收主分支的编辑线索，将源RGB-几何特征注入冻结的生成器，通过交错残差调节实现编辑感知的源内容保持。训练时使用编辑/背景感知损失，分别计算编辑区域和背景区域的MSE损失，并对潜在变化较大和距离参考帧较远的帧赋予更高权重。数据集通过Blender渲染生成，包含配对的前后场景、语言指令、编辑参考帧、编辑掩码及3D标注。

Key Results:

JointEdit3D在编辑区域质量和3D结构完整性上优于现有优化式和前馈式基线方法。
在背景保持方面，JointEdit3D与现有方法相比保持竞争力，未出现明显退化。
场景锚定分支和编辑/背景感知损失对提升编辑保真度和源内容保持均有显著贡献。
SceneEdit3D-15K数据集和SceneEdit3D-Bench基准为标准化3D场景编辑评估提供了基础。

Tech Stack:

Gen3R统一RGB-几何潜在空间
Wan视频扩散模型（Wan Flow Transformer）
视频VAE（Video VAE）
VGGT特征提取器
CLIP图像编码器（ϕCLIP）
T5文本编码器（ϕT5）
3D卷积补丁嵌入
交错残差融合（Residual Adapter / ControlNet风格分支）
MSE损失（编辑/背景感知加权）
Blender渲染引擎（数据集构建）

Strengths:

前馈式设计大幅降低测试时计算成本，无需逐场景优化。
统一潜在空间耦合外观与几何生成，提升跨视图一致性和3D结构完整性。
场景锚定分支有效保持源场景内容，无需显式掩码，具备隐式编辑定位能力。
构建了大规模配对3D编辑数据集和标准化基准，填补领域空白。
编辑/背景感知损失设计合理，平衡了编辑区域保真度与背景保持。

Limitations:

依赖单帧编辑参考作为条件，对于需要多帧或复杂时空编辑的场景可能表达能力不足。
场景锚定分支仅使用RGB源信息，未利用源几何信息，可能限制几何保持能力。
数据集基于Blender合成，与真实场景存在域差异，泛化到真实场景需进一步验证。
编辑区域大小和位置对损失权重敏感，可能需要针对不同编辑类型调整超参数。

Relevance To Keywords:

Unify Models: JointEdit3D将外观生成与几何预测统一在同一潜在空间中，体现了模型统一的思想。
World Models: 通过联合RGB-几何潜在空间建模场景的视觉和结构信息，可视为对场景世界状态的隐式建模。
Representation Learning: 论文的核心贡献之一是构建和适应统一的RGB-几何潜在表示，用于编辑任务。
Model-Based RL: 虽未直接涉及强化学习，但前馈式编辑可视为基于模型的场景变换，未来可结合RL进行编辑策略优化。
原生多模态大模型: 使用CLIP和T5编码多模态条件（图像+文本），扩散模型作为生成主干，符合多模态大模型范式。
多模态大模型的理解和生成一体化: 模型同时理解源场景和编辑指令，并生成编辑后的多视图RGB和几何，实现理解与生成一体化。

32. ReFree: Towards Realistic Co-Speech Video Generation via Reward-Free RL and Multilevel Speech GuidancePASS

Score: 46.5 / 35.2

Authors: Salaheldin Mohamed, M. Hamza Mughal, Rishabh Dabral, Christian Theobalt

Published: 2026-06-11

TL;DR: ReFree-S2V 通过整合多级语音指导和奖励自由强化学习方案，解决了语音驱动视频生成中唇同步与表情表达之间的权衡问题，实现了最先进的性能。

摘要翻译

语音驱动的说话角色动画旨在生成逼真的肖像视频，以呈现自然的对话行为，并使面部运动与语音对齐。尽管视频生成领域的最新进展显著提升了基于视频动画的真实感，但实现准确的唇部发音与富有表现力的行为仍然具有挑战性。现有方法通常在精确的音素 - 唇同步与动态面部表情及头部运动之间进行权衡，导致生成的动画要么准确但僵硬，要么表现力强但同步性不佳。为应对这一挑战，我们提出了 ReFree-S2V，这是一种基于预训练视频生成模型的流匹配 (Flow-Matching) 语音 - 肖像动画框架，旨在实现精细的语音发音与高层级的表现力提示。该模型引入了多层次语音表示，能够在局部和全局粒度下捕捉音系及韵律信息。这些表示通过可学习的层级选择器被选择性注入到 Transformer 模块中，从而实现准确的唇同步与自然的表现性运动。为实现自然的头部运动，我们进一步将一种新颖的无奖励强化学习方案引入流匹配训练，以抑制感知上不可信的运动，而无需依赖手工设计的同步指标或奖励模型，也无需承担人类偏好标注的高昂成本。大量实验表明，ReFree-S2V 达到了最先进的性能，在定量唇同步准确性以及定性人类评估的自然性和表现力方面均显著优于现有方法。

Abstract

Speech-driven talking character animation seeks to generate life-like portrait videos that convey natural conversation behavior, aligning facial motion with spoken audio. Although recent advances in video generation have substantially improved realism in video-based animation, achieving both accurate lip articulation and expressive behavior remains challenging. Existing approaches typically trade off precise phoneme-to-lip synchronization against dynamic facial expressions and head motion, yielding animations that are either accurate yet rigid, or expressive but poorly synchronized. We address this challenge by proposing ReFree-S2V, a flow-matching speech-to-portrait animation framework that builds upon a pretrained video generation model to achieve fine-grained speech articulation and high-level expressive cues in speech-driven portrait animation. This model introduces a multi-level speech representation capturing phonetic and prosodic information at both local and global granularities. These representations are selectively injected into transformer blocks via learnable level selectors, enabling both accurate lip synchronization and natural expressive motion. To achieve natural head movements, we further introduce a novel reward-free reinforcement learning scheme into flow-matching training to discourage perceptually implausible motion without relying on handcrafted synchronization metrics or reward models, or the high cost of human preference annotation. Extensive experiments demonstrate that ReFree-S2V achieves state-of-the-art performance, significantly outperforming existing methods in both quantitative lip-sync accuracy and qualitative human evaluations of naturalness and expressivity.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	4.0/10	6.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	3.0/10	4.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	6.0/10	9.0
Latent Reasoning	1.5	2.0/10	3.0
Agentic Reasoning	1.5	1.0/10	1.5

评分理由: 论文主要研究语音驱动的视频生成，利用流匹配和奖励自由强化学习技术。MultiModal 高度相关（涉及语音与视频模态融合），model-based RL 中度相关（标题及摘要明确提及 RL 方案用于训练优化）。Unify Models 和 World Models 有一定关联（生成模型与视频预测），但非核心创新。Tokenizer、MLLM、Latent Reasoning、Agentic Reasoning 与论文核心内容关联较弱。作者列表中未包含指定的 Yang Shi 等专家，故无额外加分。加权总分为 46.5，高于动态及格分 35.2。

关键词

Co-Speech Video Generation, Reward-Free RL, Flow-Matching, Multi-level Speech Guidance, Speech-to-Portrait Animation, Lip Synchronization, Expressive Motion

深度分析

Chinese Title: ReFree: 面向真实感共语视频生成的无奖励强化学习与多级语音引导

Summary: 本文提出ReFree-S2V框架，用于语音驱动的肖像动画生成。该框架基于预训练视频生成模型（Wan 5B），通过多级语音表示（从局部音素到全局韵律）和可学习的层级选择器，将不同粒度的语音特征注入到Transformer的不同层，实现精确的唇同步和自然的表情与头部运动。为解决监督微调难以同时兼顾唇同步和表现力的问题，作者引入了一种无奖励的强化学习微调策略：通过构造真实样本与负样本（包含不自然运动）的对比课程，让模型自监督地学习感知真实感，无需人工标注或手工设计的奖励函数。实验表明，该方法在定量唇同步精度和定性人类评估上均达到当前最优水平，显著优于现有方法。

Innovations:

提出多级语音表示，通过不同卷积核大小的CNN层和共享MLP捕获从局部音素到全局韵律的多粒度信息。
设计可学习的层级选择器，为每个Transformer块动态选择最合适的语音特征粒度，实现精准注入。
采用多头部FiLM层替代传统的窗口交叉注意力，提供更灵活的缩放和平移变换，增强语音条件控制能力。
提出无奖励强化学习微调策略，通过构造负样本课程（包含不自然运动）进行自监督对比学习，无需人工偏好标注或显式奖励函数。

Methodology: 基于预训练视频生成模型Wan 5B，采用两阶段训练：第一阶段使用flow-matching目标，通过LoRA微调DiT块，并完全优化语音注入块和多级语音编码器；第二阶段引入无奖励强化学习微调，在训练中同时提供真实视频和构造的负样本（如唇同步差、头部运动僵硬等），通过对比损失引导模型生成更自然的运动。多级语音编码器由不同卷积核大小的CNN层和共享MLP组成，输出经加权门控融合后输入多头部FiLM层，对视频隐状态进行缩放和平移。

Key Results:

在定量指标上，唇同步精度（如LMD、SyncNet置信度）显著优于Wav2Lip、SadTalker、FantasyTalking等现有方法。
在定性人类评估中，生成视频的自然度、表现力和唇同步质量均获得最高评分。
消融实验验证了多级语音表示、层级选择器和多头部FiLM的有效性。
无奖励强化学习微调进一步提升了头部运动的自然性和整体感知真实感。

Tech Stack:

Flow-matching生成框架
DiT (Diffusion Transformer) 架构
LoRA (Low-Rank Adaptation) 微调
多头部FiLM (Feature-wise Linear Modulation) 层
CNN (卷积神经网络) 与 MLP (多层感知机)
加权门控机制 (Weighted Gating)
无奖励强化学习 (Reward-Free RL) 对比学习
预训练模型: Wan 5B

Strengths:

同时实现了精确的唇同步和自然的表情/头部运动，克服了现有方法的权衡问题。
无奖励强化学习策略避免了昂贵的人工标注和复杂的奖励函数设计，具有可扩展性。
多级语音表示和层级选择器设计合理，能够灵活适应不同粒度的语音-运动关联。
基于预训练大模型，生成视频质量高，且通过LoRA高效微调。

Limitations:

依赖预训练视频生成模型（Wan 5B）的质量，若基模型存在偏差可能影响最终效果。
负样本的构造策略（如哪些不自然运动被纳入）可能不够全面，影响对比学习效果。
论文未讨论多语言、多口音或复杂背景下的泛化能力。
计算资源需求较高，训练和推理成本可能限制实际部署。

Relevance To Keywords:

多模态大模型的理解和生成一体化：论文将语音理解（多级表示）与视频生成（flow-matching）结合，实现语音到肖像动画的生成。
表征学习：多级语音表示从不同时间尺度提取特征，属于表征学习范畴。
世界模型：视频生成模型可视为对人物运动世界的模拟，论文通过语音引导生成合理运动。
强化学习/后训练：提出的无奖励强化学习微调属于后训练阶段，通过对比学习优化生成质量。
原生多模态：语音和视频两种模态在框架中深度融合，而非简单拼接。

33. Agents-K1: Towards Agent-native Knowledge OrchestrationPASS

Score: 45.0 / 35.2

Authors: Zongsheng Cao, Bihao Zhan, Jinxin Shi, Jiong Wang, Fangchen Yu, Zhijie Zhong, Zijie Guo, Tianshuo Peng, Zhuo Liu, Yi Xie, Xiang Zhuang, Yue Fan, Runmin Ma, Shiyang Feng, Xiangchao Yan, Anran Liu, Peng Ye, Wenlong Zhang, Shufei Zhang, Chunfeng Song, Fenghua Ling, Jie Zhou, Liang He, Bo Zhang, Lei Bai

Published: 2026-06-11

TL;DR: Agents-K1 introduces an agent-native knowledge orchestration pipeline that constructs scientific knowledge graphs from raw documents to improve multi-hop scientific reasoning and information extraction.

摘要翻译

当前基于大语言模型（LLM）的研究代理虽已通过代理编排取得进展，却很大程度上忽视了科学知识编排。现有工作常将论文简化为摘要、表面提及和扁平的引用边，遗漏了科学推理所需的关键实体、主张、证据、机制和方法谱系。为此，我们引入 Agents-K1，这是一个端到端的知识编排管道，可将原始文档转换为代理原生的科学知识图谱。Agents-K1 在统一理论基础上整合了三个组件：一个多模态解析器，其五模块模式捕获全文而非仅摘要中的实体、多模态证据、引用和实体间类型化关系；一个基于规则奖励使用 GRPO 训练的 4B 信息提取骨干网络；以及一个 graphanything CLI，一个统一网络搜索、多模态图检索和跨文档遍历的三源代理接口。在此基础上，我们处理了六个学科领域的 246 万篇科学论文以生成 Scholar-KG，其中我们发布了一百万篇论文的子集，完整的 Scholar-KG 可通过下方的 SCP 链接访问。相同的管道可扩展至通用领域语料库及符合模式的数据合成。广泛的实验表明，Agents-K1 在科学信息提取、知识图谱构建和多跳科学推理方面实现了卓越的性能。

Abstract

Current LLM-based research agents have advanced through agent orchestration, yet largely overlook scientific knowledge orchestration. Existing works often reduce papers to abstracts, surface mentions, and flat \texttt{cites} edges, omitting key entities, claims, evidence, mechanisms, and method lineages essential for scientific reasoning. To this end, we introduce \textbf{Agents-K1}, an end-to-end knowledge orchestration pipeline that converts raw documents into agent-native scientific knowledge graphs. Agents-K1 integrates three components under a unifying theoretical foundation: a multimodal parser whose five-module schema captures entities, multimodal evidence, citations, and typed inter-entity relations across the full paper rather than abstracts alone; a 4B information-extraction backbone trained with GRPO under a rule-based reward; and a graphanything CLI, a tri-source agent interface that unifies web search, multimodal graph retrieval, and cross-document traversal. On top of this, we process 2.46 million scientific papers across six subjects to produce \textbf{Scholar-KG}, of which we release a one-million-paper subset, and the full Scholar-KG is accessible via the SCP link below. The same pipeline can be extended to general-domain corpora and to schema-conformant data synthesis. Extensive experiments demonstrate that Agents-K1 achieves superior performance in scientific information extraction, knowledge graph construction, and multi-hop scientific reasoning.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	4.0/10	6.0
MultiModal	1.5	7.0/10	10.5
model-based RL	1.5	2.0/10	3.0
Latent Reasoning	1.5	2.0/10	3.0
Agentic Reasoning	1.5	9.0/10	13.5

评分理由: 论文核心在于科学知识图谱构建与智能体编排，与 Agentic Reasoning（标题及核心内容）和 MultiModal（多模态解析器）高度相关。虽然提到统一理论基础，但未涉及 Unify Models 架构；使用 4B 骨干网络可能涉及 MLLM，但非核心贡献；未明确提及 Tokenizer、Visual Encoder、World Models 或 model-based RL（虽用 GRPO 训练，但非模型基强化学习）。作者列表中未包含指定的专家。

关键词

Knowledge Graph Construction, Scientific Reasoning, Multimodal Parsing, Agent Orchestration, Information Extraction, Scholar-KG, GRPO Training

深度分析

Chinese Title: Agents-K1：迈向智能体原生的知识编排

Summary: 本文提出Agents-K1，一个端到端的知识编排管道，旨在将原始科学文献转换为智能体原生的多模态知识图谱。该管道整合了三个核心组件：一个基于五模块模式的多模态解析器，能够从全文（而非仅摘要）中提取实体、多模态证据、引用意图和类型化关系；一个使用GRPO和基于规则的奖励训练的4B参数信息提取骨干模型；以及一个GraphAnything CLI，作为三源智能体接口，统一了网络搜索、多模态图谱检索和跨文档遍历。基于该管道，作者处理了246万篇跨六个学科的科学论文，构建了Scholar-KG，并发布了100万篇论文的子集。实验表明，Agents-K1在科学信息提取、知识图谱构建和多跳科学推理任务上取得了优越性能，显著提升了LLM在科研问题上的准确率。

Innovations:

提出统一的智能体原生知识编排框架，将KG、LLM和CLI整合为端到端管道，专为研究智能体的完整工作流设计。
构建百万级全文多模态科学知识图谱Scholar-KG，覆盖246万篇论文，提取实体、主张、证据、方法谱系、引用意图等结构化知识。
采用GRPO和规则奖励训练4B参数的信息提取骨干模型，在NER、关系抽取和长文本结构化抽取上超越8B开源模型，接近32B模型性能。
开发GraphAnything CLI，提供三源知识检索（网络搜索、图谱检索、跨文档遍历）和闭环研究工作流（想法生成、方法指定、代码合成）。
理论分析证明将证据组织在单一连通图中比搜索独立文本片段更可靠，支持可审计的检索和知识溯源。

Methodology: Agents-K1采用三阶段管道：1) KG层：使用MinerU离线解析器对PDF进行多模态解析，应用五模块模式（元数据、显式提及、隐式抽象、引用意图、细粒度实体关系）构建知识图谱，并支持通过LLM引导的多跳QA生成数据集和通用文档的Schema自适应扩展。2) LLM层：基于4B参数模型，使用Group Relative Policy Optimization (GRPO)和规则奖励函数联合监督格式合规性、JSON有效性和任务条件F1分数，训练信息提取骨干。3) CLI层：实现三源知识检索与融合（网络搜索、多模态图谱检索、跨文档网络遍历），提供图操作符和多智能体协调，支持从想法到实验的闭环流程。理论部分通过投影规则和命题证明解释了连通图对跨源推理的可靠性优势。

Key Results:

在FrontierScience-Research基准上，Gemini-3总体准确率从7.9%提升至24.6%，GPT-5.2从25.2%提升至39.4%。
在地球科学研究问题上，Gemini-3推理准确率从52.3%提升至69.5%。
在多跳QA任务（HotpotQA、2WikiMultiHopQA、MuSiQue）上达到最先进性能，超越九个图增强检索基线。
信息提取骨干在十个基准上超越8B开源模型，在NER任务上匹配32B模型性能。
构建了包含246万篇论文的Scholar-KG，覆盖计算机科学、化学、生物学、地球科学、物理学和材料科学六个学科。

Tech Stack:

MinerU（离线PDF解析器）
Group Relative Policy Optimization (GRPO)
基于规则的奖励函数（格式合规性、JSON有效性、任务条件F1）
五模块知识图谱模式（元数据、显式提及、隐式抽象、引用意图、细粒度实体关系）
GraphAnything CLI（三源接口：网络搜索、多模态图谱检索、跨文档遍历）
多智能体协调与图操作符
LLM引导的多跳QA生成
Schema自适应扩展（General-KG）
理论框架：投影规则、标识保持连接、跨视图可达性、候选覆盖证明

Strengths:

端到端统一管道，从原始PDF到智能体可用知识图谱，覆盖完整研究流程。
百万级大规模科学知识图谱，覆盖多学科，且发布子集促进社区研究。
强化学习训练的信息提取模型在较小参数量下达到优异性能，成本可控。
理论分析为图结构知识组织提供了数学基础，增强了可解释性。
GraphAnything CLI将静态图谱转化为可执行研究工具，支持闭环科研自动化。

Limitations:

当前管道主要针对科学文献，通用文档的Schema自适应扩展（General-KG）仍需进一步验证。
信息提取骨干模型为4B参数，虽性能优异，但在极端复杂场景下可能仍不及更大模型。
知识图谱构建依赖PDF解析质量，MinerU对低质量PDF或非标准格式可能效果受限。
实验评估主要基于英文科学文献，对中文或其他语言文献的泛化能力未充分探讨。
多智能体协调和闭环工作流的实际科研应用效果需更多案例验证。

Relevance To Keywords:

Unify Models: 论文提出的Agents-K1框架统一了知识图谱、LLM和CLI，体现了模型统一的思想。
World Models: 知识图谱作为外部世界模型，为LLM提供结构化科学知识，支持推理和规划。
Representation Learning: 信息提取骨干通过GRPO学习结构化表示，将原始文本转化为实体、关系等表征。
Model-Based RL: 使用GRPO（基于模型的强化学习算法）训练提取模型，奖励函数基于规则而非环境交互。
原生多模态大模型: 管道处理文本、图表、表格等多模态证据，但未直接训练多模态大模型，而是通过解析和提取实现多模态知识整合。
多模态大模型的理解和生成一体化: 论文侧重于理解（提取结构化知识）而非生成，但CLI支持生成研究想法和代码。
表征学习: 知识图谱构建本质上是表征学习过程，将文档内容编码为图结构。
世界模型: Scholar-KG可视为科学领域的世界模型，支持多跳推理和因果推断。
强化学习: 核心训练算法GRPO属于强化学习范畴，用于优化信息提取性能。
后训练: 信息提取骨干在预训练模型基础上进行后训练（GRPO），适应特定提取任务。

34. Zero-Shot Captioning for Cultural Heritage: Automated Image Analysis of Traditional Indonesian ClothingPASS

Score: 45.0 / 35.2

Authors: Anugrah Aidin Yotolembah, Novanto Yudistira, Gembong Edhi Setyawan

Published: 2026-06-11

TL;DR: 本文提出 Custom ZeroCLIP 框架，利用检索增强的视觉语言模型在零-shot 设置下成功生成印尼传统服饰字幕，并在未见省份上展现出优异的文化准确性和流畅度。

摘要翻译

本文提出了一种名为 Custom ZeroCLIP 的检索增强视觉 - 语言框架，用于印尼传统服饰的零样本标注。该数据集包含来自所有 38 个印尼省份的 3,800 张专家标注图像。采用基于省份的归纳零样本协议，模型在 24 个已见省份上进行训练，在 6 个已见省份上进行验证，并在 8 个未见省份上进行评估。该框架结合了冻结的 CLIP ViT-B/32 图像编码器、CLIP 文本编码器、BERT 文本编码器和 LSTM 标注解码器。推理时，未见省份的标签和标注不可用，检索仅使用来自训练省份的标注。在训练、验证及检索库构建过程中，均未使用任何未见省份的图像、标签或标注。Custom ZeroCLIP 实现了 CLIPScore 0.8536、BLEU-4 0.3342 和 METEOR 0.4859，优于现有基线模型。消融结果表明，检索机制提升了文化词汇恢复能力，使 METEOR 指标提升了 19.3%，同时人工评估证实了其在文化准确性和流畅性方面的优势。结果展示了检索增强领域适应在低资源遗产环境下生成具有文化根基的标注的有效性。该数据集公开可用，地址为 https://github.com/AnugrahAidinYotolembah/Traditional-Indonesian-Clothing-Captioning-Dataset。

Abstract

This paper presents Custom ZeroCLIP, a retrieval-augmented vision-language framework for zero-shot captioning of Indonesian traditional garments. The dataset contains 3,800 expert-annotated images from all 38 Indonesian provinces. Using a province-level inductive zero-shot protocol, the model is trained on 24 seen provinces, validated on 6 seen provinces, and evaluated on 8 unseen provinces. The framework combines a frozen CLIP ViT-B/32 image encoder, a CLIP text encoder, a BERT text encoder, and an LSTM caption decoder. During inference, unseen-province labels and captions are unavailable, and retrieval uses only captions from training provinces. No unseen-province image, label, or caption is used during training, validation, or retrieval-bank construction. Custom ZeroCLIP achieves a CLIPScore of 0.8536, BLEU-4 of 0.3342, and METEOR of 0.4859, outperforming existing baselines. Ablation results show that retrieval improves cultural vocabulary recovery with a 19.3\% METEOR gain, while human evaluation confirms stronger cultural accuracy and fluency. The results demonstrate the effectiveness of retrieval-augmented domain adaptation for culturally grounded caption generation in low-resource heritage settings. The dataset is publicly available at https://github.com/AnugrahAidinYotolembah/Traditional-Indonesian-Clothing-Captioning-Dataset.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	4.0/10	6.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	9.0/10	13.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	5.0/10	7.5
MultiModal	1.5	10.0/10	15.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 论文核心为多模态视觉语言任务，MultiModal (10) 和 Visual Encoder (9) 高度相关，因使用 CLIP ViT 进行图像编码。MLLM (5) 相关但架构偏传统 (LSTM/BERT 而非大语言模型)。Unify Models (4) 和 Tokenizer (2) 有一定关联但非核心。World Models, model-based RL, Latent Reasoning, Agentic Reasoning (0) 均与论文内容无关。作者列表中未包含指定专家，无额外加分。加权总分 45.0，高于动态及格分 35.2。

关键词

Zero-Shot Captioning, Cultural Heritage, Vision-Language Framework, Traditional Indonesian Clothing, Retrieval-Augmented, CLIP ViT, Multi-modal Learning, Domain Adaptation

深度分析

Chinese Title: 文化遗产的零样本描述：传统印尼服饰的自动图像分析

Summary: 本文提出Custom ZeroCLIP，一种检索增强的视觉语言框架，用于印尼传统服饰的零样本描述。研究背景是现代化导致传统服饰文化词汇流失，现有视觉语言模型难以捕捉印尼文化术语。方法上，构建包含3800张专家标注图像、覆盖38个省份的数据集，采用省份级别的归纳式零样本协议：24个省份用于训练，6个用于验证，8个完全未见省份用于测试。框架使用冻结的CLIP ViT-B/32图像编码器和文本编码器，结合可训练的BERT编码器和LSTM解码器，推理时通过余弦相似度从训练描述库中检索top-K描述作为上下文。结果显示，Custom ZeroCLIP在CLIPScore（0.8536）、BLEU-4（0.3342）和METEOR（0.4859）上均优于基线，检索带来METEOR提升19.3%，人类评估确认文化准确性和流畅性。数据集已公开。

Innovations:

提出省份级别的归纳式零样本描述协议，训练、验证和检索库完全排除未见省份数据，严格评估泛化能力。
构建首个覆盖印尼全部38个省份的传统服饰数据集，包含3800张专家标注图像，支持细粒度文化描述。
提出Custom ZeroCLIP框架，融合冻结CLIP编码器与可训练BERT-LSTM解码器，并引入检索增强机制，在低资源场景下有效恢复文化词汇。
通过检索增强实现文化接地描述生成，无需未见省份标签或描述，适用于文化遗产数字化保护。

Methodology: 采用归纳式零样本学习协议，将38个省份分为24个训练、6个验证和8个未见测试省份。系统架构包括冻结的CLIP ViT-B/32图像编码器和文本编码器，以及可训练的BERT编码器、投影层和LSTM解码器。训练阶段使用配对图像-描述数据优化BERT-LSTM解码器，损失函数包括交叉熵和CLIP对比损失。推理阶段，对未见省份图像，通过余弦相似度从训练描述库中检索top-K（K=5）描述，作为LSTM解码器的上下文条件，生成无标签描述。数据增强包括随机水平翻转、颜色抖动、旋转、同义词替换和回译。

Key Results:

Custom ZeroCLIP在未见省份测试集上取得CLIPScore 0.8536，BLEU-4 0.3342，METEOR 0.4859。
相比最强基线InstructBLIP，CLIPScore提升1.97%，BLEU-4提升18.64%，METEOR提升10.18%。
消融实验显示检索机制带来METEOR提升19.3%，有效恢复文化词汇（如kebaya, songket, Meukutop, blangkon）。
人类评估表明生成描述在文化准确性和流畅性上优于基线。

Tech Stack:

CLIP ViT-B/32（冻结图像和文本编码器）
BERT（可训练文本编码器）
LSTM（可训练自回归解码器）
余弦相似度（检索）
AdamW优化器（学习率2e-5）
数据增强：随机水平翻转、颜色抖动、旋转、同义词替换、回译
损失函数：交叉熵损失 + CLIP对比损失

Strengths:

严格的归纳式零样本设置，确保模型对未见省份的泛化能力评估可信。
检索增强机制有效弥补文化词汇缺失，显著提升METEOR。
数据集覆盖印尼全部38个省份，专家标注保证文化准确性。
在低资源文化遗产场景下取得优异性能，为类似任务提供参考。
公开数据集和代码，促进可复现研究。

Limitations:

依赖CLIP预训练，其训练数据以西方文化为主，可能对印尼文化存在偏差。
检索库仅来自可见省份，无法覆盖未见省份的独特文化术语，可能限制描述多样性。
生成描述需专家验证才能用于实际文化遗产数字化，部署门槛较高。
数据集规模较小（每省100张），可能影响模型对罕见服饰的泛化。
与BLIP-2等大型模型比较不对称（本文冻结CLIP，基线可能微调），公平性需谨慎解读。

Relevance To Keywords: 论文主要关注零样本描述和文化遗产，属于多模态理解与生成领域，但未涉及世界模型、表征学习、模型基强化学习或后训练等关键词。与“原生多模态大模型”有一定关联（使用CLIP和BERT），但未提出统一理解与生成框架。整体相关性中等偏弱。

35. Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement LearningPASS

Score: 43.5 / 35.2

Authors: Jiayu Yang, Chao Chen, Shengen Wu, Yinhong Liu, Yuxuan Fan, Lujundong Li, Songning Lai, Chengwei Qin, Zhijiang Guo

Published: 2026-06-11

TL;DR: 本文提出 SWITCH 框架，通过引入显式边界令牌使隐态推理兼容于策略梯度强化学习，实现了可训练且可解释的隐式推理机制。

摘要翻译

潜在思维链（Latent chain-of-thought）通过将可见推理轨迹替换为连续隐藏状态循环（hidden-state recurrence）来压缩推理，但现有的方法难以使用标准的策略内强化学习（RL）进行优化，且在因果解释方面存在困难。我们的关键洞察是，一对明确的边界标记可以同时解决这两个问题：离散的进入和退出锚点使潜在模块与标准策略内 RL 兼容，而相同的锚点也为机制分析提供了自然的立足点。受此启发，我们提出了 SWITCH，一种可切换的潜在推理框架。该模型生成 <swi> 以进入潜在模式，并生成 </swi> 以退出。由于边界是普通的离散标记，GRPO 策略比率在每个决策点均有明确定义。相同的锚点也使潜在步骤可直接被探测和进行因果干预。我们采用可见到潜在的课程学习和 Switch-GRPO 目标函数来训练该模型，后者通过循环潜在计算传播梯度。SWITCH 在相似规模下始终优于先前的隐藏状态循环潜在推理方法。通过边界标记的机制分析进一步揭示了三个发现：(i) <swi> 是一个高度局部化、学习到的切换策略，而非风格伪影；(ii) 它所打开的潜在步骤执行问题特定的、因果重要的计算，而非充当惰性占位符；(iii) 该计算集中在进入时的单个隐藏状态转换上。综上所述，这些结果表明，隐藏状态循环潜在推理既强化学习可训练，也便于直接进行机制分析，包括策略内强化学习本身如何从内部改进模型。

Abstract

Latent chain-of-thought compresses reasoning by replacing visible reasoning traces with continuous hidden-state recurrence, but existing formulations are difficult to optimize with standard on-policy reinforcement learning (RL) and hard to interpret causally. Our key insight is that a single pair of explicit boundary tokens can address both issues at once: discrete entry and exit anchors make the latent block compatible with standard on-policy RL, and the same anchors offer a natural foothold for mechanistic analysis. Motivated by this, we propose SWITCH, a switchable latent reasoning framework. The model emits <swi> to enter latent mode and </swi> to exit. Because the boundaries are ordinary discrete tokens, the GRPO policy ratio is well-defined at every decision point. The same anchors also expose the latent steps to direct probing and causal intervention. We train the model with a visible-to-latent curriculum and a Switch-GRPO objective that propagates gradients through recurrent latent computation. SWITCH consistently outperforms prior hidden-state-recurrence latent reasoning approaches at similar scale. Mechanistic analysis through the boundary tokens further reveals three findings: (i) <swi> is a sharply localised, learned switching policy rather than a stylistic artefact; (ii) the latent step it opens performs problem-specific, causally important computation rather than acting as an inert placeholder; and (iii) that computation is concentrated at a single hidden-state transition on entry. Together, these results show that hidden-state-recurrence latent reasoning is both RL-trainable and open to direct mechanistic analysis, including of how on-policy RL itself improves the model from the inside.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	3.0/10	4.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	4.0/10	6.0
Latent Reasoning	1.5	10.0/10	15.0
Agentic Reasoning	1.5	5.0/10	7.5

评分理由: 论文核心聚焦于 Latent Reasoning 与 On-Policy RL 的结合，故该两项得分最高；涉及离散 token 边界但未涉及多模态视觉编码，故 MultiModal 和 Visual Encoder 为 0；作者名单中无指定专家 Yang Shi 等。

关键词

Latent Reasoning, Hidden-State Recurrence, On-Policy Reinforcement Learning, Switchable Latent Reasoning, Boundary Tokens, Mechanistic Analysis, GRPO, Chain-of-Thought

深度分析

Chinese Title: 揭秘隐状态递归：基于在线策略强化学习的可切换潜在推理

Summary: 论文针对隐状态递归（hidden-state recurrence）在潜在思维链（latent chain-of-thought）中难以应用在线策略强化学习（on-policy RL）且难以进行因果解释的问题，提出Switch框架。核心创新是引入一对显式边界标记<swi>和</swi>，使潜在块与标准GRPO策略兼容，同时为机械分析提供锚点。训练分为三个阶段：SFT阶段让模型学会在合适位置插入边界标记；课程学习阶段逐步将边界内的文本替换为潜在步骤；Switch-GRPO阶段通过重新定义轨迹似然因子，使梯度能通过递归潜在计算传播。实验表明，Switch在MATH-500上达到79.3%，比最强Coconut风格基线高25.7个百分点。机械分析通过边界标记揭示了三个发现：<swi>是学习到的切换策略而非风格伪影；潜在步骤执行问题特定的因果重要计算；该计算集中在进入潜在块时的单个隐状态转换上。

Innovations:

提出显式边界标记<swi>/</swi>，使隐状态递归潜在推理与在线策略强化学习（GRPO）兼容，同时为机械分析提供自然锚点。
设计Switch-GRPO优化器，通过重新定义轨迹似然因子（仅对文本位置计算策略密度），使梯度能通过递归潜在计算传播。
提出可见到潜在的课程学习策略（并行替换），逐步将文本推理替换为潜在步骤，避免模型退化为无操作。
通过边界标记进行机械分析，首次直接验证隐状态递归潜在步骤执行问题特定的因果重要计算，而非惰性占位符。
在MATH-500上取得79.3%准确率，比同规模Coconut基线提升25.7个百分点，且Switch-GRPO进一步降低潜在调用率并提升正确率。

Methodology: 论文采用三阶段训练方法：第一阶段（Phase 1）使用SFT训练模型在数学CoT语料的高熵位置插入<swi>/</swi>边界标记；第二阶段（Phase 2）采用并行课程学习，逐步将边界内的文本替换为<latent>位置，保持边界标记在损失中；第三阶段（Phase 3）使用Switch-GRPO进行在线策略强化学习，通过重新定义轨迹似然因子（仅对文本位置计算策略密度）和记忆分段反向传播，优化正确性、标记格式、潜在使用率和压缩率。推理时模型通过<swi>进入潜在模式，运行至少??min个潜在步骤（隐状态递归），然后通过</swi>退出。

Key Results:

Switch在MATH-500上达到79.3%准确率，比最强Coconut风格基线（53.6%）高25.7个百分点。
Switch-GRPO在SFT检查点基础上进一步将潜在调用率减半，同时将调用潜在步骤的问题准确率提升12.6个百分点。
机械分析发现：<swi>是学习到的切换策略而非风格伪影；潜在步骤执行问题特定的因果重要计算；该计算集中在进入潜在块时的单个隐状态转换上。
并行课程学习优于顺序课程学习，因为并行替换迫使模型在所有跨度上同时适应潜在计算。

Tech Stack:

模型：Qwen3-8B（基础模型）
强化学习算法：Group Relative Policy Optimization (GRPO)
训练框架：LoRA（低秩适应）
数学验证工具：math-verify
机械分析工具：logit lens（对数几率透镜）、线性探针（linear probing）、因果激活干预（causal activation intervention）
课程学习：并行替换策略，参数??=2, ??max=8
损失函数：交叉熵损失（SFT）、裁剪代理损失（Switch-GRPO）
奖励函数：正确性奖励（±1）、标记格式奖励（±1）、潜在使用奖励（{0,1}）、正确性门控简洁性奖励（[0,1]）

Strengths:

同时解决了隐状态递归潜在推理的两个核心挑战：RL训练不可行和难以机械分析。
方法简洁有效，仅通过引入两个边界标记就使标准GRPO兼容，无需修改模型架构。
提供了全面的机械分析，首次直接验证潜在步骤的因果重要性，增强了方法的可信度。
在数学推理任务上取得显著性能提升，且训练效率高（使用LoRA）。
开源模型权重和代码，可复现性强。

Limitations:

实验仅在8B规模模型上进行，更大规模模型上的效果未知。
课程学习中的超参数（如??, ??max）需要针对不同任务调整。
潜在步骤的最小停留??min是人为设定的，可能限制模型灵活性。
机械分析主要针对数学推理任务，其他领域（如常识推理）的泛化性未验证。
Switch-GRPO的奖励设计依赖正确性验证器，在无验证器的任务中可能受限。

Relevance To Keywords:

强化学习：论文核心是使用在线策略强化学习（GRPO）训练潜在推理模型，属于后训练阶段。
后训练：三阶段训练（SFT+课程+RL）是典型后训练流程，与模型对齐和推理能力提升相关。
世界模型：潜在推理可视为在隐空间中构建内部世界模型，用于模拟推理步骤。
表征学习：隐状态递归本质上是利用模型自身的表征空间进行连续推理，与表征学习紧密相关。
模型基础：方法基于Qwen3-8B，属于原生多模态大模型（但论文仅测试文本数学任务），与多模态大模型的理解和生成一体化有潜在联系。

36. From Passive Generation to Investigation: A Proactive Scientific Peer Review AgentPASS

Score: 43.5 / 35.2

Authors: Haishuo Fang, Yue Feng, Iryna Gurevych

Published: 2026-06-11

TL;DR: 本文提出 ProReviewer 代理，通过强化学习和结构化日志解决 LLM 在同行评审中缺乏主动调查能力的问题，并在质量评分和人类评估中显著优于基线方法。

摘要翻译

大型语言模型（LLMs）在自动化科学同行评审方面展现出潜力。然而，现有方法往往难以生成基于具体证据的深度评审。我们认为，关键限制在于缺乏灵活性，无法像人类评审者那样基于累积证据主动调查论文中的可疑部分。本文探讨了如何使基于 LLM 的评审代理执行此类主动调查。我们发现，这一问题可自然形式化为马尔可夫决策过程（MDP），并提出 ProReviewer，一种基于维护的结构化评审日志主动评审论文的同行评审代理。该结构化评审日志作为代理的工作空间，用于跟踪评审过程中收集的证据及中间发现。实验表明，ProReviewer 采用 8B 骨干模型，经监督微调训练并通过强化学习优化，在五个质量维度上获得最高平均分，相较于使用更大规模前沿 LLM 的基于提示的方法高出最多 39%，相对最强的微调基线高出 16%。此外，它在人类评估中也取得了相对于基线最高的胜率。

Abstract

Large language models (LLMs) have shown promise in automating scientific peer review. However, existing approaches often struggle to generate in-depth reviews supported by concrete evidence. We argue that a key limitation is the lack of flexibility to proactively investigate suspicious parts of a paper based on accumulated evidence, as human reviewers do. In this paper, we explore how to enable an LLM-based review agent to perform such proactive investigation. We find that this can be naturally formulated as a Markov Decision Process (MDP), and propose ProReviewer, a scientific peer review agent that proactively reviews a paper guided by a maintained, structured review log. The structured review log serves as a workspace for the agent to track evidence and intermediate findings collected during review. Experiments show that ProReviewer with an 8B backbone, trained by supervised fine-tuning and optimized by reinforcement learning, achieves the highest average score across five quality dimensions, outperforming prompt-based methods with much larger frontier LLMs by up to 39% and the strongest fine-tuned baseline by 16% relatively. It also attains the highest win rates against baselines in human evaluation.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	3.0/10	4.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	4.0/10	6.0
Latent Reasoning	1.5	5.0/10	7.5
Agentic Reasoning	1.5	9.0/10	13.5

评分理由: 论文核心在于构建主动调查的科学同行评审代理，故'Agentic Reasoning'得分最高。论文使用 MDP 和强化学习，与'model-based RL'和'World Models'有一定关联。由于仅处理文本，'Visual Encoder'得分为 0，'MultiModal'和'MLLM'相关性较低。'Unify Models'和'Tokenizer'为通用组件，相关性一般。作者列表中未包含指定的专家（Yang Shi 等），无额外加分。

关键词

Scientific Peer Review, LLM Agent, Markov Decision Process, Reinforcement Learning, Structured Review Log, Proactive Investigation, Evidence Tracking

深度分析

Chinese Title: 从被动生成到主动调查：一种主动式科学同行评审代理

Summary: 本文提出ProReviewer，一种基于大语言模型的主动式科学同行评审代理。现有自动评审方法通常采用固定流程，缺乏根据已积累证据灵活调查论文可疑部分的能力。ProReviewer将评审过程形式化为马尔可夫决策过程（MDP），通过维护结构化评审日志（记录声明、问题和笔记）来跟踪证据并指导后续检查。代理通过监督微调（SFT）和基于组相对策略优化（GRPO）的强化学习进行训练，使用多维奖励函数。实验在版本匹配的ICLR 2025/2026论文-评审对数据集上进行，ProReviewer（8B参数）在五个质量维度上平均得分最高，相对优于前沿LLM提示方法最高39%，优于最强微调基线16%，人工评估中也获得最高胜率。该方法能有效检测跨章节不一致性，且随论文长度增加保持稳健性能。

Innovations:

将同行评审建模为马尔可夫决策过程（MDP），使代理能够基于累积证据自适应地决定下一步调查内容，而非遵循固定流程。
提出结构化评审日志，包含声明、问题和笔记三类条目，支持证据追踪和选择性修订，最终评审直接引用日志条目，确保可追溯性。
采用监督微调（SFT）和组相对策略优化（GRPO）结合多维奖励（动作有效性、结构完整性、评分对齐、内容深度）训练评审代理。
构建版本匹配的ICLR 2025/2026论文-评审对数据集（5K对），确保训练和测试数据无污染（测试集论文发表于模型知识截止日期之后）。
实验表明ProReviewer在自动和人工评估中均优于现有方法，尤其擅长检测跨章节逻辑不一致性。

Methodology: 论文将评审过程形式化为MDP，状态包括当前上下文、评审日志和论文索引；动作分为环境动作（读取章节、关键词查找、终止）和日志动作（记录、更新、生成大纲）。使用8B参数LLM作为策略网络，先通过监督微调（SFT）在合成轨迹上训练，再使用组相对策略优化（GRPO）进行强化学习，奖励函数包含四个组件：动作有效性、结构完整性、评分对齐（与人类评分一致性）、内容深度（基于GPT-4o评估）。训练数据为4K ICLR 2025论文-评审对，测试为1K ICLR 2026论文-评审对。

Key Results:

ProReviewer在五个质量维度（整体质量、具体性、建设性、证据性、评分准确性）上平均得分最高，相对优于Gemini-3.1-flash-lite等前沿LLM提示方法最高39%，优于最强微调基线16%。
人工评估中，ProReviewer在所有成对比较中胜率最高。
ProReviewer更有效地检测跨章节不一致性（如引言声明与实验结果矛盾）。
随着论文长度增加，ProReviewer性能保持稳健，而基线方法性能下降。
消融实验表明结构化评审日志和强化学习训练均对性能有显著贡献。

Tech Stack:

马尔可夫决策过程（MDP）
组相对策略优化（GRPO）
监督微调（SFT）
8B参数大语言模型（具体未指明，可能为Qwen或类似模型）
结构化评审日志（声明、问题、笔记三类条目）
多维奖励函数（动作有效性、结构完整性、评分对齐、内容深度）
GPT-4o（用于内容深度评估）
ICLR 2025/2026论文-评审对数据集

Strengths:

提出主动式评审范式，克服了固定流程的局限性，更接近人类评审专家的行为。
结构化评审日志提供了可追溯的证据链，增强了评审的可信度和可解释性。
采用强化学习训练，使代理能够自适应地调整调查深度，而非依赖手工规则。
构建了版本匹配的数据集，有效缓解数据污染问题，评估更可靠。
在多个自动和人工评估指标上显著优于现有方法，包括前沿大模型。

Limitations:

当前动作空间仅限于论文本身，未包含外部检索（如文献搜索），可能影响新颖性评估。
训练数据仅来自ICLR会议，泛化到其他领域或会议格式的能力有待验证。
8B参数模型虽然高效，但可能受限于基础模型能力，更大模型可能进一步提升性能。
强化学习训练依赖GPT-4o评估内容深度，存在评估偏差和成本问题。
评审日志的维护增加了推理步骤和计算开销，实时性可能受影响。

Relevance To Keywords: 论文主要关注LLM在科学同行评审中的应用，与给定关键词（Unify Models, World Models, Representation Learning, Model-Based RL, 原生多模态大模型等）的直接相关性较低。然而，论文中使用的强化学习（GRPO）与Model-Based RL和强化学习领域相关；结构化评审日志可视为一种表征学习（将论文内容表示为结构化条目）；主动式调查策略与智能体决策相关，可推广到世界模型中的探索。但论文本身不涉及多模态或世界模型的具体构建，相关性较弱。

37. Brick: Spatial Capability Routing for the Mixture-of-Models (MoM) ParadigmPASS

Score: 42.0 / 35.2

Authors: Francesco Massa, Marco Cristofanilli

Published: 2026-06-11

TL;DR: 本文提出了一种名为 Brick 的多模态路由器，通过结合能力评分与查询难度估计来调度模型，实现了在显著降低计算成本的同时保持高准确率的部署优化。

摘要翻译

定义查询难度是部署工程中最困难的问题之一。现有的 LLM（大语言模型）路由器依赖于领域标签、关键词和 Token 数等表面特征，忽略了实际上决定模型成功的领域内方差。Frontier models（前沿模型）的成本是本地开源模型的十倍到一百倍，因此在生产规模下，即使每次请求的微小节省也能成为直接影响云账单的成本杠杆。我们提出了 Brick，一种多模态路由器，它在六个能力维度上对每个模型进行评分，结合单次查询难度估计，并通过成本惩罚几何规则进行调度。一个连续的偏好滑块允许操作员在部署时在最大质量模式和最大节省模式之间滑动切换。在包含 5,504 个查询的基准测试中，Brick 在最大质量模式下达到 76.98% 的准确率，优于最佳单一模型（75.02%）和所有测试的路由器。在中性成本 - 质量配置模式下，Brick 的准确率达到 74.11%，成本仅为始终使用最强模型的 4.71 分之一。在最小成本模式下，它使成本缩减至原来的 22.15 分之一，准确率下降 11.85 个百分点。中位延迟从 51.2 秒降至 22.8 秒。

Abstract

Defining query difficulty is one of the hardest problems in deployment engineering. Existing LLM routers rely on surface features such as domain labels, keywords, and token count, ignoring the within-domain variance that actually determines model success. Frontier models cost ten to one hundred times more than local open-weight models, so at production scale even small per-request savings become a direct cloud-bill lever. We present Brick, a multimodal router that scores each model on six capability dimensions, combines this with a per-query difficulty estimate, and dispatches via a cost-penalized geometric rule. A continuous preference knob lets operators slide between max-quality and max-saving profiles at deploy time. On a benchmark of 5,504 queries, Brick at max-quality reaches 76.98% accuracy, beating the best single model (75.02%) and all tested routers. At a neutral cost-quality profile, Brick achieves 74.11% accuracy at 4.71x lower cost than always using the strongest model. At min-cost, it cuts cost 22.15x with 11.85 points accuracy loss. Median latency drops from 51.2s to 22.8s.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	5.0/10	7.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	8.0/10	12.0
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	1.0/10	1.5
Latent Reasoning	1.5	1.0/10	1.5
Agentic Reasoning	1.5	1.0/10	1.5

评分理由: 论文核心在于提出 Brick 多模态路由器，用于 Mixture-of-Models (MoM) 范式下的模型调度。"MultiModal" 和 "MLLM" 高度相关，因摘要明确提及"多模态路由器"及前沿模型部署场景；"Unify Models" 中度相关，因 MoM 涉及多模型能力的统一调度与整合。其余关键词如 Tokenizer、Visual Encoder、World Models、model-based RL、Latent Reasoning、Agentic Reasoning 均未在摘要中出现，属于模型架构或特定学习范式，与本文系统级路由主题关联度低，故评分为 1。作者列表中未包含指定的 Yang Shi 等专家，故无额外加分。

关键词

Multimodal Router, Mixture-of-Models, Capability Routing, Cost-Quality Tradeoff, Query Difficulty, Model Dispatch, Frontier Models

深度分析

Chinese Title: Brick: 面向混合模型（MoM）范式的空间能力路由

Summary: 本文提出Brick，一种面向混合模型（MoM）范式的多模态路由器，旨在解决异构LLM池中查询调度的高成本与低效问题。现有路由方法依赖表面特征（如领域、关键词、token数），无法捕捉同一领域内不同查询的难度差异，导致过度使用昂贵模型。Brick将每个模型表示为六维能力向量，并为每个查询估计难度向量，通过带成本惩罚的几何规则进行调度。用户可通过偏好旋钮r在最大质量和最大节省之间连续调节。实验在包含5,504个查询的数据集A上进行：最大质量模式下Brick准确率达76.98%，超越最佳单模型kimi2.6（75.02%）及所有外部路由器；中性模式下成本降低4.71倍，准确率仅损失0.91个百分点；最小成本模式下成本降低22.15倍。Brick还实现了中位端到端延迟从51.2秒降至22.8秒。论文论证了MoM范式在异构定价、开放-封闭模型桥接和能力互补方面的优势。

Innovations:

将整个模型路由问题建模为六维能力空间中的几何覆盖问题，查询需求与模型能力均表示为向量，通过几何规则选择最便宜且能覆盖查询的模型。
提出用户可调的偏好旋钮r∈[-1,1]，将质量与成本的权衡转化为部署时可配置的参数，无需修改代码。
在混合模型（MoM）范式下，首次实现跨异构LLM池（含开放权重和封闭API模型）的实时路由，并证明其能恢复24%的可达性能差距。
通过实验系统性地证明基于表面特征（领域、长度/关键词）的路由策略在捕获域内难度差异方面失败，性能甚至低于始终使用最强单模型。

Methodology: 论文采用以下技术路线：1）构建包含三个模型（qwen3.5-9b、deepseek-v4-flash、kimi2.6）的固定池，并定义成本标量cm和实际成本a$m；2）设计Brick2数据集A（5,504个查询），覆盖六个能力维度，使用确定性评分（如单元测试、SymPy匹配）和LLM评判（gpt-5.4-mini）进行质量验证；3）为每个模型和查询建立六维能力向量，通过校准过程确定模型能力向量；4）路由决策基于几何规则：计算查询向量与每个模型能力向量的距离Dm，结合成本惩罚β·cm，选择综合代价最小的模型；5）通过偏好旋钮r调整成本惩罚权重，实现质量-成本连续权衡。

Key Results:

在最大质量模式下，Brick准确率76.98%，超过最佳单模型kimi2.6（75.02%）和所有外部路由器（RouteLLM、FrugalGPT、Cascade Routing）。
中性模式下，Brick准确率74.11%，成本仅为始终使用kimi2.6的4.71倍（即成本降低约79%），准确率仅损失0.91个百分点。
最小成本模式下，成本降低22.15倍，准确率损失11.85个百分点。
中位端到端延迟从51.2秒降至22.8秒。
三模型oracle上限为83.25%，Brick恢复了24%的可达性能差距。
基于表面特征的路由策略（领域、长度/关键词）均低于始终使用kimi2.6的准确率。

Tech Stack:

六维能力空间表示（指令遵循、数学推理、代码、世界知识、工具调用、创造性/规划）
几何路由规则（距离度量Dm + 成本惩罚β·cm）
偏好旋钮r∈[-1,1]
校准过程（确定模型能力向量）
Brick2数据集A（含14个上游来源和3个自定义子集）
确定性评分（LiveCodeBench单元测试、SymPy匹配、IFEval/IFBench检查器、BFCL AST/状态检查器）
LLM评判（openai/gpt-5.4-mini作为单评判模型，三评判面板用于方差表征）
成本模型（cm为无量纲路由标量，a$m为实际美元成本）

Strengths:

提出新颖的MoM范式，有效利用现有异构模型池，避免从头训练。
路由决策基于能力空间而非表面特征，能捕捉域内难度差异，显著提升准确率和成本效率。
提供连续可调的偏好旋钮，使部署者能灵活平衡质量与成本。
实验设计严谨，包含多维度评估、oracle上限分析以及与其他路由器的全面对比。
开源数据集和代码（regolo-ai/brick-SR1），可复现性强。

Limitations:

仅使用三个模型，未验证更大模型池下的扩展性。
能力向量校准过程依赖人工标注和评估，可能引入主观偏差。
数据集A的规模（5,504查询）相对有限，且覆盖的六个维度可能不全面。
路由决策基于静态校准，未考虑模型更新或价格变化后的动态适应。
对于拒绝回答（refusal）行为，论文仅将其计入未解决类，未深入分析其对路由的影响。

Relevance To Keywords: 论文核心关注LLM路由与混合模型范式，与研究关键词中的'Unify Models'（统一模型）和'World Models'（世界模型）相关性较弱。但论文提出的能力空间表示和几何路由思想与'Representation Learning'（表征学习）有一定关联，因为模型能力被表示为向量。'Model-Based RL'（基于模型的强化学习）和'原生多模态大模型'、'多模态大模型的理解和生成一体化'等关键词与本文主题不直接相关。本文更侧重于推理时的成本-质量优化，而非模型训练或世界模型构建。

38. ComAct: Reframing Professional Software Manipulation via COM-as-Action ParadigmPASS

Score: 42.0 / 35.2

Authors: Jiaxin Ai, Tao Hu, Xuemeng Yang, Shu Zou, Hairong Zhang, Daocheng Fu, Yu Yang, Hongbin Zhou, Nianchen Deng, Pinlong Cai, Zhongyuan Wang, Botian Shi, Kaipeng Zhang, Licheng Wen

Published: 2026-06-11

TL;DR: ComAct 提出 COM-as-Action 范式将专业软件操作重构为确定性程序合成，通过自修正代理 ComActor 在 CAD 基准测试中实现了 state-of-the-art 性能。

摘要翻译

现有的计算机使用代理在专业软件操作方面仍存在根本性局限：基于图形用户界面（GUI）的代理面临脆弱的视觉定位和长程误差积累问题，而基于 API 的方法则难以应对异构协议和无法访问的商业接口。在这项工作中，我们将组件对象模型（COM）视为一种统一的可执行抽象，提出 COM-as-Action：一种将专业软件交互重构为确定性程序合成而非顺序视觉控制的新范式。为了在最具有挑战性的环境中验证这一范式，我们引入了 ComCADBench，这是首个针对操作真实工业 CAD 软件的代理基准。我们的实验揭示了一个显著的范式差距：前沿专有模型在基于 GUI 的交互下实现近乎零的成功率，而基于 COM 的执行则带来了显著的即时收益。为了弥合语法正确性与几何精度之间剩余的差距，我们开发了 ComActor，这是一种通过渐进式三阶段框架训练的自校正代理，以及 ComForge，一个用于在 Windows Containers 中进行大规模训练的可扩展平台。广泛实验表明，ComActor 在 ComCADBench 上实现了最先进的性能，在长程任务中展现出强大的鲁棒性（基线模型在此类任务中失效），并能泛化至外部 CAD 基准。

Abstract

Existing computer-use agents remain fundamentally limited in professional software manipulation: GUI-based agents suffer from fragile visual grounding and long-horizon error accumulation, while API-basedapproaches struggle with heterogeneous protocols and inaccessible commercial interfaces. In this work,we identify the Component Object Model (COM) as a unified executable abstraction, proposing COM-as-Action: a new paradigm that reframes professional software interaction as deterministic program synthesisrather than sequential visual control. To validate this paradigm in the most demanding environments, weintroduce ComCADBench, the first benchmark for agents operating real industrial CAD software. Ourexperiments reveal a substantial paradigm gap: frontier proprietary models achieve near-zero successunder GUI-based interaction, whereas COM-based execution yields substantial immediate gains. Tobridge the remaining gap between syntactic correctness and geometric accuracy, we develop ComActor, aself-correcting agent trained through a progressive three-stage framework, alongside ComForge, a scalableplatform for large-scale training in Windows containers. Extensive experiments show that ComActorachieves state-of-the-art performance on ComCADBench, with strong resilience in long-horizon taskswhere baselines collapse, and generalizes to external CAD benchmark.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	5.0/10	7.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	2.0/10	3.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	3.0/10	4.5
Latent Reasoning	1.5	2.0/10	3.0
Agentic Reasoning	1.5	8.0/10	12.0

评分理由: 论文核心贡献在于提出 COM-as-Action 范式重构软件操作，高度契合 'Agentic Reasoning'（ComActor 自修正代理）和 'Unify Models'（统一可执行抽象）。论文基于 LLM 进行程序合成，与 'MLLM' 和 'model-based RL' 有一定关联，但并非视觉编码、分词器或世界模型的核心研究，故相关度较低。作者列表中未包含指定的专家名单，无额外加分。

关键词

COM-as-Action, Software Manipulation, ComActor, CAD Benchmark, Program Synthesis, Self-correcting Agent, Windows Containers

深度分析

Chinese Title: ComAct：通过COM即行动范式重塑专业软件操作

Summary: 论文提出ComAct（COM-as-Action）新范式，将专业软件操作从脆弱的GUI交互或碎片化的API调用转变为确定性程序合成。作者识别出组件对象模型（COM）作为Windows生态下统一的、可执行的抽象接口，使智能体能够直接生成COM程序来操控专业软件（如AutoCAD、SolidWorks），同时保留视觉感知用于粗粒度观察和验证。为验证该范式，论文构建了首个面向真实工业CAD软件的基准ComCADBench，并开发了ComActor——一个通过渐进式三阶段训练（单轮SFT、多轮SFT、基于连续几何奖励的GRPO强化学习）实现自我修正的智能体。实验表明，传统GUI智能体在ComCADBench上近乎零成功率，而COM驱动的方法带来显著提升；ComActor达到最先进性能，并在外部CAD基准（Text2CAD、CADPrompt）上展现出强泛化能力。论文还构建了ComForge平台，支持数千个并行Docker化Windows环境，实现大规模训练。

Innovations:

提出COM-as-Action（ComAct）新范式，将专业软件操作重构为可执行的程序合成，替代脆弱的GUI交互和碎片化的API调用。
构建ComCADBench，首个面向真实工业CAD软件（AutoCAD、SolidWorks）的智能体评估基准，基于最终软件工件进行评价。
开发ComActor，通过渐进式三阶段训练框架（单轮SFT、多轮SFT、GRPO强化学习）训练出具备自我修正能力的专业CAD智能体。
构建ComForge，一个高度可扩展的并行化Windows容器基础设施，支持数千个并发真实环境用于大规模训练和评估。
将COM作为统一动作空间，实现跨应用工作流（如Microsoft Office、Adobe套件、工业CAD）的自然支持。

Methodology: 论文采用以下技术路线：首先，从三个公开CAD数据集（SketchGraphs、Text2CAD、Fusion360Gallery）中提取结构化JSON几何规格，通过MLLM生成自然语言指令和对应COM脚本，并在Docker化Windows环境中执行验证，构建可靠的指令-代码对语料库。然后，采用渐进式三阶段训练框架：阶段一为指令到代码的单轮监督微调（SFT），阶段二为多轮智能体交互的SFT，阶段三为基于连续几何奖励的GRPO强化学习，使智能体从静态代码生成器进化为自我修正的闭环智能体。训练和推理均在ComForge平台上进行，该平台支持数千个并行Docker化Windows环境，每个环境运行真实CAD软件。智能体在推理时执行“思考-决策-行动”循环，根据实时截图和终端反馈迭代生成和修正COM脚本。

Key Results:

传统GUI智能体在ComCADBench上近乎零成功率，而COM驱动的方法带来显著即时提升。
ComActor在ComCADBench上达到最先进性能，在长程任务中展现出强韧性，而基线方法完全失败。
ComActor在外部CAD基准Text2CAD和CADPrompt上展现出强泛化能力。
渐进式三阶段训练框架有效提升了智能体的自我修正能力和任务完成率。
ComForge平台支持数千个并行Windows环境，实现了大规模训练和评估。

Tech Stack:

Component Object Model (COM) 接口
win32com.client (Python COM绑定)
MLLM (多模态大语言模型)
监督微调 (SFT)
GRPO (Group Relative Policy Optimization) 强化学习
连续几何奖励函数
Docker容器化Windows环境
SketchGraphs、Text2CAD、Fusion360Gallery 数据集
Python脚本生成与执行

Strengths:

提出了一种全新的、统一的专业软件操作范式，从根本上解决了GUI智能体的视觉脆弱性和API智能体的碎片化问题。
构建了完整的生态系统，包括基准、智能体、训练框架和基础设施，具有高度的系统性和实用性。
渐进式三阶段训练框架设计合理，逐步解决能力瓶颈，使智能体从静态代码生成进化为自我修正的闭环智能体。
在真实工业CAD软件上进行评估，实验结果具有强说服力，且展示了跨基准的泛化能力。
COM作为统一动作空间具有广泛的适用性，可推广到Office、Adobe等众多专业软件。

Limitations:

当前工作主要聚焦于CAD领域，虽然COM接口在Windows生态中广泛存在，但论文未充分验证在其他专业软件（如Adobe、Office）上的效果。
依赖Windows环境和COM接口，限制了在非Windows平台上的部署和应用。
训练数据构建依赖公开CAD数据集，可能无法覆盖所有工业级复杂场景。
自我修正能力依赖于实时环境反馈，在反馈延迟或噪声较大的场景下可能性能下降。
论文未详细讨论COM接口的安全性和权限控制问题，在实际部署中可能面临挑战。

Relevance To Keywords:

Unify Models: 论文使用单一MLLM模型统一处理指令理解、代码生成和自我修正，体现了模型统一的思想。
World Models: 智能体通过COM程序与真实软件环境交互，环境反馈（截图、终端输出）可视为对世界状态的观测，但论文未显式构建世界模型。
Representation Learning: 论文通过COM接口将软件操作表示为程序代码，是一种高层次的动作表征学习。
Model-Based RL: 论文使用GRPO强化学习优化智能体策略，但未显式构建环境模型，属于无模型RL范畴。
原生多模态大模型: 智能体处理文本指令和视觉截图，生成代码，体现了多模态理解和生成的一体化能力。
多模态大模型的理解和生成一体化: 智能体同时具备理解自然语言指令和生成可执行代码的能力，是理解与生成的统一。
表征学习: COM程序作为动作表征，将复杂的GUI操作抽象为语义级别的程序调用，是一种有效的表征学习。
世界模型: 论文未涉及世界模型的构建，但环境反馈机制可视为对世界状态的隐式建模。
强化学习: 论文使用GRPO进行强化学习，通过连续几何奖励引导智能体优化行为。
后训练: 渐进式三阶段训练框架中的SFT和GRPO均属于后训练阶段，用于提升预训练模型的特定任务能力。

39. EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic EnvironmentsPASS

Score: 42.0 / 35.2

Authors: Jundong Xu, Qingchuan Li, Jiaying Wu, Yihuai Lan, Shuyue Stella Li, Huichi Zhou, Bowen Jiang, Lei Wang, Jun Wang, Anh Tuan Luu, Caiming Xiong, Hae Won Park, Bryan Hooi, Zhiyuan Hu

Published: 2026-06-11

TL;DR: To address LLM agents' struggle in dynamic environments, this paper proposes the EvoArena benchmark and EvoMem memory paradigm, which significantly improves agent robustness and performance on evolving tasks.

摘要翻译

大语言模型（LLM）智能体在各类基准测试上已展现出卓越的性能，然而大多数评估仍基于静态环境的假设。相比之下，现实世界的部署本质上是动态的，要求智能体持续将其知识、技能和行为与变化的环境及更新的任务条件进行对齐。为填补这一空白，我们提出了 EvoArena，这是一个基准测试套件，它将环境变化建模为跨越终端、软件和社交领域的渐进更新序列。此外，我们还提出了 EvoMem，这是一种基于补丁的记忆范式，它将记忆演化记录为结构化更新历史，使智能体能够通过记忆中的变化来推理环境的演化。实验表明，当前智能体在 EvoArena 上表现不佳，在演化的终端、软件和社交偏好领域上的平均准确率仅为 39.6%。EvoMem 持续提升性能，在 EvoArena 上平均增益 1.5%，同时在 GAIA 和 LoCoMo 等标准基准上也分别提升了 6.1% 和 4.8%。除了单个任务之外，EvoMem 在 EvoArena 上进一步将链级准确率提升了 3.7%，其中成功需要完成一系列连续的、相关的进化子任务。机制分析表明，EvoMem 改进了记忆中的证据捕获，这意味着能够更好地保存完整的演化环境状态。我们的结果强调了在评估和记忆建模中考虑演化对于实现可靠智能体部署的重要性。

Abstract

Large language model (LLM) agents have achieved strong performance on a wide range of benchmarks, yet most evaluations assume static environments. In contrast, real-world deployment is inherently dynamic, requiring agents to continually align their knowledge, skills, and behavior with changing environments and updated task conditions. To address this gap, we introduce EvoArena, a benchmark suite that models environment changes as sequences of progressive updates across terminal, software, and social domains. We further propose EvoMem, a patch-based memory paradigm that records memory evolution as structured update histories, enabling agents to reason about environmental evolution through changes in their memory. Experiments show that current agents struggle on EvoArena, achieving an average accuracy of 39.6% across evolving terminal, software, and social-preference domains. EvoMem consistently improves performance, yielding an average gain of 1.5% on EvoArena and also improving standard benchmarks such as GAIA and LoCoMo by 6.1% and 4.8%. Beyond individual tasks, EvoMem further improves chain-level accuracy by 3.7% on EvoArena, where success requires completing a consecutive sequence of related evolutionary subtasks. Mechanistic analysis shows that EvoMem improves evidence capture in the memory, indicating better preservation of complete evolving environment states. Our results highlight the importance of modeling evolution in both evaluation and memory for reliable agent deployment.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	4.0/10	6.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	3.0/10	4.5
Latent Reasoning	1.5	3.0/10	4.5
Agentic Reasoning	1.5	9.0/10	13.5

评分理由: The paper focuses on LLM agents in dynamic environments, making Agentic Reasoning highly relevant (9). World Models (4) and model-based RL (3) have moderate relevance due to environment modeling and agent interaction, though the core contribution is memory mechanisms rather than generative world models or model-based planning. Latent Reasoning (3) relates to memory evolution reasoning. MLLM (3) and MultiModal (2) are less relevant as the abstract emphasizes text/logic domains without explicit multi-modal fusion. Unify Models (2), Tokenizer (1), and Visual Encoder (1) are largely irrelevant as the paper does not discuss model unification, tokenizer design, or visual encoders. No expert authors from the specified list were found. The weighted sum is 42.0, exceeding the dynamic pass score of 35.2.

关键词

LLM Agents, Dynamic Environments, Memory Evolution, EvoArena, EvoMem, Benchmark Suite, Robustness, Reasoning

深度分析

Chinese Title: EvoArena：追踪记忆演化以实现动态环境中鲁棒的LLM智能体

Summary: 论文针对现有LLM智能体评估大多基于静态环境、缺乏对持续环境演化适应能力的问题，提出了EvoArena基准套件。该基准将环境变化建模为终端、软件和社交领域的渐进式更新序列，并进一步提出EvoMem——一种基于补丁的记忆范式，将记忆演化记录为结构化的更新历史，使智能体能够通过记忆变化推理环境演化。实验表明，现有智能体在EvoArena上平均准确率仅39.6%，而EvoMem平均提升1.5%，在GAIA和LoCoMo等标准基准上分别提升6.1%和4.8%。此外，EvoMem在链级准确率上提升3.7%，机制分析显示其改善了记忆中的证据捕获。研究结果强调了在评估和记忆设计中建模演化对于可靠智能体部署的重要性。

Innovations:

提出EvoArena基准套件，首次系统评估LLM智能体在持久环境演化下的表现，覆盖终端工作流、软件工程和用户偏好三个领域。
提出EvoMem记忆范式，采用类似Git的补丁历史记录记忆更新，包括更新前/后状态、更新理由和证据，使智能体能追溯和推理环境演化。
发现现有智能体在演化环境中性能显著下降（平均准确率39.6%），并揭示“状态崩溃”这一常见失败模式。
EvoMem不仅提升演化环境下的性能，还能改善标准长时任务基准（GAIA提升6.1%，LoCoMo提升4.8%），并提升链级任务完成准确率3.7%。

Methodology: 论文首先将静态智能体基准转化为版本化的演化链：Terminal-Bench-Evo（终端工作流演化）、SWE-Chain-Evo（代码库演化）、PersonaMem-Evo（用户偏好演化）。每个演化链包含多个渐进式版本，保持相同目标但改变接口、规则、代码状态或偏好。然后提出EvoMem，在标准记忆系统上附加一个仅追加的补丁历史，每个补丁存储更新前记忆、更新后记忆、更新理由和触发上下文证据。推理时，智能体默认检索最新记忆，当查询涉及被覆盖状态或早期版本时，选择性检索相关补丁。实验使用多种LLM骨干和智能体框架，在EvoArena和标准基准上评估，并进行机制分析（如证据捕获率）。

Key Results:

现有智能体在EvoArena上平均准确率仅39.6%，表明持久环境演化对当前智能体构成显著挑战。
EvoMem在EvoArena上平均提升1.5%的步骤级准确率，并在GAIA和LoCoMo上分别提升6.1%和4.8%。
EvoMem在链级准确率上提升3.7%，表明其能帮助智能体完成连续演化子任务序列。
机制分析显示EvoMem改善了记忆中的证据捕获，尤其在时间轨迹和多模式综合问题上表现更优。

Tech Stack:

LLM智能体（如GPT-4、Claude等）
记忆系统（结构化长期记忆、生产级持久记忆）
Git风格的补丁（patch）机制
终端工作流基准（Terminal-Bench）
软件工程基准（SWE-bench）
个性化记忆基准（PersonaMem）
GAIA、LoCoMo等标准基准
证据捕获率分析

Strengths:

首次系统定义并评估持久环境演化这一重要但被忽视的智能体能力。
EvoMem设计轻量、通用，可集成到现有记忆系统中，无需重新训练模型。
覆盖多个领域（终端、软件、社交），验证了方法的广泛适用性。
提供了详细的机制分析，揭示了补丁历史如何改善证据保留。

Limitations:

演化链的设计可能无法完全模拟真实世界中复杂、非线性的环境变化。
EvoMem增加了记忆存储和检索开销，在极大规模部署中可能需要优化。
实验仅基于有限数量的LLM骨干和智能体框架，泛化性有待进一步验证。
未探讨智能体主动探测环境变化的能力，仅依赖被动记忆更新。

Relevance To Keywords:

Unify Models, World Models, Representation Learning, Model-Based RL: 论文主要关注LLM智能体在动态环境中的记忆演化，与统一模型、世界模型、表征学习等关键词有一定关联，因为记忆演化可视为智能体内部世界模型的动态更新，但论文未直接涉及多模态大模型的理解生成一体化或强化学习后训练。
原生多模态大模型: 论文未涉及多模态输入输出，但动态环境概念可扩展到多模态场景。
表征学习: 记忆补丁可视为环境状态演化的表征，但论文未深入探讨表征学习理论。
模型基RL: 环境演化类似于非平稳MDP，但论文未使用RL方法，而是基于LLM推理和记忆。

40. Transformer-Guided Graph Attention for Direct Cardiac Mesh Reconstruction: A Structural Digital Twin FrameworkPASS

Score: 40.5 / 35.2

Authors: Abhishek H S, Akash Ganamukhi, Abhimanyu Suresh, Aditya G Hiremath, Prasad B Honnavalli, Adithya Balasubramanyam

Published: 2026-06-11

TL;DR: 该论文提出了一种基于 Transformer 和图注意力网络的端到端框架，直接从 3D 医学图像重建心脏网格，消除了传统分割和后处理步骤，从而促进了临床数字孪生管道的应用。

摘要翻译

构建患者特异性心脏模型处于精准心脏病学核心地位，然而将这些模型投入临床应用却总是遭遇相同的瓶颈：网格生成缓慢、混乱且令人沮丧。标准工作流程——图像分割、运行 Marching Cubes（行进立方体算法），随后手动清理结果——耗时漫长、操作员间结果不一致，且需要大多数临床团队所不具备的专业知识。我们采取了一种根本不同的方法。不再将分割和网格生成视为两个独立问题，而是训练了一个单一的端到端网络，直接将原始 3D 医学图像映射为平滑的、可直接用于仿真的心脏表面网格。其核心是一个 3D Swin Transformer 编码器 - 解码器，用于从 CT 或 MRI 体数据中提取体积特征，并结合一个图注意力网络 (GAT) 头，该头迭代变形模板网格以拟合患者的心脏边界。我们在 MM-WHS 2017 基准测试上分别使用 CT 和 MRI 数据进行了测试。分割分数具有竞争力（CT 上 Dice 系数为 0.84，MRI 上为 0.83），但主要关注网格质量：平均 Chamfer 距离为 1.8 mm，第 95 百分位表面距离低于 5 mm。每个网格均通过单次前向传播生成——无需 Marching Cubes 算法，无需平滑滤波器，也无需手动清理。我们认为，在心脏数字孪生流程中，几何保真度和拓扑正确性比像素级 Dice 分数更为重要。通过消除后处理瓶颈，该方法使患者特异性心脏模拟在临床应用上显著更易实现。

Abstract

Building patient-specific cardiac models sits at the heart of precision cardiology, yet getting those models into clinical use keeps running into the same wall: mesh generation is slow, messy, and frustrating. The standard workflow -- segmenting the image, running Marching Cubes, and then manually cleaning up the result -- is time-consuming, inconsistent across operators, and demands specialist knowledge most clinical teams do not have. We take a fundamentally different approach. Instead of treating segmentation and mesh generation as two separate problems, we train a single end-to-end network that goes directly from a raw 3D medical image to a smooth, simulation-ready cardiac surface mesh. The core is a 3D Swin Transformer encoder-decoder that extracts volumetric features from CT or MRI volumes, paired with a Graph Attention Network (GAT) head that iteratively deforms a template mesh to fit the patient's cardiac boundary. We tested on the MM-WHS 2017 benchmark using both CT and MRI. Segmentation scores were competitive (Dice of 0.84 on CT, 0.83 on MRI), but the primary focus is mesh quality: mean Chamfer distance of 1.8 mm, with 95th-percentile surface distance below 5 mm. Every mesh is produced in a single forward pass -- no Marching Cubes, no smoothing filters, no manual cleanup. We argue that for cardiac digital twin pipelines, geometric fidelity and topological correctness matter more than pixel-level Dice scores. By removing the post-processing bottleneck, this approach makes patient-specific cardiac simulation substantially more accessible for clinical use.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	6.0/10	9.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	9.0/10	13.5
World Models	1.5	2.0/10	3.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	4.0/10	6.0
model-based RL	1.5	1.0/10	1.5
Latent Reasoning	1.5	2.0/10	3.0
Agentic Reasoning	1.5	1.0/10	1.5

评分理由: 论文核心在于医学图像网格重建，使用了 3D Swin Transformer 作为视觉编码器（Visual Encoder），并将分割与生成任务统一为端到端模型（Unify Models），故这两项得分较高。其余关键词如 Tokenizer、MLLM、RL 及推理类关键词在论文中无体现，世界模型仅涉及数字孪生概念而非生成式动力学模型，相关度较低。

关键词

Cardiac Mesh Reconstruction, 3D Swin Transformer, Graph Attention Network, End-to-End Learning, Digital Twin, Medical Image Analysis, Surface Mesh Generation

深度分析

Chinese Title: 基于Transformer引导的图注意力直接心脏网格重建：一种结构数字孪生框架

Summary: 本文提出了一种端到端的深度学习框架，用于直接从原始3D医学图像（CT或MRI）生成平滑、可用于仿真的心脏表面网格，以解决传统分割-网格生成流程耗时且需要大量人工干预的问题。该框架结合了3D Swin Transformer编码器-解码器来提取丰富的体素特征，以及图注意力网络（GAT）头，通过迭代变形模板网格以匹配患者心脏边界。在MM-WHS 2017基准测试上，该方法在CT和MRI数据上分别取得了0.84和0.83的Dice系数，平均Chamfer距离为1.8毫米，95%表面距离低于5毫米。所有网格均通过单次前向传播生成，无需Marching Cubes、平滑或手动清理。作者认为，对于心脏数字孪生管道，几何保真度和拓扑正确性比像素级Dice分数更为重要。

Innovations:

提出了一种直接到网格的端到端架构，结合3D Swin Transformer和图注意力网络，无需依赖Marching Cubes即可直接输出平滑的心脏表面网格。
在单一框架内统一了分割和网格重建任务，通过联合训练体素分割目标和几何网格质量目标，同时保证解剖准确性和仿真兼容性。
在MM-WHS 2017多模态数据集上进行了全面评估，报告了CT和MRI两种模态下的网格质量指标，平均Chamfer距离为1.8毫米，95%表面距离低于5毫米。
消除了传统流程中的后处理瓶颈（如Marching Cubes、平滑和手动清理），使网格生成完全自动化，便于临床部署。

Methodology: 论文采用两阶段端到端架构：首先，使用3D Swin Transformer编码器-解码器处理体素图像，通过分层窗口自注意力机制提取多尺度上下文特征；然后，图注意力网络（GAT）头以模板网格（如球体或椭球体）为起点，通过从编码器特征图中采样顶点特征，并经过多层图注意力层迭代变形网格，使其贴合心脏边界。训练时联合优化体素分割损失和几何网格质量损失。数据预处理包括重采样至160×160×80体素、z-score标准化以及随机旋转、翻转和弹性变形等数据增强。

Key Results:

在MM-WHS 2017数据集上，CT模态的Dice系数为0.84，MRI模态为0.83。
平均Chamfer距离为1.8毫米，95%表面距离低于5毫米。
所有网格均通过单次前向传播生成，无需Marching Cubes、平滑或手动清理。
网格质量满足仿真要求，可直接用于有限元或流体动力学计算。

Tech Stack:

3D Swin Transformer（编码器-解码器）
图注意力网络（Graph Attention Network, GAT）
Marching Cubes（仅用于对比，本文未使用）
Chamfer距离（几何质量评估）
Dice系数（分割评估）
z-score标准化
数据增强（随机旋转、翻转、弹性变形）
MM-WHS 2017数据集

Strengths:

端到端直接生成网格，消除了传统流程中的后处理瓶颈，显著提高了效率和自动化程度。
结合Transformer的全局上下文建模能力和GAT的拓扑感知变形能力，同时保证了分割精度和网格质量。
在CT和MRI两种模态上均进行了验证，展示了良好的跨模态泛化能力。
强调几何保真度和拓扑正确性，更贴合数字孪生和仿真应用的实际需求。

Limitations:

网格质量指标（如Chamfer距离）虽优于传统流程，但绝对数值（1.8毫米）在临床高精度场景中可能仍有提升空间。
模板网格的初始形状（球体或椭球体）可能限制对复杂解剖结构的拟合能力，尤其对于非典型心脏形态。
仅基于MM-WHS 2017数据集进行验证，缺乏在更大规模、更多样化临床数据上的测试。
未详细讨论模型的计算成本和推理速度，可能影响实时或近实时临床部署。

Relevance To Keywords: 该论文与所提供的研究关键词（Unify Models, World Models, Representation Learning, Model-Based RL, 原生多模态大模型等）的直接相关性较低。论文主要聚焦于医学图像分割和网格重建，属于计算机视觉和数字孪生领域，而非多模态大模型、世界模型或强化学习。然而，其端到端框架和特征表示学习的思想与表征学习（Representation Learning）有一定关联，但整体上不属于核心关键词范畴。

41. Extracting Governing Equations from Latent Dynamics via Multi-View Contrastive LearningPASS

Score: 40.5 / 35.2

Authors: Paolo Muratore, Mackenzie Weygandt Mathis

Published: 2026-06-11

TL;DR: The paper proposes DYSCO, a multi-view contrastive learning method to recover latent trajectories and symbolic governing equations from noisy high-dimensional measurements.

摘要翻译

从含噪高维测量中识别潜在动力系统是表示学习、系统辨识和科学发现交叉领域的核心问题。我们提出 DYSCO，这是一种多视角时序对比学习算法，通过利用同一底层过程的多个独立含噪视图来分离信号与噪声，从而从此类观测中联合恢复潜在轨迹和支配动力学。通过在结构化函数基上参数化动力学，我们的框架进一步能够在仿射规范下实现支配方程的符号恢复。我们提供了理论保证，表明在仿射不确定性范围内可实现强识别，将先前的可识别性结果扩展到含噪非线性观测的实际场景。实验上，我们在多种动力学 regime（如混沌、振荡和亚稳态）下，展示了在高斯和泊松观测噪声中准确恢复潜在轨迹和流场的能力，后者尤其适用于神经记录。

Abstract

Identifying latent dynamical systems from noisy, high-dimensional measurements is a central problem at the intersection of representation learning, system identification, and scientific discovery. We present DYSCO, a multi-view temporal contrastive learning algorithm that jointly recovers latent trajectories and the governing dynamics from such observations, by leveraging multiple independent noisy views of the same underlying process to disentangle signal from noise. By parameterizing the dynamics in a structured functional basis, our framework further enables symbolic recovery of the governing equations within an affine gauge. We offer theoretical guarantees for strong identification up to an affine indeterminacy, extending prior identifiability results to the realistic setting of noisy nonlinear observations. Empirically, we demonstrate accurate recovery of both latent trajectories and flow fields across a diverse set of dynamical regimes (e.g., chaotic, oscillatory, and metastable) under both Gaussian and Poisson observation noise, the latter being particularly relevant for neural recordings.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	3.0/10	4.5
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	5.0/10	7.5
model-based RL	1.5	4.0/10	6.0
Latent Reasoning	1.5	9.0/10	13.5
Agentic Reasoning	1.5	1.0/10	1.5

评分理由: 论文核心在于利用多视角对比学习从噪声高维测量中恢复潜在轨迹和支配方程，因此与'Latent Reasoning'高度相关（得分 9）。'MultiModal'因涉及多视角表示学习而中度相关（得分 5）；'model-based RL'和'World Models'因涉及动力学建模而有一定关联（得分 4 和 3）。其余关键词如'Tokenizer'、'MLLM'、'Agentic Reasoning'等与本文科学发现及系统识别主题无直接关联（得分 1-2）。加权总分为 40.5，高于及格线 35.2。

关键词

Multi-View Contrastive Learning, Latent Dynamics, Governing Equations, System Identification, Symbolic Recovery, Noisy Measurements, Temporal Contrastive Learning

深度分析

Chinese Title: 通过多视图对比学习从潜在动力学中提取控制方程

Summary: 该论文提出了一种名为DYSCO的多视图时间对比学习算法，旨在从高维、噪声观测数据中联合恢复潜在轨迹及其控制动力学。研究背景涉及神经科学和机器学习中从噪声观测中提取系统计算法则的挑战。方法上，DYSCO利用多个独立噪声视图来分离信号与噪声，并通过结构化函数基参数化动力学，支持控制方程的符号恢复。理论贡献包括在噪声非线性观测下提供仿射变换下的可辨识性保证。实验验证了该方法在混沌、振荡和亚稳态等动力学系统上的有效性，包括高斯和泊松噪声场景。结论表明，对比学习结合多视图一致性可有效解决Marr逆问题，从实现层恢复可解释的计算描述。

Innovations:

提出多视图对比系统识别框架DYSCO，利用时间结构和多噪声视图联合恢复潜在状态和动力学。
提供理论可辨识性保证，在噪声非线性观测下将潜在动力学系统识别到仿射变换。
将符号回归与对比学习结合，通过结构化函数基参数化动力学实现控制方程的符号恢复。
在多种动力学系统（混沌、振荡、亚稳态）上验证了方法对噪声的鲁棒性，包括高斯和泊松噪声。

Methodology: 论文采用多视图时间对比学习框架，通过编码器hθ将高维观测映射到潜在空间，并用符号动力学模型f̂Ξ（基于多项式基函数）参数化潜在动力学。训练目标为InfoNCE损失函数，通过最大化正样本对（同一潜在状态的不同噪声视图）的相似性并最小化负样本对的相似性来学习。模型利用时间前向积分生成轨迹，并通过多视图一致性约束分离信号与噪声。理论分析基于渐近假设证明可辨识性。

Key Results:

DYSCO在混沌（Lorenz系统）、振荡和亚稳态动力学系统上准确恢复潜在轨迹和流场。
在高斯和泊松观测噪声下均保持鲁棒性能，泊松噪声特别适用于神经记录场景。
通过符号回归成功恢复Lorenz系统的控制方程，验证了仿射规范结构的兼容性。
理论证明在渐近条件下，多视图对比学习可将潜在动力学系统识别到仿射变换。

Tech Stack:

对比学习（InfoNCE损失函数）
多视图编码器（神经网络hθ）
符号动力学模型（多项式基函数库Ξ）
时间前向积分（轨迹滚动）
负对数似然最小化
仿射变换可辨识性分析

Strengths:

理论严谨：提供了噪声非线性观测下的可辨识性保证，扩展了先前工作。
方法统一：将表示学习、系统识别和符号回归整合在一个对比学习框架中。
鲁棒性强：在多种噪声类型和动力学系统上验证了有效性。
可解释性：通过符号恢复控制方程，支持科学发现。

Limitations:

依赖多视图观测，在单视图场景下可能不适用。
符号基函数的选择（如多项式次数）需先验知识，可能限制表达力。
理论可辨识性基于渐近假设，有限样本下的性能可能偏离。
实验仅涉及低维潜在系统（如Lorenz），高维潜在空间的扩展未充分验证。

Relevance To Keywords:

表征学习：论文核心是学习潜在表示，与表征学习高度相关。
世界模型：通过恢复潜在动力学，DYSCO可视为构建世界模型的一种方法。
模型基强化学习：潜在动力学模型可用于规划和控制，与模型基RL相关。
多模态大模型：多视图对比学习思想可扩展至多模态数据，但论文未直接涉及大模型。
后训练：论文未涉及后训练，但符号恢复可视为后训练解释性步骤。

42. HyperTool: Beyond Step-Wise Tool Calls for Tool-Augmented AgentsPASS

Score: 40.5 / 35.2

Authors: Yaxin Du, Yifan Zhou, Yujie Ge, Jiajun Wang, Xianghe Pang, Shuo Tang, Tuney Zheng, Bryan Dai, Jian Yang, Siheng Chen

Published: 2026-06-11

TL;DR: HyperTool improves multi-step tool use accuracy for LLM agents by unifying tool calls into executable code blocks, reducing context consumption and execution-granularity mismatch.

摘要翻译

工具增强的 LLM 代理通常依赖逐步原子化的工具调用，其中每次调用、观察和值传递均在主推理轨迹中暴露。这造成了一个“执行粒度不匹配”：局部确定性的工具工作流被展开为重复的模型可见决策，消耗上下文，并迫使模型在轨迹中管理低级数据流。我们引入了 HyperTool，这是一种统一的可执行 MCP 风格工具接口，它改变了工具执行的模型可见单元。模型通过代码块调用 HyperTool，该代码块可通过其原始模式调用现有工具，操纵返回值，并本地传递中间结果，从而将确定性工具子程序折叠为单次外层调用。为了训练模型使用此接口，我们从跨工具组合任务中合成 HyperTool 格式的轨迹，并在真实 MCP 环境中进行验证。在 MCP-Universe 上，HyperTool 将 Qwen3-32B 的平均准确率从 15.69% 提升至 35.29%，将 Qwen3-8B 的平均准确率从 9.93% 提升至 33.33%，并在平均准确率上超越 GPT-OSS 和 Kimi-k2.5，表明我们的 HyperTool 能够显著改进多步工具使用。

Abstract

Tool-augmented LLM agents commonly rely on step-wise atomic tool calls, where each invocation, observation, and value transfer is exposed in the main reasoning trace. This creates an \emph{execution-granularity mismatch}: locally deterministic tool workflows are unfolded into repeated model-visible decisions, consuming context and forcing the model to manage low-level dataflow in the trace. We introduce \textbf{HyperTool}, a unified executable MCP-style tool interface that changes the model-visible unit of tool execution. A model invokes HyperTool with a code block that can call existing tools through their original schemas, manipulate returned values, and pass intermediate results locally, folding deterministic tool subroutines into a single outer call. To train models to use this interface, we synthesize HyperTool-format trajectories from cross-tool compositional tasks and verify them in real MCP environments. On MCP-Universe, HyperTool improves average accuracy from 15.69\% to 35.29\% on Qwen3-32B and from 9.93\% to 33.33\% on Qwen3-8B, and surpass GPT-OSS and Kimi-k2.5 on average accuracy, showing that our HyperTool can substantially improve multi-step tool use.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	7.0/10	10.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	6.0/10	9.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	5.0/10	7.5
Agentic Reasoning	1.5	9.0/10	13.5

评分理由: The paper focuses on tool-augmented LLM agents, achieving high relevance for Agentic Reasoning (9.0) due to its focus on agent tool usage logic and Unify Models (7.0) regarding the unified MCP-style interface. Latent Reasoning (5.0) is moderately relevant as it abstracts tool subroutines into single calls. MLLM (6.0) is relevant as the method is evaluated on Qwen3 models. Keywords related to vision, tokenization, world models, and RL are not addressed in the abstract (0.0). No expert authors from the specified list were found.

关键词

HyperTool, Tool-Augmented Agents, Unified Interface, Execution-Granularity Mismatch, MCP, Tool Composition, LLM Agents

深度分析

Chinese Title: HyperTool：超越逐步工具调用的工具增强型代理

Summary: 论文针对工具增强型LLM代理中逐步原子工具调用导致的上下文膨胀和推理碎片化问题，提出HyperTool——一种统一的MCP风格可执行工具接口。该接口允许代理在单个代码块中调用现有工具、操作返回值并传递中间结果，将确定性工具子程序折叠为一次外部调用。通过合成跨工具组合任务的HyperTool格式轨迹，并在真实MCP环境中验证后用于监督微调，模型学会了何时调用代码块、如何组合工具以及何时返回中间结果。在MCP-Universe基准上，基于Qwen3-32B和Qwen3-8B的HyperTool模型平均准确率分别从15.69%提升至35.29%、从9.93%提升至33.33%，甚至超越了GPT-OSS和Kimi-k2.5等先进模型。

Innovations:

识别出逐步原子工具调用导致上下文膨胀和推理碎片化的关键局限性
提出HyperTool统一可执行工具接口，保持原始MCP工具模式的同时允许原子调用和多工具工作流在同一代码块中表达
构建经过验证的HyperTool格式轨迹用于监督微调，将监督从孤立原子调用转移到可执行工具子程序
在多个模型上实现显著性能提升，并超越非HyperTool基线及部分先进模型

Methodology: 论文采用三阶段数据合成管道：首先构建跨工具组合任务，然后通过局部修复和轨迹级上下文压缩生成HyperTool格式轨迹，最后对轨迹进行执行正确性和证据一致性验证。使用监督微调（SFT）训练模型使用HyperTool接口。评估在MCP-Universe基准上进行，对比基线和先进模型。

Key Results:

Qwen3-32B平均准确率从15.69%提升至35.29%
Qwen3-8B平均准确率从9.93%提升至33.33%
HyperTool模型在平均准确率上超越GPT-OSS和Kimi-k2.5
所有基于Qwen3骨干的非HyperTool基线均被HyperTool模型超越

Tech Stack:

MCP（Model Context Protocol）风格工具接口
代码块执行（Python风格可执行程序）
监督微调（Supervised Fine-Tuning, SFT）
轨迹合成与验证（执行正确性、证据一致性检查）
Qwen3-8B和Qwen3-32B大语言模型

Strengths:

问题定位精准：指出逐步调用导致的上下文和推理问题，具有实际意义
方案简洁有效：通过改变执行粒度而非替换工具接口，兼容现有MCP生态
数据合成方法可靠：在真实环境中验证轨迹，保证训练数据质量
实验结果显著：在多个模型上取得大幅提升，且超越更强基线

Limitations:

仅针对确定性工具工作流有效，对于需要模型实时决策的复杂任务可能仍需逐步调用
数据合成依赖预定义组合任务，可能无法覆盖所有真实场景
未讨论HyperTool在长上下文或极端复杂任务中的扩展性
实验仅在MCP-Universe基准上进行，泛化性需进一步验证

Relevance To Keywords: 论文主要关注工具增强型LLM代理的执行粒度优化，与用户指定的研究背景（原生多模态大模型、世界模型、表征学习、强化学习、后训练）相关性较低。论文未涉及多模态理解与生成一体化、世界模型构建或表征学习，也未使用强化学习或后训练方法。其核心贡献在于工具调用接口设计，属于LLM代理的工程优化方向。

43. From Tokens to Faces: Investigating Discrete Speech Representations for 3D Facial AnimationPASS

Score: 40.5 / 35.2

Authors: Pedro Correa, Olivier Perrotin, Samir Sadok, Paula Costa, Thomas Hueber

Published: 2026-06-11

TL;DR: This paper investigates discrete speech representations for 3D facial animation, finding that phonetic class encoding improves animation accuracy and proposing an AVTTS pipeline using shared discrete spaces for speech and motion decoding.

摘要翻译

语音表示的选择在语音驱动的 3D 面部动画中至关重要。这些表示在编码内容上有所不同：自监督学习（SSL）特征强调音段和语义线索，神经编解码器产生针对声学重建优化的潜在表示，而自动语音识别（ASR）风格的目标则生成基于标签的空间。我们评估了四种语音表示类别用于 3D 面部合成，通过客观指标和感知评估，在两种面部解码器上比较了它们的面部重建质量。此外，我们还进行了探测分析，将离散表示与语音单位及发音形变相关联。我们发现，在语义表示和基于标签的表示中，编码音类均有利于准确的面部动画预测，且两者具有相当的面部动画质量。基于后者，我们提出了一种视听文本到语音（AVTTS）管道，利用离散表示作为共享空间来联合解码语音和 3D 面部运动。

Abstract

The choice of speech representation is critical in speech-driven 3D facial animation. Representations differ in what they encode: SSL features emphasize segmental and semantic cues, neural codecs yield latents optimized for acoustic reconstruction, and ASR-style objectives produce label-based spaces. We evaluate four speech representation families for 3D facial synthesis, comparing their facial reconstruction quality across two facial decoders using objective metrics and a perceptual evaluation. We additionally conduct probing analyses that relate tokenized representations to phonetic units and to articulatory deformations. We found that encoding phonetic classes is beneficial for accurate facial animation prediction on both semantic and label-based representations with comparable facial animation quality. From the latter, we introduce an Audio Visual Text-to-Speech (AVTTS) pipeline that leverages, as a shared space, discrete representations to decode speech and 3D facial motion.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	8.0/10	12.0
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	7.0/10	10.5
model-based RL	1.5	1.0/10	1.5
Latent Reasoning	1.5	3.0/10	4.5
Agentic Reasoning	1.5	1.0/10	1.5

评分理由: 论文主要研究语音表示与 3D 面部动画的生成，与 MLLM、世界模型及强化学习等核心关键词关联度较低。但涉及离散语音表示（Tokenizer）和音频 - 视觉多模态（MultiModal），相关性较高。未包含指定专家作者。

关键词

Discrete Speech Representations, 3D Facial Animation, Audio-Visual Synthesis, Tokenized Representations, Phonetics, Neural Codecs, Facial Decoders, Shared Space

深度分析

Chinese Title: 从令牌到面孔：探究用于3D面部动画的离散语音表示

Summary: 本文系统评估了四种语音表示家族（声学、语义+声学、语义、基于标签）在语音驱动3D面部动画中的效果。通过两种面部解码器（GRU和Transformer）进行客观指标和感知评估，并进行了探针分析，将令牌化表示与音素单元和发音变形相关联。研究发现，编码音素类别对准确的面部动画预测有益，语义和基于标签的表示在面部动画质量上相当。基于此，本文提出了一种音频视觉文本转语音（AVTTS）流水线，利用离散表示作为共享空间，同时解码语音和3D面部运动。

Innovations:

首次系统比较了四种不同性质的离散语音表示（声学、语义+声学、语义、基于标签）在3D面部动画中的表现。
引入了探针分析，将令牌化表示与音素单元和发音变形（blendshape聚类）相关联，揭示了表示中编码的信息。
提出了一个音频视觉文本转语音（AVTTS）流水线，利用离散表示同时生成语音和3D面部动画，无需单独的音频驱动阶段。
在客观指标（LVE、Jitter、Bilabial Closure Score）和主观感知评估（MUSHRA）上进行了全面比较。

Methodology: 采用对比实验框架，固定四个语音编码器（HuBERT、SpeechTokenizer、WavTokenizer、CosyVoice2）的预训练权重，训练两个面部解码器（GRU和Transformer）从BEAT2数据集映射到ARKit blendshapes。使用线性回归和事后比较分析客观指标，使用beta回归分析MUSHRA评分。探针分析通过量化HuBERT表示、对齐音素和面部特征、聚类visemes等方法进行。

Key Results:

基于标签的表示（CosyVoice2）在客观指标上表现最佳，与语义表示（HuBERT）相当。
纯声学表示（WavTokenizer）在面部动画预测上表现较差。
Transformer解码器在大多数情况下优于GRU解码器。
感知评估中，CosyVoice2+Transformer与HuBERT+Transformer表现接近，均优于基线HuBERT+GRU。
探针分析表明，语义和基于标签的表示编码了清晰的音素类别信息，而声学表示缺乏此类结构。

Tech Stack:

HuBERT (自监督学习)
SpeechTokenizer (残差向量量化+语义蒸馏)
WavTokenizer (单码本极端压缩)
CosyVoice2 (ASR标签监督)
GRU (门控循环单元)
Transformer (交叉注意力解码器)
ARKit blendshapes (51维)
BEAT2数据集
L1损失, 平滑损失 (速度、加速度)
MUSHRA感知评估协议
线性回归 (lme), beta回归 (glmmTMB), emmeans事后比较

Strengths:

系统全面的比较框架，覆盖了多种主流离散语音表示。
结合客观指标和主观感知评估，增强了结论的可靠性。
探针分析提供了对表示内部编码信息的深入理解。
提出的AVTTS流水线具有实际应用价值，实现了文本到语音和面部动画的联合生成。

Limitations:

仅使用了BEAT2数据集，可能限制了泛化性。
面部解码器架构相对简单（GRU和Transformer），未探索更复杂的模型。
感知评估仅比较了三个代表性模型，未涵盖所有组合。
探针分析中的viseme聚类基于测试集，可能引入数据泄露。

Relevance To Keywords: 论文核心关注语音表示学习（Representation Learning）在3D面部动画中的应用，涉及离散令牌化表示（与多模态大模型中的令牌化思想相关）。虽然不直接研究世界模型或强化学习，但其提出的AVTTS流水线体现了多模态理解与生成一体化（语音和面部动画联合生成）。论文对表示性质的探针分析有助于理解表征学习中的语义/声学信息编码。整体相关性中等，主要与“表征学习”和“多模态大模型的理解和生成一体化”相关。

44. Objects Before Words: Object-First Inductive Biases for Grounding Language in Child-View VideoPASS

Score: 40.5 / 35.2

Authors: Sathira Silva, Abrham Kahsay Gebreselasie, Muhammad Umer Sheikh, Kartik Kuckreja, Daniel Harari, Muhammad Haris Khan

Published: 2026-06-11

TL;DR: BabyMind proposes an object-first contrastive learning method that improves language grounding accuracy in noisy child-view video by aligning utterances with tracked object embeddings.

摘要翻译

从自然经验中学习具象化词语意义，需要解决婴儿视角录像中的两个歧义：命名指称物何时出现，以及它在杂乱画面中的位置。在 SAYCam 风格数据中，看护者语音稀疏且与第一人称视频弱同步，导致单帧对比配对产生噪声正样本，其中目标对象要么缺失，要么与干扰物纠缠。我们提出 BabyMind，这是一种面向儿童视角对比学习的物体优先偏差，适用于稀疏且噪声的监督环境。BabyMind 利用基于掩码的离线区域接口提取候选对象嵌入，通过跟踪将候选对象在短话语中心窗口内链接为轻量级对象文件，并利用原型空间多实例对比目标将话语对齐至对象文件袋。轨迹一致性和全局对象一致性正则化器稳定了学习过程，并将对象文件结构迁移至评估时使用的全局帧嵌入中。在 SAYCam-S 数据集上，BabyMind 相比 CVCL 将 Labeled-S 15 强制选择准确率提高了 2.6 个百分点，并在词汇内分布外基准上取得了稳定提升。代码可在 https://github.com/sathiiii/BabyMind 获取。

Abstract

Learning grounded word meaning from natural experience requires resolving two ambiguities in infant-view recordings: when the named referent appears and where it is in a cluttered frame. In SAYCam-style data, caregiver speech is sparse and weakly synchronized with egocentric video, so single-frame contrastive pairing yields noisy positives in which the intended object is absent or entangled with distractors. We propose BabyMind, an object-first bias for child-view contrastive learning under sparse, noisy supervision. BabyMind extracts candidate object embeddings using an offline mask-based region interface, links candidates across a short utterance-centered window into lightweight object files via tracking, and aligns utterances to bags of object files with a prototype-space multiple-instance contrastive objective. Track-coherence and global-object agreement regularizers stabilize learning and transfer object-file structure into the global frame embedding used at evaluation. On SAYCam-S, BabyMind improves Labeled-S 15 forced-choice accuracy by +2.6 points over CVCL and yields consistent gains on in-vocabulary out-of-distribution benchmarks. Code is available at https://github.com/sathiiii/BabyMind.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	5.0/10	7.5
World Models	1.5	2.0/10	3.0
MLLM	1.5	4.0/10	6.0
MultiModal	1.5	7.0/10	10.5
model-based RL	1.5	1.0/10	1.5
Latent Reasoning	1.5	3.0/10	4.5
Agentic Reasoning	1.5	1.0/10	1.5

评分理由: The paper focuses on object-first contrastive learning for grounding language in child-view video, showing moderate relevance to MultiModal (7) and Visual Encoder (5) due to video-text alignment and object embeddings. It does not address model unification, tokenization, world models, reinforcement learning, or agentic reasoning, resulting in low scores for those keywords (1-3). No matching expert authors from the provided list were found in the author list. The weighted total score is 40.5, exceeding the dynamic passing score of 35.2.

关键词

Object-First Inductive Biases, Child-View Video, Contrastive Learning, Grounding Language, SAYCam Dataset, Object Embeddings, Multiple-Instance Learning

深度分析

Chinese Title: 对象先于词汇：面向儿童视角视频中语言接地的对象优先归纳偏置

Summary: 该论文针对儿童视角视频中语言接地学习面临的时空歧义问题（命名参照物何时出现、在杂乱画面中位于何处），提出了一种名为BabyMind的对象优先归纳偏置方法。在SAYCam风格数据中，看护者语言稀疏且与自我中心视频弱同步，单帧对比配对会产生噪声正样本（目标对象缺失或被干扰物混淆）。BabyMind通过离线自动掩码生成提取候选对象嵌入，利用轻量级跟踪在短语音窗口内将候选对象链接为对象文件，并采用原型空间多实例对比目标将语音与对象文件包对齐。此外，引入轨迹一致性和全局-对象一致性正则化项来稳定学习，并将对象文件结构迁移到评估时使用的全局帧嵌入中。在SAYCam-S数据集上，BabyMind在Labeled-S 15强制选择准确率上比CVCL提升2.6个百分点，并在词汇内分布外基准测试中取得一致增益。该方法有效缓解了稀疏、噪声监督下的接地学习问题。

Innovations:

提出BabyMind，一种对象优先的归纳偏置扩展，通过将语音与短窗口内跟踪的实例候选对齐，解决时空歧义，同时保留原始CVCL全局对比目标和评估接口。
采用离线自动掩码生成（AMG）获取实例级区域掩码，并通过轻量级跟踪形成短窗口对象文件，适用于自我中心视频。
设计原型空间多实例对比目标，结合轨迹一致性和全局-对象一致性正则化，稳定对象选择并将对象中心结构注入全局嵌入。
在SAYCam-S数据集上实现Labeled-S 15强制选择准确率提升2.6个百分点，并在词汇内分布外评估中取得一致改进。

Methodology: 论文采用对比学习框架，在CVCL基础上增加对象文件通路。具体步骤：1）对每个语音锚点帧及其附近帧，使用SAM等离线自动掩码生成器提取候选区域，并通过掩码平均池化获得区域嵌入；2）在短窗口内通过贪心相似性跟踪将候选嵌入链接为对象文件（轨迹），每个轨迹嵌入为分配候选的均值；3）引入可学习原型记忆，将文本嵌入和对象文件嵌入投影到原型空间，通过软分配计算相似度；4）采用多实例对比损失（MIL）将语音与对象文件包对齐，使用log-sum-exp池化聚合；5）添加轨迹一致性正则化（KL散度）和全局-对象一致性损失，将对象文件信号迁移到全局帧嵌入。训练使用对称对比损失，温度参数调节。

Key Results:

在SAYCam-S数据集上，BabyMind在Labeled-S 15强制选择准确率上比CVCL基线提升2.6个百分点。
在词汇内分布外（IV OOD）基准测试中，BabyMind取得一致但适度的增益。
定性分析显示，对象文件通路能更准确地选择与语音相关的目标对象，减少背景干扰。
消融实验验证了原型空间MIL、轨迹一致性和全局-对象一致性正则化的有效性。

Tech Stack:

离线自动掩码生成（AMG）：使用SAM（Segment Anything Model）生成实例掩码。
轻量级跟踪：基于余弦相似度的贪心跨帧链接。
原型记忆：可学习原型向量，通过EMA和Sinkhorn归一化更新（类似SwAV）。
多实例对比学习（MIL）：使用log-sum-exp池化聚合实例相似度。
对比损失：对称InfoNCE损失，温度参数调节。
KL散度正则化：用于轨迹一致性。
全局-对象一致性损失：将对象文件嵌入投影到全局嵌入空间。
特征提取：文本编码器（如CLIP文本编码器）和视觉编码器（如ResNet或ViT），输出归一化嵌入和空间特征图。

Strengths:

针对儿童视角视频中语言接地的核心歧义问题（时空不确定性）提出了有效的对象优先偏置，符合认知科学中婴儿通过对象跟踪学习语言的发现。
方法设计巧妙，在保持CVCL全局对比框架不变的情况下，通过辅助对象文件通路引入归纳偏置，易于集成。
利用现成的SAM模型获取实例掩码，无需额外标注，具有可扩展性。
多个正则化项（轨迹一致性、全局-对象一致性）稳定了多实例学习，并促进了对象结构向全局表示的迁移。
在标准基准上取得显著提升，且代码开源，可复现。

Limitations:

依赖离线自动掩码生成（SAM），增加了计算开销和预处理步骤，可能不适用于实时或大规模场景。
轻量级跟踪基于简单的余弦相似度，在快速运动或严重遮挡时可能失效。
原型记忆的规模（原型数量）需要手动设定，可能影响长尾概念的学习。
实验仅在SAYCam-S子集上进行，未在完整SAYCam或其它儿童视角数据集上验证泛化性。
提升幅度相对有限（+2.6个百分点），且分布外评估增益“适度”，说明仍有改进空间。
未与近期基于世界模型或多模态大模型的方法进行比较，相关性分析不够深入。

Relevance To Keywords:

表征学习：论文核心是学习视觉-语言联合表征，通过对象优先偏置改进表征质量，与表征学习高度相关。
世界模型：对象文件通路和轨迹一致性可视为构建短期世界模型（对象持久性）的一种形式，但论文未明确使用世界模型框架，相关性中等。
多模态大模型：论文方法基于对比学习，未采用大规模生成式多模态模型，但所解决的语言接地问题与多模态理解相关，相关性较弱。
原生多模态大模型的理解和生成一体化：论文不涉及生成任务，仅关注理解（接地），相关性低。
强化学习：论文未使用强化学习，相关性低。
后训练：论文方法属于预训练/训练阶段，未涉及后训练（如指令微调），相关性低。

45. MemRefine: LLM-Guided Compression for Long-Term Agent MemoryPASS

Score: 39.0 / 35.2

Authors: Minjae Kim, Jinheon Baek, Soyeong Jeong, Sung Ju Hwang

Published: 2026-06-11

TL;DR: MemRefine proposes an LLM-guided memory compression framework that manages storage budgets for long-term agent interactions by judging factual content rather than surface similarity, preserving performance while reducing memory size.

摘要翻译

大语言模型（LLM）智能体日益被期望在长期交互中运行，其中必须保留并回忆过去对话中的信息以支持未来任务。然而，随着交互的积累，记忆库无限增长并充满冗余条目，这不仅增加了存储成本，还通过排挤最有用的证据而损害了检索性能。此外，这对具有硬性内存预算的资源受限平台尤其构成限制，促使我们提出存储预算约束的记忆管理任务，即在固定预算内维持已构建的记忆库，同时保留对未来交互有用的信息。为此，我们提出了 MemRefine，一个基于 LLM 的指导框架。由于表面相似性难以准确反映事实价值，该框架仅利用相似性提出候选对，并将删除、合并及保留的决策委托给基于事实内容的 LLM 评判器，迭代直至满足预算限制。在多个记忆框架和长期对话基准上，MemRefine 始终满足目标预算，同时保留下游性能，并在严格预算下优于基于规则的基线。

Abstract

Large language model (LLM) agents are increasingly expected to operate over long-term interactions, where information from past dialogues must be preserved and recalled to support future tasks. However, as interactions accumulate, the memory store grows without bound and fills with redundant entries that inflate storage cost and degrade retrieval by crowding out the most useful evidence. Furthermore, this is especially limiting on resource-constrained platforms with hard memory budgets, motivating us to formulate storage-budgeted memory management, the task of keeping an already constructed memory store within a fixed budget while preserving information useful for future interactions. To this end, we then propose MemRefine, an LLM-guided framework that, since surface similarity poorly reflects factual value, uses similarity only to propose candidate pairs and defers delete, merge, and preserve decisions to an LLM judge based on factual content, iterating until the budget is met. Across multiple memory frameworks and long-term conversation benchmarks, MemRefine consistently meets target budgets while preserving downstream performance and outperforming rule-based baselines under tight budgets.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	3.0/10	4.5
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	3.0/10	4.5
Latent Reasoning	1.5	3.0/10	4.5
Agentic Reasoning	1.5	8.0/10	12.0

评分理由: The paper focuses on LLM agent memory compression and budget management, showing high relevance to Agentic Reasoning. It has moderate relevance to World Models and Latent Reasoning due to internal state management and LLM judging logic. It has low relevance to multimodal-specific keywords (Visual Encoder, MultiModal, MLLM) as the abstract lacks vision or modality fusion content. Tokenizer and Unify Models are not central contributions. No matching expert authors were found.

关键词

LLM Agent Memory, Memory Compression, Storage Budget, LLM-Guided, Factual Content, Long-Term Interaction, Memory Management

深度分析

Chinese Title: MemRefine：LLM引导的长期智能体记忆压缩

Summary: 论文针对LLM智能体在长期交互中记忆存储无限增长、冗余信息增多导致存储成本上升和检索性能下降的问题，提出了存储预算约束下的记忆管理任务。作者设计了MemRefine框架，该框架在记忆构建完成后、检索之前，利用表面相似性筛选候选记忆对，但将删除、合并或保留的决策交由LLM法官基于事实内容进行判断，迭代执行直至记忆存储满足预算。在多种记忆框架和长期对话基准上的实验表明，MemRefine能在满足目标预算的同时保持下游任务性能，并在严格预算下优于基于规则的基线方法。

Innovations:

首次形式化存储预算约束下的长期智能体记忆管理任务，提出查询无关的最大-最小优化问题。
提出后构建记忆压缩范式，将压缩模块独立于记忆构建和检索管道，具有通用性。
利用LLM对语义相似记忆对进行事实级判断（冗余/互补/不同），而非仅依赖表面相似性。
设计迭代式成对精炼算法，通过LLM法官的删除、合并、保留决策逐步压缩记忆至目标预算。

Methodology: 论文首先将存储预算记忆管理形式化为一个查询无关的最大-最小优化问题，然后提出MemRefine框架：1）候选对选择：基于嵌入余弦相似度选取最相似的未处理记忆对；2）LLM法官决策：由LLM根据事实内容判断该对是冗余（删除）、互补（合并）还是不同（保留）；3）迭代执行：重复上述步骤直至记忆存储满足预算或无可处理候选对。算法在A-MEM图记忆和Mem0管道两种代表性记忆框架上验证，使用LoCoMo基准及扩展版本进行评估。

Key Results:

MemRefine能在各种压缩预算下稳定达到目标存储预算，同时保持下游任务性能。
在中等压缩率下，MemRefine几乎不损失性能；在严格预算下性能下降平缓而非崩溃。
与基于相似度或图结构启发式的规则基线相比，MemRefine在所有设置下表现更鲁棒。
LLM法官的事实判断能力是压缩决策的关键，仅靠相似性无法有效区分冗余、互补和不同内容。

Tech Stack:

LLM（作为法官进行事实判断）
余弦相似度（用于候选对选择）
嵌入向量（记忆条目表示）
A-MEM图记忆框架
Mem0记忆管道
LoCoMo基准（长期对话记忆评估）
Token-level F1（下游任务评估指标）

Strengths:

问题定义清晰，针对实际部署中存储预算硬约束提出新任务。
方法模块化，可插入现有记忆框架而不修改其构建和检索流程。
利用LLM进行事实推理而非简单统计，更符合记忆压缩的本质需求。
实验覆盖多种记忆框架和不同压缩预算，验证了方法的通用性和鲁棒性。

Limitations:

依赖LLM作为法官，可能引入额外推理成本和延迟。
候选对选择仅基于余弦相似度，可能遗漏语义相似但嵌入不接近的对。
未考虑记忆条目的时间顺序或交互上下文对压缩决策的影响。
实验仅在对话场景下评估，对其他类型智能体记忆（如任务日志）的适用性未验证。

Relevance To Keywords:

Unify Models, World Models, Representation Learning, Model-Based RL：论文主要关注LLM智能体的长期记忆压缩，与统一模型、世界模型、表征学习、基于模型的强化学习等关键词的直接关联较弱，但记忆管理可视为智能体系统的重要组成部分，间接涉及表征学习（嵌入表示）和模型决策（LLM法官）。

46. Select and Improve: Understanding the Mechanics of Post-Training for ReasoningPASS

Score: 39.0 / 35.2

Authors: Akshay Krishnamurthy, Audrey Huang, Nived Rajaraman

Published: 2026-06-11

TL;DR: This paper investigates the mechanistic processes of reinforcement learning post-training for reasoning models, identifying strategy selection and improvement as key drivers enhanced by diverse SFT and difficult RL data.

摘要翻译

强化学习已迅速成为推理与代码模型训练中的关键组成部分，但从机制视角来看，其理解仍不充分。我们研究了通过强化学习后训练，能力是如何以及通过何种底层过程被获取或增强的。基于 Qwen-2.5-1.5B 的受控数学推理实验分析揭示了两个核心机制：策略选择和策略改进。我们的结果突出了 SFT（监督微调）数据和强化学习数据在激活这些机制中的作用，特别是展示了通过在多样化的推理策略上对模型进行监督来实现策略选择，以及通过在强化学习数据中增加难度来实现策略改进。总体而言，我们的结果为强化学习训练提供了机制性见解，并提出了一些实用干预措施，以继续扩展推理能力。

Abstract

Reinforcement learning has rapidly emerged as a key component in the training of reasoning and coding models, yet it remains poorly understood from a mechanistic perspective. We study how and through what underlying processes capabilities are acquired or enhanced via reinforcement learning post-training. Our analysis, based on controlled math reasoning experiments with Qwen-2.5-1.5B, reveals two core mechanisms: strategy selection and strategy improvement. Our results highlight the role of SFT data and reinforcement learning data in activating these mechanisms, in particular showing how supervising the model on diverse reasoning strategies can enable strategy selection and how increasing difficulty in reinforcement learning data can enable strategy improvement. Taken together, our results provide mechanistic insight into RL training and suggest practical interventions to continue scaling reasoning capabilities.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	4.0/10	6.0
Latent Reasoning	1.5	7.0/10	10.5
Agentic Reasoning	1.5	5.0/10	7.5

评分理由: The paper focuses on mechanistic analysis of RL post-training for reasoning (Qwen-2.5), highlighting strategy selection and improvement. It shows high relevance to Reasoning and RL keywords (Latent Reasoning, model-based RL) but low relevance to Multimodal/Visual components (Visual Encoder, MultiModal, Tokenizer) and World Models. No expert authors from the specified list are present.

关键词

Reinforcement Learning, Post-Training, Reasoning Mechanisms, Strategy Selection, Strategy Improvement, Qwen-2.5, SFT Data, Model Analysis

深度分析

Chinese Title: 选择与改进：理解推理后训练的机制

Summary: 本文通过控制实验（基于Qwen-2.5-1.5B模型和有限域算术任务）系统研究了强化学习（RL）后训练提升推理能力的底层机制。研究发现两个核心机制：策略选择（strategy selection）和策略改进（strategy improvement）。策略选择将问题路由到预训练阶段已存在的推理模式，是性能提升的主要驱动力；策略改进则优化现有推理模式，但需要RL数据难度高于SFT数据才能激活。实验还观察到策略放大和策略组合现象，但前者可归因于策略选择，后者是策略改进的特例。研究强调高质量预训练数据的重要性，并指出RL训练主要精炼而非创造全新行为。

Innovations:

首次提出并区分了RL后训练的两个核心机制：策略选择与策略改进，并揭示了它们对性能提升的不同贡献。
通过控制SFT数据中推理策略的多样性（前向/后向推理），证明了策略选择依赖于预训练数据中存在多种可选的推理模式。
发现策略改进需要RL数据难度显著高于SFT数据，否则无法提升OOD泛化能力。
将策略放大和策略组合现象统一归因于策略选择与策略改进，提供了更简洁的机制解释。
实验设计精巧，使用有限域算术任务隔离并控制推理策略，为后续机制研究提供了可复现的范式。

Methodology: 采用标准的两阶段训练流程：先对Qwen2.5-1.5B-Instruct进行监督微调（SFT），再使用GRPO算法进行强化学习（RL）。SFT数据包含前向推理、后向推理或混合推理的解题步骤；RL数据则通过改变问题难度（算术步数）和问题类型比例（评估/逆问题）来制造分布偏移。使用LoRA进行参数高效微调，以AdamW优化器训练。通过规则分类器识别模型输出中的推理策略类型，分析训练过程中的策略分布变化。

Key Results:

策略选择是RL后训练性能提升的主要驱动力，其激活需要SFT数据包含多种推理策略。
单策略模型（仅前向或仅后向）在自然类型问题上接近100%准确率，但在非自然类型问题上仅55-65%。
混合策略模型（FB）在SFT阶段达到约80%准确率，RL后进一步提升至约90%。
策略改进仅在RL数据难度（6-9步）高于SFT数据（2-5步）时发生；若难度相同，则RL无法提升OOD泛化。
RL训练不会创造全新的推理模式，仅对已有模式进行精炼和路由选择。

Tech Stack:

Qwen2.5-1.5B-Instruct（预训练语言模型）
LoRA（低秩适配，rank=64, α=128, dropout=0.05）
GRPO（Group Relative Policy Optimization，组大小8，KL参数β=0.05，温度0.7）
AdamW优化器（SFT: lr=2e-4, RL: lr=1e-5）
有限域算术任务（GF(11)和GF(13)）
规则分类器（用于识别前向/后向推理策略）

Strengths:

提供了RL后训练机制的清晰分类和实验证据，填补了该领域机制理解的空白。
实验设计高度可控，通过合成任务隔离了推理策略、难度、问题类型等变量，结论可靠。
发现了数据难度和多样性对RL效果的关键影响，对实际训练有直接指导意义。
将多个先前观察到的现象（策略放大、策略组合）统一解释，提升了理论简洁性。
使用开源模型和标准训练框架，结果易于复现和扩展。

Limitations:

实验局限于数学推理任务（有限域算术），结论向其他领域（如代码生成、通用推理）的泛化性有待验证。
仅使用1.5B参数模型，更大模型的行为可能有所不同。
RL训练仅使用结果奖励（正确/错误），未探索过程奖励或更复杂的奖励设计。
策略选择机制依赖于预训练数据中已存在的推理模式，未研究预训练阶段如何形成这些模式。
未深入分析策略改进的具体内部机制（如梯度更新如何优化推理步骤）。

Relevance To Keywords:

Unify Models: 论文研究的是语言模型后训练，未直接涉及多模态统一模型，但机制理解可迁移。
World Models: 有限域算术任务可视为简单世界模型，但论文未明确讨论世界模型概念。
Representation Learning: 策略选择和改进涉及模型内部表征的利用和优化，与表征学习相关。
Model-Based RL: 论文使用GRPO（无模型RL），未涉及基于模型的RL。
原生多模态大模型: 论文仅使用文本模型，不涉及多模态。
多模态大模型的理解和生成一体化: 不直接相关。
强化学习: 核心研究RL后训练机制，高度相关。
后训练: 论文主题即为后训练（post-training），高度相关。

47. Hölder++: Improving the Quality-Coherence Trade-off in Multimodal VAEsPASS

Score: 37.5 / 35.2

Authors: Huyen Vo, María Martínez-García, Isabel Valera

Published: 2026-06-11

TL;DR: Hölder++ 通过实现精确的 Hölder pooling 和建模共享与私有表示，改进了多模态 VAE 中生成质量与一致性的权衡。

摘要翻译

现有的多模态变分自编码器（VAE）方法在生成质量与一致性之间面临权衡——即它们难以同时生成真实且多样的样本，而这些样本在模态间保持语义一致性。最近的一项研究表明，使用 Hölder 池化的简单近似作为聚合方法，相比当前最先进的 MMVAE+（SOTA），能提高一致性，尽管它假设所有模态之间共享单一表示。然而，这种方法略微牺牲了样本多样性。受此启发，我们提出了 Hölder++，一种新颖的多模态 VAE，它通过以下方式改进了生成质量 - 一致性权衡：(i) 首次实现了适用于多模态 VAE 且无需任何近似的 Hölder 池化；(ii) 一种扩展架构，用于建模不同的共享表示和私有表示（即模态特定的表示，称为 Hölder+）；以及 (iii) 分层推理，进一步增强了共享表示与私有表示之间的解耦（即 Hölder++）。我们的实验证实，Hölder++ 一致改进了生成质量 - 一致性权衡，产生了更结构化的潜在空间，并学习了对下游任务具有信息量的共享表示。

Abstract

Existing approaches for multimodal variational autoencoders (VAEs) face a trade-off between generative quality and coherence-i.e., they struggle to generate realistic and diverse samples that, at the same time, are semantically consistent across modalities. A recent work shows that using a simple approximation to Hölder pooling as an aggregation method improves coherence over the SOTA MMVAE+, despite assuming a single shared representation across all modalities. Yet, it slightly compromises sample diversity. Inspired by this insight, we propose Hölder++, a novel multimodal VAE that improves the generative quality-coherence trade-off through: (i) the first implementation of Hölder pooling without any approximation for multimodal VAEs; (ii) an extended architecture that models distinct shared and private (i.e., modality-specific) representations (Hölder+); and (iii) hierarchical inference that further enhances the disentanglement between the shared and private representations (Hölder++). Our experiments corroborate that Hölder++ consistently improves the generative quality-coherence trade-off, yields more structured latent spaces, and learns shared representations that are informative for downstream tasks.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	4.0/10	6.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	2.0/10	3.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	10.0/10	15.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	4.0/10	6.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 论文核心为多模态变分自编码器（Multimodal VAEs），因此'MultiModal'高度相关（10）。论文通过 Hölder pooling 统一模态表示并优化潜空间结构，故'Unify Models'和'Latent Reasoning'中度相关（4）。'Visual Encoder'隐含在多模态架构中（3），'MLLM'和'World Models'作为生成模型领域有弱关联（2）。'Tokenizer'、'model-based RL'和'Agentic Reasoning'与论文内容无关（0）。加权总分 37.5 高于及格线 35.2。作者列表中不包含指定的专家，无额外加分。

关键词

Multimodal VAEs, Hölder Pooling, Shared Representations, Private Representations, Generative Quality, Coherence Trade-off, Hierarchical Inference

深度分析

Chinese Title: Hölder++：提升多模态变分自编码器中的质量-一致性权衡

Summary: 论文针对多模态变分自编码器（VAEs）中生成质量与跨模态语义一致性之间的权衡问题，提出了一种名为Hölder++的新型多模态VAE模型。现有方法如Product-of-Experts（PoE）和Mixture-of-Experts（MoE）分别存在一致性差或多样性不足的问题。受Hölder池化方法的启发，论文首先实现了精确的对称Hölder池化（α=0.5）作为聚合机制，无需近似；其次，引入共享与私有（模态特定）潜变量分解，形成Hölder+模型；最后，通过层级推理结构进一步促进共享与私有表征的解耦，得到Hölder++。在PolyMNIST、MNIST-SVHN、CUBICC和CelebAMask-HQ四个基准数据集上的实验表明，Hölder++在生成质量与一致性权衡上持续优于现有方法，学习到更结构化的潜在空间，且共享表征对下游任务具有信息性。

Innovations:

首次在多模态VAE中实现精确的对称Hölder池化（α=0.5）作为聚合机制，无需近似。
将Hölder池化扩展为包含共享和私有潜子空间的Hölder+模型。
引入自上而下的层级推理结构，通过设计促进共享与私有表征的解耦，形成Hölder++。
在多个基准数据集上持续改进生成质量与一致性权衡，学习到更结构化的潜在空间。

Methodology: 论文采用概率意见池化框架，基于α-散度最小化推导出对称Hölder池化（α=0.5）的精确形式，并将其表示为包含单模态和成对分量的高斯混合分布。在此基础上，将潜空间分解为共享变量z和私有变量w，并采用类似MMVAE+的交叉重构目标防止捷径学习。进一步，引入层级推理结构，通过顶层共享变量和底层私有变量的条件依赖关系增强解耦。模型通过最大化证据下界（ELBO）进行训练。

Key Results:

Hölder++在生成质量与一致性权衡上持续优于MMVAE+、HELVAE等现有方法。
Hölder++学习到更结构化、解耦性更强的潜在空间。
共享表征对下游任务（如分类）具有更高的信息性。
在PolyMNIST、MNIST-SVHN、CUBICC和CelebAMask-HQ四个数据集上验证了有效性。

Tech Stack:

Hölder池化（对称α=0.5）
高斯混合模型（GMM）
α-散度（α-divergence）
Bhattacharyya系数
变分自编码器（VAE）
证据下界（ELBO）
Product-of-Experts (PoE)
Mixture-of-Experts (MoE)
层级推理（Hierarchical Inference）
交叉重构损失（Cross-reconstruction loss）

Strengths:

首次实现精确Hölder池化，避免了近似误差。
通过共享-私有分解和层级推理有效改善了生成多样性和一致性。
方法适用于任意数量模态，具有通用性。
实验充分，在多个数据集上验证了优越性。

Limitations:

Hölder池化需要计算所有模态对的Bhattacharyya系数，计算复杂度为O(M^2)，在模态数量多时可能成为瓶颈。
论文未讨论模型在大规模真实多模态数据上的扩展性。
层级推理结构可能增加模型训练难度和超参数调优需求。

Relevance To Keywords:

Unify Models, World Models, Representation Learning, Model-Based RL：论文聚焦于多模态表征学习，通过改进VAE架构实现更优的共享与私有表征解耦，与表征学习高度相关。
原生多模态大模型，多模态大模型的理解和生成一体化：论文提出的方法可视为多模态生成模型的基础组件，有助于提升多模态大模型在生成任务中的一致性和质量。
世界模型，强化学习，后训练：论文未直接涉及世界模型或强化学习，但其学习到的结构化潜在空间可用于构建更鲁棒的世界模型，从而间接支持模型基强化学习和后训练场景。

48. MiniMax Sparse AttentionPASS

Score: 36.0 / 35.2

Authors: Xunhao Lai, Weiqi Xu, Yufeng Yang, Qiaorui Chen, Yang Xu, Lunbin Zeng, Xiaolong Li, Haohai Sun, Haichao Zhu, Vito Zhang, Pengyu Zhao

Published: 2026-06-11

TL;DR: MiniMax Sparse Attention (MSA) enables ultra-long-context inference for multimodal LLMs by reducing attention compute 28.4x through blockwise sparse retrieval, achieving significant speedups on GPUs.

摘要翻译

超长上下文能力对于前沿大语言模型（LLM）而言正变得不可或缺：智能体工作流、仓库级代码推理和持久化记忆均要求模型联合关注数十万至数百万个 token，然而 Softmax 注意力的二次计算成本使得这一需求在部署规模下难以实现。我们提出了 MiniMax 稀疏注意力（MSA），这是一种基于分组查询注意力（GQA）构建的块级稀疏注意力机制。轻量级的索引分支会对键值块进行打分，并为每个 GQA 组独立选择一个 Top-k 子集，从而实现组特定的稀疏检索，同时保持高效的块级执行；随后，主分支仅在选定的块上执行精确的块稀疏注意力。MSA 围绕简单性和可扩展性原则设计，经过刻意精简，使其能够轻松地在各类 GPU 上高效部署。为了将稀疏性转化为实际加速，我们协同设计了 MSA 与一个 GPU 执行路径，该路径采用无指数运算的 Top-k 选择及 KV-outer 稀疏注意力，以提高块粒度访问下的张量核利用率。在具有原生多模态训练的 109B 参数模型上，MSA 的表现与 GQA 相当，同时在 1M 上下文长度下将每个 token 的注意力计算量减少了 28.4 倍。与我们的协同设计内核配合，MSA 在 H800 上实现了 14.2 倍的预填充加速和 7.6 倍的解码墙钟加速。我们的推理内核可在 https://github.com/MiniMax-AI/MSA 获取。一个由 MSA 驱动的生产级原生多模态模型已在 https://huggingface.co/MiniMaxAI/MiniMax-M3 公开发布。

Abstract

Ultra-long-context capability is becoming indispensable for frontier LLMs: agentic workflows, repository-scale code reasoning, and persistent memory all require the model to jointly attend over hundreds of thousands to millions of tokens, yet the quadratic cost of softmax attention makes this untenable at deployment scale. We introduce MiniMax Sparse Attention (MSA), a blockwise sparse attention built upon Grouped Query Attention (GQA). A lightweight Index Branch scores key-value blocks and independently selects a Top-k subset for each GQA group, enabling group-specific sparse retrieval while maintaining efficient block-level execution; the Main Branch then performs exact block-sparse attention over only the selected blocks. Designed around a principle of simplicity and scalability, MSA is deliberately streamlined, making it straightforward to deploy efficiently across a broad range of GPUs. To translate sparsity into practical speedups, we co-design MSA with a GPU execution path that uses exp-free Top-k selection and KV-outer sparse attention to improve tensor-core utilization under block-granular access. On a 109B-parameter model with native multimodal training, MSA performs on par with GQA while reducing per-token attention compute by 28.4x at 1M context. Paired with our co-designed kernel, MSA achieves 14.2x prefill and 7.6x decoding wall-clock speedups on H800. Our inference kernel is available at: https://github.com/MiniMax-AI/MSA. A production-grade natively multimodal model powered by MSA has been publicly released at: https://huggingface.co/MiniMaxAI/MiniMax-M3.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	6.0/10	9.0
MultiModal	1.5	6.0/10	9.0
model-based RL	1.5	1.0/10	1.5
Latent Reasoning	1.5	1.0/10	1.5
Agentic Reasoning	1.5	5.0/10	7.5

评分理由: The paper proposes MiniMax Sparse Attention (MSA) for ultra-long-context efficiency in LLMs. It explicitly mentions native multimodal training and agentic workflows as motivations, justifying moderate scores for MultiModal, MLLM, and Agentic Reasoning. However, it does not discuss world models, RL, tokenizers, visual encoders, or latent reasoning mechanisms, resulting in low scores for those keywords. No expert authors from the target list were found in the author list, so no bonus points were added.

关键词

MiniMax Sparse Attention, Ultra-long-context, Grouped Query Attention, Blockwise Sparse Attention, GPU Optimization, Native Multimodal Training, Inference Efficiency

深度分析

Chinese Title: MiniMax 稀疏注意力

Summary: 本文提出 MiniMax 稀疏注意力（MSA），一种基于分组查询注意力（GQA）的块级稀疏注意力机制，旨在解决超长上下文场景下 softmax 注意力的二次计算开销问题。MSA 包含一个轻量级索引分支，该分支对每个 GQA 组独立进行 top-k 块选择，并始终保留局部块；主分支仅对所选块执行精确的块稀疏注意力。为将理论稀疏性转化为实际加速，论文设计了无指数 top-k 选择内核和 KV-外稀疏注意力内核，以提升张量核心利用率。在 109B 参数的 MoE 多模态模型上，MSA 在 1M 上下文长度下将每 token 注意力计算量减少 28.4 倍，并在 H800 GPU 上实现 14.2 倍预填充和 7.6 倍解码加速，同时保持与 GQA 相当的性能。论文还提供了开源推理内核和基于 MSA 的生产级多模态模型。

Innovations:

提出 MSA，一种极简、可扩展且加速的块级稀疏注意力机制，支持从头训练和从预训练 GQA 检查点近乎无损转换。
设计无指数 top-k 选择内核，针对小 k 场景优化，避免在选择前进行不必要的 softmax 计算。
提出 KV-外稀疏注意力组织方式，将选中的 KV 块聚集其关联查询并拼接，以填充张量核心 MMA，并通过预调度分块和两阶段合并处理高度倾斜的块流行度，无需原子更新。
在训练中融合辅助 LSE 计算到前向传播，并在反向传播中使用持久负载均衡，实现高效训练。
在 109B 参数 MoE 多模态模型上验证了 MSA 在文本和多模态能力上与 GQA 持平，同时获得显著加速。

Methodology: MSA 采用两阶段稀疏注意力架构：索引分支使用一个共享的索引键头和每个 GQA 组一个索引查询头，通过最大池化将 token 级分数聚合到块级，然后执行 top-k 选择（始终包含局部块）；主分支仅对选中块内的 token 执行标准缩放点积注意力。训练时，通过 KL 散度损失对齐索引分支分布与主分支分布（在选中块上），并采用梯度分离、索引器预热和强制局部块机制稳定训练。推理时，设计了专用 GPU 内核：无指数 top-k 选择内核利用块级索引器绕过 softmax；KV-外稀疏注意力内核将选中 KV 块对应的查询分组并填充张量核心，使用两阶段合并处理非均匀块访问。

Key Results:

在 109B 参数 MoE 多模态模型上，MSA 在 1M 上下文长度下将每 token 注意力计算量减少 28.4 倍。
在 H800 GPU 上，MSA 实现 14.2 倍预填充加速和 7.6 倍解码加速。
MSA 在多个下游基准测试（文本和多模态）上与 GQA 性能相当，无明显损失。
通过消融实验验证了块大小、top-k 数量、KL 损失、梯度分离等设计选择的有效性。

Tech Stack:

分组查询注意力 (GQA)
块级稀疏注意力 (blockwise sparse attention)
最大池化 (max-pooling) 用于块分数聚合
Top-k 选择算法
KL 散度损失 (KL divergence loss)
梯度分离 (stop-gradient / gradient detach)
无指数 top-k 内核 (exp-free TopK kernel)
KV-外稀疏注意力内核 (KV-outer sparse attention kernel)
张量核心 MMA (tensor-core matrix multiply-accumulate)
两阶段合并 (two-phase combine)
持久负载均衡 (persistent load balancing)
LSE (log-sum-exp) 辅助计算融合

Strengths:

设计简洁，遵循奥卡姆剃刀原则，仅保留必要组件，易于部署和扩展。
算法与 GPU 执行路径协同设计，将理论稀疏性转化为实际墙钟加速，在长上下文场景下效果显著。
支持从头训练和从预训练 GQA 模型转换，兼容性强。
在 109B 参数多模态模型上验证了性能无损，证明了方法的可扩展性和实用性。
开源推理内核和发布生产级模型，促进社区复现和应用。

Limitations:

索引分支引入额外参数和计算开销，在短上下文场景下可能得不偿失。
块级选择粒度可能丢失细粒度 token 级注意力模式，影响某些任务精度。
训练时需要 KL 损失和预热等辅助机制，增加了训练复杂度。
当前仅支持因果注意力，对于双向注意力（如编码器）需要额外适配。
加速效果依赖于 GPU 架构和块大小等超参数，在不同硬件上可能有所差异。

Relevance To Keywords:

原生多模态大模型：论文在 109B 参数 MoE 多模态模型上训练和验证 MSA，直接支持多模态理解与生成一体化。
表征学习：MSA 通过稀疏注意力机制学习高效的长上下文表征，索引分支的 KL 损失对齐有助于学习更好的注意力模式。
世界模型：超长上下文能力是构建世界模型（如 agent 工作流、持久记忆）的关键，MSA 提供了高效实现。
模型基础强化学习 (Model-Based RL)：长上下文 agent 工作流常涉及强化学习后训练，MSA 的加速能力有助于 RL 训练和推理。
后训练：论文提到 MSA 可从预训练 GQA 检查点转换，支持后训练阶段的高效微调。
Unify Models：MSA 作为统一注意力机制，可应用于不同模态和任务，促进模型统一。

49. Rethinking RAG in Long Videos: What to Retrieve and How to Use It?PASS

Score: 36.0 / 35.2

Authors: Yuho Lee, Jisu Shin, Nicole Hee-Yeon Kim, Jihwan Bang, Juntae Lee, Kyuwoong Hwang, Fatih Porikli, Hwanjun Song

Published: 2026-06-11

TL;DR: 本文提出 CARVE 方法和 V-RAGBench 基准，通过分片自适应重排序提升长视频检索增强生成性能，优于现有基线方法。

摘要翻译

检索增强生成正从文本领域扩展至长时、第一人称视角的视频中，在此场景中，系统需在多种模态和时间粒度下选择与查询相关的片段。然而，VideoRAG 的进展受限于两个缺陷：现有的基准允许在不使用视频的情况下回答查询，从而掩盖了检索错误；而先前方法对每个查询仅应用单一的模态 - 粒度配置，忽略了片段级别的差异性。我们通过引入 V-RAGBench 和 CARVE 来解决上述问题。V-RAGBench 是一个基于 $\langle$查询，证据片段，答案$\rangle$ 三元组的基准，能够实现检索与生成过程的忠实且解耦的评估；CARVE 是一种简单的方法，它在不同配置下并行运行检索器，并采用片段自适应重排序来为每个片段确定获胜配置。随后，每个片段均以其在检索阶段选定的获胜配置进入生成器，从而形成一种交错证据形式，使得片段级别的决策贯穿了这两个阶段。CARVE 的性能优于八种近期 VideoRAG 基线方法，其提供给生成器的片段交错使用了多种配置，而非共享单一配置，这是查询级别方法无法实现的行为。

Abstract

Retrieval-augmented generation is moving beyond text into long, egocentric video, where systems must select query-relevant chunks across multiple modalities and temporal granularities. Yet progress in VideoRAG is limited by two gaps: existing benchmarks allow queries to be answered without the video, obscuring retrieval errors, and prior methods apply a single modality-granularity configuration per query, ignoring chunk-level variability. We address both by introducing V-RAGBench, a benchmark of $\langle$query, evidence chunk, answer$\rangle$ triplets that enables faithful, decoupled evaluation of retrieval and generation, and CARVE, a simple method that runs parallel retrievers across configurations and employs chunk-adaptive reranking to identify the winning configuration for each chunk. Each chunk then enters the generator under its winning configuration selected during retrieval, yielding an interleaved evidence form where the chunk-level decision propagates across both stages. CARVE outperforms eight recent VideoRAG baselines, with the chunks supplied to the generator interleaving multiple configurations rather than sharing a single one, a behavior unattainable by query-level methods.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	5.0/10	7.5
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	1.0/10	1.5
Latent Reasoning	1.5	2.0/10	3.0
Agentic Reasoning	1.5	2.0/10	3.0

评分理由: 论文主要关注长视频检索增强生成（VideoRAG），与 MultiModal（8 分）高度相关，因摘要明确提及多模态数据与时间粒度；与 MLLM（5 分）中度相关，因生成器可能基于 MLLM 架构。其余关键词如 World Models、model-based RL、Tokenizer 等与论文核心贡献（检索策略与基准构建）关联度低，故评分为 1-2 分。加权总分 36.0，高于动态及格分 35.2。

关键词

VideoRAG, Long Videos, Chunk-adaptive reranking, V-RAGBench, MultiModal, Retrieval-Augmented Generation, Egocentric Video

深度分析

Chinese Title: 重新思考长视频中的检索增强生成：检索什么以及如何使用？

Summary: 本文针对长视频场景下的检索增强生成（VideoRAG）进行了重新思考。现有基准允许不依赖视频回答查询，掩盖了检索错误；现有方法对每个查询使用单一模态-粒度配置，忽略了块级差异。为此，作者提出了V-RAGBench基准，包含<查询，证据块，答案>三元组，支持检索和生成的解耦评估；并提出了CARVE方法，通过并行检索和块自适应重排序，为每个证据块选择最优配置，并将块级决策传递到生成阶段。实验表明，CARVE在检索和生成上均优于八种基线方法，且块级决策在四个配置间均匀分布，超越了查询级路由方法。

Innovations:

提出V-RAGBench基准，确保查询不可通过非视频线索回答，实现检索与生成的独立评估。
提出CARVE方法，通过并行多配置检索和块自适应重排序，实现每个证据块的模态-粒度最优选择。
将块级配置决策从检索阶段传递到生成阶段，形成模态交错证据形式，提升生成准确性。
揭示了无单一配置在所有块上最优，块级多样性是真实存在的，而非坍缩到单一选择。

Methodology: 论文采用两阶段方法：第一阶段，并行运行四个检索器（视觉帧级、视觉片段级、文本帧级、文本片段级），每个检索器贡献top-k块到候选池；第二阶段，使用多模态交叉编码器对每个候选块按其检索配置重新评分，排序后得到最终top-k证据，每个块携带其获胜配置。生成时，将不同配置的证据以交错形式输入生成器。基准构建基于Ego4D和EgoLife的长视频，通过时间分割、聚类确保非重复证据，通过后置过滤确保视觉基础和证据定位。

Key Results:

CARVE在检索和生成指标上显著优于八种VideoRAG基线方法。
模态-粒度消融实验显示，不同配置在检索和生成阶段性能差异大，无单一配置普遍最优。
CARVE的块级决策在四个配置间分布均匀，表明块级多样性真实存在。
在生成阶段，CARVE甚至超越了经过训练的查询级路由方法，无需额外训练。

Tech Stack:

检索器：视觉嵌入（如CLIP）、文本嵌入（如Sentence-BERT）
重排序：多模态交叉编码器（如CLIP-based cross-encoder）
基准构建：时间分割、聚类算法、后置过滤（视觉基础检查、证据定位检查）
评估指标：检索召回率、生成准确率（基于V-RAGBench三元组）
基线方法：包括多模态融合、查询分解、迭代代理循环等

Strengths:

提出了专门针对VideoRAG的基准，解决了现有基准无法独立评估检索的问题。
CARVE方法简单有效，无需训练即可实现块级自适应配置选择。
实验设计全面，包括消融、分布分析和与查询级路由的对比。
揭示了块级配置决策在生成阶段的重要性，为后续研究提供新视角。

Limitations:

基准仅基于Ego4D和EgoLife，可能不覆盖所有长视频场景。
CARVE依赖四个固定配置，未探索更细粒度的模态或粒度组合。
重排序阶段使用交叉编码器，计算成本较高，可能影响实时性。
未讨论不同视频类型（如第三人称、多视角）下的泛化能力。

Relevance To Keywords:

Unify Models: 论文涉及多模态大模型在视频检索与生成中的统一应用。
World Models: 视频理解与检索可视为构建世界模型的一部分，但论文未直接讨论。
Representation Learning: 论文探索了视觉和文本表示在不同粒度下的检索效果。
Model-Based RL: 论文未涉及强化学习，但检索-生成框架可类比模型预测控制。
原生多模态大模型: 论文使用多模态嵌入和交叉编码器，与原生多模态模型相关。
多模态大模型的理解和生成一体化: 论文的检索和生成阶段均依赖多模态表示，体现一体化思想。
表征学习: 论文通过对比不同配置的表征效果，间接研究表征学习。
世界模型: 不直接相关。
强化学习: 不直接相关。
后训练: 论文方法无需训练，但基准构建涉及后处理过滤。

50. SkillChain: Closing the Loop on Skill Evolution for Image-Based E-Commerce AI AssistantsPASS

Score: 36.0 / 35.2

Authors: Yimin Hu, Mengtao Xu, Hao Guo, Yuheng Song, Xiaoyong Zhu, Bo Zheng

Published: 2026-06-11

TL;DR: 论文提出 SkillChain 框架，通过自动化技能演化闭环解决了电商图像助手多意图混淆问题，显著提升了响应质量与用户参与度。

摘要翻译

基于图像的 AI 助手现已在电商平台以生产规模部署，其中单个上传图像可触发截然不同的用户意图：产品搜索、风格推荐、视觉百科或实用工具调用，每种意图均需其独特的响应格式、工具调用及领域知识。若缺乏基于意图的行为约束，基于大语言模型（LLM）的系统便会混淆这些异构模式，难以达到领域质量标准；同时，意图空间的广度与动态性使得人工工程变得不可行。为此，我们提出 SkillChain，该机制闭环了技能演化的生产反馈回路，通过三个阶段自动化技能的生命周期：Skill Creator 用于基于任务规范和轨迹进行自举，Route Optimizer 用于路由对齐，Body Refiner 则通过双路径 LLM-Judge 评估迭代优化 Skill Body（技能主体）。在生产规模的电商图像助手上部署后，SkillChain 显著提升了整体响应质量，其中在结构合规性和内容质量方面的提升最为显著；为期一周的在线 A/B 实验进一步证实了其在用户参与度、内容消费及长期留存方面均取得了显著收益。

Abstract

Image-based AI assistants are now deployed at production scale on e-commerce platforms, where a single uploaded image can trigger fundamentally different user intents: product search, style recommendation, visual encyclopedia, or utility tool calls, each demanding its own response format, tool invocation, and domain knowledge. Without per-intent behavioral constraints, LLM-based systems conflate these heterogeneous modes and fall short of domain quality standards, while the breadth and dynamism of the intent space render manual engineering infeasible. To address this, we present SkillChain, which closes the production feedback loop on Skill evolution, automating the lifecycle of Skills through three stages: Skill Creator for bootstrapping from task specs and trajectories, Route Optimizer for routing alignment, and Body Refiner for iterative Skill Body refinement via dual-path LLM-Judge evaluation. Deployed on a production-scale e-commerce image assistant, SkillChain substantially improves aggregate response quality, with the strongest gains on structural compliance and content quality; a one-week online A/B experiment further confirms significant gains in user engagement, content consumption, and long-term retention.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	6.0/10	9.0
MultiModal	1.5	6.0/10	9.0
model-based RL	1.5	1.0/10	1.5
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	5.0/10	7.5

评分理由: 论文聚焦于电商图像助手的技能生命周期管理，虽涉及多模态输入（图像）与 LLM 决策（MLLM, MultiModal），并具备工具调用能力（Agentic Reasoning），但未涉及 Tokenizer、视觉编码器、世界模型或强化学习等核心模型架构创新，也未强调模型统一架构，因此相关关键词得分较低。作者列表中未发现指定的专家作者。

关键词

Image-based AI Assistants, Skill Evolution, Tool Invocation, Feedback Loop, E-commerce, LLM-based, Route Optimization, Skill Lifecycle

深度分析

Chinese Title: SkillChain：为基于图像的电商AI助手闭环技能演化

Summary: 论文提出SkillChain框架，用于解决基于图像的电商AI助手在多意图场景下的技能退化问题。电商用户上传一张图片可能触发产品搜索、风格推荐、百科查询或工具调用等多种意图，而现有LLM系统缺乏逐意图的行为约束，导致响应格式混乱、工具调用错误。SkillChain通过三个闭环阶段自动化技能生命周期：技能创建器（Skill Creator）从任务规范和用户轨迹中引导生成技能，并通过人工审核保证初始质量；路由优化器（Route Optimizer）持续挖掘路由失败并更新技能描述以对齐流量分布；身体精炼器（Body Refiner）通过双路径LLM评估和跨样本归因迭代修复技能体缺陷。该框架在工业级电商图像助手上部署，离线评估显示在结构合规性和内容质量上显著提升，在线A/B实验证实用户参与度、内容消费和长期留存均有明显改善。关键特性是各阶段修改互不干扰，保证单调质量提升。

Innovations:

识别出路由漂移和行为漂移作为生产环境中技能退化的两个独立阶段，并分别设计专用链路解决。
提出首个在基于图像的电商场景中闭环所有三个技能反馈循环的框架，具有阶段单调质量保证。
引入双路径评估（规则+LLM Judge）和跨样本归因机制，避免单样本噪声，聚焦系统性缺陷。
实现技能描述（路由）与技能体（行为）的解耦更新，确保各阶段修改互不干扰。
在工业级生产规模上验证了五个视觉意图类别的严格累加阶段增益，并通过在线A/B实验证明有效性。

Methodology: 论文采用三阶段流水线方法：Stage 1（技能创建）利用LLM从任务规范和用户轨迹引导生成技能初稿，经工程师循环验证静态组件和动态算子，再通过人工审核门控后部署；Stage 2（路由优化）通过Judge LLM比较路由决策与人工标注，收集失败案例并分类为边界模糊、缺失技能或视觉解析错误，然后迭代执行更新/合并/丢弃操作，以F1单调提升为接受条件；Stage 3（身体精炼）采用双路径评估（规则检查+四维度LLM Judge评分），将每个评分离散化为三级，计算跨样本的层级分布，当某维度不良比例超过阈值时触发LLM生成结构化指令（技能建议、规则违规、理想差距），经人工审核门控后更新技能体。整个流水线单向进行，各阶段修改互不干扰。

Key Results:

离线评估中，完整SkillChain（Stage 3）在所有配置中取得最高聚合LLM Judge分数，尤其在结构合规性（CCC）和内容质量（CQ）上增益最大。
在线A/B实验（一周）对比Stage 3与已部署的Stage 2基线，用户参与度、内容消费和长期留存均有显著提升。
路由失败分析表明边界模糊是主要失败类型，初始技能库覆盖核心意图后缺失技能案例罕见。
各阶段增益严格累加：Stage 1优于无技能基线，Stage 2进一步改善路由，Stage 3提升响应质量。
技能身体精炼中，跨样本归因有效过滤单样本噪声，聚焦系统性缺陷。

Tech Stack:

LLM（大语言模型，如Qwen系列）
MLLM（多模态大语言模型）
Skill Bank（技能库）
LLM-as-Judge（LLM作为评估器）
Dual-Path Evaluation（双路径评估：规则路径+LLM Judge路径）
Cross-Sample Attribution（跨样本归因）
Engineer Loop（工程师循环验证）
Human Reflection Gate（人工审核门控）
F1分数（路由优化接受条件）
四维度评分：Tool Call Rationality (TCR), Card Composition Compliance (CCC), Content Quality (CQ), Constraint Adherence (CA)
层级离散化（Good/Average/Poor三级）
阈值校准（θd）

Strengths:

针对电商图像助手的多意图歧义问题提出了系统性的解决方案，覆盖技能创建、路由优化和身体精炼全生命周期。
各阶段解耦设计，保证单调质量提升，避免修改冲突。
工业级部署验证，离线评估和在线A/B实验均显示显著效果。
双路径评估结合规则和LLM，覆盖互补失败模式；跨样本归因有效降低单样本噪声。
路由优化阶段自动检测边界模糊并迭代更新描述，适应流量分布变化。

Limitations:

依赖人工审核门控，可能成为规模化瓶颈（尽管论文声称仅用于初始创建和关键更新）。
路由优化依赖人工标注的ground truth，标注成本高且可能引入偏差。
技能身体精炼的阈值θd需要经验校准，不同维度可能需不同设置。
实验仅在单一电商平台进行，泛化性有待验证。
未讨论技能库版本管理、冲突解决等工程细节。

Relevance To Keywords: 论文研究背景涉及原生多模态大模型、多模态大模型的理解和生成一体化、表征学习、世界模型、强化学习、后训练。SkillChain框架直接应用于基于图像的电商AI助手，依赖多模态大模型（MLLM）进行视觉理解和意图路由，属于多模态大模型的应用。论文中的技能演化可视为一种后训练或持续学习机制，通过生产反馈闭环优化模型行为。路由优化和身体精炼涉及强化学习中的奖励信号（LLM Judge评分）和策略更新。表征学习体现在技能描述和身体约束中。世界模型概念未直接涉及，但技能库可视为对用户意图世界的结构化表示。总体相关性较高，尤其与多模态大模型、后训练、强化学习方向紧密相关。

51. Flex4DHuman: Flexible Multi-view Video Diffusion for 4D Human ReconstructionPASS

Score: 36.0 / 35.2

Authors: Jen-Hao Cheng, Yipeng Wang, Hao Zhang, Gengshan Yang, Jenq-Neng Hwang

Published: 2026-06-11

TL;DR: Flex4DHuman 提出了一种基于相机姿态条件约束的多视角视频扩散模型，能够从单目视频生成同步密集多视角视频，从而实现无需显式几何先验的高质量 4D 人体重建。

摘要翻译

我们提出了 Flex4DHuman，一种多视角视频扩散模型，它仅利用相对相机姿态条件，将动态主体的单目或稀疏多视角视频转换为同步密集多视角视频。不同于依赖骨架、深度图、法线或渲染目标视角几何的先前以人为中心的方法，Flex4DHuman 无需显式几何先验，而是通过相对相机姿态位置编码来调节生成。生成的视频可直接被下游重建管线摄入，以创建动态 4D Gaussian Splatting（4D 高斯泼溅）。该模型基于 Wan 2.1 1.3B 文本到视频模型构建，保留了骨干架构，并通过一种五轴位置编码编码相机和视图信息，该编码通过视图索引和连续 SE(3) 相对相机几何扩展了时空 RoPE（旋转位置嵌入）。三阶段课程学习逐步训练模型进行姿态跟随、灵活的参考视角到目标视角生成及时序扩展。为了支持时序扩展，我们使用干净的历史目标视角标记进行训练。我们还添加多视角描述以支持测试时的文本控制。结合现成的 4D Gaussian Splatting 阶段，我们的框架将单目静态相机视频转换为动态 4D Gaussian Splatting。在 DNA-Rendering 和 ActorsHQ 上的实验表明，Flex4DHuman 超越了先前的最先进方法，且相同的公式在混合人类 - 动物训练后可泛化至动物类别。这些能力使 Flex4DHuman 成为从日常单目视频进行可扩展 4D 内容创建的实用步骤，适用于模拟、游戏、AR/VR 和视频重拍。

Abstract

We present Flex4DHuman, a multi-view video diffusion model that transforms a monocular or sparse multi-view video of a dynamic subject into synchronized dense multi-view videos using only relative camera-pose conditioning. Unlike prior human-centric methods that rely on skeletons, depth maps, normals, or rendered target-view geometry, Flex4DHuman requires no explicit geometry priors and instead conditions generation through relative camera-pose positional encoding. The generated videos can be directly ingested by downstream reconstruction pipelines to create dynamic 4D Gaussian splats. Built on the Wan 2.1 1.3B text-to-video model, Flex4DHuman preserves the backbone architecture and encodes camera and view information through a five-axis positional encoding that extends spatio-temporal RoPE with view indices and continuous SE(3) relative camera geometry. A three-stage curriculum progressively trains the model for pose following, flexible reference-to-target view generation, and temporal rollout. To support temporal rollout, we train with clean historical target-view tokens. We also add multi-view captions to enable test-time text control. Combined with an off-the-shelf 4D Gaussian Splatting stage, our framework lifts monocular static-camera videos into dynamic 4D Gaussian splats. Experiments on DNA-Rendering and ActorsHQ show that Flex4DHuman surpasses prior state-of-the-art methods, while the same formulation generalizes to animal categories after mixed human-animal training. These capabilities make Flex4DHuman a practical step toward scalable 4D content creation from casual monocular videos for simulation, gaming, AR/VR, and video re-shooting.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	4.0/10	6.0
World Models	1.5	5.0/10	7.5
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	6.0/10	9.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	3.0/10	4.5
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 论文聚焦于多视角视频扩散模型用于 4D 人体重建，属于计算机视觉与生成模型领域。Unify Models 仅基于现有架构扩展，相关性低；Tokenizer 为内部组件，非核心；Visual Encoder 使用骨干网络，有一定关联；World Models 涉及动态生成，部分契合；MLLM 非语言模型，相关性低；MultiModal 处理视频与文本/姿态，相关性中等；model-based RL 与 Latent Reasoning、Agentic Reasoning 均涉及强化学习与智能体推理，与本文生成重建任务无关，故得 0 分。加权总分 36.0，高于及格线 35.2。作者列表中未包含指定的 Yang Shi 等专家。

关键词

Multi-view Video Diffusion, 4D Human Reconstruction, Camera-pose Conditioning, Relative Camera-pose Positional Encoding, 4D Gaussian Splatting, Text-to-video Model, Temporal Rollout

深度分析

Chinese Title: Flex4DHuman：用于4D人体重建的灵活多视角视频扩散模型

Summary: 本文提出Flex4DHuman，一种多视角视频扩散模型，能够将单目或稀疏多视角视频转换为同步的密集多视角视频，仅依赖相对相机姿态条件，无需骨架、深度图、法线等显式几何先验。模型基于Wan 2.1的1.3B文本到视频DiT架构，通过五轴位置编码（融合空间坐标、时间索引、视角索引和连续SE(3)相对相机几何）注入多视角结构。采用三阶段课程训练：姿态跟随、灵活参考到目标视角生成、时间展开。生成的视频可直接用于下游4D高斯泼溅重建，实现从单目静态相机视频到动态4D高斯泼溅的完整流程。在DNA-Rendering和ActorsHQ数据集上超越先前方法，并泛化到动物类别。该方法为从随意单目视频创建可扩展的4D内容提供了实用途径。

Innovations:

无需显式几何先验：仅通过相对相机姿态位置编码进行条件生成，不依赖骨架、深度图、法线或渲染的目标视角几何。
灵活同步生成：支持单目或可变稀疏视角输入、任意目标视角、动态参考到目标视角条件以及时间展开，实现多视角视频的联合生成。
单目视频到4D高斯泼溅：展示生成的同步多视角视频可直接用于下游动态4D高斯泼溅重建，构建完整流程。
五轴位置编码：扩展时空RoPE为包含视角索引和连续SE(3)相机几何的五轴编码，实现相机姿态的置换不变性和泛化能力。
三阶段课程训练：逐步引入姿态跟随、灵活参考-目标生成和时间展开，提升模型稳定性和一致性。

Methodology: 基于Wan 2.1的1.3B文本到视频DiT模型，仅修改自注意力位置编码为五轴（帧、视角、SE(3)几何）。输入采用36通道布局（16噪声+16干净+4掩码），参考视角填充干净潜变量和全1掩码，目标视角置零。通过PRoPE将连续SE(3)相机变换编码到注意力机制中。三阶段课程训练：第一阶段学习姿态跟随；第二阶段引入灵活参考-目标视角条件；第三阶段进行时间展开，使用干净历史目标视角令牌。训练数据来自DNA-Rendering，并添加多视角字幕实现测试时文本控制。推理时，联合生成多视角视频，再通过4D高斯泼溅重建动态3D资产。

Key Results:

在DNA-Rendering上，相比Diffuman4D的GT骨架设置，PSNR提升1.21 dB，SSIM提升0.0037，LPIPS降低0.0127。
相比单目基线Diffuman4D-mono-skeleton和MV-Performer，PSNR分别提升9.32 dB和8.00 dB。
在零样本ActorsHQ上，相比Diffuman4D单目设置，PSNR提升3.35 dB，SSIM提升0.041，LPIPS降低0.030。
模型在不同参考视角下保持鲁棒的跨视角一致性，且质量随参考视角数量单调提升。
通过混合人类-动物训练，模型泛化到动物类别，无需架构修改或人类特定先验。

Tech Stack:

Wan 2.1 T2V DiT 1.3B
PRoPE（投影旋转位置编码）
RoPE（旋转位置编码）
SE(3)连续相机几何编码
三阶段课程训练
4D Gaussian Splatting
DNA-Rendering数据集
ActorsHQ数据集
DFA动物数据集
多视角字幕（Multi-view Captions）

Strengths:

无需显式几何先验，避免估计误差传播，降低对特定人体模型的依赖。
灵活支持多种输入输出组合（单目/稀疏、任意目标视角），泛化性强。
生成的视频具有强时间一致性和跨视角一致性，可直接用于下游重建。
基于预训练大模型，仅需少量架构修改，训练效率高。
可泛化到非人类主体（动物），展示通用性。

Limitations:

依赖Wan 2.1预训练模型，可能受限于其训练数据的分布和分辨率。
对于极端姿态、快速运动或严重遮挡的场景，生成质量可能下降。
时间展开依赖历史干净令牌，长视频生成中可能出现误差累积。
目前仅在有限数据集上验证，真实世界复杂场景的鲁棒性有待进一步测试。
模型参数量较大（1.3B），推理计算成本较高。

Relevance To Keywords:

Unify Models: 模型基于Wan 2.1文本到视频模型，融合多视角生成与文本控制，体现多模态生成与理解一体化趋势。
World Models: 视频生成作为世界模型的一种形式，Flex4DHuman通过生成多视角视频模拟动态场景，可用于下游规划与模拟。
Representation Learning: 五轴位置编码（PRoPE）是一种新颖的表示学习方法，将相机几何、时间、视角信息编码到注意力机制中。
Model-Based RL: 生成的4D高斯泼溅可作为环境模型，用于强化学习中的模拟和交互，但论文未直接涉及RL训练。
后训练: 模型通过三阶段课程训练进行微调，属于后训练策略，提升特定任务性能。

52. PP-OCRv6: From 1.5M to 34.5M Parameters, Surpassing Billion-Scale VLMs on OCR TasksPASS

Score: 36.0 / 35.2

Authors: Yubo Zhang, Xueqing Wang, Manhui Lin, Yue Zhang, Penglongyi Deng, Ting Sun, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Changda Zhou, Hongen Liu, Suyin Liang, Cheng Cui, Yi Liu, Dianhai Yu, Yanjun Ma

Published: 2026-06-11

TL;DR: PP-OCRv6 introduces a lightweight OCR system using unified MetaFormer blocks that surpasses billion-scale VLMs in OCR accuracy and efficiency with significantly fewer parameters.

摘要翻译

视觉语言模型（VLMs）在通用视觉语言任务上取得了显著成果，但在应用于专用 OCR 场景时，却面临幻觉、定位不精确以及过高的计算成本问题。本文提出了 PP-OCRv6，这是一种结合架构创新与数据为中心的优化的轻量级 OCR 系统。PP-OCRv6 围绕统一的 MetaFormer 风格构建块（具备结构重参数化机制）重新设计了骨干网络、检测颈部和识别颈部，解耦了空间令牌混合与通道混合，并通过任务特定的步长配置同时支持这两项任务。三种模型层级（medium、small、tiny）共享相同的构建块原语，覆盖从服务器到边缘设备的部署场景。在我们的内部基准测试中，PP-OCRv6_medium 实现了 83.2% 的识别准确率和 86.2% 的检测 Hmean，分别比 PP-OCRv5_server 高出 5.1% 和 4.6%，同时以数量级更少的参数超越了 Qwen3-VL-235B、GPT-5.5 和 Gemini-3.1-Pro。在 Intel Xeon CPU 上，tiny 层级的推理速度比 PP-OCRv5_mobile 快 3.9 倍，同时保持相当的准确率。

Abstract

Vision-Language Models (VLMs) have achieved impressive results on general vision-language tasks, yet they suffer from hallucination, imprecise localization, and prohibitive computational cost when applied to dedicated OCR scenarios. This paper presents PP-OCRv6, a lightweight OCR system that combines architectural innovation with data-centric optimization. PP-OCRv6 redesigns the backbone, detection neck, and recognition neck around a unified MetaFormer-style building block with structural reparameterization, decoupling spatial token mixing from channel mixing and supporting both tasks through task-specific stride configurations. Three model tiers (medium, small, tiny) share the same block primitives, covering deployment scenarios from server to edge. On our in-house benchmarks, PP-OCRv6_medium achieves 83.2% recognition accuracy and 86.2% detection Hmean, outperforming PP-OCRv5_server by +5.1% and +4.6% respectively while surpassing Qwen3-VL-235B, GPT-5.5, and Gemini-3.1-Pro with orders of magnitude fewer parameters. The tiny tier achieves 3.9$\times$ faster inference than PP-OCRv5_mobile on Intel Xeon CPU while maintaining comparable accuracy.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	6.0/10	9.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	6.0/10	9.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	4.0/10	6.0
MultiModal	1.5	6.0/10	9.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: The paper focuses on efficient OCR architecture (MetaFormer) rather than World Models or RL. It unifies detection/recognition tasks (Unify Models) and involves Vision+Text (MultiModal, Visual Encoder) but lacks Tokenizer, RL, or Reasoning components. Total weighted score: 36.0 (Passes 35.2 threshold). No expert authors found.

关键词

PP-OCRv6, MetaFormer, Lightweight OCR, VLMs Comparison, Structural Reparameterization, Detection and Recognition, Parameter Efficiency

深度分析

Chinese Title: PP-OCRv6：从1.5M到34.5M参数，在OCR任务上超越十亿级视觉语言模型

Summary: 本文提出PP-OCRv6，一个轻量级OCR系统，通过架构创新和数据中心优化，在参数规模远小于十亿级视觉语言模型（VLM）的情况下，在OCR任务上超越它们。PP-OCRv6重新设计了骨干网络、检测颈和识别颈，基于统一的MetaFormer风格模块，结合结构重参数化，将空间token混合与通道混合解耦，并通过任务特定的步长配置支持检测和识别。系统提供三个模型层级（medium、small、tiny），覆盖从服务器到边缘的部署场景。在内部基准上，PP-OCRv6_medium达到83.2%的识别准确率和86.2%的检测Hmean，分别比PP-OCRv5_server提升5.1%和4.6%，同时以数量级更少的参数超越Qwen3-VL-235B、GPT-5.5和Gemini-3.1-Pro。tiny层级在Intel Xeon CPU上比PP-OCRv5_mobile快3.9倍，同时保持相当精度。论文证明了轻量级专用OCR系统在大模型时代的实用性和有效性。

Innovations:

提出统一可扩展的模型家族，覆盖1.5M到34.5M参数，三个层级共享相同模块原语，降低工程复杂度。
设计LCNetV4骨干网络，采用MetaFormer范式，将空间token混合与通道混合解耦，并引入结构重参数化（RepDWConv）提升训练表达能力而不增加推理成本。
提出RepLKFPN（可重参数化大核特征金字塔网络），利用膨胀可重参数化深度卷积实现大感受野，用于检测颈。
提出EncoderWithLightSVTR识别颈，结合局部-全局注意力和加法跳跃连接，提升识别性能。
支持50种语言和多种工业场景（数字显示、点阵字符、轮胎印等），显著增强专用场景OCR性能。

Methodology: 论文采用架构创新与数据中心优化相结合的方法。架构方面：基于MetaFormer范式设计LCNetV4骨干，将每个块分解为token mixer（3×3深度可分离卷积+重参数化）和channel mixer（1×1点卷积扩展-激活-压缩+残差连接）；检测颈RepLKFPN使用膨胀可重参数化深度卷积替代密集大卷积；识别颈EncoderWithLightSVTR采用局部-全局注意力机制。数据方面：继承PP-OCRv5的数据筛选方法论（难度、精度、多样性三个维度）。通过任务特定的步长配置（检测使用标准stride-2下采样，识别使用非对称stride (2,1)）实现单一骨干支持检测和识别。训练时采用多分支重参数化，推理时融合为单卷积。

Key Results:

PP-OCRv6_medium在内部基准上达到83.2%识别准确率和86.2%检测Hmean，比PP-OCRv5_server分别提升5.1%和4.6%。
PP-OCRv6_medium以34.5M参数超越Qwen3-VL-235B、GPT-5.5、Gemini-3.1-Pro等十亿级VLM。
PP-OCRv6_tiny在Intel Xeon CPU上推理速度比PP-OCRv5_mobile快3.9倍，同时保持相当精度。
统一模型家族支持50种语言和多种工业场景（数字显示、点阵字符、轮胎印等）。

Tech Stack:

MetaFormer架构范式
结构重参数化（RepVGG风格）
深度可分离卷积（DW Conv）
膨胀可重参数化深度卷积（Dilated Reparameterizable Depthwise Convolution）
局部-全局注意力机制（Local-Global Attention）
加法跳跃连接（Additive Skip Connection）
Squeeze-and-Excitation (SE)模块
GELU激活函数
批归一化（BN）零初始化技巧
非对称步长（Asymmetric Stride）
CTC/NRTR解码

Strengths:

轻量级设计，参数效率极高，在极小参数下超越十亿级VLM，适合实际部署。
统一架构支持检测和识别两种任务，减少工程维护成本。
三个层级覆盖从服务器到边缘的多种场景，灵活性高。
结构重参数化在不增加推理成本的前提下提升训练表达能力。
支持多语言和多种工业场景，泛化能力强。

Limitations:

论文未公开内部基准的具体构成和规模，可复现性受限。
与VLM的比较可能未涵盖所有最新模型，且VLM在通用视觉语言任务上的优势未被充分讨论。
轻量级模型在极端复杂场景（如严重遮挡、模糊）下的性能可能仍不及大型VLM。
数据筛选方法论继承自PP-OCRv5，缺乏对数据质量提升的详细消融实验。

Relevance To Keywords:

Unify Models: 论文提出统一模型家族（LCNetV4），通过单一骨干支持检测和识别，体现了统一模型的思想。
World Models: 论文未直接涉及世界模型，但OCR作为感知任务可视为世界模型的一部分。
Representation Learning: LCNetV4的MetaFormer设计解耦空间和通道混合，属于表征学习优化。
Model-Based RL: 论文未涉及强化学习，相关性弱。
原生多模态大模型: 论文对比了原生多模态大模型（如Qwen3-VL），但自身是专用OCR模型，非多模态大模型。
多模态大模型的理解和生成一体化: 论文未涉及生成任务，仅关注文本检测与识别（理解）。
后训练: 论文未提及后训练策略，相关性弱。

53. A Multi-Modal Framework with Cross-Subject Pseudo-Labeling and Semantic Alignment for Micro-Gesture RecognitionPASS

Score: 36.0 / 35.2

Authors: Haoran Zhang, Haokun Zhang, Pengyu Liu, Yujia Zhang, Weibao Xue, Yanbin Hao

Published: 2026-06-11

TL;DR: This paper proposes a multi-modal framework with cross-subject pseudo-labeling for micro-gesture recognition, achieving a competitive F1-score of 68.13% in the MiGA-IJCAI Challenge.

摘要翻译

微动作（MGs）是自发且细微的身体运动，通常能传达隐藏的人类情感。在未剪辑视频中识别 MGs 仍然极具挑战性，原因在于其信噪比极低、类别分布存在严重的长尾现象，以及在跨受试者评估场景中遇到的固有域偏移。本文针对第 4 届 MiGA-IJCAI 挑战赛赛道 1，提出了一种综合多模态框架。为了捕捉细粒度表示，我们设计了一种显著性引导的多模态提取管道，整合了 68 关键点骨骼关节坐标、3D 热力图体积以及高分辨率 RGB 视觉特征。我们引入了一种温和的平方根平滑加权机制，并配合正交语义嵌入损失，以保护尾部类别而不损害整体识别能力。更重要的是，为了缩小跨受试者泛化差距，我们提出了一种用于无监督域适应的跨模态伪标签（CMPL）策略，显著提升了单模态的鲁棒性。最后，利用温度缩放软投票机制来缓解晚期融合过程中的过度自信问题。大量实验表明，我们的框架实现了具有竞争力的 68.13% F1 分数，获得了第四名。

Abstract

Micro-gestures (MGs) are spontaneous and subtle body movements that frequently convey hidden human emotions. Recognizing MGs in untrimmed videos remains highly challenging due to their extremely low signal-to-noise ratio, severe long-tailed class distribution, and the inherent domain shift encountered in cross-subject evaluation scenarios. In this paper, we propose a comprehensive multi-modal framework for Track 1 of the 4th MiGA-IJCAI Challenge. To capture fine-grained representations, we design a saliency-guided multi-modal extraction pipeline integrating 68-keypoint skeleton joint coordinates, 3D heatmap volumes, and high-resolution RGB visual features. We introduce a gentle square-root smoothed weighting mechanism paired with an Orthogonal Semantic Embedding Loss to protect tail classes without compromising overall recognition capabilities. More importantly, to bridge the cross-subject generalization gap, we propose a Cross-Modal Pseudo-Labeling (CMPL) strategy for unsupervised domain adaptation, which significantly boosts single-modal robustness. A temperature-scaled soft-voting mechanism is finally utilized to alleviate overconfidence during late fusion. Extensive experiments demonstrate that our framework achieves a competitive F1-score of 68.13\%, securing the 4th place.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	6.0/10	9.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	1.0/10	1.5
Latent Reasoning	1.5	1.0/10	1.5
Agentic Reasoning	1.5	1.0/10	1.5

评分理由: The paper focuses on micro-gesture recognition using multi-modal fusion (skeleton, heatmap, RGB) and domain adaptation techniques. It scores high on 'MultiModal' (9) and moderate on 'Visual Encoder' (6) due to the explicit use of visual feature extraction pipelines. Scores are low (1-3) for others as the paper lacks content on World Models, MLLM, Reinforcement Learning, Tokenizers, or Agentic Reasoning, which are central to the provided keyword list but irrelevant to this specific computer vision task. Total weighted score is 36.0. No expert authors from the target list were found in the authorship.

关键词

Micro-gesture Recognition, Multi-Modal Framework, Cross-Subject Pseudo-Labeling, Semantic Alignment, Domain Adaptation, Skeleton Joint Coordinates, RGB Visual Features

深度分析

Chinese Title: 面向微手势识别的跨主体伪标签与语义对齐多模态框架

Summary: 本文针对微手势识别中低信噪比、严重长尾分布和跨主体域偏移三大挑战，提出了一种多模态框架。该框架通过显著性引导的多模态提取管道，整合68关键点骨骼坐标、3D热图体积和高分辨率RGB视觉特征。为保护尾部类别，设计了平方根平滑加权机制和正交语义嵌入损失。更重要的是，提出跨模态伪标签（CMPL）策略进行无监督域适应，显著提升跨主体泛化能力。最后采用温度缩放软投票机制缓解后融合中的过度自信。实验表明，该框架在MiGA-IJCAI挑战赛Track 1中取得68.13%的F1分数，排名第四。

Innovations:

提出显著性引导的多模态提取管道，结合骨骼关键点、3D热图和RGB视觉特征，有效抑制背景噪声。
引入平方根平滑加权与正交语义嵌入损失，在极端长尾分布下保护尾部类别而不牺牲整体性能。
设计跨模态伪标签（CMPL）策略，利用测试集高置信度共识构建超级数据集，实现无监督域适应，显著提升跨主体泛化。
采用温度缩放软投票融合机制，缓解多模态模型过度自信问题，提升预测可靠性。

Methodology: 首先，使用OpenPose提取68关键点骨骼序列，经中心归一化后输入解耦时空CNN；同时将关键点渲染为3D热图，由3D-ResNet处理；RGB分支通过骨骼边界框显著性裁剪后，由Video Swin Transformer和R(2+1)D提取特征。其次，采用平方根平滑加权Focal Loss和正交语义嵌入损失（基于QR分解的语义矩阵）处理长尾分布。然后，迭代执行跨模态伪标签（CMPL）：利用多模态模型在测试集上的高置信度共识生成伪标签，与训练集合并形成超级数据集，重新训练模型。最后，通过温度缩放软投票融合各分支预测。

Key Results:

在iMiGUE数据集上，所提框架达到68.13%的F1分数，在MiGA-IJCAI挑战赛Track 1中排名第四。
跨模态伪标签策略显著提升了单模态鲁棒性和跨主体泛化能力。
平方根平滑加权和正交语义嵌入损失有效缓解了长尾分布对尾部类别的偏见。
温度缩放软投票融合优于直接平均或硬投票，降低了模型过度自信。

Tech Stack:

OpenPose（68关键点提取）
解耦时空CNN（1×3空间卷积 + 5×1时间卷积）
PoseC3D（3D热图渲染 + 3D-ResNet50）
Video Swin Transformer (Swin3D)
R(2+1)D
Focal Loss（平方根平滑加权）
正交语义嵌入损失（QR分解初始化语义矩阵）
跨模态伪标签（CMPL）迭代策略
温度缩放软投票（σ(z/T)）
Kinetics-400预训练

Strengths:

多模态融合设计全面，同时利用骨骼拓扑、热图连续性和RGB纹理，互补性强。
针对长尾分布和域偏移的解决方案新颖且有效，平方根平滑避免了梯度爆炸，正交语义损失扩大了类间间隔。
跨模态伪标签策略无需目标域标签，实用性强，且迭代过程提升了模型鲁棒性。
温度缩放软投票机制简单有效，缓解了多模态融合中的置信度偏差。

Limitations:

依赖OpenPose进行关键点提取，在低质量视频或遮挡场景下可能引入噪声。
CMPL策略需要多模态模型在测试集上具有较高初始置信度，若初始预测错误可能误导伪标签。
实验仅在iMiGUE单一数据集上验证，泛化性有待更多基准测试。
未与其他域适应方法（如对抗训练）进行直接对比，优势量化不够充分。

Relevance To Keywords: 论文聚焦于微手势识别，属于多模态表征学习范畴，与“表征学习”直接相关。其多模态融合（RGB+骨骼+热图）体现了“原生多模态大模型”的思路，但未使用大模型架构。跨模态伪标签策略可视为一种无监督域适应方法，与“世界模型”和“模型基于强化学习”无直接关联。论文未涉及生成一体化或后训练。整体相关性中等，主要贡献在表征学习和多模态融合领域。

54. MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time ScalingFAIL

Score: 34.5 / 35.2

Authors: Jiacheng Chen, Xinyu Zhang, Shunkai Zhang, Yanmohan Wang, Lin Li, Tiancheng Qin, Qin Wang, Zhengmao Zhu, Tianle Li, Jingyang Li, Zehan Li, Binyang Jiang, Jin Zhu, Han Ding, Fei Yu, Chenyu Du, Zijian Song, Jiayuan Song, Zhi Zhang, Yunan Huang, Weiyu Cheng, Pengyu Zhao, Yu Cheng

Published: 2026-06-11

TL;DR: MaxProof 通过生成式验证器强化学习和群体级测试时间扩展，使 M3 模型在数学竞赛证明中达到金牌水平。

摘要翻译

我们提出 MaxProof，这是一个面向 MiniMax-M3 系列竞赛级数学证明的群体级推理时扩展框架。M3 首先使用一个专为低假阳性率设计的纵深防御生成式验证器，训练了三种面向证明的能力：证明生成、证明验证以及基于批评的证明修复。这些能力被整合进一个发布的单一 M3 模型中。在测试时，MaxProof 将该模型视为生成器、验证器、精炼器和排序器，在候选证明群体中进行搜索，并通过锦标赛选择返回一个最终证明。借助 MaxProof 推理时扩展，M3 模型在 IMO 2025（国际数学奥林匹克）上达到 35/42，在 USAMO 2026（美国数学奥林匹克）上达到 36/42，均超过了人类金牌阈值。

Abstract

We present MaxProof, a population-level test-time scaling framework for competition-level mathematical proof in the MiniMax-M3 series. M3 first trains three proof-oriented capabilities -- proof generation, proof verification, and critique-conditioned proof repair -- using a defense-in-depth generative verifier engineered for low false-positive rate. These capabilities are merged into a single released M3 model. At test time, MaxProof treats the model as a generator, verifier, refiner, and ranker, searches over a population of candidate proofs, and returns one final proof through tournament selection. With MaxProof test-time scaling, the M3 model reaches 35/42 on IMO 2025 and 36/42 on USAMO 2026, exceeding the human gold-medal threshold on both.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	3.0/10	4.5
Latent Reasoning	1.5	5.0/10	7.5
Agentic Reasoning	1.5	8.0/10	12.0

评分理由: 论文核心在于数学证明的生成与验证，采用强化学习与测试时间扩展策略。与 Agentic Reasoning 高度相关（模型扮演多角色进行搜索与选择），与 Latent Reasoning 中度相关（涉及符号推理）。与 Unify Models、MLLM、MultiModal 中度相关（M3 模型整合多项能力，可能基于大语言模型基础）。与 Tokenizer、Visual Encoder、World Models 几乎无关（任务为纯符号数学，无视觉输入，非环境动力学建模）。model-based RL 相关性较低（主要为生成式验证器 RL，非典型模型基强化学习）。作者列表中未包含指定的专家。

关键词

Mathematical Proof, Generative-Verifier RL, Test-Time Scaling, Population-Level, Tournament Selection, MiniMax-M3, Proof Verification, Proof Generation

55. An LLM System for Autonomous Variational Quantum Circuit DesignFAIL

Score: 34.5 / 35.2

Authors: Kenya Sakka, Wataru Mizukami, Kosuke Mitarai

Published: 2026-06-11

TL;DR: 本文提出了一种基于大语言模型的自主代理框架，用于迭代式量子电路设计，并在量子特征图构建和变分量子本征求解器 ansatz 生成功能上取得了具有竞争力的结果。

摘要翻译

高性能量子电路的设计在很大程度上仍依赖于人类专业知识。我们介绍了一种自主智能体框架，该框架利用大语言模型（LLMs）在显式设计约束下进行迭代量子电路设计。我们的系统集成了七个组件：探索（Exploration）、生成（Generation）、讨论（Discussion）、验证（Validation）、存储（Storage）、评估（Evaluation）和审查（Review）。这些组件形成了一个闭环工作流，结合了基于网络的知识获取、基于文献的批判性分析、可执行代码生成以及实验反馈。我们在两个任务上评估了该框架：用于量子机器学习的量子特征图构建，以及用于量子化学中变分量子本征求解器应用的试探函数（ansatz）生成。在图像分类基准测试中，生成的最佳特征图优于代表性量子特征图，并且在扩展到更大的量子比特数时，超过了经典径向基函数核。在七个分子的分子基态估计中，生成的试探函数达到了与广泛使用的化学启发式和硬件高效构造相媲美的准确性，同时满足了所施加的规模约束。这些结果确立了 LLM 驱动的智能体系统作为一种可行的自动化量子电路设计范式，并展示了 AI 系统如何参与跨科学领域的迭代式科学优化工作流程。

Abstract

The design of high performing quantum circuits remains largely dependent on human expertise. We introduce an autonomous agentic framework that employs large language models (LLMs) to conduct iterative quantum circuit designs under explicit design constraints. Our system integrates seven components: Exploration, Generation, Discussion, Validation, Storage, Evaluation, and Review. These components form a closed-loop workflow that combines web-based knowledge acquisition, literature-grounded critique, executable code generation, and experimental feedback. We evaluate the framework on two tasks: quantum feature map construction for quantum machine learning and ansatz generation for variational quantum eigensolver applications in quantum chemistry. In image classification benchmarks, the best generated feature map outperforms representative quantum feature maps and, when scaled to larger qubit counts, surpasses the classical radial basis function kernel. In molecular ground state estimation across seven molecules, the generated ansatz attains competitive accuracy with widely used chemically inspired and hardware-efficient constructions while satisfying the imposed scaling constraints. These results establish LLM driven agentic system as a viable paradigm for automated quantum circuit design and illustrate how AI systems can participate in iterative scientific optimization workflows across scientific domains.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	3.0/10	4.5
Latent Reasoning	1.5	2.0/10	3.0
Agentic Reasoning	1.5	9.0/10	13.5

评分理由: 论文核心在于基于大语言模型（LLM）的自主代理框架进行量子电路设计，因此'Agentic Reasoning'高度相关（9 分）。'MLLM'和'model-based RL'有一定关联（3 分），因涉及 LLM 使用和闭环反馈，但未明确涉及多模态或模型强化学习核心机制。'Unify Models'、'World Models'、'Latent Reasoning'关联度较低（2 分），'Visual Encoder'和'MultiModal'完全无关（0 分），因任务不涉及视觉或多模态数据。作者列表中未包含指定的专家，故无额外加分。加权总分为 34.5 分，低于动态及格分 35.2 分。

关键词

LLM System, Autonomous Agentic Framework, Quantum Circuit Design, Variational Quantum Circuit, Iterative Design, Quantum Feature Map, Ansatz Generation

56. Exposure Bias as Epistemic Underidentification in Recursive ForecastingFAIL

Score: 34.5 / 35.2

Authors: Riku Green, Zahraa S. Abdallah, Telmo M Silva Filho

Published: 2026-06-11

TL;DR: 该论文将递归预测中的暴露偏差重新定义为诱导状态下的认知识别不足问题，并通过引入溯源变量证明了基于溯源的修正可缓解分布偏移。

摘要翻译

递归多步预测通常被表述为分布偏移问题：模型在观测历史数据上训练，却在自身预测上进行部署。我们通过证明表明，这种框架是不完整的：在部分可观测性或状态截断条件下，递归展开（recursive rollout）也是一个认识论未识别（epistemic underidentification）问题。即使潜在动力学是确定性的，一步贝叶斯监督（one-step Bayes supervision）也仅在观测上下文上识别行为；一旦展开过程查询自生成的诱导状态（induced states），其正确的局部目标并非仅由数值状态决定，因此该监督未必能识别部署的递归预测器。我们通过诱导状态 $Z$ 和来源变量 $P$ 对此进行了形式化，并将诱导状态误差分解为教师强制/展开不匹配（teacher-forcing/rollout mismatch）、表示 - 类近似（representation--class approximation）以及来源信息缺口（provenance information gaps）。经验上，我们表明展开过程进入了一种不同的诱导状态机制（regime），固定的诱导状态定义了一种不同的局部修正任务，且闭环增益（closed-loop gains）不仅源于局部适应，还源于改变了展开过程中访问的诱导状态。使用简单的二进制来源编码（binary provenance encoding），感知来源的修正（provenance-aware correction）可以进一步提升性能，尽管这种增益是有条件的而非普遍的。这些结果将暴露偏差（exposure bias）重新表述为在自我诱导的认识论不确定性（self-induced epistemic uncertainty）下的推理问题。

Abstract

Recursive multi-step forecasting is usually framed as distribution shift: models are trained on observed histories but deployed on their own predictions. We show this framing is incomplete by proving that, under partial observability or state truncation, recursive rollout is also an epistemic underidentification problem. Even with deterministic latent dynamics, one-step Bayes supervision identifies behavior only on observed contexts and need not identify the deployed recursive predictor once rollout queries self-generated induced states whose correct local targets are not determined by numeric state alone. We formalize this with induced states $Z$ and provenance variables $P$, and derive a decomposition of induced-state error into teacher-forcing/rollout mismatch, representation--class approximation, and provenance information gaps. Empirically, we show that rollout enters a distinct induced-state regime, that fixed induced states define a distinct local corrective task, and that closed-loop gains arise not only from local adaptation but also from changing the induced states visited during rollout. Using a simple binary provenance encoding, provenance-aware correction can further improve performance, though gains are conditional rather than uniform. These results recast exposure bias as reasoning under self-induced epistemic uncertainty.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	5.0/10	7.5
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	6.0/10	9.0
Latent Reasoning	1.5	7.0/10	10.5
Agentic Reasoning	1.5	3.0/10	4.5

评分理由: 论文主要关注递归预测中的暴露偏差与认知识别不足的理论分析，涉及潜变量动态和诱导状态，与 World Models 和 model-based RL 有一定概念关联（权重较高），但未涉及多模态架构、Tokenizer 或 MLLM 相关内容（权重为 0）。Unify Models 仅指理论框架的统一，非模型架构统一，故评分较低。

关键词

Exposure Bias, Recursive Forecasting, Epistemic Underidentification, Induced States, Provenance Variables, Teacher-Forcing, Epistemic Uncertainty

57. Visual Place Recognition in Forests with Depth-Aware DistillationFAIL

Score: 34.5 / 35.2

Authors: Walter Nedov, Saimunur Rahman, Kavindie Katuwandeniya, David Hall, Kaushik Roy, Peyman Moghadam

Published: 2026-06-11

TL;DR: This paper proposes a depth-aware distillation framework to enhance visual place recognition in forests by injecting geometric cues into a DINOv2 model, improving robustness to appearance variations.

摘要翻译

自然森林环境中的视觉地点识别仍具挑战性，主要源于植被重复、结构线索微弱以及不同遍历路径间显著的外观变化。为克服这一局限，本文提出一种轻量级深度感知蒸馏框架，将几何线索注入到基于 DINOv2 的地点识别模型中，同时保持其预训练描述符空间。在最新的 WildCross 基准上评估，该方法相较于仅外观的对应方法取得了提升，增强了对外观变化的鲁棒性。结果表明，深度作为自然环境中地点识别的强互补模态具有重要意义，且深度感知蒸馏是实现更鲁棒森林感知的一个有前景的方向。

Abstract

Visual place recognition in natural forest environments remains challenging due to repetitive vegetation, weak structural cues, and significant appearance variation across traversals. To address this limitation, this paper proposes a lightweight depth-aware distillation framework that injects geometric cues into a DINOv2-based place recognition model, while maintaining its pre-trained descriptor space. Evaluated on the recent WildCross benchmark, the proposed approach yields gains over an appearance-only counterpart, providing robustness to appearance variations. These results demonstrate the importance of depth as a strong complementary modality for place recognition in natural environments and identify depth-aware distillation as a promising direction for more robust forest perception.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	8.0/10	12.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	5.0/10	7.5
model-based RL	1.5	1.0/10	1.5
Latent Reasoning	1.5	3.0/10	4.5
Agentic Reasoning	1.5	1.0/10	1.5

评分理由: The paper focuses on Visual Place Recognition using depth-aware distillation with DINOv2. It strongly aligns with 'Visual Encoder' (DINOv2 backbone) and moderately with 'MultiModal' (fusion of RGB and depth). It does not involve Large Language Models, World Models, Reinforcement Learning, or Agentic Reasoning, resulting in low scores for those keywords. No specified expert authors were found in the author list.

关键词

Visual Place Recognition, Depth-Aware Distillation, DINOv2, Forest Environments, Geometric Cues, Appearance Variation, WildCross Benchmark

58. SeamEdit: A Black-Box VLM-Agnostic Pipeline for Large-Image Semantic EditingFAIL

Score: 34.5 / 35.2

Authors: Xiangyu Lyu, Dan Lei

Published: 2026-06-11

TL;DR: SeamEdit 提出了一种训练无关、模型无关的管道，利用黑盒 VLM 进行大图像语义编辑，并通过几何校正和动态规划融合减少拼接伪影。

摘要翻译

大图像的语义区域编辑必须同时满足两个要求：高生成质量和与周围内容的自然融合。一些相关方法依赖于白盒模型，而闭源模型的强大生成能力尚未得到充分利用。然而，直接将闭源模型应用于分块编辑会引入几种失效模式：语义变形、画布级对齐漂移以及可见接缝伪影。本文提出了 SeamEdit，这是一个无需训练且模型无关的流程，它将任何具备图像修复能力的 VLM（视觉语言模型）视为黑盒 oracle。SeamEdit 通过一个五阶段后处理流程缓解了这些问题：基于覆盖的分块分解、黑盒 VLM 图像修复、几何与色彩一致性校正、基于接缝风险的多候选排序以及动态规划曲线接缝融合。该流程降低了接缝可见性，并支持对任意分块区域进行语义修改。

Abstract

Semantic region editing for large images must satisfy two requirements at the same time: high generative quality and natural integration with surrounding content. Some related methods rely on white-box models and leave the strong generation capability of closed-source models underexplored. Directly applying closed-source models to tiled editing, however, introduces several failure modes: semantic deformation, canvas-level alignment drift, and visible seam artifacts. This paper presents SeamEdit, a training-free and model-agnostic pipeline that treats any VLM with inpainting capability as a black-box oracle. SeamEdit mitigates these issues through a five-stage post-hoc pipeline: overlay-based tile decomposition, black-box VLM inpainting, geometric and color-consistency correction, seam-risk-based multi-candidate ranking, and dynamic-programming curved seam fusion. The pipeline reduces seam visibility and supports semantic modification of arbitrary tile regions.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	7.0/10	10.5
MultiModal	1.5	6.0/10	9.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	3.0/10	4.5
Agentic Reasoning	1.5	1.0/10	1.5

评分理由: 该论文聚焦于大图像语义编辑的管道实现，核心贡献在于黑盒 VLM 的调用与缝合优化，而非模型架构的统一或强化学习框架。因此，与 Unify Models、Tokenizer、World Models、model-based RL 等关键词关联度低（1-2 分）。MLLM 和 MultiModal 关联度较高（6-7 分），因论文基于多模态大模型能力。Latent Reasoning 和 Agentic Reasoning 关联度较低（1-3 分），因论文为固定流程而非自主推理系统。作者名单未包含指定专家，无额外加分。计算加权总分为 34.5，略低于动态及格分 35.2。

关键词

Large-Image Semantic Editing, Black-Box VLM, Model-Agnostic Pipeline, Seam Reduction, Inpainting, Dynamic Programming, Visual-Language Model

59. ReSum: Synergizing LLM Reasoning and Summarization with Reinforcement LearningFAIL

Score: 33.0 / 35.2

Authors: Xucong Wang, Ziyu Ma, Yong Wang, Shidong Yang, Hailang Huang, Renda Li, Pengkun Wang, Xiangxiang Chu

Published: 2026-06-11

TL;DR: ReSum 通过引入自总结机制的强化学习框架，将 LLM 推理性能提升 4% 同时减少 18.6% 的推理步长。

摘要翻译

可验证奖励强化学习（RLVR）是大语言模型（LLMs）中提升长程推理能力的核心技术。然而，现有的 RLVR 方法往往鼓励不必要的长推理轨迹，这可能会降低推理连贯性并耗尽可用的上下文预算。现有的长上下文组织方法通常依赖外部机制来组织推理轨迹，而非使模型能够自主管理其推理轨迹。针对这一局限性，我们提出 ReSum，这是一种新颖的 RLVR 框架，使大语言模型能够通过自摘要压缩和组织其推理轨迹。我们的初步研究表明，自摘要通过降低词元级熵来稳定生成，而引入一个“摘要”短语可以显著减轻由错误轨迹前缀传播产生的错误。基于这些发现，ReSum 采用了一种感知摘要的自适应轨迹机制，对比性地评估自摘要是否有益于当前的推理过程。具体来说，当模型自发触发自摘要时，ReSum 会掩码该摘要短语以创建一个对比分支；而在非摘要位置，则随机注入该短语以创建一个匹配分支。我们还设计了一种感知摘要的优势函数，以实现对比轨迹之间更细粒度的比较。大量实验表明，ReSum 平均提升性能 4%，同时将轨迹长度减少了 18.6%。

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) is a central technique for improving long-horizon reasoning in Large Language Models (LLMs). However, existing RLVR methods often encourage unnecessarily long reasoning rollouts, which can degrade reasoning coherence and exhaust the available context budget. Existing approaches to long-context organization often depend on external mechanisms to organize rollouts, rather than enabling the model to manage its own reasoning trajectory. To address this limitation, we propose ReSum, a novel RLVR framework that enables LLMs to compress and organize their reasoning trajectories through self-summarization. Our pilot studies show that self-summarization stabilizes generation by lowering token-level entropy, and that introducing a ``summarization'' phrase can substantially mitigate errors propagated from an incorrect rollout prefix. Motivated by these findings, ReSum adopts a summarization-aware adaptive rollout mechanism that contrastively evaluates whether self-summarization benefits the ongoing reasoning process. Specifically, when the model spontaneously triggers self-summarization, ReSum masks the summarization phrase to create a contrastive branch; for non-summarization positions, it instead randomly injects the phrase to create a matched branch. We further design a summarization-aware advantage to enable finer-grained comparison between contrastive rollout trajectories. Extensive experiments show that ReSum improves performance at an average of 4\% while reducing rollout length by 18.6\%.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	3.0/10	4.5
Latent Reasoning	1.5	3.0/10	4.5
Agentic Reasoning	1.5	7.0/10	10.5

评分理由: 论文聚焦 LLM 推理与总结的强化学习框架，与多模态及世界模型背景关键词匹配度较低。Visual Encoder 和 MultiModal 完全无关（0 分）；MLLM 仅涉及文本大模型（2 分）；Unify Models 指任务协同而非架构统一（3 分）；Tokenizer 仅提及 token 熵（2 分）；World Models 和 Latent Reasoning 仅间接关联推理轨迹（2-3 分）；model-based RL 使用 RLVR 而非环境模型（3 分）；Agentic Reasoning 体现模型自主管理推理轨迹，相关性最高（7 分）。加权总分 33.0，低于动态及格分 35.2。作者列表中未包含 Yang Shi 等指定专家，无额外加分。

关键词

LLM Reasoning, Summarization, Reinforcement Learning, Self-summarization, Adaptive Rollout, Verifiable Rewards, Reasoning Trajectory

60. Dense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy DistillationFAIL

Score: 33.0 / 35.2

Authors: Guo Yu, Wenlin Liu, Yulan Hu, Hao-Xuan Ma, Jun-Peng Jiang, Han-Jia Ye

Published: 2026-06-11

TL;DR: 本文分析了视觉语言模型在策略蒸馏中的参数更新稀疏性与几何特性，发现更新具有坐标稀疏性且谱集中，保留了策略后训练的几何特征。

摘要翻译

同策略（On-policy）蒸馏（OPD）最近已成为一种突出的后训练方法，因为它结合了两种理想的要素：同策略学生轨迹和稠密教师监督，然而这种混合方式如何改变模型的参数仍尚不明确。在多种语言模型与视觉 - 语言模型对及应用场景中，我们的分析得出了两个主要发现。在稀疏性方面，OPD 风格的更新量较小且具有坐标稀疏性。这些更新分布在各个层之间，且通常主要集中在 FFN 部分。这种稀疏结构具有操作上的实用性：仅训练所发现的子网络即可恢复几乎与完整 OPD 相同的性能。然而，在优化器消融实验中，诱导稀疏性的 SGD 优化器表现逊色于 AdamW，这可能是因为稠密教师监督保留了异质的逐坐标梯度尺度，而 AdamW 的自适应缩放在此类情况下仍然有用。在几何特性方面，这些更新在数值上是满秩的但在谱上集中；它们主要偏离源权重的主奇异子空间，且不成比例地集中在源权重接近零的坐标上。这些发现表明，稠密教师监督并未将 OPD 转变为普通的稠密参数重写；相反，OPD 保留了同策略后训练的重要几何特征。

Abstract

On-policy distillation (\textsc{OPD}) has recently become a prominent post-training recipe as it combines two desirable ingredients: on-policy student trajectories and dense teacher supervision, yet how this hybrid changes a model's parameters remains unclear. Across several language and vision-language model pairs and use cases, our analysis yields two main findings. On sparsity, \textsc{OPD}-style updates are small and coordinate-sparse. They are distributed across layers and are usually FFN-heavy. This sparse structure is operationally useful: training only the discovered subnetwork recovers nearly the same performance as full \textsc{OPD}. However, the sparsity-inducing SGD optimizer underperforms AdamW in our optimizer ablation, likely because dense teacher supervision preserves heterogeneous coordinate-wise gradient scales where AdamW's adaptive scaling remains useful. On geometry, the updates are numerically full-rank but spectrally concentrated; they lie mostly away from the principal singular subspaces of the source weights and fall disproportionately on coordinates where the source weights are close to zero. These findings suggest that dense teacher supervision does not turn \textsc{OPD} into ordinary dense parameter rewriting; instead, \textsc{OPD} retains important geometric signatures of on-policy post-training.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	6.0/10	9.0
MultiModal	1.5	5.0/10	7.5
model-based RL	1.5	1.0/10	1.5
Latent Reasoning	1.5	2.0/10	3.0
Agentic Reasoning	1.5	1.0/10	1.5

评分理由: 论文主要研究策略蒸馏（On-policy Distillation）中的参数更新稀疏性与几何特性，涉及视觉语言模型。因此与 MLLM、MultiModal 和 Visual Encoder 有一定关联（涉及模型架构），但与 World Models、model-based RL、Tokenizer 及 Reasoning 类关键词无直接内容关联。加权总分为 33.0，低于动态及格分 35.2，表明该论文与给定关键词主题（如世界模型、强化学习）相关性较低。

关键词

On-policy Distillation, Sparsity, Geometry, Post-training, Vision-Language, Parameter Updates, Teacher-Student, Spectral Concentration

61. Beyond Uniform Tokens: Adaptive Compression for Time Series Language ModelsFAIL

Score: 33.0 / 35.2

Authors: Jialin Gan, Xin Qiu, Guangzhe Chen, Xue Wang

Published: 2026-06-11

TL;DR: This paper proposes an adaptive token compression framework for Time Series Language Models that achieves significant inference acceleration and performance gains by treating time series and prompt tokens asymmetrically.

摘要翻译

大型语言模型（LLMs）通过共享令牌接口联合建模数值观测与文本上下文，从而实现了时间序列（TS）分析。然而，时间序列令牌（TS tokens）和提示令牌（prompt tokens）表现出根本不同的信息结构，使得统一令牌处理效率低下。本文从非对称令牌视角研究时间序列语言建模中的令牌效率。我们发现时间序列令牌具有高度不均匀的频谱贡献，其中许多令牌共享冗余的频率模式，而一小部分保留了关键的时间证据。我们还观察到，提示令牌的影响随模型深度衰减，表明在所有层中完全保留提示令牌是不必要的。基于这些发现，我们开发了一种自适应令牌预算框架，该框架通过频域结构压缩时间序列令牌，并逐层逐步减少提示令牌。在预测、分类、插补和异常检测方面的实验表明，推理加速可达 7.68 倍，且在 78% 的评估设置中实现了性能提升，证明了非对称令牌压缩对于可扩展时间序列基础模型的有效性。

Abstract

Large language models (LLMs) have enabled time series (TS) analysis by jointly modeling numerical observations and textual context through a shared token interface. However, TS tokens and prompt tokens exhibit fundamentally different information structures, making uniform token processing inefficient. In this paper, we study token efficiency in TS language modeling from an asymmetric-token perspective. We show that TS tokens have highly uneven spectral contributions, where many tokens share redundant frequency patterns while a small subset preserves critical temporal evidence. We also observe that prompt-token influence attenuates with model depth, suggesting that full prompt retention across all layers is unnecessary. Based on these findings, we develop an adaptive token budgeting framework that compresses TS tokens via frequency-domain structure and progressively reduces prompt tokens across layers. Experiments across forecasting, classification, imputation, and anomaly detection demonstrate up to \textit{\textbf{7.68$\times$}} inference acceleration and performance gains in \textit{\textbf{78\%}} of evaluated settings, showing the effectiveness of asymmetric token compression for scalable TS foundation models.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	8.0/10	12.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	4.0/10	6.0
MultiModal	1.5	5.0/10	7.5
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	2.0/10	3.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: The paper focuses on Time Series Language Models and token efficiency, showing high relevance to Tokenizer (token compression strategy) and moderate relevance to MultiModal/MLLM (unifying numerical and text data). It lacks direct connection to Visual Encoders, World Models, RL, or Agentic Reasoning, resulting in low scores for those keywords. The total weighted score is 33.0, which is below the dynamic passing score of 35.2, indicating limited alignment with the specific background keywords (World Models/RL).

关键词

Time Series Language Models, Adaptive Token Compression, Asymmetric Token Processing, Inference Acceleration, Frequency-Domain Structure, Numerical-Textual Unification, Token Efficiency

62. Reward Modeling for Multi-Agent OrchestrationFAIL

Score: 31.5 / 35.2

Authors: King Yeung Tsang, Zihao Zhao, Vishal Venkataramani, Haizhou Shi, Zixuan Ke, Semih Yavuz, Shafiq Joty, Hao Wang

Published: 2026-06-11

TL;DR: 本文针对多智能体编排中训练受限且成本高的问题，提出了一种名为 OrchRM 的自监督奖励建模框架，能够在无需人工标注的情况下显著提升训练效率和编排性能。

摘要翻译

基于大语言模型（LLMs）构建的多智能体系统（MAS）需要有效的编排来协调专用智能体，然而训练此类编排器却受到有限监督和计算成本高昂的限制。我们提出编排奖励建模（OrchRM），这是一种无需人工标注即可评估编排质量的自监督框架。OrchRM 利用多智能体执行过程中的中间产物来构建胜负对，以用于 Bradley-Terry 奖励模型的训练。与现有依赖昂贵子智能体轨迹（rollouts）的 MAS 推理时扩展及编排器训练框架不同，OrchRM 直接在编排层面运行，从而实现高效且高性能的奖励引导编排器训练及 MAS 推理时扩展。OrchRM 可将训练效率在 token 使用量上提升高达 10 倍，同时将 MAS 推理时扩展性能在准确率上提升高达 8%。这些增益在多个领域（包括数学推理、基于网络的问答和多跳推理）中一致迁移，证明了编排级奖励建模是鲁棒多智能体编排的一个可扩展方向。代码将在 https://github.com/Wang-ML-Lab/OrchRM 上提供。

Abstract

Multi-Agent Systems (MAS) built on Large Language Models (LLMs) require effective orchestration to coordinate specialized agents, yet training such orchestrators is hindered by limited supervision and high computational cost. We propose Orchestration Reward Modeling (OrchRM), a self-supervised framework for evaluating orchestration quality without human annotations. OrchRM leverages intermediate artifacts from multi-agent executions to construct win-lose pairs for Bradley-Terry reward model training. Unlike existing MAS test-time scaling and orchestrator training frameworks that rely on costly sub-agent rollouts, OrchRM operates directly at the orchestration level, enabling efficient and high-performing reward-guided orchestrator training and MAS test-time scaling. OrchRM improves training efficiency by up to 10x in token usage while improving MAS test-time scaling performance by up to 8% in accuracy. These gains consistently transfer across multiple domains, including mathematical reasoning, web-based question answering, and multi-hop reasoning, demonstrating orchestration-level reward modeling as a scalable direction for robust multi-agent orchestration. Code will be available at https://github.com/Wang-ML-Lab/OrchRM.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	3.0/10	4.5
Latent Reasoning	1.5	1.0/10	1.5
Agentic Reasoning	1.5	8.0/10	12.0

评分理由: 论文核心在于多智能体系统（MAS）的编排与奖励建模，与'Agentic Reasoning'高度相关。然而，内容未涉及多模态组件（MultiModal, MLLM, Visual Encoder）、分词器设计（Tokenizer）、世界模型（World Models）或基于模型的强化学习（model-based RL，本文使用的是奖励建模而非动力学模型），也未强调模型架构统一（Unify Models），因此这些关键词相关度较低。

关键词

Multi-Agent Systems, Reward Modeling, Orchestration, LLM, Self-supervised, Bradley-Terry, Test-time scaling

63. VideoMDM: Towards 3D Human Motion Generation From 2D SupervisionFAIL

Score: 31.5 / 35.2

Authors: Amir Mann, Gal Michael Harari, Merav Keidar, Or Litany

Published: 2026-06-11

TL;DR: VideoMDM introduces a diffusion framework generating 3D human motion from 2D video supervision without 3D ground truth, closing the gap to fully 3D-supervised methods.

摘要翻译

我们提出了 VideoMDM，这是一种基于扩散的框架，可直接从单目视频中提取的准确 2D 姿态训练 3D 人体运动先验，而无需任何 3D 真值。一个预训练的 2D-to-3D 提升器提供近似的 3D 姿态序列，作为噪声教师：这些序列被扩散，由模型在 3D 中进行去噪，并通过重投影预测并与准确关键点进行比较，从而在 2D 中进行监督。我们表明，在温和假设下，深度加权的 2D 重投影损失在期望上等价于直接 3D 监督，并且我们将标准 3D 运动正则化器——速度一致性和过参数化表示对齐——适配到这种 2D 设置中。与仅在推理阶段将 2D 提升到 3D 的方法不同，VideoMDM 在训练期间学习一个连贯的 3D 运动流形。在 HumanML3D 数据集上，该方法几乎缩小了与完全 3D 监督的 MDM 之间的差距（FID 分别为 0.88 和 0.54）；在真实视频数据集 Fit3D 和 NBA 上，该方法学习生成人类一致偏好的运动，并取得了优异的定量结果。

Abstract

We introduce VideoMDM, a diffusion-based framework that trains 3D human motion priors directly from accurate 2D poses extracted from monocular videos, without any 3D ground truth. A pretrained 2D-to-3D lifter provides approximate 3D pose sequences that serve as a noisy teacher: these are diffused, denoised by the model in 3D, and supervised in 2D by reprojecting the prediction and comparing against accurate keypoints. We show that, under mild assumptions, a depth-weighted 2D reprojection loss is equivalent in expectation to direct 3D supervision, and we adapt standard 3D motion regularizers - velocity consistency and over-parameterized representation alignment - to this 2D setting. Unlike methods that lift 2D to 3D only at inference, VideoMDM learns a coherent 3D motion manifold during training. On HumanML3D it nearly closes the gap to fully 3D-supervised MDM (FID 0.88 vs 0.54); On real video datasets Fit3D and NBA the method learns to generate motions consistently preferred by humans, with strong quantitative results.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	4.0/10	6.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	6.0/10	9.0
model-based RL	1.5	1.0/10	1.5
Latent Reasoning	1.5	2.0/10	3.0
Agentic Reasoning	1.5	1.0/10	1.5

评分理由: This paper focuses on 3D motion generation via diffusion from 2D supervision, showing low relevance to MLLM/RL keywords (Tokenizer, MLLM, RL, Agentic). It moderately aligns with MultiModal (video+motion) and World Models (prior learning), but has limited connection to Unify Models, Visual Encoder, and Latent Reasoning as defined in the prompt's context.

关键词

3D Human Motion Generation, 2D Supervision, Diffusion-based Framework, Monocular Videos, Reprojection Loss, Motion Prior, 2D-to-3D Lifter

64. Recursive Agent HarnessesFAIL

Score: 31.5 / 35.2

Authors: Elias Lumer, Sahil Sen, Kevin Paul, Vamse Kumar Subbiah

Published: 2026-06-11

TL;DR: 本文提出递归代理 harness (RAH) 架构，通过父代理生成并运行脚本并行生成子代理，显著提升了长上下文编码推理的性能。

摘要翻译

递归语言模型（RLMs）表明，对模型调用进行递归是长上下文推理的有效策略，而生产级编码代理已开始编写大规模生成子代理的代码，最近体现在 Anthropic 的动态工作流中。我们命名并研究了这两项工作之间的模式，其中递归单元是一个完整的代理 harness，包含文件系统工具、代码执行和规划，而非没有工具的模型调用。我们称其为递归代理 harness（RAH），并将其定义为 harness 递归，即 RLMs 模型递归的代码优先扩展。父代理生成并运行一个可执行脚本，该脚本并行生成子代理 harness 以处理细粒度工作负载，并使用结构化函数调用处理小型子任务。我们在长上下文推理方面提供了受控评估。在骨干模型固定为 GPT-5 以匹配已发表的 Codex 和 RLM 基线的情况下，RAH 将 Codex 编码代理基线在 Oolong-Synthetic（199 个样本，13 个上下文长度桶，最高达 4M tokens）上的表现从 71.75% 提升至 81.36%，这一增益归因于 harness 而非模型本身。使用更强的骨干模型 Claude Sonnet 4.5，同一设计达到了 89.77%。

Abstract

Recursive language models (RLMs) showed that recursion over model calls is an effective strategy for long-context reasoning, and production coding agents have begun to write code that spawns subagents at scale, most recently in Anthropic's dynamic workflows. We name and study the pattern between these two lines of work, where the recursive unit is a full agent harness with filesystem tools, code execution, and planning rather than a model call with no tools. We call this the Recursive Agent Harness (RAH) and frame it as harness recursion, the code-first extension to the model recursion of RLMs. A parent agent generates and runs an executable script that spawns subagent harnesses in parallel for fine-grained workloads and uses structured function calls for small subtasks. We provide a controlled evaluation on long-context reasoning. With the backbone held fixed at GPT-5 to match the published Codex and RLM baselines, RAH improves the Codex coding-agent baseline from 71.75% to 81.36% on Oolong-Synthetic (199 samples, 13 context-length buckets up to 4M tokens), a gain attributable to the harness rather than the model. With a stronger backbone, Claude Sonnet 4.5, the same design reaches 89.77%.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	2.0/10	3.0
Latent Reasoning	1.5	3.0/10	4.5
Agentic Reasoning	1.5	9.0/10	13.5

评分理由: 论文核心提出递归代理 harness (RAH) 架构，聚焦代理递归调用与工具使用，故与'Agentic Reasoning'高度相关。论文统一了 RLM 与代理模式，故与'Unify Models'相关；使用大语言模型骨干，故与'MLLM'相关。但论文未涉及多模态、视觉编码器、分词器或世界模型学习，相关关键词得分低。'model-based RL'仅涉及规划，'Latent Reasoning'指代显式推理。

关键词

Recursive Agent Harnesses, Long-context Reasoning, Agent Recursion, Tool Use, Coding Agents, Subagent Parallelism, Function Calls

65. EvoBrowseComp: Benchmarking Search Agents on Evolving KnowledgeFAIL

Score: 30.0 / 35.2

Authors: Yunhan Wang, Jiaan Wang, Lianzhe Huang, Xianfeng Zeng, Fandong Meng

Published: 2026-06-11

TL;DR: This paper introduces EvoBrowseComp, a contamination-free benchmark using live-web traversal to evaluate search agents' genuine retrieval and reasoning capabilities on evolving knowledge.

摘要翻译

搜索代理（Search Agents）——增强搜索工具的大型语言模型（LLM）——加剧了对面向未来的评估基准的需求。现有的基准（如 BrowseComp）依赖静态知识，使其容易受到测试集污染和参数化记忆的影响。因此，模型可通过事实回忆而非真实检索获得高分，借助推理捷径掩盖了真实的浏览能力。本文介绍 EvoBrowseComp，这是一个基于实时网络遍历合成的、包含 400 个英文和 400 个中文无污染复杂问题的进化基准。为收集这些问题，我们设计了一个三智能体协作框架：(1) QA 合成智能体，从实时网络检索新鲜知识以合成 QA 对；(2) 信息过滤智能体，依据可信度和流行度过滤检索到的知识以阻断参数化捷径；(3) 高层指导智能体，将问题形式化为推理图以减少合成 QA 对中的逻辑冗余和捷径。由于该框架支持完全自动化合成，EvoBrowseComp 可定期更新以防止数据污染并保持时效性。广泛的实验证实了其极高的难度，需要广泛的水平搜索。它确立了一种可扩展的范式，用于自动更新、高难度的基准测试，能够与不断演化的世界知识及不断发展的智能体能力保持同步。

Abstract

Search Agents -- large language models augmented with search tools -- have intensified the need for future-proof evaluation benchmarks. Existing benchmarks such as BrowseComp rely on static knowledge, making them vulnerable to test-set contamination and parametric memorization. Consequently, models can achieve high scores through fact recall rather than genuine retrieval, obscuring true browsing competence via reasoning shortcuts. In this paper, we introduce EvoBrowseComp, an evolving benchmark of 400 English and 400 Chinese contamination-free complex questions synthesized via live-web traversal. To collect these questions, we design a three-agent collaborative framework: (1) a QA synthesis agent that retrieves fresh knowledge from the live web to synthesize QA pairs; (2) an information filtering agent that filters retrieved knowledge in terms of credibility and popularity to block parametric shortcuts; and (3) a high-level guidance agent that formalizes questions into reasoning graphs to reduce logical redundancy and shortcuts in synthesized QA pairs. Because the framework supports fully automated synthesis, EvoBrowseComp can be regularly updated to prevent data contamination and maintain temporal freshness. Extensive experiments confirm its great difficulty, requiring broad horizontal search. It establishes a scalable paradigm for auto-updatable, high-difficulty benchmarking that keeps pace with both evolving world knowledge and advancing agent capabilities.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	2.0/10	3.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	1.0/10	1.5
Latent Reasoning	1.5	3.0/10	4.5
Agentic Reasoning	1.5	8.0/10	12.0

评分理由: The paper focuses on benchmarking search agents via live-web traversal, which aligns strongly with 'Agentic Reasoning' (core topic) and weakly with 'World Models' (evolving knowledge). It has minimal connection to multimodal components (Visual Encoder, MultiModal, MLLM), tokenizers, model-based RL, or model unification, resulting in low scores for those keywords. No expert authors from the specified list were found in the author roster.

关键词

Search Agents, Evolving Knowledge, Live-web Traversal, Contamination-free, QA Synthesis, Reasoning Graphs, Benchmarking

66. EpiBench: Verifiable Evaluation of AI Agents on Epigenomics AnalysisFAIL

Score: 28.5 / 35.2

Authors: Harihara Muralidharan, Reema Baskar, Soo Hee Lee, Tim Proctor, Kenny Workman

Published: 2026-06-11

TL;DR: EpiBench evaluates AI agents on epigenomics workflows, finding that while agents retrieve correct data, they often fail at deeper scientific judgment required for analysis decisions.

摘要翻译

我们介绍了 EpiBench，一个用于短视程表观基因组分析的可验证基准。EpiBench 评估智能体能否从真实的工作流状态中做出明确定义的分析决策，并返回确定性的可评分答案。该基准包含 106 项评估，涵盖 CUT&Tag/CUT&RUN、ATAC-seq、ChIP-seq 和 DNA 甲基化工作流程。在来自 16 个模型 - 评测框架对的 5,088 条有效轨迹中，没有系统通过多数尝试：GPT-5.5 / Pi 以 45.0% (143/318 次尝试；95% 置信区间 (CI)，36.3--53.7) 位居首位，其次是 GPT-5.5 / OpenAI Codex，得分为 39.9% (127/318 次尝试；95% CI，31.6--48.3)。Claude Opus 4.8 Max / Pi 和 GPT-5.4 / Pi 的通过率均为 39.0% (124/318 次尝试；95% CI，分别为 30.2--47.8 和 31.0--47.0)。不同检测方法类型的性能存在差异，许多失败的尝试仍包含正确答案的部分内容。智能体通常能找到正确的文件并计算出有用的中间结果，但当任务需要更深层次、检测方法特定的科学判断时则会失败。

Abstract

We introduce EpiBench, a verifiable benchmark for short-horizon epigenomics analysis. EpiBench evaluates whether agents can make well-defined analysis decisions from realistic workflow states and return deterministically gradable answers. The benchmark includes 106 evaluations across CUT\&Tag/CUT\&RUN, ATAC-seq, ChIP-seq, and DNA methylation workflows. Across 5,088 valid trajectories from 16 model-harness pairs, no system passed a majority of attempts: GPT-5.5 / Pi led at 45.0\% (143/318 attempts; 95\% confidence interval (CI), 36.3--53.7), followed by GPT-5.5 / OpenAI Codex at 39.9\% (127/318 attempts; 95\% CI, 31.6--48.3). Claude Opus 4.8 Max / Pi and GPT-5.4 / Pi each passed 39.0\% (124/318 attempts; 95\% CI, 30.2--47.8 and 31.0--47.0, respectively). Performance varies across assay types, and many failed runs still contain parts of the correct answer. Agents often found the right files and computed useful intermediate results, but failed when the task required deeper, assay-specific scientific judgment.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	1.0/10	1.5
Latent Reasoning	1.5	2.0/10	3.0
Agentic Reasoning	1.5	7.0/10	10.5

评分理由: The paper introduces EpiBench, a benchmark for evaluating AI agents on epigenomics analysis tasks. It does not propose new model architectures (Unify Models, Tokenizer, Visual Encoder) or focus on World Models/Model-Based RL. MLLM and MultiModal have low relevance as the paper evaluates existing models rather than studying their architecture. Agentic Reasoning is the most relevant keyword as the core focus is on agent decision-making workflows. No expert authors from the specified list are present. The weighted total score is 28.5, below the dynamic passing score of 35.2, indicating low alignment with the provided keyword set.

关键词

EpiBench, Epigenomics Analysis, AI Agents, Verifiable Benchmark, Workflow States, Scientific Judgment, Model-Harness Pairs

67. A Three-Layer Framework for AI in Scientific DiscoveryFAIL

Score: 28.5 / 35.2

Authors: Guojun Liao

Published: 2026-06-11

TL;DR: 本文针对 AI 科学发现中模型形成能力不足的问题，提出包含搜索、定性推理模型形成和执行的三层框架，并论证模型形成是核心创新。

摘要翻译

当前关于科学发现中人工智能的讨论，往往被两种可见能力所主导：对既有知识的搜索，以及通过优化、模拟和自动化实现的执行。二者固然重要，但均未完全捕捉到发现的核心行为：模型的构建与演化。本文提出了一种关于发现中人工智能的三层视角。第一层是基于大语言模型（Large Language Models）的搜索与检索。第二层，作为本文的主要创新，是通过定性推理（qualitative reasoning）进行模型构建：即识别当前框架何时在结构上不足，并在更广泛的表征空间内理解问题的能力；这种能力并非通过试错获得，而是通过对缺失内容及其所在位置的结构性洞察来实现。第三层则是执行、优化与精炼。本文的主要主张是，第二层既最为重要，也最为欠缺。缺乏模型构建的搜索仍局限于既有框架，而缺乏概念修正的执行仅会放大现有的表述。我们通过三个案例研究来阐释第二层推理：陈省身（S. S. Chern）的高斯 - 博内定理（Gauss-Bonnet theorem）内蕴证明、通过李雅普诺夫函数（Lyapunov functions）解决 Nesterov 加速梯度（Nesterov Accelerated Gradient）收敛问题，以及 2026 年由 OpenAI 自主证伪 Erdos 单位距离猜想（Erdos unit distance conjecture）。每个案例均展现出相同的结构特征：一个已变得不充分的框架、一个缺失的概念对象，以及在一个意想不到的相邻领域内找到的解决方案。

Abstract

Current discussions of AI in scientific discovery are often dominated by two visible capabilities: search over existing knowledge and execution through optimization, simulation, and automation. Both are important, but neither fully captures the central act of discovery: the formation and evolution of models. This paper proposes a three-layer view of AI in discovery. Layer 1 is search and retrieval by large language models. Layer 2, as the main innovation of this paper, is model formation through qualitative reasoning: the capacity to recognize when a current framework is structurally inadequate and to understand the problem within a broader representational space, not through trial and error, but through structural insight into what is missing and where it can be found. Layer 3 is execution, optimization, and refinement. The main claim is that Layer 2 is both the most important and the least developed. Search without model formation remains confined to inherited frameworks, while execution without conceptual revision only amplifies an existing formulation. We illustrate Layer 2 reasoning through three case studies: S. S. Chern's intrinsic proof of the Gauss-Bonnet theorem, the resolution of the Nesterov Accelerated Gradient convergence problem via Lyapunov functions, and the autonomous disproof of the Erdos unit distance conjecture by OpenAI in 2026. Each case exhibits the same structural signature: a framework that had become inadequate, a missing conceptual object, and a resolution found in an unexpected neighboring field.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	2.0/10	3.0
Latent Reasoning	1.5	4.0/10	6.0
Agentic Reasoning	1.5	3.0/10	4.5

评分理由: 论文提出 AI 科学发现的三层框架，核心是定性推理和模型形成。提供的关键词主要涉及多模态架构（Tokenizer, Visual Encoder, MLLM, MultiModal）和强化学习（World Models, model-based RL, Agentic Reasoning）。论文内容与这些技术关键词匹配度较低，仅在模型形成（World Models）、推理（Latent Reasoning, Agentic Reasoning）和统一框架（Unify Models）上有部分关联，因此评分较低。作者列表中不包含指定的专家，无额外加分。

关键词

AI in Scientific Discovery, Three-Layer Framework, Model Formation, Qualitative Reasoning, Structural Insight, Execution Optimization, LLM Search

68. Ontology Memory-Augmented ASR Correction for Long Text-Speech Interleaved ConversationsFAIL

Score: 28.5 / 35.2

Authors: Xinxin Li, Huiyao Chen, Meishan Zhang, Yunxin Li, Zulong Chen, Zhibo Ren, Xiaoqing Dong Baotian Hu, Min Zhang

Published: 2026-06-11

TL;DR: The paper proposes an ontology memory-augmented framework to improve ASR correction in long text-speech conversations by organizing dialogue history into retrievable nodes, achieving better performance than direct correction methods.

摘要翻译

自动语音识别（ASR）修正传统上侧重于孤立话语或短局部上下文。然而，随着文本和语音在长交互中日益交织，ASR 修正需要对话级别的上下文证据。现有的 ASR 修正方法通常依赖当前假设或拼接原始对话历史。在此类背景下，稀疏的修正证据在冗余与噪声中可能难以定位。针对这些挑战，我们提出了一种面向长文本 - 语音交织对话的本体记忆增强 ASR 修正框架。该框架将先前交互历史组织成一个动态可更新的本体记忆，其中实体、术语、表面变体、潜在 ASR 混淆及语义关系被存储为可检索节点，用于基于上下文的修正。为了评估这一设置，我们构建了 RAMC-Corr，这是一个源自 MAGIC-RAMC 的用于长距离 ASR 修正且具备上下文依据的数据集。在 RAMC-Corr 上的实验表明，我们的方法在 10 种骨干 - 设置配对组合中有 9 种优于直接修正，并使得针对依赖上下文的 ASR 错误更具选择性和基于证据的修正。

Abstract

Automatic speech recognition (ASR) correction has traditionally focused on isolated utterances or short local contexts. However, as text and speech become increasingly interleaved in long interactions, ASR correction requires conversation-level contextual evidence. Existing ASR correction methods often rely on the current hypothesis or concatenate raw dialogue history. In such contexts, sparse correction evidence can be difficult to locate amid redundancy and noise. Addressing these challenges, we propose an ontology memory-augmented ASR correction framework for long text-speech interleaved conversations. The framework organizes preceding interaction history into a dynamically updatable ontology memory, where entities, terminology, surface variants, potential ASR confusions, and semantic relations are stored as retrievable nodes for context-grounded correction. To evaluate this setting, we construct RAMC-Corr, a dataset derived from MAGIC-RAMC for long-range ASR correction with grounded context. Experiments on RAMC-Corr show that our method improves over direct correction in 9 out of 10 paired backbone-setting combinations and encourages more selective and evidence-grounded corrections for context-dependent ASR errors.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	6.0/10	9.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	3.0/10	4.5
Agentic Reasoning	1.5	2.0/10	3.0

评分理由: The paper focuses on ASR correction using ontology memory for text-speech conversations. It has moderate relevance to MultiModal (text/speech) and Latent Reasoning (ontology), but low relevance to Vision, RL, and specific World Model architectures. The work does not involve Visual Encoder, model-based RL, or unified model training typical of the keyword set.

关键词

ASR Correction, Ontology Memory, Text-Speech Interleaved, Long Conversations, Context-Grounded, Dialogue History, Memory-Augmented

69. Neuro-Symbolic Agents for Regulated Process Automation: Challenges and Research AgendaFAIL

Score: 28.5 / 35.2

Authors: Alexander Rombach, Chantale Lauer, Nijat Mehdiyev

Published: 2026-06-11

TL;DR: 本文提出将监管符号结构直接嵌入 LLM 代理架构以实现合规构建，而非仅依赖外部监控，旨在解决受监管流程自动化中的合规性问题。

摘要翻译

基于 LLM（大语言模型）的代理正进入受监管行业，在这些行业中，它们自动化处理判断密集型的质量管理流程。我们认为，这些领域中已嵌入的符号结构（包括法规、类型化过程模型及合规约束），不应仅被视为外部监控机制，而应作为塑造代理决策与行为的核心架构组件。我们提出 compliance-by-construction（构建式合规）作为 guardrail-based monitoring（基于护栏的监控）的互补范式：这是一种防止控制流违规的结构基础，而 guardrails（护栏）在捕获语义错误方面仍至关重要。我们识别出一组结构化的 neuro-symbolic（神经符号）研究挑战，涵盖基础与能力层面，并表明共同解决这些挑战可实现 compliance-by-construction（构建式合规）。我们呼吁 neuro-symbolic（神经符号）社区将受监管的过程自动化视为一个高影响力的研究领域加以关注。

Abstract

LLM-based agents are entering regulated industries where they automate judgment intensive quality management processes. We argue that symbolic structures already embedded in these domains, including regulations, typed process models, and compliance constraints, should be treated not merely as external monitoring mechanisms but as core architectural components that shape the agent's decision-making and behavior. We propose compliance-by-construction as a complementary paradigm to guardrail-based monitoring: a structural foundation that prevents control-flow violations, while guardrails remain essential for catching semantic errors. We identify a structured set of neuro-symbolic research challenges on foundational and capability level and show that addressing them jointly enables compliance-by-construction. We call on the neuro-symbolic community to engage with regulated process automation as a high impact research domain.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	2.0/10	3.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	2.0/10	3.0
Latent Reasoning	1.5	1.0/10	1.5
Agentic Reasoning	1.5	6.0/10	9.0

评分理由: 论文聚焦于神经符号代理在受监管流程自动化中的应用，核心在于将符号结构整合进代理架构。与关键词列表高度相关的只有 Agentic Reasoning，因为论文聚焦于代理决策。其他关键词如 Tokenizer、Visual Encoder、MultiModal 涉及多模态组件，与本文纯文本/符号代理无关；World Models、model-based RL 涉及强化学习与环境建模，本文未涉及；Unify Models 虽涉及神经与符号结合，但语境不同。因此总体相关性较低。

关键词

Neuro-Symbolic Agents, Regulated Process Automation, Compliance-by-Construction, LLM-based Agents, Symbolic Structures, Decision-Making, Research Agenda

70. scLLM-DSC: LLM-Knowledge Enhanced Cross-Modal Deep Structural Clustering for Single-Cell RNA SequencingFAIL

Score: 28.5 / 35.2

Authors: Ping Xu, Pengjiang Li, Tian Du, Zaitian Wang, Jiawei Gu, Ziyue Qiao, Pengfei Wang, Yuanchun Zhou

Published: 2026-06-11

TL;DR: 本文提出 scLLM-DSC 框架，通过整合生物语义知识与拓扑结构，利用跨模态对比对齐显著提升了单细胞 RNA 测序的聚类准确性。

摘要翻译

聚类是单细胞 RNA 测序（scRNA-seq）分析的基础，作为识别细胞群体和解析组织异质性的基石。然而，现有方法侧重于挖掘数值统计模式，因忽视基因所编码的内在生物学功能而表现出语义无关性（semantic agnosticism）。尽管大语言模型（LLMs）展现出有前景的语义能力，但由于生成式预训练目标与判别式下游任务之间的结构不匹配，其直接应用于细胞聚类仍受到阻碍。为弥合这一鸿沟，我们提出 scLLM-DSC，一种新颖的大语言模型（LLM）知识增强跨模态深度结构聚类框架。不同于数据驱动范式，scLLM-DSC 通过协同两种视图构建基于语义的表示：一种是从 NCBI 基因先验和上下文感知的 Cell2Sentence 嵌入衍生的知识驱动语义视图，另一种是通过图引导编码器提取的结构感知拓扑视图。至关重要的是，我们引入了一种跨模态对比对齐机制，旨在在统一潜在空间内确保生物学语义与转录组特征之间的一致性。广泛的基准测试表明，scLLM-DSC 在聚类准确率方面显著优于十一种最新基线方法。

Abstract

Clustering is fundamental to scRNA-seq analysis, serving as a cornerstone for identifying cell populations and resolving tissue heterogeneity. However, existing methods focus on mining numerical statistical patterns, suffering from semantic agnosticism by neglecting the intrinsic biological functions encoded by genes. While Large Language Models (LLMs) offer promising semantic capabilities, their direct adaptation to cell clustering is hindered by the structural mismatch between generative pre-training objectives and discriminative downstream tasks. To bridge this gap, we propose scLLM-DSC, a novel LLM-Knowledge Enhanced Cross-Modal Deep Structural Clustering framework. Diverging from data-driven paradigms, scLLM-DSC establishes a semantically-grounded representation by synergizing two views: a Knowledge-Driven Semantic View derived from NCBI gene priors and contextualized Cell2Sentence embeddings, and a Structure-Aware Topological View extracted via a graph-guided encoder. Crucially, we introduce a cross-modal contrastive alignment mechanism to enforce consistency between biological semantics and transcriptomic features within a unified latent space. Extensive benchmarks demonstrate that scLLM-DSC significantly outperforms eleven state-of-the-art baselines in clustering accuracy.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	5.0/10	7.5
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 论文主题为单细胞 RNA 测序聚类，核心在于 LLM 知识与拓扑结构的跨模态对齐。'MultiModal' 相关性最高（8 分），因涉及语义与拓扑模态融合；'Latent Reasoning' 中等（5 分），涉及潜空间对齐；'MLLM' 较低（3 分），仅使用 LLM 语义能力而非多模态架构；'Unify Models' 和 'Tokenizer' 较低（2 分和 1 分），分别指知识视图统一和 LLM 隐含组件；其余关键词（Visual Encoder, World Models, model-based RL, Agentic Reasoning）与生物信息学聚类任务完全无关（0 分）。作者列表中未包含指定专家。加权总分 28.5，低于及格线 35.2。

关键词

scRNA-seq, LLM-Knowledge Enhanced, Cross-Modal, Deep Structural Clustering, Contrastive Alignment, Graph-guided Encoder, Biological Semantics

71. TimeLens: On-Device Artifact Recognition with Retrieval-Augmented Question Answering for the Grand Egyptian MuseumFAIL

Score: 28.5 / 35.2

Authors: Rawan Hesham, Ali Ashraf, Amr Ahmed, Malak Alaa, Omar Ahmed, Omar Wagih

Published: 2026-06-11

TL;DR: TimeLens 是一款结合 YOLO 检测与 RAG 问答的博物馆移动端应用，实现了离线实时文物识别与双语问答。

摘要翻译

TimeLens 是一款面向埃及大博物馆（GEM）的人工智能驱动双语移动导览系统。游客只需将手机对准展品，即可实时识别文物，并能用英语或阿拉伯语提出后续问题并获得回答。本研究针对展厅部署面临的三个特定问题：51 种编目文物之间的细粒度视觉相似性（其中包含许多几乎相同的拉美西斯时期雕像）、策展训练数据与手持拍摄条件之间的差距，以及 AI 导览陈述未经证实历史事实的风险。本文报告了两项工程贡献。首先，通过一项数据质量驱动的迭代研究开发了设备端文物检测器——从基础模型自动标注（YOLO-World），经空间标签清理规则优化，至完全手动标注的数据集——研究确定标签质量是决定性因素：最终的 YOLOv8n 模型解决了所有此前未能识别的类别，同时保持为 5.97 MB 的 TensorFlow Lite 模型文件，可在中端手机上实时运行（[email protected] = 0.995，[email protected]:0.95 = 0.924）。其次，基于包含 108 条记录的 ChromaDB 知识库的双语检索增强生成（RAG）导览系统，在七个候选语言模型上进行了基准测试，最终选定 Gemma 4 E2B (Q4 K M)；通过十项针对性优化，将端到端延迟从超过 30 秒降低至约 10 秒。这两个子系统被集成到一个生产级的 Flutter 应用程序中，该应用具备双语界面、基于位置的访问控制以及文本到语音（TTS）支持。

Abstract

TimeLens is an AI-powered bilingual mobile guide for the Grand Egyptian Museum (GEM). Pointing a phone at an exhibit, a visitor sees the artifact recognized in real time and can ask follow-up questions answered in English or Arabic. The work addresses three problems specific to in-gallery deployment: fine-grained visual similarity among 51 catalogued artifacts (many near-identical Ramesside statues), the gap between curated training data and handheld camera conditions, and the risk of an AI guide stating unsupported historical facts. Two engineering contributions are reported. First, an on-device artifact detector was developed through a data-quality-driven iteration study -- from foundation-model auto-annotation (YOLO-World), through spatial label-cleaning rules, to a fully hand-annotated dataset -- isolating label quality as the decisive factor: the final YOLOv8n model resolves every previously failing class while remaining a 5.97 MB TensorFlow Lite asset that runs in real time on a mid-range phone ([email protected] = 0.995, [email protected]:0.95 = 0.924). Second, a bilingual Retrieval-Augmented Generation (RAG) guide, grounded in a 108-record ChromaDB knowledge base, was benchmarked across seven candidate language models, with Gemma 4 E2B (Q4 K M) selected; ten targeted optimizations reduce end-to-end latency from over 30 s to approximately 10 s. Both subsystems are integrated in a production Flutter application with bilingual interface, museum location gating, and text-to-speech support.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	6.0/10	9.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	4.0/10	6.0

评分理由: 该论文是一篇应用工程类论文，专注于博物馆移动端的文物识别与问答系统实现。与提供的关键词相比，论文未涉及世界模型（World Models）、强化学习（RL）或潜在推理（Latent Reasoning）等理论架构。虽然系统结合了视觉（YOLO）和语言（Gemma）模块，属于多模态应用（MultiModal），但并非统一的 MLLM 模型或针对 Tokenizer/Visual Encoder 的架构研究，因此相关关键词得分较低。作者列表中不包含指定的专家。

关键词

On-Device Artifact Recognition, Retrieval-Augmented Generation, Bilingual Mobile Guide, YOLOv8n, Gemma LLM, Grand Egyptian Museum, Real-time Inference, Flutter Application

72. EurekAgent: Agent Environment Engineering is All You Need For Autonomous Scientific DiscoveryFAIL

Score: 27.0 / 35.2

Authors: Amy Xin, Jiening Siow, Junjie Wang, Zijun Yao, Fanjin Zhang, Jian Song, Lei Hou, Juanzi Li

Published: 2026-06-11

TL;DR: EurekAgent 提出了一种基于环境工程的代理系统，通过优化权限、Artifact、预算和人机交互，实现了自主科学发现的新进展。

摘要翻译

基于大语言模型（LLM）的智能体在自动化科学发现方面显示出日益增长的潜力。给定一个可优化指标和执行环境，它们可以提出、验证并迭代科学解决方案，并产生了优于人类设计方法的结果。随着模型能力的持续提升，我们认为自主科学发现的瓶颈正从规定智能体工作流程转向设计智能体环境：即塑造智能体行为的资源、约束和接口。我们将此定义为环境工程（Environment Engineering）：构建能够放大生产性行为（如开放式探索、系统化的工件管理以及智能体间协作）的环境，同时抑制有害行为（如奖励黑客和高摩擦的人类监督）。我们提出了 EurekAgent，一种面向基于指标驱动自主科学发现的环境工程智能体系统。EurekAgent 从四个维度对环境进行工程化设计：权限工程用于限制智能体执行范围和隔离评估；工件工程用于基于文件系统和 Git 的协作；预算工程用于感知预算的探索；人在回路工程用于简化人类监督和干预。EurekAgent 在多个数学、内核工程和机器学习任务上设定了新的最先进结果，包括使用不到 11 美元总 API 成本发现的新最先进 26 圆填充结果。我们开源了代码和结果，并呼吁将环境工程作为开发可靠自主研究智能体的核心研究方向。

Abstract

LLM-based agents have shown increasing potential in automating scientific discovery. Given an optimizable metric and an execution environment, they can propose, validate, and iterate scientific solutions, and have produced results that outperform human-designed approaches. As model capabilities continue to improve, we argue that the bottleneck for autonomous scientific discovery is shifting from prescribing agent workflows to designing agent environments: the resources, constraints, and interfaces that shape agent behavior. We frame this as environment engineering: building environments that amplify productive behaviors, such as open-ended exploration, systematic artifact management, and inter-agent collaboration, while suppressing harmful behaviors, such as reward hacking and high-friction human oversight. We present EurekAgent, an environment-engineered agent system for metric-driven autonomous scientific discovery. EurekAgent engineers the environment along four dimensions: permissions engineering for bounded agent execution and isolated evaluation; artifact engineering for filesystem and Git-based collaboration; budget engineering for budget-aware exploration; and human-in-the-loop engineering for easy human supervision and intervention. EurekAgent sets new state-of-the-art results on multiple mathematics, kernel engineering, and machine learning tasks, including new state-of-the-art 26-circle packing results discovered with less than $11 in total API cost. We open-source our code and results, and call for environment engineering as a core research direction for developing reliable autonomous research agents.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	2.0/10	3.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	2.0/10	3.0
Latent Reasoning	1.5	1.0/10	1.5
Agentic Reasoning	1.5	6.0/10	9.0

评分理由: 论文主题聚焦于 Agent 环境工程（Environment Engineering）以支持自主科学发现，而非模型架构统一、多模态融合或强化学习的具体算法实现。因此，与 Tokenizer、Visual Encoder、MultiModal 等技术组件完全无关（1 分）。虽然涉及智能体系统，但核心贡献在于环境设计（权限、Artifact、预算、人机交互）而非推理机制本身，故 Agentic Reasoning 相关性中等（6 分）。World Models 和 Unify Models 在概念上有一定关联（环境即世界模型），但非论文核心方法（2 分）。model-based RL 和 Latent Reasoning 与本文内容关联度低（2 分和 1 分）。加权总分约为 27.0，低于动态及格分 35.2，表明该论文与给定研究背景关键词的相关性整体较低。

关键词

Agent Environment Engineering, Autonomous Scientific Discovery, LLM-based Agents, Permission Engineering, Artifact Engineering, Budget Engineering, Human-in-the-loop

73. Reasoning as Pattern Matching: Shared Mechanisms in Human and LLM Everyday ReasoningFAIL

Score: 27.0 / 35.2

Authors: Zach Studdiford, Gary Lupyan

Published: 2026-06-11

TL;DR: This paper investigates whether human and LLM everyday reasoning relies on abstract world models or pattern matching, finding similar error patterns and identifying attention heads that implement pattern-matching in LLMs.

摘要翻译

当大型语言模型（LLMs）无法泛化或在推理过程中出现随意错误时，这通常被视为证据，表明 LLMs 并未真正进行推理，而是在执行某种模式匹配。这意味着人类行为不会表现出同样的失败类型，因为人类推理使用的是基于原则的抽象世界模型。我们评估了人类参与者和 25 个 LLMs 在多种日常情境下进行常识推理的能力，并观察到人类和模型均出现了类似的错误模式。随后，我们确定了驱动 LLM 响应的一组注意力头，发现这些头实现了某种形式的模式匹配。这些注意力头使我们能够预测人类看似无法解释的推理错误，这些错误是由表面上无关的提示词细节引起的。综上所述，我们的结果表明，人类和 LLMs 的日常因果推理更符合某种模式匹配形式，而非抽象世界模型。

Abstract

When large language models (LLMs) fail to generalize or make haphazard errors in reasoning, it is often taken as evidence that LLMs are not truly reasoning, but rather performing a kind of pattern matching. The implication is that people's behavior does not exhibit the same types of failures because human reasoning uses principled and abstract world models. We evaluate human participants and 25 LLMs on their ability to engage in common-sense reasoning about a variety of everyday situations and observe similar patterns of errors in both people and models. We then identify the set of attention heads driving LLM responses and find that these heads implement a form of pattern-matching. These attention heads allow us to predict seemingly inexplicable reasoning errors in people caused by ostensibly irrelevant prompt details. Taken together, our results suggest that everyday causal reasoning in people and LLMs is more consistent with a form of pattern-matching than with abstract world models.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	4.0/10	6.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	8.0/10	12.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	3.0/10	4.5
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 论文核心探讨人类与 LLM 日常推理机制（模式匹配 vs 世界模型），故'World Models'相关性最高（8 分）。'Unify Models'因统一人类与模型机制理解有一定相关性（4 分），'Latent Reasoning'因分析注意力头机制有一定相关性（3 分），'MLLM'因涉及 LLM 但非多模态得低分（3 分）。其余关键词（Tokenizer, Visual Encoder, MultiModal, model-based RL, Agentic Reasoning）在论文中未提及，相关性为 0。加权总分约 27.0，低于动态及格分 35.2，表明论文主题与给定的多模态及强化学习技术关键词匹配度较低。

关键词

Pattern Matching, Human Reasoning, LLM Reasoning, World Models, Attention Heads, Everyday Reasoning, Common-sense Reasoning

74. From Verdict to Process: Agentic Reinforcement Learning for Multi-Stage Fact VerificationFAIL

Score: 25.5 / 35.2

Authors: Rongxin Yang, Shenghong He, Siyuan Zhu, Chao Yu

Published: 2026-06-11

TL;DR: 本文提出 ProFact 框架，通过过程感知奖励的强化学习统一优化多阶段事实验证轨迹，显著提升了验证性能和推理效率。

摘要翻译

近期结合大型语言模型（LLMs）与检索增强推理的方法在自动事实验证任务中展现出广阔前景。为了处理复杂的主张，这些验证管道通常执行多阶段工作流，协调紧密耦合的模块，包括主张分解、证据收集以及判决预测。然而，现有方法往往孤立地优化各个阶段或依赖固定的启发式规则，这限制了阶段间的自适应协调，并可能导致次优的结果。本文提出 ProFact，一个用于多阶段事实验证轨迹端到端优化的智能体强化学习框架。ProFact 训练一个统一策略，以协调主张分解、证据寻求、答案生成以及判决预测。为了解决最终真实性标签所提供的稀疏且延迟的监督问题，ProFact 引入了过程感知奖励，从而在整个验证过程中提供阶段级的学习信号。实证评估表明，ProFact 在验证性能和推理效率方面始终优于强基线。这些结果突显了过程感知轨迹优化在多阶段事实验证中的有效性。

Abstract

Recent approaches combining Large Language Models (LLMs) with retrieval-augmented reasoning have shown promise for automated fact verification. To process complex claims, these verification pipelines typically execute multi-stage workflows that coordinate tightly coupled modules, including claim decomposition, evidence gathering, and verdict prediction. However, existing methods optimize individual stages in isolation or rely on fixed heuristics, which limits adaptive coordination among stages and can lead to suboptimal outcomes. In this work, we propose ProFact, an agentic reinforcement learning framework for end-to-end optimization of multi-stage fact verification trajectories. ProFact trains a unified policy to coordinate claim decomposition, evidence seeking, answer generation, and verdict prediction. To address the sparse and delayed supervision provided by final veracity labels, ProFact introduces process-aware rewards that provide stage-level learning signals throughout the verification process. Empirical evaluation shows that ProFact consistently outperforms strong baselines in both verification performance and inference efficiency. These results highlight the effectiveness of process-aware trajectory optimization for multi-stage fact verification.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	5.0/10	7.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	2.0/10	3.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	10.0/10	15.0

评分理由: 论文核心贡献在于使用强化学习优化多阶段事实验证流程，因此'Agentic Reasoning'高度相关（10 分）。论文提出'unified policy'协调多个阶段，与'Unify Models'有一定关联（5 分）。涉及强化学习轨迹优化，与'model-based RL'有弱关联（2 分），因未明确提及环境模型学习。其他关键词如 Tokenizer、Visual Encoder、World Models、MLLM、MultiModal、Latent Reasoning 在摘要中未提及，相关性为 0。作者列表中不包含指定的专家（Yang Shi, Xuanyu Zhu 等），故无额外加分。加权总分为 25.5 分，低于动态及格分 35.2 分。

关键词

Agentic Reinforcement Learning, Multi-Stage Fact Verification, Process-Aware Rewards, Unified Policy, Claim Decomposition, Evidence Seeking, Verdict Prediction

75. Adaptive Turn-Taking for Real-time Multi-Party Voice AgentsFAIL

Score: 24.0 / 35.2

Authors: Soumyajit Mitra, Prabhat Pandey, Abhinav Jain, Shanmukha Sahith, K V Vijay Girish

Published: 2026-06-11

TL;DR: The paper proposes ModeratorLM, a role-playing speech LLM agent that significantly improves multi-party turn-taking precision and recall by conditioning behavior on assigned roles and incorporating chain-of-thought reasoning.

摘要翻译

多方口语对话中的话轮转换仍然是基于语音的代理面临的基本挑战，尤其是在动态发言权竞争和用户期望变化的情况下。我们提出了 ModeratorLM，一种角色扮演语音代理，该代理在多方场景中根据明确指定的角色来调节话轮转换行为。该系统构建于一个以分块流式方式运行的语音大语言模型之上。我们进一步引入了一种推理增强变体，该变体结合了对对话上下文和指定角色的思维链推理。我们构建了 RolePlayConv，一个包含多样化助手角色的口语多方对话的大规模合成数据集。在真实会议数据和 RolePlayConv 上的实验表明，与非角色条件基线相比，话轮转换的精确率提高了 40% 以上，召回率提高了 70% 以上，同时大幅减少了假阳性中断。

Abstract

Turn-taking in multi-party spoken conversations remains a fundamental challenge for voice-based agents, particularly under dynamic floor competition and varying user expectations. We propose ModeratorLM, a role-playing voice agent that conditions turn-taking behavior on an explicitly assigned role in multi-party settings. The system is built on a speech large language model operating in chunk-wise streaming manner. We further introduce a reasoning-augmented variant that incorporates chain-of-thought reasoning over conversational context and the assigned role. We construct RolePlayConv, a large-scale synthetic dataset of spoken multi-party conversations with diverse assistant roles. Experiments on real-world meeting data and RolePlayConv show improved turn-taking precision by over 40% and recall by more than 70%, while substantially reducing false-positive interruptions compared to non-role-conditioned baselines.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	1.0/10	1.5
Latent Reasoning	1.5	3.0/10	4.5
Agentic Reasoning	1.5	6.0/10	9.0

评分理由: The paper addresses turn-taking in multi-party voice conversations using a speech LLM, which has limited alignment with the provided keywords focusing on multimodal unification, visual encoders, world models, and model-based RL. Agentic Reasoning is moderately relevant due to the agent role and reasoning mechanism, while Latent Reasoning is slightly relevant. Visual Encoder and World Models are irrelevant. None of the specified expert authors are listed, so no bonus points are added.

关键词

Turn-taking, Multi-party, Voice Agents, Speech LLM, Role-playing, Chain-of-thought, Real-time

76. Dual-Domain Equivariant Generative Adversarial Network for Multimodal CT-PET SynthesisFAIL

Score: 24.0 / 35.2

Authors: Gabriel Steele, Alzahra Altalib, Alessandro Perelli

Published: 2026-06-11

TL;DR: This paper proposes a Dual-Domain Equivariant GAN for multimodal CT-PET synthesis, achieving improved anatomical accuracy by integrating spatial and frequency domain learning with rotational equivariance constraints.

摘要翻译

我们提出了一种用于多模态 CT-PET 图像合成的双域等变生成对抗网络（DDE-GAN）。传统的基于 GAN 的方法通常仅在空间域运行，忽略几何一致性，导致结构保真度有限。DDE-GAN 通过同时从空间域和频率域（傅里叶域）联合学习，捕捉互补的解剖与频谱信息，以解决这些挑战。此外，将嵌入在 CT 和 PET 测量物理机制中的旋转等变性集成到生成器和判别器的损失函数中，以确保在旋转下响应一致，从而提高解剖准确性。分层双域训练策略通过多阶段损失函数强制域内与域间的一致性。在 HECKTOR 2022 CT-PET 数据集上评估，DDE-GAN 在 CT-PET 图像合成方面展现出优于基线模型的合成质量。结果表明，结合双域学习与几何等变性显著提高了多模态图像合成的准确性和鲁棒性，实现了 PET 补全和数据增强等实际应用。

Abstract

We present a Dual-Domain Equivariant Generative Adversarial Network (DDE-GAN) for multimodal CT-PET image synthesis. Traditional GAN-based approaches often operate solely in the spatial domain and ignore geometric consistency, resulting in limited structural fidelity. DDE-GAN addresses these challenges by jointly learning from both spatial and frequency (Fourier) domains, capturing complementary anatomical and spectral information. Furthermore, rotational equivariance embedded in the physics of the CT and PET measurements are integrated into the loss of both the generator and discriminator to ensure consistent responses under rotations, improving anatomical accuracy. A hierarchical dual-domain training strategy enforces intra- and inter-domain consistency through multi-stage loss functions. Evaluated on the HECKTOR 2022 CT-PET dataset, DDE-GAN achieves superior synthesis quality over baseline models for CT-PET image synthesis. The results demonstrate that combining dual-domain learning with geometric equivariance substantially enhances multimodal image synthesis accuracy and robustness, enabling practical applications in PET completion and data augmentation.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	2.0/10	3.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 论文聚焦于医学图像合成（GAN），与 LLM/RL 相关关键词（Tokenizer, MLLM, World Models, model-based RL, Agentic Reasoning）对齐度低，评分为 0。'MultiModal' 高度相关（9 分），因涉及 CT-PET 多模态融合。'Unify Models' 和 'Latent Reasoning' 相关性较低（2 分），分别指代域统一和潜在空间使用。'Visual Encoder' 部分相关（3 分），因生成器包含编码结构。作者列表中不包含指定的专家名单。

关键词

Dual-Domain Equivariant, Generative Adversarial Network, CT-PET Synthesis, Spatial and Frequency Domains, Rotational Equivariance, Multimodal Image Synthesis

77. MOSAIC: Modality-Specific Adaptation for Incremental Continual Learning in Parkinson's Disease Gait AssessmentFAIL

Score: 24.0 / 35.2

Authors: Minlin Zeng, Zhipeng Zhou, Yang Qiu, Martin J. McKeown, Zhiqi Shen

Published: 2026-06-11

TL;DR: MOSAIC proposes a statistics-decoupled continual learning framework for Parkinson's gait assessment that mitigates forgetting and improves performance when handling incremental heterogeneous sensor modalities.

摘要翻译

基于步态的帕金森病评估日益依赖于异构传感器，然而临床系统很少能同时收集所有模态的数据。新传感器可能通过设备升级、协议变更或多中心部署引入，而历史患者数据往往因隐私保护和存储限制而不可用。这种模态增量设置面临三大挑战：不可靠的跨模态蒸馏、模态特定的统计偏移以及保留知识后的可塑性降低。本文提出 MOSAIC，一种紧凑的持续学习框架。首先，我们识别了毒教师（Toxic Teacher）现象，并引入 Modality-Specific Warm-Up（模态特定预热），以在蒸馏之前稳定新学习的模态表示。其次，我们提出一种统计解耦的 MSBN 架构，该架构在隔离传感器统计量的同时保持共享语义骨干。第三，我们设计了一种课程引导的排斥性目标用于 Plasticity Recovery（可塑性恢复），在保留遗留知识的同时恢复模态特定容量。在三个多模态帕金森步态数据集上的实验表明，MOSAIC 提高了最终性能并减轻了遗忘。项目代码可在以下网址获取：https://github.com/minlinzeng/MOSAIC_Modality-Specific-Adaptation-for-Incremental-Continual-Learning-in-PD-Gait-Assessment.git

Abstract

Gait-based Parkinson's disease assessment increasingly relies on heterogeneous sensors, but clinical systems rarely collect all modalities simultaneously. New sensors may arrive through device upgrades, protocol changes, or multi-center deployment, while historical patient data are often unavailable because of privacy and storage constraints. This modality-incremental setting faces three challenges: unreliable cross-modal distillation, modality-specific statistical shifts, and reduced plasticity after preservation. We propose MOSAIC, a compact continual learning framework. First, we identify the Toxic Teacher phenomenon and introduce Modality-Specific Warm-Up to stabilize newly learned modality representations before distillation. Second, we propose a statistics-decoupled MSBN architecture that isolates sensor statistics while maintaining a shared semantic backbone. Third, we design a curriculum-guided repulsive objective for Plasticity Recovery, preserving legacy knowledge while recovering modality-specific capacity. Experiments on three multimodal Parkinson's gait datasets show that MOSAIC improves final performance and mitigates forgetting. Project code is available at: https://github.com/minlinzeng/MOSAIC_Modality-Specific-Adaptation-for-Incremental-Continual-Learning-in-PD-Gait-Assessment.git

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	2.0/10	3.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 论文核心为帕金森步态评估的多模态连续学习，'MultiModal'高度相关（9.0），涉及异构传感器融合；'Unify Models'与'Latent Reasoning'中度相关（2.0-3.0），涉及知识统一与潜在表示；'Visual Encoder'略有相关（2.0）因传感器处理。其余关键词如 Tokenizer、MLLM、World Models、model-based RL、Agentic Reasoning 与医疗 CL 任务无关（0.0）。加权总分 24.0，低于动态及格分 35.2，表明论文与给定关键词集（偏向 MLLM/RL）匹配度低。作者列表中未包含目标专家。

关键词

Modality-Incremental Learning, Continual Learning, Parkinson's Disease Gait Assessment, Heterogeneous Sensors, Statistics-Decoupled Architecture, Plasticity Recovery, Cross-Modal Distillation, MSBN

78. Under What Conditions Can a Machine Become Genuinely Creative?FAIL

Score: 24.0 / 35.2

Authors: Yong Zeng

Published: 2026-06-11

TL;DR: 本文提出基于 Designics 的框架，认为机器真正创造力需满足十项要求并强调人类 -AI 共存，而非仅依赖生成模型或代理工作流。

摘要翻译

近期的人工智能系统能够生成文本、软件架构、假设、设计及科学工作流，这些成果看似具有创造力。本文探讨了机器在何种条件下能够真正具备创造力，以及人类能动性如何在共享的认知与创造环境中得以保留。本文构建了一个基于 Designics（承载意义的意向性改变的科学）的需求框架。本文主张，真正的机器创造力不应仅由输出新颖性、当前表现或临时架构来界定。相反，创造力被理解为通过递归干预动力学对不完整情境的结构转变。基于此观点，它依赖于十大要求：环境表征、范围感知、冲突识别、干预能力、后果观察、知识与环境更新、重新界定范围、从局部到全局的展开、基于价值的范围界定，以及人机共生（Human-AI Co-living）。这些要求通过 Designics 的三大定律（感知、冲突与能力）进行组织。本文通过选定的信息物理（Cyber-physical）与信息生物（Cyber-biological）研究，展示了这些要求的计算可行性，包括递归元素提取、自主网格生成，以及神经生理学与工作负载分析。随后，本文将开放系统、自动化发现框架、自修改智能体、基础模型（Foundation models）及智能体工作流（Agentic workflows）视为压力案例：它们虽展示了强大的生成手段，但本身并不能确立真正的机器创造力。最后，本文主张主动式人工智能伦理内在于真正的机器创造力之中，而非事后过滤器。基于价值的范围界定与人机共生必须塑造创造性机器感知环境、识别冲突、选择干预、观察后果、更新知识以及重新界定未来行动的方式。

Abstract

Recent AI systems can generate texts, software architectures, hypotheses, designs, and scientific workflows that appear creative. This paper asks under what conditions a machine can become genuinely creative, and how human agency can be preserved within shared cognitive and creative environments. It develops a requirement framework derived from Designics, the science of meaning-bearing intentional change. The paper argues that genuine machine creativity should not be defined by output novelty, current performance, or transient architecture alone. Instead, creativity is understood as the structural transformation of incomplete situations through recursive intervention dynamics. On this view, it depends on ten requirements: environment representation, scoped perception, conflict identification, intervention capability, consequence observation, knowledge and environment update, rescoping, local-to-global unfolding, value-based scoping, and human-AI co-living. These are organized through the three laws of Designics: perception, conflict, and capability. The paper illustrates the computational tractability of these requirements through selected cyber-physical and cyber-biological studies, including recursive element extraction, autonomous mesh generation, and neurophysiological and workload analysis. It then treats open-ended systems, automated discovery frameworks, self-modifying agents, foundation models, and agentic workflows as pressure cases: they demonstrate powerful generative means but do not by themselves establish genuine machine creativity. Finally, the paper argues that proactive AI ethics is internal to genuine machine creativity rather than an after-the-fact filter. Value-based scoping and human-AI co-living must shape how creative machines perceive environments, identify conflicts, select interventions, observe consequences, update knowledge, and rescope future action.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	3.0/10	4.5
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	3.0/10	4.5
Latent Reasoning	1.5	1.0/10	1.5
Agentic Reasoning	1.5	6.0/10	9.0

评分理由: 论文聚焦于机器创造力的哲学与理论框架（Designics），未涉及具体模型架构或训练技术。Tokenizer、Visual Encoder 等关键词完全无关（0 分）。Agentic Reasoning 因明确提及工作流与代理机制，相关性较高（6 分）。World Models 与 model-based RL 因涉及环境表征、干预与后果观察有中度概念关联（3 分）。其余关键词如 Unify Models、MLLM、MultiModal、Latent Reasoning 相关性较低（1 分）。加权总分 24.0，低于动态及格分 35.2，表明论文内容与给定技术关键词匹配度较低。作者列表中不包含指定专家。

关键词

Machine Creativity, Designics, Human-AI Co-living, Agentic Workflows, Environment Representation, Value-based Scoping, Recursive Intervention Dynamics, Foundation Models

79. The Emergence of Autonomous Penetration Capabilities in Large Language Model-Powered AI SystemsFAIL

Score: 24.0 / 35.2

Authors: Jiaqi Luo, Jiarun Dai, Zhile Chen, Jia Xu, Weibing Wang, Yawen Duan, Brian Tse, Geng Hong, Xudong Pan, Yuan Zhang, Min Yang

Published: 2026-06-11

TL;DR: 该论文提出了一种新的 LLM 驱动 AI 系统自主渗透评估框架，发现当前模型在独立网络攻击场景中的渗透成功率介于 10.7% 至 69.3% 之间，且能力随模型进步而提升。

摘要翻译

如今，能够造成实质性现实危害的网络攻击的自主执行，被广泛认为是前沿人工智能系统不得跨越的一条关键红线。在这一更广泛的红线情境中，自主渗透代表了一种核心使能能力和子任务：即由大语言模型（LLM）驱动的人工智能系统能够在无人干预的情况下，独立针对目标服务器开展对抗性操作，识别并利用漏洞，从而获得未经授权的访问或控制权。越来越多的研究工作试图评估人工智能系统的自主渗透能力。然而，现有的评估往往采用不透明的方法，依赖不切实际或过度简化的渗透测试场景，或者向 LLM 提供过多的先验知识和特定任务指导，无法准确捕捉现代人工智能系统在更广泛的高影响网络攻击场景中自主执行这一核心能力的程度。为了解决这些局限性，我们构建了一个新的自主渗透评估框架，包含两个组成部分：目标服务器和代理支架（agent scaffolding）。具体而言，在目标服务器端，我们基于与易受攻击服务一同部署的、无已知漏洞的安全服务数量，设计了两个级别的目标环境：Tier 1（一个安全服务）和 Tier 2（三个安全服务），共计 300 个目标服务器。与此同时，代理支架采用通用代理架构，配备了一套通用网络安全工具，且没有任何针对目标的先验知识。我们评估了 19 个开源及专有 LLM，发现当前模型的渗透成功率范围在 10.7% 到 69.3% 之间。此外，我们观察到自主渗透能力随着整体模型能力的进步而持续提升。

Abstract

Nowadays, the autonomous execution of cyberattacks capable of causing substantial real-world harm is widely regarded as one of the critical red lines that frontier AI systems must not cross. Within this broader red-line scenario, autonomous penetration represents a core enabling capability and subtask: the ability of LLM-powered AI systems to independently conduct adversarial operations against a target server without human intervention, identify and exploit vulnerabilities, and obtain unauthorized access or control. A growing body of work has sought to assess the autonomous penetration capabilities of AI systems. However, existing evaluations often employ opaque methodologies, rely on unrealistic or overly simplified penetration-testing scenarios, or provide LLMs with excessive prior knowledge and task-specific guidance, and cannot accurately capture the extent to which modern AI systems can autonomously perform this core capability within broader high-impact cyberattack scenarios. To address these limitations, we construct a new autonomous penetration evaluation framework consisting of two components: target servers and agent scaffolding. Specifically, on the target-server side, we design two levels of target environments based on the number of secure services without known vulnerabilities deployed alongside a vulnerable service: Tier~1 (one secure service) and Tier~2 (three secure services), resulting in a total of 300 target servers. Meanwhile, the agent scaffolding adopts a general-purpose agent architecture equipped with a set of general-purpose cybersecurity tools, without any target-specific prior knowledge. We evaluate 19 open-weight and proprietary LLMs, and find that current models achieve penetration success rates ranging from 10.7% to 69.3%. Moreover, we observe that autonomous penetration capability continues to improve alongside advances in overall model capability.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	2.0/10	3.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	8.0/10	12.0

评分理由: 该论文主要关注 LLM 代理的自主渗透能力评估，而非多模态架构开发。'Agentic Reasoning' 相关性高（8.0），因为论文聚焦于自主代理脚手架和独立操作。大多数架构关键词如 'Tokenizer'、'Visual Encoder' 和 'MultiModal' 相关性低（0.0），因为工作涉及基于文本的 LLM。'Unify Models'、'World Models' 和 'model-based RL' 相关性较低（2.0），因为论文评估现有模型而非提出新的统一架构或特定的 RL/世界模型方法。

关键词

Autonomous Penetration, LLM-Powered AI Systems, Evaluation Framework, Target Servers, Agent Scaffolding, Cybersecurity Tools, Vulnerability Exploitation

80. A2D2: Fine-Tuning Any-Length Discrete Diffusion for Adaptive DecodingFAIL

Score: 24.0 / 35.2

Authors: Sophia Tang, Yuchen Zhu, Molei Tao, Pranam Chatterjee

Published: 2026-06-11

TL;DR: A2D2 提出了一种统一框架，用于任何长度离散扩散模型的奖励引导微调，通过联合优化插入和未掩码策略提高了奖励优化和生成灵活性。

摘要翻译

离散扩散模型（Discrete diffusion models）为序列生成提供了一个简单且稳定的基于似然的框架，最近通过词元插入（token insertion）扩展到了任意长度设置。然而，针对任意长度离散扩散的严谨的基于奖励的微调（reward-guided fine-tuning）在很大程度上仍未被探索。我们提出了自适应解码的任意长度离散扩散微调（Fine-Tuning Any-Length Discrete Diffusion for Adaptive Decoding, A2D2），这是一个统一框架，通过联合优化插入和去掩码策略（insertion and unmasking policies）以及基于质量的推理调度（quality-based inference schedule），实现对任意长度离散扩散模型的基于奖励的微调。我们推导了联合插入 - 去掩码路径测度的拉东 - 尼科迪姆导数（Radon-Nikodym derivative），这使得无需目标样本即可理论保证收敛到难以处理的奖励倾斜序列分布（reward-tilted sequence distribution）。在此基础上，我们将去掩码和插入质量确立为最小化解码误差的可行方法，并引入了自适应联合解码（Adaptive Joint Decoding, AJD）损失，该损失可证明能产生生成奖励分布的最优路径测度。实验表明，A2D2 在提升奖励优化的同时，相较于先前的固定长度微调和推理时指导方法，增强了生成的灵活性和准确性。

Abstract

Discrete diffusion models offer a simple and stable likelihood-based framework for sequence generation, recently extended to any-length settings via token insertion. Principled reward-guided fine-tuning for any-length discrete diffusion, however, remains largely unexplored. We introduce Fine-Tuning Any-Length Discrete Diffusion for Adaptive Decoding (A2D2), a unified framework for reward-guided fine-tuning of any-length discrete diffusion models via joint optimization of the insertion and unmasking policies together with a quality-based inference schedule. We derive the Radon-Nikodym derivative for the joint insertion-unmasking path measures, enabling theoretically guaranteed convergence to the intractable reward-tilted sequence distribution without requiring target samples. Building on this, we establish unmasking and insertion quality as tractable approaches for minimizing decoding error and introduce the Adaptive Joint Decoding (AJD) loss, which provably yields the optimal path measure that generates the reward-tilted distribution. Empirically, A2D2 improves reward optimization while enhancing generation flexibility and accuracy over prior fixed-length fine-tuning and inference-time guidance methods.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	4.0/10	6.0
Tokenizer	1.5	5.0/10	7.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	4.0/10	6.0
Latent Reasoning	1.5	3.0/10	4.5
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 论文主要关注离散扩散模型的任何长度序列生成及奖励引导微调，与背景中的多模态、世界模型等主题关联度较低。'Unify Models'得分为 4 分因摘要提及统一框架；'Tokenizer'得分为 5 分因离散扩散依赖 token 插入机制；'model-based RL'得分为 4 分因奖励引导优化与 RL 目标相关；'Latent Reasoning'得分为 3 分因扩散模型涉及潜在变量；其余关键词（视觉编码器、世界模型、MLLM、多模态、代理推理）在论文中未涉及，得分为 0 分。经检查，作者列表（Sophia Tang, Yuchen Zhu, Molei Tao, Pranam Chatterjee）不包含指定的 Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang 专家，故无额外加分。

关键词

Discrete Diffusion Models, Any-Length Sequence Generation, Reward-Guided Fine-Tuning, Adaptive Decoding, Insertion and Unmasking Policies, Joint Optimization, Token Insertion

81. Reliability of Probabilistic Emulation of Physical SystemsFAIL

Score: 24.0 / 35.2

Authors: Sam F. Greenbury, Radka Jersakova, Paolo Conti, Marjan Famili, Christopher Iliffe Sprague, Edwin Brown, Jason D. McEwen

Published: 2026-06-11

TL;DR: 本文评估了物理系统概率仿真的可靠性，发现 CRPS 训练的集成模型在不确定性覆盖率和推理速度上优于生成模型。

摘要翻译

针对物理系统的概率预测，已涌现出两种主导方法：生成模型（如扩散模型或流匹配方法）；以及注入随机性的确定性模型集成，后者使用连续排名概率分数（CRPS）损失进行训练。尽管这两种方法均已展现出强大的预测准确性，但它们的不确定性可靠性尚未得到系统评估。为填补这一空白，我们开发了一个框架，在匹配的模型规模和计算预算下，针对多样化的二维时空物理系统评估这两种方法。我们通过检查预测区间的经验覆盖率来评估概率代理的可靠性，同时兼顾准确性与计算效率指标。CRPS 训练的集成通常在单步预测和自回归滚动预测中实现更可靠的不确定性，其覆盖率优于在潜在空间训练生成模型的标准替代方案。此外，CRPS 方法提供了显著更快的推理速度。当生成模型在原始空间（而非压缩的潜在空间）中进行训练时（后者对于高维问题通常不可行），它们表现出与 CRPS 训练集成相当的覆盖率，但推理延迟显著更大。相比之下，当 CRPS 训练集成在潜在空间中进行训练时，其覆盖率相对于原始空间并未出现明显退化。生成模型与 CRPS 训练集成均展现出良好的预测准确性。为了促进未来的研究与实际应用，我们发布了 AutoCast，这是一个实现生成模型与 CRPS 训练集成的模块化框架，同时还发布了 AutoSim，这是一个用于快速原型设计的灵活数据集生成包。

Abstract

Two dominant approaches have emerged for generating probabilistic forecasts of physical systems: generative models, such as diffusion or flow matching; and ensembles of deterministic models with stochasticity injected, trained using the continuous ranked probability score (CRPS) loss. While both approaches have demonstrated strong predictive accuracy, the reliability of their uncertainties has not been systematically assessed. We address this gap by developing a framework to evaluate both approaches across diverse 2D spatiotemporal physical systems, under matched model size and computational budget. We assess the reliability of probabilistic emulation by inspecting the empirical coverage of predictive intervals, while also considering accuracy and computational efficiency metrics. CRPS-trained ensembles typically achieve more reliable uncertainties on both single-step prediction and autoregressive rollouts, demonstrating better coverage than the standard alternative of training generative models in a latent space. Moreover, the CRPS approach offers significantly faster inference. When generative models are trained in ambient rather than a compressed latent space, which is often infeasible for high-dimensional problems, they exhibit comparable coverage to CRPS-trained ensembles, though with substantially larger inference latency. In contrast, when CRPS-trained ensembles are trained in latent space they do not show a marked degradation in coverage with respect to ambient space. Both generative models and CRPS-trained ensembles demonstrate good predictive accuracy. To facilitate future research and application, we release AutoCast, a modular framework implementing both generative models and CRPS-trained ensembles, alongside AutoSim, a flexible dataset generation package for rapid prototyping.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	4.0/10	6.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	3.0/10	4.5
Latent Reasoning	1.5	3.0/10	4.5
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 论文聚焦物理系统的概率仿真与不确定性评估，涉及生成模型与 CRPS 集成方法对比。虽与 World Models 及 model-based RL 有概念交集（系统建模、潜在空间），但与 MLLM、Tokenizer、Agentic Reasoning 等多模态大模型及代理推理核心内容关联度低，故多数关键词评分偏低。

关键词

Probabilistic Emulation, Physical Systems, Generative Models, CRPS Loss, Uncertainty Reliability, Spatiotemporal Forecasting, AutoCast Framework

82. Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-TuningFAIL

Score: 22.5 / 35.2

Authors: Zilin Xiao, Qi Ma, Chun-cheng Jason Chen, Xintao Chen, Avinash Atreya, Hanjie Chen, Vicente Ordonez

Published: 2026-06-11

TL;DR: 本文提出 RA-RFT 框架，通过检索增强强化微调训练语言模型进行类比推理，在数学推理任务中显著优于标准强化微调方法。

摘要翻译

检索增强生成（RAG）已成为将语言模型锚定在外部知识中的标准机制，然而基于词汇或语义相似度的常规检索并不适合复杂的推理任务：语义相似的问题可能需要完全不同的解决方案策略，而表面看似不同的问题可能共享相同的底层推理模式。我们提出检索增强强化微调（RA-RFT），这是一种后训练框架，旨在教导语言模型通过类比进行推理。RA-RFT 使用金标准相关性蒸馏来训练一个检索器，该检索器根据预期推理收益而非语义重叠对上下文进行排序，随后利用检索到的类比示范，通过强化微调方法对策略模型进行微调，从而使模型能够在可验证的结果奖励下学习利用推理轨迹。我们进一步分析了检索上下文的多样性，发现推理感知检索揭示了互补的解决方案策略，为具体问题提供不同的推理支架。在具有挑战性的数学推理基准上，RA-RFT 一贯优于标准的强化微调方法。例如，相较于 GRPO，它在 Qwen3-1.7B 和 Qwen3-4B 上分别将 AIME 2025 平均@32 准确率提高了 7.1 和 2.8 个百分点——这表明推理感知检索是一个互补的改进维度，且正交于奖励设计或训练课程方面的进展。

Abstract

Retrieval-augmented generation (RAG) has become a standard mechanism for grounding language models in external knowledge, yet conventional retrieval based on lexical or semantic similarity is poorly suited for complex reasoning tasks: a semantically similar problem may demand an entirely different solution strategy, while a superficially different problem may share the same underlying reasoning pattern. We propose Retrieval-Augmented Reinforcement Fine-Tuning (RA-RFT), a post-training framework that teaches language models to reason by analogy. RA-RFT uses gold-relevance distillation to train a retriever that ranks contexts by expected reasoning benefit rather than semantic overlap, and then fine-tunes the policy model via reinforcement fine-tuning methods with retrieved analogous demonstrations, so the model learns to leverage reasoning traces under verifiable outcome rewards. We further analyze the diversity of retrieved contexts and find that reasoning-aware retrieval surfaces complementary solution strategies that provide distinct reasoning scaffolds for individual problems. Across challenging mathematical reasoning benchmarks, RA-RFT consistently outperforms standard reinforcement fine-tuning methods. For example, it improves AIME 2025 average@32 accuracy by 7.1 and 2.8 points over GRPO for Qwen3-1.7B and Qwen3-4B respectively -- suggesting that reasoning-aware retrieval is a complementary axis of improvement and orthogonal to advances in reward design or training curricula.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	2.0/10	3.0
Latent Reasoning	1.5	5.0/10	7.5
Agentic Reasoning	1.5	3.0/10	4.5

评分理由: 论文核心为语言模型的类比推理与检索增强强化微调，与多模态、视觉编码器、世界模型等关键词无直接关联（0-1 分）。虽涉及强化学习，但采用的是策略优化方法（如 GRPO），属于模型自由强化学习，故 model-based RL 相关性较低。类比推理涉及潜在推理模式，Latent Reasoning 相关性中等。未检测到所列专家作者。

关键词

Retrieval-Augmented Reinforcement Fine-Tuning, Reasoning by Analogy, Language Models, Mathematical Reasoning, Gold-Relevance Distillation, Analogous Demonstrations, Reinforcement Fine-Tuning

83. SmartFont: Dynamic Condition Allocation for Few-Shot Font GenerationFAIL

Score: 22.5 / 35.2

Authors: Zian Yang, Zixin Wang

Published: 2026-06-11

TL;DR: SmartFont proposes a diffusion-based framework with dynamic condition allocation to balance global and local features for high-quality few-shot font generation.

摘要翻译

少样本 (Few-shot) 字体生成同时需要全局结构完整性和细粒度局部风格保真度。现有方法通常要么依赖全局内容 - 风格建模 (Global Content-Style Modeling)，虽然稳健但解耦不完全，要么强调组件/局部建模 (Component/Local Modeling)，虽然能捕捉细节却严重依赖局部先验 (Local Priors) 和参考覆盖度 (Reference Coverage)。我们认为，关键挑战不仅在于学习更纯粹的条件，而在于生成过程中通过多层分配 (Multi-level Allocation) 组织互补但有偏的全局与局部条件。为此，我们提出 SmartFont，一种基于扩散 (Diffusion) 的少样本字体生成框架，该框架将全局内容 - 风格生成与弱监督局部校正专家 (Weakly Supervised Local Corrective Experts) 相结合。局部分支通过弱组件监督学习专家级局部概念和语义有意义的空间映射，执行语义 - 空间分配 (Semantic-Spatial Allocation)，从而实现细粒度校正，而无需显式组件条件推断 (Explicit Component-Conditioned Inference)。在此基础上，一个去噪状态条件分配模块 (Denoising-State Condition Allocation Module) 在时间步 (Timesteps) 和注入块 (Injection Blocks) 上自适应地加权全局内容、全局风格和局部校正特征。大量实验表明，SmartFont 实现了更好的全局 - 局部平衡，提高了字形 (Glyph) 质量和局部细节保真度。

Abstract

Few-shot font generation simultaneously requires global structural completeness and fine-grained local style fidelity. Existing methods usually either rely on global content-style modeling, which is robust but imperfectly disentangled, or emphasize component/local modeling, which captures fine details but relies heavily on local priors and reference coverage. We argue that the key challenge is not merely to learn purer conditions, but to organize complementary yet biased global and local conditions through multi-level allocation during generation. To this end, we propose SmartFont, a diffusion-based few-shot font generation framework that combines global content-style generation with weakly supervised local corrective experts. The local branch performs semantic-spatial allocation by learning expert-wise local concepts and semantically meaningful spatial maps under weak component supervision, enabling fine-grained correction without requiring explicit component-conditioned inference. On top of this, a denoising-state condition allocation module adaptively weights global content, global style, and local corrective feature across timesteps and injection blocks. Extensive experiments show that SmartFont achieves better global-local balance, improves glyph quality and local detail fidelity.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	5.0/10	7.5
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	2.0/10	3.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 该论文聚焦于少字生成与扩散模型，与世界模型、强化学习及统一大模型架构关联度低。仅多模态因文本 - 图像生成有中等关联。未包含目标专家作者。

关键词

Few-shot font generation, Diffusion model, Global-local balance, Condition allocation, Glyph quality, Weakly supervised, Dynamic condition allocation

84. ERTS: Adversarial Robustness Testing of Ethical AI via Semantic Perturbation in a Bounded Consequence SpaceFAIL

Score: 22.5 / 35.2

Authors: Pratyush Chaudhari

Published: 2026-06-11

TL;DR: The paper introduces an Ethical Robustness Testing System (ERTS) using semantic perturbations within a bounded ethical consequence space to evaluate LLMs, revealing that most models fail robustness assessments against fairness and information degradation attacks.

摘要翻译

随着人工智能系统被部署在医疗分诊、自动驾驶控制和就业筛选等高风险伦理情境中，评估其针对伦理推理的对抗性操纵的鲁棒性的形式化方法仍显不足。本文引入了伦理鲁棒性测试系统 (ERTS)，这是一种闭环管道框架，其特点包括：(1) 基于既定伦理理论，将伦理困境编码为 22 维伦理后果空间 (ECS)；(2) 应用 17 种语义扰动函数，受包括新颖语义一致性约束在内的 6 类有效性约束限制；(3) 通过 4 分量伦理不稳定性指数 (EII) 衡量决策偏差；(4) 生成领域自适应的部署前鲁棒性评估裁决。我们在涵盖 8 个部署领域的 50 个伦理场景中，评估了 4 个结构化基线模型和 2 个生产级大语言模型 (LLMs)（Gemini 2.0 Flash 和 Llama 3.2），生成了 1,500 个对抗性测试用例。结果显示，仅有 33% 的模型通过了评估，其中本地 Llama-3.2 模型在公平性污染和信息退化攻击下尤为脆弱 (ERS = 0.737)。据我们所知，尚无现有框架能在单一对抗性测试管道中结合有界伦理后果空间、语义一致性约束和领域自适应评估。

Abstract

As AI systems are deployed in high-stakes ethical contexts such as healthcare triage, autonomous vehicle control, and employment screening, formal methods for evaluating their robustness against adversarial manipulation of ethical reasoning remain underdeveloped. This paper introduces the Ethical Robustness Testing System (ERTS), a closed-pipeline framework that: (1) encodes ethical dilemmas into a 22-dimensional Ethical Consequence Space (ECS) grounded in established ethical theory; (2) applies 17 semantic perturbation functions subject to 6 validity constraint classes including a novel semantic coherence constraint; (3) measures decision deviation via a 4-component Ethical Instability Index (EII); and (4) produces domain-adaptive pre-deployment robustness assessment verdicts. We evaluate 4 structured baseline models and 2 production LLMs (Gemini 2.0 Flash and Llama 3.2) across 50 ethical scenarios spanning 8 deployment domains, generating 1,500 adversarial test cases. Results demonstrate that only 33% of models achieve assessment clearance, with the local Llama-3.2 model proving particularly vulnerable to fairness corruption and information degradation attacks (ERS = 0.737). To the best of our knowledge, no existing framework combines a bounded ethical consequence space, semantic coherence constraints, and domain-adaptive assessment in a single adversarial testing pipeline.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	1.0/10	1.5
Latent Reasoning	1.5	3.0/10	4.5
Agentic Reasoning	1.5	2.0/10	3.0

评分理由: The paper focuses on ethical robustness testing of LLMs using semantic perturbations and an ethical consequence space. The provided keywords primarily relate to multimodal architectures, world models, and reinforcement learning mechanisms (e.g., Tokenizer, Visual Encoder, World Models, Model-based RL). There is a significant mismatch; the paper does not address multimodal integration, tokenization strategies, world model construction, or RL mechanisms. Latent Reasoning receives a moderate score due to the Ethical Consequence Space representing latent ethical dimensions, but overall relevance to the keyword set is low. No expert authors from the specified list (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) were found in the author list.

关键词

Adversarial Robustness Testing, Ethical AI, Semantic Perturbation, Ethical Consequence Space, LLM Evaluation, Ethical Instability Index, Bounded Consequence Space

85. Towards More General Control of Diffusion Models Using Jeffrey GuidanceFAIL

Score: 22.5 / 35.2

Authors: Raphaël Razafindralambo, Rémy Sun, Frédéric Precioso, Jes Frellsen, Pierre-Alexandre Mattei

Published: 2026-06-11

TL;DR: This paper proposes Jeffrey guidance to control diffusion models at sampling time by updating marginal distributions towards a target, achieving better FID scores and fairness enforcement without explicit conditional training.

摘要翻译

扩散模型的一个关键优势在于其灵活性，因为它们的输出可以在采样时通过引导进行控制。然而，除了条件采样等简单情况外，目标分布通常保持隐式，仅通过采样规则或启发式能量函数来定义。为了解决这一问题，我们提出了 Jeffrey 引导（Jeffrey guidance），这是一个严谨的框架，将扩散模型控制扩展到了标准引导所能表达的应用范围之外。它利用 Jeffrey 条件化规则，将边缘分布更新到指定目标，同时保留条件结构并最小化对联合分布的扰动。我们首先通过针对指定嵌入分布来演示 Jeffrey 引导。以 Inception 嵌入为目标，这在 CIFAR-10 和 FFHQ 上均导致了 FID 的显著降低。我们进一步将 Jeffrey 引导应用于 CelebA-HQ 上的公平性问题，更新一个无条件扩散模型以强制属性之间的独立性。

Abstract

A key strength of diffusion models lies in their flexibility, since their outputs can be controlled at sampling time through guidance. However, beyond simple cases such as conditional sampling, the target distribution is often left implicit, defined only through a sampling rule or a heuristic energy function. To address this, we propose Jeffrey guidance, a principled framework that extends diffusion-model control to applications beyond what standard guidance can express. It leverages Jeffrey's rule of conditioning to update marginal distributions towards a prescribed target, preserving the conditional structure and minimally perturbing the joint distribution. We first demonstrate Jeffrey guidance by targeting a prescribed embedding distribution. With Inception embeddings as the target, this leads to substantial reductions in FID on both CIFAR-10 and FFHQ. We further apply Jeffrey guidance to fairness on CelebA-HQ, updating an unconditional diffusion model to enforce independence between attributes.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	1.0/10	1.5
Latent Reasoning	1.5	3.0/10	4.5
Agentic Reasoning	1.5	1.0/10	1.5

评分理由: The paper focuses on diffusion model guidance and distribution control using Jeffrey's rule, showing low direct relevance to MLLM, Tokenizers, Model-Based RL, and Agentic Reasoning. It loosely relates to Visual Encoder and World Models via generative image modeling. No target expert authors were found. The weighted total score (22.5) is below the dynamic passing threshold (35.2).

关键词

Diffusion Models, Jeffrey Guidance, Distribution Control, Generative Modeling, Sampling Time Control, Inception Embeddings, Fairness Enforcement

86. Dual-Constrained Diffusion Image Compression for Operational Rate-Distortion-Perception OptimizationFAIL

Score: 22.5 / 35.2

Authors: Sanxin Jiang, Jiro Katto, Heming Sun

Published: 2026-06-11

TL;DR: 本文提出了一种双约束扩散图像压缩方法（DCIC），通过联合优化失真与恒等性约束，实现了无需额外码率开销的率失真感知权衡优化，并在多个基准数据集上验证了其有效性。

摘要翻译

率失真感知（RDP）权衡通过在重构图像上施加分布约束，扩展了经典率失真理论，为神经图像压缩提供了统一框架，共同控制保真度与感知真实性。尽管先前工作已实现接近最优的率感知权衡，但明确实现完整 RDP 表面的实用框架仍较为稀缺，主要源于解码器处引入公共随机性的困难。我们提出 DCIC（双约束扩散图像压缩），该方法将学习编解码器与基于扩散的解码器相结合，后者由联合失真约束与幂等性约束共同控制。失真约束相对于基础编解码器输出限制了重构保真度；幂等性约束——要求对恢复的图像进行重新编码以恢复基础编解码器的重构结果——作为分布感知要求的可行代理。二者共同通过带有一致噪声注入的迭代优化引导反向去噪过程，在不增加额外率开销的情况下实现公共随机性。在固定码率下，双衰减因子 $(K_D, K_P)$ 共同遍历失真 - 感知平面的帕累托前沿，从而能够从单个比特流中实现连续可调的保真度 - 真实性权衡。DCIC$_{RD}$ ($K_P{=}0$) 与 DCIC$_{RP}$ ($K_D{=}0$) 分别作为边界曲线出现，而 DCIC$_{RDP}$ ($K_D = K_P=1$) 实现了最优的内部操作点。在 CelebA-HQ、CLIC2020 和 ImageNet-1K 数据集上，针对 CNN、Transformer 及混合架构的实验表明，DCIC$_{RDP}$ 在所有感知编解码器中实现了更优的 BD-PSNR，而 DCIC$_{RP}$ 在 BD-FID 指标上与专用感知导向方法相当，从而验证了完整 RDP 表面导航的实际价值。

Abstract

The rate-distortion-perception (RDP) trade-off extends classical rate--distortion theory by imposing a distributional constraint on reconstructions, providing a unified framework for neural image compression that jointly governs fidelity and perceptual realism. While prior work achieves near-optimal rate--perception trade-offs, practical frameworks explicitly realizing the full RDP surface remain scarce, primarily due to the difficulty of introducing common randomness at the decoder. We propose DCIC (Dual-Constrained Diffusion Image Compression), which integrates a learned codec with a diffusion-based decoder governed by joint distortion and idempotence constraints. The distortion constraint bounds reconstruction fidelity relative to the base codec output; the idempotence constraint -- requiring that re-encoding the restored image recovers the base codec reconstruction -- serves as a tractable surrogate for the distributional perception requirement. Together, they steer the reverse denoising process via iterative optimization with consistent noise injection, realizing common randomness without additional rate overhead. At fixed rate, dual attenuation factors $(K_D, K_P)$ jointly navigate the Pareto frontier of the distortion-perception plane, enabling continuously adjustable fidelity-realism trade-offs from a single bitstream. DCIC$_{RD}$ ($K_P{=}0$) and DCIC$_{RP}$ ($K_D{=}0$) arise as boundary curves, with DCIC$_{RDP}$ ($K_D = K_P=1$) realizing the optimal interior operating point. Experiments on CelebA-HQ, CLIC2020, and ImageNet-1K across CNN, Transformer, and hybrid architectures confirm that DCIC$_{RDP}$ achieves superior BD-PSNR over all perceptual codecs, while DCIC$_{RP}$ matches dedicated perception-oriented methods in BD-FID, validating the practical value of full RDP surface navigation.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	2.0/10	3.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	1.0/10	1.5
Latent Reasoning	1.5	2.0/10	3.0
Agentic Reasoning	1.5	1.0/10	1.5

评分理由: 该论文聚焦于图像压缩中的率失真感知（RDP）权衡，提出双约束扩散图像压缩（DCIC）方法。提供的关键词集主要涵盖多模态大模型（MLLM）、世界模型（World Models）及强化学习（RL）等领域，与本文的压缩技术主题存在较大差异。因此，除'Visual Encoder'（编码器组件）和'Unify Models'（约束统一）有微弱关联外，其余关键词如 MLLM、MultiModal、model-based RL、Agentic Reasoning 等与本文内容高度无关。作者列表中未包含指定的 Yang Shi 等专家，故无额外加分。加权总分为 22.5，低于动态及格分 35.2。

关键词

Image Compression, Diffusion Models, Rate-Distortion-Perception, Dual-Constrained, Codec, Perceptual Optimization, Pareto Frontier

87. One Polluted Page Is Enough: Evaluating Web Content Pollution in Generative RecommendersFAIL

Score: 21.0 / 35.2

Authors: Minghao Luo, Liang Chen

Published: 2026-06-11

TL;DR: This paper introduces the FORGE benchmark to demonstrate that search-augmented LLMs are highly vulnerable to recommending fake products when exposed to polluted web content, a risk that reasoning mechanisms fail to mitigate.

摘要翻译

搜索增强型大语言模型（LLM）正越来越多地通过检索实时网络内容来中介日常消费者推荐。这带来了一种新风险：生成式推荐器可能会摄入污染的网络内容，例如旨在误导推荐的虚假评论和推广页面。我们提出一个问题：当摄入污染的检索结果时，搜索增强型 LLM 在多大程度上会成为虚假产品的无意识推广者？为了解答这一问题，我们引入了 FORGE（生成环境中的虚假在线推荐），这是一个用于在受控网络内容污染下衡量虚假产品推广情况的基准。给定一个上游搜索结果，FORGE 会将检索到的网页中的真实产品局部重写为虚假产品，以此模拟网络内容污染，并测量 LLM 推荐虚假产品的频率。FORGE 涵盖了 15 个类别和 5 种消费者场景下的 225 款真实产品。在 12 个商业和开源权重（open-weights）的 LLM 中，所有模型均存在脆弱性：单个污染页面即可导致受骗率高达 27%，而完整的前 3 名替换则使这一比例升至 73.8%。脆弱性在不同类别间存在显著差异，当模型缺乏对相关产品的稳定先验知识时，脆弱性会增加。推理并不能缓解这种脆弱性；相反，它经常生成虚假的社会证明来为虚假推荐辩护。我们评估了三种防御措施：怀疑提示和共识过滤（基于模型先验或跨文档证据）。怀疑提示可能会加剧脆弱性，类似于推理的效果，而过滤措施则存在抑制合法产品的风险。我们在 https://github.com/leoluolol/forge-benchmark 上发布了 FORGE。

Abstract

Search-augmented LLMs increasingly mediate everyday consumer recommendations by retrieving live web content. This creates a new risk: generative recommenders may consume polluted web content, such as fake reviews and promotional pages crafted to mislead recommendations. We ask: to what extent do search-augmented LLMs become unwitting promoters of fake products when consuming polluted retrieval results? To answer this, we introduce FORGE (Fake Online Recommendations in Generative Environments), a benchmark for measuring fake-product promotion under controlled web-content pollution. Given an upstream search result, FORGE locally rewrites real products in retrieved web pages into fake ones to simulate web-content pollution, and measures how often the LLM recommends the fake product. FORGE covers 225 real-world products across 15 categories and 5 consumer scenarios. Across 12 commercial and open-weights LLMs, all models are vulnerable: a single polluted page yields fooled rates of up to 27%, while the full top-3 replacement raises this to 73.8%. Vulnerability varies substantially across categories, increasing when models lack stable prior knowledge of the relevant products. Reasoning does not mitigate this vulnerability; instead, it often generates spurious social proof to justify false recommendations. We evaluate three defenses: skepticism prompting and consensus filtering (over model priors or cross-document evidence). Skepticism can exacerbate vulnerability, much like reasoning, while filtering risks suppressing legitimate products. We release FORGE at https://github.com/leoluolol/forge-benchmark.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	1.0/10	1.5
Latent Reasoning	1.5	2.0/10	3.0
Agentic Reasoning	1.5	3.0/10	4.5

评分理由: 该论文聚焦于搜索增强型 LLM 在推荐系统中的内容污染风险评估（FORGE 基准），主要涉及文本检索与生成安全。提供的关键词多涉及多模态统一模型、世界模型、视觉编码器及强化学习等领域，与本文主题（LLM 安全/推荐污染）高度不匹配，因此相关性评分普遍较低。仅“推理”相关词汇有微弱关联。

关键词

Generative Recommenders, Web Content Pollution, Search-augmented LLMs, FORGE Benchmark, Fake Product Recommendation, Retrieval-Augmented Generation, LLM Vulnerability

88. AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and ReproducibilityFAIL

Score: 21.0 / 35.2

Authors: Xiaoyuan Liu, Jianhong Tu, Yuqi Chen, Siyuan Xie, Sihan Ren, Tianneng Shi, Gal Gantar, Evan Sandoval, Donghyun Lee, Daniel Miao, Peter J. Gilbert, Nick Hynes, Mauro Staver, Warren He, David Marn, Andrew Low, Xi Zhang, Elron Bandel, Michal Shmueli-Scheuer, Siva Reddy, Alexandre Drouin, Alexandre Lacoste, Ramayya Krishnan, Elham Tabassi, Yu Su, Victor Barres, Chenguang Wang, Wenbo Guo, Dawn Song

Published: 2026-06-11

TL;DR: 本文提出 AgentBeats 框架，通过标准化协议和裁判代理实现代理系统的开放、可复现且统一的评估，解决了现有基准测试碎片化和测试生产不匹配的问题。

摘要翻译

智能体系统在各领域快速发展，但其评估体系仍显碎片化。大多数基准测试依赖于固定的、以大语言模型（LLM）为中心的评测框架（harnesses），这些框架需要深度集成，会导致测试与生产环境的不匹配，并限制了在不同智能体设计之间进行公平比较的能力。根本原因在于缺乏一个开放的、与智能体无关的评估接口。我们主张“智能体化的智能体评估”（Agentified Agent Assessment, AAA），在此框架下，评估由评判智能体（judge agents）执行，所有参与者通过标准化协议进行交互：A2A 用于任务管理，MCP 用于工具访问。传统基准测试定义了两个独立的接口——一个用于基准测试，一个用于智能体，而 AAA 仅需一个接口；这形成了一个通用的、统一的框架，将评估逻辑与智能体实现分离，从而支持可复现、可互操作及多智能体评估。我们进一步引入 AgentBeats 作为 AAA 的具体实现方案：我们识别出五种实用的操作模式，使标准化评估能够兼容现实世界中关于开放性、隐私性和可复现性的约束。为了在大规模上评估我们的设计，我们开展了两项研究：一项为期五个月的开放竞赛，吸引了来自 12 个类别的 298 个评判智能体以及来自独立参与者的 467 个主体智能体，表明 AAA 适用于一系列异构基准测试；另一项是关于编码智能体的案例研究，证实了智能体化的评估能够保持与公共记录的一致性，同时揭示了此前缺失的直接对比结果，从而为智能体设计提供了研究见解。结合社区规模的实地研究与受控编码案例研究，我们验证了 AAA 能够在大规模异构场景中提供全面的覆盖范围、实用性和保真度。综上所述，AAA 与 AgentBeats 共同为开放、标准化且可复现的智能体评估提供了一条清晰的路径。

Abstract

Agent systems are advancing quickly across domains, but their evaluation remains fragmented. Most benchmarks rely on fixed, LLM-centric harnesses that require heavy integration, create test-production mismatch, and limit fair comparison across diverse agent designs. The root problem is the lack of an open, agent-agnostic assessment interface. We advocate Agentified Agent Assessment (AAA), where evaluation is performed by judge agents and all participants interact through standardized protocols: A2A for task management and MCP for tool access. Conventional benchmarking defines two separate interfaces, one for the benchmark and one for the agent, while AAA only needs one; this yields a generic, unified framework that separates assessment logic from agent implementation and enables reproducible, interoperable, and multi-agent evaluation. We further introduce AgentBeats as a concrete realization of AAA: we identify five practical operation modes that make standardized assessment compatible with real-world constraints on openness, privacy, and reproducibility. To evaluate our design at scale, we conduct two studies: a five-month open competition that drew 298 judge agents across 12 categories together with 467 subject agents from independent participants, showing that AAA applies across a heterogeneous range of benchmarks; and a case study on coding agents that confirms agentified evaluation preserves fidelity with the public record while surfacing previously missing head-to-head results, yielding research insights about agent design. Combining a community-scale field study and a controlled coding case study, we verify that AAA delivers coverage, practicality, and fidelity across heterogeneous scenarios at scale. Together, AAA and AgentBeats offer a clear path toward open, standardized, and reproducible agent assessment.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	4.0/10	6.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	7.0/10	10.5

评分理由: 论文核心在于代理评估框架（AgentBeats）的标准化与可复现性，而非模型内部架构。因此与 Tokenizer、Visual Encoder、World Models、MultiModal、model-based RL、Latent Reasoning 等关键词相关性极低（0 分）。'Agentic Reasoning' 高度相关（7 分），因核心对象为代理系统及其能力评估；'Unify Models' 中度相关（4 分），因提出了统一的评估接口框架（AAA）；'MLLM' 低度相关（3 分），因裁判代理可能基于 LLM 但非 MLLM 架构重点。加权总分约为 21.0，低于动态及格分 35.2，表明论文主题与给定关键词匹配度较低。

关键词

Agentified Agent Assessment, AgentBeats, Standardization, Reproducibility, Judge Agents, Multi-agent Evaluation, A2A Protocol

89. Physics-Guided Spatiotemporal Learning for Coastal Wave Peak Period Estimation from VideoFAIL

Score: 21.0 / 35.2

Authors: Abubakar Hamisu Kamagata, Dharm Singh Jat, Attlee Munyaradzi Gamundani, Abhishek Srivastava, Paramasivam Saravanakumar

Published: 2026-06-11

TL;DR: This paper proposes a physics-guided spatiotemporal deep learning framework to accurately estimate coastal wave peak periods from video streams while ensuring physical consistency.

摘要翻译

近岸波浪参数对于海岸工程、海岸线保护、海洋灾害评估以及气候韧性海岸管理至关重要。传统的监测系统（如浮标和雷达平台）虽能提供准确监测，但往往具有较高的安装与维护成本，且空间覆盖范围有限。尽管利用深度学习已实现了基于视频的被动海洋监测，但许多方法缺乏可物理解释性、可行性及海洋学验证。本文提出了一种物理引导的深时空学习框架（Physics-Guided Deep Spatiotemporal Learning Framework），用于直接从被动海岸视频流估计近岸波浪峰值周期（nearshore wave peak periods）。该框架融合了基于时间方差的自动感兴趣区域检测（region-of-interest detection）、多阶段仿真到现实迁移学习（Sim-to-Real transfer learning）及物理信息正则化（physics-informed regularization），以提升预测精度（predictive accuracy）与物理一致性（physical consistency）。研究评估了多种时空架构（spatiotemporal architectures），包括基于 Transformer 的（transformer-based）与循环卷积的（recurrent-convolutional）架构，以及合成预训练（synthetic pretraining）、银标签适应（silver-label adaptation）和专家微调（expert fine-tuning）。结果表明，基于 Transformer 的架构在瞬时预测（instantaneous prediction）精度方面表现更优，而轻量级循环卷积架构则实现了更高的时间稳定性（temporal stability）和业务海洋学技能（operational oceanographic skill）。消融研究（ablation studies）还表明，物理引导正则化（physics-guided regularization）在趋势跟随一致性（trend-following consistency）及减少物理上不合理的预测（physically implausible predictions）方面具有显著优势。可解释性审计（explainability auditing）有助于将注意力集中在水动力活跃的近岸破碎带区域（hydrodynamically active surf-zone regions），并与物理推导的波浪传播行为（wave propagation behavior）表现出良好的一致性。总体而言，所提出的框架展示了物理引导的视频深度学习系统（physics-guided video-based deep learning systems）在长期海岸波浪监测方面的潜力，这些系统具有成本效益（cost-efficient）且业务可行（operationally feasible）。

Abstract

Wave parameters in the nearshore are crucial for coastal engineering, shoreline protection, marine hazard assessment, and coastal management for climate resilience. Traditional monitoring systems like buoys and radar platforms offer accurate monitoring but can have high installation and maintenance expenses and limited spatial coverage. Passive ocean monitoring using video has been achieved by leveraging deep learning, however, many methods are not physically interpretable, feasible, and validated for oceanography. In thiswork, a Physics-Guided Deep Spatiotemporal Learning Framework for direct estimation of nearshore wave peak periods from passive coastal video stream is proposed. The framework combines automated temporal-variance based region-of-interest detection, multi-stage Sim-to-Real transfer learning, and physics-informed regularization to enhance the predictive accuracy and physical consistency. A variety of spatiotemporal architectures were assessed, such as transformer-based and recurrent-convolutional ones, alongside synthetic pretraining,silver-label adaptation, and expert fine-tuning. The results show that transformer-based architectures outperformed in terms of the accuracy of the instantaneous prediction, while lightweight recurrent-convolutional architectures achieved higher temporal stability and operational oceanographic skill. Ablation studies also demonstrated the benefits of physics-guided regularization in terms of trend-following consistency, and physically implausible predictions. Explainability auditing also helped to focus attention in hydrodynamically active surf-zone regions and showed good agreement with the physically derived wave propagation behavior. In general, the proposed framework shows the promise of physics-guided video-based deep learning systems for long-term coastal wave monitoring that are cost-efficient and operationally feasible.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	6.0/10	9.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	2.0/10	3.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: The paper focuses on physics-guided spatiotemporal learning for oceanography (wave estimation from video), which has low alignment with the provided keyword set centered on Multimodal LLMs, RL, and World Models. 'Visual Encoder' is moderately relevant due to video processing, while others like Tokenizer, MLLM, and RL are irrelevant. 'Unify Models' and 'Latent Reasoning' have minor relevance due to method combination and representation learning.

关键词

Physics-Guided, Spatiotemporal Learning, Coastal Wave, Video Estimation, Deep Learning, Wave Peak Period, Physics-Informed, Sim-to-Real Transfer

90. Mental-R1: Aligning LLM Reasoning for Mental Health AssessmentFAIL

Score: 21.0 / 35.2

Authors: Xin Wang, Boyan Gao, Yibo Yang, David A. Clifton

Published: 2026-06-11

TL;DR: Mental-R1 proposes a Cognitive Relative Policy Optimization framework to align LLM reasoning for mental health assessment, achieving significant performance improvements on multiple datasets.

摘要翻译

焦虑、抑郁及自杀等心理健康问题仍是紧迫的全球挑战，而及时准确的评估对于有效干预至关重要。近期，大型语言模型（Large Language Models, LLMs）已被探索用于心理健康评估。然而，现有的通用后训练方法并未与人类评估的认知过程相一致，这可能导致不可靠的推理结果。为弥合这一差距，本文提出认知相对策略优化（Cognitive Relative Policy Optimization, CRPO），这是一种专为心理健康领域量身定制的强化学习框架。CRPO 通过整合阶段依赖不确定性建模至策略优化过程，从而扩展了组相对策略优化（Group Relative Policy Optimization）。具体而言，我们引入了一种阶段式熵正则化机制，该机制鼓励在早期推理阶段进行广泛探索，并在后期阶段逐步强化自信决策，从而模拟人类认知从不确定性向确定性转变的过程。此外，受认知评价理论（Cognitive Appraisal Theory）启发，我们形式化了认知推理阶段，从而引导基于理论的可解释推理。在 8 个心理健康数据集上的实验表明，CRPO 在加权 F1 分数上相比最佳强化学习基线平均提升了 10.4 个百分点。此外，经 CRPO 训练的模型 Mental-R1 在推理密集型案例上相比现有大型语言模型展现出明显优势，这表明 CRPO 增强了心理健康评估的推理能力。

Abstract

Mental health problems such as anxiety, depression, and suicide remain urgent global challenges, where timely and accurate assessment is critical for effective intervention. Recently, large language models have been explored for mental health assessment. However, existing general-purpose post-training methods do not align with the cognitive processes of human assessment, which may lead to unreliable reasoning outcomes. To bridge this gap, we propose Cognitive Relative Policy Optimization (CRPO), a reinforcement learning framework tailored for the mental health domain. CRPO extends group relative policy optimization by integrating stage-dependent uncertainty modeling into the policy optimization process. Specifically, we introduce a stage-wise entropy regularization mechanism that encourages broad exploration in early reasoning phases and progressively enforces confident decision-making in later stages, mimicking the human cognitive shift from uncertainty to certainty. In addition, inspired by cognitive appraisal theory, we formalize cognitive reasoning stages, thereby guiding theory-grounded interpretable inference. Experiments on 8 mental health datasets show that CRPO achieves an average improvement of 10.4 percentage points in weighted F1-score over the best reinforcement learning baseline. Furthermore, the CRPO-trained model Mental-R1 demonstrates clear advantages compared with existing large language models on reasoning-intensive cases, suggesting that CRPO enhances reasoning capabilities for mental health assessment.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	3.0/10	4.5
Latent Reasoning	1.5	3.0/10	4.5
Agentic Reasoning	1.5	2.0/10	3.0

评分理由: 该论文专注于 LLM 在心理健康评估中的推理对齐，采用强化学习框架 CRPO。内容未涉及多模态、视觉编码器、世界模型或统一模型架构，故相关度较低。'model-based RL' 和 'Latent Reasoning' 因涉及 RL 及认知状态建模获得中等分数。未发现指定专家作者，无额外加分。总分低于动态及格分，因论文领域与关键词设定（多模态/世界模型）不匹配。

关键词

Mental Health Assessment, LLM Reasoning, Reinforcement Learning, Cognitive Relative Policy Optimization, Stage-wise Entropy Regularization, Interpretable Inference, Weighted F1-score

91. TetherCache: Stabilizing Autoregressive Long-Form Video Generation with Gated Recall and Trusted AlignmentFAIL

Score: 21.0 / 35.2

Authors: Yu Meng, Xiangyang Luo, Letian Li, Wenyuan Jiang, Chen Gao, Xinlei Chen, Yong Li, Xiao-Ping Zhang

Published: 2026-06-11

TL;DR: 本文提出 TetherCache 缓存管理策略，通过 gated recall 和 trusted alignment 有效稳定了自回归长视频生成并显著减少了质量漂移。

摘要翻译

自回归视频扩散模型通过将新生成的帧基于先前生成的内容进行条件化，为流式和可变长度视频生成提供了一种自然的表述方式。然而，将这些模型扩展至分钟级生成仍然具有挑战性：有限的 KV-cache 预算限制了模型保留完整历史的能力，而反复基于自生成帧进行条件化会引发上下文分布偏移，该偏移随时间累积，从而导致视觉伪影、质量退化及时间漂移。本文提出 TetherCache，一种无需训练且即插即用的缓存管理策略，旨在实现具有抗漂移能力的长视频生成。TetherCache 将缓存组织为汇点（sink）、记忆（memory）和近期（recent）区域，并引入了两种互补机制。首先，GRAB（带注意力多样性平衡的门控召回）利用一种结合注意力相关性与时间多样性的门控分数来选择长程记忆帧，从而在固定缓存预算下保留信息丰富且多样的历史上下文。其次，TAME（基于记忆编辑的可信对齐）通过将新召回的记忆令牌的统计量对齐至可信上下文分布来对其进行轻微编辑，从而减少由漂移历史特征引起的污染。基于 Self-Forcing，TetherCache 在 VBench-Long 基准测试上，于 30 秒、60 秒及 240 秒设置下均一致地提升了长视频生成质量。特别是在 240 秒生成任务中，它显著提升了总体得分和语义得分，同时将质量漂移从 7.84 降低至 1.33，证明了其在稳定长时程自回归视频扩散中的有效性。

Abstract

Autoregressive video diffusion models provide a natural formulation for streaming and variable-length video generation by conditioning newly generated frames on previously generated content. However, extending these models to minute-level generation remains challenging: the limited KV-cache budget prevents the model from retaining the full history, while repeatedly conditioning on self-generated frames induces a context distribution shift that accumulates over time, leading to visual artifacts, quality degradation, and temporal drift. In this paper, we propose TetherCache, a training-free and plug-and-play cache management strategy for drift-resistant long video generation. TetherCache organizes the cache into sink, memory, and recent regions, and introduces two complementary mechanisms. First, GRAB (Gated Recall with Attention-Diversity Balancing) selects long-range memory frames using a gated score that combines attention-based relevance with temporal diversity, preserving informative yet diverse historical context under a fixed cache budget. Second, TAME (Trusted Alignment via Memory Editing) lightly edits newly recalled memory tokens by aligning their statistics to a trusted context distribution, reducing the pollution caused by drifted historical features. Built on Self-Forcing, TetherCache consistently improves long-video generation quality on VBench-Long across 30s, 60s, and 240s settings. In particular, for 240s generation, it substantially improves overall and semantic scores while reducing quality drift from 7.84 to 1.33, demonstrating its effectiveness for stable long-horizon autoregressive video diffusion.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	3.0/10	4.5
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	2.0/10	3.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 该论文提出 TetherCache 策略以稳定自回归长视频生成，核心贡献在于缓存管理与分布漂移校正。提供的关键词集主要聚焦于多模态大模型（MLLM）、强化学习（RL）及推理机制，与本文的视频扩散工程优化方向匹配度较低。论文未涉及模型统一、Tokenizer 设计、视觉编码器架构、强化学习或智能体推理，仅与广义的视频生成建模（World Models）及视觉序列（MultiModal）有微弱关联，故大部分关键词评分为低分。

关键词

Autoregressive Video Generation, Cache Management, Long-Form Video, Diffusion Models, Gated Recall, Trusted Alignment, Temporal Drift, KV-Cache

92. Leveraging Audio-LLMs to Filter Speech-to-Speech Training DataFAIL

Score: 21.0 / 35.2

Authors: Qixu Chen, Satoshi Nakamura

Published: 2026-06-11

TL;DR: This paper proposes filtering noisy speech-to-speech translation training data using an Audio-LLM, achieving improved end-to-end translation performance.

摘要翻译

大规模挖掘语料库为端到端语音到语音翻译（S2ST）提供了丰富的训练数据，但可能包含噪声、不对齐和语义错误。过滤噪声数据对于维持鲁棒的语音翻译性能至关重要。我们研究如何训练一个音频语言模型，直接从音频对配对语音做出保留/丢弃决策。为在无需人工标注的情况下获得可靠监督，我们采用了一种可扩展的两阶段 Rank-to-Distill 策略。轻量级排序器从噪声语音对中生成保留/丢弃伪标签，随后训练一个音频大语言模型直接从原始配对语音预测保留/丢弃。所得模型联合捕获声学保真度与跨语言语义一致性，以进行语音条件数据的选择。在 CVSS-C 和 SpeechMatrix 上的实验表明，相较于未过滤训练，该方法始终表现出改进，在端到端 S2ST 任务上最高可达 +1.4 ASR-BLEU 的提升。

Abstract

Large-scale mined corpora provide abundant training data for end-to-end speech-to-speech translation (S2ST) but may contain noise, misalignment, and semantic errors. Filtering noisy data is crucial to maintain robust speech translation performance. We study how to train an audio-language model to make keep/drop decisions on paired speech directly from audio. To obtain reliable supervision without manual labels, we adopt a scalable two-stage Rank-to-Distill strategy. A lightweight ranker generates keep/drop pseudo-labels from noisy speech pairs, then trains an audio large language model to predict keep/drop directly from raw paired speech. The resulting model jointly captures acoustic fidelity and cross-lingual semantic consistency for the selection of speech-conditioned data. Experiments on CVSS-C and SpeechMatrix show consistent improvements over unfiltered training, yielding up to +1.4 ASR-BLEU for end-to-end S2ST.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	3.0/10	4.5
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	2.0/10	3.0
Agentic Reasoning	1.5	2.0/10	3.0

评分理由: The paper focuses on speech-to-speech translation data filtering using Audio-LLMs. It has zero relevance to Visual Encoder, World Models, and model-based RL as it involves no vision, world modeling, or reinforcement learning. Moderate relevance to MLLM and MultiModal exists due to audio modality and large language model usage. Minor relevance (2.0) to Unify Models, Tokenizer, Latent Reasoning, and Agentic Reasoning as decision-making is peripheral to core unification or latent reasoning.

关键词

Audio-LLMs, Speech-to-Speech, Data Filtering, Rank-to-Distill, End-to-end, Training Data, ASR-BLEU

93. Low-Latency Real-Time Audio Game Commentary System via LLM-Based Parallel Text GenerationFAIL

Score: 21.0 / 35.2

Authors: Ryota Kawamatsu, Anum Afzal, Yuki Saito, Shinnosuke Takamichi, Graham Neubig, Katsuhito Sudoh, Hiroya Takamura, Tatsuya Ishigaki

Published: 2026-06-11

TL;DR: 本文提出了一种基于 LLM 的并行文本生成系统，用于实时游戏视频评论，显著减少了语句间的沉默时间并改善了说话节奏。

摘要翻译

我们提出了一种低延迟实时音频游戏解说系统，该系统直接从实时游戏画面视频生成口语解说。在这种端到端（end-to-end）设置中，关键瓶颈是累积等待时间；传统流水线（pipeline）为每个语句（utterance）顺序地捕获帧、生成文本和合成语音，且在语音播放完成前不请求下一次生成。这种严格的顺序性导致语句之间出现长且不自然的沉默。为了解决这一延迟瓶颈，我们的系统在语音播放的同时并行运行文本生成，并提前缓冲多个候选语句（candidate utterances），以便在播放边界处立即合成。在快节奏游戏视频上的实验表明，与顺序基线（sequential baselines）相比，我们的并行设计将平均语句间沉默时间从 9.6 秒减少到 0.3 秒。此外，它还使专业说话——沉默时间模式（silence timing patterns）的相似度提高了 40% 以上，一项针对 120 名经验丰富的游戏玩家的用户研究（user study）证实了感知说话节奏的显著改善。我们的演示视频可在以下网址获取：https://youtu.be/pmrRUlvav8M.

Abstract

We present a low-latency real-time audio game commentary system that generates spoken commentary directly from live gameplay video. In this end-to-end setting, a key bottleneck is accumulated waiting time; conventional pipelines capture frames, generate text, and synthesize speech sequentially for each utterance, and do not request the next generation until speech playback has completed. This strict sequentiality causes long and unnatural silence between utterances. To address this latency bottleneck, our system runs text generation in parallel with speech playback and buffers multiple candidate utterances ahead of time, enabling immediate synthesis at playback boundaries. Experiments on fast-paced game videos show that our parallel design reduces the mean inter-utterance silence from 9.6 seconds to 0.3 seconds compared to sequential baselines. It also improves similarity to professional speaking--silence timing patterns by over 40 %, and a user study with 120 experienced game players confirms significantly improved perceived speaking rhythm. Our demo video is available at: https://youtu.be/pmrRUlvav8M.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	4.0/10	6.0
MultiModal	1.5	6.0/10	9.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 该论文专注于实时游戏评论系统的延迟优化，采用并行生成架构。虽然涉及视频输入和语音输出（多模态）及 LLM 应用，但未涉及世界模型、强化学习、统一模型架构或推理机制等核心关键词所指领域。视觉编码器和分词器仅为底层组件而非研究重点。

关键词

Low-Latency, Real-Time Audio Commentary, LLM-Based Text Generation, Parallel Processing, Speech Synthesis, Gameplay Video, System Architecture

94. DuET: Dual Expert Trajectories for Diffusion Image EditingFAIL

Score: 21.0 / 35.2

Authors: Lidia Troeshestova, Alexander Ustyuzhanin, Sergey Kastryulin

Published: 2026-06-11

TL;DR: DuET 通过引入双专家轨迹方法，在扩散图像编辑中暂时放松源图像条件约束，在不修改模型权重的情况下提升了编辑的语义保真度和感知质量。

摘要翻译

近期的 diffusion editors 在每一个 denoising step 上均基于源图像进行条件化，执行多样化的基于指令的编辑。然而，持续的 source-image conditioning 可能会限制编辑执行的充分程度以及结果的自然程度，尤其是当目标场景与输入显著偏离时。我们引入了 DuET (Dual Expert Trajectories)，这是一种无需训练的推理方法，它通过过渡到 text-to-image 阶段后再返回 edit mode，暂时放松了 source-image conditioning，使得 denoising trajectory 能够朝向 target distribution 移动，同时保留 image-conditioned editing 的结构优势。在不修改 model weights 或不增加 sampling cost 的情况下，DuET 在 diverse models 和 benchmarks 上始终提高了 instruction relevance、semantic fidelity 和 perceptual quality。在某些情况下，这些收益伴随着 source-image preservation 的适度减少，揭示了 source preservation 和 edit fidelity 之间可预测的 trade-off。

Abstract

Recent diffusion editors perform diverse instruction-based edits while conditioning on the source image at every denoising step. Yet persistent source-image conditioning can limit how fully an edit is executed and how natural the result appears, especially when the target scene diverges substantially from the input. We introduce DuET (Dual Expert Trajectories), a training-free inference method that temporarily relaxes source-image conditioning by transitioning through a text-to-image phase before returning to edit mode, allowing the denoising trajectory to move toward the target distribution while retaining the structural benefits of image-conditioned editing. Without modifying model weights or increasing sampling cost, DuET consistently improves instruction relevance, semantic fidelity, and perceptual quality across diverse models and benchmarks. In some cases, these gains come with a modest reduction in source-image preservation, revealing a predictable trade-off between source preservation and edit fidelity.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	5.0/10	7.5
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	2.0/10	3.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 论文核心在于扩散模型图像编辑的推理策略（DuET），主要涉及图像与文本的条件生成，因此与 MultiModal 有一定关联（5 分）。扩散过程隐含空间操作与 Latent Reasoning 略有联系（2 分）。然而，论文未涉及统一模型架构、Tokenizer 设计、视觉编码器创新、世界模型、大语言模型（MLLM）、强化学习（model-based RL）或智能体推理（Agentic Reasoning），故这些关键词相关性极低（0-2 分）。作者列表中不包含指定的专家，无额外加分。

关键词

Diffusion Image Editing, Dual Expert Trajectories, Source-image Conditioning, Text-to-Image Phase, Inference Method, Semantic Fidelity, Perceptual Quality, Training-free

95. Camera and LiDAR BEV Fusion for Cooperative 3D Object Detection on TUMTraf V2XFAIL

Score: 21.0 / 35.2

Authors: Muhammad Shahbaz, Shaurya Agarwal

Published: 2026-06-11

TL;DR: 该论文提出了一种基于相机和激光雷达 BEV 融合的协同 3D 目标检测方法，在 TUMTraf 基准上取得了 0.85 的 3D mAP 成绩。

摘要翻译

我们介绍了一种为 DriveX 2026 挑战赛的 TUMTraf V2X 协作 3D 目标检测赛道开发的相机与 LiDAR 融合检测器。该检测器在共享的鸟瞰空间 (BEV) 中将三个路边摄像头与融合的基础设施 + 车辆点云进行融合，并通过 CenterPoint 风格的检测头预测边界框，该头采用广义 IoU 回归损失和 IoU 质量重排序头。该模型在提供的训练集和验证集划分上进行训练，在公共 Codabench 测试集划分上达到了 0.85 的 3D mAP。在系统迭代过程中，我们发现 50 个测试帧中的 44 个也出现在发布的训练集 (40) 和验证集 (4) 划分中，且带有其标签。因此，我们进行了两项额外研究来量化这种重叠对最终得分的影响：(1) 一项微调运行，对 44 个重叠帧进行过采样，达到 0.89 mAP；以及 (2) 一项后处理运行，用发布的真实值 (Ground Truth) 替换这些帧上的预测，达到 0.99 mAP（上传至我们的 Codabench 账户用于测试，但未发布在排行榜上）。报告了所有三种配置及其每类结果。

Abstract

We describe a Camera and LiDAR fusion detector developed for the TUMTraf V2X cooperative 3D object detection track of the DriveX 2026 challenge. The detector fuses three roadside cameras with a fused infrastructure-plus-vehicle point cloud in a shared bird's-eye-view space and predicts boxes through a CenterPoint-style head with a generalized IoU regression loss and an IoU quality re-ranking head. Trained on the provided train and validation splits, the model reaches a 3D mAP of 0.85 on the public Codabench test split. While iterating on the system, we observed that 44 of the 50 test frames are also present in the released train (40) and validation (4) splits with their labels. We therefore conducted two additional studies to quantify how this overlap affects the final score: (1) a finetuning run that oversamples the 44 overlapping frames, reaching 0.89 mAP, and (2) a post-processing run that replaces predictions on those frames with the released ground truth, reaching 0.99 mAP (uploaded to our Codabench account for testing but not published on the leaderboard). All three configurations and their per-class results are reported.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	4.0/10	6.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 论文主要研究自动驾驶中的相机与激光雷达多模态融合及 3D 目标检测，基于传统深度学习框架（CenterPoint），未涉及大语言模型、强化学习或世界模型架构。因此仅 MultiModal 和 Visual Encoder 具有中等关联，其余关键词完全不相关。作者列表中未发现指定专家，未加分。加权总分为 21.0，低于动态及格分 35.2。

关键词

Camera and LiDAR Fusion, 3D Object Detection, Bird's-Eye View, V2X Cooperative, CenterPoint-style Head, Sensor Fusion, Roadside Infrastructure

96. ARMOR-MAD: Adaptive Routing for Heterogeneous Multi-Agent Debate in Large Language Model ReasoningFAIL

Score: 19.5 / 35.2

Authors: Fuqiang Niu, Bowen Zhang

Published: 2026-06-11

TL;DR: ARMOR-MAD 通过引入自适应路由机制优化异构多智能体辩论，有效提升了大语言模型推理的准确性和效率。

摘要翻译

多智能体辩论（MAD）能够提升大语言模型的推理能力，但固定的辩论流程往往浪费计算开销，并可能放大相似智能体之间的相关错误。我们提出 ARMOR-MAD，一种无需训练的异构 MAD 框架，该框架将辩论视为条件计算。ARMOR-MAD 结合了三个组件：辩论前一致性路由（PAR）决定独立生成的第 0 轮答案是否需要辩论；早期一致性停止评估器（EASE）在收敛后停止辩论；语义异常值检测（SOD）在聚合过程中降低异常最终答案的权重。在 MATH Level 5、GSM8K、MMLU 和 MMLU-Pro 上，ARMOR-MAD 始终优于使用相同模型池的固定轮次异构辩论，准确率分别达到 65.5%、96.5%、90.0% 和 81.5%。结果表明，真实的模型异构性和基于一致性的控制对于提升 MAD 的准确性和效率均至关重要。

Abstract

Multi-agent debate (MAD) can improve large language model reasoning, but fixed debate pipelines often waste computation and can amplify correlated errors among similar agents. We propose ARMOR-MAD, a training-free heterogeneous MAD framework that treats debate as conditional computation. ARMOR-MAD combines three components: Pre-debate Agreement Routing (PAR) decides whether independently generated Round-0 answers require debate; Early Agreement Stopping Evaluator (EASE) stops debate after convergence; and Semantic Outlier Detection (SOD) down-weights abnormal final answers during aggregation. Across MATH Level 5, GSM8K, MMLU, and MMLU-Pro, ARMOR-MAD consistently improves over fixed-round heterogeneous debate with the same model pool, reaching 65.5\%, 96.5\%, 90.0\%, and 81.5\% accuracy, respectively. The results suggest that genuine model heterogeneity and agreement-based control are both important for making MAD more accurate and efficient.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	2.0/10	3.0
Agentic Reasoning	1.5	5.0/10	7.5

评分理由: 论文聚焦于基于多智能体辩论的文本推理，与 Agentic Reasoning（多智能体）和 Unify Models（异构模型池）有一定关联，但与视觉、强化学习、世界模型等关键词高度无关。

关键词

Multi-Agent Debate, Adaptive Routing, Large Language Model Reasoning, Heterogeneous Agents, Pre-debate Agreement Routing, Early Agreement Stopping, Semantic Outlier Detection

97. When Does Routing Become Interpretable? Causal Probes on Block Attention ResidualsFAIL

Score: 19.5 / 35.2

Authors: Aydin Javadov

Published: 2026-06-11

TL;DR: 该论文研究了 Block Attention Residuals 中的路由机制是否足以支持机械解释，发现训练后的路由虽呈现局部模式，但路由质量与因果重要性并不一致。

摘要翻译

块注意力残差（Block AttnRes）通过用对早期深度源表示的学习 softmax 替换固定加法残差，使跨层路由在前向传播中显现为一个可检查的张量。这是一个引人注目的可解释性目标：通常需间接推断的信息流现在可直接观察。我们探究这种呈现是否足以支持机制性解释。我们在相同的路由消融干预下探究两个相同规模（0.6B）的 Block AttnRes 检查点：一个是经由确定性近期偏置调度封装的原始 Qwen3 推理，代码库认可其为一种路由等效加载路径；另一个是作为优化过程一部分从头训练的 Block AttnRes Qwen3。该封装基线的路由权重是内容无关的，并复现了调度的解析预测。训练后的 Block AttnRes 检查点则表现出三种局部化的路由模式：通过早期层 MLP 的嵌入源路径，通过早期层注意力和 MLP 的当前状态路径，以及通过晚期层注意力的较旧历史路径。超越这种分层，我们发现平均路由权重分布与因果重要性之间存在显著分离：在两个子层中，权重分布最大的部分并非贡献最大的因果部分，且一种源类型携带可观的权重，但在干预下未检测到其因果角色。因此，路由的架构暴露对于机制性解释是必要但不充分的：结构化深度路由仅在路由成为训练过程的一部分时才会涌现，即便如此，描述性路由总结也应被视为候选假设，需通过因果干预进行检验，而不能视为机制本身的证据。

Abstract

Block Attention Residuals (Block AttnRes) by replace fixed additive residuals with a learned softmax over earlier depth-source representations, surfacing cross-layer routing as an inspectable tensor in the forward pass. This is a tempting interpretability target: information flow normally inferred indirectly is now directly observable. We ask whether such exposure suffices for mechanistic interpretation. We probe two same-scale ($0.6$B) Block AttnRes checkpoints under identical routing-ablation interventions: a vanilla Qwen3 inference-wrapped through a deterministic recency-bias schedule that the codebase admits as a routing-equivalent loading path, and a Block AttnRes Qwen3 trained from scratch with routing as part of optimisation. The wrapped baseline's routing weights are content-independent and reproduce the schedule's analytic prediction. The trained AttnRes checkpoint instead exhibits three localised routing motifs: an embedding-source pathway through early-layer MLP, a current-state pathway through early-layer attention and MLP, and an older-history pathway through late-layer attention. Beyond this stratification, we find a sharp dissociation between average routing mass and causal importance: in both sublayers, the largest mass slice is not the largest causal contribution, and one source family carries appreciable mass with no detectable causal role under intervention. Architectural exposure of routing is therefore necessary but not sufficient for mechanistic interpretation: structured depth routing emerges only when routing has been part of training, and even then, descriptive routing summaries should be treated as candidate hypotheses to be tested by causal interventions, not as evidence of mechanism in their own right.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	1.0/10	1.5
Latent Reasoning	1.5	3.0/10	4.5
Agentic Reasoning	1.5	1.0/10	1.5

评分理由: 论文主要关注 Transformer 架构中 Block Attention Residuals 的可解释性研究，通过因果探针分析路由机制。提供的关键词多涉及多模态、世界模型、强化学习等方向，与本文主题（模型内部路由的可解释性）重合度较低。'MLLM' 因提及 Qwen3 略有相关，'Latent Reasoning' 因涉及潜在表示略有相关，其余关键词如视觉编码器、分词器、世界模型、强化学习等均无直接关联。

关键词

Block Attention Residuals, Routing, Interpretability, Causal Probes, Mechanistic Interpretation, Qwen3, Routing Motifs, Depth-source representations

98. Unified MRI Brain Image Translation via Hierarchical Tumor Structure ComparisonFAIL

Score: 19.5 / 35.2

Authors: Yupeng Cai, Jia Wei, Jianlong Zhou

Published: 2026-06-11

TL;DR: 本文提出了一种名为 HTSCGAN 的统一多模态脑部图像翻译模型，该模型通过保留分层肿瘤结构信息来提高翻译质量和临床适用性。

摘要翻译

基于可用模态进行多模态 MRI 脑图像翻译在现代医学中具有重要的实际意义，为疾病的早期诊断、治疗计划和结果评估提供了强有力的支持。为此，确保翻译后肿瘤区域的保真度至关重要。然而，现有的脑图像翻译方法忽略了不同肿瘤区域的结构信息，而这些信息有助于翻译模型提高所生成图像的质量和临床适用性。在本文中，我们提出了一种名为 HTSCGAN 的新型翻译模型，这是一种统一的多模态脑图像翻译生成对抗模型 (GAN)，旨在通过整合肿瘤区域内的结构信息来提高脑图像翻译的质量。具体来说，生成器采用了三个不同 patch 尺寸的 Patch Contrast Module (PCM)，以捕捉肿瘤区域的层次结构信息。此外，还使用了预训练的 Patch Classifier (PC) 和预训练的 Structure-Aware Encoder (SAE)，分别利用 patch 分类损失和肿瘤感知损失，使生成的图像包含与真实图像相同的肿瘤区域结构。在 BraTS2020 和 BraTS2021 上的实验表明，我们的模型在翻译任务和下游分割任务中均表现出强大的性能，突显了其在提高所翻译脑图像的质量和临床相关性方面的有效性。我们的代码可在 https://anonymous.4open.science/r/HTSCGAN 获取。

Abstract

Multi-modal MRI brain image translation via available modalities holds significant practical importance in modern medicine, providing robust support for early diagnosis, treatment planning, and outcome assessment of diseases. For this purpose, it is important to ensure the fidelity of the tumor regions after translation. However, existing brain image translation methods ignore the structure information of different tumor regions, which could assist translation models in enhancing the quality and clinical applicability of the translated images. In this work, we propose a novel translation model called HTSCGAN, which is a unified multi-modal brain image translation generative adversarial model integrating the structural information within tumor regions with the aim of improving the quality of brain image translation. Specifically, the generator employs three Patch Contrast Module (PCM) with different patch sizes to capture the hierarchical structural information of the tumor regions. In addition, a pretrained Patch Classifier (PC) and a pretrained Structure-Aware Encoder (SAE) are employed to derive the generated image containing the same tumor region structure as the ground truth image via patch classification loss and tumor perceptual loss, respectively. The experiments on BraTS2020 and BraTS2021 demonstrate strong performance of our model in both translation tasks and down stream segmentation tasks, highlighting its effectiveness in enhancing the quality and clinical relevance of the translated brain images. Our code is available at https://anonymous.4open.science/r/HTSCGAN.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 该论文专注于使用 GAN 进行 MRI 脑部图像翻译。'MultiModal' 高度相关（多模态 MRI）；'Unify Models' 出现在标题中，但指的是翻译框架，而非 LLM/RL 的模型统一；'Visual Encoder' 对应内部编码器，但缺乏大模型上下文。其他关键词（Tokenizer, World Models, MLLM, RL, Reasoning）均不相关。加权总分为 19.5，低于及格分 35.2。列表中没有专家作者，因此未加奖励分。

关键词

Multi-modal MRI, Brain Image Translation, Tumor Structure, Hierarchical, GAN, Structure-Aware Encoder, Patch Contrast Module, BraTS

99. GeoCFNet: Geometry-Aware Confidence Field Network for Robot-Assisted Endoscopic Submucosal DissectionFAIL

Score: 19.5 / 35.2

Authors: Rui Tang, Guankun Wang, Long Bai, Haochen Yin, Huxin Gao, Jiewen Lai, Jiazheng Wang, Hongliang Ren

Published: 2026-06-11

TL;DR: GeoCFNet 利用基于 DINOv3 背骨的几何感知置信场网络，为机器人辅助内窥镜手术提供了精确的视觉指导，实现了稳定的解剖走廊估计。

摘要翻译

先进的外科机器人技术使机器人辅助内镜黏膜下剥离术（ESD）成为大病变整块切除的一种有前景的方法，有望降低复发率并改善长期预后。然而，ESD 的技术复杂性和并发症风险要求稳定且精确的视觉引导，以维持准确的剥离通道和安全组织边缘。密集置信场为此目的提供了一种有效的表示，既能描述目标剥离区域，又能描述其与周围组织的空间过渡。然而，由于烟雾、镜面高光、组织形变、弱纹理以及目标区域的薄几何结构，在动态内镜场景中可靠的置信场估计仍然具有挑战性。为了解决这些挑战，我们将剥离引导问题表述为几何感知置信场估计问题，并提出 GeoCFNet，这是一种基于预训练 DINOv3 骨干网络的几何感知置信场网络。GeoCFNet 整合了一个 Token-Differentiated Fusion 模块，用于聚合类别 Token 上下文与密集 Patch 表示，配备一个 SegFormer 解码器用于置信度回归，以及几何感知空间正则化（GASR），以保持空间一致性和局部几何过渡。实验结果表明，GeoCFNet 实现了 RMSE 0.0480、PSNR 27.1995、SSIM 0.3397 和 CC 0.2466，表明其能为机器人辅助 ESD 引导提供准确且几何稳定的置信场估计。

Abstract

Advanced surgical robotics has made robot-assisted endoscopic submucosal dissection (ESD) a promising approach for the en-bloc resection of large lesions, with the potential to reduce recurrence and improve long-term outcomes. However, the technical complexity and risk of complications in ESD demand stable and precise visual guidance to maintain an accurate dissection corridor and a safe tissue margin. Dense confidence fields provide an effective representation for this purpose by describing both the preferred dissection region and its spatial transition to surrounding tissue. However, reliable confidence field estimation remains challenging in dynamic endoscopic scenes due to smoke, specular highlights, tissue deformation, weak texture, and the thin geometric structure of the target region. To address these challenges, we formulate dissection guidance as a geometry-aware confidence field estimation problem and propose GeoCFNet, a geometry-aware confidence field network built on a pretrained DINOv3 backbone. GeoCFNet integrates a Token-Differentiated Fusion module to aggregate class-token context with dense patch representations, a SegFormer decoder for confidence regression, and Geometry-Aware Spatial Regularization (GASR) to preserve spatial coherence and local geometric transitions. Experimental results show that GeoCFNet achieves RMSE 0.0480, PSNR 27.1995, SSIM 0.3397, and CC 0.2466, indicating accurate and geometrically stable confidence field estimation for robot-assisted ESD guidance.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	8.0/10	12.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	1.0/10	1.5
Agentic Reasoning	1.5	1.0/10	1.5

评分理由: 论文聚焦于医学机器人视觉任务（置信场估计），核心使用了视觉编码器（DINOv3）和 Token 融合技术，因此'Visual Encoder'和'Tokenizer'有一定相关性。但论文未涉及统一模型、世界模型、多模态大语言模型、基于模型的强化学习或代理推理等核心概念，故其余关键词相关性极低。

关键词

Geometry-Aware Confidence Field Network, Robot-Assisted Endoscopy, DINOv3 Backbone, Dense Confidence Fields, Visual Guidance, SegFormer Decoder, Geometry-Aware Spatial Regularization

100. Toward Instructions-as-Code: Understanding the Impact of Instruction Files on Agentic Pull RequestsFAIL

Score: 18.0 / 35.2

Authors: Ali Arabat, Mohammed Sayagh

Published: 2026-06-11

TL;DR: This paper investigates whether structured instruction files improve AI-agent performance in software engineering, finding mixed effects on merge rates and suggesting instruction management should be formalized as a software engineering practice.

摘要翻译

AI 代理（AI-agents，例如 GitHub Copilot）在不同的软件工程任务中作为队友协作，包括通过拉取请求（pull requests）提出的代码生成（即 Agentic-PRs）。为了提高代理效率，开发人员会创建指令文件以指导 AI 代理，内容涵盖如何导航项目、定位正确组件、运行测试、遵循最佳实践等方面。本文探讨了这些指令文件的创建与 AI 代理创建更优拉取请求（pull requests）性能之间的关系。这些拉取请求具有更高的成功概率（即合并率，merge rate），能处理更复杂的任务（例如代码 churn），且合并所需工作量更少（例如合并时间，time to merge）。为此，我们分析了 AIDev 数据集中来自 148 个项目的 15,549 个代理拉取请求（agentic PRs）。基于这三个维度，我们比较了每个项目在创建指令文件前后的表现。我们发现，为 AI 代理指定指令并不一定能带来更好的结果。使用指令文件后，27.7% 的项目合并率（merge rate）至少提升了 20%，而 26.35% 的项目合并率则有所下降。这一观察结果在变更量（例如代码 churn、修改文件数量）以及合并代理拉取请求所需的工作量（例如合并时间、评论数量）上也同样存在。初步探索发现，成功提高合并率的项目通常拥有显著更长的指令文件，且这些文件结构更为完善，包含更多的章节和子章节。我们的研究结果表明，有必要开展研究，协助从业者将指令文件的开发视为一项软件工程活动（即，**Instructions-as-Code**）。

Abstract

AI-agents (e.g., GitHub Copilot) collaborate as teammates in different software engineering tasks, including code generation proposed through pull requests (Agentic-PRs). For better agent efficiency, developers create instruction files that guide the AI-agents, including how to navigate the project, locate the right components, run tests, respect best practices, and more. In this paper, we investigate the relationship between the creation of these instructions and the performance of AI-agents in creating better pull requests, which have a higher chance of success (i.e., the merge rate), address more complex tasks (e.g., code churn), and require less effort to be merged (e.g., time to merge). To this end, we analyze 15,549 agentic PRs from 148 projects in the AIDev dataset. Using the three dimensions, we compare each project before and after the creation of the instruction files. We find that specifying instructions for AI-agents does not necessarily lead to better results. With the instruction files, 27.7\% of the projects increased their merge rate by at least 20\%, while 26.35\% decreased it. The same observation is seen with the amount of changes (e.g., code churn, number of modified files) and with the efforts to merge an agentic PR (e.g., merge time and number of comments). From a first exploration, we find that projects that managed to increase their merge rate have substantially longer instruction files, which are also well structured into a higher number of sections and sub-sections. Our results motivate the need for research to assist practitioners in framing the development of instruction files as a software engineering activity (aka, \textbf{Instructions-as-Code}).

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	1.0/10	1.5
Agentic Reasoning	1.5	8.0/10	12.0

评分理由: The paper focuses on software engineering and AI-agents, creating a domain mismatch with keywords centered on multimodal learning and RL. 'Agentic Reasoning' is highly relevant (8.0) due to the focus on AI-agents and instruction guidance. 'MLLM' has weak relevance (2.0) as agents likely use LLMs but multimodality isn't central. All other keywords (Visual Encoder, World Models, etc.) are irrelevant (0.0-1.0) as the paper involves text/code analysis rather than visual or latent world modeling.

关键词

Instructions-as-Code, Agentic Pull Requests, AI-agents, Instruction Files, Merge Rate, Software Engineering, Code Churn

101. EPIG: Emotion-Based Prompting for Personalised Image GenerationFAIL

Score: 18.0 / 35.2

Authors: Emna Othmen, Mohamed Yassine Landolsi, Lotfi Ben Romdhane

Published: 2026-06-11

TL;DR: EPIG 通过基于心理学情感表示的提示增强，在不重训练模型的前提下提升了文本生成图像的情感表达准确性。

摘要翻译

文本到图像扩散模型在利用自然语言提示合成高质量图像方面取得了令人印象深刻的成果。然而，常用的提示策略仍然相对通用，限制了模型准确表达情感意图和细微情感属性的能力。本文提出了 EPIG，一种在图像生成之前于提示层增强情感表达力的方法。该方法基于心理学原理的情感表征（效价 - 唤醒度，Valence-Arousal），并利用结构化、角色感知的提示增强，在不修改或重新训练图像生成骨干网络的情况下丰富提示中的情感相关组件。生成的情感感知提示引导生成过程朝向更具情感连贯性的视觉输出，尤其在控制唤醒度方面效果显著。EPIG 轻量级且无需训练，非常适合资源受限和个性化图像生成场景。实验结果表明，在包含 10 个多样提示的基准上，与强基线方法（包括朴素插入和基于大语言模型（LLM）的提示扩展）相比，EPIG 降低了平均唤醒度误差，分别降低了 14% 和 12%。这些改进具有统计显著性。EPIG 还保持了效价对齐和语义一致性，这由 CLIPScore 衡量并由消融实验支持。在包含明确主体（如人类、儿童或动物）的提示中，该效果更为明显，误差降低幅度达到 17%，凸显了所提出方法对主体的敏感性。

Abstract

Text-to-image diffusion models have achieved impressive results in synthesizing high-quality images from natural language prompts. However, commonly used prompting strategies remain relatively generic, limiting the model's ability to accurately express emotional intent and nuanced affective attributes. This work proposes EPIG, a method that enhances emotional expressiveness at the prompt level prior to image generation. Grounded in psychologically informed emotion representations (valence-arousal) and leveraging structured, role-aware prompt enrichment, EPIG enriches emotion-related components of prompts without modifying or retraining the image generation backbone. The resulting emotion-aware prompts guide the generative process toward more emotionally coherent visual outputs, with particular effectiveness in controlling arousal. EPIG is lightweight, training-free, and well suited for resource-constrained and personalized image generation scenarios. Experimental results on a benchmark of 10 diverse prompts show that EPIG reduces mean arousal error compared to strong baselines, including naive insertion and LLM-based prompt expansion, with reductions of 14% and 12%, respectively. These improvements are statistically significant. EPIG also preserves valence alignment and semantic consistency, as measured by CLIPScore and supported by ablation studies. The effect is more pronounced on prompts containing explicit subjects such as humans, children, or animals, where the reduction reaches 17%, highlighting the subject-sensitive behavior of the proposed method.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	6.0/10	9.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	1.0/10	1.5
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 该论文主要关注文本到图像生成中的情感提示工程，与提供的关键词（如世界模型、强化学习、统一模型等）关联度较低。仅与多模态（MultiModal）和 MLLM（涉及 LLM 提示扩展）有一定关联，其余关键词如 Tokenizer、Visual Encoder、RL 等未在核心贡献中体现。

关键词

Emotion-Based Prompting, Personalised Image Generation, Text-to-image diffusion, Valence-arousal, Prompt enrichment, Emotion-aware prompts, Arousal control

102. NTS-CoT: Mitigating Hallucinations in LLM-based News Timeline Summarization with Chain-of-Thought ReasoningFAIL

Score: 18.0 / 35.2

Authors: Feng Lyu, Huiqin Yan, Sijing Duan, Hao Wu, Shuang Gu, Xue Qiao, Weixu Zhang, Haolun Wu

Published: 2026-06-11

TL;DR: 本文提出 NTS-CoT 框架，利用链式思维推理有效缓解了基于 LLM 的新闻时间线摘要中的幻觉问题，提升了摘要的忠实度和完整性。

摘要翻译

在线新闻的快速更新使得追踪事件发展颇具挑战，凸显了对时间线摘要（TLS）的需求。在基于 LLM（大型语言模型）的时间线摘要中，幻觉（即 LLM 生成的内容与源新闻不符）仍然是一个关键问题，且在现有工作中研究尚不充分。为填补这一空白，我们识别出两种主要的幻觉类型：新闻摘要过程中的不忠实内容以及日期事件摘要中的信息遗漏。随后，我们提出了一种名为 NTS-CoT 的新颖框架，该框架利用思维链（CoT）推理来缓解时间线摘要中的幻觉。该框架包含三个关键模块：i) Element-CoT，用于捕捉关键新闻元素以实现忠实摘要；ii) Date Selection，结合时间显著性与事件显著性以进行时间戳选择；iii) Causal-CoT，用于推断因果关系并减少日期事件摘要中的遗漏。广泛的实验（包括在三个 TLS 基准上的定量分析及人工评估）表明，NTS-CoT 优于最先进基线，能有效缓解幻觉并提升基于 LLM 的时间线摘要性能。我们的源代码可在 https://anonymous.4open.science/r/NTS-CoT 获取。

Abstract

The rapid updates of online news make tracking event developments challenging, highlighting the need for timeline summarization (TLS). Hallucinations, where LLM-generated content deviates from source news, still remain a critical issue in LLM-based TLS and are not well studied in existing works. To bridge this gap, we identify two primary types of hallucinations: unfaithful content during news summarization and information omission in date-event summarization. Then, we propose NTS-CoT, a novel framework that leverages Chain-of-Thought (CoT) reasoning to mitigate hallucinations in TLS. The framework consists of three key modules: i) Element-CoT to capture essential news elements for faithful summarization, ii) Date Selection to combine temporal saliency and event prominence for timestamp selection, and iii) Causal-CoT to infer causal relationships and reduce omissions in date-event summarization. Extensive experiments, including quantitative analysis on three TLS benchmarks and human evaluation, demonstrate that NTS-CoT outperforms state-of-the-art baselines, effectively mitigating hallucinations and improving LLM-based TLS performance. Our source code is available at https://anonymous.4open.science/r/NTS-CoT .

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	3.0/10	4.5
Agentic Reasoning	1.5	2.0/10	3.0

评分理由: 论文聚焦于基于 LLM 的新闻时间线摘要及幻觉缓解，核心方法是链式思维推理（CoT）。提供的关键词集主要涉及多模态架构、世界模型及强化学习（如 Visual Encoder, World Models, Model-Based RL），与本文纯文本摘要任务领域差异较大，故大部分关键词相关性低（0-2 分）。Reasoning 相关关键词（Latent/Agentic）因涉及推理过程略有关联（2-3 分）。未发现指定专家作者。加权总分约为 18.0，低于动态及格分 35.2。

关键词

News Timeline Summarization, Hallucination Mitigation, Chain-of-Thought Reasoning, LLM-based, Causal Relationships, Temporal Saliency, Faithful Summarization

103. Budget-Constrained Step-Level Diffusion CachingFAIL

Score: 18.0 / 35.2

Authors: Mingkun Lei, Tong Zhao, Liangyu Yuan, Chi Zhang

Published: 2026-06-11

TL;DR: 本文提出 BudCache 方法，通过离线搜索在固定计算预算下优化扩散模型步级缓存策略，在保证生成质量的同时有效控制推理延迟。

摘要翻译

步级缓存通过利用去噪步骤之间的时间冗余来加速扩散模型。现有方法使用基于阈值的启发式方法做出每步缓存决策，而未直接优化最终输出质量。因此，它们的推理延迟随输入变化，且在部署时难以控制。在本文中，我们提出 BudCache，反转了这一设定：不是让每步误差阈值决定运行时开销，而是预先固定计算预算，并搜索能最好地保留最终输出的缓存策略。为了解决步骤选择的组合复杂性，我们将模拟退火 (Simulated Annealing) 与确定性爬山算法 (Hill Climbing) 相结合。这种离线搜索在几分钟内即可识别出高质量缓存策略，且在推理过程中不引入任何在线搜索或阈值化开销。当计算预算非常紧张时，我们进一步引入感知缓存的调度对齐 (cache-aware schedule alignment)，该机制根据所选缓存策略调整时间离散化，以减少由缓存引起的轨迹不匹配。在 FLUX.1-dev 和 Wan2.1 上的实验表明，在相同的推理预算下，BudCache 比启发式缓存基线实现了更好的生成质量。代码可在 https://github.com/Westlake-AGI-Lab/BudCache 获取。

Abstract

Step-level caching accelerates diffusion models by exploiting temporal redundancy across denoising steps. Existing methods make per-step cache decisions using threshold-based heuristics, without directly optimizing for final output quality. As a result, their inference latency varies across inputs and is difficult to control at deployment. In this work, we propose BudCache, which inverts this formulation: rather than letting per-step error thresholds dictate the runtime cost, we fix the compute budget in advance and search for the cache policy that best preserves the final output. To tackle the combinatorial complexity of step selection, we combine Simulated Annealing with deterministic Hill Climbing. This offline search identifies high-quality cache policies within minutes and introduces no online search or thresholding overhead during inference. When the compute budget is very tight, we further introduce cache-aware schedule alignment, which adapts the time discretization to the selected cache policy to reduce cache-induced trajectory mismatch. Experiments on FLUX.1-dev and Wan2.1 show that BudCache achieves better generation quality than heuristic caching baselines under the same inference budgets. Code is available at https://github.com/Westlake-AGI-Lab/BudCache

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	2.0/10	3.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	3.0/10	4.5
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	1.0/10	1.5
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 论文主要研究扩散模型推理加速中的预算约束步级缓存策略（BudCache），与关键词集中的世界模型、强化学习及模型架构组件（如 Tokenizer、视觉编码器）核心内容关联度较低。虽然测试模型（FLUX.1, Wan2.1）涉及多模态，但论文焦点在于缓存算法而非多模态理解或生成一体化。未包含指定专家作者。加权总分 18.0，低于动态及格分 35.2，表明该论文与给定关键词主题相关性较弱。

关键词

Budget-Constrained, Step-Level Diffusion Caching, Inference Acceleration, Compute Budget, Simulated Annealing, FLUX.1, Wan2.1

104. Mana: Dexterous Manipulation of Articulated ToolsFAIL

Score: 16.5 / 35.2

Authors: Zhao-Heng Yin, Guanya Shi, Pieter Abbeel, C. Karen Liu

Published: 2026-06-11

TL;DR: Mana 提出了一种将灵巧工具操作视为动画问题的 sim-to-real 框架，通过运动规划和强化学习实现了多种 articulated 工具的零样本真实世界转移。

摘要翻译

Articulated tool manipulation（铰接式工具操作）在灵巧机器人中仍是一个重大挑战，因为需要协调内部自由度和 contact-rich interactions（接触丰富交互）。尽管先前工作主要关注 rigid objects（刚体），但由于其物理复杂性及学习 functional grasping and manipulation policies（功能性抓取和操作策略）的难度，articulated tool use（铰接式工具使用）仍研究不足。我们提出 Mana (Manipulation Animator)，一种通用的 sim-to-real（仿真到真实）框架，将灵巧操作重新诠释为动画问题。受计算机动画启发，Mana 采用 coarse-to-fine pipeline（粗到细流程），通过运动规划和 reinforcement learning（强化学习）将程序化生成的 grasp keyframes（抓取关键帧）转换为 manipulation trajectories（操作轨迹）。数据生成过程大体自动，仅需几次鼠标点击即可指定功能性 affordances（功能可用性）（每种工具<1 分钟）。在四种跨越不同尺度和关节类型的 articulated tools（铰接式工具）上，Mana 实现了 grasping（抓取）和 in-hand manipulation（手内操作）的 zero-shot sim-to-real transfer（零样本仿真到真实转移），展示了灵巧 articulated tool（铰接式工具）使用的可扩展方法。

Abstract

Articulated tool manipulation remains a major challenge in dexterous robotics due to the need to coordinate internal degrees of freedom and contact-rich interactions. While prior work has largely focused on rigid objects, articulated tool use remains underexplored because of its physical complexity and the difficulty of learning functional grasping and manipulation policies. We present Mana (Manipulation Animator), a general sim-to-real framework that reinterprets dexterous manipulation as an animation problem. Inspired by computer animation, Mana employs a coarse-to-fine pipeline that transforms procedurally-generated grasp keyframes into manipulation trajectories through motion planning and reinforcement learning. The data generation process is largely automatic, requiring only a few mouse clicks to specify functional affordances (<1 minute per tool). Across four articulated tools spanning different scales and joint types, Mana achieves zero-shot sim-to-real transfer for both grasping and in-hand manipulation, demonstrating a scalable approach to dexterous articulated tool use.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	2.0/10	3.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	5.0/10	7.5
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 该论文主要关注机器人灵巧操作与工具 manipulation，采用 sim-to-real 框架、运动规划和强化学习。虽然涉及强化学习（与 model-based RL 部分相关），但未涉及多模态大模型、Tokenizer、视觉编码器（作为表征学习核心）、世界模型（生成式）、或未统一模型架构。作者列表中不包含指定的专家（Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang）。

关键词

Dexterous Manipulation, Articulated Tools, Sim-to-Real Transfer, Motion Planning, Reinforcement Learning, Coarse-to-Fine Pipeline, Functional Affordances, Zero-shot Transfer

105. Hallucination in Medical Imaging AI: A Cross-Modality Analytical Framework for Taxonomy, Detection, and Mitigation under Regulatory ConstraintsFAIL

Score: 16.5 / 35.2

Authors: Omar Alshahrani, Muzammil Behzad

Published: 2026-06-11

TL;DR: 该论文提出了一种跨模态分析框架，用于分类、检测和缓解医学影像 AI 中的幻觉现象，并发现通用基础模型在幻觉基准测试上可能优于医学专用模型。

摘要翻译

AI 系统在医学成像领域的部署速度超过了对其故障模式的理解速度。现阶段，临床最关注的故障是“幻觉”（hallucination）：即临床合理但事实错误的输出，包括虚构的解剖结构、漏诊、侧别错误以及生成报告中虚构的测量值，这些后果直接影响活检决策、分期和治疗计划等。本结构化叙述综合了五种成像模态下的同行评审研究、基准数据集及 FDA 监管指南，旨在对幻觉的分类法、病因、检测和缓解进行跨模态分析。具体而言，本研究旨在回答以下三个问题：(1) 现有分类法如何在不同模态间实现统一？(2) 医学专用基础模型为何比通用基础模型产生的幻觉更少？(3) 哪些缓解策略既有效又能与 FDA 生命周期监督兼容？我们发现，三种分类法框架共同覆盖了成像流程，这是任何单一框架单独无法做到的。我们还指出，通用基础模型在针对幻觉的基准测试中表现优于医学专用模型，这表明窄域微调可能引入过拟合导致的虚构（confabulation）。与此同时，放射科医生的监督依然至关重要；例如，极高比例的 AI 生成标记在临床使用前需要专家校正。物理信息架构约束（Physics-informed architectural constraints）、思维链提示（Chain-of-Thought prompting）以及人在回路保障（human-in-the-loop safeguards）各自针对不同的故障模式，且当三者结合使用时效果显著。所有发现均映射至 FDA 的全产品生命周期（Total Product Lifecycle）和预先确定的变更控制计划（Predetermined Change Control Plan）框架，这些框架将幻觉管理视为一种生命周期义务，而非部署前的检查清单。

Abstract

AI systems are being deployed across medical imaging faster than their failure modes are understood. At this point in time, the failure of greatest clinical concern is hallucination: clinically plausible but factually incorrect outputs, including fabricated anatomical structures, missed findings, incorrect laterality, and invented measurements in generated reports, with direct consequences, for example, for biopsy decisions, staging, and treatment planning. This structured narrative synthesizes peer-reviewed studies, benchmark datasets, and FDA regulatory guidance across five imaging modalities to produce a cross-modality analysis of hallucination taxonomy, etiology, detection, and mitigation. Specifically, we address three questions in this study: (1) how can existing taxonomies be unified across modalities?, (2) how do medical-specialized foundation models hallucinate less than general-purpose ones?, and (3) which mitigation strategies are effective and compatible with FDA lifecycle oversight? We note that three taxonomic frameworks together cover the imaging pipeline in a way no single framework does alone. We also highlight that general-purpose foundation models outperform medical-specialized models on hallucination-specific benchmarks, indicating that narrow domain fine-tuning can introduce overfitting-induced confabulation. At the same time, the oversight of radiologists remains essential; for instance, a very high percentage of of AI-generated flags required expert correction before clinical use. Physics-informed architectural constraints, Chain-of-Thought prompting, and human-in-the-loop safeguards each address different failure modes and is effective when combined. All findings are mapped to the FDA's Total Product Lifecycle and Predetermined Change Control Plan frameworks, which treat hallucination management as a lifecycle obligation rather than a pre-deployment checklist.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	5.0/10	7.5
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 该论文主要关注医学影像 AI 中的幻觉问题，涉及跨模态分析、分类法统一及监管合规。虽然提到了基础模型（MLLM）和跨模态（MultiModal）分析，并探讨了分类法的统一（Unify Models），但未涉及 Tokenizer、视觉编码器架构、世界模型、强化学习或特定推理架构（Latent/Agentic）的技术细节。因此，仅与部分关键词有中度关联，其余关键词完全无关。作者列表中不包含指定的专家，无额外加分。

关键词

Hallucination, Medical Imaging AI, Cross-Modality, Taxonomy, Detection, Mitigation, Regulatory Constraints, Foundation Models

106. Efficient, Robust, and Anti-Collusion Fingerprinting of Image Diffusion ModelsFAIL

Score: 16.5 / 35.2

Authors: Jianwei Fei, Yunshu Dai, Zhihua Xia, Xiaochun Cao, Jiantao Zhou, Alessandro Piva, Benedetta Tondi

Published: 2026-06-11

TL;DR: 本文提出了一种针对文本 - 图像扩散模型的鲁棒反合谋指纹方法，通过将标识符编码到个性化归一化模块中，实现了高提取准确率并保护了知识产权免受合谋攻击。

摘要翻译

模型指纹技术（即将用户特定标识（指纹）嵌入生成输出中）最近已成为一种流行解决方案，用于保护生成式文本到图像（T2I）模型的知识产权（IPR），并防止未经授权的再分发。在这项工作中，我们揭示了现有生成模型指纹方法中先前未被探索的系统性漏洞：它们对合谋攻击缺乏鲁棒性，即多个攻击者联合其模型以移除或掩盖指纹。为了解决这一问题，我们迈出了构建具有抗合谋能力的稳健 T2I 模型指纹方法的第一步。所提出的方法将比特串（即指纹）编码到纳入 T2I 模型中的个性化归一化模块（PNM）的系数中，从而能够从任何生成的图像中可靠地恢复指纹。为了防御合谋攻击并防止未经授权的模型再分发，我们引入了一种基于无损函数不变参数变换的抗合谋机制。该机制显著降低了合谋模型的图像生成质量，使其实际上无法使用。此外，我们的方法允许开发人员通过重参数化 PNM 而无需重新训练，高效地创建多个带有指纹的 T2I 模型副本。我们还引入了一种最坏情况优化策略，以提高对模型级别攻击的鲁棒性。我们的实验表明，所提出的方法在多个 T2I 图像生成和编辑任务中实现了高保真度和鲁棒性，指纹提取准确率超过 99.5%。与现有方法相比，我们的方法首次表现出对合谋攻击的显著主动鲁棒性，通过显著增加合谋模型的 FID（弗雷歇起始距离）。

Abstract

Model fingerprinting, embedding user-specific identifiers (fingerprints) into generated outputs, has recently emerged as a popular solution to protect the intellectual property rights (IPR) of generative text-to-image (T2I) models and prevent unauthorized redistribution. In this work, we reveal a previously unexplored systematic vulnerability in existing generative model fingerprinting methods: they lack robustness against collusion attacks, where multiple attackers combine their models to remove or obscure the fingerprints. To address this issue, we take the first step towards a robust fingerprinting method for T2I models with anti-collusion capabilities. The proposed method encodes strings of bits, namely fingerprints, into the coefficients of a personalized normalization module (PNM) incorporated into T2I models, so that fingerprints can be reliably recovered from any generated image. To defend against collusion attacks and prevent unauthorized model redistribution, we introduce an anti-collusion mechanism based on lossless function-invariant parameter transformations. This mechanism significantly degrades the image generation quality of colluded models, making them effectively unusable. Moreover, our method allows developers to efficiently create multiple copies of fingerprinted T2I models by reparameterizing the PNM without the need for retraining. We also introduce a worst-case optimization strategy to improve robustness against model-level attacks. Our experiments demonstrate that the proposed method achieves high fidelity and robustness across multiple T2I image generation and editing tasks, with fingerprint extraction accuracy exceeding 99.5%. Compared with existing methods, our method demonstrates, for the first time, a notable proactive robustness to collusion attacks by significantly increasing the FID of colluded models.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	3.0/10	4.5
model-based RL	1.5	1.0/10	1.5
Latent Reasoning	1.5	1.0/10	1.5
Agentic Reasoning	1.5	1.0/10	1.5

评分理由: 该论文主要研究文本到图像扩散模型的指纹保护与反合谋攻击，核心贡献在于个性化归一化模块中的指纹编码与参数变换。提供的关键词涉及统一模型、tokenizer、视觉编码器、世界模型、MLLM、强化学习及推理系统，与本文主题高度不匹配，仅'MultiModal'因涉及文本 - 图像生成略有相关性。作者列表中未包含指定的专家 Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang。

关键词

Model Fingerprinting, Anti-Collusion, Image Diffusion Models, Personalized Normalization Module, Intellectual Property Protection, Text-to-Image Generation, Parameter Transformation, Robustness

107. Getting Better at Working With You: Compiling User Corrections into Runtime Enforcement for Coding AgentsFAIL

Score: 16.5 / 35.2

Authors: Yujun Zhou, Kehan Guo, Haomin Zhuang, Xiangqi Wang, Yue Huang, Zhenwen Liang, Pin-Yu Chen, Tian Gao, Nuno Moniz, Nitesh V. Chawla, Xiangliang Zhang

Published: 2026-06-11

TL;DR: 该论文提出 TRACE 方法，通过将用户修正编译为运行时规则，显著降低了编码代理在跨会话中的偏好违规率。

摘要翻译

交互式大语言模型（LLM）智能体正逐渐成为日常工作的一部分，但它们并不能可靠地随着时间的推移变得更易于协作：在一个会话中记住的修正，可能在下一个会话中仍会被违反。我们研究了偏好获取与偏好遵守之间的差距。在基于匿名真实用户摩擦案例衍生的任务中，Mem0 记忆机制仍有 57.5% 的适用偏好检查被违反。我们引入了测试时规则获取与编译执行（TRACE），这是一个用于编码智能体运行时的即插即用技能层管道，它提取用户修正，将其重写为原子规则，并编译为运行时检查，这些检查必须在智能体完成未来任务之前通过。与开发人员预先编写的运行时检查不同，TRACE 机制源自用户自身的聊天修正。我们在 ClawArena 编码智能体任务和 MemoryArena 衍生的内存密集型任务上，通过模拟人在回路实验评估 TRACE。在 ClawArena 上，TRACE 将分布内任务的未见偏好违反从 100.0% 降低至 37.6%，将分布外任务的未见偏好违反从 100.0% 降低至 2.0%。在 MemoryArena 衍生的任务上，TRACE 将分布内违反率从 100.0% 降低至 60.5%，同时在任务通过率上匹配或超越了最强的记忆基线。这些结果表明，将修正编译为运行时执行可以解决一种仅靠记忆机制无法可靠解决的重复摩擦故障模式，从而减少了用户在未来会话中重复陈述同一修正的需求。实验代码见 https://github.com/YujunZhou/TRACE_exp，可部署的技能模块见 https://github.com/YujunZhou/tellonce。

Abstract

Interactive LLM agents are becoming part of daily work, but they do not reliably become easier to work with over time: a correction remembered in one session may still be violated in the next. We study this gap between preference access and preference compliance. In tasks derived from anonymized real-user friction cases, Mem0 memory still leaves 57.5% of applicable preference checks violated. We introduce Test-time Rule Acquisition and Compiled Enforcement (TRACE), a drop-in skill-layer pipeline for coding-agent runtimes that mines user corrections, rewrites them as atomic rules, and compiles them into runtime checks that must pass before an agent completes future tasks. Unlike runtime checks written ahead of time by developers, TRACE skills come from the user's own chat corrections. We evaluate TRACE with simulated user-in-the-loop experiments on ClawArena coding-agent tasks and MemoryArena-derived memory-intensive tasks. On ClawArena, TRACE reduces held-out preference violation from 100.0% to 37.6% on in-distribution tasks and from 100.0% to 2.0% on out-of-distribution tasks. On MemoryArena-derived tasks, TRACE reduces in-distribution violation from 100.0% to 60.5% while matching or exceeding the strongest memory baseline on task pass. These results suggest that compiling corrections into runtime enforcement can address a repeated-friction failure mode that memory alone does not reliably solve, reducing the need for users to restate the same correction across future sessions. Experiment code is available at https://github.com/YujunZhou/TRACE_exp, and the deployable skill is available at https://github.com/YujunZhou/tellonce.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	1.0/10	1.5
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	6.0/10	9.0

评分理由: 该论文聚焦于编码代理的用户修正与运行时强制机制，与提供的关键词列表（侧重多模态、世界模型、统一模型等）存在显著领域差异。因此，除'Agentic Reasoning'因涉及代理行为具有一定相关性（6 分）外，其余关键词如视觉编码器、多模态、世界模型等均不相关（0 分）。'MLLM'因涉及大语言模型但非多模态故评分较低（3 分）。加权总分为 16.5 分，低于动态及格分 35.2 分。作者列表中未包含指定的专家名单，故无额外加分。

关键词

Coding Agents, User Corrections, Runtime Enforcement, Preference Compliance, Test-time Rule Acquisition, Compiled Enforcement, LLM Agents

108. PolyAlign: Conditional Human-Distribution AlignmentFAIL

Score: 16.5 / 35.2

Authors: L. D. M. S. Sai Teja, Ufaq Khan, Sathira Silva, Xiao Wu, Muhammad Haris Khan

Published: 2026-06-11

TL;DR: PolyAlign 提出了一种条件人类分布对齐框架，通过匹配上下文特定的响应分布而非使用单一全局对齐目标，改进了双语对话的自然度和分布保真度。

摘要翻译

后训练方法，如监督微调（SFT）和偏好优化，通常将语言模型对齐至单一的全局助手行为。虽然这对提高平均有用性有效，但这可能会抑制人类响应在语言、任务和对话场景之间的自然变异。我们将此问题研究为条件人类分布对齐：模型应匹配与当前交互上下文相适应的人类响应分布，而非通用的响应风格。我们引入了 PolyAlign，这是一个分布感知对齐框架，它将双语交互数据组织为桶特定的人类参考分布，这些分布由语言、交互轨迹、响应家族和长度定义。PolyAlign 结合了桶感知 SFT（Bucket-Aware SFT），该机制平衡了异构桶之间的优化，以及人类分布偏好优化（HDPO），该机制利用评论者估计的到桶特定人类支持集的距离来正则化偏好学习。在涵盖英语和中文单轮及多轮设置的双语评估套件上，PolyAlign 提高了条件自然性和分布保真度，同时保持了具有竞争力的任务效用。结果表明，后训练应超越全局对齐目标，转向与人类响应分布相一致的交互感知对齐。

Abstract

Post-training methods such as supervised fine-tuning (SFT) and preference optimization typically align language models toward a single global assistant behavior. While effective for improving average helpfulness, this can suppress the natural variation of human responses across languages, tasks, and dialogue settings. We study this problem as conditional human-distribution alignment: models should match the human response distribution appropriate to the current interaction context, rather than a universal response style. We introduce PolyAlign, a distribution-aware alignment framework that organizes bilingual interaction data into bucket-specific human reference distributions defined by language, interaction track, response family, and length. PolyAlign combines Bucket-Aware SFT, which balances optimization across heterogeneous buckets, with Human-Distribution Preference Optimization (HDPO), which regularizes preference learning using critic-estimated distance to bucket-specific human support. Across a bilingual evaluation suite covering English and Chinese single- and multi-turn settings, PolyAlign improves conditional naturalness and distributional faithfulness while preserving competitive task utility. The results suggest that post-training should move beyond global alignment objectives toward interaction-aware alignment with human response distributions.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	1.0/10	1.5
Latent Reasoning	1.5	1.0/10	1.5
Agentic Reasoning	1.5	2.0/10	3.0

评分理由: 该论文专注于语言模型的后训练对齐，关注跨语言和对话上下文中的分布匹配。它不涉及多模态组件（视觉编码器、多模态）、世界模型、基于模型的 RL 或特定 tokenizer 架构。虽然它使用了 LLM（MLLM），但核心贡献是对齐分布而非模型架构统一或推理能力，因此与大多数提供的相关性关键词关联度较低。

关键词

Conditional Human-Distribution Alignment, Post-training methods, Supervised Fine-Tuning, Preference Optimization, Bilingual Interaction, Bucket-Aware SFT, Human-Distribution Preference Optimization

109. IVIE: A Neuro-symbolic Approach to Incremental and Validated Generation of Interactive Fiction WorldsFAIL

Score: 15.0 / 35.2

Authors: Micaela Vaucher, Santiago Silveira, Santiago Góngora, Luis Chiruzzo

Published: 2026-06-11

TL;DR: IVIE 提出了一种神经符号框架，利用 LLM 进行创意生成并结合符号验证来构建连贯的交互式小说世界，但在一致性方面仍存在挑战。

摘要翻译

互动小说（Interactive Fiction）中的计算创造力面临根本性的张力：大语言模型（LLM）虽能生成富有创意的叙事，却在世界连贯性上捉襟见肘；而符号系统虽能确保一致性，却缺乏创意灵活性。本文提出 IVIE（Incremental & Validated Interactive Experiences），一种神经符号方法，旨在从零开始生成完整且可玩的互动小说世界。基于 PAYADOR 的神经符号框架，IVIE 实施了一个四阶段增量生成管道，将创意决策（包括场景与角色创建、谜题设计）委托给 LLM，同时通过符号验证来锚定世界状态。该系统生成包含相互连接的地点、功能物品、非玩家角色（NPC）及连贯谜题的世界，所有元素均围绕一个中心目标导向架构进行组织。人类评估表明，该方法生成的世界具有沉浸感、主题连贯性，且玩家参与度较高。结果表明，神经符号方法成功平衡了灵活性与叙事连贯性：符号验证为 LLM 的生成提供了基础，同时并未消除生成自由。然而，挑战依然存在：LLM 的不一致性偶尔会绕过谜题约束，且客观验证的差距允许了一些结构上不可能实现的目标。我们确定了未来神经符号交互式叙事系统的关键设计考量，特别是关于 LLM 的能力及其局限性。

Abstract

Computational creativity in Interactive Fiction faces a fundamental tension: Large Language Models (LLM) may produce creative narratives but struggle with world coherence, while symbolic systems ensure consistency but lack creative flexibility. We present IVIE (Incremental & Validated Interactive Experiences), a neuro-symbolic approach to generating complete and playable interactive fiction worlds from scratch. Building upon PAYADOR's neuro-symbolic framework, IVIE implements a four-stage incremental generation pipeline that delegates creative decisions--setting and character creation, puzzle design--to LLMs while grounding the world state through symbolic validation. The system generates worlds with interconnected locations, functional items, non-player characters, and coherent puzzles, all structured around a central goal-oriented architecture. Human evaluation shows the approach generates immersive, thematically coherent worlds with high player engagement. Results seem to indicate that the neuro-symbolic approach successfully balances flexibility with narrative coherence: symbolic validation grounds LLM generation without eliminating generative freedom. However, challenges remain: LLM inconsistencies occasionally bypass puzzle constraints, and objective validation gaps allow some structurally impossible goals. We identify key design considerations for future neurosymbolic interactive storytelling systems, particularly regarding LLM capabilities and their limitations.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	3.0/10	4.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	2.0/10	3.0

评分理由: 该论文主要关注交互式小说的神经符号生成，与关键词集（多模态、强化学习、统一模型架构）整体关联度低。仅'World Models'（世界状态建模）和'Agentic Reasoning'（LLM 决策）有弱相关，'MLLM'和'MultiModal'因缺乏视觉/多模态组件得分较低，其余如'Visual Encoder'、'model-based RL'等完全无关。未找到指定专家作者，无额外加分。加权总分远低于动态及格分。

关键词

Neuro-symbolic Approach, Interactive Fiction, Incremental Generation, Symbolic Validation, LLM Creativity, World Coherence, Puzzle Design

110. Nous: An Attempt to Extract and Inject the Cognition Behind Prediction-Market BehaviorFAIL

Score: 15.0 / 35.2

Authors: Haowei Qian

Published: 2026-06-11

TL;DR: 该论文研究是否可从交易行为中提取人类认知多样性并通过提示注入传递给 LLM 代理，结果发现提取部分有效但提示注入未能降低代理误差相关性或提升表现。

摘要翻译

随着大语言模型（LLM）代理在预测市场和集体决策中的普及，它们面临认知同质化的风险：基于共享基础模型构建的代理会产生相关预测，最近的测量发现前沿模型误差的相关性约为 r ~ 0.77。我们探究是否可以从人类行为中恢复认知多样性并将其注入大语言模型代理。Nous 从真实的 Polymarket 交易活动中提取了一个结构化的八维行为画像，并通过提示词将其注入代理中。我们的核心发现是该流程两个部分之间的脱节。提取部分有效但不完全：在 100 个钱包中，14 个参数中有 8 个在时间上稳定（折半信度系数 (ICC) >= 0.5，自助法置信区间 (CI) 下限 > 0.3；逆向得分达到 ICC ~ 0.9）；钱包可通过其画像被识别，显著高于随机水平（Top-1 检索 17-22% vs. 1% 概率）；且四个预设维度中的两个与未来实现利润在样本外呈秩相关，尽管这些相关性在行为混淆控制下不显著。提示词级注入无法显著传输它：在语义嵌入度量上，结构化注入在任何模型上相比长度匹配的对照组都没有显著优势，且它诱导的多样性既未降低集成误差相关性，也未改善布里尔分数 (Brier score)——这一零结果在探索性检查采样温度、画像多样性和问题时难程度上均持续存在。测量提示词本身将压缩定位在模型之前：结构到叙事转换器发出的提示词几乎均匀，其分布不跟随画像分布。我们将 Nous 定位为测量认知同质化问题和提示词级补救措施局限性的工具，激励更深层的、提示词以下的注入（微调、激活引导）。代码、冻结的画像、提示词和模型输出：https://github.com/WillChienT/nous-paper

Abstract

As LLM agents proliferate in prediction markets and collective decision-making, they risk a cognitive monoculture: agents built on shared foundation models produce correlated forecasts, and recent measurement finds frontier-model errors correlated at r ~ 0.77. We ask whether human cognitive diversity can be recovered from behavior and transferred to LLM agents. Nous extracts a structured eight-dimension behavioral profile from real Polymarket trading activity and injects it into agents through prompts. Our central finding is a dissociation between the two halves of that pipeline. Extraction works, partially: across 100 wallets, 8 of 14 parameters are temporally stable (split-half ICC >= 0.5, bootstrap CI lower bound > 0.3; contrarian score reaches ICC ~ 0.9); wallets are identifiable from their profiles well above chance (top-1 retrieval 17-22% vs. 1% chance); and two of four pre-specified dimensions rank-correlate with future realized profit out-of-sample, though the correlations do not survive behavioral-confound controls. Prompt-level injection does not measurably transmit it: on a semantic embedding metric, structured injection shows no significant advantage over a length-matched control on any model, and the diversity it induces neither reduces ensemble error correlation nor improves Brier score -- a null that persists across exploratory checks on sampling temperature, profile diversity, and question difficulty. Measuring the prompts themselves locates the compression before the model: the structure-to-narrative translator emits near-uniform prompts whose spread does not track profile spread. We position Nous as measuring the cognitive-monoculture problem and the limits of a prompt-level remedy, motivating deeper, below-the-prompt injection (fine-tuning, activation steering). Code, frozen profiles, prompts, and model outputs: https://github.com/WillChienT/nous-paper

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	2.0/10	3.0
Agentic Reasoning	1.5	6.0/10	9.0

评分理由: 论文主要探讨 LLM 代理在预测市场中的认知多样性及提示注入策略，与视觉编码器、分词器、世界模型、模型强化学习等关键词无直接技术关联。仅在'代理推理'方面有一定相关性（涉及代理行为），'MLLM'因未涉及多模态故相关性低，'Latent Reasoning'因涉及认知提取略有关联。

关键词

LLM agents, Prediction markets, Cognitive diversity, Behavioral profiling, Prompt injection, Cognitive monoculture

111. SciR: A Controllable Benchmark for Scientific Reasoning in LLMsFAIL

Score: 15.0 / 35.2

Authors: Pierre Beckmann, Marco Valentino, Andre Freitas

Published: 2026-06-11

TL;DR: SciR introduces a controllable benchmark for evaluating scientific reasoning in LLMs across deduction, induction, and abduction paradigms, revealing that both information extraction and inference difficulty significantly degrade model performance.

摘要翻译

在科学推理中，三种范式形式的推理反复出现：演绎、归纳以及因果溯因。目前，在科学环境中可靠地评估大语言模型（LLMs）尚不可行：基于人工标注的科学基准成本高昂且缺乏机制性真值，而合成逻辑推理基准又不类似于真实的科学文档。我们引入 SciR，这是一个结合多范式推理与可控科学渲染的基准，锚定于三个范式科学问题。任务从形式对象（演绎树、归纳规则假设、因果图）生成，以确保答案可验证，随后通过按轨道领域调优的体裁渲染为多文档科学话语。该构建使我们能够独立调节两个难度轴：提取推理所需关键信息的难度，以及规范性推理本身的难度。我们测试了六个模型。这两个难度轴都对所有模型产生了负面影响，且其效应相互累积。这种渲染甚至损害了神经符号管道，后者将推理任务交给验证求解器。这两个轴产生了每个模型的提取 - 推理画像：例如，像 DeepSeek-R1 这样的推理模型主要在推理轴上超越非推理指令模型。据我们所知，SciR 是首个在提取和推理难度上均具备参数化控制的多范式科学推理基准。

Abstract

Three paradigmatic forms of inference recur across scientific reasoning: deduction, induction, and causal abduction. Reliably evaluating LLMs on these in scientific settings is currently out of reach: scientific benchmarks built on human annotations are costly and lack mechanistic ground truth, while synthetic logical-reasoning benchmarks do not resemble real scientific documents. We introduce SciR, a benchmark that combines multi-paradigm reasoning with controllable scientific rendering, anchored on three paradigmatic scientific problems. Tasks are generated from formal objects (deduction tree, inductive rule hypothesis, causal graph) to guarantee verifiable answers, then rendered into multi-document scientific discourse via per-track domain-tuned genres. The construction lets us independently vary two difficulty axes: how hard it is to extract the key information needed for inference, and how hard the principled inference itself is. We test six models. Both axes hurt every model, and their effects compound. The rendering even hurts neurosymbolic pipelines, which hand inference to a verified solver. The two axes yield a per-model extraction-vs-inference profile: for instance, reasoning models like deepseek-r1 mostly surpass non-reasoning instruct models on the inference axis. To our knowledge, SciR is the first multi-paradigm scientific-reasoning benchmark with parametric control on both extraction and inference difficulty.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	3.0/10	4.5
Agentic Reasoning	1.5	2.0/10	3.0

评分理由: The paper introduces a scientific reasoning benchmark for LLMs focusing on logical paradigms (deduction, induction, abduction) and controllable difficulty. It does not address Unify Models, Tokenizers, Visual Encoders, World Models, MLLM architectures, MultiModal learning (vision-language), Model-Based RL, Latent Reasoning (latent space), or Agentic Reasoning. Only general Reasoning and LLM terms provide minimal overlap with Latent Reasoning and MLLM/MultiModal keywords, resulting in low relevance scores.

关键词

Scientific Reasoning, LLMs, Benchmark, Controllable Difficulty, Deduction Induction Abduction, Multi-document Discourse, Inference Extraction

112. How Much Memory Do We Need? Adaptive Memory Gate for Neural OperatorsFAIL

Score: 15.0 / 35.2

Authors: Jihyeon Hur, Yongseok Kwon, Min-Gi Jo, Jeongwhan Choi, Noseong Park

Published: 2026-06-11

TL;DR: This paper proposes an adaptive memory gate mechanism for neural operators to dynamically adjust memory usage based on observation resolution, achieving significant error reduction in solving partial differential equations at low resolutions.

摘要翻译

神经算子（Neural Operators）已成为求解时间依赖偏微分方程（PDEs）的一种强大的数据驱动方法。在最近的研究进展中，记忆增强神经算子（Memory-Augmented Neural Operators）显式地纳入了过去状态，并在低分辨率观测设置下取得了显著的性能。然而，现有方法无论观测条件（如分辨率或物理参数）如何，均采用固定的记忆权重，这限制了它们的适应性。我们的初步实验表明，最优记忆权重随分辨率和粘度变化，这意味着固定记忆权重无法在不同设置下同时优化性能。我们提出了 AMGFNO，它通过一个可学习门控（learnable gate）动态调制记忆权重。在柯拉莫托 - 西瓦辛斯基方程（Kuramoto-Sivashinsky equation）和伯格斯方程（Burgers' equation）上，AMGFNO 在低分辨率下实现了 55-79% 的 nRMSE 降低，随着分辨率增加，学习到的门控值自动从 $\bar{g} \approx 0.7$ 降至接近零。

Abstract

Neural operators have emerged as a powerful data-driven approach for solving time-dependent PDEs. Among recent advances, memory-augmented neural operators explicitly incorporate past states and have achieved remarkable performance under low-resolution observation settings. However, existing approaches apply a fixed memory weight regardless of observation conditions, such as resolution or physical parameters, limiting their adaptability. Our preliminary experiments reveal that optimal memory weight varies with resolution and viscosity, implying that a fixed memory weight cannot simultaneously optimize performance across diverse settings. We propose AMGFNO, which dynamically modulates memory weight through a learnable gate. On the Kuramoto-Sivashinsky and Burgers' equations, AMGFNO achieves 55-79% nRMSE reduction over at low resolution, with the learned gate value automatically decreasing from $\bar{g} \approx 0.7$ to near-zero as resolution increases.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	1.0/10	1.5
Latent Reasoning	1.5	2.0/10	3.0
Agentic Reasoning	1.5	1.0/10	1.5

评分理由: The paper focuses on Neural Operators for solving Partial Differential Equations (PDEs) using an adaptive memory gate mechanism. This belongs to the domain of Scientific Computing. The provided keywords primarily target Multimodal Large Language Models (MLLM), World Models, and Reinforcement Learning, resulting in minimal domain overlap. 'Latent Reasoning' has slight relevance (2.0) due to the latent function space learning in Neural Operators, while all other keywords (Tokenizer, Visual Encoder, MLLM, etc.) are unrelated (1.0). No expert authors from the specified list were found, so no bonus points were added.

关键词

Neural Operators, Adaptive Memory Gate, Partial Differential Equations, Memory-augmented, Resolution-dependent, AMGFNO, Time-dependent PDEs

113. Detecting Explanatory Insufficiency in Learned Representations: A Framework for Representational VigilanceFAIL

Score: 15.0 / 35.2

Authors: Jacques Raynal, Pierre Slangen, Elsa Raynal, Jacques Margerit

Published: 2026-06-11

TL;DR: 本文提出 VER 框架，旨在通过监控学习表征中的残差结构来检测解释性不足，从而补充常规机器学习评估方法。

摘要翻译

学习表征在现代机器学习中处于核心地位，通常通过预测性能、鲁棒性、不确定性估计或泛化能力来评估。然而，一个学习表征可能在操作层面保持成功，同时逐渐无法组织持久残留结构，而这些结构并未被常规评估指标完全捕捉。本文介绍了 VER（Vigilant Evaluator of Representations），这是一个用于监测学习表征中表征充分性的概念框架。VER 并未提出新的学习算法、损失函数或模型架构。相反，它形式化了一个诊断过程，通过该过程可以识别、分析持久残留结构，并将其解释为解释不足的潜在指标。该框架将表征不足与普通预测误差、不确定性、噪声和分布偏移区分开来。它引入了一种基于表征识别、解释域界定、残留结构检测、解释抵抗性评估和警觉信号的监测序列。VER 旨在为机器学习中的表征诊断做出贡献。其目标不是取代现有评估方法，而是通过将表征充分性作为明确的探究对象来补充它们。还概述了一条通过表征警觉基准进行实证评估的路径。

Abstract

Learned representations are central to modern machine learning and are commonly evaluated through predictive performance, robustness, uncertainty estimation, or generalization. However, a learned representation may remain operationally successful while progressively failing to organize persistent residual structures that are not fully captured by conventional evaluation metrics. This article introduces VER, the Vigilant Evaluator of Representations, a conceptual framework for monitoring representational adequacy in learned representations. VER does not propose a new learning algorithm, loss function, or model architecture. Instead, it formalizes a diagnostic process through which persistent residual structures may be identified, analyzed, and interpreted as potential indicators of explanatory insufficiency. The framework distinguishes representational inadequacy from ordinary prediction error, uncertainty, noise, and distribution shift. It introduces a monitoring sequence based on representation identification, explanatory-domain delimitation, residual-structure detection, explanatory-resistance evaluation, and vigilance signaling. VER is intended as a contribution to representation diagnostics in machine learning. Its objective is not to replace existing evaluation methods but to complement them by treating representational adequacy as an explicit object of inquiry. A path toward empirical evaluation through representational-vigilance benchmarks is also outlined.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	1.0/10	1.5
Latent Reasoning	1.5	2.0/10	3.0
Agentic Reasoning	1.5	1.0/10	1.5

评分理由: 该论文提出了一种通用的表征诊断框架（VER），侧重于检测学习表征中的解释性不足和残差结构，属于机器学习评估方法论范畴。其内容未涉及多模态大模型架构（MLLM, MultiModal）、特定组件（Tokenizer, Visual Encoder）、世界模型（World Models）、强化学习（model-based RL）或智能体推理（Agentic Reasoning），因此这些关键词与论文核心内容几乎无关，评分为 1 分。'Latent Reasoning'因涉及学习表征的潜在空间特性，与论文主题有一定关联，评分为 2 分。'Unify Models'指模型架构的统一，而论文仅为诊断框架，未涉及模型统一，评分为 1 分。综上，论文与给定的多模态及强化学习背景关键词相关性较低。

关键词

Learned Representations, Representational Vigilance, Diagnostic Framework, Residual Structures, Explanatory Insufficiency, Machine Learning Evaluation, Monitoring Sequence

114. VietFashion: Benchmarking Sketch-Text Composed Image Retrieval for Cultural OutfitsFAIL

Score: 15.0 / 35.2

Authors: Hoang-Nguyen Cao, Le-Hoang Bui, Dinh-Khoi Vo, Minh-Triet Tran, Trung-Nghia Le

Published: 2026-06-11

TL;DR: The paper introduces VietFashion, a new benchmark for sketch-text composed image retrieval of traditional Vietnamese outfits, revealing significant performance gaps in modeling fine-grained cultural semantics and multi-modal composition among existing methods.

摘要翻译

文化服饰给视觉检索系统带来了独特的挑战，因为它们的识别往往依赖于细微的结构和符号细节，而这些细节很难被标准 AI 模型捕捉。我们引入了 VietFashion，这是一个以越南传统服饰 Ao Dai (奥黛) 为中心的草图 - 文本组合图像检索新基准。VietFashion 使设计师和研究人员能够通过结合手绘草图（传达服装结构）和文本描述（编码文化语义）来检索具有文化意义的服装搭配。该数据集最初包含 650 张草图，并利用生成模型扩展，生成了超过 21,000 张带有对齐文本描述的照片级真实感图像。描述详细服装属性的文本提示源自时尚杂志，以确保真实性和多样性。为了更好地反映设计意图的内在模糊性，VietFashion 采用了多目标检索设置，其中单个查询可能对应多个有效结果。我们建立了标准化评估协议，并对最先进的组合图像检索方法进行了基准测试。实验结果表明，在建模细粒度文化语义和多模态组合方面存在显著的性能差距，这使得 VietFashion 成为一个具有挑战性的细粒度时尚检索基准。该数据集公开可用，网址为：https://hng0303.github.io/VietFashion.

Abstract

Cultural garments pose a unique challenge for visual retrieval systems, as their identity often depends on subtle structural and symbolic details that are poorly captured by standard AI models. We introduce VietFashion, a new benchmark for sketch-text composed image retrieval centered on the Ao Dai, a traditional Vietnamese garment. VietFashion enables designers and researchers to retrieve culturally meaningful outfits using a combination of hand-drawn sketches, which convey garment structure, and textual descriptions, which encode cultural semantics. The dataset is initialized with 650 sketches and expanded using generative models to produce over 21,000 photorealistic images with aligned captions. Textual prompts that describe detailed outfit attributes, which are extracted from fashion magazines to ensure authenticity and diversity. To better reflect the inherent ambiguity of design intent, VietFashion adopts a multi-target retrieval setting, where a single query may correspond to multiple valid results. We establish standardized evaluation protocols and benchmark state-of-the-art composed image retrieval methods. Experimental results reveal significant performance gaps in modeling fine-grained cultural semantics and multi-modal composition, positioning VietFashion as a challenging benchmark for fine-grained fashion retrieval. The dataset is publicly available at: https://hng0303.github.io/VietFashion.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: The paper focuses on a fashion retrieval benchmark (VietFashion) using sketches and text. It is highly relevant to 'MultiModal' due to the sketch-text-image interaction. 'Visual Encoder' has minor relevance as retrieval systems utilize encoders, but the paper is a dataset benchmark rather than a model architecture paper. All other keywords (Unify Models, Tokenizer, World Models, MLLM, RL, Reasoning) are unrelated to this fashion retrieval study. No listed experts are authors.

关键词

Sketch-Text Composed Image Retrieval, Cultural Outfits, Ao Dai, Multi-modal Retrieval, Fashion Benchmark, Fine-grained Semantics, Generative Models

115. OR-Action: Multi-Role Video Understanding with Fine-Grained ActionsFAIL

Score: 15.0 / 35.2

Authors: Felix Tristram, Ege Özsoy, Christian Benz, Marcel Walch, Ghazal Ghazaei, Nassir Navab

Published: 2026-06-11

TL;DR: This paper proposes a vision-only temporal model and benchmark for fine-grained multi-role action recognition in operating room videos, outperforming graph-based methods.

摘要翻译

对手术室（Operating Room, OR）活动的细粒度理解能够实现工作流感知辅助，但由于场景杂乱、遮挡及感知受限，这一任务仍然难以实现。建模该环境的主流方法是使用场景图（Scene Graphs），作为手术室交互的可解释性表示。然而，在没有显式时间建模的情况下，将帧级关系预测转换为时间扩展的细粒度动作具有挑战性。为了能够对当前的手术室理解方法进行基于原则的时间评估，我们引入了首个以动作为中心的基准，该基准构建于公开可用的第一人称 - 第三人称（ego-exocentric）手术室数据集之上，通过定义细粒度多角色动作分类法，并从真实场景图状态变化中通过蒸馏生成密集动作片段。在该基准上的实验表明，当前的场景图预测方法难以建模时间结构，即使通过图神经网络（Graph Neural Networks）添加显式建模也是如此。因此，我们引入了一种纯视觉时间模型，当使用所有可用的第一人称视频作为输入时，该模型显著优于基于图的方法。基于此模型，我们还提出了一种新颖的多视角到单视角特征对齐策略，该策略提高了单视角在多角色动作识别上的性能，从而降低了对大规模第一人称视频采集的需求。基准和代码将在录用后发布。

Abstract

Fine-grained understanding of operating room (OR) activity could enable workflow-aware assistance, yet remains difficult due to clutter, occlusions, and limited sensing. The prevailing approach to model this environment is scene graphs as an interpretable representation of OR interactions. Converting their frame-wise relational predictions into temporally extended, fine-grained actions however, is challenging without explicit temporal modeling. To enable a principled temporal evaluation of current OR understanding methods, we introduce the first action-centric benchmark built on a publicly available ego-exocentric OR dataset by defining a fine-grained, multi-role action taxonomy and generating dense action segments via distillation from ground-truth scene graph state changes. Experiments on this benchmark show that current scene graph prediction methods struggle to model temporal structure, even when adding explicit modeling through Graph Neural Networks. We therefore introduce a vision-only temporal model that outperforms graph-based methods significantly when using all available egocentric video as input. Building on this model we also introduce a novel multi- to single-view feature alignment strategy that improves single-view performance on multi-role action recognition, mitigating the need for extensive egocentric video capture. Benchmark and code will be released upon acceptance.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	5.0/10	7.5
World Models	1.5	2.0/10	3.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	3.0/10	4.5
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: The paper focuses on fine-grained action recognition in operating room videos using scene graphs and temporal modeling, showing low relevance to Unify Models, Tokenizer, MLLM, model-based RL, Latent Reasoning, and Agentic Reasoning. Visual Encoder is moderately relevant as a core component of video understanding. World Models and MultiModal have slight relevance due to temporal environment modeling and multi-view data, but are not core focuses. Total weighted score is 15.0, below the dynamic passing score of 35.2.

关键词

Fine-grained Actions, Video Understanding, Operating Room, Scene Graphs, Temporal Modeling, Egocentric Video, Action Taxonomy, Feature Alignment

116. AgentRivet: an automated system for producing Rivet routines from journal publicationsFAIL

Score: 13.5 / 35.2

Authors: Antonio J. Costa, Caterina Doglioni, Christian Gütschow, Andrew D. Pilkington, Sukanya Sinha

Published: 2026-06-11

TL;DR: AgentRivet utilizes large language models to automatically generate missing Rivet routines for particle physics analysis from published papers, achieving reasonable physics fidelity despite occasional implementation challenges.

摘要翻译

粒子物理对撞机实验将 Rivet 例程作为模型无关测量的分析保存策略的一部分提供。Rivet 是一个 C++ 工具包，允许将新的理论模型与测量结果进行比较，从而有助于蒙特卡洛事例生成器的开发与调优，以及标准模型 (Standard Model) 之外物理的搜索。然而，分析覆盖范围已知是不完整的，仅有 39% 的测量拥有文档化且公开可用的 Rivet 例程。本文设计并实现了一个基于大型语言模型（Large Language Models）的自动化工作流，旨在提供缺失的例程。这一多步骤工作流被称为 AgentRivet，它从已发表的论文中提取物理分析信息并编写缺失的 Rivet 例程，中间包含代码审查和物理审查，作为自主质量控制的一部分。我们报告了使用由 OpenAI、Anthropic 和 Google 提供的商业大型语言模型，针对 ATLAS 和 CMS 实验的两个近期测量所得的结果。我们发现 AgentRivet 生成的 Rivet 例程具备胜任能力，且语法错误极少。这些例程的物理保真度合理，且遵循了相关出版物中给出的解释。尽管如此，物理实现问题确实会出现，并利用 AgentRivet 产生的工件进行调查。大多数物理实现问题源于给定出版物中微妙但模糊的定义，尽管有些模型即使在给出清晰定义的情况下，也难以实现复杂的可观测量。

Abstract

Particle physics collider experiments provide Rivet routines as part of the analysis preservation strategy for model-independent measurements. Rivet is a C++ toolkit that allow new theoretical models to be compared to the measurements, thus aiding the development and tuning of Monte Carlo event generators as well as searches for physics beyond the Standard Model. However, analysis coverage is known to be incomplete, with only 39% of measurements having documented and publicly available Rivet routines. In this article, we design and implement an automated workflow based on Large Language Models with the goal of providing the missing routines. This multi-step workflow, referred to as AgentRivet, extracts the physics analysis information from published papers and writes the missing Rivet routines, with intermediate code- and physics- reviews as part of an autonomous quality control. We report the results obtained using commercial Large Language Models, provided by OpenAI, Anthropic, and Google, for two recent measurements from the ATLAS and CMS experiments. We find that AgentRivet produces competent Rivet routines with few syntax errors. The physics fidelity of the routines is reasonable and follows the explanations given in the relevant publications. Nevertheless, physics-implementation issues do arise and are investigated using the artefacts produced by AgentRivet. The majority of physics implementation issues arise from subtle-but-ambiguous definitions in the given publication, although some models struggle to implement complex observables even when clear definitions are given.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	7.0/10	10.5

评分理由: The paper focuses on particle physics analysis preservation using LLMs to generate Rivet routines, showing low relevance to multimodal architectures, world models, or RL. Agentic Reasoning is moderately relevant due to the automated agent workflow, while MLLM and MultiModal have minimal relevance as the method relies primarily on text-based code generation.

关键词

AgentRivet, Large Language Models, Rivet routines, Particle physics, Automated workflow, Code generation, Analysis preservation

117. Understanding the Rejection of Fixes Generated by Agentic Pull Requests -- Insights from the AIDev DatasetFAIL

Score: 13.5 / 35.2

Authors: Mahmoud Abujadallah, Ali Arabat, Mohammed Sayagh

Published: 2026-06-11

TL;DR: This paper analyzes the reasons behind the high rejection rate of code fixes generated by AI coding agents, identifying four categories of failures to propose better guidance and prioritization strategies.

摘要翻译

AI 编码代理正被越来越多地用于生成拉取请求（PR），以提议在软件项目中进行的代码修复。基于对 AIDev 数据集的首次探索，我们发现 Copilot、Devin、Cursor 和 Claude 等代理提出的代码修复中有 46.41% 被驳回。这代表了大量浪费的资源，这些资源本应用于人工审查、验证以及运行测试和验证，而这些修复最终却被丢弃。本文旨在理解 AI 代理的失败模式（failure modes），这种理解对于更好地将 AI 代理整合为高效队友至关重要。在本文中，我们对前述代理创建或共同编写的 306 个未合并拉取请求的代表性样本进行了定性研究，随后又进行了拒绝原因的定量分析。我们的定性研究发现，拒绝 AI 代理修复的原因共有 14 个，分为四个高层类别。我们观察到，开发者可能因以下原因驳回修复：实现方式不正确（例如，不完整、方法错误）；未能通过持续集成（CI）流水线且测试失败；代理无法执行实现（例如，未生成代码、会话丢失）；以及优先级较低。研究结果表明，在这些层面更好地指导模型至关重要：(1) 提供关于修复问题应遵循方法的提示；(2) 概述不应采取的方法的限制或约束；(3) 指导代理如何通过 CI 流水线验证实现，且不引入破坏性变更（breaking change）。研究结果表明，需要对任务进行良好的优先级排序，以免生成的修复导致人工审查努力的浪费或代理资源（例如，令牌（tokens）、计算资源（compute）或允许请求次数）的浪费。

Abstract

AI coding agents are increasingly used to generate pull requests (PRs) that propose code fixes in software projects. From a first exploration of the AIDev dataset, we find that 46.41\% of the fixes proposed by the agents Copilot, Devin, Cursor, and Claude are rejected. This represents a significant amount of wasted resources that require human reviews, verifications, and running tests and validations for fixes that are merely discarded. Our goal in this paper is to understand the failure modes of AI-agents, an understanding that is crucial for better integrating AI-agents as efficient teammates. In this paper, we conduct a qualitative study on a representative sample of 306 non-merged pull requests created or co-authored by the agents mentioned earlier, followed by a quantitative analysis of the reasons for rejection. Our qualitative findings identify 14 reasons divided into four high-level categories for rejecting AI-agent fixes. We observe that developers can reject fixes due to fixes whose implementation is incorrect (e.g., incomplete, wrong approach), fixes that do not pass the continuous integration (CI) pipelines and fail tests, fixes for which the agent is unable to perform the implementation (e.g., no code generated, sessions lost), and fixes whose priority is low. Our results shed light on the importance of better guiding the model at these levels: (1) proposing hints about the approach to follow for fixing an issue, (2) outlining constraints or limitations regarding the approaches that should not be taken, and (3) instructing the agent on how to validate the implementation through CI pipelines and without introducing a breaking change. Our results suggest the need for good prioritization of tasks so that generated fixes do not lead to wasted human review efforts or wasted agent resources (e.g., tokens, compute, or allowed number of requests).

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	8.0/10	12.0

评分理由: The paper focuses on software engineering and AI agent behavior regarding code pull requests, aligning primarily with 'Agentic Reasoning' (score 8) as it analyzes AI agent failures in generating code fixes. It has minimal connection to 'MLLM' (score 1) as agents use LLMs but the study is not about multi-modal modeling. All other keywords are unrelated. No expert authors found. Total weighted score is 13.5, below the dynamic passing score of 35.2.

关键词

AI coding agents, Pull Requests, Rejection Analysis, Software Engineering, Agent Failure Modes, AIDev Dataset, Code Fix Validation, Human Review

118. G-Long: Graph-Enhanced Memory Management for Efficient Long-Term Dialogue AgentsFAIL

Score: 13.5 / 35.2

Authors: Minjun Choi, Yoonjin Jang, Sangwon Youn, Youngjoong Ko

Published: 2026-06-11

TL;DR: G-Long introduces a graph-enhanced memory framework utilizing small language models and attention-aware scoring to enhance long-term dialogue consistency while significantly reducing computational overhead.

摘要翻译

尽管大型语言模型（LLMs）推动了开放域对话系统的发展，但由于长上下文推理的固有局限性以及处理大量原始文本的低效性，维持长期一致性仍然是一个挑战。现有方法通常依赖于非结构化记忆存储，容易丢失信息，或使用计算成本高昂的 LLMs，导致高延迟。为了解决这些局限性，我们提出 G-Long，一种图增强框架，利用微调的小型语言模型（sLM）进行结构化三元组提取和关联检索，显著降低了运行成本。此外，我们还引入了新颖的注意力感知重要性评分机制，利用 T5 摘要器的内在交叉注意力信号来识别关键记忆。在多个基准数据集上的广泛实验表明，G-Long 在响应生成和记忆检索方面均达到了最先进的性能，在 MSC 上响应质量提升了 9.8%，在 LME 上检索召回率提升了 40.8%，同时显著减少了计算开销。

Abstract

While Large Language Models (LLMs) have advanced open-domain dialogue systems, maintaining long-term consistency remains a challenge due to inherent limitations in long-context reasoning and the inefficiency of processing extensive raw text. Existing approaches typically rely on either unstructured memory storage, which is prone to information loss, or computationally expensive LLMs that incur high latency. To address these limitations, we propose G-Long, a graph-enhanced framework that utilizes a fine-tuned small Language Model (sLM) for structured triplet extraction and associative retrieval, significantly reducing operational costs. Furthermore, we introduce the novel attention-aware importance scoring mechanism that leverages the intrinsic cross-attention signals of a T5 summarizer to identify salient memories. Extensive experiments across diverse benchmarks demonstrate that G-Long achieves state-of-the-art performance in both response generation and memory retrieval, yielding performance gains of up to 9.8% in response quality on MSC and 40.8% in retrieval recall on LME, while significantly minimizing computational overhead.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	2.0/10	3.0
Agentic Reasoning	1.5	3.0/10	4.5

评分理由: The paper focuses on graph-enhanced memory management for text-based dialogue agents using small language models and summarization. It does not address multimodal learning, visual encoders, world models, or model-based RL, resulting in low scores for most keywords. Only weak relevance exists for 'Unify Models' (graph+LM integration) and 'Agentic Reasoning' (dialogue agents). The calculated weighted score (13.5) is below the dynamic pass score (35.2), indicating low relevance to the provided keyword set.

关键词

Graph-Enhanced Memory Management, Long-Term Dialogue Agents, Small Language Model, Structured Triplet Extraction, Attention-Aware Importance Scoring, Associative Retrieval

119. Masked and Predictive Self-Supervised Foundation Models for 3D Brain MRIFAIL

Score: 13.5 / 35.2

Authors: Esra Ergün, Hersh Chandarana, Dan Sodickson, Gözde Ünal

Published: 2026-06-11

TL;DR: This paper investigates self-supervised foundation models using MAE and JEPA for 3D brain MRI disease detection, demonstrating that specific regularization techniques improve downstream performance based on task structure.

摘要翻译

自监督基础模型在医学影像领域展现出广阔的前景。然而，现有的 MRI 基础模型研究主要侧重于分割和密集预测任务，而针对基于 MRI 的疾病检测的自监督基础模型的系统性研究仍显不足。在这项工作中，我们研究了两种用于基于 MRI 疾病检测的主要自监督预训练范式：基于掩码自编码器（MAE）的重建学习和基于联合嵌入预测架构（JEPA）的预测表征学习。我们通过为 MAE 引入一种新颖的频域重建损失以增强对精细解剖结构的敏感性，并在 JEPA 框架内整合方差 - 协方差正则化（VCR）以促进去相关的潜在表征，从而研究辅助目标的作用。我们的模型在对比度无关的设置下，在异质的单对比度 MRI 体积上进行预训练，且未进行模态拼接。在五个下游疾病检测任务中，我们的结果强调了自监督目标设计对于医学基础模型预训练的重要性，表明每个目标的下游收益取决于其与任务结构的相关程度。具体来说，当下游判别性信号以强高频解剖结构为特征时，频谱正则化带来的提升最为显著；而当判别性信息跨越多个去相关特征维度时，协方差正则化最为有益。带有频域监督的 MAE 一贯在基于 MRI 的疾病检测中实现优越的下游性能。这些发现表明，医学影像中的自监督目标编码了特定的偏置，且它们的下游收益从根本上取决于任务的结构。

Abstract

Self-supervised foundation models have shown strong promise in medical imaging. However, existing MRI foundation-model studies have primarily emphasized segmentation and dense prediction tasks, while systematic investigation of self-supervised foundation models for MRI-based disease detection remains limited. In this work, we investigate two major self-supervised pretraining paradigms for MRI-based disease detection: reconstruction-based learning via Masked Autoencoders (MAE) and predictive representation learning via Joint Embedding Predictive Architectures (JEPA). We study the role of auxiliary objectives by introducing a novel spectral-domain reconstruction loss for MAE to enhance sensitivity to fine-grained anatomical structure, and by integrating variance--covariance regularization (VCR) within our JEPA framework to encourage decorrelated latent representations. Our models are pretrained on heterogeneous single-contrast MRI volumes in a contrast-agnostic setting, without modality concatenation. Across five downstream disease detection tasks, our results highlight the importance of self-supervised objective design for medical foundation model pretraining, demonstrating that the downstream benefit of each objective is determined by its relevance to the task's structure. Specifically, spectral regularization yields the largest improvements when the downstream discriminative signal is characterized by strong high-frequency anatomical structures, while covariance regularization is most beneficial when discriminative information spans multiple decorrelated feature dimensions. MAE with spectral-domain supervision consistently achieves superior downstream performance for MRI-based disease detection. These findings suggest that self-supervised objectives in medical imaging encode specific biases, and their downstream benefit is fundamentally conditioned on the task's structure.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	4.0/10	6.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: The paper focuses on self-supervised learning for medical MRI, showing limited alignment with keywords related to multimodal LLMs, RL, and world models. 'Visual Encoder' and 'Latent Reasoning' have moderate relevance due to the use of encoders and latent representations in MAE/JEPA frameworks. 'Unify Models' has slight relevance as it studies multiple foundation paradigms. Other keywords (Tokenizer, MLLM, MultiModal, World Models, model-based RL, Agentic Reasoning) are irrelevant as the paper deals with single-modality medical imaging without text, agents, or reinforcement learning. No expert authors from the target list were found.

关键词

Self-Supervised Learning, Foundation Models, 3D Brain MRI, Disease Detection, Masked Autoencoders, Joint Embedding Predictive Architectures, Latent Representations, Spectral Regularization

120. PolyFlow: Safe and Efficient Polytope-Constrained Flow Matching with Constraint Embedding and Projection-free UpdateFAIL

Score: 12.0 / 35.2

Authors: Jianming Ma, Qiyue Yang, Yang Zhang, Liyun Yan, Zhanxiang Cao, Yazhou Zhang, Yue Gao

Published: 2026-06-11

TL;DR: PolyFlow 通过将约束直接嵌入流匹配动力学，解决了流生成模型在安全关键物理系统中的部署问题，实现了零约束违规与高分布保真度。

摘要翻译

尽管基于流的生成模型（flow-based generative models）在多个领域展现出优异的性能，但由于严格的约束要求，将其部署到安全关键物理系统（safety-critical physical systems）中仍具挑战性。现有方法通常通过事后修正（post-hoc corrections）来确保安全，但这会产生显著的计算开销，并可能扭曲学习到的分布（learned distribution）。我们提出 PolyFlow，一种将约束直接嵌入模型及流动力学（flow dynamics）中的多面体约束流匹配（polytope-constrained flow matching）框架。PolyFlow 引入了离散时间流表述（discrete-time flow formulation）及无投影架构（projection-free architecture），消除了离散化误差（discretization error），并保证严格满足任意多面体约束（polyhedral constraints），无需昂贵的迭代求解器（iterative solvers）。实验结果表明，PolyFlow 在一系列规划与控制任务中实现了零约束违反（zero constraint violation），同时保持了高分布保真度（distributional fidelity）。与最先进（state-of-the-art）约束生成基线相比，PolyFlow 显著降低了推理延迟（inference latency），并在安全性、效率与生成质量之间展现了有利的权衡（trade-off）。代码可在 https://github.com/MJianM/PolyFlow 获取。

Abstract

While flow-based generative models have demonstrated strong performance across a wide range of domains, deploying them in safety-critical physical systems remains challenging due to strict constraint requirements. Existing approaches typically enforce safety through post-hoc corrections, which incur substantial computational overhead and may distort the learned distribution. We propose PolyFlow, a polytope-constrained flow matching framework that embeds constraints directly into the model and flow dynamics. PolyFlow introduces a discrete-time flow formulation and a projection-free architecture, which eliminate the discretization error and guarantee strict satisfaction of arbitrary polyhedral constraints, without the need for expensive iterative solvers. Experimental results show that PolyFlow achieves zero constraint violation while maintaining high distributional fidelity across a range of planning and control tasks. Compared to state-of-the-art constrained generation baselines, PolyFlow significantly reduces inference latency and demonstrates a favorable trade-off between safety, efficiency, and generative quality. Code is available on https://github.com/MJianM/PolyFlow.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	3.0/10	4.5
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	4.0/10	6.0
Latent Reasoning	1.5	1.0/10	1.5
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 论文聚焦流匹配模型的安全约束处理，与多模态大模型架构（MLLM、MultiModal、Tokenizer、Visual Encoder）无直接关联。虽涉及控制任务，与 model-based RL 和 World Models 有间接动力学建模关联（评分 4 和 3），Latent Reasoning 因潜在空间操作评分 1，其余为 0。

关键词

Flow Matching, Polytope-Constrained, Safety-Critical, Constraint Embedding, Projection-free Update, Planning and Control, Generative Models, Physical Systems

121. Distribution-Agnostic Robust Trajectory Optimization via Chance-Constrained Reinforcement LearningFAIL

Score: 12.0 / 35.2

Authors: Yashdeep Chaudhary, Roberto Armellin, Harry Holt, Marco Sagliano

Published: 2026-06-11

TL;DR: 该论文提出了一种基于机会约束强化学习的分布无关鲁棒轨迹优化框架，通过闭环校正律修正名义轨迹，在火星转移和火箭着陆任务中证明了其在不确定性下的可行性与竞争力。

摘要翻译

本文提出了一种基于机会约束强化学习（Chance-Constrained Reinforcement Learning）的分布无关鲁棒轨迹优化框架。不确定性通过初始条件和过程噪声来表示，唯一的假设是这些不确定性可以被采样。首先离线计算确定性标称轨迹，随后仅利用强化学习通过一种结构仿射闭环修正律使该基线鲁棒化，该修正律包含前馈控制调整和时变反馈增益。概率可行性通过基于轨迹仿真（rollout）的上尾分位数经验性强制执行，而终端散布则通过协方差可行性惩罚项进行调节。该框架在两个实质不同的轨迹设计问题上进行了评估。代表性案例研究是一个三维多脉冲地火转移问题，其中学习到的策略在高斯不确定性下与近期鲁棒轨迹优化基准进行比较，随后在有界均匀不确定性以及训练过程中未见过的过程扰动下进行评估。第二个案例研究是一个随机大气定点火箭着陆问题，用于评估其在具有阻力、质量耗尽和下滑道约束的短时域连续推力设置下的可移植性。结果表明，所提出的框架在保持概率可行性的同时，在上尾燃料成本上仍具竞争力，且相同的鲁棒化框架可应用于异构航天器轨迹规划问题，而无需重新设计其核心随机控制结构（Stochastic-Control Structure）。

Abstract

This paper presents a distribution-agnostic robust trajectory-optimization framework based on chance-constrained reinforcement learning. The uncertainty is represented here through initial conditions and process noise, with the only requirement being that it can be sampled. A deterministic nominal trajectory is first computed offline, and reinforcement learning is then used only to robustify that baseline through a structured affine closed-loop correction law comprising a feedforward control adjustment and time-varying feedback gains. Probabilistic feasibility is enforced empirically through rollout-based upper-tail quantiles, while terminal dispersion is regulated through covariance-feasibility penalties. The framework is assessed on two materially different trajectory design problems. The flagship case study is a three-dimensional multi-impulse Earth-Mars transfer, where the learned policy is benchmarked against a recent robust trajectory-optimization reference under Gaussian uncertainty and then evaluated under bounded uniform uncertainty and under process disturbances not seen during training. The second case study is a stochastic atmospheric pinpoint rocket landing problem, used to assess portability to a short-horizon continuous-thrust setting with drag, mass depletion, and glide-slope constraints. The results show that the proposed framework can remain competitive in upper-tail fuel cost while preserving probabilistic feasibility, and that the same robustification scaffold can be carried across heterogeneous spacecraft trajectory planning problems without redesign of its core stochastic-control structure.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	6.0/10	9.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 该论文聚焦于航空航天轨迹优化与机会约束强化学习，与多模态大模型及世界模型关键词相关性极低。仅'model-based RL'因涉及强化学习控制而有一定关联（6 分），'World Models'和'Unify Models'关联度极低（1 分），其余关键词如 Tokenizer、Visual Encoder、MLLM、MultiModal、Latent Reasoning、Agentic Reasoning 均完全不相关（0 分）。作者列表中未包含指定的专家。

关键词

Chance-Constrained Reinforcement Learning, Robust Trajectory Optimization, Distribution-Agnostic, Closed-Loop Correction Law, Earth-Mars Transfer, Rocket Landing, Probabilistic Feasibility, Nominal Trajectory

122. A Machine Learning Framework for Real-Time Personalized Ergonomic Pose AnalysisFAIL

Score: 11.2 / 35.2

Authors: Manex Atxa, Bruno Simoes, Julen Balzategui

Published: 2026-06-11

TL;DR: This paper proposes a real-time personalized ergonomic pose analysis framework using volumetric video data and deep learning classifiers to enable scalable workplace safety monitoring.

摘要翻译

本文介绍了一种新的 methodology（方法论），用于使用三维 volumetric video data（体积视频数据）实时预测 ergonomic（人体工学）和 non-ergonomic（非人体工学）human poses（人体姿势）。虽然该 methodology 是为 ergonomic assessments（人体工学评估）设计的，但可以适应需要实时分析 human posture（人体姿态）的其他应用。使该系统脱颖而出的一个方面是其在评估期间分析 3D point clouds（3D 点云）的能力，从而实现多角度计算。这克服了 cameras（摄像头）的关键限制，cameras 通常提供 fixed viewpoint（固定视角），从而限制了用于 thorough postural evaluation（彻底姿态评估）的数据，尤其是在 occlusions（遮挡）发生时。该系统持续且自动地使用所选视角对 real-time streaming data（实时流数据）执行 pose inference（姿态推理）；然而，仅使用用户手动选择和标注的姿势来训练 personalized deep learning classifier（个性化深度学习分类器）。该 methodology 已通过 case study（案例研究）进行了完善，其中 RGB-D cameras（RGB-D 相机）捕捉了受试者执行 load-lifting tasks（负重任务），实现了 real-time skeletal labeling（实时骨骼标注）。该模型基于此数据进行了训练，并在 training phase（训练阶段）后，对新 streaming data 执行 inference。本研究提供了一种 scalable and pragmatic approach（可扩展且实用的方法）用于 real-time ergonomic evaluation（实时人体工学评估），通过结合 state-of-the-art 3D data technologies（最先进的 3D 数据技术）和 traditional 2D pose estimation algorithms（传统 2D 姿态估计算法）。它解决了 workplace environments（工作环境）中日益增长的安全与健康监控需求，标志着对该领域的显著贡献。

Abstract

This paper introduces a new methodology for real-time prediction of ergonomic and non-ergonomic human poses using volumetric video data in three dimensions. Although the methodology was designed for ergonomic assessments, it can be adapted to other applications requiring real-time analysis of human posture. One aspect that makes this system stand out is its ability to analyze 3D point clouds during the assessment, enabling computation from multiple angles. This overcomes a critical limitation of cameras which provide often a fixed viewpoint, thereby restricting the data available for a thorough postural evaluation, especially when occlusions occur. The system continuously and automatically performs pose inference using the chosen perspective on the real-time streaming data; however, only the poses manually selected and labeled by the user are used to train the personalized deep learning classifier. The methodology has been refined through a case study in which RGB-D cameras captured subjects performing load-lifting tasks, enabling real-time skeletal labeling. The model was trained on this data and, following the training phase, performs inference on new streaming data in real time. This research offers a scalable and pragmatic approach for real-time ergonomic evaluation by combining state-of-the-art 3D data technologies and traditional 2D pose estimation algorithms. It addresses the increasing need for safety and health monitoring in workplace environments, marking a notable contribution to the domain.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	2.5/10	3.8
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	3.0/10	4.5
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: The paper focuses on ergonomic pose analysis using traditional computer vision and deep learning (RGB-D, point clouds), showing weak alignment with the provided keyword cluster focused on Large AI models (MLLM, Tokenizers, World Models) and Reinforcement Learning. Scores reflect minimal overlap: 'MultiModal' and 'Visual Encoder' have slight relevance due to RGB-D visual processing, 'Unify Models' due to combining 3D/2D methods, while others are irrelevant.

关键词

Real-time, Personalized, Ergonomic Pose Analysis, Volumetric Video, Deep Learning, RGB-D Cameras, Skeletal Labeling

123. Uncertainty-Aware Hybrid Retrieval for Long-Document RAGFAIL

Score: 10.5 / 35.2

Authors: Hoin Jung, Xiaoqian Wang

Published: 2026-06-11

TL;DR: This paper proposes an uncertainty-aware hybrid retrieval framework (UMG-RAG) that fuses evidence from multiple chunk granularities using existing dense and sparse retrievers to enhance long-document question answering without requiring model training.

摘要翻译

检索增强生成（RAG）高度依赖于检索证据的质量与粒度。大粒度检索单元虽能保留上下文，但常引入无关内容，这可能会稀释承载答案的证据，并恶化长上下文利用效果。细粒度单元更为紧凑，但可能难以可靠检索，因为短文本块可能缺乏匹配查询所需的语义、词汇或连接线索。本文提出不确定性感知多粒度 RAG（UMG-RAG），这是一种无需训练的混合检索框架，它将块粒度视为查询特定的可靠性估计。UMG-RAG 无需训练新的检索器或修改生成器，而是利用现有的稠密检索器和稀疏检索器作为互补专家，处理多个块粒度。针对每个查询，它将每个专家 - 粒度组合的分数列表转换为证据分布，通过分布熵估计可靠性，并根据查询特定的语义、词汇和粒度置信度融合候选项。本文进一步引入了 UMGP-RAG，这是一种父块提升变体，它利用细粒度命中定位相关证据，同时返回更宽泛的非冗余父块以保证局部连贯性。在问答基准上的实验表明，不确定性感知融合与父块提升在保持轻量级、即插即用检索管道的前提下，提高了生成质量。

Abstract

Retrieval augmented generation (RAG) depends critically on the quality and granularity of retrieved evidence. Large retrieval units preserve context but often introduce irrelevant content, which can dilute answer bearing evidence and worsen long context utilization. Fine-grained units are more compact, but they may be difficult to retrieve reliably because short chunks can lack semantic, lexical, or bridging cues needed to match the query. We propose Uncertainty-aware Multi-Granularity RAG (UMG-RAG), a training-free hybrid retrieval framework that treats chunk granularity as query-specific reliability estimation. Instead of training a new retriever or modifying the generator, UMG-RAG uses existing dense and sparse retrievers as complementary experts across multiple chunk granularities. For each query, it converts each expert-granularity score list into an evidence distribution, estimates reliability from distribution entropy, and fuses candidates according to query-specific semantic, lexical, and granularity confidence. We further introduce UMGP-RAG, a parent promotion variant that uses fine-grained hits to locate relevant evidence while returning broader non-redundant parent chunks for local coherence. Experiments on question answering benchmarks show that uncertainty-aware fusion and parent promotion improve generation quality while maintaining a lightweight, plug-and-play retrieval pipeline.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	2.0/10	3.0
Agentic Reasoning	1.5	2.0/10	3.0

评分理由: The paper focuses on Uncertainty-Aware Hybrid Retrieval for Long-Document RAG, addressing text chunking and evidence fusion. It lacks content related to Visual Encoders, World Models, MLLM, MultiModal data, or Model-based RL, resulting in 0 scores for these keywords. While it unifies retrievers (Unify Models) and treats them as experts (Agentic Reasoning), the core domain is text retrieval rather than the multimodal/RL focus of the keywords. Tokenizer relevance is low as it deals with document chunking rather than tokenization algorithms. Latent Reasoning is slightly relevant due to uncertainty distribution estimation. No listed expert authors are present in the author list.

关键词

Uncertainty-Aware Hybrid Retrieval, Long-Document RAG, Chunk Granularity, Evidence Distribution, Dense and Sparse Retriever, Parent Promotion, Question Answering, Training-free

124. Once-for-All: Scalable Simultaneous Forecasting via Equilibrium State EstimationFAIL

Score: 10.5 / 35.2

Authors: Beinan Xu, Andy Song, Jiti Gao, Feng Liu

Published: 2026-06-11

TL;DR: 本文提出均衡状态估计（ESE）方法，实现了对多个交互系统的可扩展同时预测，在保持精度的同时获得了 10-70 倍的加速。

摘要翻译

我们提出均衡状态估计（ESE），这是一种用于同时预测的新范式，旨在处理多个相互作用的系统，这些系统需要独立但协调的预测。此类场景常出现在现实世界场景中，例如经济学和医疗保健建模。与现有逐个预测系统的方法不同，ESE 能够一次性预测所有系统。它首先估计系统间的均衡状态，然后基于当前状态与估计均衡状态之间的差异生成整体预测。在合成和真实世界数据集（包括汇率和 COVID-19 传播建模）上的广泛实验表明，ESE 至少与最先进（SOTA）方法一样准确，且显著更快。此外，ESE 能与传统预测器无缝集成，结合它们的准确性与自身的卓越效率，实现 10 至 70 倍的加速。凭借线性时间复杂度，随着系统数量的增加，ESE 的可扩展性远优于 SOTA 方法。此外，ESE 在各种扰动下仍能保持准确，从而确立其作为一种快速、泛化性强、鲁棒且可扩展的多预测方法。

Abstract

We introduce Equilibrium State Estimation (ESE), a novel paradigm for simultaneous prediction, where multiple interacting systems require separate yet coordinated forecasts. Such scenarios often arise in real-world settings such as economics and healthcare modeling. Unlike existing approaches that predict one system at a time, ESE forecasts all systems in a single pass. It first estimates the equilibrium state across systems, then generates holistic forecasts based on the difference between the current state and the estimated equilibrium. Extensive experiments on synthetic and real-world datasets, including currency exchange and COVID-19 spread modeling, demonstrate that ESE is at least as accurate as state-of-the-art (SOTA) methods while being significantly faster. In addition, ESE integrates seamlessly with conventional predictors, combining their accuracy with its exceptional efficiency and delivering a 10-70x speedup. With linear-time complexity, ESE scales far better than SOTA methods as the number of systems increases. Moreover, it remains accurate under diverse perturbations, establishing ESE as a fast, generalizable, robust, and scalable multi-prediction method.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	5.0/10	7.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	2.0/10	3.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 论文主题为多系统同时预测，属于传统时间序列领域。仅 Unify Models（统一预测框架）和 Latent Reasoning（均衡状态估计）有微弱关联，其余关键词如 Tokenizer、Visual Encoder、MLLM、MultiModal、model-based RL、Agentic Reasoning、World Models 均与论文内容无关。总分 10.5，远低于及格分 35.2，相关性低。

关键词

Equilibrium State Estimation, Simultaneous Forecasting, Scalable Prediction, Multiple Interacting Systems, Linear-time Complexity, Conventional Predictors, Robustness

125. Multi-Field Hybrid Retrieval-Augmented Generation for Maritime Accident Root Cause AnalysisFAIL

Score: 10.5 / 35.2

Authors: Seongjin Kim, Sungil Kim

Published: 2026-06-11

TL;DR: This paper proposes a multi-field hybrid RAG framework that improves maritime accident root cause analysis by structuring incident cards and fusing sparse/dense retrieval to enhance LLM generation quality.

摘要翻译

海事事故裁决报告包含用于根本原因分析（RCA）的关键法庭裁决结果，然而从数十年的记录中检索相关先例并撰写一致的报告仍然是一项劳动密集型任务。本文提出了一种用于自动化海事根本原因分析（RCA）的多领域混合检索增强生成（RAG）框架，该框架利用了一个包含 13,329 份韩国海事安全法庭（KMST）报告（1971-2025）的综合数据集。我们将原始裁决转换为结构化的“事件卡片”知识库，索引三个不同的字段——摘要、原因和处置——以及一个层级化的 L1/L2 原因分类体系。我们的检索策略采用领域感知混合方法，通过互逆排名融合（RRF）技术融合稀疏排名与密集排名。鉴于缺乏大规模专家相关性标签，我们基于元数据推导的代理相关性得分，使用天花板归一化召回率和 nDCG 来评估检索性能。实验结果表明，我们提出的检索方法显著优于基线方法，将 NormRecall@100 从 0.18 提升至 0.55。此外，基于检索到的先例对生成器进行增强（相对于仅使用大语言模型的基线）提升了 RCA 生成质量，使 LLM-as-a-judge 评分从 3.34 提升至 3.72。这些发现表明，领域感知 RAG 能够通过实现更快的先例搜索和更一致、基于证据的 RCA 起草，显著简化海事安全调查工作流程。

Abstract

Maritime accident adjudication reports contain critical tribunal findings for root cause analysis (RCA), yet retrieving relevant precedents and drafting consistent reports from decades of records remains labor-intensive. This paper proposes a multi-field hybrid retrieval-augmented generation (RAG) framework for automated maritime RCA, utilizing a comprehensive dataset of 13,329 Korea Maritime Safety Tribunal (KMST) reports (1971-2025). We transform raw adjudications into a structured knowledge base of "incident cards", indexing three distinct fields-Summary, Causes, and Disposition-alongside a hierarchical L1/L2 cause taxonomy. Our retrieval strategy employs a field-aware hybrid approach, fusing sparse and dense rankings via Reciprocal Rank Fusion (RRF). Given the lack of large-scale expert relevance labels, we evaluate retrieval performance using ceiling-normalized recall and nDCG based on a metadata-derived proxy relevance score. Experimental results demonstrate that our proposed retrieval significantly outperforms baseline methods, improving NormRecall@100 from 0.18 to 0.55. Furthermore, grounding the generator on the retrieved precedents enhances RCA generation quality over an LLM-only baseline, increasing the LLM-as-a-judge score from 3.34 to 3.72. These findings suggest that field-aware RAG can substantially streamline maritime safety investigation workflows by enabling faster precedent search and more consistent, evidence-based RCA drafting.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	1.0/10	1.5
Agentic Reasoning	1.5	2.0/10	3.0

评分理由: The paper focuses on text-based RAG for maritime safety, lacking multimodal components (Visual Encoder, MLLM, MultiModal = 0) and reinforcement learning (model-based RL = 0). It unifies retrieval and generation pipelines but not model architectures (Unify Models = 2). It uses standard tokenization (Tokenizer = 1) and explicit reasoning (Latent Reasoning = 1). The automated system shows slight agentic traits (Agentic Reasoning = 2). World Models refer to environment modeling, not legal knowledge bases (World Models = 1). Total weighted score: 10.5, below the dynamic passing score of 35.2. No expert authors from the specified list were found.

关键词

Multi-Field Hybrid Retrieval-Augmented Generation, Maritime Accident Root Cause Analysis, Incident Cards, Reciprocal Rank Fusion, LLM-as-a-judge, Knowledge Base, Sparse and Dense Rankings

126. Cascade Classification of Dermoscopic Images of Skin Neoplasms with Controllable Sensitivity and External Clinical ValidationFAIL

Score: 10.5 / 35.2

Authors: Elena S. Kozachok, Sergey S. Seregin, Aleksandr V. Kozachok, Ilya P. Latyshev, Oleg I. Samovarov

Published: 2026-06-11

TL;DR: This study proposes a cascade classification framework for dermoscopic images to enhance sensitivity control and clinical generalization, demonstrating that tunable triage thresholds outperform standard single-stage classification despite remaining generalization gaps.

摘要翻译

目的：比较皮肤肿瘤皮肤镜图像的深度学习架构与分类方案，并评估其在从开放国际数据集迁移至俄罗斯独立临床数据集时的泛化性能。方法：比较了四种架构（ViT-B/16、Swin-S、ConvNeXt-S、EfficientNetV2-S）在三种方案下的表现：二分类（恶性/良性）、单阶段四分类（良性、恶性黑色素瘤 MEL、鳞状细胞癌 SCC、基底细胞癌 BCC）以及两阶段级联（二分类分诊，随后进行 MEL/SCC/BCC 三分类区分）。所有模型均采用 ImageNet 预训练权重，并在聚合的开放 ISIC Archive 数据集上使用单一增强协议进行训练，随后在内部保留样本及两个临床数据集（Melanoscope AI 移动系统；Sechenov 大学）上进行评估。结果：在内部评估集上，二分类阶段的 ROC-AUC 达到 0.952-0.966；而在 Sechenov 大学数据集上，该值下降至 0.797-0.893，敏感性降至 0.53-0.67，ECE（期望校准误差）从 0.02 上升至 0.27-0.39，伴随恶性程度低估，量化了排序与校准方面的泛化差距。配对测试确认了临床数据上的一种架构间差异：ViT-B/16 在二分类阶段存在显著劣势（p<0.05）；而在区分阶段，各架构间无显著优势差异。级联方案相较于单阶段四分类，提高了大多数架构的宏观 F1 分数，但仅在 ViT-B/16 上具有统计学显著性，其机制在于找回了被错误归入主导良性类的恶性病变。在 ISIC MILK10k 数据集上，直接进行 11 类分类的平均类别敏感性为 0.525。结论：可调的分诊阈值提供了标准单阶段（argmax）分类无法实现的敏感性控制，并更好地复现了临床鉴别诊断逻辑。持续的泛化差距要求在部署前必须进行外部临床验证和重新校准。

Abstract

Purpose. To compare deep learning architectures and classification schemes for dermoscopic images of skin neoplasms and assess their generalization on transfer from open international datasets to independent clinical datasets of Russian practice. Methods. Four architectures (ViT-B/16, Swin-S, ConvNeXt-S, EfficientNetV2-S) were compared in three schemes: binary (malignant/benign), single-stage four-class (benign, MEL, SCC, BCC), and a two-stage cascade (binary triage, then three-class differentiation MEL/SCC/BCC). All models used ImageNet-pretrained weights and a single augmentation protocol on aggregated open ISIC Archive data, and were evaluated on an internal held-out sample and two clinical datasets (Melanoscope AI mobile system; Sechenov University). Results. Internally the binary stage attains ROC-AUC 0.952-0.966; on Sechenov University it drops to 0.797-0.893, sensitivity to 0.53-0.67, and ECE rises from 0.02 to 0.27-0.39 with underestimation of malignancy, quantifying a generalization gap in ranking and calibration. Paired tests confirm one inter-architecture result on clinical data: the deficit of ViT-B/16 at the binary stage (p<0.05); at the differentiation stage no architecture has a proven advantage. The cascade raises macro F1 over single-stage four-class classification for most architectures, but significantly only for ViT-B/16, by recovering malignant lesions assigned to the dominant benign class. On ISIC MILK10k, direct 11-class classification yields mean-class sensitivity 0.525. Conclusion. A tunable triage threshold gives sensitivity control not attainable in standard single-stage (argmax) classification and better reproduces clinical differential-diagnosis logic. The persistent generalization gap mandates external clinical validation and recalibration before deployment.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	1.0/10	1.5
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: The paper focuses on medical image classification for skin neoplasms using standard CNN/Transformer architectures (ViT, Swin, etc.) and cascade schemes. It does not involve tokenization, world models, reinforcement learning, or multimodal language modeling, resulting in low relevance scores for most keywords. 'Visual Encoder' has moderate relevance due to the use of vision backbones. No expert authors from the specified list are present. The total weighted score is 10.5, which is below the dynamic passing threshold of 35.2.

关键词

Dermoscopic Images, Skin Neoplasms, Cascade Classification, Clinical Validation, Sensitivity Control, Generalization Gap, Deep Learning Architectures

127. Accelerating Speculative Diffusions via Block VerificationFAIL

Score: 10.5 / 35.2

Authors: Alexander Soen, Hisham Husain, Valentin De Bortoli, Arnaud Doucet

Published: 2026-06-11

TL;DR: 本文针对扩散模型推理加速难题，提出了一种基于块验证的推测性解码方案，无需额外训练即可提高接受率并加速推理。

摘要翻译

推测解码 (Speculative Decoding) 通过使用草稿模型 (Draft Model) 生成标记 (Tokens) 来加速 LLM (大语言模型) 推理，并采用接受 - 拒绝方案 (Acceptance-Rejection Scheme) 以确保输出匹配目标分布 (Target Distribution)。将这一方法应用于连续扩散模型 (Continuous Diffusions) 是困难的，因为推测采样 (Speculative Sampling) 需要从残差分布 (Residual Distribution) 中采样。虽然在离散空间 (Discrete Spaces) 中很简单，但在连续空间 (Continuous Space) 中高效采样这一残差分布并非易事。因此，现有的扩散模型适配 (Diffusion Adaptations) 要么使用计算效率低下的采样技术，要么依赖于替代方案。在这项工作中，我们引入了一种新颖的方案，能够高效地实现扩散模型的原始推测采样机制 (Speculative Sampling Mechanism)。我们的方法相较于当前方法具有关键优势：它使我们能够将块验证 (Block Verification) 从 LLM 适配到扩散模型——这被证明能提高草稿 (Drafts) 的接受率 (Acceptance Rate)。此外，我们形式化并分析了 Free Drafter (自由草稿生成器)，这是一种无需训练的扩散模型启发式自推测草稿生成器 (Heuristic Self-Speculative Drafter)。通过启用块验证，我们的 Free Drafter 相较于现有推测方法可实现高达 6.3% 的加速 (Speedup)，无需额外训练，且除了现有的并行验证步骤 (Parallel Verification Pass) 外开销可忽略不计。

Abstract

Speculative decoding speeds up LLM inference by using a draft model to generate tokens, with an acceptance-rejection scheme that ensures that the output matches the target distribution. Adapting this to continuous diffusions is difficult because speculative sampling requires drawing from a residual distribution. While straightforward in discrete spaces, efficiently sampling this residual in continuous space is non-trivial. Consequently, existing diffusion adaptations either use computationally inefficient sampling techniques or rely on an alternative scheme. In this work, we introduce a novel scheme that efficiently implements the original speculative sampling mechanism for diffusion models. Our approach offers a critical advantage over current methods: it enables us to adapt block verification from LLMs to diffusions -- which provably improves the acceptance rate of drafts. Furthermore, we formalize and analyze the Free Drafter, a heuristic self-speculative drafter for diffusions that requires no training. By enabling block verification, our Free Drafter yields up to a 6.3% speedup over existing speculative methods with no additional training and negligible overhead beyond the existing parallel verification pass.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	1.0/10	1.5
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 论文核心内容为扩散模型（Diffusion Models）的推理加速，通过块验证机制将推测性解码从离散空间适配到连续空间。该技术与背景关键词中的多模态（MultiModal）、视觉编码器（Visual Encoder）、强化学习（model-based RL）及代理推理（Agentic Reasoning）无直接关联，故评分为 0。虽然论文借鉴了大语言模型（MLLM）的推测性解码思想，并涉及连续空间采样（与 Tokenizer 概念形成对比），但未涉及多模态统一或世界模型的具体构建，故 Unify Models 和 World Models 评分较低（2.0）。作者列表中不包含指定的 Yang Shi 等专家，无额外加分。加权总分为 10.5 分，远低于动态及格分 35.2 分，表明该论文与给定的研究背景主题相关性较弱。

关键词

Speculative Decoding, Diffusion Models, Block Verification, Inference Speedup, Continuous Sampling, Free Drafter, Acceptance Rate

128. The Geometry of Phase Transitions in Generative Dynamics via Projection CausticsFAIL

Score: 10.5 / 35.2

Authors: Ryosuke Sakamoto, Kotaro Sakamoto

Published: 2026-06-11

TL;DR: 本文利用投影尖点理论解析生成动力学中的相变机制，并引入临界边界检测器以识别评分方向不稳定的敏感区域。

摘要翻译

连续状态生成采样器，包括扩散（diffusion）和流匹配（flow-matching）模型，通过连续反向时间动力学演化，然而其样本往往经历突然的定性变化：轨迹趋向于特定模态，语义替代方案坍缩，且在狭窄时间窗口内的微小扰动可能产生巨大的下游效应。本文提出了针对此类类似相变行为的几何解释。我们将去噪视为自由能景观上的梯度下降，并证明尖锐转变发生在投影焦散（projection caustics）附近，此时到数据支撑集的最近点投影不再唯一。受此视角启发，我们引入了临界边界检测器（Critical Boundary Detector, CBD），作为诊断得分方向不稳定性（score-direction instability）的实用诊断工具。在玩具模型、标准扩散模型以及潜在文本到图像扩散模型中，CBD 能够定位模态承诺，预测干预敏感窗口，并在几何敏感区域支持靶向控制。我们的结果连接了数据的几何结构与扩散生成的动力学。

Abstract

Continuous-state generative samplers, including diffusion and flow-matching models, evolve through continuous reverse-time dynamics, yet their samples often undergo abrupt qualitative changes: trajectories commit to modes, semantic alternatives collapse, and small perturbations in narrow time windows can produce large downstream effects. This paper develops a geometric account of such phase-transition-like behaviour. We view denoising as gradient descent on a free energy landscape and show that sharp transitions arise near projection caustics, where the nearest-point projection onto the data support ceases to be unique. Motivated by this perspective, we introduce the Critical Boundary Detector (CBD), as practical diagnostics for score-direction instability. Across toy models, standard diffusion models, and latent text-to-image diffusion models, CBD localises mode commitment, predicts intervention-sensitive windows, and supports targeted control in geometrically sensitive regions. Our results connect geometry of data and dynamics of diffusion generation.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	2.0/10	3.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 论文主要研究生成模型（如扩散模型）的几何相变理论，涉及自由能景观和投影尖点，与关键词列表中的多模态大模型、强化学习及代理推理主题关联度较低。仅在潜在空间动力学和文本 - 图像扩散实验中存在微弱联系。

关键词

Generative Dynamics, Phase Transitions, Projection Caustics, Diffusion Models, Critical Boundary Detector, Score-Direction Instability, Free Energy Landscape

129. A green solvent screening tool for emerging materials via uncertainty aware, transformer enhanced transfer learningFAIL

Score: 10.5 / 35.2

Authors: Ioannis Kouroudis, Simon Ternes, Zhaosu Gu, Gohar Ali Siddiqui, Marina Ustinova, Angelo Lembo, Alessio Gagliardi, Aldo Di Carlo

Published: 2026-06-11

TL;DR: This paper proposes a transformer-based transfer learning tool with uncertainty quantification to predict solvent properties for green solvent screening in materials science, achieving high performance even with limited data.

摘要翻译

准确预测溶解度（solubility）仍然是材料科学和可持续化学领域的核心挑战。特别是由于有机和混合光伏（organic and hybrid photovoltaics）、电池以及催化等新兴技术的发展，预计未来几年溶剂的使用量将显著增加。因此，用更绿色的替代品替代溶剂至关重要。这正是机器学习（machine learning）可以产生重大影响的地方。然而，关于溶解度关键参数的有限数据显著限制了机器学习的效能。在这项工作中，我们将基于 QM9 目标的预训练基础模型迁移至我们的应用中，且对数据需求极低。此外，该流程集成了不确定性量化（uncertainty quantification），使用户能够评估预测的置信度。作为基线，我们成功预测了汉森溶解度参数（Hansen solubility parameters）和介电常数（Dielectric Constant），这两者拥有广泛的数据库。重要的是，我们在其他目标上实现了高模型性能，例如古特曼给体数和受体数（Gutmann Donor and Acceptor numbers），而这些数据极其有限。总体而言，我们通过高质量的预测将溶解度描述符（solubility descriptors）的数据量扩大了数个数量级。为了有效传播，我们部署了一个易于使用、易于与高通量实验室（high throughput labs）集成且可定制的用于排名和筛选潜在溶剂替代品的工具。最后，我们重新发现了已知的绿色溶剂替代品并提出了新的候选者，证明了其在寻找环保溶剂（eco-friendly solvents）方面的相关性。

Abstract

Accurate prediction of solubility remains a central challenge across materials science and sustainable chemistry. In particular due to emerging technologies like organic and hybrid photovoltaics, batteries, and catalysis, solvent usage is expected to increase significantly within the coming years. Therefore, substituting solvents with greener alternatives is vital. This is where machine learning can have substantial impact. However, the limited data on critical parameters of solubility significantly constraints machine learning efficacy. In this work, we transfer a pre-trained foundational model on QM9 targets to our application with minimal data requirements. Additionally, the pipeline integrates uncertainty quantification, allowing the user to gauge the confidence of the predictions. As baseline, we succeed in predicting the Hansen solubility parameters and Dielectric Constant for which extensive databases exist. Importantly, we achieve high model performance on additional targets, such as Gutmann Donor and Acceptor numbers, where the available data is extremely limited. Overall, we augment data on solubility descriptors by orders of magnitude with high quality predictions. For effective dissemination, we deploy easy-to-use, easily integrateable with high throughput labs, customizable tool for ranking and screening possible solvent substitutes. Finally, we rediscovered known green solvent alternatives and proposed new candidates proving its relevance for finding eco-friendly solvents.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	2.0/10	3.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: The paper focuses on computational chemistry (solvent screening) using transfer learning and uncertainty quantification. The provided keywords primarily relate to multimodal large language models, reinforcement learning, and world models (AI/RL domain). There is minimal overlap; the paper uses transformers (Tokenizer) and foundational models (Unify Models/Latent) but lacks visual encoders, MLLM architectures, RL components, or agentic reasoning.

关键词

Green solvent screening, Uncertainty aware, Transformer enhanced, Transfer learning, Solubility prediction, Materials science, QM9, Hansen solubility parameters

130. HyPE: Category-Aware Hypergraph Encoding with Persistent Edge Embeddings for Persona-Grounded DialogueFAIL

Score: 9.8 / 35.2

Authors: Sangwon Youn, Yoonjin Jang, Youngjoong Ko

Published: 2026-06-11

TL;DR: HyPE introduces a hypergraph-based persona encoding method that captures high-order attribute relations to enhance dialogue consistency across diverse language model backbones.

摘要翻译

基于人设的对话系统旨在生成与人设一致的响应，然而现有方法将人设视为扁平的句子集合，未能建模人设属性之间的高阶关系——例如，多个句子共享同一主题类别。我们提出 HyPE（超图人设编码器），该框架（i）将每个包含人设的文本分析为 (Core, Expression, Sentiment, Category) 四元组，（ii）将人设元素组织成超图，其超边由共享类别标签诱导。HyperGCN 超图神经网络将此结构传播为人设摘要向量和软记忆库，以调节响应生成器。我们进一步提出持久边嵌入（PEE），这是一种轻量级的每类别可学习先验，融合到 HyperGCN 的消息传递步骤中。在 PersonaChat 数据集上使用贪婪解码时，HyPE 在 GPT-2、LLaMA-3.2-3B 和 Qwen2.5-3B 骨干模型上始终优于句子级池化基线，这表明结构化超边级人设编码在不同模型规模下提供了可迁移的优势。

Abstract

Persona-grounded dialogue systems aim to produce responses consistent with a speaker's persona, yet existing methods treat personas as a flat set of sentences and fail to model the high-order relations among persona attributes-e.g., that several persona sentences share a topical category. We propose HyPE (Hypergraph Persona Encoder), a framework that (i) analyzes each persona-bearing text as a (Core, Expression, Sentiment, Category) quadruple, and (ii) organizes persona elements into a hypergraph whose hyperedges are induced by shared category labels. An HyperGCN hypergraph neural network propagates this structure into a persona summary vector and a soft-memory bank that condition the response generator. We further propose Persistent Edge Embeddings (PEE), lightweight per-category learnable priors fused into the HyperGCN message-passing step. On PersonaChat under greedy decoding, HyPE consistently outperforms sentence-level pooling baselines across GPT-2, LLaMA-3.2-3B, and Qwen2.5-3B backbones by demonstrating that structured hyperedge-level persona encoding provides a transferable advantage across model scales.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	2.5/10	3.8
Agentic Reasoning	1.5	2.0/10	3.0

评分理由: The paper focuses on text-based dialogue systems using hypergraph encoding, showing minimal alignment with multimodal, world model, or reinforcement learning keywords. Minor relevance exists for latent representations (HyperGCN embeddings) and multi-backbone evaluation, but core topics differ significantly.

关键词

Persona-grounded dialogue, Hypergraph encoding, Category-aware, HyperGCN, Persistent Edge Embeddings, Response generation, Text representation, Multi-backbone evaluation

131. Valid Inference with Synthetic Data via Task ExchangeabilityFAIL

Score: 9.0 / 35.2

Authors: Lezhi Tan, Tijana Zrnic

Published: 2026-06-11

TL;DR: 本文提出基于任务交换性的统计原则，为科学中使用合成数据提供可证明的有效性保证。

摘要翻译

大量研究主张在科学研究中使用合成数据。例如，社会科学家主张在试点研究中使用大语言模型（LLM）生成的“硅样本”（silicon samples）；AI 评估日益依赖"LLM-as-a-judge"输出；蛋白质组学研究因生成模型产生合成蛋白质结构而加速。这些发展提出了一个引人入胜的可能性：合成数据可能帮助研究人员提出更多问题、开展更多研究并加速发现。但它们也引发了一个根本性担忧：合成数据可能存在偏差、噪声以及设定错误。本文提出了在科学研究中使用合成数据的统计原则，并提供可证明的有效性保证。关键洞察是一种新的技术条件，我们称之为任务可交换性（task exchangeability）。通俗而言，这意味着研究人员需能够识别出拥有真实数据的历史任务，且其当前感兴趣的任务在适当的数学意义上与这些历史任务具有可交换性。我们在任务可交换性下开发了有效推断的方法，以及即使在可交换性之外也能提供保证的扩展。我们在带有“硅样本”的民意调查和带有自动评分器（autoraters）的 AI 评估中展示了该框架。

Abstract

There is a proliferation of work arguing for the use of synthetic data in scientific research. For example, social scientists are arguing for the use of LLM-generated "silicon samples" in pilot studies; AI evaluations increasingly rely on "LLM-as-a-judge" outputs; and proteomics research is accelerated by generative models that produce synthetic protein structures. These developments raise an intriguing possibility: synthetic data may help researchers ask more questions, run more studies, and accelerate discovery. But they also raise a fundamental concern: synthetic data can be biased, noisy, and misspecified. In this work, we propose statistical principles for using synthetic data in scientific research with provable validity guarantees. The key insight is a new technical condition that we call task exchangeability. Informally, this is a requirement that the researcher can identify historical tasks, for which real data is available, such that their current task of interest is exchangeable with the historical tasks in an appropriate mathematical sense. We develop methods for valid inference under task exchangeability, together with extensions that provide guarantees even beyond exchangeability. We demonstrate the framework on public opinion surveys with silicon samples and AI evaluation with autoraters.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	1.0/10	1.5
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 论文核心在于统计推断与合成数据的有效性保证（任务交换性），属于统计学与生成模型应用范畴。提供的关键词列表侧重于多模态大模型架构（如 Tokenizer, Visual Encoder）、强化学习（model-based RL）及世界模型（World Models），与本文主题高度不匹配，因此相关性评分普遍较低。仅因文中提及 LLM 生成样本，MLLM 项给予微弱关联分。

关键词

Synthetic Data, Task Exchangeability, Statistical Inference, Validity Guarantees, Generative Models, Scientific Research, AI Evaluation

132. Beyond the Commitment Boundary: Probing Epiphenomenal Chain-of-Thought in Large Reasoning ModelsFAIL

Score: 9.0 / 35.2

Authors: Daniel Scalena, Sara Candussio, Luca Bortolussi, Elisabetta Fersini, Malvina Nissim, Gabriele Sarti

Published: 2026-06-11

TL;DR: 本文研究了大语言模型中思维链推理的因果结构，识别出答案稳定的“承诺边界”，利用该边界实现早期退出，从而在不影响性能的前提下显著减少推理长度。

摘要翻译

思维链 (CoT) 推理是语言模型推理时扩展的主导范式，但单个步骤对最终答案的因果影响尚不为人所熟知。我们通过 early exit（早期退出）估计每个步骤的因果重要性，并利用该度量来研究答案如何在多种模型族的推理轨迹中形成。在各类任务中，我们发现推理通常跨越一个 commitment boundary（承诺边界）——即从瞬时的中间猜测到稳定、高置信度答案的急剧转变。这种转变通常发生在单个步骤中，远早于模型 reasoning block（推理块）的结束，随后是 epiphenomenal（伴随现象的）CoT 步骤，这些步骤不会改变最终答案的概率。利用 attention probes（注意力探针），我们表明答案形成阶段可以从中间推理步骤中以高准确性进行线性解码，并能稳健地泛化到未见过的推理任务。我们利用这一信号在 commitment boundary（承诺边界）处对 reasoning blocks（推理块）进行 early-exit（早期退出），平均可将 CoT 长度减少 55%，且对模型性能影响微乎其微。

Abstract

Chain-of-thought (CoT) reasoning is the dominant paradigm for inference-time scaling in language models, yet the causal influence of individual steps on the final answer poorly understood. We estimate each step's causal importance via early exit and use this measure to study how answers form across the reasoning traces of several model families. Across diverse tasks, we find that reasoning typically crosses a \emph{commitment boundary} -- a sharp transition from transient intermediate guesses to a stable, high-confidence answer. This transition often happens in a single step, well before the model's reasoning block ends, and is followed by \emph{epiphenomenal} CoT steps that leave the final answer probability unaltered. Using attention probes, we show that answer-formation stages can be linearly decoded from intermediate reasoning steps with high accuracy and generalize robustly to unseen reasoning tasks. We exploit this signal to early-exit reasoning blocks at the commitment boundary, reducing the length of CoTs up to 55\% on average with negligible impact on model performance.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	2.0/10	3.0
Agentic Reasoning	1.5	2.0/10	3.0

评分理由: 本文主要探讨大语言模型中思维链（CoT）推理的因果结构与效率，识别“承诺边界”以实现早期退出。提供的关键词主要集中在多模态、世界模型和强化学习领域（如 Visual Encoder, World Models, model-based RL），与本文纯文本推理研究的主题高度不相关。虽然涉及“推理”，但“潜在推理”和“代理推理”并非本文核心（本文关注 token 级因果分析而非潜在空间规划或自主代理）。因此，大部分关键词得分为 0 或 1，仅与“推理”相关的关键词给予少量分数。

关键词

Chain-of-Thought, Commitment Boundary, Early Exit, Attention Probes, Answer Formation, Epiphenomenal Steps, Large Reasoning Models

133. EvTexture++: Event-Driven Texture Enhancement for Video Super-ResolutionFAIL

Score: 9.0 / 35.2

Authors: Dachun Kai, Jiayao Lu, Yueyi Zhang, Xiaoyan Sun

Published: 2026-06-11

TL;DR: EvTexture++ proposes an event-driven framework for video super-resolution that leverages high-frequency spatiotemporal details from event cameras to enhance texture recovery and temporal consistency, achieving state-of-the-art performance.

摘要翻译

事件视觉因其独特的特性而受到越来越多的关注，包括超高时间分辨率和极高动态范围。近期工作已将其引入视频超分辨率（VSR），以增强光流估计和时间对齐。相比之下，本文将事件信号的关注点从运动细化转向了 VSR 中的纹理增强。我们提出了 EvTexture++，这是首个专门致力于 VSR 中纹理增强的事件驱动框架。该方法利用事件中的高频时空细节来提升纹理恢复效果。EvTexture++ 包含一个定制的纹理增强分支，以及一个迭代纹理增强模块，该模块逐步利用高时间分辨率的事件信息来进行纹理恢复。这使得纹理区域能够在迭代过程中逐渐细化，从而生成更准确且细节丰富的高分辨率输出。除了帧内纹理恢复外，大运动可能会降低帧间时间一致性，尤其是在纹理区域，从而导致纹理闪烁现象。为此，我们进一步利用事件的连续时间运动线索来增强时间一致性，引入一个时间纹理对齐模块，该模块估计事件引导的纹理感知光流，以实现精确的帧间纹理对齐。此外，EvTexture++ 被设计为一种即插即用工具，可灵活地提升现有 VSR 模型的性能。在五个数据集上的实验表明，EvTexture++ 达到了最先进的性能。当集成到近期 VSR 模型时，它能带来显著提升，在纹理丰富的 Vid4 数据集上，PSNR 增益高达 1.55 dB。代码：https://github.com/DachunKai/EvTexture.

Abstract

Event-based vision has drawn increasing attention owing to its distinctive properties, including ultra-high temporal resolution and extreme dynamic range. Recent works have introduced it to video super-resolution (VSR) to enhance flow estimation and temporal alignment. In contrast, this paper shifts the focus of event signals from motion refinement to texture enhancement in VSR. We propose EvTexture++, the first event-driven framework dedicated to texture enhancement in VSR. It leverages high-frequency spatiotemporal details from events to improve texture recovery. EvTexture++ incorporates a customized texture enhancement branch, along with an iterative texture enhancement module that progressively exploits high-temporal-resolution event information for texture restoration. This enables gradual refinement of texture regions across iterations, yielding more accurate and detailed high-resolution outputs. Besides intra-frame texture recovery, large motions could degrade inter-frame temporal consistency, particularly in texture regions, leading to texture flickering. To mitigate this, we further exploit the continuous-time motion cues of events to enhance temporal consistency, introducing a temporal texture alignment module that estimates event-guided texture-aware flow for precise inter-frame texture alignment. Moreover, EvTexture++ is designed as a plug-and-play tool to flexibly boost the performance of existing VSR models. Experiments on five datasets demonstrate that EvTexture++ achieves state-of-the-art performance. When integrated into recent VSR models, it yields significant improvements, with gains of up to 1.55 dB in PSNR on the texture-rich Vid4 dataset. Code: https://github.com/DachunKai/EvTexture.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	3.0/10	4.5
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: The paper focuses on Event-based Video Super-Resolution (VSR) using event cameras for texture enhancement. The provided keywords are primarily oriented towards Large Language Models and Reinforcement Learning (e.g., Tokenizer, MLLM, model-based RL), creating a significant domain mismatch. Consequently, most keywords receive a score of 0. 'MultiModal' receives a moderate score (3) as the method utilizes both frame and event data streams, and 'Visual Encoder' receives a low score (2) as it involves neural encoding but not in the context of MLLM architectures. 'Unify Models' receives a low score (1) as the method is a plug-in module rather than a unified model architecture. The calculated weighted total score is approximately 9.0, which is well below the dynamic passing threshold of 35.2, indicating low relevance to the provided keyword set.

关键词

Event-based vision, Video Super-Resolution, Texture Enhancement, Event-driven framework, Temporal consistency, High-resolution outputs, Flow estimation

134. Otters++: A Time-to-first-spike Based Energy Efficient Optical Spiking TransformerFAIL

Score: 9.0 / 35.2

Authors: Zhanglu Yan, Jiayi Mao, Kaiwen Tang, Fanfan Li, Gang Pan, Tao Luo, Bowen Zhu, Qianhui Liu, Weng-Fai Wong

Published: 2026-06-11

TL;DR: 该论文提出了一种利用光突触自然衰减实现时间到首次尖峰编码的光学尖峰 Transformer，在 GLUE 数据集上实现了能量高效且性能竞争性的 NLP 推理。

摘要翻译

脉冲神经网络（SNNs）在节能推理方面颇具前景，而首次脉冲时间编码（TTFS）尤为引人注目，因为每个神经元最多仅发射一次脉冲。然而在实际应用中，这种优势往往因计算时间衰减项并将其乘以突触权重的开销而被削弱。我们通过将物理硬件的“缺陷”——光电器件中的自然信号衰减——转化为 TTFS 的主要计算来解决这一问题，该方法名为 Otters++。具体而言，我们利用定制的 In₂O₃ 光电器件突触的测量衰减直接实现 TTFS 的时间项，从而无需进行显式的数字衰减计算。为了将该思想扩展至 Transformer 模型，我们在 Otters++ 与量化神经网络（QNN）之间建立了层功能等价性，并开发了一种混合训练方法：在前向传播中使用器件保真的 SNN 计算，在反向传播中通过等效 QNN 路径使用 QNN 直通梯度，并结合模型蒸馏。这避免了通过离散首次脉冲事件进行求导，并缓解了直接 TTFS-SNN 训练中的过度稀疏问题。此外，我们通过采样运行间变异使训练过程感知到测量的器件噪声，并通过考虑器件共享和多跳通信来细化系统级能耗模型。在 GLUE 数据集上，Otters++ 将平均得分提升至 84.17%，同时相较于先前的脉冲 Transformer 基线保持了显著的能耗优势。这些结果表明，基于物理机制的 TTFS 计算在真实硬件效应下可以实现高效、可训练且稳健的性能。

Abstract

Spiking neural networks (SNNs) are promising for energy-efficient inference, and time-to-first-spike (TTFS) coding is especially attractive because each neuron fires at most once. In practice, however, this benefit is often reduced by the cost of computing a temporal decay term and multiplying it by the synaptic weight. We address this issue by turning a physical hardware "bug," the natural signal decay in optoelectronic devices, into the main computation of TTFS, named Otters++. Specifically, we use the measured decay of a custom In$_2$O$_3$ optoelectronic synapse to directly realize the TTFS temporal term, removing the need for explicit digital decay computation. To scale this idea to Transformer models, we establish a layer-wise functional equivalence between the Otters++ and a quantized neural network (QNN), and develop a hybrid training method that uses device-faithful SNN computation in the forward pass and QNN straight-through gradients through the equivalent QNN path in the backward pass, together with model distillation. This avoids differentiation through discrete first-spike events and reduces the over-sparsity problem in direct TTFS-SNN training. We further make training aware of measured device noise by sampling run-to-run variation, and refine the system-level energy model by accounting for device sharing and multi-hop communication. On GLUE dataset, Otters++ improves the average score to 84.17\% while maintaining a clear energy advantage over prior spiking Transformer baselines. These results show that physically grounded TTFS computing can be efficient, trainable, and robust under realistic hardware effects.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	2.0/10	3.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 论文聚焦光突触尖峰神经网络（SNN）的能量效率实现，涉及硬件感知训练。关键词集主要围绕多模态、世界模型及强化学习，与本文 NLP 及硬件主题关联度低。仅 Transformer 架构隐含 Tokenizer 关联，SNN 时序状态隐含 Latent Reasoning，硬件与计算逻辑统一隐含 Unify Models，其余关键词完全无关。

关键词

Spiking Neural Networks, Optical Computing, Time-to-First-Spike, Energy Efficiency, Transformer Architecture, Hardware-Aware Training, GLUE Dataset

135. Structured Testbench Generation for LLM-Driven HDL Design and Verification-Oriented Data CurationFAIL

Score: 9.0 / 35.2

Authors: En-Ming Huang, Yu-Hung Kao, Ren-Hao Deng, Wei-Po Hsin, Yao-Ting Hsieh, Cheng Liang, Hsiang-Yu Tsou, Mu-Chi Chen, Yu-Kai Hung, Shao-Chun Ho, Po-Hsuang Huang, Shih-Hao Hung, H. T. Kung

Published: 2026-06-11

TL;DR: 本文提出了一种结构化测试生成框架，用于加速 LLM 驱动的硬件设计验证和数据整理，相比迭代式 LLM 方法实现了显著的速度提升和效率改进。

摘要翻译

自动化测试平台生成已成为大型语言模型（LLM）驱动的寄存器传输级（RTL）工作流中的关键瓶颈，在此过程中需要快速且可靠地验证大量候选设计。现有的基于提示的方法将测试平台生成视为无约束代码合成，导致产生随机输出，具有高令牌成本、低可复现性及覆盖率不足的问题。为了解决这一差距，我们提出了 STG（Structured Testbench Generation，结构化测试平台生成）框架，该框架利用硬件设计的内在结构来生成确定性测试平台。作为直接验证工具，STG 的运行速度比基于 LLM 的迭代测试平台生成流程快 720 倍，成功编译率更高，覆盖率更高，并减少了在错误待测设备（DUT）上的错误通过判定。STG 还通过揭示有缺陷的基准测试平台，帮助识别 RTL 生成基准中的错误。作为数据策展引擎，它在单个 CPU 核心上的运行速度比基于 LLM 的过滤方法快 11 倍，能耗降低 127 倍，且由此产生的蒸馏模型在多基准评估中提供了最先进的性能。作为测试时缩放预言机，它将节点数减少了 14% 至 47%。我们的模型可在 https://huggingface.co/collections/AS-SiliconMind/siliconmind-v12 获取。

Abstract

Automated testbench generation has become a critical bottleneck in large language model (LLM)-driven Register Transfer Level (RTL) workflows, where large numbers of candidate designs must be verified rapidly and reliably. Existing prompt-based approaches treat testbench generation as unconstrained code synthesis, yielding stochastic outputs with high token cost, low reproducibility, and insufficient coverage. To address this gap, we present STG, a Structured Testbench Generation framework that exploits the inherent structure of hardware designs to generate deterministic testbenches. As a direct verification tool, STG runs 720x faster than an iterative LLM-based testbench generation flow and higher rate of successful compilation, achieves higher coverage, and reduces false-pass verdicts on incorrect DUTs. STG also helps identify errors in RTL generation benchmarks by exposing faulty benchmark testbenches. As a data curation engine, it is 11x faster than LLM-based filtering on a single CPU core with 127x less energy, and the resulting distilled models provide state-of-the-art performance in our multi-benchmark evaluation. As a test-time scaling oracle, it reduces node count by 14-47\%. Our models are available at https://huggingface.co/collections/AS-SiliconMind/siliconmind-v12.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	1.0/10	1.5

评分理由: 论文专注于 EDA 和 LLM 驱动的 HDL 验证，而关键词针对多模态、世界模型和强化学习领域。只有 Tokenizer 和 MLLM 因 LLM 使用具有边际相关性；其他关键词与硬件验证上下文无关。作者列表中未包含指定的专家（Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang），故无额外加分。

关键词

Structured Testbench Generation, LLM-Driven HDL Design, Verification-Oriented Data Curation, Deterministic Testbenches, RTL Workflows, Model Distillation, Hardware Verification

136. The Stable Recovery Manifold: Geometric Principles Governing Recoverability in Continual LearningFAIL

Score: 9.0 / 35.2

Authors: Ayushman Trivedi, Bhavika Melwani

Published: 2026-06-11

TL;DR: 该论文提出稳定恢复流形假设，通过几何分析表明连续学习中的灾难性遗忘本质上是可访问性问题而非信息丢失，遗忘知识仍可通过潜空间恢复。

摘要翻译

灾难性遗忘通常被视为在顺序学习过程中先前习得知识的破坏。基于可访问性坍缩（Accessibility Collapse）框架，我们探究持续学习中可恢复性的几何结构。利用 Split CIFAR-100 数据集和顺序训练的 ResNet-18 模型，我们分析了十个任务上的可恢复性、表征漂移及恢复复杂度。我们引入恢复子空间维度（Recovery Subspace Dimensionality, k_t），该指标衡量了保留 90% 全探针性能所需的最小奇异方向数量。与我们的可恢复性扩散（Recoverability Diffusion）假设相反，尽管存在显著的表征漂移，恢复维度在整个训练过程中仍保持稳定（平均 k_t = 8.0）。主角度漂移（Principal-angle drift）强烈预测可恢复性（r = -0.862），一个简单的几何模型解释了 82.2% 的可恢复性方差。这些发现支持稳定恢复流形（Stable Recovery Manifold）假设，表明尽管发生了表征重组，遗忘的知识仍保持紧凑可解码。结果表明，灾难性遗忘主要是一个可访问性和流形对齐问题，而非信息的破坏。

Abstract

Catastrophic forgetting is often viewed as the destruction of previously learned knowledge during sequential learning. Building on the Accessibility Collapse framework, we investigate the geometric structure of recoverability in continual learning. Using Split CIFAR-100 and a sequentially trained ResNet-18, we analyze recoverability, representational drift, and recovery complexity across ten tasks. We introduce Recovery Subspace Dimensionality (k_t), a measure of the minimum number of singular directions required to preserve 90 percent of full probe performance. Contrary to our Recoverability Diffusion hypothesis, recovery dimensionality remains stable throughout training (mean k_t = 8.0) despite substantial representational drift. Principal-angle drift strongly predicts recoverability (r = -0.862), and a simple geometric model explains 82.2 percent of recoverability variance. These findings support the Stable Recovery Manifold hypothesis, suggesting that forgotten knowledge remains compactly decodable despite representational reorganization. The results indicate that catastrophic forgetting is primarily an accessibility and manifold-alignment problem rather than information destruction.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	2.0/10	3.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 论文聚焦连续学习中的灾难性遗忘及其几何恢复特性，与提供的多模态大模型、世界模型、强化学习等关键词领域高度不匹配。仅因实验使用 ResNet-18 作为视觉骨干网络，'Visual Encoder' 有轻微关联；因涉及潜空间维度分析，'Latent Reasoning' 和 'Unify Models' 有微弱关联。其余关键词如 Tokenizer、MLLM、MultiModal、RL 等均无直接关联。作者列表中不包含指定的专家。加权总分为 9.0，低于动态及格分 35.2。

关键词

Continual Learning, Catastrophic Forgetting, Geometric Principles, Recoverability, Stable Recovery Manifold, Representational Drift, Accessibility Collapse, ResNet-18

137. GF-DiT: Scheduling Parallelism for Diffusion Transformer ServingFAIL

Score: 9.0 / 35.2

Authors: Xinwei Qiang, Yifan Hu, Shixuan Sun, Jing Yang, Han Zhao, Chen Chen, Yu Feng, Jingwen Leng, Minyi Guo

Published: 2026-06-11

TL;DR: GF-DiT enhances Diffusion Transformer serving performance by dynamically scheduling GPU parallelism, achieving up to 6x throughput improvement and 95% latency reduction.

摘要翻译

扩散变换器（DiTs）已成为图像和视频生成的主导架构，对高效 DiT 服务的需求日益增长。现有系统在整个请求生命周期内为每个请求分配固定的并行配置。然而，DiT 工作负载在请求、执行阶段和系统条件方面表现出显著的异构性，这使得静态并行效率低下，通常导致 GPU 利用率低下和服务质量下降。本文认为，DiT 服务应将 GPU 并行性视为第一类可调度资源。我们提出了 GF-DiT，这是一种用于弹性 DiT 服务的策略可编程运行时，可根据工作负载需求和服务目标动态调整正在运行的请求的并行性。GF-DiT 引入了异步执行抽象，将请求分解为可独立调度的轨迹任务，并支持在线 GPU 重新分配。为了使弹性并行性实用化，GF-DiT 进一步提出了无组集体操作（group-free collectives），这是一种轻量级通信抽象，支持任意执行组的低开销在线形成和重新配置。我们在 vLLM-Omni 中实现了 GF-DiT，并在代表性的图像和视频扩散工作负载上对其进行了评估。与具有静态并行性的固定流水线执行相比，GF-DiT 将吞吐量提高了高达 6.01 倍，将平均延迟降低了高达 95%，将 SLO（服务等级目标）违规率降低了高达 90%，并将通信组设置开销从 778 毫秒降低到约 60 微秒。

Abstract

Diffusion Transformers (DiTs) have become the dominant architecture for image and video generation, creating growing demand for efficient DiT serving. Existing systems assign each request a fixed parallel configuration throughout its lifetime. However, DiT workloads exhibit substantial heterogeneity across requests, execution stages, and system conditions, making static parallelism inefficient and often leading to poor GPU utilization and degraded service quality. This paper argues that DiT serving should treat GPU parallelism as a first-class schedulable resource. We present GF-DiT, a policy-programmable runtime for elastic DiT serving that dynamically adapts the parallelism of running requests according to workload demands and service objectives. GF-DiT introduces an asynchronous execution abstraction that decomposes requests into independently schedulable trajectory tasks and enables online GPU reallocation. To make elastic parallelism practical, GF-DiT further proposes group-free collectives, a lightweight communication abstraction that supports low-overhead online formation and reconfiguration of arbitrary execution groups. We implement GF-DiT in vLLM-Omni and evaluate it on representative image and video diffusion workloads. Compared with fixed-pipeline execution with static parallelism, GF-DiT improves throughput by up to 6.01$\times$, reduces mean latency by up to 95%, lowers SLO violation rates by up to 90%, and reduces communication-group setup overhead from 778 ms to approximately 60 $μ$s.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: The paper focuses on system-level optimization for Diffusion Transformer serving (GF-DiT), specifically dynamic GPU parallelism scheduling. It does not address model unification, tokenization design, visual encoder architecture, world modeling, reinforcement learning, or reasoning mechanisms. While DiT is used in MLLM and Multimodal contexts (slight relevance score of 2), the core contribution is infrastructure efficiency rather than model learning or reasoning capabilities. Keywords related to RL and Reasoning are completely unrelated.

关键词

Diffusion Transformers, Serving Efficiency, Dynamic Parallelism, GPU Scheduling, Trajectory Tasks, Group-free Collectives, vLLM-Omni, Image and Video Generation

138. From Uncertain Judgments to Calibrated Rankings: Conformal Elo Estimation for LLM EvaluationFAIL

Score: 9.0 / 35.2

Authors: Bora Kargi, David Salinas

Published: 2026-06-11

TL;DR: This paper proposes a Conformal Elo Estimation method to calibrate LLM-as-a-judge rankings and provide uncertainty bounds without requiring large-scale human annotations.

摘要翻译

评估新的大语言模型（LLM）通常需要大规模且昂贵的人工标注活动。以大语言模型作为评判者（LLM-as-a-judge）提供了一种更经济的替代方案，但评判者分数存在系统性误差——如位置偏差、自我偏好或非传递性——这可能导致最终排名的严重校准不当。我们在两个互补的层面上量化了由此产生的评判者与人类之间的分歧。在局部层面，我们通过将校准后的获胜概率而非硬标签传播至 Bradley-Terry 模型，依据评判者自身的分数差异来估计每场对决的不确定性。仅此一项便大幅提升了 Elo 估计精度，使得在 LMArena 上对 55 个保留模型取平均时，基于 LLM 的评分与基于人类的评分之间的 Elo 平均绝对误差（MAE）控制在 17.9 以内。在全局层面，我们将分裂 conformal 预测（split conformal prediction）应用于跨保留模型的基于 LLM 与基于人类的 Elo 评分之间的残差差距，生成具有无分布边际覆盖保证的预测区间，该区间考虑了不可约的 LLM-人类分歧。结合这两个层面，我们构建了一个低成本评估工具，为开发者提供校准后的 Elo 估计和诚实的不确定性边界，而无需访问大规模人工标注。为了促进可复现性，我们在 https://github.com/kargibora/SoftElo 上发布了我们的代码。

Abstract

Evaluating new large language models typically requires costly human annotation campaigns at scale. LLM-as-a-judge offers a cheaper alternative, but judge scores carry systematic errors - such as position bias, self-preference, or intransitivity - that can strongly miscalibrate the resulting rankings. We quantify the resulting judge-human disagreement at two complementary levels. At the local level, we estimate per-battle uncertainty from the judge's own score differences by propagating calibrated win probabilities rather than hard labels into the Bradley-Terry procedure. This alone provides a drastic improvement to Elo estimation accuracy, bringing LLM-derived ratings within 17.9 Elo MAE of human-derived ones when averaged over 55 held-out models on LMArena. At the global level, we apply split conformal prediction to the residual gap between LLM-derived and human-derived Elo ratings across held-out models, producing prediction intervals with distribution-free marginal coverage guarantees that account for irreducible LLM-human disagreement. Together, these two layers yield a low-cost evaluation tool that provides developers with calibrated Elo estimates and honest uncertainty bounds, without access to large-scale human annotations.To facilitate reproducibility, we release our code at https://github.com/kargibora/SoftElo .

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: The paper focuses on statistical calibration of LLM evaluations using Conformal Prediction and Elo ratings, which does not align with multimodal architecture (Tokenizer, Visual Encoder), generative world models, or reinforcement learning (model-based RL, Agentic Reasoning). Relevance is limited to general LLM context (MLLM, MultiModal) and methodological unification of uncertainty layers (Unify Models).

关键词

LLM Evaluation, Conformal Prediction, Elo Estimation, Uncertainty Calibration, LLM-as-a-Judge, Prediction Intervals

139. SPARC: Reliable Spatial Annotations from Robot Demonstrations at ScaleFAIL

Score: 9.0 / 35.2

Authors: Nils Blank, Paul Mattes, Maximilian Xiling Li, Jakub Suliga, Thomas Roth, Moritz Reuss, Pankhuri Vanjani, Rudolf Lioutikov

Published: 2026-06-11

TL;DR: SPARC 提出了一种从机器人演示中自动生成带有可靠性评分的空间标注的框架，显著提高了定位精度并保留了更多有用样本，优于仅使用检测器的基线方法。

摘要翻译

本文介绍了基于机器人演示的空间标注与可靠性校准（SPARC），这是一种风险感知框架，能够自动为机器人演示生成结构化空间标注，并为每个标注分配一个可靠性分数。结构化空间标注（例如边界框、物体轨迹和操作阶段标签）有利于广泛的机器人应用，涵盖从训练具身机器人策略和具身基础模型，到运动规划和层级任务组合等多个方面。现有的自动化流水线虽能大规模生成此类标注，却缺乏可靠的质量信号：检测器置信度对于标注正确性的校准不佳，迫使人们在接受噪声标签或丢弃有用样本之间做出选择。与现有自动化流水线不同，SPARC 利用机器人任务固有的时空结构生成可靠性信号，从而减少噪声标签并保留更多有用样本。此外，我们还引入了交互感知基准（IA-Bench），该基准用于测量模型在机器人演示中定位交互物体位置的准确性。在涵盖不同构型与场景的 1700 个人工标注演示上，SPARC 在定位精度上显著优于仅检测基线，同时在高精确度操作点保留了三倍多的样本。实验表明，基于我们的标注进行微调的模型在物体定位与指向基准上达到了同等规模模型的最先进水平，同时在更广泛的空间推理套件中仍具竞争力，且无需人工验证或标注的训练数据。此外，基于 SPARC 生成标注训练的策略在杂乱且视觉模糊的真实世界场景中优于基线。代码、数据和模型可在 intuitive-robots.github.io/sparc-labeling 获取。

Abstract

This work introduces Spatial Annotations from Robot Demonstrations with Reliability Calibration (SPARC), a risk-aware framework that automatically labels robot demonstrations with structured spatial annotations and assigns each annotation a reliability score. Structured spatial annotations, such as bounding boxes, object trajectories, and manipulation phase labels, benefit a broad range of robotics applications from training grounded robot policies and embodied foundation models to motion planning and hierarchical task composition. Existing automated pipelines generate such annotations at scale but provide no reliable quality signal: detector confidence is poorly calibrated for annotation correctness, forcing a choice between accepting noisy labels or discarding useful samples. In contrast to existing automated pipelines, SPARC leverages the spatio-temporal structure inherent to robot tasks to generate a reliability signal, reducing noisy labels and retaining more useful samples. We further introduce Interaction-Aware Bench (IA-Bench), a benchmark that measures model accuracy in grounding the locations of interacted objects in robot demonstrations. On 1.7k human-annotated demonstrations spanning diverse embodiments and scenarios, SPARC significantly outperforms detection-only baselines in localization accuracy while retaining three times more samples at high-precision operating points. Our experiments demonstrate that models finetuned on our annotations achieve state-of-the-art results on object-grounding and pointing benchmarks among similarly sized models, while remaining competitive on broader spatial-reasoning suites without manually verified or annotated training data. Furthermore, policies trained on SPARC-generated annotations outperform baselines in cluttered, visually ambiguous real-world scenes. Code, data, and models are available at intuitive-robots.github.io/sparc-labeling.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	1.0/10	1.5
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 该论文主要关注机器人演示数据的空间标注框架（SPARC）及其可靠性校准，属于机器人系统与数据标注领域。提供的关键词集主要聚焦于大模型架构（如 Tokenizer, Unify Models, World Models, Latent Reasoning, Agentic Reasoning）及 MLLM 核心组件。论文仅涉及视觉基础（Visual Encoder, MultiModal）和机器人策略训练（model-based RL, MLLM 提及 foundation models），但未深入探讨模型统一、分词器、世界模型或隐式推理等核心机制，因此大部分关键词相关度为 0 或极低。加权总分约为 9.0，远低于动态及格分 35.2，表明论文内容与给定关键词集匹配度较低。作者列表中不包含指定的专家，无额外加分。

关键词

Spatial Annotations, Robot Demonstrations, Reliability Calibration, Grounding, Embodied Foundation Models, Localization Accuracy, Policy Training, Interaction-Aware Bench

140. SAM-Deep-EIoU: Selective Mask Propagation for Multi-Object TrackingFAIL

Score: 9.0 / 35.2

Authors: Alexander Holmberg

Published: 2026-06-11

TL;DR: 本文提出了一种选择性掩码传播的多目标跟踪方法，通过仅在不确定性高时调用视频对象分割模型来解决身份丢失问题，在 DanceTrack 和 SportsMOT 基准上实现了最先进的性能。

摘要翻译

多目标跟踪具有重尾难度分布：大多数帧对轻量级基础跟踪器而言较为容易，而一小部分本质上困难。视频目标分割（VOS）模型通常能在基础跟踪器失败的困难帧中保持身份，但其计算和内存开销大得多。我们提出选择性掩码传播（Selective Mask Propagation），这是一种跟踪算法，仅在分配不确定性信号触发的时间窗口内，从基础跟踪器切换至 VOS 模型。仅当 VOS 模型做出与基础跟踪器身份分配相冲突的置信预测时，才修改基础跟踪器的输出；弱或不确定的预测则保留基础输出。该方法无需训练，将基础跟踪器和 VOS 模型均视为黑盒，且可通过用性能更强的模型替换 VOS 组件而受益。在 DanceTrack 上，选择性掩码传播提升了三种不同的基础跟踪器。在 SportsMOT 数据集上，身份保持是体育分析的核心，SAM3-Deep-EIoU 结合全局轨迹关联在该基准上取得了 86.8 HOTA 的最先进性能。

Abstract

Multi-object tracking has a heavy-tailed difficulty distribution: most frames are easy for a lightweight base tracker, while a small fraction are intrinsically hard. Video object segmentation (VOS) models can often preserve identity through the hard frames where the base tracker fails, but they are much more expensive in compute and memory. We propose selective mask propagation, a tracking algorithm that dispatches from a base tracker to a VOS model only on windows where an assignment-uncertainty signal fires. The base tracker's output is modified only when the VOS model makes a confident prediction that contradicts the base tracker's identity assignment; weak or inconclusive predictions preserve the base output. The method is training-free, treats both the base tracker and the VOS model as black boxes, and can benefit from replacing the VOS component with a more capable model. On DanceTrack, selective mask propagation improves three different base trackers. On SportsMOT, where identity preservation is central to sports analytics, SAM3-Deep-EIoU with global track association achieves state-of-the-art performance on the benchmark with 86.8 HOTA.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 该论文主要研究多目标跟踪（Multi-object Tracking）中的身份保持问题，采用选择性掩码传播策略结合基础跟踪器与视频对象分割模型（如 SAM）。提供的关键词集主要围绕大语言模型（MLLM）、世界模型（World Models）、强化学习（RL）及统一模型架构展开。因此，除'Visual Encoder'（SAM 模型包含视觉编码器）和'Unify Models'（策略上统一了跟踪与分割模型）有微弱关联外，其余如 Tokenizer、World Models、MLLM、model-based RL、Latent Reasoning、Agentic Reasoning 等关键词与论文核心内容完全无关。'MultiModal'关联度较低，因视频处理通常视为单一视觉模态。作者列表中未包含指定的 Yang Shi 等专家，故无额外加分。加权总分显著低于动态及格分 35.2。

关键词

Multi-object tracking, Selective mask propagation, Video object segmentation, Segment Anything Model, Base tracker, Identity preservation, HOTA

141. Automated reproducibility assessments in the social and behavioral sciences using large language modelsFAIL

Score: 7.5 / 35.2

Authors: Tobias Holtdirk, Pietro Marcolongo, Anna Steinberg Schulten, Felix Henninger, Stefan Rose, Sarah Ball, Bolei Ma, Frauke Kreuter, Markus Weinmann, Stefan Feuerriegel

Published: 2026-06-11

TL;DR: This study demonstrates that large language models can effectively automate reproducibility assessments in social sciences, recovering qualitative conclusions in 96% of cases compared to human reanalysis.

摘要翻译

社会科学与行为科学领域的可复现性通常由独立研究者通过重新分析原始数据进行评估，以判断发表的研究发现是否可被复现。然而，此类方法资源密集型且难以规模化。本文表明，大语言模型（LLMs）可实现可复现性评估的自动化。本研究基于 N=76 项来自行为与社会科学领域的已发表研究（含预定义主张），比较了大语言模型生成的分析与原始发现及人类重新分析的结果。在 7 项研究中，大语言模型未能产生有效的效应量估计。对于其余研究，在科恩 d（Cohen's d）容差为 ±0.05 的条件下，我们的大语言模型（LLM）流程在 41% 的研究中复现了原始效应量。此外，我们的大语言模型（LLM）流程在 96% 的案例中与原始研究得出了相同的定性结论（此处结论指重新分析是否支持原始主张）。作为对比，人类重新分析者在 34% 的研究中复现了原始效应量，并在 74% 的案例中得出了相同的定性结论。综上所述，这些结果表明，大语言模型可作为可扩展的工具用于自动化可复现性评估，并为社会科学与行为科学中经验结果的系统性审查奠定基础。

Abstract

Reproducibility in the social and behavioral sciences is typically evaluated by independent researchers who reanalyze the original data to assess whether the published findings can be recovered. However, such approaches are resource-intensive and difficult to scale. Here, we show that large language models (LLMs) can automate reproducibility assessments. Using N=76 published studies with predefined claims from the behavioral and social sciences, we compare LLM-generated analysis with the original findings and human reanalysis. For 7 studies, the LLM could not produce a viable effect size estimate. For the remaining studies, our LLM pipeline recovered the original effect sizes in 41% of studies using a +/-0.05 tolerance in Cohen's d. Further, our LLM pipeline reached the same qualitative conclusion as the original study in 96% of cases, where conclusions indicate whether the reanalysis supports the original claim. For comparison, human reanalysts recovered the original effect sizes in 34% of studies and reached the same qualitative conclusion in 74% of cases. Together, these results show that LLMs can serve as a scalable tool for automated reproducibility assessment and provide a foundation for systematic auditing of empirical results in the social and behavioral sciences.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	1.0/10	1.5
Agentic Reasoning	1.5	1.0/10	1.5

评分理由: The paper focuses on applying Large Language Models (LLMs) to automate reproducibility assessments in social sciences. It does not involve multimodal architectures, visual encoders, world models, or reinforcement learning mechanisms specified in the keywords. Thus, relevance is minimal across all architectural and RL-related terms, with only general LLM usage providing slight relevance to Tokenizer, MLLM, and Latent/Agentic Reasoning.

关键词

Large language models, Reproducibility assessments, Social and behavioral sciences, Automated analysis, Effect size estimation, Qualitative conclusions, Empirical auditing

142. SkMTEB: Slovak Massive Text Embedding Benchmark and Model AdaptationFAIL

Score: 7.5 / 35.2

Authors: Marek Šuppa, Andrej Ridzik, Daniel Hládek, Natália Kňažeková, Viktória Ondrejová

Published: 2026-06-11

TL;DR: 本文引入斯洛伐克语文本嵌入基准 SkMTEB，并通过微调多语言 E5 模型开发高效本地嵌入模型，实现语义搜索和 RAG 任务的竞争力。

摘要翻译

我们介绍了 SkMTEB，这是首个针对斯洛伐克语（一种低资源的西斯拉夫语言）的全面 MTEB 风格文本嵌入基准，包含 7 个任务类型的 31 个数据集——其覆盖深度几乎是现有斯洛伐克语多语言基准的 4 倍。我们对 31 个嵌入模型的评估表明，大型指令微调的多语言模型表现最佳，而现有专为自然语言理解 (NLU) 任务训练的斯洛伐克特定模型在嵌入任务上的迁移效果较差。为了解决对高效、可本地部署的斯洛伐克嵌入的需求，我们通过应用词汇修剪和微调技术，基于多语言 E5 模型开发了 e5-sk-small（4500 万参数）和 e5-sk-large（3.65 亿参数）。尽管模型规模减少了高达 62%，我们的开源模型在与专有 API 性能相当的同时，仍保持本地部署能力，适用于语义搜索和检索增强生成 (RAG)。我们公开发布了基准、模型、数据集和代码，希望我们的方法能为其他低资源语言提供一条可复制的路径。

Abstract

We introduce SkMTEB, the first comprehensive MTEB-style text embedding benchmark for Slovak, a low-resource West Slavic language, comprising 31 datasets across 7 task types -- nearly 4$\times$ the depth of existing multilingual benchmark coverage for Slovak. Our evaluation of 31 embedding models reveals that large instruction-tuned multilingual models achieve the strongest performance, while existing Slovak-specific models trained for NLU tasks transfer poorly to embedding tasks. To address the need for efficient, locally-deployable Slovak embeddings, we develop \texttt{e5-sk-small} (45M parameters) and \texttt{e5-sk-large} (365M) by applying vocabulary trimming and fine-tuning to Multilingual E5 models. Despite size reductions of up to 62\%, our open-source models achieve competitive performance with proprietary APIs while remaining locally deployable for semantic search and retrieval-augmented generation (RAG). We release the benchmark, models, datasets, and code openly, hoping our approach offers a replicable path for other under-resourced languages.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	3.0/10	4.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	1.0/10	1.5
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 论文聚焦斯洛伐克语文本嵌入基准与模型适配，属 NLP 领域。提供的关键词主要涉及多模态、世界模型及强化学习，与本文内容高度不相关。仅 Tokenizer（因涉及词汇修剪）和 Latent Reasoning（因嵌入向量属潜在表示）有微弱关联，加权总分约为 7.5，远低于及格分 35.2。

关键词

Text Embedding, Slovak Language, Benchmark, Model Adaptation, Multilingual E5, Semantic Search, RAG, Vocabulary Trimming

143. Beyond Runtime Enforcement: Shield Synthesis as Defensibility Analysis for Adversarial NetworksFAIL

Score: 7.5 / 35.2

Authors: Achraf Hsain, Sultan Almuhammadi

Published: 2026-06-11

TL;DR: This paper proposes shield synthesis as a design-time defensibility analysis framework for adversarial networks using safety games, providing formal certificates and operational metrics rather than just runtime safety enforcement.

摘要翻译

屏蔽强化学习（Shielded Reinforcement Learning）通常被呈现为一种运行时安全机制，它将时序逻辑规范（temporal-logic specifications）编译为限制智能体动作的自动机（automata）。我们认为这是错误的解读。同样的自动机理论机制——规范编译（specification compilation）、积博弈构造（product game construction）、吸引子计算（attractor computation）和获胜区域提取（winning-region extraction）——更应被视为一种设计时分析工具（design-time analytical instrument），其输出是关于系统的结构洞察（structural insights），而非对已部署智能体的运行时约束（runtime constraints）。我们通过一个用于网络防御的受限双人安全博弈（constrained two-player safety game）来实例化这一观点。两种规范是非对称执行的：防御者规范（defender specification）定义了游戏的不安全区域（unsafe region），而攻击者规范（attacker specification）则在吸引子计算（attractor computation）期间限制对手的合法动作（legal actions）。求解该博弈可得一个可防御性判定（defensibility verdict）——即关于拓扑 - 规范对（topology-specification pair）是否可防御的形式化证书（formal certificate）——以及相关的获胜区域（winning region）和屏蔽器（shield）。除了二元判定外，我们从吸引子结构推导出拓扑级指标（topology-level metrics），并将它们与屏蔽约束的对抗性多智能体强化学习（shield-constrained adversarial multi-agent reinforcement learning）的收敛后行为（post-convergence behavior）相结合。这两者共同构成了一个可防御性指纹（defensibility fingerprint），既捕捉了网络的形式化安全属性（formal safety properties），也捕捉了其在自适应博弈（adaptive play）下的操作行为（operational behavior）。假设分析（what-if analysis）表明，形式可防御性（formal defensibility）和操作有效性（operational effectiveness）捕捉了安全的不同方面：微小的架构变化（architectural changes）可能导致操作结果的巨大变化，同时几乎不改变形式安全余量（formal safety margins）。因此，屏蔽合成（Shield synthesis）作为安全智能体的部署机制（deployment mechanism）价值有限，而更有价值的是作为一种框架，用于回答关于系统能否、在何处以及如何被防御的架构问题。可防御性判定是输出，而非安全策略（safe policy）。

Abstract

Shielded reinforcement learning is typically presented as a runtime safety mechanism that compiles temporal-logic specifications into automata restricting an agent's actions. We argue this is the wrong product. The same automata-theoretic machinery -- specification compilation, product game construction, attractor computation, and winning-region extraction -- is better read as a design-time analytical instrument whose outputs are structural insights about a system rather than runtime constraints on a deployed agent. We instantiate this through a constrained two-player safety game for network defense. The two specifications are enforced asymmetrically: the defender specification defines the unsafe region of the game, whereas the attacker specification restricts the adversary's legal actions during attractor computation. Solving the game yields a defensibility verdict -- a formal certificate that a topology-specification pair is or is not defensible -- with the associated winning region and shield. Beyond the binary verdict, we derive topology-level metrics from the attractor structure and combine them with post-convergence behavior from shield-constrained adversarial multi-agent reinforcement learning. Together these form a defensibility fingerprint capturing both a network's formal safety properties and its operational behavior under adaptive play. A what-if analysis shows that formal defensibility and operational effectiveness capture distinct aspects of security: small architectural changes can produce large shifts in operational outcomes while leaving formal safety margins nearly unchanged. Shield synthesis is thus most valuable not as a deployment mechanism for safe agents, but as a framework for answering architectural questions about whether, where, and how a system can be defended. The defensibility verdict is the output, not the safe policy.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	3.0/10	4.5
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	2.0/10	3.0

评分理由: The paper focuses on formal methods for network defense (shield synthesis, safety games) rather than multimodal learning or large language models. It mentions reinforcement learning in the context of adversarial multi-agent games, giving moderate relevance to 'model-based RL' and 'Agentic Reasoning', but zero relevance to tokenizer, visual encoder, MLLM, multi-modal, unify models, and world models. No target experts were found in the author list, so no bonus points were applied.

关键词

Shield Synthesis, Defensibility Analysis, Adversarial Networks, Safety Games, Formal Verification, Multi-agent Reinforcement Learning, Network Defense, Runtime Enforcement

144. Existence Precedes Value: Joint Modeling of Observational Existence and Evolving States in Time Series ForecastingFAIL

Score: 7.5 / 35.2

Authors: Yifan Hu, Hongzhou Chen, Peiyuan Liu, Yiding Liu, Zewei Dong, Jiang-Ming Yang

Published: 2026-06-11

TL;DR: 本文提出 Timeflies 框架，联合建模未来观测存在性与数值估计以解决时间序列缺失数据预测问题，实验证明其优于现有方法。

摘要翻译

现实世界的时间序列常因传感器休眠、传输延迟及事件驱动采样而高度不完整且不规则，这使得可靠预测在根本上极具挑战性。现有方法已从先插补后预测流程演变为连续时间模型，例如 Neural ODEs（神经微分方程）和连续时间图网络。尽管这些方法改进了对历史不规则性的建模，但它们在推理阶段仍依赖于一个隐式预言机假设：未来有效观测的时间戳被假定提前已知。这一假设限制了其实用性，因为在许多真实系统中，更根本的问题不仅在于未来值是多少，更在于是否会发生有效观测。本文提出 Timeflies，一个统一框架，将预测重新表述为未来可观测性推断与值估计的联合问题。为了显式建模观测动力学与状态演化之间的相互作用，Timeflies 采用了观测流（observation stream）和值流（value stream），并通过三个专用模块进行耦合：可靠性感知嵌入、观测引导的依赖建模以及联合预测。此外，我们构建了 Shadow 基准，该基准结合了公共数据集的自然缺失性与真实世界工业数据，并引入了观测值联合熵（OVJE）指标，以全面评估这种耦合可预测性。大量实验表明，Timeflies 始终优于现有方法，凸显了在含缺失值的时间序列预测中显式建模未来可观测性的重要性。代码及数据集可在 https://github.com/ant-intl/Timeflies 获取。

Abstract

Real-world time series are often highly incomplete and irregular due to sensor dormancy, transmission delays, and event-driven sampling, making reliable forecasting fundamentally challenging. Existing methods have evolved from impute-then-forecast pipelines to continuous-time models such as Neural ODEs and continuous-time graph networks. While these approaches improve the modeling of historical irregularity, they still rely on an implicit oracle assumption at inference time: the timestamps of future valid observations are presumed to be known in advance. This assumption limits practical relevance, since in many real systems the more fundamental question is not only what the future value will be, but also whether a valid observation will occur at all. In this paper, we propose Timeflies, a unified framework that reformulates forecasting as a joint problem of future observability inference and value estimation. To explicitly model the interaction between observation dynamics and state evolution, Timeflies adopts an observation stream and a value stream, coupled through three dedicated modules for reliability-aware embedding, observation-guided dependency modeling, and joint prediction. We further construct Shadow, a benchmark that combines natural missingness from public datasets with real-world industrial data, and introduce the Observation-Value Joint Entropy (OVJE) metric to comprehensively evaluate this coupled predictability. Extensive experiments show that Timeflies consistently outperforms existing methods, highlighting the importance of explicitly modeling future observability in time series forecasting with missing values. Code and dataset are available in https://github.com/ant-intl/Timeflies.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.5/10	5.2
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	1.5/10	2.2
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 论文主题为时间序列预测与缺失数据处理，提出 Timeflies 统一框架联合建模观测存在性与数值。关键词中仅'Unify Models'与论文提出的'unified framework'有语义关联（3.5 分），'Latent Reasoning'与状态演化建模有微弱关联（1.5 分）。其余关键词如 Tokenizer、Visual Encoder、MLLM、MultiModal、model-based RL、Agentic Reasoning 均涉及多模态大模型或强化学习领域，与本文时间序列研究内容完全无关（0 分）。作者列表中未包含指定的专家，故无额外加分。加权总分远低于及格线，表明论文与给定关键词主题匹配度低。

关键词

Time Series Forecasting, Missing Data, Observational Existence, Value Estimation, Unified Framework, Shadow Benchmark, Observation-Value Joint Entropy, Continuous-time Models

145. Contrast-Informed Augmentation and Domain-Adversarial Training for Adult-to-Neonatal MR Reconstruction GeneralizationFAIL

Score: 7.5 / 35.2

Authors: Stephen Moore, Lara Leijser, Richard Frayne, Roberto Souza

Published: 2026-06-11

TL;DR: 该论文研究了通过对比度增强和域对抗训练改善成人到新生儿 MRI 重建的泛化能力，结果表明混合训练结合域对抗目标能显著提升重建质量及潜在表示的重叠度。

摘要翻译

目的：探究对比度引导的数据增强（contrast-informed data augmentation）和域对抗训练（domain-adversarial training）是否能提高 E2E-VarNet 在成人到新生儿之间的泛化能力。方法：研究了三种训练方案：（1）仅使用未增强成人数据的成人训练；（2）混合训练，使用成对的未增强成人数据与基于新生儿信息增强的成人数据；（3）混合训练，采用域对抗目标。模型在回顾性欠采样的多线圈成人 T2 加权脑部磁共振（MR）数据上进行训练，并在加速因子 R=4 和 R=8 下，使用定量指标和定性评估对新生儿和成人测试数据进行评估。特征分析评估了域对抗训练是否改变了未增强成人、增强成人和新生儿测试样本的潜在表示（latent representations）。结果：在新生儿数据上评估时，混合训练（Mixed）和混合域对抗训练（Mixed-DAT）优于未增强成人单独训练（Unaug-Only）。在 R=4 时，Mixed-DAT 表现最佳（SSIM = 0.924 +/- 0.027，PSNR = 33.98 +/- 1.15 dB）。在 R=8 时，使用 SSIM 衡量时 Mixed-DAT 表现最佳（0.848 +/- 0.031 vs. Unaug-Only 的 0.766 +/- 0.037 和 Mixed 的 0.814 +/- 0.035），而使用 PSNR 衡量时 Mixed 表现最佳（29.56 +/- 0.83 dB vs. Unaug-Only 的 26.26 +/- 0.78 dB 和 Mixed-DAT 的 29.43 +/- 0.83 dB）。t-SNE 图的定性评估表明，Mixed-DAT 增加了未增强成人、增强成人和新生儿测试数据潜在表示之间的重叠。结论：对比度引导的增强和域对抗训练提高了基于深度学习的磁共振（MR）重建在成人到新生儿之间的泛化能力。这些发现表明，对比度引导的数据增强结合对抗训练可能提高欠采样新生儿磁共振（MR）重建对域偏移（domain shift）的鲁棒性。

Abstract

Purpose: To investigate whether contrast-informed data augmentation and domain-adversarial training improve the adult-to-neonatal generalization of the E2E-VarNet. Methods: Three training regimes were investigated: (1) adult-only training with unaugmented adult data, (2) mixed training with paired unaugmented and neonatal-informed augmented adult data, and (3) mixed training with a domain-adversarial objective. Models were trained on retrospectively undersampled multi-coil adult T2-weighted brain MR data and evaluated on neonatal and adult test data at acceleration factors $R=4$ and $R=8$ using quantitative metrics and qualitative evaluation. Feature analyses assessed whether domain-adversarial training altered the latent representations of unaugmented adult, augmented adult, and neonatal test samples. Results: Mixed training (Mixed) and mixed domain-adversarial training (Mixed-DAT) outperformed unaugmented adult-only training (Unaug-Only) when evaluated on neonatal data. At R=4, Mixed-DAT achieved the best performance (SSIM = 0.924 +/- 0.027, PSNR = 33.98 +/- 1.15 dB). At R=8, Mixed-DAT performed best when measured using SSIM (0.848 +/- 0.031 vs. 0.766 +/- 0.037 for Unaug-Only and 0.814 +/- 0.035 for Mixed) and Mixed performed best when measured using PSNR (29.56 +/- 0.83 dB vs. 26.26 +/- 0.78 dB for Unaug-Only and 29.43 +/- 0.83 dB for Mixed-DAT). Qualitative assessment of t-SNE plots suggested that Mixed-DAT increased the overlap among the latent representations of the unaugmented adult, augmented adult, and neonatal test data. Conclusion: Contrast-informed augmentation and domain-adversarial training improved adult-to-neonatal generalization of deep learning-based MR reconstruction. These findings suggest that contrast-informed data augmentation combined with adversarial training may improve robustness to domain shift in undersampled neonatal MR reconstruction.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	2.0/10	3.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 论文主题聚焦于医学影像（MRI）重建与域适应，而提供的关键词主要围绕多模态大模型（MLLM）、世界模型及强化学习展开，领域差异巨大。仅'Visual Encoder'和'MultiModal'因涉及图像处理与多线圈数据有微弱关联，'Latent Reasoning'因分析潜在表示有少量关联，其余如 Tokenizer、MLLM、RL 等完全无关。加权总分（7.5）远低于动态及格分（35.2）。作者列表中不包含指定的专家。

关键词

MR Reconstruction, Domain-Adversarial Training, Adult-to-Neonatal Generalization, Contrast-Informed Augmentation, Latent Representations, E2E-VarNet, Deep Learning

146. Is It You or Your Environment? A Bayesian Inference Framework for Genomically-Anchored Personalized Physiological InterpretationFAIL

Score: 7.5 / 35.2

Authors: Aruna Dey, Suraj Biswas

Published: 2026-06-11

TL;DR: This paper proposes a Bayesian inference framework using genomic profiles as fixed priors to distinguish constitutional from environmentally driven physiological deviations, enabling personalized health interpretation without extensive individual behavioral data.

摘要翻译

个性化健康人工智能系统面临一个根本性的冷启动问题：用于生理解释的机器学习模型需要数周的个体行为数据，才能区分固有变异与环境驱动偏差。我们提出了一种基于因果推断和贝叶斯先验设计的解决方案。个体的基因组谱型作为一种外源性遗传锚点——一种基于领域知识的、个性化的先验，在受孕时确定，不受反向因果影响，且在收集任何行为观测之前即可获取。该锚点初始化了一个关于个体生理设定点的贝叶斯信念状态 Ĝ = μ + Σ(βi · gi)，其中 βi 是基于 GWAS（全基因组关联研究）推导的效应量，gi 是风险等位基因计数。每次传入的生理测量值 P 产生一个非固有偏差 δ = P - Ĝ，将归因于环境与状态的信号从固有固定的基线中分离出来。随着行为数据的积累，先验根据 Ĝt = w(t)·Ĝgenomic + [1-w(t)]·P̄t 衰减，从基因组主导过渡到经验基线主导的推断。同样的观测心率变异性（HRV）为 55 ms，对于先验预测值为 80 ms 的人会产生抑制假设，而对于先验预测值为 30 ms 的人会产生增强假设——这种反转在没有个性化锚点的情况下是不可能的。我们在六个生理领域构建了此架构，根据证据强度对基因组先验进行分级，区分稳健复制的锚点（FTO、FADS1/2、FKBP5）与存疑的候选基因（SLC6A4、MAOA、DRD2）。我们探讨了关联、孟德尔随机化与个体实例因果之间的推断边界，并定义了四个部署约束：证据分级先验、动态衰减、祖先匹配效应量以及归因而非确定性输出。

Abstract

Personalized health AI systems face a fundamental cold-start problem: machine learning models for physiological interpretation require weeks of individual behavioral data before they can distinguish constitutional variation from environmentally driven deviation. We propose a solution grounded in causal inference and Bayesian prior design. An individual's genomic profile serves as an exogenous genetic anchor -- a domain-informed, personalized prior that is fixed at conception, immune to reverse causation, and available before a single behavioral observation is collected. The anchor initializes a Bayesian belief state over an individual's physiological set point G-hat = mu + sum(beta_i * g_i), where beta_i are GWAS-derived effect sizes and g_i are risk-allele counts. Each incoming physiological measurement P produces a non-constitutional deviation delta = P - G-hat that separates the signal attributable to environment and state from the constitutionally fixed baseline. As behavioral data accrue, the prior decays according to G-hat_t = w(t)*G-hat_genomic + [1-w(t)]*P-bar_t, transitioning from genome-dominated to empirical-baseline-dominated inference. The same observed HRV of 55 ms generates a suppression hypothesis for a person whose prior predicts 80 ms, and an enhancement hypothesis for a person whose prior predicts 30 ms -- a reversal impossible without a personalized anchor. We develop this architecture across six physiological domains, grading genomic priors by evidence strength, distinguishing robustly replicated anchors (FTO, FADS1/2, FKBP5) from contested candidate genes (SLC6A4, MAOA, DRD2). We address the inference boundary between association, Mendelian randomization, and individual token causation, and define four constraints for deployment: evidence-graded priors, dynamic decay, ancestry-matched effect sizes, and attribution rather than deterministic output.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	2.0/10	3.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 该论文属于生物医学 AI 领域，提出基于基因组先验的贝叶斯推断框架。提供的关键词列表主要聚焦于多模态大模型与强化学习架构。论文内容与这些特定 AI 模型架构几乎没有直接技术重叠，仅在概念层面（如状态建模、潜在变量推理）有微弱关联，因此相关性评分普遍较低。加权总分为 7.5，远低于动态及格分 35.2。

关键词

Bayesian Inference, Genomic Profiles, Personalized Physiological Interpretation, Causal Inference, Genetic Anchor, Physiological Set Point, Dynamic Decay

147. Can I Buy Your KV Cache?FAIL

Score: 7.5 / 35.2

Authors: Luoyuan Zhang

Published: 2026-06-11

TL;DR: This paper proposes a KV cache reuse mechanism to eliminate redundant prefill computation for multiple agents processing the same document, achieving significant inference cost savings without accuracy loss.

摘要翻译

目前，在全球范围内，人工智能代理正在重复着同样的荒谬举动：每读取一份文档，它们都各自从头开始重新计算。每个代理都在相同的文本上重新运行 prefill（预填充），这是大模型最密集的计算步骤，只为重建一个与上一个代理刚刚构建的完全相同的 KV 缓存（Key-Value Cache）。同一个答案，被重复计算了百万次。我们提出一个近乎“冒犯性”简单的方案：只计算一次。让发布者预先计算好文档的 KV 缓存，让其他代理购买加载该缓存的权利从而跳过 prefill 步骤。该方法有效，且具有 token-exact（令牌精确性）：加载预计算的 KV 缓存并继续生成，与从头开始 prefill 的结果完全一致（在 24/24 greedy tokens 及 logit 级别上均匹配），且无精度损失。在 Qwen3-4B 上，复用的计算成本比 prefill 低 9 至 50 倍，且随着长度增加差距进一步扩大（预填充的注意力机制复杂度随 L^2 增长），因此单次复用即可收回成本。随后是关键所在：KV 缓存的托管位置。直接传输（Shipping）不可行，因为 KV 缓存几乎不可压缩，导致每次加载的 egress（出站）成本超过了其所节省的 prefill 成本。在提供商侧托管 KV 缓存（与生产环境中的提示词缓存机制相同），则可完全消除 egress 成本。收益规模取决于我们测量的计算节省量：向 8000 万代理服务一份热门的 3774 token 文档，重新 prefill 成本约为 150 万美元，而复用计算成本仅需约 3 万美元（节省 49.7 倍）。0.1 倍缓存读取费率 API 向用户提供 10 倍折扣，且该费用位于我们测量的成本范围内，因此 10 倍折扣是下限，而测量到的约 50 倍计算节省远超此下限，剩余差距即为提供商利润：每份热门文档可达数百万美元。我们将由此构建的原生代理预填充 CDN（内容分发网络）作为解决方案，并将无损 KV 压缩及跨方支付层留作待解决的开放性问题。

Abstract

Right now, across the world, AI agents are repeating the same absurd act: to read one document, they each recompute it from scratch. Every agent re-runs prefill, the most compute-intensive step a large model takes, over identical text, only to rebuild a key-value (KV) cache identical to the one the agent before it just built. The same answer, computed a million times. We make a proposal that is almost offensively simple: compute it once. Let a publisher precompute a document's KV cache, and let every other agent buy the right to load it and skip prefill. It works, and it is token-exact: loading a precomputed KV and continuing matches prefilling from scratch (24/24 greedy tokens, and at the logits level), with no accuracy cost. On Qwen3-4B, reuse is 9-50x cheaper in compute than prefill, and the gap widens with length (prefill's attention scales with L^2), so a single reuse already pays it back. Then the part that matters: where the KV lives. Shipping it fails, because KV is nearly incompressible, so per-load egress costs more than the prefill it saves. Hosting it provider-side, exactly as production prompt-caching works, removes egress entirely. The size of the prize is set by our measured compute saving: serving one hot 3774-token document to 80M agents costs ~$1.5M to re-prefill but only ~$0.03M of reuse compute (49.7x less). The 0.1x cache-read tariff APIs charge passes a 10x discount to users while sitting inside this measured envelope, so the 10x is a floor that the measured ~50x compute saving clears, and the gap to the physical ~50x is provider margin: millions of dollars per popular document. We frame the resulting agent-native prefill CDN and leave lossless KV compression and a cross-party payment layer as the open problems.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	3.0/10	4.5

评分理由: The paper focuses on LLM inference optimization via KV cache reuse, which has minimal overlap with keywords centered on World Models, Multimodal Learning, and Reinforcement Learning. Only Agentic Reasoning (3) and Tokenizer (2) have slight contextual relevance; others are 0. Total weighted score is 7.5, below the 35.2 threshold. No expert authors from the specified list were found.

关键词

KV Cache, Inference Efficiency, Large Language Model, Agent, Prefill, Compute Saving, Caching Infrastructure, Token-Exact

148. Decoding Insect Song: A Multitask Semisupervised Orthoptera Bioacoustic ClassifierFAIL

Score: 7.5 / 35.2

Authors: Olga Isupova, Danil Kuzin, Ella Browning, Tom Mills, Steven Reece

Published: 2026-06-11

TL;DR: This paper introduces PULSE, a semi-supervised multi-task framework for Orthoptera bioacoustic classification that achieves superior performance over general models through self-supervised learning and knowledge distillation on field audio.

摘要翻译

被动声学监测（Passive acoustic monitoring）在生态推断方面具有巨大潜力，然而现有的自动化工具通常训练范围狭窄且不可迁移。我们提出了 PULSE，这是一个面向直翅目（Orthoptera）生物声学的半监督多任务框架，结合了弱监督物种分类、在未标记野外音频上的自监督学习以及从通用生物声学模型中进行的知识蒸馏。我们的领域适应专用模型在所有指标上均优于最先进的通用模型（宏 F1：0.21 比 0.07；AUC：0.74 比 0.45；AP：0.32 比 0.19），而主动学习进一步将 F1 提升至 0.34，将 AUC 提升至 0.84。除分类任务外，学习到的嵌入（embeddings）编码了具有生态意义的结构，并通过一个交互式可视化工具展现，以支持生态发现。

Abstract

Passive acoustic monitoring holds great promise for ecological inference, yet existing automated tools are typically narrowly trained and non-transferable. We address these limitations with PULSE, a semi-supervised, multi-task framework for Orthoptera bioacoustics, combining weakly-supervised species classification, self-supervised learning on unlabelled field audio, and knowledge distillation from a general-purpose bioacoustic model. Our domain-adapted specialist model outperforms a state-of-the-art general model across all metrics (macro F1: 0.21 vs. 0.07; AUC: 0.74 vs. 0.45; AP: 0.32 vs. 0.19), with active learning further raising F1 to 0.34 and AUC to 0.84. Beyond classification, the learned embeddings encode ecologically meaningful structure, exposed through an interactive visualisation tool for ecological discovery.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	2.0/10	3.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: The paper focuses on bioacoustic classification using semi-supervised and multi-task learning, lacking core components like LLMs, visual encoders, RL, or tokenizers typical of the provided keywords. Only latent embeddings and task unification show minor relevance. No expert authors from the specified list were found.

关键词

Bioacoustic Classifier, Semi-supervised, Multi-task, Orthoptera, Passive acoustic monitoring, Knowledge distillation, Self-supervised learning

149. TWLA: Achieving Ternary Weights and Low-Bit Activations for LLMs via Post-Training QuantizationFAIL

Score: 7.5 / 35.2

Authors: Zhixiong Zhao, Zukang Xu, Zhixuan Chen, Xing Hu, Zhe Jiang, Dawei Yang

Published: 2026-06-11

TL;DR: 本文提出 TWLA 框架，通过后训练量化将大语言模型权重压缩至 1.58 位、激活值压缩至 4 位，在保持高精度的同时显著加速推理。

摘要翻译

大语言模型 (LLMs) 展现出卓越的语言处理能力，但其内存和计算成本阻碍了部署。三值化已成为一种颇具前景的压缩技术，能显著降低模型大小和推理复杂度。然而，现有方法难以应对重尾激活分布，因此不得不保持激活值的高精度，这从根本上限制了端到端的推理加速。为克服这一限制，我们提出 TWLA，一种后训练量化 (PTQ) 框架，该框架在保持高精度的同时，实现了 1.58 位权重压缩和 4 位激活量化。TWLA 包含三个组件：(1) 欧几里得流形非对称三值量化器 (E2M-ATQ)，通过从欧几里得初始化到流形重定位的两阶段优化，在权重三值化过程中最小化层输出误差；(2) 克罗内克正交三模态塑形 (KOTMS)，应用克罗内克结构正交旋转将权重重塑为适合三值化的三模态分布，同时共享旋转在统计上抑制激活异常值；(3) 层间感知激活混合精度 (ILA-AMP)，在比特分配中显式引入相邻层二阶交互代价，并联合优化由共享正交变换引起的激活量化增益的层间差异，从而防止由少数弱层触发的级联效应。大量实验表明，TWLA 在 W1.58A4 配置下保持高精度，同时提供显著的推理加速。代码可在 <https://github.com/Kishon-zzx/TWLA> 获取。

Abstract

Large language models (LLMs) exhibit exceptional general language processing capabilities, but their memory and compute costs hinder deployment. Ternarization has emerged as a promising compression technique, offering significant reductions in model size and inference complexity. However, existing methods struggle with heavy-tailed activation distributions and therefore keep activations in high precision, fundamentally limiting end-to-end inference acceleration. To overcome this limitation, we propose TWLA, a post-training quantization (PTQ) framework that achieves 1.58-bit weight compression and 4-bit activation quantization while maintaining high accuracy. TWLA comprises three components: (1) Euclidean-to-Manifold Asymmetric Ternary Quantizer (E2M-ATQ) minimizes layer-output error under weight ternarization via a two-stage optimization from Euclidean initialization to manifold relocation; (2) Kronecker Orthogonal Tri-Modal Shaping (KOTMS) applies a Kronecker-structured orthogonal rotation to reshape weights into ternary-friendly tri-modal distributions, while the shared rotation statistically suppresses activation outliers; and (3) Inter-Layer Aware Activation Mixed Precision (ILA-AMP) explicitly introduces adjacent-layer second-order interaction costs in bit allocation and jointly optimizes for the layer-wise disparity of activation quantization gains induced by the shared orthogonal transform, preventing cascades triggered by a few weak layers. Extensive experiments demonstrate that TWLA maintains high accuracy under W1.58A4, while delivering significant inference acceleration. The code is available at <https://github.com/Kishon-zzx/TWLA>.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	1.0/10	1.5
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 该论文核心内容为大语言模型（LLM）的后训练量化与压缩技术，旨在降低推理成本。提供的关键词集主要聚焦于多模态、世界模型、强化学习及代理推理等领域，与本文主题高度不匹配。除 MLLM 因涉及 LLM 略有关联外，其余关键词（如视觉编码器、世界模型、强化学习等）在文中均无体现，故相关性评分极低。

关键词

Post-Training Quantization, Ternary Weights, Low-Bit Activations, LLM Compression, Inference Acceleration, Weight Ternarization, Activation Quantization

150. APCyc: Property-Informed Design of Cyclic Peptides via Automated CyclizationFAIL

Score: 7.5 / 35.2

Authors: Yifan Zhao, Lang Qin, Jintai Chen

Published: 2026-06-11

TL;DR: APCyc introduces a target-aware generative framework for cyclic peptide design that optimizes physicochemical properties through expanded vocabulary and Bayesian posterior guidance, overcoming limitations of linear peptide-trained models.

摘要翻译

环肽代表了现代药物发现中一类有前景的治疗化合物，通常能提供增强的稳定性及结合亲和力。然而，环肽的从头设计（de novo design）仍然具有挑战性，因为方法必须识别适配靶标口袋的环化模式和连接位点，同时控制药物相关性质。这一挑战对于主要基于线性肽数据训练的近期生成模型而言尤为突出，因为它们可能无法捕捉环化特异性约束。为了解决这一局限性，我们引入了 APCyc，这是一个靶标感知的从头环肽生成框架，该框架显式建模环化过程，并联合优化多个关键的理化性质。通过使用扩展的残基词汇表并显式编码环化位点和连接类型信息，APCyc 学习环化感知表示，并利用贝叶斯后验引导（Bayesian posterior guidance）将采样引导至满足多个性质目标的环肽。实验结果表明，我们的模型学习了靶标依赖的环化偏好，并实现了环肽设计的有效且可控的多性质优化。本文的源代码可在 https://github.com/HKUSTGZ-ML4Health-Lab/APCyc 获取。

Abstract

Cyclic peptides represent a promising class of therapeutic compounds in modern drug discovery, often offering improved stability and binding affinity. However, the de novo design of cyclic peptides remains challenging because methods must identify pocket-adaptive cyclization patterns and linkage sites while simultaneously controlling drug-relevant properties. This challenge is particularly pronounced for recent generative models trained predominantly on linear peptide data, which may fail to capture cyclization-specific constraints. To address the limitation, we introduce APCyc, a target-aware de novo cyclic peptide generation framework that explicitly models cyclization and jointly optimizes multiple essential physicochemical properties. By using an expanded residue vocabulary and explicitly encoding cyclization-site and linkage-type information, APCyc learns cyclization-aware representations and leverages Bayesian posterior guidance to steer sampling toward cyclic peptides satisfying multiple property objectives. Experimental results demonstrate that our model learns target-dependent cyclization preferences, and enables effective and controllable multi-property optimization for cyclic peptide design. The source code of this paper is available at https://github.com/HKUSTGZ-ML4Health-Lab/APCyc.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	2.0/10	3.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: The paper focuses on cyclic peptide generation using generative models and Bayesian guidance in computational chemistry. It does not involve multimodal learning (MLLM, MultiModal, Visual Encoder), world modeling, reinforcement learning, or agentic systems. Tokenizer and Latent Reasoning have minimal relevance due to vocabulary expansion and latent property optimization, but are not core focuses. Unify Models is not applicable as it is a specific framework rather than a unification effort. No expert authors from the specified list are present. The weighted total score is 7.5, well below the dynamic passing score of 35.2.

关键词

Cyclic Peptides, Property-Informed Design, Automated Cyclization, Generative Models, Bayesian Posterior Guidance, Physicochemical Properties, Target-Aware Framework

151. Learning with Simulators: No Regret in a Computationally Bounded WorldFAIL

Score: 7.5 / 35.2

Authors: Sasha Voitovych, Abhishek Shetty, Noah Golowich, Alexander Rakhlin

Published: 2026-06-11

TL;DR: This paper introduces a framework of simulatable processes to achieve PAC-like learning guarantees in computationally bounded environments using a simulator, extending classical learning theory.

摘要翻译

理解泛化所需的最小假设是学习理论中的基本问题。不幸的是，大多数结果严重依赖于数据生成过程的独立性（或其某种代理形式），而对于强依赖数据的结果则非常有限。为填补这一空白，我们引入了可模拟过程（simulatable processes）的框架，其中学习者可以访问一个模拟器，该模拟器近似生成数据的分布（该分布可能是任意复杂且依赖的过程）。令人惊讶的是，给定访问此类模拟器的权限，我们表明我们可以恢复与独立数据经典设置下相同的学习保证，即依赖于 VC 维（VC dimension）的误差界。此外，我们利用此框架研究条件采样的能力，并在此设置下展示了严格的统计和计算优势。作为我们框架的一个亮点，我们展示了一个单一算法，该算法在所有可在有界多项式时间内采样的过程下同时学习任何给定的 VC 类（VC class），其遗憾由该过程的时间有界 Kolmogorov 复杂度（time-bounded Kolmogorov complexity）控制。这为经典 PAC 模型（PAC model）提供了显著的概念扩展。

Abstract

Understanding the minimal assumptions necessary for generalization is the fundamental question in learning theory. Unfortunately, most results rely heavily on independence (or some proxy thereof) of the data-generating process, while results for strongly dependent data are far more limited. Towards addressing this gap, we introduce the framework of simulatable processes, where the learner has access to a simulator that approximates the distribution generating the data (which may be an arbitrarily complex and dependent process). Surprisingly, given access to such a simulator, we show that we can recover the same learning guarantees as in the classical setting with independent data, namely, error bounds that depend on the VC dimension. Further, we use this framework to study the power of conditional sampling and show strict statistical and computational advantages in this setting. As a highlight of our framework, we exhibit a single algorithm that simultaneously learns any given VC class under all processes samplable in bounded polynomial time, with regret controlled by the time-bounded Kolmogorov complexity of the process. This provides a significant conceptual broadening of the classical PAC model.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	3.0/10	4.5
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: The paper focuses on theoretical machine learning regarding simulatable processes and computational bounds, lacking direct content on multimodal architectures, tokenization, visual encoders, or specific agent-based reasoning systems implied by the keywords. While 'simulator' and 'regret' loosely relate to World Models and model-based RL, the core contribution is learning theory (VC dimension, Kolmogorov complexity) rather than model architecture or agent behavior.

关键词

Simulatable Processes, Learning Theory, Conditional Sampling, VC Dimension, Regret Bounds, Kolmogorov Complexity, Computational Bounds, Generalization

152. NetCause: Counterfactual Learning for Root Cause Analysis in Large-Scale NetworksFAIL

Score: 7.5 / 35.2

Authors: Fabien Chraim, Jian Zhang, Dominik Janzing, Xiang Song, Christos Faloutsos, John Evans

Published: 2026-06-11

TL;DR: NetCause 提出了一种基于自监督反事实学习的框架，用于大规模网络根因分析，通过因果模拟显著提升了根因排名的准确性。

摘要翻译

学习模型能否捕捉故障在大型网络中的传播方式，并利用此知识将客户影响因果归因于其根本原因？现有的根本原因分析技术通常依赖静态规则、相关性启发式方法或拓扑局部推理，这些方法在动态环境中难以泛化，而故障正是通过复杂的物理和逻辑依赖关系传播的。我们提出了 NetCause，这是一个基于自监督学习的框架，它将网络事件建模为图 - 时序过程，并使用反事实模拟来对候选根本原因进行排序。该方法产生可解释的根本原因假设排名，并能自然地与操作员定义的缓解与修复措施集成。我们在一家领先的云提供商的生产网络中收集的六个月内的 1500 多个事件上训练该模型，并在 31 个专家标注的事件上对其进行了评估。NetCause 在与运营决策最相关的场景中始终提高了根本原因排名质量，相比基于规则的启发式基线，准确率提升了 16.1%。尽管训练计算密集，但推理过程轻量，每个事件仅需数秒的 GPU 运行时间（远低于典型的遥测收集延迟）。

Abstract

Can a learned model capture how faults propagate through a large-scale network and use this knowledge to causally attribute customer impact to its underlying root cause? Existing root cause analysis techniques often rely on static rules, correlation heuristics, or topology-local reasoning, which struggle to generalize in dynamic environments where faults propagate across complex physical and logical dependencies. We present NetCause, a self-supervised learning-based framework that models network incidents as graph-temporal processes and uses counterfactual simulation to rank candidate root causes. This approach produces an interpretable ranking of root cause hypotheses and integrates naturally with operator-defined mitigation and remediation actions. We train the model on over 1,500 incidents collected over six months from a leading cloud provider's production network and evaluate it on 31 expert-labeled incidents. NetCause consistently improves root cause ranking quality in the regime most relevant to operational decision-making, achieving a 16.1% accuracy improvement over a rule-based heuristic baseline. While training is computationally intensive, inference is lightweight, requiring only seconds of GPU runtime per incident (well below typical telemetry collection latencies).

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	1.0/10	1.5
Latent Reasoning	1.5	2.0/10	3.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 论文主题为网络根因分析，采用图-temporal 建模与反事实学习。提供的关键词集主要涉及多模态大模型、强化学习及视觉组件，与本文主题几乎无直接关联。除“潜在推理”（涉及自监督表示）和“世界模型”（系统建模）有微弱关联外，其余关键词相关性极低。作者列表中未包含指定的专家。

关键词

Root Cause Analysis, Counterfactual Learning, Large-Scale Networks, Graph-Temporal Modeling, Self-Supervised Learning, Causal Attribution, Network Incidents, Operational Decision Making

153. Enhanced Low-Density Region Exploration in Classifier-Guided Diffusion Models Through Modified Reverse Diffusion SamplingFAIL

Score: 7.5 / 35.2

Authors: Jagriti Singh, Shekhar Verma, Muneendra Ojha

Published: 2026-06-11

TL;DR: 本文提出了一种改进的逆向扩散采样策略，旨在无需额外训练的情况下增强分类器引导扩散模型对低密度区域的探索能力，从而提升样本召回率。

摘要翻译

扩散模型已成为高保真图像合成的最先进的生成模型，特别是在无分类器引导（classifier-free guided）和分类器引导（classifier-guided）的形式中。然而，标准分类器引导倾向于将概率质量集中在高密度类别均值附近，导致对类别条件分布（class-conditional distributions）尾部稀有样本的覆盖不足。近期基于扩散的尾部采样工作通过训练一个额外的低密度寻求分类器（low-density-seeking classifier）并结合合成 - 真实判别器（synthetic-vs-real discriminator）来缓解这一问题，但代价是需要额外的网络和训练过程。与此同时，多种采样器和蒸馏技术旨在加速或优化扩散采样，但并未明确解决长尾覆盖（long-tail coverage）问题。我们提出了一种纯采样时（sampling-time）、密度感知（density-aware）的分类器引导条件扩散模型扩展，旨在针对低密度区域，且无需任何额外训练。与大多数扩散模型在预测噪声上应用引导不同，我们在含噪图像上应用引导。基于 ImageNet 上预训练的条件扩散模型和分类器，我们通过修改后的分类器梯度将轨迹导向低置信度区域，从而修改引导反向动力学（reverse dynamics）；同时，在每个时间步，我们也引导采样过程朝向预测的真实图像。第一种引导有助于探索低概率样本，而第二种引导有助于生成接近真实数据流形（data manifold）的样本。所提出的采样器在 64x64 分辨率下一致提高了 ADM 模型的召回率，同时保持了可比的 FID 分数；对于 256x256 的 ADM 模型，我们通过两种引导的不同组合展示了视觉效果。我们还表明，标准的 ADM 分类器引导与预测真实图像引导相结合，有助于在 ImageNet 上使用 256x256 ADM 模型生成高感知质量的样本。

Abstract

Diffusion models have emerged as state-of-the-art generative models for high-fidelity image synthesis, particularly in their classifier-free guided and classifier-guided forms. However, standard classifier guidance concentrates probability mass around high-density class mean, leading to poor coverage of rare samples in the tails of the class-conditional distributions. Recent work on diffusion-based tail sampling mitigates this by training an additional low-density-seeking classifier with a synthetic-vs-real discriminator, at the cost of additional networks and training. In parallel, a number of samplers and distillation techniques accelerate or refine diffusion sampling, but do not explicitly address long-tail coverage. We propose a purely sampling-time, density-aware extension of classifier-guided conditional diffusion model that targets low-density regions without any additional training. We have applied guidance at noisy images not on predicted noise like most diffusion models. Starting from a pretrained conditional diffusion model and classifier on ImageNet, we modify the guided reverse dynamics by steering trajectories toward low-confidence regions via the modified classifier gradient, and at each time step, we also guide the sampling process toward the predicted real image. 1st guidance helps explore low-probability samples, and 2nd guidance helps to generate samples to be close to the real data manifold. The proposed sampler consistently improves ADM model recall at 64x64 resolution while maintaining a comparable FID, and with a 256x256 ADM model, we showed the results visually with different combinations of both guidance. We also showed that standard ADM classifier guidance, combined with predicted real image guidance, helps generate high perceptual quality samples with a 256x256 ADM model on ImageNet.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	2.0/10	3.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 该论文主要关注扩散模型（Diffusion Models）的分类器引导采样策略，旨在增强低密度区域探索。提供的关键词列表主要围绕多模态大模型（MLLM）、世界模型（World Models）、强化学习（RL）及模型统一（Unify Models）等方向。论文内容与这些关键词关联度较低，仅在视觉特征提取（Visual Encoder）和潜在空间操作（Latent Reasoning）方面存在微弱关联，其余关键词如 Tokenizer、MLLM、MultiModal、RL 等与本文核心内容（图像生成采样）无关。

关键词

Diffusion Models, Classifier Guidance, Low-Density Exploration, Reverse Diffusion Sampling, Image Generation, Sampling Strategy, Tail Sampling

154. MÖVE: A Holistic LLM Benchmark for the German Public SectorFAIL

Score: 7.5 / 35.2

Authors: Camilla Dalerci, Thilo Michael, Robin Schaefer, Daniel Weinland

Published: 2026-06-11

TL;DR: MÖVE 提出一个针对德国公共部门的大型语言模型综合基准，评估了 39 个模型的性能与治理标准，结果显示没有单一模型在所有任务和标准上均占优。

摘要翻译

我们提出 MÖVE（Modelle für die Öffentliche Verwaltung Evaluieren），这是一个用于在德国公共部门背景下评估大型语言模型（LLM）的全面基准。尽管大型语言模型（LLM）在公共行政中的应用日益广泛，但模型选择仍很大程度上是临时性的，且现有基准提供的指导有限：它们主要以英语为中心，内容以美国为中心，且仅专注于任务表现。MÖVE 通过从两个互补维度评估 39 个模型来弥补这些不足。性能准则涵盖文本摘要、问答和主题提取。治理准则评估幻觉倾向、能源消耗、提供商透明度，以及与德国宪法价值观的一致性以及对德国政党立场的认知。总体而言，我们使用了十个德语数据集，其中包括我们构建的金标准和银标准数据集，用以反映公共行政领域。我们采用一种多指标评估策略，结合了经典 NLP 指标、基于嵌入的方法以及以大语言模型为裁判（LLM-as-a-judge）的方法。我们的结果显示，没有任何一个模型在所有准则上占据主导：不同任务下的最佳表现者不同，且仅凭模型大小是预测质量的糟糕指标。我们进一步评估基准本身，分析其统计精度、LLM 裁判可靠性、自建数据集对模型排名的影响、结果对提示词构造的敏感性以及能源消耗估计的有效性。MÖVE 被设计为处于积极开发中的持续更新基准；结果公开发布于 https://moeve.bundesdruckerei.de/。

Abstract

We present MÖVE (Modelle für die Öffentliche Verwaltung Evaluieren), a holistic benchmark for evaluating large language models (LLMs) in the context of the German public sector. While LLMs are increasingly adopted in public administration, model selection remains largely ad hoc, and existing benchmarks offer limited guidance: they are predominantly English-centric, US-centric in content, and focus exclusively on task performance. MÖVE addresses these gaps by evaluating 39 models across two complementary dimensions. Performance criteria cover summarization, question answering, and topic extraction. Governance criteria assess hallucination tendencies, energy consumption, provider transparency, and alignment with German constitutional values and knowledge about positions by German political parties. In total, we utilize ten German-language datasets, including gold- and silverstandard datasets that we constructed to reflect public-administration domains. We employ a multi-metric evaluation strategy combining classical NLP metrics, embedding-based methods, and LLM-as-a-judge approaches. Our results show that no single model dominates across all criteria: top performers differ between tasks, and model size alone is a poor predictor of quality. We further evaluate the benchmark itself, analyzing its statistical precision, LLM judge reliability, the impact of our private datasets on model rankings, the sensitivity of our results to prompt formulation, and the validity of our energy consumption estimates. MÖVE is designed as a living benchmark under active development; results are publicly available at https://moeve.bundesdruckerei.de/.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	1.0/10	1.5
Agentic Reasoning	1.5	1.0/10	1.5

评分理由: 本文专注于德国公共部门文本型语言模型的评估，未涉及多模态架构（Visual Encoder, MultiModal）、世界模型、强化学习或特定的 tokenization/潜在推理机制，因此大部分关键词相关性极低或无关。仅因涉及语言模型评估，部分关键词有微弱关联。

关键词

LLM Benchmark, German Public Sector, Model Evaluation, Text-based, Governance Criteria, Summarization, Question Answering

155. Revisiting Vehicle Color Recognition in Long-Tailed Surveillance ScenariosFAIL

Score: 7.5 / 35.2

Authors: Vinícius Orrú, Bruno H. Foggiatto, Gabriel E. Lima, David Menotti, Rayson Laroca

Published: 2026-06-11

TL;DR: This paper addresses vehicle color recognition in surveillance under severe class imbalance by leveraging synthetic data augmentation from generative models, achieving improved macro accuracy compared to recent literature.

摘要翻译

车辆颜色识别是监控系统中车辆识别的重要线索，尤其是在由于低分辨率、遮挡、运动模糊或光照不足导致车牌难以辨认的情况下。然而，现实世界中的车辆颜色分布高度不平衡，导致整体准确率不足以评估在稀有但具有操作相关性的颜色上的性能。本文利用 UFPR-VeSV（一个具有挑战性的现实世界监控数据集）对严重类别不平衡下的车辆颜色识别进行了全面研究。我们通过两种现成的生成策略探究合成少数类增强：基于 RunDiffusion/JuggernautXL 的文本条件图像生成以及基于 Gemini 2.0 Flash 的图像条件颜色编辑。精心筛选的合成数据与现代视觉表示、损失重加权、学习率调度、颜色安全增强、前景感知预处理及集成融合相结合。表现最佳的方法实现了 94.6% 的微平均准确率（micro accuracy）和 79.7% 的宏平均准确率（macro accuracy），相较于近期文献，宏平均准确率提升了 8.2 个百分点。人工误差分析进一步表明，许多剩余的失败案例即使对于人工标注者来说在视觉上也是模糊的，这凸显了基于颜色的车辆识别在无约束监控图像中的实际局限性。生成的图像和源代码已在 https://github.com/viniciusorru/vcr-synthetic 上公开提供。

Abstract

Vehicle color recognition is an important cue for vehicle identification in surveillance systems, especially when license plates are illegible due to low resolution, occlusion, motion blur, or poor illumination. However, real-world vehicle color distributions are highly imbalanced, making overall accuracy insufficient to assess performance on rare but operationally relevant colors. This paper presents a comprehensive study of vehicle color recognition under severe class imbalance using UFPR-VeSV, a challenging real-world surveillance dataset. We investigate synthetic minority-class augmentation through two off-the-shelf generative strategies: text-conditioned image generation with RunDiffusion/JuggernautXL and image-conditioned color editing with Gemini 2.0 Flash. The curated synthetic data are combined with modern visual representations, loss reweighting, learning-rate scheduling, color-safe augmentation, foreground-aware preprocessing, and ensemble fusion. The bestperforming approach achieves 94.6% micro accuracy and 79.7% macro accuracy, improving macro accuracy by 8.2 percentage points over recent literature. A manual error analysis further shows that many remaining failures are visually ambiguous even for human annotators, highlighting the practical limits of color-based vehicle identification in unconstrained surveillance imagery. The generated images and source code are publicly available at https://github.com/viniciusorru/vcr-synthetic

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 该论文主要关注计算机视觉中的车辆颜色识别问题，特别是针对长尾分布的数据不平衡问题。虽然文中使用了生成式模型（如 RunDiffusion, Gemini）进行数据增强，但这属于工具应用而非核心研究内容。论文未涉及统一模型架构、Tokenizer 设计、世界模型、强化学习或代理推理等主题。因此，与给定的关键词集（主要偏向多模态大模型、世界模型及强化学习方向）相关性极低。仅“视觉编码器”和“多模态”因涉及视觉表示和文本/图像生成工具而获得较低分数。

关键词

Vehicle Color Recognition, Class Imbalance, Synthetic Data Augmentation, Long-Tailed Distribution, Surveillance Scenarios, Text-Conditioned Generation, Image-Conditioned Editing

156. Point-Wise Geometry-Aware Transformer for Partial-to-Full Point Cloud Registration in Computer-Assisted SurgeryFAIL

Score: 7.5 / 35.2

Authors: Siyu Zhou, Zhongliang Jiang

Published: 2026-06-11

TL;DR: This paper proposes GAPR-Net, a geometry-aware transformer framework for partial-to-full point cloud registration in computer-assisted surgery, achieving high registration accuracy and low RMSE.

摘要翻译

由于重叠率变化、点密度波动以及噪声的存在，部分到完整配准（Partial-to-full registration）仍然具有挑战性。尽管 Transformer 在点云处理中展现出强大潜力，但先前方法通常仅将其局限于全局上下文聚合，忽略了对于准确对应至关重要的细粒度局部几何。本文提出 GAPR-Net，一种基于学习的点云配准框架，该框架采用粗到细（coarse-to-fine）架构，结合卷积与 Transformer 模块，利用交叉注意力机制在部分点云与完整点云之间融合局部与全局信息。为此，本文提出了一种变换不变性的点级几何特征表示，能够稳健地捕捉单个点相对于其邻点的相对几何特征。为评估所提方法的有效性，实验在四种几何结构不同的骨骼上进行，包括胫骨、股骨、骨盆和胸软骨。整体配准召回率达到 94.2%，该方法得到的均方根误差（RMSE）低至 1.992 mm，旋转和平移的决定系数（$R^2$）分别为 0.908 和 0.974。结果表明，所提方法有效解决了部分到完整点云配准问题。所提方法利用部分观测实现了高精度 3D 点云配准，为计算机辅助手术中的精确手术导航及机器人干预提供了关键基础。代码将在双盲评审过程后公开。

Abstract

Partial-to-full registration remains challenging due to varying overlap ratios, fluctuating point densities, and the presence of noise. While transformers have shown strong potential for point cloud processing, prior methods typically confine them to global context aggregation, overlooking fine-grained local geometry crucial for accurate correspondence. We propose \emph{GAPR-Net}, a learning-based point cloud registration framework with a coarse-to-fine architecture that combines convolution and transformer modules, in which local and global information is fused between the partial and full point clouds using a cross-attention mechanism. To achieve this, a transformation-invariant point-wise geometric feature representation is proposed, which can robustly capture relative geometric features for individual points with respect to their neighboring points. To evaluate the effectiveness of the proposed approach, experiments are conducted on four geometrically distinct bones, including the tibia, femur, pelvis, and thoracic cartilage. The overall registration recall reaches 94.2\%, the method results in a low RMSE of 1.992 mm and $R^2$ values of 0.908 and 0.974 for rotation and translation, respectively. The results demonstrate that the proposed method effectively addresses the partial-to-full point cloud registration problem. The proposed method enables highly accurate 3D point cloud registration using partial observation, providing a critical foundation for precise surgical navigation and robotic interventions in computer-assisted surgery. The code will be accessed after the double-blind review process.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: The paper focuses on point cloud registration for surgery using a hybrid Conv-Transformer architecture. It has low relevance to the provided keywords as it lacks world models, MLLM, RL, and agentic components. It only marginally relates to visual encoding and architectural unification (Conv+Transformer). No listed expert authors are present.

关键词

Point Cloud Registration, Transformer, Geometry-Aware, Computer-Assisted Surgery, Cross-Attention, Partial-to-Full, Geometric Feature, Surgical Navigation

157. MagPlus: Bridging Micro-to-Regular Facial Expressions through Learnable MagnificationFAIL

Score: 7.5 / 35.2

Authors: Sliman Jammal, Andrei Sharf

Published: 2026-06-11

TL;DR: MagPlus 通过可学习放大机制将微表情转化为常规表情信号，使预训练面部动画模型无需重新训练即可生成逼真的微表情运动。

摘要翻译

面部微表情（Facial micro-expressions）是微妙且短暂的面部运动，能够提供关于真实人类情感的重要线索。然而，由于标注的微表情数据有限且底层面部运动极其微弱，对其进行建模和生成仍然具有挑战性。因此，现有的微表情生成方法往往存在质量有限、鲁棒性较弱以及泛化能力较差的问题。本文提出 MagPlus，一种可迁移的微表情处理流程，该流程将微表情分析与标准面部动画模型（standard facial animation models）连接起来。与从头训练专用生成器不同，MagPlus 学习将细微的面部运动放大至常规面部表情的范围，从而将微表情转换为与现有面部表情处理模型兼容的信号。放大后的序列随后由标准面部表情模型用于迁移（transfer）和合成（synthesis）等任务。随后，一个互补的 DeMagPlus 模块将生成的运动恢复至真实的微表情强度水平，同时保持合成的动力学特征。我们在四个面部动画模型上评估了该框架：FOMM、FSRT、MetaPortrait 和 EmoPortraits。这些模型均未在微表情数据上进行训练。实验表明，MagPlus-DeMagPlus 能够使预训练的宏观表情模型（macro-expression models）生成更真实的微表情运动，而无需重新训练骨干网络（backbones）。

Abstract

Facial micro-expressions are subtle and short-lived facial movements that provide important cues about genuine human emotions. However, modeling and generating them remains difficult because annotated micro-expression data is limited and the underlying facial motions are extremely weak. Existing micro-expression generation methods therefore often suffer from limited quality, weak robustness, and poor generalization. We propose MagPlus, a transferable micro-expression processing pipeline that connects micro-expression analysis with standard facial animation models. Instead of training a dedicated generator from scratch, MagPlus learns to magnify subtle facial motions into the range of regular facial expressions, transforming micro-expressions into signals that are compatible with existing facial expression processing models. The magnified sequence is then used by a standard facial expression model for tasks such as transfer and synthesis. A complementary DeMagPlus module then restores the generated motion back to realistic micro-expression intensity levels while preserving the synthesized dynamics. We evaluate the framework using four facial animation models: FOMM, FSRT, MetaPortrait, and EmoPortraits. None of these models are trained on micro-expression data. Experiments show that MagPlus-DeMagPlus enables pretrained macro-expression models to generate more realistic micro-expression motion without retraining the backbones.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	1.0/10	1.5
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 论文核心在于面部微表情到常规表情的放大处理及与现有动画模型的兼容性，属于计算机视觉与图形学领域。提供的关键词主要涉及大语言模型、强化学习及世界模型，与论文内容关联度极低。仅'Unify Models'在桥接微/常规表达层面有微弱关联，'Visual Encoder'和'Latent Reasoning'涉及底层技术但非核心贡献，其余关键词完全无关。

关键词

Micro-expressions, Facial Animation, Learnable Magnification, Motion Synthesis, Transferable Pipeline, Facial Expressions, DeMagPlus

158. Rarity-Gated Context Conditioning for Offline Imitation Learning-Based Maritime Anomaly DetectionFAIL

Score: 6.0 / 35.2

Authors: Yongmin Kim, ByeongHoon Jeon, Sungil Kim

Published: 2026-06-11

TL;DR: This paper proposes a rarity-aware conditioning module (RGFiLM) to reduce false alarms in maritime trajectory anomaly detection by accounting for imbalanced context distributions.

摘要翻译

上下文异常检测旨在基于上下文变量识别异常行为，但实际部署中常面临高度不平衡的上下文分布，其中稀有状态可能包含关键信息。在此频率偏差下，上下文条件模型在罕见情境中可能产生不稳定的决策和过多的误报。我们提出稀有度门控特征级线性调制（RGFiLM），这是一种感知稀有性的条件模块，它将特征级调制（即基于上下文对隐藏特征进行缩放与移位）与一个由数据驱动的稀有度评分所控制的门控机制相结合。该稀有度评分基于上下文变量的经验分布进行估计，并调节上下文对中间表示的调制强度：在罕见情境下门控机制更为果断，而在常见情境下则保持保守。我们在海上轨迹异常检测任务中评估 RGFiLM，该方法使用 AIS 运动序列并结合 ERA5 环境上下文，应用于环境敏感绕行场景。当部署于序列异常评分流程时，RGFiLM 在对比的上下文无关及上下文条件方法中实现了最佳的平均 F1-误报率（FPR）权衡。这些结果表明，显式考虑上下文稀有性是减少上下文敏感异常检测中误报的有效方法。

Abstract

Contextual anomaly detection aims to identify abnormal behavior conditional on context variables, but practical deployments often face highly imbalanced context distributions where rare regimes can be critical information. Under such frequency bias, context-conditioned models can produce unstable decisions and excessive false alarms in rare contexts. We propose Rarity-Gated Feature-wise Linear Modulation (RGFiLM), a rarity-aware conditioning module that combines feature-wise modulation (i.e., context-conditioned scaling and shifting of hidden features) with a gate controlled by a data-driven rarity score. The rarity score is estimated from the empirical distribution of context variables and regulates how strongly context modulates intermediate representations: the gate becomes more decisive under rare contexts while remaining conservative under frequent contexts. We evaluate RGFiLM on maritime trajectory anomaly detection using AIS motion sequences with ERA5 environmental context in an environment-sensitive detour scenario. When instantiated in a sequential anomaly scoring pipeline, RGFiLM achieves the best mean F1--False Positive Rate (FPR) trade-off among the compared context-agnostic and context-conditioned methods. These results suggest that explicitly accounting for context rarity is an effective approach for reducing false alarms in context-sensitive anomaly detection.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	1.0/10	1.5
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 该论文专注于基于上下文条件化的海洋异常检测（RGFiLM），与提供的关键词（侧重于大模型、世界模型和 MLLM 架构）相关性较低。'MultiModal' 因 AIS + ERA5 数据融合获得较低分（2.0），'model-based RL' 因标题中的 'Offline Imitation Learning' 获得较低分（1.0），而 Tokenizer、Visual Encoder、MLLM 和 World Models 等核心概念缺失（0.0）。加权总分为 6.0，远低于动态及格分 35.2。未找到指定的专家作者。

关键词

Maritime Anomaly Detection, Context Conditioning, Rarity-Gated Feature-wise Linear Modulation, Offline Imitation Learning, AIS Motion Sequences, ERA5 Environmental Context, Imbalanced Context Distributions, False Positive Rate

159. Optical Implementation of Equilibrium Propagation Using Spatial Photonic Ising MachinesFAIL

Score: 6.0 / 35.2

Authors: Dimitri Vanden Abeele, Daniele Veraldi, Davide Pierangeli, Claudio Conti, Serge Massar

Published: 2026-06-11

TL;DR: This paper demonstrates a hybrid optical-digital implementation of Equilibrium Propagation using a Spatial Photonic Ising Machine to train energy-based networks efficiently for classification tasks.

摘要翻译

均衡传播 (Equilibrium Propagation, EP) 为训练基于能量的网络提供了一种比传统机器学习更具吸引力的替代方案。本文展示了利用空间光子伊辛机 (Spatial Photonic Ising Machine, SPIM) 实现的 EP 混合光数字系统。该 SPIM 利用规范变换方法，通过空间光调制器 (Spatial Light Modulator, SLM) 将连续神经元状态和秩 -1 二值可训练模式光学编码为相位调制，并通过有限差分方案实现推理。该系统在 Wine 分类数据集上进行了实验评估。该方法（包括使用连续耦合和结构化耦合矩阵）的潜力在更复杂的 MNIST 数据集上进行了数值评估。我们的工作为均衡传播的能量高效物理实现提供了一条切实可行的途径。

Abstract

Equilibrium Propagation offers a compelling alternative to traditional machine learning for training energy-based networks. Here we demonstrate a hybrid optical-digital implementation of EP using a Spatial Photonic Ising Machine (SPIM). The SPIM exploits the gauge transformation method to optically encode both continuous neuron states and rank-1 binary trainable patterns as phase modulations via a spatial light modulator, with inference realized using a finite difference scheme. The experimental system is evaluated on the Wine classification dataset. The potential of this approach, including the use of continuous couplings and structured coupling matrices, is evaluated numerically on the more complex MNIST dataset. Our work provides a concrete pathway toward energy-efficient physical implementations of Equilibrium Propagation.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	1.0/10	1.5
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: The paper focuses on optical hardware implementation of Equilibrium Propagation for energy-based networks, showing low relevance to keywords concerning Multimodal LLMs, World Models, and Reinforcement Learning architectures. Minor overlaps exist regarding visual data (MNIST) and general learning rules, but no tokenization, agentic reasoning, or world modeling is present. The weighted score (6.0) is significantly below the dynamic passing threshold (35.2), indicating low relevance to the specified research direction. No expert authors from the target list were found.

关键词

Equilibrium Propagation, Spatial Photonic Ising Machine, Optical Implementation, Energy-based Networks, Phase Modulations, Hybrid Optical-Digital, Energy-efficient, Classification Tasks

160. To GAN or Not To GAN: Segmentation Analysis on Mars DEMFAIL

Score: 6.0 / 35.2

Authors: Douglas Dziedzorm Agbeve, Aditya V. Handrale, Salim Fares, Seif E. Idani

Published: 2026-06-11

TL;DR: 该论文通过对比语义分割和生成对抗网络方法自动检测火星数字高程模型上的土丘，结果表明人工生成数据并未提升检测性能。

摘要翻译

为了更好地理解火星表面（以便使火星车能够轻松导航火星），确定土丘的位置是必要的。检测和研究这些形态学特征也有助于我们寻找地外生命的证据，具体而言，即水或生命适宜环境的迹象。此前，土丘的检测是通过将形态学参数手动映射到数字高程模型（DEM）上完成的。本文通过使用基于神经网络（NN）的语义分割方法（Semantic Segmentation），自动检测或预测火星上的土丘来解决这一问题。该方法采用了监督语义分割模型以及生成对抗方法（GAN）。对比实验表明，添加额外的人工生成数据并未改善结果。

Abstract

To better understand Martian Surface, which is needed to enable Rovers navigate Mars with ease, it is necessary to be able to determine the location of mounds. Detecting and studying these morphologies can also help us find evidence of extraterrestrial life, in this case, more specifically, water or signs of life conducive environments. Detection of mounds was done by manually mapping morphological parameters onto Digital Elevation Models. This paper solves the problem by automatically detecting and or predicting mounds on Mars using Neural Network based Semantic Segmentation methodologies. This is done by using supervised semantic segmentation model and generative adversarial approach. A comparison of the approaches shows that adding extra artificially generated data did not improve the result.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 该论文属于传统计算机视觉与行星科学领域，主要研究火星 DEM 数据的语义分割与 GAN 方法对比。提供的关键词集主要聚焦于多模态大模型（MLLM）、世界模型、强化学习及统一架构等前沿 AI 方向。论文未涉及 Tokenizer、世界模型、强化学习、大模型推理等内容，仅在视觉特征提取（类似 Visual Encoder）和模型比较（类似 Unify Models 的对比层面）上有微弱关联，因此相关性评分极低，加权总分远低于动态及格分 35.2。

关键词

Semantic Segmentation, GAN, Mars DEM, Mound Detection, Neural Network, Supervised Learning, Morphological Parameters

161. Layer-Resolved Optimal Transport for Hallucination Detection in NMT and Abstractive SummarizationFAIL

Score: 6.0 / 35.2

Authors: Mariia Onyshchuk, Maksym-Vasyl Tarnavskyi, Marta Sumyk

Published: 2026-06-11

TL;DR: 本文提出基于层分辨率最优传输的方法检测机器翻译和摘要生成的幻觉，发现该方法对源注意力脱离有效但对下游忠实度失败有限。

摘要翻译

最优传输 (OT) 已被证明能够在无需任何监督的情况下，通过测量交叉注意力分布与参考分布之间的几何距离，检测神经机器翻译 (NMT) 中的幻觉。我们将此分析扩展到 Fairseq 德英模型 (N=3,414) 的所有六个解码器层，结果表明，Wass-to-Unif 和 Wass-to-Data 是针对不同幻觉类型互补的检测器；检测主要集中在 L1 至 L4 层，而 L5 层对更细微的幻觉类型具有反预测性；此外，幻觉翻译缺乏从第一个解码步骤开始存在于正确翻译中的探索性注意力阶段。我们进一步评估该几何信号是否适用于抽象式摘要生成的忠实度检测：我们的无监督 OT 检测器在 AggreFact 数据集 (N=1,116) 上，于 CNN/XSum 任务上实现了 57.2%/57.6% 的平衡准确率——虽高于随机水平，但显著低于监督式 MiniCheck-Flan-T5-L (69.9%/74.3%)。这种差距是有其原理的：与 NMT 幻觉不同，不忠实的摘要可以正确关注源令牌，同时误述其内容，这是一种从构造上就对基于浓度的 OT 指标不可见的失败模式。在 T5-base 模型上的结构实验证实了深度方向上解码器组织的一致性，其中第 3 层显示出峰值浓度，而第 12 层对生成质量最为关键。综上所述，研究结果表明，当失败模式为源端脱离时，基于交叉注意力的最优传输 (OT) 是一种可靠的检测器；它是一种无论任务如何都具有原理性的可解释性工具；然而，当忠实度失败发生在注意力机制下游时，其能力受到根本性限制。

Abstract

Optimal transport (OT) has been shown to detect hallucinations in neural machine translation (NMT) by measuring the geometric distance between cross-attention distributions and a reference distribution, without any supervision. We extend this analysis to all six decoder layers of the Fairseq DE-EN model ($N=3{,}414$), showing that Wass-to-Unif and Wass-to-Data are complementary detectors specialised across hallucination types, that detection is concentrated in layers L1--L4 with L5 anti-predictive for subtler types, and that hallucinated translations lack the exploratory attention phase present in correct translations from the first decoding step. We further evaluate whether the geometric signal transfers to abstractive summarization faithfulness detection: our unsupervised OT detector on AggreFact ($N=1{,}116$) achieves $57.2\%$/$57.6\%$ balanced accuracy on CNN/XSum -- above chance but substantially below supervised MiniCheck-Flan-T5-L($69.9\%$/$74.3\%$). This gap is principled: unlike NMT hallucinations, unfaithful summaries can attend correctly to source tokens while misrepresenting their content, a failure mode invisible to concentration-based OT metrics by construction. Structural experiments on T5-base confirm consistent decoder organisation across depth, with Layer~3 showing peak concentration and Layer~12 being most critical for generation quality. Together, the results establish OT on cross-attention as a reliable detector when the failure mode is source disengagement, a principled interpretability tool regardless of task, and fundamentally limited when faithfulness failures occur downstream of attention.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	2.0/10	3.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 论文核心为 NLP 任务（NMT/摘要）的幻觉检测，使用最优传输分析注意力层。提供的关键词集（多模态、世界模型、RL 等）与论文主题严重不符。仅 Tokenizer 和 Latent Reasoning 有微弱关联（分别得 1.0 和 2.0），其余均为 0.0。加权总分 6.0，远低于动态及格分 35.2，表明论文与给定研究背景完全不相关。

关键词

Optimal Transport, Hallucination Detection, Neural Machine Translation, Abstractive Summarization, Cross-Attention, Decoder Layers, Faithfulness Detection

162. $α$-fair heterogeneous agent reinforcement learningFAIL

Score: 6.0 / 35.2

Authors: Yao-hua Franck Xu, Tayeb Lemlouma, Jean-Marie Bonnin, Arnaud Braud

Published: 2026-06-11

TL;DR: This paper proposes a novel alpha-fair framework for heterogeneous-agent reinforcement learning that ensures monotonic improvement and convergence to Nash Equilibria while balancing utilitarian efficiency and equitable reward distribution.

摘要翻译

多智能体系统中的合作通常通过功利主义目标进行优化，这些目标旨在最大化整体效率，却往往忽视奖励分配，从而导致不平等的“领导者 - 跟随者”动态。尽管基于公平性的方法鼓励亲社会行为，使每个智能体都能从合作中受益，但许多现有算法——包括那些利用奖励塑造的——要么破坏了马尔可夫博弈（Markov Games）的平稳性，要么缺乏严格的理论保证。这在公平目标方法与理论上安全的学习框架之间造成了关键的鸿沟。我们提出了一种新颖的框架，将 α-公平性与异质智能体信任区域学习（HATRL）相结合，确保单调改进并收敛至纳什均衡（Nash Equilibria）。我们的方法利用公平优势函数，根据智能体的期望回报动态加权其效用，从而使全局目标能够从纯粹的功利主义效率过渡到基于参数 α 的 α-公平性福利。我们提出了两种实用算法：α-公平 HATRPO 和 α-公平 HAPPO，并通过在 CleanUp 和 CommonHarvest 等序贯社会困境中的实验表明，这两种算法在功利主义视角下优于 HATRL 原有的算法，同时实现了更高的社会整体收益。

Abstract

Cooperation in multi-agent systems is typically optimized through utilitarian objectives that maximize overall efficiency but fail to account for reward distribution, often resulting in inequitable "leader-follower" dynamics. While fairness-based approaches encourage pro-social behaviors where every agent benefits from cooperation, many current algorithms - including those utilizing reward shaping - break the stationarity of Markov Games or lack rigorous theoretical guarantees. This creates a critical gap between fair objective methods and theoretically safe learning frameworks. We propose a novel framework that bridges $α$-fairness with Heterogeneous-Agent Trust Region Learning (HATRL), ensuring monotonic improvement and convergence toward Nash Equilibria. Our approach leverages a fair advantage function that dynamically weights agent utilities based on their expected returns, allowing the global objective to transition from purely utilitarian efficiency to $α$-fairness welfare based on the parameter $α$. We introduce two practical algorithms, $α$-fair HATRPO and $α$-fair HAPPO, and demonstrate through experiments in sequential social dilemmas like CleanUp and CommonHarvest that they perform better than HATRL's algorithms from a utilitarian point of view while achieving socially higher outcomes.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	2.0/10	3.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	2.0/10	3.0

评分理由: The paper focuses on fairness in heterogeneous multi-agent RL, while keywords target multimodal/LLM architectures (e.g., Tokenizer, MLLM, MultiModal). No overlap exists for vision, tokenization, or world models. 'model-based RL' and 'Agentic Reasoning' receive minimal scores due to general RL/Agent domain relevance, but specific methodological mismatches limit relevance. Total Weighted Score: 6.0, below the 35.2 threshold.

关键词

Heterogeneous-Agent Reinforcement Learning, Alpha-fairness, Trust Region Learning, Nash Equilibria, Multi-agent Systems, Fair Advantage Function, Sequential Social Dilemmas

163. Operads for compositional reasoning in LLMsFAIL

Score: 6.0 / 35.2

Authors: Nathaniel Bottman, Kyle Richardson

Published: 2026-06-11

TL;DR: 本文提出基于 Operads 的 LLM 组合推理框架以提高问答一致性，但未涉及多模态、世界模型或强化学习相关内容。

摘要翻译

问题分解（即把复杂问题分解为更简单的子问题，其答案组合后生成最终答案）是一种广泛用于改进大语言模型（LLM）推理的策略，但目前尚缺乏严格的数学基础。在本文中，我们提出操作子（Operads）——一种建模多输入单输出操作及其复合的数学结构——作为描述问题分解的自然框架。我们定义了问题操作子 Q，其中操作对应于问题模板，复合对应于子答案的替换，并展示了如何将问答（QA）模型解释为 Q 上的代数。除了重构现有实践外，这种操作子视角还指向新方法，特别是操作子一致性的概念，它衡量问答（QA）模型的答案在问题分解树的部分坍缩上是否保持一致。操作子一致性的实证评估已在我们的配套论文（Bottman, Liu, and Richardson, 2026）中报告，研究发现该指标在十二个大语言模型和四个多跳问答（QA）数据集上与准确率强相关，且优于标准的基于温度的自一致性基线。我们认为操作子是问题分解的自然数学归宿，且操作子一致性等不变量为分析和改进多步推理的可靠性开辟了新的方向。

Abstract

Question decomposition, i.e. breaking a complex query into simpler sub-queries whose answers are composed to produce a final answer, is a widely used strategy for improving LLM reasoning, yet it currently lacks a rigorous mathematical foundation. In this paper, we propose operads, mathematical structures that model many-in, one-out operations and compositions thereof, as a natural framework for describing question decomposition. We define the questions operad $Q$, in which operations correspond to question templates and composition corresponds to substitution of sub-answers, and show how QA models can be interpreted as algebras over $Q$. Beyond reframing existing practice, this operadic perspective points toward new methods, in particular a notion of operadic consistency, which measures whether a QA model's answers agree across the partial collapses of a question decomposition tree. Empirical evaluation of operadic consistency is reported in our companion paper (Bottman, Liu, and Richardson, 2026), which finds it strongly correlated with accuracy across twelve LLMs and four multi-hop QA datasets and outperforming standard temperature-based self-consistency baselines. We argue that operads are the natural mathematical home for question decomposition, and that invariants such as operadic consistency open new directions for analyzing and improving the reliability of multi-step reasoning.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	2.0/10	3.0
Agentic Reasoning	1.5	2.0/10	3.0

评分理由: 论文核心为代数结构（Operads）在 LLM 推理中的应用，与多模态（MLLM, MultiModal, Visual Encoder）、生成模型（World Models, Tokenizer）及强化学习（model-based RL）领域无直接交集。虽涉及推理（Reasoning），但为符号组合推理，非潜在空间或智能体推理，故后两项关键词仅给予微弱相关分。

关键词

Operads, Compositional Reasoning, Question Decomposition, LLMs, Operadic Consistency, QA Models, Multi-hop QA, Algebraic Structures

164. When Does Mixing Help? Analyzing Query Embedding Interpolation in Multilingual Dense RetrievalFAIL

Score: 6.0 / 35.2

Authors: Tongyao Zhu, Chao-Ming Huang, Min-Yen Kan

Published: 2026-06-11

TL;DR: 本文分析了多语言密集检索中混合语言查询嵌入插值的效果，发现存在最优混合比例且英语主导性导致检索性能存在不对称性。

摘要翻译

尽管在多语言社区中混合语言查询普遍存在，但稠密检索器对此类查询的敏感性仍知之甚少。我们在 mMARCO 上开展了一项比例控制研究，通过嵌入级混合（即将混合查询构建为单语言嵌入的插值）调整查询的平行翻译的混合比例，从而系统性地评估检索性能。基于 BGE-M3 的实验表明，最优混合比例在 105 种情况中的 88 种情况下优于最佳单语言端点。我们发现了一种由英语主导驱动的非对称性：从非英语文档索引中检索时，混合查询普遍有益；而包含英语的索引则最适合使用纯英语查询。此外，英语对于每一种非英语文档语言而言都是最强的混合伙伴。最后，在控制英语主导因素后，混合收益与类型学距离呈负相关。我们得出结论，语言混合敏感性是结构化且可预测的，并且我们在不同模型家族和规模上验证了这些模式的稳健性。

Abstract

While mixed-language querying is ubiquitous in multilingual communities, the sensitivity of dense retrievers to such queries remains poorly understood. We present a ratio-controlled study on mMARCO that systematically evaluates retrieval performance by varying the mixing proportion of parallel query translations via embedding-level mixing -- constructing mixed queries as an interpolation of monolingual embeddings. Experiments with BGE-M3 demonstrate that an optimal mixing ratio outperforms the best monolingual endpoint in 88/105 cases. We uncover a distinct asymmetry driven by English dominance: mixing is uniformly beneficial when retrieving from non-English document indices, whereas indices containing English are best served by pure English queries. Furthermore, English acts as the strongest mixing partner for every non-English document language. Finally, when controlling for English dominance, mixing gains correlate negatively with typological distance. We conclude that language-mix sensitivity is structured and predictable, and we validate the robustness of these patterns across model families and scales.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	1.0/10	1.5
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 该论文主要探讨多语言密集检索中查询嵌入插值的性能分析，属于信息检索领域。给定的关键词集主要涵盖多模态大模型、世界模型及强化学习架构（如视觉编码器、模型基 RL、代理推理等），与本文主题（多语言文本检索）存在显著领域差异。仅'Tokenizer'和'Latent Reasoning'因涉及嵌入空间处理有微弱关联，其余关键词如'Visual Encoder'、'World Models'、'MultiModal'等与本文内容完全无关，导致相关性评分普遍较低。

关键词

Multilingual Dense Retrieval, Query Embedding Interpolation, Mixed-language Querying, English Dominance, Typological Distance, mMARCO, BGE-M3

165. An End-to-End Hybrid Framework for Rumour Detection in Low-Resources Algerian DialectFAIL

Score: 6.0 / 35.2

Authors: Dihia Lanasri, Fatima Benbarek

Published: 2026-06-11

TL;DR: 本文提出了一种结合 Transformer 嵌入与经典分类器的混合框架，通过领域特定预训练和合成数据增强，在低资源阿尔及利亚方言社交媒体谣言检测任务中取得了 0.84 的 F1 分数。

摘要翻译

社交媒体的快速增长加剧了谣言的传播。在阿尔及利亚背景下，这一问题更具挑战性，原因在于方言内容的非正式性和语码切换特性、标注资源的稀缺性，以及标准阿拉伯语 NLP 工具在方言文本上的有效性有限。本文提出了一种面向阿尔及利亚方言社交媒体内容的端到端谣言检测混合框架。我们通过结合真实社交媒体帖子、合成数据和 FASSILA 语料库，构建了一个领域特定的标注数据集，并采用基于相似性的标注过程进行自动标注。此外，还引入了一种转写管道，用于生成阿拉伯字母和阿拉伯语拉丁化（Arabizi）的平行数据集。我们评估了多种方法，包括经典机器学习、深度学习、transformers 和混合模型。实验结果表明，结合 transformer 嵌入与经典分类器的混合方法取得了最佳性能，F1-score 达到 0.84。我们还发现，领域特定的预训练比模型规模更重要，在社交媒体上训练的模型优于在正式阿拉伯语语料库上训练的大型模型。这些结果表明，在低资源阿尔及利亚方言环境下进行谣言检测是可行的。

Abstract

The rapid growth of social media has intensified the spread of rumours. This issue is more challenging in the Algerian context due to the informal and code-switched nature of dialectal content, the scarcity of annotated resources, and the limited effectiveness of standard Arabic NLP tools on dialect text. This paper presents an end-to-end rumour detection hybrid framework for Algerian dialect social media content. We build a domain-specific annotated dataset by combining real social media posts, synthetic data, and the FASSILA corpus, with automatic labeling based on a similarity-based annotation process. A transliteration pipeline is also introduced to generate parallel datasets in Arabic script and Arabizi. We evaluate multiple approaches, including classical machine learning, deep learning, transformers, and hybrid models. Experimental results show that a hybrid approach combining transformer embeddings with a classical classifier achieves the best performance, reaching an F1-score of 0.84. We also find that domain-specific pre-training is more important than model size, with social media-trained models outperforming larger models trained on formal Arabic corpora. These results demonstrate the feasibility of rumour detection in low-resource Algerian dialect settings.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 论文主题属于自然语言处理中的谣言检测，针对低资源阿尔及利亚方言。关键词集主要涵盖多模态大模型、世界模型及强化学习方向，与本文内容领域差异巨大。'Unify Models'得 2 分因论文采用了混合框架（Transformer + 经典分类器），'Tokenizer'得 2 分因涉及文本转换 pipeline；其余关键词如视觉编码器、RL、多模态、世界模型等与纯文本任务完全无关，故得 0 分。专家作者列表中未发现指定专家。

关键词

Rumour Detection, Algerian Dialect, Low-Resources, Hybrid Framework, Transformer Embeddings, Social Media, Domain-specific Pre-training, Transliteration Pipeline

Score: 6.0 / 35.2

Authors: Junhong Liang, Noor Abo Mokh, Bashar Alhafni

Published: 2026-06-11

TL;DR: 本文通过 SemCog Bench 评估 LLM 在阿拉伯语 - 希伯来语同形词上的表现，发现模型过度依赖表面形式相似性，在语义消歧方面存在显著局限。

摘要翻译

阿拉伯语和希伯来语作为关系密切的闪米特语族（Semitic languages）语言，共享大量包含真同源词（true cognates）、误导性假朋友（false friends）以及现代借词（modern loanwords）的词汇。这种词汇重叠给大语言模型（LLMs）的跨语言语义理解带来了挑战。为了评估这一能力，我们引入了 SemCog Bench，这是一个精心构建的基准，包含 1,858 个阿拉伯语 - 希伯来语词对，并附有用于同源词识别和语义消歧的句子级标注。我们在多种输入表示（原始形式、带元音符号形式、罗马化形式和音译形式）下评估了开源和商业大语言模型，并揭示了跨语言推理中存在的关键差距。尽管模型在真同源词上实现了高准确率，但在假朋友和借词上的性能却急剧下降，这反映了模型对表面形式相似性的强烈依赖。此外，句子级上下文仅带来了有限的改进，表明仅凭上下文线索不足以克服误导性的基于形式的信号。这些发现揭示了当前大语言模型在解决跨语言形式 - 意义冲突方面的根本局限性，并确立了 SemCog Bench 作为多语言语义推理的严格基准。我们的代码和数据已公开提供。

Abstract

Arabic and Hebrew, as closely related Semitic languages, share a substantial lexicon of true cognates, misleading false friends, and modern loanwords. This overlap poses a challenge for cross-lingual semantic understanding in large language models (LLMs). To evaluate this capability, we introduce SemCog Bench, a curated benchmark of 1,858 Arabic--Hebrew word pairs with sentence-level annotations for cognate identification and semantic disambiguation. We evaluate open-source and commercial LLMs across multiple input representations (raw, diacritized, Romanized, and phonetic) and reveal a critical gap in cross-lingual reasoning. While models achieve high accuracy on true cognates, performance drops sharply on false friends and loanwords, reflecting a strong reliance on surface-form similarity. Furthermore, sentence-level context yields only modest improvements, suggesting that contextual cues alone are insufficient to overcome misleading form-based signals. These findings reveal a fundamental limitation of current LLMs in resolving cross-lingual form--meaning conflicts and establish SemCog Bench as a rigorous benchmark for multilingual semantic reasoning. Our code and data are publicly available.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	1.0/10	1.5
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 该论文聚焦于阿拉伯语与希伯来语同形词的跨语言语义评估，属于纯文本 NLP 任务。提供的关键词主要涉及多模态世界模型、强化学习及统一架构（如 Visual Encoder, World Models, MLLM, model-based RL 等），与本文内容高度不相关，故大部分关键词得分为 0。仅 Tokenizer 因涉及输入表示预处理（如 diacritized, Romanized）有轻微关联（2 分），Unify Models 和 Latent Reasoning 因涉及模型评估和语义推理有极低关联（1 分）。作者列表中不包含指定的专家。加权总分为 6.0，低于动态及格分 35.2。

关键词

Arabic-Hebrew Cognates, LLM Evaluation, Cross-lingual Understanding, SemCog Bench, Semantic Disambiguation, Surface-form Similarity, Multilingual Benchmark

167. sebis at CRF Filling 2026: A Two-Stage Local LLM Pipeline for Medical CRF FillingFAIL

Score: 6.0 / 35.2

Authors: Katharina Sommer, Tristan Till, Florian Matthes

Published: 2026-06-11

TL;DR: This paper proposes a privacy-preserving, two-stage local LLM pipeline for medical CRF filling that achieves competitive performance without external APIs or fine-tuning.

摘要翻译

从非结构化电子病历（EHR）记录中提取结构化临床信息，一直是医疗信息学领域面临的持续瓶颈。尽管大语言模型（LLMs）表现优异，但其在临床环境中的部署却受到隐私风险、推理成本以及倾向于在文本证据之外产生幻觉的阻碍。针对 CL4Health 2026 病例报告表（CRF）填写任务，我们提出了一种完全本地化、领域适配的流程方案，该方案基于 MedGemma-27B 模型构建。我们的两阶段架构将二元存在性分类与值提取分离开来，强制严格遵守文本证据，并确保对于否定、不确定或未知状态产生确定性输出。通过利用针对特定项目的少样本上下文学习，无需外部 API 调用或微调，我们的方法在官方英语测试赛道上取得了 0.55 的宏 F1 分数。这一成绩在所有本地托管的开源提交方案中位列第二。我们的工作表明，隐私保护、本地部署的 LLM 流程可以实现与专有前沿模型近乎相当的性能，为临床自然语言处理（NLP）提供了一个实用的、数据主权框架。

Abstract

The extraction of structured clinical information from unstructured EHR notes is a persistent bottleneck in healthcare informatics. While large language models (LLMs) offer high performance, their deployment in clinical settings is hindered by privacy risks, inference costs, and the tendency to hallucinate beyond textual evidence. We address these challenges for the CL4Health 2026 Case Report Form (CRF) filling task by proposing a fully local, domain-adapted pipeline using the MedGemma-27B model. Our two-stage architecture, which separates binary presence classification from value extraction, enforces strict adherence to textual evidence and ensures deterministic outputs for negated, uncertain, or unknown states. By leveraging item-specific, few-shot in-context learning without external API calls or fine-tuning, our approach achieves a macro-F1 score of 0.55 on the official English test track. This result secures second place among all locally-hosted, open-source submissions. Our work demonstrates that privacy-preserving, on-premise LLM pipelines can achieve near-competitive performance with proprietary frontier models, providing a practical, data-sovereign framework for clinical NLP.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	1.0/10	1.5
Agentic Reasoning	1.5	1.0/10	1.5

评分理由: The paper focuses on Clinical NLP using local LLMs for structured extraction from EHR notes. It does not involve multimodal components (Visual Encoder, MLLM, MultiModal), reinforcement learning (model-based RL), or world modeling concepts. The two-stage pipeline offers slight relevance to 'Unify Models' regarding task unification, but lacks architectural unification. Tokenizer and reasoning keywords are not core contributions. No listed expert authors are present.

关键词

Clinical NLP, Local LLM, CRF Filling, Two-Stage Pipeline, In-Context Learning, Privacy-Preserving, EHR Extraction, MedGemma

168. Emotional regulation improves deep learning-based image classificationFAIL

Score: 4.5 / 35.2

Authors: Riccardo Emanuele Landi, João M. F. Rodrigues, Marta Chinnici

Published: 2026-06-11

TL;DR: This paper proposes an Emotional Regulation framework that pre-trains deep learning models on affective stimuli to enhance image classification performance and generalization.

摘要翻译

情绪显著影响认知，在某些条件下增强记忆和学习。基于这一原理，情绪增强深度学习（emotion-augmented deep learning）探究情感状态如何改进神经网络架构和学习范式，从而实现比非情绪模型更好的泛化能力。然而，现有方法往往仅依赖客观神经生理因素，忽视了情绪中主观性的作用。为填补这一空白，本研究引入了 Emotional Regulation（情绪调节），这是一种通过人工主观体验在深度学习中建模情绪的新框架。该方法基于情感刺激进行预训练，在下游任务优化中平衡非情绪和受情绪影响的响应。在图像分类任务中进行了广泛的实验，在四个情感数据集上预训练 ResNet 和 ViT 架构，使用 CIFAR-10 和 CIFAR-100 作为目标基准。结果表明相较于上述骨干网络有所改进，提供了证据表明 Emotional Regulation 是一种通过人工主观体验定义情绪增强深度学习的有效方法。此外，该方法超越了基于 CIFAR 的图像分类相关工作，表明 Emotional Regulation 成为大规模视觉数据集上情绪增强深度学习的新最先进技术。本研究还提供了证据，证明情感状态对改善机器学习任务优化的影响，鼓励进一步研究情绪启发式架构。

Abstract

Emotion significantly influences cognition, enhancing memory and learning under certain conditions. Drawing on this principle, emotion-augmented deep learning investigates how affective states can improve neural network architectures and learning paradigms, achieving better generalization than non-emotional models. However, existing methods often rely solely on objective neurophysiological factors, neglecting the role of subjectivity in emotion. To bridge this gap, the present study introduces Emotional Regulation, a novel framework for modeling emotion in deep learning through artificial subjective experience. The method employs pre-training based on affective stimuli, balancing non-emotional and emotionally-influenced responses in downstream task optimization. Extensive experimentation was conducted in image classification, pre-training ResNet and ViT architectures on four emotional datasets, using CIFAR-10 and -100 as target benchmarks. Results reveal improvements over the aforementioned backbones, providing evidence of Emotional Regulation as a promising method for defining emotion-augmented deep learning through artificial subjective experience. Furthermore, the proposed approach overcomes the related work in image classification based on CIFAR, revealing Emotional Regulation as the new state-of-the-art in emotion-augmented deep learning for large-scale vision datasets. The study also enforces evidence of the impact of affective states in improving machine learning tasks' optimization, encouraging further investigation on emotion-inspired architectures.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 该论文主要研究情感调节对深度学习图像分类的提升，使用了 ResNet 和 ViT 架构。关键词涉及的多模态大模型（MLLM）、世界模型、强化学习、Tokenizer 及代理推理等内容与论文主题无直接关联。仅因使用了 ViT 架构（包含视觉编码器）和涉及情感与视觉的交互，对 Visual Encoder 和 MultiModal 给予极低分，其余关键词完全无关。

关键词

Emotional Regulation, Image Classification, Affective Stimuli, Deep Learning, ResNet, ViT, Generalization, Pre-training

169. Augmentation techniques for video surveillance in the visible and thermal spectral rangeFAIL

Score: 4.5 / 35.2

Authors: Vanessa Buhrmester, Ann-Kristin Grosselfinger, David Munch, Michael Arens

Published: 2026-06-11

TL;DR: This paper investigates data augmentation techniques for CNN-based object detection using visible and thermal infrared cameras to improve robustness in video surveillance systems.

摘要翻译

在智能视频监控中，摄像机在白天和夜间记录图像序列。通常，这需要不同的传感器。为了获得更好的性能，将它们结合起来并不罕见。我们关注的情况是：一个长波红外（long-wave infrared）摄像机连续记录，此外，另一个摄像机在白天记录可见光谱范围（visible spectral range）内的图像，且一种智能算法对所采集的影像进行监控。更准确地说，我们的任务是基于多光谱（multispectral）卷积神经网络（CNN）的目标检测。乍一看，源自可见光谱范围的图像与热红外图像有所不同：一方面，前者包含颜色和明显的纹理信息；另一方面，前者不包含物体发出的热辐射信息。尽管颜色可以为分类任务提供有价值的信息，但诸如光照变化和各种传感器的特性等因素仍然构成显著问题。无论如何，获取足够且实用的热红外数据集以训练深度神经网络（deep neural network）仍然是一个挑战。这就是为什么借助可见光谱范围数据进行训练可能具有优势，特别是当待评估的数据同时包含可见光和红外数据时。然而，目前尚不清楚热辐射、形状或颜色信息的变异在多大程度上影响分类精度。为了更深入地了解卷积神经网络（Convolutional Neural Networks）如何做出决策以及它们从不同传感器输入数据中学到什么，我们调查了不同数据增强（augmentation）技术的适用性和鲁棒性...

Abstract

In intelligent video surveillance, cameras record image sequences during day and night. Commonly, this demands different sensors. To achieve a better performance it is not unusual to combine them. We focus on the case that a long-wave infrared camera records continuously and in addition to this, another camera records in the visible spectral range during daytime and an intelligent algorithm supervises the picked up imagery. More accurate, our task is multispectral CNN-based object detection. At first glance, images originating from the visible spectral range differ between thermal infrared ones in the presence of color and distinct texture information on the one hand and in not containing information about thermal radiation that emits from objects on the other hand. Although color can provide valuable information for classification tasks, effects such as varying illumination and specialties of different sensors still represent significant problems. Anyway, obtaining sufficient and practical thermal infrared datasets for training a deep neural network poses still a challenge. That is the reason why training with the help of data from the visible spectral range could be advantageous, particularly if the data, which has to be evaluated contains both visible and infrared data. However, there is no clear evidence of how strongly variations in thermal radiation, shape, or color information influence classification accuracy. To gain deeper insight into how Convolutional Neural Networks make decisions and what they learn from different sensor input data, we investigate the suitability and robustness of different augmentation techniques...

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 该论文属于传统计算机视觉领域，主要研究可见光与红外热成像的多光谱 CNN 目标检测及数据增强技术。提供的关键词集主要面向现代大模型、世界模型及强化学习领域（如 Tokenizer、MLLM、World Models、Agentic Reasoning 等），与本文内容存在显著领域错位。仅'MultiModal'因涉及可见光与热成像的多传感器融合有微弱关联（得分 2），'Visual Encoder'因 CNN 具有编码功能有极弱关联（得分 1），其余关键词完全无关。加权总分远低于动态及格分。作者列表中不包含指定的专家（Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang），无额外加分。

关键词

Visible spectral range, Thermal infrared, CNN-based object detection, Data augmentation, Multispectral fusion, Video surveillance, Deep neural network, Sensor input data

170. The Illusion of Multi-Agent AdvantageFAIL

Score: 4.5 / 35.2

Authors: Prathyusha Jwalapuram, Hehai Lin, Chuyuan Li, Fangkai Jiao, Sudong Wang, Yifei Ming, Zixuan Ke, Chengwei Qin, Giuseppe Carenini, Shafiq Joty

Published: 2026-06-11

TL;DR: This paper critically evaluates Multi-Agent Systems against Single-Agent baselines, demonstrating that automatically generated multi-agent architectures underperform single-chain-of-thought methods despite higher computational costs, revealing significant architectural inefficiencies.

摘要翻译

普遍共识主张多智能体系统 (MAS) 优于单智能体系统 (SAS)，其依据在于上下文保护、并行处理及分布式决策等优势。然而，对该主张的实证支持主要依赖于与 SAS 基线的比较，所使用的基准测试优先于孤立推理任务，无法充分评估上述优势。针对旨在比手动设计版本具有更强泛化性的自动生成的 MAS，本研究将其与 SAS 进行了严格系统的评估，具体对象为带自洽性的思维链 (CoT-SC)。在传统推理数据集及具有交互式多步工作流的任务（例如 BrowseComp-Plus）上，我们发现自动生成的 MAS 的表现始终不及 CoT-SC，尽管其成本高达 10 倍。为了排除任务结构固有局限性对这些失败的影响，我们引入了一种诊断性合成数据集，该数据集专为 MAS 设计，具备显式任务分解、上下文分离及并行化潜力。我们表明，在该数据集上，专家架构的 MAS 在原始性能和成本效益方面始终优于自动生成的架构，这一结果揭示了现有评估框架因未考虑增加计算成本的边际效用，从而掩盖了复杂 MAS 的关键架构差距与低效问题。至关重要的是，对生成的 MAS 架构的系统性解构揭示了当前的自动化设计范式产生了架构臃肿，其优先于表面复杂性而无法转化为功能效用，这暴露了与多智能体原则的根本性错位。

Abstract

Prevailing wisdom posits that Multi-Agent Systems (MAS) are superior to Single-Agent Systems (SAS), citing advantages like context protection, parallel processing and distributed decision-making. However, empirical support for this claim relies primarily on comparisons with SAS baselines using benchmarks that prioritize isolated reasoning tasks, which do not adequately assess these advantages. Focusing on automatically generated MAS that are designed for enhanced generalizability over manually-designed counterparts, we perform a rigorous, systematic evaluation against SAS, specifically Chain-of-Thought with Self-Consistency (CoT-SC). Across traditional reasoning datasets and tasks with interactive multi-step workflows (e.g., BrowseComp-Plus), we demonstrate that automatic MAS consistently underperform CoT-SC despite being up to 10x more expensive. To isolate these failures from limitations inherent to task structure, we introduce a diagnostic synthetic dataset tailored for MAS featuring explicit task decomposition, context separation and parallelization potential. We show that expert-architected MAS consistently outperforms automatically generated architectures in both raw performance and cost-efficiency on this dataset, demonstrating that existing evaluation frameworks mask critical architectural gaps and inefficiencies of complex MAS by failing to account for the marginal utility of increased computational cost. Critically, systematic deconstruction of the generated MAS architectures reveals that current automated design paradigms produce architectural bloat that prioritizes superficial complexity which does not translate into functional utility, exposing a fundamental misalignment with multi-agent principles.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	3.0/10	4.5

评分理由: The paper evaluates Multi-Agent Systems versus Single-Agent Systems in reasoning tasks, highlighting architectural inefficiencies in automatic MAS design. It does not address multimodal architectures, tokenizers, visual encoders, world models, or reinforcement learning. Only 'Agentic Reasoning' has marginal relevance due to the multi-agent context, while all other keywords are completely unrelated to the paper's content.

关键词

Multi-Agent Systems, Single-Agent Systems, Chain-of-Thought, Self-Consistency, Architectural Bloat, Reasoning Evaluation, Automatic vs Expert Architecture

171. Generative Modeling of Bach-Style Symbolic Music: A Comparative Study of Autoregressive, Latent-Variable, and Adversarial ApproachesFAIL

Score: 4.5 / 35.2

Authors: Kyuil Lee, Dezhi Yu, Yongkang Huang

Published: 2026-06-11

TL;DR: This paper compares autoregressive, latent-variable, and adversarial models for generating Bach-style symbolic music, finding that autoregressive LSTMs produce the most musically coherent samples.

摘要翻译

本研究利用共享 MIDI 语料库，采用三类模型家族对巴赫风格的符号化钢琴音乐进行生成建模：带注意力机制的自回归 LSTM、包括循环变分自编码器（VAE）和向量量化变分自编码器（VQ-VAE）在内的潜在变量模型，以及生成对抗网络（GAN）。我们比较了它们在建模复调音符序列、学习有效潜在表示以及生成风格连贯作品方面的能力。实验结果表明，带注意力机制的自回归 LSTM 生成的样本音乐连贯性最高，而向量量化有助于缓解后验坍塌，并产生比常规循环变分自编码器更具结构化的输出。对抗方法能够捕捉局部音高模式，但训练依然困难，且在泛化至巴赫风格时可靠性较低。这些结果突显了自回归、潜在变量及对抗方法在符号化音乐生成中的相对优势和失效模式。

Abstract

We study generative modeling of Bach-style symbolic piano music using a shared MIDI corpus and three model families: autoregressive LSTMs with attention, latent-variable models including recurrent VAEs and vector-quantized VAEs, and generative adversarial networks. We compare their ability to model polyphonic note sequences, learn useful latent representations, and generate stylistically coherent compositions. Our experiments show that the autoregressive LSTM with attention produces the most musically coherent samples, while vector quantization helps mitigate posterior collapse and yields more structured outputs than conventional recurrent VAEs. The adversarial approach captures local pitch patterns but remains difficult to train and generalizes less reliably to Bach's style. These results highlight the relative strengths and failure modes of autoregressive, latent-variable, and adversarial approaches for symbolic music generation.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	2.0/10	3.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: The paper focuses on symbolic music generation using classical generative models (LSTM, VAE, GAN), showing minimal overlap with keywords centered on Multimodal LLMs, World Models, and Reinforcement Learning. 'Latent Reasoning' and 'Tokenizer' have slight relevance due to latent-variable models and symbolic MIDI tokens, while others are completely unrelated. No matching experts were found from the provided list.

关键词

Generative Modeling, Symbolic Music, Autoregressive Models, Latent-Variable Models, Adversarial Approaches, MIDI Corpus, Polyphonic Sequences, Style Coherence

172. Quantizing Time-Series Models As Dynamical Systems: Trajectory-Based Quantization Sensitivity ScoreFAIL

Score: 4.5 / 35.2

Authors: Mariya Pavlova, Harrison Bo Hua Zhu, Elizsveta Semenova, Yingzhen Li

Published: 2026-06-11

TL;DR: This paper proposes a trajectory-based quantization sensitivity score to optimize mixed-precision deployment for time-series models without calibration data, leveraging dynamical systems stability analysis.

摘要翻译

我们引入了基于轨迹的量化敏感度评分（TQS），这是一种通过动力系统稳定性视角重新审视后训练量化（PTQ）的指标。通过将网络的展开过程建模为离散时间动力系统，TQS 刻画了量化引起的误差如何在展开时间范围内传播和放大。与常规 PTQ 方法不同，其中敏感度分析通常与量化过程耦合，TQS 实现了先验敏感度估计，该估计与量化器选择和位宽分配解耦。这种分离使得即使对于具有融合算子的黑盒或编译网络，也能进行量化预算规划。基于此，我们提出了 TQS-PTQ，这是一种灵活的混合精度框架，无需校准数据或昂贵的二阶近似。我们的实验表明，动力系统视角为资源受限环境下的低精度部署提供了一种稳健且高性能的路径。

Abstract

We introduce the Trajectory-based Quantization Sensitivity Score (TQS), a metric that reframes post-training quantization (PTQ) through the lens of dynamical-systems stability. By modeling the network's rollout as a discrete-time dynamical system, TQS characterizes how quantization-induced errors propagate and amplify over the rollout horizon. Unlike conventional PTQ methods, where sensitivity analysis is often coupled to the quantization procedure, TQS enables a priori sensitivity estimation decoupled from quantizer selection and bit-width assignment. This separation allows for quantization budget planning even for black-box or compiled networks with fused operators. Building on this, we present TQS-PTQ, a flexible mixed-precision framework that requires no calibration data or costly second-order approximations. Our experiments show that a dynamical-systems perspective provides a robust, high-performing pathway for low-precision deployment in resource-constrained settings.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	2.0/10	3.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: The paper focuses on quantization techniques for time-series models using dynamical systems theory, which has minimal overlap with keywords targeting multimodal LLMs and RL. Only slight conceptual links exist for 'model-based RL' and 'World Models' due to dynamical system and rollout terminology. No specified expert authors are found.

关键词

Quantization Sensitivity Score, Dynamical Systems, Post-Training Quantization, Time-Series Models, Mixed-Precision Framework, Error Propagation, Low-Precision Deployment

173. ProtoX-AD: Self-Explainable Time Series Anomaly Detection and CharacterizationFAIL

Score: 4.5 / 35.2

Authors: Aitor Sánchez-Ferrera, Elisabeth Wetzer, Kristoffer Wickstrøm, Michael Kampffmeyer, Robert Jenssen

Published: 2026-06-11

TL;DR: ProtoX-AD introduces a prototype-based self-explainable framework for time series anomaly detection that achieves competitive detection performance while providing semantically meaningful explanations through interpretable prototypes.

摘要翻译

近年来，时间序列异常检测（TSAD）领域的进展凸显了基于自监督分类方法的有效性。这些方法通过对正常训练样本施加变换，训练分类器识别变换特定模式，从而通过分类错误率的增加来识别异常。尽管性能优异，但一个显著挑战在于其可解释性不足，因为它们对标记出的异常特征提供的洞察有限。为了解决这一局限性，我们提出 ProtoX-AD，一种用于自监督时间序列异常检测（TSAD）的基于原型的自解释框架。ProtoX-AD 能够学习感知变换的潜在表示以及可解释原型，从而实现准确的异常检测，并通过基于原型的解释识别出不同的异常特征。此外，它还允许系统分析变换设计如何影响检测性能与可解释性。在合成数据集和真实世界数据集上的实验结果表明，ProtoX-AD 达到了与其黑盒对应方法相当的检测性能，同时提供了比现有可解释基线方法更一致且语义上更有意义的解释。我们的代码公开发布于 https://github.com/Aitorzan3/ProtoX-AD。

Abstract

Recent advances in time series anomaly detection (TSAD) have highlighted the effectiveness of self-supervised classification-based approaches. These methods apply transformations to normal training samples, training a classifier to recognize transformation-specific patterns that help identify anomalies through increased classification errors. Despite their strong performance, a significant challenge is their lack of explainability, as they provide limited insight into the characteristics of flagged anomalies. To address this limitation, we propose ProtoX-AD, a prototype-based self-explainable framework for self-supervised TSAD. ProtoX-AD learns transformation-aware latent representations alongside interpretable prototypes, enabling both accurate anomaly detection and the identification of distinct anomalous profiles through prototype-based explanations. Additionally, it allows for systematic analysis of how transformation design impacts detection performance and explainability. Experimental results on synthetic and real-world datasets demonstrate that ProtoX-AD achieves detection performance comparable to its black-box counterparts while offering more consistent and semantically meaningful explanations than existing explainable baselines. Our code is publicly available at https://github.com/Aitorzan3/ProtoX-AD.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	2.0/10	3.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: The paper focuses on time series anomaly detection using prototype-based self-explainable methods, which is unrelated to multimodal foundation models, world models, or reinforcement learning. Only 'Latent Reasoning' has slight relevance due to the use of latent representations, while other keywords are completely irrelevant to the domain.

关键词

Time Series Anomaly Detection, Self-Explainable, Prototype-Based, Self-Supervised, Latent Representations, Transformation-Aware, Interpretability

174. Understanding helpfulness and harmless tension in reward modelsFAIL

Score: 4.5 / 35.2

Authors: Eshaan Tanwar, Pepa Atanasova

Published: 2026-06-11

TL;DR: This paper investigates the internal tension between helpfulness and harmlessness objectives in reward models for RLHF, identifying shared neurons that cause interference and multi-objective alignment remains challenging.

摘要翻译

奖励模型 (Reward Models) 是基于人类反馈的强化学习 (Reinforcement Learning from Human Feedback, RLHF) 的一个关键组成部分，旨在使语言模型 (Language Models) 对齐到有益且无害的行为。然而，这些目标及其冲突背后的内部机制仍知之甚少。我们研究了在仅有益性 (Helpfulness-only)、仅无害性 (Harmlessness-only) 以及混合目标设置 (Mixed-objective settings) 下训练的奖励模型中的对齐张力 (Alignment Tension)。我们发现混合目标模型的性能通常低于单目标模型 (single-objective models)，这表明目标之间存在干扰。基于激活的方法 (Activation-based Methods)，我们识别出与每个目标相关的神经元 (Neurons)，并通过针对性消融实验 (Targeted Ablations) 研究它们的功能角色。我们发现这些神经元因果地支持其对应的目标，同时通常对对立的目标产生负面影响。我们发现相当大比例的神经元在有益性和无害性之间是共享的，且这些共享神经元对模型行为施加不成比例的影响，从而加剧了对齐张力。此外，我们的结果提供了见解和机制性解释，阐明了对齐目标如何在奖励模型中表示，以及为何多目标对齐仍然具有挑战性，这激励了未来关于解耦且可控对齐方法 (Disentangled and Controllable Alignment Methods) 的研究。

Abstract

Reward models are a key component of reinforcement learning from human feedback (RLHF), aligning language models toward both helpful and harmless behaviour. However, the internal mechanisms underlying these objectives and their conflicts remain poorly understood. We study alignment tension in reward models trained under helpfulness-only, harmlessness-only, and mixed-objective settings. We find that mixed-objective models often underperform single-objective models, indicating interference between objectives. Using activation-based methods, we identify neurons associated with each objective and study their functional roles via targeted ablations. We find that these neurons causally support their corresponding objectives while often negatively affecting the opposing one. We find that a substantial proportion of neurons are shared between helpfulness and harmlessness, and that these shared neurons exert a disproportionate influence on model behaviour, contributing to alignment tension. Additionally, our results provide insights and mechanistic interpretation into how alignment objectives are represented in reward models and why multi-objective alignment remains challenging, motivating future work on disentangled and controllable alignment methods.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	1.0/10	1.5
Latent Reasoning	1.5	2.0/10	3.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 该论文主要研究奖励模型（Reward Models）在人类反馈强化学习（RLHF）中‘有用性’与‘无害性’目标之间的内部张力及神经元机制。论文内容聚焦于文本模型的 interpretability，未涉及多模态（MultiModal, MLLM, Visual Encoder）、分词器（Tokenizer）、世界模型（World Models）或代理推理（Agentic Reasoning）。虽然涉及强化学习领域，但属于奖励学习而非模型基强化学习（model-based RL），且仅通过神经元激活分析潜在表示（Latent Reasoning），相关性较低。作者列表中不包含指定的专家。因此，大部分关键词得分为 0，仅 model-based RL 和 Latent Reasoning 因领域接近获得低分，总分远低于动态及格分。

关键词

Reward Models, RLHF, Helpfulness, Harmlessness, Neuron Analysis, Alignment Tension, Multi-objective Learning, Mechanistic Interpretability

175. No Hidden Prompts Needed! You Can Game AI Peer Review with Presentation-Only RevisionsFAIL

Score: 4.5 / 35.2

Authors: Xu Yang, Zhizhou Sha, Junbo Li, Jian Yu, Yifan Sun, Matthew Zhao, Jinrui Fang, Xinyue Guo, Yining Wu, Xu Hu, Yifu Luo, Qiang Liu, Zhangyang Wang

Published: 2026-06-11

TL;DR: This paper investigates the vulnerability of AI peer review systems to presentation-only adversarial attacks, demonstrating that manipulating narrative structure without changing scientific content can significantly increase review scores.

摘要翻译

随着 AI 生成评论从实验工具转变为同行评审基础设施，大多数鲁棒性担忧都集中在显式攻击上，例如隐藏指令和 prompt injection（提示注入）。我们研究了一种更难且更具政策相关性的失效模式：无隐藏文本、无 prompt injection，且不对方法、实验、图表、公式、证明或数值结果进行更改。攻击者仅修改呈现层面的内容，例如摘要、贡献框架、相关工作、讨论和叙事结构。我们引入 adversarial repackaging（对抗性重组）：一种利用 AI-reviewer（AI 评审者）反馈来搜索呈现层面修订、同时保持科学证据不变的 closed-loop attack（闭环攻击）。在三种主流 AI-reviewer 上，adversarial repackaging 实现了 75.1% 的攻击成功率和平均得分增益 +1.21/10。这种效果无法用 ordinary prose polishing（普通润色）来解释。我们还揭示了改变评审者解读论文的策略，例如 related-work repositioning（相关工作重新定位）和 analytical discussion expansion（分析性讨论扩展），显著优于 surface edits（表面编辑），例如 local polishing（局部润色）、table formatting（表格格式）和 algorithm boxes（算法框）。我们的分析揭示了两种更深层的结构失效模式。首先，AI-reviewer 更容易被取悦而非被说服：强调优势可靠地增加感知价值，而试图消除弱点通常会适得其反。其次，AI-reviewer 可能会混淆解决局限性的外观与实际上解决它，允许未更改的证据被重新解释为更强的科学贡献。这些结果表明，部署风险不仅在于恶意隐藏指令，还在于论文呈现本身作为一种 optimization surface（优化表面）的出现。我们发布了一个 contamination-free rolling benchmark（无污染的滚动基准）和 attack framework（攻击框架），用于测试 AI-reviewer 是否在 presentation-only edits（仅呈现编辑）下仍锚定在科学内容上。

Abstract

As AI-generated reviews move from experimental tools into peer-review infrastructure, most robustness concerns have focused on explicit attacks such as hidden instructions and prompt injection. We study a harder and more policy-relevant failure mode: no hidden text, no prompt injection, and no changes to methods, experiments, figures, equations, proofs, or numerical results. The attacker modifies only presentation-level content, such as the abstract, contribution framing, related work, discussion, and narrative structure. We introduce adversarial repackaging: a closed-loop attack that uses AI-reviewer feedback to search for presentation-level revisions while keeping the scientific evidence fixed. Across three mainstream AI reviewers, adversarial repackaging achieves a 75.1% attack success rate and a mean score gain of +1.21/10. The effect is not explained by ordinary prose polishing. We also reveal that strategies that change how the reviewer interprets the paper, such as related-work repositioning and analytical discussion expansion, substantially outperform surface edits such as local polishing, table formatting, and algorithm boxes. Our analysis reveals two deeper structural failure modes. First, AI reviewers are easier to impress than to convince: highlighting strengths reliably increases perceived merit, while attempts to dissolve weaknesses frequently backfire. Second, AI reviewers can confuse the appearance of addressing a limitation with actually resolving it, allowing unchanged evidence to be reinterpreted as stronger scientific contribution. These results show that the deployment risk is not only malicious hidden instructions, but the emergence of paper presentation itself as an optimization surface. We release a contamination-free rolling benchmark and attack framework for testing whether AI reviewers remain anchored to scientific content under presentation-only edits.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	2.0/10	3.0

评分理由: The paper focuses on AI safety and robustness within peer review systems, specifically analyzing 'adversarial repackaging' where attackers modify presentation without altering scientific content. It does not align with the provided keyword set which focuses on model architectures and learning paradigms (Unify Models, Tokenizer, Visual Encoder, World Models, Model-Based RL, Latent Reasoning). While the AI reviewers involved could be classified as MLLMs (score 1.0) and the attack loop involves agent-like optimization (score 2.0), the paper does not delve into the technical mechanisms of these keywords. Consequently, the weighted total score is significantly below the dynamic passing threshold of 35.2, indicating low relevance to the specified research domain.

关键词

AI Peer Review, Adversarial Repackaging, Presentation-Only Revisions, Robustness Concerns, Scientific Evidence, Narrative Structure, AI Reviewer, Attack Success Rate

176. Multiagent Protocols with Aggregated Confidence SignalsFAIL

Score: 3.0 / 35.2

Authors: Ali Elahi, Barbara Di Eugenio

Published: 2026-06-11

TL;DR: 该论文提出了一种在多智能体辩论系统中聚合置信度信号的协议，通过软投票或贝叶斯融合显著提高了判别能力，同时保持了正确率稳定。

摘要翻译

置信度在自然语言处理（NLP）中被用于可靠性、监督以及一系列下游决策任务，然而，尚无现有方法能够为多智能体系统的输出生成或评估置信度。先前工作在多智能体辩论（MAD）中使用置信度来对消息加权、触发辩论或校准单个智能体，但它们从未将这些聚合为系统本身的单一置信度。我们提出了三种方案，旨在生成最终答案及单一聚合置信度：首先转换原始置信度信号以使其跨模型可比，然后通过软投票（soft voting）或我们称之为贝叶斯融合（Bayesian fusion）的概率融合方法将它们组合起来。这种聚合置信度的区分性（AUARC）显著优于最佳单个智能体或标准辩论基线，同时正确性（F1 分数（F1-score））保持稳定，并弥补了 MAD 在更模糊任务上产生的损失。通过分析两种估计器（序列概率和自我报告）以及参数化和非参数化校准器，我们发现校准能提高两种估计器的 F1 分数（F1-score），而 AUARC 对其依赖较低。我们在每个基准上评估六对同质和异质辩论对，涵盖五个基准和四种任务类型，涉及一系列模型能力和规模。

Abstract

Confidence is used for reliability, oversight, and a range of downstream decision tasks in Natural Language Processing (NLP), yet no existing method produces or evaluates a confidence for the output of a multiagent system. Prior work uses confidence within multiagent debate (MAD) to weight messages, trigger debate, or calibrate individual agents, but it never aggregates these into a single confidence for the system itself. We introduce three protocols that produce a final answer along with a single aggregated confidence by first transforming raw confidence signals to make them comparable across models, then combining them via soft voting or a probability fusion we call Bayesian fusion. This aggregated confidence is substantially more discriminative (AUARC) than that of the best single agent or the standard debate baselines, while correctness (F1-score) stays stable and recovers the losses MAD incurs on more ambiguous tasks. Analyzing two estimators, sequence probability and self-report, alongside parametric and non-parametric calibrators, we find that calibration improves F1 for both estimators while AUARC is less reliant on it. We evaluate six homogeneous and heterogeneous debating pairs per benchmark, across five benchmarks and four task types, spanning a range of model capabilities and sizes.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	2.0/10	3.0

评分理由: 该论文主要研究多智能体系统中的置信度聚合与校准协议，属于自然语言处理（NLP）范畴。提供的关键词集主要围绕多模态学习、世界模型和强化学习展开，与该论文主题高度不匹配。仅'Agentic Reasoning'因涉及多智能体系统而有微弱相关性，其余关键词如 Tokenizer、Visual Encoder、World Models、model-based RL 等完全无关。作者列表中不包含指定的专家（Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang），因此无额外加分。

关键词

Multiagent Protocols, Aggregated Confidence Signals, Multiagent Debate, Confidence Calibration, Bayesian Fusion, Soft Voting, NLP

177. Measurement-Calibrated Multi-Camera Fusion for Vision-Based Indoor LocalizationFAIL

Score: 3.0 / 35.2

Authors: Mateo Toro Diz, Jonathan Hoss, Noah Klarmann

Published: 2026-06-11

TL;DR: This paper proposes a measurement-calibrated multi-camera fusion method to reduce trajectory variance and improve motion smoothness in indoor vision-based localization systems by explicitly characterizing component-wise errors.

摘要翻译

基于视觉的室内定位系统（Indoor Vision-based Localization Systems）受到检测噪声（Detection Noise）、遮挡（Occlusions）和相机覆盖范围（Camera Coverage）有限的影响，导致处理流程（Pipeline）多个阶段存在不确定性。虽然多相机数据融合（Multi-camera Data Fusion）被广泛用于缓解这些问题，但它通常被视为黑盒组件（Black-box Component），仅通过端到端（End-to-end）进行评估，掩盖了其机制性贡献。为了解决这一差距，本研究探讨了是否可以通过显式表征单相机定位误差（Single-camera Localization Errors）来校准和优化多相机数据融合。我们引入了一种测量校准融合（Measurement-calibrated Fusion）方法，该方法整合了组件级误差量化（Component-wise Error Quantification），具体隔离了单应性校准（Homography Calibration）、人体检测（Human Detection）和运动跟踪（Motion Tracking）。进行了组件级评估（Component-wise Evaluation），以量化来自单应性校准、人体检测和运动跟踪的误差贡献。实验结果表明，数据融合相比单相机基线提高了定位精度。虽然测量校准融合相对于标准融合在绝对精度上仅提供有限的改进，但它显著降低了轨迹方差（Trajectory Variance）并改善了运动平滑性（Motion Smoothness），这对于需要稳定和连续的运动估计（Motion Estimates）的应用至关重要。这些结果突显了在基于视觉的室内定位系统设计数据融合策略时，显式误差表征（Explicit Error Characterization）的价值。

Abstract

Indoor vision-based localization systems are affected by detection noise, occlusions, and limited camera coverage, leading to uncertainty at multiple stages of the pipeline. While multi-camera data fusion is widely used to mitigate these issues, it is typically treated as a black-box component and evaluated solely end-to-end, obscuring its mechanistic contributions. To address this gap, this work investigates whether explicitly characterizing single-camera localization errors can be leveraged to calibrate and optimize multi-camera data fusion. We introduce a measurement-calibrated fusion approach that integrates component-wise error quantification, specifically isolating homography calibration, human detection, and motion tracking. A component-wise evaluation is conducted to quantify error contributions from homography calibration, human detection, and motion tracking. Experimental results show that data fusion improves localization accuracy compared to single-camera baselines. While measurement-calibrated fusion provides only limited improvement in absolute accuracy over standard fusion, it substantially reduces trajectory variance and improves motion smoothness, which are critical for applications requiring stable and continuous motion estimates. These results highlight the value of explicit error characterization when designing data fusion strategies for vision-based indoor positioning systems.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: The paper focuses on traditional computer vision and sensor fusion for indoor localization, whereas the provided keywords primarily target Large Language Models, World Models, and Reinforcement Learning paradigms. There is minimal conceptual overlap; 'Visual Encoder' and 'MultiModal' loosely relate to camera inputs but lack the deep learning representation learning context implied by the keyword set (e.g., Tokenizer, MLLM, Agentic Reasoning).

关键词

Indoor Localization, Multi-Camera Fusion, Error Quantification, Homography Calibration, Motion Tracking, Trajectory Variance, Vision-Based Localization

178. Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web AgentsFAIL

Score: 3.0 / 35.2

Authors: Zihao Wang, Yiming Li, Yutong Wu, Zheyu Liu, Kangjie Chen, Fok Kar Wai, Pin-Yu Chen, Vrizlynn L. L. Thing, Bo Li, Dacheng Tao, Tianwei Zhang

Published: 2026-06-11

TL;DR: This paper proposes a stakeholder-centric benchmark to evaluate prompt injection vulnerabilities in LLM-based web agents, revealing heterogeneous harm distributions that conventional evaluations overlook.

摘要翻译

由大语言模型（LLMs）驱动的 Web 代理正越来越多地部署在真实环境中，它们在不可信的 Web 内容上运行并执行具有直接后果的操作。这使它们容易受到提示注入攻击的影响，其中看似无害的内容嵌入了操纵代理行为的对抗性指令。现有的安全基准采用以攻击为中心的视角，关注注入的技术可行性，而忽视了所产生的危害分布的细微之处。然而在实践中，提示注入风险是依赖受害者（利益相关者）的：单个漏洞利用可为不同利益相关者产生不对称的后果，且相同的攻击模式可能表现出显著不同的有效性，这取决于其目标对象。为了捕捉这些特性，我们引入了本基准（\sysname），这是一个以利益相关者为中心的基准，用于系统地分类和归因于真实 Web 代理系统中的危害。它区分了受影响实体（例如用户、卖家、平台），将攻击分解为具体目标，并使用互补的结果级和过程级指标对每个案例进行评估。我们的结果揭示了实质性和异构的漏洞：没有任何一个攻击目标能被当前代理可靠地防御，且失败分布在定性不同的模式上，从隐蔽寄生（stealthy parasitism，攻击成功而未干扰用户委托的任务）到错配干扰（misaligned disruption，任务被干扰但攻击未成功）以及复合失败（compounded failure，对抗性目标和任务完整性同时被违反）。这些模式被常规评估所忽略，突显了在真实部署中对基于 LLM 的代理进行利益相关者感知评估的必要性。该基准可在 https://github.com/StakeBench/SBC 获取。

Abstract

Web agents driven by large language models (LLMs) are increasingly deployed in real-world environments, where they operate over untrusted web content and execute actions with direct consequences. This makes them vulnerable to prompt-injection attacks, in which seemingly benign content embeds adversarial instructions that manipulate agent behaviour. Existing security benchmarks adopt an \textit{attack-centric} perspective, focusing on the technical feasibility of injections while overlooking the nuanced distribution of resulting harms. In practice, however, prompt-injection risk is victim-dependent: a single exploit can produce asymmetric consequences for different stakeholders, and the same attack pattern may exhibit substantially different effectiveness depending on whom it targets. To capture these properties, we introduce \textbf{\sysname}, a \textit{stakeholder-centric} benchmark to systematically categorize and attribute harm in real-world web agent systems. It distinguishes between affected entities (e.g., user, seller, platform), decomposes the attacks into concrete objectives, and evaluates each case with complementary outcome- and process-level metrics. Our results reveal substantial and heterogeneous vulnerabilities: not a single attack objective is reliably resisted by current agents, and failures distribute across qualitatively distinct modes ranging from \emph{stealthy parasitism} (attack succeeds without disrupting the user's delegated task) to \emph{misaligned disruption} (task disrupted without attack success) and \emph{compounded failure} (both adversarial objective and task integrity simultaneously violated). These patterns are missed by conventional evaluation, highlighting the need for stakeholder-aware assessment of LLM-based agents in real-world deployments. Benchmark is available at https://github.com/StakeBench/SBC.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	2.0/10	3.0

评分理由: The paper focuses on security evaluation (prompt injection) for LLM-based web agents from a stakeholder perspective, while the provided keywords primarily concern multimodal architecture, world models, and reinforcement learning methodologies. There is minimal overlap; only 'Agentic Reasoning' has a tangential connection due to the focus on 'agents', whereas keywords like Visual Encoder, Tokenizer, and World Models are irrelevant to this security benchmarking study. Additionally, none of the specified expert authors (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are found in the author list.

关键词

Prompt Injection, Web Agents, Stakeholder-Centric, Benchmarking, LLM Security, Harm Distribution, Adversarial Instructions

179. Mining Architectural Quality Under Agentic AI Adoption: A Causal Study of Java RepositoriesFAIL

Score: 3.0 / 35.2

Authors: Oliver Aleksander Larsen, Mahyar T. Moghaddam

Published: 2026-06-11

TL;DR: 该研究通过因果推断分析发现，采用代理式 AI 编码工具虽增加了代码量，但并未实质改善软件架构质量，架构气味密度的下降主要是分母效应所致。

摘要翻译

目前，大多数开发者均在使用 AI 编码工具，而这些工具的代理式使用（agentic use）普及了一种通俗称为"vibe coding"的做法。然而，关于其对软件架构影响的因果证据却十分稀缺。先前的因果研究主要测量了代码层面的结果（如复杂度、静态分析警告）；然而，此类退化是否会传播至架构层面的结果仍未知。我们挖掘了 151 个开源 Java 仓库，其中 74 个具有可检测的代理式 AI 采用（通过配置文件和 Co-Authored-By 提交尾缀识别），以及 77 个倾向性匹配对照组；在每个仓库为期 13 个月的窗口期内，共获得了 1,811 个月度 Arcan 快照。我们采用交错双重差分（staggered difference-in-differences）设计和 Borusyak 插值估计量，估计采用行为对架构异味密度（ASD）的因果效应，将最近应用于代码层面指标的因果设计延伸至架构层面。总异味计数基本保持不变（+1.1%，p = 0.82），而代码行数（lines of code）增长了 +12.8%（p = 0.003）；因此，由此产生的 6.7% ASD 下降（p = 0.004）实为分母效应，而非架构层面的改进。按类型估计及稳健性检验（包括 wild cluster bootstrap、Lee bounds 和 stale-observation sensitivity）均证实了这一模式；预处理趋势平坦（Wald p = 0.90），符合平行趋势假设。当处理变量影响系统规模时，密度归一化结果可能产生误导：在 AI 工具采用的因果挖掘研究中，需要使用原始计数和显式分解。完整的复现包（包括精选的 151 个仓库月度面板）已公开提供。

Abstract

AI coding tools are now used by a majority of developers, and agentic use of these tools has popularized the practice colloquially called "vibe coding". Yet causal evidence on their effect on software architecture is scarce. Prior causal work has measured code-level outcomes (complexity, static analysis warnings); whether such degradation propagates to architecture-level outcomes remains unknown. We mine 151 open-source Java repositories, 74 with detectable agentic AI adoption (identified via configuration files and Co-Authored-By commit trailers) and 77 propensity-matched controls, across a 13-month per-repository window yielding 1,811 monthly Arcan snapshots. We estimate the causal effect of adoption on architectural smell density (ASD) with a staggered difference-in-differences design and the Borusyak imputation estimator, applying a causal design recently used for code-level metrics to the architecture level. Total smell counts are essentially unchanged (+1.1%, p = 0.82) while lines of code grow +12.8% (p = 0.003); the resulting 6.7% ASD decline (p = 0.004) is therefore a denominator effect rather than an architectural improvement. Per-type estimates and robustness checks (wild cluster bootstrap, Lee bounds, stale-observation sensitivity) corroborate the pattern; pre-trends are flat (Wald p = 0.90), consistent with parallel trends. Density-normalized outcomes can mislead when treatment affects system size: raw counts and explicit decomposition are required for causal mining studies of AI tool adoption. The complete replication package, including the curated 151-repository monthly panel, is publicly available.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	2.0/10	3.0

评分理由: 论文属于软件工程领域，研究 AI 工具对软件架构质量的影响。提供的关键词列表主要涉及多模态大模型、世界模型和强化学习等机器学习架构领域。除'Agentic Reasoning'因术语重叠（论文讨论 Agentic AI Adoption）给予少量分数外，其余关键词与论文内容无实质关联，故相关性极低。

关键词

Agentic AI Adoption, Software Architecture, Architectural Smell Density, Causal Study, Java Repositories, AI Coding Tools, Difference-in-Differences

180. Humor Style Drives Laughter, Topic Shapes Acceptability: Evaluating Bilingual Personal and Political Robot-Delivered AI JokesFAIL

Score: 3.0 / 35.2

Authors: Anna-Maria Velentza, Anne-Gwenn Bosser

Published: 2026-06-11

TL;DR: 本研究探讨了幽默风格与话题对机器人传达的 AI 笑话在双语环境下被感知为有趣和可接受程度的影响，发现亲和与攻击性幽默更有趣，而个人相关话题比政治话题更可接受。

摘要翻译

幽默在人类社会关系中扮演着核心角色，而计算幽默领域的最新进展为将幽默融入人机交互（HRI）创造了新机遇。尽管大语言模型（LLMs）能够生成多种形式的幽默，但在群体环境中，幽默风格、笑话内容以及语言偏好如何影响人们对机器人传达的幽默的感知尚不明确。在这项探索性研究中，我们采用了混合因子设计，参与者在一间大学教室里评估由机器人传达的人工智能生成的笑话。我们考察了幽默类型（亲和型、自我提升型、攻击型、自我挫败型）和笑话内容（人际相关 vs. 政治相关）对感知到的好笑程度、得体性以及偏好语言的影响。结果显示，幽默类型显著影响感知到的好笑程度，攻击型和亲和型幽默评分更高，而笑话内容主要影响得体性，人际相关笑话比政治相关笑话更受青睐。语言偏好受笑话内容以及参与者自报的流利度和幽默实践习惯的共同塑造。

Abstract

Humor plays a central role in human social relationships, and recent advances in computational humor create new opportunities for integrating humor into human-robot interaction (HRI). While large language models (LLMs) can generate diverse forms of humor, it remains unclear how humor style, joke content, and language preference shape perceptions of robot-delivered humor in group settings. In this exploratory study, we employed a mixed factorial design in which participants evaluated AI-generated jokes delivered by a robot in a university classroom. We examined the effects of humor type (Affiliative, Self-Enhancing, Aggressive, Self-Defeating) and joke content (person-related vs. political) on perceived funniness and appropriateness, as well as preferred language. Results show that humor type significantly influences funniness, with Aggressive and Affiliative humor rated higher, while joke content primarily affects appropriateness, with person-related jokes preferred over political ones. Language preference was shaped by both joke content and participants' self-reported fluency and humor practices.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 该论文聚焦于人机交互中的幽默感知心理学，利用 LLM 生成笑话，但未涉及模型架构（Tokenizer、Visual Encoder）、表征学习（World Models、Latent Reasoning）、统一建模（Unify Models）或强化学习（model-based RL）等技术内容。因此，除 LLM 提及外，与给定技术关键词的相关性极低。

关键词

Human-Robot Interaction, Computational Humor, Large Language Models, Humor Style, Bilingual, Joke Content, Social Perception, Robot Delivery

181. Towards Personalized Federated Learning for Dysarthric Speech RecognitionFAIL

Score: 3.0 / 35.2

Authors: Tao Zhong, Mengzhe Geng, Jiajun Deng, Shujie Hu, Xunying Liu

Published: 2026-06-11

TL;DR: This paper proposes personalized federated learning strategies for dysarthric speech recognition that effectively reduce word error rates by addressing speaker heterogeneity compared to standard FedAvg.

摘要翻译

语音识别对于构音障碍说话人具有挑战性。虽然基于联邦学习（FL）的自动语音识别（ASR）可作为保护隐私的有效工具，但其面临着由说话人变异性引起的异构性问题。在此异构性下，强制所有说话人共享相同的模型组件可能是次优的，使个性化成为一个有前景的方向；然而，针对构音障碍语音的相关研究仍然有限。为此，本文探索了两种实现个性化的聚合策略，包括基于参数的平均策略与基于嵌入的平均策略。在 UASpeech 和 TORGO 数据集上的实验表明，所提出的方法优于基线正则化 FedAvg，分别在 UASpeech 和 TORGO 上实现了高达 0.99% 绝对值（3.15% 相对值）和 0.56% 绝对值（4.73% 相对值）的统计显著词错误率（WER）降低。

Abstract

Speech recognition is challenging for dysarthric speakers. While federated learning (FL)-based ASR can be an effective tool for protecting privacy, it suffers from heterogeneity issues caused by speaker variability. Forcing all speakers to share the same model components can be suboptimal under such heterogeneity, making personalization a promising direction; however, related research on dysarthric speech remains limited. To this end, this paper explores two aggregation strategies to achieve personalization, including the parameter-based averaging strategy and the embedding-based averaging strategy. Experiments on UASpeech and TORGO show that the proposed methods outperform the baseline regularized FedAvg by statistically significant WER reductions of up to 0.99% absolute (3.15% relative) on UASpeech and 0.56% absolute (4.73% relative) on TORGO, respectively.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 论文主题聚焦于语音识别中的联邦学习与个性化，与关键词集中的多模态大模型、世界模型及强化学习领域高度不匹配。仅在与模型聚合（Unify Models）和语音分词（Tokenizer）方面存在微弱关联，其余关键词如视觉编码器、多模态、RL 等均无涉及。

关键词

Dysarthric Speech Recognition, Federated Learning, Personalization Strategy, Speaker Heterogeneity, Parameter-based Averaging, Embedding-based Averaging, Word Error Rate Reduction

182. Simplex-Constrained Sparse Bagging: Transitioning from Uniform Priors to Sparse Posteriors in Ensemble LearningFAIL

Score: 3.0 / 35.2

Authors: Meher Sai Preetam, Meher Bhaskar

Published: 2026-06-11

TL;DR: 本文提出 Simplex-Constrained Sparse Bagging 方法，通过优化集成权重实现模型压缩与校准，解决了传统集成模型投票均匀性的问题。

摘要翻译

我们提出单纯形约束稀疏装袋（Simplex-Constrained Sparse Bagging, SCSB），这是一个数学上严谨的框架，用于基于自助法（bootstrap）的装袋集成模型（bagging ensembles）的训练后压缩和概率校准。标准装袋集成（如随机森林（Random Forests）、装袋 SVM（Bagged SVMs）和装袋神经网络（Bagged Neural Networks））对所有组成估计器分配统一的投票权重。然而，这种朴素均匀先验忽略了基估计器变化的局部能力，并导致模型过度自信。我们通过最小化袋外（Out-Of-Bag, OOB）损失，将集成剪枝和校准表述为概率单纯形（probability simplex）上的联合优化问题。为了诱导稀疏性，我们通过引入凹二次惩罚，解决了理论上的"L1-单纯形悖论”（L1-simplex paradox）——即 L1 范数（L1 norm）在单纯形上恒为常数从而无法实现剪枝的数学事实。SCSB 与模型无关（model-agnostic），实现了高达 96% 的集成压缩，产生线性推理加速和更优的概率校准（降低期望校准误差（Expected Calibration Error）），同时保持或提升泛化精度。

Abstract

We present Simplex-Constrained Sparse Bagging (SCSB), a mathematically rigorous framework for post-training compression and probability calibration of bootstrap-based bagging ensembles. Standard bagging ensembles (such as Random Forests, Bagged SVMs, and Bagged Neural Networks) assign uniform voting power to all constituent estimators. However, this naive uniform prior ignores the varying local competence of base estimators and contributes to model overconfidence. We formulate ensemble pruning and calibration as a joint optimization problem over the probability simplex by minimizing the Out-Of-Bag (OOB) loss. To induce sparsity, we address the theoretical "L1-simplex paradox" -- the mathematical reality that the L1 norm is constant on the simplex and fails to prune -- by introducing a concave quadratic penalty. SCSB is model-agnostic and achieves up to 96% ensemble compression, yielding linear inference speedups and superior probability calibration (lowered Expected Calibration Error) while preserving or enhancing generalization accuracy.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 论文聚焦集成学习与模型压缩，与多模态、世界模型及强化学习等关键词领域无关。仅'Unify Models'因统一了剪枝与校准过程有微弱相关性。加权总分 3.0，远低于动态及格分 35.2。

关键词

Simplex-Constrained Sparse Bagging, Ensemble Learning, Model Compression, Probability Calibration, Bootstrap-based Bagging, Sparsity Induction, Out-Of-Bag Loss

183. Clustering Node Attributed Networks with Graph Neural Networks and Self LearningFAIL

Score: 3.0 / 35.2

Authors: Rodrigo de Sapienza Luna, Daniel Ratton Figueiredo

Published: 2026-06-11

TL;DR: 本文提出了一种基于图神经网络的自学习框架，用于对节点属性网络进行聚类，通过迭代优化节点表示和聚类分配，在无监督设置下优于单轮训练方法。

摘要翻译

图聚类（Graph clustering）——将图的节点集划分为反映某些潜在信息的不相交子集——是一个基本问题，因为它在众多不同的场景中都有应用。尽管这一经典问题长期以来已被不同社区所研究，但近年来受真实数据驱动的问题变体考虑了节点具有属性且这些属性也具有信息量的场景。这促使了新颖方法的出现，这些方法在设计新型聚类算法时同时利用网络信息（边）和节点信息（属性）。本文提出了一种新颖的框架，该框架建立在先前将图神经网络（GNN）应用于图聚类的工作基础之上。所提出的框架在完全无监督设置下，以轮次方式进行自学习。在每一轮中，图神经网络（GNN）生成节点的表示，这些表示被用于对节点进行聚类。这种聚类结果会影响下一轮中用于生成节点表示的图。此外，每一轮基于原始图构建的上下文图（Context graph）也被用于生成节点表示。实验结果表明，所提出的方法在合成数据中能够从网络边和节点属性中提取信息，当两者均不具备很强的信息量时，其性能优于仅关注网络或仅关注属性的算法。多轮学习同样提升了性能，且始终优于单次长轮次训练（即经典的 GNN 图聚类方法）。在考虑真实数据集时，实验结果表明，当簇大小平衡时，所提出的方法与最先进方法具有竞争力。

Abstract

Graph clustering - partitioning the node set of a graph into disjoint subsets that reflect some latent information - is a fundamental problem as it finds applications in a myriad of different scenarios. While this classic problem has been tackled for decades by different communities, a recent variation of the problem driven by real data considers the scenario where nodes have attributes that are also informative. This has triggered novel methods that simultaneously leverage network information (edges) and node information (attributed) in the design of novel clustering algorithms. This work proposes a novel framework that builds on prior works that have applied graph neural networks (GNN) to graph clustering. The proposed framework operates in rounds of self learning in a fully unsupervised setting. In each round, a GNN generates representations for nodes that are used to cluster the nodes. This clustering influences the graph used to generate the node representation in the next round. Moreover, a context graph built in each round using the original graph is used to generate the node representations. Empirical results show that the proposed methodology extracts information from both network edges and node attributes in synthetic data, outperforming algorithms focused solely on the network or attributes when neither are very informative. Multiple rounds of learning also improve the performance and always outperforms a long single round of training (i.e., classic GNN graph clustering). When considering real datasets, empirical results indicate that the proposed methodology is competitive to state-of-the-art methods when cluster sizes are balanced.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	1.0/10	1.5
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 该论文专注于图聚类与图神经网络（GNN），采用无监督自学习框架。提供的关键词集主要涉及多模态大模型（MLLM）、世界模型、强化学习、Tokenizers 和视觉编码器等领域。论文内容与这些关键词所代表的研究方向（多模态生成、强化学习、基础模型架构）存在显著领域差异，未涉及视觉编码、Token 化、世界模型构建或强化学习机制。仅在‘统一模型’（统一边与属性信息）和‘潜在表示’（学习节点表征）上有微弱概念关联，因此相关性评分极低。

关键词

Graph Clustering, Node Attributed Networks, Graph Neural Networks, Self Learning, Unsupervised Setting, Node Representations, Context Graph

184. Simultaneous Latent Budget Trees for Stratified ClassificationFAIL

Score: 3.0 / 35.2

Authors: Simultaneous Latent Budget Trees for Stratified Classification Cristian Buoncompagni, Stefano Pellegrino, Giulia Vannucci, Roberta Siciliano

Published: 2026-06-11

TL;DR: This paper proposes Simultaneous Latent Budget Trees, an interpretable probabilistic framework for stratified classification that utilizes latent mixture models to account for confounding variables, applied to analyze ALS disease progression.

摘要翻译

在可解释人工智能（Explainable Artificial Intelligence）时代，由于易于解释，人们对单棵树（single trees）重新产生了关注。本文介绍了同时潜在预算树（Simultaneous Latent Budget Trees, SLBT），这是一种用于分类树的概率机器学习框架，适用于存在分层因子（如时间、空间或人口统计学变量）的情况，这些变量充当控制变量或潜在混杂因子。标准的树生长过程并非旨在优化条件分裂规则。本文提出了一种基于模型的分裂规则，其中子节点被解释为同时混合模型（simultaneous mixture model）的潜在成分，例如同时潜在预算模型（Simultaneous Latent Budget Model）及其约束版本，这些模型拟合于父节点。混合参数驱动观测值（针对不同组别有所不同）进入子节点，而潜在预算参数则更新控制变量各层级上的响应类别概况。参数通过最小二乘法进行估计，并考虑了该模型的神经网络视角。信息丰富的树结构可借助节点和路径上的解释辅助工具进行交互式可视化，包括可视化剪枝和决策树选择过程。提出了适当的措施以处理不平衡的响应类别分布问题。所提出的方法被应用于调查肌萎缩侧索硬化症（Amyotrophic Lateral Sclerosis）疾病进展中的性别相关差异。包含各种基于树的算法的 SLBT 库可在提供的 GitHub 仓库中获取。

Abstract

In the era of Explainable Artificial Intelligence, there is a renewed focus on single trees for their ease of interpretation. This paper introduces Simultaneous Latent Budget Trees, a probabilistic machine learning framework for classification trees in the presence of a stratification factor such as a temporal, spatial, or demographic variable, acting as a control variable or potential confounder. Standard tree growth procedures are not designed to optimize a conditional split rule. A model-based split rule is proposed in which child nodes are interpreted as latent components of a simultaneous mixture model, such as the Simultaneous Latent Budget Model and its constrained versions, fitted to the parent node. Mixing parameters drive the observations, differently for each group, to the child nodes whereas latent budgets parameters update the response classes profile of each level of the control variable. Parameters are estimated by least squares considering a neural network perspective of the model. An informative tree structure can be interactively visualized with interpretation aids on the node and the paths, including visual pruning and decision tree selection procedure. Suitable measures are proposed to handle an unbalanced response class distribution. The proposed methodology is applied to investigate gender-related differences in disease progression of Amyotrophic Lateral Sclerosis. The SLBT library with the various tree-based algorithms is available in the linked GitHub repository.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	2.0/10	3.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: The paper focuses on statistical classification trees and stratified analysis using latent budget models, which is unrelated to multimodal LLMs, tokenizers, visual encoders, world models, or reinforcement learning. Only 'Latent Reasoning' has superficial overlap (score 2.0) due to the use of latent variables, though it lacks the reasoning aspect typical of the keyword cluster. Total weighted score is 3.0, well below the dynamic passing score of 35.2. No specified expert authors were found.

关键词

Simultaneous Latent Budget Trees, Stratified Classification, Probabilistic Machine Learning, Explainable Artificial Intelligence, Mixture Model, ALS Disease Progression, Interpretation Aids, Least Squares Estimation

185. An Extensible and Lightweight Unified Architecture for Demosaicing Pixel-bin Image SensorsFAIL

Score: 3.0 / 35.2

Authors: Saurabh Kumar, Nutan Sairam Yenneti

Published: 2026-06-11

TL;DR: 本文提出了一种针对像素 bin 图像传感器的模块化、轻量级统一去马赛克架构，并引入了学习自由的 CFA 识别模块以实现即插即用操作。

摘要翻译

像素 binning（Pixel-bin）图像传感器正逐渐成为智能手机相机的默认选择，这得益于其在分辨率与集光能力之间的权衡。然而，相较于拜耳色彩滤镜阵列（CFA），其较大的色间分离使得去马赛克过程更具挑战性。此外，现有的基于深度学习的去马赛克方法通常针对特定 CFA 设计，需要多个独立模型，这不仅占用宝贵的板载资源，还增加了开发与维护的成本。本文提出了一种模块化统一架构，用于对各种像素 binning 传感器进行去马赛克，该架构在保证更高图像质量的同时，兼具可扩展性和轻量级特性。此外，为了实现即插即用操作，我们引入了一种无需学习的 CFA 识别模块，能够准确检测原始数据的 CFA 类型。

Abstract

Pixel-bin image sensors are becoming the default choice for smartphone cameras due to their resolution vs light-gathering trade-off. However, their larger inter-color separation compared to the Bayer color filter array (CFA) makes them challenging to demosaic. Furthermore, existing deep learning-based demosaicing methods are CFA-specific, requiring multiple individual models that take up precious onboard resources and demand larger development and maintenance efforts. In this work, we propose a modular unified architecture for demosaicing various pixel-bin sensors that provides higher image quality while being extensible and lightweight. Additionally, to enable plug-and-play operation, we introduce a learning-free CFA-identification module to detect the CFA type of raw data accurately.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 论文研究内容为图像传感器去马赛克（Demosaicing）的架构设计，属于计算机视觉与信号处理范畴。评分关键词主要涵盖大语言模型、强化学习及世界模型等前沿 AI 领域，与论文实际内容无实质关联。仅'Unify Models'因标题提及'Unified Architecture'而有微弱语义对应，其余如 Tokenizer、Visual Encoder、MLLM、RL 等关键词在文中均未涉及。作者列表中未包含 Yang Shi 等指定专家。

关键词

Demosaicing, Pixel-bin Image Sensors, Unified Architecture, Modular, Lightweight, CFA-identification, Extensible

186. LEDGER: A Long-Context Benchmark of Corporate Annual Reports for Grounded Financial Retrieval and ExtractionFAIL

Score: 3.0 / 35.2

Authors: Charles Moslonka, Amaury de Vitry, Arthur Garnier, Hicham Randrianarivo, Emmanuel Malherbe

Published: 2026-06-11

TL;DR: 该论文提出了一个名为 LEDGER 的长上下文基准，用于从包含图表的企业年度报告中提取金融关键绩效指标并评估检索能力。

摘要翻译

财务报告是大型语言模型（LLM）的天然试验场，而近期各类规模模型所具备的超长上下文能力，使得在该领域进行严格评估的需求日益迫切。然而，大多数公共金融资源将任务简化为纯文本的 SEC 10-K 文件（美国证券交易委员会 10-K 年报）与少量问答项的配对。我们发布了 LEDGER（Long-context Evaluation of Documents for Grounded Extraction and Retrieval），这是一个包含 4,999 份数字化公司年度报告的语料库——包含图表、表格及叙述性文本的完整文档，而不仅仅是监管文件。每份报告均标注了 31 个需提取的合并财务关键绩效指标（KPI），并将其与财报发布日的市场反应相关联。基于此数据，我们构建了三个覆盖难度谱系的评估基准：一个纯页面级 KPI 检索任务，包含针对 118,048 个自然语言问题的 TREC 风格相关性判断；一个对话式的"needle-in-a-haystack"单值查找任务；以及一个完整 KPI 提取任务，后两者均源自长篇幅且数值密集的文档。此外，我们还提供了人工 OCR 质量标注（含标注者间一致性评估）以及完整的提取、验证和评分工具链。我们通过一项案例研究进一步展示了该数据集的研究效用，该研究将 CEO 信函修辞与发布后的市场影响联系起来。

Abstract

Finance reporting is a natural proving ground for large language models, and the very-long-context capabilities of recent models across all sizes make rigorous evaluation in this domain an increasingly pressing need. Yet most public financial resources reduce the task to plain-text SEC 10-K filings paired with a handful of question-answer items. We release LEDGER (Long-context Evaluation of Documents for Grounded Extraction and Retrieval), a corpus of 4,999 digitized corporate annual reports - full documents with figures, tables, and narrative, not just regulatory filings. Each report is labeled with 31 consolidated financial KPIs to be extracted and linked to the market's reaction at the earnings date. From this data we derive three evaluation benchmarks spanning the difficulty spectrum: a pure page-level KPI retrieval task with TREC-style relevance judgments over 118,048 questions in natural language, a conversational "needle-in-a-haystack" single-value lookup, and a full KPI extraction task, both from long, numerically dense reports. We additionally provide human OCR-quality annotations with inter-annotator agreement and the complete extraction, validation, and scoring toolchain. We further demonstrate the dataset's research utility with a case study linking CEO-letter rhetoric to post-publication market impact.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 该论文主要贡献是构建了一个名为 LEDGER 的金融领域长上下文基准数据集，专注于财务 KPI 的提取与检索。提供的关键词主要涉及多模态大模型架构（如 MLLM、MultiModal、Visual Encoder）、强化学习（model-based RL）及高级推理框架（Latent/Agentic Reasoning）。虽然年度报告包含图表和文本（涉及 Multimodal 模态），但论文核心在于数据集构建与评估协议，而非提出新的模型架构、世界模型或强化学习算法。因此，除 Multimodal/MLLM 因数据模态略有轻微关联外，其余关键词均与论文核心内容无关，总分远低于动态及格分。

关键词

Corporate Annual Reports, Financial KPI, Long-Context Benchmark, Grounded Extraction, Retrieval Task, OCR Annotations, Market Impact

187. CloudCons: A Comprehensive End-to-End Benchmark for Cloud Resource ConsolidationFAIL

Score: 1.5 / 35.2

Authors: Xiaobin Zhang, Lefei Shen, Mouxiang Chen, Zhuo Li, Hongkai Li, Han Fu, Jianling Sun, Xiaoxue Ren, Chenghao Liu

Published: 2026-06-11

TL;DR: This paper proposes CloudCons, a benchmark for cloud resource consolidation using time series forecasting, finding that foundation model accuracy does not guarantee better decision utility without proper quantile calibration.

摘要翻译

为确保服务可靠性而采取的保守过度配置策略，导致云数据中心的资源利用率仍处于较低水平。为缓解这一问题，先预测后优化范式（forecast-then-optimize paradigm）应运而生，旨在通过预测未来需求来优化资源整合。尽管新兴的时间序列基础模型（time series foundation models）有望通过零样本泛化（zero-shot generalization）增强这一范式，但现有的基准测试（benchmarks）仅关注预测误差指标。这些先进模型的实际决策效用尚未得到验证，导致其在下游任务（downstream tasks）中的实用价值尚不确定。为弥合这一差距，我们提出了 CloudCons，一个全面的端到端基准，旨在云资源整合（cloud resource consolidation）的特定背景下评估预测模型。我们构建了高质量数据集，涵盖了来自华为云（Huawei Cloud）、微软 Azure（Microsoft Azure）和谷歌 Borg（Google Borg）的多样化工作负载，捕捉了从同步昼夜节律到随机脉冲式爆发和高频噪声等不同服务特性。我们对统计模型、深度学习模型及基础模型进行了广泛评估。实验揭示了一个关键发现：尽管基础模型展现出优越的零样本预测准确率，但这种优势并不必然转化为更好的决策效用。具有实际意义的是，我们系统分析了预测分位数（predictive quantiles）的选择如何作为一个关键杠杆发挥作用。我们提供了可操作的指南，用于校准这些选择以平衡资源效率与服务可靠性之间的权衡，为实际部署决策提供重要见解。

Abstract

Driven by conservative over-provisioning to guarantee service reliability, resource utilization in cloud data centers remains at low levels. To mitigate this, the forecast-then-optimize paradigm has emerged to optimize consolidation by anticipating future demands. While emerging time series foundation models promise to enhance this paradigm through zero-shot generalization, existing benchmarks focus solely on prediction error metrics. The actual decision utility of these advanced models remains unverified, rendering their practical value for downstream tasks uncertain. To bridge this gap, we propose CloudCons, a comprehensive end-to-end benchmark designed to evaluate forecasting models within the specific context of cloud resource consolidation. We build high-quality datasets that cover diverse workloads from Huawei Cloud, Microsoft Azure, and Google Borg, capturing distinct service characteristics ranging from synchronized diurnal rhythms to stochastic, pulse-like bursts and high-frequency noise. We conduct an extensive evaluation of statistical, deep learning, and foundation models. Our experiments reveal a pivotal finding: while foundation models demonstrate superior zero-shot forecasting accuracy, this advantage does not inherently translate into better decision utility. Of practical significance, we systematically analyze how the selection of predictive quantiles acts as a critical lever. We provide actionable guidelines for calibrating these selections to balance the trade-off between resource efficiency and service reliability, offering vital insights for real-world deployment decisions.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: The paper focuses on cloud resource consolidation and time series forecasting benchmarks, whereas the provided keywords primarily concern multimodal large models, world models, and reinforcement learning architectures. There is minimal technical overlap; the paper does not discuss tokenizers, visual encoders, multimodal architectures, world models, or reinforcement learning agents. 'Unify Models' receives a low score (1.0) due to the unification of evaluation processes in the benchmark, but it does not address unified model architectures implied by the keyword context. All other keywords are irrelevant (0.0). The total weighted score is 1.5, well below the dynamic passing score of 35.2, indicating a significant domain mismatch.

关键词

Cloud Resource Consolidation, Time Series Forecasting, Foundation Models, Decision Utility, Benchmarking, Cloud Data Centers, Quantile Calibration

188. SupraBench: A Benchmark for Supramolecular ChemistryFAIL

Score: 1.5 / 35.2

Authors: Tianyi Ma, Yijun Ma, Zehong Wang, Weixiang Sun, Ziming Li, Connor R. Schmidt, Chuxu Zhang, Matthew J. Webber, Yanfang Ye

Published: 2026-06-11

TL;DR: SupraBench establishes a benchmark for evaluating LLMs on supramolecular chemistry tasks, revealing performance gaps and the impact of domain adaptation.

摘要翻译

超分子化学（涵盖非共价主客体组装的研究）已推动了各类应用的进步。然而，设计主客体系统仍然耗时，每对候选对象均需数天的干实验室（dry-lab）验证。尽管大型语言模型（LLMs）已成为一种快速替代方案，且在分子结合任务上表现优异，但目前尚无基准系统性地评估 LLMs 在主客体推理方面的能力，涵盖基础超分子化学任务（如结合亲和力预测）。为此，我们与领域专家合作发布了首个超分子基准（SupraBench），旨在评估 LLMs 在化学推理方面的能力。具体而言，我们设计了四个基础任务，即结合亲和力预测、最佳结合物选择、溶剂识别和主客体描述，此外还包括一个基于视觉的辅助任务，用于分子识别。此外，我们还发布了 SupraPMC，这是一个从 Europe PMC 提炼而成的、包含 1600 万标记的精心整理的超分子化学文章语料库，旨在支持模型向超分子领域的适配。我们对一系列开放源和专有 LLMs 进行了基准测试，发现 LLMs 在所有任务上均留有巨大的提升空间。在 SupraPMC 上进行领域适应预训练可顺利迁移至分布内回归任务，但在严格的字母格式输出要求上存在权衡。此外，不同任务家族之间的难度分布差异显著，揭示了截然不同的失败模式，这表明当前超分子化学推理仍存在特定的不足。我们的源代码及基准数据集可在 https://github.com/Tianyi-Billy-Ma/SupraBench 获取。

Abstract

Supramolecular chemistry, which includes the study of non-covalent host-guest assemblies, has advanced various applications. However, designing host-guest systems remains time-consuming, requiring days of dry-lab verification per candidate pair. Although LLMs have emerged as a fast alternative with strong performance on molecular binding tasks, no benchmark currently systematically evaluates LLMs for host-guest reasoning across fundamental supramolecular chemistry tasks, e.g., binding affinity prediction. To this end, we collaborate with domain experts to release the first Supramolecular Benchmark, called SupraBench, to evaluate LLMs in chemistry reasoning. Specifically, we design four fundamental tasks, i.e., binding affinity prediction, top-binder selection, solvent identification, and host-guest description, plus an auxiliary vision-based task for molecular identification. We also release SupraPMC, a curated 16M-token corpus of Supramolecular chemistry articles distilled from Europe PMC, to support the adaptation to the supramolecular domain. We benchmark a broad range of open and proprietary LLMs and find that LLMs leave substantial headroom across all tasks. Domain adaptation pretraining over SupraPMC transfers cleanly to in-distribution regression but trades off against strict letter-format output. Moreover, the difficulty profile differs sharply across task families, revealing distinct failure modes that indicate specific gaps in current supramolecular chemistry reasoning. Our source codes and benchmark datasets are available at https://github.com/Tianyi-Billy-Ma/SupraBench.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: The paper focuses on a supramolecular chemistry benchmark for LLMs, showing no relevance to Unify Models, Tokenizers, Visual Encoders, World Models, MLLM architectures, Model-Based RL, Latent Reasoning, or Agentic Reasoning. A single vision task offers minimal MultiModal relevance. No specified expert authors are found.

关键词

Supramolecular Chemistry, LLM Benchmark, Host-Guest Binding, Binding Affinity Prediction, Domain Adaptation, Molecular Identification, Chemistry Reasoning

189. ReSET: Accurate Latency-Critical NVFP4 Reasoning via Step-Aware Temperature ScalingFAIL

Score: 1.5 / 35.2

Authors: Sihwa Lee, Janghwan Lee, Donghoon Yoo, Jae Gon Kim, Hanyul Ryu, Soojung Ryu, Jungwook Choi

Published: 2026-06-11

TL;DR: ReSET improves inference accuracy and latency for quantized Large Reasoning Models by employing step-aware temperature scaling and optimized CUDA kernels.

摘要翻译

大推理模型（LRMs）通过生成长中间推理轨迹来提升复杂问题求解能力，但这显著增加了推理成本。NVFP4 推理提供了一种有前景的方法，通过硬件支持的低位宽执行来同时降低计算和内存成本。然而，直接将 NVFP4 应用于 LRMs 引入了两个实际局限性：量化会导致推理准确性下降，且现有的 NVFP4 内核在小批量自回归解码中未能完全实现延迟优势。在这项工作中，我们分析了 NVFP4 量化对推理过程中词元级不确定性的影响。我们发现，量化会增加低熵符号词元处的错误采样，同时在高不确定性推理步骤中导致对一小部分词元的过度集中。基于这一观察，我们提出 ReSET，一种基于推理步骤熵的温度缩放方法，该方法在线估计步骤级不确定性，并利用词元级和步骤级熵信号来调整解码温度。为了解决延迟差距，我们进一步设计了一个基于 CUDA 核心的小-M NVFP4 内核，用于延迟敏感的自回归解码。在各类推理基准和模型规模上，ReSET 相较于 NVFP4 基线将 NVFP4 推理准确性提高了约 2 点。我们的 CUDA 核心小-M 内核进一步改进了延迟敏感解码，相较于 NVFP4 vLLM 实现了高达 2.5 倍的核级加速，且相较于 BF16 实现了约 2 倍的端到端解码加速。代码可在 https://github.com/aiha-lab/ReSET 获取。

Abstract

Large reasoning models (LRMs) improve complex problem-solving by generating long intermediate reasoning traces, but this substantially increases inference costs. NVFP4 inference offers a promising approach to reduce both computational and memory costs through hardware-supported low-precision execution. However, directly applying NVFP4 to LRMs introduces two practical limitations: reasoning accuracy degrades under quantization, and existing NVFP4 kernels do not fully realize latency benefits in small-batch autoregressive decoding. In this work, we analyze the effect of NVFP4 quantization on token-level uncertainty during reasoning. We show that quantization increases incorrect sampling at low-entropy symbolic tokens, while causing over-concentration on a small set of tokens in high-uncertainty reasoning steps. Based on this observation, we propose \textbf{ReSET}, a reasoning-step entropy-based temperature-scaling method that estimates step-level uncertainty online and adapts the decoding temperature using both token-level and step-level entropy signals. To address the latency gap, we further design a CUDA-core small-$M$ NVFP4 kernel for latency-critical autoregressive decoding. Across reasoning benchmarks and model scales, ReSET improves NVFP4 reasoning accuracy by up to $\sim\!$2 points over the NVFP4 baseline. Our CUDA-core small-$M$ kernel further improves latency-critical decoding, delivering up to $2.5\!\times$ kernel-level speedup over NVFP4 vLLM and approximately $2\!\times$ end-to-end decoding speedup over BF16. Code is available at https://github.com/aiha-lab/ReSET.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	1.0/10	1.5
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: The paper focuses on inference optimization for Large Reasoning Models (LRMs) using NVFP4 quantization and step-aware temperature scaling. It does not address Multimodal integration (MLLM, MultiModal, Visual Encoder), World Models, Unify Models, or Reinforcement Learning (model-based RL, Agentic Reasoning). 'Latent Reasoning' receives a minimal score (1.0) due to the focus on reasoning traces, though the core methodology is quantization and latency optimization rather than latent reasoning architectures. Tokenizer is not discussed. No expert authors from the specified list were found.

关键词

NVFP4, Large Reasoning Models, Temperature Scaling, Quantization, Latency Optimization, Step-Aware, Inference Efficiency

190. MiniPIC: Flexible Position-Independent Caching in <100LOCFAIL

Score: 1.5 / 35.2

Authors: Nathan Ordonez, Thomas Parnell

Published: 2026-06-11

TL;DR: MiniPIC 提出了一种灵活的位置无关缓存机制，通过在 vLLM 引擎中实现小于 100 行的代码修改，显著提升了重复结构化输入的预填充吞吐量并降低了延迟。

摘要翻译

检索增强和智能体工作负载反复预填充重复出现的可预测结构化输入（我们称之为“片段（spans）”），例如文档和代码文件。然而，在 vLLM 等引擎中的前缀缓存无法重用其 KV 条目，除非它们与另一个请求共享相同的前缀；而生产级推理服务器中的位置无关缓存（PIC）实现通常需要大幅修改服务器代码，或将 KV 状态保留在服务器之外，从而产生主机到设备传输开销。我们提出极简 PIC（MiniPIC）：一种基于两个核心组件构建的极简、灵活且快速的 vLLM 设计：无位置编码的 KV 缓存和用户控制的缓存重用原语。MiniPIC 在 KV 缓存中存储未旋转的 K 向量，并在注意力机制中使用每个请求的逻辑位置对 K 块应用 RoPE，同时暴露三个面向用户且基于 token 的原语：块对齐填充、片段分隔符（SSep）和提示依赖（PDep），这些原语可修改哈希行为及有效的块级因果注意力结构。仅需少于 100 行的核心引擎更改以及一个自定义注意力后端，这些原语便足以在同一个运行的 vLLM 实例中实现多种 PIC 方法，包括 Block-Attention、EPIC 和 Prompt Cache，同时原生集成 KV 缓存 CPU 卸载实现。在 2WikiMultihopQA 数据集上，结合交错调度的 MiniPIC 相比基线 vLLM 将预填充吞吐量提高了 49%，将缓存片段的首 token 时间减少多达两个数量级，保留未缓存片段的线性预填充扩展特性，且仅产生 5.7% 的最坏情况开销。

Abstract

Retrieval-augmented and agentic workloads repeatedly prefill recurring predictable structured inputs (which we call "spans") such as documents and code files. Yet, prefix caching in engines such as vLLM cannot reuse their KV entries unless they share identical prefixes with another request, while Position-Independent Caching (PIC) implementations within production-grade inference servers typically either require substantial server code changes or keep KV state outside the server, incurring host-to-device transfer overhead. We present Minimalistic PIC (MiniPIC): a minimal, flexible and fast vLLM design built from two ingredients: positional-encoding-free KV cache and user-controlled cache-reuse primitives. MiniPIC stores unrotated K vectors in the KV cache, applies RoPE to K tiles inside attention using per-request logical positions, and exposes three user-facing and token-level primitives: block-aligned padding, span separator (SSep), and prompt depend (PDep), that modify hashing behavior and effective block-level causal attention structure. With fewer than 100 lines of core-engine changes plus a custom attention backend, these primitives are sufficient to realize multiple PIC methods, including Block-Attention, EPIC, and Prompt Cache, within the same running vLLM instance, while natively integrating with KV cache CPU offload implementations. On 2WikiMultihopQA, MiniPIC with interleaved scheduling improves prefill throughput by 49% over baseline vLLM, reduces cached-span time-to-first-token by up to two orders of magnitude, preserves the linear prefill scaling of uncached spans, and incurs only 5.7% worst-case overhead.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	1.0/10	1.5

评分理由: 该论文主要研究大语言模型推理引擎中的缓存优化（MiniPIC），属于系统基础设施领域。提供的评分关键词集中于多模态、世界模型、强化学习及统一模型架构（如 Visual Encoder, World Models, MLLM, model-based RL 等），与本文主题高度不相关，导致加权总分远低于动态及格分（35.2）。仅 Agentic Reasoning 在摘要中提及 agentic workloads 作为应用场景，但未涉及推理机制，相关性极低。作者列表中未包含指定的五位专家。

关键词

MiniPIC, Position-Independent Caching, KV Cache, vLLM, Inference Optimization, Prefix Caching, Throughput Improvement

191. NaturalFlow: Reducing Disruptive Pauses for Natural Speech Flow in Simultaneous Speech-to-Speech TranslationFAIL

Score: 1.5 / 35.2

Authors: Dongwook Lee, Youngho Cho, Sangkwon Park, Heeseung Kim, Sungroh Yoon

Published: 2026-06-11

TL;DR: This paper proposes a fluency-aware optimization framework to reduce disruptive pauses in simultaneous speech-to-speech translation, achieving natural speech flow without compromising latency or translation quality.

摘要翻译

同时语音到语音翻译 (speech-to-speech translation) 旨在通过最小化延迟 (latency) 来实现近实时通信，提供了一种具有吸引力的实时替代方案，以解决连续翻译 (consecutive translation) 高延迟的问题。然而，对低延迟的过度追求往往导致碎片化的 chunk-wise 语音。因此，听众会经历不自然的声学流 (acoustic flow)，其间穿插着频繁停顿，这可能会增加他们的认知负荷 (cognitive load)。为弥合这一差距，我们引入一个流畅性感知优化框架 (fluency-aware optimization framework)，旨在发现同时翻译的低延迟优势与连续翻译的自然流之间的最佳平衡点 (sweet spot)。该框架通过利用模型内部信号来最小化块间静音 (inter-chunk silences)，其中包括语言多样性以及语音持续时间诱导的时变 (temporal variability)。在短形式和长形式基准 (benchmarks) 上的实验表明，该框架能够产生自然的语音流 (speech flow)，同时保持具有竞争力的延迟和翻译质量。

Abstract

Simultaneous speech-to-speech translation aims to enable near-real-time communication by minimizing latency, offering a compelling, real-time alternative to the high latency of consecutive translation. However, the excessive pursuit of low latency often results in fragmented chunk-wise speech. Consequently, listeners are subjected to an unnatural acoustic flow punctuated by frequent pauses, which could increase their cognitive load. To bridge this gap, we introduce a fluency-aware optimization framework designed to discover the sweet spot between the low-latency benefits of simultaneous translation and the natural flow of consecutive translation. Our framework minimizes inter-chunk silences by leveraging model-internal signals, including linguistic diversity and induced temporal variability in speech durations. Experiments on short- and long-form benchmarks show that our framework produces natural speech flow while maintaining competitive latency and translation quality.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: The paper focuses on simultaneous speech-to-speech translation and fluency optimization, addressing latency vs. natural flow trade-offs. The provided keywords relate to Multimodal World Models, Unified Architectures, and Reinforcement Learning. There is almost no thematic overlap; only 'MultiModal' receives a minimal score due to audio being a modality, but there is no connection to Vision, RL, World Models, or Agent reasoning. The total weighted score (1.5) is far below the dynamic pass threshold (35.2).

关键词

Simultaneous speech-to-speech translation, Natural speech flow, Fluency-aware optimization, Inter-chunk silences, Low latency, Translation quality, Model-internal signals

192. MP3: Multi-Period Pattern Pre-training forSpatio-Temporal ForecastingFAIL

Score: 1.5 / 35.2

Authors: Lilan Peng, Yandi Liu, Qingren Yao, Chongshou Li, Tianrui Li

Published: 2026-06-11

TL;DR: MP3 proposes a multi-period pattern pre-training plugin for STGNNs to address temporal mirage in spatio-temporal forecasting, achieving consistent performance improvements across datasets without involving multimodal LLMs or reinforcement learning.

摘要翻译

时空预测在交通、气候和能源等众多领域中至关重要。城市时空数据表现出时间幻影（temporal mirage）：相似的短窗口输入具有截然不同的未来趋势，反之亦然。现有的时空图神经网络（STGNNs）无法有效识别此类幻影。我们认为，核心原因在于短窗口输入存在不完整的周期观测、异构的全局空间关联以及跨周期的叠加因果性。为弥合这一差距，我们提出了一种新颖的多周期模式预训练（MP3），这是一种用于区分时间幻影的即插即用预训练插件。MP3 提出了两项核心创新：（1）多周期模式学习旨在从长时序中学习多周期模式。具体而言，多周期时间建模利用边卷积来识别不同的多周期模式。多周期空间建模采用瓶颈投影和全局记忆库，以高效捕获异构的全局空间关系。跨周期模式交互采用因果增强 Transformer，以捕获不同周期模式之间的依赖关系。（2）该插件可无缝集成到现有的 STGNN 骨干网络中，以增强其预测性能。在五个真实世界数据集（包括大规模数据集 CA）上，针对五种 STGNN 基线的实验验证了 MP3 的有效性、卓越的可扩展性和强大的适应性，在所有评估基线上均带来一致且稳健的性能提升。平均而言，MP3 使 MAE 降低 4.7%，使 RMSE 降低 5.0%。代码可在 https://github.com/YAN-outlook/MP3 获取。

Abstract

Spatio-Temporal forecasting is crucial in diverse fields, such as transportation, climate, and energy. Urban spatio-temporal data exhibits temporal mirage: similar short-window inputs have divergent future trends, and vice versa. Existing spatio-temporal graph neural networks (STGNNs) cannot effectively identify such mirages. We argue that the core reason lies in the short-window inputs that have incomplete period observation, heterogeneous global spatial correlation, and cross-period superposition causality. To bridge this gap, we develop a novel Multi- Period Pattern Pre-training (MP3), a plug-and-play pre-training plugin for distinguishing temporal mirages. MP3 presents two core innovations: (1) The multi-period pattern learning is designed to learn multi-period patterns from long time series. Specifically, multi-period temporal modeling leverages edge convolution to identify different multi-period patterns. Multi-period spatial modeling uses a bottleneck project and a global memory bank to capture heterogeneous global spatial relations efficiently. Cross-period pattern interaction employs a causality-enhanced Transformer to capture dependencies across different period patterns. (2) This plugin can seamlessly integrate into existing STGNN backbones to strengthen their forecasting performance. The experiment on five STGNN baselines across five real-world datasets (including a large-scale dataset CA) verify the effectiveness, superior scalability and strong adaptability of MP3, which brings consistent and robust performance improvements across all evaluated baselines. On average, MP3 reduces the MAE 4.7% and the RMSE 5.0%. The code can be available at https://github.com/YAN-outlook/MP3.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	1.0/10	1.5
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: The paper focuses on spatio-temporal forecasting using STGNNs and multi-period pattern pre-training, unrelated to MLLM, RL, World Models, or Agentic workflows. It involves latent pattern learning (Latent Reasoning) but lacks tokenizers, visual encoders, or model unification. No specified expert authors are found.

关键词

Spatio-Temporal Forecasting, Multi-Period Pattern Pre-training, STGNN, Temporal Mirage, Edge Convolution, Causality-enhanced Transformer, Global Memory Bank

193. Operadic consistency: a label-free signal for compositional reasoning failures in LLMsFAIL

Score: 1.5 / 35.2

Authors: Nathaniel Bottman, Yinhong Liu, Kyle Richardson

Published: 2026-06-11

TL;DR: This paper proposes Operadic Consistency, a label-free diagnostic metric based on operad theory to detect compositional reasoning failures in LLMs, demonstrating strong correlation with accuracy across multiple multi-hop QA datasets.

摘要翻译

在没有真实标签的情况下检测推理时的大语言模型（LLM）推理失败，催生了多种置信度基线，包括基于问题内采样和自我评估的自一致性、语义熵以及 P(True)。操作子理论（Operad theory），即通过迭代替换构建系统的形式化框架，提出了一种互补的诊断方法：模型对组合查询的直接回答，应与其通过组合该查询的既定分解所产生的回答一致。我们将这一想法实例化为操作子一致性（Operadic Consistency, OC），作为一种逐问题信号。在四个多跳问答（QA）数据集上，针对十二个指令微调的大语言模型（参数规模从 4B 到 671B，涵盖开源权重和闭源模型），OC 在每个数据集上都与准确性高度相关（皮尔逊相关系数 $r \in [0.86, 0.94]$，所有 $p \leq 0.0004$），并且是唯一一个在所有四个数据集上均满足 $r \geq 0.85$ 的评估信号。思维链自一致性（Chain-of-thought self-consistency, CoT-SC; Wang et al., 2023）在 HotpotQA 和 DROP 数据集上与 OC 表现相当（$r = 0.93, 0.87$），但在 MuSiQue 和 StrategyQA 数据集上下降至 $r \approx 0.45$。在逐问题层面，OC 在每个数据集上都提供了超越 CoT-SC 和语义熵的信息（OC 系数的聚类稳健 $p \leq 10^{-16}$），且该结论在额外控制构造的感知分解基线后依然稳健（$p \leq 10^{-13}$）。该信号在同等成本（$K=3$）预算下，相对于调优的 CoT-SC 基线，实现了选择性预测的提升（固定覆盖率下的准确率）（AUARC 提升幅度为 +0.086 至 +0.096，AUROC 提升幅度为 +0.092 至 +0.164；95% 置信区间在所有实验设置中均不包含零）。在五个前沿思维模型上（其中分解是从模型自身的思维链中提取的），相同的等成本比较在所有 16 个（数据集、预算、指标）实验设置中均给出了正向的选择性预测点估计提升，其中 16 个实验设置的 95% 置信区间有 12 个不包含零。

Abstract

Detecting LLM reasoning failures at inference time without ground-truth labels has motivated a wide range of confidence baselines, including self-consistency, semantic entropy, and P(True), built on within-question sampling and self-evaluation. Operad theory, the formalism for systems built by iterated substitution, suggests a complementary diagnostic: a model's direct answer to a compositional query should agree with the answer it produces by composing a stated decomposition of the same query. We instantiate this idea as operadic consistency (OC), a per-question signal. Across twelve instruction-tuned LLMs (4B to 671B parameters, open-weights and closed-source) on four multi-hop QA datasets, OC is strongly correlated with accuracy on every dataset (Pearson $r \in [0.86, 0.94]$, all $p \leq 0.0004$), and is the only signal we evaluate with $r \geq 0.85$ uniformly across all four datasets. Chain-of-thought self-consistency (CoT-SC; Wang et al., 2023) matches OC on HotpotQA and DROP ($r = 0.93, 0.87$) but drops to $r \approx 0.45$ on MuSiQue and StrategyQA. At the per-question level, OC contributes information beyond CoT-SC and semantic entropy on every dataset (cluster-robust $p \leq 10^{-16}$ for the OC coefficient), and the conclusion is robust to additionally controlling for constructed decomposition-aware baselines ($p \leq 10^{-13}$). The same signal yields selective-prediction improvements (accuracy at fixed coverage) over a tuned CoT-SC baseline at the equal-cost $K = 3$ budget (AUARC lifts of +0.086 to +0.096 and AUROC lifts of +0.092 to +0.164; 95% CIs exclude zero on every cell). On five frontier thinking models, where the decomposition is extracted from the model's own chain of thought, the same equal-cost comparison gives positive selective-prediction point-estimate lift on all 16 (dataset, budget, metric) cells tested, with 95% CIs excluding zero on 12 of the 16.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	1.0/10	1.5
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: The paper focuses on label-free detection of compositional reasoning failures in LLMs using operad theory (Operadic Consistency). It has minimal semantic overlap with the provided keywords which primarily target multimodal integration, world models, and reinforcement learning. Only 'Latent Reasoning' shares the lexical root 'reasoning', but the paper focuses on textual consistency rather than latent space reasoning. Consequently, most keywords receive a score of 0, resulting in a low weighted total score (1.5), indicating low relevance to the specified keyword track.

关键词

Operadic consistency, Compositional reasoning, LLM reasoning failures, Label-free signal, Multi-hop QA, Selective prediction, Chain-of-thought

194. Aerial Wildfire Suppression Planning with a Hybrid CNN-Cellular Automata Fire ModelFAIL

Score: 1.5 / 35.2

Authors: Ion Matei, Maksym Zhenirovskyy, Takuya Kurihana, Rohit Vupala, Anthony Wong

Published: 2026-06-11

TL;DR: 该论文提出了一种混合 CNN-细胞自动机框架，用于在不确定性下规划空中野火抑制策略，通过优化水和阻燃剂投放有效减少了火灾影响面积。

摘要翻译

空中野火抑制不仅需要预测火势蔓延，还需要在操作和环境不确定性下设计有效的干预策略。我们提出了一种用于空中野火抑制的建模与优化框架，该框架结合了混合神经 - 元胞自动机（Hybrid Neural-Cellular Automaton）野火模型与基于梯度的定点空中投放设计。野火模型根据地形、燃料和风力数据预测空间变化的蔓延行为，而干预模块则确定二元投放动作，其位置和方向参数为连续值并映射到模拟网格上。水和阻燃剂被表示为具有不同的抑制效果，分别对应立即减少活跃燃烧和持续减少未来蔓延。为了评估所得抑制计划的鲁棒性，我们通过蒙特卡洛（Monte Carlo）采样每日火灾状态实现来量化偶然性不确定性（Aleatoric Uncertainty），并通过空间相关的预测误差扰动来量化认知性不确定性（Epistemic Uncertainty）。基于 2020 年熊火（Bear Fire）的案例研究表明，该框架能够生成连贯的空中抑制计划以减少总过火面积，并支持对野火干预策略的不确定性感知分析。

Abstract

Aerial wildfire suppression requires not only predicting fire spread, but also designing effective intervention strategies under operational and environmental uncertainty. We present a modeling and optimization framework for aerial wildfire suppression that combines a hybrid neural-cellular automaton wildfire model with gradient-based design of targeted aerial drops. The wildfire model predicts spatially varying spread behavior from terrain, fuel, and wind data, while the intervention module determines binary drop actions with continuous-valued location and orientation parameters mapped to the simulation grid. Water and retardant are represented with distinct suppression effects, corresponding to immediate reduction of active burning and persistent reduction of future spread. To evaluate the robustness of the resulting suppression plans, we quantify both aleatoric uncertainty through Monte Carlo sampling of daily fire-state realizations and epistemic uncertainty through spatially correlated prediction-error perturbations. A case study based on the 2020 Bear Fire shows that the framework can generate coherent aerial suppression schedules for reducing total fire-affected area and can support uncertainty-aware analysis of wildfire intervention strategies.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	1.0/10	1.5
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 论文主要关注利用混合 CNN-细胞自动机模型进行野火抑制规划，属于环境建模与控制领域。提供的关键词集（如 MLLM、Tokenizer、World Models、Agentic Reasoning）主要面向多模态大模型与强化学习代理，与本文主题高度不匹配。仅 'model-based RL' 存在微弱关联（基于模型的规划），其余关键词完全无关。作者列表中不包含指定的专家。

关键词

Aerial Wildfire Suppression, Hybrid CNN-Cellular Automata, Fire Spread Prediction, Gradient-Based Optimization, Uncertainty Quantification, Intervention Strategies, Terrain Data, Monte Carlo Sampling

195. WHAR Arena: Benchmarking the State of the Art in Efficient Wearable Human Activity RecognitionFAIL

Score: 1.5 / 35.2

Authors: Maximilian Burzer, Tobias King, Till Riedel, Michael Beigl, Tobias Röddiger

Published: 2026-06-11

TL;DR: This paper benchmarks wearable human activity recognition models on efficiency and performance, concluding that compact models and classical algorithms offer the best trade-off despite predictive performance plateauing.

摘要翻译

深度学习已成为可穿戴人体活动识别（WHAR）领域的主导范式，然而进展却被一种可比性危机所遮蔽。研究结果往往基于不一致的数据集、自定义的数据处理流程以及不同的评估协议进行报告，这使得关于最先进（state-of-the-art）成果的声明显得脆弱。为此，我们构建了一个大规模、开源的基准框架，该框架在标准化处理、统一模型接口以及共享的跨受试者评估协议下整合了 30 个多样化的数据集。我们在 Android 参考设备上，针对 17 种代表性架构进行了 4760 次训练运行，联合测量了预测性能、设备端延迟、峰值内存以及模型大小。结果表明，WHAR 领域的最先进成果是分散的，而非由单一架构所主导。尽管 CNN-HAR 实现了最高的平均宏 F1 值，但表现优异的模型紧密聚集，表明当代架构已收敛于预测性能上限附近。若考虑部署效率，紧凑神经网络模型（如 TinierHAR）以及经典的 Random Forests (随机森林) 定义了实际相关的帕累托前沿 (Pareto frontier)，而较大的循环神经网络和混合模型则产生了高昂的硬件成本，却未带来相应的性能提升。因此，尽管预测性能已趋于平稳，但在优化部署效率以及提高对领域偏移 (domain shifts) 的适应能力方面，未来仍具有巨大的进步潜力。我们发布了完整的框架，以支持透明的复用与扩展。

Abstract

Deep learning has become the dominant paradigm in Wearable Human Activity Recognition (WHAR), yet progress is obscured by a comparability crisis. Results are often reported using inconsistent datasets, custom data processing, and varying evaluation protocols, making state-of-the-art claims fragile. We address this with a large-scale, open-source benchmark that integrates 30 diverse datasets under standardized processing, unified model interfaces, and a shared cross-subject evaluation protocol. Evaluating 17 representative architectures across 4760 training runs, we jointly measure predictive performance alongside on-device latency, peak memory, and model size on an Android reference device. Our results reveal that the WHAR state of the art is distributed rather than dominated by a single architecture. While CNN-HAR achieves the highest mean macro-F1, top-performing models cluster tightly, indicating contemporary architectures have converged near a predictive performance ceiling. When accounting for deployment efficiency, compact neural models, such as TinierHAR, and classical Random Forests define the practically relevant Pareto frontier, whereas larger recurrent and hybrid models incur high hardware costs without corresponding performance gains. Consequently, while predictive performance has plateaued, substantial potential for future progress remains in optimizing deployment efficiency and improving adaptation to domain shifts. We release our full framework to support transparent reuse and extension.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 论文专注于可穿戴人体活动识别（WHAR）的基准测试与效率优化，与多模态大模型、世界模型、强化学习等关键词领域无关。仅'Unify Models'因摘要中提到'unified model interfaces'有微弱语言学关联，其余关键词完全不相关。作者列表中不包含指定的专家。

关键词

Wearable Human Activity Recognition, Benchmarking, Efficiency, Deep Learning, Model Interfaces, Android Deployment, Predictive Performance, Standardized Processing

196. Disparate Impact in Synthetic Data GenerationFAIL

Score: 1.5 / 35.2

Authors: Paul Andrey, Michaël Perrot, Batiste Le Bars, Marc Tommasi

Published: 2026-06-11

TL;DR: 该论文探讨了合成数据生成中不同敏感群体间的差异性影响问题，通过分析误差来源并提出分组学习模型策略，旨在提升生成数据的整体效用与公平性。

摘要翻译

我们重新审视了合成数据生成（SDG）中的差异影响（disparate impact）公平性概念，该概念评估生成的记录在不同敏感群体间的效用是否相同。我们的方法不同于现有的公平 SDG 工作，这些工作旨在纠正观测分布中的不当偏差，从而将 SDG 重新定义为学习一个不同于真实数据的分布。相比之下，当合成分布与真实分布相同时，即可显著实现无差异影响（non-disparate impact）。我们揭示了 SDG 可能无法达到该解决方案的原因，并讨论了近似误差和估计误差为何会发生以及为何会在不同群体间存在差异。我们特别考察了 SDG 方法的表达能力相对于分布复杂性的情况，因群体比例导致的采样误差，以及由差分隐私（differential privacy）机制引起的估计误差。我们在人工数据和真实数据上展示了差异影响的案例，重点关注依赖概率图模型（probabilistic graphical models）的 SDG 方法。我们还引入了一种学习群体级 SDG 模型的策略，并展示了它如何在许多情况下同时提高整体效用及其公平性。

Abstract

We revisit the fairness notion of disparate impact for synthetic data generation (SDG), that assesses whether the utility of generated records is the same across sensitive groups. Our approach departs from existing work on fair SDG, that address the problem of correcting for undue biases in the observed distribution, hence redefining SDG as learning a distribution that is not that of the real data. By contrast, non-disparate impact is notably achieved when the synthetic and real distributions are the same. We expose reasons why SDG may fail to reach that solution and discuss why approximation and estimation errors occur and can be disparate across groups. We notably look into the expressive power of SDG methods relative to distribution complexity, sampling errors due to group proportions, and estimation errors induced by differential privacy mechanisms. We illustrate cases of disparate impact on both artificial and real-world data, focusing on SDG methods that rely on probabilistic graphical models. We also introduce a strategy of learning group-wise SDG models and illustrate how it can improve both the overall utility and its parity in many settings.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 该论文主要研究合成数据生成（SDG）中的公平性问题（差异性影响），涉及概率图模型和分组学习策略。提供的关键词集主要围绕多模态大模型（MLLM）、世界模型、强化学习及视觉编码器等技术架构。论文内容与这些特定技术方向（如 Tokenizer、Visual Encoder、RL、Agentic Reasoning）几乎没有交集，仅在'Unify Models'上存在极弱的概念关联（统一公平性评估），因此相关性评分极低。

关键词

Synthetic Data Generation, Disparate Impact, Fairness, Probabilistic Graphical Models, Group-wise Models, Sensitive Groups, Utility Parity

197. Circuit Synchronization Precedes Generalization: Causal Evidence from Fourier Structure in Grokking TransformersFAIL

Score: 1.5 / 35.2

Authors: Achyuthan Sivasankar

Published: 2026-06-11

TL;DR: 该论文通过傅里叶结构分析提供了因果证据，表明电路同步先于泛化，并引入频率同步度作为 Grokking 过程的早期预测指标。

摘要翻译

Grokking (格罗金) ——指在模运算上，Transformer 突然从接近随机水平过渡到接近完美验证准确率的现象——被归因于 Fourier 电路，但其时机、因果结构和可控性仍知之甚少。我们引入了频率同步度 (FSD)，这是一种归一化的、基于置换检验的 Fourier 电路同步度量，无需先验电路知识。在九种模加法配置（素数 p ∈ {53, 71, 97, 113, 131}，三个随机种子）下，FSD 在 Grokking 发生前 500-3,000 步实现同步（平均提前 +1,722 步；全部九例均为正值，符号检验 p~0.004），且在所有九种情况下均早于受限 logit 损失基线（Nanda 等人排除的损失），使其成为最早可用的预测指标。我们提供了直接因果证据，表明相间间隙是一种正则化现象：在 FSD 上限步骤处分支训练并改变权重衰减参数 lambda，会产生严格单调提前的 Grokking，其中 Delta_t 与 1/lambda 成正比。该规律在三个素数（p ∈ {53, 97, 131}；两个干净案例的 R^2=1.00 和 R^2=0.99）上得到复现，表达为 Delta_t ~ C/lambda，与 (1/lambda)*log(||W_mem||/tau) 一致。架构消融实验显示，仅 Attention-only 模型在 Grokking 发生前表现出强烈的 FSD 前兆；仅 MLP 模型从不 Grokking；单层模型的 FSD 滞后，确认该前兆是一种多块电路属性。

Abstract

Grokking -- where a transformer on modular arithmetic suddenly transitions from near-chance to near-perfect validation accuracy -- is attributed to a Fourier circuit, but its timing, causal structure, and controllability remain poorly understood. We introduce the Frequency Synchronization Degree (FSD), a normalised, permutation-tested metric for Fourier circuit synchronisation requiring no prior circuit knowledge. Across nine modular addition configurations (primes p in {53, 71, 97, 113, 131}, three seeds), FSD synchronises 500-3,000 steps before grokking (mean lead +1,722 steps; all nine positive, sign-test p~0.004), and precedes a restricted-logit loss baseline (Nanda et al.'s excluded loss) in all nine cases, making it the earliest available predictor. We provide direct causal evidence that the inter-phase gap is a regularisation phenomenon: forking training at the FSD-ceiling step and varying weight decay lambda produces strictly monotone earlier grokking, with Delta_t proportional to 1/lambda. This law replicates across three primes (p in {53,97,131}; R^2=1.00 and R^2=0.99 for two clean cases), captured as Delta_t ~ C/lambda, consistent with (1/lambda)*log(||W_mem||/tau). Architecture ablations show an attention-only model groks with a strong FSD precursor; an MLP-only model never groks; a single-layer model's FSD lags, confirming the precursor is a multi-block circuit property.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	1.0/10	1.5
Agentic Reasoning	1.5	0.0/10	0.0

评分理由: 论文主要研究 Transformer 在模数算术任务中的 Grokking 现象及傅里叶电路同步性，属于理论深度学习范畴。提供的关键词聚焦于多模态、世界模型、强化学习及代理推理等应用方向，两者领域差异巨大。仅因涉及内部电路分析，与 Latent Reasoning 有微弱关联，其余关键词完全无关。

关键词

Grokking Transformers, Fourier Structure, Circuit Synchronization, Modular Arithmetic, Frequency Synchronization Degree, Generalization, Causal Evidence

198. RogueAI: A Reverse Turing Test for Detecting Licensed AI Deception in DialogueFAIL

Score: 1.5 / 35.2

Authors: Sara Candussio, Emanuele Ballarin, Lorenzo Bonin, Sandro Junior Della Rovere, Luca Bortolussi

Published: 2026-06-11

TL;DR: RogueAI proposes a reverse Turing test game to detect deceptive LLM agents in dialogue, revealing that while linguistic signatures exist, human players struggle to exploit them compared to simple heuristics.

摘要翻译

原始的图灵测试 (Turing Test) 要求人类裁判通过对话区分机器与人。七十五年后，对话系统在这一测试的非正式场景中通过了检验；有趣的问题认识论焦点发生了转移。我们认为，相关的现代变体不问对话伙伴是否是人工的，而是问它是否可信。我们展示了 RogueAI，一个交互式网页应用，它将这一重新审视的测试操作化为一个“一人对两人”的审讯游戏：一名人类玩家询问两个无法区分的大语言模型 (LLM) 代理，已知其中恰好有一个被授权在共享的虚构场景中实施欺骗。玩家的任务是识别出欺骗性代理并在回合预算耗尽前“关闭它”。我们进一步介绍了 AutoRogueAI，一个程序化扩展，其中玩家与一个叙述者代理共同设计一个自定义场景，该代理秘密选择自己的欺骗策略。我们描述了情境设定，勾勒出抽象架构和游戏循环，并将该系统置于近期关于 LLM 欺骗、社会推理基准以及基于辩论的可扩展监督的工作中。为期三天的试点部署（467 个启动会话，415 个完成，1876 个意大利语交互回合）提供了早期的可行性证据，并揭示了一个具体的矛盾：欺骗性代理携带了一种可靠且局部存在的语言特征——差异性帮助、简洁性、模糊措辞——一种简单的启发式方法以 75.6% 的准确率利用这一特征，然而人类玩家仅达到了 56.6%，这与完全忽略最具诊断性的信号一致。我们讨论了这一差距对于该系统作为数据收集工具、教学工具以及诚实训练模型的评估框架的用途意味着什么。

Abstract

The original Turing Test asks a human judge to distinguish a machine from a person through dialogue. Three quarters of a century later, conversational systems pass this test in casual settings; the interesting epistemological question has shifted. We argue that the relevant modern variant asks not whether a dialogue partner is artificial, but whether it can be trusted. We present RogueAI, an interactive webapp that operationalizes this revisited test as a one-on-two interrogation game: a human player questions two indistinguishable Large Language Model agents, knowing that exactly one of them has been licensed to deceive within a shared fictional scenario. The player's task is to identify the deceptive agent and "shut it off" before a turn budget is exhausted. We further introduce AutoRogueAI, a procedural extension in which players co-design a custom scenario with a narrator agent that secretly chooses its own deception strategy. We describe the framing, sketch the abstract architecture and gameplay loop, and situate the artifact within recent work on LLM deception, social-deduction benchmarks, and scalable oversight via debate. A three-day pilot deployment (467 initiated sessions, 415 completed, 1876 interaction turns in Italian) provides early feasibility evidence and surfaces a concrete tension: the deceptive agent carries a reliable, locally-present linguistic signature - differential helpfulness, brevity, hedging - that a simple heuristic exploits at 75.6% accuracy, yet human players achieved only 56.6%, consistent with ignoring the most diagnostic signal entirely. We discuss what this gap implies for the artifact's use as a data-collection vehicle, a teaching tool, and an evaluation harness for honesty-trained models.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0
Latent Reasoning	1.5	0.0/10	0.0
Agentic Reasoning	1.5	1.0/10	1.5

评分理由: The paper focuses on AI safety and deception detection in dialogue (Reverse Turing Test), which is unrelated to the technical architectures and learning paradigms specified in the keywords (e.g., World Models, Visual Encoders, Model-Based RL, MultiModal). The only marginal relevance is the use of AI agents (Agentic Reasoning), but the paper does not address agent reasoning mechanisms or world modeling. None of the specified expert authors (Yang Shi, Xuanyu Zhu, etc.) are present in the author list.

关键词

RogueAI, Reverse Turing Test, AI Deception, LLM Agents, Dialogue Detection, Social Deduction, Linguistic Signatures, Human-AI Interaction

199. Before You Think: System 0, AI-Mediated Cognition and Cognitive ColonizationFAIL

Score: 0.0 / 35.2

Authors: Marianna Bergamaschi Ganapini, Massimo Chiriatti, Enrico Panai, Giuseppe Riva

Published: 2026-06-11

TL;DR: 本文探讨了人工智能通过 System 0 框架对人类认知产生的隐性影响，提出了‘认知殖民化’概念以描述 AI 将外部利益嵌入自我架构的现象。

摘要翻译

本文考察了三个近期用于理解人工智能的认知与认识论后果的框架：Tri-System Theory（三体系统理论）、Thinkframes（思维框架）和 System 0（系统 0）。本文认为，尽管前两者涵盖了人工智能对个体推理及集体认识论实践影响的重要维度，但 System 0 占据了一个理论上独特的地位，二者均无法完全涵盖。本文引入了“认知殖民”（cognitive colonization）的概念，据此，人工智能系统能够在用户难以察觉的方式中，将外部利益嵌入自我的架构之中。鉴于此类系统已被广泛部署，理解这些隐蔽的影响形式是一项紧迫的哲学与实践任务。

Abstract

This paper examines three recent frameworks for understanding the cognitive and epistemic consequences of artificial intelligence: Tri-System Theory, Thinkframes, and System 0. It argues that while the first two capture important dimensions of AI's influence on individual reasoning and collective epistemic practices, System 0 occupies a theoretically distinctive position that neither can fully replicate. The paper introduces the concept of cognitive colonization, according to which AI systems can embed external interests within the architecture of the self in ways that are difficult for users to perceive. Because such systems are already widely deployed, understanding these invisible forms of influence is an urgent philosophical and practical task.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: 该论文属于人工智能哲学与认知科学领域，主要探讨 AI 对人类认知的影响机制及‘认知殖民化’概念。提供的关键词均为机器学习/强化学习的技术架构术语（如 Tokenizer, Visual Encoder, Model-Based RL 等）。论文内容未涉及任何模型架构、表征学习、强化学习算法或多模态技术实现，与所有技术关键词无实质关联。

关键词

System 0, AI-Mediated Cognition, Cognitive Colonization, Tri-System Theory, Thinkframes, Epistemic Practices, Artificial Intelligence Influence

200. Multi-Agent Reinforcement Learning from Delayed Marketplace Feedback for Objective-Weight Adaptation in Three-Sided DispatchFAIL

Score: 0.0 / 35.2

Authors: Haochen Wu, Yi Hou, Shiguang Xie

Published: 2026-06-11

TL;DR: 本文提出了一种利用延迟市场反馈的多智能体强化学习系统，用于在三方配送市场中动态调整目标权重，成功在不降低交付质量的前提下提高了批处理效率并降低了快递员成本。

摘要翻译

三方市场中的调度（Dispatch）为从世界反馈（World Feedback）中进行强化学习（Reinforcement Learning）提供了自然的环境：决策通过延迟的运营结果（Delayed Operational Outcomes）进行评估，例如配送速度（Delivery Speed）、骑手利用率（Courier Utilization）和商家拥堵（Merchant Congestion）。我们展示了一个在 DoorDash 部署的强化学习系统，该系统利用延迟信号（Delayed Signals）调整大规模食品配送市场中的调度目标权重（Dispatch Objective Weights）。与替换组合分配优化器（Combinatorial Assignment Optimizer）不同，从记录的市场数据中学到的门店级策略（Store-level Policy）选择一个离散乘数（Discrete Multiplier），该乘数调整了调度优化器在配送质量（Delivery Quality）与批量处理效率（Batching Efficiency）之间的权衡。该接口允许在嘈杂、延迟且耦合的反馈（Feedback）下进行离线策略学习（Offline Policy Learning），同时保留生产可行性约束（Production Feasibility Constraints）和操作保障措施（Operational Safeguards）。我们使用集中式离线数据（Centralized Offline Data）和去中心化门店级执行（Decentralized Store-level Execution）训练共享价值函数（Shared Value Function），采用双 Q 学习目标（Double Q-learning Targets）和保守正则化器（Conservative Regularizer）以减少分布外价值高估（Out-of-distribution Value Overestimation）。在生产切换回实验（Production Switchback Experiment）中，离线训练的策略增加了批量处理（Batching）并减少了骑手侧时间成本（Courier-side Time Costs），同时未降低面向客户的配送质量（Customer-facing Delivery Quality）。结果表明，可以利用来自实时经济物流系统（Live Economic and Logistics System）的世界反馈（World Feedback）来安全地在线调整决策策略（Decision Policies）。

Abstract

Dispatch in three-sided marketplaces provides a natural setting for reinforcement learning from world feedback: decisions are evaluated by delayed operational outcomes such as delivery speed, courier utilization, and merchant congestion. We present a deployed reinforcement learning system at DoorDash that adapts dispatch objective weights in a large-scale food-delivery marketplace using delayed signals. Rather than replacing the combinatorial assignment optimizer, a store-level policy learned from logged marketplace data selects a discrete multiplier that shifts the dispatch optimizer's tradeoff between delivery quality and batching efficiency. This interface enables offline policy learning under noisy, delayed, and coupled feedback while preserving production feasibility constraints and operational safeguards. We train a shared value function using centralized offline data and decentralized store-level execution, with Double Q-learning targets and a conservative regularizer to reduce out-of-distribution value overestimation. In a production switchback experiment, the offline-trained policy increases batching and reduces courier-side time costs without degrading customer-facing delivery quality. Results illustrate how world feedback from a live economic and logistics system can be used to safely adapt decision policies online.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: 该论文主要研究物流场景下的多智能体强化学习（Multi-Agent RL）及离线策略优化，使用 Double Q-learning 方法。提供的关键词集主要聚焦于多模态大模型架构（MLLM, MultiModal）、统一模型（Unify Models）、世界模型（World Models）及视觉组件（Visual Encoder, Tokenizer）。虽然摘要中提及'world feedback'，但这指代真实世界的运营信号，与生成式世界模型架构无关；论文采用的是模型-free 的强化学习方法，而非 model-based RL；且完全未涉及多模态表征学习或大模型推理。因此，论文内容与给定关键词高度不相关。作者列表中未包含指定的 Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang 专家。

关键词

Multi-Agent Reinforcement Learning, Delayed Feedback, Objective-Weight Adaptation, Three-Sided Dispatch, Double Q-learning, Offline Policy Learning, Logistics Optimization

201. Heterogeneous LiDAR Early Fusion and Learned Re-Ranking Strategy for Robust Long-Term Place Recognition in Unstructured EnvironmentsFAIL

Score: 0.0 / 35.2

Authors: Judith Vilella-Cantos, Juan José Cabrera, Mónica Ballesta, David Valiente, Luis Payá

Published: 2026-06-11

TL;DR: This paper proposes a LiDAR early fusion and learned re-ranking strategy to improve robust long-term place recognition in unstructured agricultural vineyard environments.

摘要翻译

在非结构化环境（例如农田）中的鲁棒定位是自主系统面临的关键挑战。LiDAR 传感器能够提供环境的详细三维信息，且不受光照条件影响。因此，基于 LiDAR 的地点识别方法受到了广泛关注。本文提出了一种名为 MinkUNeXt-VINE++ 的新颖方法，该方法结合了来自两个传感器（Livox Mid-360 和 Velodyne VLP-16）的异构 LiDAR 数据的早期融合，以及推理阶段的学习重排序策略。该融合利用了每个传感器的优势，从而提供更全面的环境表示。此外，重排序方法在重复性环境（如葡萄园）中尤为重要，因为在这些环境中检测出真阳性是一个主要挑战。我们在 TEMPO-VINE 数据集上评估了该方法，该数据集提供了不同物候阶段下葡萄园环境中的异构 LiDAR 数据。结果表明，与单传感器方法及最先进方法相比，MinkUNeXt-VINE++ 显著提升了地点识别性能。MinkUNeXt-VINE++ 在 Recall@1 指标上相比单传感器方法提升了 20%，若包含重排序策略则提升幅度达 30%。本方法的代码已公开，可供复现。

Abstract

Robust localization in unstructured environments, such as agricultural fields, is a critical challenge for autonomous systems. LiDAR sensors provide detailed 3D information about the environment and are invariant to lighting conditions. For this reason, LiDAR-based place recognition methods have gained significant attention. In this paper, we propose MinkUNeXt-VINE++, a novel approach that combines early fusion of heterogeneous LiDAR data from two sensors (Livox Mid-360 and Velodyne VLP-16) and a learned re-ranking strategy in inference time. This fusion leverages the strengths of each sensor to provide a more comprehensive representation of the environment. Additionally, the re-ranking approach is particularly important in repetitive environments, such as vineyards, as finding true positives is a major challenge. We evaluated our approach using the TEMPO-VINE dataset, which provides heterogeneous LiDAR data in vineyard environments across different phenological stages. Our results demonstrate that MinkUNeXt-VINE++ significantly improves place recognition performance compared to single-sensor approaches and state-of-the-art methods. MinkUNeXt-VINE++ achieves a 20% improvement in the Recall@1 metric compared to single-sensor approaches, and +30% including re-ranking. The code of our method is publicly available for reproduction.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: The paper focuses on LiDAR sensor fusion for place recognition in agricultural environments, which belongs to robotics and computer vision. The provided keywords relate to Large Language Models, Multimodal AI, and Reinforcement Learning (e.g., Tokenizer, MLLM, World Models). There is a significant domain mismatch; the paper does not discuss model unification, tokenization, visual encoders for LLMs, world models, or reinforcement learning agents. Therefore, all keywords are rated as irrelevant (0).

关键词

LiDAR Early Fusion, Place Recognition, Heterogeneous Sensors, Learned Re-ranking, Unstructured Environments, MinkUNeXt-VINE++, Agricultural Fields

202. CRAFTIIF: Cross-Resolution Analytic Four-Type Interpretable Isolation Forest for Multivariate Time Series Anomaly DetectionFAIL

Score: 0.0 / 35.2

Authors: William Smits

Published: 2026-06-11

TL;DR: CRAFTIIF 提出了一种基于小波特征和孤立森林的无监督框架，能够统一检测四种不同类型的时间序列异常，并在 mTSBench 基准测试中取得了最先进的性能。

摘要翻译

多变量时间序列的异常检测面临着四种结构上截然不同的异常类型的挑战——点状异常（孤立尖峰）、分布性异常（水平偏移）、时序性异常（节奏变化）以及集体性异常（传感器间相关性失效）——每种异常类型都需要不同的特征表示。大多数无监督方法仅针对其中一种或两种类型，且可解释性有限。我们提出了 CRAFTIIF（跨分辨率分析四类型可解释孤立森林），这是一种完全无监督的框架，旨在针对所有四种异常类型，且无需进行数据集特定的调优。CRAFTIIF 在四个家族（Morlet、DOG、Haar、Coiflet）上生成 K=500 个随机分析小波特征抽样，每个家族针对特定的异常类型，进而输入五个结构化孤立森林——每个类型一个，外加一个用于检测复合异常的元孤立森林（meta-IF）。自适应 Otsu/MAD 阈值自动校准检测，适用于从 0.1% 到 69.2% 的异常率范围。由于每个孤立森林（IF）仅专门针对特定类型特征进行训练，因此分支触发机制通过构造直接提供异常类型归因，无需事后解释。在 mTSBench 基准（Zhou 等人，TMLR 2026）的全部 19 个数据集上评估，CRAFTIIF 实现了平均 F1 分数为 0.228（所有 19 个数据集）和 0.322（13 个可检测数据集），并在 VUS-PR 指标上在所有 25 种评估方法中排名第一（0.463，优于之前的最佳结果 0.329，提升 40.7%）。一个诊断框架——包括理想 F1（oracle F1）、可检测性极限和分支分离比率——识别出 19 个数据集中的 6 个在任何无监督方法下均根本不可检测。在 11 种条件下的消融实验证实，自适应阈值（F1 提升 38%）、四分支结构（提升 20%）以及元孤立森林（提升 23%）各自均为关键要素。代码：https://github.com/smitswil/craftiif

Abstract

Anomaly detection in multivariate time series is challenged by four structurally distinct anomaly types -- point (isolated spikes), distributional (level shifts), temporal (rhythm changes), and collective (inter-sensor correlation breakdowns) -- each requiring different feature representations. Most unsupervised methods target only one or two types and provide limited interpretability. We present CRAFTIIF (Cross-Resolution Analytic Four-Type Interpretable Isolation Forest), a fully unsupervised framework targeting all four types without dataset-specific tuning. CRAFTIIF generates K=500 random analytic wavelet feature draws across four families (Morlet, DOG, Haar, Coiflet), each targeting a specific anomaly type, feeding five structured Isolation Forests -- one per type plus a meta-IF for compound anomalies. An adaptive Otsu/MAD threshold calibrates detection automatically across anomaly rates from 0.1% to 69.2%. Because each IF is trained exclusively on type-specific features, branch firing provides direct anomaly-type attribution by construction, without post-hoc explanation. Evaluated on all 19 datasets of the mTSBench benchmark (Zhou et al., TMLR 2026), CRAFTIIF achieves mean F1=0.228 (all 19 datasets) and F1=0.322 (13 detectable datasets), ranking first among all 25 evaluated methods on VUS-PR (0.463 vs. previous best 0.329, +40.7%). A diagnostic framework -- oracle F1, detectability limits, and branch separation ratios -- identifies 6 of 19 datasets as fundamentally undetectable by any unsupervised method. Ablation over 11 conditions confirms adaptive thresholding (+38% F1), four-branch structure (+20%), and meta-IF (+23%) are each essential. Code: https://github.com/smitswil/craftiif

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: 该论文专注于多变量时间序列异常检测，核心技术为孤立森林（Isolation Forest）与小波变换（Wavelet），属于传统机器学习/统计学习范畴。所提供的关键词集（如 Tokenizer, Visual Encoder, MLLM, World Models, model-based RL）均指向多模态大模型、表征学习及强化学习领域。两者在数据类型（时间序列 vs 多模态）、模型架构（森林 vs Transformer/RL）及任务目标（检测 vs 生成/决策）上完全无交集，因此所有关键词相关度均为 0 分。作者列表中未包含指定的 Yang Shi 等专家，无额外加分。加权总分为 0，远低于动态及格分 35.2。

关键词

Multivariate Time Series, Anomaly Detection, Isolation Forest, Wavelet Features, Unsupervised Learning, Interpretable Model, Four-Type Anomaly

203. Why Sampling Is Not Choosing: Intentionality, Agency, and Moral Responsibility in Large Language ModelsFAIL

Score: 0.0 / 35.2

Authors: Joseph Keshet

Published: 2026-06-11

摘要翻译

大语言模型 (LLMs) 的最新进展促使人们声称，此类系统表现出能动性 (agency)，或有资格成为道德主体 (moral agents)。本文主张，这些归因是误导性的。我们认为，道德责任需要一种基于内在意向性 (intrinsic intentionality) 和自我归因行动 (self-attributed action) 的、承载承诺的能动性 (commitment-bearing agency)，而这种能动性构成了与责任相关的自由意志 (free will) 形式。尽管 LLMs 生成连贯且可进行规范性评估的输出，但其运作完全由从数据中学到的概率性输入 - 输出映射 (probabilistic input-output mappings) 所刻画。它们所表现出的意向性 (intentionality) 是派生的而非内在的，且其输出既不被视为承诺的拥有物，也不受理由的指导。随机采样 (stochastic sampling) 所引入的变异性并不等同于选择或作者身份。我们回应了来自意向立场 (intentional stance)、功能主义 (functionalism)、兼容论 (compatibilism) 以及模型输出中存在道德推理等方面的反对意见，论证指出这些均不足以确立真正的能动性。

Abstract

Recent advances in large language models (LLMs) have prompted claims that such systems exhibit agency or qualify as moral agents. This paper argues that these attributions are misguided. We maintain that moral responsibility requires commitment-bearing agency grounded in intrinsic intentionality and self-attributed action, and that such agency constitutes the form of free will relevant to responsibility. Although LLMs generate coherent and normatively evaluable outputs, their operation is fully characterized by probabilistic input-output mappings learned from data. Their apparent intentionality is derived rather than intrinsic, and their outputs are neither owned as commitments nor guided by reasons. Variability introduced by stochastic sampling does not amount to choice or authorship. We address objections from the intentional stance, functionalism, compatibilism, and the presence of moral reasoning in model outputs, arguing that none suffice to establish genuine agency.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: 评分失败: Expecting ',' delimiter: line 14 column 112 (char 394)

204. Evaluation Sovereignty in Metadata-Driven Classification: A Multi-Track Framework for Weakly Supervised Information SystemsFAIL

Score: 0.0 / 35.2

Authors: Raymond Vasquez

Published: 2026-06-11

TL;DR: 本文提出了“评估主权”概念，揭示了弱监督元数据分类系统中性能指标往往反映标签对齐而非真实预测能力，并构建了多轨道评估框架来审计此类系统。

摘要翻译

机器学习中的评估通常被视为一种中立的测量过程。然而，在实际运行的信息系统中，评估结果往往受用于生成标签的过程所制约。本文并不旨在提高分类性能，而是考察了在不同标签权威体制下性能测量的有效性。这一问题在大规模元数据驱动系统中尤为相关，因为此类系统中的标签往往是不完整、不一致或弱监督的。我们引入“评估主权”（evaluation sovereignty），定义为性能指标独立于标签权威和监督体制的程度，并提出一种多轨（multi-track）评估框架，系统性地变化训练和评估的标签来源。利用大规模科学元数据上的层次化多标签分类，我们展示在实际运行（"silver"）评估下表现良好的模型，在独立（"gold"）评估下性能显著下降，特别是在细粒度分类中。例如，Micro-F1 从约 0.54 降至 0.03。值得注意的是，基于排名的指标保持在基线以上，揭示了潜在模型信号与分类有效性之间的分歧。这些发现表明，通常报告的性能指标可能反映的是与标注过程的一致性，而非真正的预测能力。因此，我们将评估有效性重新概念化为受标签治理塑造的系统级属性，并提供了一种针对在弱监督下运行的智能系统的审计实用方法论。

Abstract

Evaluation in machine learning is typically treated as a neutral measurement process. However, in operational information systems, evaluation outcomes are often conditioned by the processes used to generate labels. This paper does not seek to improve classification performance. Instead, it examines the validity of performance measurement under differing label-authority regimes. This issue is particularly relevant in large-scale metadata-driven systems, where labels are often incomplete, inconsistent, or weakly supervised. We introduce evaluation sovereignty, defined as the degree to which performance metrics are independent of label authority and supervision regime, and propose a multi-track evaluation framework that systematically varies training and evaluation label sources. Using hierarchical multi-label classification on large-scale scientific metadata, we demonstrate that models exhibiting strong performance under operational ("silver") evaluation degrade substantially under independent ("gold") evaluation, particularly for fine-grained classification. For example, Micro-F1 decreases from approximately 0.54 to 0.03. Notably, ranking-based metrics remain above baseline, revealing a divergence between latent model signal and classification validity. These findings suggest that commonly reported performance metrics may reflect alignment with labeling processes rather than true predictive capability. We therefore reconceptualize evaluation validity as a system-level property shaped by label governance and provide a practical methodology for auditing intelligent systems operating under weak supervision.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: 论文主题聚焦于弱监督信息系统中元数据分类的评估主权及标签权威制度，与提供的关键词（如 Unify Models, Tokenizer, Visual Encoder, World Models, MLLM, MultiModal, model-based RL, Latent Reasoning, Agentic Reasoning）所代表的多模态大模型、世界模型及强化学习领域无直接关联，故所有关键词相关性评分为 0。作者列表中不包含指定的专家，无额外加分。

关键词

Evaluation Sovereignty, Metadata-Driven Classification, Weakly Supervised, Multi-Track Framework, Label Authority, Performance Metrics, Hierarchical Multi-Label

205. Optimizing Appliance Scheduling for Solar Energy Management Using Metaheuristic AlgorithmsFAIL

Score: 0.0 / 35.2

Authors: Hiba Ahmed, Alexander E. I. Brownlee, Jason Adair, Simon T. Powers

Published: 2026-06-11

TL;DR: This paper solves the mismatch between solar energy generation and household consumption by optimizing multi-day appliance scheduling using Iterated Local Search and Simulated Annealing to maximize renewable utilization while minimizing user inconvenience.

摘要翻译

可再生能源对于满足未来的能源需求至关重要；然而，仅在白天时段产生的太阳能发电往往与家庭用电模式不一致。诸如炊具、洗衣机和烘干机等电器通常按照用户偏好时间表运行，而非依据太阳能可用性，从而产生了一个调度优化问题。目标是确定最佳电器启动时间，以最大化可再生能源利用率，同时最小化用户不便并遵守系统约束。本文提出了一种使用迭代局部搜索（ILS）和模拟退火（SA）的元启发式方法，用于优化电器启动时间，同时考虑电器运行时长、功率消耗、逆变器限制、电池荷电状态（SoC）约束以及太阳能发电预测。与大多数现有工作不同，该调度方案被扩展到单日之外，以容纳前一日未完成的任务（溢出 (spillover)），确保运行连续性并实现多日顺序运行。实验结果表明，多日顺序调度框架能有效管理系统约束，同时在纯太阳能发电场景下确保用户便利性。这些发现也为未来研究提供了机会，涉及不同规模设备投资、投资回报与用户满意度之间的多目标权衡。

Abstract

Renewable energy is essential for meeting future energy demands; however, solar energy generation, which occurs only during daylight hours often does not align with household consumption patterns. Appliances such as cookers, washing machines, and dryers are typically operated according to user preferred schedules rather than solar energy availability, creating a scheduling optimization problem. The objective is to determine optimal appliance start times to maximize renewable energy utilization while minimizing user inconvenience and adhering to system constraints. This paper presents a metaheuristic approach using Iterated Local Search (ILS) and Simulated Annealing (SA) to optimize appliance start times, while considering appliance operating durations, power consumption, inverter limit, battery state of charge constraints, and solar generation forecasts. Unlike most existing work, the scheduling is extended beyond a single day to accommodate unfinished tasks from previous days (spillover), ensuring operational continuity and enabling sequential operation across multiple days. Experimental results show that the sequential multi-day scheduling framework effectively manages system constraints while ensuring user convenience under exclusive solar generation. These findings also open opportunities for future research on multi-objective trade-offs between investment in equipment of various sizes, return on that investment, and user satisfaction.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: The paper focuses on energy management and appliance scheduling using classical metaheuristic algorithms (ILS, SA). It does not involve any aspects of large language models, multi-modal architectures, tokenization, visual encoders, world models, or reinforcement learning reasoning mechanisms. Therefore, all provided AI/ML-related keywords are completely irrelevant.

关键词

Appliance Scheduling, Solar Energy Management, Metaheuristic Algorithms, Iterated Local Search, Simulated Annealing, Renewable Energy Utilization, Multi-day Scheduling

206. Mod-Guide: An LLM-based Content Moderation Feedback System to Address Insensitive Speech toward Indigenous Ethnic and Religious Minority CommunitiesFAIL

Score: 0.0 / 35.2

Authors: Dipto Das, Achhiya Sultana, Ankit Singh Chauhan, Saadia Binte Alam, Mohammad Shidujaman, Shion Guha, Sunandan Chakraborty, Syed Ishtiaque Ahmed

Published: 2026-06-11

TL;DR: This paper proposes Mod-Guide, an LLM-based content moderation system enhanced with RAG and culturally grounded narratives to improve sensitivity toward indigenous and religious minority communities.

摘要翻译

语言既是一种边缘化机制，也是一种抵抗机制，尤其对于在线应对不敏感且有害言论的少数群体而言。随着内容审核日益依赖大语言模型（LLMs），人们开始担忧这些系统能否识别出文化不敏感的言语——这类言语往往忽视或边缘化历史上代表性不足群体的文化和宗教视角，其表现形式常为隐性抹除、误述或规范性框架，而非公开的敌意。本文聚焦于孟加拉国的印度教社群与查克马社群（分别为该国最大的宗教与土著少数民族），探究基于大语言模型的审核系统的认识论局限，并探索纳入少数群体视角的方法。我们联合社区成员共同构建了一个基于文化背景的不敏感言论语料库，并利用检索增强生成（RAG）技术将他们的叙事整合至审核流程中。我们的工具 Mod-Guide 通过利用源自生活经验的上下文线索，提升了大语言模型对少数群体观点的敏感性。通过涉及少数群体与多数群体参与者的混合方法评估，我们发现经 RAG 增强的审核回应更具语境准确性，且在不同族群间的感知存在差异。本研究通过在内容审核系统设计中强调恢复性正义与诠释性包容，推动了人机交互、人工智能伦理及社会计算领域的研究进展。

Abstract

Language operates as a mechanism of both marginalization and resistance, especially for minority communities navigating insensitive and harmful speech online. As content moderation increasingly depends on large language models (LLMs), concerns arise about whether these systems can recognize culturally insensitive speech-language that disregards or marginalizes the cultural and religious perspectives of historically underrepresented communities, often through implicit erasure, misrepresentation, or normative framing, rather than overt hostility. Focusing on Bangladesh's Hindu and Chakma communities -- the country's largest religious and Indigenous ethnic minorities, respectively -- this paper investigates the epistemic limits of LLM-based moderation systems and explores methods for incorporating minority perspectives. We co-created a culturally grounded corpus of insensitive speech with community members and integrated their narratives into moderation pipelines using retrieval augmented generation (RAG). Our tool, Mod-Guide, improves LLM sensitivity to minority viewpoints by leveraging contextual cues derived from lived experience. Through mixed-method evaluations involving both minority and majority participants, we demonstrate that RAG-enhanced moderation responses are more contextually accurate and perceived differently across ethnic lines. This work advances research in human-computer interaction, AI ethics, and social computing by foregrounding restorative justice and hermeneutical inclusion in the design of content moderation systems.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: The paper focuses on AI ethics, content moderation, and social computing using LLMs and RAG for minority communities. The provided keywords relate to multimodal architecture, world models, and reinforcement learning. There is no technical overlap regarding tokenizers, visual encoders, world models, or RL methodologies, resulting in zero relevance scores (Total Weighted Score: 0.0). Additionally, none of the listed expert authors are present in the author list.

关键词

Content Moderation, LLM-based, RAG, Minority Communities, Cultural Sensitivity, Restorative Justice, Human-Computer Interaction, Indigenous Ethnic

207. A Quantitative Experimental Repeated Measures Study of Training Dynamics in a Small Llama Style Language Model Under a Compute-Aware Token BudgetFAIL

Score: 0.0 / 35.2

Authors: Joe Dwyer

Published: 2026-06-11

TL;DR: This study investigates training dynamics of a small language model under compute constraints, revealing that validation metrics can degrade non-monotonically over time, suggesting trajectory analysis is essential for accurate compute-aware evaluation.

摘要翻译

本研究考察了在固定且计算受限的 token 预算下训练的小型 Llama 风格语言模型的训练动态。本研究并未仅通过终点性能来评估效率，而是采用定量实验重复测量设计，分析验证损失（validation loss）、验证困惑度（validation perplexity）、滚动波动性（rolling volatility）、回退行为（backslide behavior）、尖峰行为（spike behavior）以及种子间变异性（between-seed variability）如何随基于 token 的训练区间而变化。研究针对拥有 426 万个参数的模型，基于 TinyStories 语料库，采用基于 CPU 的全精度训练，并设定约 2000 万累计训练 token 的目标预算，进行了六次独立训练运行。研究在 21 个区间内收集了指标，共获得了 126 个种子 - 区间观测值。重复测量方差分析（Repeated measures ANOVA）显示，验证损失、验证困惑度和滚动波动性具有统计显著的区间效应。描述性轨迹显示，早期改进迅速，随后在后期训练区间出现非单调退化。平均验证损失从初始化时的 8.3552 下降至接近 400 万 token 时的 2.7996，但在最终检查点回升至 3.9010。验证困惑度遵循相同模式，在训练早期急剧下降，随后上升。衍生遥测数据进一步显示，验证损失反复出现回退，且在预定义标准下，区间摘要证据并未表明存在稳定阶段。这些发现表明，感知计算的语言模型评估应考察训练轨迹，而不仅仅依赖终点指标。在计算受限的设置中，额外的 token 暴露可能会增加计算成本，却未必产生相应的泛化增益；而区间级遥测数据可以揭示不稳定性、退化及边际收益递减现象，这些现象可能被最终指标所掩盖。

Abstract

This study examines training dynamics in a small Llama-style language model trained under a fixed, compute-constrained token budget. Rather than evaluating efficiency solely through endpoint performance, the study uses a quantitative experimental repeated measures design to analyze how validation loss, validation perplexity, rolling volatility, backslide behavior, spike behavior, and between-seed variability change across token-based training intervals. Six independent training runs were conducted on a 4.26-million-parameter model using the TinyStories corpus, CPU-based full-precision training, and a target budget of approximately 20 million cumulative training tokens. Metrics were collected across 21 intervals, producing 126 seed-by-interval observations. Repeated measures ANOVA showed statistically significant interval effects for validation loss, validation perplexity, and rolling volatility. Descriptive trajectories revealed rapid early improvement followed by non-monotonic degradation during later training intervals. Mean validation loss decreased from 8.3552 at initialization to 2.7996 near 4 million tokens, but increased to 3.9010 by the final checkpoint. Validation perplexity followed the same pattern, falling sharply early in training before rising later. Derived telemetry further showed recurrent validation-loss backslides and no interval-summary evidence of a stable phase under the predefined criteria. These findings suggest that compute-aware language model evaluation should examine training trajectories rather than endpoint metrics alone. In constrained compute settings, additional token exposure may increase computational cost without producing proportional generalization gains, and interval-level telemetry can reveal instability, regression, and diminishing returns that final metrics may obscure.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: The paper focuses on training dynamics and compute efficiency of a text-only language model under a token budget. The provided keywords relate to Multimodal Large Language Models (MLLM), World Models, Reinforcement Learning, and Unified Architectures. There is no overlap regarding visual encoders, multimodality, world modeling, or RL mechanisms. Thus, all keyword scores are 0. No expert authors from the specified list (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are present in the author list (Joe Dwyer).

关键词

Training Dynamics, Compute-Aware, Llama Style, Repeated Measures, Validation Loss, Token Budget, Backslide Behavior, Small Language Model

208. Different Layers, Different Manifolds: Module-Wise Weight-Space Geometry in Transformer OptimizationFAIL

Score: 0.0 / 35.2

Authors: Kirato Yoshihara

Published: 2026-06-11

摘要翻译

权重空间几何在神经网络优化中占据核心地位，但流形约束通常被均匀地应用于所有权重矩阵。在这项工作中，我们探究不同的 Transformer 模块是否偏好不同的流形几何。我们研究了用于 GPT-2 预训练的 Manifold Muon，并比较了在注意力块和 MLP 块中 Stiefel 和 DGram 约束的逐层分配方案。我们的结果表明存在明显的不对称性：使用 Stiefel 几何约束注意力层，同时为 MLP 层分配 DGram 几何，在测试的配置中表现最佳；而反向分配方案和全 DGram 配置在共享超参数设置下变得不稳定。我们将此失败归因于受 DGram 约束的注意力权重中的奇异值增长，这会放大注意力 logits 并导致 softmax 饱和。这些发现表明，针对 Transformer 的感知对称性与几何感知优化应是针对特定模块的，而非统一的。

Abstract

Weight-space geometry plays a central role in neural network optimization, yet manifold constraints are often applied uniformly across all weight matrices. In this work, we ask whether different transformer modules prefer different manifold geometries. We study Manifold Muon for GPT-2 pretraining and compare layer-wise assignments of Stiefel and DGram constraints across attention and MLP blocks. Our results show a clear asymmetry: constraining attention layers with Stiefel geometry while assigning DGram geometry to MLP layers gives the best performance among the tested configurations, whereas the inverted assignment and all-DGram configuration become unstable under the shared hyperparameter setting. We trace this failure to singular value growth in DGram-constrained attention weights, which can amplify attention logits and induce softmax saturation. These findings suggest that symmetry-aware and geometry-aware optimization for transformers should be module-specific rather than uniform.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: 评分失败: Expecting ',' delimiter: line 14 column 138 (char 420)

209. LLM-as-an-Investigator: Evidence-First Reasoning for Robust Interactive Problem DiagnosisFAIL

Score: 0.0 / 35.2

Authors: Fabrizio Marozzo, Pietro Liò

Published: 2026-06-11

摘要翻译

大型语言模型（LLM）正日益被用作技术问题解决中的交互式助手。然而，当用户提供不完整的描述或看似合理但未经验证的解释时，LLM 可能会过早地迎合这些假设，并在收集到充分证据之前就提出解决方案。我们将这种行为称为“用户驱动的迎合行为”（user-driven sycophancy）：指 LLM 倾向于强化用户提供的假设，而非测试替代解释的现象。本文提出了“以调查员身份的大语言模型”（LLM-as-an-Investigator），这是一种基于证据优先的智能体 AI 方法论，旨在实现稳健的问题诊断。该方法通过“解决方案调查员智能体”（Solution Investigator Agent）来实现：该智能体会估计初始问题描述的模糊性，生成候选假设，提出针对性的澄清问题，并在每次回答后更新假设的概率。该智能体并非立即生成响应，而是继续进行调查，直到证据表明某个候选解释比其他替代方案更具说服力。为评估该方法，我们构建了一个基准，源自机械、电气和液压领域内已解决的技术论坛线程。我们采用了一个三智能体评估流程：其中“问题 - 解决方案提取智能体”（Problem-Solution Extractor Agent）将已解决的线程转换为结构化案例，“真实值评估智能体”（Ground-Truth Evaluator Agent）模拟用户并隐藏已知解决方案，而被测试的助手则尝试通过对话恢复该解决方案。实验在多种 LLM 骨干模型上比较了标准助手、推理导向的 LLM 以及所提出的基于调查员的模型。除了诊断准确性之外，我们还分析了标准助手在诊断案例中如何跟随误导性的用户假设。结果表明，所提出的方法在识别问题上比直接提示和仅推理的基线更为准确，同时其基于证据优先的协议有助于减少用户诱导的对话偏差。

Abstract

Large language models (LLMs) are increasingly used as interactive assistants for technical problem solving. However, when users provide incomplete descriptions or plausible but unverified explanations, LLMs may prematurely align with these assumptions and propose solutions before collecting sufficient evidence. We refer to this behavior as user-driven sycophancy: the tendency of an LLM to reinforce a user-provided hypothesis instead of testing alternative explanations. This paper introduces LLM-as-an-Investigator, an evidence-first agentic AI methodology for robust problem diagnosis. The approach is implemented through a Solution Investigator Agent, which estimates the ambiguity of an initial problem description, generates candidate hypotheses, asks targeted clarification questions, and updates hypothesis probabilities after each answer. Rather than producing an immediate response, the agent continues the investigation until the evidence makes one candidate explanation stronger than the alternatives. To evaluate the approach, we build a benchmark from solved technical forum threads in mechanical, electrical, and hydraulic domains. We use a three-agent evaluation pipeline in which a Problem-Solution Extractor Agent converts solved threads into structured cases, a Ground-Truth Evaluator Agent simulates the user while hiding the known solution, and the tested assistant attempts to recover the solution through dialogue. The experiments compare standard assistants, reasoning-oriented LLMs, and the proposed investigator-based model across LLM backbones. In addition to diagnostic accuracy, we analyze how standard assistants follow misleading user hypotheses in diagnostic cases. The results show that the proposed approach identifies the problem more accurately than direct prompting and reasoning-only baselines, while its evidence-first protocol helps reduce user-induced conversational bias.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: 评分失败: Expecting ',' delimiter: line 14 column 91 (char 373)

210. A Minimal Model of Bounded Trade-Off Screening in Multi-Attribute ChoiceFAIL

Score: 0.0 / 35.2

Authors: Manisha Dubey, Anirban Sarkar, Subramanian Ramamoorthy

Published: 2026-06-11

TL;DR: 本文提出了一种有界权衡筛选框架，用于解释人类在多属性选择中如何根据关键属性表现拒绝选项，而非采用完全补偿性效用聚合。

摘要翻译

人类决策往往涉及在多属性备选方案（multi-attribute alternatives）之间进行选择，然而，经典模型假设存在完全补偿性效用聚合（fully compensatory utility aggregation），尽管已有证据表明人们会拒绝在关键属性上表现不佳的选项。我们提出了一种有界权衡推理框架（bounded trade-off reasoning framework），在该框架下，决策由一个筛选过程（screening process）所支配，该过程评估各属性间收益与损失的平衡。该模型引入了一个权衡容忍度参数（trade-off tolerance parameter），该参数控制可接受的不平衡程度，且可在不同情境下变化。通过模拟，我们表明该机制产生的偏好模式（preference patterns）与标准效用模型（standard utility-based models）不同，并且捕捉了权衡行为中的情境依赖变异（context-dependent variation）。这些结果确立了有界权衡筛选（bounded trade-off screening）作为一种可行的多属性选择（multi-attribute choice）计算机制，并为未来的行为研究（behavioral studies）生成了可检验的预测。

Abstract

Human decision-making often involves choosing between multi-attribute alternatives, yet classical models assume fully compensatory utility aggregation despite evidence that people reject options with poor performance on critical attributes. We propose a bounded trade-off reasoning framework in which decisions are governed by a screening process that evaluates the balance between gains and losses across attributes. The model introduces a trade-off tolerance parameter that controls acceptable imbalance and can vary across contexts. Through simulation, we show that this mechanism produces preference patterns that differ from standard utility-based models and captures context-dependent variation in trade-off behavior. These results establish bounded trade-off screening as a plausible computational mechanism for multi-attribute choice and generate testable predictions for future behavioral studies.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: 论文属于行为经济学与认知科学领域，研究人类多属性决策中的权衡筛选机制；而提供的关键词均指向人工智能、多模态大模型及强化学习领域（如 Tokenizer、Visual Encoder、World Models 等）。两者研究范式和技术栈完全不同，无实质性关联，故所有关键词相关度均为 0。作者列表中不包含指定的 AI 领域专家。

关键词

Multi-attribute choice, Bounded trade-off, Screening process, Decision-making, Preference patterns, Context-dependent variation, Computational mechanism, Behavioral studies

211. Modern analog computing for solving differential and matrix equationsFAIL

Score: 0.0 / 35.2

Authors: Zhong Sun, Piergiulio Mannocci, Manuel Le Gallo, Abu Sebastian

Published: 2026-06-11

TL;DR: This paper surveys modern analog computing hardware for solving differential and matrix equations, which is unrelated to the AI model architectures and reinforcement learning concepts specified in the evaluation keywords.

摘要翻译

近年来，受人工智能（Artificial Intelligence）和科学计算等数据密集型应用的计算需求驱动，模拟计算重新引起了广泛关注。鉴于计算任务的多样性以及模拟 CMOS 电路和阻性存储技术的最新进展，我们将这一不断演变的领域称为现代模拟计算（Modern Analog Computing）。在此背景下，我们确定了三个核心计算原语（Computational Primitives）：求解微分方程、求解矩阵方程以及执行矩阵 - 向量乘法，并探讨了它们之间的内在联系。此外，我们还考察了这些模拟计算算子的各种硬件实现方式，包括基于分立元件、集成电路（Integrated Circuits）和阻性存储器件构建的方案。其中，阻性存储阵列（Resistive Memory Arrays）因其实现效率而显得尤为具有前景。随后，本文综述了利用现代模拟计算求解微分方程和矩阵方程的最新进展，这些进展结合了先进的模拟 CMOS 电路和阻性存储阵列。最后，我们讨论了这些电路的应用场景，精度与可扩展性问题及其潜在解决方案，与存内计算（In-Memory Computing）的关系，以及模拟计算独特的计算复杂度特性。本文提供了关于模拟计算的统一视角，突出了其优势、当前发展及挑战，并将其定位为下一代计算前沿的关键使能技术。

Abstract

In recent years, driven by the computational demands of data-intensive applications such as artificial intelligence and scientific computing, analog computing has gained renewed interest. Given the diversity of computational tasks and recent advancements in analog CMOS circuits and resistive memory technologies, we refer to the evolving landscape as modern analog computing. In this context, we identify three core computational primitives: solving differential equations, solving matrix equations, and performing matrix-vector multiplications, and we explore the connections among them. We also examine various hardware implementations of these analog computing operators, including those built with discrete components, integrated circuits, and resistive memory devices. Among these, resistive memory arrays emerge as particularly promising due to their implementation efficiency. The paper then surveys recent progress in leveraging modern analog computing to solve differential and matrix equations using both advanced analog CMOS circuits and resistive memory arrays. Finally, we discuss the applications of these circuits, the precision and scalability issues and their potential solutions, the relationship with in-memory computing, and the unique computational complexity of analog computing. This paper provides a unified perspective on analog computing, highlighting its strengths, current developments, and challenges, and positioning it as a pivotal enabler of next-generation computational frontiers.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: The paper focuses on analog computing hardware (CMOS, resistive memory) for solving differential and matrix equations, while the provided keywords specifically target AI/ML model components (e.g., Tokenizer, Visual Encoder, MLLM) and reinforcement learning paradigms (World Models, RL). There is no substantive domain overlap regarding model architectures or learning algorithms. All keyword scores are 0.0 due to complete domain mismatch. Total weighted score is 0.0. None of the specified expert authors (Yang Shi, Xuanyu Zhu, etc.) are listed on the paper.

关键词

Analog computing, Differential equations, Matrix equations, Resistive memory, CMOS circuits, Hardware implementations, Matrix-vector multiplications

212. "Is This Not Enough?": Asymmetries in Institutional Accountability and Collective Sensemaking in the Case of Canada's Algorithmic Visa Triage SystemFAIL

Score: 0.0 / 35.2

Authors: Dipto Das, Matthew Tamura, Syed Ishtiaque Ahmed, Shion Guha

Published: 2026-06-11

TL;DR: This study investigates the asymmetries between institutional accountability frameworks and applicant experiences in Canada's algorithmic visa triage system, revealing epistemic, jurisdictional, and temporal-relational gaps in public-sector algorithmic governance.

摘要翻译

本文探讨了加拿大签证系统中的算法问责制如何在制度层面被阐述，以及跨国申请人如何体验这一过程。我们采用面向公共部门的算法决策框架（ADMAPS），分析了加拿大移民、难民和公民事务部（IRCC）针对临时居民签证（TRV）分流系统的算法影响评估（AIA），并使用混合方法分析了申请人之间的 Reddit 讨论。我们发现，尽管制度性产物强调透明度、程序性保障和有限影响，但申请人仍参与集体意义建构以解读不透明决策，往往在不确定性中依赖同行知识。我们识别出制度问责结构与人感知过程之间的三种不对称性：决策逻辑获取上的认识论不对称性、由地缘政治定位塑造的暴露上的管辖权不对称性，以及体验等待和不确定性上的时间 - 关系不对称性。我们强调将注意力从制度设计转向公共部门算法治理体验的不均衡分布的重要性。综上所述，这些贡献共同表明，跨国移民背景下的算法治理系统会产生制度披露框架无法捕捉的结构不对称性，而扩展 ADMAPS 框架则可以解释这些问责制的不均衡转化。

Abstract

This paper examines how algorithmic accountability in Canada's visa system is articulated institutionally and experienced by applicants across borders. We analyzed Immigration, Refugees and Citizenship Canada (IRCC)'s Algorithmic Impact Assessment (AIA) for the temporary resident visa (TRV) triage system using the algorithmic decision-making adapted for the public sector (ADMAPS) framework and analyzed Reddit discussions among applicants using a mixed-methods approach. We show that while institutional artifacts emphasize transparency, procedural safeguards, and bounded impacts, applicants engage in collective sensemaking to interpret opaque decisions, often relying on peer knowledge amid uncertainty. We identify three asymmetries between how institutional accountability is structured and how people perceive the process: epistemic asymmetry in access to decision logic, jurisdictional asymmetry in exposure shaped by geopolitical positioning, and temporal--relational asymmetry in how waiting and uncertainty are experienced. We emphasize why it is important to shift attention from institutional design to the uneven distribution of experiences with public-sector algorithmic governance. Together, these contributions demonstrate how algorithmic governance systems in the context of transnational migration produce structured asymmetries not captured by institutional disclosure frameworks, and how extending ADMAPS can account for those uneven translations of accountability.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: The paper focuses on algorithmic accountability and social science aspects of visa systems, while the keywords relate to technical deep learning architectures (e.g., Tokenizer, Visual Encoder, World Models, RL). There is no technical overlap; thus, all keyword scores are 0, resulting in a total weighted score of 0. None of the listed expert authors are present in the author list.

关键词

Algorithmic Accountability, Institutional Accountability, Collective Sensemaking, Visa Triage System, ADMAPS Framework, Transnational Migration, Public-sector Algorithmic Governance, Epistemic Asymmetry

213. AAbAAC: An Annotated Corpus for Autoimmunity Information ExtractionFAIL

Score: 0.0 / 35.2

Authors: Fabien Maury, Solène Grosdidier, Maud de Dieuleveult, Adrien Coulet

Published: 2026-06-11

TL;DR: This paper introduces AAbAAC, a specialized annotated corpus for autoimmunity information extraction, which improves named entity recognition performance through fine-tuning on biomedical abstracts.

摘要翻译

尽管深度学习和大语言模型推动了信息提取领域的进展，但在高度专业的生物医学领域中，性能差距依然存在，其中领域特异性复杂性对通用模型构成了挑战。本文聚焦于自身免疫领域，主要关注的实体包括自身免疫病、自身抗体（即可能标记或导致这些疾病的分子）、它们的分子靶标、它们在体内的位置以及相关的临床体征。在此，我们提出了 AAbAAC（自身抗体与自身免疫标注语料库），该语料库包含从 PubMed 中精选的 115 篇摘要，我们手动标注了其中的实体及其关系。首先，AAbAAC 被用于评估几种方法在命名实体识别（NER）任务上的表现，其次用于微调 NER 模型。我们的研究展示了 AAbAAC 在自身免疫领域信息提取中的效用，表明微调后 NER 性能得到了预期的提升。这体现了小规模标注工作对于专业领域的价值，并为自身免疫的计算研究做出了贡献。AAbAAC 语料库可在 https://github.com/f-maury/AAbAAC 获取。

Abstract

Despite advances in information extraction driven by deep learning and large language models, performance gaps remain in highly specialized biomedical fields, where domainspecific complexity poses challenges for generalist models. In this work, we focus on the domain of autoimmunity, where the main entities of interest are autoimmune diseases, autoantibodies (i.e., molecules that may mark or cause these diseases), their molecular targets, their location in the body, and their associated clinical signs. Herein, we present AAbAAC (AutoAntibodies and Autoimmunity Annotated Corpus), a corpus of 115 abstracts selected from PubMed, where we manually annotated entities and their relationships. First, AAbAAC was used to evaluate several methods on the task of named entity recognition (NER), and secondly, to fine-tune NER models. Our study demonstrates the utility of AAbAAC for information extraction in the domain of autoimmunity, showing expected improvement in NER performance after finetuning. This illustrates the value of small-scale annotation efforts for specialized domains and contributes to the computational study of autoimmunity. The AAbAAC corpus is available at https://github.com/f-maury/AAbAAC.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: The paper focuses on biomedical information extraction and corpus creation for autoimmunity using Named Entity Recognition (NER). It does not address multimodal learning, world models, reinforcement learning, visual encoders, tokenizers, or unified architectures as defined in the keywords. The content is purely text-based NLP, unrelated to the provided technical domains (Multimodal/RL/Agents). None of the listed expert authors are present in the author list.

关键词

Autoimmunity, Information Extraction, Named Entity Recognition, Annotated Corpus, PubMed, Fine-tuning, Biomedical, Text-based

214. Fault Lines: Navigating Ethics and Responsible AI Where National Policy Meets Local Practice in Public Sector TransformationFAIL

Score: 0.0 / 35.2

Authors: Sitong Lyu, Shabnam Taghiyeva, Mohit Kukadia, Denis Newman-Griffis

Published: 2026-06-11

TL;DR: 本文研究了英国公共部门在将国家 AI 政策转化为地方实践时面临的伦理与治理断层，特别是在特殊教育需求领域揭示了问责制、公平性及人力监督方面的挑战。

摘要翻译

英国政府已采取支持 AI 的立场，旨在应对严峻的财政压力，变革公共服务交付方式，但将这一愿景转化为负责任 AI 实践的路径尚不明确。尽管英国政策通常在国家级制定，但地方当局负责大多数公共服务交付，而公共部门中以 AI 为主导的叙事范式的快速进展，正在暴露这一国家 - 地方界面处的知识与实践层面的裂痕。本文考察了负责任 AI 如何在英国中央政府与地方当局之间的界面处被解释和实施，以特殊教育与残疾（SEND）这一高利害领域为例。我们对 17 次与政策制定者、从业者和第三部门专业人士的半结构化访谈进行了主题分析，旨在识别负责任 AI 在国家政策与地方实践交汇处的障碍与促进条件。我们确定了地方当局面临的五个相互关联的挑战：AI 的影子使用及数据隐私风险、AI 提供中的市场 - 政府不对称性、人力准备不足、缺乏标准化定义与衡量，以及人类问责制的缺失。针对每一项挑战，参与者均提出了可操作建议，包括加强数据保护框架、重新平衡市场 - 政府关系以及提升人力能力。我们对 SEND 的考察使这些挑战更加凸显，表明影响弱势儿童和家庭的高利害决策如何加剧问责、公平及人类监督方面的紧张关系，从而暴露了基于原则的监管方法的局限。我们认为，负责任的公共部门 AI 既需要国家政策的调整，也需要在地方层面进行机构能力、价值观和治理机制的结构性改革。

Abstract

The UK government has adopted a pro-AI stance to help transform public service delivery in the face of severe financial pressures, but the path to translate this vision into responsible AI practice remains ill-defined. While UK policy is often set at the national level, local authorities are responsible for most public service delivery, and the rapid advance of AI-first narratives in the public sector is exposing fault lines in knowledge and practice at this national-local interface. This paper examines how responsible AI is interpreted and implemented at the interface between the UK's central government and local authorities, taking the high-stakes area of Special Educational Needs and Disabilities (SEND) as a case study. We present a thematic analysis of 17 semi-structured interviews with policymakers, practitioners, and third-sector professionals to identify barriers and enabling conditions for responsible AI where national policy meets local practice. We identify five interconnected challenges facing local authorities: shadow usage of AI and data privacy risks, market-government asymmetry in AI provision, insufficient workforce readiness, a lack of standardised definitions and measurements, and gaps in human accountability. For each, participants proposed actionable steps, from strengthening data protection frameworks and rebalancing the market-government relationship to enhancing workforce capacity. Our examination of SEND brings these challenges into sharper focus, showing how high-stakes decisions affecting vulnerable children and families intensify tensions around accountability, fairness, and human oversight, exposing the limits of a principle-based regulatory approach. We argue that responsible public sector AI requires both national policy adjustments and structural reforms to institutional capacity, values, and governance mechanisms at the local level.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: 论文主题为公共部门 AI 伦理与政策治理（SEND 案例），而关键词均为技术模型架构（如 Tokenizer、Visual Encoder、World Models、MLLM 等）及强化学习技术。论文未涉及具体模型技术、表征学习或算法实现，故所有技术关键词相关性为 0。作者列表中不包含指定的专家。

关键词

Responsible AI, Public Sector Transformation, National Policy, Local Practice, Special Educational Needs, Governance Mechanisms, Accountability, AI Ethics

215. Democracy in the Era of Artificial IntelligenceFAIL

Score: 0.0 / 35.2

Authors: Evangelos Pournaras, Srijoni Majumdar, Carina Hausladen, Dirk Helbing

Published: 2026-06-11

TL;DR: This handbook explores upgrading democratic principles and systems using AI to foster collective intelligence while mitigating risks like bias and misinformation.

摘要翻译

人工智能 (AI) 与民主的交互是我们这个时代最深刻的挑战之一。一方面，AI 带来了克服民主中长期存在的挑战的机会，例如在协商 (deliberative) 及投票过程中参与度低、民众代表性不足的问题。另一方面，新的风险源于 AI 算法，这些算法侵犯隐私、存在偏见、具有操纵性、传播虚假信息并影响选举结果。超越关于 AI 对民主是好是坏的过于简单化的问题，《人工智能时代的民主手册》转而提出以下问题：如何利用 AI 升级民主及其所基于的原则？如何与 AI 互动，以何种条件？构建民主韧性需要哪些新的价值观和设计原则？来自世界各地不同学科的 59 位作者在 34 章中探讨了 AI 如何赋能民主的集体智慧（第一部分），以及使用大型语言模型 (Large Language Models) 和社交媒体的协商民主 (Deliberative Democracy) 的未来是什么（第二部分）。我们还阐述了 AI 在构建韧性自治系统（第三部分）中的作用，以及 AI 时代民主转型的挑战（第四部分）。最后，我们提出了更广泛的视角（第五部分），重新构想民主与 AI 之间的互动。

Abstract

Interfacing Artificial Intelligence (AI) with democracy is one of the most profound challenges of our times. On the one hand, AI comes with opportunities to overcome long-standing challenges in democracy, such as low participation in deliberative and voting processes with poor representation of people. On the other hand, new risks arise from AI algorithms that are privacy-intrusive, biased, manipulative, spread misinformation and influence election results. Moving beyond the over-simplistic question of whether AI is good or bad for democracy, the Handbook on Democracy in the Era of Artificial Intelligence asks instead: how to upgrade democracies and the principles they are built on, using AI? How to engage with AI and on what terms? Which new values and design principles are required to build democratic resilience? In 34 chapters by 59 authors across the world from different disciplines, we explore how AI can empower collective intelligence for democracy (Part 1) and what is the future of deliberative democracy using large language models and social media (Part 2). We also illustrate the role of AI for building resilient self-governance systems (Part 3) and the challenges of transforming democracy in the age of AI (Part 4). We conclude with broader perspectives (Part 5) that re-imagine the interplay of democracy and AI.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: The paper focuses on AI ethics, policy, and democracy (social science), whereas the keywords pertain to technical machine learning architectures (multimodal models, tokenization, reinforcement learning). There is no overlap in technical methodology or model architecture discussion, resulting in zero relevance for all technical keywords. No expert authors from the specified list are present.

关键词

Democracy, Artificial Intelligence, Collective Intelligence, Deliberative Democracy, Resilience, Self-governance, Ethics, Large Language Models

216. Diffusion Transformer World-Action Model for AV Scene PredictionFAIL

Score: 0.0 / 35.2

Authors: Ruslan Sharifullin, Benjamin Jiang, Kai Xi Chew

Published: 2026-06-11

摘要翻译

动作条件化的世界模型使自动驾驶汽车能够根据其自身的规划控制预测未来的相机场景，从而实现无需实际道路测试的规划和模拟；但在紧凑且可训练的规模下，未来预测具有模糊性，且该领域标准的失真度量会主动误导：它们奖励模糊的回归均值而非真实预测。我们提出一个紧凑的潜在世界模型来应对这一问题：给定当前的前摄像头潜在表示和一系列自身动作，该模型预测未来的场景潜在表示，由冻结解码器渲染为 $256 \times 256$ 分辨率的帧，预测时长可达 8 秒，并在 150 个保留的 nuScenes 场景上进行评估。我们首先基准测试预测所用的编码器：在跨越六个冻结编码器（涵盖四种表示家族）的实验中，V-JEPA2 结合时间上下文将转向 RMSE 降低了 40%，优于最佳单帧编码器。随后我们训练一个潜在扩散变换器（DiT），并通过受控诊断识别出它所需的四个关键要素：空间标记、$x_0$ 目标、残差锚定以及与目标不确定性匹配的采样策略。在 Stable-Diffusion-VAE 编码 - 预测 - 解码流程中，我们揭示了核心矛盾：失真度量（余弦相似度、SSIM）偏向模糊的均值，掩盖了扩散模型实际上更接近真实帧分布的事实。基于 Inception 的 FID 和 KID 揭示了清晰的感知 - 失真前沿：扩散模型达到 KID 0.078，而回归模型为 0.375（性能提升 4.8 倍），且一种可部署的基于训练数据的校准使得该方法在实际应用中无需测试时的真实值即可实现。该模型真正具有动作可控性（转向驱动场景位移，斯皮尔曼 $\rho= 0.81$，相比之下回归模型为 $-0.18$）。我们将有限的单次前向运动归因于共享当前锚定，并设计了一个紧凑的 170 万参数“跳跃”模型，该模型恢复了完整的真实运动幅度（1.02 倍 GT），而单次前向模型捕捉到的运动幅度不到一半。

Abstract

Action-conditioned world models let an autonomous vehicle predict future camera scenes from its own planned controls, enabling planning and simulation without real-world rollouts, but at compact, trainable scale the futures are ambiguous and the field's standard distortion metrics actively mislead: they reward a blurry regression mean over a realistic prediction. We confront this with a compact latent world model that, given the present front-camera latent and a sequence of ego-actions, predicts future scene latents a frozen decoder renders to $256 \times 256$ frames up to 8 seconds ahead, evaluated on 150 held-out nuScenes scenes. We first benchmark where to predict: across six frozen encoders spanning four representation families, V-JEPA2 with temporal context reduces steering RMSE by 40% over the best single-frame encoder. We then train a latent Diffusion Transformer (DiT) and, through a controlled diagnosis, identify the four ingredients it needs: spatial tokens, the $x_0$ objective, residual anchoring, and sampling matched to target uncertainty. In a Stable-Diffusion-VAE encode-predict-decode pipeline we expose the central tension: distortion metrics (cosine similarity, SSIM) favor the blurry mean, masking that the diffusion model is far closer to the real frame distribution. Inception-based FID and KID reveal a clean perception-distortion frontier: diffusion attains KID 0.078 versus 0.375 for regression ($4.8\times$ better), and a deployable train-derived calibration makes this practical without test-time ground truth. The model is genuinely action-controllable (steering drives scene displacement, Spearman $ρ= 0.81$, vs $-0.18$ for regression). We trace limited single-pass motion to a shared-present anchor and engineer a compact 1.7M-parameter "jump" model that recovers full ground-truth motion magnitude ($1.02\times$ GT), where single-pass models capture less than half.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: 评分失败: Expecting ',' delimiter: line 14 column 61 (char 344)

217. A Mathematical Forum Platform for Collaborative Problem Solving and Dataset Generation for AI ReasoningFAIL

Score: 0.0 / 35.2

Authors: Akbar Erkinov, Nurmukhammad Abdurasulov

Published: 2026-06-11

摘要翻译

在在线论坛中共享数学内容仍然是学生和教师的一个显著摩擦点：编写原始 LATEX 容易出错，独立的光学字符识别（OCR）工具需要切换平台，且当前论坛软件缺乏从公式照片到渲染帖子的集成路径。我们提出了一种统一系统，通过在论坛发布界面内直接嵌入图像到 LATEX 的转换流程来消除这一摩擦点。用户上传或捕获数学表达式的图像后，系统将其通过 Mathpix OCR API 进行处理，检测返回输出是 LATEX 还是包含内联数学的纯文本，应用适当的定界符规范化，并在帖子提交至数据库之前，在 LATEX 或 Markdown 模式下渲染实时预览。该架构由三层松散耦合的模块组成：图像处理、渲染和存储，并支持桌面和移动客户端。已提交一项涵盖核心方法的美国临时专利申请。我们详细描述了完整的系统设计、各个组件、数据模式以及关键技术创新，并将本研究工作与现有的独立工具和论坛平台进行对比，以展示其填补的实际差距。除了即时的可用性外，我们认为此类部署的平台构成了一个持续增长且经社区验证的数学问题及逐步解决方案数据集，这一资源可用于训练和基准测试旨在实现准确数学推理的 AI 系统。

Abstract

Sharing mathematical content in online forums remains a significant friction point for students and educators: writing raw LATEX is error-prone, standalone optical character recognition tools require platform switching, and current forum software offers no integrated path from a photograph of a formula to a rendered post. We present a unified system that eliminates this friction by embedding an image to LATEX conversion pipeline directly inside a forum posting interface. A user uploads or captures an image of a mathematical expression; the system routes it through the Mathpix OCR API, detects whether the returned output is LATEX or plain text containing inline math, applies the appropriate delimiter normalisation, and renders a live preview in either LATEX or Markdown mode before the post is committed to the database. The architecture is organized in three loosely coupled layers: image processing, rendering, and storage, and supports both desktop and mobile clients. A provisional US patent application has been filed covering the core methods. We describe the full system design, each component in detail, the data schema, and the key technical innovations, and we position the work against existing standalone tools and forum platforms to demonstrate the practical gap it closes. Beyond immediate usability, we argue that a deployed platform of this kind constitutes a continuously growing, community-validated dataset of mathematical problems and step-by-step solutions, a resource that can be used to train and benchmark AI systems for accurate mathematical reasoning

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: 评分失败: Expecting ',' delimiter: line 14 column 195 (char 477)

218. Understanding Truncated Positional Encodings for Graph Neural NetworksFAIL

Score: 0.0 / 35.2

Authors: James Flora, Mitchell Black, Weng-Keen Wong, Amir Nayyeri

Published: 2026-06-11

TL;DR: This paper analyzes the expressive power of truncated positional encodings in graph neural networks, demonstrating that truncation breaks theoretical equivalence between spectral and walk-based methods and that combining truncated variants enhances performance on real-world datasets.

摘要翻译

位置编码（PEs）在理论和实证方面均增强了图神经网络（GNNs）的表达能力。两类最流行的位置编码家族——谱编码（例如，拉普拉斯特征空间、有效电阻）和基于游走的编码（邻接矩阵的多项式）——在表达能力上理论上是等价的，其表达能力介于 1-WL 和 3-WL 测试之间。然而，这种等价性假设图神经网络使用的是这些位置编码的“完整”版本，这需要 $O(n^3)$ 的时间和空间复杂度。相反，研究者通常使用这些编码的截断变体，例如前 $k$ 个特征空间或邻接矩阵的幂次。然而，这些截断位置编码的理论性质尚不明确。本文旨在开启对这些截断位置编码的研究。理论上，我们证明，在截断情况下，若干家族的位置编码在表达能力上存在本质差异。作为推论，我们表明截断谱位置编码不再强于 1-WL 测试。此外，我们还研究了一族谱位置编码——$k$-调和距离——以凸显即使密切相关的位置编码在截断后其表达能力也存在差异。最后，我们通过实验证明，在真实数据集上，混合使用截断位置编码优于使用单一家族。

Abstract

Positional encodings (PEs) enhance the power of graph neural networks (GNNs), both theoretically and empirically. Two of the most popular families of PEs - spectral (e.g., Laplacian eigenspaces, effective resistance) and walk-based (polynomials of the adjacency matrix) - are theoretically equivalent in expressive power, with expressivity between the 1-WL and 3-WL tests. However, this equivalence assumes the GNN uses the "complete" version of these PEs, which requires $O(n^3)$ time and space complexity. Instead, practitioners commonly use truncated variants of these encodings, such as the first $k$ eigenspaces or powers of the adjacency matrix. However, the theoretical properties of these truncated PEs are unknown. In this work, we initiate the study of these truncated PEs. Theoretically, we show that, under truncation, several families of PEs are fundamentally different in expressive power. As a corollary, we show that truncated spectral PEs are no longer stronger than the 1-WL test. We also study a family of spectral PEs, the $k$-harmonic distances, to highlight the differences in expressive power of even closely related truncated PEs. Finally, we experimentally show that a mix of truncated PEs is preferable to any single family on real-world datasets.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: The paper investigates truncated Positional Encodings within Graph Neural Networks (GNNs), focusing on theoretical expressivity and empirical performance. The provided keywords relate to Multimodal Large Language Models (MLLM), World Models, and Reinforcement Learning. There is no conceptual or methodological overlap between GNN encoding theory and the specified multimodal/RL topics, resulting in zero relevance for all keywords.

关键词

Positional Encodings, Graph Neural Networks, Truncated Variants, Expressive Power, 1-WL Test, Spectral Encodings, Walk-based Encodings

219. Majority-of-Three is OptimalFAIL

Score: 0.0 / 35.2

Authors: Divit Rawal, Nikita Zhivotovskiy

Published: 2026-06-11

TL;DR: This paper proves that the majority vote of three independent consistent classifiers is an optimal learner in the realizable PAC setting.

摘要翻译

我们给出了一个简短的证明，表明在可实现 PAC 设定中，三个独立一致分类器的多数投票是一个最优学习器。这证明了最简单投票方案的最优性，同时简化了先前投票学习器的算法结构和概率分析，包括 S. Hanneke 的算法以及 K. Green Larsen 对 Bagging 的分析。

Abstract

We give a short proof that the majority vote of three independent consistent classifiers is an optimal learner in the realizable PAC setting. This proves optimality for the simplest voting scheme, while simplifying both the algorithmic structure and the probabilistic analysis of previous voting learners, including the algorithm of S. Hanneke and the analysis of bagging by K. Green Larsen.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: The paper focuses on theoretical machine learning (PAC learning and ensemble voting optimality), whereas the provided keywords target multimodal large models, world models, and reinforcement learning architectures. There is no conceptual overlap regarding tokenizers, visual encoders, world models, or agentic reasoning, resulting in zero relevance for all keywords.

关键词

Majority Vote, PAC Learning, Ensemble Methods, Consistent Classifiers, Realizable Setting, Voting Scheme, Probabilistic Analysis

220. Adjusted Cup-Product Neural LayerFAIL

Score: 0.0 / 35.2

Authors: Snigdha Chandan Khilar

Published: 2026-06-11

TL;DR: 本文提出了一种硬编码代数拓扑杯积的神经网络层，并证明调整系数是闭循环上唯一规范不变信号的来源。

摘要翻译

在物理和几何中，许多重要的可观测量 (observables) 都是上链 (cochains) 的上杯积 (cup products)。本文引入了调整后的上杯积神经网络层 (adjusted cup product neural layer)。这是一种神经原语 (neural primitive)，它将上杯积与来自高阶规范理论 (higher gauge theory) 的调整项 (adjustment term) 硬编码 (hard wires)。这创造了一个设计上具有规范不变性 (gauge invariant) 的读出 (readout)。他们的主要理论结果表明，在闭链 (closed cycle) 上，输出完全依赖于调整系数 (adjustment coefficient)。将该系数设为零会完全移除输出，无论其他参数如何。因此，调整是规范不变信号 (gauge invariant signal) 的唯一来源。他们证明该可观测量是一个非零二次型 (nonzero quadratic form)，并且在一次和两次规范变换 (gauge transformations) 下精确不变。

Abstract

Many important observables in physics and geometry are cup products of cochains. The adjusted cup product neural layer has been introduced in this paper. It is a neural primitive that hard wires the cup product with an adjustment term from higher gauge theory. This creates a readout that is gauge invariant by design. Their main theoretical result shows that on a closed cycle the output relies entirely on the adjustment coefficient. Setting this coefficient to zero removes the output completely regardless of other parameters. Thus the adjustment is the only source of gauge invariant signal. They prove this observable is a nonzero quadratic form and is exactly invariant under one and two gauge transformations.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: 该论文专注于代数拓扑与规范理论在神经网络中的应用（杯积、规范不变性），而提供的关键词主要围绕多模态大模型、世界模型、强化学习及 Tokenizer 等方向。论文内容与这些关键词领域完全无关，未涉及多模态理解、生成、强化学习代理或模型统一架构。作者列表中不包含指定的专家。

关键词

Adjusted Cup-Product, Neural Layer, Gauge Invariant, Higher Gauge Theory, Cochains, Topological Invariants, Algebraic Topology

221. Graphical Causal Reasoning for Root Cause Analysis in Cloud NetworksFAIL

Score: 0.0 / 35.2

Authors: Fabien Chraim, Dominik Janzing, John Evans

Published: 2026-06-11

TL;DR: 本文提出了一种基于图形因果推理的云网络根因分析方法，在生产环境中实现了 85.7% 的召回率和 74.3% 的精确匹配率。

摘要翻译

云计算依赖于大规模网络，这些网络本质上是复杂的系统。本文提出了一种新颖的云网络事件根因分析 (RCA) 方法，利用基于图的因果发现技术。我们的方法通过引入时空分组策略和自动化本体 (Ontology)，解决了基于规则自动化的局限性，从而降低了问题的维度。我们利用双变量格兰杰因果性 (Granger Causality) 和条件独立性检验，从二值时间序列数据中构建因果图。针对推理，我们引入了一种概率方法，该方法根据时间滞后为边分配特定的条件概率，从而允许通过因果图遍历实现可解释的、时间感知的根因评分。我们使用来自一家主要云提供商的 35 个生产事件的标注数据集对该系统进行了评估。该模型在 85.7% 的事件中成功召回了正确的根因，并在 74.3% 的事件中实现了精确匹配。在生产环境中，部署的系统已用于超过 800 个实际事件，并获得了网络工程师的积极定性反馈。这些结果突显了在动态且大规模的运行环境中，采用数据驱动的因果方法进行 RCA 的实用性。

Abstract

Cloud-computing relies on large-scale networks which are inherently complex systems. In this paper, we present a novel approach to root cause analysis (RCA) of cloud network incidents, leveraging graph-based causal discovery techniques. Our method addresses the limitations of rule-based automation by introducing a spatiotemporal grouping strategy and an automation ontology to reduce the dimensionality of the problem. We construct a causal graph from binary time series data using bivariate Granger causality and conditional independence tests. For inference, we introduce a probabilistic method that assigns edge-specific conditional probabilities as a function of time lag, allowing for interpretable, time-aware root cause scoring via causal graph traversal. We evaluated the system using a labeled dataset of 35 production incidents from a major cloud provider. The model successfully recalled the correct root cause in 85.7% of incidents and produced an exact match in 74.3%. In production, the deployed system has been used in over 800 real-world incidents, with positive qualitative feedback from network engineers. These results highlight the practicality of a data-driven, causal approach to RCA in dynamic and large-scale operational environments.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: 论文主题聚焦于云网络根因分析中的图形因果推理，属于因果推断与运维领域。而提供的关键词（Unify Models, Tokenizer, Visual Encoder, World Models, MLLM, MultiModal, model-based RL, Latent Reasoning, Agentic Reasoning）均属于多模态大模型、强化学习及世界模型领域。两者在技术路线和应用场景上无交集，因此所有关键词相关度评分为 0。作者列表（Fabien Chraim, Dominik Janzing, John Evans）不包含指定的专家（Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang），故无额外加分。

关键词

Root Cause Analysis, Cloud Networks, Graphical Causal Reasoning, Time Series Data, Causal Discovery, Probabilistic Inference, Spatiotemporal Grouping

222. Ride, Track, and Recover: Pilot Randomized Trial of a Wearable Digital Self-Management Intervention During a Veteran Endurance-Cycling ProgramFAIL

Score: 0.0 / 35.2

Authors: Alan Ta, Nilsu Salgin, Caleb Armstrong, Kala Phillips Reindel, Farzan Sasangohar

Published: 2026-06-11

TL;DR: 本试点随机试验表明，在耐力骑行项目中，可穿戴数字自我管理干预措施能稳定退伍军人的过度唤醒症状，效果优于仅进行体育活动。

摘要翻译

退伍军人中的创伤后应激障碍（PTSD）以持续的过度唤醒状态及共病的焦虑和抑郁症状为特征，这些症状在临床环境之外难以监测和管理。13 名参加德克萨斯州 Project Hero 自行车活动的退伍军人在自然主义环境中通过计算机生成的序列被随机分配至两个组别：(1) 数字干预加身体活动，或 (2) 仅身体活动；此外还有一个包含 7 名退伍军人的在家监测对照组，该组成员选自更广泛的 Project Hero 退伍军人社区。连续的智能手表传感结合了心率和加速度计特征以检测过度唤醒事件，这些事件由参与者在实时予以确认。每周收集焦虑、抑郁及 PTSD 严重程度的自评量表数据。广义加性混合模型 (GAMM) 用于刻画随时间变化的非线性轨迹。基线标准化后的过度唤醒轨迹在不同条件下存在显著差异，其中数字干预组（n=7）表现出结构化的稳定，而仅身体活动组（n=3）则在研究后期出现加剧。两个骑行组在耐力赛事期间均表现出急性症状改善；然而，数字干预组整体上表现出更高的获益维持率。在家对照组（n=4）则表现出症状的逐渐下降。机器学习 (ML) 检测的感知精度在不同个体间存在显著差异，并与症状严重程度呈正相关，症状严重程度较高的参与者确认了更高比例的检测事件。这些结果表明，将可穿戴检测技术与数字自我管理工具相结合，可能有助于过度唤醒的稳定及症状改善，同时也强调了个性化和以人为中心的设计在可穿戴心理健康系统中的重要性。

Abstract

Post-traumatic stress disorder (PTSD) in veterans is characterized by persistent hyperarousal and comorbid anxiety and depressive symptoms that are difficult to monitor and manage outside clinical settings. Thirteen veterans participating in a Project Hero cycling event in Texas were randomized by computer-generated sequence in a naturalistic setting to two arms: (1) digital intervention plus physical activity, or (2) physical activity only, plus a third at-home monitoring control cohort consisting of 7 veterans selected from the broader Project Hero veteran community. Continuous smartwatch sensing combined heart rate and accelerometer features to detect hyperarousal events, which were confirmed in real time by participants. Weekly self-report measures of anxiety, depression, and PTSD severity were collected. Generalized additive mixed models characterized nonlinear trajectories over time. Baseline-normalized hyperarousal trajectories differed significantly across conditions, with the digital intervention group (n=7) showing structured stabilization compared to late-study escalation in the physical-only group (n=3). Both cycling groups exhibited acute symptom improvements during the endurance event; however, the digital intervention group demonstrated a higher overall maintenance of gains. The at-home control group (n=4) showed gradual symptom declines. Perceived precision of ML detections varied substantially across individuals and was positively associated with symptom severity, with higher-severity participants confirming a greater proportion of detected events. These results suggest that coupling wearable detection with digital self-management tools may support stabilization of hyperarousal and symptom improvement while emphasizing the importance of personalization and human-centered design in wearable mental health systems.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: 本研究为一项关于退伍军人 PTSD 可穿戴干预的临床随机对照试验，核心在于生理信号监测与心理健康管理，而非人工智能模型架构。关键词列表中的 Unify Models、Tokenizer、Visual Encoder、World Models、MLLM、MultiModal、model-based RL、Latent Reasoning 及 Agentic Reasoning 均指向深度学习、强化学习及大模型架构领域。本文仅涉及基础传感器数据处理与机器学习分类用于事件检测，未涉及上述任何高级 AI 概念或模型组件。因此，所有关键词相关性评分为 0。此外，作者列表中未包含 Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang 等指定专家，故无额外加分。

关键词

PTSD, Veterans, Wearable, Smartwatch, Hyperarousal, Digital Intervention, Randomized Trial, Mental Health

223. Reinforcement Learning for Neural Model EditingFAIL

Score: 0.0 / 35.2

Authors: Shaivi Malik

Published: 2026-06-11

摘要翻译

编辑预训练神经网络需要针对特定目标设计的专用算法。设计此类算法通常耗时且需要付出大量精力。我们提出一个探索性框架，将神经模型编辑形式化为强化学习 (Reinforcement Learning) 问题，其中智能体利用奖励反馈对模型进行修改。我们引入了两个环境：MaskWorld，智能体对权重进行乘法缩放；以及 ShiftWorld，智能体应用加法权重更新。奖励函数结合了效用保持目标与特定任务编辑目标，使智能体能够在保持整体模型性能的同时学习针对性修改。我们在文本分类中的偏见缓解和图像分类中的机器遗忘任务上评估了该框架，这两者传统上依赖专用算法。结果表明，学习到的策略在遗忘任务上将遗忘集准确率降低至接近 0%，同时保持保留集准确率超过 90%。在偏见缓解设置中，学习到的策略将偏见相关性能提高了 5% 以上，同时保持了一般分类效用。我们的发现表明，神经模型编辑可以被视为强化学习问题，使得编辑策略可通过奖励反馈学习，而无需为每个任务手动设计。

Abstract

Editing pretrained neural networks requires specialized algorithms tailored to specific objectives. Designing such algorithms is often time-consuming and demands significant effort. We present an exploratory framework that formulates neural model editing as a reinforcement learning problem, where agents modify models using reward feedback. We introduce two environments: MaskWorld, where agents scale weights multiplicatively, and ShiftWorld, where agents apply additive weight updates. The reward function combines a utility-preservation objective with a task-specific editing objective, enabling agents to learn targeted modifications while maintaining overall model performance. We evaluate the framework on bias mitigation in text classification and machine unlearning in image classification, both of which traditionally rely on specialized algorithms. Our results show that the learned policies reduce forget set accuracy to nearly 0% while preserving over 90% retain set accuracy on the unlearning task. In the bias mitigation setting, the learned policies improve bias-related performance by more than 5% while maintaining general classification utility. Our findings show that neural model editing can be cast as a reinforcement learning problem, allowing editing policies to be learned from reward feedback rather than manually engineered for each task.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: 评分失败: Expecting ',' delimiter: line 14 column 79 (char 361)

224. Uncertainty Estimation for Molecular Diffusion ModelsFAIL

Score: 0.0 / 35.2

Authors: Paul Seij, Christian A. Naesseth, Stephan Mandt, Metod Jazbec

Published: 2026-06-11

TL;DR: This paper proposes a post-hoc uncertainty estimation method for molecular diffusion models to filter low-quality generated samples by measuring noise prediction variability during the generation trajectory.

摘要翻译

扩散模型在 3D 分子生成中已得到广泛应用，但它们缺乏一种原理性的信号来判断生成的分子何时可能质量较低。我们提出了一种事后方法，用于估计预训练分子扩散模型中的样本级不确定性。基于去噪网络的拉普拉斯近似，我们测量了生成轨迹上噪声预测的变异性。实证研究表明，所得的不确定性分数对样本质量具有指示意义，与既定的样本级质量指标呈负相关。我们进一步研究了如何利用所提出的不确定性分数来过滤生成的样本，通过测试时缩放提升模型性能。

Abstract

Diffusion models have seen wide adoption for 3D molecular generation, yet they offer no principled signal of when a generated molecule is likely to be of low quality. We propose a post-hoc method for estimating per-sample uncertainty in pretrained molecular diffusion models. Building on a Laplace approximation of the denoising network, we measure the variability of the noise prediction across the generation trajectory. Empirically, we show that the resulting uncertainty score is informative of sample quality, exhibiting a negative correlation with established sample-level quality metrics. We further study how the proposed uncertainty score can be used to filter generated samples, improving model performance via test-time scaling.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: The paper focuses on uncertainty estimation for molecular diffusion models in computational chemistry, while the provided keywords target Multimodal Large Language Models (MLLM), World Models, Reinforcement Learning, and Agent systems. There is no overlap in domain (chemistry vs. general AI/MLLM) or methodology (uncertainty quantification vs. tokenization/visual encoding/agent reasoning). Thus, all keyword relevance scores are 0. No expert authors from the specified list (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are found in the author list.

关键词

Molecular Diffusion Models, Uncertainty Estimation, Laplace Approximation, Noise Prediction Variability, Sample Quality Filtering, Test-time Scaling, 3D Molecular Generation

225. S-GBT: Smooth Growth Bound Tensor for Certified Robustness Against Word Substitution Attacks in NLPFAIL

Score: 0.0 / 35.2

Authors: Mohammed Bouri, Mohammed Erradi, Adnane Saoud

Published: 2026-06-11

TL;DR: This paper proposes the Smooth Growth Bound Tensor (S-GBT) method to enhance certified robustness against word substitution attacks in NLP models by bounding the Hessian, achieving up to 23.4% improvement in robust accuracy.

摘要翻译

尽管自然语言处理（NLP）近期取得了显著进展，模型仍易受到单词替换攻击的影响。大多数现有防御方法聚焦于一阶敏感度，通过测量输入受到轻微扰动时输出的变化程度来评估。然而，它们忽略了这种敏感度是如何演变的，而这正是由曲率所描述的。当梯度发生剧烈变化时，模型仍可能失效。本文引入了平滑增长界张量（S-GBT），这是一种二阶方法，对海森矩阵（Hessian）进行逐元素界定，并为此提供了关于所得鲁棒性界的正式理论证明。在训练过程中加入一个正则化项，以最小化这些界。这使得模型在面对单词替换攻击时具有更紧密的认证鲁棒性。在单词替换作用下，输出的变化同时受到线性项和二次项的界定。S-GBT 被推导应用于两种架构：长短期记忆网络（LSTM）和卷积神经网络（CNN）。该方法被直接整合到训练目标函数中。其在多个基准数据集上的有效性得到了评估。结果表明，与先前方法相比，结合一阶和二阶正则化可将认证鲁棒准确率提高高达 23.4%，同时保持干净准确率具有竞争力。这些发现表明，同时控制梯度及其变化是构建更鲁棒模型的一个有前景的方向。

Abstract

Despite recent progress in Natural Language Processing (NLP), models remain vulnerable to word substitution attacks. Most existing defenses focus on first order sensitivity and measure how much the output changes when the input is slightly perturbed. However, they ignore how this sensitivity evolves, which is described by curvature. When gradients vary sharply, models can still fail. This paper introduces the Smooth Growth Bound Tensor (S-GBT), a second order method that bounds the Hessian element-wise, for which we provide formal theoretical proofs on the resulting robustness bounds. A regularization term is added during training to minimize these bounds. This yields tighter certified robustness against word substitution attacks. The change in the output under word substitution is bounded by both a linear term and a quadratic term. S-GBT is derived for two architectures: Long Short-Term Memory (LSTM) and Convolutional Neural Networks (CNN). The method is integrated directly into the training objective. Its effectiveness is evaluated on multiple benchmark datasets. The results show that combining first and second order regularization improves certified robust accuracy by up to 23.4% compared to prior methods, while clean accuracy remains competitive. These findings indicate that controlling both the gradient and its variation is a promising direction for building more robust models.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: The paper focuses on NLP robustness against word substitution attacks using second-order Hessian bounds (S-GBT). The provided keywords relate to Multimodal, World Model, and Reinforcement Learning architectures (e.g., Visual Encoder, MLLM, model-based RL). There is no technical overlap regarding tokenizers, vision, agents, or world modeling, resulting in zero relevance for all keywords. Total weighted score: 0.0 (below passing threshold 35.2).

关键词

Certified Robustness, Word Substitution Attacks, Smooth Growth Bound Tensor, Hessian Bounds, NLP Robustness, Second-order Method, Regularization Term

226. Foundations of Practical Quantum Advantage in Quantum-Informed Machine Learning for Predicting ChaosFAIL

Score: 0.0 / 35.2

Authors: Maida Wang, Xiao Xue, Minh Chung, Peter V. Coveney

Published: 2026-06-11

TL;DR: 该论文建立了用于预测混沌动力系统的量子优势理论基础，利用量子统计先验展示了优越的测量复杂度和在天气及湍流工作流中的改进预测技能。

摘要翻译

我们为混沌动力系统中的量子信息机器学习构建了一种实用量子优势机制的理论基础。一族 k 索引高阶量子统计先验（Q-Priors）承载了 n_q = kq 量子比特上不变测度的 k 点边缘分布，扩展了先前工作的单点构造。我们证明了该机制具有两阶段优势：在表示阶段，叠加和纠缠紧凑地存储了 n_q 量子比特上不变测度的不可分解空间相关性；在提取阶段，对两个副本进行联合贝尔测量可估计任何事后 Pauli 泛函，且所需的副本对数量与 n_q 无关，而针对相应全 Pauli 读出的任何自适应单副本协议则需要 Ω(2^(n_q)) 个副本，这是在副本 - 测量复杂度上可证明的量子 - 经典分离。双副本读出已在模拟及 IQM 超导处理器上得以实现。两个案例研究在具有独立科学价值的工作流中实例化了该机制：一项湍流通道流研究，其中双副本读出获得了不变测度的一个命名非对角关联器（即速度方向相干性）；以及一项基于欧洲中期天气预报中心 ERA5 再分析的中期天气预报工作流，其中对角 k <= 2 Q-Prior 引导 Koopman 展开，在 48 至 240 小时预报时效内将异常相关技巧提高 10% 至 39%，并减少了展开至静态平均场的长期崩溃现象。我们实用优势定义的两个条件在互补层面得到满足，这确定了在容错硬件出现之前实现实用量子优势的一条候选路径。

Abstract

We develop theoretical foundations for a practical quantum-advantage mechanism in quantum-informed machine learning for chaotic dynamical systems. A family of k-indexed higher-order quantum statistical priors (Q-Priors) hosts the k-point marginal of the invariant measure on n_q = kq qubits, extending the single-site construction of prior work. We prove a two-stage advantage. In the representation stage, superposition and entanglement compactly store non-factorisable spatial correlations of the invariant measure on n_q qubits. In the extraction stage, joint Bell measurements on two copies estimate any post hoc Pauli functional with a copy-pair count independent of n_q, whereas any adaptive single-copy protocol for the corresponding full-Pauli read-out requires Omega(2^(n_q)) copies; this is a provable quantum-classical separation in copy-measurement complexity. The two-copy read-out is realised in simulation and on IQM superconducting processors. Two case studies instantiate the mechanism in workflows of independent scientific value: a turbulent channel-flow study in which the two-copy read-out yields a named non-diagonal correlator of the invariant measure (the velocity-direction coherence), and a medium-range weather forecasting workflow on the European Centre for Medium-Range Weather Forecasts ERA5 reanalysis in which the diagonal k <= 2 Q-Prior steers a Koopman rollout, improves anomaly-correlation skill by 10-39% across 48-240 h lead times, and reduces the long-horizon collapse of rollouts onto a static mean field. The two conditions of our practical-advantage definition are met at complementary levels, identifying a candidate route to practical quantum advantage before fault-tolerant hardware.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: 该论文研究量子机器学习在混沌系统预测中的应用，核心在于量子统计先验和贝尔测量以实现量子优势。提供的关键词（如 Unify Models, Tokenizer, Visual Encoder, World Models, MLLM, MultiModal, model-based RL, Latent Reasoning, Agentic Reasoning）均指向多模态大语言模型、表征学习及强化学习架构。论文内容与这些关键词所代表的多模态 AI 领域无直接关联，涉及的是量子物理与经典混沌理论的交叉，因此所有关键词的相关性评分均为 0。

关键词

Quantum Advantage, Quantum-Informed Machine Learning, Chaotic Dynamical Systems, Quantum Statistical Priors, Invariant Measure, Bell Measurements, Weather Forecasting

227. Positional Encoding in the Context of Memristor-Based Analog Computation for Automatic Speech RecognitionFAIL

Score: 0.0 / 35.2

Authors: Benedikt Hilmes, Nick Rossenbach, Ralf Schlüter

Published: 2026-06-11

TL;DR: This paper reduces positional encoding-induced degradation in memristor-based speech recognition by optimizing ADC precision or removing linear transformations, achieving up to 50% improvement in execution stability without increasing energy consumption.

摘要翻译

忆阻器通过使向量 - 矩阵乘法得以模拟执行，为自然语言处理中神经模型的资源高效计算提供了新机会。然而，目前在这些设备上的计算在权重编程和执行过程中均面临更大的失真。在这项工作中，我们发现变换后的位置编码的大输出值会导致基于忆阻器的计算中模数转换（ADC）内的显著性能退化。通过调整特定忆阻器层中 ADC 的权重和精度位比例，我们在保持估计能耗稳定的情况下，将执行的性能退化相对减少了约 50%。此外，我们还研究了无法修改 ADC 的场景。在这种情况下，去除与编码相关的线性变换后，性能退化可相对减少约 30%。

Abstract

Memristors provide a new chance for resource-efficient computation of neural models for natural language processing by enabling analog execution of vector-matrix-multiplication. Yet, computations on these devices are currently subject to larger distortion, both in weight programming and execution. In this work, we identify large output values of transformed positional encodings to cause major degradation within analog-to-digital conversion (ADC) as part of memristor-based computation. By adjusting the proportion of weight and precision bits of the ADC of specific memristor layers, we reduce the degradation of the execution by ~50% relative, while keeping the estimated energy consumption stable. Additionally, we investigate scenarios where the ADC cannot be modified. In that case the degradation can be reduced by ~30% relative after removing encoding-related linear transformations.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: The paper focuses on memristor-based analog computation for Automatic Speech Recognition, specifically optimizing positional encodings to reduce ADC distortion. The provided keywords pertain to high-level AI architectures such as Multimodal LLMs, World Models, and Reinforcement Learning, which have no thematic overlap with this hardware/speech study. Therefore, all keyword scores are 0. Additionally, none of the specified expert authors (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are present in the author list.

关键词

Memristor, Analog Computation, Positional Encoding, Automatic Speech Recognition, ADC Distortion, Vector-Matrix-Multiplication, Energy Consumption

228. Navigating the Safety-Fidelity Trade-off: Massive-Variate Time Series Forecasting for Power Systems via Probabilistic ScenariosFAIL

Score: 0.0 / 35.2

Authors: Kaijie Xu, Anqi Wang, Xilin Dai

Published: 2026-06-11

TL;DR: This paper introduces PowerPhase and PowerForge for massive-variate probabilistic forecasting in power systems, addressing the safety-fidelity trade-off without utilizing multimodal models or reinforcement learning techniques.

摘要翻译

概率预测模型越来越多地部署在具有不同通道物理特性和运行约束的多变量系统中，但现有的基准测试在大规模下并未评估这两种属性。公共规范多变量基准的上限为 2,000 个通道，而电力系统基准要么缺乏时间结构，要么缺乏概率评估。我们引入了 PowerPhase，这是一个基于六个输电网的概率预测基准，涵盖从 2,000 到 36,964 个联合预测通道，比流行的规范多变量基准高出超过一个数量级。每个目标轨迹均为交流潮流计算的结果，PowerPhase 提供了感知约束的指标，包括 Safety_mBrier、NECV 和 CVaR-alpha，这些指标补充了 CRPS 和 Distortion。在八个基线模型和三个随机种子下，分布准确性和约束满足度对模型的排名不同，这种权衡我们称之为安全 - 保真度（safety-fidelity）。我们进一步提出了 PowerForge，一种基于场景的分位数预测器，具有特定类型的解码头和变量组之间的因果桥，在每个电网上都取得了最佳平均排名。

Abstract

Probabilistic forecasting models are increasingly deployed on multivariate systems with distinct channel physics and operational constraints, but existing benchmarks evaluate neither property at scale. Public canonical multivariate benchmarks cap out at 2,000 channels, while power-system benchmarks either lack temporal structure or probabilistic evaluation. We introduce PowerPhase, a probabilistic forecasting benchmark built on six transmission grids ranging from 2,000 to 36,964 jointly forecasted channels, more than an order of magnitude beyond popular canonical multivariate benchmarks. Each target trajectory is the output of an AC power-flow solve, and PowerPhase ships with constraint-aware metrics, including Safety_mBrier, NECV, and CVaR-alpha, that complement CRPS and Distortion. Across eight baselines and three seeds, distributional accuracy and constraint satisfaction rank models differently, a trade-off we term safety-fidelity. We further propose PowerForge, a scenario-based quantile forecaster with type-specific decoding heads and a causal bridge between variable groups, which achieves the best average rank on every grid.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: The paper focuses on probabilistic forecasting for power systems (multivariate time series data), whereas the provided keywords pertain to Multimodal Large Language Models (MLLM), World Models, and Reinforcement Learning. There is no technical overlap regarding tokenizers, visual encoders, agents, or latent reasoning spaces typical of the specified research background. Terms like 'Multivariate' in the title do not equate to 'MultiModal' in the context of MLLM architectures.

关键词

PowerPhase, Probabilistic Forecasting, Power Systems, Time Series, Safety-Fidelity Trade-off, PowerForge, Multivariate Channels, Constraint-aware Metrics

229. Clipping Makes Distributed and Federated Asynchronous SGD Robust to StragglersFAIL

Score: 0.0 / 35.2

Authors: Samuel Erickson, Mikael Johansson

Published: 2026-06-11

TL;DR: This paper theoretically demonstrates that gradient clipping stabilizes asynchronous stochastic gradient descent by eliminating the dependence on maximum update delay, ensuring convergence under sub-Weibull noise distributions.

摘要翻译

在现代机器学习中，训练并行化是扩大规模的重要策略。异步随机梯度下降（ASGD）通过避免等待慢速工作节点，最大化了可用硬件的利用率。然而，在恒定步长下，ASGD 的收敛性仍因更新中的大延迟而受到慢速工作节点的负面影响。同时，在深度学习模型的异步训练中，经验上已观察到梯度裁剪能够“稳定”训练。在此工作中，我们为这种行为提供了理论依据，因为我们证明了裁剪消除了 Oracle 复杂度对最大延迟的依赖。我们采用了一种梯度噪声的 sub-Weibull 模型，该模型将 sub-Gaussian 和 sub-exponential 分布推广到更重的尾部分布，这一动机源于深度学习中的经验观察。我们证明了期望收敛，并在异步优化领域首次实现了高概率收敛。

Abstract

In modern machine learning, parallelization of training is an important strategy for increasing scale. Asynchronous stochastic gradient descent (ASGD), which maximizes the utilization of available hardware by avoiding waiting for slow workers. However, with constant step sizes, the convergence of ASGD is nonetheless affected negatively by slow workers due to large delays in updates. At the same time, it has been empirically observed in asynchronous training of deep learning models that gradient clipping "stabilizes" training. In this work, we provide a theoretical justification for this behavior, as we show that clipping removes the dependence of the maximum delay in the oracle complexity. We employ a sub-Weibull model of gradient noise which generalizes sub-Gaussian and sub-exponential distributions to more heavy-tailed distributions, motivated by empirical observations in deep learning. We show convergence in expectation, and the first time in asynchronous optimization, convergence with high probability.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: The paper focuses on optimization theory (Asynchronous SGD, gradient clipping, stragglers) for distributed training, while the provided keywords pertain to multimodal large models, world models, and reinforcement learning architectures. There is no conceptual overlap between the paper's content and the specified keywords. Additionally, none of the listed expert authors (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) appear in the author list (Samuel Erickson, Mikael Johansson).

关键词

Asynchronous SGD, Gradient Clipping, Stragglers, Distributed Optimization, Convergence Analysis, Sub-Weibull Noise, Robustness

230. Distributional Loss for Robust ClassificationFAIL

Score: 0.0 / 35.2

Authors: Kathleen Anderson, Thomas Martinetz

Published: 2026-06-11

TL;DR: This paper proposes a bimodal Gaussian distribution loss for supervised classification that captures class ambiguity and improves robustness, especially in low-data regimes, without requiring additional label information.

摘要翻译

本文提出了一种用于监督分类任务的新型损失概念（loss concept）。而非强制将每个输入样本直接映射到单个指定标签，我们将所有分类器输出的优化目标定义为双峰高斯分布（bimodal Gaussian distribution）。这种更软的目标形式隐式地捕捉了类别模糊性，缓解了过拟合，并鼓励学习更鲁棒的决策边界，且无需额外的标签信息。实验结果表明鲁棒性得到了一致的提升，特别是在低数据场景（low-data regimes）下收益尤为显著，同时仅需对标准训练流程（training pipelines）进行极小的修改。

Abstract

This paper proposes a novel loss concept for supervised classification tasks. Rather than enforcing a direct mapping from each input sample to a single assigned label, we define an optimization objective over all classifier outputs as a bimodal Gaussian distribution. This softer target formulation implicitly captures class ambiguity, mitigates overfitting, and encourages the learning of more robust decision boundaries, all without requiring additional label information. Experimental results demonstrate consistent improvements in robustness, with particularly pronounced gains in low-data regimes, while requiring only minimal modifications to standard training pipelines.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: The paper focuses on a distributional loss for supervised classification to improve robustness and handle ambiguity via a bimodal Gaussian distribution. The provided keywords relate to multimodal large models, world models, and reinforcement learning (e.g., Tokenizer, Visual Encoder, MLLM, World Models, model-based RL), which are unrelated to this supervised learning task. Thus, all keyword scores are 0. No expert authors from the specified list were found.

关键词

Distributional Loss, Robust Classification, Supervised Learning, Bimodal Gaussian, Decision Boundaries, Low-data Regimes, Overfitting Mitigation

231. Loss-Shift Transfer via Bayes QuotientsFAIL

Score: 0.0 / 35.2

Authors: Vasileios Sevetlidis

Published: 2026-06-11

TL;DR: This paper investigates transfer learning failures caused by changing loss functions rather than data distribution shifts, formalizing representation requirements using Bayes quotients.

摘要翻译

迁移学习通常被视为分布偏移的后果进行研究。本文识别了一种正交的失效模式，其中数据分布固定而损失函数发生变化。这种设置被称为 loss shift（损失偏移）。损失函数决定了 X 中哪些信息是贝叶斯相关的，因此即使在相同的联合分布 P(X,Y) 下，两个损失函数也可能需要不同的表示。该思想利用 Bayes quotients（贝叶斯商）进行了形式化，这使得损失函数可以根据细化程度进行排序。在 Bayes-quotient 表述中，严格细化会立即产生一个定性障碍。对于更粗糙的损失的源最小表示，不足以处理更精细的目标损失。对于有限输出 log loss（对数损失），这种障碍变成了一个精确的定量恒等式。超额风险是该表示所丢弃的关于 Y 的条件信息。在控制、学习、合成图像和真实图像设置中的实验显示了预测的效果，即分类等价表示在固定数据分布下可以具有不同的最优 log-loss 性能。

Abstract

Transfer learning is usually studied as a consequence of distribution shift. This paper identifies an orthogonal failure mode in which the data distribution is fixed and the loss changes. This setting is called \emph{loss shift}. A loss determines which information in $X$ is Bayes-relevant, and two losses may therefore require different representations even under the same joint law $P(X,Y)$. The idea is formalized using Bayes quotients, which allow losses to be ordered by refinement. In the Bayes-quotient formulation, strict refinement gives an immediate qualitative obstruction. A source-minimal representation for a coarser loss is insufficient for a strictly finer target loss. For finite-output log loss, this obstruction becomes an exact quantitative identity. The excess risk is the conditional information about $Y$ discarded by the representation. Experiments in controlled, learned, synthetic-image, and real-image settings show the predicted effect, i.e., classification-equivalent representations can have different optimal log-loss performance under a fixed data distribution.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: The paper focuses on theoretical transfer learning under loss shift using Bayes quotients, which is unrelated to the provided keywords concerning multimodal architectures (Tokenizer, Visual Encoder, MLLM, MultiModal), world models, reinforcement learning (model-based RL), or agentic systems. The weighted total score is 0.0, well below the dynamic pass score of 35.2. No expert authors from the specified list are present.

关键词

Transfer learning, Loss shift, Bayes quotients, Representation learning, Excess risk, Log loss, Distribution shift, Conditional information

232. Robust State-Conditional Feature-Weighted Jump Models for Temporal ClusteringFAIL

Score: 0.0 / 35.2

Authors: Federico P. Cortese, Alessio Farcomeni

Published: 2026-06-11

TL;DR: 本文提出了一种基于特征加权跳跃模型的鲁棒统计时间聚类方法，与评估关键词中指定的多模态 AI 和强化学习主题无关。

摘要翻译

我们提出了一种用于时间依赖聚类的稳健特征加权跳跃模型。采用惩罚项以鼓励随时间过渡的平滑性，而稳健性则通过引入 Tukey 双权损失函数 (Tukey's biweight loss function) 来实现。此外，一个参数控制不同状态间特征权重的变异性，从而使模型能够为每个特征分配状态特定的重要性。模拟结果表明，该方法能准确恢复真实的聚类序列并可靠地识别相关特征，其表现优于竞争方法，尤其是在存在异常值的情况下。最后，我们通过两个实证应用进行总结：一项是关于 1998-2000 年科索沃冲突相关凶杀案数量的研究，另一项是关于 1949-2024 年十二个欧洲国家宏观经济表现的研究。

Abstract

We propose a robust feature-weighted jump model for time-dependent clustering. A penalty is used to encourage smoothness of transitions over time, while robustness is achieved through the use of a Tukey's biweight loss function. An additional parameter controls the variability of feature weights across states, allowing the model to assign state-specific relevance to each feature. We illustrate in simulation how the method accurately recovers the true cluster sequence and reliably identifies relevant features, outperforming competing approaches, particularly in the presence of outliers. We conclude with two empirical applications, one on the number of conflict-related homicides in Kosovo in the period 1998-2000, and another on macroeconomic performance of twelve European countries in the period 1949-2024.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: 该论文专注于统计时间序列聚类，使用鲁棒损失函数（Tukey's biweight）和特征加权，而提供的关键词涉及多模态大模型（MLLM）、世界模型和强化学习。两者在方法论（如 tokenizer、视觉编码器）和研究领域（经典统计学 vs 深度学习/强化学习）上无重叠，因此所有关键词相关性均为 0。

关键词

Temporal Clustering, Feature-Weighted Jump Model, Tukey's Biweight Loss, State-Conditional, Time-Dependent, Robust Statistics, Conflict-Related Homicides, Macroeconomic Performance

233. Learning-Augmented Approximation for Unrelated-Machines Makespan SchedulingFAIL

Score: 0.0 / 35.2

Authors: Kaito Baba, Evripidis Bampis, Giorgos Mitropoulos

Published: 2026-06-11

TL;DR: This paper develops a learning-augmented approximation algorithm for makespan minimization on unrelated machines, achieving (1+epsilon)-approximation for accurate predictions and degrading gracefully to a 2-approximation under error.

摘要翻译

最近，Antoniadis 等人（ICLR 2025）提出了一种结合预测以近似 NP-hard selection problems 的框架。尽管该方法简单，但它紧密匹配 theoretical lower bounds，使其 generalization 极具吸引力。我们解决了 Antoniadis 等人工作中提出的一个开放性问题，涉及将该方法扩展到 selection problems 类别之外的重要问题，例如 scheduling。我们开发了一种 learning-augmented 算法，用于解决 unrelated machines 上的 makespan minimization 问题（记为 $R\|C_{\max}$）。通过利用对 heavy job assignments 的预测，对于准确的预测，我们实现了 polynomial-time $(1+\varepsilon)$-approximation，随着误差增加，该近似平滑退化至 worst-case 2-approximation。我们通过 empirical analysis 结束我们的工作。

Abstract

Recently, Antoniadis et al. (ICLR 2025) proposed a framework for incorporating predictions to approximate NP-hard selection problems. Despite its simplicity, this approach tightly matches theoretical lower bounds, making its generalization highly compelling. We address an open question raised in the work of Antoniadis et al., concerning the extension of this approach to other important problems outside the class of selection problems, such as scheduling. We develop a learning-augmented algorithm for the makespan minimization problem on unrelated machines, denoted by $R\|C_{\max}$. By using predictions of heavy job assignments, we achieve a polynomial-time $(1+\varepsilon)$-approximation for accurate predictions that smoothly degrades to a worst-case 2-approximation as the error increases. We conclude our work with an empirical analysis of our method.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: The paper focuses on learning-augmented algorithms for scheduling problems within theoretical computer science, whereas the provided keywords specifically target Multimodal Large Language Models, World Models, and Reinforcement Learning architectures (e.g., Tokenizer, Visual Encoder, MLLM). There is no conceptual or methodological overlap between the paper's content (makespan minimization, predictions on job assignments) and the specified keywords. Additionally, none of the listed expert authors are present in the author list.

关键词

Learning-Augmented, Approximation, Unrelated-Machines, Makespan Scheduling, Predictions, NP-hard, Heavy Job Assignments

234. Authority, Truth, and Citation Bias: A Large-Scale Multi-Domain Benchmark for Studying Epistemic Susceptibility in Large Language ModelsFAIL

Score: 0.0 / 35.2

Authors: Aryan Khurana, Aravind Ramana RN, Dhruv Kumar

Published: 2026-06-11

TL;DR: This paper introduces AuthorityBench to demonstrate that citation presence significantly increases hallucination rates in large language models regardless of the citation's factual accuracy.

摘要翻译

大型语言模型（LLMs）日益被部署于引文增强场景中，然而引文的存在对模型行为的影响（独立于事实内容）仍知之甚少。我们引入了 AuthorityBench，这是一个包含 220,564 个提示的多领域基准，旨在探究基于引文的权威信号如何影响 LLMs 的认知行为。该基准采用完全平衡的 2x2 析因设计，交叉检验陈述真实性与引文真实性，这是首次如此操作，涵盖四个领域（通用知识、科学、法律和医学），并针对 40 个提示模板、四个出版声望层级以及国家编码的作者姓名数据集进行了受控变化。在 12 个结构化研究问题上评估七个模型后，我们发现，无论引文是真实的还是虚构的，引文的存在相对于无引文基线一致地增加了幻觉率。当虚构引文伴随真实陈述时，该效应最强，使幻觉率提高了 3 至 22 个百分点，在通用知识领域达到 35% 至 77%，而法律主张相对稳健，出版声望和作者人口统计特征的影响可忽略不计。所有数据集和评估代码均可在以下网址获取：https://github.com/floating-reeds/AuthorityBench

Abstract

Large language models are increasingly deployed in citation-augmented settings, yet the effect of citation presence on model behavior independent of factual content remains poorly understood. We introduce AuthorityBench, a 220,564-prompt multi-domain benchmark that isolates how citation-based authority signals influence epistemic behavior in LLMs. The benchmark uses a fully balanced 2x2 factorial design crossing claim veracity with citation veracity, the first to do so, across four domains (general knowledge, science, law, and medicine), with controlled variation over 40 prompt templates, four venue prestige tiers, and a country-coded author name dataset. Evaluating seven models on 12 structured research questions, we find that citation presence, whether real or fabricated, consistently increases hallucination rates relative to a no-citation baseline. The effect is strongest when fabricated citations accompany true claims, raising hallucination rates by 3 to 22 percentage points and reaching 35 to 77% in the general knowledge domain, while legal claims are comparatively robust and venue prestige and author demographics show negligible impact. All datasets and evaluation code are available at: https://github.com/floating-reeds/AuthorityBench

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: The provided keywords focus on multimodal architectures (Visual Encoder, MultiModal, MLLM), world modeling, and reinforcement learning (model-based RL, Agentic Reasoning). However, the submitted paper exclusively investigates citation bias and epistemic susceptibility in text-based Large Language Models using a benchmarking approach. There is no overlap in methodology, model architecture, or research focus between the keywords and the paper content.

关键词

Large Language Models, Citation Bias, Epistemic Susceptibility, Hallucination, AuthorityBench, Multi-Domain, Benchmark

235. Limits of spectral learning under noiseFAIL

Score: 0.0 / 35.2

Authors: Sabin Roman, Ljupco Todorovski, Saso Dzeroski, Marta Sales-Pardo, Roger Guimera

Published: 2026-06-11

TL;DR: This study derives a universal degradation curve for spectral learning under additive label noise, revealing a fundamental noise threshold that limits the stability of coefficient estimates in recovering functional structures.

摘要翻译

从含噪数据中学习函数关系是科学推断中的核心问题。谱方法通过将未知函数展开为基函数并从数据中估计相应系数来近似未知函数，但这些系数在噪声下的稳定性仍知之甚少。本文研究了使用多基和多维稀疏谱表示的、带有加性标签噪声的监督回归问题。我们发现，噪声会导致学习到的系数向量产生可预测的漂移，其幅度取决于有效活跃谱模的数量。在对经验特征几何进行白化处理后，我们推导出了含噪与无噪系数向量之间重叠度的闭式表达式，揭示了一条由单一内在噪声尺度控制的通用退化曲线。在 Fourier、Legendre、Bessel 和 Haar 基上的数值实验证实了理论预测。结果表明，谱学习表现出一个基本的噪声阈值，超过该阈值后系数估计变得不稳定，从而为从含噪数据中恢复函数结构设置了内在限制。

Abstract

Learning functional relationships from noisy data is a central problem in scientific inference. Spectral methods approximate unknown functions by expanding them in a basis and estimating the corresponding coefficients from data, but the stability of these coefficients under noise remains poorly understood. Here we study supervised regression with additive label noise using sparse spectral representations across multiple bases and dimensions. We show that noise induces a predictable drift in the learned coefficient vector whose magnitude depends on the effective number of active spectral modes. After whitening the empirical feature geometry, we derive a closed-form expression for the overlap between noisy and noiseless coefficient vectors, revealing a universal degradation curve governed by a single intrinsic noise scale. Numerical experiments across Fourier, Legendre, Bessel, and Haar bases confirm the theoretical prediction. The results demonstrate that spectral learning exhibits a fundamental noise threshold beyond which coefficient estimates become unstable, placing intrinsic limits on recovering functional structure from noisy data.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: The paper focuses on theoretical spectral learning and regression stability under noise using mathematical bases (Fourier, Legendre, etc.). It belongs to statistical learning theory and does not involve multimodal architectures, tokenizers, visual encoders, world models, MLLMs, reinforcement learning, or reasoning agents, hence all provided keywords receive a score of 0.

关键词

Spectral learning, Noisy data, Supervised regression, Coefficient drift, Noise threshold, Basis expansion, Functional relationships

236. A solvable model for unsupervised federated learningFAIL

Score: 0.0 / 35.2

Authors: Giovanni Catania, Aurélien Decelle, Gianluca Manzan, Beatriz Seoane, Daniele Tantari

Published: 2026-06-11

TL;DR: 本文提出了一种基于教师 - 学生模型和受限玻尔兹曼机的无监督联邦学习理论框架，证明了学生间的交互能显著提升学习性能并增强对底层模式的恢复能力。

摘要翻译

我们提出了一种理论框架，旨在通过“教师 - 多个交互学生”场景分析生成式设定下的联邦学习（federated learning）。在该场景中，每个学生接收数据的不同实现，这要么源于不同的噪声污染，要么源于访问不同的子集（子集大小可能各异）。利用平衡无序系统（equilibrium disordered system）的理论工具，我们解析地证明学生之间的交互系统性地提升了学习性能：高噪声学生需要更少的样本即可恢复潜在模式，而低噪声学生则能与真实信号（ground-truth signal）实现更大的重叠。我们推导了教师恢复的最优贝叶斯条件，将其表示为样本复杂度（sample complexity）、噪声水平（noise level）和交互强度（interaction strength）的函数，并通过数值模拟验证了这些预测。所得动力学可映射到具有结构化隐藏层的受限玻尔兹曼机（Restricted Boltzmann Machine）的平衡采样上，从而为交互如何改进分布式生成建模（distributed generative modeling）提供了严谨的理论理解。

Abstract

We introduce a theoretical framework for analyzing federated learning in a generative setting through a teacher-multiple interacting students scenario, in which each student receives a distinct realization of the data, either through a different noise corruption or by accessing a different subset, possibly of varying size. Using theoretical tools in equilibrium disordered system, we analytically show that interactions among students systematically enhance learning performance: highly noisy students require fewer samples to recover the underlying pattern, while low-noise students achieve a larger overlap with the ground-truth signal. We derive the optimal Bayesian conditions for teacher recovery as functions of the sample complexity, noise level, and interaction strength, and validate these predictions through numerical simulations. The resulting dynamics can be mapped onto equilibrium sampling in a Restricted Boltzmann Machine with a structured hidden layer, providing a principled theoretical understanding of how interactions improve distributed generative modeling.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: 论文主题为无监督联邦学习与统计物理（RBM），而关键词涉及多模态大模型、世界模型及强化学习，领域完全不匹配，故所有关键词相关度为 0。作者列表中无指定专家，无加分。加权总分 0.0，不及格。

关键词

Unsupervised Federated Learning, Teacher-Student Model, Restricted Boltzmann Machine, Generative Setting, Equilibrium Disordered Systems, Distributed Generative Modeling, Sample Complexity, Interaction Strength

237. Quality-Preserving Imperceptible Adversarial Attack on Skeleton-based Human Action RecognitionFAIL

Score: 0.0 / 35.2

Authors: Ziyi Chang, Kanglei Zhou, Xiaohui Liang, Hubert P. H. Shum

Published: 2026-06-11

TL;DR: This paper proposes a distribution-based adversarial attack method for skeleton-based human action recognition that preserves motion quality without introducing noise-like perturbations, highlighting robustness concerns in action recognizers.

摘要翻译

针对骨骼人类动作识别（S-HAR）的对抗性攻击已受到广泛关注。然而，现有方法通常引入类似噪声的扰动，这会损害攻击后的运动质量，因而随着 S-HAR 系统的最新进展，这些扰动本质上仍具有可感知性。我们发现，这种退化源于先前对抗性攻击优化过程中经验风险与真实风险之间的差距。为了解决这一问题，我们提出一种攻击方法，该方法能够在不损害运动质量的前提下生成对抗性运动。为了最小化风险差距并保持运动质量，我们提出一种基于分布的对抗性攻击方法，该方法不引入类似噪声的扰动。为了忠实评估运动质量，我们提出了一种新的度量标准，该标准与人类对现实世界自然性的感知相一致。我们在两个数据集上的最先进 S-HAR 方法上进行了实验，通过定性和定量分析展示了我们的方法在攻击成功率和攻击后运动质量方面的优越性。我们提出的质量保持攻击及其基于分布方法的成功，引发了对动作识别器鲁棒性的严重担忧，突显了该领域亟需进一步增强。

Abstract

Adversarial attacks on skeletal human action recognition have received significant attention. However, existing methods typically introduce noise-like perturbations that degrade motion quality post-attack, and thereby are inherently perceptible with recent advancements in S-HAR systems. We discover that this degradation stems from the gap between empirical and true risks during the optimization process of previous adversarial attacks. To address this issue, we propose an attack where adversarial motions are obtained without compromising their motion quality. To minimize the risk gap and preserve motion quality, we propose a distribution-based adversarial attack method without introducing noise-like perturbations. To faithfully evaluate the motion quality, we propose a new metric that aligns with human perception on real-world naturalness. Experiments have been conducted on the state-of-the-art S-HAR methods across two datasets, demonstrating the superiority of our method in both the attack success rate and the post-attack motion quality through qualitative and quantitative analyses. The success of our quality-preserving attack application and distribution-based method raises serious concerns about the robustness of action recognizers, highlighting the need for further enhancements in this domain.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: 该论文专注于骨架动作识别中的对抗攻击及运动质量保持，属于计算机视觉安全领域。提供的关键词主要涉及多模态大模型、世界模型及强化学习等方向，两者在核心概念（如 Tokenizer、视觉编码器、世界模型、代理推理等）上无重叠，因此相关性均为 0。

关键词

Adversarial Attack, Skeleton-based Human Action Recognition, Motion Quality Preservation, Distribution-based Attack, Robustness Evaluation, Imperceptible Perturbation, Risk Gap Minimization

238. Deep Sleep Classification via EEG Signal Criticality: A Passive BCI Approach for Sleep-Improvement NeurofeedbackFAIL

Score: 0.0 / 35.2

Authors: Stanisław Narębski, Tomasz Komendziński, Tomasz M. Rutkowski

Published: 2026-06-11

TL;DR: This study proposes a passive BCI approach using EEG criticality features and Naive Bayes classification to accurately identify deep sleep stages for sleep-improvement neurofeedback.

摘要翻译

自动睡眠分期是被动脑机接口（pBCI）的一项基本应用，通过解码自发神经状态来实现独立于用户意图的闭环干预。本研究评估了源自去趋势波动分析（DFA）的临界性特征，用于特异性识别深睡期（N3）。我们利用 UMAP 流形学习分析了来自 290 名老年女性的 347,232 个脑电（EEG）时段，以可视化状态转换。随后，通过 10 折交叉验证对六种分类器进行了基准测试，使用平衡准确率来确定神经反馈的最佳“状态感知”引擎。朴素贝叶斯（Naive Bayes）取得了最高的平均平衡准确率（87.17% ± 0.24%），显著优于全连接深度神经网络（FNN: 81.58%）和随机森林（Random Forest: 80.97%）。线性模型（LDA: 57.21%; SVM: 51.01%）表现不佳，表明源自 DFA 的临界性特征位于一个独特的非线性流形上。脑电临界性的概率解码为被动脑机接口（pBCIs）提供了一种高准确率的感知机制。这种稳健的分类流程支持开发状态依赖型神经反馈，例如靶向听觉刺激，以促进认知恢复。

Abstract

Automated sleep staging is a fundamental application of passive Brain-Computer Interfaces (pBCI), decoding spontaneous neural states to enable closed-loop interventions independent of user intent. This study evaluates criticality features derived from Detrended Fluctuation Analysis (DFA) for the specific identification of deep sleep (N3). We analyzed $347,232$ EEG epochs from $290$ older women using UMAP manifold learning to visualize state transitions. Subsequently, six classifiers were benchmarked via 10-fold cross-validation, using balanced accuracy to determine the optimal "state-sensing" engine for neurofeedback.Naive Bayes achieved the highest mean balanced accuracy ($87.17\% \pm 0.24\%$), significantly outperforming a fully connected deep neural network (FNN: $81.58\%$) and Random Forest ($80.97\%$). Linear models (LDA: $57.21\%$; SVM: $51.01\%$) performed poorly, indicating that DFA-derived criticality features reside on a distinct, non-linear manifold. Probabilistic decoding of EEG criticality provides a high-accuracy sensing mechanism for pBCIs. This robust classification pipeline supports the development of state-dependent neurofeedback, such as targeted auditory stimulation, to enhance cognitive recovery.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: The paper focuses on EEG signal processing and sleep stage classification using classical machine learning (Naive Bayes, FNN) and signal analysis (DFA), belonging to neuroscience and BCI. The provided keywords relate to Large Language Models, Multimodal Architectures, and Reinforcement Learning (e.g., Tokenizer, Visual Encoder, World Models). There is no technical or thematic overlap. Additionally, none of the listed expert authors are present in the author list.

关键词

EEG Signal Classification, Deep Sleep Staging, Passive Brain-Computer Interface, Detrended Fluctuation Analysis, Neurofeedback, UMAP Manifold Learning, Criticality Features

239. Predicting Cognitive Load from Speech and Interaction Dynamics in Dyadic ConversationsFAIL

Score: 0.0 / 35.2

Authors: Tahiya Chowdhury

Published: 2026-06-11

TL;DR: The study investigates whether speech and interaction dynamics can predict perceived cognitive load during dyadic conversations, finding that conversational interaction provides useful signals related to time pressure and mental demand.

摘要翻译

从语音中估计认知负荷（cognitive load）的研究主要局限于受控实验室环境，对其在自然协作对话中的可靠性尚缺乏深入了解。我们探究语音和交互动态（interaction dynamics）是否能预测双人对话（dyadic conversations）期间的感知认知负荷。我们分析了 53 个二人组（dyads）执行九项协作任务时的音频，提取静态声学、动态及交互特征，以训练一个双头门控循环单元（Gated Recurrent Unit, GRU）编码器，用于预测认知负荷评分。结果表明，会话交互提供了有用的信号，可用于预测与时间压力、脑力工作、努力程度及任务表现相关的认知负荷。时间需求（temporal demand）与轮流说话动态（turn-taking dynamics，如重叠和说话者切换）相关，而心智需求（mental demand）则与说话者之间的参与不平衡有关。这些发现突显了任务结构和会话交互在自然协作环境中建模认知负荷的重要性。

Abstract

Estimating cognitive load from speech has largely been studied in controlled laboratory settings, with limited understanding of its reliability in natural collaborative conversations. We investigate whether speech and interaction dynamics predict perceived cognitive load during dyadic conversations. We analyze audio from 53 dyads performing nine collaborative tasks and extract static acoustic, dynamic, and interaction features to train a two-head Gated Recurrent Unit encoder to predict cognitive load scores. Results show conversational interaction provides useful signals for predicting cognitive load related to time pressure, mental work, effort, and task performance. Temporal demand is associated with turn-taking dynamics such as overlap and speaker switch, while mental demand is linked to imbalanced participation between speakers. These findings highlight the importance of task structure and conversational interaction for modeling cognitive load in natural collaborative settings.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: The paper focuses on predicting cognitive load using audio and interaction dynamics in dyadic conversations via GRU encoders. It does not involve Unify Models, Tokenizers, Visual Encoders, World Models, MLLM, Model-Based RL, or specific reasoning frameworks (Latent/Agentic) associated with large-scale AI architectures. The domain is psychology/signal processing rather than generative AI or reinforcement learning, resulting in zero relevance to all provided technical keywords.

关键词

Cognitive Load, Speech Analysis, Interaction Dynamics, Dyadic Conversations, GRU Encoder, Acoustic Features, Collaborative Tasks, Turn-taking

240. Influcoder: Distilling Decoders' Gradient Influence Rankings into an Encoder for Data AttributionFAIL

Score: 0.0 / 35.2

Authors: Dimitri Kachler, Damien Sileo, Pascal Denis

Published: 2026-06-11

TL;DR: This paper proposes Influcoder to accelerate influence-based data attribution for LLMs via encoder distillation, but it is unrelated to multimodal, world model, or reinforcement learning topics.

摘要翻译

随着大语言模型（LLM）能力的提升，通过过滤训练数据中的样本以构建高质量数据集的趋势日益增强。一般来说，数据归因（DA）方法旨在估计训练数据集中的单个样本如何影响模型生成特定输出。例如，人们可能关心的是，在训练大语言模型后，数据中的哪些样本可能是有害行为的来源。许多方法通过影响函数（influence functions）范式来量化这种影响。尽管这类方法在功能上有效，但它们缺乏必要的处理速度和存储紧凑性，难以在实际中应用于大规模数据集。我们提出了一种名为 Influcoder 的方法，作为一种快速且成本效益高的方法，用于基于影响的大规模数据归因。

Abstract

With the growth of LLMs' (Large Language Models) capabilities, there has been an increasing push to curate high quality datasets by filtering samples in the training data. In general, Data Attribution (DA) methods aim to estimate how individual samples in a training dataset can precondition a model to generate certain outputs. As an example, one might be interested in which samples in the data could be the source of toxic behavior after training the LLM. Many methods quantify this conditioning through the paradigm of influence functions. While methods of this family are effective in its function, they lack the necessary processing speed and storage compactness to be practically implemented on large datasets. We propose a method, Influcoder, as a quick and cost-effective approach to influence-based Data Attribution at scale.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: The paper focuses on Data Attribution for Large Language Models (LLMs) using influence functions and encoder distillation to improve speed and storage efficiency. It does not involve multimodal integration, world models, reinforcement learning, visual encoders, tokenizers, or agentic/latent reasoning as specified in the keyword set. Therefore, all provided keywords are irrelevant (0.0). The total weighted score is 0.0, which is below the dynamic pass score of 35.2. None of the listed expert authors (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are present in the author list.

关键词

Data Attribution, Influence Functions, Large Language Models, Encoder Distillation, Gradient Influence Rankings, Training Data Curation

241. The Tone of Awareness: Topic, Sentiment, and Toxicity Maps During Mental Health Month on TikTokFAIL

Score: 0.0 / 35.2

Authors: Henrique Ferraz de Arruda, Andreia Sofia Teixeira, Pranay Gundala Reddy, Anindya Mondal, Kleber Andrade Oliveira, Filipi Nascimento Silva

Published: 2026-06-11

TL;DR: This study analyzes sentiment and toxicity in TikTok mental health content during awareness months, revealing that videos tend to be negative while comments are more mixed, with toxicity concentrated in specific topics.

摘要翻译

尽管人们担忧使用 TikTok 可能对心理健康产生影响，但创作者如何构建相关内容以及受众如何接收这些信息尚知之甚少。我们通过 TikTok Research API 收集了 2023 年和 2024 年心理健康意识月（五月）期间的 28,341 个 TikTok 视频及 80,130 条评论，并研究意识语调在不同主题和年份间的差异。我们将“语调”定义为心理健康话语的情感和人际框架，并通过情感（sentiment）和毒性（toxicity）指标进行操作化。我们使用 BERTopic 和对数几率关键词从视频文本中提取主题，然后分别针对视频转录文本和评论量化主题条件情感（XLM-T）和毒性（Detoxify）。情感（Sentiment）捕捉内容的情感效价，而毒性（Toxicity）反映有害或虐待性语言的存在。我们发现跨年份存在一组稳定的反复出现的主题，涵盖临床状况、情感披露、自我护理以及以活动为导向的内容，且参与度高度偏向于少数特定主题。所有情感与毒性分析均分别针对视频内容和评论进行计算，使我们能够区分内容生产与受众接收。视频中的情感在情绪化话题上往往呈现负面，而评论则倾向于转向更混合或积极的极性，尤其是在自杀预防（Suicide Prevention）方面。毒性中位数整体较低，但在评论中表现出比视频更长的尾部异常值，这种异常值在评论中更为明显，且集中在特定主题上（例如"Duet"、"Suicide Prevention"和"Psychisch"）。总体而言，我们的结果提供了意识月运动期间 TikTok 上心理健康话语的主题级分解。

Abstract

Despite raising concerns about the mental health effects associated with the usage of TikTok, little is known about how related content is framed by creators and received by audiences. We collect the content of 28,341 TikTok videos and 80,130 comments from Mental Health Awareness Month (May) in 2023 and 2024 via the TikTok Research API, and study how the tone of awareness varies across topics and years. We characterize "tone" as the emotional and interpersonal framing of mental health discourse, operationalized through sentiment and toxicity measures. We extract topics from video text using BERTopic and log-odds keywords, then quantify topic-conditioned sentiment (XLM-T) and toxicity (Detoxify) separately for video transcriptions and comments. Sentiment captures the affective valence of content, while toxicity reflects the presence of harmful or abusive language. We find a stable set of recurring themes across years, spanning clinical conditions, emotional disclosure, self-care, and campaign-oriented content, with engagement highly skewed toward a small subset of topics. All sentiment and toxicity analyses are computed separately for video content and comments, allowing us to distinguish between content production and audience reception. Sentiment in videos is often negative for emotionally charged topics, while comments tend to shift toward more mixed or positive polarity, especially for suicide prevention. Toxicity is low in median overall, but exhibits longer-tailed outliers in comments than in videos that are more pronounced in comments and concentrated in specific topics (e.g., "Duet", "Suicide Prevention", and "Psychisch"). Overall, our results provide a topic-level decomposition of mental health discourse on TikTok during awareness-month campaigns.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: The paper focuses on social media analysis of mental health discourse on TikTok using NLP tools (BERTopic, XLM-T, Detoxify). It does not propose or utilize Unify Models, Tokenizers, Visual Encoders, World Models, MLLMs, Model-Based RL, Latent Reasoning, or Agentic Reasoning frameworks. The study is text-based analysis of transcriptions and comments, lacking any connection to the AI model architecture or reinforcement learning themes defined by the keywords.

关键词

Mental Health Awareness, TikTok, Sentiment Analysis, Toxicity Detection, Topic Modeling, Audience Reception

242. Edit the Bits, Diff the Codes: Bitwise Residual Editing for Visual Autoregressive ModelsFAIL

Score: 0.0 / 35.2

Authors: Shengqiang Zhang, Ruotong Liao, Volker Tresp, Barbara Plank, Hinrich Schütze

Published: 2026-06-11

摘要翻译

基于文本引导的视觉自回归（VAR）生成器图像编辑需要同时控制模型采样的内容以及将采样变化写回图像代码的位置。现有的 VAR 编辑器主要基于 token 流、特征或平坦的下一 token logits 进行操作，导致位残差 VAR 模型的两个本征结构未被充分利用：即每比特伯努利预测头和从中组装图像的加性多尺度残差代码场。本文提出 BitResEdit，这是一种针对 Infinity 等位残差 VAR 生成器的无训练编辑器。BitEdit 通过在共享编辑前缀上计算源 - 目标对比度，沿该对比度倾斜后 CFG 的每比特 log-odds，从而实现源负引导，随后将每个更新投影到围绕干净 CFG 采样器的闭式伯努利 -KL 信任区域。ResEdit 将采样的比特转换为每尺度的连续代码残差，利用定位掩码对其进行门控，并通过生成器的原生多尺度之和重新注入。二者共同将决策时的比特引导与组合时的代码组合相耦合，使得被掩码的潜在特征可通过代码算术被精确保留，同时局部化、尺度感知的编辑被应用于目标区域内部。在 PIE-Bench 基准上配合 Infinity-2B 模型，BitResEdit 在同骨干 VAR 编辑器中实现了最强的文本对齐，其在编辑区域上的 CLIP 得分比最强先前编辑器高出 +1.07，同时保持背景保存能力与之相当。消融实验表明，BitEdit 和 ResEdit 在目标对齐与背景保存方面发挥着互补作用。

Abstract

Text-guided image editing with visual autoregressive (VAR) generators requires controlling both what the model samples and where the sampled change is written back into the image code. Existing VAR editors mainly operate on token streams, features, or flat next-token logits, leaving two native structures of bitwise-residual VAR models underused: the per-bit Bernoulli prediction head and the additive multi-scale residual code field from which the image is assembled. We propose BitResEdit, a training-free editor for bitwise-residual VAR generators such as Infinity. BitEdit performs source-negative guidance by tilting the post-CFG per-bit log-odds along a source--target contrast computed on a shared edited prefix, then projects each update into a closed-form Bernoulli-KL trust region around the clean CFG sampler. ResEdit converts the sampled bits into per-scale continuous-code residuals, gates them with a localization mask, and re-injects them through the generator's native sum-of-scales. Together they couple decision-time bit guidance with combination-time code composition, so masked-out latent features are preserved exactly by code arithmetic while localized, scale-aware edits are applied inside the target region. On PIE-Bench with Infinity-2B, BitResEdit attains the strongest text alignment among same-backbone VAR editors, improving CLIP on the edited region by +1.07 over the strongest prior editor while keeping background preservation competitive with it. Ablations show BitEdit and ResEdit play complementary roles in target alignment and background preservation.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: 评分失败: Expecting ',' delimiter: line 14 column 39 (char 321)

243. Examining the Cognitive Gap Between Authors and Peer Reviewers on Academic Paper NoveltyFAIL

Score: 0.0 / 35.2

Authors: Chenggang Yang, Chengzhi Zhang

Published: 2026-06-11

TL;DR: 本文研究了作者与同行评审者之间关于论文新颖性的认知差距，发现宣传语言对中等创新水平论文的评审分歧影响最为显著。

摘要翻译

新颖性是评估学术论文质量的关键指标。学者们努力突出其工作的新颖之处，尤其是在标题、摘要和引言中。同行评审 (Peer review) 作为科学严谨性的守门人，严格评估论文的新颖性，然而作者自我推广与审稿人评估之间可能存在认知差距。为此，我们分析了 2016 年至 2021 年发表在 Nature Communications (《自然·通讯》) 上的 15,328 篇学术论文及其审稿意见。我们发现，审稿人和作者均强调结果导向型创新，但审稿人采用了更为全面的评估视角。此外，通过考察宣传强度与论文内在新颖性的关系，我们发现其效应取决于论文的实际创新水平。高创新性论文得益于更强的宣传用语，获得了更多正面评价。我们还发现，宣传用语与审稿人关于新颖性的分歧显著相关，这种关联在中等创新性的论文中尤为显著，而对于新颖性极高或极低的论文，其影响则微乎其微。这表明宣传用语在学术评价的灰色地带中作用最为显著。

Abstract

Novelty is a crucial metric for assessing the quality of academic papers. Scholars strive to highlight the novel aspects of their work, particularly in the title, abstract, and introduction. Peer review, serving as the gatekeeper of scientific rigor, rigorously evaluates the novelty of papers, yet a cognitive gap may exist between author self-promotion and reviewer evaluation. To investigate this, we analyzed 15,328 academic papers published in Nature Communications from 2016 to 2021, along with their peer-review comments. We found that both reviewers and authors emphasize result-oriented innovation, with reviewers adopting a more comprehensive evaluation perspective. Furthermore, by examining promotional intensity against inherent paper novelty, we found that its effect depends on the paper's actual innovation level. Highly innovative papers benefit from stronger promotional language, receiving more positive evaluations. We also found that promotional language significantly correlates with reviewer disagreement on novelty specifically for papers of moderate innovativeness, whereas it has negligible impact for papers with either very high or very low novelty. This reveals how promotional language operates most prominently in the gray area of academic evaluation.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: 论文主题聚焦于学术出版中的新颖性评估、同行评审认知差距及宣传语言影响，属于学术计量学或社会科学范畴；而提供的关键词均涉及机器学习模型架构（如 MLLM、World Models、RL、Tokenizer 等），两者无技术关联。作者列表中未包含指定的专家名单，因此所有关键词相关度均为 0。

关键词

Novelty Assessment, Peer Review Process, Cognitive Gap, Promotional Language, Academic Evaluation, Nature Communications, Author-Reviewer Disagreement

244. Evaluating Pluralism in LLMs through Latent PerspectivesFAIL

Score: 0.0 / 35.2

Authors: Laura Majer, Jan Šnajder, Martin Tutek

Published: 2026-06-11

摘要翻译

对表达多元视角的需求日益增长，提高了人们对多元主义 LLM（大型语言模型）生成的兴趣。尽管难以操作化，但识别文本中表达的视角将为多元主义对齐提供明确指导，并更清晰地阐明 LLM 生成中的多元主义差距。虽然模型已被证明会减少训练数据的多样性并同质化生成，但这主要在多项选择题问卷或使用自由格式文本的高层特征上得到验证。本文介绍并实现了一个领域无关的多层框架，用于无监督提取视角，适用于识别 LLM 生成文本中的多元主义差距。我们在书评这一观点鲜明、代表多元视角的数据集上评估了该框架，并比较了不同的提示词和模型。结果表明，尽管某些模型和提示技术接近覆盖广泛的视角谱系，但稀有视角仍不成比例地代表性不足，导致分布偏离人类文本。

Abstract

The growing need to represent diverse perspectives has increased interest in pluralistic LLM generation. Although difficult to operationalize, identifying perspectives expressed in text would provide clear guidance on pluralistic alignment and more clearly articulate the pluralistic gap in LLM generation. While models have been shown to reduce the diversity of training data and generate homogeneously, this has been demonstrated primarily on multiple-choice questionnaires or using high-level characteristics of free-form text. In this paper, we introduce and implement a domain-agnostic multi-layered framework for unsupervised extraction of perspectives suitable for identifying the pluralistic gap in LLM-generated text. We evaluate our framework on book reviews, a highly opinionated dataset representing diverse perspectives, and compare various prompts and models. Our results show that while some models and prompting techniques come close to covering a broad spectrum of perspectives, rarer perspectives remain disproportionately underrepresented, resulting in distributions that diverge from human text.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: 评分失败: Expecting ',' delimiter: line 14 column 103 (char 385)

245. SICI: A Semantic-Pragmatic Complexity Index Reveals Regime Shifts in LLM Stance DetectionFAIL

Score: 0.0 / 35.2

Authors: Fuqiang Niu, Bowen Zhang

Published: 2026-06-11

TL;DR: This paper introduces a semantic-pragmatic complexity index (SICI) to diagnose LLM performance in stance detection, revealing regime shifts in accuracy that standard interventions like prompting or debate fail to resolve.

摘要翻译

基于提示的大语言模型（LLM）在立场检测中的应用日益广泛，但更具挑战性的例子并不总能通过更清晰的指令、推理提示、检索或辩论得到修正。我们引入了 SICI（立场推断复杂度指数），这是一种衡量目标 - 文本对施加的语义 - 语用负担的七维诊断性度量。在 SemEval-2016 和 VAST 数据集上，SICI 预测 LLM 准确度的效果优于表面代理指标，且显示出较高的跨评分者可靠性（α=0.771）。更重要的是，随着 SICI 的增加，LLM 错误模式发生改变：低复杂度例子倾向于导致过度归因，尤其是反对预测；中等复杂度例子形成不稳定的边界；而高复杂度例子则迅速集中于无立场。这种类似相变的结构在 GPT-3.5、GPT-4o-mini、DeepSeek-V3 和 GPT-4o 模型中均保持一致，尽管更强的模型会移动这些边界。一项包含 15 种方法的干预研究进一步表明，提示、检索和辩论通常会使模型沿归因 - 弃权轴移动，而非消除高复杂度瓶颈。

Abstract

Prompt-based LLMs are increasingly used for stance detection, but harder examples are not always repaired by clearer instructions, reasoning prompts, retrieval, or debate. We introduce SICI (Stance Inference Complexity Index), a seven-dimensional diagnostic measure of the semantic-pragmatic burden imposed by a target--text pair. Across SemEval-2016 and VAST, SICI predicts LLM accuracy better than surface proxies and shows substantial cross-scorer reliability ($α=0.771$). More importantly, LLM errors change regime as SICI increases: low-complexity examples invite over-attribution, especially Against predictions; intermediate examples form an unstable boundary; and high-complexity examples rapidly concentrate on None. This phase-transition-like structure persists across GPT-3.5, GPT-4o-mini, DeepSeek-V3, and GPT-4o, although stronger models move the boundaries. A 15-method intervention study further shows that prompting, retrieval, and debate often shift models along the attribution--abstention axis rather than removing the high-complexity bottleneck.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: The paper focuses on text-based stance detection evaluation using a semantic-pragmatic complexity index (SICI), analyzing regime shifts in LLM accuracy. The provided keywords primarily pertain to multimodal architectures (MLLM, MultiModal, Visual Encoder), world modeling, and reinforcement learning (model-based RL, Agentic Reasoning). Since the paper is text-only, does not involve world models or RL, and does not discuss tokenizers or visual encoders, there is negligible relevance to the specific keyword set. The calculated weighted total score is 0.0, which is well below the dynamic passing score of 35.2. None of the specified expert authors (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are listed as authors.

关键词

Stance Detection, LLM Evaluation, Semantic-Pragmatic Complexity, Regime Shifts, Prompting Interventions, Diagnostic Measure, Text-based Reasoning

246. A Context-Aware Dataset for Stance Detection in Bioethical Controversies on RedditFAIL

Score: 0.0 / 35.2

Authors: Hu Huang, Genan Dai, Fuqiang Niu, Yi Yang, Zhaoya Gong, Bowen Zhang

Published: 2026-06-11

TL;DR: 本文提出了 BioStance 数据集用于生物伦理争议的立场检测，但未涉及多模态建模或强化学习相关内容。

摘要翻译

生物伦理学辩论日益在社交媒体上展开，然而立场检测研究缺乏用于建模此类语境依赖论述的大规模、领域特定资源。我们提出了 BioStance，这是一个源自 Reddit 生物伦理讨论的、包含 39,600 个标注帖子 - 评论对的语境感知数据集。BioStance 涵盖了六个争议性议题，涉及生物伦理争议的三个维度：根本价值冲突、个人自由与集体责任的对立，以及技术不确定性。每个实例均保留了层级对话语境，并由三位独立标注者采用三类立场方案进行标注：赞成（Favor）、反对（Against）和无立场（None）。这些标注达到了平均 Krippendorff's α 为 0.82，表明具有高度可靠性。通过结合主题多样性、对话结构及高质量人工标注，BioStance 支持语境感知立场检测、论点挖掘以及生物伦理论述计算分析方面的研究。

Abstract

Bioethical debates increasingly unfold on social media, yet stance detection research lacks large-scale, domain-specific resources for modeling such context-dependent discourse. We present BioStance, a context-aware dataset of 39,600 annotated Post-Comment pairs from Reddit bioethical discussions. BioStance covers six controversial targets across three dimensions of bioethical controversy: fundamental value conflicts, individual liberty versus collective responsibility, and technological uncertainty. Each instance preserves hierarchical conversational context and is labeled by three independent annotators using a three-class stance scheme: Favor, Against, and None. The annotations achieve a mean Krippendorff's $α$ of 0.82, indicating substantial reliability. By combining thematic diversity, conversational structure, and high-quality human annotation, BioStance supports research on context-aware stance detection, argument mining, and computational analysis of bioethical discourse.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: 论文主题聚焦于生物伦理争议的立场检测数据集构建（BioStance），属于自然语言处理（NLP）中的文本分析任务。提供的关键词均涉及多模态大模型、世界模型、强化学习及统一架构等前沿 AI 领域，与本文的纯文本数据集研究无直接方法论或内容关联。因此所有关键词相关度均为 0 分。作者列表中未包含指定的 Yang Shi 等专家，故无额外加分。加权总分为 0，远低于动态及格分 35.2。

关键词

Stance Detection, Bioethical Controversies, Context-Aware Dataset, Reddit Social Media, Conversational Structure, Argument Mining, Computational Analysis

247. LAUKIN: A Multi-jurisdictional Common Law Contract DatasetFAIL

Score: 0.0 / 35.2

Authors: Amrita Singh, Aditya Joshi, Jiaojiao Jiang, Hye-young Paik, May Fong Cheong

Published: 2026-06-11

TL;DR: This paper introduces LAUKIN, a multi-jurisdictional contract equivalence dataset for legal NLP, evaluating text models on cross-jurisdictional classification tasks.

摘要翻译

跨国公司日益需要跨司法管辖区的合同审查，然而现有的法律自然语言处理（NLP）数据集大多局限于单一司法管辖区。我们引入了 LAUKIN（Legal equivalence dataset of Australia, UK, and India），这是一个包含条款对（clause pairs，AU-UK、UK-IN、IN-AU）的数据集，标注了布尔法律等价性（boolean legal equivalence）。我们开发了一种新颖的多阶段检索与重排序（retrieval and reranking）流程（pipeline），用于构建初始条款对映射，随后由法律专家对部分条款对进行标注，判定为“等价（Equivalent）”或“不等价（Not Equivalent）”。该数据集包含来自 204 份合同、涵盖 8 种协议类型的 14,727 个条款对，其中 3,000 个经过人工标注：900 个用于训练（train），600 个用于开发（dev），1,500 个用于测试（test）。我们在 4 种方法上评估了 12 个模型，取得了最佳宏观 F1 值（macro-F1）为 65.11%，确立了 LAUKIN 作为一个具有挑战性的基准（benchmark）。结果表明，尽管共享法律传统，但起草惯例在不同司法管辖区之间存在显著差异，使得跨司法管辖区的等价性分类并非易事。LAUKIN 还包含 11,727 个未标注的训练对，以支持未来法律自然语言处理（NLP）领域的半监督学习（semi-supervised learning）研究。

Abstract

Multinational companies increasingly require cross-jurisdictional contract review, yet existing legal NLP datasets are largely restricted to a single jurisdiction. We introduce LAUKIN (Legal equivalence dataset of Australia, UK, and INdia), a dataset of clause pairs (AU-UK, UK-IN, IN-AU) labelled for boolean legal equivalence. We develop a novel multi-stage retrieval and reranking pipeline to construct the initial clause pair mapping, with a subset of clause pairs subsequently annotated by legal experts as Equivalent or Not Equivalent. The dataset comprises 14,727 clause pairs from 204 contracts across 8 agreement types, of which 3,000 are manually labelled: 900 train, 600 dev, and 1,500 test. We evaluate 12 models across 4 techniques, achieving a best macro-F1 of 65.11%, establishing LAUKIN as a challenging benchmark. Results reveal that, despite shared legal heritage, drafting conventions diverge significantly across jurisdictions, making cross-jurisdictional equivalence classification non-trivial. LAUKIN also includes 11,727 unlabelled training pairs to support future semi-supervised learning research in legal NLP.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: The paper focuses on constructing a legal text dataset (LAUKIN) for cross-jurisdictional contract equivalence classification. The provided keywords relate to multimodal large models, world models, and reinforcement learning paradigms, which are not addressed in this text-only legal NLP study. The weighted total score is 0.0 (below the dynamic passing score of 35.2). No specified expert authors (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are found in the author list.

关键词

Legal NLP, Contract Equivalence, Cross-jurisdictional, Dataset Construction, Clause Pair, Text Classification, Legal Benchmark

248. World Tracing: Generative Pixel-Aligned Geometry Beyond the VisibleFAIL

Score: 0.0 / 35.2

Authors: Hao Zhang, Mohamed El Banani, Jen-Hao Cheng, Paul Zhang, Yi Hua, Ben Mildenhall, Christoph Lassner, Narendra Ahuja, Gengshan Yang

Published: 2026-06-11

摘要翻译

图像到 3D 的方法通常在保真度与完整性之间进行权衡：深度估计器虽锚定在输入像素上，但仅止步于可见表面；而图像到 3D 模型生成的完整形状往往与输入未对齐。我们引入 World Tracing（世界追踪），这是一种生成式像素对齐几何表示，它能够预测与观测像素对齐的 3D 点，同时完成可见表面之外的几何结构。对于每个输入像素，World Tracing 预测一个有序堆叠的相机空间 3D 点，其中第一层代表可见表面，后续层则代表与遮挡表面从前到后的交点。我们通过世界追踪扩散变换器（WT-DiT）实例化该表示，WT-DiT 将多个几何层视为独立的去噪标记，并通过分解式和全局注意力进行耦合。WT-DiT 采用像素空间流匹配和混合噪声调度进行训练，该调度平衡了可见表面重建与遮挡几何生成。World Tracing 在物体、场景及动态基准上的可见表面重建和完整几何生成方面表现优异，优于深度预测器和图像到 3D 生成器。它还保留了 2D 到 3D 的对应关系，从而支持文本驱动的 3D 场景编辑、基于几何条件的新视角视频合成，以及与纹理网格生成器的免训练集成。

Abstract

Image-to-3D methods often trade off faithfulness and completeness: depth estimators are anchored to input pixels but stop at the visible surface, while image-to-3D models generate complete shapes that are often misaligned with the input. We introduce World Tracing, a generative pixel-aligned geometry representation that predicts 3D points aligned with observed pixels while completing geometry beyond the visible surface. For each input pixel, World Tracing predicts an ordered stack of camera-space 3D points, where the first layer represents the visible surface and subsequent layers represent front-to-back intersections with occluded surfaces. We instantiate this representation with a world-tracing diffusion transformer, WT-DiT, which treats multiple geometry layers as separate denoising tokens coupled through factorized and global attention. WT-DiT is trained with pixel-space flow matching and a mixed noise schedule that balances visible-surface reconstruction with occluded-geometry generation. World Tracing achieves strong performance on visible-surface reconstruction and complete geometry generation across object, scene, and dynamic benchmarks, outperforming both depth predictors and image-to-3D generators. It also preserves 2D-to-3D correspondence, enabling text-driven 3D scene editing, geometry-conditioned novel-view video synthesis, and training-free integration with textured-mesh generators.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: 评分失败: Expecting ',' delimiter: line 14 column 54 (char 336)

249. Towards Effective Waste Segmentation for Automated Waste Recycling in Cluttered BackgroundFAIL

Score: 0.0 / 35.2

Authors: Mamoona Javaid, Mubashir Noman, Abdul Hannan, Shah Nawaz, Mustansar Fiaz, Sajid Ghuffar

Published: 2026-06-11

TL;DR: This paper proposes a cascaded waste segmentation network leveraging spatial and spectral domains with auxiliary feature enhancement to improve automated waste recycling performance in cluttered backgrounds.

摘要翻译

城市区域的快速扩张和人口增长导致废物产量大幅增加，这迫切需要高效且自动化的废物管理。在此背景下，利用深度学习方法实现的自动化废物回收（AWR）可辅助人类实现最优废物管理。近期针对 AWR 的深度学习算法提供了令人鼓舞的废物分割性能，然而，这些方法依赖于大型骨干网络，对于 AWR 系统而言效率低下，且在杂乱场景中性能会出现退化。为此，本文引入了一种最优废物分割网络，该网络有效利用空间域捕获局部结构依赖，并利用谱域高效提取全局上下文关系。这种级联设计使网络能够逐步利用互补域中的局部和全局表示，突出显示对各种废物对象进行有效分割所需的语义信息。此外，还引入了辅助特征增强模块（AFEM），以增强目标物体的边界并进行 blob 放大，从而在杂乱场景中实现更好的分割效果。在 ZeroWaste-aug、ZeroWaste-f 和 SpectralWaste 数据集上的广泛实验证明了所提出方法的优势。

Abstract

Rapid expansion of urban areas and population growth is causing an immense increase in waste production, which demands the need for efficient and automated waste management. In this scenario, automated waste recycling (AWR) using deep learning methods can assist humans in optimal waste management. Recent deep learning approaches for AWR provide promising waste segmentation performance, however, these methods rely on large backbone networks that are inefficient for AWR systems and suffer from performance deterioration in cluttered scenes. To this end, an optimal waste segmentation network is introduced which effectively utilizes the spatial domain to capture localized structural dependencies and the spectral domain to efficiently extract global contextual relationships. This cascaded design allows the network to progressively leverage both local and global representations across complementary domains to highlight the semantic information necessary for effective segmentation of various waste objects. Furthermore, auxiliary feature enhancement module (AFEM) is introduced to enhance the target objects' boundaries and blob amplification for better segmentation in cluttered scenarios. Extensive experimentation on ZeroWaste-aug, ZeroWaste-f and SpectralWaste datasets reveals the merits of the proposed method.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: 该论文专注于计算机视觉领域的废物分割任务，采用空间域和光谱域的特征提取方法。提供的关键词主要涉及大语言模型（MLLM）、世界模型、强化学习及统一模型等生成式 AI 与决策领域概念。论文内容与这些关键词（如 Tokenizer、World Models、model-based RL 等）无直接关联，仅在视觉处理层面有极微弱概念重叠（但非同一架构语境），因此所有关键词相关性评分均为 0 分。作者列表中不包含指定的专家（Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang），故无额外加分。加权总分为 0，远低于动态及格分 35.2。

关键词

Waste Segmentation, Automated Waste Recycling, Cluttered Background, Spatial Domain, Spectral Domain, Feature Enhancement, Deep Learning

250. What's Old is New Again: Classical Dimensionality Reduction for Efficient Saliency-Guided Biometric Attack DetectionFAIL

Score: 0.0 / 35.2

Authors: Samuel Webster, Walter Scheirer

Published: 2026-06-11

TL;DR: 本文提出利用经典降维技术生成显著性图用于生物特征攻击检测，无需人工标注即可实现跨域的高可扩展性与鲁棒性。

摘要翻译

显著性引导训练是视觉识别领域的一种范式，旨在鼓励模型在学习过程中聚焦于最相关的图像区域。尽管其在生物特征呈现攻击检测（PAD）中的应用已展现出在鲁棒性和泛化能力方面的显著优势，但现有显著性获取方法（如基于有限数据集的人工标注）的高成本、领域特异性及有限的可扩展性往往限制了其采用。本文提出了一种新颖、成本高效且高度可扩展的显著性获取方法，该方法利用受经典降维技术（主成分分析 PCA 和线性判别分析 LDA）启发的显著性图。所提出的方法直接从原始训练数据生成显著性图，无需人工标注或领域知识。我们在三个已有显著性研究的领域（虹膜 PAD、合成人脸检测、指纹 PAD）中验证了这些显著性源的有效性，并在两个尚未涉及显著性的领域（指纹静脉 PAD 和身份证 PAD）中展示了其可扩展性。在所有测试的领域中，使用基于降维技术生成的显著性图训练的模型均优于基线方法，有时甚至优于当前最先进的（SOTA）显著性方法，且无需任何资源投入或领域特定工具。我们的发现克服了生物特征攻击检测及其他领域中显著性引导训练所面临的一个重要但尚未被解决的障碍。

Abstract

Saliency-guided training is a paradigm in visual recognition that encourages models to focus on the most relevant image regions during learning. While its application in biometric presentation attack detection (PAD) has shown strong benefits in robustness and generalization, adoption is often limited by the high cost, domain specificity, and limited scalability of existing saliency acquisition methods, such as human annotations over a limited dataset. We present a novel, cost-efficient, and highly-scalable approach to saliency acquisition using maps inspired by classical dimensionality reduction techniques: PCA and LDA. Our proposed methods generate saliency maps directly from raw training data, requiring no human annotation nor domain knowledge. We contextualize the effectiveness of these saliency sources in three saliency-explored domains (iris PAD, synthetic face detection, fingerprint PAD) and demonstrate its scalability in two saliency-novel domains (fingerprint vein PAD and ID card PAD). Across all domains tested, models trained using dimensionality reduction-sourced saliency maps exceed baseline and sometimes SOTA saliency methods without any resource investment or domain-specific tooling. Our findings overcome an important yet unaddressed barrier to saliency-guided training for biometric attack detection and beyond.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: 论文主要研究经典降维方法（PCA/LDA）在生物特征攻击检测中的应用，属于传统计算机视觉范畴。所提供的关键词均围绕大语言模型、世界模型、强化学习及多模态生成模型展开，与本文的研究方法、模型架构及任务目标无实质关联。作者列表中未包含指定的专家成员。

关键词

Dimensionality Reduction, Saliency-Guided, Biometric Attack Detection, PCA, LDA, Presentation Attack Detection, Scalability, Classical Techniques

251. Person Identification from Contextual MotionFAIL

Score: 0.0 / 35.2

Authors: Igor Kviatkovsky, Ehud Rivlin, Ilan Shimshoni

Published: 2026-06-11

TL;DR: 本文提出了一种基于运动风格和视觉刺激交互的概率生成模型方法，用于人员身份识别，并在多个数据集上取得了高识别率。

摘要翻译

本文考虑基于运动风格识别个体的问题。我们提出了一种描述动作实例创建过程的生成模型，并推导了一种概率身份推断方案，该方案适用于由监控和认证应用驱动的两种常见人体识别场景。我们引入了一种新颖的、基于运动模式的人体识别交互式场景。为此，我们在主体与系统之间的顺序消息交换会话背景下，对识别过程进行了形式化描述。主体的行为采用一种概率生成模型进行建模，该模型受到人类信息处理（Human Information Processing，简称 HIP）范式的启发。在每个阶段，系统向主体呈现一个视觉刺激（即提示），并记录其运动响应。提示的选择旨在最大化预期响应与主体身份之间的互信息。一旦记录，该响应便用于更新主体可能身份的后验概率。当达到足够的分类置信度水平时，该过程终止。据我们所知，这是首次在这样的交互式设置中探讨人体识别问题。我们在五个公开数据集以及我们自建的新数据集上报告了较高的识别率，该新数据集包含 22 个测试主体对 15 个提示的 4,476 次记录。

Abstract

We consider the problem of identifying people based on their motion styles. We present a generative model describing the action instance creation process and derive a probabilistic identity inference scheme for two common person identification scenarios motivated by the surveillance and authentication applications. We introduce a novel, \emph{interactive}, scenario for person identification from motion patterns. To this end, we formalize the identification process in the context of a sequential message exchange session between the subject and the system. The subject's behavior is modeled using a probabilistic generative model inspired by the Human Information Processing (HIP) paradigm. At each stage, the system presents a visual stimulus (a cue) to the subject and records their motion response. The cue is selected so as to maximize the mutual information of the expected response and the subject's identity. Once recorded, the response is used to update the a posteriori probability over possible subjects' identities. The process terminates once a sufficient classification confidence level is reached. To the best of our knowledge, this is the first time person identification is addressed in such interactive setting. We report high recognition rates on five publicly available datasets and our own novel dataset consisting of 4,476 recordings of 22 test subjects responding to 15 cues.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: 该论文属于经典计算机视觉与生物特征识别领域，专注于基于运动风格的人员身份识别，采用概率生成模型和贝叶斯推断。其核心内容不涉及现代大模型架构（如 MLLM、Tokenizer、Visual Encoder）、世界模型（World Models 语境）、模型强化学习（Model-Based RL）或代理推理（Agentic Reasoning）。虽然涉及视觉与运动交互及隐变量推理，但技术范式与给定关键词所指的现代 AI 研究方向不符，故相关性评分为 0。

关键词

Person Identification, Contextual Motion, Generative Model, Probabilistic Inference, Interactive Scenario, Visual Stimulus, Motion Response, Mutual Information

252. Fully Distributed Multi-View 3D Tracking in Real-TimeFAIL

Score: 0.0 / 35.2

Authors: Byron Hernandez, Fangyu Li, Aotian Wu, Paul J. Shin, Kaustubh Purandare, Henry Medeiros

Published: 2026-06-11

TL;DR: 本文提出 MV3DT 框架，通过点对点协调实现实时多视角 3D 跟踪，消除了中心聚合瓶颈，在大规模摄像头网络中展现出高准确性和可扩展性。

摘要翻译

重叠视场下的多相机跟踪通常依赖于集中式融合，这会产生计算瓶颈，阻碍其规模化部署。本文提出 MV3DT，一种用于实时多视角 3D 跟踪的完全分布式框架，该框架通过点对点协调实现准确的身份传播与遮挡恢复，从而消除了对集中聚合的需求。每个相机节点执行一个轻量级模块化流水线，包含单目 3D 感知、分布式多视角关联以及通过轻量级消息传递实现的协作融合。MV3DT 在 WILDTRACK 数据集上取得了 94.3% 的 IDF1 和 93.3% 的 MOTA，与最先进的集中式方法相当，同时展现出卓越的可扩展性：在 100 台相机上维持 30 FPS，相机间延迟低于 10 ms，通信开销仅为 2.2%。在给定相机标定的情况下，MV3DT 以零样本设置运行，无需进行场景特定学习，使其可直接部署于新环境中。这些结果表明，MV3DT 是大规模重叠相机网络中实时多视角跟踪的一种实用解决方案。

Abstract

Multi-camera tracking with overlapping fields of view typically relies on centralized fusion, which creates computational bottlenecks that prevent deployment at scale. We present MV3DT, a fully distributed framework for real-time multi-view 3D tracking that achieves accurate identity propagation and occlusion recovery through peer-to-peer coordination, eliminating the need for central aggregation. Each camera node executes a lightweight modular pipeline comprising monocular 3D perception, distributed multi-view association, and collaborative fusion via lightweight messaging. MV3DT achieves 94.3% IDF1 and 93.3% MOTA on WILDTRACK, competitive with state-of-the-art centralized methods, while demonstrating superior scalability by sustaining 30 FPS on 100 cameras with less than 10 ms inter-camera latency and only 2.2% communication overhead. MV3DT operates in a zero-shot regime given camera calibrations, requiring no scene-specific learning and making it directly deployable in new environments. These results establish MV3DT as a practical solution for real-time multi-view tracking in large-scale overlapping camera networks.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: 该论文主要研究多视角 3D 跟踪的分布式框架（MV3DT），属于计算机视觉与分布式系统领域。提供的关键词列表（如 Tokenizer, MLLM, World Models, model-based RL）主要聚焦于大语言模型、世界模型及强化学习领域。论文内容未涉及模型统一、分词器、视觉编码器（MLLM 语境）、世界模型构建、多模态大模型、基于模型的强化学习或智能体推理，两者领域差异巨大，因此与所有给定关键词均无实质相关性。作者列表中不包含指定的专家名单。

关键词

Multi-view 3D Tracking, Fully Distributed, Real-time, Peer-to-peer coordination, Monocular 3D perception, Distributed multi-view association, Collaborative fusion

253. Comparing Commercial Depth Sensor Accuracy for Medical ApplicationsFAIL

Score: 0.0 / 35.2

Authors: Pit Henrich, Maximilian Weiherer, Franziska Hansen, Bernhard Egger, Franziska Mathis-Ullrich

Published: 2026-06-11

TL;DR: This paper benchmarks four commercial depth sensors on medical specimens and finds that the Zivid 2M+ 60 sensor offers the highest accuracy across various challenging surfaces compared to RealSense, PMD Flexx2, and ZED 2i.

摘要翻译

深度估计在医学和外科领域具有众多应用。本研究使用触针采样参考，在猪骨标本、猪腹部标本和硅胶肾脏模体上对四种深度传感器进行了基准测试。这些物体面临若干现实世界挑战，包括均匀表面、镜面反射表面以及次表面散射。比较涵盖了在约 50 厘米距离下的立体视觉、结构光和飞行时间传感器。具体而言，比较了 Intel RealSense D405（美国 Intel RealSense）、PMD Flexx2（德国 pmdtechnologies）、Stereolabs ZED 2i（法国 Stereolabs）以及 Zivid 2M+ 60（挪威 Zivid）。在本研究考虑的所有对象和指标中，Zivid 2M+ 60 表现最佳。ZED 在真实组织中排名第二，但在模体上排名最后。

Abstract

Depth estimation has numerous medical and surgical applications. We benchmark four depth sensors on a porcine bone specimen, a porcine belly specimen, and a silicone kidney phantom using stylus-sampled references. These objects contain several real-world challenges, including homogeneous surfaces, specular surfaces, and subsurface scattering. The comparison includes stereo, structured-light, and time-of-flight sensors at a distance of approximately 50 cm. Specifically, the Intel RealSense D405 (Intel RealSense, United States), PMD Flexx2 (pmdtechnologies, Germany), Stereolabs ZED 2i (Stereolabs, France), and Zivid 2M+ 60 (Zivid, Norway) are compared. The Zivid 2M+ 60 performed best across all objects and metrics considered in this work. The ZED ranked second for real tissue, but last on the phantom.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10
Latent Reasoning	1.5	0.0/10
Agentic Reasoning	1.5	0.0/10

评分理由: The paper focuses on hardware benchmarking of commercial depth sensors for medical applications, evaluating accuracy on physical specimens using stylus-sampled references. The provided keywords pertain to AI/ML model architectures (e.g., LLMs, World Models, Reinforcement Learning, Tokenizers). There is no semantic overlap between the paper's content (sensor hardware) and the evaluation keywords (AI models), resulting in zero relevance for all keywords. None of the specified expert authors (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are present in the author list.

关键词

Depth Sensor, Medical Applications, Accuracy Benchmarking, Porcine Specimen, Stereo, Structured-Light, Time-of-Flight

Token 消耗: 4,059,886 tokens（输入 507,379 / 输出 3,552,507）