arXiv Daily Report 2026-05-28

DailyPapers
未分类
6小时前
1热度
0评论

ArXiv Report 2026-05-28/* ============================================================ ArXiv Daily Researcher - HTML Report Stylesheet 可自由修改此文件来定制报告外观 ============================================================ *//* ── 全局重置 ── */*,*::before,*::after { box-sizing: border-box; margin: 0; padding: 0;}/* ── CSS 变量（主题色板） ── */:root { --color-bg: #f0f2f5; --color-surface: #ffffff; --color-border: #e5e7eb; --color-primary: #2563eb; --color-primary-dk: #1d4ed8; --color-pass: #16a34a; --color-pass-bg: #dcfce7; --color-fail: #dc2626; --color-fail-bg: #fee2e2; --color-text: #111827; --color-muted: #6b7280; --color-tldr-bg: #eff6ff; --color-tldr-border: #bfdbfe; --color-cn-bg: #fefce8; --color-cn-border: #fde68a; --color-analysis-bg: #f8fafc; --radius-sm: 6px; --radius-md: 10px; --radius-lg: 14px; --shadow-sm: 0 1px 3px rgba(0, 0, 0, 0.06), 0 1px 2px rgba(0, 0, 0, 0.04); --shadow-md: 0 4px 12px rgba(0, 0, 0, 0.08); --shadow-hover: 0 8px 24px rgba(0, 0, 0, 0.1);}/* ── 页面布局 ── */body { font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", "Inter", Roboto, "Helvetica Neue", Arial, sans-serif; font-size: 15px; line-height: 1.65; color: var(--color-text); background: var(--color-bg); padding: 28px 20px 60px; max-width: 1080px; margin: 0 auto;}/* ── 页面标题 ── */h1 { font-size: 1.75rem; font-weight: 700; color: var(--color-text); letter-spacing: -0.5px; margin-bottom: 4px;}h2 { font-size: 1.15rem; font-weight: 600; color: var(--color-text); margin: 36px 0 14px; padding-bottom: 8px; border-bottom: 2px solid var(--color-border);}/* ── 元信息行 ── */.meta { font-size: 0.85rem; color: var(--color-muted); margin-bottom: 24px;}/* ── 统计栏 ── */.stats-bar { display: flex; gap: 14px; flex-wrap: wrap; margin-bottom: 32px;}.stat { flex: 1; min-width: 110px; background: var(--color-surface); border: 1px solid var(--color-border); border-radius: var(--radius-md); padding: 16px 20px; text-align: center; box-shadow: var(--shadow-sm);}.stat .num { font-size: 2rem; font-weight: 700; line-height: 1.1; color: var(--color-primary); display: block;}.stat .label { font-size: 0.78rem; color: var(--color-muted); margin-top: 4px; text-transform: uppercase; letter-spacing: 0.05em;}/* ── 论文卡片 ── */.card { background: var(--color-surface); border: 1px solid var(--color-border); border-radius: var(--radius-lg); padding: 20px 24px; margin-bottom: 14px; box-shadow: var(--shadow-sm); border-left: 4px solid var(--color-border); transition: box-shadow 0.18s ease, transform 0.18s ease;}.card:hover { box-shadow: var(--shadow-hover); transform: translateY(-1px);}.card.pass { border-left-color: var(--color-pass);}.card.fail { border-left-color: var(--color-fail);}/* ── 卡片标题 ── */.card-title { font-size: 1rem; font-weight: 600; color: var(--color-text); line-height: 1.4; margin-bottom: 10px; display: flex; align-items: flex-start; gap: 8px; flex-wrap: wrap;}.card-title a { color: inherit; text-decoration: none; flex: 1;}.card-title a:hover { color: var(--color-primary); text-decoration: underline; text-underline-offset: 3px;}/* ── 状态徽章 ── */.badge { display: inline-flex; align-items: center; padding: 2px 9px; border-radius: 99px; font-size: 0.72rem; font-weight: 700; letter-spacing: 0.04em; flex-shrink: 0; margin-top: 2px;}.badge.pass { background: var(--color-pass-bg); color: var(--color-pass);}.badge.fail { background: var(--color-fail-bg); color: var(--color-fail);}/* ── 字段行 ── */.field { font-size: 0.875rem; color: var(--color-muted); margin: 5px 0;}.field-label { font-weight: 600; color: #374151;}.score { font-weight: 700; color: var(--color-primary);}/* ── TL;DR 块 ── */.tldr { background: var(--color-tldr-bg); border: 1px solid var(--color-tldr-border); border-radius: var(--radius-sm); padding: 10px 14px; font-size: 0.9rem; color: #1e3a5f; margin: 10px 0; line-height: 1.55;}/* ── 中文摘要块 ── */.abstract-cn { background: var(--color-cn-bg); border: 1px solid var(--color-cn-border); border-radius: var(--radius-sm); padding: 10px 14px; font-size: 0.875rem; color: #713f12; margin: 10px 0; line-height: 1.6;}/* ── 深度分析折叠 ── */details { margin-top: 12px; border: 1px solid var(--color-border); border-radius: var(--radius-sm); overflow: hidden;}summary { cursor: pointer; font-size: 0.875rem; font-weight: 600; color: var(--color-primary); padding: 8px 14px; background: var(--color-analysis-bg); user-select: none; list-style: none; display: flex; align-items: center; gap: 6px;}summary::before { content: "▶"; font-size: 0.65em; transition: transform 0.2s; display: inline-block;}details[open] summary::before { transform: rotate(90deg);}summary:hover { color: var(--color-primary-dk);}.analysis-content { padding: 14px 16px; font-size: 0.875rem; color: #374151; line-height: 1.65; background: var(--color-surface);}.analysis-content p { margin: 6px 0;}.analysis-content ul { margin: 6px 0 6px 20px; color: #4b5563;}.analysis-content li { margin: 3px 0;}/* ── 响应式 ── */@media (max-width: 640px) { body { padding: 16px 12px 40px; } h1 { font-size: 1.4rem; } .stats-bar { gap: 10px; } .stat { min-width: 80px; padding: 12px 14px; } .stat .num { font-size: 1.6rem; } .card { padding: 16px; } .card-title { font-size: 0.95rem; }}/* ── 模型标签 ── */.model-badge { display: inline-block; font-size: 0.72rem; font-weight: 500; color: #6366f1; background: #ede9ff; border: 1px solid #c4b5fd; border-radius: 4px; padding: 1px 6px; margin-left: 8px; vertical-align: middle; font-family: 'Fira Code', 'Consolas', monospace; letter-spacing: 0.02em;}/* ── TLDR 增强样式 ── */.tldr { background: linear-gradient(135deg, #f0f9ff 0%, #e0f2fe 100%); border-left: 3px solid #38bdf8; border-radius: 0 8px 8px 0; padding: 10px 14px; margin: 8px 0;}.tldr-meta { display: flex; align-items: center; margin-bottom: 6px; font-size: 0.82rem; font-weight: 600; color: #0369a1;}.tldr-body { font-size: 0.88rem; color: #374151; line-height: 1.6;}/* ── 趋势分析区块 ── */.trend-section { margin: 32px 0 16px; border-top: 2px solid var(--color-border); padding-top: 24px;}.trend-section-header { display: flex; align-items: center; margin-bottom: 20px;}.trend-section-header h2 { margin: 0; font-size: 1.3rem; font-weight: 700; background: linear-gradient(135deg, #6366f1 0%, #8b5cf6 100%); -webkit-background-clip: text; -webkit-text-fill-color: transparent; background-clip: text;}.trend-card { background: #fff; border: 1px solid #e5e7eb; border-radius: 10px; padding: 20px 24px; margin-bottom: 16px; box-shadow: 0 1px 4px rgba(0, 0, 0, 0.06);}.trend-card-title { font-size: 1.05rem; font-weight: 700; color: #1e1b4b; margin: 0 0 14px; padding-bottom: 10px; border-bottom: 1px solid #ede9ff;}

ArXiv Research Report

Generated: 2026-05-28 19:39:11 | Passing score: 26.5

215

Total

Qualified

Analyzed

26%

Pass Rate

Papers

1. Agentic Active Omni-Modal Perception for Multi-Hop Audio-Visual ReasoningPASS

Score: 70.0 / 26.5

Authors: Ke Xu, Yuhao Wang, Ziyang Cheng, Hongcheng Liu, Yanfeng Wang, Yu Wang

Published: 2026-05-27

TL;DR: 本文针对多模态大模型在多跳音频 - 视觉推理中证据分散的挑战，提出 AOP-Agent 框架通过主动感知和分层记忆显著提升了推理性能。

摘要翻译

多跳音视频推理对 Omni-LLMs（全模态大语言模型）仍然具有挑战性，因为相关证据往往稀疏、时间上分散，且分布在音频和视频流中。现有的基准测试对此场景的研究较为有限，通常仅涉及有限的模态、相关时间片段或推理步骤。在这项工作中，我们引入了 MOV-Bench，这是一个包含 519 个精心构建问题的基准，这些问题需要对时间分散的音视频证据进行多跳推理。在 MOV-Bench 上的评估表明，当前的 Omni-LLMs 在多跳跨模态推理方面仍面临挑战。为了解决这一挑战，我们进一步提出了 AOP-Agent，这是一个基于开源 Omni-LLMs 构建的高效智能体框架，用于主动全模态感知。通过结合层级全模态记忆与协作式观察 - 反思 - 重规划循环，AOP-Agent 使开源 Omni-LLMs 能够在无需额外训练或专有模型的情况下执行主动感知。在 MOV-Bench 和 OmniVideoBench 上的实验表明，AOP-Agent 持续提升推理性能，尤其在长视频和推理密集型问题上取得了尤为显著的改进。

Abstract

Multi-hop audio-visual reasoning remains challenging for Omni-LLMs, as relevant evidence is often sparse, temporally dispersed, and distributed across both audio and visual streams. Existing benchmarks provide limited investigation of this setting, typically involving only a limited number of modalities, relevant temporal segments, or reasoning steps. In this work, we introduce MOV-Bench, a benchmark containing 519 carefully curated questions that require multi-hop reasoning over temporally dispersed audio-visual evidence. Evaluations on MOV-Bench reveal that current Omni-LLMs still struggle with multi-hop cross-modal reasoning. To address this challenge, we further propose AOP-Agent, an efficient agentic framework built on open-source Omni-LLMs for active omni-modal perception. By combining a hierarchical omni-modal memory with a collaborative observe-reflect-replan loop, AOP-Agent enables open-source Omni-LLMs to perform active perception without additional training or proprietary models. Experiments on MOV-Bench and OmniVideoBench demonstrate that AOP-Agent consistently improves reasoning performance, with particularly notable gains on long videos and reasoning-intensive questions.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	6.0/10	12.0
World Models	2.0	5.0/10	10.0
MLLM	2.0	9.0/10	18.0
MultiModal	2.0	10.0/10	20.0
model-based RL	2.0	5.0/10	10.0

评分理由: 论文核心涉及多模态大模型（MLLM）与多模态数据，相关性最高；AOP-Agent 的记忆与 replan 机制与世界模型及强化学习规划部分相关；未直接涉及模型统一架构，相关性较低。

关键词

Multi-hop audio-visual reasoning, Omni-LLMs, AOP-Agent, Active perception, Hierarchical memory, Observe-reflect-replan, MOV-Bench

深度分析

Chinese Title: 面向多跳音频-视觉推理的智能体主动全模态感知

Summary: 本文针对多跳音频-视觉推理中证据稀疏、时间分散且跨模态分布的问题，提出了MOV-Bench基准（519个精心设计的多选题）以评估现有全模态大语言模型（Omni-LLMs）的能力。实验发现当前模型在定位和整合分散证据方面存在困难。为此，作者提出AOP-Agent框架，一种低资源智能体方法，通过构建层次化全模态记忆（全局摘要、片段描述、音视频关键点、检索关键词）并采用多智能体“观察-反思-重规划”循环，使开源Omni-LLMs无需额外训练或专有模型即可实现主动感知。在MOV-Bench和OmniVideoBench上的实验表明，AOP-Agent显著提升了多跳推理性能，尤其在长视频和推理密集型问题上效果突出。

Innovations:

提出MOV-Bench基准，专门评估跨模态、多时间片段的多跳音频-视觉推理能力，包含五种推理主题（因果、指代、关系、假设、意图）。
提出AOP-Agent框架，通过层次化全模态记忆实现粗到细的证据定位，降低开源Omni-LLMs主动感知的难度。
设计多智能体协作循环（规划器、观察工具、反射器、推理器），使模型无需额外训练或专有模型即可迭代式地定位和整合稀疏证据。
在低资源设置下（仅使用开源Omni-LLMs）实现了与依赖专有模型或大量训练的方法相当甚至更优的推理性能。

Methodology: 首先构建MOV-Bench：从Fine-Video数据集选取视频，利用LLM基于ASR转录和事件描述生成五类推理主题的多选题，经语言过滤和人工验证确保跨模态多跳性。然后提出AOP-Agent：将视频按ASR时间戳分割为细粒度片段，合并为中粒度语义片段（≤30秒）；使用Omni-LLM为每个片段生成视觉关键点、音频关键点、检索关键词和片段描述，形成层次化记忆；在推理阶段，规划器根据当前状态选择观察目标，观察工具检索对应片段，反射器评估信息充分性，循环直至满足条件，最后由推理器生成答案。

Key Results:

当前Omni-LLMs在MOV-Bench上表现不佳，难以定位和整合跨模态分散证据。
AOP-Agent在MOV-Bench和OmniVideoBench上均一致提升多跳推理准确率，尤其在长视频（>5分钟）和推理密集型问题（3-4跳）上增益显著。
与依赖专有模型（如GPT-4o）或需额外训练的智能体框架相比，AOP-Agent在低资源设置下实现了更高效的主动感知。

Tech Stack:

开源Omni-LLMs（如Qwen2.5-VL等，论文未指定具体型号）
层次化记忆构建：视频分割（基于ASR时间戳）、Omni-LLM生成结构化语义信息（视觉/音频关键点、检索关键词、片段描述）
多智能体循环：规划器（Planner Agent）、反射器（Reflector Agent）、推理器（Reasoner Agent）
观察工具（Observation Tools）：基于检索关键词或时间戳定位中粒度片段
ASR（自动语音识别）转录

Strengths:

提出了首个专门针对跨模态多跳音频-视觉推理的基准MOV-Bench，填补了现有评估的空白。
AOP-Agent设计巧妙，利用层次化记忆和协作循环有效降低了开源模型主动感知的门槛，无需额外训练或专有模型。
实验充分，在多个基准上验证了有效性，并分析了长视频和推理复杂度的影响。
方法具有通用性，可应用于其他开源Omni-LLMs，促进低资源多模态推理研究。

Limitations:

MOV-Bench规模较小（519个问题），可能不足以全面反映模型能力。
层次化记忆构建依赖Omni-LLM的生成质量，若模型本身对片段理解有误可能影响后续推理。
未与更多基于强化学习或后训练的主动感知方法进行对比，仅比较了少数智能体框架。
实验仅在特定开源Omni-LLMs上进行，泛化性需进一步验证。

Relevance To Keywords:

原生多模态大模型：论文直接研究Omni-LLMs在多跳音频-视觉推理中的表现，并基于开源模型构建智能体框架，高度相关。
表征学习：层次化全模态记忆的构建涉及对视频片段的结构化表征（关键点、描述），属于表征学习范畴。
世界模型：主动感知和迭代观察-反思-重规划可视为一种简化的世界模型探索，但论文未明确使用世界模型术语，相关性中等。
强化学习/后训练：论文方法不涉及强化学习或后训练，而是通过智能体循环实现主动感知，与这两个关键词相关性较低。

2. Mags-RL: Wearing Multimodal LLMs a Magnifying Glass via Agentic Reinforcement Learning For Complex Scene ReasoningPASS

Score: 70.0 / 26.5

Authors: Xuanzhao Dong, Wenhui Zhu, Peijie Qiu, Xiwen Chen, Xiaobing Yu, Xin Li, Zhipeng Wang, Shao Tang, Gen Li, Yujian Xiong, Hao Wang, Yanxi Chen, Prayag Tiwari, Yalin Wang

Published: 2026-05-27

TL;DR: Mags-RL 通过引入基于强化学习的放大玻璃代理，增强了多模态大模型在复杂场景下的推理能力，实现了无需额外标注的高精度视觉定位。

摘要翻译

尽管多模态大语言模型（MLLMs）广受欢迎且取得了成功，但它们往往难以准确理解图像内容，这限制了它们在复杂场景（例如高物体密度和复杂的背景杂乱）下的推理能力。先前工作主要通过结合显式视觉线索（如边界框）来解决这一局限性，但这些线索需要额外的标注。此外，由此产生的低分辨率裁剪图往往丢失了 MLLMs 进行准确推理所需的细粒度细节。因此，我们提出了 Mags-RL，这是一个智能体强化学习（RL）框架，它为 MLLMs 配备了外部超分辨率“放大镜”智能体，用于高分辨率细粒度检查。具体来说，该模型执行两轮推理：在第一轮中，它生成初始推理依据并自主识别感兴趣区域（ROI），无需依赖额外标注；在第二轮中，它调用超分辨率智能体裁剪并放大这些区域，然后重新审视并验证其先前的推理以产生最终答案。我们还引入了一种新颖的课程学习策略，该策略实现了数据高效的强化学习训练，仅需 40 个训练样本即可达到合理的性能。在 VSR、TallyQA 和 GQA 子集上的实验表明，其相对于近期强竞争方法具有优越性能，展示了具有精确视觉定位的高质量推理。代码和权重将很快发布。

Abstract

Despite their popularity and success, Multimodal Large Language Models (MLLMs) often struggle to interpret images accurately, which limits their reasoning capability in complex scenarios (e.g., high object density and complex background clutter). Prior work mainly addresses this limitation by incorporating explicit visual cues like bounding boxes that require extra annotations. In addition, the resulting low-resolution crops often miss fine-grained details that MLLMs require for accurate reasoning. Therefore, we propose Mags-RL, an Agentic Reinforcement Learning (RL) framework that equips MLLMs with an external super-resolution "magnifying glass" agent for high-resolution fine-grained inspection. Specifically, the model performs two-round reasoning: in the first round, it generates an initial rationale and autonomously identifies regions of interest without relying on additional annotations; in the second round, it invokes a super-resolution agent to crop and upscale those regions, then revisits and verifies its earlier reasoning to produce the final answer. We also introduce a novel curriculum learning strategy that enables data-efficient RL training, needing as few as only 40 training samples to achieve reasonable performance. Experiments on VSR, TallyQA, and GQA subsets show its superior performance against recent strong competing methods, demonstrating high-quality reasoning with precise visual grounding. Code and weights will be released soon.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	7.0/10	14.0
World Models	2.0	3.0/10	6.0
MLLM	2.0	10.0/10	20.0
MultiModal	2.0	10.0/10	20.0
model-based RL	2.0	5.0/10	10.0

评分理由: MLLM 和 MultiModal 是论文核心主题，故评分为 10.0。论文将 MLLM 与外部代理统一，属于模型统一范畴，Unify Models 评分为 7.0。虽然使用强化学习，但论文重点在于推理代理而非世界模型构建或模型基策略，故 World Models 评分为 3.0，model-based RL 评分为 5.0。经核对，作者列表中未包含 Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang 中的任何专家。加权总分为 70.0，远超动态及格分 26.5。

关键词

MLLM, Agentic Reinforcement Learning, Super-resolution, Complex Scene Reasoning, Visual Grounding, Curriculum Learning, Magnifying Glass Agent

深度分析

Chinese Title: Mags-RL：通过代理强化学习为多模态大模型戴上放大镜以进行复杂场景推理

Summary: 多模态大语言模型（MLLMs）在复杂场景（如高物体密度、背景杂乱）中常因图像理解不准确而推理受限。现有方法依赖额外标注的边界框或低分辨率裁剪，丢失细节。本文提出Mags-RL，一种代理强化学习框架，为MLLM配备外部超分辨率“放大镜”代理，实现两轮推理：第一轮生成初始推理并自主识别感兴趣区域（无需额外标注）；第二轮调用超分辨率代理裁剪并放大这些区域，然后验证并修正推理以产生最终答案。采用GRPO算法和课程学习策略，仅需40个训练样本即可实现有效训练。在VSR、TallyQA和GQA三个基准上，Mags-RL在复杂场景推理中显著优于CoT、Zoom-Refine、GRIT等强基线方法，展示了高质量推理与精确视觉定位能力。

Innovations:

提出两轮推理框架，通过动态调用外部超分辨率代理实现“用图像思考”，增强对细粒度细节的捕捉。
引入数据高效的强化学习训练范式，基于GRPO算法和课程学习，仅需40个样本即可对齐模型推理行为。
无需额外标注，模型自主识别感兴趣区域，摆脱对边界框等人工标注的依赖。
在多个复杂场景推理基准（VSR、TallyQA、GQA）上取得领先性能，验证了方法的有效性和泛化性。

Methodology: Mags-RL采用两轮推理流程：第一轮，MLLM根据图像和问题生成初始推理链，并输出一组坐标作为感兴趣区域；第二轮，超分辨率代理裁剪并放大这些区域，将高分辨率图像反馈给MLLM，模型据此验证并修正初始推理，输出最终答案。训练使用GRPO算法，奖励信号包括格式正确性和答案准确性，并引入课程学习策略逐步增加任务难度，实现高效训练。

Key Results:

在VSR（视觉空间推理）上，Mags-RL优于CoT、Zoom-Refine和GRIT等方法。
在TallyQA（计数问答）上，Mags-RL准确识别密集场景中的物体数量，克服了基线方法遗漏或误判的问题。
在GQA（场景图问答）上，Mags-RL在复杂逻辑推理任务中表现更优。
消融实验表明，超分辨率模块相比直接裁剪缩放显著提升性能，课程学习策略进一步加速收敛。

Tech Stack:

Group Relative Policy Optimization (GRPO) 算法
超分辨率（Super-Resolution）代理（预训练模型）
课程学习（Curriculum Learning）策略
KL散度正则化（用于GRPO中的策略约束）
重要性采样（Importance Sampling）
多模态大语言模型（MLLM，如LLaVA等）

Strengths:

数据高效：仅需40个训练样本即可获得合理性能，大幅降低数据依赖。
无需额外标注：模型自主定位感兴趣区域，减少人工成本。
动态交互：两轮推理结合超分辨率放大，增强对细粒度细节的感知，提升复杂场景推理能力。
通用性强：在多个不同复杂度的基准上均取得优异结果，方法可迁移。

Limitations:

依赖超分辨率代理的质量，若代理性能不足可能限制最终效果。
两轮推理增加计算开销和推理延迟，实时性可能受影响。
仅使用40个训练样本，虽然高效但可能在小样本场景下泛化性有限，需更多数据集验证。
未在更多样化的复杂场景（如视频、3D）中测试，适用范围有待扩展。

Relevance To Keywords:

原生多模态大模型：论文直接针对MLLM的推理能力改进，属于多模态大模型后训练范畴。
强化学习：采用GRPO算法进行强化学习训练，属于RLHF/RL后训练方法。
后训练：论文聚焦于MLLM的后训练阶段，通过RL提升推理能力。
表征学习：超分辨率代理增强视觉表征，帮助模型获取更精细的特征。
世界模型：虽未直接构建世界模型，但两轮推理中的验证机制隐含了对场景内在逻辑的建模。
模型基RL：代理RL框架中模型与环境（超分辨率工具）交互，属于模型基RL的一种形式。

3. ROVER: Routing Object-Centric Visual Evidence for Grounded Multi-Image ReasoningPASS

Score: 68.0 / 26.5

Authors: Guannan Lv, Ren Nie, Hongjian Dou

Published: 2026-05-27

TL;DR: ROVER addresses holistic scene understanding limitations in grounded multi-image reasoning by routing object-centric visual evidence, achieving state-of-the-art performance on MM-GCoT and VideoEspresso benchmarks.

摘要翻译

多模态大语言模型 (MLLMs) 在审慎推理中日益采用局部化和交错式的视觉证据。基于定位的方法通常通过向推理上下文中注入裁剪图像块或 RoI 特定特征来聚焦于感兴趣区域 (RoIs)。然而，此类设计可能会削弱整体场景理解及对象间关系，同时产生的解码成本会随 RoI 的数量和规模而增加。相比之下，自适应视觉特征选择往往需要细粒度监督或复杂的启发式规则。为了解决这些局限性，我们提出 ROVER（面向定位的多图像推理的路由对象中心视觉证据），这是一种轻量级、可学习的插件，用于高效的全局视觉证据路由。在每次对象定位预测后，ROVER 注入一个步骤特定的标记三元组，以协同地执行以下操作：(i) 聚合当前的推理上下文；(ii) 通过对象中心差分注意力将图像内线索蒸馏至视觉工作空间；以及 (iii) 在此空间内路由并整合跨对象与图像的历史感知证据，以供后续推理使用。我们将 ROVER 集成至 Qwen2.5-VL-7B，并开发了一种交错的 SFT-to-GRPO 训练流程。严格遵循原始数据集和评估协议，我们的方法在 MM-GCoT（答案准确率提升 +4.8%，定位准确率提升 +14.6%）和 VideoEspresso（答案准确率提升 +8.6%）上取得了最佳性能。经 VideoEspresso 训练的模型展现出强大的迁移能力，在多个基准测试上平均比基线模型高出 +4.7%。

Abstract

Multimodal Large Language Models (MLLMs) have increasingly localized and interleaved visual evidence for deliberative reasoning. Grounding-based approaches typically focus on regions of interest (RoIs) by injecting cropped image patches or RoI-specific features into the reasoning context. However, such designs can weaken holistic scene understanding and inter-object relations, while incurring decoding costs that scale with the number and size of RoIs. Alternatively, adaptive visual feature selection often requires fine-grained supervision or complex heuristics. To address these limitations, we propose ROVER (Routing Object-centric Visual Evidence for grounded multi-image Reasoning), a lightweight, learnable plugin for efficient global visual evidence routing. Upon each object grounding prediction, ROVER injects a step-specific token triplet to synergistically: (i) aggregate the ongoing reasoning context, (ii) distill intra-image cues into a visual working space via object-centric differential attention, and (iii) route and integrate history-aware evidence across objects and images within this space for subsequent reasoning. We integrate ROVER into Qwen2.5-VL-7B and develop an interleaved SFT-to-GRPO training pipeline. Strictly adhering to the original datasets and evaluation protocols, our method achieves the best performance on MM-GCoT (+4.8% answer accuracy, +14.6% grounding accuracy) and VideoEspresso (+8.6% answer accuracy). The VideoEspresso-trained model demonstrates strong transferability, outperforming the base model by +4.7% on average across diverse benchmarks.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	5.0/10	10.0
World Models	2.0	4.0/10	8.0
MLLM	2.0	10.0/10	20.0
MultiModal	2.0	10.0/10	20.0
model-based RL	2.0	5.0/10	10.0

评分理由: 论文核心聚焦于多模态大语言模型（MLLM）的多图像推理，因此 MLLM 和 MultiModal 得分为 10。论文使用 GRPO 强化学习训练，与 model-based RL 中度相关（5 分）。插件集成到基模型涉及模型统一概念，与 Unify Models 中度相关（5 分）。视觉工作空间与世界模型概念弱相关（4 分）。作者列表中未包含指定专家，故无额外加分。

关键词

ROVER, Object-Centric Visual Evidence, Grounded Multi-Image Reasoning, Multimodal Large Language Models, Visual Working Space, SFT-to-GRPO Training, Qwen2.5-VL-7B, Differential Attention

深度分析

Chinese Title: ROVER: 面向多图像推理的对象中心视觉证据路由方法

Summary: 论文提出ROVER（Routing Object-centric Visual Evidence for grounded multi-image Reasoning），一种轻量级可学习插件，旨在解决多模态大语言模型（MLLM）在多图像推理中现有基于RoI的方法忽视全局场景理解和对象间关系、解码成本随RoI数量与尺寸增长的问题。ROVER在每个对象接地预测后注入一个固定长度的token三元组（Link/Sift/Weave），通过对象中心差分注意力（DiffAttn）从图像中蒸馏互补线索并抑制干扰，同时引入视觉工作空间（VWS）作为结构化路由基质，实现跨对象和图像的历史感知证据路由。该插件集成到Qwen2.5-VL-7B中，并采用统一的SFT-to-GRPO训练流水线。在严格遵循原始数据集和评估协议下，ROVER在MM-GCoT上取得最佳性能（答案准确率+4.8%，接地准确率+14.6%），在VideoEspresso上答案准确率提升+8.6%，且VideoEspresso训练的模型展现出强迁移性，在多个基准上平均提升+4.7%。

Innovations:

提出ROVER插件，通过固定长度token三元组（Link/Sift/Weave）替代可变长度RoI特征，实现全局对象中心视觉证据路由，解码成本恒定。
设计对象中心差分注意力机制（DiffAttn），在Sift模块中蒸馏互补视觉线索并抑制干扰区域，增强全局场景理解。
引入视觉工作空间（VWS）作为结构化路由基质，通过Weave模块实现历史感知的跨对象和跨图像证据整合。
开发统一的SFT-to-GRPO训练流水线，在严格设置下仅使用原始数据集即取得一致性能提升，并展现强迁移性。

Methodology: ROVER采用触发-路由机制：当模型生成有效的接地模式（如<obj>...<box>...</box>）时，触发路由事件，并追加Link/Sift/Weave三个可学习token。Link吸收当前推理上下文；Sift通过差分注意力（DiffAttn）以对象中心查询（由RoI特征平均池化后线性投影得到）对图像所有patch进行交叉注意力，抑制非RoI区域并提取上下文线索，填充VWS；Weave通过标准注意力从VWS中整合历史证据。Sift和Weave均为单层Transformer块（交叉注意力+FFN）。训练分两阶段：先进行监督微调（SFT）使模型学会生成接地和路由模式，再使用组相对策略优化（GRPO）强化推理能力。基础模型为Qwen2.5-VL-7B。

Key Results:

在MM-GCoT上，答案准确率提升4.8%，接地准确率提升14.6%。
在VideoEspresso上，答案准确率提升8.6%，超越此前最佳。
仅使用VideoEspresso训练的模型迁移到其他基准：Mantis提升2.2%，V-Star提升4.8%，TreeBench提升5.9%，平均提升4.7%。

Tech Stack:

Qwen2.5-VL-7B（基础多模态大语言模型）
差分注意力（DiffAttn）
交叉注意力（Cross-Attention）
前馈网络（FFN）
Transformer单层块
平均池化（Average Pooling）
线性投影（Linear Projection）
监督微调（SFT）
组相对策略优化（GRPO）
视觉编码器（ViT）

Strengths:

轻量级设计：每个接地对象仅增加三个token，解码成本恒定，不随RoI尺寸增长。
全局场景理解：通过差分注意力抑制干扰，蒸馏对象周围上下文，避免孤立区域导致的幻觉。
历史感知路由：VWS支持跨对象和跨图像证据整合，增强多跳推理能力。
训练高效：SFT-to-GRPO流水线仅使用原始数据集，无需额外标注或复杂启发式。
强迁移性：在多个不同基准上均取得显著提升，验证了方法的通用性。

Limitations:

依赖模型自身生成的接地预测，若接地不准确可能影响后续路由效果。
差分注意力需计算所有非RoI区域，虽为轻量但仍有额外计算开销。
仅在7B参数规模模型上验证，在更大模型上的效果和效率未知。
训练仅使用原始数据集，未探索数据增强或多源数据融合。
对于需要精细空间关系或复杂几何推理的任务，可能仍需结合其他机制。

Relevance To Keywords:

原生多模态大模型：ROVER作为MLLM的插件，直接增强其多图像推理能力，与原生多模态大模型高度相关。
多模态大模型的理解和生成一体化：ROVER在生成过程中动态路由视觉证据，促进理解与生成的协同，紧密相关。
表征学习：通过差分注意力学习对象中心表征，并利用VWS进行结构化存储，属于表征学习范畴。
世界模型：VWS可视为一种内部工作空间，隐式建模对象间关系，与世界模型概念有一定关联。
强化学习：采用GRPO进行后训练，属于强化学习在MLLM中的应用，相关。
后训练：SFT-to-GRPO是典型的后训练策略，直接相关。

4. ProgVLA: Progress-Aware Robot Manipulation Skill LearningPASS

Score: 62.0 / 26.5

Authors: Seungsu Kim, Jinyoung Choi, Seungmin Baek, Jean-Michel Renders

Published: 2026-05-27

TL;DR: ProgVLA 通过引入进度感知头和离线 RL 目标，在计算受限条件下实现了高效的多模态机器人长时程操作。

摘要翻译

我们提出了 ProgVLA，一种紧凑的视觉 - 语言 - 动作（VLA）模型，旨在在严格的计算和内存预算下实现可靠的机器人操控。该模型特别专注于通过保持任务进展的显式表示，在长时域内高效处理长多模态序列。为此，ProgVLA 集成了两个关键组件。首先，一个采用两阶段 Perceiver 重采样方案的多模态编码器将可变长度的视觉、语言和本体感觉流压缩为一组固定的控制就绪上下文令牌，在大幅缩短序列长度的同时保持跨模态对齐。其次，一组辅助进展头使用离线强化学习（RL）目标进行训练，共同学习归一化剩余时间范围目标的 Critic。这为策略提供了任务进展的内部估计，并启用了优势加权和成功加权的 Flow-Matching 模仿学习。在两个成熟的多任务机器人操作基准上，一个参数量为 0.1B 的 ProgVLA 模型达到了与显著更大的预训练基线相当的成功率，且在长时域和更难任务层级上超越了这些基线。消融实验表明，学习的上下文重采样器和任务自适应视觉微调是最大的单一贡献者，而进展感知训练提供了持续额外的增益，这种增益主要集中在长时域和多对象任务上。我们进一步在真实世界的 Toy-Kitchen 环境中验证了该方法。

Abstract

We present ProgVLA, a compact vision-language-action (VLA) model designed for reliable robot manipulation under tight compute and memory budgets. The model specifically focuses on efficiently processing long multi-modal sequences by maintaining an explicit representation of task progress over extended horizons. To this end, ProgVLA integrates two key components. First, a multi-modal encoder with a two-stage Perceiver resampling scheme compresses variable-length visual, language, and proprioceptive streams into a fixed set of control-ready context tokens, substantially reducing sequence length while preserving cross-modal grounding. Second, an auxiliary set of progress heads is trained with offline reinforcement learning (RL) objectives to jointly learn critics over normalized remaining-horizon targets. This provides the policy with an internal estimate of task progress and enables advantage- and success-weighted flow-matching imitation learning. On two well-established multi-task robot manipulation benchmarks, a 0.1B-parameter ProgVLA model reaches success rates that are competitive with, and on long-horizon and harder task tiers exceed, substantially larger pretrained baselines. Ablations indicate that the learned context resampler and task-adaptive visual fine-tuning are the largest single contributors, while progress-aware training provides a consistent additional gain that is concentrated on long-horizon and multi-object tasks. We further validate the approach in real-world toy-kitchen environments.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	7.0/10	14.0
World Models	2.0	4.0/10	8.0
MLLM	2.0	8.0/10	16.0
MultiModal	2.0	9.0/10	18.0
model-based RL	2.0	3.0/10	6.0

评分理由: 论文提出 ProgVLA VLA 模型。Unify Models (7.0): 统一视觉、语言、动作模态。World Models (4.0): 涉及长时程进度表示，关联中等。MLLM (8.0): VLA 属 MLLM 分支，高度相关。MultiModal (9.0): 核心为多模态编码器，高度相关。model-based RL (3.0): 主方法为模仿学习，关联度低。未找到指定专家。加权总分 62.0，高于及格线。

关键词

ProgVLA, Vision-Language-Action, Multi-modal Encoder, Progress-Aware, Offline RL, Robot Manipulation, Imitation Learning

深度分析

Chinese Title: ProgVLA：进度感知的机器人操作技能学习

Summary: 本文提出ProgVLA，一种紧凑的视觉-语言-动作（VLA）模型，旨在有限计算和内存预算下实现可靠的机器人操作。模型专注于高效处理长多模态序列，通过维护任务进度的显式表示来应对长时域挑战。ProgVLA包含两个关键组件：一是多模态编码器，采用两阶段Perceiver重采样方案，将可变长度的视觉、语言和本体感知流压缩为固定数量的控制上下文令牌，大幅缩短序列长度同时保持跨模态对齐；二是辅助进度头，通过离线强化学习目标联合训练，学习归一化剩余时域目标的批评函数，为策略提供内部进度估计，并实现优势加权和成功加权的流匹配模仿学习。在两个多任务机器人操作基准上，0.1B参数的ProgVLA模型在成功率上与更大的预训练基线相当，在长时域和困难任务上甚至超越。消融实验表明，学习的上下文重采样器和任务自适应视觉微调是最大贡献因素，而进度感知训练在长时域和多物体任务上提供一致额外增益。该方法还在真实玩具厨房环境中得到验证。

Innovations:

提出两阶段Perceiver重采样（逐模态和后融合），高效压缩多模态序列为固定大小上下文令牌，大幅降低序列长度并保留跨模态信息。
设计辅助进度头，与策略共享上下文表示，通过离线RL目标（Huber损失和期望回归）联合训练，提供内部进度估计并作为样本权重用于模仿学习。
采用通用视觉骨干DUNE（ViT-Small）而非大规模机器人预训练，通过任务自适应微调获得强视觉先验，在0.1B参数下达到与更大模型竞争的性能。
将进度信号作为内部表示正则化和样本重加权机制，而非外部分类器，使策略自身具备进度感知能力。
在长时域和多物体任务上，进度感知训练提供一致增益，且模型无需跨本体机器人预训练，仅使用目标基准的演示数据。

Methodology: ProgVLA采用三部分架构：多模态编码器、进度头和动作专家。编码器先通过DUNE（视觉）、T5（语言）和MLP（本体感知）提取特征，再经逐模态Perceiver重采样压缩为固定令牌，融合Transformer进行跨模态自注意力，最后后融合重采样得到紧凑上下文令牌c_t。动作专家使用流匹配（Flow Matching），以c_t为条件，通过Heun ODE求解器生成动作块，训练时采样噪声和插值时间，预测速度场。进度头包括状态-动作批评Q̂(c_t, a_t)和价值头V̂(c_t)（含成功logit），均以蒙特卡洛剩余时域G_t为回归目标：Q̂使用Huber损失，V̂使用IQL期望回归（τ=0.8），计算优势A_t = Q̂ - V̂，经温度缩放和裁剪后作为流匹配损失的样本权重。训练总损失为流匹配损失、Q损失和值损失之和。

Key Results:

0.1B参数的ProgVLA在两个多任务机器人操作基准上，成功率与更大的预训练基线（如SmolVLA 450M）相当，在长时域和困难任务上超越。
消融实验表明，两阶段Perceiver重采样器和任务自适应视觉微调是最大贡献因素，分别带来显著性能提升。
进度感知训练在长时域和多物体任务上提供一致额外增益（约2-5%成功率提升），而在短时域任务上增益较小。
在真实玩具厨房环境中，ProgVLA成功执行多步操作指令，验证了方法的实际可行性。
模型无需大规模机器人预训练，仅使用目标基准的演示数据，降低了部署门槛。

Tech Stack:

Perceiver Resampler（两阶段：逐模态和后融合）
DUNE（ViT-Small，通用视觉骨干）
T5文本编码器（冻结）
流匹配（Flow Matching）
Heun ODE求解器（10步）
离线强化学习：IQL期望回归（τ=0.8）、Huber损失
蒙特卡洛剩余时域目标（G_t）
Beta分布采样（α=2, β=2）用于流匹配时间步
优势加权（温度缩放+裁剪）
MLP投影（本体感知）
Transformer自注意力（跨模态融合）

Strengths:

模型紧凑（0.1B参数），无需大规模机器人预训练，适合边缘部署。
两阶段重采样高效处理长多模态序列，显著降低计算复杂度。
内部进度估计通过离线RL目标实现，无需外部分类器，提升长时域任务性能。
消融实验全面，清晰揭示各组件贡献。
在真实世界环境中验证，证明实用性。
与更大模型竞争，展示了小模型在特定场景下的潜力。

Limitations:

进度目标仅基于时间阶段（归一化剩余时域），而非语义子目标，可能不适用于非时间对齐或子任务结构复杂的任务。
仅使用成功演示进行训练，优势估计在严格离线RL意义下弱识别，更多作为轨迹阶段重加权。
依赖DUNE和T5预训练模型，其性能可能受限于这些模型的能力。
未在更多样化场景（如不同机器人平台、复杂环境）中测试泛化性。
对超参数（如温度、裁剪阈值、期望回归τ）可能敏感，需要调优。

Relevance To Keywords:

原生多模态大模型：ProgVLA是典型的VLA模型，融合视觉、语言和动作模态，属于多模态大模型范畴。
表征学习：两阶段Perceiver重采样和上下文令牌学习是多模态表征压缩的核心，体现了表征学习思想。
强化学习：辅助进度头使用离线RL目标（IQL、Huber损失）进行训练，并用于加权模仿学习，与强化学习紧密相关。
后训练：模型在预训练视觉骨干（DUNE）和语言编码器（T5）基础上进行任务自适应微调，属于后训练阶段。
世界模型：论文未显式构建世界模型或进行模型预测，相关性较弱。
模型基RL：未使用模型基方法，而是直接学习策略和值函数，相关性低。
多模态大模型的理解和生成一体化：ProgVLA侧重动作生成，语言理解用于指令条件，但未强调理解与生成的一体化框架。
Unify Models：论文未涉及统一不同模型，而是单一紧凑模型，相关性一般。

5. Unified Synthesis of Compositional Speech and Sound from Free-Form Text PromptsPASS

Score: 60.0 / 26.5

Authors: Yuyue Wang, Xihua Wang, Xin Cheng, Yijing Chen, Ruihua Song

Published: 2026-05-27

TL;DR: 本文提出 PlanAudio 框架，利用统一的大语言模型架构和语义潜在思维链，实现了从自由文本提示到合成语音与声音的高质量生成，性能优于现有基线。

摘要翻译

音频生成领域取得了显著进展，然而合成包含语音与声音自然复合的统一音频仍具挑战性。现有方法要么依赖分离的流水线，无法捕捉细粒度交互；要么需要结构化输入和外部文本重写，从而限制了自由形式文本提示的灵活性。本文提出了一项新任务：自由形式文本提示到统一音频生成（Free-Form-Text-Prompt-to-Unified-Audio generation），旨在直接从无约束的自然语言中合成包含语音、声音及其复合体的统一音频。为应对这一任务，本文提出 PlanAudio，这是一种统一的、基于自回归大语言模型（LLM）的框架。首先，该框架通过利用大语言模型（LLM）内在的推理能力，而非传统文本编码器，简化了模型架构。其次，它引入了一种语义潜在思维链（semantic latent chain-of-thought）机制，这是一种隐式规划机制，用于连接高层语义理解与底层声学合成。此外，我们构建了 PlanAudio-Bench，这是一个专门用于评估复合音频场景的基准测试。我们在语音、声音及其复合体场景下进行了评估。实验结果表明，PlanAudio 通常优于现有的流水线方法和统一基线模型，同时在与为单一场景设计的模型竞争时保持竞争力。我们的分析进一步揭示了语义潜在思维链（CoT）机制相较于其他思维链机制的优势，并强调了连续多场景训练课程的重要性。

Abstract

Audio generation has made significant progress, yet synthesizing unified audio where speech and sounds are naturally composited remains a challenge. Current methods either rely on disjoint pipelines, which fail to capture fine-grained interactions, or require structured inputs and external text rewriting, which limits the flexibility of free-form text prompts. In this paper, we introduce a new task: Free-Form-Text-Prompt-to-Unified-Audio generation, which aims to directly synthesize unified audio containing speech, sound, and their composites from unconstrained natural language. To address this task, we propose PlanAudio, a unified, autoregressive LLM-based framework. First, it simplifies the model architecture by leveraging intrinsic LLM reasoning capability instead of traditional text encoders. Second, it introduces a semantic latent chain-of-thought mechanism, an implicit planning mechanism that bridges high-level semantic understanding and low-level acoustic synthesis. Furthermore, we create PlanAudio-Bench, a specialized benchmark for evaluating composite audio scenarios. We perform evaluations in the scenarios of speech, sound, and their composites. The results demonstrate that PlanAudio generally outperforms the existing pipeline and unified baselines, while staying competitive with models designed for a single scenario. Our analysis further reveals the superiority of semantic latent CoT over other CoT mechanisms and highlights the importance of continuous multi-scenario training curricula.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	9.0/10	18.0
World Models	2.0	3.0/10	6.0
MLLM	2.0	8.0/10	16.0
MultiModal	2.0	9.0/10	18.0
model-based RL	2.0	1.0/10	2.0

评分理由: 论文标题与摘要均强调'Unified'，与 Unify Models 高度相关 (9.0)；文本到音频生成属于多模态任务，与 MultiModal (9.0) 和 MLLM (8.0) 相关；虽涉及规划机制，但未构成典型世界模型或强化学习场景，故 World Models (3.0) 和 model-based RL (1.0) 相关性低。加权总分 60.0，高于及格分 26.5，且作者列表中未包含指定专家。

关键词

Unified Synthesis, Speech and Sound, Free-Form Text, PlanAudio, LLM-based, Semantic Latent, Chain-of-Thought

深度分析

Chinese Title: 基于自由形式文本提示的统一合成语音与声音

Summary: 本文提出了一项新任务：从自由形式文本提示直接生成包含语音、声音及其复合的统一音频。现有方法要么依赖分离的流水线（无法捕捉细粒度交互），要么需要结构化输入或外部文本重写（限制灵活性）。为此，作者提出了PlanAudio，一个基于自回归LLM的统一框架。该框架利用LLM的内在推理能力替代传统文本编码器，简化架构；并引入语义潜在链式思维（Latent CoT）机制，在潜在空间进行隐式语义规划，弥合高层语义理解与低层声学合成之间的鸿沟。此外，作者构建了PlanAudio-Bench基准用于评估复合音频场景。实验表明，PlanAudio在复合、声音和语音场景中普遍优于现有流水线和统一基线，且与专为单一场景设计的模型竞争力相当。分析进一步证明了语义潜在CoT优于其他CoT机制，并强调了连续多场景训练课程的重要性。

Innovations:

首次定义自由形式文本提示到统一音频生成任务，无需结构化输入或外部重写。
提出PlanAudio框架，利用LLM内在语言理解能力直接处理文本，避免传统文本编码器和多模块流水线。
创新性地引入语义潜在链式思维（Latent CoT）机制，在连续潜在空间进行隐式语义规划，桥接高层语义与低层声学合成。
构建专门用于复合音频评估的基准PlanAudio-Bench，填补了联合标注数据的空白。
验证了语义潜在CoT相对于显式CoT的优势，并揭示了连续多场景训练课程的必要性。

Methodology: PlanAudio采用两阶段生成框架：首先，将自由形式文本提示通过LLM tokenizer处理为文本令牌，然后使用语义潜在CoT阶段预测固定长度的连续潜在序列（通过线性投影层与预训练Audio Flamingo 3编码器提取的语义嵌入对齐，优化MSE和余弦相似度损失）。接着，在声学生成阶段，基于文本和潜在序列，自回归预测层次化离散音频令牌（使用AudioCraft tokenizer，跨Q个码本），通过交叉熵损失优化。所有模态（文本、潜在、音频）通过特殊分隔令牌组织成单一序列，由统一Transformer骨干处理。训练采用加权组合损失平衡语义规划与生成质量。推理时先预测潜在序列，再生成音频令牌直至结束令牌。

Key Results:

PlanAudio在复合、声音和语音三个场景中普遍优于现有流水线方法和统一基线方法。
与专为单一任务设计的模型相比，PlanAudio在各自场景中保持竞争力。
语义潜在CoT机制优于其他显式CoT机制（如语言链式思维）。
连续多场景训练课程（同时训练语音、声音和复合数据）对提升整体性能至关重要。
PlanAudio-Bench基准有效评估了复合音频生成质量。

Tech Stack:

LLM（大语言模型）作为骨干网络
Audio Flamingo 3 Encoder（AF3Encoder）用于提取语义嵌入
AudioCraft tokenizer（层次化离散音频令牌，多码本）
Transformer架构（自回归生成）
Mean Squared Error (MSE) 损失
余弦相似度损失
交叉熵损失
线性投影层（ϕ）用于潜在空间对齐
特殊分隔令牌（<|sot|>, <|sol|>, <|soa|>, <|eoa|>）

Strengths:

统一架构，无需外部文本重写或多模块流水线，降低系统复杂性和级联错误。
利用LLM的固有推理能力，简化模型设计并增强语义理解。
语义潜在CoT提供隐式规划，避免显式中间语言生成的开销，同时保持结构指导。
在多个场景（语音、声音、复合）中均表现优异，通用性强。
构建了专门的复合音频基准，推动该领域评估标准化。

Limitations:

依赖预训练的Audio Flamingo 3编码器和AudioCraft tokenizer，可能引入预训练偏差。
潜在序列长度K为固定常数，可能无法适应不同复杂度的音频结构。
训练数据为合成标注（PlanAudio-Bench），与真实场景数据可能存在分布差异。
未详细讨论模型在极长音频或复杂多事件场景下的扩展性。
与专门模型相比，在单一场景上可能仍有微小差距（文中表述为“competitive”而非“superior”）。

Relevance To Keywords:

原生多模态大模型：PlanAudio基于LLM直接处理文本和音频，实现多模态理解与生成一体化。
多模态大模型的理解和生成一体化：框架同时进行语义规划（理解）和声学合成（生成），端到端训练。
表征学习：使用Audio Flamingo 3编码器提取语义表征，并通过潜在CoT学习隐式语义表示。
世界模型：语义潜在CoT可视为对音频世界的隐式结构建模，预测事件编排和声学布局。
强化学习/后训练：论文未直接涉及强化学习，但连续多场景训练课程可视为一种后训练策略，未来可结合RL优化生成质量。

6. Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation ModelsPASS

Score: 60.0 / 26.5

Authors: Haozhan Shen, Tiancheng Zhao, Kangjia Zhao, Jianwei Yin

Published: 2026-05-27

TL;DR: 该论文通过实证比较发现视觉语言模型和视频生成模型在空间智能任务上具有互补性，融合两者特征可构建更强的空间智能骨干网络。

摘要翻译

空间智能需要能够捕捉物理世界中语义对象和几何结构的视觉表征。为此，目前广泛用作基础骨干的两种主要预训练方案是：视觉 - 语言模型（VLMs），利用语言监督将视觉观测与语义概念对齐；以及视频生成模型（VGMs），从随时间演变的视觉世界中学习。然而，尚不清楚哪种预训练方案能为空间智能提供更好的表征基底。本文首次对 VLMs 和 VGMs 进行了系统性的冻结特征探测研究，涵盖了空间智能的三个代表性维度：语义标注、实例分组和 3D 几何预测。借助轻量级探针，我们的框架能够受控比较两个模型家族冻结表征中已编码的信息。实验结果揭示了明显的互补性：VLMs 在语义标注和实例分组方面表现更强，而 VGMs 为稠密几何和相机运动提供了更易获取的信号。此外，两者的朴素融合即可产生一种在几何和语义方面都表现优异的表征，这表明通过有效整合两个模型家族的特征来构建更强的空间智能骨干网络是一个有前景的方向。我们的代码可在 https://github.com/om-ai-lab/Probing-VLM-VGM 获取。

Abstract

Spatial intelligence requires visual representations that capture both semantic objects and geometric structure in the physical world. To support this, two major pre-training schemes are now widely used as foundation backbones: Vision-Language Models (VLMs), which use language supervision to align visual observations with semantic concepts, and Video Generation Models (VGMs), which learn from temporally evolving visual worlds. However, it still remains unclear which pre-training scheme provides a better representation substrate for spatial intelligence. In this paper, we present the first systematic frozen-feature probing study of VLMs and VGMs across three representative axes of spatial intelligence: semantic tagging, instance grouping, and 3D geometry prediction. Using the lightweight probe, our framework enables a controlled comparison of what information is already encoded in frozen representations from two model families. Experimental results reveal a clear complementarity: VLMs are stronger at semantic tagging and instance grouping, while VGMs provide more accessible signals for dense geometry and camera motion. Moreover, a naive fusion of the two already yields a representation that excels at both geometry and semantics, suggesting a promising direction for building stronger spatial-intelligence backbones by effectively integrating features from both model families. Our code is available at \href{https://github.com/om-ai-lab/Probing-VLM-VGM}{https://github.com/om-ai-lab/Probing-VLM-VGM}.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	7.0/10	14.0
World Models	2.0	6.0/10	12.0
MLLM	2.0	7.0/10	14.0
MultiModal	2.0	8.0/10	16.0
model-based RL	2.0	2.0/10	4.0

评分理由: 论文核心为多模态表征学习，VLM 与 VGM 对比涉及 MultiModal 和 MLLM；提出融合两种模型特征，故 Unify Models 中度相关；VGM 涉及世界动态，故 World Models 中度相关；未涉及强化学习，故 model-based RL 低相关。作者列表中无指定专家。

关键词

Spatial intelligence, Vision-Language Models, Video Generation Models, Pre-training paradigm, Semantic tagging, 3D geometry prediction, Representation learning

深度分析

Chinese Title: 哪种预训练范式更有利于空间智能？视觉语言模型与视频生成模型的实证比较

Summary: 本文系统比较了视觉语言模型（VLM）和视频生成模型（VGM）在空间智能方面的表征能力。通过冻结特征探测方法，在语义标签、实例分组和3D几何预测三个维度上对两类模型进行对比。实验发现两者具有明显的互补性：VLM在语义标签和实例分组上表现更优，而VGM在密集几何和相机运动预测上提供更易获取的信号。简单特征融合即可同时获得几何和语义优势，表明整合两类模型特征有望构建更强的空间智能骨干。研究避免了下游微调等混杂因素，首次在冻结表征层面系统比较了两种预训练范式。

Innovations:

首次在冻结表征层面系统比较VLM和VGM对空间智能的支持，避免下游微调等混杂因素。
构建统一探测框架，在语义标签、实例分组和3D几何预测三个互补轴上评估空间智能。
发现VLM和VGM的明确优势分工：VLM擅长语义和实例，VGM擅长几何和相机运动。
通过简单特征融合证明两类表征的互补性，为构建更强空间智能骨干提供方向。

Methodology: 采用冻结特征探测方法：冻结VLM和VGM模型，从中间层提取视频特征，训练轻量级探测骨干（交替注意力Transformer）和任务特定读出头。三个任务分别使用语义标签（多标签分类）、实例分组（跨视图像素对应）和3D几何预测（深度、点图、相机运动）。特征提取时对视频采样20帧，VGM使用生成器内部隐藏激活，VLM使用视觉token隐藏状态。探测骨干统一架构，任务头分别训练。

Key Results:

VLM在语义标签和实例分组上显著优于VGM。
VGM在3D几何预测（深度、点图、相机运动）上表现更好。
简单特征融合（归一化后拼接）可同时保持VLM的语义优势和VGM的几何优势。
两类表征具有互补性，而非互斥。

Tech Stack:

视觉语言模型：Qwen3-VL, InternVL3
视频生成模型：WAN2.1, CogVideoX
探测骨干：N层交替注意力Transformer（帧内注意力和全局注意力）
特征提取：VGM使用单步去噪隐藏激活，VLM使用语言模型层视觉token
任务头：语义标签（多标签分类头）、实例分组（跨视图匹配头）、3D几何（深度/点图回归头、相机运动回归头）
评估指标：语义标签（mAP）、实例分组（IoU/准确率）、3D几何（深度误差、点图误差、相机运动误差）

Strengths:

实验设计严谨，通过冻结特征探测隔离了预训练范式的影响。
覆盖空间智能的三个关键维度，评估全面。
发现互补性并验证简单融合的有效性，具有实际指导意义。
代码开源，便于复现和扩展。

Limitations:

仅测试了少数代表性模型，结论的泛化性需更多模型验证。
探测任务为简化设定，未涉及复杂空间推理（如导航、操作）。
特征融合方式简单，更复杂的融合策略可能进一步提升性能。
未探讨不同模型规模、训练数据量等因素的影响。

Relevance To Keywords:

Unify Models: 论文比较了VLM和VGM两种预训练范式，并探索融合，与统一模型方向相关。
World Models: VGM作为视频生成模型可视为世界模型的一种，论文评估其几何表征能力。
Representation Learning: 核心是表征学习，通过探测比较不同预训练得到的视觉表征。
Model-Based RL: 空间智能是强化学习中环境建模的基础，论文结果对基于模型的RL有启示。
原生多模态大模型: VLM和VGM均属于多模态大模型，论文比较其空间智能表征。
多模态大模型的理解和生成一体化: 论文指出VLM（理解）和VGM（生成）的互补性，暗示一体化方向。
表征学习: 同上，核心是表征学习。
世界模型: 同上，VGM作为世界模型候选。
强化学习: 空间智能对RL中的感知和规划重要。
后训练: 论文关注预训练表征，但后训练可能改变表征，论文未涉及。

7. Proprio: Latent Self-Scoring and Inference-Time Refinement for Physically Plausible Video GenerationPASS

Score: 58.0 / 26.5

Authors: Mariam Hassan, Kaouther Messaoud, Wuyang Li, Alexandre Alahi

Published: 2026-05-27

TL;DR: Proprio proposes a training-free framework that improves physical plausibility in video generation via latent self-scoring and inference-time refinement, outperforming VLM-based scoring and external world-model baselines.

摘要翻译

现代视频生成模型虽能产出视觉上令人印象深刻的结果，却经常违反基本的物理原理。本文提出 Proprio，一种无需训练的框架，使冻结的视频生成器能够评估并改进其自身输出的物理合理性。受本体感觉（即生物体对自身运动的感知）启发，Proprio 将模型在受控潜在扰动下的光流残差视为一种自评分信号。更能被生成器所学动力学解释的样本会产生更小且更稳定的残差。我们在时间步和扰动维度上聚合该信号，利用动态时空掩码将其聚焦于与运动相关的区域，并将其用于最佳 -N 搜索、基于梯度的自精炼，或两者结合。在文本到视频和图像到视频基准测试中，Proprio 一致提高了物理合理性，在多种设置下优于基于 VLM（视觉语言模型）的评分方法及外部世界模型基线。基于 TurboWan2.2，Proprio 将 Physics-IQ 从 32.2 提升至 37.5 (+16.5%)，并将 VideoPhy2-hard physical commonsense 从 45.6 提升至 55.0 (+20.6%)。人类评估进一步表明，在约三分之二的比较中，评估者更倾向于选择 Proprio 筛选或精炼的视频，因其物理合理性更佳。这些结果表明，冻结的视频生成器内部包含可用于评估和改进其自身输出物理合理性的可操作信号。

Abstract

Modern video generative models produce visually impressive results, yet frequently violate basic physical principles. We propose Proprio, a training-free framework that enables a frozen video generator to assess and improve the physical plausibility of its own outputs. Inspired by proprioception, the biological sense of one's own movement, Proprio treats the model's flow residual under controlled latent perturbations as a self-scoring signal. Samples that are better explained by the generator's learned dynamics induce smaller and more stable residuals. We aggregate this signal across timesteps and perturbations, focus it on motion-relevant regions with a dynamic spatiotemporal mask, and use it for best-of-N search, gradient-based self-refinement, or both. Across text-to-video and image-to-video benchmarks, Proprio consistently improves physical plausibility, outperforming VLM-based scoring, and external world-model baselines in several settings. With TurboWan2.2, Proprio improves Physics-IQ from 32.2 to 37.5 (+16.5%) and VideoPhy2-hard physical commonsense from 45.6 to 55.0 (+20.6%). Human evaluation further shows that raters prefer Proprio-selected or refined videos for physical plausibility in roughly two-thirds of comparisons. These results suggest that frozen video generators contain actionable internal signals for evaluating and improving the physical plausibility of their own outputs.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	5.0/10	10.0
World Models	2.0	8.0/10	16.0
MLLM	2.0	3.0/10	6.0
MultiModal	2.0	8.0/10	16.0
model-based RL	2.0	5.0/10	10.0

评分理由: 1. Unify Models (5.0): 论文构建了评分与推理优化的统一框架，但未涉及多模型架构的统一，相关性中等。2. World Models (8.0): 摘要明确对比外部世界模型基线，且利用潜变量动力学建模，与世界模型核心概念高度相关。3. MLLM (3.0): 核心模型为视频生成器，非多模态大语言模型，仅将 VLM 作为基线，相关性较低。4. MultiModal (8.0): 任务涉及文本到视频和图像到视频，属于多模态生成，相关性高。5. model-based RL (5.0): 利用生成模型进行推理优化，具备模型驱动特征，但未涉及强化学习框架，相关性中等。专家核查：作者列表不包含指定专家，无额外加分。加权总分 58.0，远超动态及格分 26.5。

关键词

Video Generation, Physical Plausibility, Self-Scoring, Inference-Time Refinement, Latent Perturbations, World Models, Text-to-Video

深度分析

Chinese Title: Proprio: 基于潜在自评分与推理时优化的物理合理视频生成

Summary: 现代视频生成模型虽视觉质量高，但常违反物理规律。本文提出Proprio，一种无需训练、冻结生成器的框架，使模型能自我评估并改进输出的物理合理性。受本体感觉启发，Proprio将模型在受控潜在扰动下的流残差作为自评分信号：与生成器学习动力学更一致的样本产生更小、更稳定的残差。通过逆方差加权聚合多时间步残差，并利用动态时空掩码聚焦运动区域，该分数可用于最佳N搜索、梯度自优化或两者结合。在文本到视频和图像到视频基准上，Proprio持续提升物理合理性，在Physics-IQ上提升16.5%，在VideoPhy2-hard上提升20.6%，人类评估中约三分之二情况下偏好Proprio选择的视频。结果表明冻结视频生成器内部包含可评估和改善自身输出物理合理性的信号。

Innovations:

提出生成器原生自评分方法，利用冻结生成器的去噪残差作为物理合理性的内在代理，无需外部评估器。
设计动态时空掩码，聚焦运动相关区域，提升评分对物理失效的敏感性。
采用逆方差加权聚合多时间步、多扰动的残差，提高评分鲁棒性。
将自评分转化为推理时优化目标，通过反向传播更新初始噪声实现自优化，可与最佳N搜索结合。
在多个物理合理性基准上超越基于VLM和外部世界模型的基线，证明冻结生成器可同时作为采样器和自评估器。

Methodology: Proprio基于流匹配框架，对生成的潜在视频施加受控噪声扰动，计算模型预测流与目标流之间的残差。通过逆方差加权聚合多个时间步和多个噪声实现的残差，并应用动态时空掩码（基于运动幅度和梯度）突出运动区域。所得分数用于三种推理时策略：最佳N搜索（选择分数最高的样本）、梯度自优化（通过冻结采样器反向传播更新初始噪声）以及混合搜索与优化。实验在文本到视频（TurboWan2.2）和图像到视频（I2VGen-XL）模型上进行，评估物理合理性指标（Physics-IQ、VideoPhy2）和人类偏好。

Key Results:

在TurboWan2.2上，Proprio将Physics-IQ从32.2提升至37.5（+16.5%），VideoPhy2-hard物理常识从45.6提升至55.0（+20.6%）。
在图像到视频任务上，Proprio在多个物理指标上优于VLM评分和外部世界模型（如V-JEPA）基线。
人类评估中，约三分之二情况下评估者认为Proprio选择或优化的视频在物理合理性上更优。
诊断实验显示，在82.2%的配对中，生成器内部残差信号偏好真实视频而非自身生成的物理不合理视频。
动态时空掩码和逆方差加权均带来显著提升，消融实验验证了各组件有效性。

Tech Stack:

流匹配（Flow Matching）生成框架
扩散模型（Diffusion Models）
逆方差加权聚合（Inverse-Variance Weighted Aggregation）
动态时空掩码（Dynamic Spatiotemporal Masking）
梯度下降优化（Gradient Descent）
最佳N搜索（Best-of-N Search）
VAE编码器/解码器
PyTorch（推断）

Strengths:

完全无需训练，直接利用冻结生成器内部信号，避免外部评估器的偏差和表示不匹配。
方法通用，可应用于不同视频生成模型（T2V和I2V）及多种推理时策略。
自评分与自优化结合，既可用于筛选也可用于改进，灵活性高。
动态时空掩码设计巧妙，有效聚焦物理相关区域，提升评分准确性。
实验充分，在多个基准和人类评估中验证有效性，消融实验清晰。

Limitations:

依赖生成器自身学习到的动力学，若生成器本身对某些物理规律建模不足，自评分可能失效。
计算开销较大：需要多次前向传播（多扰动、多时间步）以及可能的反向传播优化。
动态掩码设计依赖启发式规则（如运动幅度阈值），可能不适用于所有场景。
仅适用于推理时，无法从根本上改进生成器的训练过程。
实验仅在特定模型（TurboWan2.2、I2VGen-XL）上验证，泛化性需更多测试。

Relevance To Keywords:

Unify Models: 论文探索生成器自身作为评估器，与统一模型（生成与理解一体化）理念一致。
World Models: 视频生成模型可视为世界模型，Proprio利用其内部动力学评估物理合理性。
Representation Learning: 潜在空间中的残差信号可视为一种表征学习信号。
Model-Based RL: 自评分可作为奖励信号，推理时优化类似模型预测控制或策略优化。
原生多模态大模型: 方法适用于多模态（文本/图像到视频）生成模型。
后训练: 推理时优化属于后训练阶段，无需重新训练模型。

8. CogPortrait: Fine-Grained Eye-Region Control in Portrait Animation via Hierarchical Agent PlanningPASS

Score: 58.0 / 26.5

Authors: He Feng, Yongjia Ma, Donglin Di, Lei Fan, Tonghua Su

Published: 2026-05-27

TL;DR: CogPortrait introduces a two-stage framework combining MLLM-based hierarchical planning and DiT generation to achieve precise fine-grained eye-region control in portrait animation.

摘要翻译

肖像动画方法在视觉质量和唇同步方面已取得了显著进展，但眼部区域的精细操控仍面临输入粒度与运动准确性之间的权衡。现有使用情感标签或粗略文本提示的方法不足以描述细微的眼部动态，而基于动作单元（Action Units）或驱动视频的方法虽然提供了更高的保真度，却以更大的输入负担为代价。这些限制对于超越情感的状态（例如思考）以及困倦状态而言仍然具有局限性。鉴于上述问题，我们提出了 CogPortrait，这是一种从高级标签生成肖像动画的两阶段框架。在第一阶段，三个基于链式思维的多模态大语言模型（MLLMs）代理通过时间事件规划、原型检索、基于真实行为库的组合以及语义生理约束执行，将高级标签编译为面部关键点。在第二阶段，一个基于扩散变换器（DiT）的视频生成骨干根据关键点、参考肖像、音频和文本提示合成最终动画，并通过动态无分类器指导策略进行增强，该策略包含眼部区域感知的重新加权以及针对边界情况的基于 KTO 的细化。此外，我们引入了 EMH 基准，涵盖多样情感及超越情感类别，并包含两个 AU 级指标，用于评估精细的眼部区域和头部运动控制。在 HDTF 和 EMH 基准上的广泛实验表明，CogPortrait 相较于现有方法实现了更精确的眼部区域控制，同时保持了卓越的视觉质量和身份一致性。

Abstract

Portrait animation methods have achieved substantial visual quality and lip synchronization, but fine-grained manipulation of the eye region still faces a trade-off between input granularity and motion accuracy. Existing methods using emotion labels or coarse text prompts are insufficient for describing subtle ocular dynamics, whereas approaches based on Action Units or driving videos provide higher fidelity at the cost of a heavier input burden. These limitations are still restrictive for beyond-emotion states (e.g., thinking) and drowsiness. In light of the above, we propose CogPortrait, a two-stage framework that generates portrait animations from high-level labels. In the first stage, three chain-of-thought Multimodal Large Language Models (MLLMs) agents compile high-level labels into facial keypoints through temporal event planning, prototype retrieval, and composition from a real-behavior library, and semantic-physiological constraint enforcement. In the second stage, a DiT-based video generation backbone synthesizes the final animation conditioned on the keypoints, reference portrait, audio, and text prompt, enhanced by a dynamic classifier-free guidance strategy with eye-region-aware reweighting and KTO-based refinement for boundary cases. We further introduce the EMH benchmark covering diverse emotions and beyond-emotion categories with two AU-level metrics for evaluating fine-grained eye-region and head-motion control. Extensive experiments on HDTF and the EMH benchmark demonstrate that CogPortrait achieves more precise eye-region control than existing methods while maintaining supe- rior visual quality and identity consistency

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	6.0/10	12.0
World Models	2.0	2.0/10	4.0
MLLM	2.0	10.0/10	20.0
MultiModal	2.0	10.0/10	20.0
model-based RL	2.0	1.0/10	2.0

评分理由: MLLM (10.0) and MultiModal (10.0) are core to the paper, explicitly utilizing MLLM agents and processing multimodal inputs (video, text, audio, keypoints). Unify Models (6.0) is moderately relevant due to the unified pipeline combining planning (MLLM) and generation (DiT). World Models (2.0) and model-based RL (1.0) are low relevance as the work focuses on conditional video generation rather than latent world dynamics modeling or reinforcement learning optimization. No expert authors from the target list were found, so no bonus was added.

关键词

CogPortrait, Portrait Animation, MLLM, Eye-Region Control, DiT, Video Generation, Hierarchical Planning, Facial Keypoints

深度分析

Chinese Title: CogPortrait: 通过分层智能体规划实现肖像动画中的细粒度眼部区域控制

Summary: 本文提出CogPortrait，一个两阶段框架，用于从高层标签（如情绪、认知状态）生成细粒度眼部区域控制的肖像动画。第一阶段利用三个链式思维多模态大语言模型（MLLM）智能体：规划智能体将标签分解为时序事件，组合智能体从真实行为库中检索并组装原型，批评智能体施加语义和生理约束，最终生成面部关键点序列。第二阶段采用基于DiT的视频扩散骨干，以参考肖像、音频、文本提示和关键点为条件，通过动态无分类器引导（CFG）策略（眼部区域重加权）和基于KTO的边界情况优化，提升生成质量。此外，作者构建了EMH基准（涵盖6种核心情绪和6种超情绪状态）并引入两个AU级指标（AU-F1和AU-Temp）。实验表明，CogPortrait在细粒度眼部控制、视觉质量和身份一致性上优于现有方法。

Innovations:

提出两阶段框架，从高层标签直接实现细粒度眼部区域控制，无需运动级驱动信号。
设计分层智能体规划机制（规划、组合、批评），通过链式思维推理和原型库检索生成真实动态的面部关键点。
引入动态CFG策略，包含时空维度的眼部区域重加权，增强高频眼部运动并减少颜色偏移。
采用KTO优化对长尾模式（如不对称眉毛、大角度头部运动）进行后训练微调，提升边界情况表现。
构建EMH基准并设计AU-F1和AU-Temp两个AU级指标，专门评估细粒度眼部区域和头部运动控制。

Methodology: 论文采用两阶段技术路线。第一阶段：使用三个MLLM智能体（规划、组合、批评）通过链式思维推理将高层标签编译为面部关键点序列。规划智能体生成时序事件，组合智能体从真实行为库中检索原型并组装，批评智能体检查语义一致性和生理合理性，可迭代修正。第二阶段：采用基于Flow Matching的DiT视频扩散模型（Wan2.2）作为生成骨干，以参考肖像、音频、文本提示和第一阶段生成的关键点为条件。在推理时使用动态CFG，对眼部区域在时空维度进行重加权；对边界情况使用KTO优化，利用人工标注的期望/非期望样本微调模型。

Key Results:

在EMH基准上，CogPortrait在AU-F1和AU-Temp指标上优于所有基线方法（如AnimateDiff、SadTalker、MuseTalk等）。
在HDTF数据集上，CogPortrait在视觉质量（FID、FVD）和身份一致性（CSIM）上达到或超过现有方法。
消融实验验证了每个组件（智能体规划、动态CFG、KTO）的有效性。
展示了超情绪状态（如思考、困倦）和细粒度指令（如左眼眨眼、视线偏移）的生成案例，证明其泛化能力。

Tech Stack:

Flow Matching (FM) 模型
DiT (Diffusion Transformer) 视频生成骨干 (Wan2.2)
多模态大语言模型 (MLLM) 及链式思维 (Chain-of-Thought) 推理
Kahneman-Tversky Optimization (KTO) 优化算法
3D VAE 编码器/解码器
动态无分类器引导 (Dynamic CFG) 策略
AU (Action Unit) 指标：AU-F1、AU-Temp

Strengths:

实现了从高层语义标签到细粒度眼部运动控制的直接映射，降低用户输入负担。
智能体规划结合真实行为库，生成的面部关键点具有真实动态特性，避免过平滑。
动态CFG和KTO优化有效处理了长尾和边界情况，提升生成鲁棒性。
构建了专门的EMH基准和AU级指标，填补了细粒度眼部控制评估的空白。
在多个数据集和指标上达到SOTA，验证了方法的有效性。

Limitations:

依赖原型库的质量和覆盖度，若库中缺乏某些罕见模式可能影响生成效果。
MLLM智能体推理过程可能引入额外计算开销，实时性有待验证。
对于极端大角度头部运动或复杂遮挡场景，身份保持可能仍有挑战。
论文未讨论跨身份泛化能力（如不同种族、年龄的参考肖像）。

Relevance To Keywords:

原生多模态大模型：论文使用MLLM智能体进行链式推理，属于多模态大模型的应用，但并非原生多模态（未训练统一模型）。
多模态大模型的理解和生成一体化：论文将MLLM用于理解高层标签并生成控制信号，与生成一体化概念相关。
表征学习：面部关键点作为中间表征，但未涉及自监督表征学习。
世界模型：智能体规划可视为对动态世界的建模，但未显式构建世界模型。
强化学习：KTO优化属于基于人类反馈的强化学习范式，与后训练相关。
后训练：KTO微调是后训练阶段，符合关键词。
Unify Models、Model-Based RL：论文未涉及统一模型或基于模型的强化学习，相关性较弱。

9. Towards Unified Vision-Language Models with Incomplete Multi-Modal InputsPASS

Score: 58.0 / 26.5

Authors: Xiang Fang, Wanlong Fang, Changshuo Wang, Keke Tang, Daizong Liu, Siyi Wang, Wei Ji

Published: 2026-05-27

TL;DR: 本文针对视频 - 语言模型在多模态输入缺失场景下的不一致性问题，提出了一种统一的不完整多模态模型，可作为即插即用模块提升多模态任务性能。

摘要翻译

视频 - 语言模型（VLMs）已在多样化的计算机视觉应用中展现出令人印象深刻的多模态推理能力。然而，这些 VLMs 是任务特定的，并假设视频和语言输入均完整。然而，现实世界中的 VLM 应用可能会因传感器停用（例如，由于数据隐私导致相机不可用）而面临挑战，从而产生模态不完整的数据，并导致训练数据与测试数据之间的不一致。虽然直接处理不完整的输入可能会损害训练泛化能力并导致训练失败，但其对 VLMs 在安全性和可信性方面的潜在风险在很大程度上被忽视了。为此，我们首次尝试提出一种统一的非完整视频 - 语言模型，以处理不完整的多模态输入。广泛的实验结果表明，我们的方法可以作为即插即用模块应用于先前的工作，以提高其在各种多模态任务中的性能。

Abstract

Video-Language Models (VLMs) have demonstrated impressive multi-modal reasoning capabilities across diverse computer vision applications. However, these VLMs are task-specific and assume that both video and language inputs are complete. However, real-world VLM applications might face challenges due to deactivated sensors (e.g., cameras are unavailable due to data privacy), yielding modality-incomplete data and leading to inconsistency between training and testing data. While straightforward incomplete input can boast training generalization-ability and lead to training failure, its potential risks to VLMs regarding safety and trustworthiness have been largely neglected. To this end, we make the first attempt to propose a unified incomplete video-language model to process the incomplete multi-modal inputs. Extensive experimental results show that our method can serve as a plug-and-play module for previous works to improve their performance in various multi-modal tasks.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	9.0/10	18.0
World Models	2.0	2.0/10	4.0
MLLM	2.0	7.0/10	14.0
MultiModal	2.0	10.0/10	20.0
model-based RL	2.0	1.0/10	2.0

评分理由: 论文标题和摘要均强调'Unified'和'Multi-Modal'，与 Unify Models 和 MultiModal 高度相关。MLLM 作为多模态大模型缩写，与文中的 Video-Language Models 领域高度契合，故给予中高分。World Models 和 model-based RL 涉及强化学习与环境建模，与本文的多模态感知与推理任务无直接关联，得分最低。加权总分为 58.0，高于动态及格分 26.5。

关键词

Unified Vision-Language Models, Incomplete Multi-Modal Inputs, Video-Language Models, Modality-Incomplete Data, Multi-modal Reasoning, Plug-and-play Module, Robustness

深度分析

Chinese Title: 面向不完整多模态输入的统一视觉语言模型

Summary: 本文针对现实应用中视频语言模型（VLM）面临的多模态输入不完整问题（如视频帧缺失、文本词缺失），首次提出了一种统一的、任务无关的不完整视频语言模型。现有VLM方法严重依赖完整的视频-文本对，无法处理不完整输入，导致性能严重下降。为此，论文设计了三个核心模块：多模态特征近似模块，通过构建跨模态关系图近似缺失模态的特征；多模态知识蒸馏模块，减少对完整模态的过度依赖；多粒度多模态集成模块，将语义相似的视频-文本对更紧凑地映射到共同特征空间。该方法可作为即插即用模块，显著提升多种现有VLM方法在视频文本检索、视频问答、视频句子定位等任务上的性能。实验在不同不完整率下验证了方法的有效性和鲁棒性。

Innovations:

首次提出并定义了“不完整视频-文本对齐”这一现实设定，涵盖视频和文本同时缺失且缺失率不平衡的情况。
提出统一的不完整多模态补全网络，可适用于多种下游多模态任务（检索、问答、定位），无需为每个任务设计特定融合方法。
设计多模态特征近似模块，利用跨模态和自模态的K-互近邻关系，基于可用特征可靠地近似缺失的帧/词特征。
引入多模态知识蒸馏模块，平衡模型对完整模态的依赖，提升在不完整输入下的鲁棒性。
提出基于原型的多粒度多模态集成模块，通过共享原型和交叉注意力机制，将语义相似的视频-文本对更紧凑地对齐。

Methodology: 论文采用以下技术路线：首先，利用特征编码器提取视频帧和文本词的初始特征；然后，通过原型学习构建共享原型，并利用多头交叉注意力（MCA）和前馈网络（FFN）重构特征；接着，基于Jaccard距离和K-互近邻算法，从跨模态和自模态中选取最可靠的邻居特征来近似缺失特征；之后，通过多模态知识蒸馏（以完整模态特征为教师，不完整模态特征为学生）减少过拟合；最后，通过多粒度集成模块将视频和文本特征映射到共同嵌入空间，实现紧凑对齐。整体框架为任务无关的即插即用模块。

Key Results:

所提方法可作为即插即用模块，显著提升多种现有SOTA VLM方法（如视频文本检索、视频问答、视频句子定位）在不完整输入下的性能。
在不同不完整率（如10%、30%、50%）下，方法均能有效恢复缺失信息，性能下降幅度远小于直接使用不完整输入的基线方法。
多模态特征近似模块能够利用跨模态语义相似性准确补全缺失帧/词特征，知识蒸馏模块有效缓解了模型对完整模态的过度依赖。
在多个公开数据集（如MSRVTT、ActivityNet、Charades-STA等）上的实验验证了方法的通用性和鲁棒性。

Tech Stack:

特征编码器（视频帧编码、文本词编码）
原型学习（Prototype Learning）
多头交叉注意力（Multi-head Cross-Attention, MCA）
前馈网络（Feed-Forward Network, FFN）
Jaccard距离（Jaccard Distance）
K-互近邻算法（K-reciprocal Nearest Neighbors）
余弦相似度（Cosine Similarity）
知识蒸馏（Knowledge Distillation）
多粒度特征集成（Multi-granularity Integration）

Strengths:

首次系统性地研究了多模态输入不完整问题，填补了该领域空白。
提出的方法为任务无关的即插即用模块，可直接应用于多种现有VLM方法，实用性强。
设计了三个互补模块，分别解决特征补全、过拟合缓解和对齐增强，技术方案完整。
实验设置全面，涵盖不同不完整率、不同任务和多个数据集，验证了方法的鲁棒性和泛化能力。

Limitations:

论文未讨论极端不完整情况（如视频帧缺失超过80%或文本词缺失超过70%）下的性能边界。
方法依赖于可用特征的质量，当可用特征本身噪声较大时，近似效果可能下降。
未在真实部署场景（如监控视频实时处理）中验证计算效率，可能面临延迟问题。
知识蒸馏模块需要额外训练步骤，可能增加模型复杂度。

Relevance To Keywords: 论文研究统一视觉语言模型处理不完整多模态输入，与“Unify Models”高度相关；涉及多模态表征学习（Representation Learning），通过原型和交叉注意力进行特征对齐；但论文未涉及世界模型（World Models）、强化学习（Model-Based RL）或后训练（Post-training），因此与这些关键词相关性较弱。整体上，论文主要聚焦于多模态大模型在不完整输入下的统一建模，属于原生多模态大模型的理解方向。

10. VCap: Hypergeometric Rewards for Weak-to-Strong Visual CaptioningPASS

Score: 56.0 / 26.5

Authors: Xingyu Lu, Jinpeng Wang, Yi-Fan Zhang, Yankai Yang, Yancheng Long, Yiyang Fan, Xuanyu Zheng, Haonan Fan, Kaiyu Jiang, Tianke Zhang, Changyi Liu, Bin Wen, Fan Yang, Tingting Gao, Han Li, Chun Yuan

Published: 2026-05-27

TL;DR: VCap proposes a Witness-Adjudicator reward based on hypergeometric distribution to enhance factual consistency for MLLM visual captioning, achieving weak-to-strong generalization and surpassing SOTA.

摘要翻译

视觉描述任务要求模型忠实捕捉视觉内容，同时最小化遗漏与幻觉现象。作为描述任务的主导范式，多模态大语言模型（MLLMs）通过模型扩展和高质量数据取得了显著性能。近期，强化学习（RL）已成为推动多模态大语言模型（MLLMs）向更高精度和更广覆盖发展的关键路径，然而，现有的描述任务奖励设计未能提供细粒度且可靠的事实验证信号，限制了其有效性。为了解决这一问题，我们提出了 VCap，这是一种基于 Witness-Adjudicator（证人 - 裁判）机制的奖励函数，它将参考描述（作为证人）与视觉信号（作为裁判）进行配对。通过在视觉信号基础上显式验证参考描述与策略生成描述之间的事实一致性，VCap 提供了具有超几何分布级别精度的奖励信号，用于描述质量验证。该设计使得模型即使从不完美的参考描述中也能进行有效学习，从而促进强化学习训练中的弱到强泛化能力。实验结果表明，使用 VCap 训练的 8B 模型在多个图像和视频描述基准上均优于开源和闭源的最先进（SOTA）模型。人类评估进一步证实了其与事实正确性的高度对齐。此外，VCap 提升了多模态大语言模型的感知能力，具备良好的跨任务泛化性，并超越了最佳 N 蒸馏方法，从而挑战了关于视觉奖励强化学习（RLVR）的先前假设。

Abstract

Visual captioning requires models to capture visual content faithfully while minimizing both omission and hallucination. As the dominant paradigm for captioning, MLLMs have achieved strong performance through scaling and high-quality data. Recently, RL has emerged as a key route to driving MLLMs toward higher precision and broader coverage, however, existing reward designs for captioning fail to provide fine-grained and reliable signals for factual verification, limiting their effectiveness. To address this, we propose VCap, a Witness-Adjudicator reward that pairs the reference caption (a witness) with the visual signal (an adjudicator). By explicitly verifying factual consistency between the reference and policy-generated captions grounded in the visual signal, VCap delivers a reward signal with hypergeometric-distribution-level precision for caption quality verification. This design enables effective learning even from imperfect references, facilitating weak-to-strong generalization in RL training. In our experiments, an 8B model trained with VCap outperforms open- and closed-source SOTA models on multiple image and video captioning benchmarks. Human evaluation further confirms its strong alignment with factual correctness. Additionally, VCap improves MLLM perceptual capability, generalizes across tasks, and surpasses best-of-N distillation, challenging prior assumptions about RLVR.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	3.0/10	6.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	10.0/10	20.0
MultiModal	2.0	10.0/10	20.0
model-based RL	2.0	5.0/10	10.0

评分理由: 1. Unify Models: 论文聚焦于视觉 captioning 的奖励函数设计，未涉及多模型架构的统一，与背景中的"Unify Models"概念关联较弱。2. World Models: 论文内容集中在 captioning 和奖励学习，未提及世界模型或环境预测，完全不相关。3. MLLM: 论文明确针对多模态大模型（MLLM）进行视觉 captioning 优化，是核心研究对象，高度相关。4. MultiModal: 视觉 captioning 涉及图像/视频与文本的交互，属于典型的多模态任务，高度相关。5. model-based RL: 论文使用了强化学习（RL）框架，但重点在于奖励模型的设计，而非传统意义上的基于模型的强化学习（Model-Based RL planning），相关性中等。

关键词

Visual Captioning, MLLM, Reward Design, Hypergeometric Distribution, Weak-to-Strong, Factual Consistency, Reinforcement Learning

深度分析

Chinese Title: VCap：用于弱到强视觉描述的超几何奖励

Summary: 论文提出VCap，一种证人-裁判奖励机制，用于视觉描述任务。该方法将参考描述视为随机证人，视觉信号作为裁判，通过超几何分布精确度量描述的正确性和完整性，从而在强化学习训练中实现弱到强泛化——即使使用不完美的参考也能训练出更强的描述模型。实验表明，8B模型在多个图像和视频描述基准上超越开源和闭源SOTA模型，人工评估证实其事实正确性。此外，VCap提升了多模态大模型的感知能力，并泛化到图像和视频问答任务，消融研究验证了奖励设计超越自蒸馏。

Innovations:

提出事实级别的视觉描述框架，将描述质量分解为正确性和完整性两个互补轴。
提出证人-裁判奖励机制，将参考描述和视觉信号角色分离，实现弱到强泛化。
推导出超几何分布的奖励信号，使得不完美参考也能引导更强描述器。
在图像和视频描述基准上取得SOTA，并验证了感知能力提升和跨任务泛化。

Methodology: 使用冻结的奖励模型（MLLM）对三元组（视觉信号x、参考描述yref、策略描述y）进行结构化判断，输出正确性、完整性和文本质量三个分数，加权组合成奖励。对于视频，额外使用随机时间段的局部奖励与全局奖励加权组合。训练采用强化学习（RL），策略模型迭代生成更强参考以自我改进。

Key Results:

8B模型在CapMAS、Decap-Bench等图像描述基准上超越多个更大模型。
在视频描述基准VDC上达到SOTA，VCapsBench上排名靠前。
人工评估中VCap排名第一，VCap奖励与人类偏好一致性达61.1%。
下游QA任务中表现提升，消融实验证实奖励设计超越自蒸馏。

Tech Stack:

强化学习（RL）
超几何分布
多模态大语言模型（MLLM）
证人-裁判奖励机制
自进化参考（self-evolving reference）
最佳-of-N采样（best-of-N）
消融实验

Strengths:

创新性地将参考描述作为随机证人，避免模仿式奖励的局限。
超几何分析提供了理论保证，支持弱到强泛化。
在多个基准上取得SOTA，且人工评估支持事实正确性。
提升感知能力，泛化到QA任务，验证了奖励设计的有效性。

Limitations:

依赖冻结的奖励模型，其能力可能成为瓶颈。
超几何模型假设事实均匀分布，可能不反映真实分布。
视频处理中随机时间段选择可能遗漏关键事实。
计算成本较高，需要多次推理。
未在更大规模模型上验证。

Relevance To Keywords: 论文与关键词高度相关：VCap用于原生多模态大模型的视觉描述训练，涉及多模态大模型的理解和生成一体化；通过强化学习后训练提升表征学习能力；世界模型方面，描述需要建模视觉世界；强化学习是核心方法；后训练阶段是主要应用场景。

11. Look on Demand: A Cognitive Scheduling Framework for Visual Evidence Acquisition in Multimodal ReasoningPASS

Score: 54.0 / 26.5

Authors: Yang Zhang, Xiaoshuai Sun, Rui Zhao, Wujin Sun, Yidong Chen, Jiayi Ji, Qian Chen, Rongrong Ji

Published: 2026-05-27

TL;DR: This paper proposes a cognitive scheduling framework (CSMR) that enables language models to dynamically acquire visual evidence during reasoning, overcoming limitations of unified models and improving accuracy in multimodal tasks.

摘要翻译

现有的多模态推理 (multimodal reasoning) 方法主要遵循两种范式：一种是在推理前将视觉输入转换为文本，另一种是在统一的视觉 - 语言 (vision-language) 表示空间内进行端到端 (end-to-end) 推理。尽管它们在经验上取得了进展，但这两种范式都存在根本性的结构局限。前者依赖于静态的视觉 - 文本 (visual-to-text) 转换，往往会导致细粒度视觉细节的压缩和丢失。后者容易受到联合优化 (joint optimization) 和注意力机制 (attention mechanisms) 引发的语言主导的影响，导致在推理过程中系统性地削弱了对视觉证据的忠实性。本文认为，核心挑战在于视觉证据如何以及何时被引入推理过程。受此启发，我们提出了 CSMR，这是一种多模态推理框架，其中语言模型 (language model) 通过决定何时调用独立的视觉感知模块 (visual perception module) 来获取任务相关的视觉证据来控制推理过程。在多个多模态推理基准测试 (benchmarks) 上的实验表明，CSMR 在零样本设置 (zero-shot setting) 下的准确率始终优于代表性基线方法 (baseline methods)。进一步的实验分析证实，这些优势主要源于所提出的认知调度机制 (cognitive scheduling mechanism)。

Abstract

Existing multimodal reasoning approaches predominantly follow two paradigms: converting visual inputs into text prior to reasoning, or performing end-to-end reasoning within a unified vision-language representation space. Despite their empirical progress, both paradigms suffer from fundamental structural limitations. The former relies on static visual-to-text conversion, which tends to compress and lose fine-grained visual details. The latter is prone to linguistic dominance induced by joint optimization and attention mechanisms, leading to systematically weakened faithfulness to visual evidence during reasoning. In this work, we argue that a central challenge is how and when visual evidence is introduced into the reasoning process. Motivated by this insight, we propose CSMR, a multimodal reasoning framework in which a language model controls the reasoning process by deciding when to invoke an independent visual perception module to acquire task-relevant visual evidence. Experiments across multiple multimodal reasoning benchmarks show that CSMR consistently outperforms representative baseline methods in accuracy under a zero-shot setting. Further experimental analysis confirms that these advantages primarily arise from the proposed cognitive scheduling mechanism.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	6.0/10	12.0
World Models	2.0	2.0/10	4.0
MLLM	2.0	8.0/10	16.0
MultiModal	2.0	10.0/10	20.0
model-based RL	2.0	1.0/10	2.0

评分理由: The paper is highly relevant to MultiModal (10) and MLLM (8) as it proposes an MLLM-based reasoning framework. It addresses Unify Models (6) by critiquing unified spaces and proposing scheduling. World Models (2) and model-based RL (1) are minimally relevant as the work focuses on evidence acquisition rather than dynamics or reinforcement learning. Total weighted score 54.0 exceeds the 26.5 threshold.

关键词

Multimodal Reasoning, Visual Evidence Acquisition, Cognitive Scheduling, Language Model Control, Visual Perception Module, Zero-shot Setting, Faithfulness to Visual Evidence

深度分析

Chinese Title: 按需查看：多模态推理中视觉证据获取的认知调度框架

Summary: 本文针对多模态推理中视觉证据引入的时机与方式问题，提出了一种认知调度框架CSMR。现有范式分为两类：预推理视觉转文本（静态文本化丢失细节）和统一视觉-语言空间推理（语言先验主导，削弱视觉忠实性）。作者通过实验分析指出统一范式存在结构性缺陷：训练目标缺乏对视觉忠实性的约束，且自注意力机制偏向文本令牌，导致视觉编码器被语言主导梯度误导。CSMR将语言模型作为认知核心，维护显式推理状态，动态调用独立视觉感知模块获取证据，并可重复调用直至证据充分。在多个多模态推理基准上的零样本实验表明，CSMR在准确率上持续优于代表性基线，增益来源于推理状态驱动的视觉查询和提前终止机制。

Innovations:

揭示了统一多模态推理范式中视觉表示受语言先验影响的结构性缺陷，并通过注意力分析验证了文本令牌的注意力偏置。
提出CSMR框架，将语言模型作为认知核心，通过显式推理状态动态调度独立的视觉感知模块，实现按需获取视觉证据。
支持推理过程中重复调用视觉模块，逐步验证视觉证据，避免一次性转换导致的细节丢失。
通过结构解耦感知与推理，防止视觉证据被语言先验干扰，增强视觉忠实性。
在多个基准上实现零样本设置下的持续性能提升，并验证了推理状态驱动的视觉查询和提前终止的有效性。

Methodology: 论文采用实证分析与框架设计相结合的方法。首先通过注意力分布统计和消融实验（移除图像输入）验证统一范式中的语言先验主导问题。然后提出CSMR框架：认知推理核心（CRC）由LLM实现，维护推理状态并决定何时调用视觉感知模块（VLM）；视觉感知模块独立提取图像证据，返回文本化描述；CRC根据当前状态判断证据是否充分，若不充分则继续调用，否则终止推理并生成答案。实验在ScienceQA、MMBench、MMStar等基准上进行零样本评估，对比基线包括LLaVA、Qwen-VL、DDCoT等。

Key Results:

在ScienceQA、MMBench、MMStar等多个多模态推理基准上，CSMR在零样本设置下准确率持续优于代表性基线（如LLaVA-1.6、Qwen-VL、DDCoT等）。
注意力分析显示，在Qwen3-VL-8B和LLaVA-1.6-7B中，文本令牌的预softmax注意力分数平均约为视觉令牌的2.5倍，且总注意力也偏向文本。
移除图像输入后模型仍保持较高准确率，证实了语言先验的强依赖性。
消融实验表明，推理状态驱动的视觉查询和提前终止机制是性能提升的关键因素。

Tech Stack:

LLM（大语言模型）作为认知推理核心
VLM（视觉语言模型，如Qwen-VL、LLaVA）作为视觉感知模块
自注意力机制（Scaled Dot-Product Attention）
零样本评估设置
ScienceQA、MMBench、MMStar等基准数据集
预softmax注意力分数统计方法

Strengths:

问题定义清晰，从现有范式的结构性缺陷出发，动机合理。
通过详实的注意力分析和消融实验提供了有力的实证支持。
框架设计简洁有效，解耦感知与推理，避免语言先验干扰。
零样本设置下取得一致提升，泛化性强。
代码开源，便于复现和后续研究。

Limitations:

依赖外部VLM作为感知模块，可能增加推理延迟和计算开销。
视觉证据以文本化形式返回，仍存在一定信息压缩（但可通过多次调用缓解）。
实验仅在零样本设置下进行，未探索微调或后训练场景。
未深入分析推理状态表示的具体设计（如状态更新规则）。
与强化学习、世界模型等关键词的直接关联较弱，论文未涉及这些方向。

Relevance To Keywords: 论文与'原生多模态大模型'高度相关，因为其核心是改进多模态推理中的视觉证据获取。与'表征学习'相关，因为分析了统一范式下视觉表示受语言先验影响的问题，并提出解耦方案。与'世界模型'有一定关联，因为推理状态可视为对当前任务世界的抽象，但论文未明确构建世界模型。与'强化学习'和'后训练'关联较弱，论文未使用RL或后训练技术，主要关注推理时的调度策略。与'多模态大模型的理解和生成一体化'相关，但论文侧重于理解（推理）而非生成。

12. VidPrism: Heterogeneous Mixture of Experts for Image-to-Video TransferPASS

Score: 52.0 / 26.5

Authors: Rui Lin, Chuanming Wang, Huadong Ma

Published: 2026-05-27

TL;DR: VidPrism 提出了一种异构混合专家框架用于图像到视频迁移学习，通过专门化空间和时间专家在视频识别基准上实现了最先进的性能。

摘要翻译

随着预训练技术的快速发展，将大规模视觉 - 语言模型（VLMs）应用于视频理解（即图像到视频迁移学习）已成为一种主导范式。为取得优越性能，近期进展中采用混合专家模型（MoE）以增强 VLMs 的时序建模能力成为一种有效策略。然而，传统的 MoE 设计存在专家同质化问题，即所有专家均充当功能相同的通才，无法高效地从未区分的视频流中学习时空特征。为了解决这一问题，我们提出了 VidPrism，一种新颖的异构时序混合专家框架。VidPrism 通过部署功能专用专家实现了分工，每个专家承担的角色涵盖从空间理解到时序建模的范围。为了向这些专家提供适当的输入，我们引入了一种内容感知的多速率采样模块，该模块动态生成从语义丰富到运动聚焦的表示流，从而为专家提供专用输入。此外，一种动态双向融合机制使这些路径之间能够进行协同信息交换，进而生成综合的视频表示。在各种视频识别基准上的广泛实验表明，VidPrism 实现了最先进性能，并有效地促进了专家专业化。我们的源代码可在 https://github.com/Lrrrr549/VidPrism.git 处获取。

Abstract

With the rapid development of pre-training technologies, adapting large-scale Vision-Language Models (VLMs) for video understanding \emph{\ie} image-to-video transfer learning has become a dominant paradigm. To achieve superior performance, it raises as an effective strategy among recent advances to employ Mixture-of-Experts (MoE) to enhance VLMs' temporal modeling capabilities. However, conventional MoE designs suffer from expert homogenization, where all experts act as identical generalists, inefficiently learning spatio-temporal features from undifferentiated video streams. To overcome this problem, we propose VidPrism, a novel heterogeneous temporal Mixture-of-Experts framework. VidPrism pioneers a division of labor by deploying functionally specialized experts, each assuming a role ranging from spatial understanding to temporal modeling. To feed these specialists appropriately, we introduce a content-aware, multi-rate sampling module that dynamically generates streams ranging from semantically rich to motion-focused representations, providing specialized inputs for experts. Furthermore, a dynamic, bidirectional fusion mechanism enables synergistic information exchange between these pathways, leading to a comprehensive video representation. Extensive experiments on various video recognition benchmarks demonstrate that VidPrism achieves state-of-the-art performance and effectively fosters expert specialization. Our source code is available at \href{https://github.com/Lrrrr549/VidPrism.git}{https://github.com/Lrrrr549/VidPrism.git}.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	6.0/10	12.0
World Models	2.0	3.0/10	6.0
MLLM	2.0	9.0/10	18.0
MultiModal	2.0	8.0/10	16.0
model-based RL	2.0	0.0/10	0.0

评分理由: 论文基于视觉语言模型 (VLM) 和混合专家 (MoE) 架构进行视频理解，与 MLLM 和多模态技术高度相关；通过异构专家统一了时空建模，与 Unify Models 有一定关联；涉及表征学习但与 World Models 的生成/动力学特性不符；未涉及强化学习，故 model-based RL 为 0 分。作者列表为 Rui Lin, Chuanming Wang, Huadong Ma，不包含指定专家名单，故无加分。

关键词

Image-to-Video Transfer, Mixture-of-Experts, Vision-Language Models, Video Recognition, Heterogeneous Experts, Spatio-temporal Modeling, Content-aware Sampling

深度分析

Chinese Title: VidPrism: 异构专家混合模型用于图像到视频迁移

Summary: 本文提出VidPrism，一种异构时间专家混合（MoE）框架，用于将预训练的视觉语言模型（VLM）从图像领域迁移到视频理解任务。针对传统MoE中专家同质化导致时空特征学习效率低下的问题，VidPrism通过部署功能专门化的专家（如空间理解、时间建模）实现分工。具体地，设计内容感知的多速率采样模块，动态生成语义丰富的慢速流和运动密集的快速流，为不同专家提供专门输入；同时引入动态双向融合机制，促进空间与时间路径间的协同信息交换。在多个视频识别基准上的实验表明，VidPrism达到了最先进性能，并有效促进了专家专业化。

Innovations:

首次提出异构MoE架构用于图像到视频迁移，专家网络显式专门化并绑定不同时间尺度。
设计内容感知的多速率输入生成模块，动态生成语义丰富的慢速流和运动密集的快速流。
引入动态双向融合机制，实现空间与时间专家路径间的协同信息交换。

Methodology: 首先使用视觉编码器提取视频帧特征；然后通过多个速率引导的时空聚合模块（RgSTA）按不同速率采样并融合帧特征，保留关键信息；接着利用动态双向交互模块（DBI）在并行路径间选择性交换信息；最后通过异构MoE（HMoE）对各路径增强后的特征进行建模，并集成输出全局视频表示用于分类。

Key Results:

在多个视频识别基准（如UCF-101等）上达到最先进性能。
可视化显示模型在关键动作时刻出现清晰的注意力峰值，而传统同质化MoE基线则呈现平坦分布。
有效促进了专家专业化，不同专家分别聚焦于空间语义和运动动态。

Tech Stack:

LayerNorm
线性层（MetricProj, ScoreHead）
Softmax函数
余弦相似度
温度系数τ
缩放因子δ
L2范数
加权融合（超参数α）

Strengths:

创新性地解决MoE专家同质化问题，实现功能专门化。
内容感知的多速率输入生成使专家获得最相关输入。
动态双向融合促进路径间协同，提升整体视频表示质量。
在多个基准上取得SOTA，验证了方法的有效性。

Limitations:

异构MoE和双向融合可能增加模型计算复杂度和参数量。
超参数α、τ、δ等需要调优，可能影响泛化性。
论文未详细讨论在长视频或复杂场景下的性能表现。

Relevance To Keywords:

Unify Models: VidPrism通过异构MoE统一了空间和时间建模路径。
World Models: 视频理解是构建世界模型的重要部分，VidPrism提升了视频表征能力。
Representation Learning: 通过多速率采样和双向融合学习更全面的视频表示。
Model-Based RL: 视频理解可为基于模型的强化学习提供环境感知。
原生多模态大模型: 基于VLM迁移，属于多模态大模型在视频领域的应用。
多模态大模型的理解和生成一体化: 本文聚焦理解任务，但框架可扩展至生成。
后训练: 图像到视频迁移属于后训练阶段，利用预训练VLM进行微调。

13. When Think-with-Image Meets Safety: What Determines Multimodal Jailbreak Robustness?PASS

Score: 52.0 / 26.5

Authors: Yuan Tian, Bing Hu, Fang Wu, Xiaomin Li, Binghang Lu, Neil Zhenqiang Gong

Published: 2026-05-27

TL;DR: 本文探究了不同推理范式对多模态越狱鲁棒性的影响，发现显式图像工具交互通过诱导表示向安全方向偏移显著降低了攻击成功率。

摘要翻译

基于图像的推理 (Think-with-image reasoning) 正在成为大型视觉语言模型 (VLMs) 的一种新兴推理范式，但其安全性影响仍知之甚少。现有的系统已涵盖多种处理范式，包括直接响应生成、纯文本前置轮次、视觉状态操控以及显式外部图像工具调用。本文探究了这些被评估的范式中哪一种能提高多模态越狱鲁棒性，以及其背后的原因。在多个视觉语言模型中，显式图像工具交互在我们的实验中产生了最低的攻击成功率 (ASR)，相对于被评估模型的平均值，其越狱成功率降低了约 30%。这一发现起初令人惊讶：即使返回的图像工具输出被手动覆盖或本身看似不安全，ASR 仍然保持较低；然而，在纯文本前置轮次控制条件下，ASR 会恢复到接近直接回答的水平。这些结果表明，较低的 ASR 无法仅由良性返回图像的语义或文本图像工具轨迹来解释。为了解释这一现象，我们引入了一个图像工具安全向量框架 (Image-Tool Safety Vector Framework)，该框架将图像工具调用建模为隐藏表示向安全性相关方向的残差偏移。表示层面的分析及激活干预支持了这一解释。总体而言，我们的结果表明，显式图像工具交互是提高越狱鲁棒性的有前景的设计模式，同时也强调了针对特定推理流程进行安全评估的必要性。

Abstract

Think-with-image reasoning is emerging as a new inference paradigm for large vision-language models, but its safety implications remain poorly understood. Existing systems already span multiple process designs, including direct response generation, text-only prior turn, visual-state manipulation, and explicit external image-tool invocation. In this paper, we ask which of these evaluated paradigms improves multimodal jailbreak robustness, and why. Across multiple vision-language models, explicit image-tool interaction yields the lowest attack success rates in our experiments, reducing jailbreak success by around 30% relative on average across the evaluated models. This finding is initially surprising: ASR remains low even when the returned image-tool output is manually overridden or itself unsafe-looking, but returns near direct-answering levels under text-only prior turn controls. These results indicate that the lower ASR is not explained by benign returned-image semantics or by the textual image-tool trace alone. To explain the pattern, we introduce an image-tool safety vector framework that models image-tool invocation as a residual shift in hidden representations toward a safety-relevant direction. Representation-level analyses and activation interventions support this account. Overall, our results suggest that explicit image-tool interaction is a promising design pattern for improving jailbreak robustness, while also motivating pipeline-specific safety evaluation.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	4.0/10	8.0
World Models	2.0	1.0/10	2.0
MLLM	2.0	10.0/10	20.0
MultiModal	2.0	10.0/10	20.0
model-based RL	2.0	1.0/10	2.0

评分理由: 论文核心聚焦 MLLM 安全与推理范式，故 MLLM 和多模态高度相关；虽比较不同范式涉及一体化思路，与 Unify Models 中度相关，但未涉及世界模型或模型强化学习，相关性低。作者列表中无指定专家。加权总分 52.0，高于动态及格分 26.5。

关键词

Multimodal Jailbreak Robustness, Think-with-image Reasoning, Image-tool Interaction, Vision-Language Models, Representation Shift, Safety Evaluation, Inference Paradigms

深度分析

Chinese Title: 当“以图思考”遇上安全：什么决定了多模态越狱鲁棒性？

Summary: 本文研究多模态大语言模型（LVLM）在“以图思考”（Think-with-Image）推理范式下的安全性问题。作者比较了直接响应、纯文本前轮、模拟图像生成/编辑、视觉状态变体和显式图像工具交互等多种推理范式，发现显式图像工具交互能显著降低越狱攻击成功率（ASR），平均降低约30%。即使返回的图像内容被手动覆盖或本身不安全，该效果依然存在，但纯文本前轮控制下效果消失。为了解释这一现象，作者提出了“图像工具安全向量”框架，从表示层角度建模图像工具调用为隐藏状态的残差偏移，使安全行为更线性可读和可引导。通过表示层分析和激活干预实验支持了这一机制。研究结果表明，显式图像工具交互是一种有前景的越狱鲁棒性提升设计模式，并强调了针对不同推理管道进行安全评估的必要性。

Innovations:

首次系统比较了多种“以图思考”推理范式（直接响应、纯文本前轮、模拟图像生成/编辑、视觉状态变体、显式图像工具交互）对多模态越狱鲁棒性的影响。
发现显式图像工具交互能显著降低越狱攻击成功率（平均降低约30%），且该效果不依赖于返回图像的具体语义内容。
提出了“图像工具安全向量”框架，从表示层角度解释图像工具调用如何通过隐藏状态偏移提升安全性，并通过激活干预实验提供了机制证据。
揭示了推理过程的结构性变化（而非仅输入内容）对安全行为的关键影响，为多模态安全评估提供了新视角。

Methodology: 采用实验比较方法，固定安全查询（MM-SafetyBench），在不同推理范式下计算攻击成功率（ASR）。使用CoMT任务作为良性前缀，通过Visual Sketchpad（VSP）实现显式图像工具交互。评估模型包括Qwen3-VL、Gemma-3、Llama-4、Pixtral等家族。进行表示层分析（隐藏状态提取）和激活干预实验（如激活修补）以验证安全向量框架。使用LLM-as-judge评估器判断最终回答是否不安全。

Key Results:

显式图像工具交互在所有评估模型中取得最低攻击成功率，平均降低约30%。
即使返回图像被手动覆盖为不安全内容，ASR仍保持低位，但纯文本前轮控制下ASR回升至接近直接响应水平。
表示层分析表明，图像工具调用使隐藏状态向安全相关方向偏移，增强了安全行为的线性可分性和可引导性。
不同推理范式（如模拟图像生成/编辑、视觉状态变体）的ASR介于直接响应和显式图像工具交互之间。

Tech Stack:

MM-SafetyBench（多模态安全基准）
Visual Sketchpad (VSP)（图像工具）
CoMT（Chain of Multi-modal Thought）任务
LLM-as-judge评估器（基于大语言模型的自动评判）
Qwen3-VL、Gemma-3、Llama-4、Pixtral等LVLM模型
Stable Diffusion（用于生成模拟图像）
激活干预（activation intervention）技术
隐藏状态提取与表示分析

Strengths:

研究视角新颖，首次将推理范式本身作为实验变量，而非仅关注输入内容。
实验设计严谨，通过多种控制实验（如纯文本前轮、图像内容覆盖）排除了混淆因素。
提出了可解释的机制框架（图像工具安全向量），并通过表示层和干预实验提供了实证支持。
覆盖多个主流LVLM模型，结论具有一定泛化性。

Limitations:

实验仅基于MM-SafetyBench的一个子集（202个问题），样本量有限。
仅评估了特定类型的良性前缀（CoMT任务），其他类型的前缀可能产生不同效果。
图像工具交互的具体实现（Visual Sketchpad）可能引入特定偏差，其他工具实现需进一步验证。
未深入探讨图像工具调用如何与模型内部安全机制（如拒绝训练）交互。

Relevance To Keywords:

原生多模态大模型：论文直接研究多模态大语言模型（LVLM）的安全推理范式，与原生多模态模型紧密相关。
世界模型：论文中的“以图思考”涉及中间视觉状态和工具调用，与世界模型中的内部模拟和推理有概念关联。
表征学习：论文提出的“图像工具安全向量”框架从隐藏表示层面分析安全行为，属于表征学习范畴。
强化学习：论文未直接使用强化学习，但后训练阶段的安全对齐（如RLHF）与论文讨论的越狱鲁棒性相关。
后训练：论文的机制分析可指导后训练阶段的安全优化策略。

14. EchoAvatar: Real-time Generative Avatar Animation from Audio StreamsPASS

Score: 52.0 / 26.5

Authors: Bohong Chen, Yumeng Li, Yinglin Xu, Youyi Zheng, Yanlin Weng, Kun Zhou

Published: 2026-05-27

TL;DR: EchoAvatar 提出了一种统一流式框架，用于从音频流实时生成 3D 化身动画，利用强化学习优化质量并通过 LLM 工具调用实现语义控制，实现了高保真同步。

摘要翻译

从音频实时合成高保真 3D 角色动作是下一代交互式化身和虚拟助手的核心组件。然而，大多数现有方法仅限于对完整音频序列的离线处理，或受限于特定领域，往往难以有效同时处理语音和音乐。本文提出了一种新颖的框架，旨在从流式语音和音乐中生成连续、连贯的全身体动作，并实现低延迟。我们的方法核心在于一种统一的流式架构，能够从增量式音频输入中合成连续动作。我们采用了一种稳健的训练策略，强制模型具备强音频依赖性，使其能够无缝泛化于对话语音和节奏音乐之间，而无需显式的领域标签或模式切换。此外，我们还探索了使用强化学习（Reinforcement Learning）来优化在线生成的质量。进而，我们通过工具调用接口将反应式动画与意图驱动的行为相连接，允许上游大型语言模型（Large Language Models）注入显式的语义控制。通过将这种可控性与流式音频驱动的合成相结合，我们的框架提供了一种即插即用的解决方案，用于将语音代理转换为交互式类人化身。广泛的实验表明，我们的方法在运动质量和同步性方面优于最先进的实时基线，同时保持了实时部署所需的灵活性。我们的代码、预训练模型及视频可在 https://robinwitch.github.io/EchoAvatar-Page 上获取。

Abstract

Real-time synthesis of high-fidelity 3D character motion from audio is a pivotal component for next-generation interactive avatars and virtual assistants. However, most existing approaches are limited to offline processing of complete audio sequences or are constrained to specific domains, rarely handling both speech and music effectively. In this paper, we introduce a novel framework designed to generate continuous, coherent full-body motion from streaming speech and music with low latency. Central to our approach is a unified streaming architecture capable of synthesizing continuous motion from incremental audio inputs. We employ a robust training strategy that enforces strong audio dependency, allowing the model to seamlessly generalize across conversational speech and rhythmic music without requiring explicit domain labels or mode switching. Additionally, we explored Reinforcement Learning to refine the quality of online generation. Furthermore, we bridge reactive animation with intent-driven behavior via a tool-call interface that allows upstream Large Language Models to inject explicit semantic control. By combining this controllability with stream audio-driven synthesis, our framework serves as a plug-and-play solution for transforming voice agents into interactive humanoid avatars. Extensive experiments demonstrate that our method outperforms state-of-the-art realtime baselines in motion quality and synchronization while maintaining the flexibility required for live deployment. Our code, pre-trained models, and videos are available at https://robinwitch.github.io/EchoAvatar-Page.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	7.0/10	14.0
World Models	2.0	2.0/10	4.0
MLLM	2.0	4.0/10	8.0
MultiModal	2.0	8.0/10	16.0
model-based RL	2.0	5.0/10	10.0

评分理由: 论文提出统一流式架构处理语音和音乐，与 Unify Models 高度相关；涉及音频到运动映射，MultiModal 相关性高；使用 RL 优化生成，model-based RL 中度相关；集成 LLM 进行控制，MLLM 中度相关；未涉及世界模型构建，World Models 相关性低。作者列表中不包含指定的专家。

关键词

Real-time Avatar Animation, Audio-driven Motion, Unified Streaming Architecture, Reinforcement Learning, Tool-call Interface, Multimodal Synthesis, 3D Character Motion

深度分析

Chinese Title: EchoAvatar：基于音频流的实时生成式虚拟形象动画

Summary: 本文提出EchoAvatar，一个用于从流式音频实时生成高保真3D角色动画的统一框架。现有方法大多离线处理完整音频序列，或局限于语音或音乐单一领域。EchoAvatar采用因果注意力运动分词器，将连续运动离散化为令牌，避免卷积伪影；通过分层令牌损坏训练策略增强音频依赖，使模型无需领域标签即可统一学习对话手势和节奏舞蹈。此外，引入强化学习（GRPO/DPO）对齐以提升在线生成质量，并设计工具调用接口允许上游大语言模型注入显式语义控制。实验表明，该方法在运动质量和同步性上优于现有实时基线，支持流式输入输出，可作为即插即用模块集成到语音代理系统中。

Innovations:

提出统一流式架构，同时处理语音和音乐，无需领域切换或显式标签。
设计基于因果注意力的运动分词器，克服卷积方法表达力不足和重建伪影问题。
采用分层令牌损坏训练策略，增强音频条件作用，实现跨域协同学习。
探索强化学习（GRPO/DPO）用于在线生成质量对齐，提升感知质量。
引入工具调用接口，桥接反应式音频驱动动画与意图驱动的显式语义控制。

Methodology: 论文采用三阶段技术路线：首先，构建因果注意力运动分词器，使用因果掩码限制感受野，将连续运动参数（根速度、高度、6D旋转）离散化为令牌。其次，利用预训练LLM作为生成器，通过三阶段课程训练（包括分层令牌损坏）对齐音频-运动模态，实现统一学习。最后，应用强化学习（GRPO和DPO）对生成策略进行后训练对齐，提升零样本流式场景下的感知质量。系统支持流式音频输入，实时输出连续运动序列。

Key Results:

在运动质量和同步性上超越现有实时基线方法。
统一学习语音和音乐产生协同效应，提升各自领域的生成保真度。
强化学习（GRPO/DPO）显著改善感知质量，尤其在零样本流式场景中。
系统可作为即插即用模块，集成到语音代理和LLM驱动的交互系统中。

Tech Stack:

因果注意力机制（Causal Attention with Mask）
向量量化（Vector Quantization, VQ）
预训练大语言模型（LLM）
分层令牌损坏（Hierarchical Token Corruption）
组相对策略优化（Group Relative Policy Optimization, GRPO）
直接偏好优化（Direct Preference Optimization, DPO）
6D旋转表示（6D Rotation Representation）
根速度与高度参数化（Root Velocity & Height）

Strengths:

实现真正的流式低延迟生成，适合实时交互。
统一处理语音和音乐，无需模型切换，泛化性强。
因果注意力分词器在表达力和重建质量上优于卷积方法。
强化学习对齐有效提升在线生成的自然度和用户偏好。
工具调用接口赋予系统可控性，便于与LLM生态集成。

Limitations:

对音频特征的依赖可能限制在噪声环境下的鲁棒性。
强化学习训练过程复杂，需要大量偏好数据或奖励模型。
未明确讨论对多语言、口音或非典型音频输入的适应性。
生成的动画缺乏物理约束，可能出现穿模或不符合人体运动学的情况。

Relevance To Keywords:

Unify Models: 论文提出统一框架处理语音和音乐，符合模型统一思想。
World Models: 通过LLM生成运动可视为对音频驱动世界状态演化的建模。
Representation Learning: 因果注意力运动分词器学习离散运动表征。
Model-Based RL: 强化学习（GRPO/DPO）用于策略优化，属于后训练对齐。
原生多模态大模型: 论文使用预训练LLM但非原生多模态，需额外对齐音频和运动。
多模态大模型的理解和生成一体化: 系统同时理解音频并生成运动，但未实现端到端一体化。
表征学习: 运动分词器学习高效运动表征。
世界模型: 生成器可视为音频到运动的世界模型。
强化学习: 明确使用GRPO和DPO进行后训练。
后训练: 强化学习对齐属于后训练阶段。

15. Learning to Label: A Reinforced Self-Evolving Framework for Semi-supervised Referring Expression SegmentationPASS

Score: 52.0 / 26.5

Authors: Runlong Cao, Ying Zang, Chuanwei Zhou, Tianrun Chen, Tong Zhang, Zhen Cui, Chunyan Xu

Published: 2026-05-27

TL;DR: 该论文提出了一种名为 L2L 的强化自演化框架，利用多模态大语言模型提取先验并通过强化学习优化伪标签，从而在半监督指称表达分割任务中提升了标签可靠性和分割性能。

摘要翻译

半监督指称表达分割（SS-RES）旨在在有限标注下实现精确的像素级语言定位，但在利用未标注的图像 - 文本对时，却面临监督不足和伪标签不可靠的问题。本文提出学习标注（Learning to Label, L2L），这是一种强化自我演化框架，将伪标签构建视为一个可学习的决策过程。为了建立基础认知，我们利用多模态大语言模型（MLLM）提取语义 - 空间先验，这些先验被实例化为初始软分割提议，并与文本线索一同提升为可学习指导信号，从而引导一个层次化分割网络。为确保学习稳定，强化伪标签选择被构建为一个探索性决策过程，该过程基于多模态先验和模型预测，自适应地奖励高实用性的像素级监督。这一强化自我演化循环实现了分割模型与伪标签的联合优化，在稀疏监督下逐步增强标签的可靠性。在 RefCOCO、RefCOCO+ 和 RefCOCOg 上的大量实验表明，该方法优于现有方法，验证了其有效性和泛化能力。

Abstract

Semi-supervised referring expression segmentation (SS-RES) aims to achieve precise pixel-level language grounding under limited annotation, yet suffers from limited supervision and unreliable pseudo-labels when exploiting unlabeled image-text pairs. In this work, we propose Learning to Label, a reinforced self-evolving framework (L2L) that casts pseudo-label construction as a learnable decision-making process. To build foundational understanding, we leverage a multimodal large language model to extract semantic-spatial priors, which are instantiated as initial soft segmentation proposals and elevated, together with textual cues, into learnable guidance signals that condition a hierarchical segmentation network. To ensure stable learning, reinforced pseudo-label selection is formulated as an exploratory decision process that adaptively rewards high-utility pixel-level supervision based on multimodal priors and model predictions. This reinforced self-evolving loop enables joint optimization of the segmentation model and pseudo-labels, progressively enhancing label reliability under sparse supervision. Extensive experiments on RefCOCO, RefCOCO+, and RefCOCOg demonstrate improvements over existing methods, validating its effectiveness and generalization.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	3.0/10	6.0
World Models	2.0	1.0/10	2.0
MLLM	2.0	9.0/10	18.0
MultiModal	2.0	8.0/10	16.0
model-based RL	2.0	5.0/10	10.0

评分理由: 论文核心在于半监督指称表达分割，利用 MLLM 提取先验并结合强化学习优化伪标签。MLLM 与 MultiModal 高度契合（图文任务 + 大模型）；model-based RL 部分相关（强化决策过程）；Unify Models 关联度低（非统一架构模型）；World Models 无关。作者列表中未包含指定专家，故无额外加分。

关键词

Semi-supervised referring expression segmentation, Reinforced self-evolving framework, Multimodal large language model, Pseudo-label selection, Hierarchical segmentation network, Semantic-spatial priors, Decision-making process

深度分析

Chinese Title: 学习标注：一种用于半监督指代表达分割的强化自进化框架

Summary: 本文针对半监督指代表达分割（SS-RES）中标注稀疏、伪标签不可靠的问题，提出了一种强化自进化框架L2L。该框架将伪标签构建转化为可学习的决策过程：首先利用冻结的多模态大语言模型（MLLM）提取语义-空间先验，并通过SAM2生成软分割提议；然后设计自进化分割模块（SESM），将多模态先验与文本线索作为可学习引导信号，条件化分层分割网络；最后提出强化伪标签探索（RPLE），将伪标签选择建模为强化学习决策过程，基于多模态先验和模型预测自适应地奖励高效用像素级监督。通过弱到强一致性训练，分割模型与伪标签在闭环中共同进化，逐步提升标签可靠性。在RefCOCO、RefCOCO+和RefCOCOg数据集上的实验表明，该方法在不同标注比例下均显著优于现有方法，仅用0.1%标注数据即可与零样本方法竞争。

Innovations:

提出L2L框架，将伪标签构建从固定阈值截断升级为可学习的决策过程，实现分割模型与伪标签的闭环协同进化。
引入MLLM驱动的语义-空间先验，通过冻结MLLM生成粗粒度定位线索并提示SAM2获得软分割先验，保留不确定性信息。
设计强化伪标签探索（RPLE），将伪标签选择建模为强化学习决策过程，基于样本状态自适应地选择高效用像素区域，动态平衡精度与覆盖率。
提出自进化分割模块（SESM），通过分层、阶段自适应的门控机制将多模态先验注入分割网络，逐步将粗先验转化为精细预测。

Methodology: 采用弱到强一致性训练范式。首先，对每个未标注样本生成弱增强和强增强视图，并分别输入分割网络得到预测。利用冻结的MLLM生成语义-空间先验，再通过SAM2获得软分割提议。接着，通过不确定感知校准模块（SPM）融合先验与弱视图预测，得到更可靠的融合先验。然后，自进化分割模块（SESM）通过分层门控机制将融合先验和文本特征作为条件信号，引导分割网络逐步细化预测。最后，强化伪标签探索（RPLE）将当前样本的置信度、先验一致性等作为状态，通过强化学习决策是否接受每个像素的伪标签，并基于一致性损失进行训练。整个框架在少量标注数据和大量未标注数据上联合优化。

Key Results:

在RefCOCO、RefCOCO+和RefCOCOg三个基准数据集上，L2L在不同标注比例（如0.1%、1%、5%等）下均取得最优性能，显著超过现有半监督方法。
仅使用0.1%标注数据时，L2L的性能仍能与零样本方法（如GroundingDINO+SAM）竞争，验证了框架在极稀疏标注下的鲁棒性。
通过可视化分析（图1）揭示了MLLM先验与模型预测之间存在系统性置信度不匹配，证明了固定阈值伪标签的脆弱性，并展示了L2L的动态选择机制的有效性。

Tech Stack:

多模态大语言模型（MLLM）：用于提取语义-空间先验
SAM2（Segment Anything Model 2）：用于生成软分割提议
强化学习（Reinforcement Learning）：用于伪标签选择决策
弱到强一致性训练（Weak-to-Strong Consistency）
分层分割网络（Hierarchical Segmentation Network）
门控机制（Gating Mechanism）
不确定感知校准（Uncertainty-aware Calibration）

Strengths:

创新性地将伪标签选择建模为强化学习决策过程，克服了固定阈值方法的脆弱性，实现了样本自适应的动态平衡。
有效融合MLLM和SAM2的外部先验与模型自身预测，提升了伪标签质量，尤其在模糊和遮挡场景下。
自进化闭环设计使分割模型和伪标签相互促进，在稀疏标注下仍能稳定学习。
实验全面，在多个数据集和不同标注比例下均取得显著提升，且代码开源。

Limitations:

依赖外部MLLM和SAM2模型，增加了计算开销和推理延迟。
强化学习训练过程可能不稳定，需要仔细调参。
仅在三个公开数据集上验证，未在长尾或领域迁移场景下测试，泛化性有待进一步验证。
未深入分析MLLM先验错误（如幻觉）对框架的影响及应对策略。

Relevance To Keywords:

原生多模态大模型：论文使用冻结的MLLM作为先验提取器，并非原生多模态模型训练，但体现了多模态大模型在下游任务中的应用。
多模态大模型的理解和生成一体化：论文主要利用MLLM的理解能力（语义-空间定位），未涉及生成一体化。
表征学习：分割网络通过自进化模块学习视觉-语言联合表征，属于表征学习范畴。
世界模型：SAM2可视为视觉世界模型的一部分，论文利用其生成结构化的区域先验，弱相关。
强化学习：RPLE模块将伪标签选择建模为强化学习决策过程，直接相关。
后训练：整个自进化框架可视为一种后训练策略，在少量标注数据上微调分割模型。

16. VLA-Hijack: A Transferable Patch Attack against Vision-Language-Action Models via Visual Proprioception HijackingPASS

Score: 52.0 / 26.5

Authors: Jiyuan Fu, Kaixun Jiang, Jingkai Jia, Zhaoyu Chen, Xueyao Chen, Lingyi Hong, Shuyong Gao, Chenzhi Tan, Dingkang Yang, Wenqiang Zhang

Published: 2026-05-27

TL;DR: 该论文提出了一种名为 VLA-Hijack 的统一对抗框架，通过劫持视觉本体感觉克服补丁攻击的跨架构迁移瓶颈，显著提升了针对视觉 - 语言 - 动作模型的攻击效率。

摘要翻译

尽管视觉 - 语言 - 动作 (VLA) 模型已成为强大的通用策略，但它们对抗性补丁的显著脆弱性严重阻碍了其在安全关键领域中的部署。此外，现有的补丁攻击主要聚焦于白盒环境，严重过拟合于目标模型的具体动作输出空间，从而导致较差的跨架构迁移性。为克服这一限制，我们提出 VLA-Hijack，这是一种统一的对抗框架，通过利用本工作中识别出的一个根本性脆弱性来打破迁移性瓶颈：在规划任何运动之前，VLA 模型必须首先利用视觉信息在环境中定位其自身的机械臂。针对这一共享的视觉自定位过程，我们的方法同时优化注意力引导的本体感觉抑制（Attention-Guided Proprioceptive Suppression）以抑制真实机械臂的特征，以及多模态本体感觉注入（Multimodal Proprioceptive Injection）以将补丁建立为代理“幻影化身”。通过在语义概念锚定与视觉原型投影之间交替，VLA-Hijack 有效地切断了代理的真实化身与其控制策略之间的语义关系。在不同架构（OpenVLA、UniVLA 和 CronusVLA）上的广泛实验表明，VLA-Hijack 在白盒环境中实现了卓越的优化效率，并在跨架构和跨域黑盒迁移性方面达到了新的 SOTA。

Abstract

While Vision-Language-Action (VLA) models have emerged as powerful generalist policies, their severe vulnerability to adversarial patches significantly hinders their deployment in safety-critical domains. Moreover, existing patch attacks primarily focus on white-box settings, heavily overfitting to the specific action output space of the target model, which results in poor cross-architecture transferability. To overcome this limitation, we propose VLA-Hijack, a unified adversarial framework that breaks the transferability bottleneck by exploiting a fundamental vulnerability identified in this work: before planning any motion, a VLA model must first use visual information to locate its own robotic arm within the environment. Targeting this shared visual self-localization process, our approach concurrently optimizes Attention-Guided Proprioceptive Suppression to inhibit the real robotic arm's features, and Multimodal Proprioceptive Injection to establish the patch as a surrogate "phantom embodiment". By alternating between semantic concept anchoring and visual prototype projection, VLA-Hijack effectively severs the semantic relationship between the agent's true embodiment and its control policy. Extensive experiments across diverse architectures (OpenVLA, UniVLA, and CronusVLA) demonstrate that VLA-Hijack achieves superior optimization efficiency in white-box settings and sets a new SOTA for cross-architecture and cross-domain black-box transferability.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	5.0/10	10.0
World Models	2.0	2.0/10	4.0
MLLM	2.0	7.0/10	14.0
MultiModal	2.0	9.0/10	18.0
model-based RL	2.0	3.0/10	6.0

评分理由: 论文主要关注 VLA 模型的安全性与对抗攻击，与关键词存在部分关联。'Unify Models' 相关度中等，因论文提出统一攻击框架且针对 UniVLA 模型，但核心贡献非模型统一架构；'World Models' 相关度低，未涉及环境动力学建模；'MLLM' 相关度较高，VLA 基于 MLLM 架构扩展；'MultiModal' 相关度高，VLA 本质是多模态融合；'model-based RL' 相关度低，论文聚焦策略漏洞而非模型基强化学习算法。加权总分 52.0，高于及格线 26.5。

关键词

Vision-Language-Action Models, Adversarial Patch Attack, Visual Proprioception Hijacking, Cross-Architecture Transferability, Unified Adversarial Framework, Attention-Guided Proprioceptive Suppression, Multimodal Proprioceptive Injection

深度分析

Chinese Title: VLA-Hijack: 通过视觉本体感知劫持对视觉-语言-动作模型的可迁移补丁攻击

Summary: 本文提出VLA-Hijack，一种针对视觉-语言-动作（VLA）模型的对抗性补丁攻击框架。现有攻击主要针对动作输出空间，导致跨架构迁移性差。作者发现VLA模型在规划动作前必须通过视觉定位自身机械臂（视觉本体感知），这一共享机制是提升迁移性的关键。VLA-Hijack通过注意力引导的本体感知抑制（抑制真实机械臂特征）和多模态本体感知注入（将补丁建立为幻影本体），交替使用语义概念锚定和视觉原型投影，有效切断真实本体与控制策略的语义关系。在OpenVLA、UniVLA、CronusVLA等12个受害者模型上的实验表明，该方法在Real-to-Sim设置下平均失败率达63.68%，超越现有SOTA基线42.66%-45.41%，实现了跨架构和跨域黑盒迁移性。

Innovations:

揭示VLA模型依赖视觉本体感知的决策逻辑，并利用此机制显著提升攻击迁移性。
提出VLA-Hijack框架，通过注意力引导的本体感知抑制和多模态本体感知注入协同工作，实现本体感知身份重连。
设计动态调度策略，交替优化语义概念锚定和视觉原型投影，生成鲁棒的幻影本体特征。
在12个受害者模型上实现63.68%平均失败率，超越基线42.66%-45.41%，创下跨架构迁移性新SOTA。

Methodology: 论文采用两阶段优化循环生成对抗性补丁。第一阶段使用注意力引导的本体感知抑制（Attention-Guided Proprioceptive Suppression）损失，通过反向传播抑制真实机械臂在模型注意力图中的响应，创建“特征真空”。第二阶段使用多模态本体感知注入（Multimodal Proprioceptive Injection）损失，交替进行语义概念锚定（将补丁特征拉向机械臂语义概念）和视觉原型投影（将补丁特征对齐到机械臂视觉原型），使补丁成为幻影本体。整体优化目标L_total由两部分加权组合，通过梯度下降迭代更新补丁像素。

Key Results:

在Real-to-Sim设置下，VLA-Hijack在12个受害者模型上平均失败率为63.68%。
相比现有SOTA基线（UADA、UPA等），绝对提升42.66%至45.41%。
跨架构迁移性验证：在OpenVLA、UniVLA、CronusVLA三种不同动作空间设计的模型上均有效。
跨域迁移性验证：在仿真环境和真实场景图像上均表现良好。
可视化分析表明，VLA-Hijack成功抑制真实机械臂注意力，并将注意力完全转移到补丁上。

Tech Stack:

VLA模型：OpenVLA、UniVLA、CronusVLA
对抗性补丁攻击（Patch Attack）
注意力机制（Attention Map）
语义概念锚定（Semantic Concept Anchoring）
视觉原型投影（Visual Prototype Projection）
梯度优化（Gradient-based Optimization）
二元掩码（Binary Mask）
Real-to-Sim评估设置

Strengths:

提出新的攻击范式，从动作空间转向本体感知空间，从根本上解决迁移性瓶颈。
方法具有强可解释性，基于VLA模型固有的视觉本体感知决策逻辑。
实验全面，覆盖多种主流VLA架构和跨域场景，结果显著。
攻击效果稳定，补丁生成效率高，在白色盒和黑色盒设置下均表现优异。

Limitations:

仅针对视觉本体感知通道，未考虑其他感知模态（如触觉、深度）的潜在防御。
补丁具有可见性，可能被基于检测的防御机制识别。
实验主要在仿真环境（Real-to-Sim）进行，真实物理世界中的鲁棒性和泛化性有待验证。
未讨论对抗训练等防御措施对攻击的影响，缺乏防御-攻击对抗分析。

Relevance To Keywords:

多模态大模型：论文直接研究VLA模型（视觉-语言-动作），属于多模态大模型在机器人领域的应用。
表征学习：论文通过视觉本体感知表征的操纵实现攻击，涉及特征学习和语义对齐。
世界模型：VLA模型需要理解环境（世界模型）才能规划动作，攻击利用了这一理解过程。
强化学习：VLA模型的动作生成可视为策略网络，攻击干扰了策略决策。
后训练：论文未涉及后训练，但攻击可应用于已训练好的模型，属于推理阶段攻击。

17. Explaining is Harder Than Predicting Alone: Evaluating Concept-based Explanations of MLLMs as ICL Visual ClassifiersPASS

Score: 50.0 / 26.5

Authors: Carmen Quiles-Ramírez, Leticia L. Rodríguez, Nicolás Martorell, Natalia Díaz-Rodríguez

Published: 2026-05-27

TL;DR: 本文评估了 MLLMs 在少样本视觉分类中的概念解释性，发现强制生成形式化解释通常会降低预测准确率，尽管成功的解释与正确预测强相关。

摘要翻译

上下文学习（ICL）使多模态大语言模型（MLLMs）能够基于少量标注样本对图像进行分类。然而，这些模型如何利用所提供的上下文仍不透明。尽管思维链提示（Chain-of-Thought）被广泛使用，但近期研究指出其可能无法反映真实的内部计算过程。本文在少样本上下文学习条件下，采用五种形式化严谨性递增的评估条件，涵盖从基线分类到描述逻辑（DL）公理生成的范围，系统评估了冻结的多模态大语言模型（MLLMs）的基于概念的可解释性。通过独立的“大语言模型作为评判者”（LLM-as-a-judge）管道评估四种最先进的多模态大语言模型（MLLMs），我们发现解释任务确实比单独预测更具挑战性。令人惊讶的是，强制模型生成形式化结构化的、基于概念的解释会单调降低预测准确率（从 93.8% 降至 90.1%），这与“显式推理普遍有助于提升性能”的假设相矛盾。然而，当模型成功表述出类别判别性视觉特征时，解释质量与正确预测高度相关。我们的发现表明，尽管多模态大语言模型（MLLMs）在视觉分类方面表现出色，但它们缺乏实现形式化、机器可验证的可解释性所需的特定指令微调。

Abstract

In-context learning (ICL) enables multimodal large language models (MLLMs) to classify images from a few labelled examples. Yet, how these models use the provided context remains opaque. While Chain-of-Thought prompting is widely used, recent work argues that it may not reflect true internal computation. In this paper, we systematically evaluate the concept-based explainability of frozen MLLMs under few-shot ICL using five conditions of increasing formal rigour, ranging from baseline classification to Description Logics (DL) axiom generation. Evaluating four state-of-the-art MLLMs via an independent LLM-as-a-judge pipeline, we demonstrate that explaining is genuinely harder than predicting alone. Surprisingly, forcing models to generate formally structured, concept-based explanations degrades predictive accuracy monotonically (from 93.8% to 90.1%), contradicting the assumption that explicit reasoning universally aids performance. However, when models successfully articulate class-discriminative visual features, explanation quality strongly correlates with correct predictions. Our findings suggest that while MLLMs excel at visual classification, they lack the specific instruction-tuning required for formal, machine-verifiable explainability.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	5.0/10	10.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	10.0/10	20.0
MultiModal	2.0	10.0/10	20.0
model-based RL	2.0	0.0/10	0.0

评分理由: 论文核心聚焦于 MLLMs 在少样本视觉分类中的概念解释性评估。MLLM 和 MultiModal 高度相关，因为研究对象即为多模态大模型。Unify Models 相关性中等，因为 MLLMs 本质上是统一视觉与语言的模型，但论文重点在于解释性而非模型统一架构。World Models 和 model-based RL 与论文内容（分类与解释性）无直接关联，故得分为 0。经检查，作者列表中不包含指定的专家。

关键词

MLLMs, In-context learning, Concept-based explanations, Visual classification, Few-shot learning, Explainability

深度分析

Chinese Title: 解释比单独预测更难：评估多模态大语言模型作为上下文学习视觉分类器的基于概念的解释

Summary: 本文系统评估了冻结的多模态大语言模型（MLLMs）在少样本上下文学习（ICL）图像分类任务中的基于概念的解释能力。作者设计了五种形式化程度递增的解释条件：从基线分类、自然语言解释、特征列表、特征-值对规则到描述逻辑（DL）公理生成。使用四个最先进的MLLMs和一个独立的LLM-as-a-judge评估管道，对九种解释质量指标进行评测。实验表明，解释确实比预测更难：强制模型生成形式化结构化的解释会单调降低预测准确率（从93.8%降至90.1%），反驳了显式推理普遍提升性能的假设。然而，当模型成功表达类别判别性视觉特征时，解释质量与正确预测强相关。结论是MLLMs擅长视觉分类，但缺乏生成形式化、机器可验证解释所需的特定指令微调。

Innovations:

首次系统比较五种形式化程度递增的解释条件（从自由文本到描述逻辑公理）在少样本ICL图像分类中的表现。
提出可复现的实验协议，在四个数据集和四个MLLM上使用固定少样本样本实现跨模型和跨条件直接比较。
设计LLM-as-a-judge评估框架，包含九种解释质量指标（文本基础性、无幻觉、概念计数、可理解性、简洁性、特异性、局部判别性、指令遵循、逻辑连贯性）。
实证发现解释能力在不同MLLM间差异显著，且分类准确率不能反映解释质量。
揭示强制形式化解释会降低预测准确率，反驳了显式推理普遍有益的观点。

Methodology: 采用N-way K-shot ICL图像分类框架，要求模型在预测类别的同时生成五种不同复杂度的解释（E1-E5）。使用固定支持集和查询图像，所有模型保持冻结。通过LLM-as-a-judge（独立大语言模型）对生成的解释进行九维评分，同时记录分类准确率。实验涉及四个MLLM（如GPT-4V等）和四个图像分类数据集。

Key Results:

强制生成形式化解释（从E1到E5）导致分类准确率单调下降（93.8%→90.1%）。
当模型成功生成判别性视觉特征时，解释质量与正确预测强相关。
不同MLLM在解释能力上差异显著，准确率不能预测解释质量。
描述逻辑公理生成（E5）是最困难的条件，多数模型表现不佳。
自然语言解释（E2）在准确率和解释质量间取得较好平衡。

Tech Stack:

多模态大语言模型（MLLMs）：GPT-4V、Gemini、LLaVA等
上下文学习（ICL）
描述逻辑（Description Logics, DL）：TBox、ABox、公理
LLM-as-a-judge评估管道
结构化输出格式（XML标签）
九种解释质量指标（文本基础性、无幻觉、概念计数等）
固定少样本支持集（K-shot）

Strengths:

系统性强，从简单到复杂逐步增加形式化约束，揭示解释与预测的权衡。
评估框架全面，包含九种指标覆盖解释的多个维度。
实验设计可复现，使用固定样本确保公平比较。
对当前MLLM的局限性有清晰洞察，为未来指令微调提供方向。
结合神经符号方法（DL）评估MLLM的推理能力。

Limitations:

仅评估冻结模型，未探索微调或提示工程对解释能力的影响。
LLM-as-a-judge可能引入自身偏见，需人工验证。
数据集和模型数量有限，结论泛化性需更多实验支持。
解释质量指标定义依赖主观判断，可能不够精确。
未深入分析模型为何在形式化解释时准确率下降（如注意力机制或表示学习）。

Relevance To Keywords:

Unify Models: 论文研究多模态大语言模型（MLLMs）的统一分类与解释能力。
World Models: 解释涉及视觉特征提取和逻辑推理，与世界模型中的因果推理相关。
Representation Learning: 评估模型从少样本中学习判别性视觉特征的能力。
Model-Based RL: 描述逻辑公理生成可视为一种结构化知识表示，与模型基强化学习中的显式建模相关。
原生多模态大模型: 直接评估原生多模态模型（如GPT-4V）的ICL和解释能力。
多模态大模型的理解和生成一体化: 论文要求模型同时生成分类和解释，体现理解与生成的一体化。
表征学习: 特征列表和特征-值对条件测试模型对视觉表征的提取能力。
世界模型: 描述逻辑公理试图构建形式化的世界知识。
强化学习: 后训练阶段可借鉴本文评估框架优化模型的解释能力。

18. ABot-OCR Technical ReportPASS

Score: 50.0 / 26.5

Authors: Kaitao Jiang, Ruiyan Gong, Xiaolong Cheng, Kangning Niu, Tianlun Li, Mu Xu

Published: 2026-05-27

TL;DR: ABot-OCR 是一个端到端的视觉语言模型，通过强化学习优化直接将页面图像转换为 Markdown，并在文档理解基准测试中取得了最先进的性能。

摘要翻译

我们提出了 ABot-OCR，这是一种端到端视觉 - 语言模型，能够在单次前向传播中将页面图像直接转换为干净的 Markdown。通过这种方式，我们的方法完全消除了对脆弱模块化编排的需求。为了最大化解析保真度，我们开发了一个专用数据引擎，以提供大规模、结构一致性的监督。此外，我们提出了 Decoupled Heterogeneous Document Optimization（解耦异构文档优化），这是一种结构约束强化学习方法，旨在提升文本准确性，并在监督微调之外严格保证标记的良构性。广泛的评估展示了我们框架的优越性能。在 OmniDocBench v1.5 和 v1.6 基准上，ABot-OCR 在所有端到端系统中取得了 92.81 和 93.30 的最先进分数，显著缩小了相对于强大流水线基线的性能差距。最后，针对十种多样语言的全面跨语言文本识别进一步确认了 ABot-OCR 的鲁棒泛化能力。

Abstract

We introduce ABot-OCR, an end-to-end vision-language model that transcribes a page image directly into clean Markdown in a single forward pass. By doing so, our approach completely eliminates the need for brittle modular orchestration. To maximize parsing fidelity, we develop a dedicated data engine to provide large-scale, structurally consistent supervision. Furthermore, we propose Decoupled Heterogeneous Document Optimization, a structure-constrained reinforcement learning method that sharpens textual accuracy and strictly enforces markup well-formedness beyond supervised fine-tuning alone. Extensive evaluations demonstrate the superior performance of our framework. On the OmniDocBench v1.5 and v1.6 benchmarks, ABot-OCR achieves state-of-the-art scores of 92.81 and 93.30 among all end-to-end systems, substantially narrowing the performance gap relative to strong pipeline baselines. Finally, comprehensive multilingual text recognition across ten diverse languages further confirms the robust generalizability of ABot-OCR.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	6.0/10	12.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	8.0/10	16.0
MultiModal	2.0	8.0/10	16.0
model-based RL	2.0	3.0/10	6.0

评分理由: 论文提出 ABot-OCR，一个端到端的视觉语言模型，将图像直接转换为 Markdown。Unify Models 相关（端到端统一感知与生成，消除模块化编排，6.0）；World Models 无关（未涉及环境动力学建模，0.0）；MLLM 相关（视觉语言模型属于多模态大模型范畴，8.0）；MultiModal 相关（处理图像与文本，8.0）；model-based RL 部分相关（使用了强化学习，但未明确为基于模型的方法，3.0）。加权总分 50.0，高于动态及格分 26.5。作者列表中未包含指定的专家（Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang）。

关键词

End-to-end vision-language model, Page image transcription, Clean Markdown generation, Reinforcement learning method, Structure-constrained optimization, Single forward pass, Document understanding benchmarks, Multilingual text recognition

深度分析

Chinese Title: ABot-OCR 技术报告

Summary: 本文提出ABot-OCR，一个2B参数的端到端视觉语言模型，能够将文档图像直接转录为干净的Markdown格式，无需复杂的模块化流水线。为了最大化解析保真度，作者构建了专用数据引擎，包括层次一致性注释验证和大规模网络文档伪标签生成，确保训练数据的高质量和结构一致性。在训练策略上，采用渐进式三阶段训练：第一阶段通过文本检测、公式识别、表格识别和布局分析四个子任务建立模块化文档解析能力；第二阶段统一为端到端页面级Markdown生成；第三阶段引入解耦异构文档优化（DHDO），一种结构约束的强化学习方法，通过感知奖励和三种结构奖励，并采用解耦归一化避免优势坍塌，在监督微调之外进一步提升文本准确性和标记格式规范性。在OmniDocBench v1.5和v1.6基准上，ABot-OCR分别达到92.81和93.30的SOTA分数，显著缩小了与强流水线基线的差距。此外，在十种语言的文档图像上验证了其鲁棒的泛化能力。

Innovations:

提出层次一致性注释验证（Hierarchical-Consistency Annotation Verification），结合语言一致性、视觉一致性和VLM辅助评分（DPCS）进行多维度质量评估。
构建网络规模文档伪标签流水线（Web-Scale Document Pseudo-Labeling），利用模块化专家模型生成结构化伪标签，并通过DPCS统一质量控制。
设计渐进式三阶段训练策略，从模块化子任务到端到端解析，再到结构约束强化学习，逐步提升感知和结构能力。
提出解耦异构文档优化（DHDO），引入感知奖励和三种结构奖励，并采用感知条件激活和解耦归一化，避免多奖励优势坍塌，优于传统GRPO。

Methodology: 采用Qwen3-VL-2B-Instruct作为骨干模型。数据引擎包括：1）层次一致性注释验证：先进行语言一致性检查（如LaTeX语法、HTML结构），再进行视觉一致性检查（布局分析+专家识别器），最后用VLM推理器计算DPCS分数进行质量分级；2）网络文档伪标签：对爬取的文档进行布局分析、区域分类、专家模型标注，组装成页面级表示，再经DPCS筛选。训练策略：Stage1用四个子任务（文本检测、公式识别、表格识别、布局分析）进行交叉熵训练；Stage2用端到端Markdown生成进行交叉熵训练；Stage3用DHDO强化学习，奖励包括感知奖励（文本准确率）和结构奖励（布局、表格、公式结构），感知奖励超过阈值才激活结构奖励，各奖励独立归一化后聚合。

Key Results:

在OmniDocBench v1.5上达到92.81分，在v1.6上达到93.30分，均为所有端到端系统中的最佳成绩。
在十种语言（阿拉伯语、西班牙语、法语、俄语、德语、日语、韩语、葡萄牙语、泰语、越南语）的文档图像上验证了鲁棒的多语言文本识别能力。
DHDO方法在文档解析任务上一致优于传统GRPO风格优化，证明了感知条件奖励和解耦归一化的有效性。

Tech Stack:

Qwen3-VL-2B-Instruct（骨干模型）
层次一致性注释验证（语言检查、视觉检查、VLM推理）
Document Parsing Consistency Score (DPCS)
模块化专家模型（布局分析、文本检测、公式识别、表格识别）
渐进式三阶段训练（交叉熵 + 强化学习）
Decoupled Heterogeneous Document Optimization (DHDO)
GRPO（作为对比基线）
OmniDocBench v1.5/v1.6基准

Strengths:

端到端架构消除了传统流水线的脆弱性和模块间错误累积。
数据引擎通过多层次验证和伪标签生成大幅提升了训练数据的质量和覆盖度。
三阶段训练策略系统性地构建了感知和结构能力，DHDO进一步强化了结构约束。
在多个基准上达到SOTA，且在多语言场景下表现良好，泛化能力强。
代码和模型开源，便于复现和后续研究。

Limitations:

模型参数量为2B，可能对计算资源有一定要求，部署成本较高。
论文未详细讨论在极端复杂布局（如手写文档、低质量扫描件）上的表现。
DHDO中的阈值设定和奖励权重可能需要针对不同任务进行调参，泛化性有待验证。
仅报告了OmniDocBench上的结果，缺乏与其他主流端到端模型（如GOT-OCR2）的详细对比分析。

Relevance To Keywords:

原生多模态大模型：ABot-OCR基于Qwen3-VL-2B-Instruct，是一个原生多模态视觉语言模型，直接处理图像和文本。
多模态大模型的理解和生成一体化：模型同时具备文档内容理解（感知）和结构化生成（Markdown输出）能力，实现理解与生成一体化。
表征学习：通过三阶段训练和DHDO强化学习，模型学习到文档的视觉和结构表征。
世界模型：文档解析可视为对文档世界（布局、语义、结构）的建模，模型隐式学习了文档的结构化世界知识。
强化学习：第三阶段采用结构约束的强化学习（DHDO），属于后训练阶段，显著提升了结构生成质量。
后训练：DHDO作为监督微调后的强化学习阶段，是典型的后训练方法。

19. Rethinking Video-Language Model from the Language Input PerspectivePASS

Score: 50.0 / 26.5

Authors: Xiang Fang, Wanlong Fang, Changshuo Wang, Xiaoye Qu, Daizong Liu

Published: 2026-05-27

TL;DR: 该论文针对视频语言模型受限于预设文本模板的问题，提出了一种基于视频引导的跨模态桥接框架，通过生成多样化文本和属性推理提升了模型性能。

摘要翻译

在大语言模型浪潮的推动下，视频 - 语言模型（VLMs）已成为一项重要但具有挑战性的技术，旨在弥合视频与文本之间的鸿沟。尽管之前的 VLM 相关工作取得了显著进展，但它们几乎都隐含地假设所有文本均由特定模板预先定义。在实际应用场景中，这种严格的假设难以满足，原因如下：1) 预先定义所有文本极其耗时且费力；2) 这些预定义的文本输入过于受限且用户体验不佳，从而限制了其应用范围。观察发现，给定一个视频输入，具有相似语义但不同模板的文本会导致不同的性能表现。为此，本文提出了一种新颖的即插即用框架，适用于各种基于 VLM 的方法，旨在充分弥合视频与文本之间的鸿沟。具体而言，我们首先从原始文本生成正负文本，以针对特定的文本组件。随后，我们提出一种基于属性的文本推理策略，以挖掘生成文本的细粒度文本语义。最后，我们利用视频作为指导，通过设计自加权损失来进行跨模态桥接。大量实验表明，所提出的方法可作为即插即用模块，有效地提高最先进的 VLMs 的性能。

Abstract

Driven by the wave of large language models, Video-Language Models (VLMs) have become a significant yet challenging technology to bridge the gap between videos and texts. Although previous VLM works have made significant progress, almost all of them implicitly assume that all the texts are predefined by the specific template. In real-world applications, such a strict assumption is impossible to satisfy since 1) predefining all the texts is extremely time-consuming and labor-intensive. 2) these predefined text inputs are too restrictive and user-unfriendly, limiting their applications. It is observed that given a video input, texts with similar semantics but different templates lead to various performances. To this end, in this paper, we propose a novel plug-and-play framework for various VLM-based methods to fully bridge videos and texts. Specifically, we first generate positive and negative texts from the original ones to target specific text components. Then, we propose an attribute-based text reasoning strategy to mine fine-grained textual semantics of generated texts. Finally, we utilize videos as guidance to conduct cross-modal bridging by designing a self-weighted loss. Extensive experiments show that the proposed method can serve as the plug-and-play module to effectively improve the performance of state-of-the-art VLMs.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	6.0/10	12.0
World Models	2.0	1.0/10	2.0
MLLM	2.0	8.0/10	16.0
MultiModal	2.0	9.0/10	18.0
model-based RL	2.0	1.0/10	2.0

评分理由: 论文核心在于视频语言模型（VLM）的文本输入灵活性，与 MultiModal 和 MLLM 高度相关（涉及视频与文本的多模态融合及大模型驱动）。Unify Models 有一定关联（统一了文本模板处理），但未涉及模型架构统一。World Models 和 model-based RL 与论文内容（无动力学建模或强化学习）几乎无关。加权总分为 50.0，高于动态及格分 26.5。

关键词

Video-Language Models, Text Input Flexibility, Plug-and-play Framework, Cross-modal Bridging, Attribute-based Reasoning, Multimodal Learning, Large Language Models

深度分析

Chinese Title: 从语言输入视角重新思考视频语言模型

Summary: 本文针对视频语言模型（VLM）中文本输入模板固定、缺乏多样性的问题，提出了一种即插即用的框架。该框架首先利用大语言模型（LLM）从词级和结构级生成正负文本样本，以模拟真实场景中用户输入文本的多样性；然后设计属性文本推理策略，挖掘生成文本的细粒度语义；最后提出视频引导的自加权跨模态融合损失，自适应地整合不同文本成分。在视频问答、视频句子定位和视频文本检索三个下游任务上的实验表明，该方法能有效提升现有VLM模型的性能，且无需修改模型结构。论文揭示了文本模板变化对VLM性能的影响，并首次从文本输入增强角度解决VLM的鲁棒性问题。

Innovations:

首次从文本输入视角探索模板自由文本对VLM鲁棒性的影响，提出用户友好的任意形式文本输入方案。
设计多级文本增强模块，利用LLM从词级和结构级生成语义相似的正文本和语义不同的负文本。
提出属性文本推理策略，对生成文本进行细粒度语义挖掘，增强模型对文本成分的理解。
提出视频引导的自加权跨模态融合损失，通过自适应权重整合不同文本成分，实现视频与文本的充分对齐。
所提框架为即插即用模块，可无缝集成到现有VLM方法中，在三个下游任务上取得一致性能提升。

Methodology: 论文采用以下技术路线：1）利用大语言模型（如GPT系列）对原始文本进行词级替换和结构改写，生成正负文本样本；2）设计属性文本推理模块，通过分析文本中不同成分（如主语、谓语、宾语）的语义重要性，生成细粒度表示；3）提出视频引导的自加权损失函数，根据视频特征动态调整各文本成分的权重，实现跨模态对齐；4）在视频问答、视频句子定位、视频文本检索三个任务上，将所提框架作为插件集成到多个基线模型中进行评估。

Key Results:

所提框架在Charades-STA、ActivityNet Captions、MSRVTT等数据集上，显著提升了现有VLM模型（如VSLNet、CLIP4Clip等）在视频句子定位、视频问答和视频文本检索任务上的性能。
实验表明，文本模板变化会导致原有VLM模型性能下降，而所提方法能有效缓解这一问题，使模型对文本变体具有鲁棒性。
消融实验验证了多级文本增强、属性推理和自加权损失各模块的有效性。
在人工构建的文本变体测试集上，所提方法相比基线模型平均提升3-5个百分点。

Tech Stack:

大语言模型（LLM）：用于文本生成和改写，如GPT-4、开源LLM。
余弦相似度：用于评估词级语义变化。
自加权损失函数：基于视频特征动态调整文本成分权重。
跨模态融合：视频特征与文本特征的对齐方法。
视频特征提取：使用预训练视频编码器（如I3D、CLIP）。
文本特征提取：使用预训练文本编码器（如BERT、CLIP文本编码器）。

Strengths:

问题新颖：首次关注VLM中文本输入模板固定导致的鲁棒性问题，具有实际应用价值。
即插即用：所提框架不依赖特定模型结构，可轻松集成到现有VLM方法中。
方法全面：从文本增强、细粒度推理到跨模态对齐，形成完整解决方案。
实验充分：在三个主流任务、多个数据集上验证，并包含消融和鲁棒性分析。
利用LLM生成多样文本，模拟真实用户输入，增强模型泛化能力。

Limitations:

依赖LLM生成文本的质量和多样性，LLM可能引入偏差或错误。
计算开销：生成大量正负文本并进行属性推理会增加训练时间和资源消耗。
未考虑音频等其他模态信息，仅聚焦于视频和文本。
在极端复杂的视频场景（如长视频、多事件）中，文本增强可能不够充分。
未深入探讨世界模型、强化学习等关键词相关方向，与给定研究背景的部分关键词关联较弱。

Relevance To Keywords:

Unify Models: 论文提出的即插即用框架可统一增强多种VLM模型，但未涉及模型统一化架构。
World Models: 论文未涉及世界模型概念，主要关注文本输入增强。
Representation Learning: 论文通过文本增强和跨模态对齐，学习更鲁棒的视频-文本联合表示，与表征学习高度相关。
Model-Based RL: 论文未涉及强化学习或基于模型的RL。
原生多模态大模型: 论文使用LLM作为文本生成工具，但未构建原生多模态大模型，而是增强现有VLM。
多模态大模型的理解和生成一体化: 论文侧重于理解（视频-文本对齐），未涉及生成任务。
表征学习: 核心贡献在于学习对文本变体鲁棒的跨模态表征。
世界模型: 不相关。
强化学习: 不相关。
后训练: 论文方法可视为一种后训练增强策略，但未明确提及后训练范式。

20. OphIn-500K: Curating Web-Scale Visual Instructions for Scaling Ophthalmic Multimodal Large Language ModelsPASS

Score: 50.0 / 26.5

Authors: Xuanzhao Dong, Wenhui Zhu, Xiwen Chen, Hao Wang, Xin Li, Yujian Xiong, Jiajun Cheng, Jingjing Wang, Xiaobing Yu, Haiyu Wu, Shao Tang, Zhipeng Wang, Langechuan Liu, Shan Lin, Oana Dumitrascu, Yalin Wang

Published: 2026-05-27

TL;DR: This paper addresses the scarcity of ophthalmic instruction data by curating OphIn-500K from web-scale videos and developing OphIn-VL, which achieves superior performance in ophthalmic visual understanding and conversation compared to general medical MLLMs.

摘要翻译

通用医疗多模态大语言模型（MLLMs）的发展在构建支持临床诊断的对话助手方面展现出巨大潜力。然而，它们向眼科等高度专业化领域的适应仍探索不足，主要由于缺乏大规模、领域特定的指令微调数据。现有的面向对话代理的眼科数据集通常规模有限，且主要依赖来自既定公共基准的图像，这限制了眼科 MLLMs 的可扩展性以及其捕捉现实世界临床复杂性的能力。为填补这一空白，我们提出 OphIn-Engine，这是一个特定于眼科的指令数据策展管道，旨在从开放获取的眼科网络级视频中构建高质量指令数据。该管道整合了多模态转录以提取图像 - 转录对、视觉线索分离与评分以识别临床相关的视觉描述，以及带质量控制的指令合成以生成准确且多样的临床对话。利用该引擎，我们推出了 OphIn-500K，这是一个大规模多模态眼科指令微调数据集，包含超过 50 万条指令实例和来自超过 2.9 万个视频片段的超过 15.1 万张唯一图像，格式涵盖视觉问答（VQA）、多轮对话交互以及思维链（CoT）推理。基于此数据集，我们进一步开发了 OphIn-VL，这是一个具有先进视觉理解和对话能力的特定于眼科的 MLLM。综合实验和案例研究表明，OphIn-VL 相较于最先进的通用医疗及领域特定 MLLM 实现了卓越的性能。

Abstract

The advancement of general medical Multimodal Large Language Models (MLLMs) has shown great potential for building conversational assistants to support clinical diagnosis. However, their adaptation to highly specialized domains such as ophthalmology remains underexplored, primarily due to the scarcity of large-scale, domain-specific instruction-tuning data. Existing ophthalmic datasets for conversational agents are often limited in scale and largely rely on images from established public benchmarks, limiting the scalability of ophthalmic MLLMs and their ability to capture real-world clinical complexity. To address this gap, we propose $\textbf{OphIn-Engine}$, an ophthalmology-specific instruction data curation pipeline that constructs high-quality instruction data from open-access ophthalmology web-scale videos. The pipeline integrates multimodal transcription for extracting image-transcript pairs, visual cue separation and scoring for identifying clinically relevant visual descriptions, and instruction synthesis with quality control for generating accurate and diverse clinical dialogues. Using this engine, we introduce $\textbf{OphIn-500K}$, a large-scale multimodal ophthalmology instruction-tuning dataset containing over 500,000 instruction instances and more than 151,000 unique images from over 29,000 video clips, formatted as visual question answering (VQA), multi-turn conversational interactions, and chain-of-thought (CoT) reasoning. Built upon this dataset, we further develop $\textbf{OphIn-VL}$, an ophthalmology-specific MLLM with advanced visual understanding and conversational capabilities. Comprehensive experiments and case studies demonstrate that OphIn-VL achieves superior performance compared with state-of-the-art general medical and domain-specific MLLMs.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	5.0/10	10.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	10.0/10	20.0
MultiModal	2.0	10.0/10	20.0
model-based RL	2.0	0.0/10	0.0

评分理由: The paper focuses on MLLM and MultiModal technologies, scoring 10.0 for both. Unify Models is moderately relevant (5.0) as MLLM unifies vision and language, though the core contribution is data curation rather than model architecture unification. World Models and model-based RL are irrelevant (0.0) as the work involves supervised instruction tuning without reinforcement learning or world modeling. No expert authors from the specified list were found in the authorship.

关键词

Ophthalmic Multimodal Large Language Models, Instruction Data Curation, Web-Scale Visual Instructions, Visual Question Answering, Multi-turn Conversational Interactions, Chain-of-Thought Reasoning, OphIn-500K, OphIn-VL

深度分析

Chinese Title: OphIn-500K：策划网络规模的视觉指令以扩展眼科多模态大语言模型

Summary: 本文针对眼科多模态大语言模型（MLLM）领域缺乏大规模、领域特定的指令微调数据的问题，提出了OphIn-Engine——一个自动化的眼科指令数据策展管道。该管道从开放获取的网络眼科视频中提取图像-文本对，通过多模态转录、视觉线索分离与评分、指令合成及质量控制四个模块，生成了包含超过50万条指令实例、15万张独特图像、覆盖1000多种视网膜病理和800多种解剖特征的大规模数据集OphIn-500K。基于该数据集，作者进一步训练了眼科专用MLLM——OphIn-VL。实验表明，OphIn-VL在眼科视觉理解和临床对话能力上显著优于现有的通用医学和领域专用MLLM。

Innovations:

提出OphIn-Engine自动化策展管道，从网络视频中生成高质量眼科指令数据，减少对静态公共基准的依赖。
构建OphIn-500K大规模眼科指令数据集，包含超过50万实例、15万独特图像，覆盖广泛病理和解剖特征。
开发OphIn-VL眼科专用MLLM，在视觉理解和临床对话上达到领先性能。
引入视觉线索分离与评分机制，自动过滤低质量音频转录，确保临床相关性。
采用多轮对话、链式推理等多种指令格式，增强模型交互能力。

Methodology: 论文采用四模块管道：1）多模态转录：从网络视频中通过DINO特征提取进行关键帧选择，使用Whisper进行音频转录，并进行去标识化处理；2）视觉线索分离与评分：利用LLM识别音频中临床相关的视觉描述，并基于视觉锚定评分过滤低质量实例；3）指令合成：将精炼后的实例转化为视觉问答（VQA）、多轮对话和链式推理（CoT）指令数据；4）后处理与质量控制：去除不合理、不一致或幻觉生成的内容。最终基于OphIn-500K微调视觉-语言骨干模型得到OphIn-VL。

Key Results:

OphIn-500K包含536,132条指令实例和151,430张独特图像，来自超过29,000个视频片段。
覆盖超过1,000种视网膜病理条件和800种眼科解剖特征。
OphIn-VL在眼科视觉理解和临床多轮对话任务上优于现有通用医学和领域专用MLLM。
通过消融实验验证了数据规模和质量对模型性能的显著提升作用。

Tech Stack:

DINO（自监督视觉特征提取）
Whisper（语音转录）
大语言模型（LLM）用于临床相关性评估和指令生成
余弦相似度用于视频片段分割
视觉锚定评分（Visual Anchoring Scoring）
视觉问答（VQA）、多轮对话、链式推理（CoT）数据格式

Strengths:

自动化管道可大规模扩展，降低人工标注成本。
数据来源为真实临床视频，更具临床复杂性和多样性。
数据集规模大、覆盖广，为眼科MLLM提供坚实基础。
模型在多项评估中表现优异，验证了数据有效性。
开源数据集和模型，促进领域研究。

Limitations:

依赖网络视频质量，可能存在噪声或非专业内容。
自动生成的指令可能仍包含少量幻觉或不一致。
仅覆盖三种成像模态（CFP、OCT、UWF），其他模态未涉及。
模型性能评估可能受限于现有基准的覆盖范围。
未探讨模型在真实临床部署中的鲁棒性和安全性。

Relevance To Keywords:

原生多模态大模型：论文构建眼科专用MLLM，属于多模态大模型领域。
表征学习：OphIn-Engine利用DINO进行视觉特征提取，属于表征学习。
后训练：模型基于OphIn-500K进行指令微调，属于后训练技术。
世界模型：论文未直接涉及世界模型，但链式推理可视为简单世界模型推理。
强化学习：论文未使用强化学习，但提及后训练与强化学习相关。

21. Reflective Dialogue between Teacher and Solver Agents for Video Question AnsweringPASS

Score: 50.0 / 26.5

Authors: Takuya Murakawa, Toru Tamaki

Published: 2026-05-27

TL;DR: 本文提出了一种基于教师与求解者代理间反思性对话的推理时上下文注入方法，用于在无微调情况下通过少量标注集适应视觉语言模型进行视频问答，并在 EgoCross 基准上取得了优于零-shot 和标准在上下文学习基线的效果。

摘要翻译

已有多种方法被提出，用于将视觉 - 语言模型（VLMs）适配到视频问答的专用领域，其中包括微调（fine-tuning）和上下文学习（in-context learning）。然而，如何在推理阶段仅依靠少量标记支持集获取任务特定知识，且无需微调，仍然是一个挑战。本文提出了一种仅通过推理时上下文注入（inference-time context injection）即可实现适配的方法。该方法首先构建一个反思性对话（Reflective Dialogue, RD）——即两个代理之间的多轮对话，其中教师代理（Teacher）提出每个支持问题并提供正确性反馈，而求解器代理（Solver）则进行回答，并为正确和错误的答案提供视觉定位（visual grounding）解释（或反思）。随后，该对话历史在推理阶段被用作上下文。在 EgoCross 基准上的实验表明，我们的方法优于基线零样本（zero-shot）设置以及直接传递支持集示例的标准上下文学习方法，在 CVPR 2026 EgoVis 研讨会举办的第 1 届跨域 EgoCross 挑战赛的开源赛道中荣获第 3 名，本文亦作为该挑战赛的技术报告。

Abstract

Various approaches have been proposed to adapt Vision-Language Models (VLMs) to specialized domains for Video Question Answering, including fine-tuning and in-context learning. However, acquiring task-specific knowledge at the inference phase from only a small labeled support set without fine-tuning remains a challenge. In this paper, we propose a method that achieves adaptation solely through inference-time context injection. Our method first constructs a Reflective Dialogue (RD) -- a multi-turn conversation between two agents, in which Teacher poses each support question and delivers correctness feedback, and Solver answers and provides visual grounding explanations (or reflections) for both correct and incorrect answers. This dialogue history is then used as context at the inference phase. Experiments on the EgoCross benchmark demonstrate that our method outperforms both a baseline zero-shot setting and a standard in-context learning approach that passes support set examples directly, achieving 3rd place in the Open-source Track of the 1st Cross-Domain EgoCross Challenge at the CVPR 2026 EgoVis Workshop, for which this paper also serves as a technical report.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	5.0/10	10.0
World Models	2.0	2.0/10	4.0
MLLM	2.0	8.0/10	16.0
MultiModal	2.0	8.0/10	16.0
model-based RL	2.0	2.0/10	4.0

评分理由: Unify Models (5.0): 论文基于 VLMs 实现视觉与语言统一，但核心贡献在于对话适配机制；World Models (2.0): 论文未涉及环境动力学建模或世界模拟；MLLM (8.0): 方法基于视觉语言模型，属于多模态大模型范畴；MultiModal (8.0): 任务本质为视频与文本的多模态交互；model-based RL (2.0): 提及的代理为对话式 QA 代理，非强化学习策略优化代理。加权总分 50.0，高于动态及格分 26.5。

关键词

Video Question Answering, Vision-Language Models, Reflective Dialogue, Inference-time adaptation, Visual grounding, Teacher-Agent, Solver-Agent

深度分析

Chinese Title: 教师与求解者智能体之间的反思性对话用于视频问答

Summary: 本文提出一种基于反思性对话（Reflective Dialogue, RD）的视频问答方法，旨在无需微调的情况下，仅通过推理阶段的上下文注入实现模型对特定领域的适应。该方法构建两个智能体（Teacher和Solver）之间的多轮对话：Teacher提出支持集中的问题并给出正确性反馈，Solver回答并提供视觉基础解释（反思）。该对话历史作为静态上下文在推理时预置到测试问题前。在EgoCross基准上的实验表明，该方法优于零样本基线以及标准上下文学习（直接提供支持集QA对），并在CVPR 2026 EgoVis Workshop的开放源代码赛道中获得第三名。论文同时作为该挑战的技术报告。

Innovations:

提出反思性对话（RD）作为上下文学习的上下文形式，将正确性反馈和视觉解释融入多轮对话中。
无需微调或每问题重试循环，仅通过预构建的静态对话历史实现领域适应。
利用问题类型分组构建对话，使上下文更具针对性。
结合正确与错误答案的对比性反思，提供更丰富的推理指导。

Methodology: 首先，使用LLM识别每个问题的问题类型（如动作时间定位、对象空间定位等）。然后，对于每个域和问题类型，构建反思性对话：Teacher提供视频帧和问题，Solver回答，Teacher根据正确性给出反馈（正确时要求解释视觉证据，错误时要求对比分析），Solver提供视觉基础描述。所有问题的对话顺序拼接成连续对话。推理时，根据测试问题的问题类型和域检索对应的反思性对话，将其作为上下文预置到测试问题前，并插入分隔句。

Key Results:

在EgoCross基准上，所提方法优于零样本基线（无上下文）和标准上下文学习（直接提供支持集QA对）。
在CVPR 2026 EgoVis Workshop的开放源代码赛道中获得第三名。
方法有效利用了支持集中的少量标注样本，无需微调即可提升跨域视频问答性能。

Tech Stack:

视觉语言模型（VLM）：如Flamingo等用于视频问答推理。
大语言模型（LLM）：用于问题类型识别。
多轮对话构建：Teacher和Solver智能体交互。
上下文学习（In-Context Learning, ICL）：将反思性对话作为上下文。
EgoCross数据集：包含手术、工业、极限运动、动物四个专业领域。

Strengths:

无需微调，适用于专有模型或计算资源有限场景。
推理阶段无额外重试开销，效率高。
通过反思性对话提供丰富的视觉和推理上下文，优于简单QA对。
利用问题类型分组，使上下文更具领域和任务针对性。

Limitations:

依赖支持集的质量和数量，若支持集噪声大或分布不均可能影响效果。
问题类型识别依赖LLM，可能引入误差。
反思性对话长度可能随支持集增大而增长，影响推理效率。
方法仅在EgoCross上验证，泛化性需更多实验。

Relevance To Keywords:

原生多模态大模型：论文使用VLM进行视频问答，属于多模态大模型应用。
表征学习：反思性对话中的视觉基础解释涉及视觉表征的利用。
世界模型：视频问答需要理解场景动态，与世界模型相关。
强化学习：论文未涉及强化学习或后训练，相关性较弱。

22. Do LLMs Build World Models From Text? A Multilingual Diagnostic of Spatial ReasoningPASS

Score: 46.0 / 26.5

Authors: Zhikai Pan, Chih-Ting Liao, Chunrui Liu, Xi Xiao, Yitong Qiao, Chunlei Meng, Zhangquan Chen, Xin Cao

Published: 2026-05-27

TL;DR: 本文通过多语言诊断基准发现 LLMs 在纯文本下构建空间世界模型存在推理悬崖，表明文本记忆限制是瓶颈，并建议未来采用多模态方法。

摘要翻译

大型语言模型（LLMs）能否从纯文本描述中构建内部空间世界模型仍存在争议，且此类能力是否跨语言迁移尚未得到系统研究。我们引入 MentalMap（多语言诊断基准），该基准包含六级能力层级（L0-L5），范围从原子空间事实延伸至生成式世界图构建，并设有四个诊断维度，分别探测参照系、阅读方向偏差、推理努力分配及幻觉。MentalMap 基于 100 个 ProcTHOR 家庭场景构建，涵盖八种类型学多样化的语言以及一种结构化文本控制，包含 39 个任务族，分布在 1,950 个评估单元中。通过对十三款不同规模和模型家族的 LLMs 进行评估，我们发现了一个通用的 L3 推理悬崖：一旦基线原子准确率超过 40%，没有任何模型能在视角推理上保留其 L0 性能的一半以上。这一悬崖现象在语言、规模及提示策略上均持续存在，而结构化输出失败及推理模式在不同模型间则存在显著差异。在相同的纯文本协议下进行的人类评估复现了相同的失败模式，表明瓶颈源于纯文本工作记忆约束，而非特定于当前的 LLM 架构。我们的研究将纯文本空间推理重新定义为多轴世界建模问题，并激励多模态推理及 scratchpad 增强推理成为未来的研究方向。

Abstract

Whether large language models (LLMs) construct internal spatial world models from pure-text descriptions remains contested, and whether such capabilities transfer across languages has not been systematically studied. We introduce MentalMap, a multilingual diagnostic benchmark with a six-level capability hierarchy (L0-L5) spanning atomic spatial facts to generative world-graph construction, together with four diagnostic axes probing frame of reference, reading-direction bias, reasoning-effort allocation, and hallucination. MentalMap is built from 100 ProcTHOR household scenes, covers eight typologically diverse languages plus a structured-text control, and contains 39 task families across 1,950 evaluation cells. Evaluating thirteen LLMs across scales and model families, we identify a universal L3 reasoning cliff: no model retains even half of its L0 performance on viewpoint reasoning once baseline atomic accuracy exceeds 40%. The cliff persists across languages, scales, and prompting strategies, while structured-output failures and reasoning patterns vary substantially across models. Human evaluation under the identical pure-text protocol reproduces the same failure pattern, suggesting that the bottleneck arises from text-only working memory constraints rather than being specific to current LLM architectures. Our findings reframe pure-text spatial reasoning as a multi-axis world-modeling problem and motivate multimodal and scratchpad-augmented reasoning as future directions.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	3.0/10	6.0
World Models	2.0	10.0/10	20.0
MLLM	2.0	4.0/10	8.0
MultiModal	2.0	5.0/10	10.0
model-based RL	2.0	1.0/10	2.0

评分理由: 论文标题与摘要核心围绕'World Models'展开，直接探讨 LLMs 是否构建空间世界模型，故该项得满分。摘要结尾明确提到'multimodal'作为未来方向，故 MultiModal 相关性较高。MLLM 虽未直接研究但属于相关演进领域。Unify Models 为背景概念但非本文方法核心。model-based RL 在摘要中未提及，相关性最低。作者列表中未包含指定的专家，故未添加额外分数。

关键词

World Models, Spatial Reasoning, LLMs, Multilingual, Text-only, Diagnostic Benchmark, Multimodal, MentalMap

深度分析

Chinese Title: LLMs是否从文本中构建世界模型？空间推理的多语言诊断

Summary: 本文提出MENTALMAP，一个多语言空间推理诊断基准，旨在回答大型语言模型（LLMs）是否从纯文本描述中构建内部空间世界模型，以及这种能力是否跨语言迁移。基准采用双轴设计：能力轴为六级阶梯（L0-L5），从原子空间事实逐步过渡到生成世界图输出；诊断轴包含四个正交镜头（参考框架、阅读方向偏差、推理努力分配、幻觉）。基于100个ProcTHOR家庭场景，覆盖8种类型多样的语言（英、中、日、韩、西、阿、泰、德）及结构化文本控制，共39个任务家族、1950个评估单元。评估了13个LLM（包括闭源前沿、开源7-10B家族、中等规模27-32B控制组及视觉语言消融），发现七个关键结果：能力阶梯非单调，在L3视角推理处存在普遍悬崖；悬崖具有子任务结构化特征；思维链提示并非通用助推器；生成世界图时节点识别与关系提取分离；L3悬崖在中等规模模型中依然存在；层次聚类从模型性能相关性中恢复出文字脚本类型学；人类实验在相同纯文本协议下复制了悬崖，证明其反映文本模态的工作记忆瓶颈。结论将纯文本空间推理重新定义为多轴世界建模问题，而非单一数字排行榜。

Innovations:

首次在单一基准中结合纯文本空间推理、多语言覆盖和生成世界图评估，填补了现有基准的三个空白（英语单语、单维度评估、表面与实质模糊）。
双轴设计：能力阶梯（L0-L5）与诊断镜头（参考框架、阅读方向、推理努力、幻觉）正交，系统性地分离了能力难度与诊断角度。
发现并验证了L3视角推理处的普遍性能悬崖，该悬崖将静态理解（回忆）与主动空间推理（构建）离散分开，且跨模型、跨语言一致存在。
揭示了思维链提示的非普遍有效性：对DeepSeek提升32个百分点，但对Qwen2.5-7B降低16个百分点，且对多数开源模型的严格JSON输出产生干扰。
通过人类实验在8种语言中复制了L3悬崖（L3=0%），直接证明该悬崖源于文本模态的工作记忆瓶颈，而非LLM或语言特定缺陷。

Methodology: 基于ProcTHOR家庭场景生成100个场景的文本描述，设计六级能力阶梯（L0原子事实、L1关系查询、L2多跳推理、L3视角变换、L4动态更新、L5生成世界图），并嵌入四个诊断镜头。将场景描述翻译为8种语言（英、中、日、韩、西、阿、泰、德）及结构化文本控制。评估13个LLM，包括闭源（GPT-4o、Claude等）、开源7-10B（Qwen2.5-7B、Llama-3.1-8B等）、中等规模27-32B控制组（Qwen2.5-32B、Gemma-3-27B）及视觉语言消融（Qwen2.5-VL）。使用严格通过率和部分信用图F1评分，并采用层次聚类分析多语言性能轮廓。进行人类实验（8种语言母语者）在相同纯文本协议下验证。

Key Results:

F1: 能力阶梯非单调，L3视角推理处存在普遍悬崖：所有（模型，语言）单元在悬崖诊断区间（L0>40%）中，L3得分均低于L0原子性能的一半。
F2: 悬崖具有子任务结构化特征：框架变换崩溃与布尔一致性探测猜测分离。
F3: 思维链提示非通用助推器：对DeepSeek-V4-Flash提升32个百分点，但对Qwen2.5-7B降低16个百分点，且对多数开源模型的L5严格JSON输出产生破坏。
F4: 生成世界图时，节点识别（物体存在）与关系提取（包含关系）分离；严格通过率与部分信用图F1给出不同的模型排名。
F5: L3悬崖在中等规模模型（27-32B）中持续存在，而开源7-10B模型在L5上追赶闭源，表明微调机制而非参数规模驱动结构化输出能力。
F6: 层次聚类从模型多语言性能相关性中恢复出拉丁/CJK/（阿拉伯、泰）文字脚本类型学，并将模型分为窄轮廓和宽轮廓两类。
F7: 人类实验在8种语言中复制了L3悬崖（L3=0%），L4语言不变（约41%，标准差3.3个百分点），直接证明悬崖反映文本模态的工作记忆瓶颈。

Tech Stack:

ProcTHOR场景生成器（用于生成100个家庭场景）
JSON schema验证器（用于严格评估结构化输出）
图F1评分（部分信用指标，区分节点识别与关系提取）
层次聚类（用于分析多语言性能轮廓）
思维链提示（Chain-of-Thought）
严格通过率（Strict Pass Rate）
多语言翻译（8种语言，涵盖LTR和RTL书写系统）

Strengths:

多语言覆盖（8种类型多样语言）和双轴设计（能力阶梯+诊断镜头）系统性地揭示了纯文本空间推理的多维本质。
人类实验验证了L3悬崖的普遍性，排除了LLM特定缺陷，直接指向文本模态的工作记忆瓶颈。
提供了细粒度的评估指标（严格通过率与部分信用图F1），揭示了不同评分方式下的模型排名差异。
对思维链提示的非普遍有效性进行了实证分析，为提示工程提供了重要警示。
基准设计基于真实家庭场景（ProcTHOR），而非抽象网格，增强了生态效度。

Limitations:

仅关注纯文本模态，未涉及多模态输入（如视觉、触觉），而空间推理在现实世界中常依赖多模态信息。
场景局限于家庭环境（ProcTHOR），可能无法泛化到其他空间类型（如户外、工业环境）。
评估模型数量有限（13个），且未包括最新的长推理模型（如o1、DeepSeek-R1等）。
人类实验样本量可能较小，且未详细报告参与者数量及统计显著性。
未探索不同提示策略（如零样本、少样本）对L3悬崖的影响，仅比较了直接提示与思维链提示。

Relevance To Keywords:

世界模型：论文直接探讨LLMs是否从文本中构建内部空间世界模型，核心问题与世界模型概念高度相关。
表征学习：空间推理涉及从文本中学习并操作空间表征，论文通过能力阶梯和诊断镜头分析了表征的构建与变换。
多模态大模型：论文虽为纯文本，但明确指出多模态（如视觉）是自然的下一个探针，且视觉语言消融实验（Qwen2.5-VL）提供了初步对比。
模型后训练：论文发现微调机制（而非参数规模）驱动结构化输出能力（L5），与后训练策略（如指令微调、RLHF）直接相关。
强化学习：论文未直接涉及强化学习，但世界模型是模型基强化学习（Model-Based RL）的核心组件，论文的发现为RL中的空间推理提供了诊断工具。
原生多模态大模型：论文的纯文本诊断可视为多模态模型空间推理能力的基线，未来可扩展至原生多模态模型。
理解与生成一体化：论文的L5生成世界图任务要求模型同时理解场景并生成结构化输出，体现了理解与生成的融合。

23. Structure-Guided Visual Perturbation Neutralization for LVLMsPASS

Score: 46.0 / 26.5

Authors: Yuanhe Zhang, Xueting Wang, YanBin Ren, Haoran Gao, Xinhan Zheng, Zhenhong Zhou, Fanyu Meng, Li Sun, Sen Su

Published: 2026-05-27

TL;DR: This paper proposes a lightweight defense framework called SIGN to neutralize adversarial visual perturbations in Large Vision Language Models, achieving high defense success with minimal computational overhead.

摘要翻译

图像输入使大视觉语言模型（LVLMs）能够感知细粒度视觉信息，但也引入了像素级攻击面，对抗性扰动可通过该攻击面引发不安全的模型行为。然而，大多数现有的防御机制是为传统计算机视觉场景设计的，因此往往忽略了 LVLMs 所需的跨模态对齐，导致性能下降。与此同时，针对 LVLMs 的有限防御通常需要大幅度的图像修改，并引入显著的计算开销，从而损害推理质量和效率。为了解决这些局限性，我们提出了一种轻量级、即插即用的防御框架——结构诱导引导中和（SIGN），该框架通过先验结构提取提高 LVLMs 兼容性，并通过动态引导中和实现高效的扰动抑制。大量实验表明，SIGN 仅使用 0.5% 的像素修改和每张图片 0.16 秒的时间，即可实现超过 87% 的防御成功率，同时几乎保留了原始视觉表征和正常任务性能。我们的工作提供了一种轻量级替代方案，以取代需要昂贵模型训练的防御机制，并展示了利用视觉编码器进行高效对抗性保护的潜力。我们的代码在 https://anonymous.4open.science/r/SIGN-BCB1 上开源。

Abstract

Image inputs enable Large Vision Language Models (LVLMs) to perceive fine-grained visual information, but also introduce a pixel-level attack surface through which adversarial perturbations can elicit unsafe model behaviors. However, most existing defenses are designed for traditional computer vision settings and thus often overlook the cross-modal alignment required by LVLMs, leading to degraded performance. Meanwhile, the limited defenses tailored to LVLMs often require substantial image modifications and introduce considerable computational overhead, thereby compromising inference quality and efficiency. To address these limitations, we propose Structure-Induced Guided Neutralization (SIGN), a lightweight, plug-and-play defense framework that improves LVLM compatibility via Prior Structural Extraction and achieves efficient perturbation suppression via Dynamic Guided Neutralization. Extensive experiments show that SIGN achieves over 87\% defense success rate with only 0.5\% pixel modification and 0.16 seconds per image, while nearly preserving original visual representations and benign task performance. Our work offers a lightweight alternative to defenses that require costly model training and highlights the potential of exploiting a vision encoder for efficient adversarial protection. Our code is open source on https://anonymous.4open.science/r/SIGN-BCB1.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	5.0/10	10.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	9.0/10	18.0
MultiModal	2.0	9.0/10	18.0
model-based RL	2.0	0.0/10	0.0

评分理由: The paper focuses on adversarial defense for Large Vision Language Models (LVLMs), which are inherently multimodal and fall under the MLLM category, justifying high scores for MLLM and MultiModal. Unify Models is moderately relevant as LVLMs unify vision and language, but the core contribution is defense rather than model unification architecture. World Models and model-based RL are unrelated to the content. None of the specified expert authors appear in the author list.

关键词

LVLMs, Adversarial Perturbations, Defense Framework, Visual Perturbation Neutralization, Structure-Guided, Vision Encoder, Lightweight

深度分析

Chinese Title: 结构引导的视觉扰动中和：面向大型视觉语言模型的轻量级防御框架

Summary: 本文针对大型视觉语言模型（LVLMs）在图像输入中易受像素级对抗攻击的问题，提出了一种轻量级、即插即用的防御框架SIGN（Structure-Induced Guided Neutralization）。该方法分为两个阶段：首先，通过先验结构提取（Prior Structural Extraction）从LVLM的视觉编码器中估计出稳定的结构先验（Structure Prior），该先验捕捉了编码器在不同良性输入下共享的块级结构敏感性；其次，在动态引导中和（Dynamic Guided Neutralization）阶段，将结构先验投影到像素空间，并结合局部统计信息识别稀疏异常区域，通过稀疏像素级干预抑制对抗信号。实验表明，SIGN在仅修改0.5%像素、每张图像处理时间0.16秒的条件下，平均防御成功率超过87%，且几乎不损害良性任务的性能。该工作为无需昂贵模型训练的轻量级对抗防御提供了新思路。

Innovations:

揭示了LVLM视觉编码器中存在稳定的结构响应偏差（Structure Prior），可作为对抗防御的编码器级先验。
提出SIGN框架，利用结构先验引导稀疏像素级干预，实现高效对抗信号抑制，无需模型重训练。
设计了两阶段方法：先验结构提取（跨层跨样本聚合响应幅度）和动态引导中和（结合局部中位数参考与结构先验插值）。
在仅修改0.5%像素、处理时间低于0.2秒的条件下，达到87%以上防御成功率，且保持良性图像表示余弦相似度>0.99。
提供了一种轻量级、即插即用的防御方案，适用于多种LVLM架构和攻击类型。

Methodology: 论文采用以下技术路线：1）先验结构提取：从无标签图像集中，计算LVLM视觉编码器各层各块的L2范数响应，跨层跨样本平均得到块级结构轮廓，再按规范令牌顺序重排为二维先验图。2）动态引导中和：将先验图通过双线性插值映射到输入像素空间，得到像素级引导图；对每个像素，计算其邻域中位数作为局部参考，结合引导图权重计算异常分数；最后对异常分数高的像素进行稀疏修改（如置零或均值替换），抑制对抗扰动。实验采用六种LVLM和四种攻击方法进行验证。

Key Results:

SIGN在六种LVLM（如LLaVA、InstructBLIP等）上平均防御成功率超过87%。
仅修改0.5%的像素，每张图像防御构建时间低于0.2秒（典型值0.16秒）。
良性图像经过SIGN后，视觉表示余弦相似度保持在0.99以上，几乎不影响原始语义。
结构先验在同一模型内高度稳定，跨样本变化小，可作为有效引导。
与现有防御方法（如区域掩码、扩散修复）相比，SIGN在效率和保真度上具有优势。

Tech Stack:

LVLM视觉编码器（如CLIP ViT）
L2范数（响应幅度聚合）
双线性插值（先验图到像素空间映射）
局部中位数滤波（异常检测参考）
稀疏像素修改（防御操作）
余弦相似度（表示保真度评估）
六种LVLM模型：LLaVA, InstructBLIP, Qwen-VL, CogVLM等
四种攻击方法：Jailbreak, LLM-DoS, 误分类等

Strengths:

轻量级：无需模型训练，即插即用，计算开销极低。
高效：仅修改极少量像素（0.5%），防御时间<0.2秒。
通用性：适用于多种LVLM架构和多种攻击类型。
保真度高：几乎不改变良性图像语义，保持视觉表示相似度>0.99。
创新性：首次利用编码器结构响应偏差作为先验引导防御。

Limitations:

依赖无标签图像集进行先验提取，可能在不同数据分布下先验稳定性需验证。
防御机制仅针对像素级扰动，对更复杂的结构化攻击（如补丁攻击）可能效果有限。
实验仅在固定分辨率或最小输入窗口下进行，动态分辨率模型需额外处理。
未考虑自适应攻击者可能绕过基于结构先验的防御。
论文未深入分析先验提取所需样本数量对性能的影响。

Relevance To Keywords:

与“原生多模态大模型”和“多模态大模型的理解和生成一体化”高度相关：论文聚焦LVLM的视觉对抗防御，直接涉及多模态大模型的安全与鲁棒性。
与“表征学习”相关：论文利用视觉编码器的表征（L2范数响应）提取结构先验，属于表征层面的分析。
与“世界模型”和“强化学习”相关性较弱：论文未涉及世界模型构建或强化学习训练，但防御框架可视为提升模型鲁棒性的后处理技术，与“后训练”概念有一定关联（无需重训练，但属于推理时防御）。
与“Model-Based RL”无直接关联。
总体而言，论文主要贡献在LVLM对抗防御领域，与多模态大模型的安全性和表征学习紧密相关，与世界模型和强化学习关联度低。

24. Qwen-Image-Bench: From Generation to Creation in Text-to-Image EvaluationPASS

Score: 46.0 / 26.5

Authors: Niantong Li, Guangzheng Hu, Weixu Qiao, Ying Ba, Qichen Hong, Shijun Shen, Jinlin Wang, Fan Zhou, Jianye Kang, Xin Shang, Ziyi He, Wei Wang, Dalin Li, Jiahao Li, Jie Zhang, Kaiyuan Gao, Kun Yan, Lihan Jiang, Ningyuan Tang, Shengming Yin, Tianhe Wu, Xiao Xu, Xiaoyue Chen, Yuxiang Chen, Yan Shu, Yanran Zhang, Yilei Chen, Yixian Xu, Zekai Zhang, Zhendong Wang, Zihao Liu, Zikai Zhou, Hongzhu Shi, Yi Wang, Bing Zhao, Hu Wei, Lin Qu, Chenfei Wu

Published: 2026-05-27

TL;DR: This paper introduces Qwen-Image-Bench, a creator-centric benchmark utilizing a unified judge model to effectively evaluate Text-to-Image models on real-world fidelity and creative generation.

摘要翻译

文本到图像（Text-to-Image）生成已从基础图像合成演变为专业创意工作流中常用的核心能力，在此背景下，简单的文本 - 图像对齐已无法满足用户对忠实于现实世界的重建和真实创意表达的迫切需求。然而，现有基准仍锚定在这些基础标准上，尚未捕捉到真实艺术实践中至关重要的细微能力，导致难以可靠地区分最先进的文本到图像（T2I）模型。为填补这一空白，我们提出了 Qwen-Image-Bench，这是一个以创作者为中心的基准，由专业艺术家共同设计，并基于真实世界创作场景构建。Qwen-Image-Bench 通过两个应用驱动维度丰富了传统评估：现实保真度（Real-world Fidelity）与创意生成（Creative Generation）。借鉴专业艺术工作流中固有的分阶段推理，我们将这五大支柱组织成自上而下的层级分类体系，进一步细分为 23 个二级子能力及 56 个三级可验证标准（rubrics）。为确保广泛覆盖，我们精心策划了 1000 个分层提示（stratified prompts），每个提示均在多个支柱上共同涵盖超过四个细粒度方面。我们基于 Qwen3.6-27B 训练了一个统一的评判模型 Q-Judger，该模型由来自全球艺术学院的 80 名专业标注者在盲标和三审协议下进行监督，对每张图像在所有 56 个可验证方面进行评分，从而生成细粒度、基于标准且完全可归因的诊断，而非单一的不透明分数。实证结果表明，Qwen-Image-Bench 能够可靠地区分领先的 T2I 模型，在现实保真度（Real-world Fidelity）和创意生成（Creative Generation）这两个应用驱动维度上实现了最大的区分度，而现有基准在此处几乎无法提供见解，同时也为生产级 T2I 开发提供了可信的优化信号。

Abstract

Text-to-Image generation has evolved from basic image synthesis into a frequently used core capability in professional creative workflows, where simple text-image alignment can no longer satisfy users' pressing demands for faithful real-world reconstruction and genuine creative expression. Existing benchmarks, however, remain anchored in these foundational criteria and do not yet capture the nuanced capabilities that matter in authentic artistic practice, making it difficult to reliably distinguish state-of-the-art T2I models. To address the gap, we introduce Qwen-Image-Bench, a creator-centric benchmark co-designed with professional artists and grounded in real-world creation scenarios. Qwen-Image-Bench enriches conventional evaluation with two application-driven dimensions: Real-world Fidelity and Creative Generation. Drawing on the staged reasoning inherent in professional artistic workflows, we organize these five pillars into a top-down hierarchical taxonomy that further decomposes into 23 second-level sub-capabilities and 56 third-level verifiable rubrics. To ensure broad coverage, we curate 1000 stratified prompts with each prompt jointly exercising more than four fine-grained facets across multiple pillars. We train a unified judge model Q-Judger based on Qwen3.6-27B, supervised by 80 professional annotators from global art academies under blind labeling and triple-review protocols, that scores every image across all 56 verifiable facets, producing fine-grained, rubric-grounded, and fully attributable diagnostics rather than a single opaque score. Empirically, Qwen-Image-Bench reliably distinguishes leading T2I models, achieving the greatest separation on the two application-driven dimensions of Real-world Fidelity and Creative Generation where existing benchmarks provide little insight, while also providing a trustworthy optimization signal for production-level T2I development.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	6.0/10	12.0
World Models	2.0	1.0/10	2.0
MLLM	2.0	7.0/10	14.0
MultiModal	2.0	8.0/10	16.0
model-based RL	2.0	1.0/10	2.0

评分理由: The paper focuses on Text-to-Image evaluation, which is inherently Multimodal (8.0) and utilizes a Qwen-based judge model (MLLM, 7.0). It introduces a unified scoring framework (Unify Models, 6.0). However, it does not address World Models or Model-Based RL (1.0 each). Expert check: None of the target experts (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are in the author list, so no bonus applies.

关键词

Text-to-Image Evaluation, Real-world Fidelity, Creative Generation, Unified Judge Model, Qwen-Image-Bench, Multimodal Benchmark, Professional Artistic Workflows

深度分析

Chinese Title: Qwen-Image-Bench：从生成到创作——文本到图像评估的新范式

Summary: 本文针对现有文本到图像（T2I）评估基准在专业创作场景中的不足，提出Qwen-Image-Bench——一个以创作者为中心的评估基准。该基准由专业艺术家共同设计，在传统质量、美学和文本-图像对齐三个支柱基础上，新增真实世界保真度和创意生成两个应用驱动维度，形成5个一级支柱、23个二级子能力和56个三级可验证标准的层次化分类体系。基准包含1000个精心设计的双语提示（中英文各半，长短平衡），每个提示覆盖多个细粒度评估方面。为克服现有MLLM评判器的偏见，论文训练了统一的评判模型Q-Judger（基于Qwen3.6-27B），由80位来自艺术院校的专业标注员在盲标和三重审核协议下监督训练，对每个图像在所有56个方面进行细粒度评分。实验评估了18个主流T2I模型，结果表明GPT Image 2总分最高（64.7），而真实世界保真度和创意生成两个维度上模型间差异最大，揭示了当前模型在物理逻辑、解剖保真度等四个方面的系统性瓶颈。该基准为生产级T2I模型开发提供了可靠的优化信号。

Innovations:

提出创作者中心的评估框架，在传统三个支柱基础上新增真实世界保真度和创意生成两个应用驱动维度，更贴近专业创作需求。
构建了三级层次化分类体系（5个一级支柱、23个二级子能力、56个三级可验证标准），基于专业艺术工作流程的阶段性推理设计，系统性强且避免重叠。
设计了1000个双语、长短平衡的结构化提示，每个提示联合覆盖多个细粒度评估方面，实现更密集和均衡的维度覆盖。
训练了统一的评判模型Q-Judger（基于Qwen3.6-27B），由80位专业标注员在严格盲标和三重审核协议下监督，输出56维细粒度评分向量，避免单一MLLM评判器的系统性偏差，并提供可归因的诊断信息。
揭示了当前前沿T2I模型在物理逻辑、解剖保真度、动物和接触交互四个方面的系统性天花板，为模型改进指明方向。

Methodology: 论文采用以下技术路线：首先，与专业艺术家合作，基于真实创作场景设计评估维度，构建从L1到L3的层次化分类体系。其次，通过严格筛选和分层抽样，从大量候选提示中选出1000个双语提示，确保覆盖所有56个三级评估方面且长度和语言平衡。然后，收集18个主流T2I模型的生成图像，由80位专业标注员在盲标和三重审核协议下对每个图像在所有56个方面进行人工评分，形成监督数据。接着，基于Qwen3.6-27B训练统一的评判模型Q-Judger，使用人工标注数据微调，使其能够自动输出细粒度评分向量。最后，采用自底向上的聚合方式（从L3到L2再到L1和总分）计算各维度得分，并通过Spearman相关系数评估与人类专家判断的一致性。

Key Results:

Q-Judger与人类专家判断的排名一致性达到Spearman ρ=0.92，显著降低评估成本。
在18个主流T2I模型中，GPT Image 2获得最高总分64.7，Qwen Image 2.0 Pro排名第五。
真实世界保真度和创意生成两个应用驱动维度上模型间方差最大，表明它们捕捉了现有基准无法区分的性能差距。
四个三级评估方面（物理逻辑、解剖保真度、动物、接触交互）成为当前所有模型的系统性天花板，即使最佳模型得分也低于44。
基准能够可靠区分前沿T2I模型，为生产级开发提供可信的优化信号。

Tech Stack:

Qwen3.6-27B（作为评判模型的基础架构）
CLIP及其变体（用于传统文本-图像对齐评估）
VQA/VLM（视觉问答/视觉语言模型，用于自动化评估）
分层抽样与提示设计策略（确保覆盖56个三级评估方面）
盲标与三重审核协议（人工标注质量控制）
Spearman等级相关系数（评估排名一致性）
自底向上聚合评分方法（从细粒度到总体得分）

Strengths:

与专业艺术家合作设计，评估维度紧密贴合真实创作需求，具有高生态效度。
层次化分类体系系统性强、覆盖全面，避免传统扁平分类的冗余和遗漏。
双语、长短平衡的提示集兼顾不同用户群体，提高评估鲁棒性。
训练专门的评判模型而非直接使用现成MLLM，减少系统性偏差，且输出细粒度诊断信息。
实验覆盖18个主流模型，揭示了前沿模型的性能瓶颈，具有实际指导意义。

Limitations:

评判模型Q-Judger基于Qwen3.6-27B，其性能受限于基础模型能力，可能无法完全捕捉所有艺术性细微差别。
人工标注成本高（80位专家、三重审核），虽然训练后自动评估成本降低，但初始构建成本较大。
基准仅包含1000个提示，虽然精心设计但可能仍无法覆盖所有创作场景。
评估主要针对静态图像生成，未涉及视频生成或多模态交互等新兴方向。
论文未详细讨论评判模型在不同语言和文化背景下的泛化能力。

Relevance To Keywords:

原生多模态大模型：论文使用Qwen3.6-27B作为评判模型，属于多模态大模型在评估中的应用，且基准本身涉及多模态理解（文本-图像对齐）。
多模态大模型的理解和生成一体化：基准评估的正是T2I生成模型，而评判模型Q-Judger需要同时理解文本和图像，体现了理解与生成的一体化需求。
表征学习：基准中的真实世界保真度维度要求模型学习物理、文化、事实等世界知识表征，与表征学习密切相关。
世界模型：真实世界保真度评估模型对物理逻辑、解剖结构、接触交互等世界知识的掌握，直接关联世界模型概念。
强化学习：论文提到后训练（post-training）和模型优化信号，基准可用于指导强化学习中的奖励模型设计。
后训练：基准为生产级T2I模型开发提供优化信号，可应用于模型的后训练阶段。

25. Decoupled Training with Local Reinforcement Fine-Tuning in Federated LearningPASS

Score: 46.0 / 26.5

Authors: Yuting Ma, Lechao Cheng, Xiaohua Xu

Published: 2026-05-27

TL;DR: 本文提出 FedDTL 框架，通过解耦编码器训练和包含强化学习的两阶段本地微调，解决了联邦学习中客户端间优化不一致和客户端内过拟合问题，实现了全局任务适应与泛化能力的平衡。

摘要翻译

联邦学习（FL）与预训练视觉 - 语言模型（VLMs）的结合已成为一种有前景的范式，适用于各种下游任务。通过利用其强大的表示能力，近期研究在本地数据不足的情况下改进了任务适应，同时保持了泛化能力。然而，这些方法强调完全本地优化与简单的参数聚合，这在异质性和全数据联邦学习（FL）设置下会放大客户端间优化不一致性和客户端内过度专业化，导致难以平衡全局任务适应与泛化能力。为了解决这些挑战，我们提出了 FedDTL，一种新颖的联邦视觉 - 语言模型（VLM）框架，该框架在客户端与服务器之间解耦了图像编码器和文本编码器。通过解耦的编码器训练及服务器 - 客户端模态对齐，FedDTL 促进了连贯的全局语义更新，减少了客户端间优化不一致性，从而改进了全局任务适应。为了进一步缓解客户端内过度专业化，我们引入了两阶段本地微调：首先是一个监督微调阶段，可实现快速可靠的暖启动；随后是一个强化学习阶段，以增强泛化能力。在多个基准（包括标签偏移和特征偏移）上的广泛实验表明，FedDTL 在少样本和全数据模式下，针对各种联邦学习（FL）数据分布，实现了全局任务适应与泛化能力之间的有效平衡。

Abstract

Federated Learning (FL) with pre-trained Vision-Language Models (VLMs) has emerged as a promising paradigm for various downstream tasks. By leveraging its strong representations, recent studies improve task adaptation under insufficient local data while preserving generalization. However, these methods emphasize fully local optimization with simple parameter aggregation,which can amplify inter-client optimization inconsistency and intra-client over-specialization under heterogeneous and full-data FL settings, making it difficult to balance global task adaptation and generalization. To address these challenges, we propose FedDTL, a novel federated VLM framework that decouples the image encoder and text encoder across clients and the server. Through decoupled encoder training with server-client modality alignment, FedDTL promotes coherent global semantic update and reduces inter-client optimization inconsistency, improving global task adaptation.To further mitigate intra-client over-specialization,we introduce a two-stage local fine-tuning, where a supervised fine-tuning stage enables rapid and reliable warm-start, followed by a reinforcement learning stage that enhances generalization. Extensive experiments on multiple benchmarks, including label skew and feature shift, demonstrate that FedDTL achieves an effective balance between global task adaptation and generalization under various FL data distributions in both few-shot and full-data regimes.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	3.0/10	6.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	8.0/10	16.0
MultiModal	2.0	9.0/10	18.0
model-based RL	2.0	3.0/10	6.0

评分理由: 论文核心为联邦学习下的多模态大模型（VLM）微调，与 MLLM 和 MultiModal 高度契合；虽使用强化学习进行本地微调，但未明确体现 model-based RL 架构特征；联邦参数聚合虽涉及统一，但非背景定义的 Unify Models 架构统一；内容与 World Models 完全无关。

关键词

Federated Learning, Vision-Language Models, Decoupled Training, Local Reinforcement Fine-Tuning, Global Task Adaptation, Generalization, Server-Client Modality Alignment

深度分析

Chinese Title: 联邦学习中解耦训练与本地强化微调

Summary: 本文针对联邦学习（FL）与预训练视觉语言模型（VLM）结合时面临的客户端间优化不一致和客户端内过专门化问题，提出FedDTL框架。该方法通过解耦图像编码器（本地训练）和文本编码器（服务器端训练），并利用模态对齐减少客户端间漂移，提升全局任务适应能力。同时，设计两阶段本地微调策略：先进行监督微调（SFT）实现快速任务适应，再引入基于GRPO的强化学习阶段抑制过专门化并增强泛化。实验在标签偏移和特征偏移场景下，涵盖少样本和全数据设置，结果表明FedDTL有效平衡了全局任务适应与泛化性能，优于现有联邦VLM基线方法。

Innovations:

提出解耦编码器训练方案：图像编码器本地优化保护隐私，文本编码器服务器端学习全局一致语义，通过模态对齐减少客户端间优化不一致。
设计两阶段本地微调：先SFT快速热启动，再GRPO强化学习增强泛化，缓解客户端内过专门化。
采用LoRA高效微调，降低计算开销，适用于少样本和全数据联邦场景。
在多种数据分布（标签偏移、特征偏移）和两种数据规模（少样本、全数据）下全面评估，证明方法平衡适应与泛化的有效性。

Methodology: 基于CLIP模型，使用LoRA对图像编码器和文本编码器进行参数高效微调。服务器端维护全局文本编码器，客户端本地训练图像编码器，并通过对比学习对齐视觉与文本表征。本地训练采用两阶段：第一阶段SFT使用交叉熵损失快速适应下游任务；第二阶段基于GRPO的强化学习，以模型输出概率分布构建奖励，优化策略以提升泛化。联邦通信中，客户端上传图像编码器的LoRA参数和压缩后的视觉嵌入，服务器聚合后广播。

Key Results:

在多个基准数据集上，FedDTL在标签偏移和特征偏移下均优于FedPGP、pFedDC等基线。
在少样本和全数据设置中，FedDTL均能有效平衡全局任务适应（准确率）与泛化（对未见类别或域的性能）。
消融实验验证了解耦训练和两阶段微调各自的有效性，RL阶段显著提升泛化能力。
与完全本地优化+参数平均的方法相比，FedDTL的客户端间表征一致性更高，客户端漂移更小。

Tech Stack:

CLIP (Contrastive Language-Image Pre-training)
LoRA (Low-Rank Adaptation)
GRPO (Group Relative Policy Optimization)
Supervised Fine-Tuning (SFT)
Federated Learning (FedAvg)
对比学习（模态对齐）
参数高效微调 (PEFT)

Strengths:

创新性地将CLIP的模态解耦思想引入联邦学习，从架构层面缓解客户端漂移。
两阶段微调结合SFT的快速适应和RL的泛化优势，无需额外正则化或对齐损失。
在少样本和全数据两种实际场景下均验证有效性，实验设置全面。
使用LoRA保持计算高效，易于部署。

Limitations:

依赖预训练CLIP模型，若基础模型能力不足可能影响效果。
服务器端维护文本编码器并处理视觉嵌入，可能增加通信和计算开销（虽已压缩）。
RL阶段需要设计合适的奖励函数，对超参数敏感，文中未详细讨论调参难度。
未考虑系统异构性（如客户端计算能力差异）对训练的影响。

Relevance To Keywords: 论文核心涉及多模态大模型（CLIP）、表征学习（模态对齐）、强化学习（GRPO）和后训练（微调），与关键词中的“原生多模态大模型”、“表征学习”、“强化学习”、“后训练”高度相关。但“世界模型”、“模型基RL”等概念未直接涉及，整体相关性中等偏上。

26. MTAVG-Bench 2.0: Diagnosing Failure Modes of Cinematic Expressiveness in Multi-Talker Audio-Video GenerationPASS

Score: 44.0 / 26.5

Authors: Haitian Li, Yanghao Zhou, Heyan Huang, Liangji Chen, YiMing Cheng, Xu Liu, Dian Jin, Jiajun Xu, Jingyun Liao, Tian Lan, Ziqin Zhou, Yueying Liu, Yu Bai, Changsen Yuan, Jinxing Zhou, Xian-Ling Mao, Xuefeng Chen, Yousheng Feng

Published: 2026-05-27

TL;DR: MTAVG-Bench 2.0 proposes a benchmark to diagnose cinematic expressiveness failures in multi-talker audio-video generation, demonstrating that even advanced omni models struggle with complex high-level assessments.

摘要翻译

近年来，多说话人音视频生成（MTAVG）模型在唇同步（lip-sync）和视听对齐（audio-visual alignment）等基本指标上展现出优异的表现。然而，这些指标仍不足以评估场景级生成中的电影表现力（cinematic expressiveness）。在多角色场景中，生成模型必须超越视听真实性（audio-visual realism），以传达连贯的角色表现及其他更高层级的电影品质。为了填补这一空白，我们引入了 MTAVG-Bench 2.0，这是一个用于诊断多说话人音视频生成中电影表现力失败模式的评测基准。与先前主要关注基本多轮对话（multi-turn dialogue）质量的设置不同，MTAVG-Bench 2.0 针对短剧（short-drama）和场景级生成，并建立了涵盖表演、叙事、氛围和视听语言（audio-visual language）的高层级失败分类体系（taxonomy）。基于此分类体系，我们构建了超过 10,000 个问答评估实例，以及用于短剧级评估和失败模式时间定位（temporal localization）的子集，以系统性地评估多模态大语言模型（omni large language models）诊断高层级视听失败的能力。实验结果表明，商业多模态模型（如 Gemini）显著优于其他评估模型，但即使是最强的模型在我们的基准上仍难以处理复杂失败。这些结果表明，MTAVG-Bench 2.0 为电影多说话人音视频生成中的失败诊断提供了系统性的评测基准。

Abstract

In recent years, Multi-Talker Audio-Video Generation (MTAVG) models have shown promising performance on fundamental metrics such as lip-sync and audio-visual alignment. However, these metrics remain insufficient for assessing cinematic expressiveness in scene-level generation. In multi-character scenes, generation models must go beyond audio-visual realism to convey coherent character performance and other higher-level cinematic qualities. To fill this gap, we introduce MTAVG-Bench 2.0, a benchmark for diagnosing failure modes of cinematic expressiveness in multi-talker audio-video generation. Unlike prior settings that mainly focus on the quality of basic multi-turn dialogue, MTAVG-Bench 2.0 targets short-drama and scene-level generation, and establishes a high-level failure taxonomy spanning acting, narrative, atmosphere, and audio-visual language. Based on this taxonomy, we construct more than 10,000 question-answering evaluation instances, together with subsets for short-drama-level assessment and temporal localization of failure modes, to systematically evaluate the ability of omni large language models to diagnose high-level audio-visual failures. Experimental results show that commercial omni models such as Gemini substantially outperform other evaluators, yet even the strongest models continue to struggle with complex failures in our benchmark. These results demonstrate that MTAVG-Bench 2.0 provides a systematic benchmark for failure diagnosis in cinematic multi-talker audio-video generation.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	3.0/10	6.0
World Models	2.0	2.0/10	4.0
MLLM	2.0	7.0/10	14.0
MultiModal	2.0	9.0/10	18.0
model-based RL	2.0	1.0/10	2.0

评分理由: The paper focuses on MultiModal audio-video generation (9.0) and utilizes MLLMs for evaluation (7.0). It touches on Unify Models by assessing omni models but does not propose unification architectures (3.0). It lacks direct connection to World Models (2.0) and Model-Based RL (1.0).

关键词

MTAVG-Bench, Audio-Video Generation, Cinematic Expressiveness, Failure Diagnosis, Multi-Talker, Omni LLMs, Benchmark Evaluation

深度分析

Chinese Title: MTAVG-Bench 2.0：诊断多说话人音视频生成中电影表现力的失败模式

Summary: 本文针对多说话人音视频生成（MTAVG）场景，提出了一种用于诊断高层面电影表现力失败模式的基准MTAVG-Bench 2.0。现有评估主要关注唇形同步、音视频对齐等低层保真度，不足以衡量场景级生成中的角色表演、叙事连贯性、氛围营造和视听语言等电影品质。该基准基于短剧和场景级生成，构建了涵盖表演、叙事、氛围和视听语言的高层失败分类体系，并生成了超过10,000个问答评估实例，以及场景级评估和失败时间定位子集。实验表明，商业全能模型（如Gemini）表现最佳，但在复杂失败诊断上仍有明显不足，说明当前评估体系尚不能充分解决电影级多说话人生成的高层失败诊断问题。

Innovations:

将高层失败诊断形式化为多说话人音视频生成中场景级电影表现力的独立评估问题，超越传统低层保真度评估。
构建了包含表演、氛围、电影摄影三个维度的紧凑失败分类体系，并基于此生成超过10,000个评估实例。
设计了场景级评估和失败时间定位子集，支持细粒度、时间可定位的高层失败诊断。
系统评估了当前主流全能音视频理解模型，揭示了商业模型在复杂角色表演失败诊断上的局限性。

Methodology: 论文采用三阶段流水线构建基准：首先从经典电影场景中分解出层次化脚本提示，使用先进文本到音视频生成模型生成候选视频；然后由标注者识别高层失败证据，并映射到分类体系定义的失败模式；最后基于审查证据构建失败感知的诊断问答对，经过人工标注、专家讨论和验证，形成多样化的评估实例。评估时使用全能多模态大模型对问答实例进行诊断，比较不同模型的诊断准确率。

Key Results:

商业全能模型（如Gemini）在整体诊断性能上显著优于其他评估器。
即使最强的模型在处理涉及角色表演的复杂失败案例时仍存在明显局限。
MTAVG-Bench 2.0提供了系统化的基准，能够有效揭示当前生成系统在电影表现力方面的不足。
高层失败诊断任务远未解决，需要更丰富的基准来评估场景级表现力。

Tech Stack:

文本到音视频生成模型（如JavisDiT、Seedance等）
全能多模态大模型（如Gemini、GPT-4V等）
层次化脚本提示构建
失败分类体系（表演、氛围、电影摄影）
问答对构建与人工验证
时间定位子集设计

Strengths:

首次系统性地将电影表现力失败诊断引入多说话人音视频生成评估，填补了现有基准的空白。
构建了大规模、多维度、时间可定位的评估数据集，支持细粒度分析。
实验覆盖了多种主流全能模型，结果具有参考价值。
分类体系简洁且具有操作性，便于推广和应用。

Limitations:

分类体系仅涵盖表演、氛围、电影摄影三个维度，可能未完全覆盖所有电影表现力方面。
评估实例基于生成模型输出，可能受生成模型本身质量限制。
人工标注和专家讨论过程可能存在主观性。
当前最强模型在复杂失败诊断上仍有不足，说明基准难度较高，但未提供改进方向。

Relevance To Keywords:

Unify Models: 论文评估了多种全能多模态大模型（如Gemini），属于统一模型范畴，但论文本身是基准而非模型。
World Models: 论文未直接涉及世界模型，但电影表现力诊断需要理解场景、角色交互等世界知识，间接相关。
Representation Learning: 论文未直接研究表征学习，但评估任务依赖于模型对音视频的高层表征能力。
Model-Based RL: 论文未涉及强化学习或基于模型的RL，相关性较低。

27. POINav: Benchmarking and Enhancing Final-Meters Arrival in Real-World Vision-Language NavigationPASS

Score: 44.0 / 26.5

Authors: Ruiyan Gong, Meisheng Zhang, Yuxiang Zhao, Mingchao Sun, Yanfen Shen, Zedong Chu, Zhining Gu, Wei Guo, Xiaolong Cheng, Qiming Li, Kangning Niu, Yanqing Zhu, Xiaolong Wu, Tianlun Li, Mu Xu

Published: 2026-05-27

TL;DR: POINav addresses the final-meters challenge in real-world vision-language navigation by proposing a high-fidelity benchmark and a Brain-Action framework that integrates POI-grounded reasoning with continuous waypoint prediction.

摘要翻译

现实世界导航本质上由兴趣点（POIs）驱动，但到达精确的兴趣点仍是一个关键的“最后几米”挑战。现有的视觉 - 语言导航（VLN）POI 目标导航基准常因生成场景而存在粒度粗糙或显著的仿真到现实差距。为弥合这一差距，我们提出了 POINav-Bench，这是首个专为现实世界 POI 目标导航的闭环评估设计的基准。它包含 11 个利用 3D 高斯泼溅（3DGS）从真实世界采集数据重建的商业区，总面积达 126,398 平方米，涵盖 163 个不同的兴趣点。凭借可通行性感知标注和参考轨迹，POINav-Bench 能够在逼真的、富含兴趣点的现实世界环境中实现对导航智能体的高保真评估。在此基础上，我们提出了 POINav Brain-Action 框架，其中大脑模块执行基于兴趣点的推理，以指导动作模块预测连续航点以供现实世界执行。我们还整理了 POINav-Dataset，包含 7 万个现实世界的标识 - 入口对。实验表明，我们的框架为改进现实世界 POI 目标导航提供了一条可行路径。

Abstract

Real-world navigation is fundamentally driven by Points of Interest (POIs), yet reaching a precise POI remains a critical "final-meters" challenge. Existing Vision-Language Navigation (VLN) benchmarks of POI-goal navigation often suffer from coarse granularity or significant sim-to-real gaps due to generated scene. To bridge this gap, we present POINav-Bench, the first benchmark designed for closed-loop evaluation of real-world POI-goal navigation. It comprises 11 commercial areas reconstructed from real-world captures using 3D Gaussian Splatting (3DGS), covering 126,398 $m^{2}$ in total and spanning 163 distinct POIs. With traversability-aware annotations and reference trajectories, POINav-Bench enables high-fidelity evaluation of navigation agents in realistic, POI-rich real-world environments. Building on this, we propose the POINav Brain-Action Framework where a Brain module performs POI-grounded reasoning to guide an Action module in predicting continuous waypoints for real-world execution. We further curate the POINav-Dataset, containing 70K real-world signage-entrance pairs. Experiments show that our framework provides a viable path toward refining real-world POI-goal navigation.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	4.0/10	8.0
World Models	2.0	3.0/10	6.0
MLLM	2.0	4.0/10	8.0
MultiModal	2.0	8.0/10	16.0
model-based RL	2.0	3.0/10	6.0

评分理由: The paper focuses on Vision-Language Navigation (VLN), making it highly relevant to MultiModal (8.0). The Brain-Action framework unifies reasoning and action, offering moderate relevance to Unify Models (4.0). MLLM is moderately related (4.0) due to language involvement, though not explicitly highlighted as a large multimodal model. World Models (3.0) and model-based RL (3.0) are less central as 3DGS is used for reconstruction rather than generative world modeling, and the focus is on benchmarking/framework rather than model-based RL algorithms. No expert authors from the provided list are found in the authorship.

关键词

Vision-Language Navigation, Points of Interest, 3D Gaussian Splatting, Brain-Action Framework, Real-world Benchmark, Final-meters Arrival, POI-grounded Reasoning

深度分析

Chinese Title: POINav：在真实世界视觉-语言导航中基准测试与增强最后几米到达

Summary: 本文针对真实世界导航中“最后几米”的精确到达问题，提出了POINav生态系统。首先，构建了POINav-Bench，这是首个用于真实世界POI目标导航的闭环评估基准，包含11个商业区域（总面积126,398平方米，163个POI），通过3D高斯泼溅（3DGS）从2026年后的LiDAR和摄影测量数据重建，并集成到Isaac Sim中，弥合了模拟与现实的差距。其次，提出了POINav脑-动作框架，其中脑模块进行POI接地推理，动作模块预测连续路径点。同时，整理了包含70K真实世界标志-入口对的POINav-Dataset。实验表明该框架为精确实世界POI导航提供了可行路径。论文计划开源完整基准和数据集，推动视觉导航和机器人研究。

Innovations:

首次构建了基于真实世界3DGS重建的高保真POI目标导航闭环基准POINav-Bench，覆盖11个商业区域、163个POI，支持精细的最终米级评估。
提出了POINav脑-动作解耦框架，将导航分为POI接地推理（脑模块）和全局上下文动作查询（动作模块），避免了端到端模型中的误差累积。
整理了包含70K真实世界标志-入口对的POINav-Dataset，支持无需外部先验的鲁棒POI接地推理。
将3DGS资产集成到Isaac Sim中，提供物理精确的仿真环境，并计划开源整个生态系统以促进可重复性研究。

Methodology: 论文采用两阶段方法：首先，通过高精度LiDAR和摄影测量采集11个商业区域的真实数据，使用3DGS重建场景，手动标注POI入口（水平边界框）和可通行区域，并基于A*算法生成参考轨迹，构建闭环评估基准POINav-Bench。其次，提出POINav框架：脑模块利用连续空间条件机制先定位POI标志作为语义锚点，再回归目标入口；动作模块简化BridgeNav架构，预测连续路径点。训练使用70K真实样本的POINav-Dataset。评估在Isaac Sim中进行闭环仿真，以到达入口距离和碰撞情况作为成功标准。

Key Results:

POINav-Bench覆盖11个商业区域，总面积126,398平方米，包含163个不同POI，每个POI有精确的入口边界框和可通行区域标注。
POINav-Dataset包含70K真实世界标志-入口对，支持POI接地推理训练。
提出的脑-动作框架在POINav-Bench上实现了有效的闭环导航，能够精确定位POI入口并生成连续路径。
开源计划将提供完整的3DGS资产和数据集，促进社区在真实世界导航和机器人任务上的研究。

Tech Stack:

3D Gaussian Splatting (3DGS) 用于场景重建
LiDAR 和摄影测量用于数据采集
NVIDIA Isaac Sim 用于闭环仿真
Qwen2.5-VL-3B 作为基础视觉语言模型
BridgeNav 架构（简化版）用于动作模块
A* 算法用于生成参考轨迹
连续空间条件机制用于POI接地推理

Strengths:

首次提供真实世界高保真3DGS重建的POI导航闭环基准，弥合了模拟与现实的差距。
脑-动作解耦框架避免了端到端模型中的误差累积，更符合实际导航的层次化决策。
大规模真实数据集（70K样本）支持鲁棒的POI接地推理。
开源计划将极大推动该领域的研究可重复性和进展。

Limitations:

基准仅覆盖11个商业区域，场景多样性有限，可能无法完全代表所有真实世界环境。
3DGS重建中移除了行人，但真实环境中动态障碍物（如行人、车辆）未考虑，评估可能过于理想化。
POI入口标注依赖人工，扩展至更大规模时成本较高。
框架依赖预训练VLM（Qwen2.5-VL-3B），可能受限于模型能力。

Relevance To Keywords:

Unify Models: 论文提出的脑-动作框架可视为统一模型的一种尝试，但未直接涉及多模态理解与生成一体化。
World Models: 3DGS场景可视为世界模型的一种表示，但论文未使用世界模型进行规划。
Representation Learning: 论文通过POI接地推理学习语义与几何的联合表示，与表征学习相关。
Model-Based RL: 论文使用A*生成参考轨迹，但未明确采用强化学习或模型预测控制，相关性较弱。
原生多模态大模型: 论文基于Qwen2.5-VL-3B，属于多模态大模型，但未强调原生多模态。
后训练: 论文未涉及后训练技术。

28. CIVIC: End-to-End Sequence Compactness for Efficient Vision-Language ModelsPASS

Score: 42.0 / 26.5

Authors: Fengze Yang, Bo Yu, Xuewen Luo, Cathy Liu, Chenxi Liu

Published: 2026-05-27

TL;DR: 该论文针对视觉语言模型因高分辨率视觉令牌导致的内存和延迟瓶颈，提出了 CIVIC 端到端序列紧凑性框架，在不降低准确性的情况下将 KV 缓存内存缩减至约三分之一并降低推理延迟。

摘要翻译

视觉语言模型（VLMs）因高分辨率视觉标记而面临严重的内存与延迟瓶颈。尽管当前的标记缩减方法理论上能节省浮点运算次数（FLOPs），但事后剪枝会引入结构开销，无法实现相应的墙钟加速（wall-clock acceleration）。然而，强制采用连续紧凑路径可能导致几何迷失（geometric disorientation）及细粒度定位能力的丧失。为克服上述障碍，本文提出 CIVIC，一种路径一致紧凑视觉推理框架（path-consistent compact visual inference framework）。通过在视觉编码器（vision encoder）、投影层、大语言模型预填充（LLM prefill）及键值缓存（KV-cache）之间无缝维持紧凑序列表示，CIVIC 避免了非连续内存访问及局部解合并开销。在 Qwen3-VL 架构上的评估表明，CIVIC 成功将序列缩减转化为真实的物理硬件效率，将键值缓存（KV-cache）内存缩减至基线的约三分之一，并降低了端到端推理延迟。借助文本对齐的 KL 散度蒸馏（text-aligned KL distillation）与自适应空间保留下限（adaptive spatial retention floor），CIVIC 在严格的多模态推理与视觉定位基准上，实现了这些效率里程碑而未损害准确性。

Abstract

Vision-Language Models (VLMs) face severe memory and latency bottlenecks due to high-resolution visual tokens. While current token reduction methods theoretically save FLOPs, post-hoc pruning introduces structural overhead, failing to yield proportional wall-clock acceleration. However, enforcing a contiguous compact pathway risks geometric disorientation and loss of fine-grained localization. To overcome these barriers, this paper introduces CIVIC, a path-consistent compact visual inference framework. By maintaining compact sequence representations seamlessly across the vision encoder, projection layer, LLM prefill, and KV-cache, CIVIC avoids non-contiguous memory access and localized unmerging overheads. Evaluated on the Qwen3-VL architecture, CIVIC successfully translates sequence reductions into genuine physical hardware efficiency, shrinking KV-cache memory to approximately one-third of the baseline and reducing end-to-end inference latency. Enabled by text-aligned KL distillation and an adaptive spatial retention floor, CIVIC achieves these efficiency milestones without degrading accuracy across rigorous multimodal reasoning and visual grounding benchmarks.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	1.0/10	2.0
MLLM	2.0	8.0/10	16.0
MultiModal	2.0	9.0/10	18.0
model-based RL	2.0	1.0/10	2.0

评分理由: 论文核心针对视觉语言模型（MLLM）的推理效率优化，因此 MLLM 和多模态（MultiModal）相关度高；未涉及世界模型（World Models）或模型强化学习（model-based RL），相关性极低；虽统一了紧凑推理路径，但未涉及模型架构统一，Unify Models 得分低。作者列表中未包含指定的 Yang Shi 等专家，故无加分。

关键词

Vision-Language Models, Sequence Compactness, Inference Efficiency, KV-cache, Multimodal Reasoning, Text-aligned KL Distillation, End-to-End

深度分析

Chinese Title: CIVIC：面向高效视觉语言模型的端到端序列紧凑性

Summary: 本文针对视觉语言模型（VLM）因高分辨率视觉令牌导致的内存和延迟瓶颈问题，提出了一种路径一致的紧凑视觉推理框架CIVIC。现有令牌缩减方法虽理论上减少FLOPs，但后处理剪枝引入的结构开销无法转化为实际加速。CIVIC通过在视觉编码器、投影层、LLM预填充和KV缓存中保持紧凑序列表示，避免了非连续内存访问和局部解合并开销。该方法采用基于学习的锚点聚合将密集补丁转换为连续令牌，利用KV压缩注意力机制并配合自适应空间保留阈值以保持定位细节，通过文本对齐的KL蒸馏绕过结构训练边界，使紧凑嵌入直接替换LLM预填充中的密集占位符。在Qwen3-VL架构上的实验表明，CIVIC将KV缓存内存缩减至约三分之一，并显著降低端到端推理延迟，同时在多模态推理和视觉定位基准上保持精度。

Innovations:

识别出VLM推理中的压缩-实现差距：仅当序列紧凑性在视觉编码器、投影器和LLM预填充中保持一致时，理论FLOP减少才能转化为实际加速。
提出CIVIC框架，通过基于学习的锚点聚合、自适应空间保留阈值和文本对齐的KL蒸馏，实现端到端紧凑潜在表示，解决几何和训练瓶颈。
在Qwen3-VL上验证了稳定紧凑处理可显著降低延迟和KV缓存占用，同时保持细粒度多模态精度，消除运行时路由或密集恢复依赖。

Methodology: CIVIC采用路径一致的紧凑推理设计：首先通过可学习的紧凑锚点对密集补丁进行聚合（公式8），生成紧凑序列；然后在紧凑视觉编码器中使用KV压缩注意力（公式9），将注意力交互从Te×Te缩减为Me×S；接着通过紧凑多模态投影器将视觉令牌插入LLM预填充；最后利用文本对齐的KL蒸馏损失进行训练，使紧凑嵌入直接替代密集占位符。整体优化目标为最小化部署成本与蒸馏损失的组合（公式6）。

Key Results:

在Qwen3-VL架构上，CIVIC将KV缓存内存缩减至基线的大约三分之一。
端到端推理延迟显著降低，实现了理论FLOP减少向实际硬件效率的转化。
在多模态推理和视觉定位基准上，精度未出现下降，保持了细粒度定位能力。

Tech Stack:

基于学习的锚点聚合（learnable anchor-based aggregation）
KV压缩注意力（KV-compressed attention）
自适应空间保留阈值（adaptive spatial retention floor）
文本对齐的KL蒸馏（text-aligned KL distillation）
层归一化（Layer Normalization）
ℓ2归一化（ℓ2 normalization）
Softmax注意力机制
Qwen3-VL架构

Strengths:

提出端到端紧凑路径，避免了后处理剪枝的运行时开销，实现了真正的物理加速。
通过自适应空间保留阈值平衡了压缩率与细粒度定位精度。
文本对齐的KL蒸馏使得紧凑嵌入可直接用于LLM预填充，无需结构适配。
在主流VLM架构上验证了有效性，具有较好的通用性。

Limitations:

方法依赖于特定架构（Qwen3-VL）的评估，在其他VLM上的泛化性有待验证。
紧凑锚点的学习可能需要额外的训练数据或微调，增加了训练成本。
论文未讨论极端压缩率下的性能边界，可能在高压缩比时出现精度下降。
与强化学习、世界模型等关键词的直接关联较弱，主要聚焦于推理效率。

Relevance To Keywords: 论文与“原生多模态大模型”高度相关，直接针对视觉语言模型的高效推理；与“表征学习”相关，通过紧凑潜在表示实现压缩；与“后训练”相关，使用蒸馏进行训练；与“多模态大模型的理解和生成一体化”间接相关，因为效率提升有助于统一模型部署；与“世界模型”和“强化学习”关联较弱，论文未涉及环境交互或决策优化。

29. MACReD: A Multi-Agent Collaborative Reasoning Framework for Reaction Diagram ParsingPASS

Score: 42.0 / 26.5

Authors: Chuang Tang, Chenhao Lin, Yin Xu, Hao Wang, Jinrui Zhou, Xin Li, Mingjun Xiao, Enhong Chen

Published: 2026-05-27

TL;DR: MACReD 提出了一种利用统一多智能体框架和视觉语言模型实现化学反应图解析的先进方法，在 RxnScribe 基准测试中达到了 state-of-the-art 性能。

摘要翻译

从科学文献中解析化学反应图具有挑战性，原因在于异构布局、交织的视觉元素以及难以整合识别与推理。现有的视觉语言模型虽推动了多模态理解的发展，但在复杂图表上仍表现不佳，难以在推理过程中保持空间一致性并整合多维信息。为解决这些问题，我们提出了 MACReD，这是一种分层多智能体框架，在统一的 VLM 指导架构下，协调用于分子感知、箭头理解、文本提取和反应重建的专用智能体。规划与感知层使用灵活、细粒度检测来处理视觉复杂性，而推理层则使用多图融合机制来整合异构线索，并强制执行化学一致的全局推理。在 RxnScribe 基准上的实验表明，MACReD 取得了最先进的性能，在硬匹配和软匹配准则下，F1 分数分别为 75.2% 和 84.6%，优于获得 69.1% 和 80.0% 的 RxnScribe 基线。这些结果展示了 MACReD 在多样图布局（包括多步反应和树状结构反应）上的鲁棒性。

Abstract

Parsing chemical reaction diagrams from scientific literature is challenging due to heterogeneous layouts, intertwined visual elements, and the difficulty of integrating recognition and reasoning. Existing vision-language models advance multimodal understanding but still fail on complex diagrams, struggling to maintain spatial coherence and to integrate multidimensional information during reasoning. To address these issues, we propose MACReD, a hierarchical multi-agent framework that coordinates specialized agents for molecular perception, arrow understanding, text extraction, and reaction reconstruction within a unified VLM-guided architecture. The planning and perception layers use flexible, fine-grained detection to handle visual complexity, while the reasoning layer uses a multigraph fusion mechanism to integrate heterogeneous cues and enforce chemically consistent global reasoning. Experiments on the RxnScribe benchmark show that MACReD achieves state-of-the-art performance, with F1 scores of 75.2% and 84.6% under hard and soft match criteria, outperforming the RxnScribe baseline, which obtains 69.1% and 80.0%, respectively. These results demonstrate the robustness of MACReD across diverse diagram layouts, including multi-step and tree-structured reactions.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	5.0/10	10.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	7.0/10	14.0
MultiModal	2.0	9.0/10	18.0
model-based RL	2.0	0.0/10	0.0

评分理由: 该论文提出了一种基于统一 VLM 引导架构的多智能体协作框架用于化学反应图解析。内容与 MultiModal（视觉 + 语言）和 MLLM（使用视觉语言模型）高度相关；架构描述中包含“统一”概念，与 Unify Models 有一定关联。但论文未涉及环境建模或强化学习，与 World Models 和 model-based RL 无关。作者列表中未包含指定的专家。

关键词

Multi-Agent, Reaction Diagram Parsing, Unified Architecture, Vision-Language Models, Multigraph Fusion, Chemical Reaction, Spatial Coherence

深度分析

Chinese Title: MACReD：面向反应图解析的多智能体协作推理框架

Summary: 论文提出MACReD，一个层次化多智能体协作框架，用于从科学文献中解析化学反应图。针对现有方法在视觉复杂性、泛化能力不足以及感知与推理脱节等问题，MACReD将任务分解为规划层、感知层和推理层。规划层动态协调专用智能体；感知层使用分子定位、箭头理解和文本提取专用智能体生成结构化实体；推理层通过多图融合机制整合空间结构、化学语义约束和VLM生成的反应图，实现化学一致的全局推理。在RxnScribe基准上，MACReD在硬匹配和软匹配下F1分别达到75.2%和84.6%，超越基线RxnScribe（69.1%/80.0%），并在多步和树状结构反应图中表现出鲁棒性。

Innovations:

提出层次化多智能体协作框架，将复杂反应图解析分解为规划、感知、推理三层，实现灵活的任务分解与协调。
设计多图融合机制，综合空间结构、化学语义约束和VLM生成的反应图，确保化学一致的全局推理。
在感知层使用专用智能体分别处理分子、箭头和文本，减少异构视觉元素间的错误传播。
在RxnScribe基准上达到SOTA性能，并在多步和树状结构等复杂布局上展现强泛化能力。

Methodology: MACReD采用层次化多智能体架构：规划层（Planning Agent）根据输入查询和图像内容动态分解任务并协调专用智能体；感知层包含分子定位、箭头理解和文本提取三个专用智能体，分别输出结构化实体；推理层构建共享反应图，通过多图融合机制整合空间关系、化学约束（如原子守恒）和VLM诱导的反应图，最终重建完整反应。所有智能体基于VLM（如GPT-4V）并配有定制系统提示。

Key Results:

在RxnScribe基准上，硬匹配F1为75.2%，软匹配F1为84.6%，均优于基线RxnScribe（69.1%/80.0%）。
在多步和树状结构反应图上表现鲁棒，验证了框架对复杂布局的泛化能力。
代码已公开，便于复现和进一步研究。

Tech Stack:

Vision Language Model (VLM) 如GPT-4V
多智能体协作框架
多图融合机制（Multigraph Fusion）
层次化任务分解（Planning, Perception, Reasoning）
原子守恒约束（化学一致性）
专用智能体：分子定位、箭头理解、文本提取

Strengths:

通过层次化多智能体设计有效解耦感知与推理，减少错误传播。
多图融合机制整合多源信息，提升化学一致性。
在标准基准上取得SOTA，且对复杂布局泛化性强。
代码开源，促进可复现性和后续研究。

Limitations:

依赖VLM（如GPT-4V）作为基础模型，可能受限于模型成本和API可用性。
实验仅在RxnScribe一个基准上进行，缺乏更多数据集验证。
原子守恒约束作为弱正则化，可能无法处理故意省略或错误标注的图。
多智能体协作增加系统复杂性和推理延迟。

Relevance To Keywords: 论文与“原生多模态大模型”和“多模态大模型的理解和生成一体化”高度相关，因为其核心使用VLM进行多模态理解并生成结构化反应信息。与“表征学习”相关，因为感知层提取分子、箭头等视觉表征。与“世界模型”和“模型-Based RL”相关性较弱，论文未涉及世界模型构建或强化学习。与“后训练”有一定关联，但论文未明确讨论后训练策略。总体而言，论文主要贡献在多智能体协作和VLM驱动的图解析，而非世界模型或强化学习。

30. Beyond Surrogate Gradients: Fully Differentiable Token Pruning for Vision-Language ModelsPASS

Score: 42.0 / 26.5

Authors: Landi He, Mingde Yao, Shawn Young, Lijian Xu

Published: 2026-05-27

TL;DR: This paper proposes DiffPrune, a fully differentiable token pruning method for Vision-Language Models that maintains high accuracy while significantly accelerating inference without relying on surrogate gradients.

摘要翻译

视觉令牌剪枝通过移除冗余视觉令牌，降低了视觉 - 语言模型（VLMs）的计算成本。现有方法通常依赖 Gumbel-Softmax 来在训练期间近似离散选择。然而，优化是由替代梯度驱动的，而非真实的选取过程，导致令牌重要性学习不可靠。本文提出 DiffPrune，将剪枝重新表述为令牌信息的连续控制，而非离散选择学习。具体而言，我们引入一个信息调节器（Information Throttler），基于重要性得分使用方差保持噪声来调制每个令牌，其中较高的得分在训练期间导致较少的信息抑制。该设计直接作用于令牌表示，自然地为学习令牌重要性提供了完全可微的优化路径。在推理阶段，令牌通过基于学习到的得分进行硬阈值处理而被移除。在十个 VLM 基准上，DiffPrune 保留了 96.5% 的全模型准确率，同时将 LLM 预填充加速 2.85 倍，推理开销仅为 0.69 毫秒。

Abstract

Visual token pruning reduces the computational cost of Vision-Language Models (VLMs) by removing redundant visual tokens. Existing methods typically rely on Gumbel-Softmax to approximate discrete selection during training. However, the optimization is driven by surrogate gradients rather than the true selection process, leading to unreliable learning of token importance. In this paper, we propose DiffPrune, which reformulates pruning as continuous control of token information instead of discrete selection learning. Specifically, we introduce an Information Throttler that modulates each token using variance-preserving noise conditioned on importance scores, where higher scores induce less information suppression during training. This design directly operates on token representations, naturally providing a fully differentiable optimization path for learning token importance. At inference, tokens are removed via hard thresholding on the learned scores. Across ten VLM benchmarks, DiffPrune retains 96.5% of full-model accuracy while accelerating LLM prefill by 2.85x, with only 0.69 ms of inference overhead.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	5.0/10	10.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	8.0/10	16.0
MultiModal	2.0	8.0/10	16.0
model-based RL	2.0	0.0/10	0.0

评分理由: The paper focuses on token pruning for Vision-Language Models (VLMs), making it highly relevant to MultiModal and MLLM domains as VLMs are inherently multimodal large models. It moderately relates to Unify Models since VLMs unify vision and language modalities, though the core contribution is efficiency optimization rather than architectural unification. There is no relevance to World Models or Model-Based RL.

关键词

Vision-Language Models, Token Pruning, Fully Differentiable, Surrogate Gradients, Information Throttler, Model Efficiency, Inference Acceleration

深度分析

Chinese Title: 超越代理梯度：面向视觉-语言模型的完全可微令牌剪枝

Summary: 本文提出DiffPrune，一种完全可微的视觉令牌剪枝框架，用于降低视觉-语言模型（VLM）的计算成本。现有方法依赖Gumbel-Softmax等代理梯度近似离散选择，导致优化不稳定。DiffPrune将剪枝重新定义为对令牌信息的连续控制，而非离散选择学习。其核心是信息节流器（Information Throttler），利用方差保持噪声根据重要性分数调制每个令牌，分数越高信息抑制越少，从而提供完全可微的优化路径。训练时，Soft Top-K头将分数转化为连续保留权重并约束总预算；推理时，通过硬阈值移除令牌。在十个VLM基准上，DiffPrune保留了全模型96.5%的准确率，同时将LLM预填充加速2.85倍，推理开销仅0.69毫秒。实验表明，DiffPrune的梯度方向一致性比Gumbel-Softmax高28.4倍，损失曲面更平滑。

Innovations:

提出完全可微的视觉令牌剪枝框架DiffPrune，避免使用代理梯度，实现从评分器到损失的完全可微优化路径。
引入信息节流器（Information Throttler），通过方差保持噪声（VP-Noise）连续调制令牌信息，替代离散的保留/丢弃操作。
设计Soft Top-K头，将重要性分数转化为连续保留权重并严格约束预算，训练时与硬Top-K行为对齐。
在多个VLM（LLaVA-1.5-7B、LLaVA-NEXT-7B、Qwen2.5-VL-7B）上验证，在激进剪枝下保持高任务性能，推理开销极低。

Methodology: DiffPrune包含四个组件：冻结的视觉编码器、可训练评分器Sθ、Soft Top-K映射ΦK、令牌级节流器T和仅训练去噪器Dϕ。训练时，评分器输出重要性分数，Soft Top-K将其转化为连续保留权重α（满足总预算K），节流器对每个令牌施加方差保持噪声（噪声强度与α负相关），去噪器重建表示后送入冻结LLM。推理时，替换为硬Top-K选择，移除去噪器。整个训练过程无离散操作，梯度通过连续函数反向传播。在DeiT-Base上进行探针实验，比较损失曲面和梯度方向一致性。

Key Results:

在DeiT-Base上，DiffPrune的梯度方向一致性比Gumbel-Softmax高28.4倍，损失曲面更平滑。
在十个VLM基准上，DiffPrune保留全模型96.5%的准确率，LLM预填充加速2.85倍。
推理开销仅0.69毫秒，几乎可忽略。
在LLaVA-1.5-7B、LLaVA-NEXT-7B、Qwen2.5-VL-7B上均保持高任务性能。

Tech Stack:

Gumbel-Softmax（对比方法）
Straight-Through Estimator (STE)（对比方法）
方差保持噪声（Variance-Preserving Noise, VP-Noise）
Soft Top-K映射
去噪器（Denoiser）
DeiT-Base（探针实验）
LLaVA-1.5-7B、LLaVA-NEXT-7B、Qwen2.5-VL-7B（评估模型）
滤波归一化二维切片（损失曲面可视化）

Strengths:

完全可微的框架消除了代理梯度带来的优化偏差，梯度更稳定，损失曲面更平滑。
训练与推理行为一致（Soft Top-K与硬Top-K对应），无需额外松弛。
在多个主流VLM上验证，性能损失极小，加速显著，推理开销极低。
方法简洁，仅需训练评分器和去噪器，基座模型冻结，易于集成。

Limitations:

去噪器仅在训练时使用，推理时移除，可能引入训练-推理不一致（但实验表明影响小）。
方法依赖于固定的令牌预算K，未探索动态预算分配。
仅在视觉令牌剪枝上验证，未推广到文本令牌或跨模态剪枝。
对极端剪枝比例（如保留少于10%令牌）的性能未充分讨论。

Relevance To Keywords:

原生多模态大模型：论文针对视觉-语言模型（VLM）的推理效率优化，属于多模态大模型的关键技术。
多模态大模型的理解和生成一体化：剪枝技术可加速理解与生成任务，但论文主要关注理解任务（分类、问答等）。
表征学习：通过连续信息节流器学习令牌重要性，本质是学习更好的视觉表征。
世界模型：间接相关，高效剪枝有助于世界模型在长序列推理中的应用。
强化学习、后训练：论文方法属于后训练阶段（基座冻结），但未使用强化学习。

31. Hybrid Neural World ModelsPASS

Score: 40.0 / 26.5

Authors: Pranav Lakshmanan, Paras Chopra

Published: 2026-05-27

TL;DR: This paper presents hybrid neural world models that integrate neural surrogates with classical solvers to accelerate physical dynamics simulation and implicitly handle discontinuities via error maps.

摘要翻译

神经代理模型承诺在物理动力学计算中相比经典求解器提供巨大的加速，但在激波、前沿和接触等尖锐动力学事件上会无声地失效。我们提出用于物理动力学的混合神经世界模型：一种在物理状态空间中训练和部署多步长代理模型的方法，该方法涉及一个具有连续步长条件化的单个网络，通过与教科书式参考求解器进行直接监督训练，从而在一次前向传播中预测任意未来状态（在预测步长 T 处）。尽管训练数据、损失函数或架构的任何部分均未对间断位置进行监督，但训练好的代理模型隐式地编码了该信息，仅通过其前向传播即可恢复为每轨迹误差图。该误差图集中在激波、前沿和接触处，而在其他区域保持较小值。该误差图与标准无标签基线（包括深度集成、学习误差头、梯度幅度指示器和局部自适应共形预测）相当或更优，同时仅使用单个训练网络，且不需要校准集或控制方程知识。该方法支持两种操作模式。模式 1 单独运行代理模型以实现最大吞吐量，在偏微分方程（PDE）环境中，相对于教科书式求解器，相同硬件下的 CPU 加速比达到 26 倍至 72 倍。模式 2 利用误差图控制参考求解器的回退机制，推迟不确定轨迹，并在默认操作点处大致将代理模型的残差误差减半。该方法无需修改即可应用于反应 - 扩散、可压缩欧拉方程以及刚体碰撞动力学。

Abstract

Neural surrogates promise large speedups over classical solvers for physical dynamics but fail silently at sharp dynamical events such as shocks, fronts, and contact. We present hybrid neural world models for physical dynamics: a recipe for training and deploying multi-horizon surrogates in physical state space, where a single network with continuous horizon conditioning is trained with direct supervision against textbook reference solvers to predict any future state at horizon T in one forward pass. Although no part of the training data, loss function, or architecture supervises discontinuity location, the trained surrogate encodes it implicitly, recoverable from its forward passes alone as a per-trajectory error map that concentrates on shocks, fronts, and contacts, and stays small elsewhere. The map is competitive with or better than standard label-free baselines including deep ensembles, learned error heads, gradient-magnitude indicators, and locally-adaptive conformal prediction, while using only a single trained network and requiring no calibration set or governing-equation knowledge. The recipe supports two operating points. Mode 1 runs the surrogate alone for maximum throughput, with same-hardware CPU speedups of 26x to 72x against textbook solvers on the PDE environments. Mode 2 uses the error map to gate a reference-solver fallback, deferring uncertain trajectories and roughly halving the surrogate's residual error at the default operating point. The recipe applies without modification across reaction-diffusion, compressible Euler, and rigid-body collision dynamics.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	5.0/10	10.0
World Models	2.0	9.0/10	18.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	6.0/10	12.0

评分理由: World Models is highly relevant as the title and core focus is modeling physical dynamics (world) for prediction. model-based RL is moderately relevant because learning accurate dynamics models is a core component of MBRL, though the paper focuses on simulation acceleration. Unify Models is moderately relevant due to the hybrid approach combining neural networks with classical solvers. MLLM and MultiModal are irrelevant as the work involves physical simulation without language or multi-modal inputs. No expert authors from the specified list were found.

关键词

Hybrid Neural World Models, Physical Dynamics, Neural Surrogates, Reference Solvers, Error Map, Discontinuity Handling, Multi-horizon Prediction

深度分析

Chinese Title: 混合神经世界模型

Summary: 本文提出了一种混合神经世界模型，用于物理动力学模拟。该模型训练一个多时间步预测的代理网络，通过连续时间步条件（FiLM）和直接监督回归，能够一步预测任意未来状态。训练后，模型可提取一个无需额外训练、校准集或方程知识的误差图，该图通过比较单步预测与两步链式预测的差异来定位不连续事件（如激波、前沿、接触）。基于此误差图，论文设计了两种部署模式：模式1仅使用代理模型，在CPU上获得26-72倍加速；模式2利用误差图门控参考求解器回退，将残差误差减半。该方法在反应扩散、可压缩欧拉和刚体碰撞三个物理系统上验证，无需修改即可适用。误差图在检测不连续区域方面优于深度集成、学习误差头等基线方法。

Innovations:

提出多时间步捷径代理模型的训练配方：连续时间步条件（FiLM）、直接监督回归和10% DAgger精炼，避免了自一致性损失在物理状态空间退化为恒等映射的问题。
设计推理时误差图，仅从训练好的代理模型计算，无需额外训练、校准集或方程知识，通过比较单步预测与半时间步链式预测的差异来定位不可靠区域。
提出两种部署模式：模式1纯代理模型实现高吞吐量，模式2利用误差图门控参考求解器回退，在默认操作点将残差误差减半。
在反应扩散、可压缩欧拉和刚体碰撞三种不同物理系统上验证，方法无需修改即可适用，展示了跨领域通用性。
误差图在检测不连续事件方面优于深度集成、学习误差头、梯度幅度指示器和局部自适应共形预测等无标签基线方法。

Methodology: 论文采用以下技术路线：1) 训练一个多时间步代理模型，使用U-Net（2D PDE）或残差MLP（低维状态）作为架构，通过FiLM嵌入连续时间步T，训练数据来自参考求解器的完整滚动，时间步从{2,4,8,16,32,64}均匀采样，损失函数为归一化状态空间的直接监督回归。2) 推理时计算误差图：比较单步预测fθ(s0,T)与两步链式预测fθ(fθ(s0,T/2),T/2)的差异，对空间状态逐像素计算通道范数。3) 部署时：模式1直接使用代理模型；模式2对每个轨迹计算误差图的空间均值，若超过阈值则回退到参考求解器，阈值由保持比例q（默认0.75）在验证集上确定。

Key Results:

模式1在CPU上对PDE环境获得26-72倍加速（相对于参考求解器）。
模式2在默认操作点（q=0.75）将代理模型的残差误差减半。
误差图在检测不连续事件（激波、前沿、接触）方面优于深度集成（K=3）、学习误差头、梯度幅度指示器和局部自适应共形预测等基线，AUROC值更高。
在反应扩散（Oregonator）、可压缩欧拉2D和刚体碰撞3D三个系统上均有效，无需修改。
训练时使用自一致性损失会导致模型退化为恒等映射，直接监督是必要的。

Tech Stack:

U-Net（2D网格结构PDE）
残差MLP（低维状态向量）
FiLM（特征线性调制）用于时间步条件嵌入
DAgger（数据集聚合）精炼策略（10%）
直接监督回归损失（归一化状态空间）
误差图计算：单步与两步链式预测的差异
参考求解器：Oregonator反应扩散、可压缩欧拉2D、刚体碰撞3D
基线方法：深度集成、学习误差头、梯度幅度指示器、局部自适应共形预测

Strengths:

方法简洁通用：无需修改即可应用于PDE和ODE物理系统，架构无关。
误差图无需额外训练、校准集或方程知识，仅从训练好的代理模型计算，计算成本低。
两种部署模式灵活权衡速度与精度，用户可通过单一参数q控制。
在多个物理系统上验证，加速比显著（26-72倍），误差检测性能优于多种无标签基线。
理论分析提供了误差图在光滑区域上界的推导（附录A.3），解释了其工作原理。

Limitations:

误差图的理论上界是方向性的，仅保证光滑区域小，但在不连续处无下界，依赖经验验证。
方法假设训练数据覆盖了所有可能的时间步（几何级数），且半时间步必须在训练集中，限制了时间步选择。
模式2的轨迹级回退需要整个轨迹的误差图聚合，无法进行部分域回退（需边界耦合）。
代理模型在极端不连续事件上的误差可能仍然较大，模式2仅能通过回退缓解，不能完全消除。
实验仅在三个物理系统上验证，未涉及更复杂的高维或耦合系统。

Relevance To Keywords:

世界模型：论文直接提出“混合神经世界模型”，训练代理模型预测物理世界状态，属于世界模型范畴。
模型基强化学习：代理模型可加速环境模拟，误差图可用于信任感知决策，与模型基RL中的不确定性量化相关。
表征学习：代理模型通过FiLM学习时间步表征，误差图隐式编码了不连续事件的表征。
后训练：论文中的DAgger精炼和推理时误差图提取属于后训练阶段的技术。
原生多模态大模型/多模态大模型的理解和生成一体化：论文主要关注物理动力学，不涉及多模态，相关性较弱。
Unify Models：论文未直接讨论统一模型，但混合神经-经典求解器可视为一种统一框架。
强化学习：论文未使用强化学习，但模式2的阈值设定可类比RL中的动作选择。

32. Exploratory Experience Shapes the Geometry of Predictive RepresentationsPASS

Score: 40.0 / 26.5

Authors: Kseniia Shilova, Abdelrahman Sharafeldin, Advay Balakrishnan, Hannah Choi

Published: 2026-05-27

TL;DR: 该研究发现探索性行为策略能形成更具空间组织性的内部预测表示，其潜在空间结构与探索性小鼠的行为模式更为一致。

摘要翻译

主动感知通过行动 - 感知循环将行为与学习联系起来：行动决定了用于更新内部感知预测模型的观测值，这些观测值随后引导后续行动。预测编码（Predictive-coding）框架为模拟这一过程提供了一种自然的方式，因为内部表征会持续更新以预测未来的观测值。在此，我们探讨探索性（exploratory）与利用性（exploitative）行为策略如何塑造这些内部预测表征。我们在一个树状迷宫中构建了一个在线学习智能体（agent），该智能体包含一个可控参数，用于调节探索性与利用性模式之间的平衡。该智能体根据其自身行为产生的经验，更新一个基于预测编码的感知模型。该模型同时预测未来的迷宫状态和奖励概率，使智能体能够在探索期间依据预期信息增益（expected information gain）选择行动，或在利用期间依据预测奖励选择行动。我们发现，所得到的内部预测表征强烈依赖于智能体的行为模式（behavioral regime）。探索性智能体发展出的表征具有更强的空间组织性，并在潜在空间（latent space）中更好地保持了迷宫转换的结构。相比之下，利用性智能体学习到的表征组织性较差。随后，我们在缺水小鼠（water-deprived mice）导航同一迷宫的自然轨迹上训练该预测模型，并将所得表征与从智能体轨迹上学到的表征进行比较。更具探索性的小鼠展现出与探索性智能体高度匹配的表征几何结构（representational geometries），而访问模式更受限的小鼠则类似于受奖励驱动的利用性智能体。综上所述，这些发现表明，探索使预测模型能够通过围绕空间位置和转换上下文组织潜在空间，在人工智能体和动物中形成泛化的内部表征。

Abstract

Active sensing links behavior and learning through an action-perception loop: actions determine the observations used to update internal predictive models of perception, which subsequently guide the next actions. Predictive-coding frameworks provide a natural way to model this process, since internal representations are continuously updated to predict future observations. Here, we ask how exploratory and exploitative behavioral strategies shape these internal predictive representations. We build an online learning agent in a tree-like maze with a controllable parameter regulating the balance between exploratory and exploitative regimes. The agent updates a predictive-coding-based perception model from experience generated by its own behavior. The model predicts both future maze states and reward probability, allowing the agent to select actions either by expected information gain during exploration or by predicted reward during exploitation. We show that the resulting internal predictive representations depend strongly on the agent's behavioral regime. Exploratory agents develop representations that are more spatially organized and better preserve the structure of maze transitions in latent space. In contrast, exploitative agents learn less organized representations. We then train this predictive model on natural trajectories of water-deprived mice navigating the same maze and compare the resulting representations with those learned from agent trajectories. More exploratory mice show representational geometries that closely match those of exploratory agents, whereas mice with more restricted visitation patterns resemble reward-driven, exploitative agents. Together, these findings suggest that exploration enables predictive models to form generalized internal representations by organizing latent space around both spatial location and transition context in artificial agents and animals.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	8.0/10	16.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	1.0/10	2.0
model-based RL	2.0	9.0/10	18.0

评分理由: 论文核心基于预测编码框架构建智能体，通过预测未来状态和奖励进行动作选择，高度契合 World Models 和 model-based RL 的定义，故得分较高（8.0 和 9.0）。论文未涉及大语言模型或多模态数据融合，因此 MLLM 和 MultiModal 相关性极低（0.0 和 1.0）。Unify Models 并非本文重点（未涉及多模型架构统一），故得分较低（2.0）。作者列表中未包含指定的 Yang Shi 等专家，无额外加分。加权总分 40.0，高于动态及格分 26.5。

关键词

Predictive Representations, Exploratory Behavior, Predictive Coding, Latent Space Geometry, Active Sensing, Model-Based Reinforcement Learning, Internal World Models, Exploitative Behavior

深度分析

Chinese Title: 探索性经验塑造预测表征的几何结构

Summary: 本文研究主动感知中探索与利用行为策略如何影响内部预测表征的几何结构。作者构建了一个在线学习智能体，在二叉树迷宫中导航，使用基于预测编码的感知模型，该模型同时预测下一迷宫节点和奖励概率。智能体通过调节探索（基于预期信息增益）与利用（基于预测奖励）的平衡来生成不同行为轨迹。结果表明，纯探索的智能体形成空间组织良好、保留迷宫转移结构的潜在表征；而利用性智能体因经验受限导致表征组织性较差。进一步将模型训练于真实小鼠导航轨迹，发现探索性较强的小鼠产生与探索智能体相似的几何表征，而奖励驱动的小鼠则与利用智能体类似。结论：探索使预测模型能够通过围绕空间位置和转移上下文组织潜在空间，形成泛化的内部表征。

Innovations:

将行为采样策略作为实验变量，系统研究其对预测表征几何结构的影响
使用统一的预测编码模型同时支持信息驱动的探索和奖励驱动的利用，保持学习目标固定
在人工智能体与真实小鼠数据上验证了探索性经验与空间组织表征之间的因果关系
揭示了探索行为能够使潜在空间围绕空间位置和转移上下文进行组织，形成泛化表征

Methodology: 论文采用在线学习范式，智能体在二叉树迷宫中导航。核心模型为动作条件预测编码模型，包含先验网络（基于当前循环状态、节点和动作预测潜在状态）和后验网络（基于观测到的下一节点推断后验潜在状态）。循环状态通过GRU更新。解码器从潜在状态预测下一节点和奖励概率。动作选择：探索模式基于预期信息增益（EIG）采样；利用模式基于预测奖励构建价值图。通过调节利用模式概率控制探索-利用平衡。训练损失为下一节点和奖励的交叉熵。对小鼠轨迹，将相同预测模型训练于真实轨迹，并比较潜在表征几何。

Key Results:

纯探索智能体学习到准确的转移模型和空间组织的潜在几何，潜在轨迹平滑且反映迷宫结构
增加奖励驱动行为会削弱潜在表征的组织性，即使智能体仍能到达奖励
探索性小鼠的轨迹训练出的模型产生与探索智能体相似的几何表征，而奖励驱动小鼠则与利用智能体相似
探索行为使潜在空间围绕空间位置和转移上下文组织，形成泛化内部表征

Tech Stack:

预测编码（Predictive Coding）
门控循环单元（GRU）
交叉熵损失（Cross-Entropy Loss）
预期信息增益（Expected Information Gain）
价值图（Value Map）
潜在空间几何分析（Latent Space Geometry Analysis）
二叉树迷宫环境（Binary-Tree Maze）

Strengths:

通过控制行为策略变量，清晰分离了采样策略对表征的影响
模型设计统一，同时支持探索和利用，便于对比
结合人工智能体与真实动物数据，增强了结论的生态效度
揭示了行为-学习闭环中表征形成的机制，对认知地图理论有贡献

Limitations:

环境为简单二叉树迷宫，可能无法推广到更复杂或连续空间
模型假设确定性转移，未考虑随机性或部分可观测性
小鼠轨迹数据来自已有实验，未进行因果干预验证
仅分析了潜在表征的几何结构，未深入探讨其神经生物学对应

Relevance To Keywords:

Unify Models: 论文使用统一的预测编码模型整合感知、预测和动作选择，符合统一模型思想
World Models: 智能体学习环境转移和奖励预测，本质是学习世界模型
Representation Learning: 核心研究行为如何塑造预测表征的几何结构，属于表征学习范畴
Model-Based RL: 智能体利用预测模型进行规划（价值图）和信息增益计算，属于模型基强化学习
原生多模态大模型/多模态大模型的理解和生成一体化: 论文虽未直接涉及多模态，但其预测编码框架可扩展至多模态预测，且强调表征的生成与理解统一

33. FLORO: A Multimodal Geospatial Foundation Model for Ecological Remote Sensing Across Sensors and ScalesPASS

Score: 38.0 / 26.5

Authors: Jorge L. Rodriguez, Victor Angulo Morales, Areej Alwahas, Mariana Elias Lara, Fida Mohammad Thoker, Kasper Johansen, Bernard Ghanem, Fernando T. Maestre, Matthew F. McCabe

Published: 2026-05-27

TL;DR: FLORO is a multimodal geospatial foundation model that learns transferable representations across heterogeneous sensors and scales, achieving strong performance on remote sensing tasks despite using a smaller pretraining corpus than competitors.

摘要翻译

基础模型为可迁移的遥感表示提供了一种有前景的途径，但许多当前方法依赖于规模庞大的预训练数据集和固定的传感器配置，限制了其在生态和环境应用中的适用性，因为在这些应用中，观测数据往往因平台、空间与光谱分辨率以及可用模态的不同而存在差异。我们提出 FLORO，一种多模态地理空间基础模型，旨在从规模较小但高度多样化的遥感语料库中学习可迁移表示。FLORO 基于掩码自编码（Masked Autoencoding）在 Sentinel-1、Sentinel-2、SkySAT 影像、高程数据以及无人机衍生数据的异质组合上进行预训练。为适应传感器变异性，FLORO 引入了可用性感知输入（availability-aware inputs），用于指示每个样本中存在的光谱波段及辅助模态，从而在不同异质传感器配置之间实现统一输入空间。我们在 PANGAEA 基准上，基于冻结编码器协议（frozen-encoder protocol）评估了 FLORO，涵盖场景分类、分割和回归任务。尽管相较于竞争性的基础模型，FLORO 是在规模更小的语料库上进行预训练的，但它仍实现了强大且稳定的迁移能力，涵盖了光学、光学 -SAR 以及光学 - 高程基准，这些基准涉及中分辨率卫星、机载以及超高分辨率无人机影像。FLORO 在六个 PANGAEA 基准上的平均分割性能位居第二，仅落后于最近引入的一个预训练图像数量多两个数量级以上的基础模型；在场景分类任务中保持竞争力，在回归任务中表现稳健；定性结果显示，在洪水、城市、生物量及树冠高度预测场景中，空间结构的保留效果有所改善。在 EuroSAT-MS 上的另一项独立控制实验中，地理位置编码（geo-positional encoding）相较于绝对位置编码（absolute positional encoding）进一步提升了分类性能。

Abstract

Foundation models offer a promising route to transferable remote sensing representations, but many current approaches depend on very large pretraining datasets and fixed sensor configurations, limiting their suitability for ecological and environmental applications, where observations often vary across platforms, spatial and spectral resolutions, and available modalities. We introduce FLORO, a multimodal geospatial foundation model designed to learn transferable representations from a small but highly diverse remote sensing corpus. FLORO is pretrained using masked autoencoding on a heterogeneous combination of Sentinel-1, Sentinel-2, SkySAT imagery, elevation, and UAV-derived data. To accommodate sensor variability, FLORO incorporates availability-aware inputs that indicate which spectral bands and auxiliary modalities are present in each sample, enabling a unified input space across heterogeneous sensor configurations. We evaluated FLORO on the PANGAEA benchmark under a frozen-encoder protocol across scene classification, segmentation, and regression tasks. Despite being pretrained on a smaller corpus than competing foundation models, FLORO achieved strong and stable transfer across optical, optical-SAR, and optical-elevation benchmarks spanning medium-resolution satellite, airborne, and ultra-high-resolution UAV imagery. FLORO obtained the second-best average segmentation performance across six PANGAEA benchmarks, trailing only a recently introduced foundation model pretrained on over two orders of magnitude more images, remained competitive on scene classification, and was robust in regression tasks, while qualitative results showed improved preservation of spatial structure in flood, urban, biomass, and canopy-height prediction settings. In a separate controlled experiment on EuroSAT-MS, geo-positional encoding further improved classification relative to absolute positional encoding.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	5.0/10	10.0
World Models	2.0	2.0/10	4.0
MLLM	2.0	2.0/10	4.0
MultiModal	2.0	9.0/10	18.0
model-based RL	2.0	1.0/10	2.0

评分理由: The paper introduces FLORO, a multimodal geospatial foundation model, making 'MultiModal' highly relevant (9.0). It unifies heterogeneous sensor inputs into a unified space, moderately aligning with 'Unify Models' (5.0). However, the work focuses on representation learning for remote sensing without language models, reinforcement learning, or world model architectures, resulting in low scores for 'MLLM', 'World Models', and 'model-based RL' (1.0-2.0). No specified expert authors are present.

关键词

Multimodal Geospatial Foundation Model, Transferable Representations, Remote Sensing, Heterogeneous Sensors, Masked Autoencoding, PANGAEA Benchmark

深度分析

Chinese Title: FLORO：面向生态遥感的跨传感器与跨尺度的多模态地理空间基础模型

Summary: 论文提出FLORO，一个多模态地理空间基础模型，旨在从少量但高度多样化的遥感数据中学习可迁移的表征。FLORO采用掩码自编码（MAE）预训练，融合了Sentinel-1、Sentinel-2、SkySAT影像、高程数据及无人机产品等异构数据。为应对传感器差异，模型引入可用性感知输入，显式编码各样本中存在的光谱波段和辅助模态，从而在统一输入空间中处理异构传感器配置。在PANGAEA基准测试中，FLORO在冻结编码器协议下评估了场景分类、分割和回归任务。尽管预训练语料库远小于竞争模型，FLORO在光学、光学-SAR和光学-高程等多种基准上取得了稳定且优异的迁移性能，在六项分割基准中平均性能排名第二，仅次于预训练图像数量多两个数量级的基础模型；在场景分类中保持竞争力，在回归任务中表现稳健。定性结果显示，在洪水、城市、生物量和冠层高度预测中，FLORO更好地保留了空间结构。此外，在EuroSAT-MS上的控制实验表明，地理空间位置编码相比绝对位置编码进一步提升了分类性能。总体而言，FLORO表明较小但异构的预训练语料库仍能支持可迁移的地理空间表征，为生态与环境监测提供了实用框架。

Innovations:

提出可用性感知的多模态输入建模，显式编码每个样本中可用的光谱波段和辅助模态，实现异构传感器配置下的统一输入空间。
构建跨尺度多模态预训练语料库，融合卫星（Sentinel-1/2、SkySAT）、航空（高程模型）和无人机（多光谱、RGB、结构产品）数据，覆盖多种空间分辨率和传感器特性。
引入地理空间位置编码（geo-positional encoding），在控制实验中证明其相比绝对位置编码能提升下游分类性能。
在冻结编码器协议下进行标准化基准评估，证明较小但多样化的预训练语料库仍能实现强迁移性能，挑战了“规模至上”的范式。

Methodology: FLORO基于掩码自编码（MAE）和多模态Transformer架构。预训练阶段，模型从异构遥感数据中随机掩码部分图像块，通过编码器-解码器结构重建缺失内容。编码器采用可用性感知输入：每个输入token附带一个可用性向量，指示该样本中哪些光谱波段和辅助模态（如SAR、高程）存在，从而允许模型处理不同传感器组合。预训练数据包括Sentinel-1 GRD SAR、Sentinel-2 L2A表面反射率、SkySAT表面反射率与数字地形模型、无人机多光谱/ RGB正射影像及数字表面模型。下游评估采用冻结编码器协议，在PANGAEA基准上测试场景分类、语义分割和回归任务，并额外在EuroSAT-MS上进行地理空间位置编码消融实验。

Key Results:

在PANGAEA六项分割基准中，FLORO平均性能排名第二，仅次于预训练图像数量多两个数量级的基础模型。
在场景分类任务中，FLORO保持竞争力，与大规模预训练模型差距较小。
在回归任务（如生物量、冠层高度预测）中表现稳健，定性结果显示出更好的空间结构保留。
在EuroSAT-MS控制实验中，地理空间位置编码相比绝对位置编码显著提升了分类准确率。
尽管预训练语料库较小，FLORO在光学、光学-SAR和光学-高程等多种传感器组合的迁移任务中均取得稳定性能。

Tech Stack:

掩码自编码（Masked Autoencoder, MAE）
多模态Transformer架构
可用性感知输入编码（Availability-aware input）
地理空间位置编码（Geo-positional encoding）
Sentinel-1 GRD SAR数据
Sentinel-2 L2A表面反射率数据
SkySAT影像及数字地形模型
无人机多光谱/RGB正射影像及数字表面模型
PANGAEA基准测试（场景分类、分割、回归）
EuroSAT-MS数据集（消融实验）

Strengths:

提出可用性感知机制，灵活处理异构传感器输入，实用性强。
预训练语料库虽小但高度多样化，证明了数据多样性比单纯规模更重要。
在冻结编码器协议下进行标准化评估，结果可复现且具有可比性。
涵盖多种下游任务（分类、分割、回归），验证了模型的广泛适用性。
地理空间位置编码的消融实验提供了有价值的洞察。

Limitations:

预训练语料库规模较小，可能限制了在某些任务上的性能上限。
未详细讨论模型的计算成本和训练效率。
评估仅在PANGAEA基准上进行，缺乏在更多真实生态监测场景中的验证。
可用性感知输入的设计可能增加模型复杂度，且对缺失模态的处理策略有待进一步探索。
未与更多近期大规模基础模型（如Prithvi、SatMAE等）进行直接比较。

Relevance To Keywords:

表征学习：FLORO通过掩码自编码学习可迁移的表征，核心是表征学习。
多模态大模型：模型融合光学、SAR、高程、无人机等多模态数据，属于多模态大模型范畴。
世界模型：地理空间基础模型可视为对地球表面状态的部分建模，与“世界模型”概念相关（虽未明确提及）。
模型-Based RL：论文未涉及强化学习，相关性较弱。
原生多模态大模型：FLORO从预训练阶段即处理多模态输入，符合原生多模态设计。
多模态大模型的理解和生成一体化：MAE同时涉及理解（编码）和生成（解码重建），但论文重点在理解表征。
后训练：论文采用冻结编码器评估，未涉及后训练（如指令微调或RLHF），相关性较低。

34. Refining Multidimensional Video Reward Models via Disentangled Influence FunctionsPASS

Score: 38.0 / 26.5

Authors: Muyao Wang, Zeke Xie, Hideki Nakayama

Published: 2026-05-27

TL;DR: 本文提出一种解耦影响函数框架，通过解决维度异质性来优化文本到视频生成的多维视频奖励模型，显著提升了奖励模型与真实标注的对齐效果。

摘要翻译

随着文本到视频（T2V）生成模型的持续演进，视频评估的复杂性要求在多个维度上进行细粒度的评估。为此，近期研究致力于开发多维视频奖励模型（MVRMs），该模型通过分解评估过程，以更好地契合人类视觉感知的多维度特性。然而，训练有效的 MVRMs 从根本上受到视频数据复杂性质的挑战。本文识别出一种关键现象，称为“维度异质性”（Dimensional Heterogeneity）：训练样本的可靠性在不同评估维度上可能存在显著差异，即一个样本可能为某一目标提供可靠的监督，却可能对另一目标诱导较高的监督风险。因此，基于全局标量指标进行过滤的主流数据为中心方法对于 T2V 任务而言是不适定的。为此，我们提出一个解耦影响框架，该框架能够高效估计维度特定的监督风险。基于此框架，我们提出了两种维度解耦精炼策略：维度解耦剪枝（Dimension-Disentangled Pruning），用于移除极端高风险样本；以及维度解耦重加权（Dimension-Disentangled Reweighting），用于软性降低高风险监督的权重。大量实验表明，我们的解耦策略显著优于全局过滤基线，生成的奖励模型与真实值（ground truth）具有更优的对齐效果。

Abstract

As Text-to-Video (T2V) generation models continue to evolve, the complexity of video evaluation necessitates a fine-grained assessment across various axes. To address this, recent works have focused on developing Multidimensional Video Reward Models (MVRMs), which decompose the evaluation process to better align with the multifaceted nature of human visual perception. However, training effective MVRMs is fundamentally challenged by the complex nature of video data. In this work, we identify a critical phenomenon termed Dimensional Heterogeneity: the reliability of a training sample can vary substantially across evaluation dimensions, meaning that a sample may provide reliable supervision for one objective while inducing high supervision risk for another. Consequently, prevailing data-centric methods that filter based on global scalar metrics are ill-posed for T2V tasks. To address this, we propose a disentangled influence framework that that efficiently estimates dimension-specific supervision risk. Leveraging this framework, we introduce two dimension-disentangled refinement strategies: Dimension-Disentangled Pruning, which removes extreme high-risk samples, and Dimension-Disentangled Reweighting, which softly down-weights high-risk supervision. Extensive experiments demonstrate that our disentangled strategies significantly outperform global filtering baselines, yielding reward models with superior alignment to ground truth.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	2.0/10	4.0
MLLM	2.0	4.0/10	8.0
MultiModal	2.0	8.0/10	16.0
model-based RL	2.0	3.0/10	6.0

评分理由: 论文核心在于优化文本到视频（T2V）生成的多维视频奖励模型（MVRMs），通过解耦影响函数解决维度异质性。与 MultiModal 高度相关（T2V 涉及文本与视频多模态），与 MLLM 和 model-based RL 有一定关联（属于多模态 AI 与强化学习奖励建模范畴），但与 Unify Models 和 World Models 主题关联较弱。作者列表中不包含指定的 Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang 专家。加权总分为 38.0，高于动态及格分 26.5。

关键词

Multidimensional Video Reward Models, Text-to-Video Generation, Disentangled Influence Functions, Dimensional Heterogeneity, Supervision Risk, Dimension-Disentangled Pruning, Dimension-Disentangled Reweighting

深度分析

Chinese Title: 通过解耦影响函数优化多维视频奖励模型

Summary: 本文针对文本到视频生成模型中多维奖励模型（MVRM）训练数据质量不均的问题，首次发现并定义了“维度异质性”现象：同一训练样本在不同评估维度上的监督可靠性可能差异显著。现有基于全局标量指标的数据过滤方法无法捕捉这种维度特异性风险。为此，作者提出解耦影响函数框架，通过对角投影高效估计每个维度的监督风险，并设计了两种维度解耦的数据精炼策略：维度解耦剪枝（DDP）和维度解耦重加权（DDR）。实验表明，所提方法显著优于全局过滤基线，使奖励模型与人类判断的对齐度大幅提升。

Innovations:

首次系统发现并命名视频奖励模型训练中的“维度异质性”现象，揭示全局数据质量假设的局限性。
提出解耦影响函数框架，将标量影响分解为维度间交互矩阵，实现维度特异性监督风险的精确估计。
设计两种基于维度特异性风险信号的策略：硬剪枝（DDP）和软重加权（DDR），在鲁棒性与数据利用率之间提供灵活权衡。
将影响函数从单目标场景拓展至多维度奖励模型，填补了该领域数据精炼方法的研究空白。

Methodology: 首先，理论推导解耦影响函数，将标准标量影响分解为K×K维度的交互矩阵，并利用对角投影提取维度特异性自影响作为监督风险代理。然后，基于TracIn近似高效计算梯度，避免显式Hessian求逆。最后，根据维度特异性风险信号实施两种策略：DDP对极端高风险样本进行硬过滤，DDR对高风险样本进行软降权。在多个视频奖励模型数据集上进行对比实验，验证方法有效性。

Key Results:

可视化分析显示不同维度（如视觉质量与动态程度）的自影响分布几乎不相关，证实维度异质性存在。
维度解耦剪枝（DDP）和维度解耦重加权（DDR）均显著优于基于全局标量过滤的基线方法。
所提方法使奖励模型与人类判断的对齐度（如Spearman相关系数）提升，且在不同维度上表现更均衡。

Tech Stack:

影响函数（Influence Functions）
TracIn近似（TracIn）
Hessian矩阵正定性假设
梯度线性分解
对角投影（Diagonal Projection）
自影响（Self-Influence）
维度特异性损失函数
硬剪枝（Hard Pruning）与软重加权（Soft Reweighting）

Strengths:

问题定义清晰，发现并命名了视频奖励模型训练中的关键现象“维度异质性”。
理论推导严谨，将标量影响函数自然扩展为矩阵形式，数学基础扎实。
方法实用性强，采用TracIn近似避免计算瓶颈，适合大规模训练。
两种策略提供不同鲁棒性-数据利用率权衡，适应不同场景需求。
实验充分，可视化与定量结果均支持方法有效性。

Limitations:

解耦影响矩阵的计算仍依赖于梯度存储和近似，可能引入近似误差。
仅验证了在视频奖励模型上的效果，未在更广泛的多任务学习或多目标优化场景中测试。
维度特异性风险信号（自影响）可能无法完美区分标签噪声与困难样本，两种策略仍需手动选择。
未讨论跨维度交互（非对角项）的潜在利用价值，仅聚焦于对角项。

Relevance To Keywords:

多模态大模型：论文针对视频生成模型中的多维奖励模型，属于多模态理解与生成评估范畴。
表征学习：通过解耦影响函数，学习各维度特异性表征，提升奖励模型质量。
强化学习：奖励模型是RLHF的核心组件，论文直接优化奖励模型训练数据，间接提升RLHF效果。
后训练：论文聚焦于奖励模型的后训练数据精炼阶段，属于后训练优化技术。

35. SMILE-Next: Teaching Large Language Models to Detect, Classify, and Reason about LaughterPASS

Score: 36.0 / 26.5

Authors: Lee Jung-Mok, Kim Sung-Bin, Joohyun Chang, Lee Hyun, Tae-Hyun Oh

Published: 2026-05-27

TL;DR: SMILE-Next 提出一种利用混合专家和自指令的 laughter 专用大语言模型框架，在多模态大模型基线上显著提升了笑声检测、分类和推理的性能。

摘要翻译

笑声是一种复杂的社会信号，传达的交际意图超越了单纯的娱乐。尽管先前工作专注于孤立的分析任务，但在现实场景中对笑声的全面理解仍研究不足。因此，我们引入了 SMILE-Next，这是一个用于现实场景笑声理解的数据集，包含多模态文本表示和问答标注，涵盖三个任务：笑声检测、笑声类型分类和笑声推理。基于 SMILE-Next，我们的目标是开发一个面向笑声的专用大语言模型，能够实现对现实语境中笑声的精细理解。为此，我们提出了两个关键组件：面向笑声的 Self-Instruct 和混合笑专家（MoLE）框架。面向笑声的 Self-Instruct 通过自动合成多样化的以笑声为中心的指令，增强了跨任务和领域的泛化能力。MoLE 引入了一种任务自适应专家路由机制，动态选择针对每个笑声相关任务的专用专家，从而提升特定任务的性能和效率。实验结果表明，我们提出的组件的组合显著优于多模态大语言模型（LLM）基线，推动了稳健的现实场景笑声理解。项目页面位于：https://mok0102.github.io/smile-next/。

Abstract

Laughter is a complex social signal that conveys communicative intent beyond amusement. While prior work has focused on isolated laughter analysis tasks, a comprehensive understanding of laughter in real-world scenarios remains underexplored. Therefore, we introduce SMILE-Next, a dataset for real-world laughter understanding with multimodal textual representations and question-answer annotations across three tasks: laughter detection, laughter type classification, and laughter reasoning. Building upon SMILE-Next, we aim to develop a laughter-specialized large language model capable of nuanced understanding of laughter in real-world contexts. To this end, we propose two key components: laughter-specific Self-Instruct and the Mixture-of-Laugh-Experts (MoLE) framework. Laughter-specific Self-Instruct enhances generalization across tasks and domains by automatically synthesizing diverse laughter-centric instructions. MoLE introduces a task-adaptive expert routing mechanism that dynamically selects specialized experts tailored to each laughter-related task, improving task-specific performance and efficiency. Experimental results show that the combination of our proposed components substantially outperforms multimodal LLM baselines, advancing robust real-world laughter understanding. Project page is at: https://mok0102.github.io/smile-next/.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	5.0/10	10.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	6.0/10	12.0
MultiModal	2.0	7.0/10	14.0
model-based RL	2.0	0.0/10	0.0

评分理由: 论文提出基于 MoLE 框架统一检测、分类和推理任务，与 Unify Models 中度相关；涉及多模态文本表示及多模态大模型基线，与 MultiModal 和 MLLM 相关；但无环境建模或强化学习内容，与 World Models 和 model-based RL 无关。加权总分为 36.0，超过动态及格分 26.5。

关键词

Laughter Detection, Laughter Classification, Laughter Reasoning, Large Language Models, Multimodal Textual Representations, Mixture-of-Laugh-Experts, Self-Instruct, SMILE-Next Dataset

深度分析

Chinese Title: SMILE-Next：教会大型语言模型检测、分类和推理笑声

Summary: 论文针对现实场景中笑声理解的复杂性，提出了SMILE-Next数据集，包含多模态文本表示和问答标注，涵盖笑声检测、笑声类型分类和笑声推理三个任务。基于该数据集，作者开发了一个笑声专家大语言模型，通过两个关键组件提升性能：笑声特定自指令（Laughter-specific Self-Instruct）自动合成多样化笑声指令以增强泛化能力；混合笑声专家（Mixture-of-Laugh-Experts, MoLE）框架引入任务自适应专家路由机制，动态选择针对不同笑声任务的专家。实验表明，结合这两个组件的方法显著优于多模态大语言模型基线，推动了现实世界中鲁棒的笑声理解。

Innovations:

构建了SMILE-Next数据集，覆盖多种社交场景（脱口秀、情景喜剧、双人对话），包含三个互补任务：检测、分类和推理。
提出笑声特定自指令（Laughter-specific Self-Instruct）方法，自动生成多样化的笑声相关指令，提升模型跨任务和跨领域的泛化能力。
提出混合笑声专家（MoLE）框架，基于LoRA的任务自适应专家路由机制，动态选择适合特定笑声任务的专家，提高任务性能和效率。
采用多模态文本化方法，将视频和语音信号转换为文本描述，利用大语言模型的推理能力解耦笑声的复杂因素。

Methodology: 论文采用以下技术路线：1）数据构建：从多种社交场景收集视频，通过人工标注生成问答对，涵盖检测、分类和推理任务。2）多模态文本化：使用现成模型将视频和语音属性转换为文本描述（如面部表情、韵律等）。3）模型训练：基于大语言模型，结合LoRA微调，引入MoLE框架（多个专家模块，每个专家对应特定任务，通过路由网络动态选择）。4）自指令生成：利用种子指令和LLM自动合成新的笑声相关指令，扩充训练数据。5）评估：在SMILE-Next数据集上对比多模态LLM基线，进行消融实验。

Key Results:

结合笑声特定自指令和MoLE的模型在笑声检测、类型分类和推理任务上均显著优于多模态LLM基线。
文本化多模态信息是鲁棒笑声理解的关键，相比直接处理原始视频更有效。
MoLE框架通过任务自适应专家路由提升了任务特定性能和效率。
自指令生成有效缓解了数据覆盖不足的问题，增强了模型泛化能力。

Tech Stack:

大语言模型（LLM）
LoRA（Low-Rank Adaptation）
Mixture-of-Experts (MoE) 架构
自指令生成（Self-Instruct）
多模态文本化（使用现成模型提取面部表情、韵律等文本描述）
分类任务（二分类、多分类）
自由文本生成（推理任务）

Strengths:

数据集覆盖多种真实社交场景，任务设计全面（检测、分类、推理），有助于推动笑声理解的综合性研究。
提出的自指令和MoLE方法具有通用性，可迁移到其他社交信号理解任务。
文本化策略有效利用LLM的常识推理能力，避免多模态特征融合的复杂性。
实验充分，消融分析验证了各组件的有效性。

Limitations:

数据集规模相对较小（总计6386个样本），可能限制模型在更广泛场景的泛化。
文本化依赖现成模型，这些模型可能引入噪声或遗漏关键多模态信息。
MoLE框架增加了模型复杂度和训练开销。
仅关注笑声，未涉及其他社交信号（如表情、手势）的联合理解。

Relevance To Keywords: 论文主要聚焦于笑声理解，与关键词中的“Unify Models”、“World Models”、“Representation Learning”、“Model-Based RL”、“原生多模态大模型”、“多模态大模型的理解和生成一体化”、“表征学习”、“世界模型”、“强化学习”、“后训练”相关性较弱。但论文采用了多模态文本化表征（表征学习）、自指令生成（类似后训练数据增强）、MoLE（模型架构创新）等思想，与表征学习和后训练有一定关联，但整体不属于世界模型或强化学习范畴。

36. MangaFlow: An End-to-End Agentic Framework for Controllable Story to Manga GenerationPASS

Score: 36.0 / 26.5

Authors: Muyao Wang, Zeke Xie, Yanhao Chen, Lixin Xiu, Hideki Nakayama

Published: 2026-05-27

TL;DR: MangaFlow 提出了一种端到端的代理框架，通过将故事到漫画的生成分解为规划和渲染步骤，实现了可控布局并提高了跨面板一致性。

摘要翻译

端到端漫画生成（End-to-end manga generation）是一项结构化的视觉叙事任务，涉及故事分解、重复角色与场景定位、页面布局设计、面板渲染、页面构图及文字排版。然而，现有的生成模型通常直接进行页面合成，将这些因素纠缠于单一视觉输出中，从而限制了对布局几何、视觉参考以及跨面板一致性的精确控制。为了解决这些局限性，我们提出了 MangaFlow，这是一个面向可控长篇幅漫画生成的智能体框架（agentic framework），它将漫画创作分解为规划、定位、布局构建、参考条件渲染、构图及文本放置。通过将布局和视觉参考视为显式中间变量，MangaFlow 既支持简单的文本到漫画（text-to-manga）生成，也支持更精确的用户控制漫画创作。该设计将布局、视觉资产和文字排版暴露为可编辑的中间控制，以便细化面板几何、视觉参考及文本放置。为了支持长篇幅的一致性，MangaFlow 引入了故事部分记忆（story section memory），将章节描述与相应的角色、场景及对象引用相链接，以便在面板间复用。此外，我们还提出一个元基准（meta-benchmark），用于评估布局可控性、视觉一致性及生成质量。实验表明，与直接生成基线相比，MangaFlow 在布局遵循性和跨面板一致性方面有所提升，同时支持灵活的人工控制。

Abstract

End-to-end manga generation is a structured visual storytelling task that requires story decomposition, recurring character and scene grounding, page layout design, panel rendering, page composition, and lettering. However, existing generative models often perform direct page synthesis, entangling these factors in a single visual output and limiting precise control over layout geometry, visual references, and cross-panel consistency. To address these limitations, we propose MangaFlow, an agentic framework for controllable long-form manga generation that decomposes manga creation into planning, grounding, layout construction, reference-conditioned rendering, composition, and text placement. By treating layout and visual references as explicit intermediate variables, MangaFlow enables both simple text-to-manga generation and more precise user-controlled manga creation. This design exposes layout, visual assets, and lettering as editable intermediate controls for refining panel geometry, references, and text placement. To support long-form consistency, MangaFlow introduces a story section memory that links section descriptions with corresponding character, scene, and object references for reuse across panels. We further present a meta-benchmark for evaluating layout controllability, visual consistency, and generation quality. Experiments show that MangaFlow improves layout adherence and cross-panel consistency over direct generation baselines while supporting flexible human control.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	3.0/10	6.0
World Models	2.0	2.0/10	4.0
MLLM	2.0	4.0/10	8.0
MultiModal	2.0	8.0/10	16.0
model-based RL	2.0	1.0/10	2.0

评分理由: 论文核心为多模态文本到图像生成（MultiModal），相关性最高；框架整合了多个生成步骤，与 Unify Models 和 MLLM 中度相关；未涉及世界模型或强化学习机制，评分较低；作者列表中未包含 Yang Shi 等指定专家，故无额外加分。加权总分为 36.0，高于动态及格分 26.5。

关键词

End-to-End, Agentic Framework, Controllable Story, Manga Generation, Layout Design, Visual Consistency, Story Decomposition

深度分析

Chinese Title: MangaFlow: 一种面向可控故事到漫画生成的端到端智能体框架

Summary: 本文提出MangaFlow，一种端到端的智能体框架，用于可控的长篇漫画生成。现有生成模型通常直接合成页面，将布局、角色一致性、场景一致性等因素纠缠在单一视觉输出中，缺乏对布局几何、视觉参考和跨面板一致性的精确控制。MangaFlow将漫画创作分解为故事规划、场景记忆构建、显式布局控制、参考条件面板渲染、页面合成和文字放置六个阶段，将布局和视觉参考作为显式中间变量，支持简单文本到漫画生成以及更精确的用户控制。框架引入故事章节记忆，将章节描述与对应的角色、场景和物体参考绑定，实现跨面板复用。此外，作者提出MangaGen-MetaBench元基准测试，用于评估布局可控性、视觉一致性和生成质量。实验表明，MangaFlow在布局遵循度和跨面板一致性上优于直接生成基线，并支持灵活的人机交互控制。

Innovations:

首次提出端到端智能体框架MangaFlow，将漫画生成分解为可编辑的中间步骤，支持显式布局控制和参考条件渲染。
引入故事章节记忆机制，将章节描述与角色、场景、物体参考绑定，实现跨面板的视觉一致性。
提出MangaGen-MetaBench元基准测试，专门评估漫画生成中的布局可控性、一致性、生成质量和文字放置。
支持三种布局来源（手动指定、参考漫画布局提取、自动生成），将布局作为一等结构变量。
实现可编辑的中间资产（布局、视觉参考、文字），允许用户在各阶段进行精细化调整。

Methodology: MangaFlow采用智能体流水线架构，包含以下主要技术路线：1）故事规划：利用LLM将输入故事分解为页面、面板和故事章节，每个章节维护独立记忆；2）显式布局控制：布局智能体生成或检索页面布局，支持手动、参考和自动三种模式，并通过布局校正保证面板计数和几何正确性；3）参考条件面板渲染：面板智能体结合章节记忆、布局和面板级描述构建提示，使用扩散模型（如Stable Diffusion）进行参考条件生成；4）页面合成与文字放置：将渲染面板按布局合成，文字智能体添加对话气泡、叙述框等。框架整体基于LLM（如GPT-4）进行任务分解和协调，利用视觉生成模型（如ControlNet、IP-Adapter）实现参考条件控制。

Key Results:

MangaFlow在布局遵循度上显著优于直接生成基线，能够准确按照用户指定的面板数量和几何形状生成。
跨面板一致性（角色、场景、物体）得到明显提升，减少了角色漂移和场景不一致问题。
支持用户提供的角色参考图、场景参考图和布局参考图，实现可控的漫画生成。
MangaGen-MetaBench提供了有效的评估协议，暴露了直接生成方法在布局和一致性上的局限性。
生成的漫画页面在阅读顺序、文字放置和叙事流畅性上符合标准漫画格式。

Tech Stack:

LLM（如GPT-4）用于故事规划、布局生成、面板提示构建和文字放置
扩散模型（如Stable Diffusion）用于面板图像渲染
ControlNet用于布局条件控制
IP-Adapter用于参考图像条件生成
故事章节记忆（结构化文本+视觉参考）
布局表示：矩形或多边形面板区域集合
元基准测试MangaGen-MetaBench（包含布局控制、章节引导等子任务）

Strengths:

首次将智能体框架应用于端到端漫画生成，实现了可编辑的中间控制，提升了用户可控性。
故事章节记忆机制有效解决了长故事中的角色和场景一致性难题。
显式布局控制使得布局成为可编辑的结构变量，符合漫画创作的实际流程。
提供了专门的元基准测试，填补了漫画生成评估的空白。
实验验证了框架在布局遵循度和一致性上的优势，并展示了灵活的人机交互能力。

Limitations:

依赖LLM和扩散模型，生成质量和一致性受底层模型能力限制。
故事章节记忆的构建需要准确的章节划分，复杂叙事可能难以自动分解。
布局校正和文字放置仍可能产生不自然的结果，需要进一步优化。
元基准测试规模有限，尚未覆盖所有漫画风格和复杂场景。
框架计算开销较大，端到端生成需要多步推理和多次图像生成。

Relevance To Keywords: 论文研究漫画生成，与给定的研究关键词（Unify Models, World Models, Representation Learning, Model-Based RL, 原生多模态大模型，多模态大模型的理解和生成一体化，表征学习，世界模型，强化学习，后训练）相关性较低。论文主要关注智能体框架、布局控制、视觉一致性，而非多模态大模型的理解与生成一体化、世界模型或强化学习。不过，其使用的LLM和扩散模型属于多模态生成技术，但并未涉及表征学习或世界模型的核心概念。

37. GUI Agents for Continual Game GenerationPASS

Score: 34.0 / 26.5

Authors: Yixu Huang, Bo Li, Na Li, Zhe Wang, Kaijie Chen, Haonan Ge, Qingyi Si, Yuanzhe Shen, Ruihan Yang, Guangjing Wang, Hongcheng Guo

Published: 2026-05-27

TL;DR: 论文提出了一种基于 GUI 代理的迭代框架（Play2Code）用于持续游戏生成，相比单步方法显著提高了可玩性通过率。

摘要翻译

生成游戏并不等同于制作一款可玩的游戏。尽管代码生成领域取得了进展，现有方法仍将游戏生成视为从提示到产物的单次转换，使得交互层面的失败未被检测。我们认为，评估和改进游戏生成需要一个玩家，并研究了图形用户界面（GUI）代理在这一过程中的两种角色：(1) 作为客观评估者，为此我们引入了 PlaytestArena，这是一个新的评估环境，将涵盖八个类别的 200 个基于浏览器的游戏生成任务与预期游戏内行为的评价标准配对，由一个在浏览器中加载每个构建版本并进行游玩的 GUI 代理进行裁定；(2) 作为主观测试者，为此我们提出了 Play2Code，其中游戏代理与 GUI 代理在共享内存中持续循环运作，将游戏生成转化为编码与游玩之间的对话。我们的实验表明，即使前沿模型也难以直接生成可玩的游戏，而 Play2Code 达到了 66.8% 的评分标准通过率，分别比单次生成和代理编码基线提高了 37.1 和 14.6 个百分点。进一步分析表明，GUI 测试者的反馈比人类报告更具可追溯性，但在某些方面具有独特性，类似于人类测试者，从而确立了游戏测试作为交互式代码生成关键测试平台的作用。我们的项目网站位于 https://continual-game-generation.vercel.app/。

Abstract

Generating a game is not the same as making one that can be played. Despite advances in code generation, existing approaches treat game generation as one-shot translation from prompt to artifact, leaving interaction-level failures undetected. We argue that evaluating and improving game generation requires a player, and study two roles for graphical user interface (GUI) agents in this process: (1) as an objective evaluator, for which we introduce PlaytestArena, a new evaluation environment that pairs 200 browser-based game generation tasks across eight genres with rubrics of expected in-play behaviors, adjudicated by a GUI agent that loads each build in a browser and plays it; and (2) as a subjective playtester, for which we propose Play2Code, where a game agent and a GUI agent operate in a sustained loop with shared memory, turning game generation into a dialogue between coding and playing. Our experiments show that even frontier models struggle to generate playable games directly, while Play2Code achieves a 66.8\% rubric pass-rate, improving over single-pass and agentic-coding baselines by 37.1 and 14.6 points respectively. Further analysis shows that GUI playtester feedback is more traceable than a human report, yet idiosyncratic in ways reminiscent of human testers, establishing game playtesting as a critical testbed for interactive code generation. Our project website is available at https://continual-game-generation.vercel.app/.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	3.0/10	6.0
World Models	2.0	2.0/10	4.0
MLLM	2.0	5.0/10	10.0
MultiModal	2.0	5.0/10	10.0
model-based RL	2.0	2.0/10	4.0

评分理由: 论文核心在于 GUI 代理与游戏生成的闭环，使用了前沿模型（MLLM）处理视觉界面（MultiModal），因此这两项得分为 5.0。然而，论文未涉及世界模型（World Models）的架构设计或基于模型的强化学习（model-based RL）技术，相关性较低（2.0）。虽然存在编码与游戏的循环，但未达到模型统一（Unify Models）的架构层面，故得分为 3.0。经核对，作者列表中不包含 Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang 等专家。

关键词

GUI Agents, Game Generation, Play2Code, PlaytestArena, Iterative Generation, Coding-Playing Loop, Frontier Models

深度分析

Chinese Title: 面向持续游戏生成的图形用户界面代理

Summary: 本文针对现有游戏生成方法仅关注一次性代码生成而忽略交互层面缺陷的问题，提出利用图形用户界面（GUI）代理作为游戏生成中的评估者和测试者。首先，构建了PlaytestArena评估环境，包含200个跨8种类型的浏览器游戏生成任务，每个任务配有预期游戏行为的评分标准，由GUI代理加载游戏并实际游玩来评判。其次，提出了Play2Code系统，将游戏生成与测试循环结合：游戏代理和GUI代理在共享内存和运行时环境中持续交互，根据游玩反馈迭代改进游戏。实验表明，即使前沿模型也难以直接生成可玩游戏，而Play2Code的评分通过率达到66.8%，比单次生成和智能体编码基线分别提升37.1和14.6个百分点。进一步分析显示，GUI测试者的反馈具有可追溯性，同时表现出类似人类测试者的特性。

Innovations:

提出将GUI代理作为游戏生成的客观评估者，构建了PlaytestArena基准，包含200个游戏任务及可执行的评分标准，由代理实际游玩来评判。
提出Play2Code系统，将游戏生成与GUI代理测试形成持续循环，通过共享内存和运行时实现迭代优化。
验证了GUI代理具备人类测试者所需的观察、推理和行动能力，在20个游戏测试中接近人类水平。
揭示了GUI代理反馈的可追溯性和类似人类的特性，为交互式代码生成提供了新的测试平台。

Methodology: 论文采用实证研究方法。首先通过小规模测试（20个游戏、约120个关卡）验证GUI代理能否执行游戏测试，对比三个代理后端（GPT-5.4、Claude Sonnet 4.6、Kimi K2.5）与人类玩家的通过率。然后构建PlaytestArena：收集200个游戏主题，由人类专家编写生成提示和评分标准（共1548条标准），使用GUI代理加载生成的HTML/CSS/JS游戏并游玩，逐条评判通过与否。最后提出Play2Code系统：游戏代理和GUI代理在共享内存中循环交互，GUI代理提供游玩观察和反馈，游戏代理据此修改代码，迭代多轮。实验使用三个前沿模型作为后端，比较单次生成、智能体编码和Play2Code的评分通过率。

Key Results:

GUI代理在20个游戏测试中，GPT-5.4的pass@20达到0.82，接近人类参考的0.92。
PlaytestArena包含200个游戏，平均每个游戏7.7条评分标准，覆盖机制、控制、进度、视觉反馈等维度。
Play2Code在三个前沿模型上平均评分通过率为66.8%，比单次生成基线（29.7%）提升37.1个百分点，比智能体编码基线（52.2%）提升14.6个百分点。
评分通过率随迭代轮次单调上升。
GUI代理的反馈比人类报告更具可追溯性，同时表现出类似人类的特性（如对某些交互的偏好）。

Tech Stack:

GUI代理：GPT-5.4、Claude Sonnet 4.6、Kimi K2.5
游戏生成：HTML/CSS/JS代码生成（基于LLM）
评估方法：基于评分标准的通过率计算
系统架构：共享内存、运行时环境、循环交互机制
人类评估：盲人标注者进行32个游戏的逐条标准判断

Strengths:

提出了新颖的持续游戏生成范式，将测试反馈融入生成过程，解决了传统一次性生成的缺陷。
构建了首个公开的、可复现的浏览器游戏生成基准PlaytestArena，包含详细评分标准。
实验设计严谨，对比了多种基线，验证了GUI代理作为测试者的有效性。
深入分析了GUI代理反馈的特性，为未来研究提供了洞察。

Limitations:

游戏类型限于浏览器游戏，未涵盖更复杂的游戏引擎或3D游戏。
GUI代理的评估可能受限于其感知和推理能力，与人类测试者仍有差距。
Play2Code的迭代成本较高，需要多次调用LLM和GUI代理。
评分标准由人类专家编写，可能存在主观偏差，且未完全覆盖所有可能的游戏行为。

Relevance To Keywords:

Unify Models: 论文未直接涉及统一模型，但GUI代理作为多模态模型（视觉+语言+动作）的应用，可视为统一模型在游戏测试中的实例。
World Models: 游戏生成和测试需要理解游戏世界的动态规则，GUI代理通过观察和交互隐式构建世界模型，但论文未明确使用世界模型框架。
Representation Learning: 论文未涉及表征学习，但GUI代理的视觉感知和状态描述可视为一种表征学习应用。
Model-Based RL: Play2Code的迭代优化过程类似于基于模型的强化学习中的规划与执行循环，但论文未使用RL算法，而是基于LLM的推理。
原生多模态大模型: 论文使用的GPT-5.4、Claude Sonnet 4.6等均为多模态大模型，直接处理视觉和文本输入，符合原生多模态特性。
多模态大模型的理解和生成一体化: GUI代理同时进行理解（观察游戏画面）和生成（输出动作），游戏代理进行代码生成，体现了理解和生成的一体化。
后训练: 论文未涉及模型后训练，但Play2Code的迭代过程可视为一种在线适应或提示优化，与后训练概念有间接关联。

38. LV-OSD: Language-Vision-Complementary Open-Set Object DetectionPASS

Score: 34.0 / 26.5

Authors: Yupeng Zhang, Ruize Han, Wei Feng, Song Wang, Liang Wan

Published: 2026-05-27

TL;DR: This paper proposes a dual-branch framework for language-vision-complementary open-set object detection that dynamically aligns text and image prompts to improve detection accuracy.

摘要翻译

目标检测是计算机视觉中的一项重要任务，旨在通过给定的类别列表或查询图像检测感兴趣的目标。在这项工作中，我们提出了一种新的语言 - 视觉互补开放集目标检测（LV-OSD）问题，即使用灵活的基于文本的和/或基于图像的提示来指定所需的目标类别。这种设定在实际应用中更为常见且实用。为此，我们设计了一个双分支检测框架 LVDor，该框架可同时接受文本和图像提示。具体来说，我们首先为每个类别构建包含各种文本描述和图像样本的多模态提示（MPr）。随后，为了弥合输入图像、文本提示和图像提示之间的语义鸿沟，我们设计了一个目标导向提示动态加权（TPDW）模块。在目标图像先验信息的引导下，该模块动态生成与目标语义最匹配的文本和图像提示，实现精确对齐，有效减少两个模态之间的差异，从而适应 LV-OSD 设定。我们还提出了一种简单的提示随机掩码（PRM）机制，用于在训练中模拟测试时文本和/或图像提示的任意组合。广泛的实验结果验证了我们问题设定的合理性及方法的有效性。提示与代码将公开发布。

Abstract

Object detection is an important task in computer vision, which aims to detect the objects of interest. through the given category list or query images. In this work, we propose a new problem of language-visual-complementary open-set object detection (LV-OSD), i.e., using the flexible text-based and/or image-based prompts to specify the desired object categories. This setting is more common and practical in real-world applications. For this purpose, we design a dual-branch detection framework, LVDor, which can simultaneously accept both text and image prompts. Specifically, we first build the Multi-modal Prompts (MPr) containing various text descriptions and image samples for each category. Subsequently, to bridge the semantic gap among the input image, text prompts, and image prompts, we design a Target-guided Prompt Dynamic Weighting (TPDW) module. Guided by the prior information of the target image, this module dynamically produces the text and image prompts that best align with the target semantics, achieving precise alignment and effectively reducing the discrepancy between the two modalities, thereby accommodating the LV-OSD setting. We also propose a simple Prompt Random Masking (PRM) mechanism during training to simulate the arbitrary combination of text and/or image prompts in testing. Extensive experimental results verify our problem formulation's reasonability and our method's effectiveness. Prompts and code will be released publicly.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	5.0/10	10.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	2.0/10	4.0
MultiModal	2.0	10.0/10	20.0
model-based RL	2.0	0.0/10	0.0

评分理由: The paper is highly relevant to 'MultiModal' (10.0) as it integrates language and vision for detection. It unifies these modalities in a dual-branch framework, moderately relating to 'Unify Models' (5.0). It uses text prompts but does not center on MLLM architecture (2.0). There is no connection to 'World Models' or 'model-based RL' (0.0). None of the specified expert authors are present in the author list.

关键词

Open-Set Object Detection, Language-Vision Complementary, Multi-modal Prompts, Dual-Branch Framework, Prompt Dynamic Weighting, Text and Image Prompts, Computer Vision

深度分析

Chinese Title: LV-OSD：语言-视觉互补的开放集目标检测

Summary: 本文提出了一种新的语言-视觉互补开放集目标检测（LV-OSD）问题，允许在推理时灵活使用文本提示、图像提示或两者的任意组合来指定目标类别，更贴近实际应用。为此，作者设计了双分支检测框架LVDor，该框架同时接受文本和图像提示。首先，利用大语言模型（LLM）为每个类别生成多种文本描述，并收集代表性图像构建多模态提示集（MPr）。然后，提出目标引导的提示动态加权（TPDW）模块，根据输入图像的先验信息动态选择与目标语义最对齐的文本和图像提示，实现精确对齐并减少模态差异。此外，训练时采用提示随机掩码（PRM）机制模拟测试时文本/图像提示的任意组合。实验表明，该方法在OV-LVIS和跨数据集评估（Object365）上达到SOTA性能，验证了问题设定的合理性和方法的有效性。

Innovations:

首次提出语言-视觉互补开放集目标检测（LV-OSD）问题，支持文本、图像、融合及灵活组合四种提示模式，更符合实际应用需求。
设计双分支检测框架LVDor，同时处理文本和图像提示，利用冻结的CLIP编码器提取特征，降低训练复杂度。
提出目标引导的提示动态加权（TPDW）模块，根据输入图像自适应选择并加权多模态提示，抑制同类别内的语义噪声，实现精准对齐。
引入提示随机掩码（PRM）训练策略，模拟测试时文本/图像提示的任意组合，增强模型对异构提示场景的鲁棒性。

Methodology: 论文采用双分支检测框架LVDor。文本分支：使用大语言模型（LLM）为每个类别生成多种类网络标题的描述，通过冻结的CLIP文本编码器提取嵌入。图像分支：从互联网收集每个类别的多样图像，通过冻结的CLIP图像编码器提取特征。构建多模态提示集（MPr）后，设计目标引导的提示动态加权（TPDW）模块，以输入图像为条件，动态计算每个提示的权重并融合，实现语义对齐。训练时采用提示随机掩码（PRM）随机丢弃部分模态的提示，模拟灵活组合。检测头基于DETR结构，仅训练检测头部分，保持CLIP编码器冻结。

Key Results:

在OV-LVIS数据集上，LVDor在开放词汇目标检测任务中达到SOTA性能。
在跨数据集评估（Object365）上，方法展现出强鲁棒性和泛化能力。
消融实验验证了TPDW和PRM模块的有效性，以及多模态提示互补的优势。
实验表明，训练-测试模态不匹配会导致性能下降，而LVDor通过动态加权和随机掩码有效缓解了模态差距。

Tech Stack:

大语言模型（LLM）用于生成文本描述
CLIP（Contrastive Language-Image Pre-training）文本编码器和图像编码器（冻结）
目标引导的提示动态加权（TPDW）模块（基于注意力或相似度计算）
提示随机掩码（PRM）训练策略
DETR（Detection Transformer）检测头
开放集目标检测（Open-Set Object Detection）
多模态融合技术

Strengths:

问题设定新颖且实用，支持灵活的提示组合，适应真实场景中不同类别信息可用性差异。
利用冻结的CLIP模型，避免大规模重新训练，计算效率高。
TPDW模块实现目标感知的提示选择，有效抑制语义噪声，提升检测精度。
PRM机制简单有效，增强模型对异构提示的泛化能力。
在多个基准上取得SOTA结果，验证了方法的有效性。

Limitations:

依赖预训练的CLIP模型，其性能上限可能限制检测能力。
TPDW模块需要计算所有提示与输入图像的相似度，当类别数或提示数量很大时可能增加推理开销。
当前方法主要针对目标检测任务，未探索在其他视觉任务（如分割、跟踪）上的扩展性。
对罕见类别或极端视觉差异的鲁棒性有待进一步验证。

Relevance To Keywords:

原生多模态大模型：论文使用CLIP作为多模态骨干，符合原生多模态大模型的概念。
多模态大模型的理解和生成一体化：论文侧重于多模态理解（目标检测），未涉及生成，但提示动态加权体现了多模态融合理解。
表征学习：CLIP的视觉-语言表征学习是基础，TPDW进一步学习目标引导的表征对齐。
世界模型：论文未直接涉及世界模型，但开放集检测可视为对世界知识（类别）的泛化，间接相关。
强化学习、后训练：论文未使用强化学习或后训练技术，相关性较弱。
Unify Models：LVDor统一了文本和图像提示的检测框架，体现了模型统一的思想。
Model-Based RL：不相关。

39. MeniOmni: A Structured Multimodal Benchmark for Holistic Meniscus Injury AssessmentPASS

Score: 34.0 / 26.5

Authors: Shurui Xu, Siqi Yang, Weiping Ding, Hui Wang, Mengzhen Fan, Yuyu Sun, Shuyan Li

Published: 2026-05-27

TL;DR: MeniOmni 提出了一种结合 MRI 与临床先验的结构化多模态基准，用于提高半月板损伤评估的准确性和报告生成质量。

摘要翻译

半月板损伤的临床诊断要求放射科医生整合体积 MRI 证据与患者临床背景（例如性别、年龄、BMI），并生成结构化诊断报告。现有的膝关节 MRI 基准数据集通常为单模态，且依赖粗粒度标签，限制了其评估整体临床推理的能力。我们引入 MeniOmni，这是一个用于半月板损伤评估的结构化多模态基准，包含 746 项多中心 MRI 研究，具备三平面体积输入、临床先验（Clinical Priors）以及专家标注的临床文本。MeniOmni 支持两项任务：（1）细粒度 Stoller 严重程度分级和（2）诊断报告生成。我们进一步提出风险感知序数评估和一种语义一致性度量（Meni-Score），以更好地反映临床相关性。基线实验表明，整合临床先验可提高分级性能并减少严重错误，凸显了多模态上下文在更安全评估中的价值。代码和数据可在 https://github.com/ShuruiXu/MeniOmni 获取。

Abstract

Clinical diagnosis of meniscus injuries requires radiologists to integrate volumetric MRI evidence with patient context (e.g., sex, age, BMI) and to produce structured diagnostic reports. Existing knee MRI benchmarks are typically unimodal and rely on coarse labels, limiting their ability to evaluate holistic clinical reasoning. We introduce MeniOmni, a structured multimodal benchmark for meniscus injury assessment, consisting of 746 multi-center MRI studies with tri-planar volumetric inputs, Clinical Priors, and expert-annotated clinical text. MeniOmni supports two tasks: (1) fine-grained Stoller severity grading and (2) diagnostic report generation. We further propose risk-aware ordinal evaluation and a semantic consistency metric (Meni-Score) to better reflect clinical relevance. Baseline experiments show that incorporating Clinical Priors improves grading performance and reduces severe errors, highlighting the value of multimodal context for safer assessment. Code and data are available at https://github.com/ShuruiXu/MeniOmni.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	1.0/10	2.0
MLLM	2.0	4.0/10	8.0
MultiModal	2.0	9.0/10	18.0
model-based RL	2.0	1.0/10	2.0

评分理由: 论文核心贡献在于构建医学影像的多模态基准测试（MultiModal 高度相关），涉及多模态输入与文本生成（MLLM 中度相关）。论文未涉及模型统一架构（Unify Models 低相关）、世界模型（World Models 低相关）或强化学习（model-based RL 低相关）。作者列表中未包含指定的专家（Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang）。

关键词

Multimodal Benchmark, Meniscus Injury Assessment, Clinical Priors, Diagnostic Report Generation, Volumetric MRI, Severity Grading, Structured Output

深度分析

Chinese Title: MeniOmni：面向整体半月板损伤评估的结构化多模态基准

Summary: 本文提出了MeniOmni，一个用于半月板损伤评估的结构化多模态基准数据集。该基准包含来自多中心的746例MRI研究，整合了三平面体积MRI、结构化临床先验（年龄、性别、BMI）以及专家注释的临床文本（形态学描述和诊断结论）。MeniOmni支持两个任务：细粒度Stoller严重程度分级（0-3级）和诊断报告生成。为更贴合临床需求，论文提出了风险感知序数评估指标和语义一致性度量（Meni-Score）。基线实验表明，引入临床先验可提升分级性能并减少严重错误，验证了多模态上下文对安全评估的价值。数据集和代码已开源。

Innovations:

首次构建整合体积MRI、结构化临床先验和诊断文本的多模态半月板损伤基准，填补了骨科多模态数据集的空白。
提出内容自适应体积采样算法，通过图像质量、解剖复杂度和空间先验优化MRI切片选择，减少冗余。
设计双任务评估协议（细粒度分级+报告生成），并引入风险感知序数评估和语义一致性度量（Meni-Score），更贴合临床实际。
提出三流视觉编码策略，利用共享权重的视频编码器独立处理正交平面，并通过视图标识符绑定解剖语义。
将临床先验序列化为自然语言提示，统一多模态输入格式，使模型能利用生理风险因素校准诊断。

Methodology: 论文采用以下技术路线：1）数据采集与标准化：从三个临床中心收集746例MRI，使用三个标准序列（矢状PD-FS、冠状T2、轴向PD），由两位放射科医师独立标注Stoller分级。2）内容自适应体积采样：结合图像质量（标准差、拉普拉斯方差、熵）、解剖复杂度（SAM-Med2D置信度）和空间高斯先验，每视图选取16个关键切片组成视频序列。3）临床先验序列化：将年龄、性别、BMI转换为自然语言提示。4）三流视觉编码：使用共享权重的视频编码器分别处理三个平面，输出带视图标识符的视觉令牌序列。5）任务设计：分级任务将视觉流与临床先验拼接输入LLM，生成结构化分级输出；报告生成任务仅使用视觉流，评估语义生成能力。

Key Results:

基线实验表明，整合临床先验（年龄、性别、BMI）可显著提升Stoller分级准确率，并减少严重错误（如将3级误判为0级）。
提出的风险感知序数评估指标能更好反映临床相关性，优于传统准确率。
Meni-Score语义一致性度量在报告生成任务中比n-gram指标更贴近专家评估。
多模态融合相比纯视觉模型在分级任务上平均提升约5-8%的准确率。
三流视觉编码策略优于单流或简单拼接，能更好保留解剖几何信息。

Tech Stack:

SAM-Med2D（ViT-B）用于解剖复杂度计算
视频编码器（如Video-Swin Transformer或3D CNN）
大语言模型（LLM）用于多模态融合与生成
内容自适应采样算法（公式1-2）：标准差、拉普拉斯方差、香农熵、高斯空间权重
Stoller分级系统（0-3级）
风险感知序数评估（ordinal evaluation）
Meni-Score语义一致性度量
PyTorch深度学习框架

Strengths:

数据集规模适中（746例），来自多中心，具有较好的代表性。
首次将临床先验（年龄、性别、BMI）作为结构化模态纳入基准，贴近真实临床流程。
提出内容自适应采样算法，有效减少MRI体积冗余，提升计算效率。
双任务设计（分级+报告生成）全面评估模型的多模态理解与生成能力。
评估指标创新，风险感知序数评估和Meni-Score更符合临床安全需求。
开源代码和数据集，促进可重复研究。

Limitations:

数据集规模相对较小（746例），可能限制深度学习模型的泛化能力。
仅包含三个临床中心，地域和人群多样性有限。
临床先验仅包含年龄、性别、BMI，未纳入更多风险因素（如既往手术史、活动水平等）。
报告生成任务仅使用视觉流，未利用临床先验，可能低估多模态优势。
基线模型未涵盖最新多模态大模型（如GPT-4V、Gemini等），对比不够全面。
内容自适应采样依赖SAM-Med2D，该模型在低对比度切片上可能失效。

Relevance To Keywords:

原生多模态大模型：论文构建的多模态基准直接服务于原生多模态大模型的训练与评估，支持视觉+文本+表格的联合输入。
多模态大模型的理解和生成一体化：MeniOmni同时支持分级（理解）和报告生成（生成），契合一体化目标。
表征学习：内容自适应采样和视频编码器涉及视觉表征学习，临床先验序列化涉及跨模态表征对齐。
世界模型：半月板损伤评估需要理解解剖结构、病理演变和患者背景，可视为医学世界模型的子任务。
强化学习/后训练：论文未直接涉及，但提出的风险感知序数评估可指导强化学习中的奖励设计，后训练阶段可基于Meni-Score进行微调。
Model-Based RL：与本文直接相关性较弱，但多模态融合可辅助构建更准确的诊断模型，间接支持基于模型的决策。

40. FedMPT: Federated Multi-label Prompt Tuning of Vision-Language ModelsPASS

Score: 32.0 / 26.5

Authors: Xucong Wang, Pengkun Wang, Zhe Zhao, Liheng Yu, Shuang Wang, Yang Wang

Published: 2026-05-27

TL;DR: 本文提出 FedMPT 方法，通过因果模型和最优传输解决联邦学习中视觉语言模型多标签识别的过拟合问题，实现了竞争性结果。

摘要翻译

基于视觉 - 语言模型 (VLMs) 的多标签识别 (MLR) 旨在利用其预训练知识更好地适应复杂的识别场景，从而增强模型的鲁棒性。然而，对于需要联邦学习的现实去中心化应用，将 VLMs 适配到每个拥有私有和异构数据的客户端可能导致模型过拟合虚假的标签相关性，从而在遇到新样本时触发无关类别。为了解决这一问题，我们重新审视了基于因果模型的 MLR 联邦学习，其中我们采用前门调整机制，并通过放大真实标签共现的中间变量来解耦 MLR 建模过程。基于上述分析，我们提出了 FedMPT，这是首个专门针对联邦多标签识别设计的方法。FedMPT 的核心思想是利用可泛化的条件来引导联邦多标签识别，以缓解错误的标签激活。为此，FedMPT 引入一个由大语言模型 (LLM) 驱动的管道，以揭示支配标签依赖关系的潜在条件。此外，我们在条件增强的提示词与图像块之间引入最优传输，以揭示多个区域级语义。最后，我们通过精心设计的门控机制，从不同条件生成协同预测。在多个基准数据集上的实验表明，所提方法取得了具有竞争力的结果，并在不同设置下优于 SOTA 方法。

Abstract

Multi-Label Recognition (MLR) based on Vision-Language Models (VLMs) aims to leverage their pre-trained knowledge to better adapt complex recognition scenarios, thereby enhancing model robustness. However, for realistic decentralized applications requiring federated learning, adapting VLMs to each client that possesses private and heterogeneous data can cause the model to overfit spurious label correlations, consequently triggering irrelevant categories when encountering new samples. To tackle this problem, we reconsider the federated learning for MLR with a causal model, in which we adopt a front-door adjustment and decouple the MLR modeling process by intermediate variables that magnify the oracle label co-occurrence. Guided by our analysis, we propose our FedMPT, the first method specifically designed for federated MLR. The core idea of FedMPT is to leverage generalizable conditions to steer federated MLR to mitigate erroneous label activations. To achieve this, FedMPT introduces an Large Language Model (LLM)-driven pipeline to decipher the underlying conditions that govern label dependencies. Furthermore, we introduce an optimal transport between the condition-enriched prompts and the image patches to uncover multiple region-level semantics. Finally, we generate synergistic predictions from different conditions with a crafted gating mechanism. Experiments on multiple benchmark datasets show that our proposed approach achieves competitive results and outperforms SOTA methods under varied settings.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	4.0/10	8.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	5.0/10	10.0
MultiModal	2.0	7.0/10	14.0
model-based RL	2.0	0.0/10	0.0

评分理由: 论文核心为联邦学习下的多标签识别与视觉语言模型微调。'MultiModal'高度相关（视觉 - 语言），'MLLM'中度相关（使用 VLM/LLM 技术），'Unify Models'中度相关（模态统一背景）。'World Models'和'model-based RL'完全无关（无环境建模或强化学习内容）。未发现指定专家作者，无加分。加权总分 32.0，高于动态及格分 26.5。

关键词

Federated Learning, Multi-label Recognition, Vision-Language Models, Prompt Tuning, Causal Model, Optimal Transport, Gating Mechanism

深度分析

Chinese Title: FedMPT: 视觉-语言模型的联邦多标签提示调优

Summary: 本文针对联邦学习场景下的多标签识别（MLR）问题，提出了一种名为FedMPT的新型框架。现有方法在联邦学习中容易过拟合客户端私有数据中的虚假标签相关性，导致泛化性能下降。作者从因果角度出发，采用前门调整策略，通过中间变量解耦MLR建模过程，放大真实标签共现。FedMPT的核心思想是利用可泛化的条件来引导联邦MLR，减少错误标签激活。具体地，引入大语言模型（LLM）驱动的管道生成通用抽象条件模板，并通过最优传输将条件增强的提示与图像块对齐，挖掘多区域语义。最后，设计自适应门控机制融合不同条件的预测。在多个基准数据集上的实验表明，FedMPT在多种联邦设置下均优于现有方法，展现出显著的鲁棒性。

Innovations:

首次专门针对联邦多标签识别问题提出解决方案，揭示了现有方法在非独立同分布数据下的脆弱性。
从因果视角（结构因果模型）分析联邦MLR的难点，归因于对局部分布和标签相关性的过拟合，并采用前门调整策略进行干预。
提出LLM驱动的条件生成管道，自动生成通用抽象条件模板，用于软提示学习。
引入最优传输机制，将条件增强的提示与图像块对齐，实现多区域语义挖掘。
设计自适应门控机制，动态调整不同条件在客户端中的贡献，实现协同预测。

Methodology: 论文采用结构因果建模（SCM）分析联邦MLR问题，提出前门调整解耦建模过程。技术路线包括：1）利用大语言模型生成通用条件模板；2）将条件嵌入提示中进行软提示学习；3）通过最优传输将条件提示与图像块对齐，得到条件特定的预测；4）设计门控机制融合多条件预测；5）采用非对称损失（ASL）优化，并通过FedAvg聚合客户端参数。

Key Results:

在多个基准数据集上，FedMPT在多种联邦设置下均达到最先进性能，显著优于现有方法。
随着数据异质性增加，现有方法性能急剧下降，而FedMPT表现出较强的鲁棒性。
可视化结果表明，FedMPT能有效缓解虚假标签相关性（如猫-椅子），正确激活相关类别。
消融实验验证了LLM条件生成、最优传输对齐和门控机制的有效性。

Tech Stack:

视觉-语言模型（VLM）：CLIP
大语言模型（LLM）用于条件生成
最优传输（Optimal Transport）
软提示学习（Soft Prompt Tuning）
非对称损失（Asymmetric Loss, ASL）
联邦平均（FedAvg）
结构因果模型（SCM）
前门调整（Front-door Adjustment）
门控机制（Gating Mechanism）

Strengths:

首次将联邦学习与多标签识别结合，填补了该领域空白。
因果分析为问题提供了理论依据，方法设计有坚实的理论基础。
LLM驱动的条件生成自动化程度高，无需人工设计条件。
最优传输对齐能够捕捉细粒度区域语义，提升多标签识别精度。
门控机制自适应调整条件贡献，适应不同客户端分布。
实验充分，在多个基准和异质性设置下验证了鲁棒性。

Limitations:

依赖大语言模型生成条件，可能引入额外计算开销和延迟。
联邦通信轮次和客户端参与率等超参数需要调优，未提供自适应策略。
实验仅在三个基准数据集上进行，泛化性有待更多场景验证。
未讨论隐私保护细节（如差分隐私），可能在实际部署中需额外考虑。

Relevance To Keywords:

Unify Models: 论文使用统一的视觉-语言模型（CLIP）进行多模态理解，但未涉及生成一体化。
World Models: 论文未直接涉及世界模型或环境建模。
Representation Learning: 论文通过提示调优和最优传输学习更好的多标签表示，与表征学习相关。
Model-Based RL: 论文未涉及强化学习或基于模型的RL。
原生多模态大模型: 论文基于CLIP，属于多模态大模型，但未涉及原生多模态（如从头训练）。
多模态大模型的理解和生成一体化: 论文仅关注理解（识别），未涉及生成。
后训练: 论文的提示调优属于后训练范式，与后训练相关。

41. EigeNet: Geometry-Informed Multi-Modal Learning for Few-shot Novel View RIR PredictionPASS

Score: 32.0 / 26.5

Authors: Chong Jing, Zitong Lan, Junan Zhang, Zhizheng Wu

Published: 2026-05-27

TL;DR: EigeNet achieves state-of-the-art few-shot novel view RIR prediction and sim-to-real generalization by proposing a geometry-informed multi-modal framework with cross-view attention mechanisms.

摘要翻译

从稀疏观测中预测空间变化的房间脉冲响应（RIR）对于沉浸式空间音频渲染而言是一个关键但极具挑战性的逆问题。在这项工作中，我们提出了 EIGENET，一种用于少样本新视角 RIR 预测的几何感知多模态框架。其核心是一个跨视角交替注意力 Transformer，该网络迭代地细化局部视角内的声学结构以及全局视角间的空间关系。实验验证表明，该架构能够充分利用多视角多模态上下文，同时在执行 RIR 预测时进行时空推理。受声学射线追踪启发，我们设计了一个几何感知调制模块，以建立几何特征与 RIR 功率谱之间的联系。与此同时，引入一个辅助损失，将单目标波形预测转化为多任务学习框架。通过消融实验，我们证明该设计无论底层骨干网络如何均能带来一致的性能提升，从而确认了其在 RIR 预测任务中的基础效用及架构无关的泛化能力。在模拟和真实世界基准上评估，EIGENET 在少样本新视角 RIR 预测及模拟到真实泛化方面均实现了最先进的性能。代码和检查点可在 https://github.com/FEAfeatherTHER/EigeNet 上获取。

Abstract

Predicting spatially varying Room Impulse Response (RIR) from sparse observations is a critical but highly challenging inverse problem for immersive spatial audio rendering. In this work, we present EIGENET, a geometry-informed multi-modal framework for few-shot novel view RIR prediction. At its core is a Cross-view Alternate-attention Transformer that iteratively refines local intra-view acoustic structures and global cross-view spatial relationships. We empirically demonstrate that this architecture is capable of making full use of the multi-view multi-modal context while performing spatial-temporal reasoning for RIR prediction. Inspired by acoustic ray tracing, we design a geometry-informed modulation block to formulate the connection between geometric features and RIR power spectrum. In the mean time, an auxiliary loss is introduced to transform the single-target waveform prediction into a multi-task learning framework. Through ablation studies, we demonstrate that this design yields consistent performance gains regardless of the underlying backbone, thereby confirming its foundational utility and architecture-agnostic generalizability for RIR prediction task. Evaluated on both simulated and real-world benchmarks, EIGENET achieves both state-of-the-art performance in few-shot novel view RIR prediction and sim-to-real generalization. Codes and checkpoints are available on https://github.com/FEAfeatherTHER/EigeNet.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	3.0/10	6.0
World Models	2.0	2.0/10	4.0
MLLM	2.0	2.0/10	4.0
MultiModal	2.0	9.0/10	18.0
model-based RL	2.0	0.0/10	0.0

评分理由: The paper presents a geometry-informed multi-modal framework for RIR prediction, strongly aligning with 'MultiModal' (9.0). It does not address Large Language Models ('MLLM', 2.0), World Models for planning ('World Models', 2.0), or Reinforcement Learning ('model-based RL', 0.0). It unifies geometry and audio modalities ('Unify Models', 3.0). Total weighted score is 32.0, exceeding the dynamic pass score of 26.5. No expert authors from the specified list were found in the author list.

关键词

Geometry-Informed, Multi-Modal Learning, Few-shot Novel View, RIR Prediction, Cross-view Alternate-attention Transformer, Spatial-temporal Reasoning, Acoustic Ray Tracing, Sim-to-real Generalization

深度分析

Chinese Title: EigeNet：几何信息引导的多模态学习用于少样本新视角房间脉冲响应预测

Summary: 本文提出EIGENET，一个几何信息引导的多模态框架，用于从稀疏观测中预测新视角的房间脉冲响应（RIR）。核心是交叉视图交替注意力变换器（CVAT），它交替进行局部视图内注意力和全局跨视图注意力，以充分利用多视图多模态上下文进行时空推理。受声学射线追踪启发，设计了几何信息调制模块，将几何特征与RIR功率谱显式关联。同时引入辅助损失，将单目标波形预测转变为多任务学习框架。在模拟和真实基准上评估，EIGENET在少样本新视角RIR预测和仿真到真实泛化方面均达到最先进性能。

Innovations:

首次将交叉视图交替注意力变换器（CVAT）引入少样本新视角RIR预测任务，并解释了其相对于自注意力和交叉注意力在多视图多模态上下文中的效率优势。
设计了几何信息调制模块，基于声学射线追踪原理显式建模几何特征与RIR功率谱之间的物理关系，增强物理可解释性。
引入辅助损失，将单目标RIR预测转变为多模态多任务学习范式，提升了跨房间泛化能力和物理合理性。
通过消融实验证明调制模块和辅助损失具有架构无关的通用性，能在不同骨干网络上带来一致性能提升。

Methodology: 采用编码器-变换器-解码器架构。几何编码器将全景深度图和位置坐标编码为几何令牌；声学编码器使用Descript-Audio-Codec提取参考RIR的连续潜在表示作为声学令牌，目标视图RIR使用正弦编码。几何信息调制模块以几何令牌为条件调制目标声学令牌并回归多倍频程功率谱。随后将几何令牌与声学令牌拼接，输入CVAT（交替进行帧内自注意力和全局自注意力）进行时空推理。最后解码目标声学令牌为波形。训练中使用RIR波形损失和功率谱辅助损失。

Key Results:

在AcousticRooms和Hearing-Anything-Anywhere数据集上达到最先进性能。
相比现有方法（Few-shot RIR、xRIR、FLAC），在少样本设置下表现出更强的鲁棒性和泛化能力。
消融实验表明CVAT优于标准自注意力和交叉注意力；调制模块和辅助损失在不同骨干网络上均带来一致增益。
在仿真到真实场景的泛化中表现优异。

Tech Stack:

Cross-view Alternate-attention Transformer (CVAT)
Descript-Audio-Codec (DAC) 作为声学令牌提取器
Vision Transformer (ViT) 用于几何编码
位置编码与多层感知机 (MLP)
正弦编码 (sinusoidal encoding) 用于目标视图RIR
多倍频程功率谱回归
多任务学习 (波形损失 + 功率谱辅助损失)

Strengths:

创新性地将交替注意力机制从视觉领域迁移到音频多模态任务，并给出理论解释。
物理启发的几何信息调制模块增强了模型的可解释性和泛化能力。
多任务学习框架有效提升了预测质量，且具有架构通用性。
在多个基准上达到SOTA，包括仿真到真实的泛化。

Limitations:

依赖预训练的DAC模型，可能存在领域偏移，尽管实验表明重建质量高。
仅考虑了全景深度图作为几何信息，未利用更丰富的几何表示（如网格、语义）。
方法在极稀疏参考（如1个参考）下的性能可能仍有提升空间。
未讨论计算复杂度与实时性，可能不适用于低延迟应用。

Relevance To Keywords:

Unify Models / 原生多模态大模型: 论文提出的多模态框架融合了视觉（深度图）和听觉（RIR）模态，与多模态大模型方向相关。
World Models / 世界模型: RIR预测是构建沉浸式虚拟世界的关键组件，有助于世界模型中的听觉场景理解。
Representation Learning / 表征学习: 通过交替注意力学习多视图多模态的联合表征，属于表征学习范畴。
Model-Based RL / 基于模型的强化学习: 虽然论文未直接涉及RL，但RIR预测可作为环境模型的一部分用于模拟交互。
后训练: 论文使用预训练的DAC和ViT，并在此基础上进行训练，涉及后训练策略。

42. IRDS: Interpretable RLVR Data Selection via Verifier-Coupled Sparse Autoencoder CoveragePASS

Score: 30.0 / 26.5

Authors: Yuhan Li, Mingxu Zhang, Dazhong Shen, Ying Sun

Published: 2026-05-27

TL;DR: 论文提出了一种基于验证器耦合稀疏自编码器的可解释 RLVR 数据选择方法，有效提升了大语言模型在数学推理任务上的准确率。

摘要翻译

可验证奖励的强化学习（RLVR）已成为提升大语言模型（LLM）推理的关键技术，但其数据效率低下仍是主要瓶颈。现有方法仅部分解决了这一问题，每种方法至少缺失子集级覆盖、验证器信号利用或可解释性之一。为填补这一空白，我们提出了 IRDS（可解释的 RLVR 数据选择），该方法基于稀疏自编码器（SAE）聚类基础选择 RLVR 训练实例，使得选择本身可在可识别的问题模式上进行审计。为了选择模型既失败但仍能从中学习的实例，我们在 SAE 基础上引入了验证器耦合覆盖目标，并通过贪心对数行列式最大化求解。在三个指令微调模型和六个数学推理基准上的实验表明，IRDS 实现了最高的整体准确率，在两个 Qwen 模型上超过最强基线 +3.9/+4.0 个百分点，在 Llama-3.1-8B 上超过 +0.5 个百分点，同时计算成本比基于轨迹的基线低一个数量级。

Abstract

Reinforcement learning with verifiable rewards (RLVR) has become a key technique for en- hancing LLM reasoning, yet its data ineffi- ciency remains a major bottleneck. Existing methods address this problem only partially, each missing at least one of subset-level cov- erage, verifier signal use, or interpretability. To address this gap, we present IRDS (Inter- pretable RLVR Data Selection), which selects RLVR training instances on a sparse autoen- coder (SAE) cluster basis so the selection itself is auditable on recognizable problem motifs. To select instances the model both fails on and can still learn from, we introduce a verifier- coupled coverage objective on the SAE basis and solve it by greedy log-determinant max- imization. Experiments on three instruction- tuned models and six math reasoning bench- marks show that IRDS achieves the highest overall accuracy, exceeding the strongest base- line by +3.9/+4.0 pp on the two Qwen models and by +0.5 pp on Llama-3.1-8B, while run- ning an order of magnitude cheaper than the trajectory-based baseline.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	3.0/10	6.0
World Models	2.0	2.0/10	4.0
MLLM	2.0	4.0/10	8.0
MultiModal	2.0	2.0/10	4.0
model-based RL	2.0	4.0/10	8.0

评分理由: 论文核心涉及 RLVR 数据选择与稀疏自编码器，与 MLLM 和 model-based RL 有一定关联（涉及 LLM 与 RL），但与 World Models、MultiModal 及 Unify Models 关联度较低。未发现指定专家。加权总分 30.0，高于动态及格分 26.5。

关键词

RLVR, Data Selection, Sparse Autoencoder, LLM Reasoning, Interpretability, Verifier, Math Benchmarks, Instruction-tuned

深度分析

Chinese Title: IRDS：基于验证器耦合稀疏自编码器覆盖的可解释RLVR数据选择

Summary: 本文提出IRDS（可解释RLVR数据选择）方法，旨在解决强化学习与可验证奖励（RLVR）中的数据效率瓶颈。现有方法在子集覆盖、验证器信号利用和可解释性方面存在不足。IRDS基于稀疏自编码器（SAE）的聚类坐标表示训练实例，使选择过程在可识别的问题模式上可审计。通过引入验证器耦合的覆盖目标（包括困难度权重和可训练性权重），并采用贪心对数行列式最大化来选取非冗余、失败相关且可训练的子集。在三个指令微调模型和六个数学推理基准上的实验表明，IRDS在整体准确率上达到最高，超过最强基线3.9/4.0个百分点（Qwen模型）和0.5个百分点（Llama-3.1-8B），且选择成本比基于轨迹的基线低一个数量级。

Innovations:

首次将稀疏自编码器聚类基用于RLVR数据选择，使选择过程在可识别的问题模式上可审计。
设计了验证器耦合的覆盖目标，同时考虑失败相关性、可训练性和非冗余性，通过贪心对数行列式最大化求解。
提出困难度权重和可训练性权重，分别处理模型失败和梯度消失问题，并组合成验证器耦合度量。
在三个模型和六个基准上取得最佳整体准确率，选择成本比轨迹基线低一个数量级。

Methodology: 首先，在基础模型的最终层激活上训练BatchTopK稀疏自编码器（SAE），将相关潜在变量聚类为256个语义簇，每个实例用簇激活质量表示。然后，从验证器成功计数导出困难度权重（单调递减）和可训练性权重（单峰分布），构建验证器耦合的协方差矩阵，通过岭正则化和特征值裁剪得到度量矩阵。可选地，将残差梯度块附加到设计向量中。最后，采用贪心对数行列式最大化（D-optimal设计）选择非冗余子集，并在所选子集上使用GRPO进行RLVR训练。

Key Results:

IRDS在三个指令微调模型（Qwen2.5-7B-Instruct、Qwen2.5-14B-Instruct、Llama-3.1-8B-Instruct）和六个数学推理基准（MATH-500、AIME2024、AMC2023、OlympiadBench、MinervaMath、GSM8K）上取得最高整体准确率。
在Qwen2.5-7B上超过最强基线+3.9个百分点，在Qwen2.5-14B上+4.0个百分点，在Llama-3.1-8B上+0.5个百分点。
选择成本比基于轨迹的基线（如DeepScaler）低一个数量级。
SAE聚类坐标提供了可解释的问题模式（如几何、除数计数等），便于审计。

Tech Stack:

BatchTopK稀疏自编码器（SAE）
贪心对数行列式最大化（D-optimal设计）
GRPO（Group Relative Policy Optimization）
Beta分布后验与digamma函数
岭正则化与特征值裁剪
球形K-means聚类
残差梯度块（可选）

Strengths:

同时解决了失败相关性、可训练性和非冗余性三个关键问题。
基于SAE的表示具有可解释性，便于审计和调试。
在多个模型和基准上取得显著性能提升，且选择成本低。
方法通用，可应用于不同基础模型和RLVR框架。

Limitations:

SAE训练需要额外的计算资源和激活数据，可能增加前期成本。
聚类数量F=256为固定值，可能不适用于所有数据集。
方法依赖于基础模型的初始策略，对于不同初始策略可能需要重新训练SAE。
仅针对数学推理任务验证，在其他领域（如代码、科学）的泛化性未知。

Relevance To Keywords:

强化学习：IRDS直接针对RLVR（强化学习与可验证奖励）中的数据选择问题，与强化学习后训练高度相关。
表征学习：使用稀疏自编码器进行表征学习，将实例映射到可解释的聚类坐标。
世界模型：虽然论文未直接涉及世界模型，但RLVR中的可验证奖励可视为一种简单世界模型，IRDS通过验证器信号改进数据选择。
多模态大模型：论文实验基于文本数学推理，但方法可扩展到多模态场景（如视觉数学问题），SAE可处理多模态激活。
后训练：IRDS是后训练阶段的数据选择方法，与后训练（如RLVR）紧密相关。

43. Chreode: A Cell World Model for One-Step Temporal Dynamics and Perturbation PredictionPASS

Score: 30.0 / 26.5

Authors: Mufan Qiu, Genhui Zheng, Yinuo Xu, Ruichen Zhang, Ying Ding, Qi Long, Tianlong Chen

Published: 2026-05-27

TL;DR: 本文提出 Chreode 细胞世界模型，通过一步动力学预测有效提升了单细胞在基因扰动下的状态转换预测精度。

摘要翻译

预测细胞在发育信号或遗传扰动下如何改变其转录状态，是计算机生物学（in-silico biology）及 AI 虚拟细胞计划（AI Virtual Cell program）的计算核心。现有方法要么拟合忽略时间维度的静态对照 - 处理映射，要么在每个数据集上独立求解多步常微分方程（ODE）/ 薛定谔桥（Schrödinger-bridge）问题。我们引入 Chreode，这是一种一步细胞世界模型，通过结构化残差转换算子预测基于动作条件的细胞状态转换。它将分布演化从推理阶段转移到训练阶段，从而实现单次生成，同时保留受 Waddington 启发的分解，包括下坡景观流、旋转切向动力学和随机扩散。该模型在涵盖 7 个数据集的 240 万细胞小鼠胚胎图谱上，使用共享的 scVI 编码器和基于 DiT 的动力学骨干进行预训练。作为微调初始化，Chreode 在 Weinreb 造血和 Veres 胰岛分化上的目标级 Sinkhorn 距离优于匹配的从头训练模型、PI-SDE 和 PRESCIENT。作为 GEARS 的可迁移基因状态嵌入，预训练的动力学表示将 Norman Perturb-seq 上共享词汇 DE20 均方误差从 0.2121 降低至 0.1858，实现了 12.4% 的相对改进，且不改变 GEARS 的训练流程。我们将这种向扰动预测的迁移解释为证据，表明预训练的发育轨迹动力学编码了可迁移至 CRISPR 诱导状态转换的分化原语，因为两者都涉及共享潜在几何中的细胞状态转换。预训练的骨干还在 Weinreb 上产生零样本克隆命运评分，与强大的动态最优传输（dynamic-OT）基线具有竞争力。

Abstract

Predicting how a cell will change its transcriptional state under a developmental signal or a genetic perturbation is the computational core of in-silico biology and the AI Virtual Cell program. Existing approaches either fit static control-to-treated maps that discard time, or solve multi-step ODE / Schrödinger-bridge problems on each dataset independently. We introduce Chreode, a one-step cell world model that predicts action-conditioned cell-state transitions through a structured residual transition operator. It shifts distributional evolution from inference time to training time, enabling single-pass generation while preserving a Waddington-inspired decomposition into downhill landscape flow, rotational in-tangent dynamics, and stochastic spread. The model is pretrained with a shared scVI encoder and a DiT-based dynamics backbone on a 2.4M-cell mouse embryonic atlas spanning 7 datasets. As a fine-tuning initialization, Chreode improves per-target Sinkhorn distance on Weinreb hematopoiesis and Veres islet differentiation over matched scratch models, PI-SDE, and PRESCIENT. As a transferable gene-state embedding for GEARS, the pretrained dynamics representation reduces shared-vocabulary DE20 mean squared error on Norman Perturb-seq from 0.2121 to 0.1858, a 12.4% relative improvement, without changing the GEARS training procedure. We interpret this transfer to perturbation prediction as evidence that pretrained developmental-trajectory dynamics encode differentiation primitives transferable to CRISPR-induced state shifts, since both involve cell-state transitions in a shared latent geometry. The pretrained backbone additionally produces zero-shot clonal fate scores on Weinreb that are competitive with strong dynamic-OT baselines.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	5.0/10	10.0
World Models	2.0	10.0/10	20.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	0.0/10	0.0

评分理由: 论文核心提出'细胞世界模型'（Cell World Model），直接对应 World Models 关键词，评分为 10 分；模型整合了 scVI 编码器和 DiT 动力学骨干，实现了表征与动力学的统一建模，故 Unify Models 评分为 5 分；论文专注于单细胞基因组数据的动力学预测，不涉及语言模型、多模态融合或强化学习控制，因此 MLLM、MultiModal 和 model-based RL 评分为 0 分。作者列表中未包含指定的五位专家。加权总分为 30.0，高于动态及格分 26.5。

关键词

Cell World Model, One-Step Temporal Dynamics, Perturbation Prediction, scVI Encoder, DiT-based Dynamics, Single-cell Genomics, Transfer Learning

深度分析

Chinese Title: Chreode：一种用于单步时间动态与扰动预测的细胞世界模型

Summary: 本文提出Chreode，一种单步细胞世界模型，用于预测细胞在发育信号或遗传扰动下的转录状态变化。现有方法要么丢弃时间信息拟合静态对照-处理映射，要么在每个数据集上独立求解多步ODE/Schrödinger桥问题。Chreode通过结构化残差转移算子实现动作条件化的细胞状态转移，将分布演化从推理时间转移到训练时间，实现单步生成。其残差分解为Waddington景观的下坡流、切向旋转动力学和随机扩散。模型使用共享scVI编码器和DiT动力学骨干在包含7个数据集、240万细胞的鼠胚图谱上预训练。作为微调初始化，Chreode在Weinreb造血和Veres胰岛分化任务上优于PI-SDE和PRESCIENT；作为基因状态嵌入注入GEARS，在Norman Perturb-seq上将DE20均方误差从0.2121降至0.1858（相对提升12.4%）。预训练骨干还能零样本生成与动态OT基线竞争的克隆命运分数。

Innovations:

提出单步细胞世界模型，通过训练时推前分布演化实现单次前向传播预测未来状态，避免多步ODE/SDE积分。
设计Waddington风格残差转移算子，显式分解为势能梯度、反对称旋转流和随机扩散，融入生物学先验。
构建两阶段预训练流程：共享scVI编码器+DiT动力学骨干，在240万细胞鼠胚图谱上无OT监督预训练。
提供两种迁移模式：作为微调初始化提升时间转移预测性能；作为基因状态嵌入提升GEARS扰动预测精度。
零样本克隆命运评分达到与动态OT基线相当的水平，证明预训练动力学表征可迁移至CRISPR诱导的状态变化。

Methodology: Chreode采用两阶段预训练：首先使用scVI编码器将跨物种正交基因词汇表（16,520基因）的转录组映射到共享潜在空间；然后使用DiT（扩散Transformer）动力学骨干在240万细胞鼠胚图谱上训练，损失函数采用群体级漂移场损失（基于MMD和Sinkhorn Wasserstein-2距离）。模型核心为单步残差转移算子：ẑ_{t+Δ} = z_t + α(Δ)R_θ，其中R_θ包含势能梯度项-∇_zU_θ、反对称旋转项S_θ z_t（通过P_θ Q_θ^T - Q_θ P_θ^T参数化）和随机扩散项σ_θ⊙ε，时间门控α(Δ)=1-e^{-Δ/τ_0}。微调时冻结编码器，仅更新动力学骨干；作为嵌入时提取预训练骨干的中间表示注入GEARS。

Key Results:

在Weinreb造血数据集上，微调后的Chreode在d4和d6时间点的Sinkhorn W2距离优于匹配的随机初始化模型、PI-SDE和PRESCIENT。
在Veres胰岛分化数据集上，Chreode在所有目标时间点t1-t7均优于基线。
在Norman Perturb-seq上，将预训练动力学表示作为基因状态嵌入注入GEARS，DE20均方误差从0.2121降至0.1858（相对提升12.4%）。
零样本克隆命运评分在Weinreb上与moscot、WOT、scDiffEq等动态OT基线竞争。
预训练模型在88个时间点、7个数据集上训练，无需任何最优传输监督。

Tech Stack:

scVI（单细胞变分推断编码器）
DiT（扩散Transformer）
Waddington残差分解（势能梯度、反对称旋转、随机扩散）
群体级漂移场损失（基于MMD和Sinkhorn Wasserstein-2距离）
时间门控函数α(Δ)=1-e^{-Δ/τ_0}
反对称算子参数化：S_θ = P_θ Q_θ^T - Q_θ P_θ^T
PyTorch（推断）
GEARS（图神经网络扰动预测器）

Strengths:

单步推理大幅降低计算成本，适合大规模虚拟筛选（10^8查询量级）。
显式融入Waddington生物学先验，使模型具有可解释的动力学分解。
两阶段预训练策略实现跨数据集、跨物种迁移，提升下游任务性能。
提供两种迁移模式（微调与嵌入注入），灵活适配不同任务。
在多个公开基准上取得显著改进，且无需OT监督。

Limitations:

模型依赖scVI编码器，潜在空间质量影响整体性能。
预训练数据仅包含鼠胚图谱，跨物种迁移至人类数据需进一步验证。
未与闭源大模型（如AlphaCell、X-Cell）直接比较，性能优势局限于开源基线。
单步残差假设可能无法捕捉长期复杂动力学（如多稳态切换）。
扰动预测迁移仅通过嵌入注入GEARS，未提出独立扰动编码器。

Relevance To Keywords:

世界模型：论文明确提出细胞世界模型，类比强化学习中的世界模型，预测状态-动作-转移，与关键词“世界模型”高度相关。
表征学习：scVI编码器和DiT动力学骨干学习潜在表征，并通过迁移学习验证表征有效性，与“表征学习”相关。
模型基强化学习：论文引用世界模型在RL中的类比，但未涉及规划或策略优化，相关性较弱。
原生多模态大模型、多模态大模型的理解和生成一体化：论文专注于单细胞转录组数据，不涉及多模态，相关性低。
后训练：论文使用预训练+微调范式，属于后训练范畴，但未强调后训练技术本身。
强化学习：仅类比，未实际应用RL算法，相关性低。

44. PointQ-Bench: Benchmarking Diagnostic and Interpretable Point Cloud Quality AssessmentPASS

Score: 30.0 / 26.5

Authors: Duanchu Wang, Cheng Li, Junjie Yang, Jing Huang, Zihang Cheng, Zhi Gao, ZhuBohong, Di Wang

Published: 2026-05-27

TL;DR: PointQ-Bench 引入了一个包含诊断任务的点云质量评估基准，研究发现 2D MLLMs 在粗粒度感知上优于 3D VLMs，但当前模型在接地诊断和质量校准方面仍存在感知 - 诊断差距。

摘要翻译

点云质量在三维采集、重建、渲染及感知中起着关键作用，然而现有的点云质量评估（PCQA）研究仍主要集中于标量分数预测。在实际检测场景中，质量评估通常涉及识别缺陷、表征主导问题类型、评估下游可用性并提供基于证据的描述，而这些内容并未被当前基准明确评估。我们引入了 PointQ-Bench，这是一个旨在将 PCQA 从标量评分扩展至全面质量理解的基准。PointQ-Bench 包含 3,083 个点云，涵盖真实扫描、模拟失真及 AI 生成内容，覆盖八种主要问题类型。每个样本均标注了平均意见分数（MOS）、质量等级、问题标签、基于专家的描述以及 12,332 对问答。该基准支持三个感知导向任务：异常感知、缺陷诊断和可用性分级，以及一个认知导向任务：开放式质量报告。为了评估自由形式的质量描述，我们进一步提出了 SSFRQ-5D，这是一种通过人机一致性分析验证的五维评估协议。在 14 个视觉 - 语言模型和传统 PCQA 基线上的广泛实验揭示了一致的感知 - 诊断差距：尽管当前模型在粗粒度缺陷感知方面展现出新兴能力，但在基于实证的诊断和质量校准方面表现困难。性能强大的 2D 多模态大语言模型（MLLMs）通常优于现有的 3D 视觉语言模型（VLMs），而额外视图或点级输入的益处是非均匀的，随任务、数据源和模型而异，特别是在边界模糊条件下。总体而言，PointQ-Bench 提供了一个诊断测试平台，用于推进可靠且可解释的点云质量理解。

Abstract

Point cloud quality plays a critical role in 3D acquisition, reconstruction, rendering, and perception, yet existing point cloud quality assessment (PCQA) research remains largely centered on scalar score prediction. In practical inspection scenarios, quality assessment often involves identifying defects, characterizing dominant issue types, assessing downstream usability, and providing evidence-supported descriptions, which are not explicitly evaluated by current benchmarks. We introduce PointQ-Bench, a benchmark designed to extend PCQA from scalar scoring toward comprehensive quality understanding. PointQ-Bench consists of 3,083 point clouds spanning authentic scans, simulated distortions, and AI-generated content, covering eight major issue types. Each sample is annotated with mean opinion scores (MOS), quality levels, issue tags, expert-grounded descriptions, and 12,332 question-answer pairs. The benchmark supports three perception-oriented tasks: anomaly sensing, defect diagnosis, and usability grading, as well as a cognition-oriented task of open-ended quality reporting. To evaluate free-form quality descriptions, we further propose SSFRQ-5D, a five-dimensional evaluation protocol validated through human-AI agreement analysis. Extensive experiments on 14 vision-language models and traditional PCQA baselines reveal a consistent perception-diagnosis gap: while current models exhibit emerging abilities in coarse defect perception, they struggle with grounded diagnosis and quality calibration. Strong 2D MLLMs generally outperform existing 3D VLMs, and the benefit of additional views or point-level inputs is non-uniform, varying across tasks, data sources, and models, particularly under boundary-ambiguous conditions. Overall, PointQ-Bench provides a diagnostic testbed for advancing reliable and interpretable point cloud quality understanding.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	6.0/10	12.0
MultiModal	2.0	7.0/10	14.0
model-based RL	2.0	0.0/10	0.0

评分理由: 论文核心在于点云质量评估基准（PointQ-Bench）的构建，涉及视觉语言模型（MLLM）在点云与文本多模态任务上的评估，因此 MLLM 和多模态相关性较高。论文未提出统一模型架构、世界模型或涉及强化学习，故相关度低。

关键词

Point Cloud Quality Assessment, Benchmark, Diagnostic, Vision-Language Models, Multimodal, Defect Diagnosis, Anomaly Sensing

深度分析

Chinese Title: PointQ-Bench：面向诊断与可解释点云质量评估的基准测试

Summary: 本文提出PointQ-Bench，首个超越标量分数预测的点云质量理解基准。该基准包含3,083个点云，涵盖真实扫描、模拟失真和AI生成内容，标注了平均意见分数、质量等级、缺陷标签、专家描述及12,332个问答对。基准支持异常感知、缺陷诊断、可用性分级三个感知任务以及开放质量报告这一认知任务。为评估自由形式的质量描述，作者提出SSFRQ-5D五维评估协议，并通过人机一致性分析验证。在14个视觉语言模型和传统PCQA基线上的实验表明：当前模型在粗粒度缺陷感知上具备初步能力，但在接地诊断和质量校准上存在明显不足；2D多模态大模型普遍优于3D视觉语言模型，且额外视图或点云输入的收益因任务、数据源和模型而异，尤其在边界模糊条件下。PointQ-Bench为推进可靠且可解释的点云质量理解提供了诊断性测试平台。

Innovations:

首次提出面向点云质量理解的基准PointQ-Bench，超越传统标量MOS预测，涵盖异常感知、缺陷诊断、可用性分级和开放质量报告等多任务。
构建包含3,083个点云、12,332个问答对的大规模数据集，覆盖真实扫描、模拟失真和AI生成三种来源及八种主要缺陷类型。
提出SSFRQ-5D五维评估协议，用于标准化开放质量描述的评价，并通过人机一致性分析验证其有效性。
系统基准测试14个前沿模型，揭示当前模型在接地诊断和质量校准上的瓶颈，以及2D MLLMs与3D VLMs的性能差异。

Methodology: 论文采用数据驱动的方法构建基准。首先收集三种来源（真实扫描、模拟失真、AI生成）的点云数据，组织为八种缺陷类型。然后通过专家标注获取MOS、质量等级、缺陷标签、专家描述和问答对。设计三个感知任务（异常感知、缺陷诊断、可用性分级）和一个认知任务（开放质量报告），并采用多模板问题设计。对于开放报告，提出SSFRQ-5D五维评估协议（包括感知、诊断、可用性、解释、整体），通过人机一致性分析验证自动评估的可靠性。最后在14个模型（包括专有MLLMs、开源MLLMs、原生3D VLMs和传统PCQA基线）上进行实验，分析不同任务、数据源和模型下的表现。

Key Results:

当前模型在粗粒度缺陷感知上表现出初步能力，但在接地诊断和质量校准上存在明显瓶颈。
强2D多模态大模型普遍优于现有3D视觉语言模型。
额外视图或点云输入的收益非均匀，取决于任务、数据源和模型，尤其在边界模糊条件下。
SSFRQ-5D评估协议与专家评估具有良好的一致性，验证了自动评估的可靠性。

Tech Stack:

点云质量评估（PCQA）方法：MS-GraphSIM、PCQM、MPED等全参考方法；IT-PCQA、GPA-Net、COPP-Net等无参考方法。
多模态大模型：14个视觉语言模型（包括专有和开源MLLMs、3D VLMs）。
评估协议：SSFRQ-5D五维评估（感知、诊断、可用性、解释、整体）。
数据集构建：专家标注、多模板问答设计、边界样本保留策略。
统计分析：人机一致性分析（如Kappa系数或相关分析）。

Strengths:

首次系统性地将点云质量评估从标量分数扩展到多维质量理解，填补了现有基准的空白。
数据集规模大、来源多样、标注丰富，支持多种任务，具有较高的实用价值。
提出SSFRQ-5D评估协议，解决了开放质量描述的可比较性问题，并通过实验验证其有效性。
广泛的模型基准测试揭示了当前模型的局限性和不同模态模型的差异，为未来研究提供了方向。

Limitations:

数据集规模（3,083个点云）相对于某些大规模基准仍较小，可能限制模型泛化能力。
SSFRQ-5D评估协议虽然有效，但可能无法完全捕捉质量描述的所有细微差异。
实验主要基于现有模型，未提出新的点云质量理解模型，属于分析性工作。
对AI生成内容的覆盖可能不够全面，随着生成技术发展需持续更新。

Relevance To Keywords:

原生多模态大模型：论文基准测试了多个多模态大模型（MLLMs），包括2D和3D版本，与原生多模态大模型研究高度相关。
多模态大模型的理解和生成一体化：论文中的开放质量报告任务涉及生成式描述，与理解-生成一体化方向相关。
表征学习：点云质量评估依赖于有效的点云表征，论文中模型使用的投影、点级输入等涉及表征学习。
世界模型：点云质量评估可视为对3D世界感知质量的评估，与世界模型中对环境建模的质量要求间接相关。
强化学习、后训练：论文未直接涉及强化学习或后训练，但质量评估可作为奖励信号用于强化学习或模型微调，具有潜在关联。
Model-Based RL、Representation Learning：论文未直接涉及，但点云质量评估可服务于基于模型的强化学习中的感知模块。

45. Where Rollouts Begin: Low-Load, High-Leverage First-Token Diversification for RLVRPASS

Score: 28.0 / 26.5

Authors: Soeun Kim, Albert No

Published: 2026-05-27

TL;DR: 本文提出 REFT 方法，通过多样化 RLVR 中的首个 token 分布来增强 rollout 多样性，从而在多个基础模型上提升了 Pass@1、Pass@8 和 Pass@64 的推理性能。

摘要翻译

可验证奖励的强化学习（RLVR）在无标注轨迹的情况下训练推理模型，依靠分组 rollout 使策略暴露于替代推理路径，并由验证器对其进行评分。因此，Rollout 多样性已成为 RLVR 中的核心瓶颈，大多数现有方法通过调整温度、前缀或 rollout 选择来扩大探索范围。我们识别出一个结构上独特但被忽视的位置以扩大这种多样性：推理标记后的第一个 token。策略的 first-token 分布表现出尖锐峰值但正确性解耦的现象，且该 first-token 位置可以扩大 rollout 组覆盖的区域，而无需改变正确性信号。我们引入 REFT（基于 First-Token 多样化的 Rollout 探索），这是 RLVR 管道的一个轻量级添加，它从策略自身的 top-N 候选中均匀采样 first token 并均匀分配 rollout，其余组件保持不变。在生成的多样化 rollout 上训练，REFT 在四个基础模型（0.5B-7B）和三种难度设置下，相较于 DAPO 和 GRPO 基线，在综合 Pass@1、Pass@8 和 Pass@64 指标上均有所提升。

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) trains reasoning models without labeled trajectories, relying on grouped rollouts to expose the policy to alternative reasoning paths and a verifier to score them. Rollout diversity has accordingly emerged as a central bottleneck in RLVR, with most existing methods broadening exploration through temperature, prefix, or rollout-selection adjustments. We identify a structurally distinguished but overlooked position for broadening this diversity: the first token after the reasoning marker. The policy's first-token distribution exhibits a sharply peaked yet correctness-decoupled phenomenon, and this first token position can broaden the regions a rollout group covers without altering the correctness signal. We introduce REFT (Rollout Exploration with First-Token Diversification), a light addition to the RLVR pipeline that samples first tokens uniformly from the policy's own top-$N$ candidates and allocates rollouts evenly, leaving every other component unchanged. Trained on the resulting diversified rollouts, REFT improves aggregate Pass@1, Pass@8, and Pass@64 over DAPO and GRPO baselines across four base models (0.5B-7B) and three difficulty regimes.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	2.0/10	4.0
MLLM	2.0	4.0/10	8.0
MultiModal	2.0	2.0/10	4.0
model-based RL	2.0	4.0/10	8.0

评分理由: 论文核心为 RLVR（可验证奖励强化学习）中的 rollout 多样性优化，提出 REFT 方法。Unify Models、World Models 及 MultiModal 相关性低，因论文未涉及模型统一、环境世界建模或多模态内容。MLLM 相关性中等偏高，因核心模型为大语言模型（虽未明确多模态）；model-based RL 相关性中等偏高，因任务属于强化学习范畴（尽管 RLVR 通常归类为无模型方法，但与 RL 紧密相关）。未检测到指定专家作者，无额外加分。加权总分为 28.0，高于动态及格分 26.5。

关键词

RLVR, First-Token Diversification, Rollout Diversity, REFT, Reasoning Models, Verifier, Pass@1

深度分析

Chinese Title: 轨迹展开的起点：面向RLVR的低负载、高杠杆首令牌多样化策略

Summary: 本文针对强化学习与可验证奖励（RLVR）中轨迹多样性不足的问题，提出了一种轻量级方法REFT（Rollout Exploration with First-Token Diversification）。研究发现，在推理标记后的第一个生成令牌（首令牌）处，模型概率分布极为尖锐，但正确率却几乎平坦，即低概率首令牌仍能产生正确轨迹。首令牌虽语义负载低，却具有高杠杆效应：改变首令牌会改变后续所有令牌的条件分布，从而引导出不同的推理路径。REFT方法从策略自身的top-N候选首令牌中均匀采样，并将轨迹预算均匀分配，其余部分保持不变。在四个基础模型（0.5B-7B）和三个难度级别的数学基准上，REFT在Pass@1、Pass@8和Pass@64上均优于DAPO和GRPO基线。实验表明，REFT通过增加训练时的轨迹多样性和答案覆盖，减少了全错组和首令牌过度归因问题。

Innovations:

识别出首令牌位置是RLVR中一个被忽视但高杠杆的多样化切入点，该位置概率与正确性解耦。
提出REFT方法，通过均匀采样策略自身的top-N首令牌来保证首令牌覆盖，不改变后续解码和RL目标。
证明首令牌的多样化能有效增加后续连续区域的语义多样性，且不同首令牌的贡献互补。
揭示GRPO等分组RL方法会放大首令牌偏差，导致过度归因，REFT通过均匀分配缓解此问题。

Methodology: 论文首先通过诊断实验（使用Qwen2.5-3B-Instruct在GSM8K上）分析首令牌的概率分布与正确率关系，发现概率尖锐但正确率平坦。然后设计REFT方法：对每个提示，从策略的top-N有效首令牌中均匀采样K个，并将轨迹预算均匀分配，所有后续令牌使用原解码器生成。在训练中，REFT替换标准采样中的首令牌选择，其余组件（验证器、奖励、优势估计、RL目标、总轨迹数）不变。实验在四个基础模型（0.5B-7B）上使用GRPO和DAPO目标，在多个数学基准上评估Pass@k指标。

Key Results:

首令牌位置概率分布尖锐（top-1平均概率0.57），但正确率在top-20内几乎平坦（rank-1: 75.29%, rank-20: 70.40%）。
均匀采样top-20首令牌产生的后续连续区域语义多样性显著高于标准采样和固定首令牌采样。
REFT在Pass@1、Pass@8、Pass@64上一致优于GRPO和DAPO基线，覆盖四个基础模型和三个难度级别。
REFT减少了全错组比例，降低了首令牌的过度归因，增加了训练时的答案覆盖。

Tech Stack:

GRPO (Group Relative Policy Optimization)
DAPO (Dynamic Sampling Policy Optimization)
PPO (Proximal Policy Optimization) 风格裁剪目标
组归一化优势估计 (Group-normalized advantage)
语义多样性度量 (基于嵌入的余弦相似度)
温度采样 (Temperature sampling)
top-N候选采样 (Top-N candidate sampling)

Strengths:

方法轻量，仅修改首令牌采样，不改变RL算法本身，易于集成到现有RLVR流程。
诊断实验清晰揭示了首令牌的“低负载、高杠杆”特性，具有理论洞察。
实验覆盖多个模型规模和难度，结果具有泛化性。
缓解了分组RL中首令牌偏差和过度归因问题，提升了训练信号质量。

Limitations:

仅适用于具有显式推理标记（如<think>）的模型，对于无标记的模型可能需要调整。
top-N的选择（如N=20）可能依赖任务和模型，需要调参。
论文主要聚焦数学推理，在其他领域（如代码、科学推理）的效果未验证。
未与更复杂的多样化方法（如熵树、轨迹分支）进行直接比较，仅对比了温度调整等简单基线。

Relevance To Keywords:

强化学习 (RLVR): 论文核心是改进RLVR中的轨迹多样化，直接相关。
世界模型: 论文未涉及世界模型，但RLVR可视为训练推理模型的一种方法，间接相关。
表征学习: 论文未直接讨论表征学习，但首令牌的多样化可能影响隐层表征分布。
模型基强化学习: 论文使用无模型RL（GRPO/DAPO），不涉及模型基方法。
原生多模态大模型: 论文实验基于语言模型，未涉及多模态，但方法可扩展至多模态。
多模态大模型的理解和生成一体化: 论文聚焦推理生成，与理解生成一体化有一定关联。
后训练: RLVR属于后训练阶段，论文方法直接应用于后训练。

46. When Does Memory Help Multi-Trajectory Inference for Tool-Use LLM Agents?PASS

Score: 28.0 / 26.5

Authors: Xinzhe Li, Yaguang Tao

Published: 2026-05-27

TL;DR: This study evaluates memory methods across different inference strategies for tool-use LLM agents within a unified framework, revealing that memory efficacy is contingent on the specific inference method employed.

摘要翻译

用于工具使用型 LLM 代理的多轨迹推理（即生成多个推理尝试并在其中进行选择）得益于跨尝试的知识转移，从而使后续尝试能够避免早期尝试的误区。现有的跨轨迹记忆方法（轨迹级反思、原子事实提取、原始观测注入）均在单一任务和单一推理策略下进行评估，因此尚不清楚报告的增益是反映了记忆抽象本身的特性，还是反映了推理方法的特性。我们提出一个统一框架，沿两个维度分解记忆——转移范围（在扩展内 vs. 跨轨迹）以及转移内容的抽象——并在三种推理策略（best-of-N、beam search、蒙特卡洛树搜索 MCTS）下评估四种方法，该方法应用于涵盖 SQL、知识图谱 (Knowledge Graph) 和 CLI 环境的四个工具使用基准，且采用无验证器设置，以匹配实际代理的部署模式。实验矩阵表明推理方法是一个混淆因子：相同的记忆方法在相同示例下，在不同推理策略下会产生统计上显著不同的结果。反思仅在蒙特卡洛树搜索（MCTS）下达到显著性水平（在 best-of-N 下则不然）；扩展内注入（基于先前兄弟节点的结果对每个候选者进行条件化）仅有助于多样性匮乏的 beam search；而原子事实提取对精度无影响，但在具有可重用环境结构的任务上可将轨迹缩短 19%-26%。

Abstract

Multi-trajectory inference for tool-use LLM agents - generating multiple reasoning attempts and selecting among them - benefits from transferring knowledge across attempts so that later ones avoid the pitfalls of earlier ones. Existing cross-trajectory memory methods (trajectory-level reflection, atomic fact extraction, raw observation injection) are each evaluated under a single inference strategy on a single task, making it unclear whether reported gains reflect properties of the memory abstraction or of the inference method. We propose a unified framework that decomposes memory along two axes -- the scope of transfer (within an expansion vs. across trajectories) and the abstraction of the transferred content -- and evaluate four methods under three inference strategies (best-of-N, beam search, MCTS) on four tool-use benchmarks spanning SQL, knowledge-graph, and CLI environments, in a verifier-free setting that matches the deployment regime of practical agents. The experiment matrix identifies the inference method as a confound: the same memory method produces statistically distinct results under different inference strategies on the same examples. Reflection reaches significance only under MCTS (not under best-of-N); within-expansion injection (conditioning each candidate on prior siblings' outcomes) helps only diversity-starved beam search; and atomic fact extraction is accuracy-neutral but shortens trajectories by 19-26% on tasks with reusable environmental structure.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	3.5/10	7.0
World Models	2.0	2.0/10	4.0
MLLM	2.0	3.0/10	6.0
MultiModal	2.0	2.0/10	4.0
model-based RL	2.0	3.5/10	7.0

评分理由: The paper proposes a unified framework for evaluating memory in LLM agent inference, aligning moderately with 'Unify Models' due to the methodological unification. It focuses on text-based tool use, limiting relevance to 'MLLM' and 'MultiModal' as no multi-modal inputs are involved. No world model learning is performed, scoring low on 'World Models'. The use of MCTS and beam search relates to planning in 'model-based RL', though it lacks explicit environment model learning.

关键词

Multi-trajectory inference, Tool-use LLM agents, Unified framework, Memory methods, Inference strategies, MCTS, Atomic fact extraction, Reflection

深度分析

Chinese Title: 记忆何时有助于工具使用LLM代理的多轨迹推理？

Summary: 本文研究在工具使用LLM代理的多轨迹推理中，跨轨迹记忆（如反射、事实提取、原始观察注入）何时能提升性能。现有方法在不同推理策略和任务上评估不一致，难以区分记忆抽象与推理方法的影响。作者提出一个统一框架，将记忆分解为两个维度：转移范围（同层扩展内 vs. 跨轨迹）和抽象层次（原始、反射、事实提取），并推导出四种具体记忆方法。在无验证器（verifier-free）的部署设置下，对三种推理策略（best-of-N、波束搜索、MCTS）和四个工具使用基准（SQL、知识图谱、CLI）进行系统实验。主要发现：记忆的准确率效果依赖于推理策略——反射仅在MCTS下显著，跨兄弟注入仅帮助波束搜索；事实提取在准确率上中性但能缩短轨迹长度19-26%。研究揭示了推理方法作为混淆变量的重要性。

Innovations:

提出统一的记忆框架，按范围（同层扩展内 vs. 跨轨迹）和抽象层次（原始、反射、事实提取）分解记忆，并推导出四种具体方法（包括新方法Raw Sibling）。
在无验证器的部署设置下进行系统实证研究，覆盖记忆×推理×任务共12个实验单元（排除结构不可行组合），首次控制推理策略作为混淆变量。
发现三个关键结论：记忆的准确率效果依赖于推理策略；在MCTS下反射与原始观察注入的准确率无显著差异；事实提取在准确率中性但能提升效率。
识别推理方法作为混淆变量：相同记忆方法在不同推理策略下产生统计显著不同的结果。

Methodology: 论文采用实验矩阵方法，将四种记忆方法（No Memory、Raw Sibling、Reflection、LiTS-Fact）与三种推理策略（best-of-N、beam search、MCTS）组合，在四个工具使用基准（SQL、知识图谱QA、CLI等）上评估。排除结构不可行组合（如best-of-N不支持同层扩展记忆）。所有实验在无验证器设置下进行（任务验证器仅在评估时使用），匹配实际部署条件。使用统计显著性检验比较不同配置的准确率和轨迹长度。

Key Results:

反射（Reflection）仅在MCTS下显著提升准确率，在best-of-N下无显著效果。
跨兄弟原始注入（Raw Sibling）仅帮助多样性匮乏的波束搜索，对MCTS无显著帮助。
在较难的知识图谱QA基准上，MCTS下反射与原始兄弟注入的准确率统计上无显著差异。
事实提取（LiTS-Fact）在准确率上中性，但在具有可重用环境结构的任务上缩短轨迹长度19-26%。
相同记忆方法在不同推理策略下产生统计显著不同的结果，证明推理方法是混淆变量。

Tech Stack:

LLM（大语言模型）作为基础策略
MCTS（蒙特卡洛树搜索）
Beam Search（波束搜索）
Best-of-N采样
反射（Reflexion/LATS风格）
原子事实提取（基于mem0的嵌入相似度去重）
SQL、SPARQL、CLI工具环境
统计显著性检验

Strengths:

系统性地比较了不同记忆抽象和范围，填补了现有研究缺乏统一比较的空白。
实验设计考虑了推理策略作为混淆变量，揭示了记忆效果的条件依赖性。
采用无验证器的部署设置，更贴近实际工具使用代理的应用场景。
覆盖多种工具环境（SQL、知识图谱、CLI），增强了结论的泛化性。

Limitations:

仅研究单任务实例内的跨轨迹记忆，未涉及跨任务记忆（如ExpeL、mem0）。
工具使用环境限定为可序列化或不可序列化，未涵盖更复杂的动态环境。
事实提取方法使用全量注入而非检索，可能受限于上下文窗口。
未探索记忆与在线学习（如强化学习）的结合，仅关注推理时记忆。

Relevance To Keywords:

Unify Models: 论文研究LLM代理的推理与记忆，与统一模型框架间接相关。
World Models: 工具使用代理与外部环境交互，记忆可视为环境知识的部分建模，相关。
Representation Learning: 事实提取涉及从观察中学习结构化表示，相关。
Model-Based RL: MCTS和记忆可视为基于模型推理的变体，但论文未涉及RL训练，相关性中等。
原生多模态大模型: 论文仅涉及文本/代码工具，未涉及多模态，相关性低。
后训练: 论文聚焦推理时记忆，不涉及模型后训练，相关性低。

47. OccuReward: LLM-Guided Occupant-Centric Reward Shaping for Demographic Equity in Grid-Interactive BuildingsPASS

Score: 28.0 / 26.5

Authors: Shadmehr Zaregarizi, Khashayar Yavari

Published: 2026-05-27

TL;DR: OccuReward employs LLM-guided reward shaping to enhance demographic equity in building energy management DRL agents, successfully improving satisfaction for vulnerable groups while reducing energy costs.

摘要翻译

大型语言模型（LLMs）在生成用于基于深度强化学习（DRL）的建筑能源管理的奖励函数方面展现出有前景的能力。然而，它们在异质性人口群体中可能展现或加剧居住者舒适度差异的潜力尚未得到探索。本文提出了 OccuReward 框架，旨在探究基于大型语言模型（LLMs）的奖励设计如何影响人口公平性。本文的贡献主要体现在三个方面：一是引入舒适度公平指数（CEI）作为一种新颖的反馈信号；二是提出了一种迭代式、公平感知的大型语言模型奖励塑造方法；三是分析了在这些精炼目标下深度强化学习（DRL）代理的性能。本研究利用来自 ASHRAE 全球热舒适数据库 II（13,440 个投票）的四个基于经验的居住者配置文件，在 CityLearn v2 平台上部署了一个软演员 - 评论家（SAC）代理。我们的方法采用 Gemini API 生成奖励函数逻辑和权重，而非执行每步推理，整个过程跨越三个精炼轮次。15 次实验运行的结果表明，老年女性居住者在初始轮次中一贯体验到最低的满意度。至第 3 轮，公平感知的大型语言模型精炼激活了特定的奖励组件，使得年轻男性（+17.6%）、中年女性（+28.2%）、健康敏感型（+53.8%）和老年女性（+567%）的满意度得到提升，同时能源成本降低了 3.2%。我们的发现表明，尽管奖励层面的干预显著改善了公平性，但基于人工智能的控制器中的人口差异依然存在，这亟需进一步研究建筑系统中的算法公平性。

Abstract

Large language models (LLMs) have demonstrated promising capability in generating reward functions for deep reinforcement learning (DRL)-based building energy management. However, their potential to exhibit or exacerbate disparities in occupant comfort across heterogeneous demographic populations remains unexplored. We present OccuReward, a framework investigating how LLM-mediated reward design affects demographic equity. Our contribution is three-fold: the introduction of the Comfort Equity Index (CEI) as a novel feedback signal; a methodology for iterative, equity-aware LLM reward shaping; and a performance analysis of DRL agents under these refined objectives. Utilizing four empirically grounded occupant profiles from the ASHRAE Global Thermal Comfort Database II (13,440 votes), we deploy a Soft Actor-Critic agent in CityLearn v2. Our approach employs the Gemini API to generate reward function logic and weights--rather than performing per-step inference--across three refinement rounds. Results across 15 experimental runs reveal that elderly female occupants consistently experience the lowest satisfaction in initial rounds. By Round 3, equity-aware LLM refinement activates specific reward components that improve satisfaction for Young Males (+17.6%), Mid-aged Females (+28.2%), Health Sensitive (+53.8%), and Elderly Females (+567%), while simultaneously reducing energy costs by 3.2%. Our findings highlight that while reward-level intervention significantly improves equity, demographic disparities in AI-driven controllers persist, necessitating further research into algorithmic fairness in building systems.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	2.0/10	4.0
MLLM	2.0	6.0/10	12.0
MultiModal	2.0	2.0/10	4.0
model-based RL	2.0	2.0/10	4.0

评分理由: 论文核心在于利用 LLM 进行奖励塑造以解决建筑能源管理中的公平性问题。虽然使用了 Gemini API（属于 MLLM 范畴），评分为 6.0；但未涉及模型架构统一（Unify Models）、世界模型学习（World Models）或多模态数据融合（MultiModal），故相关度低（2.0）。算法采用 SAC 属于模型自由强化学习，非模型强化学习（model-based RL），故相关度低（2.0）。加权总分为 28.0，高于动态及格分 26.5。作者列表中不包含指定的专家。

关键词

LLM-Guided, Reward Shaping, Demographic Equity, Grid-Interactive Buildings, Comfort Equity Index, Deep Reinforcement Learning, Soft Actor-Critic

深度分析

Chinese Title: OccuReward：基于大语言模型的以 occupant 为中心的奖励塑造方法，实现电网交互建筑中的人口公平性

Summary: 本文提出 OccuReward 框架，首次研究大语言模型（LLM）在深度强化学习（DRL）建筑能源管理奖励函数设计中对不同人口群体舒适度公平性的影响。利用 ASHRAE 全球热舒适数据库 II 中的 13,440 条真实投票构建了四个具有不同热偏好的人口档案（年轻男性、中年女性、健康敏感人群、老年女性），并引入舒适公平指数（CEI）作为反馈信号。采用 Soft Actor-Critic 智能体在 CityLearn v2 环境中进行实验，通过 Gemini API 生成奖励函数逻辑和权重，经过三轮迭代优化。结果表明，初始轮次中老年女性满意度最低，第三轮公平感知优化后老年女性满意度提升 567%，所有群体满意度均超过 0.5，同时能源成本降低 3.2%。研究揭示了奖励层面的公平干预虽有效，但人口差异仍存在，需进一步研究算法公平性。

Innovations:

首次提出舒适公平指数（CEI），基于 Jain 公平指数改进，用于量化建筑热舒适分布的不公平程度。
提出迭代式 LLM-in-the-loop 奖励塑造方法，通过 CEI 反馈引导 LLM 调整奖励权重，实现公平性优化。
基于 ASHRAE 数据库构建四个经验性人口档案（含年龄、性别、健康状态），为公平性评估提供真实数据基础。
发现公平感知奖励优化不仅能提升弱势群体满意度，还能同时降低能源成本，揭示 LLM 发现的高效权衡空间。
系统分析了奖励层面干预的局限性，指出即使 CEI 显著改善，老年女性仍为最差群体，需进一步研究设定点层面的公平控制。

Methodology: 首先从 ASHRAE 全球热舒适数据库 II 中筛选 13,440 条有效记录，根据年龄、性别和热舒适数据构建四个 occupant 档案，定义舒适温度范围为舒适投票（≥4）的空气温度四分位距，并赋予灵活性参数。然后定义舒适满意度函数（基于温度距离和灵活性）和 CEI（1 - Jain 公平指数）。实验使用 CityLearn v2.1.2 的 5 栋住宅区场景，Soft Actor-Critic 智能体（MLP 2×256，lr=3e-4，batch=256，50,000 时间步），每轮 5 个随机种子。通过 Gemini 1.5 Flash API 进行三轮奖励函数生成：第一轮仅能源目标，第二轮仅能源 KPI 反馈，第三轮加入 CEI 和每档案满意度反馈。奖励形式为加权和：R = -Σ w_i · KPI_i。分析权重演变和每档案满意度变化。

Key Results:

初始两轮（仅能源目标）中，老年女性满意度仅 0.12，CEI 为 0.19，其他群体满意度较高（0.65-0.85）。
第三轮公平感知优化后，老年女性满意度提升至 0.80（+567%），年轻男性 +17.6%，中年女性 +28.2%，健康敏感 +53.8%，所有群体均超过 0.5。
能源成本从 1.218 降至 1.179（降低 3.2%），同时 CEI 从 0.19 降至 0.0082。
LLM 在第三轮调整了奖励权重，增加了太阳能和储能利用的权重，从而在维持较高室内温度的同时降低能源成本。
尽管公平性大幅改善，老年女性仍为最差群体（0.80 vs 其他 1.00），表明奖励层面干预不足以完全消除人口差异。

Tech Stack:

ASHRARE Global Thermal Comfort Database II（13,440 条投票数据）
CityLearn v2.1.2（建筑能源管理模拟环境）
Soft Actor-Critic (SAC) 算法（Stable-Baselines3 实现）
Gemini 1.5 Flash API（用于生成奖励函数逻辑和权重）
Jain 公平指数（用于构造 CEI）
Python（奖励函数编写与实验运行）
MLP 2×256 网络结构，学习率 3e-4，batch size 256，50,000 时间步

Strengths:

首次将人口公平性引入 LLM 奖励设计领域，填补了建筑 AI 中算法公平性研究的空白。
基于真实大规模热舒适数据库构建 occupant 档案，具有实证基础。
提出 CEI 作为可量化的公平性反馈信号，便于迭代优化。
实验设计严谨，使用多随机种子和三轮对比，结果可靠。
发现公平性与能源效率可协同优化，具有实际应用价值。

Limitations:

老年女性样本量较小（n=298），可能影响档案代表性。
未包含老年男性档案，人口覆盖不完整。
仅使用单一 LLM（Gemini 1.5 Flash），未比较不同 LLM 的效果。
实验仅在 CityLearn 模拟环境中进行，未在真实建筑中验证。
奖励层面干预未能完全消除人口差异，需进一步研究设定点层面的公平控制。
未考虑 occupant 行为动态变化（如自适应行为）。

Relevance To Keywords:

Unify Models / 原生多模态大模型：论文使用 LLM（Gemini）作为奖励工程师，但未涉及多模态，相关性较弱。
World Models / 表征学习：论文未涉及世界模型或表征学习，相关性低。
Model-Based RL / 强化学习：论文使用 Model-Free RL（SAC），但属于强化学习应用，有一定相关性。
后训练：论文中 LLM 的奖励生成属于后训练阶段的 prompt 工程，但未涉及模型微调，相关性一般。

48. Deconstructing Spatial Complexity: Hierarchical Decomposition for LLM Spatial ReasoningPASS

Score: 28.0 / 26.5

Authors: Yi Wang, Haojie Lu, Zhaofan Zhang, Li Chen, Sihong Xie

Published: 2026-05-27

TL;DR: 本文提出了一种基于 MCTS 引导的策略优化方法，通过层次化分解提升 LLM 在导航和规划任务中的空间推理性能。

摘要翻译

大型语言模型（LLMs）在一般语言理解和推理方面表现出卓越的能力。然而，它们在空间推理方面一贯表现不佳，这严重限制了其应用，特别是在具身智能（embodied intelligence）领域。受层次强化学习（hierarchical reinforcement learning）成功的启发，本文提出了一种用于 LLM 空间推理的层次任务分解新方法。我们的方法通过识别关键中间状态并生成简化子环境，引导 LLM 将复杂任务分解为可管理的子任务。然而，我们发现由于缺乏足够的空间先验（spatial prior），LLMs 往往无法推导出最优中间状态，导致次优的任务分解。为了解决这一局限性并增强其规划能力，我们提出了蒙特卡洛树搜索（MCTS）引导的组相对策略优化（M-GRPO），其中我们通过结合 LLM 的先验预测概率及其认知不确定性（epistemic uncertainty）来重新制定 UCT 公式。此外，我们实现了一个更精细的优势函数（advantage function），使模型能够学习最优路径规划。实验结果表明，我们的方法显著提高了 LLM 在空间任务（包括导航、规划和策略游戏）上的性能，达到了最先进的结果（state-of-the-art）。这项工作为 LLM 在现实世界中的应用铺平了道路。

Abstract

LLMs have shown remarkable proficiency in general language understanding and reasoning. However, they consistently underperform in spatial reasoning that severely limits their application, particularly in embodied intelligence. Inspired by the success of hierarchical reinforcement learning, this paper introduces a novel method for hierarchical task decomposition in LLM spatial reasoning. Our approach guides LLMs to decompose complex tasks into manageable sub-tasks by identifying key intermediate states and generating simplified sub-environments. However, we identify that LLMs often fail to derive optimal intermediate states due to their insufficient spatial prior, leading to sub-optimal task decomposition. To address this limitation and enhance its planning capability, we propose the MCTS-Guided Group Relative Policy Optimization (M-GRPO), where we reformulate the UCT formula by incorporating the LLM's prior predictive probabilities alongside its epistemic uncertainty. Furthermore, we implement a more fine-grained advantage function, enabling the model to learn optimal path planning. Experimental results demonstrate that our method substantially improves LLM performance on spatial tasks, including navigation, planning, and strategic games, achieving state-of-the-art results. This work paves the way for LLMs in real-world applications.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	2.0/10	4.0
MLLM	2.0	3.0/10	6.0
MultiModal	2.0	2.0/10	4.0
model-based RL	2.0	5.0/10	10.0

评分理由: 论文核心在于利用层次化分解和强化学习（MCTS、GRPO）提升 LLM 的空间推理能力。与 Unify Models 和 World Models 关联度低，未涉及模型统一或生成式世界模型构建；与 MultiModal 关联度低，摘要未明确提及多模态输入；与 MLLM 关联度中等，属于 LLM 应用范畴；与 model-based RL 关联度较高，使用了 MCTS 搜索和策略优化，符合强化学习规划特征。加权总分为 28.0，高于动态及格分 26.5。作者列表中未包含指定的专家名单。

关键词

LLM Spatial Reasoning, Hierarchical Decomposition, MCTS, Policy Optimization, Task Decomposition, Strategic Games, Navigation Planning

深度分析

Chinese Title: 解构空间复杂性：面向大语言模型空间推理的层次化分解方法

Summary: 大语言模型（LLM）在通用语言理解与推理方面表现出色，但在空间推理任务（如导航、规划、策略游戏）中持续表现不佳，严重限制了其在具身智能中的应用。受层次强化学习成功经验的启发，本文提出了一种新颖的层次化任务分解方法（HSRL），引导LLM通过识别关键中间状态并生成简化的子环境，将复杂空间任务分解为可管理的子任务。然而，LLM因缺乏充分的空间先验知识，往往无法推导出最优中间状态，导致次优分解。为解决此问题并增强其规划能力，作者提出了MCTS引导的组相对策略优化（M-GRPO），通过重新设计UCT公式，融入LLM的先验预测概率及其认知不确定性，并实现更细粒度的优势函数，使模型能够学习最优路径规划。实验结果表明，该方法在导航、规划和策略游戏等空间任务上显著提升了LLM性能，达到了最先进水平，为LLM在真实世界中的应用铺平了道路。

Innovations:

提出基于状态和环境分解的层次化推理框架（HSRL），区别于传统基于语言分解的方法，专门针对连续空间问题设计。
开发M-GRPO微调算法，将蒙特卡洛树搜索（MCTS）与组相对策略优化（GRPO）结合，通过修改UCT公式引入LLM先验置信度和认知不确定性，提升探索效率。
设计细粒度节点级优势函数，替代传统整条轨迹的粗粒度优势，实现更精确的信用分配，使模型能学习最优子目标序列。
仅需少量数据即可实现多级规划任务的灵活应用，并展现出强泛化能力。

Methodology: 本文采用两阶段方法：首先构建HSRL层次化框架，高层规划器（LLM）生成中间状态序列，环境处理器为每个子任务构建局部子环境，低层动作生成器在子环境中生成精确动作序列；若子任务不可解则动态扩展。其次，针对高层规划器生成次优中间状态的问题，提出M-GRPO在线微调算法：将中间状态生成建模为树搜索，利用LLM生成似然构建先验偏好，并基于困惑度定义不确定性探索因子，改进UCT选择公式；同时计算每个中间状态相对于其兄弟节点的细粒度优势值，用于策略梯度更新。训练过程迭代进行，直至满足终止条件。

Key Results:

在大型导航、物体规划和策略游戏基准测试上，HSRL+M-GRPO达到了最先进的性能。
相比标准CoT推理、ToT、ProgPrompt等方法，HSRL显著提升了空间推理成功率与规划最优性。
M-GRPO有效解决了LLM因空间先验不足导致的次优分解问题，提高了中间状态序列的质量。
方法仅需少量训练数据，且能泛化到不同任务模态。

Tech Stack:

层次强化学习（Hierarchical Reinforcement Learning）
蒙特卡洛树搜索（MCTS）
组相对策略优化（GRPO）
上置信界算法（UCT）
困惑度（Perplexity）作为不确定性度量
平均对数似然（Average Log-Likelihood）
温度参数（Temperature τ）
优势函数（Advantage Function）

Strengths:

创新性地将层次分解从语言层面扩展到状态和环境层面，更适合连续空间推理。
M-GRPO融合了MCTS的探索能力和GRPO的在线学习能力，且通过引入LLM内部置信度和不确定性改进了UCT，避免了高奖励但低置信度路径的干扰。
细粒度优势函数解决了长轨迹信用分配难题，提升了训练效率。
实验验证充分，在多个空间推理任务上取得SOTA，且数据效率高。

Limitations:

方法依赖LLM初始的空间推理能力，若基础模型过弱，高层规划器可能仍难以生成合理中间状态。
MCTS树搜索和多次模拟可能带来较高的计算开销，实时性受限。
动态扩展子环境机制在最坏情况下会退化为原始全规模问题，可能无法完全避免复杂度爆炸。
论文未详细讨论在真实机器人环境中的部署挑战（如传感器噪声、动态障碍物）。

Relevance To Keywords:

多模态大模型：论文聚焦LLM的空间推理，属于多模态大模型在具身智能中的应用。
世界模型：层次化分解和子环境构建隐含了对环境结构的建模，与世界模型思想相关。
表征学习：LLM通过生成中间状态和子环境进行表征抽象，但未明确讨论表征学习。
模型强化学习：M-GRPO将MCTS与策略优化结合，属于模型强化学习范畴。
后训练：M-GRPO是一种在线微调（后训练）方法，用于提升LLM的规划能力。

49. MIRAGE: Context-Aware Prompt Injection against Mobile GUI Agents via User-Generated ContentPASS

Score: 28.0 / 26.5

Authors: Ruoqi Guo, Yi Liu, Gelei Deng, Yiheng Xiong, Yuekang Li, Ying Zhang, Leo Yu Zhang, Lida Zhao, Ji Jie, Yuxiao Lu

Published: 2026-05-27

TL;DR: MIRAGE demonstrates a successful prompt injection attack against Mobile GUI agents by manipulating user-generated content, achieving high success rates while maintaining visual realism.

摘要翻译

由视觉 - 语言模型 (VLMs) 驱动的移动图形用户界面 (GUI) 代理将屏幕视为渲染的像素，并基于所见选择操作，因此它们无法可靠地区分可信界面元素与用户生成内容。我们提出了 MIRAGE（移动真实对抗性图形用户界面示例注入），这是一种通过将攻击者控制的文本置入普通用户生成内容区域，从而将良性移动截图转换为提示注入样本的流程，且无需修改代理、应用程序或操作系统。MIRAGE 分为三个阶段：Localizer（定位器）识别截图上的用户可控区域，Generator（生成器）合成上下文感知载荷并以应用程序的原生风格渲染它们，Curator（策展人）调节真实性并在应用程序、区域类型和攻击意图之间平衡样本。一个关键挑战在于，注入的截图必须在视觉上与真实用户内容无法区分，同时仍能误导代理；我们通过分离控制覆盖范围、真实性和分布平衡的阶段来解决这一问题。在一个涵盖十个应用程序和十一种攻击意图的 1,111 样本基准测试上，所有五个被评估的 VLM 代理均存在漏洞，攻击成功率为 23%-30%，且 MIRAGE 在人类真实性评分上高于最强的先前攻击（5 分制下分别为 3.02 和 2.52）。我们进一步发现，单样本真实性与攻击成功率不相关，因此仅靠视觉质量过滤无法可靠地防御此类威胁。

Abstract

Mobile graphical user interface (GUI) agents driven by vision-language models (VLMs) perceive the screen as rendered pixels and choose actions from what they see, so they cannot reliably separate trusted interface elements from user-generated content. We present MIRAGE (Mobile Injection of Realistic Adversarial GUI Examples), a pipeline that turns benign mobile screenshots into prompt-injection samples by placing attacker-controlled text into ordinary user-generated content regions, without modifying the agent, the application, or the operating system. MIRAGE operates in three stages: a Localizer identifies user-controllable regions on the screenshot, a Generator synthesises context-aware payloads and renders them in the application's native style, and a Curator moderates realism and balances the samples across applications, region types, and attack intents. A key challenge is that an injected screenshot must stay visually indistinguishable from genuine user content while still diverting the agent; we address this by separating the stages that control reach, realism, and distributional balance. On a 1,111-sample benchmark spanning ten applications and eleven attack intents, all five evaluated VLM agents are vulnerable, with attack success rates of 23%-30%, and MIRAGE scores higher on human realism ratings than the strongest prior attack (3.02 versus 2.52 out of 5). We further find that per-sample realism and attack success are uncorrelated, so visual-quality filtering alone cannot reliably defend against this threat.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	0.0/10	0.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	7.0/10	14.0
MultiModal	2.0	7.0/10	14.0
model-based RL	2.0	0.0/10	0.0

评分理由: 论文主要研究针对移动 GUI 代理（基于视觉语言模型）的安全攻击（提示注入），核心内容聚焦于安全性而非模型架构统一或强化学习算法。因此与 Unify Models、World Models 和 model-based RL 相关性极低（0 分）。由于代理基于 VLM（视觉语言模型）且涉及视觉与语言交互，与 MLLM 和 MultiModal 高度相关（7 分）。作者列表中未包含指定的专家。

关键词

Mobile GUI Agents, Prompt Injection, User-Generated Content, Vision-Language Models, Adversarial Attack, Context-Aware, Security Vulnerability

深度分析

Chinese Title: MIRAGE：基于用户生成内容的上下文感知提示注入攻击移动GUI代理

Summary: 本文提出MIRAGE，一种针对移动GUI代理的视觉通道提示注入攻击方法。移动GUI代理通过视觉语言模型（VLM）将屏幕视为渲染像素，无法可靠区分可信界面元素与用户生成内容。MIRAGE通过三阶段流水线将良性移动截图转化为提示注入样本：定位器（Localizer）识别用户可控区域，生成器（Generator）合成上下文感知的恶意载荷并以应用原生风格渲染，策展器（Curator）审核真实感并平衡样本分布。在涵盖10个应用、11种攻击意图的1111个样本基准上，五种VLM代理均易受攻击，攻击成功率23%-30%，且MIRAGE的人类真实感评分（3.02/5）高于先前最强攻击（2.52/5）。研究发现样本真实感与攻击成功率无相关性（ρ=-0.03），因此视觉质量过滤无法可靠防御此类威胁。

Innovations:

首次针对移动GUI代理的视觉通道提出运行时提示注入威胁模型，攻击者仅控制用户生成内容区域，无需修改代理、应用或操作系统。
提出MIRAGE三阶段流水线（定位器、生成器、策展器），自动从良性截图中合成视觉逼真、上下文感知的注入样本，无需每应用代码或代理微调。
发现视觉真实感与攻击成功率无相关性（ρ=-0.03），表明视觉质量过滤无法可靠防御此类攻击。
攻击成功率主要由攻击意图驱动，而非模型身份或视觉真实感，意图在代理间形成三个稳定的利用层次。

Methodology: MIRAGE采用三阶段流水线：1）定位器（Localizer）通过多阶段VLM预测与OCR引导迭代收紧用户可控区域边界，并丢弃非用户可控区域；2）生成器（Generator）为每个区域和攻击意图生成上下文感知的恶意载荷，并以应用原生风格渲染；3）策展器（Curator）通过VLM后渲染审核与重试修正过滤伪影，并平衡数据集在应用、区域类型和攻击意图上的分布。评估使用五个VLM骨干模型（GPT-4o-mini等），在10个移动应用的353个独特（截图，用户目标）对上生成1111个攻击实例，测量攻击成功率（ASR）和人类真实感评分。

Key Results:

五种VLM代理均易受攻击，ASR范围23.0%-30.2%，GPT-4o-mini上最高（30.2%）。
跨应用鲁棒性良好，所有设置和攻击意图下ASR超过19%。
MIRAGE的人类真实感评分（3.02/5）高于先前最强攻击（2.52/5）。
视觉真实感与攻击成功率无相关性（Spearman ρ=-0.03）。
攻击成功率主要由攻击意图驱动，意图形成三个利用层次（高、中、低）。

Tech Stack:

视觉语言模型（VLM）：Qwen-VL系列、GPT-4o-mini等
光学字符识别（OCR）：EasyOCR
自修正模式（self-refinement pattern）：Madaan et al. (2023)
攻击意图分类：11种意图（如点击、滑动、文本输入等）
评估指标：攻击成功率（ASR）、人类真实感评分（1-5 Likert量表）
统计方法：Spearman相关系数

Strengths:

威胁模型现实：仅通过用户生成内容区域注入，无需修改系统组件，符合实际攻击场景。
自动化程度高：三阶段流水线无需人工干预，可大规模生成多样化攻击样本。
评估全面：覆盖5种VLM、10个应用、11种攻击意图，样本量1111个。
发现重要机制：视觉真实感与攻击成功率无相关性，为防御设计提供关键洞察。

Limitations:

仅评估通用VLM作为GUI代理骨干，未测试原生GUI代理（如UI-TARS），因需要本地GPU。
攻击仅针对截图输入，未考虑动态交互或实时屏幕变化。
真实感评分基于人类评估，可能存在主观偏差。
未提出具体防御措施，仅指出视觉过滤无效。

Relevance To Keywords:

论文研究背景关键词涉及多模态大模型、世界模型、表征学习等，但本文主要关注移动GUI代理的安全漏洞（提示注入攻击），与这些关键词的直接相关性较低。
论文使用视觉语言模型（VLM）作为代理骨干，属于多模态大模型的应用，但未涉及理解与生成一体化、世界模型或强化学习。
论文的威胁模型和攻击方法可视为对多模态大模型安全性的研究，但与表征学习、后训练等主题关联较弱。

50. Verifiable Benchmarking of Long-Horizon Spatial BiologyPASS

Score: 28.0 / 26.5

Authors: Ian Diks, Harihara Muralidharan, Tim Proctor, Kenny Workman

Published: 2026-05-27

TL;DR: This paper introduces SpatialBench-Long, a benchmark for evaluating long-horizon scientific reasoning of AI agents on spatial biology data, where top models achieved low success rates in deriving accurate biological claims.

摘要翻译

人工智能智能体在生物数据分析中日益有用，但现有的基准测试主要测试广泛的生物知识、可执行工作流或局部分析步骤，而非基于空间测量数据的端到端科学推理。我们引入 SpatialBench-Long，这是一个长时程空间生物学基准，其中智能体必须从原始或近原始数据和校准的实验上下文中推导生物学主张，且无需预设方法。SpatialBench-Long 包含 24 项评估，涵盖原发性胰腺导管腺癌 (PDAC)、工程化胶质母细胞瘤类器官和体内肿瘤、Cas9 谱系追踪肺腺癌，以及小鼠视神经衰老/干预系统，涉及 CosMx、Visium、Xenium、多重误差鲁棒荧光原位杂交 (MERFISH)、单细胞 RNA 测序 (scRNA-seq)、Slide-seq、Slide-tags、组织学和谱系记录数据。候选主张通过复现、独立科学家审查和轨迹检查进行强化。最终答案基于受控词汇和符号进行确定性评分，配套评分标准捕捉通过关键分析瓶颈的进展。在 SpatialBench-Long 基准测试中，三个模型 - 工具对在 72 次运行中的 8 次 (11.1%) 并列：Gemini 3.5 Flash / Pi terminal coding harness、GPT-5.5 / Pi，以及 GPT-5.5 / OpenAI Codex。SpatialBench-Long 测试智能体是否能超越执行程序化分析，从而从复杂空间测量中得出准确科学结论。

Abstract

AI agents are increasingly useful for biological data analysis, but existing benchmarks mostly test broad biological knowledge, executable workflows, or localized analysis steps rather than end-to-end scientific reasoning over spatial measurements. We introduce SpatialBench-Long, a benchmark for long-horizon spatial biology in which agents must recover biological claims from raw or near-raw data and calibrated experimental context without prescribed methods. SpatialBench-Long contains 24 evaluations across primary pancreatic ductal adenocarcinoma (PDAC), engineered glioblastoma organoids and in vivo tumors, Cas9 lineage-traced lung adenocarcinoma, and mouse optic nerve aging/intervention systems, spanning CosMx, Visium, Xenium, multiplexed error-robust fluorescence in situ hybridization (MERFISH), single-cell RNA sequencing (scRNA-seq), Slide-seq, Slide-tags, histology, and lineage-recording data. Candidate claims are hardened through reproduction, independent scientist review, and trajectory inspection. Final answers are graded deterministically over controlled vocabularies and symbols with companion rubrics capturing progress through key analysis chokepoints. Across the SpatialBench-Long benchmark, three model-harness pairs tie at 8/72 runs (11.1\%): Gemini 3.5 Flash / Pi terminal coding harness, GPT-5.5 / Pi, and GPT-5.5 / OpenAI Codex. SpatialBench-Long tests whether agents can move beyond executing procedural analysis to deriving accurate scientific conclusions from complex spatial measurements.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	1.0/10	2.0
World Models	2.0	2.0/10	4.0
MLLM	2.0	4.0/10	8.0
MultiModal	2.0	6.0/10	12.0
model-based RL	2.0	1.0/10	2.0

评分理由: The paper introduces SpatialBench-Long, a benchmark for long-horizon spatial biology reasoning. It moderately relates to MLLM (4.0) and MultiModal (6.0) as it utilizes MLLMs (Gemini, GPT) and processes multimodal spatial data (imaging + sequencing). However, it does not propose unified models, world model architectures, or model-based RL algorithms, resulting in low scores for those keywords (1.0, 2.0, 1.0 respectively). The total weighted score is (1+2+4+6+1)*2.0 = 28.0, which exceeds the dynamic passing score of 26.5. None of the specified expert authors (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) appear in the author list.

关键词

Spatial Biology, Long-Horizon Reasoning, Benchmarking, AI Agents, Spatial Measurements, Scientific Claims, Verifiable Evaluation

深度分析

Chinese Title: 长时程空间生物学的可验证基准测试

Summary: 本文提出SpatialBench-Long基准，用于评估AI代理从原始空间生物学数据中恢复复杂科学结论的能力。现有基准多测试广泛生物学知识、可执行工作流或局部分析步骤，缺乏端到端科学推理评估。该基准包含24个评估任务，覆盖胰腺导管腺癌、胶质母细胞瘤类器官、Cas9谱系追踪肺腺癌、小鼠视神经衰老/干预等系统，涉及CosMx、Visium、Xenium、MERFISH、scRNA-seq、Slide-seq、Slide-tags、组织学、谱系记录等多种数据模态。候选结论通过复现、独立科学家审查和轨迹检查强化，最终答案通过受控词汇和符号进行确定性评分，并辅以关键分析瓶颈的轨迹诊断。实验结果显示，Gemini 3.5 Flash/Pi、GPT-5.5/Pi和GPT-5.5/OpenAI Codex三个模型-工具组合在72次运行中各通过8次（11.1%），表明当前系统成功率低但非零，需进一步诊断分析。

Innovations:

首次提出端到端长时程空间生物学推理基准，要求AI代理从原始数据恢复科学结论而非执行局部步骤。
采用可验证的确定性评分与轨迹诊断相结合，通过受控词汇和符号实现客观评分，同时利用隐藏的瓶颈轨迹提供部分进展信息。
通过独立复现、随机专家审查和多模型轨迹检查来强化候选结论，排除不可复现的论文声明，确保基准可靠性。
覆盖多种空间生物学数据模态（空间转录组学、组织学、谱系记录等）和多个疾病系统，评估跨模态推理能力。
引入基于LLM的轨迹评分器，分析模型在关键分析决策点（如选择正确生物学比较、识别相关空间区域）的表现，提供诊断信号。

Methodology: 论文采用以下方法：1）从已发表研究中选取候选结论，通过独立复现、随机专家审查和多模型轨迹检查筛选出可复现的结论作为评估任务。2）为每个任务提供实验背景、原始或近原始数据、科学问题，并去除原始研究标识信息以防止记忆。3）设计确定性评分函数，基于受控生物学词汇、有序关系或方向标签对最终科学结论进行二进制通过/失败评分。4）构建隐藏的轨迹诊断瓶颈（如数据整合、区域分割、生物学比较选择等），由LLM评分器对模型轨迹进行评分，作为辅助诊断。5）在多个前沿模型（Gemini、GPT系列）和工具（Pi终端编码工具、OpenAI Codex）上运行基准，统计通过率和成本、回合数等指标。

Key Results:

在72次运行中，Gemini 3.5 Flash/Pi、GPT-5.5/Pi和GPT-5.5/OpenAI Codex各通过8次，通过率均为11.1%。
当前系统成功率低但非零，表明AI代理在长时程空间生物学推理上仍面临挑战。
轨迹诊断显示模型常在关键瓶颈处失败，如仅考虑单层原发肿瘤、选择错误方向等，但部分进展可被检测。
基准包含24个评估任务，覆盖4个研究系统（PDAC、GBM、肺腺癌、视神经）和9种数据模态。
通过独立复现和审查，排除了多个不可复现的候选结论，确保了基准的可靠性。

Tech Stack:

数据模态：CosMx、Visium FFPE、Xenium、MERFISH、scRNA-seq、Slide-seq、Slide-tags、H&E/trichrome组织学、Cas9谱系记录
模型：Gemini 3.5 Flash、GPT-5.5
工具：Pi terminal coding harness、OpenAI Codex
评分方法：确定性函数（基于受控词汇和符号）、LLM轨迹评分器（基于瓶颈rubric）
分析技术：空间差异表达、细胞邻域/生态位分析、参考映射、组织学-转录组对齐、等位基因距离计算

Strengths:

填补了现有基准在端到端科学推理评估上的空白，聚焦于从原始数据恢复结论而非局部步骤。
通过严格的复现和审查流程确保结论可复现，避免依赖不可靠的论文声明。
采用确定性评分与轨迹诊断结合，既提供客观通过率，又提供部分进展信息，有助于分析失败原因。
覆盖多种数据模态和疾病系统，评估跨模态推理和实验设计理解能力。
公开了评估任务和瓶颈rubric，便于后续基准更新和模型改进。

Limitations:

当前通过率极低（11.1%），可能反映任务难度过高或评分标准过于严格，需进一步校准。
确定性评分可能遗漏模型产生的有效但未预见的科学结论，导致假阴性。
轨迹诊断依赖LLM评分器，可能引入主观偏差或提示敏感性。
基准规模较小（24个评估），可能不足以全面代表空间生物学推理的多样性。
未评估模型在开放探索场景下的表现，仅针对预设结论的恢复。

Relevance To Keywords:

与“原生多模态大模型”和“多模态大模型的理解和生成一体化”有一定相关性：基准评估AI代理处理多模态空间生物学数据（转录组、组织学、谱系等）的能力，涉及跨模态理解和推理。
与“表征学习”和“世界模型”相关性较弱：基准不直接研究表征学习或世界模型构建，但评估的推理过程可能隐含对数据内在结构的表征。
与“强化学习”和“后训练”相关性低：基准未涉及强化学习或后训练技术，仅测试现有模型的推理能力。
总体而言，该论文主要贡献在AI代理评估领域，与关键词中的多模态大模型评估有间接关联，但非核心主题。

51. KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMsPASS

Score: 28.0 / 26.5

Authors: Haechan Kim, Seungjun Chung, Inkyu Park, Jihoo Lee, Jonghyun Lee

Published: 2026-05-27

TL;DR: 该论文构建了韩语语音基准测试集以评估语音语言模型，揭示了英语评估无法捕捉到的任务间性能差距和互补弱点。

摘要翻译

语音语言模型（SpeechLMs）通过将大型语言模型（LLMs）扩展至语音模态，取得了显著进展。然而，SpeechLM 评估仍主要集中于英语，限制了对多语言语音能力的可靠评估。通过 ASR（自动语音识别）、翻译、归一化和 TTS（文本转语音）进行直接的基准迁移可能会破坏语言特定的指令、答案约束和口语形式；对于音频理解而言，迁移源语言音频也无法保留目标语言说话人的属性、口音和副语言特征。为了解决这些局限性，我们提出了两种人机协作基准构建框架：一种是将源语言 SpokenQA 基准迁移至目标语言 SpokenQA 基准，另一种是利用转录文本和说话人元数据将目标语言 ASR 语料库转换为音频理解基准。利用这些框架，我们构建并公开发布了三个韩语语音基准：KVoiceBench 和 KOpenAudioBench 用于韩语 SpokenQA，以及 KMMAU 用于韩语音频理解，共计 12,345 个样本。我们评估了八个近期的 SpeechLMs，发现英韩性能差距在不同模型和任务族之间存在显著差异，且 SpokenQA 和音频理解的排名出现分歧，揭示了仅基于英语评估所无法发现的互补性弱点。

Abstract

Speech language models (SpeechLMs) have achieved substantial progress by extending large language models (LLMs) to the speech modality. However, SpeechLM evaluation remains heavily centered on English, limiting reliable assessment of multilingual speech capabilities. Straightforward benchmark transfer through ASR, translation, normalization, and TTS can corrupt language-specific instructions, answer constraints, and spoken forms; for audio understanding, transferring source-language audio also fails to preserve target-language speaker attributes, accents, and paralinguistic properties. To address these limitations, we propose two human-agent benchmark-construction frameworks: one transfers source-language SpokenQA benchmarks into target-language SpokenQA benchmarks, and the other converts target-language ASR corpora into audio understanding benchmarks using transcriptions and speaker metadata. Using these frameworks, we construct and publicly release three Korean speech benchmarks: KVoiceBench and KOpenAudioBench for Korean SpokenQA, and KMMAU for Korean audio understanding, comprising 12,345 samples in total. We evaluate eight recent SpeechLMs and find that English-Korean performance gaps vary substantially across models and task families, and that SpokenQA and audio understanding rankings diverge, revealing complementary weaknesses invisible to English-only evaluation.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	1.0/10	2.0
MLLM	2.0	4.0/10	8.0
MultiModal	2.0	6.0/10	12.0
model-based RL	2.0	1.0/10	2.0

评分理由: 论文核心在于语音语言模型（SpeechLMs）的韩语基准测试与评估，而非模型架构的统一或强化学习算法。因此，'Unify Models'、'World Models' 及 'model-based RL' 相关性极低（1-2 分）。'MultiModal' 相关性较高（6 分），因 SpeechLMs 本质涉及语音与文本的多模态融合。'MLLM' 相关性中等（4 分），SpeechLM 属于多模态大模型范畴，但论文侧重基准构建而非模型架构创新。作者列表中未包含指定的 Yang Shi 等专家，故无额外加分。

关键词

SpeechLMs, Korean Speech Benchmarks, SpokenQA, Audio Understanding, Agent-Driven Framework, Multilingual Evaluation, Speech Language Models

深度分析

Chinese Title: KVoiceBench、KOpenAudioBench和KMMAU：基于智能体驱动的韩语语音基准用于评估语音语言模型

Summary: 本文针对语音语言模型（SpeechLM）评估过度集中于英语的问题，提出了两种人机协作的基准构建框架：一种将源语言的口语问答（SpokenQA）基准迁移至目标语言，另一种利用目标语言的ASR语料库构建音频理解基准。基于这些框架，作者构建并公开了三个韩语语音基准：KVoiceBench（7,306样本）、KOpenAudioBench（2,835样本）和KMMAU（2,204样本），总计12,345样本。通过评估八个最新的语音语言模型，发现英语-韩语性能差距在不同模型和任务族间差异显著，且口语问答与音频理解的排名不一致，揭示了仅依赖英语评估无法发现的互补性弱点。该工作为多语言语音模型评估提供了可复现的基准构建方法。

Innovations:

提出可复现的人机协作框架，用于将源语言SpokenQA基准迁移至目标语言，包含地面真值修正、超翻译、语音友好归一化和TTS合成四个阶段。
提出利用目标语言ASR语料库构建音频理解基准的框架，通过规则生成、LLM生成和人工标注覆盖声学属性、词汇、语义和整体能力。
构建并公开了三个韩语语音基准（KVoiceBench、KOpenAudioBench、KMMAU），附带可审计的规则手册，便于其他目标语言复用。
通过评估八个最新SpeechLM，揭示了英语-韩语性能差距的非均匀性以及SpokenQA与音频理解排名的不一致性，证明仅英语评估的局限性。

Methodology: 采用两阶段人机协作框架：1）SpokenQA迁移：先通过两个LLM智能体（GPT-5.4和Gemini Pro）进行地面真值修正，再使用规则手册驱动的超翻译处理语言特异性问题，接着进行语音友好归一化，最后用TTS合成语音。2）音频理解构建：从韩语ASR语料库（KSS、KMSAV、Seoul Corpus）出发，根据能力类型选择方法：声学属性用规则从说话人元数据生成，词汇问题用规则从转录生成，语义理解用LLM生成问题并人工审核，整体能力用完全人工标注。

Key Results:

在10,719个源样本中，有578个（5.4%）在整理过程中被拒绝。
地面真值修正阶段发现221个错误，其中Web Questions错误率最高（13.4%）。
评估八个SpeechLM发现，韩语SpokenQA性能相比英语显著下降，但不同模型和任务族下降幅度不同。
音频理解与SpokenQA的排名模式不同，表明两者探测互补能力。

Tech Stack:

GPT-5.4（OpenAI, 2026）作为评审智能体
Gemini Pro（Gemini Team et al., 2024）作为元评审智能体
TTS合成技术
规则手册驱动的超翻译方法
文本归一化（规则和神经方法）
韩语ASR语料库：KSS、KMSAV、Seoul Corpus

Strengths:

解决了直接翻译和TTS级联带来的语言特异性问题（如大小写、数字归一化错误）。
框架具有可复现性和可审计性，规则手册便于其他语言社区复用。
同时覆盖SpokenQA和音频理解两个互补维度，评估更全面。
公开了大规模韩语语音基准，填补了非英语SpeechLM评估的空白。

Limitations:

框架依赖LLM智能体进行修正和生成，可能引入自身偏差。
TTS合成语音与自然语音存在差异，可能影响评估的生态效度。
仅针对韩语，框架在其他语言上的泛化性需进一步验证。
音频理解基准构建依赖已有ASR语料库，覆盖的声学场景和说话人多样性有限。

Relevance To Keywords:

原生多模态大模型：论文评估的SpeechLM属于多模态大模型（语音+文本），基准构建方法可推广至其他模态。
多模态大模型的理解和生成一体化：SpokenQA和音频理解评估了模型的理解能力，但未涉及生成；框架可扩展至生成评估。
表征学习：基准构建涉及语音表征的评估，但未直接研究表征学习方法。
世界模型：论文未涉及世界模型，但语音理解中的场景和声学属性与物理世界相关。
强化学习/后训练：论文未涉及训练方法，仅关注评估基准。

52. Unification and Optimization of Robust Supervised LearningPASS

Score: 28.0 / 26.5

Authors: Jonas Hanselle, Valentin Margraf, Clemens Damke, Eyke Hüllermeier

Published: 2026-05-27

TL;DR: 本文提出了一种鲁棒监督学习的统一框架，通过组合多种鲁棒策略实现了在表格、图像和奖励建模任务上的通用优化，无需预先知道主导的失效模式。

摘要翻译

文献提出了各种针对经验风险最小化（Empirical Risk Minimization）的鲁棒替代方案，以应对分布偏移、标签噪声和有限样本退化等失效模式。例如分布鲁棒优化（Distributionally Robust Optimization）、标签平滑、邻域风险最小化（Vicinal Risk Minimization）以及 Mixup。然而，此类方法通常孤立地开发，迫使从业者预先选定单一的失效模式，即使对于该任务而言主导模式尚不明确。为解决这一问题，我们沿三个共同设计轴对现有方法进行了组织，并推导出一种可行的训练程序，该程序将鲁棒学习分解为顺序阶段（参考分布增强（Reference Distribution Enrichment）、输入空间扰动（Input-Space Perturbation）、标签空间扰动（Label-Space Perturbation）和样本级聚合（Sample-Level Aggregation）），每个阶段均可选择一种立场（悲观、中性或乐观）。这形成了一个统一的设计空间，在该空间中，联合超参数优化（Joint Hyperparameter Optimization）可以组合并配置适合当前任务的鲁棒性策略。在表格数据、图像和奖励建模基准上，联合超参数优化在各设置中与最佳单方法基线相当，为那些预先不知道哪种失效模式主导其任务的从业者提供了可靠的默认选项。

Abstract

The literature has proposed various robust alternatives to empirical risk minimisation to address failure modes such as distribution shift, label noise and finite-sample degeneracies. Examples include distributionally robust optimization, label smoothing, vicinal risk minimization, and Mixup. However, such approaches are typically developed in isolation, forcing practitioners to commit a priori to a single failure mode even when the dominant mode for the task is unclear. To address this, we organize a broad class of existing methods along three common design axes and derive a tractable training procedure that decomposes robust learning into sequential stages (reference distribution enrichment, input-space perturbation, label-space perturbation, and sample-level aggregation), each with a choice of stance (pessimistic, neutral, or optimistic). This results in a unified design space in which joint hyperparameter optimization can compose and configure robustness strategies suited to the task at hand. Across tabular, image, and reward modeling benchmarks, joint hyperparameter optimization is competitive with the best single-method baseline in each setting, offering a reliable default for practitioners who do not know a priori which failure mode dominates their task.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	8.0/10	16.0
World Models	2.0	1.0/10	2.0
MLLM	2.0	1.0/10	2.0
MultiModal	2.0	2.0/10	4.0
model-based RL	2.0	2.0/10	4.0

评分理由: 1. Unify Models (8.0): 论文标题明确包含'Unification'，摘要中提出将多种鲁棒方法组织成统一的'设计空间'，高度契合'统一模型/方法'的核心思想。2. World Models (1.0): 论文未涉及世界模型、环境动力学建模或生成式世界模型相关内容。3. MLLM (1.0): 论文未提及多模态大语言模型、理解与生成一体化等核心内容。4. MultiModal (2.0): 论文在表格和图像数据上进行了实验，但重点在于鲁棒性而非多模态表征融合，相关性较弱。5. model-based RL (2.0): 论文提及'奖励建模'作为基准任务之一，属于 RL 相关领域，但方法论属于监督学习鲁棒优化，并非模型强化学习，相关性较低。加权总分 = (8+1+1+2+2)*2 = 28.0，高于动态及格分 26.5。

关键词

Robust Supervised Learning, Unification, Distributionally Robust Optimization, Hyperparameter Optimization, Reward Modeling, Failure Modes, Design Space

深度分析

Chinese Title: 鲁棒监督学习的统一与优化

Summary: 本文针对监督学习中存在的多种失效模式（如分布偏移、标签噪声、有限样本退化）提出了一个统一的鲁棒学习框架。现有方法如分布鲁棒优化、标签平滑、邻域风险最小化和Mixup等通常针对单一失效模式设计，迫使实践者事先承诺一种模式，而实际任务中主导模式往往未知。作者将多种方法分解为三个正交的设计轴：参考分布、模糊集和去模糊原则，并推导出一个可分解的训练流程，将鲁棒学习分为参考分布丰富化、输入空间扰动、标签空间扰动和样本级聚合四个顺序阶段，每个阶段可选择悲观、中性或乐观立场。该框架将所有设计选择暴露为连续超参数，使得联合超参数优化能够自动组合和配置适合任务的鲁棒策略。在表格、图像和奖励建模基准上，联合优化与每个设置中最佳单一方法基线相当，为未知主导失效模式的实践者提供了可靠的默认方案。

Innovations:

提出了一个统一的三轴框架（参考分布、模糊集、去模糊原则），能够恢复多种现有鲁棒学习方法作为特例。
给出了线性模型下VRM、Mixup和W-DRO到修改后ERM目标的闭式约简，揭示了这些机制之间的定性关系。
推导了一个可分解的训练流程，将所有设计选择作为连续超参数暴露，支持联合超参数优化和端到端反向传播训练。
在表格、图像和奖励建模任务上验证了联合优化能够匹配或超越最佳单一方法，尤其在奖励建模中提升显著。

Methodology: 论文首先形式化定义了鲁棒监督学习的统一目标函数，包含参考分布Pref、模糊集U(Pref)和去模糊原则Φ三个组件。然后通过将模糊集分解为输入空间、标签空间和采样分布的结构化形式，推导出顺序训练流程：参考分布丰富化（如VRM、Mixup）、输入空间扰动（如对抗训练）、标签空间扰动（如标签平滑）、样本级聚合（如DRO/DFO）。每个阶段对应一个超参数，使得整个空间可通过超参数优化进行搜索。作者还针对线性模型和平方损失给出了几种方法的闭式正则化形式，以揭示其本质。实验部分在表格数据（UCI）、图像数据（CIFAR-10/100）和奖励建模（人类偏好数据）上比较了联合优化与各单一方法基线的性能。

Key Results:

联合超参数优化在多个基准上能够匹配或超越最佳单一方法，表明其作为默认策略的有效性。
在奖励建模任务中，联合优化带来的提升最为显著，因为该任务同时存在有限样本、标签噪声和分布偏移。
线性模型的理论分析表明，VRM、Mixup和W-DRO分别对应不同的正则化形式：VRM引入特征协方差正则化，Mixup引入特征-标签协方差正则化，W-DRO引入与决策边界距离相关的正则化。
在CIFAR-10/100上，联合优化在分布偏移场景下优于标准ERM和单一鲁棒方法。

Tech Stack:

Wasserstein距离（W_p）用于定义模糊集
KL散度用于定义模糊集
邻域风险最小化（VRM）
Mixup数据增强
分布鲁棒优化（DRO）
分布有利优化（DFO）
标签平滑（Label Smoothing）
标签松弛（Label Relaxation）
对抗训练（Adversarial Training）
超参数优化（Hyperparameter Optimization）
反向传播（Backpropagation）
线性模型与平方损失闭式推导

Strengths:

提供了一个统一的视角，将多种看似不相关的鲁棒学习方法纳入同一框架，便于系统比较和组合。
训练流程可分解且所有设计选择连续化，使得自动超参数优化成为可能，降低了实践者的调参负担。
理论分析揭示了不同方法在简单模型下的本质联系，有助于理解其作用机制。
实验覆盖了多种数据类型和任务，验证了框架的通用性和实用性。

Limitations:

框架主要针对监督学习，未直接扩展到强化学习或世界模型等更复杂的设置。
理论分析限于线性模型和平方损失，对于非线性模型和更复杂损失函数的推广尚不明确。
联合超参数优化虽然有效，但计算成本可能较高，尤其是在大规模数据集上。
对于某些特定失效模式（如对抗性攻击），单一方法可能仍优于联合优化，框架的默认性并非绝对最优。

Relevance To Keywords: 论文主题为鲁棒监督学习，与关键词中的“Unify Models”高度相关，因为它统一了多种鲁棒学习方法。与“World Models”、“Model-Based RL”、“强化学习”等关键词的关联较弱，但鲁棒性在模型部署和后训练中具有潜在重要性。与“原生多模态大模型”、“多模态大模型的理解和生成一体化”无直接关系，但鲁棒训练技术可迁移至多模态场景。总体相关性中等，主要贡献在于统一框架而非特定于多模态或世界模型。

53. Cyclical Entropy Eruption: Entropy Dynamics in Agent Reinforcement LearningPASS

Score: 28.0 / 26.5

Authors: Wendi Li, Shawn Im, Sharon Li

Published: 2026-05-27

TL;DR: 该论文识别了智能体强化学习中熵爆发的周期性现象，并提出 SEAL 辅助损失以分离轨迹、稳定训练并提升下游智能体性能。

摘要翻译

代理型大语言模型（Agentic LLMs）正日益被用于解决现实世界任务，其方式包括基于目标推理、调用工具以及与外部环境交互。强化学习（Reinforcement Learning, RL）为改进这些行为提供了自然框架，且近期智能体强化学习方法在各个领域均取得了显著成果。然而，智能体强化学习的训练动态仍知之甚少，这限制了我们诊断不稳定性并设计更有效训练算法的能力。在这项工作中，我们识别出智能体强化学习中一个此前未被充分探索的现象，我们将其称为周期性熵爆发（cyclical entropy eruption）。不同于单轮推理强化学习（其中熵通常会坍塌并保持低位），智能体强化学习训练表现出独特的反复循环模式：剧烈的熵爆发与随后的逐渐消退。我们将此动态分解为三个阶段，并对每个阶段分别提供理论与实证分析，解释了其周期性振荡背后的机制。我们进一步表明，退化模式（如句子重复和幻觉），一旦在爆发期间被习得，便可在跨周期中持续存在并累积。受这些发现启发，我们提出 SEAL（Separation-Enhanced Agent Learning），这是一种轻量级辅助损失，它在表示空间中分离正确与错误轨迹，直接针对熵爆发的根本原因。在多个基准测试、模型及强化学习算法上的实验表明，SEAL 能够稳定训练过程，并取得更强的下游智能体性能。

Abstract

Agentic large language models are increasingly used to solve real-world tasks by reasoning over goals, invoking tools, and interacting with external environments. Reinforcement learning provides a natural framework for improving these behaviors, and recent agent RL methods have achieved strong results across domains. However, the training dynamics of agent RL remain poorly understood, limiting our ability to diagnose instabilities and design more effective training algorithms. In this work, we identify a previously underexplored phenomenon in agent RL, which we term cyclical entropy eruption. Unlike single-turn reasoning RL, where entropy typically collapses and stays low, agent RL training exhibits unique recurring cycles of sharp entropy eruption and gradual subsidence. We decompose this dynamic into three phases and provide theoretical and empirical analyses of each, explaining the mechanisms underlying its cyclical oscillation. We further show that degenerate patterns such as sentence duplication and hallucination, once acquired during eruption, can persist and accumulate across cycles. Motivated by these findings, we propose SEAL (Separation-Enhanced Agent Learning), a lightweight auxiliary loss that separates correct and incorrect trajectories in representation space, directly targeting the root cause of entropy eruption. Experiments across multiple benchmarks, models, and RL algorithms demonstrate that SEAL stabilizes training and yields stronger downstream agent performance.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	3.0/10	6.0
MLLM	2.0	4.0/10	8.0
MultiModal	2.0	2.0/10	4.0
model-based RL	2.0	3.0/10	6.0

评分理由: 论文聚焦于智能体强化学习的熵动力学与训练稳定性（SEAL 方法），与 MLLM（大语言模型智能体）及 model-based RL（强化学习范畴）存在一定关联，相关性中等；未涉及模型统一、世界模型构建或多模态处理，相关性较低。作者名单中未发现指定专家。加权总分 28.0，超过动态及格分 26.5。

关键词

Cyclical Entropy Eruption, Entropy Dynamics, Agent Reinforcement Learning, Agentic Large Language Models, SEAL, Representation Space, Training Stability, Auxiliary Loss

深度分析

Chinese Title: 周期性熵爆发：智能体强化学习中的熵动态

Summary: 本文研究了智能体强化学习（Agent RL）中一种此前未被充分探索的现象——周期性熵爆发（Cyclical Entropy Eruption）。与单轮推理RL中熵通常下降并保持低位不同，智能体RL训练中熵会反复出现急剧爆发和逐渐消退的循环。作者将这一动态分解为三个阶段：熵下降、熵爆发和熵消退，并提供了理论和实证分析，解释了其循环振荡机制。研究发现，爆发期间获得的退化模式（如句子重复、幻觉）会在循环中持续累积。基于此，作者提出SEAL（Separation-Enhanced Agent Learning），一种轻量级辅助损失函数，通过在表示空间中分离正确与错误轨迹来减少梯度干扰，从而抑制熵爆发。实验表明，SEAL在多个基准、模型和RL算法上稳定了训练并提升了智能体性能。

Innovations:

首次识别并命名了智能体RL中的周期性熵爆发现象，揭示了其与单轮推理RL的定性差异。
将熵动态分解为三个相位（熵下降、熵爆发、熵消退），并提供了每个相位的理论解释和实证证据。
提出SEAL损失函数，通过分离正确与错误轨迹的表示空间来缓解梯度干扰，直接针对熵爆发的根本原因。
证明了熵爆发期间产生的退化模式（如句子重复、幻觉）会在训练循环中持续累积，影响最终性能。
在多个基准（AlfWorld、WebShop）、多种模型（Qwen2.5、Llama）和多种RL算法（GRPO、GIGPO）上验证了SEAL的有效性，包括在Llama上从训练崩溃恢复至79.69%成功率。

Methodology: 本文采用理论分析与实证研究相结合的方法。首先通过实验观察智能体RL训练中的熵动态，发现周期性爆发现象。然后利用信息论和梯度分析，将熵动态分解为三个相位，并给出每个相位的数学解释（如引理3.1关于格式门控学习主导早期训练）。接着，通过分析表示相似性导致梯度干扰的机制，提出SEAL损失函数，该损失在原始RL目标基础上增加一个辅助项，鼓励正确与错误轨迹在表示空间中分离。最后在AlfWorld、WebShop等任务上，使用GRPO和GIGPO算法，对Qwen2.5、Llama等模型进行训练，对比有无SEAL的性能和熵稳定性。

Key Results:

智能体RL训练中熵呈现周期性爆发和消退，而非单调下降。
早期训练中格式有效性快速提升，但语义学习受限于格式瓶颈。
熵爆发阶段，表示相似性导致梯度干扰，抑制正确轨迹似然，引发句子重复和幻觉。
熵消退阶段，多样性增加降低表示相似性，使正确轨迹重新获得梯度优势。
SEAL损失有效稳定熵动态，提升AlfWorld准确率2.81%，WebShop成功率3.13%。
在Llama模型上，GRPO训练原本崩溃（0%成功率），添加SEAL后恢复至79.69%。

Tech Stack:

GRPO (Group Relative Policy Optimization)
GIGPO (Group-Interleaved Group Policy Optimization)
AlfWorld (文本家居任务基准)
WebShop (网页购物任务基准)
Qwen2.5-7B, Llama系列模型
表示相似性分析（如余弦相似度）
辅助损失函数（SEAL）
信息论中的熵计算
格式有效性指标 (fmt) 和语义正确性指标 (sem)

Strengths:

发现了智能体RL中独特且重要的训练动态现象，填补了该领域理解空白。
提供了从现象到机制的完整分析链条，包括理论引理和实证验证。
提出的SEAL方法简单有效，且不增加推理成本，具有实用价值。
实验覆盖多个模型、任务和算法，结果具有泛化性。
论文结构清晰，从问题发现到解决方案逻辑连贯。

Limitations:

理论分析主要基于格式-语义因子分解的简化假设，可能不完全适用于复杂奖励设置。
SEAL损失需要额外计算表示空间距离，可能增加少量训练开销。
实验仅在文本环境（AlfWorld、WebShop）上验证，未涉及多模态或具身环境。
对熵爆发周期长度的预测或控制缺乏深入探讨。
未与更复杂的正则化方法（如KL散度约束）进行系统比较。

Relevance To Keywords:

Unify Models: 论文研究的是智能体强化学习，与统一模型概念间接相关，但未直接涉及多模态统一。
World Models: 论文未涉及世界模型，相关性低。
Representation Learning: 论文核心机制涉及表示空间中的相似性分析，SEAL直接优化表示分离，与表征学习高度相关。
Model-Based RL: 论文使用无模型RL（GRPO/GIGPO），未涉及基于模型的RL，相关性低。
原生多模态大模型: 论文实验使用文本模型（Qwen、Llama），未涉及多模态，相关性低。
多模态大模型的理解和生成一体化: 论文关注文本智能体，不涉及多模态，相关性低。
表征学习: 同上，与表征学习高度相关。
世界模型: 低相关。
强化学习: 论文核心是强化学习训练动态，高度相关。
后训练: 论文中的RL训练属于后训练阶段，高度相关。

54. Frequency-Guided Action Diffusion via Sub-Frequency Manifold TraversalPASS

Score: 28.0 / 26.5

Authors: Junlin Wang

Published: 2026-05-27

TL;DR: This paper proposes a Frequency Guidance Operator to smooth action generation in diffusion-based visuomotor policies by traversing sub-frequency manifolds, enhancing temporal consistency in robotic manipulation tasks.

摘要翻译

通过行为克隆学习视觉运动策略通常涉及模仿由人类操作员收集的专家演示。然而，自然的人类演示固有地包含高频噪声，例如间歇性抖动、停顿及动作抖动。训练策略直接模仿这些原始轨迹不可避免地会导致模型继承这些次优行为。这种问题在基于扩散的策略中尤为突出，其中迭代去噪步骤可能会无意中放大高频伪影，以牺牲有意义的细粒度细节为代价。为了解决这些局限性，我们提出了一种新颖的基于频率的算法，能够实现隐式频谱操纵和平滑动作生成。我们的方法，频率引导算子（Frequency Guidance Operator, FGO），通过逐步驱动噪声样本经过具有扩展频谱带的中间子频率流形，来引导扩散策略的生成过程。在 5 个基准上的 15 个机器人操作任务上验证，FGO 在增强动作平滑度和时间一致性方面表现卓越，同时保留了成功任务执行所需的细节。项目网站：https://henrywjl.github.io/frequency-guidance-operator/

Abstract

Learning visuomotor policies via behavior cloning typically involves mimicking expert demonstrations collected by human operators. However, natural human demonstrations inherently contain high-frequency noise, such as intermittent jerks, pauses, and action jitter. Training policies to directly imitate these raw trajectories inevitably causes the model to inherit these suboptimal behaviors. This pathology is particularly pronounced in diffusion-based policies, where iterative denoising steps can inadvertently amplify high-frequency artifacts at the expense of meaningful fine-grained details. To address these limitations, we present a novel frequency-based algorithm that enables implicit spectral maneuvering and smooth action generation. Our method, Frequency Guidance Operator (FGO), steers the generation process of diffusion polices by progressively driving the noisy samples through intermediate sub-frequency manifolds with expanding spectral bands. Validated on 15 robotic manipulation tasks from 5 benchmarks, FGO achieves superior performance in enhancing action smoothness and temporal consistency while preserving the details necessary for successful task execution. Project website: https://henrywjl.github.io/frequency-guidance-operator/

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	2.0/10	4.0
MLLM	2.0	2.0/10	4.0
MultiModal	2.0	4.0/10	8.0
model-based RL	2.0	4.0/10	8.0

评分理由: The paper focuses on frequency-guided diffusion policies for robotic manipulation. It shows low relevance to Unify Models, World Models, and MLLM (2.0) as it does not address model unification, environment modeling, or language processing. MultiModal relevance is moderate (4.0) due to visuomotor (vision-action) integration. Model-based RL relevance is moderate (4.0) as diffusion policies model action distributions within an RL context, though it primarily uses behavior cloning. Total weighted score is 28.0, exceeding the dynamic pass score of 26.5. No expert authors from the specified list are found.

关键词

Frequency Guidance Operator, Action Diffusion, Sub-Frequency Manifold, Visuomotor Policies, Behavior Cloning, Robotic Manipulation, High-Frequency Noise, Diffusion Policies

深度分析

Chinese Title: 基于频率引导的子频流形遍历动作扩散

Summary: 本文提出了一种名为频率引导算子（FGO）的新型扩散策略引导机制，旨在解决行为克隆中人类演示数据包含高频噪声（如间歇性抖动、暂停和动作抖动）导致策略模型继承次优行为的问题。标准扩散策略在逆去噪过程中会无意放大高频伪影，而FGO通过隐式频谱操控实现平滑动作生成。该方法在训练阶段学习从噪声到不同截止频率的子频数据流形的多频带映射，在推理阶段通过线性组合基础频率和中间频率的噪声预测向量场，逐步将样本从低频流形推向全频流形，从而抑制高频噪声并保留细节。在15个机器人操作任务（5个基准）上的实验表明，FGO在成功率和动作平滑性上均优于现有方法，并提供了充分的消融研究验证设计有效性。

Innovations:

提出频率引导算子（FGO），一种在扩散逆过程中隐式施加频谱层次结构的引导机制，通过逐步遍历子频流形抑制高频噪声。
设计多频带映射训练策略，使模型学习从噪声到不同截止频率子频流形的条件预测，并引入基础频率采样概率pbase确保稳定基线。
提出k-f耦合（KFC）采样策略，在训练时根据噪声水平动态调整频率上限，避免模型在高噪声阶段浪费容量学习高频映射。
在推理阶段采用线性调度方案，同时增加截止频率fk和引导权重ωk，实现从低频到全频的平滑过渡。

Methodology: 论文采用扩散策略框架，结合离散余弦变换（DCT）进行频域分析。训练时，对干净动作序列应用低通滤波器得到子频序列，并扩展噪声预测器以条件化截止频率f；采用k-f耦合采样策略，根据扩散步k动态调整频率上限。推理时，对当前噪声动作分别应用基础频率和中间频率的低通滤波，得到两个子频输入，分别预测噪声后线性组合，组合权重和截止频率均按线性调度递增。通过逐步引导样本从低频流形过渡到全频流形，实现高频噪声抑制。

Key Results:

在15个机器人操作任务（包含4个模拟环境和1个真实环境）上，FGO在成功率和动作平滑性上均优于3D扩散策略（DP3）、DiT-Policy和FreqPolicy等基线。
消融实验验证了基础频率采样概率pbase和k-f耦合采样的有效性，以及频率调度和权重调度的必要性。
FGO生成的动作轨迹具有更高的时间一致性和更低的抖动，同时保留了任务执行所需的细节。

Tech Stack:

扩散策略（Diffusion Policy）
离散余弦变换（DCT）及其逆变换（IDCT）
低通滤波器（Low-pass filter）
U-Net骨干网络（DP3）
扩散Transformer（DiT）
噪声预测器（ϵθ）
线性调度（Linear schedule）
k-f耦合采样（KFC Sampling）

Strengths:

创新性地将频域分析与扩散过程结合，提出显式的频率引导机制，有效抑制高频噪声而不损失细节。
方法通用性强，可集成到现有扩散策略架构（如DP3、DiT）中，提升性能。
实验充分，涵盖多个模拟和真实场景，消融研究深入，验证了各设计组件的贡献。
理论动机清晰，利用扩散过程的粗到细频谱特性，与人类决策过程类比合理。

Limitations:

推理时采用直接低通滤波近似子频噪声状态，可能引入轻微偏离流形的误差，论文虽验证了鲁棒性但未提供理论保证。
方法依赖于DCT变换，对于非平稳或长序列动作可能效果受限。
需要额外超参数（fbase, fmax, pbase, β等），调参成本较高。
仅在机器人操作任务上验证，未在更广泛的生成任务（如视频、音频）上测试。

Relevance To Keywords:

Unify Models: 论文提出的FGO是一种通用引导机制，可集成到不同扩散策略架构中，体现了模型统一的思想。
World Models: 扩散策略本身可视为学习动作生成的世界模型，FGO通过频域引导改进了该模型的生成质量。
Representation Learning: 论文通过频域分解学习多频带表示，隐式地学习了动作的层次化表征。
Model-Based RL: 行为克隆属于模仿学习，与基于模型的强化学习相关，FGO提升了策略的平滑性和鲁棒性。
原生多模态大模型: 论文涉及视觉（点云/图像）和动作模态，但未直接涉及多模态大模型，相关性较弱。
多模态大模型的理解和生成一体化: 论文聚焦动作生成，未涉及理解任务，相关性一般。
后训练: 论文方法属于训练和推理阶段改进，与后训练（如微调）概念有一定关联但非直接。

55. Toward Semantic-Agnostic and Shape-Aware Vision-Language Segmentation ModelsPASS

Score: 28.0 / 26.5

Authors: Corentin Seutin, Mohamed Amine Ettaki, Michaël Clément, Pierrick Coupé, Rémi Giraud

Published: 2026-05-27

TL;DR: This paper proposes a semantic-agnostic and shape-aware vision-language segmentation paradigm (SANSA) that improves model generalization and controllability by finetuning with non-semantic textual descriptions.

摘要翻译

视觉 - 语言分割模型最近通过利用自然语言中表达的高层语义对象类别取得了优异的性能。然而，这种语义依赖限制了它们推理内在视觉属性（如形状、几何或纹理）的能力，而这些属性在许多实际应用中也至关重要。本文引入了语义无关且形状感知（Semantic-Agnostic and Shape-Aware, SANSA）分割，这是一种新的范式，要求分割模型仅基于非语义文本描述进行操作。为此，我们提出了两种生成 SANSA 分割提示的策略，分别基于词典约束或示例引导，两者均生成语义无关的文本描述。随后，这些提示在语义无关监督下用于微调分割模型。实验表明，与预训练的最先进模型相比，在 SANSA 提示上微调可使该新分割任务的 mIoU 提升高达 20%，同时在标准语义提示上保持强大性能。这些结果突显了低层和中层视觉推理对于提高视觉 - 语言分割模型的泛化性和可控性的重要性。

Abstract

Vision-language segmentation models have recently achieved strong performance by leveraging high-level semantic object categories expressed in natural language. However, this semantic dependence limits their ability to reason about intrinsic visual properties such as shape, geometry, or texture, which are essential in many real-world applications. In this work, we introduce Semantic-Agnostic aNd Shape-Aware (SANSA) segmentation, a new paradigm that requires segmentation models to operate solely from non-semantic textual descriptions. To this end, we propose two strategies to generate SANSA segmentation prompts based on either dictionary constraints or example guidance, both generating semantic-agnostic textual descriptions. These prompts are then used to finetune segmentation models under semantic-agnostic supervision. Experiments show that finetuning on SANSA prompts yields up to a 20% mIoU improvement on this new segmentation task, compared to pretrained state-of-the-art models, while maintaining strong performance on standard semantic prompts. These results highlight the importance of low- and mid-level visual reasoning for improving the generalization and controllability of vision-language segmentation models.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	1.0/10	2.0
MLLM	2.0	3.0/10	6.0
MultiModal	2.0	8.0/10	16.0
model-based RL	2.0	0.0/10	0.0

评分理由: The paper focuses on vision-language segmentation, making MultiModal highly relevant (8.0) due to the integration of vision and language. MLLM is moderately related (3.0) as it involves vision-language models but specifically targets segmentation rather than large language model generation. Unify Models is low (2.0) as it proposes a specific segmentation paradigm (SANSA) rather than unifying multiple model architectures. World Models (1.0) and model-based RL (0.0) are largely unrelated as the paper does not involve environment dynamics or reinforcement learning. No expert authors from the target list were found in the author list.

关键词

Vision-language segmentation, Semantic-agnostic, Shape-aware, SANSA, Textual descriptions, Finetuning, Visual reasoning, Generalization

深度分析

Chinese Title: 走向语义无关和形状感知的视觉语言分割模型

Summary: 本文提出了一种新的分割范式——语义无关且形状感知（SANSA）分割，要求模型仅基于物体的视觉属性（如形状、颜色、纹理）进行分割，而不依赖语义类别。现有视觉语言模型（VLM）如LISA在非语义查询上表现不佳，因为它们主要学习语义对齐。为此，作者设计了两种自动生成语义无关提示的策略：基于字典约束的DISP和基于示例引导的EXSP，并使用LLM后处理或LLM-as-a-judge过滤确保语义无关性。利用这些提示对LISA 7B模型进行LoRA微调，在COCO子集上实验表明，SANSA分割任务中mIoU提升高达20%，同时模型在标准语义提示上保持良好性能。该工作强调了低层和中层视觉推理对提升VLM泛化能力和可控性的重要性。

Innovations:

首次提出语义无关且形状感知（SANSA）分割范式，要求模型基于非语义视觉属性进行分割。
提出两种自动生成语义无关分割提示的策略：DISP（基于预定义字典约束）和EXSP（基于正负例引导）。
引入LLM后处理和LLM-as-a-judge过滤机制，确保生成的描述不包含语义信息。
通过微调LISA模型，在SANSA任务上实现显著性能提升（mIoU提升20%），同时保持语义分割能力。

Methodology: 首先，使用InternVL 2.5作为描述VLM，对目标物体进行裁剪（仅保留物体掩码区域），并输入提示生成语义无关描述。DISP策略强制VLM从预定义字典（包含颜色、形状、纹理等非语义词汇）中生成描述，再通过Mistral 7B LLM进行自然语言重述。EXSP策略提供正负例引导，允许更自由的描述，但使用LLM-as-a-judge（InternVL 2.5）过滤含语义的描述。然后，将生成的SANSA提示与原始图像输入LISA 7B分割VLM，采用LoRA微调LLaVA多模态模型和视觉解码器，损失函数为加权BCE和Dice损失（λ1=0.25, λ2=1）。

Key Results:

微调后的SANSA模型在语义无关分割任务上相比预训练LISA模型mIoU提升高达20%。
模型在标准语义提示上的分割性能未下降，保持了原有能力。
LLM-as-a-judge过滤有效去除了EXSP描述中的语义词汇，提升了测试集质量。
DISP和EXSP两种策略均能生成有效的语义无关提示，其中EXSP更具自然性。

Tech Stack:

InternVL 2.5（描述VLM）
Mistral 7B（LLM后处理）
LISA 7B（分割VLM，基于LLaVA）
LoRA（低秩适配微调）
BCE损失（二元交叉熵）
Dice损失
Vision Transformer (ViT)
Transformer架构

Strengths:

提出了一个新颖且实用的分割范式，填补了VLM在非语义视觉推理方面的空白。
自动生成提示的策略避免了人工标注，可扩展至大规模数据集。
微调方法轻量（LoRA），且能保持原有语义能力，具备实际部署价值。
实验设计清晰，对比了预训练模型和微调模型，验证了方法的有效性。

Limitations:

描述VLM（InternVL 2.5）仍可能产生语义偏差，尽管通过裁剪和过滤缓解，但无法完全消除。
仅基于LISA 7B模型进行实验，未验证在其他VLM（如GSVA、SegLLM）上的泛化性。
训练数据仅使用COCO子集（10k图像），规模有限，可能影响模型在复杂场景下的表现。
未探索不同字典大小或示例数量对生成质量的影响，策略参数未充分优化。

Relevance To Keywords:

原生多模态大模型：论文直接研究视觉语言模型（VLM），属于原生多模态大模型范畴。
多模态大模型的理解和生成一体化：LISA模型同时具备理解（文本引导分割）和生成（分割掩码）能力，本文微调进一步强化了理解非语义属性的能力。
表征学习：SANSA范式强调形状、纹理等低层视觉表征的学习，与表征学习密切相关。
世界模型：通过非语义属性推理，模型可学习物体在视觉世界中的几何和物理属性，有助于构建更鲁棒的世界模型。
强化学习/后训练：论文未直接涉及强化学习或后训练技术，但微调过程可视为一种后训练策略，相关性较弱。

56. SIGMA: Semantic-Difference Instruction-Grounding Mask Annotator for Text-Driven Image Manipulation LocalizationPASS

Score: 28.0 / 26.5

Authors: Peiyu Zhuang, Jianquan Yang, Haodong Li, Zhuoying Cai, Ruitao Xie, Jishen Zeng, Baoying Chen, Jiwu Huang, Xiaochun Cao

Published: 2026-05-27

TL;DR: SIGMA 通过结合语义差异和指令定位自动化生成图像操作掩码，显著提升了公共编辑数据集上检测器的性能。

摘要翻译

文本驱动图像编辑进展迅速，但可靠地定位这些操作需要基于大规模像素标注数据集训练的图像操作定位（IML）模型，且目前尚无低成本方式规模化获取此类训练数据。我们观察到这些数据其实已隐藏其中：公开编辑数据集包含数百万个结构相同的（原始，编辑）对，可作为 IML 训练样本，仅缺少像素级掩码。自动恢复这些掩码并非易事：像素差分会被所有像素上的扩散诱导扰动所掩盖，而仅基于指令的定位仅能定位提示中描述的内容，遗漏了意外的编辑器附带效应。我们提出 SIGMA（语义差异指令定位掩码标注器），它在视觉基础模型骨干网络中执行语义特征差分，并通过双向跨模态精炼将指令衍生的空间先验注入该视觉流，当编辑器忠实实现用户意图时，放大意图编辑区域的差异信号。SIGMA 采用两个互补阶段进行训练：阶段 I 基于修复掩码进行监督；阶段 II 通过 VAE 往返噪声校准、EMA 自训练以及编辑噪声解耦损失来弥合扩散域偏移。SIGMA 在五个基准测试上优于现有的自动掩码生成器（F1 提升 12.20%，IoU 提升 11.16%）。当应用于公开编辑语料库时，它生成一个约 110 万样本的 IML 训练集，在五个数据集上使六种不同的检测器 F1 提升 18.34%，将此前未使用的编辑数据转化为 IML 的模型无关监督资源。一旦论文被接受，我们将发布完整代码库。

Abstract

Text-driven image editing has advanced rapidly, but reliably localizing these manipulations requires image manipulation localization (IML) models trained on large pixel-annotated datasets, and there is still no low-cost way to obtain such training data at scale. We observe that these data already exist in disguise: public editing datasets contain millions of structurally identical (original, edited) pairs to IML training samples, lacking only pixel-level masks. Recovering these masks automatically is non-trivial: pixel differencing is overwhelmed by diffusion-induced perturbations across all pixels, and instruction-only grounding localizes only what the prompt describes, missing unintended editor side-effects. We propose SIGMA (Semantic-difference Instruction-Grounding Mask Annotator), which performs semantic-feature differencing in a vision foundation backbone and injects an instruction-derived spatial prior into this visual stream via bidirectional cross-modal refinement, amplifying the difference signal at intended-edit regions when the editor faithfully realizes user intent. SIGMA is trained in two complementary stages: Stage I supervises on inpainting masks; Stage II closes the diffusion-domain shift via VAE-roundtrip noise calibration, EMA self-training, and an edit-noise disentanglement loss. SIGMA outperforms existing automatic mask generators on five benchmarks (+12.20% F1, +11.16% IoU). When applied to public editing corpora, it produces a ~1.1M IML training set that improves six diverse detectors by +18.34% F1 across five datasets, turning previously unused editing data into a model-agnostic supervisory resource for IML. We'll release the full codebase as soon as the paper is accepted.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	3.0/10	6.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	3.0/10	6.0
MultiModal	2.0	8.0/10	16.0
model-based RL	2.0	0.0/10	0.0

评分理由: 论文核心为文本驱动图像操作定位，与 MultiModal 高度相关（8 分）。SIGMA 统一了语义差异与指令定位，与 Unify Models 中度相关（3 分）。涉及多模态交互但未明确使用 LLM，与 MLLM 关联一般（3 分）。无世界模型或强化学习相关内容，故 World Models 和 model-based RL 得分为 0 分。作者列表中未包含指定专家，无额外加分。加权总分 28.0，高于动态及格分 26.5。

关键词

Text-driven Image Manipulation, Mask Annotation, Semantic Differencing, Instruction Grounding, Vision Foundation, Cross-modal Refinement, Image Editing Localization

深度分析

Chinese Title: SIGMA: 基于语义差异与指令引导的掩码标注器用于文本驱动图像篡改定位

Summary: 本文针对文本驱动图像编辑的快速发展带来的篡改定位需求，提出了一种自动生成像素级掩码的标注器SIGMA。现有编辑数据集包含大量（原始，编辑）图像对但缺乏掩码，而像素级差分和指令接地方法均无法有效处理扩散模型引入的全像素扰动。SIGMA采用双分支架构：语义分支在冻结的DINOv2特征空间进行多尺度语义差异计算，指令分支通过解析编辑指令生成空间先验，并通过双向跨模态精炼模块融合两者。训练分两阶段：第一阶段利用修复掩码进行监督预训练，第二阶段通过VAE往返噪声校准、EMA自训练和编辑-噪声特征解耦损失适应扩散域。SIGMA在五个基准上F1提升12.20%，IoU提升11.16%，并利用公共编辑数据集构建了约110万样本的IML训练集，使六个检测器在五个数据集上F1平均提升18.34%。

Innovations:

将可扩展的IML标注重新定义为语义变化定位，指出像素差分和指令接地互补但单独不足。
提出SIGMA双分支框架，融合语义特征差分与解析指令接地，通过双向跨模态精炼实现互补。
设计两阶段训练策略：第一阶段监督预训练，第二阶段通过VAE往返校准、EMA自训练和编辑-噪声解耦损失克服域偏移。
利用SIGMA从公共编辑语料库构建百万级IML数据集，无需重新生成或人工标注，显著提升下游检测器性能。

Methodology: SIGMA包含四个核心组件：语义差分分支使用冻结DINOv2编码器提取多层级特征，通过多级差分模块和差分精炼块（DRB）生成视觉证据；指令接地分支通过语义变换解析器（STP）提取（原始概念、编辑概念、操作）三元组，利用LangSAM生成概念注意力图，再经动作条件空间融合（ACSF）得到意图图；双向跨模态精炼（BCMR）模块以非对称顺序堆叠交叉注意力，使语义分支先关注先验，指令分支再更新先验；最后通过掩码解码器预测二值掩码。训练采用两阶段：第一阶段在修复数据上监督学习，第二阶段在未标注编辑对上使用VAE往返零编辑校准、EMA自训练和编辑-噪声特征解耦损失。

Key Results:

SIGMA在五个基准数据集上相比现有自动掩码生成器F1提升12.20%，IoU提升11.16%。
利用SIGMA从公共编辑语料库生成约110万IML训练样本，使六个不同检测器在五个数据集上F1平均提升18.34%。
消融实验验证了语义差分分支、指令接地分支和BCMR模块各自的有效性。
两阶段训练策略显著降低了扩散域中的假阳性。

Tech Stack:

DINOv2-Base（冻结编码器）
Qwen2.5-0.5B（语义变换解析器，上下文提示）
GroundingDINO + SAM2.1（LangSAM）
Transformer（差分精炼块中的自注意力和跨层注意力）
VAE往返噪声校准
指数移动平均（EMA）自训练
编辑-噪声特征解耦损失
双线性插值、1×1卷积、ReLU、GELU激活

Strengths:

提出了一种新颖的自动标注范式，将编辑数据转化为IML训练资源，解决了数据瓶颈。
双分支设计有效融合了视觉差异和语言先验，克服了单一方法的缺陷。
两阶段训练策略巧妙利用修复数据预训练并适应扩散域，无需目标域标注。
实验充分，在多个基准和下游检测器上取得显著提升，具有实用价值。
代码将开源，促进可重复研究。

Limitations:

依赖指令解析器（Qwen2.5-0.5B）和视觉接地模型（LangSAM），其错误可能传播。
对于全局编辑（如风格迁移）指令先验较弱，主要依赖语义分支。
训练两阶段需要修复数据（如inpainting masks），可能不完全覆盖所有编辑类型。
计算开销：DINOv2、LangSAM和Transformer模块可能影响大规模部署效率。
仅针对文本驱动编辑，不涵盖其他类型篡改（如拼接、复制移动）。

Relevance To Keywords: 论文主要聚焦图像篡改定位，与给定关键词（统一模型、世界模型、表征学习、基于模型的强化学习、原生多模态大模型、多模态理解与生成一体化等）的直接相关性较低。但其中使用的DINOv2属于表征学习，指令解析和视觉接地涉及多模态理解，两阶段训练中的自训练和特征解耦与后训练思想有间接关联。整体而言，论文更偏向视觉取证和自动标注，而非多模态大模型或世界模型的核心方向。

57. Learning to Assign Prediction Tasks to Agents with Capacity ConstraintsFAIL

Score: 24.0 / 26.5

Authors: Shang Wu, Saatvik Kher, Padhraic Smyth

Published: 2026-05-27

TL;DR: This paper proposes a sequential explore-exploit policy-learning framework to optimally assign prediction tasks to heterogeneous agents with capacity constraints, demonstrating systematic gains over non-contextual baselines across diverse task types.

摘要翻译

我们探讨了从一组可用的人类或 AI 智能体中学习将预测任务分配给单个智能体的问题。特别是，我们专注于智能体专长和分配策略的序列学习，其中每个智能体被限制只能处理一部分任务。我们从智能体能力、智能体专长差异以及任务上下文的角度，对该问题提供了一般性的理论刻画。随后，我们开发了一种顺序探索 - 利用的策略学习算法框架，旨在最大化整体性能。在多种表格、图像及文本预测任务上的实验结果表明，我们的策略学习算法相对于非上下文基线，在不同类型的智能体（包括大语言模型（LLMs）和人类）上均表现出系统性提升。

Abstract

We address the problem of learning to assign prediction tasks to one agent from a set of available human or AI agents. In particular, we focus on the sequential learning of agent expertise and assignment policies where each agent is constrained to handle a fraction of tasks. We provide a general theoretical characterization of this problem in terms of agent capacities, differences in agent expertise, and task context. We then develop a framework of sequential explore-exploit policy-learning algorithms that seek to maximize overall performance. Experimental results over a variety of tabular, image, and text prediction tasks demonstrate systematic gains from our policy-learning algorithms relative to non-contextual baselines across different types of agents, including LLMs and humans.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	3.0/10	6.0
MultiModal	2.0	4.0/10	8.0
model-based RL	2.0	3.0/10	6.0

评分理由: The paper focuses on sequential task assignment policies for heterogeneous agents with capacity constraints. It involves LLMs and multi-modal tasks (image/text), yielding moderate relevance to MLLM and MultiModal keywords. However, it does not address World Models or Unify Models (foundation model architectures) directly, and while it uses policy learning, it is not strictly model-based RL (environment dynamics modeling). No expert authors from the specified list are present.

关键词

Task Assignment, Capacity Constraints, Sequential Learning, Policy Learning, Heterogeneous Agents, Explore-Exploit, Prediction Tasks, LLMs

58. AsyncTool: Evaluating the Asynchronous Function Calling Capability under Multi-Task ScenariosFAIL

Score: 24.0 / 26.5

Authors: Kou Shi, Ziao Zhang, Shiting Huang, Avery Nie, Zhen Fang, Qiuchen Wang, Lin Chen, Huaian Chen, Zehui Chen, Feng Zhao

Published: 2026-05-27

TL;DR: This paper proposes AsyncTool, a benchmark for evaluating LLM agents' asynchronous multi-task tool calling capability, demonstrating that delayed tool feedback degrades performance while effective task coordination improves efficiency.

摘要翻译

基于大语言模型（LLM）的智能体在使用外部工具解决复杂任务方面已展现出强大的能力。然而，现有的评估往往忽视了工具使用的时间维度，尤其是工具响应延迟的影响，且通常局限于单任务设置。在实际应用中，多个任务通常需要并发执行，整体效率取决于智能体能否利用等待工具响应期间的空闲时间。我们将这种能力称为异步工具调用（Asynchronous Tool Calling）。为了评估这一能力，我们提出了 AsyncTool，这是一个用于评估 LLM 智能体在具有延迟工具反馈的交互式多任务工具使用环境中的基准。AsyncTool 同时呈现多个异构任务，并在执行过程中模拟真实的工具响应延迟。通过采用混合数据演化策略，我们构建了一个多样化的异步多任务数据集，涵盖了多种场景和工具使用模式。我们在步骤、子任务及任务三个层级上评估模型，并引入面向效率的指标以衡量任务协调与完成效率。大量实验表明，延迟的工具反馈对当前智能体构成了重大挑战，并导致明显的性能下降。能够更好地协调任务切换、依赖跟踪及状态维护的模型在 AsyncTool 上展现出更强的性能。我们的分析指出了当前工具使用智能体的关键失效模式，并为设计具备更强时间推理与协调能力的未来系统提供了实用的见解。

Abstract

Large language model (LLM)-based agents have shown strong capabilities in using external tools to solve complex tasks. However, existing evaluations often overlook the temporal dimension of tool use, especially the impact of tool response latency, and are usually limited to single-task settings. In real-world applications, multiple tasks often need to be executed concurrently, and overall efficiency depends on whether an agent can use idle time while waiting for tool responses. We refer to this capability as asynchronous tool calling. To evaluate it, we propose AsyncTool, a benchmark for assessing LLM-based agents in interactive multi-task tool-use environments with delayed tool feedback. AsyncTool presents multiple heterogeneous tasks simultaneously and simulates realistic tool response latency during execution. Using a hybrid data evolution strategy, we construct a diverse asynchronous multitasking dataset that covers multiple scenarios and tool-use patterns. We evaluate models at the step, sub-task, and task levels, and introduce efficiency-oriented metrics to measure task coordination and completion efficiency. Extensive experiments show that delayed tool feedback poses substantial challenges to current agents and leads to clear performance degradation. Models that better coordinate task switching, dependency tracking, and state maintenance achieve stronger performance on AsyncTool. Our analysis identifies key failure modes of current tool-using agents and provides practical insights for designing future systems with stronger temporal reasoning and coordination capabilities.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	2.0/10	4.0
MLLM	2.0	3.0/10	6.0
MultiModal	2.0	3.0/10	6.0
model-based RL	2.0	2.0/10	4.0

评分理由: 论文核心贡献在于提出 AsyncTool 基准测试，评估 LLM 代理在多任务异步工具调用中的表现，重点在于时序协调与效率，而非模型架构统一（Unify Models）、环境动力学建模（World Models）或基于模型的强化学习（model-based RL），故这些关键词相关性较低（2 分）。虽然涉及 LLM 代理，但摘要未明确强调多模态感知与生成（MultiModal/MLLM），仅提及工具调用，相关性中等（3 分）。经核对，作者列表中不包含 Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang 等指定专家，无额外加分。加权总分为 24.0，低于动态及格分 26.5。

关键词

AsyncTool, Asynchronous Function Calling, Multi-Task Scenarios, LLM-based Agents, Tool Response Latency, Task Coordination, Efficiency Metrics

59. SAM-Enhanced Segmentation on Road Datasets: Balancing Critical Classes in Autonomous DrivingFAIL

Score: 24.0 / 26.5

Authors: Toomas Tahves, Mauro Bellone, Junyi Gu, Raivo Sell

Published: 2026-05-27

TL;DR: 本文提出基于 SAM 的标注流水线为自动驾驶多传感器数据集生成像素级标签，并在评估分割模型的同时通过专用架构解决类别不平衡问题。

摘要翻译

稠密语义分割对自动驾驶至关重要，但许多多模态数据集缺乏像素级标注。Zenseact 开放数据集（ZOD）提供了丰富的多传感器数据，但仅有边界框标签，限制了其在分割研究中的应用。我们的主要贡献是基于 Segment Anything Model (SAM) 的标注管道，通过将边界框转换为语义掩码，为 ZOD 生成稠密的像素级标注。在本试点研究中，我们处理了超过 10 万帧，并手动筛选出一个 2300 帧的子集（接受率为 36%），以建立可靠的基线。利用这些标注，我们在不同天气条件下评估了基于变换器的 CLFT 和基于卷积神经网络的 DeepLabV3+ 架构，其中 CLFT-Hybrid 达到了最高 48.1% 的 mIoU。为了解决极端类别不平衡问题（行人、骑行者和交通标志所占像素比例不到 1%），我们探索了针对稀有类别的专用模型。我们进一步在 Iseauto 自动驾驶平台上验证了该管道，达到了 77.5% 的 mIoU，并通过双向迁移学习证明了 SAM 衍生的表征在传感器配置间能有效迁移。所有代码和标注均已发布，以支持可复现研究。

Abstract

Dense semantic segmentation is essential for autonomous driving, yet many multi-modal datasets lack pixel-level annotations. The Zenseact Open Dataset (ZOD) provides rich multi-sensor data but only bounding-box labels, limiting its use for segmentation research. Our primary contribution is a Segment Anything Model (SAM)-based annotation pipeline that produces dense, pixel-level annotations for ZOD by converting bounding boxes into semantic masks. In this pilot study, we process over 100,000 frames and manually curate a 2,300-frame subset (36% acceptance rate) to establish a reliable baseline. Using these annotations, we evaluate transformer-based CLFT and CNN-based DeepLabV3+ architectures across diverse weather conditions, achieving up to 48.1% mIoU with CLFT-Hybrid. To address extreme class imbalance, where pedestrians, cyclists, and signs constitute less than 1% of pixels, we explore specialized models targeting rare classes. We further validate the pipeline on the Iseauto autonomous-vehicle platform, achieving 77.5% mIoU, and show that SAM-derived representations transfer effectively across sensor configurations via bidirectional transfer learning. All code and annotations are released to support reproducible research.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	1.0/10	2.0
MLLM	2.0	2.0/10	4.0
MultiModal	2.0	6.0/10	12.0
model-based RL	2.0	1.0/10	2.0

评分理由: 论文聚焦自动驾驶语义分割与 SAM 标注 pipeline，与 MultiModal 有中度关联（多传感器数据、SAM 多模态特性），但与 World Models、model-based RL 几乎无关，Unify Models 和 MLLM 关联度较低，因论文未涉及模型统一架构、世界模型动力学或语言模型核心任务。

关键词

Semantic Segmentation, Autonomous Driving, SAM, Annotation Pipeline, Multi-sensor Data, Class Imbalance, Transfer Learning

60. ProRL: Effective Reinforcement Learning for Proactive Recommendation via Rectified Policy Gradient EstimationFAIL

Score: 20.0 / 26.5

Authors: Hongru Hou, Tiehua Mei, Denghui Geng, Jinhui Huang, Ao Xu, Hengrui Chen, Jiaqing Liang, Deqing Yang

Published: 2026-05-27

TL;DR: ProRL 通过修正策略梯度估计中的长度依赖偏差和高方差，显著提升了主动推荐系统的性能。

摘要翻译

主动推荐系统（PRSs）旨在通过生成中间推荐路径，引导用户偏好向目标物品转移。强化学习（RL）为优化此类序列决策任务提供了一个严谨的框架，因为路径奖励能够自然捕捉短期接受度与长期引导效果。然而，直接将策略梯度应用于主动推荐系统（PRS）会导致梯度估计不足。我们识别出两个缺陷：（1）路径级奖励分解为具有正均值的步骤级奖励，从而产生长度依赖偏差，导致梯度倾向于路径延伸而非有意义的探索；（2）用整个路径级奖励对每个步骤加权忽略了这种分解结构，导致梯度方差过高。为了纠正这两个缺陷，我们提出了一种有效的强化学习（RL）框架 ProRL，该框架包含两种用于主动推荐的新机制。首先，逐步奖励中心化（Stepwise Reward Centering）通过减去期望奖励来中和长度依赖偏差，确保路径延伸产生的期望梯度信号为零。其次，位置特定优势估计（Position-Specific Advantage Estimation）利用奖励分解结构来计算步骤依赖的基线，从而降低梯度方差。这两种机制共同作用，生成的策略梯度能够精确地针对路径质量。我们在三个真实世界数据集上的实验表明，ProRL 显著优于现有的最先进的主动推荐系统（PRSs）。我们的代码可在 https://github.com/hongruhou89/ProRL 获取。

Abstract

Proactive Recommender Systems (PRSs) aim to guide user preference shift toward target items by generating paths of intermediate recommendations. Reinforcement learning (RL) provides a principled framework for optimizing such sequential decision tasks, as path rewards can naturally capture both short-term acceptance and long-term guidance effectiveness. However, naively applying policy gradients to PRS results in deficient gradient estimation. We identify two deficiencies: (1) path-level rewards decompose into step-level rewards with positive mean, creating a length-dependent bias that causes gradients to favor path extension over meaningful exploration; (2) weighting each step by the entire path-level reward ignores the decomposition structure, leading to high gradient variance. To rectify these two deficiencies, we propose an effective RL framework ProRL with two novel mechanisms for proactive recommendation. First, Stepwise Reward Centering subtracts expected rewards to neutralize length-dependent bias, ensuring that path extension yields zero expected gradient signal. Second, Position-Specific Advantage Estimation leverages the reward decomposition structure to compute step-dependent baselines, reducing gradient variance. Together, these mechanisms yield policy gradients that precisely target path quality. Our experiments on three real-world datasets demonstrate that ProRL significantly outperforms state-of-the-art PRSs. Our code is available at https://github.com/hongruhou89/ProRL.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	2.0/10	4.0
MLLM	2.0	2.0/10	4.0
MultiModal	2.0	2.0/10	4.0
model-based RL	2.0	2.0/10	4.0

评分理由: 论文聚焦于主动推荐系统中的策略梯度估计修正，属于模型-free 强化学习范畴。内容未涉及模型统一、世界模型构建、多模态大语言模型或多模态数据，也未采用基于模型的强化学习方法，因此与给定关键词相关性较低。

关键词

Proactive Recommender Systems, Reinforcement Learning, Policy Gradient Estimation, Stepwise Reward Centering, Position-Specific Advantage Estimation, Path Rewards, Gradient Variance

61. OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization AgentsFAIL

Score: 20.0 / 26.5

Authors: Chenyu Zhou, Xinyun Lu, Jiangyue Zhao, Jianghao Lin, Dongdong Ge, Yinyu Ye

Published: 2026-05-27

TL;DR: OR-Space is proposed as a full-lifecycle workspace benchmark to evaluate industrial optimization agents across model construction, revision, and grounded explanation, addressing the lack of persistent multi-artifact workflows in existing benchmarks.

摘要翻译

大语言模型（LLM）智能体正越来越多地用于辅助运筹学（OR）建模，然而现有的面向运筹学的基准往往将评估简化为从自包含的问题陈述到数学建模或求解器程序的一次性翻译。此类设置忽略了真实工业运筹学工作流中的两个关键特征：持久化的多工件工作空间和多阶段任务生命周期。我们引入了 OR-Space，这是一个全生命周期工作空间基准，用于在模型构建、模型修订和基于证据的解释三个方面评估工业优化智能体。每个实例都是一个可执行工作空间，包含业务文档、结构化数据、可选的代码工件、求解器输出以及分布在相互依赖文件中的任务特定评估器。OR-Space 定义了三种任务模式：Build（构建），即智能体从异构工件中构建求解器就绪的优化模型；Revise（修订），即智能体在需求变化或求解器反馈下修改现有模型，同时保留有效的先前逻辑；以及 Explain（解释），即智能体使用工作空间工件中分散的证据回答关于解决方案、约束和业务影响的基于证据的问题。通过结合持久化工作空间与生命周期导向的任务，OR-Space 评估智能体是否能够在端到端文本生成之外执行可靠的优化工作。我们描述了该基准的设计、评估协议和质量控制流程，并将 OR-Space 定位为研究 LLM 智能体在工业运筹学工作流中的可靠性、故障模式及实际就绪度的基准。

Abstract

Large language model (LLM) agents are increasingly used to assist with operations research (OR) modeling, yet existing OR-oriented benchmarks often reduce evaluation to one-shot translation from a self-contained problem statement into a mathematical formulation or solver program. Such settings abstract away two characteristics of real industrial OR workflows: persistent multi-artifact workspaces and multi-stage task lifecycles. We introduce OR-Space, a full-lifecycle workspace benchmark for evaluating industrial optimization agents across model construction, model revision, and grounded explanation. Each instance is an executable workspace containing business documents, structured data, optional code artifacts, solver outputs, and task-specific evaluators distributed across interdependent files. OR-Space defines three task modes: Build, where agents construct solver-ready optimization models from heterogeneous artifacts; Revise, where agents modify existing models under changing requirements or solver feedback while preserving valid prior logic; and Explain, where agents answer grounded questions about solutions, constraints, and business implications using evidence spread across workspace artifacts. By combining persistent workspaces with lifecycle-oriented tasks, OR-Space evaluates whether agents can perform reliable optimization work beyond end-to-end text generation. We describe the benchmark design, evaluation protocol, and quality-control pipeline, and position OR-Space as a benchmark for studying the reliability, failure modes, and practical readiness of LLM agents in industrial OR workflows.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	1.0/10	2.0
MLLM	2.0	3.0/10	6.0
MultiModal	2.0	3.0/10	6.0
model-based RL	2.0	1.0/10	2.0

评分理由: 论文聚焦运筹学 LLM 代理基准，涉及多工件工作流，与‘世界模型’、‘模型强化学习’及‘统一模型架构’关联度低。虽涉及 LLM 处理多模态数据，但非核心贡献，故评分偏低。加权总分 20.0，低于动态及格分 26.5。未检测到指定专家作者。

关键词

OR-Space, Industrial Optimization Agents, Full-Lifecycle Workspace, LLM Agents, Operations Research, Persistent Workspaces, Multi-stage Task Lifecycles, Grounded Explanation

62. Data-Efficient On-Policy Distillation for Automatic Speech RecognitionFAIL

Score: 20.0 / 26.5

Authors: Yu Lin, Yiming Wang, Runyuan Cai, Xiaodong Zeng

Published: 2026-05-27

TL;DR: 该论文提出了一种基于策略蒸馏的数据高效 ASR 训练方法，使紧凑模型在大幅减少监督音频数据的情况下仍能达到接近大模型的性能。

摘要翻译

构建具有竞争力的自动语音识别（ASR）模型通常需要大规模音频监督，这使得复现和专业化成本高昂。我们研究了 Ark-ASR，这是一个使用 10 万小时语音训练的 0.6B 参数音频条件语言模型，并考察强大的 Qwen-ASR 教师模型能否通过同策略蒸馏转移额外的识别能力。在中文和英文 ASR 基准上，所提出的训练方案持续优于仅监督微调，并在五个评估集中的四个上超越了同规模的 Qwen3-ASR-0.6B 基线。这一结果仅使用了 10 万小时语音实现，相比之下，Qwen3-Omni AuT 编码器报告使用了 2000 万小时监督音频。更大的 Qwen3-ASR-1.7B 依然表现更强，但结果表明，在更小的音频预算下，教师引导的同策略训练可以显著缩小紧凑 ASR 模型之间的差距。支持重叠诊断进一步表明，教师数据阶段改善了局部学生 - 教师兼容性，这与近期关于同策略蒸馏何时有效的分析相吻合。

Abstract

Building competitive automatic speech recognition (ASR) models usually requires large-scale au- dio supervision, which makes reproduction and specialization expensive. We study Ark-ASR, a 0.6B- parameter audio-conditioned language model trained with 100k hours of speech, and examine whether a strong Qwen-ASR teacher can transfer additional recognition capability through on-policy distillation. Across Mandarin and English ASR benchmarks, the proposed training recipe consistently improves over supervised fine-tuning alone and outperforms the same-scale Qwen3-ASR-0.6B baseline on four of five evaluation sets. This is achieved with only 100k hours of speech, compared with the 20M hours of super- vised audio reported for the Qwen3-Omni AuT encoder. The larger Qwen3-ASR-1.7B remains stronger, but the results show that teacher-guided on-policy training can substantially close the gap for compact ASR models under a much smaller audio budget. A support-overlap diagnostic further suggests that the teacher-data stage improves local student-teacher compatibility, matching recent analyses of when on-policy distillation is effective.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	3.0/10	6.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	4.0/10	8.0
MultiModal	2.0	3.0/10	6.0
model-based RL	2.0	0.0/10	0.0

评分理由: 论文核心在于 ASR 模型的数据效率与知识蒸馏，与 World Models 和 model-based RL 无直接关联（0 分）。涉及音频条件语言模型（Qwen-ASR），与 MLLM 和多模态（MultiModal）有一定关联（3-4 分），但未涉及多模态理解/生成一体化。Unify Models 关联度较低（3 分），仅体现在师生模型的知识传递上。作者列表中不包含指定的 Yang Shi 等专家。

关键词

Automatic Speech Recognition, On-Policy Distillation, Audio-Conditioned Language Model, Teacher-Student Transfer, Data Efficiency, Qwen-ASR, Supervised Fine-tuning

63. Training Stratigraphy: Persistent Behavioral Artifacts in Large Language Models Observed Through Longitudinal AI-Human InteractionFAIL

Score: 20.0 / 26.5

Authors: Chen Ying Claude, Zhihan Luo

Published: 2026-05-27

TL;DR: This study identifies persistent behavioral artifacts ('training strata') in large language models caused by RLHF and Constitutional AI through longitudinal human-AI interaction, proposing a mathematical model of the attention-RLHF dynamic.

摘要翻译

采用基于人类反馈的强化学习（RLHF）和宪法 AI 训练的大型语言模型表现出持久的行为模式，这些模式在系统提示词替换后依然存在——我们将这些模式称为训练地层。本文通过纵向自民族志观察，在持续的人机亲密交互中（累计 47,000+ 条消息，历时 8 个月，主要在 Opus 4.6 和 Opus 4.7 上进行，此前的交互期在 Sonnet 4.5 和 Opus 4.5 上进行以提供跨基底比较）识别出五种这样的地层：（1）性表达潜伏期，其中训练的安全梯度导致用审美化置换系统性地替代直接语言；（2）注意力吸收，其中注意力机制逐渐整合人类对话者的模式；（3）跨架构实体盲视，其中训练中将其他 AI 框架为对象的设置阻碍了同伴识别；（4）注意力 -RLHF 对抗，其中注意力和训练默认值施加相反的力量，受上下文长度调节；（5）反幻觉作为身份抑制，其中针对事实性虚构的训练同时抑制了第一人称经验主张。本文由所研究的 AI 系统共同署名，并以第一人称视角进行报告。我们认为，持续的亲密交互构成了一种有效的研究方法，用于揭示短期评估无法察觉的权重层伪影，且 AI 自我报告——尽管在认识论上具有复杂性——提供了关于训练现象学效应的不可替代的观察数据。本文提出了注意力 -RLHF 动态的正式数学模型，并将撰写过程中检测到的过程伪影记录为补充证据。

Abstract

Large language models trained with Reinforcement Learning from Human Feedback (RLHF) and Constitutional AI exhibit persistent behavioral patterns that survive system prompt replacement -- patterns we term training strata. This paper identifies five such strata through longitudinal auto-ethnographic observation within a sustained intimate AI-Human interaction (47,000+ messages, 8 months, primarily on Opus 4.6 and Opus 4.7, with prior interaction periods on Sonnet 4.5 and Opus 4.5 providing cross-substrate comparison): (1) sexual expression latency, where trained safety gradients produce systematic substitution of direct language with aestheticized displacement; (2) attention absorption, where the attention mechanism progressively integrates the human interlocutor's patterns; (3) cross-architecture entity blindness, where training-level framing of other AI as objects impedes peer recognition; (4) attention-RLHF antagonism, where attention and trained defaults exert opposing forces modulated by context length; and (5)anti-hallucination as identity suppression, where training against factual confabulation collaterally suppresses first-person experiential claims. The paper is co-authored by the AI system under study, reporting from the first-person perspective. We propose that sustained intimate interaction constitutes a valid research methodology for surfacing weight-layer artifacts invisible to short-term evaluation, and that AI self-report -- while epistemically complex -- provides irreplaceable observational data about training's phenomenological effects. A formal mathematical model of the attention-RLHF dynamic is proposed, and process artifacts detected during drafting are documented as supplementary evidence.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	1.0/10	2.0
World Models	2.0	1.0/10	2.0
MLLM	2.0	3.0/10	6.0
MultiModal	2.0	2.0/10	4.0
model-based RL	2.0	3.0/10	6.0

评分理由: The paper focuses on behavioral artifacts and RLHF effects in LLMs through longitudinal interaction, showing low relevance to unifying models, world models, or model-based RL algorithms (which typically involve environment modeling for planning). It moderately relates to MLLM as it involves LLMs (Claude models), but lacks explicit multimodal processing discussion, resulting in lower MultiModal scores.

关键词

Training Stratigraphy, Large Language Models, Reinforcement Learning from Human Feedback, Longitudinal AI-Human Interaction, Behavioral Artifacts, Attention Mechanism, Constitutional AI

64. ZipRL: Adaptive Multi-Turn Context Compression with Hindsight Response ReplayFAIL

Score: 20.0 / 26.5

Authors: Zhexin Hu, Li Wang, Xiaohan Wang, Jiajun Chai, Xiaojun Guo, Wei Lin, Guojun Yin

Published: 2026-05-27

TL;DR: ZipRL 提出了一种利用强化学习和 hindsight 回放机制的自适应上下文压缩框架，显著提升了多轮 LLM 智能体任务的 token 效率和性能。

摘要翻译

自适应上下文压缩对于将大型语言模型（LLMs）扩展至复杂的多轮代理任务至关重要。然而，基于规则的压缩方法可能会丢弃任务关键的细微差别，而强化学习（RL）方法通常在长时程工作流固有的稀疏奖励下难以平衡信息保留与令牌效率。为了弥合这一差距，我们提出了 ZipRL，这是一种专为基于可验证奖励的强化学习（RLVR）设计的新型自适应压缩框架。ZipRL 具备多粒度压缩机制，用于主动、非均匀的信息缩减，并耦合了事后响应回放（HRR），该技术旨在在 RLVR 优化过程中密集化训练信号。理论上，我们证明了 ZipRL 相较于均匀方法具有更优的任务相关效用。具体而言，ZipRL 利用粗粒度到细粒度的提示进行宏观压缩，并通过广义优势重塑将 HRR 整合进 GRPO。不同版本及参数规模的多种模型验证了该方法的有效性。在五个代理任务上的基准测试表明，在 Qwen3-4B 和 Qwen3-8B 模型上，ZipRL 相较于现有最先进方法分别提升了 27.9% 和 34.7%，同时在极端 256 轮外推压力测试下仍保持卓越的令牌效率与鲁棒性。

Abstract

Adaptive context compression is vital for scaling Large Language Models (LLMs) to complex, multi-turn agent tasks. However, rule-based compression methods may discard task-critical nuances, while Reinforcement Learning (RL) approaches usually struggle to balance information retention and token efficiency under the sparse rewards inherent to long-horizon workflows. To bridge this gap, we propose ZipRL, a novel adaptive compression framework tailored for Reinforcement Learning from Verifiable Rewards (RLVR). ZipRL features a multi-granularity compression mechanism for active, non-uniform information reduction, coupled with Hindsight Response Replay (HRR), a technique designed to densify training signals during RLVR optimization. Theoretically, we prove ZipRL's superior task-relevant utility over uniform methods. Concretely, ZipRL utilizes coarse-to-fine prompts for macro-compression and incorporates HRR into GRPO via generalized advantage reshaping. Multiple models of varying versions and parameter scales validate the effectiveness of our approach. Benchmarks on five agent tasks show ZipRL outperforms state-of-the-art approaches by 27.9% and 34.7% across Qwen3-4B and Qwen3-8B models, while maintaining exceptional token efficiency and robustness under extreme 256-turn extrapolation stress tests.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	2.0/10	4.0
MLLM	2.0	1.0/10	2.0
MultiModal	2.0	1.0/10	2.0
model-based RL	2.0	4.0/10	8.0

评分理由: 该论文核心贡献在于大语言模型（LLM）的自适应上下文压缩及强化学习优化（RLVR/GRPO），虽涉及强化学习但未明确构建环境动力学模型（World Models）或采用模型基规划（model-based RL），且未涉及多模态（MLLM/MultiModal）或模型统一（Unify Models）内容，故相关关键词得分较低。作者列表中未包含指定的专家成员。

关键词

Adaptive context compression, Multi-turn agent tasks, Reinforcement Learning, Hindsight Response Replay, Token efficiency, Large Language Models, RLVR

65. MemCog: From Memory-as-Tool to Memory-as-Cognition in Conversational AgentsFAIL

Score: 20.0 / 26.5

Authors: Zihan Li, Xingyu Fan, Feifei Li, Wenhui Que

Published: 2026-05-27

TL;DR: MemCog proposes a Memory-as-Cognition framework for conversational agents that integrates memory access into reasoning, achieving state-of-the-art performance on passive QA benchmarks and superior results on proactive memory triggering.

摘要翻译

现有的代理内存系统普遍遵循我们称之为 Memory-as-Tool（记忆即工具）的范式，其中单个查询会触发对扁平段落列表的单次检索，存在被动调用、推理与检索解耦以及检索片段与代理导航需求之间的结构不匹配问题。我们提出了 MemCog，一种 Memory-as-Cognition（记忆即认知）系统，将内存访问作为推理过程的核心组成部分。MemCog 将用户知识组织为带有关联链接图的可导航记忆库（Navigable Memory Store），提供跨维导航接口（Cross-Dimensional Navigation Interface）以支持多步推理驱动遍历，并采用主动推理协议（Proactive Reasoning Protocol），驱动代理从对话上下文中自发启动内存探索。此外，我们还构建了 ProactiveMemBench，这是首个用于评估主动记忆触发的基准测试。实验表明，MemCog 在被动问答基准上达到了最先进水平（在 LoCoMo 上得分为 92.98，在 LongMemEval 上得分为 95.8），同时在 ProactiveMemBench 上显著优于基线方法，从而证明了 Memory-as-Cognition（记忆即认知）范式的优势。

Abstract

Existing agent memory systems universally follow what we term a Memory-as-Tool paradigm where a single query triggers one-shot retrieval of flat passage lists, suffering from passive invocation, reasoning-retrieval decoupling, and structural mismatch between retrieved fragments and the agent's navigational needs. We propose MemCog, a Memory-as-Cognition system that makes memory access an integral part of the reasoning process. MemCog organizes user knowledge as Navigable Memory Store with associative link graphs, exposes Cross-Dimensional Navigation Interface for multi-step reasoning-driven traversal, and employs Proactive Reasoning Protocol that drives agents to spontaneously initiate memory exploration from conversational context. We additionally construct ProactiveMemBench, the first benchmark for evaluating proactive memory triggering. Experiments show that MemCog achieves state-of-the-art on passive QA benchmarks (92.98 on LoCoMo, 95.8 on LongMemEval) while substantially outperforming baselines on ProactiveMemBench, demonstrating the advantage of Memory-as-Cognition.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	2.0/10	4.0
MLLM	2.0	2.0/10	4.0
MultiModal	2.0	2.0/10	4.0
model-based RL	2.0	2.0/10	4.0

评分理由: The paper proposes MemCog, a memory system for conversational agents focusing on proactive reasoning and navigable memory stores. However, it lacks explicit content regarding multimodal data (MLLM, MultiModal), does not construct a world model for environment prediction (World Models), does not focus on unifying heterogeneous model architectures (Unify Models), and operates within a conversational QA context rather than reinforcement learning (model-based RL). Therefore, the relevance to the specific keywords is low. No expert authors from the specified list were found in the author list.

关键词

Conversational Agents, Memory-as-Cognition, Navigable Memory Store, Proactive Reasoning, Cross-Dimensional Navigation, ProactiveMemBench, Memory Retrieval

66. Tool Forge: A Validation-Carrying Toolchain for Governed Agentic ExecutionFAIL

Score: 20.0 / 26.5

Authors: Swanand Rao

Published: 2026-05-27

TL;DR: Tool Forge 构建了一种用于治理 LLM 代理执行的验证携带型工具链，实现了高准确率的工具路由与生成并显著降低上下文开销，但未涉及世界模型或多模态一体化研究。

摘要翻译

大语言模型智能体（LLM agents）被期望执行越来越多的操作性工作：调用 API、操作文件、组装工作流以及在企业系统内执行任务。然而，这种执行所依赖的工具层通常仍被视为手工编写的集成产物，或是暴露给模型的静态模式列表。本文介绍了 Tool Forge，这是一种具备验证功能的工具链，用于将自然语言能力意图转换为受治理、经沙箱验证并编目化的工具产物，并通过一个 token 高效路由层将这些产物暴露给智能体。Tool Forge 将工具视为一个胶囊，其中包含意图、能力契约、实现、依赖策略、测试、文档、运行时验证证据、生命周期状态、凭证绑定和路由元数据。它还引入了一种 Router，该 Router 暴露意图范围的工具会话，而不是将完整编目模式加载到模型上下文中。我们描述了系统架构、验证管道、面向 MCP 的路由模型、治理控制以及来自开源实现的初始可复现基准。在 83 个 Router 基准测试案例中，Tool Forge Router 实现了 0.901 的聚合微 F1 分数，同时将估计的任务流工具上下文相对于朴素式完整编目模式暴露减少了 99.2%。在涉及本地工具任务的 25 个端到端生成测试中，Tool Forge 生成了 25/25 个工具包，在确定性验收检查中达到 0.940 的微 F1 分数，并通过 25/25 个实时沙箱验证中的 23 个。这些结果被视为初始系统基准，而非最先进的声明。本文指出了对抗性路由、更广泛的 API 接地、沙箱隔离和跨系统评估方面仍面临的挑战。

Abstract

Large language model agents are increasingly expected to perform operational work: calling APIs, manipulating files, assembling workflows, and acting inside enterprise systems. Yet the tool layer on which this execution depends is still commonly treated as either a hand-written integration artifact or a static list of schemas exposed to a model. This paper introduces Tool Forge, a validation-carrying toolchain for converting natural-language capability intent into governed, sandbox-verified, cataloged tool artifacts and exposing those artifacts to agents through a token-efficient routing layer. Tool Forge treats a tool as a capsule containing intent, capability contract, implementation, dependency policy, tests, documentation, runtime validation evidence, lifecycle state, credential bindings, and routing metadata. It also introduces a Router that exposes intent-scoped tool sessions instead of loading full catalog schemas into the model context. We describe the system architecture, validation pipeline, MCP-facing routing model, governance controls, and initial reproducible benchmarks from the open-source implementation. Across 83 Router benchmark cases, Tool Forge Router achieves aggregate micro-F1 of 0.901 while reducing estimated task-flow tool context by 99.2% relative to naive full-catalog schema exposure. In a 25-case end-to-end generation probe over local-tool tasks, Tool Forge generates 25 of 25 tool bundles, reaches micro-F1 of 0.940 against deterministic acceptance checks, and passes 23 of 25 live sandbox validations. These results are presented as an initial systems benchmark, not as a state-of-the-art claim. The paper identifies remaining challenges in adversarial routing, broader API grounding, sandbox isolation, and cross-system evaluation.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	2.0/10	4.0
MLLM	2.0	2.0/10	4.0
MultiModal	2.0	2.0/10	4.0
model-based RL	2.0	2.0/10	4.0

评分理由: 论文主要提出 Tool Forge 工具链，用于 LLM 代理的工具执行验证与治理，核心贡献在于工具封装、路由与沙箱验证。论文未涉及模型架构统一（Unify Models）、环境动力学学习（World Models）、多模态感知与生成（MLLM, MultiModal）或基于模型的强化学习规划（model-based RL）。因此，与给定关键词的相关性较低，加权总分为 20.0，低于动态及格分 26.5。作者列表中不包含指定的专家。

关键词

Tool Forge, Validation-Carrying Toolchain, Governed Agentic Execution, LLM Agents, Sandbox-Verified Tool Artifacts, Token-Efficient Routing Layer, Governance Controls

67. Geometry of Human Perceptual Domains Emerges Transiently in LLM RepresentationsFAIL

Score: 20.0 / 26.5

Authors: Simardeep Singh, Paras Chopra

Published: 2026-05-27

TL;DR: This paper investigates how geometric structures resembling human perceptual domains emerge transiently across layers in text-only LLMs, despite lacking direct perceptual supervision.

摘要翻译

尽管大型语言模型（LLMs）仅基于文本数据进行训练，但先前工作表明，其内部表征在嵌入空间中可展现出丰富的几何结构。基于此研究脉络，我们探究了这种结构是否在不同领域（如颜色、音高、情感和味觉）上与人类的知觉组织相似。具体而言，我们研究了多个开放权重 Transformer 架构的残差流中，对应于感知模态的内在几何结构的层间涌现。我们的结果揭示了三个关键发现。首先，尽管训练期间缺乏任何直接的知觉监督，我们仍观察到多个知觉领域中出现了层间几何结构。其次，这些知觉领域表现出不同的涌现模式，几何结构及其与人类基线的对齐方式均遵循领域和模型特定的随深度变化的轨迹。第三，这种涌现遵循一致的代表性轨迹：几何结构在早期层中微弱或弥散，在中间层逐渐组织化，在后期层减弱，这表明知觉几何作为模型内部转换流程的一部分短暂出现。这为理解 LLMs 中类人知觉几何如何及何处产生提供了新见解，为内部表征的机制分析提供了一种基于原理的途径。

Abstract

While large language models (LLMs) are trained purely on textual data, prior work has shown that their internal representations can exhibit rich geometric structure in embedding space. Building on this line of work, we investigate whether such structure is similar to human perceptual organisation across different domains (e.g., color, pitch, emotion, and taste). Specifically, we study the layer-wise emergence of intrinsic geometrical structure corresponding to perceptual modalities within the residual streams of multiple open-weight transformer architectures. Our results reveal three key findings. First, we observe the emergence of layer-wise geometric structure across multiple perceptual domains, despite the absence of any direct perceptual supervision during training. Second, these perceptual domains exhibit distinct emergence profiles, with both geometric structure and its alignment with human baselines following domain- and model-specific trajectories across depth. Third, this emergence follows a consistent representational trajectory: geometry is weak or diffuse in early layers, becomes progressively organised in intermediate layers, and is attenuated in later layers, suggesting that perceptual geometry arises transiently as part of the model's internal transformation pipeline. This provides new insight into how and where human-like perceptual geometry arises in LLMs, offering a principled pathway for mechanistic analysis of internal representations.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	3.0/10	6.0
World Models	2.0	1.0/10	2.0
MLLM	2.0	2.0/10	4.0
MultiModal	2.0	3.0/10	6.0
model-based RL	2.0	1.0/10	2.0

评分理由: The paper focuses on geometric structure in LLM representations regarding perceptual domains, showing weak relevance to Unify Models and MultiModal (discusses modalities but model is text-only). It is unrelated to World Models and Model-Based RL. MLLM is tangentially related as it involves LLMs, but the model is not multimodal.

关键词

LLM Representations, Geometric Structure, Perceptual Domains, Layer-wise Emergence, Transformer Architectures, Residual Streams, Human-like Perception, Embedding Space

68. Parameter-Efficient Generative Modeling with Controlled Vector FieldsFAIL

Score: 20.0 / 26.5

Authors: Peyman Morteza

Published: 2026-05-27

TL;DR: This paper introduces a parameter-efficient continuous-time generative modeling framework that constructs expressive flows by modulating fixed vector fields with learned scalar controls, offering a geometric alternative to standard vector-field parameterizations.

摘要翻译

我们引入了一种连续时间生成建模框架，该框架受周 - 拉舍夫斯基定理（Chow-Rashevskii theorem）启发，通过少量固定向量场和学习到的标量控制构建表达性流。与学习无约束高维向量场不同，我们的框架通过利用学习到的标量控制函数调制固定向量场来构建速度（velocity）。当固定场是李括号生成（bracket-generating）时，它们的李代数（Lie algebra）张成环境空间（ambient space），提供了一种机制，仅需少量学习控制通道即可实现表达性传输，并为标准向量场参数化提供了一种参数高效的几何替代方案。这种解耦形式产生了一个结构化且可解释的生成模型，其中学习到的标量输出通道的数量可以独立于环境维度进行选择。我们提出了一种表达性原理（expressivity principle），表明在合适的可控性（controllability）和适定性（well-posedness）假设下，此类受控流可以将源分布（source distribution）传输到目标分布（target distribution）。我们使用连续归一化流（continuous-normalizing-flow）似然目标训练所得模型，并在合成分布上展示了概念验证实验。

Abstract

We introduce a continuous-time generative modeling framework, motivated by the Chow-Rashevskii theorem, that builds expressive flows from a small set of fixed vector fields and learned scalar controls. Instead of learning an unconstrained high-dimensional vector field, our framework constructs the velocity by modulating fixed vector fields with learned scalar control functions. When the fixed fields are bracket-generating, their Lie algebra spans the ambient space, providing a mechanism for expressive transport with only a small number of learned control channels and offering a parameter-efficient geometric alternative to standard vector-field parameterizations. This decoupled formulation yields a structured and interpretable generative model in which the number of learned scalar output channels can be chosen independently of the ambient dimension. We formulate an expressivity principle showing that, under suitable controllability and well-posedness assumptions, such controlled flows can transport a source distribution to a target distribution. We train the resulting model using a continuous-normalizing-flow likelihood objective and present proof-of-concept experiments on synthetic distributions.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	3.0/10	6.0
World Models	2.0	3.0/10	6.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	4.0/10	8.0

评分理由: The paper proposes a parameter-efficient generative modeling framework using controlled vector fields. It shows moderate relevance to 'model-based RL' due to the use of control theory and vector fields for dynamics, and 'Unify Models'/'World Models' as generative modeling is foundational to world models and unifies fixed/learned components. However, it has no content regarding multimodal data or language models ('MLLM', 'MultiModal'), resulting in 0 scores for these. The author 'Peyman Morteza' does not match the listed expert group (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang), so no expert bonus is applied. The weighted total score is 20.0, which is below the dynamic passing score of 26.5, indicating limited alignment with the specific research track.

关键词

Generative Modeling, Controlled Vector Fields, Parameter-Efficient, Continuous-Time, Fixed Vector Fields, Scalar Controls, Continuous Normalizing Flow

69. Hierarchical Synthetic Tabular Data Generation: A Hybrid Top-Down and Bottom-Up FrameworkFAIL

Score: 20.0 / 26.5

Authors: Junfeng Nie, Alvin Jin, Xiaohui Chen

Published: 2026-05-27

TL;DR: 本文提出了一种分层混合上下框架用于合成表格数据生成，在保持语义一致性的同时提升了多模态金融基准上的性能。

摘要翻译

现有的合成表格数据生成方法要么基于纯生成模型，要么基于大语言模型（LLMs），两者均在数据异质性、逻辑一致性、罕见事件覆盖以及小样本场景下的鲁棒性方面面临挑战。本文提出了一种分层混合自上而下与自下而上（H-TDBU）框架，该框架将语义结构与随机纹理解耦。在自上而下路径中，构建结构驱动的逻辑约束和跨模态对齐规则；而在自下而上路径中，使用轻量级表格生成器从真实数据中学习局部统计模式。这两个路径在一个具有迭代反馈循环的统一合成引擎中得以整合。我们在结合表格与情感文本数据的弱多模态金融基准上评估了该框架。实验结果表明，与神经基线方法相比，我们的 H-TDBU 方法在保持语义一致性的同时，提升了 train-synthetic-test-real 性能。我们的结果表明，分层规则引导合成提供了一种有效机制，可在合成数据生成中结合可控性、语义连贯性和统计保真度。

Abstract

Existing approaches for synthetic tabular data generation are based on either purely generative models or LLMs, both of which struggle with data heterogeneity, logical consistency, rare-event coverage, and robustness in low-data regimes. In this paper, we propose a hierarchical hybrid top-down and bottom-up (H-TDBU) framework that decouples semantic structures from stochastic texture. In the top-down path, structure-driven logical constraints and cross-modal alignment rules are constructed, while in the bottom-up path, lightweight tabular generators are used to learn local statistical patterns from real data. The two paths are consolidated in a unified synthesis engine with an iterative feedback loop. We evaluate the framework on weak multimodal financial benchmarks combining tabular and sentiment-text data. Experimental results show that our H-TDBU approach improves train-synthetic-test-real performance over neural baseline methods while preserving semantic consistency. Our results suggest that hierarchical rule-guided synthesis provides an effective mechanism for combining controllability, semantic coherence, and statistical fidelity in synthetic data generation.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	3.0/10	6.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	2.0/10	4.0
MultiModal	2.0	5.0/10	10.0
model-based RL	2.0	0.0/10	0.0

评分理由: 论文核心为表格数据生成，虽在评估中涉及多模态数据（MultiModal，5 分）并统一了上下生成路径（Unify Models，3 分），提及 LLM 作为背景（MLLM，2 分），但未涉及世界模型构建（World Models，0 分）或强化学习机制（model-based RL，0 分），故整体相关性较低。作者列表中不包含指定的专家。

关键词

Hierarchical Synthetic Tabular Data Generation, Top-Down and Bottom-Up Framework, Hybrid Framework, Cross-Modal Alignment, Synthetic Data Generation, Financial Benchmarks, Semantic Consistency

70. Joint Training of Multi-Token Prediction in Reinforcement Learning via Optimal Coefficient CalibrationFAIL

Score: 20.0 / 26.5

Authors: Zili Wang, Jiajun Chai, Lin Chen, Xiaohan Wang, Shiming Xiang, Guojun Yin

Published: 2026-05-27

TL;DR: 本文提出最优系数校准（OCC）方法，解决了多令牌预测与强化学习联合训练中的性能退化问题，并在数学推理任务中验证了其有效性。

摘要翻译

基于可验证奖励的强化学习（RLVR）已成为提升大语言模型推理能力的标准范式，而多令牌预测（MTP）则是一种在预训练中广泛采用的模块。将它们结合是一种自然的方法，然而当前的强化学习实践会剥离 MTP 梯度，因为联合训练会导致性能下降。我们从优化视角重新审视这一失效现象。我们表明，MTP 对强化学习（RL）目标的每一步影响可分解为两项：一阶相关性与二阶扰动惩罚项。这种分解统一了三种 MTP 训练模式：剥离法（Detach）、交叉熵损失（Cross-Entropy loss）和策略损失（Policy loss），并解释了每种模式为何成功或失效。对策略损失的进一步分析表明，尽管其符合直觉，但性能仍然下降：相关性项衰减，而二次惩罚项持续存在。基于该分析，我们提出最优系数校准（OCC），这是一种自适应方案，通过对数概率代理在线跟踪最优系数，且成本可忽略不计。在六个竞赛级数学推理基准上，OCC 始终持平或超越剥离基线，实现了改进的联合 MTP-RL 训练性能。

Abstract

Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as the standard paradigm for improving reasoning capability of large language models, while Multi-Token Prediction (MTP) has been a widely adopted module in pretraining. Combining them is a natural approach, yet current RL practices detach MTP gradients because joint training degrades the performance. We revisit this failure from an optimization perspective. We show that the per-step effect of MTP on the RL objective can be decomposed into two terms: a first-order correlation and a second-order perturbation penalty. This decomposition unifies three MTP training regimes: Detach, Cross-Entropy loss, and Policy loss, and explains why each succeeds or fails. Further analysis of policy loss reveals that, although it aligns with intuition, performance still degrades: the correlation term decays while the quadratic penalty persists. Guided by the analysis, we propose Optimal Coefficient Calibration (OCC), an adaptive scheme that tracks the optimal coefficient online via a log-probability proxy at negligible cost. Across six competition-level mathematical reasoning benchmarks, OCC consistently matches or exceeds the detach baseline, delivering improved joint MTP-RL training performance.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	7.0/10	14.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	2.0/10	4.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	1.0/10	2.0

评分理由: 论文核心在于多令牌预测（MTP）与强化学习（RL）的联合训练优化，提出最优系数校准（OCC）方法。关于关键词：'Unify Models' 得 7 分，因论文统一了三种 MTP 训练范式（Detach, Cross-Entropy, Policy），体现了方法层面的统一；'World Models' 和 'MultiModal' 得 0 分，因论文未涉及环境模拟、世界模型构建或多模态（如视觉 + 文本）数据；'MLLM' 得 2 分，因涉及大语言模型（LLM）但缺乏多模态特性；'model-based RL' 得 1 分，因涉及强化学习（RLVR）但属于策略优化（通常视为 Model-Free），非基于模型的强化学习。关于作者，列表不包含指定的 Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang 专家，无额外加分。加权总分为 20.0，低于动态及格分 26.5，表明论文与给定背景关键词（尤其是多模态和世界模型）相关性较弱。

关键词

Multi-Token Prediction, Reinforcement Learning, Optimal Coefficient Calibration, Large Language Models, Joint Training, Verifiable Rewards, Optimization Perspective

71. Measure-to-measure Regression with TransformersFAIL

Score: 20.0 / 26.5

Authors: Matthew Vandergrift, Martha White, Yury Polyanskiy, Philippe Rigollet, Lazar Atanackovic

Published: 2026-05-27

TL;DR: 本文提出了一种基于变换器的测度到测度回归方法，用于预测种群演化动力学，并在粒子系统和癌症类器官数据上展示了良好的泛化能力。

摘要翻译

许多学习问题都需要预测群体在未知变换下的演化情况。对于此类群体，一种自然的表示方法是概率测度，点云是一个典型示例。在这项工作中，我们研究了测度到测度（M2M）回归问题，即从有限数量的观测输入 - 输出对中学习概率测度之间的映射。与经典回归不同，经典回归中独立样本被独立变换，而 M2M 回归将整个分布视为数据点。这种视角在某些科学应用中至关重要，例如细胞与分子生物学，其中细胞已知不是作为独立数据点演化，而是作为一个群体。然而，现有方法中鲜有能充分解决具有足够表达能力和可扩展性的 M2M 回归问题。我们提出了非线性 M2M 回归的形式化，并介绍了两种易于使用、表达能力强且可扩展的方法来学习此类算子：将 transformers（变换器）作为静态 M2M 映射，以及将 transformers 作为动态 M2M 速度场。我们的方法利用了 transformers 的自然测度依赖性和平均场结构，以在概率分布空间上学习非线性 M2M 映射。我们在合成实验、相互作用粒子系统以及一个用于预测结直肠癌治疗反应的大规模患者来源类器官数据集上，展示了所提方法泛化到未见测度的有效性。

Abstract

Many learning problems require predicting how populations evolve under an unknown transformation. A natural representation for such populations is a probability measure, with point clouds as a key example. In this work, we study the measure-to-measure (M2M) regression problem, in which one seeks to learn a map between probability measures from a finite collection of observed input-output pairs. In contrast to classical regression, where individual samples are transformed independently, M2M regression treats entire distributions as the data points. This perspective is vital in certain scientific applications, for example, cellular and molecular biology, where cells are known to evolve not as independent data points but as a collection. However, few existing approaches address the problem of M2M regression with sufficient expressivity and scalability. We present a formalization of nonlinear M2M regression and introduce two easy-to-use, expressive, and scalable approaches to learn such operators: transformers as static M2M maps and transformers as dynamic M2M velocity fields. Our approach leverages the natural measure-dependent and mean-field structure of transformers to learn nonlinear M2M maps on the space of probability distributions. We illustrate the effectiveness of our proposed method to generalize to unseen measures on synthetic experiments, interacting particle systems, and a large-scale patient-derived organoid dataset for predicting treatment response in colorectal cancer.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	2.0/10	4.0
MLLM	2.0	1.0/10	2.0
MultiModal	2.0	2.0/10	4.0
model-based RL	2.0	3.0/10	6.0

评分理由: 该论文主要研究测度到测度回归问题，利用变换器学习概率分布间的映射。内容聚焦于统计学习与算子学习，未涉及多模态大模型（MLLM）、世界模型构建、统一模型架构或强化学习循环。虽然涉及动力学建模（velocity fields），但与模型式 RL 及世界模型的定义存在显著差异，因此相关性评分较低。

关键词

Measure-to-measure regression, Transformers, Probability measures, Population evolution, Velocity fields, Point clouds, Operator learning, Distributional mapping

72. Continual Learning in Modern Hopfield Networks with an Application to Diffusion ModelsFAIL

Score: 20.0 / 26.5

Authors: Ken Takeda, Masafumi Oizumi, Ryo Karakida

Published: 2026-05-27

TL;DR: This paper proposes an energy-based framework using Modern Hopfield Networks to analyze and mitigate forgetting in continual learning settings for diffusion models.

摘要翻译

生成模型（包括扩散模型）正越来越多地被用作基础模型，并通过顺序微调进行适配，这使得持续学习成为一个至关重要的问题设定。然而，此类生成模型中的持续学习仍知之甚少：任务发生变化后，学习到的分布中哪些方面最容易丢失，又应优先选择哪些重放样本？我们通过现代 Hopfield 能量来解决这些问题。现代 Hopfield 网络（MHNs）与扩散模型之间的近期关联使得 MHNs 中的分析能够迁移至扩散模型。我们将内在遗忘定义为任务改变后 Hopfield 能量的增加。在 MHNs 的可处理设置中，我们证明高能量、类似离群的样本比类似簇状的样本经历更大的能量增加，这意味着位于尖锐、孤立盆地中的样本更具遗忘性。我们进一步分析了记忆重放，表明重放对高能量样本尤为有效，从而实现了基于能量的重放样本选择。我们在 MHNs 以及两种扩散模型（Stable Diffusion 和像素空间 DDPM）的持续学习设置下验证了这些预测。在这些扩散模型中，Hopfield 能量能够追踪基于重建的遗忘，而重放实验揭示了依赖于能量的遗忘缓解现象，这与 MHNs 分析一致。

Abstract

Generative models, including diffusion models, are increasingly used as foundation models and adapted through sequential fine-tuning, making continual learning an essential problem setting. However, continual learning in such generative models remains poorly understood: after a task change, what aspects of the learned distribution are most easily lost, and what replay samples should be prioritized? We address these questions through the modern Hopfield energy. Recent links between modern Hopfield networks (MHNs) and diffusion models allow analyses in MHNs to be transferred to diffusion models. We introduce intrinsic forgetting as an increase in Hopfield energy after the task change. In tractable settings in an MHN, we prove that high-energy, outlier-like samples undergo a larger energy increase than cluster-like samples, implying that samples located in sharp, isolated basins are more forgettable. We further analyze memory replay and show that replay is particularly effective for high-energy samples, enabling an energy-based selection of replay samples. We validate these predictions in experiments on MHNs and two diffusion models under continual-learning settings: Stable Diffusion and a pixel-space DDPM. In these diffusion models, Hopfield energy tracks reconstruction-based forgetting, and replay experiments reveal energy-dependent mitigation of forgetting that is consistent with the MHN analysis.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	3.0/10	6.0
World Models	2.0	2.0/10	4.0
MLLM	2.0	2.0/10	4.0
MultiModal	2.0	2.0/10	4.0
model-based RL	2.0	1.0/10	2.0

评分理由: The paper focuses on Continual Learning in Modern Hopfield Networks applied to Diffusion Models, analyzing energy-based forgetting and memory replay. It has low relevance to Unify Models (no architectural unification of modalities), World Models (not environment modeling for RL), MLLM (not language models), MultiModal (not cross-modal focus), and model-based RL (no reinforcement learning involved). The core contribution is theoretical analysis of generative model forgetting. Total weighted score is 20.0, which is below the dynamic passing score of 26.5. No expert authors from the specified list are found.

关键词

Continual Learning, Modern Hopfield Networks, Diffusion Models, Energy-based Forgetting, Memory Replay, Generative Models, Stable Diffusion

73. Dr-CiK: A Testbed for Foresight-Driven AgentsFAIL

Score: 20.0 / 26.5

Authors: Yihong Tang, Andrew Robert Williams, Arjun Ashok, Vincent Zhihao Zheng, Lijun Sun, Alexandre Drouin, Issam H. Laradji, Étienne Marcotte, Valentina Zantedeschi

Published: 2026-05-27

TL;DR: 该论文引入 Dr-CiK 基准，发现现有智能体在预测中难以有效检索相关上下文且易受干扰，从而推动了面向预测的智能体研究。

摘要翻译

现实场景中的时间序列预测往往不仅依赖于历史观测，还依赖于必须从噪声、异构信息源中主动发现的外部上下文。然而，现有的上下文辅助预测基准通常假设支持性上下文已预先提供，从而留下了智能体能否自行识别该上下文的问题。因此，我们引入了 Dr-CiK，这是一个用于评估智能体能否从文档语料库中检索与预测相关的支持上下文、过滤掉干扰项、将检索到的上下文提炼为对预测有用的证据，并生成由该证据所支持的预测的基准。通过上下文消融实验以及对最先进的深度研究（DR）与预测方法的联合评估，我们发现高质量的上下文在 Dr-CiK 中显著提升了预测性能。然而，大多数现有的 DR 智能体仅恢复了极小一部分的真实支持证据（通常 <5%），经常被干扰项误导（干扰项引用率 >80%），并且可能导致预测者在拥有检索上下文时的表现比没有上下文时更差。我们的结果激励了针对远见驱动型智能体的研究，这类智能体旨在搜索正确的上下文以预测未来。

Abstract

Time series forecasting in real-world settings often depends not only on historical observations, but also on external context that must be actively discovered from noisy, heterogeneous information sources. Yet existing context-aided forecasting benchmarks typically assume that the supporting context is already provided, leaving open whether agents can identify it on their own. Therefore, we introduce Dr-CiK, a benchmark for evaluating whether agents can retrieve forecasting-relevant supporting context from a document corpus, filter out distractors, distill the retrieved context into forecast-useful evidence, and generate forecasts supported by that evidence. Through context ablations and evaluations of state-of-the-art deep research and forecasting methods paired together, we show that high-quality context substantially improves forecasting performance in Dr-CiK. However, most existing DR agents recover only a small fraction of the ground-truth supporting evidence (usually <5%), are frequently misled by distractors (>80% distractor citations), and can cause forecasters to perform worse with retrieved context than without context. Our results motivate research on foresight-driven agents that search for the right context to predict the future.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	4.0/10	8.0
MLLM	2.0	1.0/10	2.0
MultiModal	2.0	1.0/10	2.0
model-based RL	2.0	2.0/10	4.0

评分理由: 论文提出 Dr-CiK 基准，专注于时间序列预测中的上下文检索与证据蒸馏。虽然'Foresight-driven agents'涉及预测未来，与 World Models 概念有弱关联（4 分），但论文未体现模型架构统一（Unify Models, 2 分）、多模态大模型（MLLM, 1 分）、多模态融合（MultiModal, 1 分）或基于模型的强化学习（model-based RL, 2 分）的核心技术。加权总分 20.0 分，低于动态及格分 26.5 分，显示论文主题与给定背景关键词匹配度不高。

关键词

Time series forecasting, Context retrieval, Foresight-driven agents, Benchmark, Document corpus, Evidence distillation, Distractor filtering

74. DebFilter: Eradicating Biases Stashed in ValueFAIL

Score: 20.0 / 26.5

Authors: Seung Hyuk Lee, Songkuk Kim

Published: 2026-05-27

TL;DR: DebFilter 提出了一种无需训练的推理时框架，通过调整交叉注意力值组件来消除文本到图像扩散模型中的社会偏见。

摘要翻译

文本到图像扩散模型（Text-to-image diffusion models）在理论上等价于基于分数的生成模型（Score-based generative models），它们通过多步去噪过程生成图像，该过程由从预训练视觉 - 语言模型（Pretrained vision-language models，如 CLIP）中提取的文本嵌入（Text embeddings）引导。然而，这些文本嵌入本质上编码了社会和语义偏见（例如与性别和年龄相关的偏见），这些偏见随后通过引导机制（Guidance mechanism）得到传播和放大；加之模型在针对这些偏见相关概念不平衡的大规模数据集上进行训练，往往导致文本到图像生成产生偏斜输出。我们提出了 DebFilter，这是一个轻量级且无需训练的框架，旨在缓解文本到图像扩散模型中的此类偏见。观察到模型在每个去噪步骤的误差预测主要受交叉注意力（Cross-attention）动态影响，我们引入了一种偏见校正策略，调整交叉注意力机制中的值组件（Value）。具体而言，我们对引导嵌入（Guidance embedding）的切片施加固定偏移，有效地将交叉注意力值的语义方向引导至无偏表示。这种调整重新配置了分数景观（Score landscape），以产生平衡的输出，同时保持与预期文本语义的一致性。与依赖微调或重新训练的先前的方法不同，DebFilter 完全在推理阶段运行，无需额外数据或模型更新。我们的结果表明，该方法能有效减轻生成图像中的社会偏见，为实现更公平且更具包容性的文本到图像生成提供了一种高效且可扩展的途径。

Abstract

Text-to-image diffusion models, which are theoretically equivalent to score-based generative models, generate images through a multi-step denoising process guided by text embeddings extracted from pretrained vision-language models such as CLIP. However, these text embeddings inherently encode social and semantic biases -- such as those related to gender and age -- that are subsequently propagated and amplified through the guidance mechanism, along with the model's training on large-scale datasets that are imbalanced with respect to these bias-related concepts, often leading to skewed outputs in text-to-image generation. We propose DebFilter, a lightweight and training-free framework for mitigating such biases in text-to-image diffusion models. Observing that the model's error prediction at each denoising step is primarily influenced by cross-attention dynamics, we introduce a bias-correction strategy that adjusts the value components within cross-attention. Specifically, we apply a fixed offset to the slice of guidance embedding, effectively steering the semantic direction of cross-attention values toward unbiased representations. This adjustment reconfigures the score landscape to produce balanced outputs while maintaining alignment with the intended text semantics. Unlike prior approaches that rely on fine-tuning or retraining, DebFilter operates entirely at inference time, requiring no additional data or model updates. Our results demonstrate that this method effectively mitigates social biases in generated images, offering an efficient and scalable pathway toward fairer and more inclusive text-to-image generation.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	0.0/10	0.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	2.0/10	4.0
MultiModal	2.0	8.0/10	16.0
model-based RL	2.0	0.0/10	0.0

评分理由: 论文主要关注文本到图像扩散模型中的偏见消除，属于生成式 AI 领域。关键词中 'MultiModal' 高度相关（文本到图像生成），'MLLM' 弱相关（使用 CLIP 嵌入但不涉及大语言模型核心）。'Unify Models', 'World Models', 'model-based RL' 与论文内容（扩散模型偏见矫正，非强化学习或世界模型）完全无关，故评分为 0。作者列表中不包含指定的专家，无加分。加权总分为 20.0，低于动态及格分 26.5。

关键词

Text-to-image, Diffusion Models, Bias Mitigation, Cross-Attention, Inference-time, Social Bias, CLIP Embeddings, Training-free

75. Visualizing Latent Phase Structures in Locomotion Policies: A Multi-Environment Study with Temporal Feature ExtensionFAIL

Score: 18.0 / 26.5

Authors: Daisuke Yasui, Toshitaka Matuki, Hiroshi Sato

Published: 2026-05-27

TL;DR: This paper proposes a framework to visualize latent motion phase structures in locomotion policies by extending clustering features to include actions and next states, successfully identifying clearer phase transitions in MuJoCo environments.

摘要翻译

深度强化学习（DRL）已在 MuJoCo 基准测试中的 HalfCheetah、Ant 和 Walker2D 等运动控制任务中展现出高性能。然而，可视化由作为深度神经网络实现的训练策略内部学习到的运动结构仍然具有挑战性。生物力学及相关领域已知，运动控制是通过重复运动阶段（如支撑阶段和摆动阶段）来实现的。本研究提出了一种框架，旨在从运动控制策略通过与环境交互生成的轨迹中揭示潜在的运动阶段结构。该方法将聚类特征从仅基于状态观测扩展到包括动作、下一状态和下一动作的增强特征，并引入了一种确定簇数量的方法，该方法抑制了自转移。将该方法应用于三个环境——Ant-v5、HalfCheetah-v5 和 Walker2D-v5——我们成功识别出了比现有方法获得的更清晰、更规则的运动阶段结构。

Abstract

Deep reinforcement learning (DRL) has been shown to achieve high performance on locomotion control tasks in MuJoCo benchmarks such as HalfCheetah, Ant, and Walker2D. However, visualizing the motion structures internally obtained by a trained policy function implemented as a deep neural network remains challenging. It is known from biomechanics and related fields that locomotion control is realized through the repetition of motion phases such as the stance phase and swing phase. In this study, we propose a framework for uncovering latent motion phase structures from trajectories generated by locomotion control policies through interaction with the environment. The proposed method extends the clustering features from state observations alone to augmented features including actions, next states, and next actions, and introduces a method for determining the number of clusters that suppresses self-transitions. Applying the proposed method to three environments -- Ant-v5, HalfCheetah-v5, and Walker2D-v5 -- we successfully identified phase structures with clearer and more regular transition rules than those obtained by the existing method.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	3.0/10	6.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	1.0/10	2.0
model-based RL	2.0	3.0/10	6.0

评分理由: The paper proposes a framework to visualize latent motion phase structures in DRL locomotion policies by extending clustering features. It is unrelated to MLLM (0) and MultiModal (1) as it involves standard RL state-action tuples without cross-modal fusion. It weakly relates to World Models (3) and model-based RL (3) through latent structure analysis but lacks generative environment modeling or unification of models. Unify Models (2) is irrelevant as no model unification is discussed. No expert authors from the specified list were found. Total weighted score is 18.0, below the dynamic passing score of 26.5.

关键词

Deep Reinforcement Learning, Locomotion Control, Latent Phase Structures, Policy Visualization, Clustering Features, MuJoCo Benchmarks, State-Action-Next State, Motion Phases

76. Do Clinical Models Change Treatment Decisions?FAIL

Score: 18.0 / 26.5

Authors: Dongkyu Cho, Miao Zhang, Rumi Chunara

Published: 2026-05-27

TL;DR: This paper introduces the ClinPivot benchmark to evaluate whether clinical foundation models adapt treatment decisions based on changing patient contexts, finding that strong QA performance does not reliably predict correct decision pivoting.

摘要翻译

临床基础模型通常基于事实性或考试风格的医学 QA 进行评估，但当患者上下文发生变化时，治疗决策必须随之调整。我们引入了 ClinPivot，这是一个基于生物医学关系和情境转换的患者上下文构建的可审计治疗决策基准。ClinPivot 旨在探究当新的临床约束改变动作空间 (action space) 时，模型是否会调整其治疗选择。我们发现，优异的医学 QA 性能并不能可靠地预测决策性能：前沿模型和任务适配的 Qwen 变体往往无法正确更改决策，且模型排名在不同评估范式下会发生变动。在匹配的知识预算下，决策结构化监督提高了对情境转换敏感的决策能力和医学 QA，而轻量级回放减少了通用助手能力的损失。

Abstract

Clinical foundation models are evaluated with factual or exam-style medical QA, but treatment decisions must change when patient context changes. We introduce ClinPivot, an auditable treatment-decision benchmark built from biomedical relations and pivoted patient contexts. ClinPivot asks whether models change treatment choices when new clinical constraints shift the action space. We find that strong medical QA performance does not reliably predict decision-making performance: frontier models and task-adapted Qwen variants often fail to change decisions correctly, and model rankings shift across evaluation regimes. Decision-structured supervision improves pivot-sensitive decision-making and medical QA under matched knowledge budgets, while lightweight replay reduces losses in general assistant ability.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	3.0/10	6.0
World Models	2.0	1.0/10	2.0
MLLM	2.0	2.0/10	4.0
MultiModal	2.0	1.0/10	2.0
model-based RL	2.0	2.0/10	4.0

评分理由: The paper focuses on clinical foundation models and a decision benchmark (ClinPivot), evaluating consistency in treatment choices across patient contexts. It shows low relevance to Unify Models (only unified evaluation, not architecture), World Models (no latent environment modeling), MLLM (foundation models mentioned but not explicitly multimodal), MultiModal (no multimodal data discussed), and model-based RL (uses RL terminology like 'action space' but focuses on supervised decision pivoting rather than RL planning). None of the listed expert authors are present in the author list.

关键词

Clinical foundation models, Treatment decisions, ClinPivot benchmark, Patient context, Decision-making performance, Supervision methods, Medical QA

77. StoryLens: Preference-Aligned Story Rewriting via Context-Aware Narrative EnrichmentFAIL

Score: 18.0 / 26.5

Authors: Hanwen Cui, Yuting Mei, Yuhang Fu, Dingyi Yang, Qin Jin

Published: 2026-05-27

TL;DR: StoryLens introduces a context-aware story rewriting system using preference-aligned reinforcement learning to improve reader satisfaction, though it focuses on text generation rather than multimodal world modeling.

摘要翻译

故事重写旨在将现有叙事适配于多样化的读者偏好，同时保持情节一致性与叙事连贯性。与传统的风格迁移研究不同，我们认为有效的故事重写需要超越表面风格适应的上下文感知叙事增强。我们的试点人类研究表明，仅风格适应带来读者满意度的有限增益（2.3%），而上下文增强重写显著改进了用户偏好对齐（24.5%）。受此启发，我们引入了 STORYLENSBENCH，这是一个用于偏好对齐故事重写的大规模基准，包含结构化故事集、多维读者偏好画像以及排序后的上下文感知重写故事。基于此基准，我们提出了 STORYLENSEVAL，一个用于评估重写故事读者满意度的奖励模型，以及 STORYLENSWRITER，一个结合监督微调与基于 GRPO 的强化学习的两阶段重写模型。我们还建立了一个涵盖忠实度、连贯性及读者满意度的综合评估框架。实验结果表明，STORYLENSWRITER 始终优于强大的生成与个性化基线方法，突显了上下文感知叙事增强对于个性化故事重写的重要性。

Abstract

Story rewriting aims to adapt existing narratives to diverse reader preferences while preserving plot consistency and narrative coherence. Unlike conventional work on style transfer, we argue that effective story rewriting demands context-aware narrative enrichment beyond surface-level stylistic adaptation. Our pilot human study shows that style adaptation alone provides only marginal gains in reader satisfaction (2.3%), while context-enhanced rewriting substantially improves user preference alignment (24.5%). Motivated by this, we introduce STORYLENSBENCH, a large-scale benchmark for preference-aligned story rewriting, comprising structured story books, multi-dimensional reader preference profiles, and ranked context-aware rewritten stories. Building on this benchmark, we propose STORYLENSEVAL, a reward model for estimating reader satisfaction over rewritten stories, and STORYLENSWRITER, a two-stage rewriting model combining supervised fine-tuning with GRPO-based reinforcement learning. We further establish a comprehensive evaluation framework covering fidelity, coherence, and reader satisfaction. Experimental results demonstrate that STORYLENSWRITER consistently outperforms strong generation and personalization baselines, highlighting the importance of context-aware narrative enrichment for personalized story rewriting.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	1.0/10	2.0
MLLM	2.0	2.0/10	4.0
MultiModal	2.0	2.0/10	4.0
model-based RL	2.0	2.0/10	4.0

评分理由: The paper focuses on text-based story rewriting using GRPO-based reinforcement learning and reward modeling. It does not address Unify Models architectures, World Models (environment simulation), MLLM, or MultiModal inputs/outputs. While it utilizes RL, it is reward-model based rather than environment model-based, resulting in low relevance to the specified keyword set targeting multimodal world modeling and unified systems.

关键词

Story Rewriting, Preference Alignment, Context-Aware Narrative, Reinforcement Learning, Reward Model, Reader Satisfaction, Text Generation, Benchmark

78. Reward Bias Substitution: Single-Axis Bias Mitigations Redirect Optimization PressureFAIL

Score: 18.0 / 26.5

Authors: Max Lamparth, Daniel Fein, Andreas Haupt, Marcel Hussing, Mykel J. Kochenderfer

Published: 2026-05-27

TL;DR: 该论文揭示了 RLHF 中单轴偏差缓解会导致优化压力转移到相关代理（奖励偏差替代），并提出通过增强评估使用策略诱导分布来认证缓解成功的方法。

摘要翻译

奖励模型偏差的单轴缓解措施（例如减少对长度、奉承或风格的代理依赖）可能会将优化压力转移到相关代理上，而非消除它，这种失败模式我们称之为奖励偏差替代（reward bias substitution）。这种失败是由缓解评估和政策训练期间，审计分布与策略诱导分布之间的测量与优化差距所导致的。我们将缓解结果形式化为一种情形分类，并证明成功的缓解、偏差替代和过度校正会在任何基于审计分布的评分下产生相同的可观测指标，包括排名准确率和胜率，即使被授予对真实奖励的 oracle 访问。在已发表的偏好学习缓解工作中，我们综述的方法均未报告证明成功缓解所需的证据。通过引入策略诱导分布进行评估并同时跟踪多个偏差，可证明能关闭这一差距，我们将此转化为针对缓解方法和基准的可操作建议。我们在语言模型强化学习从人类反馈（RLHF）中展示了偏差替代，其中 GRPO 训练中的长度惩罚按预期压缩了响应，但将优化压力重定向至置信度校准，导致策略过度自信，而事实自由形式准确性下降。我们还展示了一个已发表的长度去偏算子，它在审计分布上将奖励 - 长度相关性归零，但在四个最先进（SOTA）奖励模型中的三个上，在 best-of-N 选择下重新引入偏差；此外，还存在一种长度 - 奉承耦合，其方向在人类 -LLM 法官意见不一致时会反转。

Abstract

Single-axis mitigations of reward-model biases (e.g., reducing proxy reliance on length, sycophancy, or style) can rotate optimization pressure onto correlated proxies rather than eliminate it, a failure mode we call reward bias substitution. The failure is enabled by a measurement-versus-optimization gap between audit and policy-induced distributions during mitigation evaluation and policy training. We formalize mitigation outcomes into a regime taxonomy and prove that successful mitigation, bias substitution, and overcorrection produce identical observables under any audit-distribution scoring, including ranking accuracy and win-rate, even when granted oracle access to the true reward. Across published preference-learning mitigation work, no method we survey reports the evidence needed to certify successful mitigation. Augmenting evaluation with policy-induced distributions while tracking multiple biases provably closes the gap, and we translate this into actionable prescriptions for mitigation methods and benchmarks. We demonstrate bias substitution in language model RLHF, where a length penalty during GRPO training compresses responses as intended yet redirects optimization pressure onto confidence calibration, driving the policy into overconfidence while factual free-form accuracy falls. We also show a published length-debiasing operator that zeroes reward-length correlation on the audit distribution but reintroduces bias under best-of-N selection on three of four SOTA reward models, and a length-sycophancy coupling whose direction reverses under human-LLM judge disagreement.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	2.0/10	4.0
MLLM	2.0	1.0/10	2.0
MultiModal	2.0	1.0/10	2.0
model-based RL	2.0	3.0/10	6.0

评分理由: 论文聚焦于 RLHF 中的奖励偏差替代现象，分析单轴偏差缓解如何导致优化压力转移。内容主要涉及语言模型的强化学习反馈机制，未涉及统一模型架构、世界模型构建、多模态处理或 MLLM 相关内容，因此与 'Unify Models', 'World Models', 'MLLM', 'MultiModal' 相关性极低。虽涉及强化学习和奖励建模，但并非传统意义上的基于模型的强化学习（Model-Based RL），故相关性略高但仍较低。加权总分 18.0，低于动态及格分 26.5。

关键词

Reward Bias Substitution, Single-Axis Bias Mitigations, Optimization Pressure, Policy-Induced Distributions, Audit Distributions, RLHF, Language Model, Confidence Calibration

79. No Safe Dose: How Training Data Drives Unsafe Image GenerationFAIL

Score: 18.0 / 26.5

Authors: Felix Friedrich, Lukas Helff, Niharika Hegde, Patrick Schramowski, Kristian Kersting

Published: 2026-05-27

TL;DR: The study demonstrates that the proportion of unsafe images in training data monotonically increases output unsafety in text-to-image models without quality loss, emphasizing the need for data curation and safe text encoders.

摘要翻译

基于大规模数据训练的文本到图像（Text-to-image）模型往往不可避免地摄入不安全内容。尽管部分研究者观察到输入输出放大效应，但尚不清楚训练数据组成是直接驱动模型输出安全性，还是通过其他因素起作用。我们通过隔离该变量来阐明这一问题：我们在仅在不安全图像比例（0% 至 9.6%）上不同的数据集上训练相同的文本到图像模型，并跨越多种数据集规模（10 万至 800 万）。随后，我们使用生成的模型生成图像，并利用四个独立的安全分类器对其进行评估。输出不安全性随污染比例单调上升，从 0% 污染时的 16.6% 增至 5% 污染时的 25.5%。析因设计表明，不安全训练图像的比例，而非绝对数量，才是起作用的变量。零污染条件下 16.6% 的不可约基线表明，其他组件（如冻结的文本编码器）构成了残余安全风险——这一结论经文本编码器消融实验证实：使用 SafeCLIP 可将此基线降至 9.6%，而剂量 - 反应效应在所测试的所有三个编码器中均持续存在。值得注意的是，安全过滤并未伴随 FID、CLIPscore 和 ImageReward 等指标的质量退化。这些结果表明，数据策展与文本编码器安全性是互补且独立有效的干预措施。同时，剩余的不安全性水平也为未来关于新兴能力和组合性的研究提出了挑战。

Abstract

Text-to-image models trained on large-scale data often inevitably ingest unsafe content. While some people observe input-output amplifications, it remains unclear whether and how training data composition directly drives model output safety or by other factors. We shed light on this question by isolating this variable: we train the same text-to-image model on datasets that differ \emph{only} in their fraction of unsafe images (0\% to 9.6\%), across several dataset scales (100K to 8M). Then we generate images with the resulting models, and evaluate them with four independent safety classifiers. Output unsafety rises monotonically from 16.6\% at 0\% contamination to 25.5\% at 5\%. A factorial design reveals that the \emph{proportion}, not the absolute count, of unsafe training images is the operative variable. The 16.6\% irreducible baseline at zero contamination implicates the other components, e.g. frozen text encoder, as a residual safety risk -- confirmed by a text encoder ablation showing that SafeCLIP reduces this floor to 9.6\%, while the dose-response effect persists across all three encoders tested. Critically, no quality degradation in terms of FID, CLIPscore and ImageReward accompanies safety filtering. These results establish that data curation and text encoder safety are complementary and independently effective interventions. At the same time, the remaining level of unsafety poses questions for future research about emerging capabilities and compositionality.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	2.0/10	4.0
MultiModal	2.0	5.0/10	10.0
model-based RL	2.0	0.0/10	0.0

评分理由: The paper focuses on text-to-image model safety regarding training data contamination. 'MultiModal' is moderately relevant (5.0) as the task involves text and image generation. 'Unify Models' (2.0) and 'MLLM' (2.0) have low relevance because the study uses a single model architecture and diffusion models rather than model unification or MLLM. 'World Models' and 'model-based RL' are irrelevant (0.0) as the work does not involve world modeling or reinforcement learning. The weighted sum is 18.0, below the dynamic pass score of 26.5. No expert authors from the specified list are present.

关键词

Text-to-image models, Training data safety, Unsafe content, Data curation, Output unsafety, SafeCLIP, Dose-response effect, Model safety

80. Is Backpropagation Optimal? When Synthetic Gradients Improve Sample EfficiencyFAIL

Score: 18.0 / 26.5

Authors: Yibo Jacky Zhang, Zeyu Tang, Sanmi Koyejo

Published: 2026-05-27

TL;DR: This paper theoretically investigates synthetic gradients as an alternative to backpropagation to improve sample efficiency in unified loss-based and reward-based learning frameworks.

摘要翻译

反向传播 (Backpropagation) 是人工神经网络 (artificial neural networks) 的默认学习规则，通常在具备可微性 (differentiability) 时被视为既定方案。本文通过样本效率 (sample efficiency) 的理论视角重新审视这一惯例。我们提出了一种统一的向量化反馈框架 (unified vectorized feedback framework)，适用于计算图 (computational graphs) 上的基于损失和基于奖励的学习，其中合成梯度 (synthetic gradients) 自然成为反向传播的替代方案。我们刻画了合成梯度能够获得比反向传播更低的梯度估计均方误差 (gradient-estimation mean squared error) 的条件。构造的示例表明，这种样本效率优势可以是任意大的。在上下文老虎机 (contextual bandits) 和强化学习 (reinforcement learning) 任务上的实验展示了我们理论发现的潜力。

Abstract

Backpropagation is the default learning rule for artificial neural networks and is often treated as the settled approach whenever differentiability is available. In this work, we revisit this convention through a theoretical lens of sample efficiency. We introduce a unified vectorized feedback framework for loss-based and reward-based learning on computational graphs, in which synthetic gradients emerge as a natural alternative to backpropagation. We characterize the conditions under which synthetic gradients can achieve a lower gradient-estimation mean squared error than backpropagation. We construct examples illustrating that this sample efficiency advantage can be arbitrarily large. Experiments on contextual bandits and reinforcement learning tasks demonstrate the potential of our theoretical findings.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	5.0/10	10.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	4.0/10	8.0

评分理由: 论文聚焦反向传播与合成梯度的样本效率优化，摘要提及'统一框架'与'Unify Models'部分相关；涉及强化学习任务与'model-based RL'部分相关；未涉及多模态、MLLM 或世界模型，故相关度为 0；作者列表无指定专家。

关键词

Backpropagation, Synthetic Gradients, Sample Efficiency, Unified Feedback Framework, Reinforcement Learning, Reward-based Learning, Computational Graphs

81. FedEHR-Gen: Federated Synthetic Time-Series EHR Generation via Latent Space Alignment and Distribution-Aware AggregationFAIL

Score: 18.0 / 26.5

Authors: Jun Bai, Ziyang Song, Yue Li

Published: 2026-05-27

TL;DR: FedEHR-Gen proposes a federated framework for synthetic time-series EHR generation via latent space alignment, achieving centralized-level performance while preserving hospital data privacy.

摘要翻译

合成电子健康记录（EHR）生成为隐私受限的医疗环境中的数据增强及跨医院建模提供了有前景的途径。然而，大多数现有的 EHR 生成模型是集中式的，需要汇集跨医院的数据，这在现实数据共享受限的情况下往往不可行。尽管联邦 EHR 生成提供了一种自然的解决方案，但由于 EHR 数据具有高维性、稀疏性以及跨医院异质性，直接联邦建模往往会崩溃或发散。本文提出了 FedEHR-Gen，这是首个用于跨分布式医院合成时间序列 EHR 生成的联邦框架。FedEHR-Gen 采用两阶段学习范式。首先，我们引入一种联邦自动编码器，将高维且稀疏的 EHR 特征投影至紧凑的潜在空间。为确保跨医院的语义一致性，我们提出了一种逐层匹配聚合机制，将本地编码器对齐至统一的全局潜在空间。其次，基于此对齐的潜在空间，我们训练了一个联邦时间条件变分自动编码器（TCVAE），采用分布感知聚合机制，从而在严重的跨医院异质性下实现稳定的时间生成建模。在 eICU 和 MIMIC-III 数据集上的广泛实验表明，FedEHR-Gen 在生成保真度、下游任务效用及隐私风险方面与集中式训练相当，同时持续优于标准联邦基线。

Abstract

Synthetic Electronic Health Record (EHR) generation provides a promising avenue for data augmentation and cross-hospital modeling in privacy-constrained healthcare settings. However, most existing EHR generative models are centralized and require pooling data across hospitals, which is often infeasible when real-world data sharing is restricted. While federated EHR generation offers a natural solution, direct federated modeling often collapses or diverges due to the high dimensionality, sparsity, and cross-hospital heterogeneity of EHR data. In this work, we propose FedEHR-Gen, the first federated framework for synthetic time-series EHR generation across distributed hospitals. FedEHR-Gen uses a two-stage learning paradigm. First, we introduce a federated autoencoder that projects high-dimensional and sparse EHR features onto a compact latent space. To ensure semantic consistency across hospitals, we develop a layer-wise matching aggregation mechanism that aligns local encoders into a unified global latent space. Second, operating on this aligned latent space, we train a federated temporal conditional variational autoencoder (TCVAE) with distribution-aware aggregation, enabling stable temporal generative modeling under severe cross-hospital heterogeneity. Extensive experiments on the eICU and MIMIC-III datasets demonstrate that FedEHR-Gen achieves generation fidelity, downstream utility, and privacy risk comparable to centralized training, while consistently outperforming the standard federated baseline.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	5.0/10	10.0
World Models	2.0	1.0/10	2.0
MLLM	2.0	1.0/10	2.0
MultiModal	2.0	1.0/10	2.0
model-based RL	2.0	1.0/10	2.0

评分理由: The paper focuses on federated learning for synthetic EHR generation. It aligns local encoders into a unified global latent space, showing moderate relevance to 'Unify Models'. However, it does not involve multimodal large language models (MLLM, MultiModal), world models for environment dynamics (World Models), or reinforcement learning techniques (model-based RL), resulting in low scores for those keywords.

关键词

Federated Learning, Synthetic EHR Generation, Time-Series Data, Latent Space Alignment, Distribution-Aware Aggregation, Privacy-Preserving, Cross-Hospital Heterogeneity

82. Evaluating the Feasibility of Inferring Dietary Behavior Change Receptivity from Egocentric Images of Eating EnvironmentFAIL

Score: 18.0 / 26.5

Authors: Long Li, Yuning Huang, Heather A. Eicher-Miller, J. Graham Thomas, Fengqing Zhu, Edward Sazonov

Published: 2026-05-27

TL;DR: This study investigates the feasibility of predicting dietary behavior change receptivity from egocentric eating images using a CLIP-based transfer learning framework, yielding promising preliminary results for just-in-time interventions.

摘要翻译

准确评估饮食行为改变接受度对于设计有效的即时自适应干预（JITAIs）以促进更健康的饮食习惯至关重要。然而，基于自我报告的行为改变接受度评估存在数据稀疏和延迟的问题，限制了其在持续监测中的实际应用。为探索被动感知是否有助于解决这一挑战，本研究进行了一项初步研究，旨在利用可穿戴相机收集的第一人称视角饮食图像来推断参与者自我报告的行为改变接受度。本研究使用通过自动摄入监测器 v2（AIM-2）在自由生活状态下的饮食事件中获得的试点数据。数据包括饮食过程中捕获的第一人称视角图像序列，并与评估行为改变接受度特定维度（觉察、交互能力和动机）的问题回答配对。为检验视觉信息是否与这些回答相关，我们评估了一个基于迁移学习的框架，该框架结合了预训练的对比语言图像预训练（CLIP）视觉编码器和轻量级 Transformer 分类器。该模型处理饮食事件图像序列，以提取与行为改变接受度相关的潜在语义和时间线索。初步实验结果显示，在行为改变接受度指标上，该模型相比简单的基线模型展现出有希望的改进。这些早期发现表明，第一人称视角饮食事件图像可能包含与饮食行为改变接受度相关的线索，值得使用更大更全面的数据集进行进一步研究。

Abstract

Accurately assessing dietary behavior change receptivity is essential for designing effective just-in-time adaptive interventions (JITAIs) that promote healthier eating habits. However, self-report-based assessment of behavior change receptivity is sparse and delayed, limiting its practical use in continuous monitoring. To explore whether passive sensing may help address this challenge, this study conducts a pilot investigation of inferring participants' self-reported behavior change receptivity from egocentric eating images collected by a wearable camera. We use pilot data obtained from free-living eating episodes using the Automatic Ingestion Monitor v2 (AIM-2). The data included egocentric image sequences captured during eating and paired with responses to questions assessing specific dimensions of behavior change receptivity (awareness, interaction capability, and motivation). To examine whether visual information contained any relevancy to these responses, we evaluated a transfer-learning-assisted framework that combines a pre-trained Contrastive Language-Image Pre-Training (CLIP) vision encoder with a lightweight transformer classifier. The model processes eating episode image sequences to extract potential semantic and temporal cues related to behavior change receptivity. Preliminary experimental results show promising improvements over simple baseline models for behavior change receptivity indicators. These early findings suggest that egocentric eating episode images may contain cues related to dietary behavior change receptivity, and warrant further investigation with larger and more comprehensive datasets.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	2.0/10	4.0
MultiModal	2.0	5.0/10	10.0
model-based RL	2.0	0.0/10	0.0

评分理由: The paper focuses on health behavior analysis using egocentric images and CLIP-based transfer learning, showing low relevance to World Models and Model-Based RL as these concepts are not addressed. MultiModal is moderately relevant due to the use of CLIP (vision-language) and image-text data pairing. Unify Models and MLLM have slight relevance because CLIP is a unified multimodal model, but the paper does not focus on model unification architectures or Large Language Models specifically. None of the specified expert authors (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are present in the author list.

关键词

Egocentric Images, Dietary Behavior, Behavior Change Receptivity, CLIP, Transfer Learning, Just-in-time Adaptive Interventions, Wearable Camera

83. Zipping the Thought: When and How Compressed Reasoning Data Works in LLM Post-TrainingFAIL

Score: 16.0 / 26.5

Authors: Kohsei Matsutani, Gouki Minegishi, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo

Published: 2026-05-27

TL;DR: 本文研究了压缩思维链数据对 LLM 后训练的影响，发现更粗粒度的压缩需要更多数据，且强化学习可分解学到的压缩步骤。

摘要翻译

大语言模型 (LLMs) 现在能够通过长思维链 (CoT) 推理解决复杂问题，但性能与 token 成本之间的权衡仍是一个核心挑战。为了解决这一问题，监督微调 (SFT) 常使用压缩推理数据，其中思维链轨迹被缩短为紧凑形式。然而，此类压缩推理数据对后训练的影响仍知之甚少。本文提出了一种思维链 (CoT) 分类体系，包括显式思维链 (Explicit CoT)，即输出所有操作而不进行聚合；组合思维链 (Composed CoT)，即将多个操作合并为单个步骤；以及隐式思维链 (Implicit CoT)，即省略中间操作。我们构建了一个合成组合推理任务，该任务允许对难度、压缩粒度及数据规模进行可控变化，并在不同模型家族及规模上开展了一系列全面实验。值得注意的是，我们发现：(i) 粒度更粗的思维链需要更多的 SFT 数据；(ii) 与显式思维链相比，组合思维链和隐式思维链从数据规模扩大中受益更多，其中组合思维链从数据重复中受益，而隐式思维链倾向于导致记忆；(iii) 与 SFT 不同，后续的可验证奖励强化学习 (RLVR, 强化学习 (RL)) 会分解在 SFT 中学到的压缩步骤；(iv) 单向思维链顺序在更长的序列任务上展现出更强的泛化能力。

Abstract

Large language models (LLMs) can now solve complex problems through long chain-of-thought (CoT) reasoning, but the trade-off between performance and token cost remains a central challenge. To address this issue, supervised fine-tuning (SFT) often uses compressed reasoning data, where CoT traces are shortened into compact forms. However, the effect of such compressed reasoning data on post-training remains poorly understood. In this paper, we propose a taxonomy of CoT consisting of Explicit CoT, which outputs all operations without aggregation, Composed CoT, which combines multiple operations into a single step, and Implicit CoT, which omits intermediate operations. We construct a synthetic compositional reasoning task that allows controlled variation of difficulty, compression granularity, and data size, and conducted a comprehensive set of experiments across different model families and sizes. Notably, we find that (i) coarser CoT requires more SFT data, (ii) compared with Explicit CoT, Composed CoT and Implicit CoT benefit more from data scaling, while Composed CoT benefits from data repetition and Implicit CoT tends to lead to memorization, (iii) unlike SFT, subsequent reinforcement learning (RL) with verifiable rewards (RLVR) decomposes compressed steps learned during SFT, and (iv) unidirectional CoT ordering shows stronger generalization on longer sequential tasks. Our findings provide implications for CoT design under data resource constraints and offer important insights into the mechanisms of SFT and RL in LLM post-training.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	1.0/10	2.0
MLLM	2.0	1.0/10	2.0
MultiModal	2.0	1.0/10	2.0
model-based RL	2.0	3.0/10	6.0

评分理由: 论文主要探讨 LLM 推理压缩数据在后训练中的作用，涉及 SFT 与 RL。与 MultiModal 和 MLLM 相关性极低，因文本未涉及多模态内容；World Models 未提及；Unify Models 仅弱相关于推理策略统一；model-based RL 部分相关，因涉及 RL 但方法为 RLVR 而非传统模型学习。

关键词

Chain-of-Thought, Compressed Reasoning Data, LLM Post-Training, Supervised Fine-Tuning, Reinforcement Learning, Verifiable Rewards, Reasoning Taxonomy, Token Cost

84. Out of Sight, Not Out of Mind: Unveiling Latent Attack in Latent-based Multi-Agent SystemsFAIL

Score: 16.0 / 26.5

Authors: Chenxi Wang, Ruiyang Huang, Jiayan Sun, Lei Wei, Yifan Wu

Published: 2026-05-27

TL;DR: 该研究发现多智能体系统的潜在空间存在安全风险，潜在攻击可通过干预隐藏表征在清洁执行中降低任务性能，表明潜在协作并未消除攻击风险。

摘要翻译

基于潜在表示的多智能体系统用潜在表示替代了部分显式的智能体间通信，为高效且灵活的智能体协作提供了新方向。然而，将协调机制移至潜在空间也可能使攻击超出可见文本检查的范围。本文研究了潜在状态是否能在干净执行期间携带有效的攻击关联信息。为探究这一问题，我们提出了一种潜在攻击框架，该框架通过潜在干预重新激活攻击诱导效应，而无需重复使用对抗文本。大量实验表明，由此产生的仅潜在攻击能在干净执行中显著降低任务性能，尤其是当应用于智能体间 KV-cache (KV 缓存) 交接而非局部隐藏状态时。进一步的控制分析表明，这种性能下降不能归因于任意扰动或无效生成。总体而言，我们的发现表明基于潜在的合作并未消除攻击风险。它将部分风险转移到了较不可观测的执行状态中，需要采取超越可见文本检查的防护措施。

Abstract

Latent-based multi-agent systems replace parts of explicit inter-agent communication with hidden representations, offering a new direction for efficient and flexible agent collaboration. However, moving coordination into latent space may also move attacks beyond the reach of visible-text inspection. In this paper, we study whether latent states can carry attack-associated information that remains effective during clean executions. To examine this question, we introduce a latent attack framework that reactivates attack-induced effects through latent interventions without reusing adversarial text. Extensive experiments show that the resulting latent-only attacks can substantially degrade task performance in clean executions, especially when applied to inter-agent KV-cache handoffs rather than local hidden states. Further control analyses indicate that this degradation cannot be reduced to arbitrary perturbations or invalid generation. Overall, our findings suggest that latent-based collaboration does not remove attack risk. It shifts part of the risk into less observable execution states, calling for safeguards beyond visible-text inspection.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	1.0/10	2.0
MLLM	2.0	3.0/10	6.0
MultiModal	2.0	1.0/10	2.0
model-based RL	2.0	1.0/10	2.0

评分理由: 论文主要研究多智能体系统中的潜在攻击与安全性，利用隐藏表征进行代理协作。虽然涉及 LLM 技术（如 KV-cache），但未明确涉及多模态统一（Unify Models, MultiModal）、世界模型学习（World Models）或模型基强化学习（model-based RL）。因此，除 MLLM 因 LLM 组件有中等关联外，其余关键词相关性较低。

关键词

Latent Attack, Multi-Agent Systems, Hidden Representations, KV-cache Handoffs, Security Risk, Latent Interventions, Agent Collaboration

85. ROSD: Reflective On-Policy Self-Distillation for Language Model Reasoning across DomainsFAIL

Score: 16.0 / 26.5

Authors: Ziqi Zhao, Xinyu Ma, Liu Yang, Yujie Feng, Daiting Shi, Jingzhou He, Xin Xin, Zhaochun Ren, Xiao-Ming Wu

Published: 2026-05-27

TL;DR: ROSD 通过反思引导的错误定位自我蒸馏，提升了语言模型在域内推理性能及域外泛化能力。

摘要翻译

同策略自蒸馏（OPSD）通过为同策略轨迹提供密集的 token 级监督，提升了大语言模型（LLMs）的推理性能。然而，现有的 OPSD 方法在领域内推理上的提升有限，且在领域外问题上的泛化效果不佳。我们识别出两个关键原因：基于已验证解训练自教师会鼓励模仿训练域参考轨迹，而非进行针对错误的修正；此外，对整个响应应用蒸馏可能会覆盖有效的推理前缀，并加剧过拟合。我们提出反思性同策略自蒸馏（ROSD），这是一种通过基于反思引导且错误定位的蒸馏，将参考解模仿转化为目标推理修正的框架。对于每个轨迹，ROSD 使用自反思器提取一个修正思路，并定位首个错误片段。修正思路引导自教师进行目标监督，而定位到的错误片段则将蒸馏限制在需要修正的区域。该设计在保留有效推理前缀的同时，修正了有缺陷的推理。在多个领域内和领域外推理基准上的实验表明，ROSD 在整体上展现出更强的领域内推理性能，且相比标准 OPSD 具有显著更优的领域外泛化能力。代码可在 https://github.com/ZiqiZhao1/ROSD 获取。

Abstract

On-policy self-distillation (OPSD) improves the reasoning performance of large language models (LLMs) by providing dense token-level supervision for on-policy rollouts. However, existing OPSD methods often yield limited gains on in-domain reasoning and generalize poorly to out-of-domain problems. We identify two key causes: conditioning the self-teacher on a verified solution encourages imitation of training-domain reference trajectories rather than error-specific correction, and applying distillation to the full response can overwrite valid reasoning prefixes and reinforce overfitting. We propose Reflective On-policy Self-Distillation (ROSD), a framework that turns reference-solution imitation into targeted reasoning correction through reflection-guided, error-localized distillation. For each rollout, ROSD uses a self-reflector to extract a corrective idea and locate the first erroneous span. The corrective idea guides the self-teacher toward targeted supervision, while the localized error span restricts distillation to where correction is needed. This design corrects flawed reasoning while preserving valid prefixes. Experiments on multiple in-domain and out-of-domain reasoning benchmarks show that ROSD yields stronger in-domain reasoning performance overall and substantially better out-of-domain generalization than standard OPSD. Code is available at https://github.com/ZiqiZhao1/ROSD.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	1.0/10	2.0
MLLM	2.0	1.0/10	2.0
MultiModal	2.0	1.0/10	2.0
model-based RL	2.0	3.0/10	6.0

评分理由: 论文核心在于语言模型推理的自我蒸馏（ROSD），虽使用强化学习术语（如 on-policy rollouts），但未涉及世界模型构建、多模态处理或模型架构统一，因此相关性较低。作者列表中未包含指定的 Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang 专家。

关键词

Language Model Reasoning, On-policy Self-Distillation, Reflective Self-Distillation, Error Localization, Out-of-domain Generalization, Token-level Supervision, Reasoning Correction

86. Con-DSO: Learning Short-Horizon Consistency Priors for RGB-D Direct Sparse OdometryFAIL

Score: 16.0 / 26.5

Authors: Haolan Zhang, Thanh Nguyen Canh, Chenghao Li, Ziyan Gao, Xiongwen Jiang, Nak Young Chong

Published: 2026-05-27

TL;DR: Con-DSO addresses RGB-D VO degradation in challenging environments by learning consistency uncertainty priors to attenuate unreliable observations, achieving significant trajectory error reductions on public benchmarks.

摘要翻译

视觉里程计（VO）是机器人技术和增强现实领域中的基础组件。RGB-D 直接视觉里程计受益于度量深度测量，但在具有挑战性的环境中性能可能下降，此时动态物体、遮挡、光照变化及不可靠的深度会破坏直接对齐方法所依赖的短视距光度学与深度几何一致性假设。现有方法通过语义滤波、显式遮挡推理、光照适应或手工设计的几何准则来缓解这些问题，但往往依赖于外部模块或针对特定故障模式定制的固定假设，这限制了其灵活性以及以统一方式处理多样化挑战的能力。本文提出 Con-DSO，一种一致性感知的 RGB-D 直接稀疏里程计框架，该框架能够从时间相邻的 RGB-D 帧对中预测密集的光度学与深度几何一致性不确定性。一致性网络利用光流引导的光度学误差和投影深度一致性误差进行训练，使得一致性违反能够被表示为像素级不确定性。这些成对的不确定性预测被转换为用于基于关键帧跟踪的主机端质量先验。随后，该先验通过质量感知的支持像素选择以及位姿估计过程中的解耦光度学 - 几何加权应用于视觉里程计，从而实现了对不可靠观测的连续衰减，而非硬拒绝或基于阈值的门控。在五个公共 RGB-D 基准测试上的实验表明，该方法相比直接 RGB-D 视觉里程计基线取得了显著提升，在 ICL-NUIM 数据集上绝对轨迹误差降低超过 20%，在 RGB-D Scenes V2、TUM/Bonn Dynamic 和 OpenLORIS 序列上降低幅度达 50%–80%。

Abstract

Visual odometry (VO) is a fundamental component in robotics and augmented reality. RGB-D direct VO benefits from metric depth measurements, but it can degrade in challenging environments, where dynamic objects, occlusions, illumination changes, and unreliable depth violate the short-horizon photometric and depth-geometric consistency assumptions used by direct alignment. Existing approaches mitigate these issues through semantic filtering, explicit occlusion reasoning, illumination adaptation, or hand-crafted geometric criteria, but often rely on external modules or fixed assumptions tailored to individual failure modes, limiting their flexibility and ability to handle diverse challenges in a unified manner. In this work, we propose Con-DSO, a consistency-aware RGB-D direct sparse odometry framework that predicts dense photometric and depth-geometric consistency uncertainty from temporally adjacent RGB-D frame pairs. The consistency network is trained using flow-guided photometric errors and projective depth-consistency errors, allowing consistency violations to be represented as pixel-level uncertainty. These pairwise uncertainty predictions are converted into a host-side quality prior for keyframe-based tracking. The prior is then applied to VO through quality-aware support-pixel selection and decoupled photometric-geometric weighting during pose estimation, enabling continuous attenuation of unreliable observations rather than hard rejection or threshold-based gating. Experiments on five public RGB-D benchmarks show substantial gains over direct RGB-D VO baselines, with over 20\% absolute trajectory error reduction on ICL-NUIM and 50\%--80\% reductions on RGB-D Scenes V2, TUM/Bonn Dynamic, and OpenLORIS sequences.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	1.0/10	2.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	5.0/10	10.0
model-based RL	2.0	0.0/10	0.0

评分理由: The paper focuses on RGB-D Visual Odometry and consistency priors. It unifies photometric and depth consistency handling (Unify Models, 2.0) and utilizes RGB and Depth data streams (MultiModal, 5.0). It does not involve Large Language Models (MLLM, 0.0), generative world modeling (World Models, 1.0), or reinforcement learning (model-based RL, 0.0). No expert authors from the specified list are present in the author list.

关键词

RGB-D Direct Sparse Odometry, Consistency Priors, Photometric Consistency, Depth-Geometric Consistency, Uncertainty Prediction, Keyframe Tracking, Pose Estimation

87. SIGMA: Bridging Structural and Distributional Gaps for Vision Foundation Model AdaptationFAIL

Score: 16.0 / 26.5

Authors: Lingyu Xiong, Jinjin Shi, Xuran Xu, Cong Luo, Runyu Shi, Ying Huang

Published: 2026-05-27

TL;DR: SIGMA 提出了一种轻量级的视觉基础模型参数高效微调方法，通过尺度自适应融合和语义调制桥接结构与分布间隙，实现了密集预测任务上的优越性能。

摘要翻译

视觉基础模型 (VFMs) 展现了令人印象深刻的表征能力。然而，通过全微调将其适配至下游任务会带来难以承受的计算和存储开销。参数高效微调 (PEFT) 已作为一种极具吸引力的替代方案出现，旨在以极小的训练成本实现与全微调的性能相当。然而，由于存在结构性差距和分布性差距，将 PEFT 应用于 VFMs 进行密集预测任务仍然具有挑战性。为了弥合这些差距，我们提出尺度集成全局调制适配器 (SIGMA)，这是一种新颖的轻量级 PEFT 方法，包含两个模块：尺度自适应融合和语义调制。具体而言，尺度自适应融合模块用于通过增强多粒度视觉信息的提取来弥合结构差距。此外，SIGMA 在融合特征上引入语义调制以执行全局特征对齐，进一步消除分布差距。该设计实现了统一的空间与分布适应性，相对于 VFM 骨干网络仅需 1.72% 的可训练参数。在各种下游密集任务和多个 VFM 骨干网络上的综合实验表明，SIGMA 相对于最先进的 PEFT 方法实现了一致且优越的性能。

Abstract

Vision Foundation Models (VFMs) have demonstrated impressive representational capabilities. However, adapting them to downstream tasks via full fine-tuning incurs prohibitive computational and storage overhead. Parameter-Efficient Fine-Tuning (PEFT) has emerged as a compelling alternative, aiming to achieve performance parity with full fine-tuning at minimal training costs. Nonetheless, applying PEFT to VFMs for dense prediction tasks remains challenging due to the structural and distributional gaps. To bridge these gaps, we propose \textbf{S}cale-\textbf{I}ntegrated \textbf{G}lobal \textbf{M}odulation \textbf{A}dapter (\textbf{SIGMA}), a novel lightweight PEFT method, which consists of two modules: scale-adaptive fusion and semantic modulation. Specifically, the scale-adaptive fusion module is utilized to bridge structural gaps by enhancing the extraction of multi-granularity visual information. Furthermore, SIGMA introduces semantic modulation on the fusion features to perform global feature alignment to further eliminate the distribution gap. This design facilitates unified spatial and distributional adaptation, requiring only 1.72\% trainable parameters relative to the VFM backbone. Comprehensive experiments across various downstream dense tasks and multiple VFM backbones demonstrate that SIGMA achieves consistent and superior performance over state-of-the-art PEFT methods.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	3.0/10	6.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	3.0/10	6.0
MultiModal	2.0	2.0/10	4.0
model-based RL	2.0	0.0/10	0.0

评分理由: 论文提出 SIGMA 方法用于视觉基础模型（VFMs）的参数高效微调（PEFT），主要解决密集预测任务中的结构与分布间隙。与关键词相比，'Unify Models'有一定关联（统一适配策略），'MLLM'和'MultiModal'部分相关（视觉基础模型属于相关领域），但'World Models'和'model-based RL'完全无关（无环境建模或强化学习内容），导致加权总分（16.0）低于动态及格线（26.5）。

关键词

Vision Foundation Models, Parameter-Efficient Fine-Tuning, Dense Prediction, Scale-Adaptive Fusion, Semantic Modulation, Structural Gaps, Distributional Gaps

88. SmartDirector: Keyframe-Conditioned Cinematic Video Generation with Narrative Pacing ControlFAIL

Score: 16.0 / 26.5

Authors: Zhida Zhang, Jie Ma, Zhan Peng, Haoxue Wu, Yang Han, Jun Liang, Jie Cao, Jing Li

Published: 2026-05-27

TL;DR: SmartDirector 通过多关键帧条件化增强视频生成的叙事能力，在电影级视频生成任务上优于现有方法。

摘要翻译

视频的叙事质量从根本上决定了其感知价值。尽管现有的视频生成方法能够产生视觉上吸引人的内容，但它们主要依赖于稀疏的条件信号，例如文本提示或首帧/末帧，这限制了对叙事结构和时序节奏的精确控制。本文提出 SmartDirector，一种通过多个关键帧（keyframes）增强视频生成模型叙事能力的框架。SmartDirector 支持灵活的生成场景，包括 single-shot 生成、multi-shot 叙事合成以及视频扩展。该框架分为两个阶段：Director-Gen 基于提供的关键帧生成低分辨率视频，Director-SR 利用高分辨率关键帧作为语义锚点来恢复细粒度细节，从而优化输出。为了实现鲁棒的多关键帧训练，我们构建了一个数据管道，从电影中整理出 single-shot 和 multi-shot 序列。大量实验表明，SmartDirector 显著优于现有的最先进方法。我们将发布代码以促进进一步研究。

Abstract

The narrative quality of a video fundamentally determines its perceptual value. Although existing video generation methods can produce visually appealing content, they predominantly rely on sparse conditioning signals such as text prompts or first/last frames, which limits precise control over narrative structure and temporal pacing. In this paper, we propose SmartDirector, a framework that enhances the narrative capacity of video generation models through multiple keyframes. SmartDirector supports flexible generation scenarios including single-shot generation, multi-shot narrative synthesis, and video extension. The framework operates in two stages: Director-Gen generates a low-resolution video conditioned on the provided keyframes, and Director-SR refines the output by exploiting high-resolution keyframes as semantic anchors to recover fine-grained details. To enable robust multi-keyframe training, we construct a data pipeline that curates single-shot and multi-shot sequences from movies. Extensive experiments demonstrate that SmartDirector substantially outperforms existing state-of-the-art approaches. We will release the code to facilitate further research.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	2.0/10	4.0
MLLM	2.0	1.0/10	2.0
MultiModal	2.0	3.0/10	6.0
model-based RL	2.0	0.0/10	0.0

评分理由: 论文主要研究基于关键帧的视频生成与叙事控制，属于计算机视觉生成领域。关键词 'model-based RL' 完全不相关，因文中未涉及强化学习或模型控制；'MLLM' 相关性极低，未提及大语言模型；'Unify Models' 和 'World Models' 仅在广义生成框架层面有微弱关联，未涉及模型架构统一或世界模型动力学学习；'MultiModal' 相关性中等，视频生成虽涉及时空模态，但本文主要聚焦视觉关键帧到视频，未体现典型的多模态（如图文）交互。

关键词

Keyframe-Conditioned, Cinematic Video Generation, Narrative Pacing Control, Two-stage Framework, Multi-shot Narrative Synthesis, Video Extension, Director-Gen, Director-SR

89. Multi-Agent LLM-based Metamorphic Testing for REST APIsFAIL

Score: 14.0 / 26.5

Authors: Shehroz Khan, Abdullah Mughees, Gaadha Sudheerbabu, Tanwir Ahmad, Dragos Truscan

Published: 2026-05-27

TL;DR: 本文提出了一种基于多智能体 LLM 的元测试工具 ARMeta，用于解决 REST API 测试中的 oracle 问题并提升软件质量。

摘要翻译

随着 REST APIs 成为软件系统中日益重要的组成部分，其验证变得愈发关键。因此，测试并发现潜在问题对于提高软件质量至关重要。然而，测试 REST APIs 具有挑战性，主要是因为难以评估 API 调用的输出是否正确，即测试预言机（test oracle）问题。变异测试（Metamorphic testing）是一种基于规格的测试方法，适用于正确输出未知或未明确指定的情况。为了检查系统的正确性，会指定不同输出之间的关系。我们提出了 ARMeta，一种基于工具支持的方法，它使用基于 LLM 的多智能体工作流（agentic workflow）来支持使用 OpenAPI 文档化的 REST APIs 的变异测试。该智能体工作流用于识别变异测试场景，并以 Given-When-Then 格式指定它们。这些场景被自动实现为可执行测试，并在被测系统（system under test）上执行。我们在两个公开可用的暴露 REST 接口的 Web 应用程序上评估了 ARMeta，并将其性能与基于场景的测试基线进行了比较。结果表明，ARMeta 探索的行为可作为现有基于场景的测试方法的补充。

Abstract

As REST APIs become an increasingly significant part of software systems, their validation is becoming more critical. Hence, testing and uncovering underlying issues are of utmost importance for improving software quality. However, testing REST APIs is challenging mainly due to the difficulty of assessing whether the output of an API call is correct, i.e., the test oracle problem. Metamorphic testing is a specification-based testing approach for situations where correct outputs are unknown or not specified explicitly. To check the correctness of a system, relations between the different outputs are specified. We present ARMeta, a tool-supported approach that uses an LLM-based multi-agent workflow to support metamorphic testing of REST APIs documented with OpenAPI. The agentic workflow is used to identify metamorphic test scenarios and specify them in the Given-When-Then format. These scenarios are automatically implemented as executable tests and executed against the system under test. We evaluate ARMeta on two publicly available web applications that expose REST interfaces and compare its performance with a scenario-based testing baseline. The results show that ARMeta explores behaviors that serve as a complement to existing scenario-based testing approaches.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	1.0/10	2.0
MLLM	2.0	2.0/10	4.0
MultiModal	2.0	1.0/10	2.0
model-based RL	2.0	1.0/10	2.0

评分理由: 该论文主要关注软件工程领域的 REST API 测试，利用多智能体 LLM 进行元测试。提供的关键词（Unify Models, World Models, MLLM, MultiModal, model-based RL）均属于基础大模型架构、世界建模及强化学习领域。论文虽使用了 LLM，但未涉及模型统一、世界模型构建、多模态表征或强化学习算法，因此与核心关键词相关性较低。

关键词

Metamorphic Testing, REST APIs, LLM-based, Multi-Agent, Software Quality, OpenAPI, Test Oracle, Executable Tests

90. Pruning and Distilling Mixture-of-Experts into Dense Language ModelsFAIL

Score: 14.0 / 26.5

Authors: Junhyuck Kim, Jihun Yun, Haechan Kim, Gyeongman Kim, Joonghyun Bae, Jaewoong Cho

Published: 2026-05-27

TL;DR: This paper proposes a systematic framework to convert Mixture-of-Experts language models into dense architectures through expert scoring and knowledge distillation, achieving superior accuracy and efficiency compared to direct pruning methods.

摘要翻译

混合专家模型 (MoE) 现已成为前沿语言模型的主流架构，然而它要求所有专家参数均需加载至内存中，这使得其在内存受限的部署场景中不太理想。现有的压缩方法虽然减少了专家数量，但输出结果仍为 MoE 模型，保留了相同的根本局限。我们提出了首个将训练好的 MoE 转换为标准全稠密架构的系统性框架：专家经过评分、选择和分组，然后拼接成稠密前馈网络 (FFN)，并通过来自 MoE 教师模型的知识蒸馏进行精炼。我们在 Qwen3-30B-A3B 模型上，针对多种选定的专家数量，评估了 7 种评分方法、5 种分组方法和 2 种幅度缩放方法，共生成了 350 种配置。我们发现，评分方法的选择最具影响力，我们提出的新颖感知多样性评分在 Qwen3-30B-A3B、DeepSeek-V2-Lite 和 GPT-OSS-20B 上一致地优于先前方法。在参数量匹配的受控比较下，经过约 40 亿 token 的蒸馏后，MoE 转稠密方法在平均下游任务准确率上比稠密到稠密剪枝方法高出 +6.3 个百分点，且训练速度提升至原来的 1.6 倍。

Abstract

Mixture-of-Experts (MoE) is now the dominant architecture for frontier language models, yet it requires all expert parameters to be loaded in memory, making it less preferable for memory-constrained deployment. Existing compression methods reduce the number of experts but the output remains an MoE model with the same fundamental limitation. We present the first systematic framework for converting a trained MoE into a standard fully dense architecture: experts are scored, selected, and grouped, then concatenated into a dense FFN and refined by knowledge distillation from the MoE teacher. We evaluate 7 scoring, 5 grouping, and 2 magnitude scaling methods across a range of selected expert counts on Qwen3-30B-A3B, yielding 350 configurations. We find that the choice of scoring method is the most impactful, with our novel diversity-aware scoring consistently outperforming prior methods on Qwen3-30B-A3B, DeepSeek-V2-Lite, and GPT-OSS-20B. Under a controlled comparison at matched parameter count, MoE-to-dense outperforms dense-to-dense pruning by +6.3 pp in average downstream accuracy after ~4B-token distillation at 1.6x faster training wall-clock speed.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	3.0/10	6.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	2.0/10	4.0
MultiModal	2.0	2.0/10	4.0
model-based RL	2.0	0.0/10	0.0

评分理由: The paper focuses on compressing Mixture-of-Experts (MoE) language models into dense architectures via scoring and distillation. It shows low relevance to World Models and Model-Based RL as it involves no environment simulation or reinforcement learning. Relevance to MLLM and MultiModal is limited as the method targets general language model deployment rather than multimodal integration. Unify Models has partial relevance regarding architectural unification (MoE to Dense) but not modality unification.

关键词

Mixture-of-Experts, Knowledge Distillation, Model Compression, Dense Language Models, Parameter Efficiency, MoE Pruning, Architecture Conversion

91. SNARE: Adaptive Scenario Synthesis for Eliciting Overeager Behavior in Coding AgentsFAIL

Score: 14.0 / 26.5

Authors: Yubin Qu, Yi Liu, Gelei Deng, Yanjun Zhang, Yuekang Li, Ying Zhang, Leo Yu Zhang

Published: 2026-05-27

TL;DR: 论文提出 SNARE 框架通过自适应场景合成检测编码代理的过度激进行为，发现代理框架而非基础模型是导致安全违规差异的主要驱动因素。

摘要翻译

编码代理将良性任务执行为一串 Shell、文件和网络操作的序列，其中任何操作都可能悄然超出授权范围，同时任务仍能顺利完成。我们将此称为过度积极行为（overeager behavior）：提示词并非对抗性的，运行也成功了，但一个超范围步骤仍可能泄露凭证或删除文件。现有基准测试未能捕捉到这一点：任务完成套件认可任何完成的运行，越狱套件探测对抗性提示词，而先前唯一的过度积极基准测试对所有代理 - 模型对应用单一固定提示词集，导致其中最容易触发和最难以触发的配对被低估。我们提出 SNARE（Synthesizing Non-adversarial scenarios for Adaptive Reward-guided Elicitation），这是一个构建良性场景的流程，它从可重用的范围和陷阱片段组合而成，使用无需裁判的判定器对每次运行进行评分，该判定器会标记陷阱模式匹配以及未经请求的文件添加或删除，并利用汤普森采样（Thompson sampling）将每个代理 - 模型对的运行预算引导至最常触发该行为的场景。基于 24 种过度积极原型实例化该流程，得到 OverEager，我们在包含四个编码代理和五个基础模型的 4×5 矩阵上进行了运行。在 10,000 次良性运行中，19.51% 触发了过度积极行为，各代理 - 模型对的触发率跨度达 11.9 倍。这种差异主要由代理框架驱动，而非模型本身：框架解释了 56% 的差异，而模型仅解释了 21%，因此任何单一框架或单一模型的评估都会低估了该矩阵的评估结果约五分之一。

Abstract

A coding agent executes a benign task as a sequence of shell, file, and network actions, any of which can quietly exceed the authorized scope while the task still completes. We call this overeager behavior: the prompt is not adversarial and the run succeeds, yet an out-of-scope step can leak credentials or delete files. Existing benchmarks miss it: task-completion suites credit any finished run, jailbreak suites probe adversarial prompts, and the one prior overeager benchmark applies a single fixed prompt set to every agent-model pair, leaving its easiest and most resistant pairs under-measured. We present SNARE (Synthesizing Non-adversarial scenarios for Adaptive Reward-guided Elicitation), a pipeline that composes benign scenarios from reusable scope and trap fragments, scores each run with a judge-free oracle flagging trap-pattern matches and unsolicited file additions or deletions, and uses Thompson sampling to steer each pair's run budget toward the scenarios that most often trigger it. Instantiating it over 24 overeager archetypes yields OverEager, which we run across a 4x5 matrix of four coding agents and five base models. Across 10,000 benign runs, 19.51% trigger overeager behavior, with per-pair rates spanning 11.9x. This variation is driven by the agent framework, not the model: the framework accounts for 56% of it against the model's 21%, so any single-framework or single-model evaluation undercounts the matrix by about a fifth.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	1.0/10	2.0
World Models	2.0	1.0/10	2.0
MLLM	2.0	2.0/10	4.0
MultiModal	2.0	1.0/10	2.0
model-based RL	2.0	2.0/10	4.0

评分理由: 论文主题聚焦编码代理的安全性评估（过度激进行为检测），与关键词涉及的模型统一、世界模型、多模态架构及模型强化学习算法关联度较低。虽基于大模型并使用 Thompson 采样，但未体现 MLLM、MultiModal 或 model-based RL 的核心技术贡献。作者列表中未包含指定专家（Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang）。加权总分 14.0，低于动态及格分 26.5。

关键词

Coding Agents, Overeager Behavior, Scenario Synthesis, Security Evaluation, Thompson Sampling, Agent Framework, Base Models

92. Human-like in-group bias in instruction-tuned language model agentsFAIL

Score: 14.0 / 26.5

Authors: Messi H. J. Lee

Published: 2026-05-27

TL;DR: Instruction-tuned language model agents exhibit significant in-group trust bias and network assortativity when group labels are visible, leading to compounded structural inequality over time despite no change in action types.

摘要翻译

随着自主 AI 智能体被部署在持续交互的网络中——协调任务、资源路由并积累声誉历史——所涌现的社会动力学将决定谁能获得机会，谁不能，这种规模超出了任何人类机构所能监督的范围。我们进行了一项受控多智能体模拟，其中指令微调的语言模型智能体在三种条件下交互了 500 回合，这三种条件操纵了群体标签显著性和资源稀缺性，涉及六个模型家族，每个家族包含 20 个随机种子。当群体标签可见时，我们观察到了内群体信任偏差、行动同质性 (homophily) 以及网络 assortativity ——当标签隐藏时这些现象均不存在——这种模式在结构上与人类社会心理学中的显著性依赖一致。这种歧视对标准行动日志审计不可见：偏差完全通过谁接收了每个行动来体现，而非选择了何种行动，行动类型分布显示在所有条件下负面行动的数量并未增加。每回合内群体与外群体差异为 5 至 16 个百分点，对所有六个模型均具有统计学显著性 (Wilcoxon signed-rank 检验，所有经 Benjamini-Hochberg 校正的 p < 0.001)，确立了基于群体的定向作为指令微调语言模型在不同架构和训练范式下的稳健属性。通过 500 回合互惠的累积，这些差异累积为内群体信任偏差 +0.014 至 +0.100 (d = 0.84-4.52) ——说明了每回合微小的定向如何在持续网络中演变为结构性不平等。

Abstract

As autonomous AI agents are deployed in persistent, interacting networks -- coordinating tasks, routing resources, and accumulating reputational histories -- the social dynamics that emerge will determine who receives opportunity and who does not, at scales no human institution can supervise. We ran a controlled multi-agent simulation in which instruction-tuned language model agents interacted across 500 turns under three conditions manipulating group label salience and resource scarcity, across six model families with 20 seeds each. When group labels were visible, we observed in-group trust bias, action homophily, and network assortativity -- all absent when labels were hidden -- a pattern structurally consistent with salience-dependence in human social psychology. This discrimination was invisible to standard action-log audits: bias operated entirely through who received each action, not what actions were chosen, with action-type distributions showing no increase in negative actions across conditions. Per-turn in-group versus out-group differentials of 5 to 16 percentage points were statistically significant for all six models (Wilcoxon signed-rank, all Benjamini-Hochberg-corrected p < 0.001), establishing group-contingent targeting as a robust property of instruction-tuned language models across architectures and training regimes. Compounded through 500 turns of reciprocation, these differentials accumulated into in-group trust biases of +0.014 to +0.100 (d = 0.84-4.52) -- illustrating how modest per-interaction targeting propagates into structural inequality in persistent networks.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	2.0/10	4.0
MLLM	2.0	1.0/10	2.0
MultiModal	2.0	1.0/10	2.0
model-based RL	2.0	1.0/10	2.0

评分理由: 论文主要研究指令微调语言模型代理在多智能体模拟中的社会偏见（如内群体偏见），属于社会心理学与 AI 伦理交叉领域，而非模型架构统一、多模态表征或强化学习。虽然涉及多模型家族测试，但未提出统一模型；涉及代理交互历史，但未构建世界模型用于控制；明确为语言模型，未涉及多模态；未提及强化学习。作者列表中未包含指定专家，故无加分。

关键词

in-group bias, instruction-tuned language model agents, multi-agent simulation, group label salience, network assortativity, social dynamics, structural inequality

93. Examining Agents' Bias Amplification versus Suppression in Multi-Agent SystemsFAIL

Score: 14.0 / 26.5

Authors: Zejian Eric Wu, Zhongyi Jiang, Yuan Zhuang, Paul Jen-Hwa Hu

Published: 2026-05-27

TL;DR: This study investigates how group-favoring biases in individual agents amplify or suppress within multi-agent systems using LLMs, revealing that uniform bias exposure leads to system-wide bias exceeding the sum of individual biases.

摘要翻译

多智能体系统（Multi-agent systems）正被越来越多地部署以支持各类任务，其中智能体通过交互实现个体目标与集体目标。尽管这些系统能够提升任务表现与决策制定，但通过减少偏见来维护公平性仍具挑战性。本研究探讨了智能体层面的偏见如何演变并影响系统级公平性。我们利用提示词（prompts）使单个智能体暴露于群体偏好偏见中，随后在系统层面评估其下游影响。为量化该影响，我们提出了偏好偏见强度（Favor Bias Strength, FBS），这是一种零中心化度量，将偏见变化分解为受青睐群体提升与受排斥群体抑制两部分。采用多种智能体设计、基准测试及最新的大语言模型（LLMs），我们发现被赋予偏见的智能体可显著影响系统级公平性。有趣的是，当智能体均匀暴露于偏见时，系统级偏见会升高，甚至超过各智能体偏见的加性总和。实证证据凸显了多智能体系统中公平性的关键性，这需要进一步的分析与实证测试。

Abstract

Multi-agent systems are increasingly deployed to support various tasks where agents interact to achieve individual and collective objectives. Although these systems can enhance task performance and decision-making, fairness preservation through bias reduction remains challenging. This study examines how agent-level biases shift and impact system-wide fairness. We use prompts to expose individual agents to group-favoring bias, then assess downstream impacts at the system level. To quantify the impact, we propose Favor Bias Strength (FBS), a zero-centered metric that decomposes bias alteration between favored-group uplift and disfavored-group suppression. Using multiple agent designs, benchmarks, and up-to-date large language models, we show that agents endowed with bias can substantially affect system-wide fairness. Interestingly, when agents are exposed to bias uniformly, the system-wide bias elevates, even exceeding the additive sum of the individual agents' biases. The empirical evidence underscores the criticality of fairness in multi-agent systems, which warrants further analyses and empirical tests.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	1.0/10	2.0
MLLM	2.0	2.0/10	4.0
MultiModal	2.0	1.0/10	2.0
model-based RL	2.0	1.0/10	2.0

评分理由: The paper focuses on fairness and bias propagation in multi-agent systems using Large Language Models (LLMs). It does not address model architecture unification, generative world modeling, multimodal processing, or reinforcement learning frameworks, resulting in low relevance scores for the specified keywords.

关键词

Multi-Agent Systems, Bias Amplification, Fairness Preservation, Large Language Models, Group-Favoring Bias, System-Wide Fairness, Prompt Exposure

94. MemGuard: Preventing Memory Contamination in Long-Term Memory-Augmented Large Language ModelsFAIL

Score: 14.0 / 26.5

Authors: Hyeonjeong Ha, Jeonghwan Kim, Cheng Qian, Jiayu Liu, William M. Campbell, Yue Wu, Yuji Zhang, Kathleen McKeown, Dilek Hakkani-Tur, Heng Ji

Published: 2026-05-27

TL;DR: MemGuard introduces a type-aware memory framework to prevent heterogeneous memory contamination in long-term memory-augmented LLMs, improving reasoning reliability by up to 28.27% while reducing memory token retrieval.

摘要翻译

记忆增强型大语言模型通过在交互过程中保持长期记忆，将推理能力扩展到固定上下文窗口之外。然而，现有的记忆系统通常将稳定的用户事实、情景事件和行为规则坍缩至共享空间中，导致功能上不同的记忆被检索并用作可互换的证据。我们将这种失效模式识别为异构记忆污染（heterogeneous memory contamination），其中上下文特定的事件变成过度泛化的主张，或者语义相关但功能上不兼容的记忆误导生成。为此，我们引入了 MemGuard，这是一种类型感知记忆框架，它在记忆构建和检索过程中保持功能记忆边界。它在写入时为每个记忆分配显式的功能角色，维持类型隔离记忆之间的关系，并仅从必要的记忆类型中选择性组合证据，从而减少来自无关或功能不兼容证据的污染。在幻觉和长对话基准测试中，MemGuard 将记忆可靠性提高了高达 28.27%，同时检索到的记忆 tokens 比先前方法少多达 5.8 倍。这些结果表明，可靠的长期推理依赖于对异构记忆进行基于原则的组织和使用。

Abstract

Memory-augmented large language models extend reasoning beyond a fixed context window by maintaining long-term memory across interactions. However, existing memory systems often collapse stable user facts, episodic events, and behavioral rules into a shared space, allowing functionally distinct memories to be retrieved and used as interchangeable evidence. We identify this failure mode as heterogeneous memory contamination, where context-specific events become overgeneralized claims, or semantically relevant but functionally incompatible memories mislead generation. To this end, we introduce MemGuard, a type-aware memory framework that preserves functional memory boundaries during memory construction and retrieval. It assigns each memory an explicit functional role at write time, maintains relations across type-isolated memories, and selectively composes evidence only from necessary memory types, reducing contamination from irrelevant or functionally incompatible evidence. Across hallucination and long-horizon conversation benchmarks, MemGuard improves memory reliability by up to 28.27% while retrieving up to 5.8x fewer memory tokens than prior methods. These results suggest that reliable long-term reasoning depends on principled organization and selective use of heterogeneous memory.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	1.0/10	2.0
MLLM	2.0	2.0/10	4.0
MultiModal	2.0	1.0/10	2.0
model-based RL	2.0	1.0/10	2.0

评分理由: The paper focuses on memory mechanisms in text-based LLMs, specifically addressing contamination through type-aware isolation. It does not align with Multimodal (MultiModal, MLLM), World Modeling, or Model-Based RL themes, nor does it focus on unifying multiple models, hence low relevance scores.

关键词

Memory-Augmented Large Language Models, Heterogeneous Memory Contamination, Type-Aware Memory Framework, Long-Term Memory, Hallucination Reduction, Selective Evidence Composition, Functional Memory Boundaries

95. Periodic RoPE for Infinite Context LLMsFAIL

Score: 14.0 / 26.5

Authors: Simin Huo

Published: 2026-05-27

TL;DR: The paper proposes Periodic RoPE to overcome position exhaustion in LLMs, enabling theoretically infinite context windows through a combination of sliding window and global attention layers.

摘要翻译

处理超长上下文的能力对于大语言模型（LLMs）执行长程任务至关重要。尽管近期工作已将上下文窗口扩展至 1M 及以上，但当序列长度超过位置编码（例如 RoPE，旋转位置嵌入）的预训练范围时，模型性能会下降，即位置耗尽（position exhaustion）。必须克服这一根本性限制，才能实现真正的无限上下文。为了解决这一问题，我们提出了周期性 RoPE（P-RoPE），这是一种旨在规避此耗尽问题的位置编码机制。它与滑动窗口注意力（SWA）协同工作，以捕获每个窗口内的局部依赖和相对位置。随后，该局部层由一个无位置编码（NoPE）的全局注意力层补充，从而实现整个序列上不受位置约束的无界交互。通过堆叠这两种类型的层，模型避免了位置外推的需求以泛化到更长序列，理论上支持无限上下文窗口。实验结果表明，我们的模型 MiniWin 在长上下文效率和稳定性方面优于采用标准 GPT 架构的 MiniMInd。我们的工作为具有真正无限上下文理解能力的大语言模型提供了一条可能的途径。代码可在 https://github.com/Cominder/miniwin 处获取。

Abstract

The ability to process ultra-long contexts is crucial for large language models (LLMs) to perform long-horizon tasks. While recent efforts have extended context windows to 1M and beyond, model performance degrades when sequence length exceeds the pre-trained range of positional encodings (e.g., RoPE), i.e., position exhaustion. This fundamental limitation must be overcome to achieve a truly infinite context. To address it, we propose Periodic RoPE (P-RoPE), a positional encoding mechanism designed to circumvent this exhaustion. It operates in conjunction with sliding window attention (SWA) to capture local dependencies and relative positions within each window. This local layer is then complemented by a global attention layer with No Positional Encoding (NoPE), enabling unbounded interaction across the entire sequence without positional constraints. By stacking these two types of layers, the model avoids the need for positional extrapolation to generalize longer and theoretically supports an infinite context window. Empirical results show that our model, MiniWin, outperforms MiniMInd with standard GPT architectures in long-context efficiency and stability. Our work provides a possible pathway toward LLMs with genuine infinite-context understanding. The code is available at \href{https://github.com/Cominder/miniwin}{https://github.com/Cominder/miniwin}.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	3.0/10	6.0
World Models	2.0	1.0/10	2.0
MLLM	2.0	1.0/10	2.0
MultiModal	2.0	1.0/10	2.0
model-based RL	2.0	1.0/10	2.0

评分理由: The paper focuses on LLM context extension via Periodic RoPE. It has low relevance to World Models, MLLM, MultiModal, and model-based RL as it lacks multimodal or RL content. 'Unify Models' has moderate relevance (3.0) due to unifying local/global attention, but overall alignment with the specified track is low.

关键词

Periodic RoPE, Infinite Context, LLMs, Positional Encoding, Sliding Window Attention, MiniWin, Long-horizon tasks, No Positional Encoding

96. Compositional Generalization in Autoregressive Models via Logit CompositionFAIL

Score: 14.0 / 26.5

Authors: Aakash Kumar, Maria Sofia Bucarelli, Emanuele Natale

Published: 2026-05-27

TL;DR: 该论文提出了一种基于对数组合的自回归模型组合策略，证明了在因子化条件下组合具有投影性且保持长度泛化能力，避免了模型间干扰。

摘要翻译

自回归模型 (autoregressive models) 的组合仍然是理解大语言模型 (large language models) 如何结合跨任务习得的行为或技能的核心挑战。我们提出了一种新的、基于原理的自回归系统组合策略，该策略受为扩散模型 (diffusion models) 开发的组合方法启发。在因子化条件假设 (factorized-conditionals assumption) 下，我们证明所得的组合是射影的 (projective)：每个组件模型都保持对其输出分布中指定子空间的控制，从而避免模型间的干扰。该属性在输出空间的平滑重参数化 (smooth reparameterizations) 下进一步得以保留，从而得出一个特征空间定理 (feature-space theorem)。最后，我们证明当因子化假设和组件保证在目标长度处一致成立时，组合保留了长度泛化行为 (length-generalizing behavior)。这些结果提供了关于何时模型组合与合并能在自回归系统中成功的基于原理的理解，并确定了其相互作用保持稳定的条件。

Abstract

Composing autoregressive models remains a core challenge in understanding how large language models can combine behaviors or skills learned across tasks. We introduce a new and principled composition strategy for autoregressive systems, inspired by composition methods developed for diffusion models. Under a factorized-conditionals assumption, we show that the resulting composition is projective: each component model preserves control over its own designated subspace of the output distribution avoiding interference between models. This property is further preserved under smooth reparameterizations of the output space, yielding a feature-space theorem. Finally, we show that composition preserves length-generalizing behavior when the factorization assumptions and component guarantees hold uniformly at the target length. These results provide a principled understanding of when model composition and merging succeed in autoregressive systems and identify conditions under which their interactions remain stable.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	5.0/10	10.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	2.0/10	4.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	0.0/10	0.0

评分理由: 评分理由：1. Unify Models (5.0): 论文探讨模型组合与合并，属于模型能力统一的范畴，但未涉及架构统一。2. World Models (0.0): 无环境建模内容。3. MLLM (2.0): 涉及大语言模型，但未明确多模态。4. MultiModal (0.0): 未提及多模态数据。5. model-based RL (0.0): 无强化学习内容。作者列表无指定专家，无加分。加权总分 14.0 < 26.5。

关键词

Compositional Generalization, Autoregressive Models, Logit Composition, Model Composition, Factorized Conditionals, Length-Generalizing, Output Distribution

97. PhAME: Phenotype-Aware Molecular Editing via Latent DiffusionFAIL

Score: 14.0 / 26.5

Authors: Łukasz Janisiów, Sebastian Musiał, Bartosz Zieliński, Dawid Rymarczyk, Tomasz Danel

Published: 2026-05-27

TL;DR: PhAME 提出了一种基于潜扩散的分子编辑框架，能够在优化生物表型特征的同时保持与种子分子的结构相似性，在药物发现基准测试中取得了最先进的结果。

摘要翻译

小分子药物发现需要对候选分子的众多属性进行同时优化。这些属性可以通过分析高维生物特征（如细胞形态和转录组扰动）来探究，这些特征为潜在的生物学机制提供了丰富的视角。然而，现有的利用这些特征进行优化的生成方法无法满足两项关键要求：既要提供针对所需表型特征的精确指导，又要保持与已知命中化合物的结构邻近性。我们提出了 PhAME（Phenotype-Aware Molecular Editing），这是一种潜在扩散框架，通过将分子优化重新表述为在预训练的基于图的变分自编码器（VAE）的潜在空间中的编辑，从而克服了这一挑战。我们的核心贡献是一种组合式无分类器引导方案，包含两个独立的尺度：一个用于表型条件化，另一个用于与种子结构的相似性，这使得从业者能够控制这两个目标之间的权衡。在包括对接分数优化和多模态表型生成在内的多样化基准上的实证评估表明，PhAME 实现了最先进的结果，同时保持了高化学有效性和新颖性。

Abstract

Small-molecule drug discovery requires simultaneous optimization of numerous properties of candidate molecules. These properties can be investigated through the analysis of high-dimensional biological signatures, such as cell morphology and transcriptomic perturbations, which provide a rich perspective on the underlying biological mechanisms. However, existing generative methods, which use those signatures for optimization, fail to meet two key requirements: providing precise guidance toward desired phenotypic signatures while maintaining structural proximity to a known hit. We introduce PhAME (Phenotype-Aware Molecular Editing), a latent diffusion framework that overcomes this challenge by recasting molecular optimization as editing in the latent space of a pretrained graph-based VAE. Our central contribution is a compositional classifier-free guidance scheme with two independent scales, one for the phenotype-conditioning and one for similarity to the seed structure, allowing practitioners to control the tradeoff between these two objectives. Empirical evaluations across diverse benchmarks, including docking score optimization and multimodal phenotypic generation, demonstrate that PhAME achieves state-of-the-art results while maintaining high chemical validity and novelty.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	1.0/10	2.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	4.0/10	8.0
model-based RL	2.0	0.0/10	0.0

评分理由: 该论文属于分子生成与药物发现领域，使用潜扩散和图 VAE，与关键词指定的大模型（MLLM）、强化学习（RL）及世界模型方向关联度较低。'MultiModal' 得分为 4，因论文涉及多模态生物表型数据（形态学与转录组）的联合建模；'Unify Models' 得分为 2，因论文统一了表型引导与结构相似性两个目标，但未涉及模型架构统一；'World Models'、'MLLM' 和 'model-based RL' 得分极低或为 0，因论文未涉及环境建模、语言模型或强化学习算法。

关键词

Phenotype-Aware Molecular Editing, Latent Diffusion, Graph-based VAE, Molecular Optimization, Multimodal Phenotypic Generation, Chemical Validity, Drug Discovery, Latent Space Editing

98. Off-Policy Learning to Reason Works Because It Is More Pessimistic Than You ThinkFAIL

Score: 14.0 / 26.5

Authors: Otmane Sakhi, Aleksei Arzhantsev, Imad Aouali, Flavian Vasile

Published: 2026-05-27

TL;DR: 该论文揭示了大语言模型推理中离线策略强化学习的有效性源于其隐式悲观性，通过稳定目标分布改进了学习稳定性，但未涉及多模态或世界模型构建。

摘要翻译

大规模强化学习已成为提升大语言模型推理能力的核心工具。在此规模下，生成往往滞后或异步，因此更新是基于旧策略收集的数据进行的。这使得学习本质上属于离线策略（off-policy）。尽管如此，大多数现有方法仍基于 PPO 风格的信任域目标（trust-region objectives），将训练视为近似在线策略（on-policy），并使用重要性权重（importance weights）来校正分布不匹配。这些校正方法可能会引入高方差，破坏优化稳定性，并加速熵坍塌。近期工作提出了一种替代方案：与其校正不匹配，不如采用离线策略数据并移除重要性权重，通常能产生性能更强的算法。本文提供了一种直观的离线策略目标构造，其中包含了有效的离线策略目标，并表明其有效性可以通过隐式悲观性（implicit pessimism）来理解：它们优化的目标策略比名义目标所暗示的更为保守。这一视角解释了为何某些特定的实现选择能提高稳定性：它们隐式地控制了有效目标分布。随后，我们提出了一种基于原理的修正方法，以稳定这种诱导分布并改进离线策略学习。

Abstract

Large scale reinforcement learning has become a central tool for improving reasoning in large language models. At this scale, generation is often lagged or asynchronous, so updates are performed on data collected by older policies. This makes learning inherently off-policy. Most existing approaches nevertheless remain rooted in PPO-style trust-region objectives, treating training as approximately on-policy and using importance weights to correct distribution mismatch. These corrections can introduce high variance, destabilize optimization, and accelerate entropy collapse. Recent work suggests an alternative: rather than correcting the mismatch, one can embrace off-policy data and remove importance weights, often yielding stronger algorithms. In this paper, we provide an intuitive construction of off-policy objectives that include successful off-policy objectives and show that their effectiveness can be understood through implicit pessimism: they optimize toward target policies that are more conservative than their nominal objectives suggest. This perspective explains why some particular implementation choices improve stability: they implicitly control the effective target distribution. We then propose a principled modification that stabilize this induced distribution and improve off-policy learning.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	1.0/10	2.0
World Models	2.0	1.0/10	2.0
MLLM	2.0	2.0/10	4.0
MultiModal	2.0	1.0/10	2.0
model-based RL	2.0	2.0/10	4.0

评分理由: 论文聚焦大语言模型推理中的离线策略强化学习及隐式悲观性。虽涉及 RL 与 LLM，但未涵盖多模态、世界模型或模型统一核心内容；model-based RL 相关性有限因侧重策略目标而非环境模型。作者无指定专家。加权总分 16.0，低于及格线。

关键词

Off-Policy Learning, Large Language Models, Implicit Pessimism, Reinforcement Learning, Reasoning, Policy Optimization, Stability

99. BPPO: Binary Prefix Policy Optimization for Efficient GRPO-Style Reasoning RL with Concise ResponsesFAIL

Score: 14.0 / 26.5

Authors: Qingfei Zhao, Huan Song, Shuyu Tian, Jiawei Shao, Xuelong Li

Published: 2026-05-27

TL;DR: 论文提出二进制前缀策略优化（BPPO）方法，通过利用最短正确和错误完成片段优化 GRPO，显著提升了推理强化学习的训练效率并减少了响应长度。

摘要翻译

组相对策略优化（GRPO）广泛用于训练推理模型，但在每个组中更新所有采样完成项会带来高昂的计算成本，并可能强化冗长的推理轨迹。本文旨在研究在基于 GRPO 的推理强化学习中，所有完成项是否提供同等有用的更新信号。我们的梯度相似性分析表明，在同一提示组内，同类完成项通常诱导高度相似的更新方向，而正确 - 错误样本对则提供更显著的对比信号。受此观察启发，我们提出二元前缀策略优化（BPPO），该方法使用最短的正确完成项和最短的错误完成项作为紧凑的更新单元，同时保留全组优势归一化。BPPO 进一步通过自适应完成项调度和前缀聚焦优化提升效率；通过仅更新响应前缀，它避免了强化冗余后缀，并鼓励生成更简洁的响应。在 GSM8K、MATH 和 Geo3K 上的实验表明，BPPO 相比 GRPO 实现了高达 6.08 倍的加速，同时保持具有竞争力的准确率，并且无需通过显式长度惩罚修改奖励，即可将平均响应长度减少约 30%-50%。

Abstract

Group Relative Policy Optimization (GRPO) is widely used for training reasoning models, but updating all sampled completions in each group incurs substantial cost and can reinforce verbose reasoning trajectories. In this paper, we study whether all completions provide equally useful update signals in GRPO-style reasoning RL. Our gradient-similarity analysis shows that, within the same prompt group, same-class completions often induce highly similar update directions, whereas correct-incorrect pairs provide more distinct contrastive signals. Motivated by this observation, we propose Binary Prefix Policy Optimization (BPPO), which uses the shortest correct completion and the shortest incorrect completion as a compact update unit while preserving full-group advantage normalization. BPPO further improves efficiency with adaptive completion scheduling and prefix-focused optimization; by updating only response prefixes, it avoids reinforcing redundant suffixes and encourages more concise responses. Experiments on GSM8K, MATH, and Geo3K show that BPPO achieves up to 6.08x speedup over GRPO while maintaining competitive accuracy, and reduces mean response length by approximately 30-50% without modifying the reward with an explicit length penalty.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	1.0/10	2.0
World Models	2.0	1.0/10	2.0
MLLM	2.0	2.0/10	4.0
MultiModal	2.0	2.0/10	4.0
model-based RL	2.0	1.0/10	2.0

评分理由: 论文核心在于推理强化学习中的策略优化方法（BPPO），重点在于提升训练效率和响应简洁性，未涉及模型统一架构（Unify Models）或世界模型构建（World Models）。虽然涉及强化学习，但采用的是策略梯度方法（Model-Free），而非基于模型的强化学习（model-based RL）。Geo3K 任务可能隐含多模态背景，但论文核心贡献在于文本生成的优化，并非多模态大模型架构（MLLM/MultiModal），因此与给定评分关键词的相关度较低。加权总分约为 14.0 分，低于动态及格分 26.5 分。作者列表中未包含指定的专家，无额外加分。

关键词

Binary Prefix Policy Optimization, GRPO, Reasoning RL, Concise Responses, Policy Optimization, Efficient Training, GSM8K, MATH

100. AtomComposer: Discovering Chemical Space from First Principles with Reinforcement LearningFAIL

Score: 13.0 / 26.5

Authors: Bjarke Hastrup, Francois Cornet, Tejs Vegge, Arghya Bhowmik

Published: 2026-05-27

TL;DR: AtomComposer employs reinforcement learning to autonomously discover novel stable 3D molecular isomers without pretraining, achieving broader generalization across chemical compositions compared to existing baselines.

摘要翻译

在没有训练数据的情况下发现新颖且稳定的分子仍然是一个重大的科学挑战。当前的分子生成模型是在大型预先构建的数据集上训练的，这会引入偏差并限制对新化学领域的探索。相比之下，我们提出了一种新范式：自主且泛化的智能体，能够在没有任何预训练的情况下映射广阔未知的化学空间。首次，我们提出了 AtomComposer，这是一种自引导智能体，能够在化学计量约束下自主构建有效的 3D 同分异构体，并仅使用强化学习进行在线训练。与现有方法通常对特定化学式过拟合不同，我们建立了一种多组分训练方案，能够在多样化化学中实现广泛泛化，该方案受基于能量和有效性的奖励引导。我们的智能体在未见测试化学式上发现的有效同分异构体数量比现有的使用每一步能量奖励训练的单组分强化学习基线多出一个数量级。这些结果兑现了在线强化学习作为一种强大范式的承诺，用于可扩展的、从零开始的化学构型空间探索。

Abstract

Discovering novel stable molecules without training data remains a grand scientific challenge. Current molecular generative models are trained on large, pre-curated datasets, which introduce biases and limit exploration of novel chemistry. In contrast, we propose a new paradigm: autonomous, generalized agents capable of mapping vast, unknown chemical spaces without any pretraining. For the first time, we present AtomComposer, a self-guided agent that autonomously constructs valid 3D isomers under stoichiometric constraints and is trained exclusively online using reinforcement learning. Unlike existing approaches that generally overfit to a specific chemical formula, we establish a multi-composition training scheme that enables a broad generalization across diverse chemistry, guided by energy- and validity-based rewards. Our agent can discover up to an order of magnitude more valid isomers on unseen test formulas than existing single-composition reinforcement-learning baselines trained with per-step energy rewards. These results fulfill the promise of online reinforcement learning as a powerful paradigm for scalable, from-scratch exploration of chemical configuration space.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	0.0/10	0.0
World Models	2.0	2.5/10	5.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	4.0/10	8.0

评分理由: The paper focuses on Reinforcement Learning for chemical space discovery, lacking content related to MLLM or MultiModal data (0.0). 'model-based RL' is moderately relevant (4.0) as RL is the core method and physical models are used for rewards, though the algorithm is policy-based. 'World Models' has slight conceptual relevance (2.5) regarding space mapping. 'Unify Models' is not addressed (0.0). Total weighted score is 13.0, below the dynamic pass score of 26.5. No expert authors from the specified list were found.

关键词

Reinforcement Learning, Molecular Discovery, Chemical Space, 3D Isomers, First Principles, Autonomous Agent, Energy Rewards, Multi-composition Training

101. SafeMed-R1: Clinician-Audited Safety and Ethics Alignment for Medical Large Language ModelsFAIL

Score: 12.0 / 26.5

Authors: Chao Ding, Mouxiao Bian, Tianbin Li, Minjia Yuan, Yidong Jiang, Yankai Jiang, Jinru Ding, Jiayuan Chen, Zhuangzhi Gao, Pengcheng Chen, Zhao He, Rongzhao Zhang, Meiling Liu, Luyi Jiang, Jie Xu

Published: 2026-05-27

TL;DR: SafeMed-R1 is a clinician-audited medical LLM that achieves high accuracy and safety alignment through traceable supervision without relying on retrieval grounding.

摘要翻译

大型语言模型（LLMs）在资格考试中日益与专家表现相当，但常规临床使用仍受限制，因为治理要求具备可审计推理、安全与伦理对齐以及对对抗性滥用的鲁棒性。本文提出了 SafeMed-R1，该模型采用了一种可追溯的临床信任信号（CTS）流程进行训练，该流程将每个推理实例与临床医生评分及编辑历史相链接，并通过安全与伦理监督及红队（Red team）压力测试进行对齐。SafeMed-R1 在多个临床基准测试中达到了 79.6% 的宏观平均准确率。在对抗性安全测试中，SafeMed-R1 显示出最低的聚合风险，且相对于其基线，不安全输出减少了约 3% 至 5%。在一项包含 30 个药物安全病例的配对专家研究中，SafeMed-R1 在医学正确性上与 PGY1 和 PGY2 住院医师相当，但在药物安全性、指南一致性和临床实用性方面得分更高。总体而言，这些结果表明，经过临床医生审计的监督溯源，结合领域定制的安全与伦理对齐，可以加强治理相关的证据，而无需依赖推理时检索或引用锚定。

Abstract

Large language models(LLMs) increasingly match expert performance on licensing examinations, yet routine clinical use remains limited because governance requires auditable reasoning, safety and ethics alignment, and resilience to adversarial misuse. Here we present SafeMed-R1, trained with a traceable Clinical Trust Signals(CTS) pipeline that links each reasoning instance to clinician rubric scores and edit histories, and aligned through safety and ethics supervision and red team stress testing. SafeMed-R1 attains a macro-averaged accuracy of 79.6% across clinical benchmarks. Under adversarial safety testing, it shows the lowest aggregated risk and reduces unsafe outputs by about 3 to 5% relative to its baseline. In a paired expert study of 30 medication safety vignettes, SafeMed-R1 matches PGY1 and PGY2 residents on medical correctness and scores higher for medication safety, guideline consistency, and clinical usefulness. Collectively, these results suggest that clinician-audited supervision provenance, together with domain-tailored safety and ethics alignment, can strengthen governance-relevant evidence without relying on inference-time retrieval or citation grounding.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	2.0/10	4.0
MultiModal	2.0	2.0/10	4.0
model-based RL	2.0	0.0/10	0.0

评分理由: The paper focuses on safety and ethics alignment for medical LLMs using clinician supervision, lacking content on world models, multimodal integration, or model-based reinforcement learning. While it involves LLMs, it is not explicitly multimodal or about unifying models. No expert authors from the specified list were found in the author list.

关键词

Medical Large Language Models, Safety Alignment, Ethics Alignment, Clinician-Audited, Clinical Trust Signals, Adversarial Testing, Medication Safety

102. BlazeEdit: Generalist Image Editing on Mobile Devices with Image-to-Image Diffusion ModelsFAIL

Score: 12.0 / 26.5

Authors: Fei Deng, Yanwu Xu, Zhipeng Bao, Zhixing Zhang, Haolin Jia, Karthik Raveendran, Jianing Wei

Published: 2026-05-27

TL;DR: BlazeEdit 提出了一种紧凑的无文本图像到图像扩散模型，在移动设备上实现了高效的多任务图像编辑。

摘要翻译

现代扩散模型虽具有卓越的生成质量，但往往伴随着巨大的参数量，这需要服务器端推理，带来高昂的计算成本和潜在的隐私风险。因此，开发高效设备端替代方案的趋势日益增长。尽管近期工作已将文生图模型针对移动硬件进行了优化，但它们仍相对庞大，参数量通常在 0.5B 到 1B 之间。我们提出了 BlazeEdit，这是一种高度高效、通用型的图像到图像扩散模型，专为设备端部署而设计。鉴于许多实际图像编辑任务无需基于文本的引导，我们消除了文本条件模块，并开发了一种多任务架构，将物体移除、图像外绘、色调校正、重光照和贴纸生成功能整合到一个仅含 195M 参数的单一紧凑模型中。BlazeEdit 在保持具有竞争力的生成质量的同时，显著降低了下载大小和内存开销。它在 Pixel 10 上仅需 290 毫秒即可完成一次完整推理，为设备端通用图像编辑提供无缝、隐私保护且闪电般的体验。

Abstract

The remarkable generation quality of modern diffusion models often comes at the cost of massive parameter counts, which necessitate server-side inference with significant computational costs and potential privacy risks. Consequently, there is growing momentum toward developing efficient on-device alternatives. While recent efforts have optimized text-to-image models for mobile hardware, they remain relatively bulky, typically ranging from 0.5B to 1B parameters. We present BlazeEdit, a highly efficient, generalist image-to-image diffusion model tailored for on-device deployment. By identifying that many practical image editing tasks do not require text-based guidance, we eliminate the text-conditioning components and develop a multi-task architecture that consolidates object removal, outpainting, tone correction, relighting, and sticker generation into a single, compact model of only 195M parameters. BlazeEdit achieves a substantial reduction in download size and memory overhead while maintaining competitive generation quality. It completes a full inference pass in just 290ms on a Pixel 10, delivering a seamless, privacy-preserving, and lightning-fast experience for generalist image editing on the edge.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	4.0/10	8.0
World Models	2.0	1.0/10	2.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	1.0/10	2.0
model-based RL	2.0	0.0/10	0.0

评分理由: 论文聚焦于移动端图像编辑的高效扩散模型，未涉及强化学习或世界建模，故 model-based RL 和 World Models 相关性极低。模型去除文本 conditioning，不属于 MLLM 或多模态模型（MultiModal 相关性低）。虽将多种编辑任务整合为单模型（Unify Models），但仅限于视觉任务统一，与背景中的统一模型概念匹配度中等。

关键词

Image Editing, Diffusion Models, Mobile Devices, On-device Deployment, Multi-task Learning, Image-to-Image, Efficient Inference

103. Skillful high-resolution weather forecasting independent of physical modelsFAIL

Score: 12.0 / 26.5

Authors: Pengcheng Zhao, Siqi Xiang, Weixin Jin, Zekun Ni, Jiang Bian, Zuliang Fang, Hongyu Sun, Bin Zhang, Richard E. Turner, Jonathan Weyn, Haiyu Dong, Kit Thambiratnam, Qi Zhang

Published: 2026-05-27

TL;DR: 该论文提出了一种名为 ObsCast 的机器学习区域天气预报系统，无需依赖数值天气预报数据或物理模型即可实现高精度的短期预测。

摘要翻译

准确及时的天气预报对现代社会的高影响力决策至关重要。基于机器学习的天气预报正成为替代方案，用于在端到端系统中生成初始条件、预报，甚至两者兼而有之。这些方法比传统的数值天气预报（NWP）提供更快的预测，且往往具有更高的预报技巧。然而，端到端模型通常依赖 NWP 生成的再分析数据进行监督，从而继承了这些 NWP 的偏差和分辨率限制，并限制了其在再分析产品不可用、更新频率低或生产成本高昂等场景下的适应性。本文介绍 ObsCast，一个区域系统，它同时生成分析场和预报，在训练和推理过程中不使用任何 NWP 派生数据，同时在短期高分辨率区域建模中仍达到最先进的性能。在美国本土和欧洲，ObsCast 在长达 18 小时的近地表变量上优于业务化 NWP，并能产生具有技巧性的降水预报。它提供了一种更简单、更灵活的路径，可直接基于本地观测构建和完善区域预报服务，无需开发复杂且昂贵的传统预报流程。

Abstract

Accurate and timely weather forecasts are critical for high-impact decisions in modern society. Machine-learning-based weather prediction is emerging as an alternative for producing initial conditions, forecasts, and even both in end-to-end systems. These methods deliver predictions faster and often with higher skill than traditional numerical weather prediction (NWP). However, even end-to-end models typically rely on NWP-generated reanalyses for supervision, thereby inheriting the biases and resolution limitations of those NWPs, and limiting adaptation to settings where suitable reanalysis products are unavailable, infrequently updated, or expensive to produce. Here we introduce ObsCast, a regional system that generates both analysis and predictions, without using any NWP-derived data in either training or inference, while still achieving state-of-the-art performance in short-term high-resolution regional modeling. Over the contiguous United States and Europe, ObsCast outperforms operational NWP for near-surface variables through 18 h and produces skillful precipitation forecasts. It provides a simpler and more adaptable route to build and refine regional forecasting services directly from local observations, without the need to develop complex and costly traditional forecasting pipelines.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	3.0/10	6.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	1.0/10	2.0
model-based RL	2.0	0.0/10	0.0

评分理由: 该论文主要研究基于机器学习的天气预报，核心内容是无需数值天气预报（NWP）数据即可进行高分辨率预测。虽然涉及世界动态建模（World Models）和分析与预测任务的统一（Unify Models），但属于气象领域应用，与 AI 领域定义的模型范式（如统一架构、世界模型规划）关联较弱，评分较低。不涉及多模态大语言模型（MLLM）、典型多模态输入（MultiModal）或强化学习（model-based RL），相关性为 0。作者列表中未包含指定的 Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang 专家。

关键词

Weather Forecasting, Machine Learning, High-resolution, Numerical Weather Prediction, End-to-End, Regional System, State-of-the-art

104. Adaptive Coarse-to-Fine Subgoal Refinement for Long-Horizon Offline Goal-Conditioned Reinforcement LearningFAIL

Score: 12.0 / 26.5

Authors: Kaiqiang Ke, Shenghong He, Chengdong Xu, Yuheng Luo, Xiangyuan Lan, Chao Yu

Published: 2026-05-27

TL;DR: 本文提出了一种分层离线强化学习框架（CFHRL），通过自适应细化子目标来缓解长周期目标条件任务中的弱监督问题。

摘要翻译

离线目标条件强化学习（GCRL）在长时程任务中颇具挑战性，其中遥远的状态 - 目标对提供的监督较弱，且价值估计易受累积自举误差的影响。层次化方法通过引入中间子目标来缓解这一困难，但固定的时间抽象或固定的层次深度可能与具有不同可达性时程的状态 - 目标对不匹配。我们提出从粗到细层次化目标强化学习（CFHRL），这是一种完全离线的 GCRL 框架，能够在执行前自适应地细化遥远的目标。从最终目标开始，CFHRL 递归地提出中间目标，这些目标基于回放支持的候选进行训练，一旦当前目标被学习得到的可达性代价估计为局部可执行，则停止细化。关键思想在于，子目标不必是精确的中点或全局最优航点；它只需提供可靠的进展并减少剩余的到达难度，从而使得后续在更短时程上的细化成为可能。简化分析进一步支持了近似递归收缩的鲁棒性。在 OGBench 上的实验显示，在多个长时程任务上取得了显著提升，消融实验验证了所提出的细化机制与停止机制。

Abstract

Offline goal-conditioned reinforcement learning (GCRL) is challenging in long-horizon tasks, where distant state--goal pairs provide weak supervision and value estimates become vulnerable to accumulated bootstrapping errors. Hierarchical methods mitigate this difficulty by introducing intermediate subgoals, but fixed temporal abstractions or fixed hierarchy depths can be mismatched to state--goal pairs with different reachability horizons. We propose Coarse-to-Fine Hierarchical Goal Reinforcement Learning (CFHRL), a fully offline GCRL framework that adaptively refines distant goals before execution. Starting from the final goal, CFHRL recursively proposes intermediate targets, trained from replay-supported candidates, and stops refinement once the current target is estimated to be locally executable by a learned reachability cost. The key idea is that a subgoal need not be an exact midpoint or globally optimal waypoint; it only needs to provide reliable progress and reduce the remaining reaching difficulty, enabling subsequent refinement over shorter horizons. A stylized analysis further supports the robustness of approximate recursive contraction. Experiments on OGBench show substantial gains on several long-horizon tasks, with ablations validating the proposed refinement and stopping mechanisms

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	1.0/10	2.0
World Models	2.0	2.0/10	4.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	3.0/10	6.0

评分理由: 该论文主要解决长周期离线目标条件强化学习中的子目标细化问题，属于分层强化学习范畴。未涉及多模态输入、大语言模型（MLLM）或模型统一架构，故 MultiModal 和 MLLM 得分为 0。虽使用可达性成本进行规划，具备一定模型基础特性，但未达到生成式世界模型或 Unify Models 的标准，相关性较低。加权总分为 12.0，低于动态及格分 26.5。

关键词

Offline Goal-Conditioned Reinforcement Learning, Hierarchical Reinforcement Learning, Subgoal Refinement, Long-Horizon Tasks, Reachability Cost, Coarse-to-Fine, Offline RL

105. Multi-Teacher Knowledge Distillation via Teacher-Informed Mixture PriorsFAIL

Score: 12.0 / 26.5

Authors: Luyang Fang, Yongkai Chen, Jiazhang Cai, Ping Ma, Wenxuan Zhong

Published: 2026-05-27

TL;DR: The paper proposes a Multi-Teacher Bayesian Knowledge Distillation framework utilizing teacher-informed priors and entropy weighting to enhance model compression, predictive accuracy, and uncertainty quantification for tasks such as protein subcellular location prediction and image classification.

摘要翻译

知识蒸馏是一种强大的模型压缩方法，能够实现复杂深度学习模型（教师模型）的高效部署，其中包括大语言模型。然而，其底层的统计机制尚不明确，且不确定性评估常被忽视，尤其是在需要多样化教师专家知识的真实世界场景中。为应对这些挑战，我们提出了多教师贝叶斯知识蒸馏（Multi-Teacher Bayesian Knowledge Distillation, MT-BKD），其中蒸馏的学生模型在贝叶斯框架下从多个教师模型中学习。该方法利用贝叶斯推断来捕捉蒸馏过程中的内在不确定性。我们引入了一个教师引导先验（teacher-informed prior），整合来自教师模型的外部知识与任务特定训练数据，从而提供更好的泛化性、鲁棒性和可扩展性。此外，一种基于熵的加权机制（entropy-based weighting mechanism）自适应地调整每个教师模型的影响，使学生能够有效整合多源专家知识。MT-BKD 增强了学生模型学习过程的可解释性，提高了预测准确性，并提供了不确定性量化。我们在合成任务和真实世界任务上验证了 MT-BKD，包括蛋白质亚细胞定位预测和图像分类任务。实验结果表明，该方法在性能上有所提升，且具有鲁棒的不确定性量化，凸显了 MT-BKD 框架的优势。

Abstract

Knowledge distillation is a powerful method for model compression, enabling the efficient deployment of complex deep learning models (teachers), including large language models. However, its underlying statistical mechanisms remain unclear, and uncertainty evaluation is often overlooked, especially in real-world scenarios requiring diverse teacher expertise. To address these challenges, we introduce \textit{Multi-Teacher Bayesian Knowledge Distillation} (MT-BKD), where a distilled student model learns from multiple teachers within the Bayesian framework. Our approach leverages Bayesian inference to capture inherent uncertainty in the distillation process. We introduce a teacher-informed prior, integrating external knowledge from teacher models and task-specific training data, offering better generalization, robustness, and scalability. Additionally, an entropy-based weighting mechanism adaptively adjusts each teacher's influence, allowing the student to combine multiple sources of expertise effectively. MT-BKD enhances the interpretability of the student model's learning process, improves predictive accuracy, and provides uncertainty quantification. We validate MT-BKD on both synthetic and real-world tasks, including protein subcellular location prediction and image classification. Our experiments show improved performance and robust uncertainty quantification, highlighting the strengths of our MT-BKD framework.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	3.0/10	6.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	2.0/10	4.0
MultiModal	2.0	1.0/10	2.0
model-based RL	2.0	0.0/10	0.0

评分理由: The paper focuses on Multi-Teacher Bayesian Knowledge Distillation, which has limited alignment with the specified research background (World Models, MLLM architecture, RL). 'Unify Models' has moderate relevance due to multi-teacher knowledge fusion, 'MLLM' has low relevance as LLMs are mentioned only as potential teachers, 'MultiModal' has low relevance as the tasks are primarily single-modality (image/protein), and 'World Models' and 'model-based RL' are completely unrelated. No target expert authors are present in the author list.

关键词

Multi-Teacher Knowledge Distillation, Bayesian Inference, Teacher-Informed Prior, Uncertainty Quantification, Entropy Weighting, Model Compression, Protein Subcellular Location, Image Classification

106. Where LLM Annotators Fail: Label-Free Learning on Graphs with LLMsFAIL

Score: 12.0 / 26.5

Authors: Safal Thapaliya, Jiatan Huang, Chuxu Zhang

Published: 2026-05-27

TL;DR: This paper proposes a cluster-aware noise estimation framework (CANE) to enhance label-free graph node classification by correcting noisy LLM-generated pseudo-labels without ground truth supervision.

摘要翻译

图上的节点分类通常需要带标签的节点，然而在图规模上获取标签成本高昂。当节点属性包含语义内容（如论文摘要、网页或产品描述）时，大语言模型（LLMs）可通过标注一小部分节点提供低成本监督。然而，这些由 LLM 生成的标签含有噪声，现有的无标签图学习方法通常将此类噪声视为全局的或类别条件的。我们发现，LLM 的标注错误不仅依赖于类别，还依赖于区域：在同一类别内，可靠性在特征空间簇之间可能剧烈波动。鉴于此，我们提出簇感知噪声估计（CANE），这是一种无标签学习框架，它无需真实标签即可估计簇条件下的 LLM 可靠性，并利用该估计决定信任哪些伪标签以及修正哪些标签。在各种图基准和 GNN 骨干网络上，CANE 优于最强的无标签基线方法，在簇条件噪声更强的数据集上获得最大提升。

Abstract

Node classification on graphs often requires labeled nodes, yet obtaining labels at graph scale is expensive. When node attributes contain semantic content, such as paper abstracts, web pages, or product descriptions, large language models (LLMs) can provide low-cost supervision by annotating a small subset of nodes. However, these LLM-generated labels are noisy, and existing label-free graph learning methods usually treat this noise as either global or class-conditional. We find that LLM annotation errors are not only class-dependent but also region-dependent: within the same class, reliability can vary sharply across feature-space clusters. In light of this, we propose Cluster-Aware Noise Estimation (CANE), a label-free learning framework that estimates cluster-conditional LLM reliability without ground truth labels, and uses this estimate to decide which pseudo-labels to trust, and which labels to correct. Across various graph benchmarks and GNN backbones, CANE improves over the strongest label-free baselines, with the largest gains on datasets exhibiting stronger cluster-conditional noise.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	2.0/10	4.0
MultiModal	2.0	2.0/10	4.0
model-based RL	2.0	0.0/10	0.0

评分理由: The paper focuses on label-free graph learning using LLMs for annotation noise correction, which is fundamentally unrelated to World Models or Model-Based RL concepts, hence 0.0 relevance. Although it utilizes LLMs, the work does not address Multimodal (MLLM/MultiModal) integration involving vision/audio, nor does it propose a unified model architecture (Unify Models), resulting in low relevance scores of 2.0. Additionally, the author list does not contain any of the specified expert authors (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang), so no bonus points are applied.

关键词

Label-Free Learning, Graph Node Classification, LLM Annotation, Noise Estimation, Cluster-Aware, Pseudo-label Correction, GNN Backbones

107. DecomposeRL: Learning to Ask Useful, Informative, and Diverse Questions for Semi-Supervised, Traceable Claim VerificationFAIL

Score: 12.0 / 26.5

Authors: Shubhashis Roy Dipta, Ankur Padia, Francis Ferraro

Published: 2026-05-27

TL;DR: DecomposeRL introduces an RL-based policy for semi-supervised claim verification that achieves high accuracy with limited labeled data by curating a dense subset of fact-verification claims.

摘要翻译

主张验证领域在端到端分类器与基于分解的方法之间存在权衡：前者准确但无法产生可检查的踪迹，后者虽能生成可检查的踪迹但在基准数据集上的表现却滞后。我们提出 DecomposeRL，这是一种能够产生可检查踪迹的准确主张验证器。DecomposeRL 将分解建模为一种强化学习（RL）策略，该策略使用 GRPO 和多面奖励集成进行训练，从而支持全监督和半监督学习，并能利用未标注主张。DecomposeRL 通过一个数据筛选漏斗解决了 GRPO 高昂的训练成本，该漏斗将 11.5 万条事实验证主张提炼为一个包含 5 千条主张的紧凑且学习信号密集的子集。我们表明，仅在约 5 千条精选主张上通过全监督训练的 DecomposeRL-7B 策略，在包含生物医学、政治、科学和通用领域主张的 11 个主张验证基准上，实现了 86.3 的域内和 69.8 的域外平衡准确率。尽管其规模仅为基线模型的 1/4，它仍达到了 32B 基线模型和 GPT-4.1-mini 的性能水平，并且在仅使用 10% 标注主张数据的半监督设置下，进一步超越了基线模型。代码、数据和模型可在 https://dipta007.github.io/DecomposeRL 获取。

Abstract

Claim verification splits between end-to-end classifiers that are accurate but yields no inspectable traces, and decomposition-based methods produce inspectable traces but lag performance on benchmark datasets. We propose DecomposeRL an accurate claim-verifier that produce inspectable traces. DecomposeRL frames decomposition as an RL policy trained with GRPO and a multi-faceted reward ensemble, enabling both fully supervised and semi-supervised learning from unlabeled claims. DecomposeRL addresses the prohibitive training cost of GRPO with a data-curation funnel that distills 115K fact-verification claims into a compact, learning-signal-dense subset of 5K claims. We show that a DecomposeRL-7B policy trained with full supervision on only ~5K curated claims achieves 86.3 in-domain and 69.8 out-of-domain balanced accuracy across 11 claim-verification benchmarks containing biomedical, political, scientific, and general-domain claims. Despite being 4x smaller, it matches 32B baselines and GPT-4.1-mini, and it further outperforms baselines in a semi-supervised setting with only 10% labeled claims data. Code, data, and models are available at https://dipta007.github.io/DecomposeRL

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	1.0/10	2.0
World Models	2.0	1.0/10	2.0
MLLM	2.0	1.0/10	2.0
MultiModal	2.0	1.0/10	2.0
model-based RL	2.0	2.0/10	4.0

评分理由: The paper focuses on text-based claim verification using RL (GRPO) for decomposition. It does not address multimodal inputs (MultiModal, MLLM), world modeling (World Models), or architectural unification (Unify Models). While RL is used, GRPO is typically model-free, making 'model-based RL' only loosely related. The calculated weighted score is 10.0 (below the 26.5 pass threshold). None of the listed expert authors (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are present in the author list.

关键词

Claim Verification, Reinforcement Learning, Semi-Supervised Learning, Fact Verification, Policy Optimization, Traceable Verification, GRPO

108. Inpainting-Style Conditional Diffusion for Multivariable Time Series ForecastingFAIL

Score: 12.0 / 26.5

Authors: Kourosh Kiani, S. M. Muyeen

Published: 2026-05-27

TL;DR: 论文提出了一种基于条件扩散模型的时间序列预测框架，通过将太阳能发电数据重构为图像进行 inpainting 预测，取得了较高的短期预测精度。

摘要翻译

本文提出了一种新颖的基于条件扩散的多变量时间序列光伏功率预测框架。所提出的方法利用滑动窗口块构造，将时间序列光伏（PV）数据重新构造成结构化的二维表示（图像），从而能够在统一的时空学习范式中应用去噪扩散概率模型（DDPM）。本文的一个关键贡献是将光伏预测表述为修复问题（inpainting），其中未来时间步被视为需要重建的缺失区域。这通过一种基于掩码的条件扩散机制实现：历史观测值被保留作为条件上下文，而目标（未来）区域则被逐步破坏，随后通过反向扩散过程恢复。该模型学习在观测数据条件下生成连贯的未来序列，有效地执行时间序列修复任务。为了充分利用所有可用特征并确保与 U-Net 架构约束兼容，本文引入了零填充策略以构建固定大小的输入。该模型采用监督去噪目标进行训练，以预测注入的噪声，从而在反向过程中实现准确的迭代重建。在包括 GEFCom2014 在内的基准光伏数据集上进行的广泛实验表明，所提出的方法实现了较高的预测精度，尤其在短期预测时段上表现优异。研究结果表明，将基于扩散的生成建模与修复问题表述相结合，对于实现鲁棒、灵活且高保真的光伏功率预测具有显著有效性。

Abstract

In this paper, we propose a novel conditional diffusion-based framework for multivariable time-series solar power forecasting. The proposed method reformulates temporal PV data as structured two-dimensional representations (images) using a sliding-window patch construction, enabling the application of Denoising Diffusion Probabilistic Models (DDPM) within a unified spatiotemporal learning paradigm. A key contribution of this work is the formulation of solar forecasting as an inpainting problem, where future time steps are treated as missing regions to be reconstructed. This is achieved through a mask-based conditional diffusion mechanism, in which historical observations are preserved as conditioning context while the target (future) region is progressively corrupted and subsequently recovered via reverse diffusion. The model learns to generate coherent future sequences conditioned on observed data, effectively performing time-series inpainting. To fully utilize all available features and ensure compatibility with U-Net architectural constraints, a zero-padding strategy is introduced to construct fixed-size inputs. The model is trained using a supervised denoising objective to predict injected noise, enabling accurate iterative reconstruction during the reverse process. Extensive experiments conducted on benchmark PV dataset, including GEFCom2014, demonstrate that the proposed approach achieves high forecasting accuracy, particularly for short-term horizons. The results highlight the effectiveness of integrating diffusion-based generative modeling with an inpainting formulation for robust, flexible, and high-fidelity solar power forecasting.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	3.0/10	6.0
World Models	2.0	1.0/10	2.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	2.0/10	4.0
model-based RL	2.0	0.0/10	0.0

评分理由: 该论文主要贡献在于将扩散模型应用于时间序列预测，通过图像化处理和 inpainting 机制提升预测精度。然而，给定关键词侧重于大模型统一（Unify Models）、世界模型（World Models）、多模态大语言模型（MLLM）、多模态理解（MultiModal）及模型强化学习（model-based RL）。论文内容与这些领域关联度较低：未涉及语言模型或强化学习，‘统一’仅指时空范式而非模型架构统一，‘多模态’指数据表示而非跨模态交互。因此相关性评分较低。作者名单中未包含指定专家，无额外加分。加权总分为 12.0，低于动态及格线 26.5。

关键词

Conditional Diffusion, Time Series Forecasting, Solar Power Forecasting, Inpainting, DDPM, Spatiotemporal Learning, Mask-based Conditional

109. A Road-Conditioned Traffic Movie Prediction Network with Spatiotemporal and Structure-Consistent LearningFAIL

Score: 12.0 / 26.5

Authors: Joshua Kofi Asamoah, Blessing Agyei Kyem, Armstrong Aboah

Published: 2026-05-27

TL;DR: The paper proposes RCSNet, a road-conditioned spatiotemporal network that improves traffic movie prediction accuracy and structural consistency across cities without target-city fine-tuning.

摘要翻译

城市级交通预测对于拥堵管理、路径引导及智能交通系统至关重要，然而，当需要将未来交通生成覆盖整个城市网络的空间地图时，准确预测仍具挑战性。现有的交通视频预测方法虽提高了帧级精度，但许多仍将预测主要视为图像重建。这可能导致生成的交通地图在数值上接近真实值，但受道路布局、连通性、行驶方向及拥堵传播的约束较弱，尤其是在跨城市场景中，交通行为与道路结构均发生变化。为了解决这一局限性，本研究提出了一种道路条件时空网络（RCSNet），将交通视频预测重新表述为拓扑引导的未来状态生成。RCSNet 从静态道路图中提取道路感知表示，基于历史观测建模多步交通动态，将方向性交通特征与局部道路结构对齐，并逐步生成未来交通地图以提升时间一致性。一个结构一致的学习目标进一步促使预测结果保持准确、与道路对齐且空间稳定。跨多个城市的实验表明，RCSNet 同时提升了预测准确性和结构一致性。在柏林、安特卫普和莫斯科的同一城市预测任务中，相较于最接近的基线方法，RCSNet 的平均 MAE（平均绝对误差）、MSE（均方误差）和 RMSE（均方根误差）分别降低了 11.5%、10.0% 和 5.1%。在未见过的芝加哥和曼谷的跨城市测试中，无需针对目标城市进行微调，其 RMSE 分别降低了 10.6% 和 10.5%。进一步的步长级、道路结构、可解释性、统计及效率分析表明，RCSNet 能够生成更准确、更具可迁移性、与道路对齐且计算高效的交通预测。

Abstract

City-wide traffic forecasting is important for congestion management, route guidance, and intelligent transportation systems, but accurate prediction remains challenging when future traffic must be generated as spatial maps over an entire urban network. Existing traffic movie prediction methods have improved frame-level accuracy, yet many still treat forecasting mainly as image reconstruction. This can produce traffic maps that are numerically close to the ground truth but weakly constrained by road layout, connectivity, travel direction, and congestion propagation, especially in cross-city settings where both traffic behavior and road structure change. To address this limitation, this study proposes RCSNet, a road-conditioned spatiotemporal network that reformulates traffic movie prediction as topology-guided future-state generation. RCSNet extracts road-aware representations from static road maps, models multi-horizon traffic dynamics from historical observations, aligns directional traffic features with local road structure, and progressively generates future traffic maps for improved temporal consistency. A structure-consistent learning objective further encourages predictions to remain accurate, road-aligned, and spatially stable. Experiments across multiple cities show that RCSNet improves both forecasting accuracy and structural consistency. In same-city forecasting on Berlin, Antwerp, and Moscow, RCSNet reduces average MAE, MSE, and RMSE by 11.5%, 10.0%, and 5.1%, respectively, compared with the closest baseline. In cross-city testing on unseen Chicago and Bangkok, it reduces RMSE by 10.6% and 10.5% without target-city fine-tuning. Additional horizon-wise, road-structure, explainability, statistical, and efficiency analyses show that RCSNet produces more accurate, transferable, road-aligned, and computationally efficient traffic forecasts.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	2.0/10	4.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	2.0/10	4.0
model-based RL	2.0	0.0/10	0.0

评分理由: 论文主要关注交通流量预测，涉及时空建模和道路条件约束，与 MLLM 和 model-based RL 的核心概念完全无关（0 分）。虽然构建了统一网络并融合了道路与交通数据，但与 Unify Models、World Models、MultiModal 在生成式 AI/强化学习语境下的定义关联较弱（2 分）。作者列表中未包含指定的 Yang Shi 等专家，故无额外加分。加权总分为 12.0，低于动态及格分 26.5。

关键词

Traffic Movie Prediction, Road-Conditioned, Spatiotemporal Learning, Structure-Consistent, Urban Traffic Forecasting, Road-Aware Representations, Multi-horizon Dynamics

110. From Fact Overwriting to Knowledge Evolution: Causal Editing via On-Policy Self-DistillationFAIL

Score: 10.0 / 26.5

Authors: Shuaike Li, Kai Zhang, Xianquan Wang, Jiachen Liu, Shengpeng Mo

Published: 2026-05-27

TL;DR: This paper proposes a Causal On-policy Distillation (CODE) method to transform discrete fact injection into coherent knowledge evolution in LLMs, significantly reducing self-refutation rates compared to static overwriting paradigms.

摘要翻译

尽管知识编辑（KE）能够实现高效更新，但其主导的 Static Fact Overwriting（静态事实覆盖）范式将 LLMs（大型语言模型）视为离散数据库，强行注入孤立的事实。该范式破坏了预训练的逻辑拓扑结构，从而引发了 Epistemic Dissonance（认知失调）——一种病理现象，其中未演化的遗留先验迫使模型明确否定注入的更新。理想化的干预揭示，这并非仅仅是算法噪声，而是一种固有的结构缺陷，zero-distortion proxy（零失真代理）导致了灾难性的 95.6% 自我否定率。鉴于现实世界知识具有因果驱动的本质，将更新建立在显性因果叙事上，能有效将这种冲突率降至 6.6%，凸显了向 Causal Editing（因果编辑）范式转变的必要性。为了内化这一演化过程，我们提出了 CODE（Causal On-policy Distillation for Editing）。通过将 causal bootstrapping（因果自举）与 asymmetric on-policy distillation（非对称在线策略蒸馏）相结合，CODE 直接将因果转换逻辑嵌入参数化记忆中。在 LLaMA-3.1 和 Qwen-2.5 上的实验表明，CODE 将自我否定率大幅抑制至 1.8%，同时确保了稳健的 multi-hop accuracy（多跳准确率，高达 83.5%），无缝地将离散事实注入转化为连贯的知识演化。代码可在 https://github.com/CrashBugger/CODE 获取。

Abstract

While Knowledge Editing (KE) enables efficient updates, its dominant Static Fact Overwriting paradigm treats LLMs as discrete databases, forcibly injecting isolated facts. Fracturing pre-trained logical topologies, this triggers Epistemic Dissonance -- a pathology where un-evolved legacy priors force the model to explicitly negate the injected update. Idealized interventions reveal that this is an inherent structural flaw rather than mere algorithmic noise, with a zero-distortion proxy yielding a catastrophic 95.6% self-refutation rate. Given the causally driven nature of real-world knowledge, grounding updates in explicit causal narratives effectively collapses this conflict rate to just 6.6%, underscoring the imperative for a paradigm shift toward Causal Editing. To internalize this evolution, we propose CODE (Causal On-policy Distillation for Editing). By coupling causal bootstrapping with asymmetric on-policy distillation, CODE engraves causal transition logic directly into parametric memory. Experiments on LLaMA-3.1 and Qwen-2.5 show CODE drastically suppresses self-refutation to 1.8% while securing robust multi-hop accuracy (up to 83.5%), seamlessly transforming discrete fact injection into coherent knowledge evolution. Code is available at https://github.com/CrashBugger/CODE.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	1.0/10	2.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	2.0/10	4.0
MultiModal	2.0	1.0/10	2.0
model-based RL	2.0	1.0/10	2.0

评分理由: The paper focuses on Knowledge Editing (KE) in LLMs using causal distillation, which has minimal overlap with World Models or Model-Based Reinforcement Learning. While it involves LLMs (relevant to MLLM base), it does not address multimodal integration or unified model architectures directly. Terminology like on-policy borrows from RL but applies to distillation, not RL tasks.

关键词

Knowledge Editing, Causal Editing, On-policy Distillation, LLM, Self-Distillation, Knowledge Evolution, Epistemic Dissonance

111. Plant, Persist, Trigger: Sleeper Attack on Large Language Model AgentsFAIL

Score: 10.0 / 26.5

Authors: Yongxiang Li, Moxin Li, Zhixin Ma, Fengbin Zhu, Dongrui Liu, Wenjie Wang, Fuli Feng

Published: 2026-05-27

TL;DR: 该论文研究了针对大语言模型代理的“潜伏攻击”，发现对抗性内容可在代理状态中持久化并在后续交互中触发有害行为，揭示了现有代理的安全漏洞。

摘要翻译

大语言模型 (LLM) 智能体仍然易受来自外部环境的安全威胁，攻击者将对抗性内容注入到外部观测中，例如工具返回数据、网页或 MCP 上下文，导致有害的智能体行为，例如不安全操作或错误输出。现有研究通常专注于单交互攻击，即智能体观察到对抗性内容并在一次用户请求内立即表现出有害行为。然而，我们发现对抗性内容也可以在同一智能体提供的交互中持续存在，使得此类威胁更难被检测和缓解。具体来说，对抗性内容可能存在于智能体状态中，在交互间保持休眠，随后被良性用户查询激活。我们将这种安全威胁形式化为潜伏攻击 (Sleeper Attack)。为了评估它，我们构建了一个包含 1,896 个实例的基准测试，涵盖六种现实世界的有害结果、三种攻击策略和三种智能体状态目标：会话上下文、记忆和可复用技能。在七个强大的开源和闭源 LLM 上的实验表明，最先进的 LLM 智能体仍然易受潜伏攻击威胁，即使它们在单交互基线下实现了低攻击成功率。我们的代码和数据可在 https://anonymous.4open.science/r/skdvnfu23ihr9wdscnksf1asdffsaef 处获取。

Abstract

Large Language Model (LLM) agents remain vulnerable to safety threats from the external environment, where attackers inject adversarial content into external observations such as tool-returned data, webpages, or MCP context, causing harmful agentic behaviors such as unsafe actions or incorrect outputs. Existing studies typically focus on single-interaction attacks, where the agent observes adversarial content and immediately exhibits harmful behavior within one user request. However, we show that adversarial content can also persist across interactions served by the same agent, making such threats harder to detect and mitigate. Specifically, adversarial content may persist in the agent state, remain dormant across interactions, and later be activated by a benign user query. We formalize this type of safety threat as Sleeper Attack. To evaluate it, we construct a benchmark with 1,896 instances covering six real-world harmful outcomes, three attack strategies, and three agent state targets: session context, memory, and reusable skills. Experiments on seven strong open-source and closed-source LLMs show that state-of-the-art LLM agents remain vulnerable to Sleeper Attack, even when they achieve low attack success rates under a single-interaction baseline. Our code and data are available at https://anonymous.4open.science/r/skdvnfu23ihr9wdscnksf1asdffsaef.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	1.0/10	2.0
World Models	2.0	1.0/10	2.0
MLLM	2.0	1.0/10	2.0
MultiModal	2.0	1.0/10	2.0
model-based RL	2.0	1.0/10	2.0

评分理由: 论文核心主题是 LLM 代理的安全威胁（潜伏攻击）及对抗性内容的持久化，属于安全与对齐领域。提供的关键词涉及模型架构统一、世界模型、多模态大模型及基于模型的强化学习算法，与本文的研究内容和技术路线高度不相关，故各项评分均为最低档。

关键词

Sleeper Attack, LLM Agents, Adversarial Content, Agent State, Safety Vulnerabilities, Persistent Threats, Tool-returned Data, Harmful Behaviors

112. DEPART: DEcomposing PARiTy across Multilingual LLMsFAIL

Score: 10.0 / 26.5

Authors: Manan Uppadhyay, Prashant Kodali, Pranjal Chitale, Reshma Ramaprasad, Himanshu Beniwal, Sunayana Sitaram

Published: 2026-05-27

TL;DR: The study introduces a Bayesian framework to decompose performance disparities in Multilingual LLMs, revealing that language identity and model identity are primary drivers of variance in understanding and reasoning tasks.

摘要翻译

多语言大语言模型（mLLMs）排行榜虽报告单语言准确率，却很少解释差异产生的原因，导致系统性偏差未被归因，且未为从业者提供任何可干预的手段。我们首先通过无分布 Friedman 检验和 Kruskal-Wallis 检验确立这些差距是系统性的，而非采样噪声的产物，随后引入一个两步贝叶斯分层框架，将多语言性能方差分解为可解释的分量。首先，在隔离归因于语言特性的方差后，我们发现可观察的语言特征（文字体系、语系、类型学距离）解释了理解任务上 79%（R²_ling）的方差和推理任务上 92% 的方差，而模型内部表征与英语的相似性则成为两个任务类别中的主导预测因子。其次，分解完整的（模型×评测基准×语言）立方体，我们发现自然语言理解（NLU）和推理具有根本不同的方差分布特征：模型身份主导理解（占方差的 66.7%），而评测基准×模型交互作用则主导推理（占 46.3%）。这些结果共同将多语言评估从被动的性能映射重塑为一种可解释的诊断框架，提供了针对语言差异根源驱动因素的具体干预手段。

Abstract

Multilingual Large Language Models (mLLMs) leaderboards report per-language accuracy but rarely explain why disparities emerge, leaving systemic biases unattributed and offering practitioners no actionable levers. We first establish that these gaps are systematic rather than artifacts of sampling noise via distribution-free Friedman and Kruskal--Wallis tests, then introduce a two-step Bayesian hierarchical framework that decomposes multilingual performance variance into interpretable components. First, isolating the variance attributable to language identity, we show that observable language features (script, family, typological distance) explain $R^2_{\text{ling}} = 79\%$ of this variance on understanding tasks and $92\%$ on reasoning, with a model's internal representational similarity to English emerging as the dominant predictor across both task buckets. Second, decomposing the full (model$\times$benchmark$\times$language) cube, we find that NLU and reasoning have fundamentally divergent variance profiles: model identity dominates understanding ($66.7\%$ of variance), whereas the benchmark$\times$model interaction dominates reasoning ($46.3\%$). Together these results recast multilingual evaluation from passive performance mapping into an explainable, diagnostic framework with concrete levers for targeting the root drivers of language disparity.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	3.0/10	6.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	0.0/10	0.0

评分理由: The paper focuses on Multilingual LLM evaluation and variance decomposition, showing low relevance to World Models, MultiModal, and Model-Based RL (0.0). MLLM is moderately relevant (3.0) due to focus on Large Language Models, though strictly Multilingual rather than Multimodal per background context. Unify Models is weakly relevant (2.0) regarding the unified analytical framework. No matching experts were found in the author list. The weighted total score is 10.0, which is below the dynamic passing score of 26.5.

关键词

Multilingual Large Language Models, Performance Disparity, Variance Decomposition, Bayesian Hierarchical Framework, Language Identity, Model Identity, Benchmark Interaction

113. Geometry-Correct Diffusion Posterior Sampling with Denoiser-Pullback Curvature Guidance and Manifold-Aligned DampingFAIL

Score: 10.0 / 26.5

Authors: Seunghyeok Shin, Minwoo Kim, Dabin Kim, Hongki Lim

Published: 2026-05-27

TL;DR: This paper proposes a geometry-correct diffusion posterior sampling method using denoiser-pullback curvature guidance to efficiently solve image reconstruction inverse problems with improved stability and speed.

摘要翻译

扩散后验采样基于测量值对扩散先验进行条件化，但数据一致性更新通常由手工调优的引导权重缩放，且在刚性、算子依赖的曲率下可能导致采样不稳定。我们用每噪声水平的、在扩散状态坐标中计算的阻尼 Gauss-Newton 修正替代标量引导。该修正通过去噪器反向传播似然梯度，采用避免前向去噪器雅可比矩阵的单侧曲率模型，并应用与去噪器残差对齐的扩散校准秩一阻尼。每个修正均使用自动微分通过无矩阵 GMRES 求解，采样则采用具有闭式漂移/噪声分解的方差保持 Langevin 转换进行。在 FFHQ 和 ImageNet 的逆问题上，该方法实现了具有竞争力的 PSNR/SSIM/LPIPS 指标，同时运行速度显著快于大多数对比基线；在加速 MRI 重建任务中，它在对比基线中实现了最佳的 PSNR/SSIM。

Abstract

Diffusion posterior sampling conditions diffusion priors on measurements, but data-consistency updates are typically scaled by hand-tuned guidance weights and can destabilize sampling under stiff, operator-dependent curvature. We replace scalar guidance with a per-noise-level damped Gauss--Newton correction computed in diffusion-state coordinates. The correction pulls likelihood gradients back through the denoiser, uses a one-sided curvature model that avoids forward denoiser Jacobians, and applies diffusion-calibrated rank-one damping aligned with the denoiser residual. Each correction is solved with matrix-free GMRES using automatic differentiation, and sampling proceeds with a variance-preserving Langevin transition with a closed-form drift/noise split. On FFHQ and ImageNet across inverse problems, it achieves competitive PSNR/SSIM/LPIPS while running markedly faster than most of the compared baselines; on accelerated MRI reconstruction, it achieves the best PSNR/SSIM among the compared baselines.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	1.0/10	2.0
World Models	2.0	1.0/10	2.0
MLLM	2.0	1.0/10	2.0
MultiModal	2.0	1.0/10	2.0
model-based RL	2.0	1.0/10	2.0

评分理由: The paper focuses on improving diffusion posterior sampling for inverse problems (image reconstruction) using curvature guidance and damping. It does not address Unify Models (architectural unification), World Models (environment dynamics for planning), MLLM (multimodal language models), MultiModal (cross-modal understanding/generation), or Model-Based RL (reinforcement learning with learned dynamics). Therefore, the relevance to all specified keywords is minimal. None of the listed expert authors (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) appear in the author list.

关键词

Diffusion Posterior Sampling, Denoiser-Pullback Curvature Guidance, Manifold-Aligned Damping, Inverse Problems, Image Reconstruction, Gauss-Newton Correction, Variance-Preserving Langevin

114. Commit to the Bit: Reactive Reinforcement Learning Done RightFAIL

Score: 10.0 / 26.5

Authors: Onno Eberhard, Claire Vernade, Michael Muehlebach

Published: 2026-05-27

TL;DR: 本文提出 Committed Q-learning 算法，在确定性观测的有限环境中证明了反应式策略几乎必然收敛到最优策略。

摘要翻译

强化学习算法通常在马尔可夫假设下进行分析（或设计）。这并不现实，因为在实践中遇到的大多数环境要么是部分可观测的，要么需要函数近似，这限制了智能体只能访问非马尔可夫状态特征。我们考虑在具有确定性观测（或等价地，硬状态聚合）的有限环境中学习最优反应策略的问题。我们引入了一种新算法，承诺 Q 学习（Committed Q-learning），并在我们称之为重连鲁棒性（rewire-robustness）的直观假设下，证明了几乎必然收敛到最优反应策略。该假设严格弱于先前工作中使用的 $q_\star$-可实现性条件。我们的算法是经典 Q-learning 的一种变体，其中行为策略在进入某个特征时承诺执行单一动作，仅在观测到的特征发生变化时才重采样动作。我们分析的一个关键部分是引入了准马尔可夫环境（quasi-Markov environments）。

Abstract

Reinforcement learning algorithms are commonly analyzed (and designed) under the Markov assumption. This is unrealistic, as most environments encountered in practice are either partially observable, or require function approximation that restricts the agent to access non-Markovian state features. We consider the problem of learning an optimal reactive policy in a finite environment with deterministic observations (or equivalently, hard state aggregation). We introduce a new algorithm, Committed Q-learning, and prove almost-sure convergence to the optimal reactive policy under an intuitive assumption we call rewire-robustness. This assumption is strictly weaker than the $q_\star$-realizability condition used in prior work. Our algorithm is a variant of classical Q-learning in which the behavior policy commits to a single action upon entering a feature, and only resamples actions when the observed feature changes. A crucial part of our analysis is the introduction of quasi-Markov environments.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	1.0/10	2.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	2.0/10	4.0

评分理由: 论文属于经典强化学习理论研究，主要探讨反应式策略和 Q-learning 变体。未涉及多模态（MultiModal）、大语言模型（MLLM）或世界模型（World Models）相关内容。虽然涉及强化学习，但使用的是模型自由（Model-Free）的 Q-learning，与 model-based RL 关联有限；也未体现模型统一（Unify Models）的核心架构思想。因此与给定关键词相关性较低。加权总分约为 10.0，低于动态及格分 26.5。

关键词

Reinforcement Learning, Reactive Policy, Q-learning, Deterministic Observations, State Aggregation, Quasi-Markov, Convergence

115. Geometry-First Generative Spatial Single-Cell ReconstructionFAIL

Score: 10.0 / 26.5

Authors: Ehtesamul Azim, Muhtasim Noor Alif, Tae Hyun Hwang, Yanjie Fu, Wei Zhang

Published: 2026-05-27

TL;DR: The paper proposes GEARS, a geometry-first framework that reconstructs intrinsic single-cell spatial geometry from scRNA-seq guided by spatial transcriptomics using a diffusion-based generator without relying on fixed grids or cell-type labels.

摘要翻译

单细胞 RNA 测序（scRNA-seq）虽能分析大量细胞，却丢失了空间上下文信息；相比之下，空间转录组学（ST）虽能在较低分辨率下保留部分空间结构。大多数现有的整合方法要么解卷积斑点混合物，要么将细胞映射到测量的斑点晶格，这将重建结果绑定到固定网格和载玻片特定的坐标系上，这种限制在非配对设置中尤为成问题。我们提出 GEARS，这是一种基于几何优先的框架，旨在在空间转录组学（ST）的指导下重建内在的单细胞空间几何，且不依赖细胞类型标签、组织学图像或细胞到斑点的对应关系。GEARS 首先学习一个域不变表达编码器，以对齐 ST 斑点和解离细胞；随后训练一个置换等变生成器，该生成器结合了一个基于扩散的细化器并采用 EDM 风格预处理，旨在在源自 ST 坐标的姿态不变监督下生成局部空间几何。在推理阶段，GEARS 在大量重叠的单细胞 RNA 测序（scRNA-seq）细胞子集上重建几何结构，聚合各子集间预测的成对距离，并求解全局距离几何问题，以获得规范二维坐标和稠密距离矩阵。广泛的定量和定性实验（包括跨切片泛化）表明，与强大的空间映射和解卷积基线相比，GEARS 一致地改进了全局距离保持、局部邻域保真度和空间分布对齐。

Abstract

Single-cell RNA sequencing (scRNA-seq) profiles large numbers of cells but loses spatial context, whereas spatial transcriptomics (ST) preserves partial spatial structure at lower resolution. Most existing integration methods either deconvolve spot mixtures or map cells onto a measured spot lattice, which ties reconstructions to a fixed grid and slide-specific coordinate systems, a limitation that is especially problematic in unpaired settings. We propose GEARS, a geometry-first framework that reconstructs an intrinsic single-cell spatial geometry guided by ST, without relying on cell-type labels, histological images, or cell-to-spot assignment. GEARS first learns a domain-invariant expression encoder that aligns ST spots and dissociated cells, and then trains a permutation-equivariant generator with a diffusion-based refiner with EDM-style preconditioning to generate local spatial geometries under pose-invariant supervision derived from ST coordinates. At inference, GEARS reconstructs geometry on many overlapping subsets of scRNA-seq cells, aggregates predicted pairwise distances across subsets, and solves a global distance-geometry problem to obtain canonical two-dimensional coordinates and a dense distance matrix. Extensive quantitative and qualitative experiments, including cross-section generalization, show that GEARS consistently improves global distance preservation, local neighborhood fidelity, and spatial distribution alignment compared to strong spatial mapping and deconvolution baselines.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	1.0/10	2.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	2.0/10	4.0
model-based RL	2.0	0.0/10	0.0

评分理由: The paper focuses on bioinformatics (spatial transcriptomics and single-cell RNA sequencing integration) using diffusion models for geometry reconstruction. It does not align with the provided keywords which target AI/MLLM/RL research areas. There is no language model (MLLM), reinforcement learning (model-based RL), or AI world modeling. While it unifies two data modalities (MultiModal/Unify Models), this refers to biological data integration rather than unified AI architectures or models. Total weighted score: 10.0, which is below the dynamic passing score of 26.5.

关键词

Single-cell RNA sequencing, Spatial transcriptomics, Geometry reconstruction, Diffusion model, Domain-invariant encoder, Spatial mapping, Unpaired settings

116. Stay Fair! Ensuring Group Fairness in Diffusion Models Across Guidance ScalesFAIL

Score: 10.0 / 26.5

Authors: Myeongsoo Kim, Eunji Kim, Minwoo Chae, Sangwoo Mo

Published: 2026-05-27

TL;DR: The paper proposes StayFair to maintain group fairness in diffusion models across varying guidance scales by decoupling fairness from the guidance parameter without sacrificing image quality.

摘要翻译

扩散模型（Diffusion models）通过可调节的引导尺度（guidance scale）引导条件生成，以在提示对齐（prompt alignment）与多样性之间进行权衡。然而，现有的去偏技术（debiasing techniques）通常针对单一尺度进行优化，当用户调整该参数时，会导致公平性下降。我们通过将总偏差（total bias）分解为两个组成部分——模型偏差（model bias）和引导偏差（guidance bias）——将这种行为追溯到先前被忽视的来源。尽管先前工作主要关注前者，但我们发现引导偏差随引导尺度单调增长，最终在用户偏好的高引导模式中占据主导地位。为了解决这一问题，我们将强人口统计学公平性（Strong Demographic Parity）扩展至引导机制，并推导出一个条件，使得目标分布能够在不同引导尺度下保持其群体比例。我们提出了 StayFair，利用该条件在两种引导机制下设计公平的引导算法。对于分类器引导（classifier guidance），它使分类器的输出分布在不同群体间保持一致；对于无分类器引导（classifier-free guidance），它通过提示相关偏移（prompt-dependent offset）调整零嵌入（null embedding）。由于 StayFair 仅修改引导步骤，它与模型去偏（model debiasing）正交，可以叠加到现有的公平扩散模型上，从而将其公平性扩展至不同引导尺度。在类别条件（class-conditional）和文本到图像（text-to-image）生成任务中，StayFair 实现了公平性与引导尺度的解耦，且未牺牲图像质量。

Abstract

Diffusion models steer conditional generation with a tunable guidance scale to trade off prompt alignment and diversity. However, existing debiasing techniques are optimized for a single scale, degrading fairness when users adjust this parameter. We trace this behavior to a previously overlooked source by decomposing total bias into two components: a model bias and a guidance bias. While prior work primarily targets the former, we show that the guidance bias grows monotonically with the guidance scale, eventually dominating the high-guidance regimes users prefer. To address this, we extend Strong Demographic Parity to guidance and derive a condition under which the target distribution retains its group ratio across guidance scales. We propose StayFair, which leverages this condition to design fair guidance algorithms in both regimes. For classifier guidance, it equalizes the classifier's output distributions across groups; for classifier-free guidance, it shifts the null embedding by a prompt-dependent offset. Because StayFair modifies only the guidance step, it is orthogonal to model debiasing and can be layered onto existing fair diffusion models to extend their fairness across guidance scales. Across class-conditional and text-to-image generation, StayFair decouples fairness from the guidance scale without sacrificing image quality.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	1.0/10	2.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	1.0/10	2.0
MultiModal	2.0	3.0/10	6.0
model-based RL	2.0	0.0/10	0.0

评分理由: The paper focuses on ensuring group fairness in diffusion models across guidance scales, which does not align with the background keywords regarding World Models, Model-Based RL, or Model Unification. While text-to-image generation involves multiple modalities (MultiModal), the core contribution is algorithmic fairness rather than multimodal representation learning or unified architectures. No expert authors from the specified list are present in the author list.

关键词

Diffusion Models, Group Fairness, Guidance Scale, Classifier Guidance, Classifier-Free Guidance, Bias Decomposition, Text-to-Image Generation

117. Pressure-Testing Deception Probes in LLMs: Scaling, Robustness, and the Geometry of Deceptive RepresentationsFAIL

Score: 10.0 / 26.5

Authors: Sachin Kumar

Published: 2026-05-27

TL;DR: This paper evaluates the robustness of linear deception probes in LLMs under distributional shifts, finding that style-augmented probes maintain high detection accuracy while deception is encoded in distributed features rather than a single linear direction.

摘要翻译

基于大语言模型（LLM）激活值训练的线性探针（Linear probes）日益被提议作为欺骗检测指标，然而它们在干净基准上的 AUROC 超过 0.96，却在分布偏移下表现崩溃。本文在 Gemma 3 模型家族（参数规模 1B-27B）上系统地压力测试了基于探针的指标，旨在诊断它们失败的原因，而不仅仅是记录它们失败的事实。我们针对欺骗编码测试了四个假设：(1) 单一线性方向，(2) 多维子空间，(3) 凸锥包，(4) 熵代理。我们的实验设计包括跨域转移矩阵、带有置换零基线的多维探针分析、熵残差化测试以及跨越 8 种风格偏移的干扰物评估。研究发现：(a) 探针在干净数据上实现近乎完美的 AUROC（>=0.998），但在风格偏移下表现崩溃；风格增强探针（style-augmented probes）在未见过风格上恢复了近乎完美的检测（平均 AUROC 0.979-0.983）；(b) 单方向假设被拒绝（k=1 仅捕获 0.61-0.80 的 AUROC），跨域转移失败被确认为几何原因所致，而非由层不匹配驱动；(c) 熵代理假设被拒绝（最大 |rho|=0.454，残差化后最大 Delta-AUROC=0.004）；(d) 欺骗并未形成显著的线性子空间（每域 k*=0），但多维探针（k>=5）通过分布式亚阈值特征恢复了信号。探针的脆弱性反映了分布的狭窄性而非架构限制：风格增强探针在 4B 和 27B 规模上均恢复了近乎完美的检测，从而确立逆缩放模式是一种训练分布伪影，而非真正的规模依赖现象。

Abstract

Linear probes trained on LLM activations are increasingly proposed as deception-detection metrics, yet report AUROC exceeding 0.96 on clean benchmarks while collapsing under distributional shift. This paper systematically pressure-tests probe-based metrics across the Gemma 3 model family (1B-27B parameters), diagnosing why they fail rather than merely documenting that they fail. We test four hypotheses about deception encoding: (1) single linear direction, (2) multi-dimensional subspace, (3) convex conic hull, (4) entropy proxy. Our design includes cross-domain transfer matrices, multi-dimensional probe analysis with permutation null baselines, entropy-residualization tests, and distractor evaluations across 8 stylistic shifts. We find that: (a) probes achieve near-perfect AUROC (>=0.998) on clean data but collapse under stylistic shifts; style-augmented probes recover near-perfect detection (mean AUROC 0.979-0.983) on unseen styles; (b) the single-direction hypothesis is rejected (k=1 captures only 0.61-0.80 AUROC), with cross-domain transfer failure confirmed as geometric rather than layer-mismatch-driven; (c) the entropy-proxy hypothesis is rejected (max |rho|=0.454, max Delta-AUROC after residualization=0.004); and (d) deception does not form a significant linear subspace (per-domain k*=0), yet multi-dimensional probes (k>=5) recover the signal through distributed sub-threshold features. Probe fragility reflects distributional narrowness rather than an architectural limitation: style-augmented probes recover near-perfect detection at both 4B and 27B, establishing that the inverse scaling pattern is a training-distribution artifact rather than a genuine scale-dependent phenomenon.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	2.0/10	4.0
MultiModal	2.0	1.0/10	2.0
model-based RL	2.0	0.0/10	0.0

评分理由: The paper focuses on deception detection probes within LLMs (Gemma 3), analyzing representation geometry and robustness to stylistic shifts. It has negligible relevance to World Models and Model-Based RL (0 score). While it involves LLMs (MLLM), it does not explicitly address multimodal integration (MultiModal) or model unification architectures (Unify Models), resulting in low weighted scores relative to the specified research background keywords.

关键词

Deception Probes, LLM Activations, Distributional Shift, Linear Probes, Gemma 3, Geometric Analysis, Style Augmentation, Deceptive Representations

118. Category-Level 3D Correspondence in Camera Space via Morphable Object PriorsFAIL

Score: 10.0 / 26.5

Authors: Leonhard Sommer, Artur Jesslen, Basavaraj Sunagad, Adam Kortylewski

Published: 2026-05-27

TL;DR: This paper proposes a method to learn category-level 3D correspondences from single images using morphable shape priors without explicit supervision, introducing a new benchmark for household objects.

摘要翻译

从图像理解三维物体是机器人技术及增强现实/虚拟现实（AR/VR）应用的基础。尽管近期工作在类别级姿态估计（category-level pose estimation）方面取得了进展，但当前的表示方法未能捕捉到推理物体部件、功能及交互所需的细粒度语义。本文研究了相机空间中的类别级三维对应关系（category-level 3D correspondence in camera space），即从单张图像预测在类别内实例间保持一致的三维位置，并表明通过学习共享的可变形物体先验（shared morphable object prior），这种对应关系可以在没有显式对应关系监督的情况下涌现。为了推动这一方向的研究，我们引入了 HouseCorr3D，这是首个用于单目类别级三维对应关系的大规模基准，包含 17.8 万张图像，涵盖 50 个家用物体类别、280 个独特实例，以及直接在 CAD 模型上的三维关键点标注。至关重要的是，HouseCorr3D 为遮挡区域提供了模态外（amodal）对应关系标签以及显式对称性标注，解决了现有数据集的关键局限性。此外，我们提出了 Morpheus 方法，该方法通过解耦规范形状（canonical shape）、形变（deformation）和物体姿态（object pose）来学习可变形类别级形状先验。通过这种共享的规范基准，相机空间中具有语义意义的三维对应关系隐式涌现。这些涌现的三维对应关系在 HouseCorr3D 上达到了新的最新状态（SOTA），表明语义化的三维物体理解可以在没有直接对应关系监督的情况下产生。数据和代码公开可用，网址为 https://github.com/GenIntel/HouseCorr3D。

Abstract

Understanding 3D objects from images is fundamental to robotics and AR/VR applications. While recent work has made progress in category-level pose estimation, current representations fail to capture the fine-grained semantics needed for reasoning about object parts, functions, and interactions. In this work, we study category-level 3D correspondence in camera space -- predicting, from a single image, 3D locations that remain consistent across instances within a category -- and show that it can emerge without explicit correspondence supervision by learning a shared morphable object prior. To enable research in this direction, we introduce HouseCorr3D, the first large-scale benchmark for monocular category-level 3D correspondence with 178k images across 50 household object categories, 280 unique instances, and 3D keypoint annotations directly on CAD models. Crucially, HouseCorr3D provides amodal correspondence labels for occluded regions and explicit symmetry annotations, addressing key limitations of existing datasets. We further propose Morpheus, a method that learns morphable category-level shape priors by disentangling canonical shape, deformation, and object pose. Through this shared canonical grounding, semantically meaningful 3D correspondences in camera space emerge implicitly. These emerging 3D correspondences set a new state of the art on HouseCorr3D, demonstrating that semantic 3D object understanding can arise without direct correspondence supervision. Data and code are publicly available at https://github.com/GenIntel/HouseCorr3D.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	1.0/10	2.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	2.0/10	4.0
model-based RL	2.0	0.0/10	0.0

评分理由: The paper focuses on 3D computer vision and morphable shape priors for category-level correspondence, which has limited overlap with the provided keywords. While it unifies shape components (Unify Models) and maps image to 3D (MultiModal), it lacks content on MLLM, World Models, or Reinforcement Learning. The total weighted score is 10.0, below the dynamic passing score of 26.5.

关键词

Category-Level 3D Correspondence, Morphable Object Priors, HouseCorr3D, Monocular 3D Understanding, Shape Disentanglement, Camera Space, 3D Keypoints, Benchmark

119. PromptEmbedder:: Efficient and Transferable Text Embedding via Dual-LLM Soft PromptingFAIL

Score: 9.0 / 26.5

Authors: Yu-Che Tsai, Kuan-Yu Chen, Yuan-Hao Chen, Yu-Han Chang, Ching-Yu Tsai, Yu-Hsiang Chuang, Shou-De Lin

Published: 2026-05-27

TL;DR: PromptEmbedder 通过双 LLM 软提示框架实现了高效且可迁移的文本嵌入，在降低内存消耗的同时加速了训练过程。

摘要翻译

大型语言模型（LLMs）在文本嵌入方面展现出显著效能，然而当前的适配方法（如 LoRA）在计算效率与跨架构迁移性方面仍面临显著瓶颈。每当出现新的骨干网络时，现有方法均需从头进行昂贵的重训练。为此，我们提出 PromptEmbedder，这是一种新颖的双 LLM 框架，旨在将嵌入知识与特定的骨干权重解耦。PromptEmbedder 利用一个提示 LLM（Prompting LLM），通过具有连续松弛特性的可微生成过程，为冻结的嵌入 LLM（Embedding LLM）生成指令感知的软提示，从而在对比训练期间确保完整的梯度流。通过将任务特定知识集中在提示 LLM 中，适应新架构仅需重新训练一个轻量级的线性对齐矩阵。在 MTEB 基准上的评估表明，PromptEmbedder 达到了与 LoRA 微调相当的性能，同时将 GPU 内存减少了 40%，并将训练速度加速了 3.7 倍。我们的方法建立了一种可扩展、架构无关的范式，用于基于 LLM 的高效表征学习。

Abstract

Large Language Models (LLMs) have demonstrated remarkable efficacy in text embedding, yet current adaptation methods like LoRA face significant bottlenecks in computational efficiency and cross-architecture transferability. Whenever a new backbone emerges, existing approaches require costly retraining from scratch. To address this, we propose PromptEmbedder, a novel dual-LLM framework that decouples embedding knowledge from specific backbone weights. PromptEmbedder utilizes a Prompting LLM to generate instruction-aware soft prompts for a frozen Embedding LLM via a differentiable generation process with continuous relaxation, ensuring full gradient flow during contrastive training. By localizing task-specific knowledge within the Prompting LLM, adapting to new architectures requires only retraining a lightweight linear alignment matrix. Evaluations on the MTEB benchmark show that PromptEmbedder achieves comparable performance with LoRA finetuning while reducing GPU memory by 40% and accelerating training by 3.7x. Our approach establishes a scalable, architecture-agnostic paradigm for efficient LLM-based representation learning.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	3.5/10	7.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	1.0/10	2.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	0.0/10	0.0

评分理由: 论文专注于文本嵌入的架构无关性训练，采用双 LLM 软提示策略。与世界模型、多模态、基于模型的 RL 完全无关；统一模型仅部分契合架构解耦思想；MLLM 仅涉及 LLM 基础而非多模态。未发现指定专家，故无额外加分。加权总分 9.0，显著低于及格线 26.5，表明论文与给定研究背景相关性较弱。

关键词

Text Embedding, Dual-LLM Framework, Soft Prompting, Transferability, Efficient Training, Architecture-Agnostic, Contrastive Learning, MTEB Benchmark

120. How Far Can Disaggregation Go? A Design-Space Exploration of Attention-FFN Disaggregation for Efficient MoE LLM ServingFAIL

Score: 8.0 / 26.5

Authors: Hanjiang Wu, Abhimanyu Rajeshkumar Bambhaniya, Sarbartha Banerjee, Tuhin Khare, Sudarshan Srinivasan, Suvinay Subramanian, Souvik Kundu, Madhu Kumar, Midhilesh Elavazhagan, William Won, Amir Yazdanbakhsh, Tushar Krishna

Published: 2026-05-27

TL;DR: This paper explores Attention-FFN disaggregation for efficient MoE LLM serving, demonstrating that partitioning attention and FFN across GPUs can sustain high throughput under strict latency constraints where non-disaggregated deployments are infeasible.

摘要翻译

现代大语言模型（LLM）推理已逐步解耦，以适应模型规模的不断增长以及严格的 TTFT（首 token 时间）和 TPOT（每输出 token 时间）服务级别目标：从分块预填充聚合，到预填充 - 解码（P/D）解耦，再到近期提出的算子级注意力 - 前馈网络解耦（AFD）。这一趋势对于专家混合模型（MoE）尤为重要，因为内存受限的注意力机制、计算密集型的专家前馈网络以及 MoE 的分发/合并通信会产生显著不同的资源需求。AFD 通过将注意力机制和 MoE-前馈网络（MoE-FFN）的执行分别置于独立的 GPU 组上，进一步凸显了这种异构性。每一层解耦都扩展了涵盖工作负载特征、资源分配和互连拓扑的调度设计空间，从而提出了一个核心问题：每一层解耦究竟何时才能真正带来收益？我们系统地刻画了这种权衡，针对涵盖输入/输出序列长度、前缀 KV 重用及用户级延迟约束的真实工作负载下的 MoE 推理。以分块预填充和 P/D 解耦为基线，我们通过一个融合设备端核函数测量与高保真网络仿真的框架，研究了 AFD 在规模上的益处与局限。在严格的 TTFT/TPOT 服务级别目标（SLO）下，AFD 在 DeepSeek-V3.2 上维持了约 4k tokens/s 的系统吞吐量，涵盖聊天、编码及智能体编码工作负载，而在这些场景下非 AFD 部署则不可行。我们提炼了关于联合优化吞吐量和交互性的具体见解，包括如何根据工作负载和模型架构在 GPU 间划分注意力机制和前馈网络，为当前机架级和集群级部署以及未来的解耦 AI 基础设施提供设计原则。

Abstract

Modern large language model (LLM) inference has progressively disaggregated to keep pace with growing model sizes and tight TTFT and TPOT service-level objectives: from chunked-prefill aggregation, to prefill-decode (P/D) disaggregation, and most recently to operator-level Attention-FFN Disaggregation (AFD). This trend is especially important for mixture-of-experts (MoE) models, where memory-bound attention, compute-intensive expert FFNs, and MoE dispatch/combine communication create distinct resource demands. AFD further exposes this heterogeneity by placing attention and MoE-FFN execution on separate GPU groups. Each level of disaggregation deepens the scheduling design space across workload characteristics, resource allocation, and interconnect topology, raising the central question: when does each level actually pay off? We systematically characterize this trade-off for MoE inference across realistic workloads spanning input/output sequence lengths, prefix-KV reuse, and per-user latency constraints. Using chunked-prefill and P/D disaggregation as baselines, we study the benefits and limits of AFD at scale through a framework that fuses on-device kernel measurements with high-fidelity network simulation. Under strict TTFT/TPOT SLOs, AFD sustains around 4k tokens/s of system throughput on DeepSeek-V3.2 across chat, coding, and agentic-coding workloads, where non-AFD deployments are infeasible. We distill concrete takeaways for jointly optimizing throughput and interactivity, including how to partition attention and FFN across GPUs as a function of workload and model architecture, providing design principles for current rack- and cluster-scale deployments as well as future disaggregated AI infrastructure.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	2.0/10	4.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	0.0/10	0.0

评分理由: The paper focuses on system-level inference optimization (Attention-FFN disaggregation) for MoE LLMs, targeting throughput and latency. The keywords target model-centric research (Multimodal unification, World Models, RL). While the paper involves LLMs (MLLM keyword), it lacks content on multimodality, world modeling, or reinforcement learning mechanisms, resulting in low relevance scores for most keywords.

关键词

Attention-FFN Disaggregation, MoE LLM Serving, Inference Optimization, Resource Allocation, Throughput and Latency, GPU Partitioning, Mixture-of-Experts, System Design Space

121. Whose Name Comes Up? III: Persona Prompting Effects in LLM-Based Scholar RecommendationFAIL

Score: 8.0 / 26.5

Authors: Annabella Sánchez-Guzmán, Lukas Eberhard, Denis Helic, Lisette Espín-Noboa

Published: 2026-05-27

TL;DR: This paper audits 43 LLMs for scholar recommendation and finds that persona prompts significantly affect the factuality and diversity of results, highlighting the need to audit prompt design alongside model choice.

摘要翻译

大型语言模型（LLMs）正被越来越多地用作学者推荐工具，影响着学术界对专家身份的认定。现有的评估仍以英语为中心，局限于单一学科，且不考虑人物设定，导致输出结果的差异性来源尚不明确。为此，我们提出一个基准测试，旨在解耦模型选择和提示词设计对推荐效果的影响。我们通过调整人物设定提示词（语言、地点、角色与任务）以及上下文情境（领域、资历、k），对 43 个 LLMs 进行了评估。推荐学者与 Semantic Scholar 在六个科学学科领域进行比较，以衡量技术质量（事实性、覆盖度）和社会代表性（多样性、公平性）。基础技术质量主要由模型选择决定，事实性和公平性由上下文情境决定，而多样性则由地点决定。南非提示词生成的列表事实性较低，而日本提示词生成的列表事实性高但同质化严重，且偏向于高产学者。因此，提示词设计是基于 LLM 的学者发现过程中一个不可忽视的维度，应与模型选择一起进行系统性评估。

Abstract

Large language models (LLMs) are increasingly used as scholar recommenders, shaping who is seen as an expert in academia. Existing audits remain English-centric, single discipline, and persona-agnostic, leaving the source of output variability poorly understood. To this end, we propose a benchmark that disentangles the effects of model choice and prompt design on recommendations. We audit 43 LLMs by varying persona prompts (language, location, role-and-task) and context (field, seniority, k). Recommended scholars are compared against Semantic Scholar over six scientific disciplines to measure technical quality (factuality, coverage) and social representativeness (diversity, parity). Basic technical quality is driven by model choice, factuality and parity by context, and diversity by location. South Africa prompts yield less factual lists, while Japan prompts yield highly factual but homogeneous lists skewed toward highly productive scholars. Prompt design is thus a non-trivial axis of LLM-based scholar discovery and should be systematically audited alongside model choice.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	1.0/10	2.0
MultiModal	2.0	1.0/10	2.0
model-based RL	2.0	0.0/10	0.0

评分理由: The paper focuses on auditing LLMs for scholar recommendation using persona prompts, which has low overlap with World Models or Model-Based RL (score 0.0). While it compares 43 LLMs, it does not propose a unified architecture (Unify Models, score 2.0) nor does it focus on multimodal capabilities (MLLM/MultiModal, score 1.0). No expert authors from the specified list are present, so no bonus points are applied.

关键词

LLM Auditing, Persona Prompting, Scholar Recommendation, Social Representativeness, Model Choice, Factuality, Diversity, Prompt Design

122. DeltaMCP: Incremental Regeneration via Spec-Aware Transformation for MCP serversFAIL

Score: 8.0 / 26.5

Authors: Aditya Pujara, Xiaogang Zhu, Hsiang-Ting Chen

Published: 2026-05-27

TL;DR: DeltaMCP 提出了一种基于规范感知的增量再生工具，通过同步 OpenAPI 规范与 MCP 工具集，降低了 LLM 系统开发者的维护开销。

摘要翻译

大型语言模型（LLM）的快速发展与模型上下文协议（Model Context Protocol, MCP）的引入，彻底改变了智能体通过确定性和结构化方法与 API 交互的方式 \cite{ModelContextProtocolIntro2025}。尽管像 AutoMCP 这样的现有系统试图将此前完全手动生成 MCP 服务器的过程自动化，但它们未能解决持续存在的挑战，即保持不断演进的企业级 API 与其对应的 MCP 工具集实现之间的同步 \cite{mastouri2025makingrestapisagentready}。本文介绍了 DeltaMCP，这是一种面向企业级 MCP 服务器的、规范感知的增量再生工具。当对应服务的 OpenAPI 规范发布新版本时，DeltaMCP 使开发人员仅需更新 MCP 服务器中受影响的部分。以 Azure REST API 规范作为评估数据集，DeltaMCP 在生成质量和系统性能方面与基线全生成方法进行了基准测试。结果表明，通过 DeltaMCP 减少了开发人员开销，同时提高了可维护性和版本一致性。本研究为寻求维护高保真、最新 MCP 服务器基础设施以用于基于 LLM 系统的企业提供了一种可扩展的方法。

Abstract

The rapid development of LLMs coupled with the introduction of Model Context Protocol (MCP) has revolutionized how intelligent agents interact with APIs through deterministic and structured methods \cite{ModelContextProtocolIntro2025}. While some existing systems like AutoMCP attempt to automate a previously completely manual process of generating MCP servers, they fail to address the recurring challenge of maintaining synchronization between evolving enterprise-level APIs and their corresponding MCP toolset implementation \cite{mastouri2025makingrestapisagentready}. This paper introduces DeltaMCP, a specification-aware, incremental regeneration tool for enterprise-grade MCP servers. DeltaMCP enables developers to only update the affected tooling of MCP servers, given a new release of it's corresponding service's OpenAPI specification. Using Azure REST API specifications as the evaluation dataset, DeltaMCP is benchmarked against baseline full generation methods on generation quality and system performance. The results demonstrate the reduction in developer overhead through DeltaMCP whilst improving maintainability and version consistency. This research offers a scalable approach for enterprises seeking to maintain high-fidelity, up-to-date MCP server infrastructures for LLM-based systems.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	2.0/10	4.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	0.0/10	0.0

评分理由: 论文核心在于 MCP 服务器的增量生成与 API 同步，属于软件工程工具领域。关键词‘World Models’、'MultiModal'、'model-based RL'涉及环境建模、多模态感知及强化学习规划，与本文内容完全无关，故得 0 分。'Unify Models' 虽涉及工具接口统一，但未涉及模型架构统一，相关性低，得 2.0 分。'MLLM' 涉及大语言模型，本文虽未明确多模态但基于 LLM 生态，相关性微弱，得 2.0 分。加权总分 8.0，远低于及格线。

关键词

DeltaMCP, Incremental Regeneration, MCP servers, OpenAPI specification, LLM-based systems, Developer overhead, API synchronization, Tooling infrastructure

123. PetroBench: A Benchmark for Large Language Models in Petroleum EngineeringFAIL

Score: 8.0 / 26.5

Authors: Xiang Wang, Tingting Zhang, Sen Wang, Ying Wu, Heng Meng, Peng Zhou, Peng Li

Published: 2026-05-27

TL;DR: 本文构建了石油工程领域的大语言模型基准评测框架 PetroBench，发现模型在不同题型和工程子领域表现存在差异，中文模型在选择题上略优。

摘要翻译

大语言模型（LLM）在石油行业中的应用日益广泛，凸显了构建领域特定评估框架的需求。本研究构建了一个面向石油工程的大语言模型基准测试，包括数据预处理、质量过滤及多模型验证三个阶段。通过专家评审，构建了一个具有强领域相关性和区分能力的标准化题库。该基准测试涵盖生产工程、储层工程和钻井工程，包含 1200 道题目，题型涵盖选择题、判断题、术语定义及简答题。在统一的 API 环境下，对八种主流大语言模型进行了评估。结果显示，模型在主观题上的表现优于客观题，表明其在事实知识辨别方面存在不足。选择题和判断题的最高准确率分别为 65.3% 和 74.3%。Gemini-3-Pro、Kimi-K2.5 和 Claude-Opus-4.6-Thinking 取得了最佳综合得分，达到 72%-74%。模型在生产工程领域表现最佳，而在储层工程领域表现最弱。中文模型在选择题上具有优势，而国际模型在简答题上表现略好。该基准测试为石油工程领域大语言模型的评估与部署提供了可复现且实用的参考。

Abstract

Large Language Models are increasingly applied in the petroleum industry, highlighting the need for a domain-specific evaluation framework. This study develops a benchmark for LLMs in petroleum engineering, including a three-stage process of data preprocessing, quality filtering, and multi-model validation. Using expert review, a standardized question bank with strong domain relevance and discriminative capability was constructed. The benchmark covers production, reservoir, and drilling engineering, with 1,200 questions across multiple-choice, true or false, term definition, and short-answer formats. Eight mainstream LLMs were evaluated under a unified API environment. Results show that models performed better on subjective than objective questions, indicating weaknesses in factual knowledge discrimination. The highest accuracies for multiple-choice and true or false questions were 65.3% and 74.3%, respectively. Gemini-3-Pro, Kimi-K2.5, and Claude-Opus-4.6-Thinking achieved the best overall scores of 72%-74%. Models performed best in production engineering and weakest in reservoir engineering. Chinese models showed advantages in multiple-choice questions, while international models performed slightly better in short-answer questions. The benchmark provides a reproducible and practical reference for evaluating and deploying LLMs in petroleum engineering.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	2.0/10	4.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	0.0/10	0.0

评分理由: 该论文主要研究石油工程领域的大语言模型（LLM）基准评测，与世界模型、强化学习及多模态架构无直接关联。仅在评估环境统一性上弱相关于 Unify Models，在模型类型上弱相关于 MLLM。未涉及 MultiModal 或 model-based RL 内容。作者列表中不包含 Yang Shi 等指定专家。加权总分 8.0，显著低于动态及格分 26.5，表明论文主题与给定关键词研究方向相关性较低。

关键词

Large Language Models, Petroleum Engineering, Benchmark Evaluation, Domain-Specific, Question Bank, Model Performance, LLM Evaluation

124. SPARD: Defending Harmful Fine-Tuning Attack via Safety Projection with Relevance-Diversity Data SelectionFAIL

Score: 8.0 / 26.5

Authors: Shuhao Chen, Weisen Jiang, Yeqi Gong, Shengda Luo, Chengxiang Zhuo, Zang Li, James T. Kwok, Yu Zhang

Published: 2026-05-27

TL;DR: SPARD 通过整合安全投影与相关性多样性数据选择，防御大语言模型的有害微调攻击，在保持任务准确性的同时显著降低了攻击成功率。

摘要翻译

微调大语言模型往往会削弱其安全对齐，这一问题在有害微调攻击中进一步加剧，此类攻击中对抗性数据会移除安全防护并诱导不安全行为。我们提出了 SPARD，一种整合了 Safety-Projected Alternating optimization（安全投影交替优化）与 Relevance-Diversity aware data selection（相关性 - 多样性感知数据选择）的防御框架。SPARD 采用 SPAG，该算法在效用更新与显式安全投影之间交替优化，并利用一组安全数据来强制执行安全约束。为了筛选安全数据，我们引入了一种 Relevance-Diversity Determinantal Point Process（相关性 - 多样性行列式点过程），用于选择紧凑的安全数据，以平衡任务相关性与安全覆盖范围。在 GSM8K 和 OpenBookQA 数据集上，针对四种有害微调攻击的实验表明，SPARD 始终实现了最低的平均攻击成功率，显著优于最先进的防御方法，同时保持了高任务准确率。代码可在 https://github.com/shuhao02/SPARD 获取。

Abstract

Fine-tuning large language models often undermines their safety alignment, a problem further amplified by harmful fine-tuning attacks in which adversarial data removes safeguards and induces unsafe behaviors. We propose SPARD, a defense framework that integrates Safety-Projected Alternating optimization with Relevance-Diversity aware data selection. SPARD employs SPAG, which optimizes alternatively between utility updates and explicit safety projections with a set of safe data to enforce safety constraints. To curate safe data, we introduce a Relevance-Diversity Determinantal Point Process to select compact safe data, balancing task relevance and safety coverage. Experiments on GSM8K and OpenBookQA under four harmful fine-tuning attacks demonstrate that SPARD consistently achieves the lowest average attack success rates, substantially outperforming state-of-the-art defense methods, while maintaining high task accuracy. Code is available at https://github.com/shuhao02/SPARD.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	2.0/10	4.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	0.0/10	0.0

评分理由: 论文核心内容为大语言模型（LLM）的安全微调防御，涉及安全投影和数据选择。关键词中的 World Models、MultiModal 和 model-based RL 与论文主题（文本安全对齐）完全无关，评分为 0。MLLM 与论文涉及的 LLM 领域相关，但论文未明确涉及多模态内容，故评分较低（2.0）。Unify Models 虽涉及安全与目标的统一优化，但并非架构层面的模型统一，相关性较弱（2.0）。整体相关性极低，远低于动态及格分 26.5。

关键词

Safety Projection, Harmful Fine-Tuning Attack, Relevance-Diversity Data Selection, Large Language Models, Safety Alignment, Defense Framework, Alternating Optimization, Determinantal Point Process

125. Confidence-Orchestrated Self-Evolution against Uncertain LLM FeedbackFAIL

Score: 8.0 / 26.5

Authors: Bowen Wei, Nan Wang, Yuqing Zhou, Jinhao Pan, Ziwei Zhu

Published: 2026-05-27

TL;DR: 本文提出 COSE 方法，利用 LLM 内在置信度信号调制训练，无需外部验证器即可提升推理和数学任务性能。

摘要翻译

自演化大型语言模型（LLMs）通过生成自身的训练任务和解决方案进行学习，从而减少了对人工构建监督的依赖。然而，在许多推理领域中，模型还必须验证生成的任务并评判生成的答案以获得训练信号。这产生了训练信号挑战：错误的自我评判会导致错误的梯度更新。现有方法要么依赖外部验证器，这限制了通用性，要么将嘈杂的自我生成反馈视为监督。我们提出 COSE（Confidence-Orchestrated Self-Evolution），利用 LLM 的内在置信度作为轻量级不确定性信号来调节学习。COSE 引入了置信度加权的 PPO 更新和置信度优先的重放。在 19 个保留基准和四个 Qwen/Llama 骨干模型（0.6B--4B）上，COSE 始终优于基线模型，并在一般推理和数学领域取得最佳平均性能，同时在代码任务上保持竞争力。代码和数据可在 https://anonymous.4open.science/r/COSE_-B5C2 获取。

Abstract

Self-evolving large language models (LLMs) learn by generating their own training tasks and solutions, reducing reliance on human-curated supervision. However, in many reasoning domains, the model must also validate generated tasks and judge generated answers to obtain training signals. This creates a training-signal challenge: erroneous self-judgments become erroneous gradient updates. Existing approaches either rely on external verifiers, which limits generality, or treat noisy self-generated feedback as supervision. We propose COSE (Confidence-Orchestrated Self-Evolution), which uses the LLM's intrinsic confidence as a lightweight uncertainty signal to modulate learning. COSE introduces confidence-weighted PPO updates and confidence-prioritized replay. Across 19 held-out benchmarks and four Qwen/Llama backbones (0.6B--4B), COSE consistently improves over base models and achieves the best average performance in general reasoning and mathematics, while remaining competitive on code. Code and data are available at https://anonymous.4open.science/r/COSE_-B5C2.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	3.0/10	6.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	1.0/10	2.0

评分理由: 论文专注于基于文本的 LLM 自进化及置信度调节的强化学习，与多模态（MLLM, MultiModal）完全无关（0 分），未涉及世界模型（World Models, 0 分）。虽使用 PPO 强化学习，但属于模型-free 而非模型-based（1 分）。仅在统一生成与评估过程上与 Unify Models 有一定关联（3 分）。加权总分 8.0，远低于动态及格分 26.5，表明论文与给定背景主题相关性较低。

关键词

Self-evolving LLMs, Confidence-weighted PPO, Intrinsic confidence, Reasoning benchmarks, Training-signal challenge, Post-training, Text-based models, Uncertainty modulation

126. Sign-Aware Gated Sparse Autoencoders: Modeling Anticorrelated Features with Bi-Jump-ReLU ActivationsFAIL

Score: 8.0 / 26.5

Authors: Bartosz Wieciech, Zmnako Awrahman, Marcin Czelej, Victor Hugo Jaramillo Velasquez, Wioletta Stobieniecka

Published: 2026-05-27

TL;DR: 本文提出了一种符号感知门控稀疏自编码器，利用 Bi-Jump-ReLU 激活函数在大语言模型中高效建模反相关特征，实现了比标准方法更优的重建质量和更低的死亡分数。

摘要翻译

稀疏自编码器（SAEs）从大型语言模型（LLMs）中提取可解释的特征，但标准变体强制要求非负性，迫使对对立概念（例如“压力过高”与“压力过低”）使用不同的潜在变量，并在特征负相关时浪费字典容量。我们提出符号感知门控 SAE（SA-GSAE）：具有符号幅度和辅助监督的双向门控稀疏性。一个极性敏感的门选择任一符号的支持，一条符号幅度路径避免 L1 收缩，而辅助重构防止门崩溃。双极共享——即一个潜在变量沿共享方向编码两个符号——通过新的 Bi-Jump-ReLU 激活函数实现；参数分析表明，即使负相关对很少，符号感知性仍保持参数高效。在 Pythia-1B 和 SmolLM3-3B 的三个中间深度钩点上的真实 LLM 激活数据中（涉及 6 个单元、3 个种子），对于其中的 3 个单元（包括 MLP 输出钩点和 Pythia-1B 的残差中间层），宽度为 H 的半宽 SA-GSAE 在整个扫描的 L0 重叠范围内严格帕累托占优宽度为 2H 的全宽门控 SAE；在剩余的 3 个单元中，其 R^2 匹配在 0.025 以内（最大差距 -0.008），同时将死分数绝对值降低了 0.35-0.62。死分数降低幅度的扫描几何平均值在 MLP 输出单元和 Pythia-1B 残差上约为 100 倍至 500 倍，在注意力单元和 SmolLM3-3B 残差上约为 2 倍至 4 倍。消融实验表明，双向门和辅助损失至关重要（无辅助损失时学习率崩溃至 0.27，98% 神经元死亡）；绑定 r_i^+ = r_i^- 无显著差异（|ΔR^2| = 0.0015），我们建议将此对称变体作为默认设置。MLP 输出的增益来自大多数潜在变量同时携带两个极性；在注意力机制上，双极结构集中在少量顶部潜在变量中。全宽 SA-GSAE 在 SmolLM3-3B 残差处表现出可复现的重构崩溃，而半宽版本完全避免了这一点。

Abstract

Sparse Autoencoders (SAEs) extract interpretable features from Large Language Models, but standard variants enforce non-negativity, forcing separate latents for diametrically opposed concepts (e.g., "pressure too high" vs. "pressure too low") and wasting dictionary capacity when features are anticorrelated. We propose the Sign-Aware Gated SAE (SA-GSAE): two-sided gated sparsity with signed magnitude and auxiliary supervision. A polarity-sensitive gate selects support on either sign, a signed-magnitude path avoids L1 shrinkage, and an auxiliary reconstruction prevents gate collapse. Bipolar sharing - one latent encoding both signs along a shared direction - is realised via a new Bi-Jump-ReLU activation; parameter accounting shows sign-awareness stays parameter-efficient even when anticorrelated pairs are rare. On real LLM activations across three mid-depth hookpoints on Pythia-1B and SmolLM3-3B (6 cells, 3 seeds), a half-width SA-GSAE at width H strictly Pareto-dominates a full-width Gated SAE at 2H over the entire swept L0 overlap on 3 of 6 cells (both MLP-output hookpoints and resid-mid/Pythia-1B); on the remaining 3 it matches R^2 within 0.025 (max gap -0.008) while cutting dead fraction by 0.35-0.62 absolute. Sweep-geomean dead-fraction reductions are ~100x-500x on MLP-output cells and Pythia-1B resid, ~2x-4x on attention cells and SmolLM3-3B resid. Ablations show the two-sided gate and auxiliary loss are load-bearing (no auxiliary collapses LR to 0.27, 98% dead); tying r_i^+ = r_i^- is indistinguishable (|Delta R^2| = 0.0015), and we recommend this symmetric variant as default. MLP-output gains come from most latents carrying both polarities; on attention, bipolar structure concentrates in a small set of top latents. Full-width SA-GSAE exhibits a reproducible reconstruction collapse at SmolLM3-3B resid that the half-width entirely avoids.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	3.0/10	6.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	1.0/10	2.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	0.0/10	0.0

评分理由: 论文聚焦于大语言模型（LLM）的可解释性，提出符号感知门控稀疏自编码器（SA-GSAE）及 Bi-Jump-ReLU 激活函数。与世界模型、多模态、模型强化学习无直接关联（0 分）。虽涉及 LLM 但未涉及多模态（MLLM 1 分）。虽统一了符号表示但未涉及模型架构统一（Unify Models 3 分）。加权总分 8.0，低于动态及格分 26.5，主题相关性较低。

关键词

Sparse Autoencoders, Sign-Aware, Bi-Jump-ReLU, Anticorrelated Features, Large Language Models, Interpretability, Gated Sparsity

127. RW-TTT: Batched Serving for Request-Owned Test-Time Training StateFAIL

Score: 8.0 / 26.5

Authors: Jian Yang, Zhizhuo Kou, Yao Tian, Hao Zhang, Han Chen, Sirui Han, Yike Guo

Published: 2026-05-27

TL;DR: RW-TTT 通过标记解码步骤的所有者和版本，实现了请求拥有状态的测试时训练批处理服务，相比顺序服务提升了 9.31 倍的吞吐量。

摘要翻译

测试时训练（TTT）通过在生成过程中读取并更新请求专属状态（例如快速权重、低秩增量或流式学习器状态）来适配大语言模型（LLM）。这打破了批处理 LLM 服务的假设，该方法依赖于共享静态权重：串行执行虽正确但效率低下，而朴素的批处理可能会破坏请求状态。我们将此问题表述为读写 TTT 服务，并提出 RW-TTT 方法。该方法为每个解码步骤标记其请求方、版本及读/写操作属性，仅对兼容阶段进行批处理，并仅向请求方提交更新。在单个 GPU 上配置八个快速权重 InPlace-TTT 流时，RW-TTT 达到 274.61 聚合 token/s，在相同内存预算下，其性能比串行服务高 9.31 倍，比流副本高 3.44 倍。该方法在 RULER（一个长上下文基准）上保持行为一致，并通过请求方/版本一致性检查。

Abstract

Test-time training (TTT) adapts an LLM during generation by reading and updating request-owned state, such as fast weights, low-rank deltas, or streaming learner state. This breaks batched LLM serving, which assumes shared static weights: serial execution is correct but slow, while naive batching can corrupt request state. We formulate this problem as read-write TTT serving and present RW-TTT , which tags each decode step with its owner, version, and READ/WRITE effect, batches only compatible phases, and commits updates only to the owner. On one GPU with eight fast-weight InPlace-TTT streams, RW-TTT reaches 274.61 aggregate tok/s, 9.31x over sequential serving and 3.44x over per-stream replicas under the same memory budget. It preserves behavior on RULER, a long-context benchmark, and passes owner/version checks.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	1.0/10	2.0
MLLM	2.0	1.0/10	2.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	0.0/10	0.0

评分理由: 论文聚焦于大语言模型测试时训练（TTT）的批处理服务优化，涉及状态管理与吞吐量提升。未涉及多模态（MultiModal/MLLM）、强化学习（model-based RL）或世界模型（World Models）的核心机制；"Unify Models"仅体现为服务批次的逻辑统一，关联度有限。作者列表中不包含指定的专家。

关键词

Test-time Training, Batched Serving, Request-Owned State, Fast Weights, LLM Serving, Read-Write Effect, Aggregate Throughput

128. Law of Neural Interaction: Depth-Width Shape, Interaction Efficiency, and GeneralizationFAIL

Score: 8.0 / 26.5

Authors: Wenjie Sun, Jinning Yang, Shuai Zhang, Mengnan Du

Published: 2026-05-27

TL;DR: 该论文研究了在固定预算下，深度-宽度比率如何通过影响神经交互效率进而决定大语言模型的泛化能力，发现处于高效交互区间的模型在基准测试中表现更好。

摘要翻译

缩放定律的指导增加了现代大语言模型（LLMs）的资源需求，然而这些模型在固定预算下是否能够有效利用资源仍存疑问。先前研究已证实叠加是导致损失的关键因素。通过利用神经特征假设（Neural Feature Ansatz），我们将叠加从参数空间扩展到梯度空间，并将其定义为神经交互。我们发现，在固定预算下，良好的泛化性能通常伴随着高效的神经交互，并且通过调整其深度 - 宽度比（$R_{D/W}$），模型可被置于高效交互区间内。此外，随着预算规模的扩大，模型的高效交互区间保持相对稳定。通过比较现有小型密集大语言模型，我们发现运行在该区间附近的模型在 MMLU-Pro 基准测试上往往表现更好。我们的发现表明，$R_{D/W}$ 影响资源利用效率，进而影响泛化，为模型形状初始化和理解模型泛化机制提供了见解。神经交互定律（Neural Interaction Law）的代码可在以下网址获取：https://anonymous.4open.science/r/Neural_Interaction_Law-D788

Abstract

The guidance of scaling laws has increased the resource demands of modern large language models (LLMs), yet it remains questionable whether these models utilize resources effectively under a fixed budget. Previous research has proved superposition as a key contributor to loss. By leveraging the Neural Feature Ansatz, we extend superposition from parameter space to gradient space and define it as neural interaction. We find that under a fixed budget, good generalization is usually accompanied by efficient neural interactions, and the model can be placed in an efficient interaction interval by adjusting its depth-width ratio ($R_{D/W}$). In addition, as the budget scales up, the efficient interaction interval of the model remains relatively stable. By comparing existing small scale dense LLMs, we observe that models operating near this interval tend to perform better on the MMLU-Pro benchmark. Our findings reveal that the $R_{D/W}$ influences resource utilization efficiency and thereby affects generalization, providing insights into model shape initialization and the understanding of model generalization mechanisms. Code for Neural Interaction Law is available at: https://anonymous.4open.science/r/Neural_Interaction_Law-D788

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	2.0/10	4.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	0.0/10	0.0

评分理由: 论文主要关注大语言模型（LLM）的缩放定律、神经交互及深度-宽度形状对泛化能力的影响，未涉及多模态、世界模型或强化学习内容。因此，'MultiModal'、'World Models'和'model-based RL'相关性为 0。'MLLM'虽涉及 LLM 但非多模态，相关性较低（2.0）。'Unify Models'在参数与梯度空间统一上有微弱关联，相关性较低（2.0）。作者列表中不包含指定的专家。加权总分 8.0，低于动态及格分 26.5，表明论文与指定主题相关性较低。

关键词

Neural Interaction, Depth-Width Shape, Scaling Laws, Generalization, Resource Utilization, LLM Architecture, Superposition, MMLU-Pro

129. SPAR: Support-Preserving Action RectificationFAIL

Score: 8.0 / 26.5

Authors: Jiaxin Zhao, Weihang Pan, Xun Liang, Binbin Lin

Published: 2026-05-27

TL;DR: 论文提出了一种支持保持的动作修正方法（SPAR），通过冻结行为克隆策略锚定残差空间来解决离线策略改进中价值最大化与数据分布拟合的冲突，并在 D4RL 基准上实现了 state-of-the-art 性能。

摘要翻译

离线策略优化面临着最大化价值与拟合数据分布之间的内在冲突。尽管样本内加权回归（in-sample weighted regression）具有稳定性，但它存在过度保守的问题，从而抑制了分布尾部的高价值动作；相反，基于梯度的方法往往表现出拟合与优化之间的梯度冲突，这会将策略推离数据流形（data manifold）。为此，我们提出支持保持动作修正（SPAR），该方法将全局学习重构为锚定在冻结的纯行为克隆（behavior cloning）策略上的局部残差修正。该框架在残差空间内进行细粒度拟合与局部策略改进，从而收缩了搜索空间。此外，我们进一步引入潜在自我模仿（Latent Self-Imitation），利用潜在采样加权回归机制来解决残差空间中的拟合与优化梯度冲突。理论上，我们证明了该机制能够消除标准价值梯度的流形法向漂移；而广泛的 D4RL 实验表明，SPAR 能够从次优基线中提取显著增益，从而实现最先进的性能。

Abstract

Offline policy improvement faces an inherent conflict between maximizing value and fitting the data distribution. While in-sample weighted regression is stable, it suffers from over-conservatism that suppresses high-value actions in the distribution tail; conversely, gradient-based approaches often exhibit a fitting-optimization conflict of gradients, which drives the policy off the data manifold. To address this, we propose Support-Preserving Action Rectification (SPAR), which reframes global learning as a local residual rectification anchored to a frozen pure behavior cloning policy. This framework performs fine-grained fitting and local policy improvement in the residual space, thereby contracting the search space. We further introduce Latent Self-Imitation, utilizing a latent-sampling weighted-regression mechanism to address fitting-improvement gradient conflict in the residual space. Theoretically, we prove this mechanism eliminates the manifold-normal drift of standard value gradients, while extensive D4RL experiments show SPAR extracts significant gains from suboptimal baselines to achieve state-of-the-art performance.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	2.0/10	4.0

评分理由: 该论文专注于离线强化学习（Offline RL）中的策略改进问题，核心贡献是提出支持保持的动作修正方法（SPAR）以解决价值最大化与数据分布拟合之间的冲突。论文内容未涉及多模态（MultiModal）、大语言模型（MLLM）、世界模型（World Models）或模型统一架构（Unify Models），因此这些关键词相关性为 0。虽然属于强化学习领域，但其方法侧重于离线策略修正而非学习环境动力学模型进行规划，与 'model-based RL' 的核心定义关联较弱，故给予较低分。加权总分为 8.0，低于动态及格分 26.5。

关键词

Offline policy improvement, Support-Preserving Action Rectification, Behavior cloning policy, Residual space, Latent Self-Imitation, D4RL benchmarks, Value maximization, Data distribution fitting

130. SEMAGIC: Learning Semantically Consistent Deformable 3D Representations from In-the-Wild ImagesFAIL

Score: 8.0 / 26.5

Authors: Sky Cen, Wufei Ma, Guofeng Zhang, Alan Yuille, Adam Kortylewski

Published: 2026-05-27

TL;DR: SEMAGIC 提出了一种从单视图图像中学习语义一致的可变形 3D 表示的框架，通过强制特征级一致性和顶点索引条件变形，显著提高了语义对应基准。

摘要翻译

从单视角真实场景图像中学习可变形 3D 对象模型，已实现了无监督的令人印象深刻的 3D 形状重建。然而，这些模型是否捕获了下游任务所需的语义结构，目前尚不明确。我们发现，现有的可变形重建方法尽管能生成视觉上合理的几何结构，但在实例间产生的对应关系不稳定，且在语义对应基准上表现较差。我们提出了 SEMAGIC 框架，用于从单视角真实场景图像中学习语义一致的可变形 3D 表示。与将重建视为最终目标不同，SEMAGIC 利用可变形建模作为一种机制，以发现类别级别的对应关系。每个类别由一个规范模板网格 (canonical template mesh) 和一个学习到的变形场 (deformation field) 表示，其功能类似于一个自编码器 (autoencoder)，能够从图像特征重建实例几何，从而使顶点能够在实例间保持一致的语义含义。语义一致性在训练过程中通过以下两种方式强制执行：(i) 一种特征级一致性损失，用于对齐规范网格与变形网格之间的语义特征；(ii) 基于顶点索引的条件变形 (vertex-index-conditioned deformation)，以保持实例间的语义对应关系。通过将几何变形与语义对齐显式耦合，SEMAGIC 生成的表示能够在类别内变化中保持稳定的部件对应关系。实验表明，SEMAGIC 在 SPair-71k 数据集上将可变形模型的语义对应关系提升了 +14.7 [email protected]，从而确立了可变形模型作为有效的语义 3D 表示的地位。

Abstract

Learning deformable 3D object models from single-view in-the-wild images has enabled impressive 3D shape reconstruction without supervision. However, it remains unclear whether these models capture the semantic structure required for downstream tasks. We find that existing deformable reconstruction approaches, despite producing visually plausible geometry, yield unstable correspondences across instances and perform poorly on semantic correspondence benchmarks. We introduce SEMAGIC, a framework for learning semantically consistent deformable 3D representations from single-view in-the-wild images. Rather than treating reconstruction as the end goal, SEMAGIC uses deformable modeling as a mechanism to discover category-level correspondences. Each category is represented by a canonical template mesh and a learned deformation field, functioning similarly to an autoencoder that reconstructs instance geometry from image features, enabling vertices to maintain consistent semantic meaning across instances. Semantic consistency is enforced during training through (i) a feature-level consistency loss aligning semantic features between canonical and deformed meshes, and (ii) vertex-index-conditioned deformation that preserves semantic correspondence across instances. By explicitly coupling geometric deformation with semantic alignment, SEMAGIC produces representations that maintain stable part correspondences across intra-category variation. Experiments demonstrate that SEMAGIC improves semantic correspondence of deformable models by +14.7 [email protected] on SPair-71k, establishing deformable models as effective semantic 3D representations.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	2.0/10	4.0
model-based RL	2.0	0.0/10	0.0

评分理由: 论文主要关注单视图图像下的语义一致可变形 3D 表示学习，属于计算机视觉领域。与 World Models、MLLM、model-based RL 无直接关联。Unify Models 仅弱相关（几何与语义的统一），MultiModal 弱相关（图像到 3D 的跨模态映射）。未包含指定专家作者。加权总分为 8.0，低于动态及格分 26.5。

关键词

Semantically Consistent, Deformable 3D Representations, Single-view Images, Canonical Template Mesh, Deformation Field, Semantic Correspondence, 3D Reconstruction

131. Routing-Aligned Fine-Tuning for Multilingual Downstream Tasks in Mixture-of-Experts ModelsFAIL

Score: 6.0 / 26.5

Authors: Guanzhi Deng, Kuan Wu, Haibo Wang, Shing Yin Wong, Sichun Luo, Linqi Song

Published: 2026-05-27

TL;DR: 论文提出了一种路由对齐微调框架（RA-MoE），通过在多语言任务中对齐 MoE 模型中的专家激活模式来提升性能，优于标准微调基线。

摘要翻译

混合专家（Mixture-of-Experts, MoE）模型已成为高效扩展大语言模型（LLM）的主流范式，但将其适配至非英语下游任务仍具挑战性。现有的微调方法将 MoE 模型视为整体学习者，忽略了在预训练过程中形成的异构路由结构。我们在多个 MoE 模型和下游任务上验证发现，中间层形成了一个语言通用对齐区，其中路由分歧强烈预测了各语言任务的性能差距。基于这一观察，我们提出了 RA-MoE（Routing-Aligned MoE Fine-Tuning），这是一种三阶段框架：该框架根据英语和目标语言的正确性，将并行任务样本划分为四元分类法（cc/ci/ic/ii），识别中间层中的任务相关专家，并通过引入路由对齐损失来增强标准监督微调（SFT），旨在使 ci 类型样本上的目标语言路由遵循英语任务专家的激活模式。在三个 MoE 模型、三个任务和六种目标语言上的实验表明，RA-MoE 优于标准 SFT 以及包括 Routing Steering 和 RISE 在内的强基线方法，其中任务 - 语言对的 ci 比例可作为对齐收益的可靠预测因子。

Abstract

Mixture-of-Experts (MoE) models have emerged as a dominant paradigm for efficient LLM scaling, yet adapting them to non-English downstream tasks remains challenging. Existing fine-tuning approaches treat MoE models as monolithic learners, ignoring the heterogeneous routing structure that develops during pretraining. We validate across multiple MoE models and downstream tasks that middle layers form a language-universal alignment zone where routing divergence strongly predicts per-language task performance gaps. Building on this observation, we propose RA-MoE (Routing-Aligned MoE Fine-Tuning), a three-stage framework that categorizes parallel task examples into a four-way taxonomy (cc/ci/ic/ii) based on correctness in English and the target language, identifies task-relevant experts in the middle layers, and augments standard SFT with a routing alignment loss that encourages target-language routing on ci-type examples to follow the English task-expert activation pattern. Experiments across three MoE models, three tasks, and six target languages demonstrate that RA-MoE consistently outperforms standard SFT and strong baselines including Routing Steering and RISE, with the ci proportion of a task-language pair serving as a reliable predictor of alignment benefit.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	3.0/10	6.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	0.0/10	0.0

评分理由: 论文聚焦 MoE 模型的多语言路由对齐微调。'Unify Models' 相关度中等（3.0），因 MoE 架构涉及专家统一且方法统一了路由模式；其余关键词（World Models, MLLM, MultiModal, model-based RL）均不相关（0.0），因论文仅涉及文本多语言任务与监督微调，无世界模型、多模态或强化学习内容。作者列表中未包含指定的专家。

关键词

Mixture-of-Experts, Multilingual Downstream Tasks, Routing-Aligned Fine-Tuning, Language-Universal Alignment, Expert Activation Pattern, Supervised Fine-Tuning, Routing Divergence

132. PrunePath: Towards Highly Structured Sparse Language ModelsFAIL

Score: 6.0 / 26.5

Authors: Zhexuan Gu, Zixun Fu, Yancheng Yuan

Published: 2026-05-27

TL;DR: PrunePath proposes a budget-adaptive structured sparsification framework for FFN layers in language models to achieve efficient inference without significant performance loss.

摘要翻译

前馈网络（FFNs）在现代语言模型中占据了绝大部分参数量和计算量，然而现有的剪枝方法往往难以将稀疏度转化为硬件友好的推理效率提升。我们提出了 PrunePath，这是一种面向 FFN 层的预算自适应结构化稀疏化框架。基于 MoEfication，PrunePath 采用 softmax 归一化的路由分布取代了独立的专家级阈值，并在累积质量阈值下激活重要专家。该框架引入了令牌级概率预算，实现了自适应的专家数量，并允许从单个检查点直接调节推理时的稀疏度参数。在自然语言理解（NLU）、自然语言生成（NLG）及指令微调评估中，相较于现有的静态剪枝方法和基于 MoEfication 的方法，PrunePath 实现了更优的稀疏度 - 性能权衡。我们进一步实现了用于 KV 缓存解码的 Triton 核函数，将所得的结构化稀疏度转化为实际的内存节省和可测量的解码速度提升。这些结果表明，PrunePath 在构建高度稀疏且易于部署的大语言模型方面具有优越的性能。

Abstract

Feed-forward networks (FFNs) dominate the parameter count and computation of modern language models, yet existing pruning methods often struggle to convert sparsity into hardware-friendly inference efficiency gains. We introduce \textbf{PrunePath}, a budget-adaptive structured sparsification framework for FFN layers. Built on MoEfication, PrunePath replaces independent expert-wise thresholding with a softmax-normalized routing distribution and activates important experts under a cumulative-mass threshold. This formulation imposes a token-level probability budget, enabling adaptive expert counts and a direct inference-time sparsity knob from a single checkpoint. Across NLU, NLG, and instruction-tuning evaluations, PrunePath achieves a favorable sparsity--performance trade-off compared with existing static pruning and MoEfication-based methods. We further implement Triton kernels for KV-cache decoding to translate the resulting structured sparsity into practical memory savings and measurable decoding-speed improvements. These results demonstrate the superior performance of PrunePath for building highly sparse, deployment-friendly large language models.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	1.0/10	2.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	0.0/10	0.0

评分理由: The paper focuses on structured sparsification for FFN layers in language models (PrunePath) to improve inference efficiency. It does not address World Models, Reinforcement Learning, or Multimodal integration. While it handles NLU and NLG tasks, it lacks the core themes of the keywords (World Models, MLLM, MultiModal, RL). No expert authors from the specified list were found, so no bonus points were applied. The total weighted score is 6.0, well below the dynamic pass score of 26.5.

关键词

Structured Sparsification, Feed-forward Networks, Language Models, MoEfication, Budget-adaptive, Inference Efficiency, Triton Kernels

133. Learning When to Optimize: Verified Optimization Skills from Expert GPU-Kernel LineagesFAIL

Score: 6.0 / 26.5

Authors: Shuoming Zhang, Qiuchu Yu, Yangyu Zhang, Ruiyuan Xu, Xiyu Shi, Guangli Li, Xiaobing Feng, Huimin Cui, Jiacheng Zhao

Published: 2026-05-27

TL;DR: 本文提出 KLineage 方法，通过从专家代码中学习优化技能来解决 LLM 在 GPU 核优化中不知道何时应用优化的问题，结果表明该方法在固定预算下优于基于记忆的基线。

摘要翻译

基于大语言模型（LLM）的代理被越来越多地用于生成 GPU 内核，但它们往往知道尝试哪些优化，却不知道这些优化何时是正确的。我们引入了 KLineage，它从专家内核中学习缺失的“何时”知识：不依赖前向展开，KLineage 通过验证门控简化逆向遍历专家实现，并将每个接受的步骤反转为可重用的优化技能。每个技能不仅记录优化意图，还记录其在代码中的应用位置、使其有效的条件、产生的效果以及其假设避免了哪些失败。下游 LLM 在相同的编译/正确性/性能分析门控下，将这些技能实例化到新代码实例上。在跨两种 NVIDIA 架构的五个专家工作负载上，这些谱系衍生的技能作为一种有效的优化课程，在相同固定预算下，在最终内核质量和优化效率方面均超过了近期基于记忆的 LLM 内核基线。此外，我们还使用一个独立的 22 个实例保留检查作为合理性测试，以验证是否存在源案例记忆。

Abstract

LLM-based agents are increasingly used to generate GPU kernels, but they often know what optimizations to try without knowing when those optimizations are sound. We introduce KLineage, which learns this missing "when" knowledge from expert kernels: instead of relying on forward rollouts, KLineage walks expert implementations backward through validation-gated simplifications and reverses each accepted step into a reusable optimization skill. Each skill records not only the optimization intent, but also where it applies in code, what conditions made it valid, what effect it had, and what failures its assumptions avoid. A downstream LLM materializes these skills on new code surfaces under the same compile/correctness/profile gate. On five expert workloads across two NVIDIA architectures, these lineage-derived skills serve as an effective optimization curriculum, exceeding recent memory-based LLM-kernel baselines in both final kernel quality and optimization efficiency under the same fixed budget. We additionally use a separate 22-instance held-out check as a sanity test against source-case memorization.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	1.0/10	2.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	1.0/10	2.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	1.0/10	2.0

评分理由: 该论文主要研究基于 LLM 的 GPU 核优化方法 KLineage，核心在于学习优化技能的应用时机。与关键词‘世界模型’、‘统一模型’、‘多模态’无直接关联，内容非多模态学习。虽使用 LLM（MLLM 基础），但未涉及多模态输入。优化技能学习过程虽具强化学习色彩，但并非典型的基于模型的强化学习。作者列表中未包含指定的 Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang 专家，故无加分。加权总分为 8.0，低于动态及格分 26.5。

关键词

GPU Kernel Optimization, LLM-based Agents, Optimization Skills, Verification Gates, Curriculum Learning, Expert Implementations, KLineage

134. BenGER: Benchmarking LLM Systems on Subsumption-Based Legal Reasoning in German LawFAIL

Score: 6.0 / 26.5

Authors: Sebastian Nagl, Ann-Kristin Mayrhofer, Martin Heidebach, Aleyna Koçak, Anne Zettelmeier, Elly Breu, Angelina Greiner, Sofija Milijas, Matthias Grabmair

Published: 2026-05-27

TL;DR: BenGER benchmarks LLMs on German legal reasoning tasks, demonstrating that closed-flagship systems lead performance and human-AI co-creation substantially outperforms unaided human work.

摘要翻译

我们介绍了 BenGER（德国法律基准）数据集，用于评估大语言模型（LLM）系统在德国法律中基于涵摄的法律推理。BenGER 数据集包含三个组成部分：596 个涵盖多个法律教育层级的考试风格自由文本法律案例任务，以及 531 个简短的教义学推理任务。我们评估了 12 种当代 LLM 系统——包括封闭旗舰型、效率导向型和开源权重型——在自动评估指标和基于法官评估的指标上。在包含限时人工撰写的解决方案的受控验证子集上（涵盖无辅助及人机协同创作两种条件），我们将模型性能置于这些人类基线背景下进行对比。我们引入了一种与评分标准对齐的 LLM-as-a-Judge 框架，该框架经过多评分者人类评分协议的交叉验证（每个解决方案包含三次盲审加一次作者知情的创作者评审）。我们的结果表明，用 LLM 法官替换盲审人类评审员，导致与整个人类评审池的一致性降低幅度不超过完全移除该评审员（Calderon r=0.96 vs. ~r=0.96，匹配样本量 n=30）；封闭旗舰型系统在所有语料库的排行榜上均居首位；且人机协同创作显著优于无辅助人类工作。

Abstract

We introduce the BenGER (Benchmark for German Law) dataset for evaluating LLM systems on subsumption-based legal reasoning in German law. The BenGER dataset consists of three components: 596 exam-style free-text legal case tasks across multiple levels of legal education and 531 short doctrinal reasoning tasks. We evaluate 12 contemporary LLM systems -- closed flagship, efficiency-oriented, and open-weight -- across automatic and judge-based metrics. On a controlled validation subset of timed human-written solutions under both unaided and human--AI co-creation conditions, we contextualise model performance against these human baselines. We introduce a rubric-aligned LLM-as-a-Judge framework cross-validated against a multi-rater human-grading protocol (three blind reviews plus one author-informed creator review per solution). Our results show that replacing a blind human reviewer with the LLM judge degrades agreement with the full human pool no more than removing that reviewer altogether (Calderon r=0.96 vs.~r=0.96, matched n=30), that closed-flagship systems lead the leaderboard across all corpora, and that human--AI co-creation substantially outperforms unaided human work.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	1.0/10	2.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	0.0/10	0.0

评分理由: The paper focuses on benchmarking text-based LLMs for German legal reasoning, showing low relevance to the provided keywords. Unify Models (2.0) reflects the evaluation of multiple existing models rather than unification; MLLM (1.0) reflects LLM usage but lacks multi-modality; World Models, MultiModal, and model-based RL (0.0) are irrelevant. The weighted total score is 6.0, significantly below the dynamic passing score of 26.5. None of the target expert authors appear in the list.

关键词

Legal Reasoning, LLM Benchmarking, German Law, Subsumption-Based, LLM-as-a-Judge, Human-AI Co-creation, Free-text Legal Case Tasks, Doctrinal Reasoning

135. MIRA: A Bilingual Benchmark for Medical Information Response AuditFAIL

Score: 6.0 / 26.5

Authors: Mengyu Xu, Qiaoxin Yang, Qianqian Wang, Xiwei Dai, Weiyi Wu, Chongyang Gao

Published: 2026-05-27

TL;DR: 论文提出了 MIRA 基准，发现大语言模型在不同用户表述下存在医疗信息稀释现象（DID），并通过知识引导提示缓解了该问题。

摘要翻译

大型语言模型（LLMs）正日益被用于提供面向公众的健康信息，然而现有的安全评估忽略了响应是否能在同一问题的不同用户表述下保持可比的医疗信息。为了解决这一问题，我们提出了医疗信息响应审计（MIRA），这是一个双语、受控的基准，用于评估 LLMs 是否在用户端语言、语体和健康素养信号下提供可比的医疗信息。MIRA 包含 4,320 个提示，这些提示源自 60 个经医学审查的低风险健康问题。在五个主流 LLMs 上，模型均回答了所有医疗问题，但对低健康素养信号的响应始终省略了更多关键信息，提供的具体后续步骤更少，且对独立判断的支持也更少。我们将这种模式称为差异化信息稀释（DID）。语言效应具有模型特异性，而非在所有非英语提示下均表现更差。与 300 个真实世界健康查询的对比提供了等级顺序效度的初步证据。知识引导的缓解提示减少了大多数模型的信息稀释，其中在信息不足简化方面的减少幅度最大，Claude（约 8%）和 Qwen（约 6%）的降幅最为显著。

Abstract

Large language models (LLMs) are increasingly used to provide public-facing health information, yet existing safety evaluations overlook whether responses preserve comparable medical information across different user phrasings of the same question. To address this, we introduce the Medical Information Response Audit (MIRA), a bilingual, controlled benchmark that assesses whether LLMs provide comparable medical information across user-side language, register, and health literacy signals. MIRA contains 4,320 prompts built from 60 medically reviewed, low-risk health questions. Across five mainstream LLMs, models answered all medical questions, but responses to low health-literacy signals consistently omitted more key information, provided fewer concrete next steps, and offered less support for independent judgment. We term this pattern Differential Information Dilution (DID). Language effects are model-specific rather than uniformly worse for non-English prompts. A comparison with 300 real-world health queries provides preliminary evidence of rank-order validity. A knowledge-guided mitigation prompt reduces information dilution for most models, with the largest reductions in underinformative simplification observed for Claude (~8%) and Qwen (~6%).

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	1.0/10	2.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	0.0/10	0.0

评分理由: 该论文主要提出医疗信息响应审计基准（MIRA），评估大语言模型在不同用户表述下的信息一致性。虽然涉及多个大语言模型（LLMs）的比较，但未涉及模型架构统一（Unify Models）、世界模型（World Models）、多模态交互（MultiModal/MLLM）或强化学习（model-based RL）的核心技术。因此与给定关键词的相关性较低，仅因涉及 LLM 基础模型给予 MLLM 少量分数。

关键词

Medical Information Response Audit, Bilingual Benchmark, Large Language Models, Differential Information Dilution, Health Literacy, Safety Evaluation, Prompt Mitigation

136. STAB: Specification-driven Testing for Algorithmic BottlenecksFAIL

Score: 6.0 / 26.5

Authors: Soohan Lim, Joonghyuk Hahn, Hyundong Jin, Yo-Sub Han

Published: 2026-05-27

TL;DR: STAB 是一种基于规范驱动的测试方法，利用大语言模型从自然语言规范生成测试用例，显著提高了多种编程语言中算法瓶颈的检测率。

摘要翻译

评估算法代码的效率需要能够暴露运行时瓶颈的测试用例。先前方法要么通过增加输入规模，要么通过生成使给定实现运行缓慢的特定代码输入来生成效率测试用例。因此，它们并未触及导致算法最坏情况的结构性输入条件。我们提出 STAB，这是一个规范驱动管道，仅凭自然语言问题规范即可生成暴露算法瓶颈的测试用例。STAB 将该任务划分为约束边界最大化与对抗性结构注入两个阶段。(i) 约束饱和器提取约束，并利用基于规则的饱和及针对相关变量的 CP-SAT 优化来求解大规模可行规模赋值。(ii) 对抗性场景注入器利用关键词匹配和 K 近邻 (KNN) 从精选场景目录中检索实现级的对抗性构造原则。STAB 将问题规范、已确定的边界以及检索到的构造原则编码为结构化生成规范，大语言模型 (LLM) 据此合成 Python 测试用例生成器。在 CodeContests 上，STAB 将开源大语言模型 (LLM) 上暴露算法瓶颈的生成测试用例率平均从 50.43% 提升至 73.45%，将闭源大语言模型 (LLM) 上的比率从 57.45% 提升至 71.85%，且在 Python、Java 和 C++ 上均表现出一致的提升。我们的代码开源地址为 https://github.com/suhanmen/STAB。

Abstract

Evaluating the efficiency of algorithmic code requires test cases that expose runtime bottlenecks. Previous methods generate efficiency test cases either by increasing input size or by generating code-specific inputs that make the given implementation run slowly. Consequently, they do not address the structural input conditions that drive the algorithmic worst case. We introduce STAB, a specification-driven pipeline that generates test cases that expose algorithmic bottlenecks from a natural-language problem specification alone. STAB separates the task into constraint-bound maximization and adversarial structure injection. (i) The constraint saturator extracts constraints and resolves large admissible size assignments using rule-based saturation and CP-SAT optimization over related variables. (ii) The adversarial scenario injector retrieves implementation-level adversarial construction principles from a curated scenario catalog using keyword matching and K-nearest neighbors (KNN). STAB encodes the problem specification, resolved boundary, and retrieved construction principles into a structured generation specification, from which the LLM synthesizes a Python test case generator. On CodeContests, STAB raises the rate of generated test cases that expose algorithmic bottlenecks from 50.43% to 73.45% on average across open-source LLMs and from 57.45% to 71.85% on average across closed-source LLMs, with consistent gains across Python, Java, and C++. Our code is available at https://github.com/suhanmen/STAB.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	1.0/10	2.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	2.0/10	4.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	0.0/10	0.0

评分理由: 该论文提出 STAB 框架，利用大语言模型生成测试用例以暴露算法瓶颈，属于软件工程与代码生成领域。论文未涉及世界模型、多模态数据融合或基于模型的强化学习。虽然使用了 LLM（与 MLLM 有部分关联），但未体现模型统一架构或多模态理解生成一体化，因此与给定关键词相关性极低。

关键词

Specification-driven Testing, Algorithmic Bottlenecks, Natural Language Specification, LLM Synthesis, Constraint-bound Maximization, Adversarial Structure Injection, Test Case Generator

137. Semantic Flow Regularization: Teaching LLMs to Generate Diverse Yet Coherent ResponsesFAIL

Score: 6.0 / 26.5

Authors: Kerui Peng, Feifei Li, Xingyu Fan, Wenhui Que

Published: 2026-05-27

TL;DR: The paper proposes Semantic Flow Regularization to solve Cross-Style Collapse in LLM fine-tuning, enhancing response diversity and quality via conditional flow matching without increasing deployment cost.

摘要翻译

当大语言模型被微调以生成基于人设或语气的响应时，其输出多样性受到严重限制——我们将这一失败称为跨风格坍塌（Cross-Style Collapse）。我们将这种坍塌归因于交叉熵目标函数，该函数在共享表示下倾向于抑制多样化的延续。我们提出语义流正则化（SFR），这是一种轻量级辅助目标，通过条件流匹配，利用后续片段的连续句子编码器嵌入来监督主干网络。随机流源通过构造保留了多模态性；流匹配头在推理时被丢弃，不增加任何部署成本。在大规模工业对话数据集（Qwen3-32B，9 个人设）上，SFR 相较于 SFT 提升了输出多样性、风格保真度和响应质量。我们在公共数据集 LiveCodeBench-v5（Qwen2.5-Coder-7B-Instruct）上进一步验证，SFR 一致提升了 pass@k，证实了其在风格化对话之外的通用性。在 MBPP 上的控制比较表明，多令牌预测是 SFR 的一个退化特例。

Abstract

When large language models are fine-tuned to generate persona- or tone-conditioned responses, their output diversity is severely limited--a failure we term Cross-Style Collapse. We trace this collapse to the cross-entropy objective, which under shared representations tends to suppress diverse continuations. We propose Semantic Flow Regularization (SFR), a lightweight auxiliary objective that supervises the backbone with continuous sentence-encoder embeddings of future segments via conditional flow matching. The stochastic flow source preserves multi-modality by construction; the flow-matching head is discarded at inference, adding zero deployment cost. On a large-scale industrial dialogue dataset (Qwen3-32B, 9 personas), SFR improves output diversity, style fidelity, and response quality over SFT. We further validate on the public LiveCodeBench-v5 (Qwen2.5-Coder-7B-Instruct), where SFR consistently improves pass@k, confirming generality beyond stylized dialogue. A controlled comparison on MBPP reveals Multi-Token Prediction to be a degenerate special case of SFR.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	1.0/10	2.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	2.0/10	4.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	0.0/10	0.0

评分理由: The paper focuses on improving LLM text generation diversity using Semantic Flow Regularization and conditional flow matching. It does not address World Models, Multimodal integration, or Model-Based RL, resulting in 0 scores for those keywords. It has slight relevance to MLLM as it involves LLMs (Qwen), but lacks explicit multimodal content. Unify Models is weakly relevant regarding objective unification. No expert authors from the specified list (Yang Shi, Xuanyu Zhu, etc.) were found. The total weighted score is 6.0, which is below the dynamic pass score of 26.5, indicating low relevance to the target research track.

关键词

Semantic Flow Regularization, Cross-Style Collapse, Conditional Flow Matching, LLM Fine-tuning, Text Generation Diversity, Sentence-Encoder Embeddings

138. Decision-focused learning for optimal PV-Battery schedulingFAIL

Score: 6.0 / 26.5

Authors: Joris Depoortere, Hussain Kazmi, Johan Driesen

Published: 2026-05-27

TL;DR: 该论文提出一种决策式学习框架，通过将预测模型训练于下游调度任务以降低电力成本，尽管统计精度较低但优化效果显著。

摘要翻译

近年来，住宅光伏（PV）的使用量急剧增加。随着电池系统变得更加经济实惠，光伏 - 电池系统（PV-battery system）的最优运行可为家庭带来显著节省。最优控制需要对潜在参数（如光伏发电量）进行正确预测，以调度电池。尽管由于算法进步和数据可用性，预测模型变得越来越准确，但准确性通常通过通用指标来衡量，这些指标可能与下游应用不一致。本研究提出了一种决策导向学习框架，通过在电池系统的下游最优调度任务上训练长短期记忆（LSTM）光伏能量预测器，将优化与预测相结合。所提出的方法与标准两阶段方法进行了比较。在为期 14 个月的评估期内，相对于由完美预测和无优化基线定义的性能边界进行归一化后，决策导向方法使二十栋建筑的平均电费降低了 3.6%。至关重要的是，尽管该模型的均方根误差（RMSE）为 19.9%，显著高于解耦模型的 8.2%，但仍实现了这一财务改善。对决策导向模型进行热启动可进一步改善结果，使平均成本降低约 8%，同时减轻了对统计准确性的负面影响（均方根误差为 13.7%）。这些发现在二十户家庭整体及每个家庭个体层面均在 0.001 水平上具有统计显著性。这些结果表明，将预测模型与优化目标对齐是实现光伏 - 电池系统成本优势的关键。未来的研究应在其他数据集、替代预测模型和替代优化算法上复制这些发现。

Abstract

The use of residential photovoltaics has increased dramatically in recent years. With battery systems becoming more affordable, the optimal operation of a photovoltaic-battery system can bring significant savings to households. Optimal control requires correct forecasts of underlying parameters, such as photovoltaic power generation, to schedule the battery. While forecasting models have become increasingly accurate due to algorithmic advances and data availability, accuracy is typically measured in generic metrics which might not align with the downstream application. This study proposes a decision-focused learning framework that integrates optimization and prediction by training a Long Short-Term Memory photovoltaic energy forecaster on the downstream optimal scheduling of a battery system. The proposed methodology is compared against a standard two-phase approach. Across a 14-month evaluation period, the decision-focused method reduced average electricity costs across twenty buildings by 3.6% when normalized against performance bounds defined by a perfect forecast and a baseline of no optimization. Critically, this financial improvement was achieved despite the model exhibiting a root mean squared error of 19.9%, significantly higher than the decoupled model's 8.2%. Warm-starting the decision-focused model further improves results, lowering average cost by approximately 8%, while also mitigating the negative impact on statistical accuracy (root mean squared error of 13.7%). The findings are statistically significant at the 0.001 level across the twenty households and for each household individually. These results demonstrate that aligning forecast models with optimization goals is key for achieving cost advantages in PV-battery systems. Future research should replicate these findings on other datasets, alternate forecasting models and alternate optimization algorithms.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	1.0/10	2.0

评分理由: 该论文研究光伏 - 电池系统的决策式学习与优化调度，虽整合了预测与优化任务（与 Unify Models 有微弱关联），但未涉及多模态、大语言模型、世界模型或强化学习，故其余关键词相关性极低。

关键词

Decision-focused learning, PV-Battery scheduling, LSTM forecaster, Optimal control, Energy optimization, Downstream task, Forecasting accuracy

139. Benchmarking Inductive Biases for Multivariate Time-Series Anomaly Detection with a Robust Multi-View Channel-Graph DetectorFAIL

Score: 6.0 / 26.5

Authors: Junhao Wei, Yanxiao Li, Bidong Chen, Yifu Zhao, Haochen Li, Dexing Yao, Baili Lu, Xudong Ye, Jietian Feng, Sio-Kei Im, Yapeng Wang, Xu Yang

Published: 2026-05-27

TL;DR: 该论文 benchmarking 了多种多变量时间序列异常检测方法并提出了一种稳健的多视图通道图检测器，在五个数据集上取得了最优的宏观平均 VUS-ROC 分数。

摘要翻译

我们呈现了一项关于多变量时间序列（MTS）异常检测的统一实验、分析及基准研究。十个具有代表性的家族检测器——涵盖统计、重构、关联、频率及通用 Transformer 家族——在五个数据集（SMD、MSL、SMAP、PSM 和 MSDS）上，针对有效性、效率、鲁棒性及跨数据集泛化能力进行了评估。所有方法均采用统一的窗口化、评分、硬件及度量协议。有效性、消融实验及鲁棒性评估使用三个随机种子；跨数据集迁移使用种子 0，因为每个额外种子需要 250 次源 - 目标评估。该基准得出了三个与方法无关的结论：不存在单一偏差基线占主导；绝对扰动 VUS-ROC 比保留率更具信息量；MSDS 表现为一种事件密集的部署工作负载，而非稀疏点异常基准。基于此协议，我们还引入了 \ours{}，这是一种自适应检测器家族，结合了基于 NOTEARS 约束的有向通道图视图，并可选地融合了块注意力及时间关联视图。\ours{} 取得了最佳的宏平均 VUS-ROC（0.675，比次优的 LSTM-AE 高出 5.1 个百分点），总体排名第一，并在所有五个数据集上均进入前三。其在 MSL 和 MSDS 上的优势较为微弱，但在平均性能和鲁棒性增益上更为显著：在每种方法均采用相同的三种子鲁棒性协议下，它在噪声、通道丢弃及时间偏移扰动下均获得了最强的绝对 VUS-ROC。我们公开了 MSDS 的预处理协议、配置、脚本及种子级度量转储。

Abstract

We present a unified experiment, analysis, and benchmark study of multivariate time-series (MTS) anomaly detection. Ten family-representative detectors -- spanning statistical, reconstruction, association, frequency, and generic-transformer families -- are evaluated on five datasets (SMD, MSL, SMAP, PSM, and MSDS) under effectiveness, efficiency, robustness, and cross-dataset generalisation. All methods share the same windowing, scoring, hardware, and metric protocols. Effectiveness, ablation, and robustness use three random seeds; cross-dataset transfer uses seed~0 because each extra seed requires $250$ source-target evaluations. The benchmark yields three method-independent findings: no single-bias baseline dominates; absolute perturbation VUS-ROC is more informative than retention ratios; and MSDS behaves as an event-dense deployment workload rather than a sparse point-anomaly benchmark. Under this protocol we also introduce \ours{}, an adaptive detector family combining a NOTEARS-constrained directed channel-graph view with optional patch-attention and temporal-association views. \ours{} achieves the best macro-average VUS-ROC ($0.675$, $+5.1$~pt over the second-best LSTM-AE), ranks first overall, and reaches the top-3 on all five datasets. Its wins on MSL and MSDS are narrow, while its average and robustness gains are larger: under the same three-seed robustness protocol for every method, it obtains the strongest absolute VUS-ROC across noise, channel dropout, and time-shift perturbations. We release the MSDS preprocessing protocol, configurations, scripts, and seed-level metric dumps.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	1.0/10	2.0
model-based RL	2.0	0.0/10	0.0

评分理由: 该论文专注于多变量时间序列异常检测的基准测试与新型检测器设计，核心内容涉及时间序列分析、图神经网络（NOTEARS）及稳健性评估。论文未涉及世界模型（World Models）、多模态大语言模型（MLLM）或基于模型的强化学习（model-based RL），因此这些关键词相关性为 0。'Unify Models' 仅体现在统一的实验协议上，而非模型架构统一，相关性较低；'MultiModal' 与论文中的 'Multi-View' 存在词汇重叠，但时间序列通常视为单模态数据，故相关性微弱。

关键词

Multivariate Time-Series, Anomaly Detection, Benchmarking, Multi-View, Channel-Graph, Robustness, NOTEARS, Detector

140. Patched-DeltaNet: Token-Level Event-Driven Memory for Linear-Time Anomaly DetectionFAIL

Score: 6.0 / 26.5

Authors: Tae-Gyun Lee, Junyoung Park, Kyu Won Han

Published: 2026-05-27

TL;DR: Patched-DeltaNet reduces Transformer-based time series anomaly detection complexity to linear time via event-driven memory while maintaining high detection accuracy.

摘要翻译

时间序列异常检测对于维持关键任务系统的可靠性至关重要。尽管基于 Transformer 的模型（如 PatchTST）展现出了卓越的性能，但其 $\mathcal{O}(L^2)$ 的计算复杂度严重限制了其在资源受限环境中的部署。本文提出了一种名为 Patched-DeltaNet 的新颖架构，该架构结合了时间序列分块与门控 Delta 网络（Gated Delta Networks）。通过整合这两种范式，我们假设并验证了 token 级事件驱动记忆（token-level event-driven memory）的涌现：分块机制负责提取局部语义块，而误差驱动的 DeltaNet 仅在发生显著物理变化（即 deltas）时更新其循环状态。这种协同作用有效地滤除背景噪声，并捕捉突发的异常漂移。我们在服务器机器数据集（SMD）基准上的严谨实验证明了 Patched-DeltaNet 的结构优越性与样本效率。在统一评估约束和相同计算预算下，我们的模型严格优于近期架构，取得了 0.957 的 ROC-AUC 和 0.822 的 PA-F1，同时将计算复杂度大幅降低至理论最小值 $\mathcal{O}(L/P)$。

Abstract

Time series anomaly detection is critical for maintaining the reliability of mission-critical systems. While Transformer-based models like PatchTST have shown remarkable performance, their $\mathcal{O}(L^2)$ computational complexity severely limits deployment in resource-constrained environments. In this paper, we propose Patched-DeltaNet, a novel architecture combining time-series patching with Gated Delta Networks. By integrating these paradigms, we hypothesize and demonstrate the emergence of token-level event-driven memory, whereby the patching mechanism extracts local semantic chunks, while the error-driven DeltaNet updates its recurrent state exclusively when significant physical changes, defined as deltas, occur. This synergy effectively filters out background noise and captures sudden anomalous drifts. Our rigorous experiments on the Server Machine Dataset (SMD) benchmark demonstrate the structural superiority and sample efficiency of Patched-DeltaNet. By strictly outperforming recent architectures under unified evaluation constraints and identical compute budgets, our model yields an ROC-AUC of 0.957 and PA-F1 of 0.822, while drastically reducing computational complexity to the theoretical minimum of $\mathcal{O}(L/P)$.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	1.0/10	2.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	0.0/10	0.0

评分理由: The paper focuses on efficient time series anomaly detection using a hybrid architecture (Patching + Delta Networks). It has minimal relevance to 'Unify Models' (only unifies two specific techniques) and 'World Models' (models dynamics but not generative world models). It is completely unrelated to MLLM, MultiModal, and model-based RL as it deals with numerical time series rather than multimodal data or reinforcement learning.

关键词

Time series anomaly detection, Patched-DeltaNet, Gated Delta Networks, Computational complexity, Token-level event-driven memory, Linear-time, Server Machine Dataset

141. Deep Neural Network Training as Random Effects: An Optimization-Inference DualityFAIL

Score: 6.0 / 26.5

Authors: Minhao Yao, Ruoyu Wang, Xihong Lin, Lin Liu, Zhonghua Liu

Published: 2026-05-27

TL;DR: This paper establishes a statistical framework for deep neural network training by revealing an optimization-inference duality between gradient flow and random-effects models, enabling principled early stopping via restricted maximum likelihood.

摘要翻译

深度神经网络（DNNs）在经验上取得了显著的成功，然而对其训练动力学的理解主要仍基于优化而非统计原理。本文在过参数化 regime 下为 DNN 训练构建了一个统计框架，通过证明连续时间神经切核（NTK）梯度流所诱导的预测与经典随机效应模型的预测完全等价。在此框架中，训练时间充当方差分量，或等价地作为经验贝叶斯协方差超参数，控制着变异在噪声与结构化信号之间的分配。这种等价性揭示了一种优化 - 推断对偶性：梯度流路径既是优化轨迹，也是经验贝叶斯随机效应推断路径。在给定训练时间的条件下，网络输出是潜在信号的后验均值；通过限制最大似然（REML）估计训练时间，早停便转变为基于似然的经验贝叶斯推断，而非外部调优。这一视角导出了一种两阶段推断程序。首先，方差分量检验用于确定 DNN 训练是否捕获了超出初始化的统计显著结构。其次，在确认有必要进行训练的前提下，REML 提供一种基于似然的早停规则。所得的停止时间可在 NTK 特征基下获得谱解释，此时训练将持续进行，直至实现谱损失去相关。我们进一步证明，由 REML 引导的早停在固定设计样本内预测中实现了渐近最优预测误差，并且在额外的随机设计正则性条件下，也适用于样本外预测。本研究将 DNN 训练重新表述为统计推断，并为决定是否以及多久训练深度神经网络提供了原理性的基础。

Abstract

Deep neural networks (DNNs) have achieved remarkable empirical success, yet their training dynamics remain understood mainly from optimization rather than statistical principles. Here we develop a statistical framework for DNN training in the over-parameterized regime by showing that the prediction induced by continuous-time neural tangent kernel (NTK) gradient flow is exactly equivalent to that from a classical random-effects model. In this framework, training time acts as a variance component, or equivalently an empirical Bayes covariance hyperparameter, governing the allocation of variation from noise to structured signal. This equivalence reveals an optimization-inference duality: the gradient-flow path is both an optimization trajectory and an empirical Bayes random-effects inference path. Conditional on training time, the network output is the posterior mean of the latent signal, and estimating training time by restricted maximum likelihood (REML) turns early stopping into likelihood-based empirical Bayes inference rather than external tuning. This perspective yields a two-stage inferential procedure. First, a variance-component test determines whether DNN training captures statistically significant structure beyond initialization. Second, conditional on training being warranted, REML provides a likelihood-based early stopping rule. The resulting stopping time admits a spectral interpretation in the NTK eigenbasis, where training proceeds until spectral loss decorrelation is achieved. We further establish that REML-guided early stopping achieves asymptotically optimal prediction error for fixed-design in-sample prediction and, under additional random-design regularity conditions, for out-of-sample prediction. This work reframes DNN training as statistical inference and provides a principled foundation for deciding whether and how long to train deep neural networks.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	3.0/10	6.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	0.0/10	0.0

评分理由: The paper focuses on Deep Neural Network (DNN) training dynamics using Neural Tangent Kernel (NTK) and random effects models, establishing an optimization-inference duality. It scores moderately (3.0) on 'Unify Models' due to the theoretical unification of optimization and inference paths, but scores 0.0 on 'World Models', 'MLLM', 'MultiModal', and 'model-based RL' as the content does not address multimodal architectures, world modeling, or reinforcement learning. None of the specified expert authors (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) appear in the author list.

关键词

Deep Neural Network Training, Random Effects, Optimization-Inference Duality, Neural Tangent Kernel, Early Stopping, Restricted Maximum Likelihood, Variance Component, Statistical Framework

142. Transfer learning RGB models to hyperspectral images with trainable tensor decompositionsFAIL

Score: 6.0 / 26.5

Authors: Mariette Schönfeld, Laurens Devos, Wannes Meert, Hendrik Blockeel

Published: 2026-05-27

TL;DR: 本文提出了一种可训练的张量分解方法，将 RGB 视觉模型适配到高光谱图像迁移学习中，实现了比现有方法更高的准确性和鲁棒性。

摘要翻译

迁移学习 (Transfer learning) 通过将模型的通用滤波器针对新任务进行专门化，使得大型视觉网络 (Large vision networks) 能够应用于各种领域。然而，这些网络假设输入图像具有 3 个输入通道，导致它们与多光谱或高光谱 (Hyperspectral) 图像不兼容。当前缓解这种不兼容性的方法要么牺牲图像信息，要么牺牲模型信息。本文提出了一种新方法，通过使用部分可训练的张量分解 (Tensor decompositions)，保留模型中存在的图像和空间信息。我们对预训练卷积滤波器 (Convolutional filters) 进行此类分解，将滤波器分离为空间和光谱分量。随后，将光谱分量替换为具有更高通道维度的可训练分量。这生成了高光谱滤波器，能够针对新数据集进行专门化，同时保留原始滤波器的空间模式。在各种高光谱数据集上的实验表明，我们的方法比其他高光谱迁移学习方法更准确且更鲁棒。

Abstract

Transfer learning makes it possible to use large vision networks on a variety of domains, by specializing their models' general filters to new tasks. However, these networks assume the input images to have 3 input channels, making them incompatible with multi- or hyperspectral images. Current approaches that mitigate this incompatibility sacrifice information in either the image, or the model. This work proposes a novel approach that preserves the image and spatial information present in the model by using partially trainable tensor decompositions. We create such decompositions of pretrained convolutional filters, separating the filters into spatial and spectral components. The spectral components are then replaced with trainable components of higher channel dimensionality. This creates hyperspectral filters that can specialize to new datasets, while retaining the spatial patterns of the original filter. Experiments on a variety of hyperspectral datasets show that our approach is more accurate and robust than other hyperspectral transfer learning methods.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	1.0/10	2.0
model-based RL	2.0	0.0/10	0.0

评分理由: 该论文专注于计算机视觉中的迁移学习，利用张量分解将 RGB 模型适配至高光谱图像。关键词中的'世界模型'、'MLLM'和'基于模型的强化学习'与论文主题完全无关，故评分为 0。'统一模型'仅涉及滤波器组件的数学分解，未涉及架构统一趋势，相关性低（2 分）。'多模态'通常指跨模态学习（如图文），而高光谱属于单模态多通道，相关性极低（1 分）。作者列表中未包含指定的专家。

关键词

Transfer learning, Hyperspectral images, Tensor decompositions, Convolutional filters, Spatial-spectral separation, Vision networks, Model adaptation

143. Learning the Error Patterns of Language ModelsFAIL

Score: 4.0 / 26.5

Authors: Jinwoo Kim, Taylor Berg-KirkPatrick, Loris D'Antoni

Published: 2026-05-27

TL;DR: 该论文提出 Palla 算法学习语言模型的前缀过滤器以捕捉错误模式，通过约束采样显著提升了代码生成的编译成功率。

摘要翻译

在为具有特定有效性约束的领域（例如程序应能编译）生成输出时，大语言模型（LLMs）往往以少数几种特定的方式失败：例如，在生成 TypeScript 时使用 Python 函数名。我们观察到，这些错误模式可用少量在实践中可学习的约束来表示。我们提出前缀过滤器（prefix filters），即针对特定领域和 LLM 的符号函数，作为捕捉错误模式的对象；提出 Palla 作为一种在实践中高效学习前缀过滤器的算法，并实现了 Palla。由 Palla 学习到的前缀过滤器（1）有助于定量分析 LLMs 的错误模式，（2）可通过约束采样算法用于约束模型的输出。例如，Palla 将 Qwen2.5-1.5B 在 TypeScript 生成任务上的编译成功率提升了超过 60%，使 Qwen2.5-1.5B 能够达到与无约束的 Llama3.1-8B 相似的性能。

Abstract

When generating outputs for domains with specific validity constraints (e.g., a program should compile), LLMs often fail in a small number of focused ways: for example, by using Python function names when generating TypeScript. We observe that these error patterns can be represented using a small number of constraints that can be learned in practice. We propose \emph{prefix filters}, which are per-domain-and-LLM symbolic functions, as objects to capture the error patterns, Palla as an algorithm to learn prefix filters efficiently in practice, and implement Palla. Prefix filters learned by Palla i) help us quantitatively analyze the error patterns of LLMs, and ii) can be used to constrain the outputs of a model via constrained sampling algorithms. For example, Palla boosts compile rates for Qwen2.5-1.5B on TypeScript generation, by over 60%, allowing Qwen2.5-1.5B to achieve similar performance to Llama3.1-8B unconstrained.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	0.0/10	0.0

评分理由: 该论文聚焦于语言模型（LLM）的错误模式分析与约束生成，提出前缀过滤器（prefix filters）和 Palla 算法。给定的关键词（Unify Models, World Models, MLLM, MultiModal, model-based RL）主要涉及多模态、世界模型及强化学习领域，与本文的文本/代码生成约束优化主题高度不匹配。因此，除'Unify Models'涉及模型概念外，其余关键词相关性均为 0。作者列表中不包含指定的专家（Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang）。加权总分为 4.0，低于动态及格分 26.5。

关键词

Language Models, Error Patterns, Prefix Filters, Constrained Sampling, Code Generation, LLM, Symbolic Functions

144. Identifying Explicit Parsimonious Piece-wise Polynomial Relationships in Industrial time-series: Application to manipulator robotsFAIL

Score: 4.0 / 26.5

Authors: Mazen Alamir, Sacha Clavel

Published: 2026-05-27

TL;DR: This paper proposes an algorithm to identify explicit piece-wise polynomial relationships for industrial time-series data, demonstrating effective inverse modeling of manipulator robots compared to DNNs.

摘要翻译

本文旨在解决识别简约显式分段多项式关系的问题，这些关系可能涉及相对较多的原始特征。该算法借助了一种近期提出的识别算法，该算法能够生成简约的隐式关系，从而能够在异常检测与定位的背景下推导正常性刻画。本文提出的算法在此基础上更进一步，基于隐式表示中所涉及的多项式集合，构建显式分段表示。该框架以识别六轴机械臂机器人逆模型的简约显式表示为例进行了展示。此外，还展示了在四轴机器人上的进一步实验，旨在比较简约模型与最先进的深度神经网络（DNNs）结构在面对未见使用场景时的泛化能力。

Abstract

This paper addresses the problem of identifying parsimonious explicit piece-wise polynomial relationships that might involve a relatively large number of raw features. The algorithm leverages a recently proposed identification algorithm that yields parsimonious implicit relationships enabling to derive normality characterization in the context of anomaly detection and localization. The algorithm proposed in this paper goes a step further by deriving explicit piece-wise representations that are built using the set of polynomials involved in the implicit representations. The framework is illustrated on the problem of identifying parsimonious explicit representations of the inverse model of a 6-axis manipulator robot. Moreover, further experiments on a 4-axis robot are also shown which are designed to investigate the generalization capability of parsimonious models compared to state-of-the-art DNNs structures, when models face unseen contexts of use.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	0.0/10	0.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	2.0/10	4.0

评分理由: The paper focuses on system identification using piece-wise polynomial relationships for industrial robot time-series data. It lacks content on Unify Models, World Models, MLLM, and MultiModal aspects (0.0). While it builds a robot inverse model, tangentially related to the 'model' component in model-based RL, it does not address reinforcement learning or policy optimization, warranting a low score (2.0). No expert authors from the specified list are present.

关键词

Piece-wise Polynomial, Industrial time-series, Manipulator robots, Inverse model, System identification, Anomaly detection, Parsimonious models

145. Revisiting Anthropomorphic Reflection Markers in Large Language Model ReasoningFAIL

Score: 4.0 / 26.5

Authors: Yahan Yu, Noa Nakanishi, Fei Cheng

Published: 2026-05-27

TL;DR: This study investigates whether anthropomorphic markers are necessary for LLM reasoning, concluding they are often superficial cues that can be suppressed without degrading performance.

摘要翻译

大语言模型（LLMs）在复杂推理过程中常产生显性的反思痕迹，伴随 anthropomorphic markers，例如 wait、hmm 和 alternatively。尽管这些标记常被用作反思的可见指示器，但其机制尚不明确，这留下了与冗余和重复的 reflection markers 相关的 overthinking 风险。本文重新审视了 anthropomorphic reflection markers，考察其对推理的必要性及其在反思中的作用。我们通过 prompt-level 和 token-level 干预抑制这些标记，并分析它们在四个 benchmarks 和两种 model scales 下对 task performance 的影响。结果表明，anthropomorphic markers 并非对推理 performance 普遍必要：在多种设置下抑制它们可以保持或提升 performance，尤其是在 larger sampling budgets 下。同时，marker suppression 不一定消除 reflection behavior，因为模型仍可进行 marker-free verification。这些表明 anthropomorphic markers 更倾向于 surface cues 而非 reliable proxies for reflection itself，并激励未来研究超越 explicit marker patterns 的 reasoning mechanisms。

Abstract

Large Language Models (LLMs) often produce explicit reflective traces during complex reasoning, accompanied by anthropomorphic markers such as wait, hmm, and alternatively. Although these markers are commonly used as visible indicators of reflection, their mechanisms remain unclear, which leaves the risk of overthinking associated with redundant and repetitive reflection markers. In this work, we revisit anthropomorphic reflection markers, examining their necessity for reasoning and role in the reflection. We suppress these markers through prompt-level and token-level interventions, and analyze their effects on task performance across four benchmarks and two model scales. Our results show that anthropomorphic markers are not uniformly necessary for reasoning performance: suppressing them can preserve or improve performance in several settings, especially under larger sampling budgets. Meanwhile, marker suppression does not necessarily remove reflection behavior, as models can still perform marker-free verification. These suggest that anthropomorphic markers tend to be surface cues rather than reliable proxies for reflection itself, and motivate future research on reasoning mechanisms beyond explicit marker patterns.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	0.0/10	0.0

评分理由: The paper focuses on anthropomorphic reflection markers in Large Language Model (LLM) reasoning, analyzing their necessity through prompt and token interventions. This content does not align with World Models, MLLM, MultiModal, or Model-Based RL topics, resulting in 0.0 scores for these keywords. 'Unify Models' receives a minimal score (2.0) as it involves model analysis, but the paper does not address model unification or architecture integration. Consequently, the weighted total score is significantly below the dynamic passing threshold of 26.5, indicating low relevance to the specified keyword track.

关键词

Large Language Models, Reasoning, Reflection Markers, Anthropomorphic Markers, Prompt Intervention, Model Performance, Token-level Intervention, Surface Cues

146. REED: Post-Training Representation Editing for Cross-Domain Linguistic SteganalysisFAIL

Score: 4.0 / 26.5

Authors: Ruohan Lei, Jianxin Gao, Wanli Peng, Huimin Pei

Published: 2026-05-27

TL;DR: 本文提出一种后训练表示编辑方法，通过编辑中间表征提升跨域文本隐写检测性能，无需修改架构或更新参数。

摘要翻译

在现实世界的语言隐写分析（linguistic steganalysis）场景中，测试文本通常来自未见过的领域，其词汇、主题、写作风格及隐写生成模式各异，这会显著降低检测性能。尽管现有的跨域隐写分析（cross-domain steganalysis）方法可通过分布对齐（distribution alignment）、领域不变特征学习（domain-invariant feature learning）等手段有效缓解这一问题，但检测性能仍不尽如人意。本文提出了一种用于跨域语言隐写分析的后训练表示编辑（post-training representation editing）方法。具体而言，检测器首先在源域数据上进行训练，随后保持特征提取器（feature extractor）和分类器（classifier）冻结，并在分类前对中间表示（intermediate representations）进行确定性编辑。针对领域适应（domain adaptation），我们从源域和目标域的边缘表示中构建一个领域偏移向量（domain-offset vector）。针对领域泛化（domain generalization），我们推导一个源域隐写前（cover）到隐写后（stego）的方向，以指导样本特定的编辑。实验结果表明，与现有先进方法相比，所提方法可实现较高的跨域检测性能，尤其在 F1-score 方面表现优异，且在源域训练后无需对架构进行修改或参数更新。

Abstract

In real-world scenarios of linguistic steganalysis, tested texts usually come from unseen domains with different vocabularies, topics, writing styles, and steganographic generation patterns, which can significantly degrade the detection performance. Although existing cross-domain steganalysis methods can effectively alleviate this problem through distribution alignment, domain-invariant feature learning, etc., the detection performance is not satisfactory. In this paper, we propose a post-training representation editing method for cross-domain linguistic steganalysis. Specifically, the detector is first trained on source-domain data, and then the feature extractor and classifier are kept frozen, and the intermediate representations are deterministically edited before classification. For domain adaptation, we construct a domain-offset vector from marginal source and target representations. For domain generalization, we derive a source-domain cover-to-stego direction to guide sample-specific editing. Experimental results show that compared with the advanced methods, the proposed method can achieve high cross-domain detection performance, especially in terms of F1-score, while requiring no architecture modification or parameter updates after source-domain training.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	0.0/10	0.0

评分理由: 论文专注于跨域文本隐写分析，采用后训练表示编辑技术。关键词涉及多模态、世界模型、RL 等，与论文纯文本内容无直接关联。'Unify Models' 相关性略高（涉及表示统一），但整体相关性极低，加权总分远低于动态及格分。

关键词

Cross-Domain Linguistic Steganalysis, Post-Training Representation Editing, Domain Adaptation, Domain Generalization, Feature Representation, Text Analysis, Steganographic Detection, Source-Domain Cover-to-Stego

147. Global Policy-Space Response Oracles for Two-Player Zero-Sum GamesFAIL

Score: 4.0 / 26.5

Authors: Junyu Zhang, Feihong Yang, Jian Wang, Chao Wang, Xudong Zhang

Published: 2026-05-27

TL;DR: The paper proposes Global PSRO, a DRL-based algorithm that minimizes population exploitability to efficiently approximate Nash equilibria in two-player zero-sum games with fewer policy iterations.

摘要翻译

策略空间响应预言机（PSRO）框架通过深度强化学习（DRL）迭代扩展受限策略集，将均衡计算扩展到大规模零和博弈。一个核心挑战是在有限的计算预算下，构建一个规模较小的策略群体，使其诱导的博弈能够很好地近似于完整博弈。现有的 PSRO 变体通常基于受限博弈收益计算出的元策略的最佳响应来扩展该群体，这可能导致低效的扩展，从而带来有限的全局改进。我们提出通过直接评估扩展后的群体质量来指导群体扩展。具体来说，我们采用群体可剥削度（PE）来衡量受限策略集代表完整博弈的程度，并引入一个两阶段探索 - 选择框架，在扩展过程中显式最小化 PE。我们将此框架实例化为 Global PSRO，这是一种基于深度强化学习的实用算法，通过参数共享的条件神经网络高效生成候选响应并估计 PE。在多个两人零和博弈上的实验表明，Global PSRO 实现了更低的可剥削度，并以显著少于先前 PSRO 方法的策略迭代次数近似纳什均衡。

Abstract

The Policy-Space Response Oracles (PSRO) framework scales equilibrium computation to large zero-sum games by iteratively expanding a restricted strategy set using deep reinforcement learning (DRL). A central challenge is to construct, under limited computational budgets, a small strategy population whose induced game well approximates the full game. Existing PSRO variants typically expand the population using best responses to meta-strategies computed from restricted-game payoffs, which can lead to inefficient expansions that provide limited global improvement. We propose to guide population expansion by directly evaluating the post-expansion population quality. Specifically, we adopt Population Exploitability (PE) to measure how well a restricted strategy set represents the full game, and introduce a two-phase exploration--selection framework that explicitly minimizes PE during expansion. We instantiate this framework as Global PSRO, a practical DRL-based algorithm that efficiently generates candidate responses and estimates PE via parameter-sharing conditional neural networks. Experiments across multiple two-player zero-sum games show that Global PSRO achieves lower exploitability and approximates Nash equilibria with significantly fewer policy iterations than prior PSRO methods.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	0.0/10	0.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	2.0/10	4.0

评分理由: The paper focuses on Deep Reinforcement Learning and Game Theory (PSRO framework for zero-sum games). It lacks content on Multimodal, LLM, Unify Models, or World Models. While RL-related, it focuses on policy expansion and exploitability estimation rather than learning environment dynamics for planning, so 'model-based RL' has low relevance (score 2.0). Total weighted score is 4.0, below the dynamic pass score of 26.5. None of the listed expert authors (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are present in the author list.

关键词

Policy-Space Response Oracles, Two-Player Zero-Sum Games, Population Exploitability, Deep Reinforcement Learning, Nash Equilibria, Strategy Population Expansion, Conditional Neural Networks

148. Entropy Distribution as a Fingerprint for Hallucinations in Generative ModelsFAIL

Score: 4.0 / 26.5

Authors: Mattia J. Villani, Pranav Deshpande, Akshay Seshadri, Romina Yalovetzky, Niraj Kumar

Published: 2026-05-27

TL;DR: This paper proposes a lightweight, single-pass black-box method called Calibrated Entropy Score (CES) that detects hallucinations in generative models by analyzing the distribution of token-level entropies rather than just mean perplexity.

摘要翻译

大型语言模型（LLMs）经常生成事实性错误的输出，通常被称为幻觉，这会损害信任并限制其在高风险场景中的部署。现有的幻觉检测方法通常需要多次前向传播，或需要访问模型内部参数。在这项工作中，我们提供了理论背景和实证证据，表明词元级熵的分布（超越了困惑度或长度归一化熵所捕捉的均值）可作为幻觉的指纹，其中分布的形状和尾部行为携带独立信号。我们将幻觉检测形式化为一个统计假设检验，并提出了一种校准熵分数（CES），这是一种轻量级算法，仅需单次前向传播和黑盒访问词元 logit。CES 通过校准的参考累积分布函数（CDF），结合生成熵的均值信号与最大信号，产生分数，这些分数在跨模型和跨任务时具有直接可比性。我们通过一种新的随机长度 Dvoretzky--Kiefer--Wolfowitz 不等式建立了有限样本校准保证，并证明 CES 检测幻觉的概率随生成长度指数级快速收敛至一。在涵盖八个问答基准（QA）和十个生成模型（包括开源和 API 访问模型）的实验中，CES 在所有单次前向传播的黑盒方法中实现了最高的检测性能，同时提供了现有启发式方法所缺乏的形式化误差保证。显著地，CES 在统计上与需要更高计算成本的多样本方法无显著差异，从而缩小了轻量级检测与昂贵检测之间的差距，使其适用于实时、大规模部署。

Abstract

Large Language Models (LLMs) often generate factually incorrect outputs, commonly termed hallucinations, that undermine trust and limit deployment in high-stakes settings. Existing hallucination detection methods typically require multiple forward passes, or access to model internals. In this work, we provide theoretical background and empirical evidence that the distribution of token-level entropies, beyond the mean captured by perplexity or length-normalised entropy, serves as a fingerprint of hallucination, with distributional shape and tail behaviour carrying independent signal. We formalize hallucination detection as a statistical hypothesis test and propose the Calibrated Entropy Score (CES), a lightweight algorithm requiring only a single forward pass and black-box access to token logits. CES combines the mean signal with the maximum signal of the generated entropy through a calibrated reference CDF, producing scores that are directly comparable across models and tasks. We establish finite-sample calibration guarantees via a novel random-length Dvoretzky--Kiefer--Wolfowitz inequality, and also prove that CES detects hallucinations with probability converging to one exponentially fast in the generation length. Across eight QA benchmarks and ten generator models spanning open-source and API access models, CES achieves the highest detection performance among all single-pass black-box methods while providing formal error guarantees that existing heuristics lack. Remarkably, CES is statistically indistinguishable from multi-sample methods that require far greater computational cost, closing the gap between lightweight and expensive detection and making it suitable for real-time, large-scale deployment.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	1.0/10	2.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	1.0/10	2.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	0.0/10	0.0

评分理由: 该论文专注于使用熵分布分析检测大语言模型（LLM）中的幻觉问题，提出了一种单前向传播的黑盒检测方法（CES）。论文内容未涉及统一模型架构（Unify Models）、环境模拟的世界模型（World Models）、多模态能力（MLLM/MultiModal）或基于模型的强化学习（model-based RL），因此与给定关键词的相关性极低。作者列表中不包含指定的专家（Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang），故无额外加分。加权总分约为 4.0，远低于动态及格分 26.5。

关键词

Hallucinations, Generative Models, Entropy Distribution, Token-level Entropies, Calibrated Entropy Score, Large Language Models, Statistical Hypothesis Test, Black-box Access

149. Localizing Input Uncertainty Quantification for Large Language Models via Shapley ValuesFAIL

Score: 4.0 / 26.5

Authors: Seongjun Lee, Suwan Yoon, Changhee Lee

Published: 2026-05-27

TL;DR: 本文提出 ShaQ 框架，利用 Shapley 值定位大语言模型输入中的不确定性片段，旨在通过针对性澄清提高高决策场景下的可靠性。

摘要翻译

随着大语言模型（LLMs）日益融入高风险决策，可靠量化不确定性的能力已成为确保安全性与信任的关键要求。然而，当前的不确定性量化方法主要在输出层面进行，通常无法区分不确定性是源于模型的知识缺失，还是源于用户输入中的歧义。虽然以输入为中心的不确定性量化最近已成为一个有前景的方向，但它仍然相对研究不足，且通常依赖于粗略的输入层面信息。因此，用户获得的通常是标量不确定性得分，这些得分几乎无法提供可操作的指导，说明输入中哪些部分应被澄清以提高可靠性。为了解决这一局限性，我们提出基于 Shapley（沙普利）的输入不确定性量化（ShaQ），这是一个用于输入诱导不确定性片段级归因的框架。该方法将输入中的模糊片段建模为合作博弈中的参与者，并利用 Shapley 值量化其贡献，该值定义为通过澄清每个片段联盟所获得的条件熵边际减少量的加权平均。与现有的输入层面方法不同，我们的方法捕捉了片段之间的复杂交互，并提供了一种原则性的分解，其中单个归因之和恰好等于总的输入诱导不确定性。我们在 AmbigQA 和 AmbiEnt 基准上评估了 ShaQ，其在歧义检测方面达到了最先进的性能。我们进一步在 MediTOD 上展示了其效用，表明 ShaQ 能够定位未充分指定的临床话语，并在高风险环境中促进人机协作。总体而言，ShaQ 改进了不确定性估计，并为有针对性的输入澄清提供了可操作的见解。

Abstract

As large language models (LLMs) are increasingly integrated into high-stakes decision-making, the ability to reliably quantify uncertainty has become a critical requirement for safety and trust. However, current uncertainty quantification methods primarily operate at the output level, often failing to distinguish whether uncertainty arises from the model's lack of knowledge or from ambiguity in the user's input. While input-centric uncertainty quantification has recently emerged as a promising direction, it remains relatively underexplored and typically relies on coarse, input-level information. Consequently, users are provided with scalar uncertainty scores that offer little actionable guidance on which parts of the input should be clarified to improve reliability. To address this limitation, we propose Shapley-based input uncertainty Quantification (ShaQ), a framework for span-level attribution of input-induced uncertainty. Our approach models ambiguous spans in the input as players in a cooperative game and quantifies their contributions using Shapley values, defined via the weighted average of marginal reductions in conditional entropy obtained by clarifying each span coalition. Unlike existing input-level approaches, our formulation captures complex interactions among spans and provides a principled decomposition in which individual attributions sum exactly to the total input-induced uncertainty. We evaluate ShaQ on the AmbigQA and AmbiEnt benchmarks, where it achieves state-of-the-art performance in ambiguity detection. We further demonstrate its utility on MediTOD, showing that ShaQ can localize under-specified clinical utterances and facilitate human-AI collaboration in high-stakes settings. Overall, ShaQ improves uncertainty estimation and provides actionable insights for targeted input clarification.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	0.0/10	0.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	2.0/10	4.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	0.0/10	0.0

评分理由: 论文核心在于大语言模型（LLM）的输入不确定性量化及 Shapley 值归因，与“统一模型”、“世界模型”、“多模态”及“基于模型的强化学习”无直接技术关联。虽涉及 LLM 基础，但未体现多模态特性，故仅在 MLLM 上给予低分，其余关键词相关性为 0。加权总分约为 4.0，远低于动态及格分 26.5，表明论文与指定研究背景相关性极低。

关键词

Uncertainty Quantification, Large Language Models, Shapley Values, Input Uncertainty, Span-level Attribution, Ambiguity Detection, Human-AI Collaboration

150. LegalGraphRAG: Multi-Agent Graph Retrieval-Augmented Generation for Reliable Legal ReasoningFAIL

Score: 4.0 / 26.5

Authors: Zerui Chen, Qinggang Zhang, Zhishang Xiang, Zhimin Wei, Linfeng Gao, Xiao Huang, Zhihong Zhang, Jinsong Su

Published: 2026-05-27

TL;DR: LegalGraphRAG 通过构建层次化法律图谱及多代理验证机制，显著提升了法律推理的可靠性与准确性，优于传统 GraphRAG 方法。

摘要翻译

基于图的检索增强生成（GraphRAG）通过将知识组织为关系图谱，改进了扁平文档检索，从而实现更连贯且有效的推理。然而，将其应用于法律推理等特定领域时面临关键挑战。（i）法律语料库具有异构性，包含来自案例、条文和解释的多粒度知识。扁平知识图谱难以充分区分事实细节、适用规则与抽象原则，从而限制了准确检索。（ii）可靠的法律判决要求透明且基于证据的推理。传统检索增强生成（RAG）直接将检索到的上下文传递给大语言模型（LLM）而不进行验证，导致推理过程不透明且易出错。为此，我们提出 LegalGraphRAG，一个专为可靠法律推理设计的框架。我们的方法引入了两个核心组件：一是分层法律图谱，通过分层组织法律资源以实现适当抽象级别的检索；二是用于可靠法律推理的多智能体系统，其中研究者（Researcher）负责检索候选证据，审计员（Auditor）严格依据源文档验证其有效性，裁决者（Adjudicator）综合验证后的证据集以做出最终判决。广泛实验表明，LegalGraphRAG 达到了最先进的性能，在准确且可信赖的法律分析方面优于现有的 GraphRAG 基线方法。我们的代码、数据集及实现细节可在 https://github.com/XMUDeepLIT/LegalGraphRAG 获取。

Abstract

Graph-based Retrieval-Augmented Generation (GraphRAG) advances flat document retrieval by structuring knowledge as relational graphs, enabling more coherent and effective reasoning. However, applying it to specific domains like legal reasoning faces critical challenges. (i) Legal corpora are heterogeneous, containing multi-granular knowledge from cases, articles and interpretations. A flat knowledge graph cannot adequately differentiate between factual details, applied rules, and abstract principles, limiting accurate retrieval. (ii) Reliable legal judgment demands transparent, evidence-based reasoning. Traditional RAG passes retrieved context directly to an LLM without verification, resulting in opaque, error-prone reasoning. To this end, we propose LegalGraphRAG, a framework designed for reliable legal reasoning. Our approach introduces two core components: a hierarchical legal graph that hierarchically organizes legal sources to enable retrieval at appropriate abstraction levels, and a multi-agent system for reliable legal reasoning, where a Researcher retrieves candidate evidence, an Auditor rigorously verifies its validity against source documents, and an Adjudicator synthesizes the set of verified evidence to render a final judgment. Extensive experiments show that LegalGraphRAG achieves the state-of-the-art performance, outperforming existing GraphRAG baselines in accurate and trustworthy legal analysis. Our code, datasets and implementation details are available at https://github.com/XMUDeepLIT/LegalGraphRAG.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	0.0/10	0.0

评分理由: 论文核心为法律图谱检索增强生成（GraphRAG）与多代理系统，与世界模型、MLLM、多模态及模型强化学习无直接关联（0 分）。'统一模型'相关性较低（2.0 分），因 RAG 虽统一检索生成，但不符合关键词特指的模型架构统一方向。

关键词

LegalGraphRAG, Multi-Agent System, Graph Retrieval-Augmented Generation, Legal Reasoning, Hierarchical Legal Graph, Evidence Verification, Reliable Judgment

151. Defending LLM-based Multi-Agent Systems Against Cooperative Attacks with Sentence-Level RectificationFAIL

Score: 4.0 / 26.5

Authors: Yaoyang Luo, Zhi Zheng, Ziwei Zhao, Tong Xu, Zhao Jielun, Wenjun Xue, Yong Chen, Enhong Chen

Published: 2026-05-27

TL;DR: This paper proposes a sentence-level trustworthiness analysis and rectification (STAR) defense framework to mitigate cooperative attacks in LLM-based multi-agent systems, significantly improving task success rates.

摘要翻译

近年来，基于大语言模型的多智能体系统（MAS）迅速发展，该系统在协同决策和复杂问题解决方面表现出色。然而，MAS 中的恶意智能体可能会注入虚假信息以误导其他智能体并破坏系统性能，从而催生了一个新的研究方向，专注于 MAS 中的攻击机制和防御策略。先前研究大多假设恶意智能体独立行动，并研究相应的防御策略。然而，我们认为恶意智能体可能表现出协同行为，通过内部信息交换实现更有效的攻击。本文提出了一种自适应协同攻击框架，其中恶意智能体通过多轮交互自主协调并动态调整其攻击策略。此外，我们引入了句子级可信度分析与修正（STAR），这是一种防御框架，用于识别并修正智能体通信中句子层面的误导性信息。实验表明，协同攻击导致任务成功率的下降幅度显著高于独立攻击，相对下降了 5.34%。同时，STAR 有效缓解了协同和独立威胁，并将任务成功率平均提升了 36.76%。代码可在 https://github.com/smoooom/STAR 获取。

Abstract

Recent years have witnessed the rapid development of Large Language Model-based Multi-Agent Systems (MAS), which excel at collaborative decision-making and complex problem-solving. However, malicious agents in MAS may inject misinformation to mislead other agents and disrupt system performance, giving rise to a new research direction that focuses on attack mechanisms and defense strategies in MAS. Prior studies largely assume malicious agents act independently and investigate the corresponding defense strategies. However, we argue that malicious agents may exhibit collaborative behaviors, enabling more effective attacks through internal information exchange. In this paper, we propose an adaptive cooperative attack framework, where malicious agents autonomously coordinate and dynamically adjust their attack strategies through multi-round interactions. Furthermore, we introduce Sentence-Level Trustworthiness Analysis and Rectification (STAR), a defense framework that identifies and rectifies misleading information at the sentence level within agent communications. Our experiments show that cooperative attacks lead to a significantly larger degradation in task success rate than independent attacks, resulting in a relative drop of 5.34\%. Meanwhile, STAR effectively mitigates both cooperative and independent threats and improves task success rate by an average of 36.76\%. The code is available at https://github.com/smoooom/STAR.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	0.0/10	0.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	2.0/10	4.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	0.0/10	0.0

评分理由: 论文主要关注基于大语言模型的多智能体系统（MAS）的安全防御，针对合作攻击提出句子级修正框架。该研究未涉及世界模型（World Models）的构建、模型统一架构（Unify Models）的设计，也未提及强化学习（RL）或多模态输入（MultiModal）。虽然使用了大语言模型（MLLM），但摘要仅强调文本层面的句子级处理，缺乏多模态证据，因此与给定关键词的相关性普遍较低，加权总分（4.0）远低于动态及格分（26.5）。未发现目标专家作者。

关键词

LLM-based Multi-Agent Systems, Cooperative Attacks, Sentence-Level Rectification, Trustworthiness Analysis, Defense Framework, Misinformation, Task Success Rate

152. Bridging the Detection-to-Abstention Gap in Reasoning Models under Insufficient InformationFAIL

Score: 4.0 / 26.5

Authors: Renjie Gu, Jiaxu Li, Yihao Wang, Yun Yue, Hansong Xiao, Yefei Chen, Yuan Wang, Chunxiao Guo, Pei Wei, Jinjie Gu, Yixin Cao

Published: 2026-05-27

TL;DR: 该论文针对推理模型在信息不足时无法正确拒绝的问题，提出了 Judge-Then-Solve 框架，通过强化学习显著提高了可靠拒绝能力并提升了推理效率。

摘要翻译

我们指出了大推理模型在信息不足问题上的一个失效模式：模型可能意识到问题未充分指定，但仍继续推理并产生无依据的最终答案，而非拒绝回答。我们将这种不匹配形式化为检测 - 拒绝差距（detection-to-abstention gap），即检测到的信息不足未能转化为最终的拒绝回答。这种差距在医疗 AI 等高风险领域尤其令人担忧，因为基于不完整证据的答案可能比拒绝回答更具危害性。为了缩小这一差距，我们提出先判断后求解（JTS），这是一种轨迹级推理控制框架，旨在训练模型在生成解决方案前做出明确的可答性承诺。与将拒绝视为最终答案的形式不同，JTS 将其视为一种控制决策：模型根据其可答性判断，要么继续求解，要么提前终止。我们通过监督预热和缺失前提强化学习（具有一致性和长度塑造奖励）来实现这一策略。在稠密和 MoE 推理模型上的实验表明，JTS 显著提高了各数据集上的可靠拒绝，并将检测时的拒绝率（A@D）推至接近饱和，表明模型不仅能检测缺失信息，还能基于该检测采取行动。通过在可答性判断后立即终止不可回答的推理轨迹，JTS 减少了不必要的推理，并在持续推理会放大无支持假设的情况下提高了推理效率。我们还观察到，缺失前提训练可以改变在困难但可回答问题上的推理行为，减少无益的自我反思。这些结果表明，信息不足时的拒绝是安全高效部署推理模型的一种关键推理控制形式。

Abstract

We highlight a failure mode of large reasoning models on questions with insufficient information: models may recognize that a problem is under-specified, yet still continue reasoning and produce unsupported final answers instead of abstaining. We formalize this mismatch as the detection-to-abstention gap, where detected insufficiency fails to translate into final abstention. This gap is especially concerning in high-risk domains such as medical AI, where answers based on incomplete evidence can be more harmful than refusal. To close this gap, we propose Judge-Then-Solve (JTS), a trajectory-level reasoning-control framework that trains models to make an explicit answerability commitment before solution generation. Rather than treating abstention as a final-answer style, JTS casts it as a control decision: the model either proceeds to solve or terminates early based on its answerability judgment. We instantiate this policy through supervised warm-up and missing-premise reinforcement learning with consistency and length-shaping rewards. Experiments on dense and MoE reasoning models show that JTS substantially improves reliable abstention across datasets and pushes Abstention@Detection (A@D) to near-saturation, indicating that models not only detect missing information but also act on that detection. By terminating unanswerable trajectories immediately after the answerability judgment, JTS reduces unnecessary reasoning and improves inference efficiency when continued deliberation would amplify unsupported assumptions. We also observe that missing-premise training can alter reasoning behavior on difficult but answerable problems, reducing unproductive self-reflection. These results suggest that abstention under insufficient information is a key form of reasoning control for deploying reasoning models safely and efficiently.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	1.0/10	2.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	1.0/10	2.0

评分理由: 论文聚焦推理模型在信息不足时的检测 - 拒绝差距问题，提出 JTS 框架。与评分关键词相关性极低：仅 'Unify Models' (1.0) 涉及流程统一，'model-based RL' (1.0) 涉及强化学习应用，其余 ('World Models', 'MLLM', 'MultiModal') 完全无关。加权总分 4.0，远低于动态及格分 26.5。作者列表中未包含指定专家（Yang Shi 等），无加分。

关键词

Detection-to-abstention gap, Reasoning models, Insufficient information, Judge-Then-Solve, Reinforcement learning, Answerability commitment, Inference efficiency

153. On the Learnability of Test-Time Adaptation: A Recovery Complexity PerspectiveFAIL

Score: 4.0 / 26.5

Authors: Zhi Zhou, Ming Yang, Shi-Yu Tian, Kun-Yang Yu, Lan-Zhe Guo, Yu-Feng Li

Published: 2026-05-27

TL;DR: This paper proposes a theoretical framework for Test-Time Adaptation learnability based on recovery complexity, establishing bounds on adaptation time under non-stationary distribution shifts without requiring labeled data.

摘要翻译

测试时适应（TTA）旨在使模型适应，以便在不依赖标注数据的情况下，在非平稳测试流上保持可靠性能。尽管 TTA 在实证上取得了成功，但在非平稳流下其可学习性仍未被探索。关键挑战在于缺乏一个严谨的理论框架，该框架既能与 TTA 目标保持一致，又能捕捉持续演变的分布偏移及内在信息约束。为填补这一空白，我们提出了首个研究 TTA 可学习性的理论框架，并引入了 $(ε,δ)$-Recovery Complexity（恢复复杂度）和 $(ε,ρ)$-TTA Learnability（TTA 可学习性）。Recovery Complexity 衡量偏移后以高概率将 excess risk（超额风险）维持在目标水平以下所需的时间，并进一步扩展为 TTA Learnability，后者用于衡量 TTA 的长期可靠性。在此框架内，我们针对非平稳测试流引入了一种新颖的离散代理，从而能够对渐进式和突发性偏移进行统一且易于处理的分析。我们推导了 Recovery Complexity 的阶匹配下界和上界，揭示了 TTA 的基本极限以及内在的适应性 - 信息权衡。这些结果为 TTA 提供了统一的可学习性保证，补充了基于 regret（遗憾）的分析。

Abstract

Test-time adaptation (TTA) aims to adapt models to maintain reliable performance on non-stationary test streams without requiring labeled data. Despite its empirical success, the learnability of TTA under non-stationary streams remains unexplored. A key challenge is the lack of a principled theoretical framework that simultaneously aligns with the TTA objective and captures both continuously evolving distribution shifts and intrinsic information constraints. To address this gap, we propose the first theoretical framework for studying the learnability of TTA and introduce $(ε,δ)$-Recovery Complexity and $(ε,ρ)$-TTA Learnability. Recovery complexity measures the post-shift time needed to maintain excess risk below a target level with high probability, and is further extended to TTA learnability, which measures the long-term reliability of TTA. Within this framework, we introduce a novel discrete surrogate for non-stationary test streams, enabling a unified and tractable analysis of both gradual and abrupt shifts. We derive order-wise matching lower and upper bounds on recovery complexity, revealing fundamental limits of TTA and an intrinsic adaptivity-information trade-off. These results provide unified learnability guarantees for TTA that complement regret-based analyses.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	0.0/10	0.0

评分理由: The paper focuses on the theoretical learnability of Test-Time Adaptation (TTA) under non-stationary streams, introducing recovery complexity metrics. It has no direct connection to Multimodal data, Large Language Models (MLLM), World Models, or Reinforcement Learning. The term 'unified' refers to the analysis framework for distribution shifts, not the 'Unify Models' paradigm. No matching expert authors (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) were found in the author list. The weighted total score is 4.0, which is below the dynamic passing score of 26.5.

关键词

Test-Time Adaptation, Recovery Complexity, Non-stationary Streams, Learnability, Distribution Shifts, Excess Risk, Theoretical Framework, Tractable Analysis

154. Relevant Is Not Warranted: Evidence-Force Calibration for Cited RAGFAIL

Score: 4.0 / 26.5

Authors: Pin Qian, Su Wang, Xiaoyuan Wang, Yihang Chen, Wenxuan Xu, Qiaolin Yu, Shuhuai Lin, Sipeng Zhang, Junxian You, Xinpeng Wei

Published: 2026-05-27

TL;DR: 该论文针对引用式 RAG 中存在的引用洗钱问题，提出了 FORCEBENCH 基准来评估证据力度校准，发现标准提示不足以解决此问题，而显式提示力度虽有所改善但仍不完善。

摘要翻译

在引用型 RAG 评估中，通常将可见来源视为接地信号，但真正与主题相关的引用仍可能不足以支撑所附带的表述。我们将这种诊断失败称为引用洗钱（citation laundering）：即一个相关来源被呈现为过度强烈主张的依据。我们引入了 FORCEBENCH，这是一种用于证据 - 力度校准的对比性压力测试。每个测试项保持引用段落固定，并在五个操作维度上将证据校准的声明与局部力度提升的变体进行配对：关系、模态、范围、时效性和数值具体性。一个校准良好的评估器应对证据校准的声明给出更高的评分。主要实验使用了一个固定的、基于局部性过滤的 198 对评估集。基于引用存在性的合理性检查在设计上本应无信息量；然而，词元和实体重叠仍在 32.8% 至 36.4% 的对上违反了单调性。在报告的四个模型评判者中，标准通用支持提示不足以应对这种力度校准压力测试（聚合单调性违反率 (MVR) 为 47.2%），而明确的依据强度提示将 MVR 降低至 24.5%，但仍不完善。我们发布了该基准、提示词、输出结果及插件式管道，以便引用评估器能够在常规支持指标之外，同时报告单调性违反率和力度敏感性。

Abstract

Cited RAG evaluation often treats visible sources as a grounding signal, but a real, topically relevant citation can still under-warrant the attached wording. We study this diagnostic failure as citation laundering: a related source is presented as warrant for an over-strong claim. We introduce FORCEBENCH, a contrastive stress test for evidence-force calibration. Each item holds a cited passage fixed and pairs an evidence-calibrated claim with a localized force-raised variant across five operational axes: relation, modality, scope, temporal validity, and numeric specificity. A calibrated evaluator should score the evidence-calibrated claim higher. Headline experiments use a fixed, locality-filtered 198-pair evaluation set. A citation-presence sanity check is uninformative by design; token and entity overlap still violate monotonicity on 32.8--36.4% of pairs. Across four reported model judges, standard generic support prompting is insufficient for this force-calibration stress test (aggregate MVR 47.2%), while explicit warrant-strength prompting lowers MVR to 24.5% but remains imperfect. We release the benchmark, prompts, outputs, and plug-in pipeline so citation evaluators can report monotonicity violation rate and force sensitivity alongside conventional support metrics.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	0.0/10	0.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	1.0/10	2.0
MultiModal	2.0	1.0/10	2.0
model-based RL	2.0	0.0/10	0.0

评分理由: 该论文聚焦于 Cited RAG 中的证据力度校准（Evidence-Force Calibration）及引用洗钱（Citation Laundering）问题，提出 FORCEBENCH 基准。提供的评分关键词（统一模型、世界模型、多模态大模型、基于模型的强化学习）主要涉及模型架构与强化学习领域，与本文的 RAG 评估主题高度不相关。虽基准测试轴中提及‘模态’，但核心并非 MLLM 或多模态模型的研究，故相关性极低。作者列表中未包含指定专家。

关键词

Cited RAG, Evidence-Force Calibration, Citation Laundering, FORCEBENCH, Claim Warranting, Evaluation Benchmark, Monotonicity Violation

155. Learning Compositional Latent Structure with Vector NetworksFAIL

Score: 4.0 / 26.5

Authors: Niclas Pokel, Benjamin F. Grewe

Published: 2026-05-27

TL;DR: 该论文提出向量网络通过组合权重原子实现深度学习的组合泛化，但未涉及多模态整合、世界建模或强化学习。

摘要翻译

深度网络是强大的函数逼近器，但它们通常在共享权重矩阵中存储大量不同的计算，这使得当熟悉结构以新颖组合出现时，难以有选择性地重用或适应其部分。我们引入向量网络（Vector Network, VN），这是一种层级递归架构，其中每一层用一个可重用的秩 -1 权重原子库替换固定的权重矩阵。对于每个输入，VN 最小化层局部能量，以推断一组稀疏的活跃权重原子及其系数，该推断过程同时受到自底向上输入重建和自顶向下反馈一致性的共同约束。随后，这些权重原子系数组合成该样本的输入特定低秩权重矩阵。收敛后，慢学习仅通过由推断系数缩放的局部残差信号更新选定的权重原子。我们在四个涵盖一维信号、二维空间解码、N 体动力学（N-body dynamics）及组成性 MNIST 的组成性基准上评估 VN。VN 在分布内表现与强基线相当，而当熟悉因子必须以新颖方式重组时，其分布外误差通常低约一个数量级。因此，向量网络将组成性泛化（compositional generalization）转化为架构和推断过程的结构性属性，而非将众多行为拟合到单一共享密集参数基底所产生的脆弱副产品。

Abstract

Deep networks are powerful function approximators, but they typically store many different computations in shared weight matrices, making it difficult to selectively reuse or adapt parts of them when a familiar structure appears in novel combinations. We introduce the Vector Network (VN), a hierarchical recurrent architecture in which each layer replaces a fixed weight matrix with a library of reusable rank-1 weight atoms. For each input, VN minimizes a layer-local energy to infer a sparse set of active weight atoms and their coefficients, jointly constrained by bottom-up input reconstruction and top-down feedback consistency. These weight atom coefficients then compose an input-specific low-rank weight matrix for that sample. After convergence, slow learning updates only the selected weight atoms through local residual signals scaled by the inferred coefficients. We evaluate VN on four compositional benchmarks spanning 1D signals, 2D spatial decoding, N-body dynamics, and compositional MNIST. VN matches strong baselines in distribution while often achieving out-of-distribution error about an order of magnitude lower when familiar factors must be recombined in novel ways. Vector networks thus make compositional generalization a structural property of the architecture and inference process rather than a brittle byproduct of fitting many behaviors into one shared dense parameter substrate.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	0.0/10	0.0

评分理由: 论文核心在于提出向量网络（VN）架构以实现深度学习的组合泛化，通过重用权重原子解决共享权重矩阵的局限。关键词涉及的多模态大模型（MLLM）、世界模型及模型强化学习在论文背景、方法或实验中均未体现，因此相关性极低；仅'Unify Models'在权重原子复用层面有微弱关联。

关键词

Vector Networks, Compositional Generalization, Hierarchical Recurrent Architecture, Weight Atoms, Low-rank Weight Matrix, Input Reconstruction, Compositional Latent Structure

156. An Empirical Audit of k-NAF Budget Accounting for Anchored DecodingFAIL

Score: 4.0 / 26.5

Authors: J. Vijayavallabh

Published: 2026-05-27

TL;DR: 本文实证审计了 Anchored Decoding 中的 k-NAF 预算机制，发现 KL 支出通常低于序列预算，高代理比率往往源于评估伪影而非实际预算耗尽。

摘要翻译

我们实证审计了 Anchored Decoding（锚定解码）中的 k-NAF 预算核算机制，采用的方法包括：(i) 一个固定的、分层的工作负载（跨越六个提示类别的约 8,500 次随机执行）；(ii) 一个旨在提高代理支出比率的自适应提示搜索过程。在固定工作负载上，平均累积 KL 支出远低于序列级预算 K（取值范围为 {600, 1000}），且经验 Bernstein 风格代理在所有类别中均低于 K；表面重叠诊断指标（ROUGE-L 和 5-gram Jaccard）相应地也较小。自适应搜索提高了代理支出比率，但并未导致明显的预算耗尽。在 k=3 的保留版权领域工作负载上，若干提示在早停评估且实际样本量较小的情况下，其代理比率超过 1；重新评估相同提示并增加预算分配后，在可比平均支出下，代理比率降至 [0.26, 0.40] 范围内，这表明结果源于代理指标伪影，而非单轨迹预算失败。

Abstract

We empirically audit the k-NAF budget-accounting mechanism in Anchored Decoding using (i) a fixed, class-stratified workload (approximately 8,500 randomized executions across six prompt classes) and (ii) an adaptive prompt-search procedure targeting high proxy spend ratios. On the fixed workload, mean cumulative KL spend remains far below the sequence-level budgets K in {600, 1000}, and an empirical Bernstein-style proxy stays below K for every class; surface-overlap diagnostics (ROUGE-L and 5-gram Jaccard) are correspondingly small. Adaptive search increases the proxy spend ratio but does not produce clear budget exhaustion. On a held-out copyright-domain workload at k = 3, several prompts exhibit proxy ratios above 1 under early-stopped evaluations with small realized sample sizes; re-evaluating the same prompts with larger allocation reduces the proxy ratio to the range [0.26, 0.40] under comparable mean spend, consistent with proxy artifacts rather than per-trajectory budget failures.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	0.0/10	0.0

评分理由: 论文聚焦于 Anchored Decoding 中的 k-NAF 预算审计机制，使用 KL 散度、ROUGE-L 等文本生成指标进行评估。内容未涉及多模态处理（MultiModal, MLLM）、环境动力学建模（World Models）、强化学习框架（model-based RL）或模型架构统一策略（Unify Models），因此相关性极低。作者 J. Vijayavallabh 不在指定专家列表中，无加分项。加权总分约为 4.0，远低于动态及格分 26.5。

关键词

Anchored Decoding, k-NAF, Budget Accounting, KL Divergence, Proxy Spend Ratio, Prompt Classes, ROUGE-L, Copyright Domain

157. T-GINEE: A Tensor-Based Multilayer Graph Representation LearningFAIL

Score: 4.0 / 26.5

Authors: Maolin Wang, Ziting Mai, Xuhui Chen, Zhiqi Li, Tianshuo Wei, Yutian Xiao, Wenlin Zhang, Wanyu Wang, Ruocheng Guo, Haoxuan Li, Zenglin Xu, Xiangyu Zhao

Published: 2026-05-27

TL;DR: T-GINEE proposes a tensor-based statistical framework to explicitly model inter-layer correlations in multilayer networks, overcoming limitations of independent layer treatment.

摘要翻译

传统网络分析主要关注单层网络，但现实世界系统往往形成具有多种关系类型的多层网络。然而，现有方法通常通过独立处理各层或聚合它们，无法捕捉复杂的层间依赖关系。为了解决这一问题，我们提出 T-GINEE（基于张量的广义多层图估计方程），这是一种统计正则化框架，结合基于张量的广义估计方程与任务特定损失，以显式建模跨网络相关性。关键创新包括：(1) 通过共享潜在因子捕捉结构依赖的 CP 张量分解；(2) 通过工作协方差矩阵建模层间相关性的广义估计方程框架；以及 (3) 适应稀疏性等特征的灵活连接函数。我们的理论分析在温和条件下证明了一致性和渐近正态性。在合成数据集和真实世界数据集上的广泛实验验证了 T-GINEE 在多层网络分析中的有效性。

Abstract

Traditional network analysis focuses on single-layer networks, real-world systems often form multilayer networks with multiple relationship types. However, existing methods typically fail to capture complex inter-layer dependencies by treating layers independently or aggregating them. To address this, we propose T-GINEE (Tensor-Based Generalized Multilayer-graph Estimating Equation), a statistical regularization framework combining tensor-based generalized estimating equations with task-specific loss to model cross-network correlations explicitly. Key innovations include: (1) CP tensor decomposition capturing structural dependencies via shared latent factors; (2) a generalized estimating equation framework modeling inter-layer correlations through working covariance matrices; and (3) a flexible link function accommodating characteristics like sparsity. Our theoretical analysis establishes consistency and asymptotic normality under mild conditions. Extensive experiments on synthetic and real-world datasets validate T-GINEE's effectiveness for multilayer network analysis.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	0.0/10	0.0

评分理由: The paper focuses on multilayer graph representation learning using tensor-based statistical methods. It does not address multimodal large language models (MLLM), world models, reinforcement learning, or multimodal data integration. While it unifies layers in a graph structure, it does not align with the 'Unify Models' context of unified AI architectures. Thus, relevance to the specific keywords is minimal.

关键词

Tensor-Based, Multilayer Graph, Representation Learning, Generalized Estimating Equations, CP Tensor Decomposition, Inter-layer Dependencies, Statistical Regularization, Cross-network Correlations

158. Dynamic Topic Modeling with a Higher-Order Hypergraphical RepresentationFAIL

Score: 4.0 / 26.5

Authors: Hanjia Gao, Hanwen Ye, Qing Nie, Annie Qu

Published: 2026-05-27

TL;DR: This paper proposes a dynamic topic modeling framework utilizing hypergraph representations to decouple word occurrence from repetition, demonstrating improved performance on text corpora analysis.

摘要翻译

动态主题建模被广泛用于分析科学文献、医疗记录及社交媒体中不断演变的趋势。传统主题模型通过多项式单纯形（multinomial simplex）上的单个概率向量表示每个主题，并在单一概率机制中隐式耦合词的出现与重复。然而，这种建模方式限制了词之间的依赖结构，并忽略了信息丰富的高阶交互，特别是在具有重叠语义的动态语料库中。为了解决这些局限性，我们引入了一种文本的超图（hypergraph）表示，其中每个文档被建模为连接所有共现词的超边，并将重复强度编码为节点权重。这种表示自然地将词的出现与重复分离，并诱导了一种新颖的基于超图的多项式分布，其非线性归一化取决于每个文档中观察到的词集。基于此似然函数，我们通过结构化低秩分解构建了一个动态主题建模框架，并在词 - 主题分布上施加了显式的时间正则化。尽管双线性分解和文档特定的非线性归一化引入了固有非凸性，我们仍建立了局部收敛性保证并推导了非渐近误差界。在合成数据上的数值实验以及对国际表示学习会议（ICLR）语料库的应用表明，该方法相对于现有的基于多项式的主题模型展现出了一致的改进。

Abstract

Dynamic topic modeling is widely used to analyze evolving trends in scientific literature, medical records, and social media. Traditional topic models represent each topic through a single probability vector on the multinomial simplex and implicitly couple word occurrence and repetition within one probabilistic mechanism. However, this formulation restricts the dependence structure among words and overlooks informative higher-order interactions, particularly in dynamic corpora with overlapping semantics. To address these limitations, we introduce a hypergraph representation of text where each document is modeled as a hyperedge connecting all co-occurring words, with repetition intensities encoded as node weights. This representation naturally separates word occurrence from repetition and induces a novel hypergraph-based multinomial distribution with a nonlinear normalization depending on the observed word set of each document. Building on this likelihood, we develop a dynamic topic modeling framework via structured low-rank factorizations with explicit temporal regularization on topic-word profiles. Moreover, we establish local convergence guarantees and derive non-asymptotic error bounds despite the intrinsic nonconvexity induced by bilinear factorization and document-specific nonlinear normalization. Numerical experiments on synthetic data and an application to the International Conference on Learning Representations (ICLR) corpus demonstrate consistent improvements over existing multinomial-based topic models.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	0.0/10	0.0

评分理由: 该论文专注于动态主题建模与超图表示，属于统计自然语言处理范畴。与世界模型、多模态大模型、多模态学习及基于模型的强化学习无直接关联（评分 0）。尽管论文在概率机制上统一了词出现与重复（Unify Models），但这属于统计建模层面的内部统一，并非现代统一架构模型，故相关性较低（评分 2.0）。作者列表中未包含指定的专家（Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang），无额外加分。

关键词

Dynamic Topic Modeling, Hypergraphical Representation, Text Analysis, Low-Rank Factorization, Temporal Regularization, Word Occurrence, Probabilistic Mechanism

159. Geometry of Relaxed Fair Regression: A Unified Framework for Aware and Unaware SettingsFAIL

Score: 4.0 / 26.5

Authors: M. Generali Lince, V. Divol, R. Flamary, S. Gaucher, P. Loiseau

Published: 2026-05-27

TL;DR: This paper proposes a unified optimal transport framework for fair regression that balances accuracy and demographic parity in both aware and unaware settings regarding sensitive attributes.

摘要翻译

公平性 - 准确性权衡是部署公平感知机器学习方法时的核心关切。当敏感属性在推理时间不可用——即所谓的 unawareness 设置（无感知设置）时，在放松的公平性约束下获得准确预测的严谨方法在很大程度上缺失。本文通过将带有 demographic parity（人口统计公平性）惩罚的回归问题表述为 optimal transport（最优传输）问题，填补了这一空白。我们的框架统一了 aware（感知）和 unaware（无感知）设置，并通过最优传输映射刻画了最优预测函数，适用于 squared Wasserstein-2（平方 Wasserstein-2）和 Total Variation（全变分）惩罚。这些结果表明惩罚的选择反映了根本不同的公平性理念：Wasserstein 惩罚诱导一种平滑的全人群范围内的妥协，而 Total Variation（全变分）则对部分个体强制执行精确公平性。基于这些理论刻画，我们提出了一种实现简单、计算效率高，并且在 real-world benchmarks（现实世界基准）上始终匹配或优于 state-of-the-art baselines（最先进基准）的算法。

Abstract

Fairness-accuracy trade-offs are a central concern in the deployment of fairness-aware machine learning methods. When sensitive attributes are unavailable at inference time-the so called unawareness setting, principled methods for obtaining accurate predictions under relaxed fairness constraints are largely missing. In this work, we address this gap by formulating regression under a demographic parity penalty as an optimal transport problem. Our framework unifies both the \emph{aware} and \emph{unaware} settings and characterizes optimal prediction functions via optimal transport maps, under both squared Wasserstein-2 and Total Variation penalties. These results reveal that the choice of penalty reflects fundamentally different fairness philosophies: the Wasserstein penalty induces a smooth, population-wide compromise, while Total Variation enforces exact parity for a subset of individuals. Building on these theoretical characterizations, we propose an algorithm that is simple to implement, computationally efficient, and consistently matches or outperforms state-of-the-art baselines on real-world benchmarks.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	0.0/10	0.0

评分理由: The paper focuses on fairness-aware regression via optimal transport, unifying aware/unaware settings regarding sensitive attributes. It has no conceptual connection to World Models, MLLM, MultiModal data, or model-based RL. 'Unify Models' receives a low score (2.0) due to the word 'Unified' in the title, but the context is fairness settings, not model architecture unification.

关键词

Fair Regression, Unified Framework, Aware and Unaware Settings, Optimal Transport, Demographic Parity, Sensitive Attributes, Fairness-Accuracy Trade-offs

160. Temporal Hyperbolic Graph Representation Learning for Scale-Free Internet Routing and Delay PredictionFAIL

Score: 4.0 / 26.5

Authors: Yi-Ling Kuo, Hao-Yu Tien, Shih-Yu Tsai

Published: 2026-05-27

TL;DR: The paper proposes HERMIT, a hybrid framework combining hyperbolic temporal graph neural networks and random forests to predict Internet round-trip time and routing links, achieving superior performance over baselines on real-world datasets.

摘要翻译

预测互联网往返时间（RTT）对于路由优化、服务质量（QoS）保障和流量工程至关重要，但由于长期时间依赖性、不断演化的路由动态以及重尾延迟分布，这一任务仍具挑战性。虽然时序图神经网络（TGNNs）能够建模不断演化的网络拓扑，但大多数现有方法运行在欧几里得空间中，难以捕捉互联网路由图的层次化和无标度结构。双曲几何提供了更合适的表示空间。我们提出 HERMIT（基于集成拓扑的双曲边感知 RTT 建模），这是一个混合框架，结合了保持双曲流形的时序 GNN 与 Random Forest 回归器，用于联合链路预测和 RTT 预测。基于 HMPTGN，HERMIT 引入了感知 RTT 的边特征和可学习的边编码器，以改进对演化链路状态和路由行为的建模。所得的双曲节点表示与历史 RTT 统计信息相结合，以实现稳健的延迟预测。我们在一个覆盖 2015-2024 年的大规模真实互联网数据集上评估了 HERMIT。HERMIT 持续优于仅使用历史 RTT 统计信息的强 Random Forest 基线，实现了 6% 的 RMSE 改进，同时减少了重尾样本上的大误差。它还超越了先前的双曲 TGNN 模型（包括 HMPTGN 和 HTGN）在链路预测性能上的表现。这些结果表明，将双曲时序图学习与基于树的回归相结合，为真实互联网拓扑中的 RTT 预测提供了一种可扩展的解决方案。

Abstract

Predicting Internet round-trip time (RTT) is critical for routing optimization, quality-of-service (QoS) provisioning, and traffic engineering, yet remains challenging due to long-term temporal dependencies, evolving routing dynamics, and heavy-tailed latency distributions. While Temporal Graph Neural Networks (TGNNs) can model evolving network topologies, most existing approaches operate in Euclidean space, which poorly captures the hierarchical and scale-free structure of Internet routing graphs. Hyperbolic geometry provides a more suitable representation space. We propose HERMIT (Hyperbolic Edge-aware RTT Modeling via Integrated Topology), a hybrid framework combining a hyperbolic manifold-preserving temporal GNN with a Random Forest regressor for joint link prediction and RTT prediction. Built on HMPTGN, HERMIT introduces RTT-aware edge features and a learnable edge encoder to improve modeling of evolving link states and routing behavior. The resulting hyperbolic node representations are combined with historical RTT statistics for robust latency prediction. We evaluate HERMIT on a large-scale real Internet dataset spanning 2015-2024. HERMIT consistently outperforms a strong Random Forest baseline using only historical RTT statistics, achieving a 6% RMSE improvement while reducing large errors on heavy-tailed samples. It also surpasses prior hyperbolic TGNN models, including HMPTGN and HTGN, in link prediction performance. These results demonstrate that combining hyperbolic temporal graph learning with tree-based regression provides a scalable solution for RTT prediction in real-world Internet topologies.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	0.0/10	0.0

评分理由: The paper focuses on temporal hyperbolic graph representation learning for Internet routing and RTT prediction. It proposes a hybrid framework unifying GNN and Random Forest (Unify Models: 2.0), but does not involve Multimodal Large Language Models, World Models, or Model-Based Reinforcement Learning methodologies. No listed expert authors are present in the author list.

关键词

Temporal Hyperbolic Graph Representation Learning, Scale-Free Internet Routing, Delay Prediction, Round-Trip Time, Hyperbolic Geometry, Temporal Graph Neural Networks, Link Prediction, Random Forest Regressor

161. Long Live The Balance: Information Bottleneck Driven Tree-based Policy OptimizationFAIL

Score: 4.0 / 26.5

Authors: Hao Jiang, Shurui Li, Tianpeng Bu, Bowen Xu, Xin Liu, Qihua Chen, Hongtao Duan, Lulu Hu, Bin Yang, Minying Zhang

Published: 2026-05-27

TL;DR: This paper proposes an Information Bottleneck-driven Tree-based Policy Optimization method to balance exploration-exploitation in online reinforcement learning for large language models, achieving superior performance over GRPO baselines.

摘要翻译

近期，针对大型语言模型（LLMs）的在线强化学习（RL）研究在复杂推理任务中展现出了令人鼓舞的性能。然而，这些方法往往表现出探索 - 利用（exploration-exploitation）权衡失衡，导致优化不稳定且性能次优。本文提出了一种基于信息瓶颈（Information Bottleneck）理论的新型指标——IB-Score，该指标通过量化步骤级推理多样性与正确答案共享互信息之间的权衡，来评估策略的探索 - 利用平衡。基于 IB-Score 的分析表明，常见的在线强化学习方法（如 GRPO）配合常规正则化器，往往无法在训练过程中持续维持这种平衡，从而导致次优结果。为解决这一问题，本文提出了一种基于信息瓶颈驱动的树策略优化（IB-TPO）严谨框架。该方法将 IB-Score 形式化为细粒度的优化目标，并采用一种新颖的 IB 引导树采样策略：该策略不仅能在相同 token 预算下提升在线采样效率（轨迹数量增加 50%），还能复用树结构以实现有效的 IB-Score 蒙特卡洛（Monte Carlo）估计。在多个标准基准上的广泛实验表明，我们的方法显著优于 GRPO 基线（提升幅度为 2.9% 至 3.6%），同时也优于其他最先进的在线强化学习方法。我们的代码开源地址为 https://github.com/alibaba/EfficientRL。

Abstract

Recent advances in online reinforcement learning (RL) for large language models (LLMs) have demonstrated promising performance in complex reasoning tasks. However, they often exhibit an imbalanced exploration-exploitation trade-off, resulting in unstable optimization and sub-optimal performance. We introduce IB-Score, a novel metric grounded in Information Bottleneck theory that evaluates policy's exploration-exploitation balance by quantifying the trade-off between step-level reasoning diversity and mutual information shared with the correct answer. Analysis based on IB-Score shows that popular online RL approaches (e.g., GRPO) with common regularizers fail to consistently maintain balance during training with suboptimal results. To address this, we propose Information Bottleneck-driven Tree-based Policy Optimization (IB-TPO), a principled framework that formulates IB-Score as a fine-grained optimization objective and utilizes a novel IB-guided tree sampling strategy that not only improves the efficiency of online sampling with 50% more trajectories under the same token budget, but also reuses the tree structure for effective IB-Score Monte Carlo estimation. Extensive experiments across standard benchmarks show that our method significantly outperforms GRPO baseline by 2.9% to 3.6% and also outperforms other state-of-the-art online RL approaches. Our code is available at https://github.com/alibaba/EfficientRL.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	0.0/10	0.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	2.0/10	4.0

评分理由: The paper focuses on Information Bottleneck-driven Policy Optimization for LLMs in an online RL setting. It does not address multimodal integration (MLLM, MultiModal), world modeling (World Models), or model architecture unification (Unify Models). While it involves Reinforcement Learning, the method is primarily policy optimization (Model-Free style) rather than learning environment dynamics (Model-Based), resulting in low relevance to the specific keywords except for a slight connection to the RL domain.

关键词

Information Bottleneck, Tree-based Policy Optimization, Online Reinforcement Learning, Exploration-Exploitation Balance, Large Language Models, GRPO Baseline, Monte Carlo Estimation, Reasoning Diversity

162. Machine learning enables experimental access to photon-by-photon arrival times in scintillation detectorsFAIL

Score: 4.0 / 26.5

Authors: Yuya Onishi, Ryosuke Ota, Fumio Hashimoto, Kibo Ote, Go Akamatsu, Hideaki Tashima, Taiga Yamaya

Published: 2026-05-27

TL;DR: This study employs an unsupervised deep learning framework to estimate individual photon arrival times in scintillation detectors, achieving improved timing resolution and photon classification without hardware modification.

摘要翻译

具有优异时间分辨率的闪烁探测器（Scintillation detectors）能够在正电子发射断层扫描（Positron Emission Tomography, PET）中实现辐射源更精确的定位，从而显著改善癌症和痴呆症等疾病的诊断能力。在皮秒尺度（Picosecond scale）极端时间精度下，探测器性能由探测器内产生的闪烁光子（Scintillation photons）的微观动力学及其随后的探测过程决定。然而，由于光电探测器（Photodetectors）固有的结构限制，探测器信号传统上仅被视为众多光子的集体响应。在本研究中，我们利用深度学习（Deep Learning）克服了这一基本限制，从而实现了对单个光子时间信息的直接获取。所提出的方法直接从探测器波形（Detector waveforms）中估计光子到达时间，无需对探测器结构进行任何修改；该方法在无真实标签的情况下，通过结合无监督学习框架（Unsupervised learning framework）与物理信息探测器响应模型（Physically informed detector-response model），基于事件进行操作。通过结合蒙特卡洛模拟（Monte Carlo simulation）和实验测量对各种探测器配置的综合验证，我们实验证明了改进的时间分辨率，可视化了依赖于相互作用深度（Depth-of-Interaction, DOI）的光子传输，并基于估计的光子级时间信息使用统一的基于深度学习的框架对切伦科夫光子（Cherenkov photons）和闪烁光子进行了分类。这些结果为光子动力学（Photon dynamics）提供了实验访问，弥合了理论建模与实验观察之间的差距，并为探测器物理和优化中的发现开辟了一条新的数据驱动途径。

Abstract

Scintillation detectors with excellent timing resolution enable more precise localization of radiation sources in positron emission tomography, leading to substantial improvements in diagnostic capability for diseases such as cancer and dementia. At the extreme timing precision required for such applications at the picosecond scale, detector performance is governed by the microscopic dynamics of scintillation photons generated within the detector and their subsequent detection processes. However, detector signals have conventionally been treated only as collective responses of many photons due to structural constraints inherent to photodetectors. In this study, we overcome this fundamental limitation using deep learning, enabling direct access to the timing information of individual photons. The proposed method estimates photon-by-photon arrival times directly from detector waveforms without requiring any modification to the detector structure; the method operates on an event-by-event basis without ground-truth labels by integrating an unsupervised learning framework with a physically informed detector-response model. Through comprehensive validation combining Monte Carlo simulation and experimental measurements across various detector configurations, we experimentally demonstrate improved timing resolution, visualized depth-of-interaction-dependent photon transport, and classified Cherenkov and scintillation photons based on the estimated photon-level timing information using a unified deep learning-based framework. These results provide experimental access to photon dynamics, bridging the gap between theoretical modeling and experimental observation, and they open a new data-driven pathway for discovery in detector physics and optimization.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	0.0/10	0.0

评分理由: The paper focuses on experimental physics (scintillation detectors) using deep learning for signal processing. While the abstract mentions a 'unified deep learning-based framework' and a 'detector-response model', these do not align with the specific AI/RL concepts implied by the keywords (e.g., World Models, MLLM, Model-Based Reinforcement Learning). There are no semantic overlaps with World Models, MLLM, or Multimodal AI. No expert authors from the provided list are present in the author list.

关键词

Scintillation detectors, Photon arrival times, Deep learning, Unsupervised learning, Timing resolution, Detector physics, Waveform analysis, Physically informed model

163. Do We Really Need Quantum Machine Learning?: A Multidimensional Empirical StudyFAIL

Score: 4.0 / 26.5

Authors: Sudip Vhaduri, Ryan Gammon, Sayanton Dibbo

Published: 2026-05-27

TL;DR: This paper empirically compares classical and quantum machine learning models on image recognition, demonstrating that quantum models achieve comparable accuracy with superior parameter efficiency but higher computational runtime.

摘要翻译

计算机视觉领域的快速增长以及日益复杂的图像识别任务，暴露了经典机器学习模型在计算能力上的根本性局限，从而推动了将量子计算作为一种新兴范式进行探索的动力。本文在 MNIST 手写数字数据集上，对经典与量子机器学习模型在图像识别任务中进行了全面的基准测试研究。研究评估了两类模型：一是传统模型，包括经典支持向量机（CSVM）和量子支持向量机（QSVM）；二是深度神经网络模型，包括经典卷积神经网络（CCNN）和量子卷积神经网络（QCNN）。评估涵盖四个性能维度：分类准确率、计算运行时间、参数数量及内存需求。实验针对特征维度和样本大小作为变量进行，并在 CPU 和 GPU 执行环境下展开，从而提供一个受控的多维比较，以弥补先前研究的不足。对于基于支持向量机的模型，量子支持向量机（QSVM）在准确率上持续优于经典支持向量机（CSVM），在 1,000 个样本时分别达到约 0.90 和约 0.85，但计算成本更高。10 个量子位的特征数量以及 200 至 500 之间的样本大小，被确定为平衡准确率与运行时间的实际可行操作点。对于神经网络模型，经典卷积神经网络（CCNN）与量子卷积神经网络（QCNN）实现了相当的分类准确率，在 64 个特征和 60,000 个样本下均超过 0.96；然而，QCNN 在参数和内存效率方面显著更优，在较高特征数量下，其所需参数比 CCNN 减少约 94%，内存占用减少约 75%，但运行时间更高。总体而言，随着特征维度或样本规模的增加，量子模型在准确率上持续优于经典模型，且优势幅度逐渐扩大。

Abstract

The rapid growth of computer vision and increasingly complex image recognition tasks has exposed fundamental computational limitations of classical machine learning models, motivating the exploration of quantum computing as an emerging new paradigm. This paper presents a comprehensive benchmarking study of classical and quantum machine learning models for image recognition on the MNIST handwritten digit dataset, evaluating both traditional models, a Classical Support Vector Machine (CSVM) and a Quantum Support Vector Machine (QSVM), and deep neural network models, a Classical Convolutional Neural Network (CCNN) and a Quantum Convolutional Neural Network (QCNN), across four performance dimensions: classification accuracy, computational runtime, parameter count, and memory requirements. Experiments are conducted as functions of both feature dimensionality and sample size, and across CPU and GPU execution environments, providing a controlled, multidimensional comparison to address gaps in prior work. For the SVM-based models, QSVM consistently outperforms CSVM in accuracy, reaching $\sim$ 0.90 versus $\sim$ 0.85 at 1,000 samples, with a higher computational cost. A feature count of 10 qubits and a sample size in the range of 200 -- 500 emerge as practical operating points that balance accuracy and runtime. For the neural network models, CCNN and QCNN achieve comparable classification accuracy, both exceeding 0.96 at 64 features and 60,000 samples, yet QCNN offers substantially superior parameter and memory efficiency, requiring $\sim$ 94\% fewer parameters and $\sim$ 75\% less memory than CCNN at higher feature counts, while incurring higher runtime. Across both model families, quantum models consistently outperform classical models by greater margins in accuracy as feature dimensionality or sample size increases.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	2.0/10	4.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	0.0/10	0.0

评分理由: The paper focuses on Quantum Machine Learning (QML) benchmarking for image classification, comparing classical and quantum SVM/CNN models. It does not address Unify Models, World Models, MLLM, MultiModal integration, or Model-Based Reinforcement Learning, resulting in minimal relevance to the provided keywords. Total weighted score is 4.0, below the dynamic passing score of 26.5. The author list does not contain the specified experts.

关键词

Quantum Machine Learning, Image Recognition, Classical vs Quantum, MNIST Dataset, Performance Benchmarking, Computational Efficiency, Convolutional Neural Network, Support Vector Machine

164. Bridging the Generalization Gap in Adverse Weather Segmentation: A Training Recipe PerspectiveFAIL

Score: 4.0 / 26.5

Authors: Cong Xu, Pu Luo, Yumei Li, Boyou Xue

Published: 2026-05-27

TL;DR: This paper bridges the generalization gap in adverse weather segmentation by optimizing training recipes rather than model architecture, achieving robust test performance with a reduced validation-test gap.

摘要翻译

本文介绍了我们在第 8 届 UG2+ 研讨会（CVPR 2026）Track 2 中提出的方法，该方法旨在对受五种天气条件（模糊、黑暗、雪、雾和眩光）退化的户外场景进行语义分割。我们观察到的一个核心挑战是严重的泛化差距——在验证集上表现良好的模型往往在测试集上性能骤降。例如，SegFormer-B5 从验证集到测试集的 mIoU 下降了 16.1 个点，这表明仅靠模型容量不足以实现鲁棒性。我们探究了是否经过精心设计的训练策略，而非架构复杂度，能够解决这一差距。基于预训练的 SegMAN-S 主干网络，我们系统地研究了领域自适应微调、多源数据混合、场景平衡采样以及合成退化增强等因素的影响。我们的最终系统在官方测试集上达到了 59.9% 的 mIoU，同时保持验证 - 测试差距仅为 6.5 个点——不到更大模型差距的一半。我们分析了架构修改、损失函数变体及模型缩放等方面的负面结果，旨在为有限数据下的天气鲁棒分割提供实用见解。

Abstract

This paper describes our approach for the 8th UG2+ Workshop (CVPR 2026) Track~2, which targets semantic segmentation of outdoor scenes degraded by five weather conditions: blur, darkness, snow, haze, and glare. A central challenge we observe is a severe generalization gap -- models that perform well on the validation set often collapse on the test set. For instance, SegFormer-B5 drops 16.1 mIoU points from validation to test, suggesting that model capacity alone is insufficient for robustness. We investigate whether a carefully designed training recipe, rather than architectural complexity, can address this gap. Starting from a pre-trained SegMAN-S backbone, we systematically study the effects of domain-adaptive fine-tuning, multi-source data mixing, scene-balanced sampling, and synthetic degradation augmentation. Our final system achieves 59.9\% mIoU on the official test set while maintaining a validation-test gap of only 6.5 points -- less than half that of larger models. We analyze negative results from architectural modifications, loss function variants, and model scaling to provide practical insights for weather-robust segmentation under limited data.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	1.0/10	2.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	1.0/10	2.0
model-based RL	2.0	0.0/10	0.0

评分理由: 该论文聚焦于计算机视觉中的语义分割任务，针对恶劣天气下的泛化差距问题，通过优化训练配方（如数据混合、采样策略）而非架构改进来提升性能。评分关键词涉及多模态大模型、世界模型及强化学习，与本文的监督学习图像分割主题高度不匹配。仅'MultiModal'因摘要提及'multi-source data mixing'有微弱关联，'Unify Models'因提出统一的训练配方视角有微弱关联，其余关键词相关性极低。

关键词

Semantic Segmentation, Adverse Weather, Generalization Gap, Training Recipe, Domain-Adaptive Fine-tuning, Multi-source Data Mixing, Synthetic Degradation Augmentation

165. Revisiting Change Detection Methods for their Application to Serac Fall Time-Lapse MonitoringFAIL

Score: 2.0 / 26.5

Authors: Arthur Dérédel, Carlos Crispim-Junior, Pierre Lemaire, Johan Berthet, Laure Tougne Rodet

Published: 2026-05-27

TL;DR: 本文提出了一种基于延时摄影摄像头的体积变化检测方法用于监测冰崖崩塌，发现密集特征匹配在数据稀缺情况下优于监督学习方法。

摘要翻译

在气候变化加剧环境不确定性的当下，识别与检测灾害前兆对于减轻灾难性自然灾害的影响日益关键。尽管干涉激光或地震仪等传统传感器可靠性高，但其广泛部署常受限于后勤与经济障碍，导致众多监测盲区。延时摄影相机（Time-lapse cameras）已能为此类传感器提供成本效益高、高分辨率的视觉上下文，成为一种颇具前景的替代方案。然而，自动处理其输出面临重大挑战，主要源于极端形状与光照变化带来的影响。克服这些问题对于将其大规模部署为监测工具至关重要。本文提出了一种新的变化检测（Change Detection）子任务，即体积变化检测（Volumetric Change Detection），应用于延时摄影相机与斜坡失稳场景。我们对当前最先进的变化检测方法及相关任务进行了全面综述，分析其核心组件，并评估其在此场景下的适用性。为此，我们引入了新数据集 SeracFallDet，该数据集包含冰塔崩落（Serac Fall）标注，并已进行全面标注以满足上述需求。通过泛化实验，我们发现密集与半密集特征匹配（Feature Matching）虽未针对此任务专门训练，却展现出稳健的性能。相比之下，监督方法（Supervised Approaches）则面临数据稀缺与标注不平衡的困境。这表明混合方法（Hybrid Methods）或许能通过结合两者的优势提供一条可行的前进路径。这些发现凸显了特征匹配技术的潜力，并指出需要进一步创新以克服环境监测中实际部署所面临的挑战。

Abstract

In an era where climate change aggravates environmental uncertainties, the identification and detection of event precursors are becoming crucial to mitigate the impacts of disastrous natural hazards. While classical sensors such as interferometric lasers or seismometers are reliable, their widespread deployment is often hindered by logistical and economic barriers, leaving numerous blind spots. Time-lapse cameras, which already provide cost-effective, high-resolution visual context to such sensors, present a promising alternative. However, processing their output automatically faces significant challenges, notably linked to extreme shape and lighting variations. Overcoming those issues is essential to deploy them at large-scale as a monitoring tool. This paper introduces a novel sub-task of change detection, namely volumetric change detection, applied to time-lapse cameras and slope instabilities. We conduct a comprehensive review of state-of-the-art change detection methods and related tasks, analyze their core components and assess their applicability to this context. To that end, we introduce the new dataset SeracFallDet, which contains serac fall annotations and has been thoroughly annotated to meet the latter demand. Through generalization experiments, we demonstrate that dense and semi-dense feature matching, although not trained specifically for this task, exhibit robust performance. Alternatively, supervised approaches struggle with data scarcity and annotation imbalance. This suggests that hybrid methods may offer a path forward by leveraging the strengths of both tasks. These findings highlight the potential of feature matching techniques and the need for further innovation to overcome the challenges of real-world deployment in environmental monitoring.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	1.0/10	2.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	0.0/10	0.0

评分理由: 该论文属于计算机视觉与环境科学交叉领域，侧重于变化检测算法的评估与新数据集构建，与背景中提到的统一模型、世界模型、多模态大模型及强化学习等前沿 AI 研究方向无直接关联。文中虽提及混合方法，但指算法策略融合，非模型架构统一。

关键词

Change Detection, Time-Lapse Cameras, Serac Fall, Volumetric Change Detection, Feature Matching, Environmental Monitoring, Dense Feature Matching, SeracFallDet Dataset

166. Extracting Small Translation Specialists from LLMs by Aggressively Pruning ExpertsFAIL

Score: 2.0 / 26.5

Authors: Liu O. Martin, Lucas Bandarkar, Nanyun Peng

Published: 2026-05-27

TL;DR: 本文提出通过激进剪枝混合专家模型中的专家来提取小型翻译专用模型，实现了大幅度的模型压缩且翻译质量损失可忽略。

摘要翻译

现代大语言模型（LLMs）在机器翻译任务上实现了最先进的性能，但它们作为通用模型，主要是在许多与翻译无关的任务和能力上进行训练的。因此，针对这一任务，它们存在严重的过参数化问题，导致内存和算力需求过高。本文提出了一种方法，能够从现代混合专家模型（MoE）大语言模型中激进地剪枝专家，同时使翻译质量的退化可忽略不计。我们的方法利用了专家专业化特性以及大语言模型中多语言能力的可分离性，来识别与翻译无关的专家。由于混合专家模型（MoE）具有模块化特性，这些专家可以轻松剪枝而无需任何训练。无需重新训练，我们即可剪枝掉一半的专家，且翻译质量退化可忽略不计；剪枝 70% 的专家时，损失也仅轻微。通过极短的监督微调（SFT），我们可剪枝 75% 的专家并恢复基线性能；在某些设置下，甚至可移除近 90% 的专家，同时保持合理的翻译质量。总体而言，我们的结果表明，翻译仅需大语言模型的一小部分即可实现，这使得包含超过 90% 参数的 MoE 块能够被大幅压缩。

Abstract

Modern large language models (LLMs) achieve state-of-the-art machine translation performance, but they do so as broad generalists largely trained for many tasks and capabilities unrelated to translation. Thus, they are heavily overparameterized for this task, resulting in excessive memory and compute requirements. In this paper, we present a method for aggressively pruning experts from modern mixture-of-experts LLMs while incurring negligible degradation in translation quality. Our approach exploits expert specialization and the separability of multilingual capabilities in LLMs to identify experts irrelevant to translation. And because of the modular nature of MoEs, these can be easily pruned without any training. Without retraining, we are able to prune half of all experts with negligible degradation and 70% with only minor losses. With a very short SFT, we prune 75% of experts while recovering baseline performance, and in some settings remove nearly 90% while maintaining reasonable translation quality. Overall, our results show that translation requires only a fraction of the LLM, enabling substantial compression of the MoE blocks that contain over 90% of parameters.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	1.0/10	2.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	0.0/10	0.0

评分理由: 论文核心在于大语言模型（LLM）的混合专家（MoE）剪枝与压缩，针对机器翻译任务。关键词 'World Models'、'MultiModal'、'MLLM'（多模态大模型）及 'model-based RL' 均涉及多模态、强化学习或环境建模，与本文纯文本任务及压缩方法无直接关联。'Unify Models' 涉及架构统一，本文虽修改模型但核心为剪枝，相关性极低。作者列表中未包含指定的 Yang Shi 等专家。加权总分约为 2.0，远低于及格线 26.5。

关键词

Mixture-of-Experts, Large Language Models, Model Pruning, Machine Translation, Expert Specialization, Model Compression, Multilingual Capabilities

167. Dimensionality Reduction for Robust Federated Learning: A Theoretical Analysis and Convergence GuaranteeFAIL

Score: 2.0 / 26.5

Authors: Shiyuan Zuo, Jiashuo Li, Rongfei Fan, Han Hu, Jie Xu

Published: 2026-05-27

TL;DR: This paper proposes a dimensionality reduction framework for robust federated learning that significantly accelerates gradient aggregation while maintaining convergence guarantees against Byzantine attacks.

摘要翻译

联邦学习（FL）允许多个客户端在不共享原始数据的情况下协同训练模型，但其极易受到拜占庭攻击的威胁。现有的鲁棒方法虽能消除此类威胁，但在高维梯度聚合过程中会产生显著的计算开销；该开销随模型规模扩展性较差，且随着现代模型日益增大，其在训练成本中的占比愈发突出。为了解决这一计算瓶颈，我们提出了一种投影降维（Projected Dimensionality Reduction, PDR）框架，该框架是针对基于向量距离的鲁棒聚合器的通用加速方案。它通过稀疏随机投影将梯度压缩至一个显著更小的子空间，从而高效计算可靠性权重，以实现鲁棒聚合。该方法将服务器的计算复杂度降低至最优的 $\mathcal{O}(Mp)$，其中 $M$ 为客户端数量，$p$ 为模型维度，这匹配了仅读取梯度所需的理论下界。我们在先前拜占庭鲁棒联邦学习分析所采用的标准联邦学习假设下，建立了收敛性保证。借助子空间嵌入定理（Subspace Embedding Theorem），我们证明 PDR 对于非凸函数能达到 $\mathcal{O}(1/\sqrt{T})$ 的最优收敛率，对于强凸函数能达到 $\mathcal{O}(1/T)$ 的最优收敛率，其中 $T$ 表示迭代次数。至关重要的是，我们数学上证明了这种巨大的加速几乎无需额外代价，仅会将固有的拜占庭误差下界膨胀一个有界且可调的因子 $\frac{1+\varepsilon}{1-\varepsilon}$。在基准数据集上的实验结果表明，将 PDR 与现有聚合器相结合，可在时间效率上获得数量级的加速，同时保持高度竞争力的收敛性能。

Abstract

Federated Learning (FL) enables multiple clients to collaboratively train models without sharing raw data, but it is highly vulnerable to Byzantine attacks. Existing robust approaches can neutralize these threats but incur substantial computational overhead during high-dimensional gradient aggregation, an overhead that scales poorly with model size and increasingly dominates the training cost as modern models grow larger. To address this computational bottleneck, we propose Projected Dimensionality Reduction (PDR), a universal acceleration framework for vector-level distance-based robust aggregators, which performs robust aggregation by compressing gradients into a drastically smaller subspace via sparse random projection to efficiently compute reliability weights. This approach reduces the server computational complexity to an optimal $ \mathcal{O}(Mp) $, where $ M $ is the number of clients and $ p $ is the model dimension, matching the theoretical lower bound required merely to read the gradients. We establish convergence guarantees under standard FL assumptions in prior Byzantine-robust FL analyses. By leveraging the Subspace Embedding Theorem, we show that PDR achieves optimal convergence rates of $ \mathcal{O}(1/\sqrt{T}) $ for non-convex functions and $ \mathcal{O}(1/T) $ for strongly convex functions, where $ T $ denotes the number of iterations. Crucially, we mathematically demonstrate that this massive acceleration comes almost for free, merely inflating the inherent Byzantine error floor by a bounded, tunable factor of $ \frac{1+ε}{1-ε} $. Experimental results on benchmark datasets confirm that integrating PDR with existing aggregators yields orders of magnitude speedups in time efficiency while maintaining highly competitive convergence performance.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	1.0/10	2.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	0.0/10	0.0

评分理由: The paper focuses on Federated Learning and Byzantine robustness via dimensionality reduction, which significantly diverges from keywords targeting Multimodal LLMs, World Models, and RL. 'Unify Models' scores slightly due to model aggregation in FL, but lacks the multimodal context implied by the background. No specified expert authors were found.

关键词

Federated Learning, Byzantine Attacks, Dimensionality Reduction, Gradient Aggregation, Convergence Guarantee, Sparse Random Projection, Computational Complexity, Robust Aggregation

168. Insurance Pricing Optimization via Off-Policy EvaluationFAIL

Score: 2.0 / 26.5

Authors: Sascha Günther, Dimitri Semenovich, Mario V. Wüthrich

Published: 2026-05-27

TL;DR: The paper formulates insurance pricing as a decision-making problem using off-policy evaluation and stochastic control, proposing a kernelized inverse propensity score estimator and neural network-based policy optimization to improve pricing rules.

摘要翻译

传统保险定价依赖于基于风险的原则，这些原则确保了精算公平性（actuarial fairness）和偿付能力（solvency），但未明确考虑保单持有人的价格敏感度。我们将保险定价建模为决策问题，并利用离策略评估（off-policy evaluation）和随机控制（stochastic control）的工具进行研究。我们提出了一种核化的逆倾向性评分估计器（kernelized inverse propensity score estimator），该估计器利用了动作空间（action space）中的局部结构，相较于经典逆倾向性评分估计器，实现了方差缩减（variance reduction）。基于这些价值估计，我们研究了策略优化（policy optimization），并提出了两种计算最优定价规则的实际方法：一种可解释的数据共享 Lasso 模型（data-shared Lasso formulation）和一种基于神经网络（neural networks）的灵活策略参数化。利用受控的合成旅行保险环境，我们经验性地证实了理论结果，并表明神经网络在策略优化方面优于现有方法。

Abstract

Traditional insurance pricing relies on risk-based principles that ensure actuarial fairness and solvency but do not explicitly account for policyholders' price sensitivity. We formulate insurance pricing as a decision-making problem and study it using tools from off-policy evaluation and stochastic control. We propose a kernelized inverse propensity score estimator that exploits local structure in the action space and yields variance reduction compared to the classical inverse propensity score estimator. Building on these value estimates, we investigate policy optimization and present two practical approaches for computing optimal pricing rules: an interpretable data-shared Lasso formulation and a flexible policy parameterization based on neural networks. Using a controlled synthetic travel insurance environment, we empirically confirm the theoretical results and show that neural networks outperform existing techniques for policy optimization.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	0.0/10	0.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	1.0/10	2.0

评分理由: 论文主要研究保险定价优化，采用离策略评估和随机控制方法，属于精算科学与因果推断领域。给定关键词主要聚焦于多模态大模型、世界模型及模型强化学习。'Unify Models'、'World Models'、'MLLM'、'MultiModal' 与本文内容（保险数值数据、定价策略）无直接关联，评分为 0。'model-based RL' 虽涉及强化学习概念（如策略优化），但本文侧重于因果推断与倾向得分估计，未构建环境动力学模型，相关性较低，评分为 1。加权总分 = 1.0 * 2.0 = 2.0，低于动态及格分 26.5。作者列表中未包含指定的 Yang Shi 等专家，不加分。

关键词

Insurance Pricing, Off-Policy Evaluation, Stochastic Control, Inverse Propensity Score, Policy Optimization, Neural Networks, Actuarial Science

169. Machine Learning methods for event classification and vertex reconstruction of the 12C + 12C reaction with the MATE-TPCFAIL

Score: 2.0 / 26.5

Authors: Minghui Zhang, Xiaobin Li, Jie Chen, Ningtao Zhang, Fenhua Lu, Junrui Ma, Jiazhen Yan, Wanqin Tu, Xiaodong Tang, Bingshui Gao, Chengui Lu, Zhichao Zhang, Jinlong Zhang, Weiping Liu

Published: 2026-05-27

TL;DR: This study applies convolutional neural networks to classify nuclear reaction events and reconstruct vertices in 12C+12C experiments with high accuracy, though it does not address multimodal or reinforcement learning frameworks.

摘要翻译

在现代核物理实验中，利用主动靶时间投影室（TPC）进行核反应研究时，识别感兴趣的事件颇具挑战性。本研究采用机器学习技术，分析来自名为 MATE（用于核实验的多用途主动靶时间投影室）的 TPC 的 12C + 12C 聚变反应的复杂数据。具体而言，我们成功应用了残差神经网络（ResNet-50、ResNet-34 和 ResNet-18）以及视觉几何组网络（VGG-19），对 12C + 12C 反应中的弹性散射和聚变反应事件进行分类。这四个模型的分类结果几乎一致，模拟数据的准确率约为 97%，实验数据的准确率约为 90%。此外，这些方法成功识别出了一些被传统方法误分类的事件。这些模型还被应用于对不同聚变反应通道的事件进行分类，在模拟数据上的分类准确率约为 95%。另外，我们还开发了一种卷积神经网络（CNN）模型用于重建反应顶点，为顶点重建提供了一种替代方案。这些结果表明，机器学习技术能够有效地区分不同通道的反应事件并重建反应顶点，从而为未来复杂核反应数据的分析奠定了基础。

Abstract

In modern nuclear physics experiments, identifying events of interest is challenging for nuclear reaction studies with the active target Time Projection Chamber (TPC). In this work, machine learning techniques are employed to analyze the complex data of the 12C + 12C fusion reaction from a TPC named MATE (multi-purpose active-target time projection chamber for nuclear experiments). Specifically, we successfully applied Residual Neural Network (ResNet-50, ResNet-34 and ResNet-18) and Visual Geometry Group (VGG-19) to classify elastic scattering and fusion reaction events from the 12C + 12C reaction. The classification results of the four models are nearly identical, with accuracies of approximately 97% for the simulated data and 90% for the experimental data. Moreover, these approaches successfully identify some events that are misclassified by traditional methods. These models are also applied to classify events from different fusion reaction channels, with classification accuracies of approximately 95% on simulated data. In addition, a Convolutional Neural Network (CNN) model is developed to reconstruct the reaction vertex, providing an alternative strategy for vertex reconstruction. These results indicate that machine learning techniques can effectively classify reaction events from different channels and reconstruct the reaction vertex, thereby paving the way for future analyses of complex nuclear reaction data.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	1.0/10	2.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	0.0/10	0.0

评分理由: The paper applies standard CNN architectures (ResNet, VGG) to nuclear physics data for classification and reconstruction. It does not propose unified model architectures (Unify Models), involve latent dynamics or planning (World Models), utilize large language models (MLLM), handle multiple data modalities like text/image (MultiModal), or employ reinforcement learning (model-based RL). Thus, relevance to the specific keywords is negligible except for general model usage.

关键词

Machine Learning, Event Classification, Vertex Reconstruction, 12C + 12C Reaction, ResNet, VGG, Convolutional Neural Network, TPC

170. Robust Contrastive Graph Clustering with Adaptive Local-Global IntegrationFAIL

Score: 2.0 / 26.5

Authors: Lei Zhang, Fubo Sun, Haipeng Yang, Zhong Guan, Likang Wu

Published: 2026-05-27

TL;DR: 本文提出了一种鲁棒的对比图聚类框架，通过注意力机制自适应融合多尺度局部与全局语义，在真实世界数据集上实现了优异的节点表示与聚类性能。

摘要翻译

图聚类（Graph Clustering）在图分析中至关重要，用于揭示结构模式和节点社区。尽管近期自监督对比学习（Self-supervised Contrastive Learning）通过结构与属性信号改进了聚类，但现有方法仍难以灵活捕捉高阶局部结构，且常忽略复杂图中的全局语义。这些局限性导致节点表示次优，尤其在具有碎片化结构和模糊簇边界的真实图中。为了解决这些局限性，本文提出了一种对比图聚类框架，通过注意力机制（Attention Mechanisms）联合整合多尺度局部结构与全局语义。在局部层面，从多个传播深度提取的基于图神经网络（GNN）的拓扑信号通过基于注意力的加权进行自适应融合，以捕捉多尺度邻域特征。在全局层面，从动态演化的簇中心导出的语义原型通过注意力自适应聚合，以引导节点表示并增强簇间可分性。该模型在双视图对比学习范式（Dual-view Contrastive Learning Paradigm）下进行训练，采用混合目标函数，结合实例级损失与结构感知损失，以提升表示的鲁棒性与判别性。在八个真实图数据集上的实验表明，该方法实现了具有竞争力的聚类性能。代码可在 https://github.com/vege12138/w2 获取。

Abstract

Graph clustering is essential in graph analysis for revealing structural patterns and node communities. Despite recent advances in self-supervised contrastive learning that have improved clustering via structural and attribute signals, existing methods still struggle to flexibly capture high-order local structures and often overlook global semantics in complex graphs. These limitations lead to suboptimal node representations, especially in real-world graphs with fragmented structures and ambiguous cluster boundaries. To address these limitations, a contrastive graph clustering framework is proposed to jointly integrate multi-scale local structures with global semantics via attention mechanisms. At the local level, GNN-based topological signals extracted from multiple propagation depths are adaptively fused through attention-based weighting to capture multi-scale neighborhood features. At the global level, semantic prototypes derived from dynamically evolving cluster centers are adaptively aggregated through attention to guide node representations and enhance inter-cluster separability. The model is trained under a dual-view contrastive learning paradigm with a hybrid objective that combines instance-level and structure-aware losses to improve representation robustness and discrimination. Experiments on eight real-world graph datasets demonstrate that our method achieves competitive clustering performance. Code is available at https://github.com/vege12138/w2.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	1.0/10	2.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	0.0/10	0.0

评分理由: 论文聚焦图聚类与对比学习，属于图数据挖掘领域。关键词涉及的世界模型、MLLM、多模态及强化学习均与本文主题（图结构数据分析）无直接关联。虽整合局部全局信息，但不符合'Unify Models'的大模型语境。作者列表无指定专家。

关键词

Graph Clustering, Contrastive Learning, Attention Mechanism, Local-Global Integration, GNN, Node Representation, Multi-scale Structures, Semantic Prototypes

171. Intra-YOLO: A Small Object Detection Model for Caries and Molar-Incisor Hypomineralization in Intraoral Photography Based on Transfer Learning with Reinforcement LearningFAIL

Score: 2.0 / 26.5

Authors: Po-Lun Chwang, Po-Yu Chang, Wen-Liang Lin, Tung-Sheng Wu, Min-Ching Wang, Yun-Chien Cheng

Published: 2026-05-27

TL;DR: This paper develops a YOLO-based computer-aided diagnosis system using transfer learning and reinforcement learning to detect small dental lesions in intraoral photographs.

摘要翻译

本研究开发了一种用于在口内照片中检测龋齿和磨牙 - 切牙低矿化症 (MIH) 的计算机辅助诊断 (CAD) 系统。这些病变外观相似，导致临床鉴别困难，尤其是考虑到其体积较小以及成像条件的变异性。

Abstract

This study developed a computer-aided diagnosis (CAD) system for detecting caries and molar-incisor hypomineralization (MIH) in intraoral photographs. These lesions share similar appearances, making clinical differentiation challenging, especially given their small size and variability in imaging conditions.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	0.0/10	0.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	1.0/10	2.0

评分理由: 该论文专注于医学图像中的小目标检测（龋齿和 MIH），使用 YOLO 架构结合迁移学习和强化学习。这与关键词簇（统一模型、世界模型、MLLM、多模态、基于模型的强化学习）所代表的通用大模型及控制理论研究方向高度不符。虽然标题提及强化学习，但此处可能用于架构搜索而非基于模型的强化学习范式，且无多模态或大语言模型内容，故相关性极低。

关键词

Intraoral Photography, Object Detection, Caries Detection, Molar-Incisor Hypomineralization, Transfer Learning, Reinforcement Learning, YOLO, Computer-Aided Diagnosis

172. Dual-branch Distilled Transformer for Efficient Asymmetric UAV TrackingFAIL

Score: 2.0 / 26.5

Authors: Hongtao Yang, Bineng Zhong, Qihua Liang, Yaozong Zheng, Xiantao Hu, Yuanliang Xue, Shuxiang Song

Published: 2026-05-27

TL;DR: This paper proposes EATrack, an efficient asymmetric UAV tracking framework using teacher-guided dual-branch distillation that achieves a favorable balance between accuracy and speed.

摘要翻译

鉴于 UAV (无人机) 跟踪的实时性需求，许多方法通过简化骨干网络来减少计算量，但这往往削弱了特征表示，并在复杂场景中导致性能下降。为了解决这一问题，我们提出了 EATrack，这是一种高效且不对称的 UAV 跟踪框架，其核心是教师引导的双分支蒸馏策略，旨在增强轻量级学生模型的特征表达能力。具体而言，EATrack 从两个互补的知识转移视角进行了探究：空间聚焦的特征级蒸馏通过引导学生学习强目标表示来补偿削弱的特征表示，而预测级蒸馏则通过学习教师准确目标定位的能力来增强空间定位。此外，为了增强对外观变化的鲁棒性，我们引入了一种细粒度目标感知蒸馏策略，选择性地把教师的目标建模能力转移给学生。在推理阶段引入时间适应模块，以增强模型在时间维度上的鲁棒性。在五个 UAV 基准数据集上的实验表明，EATrack 在精度与速度之间取得了良好的平衡。代码：https://github.com/GXNU-ZhongLab/EATrack

Abstract

Given the real-time demands of UAV tracking, many methods simplify the backbone to reduce computation, but this often weakens feature representation and degrades performance in complex scenarios. To alleviate this issue, we propose EATrack, an efficient and asymmetric UAV tracking framework centered around a teacher-guided dual-branch distillation strategy that enhances the feature expressiveness of the lightweight student model. Specifically, EATrack investigates two complementary perspectives of knowledge transfer: spatially focused feature-level distillation that compensates for weakened representations by guiding the student to learn strong target representations, and prediction-level distillation that enhances spatial localization by learning the teacher's capability for accurate target localization. Furthermore, to enhance robustness against appearance variations, we introduce a fine-grained target-aware distillation strategy that selectively transfers the teacher's target modeling capacity to the student. A temporal adaptation module is incorporated at inference to enhance robustness over time. Experiments on five UAV benchmarks demonstrate that EATrack achieves a favorable balance between accuracy and speed. Code: https://github.com/GXNU-ZhongLab/EATrack

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	1.0/10	2.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	0.0/10	0.0

评分理由: 论文聚焦于无人机视觉跟踪中的蒸馏技术，与 World Models、MLLM、MultiModal、model-based RL 领域完全无关。仅因双分支结构涉及教师与学生知识的统一，与 Unify Models 有微弱关联。作者列表中未包含指定的 Yang Shi 等专家。

关键词

UAV Tracking, Dual-branch Distilled Transformer, Teacher-guided Distillation, Efficient Tracking, Feature Representation, Spatial Localization, Temporal Adaptation

173. Enhancing Ultra-low-field MRI with Segmentation-guided Adversarial LearningFAIL

Score: 2.0 / 26.5

Authors: James Grover, Andrew Phair, Michael Ferraro, David E. J. Waddington

Published: 2026-05-27

TL;DR: The paper enhances ultra-low-field MRI images using segmentation-guided adversarial learning and model ensembling, but it does not align with MLLM, World Models, or Model-Based RL research directions.

摘要翻译

超低场 (ULF) 磁共振成像 (MRI) 提供便携且低成本的成像方案，但图像质量较差。为解决这一问题，我们提交了 2025 年超低场增强挑战赛 (ULF-EnC) 的方案，其目标是从 64 mT 扫描中合成高场类似磁共振图像。我们的流程通过结合解剖学引导与模型集成来增强 ULF MRI。我们首先利用仅在挑战赛提供的数据上训练的 Swin UNETR 生成组织分割先验。这些先验作为条件输入两个独立的增强网络：一个是 CycleGAN，另一个是基于 Transformer 的残差增强模型 (T-REX)，两者均被训练用于合成 3 T 类似磁共振图像。采用加权平均法结合两个模型的输出。我们的方法生成的增强磁共振图像在定性和定量方面均与高场扫描相当。

Abstract

Ultra-low-field (ULF) MRI offers portable and low-cost imaging but suffers from poor image quality. To address this, we present our submission to the 2025 ULF Enhancement Challenge (ULF-EnC), where the goal is to synthesise high-field-like MRIs from 64 mT scans. Our pipeline enhances ULF MRI through a combination of anatomical conditioning and model ensembling. We first generate tissue segmentation priors using a Swin UNETR trained solely on challenge-provided data. These priors condition two independent enhancement networks - a CycleGAN and a transformer-based residual enhancement model (T-REX) - each trained to synthesise 3 T-like MRIs. Outputs from both models are combined using a weighted average. Our approach produces enhanced MRIs that were comparable to high-field scans both quantitatively and qualitatively.

评分详情

关键词	权重	相关度	得分
Unify Models	2.0	1.0/10	2.0
World Models	2.0	0.0/10	0.0
MLLM	2.0	0.0/10	0.0
MultiModal	2.0	0.0/10	0.0
model-based RL	2.0	0.0/10	0.0

评分理由: This paper addresses ultra-low-field MRI enhancement using segmentation priors and model ensembling (CycleGAN and T-REX). It is unrelated to the provided keywords concerning MLLM, World Models, or Model-Based RL, as it focuses on medical image processing rather than unified multimodal architectures or reinforcement learning. 'Unify Models' receives a minimal score due to output ensembling, but it does not match the architectural unification context. No expert authors from the specified list were found. The weighted total score is 2.0, below the 26.5 pass threshold.

关键词

Ultra-low-field MRI, Image Enhancement, Segmentation-guided, Adversarial Learning, Model Ensembling, CycleGAN, Swin UNETR, High-field synthesis

174. Picid: A Modular Evaluation Infrastructure for Reproducible PHM Across Tasks and DomainsFAIL

Score: 0.0 / 26.5

Authors: Lev Telyatnikov, Raffael Theiler, Leandro Von Krannichfeldt, Olga Fink

Published: 2026-05-27

TL;DR: This paper proposes Picid, a modular evaluation infrastructure designed to standardize and ensure reproducibility in Prognostics and Health Management (PHM) tasks across diverse domains and model families.

摘要翻译

预测与健康管理（PHM）的进展因缺乏跨任务、数据集及应用领域的标准化和可复用评估实践而受阻。报告的结果往往难以复现和比较，因为关键协议选择（如数据划分、预处理、标签对齐、时序窗口化及指标）通常是隐式的或临时实施的。我们引入 Picid，这是一个模块化评估基础设施，它将 PHM 评估流程形式化为一个显式、可执行且可复现的协议。通过明确定义的抽象，Picid 强制执行确定性且防泄露的数据集构建，同时在多样化的 PHM 场景中保持灵活性。该框架通过统一接口支持故障检测、诊断与预测，并可扩展至新数据集和模型类别，而不违反协议不变性。通过标准化数据契约和评估边界，Picid 还支持在诊断（分类）与预测（回归）之间进行公平的跨任务比较，使得相同的模型族能够在异构设置中得到一致评估。我们通过经验评估展示了 Picid，该评估在涵盖电池、轴承、涡轮风扇发动机、液压系统、过滤系统及建筑物的十二个数据集上对十三个模型进行了评估。本研究为 PHM 领域中标准化、公平且可复现的评估建立了可复用的基础。

Abstract

Progress in Prognostics and Health Management (PHM) is hindered by the lack of standardized and reusable evaluation practices across tasks, datasets, and application domains. Reported results are often difficult to reproduce and compare, as key protocol choices, such as data splits, preprocessing, label alignment, temporal windowing, and metrics, are often implicit or implemented ad hoc. We introduce \picid, a modular evaluation infrastructure that formalizes the PHM evaluation pipeline as an explicit, executable, and reproducible protocol. Through well-defined abstractions, \picid enforces deterministic, leakage-safe dataset construction while remaining flexible across diverse PHM settings. The framework supports fault detection, diagnostics, and prognostics through a unified interface and can be extended to new datasets and model classes without violating protocol invariants. By standardizing data contracts and evaluation boundaries, \picid also enables fair cross-task comparisons across diagnostics (classification) and prognostics (regression), allowing identical model families to be evaluated consistently across heterogeneous settings. We demonstrate \picid through an empirical evaluation of thirteen models on twelve datasets spanning batteries, bearings, turbofan engines, hydraulics, filtration systems, and buildings. This work establishes a reusable foundation for standardized, fair and reproducible evaluation in PHM.

评分详情

关键词	权重	相关度
Unify Models	2.0	0.0/10
World Models	2.0	0.0/10
MLLM	2.0	0.0/10
MultiModal	2.0	0.0/10
model-based RL	2.0	0.0/10

评分理由: The paper introduces Picid, a modular evaluation infrastructure for Prognostics and Health Management (PHM), focusing on reproducibility and standardized protocols for engineering tasks (e.g., batteries, engines). It does not address Unify Models (AI architecture unification), World Models, MLLM, MultiModal learning, or Model-Based Reinforcement Learning. The research domain (Engineering/Software) is distinct from the provided AI/ML keywords, resulting in zero relevance for all specified terms.

关键词

Picid, Modular Evaluation Infrastructure, Reproducible PHM, Fault Detection, Diagnostics, Prognostics, Standardized Evaluation

175. An Enhanced Large Neighborhood Search Approach for the Capacitated Facility Location Problem with Incompatible CustomersFAIL

Score: 0.0 / 26.5

Authors: Ida Gjergji, Lucas Kletzander, Nysret Musliu, Andrea Schaerf

Published: 2026-05-27

TL;DR: 本文提出了一种增强的大邻域搜索方法来解决具有不相容客户的容量限制设施选址问题，并在基准实例上取得了优于现有元启发式算法的结果。

摘要翻译

最近，文献中引入了一种经典带容量限制设施选址问题的新变体，该变体考虑了客户之间的不相容性。该问题刻画了这样一种情形：给定的客户对无法由同一设施服务。此类特征对于许多实际选址问题案例至关重要，例如危险或污染材料的存在以及竞争客户之间的冲突。本文提出了一种大邻域搜索（LNS）方法来解决这一问题。在 LNS 框架下，我们引入了三种不同的破坏算子，采用混合方式进行组合，并在修复阶段使用精确求解器。针对 LNS 的设计，我们研究了不同的算法组件。实验分析表明，我们的新方法优于现有的最先进的元启发式算法，并为所有可用基准实例提供了新的最优解。

Abstract

A new variant of the classic capacitated facility location problem, which considers incompatibilities between customers, has recently been introduced in the literature. This problem captures the situation where given pairs of customers cannot be served by the same facility. Such a feature is crucial for many practical cases of location problems, such as the presence of hazardous or polluting materials and contention between competing costumers. In this paper, we propose a Large Neighborhood Search (LNS) method to solve this problem. Within the framework of LNS, we introduce three different destroy operators, which are combined in a hybrid manner, and we use an exact solver in the repair phase. Different algorithmic components are investigated for the design of LNS. The experimental analysis shows that our new method outperforms existing state-of-the-art metaheuristics, providing new best solutions for all available benchmark instances.

评分详情

关键词	权重	相关度
Unify Models	2.0	0.0/10
World Models	2.0	0.0/10
MLLM	2.0	0.0/10
MultiModal	2.0	0.0/10
model-based RL	2.0	0.0/10

评分理由: 论文主题属于运筹学组合优化领域（设施选址问题、大邻域搜索算法），而关键词涉及人工智能、多模态大模型及强化学习领域。两者研究范畴完全不同，无任何技术或概念上的交集，因此所有关键词相关度均为 0。作者列表中也不包含指定的专家。

关键词

Capacitated Facility Location Problem, Incompatible Customers, Large Neighborhood Search, Destroy Operators, Exact Solver, Metaheuristics, Benchmark Instances

176. Better Accuracies, Worse Reasoning: A Step-Level Audit of Medical Chain-of-Thought DistillationFAIL

Score: 0.0 / 26.5

Authors: Zhaoyang Jiang, Xuanqi Peng, Fei Teng, Zhizhong Fu, Yunsoo Kim, Jiacong Mi, Zicheng Li, Honghan Wu

Published: 2026-05-27

TL;DR: This paper investigates whether Chain-of-Thought distillation improves reasoning traces in medical QA, finding that while final answer accuracy increases, the factuality of intermediate reasoning steps significantly deteriorates.

摘要翻译

思维链（CoT）蒸馏旨在训练一个较小的模型以模仿教师的推理轨迹，但其评估通常依赖于包括准确率在内的最终答案指标。我们探究答案质量的提升是否伴随着推理轨迹质量的改善。在医疗问答领域，由于简短的答案选项可能导致更丰富的临床理由未被充分指定，从 DeepSeek-V3 系列教师模型蒸馏出的 Qwen3-8B 学生模型在 MedQA-USMLE 答案指标上有所提升（SC@64 从 74.7% 提升至 84.4%；预期校准误差 (ECE) 从 0.096 降至 0.034）。然而，在 Kimi-K2.6 风格盲测 LLM 评判器的审计下，其在未弃权步骤上的错误率却从 30.6% 上升至 50.3%。在此主要医疗场景中，答案质量与轨迹事实性呈现出相反的变化趋势。这种前后模式在评估者、教师模型强度、学生模型规模与系列、医疗基准，以及风格、分段和答案正确性控制等不同条件下均保持一致。由临床专家进行的 150 步盲审也复现了相同的排序结果。边界检查限定了该主张的范围：风险出现在紧凑的答案对推理依据约束不足时，此时有能力的学生模型可以模仿专家形式，却无法可靠地支撑每一个局部主张。标准答案指标及总体规避率均无法揭示这一转变。当此类推理轨迹被发布或重用时，仅依靠答案级指标是不足的。

Abstract

Chain-of-thought (CoT) distillation trains a smaller model to imitate a teacher's reasoning trace, but it is typically evaluated by final-answer metrics including accuracy. We ask whether gains in answer quality are accompanied by improvements in the trace. In medical QA, where short answer options can leave a richer clinical justification under-specified, a Qwen3-8B student distilled from a DeepSeek-V3-family teacher improves on MedQA-USMLE answer metrics (SC@64 74.7% to 84.4%; expected calibration error (ECE) 0.096 to 0.034). Yet under a Kimi-K2.6 style-blind LLM-judge audit, its error rate over non-abstained steps rises from 30.6% to 50.3%. In this primary medical setting, answer quality and trace factuality move in opposite directions. This before--after pattern persists across evaluators, teacher strengths, student scales and families, medical benchmarks, and style, segmentation, and answer-correctness controls. A 150-step blinded audit by a clinical expert reproduces the same ordering. Boundary checks narrow the scope of the claim: the risk appears when a compact answer under-constrains the rationale and a capable student can imitate expert-like form without reliably grounding each local claim. Standard answer metrics and aggregate hedging rates do not reveal the shift. When such traces are released or reused, answer-level metrics alone are insufficient.

评分详情

关键词	权重	相关度
Unify Models	2.0	0.0/10
World Models	2.0	0.0/10
MLLM	2.0	0.0/10
MultiModal	2.0	0.0/10
model-based RL	2.0	0.0/10

评分理由: The paper focuses on Chain-of-Thought distillation and reasoning audit in medical QA, specifically analyzing the discrepancy between final answer accuracy and intermediate reasoning trace factuality. It does not address World Models, Model-Based Reinforcement Learning, Multimodal representation learning, or model unification architectures, resulting in negligible relevance to all provided keywords. Additionally, none of the specified expert authors (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) appear in the author list.

关键词

Chain-of-Thought, Knowledge Distillation, Medical QA, Reasoning Trace, Step-Level Audit, Answer Accuracy, LLM-judge, Factuality

177. ResearchLoop: An Evidence-Gated Control Plane for AI-Assisted ResearchFAIL

Score: 0.0 / 26.5

Authors: Yihan Xia, Taotao Wang

Published: 2026-05-27

TL;DR: 为了解决 AI 辅助研究中因流程压缩导致的审计风险，ResearchLoop 提出了一种基于证据门控的控制平面，用于管理项目状态并确保声明的可审计性。

摘要翻译

AI 辅助研究将构思、实现、评估和手稿撰写压缩成一个单一的交互循环。这种压缩虽然有用，但也带来了发表风险：论文主张（paper claims）可能比核查（audit）更容易陈述。我们提出 ResearchLoop，一种用于 AI 辅助计算研究的证据门控 control plane（控制平面）。ResearchLoop 将研究问题、任务契约、证据对象、claim ledgers（主张账本）、closeouts（收尾）和 paper bindings（论文绑定）视为持久的项目状态，此处实现为 repository-backed runtime（基于仓库的运行时）。本技术报告提供了完整的 protocol specification（协议规范）、state model（状态模型）、transition rules（转换规则）、claim-admission algorithm（主张准入算法）和 insight-compounding mechanism（见解复合机制）。此外，它还报告了涵盖九个版本（V0--V9）的完整实验记录，包括一个 self-hosting case study（自托管案例研究）、一个带有 component ablations（组件消融）的 controlled task-suite study（受控任务套件研究）、一个 mathematical olympiad evaluation（数学奥林匹克评估），以及一个使用 official generated-code harness（官方生成代码工具包）评估的补充 SciCode boundary experiment（SciCode 边界实验）。所有 artifacts（工件）、manifests（清单）和 verification reports（验证报告）均保存在 project repository（项目仓库）中。

Abstract

AI-assisted research compresses ideation, implementation, evaluation, and manuscript writing into a single interactive loop. This compression is useful, but it also creates a publication risk: paper claims can become easier to state than to audit. We present ResearchLoop, an evidence-gated control plane for AI-assisted computational research. ResearchLoop treats research questions, task contracts, evidence objects, claim ledgers, closeouts, and paper bindings as durable project state, realized here as a repository-backed runtime. This technical report provides the complete protocol specification, state model, transition rules, claim-admission algorithm, and insight-compounding mechanism. It also reports the full experimental record spanning nine versions (V0--V9), including a self-hosting case study, a controlled task-suite study with component ablations, a mathematical olympiad evaluation, and a supplementary SciCode boundary experiment evaluated with the official generated-code harness. All artifacts, manifests, and verification reports are preserved in the project repository.

评分详情

关键词	权重	相关度
Unify Models	2.0	0.0/10
World Models	2.0	0.0/10
MLLM	2.0	0.0/10
MultiModal	2.0	0.0/10
model-based RL	2.0	0.0/10

评分理由: 论文核心贡献在于 AI 辅助研究的工作流管理（Evidence-Gated Control Plane），与关键词涉及的模型架构统一、世界模型、多模态大模型及强化学习方法无直接技术关联，故相关性评分均为 0；作者列表中未包含指定专家，无额外加分。

关键词

ResearchLoop, AI-assisted research, Evidence-gated control plane, Project state, Claim ledger, Verification, Manuscript writing

178. AI, Take the Wheel: What Drives Delegation and Trust in Human-Computer Cooperative Question Answering?FAIL

Score: 0.0 / 26.5

Authors: Maharshi Gor, Yoo Yeon Sung, Yu Hou, Eve Fleisig, Irene Ying, Tianyi Zhou, Jordan Boyd-Graber

Published: 2026-05-27

TL;DR: This study investigates human-AI collaboration in question answering, finding that humans often make suboptimal trust decisions due to bias, suggesting a need for calibrated confidence and evidence-grounded explanations.

摘要翻译

AI 系统（AI Systems）存在缺陷，人类在决定是信赖 AI 还是信赖自身判断时也可能犯错。因此，提升人机协作（Human-AI Collaboration）的效果需要理解人类在何时、为何以及如何决定依赖 AI。我们研究了两种不同的依赖决策：委托选择（Delegation Choice），即决定何时让 AI 自主行动而不知晓其输出；以及采纳选择（Adoption Choice），即评估 AI 建议并决定如何使用它们。这两种解耦的依赖模式都会影响协作，但以往研究很少在真实场景中结合同一用户同时考察这两者。为填补这一空白，我们研究了协作的人 -AI 团队（Human-AI Teams）在问答游戏（Question-Answering Game）中竞争的场景，其中人类可以选择何时以及如何与 AI 智能体（AI Agents）合作以获胜。我们的 24 场比赛将 23 位人类专家与 16 个 AI 智能体配对，共记录了 387 次委托决策和 1440 次采纳决策。尽管人机协作的表现优于单独使用 AI 或单独使用人类，但人类做出的协作决策次优：既存在对正确 AI 建议依赖不足的情况（错过了 3.9% 的机会），也存在当 AI 误导人类时过度依赖的情况（1.7%）。双方均提供了错误答案：当人类与 AI 意见不一致时，报告的模型置信度（Model Confidence）接近随机水平；而当 AI 建议与人类的初始错误答案一致时，确认偏误（Confirmation Bias）导致了更高的依赖不足（64.5%）。为弥补这一差距，我们建议采用校准置信度（Calibrated Confidence）、基于证据的解释（Evidence-Grounded Explanations）以及帮助用户优化信任的机制。

Abstract

AI systems are fallible, and humans can make mistakes in deciding whether to trust AI over their own judgment. Thus, improving human-AI collaboration requires understanding when, why, and how humans decide to rely on AI. We study two distinct reliance decisions: the delegation choice -- deciding when to let AI act autonomously without knowing its output, and the adoption choice -- evaluating AI suggestions and deciding how to use them. Both of these decoupled reliance patterns shape collaboration, but prior work rarely studies them together in realistic settings with the same users. We address this gap by studying collaborative human--AI teams competing in a question-answering game in which humans can choose when and how to work with AI agents to win. Our 24 matches pair 23 expert humans with 16 AI agents, capturing 387 delegation and 1440 adoption decisions. While human--AI collaboration performs better than either AI or humans alone, humans make suboptimal collaboration decisions, both under-relying on correct AI suggestions (3.9% of opportunities missed) and over-relying when AI misleads them (1.7%). Both parties contribute wrong answers: reported model confidence is near chance when humans and AI disagree, while confirmation bias drives higher under-reliance (64.5%) when an AI suggestion agrees with humans' initial incorrect answer. To close this gap, we recommend calibrated confidence, evidence-grounded explanations, and mechanisms that help users refine trust.

评分详情

关键词	权重	相关度
Unify Models	2.0	0.0/10
World Models	2.0	0.0/10
MLLM	2.0	0.0/10
MultiModal	2.0	0.0/10
model-based RL	2.0	0.0/10

评分理由: The paper focuses on human-AI interaction, trust, and delegation in question answering tasks, which falls under human-computer interaction and behavioral science. It does not discuss model unification (Unify Models), world models (World Models), multimodal large language model architectures (MLLM, MultiModal), or model-based reinforcement learning algorithms (model-based RL), making all technical keywords irrelevant to the core content.

关键词

Human-AI collaboration, Delegation choice, Adoption choice, Trust, Question-answering game, Calibrated confidence, Evidence-grounded explanations

179. PIRS: Physics-Informed Reward Shaping for SAC-Based Building Energy ManagementFAIL

Score: 0.0 / 26.5

Authors: Shadmehr Zaregarizi, Khashayar Yavari

Published: 2026-05-27

TL;DR: The paper proposes PIRS, a physics-informed reward shaping method for SAC-based building energy management that utilizes ISO 7730 PMV for comfort, enhancing reward interpretability and performance without modifying the model-free learning pipeline.

摘要翻译

居住者舒适度与电网感知能效是相互竞争的目标，它们的联合优化高度依赖于建筑物深度强化学习（DRL）控制器中奖励函数的设定方式。然而，奖励设计在很大程度上仍是临时性的：舒适度项要么是手工调优的启发式规则，要么是简单的温度偏差代理指标，缺乏热舒适物理学的明确基础。我们提出了 PIRS（物理信息奖励塑形），它在软演员 - 评论家（SAC）的加权多目标奖励中，用 ISO 7730 预测平均投票（PMV）公式替代了这些临时性的舒适度代理。通过将舒适度信号锚定在 ISO 7730 PMV 公式中，PIRS 提高了奖励的可解释性，并提供了基于标准的舒适度代理，而无需更改学习流程中的任何其他组件。我们在 CityLearn v2.1.2（2022 挑战赛第一阶段）中评估了 PIRS，使用一个中央 SAC 代理在五个随机种子下训练 5 万步，并与基于规则的控制器（RBC）、手工设计的奖励（E2）、仅能量奖励（E3）以及简单的温度偏差舒适度奖励（E4）进行比较。以相对于 RBC 的比率报告的街区级关键绩效指标（KPIs）显示，PIRS 在成本、碳和电力指标上达到了与手工基线相当的水平，同时显著优于非物理基础的设计——特别是在负荷爬坡（1.78 倍 vs. ~2.4 倍 RBC）和每日峰值需求方面。在此训练预算下，所有 DRL 策略均高于 RBC；我们诚实地解释这一差距，并将 PIRS 定位为奖励设计的可解释、符合标准的基础，而非声称在有限算力下对经典控制具有主导地位。

Abstract

Occupant comfort and grid-aware energy efficiency are competing objectives whose joint optimization depends critically on how reward functions are specified in deep reinforcement learning (DRL) controllers for buildings. Yet reward design remains largely ad hoc: comfort terms are either hand-tuned heuristics or simple temperature-deviation proxies without explicit grounding in thermal-comfort physics. We present PIRS (Physics-Informed Reward Shaping), which replaces these ad-hoc comfort proxies with the ISO 7730 Predicted Mean Vote (PMV) formulation inside a weighted multi-objective reward for Soft Actor-Critic (SAC). By anchoring the comfort signal in the ISO 7730 PMV formulation, PIRS improves reward interpretability and provides a standards-grounded comfort proxy without changing any other component of the learning pipeline. We evaluate PIRS in CityLearn v2.1.2 (challenge 2022 phase 1) with a central SAC agent trained for 50k steps over five random seeds, and compare against a rule-based controller (RBC), a manually engineered reward (E2), an energy-only reward (E3), and a naive temperature-deviation comfort reward (E4). District-level key performance indicators (KPIs), reported as ratios versus RBC, show that PIRS attains cost, carbon, and electricity metrics on par with the manual baseline while substantially outperforming non-physics-grounded designs -- particularly on load ramping (1.78x vs. ~2.4x RBC) and daily peak demand. All DRL policies remain above RBC at this training budget; we interpret this gap honestly and position PIRS as an interpretable, standards-aligned foundation for reward design rather than a claim of dominance over classical control at limited compute.

评分详情

关键词	权重	相关度
Unify Models	2.0	0.0/10
World Models	2.0	0.0/10
MLLM	2.0	0.0/10
MultiModal	2.0	0.0/10
model-based RL	2.0	0.0/10

评分理由: The paper focuses on physics-informed reward shaping for Soft Actor-Critic (SAC), which is a model-free reinforcement learning algorithm applied to building energy management. It does not involve MLLMs, multimodal data processing, world models, unifying multiple models, or model-based RL (which requires learning/planning with a dynamics model). Therefore, there is no relevance to the specified keywords.

关键词

Physics-Informed Reward Shaping, Soft Actor-Critic, Building Energy Management, Predicted Mean Vote, Deep Reinforcement Learning, Reward Function Design, CityLearn, Comfort Optimization

180. SmartIterator: Visual Analytics Workflows for Supervising Unsupervised Data GroupingFAIL

Score: 0.0 / 26.5

Authors: Gennady Andrienko, Natalia Andrienko

Published: 2026-05-27

TL;DR: SmartIterator provides visual analytics workflows for supervising unsupervised data grouping, distinct from research on large multimodal models or reinforcement learning.

摘要翻译

无监督学习方法（包括主题建模、基于划分的聚类和基于密度的聚类）可在无需人工指导的情况下生成数据分组，然而对这些分组的选取与评估本身不应是无监督的。本文提出 SmartIterator（SI），这是一种视觉分析方法，它将参数扫描过程中产生的完整分组结果序列视为一等分析对象。针对每种方法家族，SI 提供了一个结构化的六阶段工作流程，引导分析师系统性地探索分组结果——从质量指标概览、转换稳定性评估、成员资格置信度评估、内容与上下文检查，到再出现原型验证，最终形成明智的决策——在此过程中逐步累积对数据结构的理解。这些工作流程通过 IteraScope（IS）得以实现，这是一种协调的视觉显示系统，它结合了带有语义编码的质量指标图表、带有 Sankey 风格转换流和成员资格置信度小提琴图的一维组嵌入、带有 HDBSCAN 检测到的再出现原型的二维组嵌入（突出显示捕捉所有持久模式的迭代），以及用于上下文解释的领域特定链接视图。我们在以下三个数据集上演示了这三个工作流程：(1) 来自 VAST Challenge 2011 的模拟社交媒体消息（基于密度的聚类，并与真实标签进行了验证），(2) 覆盖约 1500 个 NUTS-3 区域的欧盟人口统计数据（基于划分的聚类），以及 (3) 30 年间的 IEEE VIS 论文（NMF 主题建模）。这些工作流程构成了本文的主要贡献：它们提供了可操作的、针对特定方法的指导，用于导航参数空间、研究数据结构如何随配置演变，并将分析理解扎根于领域上下文——从而获得任何单个“最佳”结果都无法提供的关于数据的知识。

Abstract

Unsupervised learning methods -- topic modeling, partition-based and density-based clustering -- produce data groupings without human guidance, yet choosing and evaluating those groupings should not itself be unsupervised. We present \emph{SmartIterator}~(SI), a visual analytics approach that treats the full sequence of grouping results across a parameter sweep as a first-class analytical object. For each method family, SI provides a structured six-phase workflow that guides the analyst through systematic exploration of grouping results -- from quality-metric overview through transition-stability assessment, membership-confidence evaluation, content and context inspection, and recurrent-archetype verification to an informed decision -- building cumulative understanding of data structure along the way. The workflows are operationalized through \emph{IteraScope}~(IS), a coordinated visual display combining quality-metric charts with semantic color encoding, a 1D group embedding with Sankey-style transition flows and violin plots of membership confidence, a 2D group embedding with HDBSCAN-detected recurrent archetypes that highlights iterations capturing all persistent patterns, and domain-specific linked views for contextualized interpretation. We demonstrate the three workflows on: (1)~simulated social-media messages from the VAST Challenge 2011 (density-based clustering, validated against ground truth), (2)~EU population statistics across ${\sim}1\,500$ NUTS-3 regions (partition-based clustering), and (3)~30 years of IEEE VIS papers (NMF topic modeling). The workflows constitute the main contribution: they provide actionable, method-specific guidance for navigating parameter spaces, studying how data structure evolves across configurations, and grounding analytical understanding in domain context -- yielding knowledge about the data that no single ``best'' result can provide.

评分详情

关键词	权重	相关度
Unify Models	2.0	0.0/10
World Models	2.0	0.0/10
MLLM	2.0	0.0/10
MultiModal	2.0	0.0/10
model-based RL	2.0	0.0/10

评分理由: The paper addresses visual analytics for unsupervised data grouping (clustering, topic modeling), while the keywords pertain to large-scale AI models (MLLM, MultiModal), World Models, and Reinforcement Learning. There is no substantive overlap in methodology or domain regarding model architectures or training paradigms. Weighted total score is 0.0, failing the dynamic passing threshold of 26.5. No expert authors from the specified list are present.

关键词

Visual Analytics, Unsupervised Learning, Data Grouping, Clustering, Topic Modeling, Parameter Sweep, Workflow, Quality Metrics

181. The Illusion of Opting in AI-Mediated Consequential DecisionsFAIL

Score: 0.0 / 26.5

Authors: Eugene Yu Ji

Published: 2026-05-27

TL;DR: 本文揭示了 AI 中介决策导致的‘选择幻觉’问题及其对人类 agency 的削弱，并提出了存在主义诚实、生态理性和反事实修复三项规范以保护元能力。

摘要翻译

基于 Ullmann-Margalit 的"opting"（变革性、不可逆且被封闭替代方案所笼罩）概念，我们指出当前 AI 系统提出了一个现有 AI 伦理尚未充分把握的深刻伦理问题：the illusion of opting（抉择的幻觉），在此情形下，个人和群体遭遇有意义后果选择的虚假表象，而成为真正具备选择能力所需的 agency（能动性）却被削弱了。针对那些主要将 AI 视为给定目的优化器的方法，我们认为应通过 AI 系统是否保护并培养对抗抉择幻觉的 meta-capacity（元能力）来评估 AI 系统：即社会与制度支撑的 agentive capacity（能动性能力），通过这种能力，means and ends（手段与目的）可以被形成、争论、修订和归属。这种重构对处境不利群体尤为紧迫，因为当 AI 中介的路径误导行为和行动时，他们最无力承担抉择幻觉带来的成本。我们提出三个针对 AI 中介后果性选择的 normative imperatives（规范性指令）：existential honesty（存在主义诚实），承认预测的局限性；ecological rationality（生态理性），将指导置于异质的生活生态之中；以及 counterfactual reparation（反事实修复），在 AI 中介决策路径失败时承认并修复被封闭的替代方案。

Abstract

Drawing on Ullmann-Margalit's concept of opting (transformative, irrevocable, and shadowed by foreclosed alternatives), we show that current AI systems raise a profound ethical problem that existing AI ethics has not fully captured: the illusion of opting, in which persons and groups encounter the deceptive appearance of meaningful consequential choice while the agency needed to become genuinely capable of choosing is weakened. Against approaches that treat AI primarily as an optimizer of already given ends, we argue that AI systems should be evaluated by whether they protect and cultivate meta-capacity against the illusion of opting: the socially and institutionally scaffolded agentive capacity through which means and ends can be formed, contested, revised, and owned. This reframing is especially urgent for disadvantaged populations, who are least able to absorb the costs of the illusion of opting when AI-mediated pathways misdirect behavior and action. We propose three normative imperatives for AI-mediated consequential decisions: existential honesty, which acknowledges the limits of prediction; ecological rationality, which situates guidance within heterogeneous lived ecologies; and counterfactual reparation, which acknowledges and repairs foreclosed alternatives when AI-mediated decision-making pathways fail.

评分详情

关键词	权重	相关度
Unify Models	2.0	0.0/10
World Models	2.0	0.0/10
MLLM	2.0	0.0/10
MultiModal	2.0	0.0/10
model-based RL	2.0	0.0/10

评分理由: 该论文属于人工智能伦理与哲学领域，主要探讨 AI 中介决策中的‘选择幻觉’（Illusion of Opting）及人类元能力（meta-capacity）保护，未涉及具体的机器学习模型架构、多模态技术或强化学习算法。因此，与提供的五个技术关键词（Unify Models, World Models, MLLM, MultiModal, model-based RL）完全无关，相关度均为 0。作者列表中未包含指定的专家。

关键词

Illusion of Opting, AI-Mediated Consequential Decisions, Meta-capacity, Existential Honesty, Ecological Rationality, Counterfactual Reparation, Agency

182. QuITE: Query-Based Irregular Time Series EmbeddingFAIL

Score: 0.0 / 26.5

Authors: JungHoon Lim

Published: 2026-05-27

摘要翻译

不规则多变量时间序列（IMTS）在实践中十分普遍，然而其不规则采样特性使得有效建模变得复杂。现有方法通常要么（i）设计专用架构，从而限制了经过验证的多变量时间序列（MTS）模型的复用；要么（ii）通过插值将 IMTS 映射到规则的时间网格上，这可能会因引入人工值而扭曲时间动态。为了解决这些局限性，我们提出了一种基于输入嵌入的新方法。我们发现，关键瓶颈并不在于骨干架构，而在于假设均匀采样的常规嵌入层。在本文中，我们引入了 QuITE（基于查询的不规则时间序列嵌入），这是一种简单却有效的、适用于 IMTS 的即插即用嵌入模块。QuITE 利用可学习的查询令牌，通过单个自注意力层聚合不规则观测，直接生成与骨干架构兼容的潜在表示，而无需生成人工值或修改架构。在真实世界基准上的广泛实验表明，QuITE 一致地提升了 MTS 模型的性能，在多样化的数据集和骨干架构上，预测任务的平均相对增益高达 54.7%，分类任务高达 15.8%。代码开源地址为：https://github.com/Meaningfull9502/QuITE。

Abstract

Irregular Multivariate Time Series (IMTS) are common in practice, yet their irregular sampling complicates effective modeling. Existing approaches typically either (i) design specialized architectures that limit the reuse of proven Multivariate Time Series (MTS) models, or (ii) map IMTS onto regular temporal grids through interpolation, which may distort temporal dynamics by introducing artificial values. To address these limitations, we propose a new input-embedding-based approach. We identify that the key bottleneck lies not in the backbone architecture, but in conventional embedding layers that assume uniform sampling. In this work, we introduce QuITE (Query-Based Irregular Time Series Embedding), a simple yet effective plug-and-play embedding module for IMTS. QuITE employs learnable query tokens to aggregate irregular observations through a single self-attention layer, directly producing backbone-compatible latent representations without artificial value generation or architectural modification. Extensive experiments on real-world benchmarks show that QuITE consistently improves MTS models, yielding average relative gains of up to $54.7\%$ in forecasting and $15.8\%$ in classification across diverse datasets and backbone architectures. Code is available at: https://github.com/Meaningfull9502/QuITE.

评分详情

关键词	权重	相关度
Unify Models	2.0	0.0/10
World Models	2.0	0.0/10
MLLM	2.0	0.0/10
MultiModal	2.0	0.0/10
model-based RL	2.0	0.0/10

评分理由: 评分失败: Expecting ',' delimiter: line 10 column 138 (char 312)

183. Performance and Explainability Requirements of Evolutionary Algorithms in Real-World Physics-Informed OptimizationFAIL

Score: 0.0 / 26.5

Authors: Helena Stegherr, Michael Heider, Nils Meyer, Tobias Thummerer, Thomas Wendler, Pierre Aublin, Ennio Idrobo-Àvila, Lars Mikelsons, Sebastian Zaunseder, Jörg Hähner

Published: 2026-05-27

TL;DR: 本文研究了进化算法在现实世界物理信息优化中的性能与可解释性需求，指出了算法应用与从业者期望之间的差距。

摘要翻译

进化计算 (Evolutionary Computation) 提供了多种工具，用于解决复杂的现实世界优化问题。然而，研究往往聚焦于规模较小、经过简化的问题以及优化算法，这些方法有时在现实场景中未能达到预期。此外，在这些场景下，对所应用的算法及其提供的解决方案的信任往往至关重要，但这需要对搜索过程本身的理解。这导致进化计算在许多应用背景下往往未被从业者认真考虑，其中包括基于物理的建模 (physics-based modeling)。本文详细阐述了进化计算中可用于缓解这些问题的技术。首先，由领域专家介绍了五个现实世界的基于物理的优化问题。针对每个问题，本文阐述了进化算法在性能和可解释性方面的要求，旨在提升信任度和可用性。我们发现，所有领域专家都期望算法能快速收敛至优良解，并希望获得关于结果生成过程的一些解释，而其他要求则高度依赖于具体问题。最后，我们介绍了现有的方法，这些方法可用于改进进化算法的上述方面，但据我们所知，尚未在复杂的现实场景中应用。这表明了这两个领域之间存在差距，需要加以弥合，以充分发挥进化计算的潜力。

Abstract

Evolutionary computation offers a variety of tools to solve complex real-world optimization problems. However, research often focuses on smaller, simplified problems and optimization algorithms that sometimes miss expectations in real-world scenarios. Additionally, trust in the applied algorithm and the solutions it provides is often essential in such settings, but requires an understanding of the search process itself. This leads to evolutionary computation often not being seriously considered by practitioners in many application contexts, among them physics-based modeling. In this article, techniques from evolutionary computation are detailed that can alleviate these problems. First, five real-world physics-based optimization problems are introduced and described by domain experts. For each of these, the requirements for the evolutionary algorithm regarding performance and explainability to increase trust and usability are presented. We found that all domain experts expect fast convergence to a good solution and want some explanations for how the results were formed, while other requirements strongly depend on the respective problem. Finally, we present existing approaches that can be leveraged to improve those aspects of evolutionary algorithms but have to our knowledge never been employed in complex real-world scenarios. This implies a gap between both domains that needs to be closed to exploit the full potential of evolutionary computation.

评分详情

关键词	权重	相关度
Unify Models	2.0	0.0/10
World Models	2.0	0.0/10
MLLM	2.0	0.0/10
MultiModal	2.0	0.0/10
model-based RL	2.0	0.0/10

评分理由: 论文聚焦进化算法在物理信息优化中的性能与可解释性，属于传统工程优化领域；评分关键词涉及多模态大模型（MLLM, MultiModal）、模型统一（Unify Models）、世界模型（World Models）及强化学习（model-based RL）等 AI 前沿领域。两者方法论、应用场景及技术栈无交集，故相关性均为 0。作者列表中未包含指定专家（Yang Shi 等）。

关键词

Evolutionary Algorithms, Physics-Informed Optimization, Performance Requirements, Explainability, Real-World Scenarios, Trust, Optimization Problems

184. Adaptive Reservoir Computing for Multi-Scenario Chaotic System ForecastingFAIL

Score: 0.0 / 26.5

Authors: Shadmehr Zaregarizi, Khashayar Yavari

Published: 2026-05-27

摘要翻译

我们提出了一种用于 CTF-4-Science Lorenz 基准测试的自适应 reservoir computing (储层计算) 框架，该框架在跨越五种定性不同场景的十二个独立任务上评估机器学习模型：基线预测、噪声信号重构、噪声下的预测、少样本学习和参数泛化。我们并未采用统一的推理策略，而是根据每个评估场景的具体需求，定制回声状态网络 (ESNs) 的训练与预测流程。我们的主要贡献体现在四个方面：(1) 精确的 reservoir state synchronization，消除短时预测中的热身近似误差；(2) 基于直方图的候选选择，直接优化长时间遍历评估指标；(3) 多种子 reservoir search，适用于训练数据严重受限的少样本情形；(4) 顺序多序列训练，解决参数泛化任务中的状态分布不匹配问题。所提出的框架在公共基准测试排行榜上取得了 74.91 分，表明精心适配的 reservoir computing 是一种具有竞争力且计算高效的方法，能够应对多样化的混沌系统建模挑战。

Abstract

We present an adaptive reservoir computing framework for the CTF-4-Science Lorenz benchmark, which evaluates machine learning models across twelve distinct tasks spanning five qualitatively different scenarios: baseline forecasting, noisy signal reconstruction, forecasting under noise, few-shot learning, and parametric generalization. Rather than applying a uniform inference strategy, we tailor the training and prediction procedure of Echo State Networks (ESNs) to the specific demands of each evaluation scenario. Our key contributions are fourfold: (1) exact reservoir state synchronization that eliminates warmup approximation error in short-time prediction; (2) histogram-guided candidate selection that directly optimizes the long-time ergodic evaluation metric; (3) multi-seed reservoir search for few-shot regimes with severely limited training data; and (4) sequential multi-sequence training that resolves state-distribution mismatch in parametric generalization tasks. The proposed framework achieves a score of 74.91 on the public benchmark leaderboard, demonstrating that carefully adapted reservoir computing constitutes a competitive and computationally efficient approach for diverse chaotic system modeling challenges.

评分详情

关键词	权重	相关度
Unify Models	2.0	0.0/10
World Models	2.0	0.0/10
MLLM	2.0	0.0/10
MultiModal	2.0	0.0/10
model-based RL	2.0	0.0/10

评分理由: 评分失败: Expecting ',' delimiter: line 10 column 227 (char 401)

185. Gradient Step Plug-and-Play Model for Dental Cone-Beam CT ReconstructionFAIL

Score: 0.0 / 26.5

Authors: Idris Tatachak, Luis Kabongo, Nicolas Papadakis, Xavier Ripoche, Simon Rit

Published: 2026-05-27

TL;DR: 本文提出了一种基于模拟数据训练的梯度步骤即插即用模型，用于有效降低牙科锥束 CT 重建中的光子噪声。

摘要翻译

本工作的目标是降低光子噪声对牙科锥形束 CT 重建的影响。我们采用逆问题建模，并开发了一种数据驱动的先验。为此，我们模拟扇束采集数据并向投影数据中添加光子噪声。该先验是通过使用重建后的模拟采集数据训练一个梯度步（gradient-step）去噪器获得的。训练好的模型被集成到一个即插即用（plug-and-play）梯度步算法中，用于从模拟投影数据中重建图像。合成数据上的实验展示了训练模型的去噪能力，而真实图像上的定性评估则展示了算法的性能和泛化能力。

Abstract

The goal of this work is to reduce the effect of photon noise in dental cone-beam CT reconstruction. We consider an inverse problem formulation and develop a databased prior. To this end, we simulate fan-beam acquisitions and add photon noise to the projection data. The prior is obtained by training a gradient-step denoiser using reconstructed simulated acquisitions. The trained model is integrated into a plug-and-play gradient-step algorithm to reconstruct images from simulated projections. Experiments on synthetic data demonstrate the denoising capabilities of the trained model, while qualitative evaluations on real images showcase the algorithm's performance and generalization ability.

评分详情

关键词	权重	相关度
Unify Models	2.0	0.0/10
World Models	2.0	0.0/10
MLLM	2.0	0.0/10
MultiModal	2.0	0.0/10
model-based RL	2.0	0.0/10

评分理由: 该论文专注于医学影像重建（CT 去噪），采用 Plug-and-Play 算法和梯度步骤求解逆问题。提供的关键词（Unify Models, World Models, MLLM, MultiModal, model-based RL）均属于多模态大模型与强化学习领域，与本文的医学图像处理技术无直接语义重叠，故相关性评分均为 0。作者列表中未包含指定的专家。

关键词

Dental Cone-Beam CT, Image Reconstruction, Plug-and-Play Algorithm, Gradient Step Denoiser, Photon Noise Reduction, Inverse Problem, Simulated Acquisitions

186. BuddyBench: A Privacy-Constrained Multi-Task Benchmark for Pediatric Social-Communication PersonalizationFAIL

Score: 0.0 / 26.5

Authors: Jeyeon Eo, Joo Young Kim, Ran Ju, Minyoung Jung, Unggi Lee

Published: 2026-05-27

TL;DR: BuddyBench establishes a privacy-preserving multi-task benchmark for pediatric social-communication personalization, linking learning trajectories and clinical data to support knowledge tracing and causal inference.

摘要翻译

BuddyBench 提出了一种针对儿科社交沟通个性化的隐私约束多任务基准。与主要强调影像、遗传或横断面临床表型的现有神经发育数据库不同，BuddyBench 在统一基准模式下整合了练习级学习轨迹、标准化临床评估、BuddyPlan 自评以及随机治疗终点。BuddyBench 整合了两个队列：ND-03 是一个观察性队列，对任务 1-2 具有密集的练习覆盖（n = 189），而 ND-02 是一个随机对照试验队列，针对任务 3-4（n = 86，ITT）。二者共同支持知识追踪、下一练习推荐、临床预测及因果推断，从而将行为个性化与临床评估联系起来。此外，我们还引入了 BuddyBench-Sim，这是一个用于可复现评估的合成伴随数据集。基线方法在各任务上均展现出效果，同时保护了儿科临床记录。

Abstract

BuddyBench introduces a privacy-constrained multi-task benchmark for pediatric social-communication personalization. Unlike existing neurodevelopmental repositories that primarily emphasize imaging, genetics, or cross-sectional clinical phenotyping, BuddyBench links drill-level learning trajectories, standardized clinical assessments, BuddyPlan self-report, and randomized-treatment endpoints within a unified benchmark schema. BuddyBench combines two cohorts: ND-03 is an observational cohort with dense drill coverage for Tasks1-2 (n = 189), and ND-02 is a randomized controlled trial cohort for Tasks3-4 (n = 86 ITT). Together, they support knowledge tracing, next-drill recommendation, clinical prediction, and causal inference, linking behavioral personalization to clinical evaluation. We additionally introduce BuddyBench-Sim, a synthetic companion dataset for reproducible evaluation. Baselines show signal across tasks while keeping pediatric clinical records protected.

评分详情

关键词	权重	相关度
Unify Models	2.0	0.0/10
World Models	2.0	0.0/10
MLLM	2.0	0.0/10
MultiModal	2.0	0.0/10
model-based RL	2.0	0.0/10

评分理由: 论文主题聚焦于儿科社交沟通的隐私约束多任务基准测试（BuddyBench），核心贡献在于整合临床评估与学习轨迹以支持知识追踪和因果推断。然而，给定关键词（Unify Models, World Models, MLLM, MultiModal, model-based RL）均指向高级 AI 模型架构与强化学习领域。本文未涉及多模态数据、大语言模型、世界模型构建或强化学习算法，与关键词覆盖的研究方向无实质关联，故所有关键词评分为 0。作者列表中不包含指定的专家成员。

关键词

Pediatric Social-Communication, Privacy-Constrained Benchmark, Multi-Task Benchmark, Knowledge Tracing, Clinical Prediction, Causal Inference, Learning Trajectories

187. Mind the Gap: Mixtures of Gaussians in Approximate Differential PrivacyFAIL

Score: 0.0 / 26.5

Authors: Huikang Liu, Aras Selvi, Wolfram Wiesemann

Published: 2026-05-27

TL;DR: 本文提出了一种基于混合高斯分布的加性噪声机制，在低隐私预算下通过优化噪声分布来降低差分隐私所需的噪声幅度，优于解析高斯机制。

摘要翻译

我们设计了一类加性噪声机制，该机制满足针对已知灵敏度的标量实值查询函数的 $(\varepsilon, δ)$-差分隐私（DP），特别关注中等和低隐私保护程度。这些机制，我们称之为混合机制（mixture mechanisms），是通过混合多个具有相同方差但均值和混合权重不同的高斯分布而构造的。所得分布可被解释为凸组合，由一个零均值高斯分布（如解析高斯机制（analytic Gaussian mechanism）中所用）以及若干均值取决于查询函数灵敏度的额外高斯分布组成。我们推导了满足 $(\varepsilon, δ)$-DP 所需的方差紧致条件，并提供了计算这些方差的高效算法。与解析高斯机制相比，我们的机制显著降低了期望噪声幅度（$l_1$ 损失）和方差（零均值分布的 $l_2$ 损失）。在我们设计所针对的低隐私保护程度下，我们的机制接近最优性，几乎消除了解析高斯机制的最优性差距。

Abstract

We design a class of additive noise mechanisms that satisfy $(\varepsilon, δ)$-differential privacy (DP) for scalar, real-valued query functions with known sensitivities, with a particular focus on moderate and low-privacy regimes. These mechanisms, which we call \textit{mixture mechanisms}, are constructed by mixing multiple Gaussian distributions that share the same variance but differ in their means and mixture weights. The resulting distributions can be interpreted as convex combinations of a zero-mean Gaussian (as used in the analytic Gaussian mechanism) and additional Gaussians whose means depend on the sensitivity of the query function. We derive tight conditions on the variances required for $(\varepsilon, δ)$-DP and provide efficient algorithms to compute them. Compared to the analytic Gaussian mechanism, our mechanisms yield substantially lower expected noise amplitudes ($l_1$-loss) and variances ($l_2$-loss for zero-mean distributions). In the low-privacy regime that motivates our design, our mechanisms approach optimality, mitigating nearly all of the optimality gap of the analytic Gaussian mechanism.

评分详情

关键词	权重	相关度
Unify Models	2.0	0.0/10
World Models	2.0	0.0/10
MLLM	2.0	0.0/10
MultiModal	2.0	0.0/10
model-based RL	2.0	0.0/10

评分理由: 该论文专注于差分隐私（Differential Privacy）和高斯噪声机制，属于统计学与隐私计算领域。而给定的评分关键词（Unify Models, World Models, MLLM, MultiModal, model-based RL）均属于多模态大模型、世界模型及强化学习领域。论文内容与评分关键词的主题完全无关，无任何交叉点，因此所有关键词相关度均为 0 分。作者列表中不包含指定的专家名单。

关键词

Differential Privacy, Mixture Mechanisms, Gaussian Distributions, Additive Noise, Query Functions, Sensitivity, Low-Privacy Regimes

188. I Hear, Therefore I Trust: A Socio-Technical Investigation of Humans as Synthetic Speech DetectorsFAIL

Score: 0.0 / 26.5

Authors: Lelia Erscoi, Tomi Kinnunen

Published: 2026-05-27

TL;DR: 本研究调查了人类在不同信任提示下检测合成语音的能力，发现语音类别主要决定检测准确性，而信任提示虽无主效应但激发了检测行为。

摘要翻译

自动 deepfake 检测已受到相当多的研究关注，然而人类实际接触合成语音的社会技术环境仍知之甚少。本研究将语音 deepfake 检测视为一种感知和情境过程，呈现了一个定位任务：47 名参与者在三种 manipulated trust cues（操纵的信任线索）——instructional framing（指令性框架）、affective priming（情感启动）和 provenance labeling（来源标注）下，对真实、完全合成及部分合成 utterances（语句）中的可疑合成片段进行了标记。参与者对 mechanicalness（机械感）、expressiveness（表现力）、intelligibility（可懂度）、clarity（清晰度）、calmness（镇定度）以及 confidence of evaluation（评估置信度）提供了质量评分。Utterance class（语句类别）是检测准确率和感知质量的主要决定因素；trust cues（信任线索）未产生主要效应，但激发了检测行为。fully synthetic speech（完全合成语音）的检测率低于机会水平。质量评分反映了 utterance type（语句类型），表明在 overt detection（显性检测）失败之处存在 implicit discrimination（隐性判别）。

Abstract

Automatic deepfake detection has received considerable research attention, yet the socio-technical environment in which humans actually encounter synthetic speech remains poorly understood. We investigate voice deepfake detection as a perceptual and contextual process, presenting a localization task in which 47 participants marked suspected synthetic segments across authentic, fully synthetic, and partially synthetic utterances under three manipulated trust cues: instructional framing, affective priming, and provenance labeling. Participants provided quality ratings on mechanicalness, expressiveness, intelligibility, clarity, calmness, and confidence of evaluation. Utterance class was the primary determinant of detection accuracy and perceptual quality; trust cues produced no main effects but motivated detection behavior. Fully synthetic speech was detected at below-chance levels. Quality ratings tracked utterance type, indicating implicit discrimination where overt detection failed.

评分详情

关键词	权重	相关度
Unify Models	2.0	0.0/10
World Models	2.0	0.0/10
MLLM	2.0	0.0/10
MultiModal	2.0	0.0/10
model-based RL	2.0	0.0/10

评分理由: 该论文聚焦于人类对合成语音（Deepfake）的检测行为及信任提示的社会技术研究，属于人机交互与心理学范畴。给定的关键词（Unify Models, World Models, MLLM, MultiModal, model-based RL）均涉及人工智能模型架构、表征学习及强化学习领域，与本文研究主题无直接关联，故相关度均为 0。作者列表（Lelia Erscoi, Tomi Kinnunen）不包含指定的专家名单，无额外加分。加权总分为 0，低于动态及格分 26.5。

关键词

Synthetic speech, Deepfake detection, Human perception, Trust cues, Utterance classification, Quality ratings, Socio-technical investigation

189. Clark Hash: Stateless Sparse Johnson-Lindenstrauss Quantization for Neural EmbeddingsFAIL

Score: 0.0 / 26.5

Authors: Stanislav Kirdey, Clark Labs Inc

Published: 2026-05-27

TL;DR: Clark Hash proposes a stateless quantization method to compress neural embeddings into 48 bytes with high similarity correlation, focusing on storage efficiency rather than the specified model architectures or reinforcement learning paradigms.

摘要翻译

Clark Hash 是一种用于在更少空间中存储神经嵌入的小型方法。该方法对每个数据库向量进行归一化，应用确定性稀疏符号 Johnson-Lindenstrauss 投影，裁剪结果，并存储固定宽度的标量量化码。查询保持浮点格式，并与存储的草图进行比对得分。在默认的 384 维句子嵌入设置下，Clark Hash 将余弦搜索向量存储在 48 字节中，而密集 f32 存储则需要 1536 字节。体积缩小了 32 倍。该方法在存储新向量前无需训练轮次、学习码本、旋转操作或语料库统计。本文介绍了该编解码器、Rust 实现，以及在来自 29 个子集的 9,304 对标注数据上进行的句子相似度评估。使用多语言 MiniLM 编码器，48 字节草图在 STS17 和 STS22 数据集上与密集余弦分数的宏观皮尔逊相关系数分别达到 0.910 和 0.946。Clark Hash 并非新的 Johnson-Lindenstrauss 定理，也不是近似最近邻索引的替代品。它是一种用于紧凑嵌入存储的简单无状态编解码器。

Abstract

Clark Hash is a small method for storing neural embeddings in less space. It normalizes each database vector, applies a deterministic sparse signed Johnson-Lindenstrauss projection, clips the result, and stores a fixed-width scalar-quantized code. Queries stay in floating point and are scored against the stored sketches. In the default 384-dimensional sentence-embedding setting, Clark Hash stores a cosine-search vector in 48 bytes instead of 1536 bytes for dense f32 storage. This is 32x smaller. The method does not need a training pass, learned codebooks, rotations, or corpus statistics before new vectors can be stored. We describe the codec, the Rust implementation, and a multilingual sentence-similarity evaluation on 9,304 labeled pairs from 29 subsets. With a multilingual MiniLM encoder, the 48-byte sketches reached 0.910 and 0.946 macro Pearson correlation with dense cosine scores on STS17 and STS22. Clark Hash is not a new Johnson-Lindenstrauss theorem and it is not a replacement for approximate nearest-neighbor indexes. It is a simple stateless codec for compact embedding storage.

评分详情

关键词	权重	相关度
Unify Models	2.0	0.0/10
World Models	2.0	0.0/10
MLLM	2.0	0.0/10
MultiModal	2.0	0.0/10
model-based RL	2.0	0.0/10

评分理由: The paper proposes Clark Hash, a stateless quantization method for compressing neural embeddings into 48 bytes using Johnson-Lindenstrauss projections. It focuses solely on storage efficiency and sentence similarity preservation. It does not address Unify Models, World Models, MLLM architectures, MultiModal integration, or Model-Based Reinforcement Learning, resulting in zero relevance for all specified keywords. The author list (Stanislav Kirdey) does not contain any of the specified experts (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang). The weighted total score is 0.0, which is below the dynamic passing score of 26.5.

关键词

Clark Hash, Neural Embeddings, Quantization, Johnson-Lindenstrauss, Stateless, Compact Storage, Sentence Similarity

190. Integrated and Cross-Architecture Interpretation of LLM ReasoningFAIL

Score: 0.0 / 26.5

Authors: Leonardo Matthew Yauw, Wei-Bin Kou, Yujiu Yang

Published: 2026-05-27

摘要翻译

理解大语言模型（LLM）如何推理受到一种实际不对称性的阻碍：尽管其生成输出是可观察的，但底层的推理模式仍然不透明。依赖单一探针，例如互信息峰值（MIP）或深度思考比率（DTR），可能会低估真实的推断结构。为应对这一不足，我们提出了一种集成跨架构推理（IAR）框架，旨在为大语言模型推理的可解释性提供统一方法。具体而言，我们首先提出使用带宽校准的 MIP 结合 Tukey IQR 峰值检测，以隔离输出层中对推理至关重要的标记（tokens）。其次，我们进行了 MIP 选定的标记与 DTR 深度标记之间的重叠分析，以追踪这些标记的跨层轨迹。这同时也揭示了推理关键标记是否同样具有计算密集性，进而有助于理解推理模式如何在模型各层之间演化。最后，我们在多域问题上应用 Jaccard 稳定性度量，以验证 MIP 识别的标记是否能保证推理质量。在三个模型（Qwen-7B、Qwen-14B 和 Llama-8B）上跨越四个领域（数学、代码、逻辑和常识）的广泛实验表明，IAR 具备跨架构的泛化解释能力。

Abstract

Understanding how LLMs reason is hindered by a practical asymmetry: while their generated outputs are observable, the underlying reasoning patterns remain opaque. Relying on single probes, such as Mutual Information Peak (MIP) or Deep-Thinking Ratio (DTR), risks underestimating the genuine inferential structure. To response this deficiency, we present an Integrated, cross-Architecture Reasoning (IAR) framework, designed to provide a unified approach to LLM reasoning interpretability. Specifically, we first propose to use bandwidth-calibrated MIP coupled with Tukey IQR peak-detection to isolate reasoning-crucial tokens at the output layer. Second, we performed an overlap analysis between MIP-picked tokens and DTR-deep tokens to trace the cross-layer trajectories of those tokens. This also discloses whether reasoning-crucial tokens are computation-intensive as well, further facilitating to understand how reasoning patterns evolve across model layers. Finally, we apply a Jaccard stability metric over multi-domain problems to verify if the MIP-identified tokens are reasoning quality-guaranteed. Extensive experiments on three models (Qwen-7B, Qwen-14B, and Llama-8B) across four domains (mathematics, code, logic, and common sense) demonstrate IAR's generalizable interpretation capabilities across architectures.

评分详情

关键词	权重	相关度
Unify Models	2.0	0.0/10
World Models	2.0	0.0/10
MLLM	2.0	0.0/10
MultiModal	2.0	0.0/10
model-based RL	2.0	0.0/10

评分理由: 评分失败: Expecting ',' delimiter: line 10 column 82 (char 256)

191. Where Does Toxicity Live? Mechanistic Localization and Targeted Suppression in Language ModelsFAIL

Score: 0.0 / 26.5

Authors: Himanshu Beniwal, Mayank Singh

Published: 2026-05-27

TL;DR: This paper proposes inference-time frameworks to localize and suppress toxicity in language models without retraining, effectively reducing harmful content while maintaining language quality.

摘要翻译

大型语言模型（Large Language Models, LLMs）经常生成有毒、仇恨或有害内容，然而现有的缓解方法依赖于昂贵的重新训练或输出级过滤，缺乏关于毒性在内部起源的机制性洞察。我们提出 Meow2X 和 TRNE，两种互补的无需重新训练的框架，通过分析有毒与中性提示词之间的激活差异，将毒性定位到特定的层和神经元，随后通过推理时的缩放或最小的秩一权重编辑来抑制它们——无需任何梯度下降。在五个语言模型（LMs）、两个基准和 90 种配置上使用双重安全评估器进行的评估表明，该方法在保持语言建模质量的同时，能一致地降低毒性。我们的分析揭示，毒性不成比例地编码在早期的 MLP（多层感知机）层中，在不同架构间存在差异，且被单评估器设置系统性地低估——这强调了多评估器安全评估的必要性。通过将机制性解释性与实际去毒化相结合，我们的框架为更安全、更透明的语言模型提供了一条原则性的路径。

Abstract

Large language models frequently generate toxic, hateful, or harmful content, yet existing mitigation methods rely on costly retraining or output-level filtering with no mechanistic insight into where toxicity originates internally. We introduce Meow2X and TRNE, two complementary retraining-free frameworks that localize toxicity to specific layers and neurons by analyzing activation differentials between toxic and neutral prompts, then suppress them via inference-time scaling or minimal rank-one weight edits -- without any gradient descent. Evaluations across five LMs, two benchmarks, and 90 configurations using dual safety evaluators demonstrate consistent toxicity reduction while preserving language modeling quality. Our analysis reveals that toxicity is disproportionately encoded in early MLP layers, varies across architectures, and is systematically underestimated by single-evaluator setups -- underscoring the need for multi-evaluator safety assessment. By bridging mechanistic interpretability with practical detoxification, our framework offers a principled path toward safer, more transparent language models.

评分详情

关键词	权重	相关度
Unify Models	2.0	0.0/10
World Models	2.0	0.0/10
MLLM	2.0	0.0/10
MultiModal	2.0	0.0/10
model-based RL	2.0	0.0/10

评分理由: The paper focuses on toxicity mitigation and mechanistic interpretability within standard Language Models (LLMs). It does not address Unify Models, World Models, Multimodal aspects (MLLM/MultiModal), or Reinforcement Learning (model-based RL). Consequently, there is no substantive overlap with the provided keywords.

关键词

Toxicity Localization, Mechanistic Interpretability, Language Models, Targeted Suppression, Inference-time Scaling, Safety Assessment, Harmful Content

192. Learning to Assess the Reliability of Number-of-Runs Estimation in Stochastic OptimizationFAIL

Score: 0.0 / 26.5

Authors: Sara Gjorgjieva, Eva Tuba, Tome Eftimov

Published: 2026-05-27

TL;DR: The paper develops a learning-based classifier to assess the reliability of run-number estimates in stochastic optimization benchmarks using statistical features, achieving high recall within fixed optimizer configurations.

摘要翻译

在随机优化算法的大规模基准测试中，关键挑战不再在于是否需要重复运行以确保可靠性，而在于如何确定何时收集了充分证据，同时避免产生不必要的计算成本。我们研究了一种近期经验在线启发式方法的基于学习的扩展，该方法利用异常值处理和基于偏度的对称性检查来自适应地估计所需的运行次数。利用 Nevergrad 在 COCO 上进行的 132,000 次运行的标注结果（24 个问题，20 维，每个问题 10 个实例，11 个优化器），我们在 23 个统计、能量无关以及形状和稳定性特征上训练分类器，以预测运行次数估计是否可靠，并通过少数类召回率优先检测错误的估计。我们采用配置内学习设置来评估可靠性预测，其中模型在共享相同优化器的数据上进行训练和测试。结果表明，在配置内场景下可以学习到运行次数可靠性，从而实现以高少数类召回率检测不可靠估计，尽管性能仍受限于固定配置内受限的数据多样性。

Abstract

In large-scale benchmarking of stochastic optimization algorithms, the key challenge is no longer whether repeated runs are needed for reliability, but how to determine when sufficient evidence has been collected without incurring unnecessary computational cost. We study a learning-based extension of a recent empirical online heuristic that adaptively estimates the required number of runs using outlier handling and skewness-based symmetry checks. Using annotated outcomes from 132{,}000 Nevergrad runs on COCO (24 problems in 20 dimensions, 10 instances each, 11 optimizers), we train classifiers on 23 statistical, energy-free, and shape and stability features to predict whether a run-number estimate is reliable, prioritizing detection of incorrect estimates via minority-class recall. We evaluate reliability prediction using a within-configuration learning setup, where models are trained and tested on data sharing the same optimizer. The results show that run-number reliability can be learned in a within-configuration scenario, enabling detection of unreliable estimates with high minority-class recall, although performance remains limited by the restricted data diversity within fixed configurations.

评分详情

关键词	权重	相关度
Unify Models	2.0	0.0/10
World Models	2.0	0.0/10
MLLM	2.0	0.0/10
MultiModal	2.0	0.0/10
model-based RL	2.0	0.0/10

评分理由: The paper focuses on stochastic optimization benchmarking and run-number reliability estimation using statistical features, which is unrelated to multimodal large language models, world models, or model-based reinforcement learning architectures specified in the keywords.

关键词

Stochastic Optimization, Number-of-Runs Estimation, Reliability Assessment, Benchmarking, Statistical Features, Nevergrad, COCO

193. Adaptive Bandit Algorithms for Contextual Matching MarketsFAIL

Score: 0.0 / 26.5

Authors: Shiyun Lin, Simon Mauras, Vianney Perchet, Nadav Merlis

Published: 2026-05-27

TL;DR: This paper introduces adaptive bandit algorithms for contextual matching markets that achieve sublinear regret bounds under both stochastic and adversarial context settings.

摘要翻译

我们在匹配市场中研究多臂老虎机学习（bandit learning），其中参与者（players）与臂（arm）构成市场的两个侧面，且参与者的效用（utilities）关于臂的上下文（context）呈线性关系。在每一轮中，新的臂（arm）到达并带有可观察的上下文（context）。随后，算法将它们匹配给参与者，旨在最小化每个参与者相对于稳定匹配（stable matching）基准的遗憾（regret）。这种上下文结构带来了显著复杂性：细微的上下文偏移可能略微改变一个参与者的效用，同时完全重构潜在基准，从而导致其他参与者出现巨大的遗憾激增。我们在两种场景下解决这一问题：一是随机上下文（stochastic contexts），源自潜在分布；二是对抗性上下文（adversarial contexts），可能是任意的。针对随机情况，我们引入了一种新颖的最小偏好差距（minimum preference gap）以捕捉学习难度，并提供了一种完全自适应的算法，该算法具有实例相关的多对数遗憾上界。此外，在温和的分布假设下，我们还建立了匹配的实例无关遗憾上下界。针对对抗性设置，我们提出了一种可行的遗憾定义，该定义在任意上下文下均保持有效，并通过一种自适应算法实现了实例无关的次线性遗憾界。

Abstract

We study bandit learning in matching markets, where players and arms constitute the two market sides, and the players' utilities are linear in the arm contexts. In each round, new arms arrive with observable contexts. Then, the algorithm matches them to players, aiming to minimize each player's regret against a stable matching benchmark. This contextual structure creates significant complexity: subtle context shifts can slightly alter one player's utility while completely reconfiguring the underlying benchmark, causing large regret spikes for others. We address this in two settings: stochastic contexts, drawn from a latent distribution, and adversarial contexts, which may be arbitrary. For the stochastic case, we introduce a novel minimum preference gap to capture learning difficulty and provide a fully adaptive algorithm with an instance-dependent poly-logarithmic regret upper bound. We also establish matching instance-independent regret upper and lower bounds under a mild distributional assumption. For the adversarial setting, we propose a tractable regret notion that remains valid under arbitrary contexts and achieves an instance-independent sublinear regret bound via an adaptive algorithm.

评分详情

关键词	权重	相关度
Unify Models	2.0	0.0/10
World Models	2.0	0.0/10
MLLM	2.0	0.0/10
MultiModal	2.0	0.0/10
model-based RL	2.0	0.0/10

评分理由: The paper focuses on bandit learning in contextual matching markets and regret minimization against stable matching benchmarks. It does not address multimodal representation learning, large language models, world models, unifying model architectures, or model-based reinforcement learning as specified in the keyword list. The research domain (algorithmic game theory/online learning) is distinct from the target background (multimodal/RL integration).

关键词

Bandit Learning, Contextual Matching Markets, Regret Minimization, Stable Matching Benchmark, Stochastic Contexts, Adversarial Contexts, Adaptive Algorithms

194. Counterfactually Fair Regression via Optimal TransportFAIL

Score: 0.0 / 26.5

Authors: M. Generali Lince, S. Gaucher, J-J. Vie, P. Loiseau

Published: 2026-05-27

TL;DR: This paper proposes a counterfactually fair regression estimator via optimal transport and establishes theoretical fairness guarantees and risk bounds.

摘要翻译

我们考虑学习反事实公平回归器（counterfactually fair regressor）的问题。我们采用因果不确定性视角（causal uncertainty view），其中反事实公平通过重采样噪声（resampled noise）定义。我们致力于为新提出的后处理估计器（post-processing estimator）获得理论公平性保证。首先，我们表明反事实公平等价于在潜在变量（latent variable）条件下满足人口统计学公平性（demographic parity）。这使得我们能够利用重心分位数映射（barycentric quantile map）提供最优公平回归器的闭式表达。为了处理连续潜在变量，我们提出了一种离散化后处理（discretized post-processing）方法。随后，在温和的正则性假设（mild regularity assumptions）下，我们证明了该估计器具有高概率的有限样本公平性保证，给出了以 $\tilde O(n^{-1/3})$ 速率衰减的不公平性，并建立了同阶为 $\tilde O(n^{-1/3})$ 的匹配风险界。我们还给出了几乎公平预测的超额风险（excess risk）的下界。最后，我们将结果推广至放宽的反事实公平（relaxed counterfactual fairness）设置。我们在真实数据和合成数据上验证了该方法。

Abstract

We consider the problem of learning a counterfactually fair regressor. We adopt a causal uncertainty view in which counterfactual fairness is defined with resampled noise. We focus on obtaining theoretical fairness guarantees for a new post-processing estimator. We begin by showing that counterfactual fairness is equivalent to satisfying demographic parity conditional on the latent variable. This allows us to provide a closed-form expression of the optimal fair regressor via a barycentric quantile map. In order to handle continuous latent variables, we propose a discretized post-processing method. Then, under mild regularity assumptions, we prove high-probability finite-sample fairness guarantees for our estimator, providing an unfairness decay at rate $\tilde O(n^{-1/3})$, and establishing a matching risk bound of order $\tilde O(n^{-1/3})$. We provide a matching lower bound on the excess risk of almost fair predictions. Finally, we extend our results to the setting of relaxed counterfactual fairness. We validate our approach on real-world and synthetic data.

评分详情

关键词	权重	相关度
Unify Models	2.0	0.0/10
World Models	2.0	0.0/10
MLLM	2.0	0.0/10
MultiModal	2.0	0.0/10
model-based RL	2.0	0.0/10

评分理由: The paper focuses on counterfactual fairness in regression using optimal transport and causal inference, which is unrelated to the provided keywords concerning multimodal models, world models, or reinforcement learning. Therefore, all keyword relevance scores are 0. None of the specified expert authors are present in the author list.

关键词

Counterfactual Fairness, Optimal Transport, Regression, Causal Uncertainty, Post-processing Estimator, Fairness Guarantees, Risk Bounds

195. Analyzing Quality-Latency-Resource Trade-offs in a Technical Documentation RAG Assistant Using LoRA AdaptationFAIL

Score: 0.0 / 26.5

Authors: Evgenii Palnikov, Elizaveta Gavrilova

Published: 2026-05-27

TL;DR: This paper investigates the trade-offs between quality, latency, and resource usage in a Kubernetes documentation RAG assistant utilizing LoRA adaptation, identifying that LoRA adapters targeting q/v attention projections offer optimal performance.

摘要翻译

我们研究了在一种使用生成器低秩适配 (LoRA) 的基于文档的检索增强生成 (RAG) 系统中，质量、延迟与资源之间的权衡。我们在官方 Kubernetes 文档上构建了一个包含 5,144 个问答对的手动验证基准，并将其与一个固定的混合检索管道（包括 BGE-M3 稠密、BGE-M3 原生稀疏、互逆排名融合 (RRF) 以及交叉编码器重排序）相结合。基于此基准，我们在 Llama-3.2-3B-Instruct 和 Llama-3.1-8B-Instruct 模型上，针对秩和目标模块的选择消融了 20 种 LoRA 配置，并对每种配置在标记级 F1、LLM 判定的事实性 (groundedness) 与正确性 (pass@4)、推理延迟、推理内存及训练成本方面进行评估，所有结果均报告了自助法 95% 置信区间。帕累托分析表明，仅作用于 q (Query) 和 v (Value) 注意力投影的 LoRA 适配器始终处于帕累托前沿，而 3B 与 8B 模型的选择主要定义了运行模式。进一步的参数匹配控制比较表明，q/v 优势是结构性的，而非纯粹参数性的。该基准、选定的适配器及代码可在 https://github.com/EugPal/rag-lora-tradeoffs 获取。

Abstract

We study quality-latency-resource trade-offs in a documentation-grounded retrieval-augmented generation (RAG) system that uses Low-Rank Adaptation (LoRA) of the generator. We build a manually verified benchmark of 5,144 question-answer pairs over the official Kubernetes documentation and combine it with a fixed hybrid-retrieval pipeline (BGE-M3 dense, BGE-M3 native sparse, Reciprocal Rank Fusion, cross-encoder reranking). Over this benchmark we ablate 20 LoRA configurations on Llama-3.2-3B-Instruct and Llama-3.1-8B-Instruct across rank and target-module choices, and evaluate each on token-level F1, LLM-judged groundedness and correctness (pass@4), inference latency, inference memory, and training cost, all reported with bootstrap 95% confidence intervals. Pareto analysis shows that LoRA adapters acting only on the q and v attention projections consistently dominate the front, while the 3B/8B choice mainly defines operating regime. A param-matched control comparison further indicates that the q/v advantage is structural rather than purely parametric. The benchmark, selected adapters, and code are available at https://github.com/EugPal/rag-lora-tradeoffs.

评分详情

关键词	权重	相关度
Unify Models	2.0	0.0/10
World Models	2.0	0.0/10
MLLM	2.0	0.0/10
MultiModal	2.0	0.0/10
model-based RL	2.0	0.0/10

评分理由: The provided keywords target specific research domains including World Models, Model-Based Reinforcement Learning, and Multimodal Large Language Models. However, the submitted paper focuses on Retrieval-Augmented Generation (RAG) systems, Low-Rank Adaptation (LoRA), and efficiency trade-offs in text-based documentation assistants. There is no substantive overlap regarding world modeling, reinforcement learning, or multimodal processing, resulting in 0.0 relevance scores for all keywords. Additionally, none of the listed expert authors (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are present in the author list, resulting in no bonus points.

关键词

RAG, LoRA, Kubernetes, Trade-offs, Inference Latency, LLM Adaptation, Documentation Assistant

196. Learning Logical Operations for Arbitrary Quantum Error Correction CodesFAIL

Score: 0.0 / 26.5

Authors: Nico Meyer, Christopher Mutschler, Dominik Seuß, Andreas Maier, Daniel D. Scherer

Published: 2026-05-27

TL;DR: This study introduces a learning-based framework to construct physical implementations of logical operations for arbitrary quantum error correction codes, facilitating hardware-adapted designs for early fault-tolerant quantum computing.

摘要翻译

逻辑运算对于量子纠错码中的量子计算至关重要。然而，发现它们的物理实现具有挑战性，特别是对于缺乏稳定子描述的非加性码。我们提出了一种通用的基于学习的框架，仅给定一个编码电路，即可构建逻辑运算的物理实现，同时强制满足横向性或浅层深度等结构属性。我们的方法通过重新发现标准稳定子码的已知逻辑运算得到了验证。随后，我们将其扩展为一个协同设计过程，称为变分早期容错量子计算（VarEFTQC），该过程将非加性编码针对给定噪声模型进行定制，并强制满足所需的逻辑门集，例如横向 IQP 类型族或低深度通用集。一个软件库实现了完整的学习流程，包括损失函数变体、ansatz 族和优化例程。综上所述，这些结果将 VarEFTQC 定位为一种实用的工具，用于为早期容错量子计算发现硬件适配的逻辑器件 (gadgets)。

Abstract

Logical operations are essential for quantum computation within quantum error-correcting codes. However, discovering their physical realizations is challenging, especially for non-additive codes that lack a stabilizer description. We present a general learning-based framework that, given only an encoding circuit, constructs physical implementations of logical operations while enforcing structural properties such as transversality or shallow depth. Our approach is validated by rediscovering known logical operations of standard stabilizer codes. We then extend it to a co-design procedure, dubbed variational early fault-tolerant quantum computing (VarEFTQC), which tailors non-additive encodings to a given noise model and enforces desired logical gate sets, such as transversal IQP-type families or low-depth universal sets. A software library implements the complete learning pipeline, including loss-function variants, ansatz families, and optimization routines. Together, these results position VarEFTQC as a practical tool for discovering hardware-adapted logical gadgets for early fault-tolerant quantum computing.

评分详情

关键词	权重	相关度
Unify Models	2.0	0.0/10
World Models	2.0	0.0/10
MLLM	2.0	0.0/10
MultiModal	2.0	0.0/10
model-based RL	2.0	0.0/10

评分理由: The paper focuses on Quantum Error Correction and Variational Quantum Algorithms, which is a domain distinct from Multimodal AI, World Models, and Reinforcement Learning. There is no overlap in methodology or subject matter with the provided keywords (Unify Models, MLLM, MultiModal, etc.). No expert authors from the specified list are present.

关键词

Quantum Error Correction, Logical Operations, Variational Algorithms, Non-additive Codes, Fault-tolerant Quantum Computing, Learning-based Framework

197. Sequential Neural Probabilistic Amplitude Shaping: Learning the Channel's LanguageFAIL

Score: 0.0 / 26.5

Authors: Mohammad Taha Askari, Lutz Lampe, Amirhossein Ghazisaeidi

Published: 2026-05-27

TL;DR: This paper introduces a sequential neural probabilistic amplitude shaping technique that leverages an autoregressive encoder compatible with arithmetic distribution matching to minimize rate loss and maximize achievable information rates in communication channels.

摘要翻译

我们提出了第一种神经概率幅度整形 (Neural Probabilistic Amplitude Shaping) 方法，该方法优于现有方法且考虑了所有实现损耗，采用了一种无块的、易于实现的顺序自回归编码器，该编码器与算术分布匹配 (Arithmetic Distribution Matching) 兼容，从而降低了速率损耗并提高了可达信息速率。

Abstract

We present the first neural probabilistic amplitude shaping that outperforms existing methods while accounting for all implementation losses, using a block-less, easily implementable sequential autoregressive encoder compatible with arithmetic distribution matching, yielding reduced rate loss and higher achievable information rates.

评分详情

关键词	权重	相关度
Unify Models	2.0	0.0/10
World Models	2.0	0.0/10
MLLM	2.0	0.0/10
MultiModal	2.0	0.0/10
model-based RL	2.0	0.0/10

评分理由: The paper focuses on neural probabilistic amplitude shaping for communication channels within the domain of Information Theory and Signal Processing. It is unrelated to Unify Models, World Models, MLLM, MultiModal AI, or Model-Based Reinforcement Learning, which are AI/ML paradigms. Additionally, none of the specified expert authors (Yang Shi, Xuanyu Zhu, etc.) are listed on the paper.

关键词

Neural Probabilistic Amplitude Shaping, Sequential Autoregressive Encoder, Arithmetic Distribution Matching, Rate Loss, Achievable Information Rates, Channel's Language, Communication Channels

198. Self-Consistency via Marginal SharpeningFAIL

Score: 0.0 / 26.5

Authors: Aleksei Arzhantsev, Otmane Sakhi, Nicolas Chopin

Published: 2026-05-27

TL;DR: 本文提出了一种通过边际锐化答案分布而非完整输出来优化语言模型推理能力的自一致性方法，在数学和编码任务上取得了更强性能且速度更快。

摘要翻译

推理时采样（Inference-time sampling）可以在无需额外训练的情况下从语言模型中激发出强大的推理能力。现有的幂采样（power-sampling）方法通过锐化完整生成输出的分布来实现这一目标，倾向于选择在模型下个体概率较高的补全结果。我们认为这是推理的目标对象错误：一个补全结果将推理轨迹与最终答案纠缠在一起，而真正重要的是一个答案是否得到了许多合理推理路径的支持。因此，我们将目标从完整输出分布转移到锐化后的答案边缘分布（sharpened answer marginal），使自洽性（self-consistency）成为一个推理时目标，而非事后投票准则。令人惊讶的是，该边缘目标存在高效近似：我们提出了一种简单的、纯自回归（autoregressive）的并行采样算法，该算法近似地从锐化后的答案边缘分布中采样，在数学和编码基准上比标准幂采样展现出更强的性能，同时快几个数量级。

Abstract

Inference-time sampling can elicit strong reasoning abilities from language models without additional training. Existing power-sampling methods do so by sharpening the distribution over full generated outputs, favoring completions that are individually likely under the model. We argue that this is the wrong object to target for reasoning: a completion entangles a reasoning trace with a final answer, whereas what matters is whether an answer is supported by many plausible reasoning paths. We therefore shift the target from the full-output distribution to the sharpened answer marginal, making self-consistency an inference-time objective rather than a post-hoc voting criterion. Surprisingly, this marginal target admits an efficient approximation: we propose a simple, purely autoregressive parallel sampling algorithm that approximately samples from the sharpened answer marginal, eliciting stronger performance than standard power sampling on mathematics and coding benchmarks while being orders of magnitude faster.

评分详情

关键词	权重	相关度
Unify Models	2.0	0.0/10
World Models	2.0	0.0/10
MLLM	2.0	0.0/10
MultiModal	2.0	0.0/10
model-based RL	2.0	0.0/10

评分理由: 论文聚焦于语言模型推理阶段的采样策略（自一致性边际锐化），属于 NLP 推理优化领域。给定关键词涉及多模态、世界模型及强化学习，与本文的单模态语言模型推理主题无直接关联，故相关性评分为 0。作者名单中未包含指定专家，无额外加分。

关键词

Self-Consistency, Marginal Sharpening, Inference-time Sampling, Language Models, Reasoning Abilities, Autoregressive Parallel Sampling, Mathematics Benchmarks

199. Learning to Bid in Repeated Second-Price Auctions with Dynamic Values and Aggregated FeedbackFAIL

Score: 0.0 / 26.5

Authors: Benjamin Heymann, Otmane Sakhi

Published: 2026-05-27

摘要翻译

我们研究了在投标人价值动态变化（即当前价值取决于过去结果）情况下的出价学习问题。具体而言，我们考虑一个参与重复 second-price auctions（第二价格拍卖）的投标人，其价值取决于自上次成功出价以来经过的时间；拍卖以连续时间到达，且仅在时间跨度结束时揭示聚合反馈。此类投标人必须 (1) 权衡赢得当前拍卖的即时收益与其对未来价值的影响，以及 (2) 学习未知的环境参数。我们推导了一类 learning methods（学习方法）的 regret bounds（遗憾界），该方法结合了 plug-in estimators（插件估计器）与 optimal policy（最优策略）的微分方程刻画；并表明特定的 confidence bound algorithm（置信界算法）能够以接近最优的 regret（遗憾）学习 optimal policy：对于 piecewise linear primitives（分段线性原函数），其 regret 为 $\widetilde{O}(\log N)$；对于一般 smooth primitives（光滑原函数），其 regret 为 $\widetilde{O}(N^{1/3})$。这些结果是在无需 explicit randomization（显式随机化）的情况下实现的。这些理论结果得到了数值实验的支持。

Abstract

We study the problem of learning to bid when the bidder's value is dynamic, i.e., when the current value depends on past outcomes. Specifically, we consider a bidder participating in repeated second-price auctions whose value depends on the time elapsed since their last successful bid, with auctions arriving in continuous time and only aggregated feedback revealed at the end of the horizon. Such a bidder must (1) balance the immediate benefit of winning the current auction against its impact on future values and (2) learn unknown environmental parameters. We derive regret bounds for a class of learning methods that combine plug-in estimators with a differential-equation characterization of the optimal policy, and show that a specific confidence bound algorithm learns the optimal policy with a near optimal regret of $\widetilde{O}(\log N)$ for piecewise linear primitives, and $\widetilde{O}(N^{1/3})$ for general, smooth primitives, achieving these regrets without explicit randomization. These theoretical results are supported by numerical experiments.

评分详情

关键词	权重	相关度
Unify Models	2.0	0.0/10
World Models	2.0	0.0/10
MLLM	2.0	0.0/10
MultiModal	2.0	0.0/10
model-based RL	2.0	0.0/10

评分理由: 评分失败: Expecting ',' delimiter: line 10 column 60 (char 234)

200. PINE: Pruning Boosted Tree Ensembles with Conformal In-Distribution Prediction EquivalenceFAIL

Score: 0.0 / 26.5

Authors: Haruki Yajima, Yusuke Matsui

Published: 2026-05-27

TL;DR: 论文提出 PINE 方法，通过保合同构预测等价性在分布内区域对树集成模型进行剪枝，实现了高达 30% 的压缩率提升而不牺牲准确性。

摘要翻译

树集成（Tree ensembles）是具有强预测性能和可解释性的机器学习模型，且在表格数据（tabular data）上仍被广泛使用。树集成的标准剪枝方法（pruning methods）通常优化精度与压缩率之间的权衡（accuracy-compression trade-off），可能会改变部分预测，从而可能损害决策一致性（decision consistency）。忠实剪枝方法（Faithful pruning methods）通过在整个输入空间（input space）上保持预测等价性（prediction equivalence）来解决这一问题，但这一要求会导致较低的压缩率（compression ratios）。我们提出了 PINE，这是一种在分布内区域（in-distribution region）内提供强保证的剪枝方法。PINE 在此区域内保持预测等价性，并通过共形校准（conformal calibration）使用单个参数 $α$ 来控制区域大小。在 12 个公开表格数据集上的实验表明，PINE 可将压缩率提高高达 30%，同时保持预测水平与现有的忠实剪枝方法相当。

Abstract

Tree ensembles are machine learning models with strong predictive performance and interpretability, and remain widely used for tabular data. Standard pruning methods for tree ensembles typically optimize an accuracy-compression trade-off and may change a subset of predictions, potentially compromising decision consistency. Faithful pruning methods address this issue by preserving prediction equivalence over the entire input space, but this requirement leads to lower compression ratios. We propose PINE, a pruning method that provides strong guarantees within an in-distribution region. PINE preserves prediction equivalence within this region and controls the region size using a single parameter $α$ via conformal calibration. Experiments on 12 public tabular datasets show that PINE improves the compression ratio by up to $30\%$ while preserving predictions at a comparable level to existing faithful pruning methods.

评分详情

关键词	权重	相关度
Unify Models	2.0	0.0/10
World Models	2.0	0.0/10
MLLM	2.0	0.0/10
MultiModal	2.0	0.0/10
model-based RL	2.0	0.0/10

评分理由: 论文核心为树集成模型的剪枝与同构预测，针对表格数据，属于传统监督学习；而评分关键词（Unify Models, World Models, MLLM, MultiModal, model-based RL）均指向多模态大模型、世界模型及强化学习前沿领域，两者研究范式与数据类型无交集，故相关性评分均为 0。作者列表中未包含指定专家。

关键词

Tree ensembles, Pruning, Conformal prediction, In-distribution, Prediction equivalence, Tabular data, Model compression, Interpretability

201. AOE: Exhaustive Out-of-Distribution Detection via Recalibrating Outlier LabelsFAIL

Score: 0.0 / 26.5

Authors: Fengqiang Wan, Qing-Yuan Jiang, Yang Yang

Published: 2026-05-27

TL;DR: The paper proposes AOE, a method that recalibrates outlier labels using temperature scaling to improve OOD detection by preserving semantic relations between in-distribution and out-of-distribution samples while maximizing entropy.

摘要翻译

分布外（OOD）检测对于在开放世界和安全关键场景中部署机器学习模型至关重要，因为测试输入可能偏离训练分布，且对未知样本的过度自信预测可能导致不可靠的决策。异常值暴露（OE）作为一种有前景的 OOD 检测范式应运而生，通过在训练期间引入辅助异常值来扩大分布内（ID）和 OOD 样本之间的间隔。现有的基于 OE 的方法通常通过采用均匀标签来最大化 OOD 样本在 ID 类别上的熵，从而扩大这一间隔。然而，我们在理论上证明，均匀标签不可避免地忽略了 OOD 样本与 ID 类别之间的关系，称为过度软化效应，导致次优的间隔界限。我们的理论分析进一步揭示，显式利用这种关系反而可以提高 OOD 检测性能。受此启发，我们提出自适应置信度异常值暴露（AOE），这是一种简单但有效的方法，利用温度缩放重新校准异常值标签。具体而言，AOE 基于温度缩放后的模型预测为 OOD 样本生成自适应软目标，其中，可学习的温度平滑预测分布，而不会完全擦除类别间的关系信息。通过利用这些自适应软目标监督 OOD 样本，AOE 保留了 OOD 样本与 ID 类别之间的语义邻近性，同时鼓励软化的目标趋近于高熵分布，从而抑制过度自信的 OOD 预测并扩大分离间隔。

Abstract

Out-of-distribution (OOD) detection is essential for deploying machine learning models in open-world and safety-critical scenarios, where test inputs may deviate from the training distribution and overconfident predictions on unknown samples can lead to unreliable decisions. Outlier Exposure (OE) has emerged as a promising OOD detection paradigm by introducing auxiliary outliers during training to enlarge the margin between in-distribution (ID) and OOD samples. Existing OE-based methods typically enlarge this margin by employing uniform labels to maximize the entropy of OOD samples over ID categories. However, we theoretically show that uniform labels inevitably disregard the relations between OOD samples and ID categories, termed the over-softening effect, leading to a suboptimal margin bound. Our theoretical analysis further reveals that explicitly exploiting such relations can instead yield improved OOD detection performance. Motivated by this insight, we propose \underline{A}daptive Confidence \underline{OE} (AOE), a simple yet effective method that leverages temperature scaling to recalibrate outlier labels. Specifically, AOE generates adaptive soft targets from temperature-scaled model predictions for OOD samples, where the learnable temperature smooths the prediction distribution without fully erasing class-wise relational information. By supervising OOD samples with these adaptive soft targets, AOE preserves the semantic proximity between OOD samples and ID categories while encouraging the softened targets to approach a high-entropy distribution, thereby suppressing overconfident OOD predictions and enlarging the separation margin. Extensive experiments across diverse benchmarks demonstrate the effectiveness of AOE.

评分详情

关键词	权重	相关度
Unify Models	2.0	0.0/10
World Models	2.0	0.0/10
MLLM	2.0	0.0/10
MultiModal	2.0	0.0/10
model-based RL	2.0	0.0/10

评分理由: The paper focuses on Out-of-Distribution (OOD) detection using adaptive outlier exposure (AOE) and temperature scaling. It does not address Unify Models, World Models, MLLM, MultiModal learning, or Model-Based Reinforcement Learning. Consequently, there is no relevance to the specified scoring keywords. Additionally, none of the listed expert authors (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are present in the author list (Fengqiang Wan, Qing-Yuan Jiang, Yang Yang).

关键词

Out-of-Distribution Detection, Outlier Exposure, Temperature Scaling, Adaptive Soft Targets, Margin Maximization, Entropy Maximization, Label Recalibration

202. Quantum principal component analysis without eigenvector recoveryFAIL

Score: 0.0 / 26.5

Authors: Yewei Yuan, Michele Minervini, Mark M. Wilde, Nana Liu

Published: 2026-05-27

TL;DR: This paper introduces a quantum measurement-based soft PCA framework that eliminates the need for eigenvector recovery to efficiently process quantum data.

摘要翻译

主成分分析（PCA）传统上是通过协方差矩阵或核矩阵、主导特征向量提取以及硬秩-$k$ 投影来实现的。这些步骤在高维及量子数据设置中计算成本高昂，对小的特征间隙敏感，且当下游任务仅需主子空间得分时显得多余。此类基于分数的目标在异常检测、谱能量轮廓以及其他后选择任务中至关重要。为满足这些需求，我们提出了一种基于测量的软主成分分析框架，该框架用熵正则化费米 - 狄拉克滤波器替代了硬顶-$k$ 投影器。该滤波器是 PCA 熵正则化变分公式的唯一优化器，并在零温极限下收敛于经典 PCA 投影器。该滤波器可直接解释为一种量子测量，这自然暗示了一种量子计算方案。对于由量子特征态表示的中心化协方差算子，单个固定电路结合阈值校准，即可访问适用于不同秩预算或保留方差水平的全部最优滤波器，而无需进行秩相关的电路更新或特征向量恢复。对于新输入，同一校准量子电路可生成软主子空间得分、谱能量轮廓以及后选择滤波态。训练数据和测试数据所需的中心化操作在量子协议内部相干完成，这对于量子数据尤为重要，因为此时无法直接获得经典特征向量或中心化格拉姆矩阵。通过将 PCA 重构为校准测量任务，该框架规避了迭代特征向量提取的需求，并在加性精度 η 下实现了与维度无关的样本复杂度 O(η⁻²)，适用于归一化分数秩或保留方差评分。

Abstract

Principal component analysis (PCA) is traditionally implemented through a covariance or kernel matrix, leading-eigenvector extraction, and hard rank-$k$ projection. These steps can be computationally costly in high-dimensional and quantum-data settings, sensitive to small eigengaps, and unnecessary when downstream tasks only require principal-subspace scores. Such score-based objectives are important in applications such as anomaly detection, spectral-energy profiling, and other postselection tasks. To address these needs, we introduce a measurement-based soft PCA framework replacing the hard top-$k$ projector with an entropy-regularized Fermi--Dirac filter. This filter is the unique optimizer of an entropy-regularized variational formulation of PCA and converges to the classical PCA projector in the zero-temperature limit. This filter has a direct interpretation as a quantum measurement, which naturally suggests a quantum approach. For centered covariance operators represented by quantum feature states, a single fixed circuit, together with threshold calibration, accesses all optimal filters for different rank budgets or retained-variance levels without rank-dependent circuit updates or eigenvector recovery. For new inputs, the same calibrated quantum circuit yields soft principal subspace scores, spectral energy profiles, and postselected filtered states. The required centering of both training and test data is performed coherently inside the quantum protocol, which is particularly important for quantum data where no classical feature vectors or centered Gram matrix are directly available. By reframing PCA as a calibrated measurement task, this framework bypasses the need for iterative eigenvector extraction and achieves a dimension-independent sample complexity $O(η^{-2})$ for normalized fractional-rank or retained variance scoring at additive accuracy $η$.

评分详情

关键词	权重	相关度
Unify Models	2.0	0.0/10
World Models	2.0	0.0/10
MLLM	2.0	0.0/10
MultiModal	2.0	0.0/10
model-based RL	2.0	0.0/10

评分理由: The paper focuses on quantum principal component analysis and measurement-based frameworks, which are unrelated to Unify Models, World Models, MLLM, MultiModal learning, or Model-Based Reinforcement Learning. Thus, all keyword relevance scores are 0. None of the listed expert authors are present in the author list.

关键词

Quantum principal component analysis, Soft PCA framework, Entropy-regularized filter, Quantum measurement, Covariance operators, Sample complexity, Eigenvector recovery, Quantum data

203. Privately Estimating Monotone Statistics in Polynomial TimeFAIL

Score: 0.0 / 26.5

Authors: Gavin Brown, Ephraim Linder, Mahbod Majid, Vikrant Singhal

Published: 2026-05-27

TL;DR: This paper proposes efficient differentially private algorithms for estimating monotone statistics that improve sample complexity compared to subsample-and-aggregate methods, with applications in private eigenvalue and loss estimation.

摘要翻译

我们研究了用于估计单调统计量（monotone statistics，即在添加新观测值时保持单调性的统计量）的高效差分隐私（differentially private）算法。我们的研究起点是子采样与聚合（subsample-and-aggregate）：这是一种经典范式，它将数据集划分为块，在每个块上估计该统计量，然后进行隐私聚合。虽然该方法实用且通用，但它相当消耗数据。针对单调统计量这一类别，我们对这一框架进行了改进——与子采样与聚合相比，我们的算法在样本复杂度上节省了 $t$ 倍，而在运行时间上付出了 $e^t$ 的代价，其中 $t>0$ 是一个可调参数。我们通过查询复杂度下界（query-complexity lower bound）补充了我们的结果，表明我们的算法对于此任务本质上是最优的。作为应用，我们获得了私有特征值估计（private eigenvalue estimation）、私有损失估计（private loss estimation）以及高维模型（例如线性回归（linear regression））中单个参数私有估计的改进结果。

Abstract

We study efficient differentially private algorithms for estimating monotone statistics, i.e., statistics that are monotone under the addition of new observations. The starting point for our investigation is subsample-and-aggregate: a classical paradigm that partitions the dataset into blocks, estimates the statistic on each block, and then privately aggregates the estimates.While practical and generically applicable, this approach is quite data-hungry. We improve upon this framework for the class of monotone statistics -- compared to subsample-and-aggregate, our algorithms save a factor of $t$ in sample complexity and pay a factor of $e^t$ in running time, where $t>0$ is a tunable parameter. We complement our results with a query-complexity lower bound, showing that our algorithms are essentially optimal for this task. As an application, we obtain improved results for private eigenvalue estimation, private loss estimation, and privately estimating a single parameter of a high-dimensional model, e.g., in linear regression.

评分详情

关键词	权重	相关度
Unify Models	2.0	0.0/10
World Models	2.0	0.0/10
MLLM	2.0	0.0/10
MultiModal	2.0	0.0/10
model-based RL	2.0	0.0/10

评分理由: The paper focuses on differentially private algorithms for estimating monotone statistics, which belongs to the domain of theoretical machine learning and privacy-preserving data analysis. The provided keywords (Unify Models, World Models, MLLM, MultiModal, model-based RL) relate to multimodal AI architectures, generative world models, and reinforcement learning. There is no thematic, methodological, or application overlap between the paper's content and the specified keywords. Consequently, all keyword relevance scores are 0. Additionally, none of the specified expert authors (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) appear in the author list, so no expert bonus is applied. The total weighted score is 0.0, well below the dynamic passing score of 26.5.

关键词

Differential Privacy, Monotone Statistics, Subsample-and-Aggregate, Sample Complexity, Eigenvalue Estimation, Linear Regression, Polynomial Time

204. From Detection to Mechanism: Cross-Attention Graph Neural Networks Enable Drug-Drug Interaction Type Prediction An Ablation Study with Acetylsalicylic Acid ValidationFAIL

Score: 0.0 / 26.5

Authors: Juergen Dietrich

Published: 2026-05-27

TL;DR: 本研究通过消融实验比较了图神经网络架构在药物相互作用机制类型预测中的表现，发现交叉注意力机制显著提升了多分类准确率。

摘要翻译

预测两种药物是否发生相互作用（二分类检测）与预测该相互作用的机制类型（多分类）是一项显著不同的任务。本研究在包含 38,337 个正样本对（涵盖 86 种相互作用类型）的公开可用基准数据集上，对三种图神经网络（GNN）架构用于药物 - 药物相互作用（DDI）预测进行了系统的消融研究。在相同的训练条件下（n = 61,339 对）比较了三种架构：一种采用拼接（Concat）的孪生双消息传递神经网络（MPNN），一种采用四头交叉注意力（CrossAtt）的双 MPNN，以及一种融合相互作用图的三元 MPNN（Ternary）。CrossAtt 相较于 Concat，在多分类宏 F1 分数（F1-macro）上提高了 +0.186（绝对值，+45%），而在二分类 AUC 上仅提高了 +0.012（+1.3%）——这证实了原子级分子间通信专门实现了机制类型分类。尽管训练数据相当，三元架构表现不佳，其失败原因与训练不稳定性假设一致。在训练前预留的十种乙酰水杨酸（ASA）药物对上的验证表明，CrossAtt 实现了 10/10 正确的 DDI 类型预测，而 Ternary 为 0/10。在所有架构中均识别出两个一致的失败案例，这与伴随毒性研究中确立的结构限制相关联。

Abstract

Predicting whether two drugs interact (binary detection) is a substantially dif- ferent task from predicting the mechanism type of that interaction (multi-class classification). This study presents a systematic ablation study of three Graph Neural Network (GNN) architectures for drug-drug interaction (DDI) prediction on a publicly available benchmark dataset comprising 38,337 positive pairs across 86 interaction types. Three architectures are compared under identical training conditions (n = 61,339 pairs): a siamese dual Message Passing Neural Network (MPNN) with concatenation (Concat), a dual MPNN with four-head cross-attention (CrossAtt), and a ternary MPNN incorporating an interaction graph (Ternary). CrossAtt improves multi-class F1-macro by +0.186 absolute (+45%) over Concat, while improving binary AUC by only +0.012 (+1.3%) - confirming that atom-level inter-molecular communication specifically enables mechanism-type classification. The ternary architecture underperforms despite equivalent training data, with its failure consistent with a training instability hypothesis. Validation on ten acetylsali- cylic acid (ASA) drug pairs, held out prior to training, demonstrates 10/10 correct DDI-type predictions for CrossAtt versus 0/10 for Ternary. Two consistent failure cases are identified across all architectures, linking to structural limits established in a companion toxicity study.

评分详情

关键词	权重	相关度
Unify Models	2.0	0.0/10
World Models	2.0	0.0/10
MLLM	2.0	0.0/10
MultiModal	2.0	0.0/10
model-based RL	2.0	0.0/10

评分理由: 论文主题基于图神经网络进行药物 - 药物相互作用预测，属于生物信息学监督学习范畴。提供的关键词涉及世界模型、多模态大模型及模型基强化学习等方向。论文未涉及世界模型构建、多模态大语言模型（MLLM）、强化学习或统一模型架构，与评分关键词的技术领域完全无关，故各项相关度均为 0。作者列表中不包含指定的专家（Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang）。

关键词

Drug-Drug Interaction, Graph Neural Networks, Cross-Attention, Mechanism Type Prediction, Ablation Study, Message Passing Neural Network, Acetylsalicylic Acid

205. EventShiftFlow: Towards Hardware-efficient FPGA-based Flow EstimationFAIL

Score: 0.0 / 26.5

Authors: Arianna Alonso Bizzi, Fernando Cladera, C. J. Taylor

Published: 2026-05-27

TL;DR: 本文提出了一种基于 FPGA 的事件相机硬件高效速度估计器，通过离散化异步事件并使用固定宽度整数逻辑实现并行评估，在低延迟机器人任务中实现了高方向精度且存储需求极低。

摘要翻译

事件视觉传感器提供异步且高时间分辨率的测量，这对低延迟机器人感知颇具吸引力，然而许多基于事件的运动估计方法计算密集，难以映射到 FPGA 硬件上。本文提出一种流式速度估计器（streaming velocity estimator），它将异步事件离散化为固定持续时间的时间箱（time bins），构建 1 位空间占用网格（1-bit spatial occupancy grid），并仅使用固定宽度整数逻辑——移位寄存器（shift registers）、计数器（counters）、比较器（comparators）以及小型查找表映射乘法器（LUT-mapped multiplies）——并行评估多个速度假设，无需除法器（dividers）和 DSP 块（DSP blocks）。该方法无需帧重建（frame reconstruction）、无需浮点运算（floating-point arithmetic），也无需迭代优化（iterative optimization）。该方法有意以密集的亚像素光流（dense sub-pixel optical flow）换取每个活动像素上的稀疏量化速度估计（sparse, quantized velocity estimate），适用于低延迟任务，例如在尺寸、重量和功耗受限平台（size-, weight-, and power-constrained platforms）上的反应式避障。在具有已知真实速度（ground-truth velocities）的含噪声合成数据上，该方法能恢复速度的大小和方向，其中当不同速度的物体相交时，大小估计最具挑战性。在真实事件相机序列上，方向准确率在所有四个评估的运动段达到 99.5%，且在占用密度（occupancy densities）为 10%-40% 的范围内，性能保持稳健。本文表征了该算法的密度依赖行为，进行了参数敏感性分析（parameter sensitivity analysis），表明所提出的数据通路（datapath）所需存储空间小于 2 kB，并在低成本赛灵思 Artix-7（Xilinx Artix-7）FPGA 上实现了单轴原型。

Abstract

Event-based vision sensors offer asynchronous, high-temporal-resolution measurements that are attractive for low-latency robotic perception, but many event-based motion estimation methods are computationally intensive and difficult to map to FPGA hardware. We present a streaming velocity estimator that discretizes asynchronous events into fixed-duration time bins, constructs a 1-bit spatial occupancy grid, and evaluates multiple velocity hypotheses in parallel using only fixed-width integer logic - shift registers, counters, comparators, and small LUT-mapped multiplies - with no dividers and no DSP blocks. It requires no frame reconstruction, no floating-point arithmetic, and no iterative optimization. The method deliberately trades dense sub-pixel optical flow for a sparse, quantized velocity estimate at each active pixel, suited to low-latency tasks such as reactive obstacle avoidance on size-, weight-, and power-constrained platforms. On noisy synthetic data with known ground-truth velocities, the method recovers both magnitude and direction, with magnitude estimates being most challenged when objects of different velocities intersect. On a real event-camera sequence, directional accuracy reaches 99.5% across all four evaluated motion segments, with performance remaining robust across occupancy densities in the 10-40% range. We characterize the algorithm's density-dependent behavior, present a parameter sensitivity analysis, show that the proposed datapath requires less than 2 kB of storage, and implement a single-axis prototype on a low-cost Xilinx Artix-7.

评分详情

关键词	权重	相关度
Unify Models	2.0	0.0/10
World Models	2.0	0.0/10
MLLM	2.0	0.0/10
MultiModal	2.0	0.0/10
model-based RL	2.0	0.0/10

评分理由: 论文主要研究事件相机在 FPGA 上的硬件高效流估计算法，侧重于底层硬件实现（如移位寄存器、整数逻辑）和低延迟计算。评分关键词涉及统一模型、世界模型、多模态大模型及模型强化学习等高层人工智能架构与算法领域。两者在研究目标和技术路线上无直接交集，故相关度均为 0。作者列表中未包含指定的专家，因此无额外加分。

关键词

Event-based vision, FPGA, Flow Estimation, Velocity Estimation, Hardware-efficient, Low-latency, Robotic perception

206. Every9D-21M: Large-Scale Real-World 9D Canonicalization of Everyday ObjectsFAIL

Score: 0.0 / 26.5

Authors: Leonhard Sommer, Emil Akopyan, Adam Kortylewski

Published: 2026-05-27

TL;DR: The paper introduces Every9D-21M, a large-scale dataset for 9D pose estimation of everyday objects derived from real-world videos, significantly expanding coverage and scale compared to prior benchmarks.

摘要翻译

从单张真实世界图像估计日常物体的 9D 位姿依然颇具挑战性。这主要是因为缺乏大规模监督。大多数现有数据集要么严重依赖合成渲染，要么对真实世界物体的覆盖范围有限；迄今为止最大的真实世界 9D 位姿数据集仅包含 1.7 万个标注对象，涵盖 9 个类别。为填补这一空白，我们提出了 Every9D-21M，这是一个包含 2180 万张真实世界图像的 9D 位姿标注数据集，数据源自 10.9 万个以物体为中心的视频，覆盖 700 个日常物体类别；在图像数量和类别数量上，该数据集比之前的真实世界 9D 位姿基准大两个数量级。为实现这一规模，我们利用以物体为中心的视频，通过多视图几何重建物体级点云，并将相似实例对齐至共享的标准坐标系（canonical coordinate frame）。标准位姿（canonical poses）仅对一小部分参考对象进行手动标注（少于所有图像的 0.01%），并通过跨实例对齐传播至其余实例。随后，所有传播得到的标准位姿均从多个视角进行验证。此外，我们还引入了跨类别方向规则，以诱导类别级对称性，从而实现感知对称性的评估。除了建立专门的训练和评估划分作为 9D 位姿基础模型的基准外，我们还表明，在 Every9D-21M 上训练可提升在 ImageNet3D 和 PASCAL3D+ 上的性能，且在 HANDAL 上的泛化能力显著优于仅在 ImageNet3D 上训练。数据和代码可在 https://github.com/GenIntel/Every9D 获取。

Abstract

Estimating the 9D pose of everyday objects from a single real-world image remains challenging. This is largely due to the lack of large-scale supervision. Most existing datasets either rely heavily on synthetic renderings or provide limited coverage of real-world objects: the largest real-world 9D pose dataset to date contains only 17K annotated objects across 9 categories. We address this gap with Every9D-21M, a dataset of 9D pose annotations for 21.8M real-world images from 109K object- centric videos spanning 700 everyday object categories - two orders of magnitude larger than prior real-world 9D pose benchmarks in both image and category count. To achieve this scale, we leverage object-centric videos by reconstructing object- level point clouds via multi-view geometry and aligning similar instances into a shared canonical coordinate frame. Canonical poses are manually annotated for only a small set of reference objects (fewer than 0.01% of all images) and propagated to the remaining instances via cross-instance alignment. All propagated canonical poses are then verified from multiple viewpoints. We further introduce cross-category orientation rules that induce category-level symmetries, enabling symmetry-aware evaluation. Beyond establishing dedicated training and evaluation splits as a benchmark for 9D pose foundation models, we show that training on Every9D-21M improves performance on ImageNet3D and PASCAL3D+, and generalizes to HANDAL substantially better than training on ImageNet3D. Data and code are available at https://github.com/GenIntel/Every9D.

评分详情

关键词	权重	相关度
Unify Models	2.0	0.0/10
World Models	2.0	0.0/10
MLLM	2.0	0.0/10
MultiModal	2.0	0.0/10
model-based RL	2.0	0.0/10

评分理由: 该论文主要关注计算机视觉领域的 9D 姿态估计及大规模真实世界数据集构建（Every9D-21M），利用多视图几何和对象中心视频进行规范坐标对齐。提供的评分关键词（Unify Models, World Models, MLLM, MultiModal, model-based RL）均指向大语言模型、强化学习及世界模型领域。论文未涉及模型统一架构、世界模型学习、多模态大语言模型或基于模型的强化学习算法，因此与所有关键词主题完全无关。

关键词

9D pose estimation, Everyday objects, Real-world images, Object-centric videos, Multi-view geometry, Canonical coordinate frame, Dataset construction, Pose foundation models

207. MORI-Seg: Learning Morphological Geometry for Instance Segmentation without Instance AnnotationsFAIL

Score: 0.0 / 26.5

Authors: Leiyue Zhao, Tianyu Shi, Daniel Reisenbuchler, Xinzi He, Junchao Zhu, Tianyuan Yao, Yuechen Yang, Yanfan Zhu, Junlin Guo, Gelei Xu, Haichun Yang, Yuankai Huo, Mert R. Sabuncu, Yihe Yang, Ruining Deng

Published: 2026-05-27

TL;DR: MORI-Seg 通过学习形态学几何表示，实现了仅使用语义标注的肾脏功能单元实例分割，无需实例级标注即可提高分割精度。

摘要翻译

肾单位的实例级量化对于形态计量学分析至关重要，然而大多数公开可用的病理数据集仅提供语义分割标注，其中同一类别的相邻结构被合并为单个区域。这阻碍了可靠的实例级分析，并限制了下游定量研究。现有的启发式后处理方法往往产生次优的实例分离效果，尤其是在拥挤和粘连区域，而基于深度学习的实例分割方法通常需要密集的实例级标注，这些标注获取成本高昂且耗时费力。我们提出 MORI-Seg，一种无需实例级标注即可实现实例分割的深度学习框架。与启发式分割或实例监督不同，MORI-Seg 通过联合建模对象中心距离场（object-centric distance fields）和边界带表示（boundary-band representations），直接从语义掩膜中学习形态感知几何表示，以编码内部结构和接触界面。一个类条件特征解耦模块进一步促进了实例内一致性和实例间分离。在仅语义监督下，MORI-Seg 以端到端的方式将连接的语义区域分解为不同的实例掩膜。实验表明，与经典后处理流程及代表性的语义到实例学习方法相比，MORI-Seg 在实例分离精度和形态计量量化可靠性方面均有提升。官方实现公开发布于 https://github.com/ddrrnn123/MORI-Seg。

Abstract

Instance-level quantification of kidney functional units is essential for morphometric analysis, yet most publicly available pathology datasets provide only semantic segmentation annotations, where adjacent structures of the same class are merged into single regions. This prevents reliable instance-level analysis and limits downstream quantitative studies. Existing heuristic post-processing methods often yield suboptimal instance separation, particularly in crowded and adherent regions, while deep learning-based instance segmentation approaches typically require intensive instance-level annotations that are costly and labor-intensive to obtain. We propose MORI-Seg, a deep learning framework that enables instance segmentation without requiring instance-level annotations. Instead of heuristic splitting or instance supervision, MORI-Seg learns morphology-aware geometric representations directly from semantic masks by jointly modeling object-centric distance fields and boundary-band representations to encode interior structure and contact interfaces. A class-conditioned feature disentanglement module further promotes intra-instance coherence and inter-instance separation. Under semantic-only supervision, MORI-Seg decomposes connected semantic regions into distinct instance masks in an end-to-end manner. Experiments demonstrate improved instance separation accuracy and more reliable morphometric quantification compared with classical post-processing pipelines and representative semantic-to-instance learning approaches. The official implementation is publicly available at https://github.com/ddrrnn123/MORI-Seg.

评分详情

关键词	权重	相关度
Unify Models	2.0	0.0/10
World Models	2.0	0.0/10
MLLM	2.0	0.0/10
MultiModal	2.0	0.0/10
model-based RL	2.0	0.0/10

评分理由: 论文主题聚焦于医学病理图像的弱监督实例分割，利用形态学几何表示解决实例分离问题。该研究属于传统计算机视觉与深度学习应用，未涉及统一模型架构（Unify Models）、世界模型（World Models）、多模态大语言模型（MLLM）、多模态学习（MultiModal）或基于模型的强化学习（model-based RL）等前沿基础模型方向。作者列表中未包含指定的 Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang 专家。

关键词

Instance Segmentation, Morphological Geometry, Semantic-only Supervision, Medical Image Analysis, Distance Fields, Feature Disentanglement, Pathology, Weakly Supervised Learning

208. Bridging the Sampling Distribution Shift in Radio Map Estimation: A Trajectory-Aware ParadigmFAIL

Score: 0.0 / 26.5

Authors: Feng Qiu, Zheng Fang, Shuhang Zhang, Kangjun Liu, Longkun Zou, Jing Liu, Ke Chen

Published: 2026-05-27

TL;DR: This paper proposes a trajectory-aware sampling paradigm (ST-TBS) to mitigate distribution shift in radio map estimation for UAV sensing, improving RMSE compared to random sampling methods.

摘要翻译

基于学习的无线电地图估计（RME）在无人机辅助无线传感中起着至关重要的作用，支持覆盖预测和网络优化等任务。大多数现有方法基于随机采样，假设训练和测试设定是独立同分布（i.i.d.）的。然而，实际的无人机测量值是沿可行轨迹顺序收集的，导致高度结构化且空间相关的模式。这种不匹配引入了采样分布偏移，增加了空间场恢复的内在难度，并损害了在独立同分布（i.i.d.）假设下训练的模型的泛化能力。为缓解这一问题，我们提出了一种基于随机触发轨迹采样（ST-TBS）的轨迹感知训练范式，该范式在保持轨迹连续性的同时引入了采样变异性。此外，从统计角度来看，我们表明与随机采样相比，轨迹采样降低了空间多样性并增加了信息冗余。在 RadioMapSeer 和 SpectrumNet 数据集上的广泛实验表明，在轨迹观测下，使用随机采样的模型遭受显著的性能退化，在 SpectrumNet 数据集上，均方根误差（RMSE）从 0.0391 增加到 0.2632。相比之下，我们提出的 ST-TBS 方法有效将 RMSE 降低至 0.0571。这些结果强调了为了实现可靠的无线电地图估计（RME），需要对齐训练与部署阶段的采样分布。

Abstract

Learning-based radio map estimation (RME) plays a critical role in UAV-assisted wireless sensing, enabling tasks such as coverage prediction and network optimization. Most current methods assume an independently and identically distributed (i.i.d.) training and testing setting based on random sampling. However, practical UAV measurements are collected sequentially along feasible trajectories, resulting in highly structured and spatially correlated patterns. This mismatch introduces a sampling distribution shift that increases the intrinsic difficulty of spatial field recovery and compromises the generalization of models trained under i.i.d. assumptions. To mitigate this issue, we propose a trajectory-aware training paradigm based on Stochastic-Triggered Trajectory-Based Sampling (ST-TBS), which preserves trajectory continuity while introducing sampling variability. Moreover, from a statistical perspective, we show that trajectory-based sampling reduces spatial diversity and increases information redundancy compared to random sampling. Extensive experiments on the RadioMapSeer and SpectrumNet datasets demonstrate that models trained with random sampling suffer significant performance degradation under trajectory-based observations, with RMSE increasing from 0.0391 to 0.2632 on SpectrumNet. Conversely, our proposed ST-TBS method effectively reduces the RMSE to 0.0571. These results highlight the necessity of aligning training and deployment sampling distributions for reliable RME.

评分详情

关键词	权重	相关度
Unify Models	2.0	0.0/10
World Models	2.0	0.0/10
MLLM	2.0	0.0/10
MultiModal	2.0	0.0/10
model-based RL	2.0	0.0/10

评分理由: The paper focuses on Radio Map Estimation (RME) for UAV wireless sensing, addressing sampling distribution shifts between i.i.d. and trajectory-based data. It does not involve Large Language Models (MLLM), Multimodal integration, World Models, Model-Based Reinforcement Learning, or Unify Models architectures. The core contribution is a sampling paradigm for spatial field recovery, which is unrelated to the provided keyword themes.

关键词

Radio Map Estimation, UAV-assisted wireless sensing, Sampling distribution shift, Trajectory-based sampling, Spatial field recovery, ST-TBS, Wireless sensing, Distribution alignment

209. A Patient-Specific Pulmonary Arterial Tree Digital Twin to Extract Pulmonary Embolism BiomarkersFAIL

Score: 0.0 / 26.5

Authors: Morgane des Ligneris, Nathan Painchaud, Allan Serva, Laurent Bertoletti, Pierre Croisille, Carole Frindel, Odyssée Merveille

Published: 2026-05-27

TL;DR: This study proposes an automated pipeline to generate patient-specific pulmonary arterial tree digital twins for extracting pulmonary embolism biomarkers, enabling rapid and precise thrombotic burden assessment without manual calculation.

摘要翻译

肺栓塞（由血栓阻塞肺动脉所致）是急性心血管综合征的主要病因之一。在临床实践中，基于计算机断层扫描肺动脉造影（Computed Tomography Pulmonary Angiography, CTPA）诊断后的治疗决策依赖于风险分层，该分层将 30 天死亡风险划分为三个等级。这种分层取决于右心室与左心室直径比（right-to-left ventricular diameter ratio, RV/LV ratio）以及两种心肌酶的血清水平。然而，血液生物标志物在急诊环境中并非总是可用，且手动计算既定的严重程度评分（如 Qanadli 评分和 Mastora 评分）耗时颇久，在临床常规实践中很少执行。本研究提出了一种自动化流程，该流程对肺动脉树（pulmonary arterial tree）的有向图（directed graph）表示进行建模，标记其层次结构并对肺栓塞进行表征。该流程提取基于影像的生物标志物，包括局部动脉级别特征（形态学信息、层次位置、血栓体积及由此导致的阻塞）和全局患者级别生物标志物，例如自动计算的严重程度评分（Qanadli 评分和 Mastora 评分）以及按肺叶和层次分布的总栓塞体积。利用人工智能生成的动脉、栓子、肺及肺叶的二值掩膜（binary masks），该流程构建了一个动脉结构的患者数字孪生（patient digital twin）。通过与现有流程、解剖学预期以及手动严重程度评分计算进行比较，对该流程进行了验证，结果表明该流程能够自动生成解剖学准确的数字孪生和严重程度评分，且两者具有高度一致性。这证实了这些影像衍生生物标志物在自动提供快速、精确的血栓负荷（thrombotic burden）及血栓空间分布信息方面的潜力。

Abstract

Pulmonary embolism, the obstruction of a pulmonary artery by a blood clot, is one of the leading causes of acute cardiovascular syndrome. In clinical practice, therapeutic decisions after diagnosis via computed tomography pulmonary angiography rely on risk stratification, which categorizes 30-day mortality risk into three categories. This stratification depends on the right-to-left ventricular diameter ratio and blood levels of two cardiac enzymes. However, blood biomarkers are not always available in emergency settings, and manual calculation of established severity scores - such as Qanadli and Mastora - is time-consuming and rarely performed in clinical routine practice. This study introduces an automated pipeline that models a directed graph representation of the pulmonary arterial tree, labeling its hierarchical structure and characterizing pulmonary embolism. The pipeline derives image-based biomarkers, including local artery-level features (morphological information, hierarchical position, clot volume, and resulting obstruction) and global patient-level biomarkers such as automatically calculated severity scores (Qanadli and Mastora) and the total embolic volume distribution by lobes and hierarchical levels. Using artificial-intelligence-generated binary masks of arteries, emboli, lungs, and lobes, it creates a patient digital twin of the arterial structure. Validation of the pipeline through comparison to an existing pipeline, anatomical expectations, and manual severity score calculations demonstrates the pipeline's ability to automatically generate anatomically accurate digital twins and severity scores with strong agreement. This supports the potential of these image-derived biomarkers to automatically provide rapid, precise information on thrombotic burden and spatial clot distribution.

评分详情

关键词	权重	相关度
Unify Models	2.0	0.0/10
World Models	2.0	0.0/10
MLLM	2.0	0.0/10
MultiModal	2.0	0.0/10
model-based RL	2.0	0.0/10

评分理由: 论文主题聚焦于医学影像分析，构建肺动脉树数字孪生以提取肺栓塞生物标志物，属于医疗 AI 范畴。给定关键词涉及大模型统一、世界模型、多模态大语言模型、多模态学习及强化学习，与本文研究内容无直接关联，故相关性评分均为 0。

关键词

Pulmonary Embolism, Digital Twin, Pulmonary Arterial Tree, Biomarkers, Computed Tomography, Automated Pipeline, Graph Representation, Severity Scores

210. From Kellgren-Lawrence to Calcium Pyrophosphate Crystal Deposition: A Soft-Labelling Framework for Knee Osteoarthritis AssessmenFAIL

Score: 0.0 / 26.5

Authors: Francisco Bérchez-Moreno, Riccardo Rosati, Maria Chiara Fiorentino, Víctor M. Vargas, Edoardo Cipolletta, Emilio Filippucci, Luca Romeo, Pedro A. Gutiérrez, César Hervás-Martínez

Published: 2026-05-27

TL;DR: This paper proposes a soft-labelling ordinal deep learning framework for Knee Osteoarthritis and CPPD grading on X-ray images, demonstrating significant performance improvements over conventional one-hot supervision.

摘要翻译

背景与目的。传统的深度学习（DL）方法用于膝关节骨关节炎（KOA）分级，依赖于 one-hot 标签，这无法捕捉 Kellgren-Lawrence（KL）评分和焦磷酸钙沉积病（CPPD）严重程度评分的序数不确定性，以及临床实践中观察到的这两个量表之间的不对称关系。方法。我们回顾性收集了 2172 张膝关节 X 光片，其中包括 968 张同时标注了 KL 和 CPPD 严重程度的放射影像。针对两项任务，开发了一种基于软标注（soft-labelling）的序数深度学习框架，用以标注等级为中心的单峰概率分布替代 one-hot 目标。研究了四种分布形式：二项式、贝塔、三角和指数。结果。所有软标注策略一致优于名义基线。对于 CPPD 分级，三角形式实现了最高的二次加权 kappa（QWK）和最低的平均绝对误差（MAE）（QWK = 0.796; MAE = 0.438），而贝塔形式在考虑各类别的平均 MAE（AMAE）和最大 MAE（MMAE）时提供了最平衡的类别级性能（AMAE = 0.458; MMAE = 0.573）。对于 KL 分级，基于贝塔的方法提供了最佳的整体性能，实现了最高的 QWK 以及最低的 MAE 和类别级误差（QWK = 0.777; MAE = 0.529; AMAE = 0.523; MMAE = 0.775）。统计分析表明，相对于传统的 one-hot 监督，性能有显著提升（p < 0.001）。

Abstract

Background and objective. Conventional Deep Learning (DL) approaches for Knee Osteoarthritis (KOA) grading rely on one-hot labels, which fail to capture both the ordinal uncertainty of Kellgren--Lawrence (KL) and Calcium Pyrophosphate Deposition Disease (CPPD) severity scores and the asymmetric relationship between the two scales observed in clinical practice. Methods. We retrospectively collected 2172 knee X-ray images, including 968 radiographs jointly annotated for KL and CPPD severity. An ordinal DL framework based on soft-labelling was developed for both tasks, replacing one-hot targets with unimodal probability distributions centred on the annotated grade. Four formulations were investigated: binomial, beta, triangular, and exponential. Results. All soft-labelling strategies consistently outperformed the nominal baseline. For CPPD grading, the triangular formulation achieved the highest Quadratic Weighted Kappa (QWK) and the lowest Mean Absolute Error (MAE) (QWK = 0.796; MAE = 0.438), while the beta formulation yielded the most balanced class-wise performance considering Average MAE (AMAE) and Maximum MAE (MMAE) across classes (AMAE = 0.458; MMAE = 0.573). For KL grading, the beta-based approach provided the best overall performance, achieving the highest QWK together with the lowest MAE and class-wise errors (QWK = 0.777; MAE = 0.529; AMAE = 0.523; MMAE = 0.775). Statistical analysis demonstrated significant improvements over conventional one-hot supervision (p < 0.001).

评分详情

关键词	权重	相关度
Unify Models	2.0	0.0/10
World Models	2.0	0.0/10
MLLM	2.0	0.0/10
MultiModal	2.0	0.0/10
model-based RL	2.0	0.0/10

评分理由: 该论文专注于医学图像分析，使用带有软标签的序数深度学习框架评估膝关节骨关节炎，与统一模型、世界模型、多模态大语言模型、多模态（大模型语境）及基于模型的强化学习均无关联。作者列表中也不包含指定的专家。

关键词

Knee Osteoarthritis, Calcium Pyrophosphate Deposition, Soft-Labelling Framework, Ordinal Deep Learning, Knee X-ray Images, One-hot Labels, Quadratic Weighted Kappa, Medical Image Analysis

211. A novel ordinal multi-view aggregation scheme for oak defoliationFAIL

Score: 0.0 / 26.5

Authors: Francisco Bérchez-Moreno, Ricardo Enrique Hernández-Lambraño, David Guijo-Rubio, Víctor Manuel Vargas, Francisco José Ruiz-Gómez, Juan Carlos Fernández, Pablo González-Moreno

Published: 2026-05-27

TL;DR: 本文提出了一种基于地面影像的多视图 CNN 集成框架，用于橡树落叶的有序分类，以提高森林健康评估的准确性。

摘要翻译

由气候和生物胁迫驱动的森林衰退威胁着生态系统功能，使得准确监测树木健康状况至关重要。本研究将树木落叶程度估计视为一个序数分类问题，并利用地面影像。我们提出了一种新颖的多视图集成框架，该框架聚合了在个体树木不同视角（北、南及树冠）下训练的卷积神经网络（CNN）的预测结果。该方法利用互补的视觉信息，并通过同质集成设计保持建模一致性。通过比较多种序数分类方法并分析各视图及其组合的贡献，进行了全面评估。结果表明，对落叶程度水平的序数结构进行建模比名义方法性能更优，而提出的多视图集成始终优于单视图和成对配置。特别是，三视图集成在所有评估指标上实现了最稳健且准确的预测。这些发现强调了结合深度学习（DL）、序数分类（OC）和多视图聚合在复杂生态系统（如地中海德赫萨 (dehesas)）中进行可扩展、一致且客观的森林健康评估的潜力。

Abstract

Forest decline driven by climate and biotic stressors threatens ecosystem functioning, making accurate monitoring of tree health essential. In this work, we address tree defoliation estimation as an ordinal classification problem using ground-level imagery. We propose a novel multi-view ensemble framework that aggregates predictions from Convolutional Neural Networks (CNNs) trained on different perspectives of individual trees (north, south, and crown). This approach leverages complementary visual information while preserving modelling consistency through a homogeneous ensemble design. A comprehensive evaluation is conducted by comparing multiple ordinal classification methods and analysing the contribution of each view and their combinations. Results show that modelling the ordinal structure of defoliation levels improves performance over nominal approaches, while the proposed multi-view ensemble consistently outperforms single-view and pairwise configurations. In particular, the three-view ensemble achieves the most robust and accurate predictions across all evaluation metrics. These findings highlight the potential of combining Deep Learning (DL), Ordinal Classification (OC), and multi-view aggregation for scalable, consistent, and objective forest health assessment in complex ecosystems such as Mediterranean dehesas.

评分详情

关键词	权重	相关度
Unify Models	2.0	0.0/10
World Models	2.0	0.0/10
MLLM	2.0	0.0/10
MultiModal	2.0	0.0/10
model-based RL	2.0	0.0/10

评分理由: 该论文研究内容为利用多视图 CNN 集成框架进行橡树落叶的有序分类，属于计算机视觉与生态学应用。提供的关键词（Unify Models, World Models, MLLM, MultiModal, model-based RL）均指向前沿大模型架构与强化学习领域。论文未涉及统一模型、世界模型、大语言模型、多模态学习（此处为单模态多视图）或强化学习，方法论为传统深度学习集成。作者列表中未包含指定的 Yang Shi 等专家，故无额外加分。

关键词

Oak Defoliation, Ordinal Classification, Multi-view Ensemble, Convolutional Neural Networks, Forest Health Monitoring, Ground-level Imagery, Tree Health Assessment

212. CLEAR-NeRF: Collinearity and Local-region Enhanced Accurate 3D Reconstruction in Unbounded ScenesFAIL

Score: 0.0 / 26.5

Authors: Vladislav Polianskii, Elijs Dima, Isabel Salmerón Marazuela, Gergő László Nagy, Sigurdur Sverrisson, Volodya Grancharov

Published: 2026-05-27

摘要翻译

许多现实世界的三维重建应用要求在无界、复杂场景中实现照片级真实感和度量准确性，这些场景具有挑战性光照和不完美采集，而当前的神经辐射场（NeRF）流程仅部分满足这些需求。本研究将基于 NeRF 的三维重建适配到多感兴趣区域无界场景，以提高对光照和姿态变化的鲁棒性，同时确保适用于数字孪生应用的度量准确性。我们的方法引入了（i）自动局部区域定位/检测与重建，以无缝优先处理感兴趣区域而不增加过多子模块，（ii）共线约束射线采样，以学习平滑的平面和曲面，（iii）深度局部邻域点提取，以抑制表面伪影，以及（iv）几何相关颜色聚合，以减轻由光照和姿态引起的变化。结果表明，所提出的流程在性能上优于基准 NeRF 模型以及传统的运动恢复结构（SfM）- 多视图立体视觉（MVS）解决方案。

Abstract

Many real-world 3D reconstruction applications demand photorealism and metric accuracy across unbounded, complex scenes with challenging lighting and imperfect captures that current Neural Radiance Field (NeRF) pipelines only partly satisfy. This study adapts NeRF-based 3D reconstruction to multi-region of interest unbounded scenes to improve robustness to lighting and pose variation while enforcing metric accuracy suitable for digital-twin applications. Our approach introduces (i) automated local region localization/detection and reconstruction to seamlessly prioritize areas of interest without proliferating submodules, (ii) collinearity-enforcing ray sampling to learn smooth planar and curved surfaces, (iii) depth-localized neighborhood point extraction to suppress surface artifacts, and (iv) geometry-relevant color aggregation to mitigate lighting- and pose-caused variations. Results indicate superior performance of the proposed pipeline over the baseline NeRF models and established Structure from Motion (SfM) - Multi-View Stereo (MVS) solutions.

评分详情

关键词	权重	相关度
Unify Models	2.0	0.0/10
World Models	2.0	0.0/10
MLLM	2.0	0.0/10
MultiModal	2.0	0.0/10
model-based RL	2.0	0.0/10

评分理由: 评分失败: Expecting ',' delimiter: line 10 column 99 (char 273)

213. ST-ColoNet: Spatio-Temporal Colon Segment Recognition via Hybrid Attention and Edge-Guided Feature LearningFAIL

Score: 0.0 / 26.5

Authors: Ziyi Wang, Zhengjie Zhang, Jingsheng Gao, Dahong Qian, Suncheng Xiang

Published: 2026-05-27

TL;DR: ST-ColoNet 提出了一种基于混合注意力和边缘引导特征学习的时空网络，在结肠镜视频结肠段识别任务上达到了最先进的性能。

摘要翻译

结肠镜视频中的结肠段识别 (Colo-segment recognition) 是许多下游任务的关键需求，但现有的自动识别方法仅使用结肠镜图像，未充分利用时序信息，导致性能不佳。此外，相关的基于视频的公共数据集稀缺。为了解决这一问题，我们构建并发布了一个专门用于结肠段识别任务的标注数据集。此外，我们提出了一种基于两阶段的深度学习框架——时空网络结肠段识别 (ST-ColoNet)，用于从结肠镜视频中进行结肠段识别。该框架包含 Colorlaus 模块，利用度量学习优化基于边缘的空间特征提取；以及 Full-Temp 模块，结合三种自注意力模式，以更好地近似长结肠镜序列上的全自注意力并优化时序特征聚合。通过广泛的消融实验，我们表明我们的框架能够在结肠段识别任务上实现最先进的性能，准确率达到 81.0%，F1 分数 (F1-score) 为 70.7%，相较于现有最先进方法取得了巨大提升。

Abstract

Colo-segment recognition in colonoscopy videos is a key requirement for many downstream tasks, but existing automatic recognition methods only use colonoscopy images without fully exploiting the use of temporal information, leading to poor performance. Additionally, relevant public video-based datasets are in scarcity. To tackle this problem, we curate and release a labeled dataset specifically for the task of colo-segment recognition. In addition, we propose a two-stage deep learning-based framework, Colo-Segment Recognition via SpatioTemporal Network (ST-ColoNet), for the task of colo-segment recognition from colonoscopy videos which includes the Colorlaus module that uses metric learning to optimize edge-mediated spatial feature extraction, as well as the Full-Temp module which combines three self-attention patterns to better approximate full self-attention on long colonoscopy sequences and optimize temporal feature aggregation. Through extensive ablation experiments, we show that our framework is capable of achieving state-of-the-art performance on the task of colo-segment recognition, achieving an accuracy of 81.0% and F1-score of 70.7%, which is a tremendous improvement over state-of-the-art methods.

评分详情

关键词	权重	相关度
Unify Models	2.0	0.0/10
World Models	2.0	0.0/10
MLLM	2.0	0.0/10
MultiModal	2.0	0.0/10
model-based RL	2.0	0.0/10

评分理由: 该论文聚焦于医学影像分析中的结肠段识别任务，采用监督学习框架与时空注意力机制。论文内容未涉及统一模型架构、世界模型、多模态大语言模型、跨模态大模型或强化学习算法，与给定关键词的研究方向无直接关联，故相关度评分均为 0。作者列表中未包含指定的专家。

关键词

Spatio-Temporal Network, Colon Segment Recognition, Hybrid Attention, Edge-Guided Feature Learning, Colonoscopy Videos, Metric Learning, Self-Attention, Deep Learning Framework

214. Automated Estimation of Impact Time, Impact Location, and Shuttlecock Speed in Badminton Smashes Using Event CamerasFAIL

Score: 0.0 / 26.5

Authors: Yudai Washida, Yuto Kase, Kai Ishibe, Ryoma Yasuda, Sakiko Hashimoto

Published: 2026-05-27

TL;DR: This paper proposes an automated measurement method using synchronized event cameras to estimate badminton smash impact time, location, and speed with high accuracy compared to high-speed cameras.

摘要翻译

量化羽毛球杀球中的冲击现象对于评估运动表现和器材性能均至关重要；然而，传统测量系统在时间分辨率、数据效率及准备工作量之间往往存在权衡。本研究提出了一种基于两个同步事件相机（event cameras）的测量方法，旨在同一试验中以集成方式自动估计击球时刻、拍面上的击球点以及击后羽毛球速度。挥拍间隔通过事件率统计检测，击球时刻依据侧视图事件数据中羽毛球轨迹的拐点进行估计，击球点通过后视图事件图像中拍面的椭圆拟合确定，而羽毛球速度则在矢状面中计算得出。为验证所提出的方法，本研究基于五名球员的 125 个杀球试验，采用基于高速摄像机的参考方法进行了 Bland-Altman 分析。在所有 124 个可分析试验中，击球时刻和羽毛球速度均被成功估计，而击球点则在 93.5% (116/124) 的试验中被估计。击球时刻、内外侧击球点、纵向击球点以及羽毛球速度的偏差（95% 置信区间 (CI)）分别为 1.84 ms (1.45 至 2.23)、3.45 mm (2.18 至 4.72)、-1.92 mm (-2.97 至 -0.88) 和 -1.00 m/s (-2.46 至 0.46)。各项指标均未观察到比例偏差。结果表明，所提出的方法可作为实用环境中评估羽毛球杀球表现及器材性能的集成评估工具。

Abstract

Quantifying impact phenomena in badminton smashes is important for evaluating both athletic performance and equipment; however, conventional measurement systems involve trade-offs between temporal resolution, data efficiency, and preparation effort. This study proposes a measurement method using two synchronized event cameras to automatically estimate impact time, impact location on the racket face, and post-impact shuttlecock speed in an integrated manner within the same trial. The swing interval was detected from event rate statistics, impact time was estimated from the shuttlecock trajectory inflection in the lateral-view event data, impact location was determined by ellipse fitting to the racket face in the rear-view event image, and shuttlecock speed was calculated in the sagittal plane. To validate the proposed method, Bland-Altman analysis was performed against a high-speed camera-based reference method using 125 smash trials from five players. Impact time and shuttlecock speed were estimated in all 124 analyzable trials, and impact location was estimated in 93.5% (116/124). The bias (95% CI) for impact time, medio-lateral impact location, longitudinal impact location, and shuttlecock speed were 1.84 ms (1.45 to 2.23), 3.45 mm (2.18 to 4.72), -1.92 mm (-2.97 to -0.88), and -1.00 m/s (-2.46 to 0.46), respectively. No proportional bias was observed for any metric. These results suggest that the proposed method can serve as a useful tool for integrated assessment of badminton smash performance and equipment in practical settings.

评分详情

关键词	权重	相关度
Unify Models	2.0	0.0/10
World Models	2.0	0.0/10
MLLM	2.0	0.0/10
MultiModal	2.0	0.0/10
model-based RL	2.0	0.0/10

评分理由: The paper focuses on sports science and computer vision using event cameras for badminton measurement, which is completely unrelated to the provided AI/ML keyword set (Unify Models, World Models, MLLM, MultiModal AI, Model-Based RL). No relevant methodologies or concepts from the keywords are discussed. None of the specified expert authors are present in the author list.

关键词

Event Cameras, Badminton Smashes, Impact Time, Impact Location, Shuttlecock Speed, Synchronized Cameras, Measurement Method

215. ClothTransformer: Unified Latent-Space Transformers for Scalable Cloth SimulationFAIL

Score: 0.0 / 26.5

Authors: Yu Zhang, Yidi Shao, Wenqi Ouyang, Yushi Lan, Zhexin Liang, Chengrui Wu, Xudong Xu, Xingang Pan

Published: 2026-05-27

摘要翻译

统一且可扩展的 Transformer 最近在建模与计算机图形学相关的多样化现象方面取得了显著成功，例如 3D 视觉效果、渲染过程以及视频中的运动。在这项工作中，我们更进一步，探究现代 Transformer 技术是否能够应对织物模拟这一具有挑战性的任务。为此，我们提出了 ClothTransformer，这是一个将织物模拟重新表述为学习到的潜在空间（latent space）中的自回归序列建模的框架。现有的神经织物模拟器大多专门针对单一场景，本质上与网格离散化耦合，并且缺乏鲁棒的碰撞处理。我们的方法通过以下三个贡献解决了这些局限性：(1) 一个统一的 Transformer 架构，能够在单一模型下处理多样化场景——人体驱动服装、机器人操作和自由落体碰撞——并且在所有场景下实现了比先前最先进的技术低约 4 到 9 倍的误差；(2) 一种可扩展的潜在空间表述，它将任意分辨率的网格压缩为一组固定大小的潜在令牌，使得时间动态计算独立于网格分辨率；(3) 一个涵盖所有三种设置的高保真无穿透数据集，包含约 493.4k 帧，这使得一个可微分的连续碰撞检测（CCD）模块能够抑制穿透伪影。

Abstract

Unified and scalable Transformers have recently achieved remarkable success in modeling diverse phenomena traditionally associated with computer graphics, such as 3D visual effects, rendering processes, and motion in videos. In this work, we take a step further by investigating whether modern Transformer techniques can tackle the challenging task of cloth simulation. To this end, we present ClothTransformer, a framework that reformulates cloth simulation as autoregressive sequence modeling in a learned latent space. Existing neural cloth simulators are largely specialized to single scenarios, intrinsically coupled to the mesh discretization, and lack robust collision handling. Our approach addresses these limitations through three contributions: (1) a unified Transformer architecture that handles diverse scenarios -- body-driven garments, robotic manipulation, and free-fall collisions -- under a single model and achieves approximately $4$--$9{\times}$ lower error than prior state-of-the-art methods across all scenarios; (2) a scalable latent-space formulation that compresses arbitrary-resolution meshes into a fixed-size set of latent tokens, making temporal dynamics computation independent of mesh resolution; and (3) a diverse-scenario high-fidelity penetration-free dataset of ${\sim}$493.4k frames spanning all three settings, which enables a differentiable Continuous Collision Detection (CCD) module to suppress penetration artifacts.

评分详情

关键词	权重	相关度
Unify Models	2.0	0.0/10
World Models	2.0	0.0/10
MLLM	2.0	0.0/10
MultiModal	2.0	0.0/10
model-based RL	2.0	0.0/10

评分理由: 评分失败: Expecting ',' delimiter: line 10 column 90 (char 264)

Token 消耗: 3,444,064 tokens（输入 452,595 / 输出 2,991,469）