arXiv Daily Report 2026-06-04

DailyPapers
未分类
16小时前
6热度
0评论

ArXiv Report 2026-06-04/* ============================================================ ArXiv Daily Researcher - HTML Report Stylesheet 可自由修改此文件来定制报告外观 ============================================================ *//* ── 全局重置 ── */*,*::before,*::after { box-sizing: border-box; margin: 0; padding: 0;}/* ── CSS 变量（主题色板） ── */:root { --color-bg: #f0f2f5; --color-surface: #ffffff; --color-border: #e5e7eb; --color-primary: #2563eb; --color-primary-dk: #1d4ed8; --color-pass: #16a34a; --color-pass-bg: #dcfce7; --color-fail: #dc2626; --color-fail-bg: #fee2e2; --color-text: #111827; --color-muted: #6b7280; --color-tldr-bg: #eff6ff; --color-tldr-border: #bfdbfe; --color-cn-bg: #fefce8; --color-cn-border: #fde68a; --color-analysis-bg: #f8fafc; --radius-sm: 6px; --radius-md: 10px; --radius-lg: 14px; --shadow-sm: 0 1px 3px rgba(0, 0, 0, 0.06), 0 1px 2px rgba(0, 0, 0, 0.04); --shadow-md: 0 4px 12px rgba(0, 0, 0, 0.08); --shadow-hover: 0 8px 24px rgba(0, 0, 0, 0.1);}/* ── 页面布局 ── */body { font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", "Inter", Roboto, "Helvetica Neue", Arial, sans-serif; font-size: 15px; line-height: 1.65; color: var(--color-text); background: var(--color-bg); padding: 28px 20px 60px; max-width: 1080px; margin: 0 auto;}/* ── 页面标题 ── */h1 { font-size: 1.75rem; font-weight: 700; color: var(--color-text); letter-spacing: -0.5px; margin-bottom: 4px;}h2 { font-size: 1.15rem; font-weight: 600; color: var(--color-text); margin: 36px 0 14px; padding-bottom: 8px; border-bottom: 2px solid var(--color-border);}/* ── 元信息行 ── */.meta { font-size: 0.85rem; color: var(--color-muted); margin-bottom: 24px;}/* ── 统计栏 ── */.stats-bar { display: flex; gap: 14px; flex-wrap: wrap; margin-bottom: 32px;}.stat { flex: 1; min-width: 110px; background: var(--color-surface); border: 1px solid var(--color-border); border-radius: var(--radius-md); padding: 16px 20px; text-align: center; box-shadow: var(--shadow-sm);}.stat .num { font-size: 2rem; font-weight: 700; line-height: 1.1; color: var(--color-primary); display: block;}.stat .label { font-size: 0.78rem; color: var(--color-muted); margin-top: 4px; text-transform: uppercase; letter-spacing: 0.05em;}/* ── 论文卡片 ── */.card { background: var(--color-surface); border: 1px solid var(--color-border); border-radius: var(--radius-lg); padding: 20px 24px; margin-bottom: 14px; box-shadow: var(--shadow-sm); border-left: 4px solid var(--color-border); transition: box-shadow 0.18s ease, transform 0.18s ease;}.card:hover { box-shadow: var(--shadow-hover); transform: translateY(-1px);}.card.pass { border-left-color: var(--color-pass);}.card.fail { border-left-color: var(--color-fail);}/* ── 卡片标题 ── */.card-title { font-size: 1rem; font-weight: 600; color: var(--color-text); line-height: 1.4; margin-bottom: 10px; display: flex; align-items: flex-start; gap: 8px; flex-wrap: wrap;}.card-title a { color: inherit; text-decoration: none; flex: 1;}.card-title a:hover { color: var(--color-primary); text-decoration: underline; text-underline-offset: 3px;}/* ── 状态徽章 ── */.badge { display: inline-flex; align-items: center; padding: 2px 9px; border-radius: 99px; font-size: 0.72rem; font-weight: 700; letter-spacing: 0.04em; flex-shrink: 0; margin-top: 2px;}.badge.pass { background: var(--color-pass-bg); color: var(--color-pass);}.badge.fail { background: var(--color-fail-bg); color: var(--color-fail);}/* ── 字段行 ── */.field { font-size: 0.875rem; color: var(--color-muted); margin: 5px 0;}.field-label { font-weight: 600; color: #374151;}.score { font-weight: 700; color: var(--color-primary);}/* ── TL;DR 块 ── */.tldr { background: var(--color-tldr-bg); border: 1px solid var(--color-tldr-border); border-radius: var(--radius-sm); padding: 10px 14px; font-size: 0.9rem; color: #1e3a5f; margin: 10px 0; line-height: 1.55;}/* ── 中文摘要块 ── */.abstract-cn { background: var(--color-cn-bg); border: 1px solid var(--color-cn-border); border-radius: var(--radius-sm); padding: 10px 14px; font-size: 0.875rem; color: #713f12; margin: 10px 0; line-height: 1.6;}/* ── 深度分析折叠 ── */details { margin-top: 12px; border: 1px solid var(--color-border); border-radius: var(--radius-sm); overflow: hidden;}summary { cursor: pointer; font-size: 0.875rem; font-weight: 600; color: var(--color-primary); padding: 8px 14px; background: var(--color-analysis-bg); user-select: none; list-style: none; display: flex; align-items: center; gap: 6px;}summary::before { content: "▶"; font-size: 0.65em; transition: transform 0.2s; display: inline-block;}details[open] summary::before { transform: rotate(90deg);}summary:hover { color: var(--color-primary-dk);}.analysis-content { padding: 14px 16px; font-size: 0.875rem; color: #374151; line-height: 1.65; background: var(--color-surface);}.analysis-content p { margin: 6px 0;}.analysis-content ul { margin: 6px 0 6px 20px; color: #4b5563;}.analysis-content li { margin: 3px 0;}/* ── 响应式 ── */@media (max-width: 640px) { body { padding: 16px 12px 40px; } h1 { font-size: 1.4rem; } .stats-bar { gap: 10px; } .stat { min-width: 80px; padding: 12px 14px; } .stat .num { font-size: 1.6rem; } .card { padding: 16px; } .card-title { font-size: 0.95rem; }}/* ── 模型标签 ── */.model-badge { display: inline-block; font-size: 0.72rem; font-weight: 500; color: #6366f1; background: #ede9ff; border: 1px solid #c4b5fd; border-radius: 4px; padding: 1px 6px; margin-left: 8px; vertical-align: middle; font-family: 'Fira Code', 'Consolas', monospace; letter-spacing: 0.02em;}/* ── TLDR 增强样式 ── */.tldr { background: linear-gradient(135deg, #f0f9ff 0%, #e0f2fe 100%); border-left: 3px solid #38bdf8; border-radius: 0 8px 8px 0; padding: 10px 14px; margin: 8px 0;}.tldr-meta { display: flex; align-items: center; margin-bottom: 6px; font-size: 0.82rem; font-weight: 600; color: #0369a1;}.tldr-body { font-size: 0.88rem; color: #374151; line-height: 1.6;}/* ── 趋势分析区块 ── */.trend-section { margin: 32px 0 16px; border-top: 2px solid var(--color-border); padding-top: 24px;}.trend-section-header { display: flex; align-items: center; margin-bottom: 20px;}.trend-section-header h2 { margin: 0; font-size: 1.3rem; font-weight: 700; background: linear-gradient(135deg, #6366f1 0%, #8b5cf6 100%); -webkit-background-clip: text; -webkit-text-fill-color: transparent; background-clip: text;}.trend-card { background: #fff; border: 1px solid #e5e7eb; border-radius: 10px; padding: 20px 24px; margin-bottom: 16px; box-shadow: 0 1px 4px rgba(0, 0, 0, 0.06);}.trend-card-title { font-size: 1.05rem; font-weight: 700; color: #1e1b4b; margin: 0 0 14px; padding-bottom: 10px; border-bottom: 1px solid #ede9ff;}

ArXiv Research Report

Generated: 2026-06-04 03:18:19 | Passing score: 27.8

Total

Qualified

Analyzed

16%

Pass Rate

Papers

1. Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language ModelsPASS

Score: 72.0 / 27.8

Authors: Mahtab Bigverdi, Lindsey Li, Weikai Huang, Yiming Liu, Jaemin Cho, Jieyu Zhang, Tuhin Kundu, Chris Dangjoo Kim, Zelun Luo, Linda Shapiro, Ranjay Krishna

Published: 2026-06-02

TL;DR: 该论文提出想象感知令牌（IPT）以解决多模态语言模型在不可见信息下的空间推理难题，通过中间表征显著提升准确率且无需推理时生成图像。

摘要翻译

视觉语言模型（VLMs）在许多任务上表现卓越，但在关键信息不可直接观察时，仍难以进行空间推理。许多此类问题需要想象性感知：推断从未见视角所见的景象，追踪穿过遮挡空间的路径，或将部分观察整合为连贯的空间表示。我们引入了想象性感知令牌（IPT），这是一种中间感知表示，它能够外部化 VLM 在其他空间配置下所感知到的内容，同时保持与观察输入的一致性。为了研究这一能力，我们提出了三个任务：视角选取（PET）、路径追踪（PT）和多视角计数（MVC），并构建了包含约 2 万个示例的数据集，其中包含真值想象、答案及评估基准。以统一视觉语言模型 BAGEL 为骨干，IPT 监督一贯提升空间推理能力，且通常优于文本思维链训练，即使在推理阶段不生成图像。在多视角计数（MVC）任务上，IPT 将准确率提升了 3.4%；在路径追踪（PT）任务上，其性能可与强大的闭源模型相媲美。我们进一步发现，结合 IPT 与仅标签监督可获得额外增益，而文本思维链却会显著降低性能，这表明当空间计算被迫通过语言进行时，存在模态不匹配。总体而言，IPT 为推理未观测到的空间结构提供了一种基于原理的监督信号，在提升泛化能力的同时，生成了可解释的中间表示。

Abstract

Vision language models (VLMs) excel at many tasks but still struggle with spatial reasoning when critical information is not directly observable. Many such problems require imaginative perception: inferring what would be seen from an unseen viewpoint, tracing paths through occluded spaces, or integrating partial observations into a coherent spatial representation. We introduce Imaginative Perception Tokens (IPT), intermediate perceptual representations that externalize what a VLM would perceive under alternative spatial configurations while remaining consistent with the observed input. To study this capability, we formulate three tasks, Perspective Taking (PET), Path Tracing (PT), and Multiview Counting (MVC), and construct datasets of approximately 20K examples with ground truth imaginations, answers, and evaluation benchmarks. Using the unified VLM BAGEL as the backbone, IPT supervision consistently improves spatial reasoning and often outperforms textual chain of thought training, even without generating images at inference time. On MVC, IPT improves accuracy by 3.4% and achieves competitive performance with strong closed-source models on PT. We further find that combining IPT and label-only supervision yields additional gains, whereas textual chain of thought can substantially degrade performance, suggesting a modality mismatch when spatial computation is forced through language. Overall, IPT provides a principled supervision signal for reasoning about unobserved spatial structure, improving generalization while producing interpretable intermediate representations.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	8.0/10	12.0
Tokenizer	1.5	8.0/10	12.0
Visual Encoder	1.5	6.0/10	9.0
World Models	1.5	6.0/10	9.0
MLLM	1.5	9.0/10	13.5
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	2.0/10	3.0

评分理由: 论文核心聚焦于多模态语言模型（MLLM）的空间推理，标题与摘要明确涉及 MultiModal 和 MLLM，故评分最高（9.0）。引入 Imaginative Perception Tokens 涉及 Tokenizer 设计（8.0），基于统一 VLM BAGEL 架构故 Unify Models 相关（8.0）。Visual Encoder 作为 VLM 基础组件隐含其中（6.0），World Models 概念与空间想象表征有中度关联（6.0）。论文未涉及强化学习，故 model-based RL 相关性极低（2.0）。加权总分 72.0，远超及格线 27.8。作者列表中未包含指定专家，无额外加分。

关键词

Imaginative Perception Tokens, Spatial Reasoning, Multimodal Language Models, Unified VLM, Perspective Taking, Intermediate Representations, Vision Language Models

深度分析

Chinese Title: 想象感知标记增强多模态语言模型的空间推理

Summary: 本文提出想象感知标记（Imaginative Perception Tokens, IPT），一种中间感知表示，用于外部化视觉语言模型（VLM）在未观察到的空间配置下会感知到的内容，同时保持与输入观察的一致性。针对三个需要想象感知的任务——视角转换、路径追踪和多视角计数——构建了约20K样本的数据集，涵盖模拟和真实场景，并配有真实中间想象、最终答案和人工筛选的评估基准。以统一多模态模型BAGEL为骨干，IPT监督显著提升了空间推理性能，在部分任务上优于文本思维链训练，即使推理时不生成图像。混合IPT监督与仅标签数据可进一步提升性能。相反，文本思维链在某些任务上会损害性能，揭示了模态不匹配问题。IPT提供了原则性的监督信号，用于推理未观察到的结构，产生更强的空间泛化和更可解释的中间表示。

Innovations:

提出想象感知标记（IPT），一种新的中间感知表示，用于预测未观察到的空间结构，而非仅细化可见结构。
定义了三个需要想象感知的空间推理任务（视角转换、路径追踪、多视角计数），并构建了相应的数据集和人工筛选的评估基准。
实验证明IPT监督优于文本思维链训练，且混合IPT与仅标签数据可进一步提升性能。
发现文本思维链在某些空间任务上会显著降低性能，揭示了语言模态与空间计算之间的不匹配。

Methodology: 使用统一多模态模型BAGEL作为骨干，训练模型生成IPT（即中间图像）以及最终答案。数据集来源包括AI2-THOR、Habitat、真实图像等模拟和真实环境。训练方式包括IPT监督（生成中间图像+答案）、仅答案监督、文本思维链监督等。评估在人工筛选的基准上进行，对比不同监督策略的性能。

Key Results:

在多视角计数任务上，IPT监督提高准确率3.4%。
在路径追踪任务上，IPT监督达到与强闭源模型竞争的性能。
混合IPT监督与仅标签数据可进一步改进性能。
文本思维链训练在某些任务上导致性能下降，表明语言模态不适合空间计算。
IPT监督的改进在推理时不生成图像时仍然保持，表明模型内部空间表示得到增强。

Tech Stack:

BAGEL模型（统一多模态架构，支持文本和图像生成）
AI2-THOR、Habitat、ScanNet等模拟环境用于数据生成
真实图像数据集（如MessyTable）
人工筛选的评估基准
文本思维链（Chain-of-Thought）作为对比方法

Strengths:

针对空间推理中未观察到的结构提出新颖的中间监督信号，具有理论原则性。
构建了高质量、多样化的数据集和评估基准，覆盖模拟和真实场景。
实验设计全面，对比了多种监督方式（IPT、仅答案、文本CoT），揭示了关键发现。
揭示了文本思维链在空间任务中的局限性，为未来研究提供重要启示。
IPT监督的改进具有泛化性，即使推理时不生成图像也能保持性能提升。

Limitations:

实验仅基于BAGEL模型，可能不适用于其他架构或训练范式。
数据集规模相对较小（约20K），可能限制模型泛化能力。
任务定义局限于三个特定场景，未覆盖更广泛的空间推理问题。
对想象感知质量的评估主要依赖下游任务性能，缺乏对中间表示本身的直接度量。
未探讨IPT在不同模型规模或训练数据量下的扩展性。

Relevance To Keywords:

原生多模态大模型: 论文使用统一多模态模型BAGEL作为骨干，并训练其生成中间图像和答案，直接涉及多模态理解与生成一体化。
世界模型: IPT可视为一种内部世界模型表征，用于预测未观察到的空间状态，与构建环境内部模型的思想高度相关。
表征学习: IPT作为中间感知表示，学习如何从输入中推断缺失的空间结构，属于表征学习范畴。
模型基强化学习: 论文的想象感知任务（如视角转换、路径追踪）与强化学习中基于模型规划（想象未来状态）有相似之处，但论文未直接涉及强化学习训练。
后训练: 论文中的IPT监督可视为一种后训练策略，通过中间监督信号提升模型的空间推理能力。

2. Benchmarking Visual State Tracking in Multimodal Video UnderstandingPASS

Score: 68.0 / 27.8

Authors: Sihyun Yu, Nanye Ma, Pinzhi Huang, Hyunseok Lee, Shusheng Yang, June Suk Choi, Ellis Brown, Oscar Michel, Boyang Zheng, Jinwoo Shin, Saining Xie

Published: 2026-06-02

TL;DR: This paper introduces the VSTAT benchmark to evaluate visual state tracking in MLLMs, revealing that current models fail at continuous visual perception in videos despite strong reasoning capabilities.

摘要翻译

理解视频不仅仅局限于识别孤立瞬间，因为人类会随时间持续追踪实体、状态和事件。这种视觉状态追踪能力是视频理解的基础，但在当前对多模态大语言模型（MLLMs）的评估中仍未被充分探索。我们提出了视觉状态追踪基准（VSTAT），这是一个基于视频的基准，旨在诊断多模态大语言模型（MLLMs）中的视觉状态追踪能力。VSTAT 包含 834 个片段，源自合成视频和真实视频，并配有 1,500 个问题。这些问题无法通过任何单帧或短片段回答，需要持续感知并整合整个视频流中的事件。尽管它们在现有视频基准上表现优异，我们发现最先进的 MLLMs 的表现远低于人类，仅略高于答案先验基线（answer-prior baselines）。为分析这一差距，我们将 MLLMs 的思考轨迹与底层视频流进行比较，以探究 MLLMs 在 VSTAT 上失败的原因及时机。我们发现 MLLMs 在文本推理和追踪方面表现正确，但在视觉上感知需要追踪的事件时却失败了。最后，我们的初步评估表明，近期的代理方法（agentic approaches），包括基于 MLLMs 的视频代理和编码代理，并不能轻易解决这些失败，在 VSTAT 上仍然表现不足。

Abstract

Understanding a video requires more than recognizing isolated moments, as humans continuously track entities, states, and events over time. This capacity for visual state tracking is fundamental to video understanding, yet remains underexplored in current evaluations of Multimodal Large Language Models (MLLMs). We introduce Visual STAte Tracking benchmark (VSTAT), a video-based benchmark designed to diagnose visual state tracking in MLLMs. VSTAT consists of 834 clips drawn from both synthetic and real-world videos, paired with 1,500 questions that cannot be answered from any single frame or short segment, requiring continuous perception and integration of events across the entire video stream. Despite their strong performance on existing video benchmarks, we find that state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines. To analyze this gap, we compare MLLMs' thinking traces with the underlying video stream to understand why and when MLLMs fail on VSTAT. We find that MLLMs reason and track correctly in text, but fail at visually perceiving the events they need to track. Finally, our preliminary evaluation suggests that recent agentic approaches, including MLLM-based video agents and coding agents, do not readily resolve these failures, still falling short on VSTAT.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	6.0/10	9.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	6.0/10	9.0
World Models	1.5	5.0/10	7.5
MLLM	1.5	10.0/10	15.0
MultiModal	1.5	10.0/10	15.0
model-based RL	1.5	3.0/10	4.5
作者加分	-	+5.0	专家: Saining Xie

评分理由: The paper explicitly centers on MLLMs and Multimodal Video Understanding, justifying top scores for MLLM and MultiModal. Visual Encoder is pertinent as the study analyzes visual perception failures within these models. Unify Models and World Models are conceptually linked to the unified architecture and state tracking nature of the benchmark but are not the core technical contribution. Tokenizer and model-based RL are minimally relevant as they are not discussed or central to the evaluation framework.

关键词

Visual State Tracking, Multimodal Video Understanding, MLLM, Benchmark, Visual Perception, Continuous Tracking, Video Understanding

深度分析

Chinese Title: 多模态视频理解中的视觉状态跟踪基准测试

Summary: 本文提出了VSTAT（Visual STAte Tracking benchmark），一个专门用于评估多模态大语言模型（MLLMs）在视频中持续跟踪视觉状态能力的基准。VSTAT包含834个视频片段和1500个问题，涵盖合成（Blender渲染）、真实（YouTube）和自录视频。每个问题设计为无法从单帧或少数关键帧回答，要求模型在整个视频流中连续感知和整合事件。实验表明，当前最先进的MLLMs（如GPT-4o、Gemini）在VSTAT上表现远低于人类（人类约90%，模型仅30-40%），仅略高于基于先验的基线。通过控制实验（时间拉伸、文本转录对比）和分析思维轨迹，发现模型的主要瓶颈在于视觉感知而非推理：当使用文本转录描述事件时，模型几乎完美解答，但在原始视频条件下失败。进一步识别出三种失败模式：事件识别错误、实体关联错误和状态更新错误。即使采用最新的智能体框架（如VideoAgent、CodeAct），性能仍未显著提升。该基准揭示了当前MLLMs在动态视觉跟踪方面的根本缺陷。

Innovations:

提出VSTAT基准，专门针对视觉状态跟踪能力，填补了现有视频理解基准的空白。
通过合成和真实视频的多样化任务（如计数、打字、追踪物体）系统评估MLLMs。
通过文本转录对比实验，首次明确证明MLLMs在视觉状态跟踪中的瓶颈是视觉感知而非推理。
识别并分类了三种主要失败模式：事件识别、实体关联和状态更新。
系统评估了多种智能体框架（视频智能体、编码智能体）在该任务上的无效性。

Methodology: 1. 数据构建：使用Blender合成9种环境共450个视频；从YouTube收集304个真实视频；自录80个脚本化视频。每个视频配有多选题，确保答案无法从单帧推断。2. 评估设置：测试多个MLLMs（GPT-4o、Gemini 1.5 Pro、Qwen2-VL等），使用标准问答格式。3. 控制实验：对合成视频进行时间拉伸（延长事件持续时间）以测试帧采样是否瓶颈；对简单任务手动生成文本转录，比较视频输入与文本输入下的性能。4. 失败分析：收集模型思维链（thinking traces），与视频流对比，分类失败模式。5. 智能体框架测试：集成VideoAgent（基于MLLM的视频代理）和CodeAct（编码代理），观察性能变化。

Key Results:

人类在VSTAT上准确率约90%，而最佳MLLM（GPT-4o）仅约40%，其他模型更低。
在文本转录条件下，MLLMs准确率接近100%，表明推理能力足够，视觉感知是瓶颈。
时间拉伸仅带来微小改进（约2-5%），说明帧采样不是主要问题。
失败模式分析显示：事件识别错误占40%，实体关联错误占30%，状态更新错误占30%。
VideoAgent和CodeAct等智能体框架未显著提升性能（仍低于50%）。
模型在简单计数任务（如翻页数）上表现较好，但在需要持续关联（如追踪球位置）的任务上表现差。

Tech Stack:

Blender（3D渲染引擎，用于合成视频生成）
多模态大语言模型：GPT-4o、Gemini 1.5 Pro、Qwen2-VL、LLaVA-NeXT-Video等
VideoAgent（基于MLLM的视频代理框架）
CodeAct（编码代理框架，用于自动生成和执行代码）
文本转录（手动生成帧级事件描述）
思维链分析（Chain-of-Thought，用于提取模型推理过程）
Python（评估代码、数据处理）
Hugging Face（数据集托管）

Strengths:

基准设计严谨：问题无法通过单帧或少数关键帧回答，强制要求连续跟踪。
数据多样性：涵盖合成、真实、自录视频，任务类型丰富（计数、追踪、识别等）。
深入分析：通过控制实验和失败模式分类，揭示了模型感知瓶颈的本质。
开源透明：提供网站、数据集和评估代码，便于复现和扩展。
时效性强：评估了最新MLLMs和智能体框架，反映当前技术状态。

Limitations:

基准规模有限：834个视频、1500个问题，可能不足以覆盖所有状态跟踪场景。
合成视频与真实视频存在域差异，部分合成任务过于简单或人工痕迹明显。
仅评估了问答形式，未涉及更复杂的生成或交互任务。
未探索如何通过训练（如后训练、强化学习）提升视觉状态跟踪能力。
智能体框架测试仅包含两种，且未进行超参数调优或定制化适配。

Relevance To Keywords:

原生多模态大模型：论文直接评估了多种原生多模态大模型（如GPT-4o、Gemini）的视频理解能力，揭示了其在视觉状态跟踪上的不足。
表征学习：视觉状态跟踪需要模型学习有效的视觉表征以持续追踪实体和状态，论文的失败分析表明当前表征学习不足以支持动态跟踪。
世界模型：跟踪状态本质上是构建和更新内部世界模型的过程，VSTAT可视为对世界模型动态推理能力的测试。
强化学习与后训练：论文未直接涉及，但指出当前模型感知瓶颈，暗示可通过后训练或强化学习（如视频级奖励）来改善视觉跟踪能力，是未来方向。
多模态大模型的理解和生成一体化：论文聚焦于理解（问答），但状态跟踪是生成式模型（如视频生成）的基础，因此相关。

3. Entropy Is Not Enough: Unlocking Effective Reinforcement Learning for Visual Reasoning via Vision-Anchored Token SelectionPASS

Score: 63.0 / 27.8

Authors: Senjie Jin, Peixin Wang, Boyang Liu, Xiaoran Fan, Shuo Li, Zhiheng Xi, Jiazheng Zhang, Yuhao Zhou, Tao Gui, Qi Zhang, Xuanjing Huang

Published: 2026-06-02

TL;DR: This paper proposes VEPO, a reinforcement learning framework that integrates visual sensitivity with token entropy to enhance visual reasoning performance in multimodal large language models.

摘要翻译

虽然 token 级熵在仅文本的可验证奖励强化学习（RLVR）中被公认为对信用分配有效，但这一机制在视觉推理中是否仍然适用尚不明确。我们的控制实验表明，由于忽略了天然具有低熵的视觉敏感 token，该机制在视觉推理中失效。尽管现有的多模态强化学习方法日益重视视觉感知的重要性，但它们难以满足将精确的感知 grounding 与语义推理交织在一起的内在需求，要么缺乏系统的视觉测量，要么忽视了 token 熵主要驱动语义探索这一事实。为了解决这一问题，我们引入了 VEPO（用于策略优化的视觉 - 熵 token 选择），这是一种有效的强化学习框架，通过基于原则的乘法耦合显式地将视觉敏感性与 token 熵相结合，其中 VEPO 将梯度信用重新导向同时具有视觉 grounding 和高信息量的 token。广泛的实验证明了 VEPO 的领先性能，在 7B 规模上显著优于仅熵基线 2.28 分，在 3B 规模上高出 3.15 分。消融实验进一步证实了我们方法的有效性。

Abstract

While token-level entropy is commonly recognized as effective for credit assignment in text-only reinforcement learning with verifiable rewards (RLVR), it remains unclear whether this mechanism still holds in visual reasoning. Our controlled study shows that this mechanism collapses in visual reasoning due to the omission of vision-sensitive tokens with naturally low entropy. Although existing multimodal RL methods increasingly acknowledge the importance of visual perception, they struggle to satisfy the inherent demand for interleaving precise perceptual grounding with semantic reasoning, either lacking systematic visual measurements or overlooking that token entropy primarily drives semantic exploration. To address this, we introduce VEPO (Vision-Entropy token-selection for Policy Optimization), an effective RL framework explicitly integrating visual sensitivity with token entropy via a principled multiplicative coupling, where VEPO redirects gradient credit toward tokens which are simultaneously visually grounded and highly informative. Extensive experiments demonstrate VEPO's leading performance, significantly outperforming the entropy-only baseline by 2.28 points at 7B-scale and 3.15 points at 3B-scale. Ablations further substantiate the soundness of our method.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	7.0/10	10.5
Visual Encoder	1.5	8.0/10	12.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	8.0/10	12.0
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	5.0/10	7.5

评分理由: The paper focuses on Reinforcement Learning for visual reasoning in MLLMs, scoring high on MultiModal, MLLM, and Visual Encoder due to the vision-language context and reliance on visual features. Tokenizer is relevant due to token-level entropy operations. Unify Models and World Models are less relevant as the paper proposes a specific optimization method (VEPO) rather than architectural unification or world modeling. Model-based RL is moderately relevant as it involves RL policy optimization, though primarily model-free. No expert authors from the specified list were found in the author list.

关键词

Reinforcement Learning, Visual Reasoning, Token Selection, Visual Grounding, Policy Optimization, Multimodal Large Language Models, Entropy

深度分析

Chinese Title: 熵是不够的：通过视觉锚定的令牌选择解锁视觉推理的有效强化学习

Summary: 本文针对视觉推理任务中基于令牌熵的强化学习信用分配机制失效的问题进行了深入研究。研究发现，在纯文本推理中有效的熵驱动机制在视觉推理中崩溃，原因是视觉敏感令牌通常具有低熵，而高熵令牌主要由语言不确定性驱动。为此，作者提出了VEPO（Vision-Entropy token-selection for Policy Optimization）框架，通过反事实前向传播（噪声扰动图像）计算令牌级别的Jensen-Shannon散度（JSD）和绝对熵差（|ΔH|），并将这两个视觉敏感性信号与令牌熵进行乘法耦合，从而选择既视觉敏感又高信息量的令牌进行梯度更新。实验表明，VEPO在7B和3B规模的Qwen2.5-VL模型上显著优于纯熵基线，分别提升2.28和3.15个平均分，并在多个视觉推理基准上取得领先性能。

Innovations:

首次系统诊断了基于令牌熵的信用分配机制在视觉推理中崩溃的根本原因，即熵机制忽略了具有低熵的视觉敏感令牌。
提出了VEPO框架，通过乘法耦合JSD和|ΔH|两种互补的视觉敏感性信号，并结合令牌熵，实现了视觉锚定的令牌选择。
引入了反事实前向传播方法，通过噪声扰动图像量化令牌对视觉输入的依赖程度，为多模态强化学习提供了新的信用分配视角。
在多个视觉推理基准和两种模型规模上验证了方法的有效性，并进行了全面的消融实验，证明了各组件的重要性。

Methodology: 首先，使用Qwen2.5-VL模型进行GRPO训练，在每次迭代中生成响应。对于每个响应令牌，执行两次前向传播：一次使用原始图像，一次使用噪声扰动图像。计算两个分布之间的JSD和|ΔH|，分别衡量分布偏移和不确定性变化。然后对JSD、|ΔH|和熵进行逐响应归一化，并通过乘法耦合得到视觉-熵联合分数。最后，根据该分数选择前k%的令牌进行策略梯度更新，其余令牌梯度置零。训练过程中使用GRPO算法，并保持其他超参数一致。

Key Results:

在7B规模上，VEPO平均得分比纯熵基线高2.28分，比随机基线高3.15分。
在3B规模上，VEPO平均得分比纯熵基线高3.15分。
消融实验表明，JSD和|ΔH|的乘法耦合优于单独使用或加法组合。
视觉敏感性分析显示，高JSD/|ΔH|令牌中有41%被纯熵选择遗漏，验证了熵机制的不足。

Tech Stack:

Qwen2.5-VL-7B/3B-Instruct（基础模型）
GRPO（Group Relative Policy Optimization，强化学习算法）
Jensen-Shannon Divergence（JSD，分布差异度量）
熵（Entropy）及绝对熵差（|ΔH|）
反事实前向传播（Counterfactual forward pass，噪声扰动图像）
Min-max归一化（Per-response normalization）
乘法耦合（Multiplicative coupling）

Strengths:

问题诊断深入：通过控制实验和定量分析清晰揭示了熵机制在视觉推理中的失效原因。
方法设计巧妙：将视觉敏感性与熵进行乘法耦合，既保留了熵的信息性，又弥补了其视觉盲区。
实验充分：在多个基准（如MathVista、MMMU等）和两种模型规模上验证，并进行了详细的消融和可视化分析。
计算开销可控：仅需额外一次前向传播（噪声图像），且令牌选择后梯度更新量减少，整体效率可接受。

Limitations:

噪声扰动方式（如高斯噪声）可能不是最优的视觉干扰策略，不同扰动方式可能影响结果。
方法依赖于图像输入，对于纯文本推理任务不适用，且需要模型支持多模态输入。
未探讨在更大规模模型（如70B）上的表现，泛化性有待验证。
视觉敏感性度量（JSD和|ΔH|）可能无法完全捕捉所有类型的视觉依赖（如空间关系、颜色等）。

Relevance To Keywords:

Unify Models, World Models, Representation Learning, Model-Based RL: 论文聚焦于多模态大模型（视觉-语言模型）的强化学习后训练，涉及表征学习（令牌级视觉敏感性度量）和世界模型（通过噪声扰动模拟视觉不确定性），与关键词高度相关。
原生多模态大模型，多模态大模型的理解和生成一体化: 使用Qwen2.5-VL作为基础模型，研究视觉推理中的令牌选择，直接关联多模态大模型的理解和生成。
表征学习，世界模型: 通过JSD和熵差度量令牌对视觉输入的依赖，可视为一种隐式的世界模型表征学习。
强化学习，后训练: 核心方法是基于GRPO的强化学习后训练，通过改进信用分配提升视觉推理能力。

4. Self-Refining Agentic Reinforcement Learning for Vision-Conditioned UAV NavigationPASS

Score: 57.0 / 27.8

Authors: Roohan Ahmed Khan, Yasheerah Yaqoot, Muhammad Ahsan Mustafa, Dzmitry Tsetserukou

Published: 2026-06-02

TL;DR: 本文提出 AgenticRL 框架，利用多模态 GPT 代理自主设计奖励并优化策略，实现了视觉条件化无人机导航的自完善，真实世界成功率达 91%。

摘要翻译

深度强化学习已展现出强大的潜力，能够使自主机器人学会复杂的导航任务。然而，其实际应用仍高度依赖人工设计的奖励函数和重复的手动微调，这不仅耗时，还无法保证在期望任务中获得高成功率。本文提出了一种名为 AgenticRL 的基于代理的强化学习框架，旨在提高无人机（UAV）导航任务在奖励设计、策略优化及现实世界部署方面的自主性。AgenticRL 利用一个多模态生成式预训练变换器（GPT）代理来解读任务信息和视觉场景观测，生成任务特定的奖励函数，使用近端策略优化（PPO）算法训练策略，随后充当评论者（Critic）角色，通过诊断包评估已训练的策略以生成反馈。基于此反馈，该代理识别故障模式，并在闭环自我改进过程中优化奖励函数。为进一步在推理阶段利用多模态 GPT 代理，AgenticRL 利用真实世界图像和自然语言任务信息，自动识别当前活跃场景并选择合适的已训练策略进行执行。该框架在多个导航任务上进行了评估，包括穿越门洞、障碍物避让、带着陆的墙壁穿越、轨迹跟随以及运动行为学习。实验结果表明，闭环优化过程使策略行为相比初始奖励提升了 71%。此外，我们还展示了所提出框架的仿真到现实迁移能力，实现了 91% 的现实世界成功率和 94% 的仿真到现实迁移准确率。

Abstract

Deep reinforcement learning has shown strong potential for enabling autonomous robots to learn complex navigational tasks. However, its practical use still depends heavily on human designed reward functions and repeated manual fine tuning, which is time consuming and does not guarantee high success in the desired task. This paper presents AgenticRL, agent guided reinforcement learning framework that increases autonomy in reward design, policy refinement, and real world deployment for unmanned aerial vehicles (UAV) navigation tasks. AgenticRL uses a multimodal generative pre-trained tansformer (GPT) agent to interpret task information and visual scene observations, generate task specific reward functions, train policies using Proximal Policy Optimization (PPO) algorithm, and then act as a critic by evaluating the trained policy through diagnosis packets to generate feedback. Based on this feedback, the agent identifies failure modes and refines the reward function in a closed loop self improvement process. To further leverage the multimodal GPT agent during inference, AgenticRL uses real world images and natural language task information to automatically identify the active scenario and select the appropriate trained policy for execution. The framework is evaluated on multiple navigational tasks, including gate traversal, obstacle avoidance, wall barrier crossing with landing, trajectory following, and motion behavior learning. Experimental results show that the closed loop refinement process improves policy behavior compared with initial rewards by 71%. We also demonstrate sim-to-real transfer of the proposed framework, achieving a real world success rate of 91% and a sim-to-real accuracy of 94%.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	7.0/10	10.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	5.0/10	7.5
World Models	1.5	3.0/10	4.5
MLLM	1.5	8.0/10	12.0
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	4.0/10	6.0

评分理由: 论文核心在于使用多模态 GPT 代理（MLLM）统一奖励设计、策略训练与批评过程，因此 MLLM（8.0）和 MultiModal（9.0）得分高。该框架将多个 RL 组件整合为一个代理，体现 Unify Models（7.0）。Visual Encoder（5.0）用于处理视觉输入但非核心创新，Tokenizer（2.0）隐含于 GPT 中未专门讨论。World Models（3.0）和 model-based RL（4.0）相关性较低，因为论文主要使用 PPO（模型自由强化学习）而非学习环境动力学模型，且未涉及典型的世界模型架构。

关键词

Agentic Reinforcement Learning, Vision-Conditioned UAV Navigation, Multimodal GPT Agent, Reward Design, Policy Refinement, Sim-to-Real Transfer, Self-Refining

深度分析

Chinese Title: 面向视觉条件无人机导航的自优化智能体强化学习

Summary: 论文提出AgenticRL框架，旨在解决无人机导航任务中奖励函数依赖人工设计且需反复调参的问题。该框架利用多模态GPT智能体，通过自然语言指令和视觉场景图像自动生成初始奖励函数，使用PPO算法训练策略，并基于训练后的策略行为诊断结果（如碰撞、着陆精度等）生成反馈，在闭环中迭代优化奖励函数。此外，在部署阶段，框架通过多模态场景理解自动识别当前任务并选择对应策略。实验在门穿越、避障、着陆、轨迹跟踪等多个导航任务上进行，结果表明闭环优化使策略行为提升71%，真实世界部署成功率达91%，仿真到真实迁移准确率为94%。

Innovations:

提出多模态闭环奖励生成与优化框架，无需人工设计奖励函数
引入基于策略行为诊断的奖励细化机制，通过诊断包识别失败模式并自动改进奖励
设计多模态场景注册机制，实现真实部署中自动识别场景并选择对应策略
在真实四旋翼无人机平台上验证了多个导航任务的sim-to-real迁移能力

Methodology: 论文采用闭环自优化方法：首先，多模态GPT智能体根据任务指令和场景图像生成初始Python奖励函数；然后，在定制仿真环境中使用PPO算法训练策略网络（全连接层512→512→256→128，tanh激活）；训练后，通过随机化仿真回合收集行为指标（碰撞、着陆精度等）形成诊断包；GPT智能体分析诊断包生成细化提示，结合历史上下文更新奖励函数，并重新训练策略，迭代直至收敛。部署时，使用真实图像和语言信息通过GPT识别场景并选择已训练策略。

Key Results:

闭环奖励优化使策略行为相比初始奖励提升71%
真实世界部署成功率达91%
仿真到真实迁移准确率为94%
在门穿越、避障、带着陆的墙障穿越、轨迹跟踪、运动行为学习等多个任务上验证有效

Tech Stack:

多模态GPT智能体（Multimodal GPT Agent）
近端策略优化算法（Proximal Policy Optimization, PPO）
定制无人机仿真环境（基于论文[26]）
全连接神经网络（512→512→256→128，tanh激活）
熵系数退火（0.1→0.001）
学习率1×10⁻⁵
诊断包（Diagnosis Packet）包含碰撞事件、着陆精度、门穿越成功率等指标

Strengths:

完全自动化奖励设计，大幅减少人工干预
闭环自优化机制有效提升策略性能
多模态场景理解实现sim-to-real无缝迁移
在真实无人机平台上验证，具有实际应用价值
框架通用性强，可适应多种导航任务

Limitations:

依赖GPT智能体的生成质量，可能受限于模型能力
训练计算成本较高（25M-70M仿真步数）
未讨论框架在其他类型机器人（如地面机器人）上的泛化性
诊断包的设计依赖任务特定指标，可能需要针对新任务调整

Relevance To Keywords:

原生多模态大模型：论文核心使用多模态GPT智能体处理语言和视觉输入，属于多模态大模型应用
多模态大模型的理解和生成一体化：GPT智能体同时理解任务描述和场景图像，并生成奖励代码和诊断反馈
表征学习：策略网络学习状态表征用于决策，但论文未深入探讨表征学习机制
世界模型：论文未显式构建世界模型，但仿真环境隐含了环境动力学
强化学习：论文以PPO为核心算法，属于强化学习范畴
后训练：闭环奖励优化可视为一种后训练过程，通过策略行为反馈改进奖励函数

5. VLESA: Vision-Language Embodied Safety Agent for Human Activity MonitoringPASS

Score: 55.5 / 27.8

Authors: Hanjiang Hu, Yiyuan Pan, Jiaxing Li, Xusheng Luo, Alexander Robey, Na Li, Yebin Wang, Changliu Liu

Published: 2026-06-02

TL;DR: VLESA 是一种视觉语言具身安全代理，通过 egocentric 视频监控人类活动并利用 GRPO 训练的目标条件 Q 过滤器实现实时安全干预，在 ASIMOV-2.0 基准上取得了更高的干预准确率。

摘要翻译

随着人工智能系统日益协助人类执行物理任务，确保安全性变得至关重要——物理行为会带来即时且不可逆转的后果，而数字错误则不会。我们提出了视觉 - 语言具身安全代理（VLESA），该框架通过第一人称视角视频监控人类活动，并在预测到危险动作时触发实时安全干预。VLESA 解决了意图依赖的安全性问题，即相同的动作根据上下文可能安全也可能危险。我们引入了一组将第一人称视角帧与目标条件化安全注释配对的数据集，从而使得可以通过 GRPO 训练的目标条件化安全 Q 过滤器能够在不重新训练的情况下，依据推断的意图评估动作。在此基础上，我们提出了一种意图 - 动作预测代理，旨在从视频中联合推断目标并预测未来动作。在 ASIMOV-2.0 基准测试上，与基线相比，VLESA 在精确的真实帧上实现了更高的干预准确率，而经过 GRPO 训练的 Q 过滤器通过目标条件化约束解码将动作安全性提高了超过 41 个百分点。代码可在 https://github.com/HanjiangHu/VLESA 获取。

Abstract

As AI systems increasingly assist humans in physical tasks, ensuring safety becomes paramount -- physical actions carry immediate and irreversible consequences that digital errors do not. We introduce the Vision-Language Embodied Safety Agent (VLESA), a framework that monitors human activities from egocentric video and triggers real-time safety interventions when dangerous actions are predicted. VLESA addresses intent-dependent safety where identical actions can be safe or dangerous depending on context. A dataset pairing egocentric frames with goal-conditioned safety annotations is introduced, enabling a goal-conditioned safety Q-filter trained via GRPO that evaluates actions with respect to inferred intent without retraining. On top of that, an intent-action prediction agent is proposed to jointly infer goals and predict future actions from video. On the ASIMOV-2.0 benchmark, VLESA achieves higher intervention accuracy at the exact ground-truth frame compared to baselines, while the GRPO-trained Q-filter improves action safety by over 41 percentage points through goal-conditioned constrained decoding. Code is available at https://github.com/HanjiangHu/VLESA.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	5.0/10	7.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	8.0/10	12.0
World Models	1.5	3.0/10	4.5
MLLM	1.5	7.0/10	10.5
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	4.0/10	6.0

评分理由: 论文提出 VLESA 框架，结合视觉与语言进行安全监控，与 MultiModal (9.0) 和 MLLM (7.0) 高度相关；视频处理依赖 Visual Encoder (8.0)，是核心组件；涉及意图预测但未明确生成式世界模型 (World Models 3.0)；使用 GRPO 强化学习技术但非典型 model-based RL (4.0)，因未建立环境动力学模型；未提及 Tokenizer (1.0)；虽统一视觉语言用于安全任务，但未体现通用模型统一架构 (Unify Models 5.0)。作者列表中不包含 Yang Shi 等指定专家，无加分。加权总分 55.5，高于动态及格分 27.8。

关键词

Vision-Language, Embodied Safety, Egocentric Video, Safety Intervention, GRPO, Intent Prediction, Human Activity Monitoring

深度分析

Chinese Title: VLESA：用于人类活动监测的视觉-语言具身安全智能体

Summary: 论文提出VLESA框架，用于从自我中心视频中实时监测人类活动并触发安全干预。该框架解决意图依赖的安全问题：相同动作在不同意图下可能安全或危险。VLESA包含两个核心组件：意图-动作预测智能体（从视频流中联合推断任务目标并预测候选未来动作）和目标条件安全Q-filter（通过GRPO训练，评估每个候选动作在推断意图下的安全性）。为训练Q-filter，作者构建了EgoSafety数据集，将自我中心帧与目标条件安全标注配对。在ASIMOV-2.0-Video基准上，VLESA在精确干预帧上实现了更高的干预准确率，GRPO训练的Q-filter通过目标条件约束解码将动作安全性提升超过41个百分点。代码已开源。

Innovations:

提出VLESA框架，实现从自我中心视频的实时、意图依赖的安全监测，结合意图推断与动作预测。
引入目标条件安全Q-filter，通过GRPO训练，可泛化到不同任务，无需针对每个任务重新训练。
构建EgoSafety数据集，通过场景图编辑生成不安全动作数据，将安全数据生成转化为VQA问题。
提出约束解码策略，结合Q-filter得分与VLM排序，在保证安全的前提下选择最优动作并触发警报。

Methodology: VLESA采用三阶段方法：1) 数据构建：基于Ego4D和EASG场景图，通过VLM编辑生成不安全动作，构建EgoSafety数据集。2) Q-filter训练：使用GRPO微调VLM，将图像、目标、动作作为输入，输出安全/不安全标签，并转换为Q值。3) 意图-动作预测与约束解码：从视频关键帧中推断目标并预测候选动作，通过Q-filter评估安全性，结合VLM排序选择安全动作或触发警报。

Key Results:

在ASIMOV-2.0-Video基准上，VLESA在精确干预帧上的干预准确率优于GPT-5等前沿模型和基线方法。
GRPO训练的Q-filter相比纯提示基线，动作安全性提升超过41个百分点。
目标条件约束解码有效过滤不安全动作，同时保持高召回率。

Tech Stack:

GRPO (Group Relative Policy Optimization)
VLM (Vision-Language Model)
场景图 (Scene Graph) 表示
Ego4D 数据集
ASIMOV-2.0 基准
EASG (Egocentric Action Scene Graph) 标注
约束解码 (Constrained Decoding)
机器人宪法 (Robot Constitution) 规则集

Strengths:

解决了意图依赖的安全问题，使安全评估更符合实际场景。
Q-filter通过GRPO训练，可泛化到不同任务，无需重新训练。
构建了高质量数据集EgoSafety，通过场景图编辑高效生成不安全样本。
在真实基准上取得显著提升，验证了方法的有效性。

Limitations:

依赖自我中心视频质量，在低光照或遮挡场景下性能可能下降。
Q-filter的阈值τ固定为0，可能不适用于所有场景，需要自适应调整。
意图推断和动作预测的准确性受限于VLM能力，可能产生误报或漏报。
目前仅在仿真和有限真实数据上评估，实际部署需进一步验证。

Relevance To Keywords:

原生多模态大模型：VLESA使用VLM作为核心组件，进行视觉-语言推理。
世界模型：Q-filter可视为一种隐式世界模型，评估动作的未来安全性。
表征学习：场景图表示和Q-filter学习安全表征。
强化学习：GRPO用于训练Q-filter，属于强化学习后训练方法。
后训练：在预训练VLM基础上通过GRPO进行后训练，提升安全判别能力。

6. OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMsPASS

Score: 46.5 / 27.8

Authors: Yifei Li, Pengyiang Liu, Yuhang Zang, Zhongyue Shi, Qi Fu, Hongye Hao, Jiwen Lu

Published: 2026-06-02

TL;DR: This paper introduces OVO-S-Bench, a hierarchical benchmark for evaluating streaming spatial intelligence in Multimodal LLMs, revealing that current models lag behind humans especially in allocentric mapping and streaming reasoning.

摘要翻译

机器人学、AR（增强现实）和自动驾驶领域中的多模态智能体需从连续的自我中心视角流中推理场所与布局，常需利用当前视野之外的证据。现有的基准测试要么在完整视频上进行离线评估，要么针对事件而非空间结构。我们提出了 OVO-S-Bench，这是一个完全由人工标注的流式空间智能基准，涵盖 348 个源视频中的 1,680 个问题。标注工作涉及 12 名经过培训的标注员，每位同时担任盲审交叉评审员，共投入约 804 人时进行多轮质量保证。每个问题均包含一个查询时间戳和一个证据区间；在评估过程中，模型仅能看到查询时间点之前的前缀视频。问题涵盖四个递增的抽象层级：瞬时自我中心感知、时空上下文跟踪、空间模拟与推理以及他者中心映射（allocentric mapping）。在 38 个专有及开源的多模态大语言模型（MLLMs）中，Gemini-3.1-Pro 落后人类专家 27 分（59.2 分 vs. 86.6 分），其中他者中心映射是主要的瓶颈。值得注意的是，经过流式和空间微调的 MLLMs 的表现甚至不及其自身的骨干模型。此外，我们发现当思维链推理未在流中 grounding (ungrounded) 时，会放大空间错误。通过揭示这些局限性，OVO-S-Bench 为下一代流式空间多模态大语言模型（MLLMs）建立了一个严苛的测试平台。

Abstract

Multimodal agents in robotics, AR, and autonomous driving must reason about places and layouts from continuous egocentric streams, often using evidence outside the current view. Existing benchmarks either evaluate offline over full videos or target events rather than spatial structure. We introduce OVO-S-Bench, a fully human-annotated benchmark for streaming spatial intelligence, comprising 1,680 questions over 348 source videos. Annotation involves 12 trained annotators, each also serving as a blind cross-reviewer, across roughly 804 person-hours of multi-round quality assurance. Each question carries a query timestamp and an evidence interval, and at evaluation, the model sees only the prefix preceding the query. Questions span four levels of increasing abstraction: instantaneous egocentric perception, spatiotemporal context tracking, spatial simulation and reasoning, and allocentric mapping. Across 38 proprietary and open-source MLLMs, Gemini-3.1-Pro trails human experts by 27 points, 59.2 vs. 86.6, with allocentric mapping as the dominant bottleneck. Notably, streaming and spatially fine-tuned MLLMs underperform their own backbones. We further find that chain-of-thought reasoning amplifies spatial errors when ungrounded in the stream. By exposing these limitations, OVO-S-Bench establishes a demanding testbed for next-generation streaming spatial MLLMs.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	5.0/10	7.5
MLLM	1.5	9.0/10	13.5
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	2.0/10	3.0

评分理由: The paper focuses on a benchmark for Multimodal LLMs (MLLM) in streaming contexts, making MLLM and MultiModal highly relevant. World Models has moderate relevance due to spatial intelligence and mapping tasks. Unify Models, Tokenizer, Visual Encoder, and model-based RL are peripheral as the paper evaluates models rather than proposing architectures or RL algorithms. None of the specified expert authors are listed.

关键词

Streaming Spatial Intelligence, Multimodal LLMs, Benchmark, Allocentric Mapping, Egocentric Perception, Spatial Reasoning, Continuous Streams

深度分析

Chinese Title: OVO-S-Bench：面向多模态大语言模型流式空间智能的分层基准

Summary: 本文提出OVO-S-Bench，一个完全人工标注的流式空间智能基准，包含1,680个问题，覆盖348个源视频。每个问题带有查询时间戳和证据区间，评估时模型仅能看到查询前的视频前缀，模拟在线智能体。问题分为四个抽象层次：瞬时自我中心感知、时空上下文跟踪、空间模拟与推理、以及全局拓扑映射。通过38个专有和开源多模态大语言模型评估，Gemini-3.1-Pro得分59.2，落后人类专家27分（86.6），全局拓扑映射是主要瓶颈。流式和空间微调模型表现不如其基础骨干模型，链式思维推理在缺乏流式基础时放大空间错误。该基准揭示了当前模型在流式空间理解上的局限性，为下一代流式空间多模态大语言模型提供了严格的测试平台。

Innovations:

首次提出流式空间智能的分层基准，包含四个递增抽象层次（L1-L4），覆盖从当前视角感知到全局拓扑映射的完整空间推理能力。
所有问题均由人类专家编写和交叉验证，并采用文本-only LLM探测排除可仅凭选项文本或世界知识回答的问题，确保评估的纯粹性。
采用严格的流式评估协议：模型仅能访问查询时间戳之前的视频前缀，模拟在线智能体无法访问未来帧的真实场景。
大规模评估38个模型，系统比较了专有模型、通用视频骨干、流式专用架构和空间微调变体，揭示了流式空间智能的瓶颈。
发现链式思维推理在跨帧任务（L2）上有帮助，但在当前帧感知（L1）上反而有害，且未基于流式时放大空间幻觉。

Methodology: 论文采用人工标注与多轮质量保证流程构建基准：12名经过训练的标注员（每人同时担任盲审交叉评审）耗时约804人小时进行多轮质量保证。视频来源包括9个公开数据集（室内漫游、自我中心活动、户外场景、驾驶视频、3D环境等）。每个问题包含查询时间戳和证据区间，评估时模型仅接收前缀视频。采用分层分类法将问题分为四个层次。评估38个模型，包括专有MLLM、通用视频骨干、流式专用架构、空间微调变体，并设置随机、纯文本、人类专家基线。使用链式思维推理对比实验分析推理策略的影响。

Key Results:

Gemini-3.1-Pro整体得分59.2，人类专家86.6，差距27分。
全局拓扑映射（L4）是主要瓶颈：28/34个系统在L4得分最低。
13/15个流式和空间微调方法得分低于其基础骨干模型，在跨帧任务上退化最严重。
链式思维推理在L2（时空上下文跟踪）上平均提升3.9分，但在L1（瞬时感知）上平均下降1.0分，且未基于流式时放大空间幻觉。
所有模型在L1（当前视角感知）上表现较好，但随层次升高性能急剧下降。

Tech Stack:

视频源：RoomTour3D, Ego4D, Sekai, OmniWorld, YouTube walking tours, CODa, Honda HDD, ARKitScenes, VSI-Bench
评估模型：38个专有和开源MLLM（包括Gemini-3.1-Pro、通用视频骨干、流式专用架构、空间微调变体）
标注流程：12名标注员，804人小时，多轮盲审交叉评审
文本-only LLM探测：用于排除可仅凭文本回答的问题
链式思维推理（Chain-of-Thought）对比实验
分层分类法：L1瞬时自我中心感知、L2时空上下文跟踪、L3空间模拟与推理、L4全局拓扑映射

Strengths:

基准设计严谨：完全人工标注，多轮质量保证，排除文本泄漏，确保评估有效性。
流式评估协议真实模拟在线智能体场景，具有实际应用价值。
分层分类体系系统性强，覆盖从简单感知到复杂空间推理的完整能力谱系。
大规模模型评估（38个）提供全面对比，揭示重要发现（如微调反而退化、CoT双刃剑效应）。
公开数据集和代码，可复现性强。

Limitations:

视频来源主要来自公开数据集，可能无法完全覆盖真实世界中的极端场景。
评估仅涉及视频前缀，未考虑模型对长期记忆的主动管理能力（如记忆压缩、遗忘机制）。
人类专家基线仅来自标注员，可能未代表最优人类表现。
未深入分析模型在L4（全局拓扑映射）上失败的具体原因（如缺乏3D结构先验、记忆容量不足等）。
未探讨模型在流式场景下的推理效率（如计算开销、延迟）。

Relevance To Keywords:

Unify Models: 论文评估了多种多模态大语言模型，但未涉及模型统一或融合，相关性较低。
World Models: L4全局拓扑映射要求模型构建内部世界模型，但论文主要关注评估而非建模方法，间接相关。
Representation Learning: 论文未直接研究表征学习，但流式空间理解依赖于有效的空间表征，间接相关。
Model-Based RL: 论文未涉及强化学习或基于模型的RL，相关性低。
原生多模态大模型: 论文评估了多个原生多模态大模型（如Gemini），但未讨论其架构设计，中等相关。
多模态大模型的理解和生成一体化: 论文聚焦于理解（空间问答），未涉及生成，相关性低。
表征学习: 同上，间接相关。
世界模型: L4任务要求模型具备类似世界模型的能力，但论文未提出世界模型方法，中等相关。
强化学习: 不相关。
后训练: 论文评估了微调模型（后训练），发现微调反而退化，与后训练相关。

7. Video-Mirai: Autoregressive Video Diffusion Models Need ForesightPASS

Score: 45.0 / 27.8

Authors: Yonghao Yu, Lang Huang, Runyi Li, Zerun Wang, Toshihiko Yamasaki

Published: 2026-06-02

TL;DR: 本文提出 Video-Mirai 方法，通过引入前瞻编码器在训练阶段利用未来帧监督当前表征，解决了因果视频生成中的表示规划差距，提升了视频一致性和质量。

摘要翻译

因果视频生成器必须基于过去进行预测，但它们的学习不必仅基于过去。在流式自回归视频扩散中，每个生成的片段都会成为未来片段必须保留的承诺。然而，标准训练仅要求每个因果状态解释当前时刻。这造成了我们所说的“表示级规划差距”（representation-level planning gap）：适应当前片段的状态可能会丢弃维持未来一致性所需的身份、布局和运动信息。我们提出 Video-Mirai，这是一种仅涉及训练的方法，它在不改变因果推理的前提下填补了这一差距：生成器因果地进行展开，一个冻结的前瞻编码器（foresight encoder）非因果地读取完整的展开序列，而一个轻量级预测器（predictor）将由此产生的停止梯度目标蒸馏至因果状态中。未来帧用于监督表示，而非生成器的输入。在推理阶段，该编码器和预测器被丢弃，从而保持原始架构、每步浮点运算次数（FLOPs）以及键值缓存（KV-cache）行为不变。Video-Mirai 在 5 秒 VBench 基准测试上，将一个强大的因果强制（Causal-Forcing）基线的总分从 83.8 提升至 84.6。在超出训练范围的 30 秒展开中，主体一致性从 84.9 提升至 88.5，背景一致性从 90.2 提升至 91.9。消融实验表明，未来条件目标是关键成分，而探测实验显示，未来帧从当前特征中变得更加可解码。因果性应当约束推理过程，而非表示监督。我们的研究强调，视觉自回归模型需要具备前瞻性。项目页面：https://y0uroy.github.io/Video-Mirai.

Abstract

Causal video generators must predict from the past, but they need not learn only from it. In streaming autoregressive video diffusion, each emitted segment becomes a commitment that future segments must preserve. Standard training, however, only asks each causal state to explain the present. This creates what we call a representation-level planning gap: states that fit the current segment may discard identity, layout, and motion information needed for a consistent future. We introduce Video-Mirai, a training-only method that closes this gap without changing causal inference: the generator rolls out causally, a frozen foresight encoder reads the completed rollout non-causally, and a lightweight predictor distills the resulting stopped-gradient targets into causal states. Future frames supervise representations, never generator inputs. At inference, the encoder and predictor are discarded, leaving the original architecture, per-step FLOPs, and KV-cache behavior unchanged. Video-Mirai improves a strong Causal-Forcing baseline on 5-second VBench from 83.8 to 84.6 in terms of Total Score. On 30-second rollouts beyond the training horizon, subject consistency improves from 84.9 to 88.5 and background consistency from 90.2 to 91.9. Ablations identify future-conditioned targets as the key ingredient, and probes show that future frames become more decodable from current features. Causality should constrain inference, not representation supervision. Our study highlights that visual autoregressive models need foresight. Project page: https://y0uroy.github.io/Video-Mirai.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	4.0/10	6.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	9.0/10	13.5
World Models	1.5	8.0/10	12.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	5.0/10	7.5
model-based RL	1.5	1.0/10	1.5

评分理由: 论文核心在于视频生成模型的前瞻性表征学习。'Visual Encoder'和'World Models'高度相关，因涉及编码器设计和未来状态预测；'MultiModal'中度相关，视频属多模态数据；其余关键词如'MLLM'、'model-based RL'、'Tokenizer'与本文视频扩散生成任务关联较弱；'Unify Models'关联度一般，主要聚焦于训练目标统一而非模型架构统一。

关键词

Video Diffusion, Autoregressive, Foresight, Representation Learning, Consistency, Causal Generation, Visual Encoder, Training Objective

深度分析

Chinese Title: Video-Mirai：自回归视频扩散模型需要远见

Summary: 本文提出Video-Mirai，一种仅用于训练的方法，旨在解决自回归视频扩散模型中的“表示级规划差距”问题。在因果视频生成中，每个发射的片段成为未来片段必须保持的承诺，但标准训练仅要求每个因果状态解释当前片段，导致状态可能丢弃身份、布局和运动信息。Video-Mirai在训练时让生成器因果地生成完整视频，然后使用冻结的远见编码器非因果地读取完整视频，并通过轻量级预测器将因果隐藏状态蒸馏到远见目标中，使用余弦损失。未来帧仅用于监督表示，而不作为生成器输入。推理时丢弃编码器和预测器，保持原始架构、每步FLOPs和KV缓存行为不变。实验表明，Video-Mirai在5秒VBench上总分从83.8提升至84.6，在30秒超训练时长生成中，主体一致性从84.9提升至88.5，背景一致性从90.2提升至91.9。消融实验证实未来条件目标是关键成分，探针实验表明未来帧从当前特征中更易解码。论文强调因果性应约束推理而非表示监督。

Innovations:

将远见预测形式化为因果视频生成的表示级目标，在训练中使用未来感知目标，同时保持严格因果推理。
提出Video-Mirai训练方法，通过冻结的远见编码器和轻量级预测器将未来信息蒸馏到当前因果隐藏状态中，推理时无额外开销。
在帧级和块级自回归视频生成设置中均取得提升，并扩展到30秒超训练时长生成，显著改善主体和背景一致性。
通过消融实验和探针分析，识别出远见来源、预测层、预测器设计和前瞻窗口等关键因素，并验证未来内容从冻结特征中更易解码。

Methodology: 采用Causal-Forcing作为基础训练管线，包含自回归教师微调、因果ODE蒸馏和非对称DMD三个阶段。Video-Mirai在此基础上增加一个训练目标：生成器先因果地生成完整视频（如三个片段），然后使用一个冻结的远见编码器（如预训练的双向视频扩散模型）处理完整视频，得到未来信息融合的隐藏状态。一个轻量级预测器（MLP）将因果生成器的当前隐藏状态映射到远见编码器的特征空间，通过余弦损失对齐。未来片段仅用于构建停止梯度的监督，不输入生成器。训练后丢弃编码器和预测器。推理时生成器保持严格因果，使用相同的注意力模式和KV缓存。

Key Results:

在5秒VBench上，Video-Mirai将Causal-Forcing基线总分从83.82提升至84.62。
在30秒超训练时长生成中，主体一致性从84.93提升至88.47，背景一致性从90.22提升至91.94。
消融实验表明，未来条件目标（而非当前目标）是关键成分；使用1段前瞻窗口效果最佳。
探针实验显示，从冻结的Video-Mirai特征中解码未来帧的RGB图像质量显著优于基线，证明内部化远见。
定性结果（图2）显示Video-Mirai缓解了片段间的身份、布局和运动突变。

Tech Stack:

自回归视频扩散模型（Causal-Forcing管线）
双向视频扩散模型（作为远见编码器）
分布匹配蒸馏（DMD）
ODE蒸馏初始化
余弦损失（Cosine loss）
轻量级MLP预测器
KV缓存（KV-cache）
VBench评估指标
探针实验（MLP readout重建未来RGB）

Strengths:

方法简洁，仅修改训练目标，不改变推理架构、FLOPs和KV缓存行为，易于集成到现有自回归视频扩散管线。
显著提升长视频生成的一致性，尤其在超训练时长场景下，表明内部化远见有效缓解规划差距。
通过消融和探针实验深入分析了关键设计选择，验证了未来条件目标的有效性。
与现有方法（如Self-Forcing、Causal-Forcing）正交，可叠加使用。

Limitations:

依赖于一个冻结的远见编码器（双向视频模型），需要额外的预训练模型和计算资源。
训练时需要完整视频的因果生成，可能增加训练时间。
仅适用于自回归视频扩散模型，不适用于全窗口双向生成模型。
实验仅在特定基线（Causal-Forcing）上验证，泛化性需进一步探索。

Relevance To Keywords:

Unify Models: 论文涉及视频生成与理解的一体化（远见编码器使用双向模型），但主要聚焦生成。
World Models: 自回归视频生成可视为世界模型的一种形式，Video-Mirai通过远见提升世界模型的时间一致性。
Representation Learning: 核心是表示级规划差距，通过蒸馏远见信息改进因果表示，属于表征学习范畴。
Model-Based RL: 视频生成可用于基于模型的强化学习中的环境模拟，Video-Mirai提升长时一致性有助于世界模型在RL中的应用。
原生多模态大模型: 视频扩散模型是多模态大模型的一种，但论文未涉及文本等多模态输入。
多模态大模型的理解和生成一体化: 远见编码器可视为理解模块，生成器为生成模块，但论文未强调多模态。
表征学习: 直接相关，通过未来监督改进隐藏状态表征。
世界模型: 直接相关，视频生成作为世界模型的核心组件。
强化学习: 间接相关，提升的世界模型可用于RL训练。
后训练: Video-Mirai是一种训练方法，可视为后训练阶段（在蒸馏后）的额外目标。

8. Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion TrackingPASS

Score: 33.0 / 27.8

Authors: Zekun Qi, Xuchuan Chen, Dairu Liu, Chenghuai Lin, Yunrui Lian, Sikai Liang, Zhikai Zhang, Yu Guan, Jilong Wang, Wenyao Zhang, Xinqiang Yu, He Wang, Li Yi

Published: 2026-06-02

TL;DR: Humanoid-GPT leverages a GPT-style Transformer trained on unified motion corpora to achieve robust zero-shot generalization for whole-body motion tracking.

摘要翻译

我们介绍了 Humanoid-GPT，这是一种 GPT 风格的 Transformer，具备因果注意力机制，并在十亿级运动语料库上训练，用于全身控制。与受限于数据稀缺及敏捷性 - 泛化性权衡的先前浅层 MLP 跟踪器不同，Humanoid-GPT 在一个包含 20 亿帧的重定向语料库上进行预训练，该语料库统一了所有主要动作捕捉（mocap）数据集与大规模内部录制数据。同时扩展数据和模型容量，得到了一个单一的生成式 Transformer，它能够跟踪高度动态的行为，同时实现对未见运动和控制任务的前所未有的零样本（zero-shot）泛化。广泛的实验和扩展性分析表明，我们的模型建立了新的性能前沿，在跟踪高度动态且复杂运动的同时，对未见任务展现出鲁棒的零样本（zero-shot）泛化能力。

Abstract

We introduce Humanoid-GPT, a GPT-style Transformer with causal attention trained on a billion-scale motion corpus for whole-body control. Unlike prior shallow MLP trackers constrained by scarce data and an agility-generalization trade-off, Humanoid-GPT is pre-trained on a 2B-frame retargeted corpus that unifies all major mocap datasets with large-scale in-house recordings. Scaling both data and model capacity yields a single generative Transformer that tracks highly dynamic behaviors while achieving unprecedented zero-shot generalization to unseen motions and control tasks. Extensive experiments and scaling analyses show that our model establishes a new performance frontier, demonstrating robust zero-shot generalization to unseen tasks while simultaneously tracking highly dynamic and complex motions.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	5.0/10	7.5
Tokenizer	1.5	3.0/10	4.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	4.0/10	6.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	3.0/10	4.5
model-based RL	1.5	4.0/10	6.0

评分理由: The paper proposes Humanoid-GPT, a GPT-style Transformer for motion tracking. It unifies motion datasets and uses a single model (Unify Models: 5), and models temporal dynamics relevant to control (World Models: 4, model-based RL: 4). However, it lacks visual components (Visual Encoder: 0) and does not explicitly handle language-vision fusion (MLLM: 3, MultiModal: 3). Tokenization is implicit in the GPT architecture (Tokenizer: 3).

关键词

Humanoid-GPT, GPT-style Transformer, Zero-Shot Generalization, Motion Tracking, Whole-Body Control, Mocap Dataset, Scaling Data, Causal Attention

深度分析

Chinese Title: Humanoid-GPT：扩展数据与结构实现零样本运动跟踪

Summary: 本文提出Humanoid-GPT，一种基于GPT风格因果注意力Transformer的人形机器人全身运动跟踪器。针对现有浅层MLP跟踪器因数据稀缺和敏捷性-泛化权衡而受限的问题，Humanoid-GPT在20亿帧的重新定位语料库上预训练，该语料库整合了所有主要动作捕捉数据集和大规模内部录制数据。通过扩展数据和模型容量，该生成式Transformer能够跟踪高度动态的行为，并在未见过的运动和任务上实现前所未有的零样本泛化。论文还引入了谐波运动嵌入（HME）实现多样性感知的平衡采样，并推导了人形运动跟踪的缩放定律。实验表明，Humanoid-GPT在敏捷性和零样本泛化方面建立了新的性能前沿，能够实时跟踪复杂动态动作并零样本完成舞蹈、跳跃等任务。

Innovations:

构建了20亿帧的重新定位运动语料库，规模是先前跟踪器训练集的200倍以上，并首次系统证明视频估计运动可有效提升跟踪性能。
采用GPT风格因果注意力Transformer作为跟踪器结构，天然适配在线跟踪的因果约束，且随数据和模型规模扩展性能持续提升。
提出谐波运动嵌入（HME），通过周期自编码器提取关节级谐波特征并聚类，实现多样性感知的分布平衡采样，避免长尾模式被淹没。
设计两阶段训练流程：先训练数百个PPO运动专家覆盖不同运动簇，再通过并行DAgger蒸馏为单一通用Transformer策略，实现零样本泛化。
首次系统刻画人形运动跟踪的缩放定律，揭示数据规模、模型容量和多样性平衡共同决定零样本敏捷跟踪性能。

Methodology: 论文采用三阶段技术路线：①数据整理与处理：聚合AMASS、LAFAN1、MotionMillion、PHUMA等数据集，过滤物体交互序列，应用时间扭曲增强，重新定位至Unitree-G1人形机器人29自由度关节空间；②训练运动专家：使用谐波运动嵌入（HME）对运动序列进行聚类（约300簇），在每个簇上独立训练基于PPO的强化学习策略，以关键点级奖励（位置、速度、姿态）驱动跟踪；③蒸馏通用策略：将所有专家策略通过并行DAgger监督蒸馏为一个GPT风格因果Transformer，输入参考关节和本体感知观测，输出PD控制器目标。训练时采用多样性感知的平衡采样策略。

Key Results:

Humanoid-GPT在敏捷性和零样本泛化上显著超越现有方法（如SONIC、UniTracker、ASAP等），能够零样本跟踪舞蹈、跳跃、弯腰、功夫等未见动态动作。
缩放实验表明，随着数据量从百万级增至20亿帧，模型性能持续提升，且Transformer结构优于同等规模的MLP，后者在数据增长时性能饱和。
谐波运动嵌入（HME）的平衡采样策略相比随机采样或仅多样性采样，在零样本泛化上取得更优结果，验证了多样性与平衡性均不可或缺。
在真实人形机器人（Unitree-G1）上实现了实时遥操作和零样本舞蹈生成，无需任何微调。

Tech Stack:

GPT风格因果注意力Transformer
PPO（近端策略优化）
DAgger（数据集聚合）
PD控制器（比例-微分控制器）
K-Means聚类
周期自编码器（Periodic Autoencoder）
谐波运动嵌入（HME）
时间扭曲增强（time-warping augmentation）
运动重定向框架（off-the-shelf retargeting framework）
关键点级奖励函数（位置、速度、姿态）

Strengths:

数据规模空前（20亿帧），为泛化提供坚实基础。
因果Transformer结构天然适配在线跟踪，且可扩展性强。
创新的HME嵌入实现运动多样性量化与平衡采样，有效缓解长尾问题。
两阶段训练（专家+蒸馏）结合了强化学习的物理真实性与Transformer的序列建模能力。
系统性的缩放定律分析为未来研究提供指导。
在真实硬件上验证了零样本实时跟踪能力。

Limitations:

依赖仿真到现实的迁移，真实硬件部署可能存在Sim-to-Real gap。
当前仅针对Unitree-G1人形机器人，泛化到其他平台需重新训练或微调。
数据整理中过滤了物体交互序列（如坐椅子、游泳），限制了涉及物体操作的场景。
训练计算成本高（20亿帧、数百个专家），对资源要求较大。
论文未详细讨论失败案例或安全边界，例如极端动态动作下的稳定性。

Relevance To Keywords:

表征学习：谐波运动嵌入（HME）是一种从原始运动序列中提取紧凑表征的方法，属于表征学习范畴。
世界模型：运动跟踪可视为学习人形机器人运动动力学世界模型的一部分，但论文未明确构建世界模型。
强化学习：PPO用于训练运动专家，是核心方法之一。
后训练：将多个专家策略蒸馏为单一Transformer的过程属于后训练（post-training）或知识蒸馏。
原生多模态大模型：论文未涉及多模态（如视觉、语言），仅处理运动模态，因此相关性较弱。
多模态大模型的理解和生成一体化：不直接相关。
模型基于RL：PPO和DAgger均属于强化学习范畴，相关性高。

9. SEAOTTER: Sensor Embedded Autoencoding with One-Time Transcode for Efficient ReconstructionPASS

Score: 28.5 / 27.8

Authors: Dan Jacobellis, Neeraja J. Yadwadkar

Published: 2026-06-02

TL;DR: SEAOTTER 提出了一种传感器嵌入的自动编码器框架，通过一次性转码实现高效重构，在保持 JPEG 兼容性的同时显著提升了编码速度和视觉感知准确率。

摘要翻译

在机器人系统中，利用低成本、低功耗硬件可轻松捕获大量高分辨率视觉数据。然而，当通过 JPEG/MPEG 等传统编解码器传输时，有限的带宽和设备端计算资源阻碍了充分利用。较新的编解码器（如 AV1/AVIF）虽改善了率失真权衡，但编码所需资源远高于前者，若无专用集成电路（ASIC），则不切实际。近期的非对称自编码器在极端功耗和带宽约束下能提供高质量，却增加了不可承受的解码成本，并使用定制格式，忽略了围绕 JPEG 等标准构建的数十年基础设施。为了解决这些局限性，我们提出了一种面向云机器人的压缩框架，该框架基于传感器嵌入式自编码器与单次高效重构转码（SEAOTTER）。由于传感器、云和消费者阶段面临截然不同的功耗和带宽预算，SEAOTTER 结合了学习到的潜在表示的紧凑性与标准 JPEG 文件的广泛可用性。由于朴素转码会降低性能，我们提出了一种可学习的 JPEG 颜色与量化变换，该变换能够提升全局、密集及基于视觉 - 语言的感知准确率。利用 SEAOTTER，我们为预训练且冻结的编码器训练了通用型与任务感知型两种转码流水线。在 200:1 的压缩比下，与 AVIF 相比，我们观察到编码速度提升 7 倍，解码速度提升 3.5 倍，且 ImageNet top-1 准确率提升 8%，同时保持与 JPEG 基础设施的兼容性。我们的代码可在 https://github.com/UT-SysML/seaotter 获取。

Abstract

In robotics systems, vast amounts of visual data are easily captured at high resolution using low-cost, low-power hardware. Yet, limited bandwidth and on-device compute resources prevent full utilization when transmitted via conventional codecs like JPEG/MPEG. Newer codecs, like AV1/AVIF, improve the rate-distortion trade-off, but demand far more resources for encoding, impractical without custom ASICs. Recent asymmetric autoencoders deliver high quality under extreme power and bandwidth constraints, but add prohibitive decoding cost and use bespoke formats that ignore decades of infrastructure built around standards like JPEG. To address these limitations, we introduce a compression framework for cloud robotics based on a Sensor Embedded Autoencoder paired with a One-Time Transcode for Efficient Reconstruction (SEAOTTER). Because the sensor, cloud, and consumer stages face very different power and bandwidth budgets, SEAOTTER combines the compactness of a learned latent with the broad usability of a standard JPEG file. Since naive transcoding degrades performance, we propose a learnable JPEG color and quantization transform that enables increased accuracy for global, dense, and vision-language-based perception. Using SEAOTTER, we train both general-purpose and task-aware transcoding pipelines for a pre-trained, frozen encoder. At a compression ratio of 200:1 and compared to AVIF, we observe 7 times faster encoding, 3.5 times faster decoding, and +8% ImageNet top-1 accuracy, while retaining compatibility with JPEG infrastructure. Our code is available at https://github.com/UT-SysML/seaotter .

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	3.0/10	4.5
Visual Encoder	1.5	7.0/10	10.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	3.0/10	4.5
model-based RL	1.5	1.0/10	1.5

评分理由: 论文核心在于云机器人视觉数据的压缩框架（SEAOTTER），与 MLLM/RL 核心关键词关联度较低。Visual Encoder 高度相关（使用预训练冻结编码器）；Tokenizer 和 MultiModal 中度相关（潜空间量化及视觉 - 语言感知提及）；Unify Models 涉及传感器与云端流程统一；World Models 和 model-based RL 相关性极低。作者列表中未包含指定专家，无额外加分。

关键词

Sensor Embedded Autoencoding, One-Time Transcode, Efficient Reconstruction, Cloud Robotics, JPEG Compatibility, Visual Encoder, Rate-Distortion Trade-off

深度分析

Chinese Title: SEAOTTER：基于传感器嵌入式自编码器与一次性转码的高效重建方法

Summary: 本文针对云机器人系统中传感器端资源受限、带宽有限以及消费者端解码成本高的问题，提出了一种名为SEAOTTER的压缩框架。该框架结合了传感器嵌入式自编码器（EE-AAE）与一次性转码技术，在传感器端使用轻量级FRAPPE编码器实现高效压缩，在云端通过可学习的JPEG颜色变换和量化矩阵将潜在表示转码为标准JPEG文件，从而兼顾压缩效率与兼容性。实验表明，在200:1压缩比下，相比AVIF，SEAOTTER实现了7倍编码加速、3.5倍解码加速，并在ImageNet top-1准确率上提升8%，同时保持与JPEG基础设施的兼容性。该方法适用于全局、密集和视觉语言等多种感知任务。

Innovations:

提出三阶段不对称压缩架构（传感器/云/消费者），适应不同资源预算，并利用编码一次、解码多次的生命周期优势。
引入端到端可学习的JPEG颜色变换和量化矩阵，替代传统固定变换，提升下游任务精度。
通过一次性转码将自编码器的潜在表示转换为标准JPEG文件，既保留压缩效率又兼容现有JPEG生态。
冻结传感器端编码器，仅微调云端解码器和JPEG参数，实现任务自适应且支持多任务同时服务。
设计轻量级可逆颜色变换（3×3卷积+软符号压缩+仿射变换），消费者端解码仅需额外81 MACs/像素。

Methodology: SEAOTTER采用三阶段流水线：传感器端使用冻结的FRAPPE编码器（10-100 MAC/像素）将RGB图像压缩为int8潜在表示，经无损传输后，云端使用微调的FRAPPE解码器重建中间图像，再通过可学习的JPEG颜色变换（F）和量化矩阵（Q）进行转码，生成标准JPEG文件。消费者端执行标准JPEG解码后，应用逆颜色变换（F⁻¹）恢复图像。训练时，使用均匀噪声模拟量化，联合优化解码器、颜色变换和量化矩阵，损失函数包含率失真项和下游任务损失。

Key Results:

在200:1压缩比下，编码速度比AVIF快7倍，解码速度快3.5倍。
ImageNet top-1准确率比AVIF提升8%。
在1080p/30 Wi-Fi、720p/30 5G、480p/30 BLE等带宽场景下，SEAOTTER均表现出良好的适用性。
一次性转码后，下游任务精度超过原始自编码器。
支持可变速率（n∈{3,6,9,12,15}）和渐进编码，适应动态带宽。

Tech Stack:

FRAPPE编码器/解码器（轻量级自编码器，10-100 MAC/像素）
JPEG-LS无损熵编码
可学习3×3卷积颜色变换（ConvW）
软符号压缩函数（softsign companding）
可学习量化矩阵（3×8×8）
均匀噪声量化模拟（U[-1/2, 1/2]）
端到端训练（率失真损失+下游任务损失）
JPEG标准（4:4:4子采样，跳过YCbCr转换）

Strengths:

创新性地结合了自编码器的高压缩效率与JPEG的广泛兼容性，解决了传统方法在云机器人场景中的实际部署问题。
传感器端计算极低（10-100 MAC/像素），适合电池供电设备。
一次性转码策略显著降低了消费者端解码成本，尤其适合训练等多次读取场景。
可学习颜色变换和量化矩阵能够针对特定传感器和下游任务进行优化，提升感知精度。
支持可变速率和渐进编码，适应网络波动。

Limitations:

云端解码器（FRAPPE）参数量较大（约57M），可能增加云端计算负担。
可学习颜色变换和量化矩阵需要针对每个传感器或任务重新训练，泛化性有待验证。
实验主要基于ImageNet分类任务，对更复杂的视觉语言模型（VLM）等任务的效果未充分展示。
与标准JPEG的兼容性依赖于跳过YCbCr转换和4:4:4子采样，部分老旧解码器可能不支持。
未与最新的学习型压缩方法（如超先验模型）进行直接对比，仅与AVIF比较。

Relevance To Keywords:

表征学习：论文通过自编码器学习紧凑潜在表示，并利用可学习颜色变换优化表征，与表征学习高度相关。
世界模型：论文面向云机器人系统，涉及环境感知和压缩，但未直接构建世界模型，相关性较弱。
模型压缩：论文核心是图像压缩，属于模型压缩的范畴，但更侧重于数据压缩而非模型参数压缩。
多模态大模型：论文在视觉语言任务上进行了评估，但方法本身不涉及多模态融合，相关性中等。
强化学习：论文未涉及强化学习，相关性低。
后训练：论文中微调解码器和JPEG参数属于后训练阶段，有一定相关性。

10. NewtPhys: Do Foundation Models Understand Newtonian Physics?FAIL

Score: 25.5 / 27.8

Authors: Sebastian Cavada, Soumava Paul, Tuan-Hung Vu, Andrei Bursuc, Raoul de Charette

Published: 2026-06-02

TL;DR: NewtPhys introduces a 4D physically annotated dataset to evaluate Newtonian physics reasoning in Vision-Language Models, revealing limitations in current low-level physics understanding.

摘要翻译

以往研究使用合成或半合成场景及视觉问答任务评估了基础模型中的物理推理能力。然而，这些基准侧重于高层事件，缺乏评估真正底层牛顿力学理解所需的视觉保真度。我们引入了 NewtPhys，这是一个基于真实场景的多视图图像构建的、基于物理模拟的 4D 物理标注数据集。该数据集在时间步上提供了密集、细粒度的标注——包括三维力和涵盖物理、跟踪、语义及几何的模态外像素级量——从而弥合了简化的合成设置与真实视觉复杂性之间的差距。利用 NewtPhys，我们系统性地评估了 56 个 VLMs（视觉语言模型，包括 54 个开源模型和 2 个闭源前沿模型）以及 10 个 VFMs（视觉基础模型），并揭示了它们在底层物理推理方面的局限性。除基准测试外，我们的数据集还支持未来基于物理的视觉研究以及下一代物理感知评估的开发。代码和数据集可在 https://astra-vision.github.io/NewtPhys 获取。

Abstract

Previous work has evaluated physics reasoning in foundation models using synthetic or semi-synthetic scenes and visual question-answering tasks. However, these benchmarks emphasize high-level events and lack the visual fidelity required to assess true low-level Newtonian understanding. We introduce NewtPhys, a 4D physically annotated dataset built from multiview images of real-world scenes with physics-grounded simulations. The dataset provides dense, fine-grained annotations across timesteps -- including 3D forces and amodal per-pixel quantities covering physics, tracking, semantics and geometry -- bridging the gap between simplistic synthetic setups and realistic visual complexity. Using NewtPhys, we systematically evaluate 56 VLMs, including 54 open-weight models and 2 closed-source frontier models, and 10 VFMs and reveal limitations in low-level physics reasoning. Beyond benchmarking, our dataset enables future research in physics-grounded vision and the development of next-generation physics-aware evaluations. Code and datasets are available at https://astra-vision.github.io/NewtPhys.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	3.0/10	4.5
MLLM	1.5	6.0/10	9.0
MultiModal	1.5	6.0/10	9.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文主要关注物理推理基准数据集（NewtPhys）的构建与评估，未涉及模型统一、分词器设计、视觉编码器架构、世界模型构建或强化学习算法。虽然评估对象为 VLMs（与 MLLM 相关）且涉及多模态数据，但核心贡献在于数据而非模型技术，因此相关关键词得分较低。

关键词

NewtPhys, Physics Reasoning, Vision-Language Models, 4D Dataset, Newtonian Physics, Physical Grounding, Benchmark Evaluation

11. GARDEN: Gravity-Aligned Reconstruction of Disentangled ENvironments from RGB imagesFAIL

Score: 25.5 / 27.8

Authors: Jiahao Sun, Dingkun Wei, Zehong Shen, Hongyu Zhou, Yujun Shen, Liang Li

Published: 2026-06-02

TL;DR: GARDEN 提出了一种基于重力对齐的框架，能够从 RGB 图像解耦出 3D 环境，从而在保持视觉真实感的同时实现直接物理仿真。

摘要翻译

将多视角 RGB 观测转换为可直接用于仿真的 3D 环境仍然具有挑战性，因为当前的重建流水线会产生缺乏显式物理结构的整体式场景表示。它们通常仅存在任意全局旋转的不确定性，并将刚体前景对象与背景几何纠缠在一起，这阻碍了稳定的物理交互。现有的解决方案通常通过用检索到的 CAD 资产替换重建对象来恢复交互性，但这引入了缓慢的检索 - 替换阶段，并削弱了场景特定的几何保真度。我们提出 GARDEN，一个仅 RGB 框架，它将重建重新表述为基于物理的场景分解，并输出一种结构化混合场景表示。关键思想是利用重力作为通用物理先验：我们首先将重建对齐到统一的重力 - 视图坐标系（Gravity-View frame）以解决规范歧义，然后恢复具有准确 6-DoF 放置的以对象为中心的刚体网格，最后通过条件 3D 点分类从背景中移除重复的对象几何。所得表示将显式刚体与解耦的背景相结合，能够在保持视觉真实性的同时实现直接物理仿真。在模拟和真实多视图场景上的实验表明，与基于检索的基线相比，GARDEN 提高了对象放置可靠性、解耦质量和渲染 - 仿真效率。

Abstract

Converting multi-view RGB observations into simulation-ready 3D environments remains challenging because current reconstruction pipelines produce monolithic scene representations without explicit physical structure. They are typically defined up to an arbitrary global rotation and entangle rigid foreground objects with background geometry, which hinders stable physical interaction. Existing solutions often recover interactivity by replacing reconstructed objects with retrieved CAD assets, but this introduces a slow retrieval-and-replacement stage and weakens scene-specific geometric fidelity. We propose GARDEN, an RGB-only framework that reformulates reconstruction as physically-grounded scene factorization and outputs a structured hybrid scene representation. The key idea is to use gravity as a universal physical prior: we first align the reconstruction to a unified Gravity-View frame to resolve gauge ambiguity, then recover object-centric rigid meshes with accurate 6-DoF placement, and finally remove duplicate object geometry from the background through conditional 3D point classification. The resulting representation combines explicit rigid bodies with a decoupled background, enabling direct physics simulation while preserving visual realism. Experiments on both simulated and real multi-view scenes show that GARDEN improves object placement reliability, disentanglement quality, and rendering-simulation efficiency compared with retrieval-based baselines.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	4.0/10	6.0
World Models	1.5	3.0/10	4.5
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	6.0/10	9.0

评分理由: 论文核心在于从 RGB 图像进行 3D 场景重建与解耦，与 Tokenizer 和 MLLM 完全无关（0 分）。Visual Encoder 用于处理 RGB 输入特征（4 分）。model-based RL 高度相关，因为输出支持物理仿真（6 分）。World Models 和 Unify Models 关联较弱，因论文侧重于静态几何重建而非动态建模或模型统一（2-3 分）。MultiModal 关联度低，主要为视觉到几何的转换（2 分）。

关键词

Gravity-Aligned, Reconstruction, Disentangled Environments, RGB images, Physics Simulation, Scene Factorization, 3D Representation

12. Bootstrap Your Generator: Unpaired Visual Editing with Flow MatchingFAIL

Score: 24.0 / 27.8

Authors: Yoad Tewel, Yuval Atzmon, Gal Chechik, Lior Wolf

Published: 2026-06-02

TL;DR: 本文提出 ByG 框架，利用流匹配和梯度路由实现无需配对数据的图像和视频编辑，有效解决了数据稀缺场景下的编辑问题。

摘要翻译

现代生成模型对视觉内容拥有深刻理解，但训练它们进行图像编辑通常需要大量的配对样本数据集。这限制了可扩展性，尤其是在视频编辑中，收集配对数据的成本过高。我们提出了 Bootstrap Your Generator (ByG)，这是一个用于流匹配 (Flow Matching) 编辑模型无配对训练的通用框架。它利用基础模型的知识而无需任何外部信号。我们的方法将从冻结模型中提取的指令遵循线索与循环一致性配对，以保持结构。为了使这一过程可行，我们提出将来自下游损失的梯度从干净预测路由到含噪训练状态。我们在具有挑战性的数据稀缺图像和视频编辑场景中展示了最先进结果。广泛的评估和用户研究表明，我们的方法能有效泛化到未见领域，并优于在百万级样本上训练的监督基线。分析表明，我们的梯度路由桥接了训练 - 推理差距，从基础模型中提取语义线索提供了鲁棒的训练信号，从而无需外部奖励模型。

Abstract

Modern generative models possess a deep understanding of visual content, yet training them for image editing typically requires massive datasets of paired examples. This limits scalability, especially for video editing where collecting paired data is prohibitively expensive. We propose Bootstrap Your Generator (ByG), a general framework for unpaired training of flow matching editing models. It leverages the base model's knowledge without any external signal. Our approach pairs instruction-following cues extracted from the frozen model with cycle-consistency for structure preservation. To make this tractable, we propose to route gradients from downstream losses over clean predictions to noisy training states. We demonstrate state-of-the-art results on challenging data-scarce image and video editing scenarios. Extensive evaluations and user studies show that our method effectively generalizes to unseen domains and outperforms supervised baselines trained on millions of samples. Analysis reveals that our gradient routing bridges the train-inference gap, and extracting semantic cues from a base model provides a robust training signal that obviates the need for external reward models.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	2.0/10	3.0
MLLM	1.5	4.0/10	6.0
MultiModal	1.5	4.0/10	6.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文核心为基于流匹配的无配对视觉编辑，与强化学习及 tokenizer 无关（0-1 分）。虽涉及指令提示暗示多模态交互（MLLM/MultiModal 约 4 分），但未统一模型架构或构建世界模型（2-3 分）。作者列表不包含指定专家，无加分。加权总分 24.0，低于动态及格分 27.8。

关键词

Flow Matching, Unpaired Training, Visual Editing, Gradient Routing, Generative Models, Cycle Consistency, Instruction Following, Base Model

13. AAD-1: Asymmetric Adversarial Distillation for One-Step Autoregressive Video GenerationFAIL

Score: 22.5 / 27.8

Authors: Haobo Li, Yanhong Zeng, Yunhong Lu, Jiapeng Zhu, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Yujun Shen, Zhipeng Zhang

Published: 2026-06-02

TL;DR: AAD-1 introduces an asymmetric adversarial distillation framework for one-step autoregressive video generation that effectively mitigates motion collapse and achieves state-of-the-art performance.

摘要翻译

我们提出 AAD-1（Asymmetric Adversarial Distillation），一种用于单步自回归图像到视频生成的框架。当前最先进的方法虽采用对抗蒸馏，但面临运动崩溃和训练不稳定的问题，导致生成的视频趋于静态。AAD-1 通过架构和训练策略中的两项关键设计来解决这些挑战。我们的关键架构洞察在于打破生成器与判别器之间的对称性。生成器保持因果性以保留自回归采样能力，而判别器则在完整的时空上下文中采用双向注意力机制，并为整个视频序列输出一个整体真实度分数。这种非对称设计使判别器能够有效检测导致自回归生成中运动崩溃的全局时间故障和长程漂移。为了稳定训练，我们引入一种分阶段策略：首先利用分布匹配引导一个稳定的单步生成器，提供预热阶段，使学生分布更接近教师分布，然后再开始对抗蒸馏。在 VBench 上的大量实验表明，AAD-1 在单步自回归视频生成任务中达到了最先进性能。

Abstract

We present AAD-1, an Asymmetric Adversarial Distillation framework for One-step autoregressive image-to-video generation. State-of-the-art methods adopt adversarial distillation but suffer from motion collapse and training instability, resulting in static videos. AAD-1 addresses these challenges through two key designs in architecture and training strategy. Our key architectural insight is to break the symmetry between generator and discriminator. While the generator remains causal to preserve autoregressive sampling capability, the discriminator attends bidirectionally over the full spatiotemporal context and produces a single holistic realism score for the entire video sequence. This asymmetric design enables the discriminator to effectively detect global temporal failures and long-range drift that cause motion collapse in autoregressive generation. To stabilize training, we introduce a phased strategy that first uses distribution matching to bootstrap a stable one-step generator, providing a warm-up phase that brings the student distribution closer to the teacher before adversarial distillation begins. Extensive experiments on VBench demonstrate that AAD-1 achieves state-of-the-art performance in one-step autoregressive video generation.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	4.0/10	6.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	4.0/10	6.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on video generation via asymmetric adversarial distillation. It has low relevance to Unify Models, Tokenizer, World Models, MLLM, and model-based RL as these topics are not addressed or are tangential. Visual Encoder and MultiModal have moderate relevance due to image input and video output. No specified expert authors (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are found in the author list.

关键词

Asymmetric Adversarial Distillation, One-step Autoregressive, Video Generation, Motion Collapse, Training Stability, Image-to-Video, Generative Model

14. DiffUNet^2: Bidirectional Prediction, Probabilistic Generation and Collaborative Visual Discovery for Scientific DataFAIL

Score: 21.0 / 27.8

Authors: Mengdi Chu, Jiaxin Yang, Angus G. Forbes, Nathan Debardeleben, Earl Lawrence, Ayan Biswas, Han-Wei Shen

Published: 2026-06-02

TL;DR: 本文提出 DiffUNet^2 框架，通过结合扩散模型与交互式视觉分析，实现科学数据的时间演化双向预测与概率生成，支持假设驱动的探索。

摘要翻译

对科学现象的时间演化建模对于分析和推理至关重要，然而大多数机器学习方法仅提供确定性前向预测，忽略了多种可能的结果，且很少支持反向推理，从而限制了它们在实用科学工作流中的效用。我们提出一个框架，将基于扩散的生成建模与交互式可视化分析相结合，以用于科学探索。我们引入了 DiffUNet^2，这是一种条件扩散模型，支持跨时间的双向、任意到任意生成，并能捕捉合理系统演化的分布。基于该模型，我们的交互式系统支持分支时间线探索、用户引导的状态编辑以及概率空间导航，使科学家能够主动探索替代假设，而非被动观察预测。我们在不同科学领域的 5 个数据集上评估了该模型，以验证其预测准确性及概率空间集成质量。与领域专家合作，我们展示了该方法在支持实用科学时间数据分析工作流方面的有效性。通过整合建模与视觉交互，我们的方法使科学家能够交互式地探索系统动态，将生成模型转化为假设驱动的科学分析工具。

Abstract

Modeling temporal evolution is important to analyzing and reasoning about scientific phenomena, yet most machine learning methods provide deterministic forward predictions that overlook multiple plausible outcomes and rarely support backward reasoning, limiting their usefulness in practical scientific workflows. We present a framework that integrates diffusion-based generative modeling with interactive visual analytics for scientific exploration. We introduce DiffUNet^2, a conditional diffusion model that enables bidirectional, any-to-any generation across time and captures distributions of plausible system evolutions. Built upon the model, our interactive system supports branching timeline exploration, user-guided state editing, and probability-space navigation, enabling scientists to actively explore alternative hypotheses rather than passively observe predictions. We evaluate the model on 5 datasets across different scientific domains to validate its predictive accuracy and probability-space ensemble quality. In collaboration with domain experts, we demonstrate the effectiveness of our approach in supporting practical scientific temporal data analysis workflows. By integrating modeling and visual interaction, our approach enables scientists to interactively explore system dynamics, transforming generative models into tools for hypothesis-driven scientific analysis.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	4.0/10	6.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	1.0/10	1.5

评分理由: 论文核心为扩散模型在科学数据时间演化建模中的应用，结合交互式视觉分析。与 MLLM、Tokenizer、model-based RL 无直接关联（1 分）；World Models 有一定关联（建模动态与概率分布，4 分）；Unify Models、MultiModal、Visual Encoder 关联度较低（2-3 分）。加权总分 21.0，未达及格线 27.8。作者列表中未包含指定专家。

关键词

Diffusion-based generative modeling, Bidirectional prediction, Probabilistic generation, Scientific data, Interactive visual analytics, Temporal evolution, Conditional diffusion model

15. Agentic Chain-of-Thought Steering for Efficient and Controllable LLM ReasoningFAIL

Score: 19.5 / 27.8

Authors: Yu Xia, Zhouhang Xie, Xin Xu, Byungkyu Kang, Prarit Lamba, Xiang Gao, Julian McAuley

Published: 2026-06-02

TL;DR: This paper proposes Agentic Chain-of-Thought Steering (ACTS), an RL-based controller that efficiently steers frozen LLM reasoning traces to save tokens and enable controllable accuracy-efficiency trade-offs without requiring multi-modal or world model architectures.

摘要翻译

大语言模型通过扩展的思维链（Chain-of-Thought）推理提高了最终答案的准确性，但往往低效地使用标记（tokens），且在推理时缺乏控制。现有的高效推理方法通过缩短、早停（early-stopping）或压缩轨迹来控制思考长度，使得模型的思考过程变得隐式。本文提出智能体思维链引导（Agentic Chain-of-Thought Steering, ACTS），将推理引导形式化为一个马尔可夫决策过程（Markov decision process），其中控制器智能体在推理过程中自适应地引导一个冻结的推理器。在每一步，控制器观察推理轨迹和剩余的思维预算，然后发出一个包含推理策略和引导短语的引导动作，以启动下一个推理器步骤。这使得能够在保持推理器生成连续性的同时，实现基于预算的策略控制，从而进行高效推理。我们利用多预算增强（multi-budget augmentation）构建的合成引导轨迹初始化控制器智能体，并通过带有预算条件奖励塑造（budget-conditioned reward shaping）的强化学习（reinforcement learning）进一步优化它。在多个基准测试上的实验表明，ACTS 在大幅节省标记（tokens）的同时达到了完整思维性能，并且能够在不同的推理器和任务之间实现可控的精度 - 效率权衡。代码可在 https://github.com/Andree-9/ACTS 处获取。

Abstract

Large language models improve final-answer accuracy through extended chain-of-thought reasoning, but often spend tokens inefficiently and offer little inference-time control. Existing efficient reasoning methods control thinking length by shortening, early-stopping, or compressing traces, leaving how the model thinks implicit. In this paper, we propose Agentic Chain-of-Thought Steering (ACTS), which formulates reasoning steering as a Markov decision process where a controller agent adaptively steers a frozen reasoner during inference. At each step, the controller observes the reasoning trace and remaining thinking budget, then issues a steering action consisting of a reasoning strategy and a steering phrase that initiates the next reasoner step. This enables budget-aware strategy control for efficient reasoning while preserving the reasoner's generation continuity. We initialize the controller agent from our constructed synthetic steering trajectories with multi-budget augmentation, and further optimize it via reinforcement learning with budget-conditioned reward shaping. Experiments across multiple benchmarks show that ACTS matches full-thinking performance with substantial token savings, and enables controllable accuracy-efficiency trade-offs across different reasoners and tasks. The code is available at https://github.com/Andree-9/ACTS.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	4.0/10	6.0

评分理由: The paper addresses LLM inference efficiency via RL steering, aligning partially with 'model-based RL' (4.0) and 'MLLM' (3.0). However, it lacks multi-modal components ('Visual Encoder', 'MultiModal' score 0.0), does not focus on tokenizer design ('Tokenizer' 2.0), world modeling ('World Models' 2.0), or model unification ('Unify Models' 2.0). No specified expert authors are present.

关键词

Chain-of-Thought, Reinforcement Learning, Inference Control, Token Efficiency, Frozen Reasoner, Controller Agent, Markov Decision Process, Reasoning Steering

16. Synthesize and Reward -- Reinforcement Learning for Multi-Step Tool Use in Live EnvironmentsFAIL

Score: 18.0 / 27.8

Authors: Ibrahim Abdelaziz, Asim Munawar, Kinjal Basu, Maxwell Crouse, Chulaka Gunasekara, Suneet Katrekar, Pavan Kapanipathi

Published: 2026-06-02

TL;DR: 本文提出 PROVE 框架，利用真实环境下的程序化奖励强化学习，显著提升了大模型在多步工具调用任务上的编排准确性和效率。

摘要翻译

训练大型语言模型（LLM）以编排多步工具调用受到三个相互耦合的障碍阻碍：构建真实的有状态执行环境成本高昂，合成训练查询往往与服务器实际状态脱节（导致生成的工具调用无法执行），且基于回忆的强化学习（RL）奖励会激励冗长的工具调用模式。我们提出了 PROVE（基于验证环境的程序化奖励），该框架包含三项贡献：(1) 一个包含 20 个有状态 MCP（模型上下文协议）服务器的库，提供 343 个工具，支持具有会话级状态隔离的实时执行强化学习训练；(2) 一个自动化的数据合成管道，通过依赖图引导的对话模拟（基于实时采样的服务器状态），针对这些服务器生成经验证的多轮工具调用轨迹，从而确保每个生成的查询都引用实际存在的实体；(3) 一个多组件程序化奖励——包括分级有效性评分、依赖感知的覆盖率、具有复杂度缩放调用预算的自适应效率惩罚、工具名称信号以及参数值匹配奖励——无需外部评判模型。我们使用 GRPO 训练了四个模型（Qwen3-4B、Qwen3-8B、Qwen2.5-7B、Granite-4.1-8B），采用相同的奖励超参数和约 13K 训练样本；仅针对每个模型家族从三点扫描中调整学习率。在 BFCL Multi-Turn、tau2-bench 和 T-Eval 上，PROVE 分别带来了高达 +10.2、+6.8 和 +6.5 分的提升，表明紧凑的程序化奖励在两个模型家族的多步工具编排上均能带来一致收益。

Abstract

Training LLMs to orchestrate multi-step tool calls is held back by three coupled obstacles: realistic stateful execution environments are costly to build, synthetic training queries are often detached from the server's actual state (so the generated tool calls fail to execute), and recall-based RL rewards incentivize verbose tool-calling patterns. We present PROVE (Programmatic Rewards On Verified Environments), a framework with three contributions: (1) a library of 20 stateful MCP (Model Context Protocol) servers exposing 343 tools, enabling live-execution RL training with session-scoped state isolation; (2) an automated data synthesis pipeline that generates validated multi-turn tool-call trajectories against these servers via dependency-graph-guided conversation simulation grounded in live-sampled server state, so every generated query references entities that actually exist; and (3) a multi-component programmatic reward - graduated validity scoring, dependency-aware coverage, an adaptive efficiency penalty with a complexity-scaled call budget, a tool-name signal, and an argument-value matching bonus - requiring no external judge model. We train four models (Qwen3-4B, Qwen3-8B, Qwen2.5-7B, Granite-4.1-8B) with GRPO using identical reward hyperparameters and ~13K training examples; only learning rate is tuned per model family from a three-point sweep. On BFCL Multi-Turn, tau2-bench, and T-Eval, PROVE yields improvements of up to +10.2, +6.8, and +6.5 points respectively, demonstrating that a compact programmatic reward yields consistent gains on multi-step tool orchestration across two model families.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	2.0/10	3.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	3.0/10	4.5

评分理由: 论文核心为强化学习在工具调用中的应用，未涉及多模态视觉、Tokenizer 架构或模型统一，故相关关键词得分低；虽涉及环境交互但非生成式世界模型；RL 为核心方法但 GRPO 属模型自由算法，故 model-based RL 得中等分。作者列表无指定专家。加权总分低于动态及格分，表明匹配度较低。

关键词

Reinforcement Learning, Tool Use, Programmatic Rewards, Live Environments, Multi-step Orchestration, GRPO, LLM Training

17. CoralBay: A Self-Supervised CT Foundation ModelFAIL

Score: 18.0 / 27.8

Authors: Ioannis Gatopoulos, Nicolas Känzig, Sebastian Otálora, Fei Tang

Published: 2026-06-02

TL;DR: CoralBay 提出了一种基于 3D Swin 骨干网络和自蒸馏的自监督 CT 基础模型，在下游放射学任务中展现出强大的迁移能力。

摘要翻译

自监督学习使得在 2D 自然图像上进行大规模预训练成为可能，产生了能够跨任务有效迁移的通用视觉表征。然而，许多医学成像模态（如 CT 扫描）本质上具有三维特性，且在结构和语义上与自然图像存在根本差异。体数据模态捕捉了空间连续性、器官解剖以及基于强度的组织属性（例如亨氏单位），这些内容无法通过 2D 预训练得到充分建模。为了弥合这一差距，我们引入了 CoralBay，这是一个通过采用分层 3D Swin 骨干并将自蒸馏应用于拼接的多尺度特征来扩展 DINO 的自蒸馏框架，从而实现数据高效的自监督学习，以获取编码全局语义和细粒度局部结构的丰富空间表征。因此，CoralBay 能有效迁移到广泛的下游放射学任务中，并在多种解剖目标上展现出强大且一致的性能。此外，我们通过引入一个公开、可复现的 3D 放射学排行榜，为开源 eva 框架做出了贡献，该排行榜统一了多个数据集，并为评估体数据表征学习方法建立了标准化基准。

Abstract

Self-supervised learning has enabled large-scale pre-training on 2D natural images, producing general-purpose visual representations that transfer effectively across tasks. However, many medical imaging modalities, such as CT scans, are inherently three-dimensional and differ fundamentally from natural images in both structure and semantics. Volumetric modalities capture spatial continuity, organ anatomy, and intensity-based tissue properties (e.g., Hounsfield Units), which are not adequately modeled by 2D pre-training. To bridge this gap, we introduce CoralBay, a self-distillation framework that extends DINO by using a hierarchical 3D Swin backbone and applying self-distillation to concatenated multi-scale features, enabling data-efficient self-supervised learning of rich spatial representations that encode both global semantics and fine-grained local structure. As a result, CoralBay transfers effectively to a wide range of downstream radiological tasks, demonstrating strong and consistent performance across diverse anatomical targets. In addition, we contribute to the open-source \eva framework by introducing a public, reproducible 3D radiology leaderboard that unifies multiple datasets and establishes a standardized benchmark for evaluating volumetric representation learning methods.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	8.0/10	12.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0

评分理由: 论文核心内容聚焦于 3D 医学影像（CT）的自监督表征学习。'Visual Encoder'高度相关（对应 3D Swin backbone）；'Unify Models'仅在基准测试中提及数据集统一，模型本身未体现架构统一，相关性低；'MultiModal'涉及多尺度特征但非跨模态，相关性极低；其余关键词（Tokenizer, World Models, MLLM, model-based RL）与论文的纯视觉、非生成、非强化学习性质完全无关。作者列表中未包含指定的 Yang Shi 等专家，故无加分。加权总分为 18.0，低于动态及格分 27.8。

关键词

Self-supervised learning, 3D CT, Foundation Model, Visual Encoder, Self-distillation, Radiological tasks, Volumetric data

18. RealClawBench: Live OpenClaw Benchmarks from Real Developer-Agent SessionsFAIL

Score: 18.0 / 27.8

Authors: Zongwei Lv, Zhewen Tan, Yaoming Li, Yilun Yao, Yuxuan Tian, Lin Sun, Xiangzheng Zhang, Weihong Lin, Tong Yang, Guangxiang Zhao

Published: 2026-06-02

TL;DR: RealClawBench establishes a live benchmark framework derived from real developer-agent sessions to realistically evaluate robotic agent capabilities, revealing substantial performance gaps among current systems.

摘要翻译

智能体基准应反映用户实际要求部署的智能体去做什么，但现有基准往往忽略了真实开发者 - 智能体会话中的关键真实性特征。我们引入了 RealClawBench，这是一个基于真实 OpenClaw 会话构建的基准框架，旨在捕捉部署智能体使用的分布、多样性和现实世界难度。真实用户请求难以进行基准测试，因为它们通常依赖于本地执行环境，涉及隐含或未明确指定的意图，并且需要实质性的验证。RealClawBench 通过两个核心机制解决了这些挑战：重构的执行环境和确定性可验证评分器，两者共同将真实会话转化为可复现的、自动评分的任务。最终发布的版本包含 281 个可执行任务，它们是从一个更大的真实会话池中采样得到的，同时保留了源分布，最终分布与源分布之间的最大 Jensen-Shannon 散度为 0.0448。评估 14 个现有模型表明，最佳系统仅解决了 65.8% 的任务，揭示了在真实的开发者 - 智能体工作负载上仍有巨大的提升空间。通过将真实部署会话转化为可控评估实例，RealClawBench 为更好地衡量智能体在实际使用场景下能力的基准测试提供了一条切实可行的路径。代码可在以下网址获取：https://anonymous.4open.science/r/real-claw-bench-582B.

Abstract

Agent benchmarks should reflect what users actually ask deployed agents to do, yet existing benchmarks often miss key realism properties of real developer-agent sessions. We introduce RealClawBench, a live benchmark framework built from real OpenClaw sessions to capture the distribution, diversity, and real-world difficulty of deployed agent use. Real user requests are challenging to benchmark because they often depend on local execution environments, involve implicit or underspecified intent, and require nontrivial verification. RealClawBench addresses these challenges with two core mechanisms: reconstructed execution environments and deterministic verifiable scorers, which together convert real sessions into reproducible, automatically scored tasks. The resulting release contains 281 executable tasks sampled from a much larger real-session pool while preserving the source distribution, with maximum final-vs-source Jensen-Shannon divergence of 0.0448. Evaluating 14 contemporary models shows that the best system solves only 65.8% of tasks, revealing substantial headroom on realistic developer-agent workloads. By turning real deployed sessions into controlled evaluation instances, RealClawBench provides a practical path toward benchmarks that better measure agent capability in actual use. Code is available at:https://anonymous.4open.science/r/real-claw-bench-582B.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	3.0/10	4.5
model-based RL	1.5	2.0/10	3.0

评分理由: The paper focuses on a benchmarking framework (RealClawBench) for robotic agents using real-world session data, rather than proposing new model architectures or learning paradigms. Keywords like Tokenizer, Visual Encoder, World Models, and Unify Models are not discussed. MLLM and MultiModal are contextually relevant to the agents evaluated but not the paper's core contribution. Model-based RL is related to the domain but not the specific method. No expert authors from the specified list are present in the author list.

关键词

Agent benchmarks, OpenClaw sessions, Reconstructed execution environments, Deterministic verifiable scorers, Reproducible tasks, Real-world difficulty, Model evaluation

19. SimuScene: Simulation-Ready Compositional 3D Scene Reconstruction from a Single ImageFAIL

Score: 18.0 / 27.8

Authors: Inhee Lee, Sangwon Baik, Sungjoo Kim, Hyeonwoo Kim, Hyunsoo Cha, Hanbyul Joo

Published: 2026-06-02

TL;DR: SimuScene 提出了一种物理信息引导的组成式 3D 重建管道，通过利用物理引擎作为诊断反馈回路，从单图像生成适用于机器人操作的仿真就绪场景。

摘要翻译

从单张图像重建交互式且可直接用于仿真的 3D 场景是机器人操作中的一个关键瓶颈。尽管近期的单图像 3D 重建方法能够恢复合理的物体级形状，但将这些形状组合起来生成的场景往往会在物理模拟中崩溃，原因是存在相互穿透、悬浮或下沉的物体。现有的物理感知方法仅将其视为事后布局修正，未能解决底层的几何误差。为解决这一问题，我们提出 SimuScene，一种将物理机制纳入形状与布局估计闭环中的组合式 3D 重建流程。我们并非仅将物理用于布局清理，而是在生成过程中直接利用物理引擎作为诊断测量工具。通过在重力作用下对重建物体进行诊断性模拟，我们将穿透和支撑失效转化为定量修正信号，从而驱动重力轴拉伸及模态外形状重采样。这一物理信息反馈回路有效缓解了累积的重建误差，并生成了稳定且可直接用于仿真的组合式 3D 场景。大量实验表明，该方法在物理稳定性和几何对齐基准上均达到了最先进的性能。此外，我们通过将重建的环境部署于人形控制及机器人臂操作任务中，进一步展示了 SimuScene 的实用性。

Abstract

Reconstructing interactive, simulation-ready 3D scenes from a single image is a critical bottleneck for robotic manipulation. While recent single-image lifters recover plausible per-object shapes, composing them yields scenes that collapse under physical simulation due to interpenetrating, hovering, or sinking objects. Existing physics-aware methods address this strictly as a post-hoc layout correction, leaving the underlying geometric errors unresolved. To address this, we introduce SimuScene, a compositional 3D reconstruction pipeline that puts physics in the loop of shape and layout estimation. Rather than using physics merely for layout cleanup, we utilize the physics engine as a diagnostic measurement tool during the generative process itself. By diagnostically simulating reconstructed objects under gravity, we convert penetration and support failures into quantitative correction signals that drive gravity-axis stretching and amodal shape resampling. This physics-informed feedback loop mitigates accumulated reconstruction errors and produces a stable, simulation-ready compositional 3D scene. Extensive experiments demonstrate state-of-the-art performance on physical stability and geometric alignment benchmarks. We further highlight SimuScene's utility by deploying reconstructed environments in humanoid control and robot-arm manipulation tasks.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	2.0/10	3.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	3.0/10	4.5

评分理由: 论文核心为单图像 3D 场景重建与物理仿真，与给定的多模态大模型及强化学习关键词关联度低。Unify Models 仅指流程整合，非架构统一；Tokenizer 与 MLLM 完全无关；Visual Encoder 为通用组件非核心创新；World Models 指物理引擎模拟而非学习模型；MultiModal 仅图像到 3D；model-based RL 仅涉及应用仿真。加权总分 18.0，低于动态及格分 27.8。作者列表中未包含指定专家（Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang）。

关键词

3D Scene Reconstruction, Single Image, Physics-informed Feedback Loop, Simulation-Ready, Compositional, Robotic Manipulation, Geometric Alignment, Amodal Shape Resampling

20. AlignAtt4LLM: Fast AlignAtt for Decoder-Only LLMs at IWSLT 2026 Simultaneous Speech Translation TaskFAIL

Score: 16.5 / 27.8

Authors: Quentin Fuxa, Dominik Macháček

Published: 2026-06-02

TL;DR: 本文提出 AlignAtt4LLM 方法，通过修改提示和注意力头选择，使解码器大模型适用于同步语音翻译，实现了低延迟下的良好翻译性能。

摘要翻译

我们介绍了 AlignAtt4LLM，这是一个面向 IWSLT 2026 的英 - 德、英 - 意及英 - 中同时语音翻译系统。该系统采用同步级联架构：Qwen3-ASR 利用强制对齐生成增量更新的源文本转录，而 Gemma-4 E4B-it 则在 MT 侧的 AlignAtt 策略下翻译该前缀。据我们所知，这是 AlignAtt 首次应用于仅解码器大语言模型（LLM），此前 AlignAtt 系统所依赖的编码器 - 解码器交叉注意力在此架构中缺失。我们通过提出以下四点恢复了一个可用策略：(1) 提示中显式指定源跨度；(2) 离线选择针对翻译任务的对齐头；(3) 对草稿 - 源注意力块进行选择性 qk-fast 回放；(4) 运行时捕获查询/键，以确保模型输出位级一致。在 IWSLT 2026 开发集上，AlignAtt4LLM 在欧洲目标语言（英 - 德和英 - 意）上的表现优于所提供的基线系统，无论是在约 2 秒的低延迟阶段，还是在低于 4 秒 CU-LongYAAL 的高延迟阶段。英 - 中翻译的结果较为混合，但该方法并不局限于 Gemma-4：由于 AlignAtt4LLM 仅需确定性提示布局、校准后的注意力头以及查询/键捕获机制，相同的策略可重新应用于针对非欧洲目标语言的更强翻译导向的仅解码器 MT 骨干模型。

Abstract

We describe AlignAtt4LLM, an IWSLT 2026 simultaneous speech translation system for English to German, Italian, and Chinese. The system is a synchronous cascade: Qwen3-ASR with forced alignment produces an incrementally updated source transcript, and Gemma-4 E4B-it translates that prefix under an MT-side AlignAtt policy. To our knowledge, this is the first application of AlignAtt to a decoder-only LLM, where the encoder-decoder cross-attention used by earlier AlignAtt systems is absent. We recover a usable policy by proposing (1) an explicit source span in the prompt, (2) offline selection of translation-specific alignment heads, (3) selective qk-fast replay of the draft-to-source attention block, and (4) runtime query/key capture that preserves model outputs bit-identically. On the IWSLT 2026 development set, AlignAtt4LLM outperforms the supplied baselines for the European target languages, English to German and English to Italian, in both the low-latency regime around 2 seconds and the high-latency regime below 4 seconds CU-LongYAAL. Results for English to Chinese are more mixed, but the method is not tied to Gemma-4: because AlignAtt4LLM only requires a deterministic prompt layout, calibrated attention heads, and query/key capture, the same policy can be reapplied to stronger translation-focused decoder-only MT backbones for non-European target languages.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	4.0/10	6.0
model-based RL	1.5	1.0/10	1.5

评分理由: 该论文主要研究同步语音翻译系统中解码器大模型的注意力对齐策略（AlignAtt），与提供的关键词背景（世界模型、统一模型、视觉编码器、模型强化学习）存在显著差异。论文采用 ASR 与 LLM 的级联架构，而非统一模型；任务为语音翻译，无视觉编码器；未涉及世界模型构建；虽涉及多模态（语音 - 文本）但非统一 MLLM 架构；提到的'策略'为对齐策略，非强化学习模型。因此相关性评分较低，加权总分约 16.5，低于动态及格分 27.8。作者列表中不包含指定的专家。

关键词

Simultaneous Speech Translation, Decoder-Only LLMs, AlignAtt, Attention Alignment, Low-latency, Cascade System, Qwen3-ASR, Gemma-4

21. A Pocket Offline Model for Simultaneous Speech Translation as CUNI Submission to IWSLT 2026FAIL

Score: 16.5 / 27.8

Authors: Aziz Sharipov Ortega, Dominik Macháček

Published: 2026-06-02

TL;DR: 该论文提出了一种基于 Canary 模型和 AlignAtt 策略的 10 亿参数离线语音翻译系统，在 IWSLT 2026 任务中实现了高质量、低延迟且支持多语言的翻译结果。

摘要翻译

我们利用离线端到端语音转文本翻译模型 Canary，采用最先进的策略 AlignAtt，实现了同步翻译能力，并将其提交至 IWSLT 2026 同步语音翻译共享任务，涵盖捷克语至英语以及英语至德语和意大利语。本系统的优势在于：(1) 翻译质量高，在不考虑计算开销的模拟中，无论是在低延迟还是高延迟模式下，均优于同等规模的基线模型；(2) 计算需求低，因为该模型仅有 10 亿 (1B) 参数；(3) 多语言能力——支持 25 种源语言和 25 种目标语言。

Abstract

We implement simultaneous translation capability with the offline direct speech-to-text translation model Canary, using the state-of-the-art policy AlignAtt, and submit it to IWSLT 2026 Simultaneous Speech Translation Shared task for Czech to English and English to German and Italian. The strengths of our system are: (1) high translation quality, outperforming similarly sized baselines both in low- and high-latency regimes in computationally unaware simulations; (2) low computational requirements, as the model has only 1B parameters; (3) multilinguality -- support of 25 source and 25 target languages.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	3.0/10	4.5
model-based RL	1.5	2.0/10	3.0

评分理由: 论文核心内容为离线语音翻译系统（Canary + AlignAtt），属于 NLP/Audio 领域。提供的关键词集主要聚焦于统一模型、世界模型及基于模型的强化学习（通常用于具身智能或统一多模态代理）。论文未涉及视觉编码器（0 分），未构建世界模型（0 分）。虽涉及语音与文本（多模态），但未体现 MLLM 或统一模型架构的核心创新，与关键词主题存在显著领域错位，导致加权总分（16.5）远低于动态及格分（27.8）。

关键词

Simultaneous Speech Translation, Offline Direct Model, Canary, AlignAtt Policy, Multilingual Support, Low Computational Requirements, IWSLT 2026

22. Neuron Populations Exhibit Divergent Selectivity with ScaleFAIL

Score: 15.0 / 27.8

Authors: Amil Dravid, Yasaman Bahri, Alexei A. Efros, Yossi Gandelsman

Published: 2026-06-02

TL;DR: This paper investigates how neuron populations evolve with model scale, finding that 'Rosetta Neurons' become more selective and monosemantic while following a sublinear power law in number.

摘要翻译

我们探究神经网络中的神经元群体是否随规模可预测地演化，将缩放定律扩展至损失等宏观可观测量之外。为探究这一问题，我们研究了罗塞塔神经元 (Rosetta Neurons)，这是一类先前已被表征的神经元，其激活模式在独立训练的模型之间相似 (Dravid et al., 2023)。在分别对参数量达 300 亿的语言模型和参数量达 50 亿的视觉模型进行分析时，我们观察到罗塞塔神经元群体遵循模型规模的次线性幂律：其绝对数量增长，但占总神经元数的比例却不断缩小。我们进一步观察到神经元极化效应 (Neuron Polarization Effect)：随着规模扩大，罗塞塔神经元变得更具选择性且日益单语义，与一个不断增长的非罗塞塔群体分离，后者保持较低的选择性。一个平衡特征效用与有限神经元容量的解析模型解释了这种次线性幂律缩放及极化效应。最后，我们发现罗塞塔神经元随规模扩大变得更加领域专业化，并通过一个用于继续预训练的针对性数据过滤案例研究展示了其选择性。我们的结果表明存在一种可解释的共享神经元级结构的缩放定律，将模型规模与神经元通用性、选择性和专业化的系统性变化联系起来。

Abstract

We investigate whether neuron populations within neural networks evolve predictably with scale, extending scaling laws beyond macroscopic observables such as loss. To probe this question, we study Rosetta Neurons, a previously characterized class of neurons whose activation patterns are similar across independently trained models (Dravid et al., 2023). In separate analyses of language models up to 30B parameters and vision models up to 5B parameters, we observe that the population of Rosetta Neurons follows a sublinear power law in model size, growing in absolute number but occupying a shrinking fraction of the total neuron count. We further observe a Neuron Polarization Effect: Rosetta Neurons become more selective and increasingly monosemantic with scale, separating from a growing non-Rosetta population that remains less selective. An analytical model balancing feature utility against limited neuron capacity explains the sublinear power-law scaling and this polarization effect. Finally, we find that Rosetta Neurons become more domain-specialized with scale and illustrate their selectivity through a targeted data-filtering case study for continued pretraining. Our results point to a scaling law for interpretable, shared neuron-level structure, linking model size to systematic changes in neuron universality, selectivity, and specialization.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on neuron scaling laws and interpretability (Rosetta Neurons) in language and vision models, rather than world models, reinforcement learning, or unified multimodal architectures. Thus, World Models and model-based RL are irrelevant (0.0). MultiModal and MLLM have low relevance (2.0) as the paper studies vision and language models separately rather than a fused system. Visual Encoder has moderate relevance (3.0) due to vision model analysis. Tokenizer has low relevance (1.0) as it is not discussed. Unify Models has low relevance (2.0) as it compares scaling laws rather than unifying architectures. Total weighted score is 15.0, below the dynamic pass score of 27.8. No expert authors from the list were found, so no bonus added.

关键词

Scaling Laws, Neuron Populations, Rosetta Neurons, Selectivity, Monosemanticity, Language Models, Vision Models

23. Language Models Need Sleep: Learning to Self-Modify and Consolidate MemoriesFAIL

Score: 13.5 / 27.8

Authors: Ali Behrouz, Farnoosh Hashemi, Vahab Mirrokni

Published: 2026-06-02

TL;DR: 本文提出了一种名为'Sleep'的范式，通过记忆巩固和基于强化学习的自我改进阶段，使语言模型能够持续学习并将短期记忆转化为长期知识。

摘要翻译

过去几十年见证了机器学习算法设计的显著进展，从早期针对特定任务的浅层模型研究发展到更通用的深度大型语言模型（LLMs）。尽管在需要即时预测或上下文学习的任务中展现出有前景的结果，现有模型仍缺乏持续学习的能力，无法有效地将其时序上下文知识转移到长期参数中。受人类学习过程的启发，我们提出了一种“睡眠”（Sleep）范式，该范式使模型能够持续学习，通过重放机制将短期脆弱记忆蒸馏为稳定的长期知识，并通过“梦”（Dreaming）过程递归地自我改进。具体来说，睡眠包含两个阶段：（1）记忆巩固（Memory Consolidation）：一个向上的蒸馏过程，称为“知识播种”（Knowledge Seeding），即将较小模型的记忆蒸馏至较大网络中，以在保留知识的同时提供更大的容量。作为概念验证，我们提出了一种用于“知识播种”的新广义蒸馏过程（即策略内蒸馏与基于强化学习（RL）的模仿学习的结合）；（2）“梦”（Dreaming）：一个自我改进阶段，模型利用强化学习生成合成数据课程，以复习新知识并精炼现有能力，且无需人类监督。我们在长时程、持续学习、知识整合及小样本泛化任务上的实验验证了睡眠阶段的重要性。

Abstract

The past few decades have witnessed significant advances in the design of machine learning algorithms, from early studies on task-specific shallow models to more general deep Large Language Models (LLMs). Despite showing promising results in tasks that require instant prediction or in-context learning, existing models lack the ability to continually learn and effectively transfer their temporal in-context knowledge to their long-term parameters. Inspired by human learning process, we introduce a ''Sleep'' paradigm that allows the models to continually learn, distill their short-term fragile memories into stable long-term knowledge with replay, and recursively improve themselves with ''Dreaming'' process. In more detail, sleep consists of two stages: (1) Memory Consolidation: an upward distillation process, called Knowledge Seeding, where the memories of a smaller-self are distilled into a larger network to provide more capacity while preserving the knowledge. As a proof of concept, we present a new Generalized Distillation process for {Knowledge Seeding} (i.e., the combination of on-policy distillation with Reinforcement Learning (RL)-based imitation learning); (2) Dreaming: a self-improvement phase, where the model uses RL to generate a curriculum of synthetic data to rehearse new knowledge and refine existing capabilities without human supervision. Our experiments on long-horizon, continual learning, knowledge incorporation, and few-shot generalization tasks support the importance of the sleep stage.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	3.0/10	4.5
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	4.0/10	6.0

评分理由: 论文核心在于 LLM 的持续学习与记忆巩固，引入'Sleep'范式结合 RL 进行自我改进。与'model-based RL'和'World Models'有一定关联（涉及 RL 训练与内部模拟），与'Unify Models'有部分概念契合（短期与长期知识统一）。但论文未涉及视觉编码器、多模态架构或 Tokenizer 设计，因此'Visual Encoder'、'MLLM'、'MultiModal'、'Tokenizer'相关性为 0。总分低于动态及格分，表明该论文与给定关键词集合的整体匹配度较低。

关键词

Continual Learning, Memory Consolidation, Reinforcement Learning, Knowledge Seeding, Self-Improvement, Dreaming, Long-term Parameters, Language Models

24. QUBRIC: Co-Designing Queries and Rubrics for RL Beyond Verifiable RewardsFAIL

Score: 13.5 / 27.8

Authors: Rongzhi Zhang, Rui Feng, Zhihan Zhang, Jingfeng Yang, Qingyu Yin, Xin Liu, Zixuan Zhang, Priyanka Nigam, Bing Yin, Tuo Zhao, Chao Zhang

Published: 2026-06-02

TL;DR: QUBRIC addresses the bottleneck in rubric-based RL by co-designing queries and rubrics, achieving significant performance gains on instruction-following and reasoning benchmarks without relying on verifiable rewards.

摘要翻译

基于评分标准的强化学习（Rubric-based RL）是将强化学习扩展到可验证奖励之外的有前途的途径，然而现有方法在优化评分标准时却将查询分布视为固定不变。我们识别出一个结构瓶颈：评分标准质量受限于查询结构。开放式查询会产生模糊的评分标准；简单地缩小它们会引入编造的参考，没有任何模型能够验证，因此所有响应都会失败，训练无法获得奖励信号。我们提出 QUBRIC，这是一个联合设计查询与评分标准的框架。教师派生的关键点作为依据，将开放式查询重写为基于场景的、可评估的问题。随后，对比评分标准生成将教师策略差距转化为查询级标准，而可学习性过滤仅保留有信息量的查询 - 评分标准对，用于 GRPO（通用奖励策略优化）训练。QUBRIC 在 ArenaHard 上相比监督微调（SFT）基线获得了 +5.5 分的提升。仅在指令遵循数据上训练，它进一步泛化到三个涵盖法律、道德和叙事推理的未见基准（平均提升 +6.3 分），且改进主要集中在推理相关维度上。这些结果提供了证据，表明联合设计查询与评分标准可使基于评分标准的强化学习成为 RLVR（强化学习可验证奖励）的实用补充，超越严格可验证的任务范畴。

Abstract

Rubric-based RL is a promising route for extending reinforcement learning beyond verifiable rewards, yet existing methods optimize rubrics while treating the query distribution as fixed. We identify a structural bottleneck: rubric quality is constrained by query structure. Open-ended queries yield vague rubrics; naively narrowing them introduces fabricated references that no model can verify, so all responses fail and training receives no reward signal. We present QUBRIC, a framework that co-designs queries and rubrics. Teacher-derived key points ground the rewriting of open-ended queries into scenario-based, evaluable questions. Contrastive rubric generation then turns teacher-policy gaps into query-level criteria, and learnability filtering retains only informative query-rubric pairs for GRPO training. QUBRIC achieves a +5.5 point gain on ArenaHard over the SFT baseline. Trained only on instruction-following data, it further transfers to three held-out benchmarks spanning legal, moral, and narrative reasoning (+6.3 points on average), with improvements concentrated in reasoning-related dimensions. These results provide evidence that co-designing queries and rubrics can make rubric-based RL a practical complement to RLVR beyond strictly verifiable tasks.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	2.0/10	3.0

评分理由: The paper focuses on Rubric-based Reinforcement Learning for text instruction following, showing low relevance to multimodal-specific keywords (Visual Encoder, MultiModal, Tokenizer, World Models, Unify Models). While it involves RL, it employs model-free policy optimization (GRPO) rather than model-based RL, and MLLM relevance is minimal as the core contribution is rubric co-design rather than multimodal architecture.

关键词

Rubric-based RL, Query-Rubric Co-design, Instruction Following, ArenaHard, Reasoning, Verifiable Rewards, GRPO

25. Using Reward Uncertainty to Induce Diverse Behaviour in Reinforcement LearningFAIL

Score: 13.5 / 27.8

Authors: Anthony GX-Chen, Ankit Anand, Gheorghe Comanici, Zaheer Abbas, Eser Aygün, David Smalling, Shibl Mourad, Doina Precup, André Barreto, Mark Rowland

Published: 2026-06-02

TL;DR: This paper proposes a reinforcement learning framework that utilizes reward uncertainty distributions to naturally induce diverse agent behaviors without sacrificing expected reward, validated in contextual bandit settings.

摘要翻译

经典强化学习（RL）通常旨在寻求一个确定性策略（deterministic policy），以最大化标量奖励（scalar reward）的期望总和。然而，诸如语言模型微调（language model fine-tuning）或科学发现（scientific discovery）等现代应用却要求多样性。现有的补救措施，例如熵正则化（entropy regularization）或多样性奖励（diversity bonuses），往往需要脆弱的权衡：要么牺牲性能以换取随机性（stochasticity），要么依赖可能导致策略排名错配的启发式度量（heuristic metrics）。我们认为，多样性更自然地应被理解为对奖励不确定性（uncertainty in the reward）的理性响应。当奖励函数（reward function）并非完全已知时——例如在偏好模糊（ambiguous preferences）或奖励模型不完美的情况下——单一动作（single action）的承诺可能是次优的。基于此，我们对强化学习（RL）目标进行了根本性重构：用奖励函数上的分布（distribution over reward functions）替代标量奖励，并在动作集（sets of actions）上应用非线性目标函数（non-linear objective）。由此产生的框架中，校准的行为多样性（calibrated behavioural diversity）自然涌现，可通过奖励函数分布进行控制，且在获得多样性的同时不牺牲期望奖励。针对上下文老虎机（Contextual Bandit）设置，我们为此目标推导了一个原理性的梯度估计器（gradient estimator），并证明我们的公式化（formulation）自然泛化了原始策略梯度（vanilla policy gradient）以及最近开发的动作集方法（action-set approaches）。我们的实证结果表明，该框架为复杂的强化学习（RL）任务提供了一种稳健且理论根基深厚的替代方案，在这些任务中，传统的问题公式化（formulation）无法诱导期望的代理行为广度（agent behaviour）。

Abstract

Classical reinforcement learning (RL) typically seeks a deterministic policy that maximizes the expected sum of a scalar reward. Yet, modern applications such as language model fine-tuning or scientific discovery demand diversity. Existing remedies such as entropy regularization or diversity bonuses often require fragile trade-offs that sacrifice performance for stochasticity or rely on heuristic metrics that can misalign policy rankings. We argue that diversity is more naturally understood as the rational response to uncertainty in the reward. When the reward function is not perfectly known--as is the case with ambiguous preferences or imperfect reward models--committing to a single action can be sub-optimal. Building on this, we propose a fundamental reformulation of the RL objective by replacing the scalar reward with a distribution over reward functions, and applying a non-linear objective over sets of actions. The result is a framework in which calibrated behavioural diversity emerges naturally, remains controllable through the reward function distribution, and is obtained without sacrificing expected reward. Focusing on the contextual bandit setting, we derive a principled gradient estimator for this objective and prove that our formulation naturally generalizes both vanilla policy gradient and more recently developed action-set approaches. Our empirical results demonstrate that this framework offers a robust and theoretically grounded alternative for complex RL tasks where the traditional formulation of the problem fails to induce the desired breadth of agent behaviour.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	3.0/10	4.5

评分理由: 论文核心为强化学习（RL）中的奖励不确定性及行为多样性，与关键词集（多模态/大模型方向）匹配度较低。仅与 Unify Models（目标函数重构）、World Models（RL 背景）、MLLM（应用提及）有弱关联，与 Tokenizer、Visual Encoder、MultiModal 完全无关。model-based RL 有一定关联（涉及奖励模型不确定性），但核心并非传统模型基规划。作者列表中未包含指定的 Yang Shi 等专家。

关键词

Reinforcement Learning, Reward Uncertainty, Diverse Behaviour, Scalar Reward, Policy Gradient, Contextual Bandit, Objective Reformulation

26. Efficient ASR Training with Conversations that Never HappenedFAIL

Score: 12.0 / 27.8

Authors: Máté Gedeon, Péter Mihajlik

Published: 2026-06-02

TL;DR: 本文提出利用 LLM 和 TTS 生成合成对话来增强 ASR 训练数据，结果表明模拟对话可在有限真实数据下提升语音识别性能。

摘要翻译

面向低资源语言和特定领域的对话式自动语音识别（ASR）受限于缺乏领域匹配的多说话人训练数据。我们提出一个数据增强流程，该流程生成带有参与者元数据的场景级对话，将说话人属性映射到 TTS 音色配置文件，并将合成话语组装成说话人感知的模拟对话。我们在单生成器、固定预算混合及扩展设置下评估了五个大语言模型（LLM）系列，每个系列均采用相同的 FastConformer-Large 训练配方。我们在匈牙利 BEA-Dialogue 基准语料库上进行了全面评估，该方法本身适用于任何语言，前提是具备各组件所需的资源。结果表明，合成对话始终能提高语音识别性能，但生成器的选择和数据构成强烈影响提升幅度。我们最大的训练配置仅使用 67 小时真实对话和 636 小时模拟数据，其在评估基准上的表现优于在 2700 小时匈牙利语音上训练的零样本模型。这些发现表明，利用 TTS 合成的 LLM 生成对话数据是语音模型训练中真实对话语料库的实用补充。

Abstract

Conversational ASR for lower-resource languages and niche domains is limited by the scarcity of domain-matched multi-speaker training data. We propose an augmentation pipeline that generates scenario-level dialogues with participant metadata, maps speaker attributes to TTS voice profiles, and assembles synthesized utterances into speaker-aware simulated conversations. We evaluated five LLM families under single-generator, fixed-budget mixture, and scale-up settings using the same FastConformer-Large training recipe for each one. We ran comprehensive evaluations on the Hungarian BEA-Dialogue benchmark corpus, with the method itself being applicable to any language given the resources for each component. The results show that synthetic conversations consistently improve speech recognition performance, but generator choice and data composition strongly affect the gains. Our largest training configuration, using only 67 hours of real conversations and 636 hours of simulated data, achieves better performance on the evaluation benchmark than a zero-shot model trained on 2700 hours of Hungarian speech. These findings indicate that LLM-generated conversational data synthesized with TTS is a practical complement to real conversational corpora for speech model training.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	3.0/10	4.5
model-based RL	1.5	0.0/10	0.0

评分理由: 论文主要关注 ASR 数据增强，利用 LLM 和 TTS 生成合成对话。与关键词集高度相关的主题（如视觉编码器、世界模型、基于模型的强化学习）在论文中完全未涉及。虽然使用了 LLM 且涉及文本与音频处理（具有一定多模态性），但并非核心的 MLLM 架构或统一模型研究。未发现指定专家作者（Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang）。加权总分为 12.0，低于动态及格分 27.8，表明论文与给定关键词集相关性较低。

关键词

ASR Training, Synthetic Conversations, LLM-generated Data, TTS Voice Profiles, Speech Recognition, Data Augmentation, Multi-speaker

27. Knowledge Editing in Masked Diffusion Language ModelsFAIL

Score: 12.0 / 27.8

Authors: Haewon Park, Yohan Jo

Published: 2026-06-02

TL;DR: 该论文研究将 locate-then-edit 知识编辑方法从自回归模型迁移到掩码扩散模型的效果，发现多 token 编辑在扩散模型中因中间未掩码状态而退化，并提出相应修正方法。

摘要翻译

知识编辑旨在更新或修正语言模型中的事实性知识。一种广泛使用的方法是“定位后编辑”（locate-then-edit），它分两步完成：首先在模型内部定位某个事实，然后编辑该处的权重。迄今为止，此类方法仅在自回归模型（ARMs）上发展。然而，这些方法的基本假设是否适用于掩码扩散模型（MDMs）——后者通过双向建模文本并通过迭代去噪而非下一个 token 预测来生成文本——仍是一个开放性问题。我们通过将“定位后编辑”方法迁移至 MDMs，并在相同规模下比较两种 MDMs（LLaDA、Dream）与两种 ARMs（LLaMA、Qwen）来回应这一问题。我们的核心发现包含两个方面。首先，编辑应用的位置具有跨范式的一致性：因果追踪（causal tracing）在两者中均指向最后一个主体 token 处相同的早期至中期层 MLP（多层感知机），且在此处编辑效果最佳。其次，这一共享位置并不能保证共享的效果。单 token 编辑在两者中均能成功，但随着目标长度增加，编辑效果在 MDMs 中系统性退化，而在 ARMs 中则不然。这一失败源于编辑事实的生成机制：生成多 token 目标需要经过部分未掩码的中间状态，而编辑从未针对这些状态进行过优化。基于这一诊断，我们提出了一种简单的修正方法，针对这些状态优化编辑，显著恢复了多 token 性能。

Abstract

Knowledge editing aims to update or correct factual knowledge in a language model. A widely used approach, locate-then-edit, does this in two steps: it first localizes a fact within the model, then edits the weights there. To date, such methods have been developed exclusively on autoregressive models (ARMs). Whether their underlying assumptions hold for masked diffusion models (MDMs), which model text bidirectionally and generate by iterative denoising rather than next-token prediction, remains an open question. We address it by transferring locate-then-edit to MDMs and comparing two MDMs (LLaDA, Dream) with two ARMs (LLaMA, Qwen) at matched scale. Our central finding has two parts. First, where an edit is applied transfers across paradigms: causal tracing highlights the same early-to-mid-layer MLP at the last subject token in both, and editing is most effective there. Second, this shared location does not guarantee a shared outcome. Single-token edits succeed in both, but as targets grow longer, editing degrades systematically in the MDMs but not the ARMs. The failure stems from how the edited fact is generated: producing a multi-token target requires passing through partially unmasked intermediate states for which the edit was never optimized. Guided by this diagnosis, we introduce a simple correction that optimizes the edit for these states, substantially restoring multi-token performance.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	5.0/10	7.5
Tokenizer	1.5	3.0/10	4.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文聚焦语言模型知识编辑，对比掩码扩散与自回归模型。关键词集涉及多模态与 RL，领域偏差大。Unify Models 得 5 分因对比范式并统一方法；Tokenizer 得 3 分因涉及 token 编辑；其余关键词（视觉、多模态、RL、世界模型）无关，得 0 分。作者不在专家列表中。加权总分 12.0，低于及格分 27.8。

关键词

Knowledge Editing, Masked Diffusion Models, Autoregressive Models, Locate-then-Edit, Multi-token Editing, Language Models, Intermediate States

28. PatchScene: Patch-based Voxel Diffusion for Large-Scale Scene CompletionFAIL

Score: 12.0 / 27.8

Authors: Qingdong Xu, Jiajun Zhu, Shilin Zhu, Xinjing He, Chao Lu, Huanran Wang, Jiyao Zhang

Published: 2026-06-02

TL;DR: PatchScene introduces a patch-based voxel diffusion framework for LiDAR scene completion, achieving state-of-the-art performance on SemanticKITTI with scalable generalization from 20m to 50m ranges.

摘要翻译

我们提出 PatchScene，这是一种用于大规模 LiDAR 场景补全的新型基于扩散的框架。与依赖全局潜在表示或密集体素网格的现有方法不同，PatchScene 采用基于体素块的扩散范式，显式地在局部 3D 区域内生成细粒度几何。为了确保在空间和时间尺度上的连贯重建，我们引入了一种置信度引导的时空融合机制，该机制在统一的生成过程中整合重叠块与相邻帧。此外，我们设计了一种 Annular-Flow（环形流）扩散策略，利用 LiDAR 扫描的径向密度模式，逐步将高保真信息从近程区域传播至远程区域，从而实现空间无界的场景补全。在 SemanticKITTI 基准上的广泛实验表明，PatchScene 在所有标准指标上均实现了最先进的性能，在几何精度和时间一致性方面均超越了先前方法。值得注意的是，在 20 米 LiDAR 范围上训练的模型无需重新训练即可有效泛化至 50 米场景，凸显了其在现实世界自动驾驶应用中强大的可扩展性和泛化能力。

Abstract

We propose PatchScene, a novel diffusion-based framework for large-scale LiDAR scene completion. Unlike existing methods that rely on global latent representations or dense voxel grids, PatchScene adopts a patch-based voxel diffusion paradigm that explicitly generates fine-grained geometry within localized 3D regions. To ensure coherent reconstruction at both spatial and temporal scales, we introduce a confidence-guided spatio-temporal fusion mechanism that integrates overlapping patches and adjacent frames in a unified generative process. Furthermore, we design an Annular-Flow diffusion strategy that leverages the radial density pattern of LiDAR scans to progressively propagate high-fidelity information from near-range to far-range regions, enabling spatially unbounded scene completion. Extensive experiments on the SemanticKITTI benchmark demonstrate that PatchScene achieves state-of-the-art performance across all standard metrics, surpassing previous approaches in both geometric accuracy and temporal consistency. Remarkably, the model trained on 20 m LiDAR ranges generalizes effectively to 50 m scenes without retraining, highlighting its strong scalability and generalization capability for real-world autonomous driving applications.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on 3D diffusion for LiDAR scene completion, which has low overlap with MLLM, Tokenizer, and RL paradigms. Unify Models and World Models have slight relevance regarding unified generative processes and scene representation, but the core task is perception rather than language or control. MultiModal is weak as it primarily processes LiDAR data without explicit fusion with other modalities like vision or text.

关键词

Patch-based Voxel Diffusion, Large-Scale Scene Completion, LiDAR, Spatio-temporal Fusion, Annular-Flow, SemanticKITTI, Generative Process

29. Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent SkillFAIL

Score: 10.5 / 27.8

Authors: Tao Chen, Gangwei Jiang, Pengyu Cheng, Siyuan Huang, Yihao Liu, Jingwei Ni, Jiaqi Guo, Mengyu Zhou, Kai Tang, Junling Liu, Qinliang Su, Xiaoxi Jiang, Guanjun Jiang

Published: 2026-06-02

TL;DR: Skill-RM unifies heterogeneous reward evaluation criteria through a reusable agent skill framework, demonstrating superior performance in LLM post-training and reinforcement learning tasks compared to traditional judge baselines.

摘要翻译

奖励模型 (RMs) 为大语言模型 (LLM) 后训练提供关键的反馈信号，尤其在强化微调 (RFT) 和强化学习 (RL) 流程中。然而，当前的奖励评估依赖于诸如基于规则的验证器、真值参考、程序性清单和复杂评分量表等异质标准，而整合所有类型证据的统一机制尚未被探索。为此，我们提出技能奖励模型 (Skill-RM)，这是一个统一框架，将奖励建模重构为可重用的奖励评估技能 (Reward-Evaluation Skill) 的执行过程。通过将奖励计算视为结构化智能体任务，Skill-RM 提供一致的接口来编排异质资源，动态选择和聚合针对每个输入特定要求定制的证据。该方法使奖励模型能够突破静态评估，确保在多样化任务中的一致性和透明度。在奖励基准及下游应用（包括 Best-of-N 选择和强化学习）上的广泛实验表明，Skill-RM 始终优于传统评判基线。我们的研究结果表明，Skill-RM 不仅为奖励建模提供了统一解决方案，而且通过证据的战略性和动态编排实现了卓越性能。代码开源地址为 https://github.com/Qwen-Applications/Skill-RM。

Abstract

Reward models (RMs) provide critical feedback signals for LLM post-training, notably in reinforced fine-tuning (RFT) and reinforcement learning (RL) pipelines. However, current reward evaluation relies on heterogeneous criteria such as rule-based verifiers, ground-truth references, procedural checklists, and complex rubrics, where a unified mechanism to integrate all types of evidence remains unexplored. To this end, we propose Skill Reward Model (Skill-RM), a unified framework that reformulates reward modeling as the execution of a reusable Reward-Evaluation Skill. By treating reward computation as a structured agentic task, Skill-RM provides a consistent interface to orchestrate heterogeneous resources, dynamically selecting and aggregating evidence tailored to the specific requirements of each input. This approach enables the reward model to move beyond static evaluation, ensuring consistency and transparency across diverse tasks. Extensive experiments on reward benchmarks and downstream applications, including best-of-N selection and reinforcement learning, demonstrate that Skill-RM consistently outperforms traditional judge baselines. Our findings suggest that Skill-RM not only provides a unified solution for reward modeling but also achieves superior performance through the strategic and dynamic orchestration of evidence. The code is at https://github.com/Qwen-Applications/Skill-RM.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	4.0/10	6.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	3.0/10	4.5

评分理由: The paper focuses on Reward Modeling for LLM post-training, proposing a unified framework (Skill-RM) to integrate heterogeneous evaluation criteria via agent skills. It scores moderately on 'Unify Models' due to the core theme of unification, and on 'model-based RL' due to the mention of RL pipelines, though it primarily addresses Reward Modeling rather than Model-Based RL dynamics. It has no relevance to Tokenizers, Visual Encoders, World Models, MLLMs, or MultiModal architectures as these are not discussed in the abstract. None of the specified expert authors (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) appear in the author list.

关键词

Reward Models, LLM Post-training, Agent Skill, Heterogeneous Evaluation, Unified Framework, Reinforcement Learning, Evidence Aggregation, Best-of-N Selection

30. scTranslation: A Comprehensive Benchmark for Single-Cell Multi-Omics Modality TranslationFAIL

Score: 9.0 / 27.8

Authors: Jiabei Cheng, Jingbo Zhou, Jun Xia, Changkai Li, Zhen Lei, Chang Yu, Stan Z. Li

Published: 2026-06-02

TL;DR: 本文提出了单细胞多组学翻译基准 scTranslation，但未涉及 AI 模型架构或强化学习相关研究。

摘要翻译

单细胞中多种多组学模态 (omics modalities) 的同时测量使研究人员能够获得对细胞状态和调控机制更全面的理解。然而，鉴于实验成本高、噪声显著以及模态覆盖不完整，近年来出现了多种用于模态翻译 (modality translation) 的计算方法。尽管翻译模型不断发展，但在数据集、评估指标 (evaluation metrics) 和影响因素方面仍缺乏系统的基准 (benchmark) 评估。为此，我们提出了 scTranslation，这是一个用于单细胞多组学模态翻译任务的综合基准。该基准包含多样的翻译数据集，整合了最先进的模型，并提供了全面的评估指标。此外，我们在不同场景下评估模型性能，例如特征选择、特征质量和少样本 (few-shot) 设置。这些因素显著影响模型性能，但此前很少被系统研究过。基于此基准，我们对当前方法进行了大规模研究，报告了许多有洞察力的发现，为未来发展开辟了新可能性。该基准已开源，以促进未来的研究。代码匿名发布于 https://github.com/Bunnybeibei/scTranslation。

Abstract

Simultaneous measurement of multiple omics modalities in single cells enables researchers to gain a more comprehensive understanding of cellular states and regulatory mechanisms. However, due to high experimental costs, significant noise, and incomplete modality coverage, a variety of computational methods for modality translation have emerged in recent years. Despite the development of translation models, there is still a lack of systematic benchmark evaluation in terms of datasets, evaluation metrics, and influencing factors. To address this, we present scTranslation, a comprehensive benchmark for single-cell multi-omics modality translation tasks. It includes diverse translation datasets, integrates state-of-the-art models, and provides a comprehensive evaluation metrics. In addition, we assess model performance under different scenarios, such as feature selection, feature quality, and few-shot settings. These factors significantly affect model performance but have rarely been systematically studied before. Leveraging this benchmark, we conduct a large-scale study of current methods, report many insightful findings that open up new possibilities for future development. The benchmark is open-sourced to facilitate future research. The code is anonymously released at https://github.com/Bunnybeibei/scTranslation.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	5.0/10	7.5
model-based RL	1.5	0.0/10	0.0

评分理由: 该论文专注于单细胞多组学模态翻译的基准测试，属于生物信息学领域。虽然涉及多模态数据（MultiModal），但未涉及统一模型架构、分词器、视觉编码器、世界模型、大语言模型或基于模型的强化学习等核心内容，与给定关键词集存在显著领域错位，导致加权总分远低于及格线。

关键词

Single-Cell, Multi-Omics, Modality Translation, Benchmark, Evaluation Metrics, Feature Selection, Few-Shot Settings

31. Agent libOS: A Library-OS-Inspired Runtime for Long-Running, Capability-Controlled LLM AgentsFAIL

Score: 7.5 / 27.8

Authors: Yingqi Zhang

Published: 2026-06-02

TL;DR: 本文提出了一种基于库操作系统思想的运行时框架 Agent libOS，用于实现长运行、能力可控的 LLM 代理，主要解决代理调度与审计问题，而非多模态模型训练或强化学习算法。

摘要翻译

大语言模型（LLM）代理正从请求 - 响应助手演变为长期运行的软件实体：它们在模型调用之间维护状态，派生子任务，等待外部事件，请求人类授权，生成工具，并执行必须恢复和审计的副作用。本文提出了 Agent libOS，一种受库操作系统启发、面向 LLM 代理的运行时基础架构。Agent libOS 运行于传统宿主操作系统之上；它不实现硬件驱动、内核模式隔离或 POSIX 兼容操作系统。相反，它将一个代理视为 AgentProcess：一个具有进程身份、父子谱系、生命周期状态、从 AgentImage 派生的工具表、类型化 Object Memory、显式能力、人类队列、检查点、事件和审计记录的可调度执行主体。其核心设计原则是：工具是类似 libc 的包装器；运行时原语构成了授权边界。文件系统访问、对象访问、休眠操作、人类批准、JIT 工具注册以及外部副作用均在原语边界处，依据显式能力和策略进行检查。本文描述了该系统的设计、威胁模型、Python 原型以及面向安全性的评估。当前原型实现了异步调度、命名空间本地 Object Memory、运行时集成的人类批准、一次性权限授予、每进程工作目录、壳与镜像注册原语、基于 libOS 系统调用代理的 Deno/TypeScript JIT 工具、文件系统/对象桥接工具、可注入的 Resource Provider Substrate、确定性演示、真实模型烟雾测试脚本，以及写作时的 123 项回归测试。与提高规划器准确性不同，Agent libOS 展示了一种运行时基础架构，在此架构中，长期运行的 LLM 代理可以被调度、授权、恢复和审计，而无需将工具调度视为信任边界。

Abstract

Large language model (LLM) agents are evolving from request-response assistants into long-running software actors: they maintain state across model calls, fork subtasks, wait for external events, request human authority, generate tools, and perform side effects that must be resumed and audited. This paper presents Agent libOS, a library-OS-inspired runtime substrate for LLM agents. Agent libOS runs above a conventional host operating system; it does not implement hardware drivers, kernel-mode isolation, or a POSIX-compatible operating system. Instead, it treats an agent as an AgentProcess: a schedulable execution subject with process identity, parent-child lineage, lifecycle state, a tool table derived from an AgentImage, typed Object Memory, explicit capabilities, human queues, checkpoints, events, and audit records. Its central design rule is tools are libc-like wrappers; runtime primitives are the authority boundary. Filesystem access, object access, sleeps, human approval, JIT tool registration, and external side effects are checked at primitive boundaries under explicit capabilities and policy. We describe the design, threat model, Python prototype, and safety-oriented evaluation. The current prototype implements async scheduling, namespace-local Object Memory, runtime-integrated human approval, one-shot permission grants, per-process working directories, shell and image-registration primitives, Deno/TypeScript JIT tools over a libOS syscall broker, filesystem/object bridge tools, an injectable Resource Provider Substrate, deterministic demos, real-model smoke scripts, and 123 regression tests at the time of writing. Rather than improving planner accuracy, Agent libOS demonstrates a runtime substrate in which long-running LLM agents can be scheduled, authorized, resumed, and audited without treating tool dispatch as the trust boundary.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	1.0/10	1.5

评分理由: 该论文提出 Agent libOS，一种面向长运行 LLM 代理的库操作系统运行时，核心贡献在于调度、能力控制和审计机制。与关键词集中的模型统一、分词器、视觉编码器、世界模型、多模态大模型及基于模型的强化学习等模型架构或算法内容无直接关联，因此相关性评分较低。作者列表中未包含指定的专家（Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang），故未添加专家加分。

关键词

Agent libOS, Library-OS, LLM Agents, Capability-Controlled, Runtime Substrate, Long-Running, Tool Dispatch, Audit Records

32. Adaptive Causal Alignment for High-Confidence Adversarial TrainingFAIL

Score: 7.5 / 27.8

Authors: Zhiming Luo, Kejia Zhang, Yingxin Lai, Junwei Wu, Juanjuan Weng, Shaozi Li

Published: 2026-06-02

TL;DR: This paper proposes a High-Confidence Causally Aligned Training (HICAT) framework to improve adversarial robustness by disentangling foreground semantics from background visual biases, achieving better generalization on vision datasets.

摘要翻译

逆对抗训练 (Inverse Adversarial Training) 利用高置信度预测来稳定鲁棒学习，但我们发现了一个关键悖论：高置信度往往源于对非因果背景相关性的过拟合，而非内在对象语义。我们的研究表明，视觉上下文作为一种双重性质的信号，既可以是必要的支持性先验，也可以是虚假混杂因子。这一洞察揭示了现有盲目抑制策略的缺陷，因为它们不可避免地导致严重的特征损失 (Feature Loss)。为了解决这一问题，我们提出高置信度因果对齐训练 (HICAT)，这是一个建立语义平衡 (Semantic Equilibrium) 的统一框架。HICAT 基于“测量 - 去偏 - 对齐” (Measure-Debias-Align) 流程，整合了可学习背景偏差估计器 (LBBE) 以自适应地诊断上下文效用。在此诊断指导下，自适应去偏 (Adaptive Debiasing) 机制执行精细的 logit 修正，并辅以基于几何的前景 logit 正交增强 (FLOE) 损失，以强制执行严格的特征解耦。在 CIFAR-10、CIFAR-100 和 ImageNet-1K 上的广泛实验表明，HICAT 在各种架构 (CNNs 和 ViTs) 上始终优于匹配的基线，同时显著减少了鲁棒泛化差距。

Abstract

Inverse adversarial training leverages high-confidence predictions to stabilize robust learning, yet we uncover a critical paradox: high confidence often stems from overfitting to non-causal background correlations rather than intrinsic object semantics. Our investigation reveals that visual context functions as a dual-natured signal, serving as either a necessary supportive prior or a spurious confounder. This insight renders existing blind suppression strategies flawed, as they inevitably lead to severe Feature Loss. To resolve this, we propose High-Confidence Causally Aligned Training (HICAT), a unified framework that establishes a Semantic Equilibrium. Operating on a ``Measure-Debias-Align'' pipeline, HICAT integrates a Learnable Background-Bias Estimator (LBBE) to adaptively diagnose context utility. Guided by this diagnosis, an Adaptive Debiasing mechanism performs surgical logit rectification, complemented by a geometrically grounded Foreground Logit Orthogonal Enhancement (FLOE) loss to enforce rigorous feature disentanglement. Extensive experiments on CIFAR-10, CIFAR-100, and ImageNet-1K demonstrate that HICAT consistently improves over matched baselines across diverse architectures (CNNs and ViTs) while significantly reducing the robust generalization gap.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on adversarial training and causal alignment in computer vision (HICAT framework), which has minimal overlap with the provided keywords targeting multimodal LLMs, world models, and reinforcement learning. 'Visual Encoder' has slight relevance due to CNN/ViT usage, and 'Unify Models' loosely matches the 'unified framework' terminology, but core concepts like Tokenizer, MLLM, MultiModal, World Models, and RL are absent.

关键词

Adversarial Training, Causal Alignment, High-Confidence, Feature Disentanglement, Background Bias, Robust Generalization, Visual Context, HICAT Framework

33. Quantifying Faithful Confidence Expression in Large Reasoning ModelsFAIL

Score: 6.0 / 27.8

Authors: Areeb Gani, Asal Meskin, Gabrielle Kaili-May Liu, Arman Cohan

Published: 2026-06-02

TL;DR: 本文提出量化大推理模型忠实置信度表达的框架，发现推理行为并不能自动提升置信度忠实性，且先前评估方法存在脆弱性。

摘要翻译

可靠的不确定性传达对于大语言模型（LLMs）的可信性至关重要，然而忠实校准（FC）——即模型内在置信度与（语言上）表达置信度之间的对齐——却是一种持续存在的故障模式。这一挑战对于大型推理模型（LRMs）尤为关键，因为它们的扩展推理轨迹常被用户解读为审慎推理、能力与自信的证据。尽管 FC 的重要性显而易见且 LRMs 被广泛应用，但 LRMs 在多大程度上能够忠实表达其置信度仍知之甚少。此外，衡量 FC 的主流范式难以很好地泛化到 LRMs 生成的长思维链（CoT）输出；这些输出往往缺乏清晰的步骤边界，涉及不一致的步骤结构，并在整个轨迹中编码复杂的条件依赖——这使得内在置信度的估计变得复杂。为应对这一挑战，我们提出了一种新颖的框架，旨在系统性地量化大型推理模型（LRMs）的忠实校准（FC）。该框架基于词元概率、隐藏状态和采样响应一致性，分析了相对于三种内部不确定性来源的语言决断性。此外，我们还设计了一种前缀条件采样方法，以控制不同轨迹间的条件与结构变异。将该框架应用于一系列领先的模型、数据集和提示后，我们发现，忠实表达置信度对 LRMs 而言仍是一个重大挑战。推理行为并不会自动转化为更优的 FC，且针对非推理模型的提示干预也无法在推理场景中提高其忠实性。不同的置信度估计器对同一轨迹还产生了分歧的评估，揭示了先前评估方法论的脆弱性。综上所述，我们的工作确立了 FC 作为大型推理模型（LRMs）一个独立的可靠性与对齐目标，尤其是当此类系统日益被部署于高风险情境之中时。

Abstract

Reliable uncertainty communication is critical to the trustworthiness of LLMs, yet faithful calibration (FC)--the alignment between models' intrinsic and (linguistically) expressed confidence--is a persistent failure mode. This challenge is key for large reasoning models (LRMs), whose extended reasoning traces are often interpreted by users as evidence of deliberation, competence, and confidence. Despite the importance of FC and wide usage of LRMs, the extent to which LRMs can faithfully express their confidence remains poorly understood. Moreover, the prevailing paradigm to measure FC does not generalize well to the long chain-of-thought outputs generated by LRMs, which tend to lack clear step boundaries, involve inconsistent step structure, and encode complex conditional dependencies throughout the trace--complicating estimation of intrinsic confidence. To address this challenge, we introduce a novel framework to systematically quantify FC of LRMs. Our framework analyzes linguistic decisiveness relative to three sources of internal uncertainty, based on token probabilities, hidden states, and sampled response consistency. We also devise a prefix-conditioned sampling approach to control for conditional and structural variation across traces. Applying our framework to a diverse suite of leading models, datasets, and prompts, we find that faithful confidence expression is a significant challenge for LRMs. Reasoning behaviors do not automatically translate to improved FC, and prompt interventions for non-reasoning models do not improve faithfulness in the reasoning setting. Different confidence estimators further produce divergent assessments of the same traces, revealing fragility in prior evaluation methodologies. Taken together, our work establishes FC as a distinct reliability and alignment target for LRMs, particularly as such systems are increasingly deployed in high-stakes contexts.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 该论文专注于大推理模型（LRMs）的置信度校准与不确定性表达，核心贡献在于提出量化忠实置信度表达的框架。然而，其内容与给定的关键词集高度不相关：未涉及视觉编码器、世界模型、多模态架构（MLLM/MultiModal）或强化学习（model-based RL），故这五个关键词得分为 0。'Tokenizer' 仅在分析中提及 token 概率，未讨论 tokenizer 架构设计，故得低分 2.0。'Unify Models' 仅弱相关于统一不同置信度估算器的方法，故得低分 2.0。加权总分仅为 6.0，远低于动态及格分 27.8。

关键词

Large Reasoning Models, Faithful Confidence Expression, Uncertainty Communication, Chain-of-Thought, Calibration, Token Probabilities, Hidden States, Sampling Consistency

34. Correcting Neural Operator Spectral Bias via Diffusion Posterior Sampling with Sparse ObservationsFAIL

Score: 6.0 / 27.8

Authors: Niccolò Perrone, Fanny Lehmann, Stefania Fresca, Filippo Gatti

Published: 2026-06-02

TL;DR: The paper proposes FreqNO-DPS to correct spectral bias in Neural Operators for PDE solving by integrating diffusion posterior sampling with frequency-dependent guidance using sparse observations, achieving accurate wavefield prediction.

摘要翻译

神经算子代理模型（NO）近似求解偏微分方程（PDE）的速度比数值求解器快几个数量级，但存在频谱偏差：高频成分会被系统性衰减，这在细尺度结构至关重要的地方限制了可靠性。场的稀疏传感器测量通常也可用，提供逐点精度且无频谱失真，但仅覆盖了计算域的一小部分。我们通过将 NO 预测视为扩散后验采样框架中的辅助观测来解决这一问题。我们的方法，FreqNO-DPS（https://github.com/niccoloperrone/FreqNO-DPS），结合了一个在高保真模拟上训练的无条件基于分数的扩散先验，以及以稀疏观测为条件并由冻结的神经算子引导的扩散后验采样（DPS）。朴素集成会重新引入代理模型的频谱偏差；我们通过一个闭式、频谱形状的引导分数来解决这一问题，该分数根据代理模型的频率依赖性对其进行加权，且无需去噪器反向传播。无分布分析界定了频率 - 扩散 - 时间平面上的近似误差，并表明无论分布假设如何，引导的频率依赖性均得以保留。在 5% 和 2% 传感器覆盖率下的三维弹性波场预测中，该方法在所有频带达到近零频谱偏差，而代理模型和仅传感器的 DPS 均显示系统性高频衰减。各向同性引导作为自然基线，提高了逐点精度，但几乎完整地将偏差带入后验分布，证实了频率依赖性校准是必要的，而不仅仅是有益的。该框架仅需配对的代理/参考数据，除了残差的近似谱对角性外，不利用任何特定问题结构，可通过我们提供的相干性诊断对新代理模型进行验证。

Abstract

Neural operator surrogates (NO) approximate PDE solutions orders of magnitude faster than numerical solvers, but suffer from spectral bias: high-frequency content is systematically attenuated, limiting reliability where fine-scale structure matters. Sparse sensor measurements of the field are often available too, offering pointwise accuracy without spectral distortion but covering only a small fraction of the domain. We address this by treating NO predictions as auxiliary observations in a diffusion posterior sampling framework. Our method, FreqNO-DPS (https://github.com/niccoloperrone/FreqNO-DPS), combines an unconditional score-based diffusion prior, trained on high-fidelity simulations, with diffusion posterior sampling (DPS) conditioned on sparse observations and guided by a frozen neural operator. Naive integration reintroduces the surrogate's spectral bias; we resolve this with a closed-form, spectrally shaped guidance score that weights the surrogate by its frequency-dependent accuracy and needs no denoiser backpropagation. A distribution-free analysis bounds the approximation error across the frequency-diffusion-time plane and shows the guidance's frequency dependence is preserved regardless of distributional assumptions. On 3D elastic wavefield prediction at 5% and 2% sensor coverage, the method reaches near-zero spectral bias across all bands, where both the surrogate and sensor-only DPS show systematic high-frequency attenuation. Isotropic guidance, the natural baseline, improves pointwise accuracy but carries the bias into the posterior nearly intact, confirming that frequency-dependent calibration is essential, not merely beneficial. The framework needs only paired surrogate/reference data and exploits no problem-specific structure beyond the residual's approximate spectral diagonality, verifiable for new surrogates via the coherence diagnostic we provide.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0

评分理由: 论文核心为科学计算中的神经算子与扩散模型结合，解决 PDE 谱偏差问题。提供的关键词主要涉及多模态大模型（MLLM）、强化学习（RL）及基础模型架构（Tokenizer, Visual Encoder）。论文未涉及文本分词、视觉编码器或 RL 框架，因此与 Tokenizer、Visual Encoder、MLLM、model-based RL 完全无关。虽涉及模型统一与生成式世界建模，但语境不同，故相关性极低。

关键词

Neural Operator, Spectral Bias, Diffusion Posterior Sampling, Sparse Observations, PDE Solving, Frequency-dependent Guidance, Wavefield Prediction

35. q0: Primitives for Hyper-Epoch PretrainingFAIL

Score: 4.5 / 27.8

Authors: Bishwas Mandal, Shmuel Berman, Akshay Vegesna, Samip Dahal

Published: 2026-06-02

TL;DR: The paper introduces hyper-epoch pretraining (q0), which utilizes a population of diverse models and chain distillation to achieve superior data efficiency and validation loss compared to single-model pretraining within constrained compute budgets.

摘要翻译

随着算力增长速度超过高质量文本的供应，多轮训练正逐渐成为标准。然而，单个模型的预训练在几次迭代后便会达到饱和，远早于计算预算耗尽之时。我们认为，这需要概念上的转变，即从训练单个模型转向探索一个模型群体并聚合它们的预测。我们提出超 epoch 预训练（hyper-epoch pretraining，q0），它将多轮训练预算转化为一个多样化的模型群体，其组合预测的验证损失低于单个精炼模型。q0 可归结为三个核心原语。采用具有反相关学习率与权重衰减的循环调度，从几条并行轨迹中收集多样化的模型。链式蒸馏（Chain distillation）将每个模型与其前驱模型进行对抗训练，从而使模型质量在群体中累积增强。一个在保留集（held out set）上拟合的学习到的先验，可根据任意推理预算选择和加权成员。在基于 1 亿 FineWeb token 训练的 18 亿参数模型上，q0 仅使用约 56 个 epoch（约减少 4.6 倍）即可匹配强大的 256 个 epoch 集成基线；若匹配基线的集成规模，则仅需约 67 个 epoch（约减少 3.8 倍），且性能持续超越该基线。在 Slowrun 设置下，这些增益累积达到约 12.9 倍的数据效率，并迁移至下游基准测试。至关重要的是，最优分配随预算变化而变化，因此我们给出了指导性方案，说明如何花费给定的 epoch 预算以最大化泛化能力，范围从单个 epoch 直至最大预算。

Abstract

Multi-epoch training is becoming the standard now that compute is growing faster than the supply of high-quality text. But pretraining a single model saturates within a few passes, long before the compute budget is exhausted. We argue this calls for a conceptual shift from training a single model toward exploring a population of models and aggregating their predictions. We introduce hyper-epoch pretraining (q0), which turns a multi-epoch budget into a population of diverse models whose combined predictions reach a lower validation loss than a single refined model. q0 reduces to three core primitives. A cyclic schedule with anti-correlated learning rate and weight decay collects diverse models from a few parallel trajectories. Chain distillation trains each model against its predecessor so that model quality compounds across the population. A learned prior, fit on a held out set, selects and weights members for any inference budget. On a 1.8B-parameter model trained on 100M FineWeb tokens, q0 matches a strong 256-epoch ensemble baseline using only ${\sim}56$ epochs (${\sim}4.6\times$ fewer), or ${\sim}67$ epochs (${\sim}3.8\times$ fewer) when matched to the baseline's ensemble size, and continues to improve beyond it. These gains reach cumulative ${\sim}12.9\times$ data efficiency under the Slowrun setting and transfer to downstream benchmarks. Crucially, the optimal allocation shifts with the budget, so we give prescriptive recipes for how to spend a given epoch budget to maximize generalization, from a single epoch up to the largest budgets.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on hyper-epoch pretraining for text model efficiency using model populations and distillation, lacking content on visual encoders, multimodality, world models, or RL. Tokenizer is tangential, and Unify Models is only weakly related to aggregation. Relevance to the specific keyword set is low.

关键词

Hyper-Epoch Pretraining, Population of Models, Cyclic Schedule, Chain Distillation, Learned Prior, Data Efficiency, Multi-epoch Training

36. Quadratic integrate-and-fire neurons exhibit less fragmented loss landscapes and outperform leaky integrate-and-fire neurons in spike-based gradient descentFAIL

Score: 4.5 / 27.8

Authors: Carlo Wenig, Raoul-Martin Memmesheimer, Christian Klos

Published: 2026-06-02

TL;DR: 该研究证明二次积分发放神经元在尖峰梯度下降训练中比泄漏积分发放神经元具有更平滑的损失景观和更好的性能。

摘要翻译

训练脉冲神经网络的能力对于模拟生物神经网络以及类脑计算都至关重要。然而，对于广泛使用的泄漏积分发放（LIF）神经元，任意微小的参数变化都可能引发脉冲的产生或消失，进而破坏后续活动，导致在精确基于脉冲的梯度下降过程中出现不稳定的神经表示以及永久静默的神经元。近期研究表明，一类包含二次积分发放（QIF）神经元的神经元模型可避免这些不连续性，从而实现连续甚至平滑的基于脉冲的梯度下降。然而，这些优势是否能在实践中得到体现尚不明确。本文通过流行脉冲海德堡数字数据集（Spiking Heidelberg Digits）上 LIF 与 QIF 神经元网络的受控比较，证明了这一点。具体而言，首先，我们进行了全面的超参数搜索以优化两种模型，结果揭示了 QIF 神经元明显的性能优势。其次，我们可视化了损失和梯度景观。与它们较差的性能表现一致，我们发现 LIF 神经元的损失景观（因其不连续性）显得更为碎片化，且相关的梯度更为波动不定。对单个样本景观的分析表明，这些特征源于脉冲时间顺序的变化，而这往往会导致破坏性的脉冲产生或消失。总体而言，我们的结果主张在梯度下降训练中，用具有连续脉冲动力学的神经元模型（如 QIF 神经元）取代 LIF 神经元。

Abstract

The ability to train spiking neural networks is essential for modeling biological neural networks as well as for neuromorphic computing. However, for the extensively used leaky integrate-and-fire (LIF) neurons, arbitrarily small parameter changes can induce spike (dis)appearances that disrupt subsequent activity, leading to unstable neural representations and permanently silent neurons during exact spike-based gradient descent. Recent work shows that a class of neuron models, which includes the quadratic integrate-and-fire (QIF) neuron, avoids these discontinuities and enables continuous and even smooth spike-based gradient descent. However, it remains unclear whether these advantages translate into practice. Here, we demonstrate that they do so via a controlled comparison between networks of LIF and QIF neurons on the popular Spiking Heidelberg Digits dataset. Specifically, in a first step, we perform a thorough hyperparameter search to optimize both models, revealing a clear performance advantage of QIF neurons. In a second step, we visualize the loss and gradient landscapes. Consistent with their inferior performance, we find that the loss landscapes of LIF neurons, which are discontinuous, appear more fragmented and the related gradients more erratic. An analysis of the landscapes of single samples indicates that these features arise from changes in the temporal order of spikes, which often cause disruptive spike (dis)appearances. Overall, our results advocate replacing LIF neurons with neuron models exhibiting continuous spiking dynamics, such as QIF neurons, for gradient descent training.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0

评分理由: 论文核心为尖峰神经网络（SNN）的神经元模型比较（QIF vs LIF）及梯度下降优化，与关键词集中的多模态大模型（MLLM）、世界模型及强化学习（RL）领域高度不相关。仅因涉及视觉数据（MultiModal, Visual Encoder）及模型对比（Unify Models）给予微量相关分，其余关键词（Tokenizer, World Models, model-based RL）完全无关。作者列表不含指定专家，无额外加分。加权总分远低于及格线。

关键词

Spiking Neural Networks, Quadratic Integrate-and-Fire, Leaky Integrate-and-Fire, Gradient Descent, Loss Landscapes, Spike-based Training, Visual Digits

37. Exploring Easy Boosts for Lidar Semantic Scene CompletionFAIL

Score: 4.5 / 27.8

Authors: Tetiana Martyniuk, Jonathan Seele, Alexandre Boulch, Gilles Puy, Renaud Marlet, Raoul de Charette

Published: 2026-06-02

TL;DR: This paper proposes simple input enhancements like semantic pseudo-labels and visibility information to boost Lidar Semantic Scene Completion performance without complex architectural changes.

摘要翻译

本文探讨了“免费午餐”策略，旨在提升 LiDAR 语义场景完成（SSC）的性能，而无需进行复杂的架构重构。我们首先证明，通过利用现成分割器提供的语义伪标签对输入点云进行增强，可显著提高现有架构的性能。通过与理想参照（oracle）对比评估这些模型，我们确立了高质量语义先验是 mIoU 提升的主要驱动因素。此外，我们为输入 LiDAR 扫描配备了可见性信息，以区分空空间和未知空间，从而在所测试的架构中提供了次要的性能提升。利用这些简单增强，我们发现较老模型仍能与最先进系统保持竞争力，甚至能超越它们。我们的代码可在 https://github.com/astra-vision/SSC-Priors 获取。

Abstract

This paper investigates "free lunch" strategies to boost the performance of lidar semantic scene completion (SSC) without requiring complex architectural redesigns. We first demonstrate that endowing input point clouds with semantic pseudo-labels from off-the-shelf segmentors significantly improves the performance of existing architectures. By evaluating these models against an oracle, we establish that high-quality semantic priors are a primary driver of mIoU gains. Furthermore, we equip the input lidar scan with visibility information that distinguishes between empty and unknown spaces, which provides a secondary performance boost across the tested architectures. Using these simple enhancements, we observe that older models remain competitive with state-of-the-art systems, and can even outperform them. Our code is available at https://github.com/astra-vision/SSC-Priors.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on Lidar Semantic Scene Completion using input priors (semantic pseudo-labels, visibility) rather than Unify Models, Tokenizers, World Models, MLLM, or Model-based RL. It is a 3D perception task with minimal overlap with the provided multimodal/RL keyword set.

关键词

Lidar, Semantic Scene Completion, Pseudo-labels, Visibility Information, Point Clouds, Performance Boost, Off-the-shelf Segmentors

38. Value-Aware Stochastic KV Cache Eviction for Reasoning ModelsFAIL

Score: 3.0 / 27.8

Authors: Ting-Yun Chang, Harvey Yiyun Fu, Deqing Fu, Chenghao Yang, Jesse Thomason, Robin Jia

Published: 2026-06-02

TL;DR: 本文提出值感知随机 KV 缓存驱逐（VaSE）方法，通过保护大数值状态和增加缓存多样性，在推理模型中解决了长输出导致的内存瓶颈问题并提升了准确率。

摘要翻译

推理模型通过扩展思维链（Chain of Thought）来提高准确性，但它们的长输出造成了内存和计算瓶颈。KV 缓存（KV Cache）驱逐方法通过从缓存中驱逐不重要的键值对来降低这种成本，但它们通常比基于选择的稀疏注意力替代方案产生更差的准确性，后者保留完整的 KV 缓存。我们确定了影响 KV 缓存驱逐准确性的关键因素。首先，一小部分值状态具有异常大的幅值，驱逐它们会导致灾难性故障，使模型陷入重复推理循环。其次，在驱逐过程中引入随机性通过增加缓存多样性来提高准确性。基于这些发现，我们提出了感知值随机 KV 缓存驱逐（VaSE, Value-aware Stochastic KV Cache Eviction），这是一种无需训练的方法，旨在保护大值幅值状态并促进多样化的驱逐决策。在六个推理任务中，使用 VaSE 进行 4 倍 KV 缓存压缩的 Qwen3 模型在相同稀疏度下比最先进（SOTA）的选择方法获得更高的平均准确率，同时比最强的驱逐方法高出 4% 以上。总体而言，VaSE 弥合了效率与准确性之间的差距，支持 FlashAttention2，并为推理模型实现了静态内存占用。

Abstract

Reasoning models improve accuracy through extended chains of thought, but their long outputs create a memory and compute bottleneck. KV cache eviction methods reduce this cost by evicting unimportant key-value pairs from the cache, yet they often yield worse accuracy than selection-based sparse attention alternatives, which keep the full KV cache. We identify key factors crucial to KV cache eviction accuracy. First, a small fraction of value states have abnormally large magnitudes, and evicting them causes catastrophic failure where models enter repetitive reasoning loops. Second, introducing stochasticity during eviction improves accuracy by increasing cache diversity. Based on these findings, we propose Value-aware Stochastic KV Cache Eviction (VaSE), a training-free recipe that protects large-magnitude value states and promotes diverse eviction decisions. Across six reasoning tasks, Qwen3 models using VaSE with 4x KV cache compression yield higher average accuracies than SOTA selection method at the same sparsity, while outperforming the strongest eviction method by more than 4%. Overall, VaSE bridges the gap between efficiency and accuracy, supporting FlashAttention2 and enabling a static memory footprint for reasoning models.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0

评分理由: 论文聚焦推理模型的 KV 缓存效率优化，与世界模型、强化学习及视觉编码器无直接关联。虽基于 Qwen3 模型，但未涉及多模态核心贡献或架构统一，故 MLLM 和多模态得分极低。未发现指定领域专家作者。

关键词

KV Cache Eviction, Reasoning Models, Value-aware Stochastic, Qwen3 Models, Memory Bottleneck, FlashAttention2, Static Memory Footprint, Inference Optimization

39. Contrastive Neural Algorithmic Reasoning for Graph ColoringFAIL

Score: 3.0 / 27.8

Authors: Thien Le, Tianyu Zhao, Melanie Weber

Published: 2026-06-02

TL;DR: 该论文提出了一种基于图神经网络和对比学习的框架，用于学习可转移的着色几何结构，从而在图着色任务中实现跨图规模的有效泛化。

摘要翻译

图着色 (Graph coloring) 旨在为图的节点分配颜色，使得相邻节点获得不同的颜色，并尽可能减少使用的颜色数量。在此，我们研究近似 $k$-着色 (approximate $k$-coloring)，其目标是在使用不超过 $k$ 种颜色的同时，最小化同色边的数量。该问题是图论的核心问题，并在调度和资源分配等领域具有应用。最近的无监督 GNN 方法直接优化每个实例，从而无法在不同图规模和分布之间进行泛化。相反，我们提出了一种对比学习 (contrastive learning) 框架，该框架学习可迁移的着色几何结构，其中同色节点的嵌入相互对齐，而相邻节点的表示被推向不同的方向。我们分析了在规模有界的图上得到的总体目标函数。对于单位范数嵌入，我们证明其最优解具有线原型结构 (line-prototype structure)：同色节点的表示坍缩到一个共享的一维子空间中，而边连接正交子空间。这种几何结构在监督设置下产生平稳性条件，并在平衡着色假设下被投影次梯度动力学 (projected subgradient dynamics) 所保持。在未归一化变体中，梯度下降 (gradient descent) 具有由商图硬边距问题 (quotient-graph hard-margin problem) 控制的最大边距偏差。在合成图和真实图上的实验表明，对比 GNN 编码器 (contrastive GNN encoders) 能有效泛化并产生低冲突的着色，其表现与贪心方法相当，有时甚至优于贪心方法。

Abstract

Graph coloring seeks to assigns colors to a graph's nodes so that adjacent nodes receive different colors, using as few colors as possible. Here, we study approximate $k$-coloring, where the goal is to use at most $k$ colors while minimizing the number of monochromatic edges. This problem is central to graph theory and has applications in areas such as scheduling and resource allocation. Recent unsupervised GNN approaches optimize each instance directly, precluding generalization across graph sizes and distributions. We instead propose a contrastive learning framework that learns transferable coloring geometry where the embeddings of same-color nodes align, while adjacent nodes' representations are pushed toward distinct directions. We analyze the resulting population objective over bounded-size graphs. For unit-norm embeddings, we show that its optima have a line-prototype structure: Representations of nodes of the same color collapse to a shared one-dimensional subspace, and edges connect orthogonal subspaces. This geometry yields stationarity conditions in the supervised setting and is preserved by projected subgradient dynamics under a balanced-coloring assumption. In an unnormalized variant, gradient descent has a max-margin bias governed by a quotient-graph hard-margin problem. Experiments on synthetic and real-world graphs show that contrastive GNN encoders generalize effectively and produce low-conflict colorings, matching and sometimes improving on greedy approaches.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文聚焦图着色与图神经网络，涉及对比学习和表征学习。关键词集主要涉及多模态、世界模型及强化学习，与论文主题（图算法推理）无直接交集。'Unify Models' 仅勉强相关，其余关键词如 Tokenizer、Visual Encoder、MLLM 等均不适用。

关键词

Graph Coloring, Contrastive Learning, Graph Neural Networks, Algorithmic Reasoning, Transferable Embeddings, Generalization, Line-Prototype Structure

40. Denoise First, Orthogonalize Later: Understanding Momentum in Muon via Spectral FilteringFAIL

Score: 3.0 / 27.8

Authors: Xianliang Li, Zihan Zhang, Weiyang Liu, Han Bao

Published: 2026-06-02

TL;DR: This paper theoretically analyzes the role of momentum in the Muon optimizer for LLM training, demonstrating that it acts as a spectral filter to suppress perturbations and stabilize gradient updates.

摘要翻译

Muon 最近在（大语言模型（LLM））训练中展现出强大的实证性能，但 Muon 中动量的理论作用尚不明确。现有的 Muon 分析要么移除动量以孤立地研究谱更新，要么保留动量却未解释为何它能提升实证性能。本研究通过揭示 Muon 中的动量充当谱滤波器，弥合了这一差距。在结构化的信号加扰动梯度模型下，我们证明动量能够抑制扰动同时保留主导信号，从而扩大两者之间的谱间隙。这一扩大的间隙稳定了传递给 Muon 正交化步骤的矩阵的奇异子空间，从而使所得更新更加可靠。我们进一步表明，在正交化之前应用动量，可实现与梯度信号分量更强的可证明对齐，优于反转此顺序或直接移除动量。涵盖多样化任务（包括 LLM 预训练）的实验支持了我们的理论分析。总体而言，我们的理论为理解其他基于矩阵的优化器中动量的益处提供了出发点。

Abstract

Muon has recently demonstrated strong empirical performance in large language model training, but the theoretical role of momentum in Muon remains unclear. Existing analyses of Muon either remove momentum to study spectral updates in isolation, or retain momentum without explaining why it improves empirical performance. Our work bridges this gap by showing momentum in Muon acts as a spectral filter. Under a structured signal-plus-perturbation gradient model, we prove that momentum suppresses perturbations while preserving the dominant signal, thereby enlarging the spectral gap between them. This enlarged gap stabilizes the singular subspaces of the matrix passed to Muon's orthogonalization step, making the resulting update more reliable. We further show that applying momentum before orthogonalization achieves provably stronger alignment with the signal component of the gradient than either reversing this order or simply removing momentum. Experiments across diverse tasks, including LLM pretraining, support our theoretical analysis. More broadly, our theory offers a starting point for understanding the benefits of momentum in other matrix-based optimizers.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on the theoretical analysis of the Muon optimizer (momentum and spectral filtering) for LLM pretraining. It does not address multimodal architectures, world models, reinforcement learning, tokenizers, or visual encoders, leading to low relevance with the provided keyword set which targets multimodal/RL domains.

关键词

Muon, Momentum, Spectral Filtering, LLM Pretraining, Optimization, Orthogonalization, Gradient Signal

41. Beyond Gradient Descent: Adam for Analog Ising MachinesFAIL

Score: 1.5 / 27.8

Authors: Stijn Van Vooren, Guy Van der Sande, Guy Verschaffelt

Published: 2026-06-02

TL;DR: This paper investigates continuous-time Adam optimization for analog Ising machines, demonstrating that it significantly reduces time-to-target and improves solution quality on Max-Cut benchmarks compared to gradient-descent dynamics.

摘要翻译

随着摩尔定律（Moore's law）触及极限，伊辛机（Ising machines）为求解困难的优化问题提供了一种有前景的替代计算方案。然而，许多模拟、时间连续的伊辛机依赖类似梯度下降的动力学来寻找解，这可能会限制其速度和鲁棒性。我们探究动量（momentum）和 Adam 优化算法是否能改进这些系统。由于这些优化器传统上是在离散时间框架下构建的，我们推导出适用于模拟、时间连续伊辛机动力学的连续时间版本。在最大割（Max-Cut）基准测试中，我们发现基于 Adam 的动力学相较于基于梯度下降和动量的动力学，显著缩短了目标达成时间并提高了解的质量。我们进一步引入了一种 Adam 的一阶连续时间近似，旨在为未来的物理实现提供一个更简单的起点，且在连续时间设定下其性能优于完整的 Adam 公式。我们还研究了一种纯算法的离散时间设置，其中在较易的问题实例上性能差距有所缩小，而基于 Adam 的更新规则在较难的加权问题实例上表现最佳。这些结果表明，连续时间 Adam 动力学可被视为模拟伊辛机的一个强大设计原则。

Abstract

As Moore's law reaches its limits, Ising machines offer a promising alternative computing approach for difficult optimization problems. However, many analog, time-continuous Ising machines rely on gradient-descent-like dynamics to find solutions, which can limit speed and robustness. We investigate whether momentum and Adam optimization can improve these systems. Since these optimizers are traditionally formulated in discrete time, we derive continuous-time versions suitable for analog, time-continuous Ising-machine dynamics. On Max-Cut benchmarks, we find that Adam-based dynamics substantially reduce time-to-target and improve solution quality compared with gradient-descent- and momentum-based dynamics. We further introduce a first-order continuous-time approximation of Adam that is intended as a simpler starting point for future physical implementations and while performing better than the full Adam formulation in a continuous-time setting. We also study a purely algorithmic discrete-time setting, where the performance gap is reduced on easier problem instances, while the Adam-based update rule performs best on harder weighted problem instances. These results identify continuous-time Adam dynamics as a powerful design principle for analog Ising machines.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on analog Ising machines and optimization algorithms (Adam, gradient descent) for solving Max-Cut problems. It has no relation to Multimodal Large Language Models (MLLM), tokenizers, visual encoders, world models, or reinforcement learning. While it unifies optimization strategies, it does not align with the 'Unify Models' context of AI architectures. Thus, relevance to the provided keyword set is negligible.

关键词

Analog Ising Machines, Adam Optimization, Continuous-time Dynamics, Max-Cut Benchmarks, Gradient Descent, Optimization Problems, Hardware Computing

42. Language Models Compare Quantities Using Number-specific and Unit-specific HeuristicsFAIL

Score: 1.5 / 27.8

Authors: Mutsumi Sasaki, Go kamoda, Ryosuke Takahashi, Kosuke Sato, Kentaro Inui, Keisuke Sakaguchi, Benjamin Heinzerling

Published: 2026-06-02

TL;DR: 语言模型通过数值和单位的启发式规则而非精确转换来比较数量，导致在比较边界附近出现系统性错误。

摘要翻译

带有测量单位的量（例如 110 cm 和 1.2 m）要求语言模型（LMs）将数值与符号单位尺度相结合。在此，我们研究 LMs 如何在涵盖多个单位制的受控环境中比较此类量。我们发现，在比较边界附近，准确率会下降，此时数值的微小变化决定了正确答案。由此产生的误差具有系统性：线性代理模型能够从数值差异和单位尺度差异线索中预测 LMs 的偏好，且在与这些变量对齐的子空间上施加因果干预会改变模型的输出。结果表明，LMs 是通过在数值和单位上应用一系列启发式策略来比较量，而非首先将两个表达式转换为精确的共同尺度表示。

Abstract

Quantities with measurement units, such as 110 cm and 1.2 m, require language models (LMs) to combine a numeral with a symbolic unit scale. Here, we study how LMs compare such quantities in controlled settings spanning several unit systems. We find that accuracy degrades near the comparison boundary, where small changes in value determine the correct answer. The resulting errors are systematic: linear surrogate models predict LM preferences from numerical-difference and unit-scale-difference cues, and causal interventions on subspaces aligned with these variables shift model's output. The results suggest that LMs compare quantities through a bag of heuristics over numerals and units, rather than first converting both expressions to an exact shared-scale representation.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文主要探讨语言模型在数量比较中的启发式方法，涉及数值与单位处理。提供的关键词主要集中在多模态、世界模型和强化学习领域，与本文内容（纯文本数量推理）重合度极低。仅 MLLM 因涉及大语言模型有微弱关联，其余关键词如视觉编码器、模型强化学习等均无直接关联。

关键词

Language Models, Quantity Comparison, Measurement Units, Heuristics, Numerical Difference, Unit Scale, Systematic Errors

43. PixVOD: Pixel-Distributed Direct Visual Odometry and Depth EstimationFAIL

Score: 1.5 / 27.8

Authors: Shinjeong Kim, Ignacio Alzugaray, Callum Rhodes, Paul H. J. Kelly, Andrew J. Davison

Published: 2026-06-02

TL;DR: PixVOD proposes a pixel-level distributed framework for visual odometry and depth estimation using Gaussian Belief Propagation on sensor-processors to reduce data transmission and maintain geometric stability.

摘要翻译

由二维像素阵列构成的图像是计算机视觉算法的标准输入，然而许多底层计算可分布在像素之间。将原始、冗余且含噪的像素数据传输出传感器仍然效率低下，这促使人们转向焦平面传感器处理器（focal-plane sensor-processors），使其直接在每个像素内执行大部分计算。我们设想像素能够局部合成高层信号，从而减少下游负载，并为高层视觉任务提供更丰富的输入。我们提出了一种完全并行的视觉里程计（visual odometry）和深度估计（depth estimation）方法，分布于像素之间，其中传感器处理器通过高斯信念传播（Gaussian Belief Propagation, GBP）交换信息，以就相机运动达成共识，并从逐像素光度观测和表面法线先验中推断深度。为了在优化过程中保持几何稳定性，我们引入了一种类似关键帧的锚定机制（keyframe-like anchoring mechanism），该机制调节帧之间的有效基线，从而实现一致的运动和深度更新。我们的方法在真实数据集上进行了评估，展示了基于 GBP 的像素级分布式里程计和深度估计以及关键帧锚定（keyframe anchoring）在传感器上的可行性。项目页面：https://www.shinjeongkim.com/pixvod/

Abstract

Images composed of 2D pixel arrays are the standard input to computer vision algorithms, yet many underlying computations can be distributed across pixels. Transmitting raw, redundant, and noisy pixel data off the sensor remains inefficient, motivating a shift toward focal-plane sensor-processors that perform a significant part of the computation directly within each pixel. We envision pixels synthesizing higher-level signals locally, reducing downstream load, and providing richer inputs for higher-level vision tasks. We propose a fully parallelizable form of visual odometry and depth estimation across pixels, where sensor-processors exchange information through Gaussian Belief Propagation (GBP) to achieve consensus about camera motion and infer depth from per-pixel photometric observations and a surface normal prior. To maintain geometric stability during optimization, we introduce a keyframe-like anchoring mechanism that regulates the effective baseline between frames, enabling consistent motion and depth updates. Our method is evaluated on realistic datasets, demonstrating the feasibility of GBP-based pixel-level distributed odometry and depth estimation with keyframe anchoring on-sensor. Project Page: https://www.shinjeongkim.com/pixvod/

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on pixel-level distributed visual odometry and depth estimation using Gaussian Belief Propagation on sensor-processors, belonging to traditional computer vision and robotics. It shows negligible relevance to Tokenizers, MLLMs, World Models, or model-based RL. While it processes visual data (Visual Encoder: 1.0), it does not use learned encoders or multi-modal frameworks. No expert authors from the specified list are present.

关键词

Pixel-Distributed, Visual Odometry, Depth Estimation, Gaussian Belief Propagation, Sensor-processors, Keyframe Anchoring, Photometric Observations

44. An Attention-Based Denoising Model for Diffusion Weighted ImagingFAIL

Score: 1.5 / 27.8

Authors: Prithviraj Verma, Pawan Kumar, Chandan Deshani, Prasun Chandra Tripathi

Published: 2026-06-02

TL;DR: This paper proposes an attention-based denoising framework using Swin Transformers to effectively reduce Rician noise in diffusion-weighted imaging, achieving high restoration quality measured by PSNR and SSIM.

摘要翻译

弥散加权成像（DWI）常用于全身癌症筛查，但其通常需要较长的采集时间。当扫描时间缩短时，图像质量往往会受损，导致扫描中的噪声增加。DWI 中的幅度重建会引入信号相关的莱斯噪声（Rician noise），这使得基于卷积的传统方法去噪更具挑战性。为了解决这一局限性，我们提出了一种噪声感知注意力驱动的去噪框架，该框架整合了层次化 Swin Transformer 窗口注意力与基于 Transformer 的多维门控细化，用于 DWI 恢复。该模型融入了显式噪声水平条件化和残差重建，以实现针对广泛噪声水平下异方差噪声的自适应抑制。实验评估表明，该模型在受损 DWI 扫描上展现出优异的恢复性能。我们的模型在 1% 到 15% 的噪声水平下实现了平均 PSNR（峰值信噪比）为 33.69 dB、SSIM（结构相似性指数）为 0.8539 的效果，同时在严重噪声条件下保持了稳定表现。这些结果表明，注意力引导的上下文建模结合通道自适应细化，为 DWI 去噪提供了一种鲁棒且泛化能力强的解决方案。

Abstract

Diffusion-weighted imaging (DWI) is used for whole-body cancer screening, but it typically requires a long acquisition time. When the scan time is reduced, the image quality often suffers, leading to increased noise in the scans. Magnitude reconstruction in DWI introduces signal-dependent Rician noise, which makes denoising more challenging for conventional convolution-based methods. To address this limitation, we propose a noise-aware attention-driven denoising framework that integrates hierarchical Swin Transformer window attention with transformer-based multi-dimensional gated refinement for DWI restoration. The model incorporates explicit noise-level conditioning and residual reconstruction to enable adaptive suppression of heteroscedastic noise across a wide range of corruption levels. Experimental evaluation on corrupted DWI scans demonstrates strong restoration performance. Our model achieves a mean PSNR of 33.69~dB and SSIM of 0.8539 across noise levels from 1\% to 15\%, while maintaining stable behavior under severe noise conditions. These results indicate that attention-guided contextual modeling combined with channel-adaptive refinement provides a robust and generalizable solution for DWI denoising.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on medical image denoising using Swin Transformers, which is unrelated to the provided keywords regarding Multimodal LLMs, World Models, or Reinforcement Learning. 'Visual Encoder' receives a minimal score (1) due to the use of Transformer architecture, while all other keywords (0) are completely irrelevant to the domain of medical signal processing. The weighted total is 1.5, well below the dynamic passing score of 27.8.

关键词

Diffusion-weighted imaging, Image denoising, Swin Transformer, Attention mechanism, Rician noise, Medical imaging, Noise reduction

45. Electromagnetic Navigation for Femoral Osteotomy Using High-Accuracy X-ray-to-CT RegistrationFAIL

Score: 1.5 / 27.8

Authors: Roman Flepp, Arend Nieuwland, Bastian Sigrist, Philipp Fürnstahl, Lilian Calvet, Thomas Dreher

Published: 2026-06-02

TL;DR: This study proposes an electromagnetic navigation system for femoral osteotomy that utilizes high-accuracy X-ray-to-CT registration to achieve surgical guidance precision comparable to patient-specific instrumentation with reduced radiation exposure.

摘要翻译

矫正性股骨截骨术中术前计划的精确执行仍具挑战性。现有技术受限于准确性波动、侵入性及辐射暴露，其中徒手法和患者特异性器械（PSI）通常分别需要超过 30 张和超过 6 张透视图像。本文提出了一种集成的、基于电磁跟踪（EMT）的股骨截骨导航系统，旨在最小化组织分离和术中透视。该系统将基于 CT 的术前规划与术中一次性 C 臂校准相结合，并通过初始化时获取的两张透视图像实现准确的 X 光 -CT 配准。这使得相对于术前计划，锯片和骨碎片的实时、无透视 EMT 导航成为可能，且兼容单平面和双平面截骨术。在使用 18 具合成股骨的可行性研究中，EMT 引导在总角度误差上显著优于徒手操作（$(3.05 \pm 0.75)^\circ$ vs. $(6.32 \pm 2.36)^\circ$, $p=0.031$），前提是两者均采用相同的最小手术暴露。所有 EMT 引导的试验均未超过 >5° 的临床阈值，而徒手操作在 6 次试验中有 4 例超出该阈值。该系统在总角度误差（$p \le 0.02$）和总平移误差（$p=0.048$）上与 PSI 达到了统计学等效性（$\pm 2^\circ$, $\pm 2\text{mm}$），且用户问卷评分无显著差异。该系统仅需两张透视图像即可转移术前计划，且在无需额外手术暴露的情况下达到了与 PSI 相当的精度，从而为后续的尸体实验和临床验证提供了依据。

Abstract

Accurate execution of preoperative plans in corrective femoral osteotomies remains challenging. Current techniques are limited by variable accuracy, invasiveness, and radiation exposure, with free-hand methods and patient-specific instrumentation (PSI) often requiring >30 and >6 fluoroscopic images, respectively. We present an integrated, electromagnetic tracking (EMT)-based navigation system for femoral osteotomies that minimizes dissection and intraoperative fluoroscopy. The system couples CT-based preoperative planning with one-time intraoperative C-arm calibration and accurate X-ray-to-CT registration from two fluoroscopic images acquired at initialization. This enables real-time, fluoroscopy-free EMT navigation of the saw blade and bone fragments relative to the preoperative plan, and is compatible with uniplanar and biplanar osteotomies. In a feasibility study using 18 synthetic femora, EMT guidance significantly outperformed free-hand execution in total angular error ($(3.05 \pm 0.75)^\circ$ vs.\ $(6.32 \pm 2.36)^\circ$, $p=0.031$), assuming the same minimal surgical exposure for both. No EMT-guided trials exceeded the >5° clinical threshold, whereas free-hand produced 4 outliers of 6 trials. The system achieved statistical equivalence ($\pm 2^\circ$, $\pm 2,\text{mm}$) to PSI for total angular ($p \le 0.02$) and total translational ($p=0.048$) errors, with no significant differences in user questionnaire scores. By transferring preoperative plans using only two fluoroscopic images while matching PSI accuracy without additional surgical exposure, the proposed system motivates subsequent cadaveric and clinical validation.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on medical surgical navigation and image registration (X-ray-to-CT) using electromagnetic tracking in orthopedics. It does not discuss AI/ML architectures, tokenization, visual encoders for representation learning, world models, MLLMs, or reinforcement learning. Thus, relevance to the provided AI-specific keywords is negligible. The 'MultiModal' keyword receives a minimal score due to the use of two imaging modalities (X-ray and CT), but not in the context of multimodal learning models. None of the listed expert authors are present in the author list.

关键词

Electromagnetic Navigation, Femoral Osteotomy, X-ray-to-CT Registration, Preoperative Planning, Fluoroscopy-free, Patient-Specific Instrumentation, Surgical Guidance

46. Formalizing the Binding ProblemFAIL

Score: 0.0 / 27.8

Authors: Lianghuan Huang, Yihao Li, Saeed Salehi, Yingshan Chang, Ansh Soni, Konrad P. Kording

Published: 2026-06-02

摘要翻译

对世界的表征可以说包含有关特征的信息（例如，某物是蓝色的，某物是圆形），但也包含有关哪些特征属于同一对象的信息（例如，圆形是蓝色的），我们称之为绑定信息。任何具备理解多物体场景能力的系统都必须能够解决绑定问题：它需要知道哪些特征属于同一对象。然而，尽管已有研究表明 Vision Transformers (ViTs) 知道哪些图像块属于同一对象，但目前尚不清楚当前的深度学习模型是否学会了表现出绑定信息，即关于特征的信息。我们或许会认为绑定信息并不多，毕竟将特征错误归因于错误对象是基于 ViT 的架构的常见失败，尤其是在物体共享特征的场景中。在这里，我们采用信息论方法形式化绑定问题，并引入一种探测方法以测量模型表征中的绑定信息。我们在 ViTs 上进行了实验，从架构的不同组件（例如图像摘要 token [CLS] 或空间 token）测量绑定信息。我们使用了具有不同绑定挑战的数据集，例如特征共享、遮挡和自然特征，同时比较了若干预训练 ViTs 的性能。总体而言，我们的研究表明绑定是强大视觉识别和推理的关键要素。

Abstract

Representations of the world, arguably, contain information about features (e.g. something is blue, something is a circle) but also information about which features are part of the same object (e.g. the circle is blue), which we call binding information. Any system with the ability to understand scenes with multiple objects must be able to solve the binding problem: it needs to know which features belong together. However, despite work showing that Vision Transformers (ViTs) know which patches belong together, it is not known whether current deep learning models learn to exhibit binding information, i.e., for features. We may believe that there is not much binding information, after all misattributing features to wrong objects is a common failure of ViT-based architectures, especially in scenes with objects sharing features. Here we formalize the binding problem with an information-theoretic approach, and introduce a probing method to measure binding information in model representations. We perform experiments on ViTs, measuring binding from different components of the architecture, such as the image summary token [CLS] or the spatial tokens. We use datasets with different binding challenges, such as feature sharing, occlusion, and natural features, while comparing the performance of several pre-trained ViTs. Overall, our research demonstrates binding as a key ingredient to strong visual recognition and reasoning.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 评分失败: Expecting ',' delimiter: line 12 column 67 (char 290)

47. FlashbackCL: Mitigating Temporal Forgetting in Federated LearningFAIL

Score: 0.0 / 27.8

Authors: Mubarak A. Ojewale, Adriana E. Chis, Jorge M. Cortes-Mendoza, Bernardo Pulido-Gaytan, Horacio Gonzalez-Velez

Published: 2026-06-02

TL;DR: FlashbackCL mitigates temporal forgetting in federated learning through temporally-decayed label counts and class-balanced replay, achieving significant performance improvements over baseline methods.

摘要翻译

联邦学习（FL）在基础模型与边缘模型上的部署越来越多地针对客户端数据分布随时间漂移的场景，但现有的遗忘缓解方法假设每个客户端的数据分布是平稳的。Flashback 是近期对抗跨客户端（空间）遗忘最强的联邦学习方法，它使用单调累积的每类标签计数作为知识代理；然而，该代理在时间分布漂移下会出现校准偏差，并将全局模型锚定到过时的类别平衡上。我们提出了一种与协议级波动隔离的每阶段指标，以形式化联邦学习中的时间遗忘，并提出了 Flashback 持续学习（FlashbackCL），它是 Flashback 的即插即用扩展，包含：（i）时间衰减的标签计数；（ii）带有类别平衡水库采样（CBRS）的设备感知重放缓冲区；以及（iii）基于公共蒸馏集的服务器端主动核心集构建。结果表明，在 CIFAR-10 数据集上，针对 50 个客户端及三种受控时间漂移模式，FlashbackCL 相对于 Flashback 实现了 6.9% 至 10.0% 的相对改进，同时将时间遗忘减少了高达 68%。五种变体的消融实验表明，CBRS 重放是关键组件。此外，FlashbackCL 在平稳的 CIFAR-100 数据集上较 Flashback 提升了 3.5 个百分点，这表明类别平衡重放既能正则化时间漂移，也能正则化空间异构性。

Abstract

Federated Learning (FL) of foundation and edge models increasingly targets deployments where client data distributions drift over time, yet existing forgetting-mitigation methods assume each client's distribution is stationary. Flashback, the strongest recent FL method against cross-client (spatial) forgetting, uses monotonically accumulating per-class label counts as a knowledge proxy; this proxy becomes miscalibrated under temporal distribution shift and anchors the global model to an outdated class balance. We formalise temporal forgetting in FL with a per-phase metric isolated from protocol-level fluctuations and propose Flashback Continual Learning (FlashbackCL), a drop-in extension of Flashback with (i) temporally-decayed label counts; (ii) a device-aware replay buffer with Class-Balanced Reservoir Sampling (CBRS); and (iii) server-side active coreset curation on the public distillation set. The results show that FlashbackCL achieves 6.9% to 10.0% relative improvement relative to Flashback, on CIFAR-10 with 50 clients and three controlled temporal shift modes, while simultaneously reducing temporal forgetting by up to 68%. A 5-variant ablation identifies CBRS replay as the critical component. FlashbackCL also improves Flashback by 3.5 points on stationary CIFAR-100, suggesting that class-balanced replay regularises spatial heterogeneity as well as temporal shift.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper addresses Federated Learning and Temporal Forgetting, which is fundamentally unrelated to the provided keywords concerning Multimodal Models, World Models, Tokenizers, Visual Encoders, and Reinforcement Learning. Consequently, all keyword relevance scores are 0.

关键词

Federated Learning, Temporal Forgetting, FlashbackCL, Continual Learning, Replay Buffer, Distribution Shift, Class-Balanced Sampling

48. FFR: Forward-Forward Learning for RegressionFAIL

Score: 0.0 / 27.8

Authors: Xinyang Liu, Xuanyu Liang, Shiqi Ding, Boyang Li, Zhiqiang Que, Jiayang Li, Guosheng Hu

Published: 2026-06-02

TL;DR: 该论文提出了 FFR 框架，通过序数竞争机制和分层阶梯架构将前向 - 前向算法扩展至回归任务，在降低内存消耗的同时实现了与反向传播相当的精度。

摘要翻译

前向 - 前向 (FF) 算法通过纯粹局部的、逐层优化训练神经网络，提供了一种计算效率高且生物学上合理的反向传播 (BP) 替代方案。然而，FF 本质上是通过对比正负样本对设计的分类方法，将其扩展到回归面临根本性挑战：连续目标空间缺乏对比学习所需的自然“对立面”，且标准优度函数不包含关于目标量级或顺序的信息。我们提出 FFR（用于回归的前向 - 前向），据我们所知，这是首个将 FF 扩展至现实世界回归任务并在多样现实数据集上展现出竞争力的框架。FFR 引入了三项关键创新：(1) 一种序数竞争优度函数，它在距离感知序数监督下，利用划分的神经元组之间的竞争学习取代了对比样本对；(2) 一种分层阶梯架构，其中浅层学习粗粒度序数判别，深层细化为细粒度回归，并通过多尺度特征聚合实现层间协作；(3) 一种带不确定性估计的分层预测机制，其中多尺度预测器协同提供稳健的预测结果和预测置信度，可谓“免费午餐”。广泛的实验结果表明，FFR 在五个现实世界回归基准上平均恢复了 BP 准确率的 98.6%，同时将峰值训练内存降低至 BP 的 27%（深度 8）和 8%（深度 32），每次迭代时间约为 BP 的 72%，并显著优于所有无 BP 竞争者。

Abstract

The Forward-Forward (FF) algorithm offers a computationally efficient and biologically plausible alternative to backpropagation (BP) by training neural networks through purely local, layer-wise optimization. However, FF is inherently designed for classification via contrastive positive-negative sample pairs, and extending it to regression poses fundamental challenges: continuous target space lack natural "opposites" for contrastive learning, and the standard goodness function carries no information about target magnitude or ordering. We propose FFR (Forward-Forward for Regression), to our knowledge, the first framework to extend FF to real-world regression and demonstrate competitive performance across diverse real-world datasets. FFR introduces three key innovations: (1) an ordinal competitive goodness function that replaces contrastive pairs with competitive learning between partitioned neuron groups under distance-aware ordinal supervision; (2) a stratified ladder architecture where shallow layers learn coarse ordinal discrimination and deeper layers refine into fine-grained regression, with multi-scale feature aggregation for inter-layer collaboration; and (3) hierarchical prediction with uncertainty estimation, where multi-scale predictors jointly provide robust predictions and prediction confidence as a free-lunch. Extensive experimental results show FFR recovers on average 98.6% of BP's accuracy across five real-world regression benchmarks while reducing peak training memory to only 27% of BP's at depth 8 and 8% at depth 32, with per-iteration time around 72% of BP's, and substantially outperforms all BP-free competitors.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 论文核心贡献在于将前向 - 前向（Forward-Forward）算法扩展至回归任务（FFR），涉及序数竞争、分层架构及不确定性估计。提供的关键词（Unify Models, Tokenizer, Visual Encoder, World Models, MLLM, MultiModal, model-based RL）主要聚焦于多模态统一、视觉编码、世界模型及强化学习，与本文的回归算法优化及无反向传播学习主题无直接关联，故相关性评分均为 0。作者列表中未包含指定的 Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang 专家，无额外加分。

关键词

Forward-Forward, Regression, Ordinal Competitive, Stratified Ladder Architecture, Uncertainty Estimation, Backpropagation Alternative, Multi-scale Feature Aggregation

49. Hedge-Bench: Benchmarking Agents on Hard, Realistic Tasks Pertaining to Financial ReasoningFAIL

Score: 0.0 / 27.8

Authors: Eric Cho, Shawn Huang, Alice Lu, Andy Lyu

Published: 2026-06-02

TL;DR: Hedge-Bench presents a new benchmark for evaluating AI agents on realistic financial reasoning tasks using expert traces, showing that current frontier models achieve scores below 16%.

摘要翻译

人工智能智能体（AI agents）日益能够处理金融分析中的机械性任务，例如检索文档、计算公式和更新电子表格。更具挑战性且更有价值的挑战在于对定义专家分析师工作的开放式问题进行推理。现有的基准测试（benchmarks）无法捕捉此类问题，而试图评估开放式推理的基准测试依赖于模型评判的输出，这会引入噪声和循环性。我们提出 Hedge-Bench 1.0：一个包含 102 个实际在职任务的基准测试，其基础是专业对冲基金分析师在使用相关信息来源时的显式推理轨迹。该方法使得能够依据验证过的专家步骤进行确定性评分。前沿模型（Frontier models）和智能体在该基准测试上的得分低于 16%。我们在 github.com/Trata-Inc/trata-hedge-bench 上发布了该数据集及评估框架（evaluation harness）。

Abstract

AI agents can increasingly handle the mechanical tasks of financial analysis: retrieving documents, calculating formulas, updating spreadsheets. The harder, more valuable challenge is reasoning through the open-ended questions that define expert Analyst work. Existing benchmarks do not capture this class of problem, and those that attempt to evaluate open-ended reasoning rely on model-judged outputs that introduce noise and circularity. We present Hedge-Bench 1.0: a benchmark of 102 actual, on-the-job tasks grounded in the explicit reasoning traces of professional hedge fund analysts working with relevant information sources. This approach enables deterministic grading against verified expert steps. Frontier models and agents score below 16\% on the benchmark. We publish the dataset and evaluation harness at github.com/Trata-Inc/trata-hedge-bench.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper introduces Hedge-Bench, a benchmark for evaluating AI agents on financial reasoning tasks based on expert traces. It focuses on evaluation methodology and dataset construction rather than model architecture, tokenization, visual encoding, world modeling, or specific reinforcement learning algorithms. Therefore, none of the provided technical keywords regarding model components or learning paradigms are relevant to this work.

关键词

Hedge-Bench, Financial Reasoning, AI Agents, Benchmarking, Expert Traces, Frontier Models, Deterministic Grading, Financial Analysis

50. NetKV: Network-Aware Decode Instance Selection for Disaggregated LLM InferenceFAIL

Score: 0.0 / 27.8

Authors: Mubarak Adetunji Ojewale

Published: 2026-06-02

TL;DR: NetKV 提出了一种针对解耦 LLM 推理的网络感知调度器，通过基于网络拓扑和拥堵情况优化解码实例选择来减少首次令牌时间。

摘要翻译

解耦的 LLM 推理迫使 KV cache 在进行解码之前遍历数据中心网络，因此传输时间直接进入首 token 时间（TTFT）预算。当前的调度器仅根据计算负载和前缀缓存局部性进行路由，忽略了预填充和解码实例之间的拓扑距离和动态拥塞。我们通过一个轻量级的算子到调度器接口（网络成本预言机）填补了这一空白，并证明随着上下文长度增长，忽略网络项会导致仅感知缓存的调度变得任意次优。NetKV 是一个消耗该预言机的每请求贪婪算法（复杂度 O(|D|)），其层级排名被证明对过时遥测数据具有鲁棒性。在由 Mooncake 追踪数据驱动的 64-GPU 四层胖树模拟器上，NetKV 相比轮询方案平均 TTFT 降低高达 21.2%，相比调优的缓存 + 负载感知调度器降低 17.6%，将 SLO 达成率提高高达 20.1 个百分点，并在所有测试条件下将 token 间延迟开销保持在 0.5 毫秒以下，无需更改传输层、推理引擎或硬件。

Abstract

Disaggregated LLM inference forces the KV cache to traverse the datacenter network before decoding begins, so transfer time enters directly into the Time to First Token (TTFT) budget. Current schedulers route on compute load and prefix-cache locality alone, ignoring the topological distance and dynamic congestion between prefill and decode instances. We close this gap with a thin operator-to-scheduler interface, the network cost oracle, and we prove that ignoring the network term renders cache-aware-only scheduling arbitrarily suboptimal as context length grows. NetKV, the O(|D|) per-request greedy that consumes this oracle, has tier rankings that are provably robust to stale telemetry. On a 64-GPU four-tier fat-tree simulator driven by Mooncake traces, NetKV reduces mean TTFT by up to 21.2% over round-robin and 17.6% over a tuned cache+load-aware scheduler, lifts SLO attainment by up to 20.1 percentage points, and keeps the Time Between Tokens overhead below 0.5 ms in every condition tested, with no changes to the transport, inference engine, or hardware.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 该论文专注于解耦 LLM 推理的基础设施网络感知调度以减少 TTFT，这与提供的关键词（涉及模型架构如 Visual Encoder、Tokenizer，多模态如 MLLM、MultiModal、Unify Models，以及学习范式如 World Models、model-based RL）不符。目标专家列表中未包含任何作者。

关键词

Disaggregated LLM Inference, Network-Aware Scheduler, KV Cache Management, Time to First Token, Decode Instance Selection, Inference Optimization, Network Topology

51. The Impact of Configuring Agentic AI Coding Tools on Build-vs-Buy Decisions: A Study ProtocolFAIL

Score: 0.0 / 27.8

Authors: Jai Lal Lulla, Matthias Galster, Jie M. Zhang, Sebastian Baltes, Christoph Treude

Published: 2026-06-02

TL;DR: 本研究协议旨在通过受控实验探究配置机制如何影响智能体 AI 编码工具在构建 vs 购买决策中的行为，以评估工具选择库的完整性和准确性。

摘要翻译

自主式 AI 编码工具以日益增强的自主性编写代码，并在此过程中决定何时导入库，何时从零开始实现功能。这些决策，即从零实现功能还是采用外部库，以下简称 build-versus-buy，对软件安全、许可合规性、性能和长期可维护性产生直接影响。然而，尚无控制实验研究探讨是什么决定了自主式 AI 编码工具中的 build-versus-buy 决策。配置机制，即开发者将自主式 AI 编码工具行为针对项目或工作流定制的方法，是从业者影响这些决策的主要手段之一。然而，尚不清楚哪些配置机制最能有效地影响 build-versus-buy 决策。我们提出一种预注册协议，以研究配置机制如何改变两种流行的自主式 AI 编码工具（Claude Code 和 OpenAI Codex）中的 build-versus-buy 行为。我们将执行来自分阶段项目基准的控制编程任务，每个任务都围绕可识别的 build-versus-buy 点构建，并操纵提供给每个工具的配置，范围从无配置，经过包含软偏好和明确禁止的上下文文件，到 Skills（可自主发现的指令）、支持 MCP 的库发现工具和权限控制，测量工具选择了哪些库，是否披露了新引入的库，以及这些披露是否完整准确。九个预注册假设构成了该协议的核心。生成的基准数据集和分析流水线将作为可复用成果发布，用于评估自主式 AI 编码工具中的 build-versus-buy 行为。

Abstract

Agentic AI coding tools write code with increasing autonomy and in doing so decide when to import a library and when to implement functionality from scratch. These decisions, whether to build functionality from scratch or buy into an external library, hereafter build-versus-buy, carry direct consequences for software security, licensing compliance, performance, and long-term maintainability. Yet no controlled experimental study has examined what governs build-versus-buy decisions in agentic AI coding tools. Configuration mechanisms, i.e., the means by which developers tailor agentic AI coding tool behavior to a project or workflow, are one of the primary means by which practitioners can influence these decisions. However, it is unclear which configuration mechanisms influence build-versus-buy decisions most effectively. We present a pre-registered protocol to study how configuration mechanisms alter build-versus-buy behavior in two popular agentic AI coding tools: Claude Code and OpenAI Codex. We will execute controlled programming tasks drawn from a benchmark of staged projects, each constructed around identifiable build-versus-buy points, and will manipulate the configuration supplied to each tool, ranging from no configuration, through context files with soft preferences and explicit prohibitions, to Skills (instructions that can be autonomously discovered), MCP-enabled library discovery tools, and permission controls, measuring which libraries the tool selects, whether it discloses newly introduced libraries, and whether those disclosures are complete and accurate. Nine pre-registered hypotheses structure the protocol. The resulting benchmark dataset and analysis pipeline will be released as a reusable artifact for evaluating build-versus-buy behavior in agentic AI coding tools.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 论文聚焦软件工程，研究智能体 AI 编码工具配置对构建 vs 购买决策的影响。关键词涉及多模态大模型架构、表征学习及强化学习算法，与论文主题无直接技术关联，故评分均为 0。作者列表中未包含指定专家。

关键词

Agentic AI coding tools, Build-versus-buy decisions, Configuration mechanisms, Study protocol, Claude Code, OpenAI Codex, Software security, Controlled experimental study

52. MLSkip: Data Skipping for ML Filters via Lightweight MetadataFAIL

Score: 0.0 / 27.8

Authors: Mihail Stoian, Mark Gerarts, Pascal Ginter, Andreas Zimmerer, Jan Van den Bussche, Andreas Kipf

Published: 2026-06-02

TL;DR: This paper proposes lightweight metadata structures for data skipping in database ML filters to improve query performance, but it is unrelated to multimodal world models or reinforcement learning.

摘要翻译

数据库厂商近期发布了可用于过滤谓词的人工智能（AI）函数。由于此类函数通常依赖于昂贵的黑盒机器学习（ML）模型，它们揭示了新的数据管理挑战。具体而言，针对整数和字符串数据的传统数据跳过技术无法适用于这种新的过滤类型。事实上，目前尚无已知的机制来剪枝不符合条件的行组，例如从对象存储（blob storage）读取文件时。在本文中，我们启动了对机器学习（ML）过滤器数据跳过技术的研究。我们主张 Parquet 的默认最小 - 最大元数据足以支持剪枝。为此，我们将本研究关联到两个研究方向：(i) 最近提出的机器学习模型查询语言，以及 (ii) 神经网络验证。我们在 ReLU 架构上的初步结果显示，在 TPC-H 和 TPC-DS 的数据表上，对于选择性低于 0.1% 的过滤器，平均剪枝效果达到了 27.4%。最后，受空间连接研究的启发，我们提出了一种增强元数据结构：一个大小受限的二维凸包，验证工具可以更好地利用它，将剪枝效果提高到 38.31%，同时每个行组和列对占用不超过 45 字节。我们观察到在 DuckDB 中相对于 PyTorch 实现了 1.07× 的端到端加速。

Abstract

Database vendors recently released AI functions that can be used in filter predicates. As such functions often rely on costly, black-box ML models, they unveil new data management challenges. Concretely, traditional data skipping techniques for integer and string data fail to be applicable to the new filter type. Indeed, there is no known mechanism for pruning non-qualifying row groups, e.g., when reading files from blob storage. In this work, we initiate the study of data skipping techniques for ML filters. We make the case that Parquet's default min-max metadata is enough to enable pruning. To this end, we draw connections to two lines of research: (i) the recently proposed query language for ML models and (ii) neural network verification. Our preliminary results on ReLU architectures show that on tables from TPC-H and TPC-DS, the average pruning effectiveness for filters of selectivity below 0.1% amounts to 27.4%. Finally, inspired by research on spatial joins, we propose an enhanced metadata structure: a size-bounded 2D convex hull that verification tools can make better use of, increasing the pruning effectiveness to 38.31%, while occupying at most 45 bytes per row group and column pair. We observe an end-to-end speedup of 1.07$\times$ over PyTorch in DuckDB.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on database optimization (data skipping for ML filters) using metadata structures like convex hulls within DuckDB. It does not involve multimodal representation learning, tokenization, visual encoders, world models, large language models, or reinforcement learning, resulting in a fundamental domain mismatch with the provided keywords.

关键词

Data Skipping, ML Filters, Lightweight Metadata, Parquet, Neural Network Verification, Pruning Effectiveness, DuckDB, ReLU Architectures

53. Forecasting Conceptual Diffusion in Science: The Case of Quantum ComputingFAIL

Score: 0.0 / 27.8

Authors: Thomas Maillart, Thibaut Chataing, David Dosu, Paul Bagourd, Julian Jang-Jaccard, Alain Mermoud

Published: 2026-06-02

TL;DR: This study predicts exogenous and endogenous scientific concept diffusion in quantum computing using LightGBM on citation networks, demonstrating that exogenous diffusion is highly predictable based on structural features.

摘要翻译

理解和预测科学变革需要能够区分科学概念的内生性巩固与外源性扩散的模型。利用 OpenAlex 中量子计算子树的概念，我们构建了一个时间分辨的概念共现网络，并通过其上游引文谱系和下游扩散追踪每一对概念。我们在分布性和多样性感知特征上训练 LightGBM 模型，以预测四种结果：内生性强化、外源性扩散、它们的比值以及扩散熵。在控制科学共同体的总体出版增长后，内生性强化在主要的量子计算基准测试中表现出很大程度上不可预测。相比之下，外源性扩散和熵值具有强可预测性（$R^2$ 高达 0.78），且由上游异质性、引文广度和分布离散性驱动，SHAP 分析证实了这一点；在机器人学、先进材料和神经植入物领域的复现研究确认，外源性扩散仍是各领域的首要目标（$R^2_{test} \sim 0.60-0.87$），而神经植入物中的内生性可预测性显著上升（$R^2_{test} = 0.83$），这表明量子计算领域的不对称性并未普遍化。案例研究表明，熵值的急剧上升与概念前沿的开辟相吻合，而熵值的崩溃则预示着技术融合或范式更替。这些结果表明，概念扩散受嵌入在语义和引文环境中的稳定结构规律所支配。通过识别早期基于多样性的跨领域采纳信号，该方法为快速演化的研究领域的预测性科学计量学、技术预见和创新导向的政策分析提供了可扩展的基础。

Abstract

Understanding and anticipating scientific change requires models that distinguish between endogenous consolidation and exogenous diffusion of scientific concepts. Using the quantum computing subtree of concepts in OpenAlex, we construct a temporally resolved concept co-occurrence network and track each concept pair through its upstream citation lineage and downstream diffusion. We train LightGBM models on distributional and diversity-aware features to predict four outcomes: endogenous reinforcement, exogenous diffusion, their ratio, and diffusion entropy. After controlling for overall publication growth of the scientific body, endogenous reinforcement proves largely unpredictable in the primary quantum-computing benchmark. In contrast, exogenous diffusion and entropy are strongly predictable ($R^2$ up to $0.78à) and are driven by upstream heterogeneity, citation breadth, and distributional dispersion, as shown by SHAP analyses; replications on robotics, advanced materials, and neuro implants confirm that exogenous diffusion remains the top-ranked target across fields ($R^2_test \sim 0.60-0.87$), while endogenous predictability rises markedly in neuro implants (R^2_test = 0.83), indicating that the quantum-computing asymmetry does not generalise uniformly. Case studies reveal that sharp entropy increases coincide with the opening of new conceptual frontiers, while entropy collapses signal technological convergence or paradigm displacement. These results demonstrate that conceptual diffusion is governed by stable structural regularities embedded in semantic and citation environments. By identifying early diversity-based signals of cross-domain uptake, the approach provides a scalable foundation for anticipatory scientometrics, technology foresight, and innovation-oriented policy analysis in rapidly evolving research fields.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on scientometrics and concept diffusion prediction using LightGBM on citation networks. It does not involve Unify Models, Tokenizers, Visual Encoders, World Models, MLLMs, MultiModal architectures, or Model-Based Reinforcement Learning. Thus, all keyword relevance scores are 0. No expert authors from the specified list are present.

关键词

Conceptual Diffusion, Quantum Computing, Scientometrics, LightGBM, Citation Network, Endogenous Reinforcement, Exogenous Diffusion

54. MAdam: Metric-Aware Multi-Objective AdamFAIL

Score: 0.0 / 27.8

Authors: Fengbei Liu, Rachit Saluja, Sunwoo Kwak, Ruibo Wang, Ruining Deng, Heejong Kim, Johannes C. Paetzold, Mert R. Sabuncu

Published: 2026-06-02

TL;DR: 本文针对多目标优化求解器与 Adam 优化器之间的权重和几何失配问题，提出了 MAdam 包装器，并在多种机器学习任务中验证了其性能提升。

摘要翻译

多目标优化（MOO）构成了许多机器学习问题的基础，然而，基于损失平衡、梯度平衡及帕累托（Pareto）方法的 MOO 求解器几乎普遍将其协调方向传递给 Adam~\cite{kingma2015adam}。我们表明，这种耦合在求解器的意图与优化器的执行之间引入了两个系统性偏差。第一种是“权重失配”：Adam 的二阶矩分母将时变偏好向量与梯度统计量纠缠在一起，使偏好边缘化为历史平均，并将不同的帕累托权衡坍缩为近均匀混合。第二种是“几何失配”：Adam 的自适应度量扭曲了 MOO 求解器所假设的欧几里得几何，将对齐的目标转化为表观冲突。为共同解决这两个问题，我们引入了 MAdam（感知度量的多目标 Adam），这是一个即插即用包装器，保持求解器和优化器不变。MAdam 利用标量化目标的偏好条件曲率对协调方向进行预处理；在此白化输入上，Adam 的二阶矩坍缩为单位矩阵，因此实际更新遵循偏好条件度量。在多任务学习、帕累托前沿恢复、物理信息神经网络以及医学成像等领域，MAdam 始终优于 Adam，且适用于每种求解器方法。

Abstract

Multi-objective optimization (MOO) underlies many machine learning problems, yet MOO solvers across the loss-balancing, gradient-balancing, and Pareto-based families almost universally hand their reconciled directions to Adam~\cite{kingma2015adam}. We show this coupling introduces two systematic gaps between the solver's intent and the optimizer's execution. The first is a \emph{weighting mismatch}: Adam's second-moment denominator entangles the time-varying preference vector with gradient statistics, marginalizing the preference into a history average and collapsing distinct Pareto trade-offs toward a near-uniform mixture. The second is a \emph{geometric mismatch}: Adam's adaptive metric distorts the Euclidean geometry MOO solvers assume, turning aligned objectives into apparent conflicts. To resolve both jointly, we introduce \textbf{MAdam} (Metric-Aware Multi-Objective Adam), a drop-in wrapper that leaves both solver and optimizer unchanged. MAdam preconditions the reconciled direction by the preference-conditioned curvature of the scalarized objective; on this whitened input, Adam's second moment collapses to identity, so the realized update is governed by the preference-conditioned metric. Across multi-task learning, Pareto-front recovery, physics-informed neural networks, and medical imaging, MAdam consistently improves over Adam for every solver family.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 该论文核心贡献在于提出 MAdam 优化器包装器，解决多目标优化（MOO）中求解器与 Adam 优化器之间的权重失配和几何失配问题。提供的关键词均聚焦于多模态大模型架构（如 Tokenizer、Visual Encoder）、世界模型及模型强化学习，而本文属于优化算法领域，未涉及多模态表征、世界模型构建或强化学习的具体模型架构，因此与所有给定关键词无直接相关性，评分均为 0。加权总分为 0，远低于动态及格分 27.8。作者列表中不包含指定的专家（Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang），故无额外加分。

关键词

Multi-objective optimization, Adam optimizer, Metric-Aware, Multi-task learning, Pareto-front, Weighting mismatch, Geometric mismatch

55. Attribution via Distributional Paths for Information RevelationFAIL

Score: 0.0 / 27.8

Authors: Kieran A. Murphy, Shameen Shrestha

Published: 2026-06-02

TL;DR: 本文提出 Reveal-IG 方法，通过分布路径而非输入空间路径进行特征归因，提高了图像和表格数据中特征重要性评分的稳定性和完整性。

摘要翻译

特征归因方法通过为输入特征分配重要性得分来解释预测。基于路径的方法（如集成梯度（Integrated Gradients））尤其具有吸引力，因为它们满足“完备性”：归因之和等于参考状态与输入之间模型输出的变化量。然而，大多数路径方法在输入空间中定义这条轨迹，通过沿选定路径的逐点扰动输入来解释模型。输入空间路径在它经过的每个点集成模型的原始响应，无法控制查询特征的分辨率；轨迹早期靠近基线的部分与输入本身对解释的贡献处于同等地位。在此，我们将路径归因从输入空间提升至围绕感兴趣样本的结构化探测分布空间，并将该方法称为 Reveal-IG。与遍历原始输入值不同，Reveal-IG 逐步揭示关于输入的信息，并将模型期望输出的变化沿这条分布路径进行归因。由此得到的路径归因框架相对于期望模型响应保留了完备性，并能自然支持多尺度图像探测以及表格数据中的特征级不确定性。合成诊断表明，Reveal-IG 避免了影响输入空间方法的路径伪影；在 ImageNet 分类和表格回归任务中，它产生了稳定的有符号归因——在使用归因符号的指标上表现领先，而在其余指标上保持竞争力。

Abstract

Feature attribution methods explain predictions by assigning importance scores to input features. Path-based methods such as Integrated Gradients are especially appealing because they satisfy \textit{completeness}: attributions sum to the change in model output between a reference state and the input. Yet most path methods define this trajectory in input space, explaining a model through pointwise perturbed inputs along a chosen path. An input-space path integrates the model's raw response at each point it passes through, with no control over the resolution at which a feature is queried; the early, baseline-adjacent part of the trajectory contributes to the explanation on equal footing with the input itself. Here, we lift path attribution from input space to a space of structured probe distributions around the example of interest, and call our method Reveal-IG. Rather than traversing raw input values, Reveal-IG progressively reveals information about the input and attributes changes in the model's expected output along this distributional path. The result is a path-attribution framework that retains completeness with respect to the expected model response, and naturally accommodates multiscale image probes and feature-wise uncertainty in tabular data. Synthetic diagnostics show that Reveal-IG avoids path artifacts that affect input-space methods, and across ImageNet classification and tabular regression it produces stable, signed attributions -- leading on metrics that use attribution sign while remaining competitive on the rest.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 论文主要关注特征归因方法（Reveal-IG），旨在通过分布路径提升解释的稳定性和完整性，属于可解释性 AI 领域。提供的关键词涉及统一模型、分词器、视觉编码器、世界模型、多模态大模型、多模态架构及基于模型的强化学习，均与本文的研究主题（归因方法）无关。因此所有关键词相关度均为 0。加权总分为 0，远低于动态及格分 27.8。作者列表中不包含指定的专家。

关键词

Feature attribution, Integrated Gradients, Distributional paths, Explainability, Multiscale probes, Completeness property, Input space, Tabular regression

56. Demo2Tutorial: From Human Experience to Multimodal Software TutorialsFAIL

Score: 0.0 / 27.8

Authors: Zechen Bai, Zhiheng Chen, Yiqi Lin, Kevin Qinghong Lin, Difei Gao, Xiangwu Guo, Xin Wang, Mike Zheng Shou

Published: 2026-06-02

摘要翻译

数字环境中的人类经验提供了一种庞大且未被充分探索的资源，其中蕴含着丰富的程序性知识的真实、未经修剪的交互。我们提出了 Demo2Tutorial，这是一个框架，旨在将通过屏幕录制和交互日志捕获的这种经验转化为结构化的、多模态软件教程，以同时教导人类和智能体。Demo2Tutorial 首先通过专用记录器收集人类经验，随后利用多模态 Action Parser（动作解析器）解析原始经验，以重建感知、动作及意图。随后，Step Planner（步骤规划器）将这些步骤抽象为表示目标和步骤的层级任务图。最后，Tutorial Composer（教程合成器）将解析后的经验转化为结构化的、可重用的图文指令。我们在一个基于官方软件文档构建的新基准上评估教程生成的质量。我们进一步证明，这种提炼后的表示形式有益于：(i) 人类学习，通过自动生成多模态教程；(ii) 智能体学习，通过改进下游 GUI 智能体的规划与泛化能力。实验表明，Demo2Tutorial 生成的教程质量超越了人工编写的教程，并显著优于基线方法，同时实现了人类任务完成速度的提升以及 GUI 智能体规划能力的改进。这证明了从人类经验中提炼出的结构化教程可作为有效的知识表示，用于推动人类学习与智能体能力的进步。代码与数据将在 https://github.com/showlab/Demo2Tutorial 上提供。

Abstract

Human experience in digital environments offers a vast, underexplored resource of authentic, untrimmed interactions that contain rich procedural knowledge. We introduce Demo2Tutorial, a framework that transforms this experience captured via screen recordings and interaction logs into structured, multimodal software tutorials for teaching both humans and agents. Demo2Tutorial first collects human experience via a dedicated recorder, then parses raw experience using a multimodal Action Parser to reconstruct perception, action, and intent. A Step Planner then abstracts these steps into hierarchical task graphs representing goals and steps. Finally, a Tutorial Composer transforms the parsed experience into structured, reusable image-text instructions. We evaluate the tutorial generation quality on a new benchmark derived from official software documentation. We further demonstrate that this distilled representation benefits (i) human learning, by automatically generating multimodal tutorials, and (ii) agent learning, by improving downstream GUI-agent planning and generalization. Experiments show Demo2Tutorial produces high-quality tutorials that surpass human-authored ones and significantly outperform baseline methods, while enabling both faster human task completion and improved GUI agent planning, demonstrating that structured tutorials distilled from human experience can serve as effective knowledge representations for advancing both human learning and agent capabilities. Code and data will be available at https://github.com/showlab/Demo2Tutorial.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 评分失败: Expecting ',' delimiter: line 12 column 28 (char 251)

57. SparseStreet: Sparse Gaussian Splatting for Real-Time Street Scene SimulationFAIL

Score: 0.0 / 27.8

Authors: Qingpo Wuwu, Xiaobao Wei, Peng Chen, Nan Huang, Zhongyu Zhao, Hao Wang, Ming Lu, Ningning Ma, Shanghang Zhang

Published: 2026-06-02

TL;DR: SparseStreet 提出了一种针对街景的高斯泼溅压缩框架，通过剪枝和背景压缩将存储成本降低 80% 同时保持动态物体的高质量渲染。

摘要翻译

尽管 3D 高斯泼溅（3D Gaussian Splatting）在街景重建中已展现出令人鼓舞的结果，但现有方法需要大量高斯基元（Gaussian primitives）来捕捉精细细节，从而导致高昂的存储成本和缓慢的渲染速度。观察发现，动态物体（例如车辆和行人）需要高保真表示以维持时间一致性，而静态背景区域往往包含大量冗余。基于此，我们提出 SparseStreet，这是一种专为街景设计的通用压缩框架。首先，我们提出一种基于节点的可学习剪枝策略，系统性地移除低贡献的高斯基元，同时保留视觉关键区域。其次，在场景表示稳定后，我们应用背景压缩，进一步减少静态区域中的冗余。我们的方法能有效保持动态物体的几何结构与外观，同时显著减少高斯基元的总数。在 Waymo 和 nuScenes 数据集上的广泛实验表明，SparseStreet 可实现高达 80% 的压缩比，且质量退化极小，从而实现了资源高效的高保真动态场景重建。项目网站：https://sparsestreet.github.io/.

Abstract

While 3D Gaussian Splatting has shown promising results in street scene reconstruction, existing methods require massive numbers of Gaussian primitives to capture fine details, leading to prohibitive storage costs and slow rendering speeds. We observe that dynamic objects (e.g., vehicles and pedestrians) demand high-fidelity representations to maintain temporal consistency, while static background regions often contain substantial redundancy. Motivated by this, we propose SparseStreet, a general compression framework specifically designed for street scenes. First, we introduce a node-based learnable pruning strategy that systematically removes low-contributing Gaussian primitives while preserving visually critical regions. Second, after the scene representation stabilizes, we apply background compression, further reducing redundancy in static regions. Our method effectively preserves the geometry and appearance of dynamic objects while significantly reducing the total number of Gaussian primitives. Extensive experiments on the Waymo and nuScenes demonstrate that SparseStreet achieves up to 80% compression ratio with minimal quality degradation, enabling resource-efficient, high-fidelity dynamic scene reconstruction. Project website: https://sparsestreet.github.io/.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 该论文专注于 3D 高斯泼溅（Gaussian Splatting）技术在街景重建中的压缩与加速，核心贡献在于稀疏化策略和背景压缩，属于计算机图形学与视觉领域。所提供的关键词（如 Tokenizer, MLLM, Unify Models, Model-Based RL）均指向多模态大模型架构与强化学习，与本文内容无直接技术关联，故相关度均为 0。作者列表中未包含指定的 Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang 等专家，故专家加分项为 0。加权总分为 0，远低于动态及格分 27.8。

关键词

Sparse Gaussian Splatting, Street Scene Simulation, Real-Time Rendering, Dynamic Object Preservation, Background Compression, Gaussian Primitives, Waymo Dataset

Token 消耗: 867,506 tokens（输入 103,946 / 输出 763,560）