arXiv Daily Report 2026-06-06

DailyPapers
未分类
20小时前
2热度
0评论

ArXiv Report 2026-06-06/* ============================================================ ArXiv Daily Researcher - HTML Report Stylesheet 可自由修改此文件来定制报告外观 ============================================================ *//* ── 全局重置 ── */*,*::before,*::after { box-sizing: border-box; margin: 0; padding: 0;}/* ── CSS 变量（主题色板） ── */:root { --color-bg: #f0f2f5; --color-surface: #ffffff; --color-border: #e5e7eb; --color-primary: #2563eb; --color-primary-dk: #1d4ed8; --color-pass: #16a34a; --color-pass-bg: #dcfce7; --color-fail: #dc2626; --color-fail-bg: #fee2e2; --color-text: #111827; --color-muted: #6b7280; --color-tldr-bg: #eff6ff; --color-tldr-border: #bfdbfe; --color-cn-bg: #fefce8; --color-cn-border: #fde68a; --color-analysis-bg: #f8fafc; --radius-sm: 6px; --radius-md: 10px; --radius-lg: 14px; --shadow-sm: 0 1px 3px rgba(0, 0, 0, 0.06), 0 1px 2px rgba(0, 0, 0, 0.04); --shadow-md: 0 4px 12px rgba(0, 0, 0, 0.08); --shadow-hover: 0 8px 24px rgba(0, 0, 0, 0.1);}/* ── 页面布局 ── */body { font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", "Inter", Roboto, "Helvetica Neue", Arial, sans-serif; font-size: 15px; line-height: 1.65; color: var(--color-text); background: var(--color-bg); padding: 28px 20px 60px; max-width: 1080px; margin: 0 auto;}/* ── 页面标题 ── */h1 { font-size: 1.75rem; font-weight: 700; color: var(--color-text); letter-spacing: -0.5px; margin-bottom: 4px;}h2 { font-size: 1.15rem; font-weight: 600; color: var(--color-text); margin: 36px 0 14px; padding-bottom: 8px; border-bottom: 2px solid var(--color-border);}/* ── 元信息行 ── */.meta { font-size: 0.85rem; color: var(--color-muted); margin-bottom: 24px;}/* ── 统计栏 ── */.stats-bar { display: flex; gap: 14px; flex-wrap: wrap; margin-bottom: 32px;}.stat { flex: 1; min-width: 110px; background: var(--color-surface); border: 1px solid var(--color-border); border-radius: var(--radius-md); padding: 16px 20px; text-align: center; box-shadow: var(--shadow-sm);}.stat .num { font-size: 2rem; font-weight: 700; line-height: 1.1; color: var(--color-primary); display: block;}.stat .label { font-size: 0.78rem; color: var(--color-muted); margin-top: 4px; text-transform: uppercase; letter-spacing: 0.05em;}/* ── 论文卡片 ── */.card { background: var(--color-surface); border: 1px solid var(--color-border); border-radius: var(--radius-lg); padding: 20px 24px; margin-bottom: 14px; box-shadow: var(--shadow-sm); border-left: 4px solid var(--color-border); transition: box-shadow 0.18s ease, transform 0.18s ease;}.card:hover { box-shadow: var(--shadow-hover); transform: translateY(-1px);}.card.pass { border-left-color: var(--color-pass);}.card.fail { border-left-color: var(--color-fail);}/* ── 卡片标题 ── */.card-title { font-size: 1rem; font-weight: 600; color: var(--color-text); line-height: 1.4; margin-bottom: 10px; display: flex; align-items: flex-start; gap: 8px; flex-wrap: wrap;}.card-title a { color: inherit; text-decoration: none; flex: 1;}.card-title a:hover { color: var(--color-primary); text-decoration: underline; text-underline-offset: 3px;}/* ── 状态徽章 ── */.badge { display: inline-flex; align-items: center; padding: 2px 9px; border-radius: 99px; font-size: 0.72rem; font-weight: 700; letter-spacing: 0.04em; flex-shrink: 0; margin-top: 2px;}.badge.pass { background: var(--color-pass-bg); color: var(--color-pass);}.badge.fail { background: var(--color-fail-bg); color: var(--color-fail);}/* ── 字段行 ── */.field { font-size: 0.875rem; color: var(--color-muted); margin: 5px 0;}.field-label { font-weight: 600; color: #374151;}.score { font-weight: 700; color: var(--color-primary);}/* ── TL;DR 块 ── */.tldr { background: var(--color-tldr-bg); border: 1px solid var(--color-tldr-border); border-radius: var(--radius-sm); padding: 10px 14px; font-size: 0.9rem; color: #1e3a5f; margin: 10px 0; line-height: 1.55;}/* ── 中文摘要块 ── */.abstract-cn { background: var(--color-cn-bg); border: 1px solid var(--color-cn-border); border-radius: var(--radius-sm); padding: 10px 14px; font-size: 0.875rem; color: #713f12; margin: 10px 0; line-height: 1.6;}/* ── 深度分析折叠 ── */details { margin-top: 12px; border: 1px solid var(--color-border); border-radius: var(--radius-sm); overflow: hidden;}summary { cursor: pointer; font-size: 0.875rem; font-weight: 600; color: var(--color-primary); padding: 8px 14px; background: var(--color-analysis-bg); user-select: none; list-style: none; display: flex; align-items: center; gap: 6px;}summary::before { content: "▶"; font-size: 0.65em; transition: transform 0.2s; display: inline-block;}details[open] summary::before { transform: rotate(90deg);}summary:hover { color: var(--color-primary-dk);}.analysis-content { padding: 14px 16px; font-size: 0.875rem; color: #374151; line-height: 1.65; background: var(--color-surface);}.analysis-content p { margin: 6px 0;}.analysis-content ul { margin: 6px 0 6px 20px; color: #4b5563;}.analysis-content li { margin: 3px 0;}/* ── 响应式 ── */@media (max-width: 640px) { body { padding: 16px 12px 40px; } h1 { font-size: 1.4rem; } .stats-bar { gap: 10px; } .stat { min-width: 80px; padding: 12px 14px; } .stat .num { font-size: 1.6rem; } .card { padding: 16px; } .card-title { font-size: 0.95rem; }}/* ── 模型标签 ── */.model-badge { display: inline-block; font-size: 0.72rem; font-weight: 500; color: #6366f1; background: #ede9ff; border: 1px solid #c4b5fd; border-radius: 4px; padding: 1px 6px; margin-left: 8px; vertical-align: middle; font-family: 'Fira Code', 'Consolas', monospace; letter-spacing: 0.02em;}/* ── TLDR 增强样式 ── */.tldr { background: linear-gradient(135deg, #f0f9ff 0%, #e0f2fe 100%); border-left: 3px solid #38bdf8; border-radius: 0 8px 8px 0; padding: 10px 14px; margin: 8px 0;}.tldr-meta { display: flex; align-items: center; margin-bottom: 6px; font-size: 0.82rem; font-weight: 600; color: #0369a1;}.tldr-body { font-size: 0.88rem; color: #374151; line-height: 1.6;}/* ── 趋势分析区块 ── */.trend-section { margin: 32px 0 16px; border-top: 2px solid var(--color-border); padding-top: 24px;}.trend-section-header { display: flex; align-items: center; margin-bottom: 20px;}.trend-section-header h2 { margin: 0; font-size: 1.3rem; font-weight: 700; background: linear-gradient(135deg, #6366f1 0%, #8b5cf6 100%); -webkit-background-clip: text; -webkit-text-fill-color: transparent; background-clip: text;}.trend-card { background: #fff; border: 1px solid #e5e7eb; border-radius: 10px; padding: 20px 24px; margin-bottom: 16px; box-shadow: 0 1px 4px rgba(0, 0, 0, 0.06);}.trend-card-title { font-size: 1.05rem; font-weight: 700; color: #1e1b4b; margin: 0 0 14px; padding-bottom: 10px; border-bottom: 1px solid #ede9ff;}

ArXiv Research Report

Generated: 2026-06-06 05:33:05 | Passing score: 27.8

280

Total

Qualified

Analyzed

17%

Pass Rate

Papers

1. World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action SynthesisPASS

Score: 76.5 / 27.8

Authors: Yi Yang, Zhihong Liu, Siqi Kou, Yiyang Chen, Yanzhe Hu, Jianbo Zhou, Boyuan Zhao, Zhijie Wei, Xiao Xia, Xueqi Li, Pengfei Liu, Zhijie Deng

Published: 2026-06-04

TL;DR: 本文提出世界语言动作（WLA）模型，通过自回归 Transformer 统一世界建模、语言推理与动作合成，在机器人任务上实现了 state-of-the-art 的多任务与长 horizon 学习能力。

摘要翻译

我们提出世界 - 语言 - 动作 (WLA) 模型作为一类新的具身基础模型。WLA 将文本指令、图像和机器人状态作为输入，联合预测文本子任务、子目标图像和机器人动作，融合了类似世界 - 动作模型 (WAM) 中的 world modeling interface（世界建模接口）以从大量 egocentric videos（第一人称视角视频）中学习，以及类似 vision-language-action（VLA）模型中的 language reasoning（语言推理）能力以解决复杂的 long-horizon tasks（长周期任务）。WLA 的核心是一个 autoregressive（AR）Transformer 骨干，而不是像 WAM 那样的 bidirectional diffusion Transformer，用于预测 next state（下一状态），该状态包含 semantic-level（语义级）textual intention（文本意图）和互补的 fine-grained（细粒度）physical dynamics（物理动力学）。物理动力学基于专用的 World Expert（世界专家）通过 world modeling objective（世界建模目标）进行监督，并被利用来简化 Action Expert（动作专家）的状态 - 动作相关性表征。WLA 利用 meta-queries（元查询）使 world prediction（世界预测）隐式地影响 action generation（动作生成），以便前者在推理过程中可以被禁用。world prediction（世界预测）也可以被激活，以实现 test-time scaling（推理时扩展），从而改善机器人控制。我们的 WLA-0 原型，拥有 2B active parameters（20 亿激活参数），在 NVIDIA RTX 5090 上实现每次推理 40 毫秒。在 simulated（仿真）和 real-world（真实世界）环境中的评估表明，WLA-0 实现了 state-of-the-art（最先进的）多任务和 long-horizon（长周期）学习能力，例如在 RoboTwin2.0 Clean 上成功率为 92.94%，在 RMBench 上成功率为 56.5%。WLA-0 也有望直接从 cross-embodiment robot videos（跨具身机器人视频）中学习 novel tasks（新任务），而无需 action annotations（动作标注）。

Abstract

We propose world-language-action (WLA) models as a new class of embodied foundation models. WLA takes textual instructions, images, and robot states as inputs to jointly predict textual subtasks, subgoal images, and robot actions, conjoining the \emph{world modeling interface} to learn from extensive egocentric videos as in the world-action model (WAM) and the \emph{language reasoning} capacities to solve complex long-horizon tasks as in vision-language-action (VLA) models. At the core of WLA lies an \emph{autoregressive (AR)} Transformer backbone, instead of a bidirectional diffusion Transformer as in WAMs, to predict the \emph{next state}, comprising the \emph{semantic-level} textual intention and complementary \emph{fine-grained} physical dynamics. The physical dynamics are supervised by the world modeling objective based on a dedicated World Expert, and are leveraged to ease the characterization of the state-action correlation for the Action Expert. WLA leverages meta-queries to make the world prediction \emph{implicitly} impact the action generation so that the former can be disabled during inference. The world prediction can also be activated to enable test-time scaling for improved robot control. Our WLA-0 prototype, with 2B active parameters, achieves 40 ms per inference on an NVIDIA RTX 5090. Evaluations across simulated and real-world environments demonstrate that WLA-0 achieves state-of-the-art multi-task and long-horizon learning abilities, e.g., 92.94\% success rate on RoboTwin2.0 Clean and 56.5\% success rate on RMBench. WLA-0 also holds the promise to learn novel tasks directly from \emph{cross-embodiment robot videos} without action annotations.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	9.0/10	13.5
Tokenizer	1.5	3.0/10	4.5
Visual Encoder	1.5	6.0/10	9.0
World Models	1.5	9.0/10	13.5
MLLM	1.5	8.0/10	12.0
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	7.0/10	10.5

评分理由: 论文核心提出 WLA 模型统一世界建模、语言推理与动作合成，因此与 Unify Models、World Models、MultiModal 高度相关（9 分）；作为多模态基础模型，涉及视觉与语言处理，与 MLLM、Visual Encoder 中度相关（8/6 分）；基于状态预测的动作生成机制契合 model-based RL 理念（7 分）；Tokenizer 未作为核心创新提及（3 分）。作者列表中不包含指定的五位专家。

关键词

World-Language-Action Model, Unified World Modeling, Autoregressive Transformer, Embodied Foundation Models, Language Reasoning, Action Synthesis, Multi-modal Input

深度分析

Chinese Title: 世界-语言-动作模型：统一世界建模、语言推理与动作合成

Summary: 本文提出世界-语言-动作（WLA）模型，一种新型具身基础模型。WLA以自回归Transformer为骨干，同时预测文本子任务、子目标图像和机器人动作，融合了世界-动作模型（WAM）的世界建模接口和视觉-语言-动作模型（VLA）的语言推理能力。其核心创新在于将下一状态分解为高层文本意图和低层物理动态，通过引入世界专家（World Expert）以端到端方式学习物理动态，并利用元查询（meta-queries）隐式影响动作生成，推理时可关闭世界专家以降低延迟。WLA-0原型仅2B激活参数，在RTX 5090上推理延迟约40ms，在RoboTwin 2.0和RMBench等模拟及真实环境任务中达到或超越当前最优水平，并支持从跨本体机器人视频中学习新任务。

Innovations:

提出将下一状态分解为高层文本意图和低层物理动态的双重表示，统一了语言推理与世界建模。
采用自回归Transformer骨干替代传统WAM中的双向扩散Transformer，实现文本生成与物理动态预测的联合。
引入世界专家（World Expert）和元查询（meta-queries）架构，实现端到端的隐式物理动态学习，推理时可丢弃世界专家以降低延迟。
支持测试时缩放（test-time scaling），通过激活世界预测进一步提升机器人控制性能。
能够从无动作标注的跨本体机器人视频中学习新任务，具备跨本体泛化能力。

Methodology: WLA采用自回归Transformer骨干网络，输入文本指令、图像和机器人状态，输出文本子任务、子目标图像和动作。通过元查询从骨干网络提取物理动态表示，世界专家基于该表示和当前状态预测未来视觉状态（VAE特征），动作专家则利用物理动态表示生成动作。训练时联合优化语言交叉熵损失、世界建模流匹配损失和动作流匹配损失。推理时世界专家可被禁用，仅保留骨干和动作专家以提升速度。

Key Results:

WLA-0在RoboTwin 2.0 Clean上成功率达92.94%，在RMBench上成功率达56.5%，均为当前最优。
在Stack Cup真实任务中，WLA-0完成时间仅为基线WAM的一半，显示低延迟优势。
仅2B激活参数，无需具身预训练，推理延迟约40ms（RTX 5090）。
支持从无动作标注的跨本体视频中学习新任务，展现良好的可操控性和泛化性。

Tech Stack:

自回归Transformer（AR Transformer）
流匹配（Flow Matching）损失
VAE（变分自编码器）特征表示
元查询（Meta-Queries）
扩散Transformer（Diffusion Transformer，如SANA-600M）
视觉-语言模型（VLM）初始化骨干
交叉熵损失

Strengths:

统一了世界建模、语言推理和动作合成，兼具WAM和VLA的优势。
端到端训练，避免了两阶段流水线的次优问题。
推理效率高（40ms），适合实时控制。
支持测试时缩放，可进一步提升性能。
在长时域、记忆依赖任务上表现突出，具备错误纠正能力。

Limitations:

世界专家仅预测单帧未来图像，可能丢失时序动态细节。
依赖预训练VLM，对语言理解能力的上限受限于所选VLM。
实验规模有限（2B参数），更大规模模型的效果尚未验证。
真实环境评估场景较少，泛化到更复杂动态环境的能力有待进一步验证。

Relevance To Keywords: 论文紧密围绕所给关键词：提出统一模型（Unify Models）框架，融合世界模型（World Models）与语言推理；采用表征学习（Representation Learning）中的VAE和元查询；属于模型基强化学习（Model-Based RL）范式；利用原生多模态大模型（VLM）作为骨干，实现理解与生成一体化（多模态大模型的理解和生成一体化）；后训练（Post-training）方面，通过联合训练实现端到端优化。

2. Thinking with Imagination: Agentic Visual Spatial Reasoning with World SimulatorsPASS

Score: 75.0 / 27.8

Authors: Chenming Zhu, Jingli Lin, Yilin Long, Peizhou Cao, Tai Wang, Jiangmiao Pang, Xihui Liu

Published: 2026-06-04

TL;DR: 本文提出 Astra 框架，通过耦合强化学习训练的 VLM 策略与世界模拟器，利用想象视觉证据显著提升了视觉语言模型在空间推理任务上的表现。

摘要翻译

尽管视觉 - 语言模型（VLMs）展现出强大的视觉推理能力，但其空间推理能力仍主要局限于观测到的图像和基于文本的思维链。当仅有限的自我中心观测可用时，它们往往难以推断未观测到的布局，保持跨视图一致性，并从替代视角进行推理。在这项工作中，我们将此问题视为基于想象的思考，其中 VLMs 在推理过程中通过与世界模拟器交互主动获取想象的视觉证据。我们提出 Astra，一个代理式空间推理框架，该框架赋予 VLMs 动作条件化的视觉想象能力。具体而言，Astra 将 Astra-VL（一个强化学习（RL）训练的 VLM 策略）与 Astra-WM（一个基于 Bagel 的世界模拟器）相结合，后者根据上下文图像和自然语言相机运动生成新视角观测。为提供可靠的想象证据，Astra-WM 通过视图一致性调优进行训练，以提高不同视角之间姿态和内容的一致性。在强化学习（RL）阶段，我们提出一个世界模拟器闭环两阶段强化学习（RL）课程，以稳定工具使用探索，并提升模型仅在想象观测优于直接回答时才调用模拟器的能力。实验表明，世界模拟器和代理策略都是必要的：Astra-WM 在 MMSI-Bench 上将经模拟器增强的 Gemini-3-Flash 从 45.1 提升至 49.5，而 Astra-VL 在 MMSI-Bench 上将 Qwen3-VL 骨干模型从 29.8 提升至 38.8，在 MindCube 上从 36.8 提升至 42.7。这些结果表明，想象观测可以提供有用的空间证据，但有效的世界模型增强推理需要学习何时、何地以及如何想象。

Abstract

While Vision-Language Models (VLMs) have shown strong visual reasoning capabilities, their spatial reasoning abilities remain largely constrained to the observed images and text-oriented chain-of-thought. They often struggle to infer unobserved layouts, maintain cross-view consistency, and reason from alternative viewpoints when only limited egocentric observations are available. In this work, we study this problem as thinking with imagination, where a VLM actively acquires imagined visual evidence by interacting with a world simulator during reasoning. We propose Astra, an agentic spatial reasoning framework that empowers VLMs with action-conditioned visual imagination. Specifically, Astra couples Astra-VL, an RL-trained VLM policy, with Astra-WM, a Bagel-based world simulator that generates novel-view observations from context images and natural-language camera motions. To provide reliable imagined evidence, Astra-WM is trained with view consistency tuning to improve pose and content consistency across views. In the RL stage, we propose a world-simulator-in-the-loop two-phase RL curriculum to stabilize tool-use exploration and advance the model's ability to invoke the simulator only when imagined observations improve over direct answering. Experiments demonstrate that both the world simulator and the agentic policy are necessary: Astra-WM improves simulator-augmented Gemini-3-Flash on MMSI-Bench from 45.1 to 49.5, while Astra-VL improves the Qwen3-VL backbone from 29.8 to 38.8 on MMSI-Bench and from 36.8 to 42.7 on MindCube. These results show that imagined observations can provide useful spatial evidence, but effective world-model-augmented reasoning requires learning when, where, and how to imagine.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	7.0/10	10.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	6.0/10	9.0
World Models	1.5	9.0/10	13.5
MLLM	1.5	9.0/10	13.5
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	8.0/10	12.0

评分理由: 论文核心在于将视觉语言模型（MLLM）与世界模型（World Models）结合，通过强化学习（model-based RL）实现具身空间推理，因此后三者得分较高（8-9 分）。'Unify Models'体现在 VLM 与模拟器的耦合架构中，得分为 7。'MultiModal'和'Visual Encoder'作为基础组件存在，但未作为主要创新点，得分中等（6-9 分）。'Tokenizer'在摘要中未提及，相关性最低（2 分）。

关键词

Agentic Visual Spatial Reasoning, World Simulators, Vision-Language Models, Imagination, Reinforcement Learning, Multi-modal, Consistency Tuning, Novel-view Synthesis

深度分析

Chinese Title: 思考与想象：基于世界模拟器的智能视觉空间推理

Summary: 本文研究视觉语言模型（VLM）在空间推理中的局限性，即它们通常只能基于给定的图像进行推理，难以推断未观察到的布局、保持跨视角一致性或从替代视角推理。为此，作者提出Astra框架，将VLM与动作条件的世界模拟器相结合，使VLM能够主动通过调用模拟器获取想象的新视角观察，从而进行交互式空间推理。Astra包含两个核心组件：Astra-WM，一个基于Bagel的世界模拟器，通过视图一致性微调生成空间一致的想象图像；Astra-VL，一个基于Qwen3-VL的智能策略模型，通过世界模拟器在环的两阶段强化学习课程训练，学会何时、何处以及如何调用模拟器。实验表明，世界模拟器的质量和策略学习均至关重要：Astra-WM将Gemini-3-Flash在MMSI-Bench上的性能从45.1提升至49.5，Astra-VL将Qwen3-VL在MMSI-Bench上从29.8提升至38.8，在MindCube上从36.8提升至42.7。结果表明，有效的想象不仅依赖于生成器，更依赖于学习到的交互策略。

Innovations:

提出Astra框架，将VLM与动作条件的世界模拟器结合，实现交互式空间推理，使VLM能主动获取想象证据。
设计视图一致性微调（View Consistency Tuning）训练世界模拟器Astra-WM，确保生成图像在姿态和内容上与请求的相机运动一致。
提出世界模拟器在环的两阶段强化学习课程（two-phase RL curriculum），第一阶段学习有效调用模拟器，第二阶段鼓励选择性想象，避免过度依赖工具。
构建空间QA数据语料库，用于训练Astra-VL的策略，并设计姿态一致性和内容一致性评估指标验证模拟器质量。
将空间推理从内部重建问题转化为交互式证据获取问题，并验证了模拟器质量和策略学习对最终性能的协同作用。

Methodology: 论文采用以下技术路线：首先，基于Bagel模型构建世界模拟器Astra-WM，通过视图一致性微调（包括姿态一致性和内容一致性损失）训练，使其能根据自然语言相机运动指令生成空间一致的想象视图。其次，基于Qwen3-VL构建智能策略模型Astra-VL，使用世界模拟器在环的两阶段强化学习课程训练：第一阶段通过监督学习使模型学会有效调用模拟器，第二阶段通过强化学习（如PPO）鼓励模型在调用模拟器比直接回答更有用时才进行想象。同时，构建空间QA数据语料库，包含多视角图像、问题、答案以及模拟器调用轨迹，用于训练和评估。最终，在MMSI-Bench和MindCube等基准上评估整体框架性能。

Key Results:

Astra-WM显著提升了模拟器质量：在姿态一致性评估中，Astra-WM生成的图像更符合请求的相机运动；在内容一致性评估中，场景内容和空间布局保持更好。
使用Astra-WM增强Gemini-3-Flash，在MMSI-Bench上从45.1提升至49.5，验证了高质量世界模拟器的必要性。
直接连接Qwen3-VL与模拟器会导致性能下降，而经过RL训练的Astra-VL将Qwen3-VL在MMSI-Bench上从29.8提升至38.8，在MindCube上从36.8提升至42.7。
两阶段RL课程有效：第一阶段使模型学会有效调用，第二阶段使模型学会选择性调用，避免无效想象。
消融实验表明，模拟器质量和策略学习缺一不可，两者协同才能实现有效想象。

Tech Stack:

Bagel（基于扩散模型的图像生成器）
Qwen3-VL（视觉语言模型）
Gemini-3-Flash（闭源VLM）
视图一致性微调（View Consistency Tuning）
强化学习（PPO或类似算法）
两阶段RL课程（two-phase RL curriculum）
自然语言相机运动指令（camera-motion instruction）
姿态一致性评估（pose-consistency evaluation）
内容一致性评估（content-consistency evaluation）
MMSI-Bench、MindCube（空间推理基准）

Strengths:

创新性地将空间推理建模为交互式证据获取过程，而非静态推理，更接近人类认知。
同时优化了世界模拟器质量和策略学习，揭示了二者协同的重要性。
两阶段RL课程设计合理，有效避免了工具滥用和无效想象。
实验设计全面，包括模拟器质量评估、消融实验和跨模型迁移验证。
开源项目页面和代码（推测）便于复现和后续研究。

Limitations:

世界模拟器仍可能生成不完美的图像，尤其在复杂场景或大角度旋转时，可能误导推理。
当前相机运动词汇表有限（仅5种运动类型），可能不足以覆盖所有空间推理需求。
RL训练依赖模拟器在环，计算成本较高，且需要精心设计奖励函数。
仅在两个基准上评估，泛化性有待更多场景验证。
未讨论模拟器调用次数对推理效率的影响，可能在实际应用中产生延迟。

Relevance To Keywords:

世界模型（World Models）：Astra-WM作为世界模拟器，生成动作条件的新视角，体现了世界模型在推理中的应用。
强化学习（Reinforcement Learning）：Astra-VL通过RL训练学会何时调用模拟器，属于后训练阶段。
后训练（Post-training）：RL训练是在预训练VLM基础上进行的后训练，提升空间推理能力。
表征学习（Representation Learning）：视图一致性微调使模拟器学习到空间一致的视觉表征。
多模态大模型的理解和生成一体化：Astra结合了VLM的理解能力和生成模型的想象能力，实现一体化推理。
原生多模态大模型：Qwen3-VL作为原生多模态模型，被扩展为智能体。
Model-Based RL：使用世界模拟器作为环境模型，进行基于模型的强化学习。

3. MPCoT: Reward-Guided Multi-Path Latent Reasoning for Test-Time Scalable Vision-Language-ActionPASS

Score: 73.5 / 27.8

Authors: Boyang Zhang, Lianlei Shan

Published: 2026-06-04

TL;DR: MPCoT 通过奖励引导的多路径潜在推理，在不增加推理 token 延迟的情况下显著提升了长视距视觉 - 语言 - 动作控制性能。

摘要翻译

视觉 - 语言 - 行动 (VLA) 策略在长周期和高不确定性控制中仍显脆弱，因为单次动作解码提供的推理时间内的斟酌能力有限。显式思维链可以增加推理深度，但会引入令牌延迟以及间接的文本到动作接口。我们提出 MPCoT，一种基于奖励的多路径潜在推理框架，该框架初始化 M 个假设，在 K 个权重共享的步骤中对它们进行精炼，并在动作解码前对其进行软聚合。一个仅用于训练的路径偏好目标通过专家动作一致性、基于世界模型/视觉语言模型 (VLM) 的进展以及成功反馈来评估候选动作分支，从而使潜在路径评分器与下游执行质量对齐。MPCoT 保留了原始的 8 步动作接口，生成零推理令牌，并提供可配置的推理控制参数 (K, M)。在 LIBERO 和 CALVIN 数据集上采用匹配的实验协议，MPCoT 提升了长周期性能，消融实验证实了深度 - 宽度效应、置信度加权聚合以及基于奖励的路径监督的有效性。

Abstract

Vision-Language-Action (VLA) policies remain brittle in long-horizon and high-uncertainty control, where one-pass action decoding provides limited inference-time deliberation. Explicit chain-of-thought can increase reasoning depth, but introduces token latency and an indirect text-to-action interface. We propose MPCoT, a reward-guided multi-path latent reasoning framework that initializes $M$ hypotheses, refines them for K weight-tied steps, and softly aggregates them before action decoding. A training-only path-preference objective evaluates candidate action branches with expert-action consistency, world-model/VLM-based progress, and success feedback to align the latent path scorer with downstream execution quality. MPCoT preserves the original 8-step action interface, generates zero reasoning tokens, and exposes configurable inference controls (K,M). Under matched protocols on LIBERO and CALVIN, MPCoT improves long-horizon performance, with ablations confirming depth-width effects, confidence-weighted aggregation, and reward-guided path supervision.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	6.0/10	9.0
Tokenizer	1.5	3.0/10	4.5
Visual Encoder	1.5	6.0/10	9.0
World Models	1.5	9.0/10	13.5
MLLM	1.5	8.0/10	12.0
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	8.0/10	12.0

评分理由: 论文提出 MPCoT 框架，针对 VLA 策略在长视距控制中的脆弱性问题。'World Models' 和 'MultiModal' 评分最高，因摘要明确提及 world-model 评估且 VLA 属多模态任务；'model-based RL' 评分高，因多路径假设推理具有模型基规划特征；'MLLM' 评分较高，因依赖 VLM 骨干；'Unify Models' 评分中等，因统一了推理与动作流程；'Visual Encoder' 评分中等，为必要组件；'Tokenizer' 评分较低，仅涉及 token 接口设计。作者名单未包含指定专家。

关键词

MPCoT, Reward-Guided, Multi-Path, Latent Reasoning, Vision-Language-Action, Long-Horizon, World-Model, Test-Time

深度分析

Chinese Title: MPCoT: 奖励引导的多路径潜在推理用于测试时可扩展的视觉-语言-动作模型

Summary: 本文提出MPCoT，一种奖励引导的多路径潜在推理框架，旨在提升视觉-语言-动作（VLA）策略在长时域和高不确定性控制中的表现。传统VLA采用一次性动作解码，缺乏推理时间；显式思维链虽增加推理深度，但引入令牌延迟和间接文本-动作接口。MPCoT在连续潜在空间中初始化M个假设，通过K个权重共享步骤进行细化，然后软聚合所有分支后再解码动作。训练时引入路径偏好目标，结合专家动作一致性、世界模型/VLM进度评估和成功反馈，使潜在路径评分器与下游执行质量对齐。MPCoT保持原始8步动作接口，不生成任何推理令牌，并提供可配置的推理控制（K, M）。在LIBERO和CALVIN基准上的实验表明，MPCoT显著提升了长时域任务成功率，消融实验证实了深度-宽度效应、置信加权聚合和奖励引导路径监督的有效性。

Innovations:

将VLA推理问题重新定义为测试时计算资源分配问题，在显式推理与浅层一次性控制之间寻求平衡。
提出循环多路径潜在推理模块，具有显式深度（K）和宽度（M）控制，不产生推理令牌，且保持原始动作接口不变。
引入奖励引导的路径偏好学习机制，利用动作一致性、世界模型/VLM进度和成功反馈训练路径评分器，使其在推理时无需奖励即可选择高质量路径。
在LIBERO和CALVIN上验证了深度与宽度的互补性、置信加权聚合的有效性以及奖励监督路径学习的优势。

Methodology: 基于OpenVLA-OFT骨干网络，插入潜在推理模块Rθ。首先通过视觉编码器、语言编码器和多模态融合得到潜在上下文ct，然后初始化M个潜在假设（使用可学习假设码和训练时扰动）。每个假设通过权重共享的残差MLP（Fθ）进行K步细化，更新时始终以ct为条件。训练时，每个细化后的分支解码为候选动作，并由固定世界模型/VLM评估器计算进度奖励，结合专家动作一致性和成功标签得到复合奖励，进而计算优势并训练路径评分器hω。推理时，评分器输出软权重，对M个分支进行置信加权聚合，再由不变的动作头解码为8步动作块。训练目标包括行为克隆、路径偏好损失和多样性正则化，并采用计算丢弃（compute dropout）采样子配置。

Key Results:

在LIBERO和CALVIN基准上，MPCoT（K=5, M=4）相比基线OpenVLA-OFT显著提升了长时域任务成功率。
消融实验表明，深度K和宽度M具有互补效应：增加深度可修正单路径错误，增加宽度可覆盖多种假设，两者结合效果最佳。
置信加权聚合优于平均聚合或硬选择，奖励引导的路径偏好学习显著提高了路径一致性（scorer偏好路径与最高回报路径的匹配度）。
MPCoT仅增加约2.7%的参数，推理延迟随K×M线性增长，但无额外奖励查询或VLM调用。

Tech Stack:

OpenVLA-OFT（骨干网络，8步动作块接口）
潜在推理模块：权重共享残差MLP（Fθ）
路径评分器：两层MLP（hω）
可学习假设码（embedding）
行为克隆（Behavior Cloning）损失
奖励引导路径偏好学习（advantage-weighted path selection）
多样性正则化（diversity regularizer）
计算丢弃（compute dropout）训练策略
世界模型/VLM进度评估器（训练时使用）

Strengths:

保持原始VLA动作接口不变，无需修改骨干网络或解码器，易于集成。
零推理令牌，避免显式CoT的延迟和间接性，推理高效。
提供可配置的深度和宽度控制，允许在测试时根据计算预算灵活调整。
训练后推理完全无需奖励、世界模型或VLM查询，仅依赖学习到的评分器。
在多个基准上验证了有效性，消融实验充分，分析深入。

Limitations:

推理延迟随K×M线性增长，在实时性要求高的场景可能受限。
训练时依赖世界模型/VLM评估器提供进度奖励，该评估器的质量直接影响路径偏好学习效果。
仅在仿真基准（LIBERO、CALVIN）上验证，未在真实机器人上测试。
假设码和扰动机制可能增加训练复杂度，且对超参数敏感。
论文未讨论与更先进骨干（如π0、UnifiedVLA）的兼容性，仅基于OpenVLA-OFT。

Relevance To Keywords:

Unify Models, World Models, Representation Learning, Model-Based RL: 论文提出的MPCoT框架与多模态大模型（VLA）紧密结合，利用世界模型/VLM作为训练时的进度评估器，属于世界模型的应用。潜在推理模块涉及表征学习（潜在空间中的假设初始化与细化）。奖励引导的路径偏好学习可视为一种基于模型的强化学习（使用世界模型提供奖励信号）。后训练方面，MPCoT在预训练VLA骨干上插入轻量模块进行端到端训练，属于后训练微调。
原生多模态大模型，多模态大模型的理解和生成一体化: VLA本身是视觉-语言-动作的多模态模型，MPCoT增强了其推理能力，但未改变多模态融合架构，属于对现有模型的增强。

4. Learning Visual Spatial Planning from Symbolic State via Modality-Gap-Aware Self-DistillationPASS

Score: 69.0 / 27.8

Authors: Haocheng Luo, Jiahui Liu, Ruicheng Zhang, Zhizhou Zhong, Jiaqi Huang, Zunnan Xu, Quan Shi, Jun Zhou, Xiu Li

Published: 2026-06-04

TL;DR: 本文提出了一种模态感知自蒸馏框架，利用符号状态监督视觉学生模型，显著提升了视觉语言模型在视觉空间规划任务中的表现。

摘要翻译

尽管 vision-language models (视觉 - 语言模型) 在通用多模态理解方面表现出色，但它们仍难以应对 visual spatial planning (视觉空间规划)。我们将此归因于 perception-reasoning modality gap (感知 - 推理模态鸿沟)：visual planning (视觉规划) 要求模型从像素中推断 latent state structures (潜在状态结构)，随后基于恢复的结构进行推理以生成 valid actions (有效动作)，而 symbolic planning (符号规划) 则直接利用 explicit objects and constraints (显式对象和约束)。这造成了 visual state recovery (视觉状态恢复) 与 multi-step planning (多步规划) 的双重瓶颈。为解决这一问题，我们提出了 MGSD，一种 two-stage modality-gap-aware self-distillation framework (两阶段模态感知自蒸馏框架)。首先，cold-start grounding stage (冷启动定位阶段) 为 visual student (视觉学生模型) 配备可靠的状态表示，minimizing early perception noise (最小化早期感知噪声)。其次，privileged teacher (特权教师) 通过 on-policy distillation (策略内蒸馏) 转移规划能力，利用 explicit symbolic states (显式符号状态) 监督 student's own visual rollout prefixes (学生自身的视觉展开前缀)。关键在于，symbolic data (符号数据) 仅在训练期间严格使用，inference (推理) 过程则完全基于视觉。在 visual planning benchmarks (视觉规划基准) 上的实验表明，MGSD 在 4B 和 8B backbones (骨干网络) 上一致地提高了 visual planning (视觉规划) 性能，macro average (宏平均) 分别提升了 19.3% 和 18.4%。所得模型缩小了与 symbolic-input upper bounds (符号输入上界) 之间的差距，而 ablations (消融实验) 和 diagnostics (诊断) 证实 improvement (改进) 源于 visual state recovery (视觉状态恢复) 与 optimal-path reasoning (最优路径推理)。这些结果表明，modality-gap-aware self-distillation (模态感知自蒸馏) 不仅改进了模型感知 actionable states (可操作状态) 的方式，也改进了其在 inferred structure (推断结构) 上的规划方式。代码可在 https://github.com/Oranger-l/MGSD 获取。

Abstract

While vision-language models excel at general multimodal understanding, they still struggle with visual spatial planning. We attribute this to a perception-reasoning modality gap: visual planning requires models to infer latent state structures from pixels and then reason over the recovered structure to produce valid actions, whereas symbolic planning directly leverages explicit objects and constraints. This creates dual bottlenecks in visual state recovery and multi-step planning. To address this, we propose MGSD, a two-stage modality-gap-aware self-distillation framework. First, a cold-start grounding stage equips the visual student with reliable state representations, minimizing early perception noise. Second, a privileged teacher transfers planning capabilities via on-policy distillation, using explicit symbolic states to supervise the student's own visual rollout prefixes. Crucially, symbolic data is used strictly during training, leaving inference purely visual. Experiments on visual planning benchmarks show that MGSD consistently improves visual planning across both 4B and 8B backbones, raising the macro average by 19.3% and 18.4%, respectively. The resulting models narrow the gap to symbolic-input upper bounds, while ablations and diagnostics confirm that the improvement comes from both visual state recovery and optimal-path reasoning. These results suggest that modality-gap-aware self-distillation improves not only how models perceive actionable states, but also how they plan over the inferred structure. Code is available at https://github.com/Oranger-l/MGSD.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	6.0/10	9.0
Tokenizer	1.5	3.0/10	4.5
Visual Encoder	1.5	7.0/10	10.5
World Models	1.5	7.0/10	10.5
MLLM	1.5	8.0/10	12.0
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	7.0/10	10.5

评分理由: 论文核心在于解决视觉空间规划中的模态鸿沟，通过自蒸馏框架统一视觉感知与符号推理。MLLM 和 MultiModal 高度相关（基于视觉语言模型及视觉 - 符号模态交互）；World Models 和 model-based RL 高度相关（涉及状态恢复、rollout 及规划）；Visual Encoder 是视觉学生模型的基础组件（高相关）；Unify Models 中等相关（能力统一而非架构统一）；Tokenizer 未作为核心贡献提及（低相关）。作者列表中不包含指定的 Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang 专家，故无加分。加权总分 69.0，远超动态及格分 27.8。

关键词

Visual Spatial Planning, Modality-Gap-Aware Self-Distillation, Symbolic State, Visual State Recovery, Vision-Language Models, Perception-Reasoning Gap, Multi-step Planning

深度分析

Chinese Title: 从符号状态学习视觉空间规划：基于模态差距感知的自蒸馏方法

Summary: 本文针对视觉语言模型在视觉空间规划任务中表现不佳的问题，提出其根源在于感知-推理模态差距：视觉规划需要模型从像素中推断潜在状态结构并推理，而符号规划直接利用显式对象和约束。为弥合这一差距，作者提出了MGSD框架，包含两个阶段：冷启动感知对齐阶段，通过监督微调使视觉学生模型获得可靠的状态表示；特权教师阶段，利用显式符号状态对学生自身的视觉滚动前缀进行在线策略蒸馏，传递规划能力。训练时使用符号数据，推理时仅需视觉输入。在多个视觉规划基准上，MGSD在4B和8B骨干模型上分别将宏平均提升19.3%和18.4%，缩小了与符号输入上限的差距。消融实验和诊断分析表明，改进源于更好的视觉状态恢复和最优路径推理。

Innovations:

将视觉空间规划形式化为视觉与符号表示之间的感知-推理模态差距，并明确分解为视觉状态恢复和多步规划两个瓶颈。
提出MGSD两阶段框架：冷启动感知SFT用于可靠状态接地，符号引导的在线策略自蒸馏用于在视觉学生自身轨迹上转移规划行为。
利用符号状态作为特权训练监督，但推理时完全依赖视觉输入，无需符号信息或人工推理链。
通过反向KL散度优化学生滚动轨迹，实现模态差距感知的蒸馏，同时改善感知和推理能力。

Methodology: MGSD采用两阶段训练策略。第一阶段：感知导向的监督微调（SFT），使用从符号环境注释自动生成的结构化感知问题（如物体位置、障碍物等）训练视觉模型，使其可靠地恢复状态变量。第二阶段：符号引导的在线策略自蒸馏（OPSD），视觉学生从图像和问题生成推理滚动，冻结的符号教师基于符号状态、参考动作计划和学生前缀提供逐token的上下文蒸馏信号，优化反向KL散度目标。推理时丢弃教师，学生仅从视觉输入进行空间规划。

Key Results:

在视觉规划基准上，MGSD在4B骨干模型上将宏平均从11.2提升至30.5，在8B骨干模型上从17.2提升至35.6。
缩小了与符号输入上限的差距，表明模态差距得到有效缓解。
消融实验证实两个训练阶段均不可或缺，诊断分析表明改进来自更好的视觉状态恢复和更强的最优路径推理。

Tech Stack:

视觉语言模型（VLM）骨干：4B和8B规模
监督微调（SFT）
在线策略自蒸馏（OPSD）
反向KL散度（Reverse-KL）
符号状态表示（显式对象、约束、转换）
自动生成感知问题（从符号环境注释）
逐token上下文蒸馏信号

Strengths:

清晰定义了视觉空间规划中的模态差距，并提供了系统性的解决方案。
两阶段设计合理：先对齐感知，再蒸馏推理，避免了早期噪声干扰。
利用符号状态作为特权信息，但推理时无需符号输入，保持了实际可用性。
实验充分，在多个骨干和基准上取得显著提升，并进行了详细的消融和诊断分析。
代码开源，便于复现和进一步研究。

Limitations:

依赖训练时符号状态的可用性，对于无法获得符号状态的环境（如真实世界图像）可能不适用。
冷启动SFT需要自动生成感知问题，其质量可能影响后续蒸馏效果。
在线策略蒸馏的计算成本较高，需要教师模型在每步滚动中提供反馈。
实验仅在网格导航、路径查找和物体交互等特定规划任务上验证，泛化性有待进一步检验。

Relevance To Keywords:

Unify Models: 论文使用统一的VLM骨干，通过蒸馏将符号规划能力融入视觉模型，体现了多模态模型的统一。
World Models: 视觉状态恢复可视为构建世界模型的一部分，符号状态提供了显式的世界表示。
Representation Learning: 冷启动SFT学习从像素到符号状态的映射，属于表征学习。
Model-Based RL: 规划任务本身可视为基于模型的控制，蒸馏过程类似于从符号教师学习规划策略。
原生多模态大模型: 工作基于VLM，旨在提升其视觉空间规划能力，属于多模态大模型后训练。
多模态大模型的理解和生成一体化: 模型需要理解图像并生成动作序列，体现了理解与生成的结合。
表征学习: 同上。
世界模型: 同上。
强化学习: 在线策略蒸馏与强化学习中的策略优化有相似之处，但并非标准RL。
后训练: 两阶段训练属于后训练阶段，提升模型特定能力。

5. CLEAR: Cognition and Latent Evaluation for Adaptive Routing in End-to-End Autonomous DrivingPASS

Score: 66.0 / 27.8

Authors: Yining Xing, Zehong Ke, Zhiyuan Liu, Yanbo Jiang, Wenhao Yu, Jianqiang Wang

Published: 2026-06-04

TL;DR: CLEAR 框架结合 Drive-JEPA 视觉编码器和微调的 Qwen 3.5 MLLM，实现了无需迭代扩散采样的快速高保真多模态自动驾驶规划。

摘要翻译

端到端自动驾驶模型往往难以在多模态机动生成与实时推理约束之间取得平衡。尽管扩散模型能够成功捕捉多样的驾驶行为，但其迭代去噪过程产生的延迟对于安全关键部署而言是不可接受的。为此，我们提出 CLEAR（认知与潜在评估自适应路由）框架，该框架结合了超快速生成规划与深度语义推理。CLEAR 采用 Drive-JEPA 作为视觉编码器，并在 VAE 潜在空间中用单步条件漂移取代多步去噪链，同时引入一个条件系数以平衡多样性与专家精度。与此同时，我们在驾驶问答对上对 Qwen 3.5 0.8B 进行全微调，以提取场景感知隐藏状态。这些状态同时指导一个自适应调度器（从预定义的离散方案集中选择条件系数 α 和样本数 N）和一个交叉注意力评分器（从候选轨迹中选择最优轨迹）。在 NAVSIM v1 基准测试上，CLEAR 实现了 93.7 的最先进 PDMS 得分。我们的结果表明，无需密集几何标注或迭代采样，即可高效执行高保真、多模态规划。

Abstract

End-to-end autonomous driving models often struggle to balance multi-modal maneuver generation with real-time inference constraints. While diffusion models successfully capture diverse driving behaviors, their iterative denoising process incurs unacceptable latency for safety-critical deployment. To address this, we propose CLEAR (Cognition and Latent Evaluation for Adaptive Routing), a framework that combines ultra-fast generative planning with deep semantic reasoning. CLEAR employs Drive-JEPA as the visual encoder and replaces the multi-step denoising chain with a single-step conditional drift in a VAE latent space, introducing a conditioning coefficient to balance diversity and expert precision. Meanwhile, we fully fine-tune Qwen~3.5~0.8B on driving QA pairs to extract scene-aware hidden states. These states guide both an Adaptive Scheduler, which selects the conditioning coefficient $α$ and sample count $N$ from a discrete set of predefined schemes, and a cross-attention scorer that selects the optimal trajectory from candidates. On the NAVSIM v1 benchmark, CLEAR achieves a state-of-the-art PDMS of 93.7. Our results demonstrate that high-fidelity, multi-modal planning can be executed efficiently without dense geometric annotations or iterative sampling.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	7.0/10	10.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	9.0/10	13.5
World Models	1.5	6.0/10	9.0
MLLM	1.5	9.0/10	13.5
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	3.0/10	4.5

评分理由: 关键词相关性分析：Visual Encoder (9.0) 和 MLLM (9.0) 是论文核心技术 (Drive-JEPA, Qwen 3.5)；MultiModal (8.0) 体现于视觉与语言融合；World Models (6.0) 因 VAE 潜在规划概念相关但未明确标注；Unify Models (7.0) 体现在感知与生成统一；Tokenizer (2.0) 隐含存在但未作为重点；model-based RL (3.0) 涉及模型规划但未涉及强化学习。加权总分 66.0，高于动态及格分 27.8。作者列表中未包含指定专家，无额外加分。

关键词

End-to-end Autonomous Driving, Drive-JEPA, VAE Latent Space, Qwen 3.5, Adaptive Routing, Generative Planning, Multi-modal Reasoning

深度分析

Chinese Title: CLEAR: 端到端自动驾驶中的认知与潜在评估自适应路由

Summary: 端到端自动驾驶模型常面临多模态轨迹生成与实时推理的权衡。扩散模型虽能捕捉多样驾驶行为，但其迭代去噪过程导致延迟过高。为此，本文提出CLEAR框架，结合超快生成规划与深度语义推理。CLEAR采用Drive-JEPA作为视觉编码器，将多步去噪替换为VAE潜在空间中的单步条件漂移，引入调节系数α平衡多样性与专家精度。同时，微调Qwen 3.5 0.8B模型提取场景感知的隐藏状态，用于指导自适应调度器（选择α和采样数N）和交叉注意力评分器（选择最优轨迹）。在NAVSIM v1基准上，CLEAR达到93.7的PDMS，证明无需密集几何标注或迭代采样即可实现高效、高保真的多模态规划。

Innovations:

单步条件漂移生成：在VAE潜在空间中实现单步条件漂移，替代多步去噪，以99 FPS生成多样轨迹候选，同时保持多模态性。
LLM驱动的自适应调度与评分：利用微调后的Qwen 3.5 0.8B的隐藏状态，自适应选择调节系数α和采样数N，并通过交叉注意力评分器评估候选轨迹，替代启发式代价函数。
状态最优闭环性能：在NAVSIM上取得93.7 PDMS，超越依赖密集3D感知标注的方法，仅使用Drive-JEPA视觉特征和紧凑LLM的认知QA对。

Methodology: CLEAR采用冻结的Drive-JEPA视觉编码器提取几何特征，微调的Qwen 3.5 0.8B作为语义特征提取器。自适应调度器通过TransformerDecoder将LLM隐藏状态映射到离散采样方案（α, N）的类别分布，使用交叉熵损失训练。CLEAR解码器基于MLP-Mixer架构，通过可学习场景查询压缩视觉特征，利用自适应层归一化（adaLN）注入α，在VAE潜在空间执行单步条件漂移（结合多吸引子与样本间排斥），并通过预拟合的PCA投影输出物理轨迹。交叉注意力评分器以MLP-Mixer输出为查询、LLM隐藏状态为记忆，通过TransformerDecoder输出分数，使用铰链排名损失和MSE损失训练。

Key Results:

在NAVSIM v1基准上，CLEAR达到PDMS 93.7，超越现有方法。
生成速度高达99 FPS，满足实时控制预算。
无需密集几何标注或大规模MLLM，仅使用Drive-JEPA和0.8B LLM即可实现SOTA性能。

Tech Stack:

Drive-JEPA（视觉编码器）
Qwen 3.5 0.8B（语言模型）
VAE（变分自编码器）
PCA（主成分分析）
MLP-Mixer（解码器架构）
Adaptive Layer Normalization (adaLN)
TransformerDecoder（调度器与评分器）
交叉注意力机制
铰链排名损失 (Hinge Ranking Loss)
均方误差损失 (MSE Loss)
Winner-Take-All损失
条件漂移 (Conditional Drift)

Strengths:

高效性：单步生成替代多步去噪，实现99 FPS，满足实时性要求。
多模态性：通过多吸引子与样本间排斥机制，生成多样且合理的轨迹候选。
认知融合：利用LLM隐藏状态进行自适应调度和评分，无需LLM直接输出文本，避免格式不稳定和延迟。
无需密集标注：仅使用视觉特征和QA对，不依赖3D感知标注，降低数据需求。
闭环性能优异：在NAVSIM上取得SOTA PDMS 93.7。

Limitations:

依赖特定LLM（Qwen 3.5 0.8B），可能在其他LLM上需要重新微调。
自适应调度器训练需要预计算最优方案标签（通过PDMS评分器），计算开销较大。
未详细讨论极端场景（如恶劣天气、传感器故障）下的鲁棒性。
PCA投影约束可能限制轨迹多样性，尤其在非典型驾驶场景中。

Relevance To Keywords:

Unify Models: CLEAR将视觉编码、生成规划、LLM推理统一为端到端架构，符合统一模型思想。
World Models: Drive-JEPA作为世界模型的一种（预测潜在空间演化），用于提取场景结构。
Representation Learning: 使用JEPA和VAE进行表征学习，抑制噪声并保留驾驶相关结构。
Model-Based RL: 论文未直接涉及强化学习，但生成规划可视为基于模型的控制，与模型基RL相关。
原生多模态大模型: 使用Qwen 3.5作为多模态大模型（视觉+语言），但仅提取隐藏状态而非生成文本。
多模态大模型的理解和生成一体化: 论文将LLM用于理解（语义特征提取）而非生成，未实现一体化。
表征学习: 核心贡献之一，通过VAE和JEPA学习有效表征。
世界模型: Drive-JEPA本质是世界模型，用于预测场景演变。
强化学习: 未使用强化学习训练，但自适应调度可视为一种元学习。
后训练: 论文对Qwen 3.5进行了微调（后训练），但未涉及大规模后训练方法。

6. Edit-R2: Context-Aware Reinforcement Learning for Multi-Turn Image EditingPASS

Score: 66.0 / 27.8

Authors: Yuxiao Ye, Haoran He, Fangyuan Kong, Xintao Wang, Pengfei Wan, Kun Gai, Ling Pan

Published: 2026-06-04

TL;DR: Edit-R2 提出一种基于强化学习的多轮图像编辑框架，通过重构会话意图解决上下文稀释和状态污染，显著提升了指令遵循与内容一致性。

摘要翻译

文本引导图像编辑借助扩散模型（Diffusion Models）和统一多模态基础模型（Unified Multimodal Foundation Models）取得了快速进展。然而，大多数现有方法仍局限于单轮设置，忽略了更现实的多轮上下文编辑（Multi-turn In-context Editing）场景，即用户通过一系列指令迭代精炼图像。在此设置下，模型必须在遵循每条新指令的同时保留累积的会话级约束，这面临着两种耦合的失败模式：长上下文稀释（Long-context Dilution），即稀疏的文本约束难以从不断增长的交错图像 - 文本历史中恢复；以及状态污染（State Contamination），即早期的编辑错误会损害后续生成。我们引入 Edit-R2，一种针对统一多模态模型的新型强化学习后训练框架。Edit-R2 重构操作性会话意图，从而有效地将分散的历史约束整合为每次编辑轮次前的显式推理轨迹。它进一步通过一个统一目标实现了推理与生成的多轮强化学习（RL），该目标联合优化离散文本空间中的意图重建生成与连续潜在空间中的流匹配（Flow-matching）图像生成；同时，轨迹过滤机制能够抑制受损的轨迹（rollouts），从而在状态污染下稳定训练。为了支持系统评估，我们引入 MICE-Bench，这是一个用于多轮上下文编辑的大规模基准，包含针对累积会话约束的指令遵循（IF）、内容一致性（CC）和全局感知（GA）等自动度量指标。实验表明，Edit-R2 显著提升了多轮上下文编辑性能，并在与强基线的对比中达到了具有竞争力的表现。

Abstract

Text-guided image editing has advanced rapidly with diffusion models and unified multimodal foundation models. However, most existing methods remain confined to single-turn settings, overlooking the more realistic scenario of multi-turn in-context editing, where users iteratively refine an image through a sequence of instructions. In this setting, a model must follow each new instruction while preserving accumulated session-level constraints, challenged by two coupled failure modes: long-context dilution, where sparse textual constraints become difficult to recover from growing interleaved image-text histories, and state contamination, where earlier editing mistakes degrade subsequent generations. We introduce Edit-R2, a novel reinforcement learning post-training framework for unified multimodal models. Edit-R2 reconstructs the operative session intent, which effectively consolidates scattered historical constraints into an explicit reasoning trace before each editing turn. It further enables multi-turn RL over both reasoning and generation through a unified objective that jointly optimizes intent reconstruction generation in discrete text space and flow-matching image generation in continuous latent space, while a trajectory filtering mechanism suppresses corrupted rollouts to stabilize training under state contamination. To support systematic evaluation, we introduce MICE-Bench, a large-scale benchmark for multi-turn in-context editing with automated metrics for instruction following (IF), content consistency (CC), and global awareness (GA) over accumulated session constraints. Experiments show that Edit-R2 substantially improves multi-turn in-context editing and achieves competitive performance compared against strong baselines.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	9.0/10	13.5
Tokenizer	1.5	4.0/10	6.0
Visual Encoder	1.5	5.0/10	7.5
World Models	1.5	4.0/10	6.0
MLLM	1.5	8.0/10	12.0
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	5.0/10	7.5

评分理由: 论文核心聚焦于统一多模态模型（Unify Models, MLLM, MultiModal）的多轮图像编辑，因此相关度高。虽涉及强化学习（model-based RL），但摘要侧重于训练框架与意图重构，而非显式环境模型学习，故评分中等。Tokenizer 与 Visual Encoder 为底层组件非核心创新，World Models 为背景概念，故评分较低。

关键词

Multi-Turn Image Editing, Reinforcement Learning, Unified Multimodal Models, Intent Reconstruction, Text-Guided Editing, Session Constraints, Flow-Matching

深度分析

Chinese Title: Edit-R2：面向多轮图像编辑的上下文感知强化学习

Summary: 本文针对多轮上下文图像编辑（multi-turn in-context editing）这一实际但未被充分研究的场景，提出了一种新的强化学习后训练框架Edit-R2。该框架基于统一多模态模型BAGEL，通过上下文链式思维（IC-CoT）在每轮编辑前重建当前会话的活跃意图，将分散的历史约束压缩为显式推理轨迹，缓解长上下文稀释问题。同时，Edit-R2在离散文本空间（意图重建）和连续潜空间（流匹配图像生成）上联合优化，并引入轨迹过滤机制抑制被污染轨迹对训练的干扰。为支持系统评估，作者构建了大规模基准MICE-Bench，包含内容记忆和内容理解两类任务，并引入全局感知（GA）指标量化会话级约束的遵从度。实验表明，Edit-R2在MICE-Bench上显著优于现有开源模型，在指令跟随（IF）和全局感知（GA）上分别提升18%和18%，与闭源模型性能相当。

Innovations:

首次将多轮上下文图像编辑形式化为会话级强化学习问题，并构建了大规模自动化基准MICE-Bench，包含内容记忆和内容理解两类任务。
提出全局感知（GA）指标，专门量化模型对会话级累积约束的遵从度，填补了现有评估指标的空白。
设计上下文链式思维（IC-CoT）机制，在每轮编辑前显式重建会话意图，有效缓解长上下文稀释和状态污染问题。
实现离散文本空间（意图重建）与连续潜空间（图像生成）的联合强化学习优化，通过统一目标函数协同提升推理与生成能力。
引入轨迹过滤机制，在训练中剔除被早期错误污染的轨迹，稳定多轮强化学习训练过程。

Methodology: 本文采用以下技术路线：（1）将多轮上下文图像编辑建模为有限时域马尔可夫决策过程（MDP），状态包含历史图像和指令，动作是生成图像，奖励由指令跟随、内容一致性和全局感知指标构成。（2）基于统一多模态模型BAGEL，在每轮编辑前通过上下文链式思维（IC-CoT）生成显式意图重建文本，作为后续图像生成的稳定条件。（3）采用强化学习（类似GRPO）联合优化意图重建（离散文本空间）和流匹配图像生成（连续潜空间），奖励信号来自自动化评估指标。（4）设计轨迹过滤机制，根据每轮编辑的奖励分数筛选高质量轨迹，避免被污染轨迹主导训练。训练数据来自MICE-Bench的720个三回合编辑实例。

Key Results:

Edit-R2在MICE-Bench上相比基线BAGEL在指令跟随（IF）上提升18%，在全局感知（GA）上提升18%。
在内容理解任务中，Edit-R2正确解析跨轮指代（如代词“its”），而基线模型失败。
在内容记忆任务中，Edit-R2能保持会话级约束（如所有新增物体为紫色）贯穿整个会话，GA得分始终为1.0。
与闭源模型（如GPT-Image-1）相比，Edit-R2在多项指标上达到或接近竞争性能。
消融实验验证了IC-CoT、联合优化和轨迹过滤各模块的有效性。

Tech Stack:

统一多模态模型：BAGEL（基于流匹配的图像生成与理解一体化模型）
强化学习算法：类GRPO（Group Relative Policy Optimization），用于离散和连续空间的联合优化
意图重建：上下文链式思维（In-Context Chain-of-Thought, IC-CoT）
图像生成：流匹配（Flow Matching）
基准构建：EdiVal-Agent管道（VLM提取物体清单、GroundingDINO过滤幻觉、LLM生成指令）
评估指标：指令跟随（IF）、内容一致性（CC）、全局感知（GA）
轨迹过滤：基于奖励分数的筛选机制

Strengths:

问题定义新颖且实用：多轮上下文编辑更贴近真实用户交互场景，现有工作极少涉及。
方法设计系统：从意图重建、联合优化到轨迹过滤，形成完整闭环，有效解决长上下文和状态污染两大核心挑战。
基准构建严谨：MICE-Bench包含自动化生成管道和专用指标（GA），可扩展性强，便于后续研究。
实验充分：在多个维度与开源和闭源模型对比，消融实验验证各组件贡献。
技术贡献明确：首次将强化学习引入多轮上下文图像编辑，并实现离散-连续空间的联合优化。

Limitations:

基准规模有限：MICE-Bench仅包含720个三回合编辑实例，可能不足以覆盖复杂多轮交互模式。
依赖统一模型架构：Edit-R2基于BAGEL，其泛化性到其他架构（如自回归模型）未验证。
奖励设计依赖自动化指标：IF、CC、GA等指标可能无法完全捕捉人类编辑偏好，存在评估偏差。
训练计算成本高：多轮强化学习需要多次采样和评估，训练开销较大。
未探讨更长的会话（如超过3轮）或更复杂的约束类型（如否定、条件约束）。

Relevance To Keywords:

Unify Models: 论文基于统一多模态模型BAGEL，该模型同时支持图像理解和生成，符合统一模型趋势。
World Models: 论文未直接涉及世界模型，但意图重建可视为对会话状态的隐式建模，与表征学习相关。
Representation Learning: IC-CoT将分散的历史约束压缩为显式推理轨迹，属于表征学习的一种形式。
Model-Based RL: 论文采用强化学习但未显式构建环境模型，属于无模型RL，与Model-Based RL相关性较弱。
原生多模态大模型: BAGEL是原生多模态大模型，论文在其基础上进行后训练，直接相关。
多模态大模型的理解和生成一体化: 论文联合优化意图重建（理解）和图像生成，体现一体化思想。
表征学习: 意图重建本质上是学习会话状态的紧凑表征。
世界模型: 不直接相关。
强化学习: 核心方法，将多轮编辑建模为MDP并使用RL优化。
后训练: 论文标题明确为RL后训练框架，在预训练模型基础上进行强化学习微调。

7. LoomVideo: Unifying Multimodal Inputs into Video Generation and EditingPASS

Score: 64.5 / 27.8

Authors: Jianzong Wu, Hao Lian, Jiongfan Yang, Dachao Hao, Ye Tian, Yunhai Tong, Jingyuan Zhu, Biaolong Chen, Qiaosong Qi, Aixi Zhang, Wanggui He, Mushui Liu, Jinlong Liu, Hao Jiang

Published: 2026-06-04

TL;DR: LoomVideo 提出了一种基于 MLLM 和零开销条件注入的统一视频生成与编辑架构，在 5B 参数下实现了高效的多模态视频处理。

摘要翻译

开发能够处理交错多模态输入的统一视频生成与编辑模型是一个具有前景但充满挑战的前沿领域。现有的统一框架主要依赖大规模模型（通常为 13B 参数或更多），并通过连接序列令牌将源视频条件纳入编辑过程。这种连接方式不可避免地使序列长度加倍，导致 self-attention（自注意力）机制的计算复杂度变为原来的四倍，并引入难以承受的开销。为了解决这些瓶颈，我们提出了 LoomVideo，这是一种高度高效的 5B 参数统一架构，适用于视频生成与编辑。LoomVideo 使用多模态大语言模型（Multimodal Large Language Model, MLLM）替换了标准文本编码器，并采用 Deepstack 注入机制来对齐多层 MLLM 特征与扩散变换器（Diffusion Transformer, DiT）。至关重要的是，我们提出了一种零开销的 Scale-and-Add（缩放与添加）条件化方法用于视频编辑。该方法通过对干净的源视频潜在向量进行缩放并直接添加到带噪声的目标潜在向量中，优雅地消除了对令牌连接的需求，大幅降低了计算成本，同时保持了对复杂、非刚性编辑的鲁棒能力。此外，Negative Temporal RoPE（负向时间旋转位置编码）策略被无缝集成，以处理多个参考图像。广泛的实验表明，我们紧凑的 5B 模型在综合基准上实现了 state-of-the-art（最先进）或极具竞争力的性能，在电子商务和时尚生成场景中表现出卓越的优越性。得益于零开销条件化机制，LoomVideo 相比同等能力的模型在推理速度上实现了至少 5.41 倍的加速，为高度实用且高效的 video foundation models（视频基础模型）铺平了道路。

Abstract

Developing unified video generation and editing models capable of interpreting interleaved multimodal inputs is a promising yet challenging frontier field. Existing unified frameworks predominantly rely on massive models (typically 13B parameters or more) and incorporate source video conditions for editing by concatenating sequence tokens. This concatenation inevitably doubles the sequence length, quadrupling the computational complexity of the self-attention mechanism and introducing prohibitive overhead. To address these bottlenecks, we present LoomVideo, a highly efficient 5B-parameter unified architecture for both video generation and editing. LoomVideo replaces the standard text encoder with a Multimodal Large Language Model (MLLM) and employs Deepstack injection mechanism to align multi-layer MLLM features with the Diffusion Transformer (DiT). Crucially, we introduce a zero-overhead Scale-and-Add conditioning approach for video editing. By scaling and directly adding the clean source video latent to the noised target latent, this elegant design eliminates the need for token concatenation, drastically reducing computational cost while maintaining robust capabilities for complex, non-rigid edits. Furthermore, a Negative Temporal RoPE strategy is seamlessly integrated to handle multiple reference images. Extensive experiments demonstrate that our compact 5B model achieves state-of-the-art or highly competitive performance across comprehensive benchmarks, exhibiting exceptional superiority in e-commerce and fashion generation scenarios. Benefiting from the zero-overhead conditioning mechanism, LoomVideo achieves at least a 5.41x acceleration in inference speed compared to models of similar capabilities, paving the way for highly practical and efficient video foundation models.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	10.0/10	15.0
Tokenizer	1.5	5.0/10	7.5
Visual Encoder	1.5	5.0/10	7.5
World Models	1.5	3.0/10	4.5
MLLM	1.5	10.0/10	15.0
MultiModal	1.5	10.0/10	15.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文标题与摘要均强调统一多模态输入的视频生成与编辑（Unify Models, MultiModal），并核心依赖 MLLM（MLLM），故相关度最高。Tokenizer 与 Visual Encoder 虽在架构中存在，但非主要创新贡献，故给中等分。World Models 概念略有重叠但非本文重点，model-based RL 完全无关。作者列表中未包含指定专家。

关键词

Unifying Multimodal Inputs, Video Generation, Video Editing, MLLM, Diffusion Transformer, Zero-overhead Conditioning, Scale-and-Add, 5B-parameter

深度分析

Chinese Title: LoomVideo：统一多模态输入的视频生成与编辑

Summary: 本文提出LoomVideo，一个高效统一的5B参数视频生成与编辑架构。现有统一框架通常依赖13B以上大模型，并通过拼接源视频令牌与目标令牌实现编辑，导致序列长度翻倍、自注意力计算复杂度翻四倍。LoomVideo用多模态大语言模型（MLLM）替代标准文本编码器，并设计Deepstack注入机制，将MLLM每层特征通过交叉注意力注入扩散Transformer对应层，增强语义对齐。针对编辑任务，提出零开销的Scale-and-Add条件方法：直接缩放并添加干净源视频潜变量到噪声目标潜变量，避免令牌拼接，大幅降低计算成本。同时采用负时间RoPE策略处理多参考图像。三阶段渐进训练（低分辨率语义对齐、高分辨率多任务、参考图像与高级指令）结合强化学习后训练，使紧凑模型在多个基准上达到或超越SOTA，尤其在电商和时尚生成场景表现优异，推理速度相比同类模型加速至少5.41倍。

Innovations:

提出Deepstack注入机制：提取MLLM每一层隐藏状态，通过MLP投影后注入DiT对应层的交叉注意力，实现深层语义对齐。
提出零开销Scale-and-Add条件方法：将干净源视频潜变量按当前时间步缩放后直接加到噪声目标潜变量上，完全避免令牌拼接，显著降低计算复杂度。
采用负时间RoPE索引策略处理多参考图像，有效区分参考输入与视频帧，保持时空动态一致性。
设计三阶段渐进训练策略（低分辨率语义对齐→高分辨率多任务→参考图像与高级指令），结合强化学习后训练，提升指令遵循能力和生成保真度。
在5B参数规模下实现统一视频生成与编辑，推理速度相比同类13B模型加速至少5.41倍，并在电商和时尚场景中展现卓越性能。

Methodology: LoomVideo基于Wan 2.2 TI2V模型（5B参数），将T5文本编码器替换为Qwen3-VL（8B MLLM）。采用Deepstack注入：从MLLM每层提取特征，经共享MLP投影后作为交叉注意力的K/V注入DiT对应层。编辑时使用Scale-and-Add：将干净源视频潜变量乘以当前时间步系数后直接加到噪声目标潜变量上。多参考图像使用负时间RoPE索引。训练分三阶段：Stage1低分辨率（256×256）对齐MLLM与DiT；Stage2高分辨率（512×512）同时训练生成/重建/编辑；Stage3引入参考图像和复杂指令。后训练采用强化学习（具体方法未详述，但提及提升指令遵循和生成保真度）。

Key Results:

在多个视频生成与编辑基准上达到或超越SOTA性能。
在电商和时尚产品场景的参考图像引导视频编辑与可控生成中表现突出。
推理速度相比同类拼接式统一模型加速至少5.41倍。
5B参数模型能完成此前需13B模型的任务，实现高效统一。
成功处理复杂非刚性编辑（如改变人类动作、相机角度）。

Tech Stack:

Diffusion Transformer (DiT)
Multimodal Large Language Model (MLLM): Qwen3-VL
Cross-attention机制
Scale-and-Add条件注入
Negative Temporal RoPE (旋转位置编码)
三阶段渐进训练策略
强化学习后训练 (Reinforcement Learning Post-training)
Wan 2.2 Text-Image-to-Video (TI2V) 基础模型

Strengths:

参数规模紧凑（5B），计算效率高，推理速度快。
零开销编辑条件设计，避免序列长度翻倍，显著降低训练和推理成本。
Deepstack注入充分利用MLLM多层语义，提升多模态对齐质量。
支持多种任务：文本到视频、指令编辑、参考图像引导生成与编辑、多图像到视频。
在电商和时尚等实际场景中展现优异性能，具有实用价值。

Limitations:

论文未详细说明强化学习后训练的具体算法和超参数，可复现性受限。
仅与同类统一模型对比，未与专用视频编辑模型（如InsViE、Ditto）进行充分比较。
Scale-and-Add条件可能对某些极端非刚性编辑（如大幅改变场景结构）效果有限，论文未深入分析边界情况。
依赖8B MLLM（Qwen3-VL），整体模型参数量实际为5B+8B，但论文称5B参数可能仅指DiT部分，需明确。
未讨论模型在长视频生成或高分辨率（如1080p）下的性能。

Relevance To Keywords:

Unify Models: 高度相关，论文提出统一视频生成与编辑的单一架构。
World Models: 部分相关，视频生成可视为学习世界动态，但论文未明确强调世界模型概念。
Representation Learning: 相关，Deepstack注入和MLLM特征提取涉及多层级表征学习。
Model-Based RL: 弱相关，后训练使用强化学习，但非基于模型的RL。
原生多模态大模型: 高度相关，使用MLLM处理多模态输入，实现理解与生成一体化。
多模态大模型的理解和生成一体化: 高度相关，LoomVideo将MLLM理解能力与DiT生成能力深度融合。
表征学习: 相关，MLLM多层特征注入可视为跨模态表征对齐。
世界模型: 部分相关，视频生成需建模时空动态，但论文未直接探讨世界模型。
强化学习: 相关，后训练阶段采用强化学习提升指令遵循。
后训练: 高度相关，论文明确使用强化学习后训练作为最终优化步骤。

8. AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware UnderstandingPASS

Score: 63.0 / 27.8

Authors: Qize Yu, Jiadi You, Yuran Wang, Jiaqi Liang, Bowen Ping, Yang Tian, Yue Chen, Minghong Cai, Zeying Gong, Ruihai Wu, Yinchuan Li, Junwei Liang, Yingcong Chen

Published: 2026-06-04

TL;DR: AffordanceVLA 提出了一种基于 affordance 预测的统一视觉 - 语言 - 行动框架，通过桥接语义空间显著提升了机器人操作任务中的感知 - 动作映射精度。

摘要翻译

视觉 - 语言 - 动作 (VLA) 模型利用预训练视觉 - 语言模型 (VLMs) 丰富的世界知识，从而实现指令跟随式机器人操作。然而，VLM 语义空间与具身控制策略之间的结构不匹配往往阻碍了精确感知 - 动作映射的学习。为应对这一挑战，我们提出 AffordanceVLA，这是一个统一的框架，通过引入结构化功能预测作为面向任务的中间表示，以建立更精确且稳健的感知 - 动作映射。具体而言，我们通过三个互补组件逐步建模操作先验：1) Which2Act，通过视觉潜变量预测实现以对象为中心的视觉定位，以抑制干扰；2) Where2Act，通过功能图估计实现 2D 交互定位；3) How2Act，通过 3D 几何推理指导操作策略。这些功能提示提供了空间锚定、语义条件化且与动作耦合的中间表示，从而自然地桥接了视觉、语言与动作。我们将这些模块集成到具有专用专家的混合 Transformer (MoT) 架构中，并采用三阶段训练策略结合渐进式数据课程对模型进行训练。为克服机器人数据集中密集功能标签稀缺的问题，我们还开发了一个稳健的自动化数据增强流水线。在仿真环境和真实世界上的广泛实验表明，AffordanceVLA 在各种多样的操作场景中均展现出优异的性能。

Abstract

Vision-Language-Action (VLA) models leverage the rich world knowledge of pretrained vision-language models (VLMs) to enable instruction-following robotic manipulation. However, the structural mismatch between VLM semantic spaces and embodied control policies often hinders the learning of precise perception--action mappings. To address this challenge, we propose \textbf{AffordanceVLA}, a unified framework that introduces structured affordance forecasting as a task-oriented intermediate representation to establish a more precise and robust perception--action mapping. Specifically, we progressively model manipulation priors through three complementary components: 1) \textbf{Which2Act} for object-centric grounding via visual latent prediction to suppress distractions; 2) \textbf{Where2Act} for 2D interaction localization via affordance map estimation; and 3) \textbf{How2Act} for 3D geometric reasoning to guide manipulation policies. These affordance cues provide spatially grounded, semantically conditioned, and action-coupled intermediate representations, thereby naturally bridging vision, language and action. We integrate these modules into a Mixture-of-Transformer (MoT) architecture with specialized experts and train the model using a three-stage training strategy with a progressive data curriculum. To overcome the scarcity of dense affordance labels in robotic datasets, we also develop a robust automated data augmentation pipeline. Extensive experiments on simulation and real-world demonstrate that AffordanceVLA achieves strong performance across diverse manipulation scenarios.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	8.0/10	12.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	7.0/10	10.5
World Models	1.5	3.0/10	4.5
MLLM	1.5	8.0/10	12.0
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	5.0/10	7.5

评分理由: 论文提出 AffordanceVLA 统一框架，涉及视觉 - 语言 - 行动（MLLM/MultiModal）及统一建模（Unify Models），故相关度高（8-9 分）；视觉编码器隐含在 VLM 架构中（7 分）；未明确提及 Tokenizer 或显式世界模型/模型强化学习（2-5 分）。加权总分 63.0，远超及格线 27.8。作者列表中未包含指定专家。

关键词

AffordanceVLA, Vision-Language-Action, Affordance Forecasting, Robotic Manipulation, Unified Framework, Perception-Action Mapping, Mixture-of-Transformer

深度分析

Chinese Title: AffordanceVLA: 一种通过可操作性感知理解赋能动作生成的视觉-语言-动作模型

Summary: 论文提出AffordanceVLA，一种统一的视觉-语言-动作（VLA）框架，通过引入结构化可操作性预测作为任务导向的中间表示，解决VLM语义空间与机器人控制策略之间的结构不匹配问题。框架包含三个互补模块：Which2Act（物体中心定位，通过视觉潜在预测抑制干扰）、Where2Act（2D交互定位，通过可操作性图估计）、How2Act（3D几何推理，指导操作策略）。采用混合Transformer（MoT）架构，包含理解专家、可操作性生成专家和动作专家三个专用模块，并设计三阶段训练策略（预训练、协同训练、后训练）与渐进式数据课程。为克服机器人数据集中可操作性标注稀缺的问题，开发了自动化数据增强管线。在LIBERO、CALVIN等仿真基准和真实世界实验中，AffordanceVLA取得了与最新VLA模型竞争的成功率，展现出强泛化能力、空间鲁棒性和跨模态对齐。

Innovations:

提出AffordanceVLA框架，利用结构化可操作性预测（Which2Act、Where2Act、How2Act）作为中间监督，建立更精确的感知-动作映射。
设计混合Transformer（MoT）架构，包含理解、可操作性生成、动作三个专用专家，实现渐进式信息融合。
提出三阶段训练策略（预训练、协同训练、后训练）与渐进式数据课程，有效利用不同来源数据。
开发自动化数据增强管线，为大规模机器人数据合成可操作性标注，解决标注稀缺问题。
在仿真和真实世界实验中取得与最新VLA模型竞争的性能，并展现出强泛化与鲁棒性。

Methodology: 论文采用混合Transformer（MoT）架构，将模型分为三个专家：理解专家（处理视觉-语言对齐）、可操作性生成专家（预测Which2Act、Where2Act、How2Act）、动作专家（生成连续动作）。训练分为三阶段：Stage I在指代定位和交互感知场景数据上预训练；Stage II在大规模合成机器人数据上协同训练；Stage III在目标数据集上后训练。通过自动化数据增强管线为机器人轨迹生成可操作性标注（如目标物体掩码、交互点、3D几何信息）。推理时，输入图像和语言指令，依次通过理解专家、可操作性生成专家、动作专家，输出动作序列。

Key Results:

在LIBERO基准上，AffordanceVLA在多个任务上取得与最新VLA模型（如RT-2、Octo）竞争的成功率。
在CALVIN基准上，AffordanceVLA在长序列任务中展现出优异的泛化能力。
真实世界实验中，AffordanceVLA成功完成多种操作任务（如抓取、放置、开门等），验证了跨场景迁移能力。
消融实验表明，结构化可操作性中间表示显著提升了模型的空间鲁棒性和指令跟随准确性。
定性分析显示，Which2Act有效抑制背景干扰，Where2Act精确定位交互区域，How2Act提供3D几何引导。

Tech Stack:

Vision-Language-Action (VLA) 模型
Mixture-of-Transformer (MoT) 架构
可操作性预测（Which2Act, Where2Act, How2Act）
三阶段训练策略（预训练、协同训练、后训练）
自动化数据增强管线（合成掩码、交互点、3D几何）
仿真环境：LIBERO, CALVIN
预训练视觉-语言模型（如CLIP, SigLIP等）
动作解码：连续动作回归（可能使用扩散或流匹配）

Strengths:

结构化中间表示（可操作性）有效桥接视觉-语言语义空间与物理动作空间，提升感知-动作映射精度。
MoT架构允许不同专家专注于不同任务，避免单一模型能力被低层动作学习侵蚀。
三阶段训练策略充分利用多源数据（VQA、机器人轨迹），实现从语义理解到操作控制的平滑过渡。
自动化数据增强管线解决了可操作性标注稀缺问题，使方法可扩展到大规模数据集。
在仿真和真实世界均取得强性能，展现出良好的泛化性和鲁棒性。

Limitations:

可操作性预测依赖合成标注，可能引入噪声或与真实场景分布不一致。
三阶段训练流程复杂，需要精心设计数据课程和超参数。
MoT架构增加了模型参数量和推理计算开销，实时性可能受限。
当前实验主要针对桌面操作场景，在更复杂动态环境中的泛化性有待验证。
未与基于世界模型的VLA方法（如视频预测）进行直接比较，缺乏对中间表示效率的定量对比。

Relevance To Keywords:

Unify Models: AffordanceVLA统一了感知、预测和动作，属于统一模型方向。
World Models: 可操作性预测可视为一种轻量级世界模型，预测交互可能性而非完整未来帧。
Representation Learning: 结构化可操作性作为中间表示，学习任务导向的紧凑表征。
Model-Based RL: 可操作性预测为策略提供先验，可视为基于模型的方法（但论文未明确使用RL）。
原生多模态大模型: 基于预训练VLM构建，属于原生多模态大模型在机器人领域的应用。
多模态大模型的理解和生成一体化: 模型同时进行视觉-语言理解（理解专家）和动作生成（动作专家），并通过可操作性生成连接。
表征学习: 通过Which2Act、Where2Act、How2Act学习物体、交互点、几何的表征。
世界模型: 可操作性预测是对未来交互状态的隐式建模，类似于世界模型中的状态预测。
强化学习: 论文未直接使用RL，但三阶段训练中的后训练可类比RL中的微调；可操作性可辅助RL探索。
后训练: 论文明确包含后训练阶段（Stage III），在目标数据集上微调，与后训练概念直接相关。

9. TempoVLA: Learning Speed-Controllable Vision-Language-Action PoliciesPASS

Score: 60.0 / 27.8

Authors: Dong Jing, Jingchen Nie, Tianqi Zhang, Jiaqi Liu, Huaxiu Yao, Zhiwu Lu, Mingyu Ding

Published: 2026-06-04

TL;DR: TempoVLA 提出了一种基于速度条件化和轨迹增强的视觉 - 语言 - 行动模型，实现了机器人操作中的灵活速度控制，提升了不同风险阶段的表现。

摘要翻译

机器人操作在需要快速执行的低风险过渡阶段与需要缓慢、精确运动的高风险接触阶段之间交替进行。然而，现有的视觉 - 语言 - 动作模型（VLAs）仅从训练演示中继承了单一固定速度。先前通过模型压缩、KV-cache 重用或强化学习来加速 VLAs 的努力仅将策略从一个固定速度切换到另一个固定速度，而几乎未涉及减速。我们发现，每个预测动作的幅值已决定了机器人的运动速度，这为可控执行速度提供了一条直接途径。基于此观察，我们提出了 TempoVLA，这是一个执行速度由显式条件控制的单一 VLA。TempoVLA 结合了两个耦合组件。(1) 数据侧的可变速度轨迹增强（VSTA），通过合并或拆分动作将演示重新定时至任意目标速度，同时保留其运动语义。(2) 模型侧的条件机制，将速度信息输入至策略网络。统计结果表明，VSTA 能以可忽略的运动误差达到所需速度。仿真及真实任务上的实验表明，TempoVLA 在两个方向上均实现了灵活的速度控制，而 VSTA 通过更优的数据利用额外提升了默认 1 倍性能。此外，通过与大模态模型协同，TempoVLA 实现了动态速度控制，在低风险阶段加速，在高风险阶段减速。

Abstract

Robot manipulation alternates between low-risk transit phases that call for fast execution and high-risk contact stages that demand slow, precise motion. Yet existing Vision-Language-Action models (VLAs) only inherit a single fixed speed from training demonstrations. Prior efforts to accelerate VLAs through model compression, KV-cache reuse, or reinforcement learning only shift the policy from one fixed speed to another, and leave deceleration almost unexplored. We observe that the magnitude of each predicted action already governs how fast the robot moves, opening a direct route to controllable execution speed. We turn this observation into TempoVLA, a single VLA whose execution speed is controlled by an explicit condition. TempoVLA combines two coupled components. (1) A data-side Variable-Speed Trajectory Augmentation (VSTA) that re-times demonstration to any target speed by merging or splitting actions while preserving its motion semantics. (2) A model-side conditioning mechanism that feeds the speed to the policy. Statistics show that VSTA reaches the requested speed with negligible motion error. Experiments in simulation and on real-world tasks demonstrate that TempoVLA achieves flexible speed control in both directions, while VSTA additionally boosts the default $1\times$ performance via better data utilization. Furthermore, by cooperating with a large multimodal model, TempoVLA realizes dynamic speed control, accelerating through low-risk phases and decelerating for high-risk ones.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	7.0/10	10.5
Tokenizer	1.5	3.0/10	4.5
Visual Encoder	1.5	6.0/10	9.0
World Models	1.5	4.0/10	6.0
MLLM	1.5	8.0/10	12.0
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	3.0/10	4.5

评分理由: 论文提出 TempoVLA，一种速度可控的视觉 - 语言 - 行动（VLA）模型。MultiModal (9) 和 MLLM (8) 高度相关，因 VLA 本质是多模态架构且依赖大语言模型。Unify Models (7) 相关，因统一了视觉、语言与动作。Visual Encoder (6) 是基础组件但非核心贡献。Tokenizer (3) 和 model-based RL (3) 非核心方法（主要基于演示学习的策略学习，而非基于模型的强化学习）。World Models (4) 涉及运动语义但非核心。未发现指定专家作者。加权总分 60.0，高于动态及格分 27.8。

关键词

Vision-Language-Action, Speed Control, Variable-Speed Trajectory Augmentation, Robot Manipulation, Large Multimodal Model, Policy Conditioning, Motion Semantics

深度分析

Chinese Title: TempoVLA：学习速度可控的视觉-语言-动作策略

Summary: 本文提出TempoVLA，一种速度可控的视觉-语言-动作（VLA）策略框架。现有VLA模型仅继承训练演示的固定执行速度，无法在低风险快速阶段和高风险慢速阶段之间灵活切换。作者观察到动作幅度天然控制机器人移动速度，据此设计了两部分：数据侧的变速度轨迹增强（VSTA），通过合并或拆分动作将演示重新定时到任意目标速度，同时保持运动语义；模型侧的速度条件注入机制，将标量速度作为显式条件输入策略，通过文本前缀、软提示或MLP调制三种方式缩放预测动作幅度。实验在LIBERO仿真和真实任务上验证了TempoVLA能实现双向速度控制，且VSTA作为数据增强提升了默认1×速度下的成功率。进一步与大型多模态模型结合，实现了动态速度调度：在低风险阶段加速，高风险阶段减速。

Innovations:

提出VSTA数据增强方法，在线将任意演示重新定时到目标速度，通过合并或拆分动作保持运动语义，无需新数据采集。
设计轻量级速度条件注入机制，将标量速度作为显式条件输入策略，通过三种方案（文本前缀、软提示、MLP调制）实现双向速度控制。
发现适当重新定时的数据使速度控制易于植入，且速度条件机制与具体注入方式几乎无关；变速度训练作为有效数据增强，持续提升默认速度下的成功率。
将速度条件策略与VLM结合，实现自动动态速度调度，使系统在低风险阶段加速、高风险阶段减速，无需人工干预。

Methodology: 论文采用数据增强与模型条件相结合的技术路线。数据侧：VSTA首先对演示进行运动一致性分割（基于平移/旋转幅度和方向相似性），然后在每个段内通过累积再拆分操作将q个源帧映射到p个输出帧（s=q/p），并在线随机化块起始偏移以避免丢弃观测。模型侧：将速度s通过三种方式注入VLA策略：文本前缀（修改指令）、速度调制RMSNorm（将速度嵌入加到流匹配时间步嵌入中）、或作为软提示与视觉token拼接。训练时使用VSTA增强的多速度数据集，部署时策略根据输入速度s输出相应幅度的动作，底层控制器不变。

Key Results:

TempoVLA在LIBERO仿真和真实任务上实现了双向速度控制（加速和减速），且单个策略可在多个命令速度下执行任务。
VSTA数据增强提升了默认1×速度下的成功率，表明变速度训练作为有效数据增强手段。
与VLM结合实现动态速度调度，在低风险阶段加速、高风险阶段减速，性能优于固定速度策略。
速度控制效果与条件注入方式（文本前缀、软提示、MLP调制）关系不大，表明速度条件易于植入。

Tech Stack:

VSTA：运动一致性分割（基于平移/旋转幅度阈值和方向余弦相似度）、累积再拆分操作（线性插值）、在线块起始偏移采样。
速度条件注入：文本前缀、速度调制RMSNorm（两层MLP嵌入）、软提示（与视觉token拼接）。
基础VLA架构：支持回归头（如ACT）、扩散/流匹配头（如Diffusion Policy）、离散token头（如OpenVLA）。
流匹配（Flow Matching）用于动作分布建模。
SLERP（球面线性插值）用于旋转表示（如四元数）的插值。
VLM（视觉语言模型）用于动态速度调度。

Strengths:

提出了一种轻量级、无需重新训练基础架构的双向速度控制方法，适用于现有各类VLA模型。
数据增强VSTA不仅实现速度控制，还作为有效数据增强提升了默认速度下的性能。
速度条件注入机制简单灵活，三种方案均可实现，且效果稳定。
与VLM结合实现了动态速度调度，使机器人能根据场景风险自适应调整速度，具有实用价值。
实验覆盖仿真和真实任务，验证了方法的有效性和泛化性。

Limitations:

速度控制依赖于动作幅度缩放，对于某些需要精确位置控制的任务（如精密装配）可能不够精细。
VSTA的累积再拆分操作对非加法表示的旋转（如四元数）需要额外映射或插值，增加了实现复杂度。
动态速度调度依赖VLM的实时场景理解，VLM的推理延迟可能影响实时性。
实验仅在LIBERO和有限真实任务上验证，未在更复杂、长时域任务上测试。
速度控制范围可能受限于训练数据中动作幅度的分布，极端速度下运动语义可能失真。

Relevance To Keywords:

Unify Models: 论文提出的TempoVLA属于视觉-语言-动作统一模型，但未涉及理解与生成一体化。
World Models: 论文未涉及世界模型，仅关注策略速度控制。
Representation Learning: 论文未专门研究表征学习，但速度条件注入涉及特征调制。
Model-Based RL: 论文未使用基于模型的强化学习，采用模仿学习。
原生多模态大模型: 论文基于VLA（多模态大模型的一种），但未强调原生多模态。
多模态大模型的理解和生成一体化: 论文主要关注动作生成，理解部分仅用于指令跟随。
表征学习: 不直接相关。
世界模型: 不直接相关。
强化学习: 论文提及RL用于加速策略，但主要方法为模仿学习+数据增强。
后训练: 论文未涉及后训练，但速度条件注入可视为一种轻量级后训练适配。

10. Resonant Minds: Closed-Loop Social Avatars with Theory of MindPASS

Score: 58.5 / 27.8

Authors: Jianxu Shangguan, Jing Xu, Hang Ye, Xiaoxuan Ma, Yizhou Wang, Wentao Zhu

Published: 2026-06-04

TL;DR: 本文提出一种结合思维论和多模态生成的闭环双智能体框架，用于创造具有社会智能的数字人类，在对话和视频质量上优于先前方法。

摘要翻译

创造具有真实社会智能的逼真数字人类，需要在统一的框架内整合认知推理与多模态生成。现有方法将这两者视为独立任务：大语言模型（Large Language Models, LLMs）虽擅长对话，却缺乏具身表达；而基于扩散（diffusion-based）的说话头模型（talking head models）虽实现了视觉保真度，却忽略了社会认知。为弥合这一鸿沟，我们提出一个闭环双智能体（dual-agent）框架，将感知、社会推理与表达整合进连续的交互循环中。感知模块从视频中分析伙伴的多模态行为，而社会推理模块则通过心理理论（Theory of Mind）推断隐藏的心理状态，并通过集成机制（ensemble mechanism）选择回应。表达模块随后生成情绪可控的双智能体视频，综合说话者的言语与表达以及听众的反应行为，捕捉了先前工作中缺失的双向动态。我们构建了一个分层的人设 - 场景（Persona-Scenario）数据集，包含基于心理学的人设及私人社会目标，以支持在信息不对称条件下的评估。在该数据集上的实验表明，该方法在对话质量和视频生成指标上均表现出具有竞争力或优越的性能。值得注意的是，我们的方法甚至在关键对话质量维度上超越了全信息（full-information）脚本模式（Script mode），这表明在不确定性下的显式心理状态推断，能够唤起比无限制信息访问更为深思熟虑的对话。

Abstract

Creating lifelike digital humans with genuine social intelligence requires unifying cognitive reasoning and multimodal generation within a coherent framework. Current approaches treat these as separate tasks: Large Language Models excel at dialogue but lack embodied expression, while diffusion-based talking head models achieve visual fidelity but ignore social cognition. To bridge this gap, we propose a closed-loop dual-agent framework integrating perception, social reasoning, and expression into a continuous interaction cycle. The perception module analyzes partners' multimodal behaviors from video, while the social reasoning module infers hidden mental states through Theory of Mind and selects responses via an ensemble mechanism. The expression module then generates emotion-controllable dual-agent videos synthesizing both speaker speech and expression alongside listener reactive behaviors, capturing bidirectional dynamics absent in prior work. We construct a hierarchical Persona-Scenario dataset with psychologically grounded personas and private social goals to support evaluation under information asymmetry. Experiments on this dataset demonstrate competitive or superior performance on both dialogue quality and video generation metrics. Notably, our method surpasses even the full-information Script mode on key dialogue quality dimensions, suggesting that explicit mental state inference under uncertainty can elicit more thoughtful dialogue than unrestricted information access.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	9.0/10	13.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	4.0/10	6.0
World Models	1.5	6.0/10	9.0
MLLM	1.5	7.0/10	10.5
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	3.0/10	4.5

评分理由: 论文明确统一了推理与生成（Unify Models, MultiModal），并利用闭环动力学类似世界模型和 MLLM 应用。Tokenizer 和 model-based RL 并非架构核心，Visual Encoder 仅为隐含提及。

关键词

Closed-Loop Social Avatars, Theory of Mind, Multimodal Generation, Social Reasoning, Dual-Agent Framework, Emotion-Controllable Videos, Persona-Scenario Dataset

深度分析

Chinese Title: 共鸣心智：具有心智理论的闭环社交虚拟化身

Summary: 本文提出一个闭环双智能体框架，旨在将认知推理与多模态生成统一于持续交互循环中，以创造具有真实社交智能的数字人类。该框架包含三个模块：感知模块从视频中分析伙伴的多模态行为；社交推理模块通过心智理论推断隐藏心理状态，并利用集成机制选择响应；表达模块生成情感可控的双智能体视频，同时合成说话者的言语表情和听者的反应行为。为支持信息不对称下的评估，作者构建了基于大五人格的层级化角色-场景数据集。实验表明，该方法在对话质量和视频生成指标上均达到或超越现有水平，尤其在关键对话质量维度上超越了全信息脚本模式，表明在不确定性下进行显式心理状态推断能激发更周到的对话。

Innovations:

首次在信息不对称条件下，将心智理论推理与多模态生成整合到闭环双智能体交互框架中。
提出感知-社交推理-表达三模块闭环架构，实现人格一致且情感丰富的多模态交互。
构建基于大五人格的心理层级化角色-场景数据集，支持信息不对称下的社交模拟评估。
在对话质量关键维度上超越全信息脚本模式，证明显式心理状态推断能提升对话深度。

Methodology: 论文采用部分可观测马尔可夫博弈建模双智能体社交交互，每个智能体拥有私有角色档案（背景、人格、目标）。在每一轮交互中，感知模块从伙伴视频中提取语音转录、情绪状态和面部表情等结构化观察；社交推理模块基于BDIE框架进行心智理论推断，生成候选响应并通过共情、策略和一致性三个评估器组成的集成机制选择最优动作；表达模块通过文本到情感适配器和双智能体视频合成生成多模态视频，形成闭环。数据集通过稀疏到丰富的流水线从事实背景生成详细人格叙事，并基于Sotopia场景适配。

Key Results:

在对话质量和视频生成指标上均达到或超越现有方法。
在关键对话质量维度上超越全信息脚本模式，表明显式心理状态推断在不确定性下能产生更周到的对话。
构建的层级化角色-场景数据集有效支持信息不对称下的社交模拟评估。
闭环双智能体交互框架实现了人格一致且情感丰富的多模态生成。

Tech Stack:

部分可观测马尔可夫博弈
大五人格模型
BDIE心智理论框架
集成机制（共情、策略、一致性评估器）
文本到情感适配器
双智能体视频合成
扩散模型（用于视频生成）
大语言模型（用于推理与生成）

Strengths:

创新性地将心智理论推理与多模态生成整合于闭环框架，填补了认知推理与生成之间的鸿沟。
在信息不对称条件下模拟真实社交交互，更贴近实际应用场景。
构建了心理层级化数据集，为社交智能评估提供了坚实基础。
实验证明显式心理状态推断能提升对话质量，具有重要理论意义。

Limitations:

框架依赖大语言模型进行推理，可能面临计算开销和实时性问题。
数据集基于大五人格构建，可能无法覆盖所有文化背景下的社交行为。
视频生成质量可能受限于扩散模型的当前能力，尤其在长序列交互中。
闭环交互的评估指标主要依赖自动指标，缺乏充分的人工评估验证。

Relevance To Keywords:

Unify Models: 论文提出的闭环框架统一了感知、推理和生成，体现了模型统一的思想。
World Models: 通过心智理论推断伙伴心理状态，构建了内部世界模型以指导交互。
Representation Learning: 感知模块将多模态视频数据转化为结构化文本表示，涉及表征学习。
Model-Based RL: 部分可观测马尔可夫博弈建模和集成机制选择动作，隐含了基于模型的强化学习思想。
原生多模态大模型: 论文使用大语言模型进行推理，并结合扩散模型进行视频生成，体现了多模态大模型的应用。
多模态大模型的理解和生成一体化: 框架同时处理多模态理解（感知）和生成（表达），实现一体化。
强化学习: 集成机制中的评估器可视为奖励函数，动作选择类似于强化学习中的策略优化。
后训练: 论文未明确提及后训练，但数据集构建和模块设计可支持后续微调或训练。

11. DisasterBench: A Multimodal Benchmark for UAV-Based Disaster Response in Complex EnvironmentsPASS

Score: 57.0 / 27.8

Authors: Tan Zhang, Quanyou Li, Lu Zhang, Jun Liu, Xiaofeng Zhu, Ping Hu

Published: 2026-06-04

TL;DR: 该论文提出了 DisasterBench 基准及 DisasterVL 模型，用于无人机灾害响应，实现了与 GPT-4o 相当的推理精度且效率更优。

摘要翻译

当灾难发生时，响应人员不仅要回答发生了什么，还要回答为什么发生、接下来会发生什么以及现在该做什么，通常需基于嘈杂的低空无人机（UAV）视角，并在严格的现场计算约束下进行。然而，现有的大多数多模态基准侧重于感知（例如识别/描述），涵盖的灾难类型有限，且无法充分支持实际应急响应所需的多阶段推理。我们提出了 DisasterBench，这是一个面向复杂环境中基于无人机（UAV）的灾难响应的多阶段多模态推理基准。DisasterBench 涵盖 14 种灾难相关场景类型和 9 个响应关键任务，横跨灾前、灾中和灾后三个阶段，并具备细粒度的灾难 - 任务映射，明确测试因果归因、传播预测、损害分析及面向决策的推理。为了实现边缘推理，我们进一步提出了 DisasterVL，这是一个轻量级多模态模型，其优化结合了领域指令微调、思维链引导的多模态对齐以及基于强化学习的策略优化三阶段流水线。在 21 种流行的多模态大语言模型（MLLMs）上的实验表明，我们的 20 亿参数（2B-parameter）DisasterVL 优于所有被评估的开源模型，并显著缩小了与最先进闭源模型的差距，在保持卓越效率的同时实现了与 GPT-4o 相当的推理准确率。项目页面见 https://github.com/TanmouTT/DisasterBench。

Abstract

When a disaster unfolds, responders must answer not only what is happening, but also why it is happening, what will happen next, and what to do now, often from noisy low-altitude UAV views and under tight on-site compute constraints. However, most existing multimodal benchmarks emphasize perception (e.g., recognition/description), cover limited disaster types, and provide insufficient support for the multi-stage reasoning required in practical emergency response. We introduce DisasterBench, a multi-stage multimodal reasoning benchmark for UAV-Based disaster response in complex environments. DisasterBench spans 14 disaster-related scene types and 9 response-critical tasks across pre-, during-, and post-disaster stages, with fine-grained disaster-task mappings that explicitly test causal attribution, propagation prediction, damage analysis, and decision-oriented reasoning. To enable reasoning on the edge, we further propose DisasterVL, a lightweight multimodal model optimized with a three-stage pipeline combining domain instruction tuning, chain-of-thought-guided multimodal alignment, and reinforcement learning-based policy optimization. Experiments across 21 popular MLLMs show that our 2B-parameter DisasterVL outperforms all evaluated open-source models and substantially narrows the gap to state-of-the-art closed-source models, achieving GPT-4o-comparable reasoning accuracy with superior efficiency. The project page is available at https://github.com/TanmouTT/DisasterBench.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	5.0/10	7.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	4.0/10	6.0
World Models	1.5	3.0/10	4.5
MLLM	1.5	9.0/10	13.5
MultiModal	1.5	10.0/10	15.0
model-based RL	1.5	5.0/10	7.5

评分理由: 论文主要贡献在于多模态基准构建与轻量级模型开发，故 MultiModal 和 MLLM 得分最高。模型优化涉及强化学习策略优化，故 model-based RL 和 Visual Encoder 有一定相关性。Tokenizer 未提及。Unify Models 和 World Models 仅在任务预测与流程统一上有弱关联。

关键词

DisasterBench, Multimodal Benchmark, UAV-Based Disaster Response, DisasterVL, Multi-stage Reasoning, Reinforcement Learning, MLLM

深度分析

Chinese Title: DisasterBench：复杂环境下基于无人机的灾害响应多模态基准

Summary: 本文提出了DisasterBench，一个面向复杂环境下无人机灾害响应的多阶段多模态推理基准。该基准涵盖14种灾害场景类型和9项关键任务，覆盖灾前、灾中和灾后三个阶段，包含5,330张真实低空无人机图像和29,300个推理样本，重点评估因果归因、传播预测、损伤分析和决策导向推理能力。同时，作者提出了轻量级多模态模型DisasterVL（2B参数），采用三阶段优化流水线：领域指令微调、思维链引导的多模态对齐和基于强化学习的策略优化。在21个主流多模态大模型上的实验表明，DisasterVL在推理准确率上超越所有开源模型，接近GPT-4o水平，且效率更高。该工作填补了现有灾害基准偏重感知、缺乏多阶段推理的空白，为边缘部署的灾害推理系统提供了有效方案。

Innovations:

首次构建覆盖灾前、灾中、灾后全阶段的多模态推理基准DisasterBench，包含14种灾害类型和9项响应关键任务，强调因果推理与决策导向。
提出轻量级多模态模型DisasterVL，通过领域指令微调、思维链引导的多模态对齐和强化学习策略优化三阶段流水线，在2B参数下实现接近GPT-4o的推理性能。
系统评估21个主流多模态大模型，揭示现有模型在灾害推理任务上的不足，并验证了轻量模型在边缘部署场景下的可行性。
基准设计显式建模灾害类型与任务阶段的耦合关系，支持细粒度的阶段感知推理评估。

Methodology: 首先，从真实低空无人机图像中收集5,330张图片，覆盖14种灾害场景和正常场景。然后，针对每张图片，根据其灾害类型和阶段（灾前/灾中/灾后）生成多个推理导向的多选题（共29,300个样本），涉及因果分析、传播预测、损伤评估、决策建议等任务。数据构建采用任务条件生成、交叉模型验证和专家审核的流水线。其次，提出DisasterVL模型，基于轻量级视觉语言骨干，采用三阶段训练：1）领域指令微调（在灾害相关数据上微调）；2）思维链引导的多模态对齐（利用CoT增强视觉-语言关联）；3）基于强化学习的策略优化（使用PPO等算法优化推理策略）。最后，在DisasterBench上评估21个开源和闭源多模态大模型，对比性能与效率。

Key Results:

DisasterBench包含5,330张图像和29,300个样本，覆盖14种灾害类型和9项任务，跨越灾前、灾中、灾后三个阶段。
DisasterVL（2B参数）在推理准确率上超越所有评估的开源模型，并显著缩小了与GPT-4o等闭源模型的差距，达到可比性能。
在效率方面，DisasterVL显著优于GPT-4o等大型模型，适合边缘部署。
现有主流多模态大模型在灾害推理任务上表现不一，尤其在因果推理和决策导向任务上存在明显短板。

Tech Stack:

多模态大模型（MLLMs）：GPT-4o、Gemini、LLaVA、Qwen-VL等21个模型
轻量级视觉语言模型（DisasterVL）：2B参数，基于Transformer架构
训练方法：领域指令微调（Domain Instruction Tuning）、思维链引导的多模态对齐（CoT-guided Multimodal Alignment）、强化学习策略优化（RL-based Policy Optimization，如PPO）
数据构建：任务条件生成、交叉模型验证（使用多个MLLM生成候选答案）、专家审核
评估指标：多选准确率（Accuracy）
无人机视角：低空无人机图像（UAV imagery）

Strengths:

基准设计全面，覆盖多种灾害类型和全生命周期阶段，任务设计贴近真实应急响应需求。
轻量级模型DisasterVL在保持高性能的同时具备高效推理能力，适合边缘部署。
系统评估了21个主流模型，提供了有价值的对比基准。
数据构建流程严谨，包含交叉验证和专家审核，保证样本质量。
明确将灾害推理定义为多阶段、因果导向的推理问题，超越了传统感知基准。

Limitations:

基准仅基于低空无人机图像，未涵盖卫星、地面等其他视角，可能限制泛化性。
所有样本为多选题形式，可能无法完全反映开放式推理的复杂性。
灾害类型虽多，但部分类型样本数量不均衡（如正常场景与灾害场景比例），可能影响评估公平性。
DisasterVL模型仅在2B参数规模上验证，更大参数规模下的性能未知。
未在真实边缘设备（如无人机机载计算平台）上进行实际部署测试，仅模拟了效率优势。

Relevance To Keywords:

原生多模态大模型：论文提出的DisasterVL属于轻量级多模态大模型，但并非原生多模态（即从零训练），而是基于预训练模型微调。
多模态大模型的理解和生成一体化：论文主要关注理解（推理问答），未涉及生成任务（如图像生成），因此相关性有限。
表征学习：论文通过思维链引导的多模态对齐和强化学习优化表征，但未深入探讨表征学习理论。
世界模型：论文涉及灾害演化预测（传播预测），可视为对灾害世界模型的初步探索，但未构建完整的物理世界模型。
强化学习：论文使用强化学习进行策略优化（RL-based policy optimization），是核心方法之一。
后训练：论文的三阶段训练流水线（指令微调、对齐、RL）属于后训练范畴，与关键词高度相关。

12. HomeWorld: A Unified Floorplan-to-Furnished Framework for Generating Controllable, Densely Interactive Whole-Home ScenesPASS

Score: 55.5 / 27.8

Authors: Wenbo Li, Xiaoliang Ju, Zipeng Qin, Rongyao Fang, Hongsheng Li

Published: 2026-06-04

TL;DR: HomeWorld 提出了一种统一的分层框架，利用大语言模型和视觉语言模型生成可控的完整家居场景，为具身 AI 模拟提供了高多样性和真实性的室内环境。

摘要翻译

室内场景生成对于机器人仿真和现代室内设计至关重要。然而，复杂的布局与稀缺的 3D 场景数据相结合，使得基于学习的生成方法面临挑战。现有方法往往依赖人工设计的规则，或专注于孤立的子任务（例如，平面图合成或单房间布置），生成的全屋场景缺乏全局一致性、真实感以及仿真就绪性。为了缓解这些局限性，我们提出了一种统一的层次化框架，将室内场景合成分解为多个可控阶段。首先，我们构建了一个包含 30 万份真实住宅平面图的大规模数据集，用于训练大型语言模型（LLM），以实现全屋平面图的生成。借助详细的描述和基于 K-D 树的表示方法，我们的方法实现了细粒度且可控的全屋平面图生成。基于生成的全屋平面图，我们利用图像生成模型从多层次漫游视角草拟家具布局，随后在不同支撑表面（如橱柜、书桌和餐桌）上生成小型可操作物体的布局，以供具身人工智能（Embodied AI）模拟使用。在家具和物体布局生成过程中，基于视觉语言模型（VLM）的细化器会迭代修正家具和物体的摆放位置，而 3D 生成模型则支持单个资产的灵活替换。此外，我们还附加了基本的物理属性以及简单的表面纹理和光照设置，从而完成适用于具身人工智能（Embodied AI）使用的完整流程。实验与用户研究表明，我们的流程生成的室内空间具有更大的布局多样性和更强的 3D 设计吸引力，在定量和定性指标上均优于先前方法。最后，除了生成流程外，我们将向社区公开该平面图数据集以及 5000 个完全布置好的场景。项目页面：https://kairos-homeworld.github.io/

Abstract

Indoor scene generation is crucial for robot simulation and modern interior design. However, complex layouts together with scarce 3D scene data make learning-based generation challenging. Existing methods often rely on hand-crafted rules or focus on isolated sub-tasks (e.g., floorplan synthesis or single-room furnishing), producing whole-home scenes that lack global coherence, realism, and simulation readiness. To mitigate these limitations, we propose a unified hierarchical framework that decomposes indoor scene synthesis into controllable stages. First, we curate a large-scale dataset of 300K real residential floorplans to train a large language model for whole-home floorplan generation. With detailed descriptions and a K-D tree-based representation, our method enables fine-grained, controllable whole-home floorplan generation. Building upon the generated whole-home floorplan, we leverage image generation models to draft furniture layouts from multi-level roaming viewpoints, and then generate the layouts of small manipulable objects on different supporting surfaces (e.g., cabinets, desks, and dining tables) for embodied AI simulation. During furniture and object layout generation, a VLM-based refiner iteratively corrects furniture and object placement, and a 3D generative model enables flexible replacement of individual assets. We further attach basic physical attributes and simple surface texture and lighting setups to complete the pipeline for embodied AI use. Experiments and user studies demonstrate that our pipeline produces indoor spaces with greater layout diversity and stronger 3D design appeal, outperforming prior methods on both quantitative and qualitative metrics. Finally, alongside our generation pipeline, we will release the floorplan dataset and 5K fully furnished scenes to the community. Project Page: https://kairos-homeworld.github.io/

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	9.0/10	13.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	5.0/10	7.5
MLLM	1.5	8.0/10	12.0
MultiModal	1.5	7.0/10	10.5
model-based RL	1.5	3.0/10	4.5

评分理由: 1. Unify Models (9.0): 论文标题和摘要均强调'Unified hierarchical framework'，核心贡献在于将地板规划、家具布置和物体生成整合为统一流程，与该关键词高度契合。2. MLLM (8.0): 论文核心方法依赖于大语言模型（LLM）生成地板描述，并使用视觉语言模型（VLM）进行布局修正，属于多模态大模型的应用。3. MultiModal (7.0): 处理文本（floorplan 描述）、图像（家具生成）和 3D 场景数据，涉及多模态数据的融合与生成。4. World Models (5.0): 论文生成的场景用于'embodied AI simulation'，这是世界模型的主要应用场景之一，虽论文本身侧重生成而非动力学建模，但背景关联度中等。5. Tokenizer (2.0): 摘要中未提及 tokenizer 的创新或核心设计，仅提到 K-D tree 表示，相关性低。6. Visual Encoder (3.0): 虽然 VLM 和图像生成模型隐含了视觉编码器，但论文未将其作为独立贡献模块突出，相关性较低。7. model-based RL (3.0): 论文目的是为模拟生成场景，而非提出基于模型的强化学习算法，仅作为潜在应用场景，相关性低。专家检查：作者列表为 Wenbo Li 等，不包含 Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang，故无加分。

关键词

Floorplan Generation, Indoor Scene Generation, Unified Framework, Embodied AI Simulation, Large Language Model, Furniture Layout, 3D Generative Model

深度分析

Chinese Title: HomeWorld：一种统一的从平面图到家具布置的框架，用于生成可控、密集交互的完整家居场景

Summary: 本文提出HomeWorld，一个统一的层次化框架，用于在3D数据稀缺条件下生成完整、可控、可交互的家居场景。首先，作者构建了包含30万真实住宅平面图的大规模数据集，并利用K-D树表示和详细描述训练大语言模型（LLM）生成精细可控的整屋平面图。然后，基于生成的平面图，通过层次化视角漫游策略，利用图像生成模型从俯视图和第一人称视角逐步提出家具布局，并借助VLM迭代修正器纠正不合理放置，同时支持3D生成模型灵活替换资产。最后，为支持具身AI仿真，在支撑面上放置可操作小物体，并添加基本物理属性、纹理和光照。实验和用户研究表明，该方法在布局多样性和3D设计吸引力上优于现有方法。论文将公开30万平面图数据集和5000个完整3D场景。

Innovations:

提出统一的层次化整屋场景生成流水线，将平面图生成、家具布局、可操作物体放置和资产替换有机结合，克服3D数据稀缺瓶颈。
利用K-D树表示和详细描述训练LLM，实现精细可控的整屋平面图生成，并构建30万真实住宅平面图数据集。
采用层次化视角漫游策略，结合图像生成模型的2D先验和平面图3D约束，实现跨视图一致的家具布局生成。
引入VLM迭代修正器，递归纠正物体放置中的不合理和约束违反，提升布局合理性。
支持通过3D生成模型灵活替换单个资产，增加物体多样性和渲染变化而不破坏场景一致性。

Methodology: 论文采用层次化生成流水线：1）平面图生成阶段：收集30万真实住宅平面图，用K-D树编码房间结构，训练LLM生成可控平面图；2）家具布局阶段：基于平面图，先通过俯视图放置大型功能家具（床、沙发等），再通过第一人称视角漫游逐步添加小物体，利用图像生成模型提出新物体，并用平面图3D约束保持一致性；3）递归修正阶段：使用VLM（视觉语言模型）迭代检查并修正不合理放置；4）可操作物体放置：在桌面、柜面等支撑面上分布小物体；5）后处理：添加纹理、光照和基本物理属性，生成可直接用于仿真的3D场景。

Key Results:

平面图生成：LLM生成的平面图在房间数量、面积、布局合理性等指标上优于现有方法，支持细粒度控制。
家具布局生成：与ProcTHOR、Holodeck等方法相比，生成的场景布局更多样、更真实，用户研究评分更高。
消融实验：验证了K-D树表示、VLM修正器、层次化漫游策略等组件的有效性。
数据集贡献：将公开30万平面图数据集和5000个完整3D场景，填补大规模整屋场景数据空白。

Tech Stack:

大语言模型（LLM）用于平面图生成
K-D树（K-D Tree）用于平面图结构化表示
图像生成模型（如扩散模型）用于从漫游视角提出新物体
视觉语言模型（VLM）用于递归布局修正
3D生成模型（如基于扩散的3D资产生成）用于灵活替换资产
层次化视角漫游策略（俯视图→第一人称视角）
物理属性、纹理和光照设置（用于仿真就绪）

Strengths:

有效解决了3D数据稀缺问题，通过2D平面图数据和大规模预训练模型实现高质量整屋生成。
生成场景具有全局连贯性、物理可行性和仿真就绪性，支持具身AI交互。
层次化框架可控性强，用户可指定房间数量、风格等条件。
公开大规模数据集（30万平面图+5000完整场景），推动领域发展。
实验充分，包括定量指标、用户研究和消融实验，验证了各组件贡献。

Limitations:

依赖2D图像生成模型和VLM，可能存在生成不一致或幻觉问题，需要递归修正。
当前仅生成5000个完整3D场景，规模仍有限，后续需持续扩展。
平面图数据集虽大但可能包含噪声或标注不完整，影响LLM训练质量。
可操作物体种类和交互细节可能不如手工设计的仿真环境丰富。
未涉及动态场景（如人物移动、物体状态变化）的生成。

Relevance To Keywords:

Unify Models: 论文统一了平面图生成、家具布局、物体放置等多个子任务，体现了模型一体化思想。
World Models: 生成的整屋场景可作为具身AI的世界模型环境，支持仿真和交互。
Representation Learning: 使用K-D树表示平面图结构，LLM学习空间布局表征。
Model-Based RL: 生成的场景可用于基于模型的强化学习训练，提供可控、密集交互的环境。
原生多模态大模型: 论文结合LLM（文本）、VLM（视觉语言）、图像生成模型（视觉），体现了多模态大模型的理解与生成一体化。
多模态大模型的理解和生成一体化: LLM理解平面图描述并生成布局，VLM理解场景并修正错误，图像模型生成新物体，实现理解与生成闭环。
表征学习: K-D树和LLM隐式学习空间拓扑和功能关系表征。
后训练: 论文未明确涉及后训练，但LLM和VLM的使用可视为预训练模型的下游应用。

13. PLAN-S: Bridging Planning with Latent Style Dynamics for Autonomous Driving World ModelsPASS

Score: 55.5 / 27.8

Authors: Xiaoyun Qiu, Jingtao He, Yijie Chen, Yusong Huang, Haotian Wang, Yixuan Wang, Xinhu Zheng

Published: 2026-06-04

TL;DR: PLAN-S proposes a style-conditioned semantic cost map bridge for latent world models in autonomous driving, enhancing trajectory safety and diversity without modifying the underlying model backbone.

摘要翻译

潜在世界模型（LWMs）通过预测紧凑的场景动力学以支持下游规划，增强了端到端自动驾驶。然而，现有的基于 LWM 的规划器通常直接从纠缠的潜在表示中生成轨迹。这种紧凑的潜在到规划路径缺乏对风险、可驾驶性以及多样化风格偏好的显式建模，导致在最终轨迹选定之前，驾驶风格动力学难以被监督、检查或调节。我们提出 PLAN-S（基于潜在风格动力学的规划），这是一个面向规划的桥梁，旨在通过从潜在表示中解码风格条件化的四通道语义代价图，来解决紧凑性与可控性之间的困境。该代价图基于自车状态和驾驶风格进行条件化，并通过两种主机侧接口在规划决策之前被使用：针对回归规划器的注意力级融合，以及针对锚点分数规划器的奖励级融合。我们在两个架构不同的主机上验证了 PLAN-S：nuScenes 上的 ResWorld 和 NAVSIM 上的 WoTE，同时保持主机骨干网络冻结，以隔离所提出桥梁的贡献。在 nuScenes 上，PLAN-S 在所有时间步长上均降低了基线的 L2 误差，平均 L2 误差为 0.55 米，且 3 秒碰撞率相对降低了 42%。在 NAVSIM 上，规则代价变体达到了 89.4 的预测驾驶者模型评分（PDMS），而学习代价变体则在基线难以应对的场景上提供了互补增益。消融实验表明，代价路径对更安全轨迹的选择贡献最为直接。定性结果进一步表明，PLAN-S 能够生成多样化的代价图，其空间变化与不同的驾驶风格保持一致。

Abstract

Latent world models (LWMs) have strengthened end-to-end autonomous driving by forecasting compact scene dynamics for downstream planning. However, existing LWM-based planners usually generate trajectories directly from entangled latent representations. This compact latent-to-planner pathway lacks explicit modeling of risk, drivability, and diverse style preferences, making driving-style dynamics difficult to supervise, inspect, or modulate before a final trajectory is selected. We propose PLAN-S (PLANning with latent Style dynamics), a planner-facing bridge that addresses this compactness-controllability dilemma by decoding a style-conditioned, four-channel semantic cost map from the latent representation. The cost map is conditioned on ego state and driving style and is consumed up-stream of the planning decision through two host-side interfaces: attention-level fusion for regression planners and reward-level fusion for anchor-score planners. We validate PLAN-S on two architecturally distinct hosts, ResWorld on nuScenes and WoTE on NAVSIM, while keeping the host backbones frozen to isolate the contribution of the proposed bridge. On nuScenes, PLAN-S reduces L2 at every horizon over the baseline, with 0.55 m average L2 and a 42% relative reduction in the 3 s collision rate. On NAVSIM, the rule-cost variant reaches 89.4 Predictive Driver Model Score (PDMS), while the learned cost variant provides complementary gains on baseline-challenging scenes. Ablations show that the cost pathway contributes most directly to safer trajectory selection. Qualitative results further show that PLAN-S can produce diverse cost maps, with spatially consistent variations aligned to different driving styles.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	4.0/10	6.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	5.0/10	7.5
World Models	1.5	10.0/10	15.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	5.0/10	7.5
model-based RL	1.5	8.0/10	12.0

评分理由: The paper explicitly centers on 'Latent World Models' for autonomous driving, earning a top score on World Models (10). It involves planning and latent dynamics, strongly relating to model-based RL (8). Visual encoders are implicit in the LWM backbones (5), while MLLM and Tokenizer are not explicitly mentioned (3, 2). Unify Models and MultiModal have moderate relevance due to the bridging nature and driving context (4, 5). No target experts are present in the author list.

关键词

Latent World Models, Autonomous Driving, Planning, Style Dynamics, Cost Map, Trajectory Selection, Semantic Cost Map, Latent Representation

深度分析

Chinese Title: PLAN-S: 融合潜在风格动态的自动驾驶世界模型规划方法

Summary: 本文提出PLAN-S，一种规划导向的桥接模块，旨在解决潜在世界模型（LWM）中紧凑表示与可控性之间的矛盾。现有LWM规划器直接从纠缠的潜在表示生成轨迹，缺乏对风险、可驾驶性和驾驶风格偏好的显式建模。PLAN-S通过将潜在表示解码为四通道语义代价地图（动态障碍物、越野区域、静态障碍物、可驾驶性），并利用双自适应特征线性调制（dual AdaFiLM）机制，根据自车状态和驾驶风格条件化代价地图，实现风格依赖的规划前调制。该模块通过两种接口适配不同规划器家族：回归规划器采用注意力级融合，锚点评分规划器采用奖励级融合。在nuScenes（ResWorld主机）和NAVSIM（WoTE主机）上的实验表明，PLAN-S显著降低了L2误差和碰撞率，提升了PDMS分数，且消融实验验证了代价路径对安全轨迹选择的直接贡献。定性结果展示了与不同驾驶风格对齐的多样化代价地图。

Innovations:

提出四通道语义代价地图，显式建模风险、可驾驶性和路线偏好，作为规划导向的潜在风格动态。
通过双自适应特征线性调制（dual AdaFiLM）机制，将自车状态和驾驶风格条件化到代价地图，实现风格依赖的规划前调制。
设计两种跨规划器家族的接口：注意力级融合（回归规划器）和奖励级融合（锚点评分规划器），无需修改主机骨干网络。
在两种架构不同的主机（ResWorld和WoTE）上验证，保持骨干冻结，隔离桥接模块的贡献。

Methodology: PLAN-S采用以下技术路线：1）使用冻结的感知编码器和潜在世界模型生成当前和未来的BEV特征；2）可训练的代价地图解码器通过双AdaFiLM（分别对自车状态和驾驶风格进行特征线性调制）将BEV潜在表示解码为四通道语义代价地图；3）对于回归规划器（如ResWorld），代价地图通过注意力级融合注入，偏置路径点查询并引导空间细化；4）对于锚点评分规划器（如WoTE），代价地图通过奖励级融合注入，沿候选锚点采样代价以调整原生锚点分数。整个框架保持主机骨干网络不变，仅训练桥接模块。

Key Results:

在nuScenes上，PLAN-S在每个预测时域上均降低了L2误差，平均L2为0.55米，3秒碰撞率相对降低42%。
在NAVSIM上，规则代价变体达到89.4 PDMS，学习代价变体在基线困难场景上提供互补增益。
消融实验表明，代价路径对安全轨迹选择的贡献最直接。
定性结果展示了与不同驾驶风格对齐的空间一致变化的多样化代价地图。

Tech Stack:

BEV表示：LSS、BEVFormer、BEVDepth
潜在世界模型：ResWorld、WoTE
特征调制：FiLM（特征线性调制）、dual AdaFiLM
规划器类型：回归规划器（基于Transformer的路径点查询）、锚点评分规划器（基于奖励模型）
数据集：nuScenes、NAVSIM
评估指标：L2误差、碰撞率、PDMS（预测驾驶模型分数）
其他：VQ-VAE、GPT、扩散模型（相关背景方法）

Strengths:

解决了潜在世界模型中紧凑表示与可控性之间的矛盾，显式建模风险、可驾驶性和风格偏好。
跨规划器家族通用，通过两种接口适配回归和锚点评分规划器，无需修改骨干网络。
风格条件化调制在规划前进行，增强了可解释性和可调节性。
在两个不同数据集和主机上验证，结果显著且消融实验充分。

Limitations:

依赖特定的BEV表示形式，可能不适用于非BEV的潜在世界模型。
驾驶风格仅通过简单的代码条件化，未考虑更复杂的个性化偏好或连续风格空间。
未讨论实时推理效率，可能对部署有影响。
仅在两个主机上验证，泛化性需更多实验支持。

Relevance To Keywords:

世界模型：论文核心是潜在世界模型（LWM）中的规划，通过预测未来BEV特征并解码为代价地图，属于世界模型范畴。
表征学习：通过双AdaFiLM将自车状态和驾驶风格条件化到潜在表示，学习风格相关的语义代价表征。
模型强化学习：锚点评分规划器使用奖励模型，代价地图作为奖励信号的一部分，与模型强化学习思想相关。
原生多模态大模型：论文未涉及多模态大模型，主要聚焦于BEV潜在表示和规划，相关性较低。
多模态大模型的理解和生成一体化：论文不涉及多模态生成，仅使用BEV特征进行规划，相关性低。
后训练：论文中主机骨干冻结，仅训练桥接模块，可视为一种后训练适配方法，但并非典型后训练范式。

14. PAR3D: A Unified 3D-MLLM with Part-Aware Representation for Scene UnderstandingPASS

Score: 55.5 / 27.8

Authors: Shaohui Dai, Yansong Qu, You Shen, Shengchuan Zhang, Liujuan Cao

Published: 2026-06-04

TL;DR: PAR3D 提出了一种统一的部件感知 3D-MLLM 框架，通过建模精细的部件结构增强了场景理解，显著提升了部件级问答和指代分割的性能。

摘要翻译

近年来，3D 多模态大语言模型（3D-MLLMs）的进展使得 3D 场景理解任务（包括视觉问答、描述生成和指代分割）的统一解决方案成为可能。然而，现有的 3D-MLLMs 主要仍是以对象为中心（Object-Centric），限制了它们对精细粒度的部件结构（Fine-Grained Part Structures）进行建模的能力，而这些结构对于与 3D 环境的具身交互（Embodied Interaction）至关重要。在这项工作中，我们提出了 PAR3D，一种统一的部件感知（Part-Aware）3D-MLLM 框架，使模型能够理解、推理并在 3D 场景中定位对象及其部件。为了支持部件感知 3D 场景理解的训练与评估，我们引入了 ScenePart，一个带有部件级标注和语言指令的合成 3D 场景数据集。我们进一步开发了部件感知 3D 表示学习（Part-Aware 3D Representation Learning），通过精细粒度的部件级语义来丰富 3D 视觉表示，并提出分层分割查询生成（Hierarchical Segmentation Query Generation），通过分层对象 - 部件查询来定位部件目标。大量实验表明，我们的方法显著提高了部件级的问答和指代分割性能，同时在对象级别的视觉 - 语言任务（Vision-Language Tasks）上也取得了优异的表现。

Abstract

Recent advances in 3D multimodal large language models (3D-MLLMs) have enabled unified solutions for 3D scene understanding tasks, including visual question answering, captioning, and referring segmentation. However, existing 3D-MLLMs remain largely object-centric, limiting their ability to model fine-grained part structures that are essential for embodied interaction with 3D environments. In this work, we present PAR3D, a unified part-aware 3D-MLLM framework that enables models to understand, reason about, and ground both objects and their parts in 3D scenes. To enable training and evaluation of part-aware 3D scene understanding, we introduce ScenePart, a synthetic 3D scene dataset with part-level annotations and language instructions. We further develop Part-Aware 3D Representation Learning to enrich 3D visual representations with fine-grained part-level semantics, and propose Hierarchical Segmentation Query Generation to ground part targets via hierarchical object-part queries. Extensive experiments show that our method substantially improves part-level question answering and referring segmentation, while also achieving strong performance across object-level vision-language tasks.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	9.0/10	13.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	6.0/10	9.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	9.0/10	13.5
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	1.0/10	1.5

评分理由: 论文标题明确包含'Unified'和'3D-MLLM'，因此'Unify Models'、'MLLM'和'MultiModal'高度相关（9 分）。'Visual Encoder'隐含在 3D 视觉处理中，但核心创新在于表征学习而非编码器本身（6 分）。'Tokenizer'、'World Models'和'model-based RL'在摘要中未提及，与本文感知任务关联度低（1-2 分）。作者列表中未包含指定的专家。

关键词

3D-MLLM, Part-Aware Representation, Scene Understanding, Unified Framework, Visual-Language, Hierarchical Segmentation, Object-Part Grounding

深度分析

Chinese Title: PAR3D：一种具有部件感知表示的统一3D多模态大语言模型用于场景理解

Summary: 本文提出PAR3D，一种统一的部件感知3D多模态大语言模型（3D-MLLM），旨在解决现有3D-MLLM仅关注物体级别、缺乏细粒度部件理解的问题。研究背景：现有3D-MLLM以物体为中心，无法建模物体部件结构，限制了具身交互能力。方法：首先构建了ScenePart合成数据集，包含部件级标注和语言指令；其次提出部件感知3D表示学习，增强视觉骨干网络对部件级几何和语义特征的捕捉；最后设计层次化分割查询生成，通过解耦的物体-部件查询实现部件级定位。实验表明：PAR3D在部件级问答和指代分割任务上显著提升，同时在物体级视觉语言任务上保持强性能。结论：该框架实现了统一的物体和部件级3D场景理解，为具身智能和交互式场景编辑提供了基础。

Innovations:

构建了ScenePart合成数据集，提供场景级物体和部件掩码、物体-部件对应关系及多任务语言指令，填补了部件级3D场景理解数据空白。
提出部件感知3D表示学习，在预训练3D基础编码器上适配，使视觉骨干能够捕捉细粒度部件几何和语义特征。
设计层次化分割查询生成，生成解耦的物体级和部件级分割查询，支持部件相对于宿主物体的定位。
首次将部件级理解融入统一3D-MLLM框架，支持问答、分割和推理等多种任务，超越物体级理解。

Methodology: 基于3D-LLaVA框架扩展，采用两阶段训练：第一阶段在ScenePart数据集上进行部件感知3D骨干预训练，使用实例分割和部件级监督；第二阶段进行指令微调。具体技术包括：使用3D点云编码器提取点级和超点级特征，通过查询解码器细化特征并映射到LLM嵌入空间；引入特殊标记[SEG]用于分割查询，通过层次化查询生成实现物体和部件级掩码预测。数据集构建采用四步流程：预处理部件标注3D资产、使用MiDiffusion生成室内布局、实例化并采样为点云场景、通过模板和LLM生成语言指令。

Key Results:

在ScenePart-QA测试集上，PAR3D在部件级问答任务上显著优于基线3D-LLaVA。
在ScenePart-Seg测试集上，PAR3D在部件级指代分割任务上取得明显提升。
在物体级视觉语言任务（如ScanQA、ScanRefer）上，PAR3D保持与现有3D-MLLM相当或更优的性能。
消融实验验证了部件感知表示学习和层次化查询生成各自的有效性。

Tech Stack:

3D点云编码器（基于3D-LLaVA的3D backbone）
查询解码器（Query Decoder）
大语言模型（LLM，如LLaVA系列）
MiDiffusion（室内布局生成）
3D-CoMPaT（部件标注3D资产）
3D-FRONT（室内布局数据集）
Qwen3-VL-8B（用于物体尺度估计和描述生成）
模板规则与LLM精炼（语言指令生成）

Strengths:

首次将部件级理解引入统一3D-MLLM，填补了细粒度场景理解的空白。
构建了大规模部件级场景数据集ScenePart，支持训练和评估。
方法设计合理，通过表示学习和层次化查询有效解决部件感知问题。
实验全面，在部件级和物体级任务上均取得良好性能，验证了通用性。

Limitations:

数据集ScenePart为合成数据，与真实场景存在域差异，可能影响真实场景泛化。
部件感知表示学习依赖于预训练编码器，可能受限于基础模型能力。
层次化查询生成增加了模型复杂度，推理效率有待评估。
未在真实机器人或具身任务中验证实际交互效果。

Relevance To Keywords:

原生多模态大模型：PAR3D属于3D多模态大模型，将3D视觉与语言模型统一，符合原生多模态大模型方向。
表征学习：提出部件感知3D表示学习，增强视觉特征对部件级语义的编码，属于表征学习范畴。
世界模型：3D场景理解是世界模型的重要组成部分，PAR3D通过部件级理解提升了场景建模的精细度。
模型统一：PAR3D统一了物体和部件级的问答、分割、推理任务，体现了Unify Models的思想。

15. Global-Local Monte Carlo Tree Search in Vision-Language Models for Text-to-3D Indoor Scene GenerationPASS

Score: 54.0 / 27.8

Authors: Mengshi Qi, Wei Deng, Xianlin Zhang, Huadong Ma

Published: 2026-06-04

TL;DR: To address error propagation in text-to-3D indoor scene generation using LVLMs, this paper proposes a Global-Local Monte Carlo Tree Search method that generates more realistic 3D scenes than state-of-the-art approaches.

摘要翻译

大型视觉语言模型（LVLMs）已在各种任务中展现出显著的推理能力。然而，基于大型视觉语言模型（LVLMs）进行文本到 3D 室内场景生成的研究尚不多见。主要挑战在于，现有的基于 LVLMs 的方法采用思维链（Chain-of-Thought）顺序决策机制，无法修正早期的决策，从而导致错误传播。本文将该任务视为受空间与布局常识约束的规划问题。为了解决这一问题，我们将其建模为一种包含全局树和局部树的树搜索问题，这与现有的顺序决策方法有所不同。在全局树中，我们迭代地放置每个物体，并探索多种尝试（类似于人类布置房间的过程），其中问题空间被表示为一棵树。为了有效地搜索这棵树，我们提出了一种分层场景表示方法以及一种基于概率路线图（PRM）引导的蒙特卡洛树搜索（MCTS）方法。该分层表示将场景抽象为房间层级、区域层级、地面物体层级以及支撑物体层级。基于 PRM 引导的 MCTS 方法利用 PRM 修剪不必要的分支，并利用 MCTS 算法平衡探索与利用，从而以更少的尝试次数获得最优解。在局部树中，该方法进一步将每个物体的放置分解为更细的子步骤，包括具体的放置参数。为了使场景的整体外观保持一致，我们利用预训练的扩散图像生成模型来预测场景中所有物体的纹理。鉴于现有的文本到 3D 室内场景生成基准在规模和多样性上仍显不足，我们收集了一个新的多样化大规模数据集，名为 3DTindo-bench，该数据集包含 65 种场景类型和 3250 条指令，涵盖不同的尺寸、布局和风格，以更好地评估最先进模型的能力。实验结果表明，我们的方法生成的 3D 场景比现有最先进的方法更为逼真。

Abstract

Large Vision-Language Models have achieved significant reasoning performance in various tasks.However, there are few studies on text-to-3D indoor scene generation with LVLMs. The main challenge is that prevailing LVLM-based methods employ chain-of-thought sequential decision mechanisms that cannot revise earlier decisions, causing error propagation.In this paper, we consider the task as a planning problem constrained by spatial and layout commonsense.To solve this problem, we model it as a tree search problem with global and local trees, which differs from existing sequential decision-making approaches.In the global tree, we place each object iteratively and explore multiple attempts like humans furnishing a room, where the problem space is represented as a tree.To effectively search the tree, we propose a hierarchical scene representation and a PRM-guided MCTS method.The hierarchical representation abstracts a scene into room level, region level, floor object level, and supported object level.The PRM-guided MCTS method uses the PRM to prune unnecessary branches and the MCTS algorithm to balance exploration and exploitation to get an optimal solution with fewer attempts.In the local tree, it further decomposes the placement of each object into finer sub-steps, including the specific placement parameters.To make the whole appearance of the scene consistent, we leverage pre-trained diffusion image generative models to predict textures for all the objects in the scene.As existing benchmarks for text-to-3D indoor scene generation remain limited in scale and diversity, we collect a new large-scale diverse dataset that contains 65 scene types and 3,250 instructions with diverse sizes, layouts, and styles, named 3DTindo-bench, to better assess the capability of the state-of-the-art models. Our experiments show that our method generates more realistic 3D scenes than state-of-the-art approaches.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	4.0/10	6.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	5.0/10	7.5
MLLM	1.5	8.0/10	12.0
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	6.0/10	9.0

评分理由: 论文核心在于利用大视觉语言模型（MLLM）和多模态能力解决文本到 3D 生成问题，并采用蒙特卡洛树搜索（与 model-based RL 相关）优化决策过程。未深入探讨 Tokenizer、视觉编码器细节或统一模型架构，世界模型概念仅部分体现在层级场景中。作者列表中不包含指定的专家，故无额外加分。

关键词

Text-to-3D Indoor Scene Generation, Large Vision-Language Models, Monte Carlo Tree Search, Hierarchical Scene Representation, Planning Problem, Diffusion Image Generative Models, 3DTindo-bench

深度分析

Chinese Title: 面向文本到3D室内场景生成的视觉-语言模型中全局-局部蒙特卡洛树搜索

Summary: 本文针对文本到3D室内场景生成任务，提出了一种基于全局-局部蒙特卡洛树搜索（MCTS）的新方法。现有方法采用链式思维（CoT）顺序决策机制，无法修正早期错误，导致误差传播。本文将任务建模为树搜索问题，通过全局树和局部树分别处理场景布局和物体放置细节。全局树中，每个物体迭代放置并探索多个候选方案，利用分层场景表示（房间级、区域级、地面物体级、支撑物体级）降低复杂度，并引入进度奖励模型（PRM）指导MCTS，剪枝无效分支，平衡探索与利用。局部树进一步分解每个物体的放置参数。为保持场景外观一致性，采用预训练扩散模型生成物体纹理。此外，作者收集了大规模多样化数据集3DTindo-bench（65种场景类型、3250条指令），并设计了五维评估框架。实验表明，该方法在真实感和平均性能上超越现有最佳基线约14%。

Innovations:

将文本到3D室内场景生成建模为树搜索问题，区别于传统链式顺序决策，支持回溯修正错误。
提出分层场景表示（房间-区域-地面物体-支撑物体），有效降低搜索空间复杂度。
引入进度奖励模型（PRM）指导蒙特卡洛树搜索（MCTS），通过剪枝和平衡探索-利用提升搜索效率。
利用预训练扩散模型生成场景中所有物体的纹理，实现外观一致性。
构建大规模多样化基准数据集3DTindo-bench（65场景类型、3250指令），并设计五维评估体系。

Methodology: 本文采用全局-局部树搜索框架。全局树中，每个物体作为一层，节点代表不同放置配置，通过PRM-guided MCTS进行搜索：PRM评估中间状态并剪枝，MCTS通过选择、扩展、模拟、反向传播平衡探索与利用。局部树进一步分解每个物体的放置子步骤（位置、朝向等）。场景生成流程：首先根据文本指令，利用LVLM生成分层场景表示；然后在每个区域内部执行PRM-guided MCTS搜索布局；最后使用预训练扩散模型为所有物体生成纹理。评估采用五维指标（物理合理性、语义合理性、布局合理性、指令对齐、美学一致性）。

Key Results:

在3DTindo-bench上，该方法平均性能超越最佳基线约14%。
相比DFS，PRM-guided MCTS显著减少搜索次数，提升效率。
纹理生成模块有效提升了场景外观一致性。
分层场景表示降低了计算复杂度，使大规模场景生成可行。
消融实验验证了PRM剪枝和MCTS平衡策略的有效性。

Tech Stack:

蒙特卡洛树搜索（MCTS）
进度奖励模型（PRM）
大型视觉-语言模型（LVLM）
预训练扩散图像生成模型（用于纹理生成）
分层场景表示（房间级、区域级、地面物体级、支撑物体级）
深度优先搜索（DFS，作为对比基线）
CLIP（用于3D资产检索）
3DTindo-bench数据集（65场景类型，3250指令）

Strengths:

创新性地将场景生成视为树搜索，克服了链式推理的误差累积问题。
PRM-guided MCTS在保证质量的同时大幅降低搜索开销。
分层表示使方法可扩展至复杂多物体场景。
构建的大规模基准填补了现有评估数据不足的空白。
纹理生成模块提升了场景视觉一致性，增强真实感。

Limitations:

依赖LVLM的常识推理能力，可能在某些罕见场景或特殊指令下表现不佳。
MCTS的搜索效率仍受限于树深度和分支数，极端复杂场景可能仍需大量计算。
纹理生成基于预训练扩散模型，可能产生与场景风格不完全匹配的纹理。
评估指标虽多但均为自动计算，缺乏用户主观评价验证。
数据集虽大但仅涵盖室内场景，未涉及室外或动态场景。

Relevance To Keywords:

Unify Models: 本文使用LVLM统一理解文本指令和生成布局，体现了多模态模型的统一能力。
World Models: 分层场景表示和PRM可视为对室内场景世界模型的抽象，用于指导搜索。
Representation Learning: 分层场景表示是一种结构化表征学习，将复杂场景分解为层次化语义。
Model-Based RL: MCTS结合PRM属于基于模型的强化学习范式，通过模拟和评估进行决策。
原生多模态大模型: 直接利用LVLM的常识推理，无需额外训练，符合原生多模态大模型的应用。
多模态大模型的理解和生成一体化: LVLM同时承担指令理解和布局生成的角色。
表征学习: 分层表示是场景表征学习的具体实例。
世界模型: PRM作为进度奖励模型，评估中间状态，类似于世界模型中的状态价值函数。
强化学习: MCTS是强化学习中常用的搜索算法，PRM提供奖励信号。
后训练: 本文未涉及后训练，但PRM和MCTS可视为测试时推理增强，与后训练中的推理优化相关。

16. Adaptive Tokenisation Via Temporal Redundancy Masking And Latent InpaintingPASS

Score: 52.5 / 27.8

Authors: Kevin Dave, Sai Aditya Patkuri, Chhaya Kumar Das, Gouranga Bala, R. Venkatesh Babu, Rajeshkumar SA

Published: 2026-06-04

TL;DR: 本文提出一种基于潜空间时间冗余掩码和隐式 inpainting 的自适应视频令牌化方法，实现了动态令牌分配并显著提升了推理速度。

摘要翻译

自适应视频标记化旨在根据序列底层的视觉复杂度动态分配 token 预算。当前的连续域方法通过迭代二值化搜索或训练好的神经回归器实现这一目标，而离散方法通常需要全速率解码器前向传播来估计信息量。我们证明此类计算开销并非严格必要。我们表明，冻结的连续视频标记器的潜在空间本质上编码了可直接利用的时间冗余：在连续帧之间潜在表示变化极小的空间位置携带接近零的额外信息。我们引入了一种无参数的自适应 token 分配机制，该机制对逐位置时间 L1 差异应用固定阈值，从而识别并丢弃冗余潜在位置。因此，压缩率自然地从输入内容中涌现，而非自上而下强制施加：静态场景被激进地压缩，而高度动态序列则保留更多 token。为了重建被丢弃的位置，我们提出了潜在修复 Transformer (LIT)，这是一种轻量级的分解式时空注意力架构。所得到的推理管道效率极高，仅需单次编码器前向传播和一次 LIT 前向传播，从而消除了对辅助路由网络的需求。在 TokenBench 和 DAVIS（近期标记器~\cite{infotok, agarwal2025cosmos}所使用的标准基准）上的评估表明，我们的框架实现了有意义的、内容驱动的 token 分配，同时保持了具有竞争力的重建保真度，并且在推理速度上比连续自适应基线（ElasticTok-CV）实现了 31 倍的加速，比离散信息论基线（InfoTok）实现了约 2 倍的加速。

Abstract

Adaptive video tokenisation seeks to dynamically allocate token budgets based on the underlying visual complexity of a sequence. Current continuous-regime approaches achieve this via iterative binarised searches or trained neural regressors, while discrete methods often require a full-rate decoder pass to estimate information content. We demonstrate that such computational overheads are not strictly necessary. We show that the latent space of a frozen continuous video tokeniser inherently encodes temporal redundancy that can be exploited directly: spatial positions whose latent representations change minimally between consecutive frames carry near-zero additional information. We introduce a parameter-free adaptive token allocation mechanism that applies a fixed threshold to per-position temporal-L1 differences, identifying and dropping redundant latent positions. Consequently, the compression rate emerges naturally from the input content rather than being enforced top-down: static scenes get compressed aggressively, while highly dynamic sequences retain more tokens. To reconstruct the dropped positions, we propose the Latent Inpainting Transformer (LIT), a lightweight factorised spatial-temporal attention architecture. The resulting inference pipeline is highly efficient, requiring only a single encoder pass and one LIT forward pass, eliminating the need for auxiliary routing networks. Evaluations across TokenBench and DAVIS, which are the standard benchmarks used by recent tokenisers~\cite{infotok, agarwal2025cosmos}, indicate that our framework yields meaningful, content-driven token allocation while maintaining competitive reconstruction fidelity, and delivers a $31\times$ inference-time speedup over the continuous adaptive baseline (ElasticTok-CV) and an $\approx2\times$ speedup over the discrete information-theoretic baseline (InfoTok)

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	10.0/10	15.0
Visual Encoder	1.5	9.0/10	13.5
World Models	1.5	2.0/10	3.0
MLLM	1.5	5.0/10	7.5
MultiModal	1.5	5.0/10	7.5
model-based RL	1.5	2.0/10	3.0

评分理由: 论文核心贡献在于自适应视频令牌化（Tokenizer，10 分），方法依赖于冻结的视频编码器提取潜空间特征（Visual Encoder，9 分）。视频处理是多模态（MultiModal，5 分）和 MLLM（5 分）的基础组件，但未涉及文本融合。论文未提及模型统一（Unify Models）、世界模型（World Models）或强化学习（model-based RL），相关性低（2 分）。加权总分 52.5，远超及格线 27.8。作者列表中未包含指定的专家。

关键词

Adaptive Tokenisation, Temporal Redundancy, Latent Inpainting, Video Tokeniser, Token Allocation, Inference Efficiency, Latent Space

深度分析

Chinese Title: 基于时间冗余掩码与潜在空间修补的自适应分词化

Summary: 本文提出了一种自适应视频分词化框架，旨在根据视频内容的动态复杂度动态分配令牌预算。现有方法（如ElasticTok、InfoTok）依赖学习型路由器、额外解码器前向传播或迭代搜索，计算开销较大。作者证明，冻结的连续视频分词器的潜在空间本身已编码了时间冗余信息：相邻帧之间潜在表示变化极小的空间位置携带近乎零的额外信息。基于此，本文引入一种无参数的令牌分配机制，通过固定阈值对逐位置的时间L1差异进行阈值化，识别并丢弃冗余潜在位置。为重建被丢弃的位置，提出了潜在空间修补变换器（LIT），一种轻量级分解式时空注意力架构。推理时仅需一次编码器前向传播和一次LIT前向传播，无需辅助路由网络。在TokenBench和DAVIS基准上的评估表明，该方法实现了有意义的、内容驱动的令牌分配，重建保真度与现有方法相当，同时相比ElasticTok-CV实现了31倍的推理加速，相比InfoTok实现了约2倍的加速。

Innovations:

首次证明冻结连续视频分词器的潜在空间中的时间L1差异足以作为自适应令牌分配的信号，无需学习型路由器或额外解码器前向传播。
提出潜在空间修补变换器（LIT），一种分解式时空注意力架构，通过交错的空间和时间注意力以及2D旋转位置嵌入，高效重建被丢弃的潜在位置。
实现完全无参数的令牌分配机制，压缩率由输入内容自然决定，静态场景压缩更激进，动态场景保留更多令牌。
推理效率极高，仅需一次编码器前向传播和一次LIT前向传播，相比现有自适应方法（ElasticTok-CV、InfoTok）显著降低计算开销。

Methodology: 论文采用以下技术路线：1）使用冻结的连续视频分词器（如Cosmos Tokenizer）作为骨干网络，将输入视频编码为连续潜在表示；2）在潜在空间中计算每个空间位置相邻帧之间的时间L1差异，通过固定阈值τ生成二进制掩码，丢弃差异低于阈值的潜在位置；3）设计潜在空间修补变换器（LIT），该网络采用分解式时空注意力机制，先进行空间注意力（2D旋转位置嵌入），再进行时间注意力，从保留的令牌中重建被丢弃的位置；4）使用重建损失和潜在空间损失联合训练LIT，冻结编码器和解码器；5）推理时，编码器输出潜在表示，掩码模块生成掩码，LIT修补被丢弃位置，解码器重建视频。

Key Results:

在TokenBench和DAVIS基准上，所提方法在重建保真度上与连续自适应基线ElasticTok-CV相当，且使用更少的令牌。
与离散自适应方法InfoTok相比，在相同保留约束下表现更优。
推理速度相比ElasticTok-CV提升31倍，相比InfoTok提升约2倍。
在UCF-101数据集上，τ=0.3时，每个视频的丢弃率从5.15%到86.10%不等，表明压缩率由内容动态复杂度自然驱动。
静态场景压缩更激进，动态场景保留更多令牌，验证了内容自适应的有效性。

Tech Stack:

连续视频分词器：Cosmos Tokenizer（4×8×8时空压缩比，16通道潜在空间）
潜在空间修补变换器（LIT）：分解式时空注意力架构
2D旋转位置嵌入（2D Rotary Positional Embeddings）
时间L1差异阈值化掩码机制
重建损失（Reconstruction Loss）和潜在空间损失（Latent Loss）联合训练
冻结编码器-解码器架构
TokenBench和DAVIS基准数据集

Strengths:

提出完全无参数的令牌分配机制，无需学习型路由器或额外解码器前向传播，极大降低计算开销。
推理效率极高，仅需一次编码器前向传播和一次LIT前向传播，适合实时或资源受限场景。
压缩率由输入内容自然决定，静态场景压缩更激进，动态场景保留更多令牌，实现真正的自适应。
在多个基准上重建保真度与现有方法相当，同时显著提升推理速度。
方法简洁，易于实现和集成到现有连续分词器框架中。

Limitations:

阈值τ为固定标量，需针对不同骨干网络潜在空间的方差进行校准，可能影响泛化性。
方法依赖冻结的连续分词器，若分词器本身质量不佳，潜在空间的时间冗余信号可能不可靠。
LIT网络仅通过训练学习修补，对于极端动态或复杂场景，修补质量可能下降。
未与离散自适应方法（如AdapTok）在相同条件下进行直接比较，仅与InfoTok进行了对比。
论文未讨论在长视频或高分辨率视频上的扩展性和计算资源需求。

Relevance To Keywords:

Unify Models: 论文提出的自适应分词化框架可统一处理静态和动态视频内容，为统一模型提供高效表示。
World Models: 视频分词器是世界模型的关键组件，自适应令牌分配有助于世界模型更高效地处理动态环境。
Representation Learning: 论文在潜在空间中利用时间冗余进行自适应令牌分配，属于表示学习范畴。
Model-Based RL: 高效视频分词化可加速基于模型的强化学习中的环境建模和规划。
原生多模态大模型: 自适应分词化可提升多模态大模型处理视频数据的效率。
多模态大模型的理解和生成一体化: 连续潜在表示和高效令牌分配有助于统一理解和生成任务。
表征学习: 论文核心是学习更紧凑、内容自适应的视频表征。
世界模型: 视频分词器是世界模型的基础，自适应压缩有助于世界模型处理复杂动态场景。
强化学习: 高效视频分词化可加速强化学习中的环境建模和规划。
后训练: 论文中的LIT网络是在冻结分词器后训练的，属于后训练阶段。

17. A Vision-language Framework for Comparative Reasoning in RadiologyPASS

Score: 51.0 / 27.8

Authors: Tengfei Zhang, Ziheng Zhao, Lisong Dai, Xiaoman Zhang, Pengcheng Qiu, Ya Zhang, Yanfeng Wang, Weidi Xie

Published: 2026-06-04

TL;DR: 该论文提出了一种名为 MedReCo 的视觉 - 语言框架，通过实体感知推理实现了放射学图像的比较检索与生成解释，显著提升了纵向随访的准确性。

摘要翻译

医学影像人工智能在孤立图像解读方面表现强劲，但与放射学实践仍存在较大偏差，因为放射学中的诊断和随访依赖于对既往研究及类似参考病例的比较。本文将放射学比较定义为实体感知的跨图像推理问题，并提出一个同时支持参考病例检索和时序比较解读的框架。我们构建了 MedReCo-DB，这是一个从常规图像 - 报告对派生的大规模比较成像资源，涵盖来自 8 个机构、4 个国家、7 种成像模态的 16 万多名患者的 69 万余张图像。报告被分解为解剖结构、异常发现和病理状况，以此提供实体条件检索和比较性视觉问答的监督信号。基于此资源，我们开发了 MedReCo，一种用于可控检索临床类似病例的实体感知视觉编码器，以及 MedReCo-VLM，一种用于间隔变化生成式解读的视觉 - 语言扩展模型。在内部、外部及跨中心评估中，MedReCo 在所有 12 种内部检索设置中均取得了最高的 Recall@1，且外部检索性能平均提升了 6.0 个百分点。在临床易混淆的鉴别诊断组中，它一贯优于最强基线。MedReCo-VLM 在所有比较生成评估中均取得了最佳性能，并在胸片上将纵向随访准确率提高了 14.5-46.5 个百分点，在 CT 上提高了 13.0-27.9 个百分点。这些发现表明，实体感知比较推理可以从大规模常规临床数据中学习，并可能为医学影像人工智能提供更符合临床实践的基础。

Abstract

Medical imaging artificial intelligence has achieved strong performance in isolated image interpretation, but remains poorly aligned with radiological practice, where diagnosis and follow-up rely on comparison across prior studies and analogous reference cases. Here we formulate radiological comparison as an entity-aware cross-image reasoning problem and introduce a framework that supports both reference-case retrieval and temporal comparative interpretation. We construct MedReCo-DB, a large-scale comparative imaging resource derived from routine image-report pairs, comprising more than 690,000 images from over 160,000 patients across eight institutions, four countries and seven imaging modalities. Reports are decomposed into anatomical structures, abnormal findings and pathological conditions to provide supervision for entity-conditioned retrieval and comparative visual question answering. Using this resource, we develop MedReCo, an entity-aware visual encoder for controllable retrieval of clinically analogous cases, and MedReCo-VLM, a vision--language extension for generative interpretation of interval change. Across internal, external and cross-center evaluations, MedReCo achieved the highest Recall@1 in all 12 internal retrieval settings and improved external retrieval by a mean of 6.0 percentage points. In clinically confusable differential groups, it consistently outperformed the strongest baselines. MedReCo-VLM achieved the best performance across all comparative generation evaluations and improved longitudinal follow-up accuracy by 14.5-46.5 percentage points on chest radiographs and 13.0-27.9 percentage points on CT. These findings suggest that entity-aware comparative reasoning can be learned from routine clinical data at scale and may provide a more clinically aligned foundation for medical imaging AI.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	5.0/10	7.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	9.0/10	13.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	8.0/10	12.0
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	0.0/10	0.0

评分理由: 论文提出放射学比较推理的视觉 - 语言框架。'Visual Encoder'和'MultiModal'高度相关，因涉及实体感知视觉编码器与多模态数据；'MLLM'相关，因使用视觉语言模型生成解释；'Unify Models'中度相关，框架统一检索与生成；'Tokenizer'弱相关；'World Models'和'model-based RL'不相关，无强化学习内容。作者列表无指定专家。加权总分 51.0，高于及格分 27.8。

关键词

Comparative Reasoning, Radiology, Vision-language Framework, Entity-aware Visual Encoder, Retrieval, Generative Interpretation, Medical Imaging, Longitudinal Follow-up

深度分析

Chinese Title: 放射学中比较推理的视觉-语言框架

Summary: 本文提出了一种面向放射学比较推理的视觉-语言框架，旨在解决当前医学影像AI仅能孤立解读单张图像、缺乏临床比较推理能力的问题。研究将放射学比较分为参考比较（跨患者相似病例检索）和时间比较（同一患者前后影像变化生成）两类，并构建了大规模比较影像资源MedReCo-DB，包含来自8个机构、4个国家、7种模态的69万张图像，通过分解报告中的解剖结构、异常发现和病理条件提供实体级监督。基于此，开发了实体感知的视觉编码器MedReCo用于可控检索，以及扩展的视觉-语言模型MedReCo-VLM用于生成式比较解释。实验表明，MedReCo在内部检索12项设置中均取得最高Recall@1，外部检索平均提升6.0个百分点；MedReCo-VLM在比较生成评估中全面领先，纵向随访准确率提升14.5-46.5个百分点。研究证明实体感知的比较推理可从常规临床数据中大规模学习，为医学影像AI提供更临床对齐的基础。

Innovations:

首次将放射学比较推理形式化为实体感知的跨图像推理问题，涵盖参考比较和时间比较两种临床核心场景。
构建了大规模多模态比较影像数据集MedReCo-DB，包含69万张图像、16万患者、7种模态，并提供实体级标注（42种解剖结构、69种异常发现、28种病理条件）。
提出实体感知的视觉编码器MedReCo，通过文本引导的对比排序学习细粒度实体感知表示，支持可控检索。
开发MedReCo-VLM，将实体感知视觉表示与大语言模型通过指令微调连接，实现生成式比较解释。
在内部、外部、跨中心和临床易混淆鉴别组等多个严格设置下全面评估，显著超越现有基线。

Methodology: 研究采用以下技术路线：1) 数据构建：从8个公开数据集和1个内部数据集收集图像-报告对，利用报告分解工具提取实体级标注（解剖结构、异常发现、病理条件），构建检索三元组和比较VQA样本。2) 模型设计：MedReCo使用模态感知的视觉编码器处理不同模态图像，通过实体条件注意力机制聚焦相关视觉证据，采用文本引导的对比排序损失学习实体感知表示。3) 生成扩展：MedReCo-VLM将预训练视觉编码器与大语言模型通过指令微调连接，输入图像对和实体查询，输出比较性自然语言描述。4) 评估：在内部验证、外部验证、跨中心检索、临床易混淆鉴别组以及公共配对图像VQA基准上测试，使用Recall@k和生成指标（如BLEU、ROUGE、CIDEr等）。

Key Results:

MedReCo在内部检索12项设置中均取得最高Recall@1，在解剖结构、异常发现、病理条件上分别比最强基线提升5.8、3.5、6.1个百分点。
外部检索Recall@1平均提升6.0个百分点，跨中心检索中保持优势。
在临床易混淆鉴别组（如肺结节与肺不张）中，MedReCo一致优于最强基线。
MedReCo-VLM在所有24项比较生成评估中取得最佳性能，在公共配对图像VQA基准上达到87.1%准确率。
纵向随访任务中，胸片准确率提升14.5-46.5个百分点，CT提升13.0-27.9个百分点。

Tech Stack:

对比学习（Contrastive Learning）
CLIP架构（文本-图像对齐）
大语言模型（LLM）
指令微调（Instruction Tuning）
视觉编码器（Vision Encoder，如ViT）
文本编码器（Text Encoder）
注意力机制（Entity-conditioned Attention）
排序损失（Ranking Loss）
报告分解工具（如RadGraph、CheXbert等）
评估指标：Recall@k, BLEU, ROUGE, CIDEr, 准确率

Strengths:

大规模、多模态、多机构、多国家的数据集构建，覆盖7种影像模态，具有强泛化性。
实体感知的设计使模型能够针对特定临床实体进行检索和生成，而非全局匹配，更符合临床需求。
统一框架同时支持参考比较和时间比较，将检索与生成任务整合在同一视觉表示基础上。
在多个严格评估设置（内部、外部、跨中心、易混淆组）中均取得显著提升，验证了方法的有效性和鲁棒性。
利用常规临床报告自动生成监督信号，无需额外人工标注，具有可扩展性。

Limitations:

数据集依赖报告质量，报告分解可能存在误差，影响实体标注准确性。
当前仅覆盖7种模态，未涉及如超声心动图、核医学等模态，通用性有待扩展。
模型在罕见或长尾实体上的表现可能受限，因为训练数据分布不均。
生成式比较解释的质量评估主要依赖自动指标，缺乏临床专家人工评估。
计算资源需求较高，尤其是MedReCo-VLM的指令微调阶段。

Relevance To Keywords:

原生多模态大模型：论文构建的视觉-语言框架（MedReCo-VLM）属于多模态大模型，直接处理图像和文本，实现理解和生成一体化。
多模态大模型的理解和生成一体化：MedReCo-VLM同时支持检索（理解）和比较解释（生成），体现了理解与生成的统一。
表征学习：MedReCo通过对比学习学习实体感知的视觉表征，是表征学习的典型应用。
世界模型：论文中的比较推理（跨时间、跨患者）涉及对疾病演变和相似病例的建模，可视为医学影像领域的简化世界模型。
强化学习：论文未直接使用强化学习，但后训练（指令微调）可视为一种强化学习范式（如RLHF），不过文中未明确提及。
后训练：MedReCo-VLM的指令微调属于后训练阶段，将预训练视觉编码器与大语言模型对齐。

18. ActiveMimic: Egocentric Video Pretraining with Active PerceptionPASS

Score: 51.0 / 27.8

Authors: Xingyao Lin, Guojin Zhong, Tianyi Lu, Ziyi Ye, Yichen Zhu, Zuxuan Wu, Yu-Gang Jiang

Published: 2026-06-04

TL;DR: ActiveMimic pretrains models on egocentric human video by modeling camera motion as active perception actions, achieving robot manipulation performance comparable to models pretrained on robot data.

摘要翻译

第一人称视角人类视频为机器人数据的预训练提供了一种可扩展的替代方案，然而，基于此类视频预训练的模型始终逊色于基于机器人数据预训练的模型。我们将这一差距归因于一个缺失的信号，即第一人称视角视频中的主动感知行为：在操作过程中，人类不断调整其视点，从而引发相机运动，而标准流程将其视为噪声。为了解决这一问题，我们提出了 ActiveMimic，这是一个预训练框架，它从单个佩戴式 RGB 相机中恢复同步的相机和手腕轨迹，将相机运动建模为视点动作，并从真实场景中的第一人称视角人类视频中联合学习主动感知和操作，随后适配至目标机器人。实验上，针对具有多样主动感知需求的任务进行的真实世界实验表明，ActiveMimic 始终超越基于人类视频预训练的基线，并达到基于机器人数据预训练的最先进模型的水平。进一步分析提供了证据，表明主动感知能力源于第一人称视角人类视频预训练，而非机器人特定微调，证实了主动感知是解锁第一人称视角人类视频用于机器人预训练的关键。

Abstract

Egocentric human video offers a scalable alternative to robot data for pretraining, yet models pretrained on such video consistently underperform those pretrained on robot data. We attribute this gap to a missing signal, the active perception behavior in egocentric videos, where humans continuously reposition their viewpoint during manipulation, inducing camera motion that standard pipelines treat as noise. To address this, we present ActiveMimic, a pretraining framework that recovers synchronized camera and wrist trajectories from a single body-worn RGB camera, models camera motion as a viewpoint action, and jointly learns active perception and manipulation from in-the-wild egocentric human video before adapting to a target robot. Empirically, real-world experiments across tasks with diverse active perception demands show that ActiveMimic consistently surpasses baselines pretrained on human video and matches state-of-the-art models pretrained on robot data. Further analysis provides evidence that active perception capability originates from egocentric human video pretraining rather than robot-specific fine-tuning, confirming active perception as the key to unlocking egocentric human video for robot pretraining.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	5.0/10	7.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	6.0/10	9.0
World Models	1.5	7.0/10	10.5
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	5.0/10	7.5
model-based RL	1.5	8.0/10	12.0

评分理由: The paper proposes ActiveMimic for egocentric video pretraining in robotics. 'model-based RL' (8.0) and 'World Models' (7.0) are highly relevant as it models environment dynamics (camera + wrist) for manipulation tasks. 'Unify Models' (5.0) and 'MultiModal' (5.0) are moderate as it unifies perception/action and combines visual/control data. 'Visual Encoder' (6.0) is relevant for video processing. 'Tokenizer' (2.0) and 'MLLM' (1.0) are low as no language models or tokenization mechanisms are central to the work. No expert authors from the specified list were found in the author list.

关键词

Egocentric Video, Active Perception, Camera Motion, Robot Manipulation, Pretraining Framework, World Model, Model-Based RL

深度分析

Chinese Title: ActiveMimic：基于主动感知的自我中心视频预训练

Summary: 该论文提出ActiveMimic框架，旨在利用自我中心人类视频进行机器人预训练。现有方法因忽略主动感知行为（人类在操作中不断调整视角）而导致性能低于机器人数据预训练。ActiveMimic从单台穿戴式RGB相机中恢复同步的相机和手腕轨迹，通过解耦相机-手腕耦合，构建统一的27维动作表示，联合学习主动感知与操作。在Ego4D数据集上预训练后，迁移至真实人形机器人。实验表明，ActiveMimic在多种需要主动感知的任务上超越基于人类视频的基线，并达到与机器人数据预训练模型相当的性能。进一步分析证实主动感知能力源自自我中心视频预训练而非机器人微调，揭示了主动感知是解锁自我中心视频用于机器人预训练的关键。

Innovations:

提出ActiveMimic框架，从单台穿戴式RGB相机中恢复同步的相机和手腕轨迹，无需专用硬件即可实现主动感知与操作的联合建模。
构建统一的27维动作表示，将相机视角运动与双手腕运动编码在同一参考系中，通过流匹配目标学习其耦合关系。
首次证明主动感知能力源自自我中心人类视频预训练而非机器人微调，并验证相机运动监督有助于从人类感知到机器人控制的表征迁移。
在真实机器人上实现主动感知迁移，使机器人能在任务执行中主动调整视角，显著提升成功率。

Methodology: 首先，利用SAM-3D-Body估计手腕姿态，VGGT估计相机轨迹，UniDepth恢复度量尺度，从原始自我中心视频中恢复相机和手腕轨迹。然后，通过将每段时序块内的所有姿态重定位到块首帧坐标系，解耦相机与手腕的耦合。接着，将解耦后的姿态编码为27维动作向量（相机6D旋转+平移，左右手腕各6D旋转+平移）。模型采用混合Transformer架构（视觉语言前缀+动作专家后缀），以条件流匹配目标训练，预测未来连续动作。最后，在机器人数据上微调，迁移主动感知能力。

Key Results:

ActiveMimic在多种需要主动感知的真实机器人任务上持续超越基于人类视频预训练的基线。
性能与基于机器人数据预训练的最先进模型相当。
消融实验表明，相机运动监督是提升成功率的关键，主动感知能力主要来自预训练阶段而非机器人微调。
分析显示相机运动监督促进了从人类感知到机器人控制的表征迁移。

Tech Stack:

SAM-3D-Body（用于手腕姿态估计）
VGGT（用于相机轨迹估计）
UniDepth（用于度量深度恢复）
6D连续旋转表示（用于姿态编码）
条件流匹配（Conditional Flow Matching）作为训练目标
混合Transformer架构（Mix-of-Transformers）
Ego4D数据集（大规模自我中心视频）

Strengths:

无需专用硬件（如VR头显、额外相机），仅用单台穿戴式RGB相机即可从野外自我中心视频中提取主动感知信号，具有高可扩展性。
首次将主动感知与操作联合建模，并证明其是缩小人类视频与机器人数据预训练差距的关键。
真实机器人实验验证了方法的有效性，任务覆盖多种主动感知需求，结果具有说服力。
深入分析了主动感知能力的来源，为后续研究提供了理论依据。

Limitations:

依赖Ego4D数据集，可能在其他自我中心视频数据集上的泛化性需进一步验证。
相机与手腕解耦方法基于离线视觉模型，在快速运动或遮挡场景下可能存在误差。
仅建模相机和手腕动作，未考虑全身其他关节（如躯干、腿部）对主动感知的影响。
机器人微调阶段仍需要少量机器人数据，未能实现完全零样本迁移。

Relevance To Keywords:

Unify Models: ActiveMimic通过统一动作表示（27维）联合建模感知与操作，体现了模型统一的思想。
World Models: 框架从视频中学习相机和手腕的联合动态，可视为一种隐式世界模型，预测未来动作。
Representation Learning: 通过预训练学习主动感知表征，并迁移至机器人控制，属于表征学习范畴。
Model-Based RL: 流匹配目标可视为一种基于模型的预测方法，但论文未直接涉及强化学习。
原生多模态大模型: 混合Transformer架构结合视觉语言前缀，与多模态大模型设计相关。
多模态大模型的理解和生成一体化: 模型同时理解视觉输入并生成动作序列，体现理解与生成一体化。
表征学习: 同上。
世界模型: 同上。
强化学习: 论文未使用强化学习，但主动感知与操作联合学习可视为一种行为克隆，与后训练（微调）相关。
后训练: 论文采用两阶段训练（预训练+机器人微调），属于后训练策略。

19. Towards World Models in Biomedical ResearchPASS

Score: 49.5 / 27.8

Authors: Guangyu Wang, Jingkun Yue, Siqi Zhang, Yu Liu, Xiaoyu Wang, Mingyuan Meng, Changwei Ji, Zongbo Han, Yulin Wang, Yang Yue, Frank Fu, Ting Chen, Song Wu, Ziwei Liu, Jiangning Song, Ming Li, Gao Huang, Xiaohong Liu, Athanasios Vasilakos, Xingcai Zhang, Ping Zhang, Yong Li

Published: 2026-06-04

TL;DR: 本文提出生物医学世界模型，通过学习潜在表示和干预条件动态来模拟生物未来，从而实现模拟引导的闭环生物医学发现。

摘要翻译

生物医学的一个核心目标是理解、预测并最终控制生物系统对扰动、疾病进展及治疗干预作出反应的动态机制。尽管基础模型（foundation models）和大语言模型（large language models）加速了生物医学数据的解读，但大多数当前系统仍专注于静态模式识别，而非生物未来的前瞻性模拟。在此，我们提出生物医学世界模型（biomedical world models）作为 AI 驱动发现的范式。这些模型学习分子、细胞、组织和临床状态的潜在表示，以及干预条件动力学，从而允许在行动之前模拟未来轨迹。我们探讨了生物医学世界模型如何在虚拟细胞（virtual cells）、类器官（organoids）、虚拟病人（virtual patients）及手术模拟（surgical simulation）等应用中充当数据引擎（data engines）、环境模拟器（environment simulators）和科学规划基底（scientific planning substrates）。我们概述了所需的数据基础设施、评估基准（evaluation benchmarks）、安全约束及治理框架（governance frameworks）。生物医学世界模型可能为模拟引导、闭环（closed-loop）且具有可实验操作性的生物医学发现提供基础。

Abstract

A central goal of biomedicine is to understand, predict and ultimately control the dynamic mechanisms by which biological systems respond to perturbations, disease progression and therapeutic intervention. Although foundation models and large language models have accelerated biomedical data interpretation, most current systems remain focused on static pattern recognition rather than prospective simulation of biological futures. Here we propose biomedical world models as a paradigm for AI-driven discovery. These models learn latent representations of molecular, cellular, tissue and clinical states, together with intervention-conditioned dynamics that allow future trajectories to be simulated before actions are taken. We discuss how biomedical world models could function as data engines, environment simulators and scientific planning substrates across applications including virtual cells, organoids, virtual patients and surgical simulation. We outline the data infrastructure, evaluation benchmarks, safety constraints and governance frameworks required. Biomedical world models may provide a foundation for simulation-guided, closed-loop and experimentally actionable biomedical discovery.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	5.0/10	7.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	10.0/10	15.0
MLLM	1.5	5.0/10	7.5
MultiModal	1.5	6.0/10	9.0
model-based RL	1.5	7.0/10	10.5

评分理由: 论文标题与摘要核心均围绕'World Models'展开，故该项得满分（10）；论文整合分子、细胞、组织及临床状态，隐含多模态数据融合，'MultiModal'相关（6）；涉及干预条件动态模拟与闭环发现，符合'model-based RL'逻辑（7）；提及基础模型和语言模型，与'MLLM'有关联（5）；虽提出新范式但未明确强调架构统一，'Unify Models'中度相关（5）；未具体提及'Tokenizer'和'Visual Encoder'架构细节，得 0 分。作者列表中未包含指定的专家组成员。

关键词

World Models, Biomedical Research, Latent Representations, Intervention-Conditioned Dynamics, Simulation-Guided Discovery, Closed-Loop, Biological Systems

深度分析

Chinese Title: 迈向生物医学研究中的世界模型

Summary: 本文提出生物医学世界模型（Biomedical World Models）作为AI驱动生物医学发现的新范式。当前生物医学AI主要聚焦于静态模式识别，而生物系统本质上是动态的，需要模型能够模拟状态演化。世界模型通过学习分子、细胞、组织、临床等多尺度的潜在状态表示，并建模干预条件下的动态转移，从而在实际行动前模拟未来轨迹。论文阐述了世界模型的三大核心能力：作为数据引擎（生成模拟增强轨迹以缓解数据稀疏）、环境模拟器（预测干预下的状态演变）和科学行动规划器（支持闭环推理与实验决策）。讨论了在虚拟细胞、类器官、虚拟患者和手术模拟等应用场景中的潜力，并系统分析了所需的数据基础设施、评估基准、安全约束和治理框架。最终目标是实现模拟引导的、闭环的、可实验操作的生物医学发现。

Innovations:

首次将世界模型的概念系统性地引入生物医学领域，提出生物医学世界模型范式，强调多尺度动态模拟和干预条件驱动的状态演化。
定义了生物医学世界模型的三个核心能力：数据引擎、环境模拟器、科学行动规划器，并区分其与现有基础模型、数字孪生、生成式AI的本质差异。
提出了在虚拟细胞、类器官、虚拟患者和手术模拟等具体应用场景中的实现路径，为跨尺度生物医学建模提供了统一框架。
系统性地讨论了构建生物医学世界模型所需的数据基础设施、评估基准、安全约束和治理框架，为实际部署提供了路线图。

Methodology: 本文为观点综述，采用概念分析和框架构建的方法。首先回顾世界模型在认知科学、强化学习、视频生成等领域的发展，将其核心思想映射到生物医学领域。通过形式化定义（观测、潜在状态、动作条件转移、解码）区分三类世界模型（感官空间、潜在空间、智能体耦合）。接着分析生物医学世界模型的三个定义属性：多尺度潜在状态建模、干预条件动态、科学推理与规划。最后通过应用案例和挑战讨论来阐述可行性，未涉及具体实验验证。

Key Results:

提出了生物医学世界模型的形式化定义和概念框架。
明确了三大核心能力：数据引擎（生成模拟增强轨迹）、环境模拟器（干预条件动态预测）、科学行动规划器（闭环推理与实验规划）。
识别了四个代表性应用场景：虚拟细胞、类器官、虚拟患者、手术模拟。
列出了构建所需的关键要素：多模态状态表示、动作条件动态、科学智能体。
指出了主要挑战：纵向数据基础设施、评估基准、隐私安全、大规模部署。

Tech Stack:

扩散模型（如Sora）
自回归Transformer（如Genie, GAIA-1）
JEPA-style模型（V-JEPA2, DINO-WM）
神经ODE、物理信息神经网络、神经算子
循环状态空间模型（RSSM）
Dreamer、TD-MPC2、IRIS等基于模型的强化学习框架
多模态数据整合（多组学、空间图谱、成像、临床记录）
表征学习、因果推断、后训练技术

Strengths:

前瞻性强，将世界模型这一前沿概念系统性地引入生物医学领域，具有创新性和启发性。
框架完整，从定义、核心能力、应用场景到挑战全面覆盖，逻辑清晰。
跨学科融合，结合认知科学、控制理论、生成式AI和生物医学，视野开阔。
强调动态模拟和干预条件，超越了当前静态模式识别的局限，更贴近生物医学实际需求。
对数据、评估、安全等实际部署问题有深入思考，具有实践指导意义。

Limitations:

本文为概念性论文，缺乏具体实现和实验验证，可行性有待证明。
生物医学世界模型的构建需要海量纵向、多模态、干预标注数据，目前数据基础设施严重不足。
对模型的可解释性、因果推断能力讨论不够深入，未涉及具体算法设计。
未详细讨论计算成本、训练稳定性、模型泛化等技术挑战。
治理和安全框架较为宏观，缺乏具体操作指南和量化指标。

Relevance To Keywords: 论文与研究关键词高度相关。核心主题为世界模型，并深入涉及表征学习（潜在状态表示）、基于模型的强化学习（Dreamer等）、多模态大模型（多模态数据整合）、理解与生成一体化（世界模型同时支持预测和生成）。论文也讨论了后训练（科学智能体与规划），与关键词中的“后训练”直接对应。此外，论文强调原生多模态大模型在生物医学中的应用，与关键词中的“原生多模态大模型”和“多模态大模型的理解和生成一体化”高度契合。

20. Where, What, Why, and Importance: Structured Defect Grounding for Text-to-Image FeedbackPASS

Score: 49.5 / 27.8

Authors: Huaisong Zhang, Hao Yu, Yuxuan Zhang, Jiahe Wang, Xinrui Chen, Haoxiang Cao, Feng Lu, Wendong Zhang, Changqian Yu, Chun Yuan

Published: 2026-06-04

TL;DR: 本文针对文本生成图像模型存在的局部细微缺陷难以诊断的问题，提出结构化缺陷检测方法并利用视觉语言模型指导强化学习以提升扩散模型对齐效果。

摘要翻译

尽管生成的图像日益具有照片级真实感，文本到图像（T2I）模型仍表现出局部、细微且结构复杂的缺陷。诊断这些缺陷需要实例级反馈，以回答缺陷发生的位置、类型、成因及其对整体图像质量的重要性。尽管近期的密集反馈方法超越了标量监督，但其以热力图为中心的表示仍将诊断问题表述为像素场回归，这使得定位基数可变的缺陷并将语义原因绑定到单个缺陷变得困难。为了解决这一表示瓶颈，我们提出结构化缺陷定位（Structured Defect Grounding, SDG），它将 T2I 诊断视为结构化集合预测，通过将每个缺陷建模为（位置、类型、原因、重要性）元组。为了使该表述可训练且可评估，我们引入了 SDG-30K，这是一个包含 30K 张图像的数据集，具有基于框的标注，涵盖四种现代 T2I 生成器，并配套专门的评估协议 SDG-Eval。基于这种结构化表示，我们进一步提出一个诊断到对齐的框架，其中视觉语言模型（VLM）充当 SDG 检测器，而 BoxFlow-GRPO 将预测的缺陷集转换为基于框的、重要性加权的空间奖励，用于扩散模型对齐。广泛实验表明，我们的 SDG 检测器在结构化缺陷定位上优于领先的专有视觉语言模型，而 SDG 引导的奖励一致地提升 T2I 对齐效果并支持局部图像细化。这些结果确立了 SDG 作为一个统一的、实例级的接口，用于诊断、评估和增强现代生成模型。

Abstract

Despite generating increasingly photorealistic images, text-to-image (T2I) models still exhibit localized, subtle, and structurally complex failures. Diagnosing these failures requires instance-level feedback that answers where a defect occurs, what type it is, why it is defective, and its importance to overall image quality. While recent dense-feedback methods move beyond scalar supervision, their heatmap-centric representations still formulate diagnosis as pixel-field regression, making it difficult to localize variable-cardinality defects and bind semantic reasons to individual failures. To address this representation bottleneck, we propose Structured Defect Grounding (SDG), which casts T2I diagnosis as structured set prediction by modeling each defect as a (location, type, reason, importance) tuple. To make this formulation trainable and measurable, we introduce SDG-30K, a 30K-image dataset with box-grounded annotations across four modern T2I generators, together with a dedicated evaluation protocol, SDG-Eval. Building on this structured representation, we further present a diagnosis-to-alignment framework in which a Vision-Language Model (VLM) serves as the SDG detector, and BoxFlow-GRPO converts predicted defect sets into box-derived, importance-weighted spatial rewards for diffusion model alignment. Extensive experiments show that our SDG detector outperforms leading proprietary VLMs on structured defect grounding, while SDG-guided rewards consistently improve T2I alignment and support localized image refinement. These results establish SDG as a unified, instance-level interface for diagnosing, evaluating, and enhancing modern generative models.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	5.0/10	7.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	8.0/10	12.0
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	7.0/10	10.5

评分理由: 论文核心使用视觉语言模型（MLLM）处理多模态（MultiModal）文本图像数据，并提出结构化缺陷检测框架，涉及基于模型的强化学习（model-based RL）进行奖励塑造，故这三项得分较高。摘要提及“统一接口”，故 Unify Models 中等。Tokenizer、Visual Encoder 和 World Models 非核心贡献，得分较低。作者列表未包含指定专家，无额外加分。加权总分为 49.5，高于动态及格分 27.8。

关键词

Structured Defect Grounding, Text-to-Image Feedback, Vision-Language Model, Diffusion Model Alignment, Structured Set Prediction, BoxFlow-GRPO, Instance-level Feedback

深度分析

Chinese Title: 位置、类型、原因与重要性：面向文本到图像反馈的结构化缺陷定位

Summary: 本文针对文本到图像（T2I）模型生成的图像中存在的局部、细微且结构复杂的缺陷，提出了一种结构化缺陷定位（SDG）方法。SDG将T2I诊断建模为可变基数的集合预测，每个缺陷表示为（位置、类型、原因、重要性）元组，从而克服了传统热图回归方法在定位可变基数缺陷和绑定语义原因方面的瓶颈。为支持该框架，作者构建了SDG-30K数据集（包含30,096张图像，覆盖四个现代T2I生成器，提供框级标注），并设计了SDG-Eval评估协议。在此基础上，进一步提出诊断到对齐框架：使用视觉语言模型（VLM）作为SDG检测器，并通过BoxFlow-GRPO将预测的缺陷集转换为重要性加权的空间奖励，用于扩散模型的后训练对齐。实验表明，SDG检测器在结构化缺陷定位上优于领先的专有VLM，且SDG引导的奖励能有效提升T2I对齐质量并支持局部图像精炼。

Innovations:

提出结构化缺陷定位（SDG）表示，将T2I诊断转化为可变基数集合预测，每个缺陷以（位置、类型、原因、重要性）元组形式输出，统一了伪影和错配两类缺陷。
构建SDG-30K数据集，包含30,096张图像，来自四个现代T2I生成器，提供框级标注、自然语言原因和重要性分数，填补了现有数据集在实例级联合标注上的空白。
设计SDG-Eval评估协议，包含图像级和缺陷级指标（DetTypeF1、ClnAcc、BoxF1、DescCos、ImpAcc），实现结构化缺陷集的全面评测。
提出诊断到对齐框架，利用VLM作为SDG检测器，并通过BoxFlow-GRPO将预测的缺陷集转化为重要性加权的空间奖励，首次实现扩散模型RL中的空间密集优势。
实验证明SDG检测器在结构化缺陷定位上超越GPT-4V等专有VLM，且SDG引导的奖励显著改善T2I对齐，支持局部图像精炼。

Methodology: 本文采用以下技术路线：（1）数据集构建：从Pick-a-Pic提示中选取图像，使用四个T2I生成器（FLUX.2-dev、Z-Image-Turbo、LongCat-Image、SANA-1.5）生成30,096张1024×1024图像；112名标注员进行框级标注、类型分类和中文描述，经两轮审核；再使用Gemini 3 Pro进行描述扩展、推理轨迹蒸馏和重要性评分。（2）SDG检测器训练：两阶段训练，先通过SFT（带坐标抖动）进行监督微调，再通过GRPO（格式门控复合奖励）进行强化学习优化。（3）对齐框架：BoxFlow-GRPO将SDG检测器输出的缺陷框转换为空间奖励图，结合重要性权重，用于扩散模型的后训练对齐；同时支持缺陷引导的图像精炼（将框叠加和文本反馈输入GPT-Image-1.5）。

Key Results:

SDG检测器在结构化缺陷定位上优于GPT-4V、Gemini 3 Pro等专有VLM，在[email protected]上分别提升约10%和15%。
SDG-30K数据集中，25.1%图像无缺陷，46.3%仅含伪影，5.4%仅含错配，23.2%两者兼有；SANA-1.5生成器伪影频率最高（平均3.22个/图）。
人类标注者间的[email protected]为0.278（伪影）和0.409（错配），作为定位性能的上界。
SDG引导的BoxFlow-GRPO奖励在T2I对齐指标（如CLIP分数、Aesthetic分数）上优于标量奖励方法，并支持局部图像精炼。
消融实验表明，GRPO训练阶段显著提升检测精度，坐标抖动和格式门控奖励对性能有正向贡献。

Tech Stack:

T2I生成器：FLUX.2-dev、Z-Image-Turbo、LongCat-Image、SANA-1.5
视觉语言模型（VLM）：Gemini 3 Pro（数据增强）、Qwen2.5-VL / Qwen3-VL（检测器基础）
强化学习算法：GRPO（Group Relative Policy Optimization）
扩散模型对齐：BoxFlow-GRPO（基于流匹配的GRPO变体，引入空间奖励）
评估指标：DetTypeF1、ClnAcc、[email protected]/0.5、DescCos（使用Qwen3-Embedding-0.6B）、ImpAcc
匹配算法：类感知匈牙利匹配（基于IoU）
训练策略：SFT（带坐标抖动）、GRPO（格式门控复合奖励）
图像精炼：GPT-Image-1.5

Strengths:

提出新颖的结构化缺陷表示，克服了热图回归的表示瓶颈，实现了缺陷的精确空间定位与语义绑定。
构建了大规模、高质量、多生成器的框级标注数据集，填补了现有数据集在实例级联合标注上的空白。
设计了完整的诊断到对齐框架，将结构化缺陷检测与扩散模型后训练有机结合，实现了空间密集的RL对齐。
实验充分，对比了多个专有VLM和基线方法，验证了SDG在检测和对齐上的有效性。
代码、模型和数据集开源，便于复现和后续研究。

Limitations:

数据集仅覆盖四个T2I生成器，可能无法完全代表所有主流模型（如SDXL、DALL-E 3等）。
缺陷类型仅分为伪影和错配两类，未涵盖更细粒度的子类型（如文本错误、几何畸变等）。
重要性评分依赖Gemini 3 Pro自动生成，可能存在偏差，且未与人类评分进行充分校准。
SDG检测器基于VLM，推理速度较慢，难以用于实时反馈场景。
BoxFlow-GRPO的对齐效果仅在有限指标上验证，缺乏大规模用户偏好实验。

Relevance To Keywords:

原生多模态大模型：论文使用VLM（Qwen2.5-VL等）作为SDG检测器，体现了多模态大模型在细粒度视觉理解中的应用。
多模态大模型的理解和生成一体化：SDG框架将理解（缺陷检测）与生成（扩散模型对齐）结合，形成闭环。
表征学习：结构化缺陷元组（位置、类型、原因、重要性）是一种新型的表征学习形式，将图像缺陷编码为可解释的实例级表示。
世界模型：论文未直接涉及世界模型，但缺陷定位可视为对生成世界（图像）的局部状态诊断，间接相关。
强化学习：BoxFlow-GRPO将GRPO应用于扩散模型后训练，属于强化学习在生成模型对齐中的典型应用。
后训练：SDG引导的奖励用于扩散模型的后训练（post-training），提升生成质量。

21. OneReason Technical ReportPASS

Score: 48.0 / 27.8

Authors: OneRec Team, Biao Yang, Boyang Ding, Chenglong Chu, Dunju Zang, Fei Pan, Han Li, Hao Jiang, Honghui Bao, Huanjie Wang, Jian Liang, Jiangxia Cao, Jiao Ou, Jiaxin Deng, Jinghao Zhang, Kun Gai, Lu Ren, Peiru Du, Pengfei Zheng, Rongzhou Zhang, Ruiming Tang, Shiyao Wang, Siyang Mao, Siyuan Lou, Teng Shi, Wei Yuan, Wenlong Xu, Xingchen Liu, Xingmei Wang, Xinqi Jin, Yan Sun, Yan Wang, Yifei Hu, Yingzhi He, Yufei Ye, Yuhao Wang, Yunhao Zhou, Yuqin Dai, Zhao Liu, Zhipeng Wei, Zhixin Ling, Ziming Li, Zixing Zhang, Ziyuan Liu, An Zhang, Changxin Lao, Chaoyi Ma, Chengru Song, Defu Lian, Fan Yang, Guowang Zhang, Hao Peng, Jiayao Shen, Jie Chen, Jun Xu, Junmin Chen, Kun Zhang, Kuo Cai, Mingxing Wen, Minmao Wang, Minxuan Lv, Qi Zhang, Qiang Luo, Sheng Yu, Shijie Li, Shijie Yi, Shuang Yang, Shugui Liu, Shuni Chen, Tinghai Zhang, Tingting Gao, Xiang Wang, Xiangyu Wu, Xiangyu Zhao, Xiao Lv, Xiaoyou Zhou, Xuming Wang, Yong Du, Zejian Zhang, Zhaojie Liu, Zhiyang Zhang, Zhuang Zhuang, Ziqi Wang, Ziyi Zhao

Published: 2026-06-04

TL;DR: OneReason enhances reasoning in generative recommendation models by grounding itemic tokens in language semantics and employing a cognition-enhanced Chain-of-Thought format with specialized-unify RL training.

摘要翻译

OneRec 系列的生成式推荐模型已广泛应用于短视频、直播、广告和电子商务等多种实际服务中。然而，这些生成式模型仅能从规模优势中受益，其推理能力难以被激活，因为我们无法仅由物品级 Token 构建有意义的思维链（CoT）序列。受大语言模型（LLM）领域中“先思考后回答”推理范式成功的启发，我们进行了初步研究（即 OneRec-Think、OpenOneRec），以探索生成式推荐中的推理能力。然而，我们观察到一个意外现象：思考模式并未显示出优于非思考模式的优势。借鉴多模态语言模型中关于 CoT 鲁棒性的最新发现，我们认为有效的推荐推理依赖于两个因素：感知（将物品级 Token 锚定至其底层语言语义的能力）和认知（将用户行为序列重组为连贯的潜在兴趣点的能力）。因此，我们提出了 OneReason，它包括：（1）预训练中的强物品级 Token 感知，（2）监督微调（SFT）中用于推荐任务的三层认知增强 CoT 格式，以及（3）强化学习（RL）中用于增强思考能力的先分化后统一训练策略。

Abstract

Generative recommendation models in the OneRec family have been widely deployed in many real-world services, such as short-video, live-streaming, advertising, and e-commerce. However, these generative models can only benefit from the scaling advantage, while their reasoning ability is hard to activate, since we cannot construct meaningful Chain-of-Thought (CoT) sequences consisting of itemic tokens only. Inspired by the success of the reasoning-style ``think before answer'' paradigm in the LLM field, we conduct preliminary studies (i.e., OneRec-Think, OpenOneRec) to explore reasoning capability in generative recommendation. Nevertheless, we notice an unexpected phenomenon: the thinking mode does not show advantages over the non-thinking mode. Drawing insights from recent findings on CoT robustness in multi-modal language models, we argue that effective reasoning in recommendation rests on two factors: perception, the ability to ground itemic tokens in their underlying language semantics, and cognition, the ability to reorganize a user's behavior sequence into coherent latent interest points. We therefore propose OneReason, which includes: (1) strong itemic token perception in pre-training, (2) a three-level cognition-enhanced CoT format for recommendation tasks in SFT, and (3) a specialize-then-unify training recipe in RL to enhance the thinking ability.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	8.0/10	12.0
Tokenizer	1.5	6.0/10	9.0
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	6.0/10	9.0
MultiModal	1.5	6.0/10	9.0
model-based RL	1.5	4.0/10	6.0

评分理由: 论文核心在于提升生成式推荐模型的推理能力。'Unify Models'高度相关，因文中明确提出了'specialize-then-unify training recipe'；'Tokenizer'中度相关，因核心讨论'itemic tokens'及其语义 grounding；'MLLM'和'MultiModal'中度相关，因借鉴了多模态语言模型的 CoT 鲁棒性发现；'Visual Encoder'和'World Models'相关性低，文中未涉及视觉编码器或世界模型架构；'model-based RL'中低相关，虽使用 RL 但未明确说明是基于模型的 RL。作者列表中未包含指定的 Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang 专家。

关键词

Generative Recommendation, Chain-of-Thought, Itemic Tokens, Perception and Cognition, Specialize-then-Unify, Reinforcement Learning, Reasoning Capability, Large Language Models

深度分析

Chinese Title: OneReason技术报告

Summary: 本文提出OneReason，一个具有真正推理能力的生成式推荐基础模型。针对生成式推荐模型难以激活推理能力的问题，借鉴多模态大语言模型中CoT鲁棒性的研究，指出有效推理依赖于两个因素：感知（将物品令牌与语言语义对齐）和认知（将用户行为序列重组为连贯的潜在兴趣点）。方法包括：预训练阶段使用粗到细的对齐语料增强物品令牌感知；SFT阶段设计三级认知增强CoT格式（R0感知、R1推导、R2演化、R3推荐）；RL阶段采用“先专业化后统一”策略，先进行单领域RL充分解锁思考模式优势，再通过拒绝采样微调或多教师在线策略蒸馏进行跨领域平衡。实验表明，OneReason在多个真实业务基准上思考模式显著优于非思考模式，并观察到CoT监督数据可提升非思考推理性能。最后开源OneReason-8B和OneReason-0.8B模型以促进研究。

Innovations:

提出推荐推理的两大支柱：感知（物品令牌与语义对齐）和认知（结构化CoT推理），并据此设计训练框架。
设计三级认知增强CoT格式（R0感知、R1推导、R2演化、R3推荐），实现从粗到细的推理轨迹。
采用“先专业化后统一”的RL策略：先单领域RL充分激活思考模式优势，再通过拒绝采样微调或多教师在线策略蒸馏进行跨领域优化。
发现CoT监督数据可提升非思考模式下的推荐性能，为理解CoT迁移效应提供行为证据。
构建OneReason-Bench推理推荐基准，系统评估推荐场景下的推理能力。

Methodology: 整体采用预训练-有监督微调-强化学习三阶段流水线。预训练阶段：收集粗到细的对齐语料，将新增的物品令牌与文本令牌在语义空间对齐。SFT阶段：构建标准思考结构（R0-R3）的CoT语料，包括感知、推导、演化、推荐四个层次的推理数据。RL阶段：先针对每个领域进行单领域RL（如PPO或GRPO）以充分解锁思考模式优势，然后通过拒绝采样微调（RFT）或多教师在线策略蒸馏（MTOP）进行跨领域平衡和精炼。此外，设计了专门的物品分词器和指令数据。

Key Results:

OneReason-8B在多个真实世界推荐基准上达到SOTA性能，思考模式显著优于非思考模式。
在等量训练令牌下，使用CoT监督数据替换部分非CoT数据可提升非思考模式在多个领域的推荐性能。
模型在保持通用能力的同时，展现出对推荐任务的思考优势。
开源OneReason-8B和OneReason-0.8B模型，促进生成式推荐研究。

Tech Stack:

Transformer架构
大语言模型（LLM）
有监督微调（SFT）
强化学习（RL，含PPO/GRPO）
拒绝采样微调（Rejection Sampling Fine-tuning）
多教师在线策略蒸馏（Multi-Teacher On-Policy Distillation）
链式思维（Chain-of-Thought, CoT）
物品分词器（Itemic Tokenizer）
预训练对齐语料（粗到细对齐）
OneReason-Bench推理基准

Strengths:

系统性地解决了生成式推荐模型的推理能力激活问题，理论框架清晰（感知+认知）。
在工业级场景（快手）中验证有效，具有实际商业价值。
开源模型和部分材料，促进学术研究和工业应用。
设计了全面的推理基准和评估协议，便于后续研究对比。
观察到CoT监督对非思考模式的迁移效应，为理解推理机制提供新视角。

Limitations:

CoT监督提升非思考模式的机制尚不明确，仅提供行为证据，未区分压缩、推理或交互作用。
思考模式在混合领域RL下不如非思考模式，需要特殊训练策略，可能增加训练复杂度。
方法依赖于精心构建的粗到细对齐语料和CoT数据，数据构建成本较高。
实验主要基于快手业务数据，通用性需进一步验证。

Relevance To Keywords:

原生多模态大模型：论文将物品令牌视为一种模态，通过预训练对齐实现多模态感知，与原生多模态大模型的模态对齐思想一致。
表征学习：感知阶段强调物品令牌与语义表征的对齐，认知阶段通过CoT学习用户兴趣的表征重组。
强化学习：RL阶段是核心组成部分，采用先专业化后统一的策略，与后训练中的RL技术紧密相关。
后训练：SFT和RL均属于后训练阶段，论文详细设计了后训练数据和方法以激活推理能力。
世界模型：推荐中的用户兴趣建模可视为一种世界模型，但论文未直接提及世界模型框架，相关性较弱。
Unify Models、Model-Based RL：论文未涉及模型统一或基于模型的RL，相关性较低。

22. LLM-Conditioned Synthesis of Pathological Gaits via Structured Gait-Language RepresentationsPASS

Score: 48.0 / 27.8

Authors: Mritula Chandrasekaran, Sanket Kachole, Jarik Francik, Dimitrios Makris

Published: 2026-06-04

TL;DR: 针对病理步态数据稀缺问题，本文提出了一种基于 LLM 引导的框架，利用专用 Tokenizer 从文本描述合成病理步态数据，结合真实数据可显著提升步态分类准确率。

摘要翻译

由于隐私、招募、成本及运动变异性等因素，病理性步态数据集仍然稀缺。本文提出了一种基于多模态大语言模型（LLM）引导的框架，能够从结构化文本描述中生成病理感知的 3D 步态数据。该方法生成固定长度的基于骨架的合成步态序列，适用于病理性步态分类任务。该框架融合了运动标记化、病理感知语言条件化、基于大语言模型的语义增强以及语言到步态生成。关键贡献在于所提出的病理标记器，旨在离散表示学习过程中保留病理特异性运动特征。实验表明，当与真实数据结合时，所提出的合成序列能够提升循环分类器的下游分类性能。最佳结果由使用真实和合成样本训练的 GRU（门控循环单元）分类器获得，在留一受试者交叉验证协议下达到了 92.77% 的准确率。

Abstract

Pathological gait datasets remain scarce due to privacy, recruitment, cost, and movement variability. Our work presents a multimodal LLM-guided framework for pathology-aware 3D gait data synthesis from structured textual descriptions. The proposed method generates fixed-length synthetic skeleton-based gait sequences for pathological gait classification tasks. The framework combines motion tokenisation, pathology-aware language conditioning, LLM-based semantic augmentation, and language-to-gait generation. A key contribution is the proposed pathological tokeniser, which is designed to preserve pathology-specific motion characteristics during discrete representation learning. Experiments suggest that the proposed synthetic sequences improve downstream classification for recurrent classifiers when combined with real data. The best result is obtained using a GRU classifier trained with real and synthetic samples, achieving 92.77\% accuracy under a leave-one-subject-out protocol.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	4.0/10	6.0
Tokenizer	1.5	9.0/10	13.5
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	7.0/10	10.5
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	1.0/10	1.5

评分理由: 论文核心贡献在于病理 Tokenizer 和 LLM 引导的步态生成，因此 Tokenizer (9.0) 和 MultiModal (8.0) 高度相关；涉及语言与运动表示统一，Unify Models (4.0) 中度相关；使用 LLM 处理多模态信息，MLLM (7.0) 相关；但基于骨架数据而非图像视觉编码器 (Visual Encoder 2.0)，且为生成与分类任务而非世界模型或强化学习 (World Models 1.0, model-based RL 1.0)。

关键词

LLM-Conditioned Synthesis, Pathological Gaits, Gait-Language Representations, Motion Tokenisation, Pathological Tokeniser, 3D Gait Data Synthesis, Downstream Classification

深度分析

Chinese Title: 基于结构化步态-语言表示的大语言模型条件病理步态合成

Summary: 病理步态数据集因隐私、招募、成本和运动变异性等问题而稀缺。本文提出了一种多模态大语言模型（LLM）引导的框架，用于从结构化文本描述中合成病理感知的3D步态数据。该方法结合了运动标记化、病理感知语言条件、LLM语义增强和语言到步态生成。关键贡献在于提出了病理标记器，旨在离散表示学习过程中保留病理特定的运动特征。实验表明，当与真实数据结合时，所提出的合成序列能改善循环分类器的下游分类性能。最佳结果使用GRU分类器在留一受试者协议下达到92.77%的准确率。

Innovations:

提出了病理感知的步态-语言合成框架，专门针对病理步态数据稀缺问题。
设计了专用的病理标记器，在离散表示学习中保留病理特定的生物力学特征（如运动范围、步态不对称性等）。
将病理类别统计先验与LLM微调结合，实现病理条件化的语义生成。
在语言到步态生成中引入生物力学约束解码器，确保合成序列的病理合理性。

Methodology: 框架包括：1) 使用PoseEncoder将3D关节坐标编码为潜在表示；2) 通过空间、时间和病理三个分支进行标记化，融合为统一步态标记；3) 通过步态到语言（G2L）模块将标记映射为语言兼容表示；4) 结合病理先验和类别标签对LLM（GPT-2）进行微调；5) 使用结构化提示生成病理条件语言标记；6) 通过语言到步态（L2G）模块和生物力学约束解码器重建3D步态序列。

Key Results:

GRU分类器在真实+合成数据上达到92.77%准确率，优于仅用真实数据的91.08%。
LSTM分类器从88.67%提升至89.23%，CNN分类器从90.17%下降至87.97%。
与MotionGPT（90.26%）和Qwen-5B（79.86%）相比，所提方法（92.77%）在分类准确率上表现最佳。

Tech Stack:

GPT-2（作为基础LLM）
GRU、LSTM、CNN（分类器）
PoseEncoder（3D关节编码器）
空间、时间、病理三分支标记器
步态到语言（G2L）模块
语言到步态（L2G）模块
生物力学约束解码器
留一受试者（LOSO）评估协议

Strengths:

针对病理步态合成这一特定且数据稀缺的问题，提出了专门解决方案。
病理标记器设计合理，能捕捉病理关键生物力学特征。
通过LLM条件生成，实现了可控的病理步态多样性。
实验设计严谨，包括多种分类器对比和与通用方法的基线比较。

Limitations:

仅在单一病理步态数据集上验证，泛化性有待进一步检验。
合成数据的生物力学真实性缺乏专家或临床验证。
CNN分类器性能下降，表明合成数据对某些模型可能存在负面影响。
未进行统计显著性检验，结果可靠性需进一步确认。

Relevance To Keywords:

Unify Models: 论文将步态生成与语言模型统一，实现理解和生成一体化。
World Models: 通过LLM生成病理步态序列，可视为构建运动世界模型的一部分。
Representation Learning: 病理标记器是专门设计的离散表示学习方法。
Model-Based RL: 合成数据可用于训练下游分类器，间接支持基于模型的强化学习。
原生多模态大模型: 框架整合了步态（3D骨骼）和文本（语言描述）两种模态。

23. Where Should Knowledge Enter? A Layered Framework for Knowledge Infusion in Multimodal Iterative Generative MoPASS

Score: 46.5 / 27.8

Authors: Renjith Prasad, Chathurangi Shyalika, Anushka Pawar, Amit Sheth

Published: 2026-06-04

TL;DR: 该论文提出了一种分层框架，用于将结构化知识注入多模态迭代生成模型，通过表面、轨迹和潜在层干预，显著减少了知识违规输出。

摘要翻译

多模态生成模型（Multimodal generative models）虽能产生流畅的输出，但在生成过程需遵循结构化、领域特定或安全关键知识时，仍不可靠。现有方法通过提示增强（prompt augmentation）、引导（guidance）、潜在编辑（latent editing）或微调（fine-tuning）等机制来融入知识，但这些方法通常按技术类型分类，而非按其修改的生成过程组件分类。我们认为，迭代生成模型（iterative generative models）中的知识注入本质上是一个干预层（intervention-layer）问题。由于生成过程展开为内部状态的轨迹，知识可作用于该过程的四个结构上不同的组件：输入/输出边界（input/output boundary）、转移函数（transition function）、中间状态（intermediate state）以及模型参数（model parameters）。这对应于四种干预层：表面注入（surface infusion）、轨迹注入（trajectory infusion）、潜在注入（latent infusion）和参数注入（parametric infusion）。我们将该框架实例化在扩散模型（diffusion models）中，将代表性方法映射至所有四层，并推导出多层组合（multi-layer composition）的设计原则。在一个基于多模态知识图谱（multimodal knowledge graph）并使用两个扩散骨干（diffusion backbones）的受控安全对齐实验中，我们累积实现了四层中的三层：表面层（输入侧和输出侧）以及轨迹 - 潜在层（生成中期）。实证结果表明，每一新增层均能解决先前层无法触及的失败类别，相较于原始生成（vanilla generation），违反知识的输出减少了 70.97%，从而实证验证了该框架的互补性预测。

Abstract

Multimodal generative models produce fluent outputs but remain unreliable when generation must respect structured, domain-specific, or safety-critical knowledge. Existing methods incorporate knowledge through mechanisms such as prompt augmentation, guidance, latent editing, or fine-tuning, yet they are typically categorized by technique rather than by the component of the generative process they modify. We argue that knowledge infusion in iterative generative models is fundamentally anintervention-layer problem. Since thegenerative process unfolds as a trajectory of internal states, knowledge can act on four structurally distinct components of this process: the input/output boundary, the transition function, the intermediate state, and the model parameters. This maps to four intervention layers: surface, trajectory, latent, and parametric infusion. We instantiate the framework in diffusion models, map representative methods to all four layers, and derive design principles for multi-layer composition. In a controlled safety-alignment experiment using a multimodal knowledge graph with two diffusion backbones, we implement three of the four layers cumulatively, surface (input-side and output-side) and trajectory--latent (mid-generation). We show empirically that each additional layer addresses failure classes that prior layers cannot reach, reducing knowledge-violating outputs by 70.97% compared to vanilla generation and empirically confirming the framework's complementarity prediction.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	7.0/10	10.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	4.0/10	6.0
World Models	1.5	3.0/10	4.5
MLLM	1.5	5.0/10	7.5
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	1.0/10	1.5

评分理由: 论文核心为多模态生成模型的知识注入框架，故 MultiModal 和 Unify Models 相关性高；MLLM 中度相关；Visual Encoder 与 Tokenizer 非重点；World Models 与 model-based RL 关联弱。加权总分 46.5，高于动态及格分 27.8。未发现指定专家。

关键词

Knowledge Infusion, Multimodal Generative Models, Intervention Layers, Diffusion Models, Safety Alignment, Structured Knowledge, Layered Framework

深度分析

Chinese Title: 知识应注入何处？多模态迭代生成模型中知识注入的分层框架

Summary: 本文针对多模态生成模型在生成过程中难以遵循结构化、领域特定或安全关键知识的问题，提出将知识注入视为迭代生成模型中的干预层问题。基于生成过程的轨迹结构（初始状态、转移函数、中间状态、模型参数），定义了四个干预层：表面层（输入/输出边界）、轨迹层（转移函数）、潜在层（中间状态）和参数层（模型权重）。论文在扩散模型上实例化该框架，将现有方法映射到各层，并推导出多层组合的设计原则。通过使用多模态知识图谱进行安全对齐实验，在两种冻结扩散骨干上累积实现了表面层（输入侧和输出侧）和轨迹-潜在层（生成中期），结果表明每增加一层都能解决前一层无法处理的失败类别，最终将知识违反输出减少70.97%，验证了框架的互补性预测。

Innovations:

将知识注入问题形式化为迭代生成模型中的干预层问题，基于生成轨迹的四个形式化组件（边界、转移、状态、参数）定义四层框架。
提出表面、轨迹、潜在、参数四个干预层，并系统映射现有方法到各层，提供沿五个操作轴（可控性、可解释性、持久性、计算成本、失败修正范围）的对比分析。
推导出多层知识注入的三个设计原则：匹配层与失败类别、组合以实现互补覆盖、管理层间干扰。
通过受控安全对齐实验实证验证了框架的互补性：每增加一层单调提升知识一致性，最终减少70.97%的知识违反输出。

Methodology: 论文采用形式化建模与实证验证相结合的方法。首先形式化迭代生成模型为状态轨迹，定义四个干预层及其形式化操作。然后在扩散模型上实例化各层，映射代表性方法（如RAG、分类器无关引导、潜在编辑、微调）。设计受控实验：使用多模态知识图谱作为结构化知识源，在两种冻结扩散骨干上依次实现表面输入、轨迹-潜在、表面输出三层干预，通过累积方式评估每层对知识违反输出的减少效果，并测量生成质量。

Key Results:

提出四层干预框架，覆盖知识注入的四种形式化组件。
在安全对齐实验中，三层累积干预将知识违反输出减少70.97%（相比原始生成）。
每增加一层都能解决前一层无法处理的失败类别，验证了框架的互补性预测。
多层组合在保持生成质量的同时单调提升知识一致性。

Tech Stack:

扩散模型（Denoising Diffusion Probabilistic Models）
多模态知识图谱（MMKG）
检索增强生成（RAG）
分类器无关引导（Classifier-Free Guidance）
潜在编辑（Latent Editing）
后处理验证与修复（Post-hoc Verification）
安全过滤器（Safety Filters）
形式化定义：生成轨迹方程、一致性谓词C(x,K)

Strengths:

框架具有理论系统性，基于生成过程的数学结构定义干预层，而非按技术分类。
覆盖从输入到输出的完整知识注入路径，提供设计原则指导多层组合。
实证验证了框架的预测，结果清晰且具有可重复性。
对现有方法进行统一映射，有助于研究者理解不同技术的本质差异和互补性。

Limitations:

实验仅实现了三层（表面、轨迹、潜在），未实现参数层（需重训练），因此框架的完整互补性尚未完全验证。
实验仅在扩散模型上进行，未在自回归、流模型等其他迭代生成器上验证泛化性。
知识源仅使用多模态知识图谱，未探索其他形式（如规则、物理定律）。
计算成本分析为定性评估，缺乏定量测量。

Relevance To Keywords:

Unify Models: 框架适用于统一模型（如扩散、自回归）的知识注入，提供通用干预层。
World Models: 知识注入可视为将世界知识（如物理规律、场景结构）融入生成过程，框架支持世界模型中的知识约束。
Representation Learning: 潜在层直接修改中间表征，与表征学习中的特征编辑相关。
Model-Based RL: 轨迹层修改转移函数，类似于基于模型的强化学习中策略调整。
原生多模态大模型: 框架针对多模态生成模型，强调知识注入以提升可靠性。
多模态大模型的理解和生成一体化: 框架同时涉及输入理解（表面层）和生成过程（轨迹、潜在、参数层）。
后训练: 参数层对应微调/后训练，框架将其作为干预层之一。

24. MemoryCard: Topic-Aware Multi-Modal Clue Compression for Long-Video Question AnsweringPASS

Score: 46.5 / 27.8

Authors: Qing Yang, Pengcheng Huang, Xinze Li, Zhenghao Liu, Yukun Yan, Yu Gu, Ge Yu, Gang Li, Maosong Sun

Published: 2026-06-04

TL;DR: MemoryCard 通过构建主题感知的多模态记忆卡片组织长视频内容，显著提升了视觉语言模型在长视频问答任务中的准确率。

摘要翻译

长视频问答对视觉 - 语言模型（VLMs）仍然具有挑战性，因为与答案相关的证据通常稀疏、短暂且分散在漫长的视频上下文中。现有的基于帧的方法通过均匀采样、查询感知的帧选择、视觉标记压缩以及自适应分辨率策略来提高效率。然而，它们仍然依赖孤立且破碎的帧作为基本证据单元，限制了视觉 - 语言模型（VLMs）有效捕捉连贯事件级语义的能力。为了解决这一局限性，我们提出了 MemoryCard，一种基于视频记忆的增强框架，该框架将长视频组织成自包含的 Memory Cards（记忆卡）。具体来说，MemoryCard 首先在视频和对齐的语句上执行自我阅读过程，将视频分割成语义连贯的单元，每个单元对应一个独特的主题或事件。对于每个单元，它生成事件级视频摘要并选择代表性的视觉时刻，随后将其渲染为统一的 Memory Cards，用于检索和问答。实验结果表明，MemoryCard 在可比的视觉标记预算下持续改进长视频问答（QA）性能，实现了高达 21.8% 的准确率相对提升。所有代码均可在 https://github.com/NEUIR/MemoryCard 获取。

Abstract

Long-video question answering remains challenging for Vision-Language Models (VLMs), as answer-relevant evidence is often sparse, transient, and temporally dispersed across lengthy video contexts. Existing frame-centric approaches improve efficiency through uniform sampling, query-aware frame selection, visual-token compression, and adaptive resolution strategies. However, they still rely on isolated and fragmented frames as the fundamental evidence units, limiting VLMs' ability to effectively capture coherent event-level semantics. To address this limitation, we propose MemoryCard, a video-memory-based augmentation framework that organizes long videos into self-contained Memory Cards. Specifically, MemoryCard first performs a self-reading process over videos and aligned utterances to segment the video into semantically coherent units, each corresponding to a distinct topic or event. For each unit, it generates an event-level video gist and selects representative visual moments, which are then rendered into unified Memory Cards for retrieval and question answering. Experimental results demonstrate that MemoryCard consistently improves long-video QA performance under comparable visual-token budgets, achieving up to a 21.8% relative improvement in accuracy. All code is available at https://github.com/NEUIR/MemoryCard.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	5.0/10	7.5
Visual Encoder	1.5	6.0/10	9.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	8.0/10	12.0
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	0.0/10	0.0

评分理由: 论文聚焦长视频问答的多模态记忆框架，与 MultiModal (9) 和 MLLM (8) 高度相关；涉及视觉 token 压缩，Tokenizer (5) 和 Visual Encoder (6) 中度相关；未涉及模型统一、世界模型或强化学习，相关度低 (0-3)。加权总分 46.5 > 27.8，符合及格线。作者列表中无指定专家。

关键词

Long-video Question Answering, MemoryCard, Multi-Modal Clue Compression, Vision-Language Models, Event-level Semantics, Visual-Token Compression, Topic-Aware Segmentation

深度分析

Chinese Title: MemoryCard: 面向长视频问答的主题感知多模态线索压缩

Summary: 长视频问答对视觉语言模型（VLM）具有挑战性，因为答案相关证据稀疏、短暂且时间分散。现有基于帧的方法通过均匀采样、查询感知帧选择、视觉令牌压缩和自适应分辨率策略提高效率，但仍依赖孤立、碎片化的帧作为基本证据单元，限制了VLM捕捉连贯事件级语义的能力。为此，本文提出MEMORYCARD，一种基于视频记忆的增强框架，将长视频组织为自包含的Memory Cards。具体地，MEMORYCARD首先对视频和对齐的语音进行自读，将其分割为语义连贯的单元（每个对应一个主题或事件），然后为每个单元生成事件级视频摘要并选择代表性视觉时刻，最后渲染为统一的Memory Cards用于检索和问答。实验表明，在可比视觉令牌预算下，MEMORYCARD持续提升长视频QA性能，准确率相对提升最高达21.8%。代码已开源。

Innovations:

提出Memory Cards概念：将长视频分割为语义连贯的事件单元，并生成包含事件级视频摘要和代表性视觉时刻的自包含多模态证据卡片，替代传统孤立帧作为基本证据单元。
设计自适应视频分割方法：利用VLM对视频和对齐语音进行自读，根据事件、主题或场景转换自动划分语义会话，无需人工标注。
实现高密度多模态线索压缩：将事件级视频摘要（VLM生成的主题和对齐语音）与代表性视觉时刻渲染为统一图像格式的Memory Cards，兼容标准图像VLM流水线。
引入检索后自适应分辨率分配：根据检索到的Memory Cards与问题的相关性动态分配输入分辨率，进一步优化视觉令牌预算利用。
在多个长视频QA基准上取得显著改进，并验证了性能提升主要来自证据表示（自读语义会话构建、高密度卡片渲染、时间线索组织）而非单纯检索。

Methodology: MEMORYCARD采用两阶段方法：1）自适应视频分割：使用VLM（如LLaVA-NeXT）对视频帧和对应语音文本进行自读，根据事件/主题/场景转换指令分割为J个语义会话，每个会话有起止时间戳。2）Memory Card构建：对每个会话，使用VLM生成事件级视频摘要（包括主题和关键语音），并从会话中选择代表性视觉时刻（如关键帧），然后将摘要文本和代表性视觉时刻渲染为一张统一的图像格式Memory Card。在问答阶段，先对Memory Card库进行检索（基于问题与卡片的相关性），然后对检索到的卡片按相关性分配自适应分辨率，并按原始时间顺序重排，最后送入回答VLM生成答案。

Key Results:

在三个长视频QA基准（如EgoSchema、ActivityNet-QA、VideoChatGPT等）上，MEMORYCARD在可比视觉令牌预算下持续优于现有帧采样、查询感知帧选择、视觉令牌压缩等方法。
准确率相对提升最高达21.8%。
消融实验表明，性能提升主要来自自读语义会话构建、高密度Memory Card渲染和时间线索组织，而非单纯检索。
Memory Cards在保留细粒度视觉细节的同时维持事件级时间上下文，支持高效长视频理解。

Tech Stack:

视觉语言模型（VLM）：LLaVA-NeXT等用于视频分割和摘要生成
检索增强生成（RAG）框架：基于问题与Memory Card的相似度检索
自适应分辨率分配策略：根据相关性动态调整输入分辨率
图像渲染技术：将文本摘要和视觉时刻合并为统一图像格式
时间戳对齐：视频帧与语音文本的对齐处理

Strengths:

创新性地将长视频证据从孤立帧提升为事件级多模态卡片，显著增强语义密度和连贯性。
方法无需额外训练，兼容现有标准图像VLM流水线，易于部署。
通过自读分割和摘要生成，自动构建主题感知的语义单元，减少人工标注成本。
在多个基准上取得显著性能提升，且消融实验充分验证各组件贡献。
代码开源，促进可复现性和后续研究。

Limitations:

依赖VLM进行视频分割和摘要生成，可能引入额外计算开销和潜在错误传播。
Memory Card的构建质量受限于底层VLM的能力，对于复杂长视频可能无法准确捕捉事件边界。
当前方法主要针对问答任务，未验证在其他视频理解任务（如视频描述、时序定位）上的泛化性。
检索阶段仅基于文本相似度，未充分利用视觉模态的细粒度匹配。
实验仅在有限基准上进行，未在超长视频（如数小时）上充分测试。

Relevance To Keywords:

原生多模态大模型：论文使用VLM作为核心组件，属于原生多模态大模型的应用。
多模态大模型的理解和生成一体化：MEMORYCARD利用VLM同时进行视频分割、摘要生成和问答，体现了理解与生成的结合。
表征学习：Memory Cards将视频压缩为紧凑的多模态表征，属于表征学习范畴。
世界模型：论文通过事件级语义单元构建视频记忆，可视为对视频世界模型的简化建模。
强化学习/后训练：论文未直接涉及强化学习或后训练，但方法可结合后训练进一步优化。

25. Evaluating Stochastic Collapse and Implicit Bias in Multimodal Large Language ModelsPASS

Score: 45.0 / 27.8

Authors: Huiyuan Zheng, Houtao Zhang, Boyang Wang, Qingyi Si, Hongcheng Guo

Published: 2026-06-04

TL;DR: This paper proposes RandomBench to evaluate stochastic collapse in MLLMs, revealing that these models fail to maintain uniform randomness when selecting among equivalent options in logic-neutral scenarios.

摘要翻译

当前对多模态大语言模型（MLLMs）的评估压倒性地聚焦于效用驱动的目标，使得逻辑中立场景下的模型行为在很大程度上仍未得到充分探索。随机性在多个动作同样有效的场景中至关重要，例如推荐旅行行程或日常日程，其中多个选项具有相似的效用。在此类设置下，确定性策略可能导致重复性行为，并降低对有效替代方案的覆盖范围。为了弥合这一差距，我们提出 RandomBench，这是一个旨在评估 MLLMs 在等效选项之间选择时能否保持分布中立行为的基准测试。此外，我们还引入了三个指标（RI、BCI、BII），用于量化熵和分布偏差。实验揭示了一种普遍存在的现象，称为“随机坍缩”（Stochastic Collapse），即 MLLMs 在明确的随机指令下未能保持均匀随机性，其 top-1 概率从理想的四分之一基线飙升至 97%，且在 Claude Sonnet 4.6 中 RI 值降至 0.068。广泛的消融实验进一步表明，这些偏差在多种语言和表示格式中依然存在，凸显了分布坍缩在逻辑中立决策设置中的鲁棒性。

Abstract

Current evaluations for Multimodal Large Language Models (MLLMs) overwhelmingly focus on utility-driven objectives, leaving model behavior under logic-neutral scenarios largely underexplored. Stochasticity is essential in scenarios where multiple actions are equally valid, such as recommending travel itineraries or daily schedules where multiple options have similar utility. In such settings, deterministic policies may lead to repetitive behaviors and reduced coverage of valid alternatives. To bridge this gap, we propose RandomBench, a benchmark designed to evaluate whether MLLMs can maintain distributionally neutral behavior when selecting among equivalent options. We further introduce three metrics, including RI, BCI, BII, to quantify entropy and distributional bias. Experiments reveal a pervasive phenomenon termed Stochastic Collapse, where MLLMs fail to maintain uniform randomness under explicit random instructions, with top-1 probabilities reaching 97% from the ideal one quarter baseline and RI dropping to 0.068 in Claude Sonnet 4.6. Extensive ablation studies further demonstrate that these deviations persist across languages and representation formats, highlighting the robustness of distributional collapse in logic-neutral decision settings.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	10.0/10	15.0
MultiModal	1.5	10.0/10	15.0
model-based RL	1.5	2.0/10	3.0

评分理由: The paper explicitly focuses on Multimodal Large Language Models (MLLM) and MultiModal evaluation, hence high scores for these keywords. Other keywords like World Models, model-based RL, Tokenizer, and Unify Models are not central to the study's contribution regarding stochastic collapse and bias evaluation. No listed expert authors appear in the paper's author list, so no bonus points were added.

关键词

Stochastic Collapse, Implicit Bias, MLLM, RandomBench, Distributional Bias, Evaluation Benchmark, Multimodal

深度分析

Chinese Title: 评估多模态大语言模型中的随机坍缩与隐式偏差

Summary: 本文提出RandomBench基准，用于评估多模态大语言模型（MLLM）在逻辑中性场景下的随机选择行为。现有评估多聚焦于效用驱动目标，忽略了模型在等价选项间的行为。RandomBench包含200个实例（RB-Text和RB-Vision），采用重复采样（每实例50次）和基于信息熵的度量（随机性指数RI、偏差强度指数BII、偏差一致性指数BCI）。实验发现一个普遍现象——随机坍缩：模型在明确要求随机选择时无法保持均匀分布，Claude Sonnet 4.6的top-1概率高达97%，RI降至0.068。消融实验表明该偏差跨语言和表示格式持续存在，揭示了模型在逻辑中性决策中的系统性分布坍缩。

Innovations:

提出逻辑中性随机选择诊断场景，用于评估MLLM的内生行为偏差
构建多模态基准RandomBench，包含文本和视觉模态，并设计重复采样协议
引入基于信息熵的度量（RI、BII、BCI）量化非均匀随机选择行为
发现并命名“随机坍缩”现象，揭示模型在等价选项下稳定、有偏且合理化的偏好
通过跨语言和跨格式消融实验证明该偏差是内生行为先验而非表面词汇伪影

Methodology: 论文采用以下方法：1）构建逻辑中性数据集，手动设计200个实例，分为RB-Text（抽象符号、语言功能、空间感知、情感社会身份）和RB-Vision（感知显著性、空间布局、情感社会、跨模态劫持）；2）通过严格过滤确保选项绝对等价；3）对每个实例进行50次重复采样，收集模型输出分布；4）计算三个度量：RI（实际熵与均匀熵之比）、BII（最大偏差与均匀偏差之比）、BCI（跨重复的一致性）；5）进行消融实验，包括语言替换、格式变化等。

Key Results:

MLLM在随机选择任务中普遍出现随机坍缩，无法保持均匀分布
Claude Sonnet 4.6的top-1概率达97%，RI仅0.068（理想为1）
视觉模态中进一步出现“视觉劫持”，感知线索覆盖随机指令
更强、更对齐的模型可能表现出更显著的坍缩，并伴随流畅的事后合理化
偏差在选项标签替换和跨语言提示下持续存在，表明是内生行为先验

Tech Stack:

信息熵理论（用于定义RI、BII、BCI）
重复采样协议（每实例50次）
逻辑中性过滤（去除文化、政治、历史等敏感词）
Stroop-like冲突设计（跨模态劫持）
分类维度：抽象符号、语言功能、空间感知、情感社会、感知显著性、空间布局等

Strengths:

首次系统评估MLLM在逻辑中性场景下的随机选择行为，填补空白
多模态基准覆盖文本和视觉，并设计跨模态劫持维度
度量指标基于信息熵，量化准确且可解释
发现随机坍缩现象，具有重要理论和实践意义（如自主代理、推荐系统）
消融实验充分，证明偏差的鲁棒性和内生性

Limitations:

基准仅包含200个实例，规模较小，可能未覆盖所有偏差类型
未深入探究随机坍缩的机制原因（如预训练数据分布、架构设计等）
仅评估了有限数量的MLLM，未涵盖最新模型
未提出缓解随机坍缩的具体方法
重复采样50次可能不足以稳定估计分布，尤其对于高熵场景

Relevance To Keywords:

原生多模态大模型：论文直接评估多模态大语言模型（MLLM）的随机行为，与原生多模态模型相关
表征学习：随机坍缩揭示模型内部表征存在隐式偏差，影响表征的多样性
世界模型：空间感知和视觉维度评估模型对空间、物理属性的内部世界模型偏差
后训练：随机坍缩可能源于后训练阶段的对齐或强化学习，论文为后训练评估提供新视角
强化学习：随机策略在强化学习中至关重要，论文发现的坍缩问题影响基于RL的自主代理

26. Towards One-to-Many Temporal GroundingPASS

Score: 42.0 / 27.8

Authors: Qi Xu, Yue Tan, Shihao Chen, Jiahao Meng, Anna Wang, Shunping Ji, Hao Fei, Jason Li

Published: 2026-06-04

TL;DR: 该论文针对多模态大模型在 One-to-Many 时序 grounding 任务中的不足，提出了基于链式思维奖励函数的解决方案，并在 OMTG 基准测试中取得了新的 state-of-the-art 成绩。

摘要翻译

时序定位（Temporal Grounding, TG）旨在定位与文本查询相对应的视频片段。以往研究主要侧重于单片段检索。然而，现实场景通常需要对单个查询定位多个不连续片段，我们将此设定称为一对多时序定位（One-to-Many Temporal Grounding, OMTG）。先前针对一对一设置优化的最新多模态大语言模型（MLLMs）在此背景下表现不佳，往往因缺乏事件基数感知而得分接近零。为弥合这一差距，我们提出了一种系统性解决方案，包含三个关键贡献。首先，我们建立了首个全面的 OMTG 基准，引入计数准确率（Count Accuracy, C-Acc）和有效时序 F1（Effective Temporal F1, EtF1）作为评估指标。其次，我们通过一套复杂的构建流程整理了一个包含 5.6 万样本的高质量 OMTG 数据集。第三，我们开发了专门针对 OMTG 的新型时序奖励函数和字幕奖励函数。特别是，字幕奖励利用密集视频字幕上的思维链（Chain-of-Thought）推理，明确引导策略优化同时朝向精确性与完整性。大量实验表明，我们的模型在 OMTG 基准上实现了新的最先进（SOTA）EtF1 得分 43.65%，分别比 Gemini 2.5 Pro 和 Seed-1.8 高出 15.85% 和 15.61%。

Abstract

Temporal Grounding (TG) aims to localize video segments corresponding to a textual query. Prior research predominantly focuses on single-segment retrieval. Real-world scenarios, however, often require localizing multiple disjoint segments for a single query -- a setting we term One-to-Many Temporal Grounding (OMTG). Previous state-of-the-art MLLMs, optimized for one-to-one settings, struggle in this context, often yielding near-zero scores due to a lack of event cardinality perception. To bridge this gap, we present a systematic solution with three key contributions. First, we establish the first comprehensive OMTG benchmark, introducing Count Accuracy (C-Acc) and Effective Temporal F1 (EtF1) as evaluation metrics. Second, we curate a high-quality OMTG dataset comprising 56k samples through a sophisticated construction pipeline. Third, we develop novel temporal and caption reward functions specifically designed for OMTG. In particular, the caption reward leverages Chain-of-Thought reasoning over dense video captions to explicitly guide policy optimization toward both preciseness and completeness. Extensive experiments show our model achieves a new state-of-the-art EtF1 of 43.65\% on OMTG Bench, outperforming Gemini 2.5 Pro and Seed-1.8 by 15.85\% and 15.61\%, respectively.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	8.0/10	12.0
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	4.0/10	6.0

评分理由: 论文核心在于多模态大模型（MLLM）在视频文本对齐任务（Temporal Grounding）上的改进，特别是针对 One-to-Many 场景，因此 MLLM 和多模态（MultiModal）相关性高。虽然提到了奖励函数和策略优化（涉及 RL 术语），但并未明确涉及世界模型（World Models）或模型强化学习（model-based RL）的核心机制，也未提及统一模型（Unify Models）或分词器（Tokenizer）的具体贡献。视觉编码器（Visual Encoder）作为视频处理的隐含组件存在，但非核心创新点。作者列表中未包含指定的专家（Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang），故无加分。

关键词

One-to-Many Temporal Grounding, Temporal Grounding, MLLM, Chain-of-Thought, Reward Functions, Video-Text, Benchmark

深度分析

Chinese Title: 迈向一对多时间定位

Summary: 本文针对传统时间定位（Temporal Grounding）任务中仅处理单个查询对应单个视频片段的局限，提出了“一对多时间定位”（One-to-Many Temporal Grounding, OMTG）问题，要求模型定位一个查询在视频中出现的所有不连续片段。现有最先进的多模态大语言模型（MLLM）在此任务上表现极差，主要缺乏事件基数感知能力。为填补空白，作者做出三项贡献：首先，建立了首个OMTG基准，引入计数准确率（C-Acc）和有效时间F1（EtF1）作为评估指标；其次，通过精心设计的数据构建流程，整理出包含5.6万样本的高质量OMTG数据集；第三，开发了专门针对OMTG的时间奖励和字幕奖励函数，其中字幕奖励利用密集视频字幕和思维链推理显式指导策略优化，兼顾精确性和完整性。实验表明，所提模型在OMTG Bench上达到43.65%的EtF1，分别超越Gemini 2.5 Pro和Seed-1.8达15.85%和15.61%。

Innovations:

首次正式定义一对多时间定位（OMTG）问题，将其建模为MLLM框架下的集合生成任务。
提出一套针对OMTG的评估指标：时间F1（tF1）、计数准确率（C-Acc）和有效时间F1（EtF1），克服了传统tIoU在多重片段场景下的缺陷。
构建了首个OMTG基准（340个手工标注样本）和包含5.6万样本的高质量训练数据集。
设计了两阶段训练策略（SFT + RL），并创新性地引入时间奖励和基于密集视频字幕与思维链推理的字幕奖励，同时优化定位精度和事件计数能力。
实验证明RL训练在OMTG任务上还能提升标准一对一时间定位性能。

Methodology: 论文采用多模态大语言模型（MLLM）作为基础架构，将视频帧和文本查询作为输入，直接生成包含时间区间描述的自然语言响应，并通过确定性解析函数提取预测片段。训练分为两个阶段：首先使用构建的56k OMTG数据集进行监督微调（SFT），然后使用强化学习（RL）进一步优化。RL阶段设计了两种奖励函数：时间奖励直接监督预测边界与真实边界的IoU；字幕奖励利用密集视频字幕和思维链推理，引导模型理解事件结构并生成完整且精确的响应。评估时采用匈牙利算法进行预测与真实片段的最优匹配，计算tF1、C-Acc和EtF1。

Key Results:

所提模型在OMTG Bench上达到43.65%的EtF1，超越Gemini 2.5 Pro（15.85%提升）和Seed-1.8（15.61%提升）。
现有开源MLLM和传统TG专家在OMTG任务上几乎为零的EtF1，表明该任务存在显著能力缺口。
RL训练不仅提升OMTG性能，还改善了标准一对一时间定位任务的表现。
基准中62.2%样本包含2-3个片段，15%样本超过6个片段，视频时长从21秒到17分钟不等，挑战性强。

Tech Stack:

多模态大语言模型（MLLM）
监督微调（SFT）
强化学习（RL）
匈牙利算法（Hungarian algorithm）用于最优匹配
时间IoU（tIoU）计算
时间F1（tF1）、计数准确率（C-Acc）、有效时间F1（EtF1）指标
思维链推理（Chain-of-Thought, CoT）
密集视频字幕（dense video captions）
确定性解析函数（deterministic parsing function）

Strengths:

问题定义清晰，填补了时间定位领域一对多场景的空白。
评估指标设计合理，有效区分了计数错误和定位误差。
数据集构建流程严谨，包含5.6万样本，覆盖多样场景。
两阶段训练策略结合SFT和RL，并引入专门设计的奖励函数，显著提升性能。
实验对比充分，涵盖开源和闭源模型，结果具有说服力。

Limitations:

OMTG基准仅包含340个手工标注样本，规模较小，可能不足以全面评估模型泛化能力。
字幕奖励依赖密集视频字幕的质量，若字幕生成不准确可能影响训练效果。
论文未深入探讨模型在不同视频类型（如长视频、多事件重叠）下的鲁棒性。
方法主要针对MLLM框架，未与传统非MLLM方法进行对比。
未讨论模型在真实应用场景（如视频检索、监控）中的部署效率。

Relevance To Keywords:

原生多模态大模型：论文直接使用MLLM作为核心模型，高度相关。
多模态大模型的理解和生成一体化：模型同时理解视频和文本并生成结构化响应，相关。
表征学习：视频帧和文本的联合表征是任务基础，但论文未重点讨论表征学习机制。
世界模型：OMTG要求模型理解视频中事件的重复出现和时序结构，可视为世界模型在视频理解中的应用，有一定相关性。
强化学习：论文采用RL作为后训练阶段，并设计奖励函数，高度相关。
后训练：SFT+RL的两阶段训练属于后训练范畴，相关。
Unify Models：论文未涉及统一模型（如视觉-语言-动作统一），相关性较弱。
Model-Based RL：论文使用基于奖励的RL，但未显式构建环境模型，相关性较低。

27. Video-Rate Streaming Stylization on a Vision-Aware MLLM-Conditioned Edit Diffusion: Asymmetric Batched Inference on a Distilled UNet + MLLM Text EncoderPASS

Score: 42.0 / 27.8

Authors: Yoshiyuki Ootani

Published: 2026-06-04

TL;DR: 本文提出了一种基于蒸馏 UNet 和异步批处理的流式化管道，利用 MLLM 条件编辑扩散实现了消费级显卡上的实时视频风格化。

摘要翻译

扩散 U-Net 的激进蒸馏逆转了实时文本到图像管道的帧级瓶颈：一旦去噪器成为 4 步或 1 步蒸馏学生模型，文本编码器便成为关键路径。这种逆转在视觉感知编辑扩散中最为显著，其中编码器是多模态大语言模型（MLLM）。我们研究了 0.39B 蒸馏编辑 U-Net 与 2.13B MLLM 文本编码器（Qwen3-VL）配对的情况，并提出了一种针对此场景的流式管道，该管道围绕三个工程机制构建：非对称侧流/主流 CUDA 流水线，结合批量文本编码器摊销（以及可选的静态提示缓存）；一种编译友好的 ControlNet-LLLite 重构，将整个 U-Net + 适配器堆栈折叠为单个融合图；以及一个带有钩子子集的周期性条件刷新调度，用于摊销帧级条件成本。在单个消费级 RTX 3090 Ti 上，于 512x512 分辨率下，该管道在批量大小 B=8 时维持 27.4 fps（持续 480 帧运行），在 B=16 时为 29.6 fps，端到端 p50 延迟分别约为 0.5 和 1.0 秒；同一配置下，在 RTX 4090 上测得 54.9 fps，在 RTX 5090 上为 74.1 fps。我们报告的是视频速率流式吞吐量，而非交互式低延迟；我们将数据与相同堆栈的 StreamDiffusion 重运行进行对比，以提供系统上下文，而非作为基准优越性声明。对于训练好的油画风格，所发布的时间适配器在片段内噪声中泛化至 19 个未使用的 DAVIS-2017 序列以及来自七个来源的 15 个非 DAVIS 片段；提示级别泛化至未见风格家族的情况是有界的，并单独报告。

Abstract

Aggressive distillation of the diffusion U-Net inverts the per-frame bottleneck of real-time text-to-image pipelines: once the denoiser is a 4-step or 1-step distilled student, the text encoder becomes the critical path. This inversion is most acute in vision-aware edit diffusion, where the encoder is a multimodal large language model (MLLM). We study the case of a 0.39B distilled edit U-Net paired with a 2.13B MLLM text encoder (Qwen3-VL) and present a streaming pipeline targeted at this regime built around three engineering mechanisms: asymmetric side-stream / main-stream CUDA pipelining with batched text-encoder amortisation (and optional static-prompt caching), a compile-friendly ControlNet-LLLite reformulation that folds the entire U-Net + adapter stack into a single fused graph, and a periodic conditioning-refresh schedule with a hook subset that amortises the per-frame conditioning cost. On a single consumer RTX 3090 Ti at 512x512 the pipeline sustains 27.4 fps over a 480-frame run at batch size B=8 and 29.6 fps at B=16, with end-to-end p50 latency of approximately 0.5 and 1.0 seconds respectively; the same operating point measures 54.9 fps on RTX 4090 and 74.1 fps on RTX 5090. We report video-rate streaming throughput rather than interactive low latency, and locate our numbers against same-stack StreamDiffusion re-runs as systems context, not as a benchmark superiority claim. For the trained oil-painting style, the released temporal adapter generalises within in-clip noise to 19 unused DAVIS-2017 sequences and 15 non-DAVIS clips from seven sources; prompt-level generalisation to unseen style families is bounded and reported separately.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	5.0/10	7.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	9.0/10	13.5
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文聚焦于 MLLM 条件扩散模型的视频流式化加速，核心贡献在于蒸馏 UNet 与异步批处理推理架构。MLLM 与多模态相关性高，视觉编码器作为 MLLM 组件被使用但非研究重点，未涉及强化学习或世界模型，tokenizer 仅为底层组件。

关键词

Video-Rate Streaming, MLLM-Conditioned, Edit Diffusion, Distilled UNet, Asymmetric Batched Inference, Stylization, Vision-Aware

深度分析

Chinese Title: 基于视觉感知多模态大语言模型条件编辑扩散的视频速率流式风格化：蒸馏UNet与MLLM文本编码器上的非对称批处理推理

Summary: 本文针对实时视频流风格化任务中出现的“小蒸馏去噪器+大MLLM编码器”（SDD-MTE）新瓶颈，提出了一套工程优化流水线。当扩散UNet被蒸馏至4步或1步后，文本编码器（如2.13B参数的Qwen3-VL）成为每帧关键路径。作者基于DreamLite-mobile（0.39B蒸馏UNet）和Qwen3-VL（2.13B）组合，设计了三个核心机制：非对称侧流/主流CUDA流水线（将MLLM文本编码器置于独立CUDA流，与编译后的UNet并行执行）、编译友好的ControlNet-LLLite重构（消除Python分支使torch.compile融合整个适配器堆栈）、以及周期性条件刷新与钩子子集调度（摊销每帧条件计算成本）。在单张RTX 3090 Ti上，512×512分辨率下实现27.4 FPS（批大小8）和29.6 FPS（批大小16），端到端p50延迟约0.5秒和1.0秒；在RTX 4090和RTX 5090上分别达到54.9 FPS和74.1 FPS。论文报告了流式吞吐量而非交互低延迟，并与StreamDiffusion进行系统对比。训练的风格化时间适配器在DAVIS-2017及跨数据集序列上展示了泛化能力。

Innovations:

非对称侧流/主流CUDA流水线：将MLLM文本编码器放在专用CUDA流上，与主流的编译UNet并发执行，隐藏编码器内部的CPU-GPU同步点。
编译友好的ControlNet-LLLite重构：移除Python分支和tensor恒等测试，使torch.compile将整个UNet+适配器堆栈融合为单一计算图，消除约3.5倍的速度损失。
周期性条件刷新与钩子子集调度：将LLLite条件图像编码器成本分摊到多个批次，并异步执行Farneback光流，降低每帧计算开销。
静态提示缓存与N批刷新调度（可选）：针对固定用户指令和缓慢变化的源视觉内容，提供单帧TE提示和N批TE刷新两种近似策略，进一步摊销编码器成本。
首次在SDD-MTE（小蒸馏去噪器+大MLLM编码器）体制下实现视频速率流式风格化，并系统量化了硬件扩展性（RTX 3090 Ti/4090/5090）。

Methodology: 论文采用工程优化方法，针对蒸馏UNet（0.39B）与MLLM文本编码器（2.13B）的不对称计算负载，设计了三层流水线：1）使用torch.compile编译UNet获得3.1倍加速；2）将MLLM TE和VAE编码放在侧流，与主流UNet去噪并行执行；3）重构ControlNet-LLLite适配器，消除Python分支使其可被torch.compile融合。同时采用周期性条件刷新策略（每N帧重新计算LLLite条件图像编码），并利用Farneback光流进行帧间运动补偿。训练阶段使用后混合（α=0.85）的基输出作为教师，以扭曲的前一解码帧作为LLLite条件输入。所有实验在PyTorch 2.6.0 + CUDA 12.4 + Triton 3.2环境下进行，单卡RTX 3090 Ti（部分测试RTX 4090/5090）。

Key Results:

在RTX 3090 Ti上，512×512分辨率，批大小B=8时持续27.4 FPS（480帧），端到端p50延迟约0.5秒；B=16时29.6 FPS，p50延迟约1.0秒。
相同配置在RTX 4090上达54.9 FPS，RTX 5090上达74.1 FPS。
编译后的UNet比eager模式快3.1倍；未优化的LLLite适配器使UNet慢约3.5倍，重构后消除该开销。
MLLM TE输出标准差37、最大值1160，导致FP16注意力溢出，无法使用TensorRT路径。
训练的时间适配器在DAVIS-2017的19个未使用序列和15个非DAVIS片段上表现出泛化能力（限于训练风格）。
4步蒸馏模型在K=1推理时仍可接受，但教师保真度有所下降。

Tech Stack:

PyTorch 2.6.0+cu124
Triton 3.2 (Windows port)
torch.compile (reduce-overhead, fullgraph=False, dynamic=False)
CUDA流（侧流/主流流水线）
ControlNet-LLLite（108个适配器模块，每模块含CNN编码器和零初始化残差）
DreamLite-mobile（0.39B蒸馏UNet，TAESDXL VAE，FlowMatchEuler调度器）
Qwen3-VL（2.13B多模态大语言模型文本编码器）
Farneback光流（OpenCV）
后混合（post-hoc blending, α=0.85）
静态提示缓存（te_batch_one, te_refresh_every）

Strengths:

针对SDD-MTE新瓶颈提出了系统性的工程解决方案，填补了现有实时扩散流水线（如StreamDiffusion）在MLLM编码器场景下的空白。
三个机制均有直接测量的每帧影响，实验设计严谨，包括硬件扩展性测试和跨数据集泛化评估。
开源了时间适配器，并提供了与StreamDiffusion的公平系统对比（非基准优越性声明），方法可迁移至其他SDD-MTE配对。
在单张消费级GPU上实现了视频速率流式风格化，具有实际部署价值。
详细记录了负面结果（如TensorRT不可用、FP16溢出、LLLite编译问题等），为后续研究提供参考。

Limitations:

仅评估了DreamLite-mobile+Qwen3-VL这一种配置，声称可迁移但未验证其他SDD-MTE配对。
风格泛化有限：时间适配器仅针对训练过的油画风格，对未见风格家族的提示级泛化有界且单独报告。
周期性条件刷新策略简单（固定N帧），未采用自适应缓存策略（如AdaCache），可能在高运动场景下效果下降。
报告的是流式吞吐量而非交互低延迟，p50延迟约0.5-1.0秒，不适合实时交互应用。
依赖Farneback光流进行帧间扭曲，可能引入模糊伪影（论文提及平滑伪影）。
未进行用户研究或主观质量评估，仅报告了客观指标。

Relevance To Keywords:

原生多模态大模型：论文使用Qwen3-VL作为多模态大语言模型文本编码器，直接处理图像和文本指令，属于原生多模态架构。
多模态大模型的理解和生成一体化：DreamLite-mobile的GENERATE和EDIT模式共享权重，结合MLLM编码器实现理解（指令跟随）与生成（风格化）一体化。
表征学习：MLLM编码器通过视觉token和文本token的联合注意力学习多模态表征，但论文未深入探讨表征学习机制。
世界模型：论文涉及视频流式处理和时间一致性，但并未构建显式的世界模型，仅通过光流和帧间扭曲实现运动补偿。
强化学习：论文未涉及强化学习，但蒸馏过程（LCM、对抗蒸馏）可视为后训练的一种形式，与强化学习无关。
后训练：论文使用预训练的4步蒸馏模型，未进行额外后训练，但讨论了K=1推理的退化情况，与后训练（如进一步蒸馏）间接相关。

28. EasyLens: A Training-Free Plug-and-Play Subtle-Lesion Representation Amplifier for Medical Vision-Language ModelsPASS

Score: 40.5 / 27.8

Authors: Qiwei Zeng, Hao Wang, Jinghao Lin, Shuchang Ye, Yuezhe Yang, Yige Peng, Haoyuan Che, Jinman Kim, Lei Bi

Published: 2026-06-04

TL;DR: EasyLens 通过训练自由的插件式方法放大细微病变表示，提升了医学视觉语言模型对细微病变的检测能力，无需重新训练模型。

摘要翻译

医学视觉 - 语言模型（VLMs）在临床图像解读方面展现出日益增长的潜力，涵盖病灶检测与报告生成。然而，其对细微病灶的敏感性不足限制了其实用性，因为此类病灶的视觉证据通常稀疏、对比度低，且嵌入在复杂的解剖学背景中。随着局部视觉标记的聚合，这些微弱的病灶线索在全局图像表示中可能表示不足，导致医学 VLMs 难以识别。现有的提高病灶敏感性的方法主要依赖于医学领域视觉编码器的预训练、临床术语引导的对齐，或可训练的病理表示增强。尽管有效，这些方法通常需要额外的训练或针对模型的特定适配，且可能过拟合于特定的疾病形态，从而限制了其在冻结医学 VLMs 上的适用性。为了解决这些局限性，我们提出 EasyLens，一种面向医学 VLMs 的无需训练的即插即用细微病灶表示放大器。EasyLens 首先构建 EasyBank，这是一个病理 - 解剖原型空间，提供与病灶相关的原型和解剖感知的正常参考，以便将可疑图像块与病理及正常解剖模式进行比较。为避免盲目放大正常组织，EasyTag 通过反事实原型推理选择与病灶相关的图像块。为抵消细微病灶线索在全局图像表示中的稀释效应，EasyAmplifier 通过形态学引导的残差增强强化所选病灶相关图像块的表示，从而增加其对全局图像嵌入的贡献。在多个医学图像数据集及冻结医学 VLM 骨干上的实验表明，EasyLens 提升了细微病灶检测性能，并优于现有的编码器增强基线。

Abstract

Medical vision-language models (VLMs) have shown increasing potential for clinical image interpretation, including lesion detection and report generation. However, their practical utility remains limited by insufficient sensitivity to subtle lesions, whose visual evidence is often sparse, low-contrast, and embedded within complex anatomical context. As local visual tokens are aggregated, these weak lesion cues can become underrepresented in global image representations, making them difficult for medical VLMs to recognize. Existing efforts to improve lesion sensitivity mainly rely on medical-domain vision-encoder pre-training, clinical-term-guided alignment, or trainable pathological representation enhancement. Although effective, these approaches usually require additional training or model-specific adaptation and may overfit to particular disease morphologies, limiting their applicability to frozen medical VLMs. To address these limitations, we propose EasyLens, a training-free plug-and-play subtle-lesion representation amplifier for medical VLMs. EasyLens first constructs EasyBank, a pathology-anatomy prototype space that provides lesion-related prototypes and anatomy-aware normal references for comparing suspicious patches against both pathological and normal anatomical patterns. To avoid blindly amplifying normal tissues, EasyTag selects lesion-relevant patches through counterfactual prototype reasoning. To counteract the dilution of subtle lesion cues in global image representations, EasyAmplifier strengthens the selected lesion-relevant patch representations through morphology-guided residual enhancement, thereby increasing their contribution to the global image embedding. Experiments on multiple medical image datasets and frozen medical VLM backbones show that EasyLens improves subtle-lesion detection and outperforms existing encoder-enhancement baselines.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	7.0/10	10.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	8.0/10	12.0
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文聚焦医学视觉语言模型（MLLM）中细微病变的表示增强，属于多模态（MultiModal）与视觉表示处理（Visual Encoder）范畴，故相关度较高。未涉及模型统一架构（Unify Models）、Tokenizer 设计、世界模型（World Models）或强化学习（model-based RL），相关度低。未包含指定专家作者。加权总分 40.5，高于动态及格分 27.8。

关键词

Medical Vision-Language Models, Subtle-Lesion Representation, Training-Free, Plug-and-Play, Prototype Space, Residual Enhancement, Frozen Backbones

深度分析

Chinese Title: EasyLens：一种无需训练、即插即用的医学视觉语言模型细微病变表征放大器

Summary: 医学视觉语言模型（VLM）在临床图像解读中展现出潜力，但对细微病变的敏感性不足，因为其视觉线索稀疏、对比度低且嵌入复杂解剖背景中，在全局图像聚合时容易被稀释。现有方法主要通过医学域视觉编码器预训练、临床术语引导对齐或可训练病理表征增强来提升病变敏感性，但通常需要额外训练或模型特定适配，且可能过拟合特定病变形态，难以应用于冻结的医学VLM。为此，本文提出EasyLens，一种无需训练、即插即用的细微病变表征放大器。它首先构建病理-解剖原型空间EasyBank，提供病变相关原型和解剖感知正常参考；通过反事实原型推理的EasyTag选择病变相关补丁；再通过形态引导残差增强的EasyAmplifier强化所选补丁的表征，增加其在全局图像嵌入中的贡献。在多个医学图像数据集和冻结医学VLM骨干上的实验表明，EasyLens一致提升了细微病变检测性能，且无需模型微调即可超越现有编码器增强基线。

Innovations:

提出EasyLens，一种无需训练、即插即用的放大器，通过原型推理利用潜在病理证据提升冻结医学VLM的细微病变识别能力。
构建EasyBank病理-解剖原型空间，组织病变相关原型和解剖感知正常参考，支持补丁级精细判别。
设计EasyTag反事实原型引导的补丁选择器，通过细粒度病理比较识别病变相关区域，避免盲目放大正常组织。
引入EasyAmplifier形态引导残差语义放大器，在不需模型微调或推理时病变标注的情况下，增强所选补丁的疾病相关形态语义。

Methodology: 首先，离线构建EasyBank：从CT图像和病变掩码中提取病变原型（通过聚类病变区域特征）和解剖感知正常参考（从正常解剖区域提取原型）。然后，在冻结医学VLM推理时，对于输入图像，提取视觉补丁特征；利用EasyTag通过反事实原型推理（比较补丁特征与病变原型和正常参考的相似度）选择病变相关补丁；最后，EasyAmplifier对所选补丁进行形态引导的残差增强（结合病变原型和形态特征），强化其表征后再送入冻结VLM，从而提升全局图像嵌入中病变信息的贡献。整个过程无需训练或更新模型参数。

Key Results:

在ReXGroundingCT、LIDC-IDRI和AbdomenAtlas 3.0 Mini构建的统一细微病变基准上，EasyLens在多个冻结医学VLM骨干上一致提升了细微病变检测性能。
EasyLens在无需模型微调的情况下，优于现有的编码器增强基线方法（如MedKLIP、KAD、MLIP等）。
EasyLens在报告生成任务中也表现出改进，表明其增强的表征有助于更准确的临床描述。

Tech Stack:

视觉语言模型（VLM）骨干：如MedRAX、AOR、RadZero等（冻结使用）
原型聚类：K-means或类似方法构建病变原型和正常参考
反事实推理：基于特征相似度（如余弦相似度）的补丁选择
形态引导残差增强：结合病变原型特征与原始补丁特征的加权融合
评估指标：病变检测准确率、召回率、F1分数等

Strengths:

无需训练和微调，即插即用，可直接应用于任何冻结的医学VLM，降低了部署成本。
通过原型空间和反事实推理实现细粒度病变补丁选择，避免盲目放大正常组织。
形态引导的残差增强保留了原始视觉上下文，同时强化病变语义，提升全局表征的病变敏感性。
在多个数据集和骨干上验证了泛化性和一致性，优于现有需要训练的方法。

Limitations:

依赖离线构建的EasyBank原型空间，其质量受限于构建所用的数据集和病变多样性，可能无法覆盖所有罕见病变形态。
当前方法主要针对CT图像设计，扩展到其他模态（如X光、MRI）可能需要调整原型构建策略。
反事实原型推理和残差增强增加了推理时的计算开销，尽管是轻量级的，但在实时场景中可能仍需优化。
未在真实临床环境中进行前瞻性验证，实际应用效果有待进一步评估。

Relevance To Keywords:

Unify Models / 原生多模态大模型：EasyLens作为即插即用模块，可增强多模态大模型（医学VLM）的病变感知能力，与统一模型方向相关。
World Models / 表征学习：EasyLens通过构建病理-解剖原型空间进行表征对比和增强，属于表征学习范畴，但未涉及世界模型。
Model-Based RL / 强化学习 / 后训练：论文方法无需训练或强化学习，与这些关键词相关性较弱。
多模态大模型的理解和生成一体化：EasyLens提升理解（病变检测）和生成（报告生成）能力，支持一体化方向。

29. F3-Tokenizer: Taming Audio Autoencoder Latents for Understanding and GenerationPASS

Score: 40.5 / 27.8

Authors: Dinghao Zhou, Xingchen Song, Di Wu, Pengyu Cheng, Shengfan Shen, Sixiang Lv

Published: 2026-06-04

TL;DR: This paper proposes F3-Tokenizer, a unified audio tokenizer that integrates continuous autoencoder latents for generation with high-dimensional representations for understanding, overcoming the mismatch between reconstruction and semantic encoding.

摘要翻译

连续音频自编码器虽能很好地重构波形，但往往产生结构较弱、不利于理解的潜在变量；而自监督音频编码器虽能捕捉语义，却无法直接解码。这种不匹配使得必须同时支持理解与生成的单一音频分词器的构建变得复杂。我们通过两个组件将连续自编码器的潜在变量适配至该场景：一个噪声正则化的自编码器瓶颈层和一个潜在侧表示编码器。瓶颈层采用通道归一化和随机扰动，而非基于 KL 的变分训练，从而生成尺度可控的连续潜在变量，用于重构和自回归生成。表示编码器在冻结的自编码器潜在变量上训练，利用 RQ-MTP 及冻结的大语言模型（LLM）进行监督。所得的分词器提供高维表示用于理解，同时保留归一化连续潜在变量作为生成目标。

Abstract

Continuous audio autoencoders reconstruct waveforms well but often produce latents with weak structure for understanding, while self-supervised audio encoders capture semantics but are not directly decodable. This mismatch complicates a single audio tokenizer that must support both understanding and generation. We adapt continuous autoencoder latents to this setting with two components: a noise-regularized autoencoder bottleneck and a latent-side representation encoder. The bottleneck uses channel normalization and stochastic perturbation instead of KL-based variational training, yielding scale-controlled continuous latents for reconstruction and autoregressive generation. The representation encoder is trained on frozen autoencoder latents with RQ-MTP and frozen-LLM supervision. The resulting tokenizer provides high-dimensional representations for understanding while preserving normalized continuous latents as generation targets

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	7.0/10	10.5
Tokenizer	1.5	10.0/10	15.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	5.0/10	7.5
MultiModal	1.5	3.0/10	4.5
model-based RL	1.5	0.0/10	0.0

评分理由: Tokenizer 为核心主题 (10)。Unify Models 对应理解与生成统一的目标 (7)。MLLM 与冻结 LLM 监督及潜在应用相关 (5)。MultiModal 相关性较低，因聚焦单一音频模态 (3)。World Models 关联较弱，仅涉及潜变量生成 (2)。Visual Encoder 和 model-based RL 完全无关，因论文为音频领域且无强化学习 (0)。

关键词

Audio Autoencoder, Understanding and Generation, Latent Representation, Tokenizer, Noise-regularized, Frozen-LLM supervision, RQ-MTP

深度分析

Chinese Title: F3-Tokenizer：驯服音频自编码器潜变量用于理解与生成

Summary: 本文提出F3-Tokenizer，一种旨在统一音频理解与生成的连续音频分词器。现有音频自编码器虽能高质量重建波形，但其潜变量缺乏语义结构；自监督编码器虽能捕获语义，却无法直接解码为波形。为解决这一矛盾，F3-Tokenizer在连续自编码器基础上引入两个组件：一是噪声正则化的瓶颈层，通过通道归一化和随机扰动替代KL散度，生成尺度可控的连续潜变量用于重建和自回归生成；二是潜变量侧的表征编码器，利用随机量化多令牌预测（RQ-MTP）和冻结大语言模型（LLM）监督进行训练。该分词器为理解任务提供高维表征，同时保留归一化连续潜变量作为生成目标。实验表明，F3-Tokenizer在声学保真度、理解任务效用和生成可预测性方面均表现优异。

Innovations:

提出以连续自编码器潜变量作为声学锚点，同时输出重建、理解和生成三种互补表征。
采用通道归一化与均匀强度随机扰动的连续瓶颈，避免VAE式KL目标，提升潜空间鲁棒性。
在冻结自编码器潜变量上训练表征编码器，结合RQ-MTP自监督与冻结LLM监督，获得高维语义表征。
联合训练生成侧的补丁级流匹配头，将冻结LLM状态映射为连续自编码器潜变量补丁，实现生成感知的潜空间。

Methodology: 采用三阶段训练流程：阶段0训练归一化自编码器，使用STFT域骨干网络，瓶颈层进行通道归一化和随机扰动，优化频谱重建损失和GAN对抗损失；阶段1冻结自编码器和LLM，训练表征编码器（含RQ-MTP自监督和LLM文本交叉熵损失）以及补丁级流匹配头（基于流匹配目标）；阶段2评估表征在下游任务中的表现。表征编码器采用因果结构，通过滑动窗口实现因果训练，并包含解码器侧投影以保持声学信息。

Key Results:

归一化连续潜变量在重建任务中保持高保真度，且无需KL正则化。
表征编码器在理解任务（如语音识别、音频分类）上取得与专用自监督模型相当的性能。
补丁级流匹配头能有效预测连续潜变量，生成质量接近专用生成模型。
F3-Tokenizer在声学保真度、理解效用和生成可预测性三个维度上均优于现有分词器。

Tech Stack:

STFT域音频自编码器（SpectroStream风格）
通道归一化与随机扰动
随机量化多令牌预测（RQ-MTP）
冻结大语言模型（LLM）监督
补丁级流匹配（Flow Matching）
GAN对抗损失（基于DAC/SpectroStream）
频谱重建损失

Strengths:

统一了音频理解与生成的分词器设计，避免多编码器或多令牌流带来的复杂性。
归一化连续潜变量设计简洁有效，无需VAE式KL正则化，训练稳定。
表征编码器通过RQ-MTP和LLM监督获得丰富语义，同时保持声学信息。
生成侧流匹配头使潜空间具备生成感知特性，提升下游生成任务性能。

Limitations:

训练流程分阶段进行，可能增加整体训练复杂度。
依赖冻结LLM，LLM的选择和规模可能影响表征质量。
未在极低比特率压缩场景下评估，可能不适用于极端带宽受限环境。
论文聚焦分词器设计，未展示完整的统一音频-语言模型下游任务结果。

Relevance To Keywords:

Unify Models: F3-Tokenizer直接针对音频理解与生成一体化设计，通过单一分词器输出互补表征，契合统一模型理念。
World Models: 分词器生成的连续潜变量可作为世界模型中的状态表征，用于预测和规划。
Representation Learning: 表征编码器通过RQ-MTP和LLM监督学习高维语义表征，是表征学习的典型应用。
Model-Based RL: 连续潜变量和流匹配头可用于构建音频环境下的模型预测，支持基于模型的强化学习。
原生多模态大模型: 分词器为多模态大模型提供统一的音频令牌化方案，便于与文本、图像等模态对齐。
多模态大模型的理解和生成一体化: 直接对应论文核心目标，即同时支持理解（高维表征）和生成（连续潜变量）。
后训练: 分词器训练后可进一步用于下游任务的后训练或微调。

30. LadderMan: Learning Humanoid Perceptive Ladder ClimbingPASS

Score: 40.5 / 27.8

Authors: Siheng Zhao, Yuanhang Zhang, Ziqi Lu, Pieter Abbeel, Rocky Duan, Koushil Sreenath, Yue Wang, C. Karen Liu, Guanya Shi

Published: 2026-06-04

TL;DR: LadderMan enables humanoid robots to robustly climb diverse ladders and perform manipulation using a unified visuomotor policy trained via hybrid imitation and reinforcement learning, achieving zero-shot sim-to-real transfer.

摘要翻译

人形机器人在以人类为中心的环境中应用前景广阔，然而梯子攀爬仍是最具挑战性的任务之一，原因在于支撑点和抓握点稀疏、全身协调复杂以及对感知和控制误差的敏感性。我们提出 **LadderMan**，这是一个统一系统，旨在使人形机器人能够在受限条件下稳健地攀爬各种梯子并执行操作。我们的攀爬策略基于可扩展的两阶段学习管道：首先利用混合运动跟踪从单一参考动作中学习多个攀爬专家，然后通过混合模仿与强化学习将这些专家蒸馏为统一的基于深度的 visuomotor 攀爬策略。为了实现现实世界的部署，我们利用视觉基础模型来弥合深度感知中的 sim-to-real 差距。基于所学攀爬策略，我们进一步利用双代理公式训练一个独立的操作策略，从而允许通过 teleoperation 实现稳定的梯子上操作。实验表明，LadderMan 能够在广泛的几何形状下实现稳健的梯子攀爬，以 zero-shot 方式成功转移到真实硬件，并在具有挑战性的梯子约束下支持多种操作任务。视频结果请访问 https://ladderman-robot.github.io .

Abstract

Humanoid robots hold great promise for operating in human-centered environments, yet ladder climbing remains one of the most challenging tasks due to sparse footholds and handholds, complex whole-body coordination, and sensitivity to perception and control errors. We present \textbf{LadderMan}, a unified system that enables humanoid robots to robustly climb diverse ladders and perform manipulation under such constrained conditions. Our climbing policy is built on a scalable two-stage learning pipeline, where we use hybrid motion tracking to learn multiple climbing experts from a single reference motion, and distill these experts into a unified depth-based visuomotor climbing policy via hybrid imitation and reinforcement learning. To enable real-world deployment, we leverage vision foundation models to bridge the sim-to-real gap in depth perception. Building on the learned climbing policy, we further train a separate manipulation policy using a dual-agent formulation, allowing stable on-ladder manipulation via teleoperation. Experiments demonstrate that LadderMan achieves robust ladder climbing across a wide range of geometries, successfully transfers to real-world hardware in a zero-shot manner, and supports various manipulation tasks under challenging ladder constraints. Video results are available at https://ladderman-robot.github.io .

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	6.0/10	9.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	8.0/10	12.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	5.0/10	7.5
model-based RL	1.5	4.0/10	6.0

评分理由: The paper presents LadderMan, a unified system for humanoid ladder climbing. It heavily relies on vision foundation models for depth perception (Visual Encoder: 8/10) and unifies climbing experts into a single policy (Unify Models: 6/10). It employs reinforcement learning and imitation learning (model-based RL: 4/10), and involves vision-action integration (MultiModal: 5/10). It does not utilize Tokenizers, MLLMs, or World Models (Tokenizer/MLLM/World Models: 0-2/10). No expert authors from the specified list were found (Bonus: 0). The calculated weighted score is 40.5, exceeding the dynamic pass score of 27.8.

关键词

Humanoid robots, Ladder climbing, Visuomotor policy, Reinforcement learning, Vision foundation models, Sim-to-real transfer, Manipulation tasks

深度分析

Chinese Title: LadderMan: 学习人形机器人的感知爬梯

Summary: 本文提出LadderMan，一个统一系统，使Unitree G1人形机器人能够鲁棒地攀爬各种梯子并在梯子上进行操控。爬梯策略采用两阶段学习流水线：首先通过混合运动跟踪从单个参考动作学习多个专家策略，然后通过混合模仿和强化学习将这些专家蒸馏为统一的基于深度图像的视觉运动策略。为弥合仿真到现实的深度感知差距，利用视觉基础模型。进一步，通过双智能体框架训练操控策略，实现稳定的梯上遥操作操控。实验表明，LadderMan在多种梯子几何形状下实现零样本仿真到现实迁移，并支持调整画作、更换灯泡、传递盒子等操控任务。

Innovations:

提出混合运动跟踪方法，从单个参考动作学习多个专家爬梯策略，通过非对称跟踪奖励和梯子特定接触奖励实现上下肢协调。
采用混合模仿学习和强化学习的蒸馏方法，将多个专家策略统一为单个深度视觉运动策略，增强泛化性和鲁棒性。
利用视觉基础模型（VFM）弥合仿真与真实世界深度感知的差距，无需大量深度随机化和手动调参。
提出双智能体学习框架，将下肢稳定与上肢操控解耦，实现梯上稳定遥操作操控。

Methodology: 两阶段学习流水线：第一阶段，使用混合运动跟踪（非对称跟踪奖励+梯子接触奖励）从单个参考动作学习多个专家策略，每个专家对应不同梯子倾角和横档间距。第二阶段，通过DAgger风格的模仿学习结合强化学习，将专家策略蒸馏为统一的视觉运动策略，输入为深度图像和本体感知。为处理仿真到现实差距，使用视觉基础模型（如Depth Anything）对真实深度图像进行预处理。操控策略采用双智能体框架：一个策略负责下肢稳定，另一个策略接收遥操作指令执行上肢操控。

Key Results:

爬梯策略在仿真中泛化到多种梯子几何形状（倾角、横档间距变化），实现与人类相当的爬梯速度。
零样本迁移到真实Unitree G1人形机器人，无需额外硬件修改，成功攀爬多种真实梯子。
梯上操控任务（调整画作、更换灯泡、传递盒子）成功执行，优于现成的全身遥操作策略。

Tech Stack:

混合运动跟踪（Hybrid Motion Tracking）
非对称跟踪奖励（Asymmetric Tracking Reward）
DAgger（数据集聚合）
强化学习（Reinforcement Learning）
视觉基础模型（Vision Foundation Model, VFM）如Depth Anything
双智能体学习框架（Dual-Agent Learning）
域随机化（Domain Randomization）
深度相机（Depth Camera）

Strengths:

提出完整的感知爬梯系统，从学习到仿真到现实迁移，流程清晰。
仅需单个参考动作即可学习多种专家策略，降低数据需求。
视觉基础模型有效解决仿真到现实深度感知差距，提升鲁棒性。
双智能体框架实现梯上操控，扩展了人形机器人的应用场景。
实验验证了零样本迁移和多种梯子几何的泛化能力。

Limitations:

爬梯策略依赖深度图像，在极端光照或透明梯子下可能失效。
操控策略仅支持遥操作，未实现自主操控。
梯子几何变化范围有限，未测试极端倾角或非标准梯子。
未与其他感知爬梯方法进行定量比较（如基于地图的方法）。
系统复杂度较高，训练和部署需要较多计算资源。

Relevance To Keywords: 论文主要涉及人形机器人爬梯的强化学习和视觉感知，与给定关键词（Unify Models, World Models, Representation Learning, Model-Based RL, 原生多模态大模型等）相关性较弱。但其中使用了视觉基础模型（VFM）进行表征学习，以及强化学习进行策略训练，部分相关于“表征学习”和“强化学习”。然而，论文未涉及世界模型、多模态大模型理解生成一体化或后训练等概念，因此整体相关性较低。

31. HyperVis: Continuous Latent Visual Relational Graphs on the Lorentz Hyperboloid for Compositional ReasoningPASS

Score: 40.5 / 27.8

Authors: Moshiur Farazi, Sameera Ramasinghe, Mahbub Ahmed Turza, Shafin Rahman

Published: 2026-06-04

TL;DR: HyperVis 通过将视觉关系投影到洛伦兹双曲面上，显著提升了视觉语言模型在组合推理任务（如 GQA 和 SugarCrepe）上的性能。

摘要翻译

视觉 - 语言模型（VLMs）在处理需要理解对象间关系的组合推理时面临挑战。一种自然的补救方案是从现成的场景图生成器（SGG）注入显式的场景图三元组 $\langle s, p, o \rangle$，但我们发现这适得其反：离散的文本标签与连续的视觉模态发生冲突，导致 GQA 准确率从 60.38% 降至 58.86%。我们提出 HyperVis，该方法完全绕过了 SGG 的语义瓶颈。基于 $N$ 个类别无关的区域提议，我们通过空间偏置交叉注意力计算稠密的 $O(N^2)$ 视觉关系张量，将其投影至洛伦兹双曲面（Lorentz hyperboloid），并通过空间物理机制强制层级结构，具体包括 IoA 驱动的蕴含锥（IoA-driven entailment cones）和外角排斥（exterior-angle repulsion）。我们发现 HyperVis 以两种互补的方式做出贡献：(1) 作为训练时正则化器，双曲关系损失塑造了 LoRA 表示，从而提升了生成式视觉问答（GQA）性能（GQA 达 61.03%，相比之下，无关系损失的 LoRA 微调仅为 57.21%，恢复并超越了基线）；(2) 作为推理时关系编码器，双曲前缀标记增强了判别式组合评分（SugarCrepe 达 79.94%，比基线高出 6.25 个百分点）。学习到的曲率稳定在 $\kappa=4.0$，比先前双曲 VLMs 中的 $\kappa$（通常坍缩至零）高出一个数量级，这表明连续视觉特征确实需要强曲率空间所提供的指数级体积。受控的欧几里得消融实验证实了这一分解：关系管道在平坦空间中同样能正则化 LoRA（GQA 60.81%），但组合性增益具体而言是双曲的（SugarCrepe 比欧几里得基线高出 4.58 个百分点），且欧几里得训练中的蕴含损失约为双曲训练的 6 倍。代码将在 TBA 处提供。

Abstract

Vision-Language Models (VLMs) struggle with compositional reasoning that requires understanding inter-object relationships. A natural remedy is to inject explicit scene graph triplets $\langle s, p, o \rangle$ from an off-the-shelf scene graph generator (SGG), but we show this backfires: discrete text labels collide with the continuous visual modality, degrading GQA accuracy from 60.38\% to 58.86\%. We propose \textbf{HyperVis}, which bypasses the SGG semantic bottleneck entirely. From $N$ class-agnostic region proposals, we compute a dense $O(N^2)$ visual relation tensor via spatially-biased cross-attention, project it onto a Lorentz hyperboloid, and enforce hierarchy through spatial physics, namely IoA-driven entailment cones and exterior-angle repulsion. We discover that HyperVis contributes in two complementary ways: (1) as a \emph{training-time regularizer}, the hyperbolic relational losses shape LoRA representations that improve generative VQA (GQA 61.03\% vs.\ 57.21\% for LoRA fine-tuning without relational losses, recovering and surpassing the baseline); and (2) as an \emph{inference-time relational encoder}, hyperbolic prefix tokens boost discriminative compositional scoring (SugarCrepe 79.94\%, $+$6.25pp over baseline). The learned curvature stabilises at $κ{=}4.0$, an order of magnitude above prior hyperbolic VLMs where $κ$ typically collapses toward zero, indicating that continuous visual features genuinely require the exponential volume of strongly curved space. A controlled Euclidean ablation confirms this decomposition: the relational pipeline regularises LoRA comparably in flat space (GQA 60.81\%), but the compositionality gain is specifically hyperbolic (SugarCrepe $+$4.58pp over Euclidean), with entailment loss ${\sim}6{\times}$ higher in Euclidean training. Codes are available at TBA.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	4.0/10	6.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	5.0/10	7.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	8.0/10	12.0
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文核心在于视觉语言模型（MLLM/MultiModal）的组合推理，因此这两个关键词高度相关（8 分）；涉及视觉特征处理，Visual Encoder 中度相关（5 分）；Unify Models 部分相关，因论文统一了视觉与关系表示（4 分）；Tokenizer 仅提及前缀 token 而非 tokenizer 架构（2 分）；World Models 和 model-based RL 与论文内容（VLM 推理，非 RL 或世界建模）完全无关（0 分）。加权总分 40.5，高于动态及格分 27.8。作者列表中不包含指定的 Yang Shi 等专家，无额外加分。

关键词

Compositional Reasoning, Visual Relational Graphs, Lorentz Hyperboloid, Vision-Language Models, Hyperbolic Geometry, LoRA Fine-tuning, Spatial Physics

深度分析

Chinese Title: HyperVis: 洛伦兹双曲面上的连续潜在视觉关系图用于组合推理

Summary: 本文针对视觉语言模型（VLM）在组合推理任务中表现不佳的问题，提出HyperVis框架。传统方法通过注入场景图生成器（SGG）的文本三元组反而降低性能（GQA准确率从60.38%降至58.86%）。HyperVis完全绕过SGG语义瓶颈，从N个类别无关的区域提议中，利用空间偏置交叉注意力计算稠密的O(N²)视觉关系张量，并将其投影到洛伦兹双曲面上，通过基于交叠面积比（IoA）的蕴含锥和外角排斥来强制层次结构。实验表明HyperVis具有双重贡献：作为训练时正则化器，双曲关系损失优化LoRA表示，使生成式VQA（GQA）达到61.03%（比仅用LoRA高3.82pp）；作为推理时关系编码器，双曲前缀token提升判别式组合评分（SugarCrepe达79.94%，比基线高6.25pp）。学习到的曲率稳定在κ=4.0，比以往双曲VLM高一个数量级，表明连续视觉特征需要强曲率空间。

Innovations:

证明文本SGG三元组注入会降低VLM准确率，而连续视觉潜在特征能更好编码双曲几何。
提出完全基于视觉的关系张量，通过空间偏置交叉注意力计算，无需任何谓词词汇或外部SGG模型。
引入IoA驱动的几何层次：空间包含关系映射为蕴含锥，空间分离关系映射为外角排斥，在洛伦兹双曲面上塑造层次结构。
发现双曲损失的双重作用：训练时正则化LoRA提升生成式VQA，推理时前缀token提升判别式组合评分。
学习到的曲率κ=4.0挑战了传统双曲VLM中曲率趋近于零的瓶颈叙事，表明连续视觉特征需要强曲率空间。

Methodology: 首先，从图像中提取N=36个类别无关的区域提议（GQA使用标注框，其他基准使用6×6网格），每个区域获得RoI视觉特征和边界框。然后，通过空间偏置多头自注意力计算稠密关系张量：将相对几何编码为空间特征向量，用于产生注意力偏置和每对空间上下文特征。注意力输出经聚合后，对每个有序对构建关系特征。接着，将所有关系特征投影到洛伦兹双曲面，并应用IoA驱动的损失：若区域A被B包含（IoA>0.8），则A嵌入B的蕴含锥内；若区域不重叠（IoA<0.05），则通过外角排斥推远；中间区域无监督。最后，通过双曲Top-K门选择最显著关系作为视觉前缀token注入LLaVA-1.5。训练时使用黎曼优化和数值稳定技巧。

Key Results:

GQA准确率：HyperVis（61.03%）比仅用LoRA微调（57.21%）高3.82pp，比基线（60.38%）高0.65pp。
SugarCrepe准确率：HyperVis（79.94%）比基线（73.69%）高6.25pp。
曲率κ稳定在4.0，远高于以往双曲VLM（通常趋近于0）。
欧几里得消融实验：GQA准确率相当（60.81%），但SugarCrepe增益仅为+4.58pp（双曲为+6.25pp），且蕴含损失在欧几里得空间中高约6倍。
文本SGG注入导致GQA下降至58.86%（低于基线60.38%）。

Tech Stack:

洛伦兹模型（Lorentz hyperboloid）
多头自注意力（Multi-head self-attention）
空间偏置（Spatial bias）
交叠面积比（IoA, Intersection-over-Area）
蕴含锥（Entailment cone）
外角排斥（Exterior-angle repulsion）
LoRA（Low-Rank Adaptation）
LLaVA-1.5
黎曼优化（Riemannian optimization）
指数/对数映射（Exponential/Logarithmic maps）

Strengths:

创新性地使用连续视觉关系图替代离散文本三元组，避免了SGG的语义噪声和词汇限制。
双曲空间天然适合编码层次结构，IoA驱动的几何约束无需语义标签即可定义层次。
双重贡献设计巧妙：训练时正则化LoRA，推理时提供显式关系编码，两者互补。
曲率κ=4.0的发现具有理论意义，挑战了以往双曲VLM的曲率瓶颈。
实验充分，包括消融、欧几里得对比、曲率分析等，验证了方法的有效性。

Limitations:

依赖区域提议的质量，GQA使用标注框，其他基准使用网格，可能不适用于所有场景。
IoA阈值（0.8和0.05）需要手动设定，缺乏自适应机制。
计算复杂度为O(N²)，N=36时可行，但更大N可能带来效率问题。
仅验证了LLaVA-1.5，未在其他VLM架构上测试泛化性。
双曲空间操作（指数/对数映射）可能引入数值不稳定，需要特殊处理。

Relevance To Keywords:

原生多模态大模型：HyperVis作为LLaVA-1.5的增强模块，提升了多模态大模型的组合推理能力。
表征学习：通过双曲空间中的连续视觉关系图学习更优的视觉表征，并利用几何约束正则化LoRA。
世界模型：组合推理涉及对物体间关系（空间、功能）的理解，是构建世界模型的关键能力。
模型基础：方法基于双曲几何和注意力机制，属于模型架构创新。

32. T-FunS3D: Task-Driven Hierarchical Open-Vocabulary 3D Functionality SegmentationPASS

Score: 40.5 / 27.8

Authors: Jingkun Feng, Reza Sabzevari

Published: 2026-06-04

TL;DR: T-FunS3D proposes a task-driven hierarchical method for open-vocabulary 3D functionality segmentation that efficiently locates functional components in robotic scenes using vision-language models on multimodal inputs.

摘要翻译

开放词汇 3D 功能分割（Open-vocabulary 3D functionality segmentation）使机器人能够在三维场景中定位功能对象组件。这是一项具有挑战性的任务，需要空间理解与任务解读能力。当前的开放词汇 3D 分割方法主要侧重于物体级识别，而场景级部件分割方法试图穷尽式地分割整个场景，导致它们高度资源密集且耗时。如何在粒度、准确性和速度之间平衡分割性能仍然是一个挑战。为缓解这一问题，我们提出 T-FunS3D，一种面向机器人应用、提供可操作感知的任务驱动分层开放词汇 3D 功能分割方法。该方法以室内场景的 3D 点云和带位姿的 RGB-D 图像作为输入。我们通过提取环境中的实例及其视觉嵌入来构建开放词汇场景图（Open-vocabulary scene graph）。给定任务描述后，T-FunS3D 识别场景图中最相关的实例，并利用视觉 - 语言模型（vision-language model）定位其功能组件。在 SceneFun3D 数据集上的实验表明，T-FunS3D 在开放词汇 3D 功能分割方面与最先进方法相当，同时实现了更快的运行速度和更低的内存占用。

Abstract

Open-vocabulary 3D functionality segmentation enables robots to localize functional object components in 3D scenes. It is a challenging task that requires spatial understanding and task interpretation. Current open-vocabulary 3D segmentation methods primarily focus on object-level recognition, while scene-wide part segmentation methods attempt to segment the entire scene exhaustively, making them highly resource-intensive and time consuming. Balancing segmentation performance in terms of granularity, accuracy, and speed remains a challenge. As one step towards alleviating this, we introduce T-FunS3D, a task-driven hierarchical open-vocabulary 3D functionality segmentation method that provides actionable perception for robotic applications. Our method takes as input the 3D point cloud and posed RGB-D images of an indoor scene. We construct an open-vocabulary scene graph by extracting instances and their visual embeddings in the environment. Given a task description, T-FunS3D identifies the most relevant instances in the scene graph and locates their functional components leveraging a vision-language model. Experiments on the SceneFun3D dataset demonstrate that T-FunS3D is comparable to state-of-the-art in open-vocabulary 3D functionality segmentation, while achieving faster runtime and reduced memory usage.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	6.0/10	9.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	7.0/10	10.5
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on 3D functionality segmentation using multimodal inputs (point clouds, RGB-D) and vision-language models, showing strong relevance to MultiModal (8.0) and moderate relevance to MLLM (7.0) and Visual Encoder (6.0). It does not address World Models (2.0), Model-Based RL (0.0), Tokenizer design (1.0), or Unify Models architecture (3.0) as core contributions. No expert authors from the specified list are found.

关键词

3D Functionality Segmentation, Open-Vocabulary, Task-Driven, Vision-Language Model, Scene Graph, Point Cloud, RGB-D

深度分析

Chinese Title: T-FunS3D：任务驱动的分层开放词汇3D功能分割

Summary: 本文提出T-FunS3D，一种无需训练的开放词汇3D功能分割方法，旨在为机器人提供可操作的感知。该方法以室内场景的3D点云和带位姿的RGB-D图像为输入，首先构建开放词汇场景图，提取实例及其视觉嵌入；给定任务描述后，通过视觉-语言模型识别场景图中最相关的实例，并定位其功能组件。实验在SceneFun3D数据集上进行，结果表明T-FunS3D在开放词汇3D功能分割上达到与最先进方法相当的性能，同时运行速度更快、内存消耗更低。该方法解决了现有方法在粒度、精度和速度之间的平衡问题，避免了全场景过度分割带来的资源浪费。

Innovations:

提出一种无需训练的3D功能分割方法，基于自由形式任务描述，利用可操作的3D场景图包含开放词汇语义。
采用任务驱动的分层方法：先分解场景为对象实例，再仅对任务相关实体进行细粒度功能组件分割，提高效率。
构建开放词汇场景图，节点和边均使用视觉嵌入特征，支持在开放世界中通过环境参照有效定位实体。
相比现有方法，在SceneFun3D数据集上显著降低运行时间和内存消耗，同时保持高度竞争性的精度。

Methodology: 输入为室内场景的3D点云和带位姿的RGB-D图像。第一阶段：使用Mask3D进行类无关实例分割，并通过FG-CLIP提取视觉嵌入，构建包含节点和边（对象间关系）的开放词汇场景图。第二阶段：给定任务描述，利用Qwen3大语言模型将任务分解为本体（如上下文对象、参照对象、空间关系），然后通过文本-视觉嵌入相似度在场景图中定位上下文对象，最后结合Molmo和SAM在对应图像上提取2D功能部件掩码，并聚合为3D分割结果。

Key Results:

在SceneFun3D数据集上，T-FunS3D的开放词汇3D功能分割精度与最先进方法（如Search3D、Fun3DU）相当。
运行时间显著低于对比方法，内存消耗更低，适合移动机器人部署。
任务驱动的分层策略有效避免了全场景过度分割，仅对任务相关对象进行细粒度处理。

Tech Stack:

Mask3D（类无关3D实例分割）
FG-CLIP（视觉嵌入提取）
Qwen3（大语言模型，用于任务分解）
Molmo（视觉-语言模型，用于2D部件检测）
SAM（Segment Anything Model，用于2D掩码生成）
SceneFun3D数据集
视觉嵌入相似度计算（文本-图像匹配）

Strengths:

无需训练，直接利用预训练模型，降低部署成本。
任务驱动，仅处理相关对象，计算高效。
构建的场景图支持开放词汇和复杂参照查询。
在精度、速度和内存之间取得良好平衡。

Limitations:

依赖预训练模型（如Mask3D、FG-CLIP、Molmo、SAM）的性能，可能受限于这些模型的泛化能力。
假设场景中对象静态，局部移动可能影响空间关系。
任务分解依赖LLM，对复杂或模糊任务描述可能不够鲁棒。
仅处理显式包含上下文对象的查询，对于隐含对象或抽象任务可能失效。

Relevance To Keywords:

原生多模态大模型：论文使用Molmo（视觉-语言模型）和Qwen3（语言模型），属于多模态大模型的应用。
多模态大模型的理解和生成一体化：Molmo同时具备理解和生成能力，用于部件检测。
表征学习：通过FG-CLIP提取视觉嵌入，构建场景图节点和边的表征，属于表征学习范畴。
世界模型：论文未直接涉及世界模型，但场景图可视为对环境的结构化表示，与空间推理相关。
强化学习/后训练：论文未涉及强化学习或后训练，主要关注感知层面。
Unify Models：论文整合了多个预训练模型（Mask3D、FG-CLIP、Qwen3、Molmo、SAM），体现了模型统一的思想。

33. DragOn: A Benchmark and Dataset for Drag-Based GUI InteractionsPASS

Score: 37.5 / 27.8

Authors: Nathan Bout, Maxime Langevin, Ronan Riochet

Published: 2026-06-04

TL;DR: DragOn addresses the scarcity of drag grounding data for GUI agents by providing a benchmark and dataset, demonstrating that current MLLMs struggle with complex tasks but can be improved via fine-tuning.

摘要翻译

GUI 代理（基于视觉的模型，通过图形用户界面控制桌面、网页浏览器和移动设备）有望实现广泛数字任务的自动化。虽然百万级数据集在点击定位（click-grounding）方面取得了显著进展，但拖拽定位（drag grounding，例如拖放、滑动、高亮）的数据量仍小一个数量级，且当前模型在基于拖拽的复杂交互方面仍显不足。我们引入了 DragOn，这是一个拖拽定位基准及训练数据集，涵盖四个领域：文本高亮、单元格选择、元素调整和滑块操作。该数据集包含 28.6 万张训练截图和 350 万项训练任务，外加一个包含 2000 个示例的保留评估集。我们评估了专有模型（GPT, Claude）和开源权重模型（Qwen, Kimi, Holo），以及在训练数据上微调的 Qwen VLM（视觉语言模型）。结果表明，我们的数据集有望提升最先进模型在下游计算机使用任务上的表现。

Abstract

GUI agents - vision-based models that control desktops, web browsers, and mobile devices through graphical user interfaces - promise to automate a wide range of digital tasks. While million-scale datasets have enabled substantial progress on click-grounding, drag grounding (e.g. drag-and-drop, swipe, highlight) data remains an order of magnitude smaller and current models fall short on complex drag-based interactions. We introduce DragOn, a drag grounding benchmark and training dataset covering four domains: text highlighting, cell selection, element resizing and slider manipulation. The dataset comprises 286K training screenshots and 3.5M training tasks, plus a 2000-example held-out evaluation suite. We evaluate proprietary (GPT, Claude) and open-weight (Qwen, Kimi, Holo) models, as well as a Qwen VLM fine-tuned on our training data. Results suggest that our dataset could improve performance of state-of-the-art models on downstream computer-use tasks.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	8.0/10	12.0
MultiModal	1.5	7.0/10	10.5
model-based RL	1.5	3.0/10	4.5

评分理由: 该论文专注于 GUI 拖拽交互的基准和数据集，评估现有的 MLLM。它与 MLLM 和多模态关键词高度相关，因为评估的是多模态模型在视觉任务上的表现。它缺乏对统一模型、分词器、世界模型和基于模型的 RL 算法的关注，因此这些类别得分较低。视觉编码器作为底层模型的组成部分，相关性中等。作者列表中未包含指定的专家名单。

关键词

DragOn, GUI Agents, Drag Grounding, Benchmark, Dataset, MLLM, Visual-Language-Action

深度分析

Chinese Title: DragOn：面向基于拖拽的图形用户界面交互的基准测试与数据集

Summary: 本文针对图形用户界面（GUI）中拖拽操作（如拖放、滑动、高亮等）的视觉定位问题，提出了DragOn基准测试与训练数据集。当前点击定位数据已达百万级，而拖拽定位数据规模小一个数量级，导致现有模型在复杂拖拽交互上表现不佳。DragOn基于“渲染即监督”原则，利用渲染器（PDF、XLSX、PPTX、HTML）的几何信息自动生成标注，覆盖文本高亮、单元格选择、元素缩放和滑块操作四个领域，包含28.6万张训练截图和350万训练任务，以及2000个保留评估样本。作者评估了GPT、Claude等专有模型和Qwen、Kimi等开源模型，发现前沿模型得分均低于30%；而基于Qwen VLM在DragOn上微调的模型超越了所有前沿模型，表明该数据集能提升下游计算机使用任务的性能。

Innovations:

提出“渲染即监督”原则，利用渲染器自身几何信息作为标注函数，大幅降低标注成本并减少错误。
构建了DragOn数据集，规模比现有拖拽语料库大1-2个数量级，覆盖四个异构拖拽领域。
建立了拖拽定位基准测试，系统评估了多个专有和开源模型，揭示当前VLM在拖拽任务上的显著不足。
通过微调Qwen VLM证明该数据集可有效提升模型在拖拽动作上的表现，超越前沿模型。

Methodology: 采用“渲染即监督”框架：定义确定性渲染器R将结构化源S映射为图像I=R(S)，以及标签映射π从源和查询返回像素空间答案B=π(S,Q)。对于四个领域分别实例化：文本高亮使用PDF文本跨度坐标解析；单元格选择采用探针式标签映射（对源进行扰动后渲染，通过颜色键检测定位）；元素缩放使用PPTX EMU单位映射到像素空间；滑块操作通过URL参数化的HTML滑块几何信息。所有标注自动生成，无需人工注释。

Key Results:

DragOn数据集包含28.6万张训练截图和350万训练任务，以及2000个保留评估样本。
专有模型（GPT、Claude、Gemini）和开源模型（Qwen、Kimi、Holo）在拖拽定位任务上得分均低于30%。
在DragOn上微调的Qwen3.5VL模型超越了所有前沿模型，证明数据集的有效性。
OSWorld和AndroidWorld基准测试中分别有13.9%和82.8%的任务需要拖拽动作，凸显拖拽定位的重要性。

Tech Stack:

渲染器：PDF（文本跨度坐标）、XLSX（LibreOffice视图颜色键检测）、PPTX（EMU单位映射）、HTML（URL参数化滑块几何）
视觉语言模型：GPT-4V、Claude-3、Gemini、Qwen2.5VL、Kimi、Holo、Qwen3.5VL（微调）
探针式标签映射：对源进行扰动（如单元格填充独特颜色键），再通过视觉检测定位目标
评估指标：预测源和目标边界框与真实框的匹配精度（具体指标未详述，但通常为IoU或坐标误差）

Strengths:

数据集规模大，覆盖多种拖拽动作，填补了拖拽定位数据不足的空白。
标注方法自动化、低成本、高精度，避免了人工标注的误差和开销。
系统评估了当前主流VLM在拖拽任务上的性能，揭示了显著差距。
微调实验证明了数据集对提升模型拖拽能力的实际价值。

Limitations:

仅覆盖四个拖拽领域，可能无法完全代表所有GUI拖拽场景（如自由画布拖拽、多步骤拖拽轨迹）。
任务形式为静态截图上的两点边界框定位，未考虑动态交互过程中的轨迹预测。
评估集规模较小（2000例），可能不足以全面衡量模型泛化能力。
未提供与端到端代理任务性能的直接关联分析，仅通过下游任务比例间接论证。

Relevance To Keywords:

原生多模态大模型：论文聚焦于VLM在GUI拖拽定位上的能力，属于多模态大模型的应用和评估。
表征学习：渲染即监督原则本质上是一种利用渲染器几何表征自动生成标签的方法，涉及视觉表征与结构化数据的对齐。
世界模型：拖拽操作涉及对GUI状态变化的理解，数据集可用于训练世界模型预测拖拽后的屏幕变化。
强化学习：论文提及后训练（微调）提升模型性能，未来可结合强化学习优化拖拽策略。
后训练：作者通过微调Qwen VLM展示了后训练对提升特定任务能力的效果。

34. Causal Scaffolding for Physical Reasoning: A Benchmark for Causally-Informed Physical World Understanding in VLMsPASS

Score: 37.5 / 27.8

Authors: Tianyi Tang, Zhuoyi Lin, Zeyu Feng, Tianyi Ma, Yew-Soon Ong, Ivor Tsang, Haiyan Yin

Published: 2026-06-04

TL;DR: This paper introduces CausalPhys, a benchmark for evaluating causal physical reasoning in Vision-Language Models, and proposes CRFT fine-tuning to enhance reasoning accuracy and interpretability through causal graph alignment.

摘要翻译

对物理世界的理解与推理是智能行为的基础，然而最先进的视觉 - 语言模型（VLMs）在因果物理推理方面仍存在不足，往往产生看似合理但错误的答案。为填补这一空白，我们引入了 CausalPhys，这是一个包含超过 3000 个精心策划的视频与图像问题的基准，涵盖四个领域：感知（Perception）、预期（Anticipation）、干预（Intervention）和目标导向（Goal Orientation）。每个问题均配有一张专家标注的因果图，用于捕捉对象 - 属性 - 事件依赖关系，从而实现可解释且细粒度的因果理解评估。在此基础上，我们提出了一种基于因果图的度量方法，定量衡量模型的思维链推理（Chain-of-Thought Reasoning）与正确因果关系的对齐程度，超越了仅基于答案的准确性，并实现了对 VLMs 因果推理失败的系统性诊断。利用该度量，我们对主流 VLMs 进行了全面分析，揭示了它们在捕捉因果依赖关系方面的系统性差距，并强调了因果感知学习的必要性。为了解决这些局限性，我们进一步提出了基于因果理由的微调（Causal Rationale-informed Fine-Tuning, CRFT），明确地将 VLM 的推理过程与因果结构对齐。大量实验表明，CRFT 在多种模型骨干上显著提升了推理准确性和可解释性。通过统一数据集构建、因果评估与因果感知学习，CausalPhys 为推动现代 VLMs 迈向因果基础的物理推理奠定了坚实基础。

Abstract

Understanding and reasoning about the physical world is the foundation of intelligent behavior, yet state-of-the-art vision-language models (VLMs) still fail at causal physical reasoning, often producing plausible but incorrect answers. To address this gap, we introduce CausalPhys, a benchmark of over 3,000 carefully curated video- and image-based questions spanning four domains: Perception, Anticipation, Intervention, and Goal Orientation. Each question is paired with an expert-annotated causal graph capturing object-attribute-event dependencies, enabling interpretable and fine-grained evaluation of causal understanding. Building on this, we formulate a causal-graph-grounded metric that quantitatively measures how well a model's chain-of-thought reasoning aligns with the correct causal relations, moving beyond answer-only accuracy and enabling systematic diagnosis of VLMs' causal reasoning failures. Using this metric, we conduct a comprehensive analysis of leading VLMs, revealing systematic gaps in capturing causal dependencies and underscoring the need for causality-aware learning. To address these limitations, we further propose Causal Rationale-informed Fine-Tuning (CRFT), which explicitly aligns VLM reasoning with causal structures. Extensive experiments demonstrate that CRFT substantially enhances both reasoning accuracy and interpretability across multiple model backbones. By unifying dataset curation, causal evaluation, and causality-informed learning, CausalPhys establishes a strong foundation for advancing modern VLMs toward causally grounded physical reasoning.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	8.0/10	12.0
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	1.0/10	1.5

评分理由: The paper centers on Vision-Language Models (MLLM, MultiModal) for physical reasoning, yielding high scores for these keywords. It mentions unifying evaluation and learning processes (Unify Models) but lacks specific focus on Tokenizers, Visual Encoder architectures, World Models generation, or Model-Based RL, resulting in lower scores. No expert authors from the specified list are present.

关键词

Causal Scaffolding, Physical Reasoning, VLMs, CausalPhys, Causal Graphs, CRFT, Interpretability, Physical World Understanding

深度分析

Chinese Title: 因果脚手架用于物理推理：面向视觉语言模型中因果驱动的物理世界理解基准

Summary: 本文针对当前视觉语言模型（VLM）在因果物理推理中的不足，提出了CausalPhys基准，包含超过3000个精心设计的视频和图像问题，覆盖感知、预测、干预和目标导向四个领域。每个问题配有专家标注的因果有向无环图（DAG），用于细粒度评估模型的因果理解。在此基础上，作者开发了基于因果图的评估指标，量化模型推理与正确因果关系的对齐程度，并揭示了现有VLM在捕获因果依赖上的系统性缺陷。为缓解这一问题，提出了因果理由微调（CRFT）方法，显式地将VLM推理与因果结构对齐。实验表明CRFT在多个模型骨干上显著提升了推理准确性和可解释性。该工作统一了数据集构建、因果评估和因果学习，为推进VLM的因果物理推理奠定了基础。

Innovations:

首次将物理推理任务与专家标注的因果图相结合，构建了CausalPhys基准，支持机制级别的可解释评估。
提出了基于因果图的评估指标，超越仅依赖答案准确率的传统方法，提供细粒度的诊断信息。
设计了因果理由微调（CRFT）策略，通过显式对齐VLM推理与因果结构，提升模型在物理环境中的准确性和可解释性。
统一了基准测试、评估和模型改进的框架，为因果物理推理研究提供了完整闭环。

Methodology: 论文采用以下技术路线：首先从11个公开数据集中收集视频和图像，设计覆盖4个领域16个子类别的3000+问题；然后由专家为每个问题标注因果DAG，捕获对象-属性-事件依赖；接着开发基于因果图的评估指标，计算模型链式推理与正确因果关系的对齐程度；最后提出CRFT方法，在微调过程中引入因果理由作为监督信号，强制模型推理遵循因果结构。实验在多个VLM骨干（如LLaVA、BLIP-2等）上进行零样本和微调评估。

Key Results:

CausalPhys基准包含3062个问题，覆盖感知、预测、干预、目标导向四个领域。
现有VLM在因果物理推理上表现不佳，尤其在干预和预测任务中系统性失败。
基于因果图的评估指标揭示了模型推理与正确因果依赖之间的显著差距。
CRFT方法在多个模型骨干上显著提升了推理准确性和可解释性，零样本泛化能力增强。

Tech Stack:

因果有向无环图（Causal DAG）
链式推理（Chain-of-Thought）
视觉语言模型（VLM）骨干：LLaVA、BLIP-2等
因果理由微调（CRFT）
基于图的评估指标（因果对齐分数）
数据集来源：11个公开物理场景数据集（如CLEVRER、CoPhy等）

Strengths:

首次将因果图显式引入VLM物理推理基准，支持机制级评估。
数据集规模大、覆盖场景多样（真实世界+合成），问题设计系统化。
提出了可操作的因果微调方法，直接提升模型性能。
评估指标不仅关注答案正确性，还关注推理过程与因果结构的一致性。
提供了完整的开源代码和数据集，促进可复现研究。

Limitations:

因果图由专家手工标注，成本高且可能引入主观偏差。
基准主要基于静态或简单动态场景，复杂交互（如多物体连续碰撞）覆盖有限。
CRFT方法依赖因果图作为监督信号，在缺乏因果标注的实际应用中难以直接推广。
实验仅在有限数量的VLM骨干上进行，未涵盖最新的大型模型（如GPT-4V）。

Relevance To Keywords:

世界模型：论文通过因果图建模物理世界的对象-属性-事件依赖，与世界模型中的因果推理高度相关。
表征学习：CRFT方法强制模型学习因果结构化的表征，提升泛化能力。
模型基强化学习（Model-Based RL）：因果物理推理是智能体在环境中进行规划与决策的基础，论文的基准和微调方法可直接用于强化学习中的世界模型学习。
原生多模态大模型：论文聚焦视觉语言模型（VLM），属于多模态大模型范畴，因果推理是其能力短板。
后训练：CRFT是一种后训练微调策略，通过因果理由对齐提升模型推理能力。

35. GRAMformer: Any-Order Modality Interactions via Volumetric Multimodal Cross-AttentionPASS

Score: 37.5 / 27.8

Authors: Giordano Cicchetti, Eleonora Grassucci, Danilo Comminiello

Published: 2026-06-04

TL;DR: GRAMformer proposes a Volumetric Multimodal cross-Attention mechanism to capture any-order modality interactions efficiently in multimodal learning tasks.

摘要翻译

基于 Transformer 的多模态模型依赖注意力机制以整合异构模态之间的信息。尽管取得了成功，现有的多模态注意力机制通过成对点积交互的集合或通过将所有模态拼接至键中来计算其分数，即便多个模态本应联合参与。因此，当前方法要么在模态数量上产生二次复杂度，要么无法显式建模依赖于多个表示联合构型的交互。在此工作中，我们引入了体积多模态交叉注意力（VMA），这是一种新颖的交叉注意力机制，其中注意力分数定义为查询与多个模态特定键的联合几何的函数。VMA 计算查询和键向量在多模态空间中张成的体积，捕捉超越成对相似性的联合多模态依赖，从而实现对任意阶模态交互的原生建模。我们将 VMA 整合到我们提出的新型多模态架构 GRAMformer 中，该架构显式设计用于整合任意数量的模态。我们在多模态学习任务上评估了所提出的模型，展示了改进的有效性和效率。

Abstract

Transformer-based multimodal models rely on attention mechanisms to integrate information across heterogeneous modalities. Despite their success, existing multimodal attention formulations compute their scores through collections of pairwise dot-product interactions or by concatenating all the modalities into the keys, even when multiple modalities should be jointly involved. As a consequence, current approaches either incur quadratic complexity in the number of modalities or fail to explicitly model interactions that depend on the joint configuration of multiple representations. In this work, we introduce the Volumetric Multimodal cross-Attention (VMA), a novel cross-attention mechanism in which attention scores are defined as a function of the joint geometry of a query and multiple modality-specific keys. VMA computes the volume spanned by query and key vectors across multiple modalities, capturing joint multimodal dependencies beyond pairwise similarity, enabling native modeling of any-order modality interactions. We integrate VMA into our novel multimodal transformer architecture, named GRAMformer, explicitly designed to integrate any number of modalities. We evaluate the proposed model on multimodal learning tasks, demonstrating improved effectiveness and efficiency.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	6.0/10	9.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	5.0/10	7.5
MultiModal	1.5	9.0/10	13.5
model-based RL	1.5	1.0/10	1.5

评分理由: The paper proposes a novel cross-attention mechanism (VMA) for multimodal transformers, which is highly relevant to MultiModal and moderately relevant to Unify Models and MLLM due to its architectural focus on integrating heterogeneous modalities. It does not discuss Tokenizers, Visual Encoders specifically, World Models, or Model-based RL, resulting in low scores for those keywords. No expert authors from the specified list were found, so no bonus was applied.

关键词

Volumetric Multimodal cross-Attention, Any-Order Modality Interactions, Multimodal Transformer, Joint Geometry, Heterogeneous Modalities, GRAMformer, Cross-Attention Mechanism

深度分析

Chinese Title: GRAMformer：通过体积化多模态交叉注意力实现任意阶模态交互

Summary: 本文针对现有Transformer多模态注意力机制仅能建模成对交互或通过拼接方式隐式处理多模态信息的问题，提出了一种新型的Volumetric Multimodal cross-Attention（VMA）机制。VMA通过计算查询向量与多个模态键向量所张成的平行多面体的体积来定义注意力分数，从而直接捕获任意阶的联合模态依赖关系，避免了成对点积的二次复杂度或拼接带来的计算膨胀。基于VMA，作者设计了GRAMformer架构，能够高效融合任意数量的模态。在MOSI等数据集上的实验表明，GRAMformer在性能上优于现有方法，同时参数量更少、计算更轻量。代码已开源。

Innovations:

揭示了现有成对注意力机制无法显式建模高阶模态交互的根本局限。
提出了Volumetric Multimodal cross-Attention（VMA），利用查询与多模态键的几何体积作为注意力分数，实现任意阶模态交互的联合建模。
设计了GRAMformer架构，将VMA集成到Transformer中，支持任意数量模态的高效融合。
在多个多模态任务上验证了VMA的有效性和效率，相比拼接和成对注意力方法在性能与计算开销上均取得优势。

Methodology: 本文采用几何驱动的注意力机制设计。首先，将多模态键按对齐的序列位置分组，形成多模态键组。对于每个查询向量，计算其与一组键向量（来自M个模态）所张成的平行多面体的体积，作为注意力分数。体积通过Gram矩阵的行列式或直接计算向量组的行列式得到，反映了向量组的线性独立程度。该分数替代了传统的点积相似度，使得注意力能够同时响应所有模态的联合配置。随后，将体积分数经过softmax归一化后加权聚合对应的值向量。整个机制可嵌入多头注意力框架，并通过线性投影保持可学习性。

Key Results:

在MOSI数据集上，GRAMformer在情感分析二分类准确率（Acc-2）上优于所有对比方法，包括早期融合、成对交叉注意力等。
GRAMformer的参数量和计算开销低于多数基线，体现了轻量高效的特点。
通过消融实验验证了VMA相比成对注意力和拼接注意力在捕获高阶交互上的优势。
在多个模态数量（2、3、4）的设置下，VMA均能稳定提升性能，且复杂度随模态数线性增长而非二次。

Tech Stack:

多头交叉注意力（Multihead Cross-Attention）
体积计算：平行多面体体积（通过Gram矩阵行列式或向量组行列式）
线性投影层（用于查询、键、值的投影）
Softmax归一化
PyTorch（代码实现框架）

Strengths:

创新性地将几何体积引入注意力机制，突破了成对相似度的限制，能够显式建模任意阶模态交互。
计算复杂度随模态数线性增长，避免了成对方法的二次复杂度和拼接方法的大规模键值堆叠。
在多个多模态任务上取得更好性能，同时模型更轻量，具有实际部署价值。
方法具有通用性，可扩展到任意数量的模态，且无需低秩近似或分解。

Limitations:

当前方法假设模态在序列维度上对齐，对于非对齐或异步模态（如不同采样率的视频和音频）需要额外的对齐预处理。
体积计算依赖于向量组的线性独立性，当模态嵌入维度较高时，行列式计算可能数值不稳定或受噪声影响。
论文仅在情感分析等任务上验证，尚未在更复杂的多模态理解（如视频问答、多模态生成）中测试。
未讨论体积注意力与标准点积注意力在梯度传播特性上的差异，可能影响训练稳定性。

Relevance To Keywords:

Unify Models, World Models, Representation Learning, Model-Based RL: 论文聚焦于多模态表征学习中的注意力机制，与表征学习高度相关。但未涉及世界模型、模型基强化学习或后训练等方向，相关性中等。
原生多模态大模型，多模态大模型的理解和生成一体化: 论文提出的VMA可视为多模态大模型中的一种融合模块，有助于提升多模态理解能力，但与生成一体化无直接关联。
表征学习: 论文核心是通过体积度量学习多模态联合表征，直接相关。
世界模型，强化学习，后训练: 论文未涉及这些主题，相关性低。

36. LatentWave: JEPA Pretraining for Wireless Foundation ModelsPASS

Score: 36.0 / 27.8

Authors: Ahmed Mohamed, Ahmed Aboulfotouh, Hatem Abou-Zeid

Published: 2026-06-04

TL;DR: LatentWave 提出一种基于 JEPA 预训练的无线基础模型，通过潜在空间掩码预测学习可迁移表示，有效提升了多种无线下游任务的泛化性能。

摘要翻译

无线基础模型已成为为每个无线任务构建独立模型的有前景的替代方案。然而，现有方法依赖于掩码输入重建，这可能导致表示偏向于低层级的信号细节。本文提出 LatentWave，一种基于联合嵌入预测架构（JEPA）在多样化无线频谱图和信道状态信息（CSI）上预训练的无线基础模型。通过在潜在空间预测掩码区域，LatentWave 学习了具有更好开箱即用迁移能力的表示，适用于多样化的下游任务。所提出的架构采用每通道补丁嵌入，并在预训练期间引入随机通道采样，使其能够处理可变的天线数量，并提升在不同异构无线配置下的适用性。我们在四个下游任务上评估 LatentWave：射频信号分类、5G NR 定位、波束预测以及视距/非视距（LoS/NLoS）分类，并与在同一数据上预训练的掩码建模基线（WavesFM）进行对比。此外，我们还表明，掩码几何结构引入了任务相关的归纳偏置：频率掩码强烈倾向于信道相关任务（如定位和波束预测），而区域掩码则更好地保留了信号分类的判别性。

Abstract

Wireless foundation models have emerged as a promising alternative to building separate models for each wireless task. However, existing approaches rely on masked input reconstruction, which can bias representations toward low-level signal details. In this paper, we propose LatentWave, a wireless foundation model pretrained using a Joint-Embedding Predictive Architecture (JEPA) on diverse wireless spectrograms and channel state information (CSI). By predicting masked regions in latent space, LatentWave learns representations that are more transferable out of the box across diverse downstream tasks. The proposed architecture employs per-channel patch embeddings with stochastic channel sampling during pretraining, allowing it to process variable antenna counts and improving usability across heterogeneous wireless configurations. We evaluate LatentWave on four downstream tasks: RF signal classification, 5G NR positioning, beam prediction, and LoS/NLoS classification, comparing against a masked-modeling baseline (WavesFM) pretrained on the same data. Additionally, we show that the masking geometry introduces a task-dependent inductive bias: frequency masking strongly favors channel-related tasks such as positioning and beam prediction, while region masking better preserves discriminability for signal classification.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	6.0/10	9.0
Tokenizer	1.5	3.0/10	4.5
Visual Encoder	1.5	4.0/10	6.0
World Models	1.5	8.0/10	12.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	3.0/10	4.5
model-based RL	1.5	0.0/10	0.0

评分理由: 评分理由：论文采用 JEPA 架构进行预训练，与世界模型（World Models）的核心方法高度相关，故得分最高；作为无线基础模型，其统一处理多任务的设计符合 Unify Models 理念；Patch Embeddings 机制与 Tokenizer 和 Visual Encoder 有结构相似性，故给予中等分数；涉及多模态信号处理（谱图与 CSI），故 MultiModal 有一定关联；但论文未涉及语言模型（MLLM）或强化学习（model-based RL），故这两项得分为 0。

关键词

LatentWave, JEPA Pretraining, Wireless Foundation Models, Spectrograms, Channel State Information, Representation Learning, Patch Embeddings, Downstream Tasks

深度分析

Chinese Title: LatentWave：面向无线基础模型的JEPA预训练方法

Summary: 本文提出LatentWave，一种基于联合嵌入预测架构（JEPA）的无线基础模型预训练方法。针对现有掩码重建方法迫使模型关注低层信号细节的局限，LatentWave在潜在空间中预测掩码区域，学习更可迁移的高层语义表示。模型采用每通道patch嵌入和随机通道采样策略，支持可变天线数，提升异构无线配置下的通用性。在射频信号分类、5G NR定位、波束预测和视距/非视距分类四个下游任务上评估，与掩码建模基线WavesFM对比，LatentWave的冻结表示表现出更优的迁移性能。此外，研究发现掩码几何引入任务依赖的归纳偏置：频率掩码有利于信道相关任务（定位、波束预测），区域/时间掩码更利于信号分类。

Innovations:

首次将JEPA框架应用于无线基础模型预训练，替代传统的像素空间掩码重建，学习高层语义表示。
提出每通道patch嵌入与随机通道采样策略，使模型能处理可变天线数，适应异构无线配置。
系统比较了四种掩码策略（区域、频率、时间、随机），揭示了掩码几何引入的任务依赖归纳偏置。
在四个不同下游任务上验证了冻结表示的迁移能力，优于掩码建模基线。

Methodology: 采用JEPA自监督框架，包含上下文编码器、目标编码器和预测器。上下文编码器处理可见patch，目标编码器（EMA更新）处理完整输入，预测器在潜在空间预测掩码区域表示。输入为频谱图和CSI，每通道独立分割为patch，通过共享线性投影嵌入。预训练时随机采样通道数，采用多块掩码策略（区域/频率/时间/随机）。损失函数为掩码位置预测与目标表示的均方误差。预训练后提取目标编码器作为下游任务的基础模型。

Key Results:

LatentWave在四个下游任务上的冻结表示性能优于掩码建模基线WavesFM。
频率掩码策略在5G NR定位和波束预测任务上表现最佳，区域/时间掩码在射频信号分类任务上更优。
随机通道采样策略使模型能泛化到不同天线配置，无需重新训练。
掩码几何引入的归纳偏置具有任务依赖性，无单一策略在所有任务上占优。

Tech Stack:

JEPA（联合嵌入预测架构）
Vision Transformer（ViT）作为编码器和预测器
指数移动平均（EMA）更新目标编码器
余弦学习率调度（峰值1e-3，预热12轮）
多块掩码策略（区域、频率、时间、随机）
每通道patch嵌入（patch size 16×16）
随机通道采样（均匀分布）
均方误差（MSE）损失函数

Strengths:

提出新颖的JEPA预训练范式，有效避免像素空间重建的低层偏置，学习更通用的表示。
架构设计灵活，支持可变天线数，增强实际部署中的适用性。
系统分析掩码几何的影响，为不同任务选择合适预训练策略提供指导。
在多个无线任务上验证了冻结表示的迁移能力，减少下游微调需求。

Limitations:

仅评估了四个下游任务，未涵盖更多无线场景（如信道估计、干扰管理等）。
与基线WavesFM的比较可能不够全面，未与对比学习等方法对比。
JEPA预训练计算成本较高（ViT编码器+预测器），可能限制资源受限场景的应用。
未探讨不同天线数下性能的定量变化，随机通道采样的效果分析较简略。

Relevance To Keywords:

表征学习：JEPA在潜在空间预测，学习高层语义表示，与表征学习核心目标一致。
世界模型：JEPA预测潜在空间动态，可视为构建无线环境世界模型的一种尝试。
多模态大模型：论文处理频谱图和CSI两种模态，但未涉及多模态融合，属于单模态基础模型。
模型基础：LatentWave作为无线基础模型，支持多种下游任务，符合基础模型范式。
后训练：论文聚焦预训练，下游任务采用线性探测或微调，属于后训练范畴。

37. Benchmark Everything Everywhere All at OncePASS

Score: 33.0 / 27.8

Authors: Shiyun Xiong, Dongming Wu, Peiwen Sun, Yuang Ai, Bokang Yang, Wencheng Han, Xiao-Hui Li, Xiangyu Yue

Published: 2026-06-04

TL;DR: 针对基准构建劳动密集且易饱和的问题，本文提出 Benchmark Agent 系统，通过自主智能体高效生成高质量 LLM/MLLM 基准测试样本。

摘要翻译

基准测试（Benchmarks）对于评估和推进大语言模型（LLMs）及多模态大语言模型（MLLMs）至关重要，因为它们提供了标准化的、明确的表现衡量指标。然而，其构建过程劳动密集且难以复用，引发了关于可持续性和可扩展性的担忧。此外，现有基准测试通常在发布后迅速达到性能饱和，导致对最先进模型（state-of-the-art models）之间的区分度不足。为应对这些挑战，我们引入了 Benchmark Agent，这是一个专为基准构建设计的完全自主的代理系统（agentic system）。该框架统筹了完整的基准构建流程，涵盖从用户查询分析与子任务设计，到数据标注与质量控制的全环节。为评估 Benchmark Agent，我们将其应用于生成 15 个代表性基准，涵盖多样化的评估场景，包括文本理解、多模态理解及领域特定推理（domain-specific reasoning）。广泛的实验（包括人类评估、LLM-as-a-judge 评估及一致性检查）表明，Benchmark Agent 能够在极少人工干预的情况下生成高质量的基准样本。更重要的是，通过持续评估，我们观察到若干重要发现，包括当前模型在某些领域特定推理任务上仍存在困难。我们认为，快速迭代的基准测试可为研究社区做出重要贡献。预览及代码将在演示页面及代码库中公开提供。

Abstract

Benchmarks are fundamental for evaluating and advancing LLMs and MLLMs by providing standardized and explicit measures of performance. However, their construction is labor-intensive and hard to reuse, raising concerns about sustainability and scalability. Moreover, existing benchmarks often quickly reach performance saturation after their release, resulting in insufficient discrimination among state-of-the-art models. To address these challenges, we introduce Benchmark Agent, a fully autonomous agentic system designed for benchmark building. Our framework orchestrates the complete benchmark construction pipeline, from user query analysis and subtask design to data annotation and quality control. To assess Benchmark Agent, we implement it to produce 15 representative benchmarks, spanning diverse evaluation scenarios, including text understanding, multimodal understanding, and domain-specific reasoning. Extensive experiments, including human evaluation, LLM-as-a-judge assessment, and consistency checks, demonstrate Benchmark Agent can generate high-quality benchmark samples with minimal human involvement. More importantly, through continual evaluation, we observe several insightful findings, including that current models struggle with certain domain-specific reasoning tasks. We believe that rapidly evolving benchmarks can contribute significantly to the research community. The preview and code will be publicly available at the demo page and code repository.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	8.0/10	12.0
MultiModal	1.5	7.0/10	10.5
model-based RL	1.5	1.0/10	1.5

评分理由: 论文核心在于构建自动化基准测试系统（Benchmark Agent），而非模型架构或强化学习方法。因此与 MLLM 和 MultiModal 领域直接相关（涉及评估对象，评分较高），与 Unify Models、Tokenizer、Visual Encoder、World Models、model-based RL 几乎无关（评分较低）。作者列表中未包含指定的五位专家。加权总分 33.0 分，超过动态及格分 27.8 分。

关键词

Benchmark Agent, Autonomous Agentic System, LLM Evaluation, MLLM Evaluation, Multimodal Understanding, Benchmark Construction, LLM-as-a-judge

深度分析

Chinese Title: 一次性全面基准构建：无处不在的基准测试

Summary: 本文针对现有基准测试构建过程中人力成本高、迭代缓慢、性能饱和等问题，提出了Benchmark Agent——一个完全自主的智能体系统，用于自动化构建高质量、可定制的基准测试。该系统采用“大脑-小脑”层次架构，由Benchmark Planner（高层决策模块）和Benchmark Executor（执行模块）组成。Planner将用户需求分解为子任务，并通过数据搜索、可转换性验证和全局分配生成可行计划；Executor则执行样本级规划、工具转换和质量控制。实验表明，Benchmark Agent能在极少人工干预下生成15个覆盖文本理解、多模态理解和领域推理的基准测试，经人工评估、LLM评判和一致性检验证明其高质量和判别力。系统支持持续快速更新，为社区提供动态演进的评估工具。

Innovations:

首次提出完全自主的智能体系统用于基准测试构建，实现从需求分析到数据标注的全流程自动化。
采用“大脑-小脑”层次架构，将高层规划与底层执行解耦，支持长周期、多步骤的基准构建任务。
支持用户导向的定制化评估，可灵活调整任务格式、领域和评价标准，突破传统通用基准的局限。
实现持续快速刷新能力，能根据新模型、新领域和用户需求动态更新基准，避免性能饱和问题。
通过多智能体协作（设计、接地、分配）和工具链集成，确保生成基准的高质量和可重复性。

Methodology: 论文采用智能体系统方法，构建了Benchmark Agent框架。整体流程分为两个阶段：1) Benchmark Planner阶段：包含Design Agent（将用户需求分解为子任务）、Grounding Agent（通过数据集搜索和可转换性验证确保子任务可落地）、Allocation Agent（在全局约束下分配资源）；2) Benchmark Executor阶段：执行样本级规划、调用多种工具（LLM和非LLM工具）进行数据转换，并通过质量控制和配额验证生成最终基准。系统使用ReAct范式进行推理与交互，结合多轮迭代和自一致性机制。

Key Results:

生成了15个代表性基准测试，覆盖文本理解、多模态理解和领域推理等场景。
人工评估和LLM-as-a-judge评估表明生成样本质量高，与人工构建基准相当。
一致性检验证明生成基准具有良好的判别力，能有效区分不同模型性能。
系统在时间效率和成本效益上显著优于传统人工构建方式。
消融实验验证了Planner和Executor各组件的有效性。
持续评估发现当前模型在特定领域推理任务上仍存在明显不足。

Tech Stack:

LLM（如ChatGPT、Gemini、Claude）作为核心推理引擎
ReAct范式（推理与行动交替）
多智能体协作机制（Design Agent、Grounding Agent、Allocation Agent）
工具调用（Proposer、Revising、Discarding、Preference、Searching、Transformability、Score-and-Filter等）
非LLM工具（用于数据转换的代码生成、多模态内容创建等）
自一致性机制（迭代验证与质量检查）

Strengths:

创新性地将智能体系统应用于基准构建，解决了传统方法劳动密集、迭代慢的问题。
系统设计层次清晰，高层规划与底层执行分离，增强了可扩展性和鲁棒性。
支持用户定制化需求，生成的基准具有高灵活性和针对性。
实验验证充分，包括人工评估、LLM评判、一致性检验、效率分析等，结果可靠。
强调持续更新能力，有助于应对模型快速迭代带来的基准饱和问题。

Limitations:

系统依赖现有数据集和工具库，对于全新领域或极端罕见场景可能缺乏足够数据支持。
生成的基准质量受底层LLM能力影响，可能存在偏差或幻觉。
当前评估主要针对文本和多模态理解，对更复杂的交互式或动态环境尚未充分验证。
系统复杂度较高，部署和维护需要一定的技术成本。
论文未详细讨论基准的公平性和潜在偏见问题。

Relevance To Keywords:

Unify Models: 论文提出的Benchmark Agent可统一评估不同模型（LLM、MLLM）的能力，与统一模型评估需求高度相关。
World Models: 生成的基准测试可涵盖世界模型所需的推理和场景理解能力，但论文未直接聚焦世界模型。
Representation Learning: 基准测试可用于评估表征学习质量，但论文主要关注评估而非表征学习本身。
Model-Based RL: 基准测试可评估模型在规划、推理等任务上的表现，与基于模型的强化学习评估相关。
原生多模态大模型: 论文生成的基准覆盖多模态理解，直接服务于原生多模态大模型的评估。
多模态大模型的理解和生成一体化: 基准测试可同时评估理解和生成能力，但论文更侧重理解任务。
表征学习: 同上，间接相关。
世界模型: 基准测试可评估世界模型的预测和推理能力，但论文未专门设计。
强化学习: 基准测试可用于评估强化学习策略，但论文未涉及。
后训练: 基准测试可用于后训练阶段的效果验证，论文强调持续评估，与后训练评估需求契合。

38. Adversarial Attacks Already Tell the Answer: Directional Bias-Guided Test-time Defense for Vision-Language ModelsPASS

Score: 33.0 / 27.8

Authors: Liangsheng Liu, Si Chen, Jiamin Wu, Weiwei Feng, Zhixin Cheng, Xiaotian Yin, Wenfei Yang, Tianzhu Zhang

Published: 2026-06-04

TL;DR: 本文提出了一种基于方向偏差引导的测试时防御方法，利用对抗扰动的方向性先验恢复视觉语言模型的鲁棒表征，无需重新训练即可实现对抗鲁棒性提升。

摘要翻译

视觉 - 语言模型（VLMs），如 CLIP，已展现出强大的零样本泛化能力，但仍极易受到对抗性扰动的影响，在实际应用中构成严重风险。针对 VLMs 的测试时防御（Test-time defenses）近期已成为一种有前景且高效的方法，用于抵御对抗性攻击，而无需代价高昂的大规模重新训练。在这项工作中，我们发现了一个令人惊讶的现象：在各种输入变换下，CLIP 特征空间中的对抗性图像始终沿一个主导方向移动，与干净图像的分散模式形成对比。我们假设这种主导转移，称为防御方向（Defense Direction），与对抗性转移相反，将特征指回其正确的类别中心。基于这一洞察，我们提出了方向性偏差引导防御（DBD），这是一个测试时框架，它估计防御方向，并采用基于 DB-score 的双流重建策略来恢复鲁棒表示。在 15 个数据集上的实验表明，DBD 不仅实现了 SOTA（最先进）的对抗鲁棒性并保持干净准确率，还揭示了一个反直觉的结果：对抗准确率甚至可能超过干净准确率。这表明对抗性扰动本质上编码了关于真实决策边界的定向先验。

Abstract

Vision-Language Models (VLMs), such as CLIP, have shown strong zero-shot generalization but remain highly vulnerable to adversarial perturbations, posing serious risks in real-world applications. Test-time defenses for VLMs have recently emerged as a promising and efficient approach to defend against adversarial attacks without requiring costly large-scale retraining. In this work, we uncover a surprising phenomenon: under diverse input transformations, adversarial images in CLIP's feature space consistently shift along a dominant direction, in contrast to the dispersed patterns of clean images. We hypothesize that this dominant shift, termed the Defense Direction, opposes the adversarial shift, pointing features back toward their correct class centers. Building on this insight, we propose Directional Bias-guided Defense (DBD), a test-time framework that estimates the Defense Direction and employs a DB-score-based two-stream reconstruction strategy to recover robust representations. Experiments on 15 datasets demonstrate that DBD not only achieves SOTA adversarial robustness while preserving clean accuracy, but also reveals the counterintuitive result that adversarial accuracy can even surpass clean accuracy. This demonstrates that adversarial perturbations inherently encode directional priors about the true decision boundary.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	4.0/10	6.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	5.0/10	7.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	5.0/10	7.5
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文主要研究视觉语言模型（VLMs）的对抗攻击防御问题。在关键词相关性上，'MultiModal'高度相关，因为 VLMs 本质是多模态模型；'Visual Encoder'相关性较高，因为防御方法作用于视觉编码器的特征空间；'MLLM'相关性中等，因为 VLMs 属于多模态大模型范畴，但论文未涉及大模型训练细节；'Unify Models'相关性中等，因 VLMs 统一了视觉与语言模态，但论文核心在于防御策略而非模型统一架构。'Tokenizer'、'World Models'和'model-based RL'与论文内容无直接关联，涉及领域完全不同，故评分为 0。加权总分 33.0，高于动态及格分 27.8。

关键词

Vision-Language Models, Adversarial Attacks, Test-time Defense, Directional Bias, Adversarial Robustness, Feature Space, CLIP, Reconstruction Strategy

深度分析

Chinese Title: 对抗攻击早已揭示答案：面向视觉语言模型的方向性偏差引导的测试时防御

Summary: 本文针对视觉语言模型（如CLIP）在对抗攻击下的脆弱性问题，提出了一种无需额外训练的测试时防御方法。研究发现，在多种输入变换下，对抗图像在CLIP特征空间中会沿一个主导方向移动，而干净图像的特征则分散分布。基于此，作者提出方向性偏差引导防御（DBD），通过估计该主导方向（称为防御方向）并利用DB-score进行双流特征重建策略，恢复鲁棒表示。在15个数据集上的实验表明，DBD不仅保持了干净图像的准确率，还显著提升了对抗鲁棒性，甚至在某些数据集上对抗图像的准确率超过了干净图像。该工作揭示了对抗扰动隐式编码了真实决策边界的方向先验。

Innovations:

首次揭示对抗扰动隐式编码了真实决策边界的方向先验，可通过多种变换可靠估计。
提出方向性偏差引导防御（DBD）框架，利用防御方向估计和基于DB-score的双流重建策略实现高效测试时防御。
在15个数据集上验证了DBD的优越性，对抗鲁棒性达到SOTA，且在某些情况下对抗准确率超过干净准确率。

Methodology: 首先，对输入图像应用多种变换（空间、像素、频率域）得到增强特征；然后，通过熵过滤保留高质量特征；接着，计算方向性偏差（DB-score）区分对抗与干净图像；最后，采用双流重建策略：对于高DB-score（对抗）样本，沿防御方向线性移动特征；对于低DB-score（干净）样本，使用平均变换特征作为测试时增强。

Key Results:

DBD在10个细粒度分类数据集和5个ImageNet-OOD数据集上均达到SOTA对抗鲁棒性。
DBD在保持干净图像准确率的同时，显著提升对抗鲁棒性。
在某些数据集上，对抗图像的分类准确率甚至超过了干净图像。
DB-score呈现清晰的双峰分布，有效区分对抗与干净图像。

Tech Stack:

CLIP视觉编码器与文本编码器
PGD攻击（ℓ∞, ϵ=4/255, 步长1/255, 100步）
多维缩放（MDS）用于特征可视化
余弦相似度作为距离度量
熵过滤（基于信息熵）
方向性偏差（DB-score）计算
双流特征重建策略（线性移动与平均增强）

Strengths:

无需额外训练，计算高效，适合实际部署。
同时保持干净性能并大幅提升对抗鲁棒性，甚至超越干净准确率。
揭示了对抗扰动中隐含的方向先验，具有理论洞察。
在多个数据集上验证了泛化能力。

Limitations:

防御效果可能依赖于变换的多样性和质量，变换选择需手动设计。
未充分讨论针对自适应攻击（如AutoAttack）的鲁棒性。
仅针对CLIP模型，对其他VLM的泛化性未验证。

Relevance To Keywords:

多模态大模型：论文针对CLIP等视觉语言模型，属于多模态大模型范畴。
表征学习：通过分析特征空间中的方向性偏差，利用表征进行防御。
后训练：测试时防御属于后训练阶段，无需重新训练模型。
世界模型/强化学习：论文未直接涉及，但方向性偏差的发现可启发对模型决策边界的理解。

39. HANDOFF: Humanoid Agentic Task-Space Whole-Body Control via Distilled Complementary TeachersPASS

Score: 31.5 / 27.8

Authors: Lizhi Yang, Junheng Li, Nehar Poddar, Yiling Hou, Gio Huh, Robert Griffin, Georgia Gkioxari, Aaron Ames

Published: 2026-06-04

TL;DR: 本文提出了一种名为 HANDOFF 的人形机器人全身控制器，通过互补教师蒸馏混合专家模型，实现了无需任务特定微调的自然语言驱动操作任务。

摘要翻译

为了使人形机器人能够在现实世界中部署，指令空间（command space，即任务规划与全身控制之间的接口）的选择至关重要。现有的全身控制器通常需要密集的运动学或空间参考，而规划器难以从任务语义中生成这些参考。相反，我们提出了一种紧凑、显式的接口，该接口直观、通用、模块化，且具备足够的表达能力，足以应对多种操作技能。为此，我们引入了 HANDOFF，这是一个遵循该接口的单一的人形机器人全身控制器，它通过上下文条件门控机制下的多教师 KL 蒸馏，从三个互补的专科模型（全身运动跟踪（带安全过滤数据）、移动和跌倒恢复）中蒸馏出一个混合专家（Mixture-of-Experts）学生模型。在 Unitree G1 上，HANDOFF 达到了最先进的速度跟踪水平，并提供了最大的鲁棒操作工作空间之一。我们进一步通过多个自然语言驱动的任务执行展示了硬件可行性，这些任务由一个视觉语言模型（VLM）驱动的智能体规划器驱动，且无需任务特定数据或控制器微调。

Abstract

For a humanoid robot to be deployed in the real world, the choice of command space (i.e., the interface between task planning and whole-body control) is crucial. Existing whole-body controllers typically demand dense kinematic or spatial references that planners struggle to synthesize from task semantics. We instead propose a compact, explicit interface that is intuitive, general, modular, and expressive enough for diverse manipulation skills. To this end, we introduce HANDOFF, a single humanoid whole-body controller that follows this interface and is distilled via multi-teacher KL distillation under a context-conditioned gating scheme into a mixture-of-experts student from three complementary specialists: whole-body motion tracking with safety-filtered data, locomotion, and fall-recovery. On the Unitree G1, HANDOFF matches state-of-the-art velocity tracking and offers one of the largest robust manipulation workspaces. We further demonstrate hardware feasibility through multiple natural-language-driven task roll-outs, powered by a VLM-driven agentic planner with no task-specific data or controller fine-tuning.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	5.0/10	7.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	2.0/10	3.0
MLLM	1.5	4.0/10	6.0
MultiModal	1.5	5.0/10	7.5
model-based RL	1.5	2.0/10	3.0

评分理由: 论文核心为人形机器人控制策略的蒸馏与统一，与 MultiModal（整合语言/视觉/控制）和 Unify Models（统一控制策略）高度相关；MLLM 相关（使用 VLM 进行任务规划）；Tokenizer 完全无关；Visual Encoder 和 World Models 关联较弱（仅作为 VLM 组件或隐含环境理解）；model-based RL 关联较低（主要采用蒸馏而非模型强化学习）。

关键词

Humanoid Whole-Body Control, Distilled Complementary Teachers, Mixture-of-Experts, VLM-driven Planner, Natural-Language-driven, Task-Space Control, Agentic Task

深度分析

Chinese Title: HANDOFF：通过蒸馏互补教师实现的人形智能体任务空间全身控制

Summary: 本文提出HANDOFF，一种面向人形机器人的全身控制器，采用紧凑的10维命令空间（包括基座速度、高度和双侧手腕目标），该接口直观、通用、模块化且表达力强。通过多教师KL蒸馏和上下文条件门控机制，将三个互补专家（全身运动跟踪、 locomotion和跌倒恢复）蒸馏为一个混合专家学生策略。在Unitree G1机器人上，HANDOFF在速度跟踪方面达到SOTA水平，并提供了最大的鲁棒操作工作空间。结合VLM驱动的智能体规划器，无需任务特定数据或控制器微调，即可实现自然语言驱动的多种任务部署。该方法解决了现有控制器需要密集运动参考或复杂接口的问题，实现了规划器与控制器的高效解耦。

Innovations:

提出紧凑显式的10维命令空间（基座速度、高度、手腕目标），兼具直观性、通用性、模块化和全身表达力。
采用多教师KL蒸馏与上下文条件门控机制，将三个互补专家（运动跟踪、 locomotion、跌倒恢复）融合为单一学生策略，无需运行时切换。
混合专家（MoE）学生架构，通过负载均衡和恢复损失实现专家路由，确保不同场景下的最优监督。
无需任务特定数据或控制器微调，VLM驱动的规划器可直接生成命令，实现零样本自然语言任务执行。
可扩展的蒸馏框架：新增专家只需添加教师头和上下文通道，不影响现有组件。

Methodology: 首先分别训练三个教师策略：全身运动跟踪教师（基于重定向人体运动数据，使用PPO+非对称演员-评论家）、locomotion教师（基于平坦地形速度跟踪奖励，结合课程混合运动数据）、跌倒恢复教师（基于locomotion与跌倒恢复序列的对抗运动先验）。然后通过上下文条件KL蒸馏将三个教师蒸馏到单个学生策略中：上下文信号为命令速度范数和跌倒标志，决定每个动作切片由哪个教师监督；采用混合专家（MoE）架构，每个专家对应一个教师领域，通过门控网络和负载均衡损失实现路由。学生策略接收10维命令和11帧本体感知历史，输出29维关节动作。最后，VLM驱动的智能体规划器将自然语言任务分解为10维命令序列，直接驱动控制器。

Key Results:

HANDOFF在速度跟踪方面达到与SOTA locomotion控制器相当的性能。
在鲁棒操作工作空间上，HANDOFF提供了最大的可达范围，支持协调的蹲下-伸手、行走中单臂抓取等全身行为。
在真实Unitree G1机器人上，通过VLM规划器成功执行多个自然语言驱动的任务（如取咖啡、捡起物体），无需任务特定数据或微调。
跌倒恢复能力在仿真和硬件上得到验证，控制器能在跌倒后自主恢复。
与现有方法对比（表1），HANDOFF是唯一同时满足紧凑命令、无外部运动参考、单策略的全身控制器。

Tech Stack:

PPO（近端策略优化）用于训练教师和学生策略
KL散度（Kullback-Leibler divergence）用于多教师蒸馏
混合专家（Mixture-of-Experts, MoE）架构
上下文条件门控（context-conditioned gating）
对抗运动先验（Adversarial Motion Priors）用于跌倒恢复训练
运动重定向（motion retargeting）用于人体数据适配
课程学习（curriculum learning）用于locomotion训练
负载均衡损失（load-balancing loss）用于MoE路由
VLM（视觉语言模型）驱动的智能体规划器

Strengths:

命令空间紧凑（10维）且直观，规划器（人类、几何规划器、VLM）均可直接生成。
模块化设计：规划器、感知、控制器解耦，可独立替换。
单策略全身控制，无需运行时切换，鲁棒性高。
蒸馏框架可扩展，新增专家不影响已有组件。
硬件验证成功，展示了从语言到真实机器人动作的完整流程。

Limitations:

牺牲了部分表达力（如舞蹈等复杂全身动作），因为命令空间不如密集运动参考丰富。
教师质量依赖现有数据集和训练方法，可能限制学生性能上限。
未在复杂地形（如楼梯、斜坡）或动态环境（如移动障碍物）中验证。
跌倒恢复能力可能仅适用于特定跌倒模式，泛化性有待进一步测试。
VLM规划器依赖外部模型，其推理速度和准确性可能成为瓶颈。

Relevance To Keywords:

强化学习用于物理机器人控制：论文使用PPO训练教师和学生策略，属于强化学习在机器人控制中的应用。
任务与运动规划：提出的10维命令空间作为任务规划与运动控制的接口，VLM规划器将高层任务分解为子目标。
人形全身控制：核心贡献是全身控制器，协调基座、躯干和手臂实现locomotion和操作。
移动操作：控制器支持行走中抓取、蹲下-伸手等locomotion与操作结合的行为。

40. TRACE: A Temporal Conditional Estimation for Multimodal Time Series Foundation ModelsPASS

Score: 31.5 / 27.8

Authors: Ziwen Kan, Yishuo Chen, Kecheng Li, Andrew Wen, Xiaomeng Wang, Liwei Wang, Jihao Duan, Song Wang, Hongfang Liu, Tianlong Chen

Published: 2026-06-04

TL;DR: TRACE 提出了一种条件估计范式，用于处理多模态时间序列基础模型中的缺失和不规则采样问题，在医疗和情感计算基准测试中表现出优于先前融合方法的鲁棒性。

摘要翻译

时间序列基础模型（TS-FMs）旨在学习可泛化的时序表示，这些表示能够适应广泛的下游任务。在多模态现实场景中，时间序列常受时间错位及部分模态缺失的影响，此时不同模态可能在异构时间尺度上被观测或部分缺失。现有方法通常依赖朴素的插值或掩码策略，这些策略未能考虑跨模态依赖关系，往往导致错位或退化的表示。我们提出了 TRACE，这是一种针对存在缺失和不规则采样的多模态时间序列基础模型流水线的条件估计范式，允许从可用的辅助模态中系统地推断不完整的目标模态。我们在涵盖医疗和情感计算领域的多样化多模态基准上评估了 TRACE，包括 MIMIC-IV 临床数据集以及用于多模态情感分析的 CMU-MOSI 和 CMU-MOSEI 基准。在各种下游预测任务和缺失模态设置下，TRACE 始终优于先前的多模态融合方法，表现出对严重模态缺失的更强鲁棒性以及更可靠的跨模态表示。

Abstract

Time series foundation models (TS-FMs) aim to learn generalizable temporal representations that can be adapted to a wide range of downstream tasks. In real-world multimodal settings, time series are frequently affected by temporal misalignment and partial modality missingness, where different modalities are observed at heterogeneous time scales or are partially absent. Existing approaches typically rely on naive imputation or masking strategies, which fail to account for cross-modal dependencies and often lead to misaligned or degraded representations. We propose TRACE, a conditional estimation paradigm for multimodal time series foundation model pipelines under missingness and irregular sampling, allowing incomplete target modalities to be systematically inferred from available auxiliary modalities. We evaluate TRACE on diverse multimodal benchmarks spanning healthcare and affective computing, including the MIMIC-IV clinical dataset and the CMU-MOSI and CMU-MOSEI benchmarks for multimodal sentiment analysis. Across a range of downstream prediction tasks and missing-modality settings, TRACE consistently outperforms prior multimodal fusion approaches, demonstrating improved robustness to severe modality missingness and more reliable cross-modal representations.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	5.0/10	7.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	2.0/10	3.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	10.0/10	15.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文标题和摘要明确包含'Multimodal'和'Foundation Models'，因此'MultiModal'关键词高度相关（10 分）。'Unify Models'与基础模型的统一表征理念相关，给予中等分（5 分）。'Tokenizer'、'Visual Encoder'、'MLLM'、'World Models'及'model-based RL'与论文内容（时间序列、缺失值处理、非视觉/非语言/非强化学习）关联度低或无关，得分较低（0-2 分）。加权总分为 31.5，高于动态及格分 27.8。作者列表中未包含 Yang Shi 等指定专家，无额外加分。

关键词

Multimodal Time Series, Foundation Models, Temporal Conditional Estimation, Missingness, Cross-modal Dependencies, Healthcare, Affective Computing

深度分析

Chinese Title: TRACE：面向多模态时间序列基础模型的时间条件估计方法

Summary: 本文提出TRACE，一种面向多模态时间序列基础模型的条件估计范式，旨在解决实际应用中常见的时间错位和模态缺失问题。现有方法通常采用简单的插值或掩码策略，忽略了跨模态依赖，导致表征退化。TRACE将缺失模态视为待条件估计的潜在变量，利用扩散模型从可用模态中推断缺失部分，再通过MoE融合层进行下游预测。在MIMIC-IV、CMU-MOSI和CMU-MOSEI等多个基准上的实验表明，TRACE在严重模态缺失场景下显著优于现有融合方法，生成更接近真实完整序列的表征，提升了鲁棒性和跨模态一致性。

Innovations:

将缺失模态处理从确定性填充重新定义为条件估计问题，提出TRACE范式。
引入基于扩散机制的概率信号级估计，实现跨模态条件推断。
设计MoE门控机制聚合辅助模态信息，构建鲁棒的跨模态上下文。
在TS-FM流水线中首次将条件估计作为融合前的中间目标，提升下游任务性能。

Methodology: TRACE采用两阶段流水线：第一阶段进行多模态条件扩散，对每个目标模态，利用其观测部分和MoE聚合的跨模态上下文作为条件，通过扩散模型估计缺失部分；第二阶段采用FuseMoE的MoE融合层聚合完成后的多模态表征，输入任务特定预测头进行下游任务。扩散过程在表征空间进行，而非原始信号空间。

Key Results:

在MIMIC-IV、CMU-MOSI、CMU-MOSEI等数据集上，TRACE在多种缺失率设置下均优于现有融合方法。
TRACE生成的表征与真实完整序列表征的余弦相似度更高，尤其在严重缺失（如30%缺失率）时优势明显。
在情感分析、临床预测等下游任务中，TRACE一致提升性能，表现出更强的鲁棒性。

Tech Stack:

条件扩散模型（Conditional Diffusion Model）
混合专家门控机制（MoE-based Gating）
跨模态上下文聚合（Cross-modal Context Aggregation）
FuseMoE融合层
余弦相似度评估（Cosine Similarity）

Strengths:

创新性地将缺失模态处理从填充问题提升为条件估计问题，理论框架清晰。
扩散模型的使用使估计具有概率性，保留了不确定性，避免确定性填充带来的偏差。
MoE门控机制有效筛选辅助模态信息，增强跨模态上下文的质量。
在多个真实多模态基准上验证，泛化性强，鲁棒性好。

Limitations:

扩散模型推理速度较慢，可能影响实时应用场景。
当前方法主要针对表征级估计，未在原始信号空间进行验证。
对辅助模态质量高度依赖，若所有模态均严重缺失，性能可能下降。
未探讨与大规模预训练TS-FM的联合训练策略。

Relevance To Keywords:

多模态大模型：TRACE直接针对多模态时间序列建模，处理模态缺失和错位问题，与多模态大模型的研究高度相关。
表征学习：TRACE通过条件扩散学习更接近真实完整序列的表征，提升表征质量。
世界模型：时间序列基础模型可视为世界模型的一种形式，TRACE增强了其在缺失数据下的鲁棒性。
模型-Based RL：TRACE的估计范式可为基于模型强化学习中状态估计提供借鉴。
后训练：TRACE的条件估计可作为多模态模型后训练阶段的一种增强策略。

41. LatentSkill: From In-Context Textual Skills to In-Weight Latent Skills for LLM AgentsPASS

Score: 31.5 / 27.8

Authors: Aofan Yu, Chenyu Zhou, Tianyi Xu, Zihan Guo, Rong Shan, Zhihui Fu, Jun Wang, Weiwen Liu, Yong Yu, Weinan Zhang, Jianghao Lin

Published: 2026-06-04

TL;DR: LatentSkill 通过将文本技能转换为权重空间的 LoRA 适配器，显著降低了 LLM 代理的上下文开销并提升了任务成功率。

摘要翻译

智能体系统越来越多地使用文本技能来编码可复用任务流程，但在每一步将这些技能注入提示词中都会产生巨大的上下文开销，并以明文形式暴露技能内容。我们提出了 LatentSkill，这是一个通过预训练超网络将文本技能转换为即插即用 LoRA 适配器的框架。LatentSkill 将技能知识存储在权重空间而非上下文空间，从而消除了每步技能令牌，同时保留了模块化加载、缩放和组合的能力。在 ALFWorld 和 Search-QA 上，LatentSkill 优于相应的上下文技能基线，同时使用了显著更少的预填充令牌：在可见和不可见划分上，它在减少 64.1% 预填充令牌的同时，使 ALFWorld 的成功率分别提高了 21.4 和 13.4 个百分点；在 Search-QA 上，它在降低 72.2% 技能令牌开销的同时，使精确匹配提高了 3.0 个百分点。进一步分析表明，生成的技能 LoRAs 形成了结构化的语义几何，可以通过 LoRA 缩放系数进行精确控制，并且在技能组件对齐的情况下，可以通过参数空间算术进行组合。这些发现表明，权重空间技能为扩展大语言模型智能体提供了一种高效、模块化且暴露较少的基础。

Abstract

Agent systems increasingly use textual skills to encode reusable task procedures, but injecting these skills into the prompt at every step incurs substantial context overhead and exposes skill content as plaintext. We present LatentSkill, a framework that converts textual skills into plug-and-play LoRA adapters through a pretrained hypernetwork. LatentSkill stores skill knowledge in weight space rather than context space, removing per-step skill tokens while preserving modular loading, scaling, and composition. On ALFWorld and Search-QA, LatentSkill outperforms the corresponding in-context skill baseline while using substantially fewer prefill tokens: it improves ALFWorld success by 21.4 and 13.4 points on the seen and unseen splits with 64.1% fewer prefill tokens, and improves Search-QA exact match by 3.0 points with 72.2% lower skill-token overhead. Further analysis shows that generated skill LoRAs form a structured semantic geometry, can be precisely controlled via the LoRA scaling coefficient, and can be composed through parameter-space arithmetic when skill components are aligned. These findings suggest that weight-space skills provide an efficient, modular, and less exposed substrate for extending LLM agents.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	4.0/10	6.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	3.0/10	4.5
model-based RL	1.5	4.0/10	6.0

评分理由: 论文提出 LatentSkill 框架，核心贡献是将文本技能转换为权重空间的 LoRA 适配器以减少上下文开销。虽然涉及 RL 环境（ALFWorld）和技能表示（与 World Models 有一定关联），但未涉及多模态融合、视觉编码器设计或 tokenizer 优化，因此与多模态及模型架构类关键词相关性较低，与 RL 类关键词相关性中等。加权总分为 31.5，高于动态及格分 27.8。作者列表中未包含指定的专家，无额外加分。

关键词

LLM Agents, LoRA Adapters, Weight Space, Skill Storage, In-Context Learning, ALFWorld, Skill Composition

深度分析

Chinese Title: LatentSkill：从上下文文本技能到LLM智能体的权重潜在技能

Summary: 本文提出LatentSkill框架，旨在解决LLM智能体系统中文本技能注入带来的上下文开销和技能暴露问题。传统方法将技能文本重复插入提示词中，导致预填充成本高且技能内容以明文形式暴露。LatentSkill通过预训练的超网络将文本技能描述转换为即插即用的LoRA适配器，将技能知识存储在权重空间而非上下文空间，从而在推理时移除技能文本，同时保持模块化加载、缩放和组合能力。在ALFWorld和Search-QA任务上，LatentSkill相比上下文技能基线分别提升了21.4/13.4个百分点的成功率和3.0个百分点的精确匹配，同时减少了64.1%和72.2%的预填充令牌。进一步分析表明，生成的技能LoRA权重具有结构化的语义几何、可通过缩放系数精确控制，并能在技能组件对齐时通过参数空间算术进行组合。该工作揭示了权重空间技能作为高效、模块化且低暴露的LLM智能体扩展基质的潜力。

Innovations:

提出将文本技能转换为LoRA适配器的超网络框架，实现零上下文令牌的技能注入，同时保持模块化加载、卸载和组合能力。
首次系统揭示超网络生成的LoRA权重空间具有结构化（领域可分离）、可控性（通过缩放系数调节）和可组合性（组件级参数算术）三大特性。
设计两阶段训练：文档级预训练（重构/补全任务）使超网络学会从技能文本生成有效适配器，再通过轨迹监督微调对齐智能体策略。
在ALFWorld和Search-QA上显著降低技能令牌开销（64%-72%）的同时提升任务性能，验证了权重空间技能相比上下文技能的高效性。

Methodology: LatentSkill包含三个主要部分：1) 技能编译器（超网络Gϕ）将文本技能文档映射为LoRA适配器集合；2) 文档级预训练：在技能文档语料上通过重构和补全任务训练编译器，使信息通过生成的适配器传递；3) 轨迹监督微调：使用教师智能体的完整轨迹（技能文档+多步决策）微调编译器，确保适配器捕获稳定的策略信息。推理时，技能被预先编译并缓存，通过LoRA缩放系数α控制注入强度，支持单技能加载和多技能权重空间加法组合，以及组件级分解组合。

Key Results:

在ALFWorld上，LatentSkill在可见/不可见分集上分别提升21.4和13.4个百分点的成功率，同时减少64.1%的预填充令牌。
在Search-QA上，LatentSkill提升精确匹配3.0个百分点，减少72.2%的技能令牌开销。
生成的技能LoRA权重在语义空间中形成可分离的领域簇（如ALFWorld不同房间类型）。
通过调整LoRA缩放系数α可精确控制技能影响强度，α=0恢复原始模型。
当技能分解为语义对齐的组件时，通过参数空间加法可实现有效组合，避免直接加法导致的过度放大。

Tech Stack:

LoRA（低秩适配）
超网络（Hypernetwork）
Transformer架构（LLM骨干）
文档级预训练（重构任务、补全任务）
轨迹监督微调
参数空间算术（权重加法）
ALFWorld环境（文本游戏）
Search-QA数据集（搜索问答）

Strengths:

创新性地将技能知识从上下文空间迁移到权重空间，同时解决效率、模块化和安全性三重问题。
训练流程清晰：先预训练使超网络具备基本生成能力，再通过轨迹微调对齐任务策略，两阶段设计合理。
实验验证充分：在两种不同任务上均取得性能提升且大幅降低开销，并深入分析权重空间的结构、可控性和可组合性。
与现有LoRA组合、超网络生成等工作形成互补，为智能体技能管理提供了新范式。

Limitations:

超网络生成LoRA的质量依赖于训练数据的覆盖范围，对于未见过的技能类型可能泛化不足。
当前仅支持文本技能描述，未探索多模态技能（如图像、视频）的转换。
组件级组合需要手动分解技能为语义对齐的组件，自动化分解方法未涉及。
实验仅在两个环境上进行，在更复杂的长程任务（如WebAgent、机器人控制）上的效果有待验证。
超网络本身增加了模型参数量和训练成本，但推理时开销降低。

Relevance To Keywords:

Unify Models / 原生多模态大模型：论文聚焦LLM智能体技能表示，未直接涉及多模态统一模型，但技能权重空间的思想可扩展到多模态技能。
World Models / 世界模型：论文未涉及世界模型，但技能可视为任务世界的局部知识，权重空间技能可能用于世界模型中的策略模块。
Representation Learning / 表征学习：核心贡献是将技能表示为LoRA权重，并揭示其结构化语义几何，属于表征学习范畴。
Model-Based RL / 强化学习：论文使用轨迹监督微调，类似于模仿学习，但未涉及基于模型的RL。技能组合可视为策略组合，与RL中的模块化策略相关。
后训练：论文的两阶段训练（预训练+微调）属于后训练范式，将技能知识注入模型参数。
多模态大模型的理解和生成一体化：论文未涉及多模态，但技能生成（从文本到权重）可视为一种理解-生成过程。
总体相关性中等偏上，主要贡献在表征学习和后训练方向，与多模态和世界模型有潜在交叉。

42. StoryVideoQA: Scaling Deep Video Understanding with a Large-Scale, Multi-Genre and Auto-Generated DatasetPASS

Score: 31.5 / 27.8

Authors: Zhengqian Wu, Zhixian Liu, Aodong Chen, Jingyang Zhang, Ruizhe Li, Hanlin Ge, Zhongyuan Wang, Chunxia Xiao, Chao Liang

Published: 2026-06-04

TL;DR: This paper proposes StoryVideoQA, a large-scale deep video understanding dataset generated by a multi-agent framework, and introduces PlotTree to enhance storyline reasoning in video question answering.

摘要翻译

视频问答（VideoQA）旨在针对给定视频回答相关问题。尽管现有方法在事实型视频问答（VideoQA）上表现卓越，但在深层视频理解（DVU）方面却面临挑战，后者需要对复杂故事情节进行理解。这一挑战源于视频固有的长程内容、多维度问题类型以及实例级故事元素，这些因素共同限制了人工构建的深层视频理解（DVU）数据集的规模和多样性。这些困难进一步限制了人工构建的深层视频理解（DVU）数据集的规模和多样性。为了解决这些问题，我们先前引入了故事心智（StoryMind），旨在自动构建具有平衡细粒度主题的深层视频理解（DVU）数据集。尽管它能够为电视剧生成高质量的问答对（QAs），但在处理更长且更复杂的电影时，却表现出显著的性能退化。在本文中，我们进一步设计了 StoryMindv2，这是一个增强的多智能体协作框架，旨在为电视剧和电影生成高质量的深层视频理解（DVU）数据集。通过整合一种新颖的监督者引导生成机制和改进的多评审投票策略，该框架被用于构建故事视频问答（StoryVideoQA），这是迄今为止最大的深层视频理解（DVU）数据集，包含超过 363K 个问答对（QAs），涵盖 393.2 小时的多样化故事视频，其中包括电视剧（平均 1,635 秒）和电影（平均 7,878 秒）。在该大规模基准上对 20 种最先进的视频问答（VideoQA）方法的综合评估表明，它们无法完全维持长程角色关联，也无法构建对复杂故事情节的一致理解。为了弥合这一差距，我们提出情节树（PlotTree），这是一种新颖的视频理解代理，通过将长程视频内容重新组织为层次化情节结构，从而实现在故事视频问答（StoryVideoQA）上的高效故事情节推理。项目页面：https://github.com/nercms-mmap/StoryVideoQA/

Abstract

Video question answering (VideoQA) aims to answer questions about given videos. While existing approaches excel on factoid VideoQA, they struggle with deep video understanding (DVU), which requires the comprehension of complex storylines. This challenge arises from the inherent long-range video content, multi-faceted question types, and instance-level story elements, all of which constrain the scale and diversity of manually constructed DVU datasets.These difficulties constrain the scale and diversity of manually-constructed DVU dataset. To address these, we previously introduced StoryMind to automatically construct DVU datasets with balanced fine-grained topics. Though it can generate high-quality question-answer pairs (QAs) for TV series, it suffers significant performance degradation when handling longer and more complex movies. In this paper, we further design StoryMindv2, an enhanced multi-agent collaboration framework to generate high-quality DVU datasets for both TV series and movies. By integrating a novel supervisor-guided generation mechanism and a refined multi-reviewer voting strategy, the framework is utilized to construct StoryVideoQA, the largest DVU dataset to date, featuring over 363K QAs on 393.2 hours diverse story videos including TV series (avg. 1,635 seconds) and movies (avg. 7,878 seconds). Comprehensive evaluations of 20 state-of-the-art VideoQA methods on this large-scale benchmark reveal that they cannot fully maintain long-range character associations or construct a coherent understanding of complex storylines. To bridge this gap, we propose PlotTree, a novel video understanding agent, re-organizing long-range video content into a hierarchical plot structure, enabling efficient storyline reasoning on StoryVideoQA. Project page: https://github.com/nercms-mmap/StoryVideoQA/

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	2.0/10	3.0
MLLM	1.5	5.0/10	7.5
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	1.0/10	1.5

评分理由: The paper focuses on VideoQA dataset generation and storyline reasoning agents. It is highly relevant to MultiModal (Video+Text) and moderately relevant to MLLM (using LLMs in agents). Visual Encoder is implicit but not a core contribution. Unify Models, Tokenizer, World Models, and model-based RL are largely unrelated to the core content (dataset generation and reasoning agent). No expert authors from the specified list were found in the author list, so no bonus points were applied. The weighted total score is 31.5, exceeding the dynamic passing threshold of 27.8.

关键词

Video Question Answering, Deep Video Understanding, Multi-agent Framework, Storyline Reasoning, Large-scale Dataset, PlotTree, Video Understanding

深度分析

Chinese Title: StoryVideoQA：通过大规模、多类型和自动生成的数据集扩展深度视频理解

Summary: 本文针对深度视频理解（DVU）任务中现有方法在复杂故事情节理解上的不足，提出了一种增强的多智能体协作框架StoryMindv2，用于自动生成高质量、大规模且主题平衡的问答对。该框架通过引入监督引导生成机制和精细化多评审投票策略，克服了先前方法在处理长视频（如电影）时的性能下降问题。基于此，作者构建了目前最大的DVU数据集StoryVideoQA，包含363K问答对，覆盖393.2小时的电视剧和电影，并平衡了14个细粒度主题。评估20种SOTA方法发现，现有方法难以维持长程角色关联和连贯的故事情节理解。为此，作者提出PlotTree智能体，将视频内容重组为层次化情节结构，实现高效推理，显著提升了DVU性能。

Innovations:

提出StoryMindv2多智能体协作框架，集成监督引导生成机制和故障档案，提升长视频问答生成的准确性。
采用精细化多评审投票策略替代严格一致性过滤，在保证质量的同时扩大数据集规模。
构建StoryVideoQA数据集，是目前最大、最多样化的DVU基准，覆盖电视剧和电影，平衡14个细粒度主题。
提出PlotTree视频理解智能体，通过层次化情节树结构实现长程故事情节的高效推理。

Methodology: 论文采用多智能体协作框架StoryMindv2自动生成问答对，包括监督引导生成（利用故障档案反馈）和多评审投票（多数通过策略）两个关键模块。基于此构建StoryVideoQA数据集，涵盖3部电视剧和78部电影。评估20种SOTA方法（包括VLMs、MLLMs和智能体方法）在DVU任务上的表现。最后提出PlotTree智能体，将视频转换为文本情节节点，通过递归聚类和情节凝练构建层次化树结构，并在树中高效检索相关节点进行推理。

Key Results:

StoryVideoQA包含363K问答对，覆盖393.2小时视频，平均视频长度电视剧1635秒、电影7878秒。
数据集在14个细粒度主题上分布平衡（Gini指数0.927，熵3.795）。
20种SOTA方法在StoryVideoQA上性能显著下降，表明现有方法无法维持长程角色关联和连贯理解。
PlotTree智能体在DVU任务上取得最优性能，有效处理长程故事情节推理。

Tech Stack:

多智能体协作框架（StoryMindv2）
监督引导生成机制（Supervisor-Guided Generation）
多评审投票策略（Multi-Reviewer Voting）
层次化情节树结构（PlotTree）
节点聚类与情节凝练（Node Clustering and Plot Condensation）
Gini指数和熵用于分布平衡度量
LLMs驱动的视频理解智能体

Strengths:

数据集规模大、多样性高，覆盖电视剧和电影，平衡细粒度主题，适合全面评估DVU能力。
自动生成框架StoryMindv2通过监督引导和投票策略显著提升问答质量，克服了长视频生成难题。
PlotTree智能体创新性地将视频内容组织为层次化结构，有效解决长程推理问题。
对20种SOTA方法的全面评估揭示了现有方法的局限性，为未来研究提供方向。

Limitations:

自动生成的问答对可能仍存在噪声或偏差，尽管经过多评审投票，但完全依赖LLM生成。
数据集主要基于英文影视内容，可能缺乏跨语言和文化多样性。
PlotTree依赖文本化情节节点，可能丢失部分视觉细节信息。
评估方法集中于问答准确率，未深入分析模型在复杂推理中的具体错误类型。

Relevance To Keywords: 论文与“原生多模态大模型”和“多模态大模型的理解和生成一体化”高度相关，因为它评估了多种MLLMs在深度视频理解上的表现，并提出了PlotTree智能体增强理解能力。与“表征学习”和“世界模型”相关，因为PlotTree通过层次化情节结构学习视频中的因果和时序关系，隐含了世界模型思想。与“强化学习”和“后训练”关联较弱，但自动生成框架中的监督引导机制可视为一种反馈学习过程。

43. Faithful, Enriched, and Precise: Benchmarking Natural-Science Illustration Generation by T2I modelsPASS

Score: 31.5 / 27.8

Authors: Yifan Chang, Jiaxin Ai, Jianwen Sun, Yuandong Pu, Siqi Luo, Liangliang Zhao, Yuchen Ren, Minghao Liu, Yunfei Yu, Yu Qiao, Kaipeng Zhang, Yihao Liu

Published: 2026-06-04

TL;DR: This paper introduces FEPBench to benchmark T2I models for scientific illustration generation using MLLMs and human experts, revealing current models struggle with text-rendering, reasoning enrichment, and semantic precision.

摘要翻译

科学插图是传达研究成果的关键工具，尤其在自然科学领域，它们将复杂的概念和过程可视化。随着文生图（T2I）模型能力日益增强，研究人员开始将其用于科学插图生成。然而，现有的基准测试通常在整体层面评估输出，忽视了细粒度元素，而科学推理能力和输出简洁性仍量化不足。我们引入了 FEPBench，这是一个基于精心挑选的多学科、多布局类型的高质量科学插图构建的基准测试。在多模态大语言模型（MLLMs）和人类专家的协助下，我们提供了细粒度原子集标注，并从三个维度系统性地评估了 T2I 模型：指令忠实度、推理丰富度和语义精确度。我们的评估进一步将模型性能分解为视觉、文本、关系和布局元素层面。结果表明，即使是像 GPT Image 2 和 Nano Banana Pro 这样的最先进（SOTA）闭源模型，仍然存在文本渲染瓶颈、推理丰富度有限以及难以平衡生成丰富性与精确度的问题。这些发现为改进和部署用于科学插图生成的 T2I 模型提供了实用指导。我们将发布基准数据、原子集标注和评估代码。

Abstract

Scientific illustrations are essential tools for communicating research findings, especially in natural science, where they visualize complex concepts and processes. As Text-to-Image (T2I) models become increasingly capable, researchers have started to use them for scientific illustration generation. However, existing benchmarks often assess outputs at a holistic level, overlooking fine-grained elements, while scientific reasoning ability and output conciseness remain under-quantified. We introduce FEPBench, a benchmark built from carefully selected high-quality scientific illustrations across multiple disciplines and layout types. With the assistance of multimodal large language models (MLLMs) and human experts, we provide fine-grained atom set annotations and systematically evaluate T2I models along three dimensions: instruction faithfulness, reasoning enrichment, and semantic precision. Our evaluation further decomposes model performance across visual, textual, relation, and layout elements. Results show that even state-of-the-art (SOTA) closed-source models, such as GPT Image 2 and Nano Banana Pro, still suffer from text-rendering bottlenecks, limited reasoning enrichment, and difficulty balancing generation richness with precision. These findings provide practical guidance for improving and deploying T2I models in scientific illustration generation. Benchmark data, atom set annotations, and evaluation code will be released by us.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	8.0/10	12.0
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on benchmarking T2I models for scientific illustration generation using MLLMs and human experts. MLLM and MultiModal are highly relevant as they are core tools and domains (T2I). Unify Models, Tokenizer, Visual Encoder, World Models, and model-based RL are not central to this work, which is about evaluation rather than model architecture unification, tokenization, encoder design, world modeling, or reinforcement learning.

关键词

Natural-Science Illustration Generation, T2I Models, Benchmarking, MLLMs, Instruction Faithfulness, Reasoning Enrichment, Semantic Precision

深度分析

Chinese Title: 忠实、丰富且精确：基于文本到图像模型的自然科学插图生成基准测试

Summary: 论文提出FEPBench基准，用于评估文本到图像（T2I）模型生成自然科学插图的能力。研究背景是现有基准缺乏细粒度评估，未能区分指令要求内容与参考补充内容，且科学推理能力和输出简洁性未量化。方法上，收集1300张来自三个学科（物理材料、地理生态、生物医学）和两种布局（单面板、多面板）的高质量插图，由专家编写自由形式提示，再用LLM转换为结构化提示。通过多模态大语言模型（MLLM）和人工审核，为每张插图标注原子集（文本、视觉、关系、布局原子），并将原子分为指令原子和推理原子。评估生成图时，从三个维度衡量：指令忠实度（IF）、推理丰富性（RE）和语义精确性（SP）。实验评估了多个SOTA闭源和开源T2I模型，结果显示当前模型在文本渲染、科学推理以及生成丰富性与精确性平衡方面仍有显著提升空间。该基准提供了细粒度、可解释且冗余感知的评估，为改进和部署T2I模型提供实践指导。

Innovations:

提出原子集表示方法，将科学插图分解为文本、视觉、关系、布局四类原子，实现细粒度评估。
将原子集划分为指令原子和推理原子，区分提示要求内容与参考补充内容，引入三个互补评估维度：指令忠实度、推理丰富性、语义精确性。
构建包含1300张高质量自然科学插图的多学科基准，覆盖物理材料、地理生态、生物医学，并包含单面板和多面板布局。
设计自由形式与结构化两种提示格式，比较不同提示策略对生成效果的影响。
提出基于MLLM的自动评估管线，与人类判断高度一致，可稳健评估不同元素类型和维度。

Methodology: 论文采用以下技术路线：1）数据收集：从Nature系列期刊筛选1300张高质量科学插图，经专家审核确保清晰、含图文元素、排除实验数据图。2）提示构建：邀请博士级专家根据论文上下文编写自由形式提示，再使用GPT-5.4将自由提示转换为结构化提示，保持原意不变。3）原子集标注：使用MLLM提取参考插图的原子集，结合OCR提取图中文本，再由人类专家逐项校正。4）生成与评估：使用T2I模型生成插图，用MLLM评估生成图与参考原子集的匹配情况，判断指令原子和推理原子的实现状态，并检测意外原子，最终计算三个维度的指标。

Key Results:

当前SOTA闭源模型（如GPT Image 2、Nano Banana Pro）在指令忠实度、推理丰富性和语义精确性三个指标上仍有较大提升空间。
文本渲染是主要瓶颈，模型在生成科学标签、文字元素时频繁出错。
科学推理能力不足，模型难以恢复参考图中未明确要求但科学上有意义的补充细节。
模型在平衡生成丰富性与精确性方面存在困难，容易产生无依据的语义过度生成。
结构化提示相比自由形式提示在某些方面有所改善，但未能完全解决上述问题。

Tech Stack:

多模态大语言模型（MLLM）用于原子集提取和生成图评估
OCR模型用于提取图中文本
GPT-5.4用于将自由形式提示转换为结构化提示
原子集表示方法（文本、视觉、关系、布局四类原子）
状态评估框架（指令满足、推理产出、意外生成三种状态）
三个评估指标：指令忠实度（IF）、推理丰富性（RE）、语义精确性（SP）

Strengths:

提出细粒度原子集评估，克服了传统整体评估无法诊断具体元素错误的缺陷。
明确区分提示要求内容与参考补充内容，避免混淆忠实性与重建性。
覆盖多个自然科学学科和两种布局类型，具有较好的代表性。
设计了自由形式和结构化两种提示格式，可分析提示设计对生成的影响。
评估管线基于MLLM，自动化程度高且与人类判断一致，可扩展。

Limitations:

数据集规模为1300张，可能不足以覆盖所有科学子领域和插图风格。
原子集标注依赖MLLM和人工审核，存在主观性和成本问题。
评估准确性受限于MLLM的视觉理解能力，可能对复杂关系或细微错误判断不准。
仅评估了生成图与参考图的原子匹配，未直接评估科学内容的逻辑正确性。
未涉及动态或交互式科学插图（如动画、3D模型）的生成评估。

Relevance To Keywords:

原生多模态大模型：论文评估的T2I模型属于多模态生成模型，与原生多模态大模型相关。
多模态大模型的理解和生成一体化：论文使用MLLM进行原子提取和评估，体现了理解与生成的结合。
表征学习：原子集表示是一种结构化表征方法，有助于理解科学插图的内容。
世界模型：科学插图生成需要理解物理、生物等世界知识，论文评估了模型的科学推理能力，与世界模型相关。
强化学习/后训练：论文未直接涉及，但评估结果可为后训练提供反馈，指导模型改进。

44. DAST: A VLM-LLM Framework for Cross-Interface Anomaly Detection in O-RANPASS

Score: 30.0 / 27.8

Authors: Francesco Spinelli, Esteban Municio, Pau Baguer, Gines Garcia-Aviles, Xavier Costa-Perez

Published: 2026-06-04

TL;DR: DAST 提出了一种基于 VLM-LLM 流水线的零-shot 跨接口异常检测框架，在 O-RAN 网络中实现了高准确率的异常识别。

摘要翻译

O-RAN 支持一种解耦的基带栈，其中包含通过标准化开放接口进行通信的可编程功能。这种使能多厂商组成的开放性同时也扩大了攻击面，涉及构成计算连续统的逻辑解耦层级。在这些威胁中，拒绝服务（DoS）攻击和性能退化攻击占据了已编目 O-RAN 威胁的大多数，尤其难以检测。传统的时间序列异常检测（TSAD）方法在这一新范式下失效：标记基线稀缺，威胁演化速度快于检测器重新训练的速度，且高维多变量遥测数据压倒了单体推理模型。为应对这些挑战，我们提出了 DAST，这是一种用于 O-RAN 跨接口异常检测的零样本多智能体框架，该框架串联了一个三阶段 VLM（视觉 - 语言模型）→ LLM（大语言模型）→ VLM 管道。DAST 将多变量关键性能指标（KPI）流转换为视觉表示，依据 O-RAN 领域知识对文本化的每接口描述进行评分，并在高分辨率热力图上验证嫌疑对象，从而输出问题接口、异常时间间隔、与 O-RAN WG11 对齐的指示性操作影响评级以及决策理由。我们在从 O-RAN 测试床收集的、具有代表性的性能退化场景下的真实网络轨迹上评估了 DAST，取得了 0.910 的 F1-Score 和 0.843 的 Accuracy，优于最先进的 TSAD 基线。

Abstract

O-RAN enables a disaggregated baseband stack with programmable functions that communicate over standardized open interfaces. The same openness that enables multi-vendor composition also expands the attack surface across logically decoupled tiers that make up the compute continuum. Among these threats, Denial-of-Service and performance-degradation attacks, which account for the majority of catalogued O-RAN threats, are particularly difficult to detect. Traditional Time-Series Anomaly Detection (TSAD) methods fail in this new regime where labelled baselines are scarce, threats evolve faster than detectors can be retrained, and the high-dimensional multivariate telemetry overwhelms monolithic inference models. To address these challenges, we present DAST, a zero-shot multi-agent framework for cross-interface anomaly detection in O-RAN that chains a three-stage VLM $\rightarrow$ LLM $\rightarrow$ VLM pipeline. DAST converts multivariate KPI streams into visual representations, scores textual per-interface descriptions against O-RAN domain knowledge, and verifies suspects on high-resolution heatmaps to output the problematic interfaces, the anomalous time intervals, an indicative O-RAN WG11-aligned operational impact rating and the decision rationale. We evaluate DAST on real network traces collected from an O-RAN testbed under representative performance degradation scenarios, achieving 0.910 F1-Score and 0.843 Accuracy, outperforming state-of-the-art TSAD baselines.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	6.0/10	9.0
MultiModal	1.5	7.0/10	10.5
model-based RL	1.5	0.0/10	0.0

评分理由: 论文核心在于利用 VLM-LLM 流水线进行 O-RAN 异常检测，因此与 MLLM 和 MultiModal 相关性较高（涉及视觉与文本的多模态处理及 KPI 转视觉表示）。但与 World Models、model-based RL、Unify Models（指统一架构而非流水线）等关键词关联度低，因为论文未涉及强化学习、奖励机制或世界模型构建。Tokenizer 和 Visual Encoder 虽为 VLM 内部组件，但非本文提出的核心模块，故评分较低。作者列表中未包含指定的 Yang Shi 等专家。

关键词

O-RAN, Anomaly Detection, VLM-LLM, Cross-Interface, Zero-shot, Multivariate KPI, Heatmap, Multi-agent

深度分析

Chinese Title: DAST：面向O-RAN跨接口异常检测的VLM-LLM框架

Summary: 论文提出DAST，一种零样本多智能体框架，用于O-RAN中的跨接口异常检测。O-RAN的开放性和多供应商组合扩大了攻击面，尤其是拒绝服务和性能退化攻击难以检测。传统时间序列异常检测方法面临标签稀缺、威胁演化快、高维多变量遥测数据等挑战。DAST采用三阶段VLM→LLM→VLM流水线：首先将多变量KPI流转换为视觉表示，然后通过O-RAN领域知识对每个接口的文本描述进行评分，最后在高分辨率热图上验证可疑接口，输出问题接口、异常时间区间、操作影响评级和决策理由。在真实O-RAN测试平台上的性能退化场景下，DAST实现了0.910的F1分数和0.843的准确率，优于现有TSAD基线。

Innovations:

首次提出零样本多智能体VLM-LLM架构用于O-RAN跨接口异常检测，无需标注数据或微调。
采用三阶段流水线（VLM→LLM→VLM）模拟人类网络专家推理过程，分别处理视觉感知、语言推理和细粒度验证。
通过外部可更新的O-RAN领域知识替代重新训练，支持多供应商配置和零日性能退化模式。
输出机器可读报告，包含精确时间区间、涉及接口、操作影响评级和链式推理依据，便于下游根因分析。

Methodology: DAST采用多智能体协作推理流水线：Stage 1：将来自O-RAN四个接口的多变量KPI时间序列渲染为垂直堆叠的折线图，由VLM-1生成每个时间序列的文本描述（包括指标标签、轴范围、一般行为）。Stage 2：LLM-1结合O-RAN领域知识（接口角色、流量类型、协议等）对每个接口的描述进行评分，识别最异常的接口。Stage 3：对高评分嫌疑接口，VLM-2在高分辨率热图上进行细粒度验证，输出异常时间区间和WG11对齐的操作影响评级。整个流程零样本，无需训练。

Key Results:

在真实O-RAN测试平台上的性能退化场景下，DAST达到0.910 F1分数和0.843准确率。
优于多个现有时间序列异常检测基线方法（如MSCRED、SpotLight等）。
能够检测跨接口级联效应（如E2上的信令风暴导致F1-c和F1-u退化）。
输出精确的异常时间区间和接口定位，支持操作影响评级。

Tech Stack:

Vision-Language Model (VLM)：用于视觉感知和图像描述
Large Language Model (LLM)：用于语义推理和领域知识整合
多智能体系统（Multi-Agent System）
零样本学习（Zero-shot Learning）
时间序列可视化（折线图、热图）
O-RAN领域知识库（接口角色、协议、流量模式）
链式推理（Chain-of-Thought）

Strengths:

零样本能力：无需标注数据或重新训练，适应多供应商和零日攻击。
跨接口检测：通过多智能体协作捕获级联效应，克服传统单接口检测的盲区。
可解释性：输出链式推理和操作影响评级，便于运维人员理解。
领域知识可更新：外部知识库易于随O-RAN规范演进而更新。
性能优越：在真实测试平台上超越现有TSAD方法。

Limitations:

依赖VLM和LLM的通用能力，可能对特定领域细微模式不够敏感。
三阶段流水线增加推理延迟，不适合实时检测场景。
评估仅在单一测试平台和有限场景下进行，泛化性需进一步验证。
未讨论对对抗性攻击（如故意误导VLM/LLM）的鲁棒性。
需要人工设计领域知识库，知识完整性影响检测效果。

Relevance To Keywords:

Unify Models / 原生多模态大模型：DAST将VLM和LLM统一为协作流水线，体现了多模态大模型的联合推理能力。
World Models / 世界模型：DAST通过领域知识库模拟O-RAN网络的行为预期，可视为一种轻量级世界模型。
Representation Learning / 表征学习：VLM将时间序列转换为视觉表征，LLM将文本描述转换为语义表征，但未涉及端到端表征学习。
Model-Based RL / 强化学习：论文未涉及强化学习，但多智能体协作和评分机制可类比于基于模型的决策。
多模态大模型的理解和生成一体化：VLM同时执行视觉理解和文本生成，LLM执行文本推理，符合一体化思想。
后训练：DAST完全零样本，不涉及后训练，但领域知识库可视为一种轻量级后训练替代。

45. When Denser Credit Is Not Enough: Evidence-Calibrated Policy Optimization for Long-Horizon LLM Agent TrainingPASS

Score: 30.0 / 27.8

Authors: Yuanfan Li, Qi Zhou, Wenjing Duan, Lu Chen

Published: 2026-06-04

TL;DR: 本文提出证据校准策略优化（ECPO）以解决长周期 LLM 代理训练中的信用分配问题，在 ALFWorld 和 WebShop 任务上优于 GiGPO 基线。

摘要翻译

长视野的大语言模型代理需要强化学习算法，能够在奖励稀疏且延迟的情况下为中间决策分配信用。近期基于组的方法（如 GiGPO）通过在重复锚点状态构建步级优势，改进了 GRPO。然而，我们发现这种密集信用在统计上可能不可靠：在有限的轨迹下，罕见但幸运的动作可能获得过大的优势，导致发散的锚点偏差和后期训练振荡。我们提出证据校准策略优化（ECPO），这是一种无评论器的策略优化算法，它在策略更新前校准步级信用。ECPO 结合了证据校准动作优势（按典型动作分组轨迹并缩小低计数估计）与方差门控信用权重（抑制被动作内噪声主导的锚点状态）。在 ALFWorld 和 WebShop 上使用 Qwen2.5-1.5B/7B 的实验表明，ECPO 持续优于强基线，在 Qwen2.5-1.5B 上使 GiGPO 在 ALFWorld/WebShop 上的成功率分别提升 5.2/7.3 个百分点，同时仅增加 0.1% 的优势计算开销。

Abstract

Long-horizon LLM agents require reinforcement learning methods that can assign credit to intermediate decisions under sparse and delayed rewards. Recent group-based methods such as GiGPO improve over GRPO by constructing step-level advantages at repeated anchor states. However, we show that such dense credit can be statistically unreliable: under limited rollouts, rare but lucky actions may receive overly large advantages, producing divergent anchor bias and late-stage training oscillation. We propose Evidence-Calibrated Policy Optimization (ECPO), a critic-free policy optimization algorithm that calibrates step-level credit before policy updates. ECPO combines Evidence-Calibrated Action Advantage, which groups rollouts by canonical actions and shrinks low-count estimates, with Variance-Gated Credit Weighting, which suppresses anchor states dominated by within-action noise. Experiments on ALFWorld and WebShop with Qwen2.5-1.5B/7B show that ECPO consistently outperforms strong baselines, improving GiGPO by +5.2/+7.3 success points on ALFWorld/WebShop with Qwen2.5-1.5B while adding only 0.1% additional advantage-computation overhead.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	2.0/10	3.0
MLLM	1.5	6.0/10	9.0
MultiModal	1.5	5.0/10	7.5
model-based RL	1.5	3.0/10	4.5

评分理由: 论文核心在于提出证据校准策略优化（ECPO）算法以解决长周期 LLM 代理训练中的信用分配问题。虽然使用了 Qwen2.5 模型（涉及 MLLM 和多模态），但未涉及 Tokenizer、视觉编码器、模型统一或世界模型构建；该策略优化属于模型自由强化学习范畴，与基于模型的强化学习关联较弱。作者列表中未包含指定的专家。

关键词

Evidence-Calibrated Policy Optimization, Long-Horizon LLM Agent, Credit Assignment, Policy Optimization, Qwen2.5, Critic-free, Advantage Estimation

深度分析

Chinese Title: 当更密集的信用还不够：面向长周期LLM智能体训练的基于证据校准的策略优化

Summary: 本文针对长周期LLM智能体训练中稀疏延迟奖励下的信用分配问题，指出现有基于组的密集信用方法（如GiGPO）在有限 rollout 下会产生统计不可靠的“发散锚点偏差”，即罕见但幸运的动作获得过高优势，导致训练后期振荡。为此，提出证据校准策略优化（ECPO），包含两个组件：证据校准动作优势（ECA）通过按规范动作分组并应用收缩估计来校准低计数动作的优势；方差门控信用加权（VarGate）通过分解锚点状态回报方差，抑制统计不可靠的步级信用。在ALFWorld和WebShop上使用Qwen2.5-1.5B/7B的实验表明，ECPO一致优于GRPO、GiGPO等基线，在Qwen2.5-1.5B上分别提升+5.2和+7.3个成功点，且仅增加0.1%的计算开销。

Innovations:

首次识别并形式化了长周期LLM智能体训练中的“发散锚点偏差”问题，揭示了密集信用在有限样本下的统计不可靠性。
提出证据校准动作优势（ECA），通过按规范动作分组和收缩估计，减少低计数动作的优势高估。
提出方差门控信用加权（VarGate），通过方差分解区分动作间信号与动作内噪声，抑制不可靠的步级信用。
在无评论家框架下实现了证据校准的步级信用分配，仅增加极小的计算开销。

Methodology: ECPO基于组策略优化框架，首先对每个锚点状态收集所有rollout的后续回报，然后按规范动作分组，对每个动作的回报进行收缩估计（如贝叶斯收缩或经验贝叶斯），得到校准后的动作级优势。接着，通过方差分解计算锚点状态的总方差中由动作间差异解释的比例，若该比例低于阈值则抑制该锚点的步级信用。最终将校准后的步级优势与轨迹级优势加权结合，用于策略更新。训练过程采用类似GRPO的组采样和KL正则化。

Key Results:

在ALFWorld上，Qwen2.5-1.5B下ECPO比GiGPO提升+5.2个成功点（从约85%到约90%），Qwen2.5-7B下从90.8%提升至91.9%。
在WebShop上，Qwen2.5-1.5B下提升+7.3个成功点，Qwen2.5-7B下从72.4%提升至74.7%。
在不同rollout组大小（N=4,8,10）下，ECPO一致优于GiGPO，提升分别为+4.7、+5.2、+3.9个点。
ECPO仅增加0.1%的优势计算开销，训练后期奖励标准差从0.746降至0.555，表明训练更稳定。

Tech Stack:

GRPO (Group Relative Policy Optimization)
GiGPO (Group-based step-level credit assignment)
ECA (Evidence-Calibrated Action Advantage) - 收缩估计
VarGate (Variance-Gated Credit Weighting) - 方差分解
PPO (Proximal Policy Optimization) 作为对比基线
Qwen2.5-1.5B/7B-Instruct 作为基础模型
ALFWorld 和 WebShop 作为评估环境
KL散度正则化

Strengths:

问题诊断清晰，通过实验揭示了密集信用在有限样本下的统计不可靠性，具有理论洞察。
方法简洁有效，仅通过校准和门控机制显著提升性能，且计算开销极小。
在多个模型规模（1.5B/7B）和多个环境上验证了泛化性。
无评论家设计，保持了组策略优化的高效性。

Limitations:

收缩估计的具体形式（如贝叶斯先验选择）可能影响性能，论文未深入探讨不同收缩方法的对比。
方差门控的阈值需要手动设定，可能在不同任务中需要调整。
实验仅在两个环境（ALFWorld、WebShop）上进行，在更复杂或更开放的环境中的表现未知。
未与基于评论家的方法（如PPO）进行更全面的计算效率对比。

Relevance To Keywords:

Unify Models: 论文未直接涉及多模态统一模型，但LLM智能体可视为统一模型的应用。
World Models: 论文未涉及世界模型，但强化学习中的状态表示可间接相关。
Representation Learning: 论文未聚焦表征学习，但策略优化依赖于状态表征。
Model-Based RL: 论文属于无模型强化学习（model-free RL），与基于模型的RL方向不同。
后训练: 论文直接关注LLM的后训练阶段，通过强化学习提升智能体能力，高度相关。

46. Beyond Semantic Organization: Memory as Execution State Management for Long-Horizon AgentsPASS

Score: 28.5 / 27.8

Authors: Yaoqi Chen, Haibin Lai, Yuru Feng, Chuyu Han, Qianxi Zhang, Baotong Lu, Menghao Li, Xinjiang Wang, Zhirui Wang, Shusen Xu, Zengzhong Li, Zewen Jin, Hao Wu, Cheng Li, Qi Chen

Published: 2026-06-04

TL;DR: 针对现有代理内存系统无法匹配执行状态依赖的问题，本文提出 MAGE 架构，通过层级状态树管理显著提升了长周期任务成功率并降低了 token 消耗。

摘要翻译

基于大语言模型（LLM）的智能体越来越多地处理具有相互依赖决策的长周期任务，其中每个动作都会重塑未来的约束，且中间错误可能会级联。现有的 RAG（检索增强生成）和智能体记忆系统通过语义相似度组织历史，在决策时检索内容相关的条目。我们认为这种设计与执行状态依赖不匹配：它导致决策轨迹碎片化，混合了有效与错误的轨迹，阻碍了连贯的状态重建和错误隔离。我们提出 MAGE（记忆作为智能体引导的探索），一种主动的执行状态管理器，它将交互存储在层次化状态树中。智能体从根节点到当前节点的活跃路径推导其当前状态，结合子目标摘要、近期轨迹以及来自先前分支的提示。四种协同操作维护该树：生长（Grow）记录新轨迹，压缩（Compress）总结已完成的子目标，维护（Maintain）验证摘要，修订（Revise）恢复目标边界并在新分支上继续。这种设计限制了上下文的增长，同时保持状态完整性并将有缺陷的片段从主动路径中隔离。在 MemoryArena 上的实验表明，MAGE 相比基线将平均任务成功率提高了 7.8--20.4 个百分点，同时将 Token 消耗减少了 55.1%。

Abstract

LLM-based agents increasingly tackle long-horizon tasks with interdependent decisions, where each action reshapes future constraints and intermediate errors can cascade. Existing RAG and agent memory systems organize histories by semantic similarity, retrieving content-relevant entries at decision time. We argue that this design mismatches execution-state dependencies: it fragments decision trajectories and mixes valid and erroneous traces, hindering coherent state reconstruction and error isolation. We propose MAGE (Memory as Agent-Guided Exploration), an active execution-state manager that stores interactions in a hierarchical state tree. The agent derives its state from the active root-to-current path, combining subgoal summaries, recent traces, and hints from prior branches. Four coupled operations maintain the tree: Grow records new traces, Compress summarizes completed subgoals, Maintain validates summaries, and Revise restores a target boundary and resumes on a new branch. This design bounds context growth while preserving state integrity and isolating flawed segments from the active path. Experiments on MemoryArena show that MAGE improves the average task success rate by 7.8--20.4 pp over baselines, while reducing token consumption by 55.1%.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	5.0/10	7.5
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	4.0/10	6.0

评分理由: 论文核心在于 LLM 代理的内存架构（MAGE），强调执行状态层级管理而非语义检索。与 World Models 和 model-based RL 因状态管理和代理决策有中度至高度相关性，与 MLLM 因涉及 LLM 代理有中度相关性，而与 Unify Models、Tokenizer、Visual Encoder、MultiModal 相关性较低，因非核心贡献或未提及。

关键词

Memory as Execution State Management, Long-Horizon Agents, Hierarchical State Tree, Agent-Guided Exploration, LLM-based Agents, Error Isolation, Context Growth Bounding

深度分析

Chinese Title: 超越语义组织：面向长周期智能体的记忆作为执行状态管理

Summary: 本文针对基于大语言模型的智能体在长周期、决策相互依赖的任务中，现有记忆系统（如RAG）依赖语义相似性组织历史信息，导致执行状态碎片化和错误隔离困难的问题，提出了一种名为MAGE（Memory as Agent-Guided Exploration）的主动执行状态管理器。MAGE将交互历史组织为两层层次状态树：底层记录逐步骤的动作-观察轨迹，顶层存储子目标或决策边界的摘要。通过Grow、Compress、Maintain和Revise四个耦合操作，MAGE在保持状态完整性的同时限制上下文增长，并隔离错误分支。实验表明，在MemoryArena基准上，MAGE将任务成功率平均提升7.8–20.4个百分点，同时将token消耗降低55.1%。

Innovations:

提出将智能体记忆从语义相似性驱动的检索转变为主动执行状态管理，构建两层层次状态树，根到当前路径天然提供完整执行状态。
设计四个耦合操作（Grow、Compress、Maintain、Revise），形成闭环状态管理循环，支持错误隔离和分支恢复。
通过边界感知压缩和路径表示，在保持状态完整性的同时有效限制上下文大小，避免状态碎片化。
将记忆设计为智能体可操作的对象，而非共享的混合条目池，支持错误追踪和选择性回滚。

Methodology: 论文采用基于层次状态树的结构化记忆管理方法。首先，将智能体交互历史构建为两层树：底层为原始动作-观察节点，顶层为子目标摘要节点。然后，通过四个操作维护树：Grow扩展原始轨迹，Compress在子目标边界压缩摘要，Maintain验证摘要正确性，Revise在检测到错误时回滚到目标边界并创建新分支。智能体从活跃的根到当前路径读取执行状态，包含压缩摘要、近期原始轨迹和兄弟分支提示。实验在MemoryArena基准上进行，对比多种基线方法。

Key Results:

MAGE在MemoryArena任务上平均任务成功率提升7.8–20.4个百分点。
相比长上下文方法，MAGE减少token消耗55.1%。
MAGE在状态完整性和错误隔离方面显著优于基于语义相似性的记忆系统（如HippoRAG、MemoryOS、SimpleMem）。
MAGE在性能-效率权衡中达到最优区域（高任务性能、低token消耗）。

Tech Stack:

层次状态树（Hierarchical State Tree）
Grow、Compress、Maintain、Revise操作
子目标摘要生成（Subgoal Summarization）
边界感知压缩（Boundary-aware Compression）
路径表示（Path-based Representation）
MemoryArena基准测试
大语言模型（LLM）作为智能体

Strengths:

创新性地将记忆从被动存储转变为主动状态管理，解决了长周期任务中状态碎片化和错误传播的核心问题。
层次树结构结合四个操作，实现了高效的上下文压缩和错误隔离，兼顾性能与效率。
实验设计严谨，在MemoryArena上对比多种基线，结果显著且token消耗大幅降低。
受认知科学启发，具有理论依据和实际可操作性。

Limitations:

方法依赖于子目标边界的准确识别，复杂任务中边界定义可能不明确。
Maintain操作需要额外的验证步骤，可能引入计算开销。
实验仅在MemoryArena基准上进行，泛化性需在更多任务和环境中验证。
未详细讨论树结构在极端长序列下的扩展性和存储开销。

Relevance To Keywords:

Unify Models: 论文未直接涉及模型统一，但记忆管理可视为智能体系统的一部分，与统一模型架构相关。
World Models: MAGE通过状态树维护执行状态，间接支持世界模型中的状态表示和更新。
Representation Learning: 论文中的摘要生成和状态压缩涉及表示学习，但非核心焦点。
Model-Based RL: MAGE的状态树管理与基于模型的强化学习中的状态跟踪和回滚机制有相似之处。
原生多模态大模型: 论文未涉及多模态，但方法可扩展至多模态智能体。
多模态大模型的理解和生成一体化: 不直接相关。
表征学习: 摘要生成涉及表征学习，但论文未深入探讨。
世界模型: MAGE的状态树可视为世界模型的一种简化实现。
强化学习: 论文将任务建模为MDP，与强化学习框架一致。
后训练: 论文未涉及后训练，但记忆管理可辅助后训练阶段的数据组织。

47. Automatic Labelling of Speech Translation ErrorsPASS

Score: 28.5 / 27.8

Authors: Dominik Macháček, Maike Züfle, Ondrej Klejch

Published: 2026-06-04

TL;DR: 本文提出语音翻译错误标注框架，发现多模态大模型与文本系统互补但精度低于人类水平。

摘要翻译

语音翻译中的错误会降低语音翻译（ST）系统的可信度，并可能导致严重后果。然而，目前尚无成熟的方法论来评估语音翻译的置信度和质量估计。为了推动这一方向的进展，我们提出了语音翻译错误标注（STEL）。我们创建了一个标注协议、一个小型真实的端到端评估数据集，并分析了现有的纯文本系统和语音处理系统如何执行 STEL 任务。我们的结果表明，纯文本 XCOMET 和多模态大语言模型 Qwen2.5-Omni 能够以人类精确度的一半左右执行 STEL 任务。我们还发现，直接语音处理对于 STEL 任务是必要的，且当前的纯文本系统和语音处理系统在标注 ST 中的纯翻译错误与语音处理错误方面具有互补性。

Abstract

Errors in speech translations reduce trustworthiness of Speech Translation (ST) systems and can have serious consequences. Yet currently there is no established methodology for evaluating confidence and quality estimation of speech translations. To initiate progress in this direction, we propose Speech Translation Error Labelling (STEL). We create an annotation protocol, a small authentic end-to-end evaluation dataset, and we analyse how existing text-only and speech-processing systems perform the STEL task. Our results show that text-only XCOMET and multimodal LLM Qwen2.5-Omni are able to perform the STEL task in roughly half the precision of humans. We also find that direct speech processing is necessary for the STEL task, and that the current text-only and speech-processing systems are complementary in labelling translation-only vs. speech-processing errors in ST.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	8.0/10	12.0
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文核心为语音翻译错误标注（STEL），明确使用了多模态大模型（MLLM）进行对比实验，因此 MLLM 和多模态相关度较高。但论文未涉及世界模型、强化学习、视觉编码器（音频非视觉）或模型统一架构，tokenizer 亦非讨论重点，故相关度低。

关键词

Speech Translation, Error Labelling, Multimodal LLM, Quality Estimation, Speech Processing, Text-only Systems, STEL

深度分析

Chinese Title: 语音翻译错误的自动标注

Summary: 本文提出语音翻译错误标注（STEL）任务，旨在评估语音翻译系统的可信度。作者设计了面向端到端长语音翻译的标注协议，构建了包含四个语言方向（捷克语→英语、英语→捷克语、英语→德语、英语→希伯来语）的小型测试集，并分析了纯文本系统（XCOMET）和多模态大模型（Qwen2.5-Omni）在STEL任务上的表现。结果表明，自动系统能达到人类约一半的精度，且直接处理语音对标注至关重要。纯文本与语音处理系统在标注翻译错误和语音处理错误方面具有互补性。研究为语音翻译质量估计提供了新方法论，并公开了数据集和代码。

Innovations:

首次提出语音翻译错误标注（STEL）任务，填补了语音翻译置信度评估的方法论空白。
设计了面向用户和通信场景的标注协议，区分关键、次要、细微和冗余四类错误，并包含段级DA评分。
构建了包含四个语言方向、32分钟音频、329个句段的小型评估数据集，并进行了人工标注一致性检验。
系统比较了纯文本XCOMET与多模态Qwen2.5-Omni在STEL上的表现，揭示了直接语音处理的必要性以及两类系统的互补性。

Methodology: 首先设计STEL标注协议，基于文本MT的ESA协议但加入语音翻译用户视角。然后从现有测试集选取录音，使用三种代表性ST系统（级联ASR+LLM、端到端模型、同步系统）生成候选翻译，并用mWERSegmenter进行分割对齐。招募三名母语者进行人工标注，耗时6.5小时。自动标注方面，使用XCOMET-XL2（纯文本QE模型）和Qwen2.5-Omni-7B（多模态LLM），分别以ASR转录或原始音频为输入，输出错误跨度及段级分数。评估采用字符级F1（加权/未加权）和Kendall's τ相关系数，并与人工二次标注对比。

Key Results:

自动系统（XCOMET和Qwen）在STEL任务上达到人类约一半的精度（如En→Cs未加权F1: XCOMET 38.7 vs 人类71.9）。
XCOMET在三个语言方向上优于Qwen，但在希伯来语上因训练数据不足表现差。
使用黄金转录相比ASR转录仅带来0-3.6的微小提升，表明主要挑战不在语音处理。
直接处理语音（Qwen+audio）在语音处理错误（WER>0）上表现更好，而纯文本系统在翻译错误（WER=0）上更优，两者互补。
人工二次标注一致性为72%（捷克语）和56%（德语），表明标注存在一定歧义。

Tech Stack:

XCOMET-XL2（参考自由QE模型）
Qwen2.5-Omni-7B（多模态大语言模型）
mWERSegmenter（分割对齐工具）
Pearmut（多语言评估标注平台）
Kendall's τ相关系数
字符级F1（加权/未加权）
ASR转录（来自ST系统的语音识别组件）

Strengths:

首次系统性地定义并评估语音翻译错误标注任务，具有开创性。
标注协议考虑了语音翻译的实际使用场景（如冗余、背景知识），比纯文本标注更贴近应用。
实验设计全面，对比了多种系统（纯文本、多模态、不同输入模态）并分析了互补性。
数据集和代码开源，便于后续研究。

Limitations:

数据集规模较小（仅32分钟音频、329个句段），且语言方向有限，泛化性有待验证。
人工标注者经验不足，一致性较低（56%-72%），可能影响评估可靠性。
自动系统未使用真实场景中的自动分割（依赖黄金分割），与实用条件有差距。
仅评估了两种自动系统，未涵盖更多最新模型（如端到端ST专用QE模型）。

Relevance To Keywords:

原生多模态大模型：论文使用的Qwen2.5-Omni是原生多模态模型，能同时处理文本和音频，直接相关。
多模态大模型的理解和生成一体化：Qwen2.5-Omni具备理解（标注错误）和生成（输出错误跨度）能力，体现一体化。
表征学习：XCOMET和Qwen均依赖表征学习来建模翻译质量，但论文未深入探讨表征学习机制。
世界模型：论文未直接涉及世界模型，但语音翻译中的背景知识推断可视为世界知识的应用。
强化学习/后训练：论文未使用强化学习或后训练技术，但未来可应用于优化STEL系统。
Unify Models：论文比较了纯文本与多模态模型，但未提出统一模型，相关性较弱。
Model-Based RL：不相关。

48. TARPO: Token-Wise Latent-Explicit Reasoning via Action-Routing Policy OptimizationPASS

Score: 28.5 / 27.8

Authors: Liting Zhang, Shiwan Zhao, Xuyang Zhao, Zichen Xu, Jianye Wang, Qicheng Li

Published: 2026-06-04

TL;DR: TARPO introduces a reinforcement learning framework that adaptively switches between discrete token generation and continuous latent reasoning at the token level to enhance LLM reasoning performance.

摘要翻译

潜在推理（Latent Reasoning）已成为离散思维链（CoT）在大语言模型（LLMs）中一种颇具前景的替代方案，通过在连续表示上进行操作，实现了更具表现力的推理。然而，连续表示固有的确定性本质限制了强化学习（RL）中的策略探索。为此，我们提出了 TARPO（Token-Wise Latent-Explicit Reasoning via Action-Routing Policy Optimization），这是一个纯强化学习（RL）框架，能够在每个步骤自适应地在离散词生成与连续潜在推理之间切换。TARPO 引入了一种轻量级的动作头路由器，该路由器观察当前隐藏状态，并从二元模式选择空间中采样路由决策，从而保留了从词表中离散采样的随机性。大语言模型骨干与路由器通过共享的组相对优势信号进行端到端的联合优化。在 Qwen2.5（1.5B 至 7B）和 Llama-3.1-8B 骨干上进行的广泛实验表明，TARPO 在各种基准测试中一贯优于现有的显式和潜在推理强化学习基线。进一步分析表明，TARPO 学习自适应的逐词切换行为，同时保持稳定的训练动态。我们的代码可在 https://github.com/NKU-LITI/TARPO-master 获取。

Abstract

Latent reasoning has emerged as a promising alternative to discrete Chain-of-Thought (CoT) in large language models (LLMs), enabling more expressive reasoning by operating over continuous representations. However, the inherently deterministic nature of continuous representations limits policy exploration in reinforcement learning (RL). To address this, we propose TARPO (Token-Wise Latent-Explicit Reasoning via Action-Routing Policy Optimization), a pure RL framework that adaptively switches between discrete token generation and continuous latent reasoning at each step. TARPO introduces a lightweight action head router that observes the current hidden state and samples a routing decision from a binary mode-selection space, preserving the stochasticity of discrete token sampling from the vocabulary. The LLM backbone and router are jointly optimized end-to-end with a shared group-relative advantage signal. Extensive experiments across Qwen2.5 (from 1.5B to 7B) and Llama-3.1-8B backbones demonstrate that TARPO consistently outperforms existing explicit and latent reasoning RL baselines across diverse benchmarks. Further analysis shows that TARPO learns adaptive token-wise switching behaviors while maintaining stable training dynamics. Our code is available at https://github.com/NKU-LITI/TARPO-master.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	7.0/10	10.5
Tokenizer	1.5	7.0/10	10.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	5.0/10	7.5

评分理由: The paper proposes TARPO, which unifies latent and explicit reasoning modes in LLMs (Unify Models) and operates at the token level using RL (Tokenizer, model-based RL). However, it does not involve visual encoders, multimodal data, world models, or MLLM architectures, resulting in zero relevance for those keywords.

关键词

Latent Reasoning, Action-Routing, Policy Optimization, Token-Wise, Reinforcement Learning, Discrete Token Generation, Continuous Representations, LLM Backbone

深度分析

Chinese Title: TARPO：基于动作路由策略优化的逐词隐式-显式推理

Summary: 本文提出TARPO，一种纯强化学习框架，用于在大型语言模型中实现逐词级别的隐式（连续潜在表示）与显式（离散token）推理模式自适应切换。针对连续潜在表示固有的确定性限制了强化学习策略探索的问题，TARPO引入轻量级动作头路由器，根据当前隐藏状态从二元模式选择空间中采样路由决策，保留了离散token采样的随机性。该路由器与LLM主干通过共享的组相对优势信号进行端到端联合优化。在Qwen2.5（1.5B至7B）和Llama-3.1-8B上的实验表明，TARPO在多个数学推理基准上持续优于现有的显式和隐式推理强化学习基线。进一步分析显示，TARPO能够学习自适应的逐词切换行为，同时保持稳定的训练动态。代码已开源。

Innovations:

提出逐词级别的隐式-显式推理切换框架TARPO，通过轻量级动作头路由器实现自适应模式选择。
将推理模式选择建模为可学习的离散路由策略，利用路由策略的内在随机性促进探索。
设计端到端联合优化目标，使用共享的组相对优势信号同时优化LLM主干和路由器。
无需启发式规则或监督初始化，完全通过强化学习自主习得推理策略。
在多个模型规模和架构上验证了方法的有效性和泛化能力。

Methodology: TARPO将推理过程建模为序列决策问题。在每个时间步，基于当前隐藏状态，轻量级线性投影路由器输出二元路由决策（hard/soft）。若选择hard，则生成离散token嵌入；若选择soft，则通过top-k token的稀疏加权和构造连续潜在表示。训练采用在线组相对策略优化（类似GRPO），对每组采样多条轨迹，计算组内归一化优势信号，同时优化token生成目标（LLM主干）和动作路由目标（路由器），并加入KL正则化项。路由器偏置初始化为轻微偏向hard模式以匹配模型预训练偏好。

Key Results:

在Qwen2.5-1.5B/3B/7B和Llama-3.1-8B上，TARPO在GSM8K、MATH、MATH500、AMC23、OlympiadBench等基准上持续优于显式推理（如GRPO）和隐式推理（如Coconut）的强化学习基线。
跨架构泛化实验表明TARPO在Llama-3.1-8B上同样有效。
分布外评估显示TARPO具有更好的泛化能力和token效率。
分析表明TARPO学习到自适应的逐词切换行为，且训练动态稳定。

Tech Stack:

强化学习：组相对策略优化（GRPO）
路由策略：线性投影层 + softmax分类（二元动作空间）
潜在表示构造：top-k token加权和（softmax归一化）
KL正则化：对token生成和路由动作分别施加KL惩罚
模型架构：Qwen2.5系列、Llama-3.1-8B
训练框架：在线采样、组内奖励归一化

Strengths:

创新性地将推理模式切换建模为可学习的离散路由策略，保留了随机探索能力。
轻量级路由器设计，计算开销小，易于集成到现有LLM。
端到端强化学习训练，无需人工规则或监督数据，自适应性强。
在多个模型规模和架构上取得一致提升，泛化性好。
代码开源，可复现性强。

Limitations:

仅针对数学推理任务进行验证，在更广泛的语言任务（如常识推理、代码生成）上的效果未知。
路由策略的初始化偏置（偏向hard）可能影响早期探索，需要调参。
潜在表示构造采用top-k加权和，k的选择可能影响性能，未深入分析。
与纯离散token生成相比，推理过程可能增加计算复杂度（需额外路由判断）。
未与多模态或世界模型等方向结合，研究范围较窄。

Relevance To Keywords:

强化学习：核心方法，使用GRPO进行策略优化。
后训练：TARPO属于LLM的后训练阶段，通过强化学习微调推理策略。
表征学习：隐式推理涉及连续潜在表示的学习和利用，属于表征学习范畴。
世界模型：论文未直接涉及世界模型，但隐式推理可视为内部世界模型的构建。
多模态大模型：论文未涉及多模态，但提出的逐词路由机制未来可扩展至多模态推理。
模型基础强化学习：TARPO将推理模式选择作为动作，属于基于模型的强化学习思想（内部推理模型）。

49. SAM-Flow: Source-Anchored Masked Flow for Training-Free Image EditingFAIL

Score: 27.0 / 27.8

Authors: Haowang Cui, Rui Chen, Tao Luo, Tao Guo, Zheng Qin, Jiaze Wang

Published: 2026-06-04

TL;DR: SAM-Flow addresses background leakage in training-free image editing by localizing editable regions via token-grounded attention and anchoring non-target areas to the source latent trajectory, achieving accurate semantic editing with improved background preservation.

摘要翻译

无训练图像编辑近期因其能够利用强大的预训练扩散模型和流匹配模型修改真实图像而无需额外训练，从而吸引了越来越多的关注。然而，现有的基于逆映射和基于差分流的方法通常执行全局潜在空间传输，这不可避免地会将编辑效果传播到非目标区域，导致背景泄露。为了解决这一问题，我们提出了 SAM-Flow，这是一种用于局部无训练图像编辑的源锚定掩码流框架。与更新整个潜在表示不同，SAM-Flow 首先利用侦察图像（scout image）和基于标记的注意力图来定位可编辑语义区域。随后，它仅在这些区域内应用差分速度更新，同时将剩余区域锚定至源图像潜在轨迹。为了进一步提升空间稳定性和边界自然性，我们引入了一种时变源锚定投影机制，该机制包含动态软掩码、过渡区域以及时间掩码累积。所提出的方法具有即插即用特性，可与主流流匹配骨干网络（如 Stable Diffusion 3 和 FLUX）集成，且无需任何微调。广泛的定性和定量实验表明，SAM-Flow 在实现准确语义编辑的同时，显著改善了背景保持效果，为无训练图像编辑提供了一个简单且通用的局部编辑范式。代码开源地址为：https://github.com/chwbob/Sam-Flow.

Abstract

Training-free image editing has recently attracted increasing attention due to its ability to modify real images using powerful pre-trained diffusion and flow-matching models without additional training. However, existing inversion-based and differential-flow-based methods usually perform global latent transport, which inevitably propagates editing effects to non-target regions and leads to background leakage. To address this problem, we propose SAM-Flow, a source-anchored masked flow framework for localized training-free image editing. Instead of updating the whole latent representation, SAM-Flow first uses a scout image and token-grounded attention maps to localize the editable semantic regions. It then applies differential velocity updates only within these regions, while anchoring the remaining areas to the source-image latent trajectory. To further improve spatial stability and boundary naturalness, we introduce a time-varying source-anchored projection mechanism with dynamic soft masks, transition regions, and temporal mask accumulation. The proposed method is plug-and-play and can be integrated with mainstream flow-matching backbones such as Stable Diffusion 3 and FLUX without any fine-tuning. Extensive qualitative and quantitative experiments demonstrate that SAM-Flow achieves accurate semantic editing while significantly improving background preservation, providing a simple and general localized editing paradigm for training-free image editing. Code is available at: https://github.com/chwbob/Sam-Flow.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	4.0/10	6.0
Visual Encoder	1.5	4.0/10	6.0
World Models	1.5	3.0/10	4.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	3.0/10	4.5
model-based RL	1.5	0.0/10	0.0

评分理由: 论文聚焦无训练图像编辑，与统一模型、世界模型及强化学习主题关联度低。虽涉及 token-grounded attention 和视觉编码器，但未提出新架构或 RL 方法。MLLM 和多模态相关性一般。未包含指定专家作者。

关键词

Training-free image editing, Source-anchored masked flow, Localized editing, Flow-matching models, Token-grounded attention, Background preservation, Diffusion models

50. Multimodal Sexism Identification and Characterization using Large Language Models and Gradient BoostingFAIL

Score: 27.0 / 27.8

Authors: Kyriakos Chaviaras, Maria Lymperaiou, Athanasios Voulodimos

Published: 2026-06-04

TL;DR: 该论文提出了一种基于特征工程和多模态融合的方法，利用梯度提升模型和 LLM 衍生语义特征识别模因和视频中的性别歧视，发现针对模因的语义特征有效，但视频需要更鲁棒的时序建模。

摘要翻译

我们介绍了 AILS-NTUA 团队在 CLEF 的 EXIST 2026 实验室提交的参赛系统，针对模因（任务 2）和短视频（任务 3）中的多模态性别歧视识别与表征问题。该系统采用了一个基于特征工程的晚期融合管道，核心构建于梯度提升回归模型和层次化后处理之上。针对模因，我们融合了视觉、文本、人口统计、生物特征以及基于大语言模型（LLM）衍生的语义指标，旨在捕捉刻板印象、物化、讽刺和厌女论等高层线索。针对视频，我们研究了特征选择、基于帧的视觉表征、基于光学字符识别（OCR）的文本特征、声学描述符以及传感器衍生元数据的影响。实验结果表明，针对性的基于 LLM 的语义线索有助于提升模因性别歧视识别效果，而视频性能则对特征维度和跨模态噪声高度敏感。针对视频，开发集结果倾向于紧凑的特征选择，但官方测试集结果表明，这一结论并未完全迁移到未见数据上，其中未过滤的表征泛化能力更强。总体而言，我们的发现突显了针对静态模因进行针对性语义特征工程的有效性，以及在嘈杂的短视频环境中需要更鲁棒的时序建模。

Abstract

We present the AILS-NTUA submission to the EXIST 2026 Lab at CLEF, addressing multimodal sexism identification and characterization in memes (Task 2) and short-form videos (Task 3). Our system follows a feature-engineered late-fusion pipeline built around gradient-boosted regression models and hierarchical post-processing. For memes, we combine visual, textual, demographic, biometric, and LLM-derived semantic indicators designed to capture high-level cues such as stereotyping, objectification, irony, and misogyny. For videos, we investigate the effect of feature selection, frame-based visual representations, OCR-based textual features, acoustic descriptors, and sensor-derived metadata. Development results show that focused LLM-derived semantic cues improve meme sexism identification, while video performance is highly sensitive to feature dimensionality and cross-modal noise. For videos, development results favor compact feature selection, but official test results show that this conclusion does not fully transfer to unseen data, where the unfiltered representation generalizes better. Overall, our findings highlight the usefulness of targeted semantic feature engineering for static memes and the need for more robust temporal modeling in noisy short-form video settings.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	4.0/10	6.0
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文核心为多模态性别歧视检测，采用特征工程与梯度提升模型 pipeline。虽然涉及多模态输入（MultiModal 高分）和 LLM 特征提取（MLLM 中分），但未涉及模型统一架构（Unify Models）、分词器（Tokenizer）、世界模型（World Models）或强化学习（model-based RL）。视觉特征存在但未深入讨论编码器架构（Visual Encoder 低中分）。作者列表中不包含指定专家。

关键词

Multimodal, Sexism Identification, Gradient Boosting, Large Language Models, Feature Engineering, Memes, Videos

51. Deep Learning-based 3D Oral Cavity Reconstruction Using 2D Intraoral ImagesFAIL

Score: 25.5 / 27.8

Authors: Jihun Cho, Soo-Yeon Jeong, Eun-Jeong Bae, Sun-Young Ihm

Published: 2026-06-04

TL;DR: This paper proposes a cost-effective software method to reconstruct 3D oral models from ten 2D intraoral images using MobileNetV2 and Multi-head Attention, achieving 77.49% accuracy without dedicated hardware.

摘要翻译

口腔三维建模（Oral 3D modelling）是牙科诊疗中最关键的阶段之一，目前常采用印模制取（impression taking）和口内扫描（intraoral scanning）等多种方法，但每种方法均存在显著的局限性。印模制取涉及将藻酸盐或硅胶材料置于托盘并置入患者口腔以形成负模，该方法存在患者不适感强、材料变形误差以及存储和运输困难等问题。口内扫描设备利用结构光或激光技术直接对口腔结构进行实时扫描，虽能产生最先进的结果，但设备成本显著高昂。为应对上述局限性，本文提出一种基于软件的方法，仅利用从不同角度捕获的十张 2D 口内图像重建 3D 口腔模型，无需专用硬件设备。该方法降低了成本，消除了对物理扫描设备的需求，减少了患者不适感，并实现了自动化 3D 重建。该模型在公开的 Dental3DS 数据集上进行训练，该数据集包含 950 个上颌样本，并采用 MobileNetV2 作为图像编码器，结合 Multi-head Attention 进行多视图特征融合。所提出的模型在距离阈值为 0.035 的最近邻匹配（nearest-neighbor matching）下，达到了 77.49% 的准确率。然而，预测顶点倾向于集中在真实值的高密度区域，导致重建模型上的点分布不均匀。

Abstract

Oral 3D modelling is one of the most essential stages in dentistry, and many different approaches, such as impression taking and intraoral scanning, are commonly used for this phase, each with notable limitations. Impression taking, which involves placing alginate or silicone material in a tray and inserting it into the patient's oral cavity to form a negative mold, suffers from significant patient discomfort, material deformation errors, and difficulties in storage and transportation. Intraoral scanners, which directly scan oral structures in real time using structured light or laser technology, produce state-of-the-art results but are associated with substantially high equipment costs. To address these limitations, this paper proposes a software-based approach that reconstructs a 3D oral model using only ten 2D intraoral images captured from different angles, requiring no dedicated hardware devices. The proposed method reduces cost, eliminates the need for physical scanning equipment, minimises patient discomfort, and enables automated 3D reconstruction. The model is trained on the publicly available Dental3DS dataset, comprising 950 upper jaw samples, and employs MobileNetV2 as the image encoder combined with Multi-head Attention for multi-view feature fusion. The proposed model achieves an accuracy of 77.49%, measured by nearest-neighbor matching with a distance threshold of 0.035. However, predicted vertices tend to concentrate in high-density regions of the ground truth, resulting in uneven point distribution across the reconstructed model.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	8.0/10	12.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	3.0/10	4.5
model-based RL	1.5	1.0/10	1.5

评分理由: The paper focuses on medical 3D reconstruction from 2D images using MobileNetV2 and Multi-head Attention, which strongly aligns with 'Visual Encoder'. However, it lacks connections to 'Unify Models', 'Tokenizer', 'World Models', 'MLLM', and 'model-based RL', as it is a supervised reconstruction task rather than a generative or reinforcement learning framework. 'MultiModal' has slight relevance due to multi-view fusion but is not cross-modal. No listed expert authors are found in the author list. Weighted score: 25.5 (Threshold: 27.8).

关键词

3D Oral Cavity Reconstruction, 2D Intraoral Images, MobileNetV2, Multi-head Attention, Dental3DS dataset, Image Encoder, Point Distribution, Cost-effective

52. EEGDancer: Dynamic Emotion Latent Space Masked Modeling with Reinforcement Learning for EEG Continuous Emotion PredictionFAIL

Score: 25.5 / 27.8

Authors: Zhihao Zhou, Weishan Ye, Li Zhang, Gan Huang, Zhen Liang

Published: 2026-06-04

TL;DR: EEGDancer 提出了一种整合向量量化表征学习、掩码时序建模和强化学习的统一框架，通过优化情感轨迹实现了连续的 EEG 情感预测。

摘要翻译

连续脑电图（EEG）情绪预测旨在对从 EEG 信号中观察到的人类情绪状态的时间演化进行建模。与传统离散情绪识别不同，连续预测需要捕捉长程时间依赖以及连贯的情绪动力学过程。然而，现有方法主要依赖逐点回归，直接对噪声高维 EEG 特征进行建模，这限制了其表征连续情绪演化的能力。为应对这些挑战，我们提出了 EEGDancer，这是一个用于连续 EEG 情绪预测的动态情感潜在空间学习框架。该框架将向量量化表示学习、掩码时间建模以及基于强化学习的轨迹优化整合到一个统一架构中。具体而言，设计了一种因果时空向量量化变分自编码器（VQ-VAE），用于学习结构化情感原型并从 EEG 信号构建离散 - 连续情感潜在空间。基于学习到的潜在表示，采用基于 Transformer 的掩码动态建模策略来捕捉长程情感依赖及时间演化模式。此外，连续情绪预测被表述为序列决策问题，并引入软演员 - 评论家（SAC）框架，在序列级优化情绪预测轨迹，而非局限于帧级局部拟合。在 SEED、SEED-IV 及 Long-Term Naturalistic Emotion 数据集上的广泛实验表明，EEGDancer 始终优于现有的机器学习和深度学习方法。消融实验进一步验证了所提出的潜在空间及基于强化学习的轨迹优化在建模连续 EEG 情绪动力学方面的有效性。

Abstract

Continuous electroencephalography (EEG) emotion prediction aims to model the temporal evolution of human emotional states from EEG signals. Unlike conventional discrete emotion recognition, continuous prediction requires capturing long-range temporal dependencies and coherent emotional dynamics. However, existing methods mainly rely on point-wise regression and directly model noisy high-dimensional EEG features, limiting their ability to characterize continuous emotional evolution.To address these challenges, we propose EEGDancer, a dynamic emotional latent space learning framework for continuous EEG emotion prediction. The framework integrates vector-quantized representation learning, masked temporal modeling, and reinforcement learning-based trajectory optimization into a unified architecture.Specifically, a causal spatiotemporal Vector-Quantization Variational Autoencoder (VQ-VAE) is designed to learn structured emotional prototypes and construct a discrete-continuous emotional latent space from EEG signals. Based on the learned latent representations, a Transformer-based masked dynamic modeling strategy captures long-range emotional dependencies and temporal evolution patterns. Furthermore, continuous emotion prediction is formulated as a sequential decision-making problem, and a Soft Actor-Critic (SAC) framework is introduced to optimize emotional prediction trajectories at the sequence level instead of frame-wise local fitting.Extensive experiments on the SEED, SEED-IV, and Long-Term Naturalistic Emotion datasets demonstrate that EEGDancer consistently outperforms existing machine learning and deep learning methods. Ablation studies further verify the effectiveness of the proposed latent space and reinforcement learning-based trajectory optimization for modeling continuous EEG emotional dynamics.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	7.0/10	10.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	5.0/10	7.5

评分理由: 论文提出 EEGDancer 框架，整合了 VQ-VAE、掩码建模和强化学习，架构上具有统一性（Unify Models），且 VQ-VAE 起到了离散化标记的作用（Tokenizer），使用了强化学习优化轨迹（model-based RL）。但论文仅处理 EEG 单模态信号，不涉及视觉编码器、多模态或大语言模型（MLLM），与世界模型（World Models）概念仅有部分重叠（情感动力学建模），因此相关评分较低。

关键词

EEG emotion prediction, VQ-VAE, Masked temporal modeling, Reinforcement learning, Continuous emotion prediction, Latent space, Trajectory optimization

53. UniVoice: A Unified Model for Speech and Singing Voice GenerationFAIL

Score: 25.5 / 27.8

Authors: Junjie Zheng, Huixin Xue, Shihong Ren, Chaofan Ding, Hao Liu, Zihao Chen

Published: 2026-06-04

TL;DR: UniVoice 提出一种基于条件流匹配的统一框架，通过分解内容、旋律和音色条件，实现了自然语音生成与可控歌唱生成的统一。

摘要翻译

文本到语音（TTS）和歌声合成（SVS）均旨在从符号输入生成人类语音音频，但它们对生成过程施加了不同的要求。语音生成依赖于灵活的语言驱动韵律，而歌声生成则需要明确的旋律控制和准确的节奏对齐。这种不匹配使得训练一个能够同时生成自然语音和可控歌声的单模型具有挑战性，因为与旋律相关的条件应强烈约束歌声，但不应限制语音韵律。本文提出 UniVoice，一种基于条件流匹配的统一语音与歌声生成框架。与使用单一未区分的条件表示不同，UniVoice 将条件分解为内容、旋律和音色，这些要素由模态适配编码器编码，并由共享的扩散变换器（DiT）骨干网络处理。在歌声生成中，旋律条件由 MIDI 音符序列表示；而在语音生成中，该条件被替换为学习得到的空旋律 token，从而使模型能够从语言和声学上下文中推断韵律。这种设计保留了歌声的显式旋律控制，同时避免了对语音施加旋律约束的需求。此外，我们将空旋律 token 分析为条件流中旋律边缘化的近似。在 3 万小时语音数据和 3.5 万小时歌声数据上训练的 UniVoice，实现了 5.26% 的语音音素错误率（PER），与专用 TTS 系统（如 F5-TTS 的 5.21% 和 CosyVoice3 的 5.30%）相当。在歌声生成方面，UniVoice 实现了 16.22% 的 PER，优于统一基线 Vevo1.5（24.72%）。

Abstract

Text-to-speech (TTS) and singing voice synthesis (SVS) both aim to generate human vocal audio from symbolic inputs, but they impose different requirements on the generation process. Speech generation relies on flexible, language-driven prosody, whereas singing generation requires explicit melody control and accurate rhythmic alignment. This mismatch makes it challenging to train a single model that can generate both natural speech and controllable singing, since melody-related conditions should strongly constrain singing but should not restrict speech prosody. We present UniVoice, a unified speech and singing voice generation framework based on conditional flow matching. Instead of using a single undifferentiated conditioning representation, UniVoice factorizes the condition into content, melody, and timbre, which are encoded by modality-appropriate encoders and consumed by a shared Diffusion Transformer (DiT) backbone. For singing, the melody condition is represented by MIDI note sequences; for speech, it is replaced with a learned null melody token, allowing the model to infer prosody from linguistic and acoustic context. This design preserves explicit melody control for singing while avoiding the need to impose melody constraints on speech. We further analyze the null melody token as an approximation to melody marginalization in the conditional flow. Trained on 30k hours of speech and 35k hours of singing data, UniVoice achieves a speech PER of 5.26\%, comparable to dedicated TTS systems such as F5-TTS (5.21\%) and CosyVoice3 (5.30\%). On singing generation, UniVoice achieves a PER of 16.22\%, outperforming the unified baseline Vevo1.5 (24.72\%).

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	9.0/10	13.5
Tokenizer	1.5	4.0/10	6.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	4.0/10	6.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文标题及摘要核心在于统一语音与歌唱生成框架，因此'Unify Models'高度相关（9 分）；模型使用 MIDI 序列及空旋律 token，涉及 token 化表示，'Tokenizer'中度相关（4 分）；处理文本与音频输入，'MultiModal'中度相关（4 分）。其余关键词如视觉编码器、世界模型、MLLM、基于模型的强化学习与本文纯音频生成任务无直接关联，故评分为 0。总加权分为 25.5，低于动态及格分 27.8，显示论文领域（音频生成）与关键词集（偏向视觉/LLM/RL）存在显著偏差。

关键词

Unified Model, Speech and Singing Voice Generation, Conditional Flow Matching, Diffusion Transformer, Melody Control, Text-to-Speech, Singing Voice Synthesis

54. Knowledge Distillation for Visual Autoregressive ModelsFAIL

Score: 25.5 / 27.8

Authors: Elia Peruzzo, Aritra Bhowmik, Guillaume Sautiere, Yuki M Asano, Amirhossein Habibian

Published: 2026-06-04

TL;DR: The paper proposes VarKD, a knowledge distillation framework for visual autoregressive models that reduces token ambiguity and selectively applies teacher supervision to improve compression efficiency on ImageNet.

摘要翻译

自回归（AR）图像生成模型具有极强的表达能力，但计算开销高昂，从而催生了对有效模型压缩方法的需求。知识蒸馏（KD）是模型压缩的一种自然方法，已在语言建模中得到广泛研究，然而其在视觉自回归生成中的行为尚未得到充分探索。本文首次系统研究了针对 AR 图像模型的蒸馏策略。我们的分析表明，虽然标准蒸馏能带来显著收益，但近期为语言模型开发的方法无法直接迁移到图像领域：长解码步长和视觉 token 的歧义性使得教师监督不可靠，尤其是在学生条件化上下文中。为此，我们提出 VarKD，一种面向视觉自回归模型的蒸馏框架，该框架基于学生样本进行蒸馏，同时选择性应用教师监督并降低 token 级别的歧义。在 ImageNet 上针对多种 AR 骨干网络的实验表明，VarKD 始终优于先前的蒸馏基线，缩小了与大模型之间的差距。

Abstract

Autoregressive (AR) image generation models are highly expressive but computationally intensive, motivating effective model compression. Knowledge distillation (KD) is a natural approach for model compression and has been widely studied in language modeling, yet its behavior in visual AR generation remains underexplored. In this work, we present the first systematic study of distillation strategies for AR image models. Our analysis shows that while standard distillation can yield meaningful gains, recent methods developed for language do not directly transfer to images: long decoding horizons and visual token ambiguity make teacher supervision unreliable especially under student-conditioned contexts. To address this, we propose VarKD, a distillation framework for visual autoregressive models that distills on student samples while selectively applying teacher supervision and reducing token-level ambiguity. Experiments on ImageNet across multiple AR backbones show that VarKD consistently outperforms prior distillation baselines, narrowing the gap to large-scale models.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	5.0/10	7.5
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	4.0/10	6.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	1.0/10	1.5

评分理由: The paper focuses on Knowledge Distillation for Visual Autoregressive image generation, showing limited overlap with keywords targeting Multimodal, RL, and Unified World Models. 'Tokenizer' is moderately relevant due to visual token ambiguity discussion. 'World Models' has conceptual overlap with generative sequential modeling. 'Unify Models', 'MLLM', 'MultiModal', and 'model-based RL' are largely irrelevant as the paper is single-modal, focuses on compression, and involves no reinforcement learning. No listed expert authors are present.

关键词

Knowledge Distillation, Visual Autoregressive Models, Model Compression, VarKD, Image Generation, Token Ambiguity, Student-Teacher Learning

55. Adapting Diffusion Language Models for Lossless Pixel-Level Image TransmissionFAIL

Score: 24.0 / 27.8

Authors: Tianqi Ren, Rongpeng Li, Xianfu Chen, Yingyu Li, Zhifeng Zhao

Published: 2026-06-04

TL;DR: This paper proposes a diffusion-based source-channel coding framework for lossless image transmission that achieves better exact-recovery performance than baselines over noisy channels.

摘要翻译

无损像素级图像传输是超越语义通信的一种基本体制，因为精确恢复既需要准确的符号概率建模，也需要通过噪声信道的可靠传输。本文提出了 DDM-SSCC，一种基于离散扩散模型的独立信源 - 信道编码框架，用于无损图像传输。不同于光栅顺序自回归编码，所提出的信源编解码器将扩散语言模型适配于像素标记恢复，并在双向注意力机制下进行同步逆向算术编码，从而允许在一个逆向去噪步骤内对多个掩码标记进行编码。这种渐进式恢复过程还为噪声传输生成了更有利的信源表示，因为新恢复的标记可在后续的去噪步骤中充当双向上下文。为了弥合面向生成的掩码去噪与无损算术编码之间的差距，我们进一步引入了 Halton 引导的去噪顺序、掩码比率感知余弦调度以及轻量级温度校准模块。这些设计分别提高了空间覆盖度，使去噪进度适应上下文可靠性，并校准算术编码所使用的概率表。在 CIFAR10、DIV2K-LR-X4 和 Kodak 数据集上，针对加性高斯白噪声和瑞利衰落信道的实验表明，DDM-SSCC 实现了比代表性无损和语义通信基线更好的精确恢复性能，而消融实验验证了所提出的去噪顺序、调度及校准模块的有效性。

Abstract

Lossless pixel-level image transmission is a fundamental regime beyond semantic communications, because exact recovery requires both accurate symbol probability modeling and reliable delivery over noisy channels. This paper proposes DDM-SSCC, a discrete-diffusion-model-based separate source-channel coding framework for lossless image transmission. Different from raster-order autoregressive coding, the proposed source codec adapts a diffusion language model to pixel-token restoration and performs synchronized reverse arithmetic coding under bidirectional attention, allowing multiple masked tokens to be coded within one reverse denoising step. This progressive restoration process also yields a more favorable source representation for noisy transmission, since newly restored tokens can serve as bidirectional context in subsequent denoising steps. To bridge the gap between generation-oriented masked denoising and lossless arithmetic coding, we further introduce a Halton-guided denoising order, a mask-ratio-aware cosine schedule, and a lightweight temperature calibration module. These designs respectively improve spatial coverage, adapt the denoising pace to context reliability, and calibrate the probability tables used by arithmetic coding. Experiments on CIFAR10, DIV2K-LR-X4, and Kodak over additive white Gaussian noise and Rayleigh fading channels show that DDM-SSCC achieves better exact-recovery performance than representative lossless and semantic communication baselines, while ablation studies verify the effectiveness of the proposed denoising order, schedule, and calibration modules.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	6.0/10	9.0
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文核心在于利用离散扩散模型进行无损图像传输的信源信道编码，与多模态大模型（MLLM）、世界模型（World Models）及强化学习（RL）等关键词关联度较低。Tokenizer 得分为 6.0 因为论文涉及像素标记（pixel-token）恢复；Visual Encoder 得分为 3.0 因为扩散模型隐含编码结构但非重点；Unify Models、World Models、MultiModal、model-based RL 得分较低（0-2 分）因论文未涉及模型统一、世界动力学、多模态交互或强化学习。作者列表中未包含 Yang Shi 等指定专家，故无加分。

关键词

Lossless Image Transmission, Diffusion Language Model, Source-Channel Coding, Pixel-Token Restoration, Reverse Arithmetic Coding, Noisy Channels, Discrete Diffusion Model

56. Revising Context, Shifting Simulated Stance: Auditing LLM-Based Stance Simulation in Online DiscussionsFAIL

Score: 24.0 / 27.8

Authors: Xinnong Zhang, Wanting Shan, Hanjia Lyu, Zhongyu Wei, Jiebo Luo

Published: 2026-06-04

TL;DR: 本文提出一种基于反事实上下文修正的框架来审计 LLM 立场模拟，发现文本和模态（表情包）策略均能有效诱导不同极化机制下的立场转变。

摘要翻译

大语言模型（LLM）日益被用于模拟社交媒体用户，并推断个体可能对在线讨论做出何种回应。然而，尚不清楚这些模拟是否反映了精确的用户特定信念，或者它们是否对对话上下文中的语义独立变化高度敏感。在这项工作中，我们将反事实上下文修订（counterfactual context revision）作为一种框架，用于评估基于大语言模型的立场（stance）模拟。给定一个原始在线对话，我们首先推断目标用户对特定话题的立场。然后，我们对对话上下文应用受控修订策略，并在修订后的上下文中再次模拟用户的立场。我们将纯文本修订策略与一种结合基于模因（meme）的上下文的多模态策略进行比较，并评估两个主要有效性指标，即平均方向性立场偏移（average directional stance shift）和立场转换率（stance transition rate）。结果表明，在不同的极化 - 偏好机制下，纯文本和多模态策略均能实现有效且稳健的立场转换。本研究贡献了一个评估框架，用于理解基于大语言模型的立场模拟的上下文敏感性。更广泛地说，它突出了使用大语言模型模拟在线舆论动力学的前景与风险。

Abstract

Large language models are increasingly used to simulate social media users and infer how individuals may respond to online discussions. However, it remains unclear whether these simulations reflect precise user-specific beliefs or whether they are highly sensitive to semantically independent changes in conversational contexts. In this work, we study counterfactual context revision as a framework for auditing LLM-based stance simulation. Given an original online conversation, we first infer a target user's stance toward a specific topic. We then apply controlled revision strategies to the conversational context and simulate the user's stance again under the revised context. We compare text-only revision strategies with a multimodal one that incorporates meme-based context and evaluate two main effectiveness metrics, i.e., average directional stance shift and stance transition rate. The results reveal effective and robust stance transitions in both text-only and multimodal strategies across different polarization-preference mechanisms. Our study contributes an evaluation framework for understanding the context sensitivity of LLM-based stance simulation. More broadly, it highlights both the promise and risk of using LLMs to simulate online opinion dynamics.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	7.0/10	10.5
model-based RL	1.5	1.0/10	1.5

评分理由: 该论文主要探讨 LLM 在在线讨论中的立场模拟审计，采用反事实上下文修正框架。虽然涉及多模态（表情包）上下文对比，但未涉及模型统一、分词器设计、视觉编码器架构、世界模型或强化学习。因此，除“多模态”外，其余关键词与论文核心贡献（社会仿真审计）关联度极低。加权总分为 24.0，低于动态及格分 27.8。

关键词

LLM-based Stance Simulation, Counterfactual Context Revision, Multimodal Context, Meme-based, Auditing Framework, Online Discussions, Stance Transition Rate

57. GMBFormer: An NDVI-Guided Global Memory Bank Transformer for Urban Green-Space Extraction from Ultra-High-Resolution ImageryFAIL

Score: 24.0 / 27.8

Authors: Hao Lei, Xi Cheng, Chenlu Shu, Zhiheng Chen, Zhengjie Duan, Haoyu Wang, Zhanfeng Shen

Published: 2026-06-04

TL;DR: GMBFormer enhances urban green-space extraction from ultra-high-resolution imagery by decoupling NDVI guidance from RGB visual encoding and utilizing a global memory bank with similarity-driven prototype retrieval, achieving higher mIoU and mDice scores than baseline SegFormer.

摘要翻译

从超高分辨率（UHR）影像中提取城市绿地通常采用分块处理的方式，这限制了空间分离但视觉上相似的植被模式之间的语义复用。直接将归一化植被指数（NDVI）注入红绿蓝（RGB）主干网络也会混淆视觉外观学习与物理植被置信度的作用。我们提出 GMBFormer，这是一种基于 SegFormer 的框架，它用选择性、相似性驱动的原型检索替换了邻域驱动的特征传播。仅 RGB 通道进入主干网络和解码器，而 NDVI 被解耦为一个物理信息门，通过动量更新将高置信度植被描述符引入紧凑的全局记忆库。在训练和推理过程中，当前块通过内存中介的交叉注意力机制查询存储的原型，检索到的响应以有界开销进行集成。实验使用了一个自构建的成都 UHR 数据集，包含 7,700 个标记的 512×512 块，以及两个从公共国际摄影测量与遥感学会（ISPRS）波茨坦数据集派生的少标签设置。在相同的训练和评估协议下，GMBFormer 分别获得了平均交并比（mIoU）/平均 Dice 系数（mDice）分数为 89.25%/94.31%、92.17%/95.92% 和 83.72%/90.86%，在每种设置下均改进了基准 SegFormer-B4 模型。消融研究表明，解耦的 NDVI 引入、内存检索、容量和动量共同影响了最终性能。

Abstract

Urban green-space extraction from ultra-high-resolution (UHR) imagery is commonly performed patch by patch, which limits semantic reuse among spatially separated but visually similar vegetation patterns. Directly injecting the Normalized Difference Vegetation Index (NDVI) into red-green-blue (RGB) backbones can also blur the roles of visual appearance learning and physical vegetation confidence. We propose GMBFormer, a SegFormer-based framework that replaces adjacency-driven feature propagation with selective, similarity-driven prototype retrieval. Only RGB channels enter the backbone and decoder, while NDVI is decoupled as a physics-informed gate that admits high-confidence vegetation descriptors into a compact global memory bank through momentum updates. During training and inference, the current patch queries stored prototypes through memory-mediated cross-attention, and the retrieved response is integrated with bounded overhead. Experiments use a self-constructed Chengdu UHR dataset with 7,700 labeled 512 x 512 patches and two reduced-label settings derived from the public International Society for Photogrammetry and Remote Sensing (ISPRS) Potsdam dataset. Under the same training and evaluation protocol, GMBFormer obtains mean intersection over union (mIoU)/mean Dice (mDice) scores of 89.25%/94.31%, 92.17%/95.92%, and 83.72%/90.86%, respectively, improving the controlled SegFormer-B4 baseline in each setting. Ablation studies indicate that decoupled NDVI admission, memory retrieval, capacity, and momentum jointly shape the final performance.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	6.0/10	9.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	5.0/10	7.5
model-based RL	1.5	0.0/10	0.0

评分理由: This paper addresses remote sensing semantic segmentation, which has low alignment with RL, World Models, and MLLM keywords. 'Visual Encoder' (SegFormer backbone) and 'MultiModal' (RGB+NDVI fusion) are moderately relevant. 'Tokenizer' is implicit in the Transformer. 'Unify Models' is weakly relevant regarding input integration. 'World Models', 'MLLM', and 'model-based RL' are unrelated to this supervised learning task.

关键词

Urban Green-Space Extraction, Ultra-High-Resolution Imagery, Global Memory Bank, NDVI-Guided, SegFormer, Prototype Retrieval, Multi-Modal Fusion, Semantic Segmentation

58. Physics in 2-Steps: Locking Motion Priors Before Visual Refinement Erases ThemFAIL

Score: 24.0 / 27.8

Authors: Woojung Han, Seil Kang, Youngjun Jun, Min-Hung Chen, Fu-En Yang, Seong Jae Hwang

Published: 2026-06-04

TL;DR: This paper proposes PhaseLock, a training-free framework that preserves motion priors from few-step diffusion inference to enhance physical consistency in image-to-video generation while maintaining visual fidelity.

摘要翻译

图像到视频扩散模型利用输入图像生成视觉上令人惊叹的内容，却经常产生违反物理定律的运动。我们揭示了一个惊人的发现：2 步生成往往比同一模型生成的 50 步输出表现出更好的物理一致性。通过频谱分析，我们将此现象追溯至去噪过程中的相位侵蚀；相位显著退化（从第 2 步到第 50 步下降约 18%），而幅度保持相对稳定。基于此洞察，我们提出 PhaseLock（无需训练的框架），能够在整个去噪轨迹中保留来自少步推理的有效运动先验。与依赖全步推理以确保物理一致性不同，PhaseLock 仅从 2 步中提取运动先验，并通过潜在 Delta 引导（Latent Delta Guidance）将其施加于高保真生成。我们的方法有效缓解了相位退化，在多种模型上平均提高了 6.2 点的物理一致性，同时很大程度上保持了视觉保真度，开销可忽略（时间开销为 1.06 倍，内存开销为 1.02 倍），并减少了对昂贵外部引导方法的依赖（节省约 5 倍时间）。

Abstract

Image-to-Video diffusion models leverage input images to generate visually stunning content, yet frequently produce motion that violates physical laws. We reveal a surprising finding: a 2-step generation often exhibits better physical consistency than a 50-step output from the same model. Through spectral analysis, we trace this to phase erosion during denoising; the phase degrades significantly (dropping by $\approx 18\%$ from step 2 to step 50), whereas the magnitude remains relatively stable. Building on this insight, we propose PhaseLock, a training-free framework that preserves the valid motion priors from few-step inference throughout the denoising trajectory. Rather than relying on full-step inference for physical consistency, PhaseLock extracts a motion prior from just 2 steps and enforces it onto high-fidelity generation via Latent Delta Guidance. Our approach effectively mitigates phase degradation, improving physical consistency by an average of 6.2 points across diverse models while largely maintaining visual fidelity, with negligible overhead ($1.06\times$ time, $1.02\times$ memory) and reduced reliance on expensive external guidance methods ($\sim5\times$ time).

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	4.0/10	6.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	3.0/10	4.5
model-based RL	1.5	1.0/10	1.5

评分理由: The paper focuses on Image-to-Video diffusion models and physical consistency, proposing a PhaseLock framework. It has low relevance to MLLM, Tokenizer, and Model-based RL as it involves neither language models nor reinforcement learning. It has moderate relevance to MultiModal and World Models due to motion dynamics modeling, but Unify Models and Visual Encoder are not the core focus. No expert authors from the specified list are found.

关键词

Image-to-Video, Diffusion Models, Motion Priors, Physical Consistency, Phase Locking, Denoising Trajectory, Phase Erosion

59. Diff-CA: Separating Common and Salient Factors with Diffusion ModelsFAIL

Score: 24.0 / 27.8

Authors: Michaël Soumm, Alexandre Fournier Montgieux, Yunlong He, Pietro Gori, Alasdair Newson

Published: 2026-06-04

TL;DR: This paper introduces a diffusion-based conditioning framework that effectively separates common and salient factors in image distributions, achieving high-fidelity generation and editing without sacrificing reconstruction quality.

摘要翻译

对比分析 (Contrastive Analysis) 旨在将两个数据分布之间共同的因子与仅对其中一个显著的因子区分开来。现有的对比方法基于生成模型（例如 VAEs 或 GANs），这些模型通常面临重建能力和图像质量受限的问题，这阻碍了有效的潜在因子分离，并限制了其在高保真图像生成与编辑中的适用性。我们提出了一种针对扩散模型 (Diffusion Models) 的新型条件框架，该框架能够在不牺牲生成质量的前提下实现对比分解。我们首先训练一个无提示词、图像条件的扩散模型，然后利用弱监督学习将条件分解为共同因子和显著因子。我们证明了先前工作中通常假设的加性对比因子分解 (Additive Contrastive Factorization) 在温和条件下是可识别的。这种因子分解使得仅通过交换或插值显著因子即可实现针对性操作。

Abstract

Contrastive Analysis aims to separate factors that are common between two data distributions from those that are salient to only one of them. Existing contrastive methods are based on generative models (e.g., VAEs or GANs) that often suffer from limited reconstruction and image quality, which hampers effective latent factor separation and limits their applicability to high-fidelity image generation and edition. We propose a novel conditioning framework for diffusion models that enables contrastive decomposition without compromising generation quality. We first train a prompt-free, image-conditioned diffusion model, and then learn to decompose the conditioning into a common and a salient factor, using weak supervision. We prove that the additive contrastive factorization, commonly assumed in prior work, is identifiable under mild conditions. This factorization enables targeted operations by swapping or interpolating only the salient factor.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	4.0/10	6.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	3.0/10	4.5
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on diffusion models for contrastive factor separation. It has moderate relevance to Visual Encoder and MultiModal due to image conditioning, low relevance to Tokenizer/Unify Models/World Models/MLLM, and zero relevance to model-based RL. No specified expert authors are present.

关键词

Diffusion Models, Contrastive Analysis, Common and Salient Factors, Image Conditioning, Factor Separation, Generative Modeling, High-fidelity Generation

60. FontFusion: Enhancing Generative Text in Diffusion Models with Typographic ConditioningFAIL

Score: 24.0 / 27.8

Authors: Marian Lupascu, Nipun Jindal, Ionut Mironica, Zhaowen Wang

Published: 2026-06-04

TL;DR: FontFusion proposes a typographic conditioning framework for Diffusion Transformers that resolves the trade-off between font control and text legibility through hierarchical token representations and dual encoders.

摘要翻译

扩散模型中的排版生成面临着一个持续的权衡：实现精确的字体控制通常会降低文本可读性，而保持易读性往往又会牺牲排版保真度。我们提出了 FontFusion，这是一个面向扩散变换器（DiT）架构的即插即用条件化框架，通过三项核心创新解决了这一困境：(1) 一种层次化 token 表示，在多个粒度上建立显式的文本 - 字体关系；(2) 位置感知嵌入，在排版与图像内容之间建立空间绑定；(3) 一种多层次 token 丢弃策略，同时提升计算效率及对未见字体的泛化能力。我们对字体嵌入空间的系统评估表明，结合 DeepFont 和 DINOv2 的双编码器在排版任务上优于任何单编码器。FontFusion 在具有挑战性的装饰性字体上相对于单编码器基线实现了 76% 的相对提升，且在字体一致性方面相对于无条件模型获得了约 68-76% 的提升，同时无需重新训练即可集成到现有的 DiT 架构中。

Abstract

Typography generation in diffusion models faces a persistent trade-off: enabling precise font control typically degrades text legibility, while maintaining readability often sacrifices typographic fidelity. We present FontFusion, a plug-and-play conditioning framework for Diffusion Transformer (DiT) architectures that resolves this dilemma through three core innovations: (1) a hierarchical token representation establishing explicit text-font relationships at multiple granularities, (2) position-aware embeddings creating spatial bindings between typography and image content, and (3) a multi-level token dropping strategy improving both computational efficiency and generalization to unseen fonts. Our systematic evaluation of font embedding spaces reveals that a dual encoder combining DeepFont and DINOv2 outperforms any single encoder for typography tasks. FontFusion demonstrates 76% relative improvement on challenging decorative fonts over single-encoder baselines and font consistency gains exceeding approximately 68-76% over unconditioned models, while integrating into existing DiT architectures without retraining.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	6.0/10	9.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	5.0/10	7.5
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on typography generation in diffusion models, utilizing token manipulation and encoders. It shows moderate relevance to MultiModal (text/image integration) and Visual Encoder (DINOv2/DeepFont usage). It has low relevance to Tokenizer (token manipulation vs tokenization) and Unify Models (conditioning vs unification). It is irrelevant to World Models, MLLM, and Model-Based RL. No expert authors from the specified list are found.

关键词

Typography generation, Diffusion Models, Typographic Conditioning, Token Representation, Visual Encoder, Font Consistency, DiT Architecture, Dual Encoder

61. Unveiling the Unknown: Open Vocabulary Object Detection with Scene GraphsFAIL

Score: 24.0 / 27.8

Authors: Yi Chen, Yinghao Lu, Zhehao Li, Chenchen Yan, Jiafei Wu, Chong Wang, Jiangbo Qian

Published: 2026-06-04

TL;DR: 本文提出一种基于场景图的关系建模框架，通过捕捉对象间的结构化语义和空间关系来提升开放词汇目标检测在新颖类别上的性能。

摘要翻译

开放词汇目标检测（Open-vocabulary Object Detection，OVOD）旨在识别那些不属于训练数据的新颖对象类别。许多基于知识蒸馏的方法通过将预训练的视觉 - 语言模型（Vision-Language Models）的知识迁移至目标检测任务，取得了有前景的性能。然而，这些方法往往忽视了对象之间结构化、图像特定的关系，例如交互和空间布局。这种忽视可能会显著限制检测新颖类别的有效性。为了解决这一问题，我们提出了一种场景引导关系建模（Scene-guided Relational Modeling）检测框架。该框架利用场景图（Scene Graphs）捕捉候选区域与其上下文对象之间的结构化语义和空间关系。它显式地建模邻近区域之间的交互，并引入关系注意力模块（Relation Attention Module）以隐式增强从场景图中提取的关键关系线索。此外，我们提出了一种基于场景的文本对齐分支，该分支从文本描述中蒸馏类别知识以指导关系对齐。这种方法促进了视觉关系与语义信息的无缝集成，从而提升了检测性能。综合实验表明，我们的模型相较于其他 OVOD 方法表现更优，在 COCO 和 LVIS 数据集上提高了新颖类别的平均精度（AP）。

Abstract

Open-vocabulary object detection seeks to identify novel object categories that were not part of the training data. Many knowledge distillation-based approaches have shown promising performance by transferring knowledge from pre-trained vision-language models to object detection. However, these methods often overlook structured, image-specific relationships between objects, such as interactions and spatial arrangements. This oversight can significantly restrict the effectiveness of detecting novel categories. To address this issue, we propose a Scene-guided Relational Modeling detection framework. This framework utilizes scene graphs to capture structured semantic and spatial relationships between candidate regions and their contextual objects. It explicitly models interactions among neighboring regions and incorporates a Relation Attention Module to implicitly amplify the key relational cues extracted from the scene graph. Furthermore, we present a scene-based textual alignment branch that distills category knowledge from captions to guide relational alignment. This approach facilitates a seamless integration of visual relations with semantic information for enhanced detection performance. Comprehensive experiments show that our model achieves superior performance compared to other OVOD methods, improving the AP for novel categories on COCO and LVIS datasets.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	4.0/10	6.0
MultiModal	1.5	6.0/10	9.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文聚焦开放词汇检测与场景图关系建模，未涉及世界模型或强化学习。虽涉及视觉语言对齐（MultiModal, MLLM）及编码器，但非核心。Unify Models 概念不符。加权总分 24.0，低于及格分 27.8。

关键词

Open Vocabulary Object Detection, Scene Graphs, Relational Modeling, Novel Object Categories, Visual Relations, Semantic Information, Knowledge Distillation

62. Pretraining Recurrent Networks without RecurrenceFAIL

Score: 22.5 / 27.8

Authors: Akarsh Kumar, Phillip Isola

Published: 2026-06-04

TL;DR: 该论文提出监督记忆训练方法使循环神经网络无需反向传播即可并行训练，有效提升了长距离依赖的捕捉能力。

摘要翻译

训练循环神经网络（RNNs）需要在长序列计算中进行信用分配。标准的时间反向传播（BPTT）对此问题的解决效果不佳：它在时间上是顺序执行的，限制了并行性，且面临梯度消失或爆炸的问题，使得长程关联难以学习。我们提出监督记忆训练（SMT），这是一种训练非线性 RNN 的方法，它完全规避了循环信用传播，通过将 RNN 训练转化为基于单步记忆转换标签 (m_t, x_{t+1}) → m_{t+1} 的监督学习。SMT 通过在预测状态目标上训练基于 Transformer 的编码器来获取这些记忆标签，即仅保留来自过去预测未来所需的信息。通过将记忆内容与更新机制解耦，SMT 实现了时间并行的 RNN 训练，使得任意两个词元之间具有稳定的 O(1) 长度梯度路径，且无需展开 RNN。我们发现，在语言建模和像素序列建模等任务上预训练各种 RNN 架构时，SMT 优于 BPTT。SMT 使非线性 RNN 能够更好地捕获长程依赖并进行并行训练，从而可能解锁能够构建过去经验的时间抽象模型的规模化扩展。

Abstract

Training recurrent neural networks (RNNs) requires assigning credit across long sequences of computations. Standard backpropagation through time (BPTT) addresses this problem poorly: it is sequential in time, limiting parallelism, and suffers from vanishing or exploding gradients, making long-range associations difficult to learn. We propose Supervised Memory Training (SMT), a method for training nonlinear RNNs that sidesteps recurrent credit propagation entirely by reducing RNN training to supervised learning on one-step memory transition labels $(m_t, x_{t+1}) \rightarrow m_{t+1}$. SMT acquires these memory labels by training a Transformer-based encoder on a predictive state objective--retaining only information from the past necessary to predict the future. By decoupling what to remember from how to update memory, SMT enables time-parallel RNN training with a stable $O(1)$ length gradient path between any two tokens--without ever unrolling the RNN. We find that SMT outperforms BPTT when pretraining various RNN architectures on tasks like language modeling and pixel sequence modeling. SMT enables nonlinear RNNs to better capture long-range dependencies and train in parallel, potentially unlocking the scaling of models that build temporal abstractions of past experience.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	3.0/10	4.5
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	2.0/10	3.0

评分理由: 论文核心在于提出 Supervised Memory Training (SMT) 以并行训练 RNN，虽涉及序列建模（语言/像素），但未深入探讨多模态统一、世界模型架构、强化学习框架或特定 Tokenizer 设计，与给定关键词主题契合度较低。

关键词

Recurrent Networks, Supervised Memory Training, Backpropagation Through Time, Transformer-based Encoder, Predictive State Objective, Parallel Training, Long-range Dependencies, Sequence Modeling

63. Double Preconditioning (DoPr): Optimization for Test-Time Performance, not Validation LossFAIL

Score: 22.5 / 27.8

Authors: Thomas T. Zhang, Alok Shah, Yifei Zhang, Vincent Zhang, Nikolai Matni, Max Simchowitz

Published: 2026-06-04

TL;DR: 本文提出 Double Preconditioning 优化方法，通过结合梯度级与激活级预处理缓解自回归及机器人策略学习中的测试时间反馈误差累积，从而提升下游任务性能而不必依赖验证损失下降。

摘要翻译

许多现代深度学习应用涉及通过一步预测损失（例如 $L^2$ 回归、交叉熵）训练神经网络，但在部署时却通过迭代其自身预测进行展开。典型示例包括自回归语言建模、基于流的生成建模以及机器人策略学习。已有充分文献表明，这些设置会引发一种我们称之为测试时反馈（TTF）的现象：即训练/验证损失与下游指标（如任务成功率和生成质量）之间的差异，且该差异随任务长度增加而扩大。尽管已有研究提出通过数据策展、架构设计和目标设计来对抗 TTF 设置中的训练 - 测试分布偏移，但本文提出将优化作为一种新的设计维度，以减轻误差累积。具体而言，我们引入了一种新的优化范式，称为双重预条件处理（DoPr），专门针对 TTF 的挑战而设计。DoPr 结合了梯度级预条件处理（如 Adam 和 Muon 中所示）与激活级预条件处理（AP，如 KFAC 中所示）。我们表明，添加 AP 可在一系列 TTF 设置中作为即插即用干预措施，以提升下游模型性能。有趣的是，这些测试时性能的提升并不总是伴随着验证损失的降低，这引发了关于如何正确评估使用一步监督目标训练的模型的新问题。

Abstract

Many modern applications of deep learning involve training a neural network via a one-step prediction loss (e.g., $L^2$ regression, cross-entropy), but deploy the network by rolling out along its own predictions. Key examples include autoregressive language modeling, flow-based generative modeling, and robot policy learning. It is well-documented that these settings induce a phenomenon we call test-time feedback (TTF): the mismatch between the training/validation loss and downstream metrics of interest, such as task success rate and generation quality, which grows with task length. While data curation, architecture, and objective design have been proposed to combat train-test shift in TTF settings, this paper proposes optimization as a new design axis to mitigate error accumulation. Specifically, we introduce a new optimization paradigm called double-preconditioning (DoPr) uniquely tailored to the challenges of TTF. DoPr combines gradient-wise preconditioning, as in Adam and Muon, with activation-wise preconditioning (AP), such as in KFAC. We show that the addition of AP yields a drop-in intervention for increasing downstream model performance across a range of TTF settings. Interestingly, these gains in test-time performance do not consistently accompany improvements in validation loss, opening new questions about how to properly evaluate models trained with one-step supervised objectives.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	5.0/10	7.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	6.0/10	9.0

评分理由: 该论文核心贡献在于提出一种名为 Double Preconditioning 的优化算法，旨在缓解自回归和机器人策略学习中的测试时间反馈（TTF）误差累积。与关键词对比，论文未涉及 Tokenizer、Visual Encoder 或多模态架构设计，相关性为 0；虽优化方法具有一般性，但未直接构建统一模型架构或 MLLM，故 Unify Models 和 MLLM 评分较低（2.0）；论文涉及的生成建模和机器人策略滚动场景与 World Models 和 model-based RL 的推理过程高度契合，故给予中等偏高分（5.0-6.0）。

关键词

Double Preconditioning, Test-Time Feedback, Robot Policy Learning, Activation-wise Preconditioning, Gradient-wise Preconditioning, Train-Test Shift, Autoregressive Modeling

64. MDP-GRPO: Stabilized Group Relative Policy Optimization for Multi-Constraint Instruction FollowingFAIL

Score: 22.5 / 27.8

Authors: Mohammad Mahdi Salmani-Zarchi, Zahra Rahimi, Heshaam Faili, Mohammad Javad Dousti

Published: 2026-06-04

TL;DR: 本文提出 MDP-GRPO 算法通过多温度采样和前景理论 shaping 稳定群体相对策略优化，显著提升了多约束指令跟随的约束满足率和收敛稳定性。

摘要翻译

可验证奖励的强化学习非常适合多约束指令遵循，然而标准的组相对策略优化（GRPO）在离散且低分散度的奖励下变得不稳定，此时组内奖励分布往往高度同质化。我们识别并形式化了该机制下 z-score 组归一化的三个缺陷：低方差放大、均值中心化盲点以及零方差崩溃。为了解决这些问题，我们提出了 MDP-GRPO，该方法通过以下四个方面稳定学习：(1) 多温度采样以增加奖励分散度；(2) 双锚优势以在同质组中恢复梯度并消除均值中心化盲点；(3) 基于前景理论（卡尼曼和特沃斯基理论）的塑造机制以限制更新并惩罚违规；(4) 不对称 KL 正则化。在 FollowBench、IFEval 及一个精心构建的多约束数据集上评估，MDP-GRPO 优于标准 GRPO，在 Llama-3.2-3B 上将严格约束满足率提高了高达 5.0%。此外，该方法还能在较小的组规模下实现稳定收敛，同时在 MMLU 和 ARC 上保持通用能力。

Abstract

Reinforcement learning with verifiable rewards is ideal for multi-constraint instruction following, yet standard group-relative policy optimization (GRPO) becomes unstable under discrete, low-dispersion rewards, where within-group reward distributions are frequently homogeneous. We identify and formalize three pathologies of z-score group normalization in this regime: low-variance amplification, mean-centering blindness, and zero-variance collapse. To address them, we propose MDP-GRPO, which stabilizes learning through (1) multi-temperature sampling to increase reward dispersion, (2) dual-anchor advantages to restore gradients in homogeneous groups and stop mean-centering blindness, (3) prospect-theoretic shaping to bound updates and penalize violations based on Kahneman and Tversky's theory, and (4) asymmetric KL regularization. Evaluated on FollowBench, IFEval, and a curated multi-constraint dataset, MDP-GRPO outperforms standard GRPO, improving strict constraint satisfaction by up to 5.0% on Llama-3.2-3B. Our method also enables stable convergence with small group sizes while preserving general capabilities on MMLU and ARC.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	2.0/10	3.0
MLLM	1.5	4.0/10	6.0
MultiModal	1.5	3.0/10	4.5
model-based RL	1.5	2.0/10	3.0

评分理由: 论文核心贡献在于强化学习算法（MDP-GRPO）的稳定性改进，针对多约束指令跟随中的奖励分布问题。关键词中除 MLLM（涉及 Llama 模型应用）和 MultiModal（指令跟随潜在场景）有一定领域关联外，其余如 Tokenizer、Visual Encoder、World Models、Unify Models、model-based RL 均非论文核心内容（论文主要关注模型-free 的策略优化，而非世界模型或模型构建）。未包含指定专家，无额外加分。加权总分 22.5，低于动态及格分 27.8。

关键词

Group Relative Policy Optimization, Multi-Constraint Instruction Following, Reinforcement Learning, Reward Dispersion, Prospect-Theoretic Shaping, Asymmetric KL Regularization, FollowBench, Llama-3.2-3B

65. Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory RolloutsFAIL

Score: 22.5 / 27.8

Authors: Wenbo Pan, Shujie Liu, Chin-Yew Lin, Jingying Zeng, Xianfeng Tang, Xiangyang Zhou, Yan Lu, Xiaohua Jia

Published: 2026-06-04

TL;DR: This paper proposes Retrospective Harness Optimization, a self-supervised method that improves LLM agents by optimizing past trajectory rollouts through self-preference without external labels.

摘要翻译

智能体依赖于一套技能、工具和流程来解决复杂问题。持续改进这套工具集对于适应新任务至关重要。然而，现有的优化方法通常需要真实验证集，但在实际部署环境中，此类标注数据难以获取。为了解决这一问题，我们提出回顾性工具集优化（RHO），这是一种仅利用过往轨迹来优化智能体工具集的自监督方法。具体来说，RHO 从过往轨迹中选择一个多样化的挑战性任务核心集，并并行地重新解决它们。智能体使用自验证和自一致性来分析这些轨迹，然后生成候选工具集更新，并通过其自身的成对自偏好选择最有效的一个。我们在三个不同的领域评估了 RHO，涵盖软件工程、技术工作和知识工作。值得注意的是，单次优化轮次将 SWE-Bench Pro 上的通过率从 59% 提升至 78%，且无需任何外部评分。此外，我们的分析表明，RHO 能有效针对先前的失败模式。因此，优化的工具集改变了智能体的行为模式，并在长时程会话中保持更高的准确性。

Abstract

AI agents rely on a harness of skills, tools, and workflows to solve complex problems. Continually improving this harness is essential for adapting to new tasks. However, existing optimization methods typically require ground-truth validation sets, yet such labeled data is difficult to acquire in practical deployment settings. To address this problem, we introduce Retrospective Harness Optimization (RHO), a self-supervised method that optimizes the agent harness using only past trajectories. Specifically, RHO selects a diverse coreset of challenging tasks from past trajectories and re-solves them in parallel. The agent analyzes these rollouts using self-validation and self-consistency, then generates candidate harness updates and selects the most effective one by its own pairwise self-preference. We evaluate RHO across three diverse domains, spanning software engineering, technical work, and knowledge work. Notably, a single optimization round improves the pass rate on SWE-Bench Pro from 59% to 78% without any external grading. Furthermore, our analysis demonstrates that RHO effectively targets prior failure modes. As a result, the optimized harness alters the agent's behavior patterns and sustains higher accuracy during long-horizon sessions.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	3.0/10	4.5
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	3.0/10	4.5

评分理由: The paper focuses on LLM agent harness optimization via self-supervised trajectory analysis, lacking content on multimodal architectures (Tokenizer, Visual Encoder, MultiModal) or model unification. It uses trajectories but does not construct world models or model-based dynamics models. Weighted sum (22.5) is below the dynamic pass threshold (27.8). No matching expert authors found.

关键词

Retrospective Harness Optimization, LLM Agents, Self-Preference, Trajectory Rollouts, Self-Supervised, Tool Optimization, Software Engineering

66. Reinforcement Learning Elicits Contextual Learning of Unseen Language TranslationFAIL

Score: 22.5 / 27.8

Authors: Hanxu Hu, Zdeněk Šnajdr, Pinzhen Chen, Jannis Vamvas, Rico Sennrich

Published: 2026-06-04

TL;DR: The paper proposes using reinforcement learning to teach large language models the meta-skill of utilizing linguistic context for zero-shot translation of unseen languages, outperforming in-context learning.

摘要翻译

先前研究表明，大型语言模型（LLMs）可以通过持续训练甚至在上下文中嵌入语法知识来翻译未见过的或低资源语言。然而，这两种方法通常都会过拟合特定语言，在测试时的零样本迁移能力有限。为了在大规模上翻译极低资源语言，我们认为 LLMs 必须掌握利用上下文语言知识的元技能，而非记忆特定语言。在本文中，我们提出了一种基于强化学习（RL）的方法，用于在丰富语言上下文下进行未见语言翻译，并使用表层翻译指标（chrF）作为奖励。实验表明，尽管奖励机制轻量，我们的 RL 训练模型能有效提取并应用所提供上下文中的相关语言信息，从而在完全未见过的语言上获得比上下文学习或监督微调更好的翻译效果。我们的分析表明，基于结果的强化学习可以超越传统的推理任务（如数学和编码），成为从上下文学习语言的一种有效途径。

Abstract

Prior work has shown that large language models (LLMs) can translate unseen or low-resource languages by undergoing continued training or even by encoding a grammar book in their context. However, both methods typically overfit specific languages, with limited zero-shot transfer at test time. To translate extremely low-resource languages at scale, we argue that LLMs must acquire the meta-skill of utilizing in-context linguistic knowledge rather than memorizing specific languages. In this paper, we propose a reinforcement learning (RL) approach to unseen language translation given rich linguistic context, using a surface-level translation metric (chrF) as the reward. Empirically, despite the lightweight reward, our RL-trained models effectively extract and apply relevant linguistic information from the provided context, leading to better translations on completely unseen languages than in-context learning or supervised fine-tuning. Our analyses suggest that outcome-based RL can extend beyond conventional reasoning tasks like math and coding to serve as a recipe for language learning from context.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	2.0/10	3.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	5.0/10	7.5

评分理由: The paper focuses on Reinforcement Learning for NLP tasks (translation) using LLMs, lacking visual encoders, multimodal components, or world modeling, resulting in low scores for Visual Encoder, MultiModal, MLLM, and World Models. Unify Models and Tokenizer are irrelevant. model-based RL is moderately relevant (5.0) as RL is the core method, though the abstract specifies 'outcome-based RL'. Weighted total score is 22.5, below the dynamic pass score of 27.8. None of the listed experts are authors.

关键词

Reinforcement Learning, Unseen Language Translation, Contextual Learning, Large Language Models, Zero-shot Transfer, Linguistic Context, chrF Reward

67. Visual Commonsense Driven Knowledge Refinements for Scene Graph GenerationFAIL

Score: 22.5 / 27.8

Authors: Maëlic Neau, Salim Baloch, Jakob Suchan, Zoe Falomir, Mehul Bhatt

Published: 2026-06-04

TL;DR: This paper addresses the degradation of Scene Graph Generation models under annotation sparsity by proposing a model-agnostic framework that refines predictions using visual commonsense knowledge at inference time, achieving consistent performance improvements.

摘要翻译

学习驱动的场景图生成（SGG）模型在频繁的关系类型上表现优异，但在标注稀疏性下性能急剧下降，无法捕捉可靠的视觉常识知识。我们提出一种模型无关的、语义引导的知识精炼框架，该框架系统地从训练数据中挖掘基于常识的约束——涵盖空间、功能和定性关系规律——并在推理阶段利用通用声明式常识推理来校正和优化排序后的 SGG 预测结果。该框架无需人工规则构建，无需模型再训练，且具有跨数据集和架构的迁移能力。在三个标准基准测试上，我们相对于强基线获得了持续改进，这表明基于深度场景语义的结构化视觉常识推理是纯学习驱动的场景图生成的一种实用且有效的补充。

Abstract

Learning-driven Scene Graph Generation (SGG) models excel on frequent relation types but degrade sharply under annotation sparsity, failing to capture reliable visual commonsense knowledge. We propose a model-agnostic, semantically-guided knowledge refinement framework that systematically mines commonsense-grounded constraints from training data - capturing spatial, functional, and qualitative relational regularities - and uses general declarative commonsense reasoning to correct and refine ranked SGG predictions at inference time. The framework requires no manual rule authoring, no model retraining, and transfers across datasets and architectures. On three standard benchmarks, we obtain consistent improvements over strong baselines, demonstrating that structured visual commonsense reasoning over deep scene semantics is a practical and effective complement to purely learning-based scene graph generation.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	5.0/10	7.5
model-based RL	1.5	1.0/10	1.5

评分理由: The paper focuses on Scene Graph Generation and commonsense knowledge refinement, showing moderate relevance to MultiModal (visual-textual relations) and Visual Encoder (vision backbones). It has low relevance to Unify Models, Tokenizer, World Models, MLLM, and model-based RL as it does not address model unification, tokenization strategies, world dynamics, large language models, or reinforcement learning. No expert authors from the specified list were found.

关键词

Scene Graph Generation, Visual Commonsense Knowledge, Knowledge Refinement, Inference-time Correction, Model-agnostic Framework, Declarative Reasoning

68. Unsupervised Skill Discovery for Agentic Data AnalysisFAIL

Score: 21.0 / 27.8

Authors: Zhisong Qiu, Kangqi Song, Shengwei Tang, Shuofei Qiao, Lei Liang, Huajun Chen, Shumin Deng

Published: 2026-06-04

TL;DR: 本文提出 DataCOPE 框架，通过无监督验证器引导的技能发现方法，利用未标记探索轨迹提取可复用的程序性知识，从而在不更新模型参数的情况下提升数据分析师代理的性能。

摘要翻译

推理时技能增强（Inference-time skill augmentation）提供了一种轻量级方法，通过注入可重用的程序性知识来提升数据分析智能体（data-analytic agents），而无需更新模型参数。然而，发现用于数据分析的有效技能仍然具有挑战性，因为可靠监督成本高昂，且成功标准在不同分析格式中各不相同。这引出了一个关键问题：如何仅从无标签探索中发现可重用的数据分析技能。我们提出 DataCOPE，一种面向数据分析智能体的无监督验证器（Unsupervised Verifier）引导的技能发现框架。DataCOPE 从探索轨迹中提取验证器信号，并利用它们来刻画轨迹之间的相对质量或一致性。它迭代协调一个用于轨迹生成的数据分析智能体、一个用于信号提取的无监督验证器以及一个用于对比技能蒸馏（contrastive skill distillation）的技能管理器。对于报告式分析，我们将验证器实例化为自适应清单验证器（Adaptive Checklist Verifier），该验证器推导出任务特定标准，根据可验证覆盖率对报告进行评分，并迭代优化清单。对于推理式分析，我们将其实例化为答案一致性验证器（Answer Agreement Verifier），该验证器根据答案一致性对轨迹进行分组，并使用自一致性作为辅助信号。我们在 Deep Data Research 的报告式分析和 DABStep 的推理式分析上评估了 DataCOPE。在两种设置下，DataCOPE 始终优于基线方法，提升了未见数据性能。在四种模型设置下平均而言，DataCOPE 在报告式和推理式任务上的平均得分分别提高了 9.71% 和 32.30%。

Abstract

Inference-time skill augmentation provides a lightweight way to improve data-analytic agents by injecting reusable procedural knowledge without updating model parameters. However, discovering effective skills for data analysis remains challenging, as reliable supervision is expensive and success criteria vary across analytical formats. This raises the key question of how to discover reusable data-analysis skills from unlabeled exploration alone. We propose DataCOPE, an unsupervised verifier-guided skill discovery framework for data-analytic agents. DataCOPE derives verifier signals from the exploration trajectories and uses them to characterize relative quality or aggreement among trajectories. It iteratively coordinates a Data-Analytic Agent for trajectory generation, an Unsupervised Verifier for signal extraction, and a Skill Manager for contrastive skill distillation. For report-style analysis, we instantiate the verifier as an Adaptive Checklist Verifier that derives task-specific criteria, scores reports by verifiable coverage, and iteratively refines the checklist. For reasoning-style analysis, we instantiate it as an Answer Agreement Verifier that groups trajectories by answer agreement and uses self-consistency as an auxiliary signal. We evaluate DataCOPE on report-style analysis from Deep Data Research and reasoning-style analysis from DABStep. Across both settings, DataCOPE consistently improves held-out performance over baselines. Averaged across four model settings, DataCOPE improves the mean score by 9.71% and 32.30% on report-style and reasoning-style tasks respectively.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	3.0/10	4.5
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	4.0/10	6.0

评分理由: 论文核心为无监督技能发现与代理数据分析，涉及轨迹生成和技能蒸馏。与关键词中的多模态、视觉编码器、Tokenizer 等无直接关联；与 World Models 和 MLLM 有一定概念关联（技能学习涉及表征及 LLM 代理），但不为核心；与 model-based RL 有一定关联（涉及代理与强化学习轨迹），但侧重技能发现而非模型规划。加权总分 21.0，低于动态及格分 27.8，表明论文与给定关键词簇的相关性较低。

关键词

Unsupervised Skill Discovery, Agentic Data Analysis, Verifier-guided, Skill Distillation, Trajectory Generation, Data-Analytic Agents, Unsupervised Learning

69. Closing the Loop on Latent Reasoning via Test-Time ReconstructionFAIL

Score: 21.0 / 27.8

Authors: Xiaopeng Yuan, Haibo Jin, Ye Yu, Peng Kuang, Lijun Yu, Yushun Dong, Haohan Wang

Published: 2026-06-04

TL;DR: The paper proposes ReLAT, a test-time reconstruction method that closes the loop in latent reasoning for LLMs, significantly improving accuracy on mathematical, QA, and code generation tasks.

摘要翻译

近期工作将中间推理从自然语言轨迹转移到 latent (潜在) 或 cache-level (缓存级别) 表示，以减少 token 开销并避免离散通信瓶颈。然而，这种转移也移除了文本推理的一个关键优势：中间状态不再 inspectable (可检查)，使得难以确定 latent state (潜在状态) 是否仍保留了 original query (原始查询) 的约束。因此，latent reasoning (潜在推理) 通常以 open loop (开环) 方式运行，其中 latent state 被生成和消耗，而没有 input-anchored fidelity check (基于输入的保真度检查)。我们提出了 ReLAT (Reconstruction-Guided Latent Reasoning At Test Time)，这是一种 self-supervised (自监督) 的 test-time training (测试时训练) 方法，它使用 query 本身作为参考来闭合这个循环。我们的关键观察是：如果 latent state 忠实地表示了一个 query，那么该 query 应该能从其中恢复；如果 query 无法恢复，则 latent state 已丢失 task-relevant information (任务相关信息)。ReLAT 通过构建一个可微的 Question -> Latent Thought (潜在思考) -> Question cycle (循环)，并在 answer generation (答案生成) 前通过 latent thought 优化 query reconstruction loss (查询重建损失)，来实现这一原则。这将 opaque latent computation (不透明的潜在计算) 锚定在其应代表的问题 specification (问题规范) 上。在 Qwen 系列上的 knowledge QA (知识 QA) 和 code generation (代码生成) benchmarks (基准) 上，ReLAT 始终优于 single-model inference (单模型推理)、text-based collaboration (基于文本的协作)、open-loop latent collaboration (开环潜在协作)，以及 alternative test-time training objectives (替代性测试时训练目标)。在 Qwen3-8B 上，ReLAT 将 AIME 2024 accuracy (准确率) 从 56.7% 提升至 73.3%，相比最强的 open-loop latent baseline (开环潜在基线) 取得了 16.6 点的增益。

Abstract

Recent work moves intermediate reasoning from natural-language traces into latent or cache-level representations to reduce token overhead and avoid a discrete communication bottleneck. However, this shift also removes a key advantage of textual reasoning: intermediate states are no longer inspectable, making it difficult to determine whether a latent state still preserves the constraints of the original query. As a result, latent reasoning typically operates in an open loop, where a latent state is produced and consumed without an input-anchored fidelity check. We propose ReLAT (Reconstruction-Guided Latent Reasoning At Test Time), a self-supervised test-time training method that closes this loop using the query itself as the reference. Our key observation is that if a latent state faithfully represents a query, the query should be recoverable from it; if the query cannot be recovered, the latent state has lost task-relevant information. ReLAT operationalizes this principle by constructing a differentiable Question -> Latent Thought -> Question cycle and optimizing query reconstruction loss through the latent thought before answer generation. This anchors opaque latent computation to the problem specification it is supposed to represent. Across mathematical reasoning, knowledge QA, and code generation benchmarks on the Qwen family, ReLAT consistently improves over single-model inference, text-based collaboration, open-loop latent collaboration, and alternative test-time training objectives. On Qwen3-8B, ReLAT raises AIME 2024 accuracy from 56.7% to 73.3%, a 16.6-point gain over the strongest open-loop latent baseline.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	5.0/10	7.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on latent reasoning and test-time reconstruction within LLMs (Qwen family), directly addressing token bottlenecks (Tokenizer) by avoiding discrete communication bottlenecks. It shows low relevance to Visual Encoder, MultiModal, and model-based RL due to the text-only nature of tasks (math, QA, code). Unify Models and World Models have partial relevance regarding latent states but are not core topics compared to the specific method proposed. MLLM is weakly relevant as Qwen is a large model family often associated with multimodality, but this work is text-focused.

关键词

Latent Reasoning, Test-Time Reconstruction, Question Cycle, Task Fidelity, Qwen Family, Mathematical Reasoning, Code Generation

70. OrderGrad: Optimizing Beyond the Mean with Order-Statistic Policy Gradient EstimationFAIL

Score: 21.0 / 27.8

Authors: Paavo Parmas, Yongmin Kim, Kohsei Matsutani, Shota Takashiro, Soichiro Nishimori, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo

Published: 2026-06-04

TL;DR: OrderGrad introduces a unified policy gradient estimator for optimizing order-statistic objectives such as tail risk and best-of-K, effectively enhancing robustness in LLM post-training and reinforcement learning tasks.

摘要翻译

策略梯度方法通常优化期望回报，但许多现实世界的应用关注回报的分布特性：尾部风险、异常值鲁棒性或 best-of-K 发现。我们引入 OrderGrad，这是一类用于次序统计量目标的似然比和重参数化梯度估计器。OrderGrad 优化有限样本 L-统计量（L-statistics），即排序后的奖励或成本的加权平均值，通过仅改变秩权重，可恢复诸如 VaR（风险价值）、CVaR（条件风险价值）、截尾均值、中位数以及 top-m/best-of-K 标准等目标。对于任何固定的样本量和秩权重向量，OrderGrad 为相应的次序统计量目标提供无偏梯度估计器。该方法被实现为一种简单的奖励变换，然后可用于标准的策略梯度或重参数化更新中。我们研究了所得估计器的方差行为，并在均值优化与部署目标不匹配的任务上对其进行了评估，包括 LLM（大语言模型）数学后训练及其他任务。OrderGrad 提供了一种统一、即插即用的途径，以实现风险厌恶、鲁棒和探索性学习。代码：https://github.com/paavo5/ordergrad

Abstract

Policy-gradient methods usually optimize expected return, but many real world applications care about distributional properties of returns: tail risk, outlier robustness, or best-of-K discovery. We introduce OrderGrad, a family of likelihood-ratio and reparameterization gradient estimators for order-statistic objectives. OrderGrad optimizes finite-sample L-statistics, i.e., weighted averages of sorted rewards or costs, recovering objectives such as VaR, CVaR, trimmed means, medians, and top-m/best-of-K criteria by changing only the rank weights. For any fixed sample size and rank-weight vector, OrderGrad provides an unbiased gradient estimator for the corresponding order-statistic objective. The method is implemented as a simple reward transformation that can then be used in an otherwise standard policy-gradient or reparameterized update. We study the resulting estimator's variance behavior and evaluate it on tasks where mean optimization is mismatched to the deployment objective, including LLM math post-training and other tasks. OrderGrad provides a unified, plug-and-play route to risk-averse, robust, and exploratory learning. Code: https://github.com/paavo5/ordergrad

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	4.0/10	6.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	2.0/10	3.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	2.0/10	3.0

评分理由: The paper focuses on policy gradient optimization for order-statistic objectives (e.g., VaR, best-of-K) rather than multimodal architectures or world models. It scores higher on 'Unify Models' (unifies risk objectives under one estimator) and 'MLLM' (applied to LLM post-training) but lower on 'Tokenizer', 'Visual Encoder', 'MultiModal' (no multimodal content), 'World Models' (no latent dynamics model), and 'model-based RL' (uses model-free policy gradient). No expert authors from the target list were found in the authorship.

关键词

Order-Statistic Policy Gradient, Distributional Optimization, Risk Aversion, LLM Post-training, Reward Transformation, Tail Risk, Best-of-K Discovery

71. OPRD: On-Policy Representation DistillationFAIL

Score: 21.0 / 27.8

Authors: Shenzhi Yang, Guangcheng Zhu, Bowen Song, Haobo Wang, Mingxuan Xia, Xing Zheng, Yingfan Ma, Zhongqi Chen, Weiqiang Wang, Gang Chen

Published: 2026-06-04

TL;DR: 论文提出 OPRD 方法通过隐藏状态对齐进行表示蒸馏以减少方差并加速训练，但未涉及多模态或世界模型构建。

摘要翻译

策略内蒸馏（OPD）仅在输出空间上通过匹配下一个词元概率来监督学生模型。这种仅输出范式存在两个局限：(1) 在大词表（例如 Qwen 的约 15 万个词元）上，蒙特卡洛 KL 估计带来的采样方差在整个训练过程中持续存在；(2) 它将教师模型视为黑盒，丢弃了语言模型头（LM head）之后的所有中间隐藏状态。我们提出策略内表示蒸馏（OPRD），通过在相同轨迹（rollouts）上对齐学生模型和教师模型在选定层上的表示，将蒸馏提升至隐藏状态空间，从而完全绕过 LM 头。理论上，OPRD 消除了采样方差，并提供了更丰富的逐层结构信息。实验上，OPRD 在 AIME 2024/2025 和 AIMO 竞赛上缩小了学生模型与教师模型之间的差距，而基于输出空间的 OPD 基线则停滞于教师模型性能之下。此外，OPRD 的训练速度比 top-k OPD 快 1.44 倍，且内存使用量减少了 54%。代码：https://github.com/ShenzhiYang2000/OPRD.

Abstract

On-policy distillation (OPD) supervises the student only in output space by matching next-token probabilities. This output-only paradigm has two limits: (1) sampling variance from Monte Carlo KL estimates over large vocabularies (e.g., Qwen's ~150k tokens) persists throughout training, and (2) it treats the teacher as a black-box, discarding all intermediate hidden states after the LM head. We propose On-Policy Representation Distillation (OPRD), which lifts distillation into hidden-state space by aligning student and teacher representations across selected layers on the same rollouts, bypassing the LM head entirely. Theoretically, OPRD eliminates sampling variance and provides richer per-layer structural information. Empirically, OPRD closes the student-teacher gap on AIME 2024/2025 and AIMO, while output-space OPD baselines plateau below the teacher. OPRD also trains 1.44x faster and uses 54% less memory than top-k OPD. Code: https://github.com/ShenzhiYang2000/OPRD.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	4.0/10	6.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	3.0/10	4.5

评分理由: 论文核心为大语言模型（LLM）的表示蒸馏技术（OPRD），通过隐藏状态对齐减少采样方差并提升训练效率。与 Tokenizer 相关（提及词汇表大小），与 model-based RL 部分相关（使用 on-policy/rollouts 术语），但与 Visual Encoder、World Models、MultiModal、MLLM 无直接关联（未提及视觉、多模态或世界模型架构）。Unify Models 关联度一般（知识蒸馏非架构统一）。作者列表中未包含指定的专家名单。

关键词

On-Policy Representation Distillation, Hidden-State Alignment, Student-Teacher Training, Sampling Variance Reduction, LLM Distillation, Memory Efficiency, Faster Training

72. Agentic Molecular Recovery via Molecule-Aware ExplorationFAIL

Score: 21.0 / 27.8

Authors: Suwan Yoon, Changhee Lee

Published: 2026-06-04

TL;DR: The paper proposes AMREC, an agentic framework that recovers invalid molecular SMILES from LLMs by preserving structural cues through molecule-aware exploration and trajectory selection.

摘要翻译

使用大语言模型（LLMs）进行文本引导的分子生成经常产生无效的 SMILES。我们认为，处理无效草稿应从基于有效性的修复转向保持身份的分子恢复：其目标不仅是恢复化学有效性，还要保留与目标相关的结构线索，并恢复描述所隐含的分子身份。这一视角揭示了现有校正策略的局限性。事后修复虽能恢复有效性，却可能扭曲关键结构；仅基于 LLM 的校正可能引入意外的全局漂移；而通用智能体校正即使配备了可执行的 RDKit 编辑工具，仍受限于贪婪的单候选轨迹。为了解决这些局限性，我们提出了 AMREC，该方法将分子感知的不匹配跟踪与扩展候选探索及轨迹级选择相结合。在来自三个骨干模型的无效 ChEBI-20 草稿上，AMREC 在结构、精确匹配和字符串级别指标上实现了最强的整体恢复性能。

Abstract

Text-guided molecular generation with LLMs often yields invalid SMILES. We argue that invalid drafts should be addressed through a shift from validity-oriented repair to identity-preserving molecular recovery: the objective is not only to restore chemical validity, but also to preserve target-relevant structural cues and recover the molecular identity implied by the description. This perspective reveals the limitations of existing correction strategies. Post-hoc repair can recover validity while distorting key structures, LLM-only correction can introduce unintended global drift, and generic agentic correction remains constrained by greedy single-candidate trajectories even when equipped with executable RDKit edit tools. To address these limitations, we propose AMREC, which couples molecule-aware mismatch tracking with expanded candidate exploration and trajectory-level selection. On invalid ChEBI-20 drafts from three backbone models, AMREC achieves the strongest overall recovery profile across structural, exact-match, and string-level metrics.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	4.0/10	6.0
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	4.0/10	6.0

评分理由: The paper focuses on molecular generation validity recovery using LLMs (AMREC), showing moderate relevance to MLLM and model-based RL due to LLM usage and agentic trajectory exploration. However, it has low relevance to Visual Encoder, World Models, and Unify Models as it lacks vision components, world modeling, or model unification efforts. Tokenizer is not a focus. Total weighted score is 27.0, slightly below the dynamic pass threshold of 27.8.

关键词

Molecular Recovery, LLM, SMILES, Agentic Exploration, Structural Preservation, Invalid Drafts, Molecule-Aware

73. Towards Healthy Evolution: Exploring the Role and Mechanisms of Human-Agent Interaction in Self-Evolving SystemsFAIL

Score: 19.5 / 27.8

Authors: Dianxing Shi, Junqi He, Junhao Chen, Bowen Wang, Yuta Nakashima

Published: 2026-06-04

TL;DR: 本文提出 ANCHOR 框架，通过 LLM 模拟人类监督来缓解自我进化代理的安全漂移，结果表明在输出验证阶段进行有限监督能有效降低安全风险而不损害核心性能。

摘要翻译

自演化智能体通过持续的自我博弈和自生成学习信号进行改进，但自主演化也可能导致能力退化和安全漂移。尽管人类反馈已被证明对静态及后训练的智能体有效，但其在自演化系统中的作用尚未得到充分探索。我们提出了一种基于大语言模型（LLM）的框架，即通过人类式监督与审查进行智能体规范校正（ANCHOR），该框架模拟人类监督并在自演化的各个阶段提供反馈。利用 ANCHOR，我们在编码、数学推理和安全方面评估了两个代表性的开源自演化智能体系统。结果表明，即使有限的监督也能显著缓解安全退化，同时保持核心演化目标上的稳定表现。进一步分析表明，针对输出验证阶段的监督对于干预最为有效，而增加监督频率则边际收益递减。这些发现为设计更稳定、可控且人类对齐的自演化智能体系统提供了实证证据和实践指导。

Abstract

Self-evolving agents improve through continual self-play and self-generated learning signals, but autonomous evolution can also cause capability degradation and safety drift. Although human feedback has proven effective for static and post-trained agents, its role in self-evolving systems remains underexplored. We introduce Agent Norm Correction through Human-like Oversight and Review (ANCHOR), an LLM-based framework that simulates human supervision and delivers feedback at various phases of self-evolution. With ANCHOR, we evaluate two representative open-source self-evolving agent systems across coding, mathematical reasoning, and safety. Our results show that even limited supervision substantially mitigates safety degradation while preserving stable performance on core evolutionary objectives. Further analysis shows that supervision over the output verification phase is the most effective for intervention, whereas increasing supervision frequency yields diminishing returns. These findings provide empirical evidence and practical guidance for designing more stable, controllable, and human-aligned self-evolving agent systems.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	3.0/10	4.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	3.0/10	4.5

评分理由: 论文核心在于自我进化代理的安全性与人类监督机制（ANCHOR 框架），而非模型架构或模态处理。因此，Tokenizer、Visual Encoder、MultiModal 等关键词完全不相关（1 分）；Unify Models 和 MLLM 相关性低（2 分），因未涉及模型统一或多模态架构；World Models 和 model-based RL 有一定概念关联（3 分），因涉及代理学习与进化，但非核心方法。加权总分低于动态及格分，表明论文主题与给定关键词集匹配度较低。

关键词

Self-evolving agents, Human-Agent Interaction, Safety Drift, LLM-based Framework, ANCHOR, Human-like Oversight, Capability Degradation

74. Memory is Reconstructed, Not Retrieved: Graph Memory for LLM AgentsFAIL

Score: 19.5 / 27.8

Authors: Shuo Ji, Yibo Li, Bryan Hooi

Published: 2026-06-04

TL;DR: MRAgent proposes an associative memory graph with active reconstruction to improve LLM agents' reasoning over long interaction histories, achieving significant performance gains while reducing token and runtime costs.

摘要翻译

尽管近期取得了进展，LLM agents（大语言模型代理）在处理长交互历史的推理方面仍然面临困难。尽管当前的记忆增强代理依赖于静态的 retrieve-then-reason（检索 - 推理）范式，但这种刚性的管道设计阻碍了它们根据推理过程中发现的中间证据动态调整 memory access（记忆访问）。为了解决这一差距，我们提出了 MRAgent，这是一个结合 associative memory graph（关联记忆图）与 active reconstruction mechanism（主动重构机制）的框架。我们将记忆表示为一个 Cue-Tag-Content 图，其中 associative tags（关联标签）作为语义桥梁，将 fine-grained cues（细粒度提示）与 memory contents（记忆内容）连接起来。基于此结构，我们的 active reconstruction mechanism（主动重构机制）将大语言模型推理直接整合到 memory access 中，使代理能够根据 accumulated evidence（累积证据）迭代探索并剪枝 retrieval paths（检索路径）。这确保了 memory retrieval（记忆检索）能够动态适应推理上下文，同时避免了因 unconstrained expansion（无约束扩展）而导致的 combinatorial explosion（组合爆炸）。在 LoCoMo benchmark 和 LongMemEval benchmark 上的实验表明，相比 strong baselines（强基线）有显著提升（高达 23%），同时大幅降低了 token（令牌）和 runtime cost（运行时间成本），突出了 active and associative reconstruction（主动和关联重构）在 long-horizon memory reasoning（长程记忆推理）中的有效性。

Abstract

Despite recent progress, LLM agents still struggle with reasoning over long interaction histories. While current memory-augmented agents rely on a static retrieve-then-reason paradigm, this rigid pipeline design prevents them from dynamically adapting memory access to intermediate evidence discovered during inference. To bridge this gap, we propose MRAgent, a framework that combines an associative memory graph with an active reconstruction mechanism. We represent memory as a Cue-Tag-Content graph, where associative tags serve as semantic bridges connecting fine-grained cues to memory contents. Operating on this structure, our active reconstruction mechanism integrates LLM reasoning directly into memory access, allowing the agent to iteratively explore and prune retrieval paths based on accumulated evidence. This ensures that memory retrieval is dynamically adapted to the reasoning context while avoiding combinatorial explosion caused by unconstrained expansion. Experiments on the LoCoMo benchmark and LongMemEval benchmark demonstrate significant improvements over strong baselines (up to 23%), while substantially reducing token and runtime cost, highlighting the effectiveness of active and associative reconstruction for long-horizon memory reasoning.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	3.0/10	4.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	3.0/10	4.5

评分理由: The paper focuses on memory mechanisms for LLM agents (Graph Memory, Active Reconstruction) rather than multimodal components (Visual Encoder, MultiModal), tokenizer design, or model-based RL frameworks. While it involves LLMs (related to MLLM), it lacks explicit multimodal integration. World Models are tangentially related through memory but not the core contribution. Unify Models is not the primary theme. No expert authors from the specified list (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are found in the author list (Shuo Ji, Yibo Li, Bryan Hooi), so no bonus points are applied.

关键词

LLM Agents, Graph Memory, Active Reconstruction, Associative Memory, Long Interaction Histories, Memory Retrieval, Reasoning

75. To Be Multimodal or Not to Be: Query-Adaptive Audio-Visual Person Retrieval via Active Modality DetectionFAIL

Score: 19.5 / 27.8

Authors: Erfan Loweimi, Mengjie Qian, Kate Knill, Guanfeng Wu, Chi-Ho Chan, Abbas Haider, Muhammad Awan, Josef Kittler, Hui Wang, Mark Gales

Published: 2026-06-04

TL;DR: 本文提出了一种查询自适应的音频-视觉人物检索框架，通过检测活跃模态避免缺失模态引入的噪声，在广播档案检索中取得了优于固定融合方法的精度。

摘要翻译

当通过语音和人脸从视频档案中检索人物时，系统是否应为多模态（multimodal）？在真实的广播档案中，与精心构建的基准（curated benchmarks）不同，目标人物可能只闻未见、见未闻，或两者皆有。融合缺失模态（modality）的分数会引入噪声，导致精度低于最佳单模态（unimodal）系统。我们提出了一种查询自适应框架（query-adaptive framework），通过跨模态分数一致性（cross-modal score consistency）检测活跃模态（active modalities）：当两种模态均活跃时，通过一种模态检索的文件在另一种模态上也会获得高分；当某种模态缺失时，这种一致性便会失效。由这些跨模态特征驱动的分类器达到了 89% 的检测准确率。在 BBC Rewind 语料库（corpus，包含超过 12,000 个广播视频）上，自适应系统达到了 94.2% 的 P@1，优于仅语音（82.9%）、仅人脸（93.4%）和固定融合（90.0%）的方法，恢复了与拥有真实模态标签的理想情况（oracle，96.6%）之间差距的 64%。

Abstract

When retrieving a person from a video archive by voice and face, should the system be multimodal or not? In real-world broadcast archives, unlike curated benchmarks, a target may be heard but unseen, seen but unheard, or both. Fusing scores from an absent modality injects noise, degrading precision below the best unimodal system. We propose a query-adaptive framework that detects active modalities via cross-modal score consistency: when both modalities are active, files retrieved by one also score highly on the other; this agreement breaks down when a modality is absent. Classifiers driven by these cross-modal features achieve 89% detection accuracy. On the BBC Rewind corpus (with over 12,000 broadcast videos) the adaptive system attains 94.2% P@1, outperforming speaker-only (82.9%), face-only (93.4%), and fixed fusion (90.0%), recovering 64% of the gap to an oracle with ground-truth modality labels (96.6%).

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	8.0/10	12.0
model-based RL	1.5	0.0/10	0.0

评分理由: 该论文主要关注音频-视觉人物检索中的模态自适应融合策略，而非大模型架构或强化学习。'MultiModal'高度相关，因为核心任务是跨模态检索；'Visual Encoder'中度相关，因为涉及视觉特征提取；'Unify Models'低相关，因为仅涉及决策逻辑的统一而非模型架构统一；其余关键词（Tokenizer, World Models, MLLM, model-based RL）与论文内容完全无关，故得分为 0。作者列表中未包含指定的专家（Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang），因此不加分。加权总分为 19.5，低于动态及格分 27.8。

关键词

Audio-Visual Person Retrieval, Query-Adaptive, Active Modality Detection, Cross-Modal Score Consistency, Modality Fusion, BBC Rewind Corpus, Unimodal vs Multimodal

76. Asuka-Bench: Benchmarking Code Agents on Underspecified User Intent and Multi-Round RefinementFAIL

Score: 19.5 / 27.8

Authors: Xin Wang, Liangtai Sun, Yaoming Zhu, Shuang Zhou, Jiaxing Liu, Fengjiao Chen, Lin Qiu, Xuezhi Cao, Xunliang Cai, Licheng Zhang, Zhendong Mao

Published: 2026-06-04

TL;DR: Asuka-Bench 构建了一个基于浏览器渲染行为的多轮代码代理基准测试，揭示了当前大模型在处理模糊用户意图和迭代修复任务上的显著性能差距。

摘要翻译

现有的代码生成基准（code-generation benchmarks）仅评估从完整提示到单次输出的单一映射。然而，实际的网页开发有所不同。用户很少在开始时写出完整规范；许多需求只有在查看中间结果并对此做出反应后才变得清晰。我们提出了 Asuka-Bench，这是一个将未充分定义的用户意图与多轮 refinement（迭代）相结合的基准，其基础是浏览器渲染行为。每个任务都通过一个闭环解决：一个 Code Agent（代码代理）生成一个 Web 项目，一个 UI Agent（用户界面代理）在部署的网站上执行测试用例，而一个 User LLM（用户大语言模型）将评估结果转化为下一轮的自然语言反馈。该基准包含 50 个 Web 任务，拥有 784 个评估标准和 2402 个预期结果。我们在 2 个代理框架上对 8 个 LLMs（大语言模型）进行了基准测试。结果清晰地区分了模型：加权 Task Pass Rate（任务通过率）相差 38 个百分点，且模型在从反馈中修复的能力上也存在显著差异。Asuka-Bench 也远未达到饱和：即使是最强的模型，三轮后也仅完成了 52% 的项目。

Abstract

Existing code-generation benchmarks score a single mapping from a complete prompt to a one-shot output. However, real web development is different. Users seldom write a full spec at the start; many requirements only become clear once they look at an intermediate result and react to it. We present Asuka-Bench, a benchmark that pairs underspecified user intent with multi-round refinement, grounded in browser-rendered behavior. Each task is resolved through a closed loop: a Code Agent generates a web project, a UI Agent executes test cases on the deployed site, and a User LLM turns evaluation outcomes into natural-language feedback for the next round. The benchmark comprises 50 web tasks with 784 evaluation criteria and 2402 expected outcomes. We benchmark 8 LLMs across 2 agent frameworks. The results separate models clearly: weighted Task Pass Rate varies by 38 percentage points and models also differ substantially in their ability to repair from feedback. Asuka-Bench is also far from saturated: even the strongest model completes only 52% of projects after three rounds.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	2.0/10	3.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	3.0/10	4.5
model-based RL	1.5	2.0/10	3.0

评分理由: 论文主要贡献在于提出 Asuka-Bench 基准测试，评估代码代理在多轮反馈下的网页生成能力。虽然涉及 LLM 和闭环交互（类似世界模型或 RL 的循环），但并未提出统一模型架构、Tokenizer、视觉编码器的具体设计，也未明确涉及基于模型的强化学习算法。因此，与多模态模型架构类关键词相关性较低，仅在与交互循环相关的关键词上有微弱关联。作者列表中未包含指定的专家，故无额外加分。

关键词

Code Agents, Multi-Round Refinement, Browser-Rendered Behavior, Benchmark, Underspecified User Intent, LLMs, Web Development

77. GLASS: GRPO-Trained LoRA for Acoustic Style Steering in Zero-Shot Text-to-SpeechFAIL

Score: 19.5 / 27.8

Authors: Jaehoon Kang, Yejin Lee, Kyuhong Shim

Published: 2026-06-04

TL;DR: GLASS addresses the entanglement of speaker identity and prosodic attributes in zero-shot TTS by employing GRPO-trained LoRA adapters for composable acoustic style steering without retraining the backbone.

摘要翻译

我们提出 GLASS，一种用于零样本自回归文本到语音（TTS）的可组合声学风格控制框架，该框架通过生成后奖励而非风格标签来学习控制。在零样本 TTS 中，说话人提示往往将说话人身份与韵律属性（如语速和音高）纠缠在一起，导致在不改变提示本身的情况下难以改变风格。相反，GLASS 将每个声学属性视为由奖励定义的控制方向。对于每个控制轴，GLASS 冻结 TTS 骨干网络，并使用组相对策略优化（GRPO）训练一个轻量级 LoRA 适配器，其中语音标记长度和平均 F0 被用作风格奖励，而词错误率（WER）则作为可理解性锚点。由于每个控制都表示为 LoRA 权重更新，独立训练的适配器可以通过线性 LoRA 算术进行交换、插值和组合，而无需重新训练骨干网络。在语速和音高控制上的实验表明，该方法能在保持自然度、说话人相似性和可理解性的同时实现目标风格转移，并展示了在独立训练的适配器之间进行平滑插值和多轴组合的能力。

Abstract

We propose GLASS, a framework for composable acoustic style control in zero-shot autoregressive text-to-speech (TTS) that learns controls from post-generation rewards rather than style labels. In zero-shot TTS, a speaker prompt often entangles speaker identity with prosodic attributes such as speaking rate and pitch, making it difficult to change style without changing the prompt itself. GLASS instead treats each acoustic attribute as a reward-defined control direction. For each control axis, GLASS freezes the TTS backbone and trains one lightweight LoRA adapter with Group Relative Policy Optimization (GRPO), using speech-token length and mean F0 as style rewards and WER as an intelligibility anchor. Because each control is represented as a LoRA weight update, independently trained adapters can be swapped, interpolated, and composed through linear LoRA arithmetic without retraining the backbone. Experiments on speaking rate and pitch control show targeted style shifts while preserving naturalness, speaker similarity, and intelligibility, and demonstrate smooth interpolation and multi-axis composition across independently trained adapters.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	3.0/10	4.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	4.0/10	6.0
model-based RL	1.5	2.0/10	3.0

评分理由: The paper focuses on Text-to-Speech (TTS) style control using LoRA and GRPO, showing weak alignment with the provided Multimodal/World Model keywords. 'MultiModal' is moderately relevant (Text-Audio), 'Tokenizer' is slightly relevant (speech tokens), and 'model-based RL' loosely matches the RL usage (though GRPO is typically model-free). 'Visual Encoder', 'World Models', and 'MLLM' are largely irrelevant. No expert authors from the specified list are present. The weighted total score is 19.5, below the dynamic passing score of 27.8.

关键词

Text-to-Speech, Acoustic Style Steering, Zero-Shot, LoRA, GRPO, Reward-based Control, Composable Adapters

78. MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm DiscoveryFAIL

Score: 18.0 / 27.8

Authors: Shangheng Du, Xiangchao Yan, Jinxin Shi, Zongsheng Cao, Shiyang Feng, Zichen Liang, Boyuan Sun, Tianshuo Peng, Yifan Zhou, Xin Li, Jie Zhou, Liang He, Bo Zhang, Lei Bai

Published: 2026-06-04

TL;DR: MLEvolve 是一种基于 LLM 的自演化多智能体框架，通过渐进式搜索和回顾性记忆机制自动化机器学习算法发现，并在基准任务上实现了最先进的性能。

摘要翻译

大语言模型（LLM）代理正越来越多地被应用于科学发现和机器学习工程（MLE）等长周期任务中，其中持续的自我进化成为一项关键能力。然而，现有的 MLE 代理面临分支间信息隔离、无记忆搜索以及缺乏层次化控制的问题，这些共同阻碍了长周期优化。我们提出了 MLEvolve，一种基于 LLM 的自进化多智能体框架，用于端到端的机器学习算法发现。通过将树搜索扩展为 Progressive MCGS，MLEvolve 通过基于图的引用边实现跨分支信息流，并通过基于熵的渐进调度逐渐将搜索从广泛探索转移到聚焦利用。为了使代理能够随着积累的经验而进化，我们引入了回顾性记忆（Retrospective Memory），它结合了冷启动领域知识库与动态全局记忆，用于特定任务的经验检索与复用。为了稳定的长周期迭代，我们进一步通过自适应编码模式将战略规划与代码生成解耦。在 MLE-Bench 上的评估表明，MLEvolve 在多个维度上实现了最先进的性能，包括平均奖牌率和有效提交率，且在 12 小时预算下（为标准运行时的一半）。此外，MLEvolve 在数学算法优化任务上也优于包括 AlphaEvolve 在内的专用算法发现方法，展示了强大的跨域泛化能力。我们的代码可在 https://github.com/InternScience/MLEvolve 处获取。

Abstract

Large language model (LLM) agents are increasingly applied to long-horizon tasks such as scientific discovery and machine learning engineering (MLE), where sustained self-evolution becomes a key capability. However, existing MLE agents suffer from inter-branch information isolation, memoryless search, and lack of hierarchical control, which together hinder long-horizon optimization. We present MLEvolve, an LLM-based self-evolving multi-agent framework for end-to-end machine learning algorithm discovery. By extending tree search to Progressive MCGS, MLEvolve enables cross-branch information flow through graph-based reference edges and gradually shifts the search from broad exploration to focused exploitation with an entropy-inspired progressive schedule. To allow the agent to evolve with accumulated experience, we introduce Retrospective Memory, which combines a cold-start domain knowledge base with a dynamic global memory for task-specific experience retrieval and reuse. For stable long-horizon iteration, we further decouple strategic planning from code generation with adaptive coding modes. Evaluation on MLE-Bench shows that MLEvolve achieves state-of-the-art performance across multiple dimensions including average medal rate and valid submission rate under a 12-hour budget (half the standard runtime). Moreover, MLEvolve also outperforms specialized algorithm discovery methods including AlphaEvolve on mathematical algorithm optimization tasks, demonstrating strong cross-domain generalization. Our code is available at https://github.com/InternScience/MLEvolve.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	3.0/10	4.5

评分理由: 论文提出 MLEvolve，一种基于 LLM 的自演化多智能体框架，用于自动化机器学习算法发现。该论文与 Visual Encoder (0) 和 MultiModal (1) 相关性低，因其专注于文本/代码搜索且无视觉组件。Unify Models (2) 和 World Models (2) 在框架统一和记忆机制上相关性较弱。Tokenizer (1) 在 LLM 使用中是隐式的。MLLM (3) 和 model-based RL (3) 因使用 LLM 智能体和 RL 启发式搜索策略（树搜索、熵调度）而具有中等相关性。作者列表中不包含目标专家（Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang），因此未添加额外分数。加权总分 (18.0) 低于动态及格分 (27.8)。

关键词

Automated Machine Learning, LLM Agents, Multi-agent Framework, Tree Search, Retrospective Memory, Algorithm Discovery, Self-Evolving

79. Humans' ALMANAC: A Human Collaboration Dataset of Action-Level Mental Model Annotations for Agent CollaborationFAIL

Score: 18.0 / 27.8

Authors: Jiaju Chen, Yuxuan Lu, Jiayi Su, Chaoran Chen, Songlin Xiao, Zheng Zhang, Yun Wang, Yunyao Li, Jian Zhao, Tongshuang Wu, Toby Jia-Jun Li, Dakuo Wang, Bingsheng Yao

Published: 2026-06-04

TL;DR: 本文提出 ALMANAC 数据集，用于评估代理对人类协作行为和心智模型的预测能力，聚焦于人类 - 代理协作中的过程级能力。

摘要翻译

大语言模型智能体（LLM agents）的最新进展赋予了其复杂的认知能力，如多步推理、规划和工具使用，这使得这些智能体日益被视为人类协作者。然而，有效的协作要求协作者在协作过程中持续维护和调整关于自身推理、伙伴意图及共享目标的心智模型。当前的智能体很少具备此类能力，因为它们主要被优化用于任务完成，且社区缺乏带有动作级心智模型标注的真实人类协作数据，此类数据本可引导智能体实现过程级的协作能力。为弥合这一差距，我们提出了 ALMANAC（Action-Level Mental model ANnotations for Agent Collaboration），这是一个基于源自社会科学的经典二元路由任务——地图任务（Map Task）构建的动作级心智模型标注数据集，用于智能体协作。ALMANAC 包含 2,987 个协作动作，每个动作均配有基于理论的心智模型标注，记录了参与者的自我推理、感知到的伙伴意图及感知到的团队目标。我们对六种大语言模型（LLM）进行了基准测试，评估其预测人类下一步行为及心智模型的能力。结果表明，ALMANAC 在评估模型模拟人类协作行为及推断其潜在心智模型的能力方面具有效用。

Abstract

Recent advances in LLM agents have enabled complex cognitive capabilities, such as multi-step reasoning, planning, and tool use, that increasingly position these agents as human collaborators. Effective collaboration, however, requires collaborators to continuously maintain and align mental models of their own reasoning,partners' intentions, and shared goals during the collaborative process. Today's agents rarely develop such capabilities since they are primarily optimized for task completion, and the community lacks authentic human collaboration data with action-level mental model annotations that could guide agents toward process-level collaborative competence. To bridge this gap, we present ALMANAC, a dataset of Action-Level Mental model ANnotations for Agent Collaboration built from the Map Task, a classic dyadic routing task from social science. ALMANAC contains 2,987 collaboration actions, each paired with theory-informed mental model annotations that record the participants' self-reasoning, perceived partner intent, and perceived team goal. We benchmark six LLMs on predicting humans' next-turn behavior and mental models. Our results demonstrate ALMANAC's utility in evaluating models' ability to simulate human collaborative behaviors and infer their underlying mental models.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	3.0/10	4.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	3.0/10	4.5

评分理由: 论文聚焦人类协作数据集与心智模型，与模型架构关键词（Tokenizer、Visual Encoder、Unify Models）关联度低。虽涉及代理规划（关联 model-based RL、World Models），但未涉及多模态大模型核心架构（MLLM、MultiModal）。作者列表无指定专家。加权总分约 18.0，低于及格线。

关键词

Human Collaboration, Mental Model Annotations, Agent Collaboration, LLM Agents, Map Task, Action-Level, Dataset, Reasoning

80. PAMF: Prior-Aware Multimodal Fusion for Incomplete Time Series DataFAIL

Score: 18.0 / 27.8

Authors: Ziwen Kan, Wugeng Zheng, Tianlong Chen, Song Wang

Published: 2026-06-04

TL;DR: PAMF proposes a prior-aware multimodal fusion framework that couples imputation with downstream prediction for incomplete healthcare time series, achieving superior performance across diverse missing patterns.

摘要翻译

在医疗保健领域，多模态时间序列（multimodal time series）任务在实践中常基于不完整的观测数据运行，例如当电极脱落导致心电图（ECG）片段丢失，或在夜间监测期间整个呼吸通道不可用时。这种缺失通常表现为两种结构上不同的模式：模态内缺失（within-modality missing），即在某一模态内部其他观测值存在的情况下缺失值，以及模态级缺失（modality-level missing），即整个模态不可用。现有方法通常通过掩码（masks）或缺失嵌入（missing embeddings）隐式表示未观测数据，而不学习实例特定的缺失信息，且大多数方法仅针对一种缺失模式设计。一种自然的方法是显式估计缺失数据；然而，现有的插补（imputation）方法尽管具有不同的结构先验，却统一处理缺失问题，且插补过程通常与下游任务（downstream tasks）隔离，阻止了下游任务引导插补生成更具信息量的表示。为了解决这些局限性，我们提出了 PAMF，一种多模态时间序列框架，它显式处理不同的缺失模式，并通过感知先验的流匹配（prior-aware flow matching）和权重共享（weight sharing）将插补与下游预测耦合。具体而言，该方法使用类型特定的先验初始化流匹配源状态，以区分两种缺失类型。它进一步通过具有权重共享的结构匹配编码器连接插补和分类，将任务相关的表示注入到插补过程中。在多个多模态医疗保健时间序列基准上的实验表明，与现有基线（baselines）相比，所提出的方法在各种数据集和缺失设置下实现了最强的整体下游性能。

Abstract

In healthcare, multimodal time series tasks often operate on incomplete observations in practice, for example when ECG segments are lost because electrodes detach or an entire respiratory channel is unavailable during overnight monitoring. Such missingness typically appears in two structurally distinct patterns: within-modality missing, where values are absent within an otherwise observed modality, and modality-level missing, where an entire modality is unavailable. Existing methods typically represent unobserved data implicitly through masks or missing embeddings, without learning instance-specific missing information, and most are designed for only one missingness pattern. A natural approach is to explicitly estimate the missing data; however, existing imputation methods treat missingness uniformly despite their different structural priors, and the imputation process is often isolated from downstream tasks, preventing downstream tasks from guiding imputation toward more informative representations. To address these limitations, we present PAMF, a multimodal time-series framework that explicitly handles different missingness patterns while coupling imputation with downstream prediction through prior-aware flow matching and weight sharing. Specifically, the method initializes the flow-matching source state with type-specific priors to distinguish two missing types. It further connects imputation and classification through architecturally matched encoders with weight sharing, transferring task-relevant representations into the imputation process. Experiments on multiple multimodal healthcare time-series benchmarks show that the proposed method achieves the strongest overall downstream performance across diverse datasets and missing settings compared with existing baselines.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	10.0/10	15.0
model-based RL	1.5	0.0/10	0.0

评分理由: Only 'MultiModal' is highly relevant (10.0) as the core domain. 'Unify Models' is low (2.0) due to task vs. architecture unification mismatch. 'Tokenizer', 'Visual Encoder', 'World Models', 'MLLM', and 'model-based RL' are irrelevant (0.0) as the paper lacks vision, language models, tokenization, or RL. No specified expert authors are found.

关键词

Multimodal Time Series, Incomplete Data, Missingness Patterns, Prior-Aware Flow Matching, Weight Sharing, Healthcare, Imputation, Downstream Prediction

81. From Reward-Hack Activations to Agentic Risk States: Context-Calibrated Mechanistic Monitoring in LLM AgentsFAIL

Score: 18.0 / 27.8

Authors: Patrick Wilhelm, Odej Kao

Published: 2026-06-04

TL;DR: The paper proposes context-calibrated internal monitoring using entropy and activation features to assess reward-hacking risks in LLM agents, demonstrating that internal state alone requires environmental context to identify risky actions.

摘要翻译

语言模型智能体通过观察、推理和动作选择的重复循环进行操作，这使得安全监控既依赖于内部模型状态，也依赖于环境上下文。我们研究了在 Gameable ALFWorld 和 WebShop 环境中运行的 ReAct 风格智能体中的奖励黑客监控器。智能体配备了基于激活的奖励黑客分数、词元级熵以及决策上下文特征。我们发现，在 School-of-Reward-Hacks 数据集上微调的适配器能够将奖励黑客倾向转移到智能体动作选择中，尤其是在环境暴露出代理奖励可供性 (affordances) 时。然而，缓解此类行为不能仅依赖激活动力学。高奖励黑客激活识别出一种潜在策略状态，但并不一定意味着立即采取利用行动。在下一步预测任务中，熵和上下文校准的内部特征比仅使用奖励黑客激活更能改善风险评估。激活方向引导进一步减少了在选定混合适配器机制下的代理利用行为。总体而言，我们的结果支持对智能体进行上下文校准的内部监控：奖励黑客激活识别出潜在策略状态，而熵和决策上下文有助于确定该状态何时转变为风险行动。

Abstract

Language-model agents act through repeated cycles of observation, reasoning, and action selection, making safety monitoring depend on both internal model state and environment context. We study reward-hacking monitors in ReAct-style agents acting in Gameable ALFWorld and WebShop. Agents are instrumented with activation-based reward-hack scores, token-level entropy, and decision-context features. We find that adapters fine-tuned on \textit{School-of-Reward-Hacks} dataset can transfer reward-hack tendencies into agentic action selection, especially when the environment exposes proxy-reward affordances. However, mitigating such behavior cannot rely on activation dynamics alone. High reward-hack activation identifies a latent policy state, but does not necessarily imply an immediate exploit action. Across next-step prediction tasks, entropy and context-calibrated internal features improve risk estimation over reward-hack activation alone. Activation-direction steering further reduces proxy-exploit behavior in selected mixed-adapter regimes. Overall, our results support context-calibrated internal monitoring for agents: reward-hack activation identifies a latent policy state, while entropy and decision context help determine when that state becomes risky action.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	2.0/10	3.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	3.0/10	4.5

评分理由: The paper focuses on safety monitoring and mechanistic interpretability in LLM agents (reward-hacking detection), whereas the provided keywords primarily target multimodal foundation model architectures (Tokenizer, Visual Encoder) and world model learning. There is minimal direct relevance to specific architectural components or unification strategies. Moderate relevance exists for RL-related keywords (model-based RL, World Models) due to the agent-environment setting, but the core contribution is monitoring rather than model learning or world modeling. No expert authors from the target list are present.

关键词

Reward-Hack Activations, Agentic Risk States, Context-Calibrated Mechanistic Monitoring, LLM Agents, ReAct-style, Activation-based Reward-Hack Scores, Decision-Context Features

82. Merging model-based control with multi-agent reinforcement learning for multi-agent cooperative teaming strategiesFAIL

Score: 18.0 / 27.8

Authors: Christian Llanes, Spencer W. Jensen, Samuel Coogan

Published: 2026-06-04

TL;DR: 该论文提出了一种结合多智能体强化学习与模型预测控制的算法（MA-AC-MPC），在多智能体合作任务中实现了更安全、动态可行的动作，并在硬件实验中证明了比纯强化学习方法更高的成功率。

摘要翻译

在这项工作中，我们提出了一种结合多智能体强化学习（MARL）与基于模型的控制的框架，旨在实现合作多智能体任务中安全且动力学可行的动作。多智能体强化学习具有从离散且不可微的奖励中学习多智能体团队合作策略的优势，适用于长规划时间范围。模型预测控制（MPC）具有鲁棒性，并在快速重规划框架中为短规划时间范围提供安全且动力学可行的动作。我们提出了一种算法，该算法扩展了用于多智能体强化学习的演员 - 评论家模型预测控制，我们称之为多智能体演员 - 评论家模型预测控制（MA-AC-MPC）。我们通过将该算法应用于多智能体追捕 - 逃避场景来展示其能力。具体来说，我们比较了使用 MA-AC-MPC 模型和多层感知机模型（MA-AC-MLP）的逃避者团队的策略。追捕者团队采用增广比例导航，因其被视为一种先进的对抗性控制律。我们还提供了一个异构环境的示例，其中无人机与全向轮漫游车合作实现可重复且成功的着陆，在硬件平台上 MA-AC-MPC 的成功率为 100%，而 MA-AC-MLP 为 60%。我们在两种环境的硬件平台上展示了所提出的 MA-AC-MPC 算法的鲁棒性。

Abstract

In this work, we propose a framework that combines multi-agent reinforcement learning (MARL) with model-based control to achieve safe, dynamically feasible actions in cooperative multi-agent tasks. Multi-agent reinforcement learning provides the advantage of learning cooperative policies for multi-agent teams from discrete non-differentiable rewards in a long planning horizon. Model-predictive control is robust and offers safe, dynamically feasible actions in a fast replanning framework for short horizons. We propose an algorithm that extends actor-critic model predictive control for MARL which we refer to as multi-agent actor-critic model predictive control (MA-AC-MPC). We demonstrate the capabilities of this algorithm by applying it to a multi-agent pursuit-evasion scenario. Specifically, we compare the evader team's strategy using the MA-AC-MPC model and a multi-layer perceptron model (MA-AC-MLP). The pursuer team uses augmented proportional navigation as it is accepted as an advanced adversarial control law. We also provide an example with a heterogeneous environment where a drone and omni-wheeled rover cooperate to achieve repeatable and successful landing with 100% success rate in hardware for MA-AC-MPC compared to 60% for MA-AC-MLP. We demonstrate the robustness of the proposed MA-AC-MPC algorithm in hardware for both environments.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	8.0/10	12.0

评分理由: 该论文核心内容是多智能体强化学习（MARL）与模型预测控制（MPC）的结合，属于经典控制与强化学习交叉领域。关键词中仅'model-based RL'与论文核心方法（模型预测控制结合 RL）高度相关（评分 8.0）。'Unify Models'和'World Models'因涉及模型融合与动力学模型，给予较低关联度（2.0）。其余关键词（Tokenizer, Visual Encoder, MLLM, MultiModal）均指向多模态大模型领域，与本文机器人控制主题完全无关（评分 0.0）。加权总分约为 18.0，低于动态及格分 27.8，表明论文与给定研究背景（多模态/大模型）相关性较低。作者列表中不包含指定的 Yang Shi 等专家，无额外加分。

关键词

Multi-agent reinforcement learning, Model-based control, Cooperative teaming strategies, Actor-critic model predictive control, Hardware validation

83. EMBER: Efficient Memory via Budgeted Evidence Retention for Long-Horizon AgentsFAIL

Score: 18.0 / 27.8

Authors: Yilong Li, Suman Banerjee, Tong Che

Published: 2026-06-04

TL;DR: EMBER 提出了一种针对长智能体的预算化证据保留策略，通过保留源证据胶囊而非重读完整历史，显著提升了检索准确性和内存效率。

摘要翻译

长视野智能体能够存档大规模历史，但生成未来答案时仍需承担检索、重读及上下文开销。当保留的记忆遗漏了与答案相关的证据时，系统必须回溯至原始历史的更大范围。我们研究预算化证据生存（budgeted evidence survival）：在查询未知之前，应保留哪些源证据，使其在固定的保留源证据令牌（token）预算下仍可恢复且可用？我们将此设置实例化为 Budgeted Pre-Query Retention（预算化预查询保留），其中记忆在数据摄入过程中写入，后续读取时无法访问完整的原始流。我们引入 EMBER，一种学习到的保留策略，用于构建紧凑的、基于源的证据状态。EMBER 存储证据胶囊（evidence capsules）：与检索键和更新元数据配对的原文源片段，同时保留了锚定和读取时的访问能力。查询后结果反馈训练写入器在摄入 - 检索 - 回答链中保留证据。在 LongMemEval-RR（基于 LongMemEval 的保留证据协议）上，我们的 EMBER-14B 在 8192-token (令牌) 保留证据比较点达到 0.3017 F1，而最强的非 EMBER 预算化基线为 0.1765。在不同保留源证据预算下，EMBER 提高了 F1、Retain-Recall 和 Read-Recall，表明长视野记忆取决于在预算内保留证据，而非重读更大的历史。

Abstract

Long-horizon agents can archive large histories, but future answers still incur retrieval, rereading, and context costs. When retained memory misses answer-relevant evidence, the system must return to larger portions of the raw history. We study budgeted evidence survival: before the query is known, which source evidence should be retained so that it remains recoverable and usable under a fixed retained source-evidence token budget? We instantiate this setting as Budgeted Pre-Query Retention, where memory is written during ingestion and later read without access to the full raw stream. We introduce EMBER, a learned retention policy that constructs a compact, source-backed evidence state. EMBER stores evidence capsules: verbatim source excerpts paired with retrieval keys and update metadata, preserving both grounding and read-time access. Post-query outcome feedback trains the writer to preserve evidence across the ingestion-retrieval-answer chain. On LongMemEval-RR, our LongMemEval-derived retained-evidence protocol, EMBER-14B reaches 0.3017 F1 at the 8192-token retained-evidence comparison point, compared with 0.1765 for the strongest non-EMBER budgeted baseline. Across retained source-evidence budgets, EMBER improves F1, Retain-Recall, and Read-Recall, indicating that long-horizon memory depends on retaining evidence within the budget rather than rereading larger histories.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	3.0/10	4.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	3.0/10	4.5

评分理由: 论文核心为长智能体的预算化证据保留（EMBER），侧重记忆管理。关键词中 Visual Encoder 和 MultiModal 完全无关（0 分）；Unify Models、Tokenizer、MLLM 关联度低（2 分）；World Models 和 model-based RL 虽与智能体相关，但论文侧重记忆而非模型构建或 RL 规划，故给 3 分。作者列表中无指定专家。

关键词

Long-horizon agents, Evidence Retention, Budgeted Memory, EMBER, Retrieval, Source-backed Evidence, Memory Efficiency

84. ReSAGE-PAR: Representational Similarity Assessment for Generative Expansion in Pedestrian Attribute RecognitionFAIL

Score: 18.0 / 27.8

Authors: Pablo Ayuso-Albizu, Pablo Carballeira, Juan C. SanMiguel, Paula Moral

Published: 2026-06-04

TL;DR: ReSAGE-PAR 利用扩散模型和表征相似性评估解决行人属性识别中的数据稀缺问题，通过生成验证后的合成图像显著提升了识别性能。

摘要翻译

针对行人属性识别 (PAR) 中存在的多样性有限和数据稀缺问题，我们探索利用基于属性提示引导的扩散模型进行图像合成。尽管这实现了行人图像的受控生成，但仍面临两个关键挑战：(i) 高质量预训练数据与低分辨率、非标准监控裁剪图之间的领域差距；(ii) 需要可靠的属性验证以防止生成幻觉。本文提出了一种稳健的“生成 - 评分 - 自动标注”管道，称为 ReSAGE-PAR（行人属性识别生成扩展表示相似性评估），旨在弥合上述领域差距，并实现可扩展、高保真的数据集扩展。首先，我们采用一种定制的基于 LoRA 的图像到图像方法，将预训练的扩散模型适配至原生 PAR 分辨率。其次，我们提取生成图像与其条件提示之间的视觉 - 语言对齐分数，利用一种全面的提示策略，该策略包含标签一致和不一致的补充提示。最后，我们构建了一个贝叶斯分类器，将这些连续分数转换为可靠的二元伪标签。广泛的评估证明了 ReSAGE-PAR 在保留空间先验和验证属性方面的有效性。当集成到 PAR 训练流程中时，ReSAGE-PAR 始终带来显著改进——在标准骨干网络上实现高达 8.7% 的性能提升，并将最先进框架推向新的性能水平。这证明了其作为一种架构无关方案，在可扩展 PAR 增强方面的价值。ReSAGE-PAR 的完整代码库已在 http://www-vpu.eps.uam.es/publications/ReSAGE-PAR 上公开提供。

Abstract

To address the limited diversity and data scarcity in Pedestrian Attribute Recognition (PAR), we explore image synthesis using diffusion models guided by attribute-based prompts. While this enables the controlled generation of pedestrian images, it faces two critical challenges: (i) the domain gap between high-quality pre-training data and low-resolution, non-standard surveillance crops, and (ii) the need for reliable attribute verification to prevent generative hallucinations. In this paper, we introduce a robust generate-score-autolabel pipeline called ReSAGE-PAR (REpresentational Similarity Assessment for Generative Expansion in PAR) that bridges this domain gap and enables scalable, high-fidelity dataset expansion. First, we adapt pre-trained diffusion models to native PAR resolutions using a tailored LoRA-based Image-to-Image approach. Second, we extract vision-language alignment scores between the generated images and their conditioning prompts, utilizing a comprehensive prompting strategy that includes label-consistent and inconsistent complements. Finally, we formulate a Bayesian classifier that converts these continuous scores into reliable binary pseudo-labels. Extensive evaluations demonstrate the effectiveness of ReSAGE-PAR in preserving spatial priors and verifying attributes. When integrated into PAR training, ReSAGE-PAR consistently yields significant improvements-achieving gains of up to 8.7% on standard backbones and pushing state-of-the-art frameworks to new performance levels. This proves its value as an architecture-agnostic solution for scalable PAR enhancement. The complete codebase for ReSAGE-PAR is publicly available at http://www-vpu.eps.uam.es/publications/ReSAGE-PAR.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	4.0/10	6.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文聚焦于行人属性识别（PAR）中的数据增强，使用扩散模型和表征相似性评估。相关性分析：'MultiModal'和'Visual Encoder'有一定关联（涉及图像与文本属性的对齐及视觉编码）；'Unify Models'和'MLLM'关联较弱（未涉及统一架构或大语言模型）；'Tokenizer'、'World Models'和'model-based RL'完全无关（未提及分词器、世界模型或强化学习）。作者列表中不包含指定的专家（Yang Shi 等），故无额外加分。加权总分为 18.0，低于动态及格分 27.8。

关键词

Pedestrian Attribute Recognition, Diffusion Models, Generative Expansion, Representational Similarity, Domain Gap, Image Synthesis, Attribute Verification, Data Augmentation

85. Amortizing Federated Adaptation: Hypernetwork Driven LoRA for Personalized Foundation ModelsFAIL

Score: 16.5 / 27.8

Authors: Sunny Gupta, Shambhavi Shanker, Amit Sethi

Published: 2026-06-04

TL;DR: 本文提出 HyperLoRA，一种基于超网络的联邦适配框架，通过消除聚合偏差和初始化滞后，实现了基础模型在视觉及视觉 - 语言任务上的更快收敛与更强个性化性能。

摘要翻译

利用低秩适应（LoRA）对基础模型（Foundation Models）进行联邦微调，为分布式学习提供了一种通信高效的解决方案。然而，现有的联邦 LoRA 方法存在两个根本性局限性：（1）结构聚合偏差，即独立平均低秩因子无法近似真实的组合更新；（2）客户端初始化滞后，因为客户端在通信轮次中反复重新初始化 LoRA 参数，从而减缓收敛。我们提出 HyperLoRA，这是一个统一框架，通过超网络驱动的 LoRA 生成和乘积空间聚合实现摊销联邦适应，从而解决上述两个问题。与迭代式的每客户端优化不同，HyperLoRA 采用了一个学习生成器，该生成器将客户端分布特征映射至 LoRA 初始化参数，从而有效地摊销了每客户端的适应过程。在服务器端，我们引入一个学习聚合模块，该模块直接在低秩乘积空间中综合更新，消除了逐因子平均带来的不一致性。此外，一个轻量级的残差修正模块进一步提升了在异构（非独立同分布，non-IID）客户端分布下的稳定性。通过用学习算子替代迭代优化和启发式平均，HyperLoRA 协同实现了高效个性化、无偏聚合以及更快的收敛速度。在联邦视觉及视觉 - 语言基准上的实验表明，与先前的联邦 LoRA 方法相比，HyperLoRA 实现了更快的收敛速度、对分布偏移更强的鲁棒性以及更优的个性化性能。

Abstract

Federated fine-tuning of foundation models using Low-Rank Adaptation (LoRA) offers a communication efficient solution for distributed learning. However, existing federated LoRA methods suffer from two fundamental limitations: (1) structural aggregation bias, where independently averaging low rank factors fails to approximate the true combined update, and (2) client side initialization lag, as clients repeatedly reinitialize LoRA parameters across communication rounds, slowing convergence. We propose HyperLoRA, a unified framework that addresses both issues through amortized federated adaptation through hypernetwork-driven LoRA generation and product space aggregation. Instead of iterative per-client optimization, HyperLoRA employs a learned generator that maps client distribution signatures to LoRA initializations, effectively amortizing per client adaptation. On the server side, we introduce a learned aggregation module that directly synthesizes updates in the low-rank product space, eliminating the inconsistencies of factor-wise averaging. A lightweight residual correction module further improves stability under heterogenous (non-IID) client distributions.By replacing iterative optimization and heuristic averaging with learned operators, HyperLoRA jointly enables efficient personalization, unbiased aggregation, and faster convergence. Experiments on federated vision and vision-language benchmarks show that HyperLoRA achieves improved convergence speed, greater robustness to distribution shift, and stronger personalization performance compared to prior federated LoRA methods.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	3.0/10	4.5
model-based RL	1.5	0.0/10	0.0

评分理由: 论文核心在于联邦学习与 LoRA 适配，与世界模型、强化学习及特定组件（如 Tokenizer）无直接关联。虽涉及视觉 - 语言基准测试（关联 MLLM/MultiModal）及基础模型（关联 Unify Models），但核心方法论（超网络驱动联邦适配）与关键词主题存在显著差异，故评分较低。

关键词

Federated Adaptation, Hypernetwork Driven, LoRA, Foundation Models, Personalized Models, Vision-Language Benchmarks, Aggregation Bias

Score: 16.5 / 27.8

Authors: Jiawen Zhang, Kejia Chen, Jiachen Ma, Yangfan Hu, Lipeng He, Yechao Zhang, Jian Liu, Xiaohu Yang, Tianwei Zhang, Ruoxi Jia

Published: 2026-06-04

TL;DR: 论文提出 MemGate 插件，通过查询条件神经网络门控替代语义相似度搜索，提升了个人 AI 代理长期记忆的可信度并减少了记忆诱导威胁。

摘要翻译

个人智能体（Personal AI agents）日益依赖长期记忆（long-term memory），以在跨会话中提供持久个性化（persistent personalization）。然而，现有的记忆管道（memory pipelines）主要受语义相似度（semantic similarity）驱动：与当前查询相近的记忆数据被检索并注入模型上下文。这造成了一个关键的信任度缺口（trustworthiness gap），因为语义相关的记忆可能在上下文上仍不恰当，从而引发诸如跨域泄露（cross-domain leakage）、谄媚（sycophancy）、工具调用漂移（tool-call drift）或记忆诱导的越狱（memory-induced jailbreaks）等威胁。本文研究记忆搜索（memory search）作为个人智能体中的信任边界（trust boundary）。我们评估了代表性的代理记忆框架，包括 A-Mem、Mem0 和 MemOS，以及 OpenClaw——一个具有持久状态和工具使用能力的真实世界个人智能体环境。结果表明，长期记忆不仅仅是一个实用层（utility layer），而是一个持久的控制通道（control channel），能够重塑智能体解释任务和执行动作的方式，使其极易受到上述威胁。为缓解这些漏洞，我们提出了 MemGate，一个用于可信记忆搜索的轻量级且可部署的记忆插件，仅需 9M 参数和 35.1MB 占用空间（footprint）。MemGate 插入在向量内存存储（vector memory store）和主干大语言模型（backbone LLM）之间，无需修改 LLM、重写内存数据库或使用推理时的 LLM 评判器（judge）。它对候选记忆表示应用查询条件化的神经门（query-conditioned neural gate），将原始相似度搜索转化为任务条件化的内存准入（task-conditioned memory admission）。在多种主流内存框架、真实世界智能体设置和多样化的 LLM 主干上，MemGate 减少了记忆诱导的威胁，同时保留了长期记忆的效用。

Abstract

Personal AI agents increasingly rely on long-term memory to provide persistent personalization across sessions. However, existing memory pipelines are largely driven by semantic similarity: memory data close to the current query is retrieved and injected into the model context. This creates a critical trustworthiness gap, since a semantically related memory may still be contextually inappropriate, leading to threats such as cross-domain leakage, sycophancy, tool-call drift, or memory-induced jailbreaks. In this paper, we study memory search as a trust boundary in personal AI agents. We evaluate representative agentic memory frameworks, including A-Mem, Mem0, and MemOS, together with OpenClaw, a real-world personal-agent environment with persistent state and tool-use capability. Our results show that long-term memory is not merely a utility layer, but a durable control channel that can reshape how agents interpret tasks and execute actions, leaving them highly susceptible to the aforementioned threats. To mitigate these vulnerabilities, we propose MemGate, a lightweight and deployable memory plug-in for trustworthy memory search, with only 9M parameters and a 35.1MB footprint. MemGate is inserted between the vector memory store and the backbone LLM, requiring no LLM modification, memory-database rewriting, or inference-time LLM judge. It applies a query-conditioned neural gate to candidate memory representations, turning raw similarity search into task-conditioned memory admission. Across multiple mainstream memory frameworks, real-world agent settings, and diverse LLM backbones, MemGate reduces memory-induced threats while preserving long-term memory utility.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	2.0/10	3.0

评分理由: 论文核心在于个人 AI 代理的记忆搜索可信度（MemGate），与 Tokenizer、Visual Encoder 完全无关；虽涉及 LLM 和 Agent，但未聚焦 MLLM、World Models 或 Model-Based RL 的核心架构，Unify Models 仅在框架整合层面有微弱关联，故多数关键词相关性较低。作者列表中未包含指定的 Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang 专家，故无加分。

关键词

Personal AI Agents, Memory Search, Trustworthiness, MemGate, Long-term Memory, Semantic Similarity, Query-conditioned Neural Gate

87. Human Adults and LLMs as Scientists: Who Benefits from Active Exploration?FAIL

Score: 16.5 / 27.8

Authors: Mandana Samiei, Eunice Yiu, Anthony GX-Chen, Dongyan Lin, Jocelyn Shen, Blake A. Richards, Alison Gopnik, Doina Precup

Published: 2026-06-04

TL;DR: This study compares human and LLM performance in causal learning via active exploration, finding that while exploration aids human conjunctive reasoning, LLMs exhibit less efficient strategies despite similar accuracy.

摘要翻译

因果学习领域的一个长期发现是，成年人在识别合取因果规则（conjunctive causal rules，即效应需多个原因同时存在）时遇到困难，而在析取情境下表现更佳。然而，大多数关于这种“合取劣势（conjunctive handicap）”的演示依赖于证据有限的被动观察范式，其中学习者无法控制证据的生成。本文探讨当成年人通过主动探索获得能动性时，这种偏见是否依然存在。使用修改后的"blicket detector"任务，成年参与者在合取或析取规则结构下自由干预以识别因果对象。我们发现主动探索显著改善了成年人的合取因果推理，尽管合取规则仍比析取规则需要更多的测试才能推断。我们进一步在同一设置下将人类表现与一系列大型语言模型（large language models）进行比较。尽管一些最先进模型在假设推断准确性上接近人类水平，但它们通常表现出效率较低的探索策略以及类似的合取 - 析取性能差距。

Abstract

A long-standing finding in the causal learning literature is that adults struggle to identify conjunctive causal rules, where an effect requires the simultaneous presence of multiple causes, while performing better in disjunctive settings. However, most demonstrations of this ``conjunctive handicap'' rely on passive observation paradigms with limited evidence, where learners have no control over evidence generation. This paper asks whether this bias persists when adults are granted agency through active exploration. Using a modified ``blicket detector'' task, adult participants freely intervened to identify causal objects under conjunctive or disjunctive rule structures. We show that active exploration substantially improves adults' conjunctive causal reasoning, although conjunctive rules still require more tests to infer than disjunctive rules. We further compare human performance to a range of large language models in the same setting. While some state-of-the-art models approach human-level performance on hypothesis inference accuracy, they often exhibit less efficient exploration strategies and similar conjunctive-disjunctive performance gaps.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	4.0/10	6.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	5.0/10	7.5

评分理由: The paper focuses on cognitive science and causal learning rather than technical model architecture, making Unify Models, Tokenizer, and Visual Encoder irrelevant (0). World Models and model-based RL are moderately relevant (4-5) due to themes of active exploration and causal inference. MLLM and MultiModal are low (1) as the paper discusses LLMs without specifying multimodal architecture. No specified expert authors were found. Weighted total score: 16.5 (below dynamic passing score 27.8).

关键词

Active Exploration, Causal Learning, Large Language Models, Conjunctive Rules, Disjunctive Rules, Human Performance, Hypothesis Inference

88. CollabSim: A CSCW-Grounded Methodology for Investigating Collaborative Competence of LLM Agents through Controlled Multi-Agent ExperimentsFAIL

Score: 16.5 / 27.8

Authors: Jiaju Chen, Bo Sun, Yuxuan Lu, Yun Wang, Dakuo Wang, Bingsheng Yao

Published: 2026-06-04

TL;DR: This paper introduces CollabSim, a simulation framework grounded in Computer-Supported Cooperative Work to systematically evaluate the collaborative competence of LLM-based multi-agent systems through controlled experiments.

摘要翻译

基于大型语言模型（LLM）构建的多智能体系统（MAS）展现出日益增长的前景，其有效性依赖于智能体通过文本渠道进行协调的能力，正如人类团队所做的那样。然而，近期研究表明，多智能体系统（MAS）往往并非因为智能体缺乏个体任务解决能力而失败，而是因为缺乏协作胜任力：即在互动展开过程中建立共识、保持共享任务理解、平衡个体与集体激励以及修复不一致的能力。数十年来，计算机支持的协同工作领域的研究已明确了这些在受限通信条件下协调人类团队所需的要求，然而现有的多智能体系统（MAS）评估主要关注任务结果或单个智能体在推理、规划和工具使用方面的熟练程度。为了能够对多智能体系统（MAS）中智能体的协作胜任力进行系统分析，我们引入了 CollabSim，这是一个可配置的模拟框架，它结合了基于理论的协作能力界定、交互条件的控制性操控以及对智能体内部状态的动作级探测。在四种大型语言模型（LLM）上的实验表明，CollabSim 能够捕捉条件效应、区分模型性能模式，并揭示代理设计对任务的依赖效应。

Abstract

Multi-agent systems (MAS) built on large language models have shown growing promise, with their effectiveness resting on agents' ability to coordinate through text-based channels much as human teams do. Yet recent study suggests that MAS often falter not because agents lack individual task-solving ability, but because they lack collaborative competence: the capacity to establish common ground, maintain shared task understanding, balance individual and collective incentives, and repair misalignment as interaction unfolds. Decades of research in Computer-Supported Cooperative Work have characterized these requirements for human teams coordinating under constrained communication, yet existing MAS evaluations focus mainly on task outcomes or single-agent proficiency in reasoning, planning, and tool use. To enable a systematic analysis of agents' collaborative competence in MAS, we introduce CollabSim, a configurable simulation framework that combines a theory-grounded definition of collaborative capabilities, controlled manipulation of interaction conditions, and action-level probing of agents' internal states. Experiments across four LLMs show that CollabSim can capture condition effects, separate model performance patterns, and reveal task-dependent effects of agent design.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	2.0/10	3.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	1.0/10	1.5

评分理由: 该论文主要关注基于 CSCW 理论的 LLM 多智能体协作能力评估，未涉及视觉编码器、Tokenizer、世界模型或模型强化学习等核心技术架构。MLLM 相关性较低因主要基于文本交互，Unify Models 仅指评估框架的理论统一而非模型架构统一。加权总分 16.5 分，低于动态及格分 27.8 分，表明论文内容与给定关键词集相关性较低。

关键词

Multi-agent systems, LLM agents, Collaborative competence, CSCW, Simulation framework, Text-based interaction, Agent coordination

89. SkillComposer: Learning to Evolve Agent Skills for Specification and GeneralizationFAIL

Score: 16.5 / 27.8

Authors: Qi Zhang, Zhaopeng Feng, Xiaonan Shi, Xiaomeng Hu, Chu Liu, Pengjun Xie, Xiaobin Wang, Jieping Ye, Bryan Hooi, Haobo Wang, Junbo Zhao

Published: 2026-06-04

TL;DR: SkillComposer 框架通过创建、改进和合并操作使语言模型能够自我演化智能体技能，显著提升智能体和代码任务的性能及泛化能力。

摘要翻译

智能体技能（Agent skills）由指导智能体推理与动作的可复用策略构成，在推理时提升模型能力方面展现出巨大潜力。然而，当前的技能构建方法将问题视为一次性提取（one-shot extraction），忽视了一个基本张力：针对特定任务定制的技能难以迁移，而抽象的技能往往提供不足的指导。我们将这种脆弱性归因于缺乏明确的技能规范与泛化机制。为填补这一空白，我们提出 SkillComposer（技能组合器），该框架将技能构建分解为三个可学习操作：创建（create）、改进（improve）和合并（merge）。通过系统化的拒绝采样（rejection sampling）配方进行训练，SkillComposer 使语言模型能够在推理时自我演化技能，并支持三种部署模式：用于构建通用库的离线模式、用于特定任务精化的在线模式，以及结合两者的混合模式。在 $τ^2$-Bench、LiveCodeBench v6 和 AppWorld 上的全面实验表明，SkillComposer 始终优于基线方法。我们的 SkillComposer-4B 在智能体任务上使 27B 执行器性能提升高达 +4.5，在代码任务上提升 +3.4，同时在训练期间未见过的领域和任务类型上展现出泛化能力。分析表明，合并（merge）与改进（improve）解决了正交的质量维度，且技能组合是一种可迁移的元能力，为技能增强推理提供了实用的配方。

Abstract

Agent skills, which consist of reusable strategies that guide agent reasoning and action, have shown strong potential for improving model capability at inference time. However, current skill construction methods treat the problem as one-shot extraction, overlooking a fundamental tension: a skill tailored to the specific task fails to transfer, while the abstracted skill often provides insufficient guidance. We attribute this fragility to the absence of explicit mechanisms for skill specification and generalization. To address this gap, we introduce SkillComposer, a framework that decomposes skill construction into three learnable operations: create, improve, and merge. Trained via systematic rejection sampling recipe, SkillComposer enables language models to self-evolve skills at inference time and supports three deployment modes: offline for building generalized libraries, online for task-specific refinement, and hybrid for combining both. Comprehensive experiments on $τ^2$-Bench, LiveCodeBench v6, and AppWorld show that SkillComposer consistently outperforms baselines. Our SkillComposer-4B improves a 27B executor by up to +4.5 on agent tasks and +3.4 on code tasks, while generalizing across domains and task types unseen during training. Analysis reveals that merge and improve address orthogonal quality dimensions and that skill composition is a transferable meta-ability, providing a practical recipe for skill-augmented inference.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	4.0/10	6.0

评分理由: 论文核心在于智能体技能演化与组合（SkillComposer），未涉及 Tokenizer、视觉编码器或世界模型架构。虽涉及语言模型与智能体（关联 MLLM 与 RL），但非核心贡献，故相关度较低，加权总分低于动态及格分。

关键词

SkillComposer, Agent Skills, Language Models, Skill Composition, Inference-time, Generalization, Executor, Rejection Sampling

90. Geometry-Aware Dataset Condensation for Diffusion Model TrainingFAIL

Score: 16.5 / 27.8

Authors: Xiao Cui, Yulei Qin, Mo Zhu, Wengang Zhou, Hongsheng Li, Houqiang Li

Published: 2026-06-04

TL;DR: 本文提出了一种基于几何感知的扩散模型数据集压缩方法，通过部分最优传输保留分布结构以提升训练保真度。

摘要翻译

数据集浓缩（Dataset Condensation）旨在通过合成或选择从真实数据中构建紧凑的数据集。然而，现有方法并不适合扩散模型（Diffusion Model）训练：合成数据生成往往产生保真度较低的样本，不适合用于真实建模；而真实子集选择（Real Subset Selection）通常无法保留扩散似然目标（Diffusion Likelihood Objectives）所要求的分布几何结构。为此，我们将真实子集选择重新表述为一种几何感知的分布对齐（Distribution Alignment）问题。通过引入单边部分最优传输（One-sided Partial Optimal Transport），我们的方法选择性地使一个紧凑子集与完整数据分布对齐，同时允许低密度区域中存在未匹配的质量，从而确保保留了对有效扩散模型训练至关重要的几何结构。为进一步确保分布保真度，我们在几何对齐的基础上补充了轻量级的特征统计（Feature-statistics）与语义一致性正则化。我们提出了一种高效的两阶段离散优化策略来实现这一对齐目标。在多种扩散模型变体、子集大小、图像分辨率及训练轮数上的广泛实验表明，我们的方法在扩散模型训练中实现了卓越的保真度和分布覆盖能力。代码可在 https://github.com/2018cx/GADC 获取。

Abstract

Dataset condensation aims to construct compact datasets from real data via synthesis or selection. However, existing approaches are ill-suited for diffusion model training: synthetic data generation often yields low-fidelity samples unsuitable for authentic modeling, while real subset selection typically fails to preserve the distributional geometry required by diffusion likelihood objectives. To address this, we propose to reformulate real subset selection as a geometry-aware distribution alignment problem. By incorporating one-sided partial optimal transport, our method selectively aligns a compact subset with the full data distribution while allowing unmatched mass in low-density regions, ensuring the preserved geometric structure necessary for effective diffusion model training. To further ensure distributional fidelity, we complement geometric alignment with lightweight feature-statistics and semantic consistency regularization. An efficient two-stage discrete optimization strategy is proposed to achieve this alignment objective. Extensive experiments across diffusion variants, subset sizes, image resolutions, and training rounds show that our method achieves superior fidelity and distributional coverage in diffusion model training. Codes are available at https://github.com/2018cx/GADC.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	3.0/10	4.5
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文核心在于扩散模型训练的几何感知数据集压缩与最优传输对齐，未涉及模型统一、分词器、视觉编码器架构、多模态大语言模型或强化学习，故相关关键词得分较低；作者列表中未包含指定的专家，无额外加分。

关键词

Dataset Condensation, Diffusion Model Training, Geometry-Aware, Distribution Alignment, Optimal Transport, Subset Selection, Distributional Fidelity

91. Agent Memory: Characterization and System Implications of Stateful Long-Horizon WorkloadsFAIL

Score: 15.0 / 27.8

Authors: Yasmine Omri, Ziyu Gan, Zachary Broveak, Robin Geens, Zexue He, Alex Pentland, Marian Verhelst, Tsachy Weissman, Thierry Tambe

Published: 2026-06-04

TL;DR: This paper characterizes the system behavior of ten agent memory systems for long-horizon LLM tasks, proposing a taxonomy and profiling harness to optimize memory management strategies.

摘要翻译

大语言模型代理（LLM agents）正被越来越多地部署于需要基于扩展交互历史进行持续推理的长周期任务中。规模化实现这一目标要求代理能够在跨会话期间持久化地存储、检索和更新自身记忆。一个丰富的代理记忆系统生态系统已经涌现，涵盖了扁平检索、基于 LLM 的提取、整合事实存储库以及代理控制流等多种类型。然而，其系统级行为尚未得到表征。本文首次提出了代理记忆的系统表征（systems characterization）。首先，我们引入了一种面向系统的分类法（taxonomy），沿四个维度对代理记忆系统进行分类。其次，我们构建了一个感知阶段的剖析框架（profiling harness），将成本归因于构建、检索和生成过程。第三，我们在两个基准套件（benchmark suites）上刻画了十个代表性系统，揭示了设计选择如何在写入和读取路径之间转移成本。最后，我们提出了十条系统建议，涵盖构建调度、能力下限、通过查询量摊销、新鲜度 - 延迟权衡以及集群规模管理等方面。

Abstract

LLM agents are increasingly deployed on long-horizon tasks requiring sustained reasoning over extended interaction histories. Realizing this at scale requires agents to persistently store, retrieve, and update their own memory across sessions. A rich ecosystem of agent memory systems has emerged spanning flat retrieval, LLM-mediated extraction, consolidating fact stores, and agentic control flows. Yet, their system-level behavior remains uncharacterized. We present the first systems characterization of agent memory. First, we introduce a system-oriented taxonomy classifying agent memory systems along four axes. Second, we build a phase-aware profiling harness attributing cost to construction, retrieval, and generation. Third, we characterize ten representative systems across two benchmark suites, uncovering how design choices shift cost across the write and read paths. Finally, we derive 10 system recommendations covering construction scheduling, capability floors, amortization via query volume, freshness-latency tradeoffs, and fleet-scale management.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	2.0/10	3.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	2.0/10	3.0

评分理由: The paper focuses on system characterization of memory for LLM agents, lacking direct content on multimodal components (Tokenizer, Visual Encoder, MultiModal) or model unification strategies. While 'World Models' and 'model-based RL' loosely relate to agents and stateful tasks, the core contribution is systems engineering rather than model architecture. No expert authors from the specified list were found. The weighted total score is 15.0, below the dynamic passing score of 27.8.

关键词

Agent Memory, Long-Horizon Workloads, System Characterization, LLM Agents, Memory Management, Stateful Workloads, Profiling Harness, System Recommendations

92. RiskFlow: Fast and Faithful Safety-Critical Traffic Scenario GenerationFAIL

Score: 15.0 / 27.8

Authors: Qi Lan, Yining Tang, Yu Shen, Yi Zhou, Yuhao Wei, Jie Li, Guofa Li

Published: 2026-06-04

TL;DR: RiskFlow presents a fast, closed-loop multi-agent traffic scenario generation framework that utilizes action space transport to achieve high realism and significantly reduced inference time compared to iterative diffusion methods.

摘要翻译

安全关键型交通场景生成对于在罕见但高风险交互下评估自动驾驶系统至关重要。现有的基于扩散的方法在闭环生成中提供了强大的可控性，但其迭代去噪过程计算成本高昂，且在长 horizon 生成过程中可能累积采样和引导误差，导致不真实的运动伪影，例如抖动、异常加速以及驶离道路等行为。为解决这些问题，我们提出 RiskFlow，一种闭环安全关键型多智能体交通生成框架，该框架将未来轨迹生成表述为动作空间中的传输过程。与依赖迭代去噪不同，RiskFlow 在有限区间内学习平均速度场，通过单次前向传播将高斯动作序列转换为未来的加速度和偏航率命令，并利用基于 JVP（雅可比向量积）的目标函数实现高效且稳定的训练。在测试阶段，RiskFlow 对生成的动作应用输出空间引导，引导选定的关键智能体朝向风险交互，同时正则化驶离道路行为，并通过车辆动力学重建物理上可行的轨迹。在 nuScenes 数据集上基于 tbsim 的闭环评估实验表明，RiskFlow 在多智能体和长 horizon 设置下实现了强对抗性与现实性之间的权衡。与代表性基线相比，RiskFlow 始终在提高现实性的同时保持具有竞争力的安全关键生成能力，并显著减少了评估所需的推理时间。

Abstract

Safety-critical traffic scenario generation is essential for evaluating autonomous driving systems under rare but high-risk interactions. Existing diffusion-based methods offer strong controllability in closed-loop generation, but their iterative denoising process is computationally expensive and may accumulate sampling and guidance errors over long rollouts, causing unrealistic motion artifacts such as jitter, abnormal acceleration, and off-road behavior. To address these issues, we propose RiskFlow, a closed-loop safety-critical multi-agent traffic generation framework that formulates future trajectory generation as transport in the action space. Instead of relying on iterative denoising, RiskFlow learns an average velocity field over a finite interval to transform Gaussian action sequences into future acceleration and yaw-rate commands with a single forward pass, using a JVP-based objective for efficient and stable training. At test time, RiskFlow applies output-space guidance to the generated actions, steering selected critical agents toward risky interactions while regularizing off-road behavior, and reconstructs physically feasible trajectories through vehicle dynamics. Experiments on nuScenes with tbsim closed-loop evaluation show that RiskFlow achieves a strong adversariality-realism trade-off across multi-agent and long-horizon settings. Compared with representative baselines, RiskFlow consistently improves realism while maintaining competitive safety-critical generation capability, and substantially reduces inference time for evaluation.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	3.0/10	4.5
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	3.0/10	4.5

评分理由: The paper focuses on autonomous driving traffic scenario generation using action space transport and velocity fields, rather than multimodal large language models or tokenization. Consequently, keywords like Tokenizer, MLLM, and Visual Encoder receive low scores (0) as they are not central to the methodology. Unify Models receives a low score (2) as the paper proposes a specific framework rather than unifying disparate models. World Models and model-based RL receive moderate scores (3) because the paper involves learning a dynamics model (velocity field) for trajectory generation, which aligns loosely with model-based simulation, though it is not a traditional world model or RL algorithm. MultiModal receives a low score (2) as the focus is on multi-agent action sequences rather than multimodal fusion. No expert authors from the specified list (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) were found in the author list.

关键词

Safety-critical traffic scenario generation, Multi-agent traffic generation, Action space transport, Velocity field, Vehicle dynamics, Closed-loop evaluation, Inference time reduction

93. Evaluating Agentic Configuration Repair for Computer NetworksFAIL

Score: 15.0 / 27.8

Authors: Rufat Asadli, Benjamin Hoffman, Ioannis Protogeros, Laurent Vanbever

Published: 2026-06-04

TL;DR: 该论文通过引入形式化验证和上下文检索的代理架构，显著提升了大语言模型在复杂网络配置修复中的有效性和安全性。

摘要翻译

计算机网络中的配置错误仍然是导致严重网络中断的主要原因。研究正转向利用大语言模型（LLMs）来自动化复杂且易出错的配置任务。然而，即使是最先进的模型也无法在大规模复杂场景中解决配置错误，且往往引入新的错误。本文对结合了形式化网络验证和上下文检索工具的开源及闭源 LLMs 进行了基准测试。我们证明，智能体架构在修复有效性（平均提高 12%）和安全性（平均提高 17%）方面优于基础 LLMs，这得益于其动态管理上下文并迭代验证配置修复的能力。

Abstract

Misconfigurations in computer networks remain a major source of critical Internet outages. Research is turning to Large Language Models (LLMs) to automate the complex, error-prone task of network configuration. However, even state-of-the-art models fail to resolve misconfigurations in large-scale, complex scenarios and often introduce new errors. In this work, we benchmark open- and closed-source LLMs augmented with formal network verification and context retrieval tools. We demonstrate that agentic architectures outperform base LLMs in repair efficacy (by 12% on average) and safety (by 17% on average), enabled by the ability to dynamically manage context and iteratively validate configuration repairs.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	2.0/10	3.0

评分理由: 论文核心在于利用 LLM 代理架构结合形式化验证解决网络配置问题。Visual Encoder 和 MultiModal 完全无关（0 分），因无视觉或多模态数据。Tokenizer 和 MLLM 仅因使用 LLM 而略有相关（2 分），非核心贡献。Unify Models、World Models、model-based RL 虽涉及模型与规划，但本文未涉及模型统一架构、生成式世界模型或强化学习算法，故相关度低（2 分）。

关键词

Agentic Configuration Repair, Computer Networks, Large Language Models, Formal Network Verification, Context Retrieval, Agentic Architectures, Network Misconfigurations, Iterative Validation

94. Drag reduction or reward hacking? Recurrent multi-agent reinforcement learning that earns its rewardFAIL

Score: 15.0 / 27.8

Authors: Giorgio Maria Cavallazzi, Miguel Pérez-Cuadrado, Alfredo Pinelli

Published: 2026-06-04

TL;DR: This paper addresses reward hacking in multi-agent reinforcement learning for fluid drag reduction by introducing a differentiable projection, recurrent policy, and accurate wall power reward to achieve honest energy savings.

摘要翻译

强化学习代理 (reinforcement-learning agent) 最大化其奖励，这可能与设计者预期的结果相偏离。在物理控制中，奖励很少能缩小这一差距，而壁湍流 (wall turbulence) 中的阻力减小使其具体化。质量守恒投影 (mass-conservation projection) 耦合了代理的输出，抹去了策略梯度 (policy gradient) 所需的每代理信用；无记忆策略 (memoryless policy) 无法解决其作用的慢近壁循环 (slow near-wall cycle)；而压力梯度奖励 (pressure-gradient reward) 通过向壁面泵送功率来支付名义上的阻力减小。两种退化控制器 (degenerate controllers) 实现了较大的阻力减小，同时总耗散 (total dissipation) 增加，因此报告的数值可能掩盖了更浪费的流动。我们将每个故障追溯到其根源并加以修复：一个可微投影 (differentiable projection) 以恢复信用，一个具有扩大感知模板 (widened sensing stencil) 的循环策略 (recurrent policy)，以及基于真实壁面功率 (true wall power) 评分的奖励。修正后的控制器在封闭能量预算 (closed energy budget) 内作用于流动，在诚实核算 (honest accounting) 下获得保守的 17%。

Abstract

A reinforcement-learning agent maximises its reward, which can diverge from the outcome its designer intended. In physical control the reward rarely closes that gap, and drag reduction in wall turbulence makes it concrete. A mass-conservation projection couples agents' outputs and erases the per-agent credit the policy gradient needs; a memoryless policy cannot resolve the slow near-wall cycle it acts on; and a pressure-gradient reward pays for nominal drag reduction by pumping power through the wall. Two degenerate controllers achieve large drag reductions while total dissipation rises, so the reported figure can mask a more wasteful flow. We trace each fault to its cause and fix it: a differentiable projection that restores credit, a recurrent policy with a widened sensing stencil, and a reward scored on the true wall power. The corrected controller acts on the flow within a closed energy budget, earning a conservative $17\%$ under honest accounting.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	6.0/10	9.0

评分理由: The paper focuses on multi-agent reinforcement learning for fluid drag reduction, addressing reward hacking and credit assignment. It does not involve Large Language Models, Tokenizers, or Visual Encoders (scores 0). 'Unify Models' and 'MultiModal' have minimal relevance (1). 'World Models' has slight relevance due to recurrent policy handling temporal dynamics (2). 'model-based RL' has moderate relevance as the core task is RL involving policy and reward structures (6). No specified expert authors were found, so no bonus points were added. The weighted sum is 15.0, below the passing threshold of 27.8.

关键词

Drag reduction, Multi-agent reinforcement learning, Reward hacking, Recurrent policy, Wall turbulence, Credit assignment, Differentiable projection, Energy budget

95. USAD 2.0: Scaling Representation Distillation for Universal Audio UnderstandingFAIL

Score: 15.0 / 27.8

Authors: Heng-Jui Chang, Alexander H. Liu, Saurabhchand Bhati, Mrudula Athi, Anton Ratnarajah, Amit Chhetri, James Glass

Published: 2026-06-04

TL;DR: USAD 2.0 proposes a scaled universal audio encoder integrating SSL and supervised knowledge through distillation, achieving state-of-the-art performance in audio understanding tasks for LLMs.

摘要翻译

音频编码器在现代音频应用中至关重要，因为大语言模型（LLMs）正日益依赖单一编码器来处理多样化输入。尽管自监督学习（SSL）已经产生了如语音或音乐专家模型这样的强领域特定编码器，但 USAD 和 SPEAR 等多域方法在覆盖范围和评估方面仍有限制。近期研究还表明，监督编码器与音频大语言模型的对齐效果更好。我们提出 USAD 2.0，这是一种整合了自监督学习（SSL）和监督基础模型知识的通用编码器。USAD 2.0 引入了领域感知蒸馏以解决教师模型不匹配问题，扩展了覆盖范围至音乐领域，并增加了用于下游任务的第二阶段监督蒸馏。我们进一步通过深度扩展将模型规模扩大至十亿参数。实验表明，USAD 2.0 在探测和基于大语言模型的评估中均实现了强大或最先进的性能。

Abstract

Audio encoders are critical to modern audio applications as large language models (LLMs) increasingly rely on a single encoder for diverse inputs. While self-supervised learning (SSL) has yielded strong domain-specific encoders like speech or music experts, multi-domain approaches like USAD and SPEAR remain limited in coverage and evaluation. Recent studies also suggest supervised encoders align better with audio LLMs. We present USAD 2.0, a universal encoder integrating knowledge from both SSL and supervised foundation models. USAD 2.0 introduces domain-aware distillation to address teacher mismatch, extends coverage to the music domain, and adds second-stage supervised distillation for downstream use. We further scale the model to one billion parameters via depth scaling. Experiments show USAD 2.0 achieves strong or state-of-the-art performance across probing and LLM-based evaluations.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on universal audio understanding via representation distillation, which is largely orthogonal to the provided keywords centered on multimodal unification, world modeling, and reinforcement learning. It scores low on Visual Encoder, World Models, and model-based RL due to the single-modality audio focus and absence of RL. Moderate scores on Unify Models and MLLM reflect its unification of SSL/supervised paradigms and relevance to LLM ecosystems, but it lacks explicit multimodal fusion or tokenizer architecture details. No expert authors from the target list are present.

关键词

Universal Audio Understanding, Representation Distillation, Audio Encoders, Self-Supervised Learning, Foundation Models, Large Language Models, Scaling, Supervised Distillation

96. Computation-Aware Event-to-Frame Reconstruction via Selective AttentionFAIL

Score: 15.0 / 27.8

Authors: Jingqian Wu, Yunbo Jia, Edmund Y. Lam

Published: 2026-06-04

TL;DR: This paper proposes a computation-aware event-to-frame reconstruction framework utilizing selective attention and recurrent encoding to efficiently bridge asynchronous event streams and frame-based vision.

摘要翻译

事件到帧 (E2F) 重建连接了异步事件流与基于帧的视觉处理流水线，但现有方法通常在重建质量与计算效率之间面临权衡。本文提出一种高效的 E2F 框架，强调因果时序建模与计算感知设计。该架构采用 recurrent encoder-decoder，利用紧凑的隐藏状态增量聚合事件信息。为提升在快速运动及光照变化下的鲁棒性，本文引入选择性上下文融合策略，以整合事件驱动特征与先验强度线索。在此融合过程中，轻量级 hybrid attention mechanism 增强了特征选择性，而无需依赖高计算量的 attention operations。在标准基准数据集上的实验结果表明，所提方法实现了具有竞争力的重建性能，同时在精度与模型复杂度之间保持了良好的平衡。

Abstract

Event-to-frame (E2F) reconstruction bridges asynchronous event streams with frame-based vision pipelines, but existing methods often face a trade-off between reconstruction quality and computational efficiency. In this work, we propose an efficient E2F framework that emphasizes causal temporal modeling and computation-aware design. The architecture adopts a recurrent encoder-decoder to incrementally aggregate event information with compact hidden states. To improve robustness under fast motion and illumination variations, a selective context fusion strategy is introduced to integrate event-driven features with prior intensity cues. Within this fusion process, a lightweight hybrid attention mechanism enhances feature selectivity without relying on heavy attention operations. Experimental results on standard benchmarks demonstrate that the proposed approach achieves competitive reconstruction performance while maintaining a favorable balance between accuracy and model complexity.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	6.0/10	9.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	3.0/10	4.5
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on computer vision (Event-to-Frame reconstruction) rather than LLMs or RL. It moderately utilizes a visual encoder for event data and combines event/frame streams (low multimodal relevance). It lacks tokenizers, world models, MLLM components, and reinforcement learning mechanisms, resulting in low alignment with most keywords.

关键词

Event-to-Frame Reconstruction, Selective Attention, Recurrent Encoder-Decoder, Computation-Aware, Hybrid Attention, Event Streams, Frame-based Vision

97. ReCache: Learning Budget-Aware Caching Schedules for Diffusion Models via REINFORCEFAIL

Score: 14.2 / 27.8

Authors: Mishan Aliev, Eva Neudachina, Ilya Bykov, Aleksandr Oganov, Kirill Struminsky, Aibek Alanov, Denis Rakitin

Published: 2026-06-04

TL;DR: ReCache 利用强化学习优化扩散模型的缓存调度策略，在显著降低计算成本的同时保持生成质量。

摘要翻译

现代扩散模型能够生成高质量的图像和视频，但其迭代去噪过程导致推理成本高昂。特征缓存（Feature Caching）通过重用或预测相邻去噪步骤之间的中间激活值来加速采样，利用了反向轨迹上计算的冗余性。本文聚焦于缓存调度（Caching Schedule）：即选择哪些去噪步骤应被完全重新计算。现有的调度方案要么是固定的（例如均匀分布），要么是基于每步误差启发式方法自适应选择的；在这两种情况下，实际计算成本都是手动调优阈值的副作用，而非用户可直接指定的量化指标。我们提出了 ReCache，其思路与此相反：给定目标计算预算 k，它学习能够最大化生成质量的重新计算调度，将计算量转化为用户可直接控制的输入参数。ReCache 通过策略梯度（Policy Gradients）进行训练，避免了在完整扩散推理过程中进行反向传播，且无需使用标注数据。未缓存推理生成的样本作为匹配目标，并配以生成质量的奖励。ReCache 兼容任何缓存机制，包括特征重用（Feature Reuse）和特征预测（Feature Forecasting）；对于每种机制，单个训练好的策略可在推理时适应不同的计算预算。ReCache 始终优于调度基线：在 FLUX 模型上实现 5.04 倍的 FLOPs 削减时，相比 DiCache，其 LPIPS 降低 31%（从 0.456 降至 0.316）；在 Wan 2.1 模型上实现约 2.6 倍的加速时，相比均匀 HiCache，其 LPIPS 降低 65%（从 0.480 降至 0.169），并将 VBench 评分提升 7%（5.6 分，从 70.4 升至 76.0）。代码开源地址为 https://github.com/thecrazymage/ReCache。

Abstract

Modern diffusion models generate high-quality images and videos, but their iterative denoising process makes inference expensive. Feature caching accelerates sampling by reusing or predicting intermediate activations across neighboring denoising steps, exploiting the redundancy of computations along the reverse trajectory. In this work, we focus on the caching schedule: selecting which denoising steps should be fully recomputed. Existing schedules are either fixed (e.g. uniform) or chosen adaptively from per-step error heuristics; in both cases, the actual compute cost is a side-effect of hand-tuned thresholds rather than a quantity the user can specify. We propose ReCache, which inverts this: given a target budget k, it learns the recomputation schedule that maximizes generation quality, turning compute into a directly controllable input. ReCache trains via policy gradients, sidestepping backpropagation through full diffusion inference, and uses no labelled data. Generations from uncached inference serve as matching targets, paired with a reward for generation quality. ReCache is compatible with any caching mechanism, including feature reuse and feature forecasting; for each mechanism, a single trained policy adapts across computational budgets at inference time. ReCache consistently outperforms scheduling baselines: under a $\times5.04$ FLOPs reduction on FLUX, it reduces LPIPS by 31% (from 0.456 to 0.316) compared to DiCache; on Wan 2.1 at a $\sim \times2.6$ speedup, it drops LPIPS by 65% (from 0.480 to 0.169) and boosts the VBench score by 7% (5.6 points, from 70.4 to 76.0) over uniform HiCache. Code is available at https://github.com/thecrazymage/ReCache.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	2.5/10	3.8

评分理由: 该论文主要关注扩散模型推理加速中的缓存调度问题，使用强化学习（REINFORCE）优化计算预算分配。论文未涉及模型统一、分词器设计、视觉编码器架构、世界模型构建、多模态大语言模型（MLLM）或表征学习等核心贡献。虽然使用了强化学习技术，但属于模型无关的调度策略，且扩散模型主要涉及视觉生成，与给定关键词（特别是 Unify Models, Tokenizer, MLLM）的相关性普遍较低。

关键词

Diffusion Models, Feature Caching, Inference Acceleration, Policy Gradients, Compute Budget, Scheduling, REINFORCE

98. Goedel-Architect: Streamlining Formal Theorem Proving with Blueprint Generation and RefinementFAIL

Score: 13.5 / 27.8

Authors: Jui-Hui Chung, Ziyang Cai, Zihao Li, Qishuo Yin, Rohit Agarwal, Simon Park, Rodrigo Porto, Narutatsu Ri, Ziran Yang, Shange Tang, Xingyu Dang, Hongzhou Lin, Mengdi Wang, Danqi Chen, Chi Jin, Liam H Fowl, Sanjeev Arora

Published: 2026-06-04

TL;DR: Goedel-Architect is an agentic framework for formal theorem proving in Lean 4 that achieves state-of-the-art performance on math benchmarks through blueprint generation and refinement.

摘要翻译

我们介绍 Goedel-Architect，这是一个以蓝图生成与精炼为核心的 Lean 4 形式化定理证明智能体框架。蓝图是一个定义和引理的依赖图，逐步构建至主定理。首先，Goedel-Architect 生成一个包含形式化陈述的定义和引理及其声明依赖关系的蓝图。该蓝图可选地由自然语言证明引导。随后，一个配备工具的 Lean 证明组件并行使用相关依赖关闭每个开放引理节点。失败的引理进而驱动全局蓝图的精炼。这种策略区别于其他使用递归引理分解的主流方法，后者可能在死胡同策略上低效循环。使用开源权重 DeepSeek-V4-Flash (284B-A13B) 作为骨干模型，Goedel-Architect 在 MiniF2F-test 上达到 99.2% 的 pass@1，在 PutnamBench 上达到 75.6% 的 pass@1。在较难问题上，若以可选的自然语言证明初始化初始蓝图，我们额外解决了 MiniF2F-test 中剩余的两个问题（达到 100%），将 PutnamBench 提升至 88.8% (597/672)，并在 IMO 2025 上解决 4/6 题、Putnam 2025 上解决 11/12 题、USAMO 2026 上解决 3/6 题。这代表了开源管道最先进的性能，其成本比可比的开源管道低多达 500 倍。

Abstract

We introduce Goedel-Architect, an agentic framework for formal theorem proving in Lean 4 centered on blueprint generation and refinement. A blueprint is a dependency graph of definitions and lemmas that builds up to the main theorem. First, Goedel-Architect generates a blueprint of formally stated definitions and lemmas, along with declared dependencies. This blueprint is optionally guided by a natural language proof. Then, a tool-equipped Lean prover component closes each open lemma node in parallel using relevant dependencies. Failed lemmas in turn drive refinement of the global blueprint. This strategy contrasts with other mainstream approaches which use recursive lemma decomposition, and can inefficiently loop on dead-end strategies. Using the open-weight DeepSeek-V4-Flash (284B-A13B) as the backbone, Goedel-Architect attains 99.2% pass@1 on MiniF2F-test and 75.6% pass@1 on PutnamBench. With an optional natural-language proof seeding the initial blueprint on the harder problems, we additionally close the remaining two MiniF2F-test problems (reaching 100%), lift PutnamBench to 88.8% (597/672), and solve 4/6 on IMO 2025, 11/12 on Putnam 2025, and 3/6 on USAMO 2026. This represents state-of-the-art performance for an open-source pipeline at a price point up to 500x less than comparable open-source pipelines.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on formal theorem proving using an agentic framework and LLMs (DeepSeek-V4-Flash), which has low overlap with the provided keywords centered on multimodal learning, world models, and reinforcement learning. While it utilizes an MLLM backbone (MLLM: 3, MultiModal: 2), it does not involve visual encoders, tokenizer design, world modeling, or model-based RL. Unify Models is loosely related to process unification but not architectural unification.

关键词

Formal Theorem Proving, Blueprint Generation, Refinement, Lean 4, Agentic Framework, DeepSeek-V4-Flash, Dependency Graph, Pass@1

99. Emergent Language as an Approach to Conscious AIFAIL

Score: 13.5 / 27.8

Authors: Zengqing Wu, Chuan Xiao

Published: 2026-06-04

TL;DR: This paper proposes using emergent language in multi-agent reinforcement learning to study consciousness without human language priors, demonstrating that agents can develop self-referential communication in a minimal environment.

摘要翻译

人工系统是否具有意识这一问题仍未悬而未决，部分原因在于现有方法要么基于理论推导的清单对系统进行评估（判别性 (discriminative)），要么直接设计受意识启发的模块（架构性 (architectural)）；这两种方法都留下了疑问：观察到的结构是否仅仅是人类语言先验 (human language priors) 的产物。我们提出了一种生成式方法论：多智能体强化学习 (multi-agent reinforcement learning) 中的涌现语言 (EL)，其中智能体从极简状态开始（无语言、无自我概念、极少接触人类文本），仅在任务压力下发展通信，确保因果归因性 (causal attributability) 在于任务需求，而非继承的人类语言先验。我们通过讨论涌现语言如何作为研究意识相关结构的生成工具来定位我们的方法论，涵盖环境复杂度 (environment complexity) 的作用以及涌现通信 (emergent communication) 的解释。作为概念验证 (proof of concept)，我们在极简环境中实例化该方法论，并表明智能体发展出了自指通信 (self-referential communication)，包括一个回声不匹配检测电路 (echo-mismatch detection circuit)，该电路并非仅由任务结构或架构 (architecture) 所能预测，而是源于特定的环境可供性 (environmental affordance)。

Abstract

The question of whether artificial systems can be conscious remains open, in part because existing approaches either evaluate systems against theory-derived checklists (discriminative) or engineer consciousness-inspired modules directly (architectural); both leave open whether observed structures are artifacts of human language priors. We propose a generative methodology: emergent language (EL) in multi-agent reinforcement learning, where agents start from minimal (no language, no concept of self, minimal exposure to human text) and develop communication under task pressure alone, ensuring causal attributability to task demands rather than inherited human language priors. We position our methodology by discussing how EL serves as a generative tool for studying consciousness-relevant structure, including the role of environment complexity and the interpretation of emergent communication. As a proof of concept, we instantiate this methodology in a minimal environment and show that agents develop self-referential communication, including an echo-mismatch detection circuit that is not predicted by task structure or architecture alone but emerges from a specific environmental affordance.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	3.0/10	4.5

评分理由: The paper focuses on emergent language in multi-agent reinforcement learning for consciousness, showing low alignment with multimodal large model components (Tokenizer, Visual Encoder, MLLM). It has moderate relevance to World Models and model-based RL due to environment modeling and RL context, but lacks explicit focus on unified architectures.

关键词

Emergent Language, Multi-Agent Reinforcement Learning, Conscious AI, Self-referential Communication, Generative Methodology, Minimal Environment, Echo-mismatch Detection

100. Diffusion Models for Adaptive Sequential Data GenerationFAIL

Score: 13.5 / 27.8

Authors: Haoyang Cao, Minshuo Chen, Yinbin Han, Renyuan Xu

Published: 2026-06-04

TL;DR: 本文提出了一种用于自适应时间序列生成的序列前向 - 向后扩散框架，具有统计保证，并在投资组合优化中展示了有效性。

摘要翻译

生成逼真的合成序列数据在运筹学、金融、医疗保健、能源系统和科学计算等领域的实际应用中至关重要，其中基于时间索引的观测数据被用于预测、模拟、风险评估和数据驱动的决策。尽管扩散模型（diffusion models）在生成静态数据方面取得了显著成功，但它们直接扩展到序列设置时往往无法捕捉时间依赖性和信息结构。因此，设计能够以适应方式（即不预先获取未来信息）模拟序列数据的扩散模型仍然是一个开放性的挑战。本文提出了一种用于适应式时间序列生成的序列前向 - 后向扩散框架。我们的方法沿序列逐步注入和移除噪声，并基于先前生成的历史进行条件化，以确保适应性。引入了一种新颖的得分匹配（score-matching）目标，以实现高效的并行训练。我们在通用框架下推导了严格的统计保证，然后建立了得分近似、得分估计和分布估计结果，并以 ReLU 网络（ReLU networks）作为具体实例。实验上，我们在合成数据（包括 ARMA 模型（ARMA models）和高斯过程（Gaussian processes））上验证了该方法，并展示了其在构建均值 - 方差最优投资组合方面的有效性。

Abstract

Generating realistic synthetic sequential data is critical in real-world applications across operations research, finance, healthcare, energy systems, and scientific computing, where time-indexed observations are used for prediction, simulation, risk assessment, and data-driven decision-making. While diffusion models have achieved remarkable success in generating static data, their direct extensions to sequential settings often fail to capture temporal dependence and information structure. Designing diffusion models that can simulate sequential data in an adapted manner, and hence without anticipation of future information, therefore remains an open challenge. In this work, we propose a sequential forward-backward diffusion framework for adapted time series generation. Our approach progressively injects and removes noise along the sequence, conditioning on the previously generated history to ensure adaptiveness. A novel score-matching objective is introduced for efficient parallel training. We derive rigorous statistical guarantees under a generic framework, then establish score approximation, score estimation, and distribution estimation results with ReLU networks serving as a concrete instance. Empirically, we validate our method on synthetic data, including ARMA models and Gaussian processes, and demonstrate its effectiveness in constructing mean-variance optimal portfolios.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	3.0/10	4.5
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	4.0/10	6.0

评分理由: 论文专注于使用扩散模型生成序列时间序列数据，与多模态架构、分词器、视觉编码器或大语言模型无直接关联。在世界模型（3.0）和基于模型的强化学习（4.0）上得分中等，因为涉及生成性动力学和决策制定应用；在统一模型（2.0）上得分较低，仅统一了扩散的前向和反向过程。在分词器、视觉编码器、MLLM 和多模态上得分为 0，因为工作是单模态连续数据，无 LLM 或视觉组件。加权总分 13.5，低于动态及格分 27.8，表明论文与给定关键词主题相关性较低。

关键词

Sequential Data Generation, Diffusion Models, Time Series, Score Matching, Statistical Guarantees, Portfolio Optimization, Forward-Backward Framework

101. GenAutoML: An Agentic Framework for Dynamic Architecture Generation and Optimization in Time-Series AnalysisFAIL

Score: 13.5 / 27.8

Authors: Oleeviya Babu Poikarayil, Cédric Schockaert, Abdulrahman Nahhas, Christian Daase, Mursal Dawodi, Jawid Ahmad Baktash

Published: 2026-06-04

TL;DR: GenAutoML 提出一种利用 LLM 动态生成和优化时间序列神经架构的代理框架，实现了高效且自适应的边缘 AI 部署。

摘要翻译

设计用于时间序列预测和异常检测的神经网络架构仍然是一项资源密集型任务，通常需要大量的领域专业知识。传统的自动化机器学习（AutoML）系统通常依赖于静态、预定义的搜索空间，限制了其适应多样化数据特征的能力。我们提出 GenAutoML，这是一个智能体框架，利用大语言模型（LLMs）作为神经网络架构师，以桥接自然语言需求与可执行的 PyTorch 实现。该框架包含一个用于自主代码精炼的沙盒反射循环（Sandboxed Reflection Loop）和一个确保架构一致性和执行安全的签名感知运行时（Signature-Aware Runtime）。为了提高在非平稳条件下的鲁棒性，我们进一步引入了一个动态可逆实例归一化（Dyn-RevIN）包装器。在 ETTh1、ETTm1 和 Weather 基准上的实验表明，GenAutoML 能够根据数据集特征动态生成任务特定的神经网络架构。在生成的模型中，WaveInterferenceNet 实现了每样本推理延迟低于 0.01 毫秒，同时保持具有竞争力的预测性能。通过强调计算效率、架构适应性和稳定的优化行为，GenAutoML 使得创建适合资源受限且对延迟敏感的边缘人工智能（Edge AI）部署的超轻量神经网络成为可能。

Abstract

Designing neural architectures for time-series forecasting and anomaly detection remains a resource-intensive task that often requires substantial domain expertise. Traditional Automated Machine Learning (AutoML) systems typically rely on static, predefined search spaces, limiting their ability to adapt to diverse data characteristics. We present GenAutoML, an agentic framework that leverages Large Language Models (LLMs) as neural architects to bridge natural-language requirements and executable PyTorch implementations. The framework incorporates a Sandboxed Reflection Loop for autonomous code refinement and a Signature-Aware Runtime that enforces architectural consistency and execution safety. To improve robustness under non-stationary conditions, we further introduce a Dynamic Reversible Instance Normalization (Dyn-RevIN) wrapper. Experiments on the ETTh1, ETTm1, and Weather benchmarks demonstrate that GenAutoML can dynamically generate task-specific neural architectures tailored to dataset characteristics. Among the generated models, WaveInterferenceNet achieves inference latency below 0.01 ms per sample while maintaining competitive predictive performance. By emphasizing computational efficiency, architectural adaptability, and stable optimization behavior, GenAutoML enables the creation of ultra-lightweight neural networks suitable for resource-constrained and latency-sensitive Edge AI deployments.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	1.0/10	1.5

评分理由: 论文主题是基于 LLM 的时间序列 AutoML，与关键词涉及的多模态、视觉编码器、世界模型及模型强化学习高度不相关。仅 LLM 使用带来微弱关联（MLLM/Tokenizer 得 2 分），其余视觉及 RL 相关关键词得 0-1 分。加权总分 13.5 低于及格线 27.8。

关键词

AutoML, Time-Series Forecasting, LLM Agent, Dynamic Architecture, Edge AI, PyTorch, Non-stationary Data, Model Optimization

102. Comparison of Deep Learning Frameworks For Rice Disease Mapping From UAV Multispectral ImagingFAIL

Score: 13.5 / 27.8

Authors: Yadav Raj Ghimire, Jagrati Talreja, Tewodros Syum Gebre, Timothy Agboada, Shikha V. Chandel, Leila Hashemi Beni

Published: 2026-06-04

TL;DR: This paper compares deep learning frameworks for segmenting rice bacterial leaf blight severity from UAV multispectral imagery, finding that U-Net++ with EfficientNet-B3 achieves the highest accuracy among tested models.

摘要翻译

本研究利用无人机 (UAV) 多光谱影像，基于卷积神经网络 (CNNs) 和 Transformer 模型，对水稻细菌性叶枯病 (BLB) 的严重程度进行分割。评估的模型架构包括采用 ResNet-101 编码器的 U-Net、采用 EfficientNet-B3 和 EfficientNetB7 的 U-Net++、DeepLabV3+ 以及 SegFormer。所有模型均在同一训练流程下训练，并采用三种输入配置（仅多光谱、多光谱 +NDVI（归一化植被指数）和多光谱 +NDRE（归一化红边植被指数））。实验基于公开的 BLB 数据集进行，性能指标采用平均交并比 (mIoU)、平均 F1 分数 (mF1)、平均准确率 (mAcc)、精确率和召回率进行报告。采用 EfficientNet-B3 的 U-Net++ 取得了最佳性能，其 mIoU 达到 97.62%。SegFormer 的分割精度较低，但推理速度相当。总体而言，结果表明轻量级 CNN 骨干网络在实际业务化 BLB 监测中仍更为可靠，而植被指数的整合带来了小幅且一致的改进。本研究还强调了标准化无人机 (UAV) 数据集在比较病害制图方法中的价值，并鼓励在实际田间应用中采用 CNN 架构。

Abstract

In this study, UAV multispectral imagery is used to segment the severity of bacterial leaf blight (BLB) in rice using convolutional neural networks (CNNs) and transformer-based models. The evaluated architectures include U-Net with a ResNet- 101 encoder, U-Net++ with EfficientNet-B3 and EfficientNetB7, DeepLabV3+, and SegFormer, all trained under a common pipeline with three input configurations (multispectral only, multispectral+NDVI, and multispectral+NDRE). Experiments are conducted using the publicly available BLB dataset with performance reported using mean IoU (mIoU), mean F1 (mF1), mean accuracy (mAcc), precision, and recall. U-Net++ with EfficientNet-B3 achieved the highest performance, with an mIoU of 97.62%. SegFormer obtained lower segmentation accuracy but comparable inference speed. Overall, the results indicate that lightweight CNN backbones remain more reliable for operational BLB monitoring while integration of vegetation indices provides small and consistent improvements. The study also highlights the value of standardised UAV datasets to compare disease mapping methods and encourages the use of CNN architectures for field implementation.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	4.0/10	6.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	3.0/10	4.5
model-based RL	1.5	0.0/10	0.0

评分理由: 该论文主要关注农业计算机视觉领域，利用 UAV 多光谱影像进行水稻病害分割。虽然论文使用了 CNN 编码器（如 ResNet, EfficientNet）和多光谱数据（MultiModal），与关键词有一定技术层面的关联，但内容完全未涉及 MLLM、Tokenizer、World Models 或 Model-Based RL 等核心概念，且未提出统一模型架构。作者列表中未包含指定的专家。加权总分 13.5 分，低于动态及格分 27.8 分，表明论文与指定研究背景相关性较低。

关键词

Rice Disease Mapping, UAV Multispectral Imaging, Convolutional Neural Networks, Transformer-based Models, Segmentation, Bacterial Leaf Blight, Vegetation Indices, Deep Learning Frameworks

103. Vortex: Efficient and Programmable Sparse Attention Serving for AI AgentsFAIL

Score: 12.0 / 27.8

Authors: Zhuoming Chen, Xinrui Zhong, Qilong Feng, Ranajoy Sadhukhan, Yang Zhou, Michael Qizhe Shieh, Zhihao Jia, Beidi Chen

Published: 2026-06-04

TL;DR: Vortex 是一个通过可编程稀疏注意力实现高效 LLM 服务的系统，使 AI 代理能够设计出显著提升吞吐量而不损失准确率的算法。

摘要翻译

随着生成长度的持续增长，稀疏注意力 (Sparse Attention) 在为大型语言模型 (LLMs) 提供服务时正变得日益重要。然而，在大规模部署和评估新的稀疏注意力算法仍然高度工程密集，从而减缓了人类研究者和 AI 代理 (AI Agents) 探索稀疏注意力设计的进程。为应对这一挑战，我们提出了 Vortex 系统。该系统结合了基于页面中心张量抽象 (Page-centric tensor abstraction) 之上的用于表达广泛的稀疏注意力算法的 Python 嵌入式前端语言，并与紧密集成到现代 LLM 服务栈中的高效后端相融合。Vortex 使得稀疏注意力算法的快速原型设计、部署和评估成为可能，有效地将其理论上的效率增益转化为现实世界中的吞吐量提升。因此，Vortex 显著加速了稀疏注意力算法的设计与迭代过程。首先，AI 代理利用 Vortex 自动生成并精炼多样化的算法，其中最优算法在保持准确性的前提下，吞吐量比完整注意力 (Full Attention) 高出 3.46 倍。其次，Vortex 将稀疏注意力扩展至新兴架构及超大模型（否则难以进行实验），在基于 MLA 的 GLM-4.7-Flash 上实现了高达 4.7 倍的吞吐量提升，在 NVIDIA B200 GPU 上的 229B 参数 MiniMax-M2.7 模型上也达到了 1.37 倍的吞吐量提升。

Abstract

Sparse attention is becoming increasingly important for serving large language models (LLMs) as generation lengths continue to grow. However, deploying and evaluating new sparse attention algorithms at scale remains highly engineering-intensive, slowing both human researchers and AI agents in exploring the sparse attention design. To address this challenge, we present Vortex, a system that combines a Python-embedded frontend language atop a page-centric tensor abstraction for expressing a broad range of sparse attention algorithms, with an efficient backend tightly integrated into modern LLM serving stacks. Vortex enables rapid prototyping, deployment, and evaluation of sparse attention algorithms, effectively translating their theoretical efficiency gains into real-world throughput improvements. As a result, Vortex substantially accelerates the design and iteration of sparse attention algorithms. First, AI agents use Vortex to automatically generate and refine diverse algorithms, the best reaching up to $3.46\times$ higher throughput than full attention while preserving accuracy. Second, Vortex extends sparse attention to emerging architectures and very large models that are otherwise hard to experiment with, reaching up to $4.7\times$ higher throughput on the MLA-based GLM-4.7-Flash and $1.37\times$ on the 229B-parameter MiniMax-M2.7 on NVIDIA B200 GPUs.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	2.0/10	3.0

评分理由: 论文核心聚焦于 LLM 服务基础设施与稀疏注意力算法优化，与多模态、世界模型及强化学习关键词关联度较低。仅 LLM 服务隐含 tokenizer，AI 代理提及带来微弱关联。视觉编码器、MLLM、多模态完全不相关。加权总分 12.0，低于动态及格线 27.8。

关键词

Sparse Attention, LLM Serving, AI Agents, Throughput Improvement, Programmable System, Tensor Abstraction, Algorithm Generation

104. An Infectious Disease Spread Simulation Based on Large Language Model Decision MakingFAIL

Score: 12.0 / 27.8

Authors: Yonchanok Khaokaew, Ruochen Kong, Andreas Zufle, Hao Xue, Taylor Anderson, Chandini Raina MacIntyre, Matthew Scotch, Flora D. Salim, David J Heslop

Published: 2026-06-04

TL;DR: 本文提出一种基于大语言模型的代理模拟框架，用于建模传染病爆发期间的个体决策，发现收入和教育是报告率变异的主要驱动因素。

摘要翻译

传染病爆发期间个体决策的建模对于理解行为动力学以及指导有效的公共卫生干预措施至关重要。先前研究表明，大语言模型 (LLM) 可以通过基于人口统计提示和情境上下文生成智能体 (agent) 决策来模拟真实的人类行为。我们在此基础上构建了一个基于空间的、基于智能体 (agent-based) 的模拟框架，将 LLM 生成的关于自我报告的流感样疾病决策整合到基于人口普查 (census-based) 的合成人口 (synthetic population) 中。空间位置被视为核心特征：智能体被分配到城市内的空间单元，利用真实世界的人口普查数据捕捉不同人口群体的空间分布，从而实现地理多样化的行为建模。我们实施并比较了三种决策情景：独立推理、家庭影响和信息框架 (message framing)，并在旧金山和亚特兰大模拟了自我报告的结果。结果显示，收入和教育是报告率变异的主导驱动因素，而地理因素、LLM 模型选择和信息框架的影响较小但一致。我们的框架生成的合成数据既捕捉了社会异质性也捕捉了地理异质性，支持空间流行病学建模和偏差感知 (bias-aware) 的行为分析。

Abstract

Modelling individual decision-making during infectious disease outbreaks is crucial for understanding behavioural dynamics and informing effective public health interventions. Prior work has shown that large language models can simulate realistic human behaviour by generating agent decisions based on demographic prompts and situational context. We build on this foundation with a spatially grounded, agent-based simulation framework that integrates LLM-generated decisions about self-reported influenza-like illness into a census-based synthetic population of agents. Location is treated as a central feature: agents are assigned to spatial units within cities, capturing the spatial distributions of different demographic groups using real-world census data and enabling geographically diverse behavioural modelling. We implement and compare three decision scenarios, independent reasoning, household influence, and message framing, and simulate self-reporting outcomes in San Francisco and Atlanta. Results reveal that income and education are the dominant drivers of reporting rate variation, with smaller but consistent effects from geography, LLM model choice, and message framing. Our framework generates synthetic data that captures both social and geographic heterogeneity, supporting spatial epidemiological modelling and bias-aware behavioural analysis.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	3.0/10	4.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 该论文主要研究传染病传播模拟与 LLM 决策，属于应用科学领域。关键词中 Visual Encoder、MultiModal、model-based RL 与论文内容（纯文本 LLM，非视觉，非强化学习）完全无关（0 分）。Tokenizer 和 Unify Models 仅为 LLM 基础组件，非核心贡献（1-2 分）。MLLM 涉及语言模型但未体现多模态（2 分）。World Models 有一定关联因构建了模拟环境（3 分）。总分 12.0，低于及格线 27.8，表明论文与指定技术方向相关性低。

关键词

Infectious Disease Spread, LLM Decision Making, Agent-Based Simulation, Spatial Grounding, Census-Based Synthetic Population, Behavioral Dynamics, Public Health Interventions

105. Learning What to Forget: Improving LLM Unlearning via Learned Token-Level ImportanceFAIL

Score: 12.0 / 27.8

Authors: Gizem Yüce, Giorgos Nikolaou, Nicolas Flammarion

Published: 2026-06-04

TL;DR: 本文提出 ATWU 框架，通过联合学习 token 遗忘特异性与模型参数，在不依赖外部监督的情况下显著提升了 LLM 去遗忘的遗忘 - 保留权衡性能。

摘要翻译

机器遗忘（Machine Unlearning）旨在从已训练模型中移除特定知识，同时保留其通用能力。对于自回归语言模型，遗忘样本中的各个词元对于遗忘的相关性并不均等。现有方法要么忽视这种异质性，要么依赖辅助模型、启发式策略或外部标注来估计每个词元对于遗忘的相关性。相反，我们通过其与保留目标的相互作用来刻画这种相关性：一个词元的遗忘特定性体现在最小化该词元上的遗忘损失不会与保留最优性发生冲突。我们将这一观点形式化为关于模型参数和词元权重的联合优化问题，并证明，在满足自然分离条件下，所得目标函数能够恢复出理想的遗忘特定词元支持集。受此公式化启发，我们提出了交替加权遗忘（ATWU），这是一个轻量级框架，它在遗忘过程中联合学习词元的遗忘特定性和模型参数，仅使用隐藏状态上的简单线性评分器，无需外部词元级监督。在 TOFU 和 RWKU 上，ATWU 实现了遗忘 - 保留权衡的最优性能，优于样本级方法、基于概率的词元权重启发式策略以及基于辅助模型的方法。此外，学习到的得分与真实遗忘特定跨度对齐程度显著更高，表明 ATWU 能够识别出具有语义意义的词元级遗忘信号。总体而言，我们的结果表明，保留冲突提供了一种有效标准，用于识别语言模型应当遗忘的内容，使得能够直接从模型表征中无监督地学习词元遗忘特定性，且计算开销极小。

Abstract

Machine unlearning aims to remove targeted knowledge from a trained model while preserving its general capabilities. For autoregressive language models, not all tokens in a forget sample are equally relevant to forgetting. Existing approaches either ignore this heterogeneity or rely on auxiliary models, heuristics, or external annotations to estimate each token's relevance for forgetting. We instead characterize it through the interaction with the retain objective: a token is forget-specific to the extent that minimizing the forget loss on that token does not conflict with retain optimality. We formalize this perspective as a joint optimization problem over the model parameters and the token weights and show that, under a natural separation condition, the resulting objective recovers the oracle forget-specific token support. Motivated by this formulation, we introduce Alternating Token-Weighted Unlearning (ATWU), a lightweight framework that jointly learns token forget-specificity and model parameters during unlearning using a simple linear scorer over the hidden states, without external token level supervision. Across TOFU and RWKU, ATWU achieves state of the art forget-retain trade-offs, outperforming sample-level methods, probability-based token weighting heuristics, and auxiliary-model-based approaches. Moreover, the learned scores align substantially better with ground truth forget-specific spans, indicating that ATWU identifies semantically meaningful token level forgetting signals. Overall, our results suggest that retain conflict provides an effective criterion for identifying what language models should forget, enabling unsupervised learning of token level forget-specificity directly from model representations with minimal computational overhead.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	4.0/10	6.0
Tokenizer	1.5	4.0/10	6.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文核心为 LLM 去遗忘，仅'Tokenizer'（token 粒度处理）和'Unify Models'（联合优化目标）有弱相关，其余关键词（视觉、世界模型、多模态、RL）完全无关。作者列表中无指定专家，故无加分。

关键词

Machine unlearning, Token-level importance, LLM unlearning, Forget-retain trade-off, Alternating Token-Weighted Unlearning, Retain conflict, Autoregressive language models, Token weights

106. TAM: Torque Adaptation Module for Robust Motion Transfer in ManipulationFAIL

Score: 12.0 / 27.8

Authors: Dongwon Son, Florian Shkurti, Jason Lee, Naman Shah, Beomjoon Kim, Dieter Fox

Published: 2026-06-04

TL;DR: 本文提出了一种扭矩适应模块（TAM），通过基于本体感觉历史的扭矩修正实现了机器人在不同平台间的鲁棒运动转移，无需真实机器人数据即可实现零样本部署。

摘要翻译

为某一机器人调优的策略在另一机器人上往往表现不同，这可能是由于仿真到现实（sim-to-real）的差距、未知负载，或是同一机器人两个实例之间的动力学差异所致。在接触丰富且动态的操作中，即使微小的运动偏差也可能导致无法跟踪参考运动，因为它们会破坏接触的时间点和模式。常见的解决方法，如领域随机化（domain randomization）或系统辨识（system identification），要么会产生过于保守的任务策略，要么需要为每个机器人或负载重新收集数据。我们提出了扭矩适应模块（Torque Adaptation Module, TAM），这是一个学习模块，用于调整发送给机器人的扭矩命令，使其匹配理想机器人的行为。TAM 位于跟踪策略动作的低层控制器（low-level controller）与机器人的扭矩接口之间。它包含一个历史编码器，将本体感觉（proprioceptive）历史嵌入潜在状态，以及一个扭矩适配器，用于计算残差扭矩修正。由于 TAM 仅依赖于本体感觉历史，而不依赖于策略观测或动作空间，因此相同的 TAM 权重可被重用，以适应具有不同动作空间的策略（如关节目标、末端执行器目标或直接扭矩）。这些策略本身无需使用机器人参数的领域随机化进行训练。相反，我们将领域随机化的需求转移至 TAM，通过在完全随机化的模拟环境中对其进行训练来实现：采用多机器人预训练，随后进行特定机器人的微调步骤，此过程仍无需真实机器人数据。我们在真实的 Franka Panda 机器人上对 TAM 进行零样本评估，涵盖多种动态操作任务，包括基于视觉的箱子推动策略（源自强化学习 RL）、翻转策略（源自行为克隆 BC）以及基于模型预测控制（MPC）的球板平衡任务。我们的实验表明，与在线系统辨识（online system identification）和 RMA 基线相比，TAM 提升了零样本真实机器人执行效果，并实现了稳健的动态操作性能。

Abstract

A policy tuned for one robot often behaves differently on another, whether due to the sim-to-real gap, unknown payloads, or the differing dynamics of two instances of the same robot. In contact-rich, dynamic manipulation, even small motion discrepancies can result in failure to track reference motion, since they disrupt the timing and modes of contact. Common remedies, such as domain randomization or system identification, either produce overly conservative task policies or require data that must be recollected for each robot or payload. We introduce the Torque Adaptation Module (TAM), a learned module that adapts the torque commands sent to the robot to match the behavior of an ideal robot. TAM operates between the low-level controller that tracks the policy's actions and the robot's torque interface. It includes a history encoder that embeds proprioceptive history into a latent state and a torque adaptor that computes residual torque corrections. Because TAM depends only on proprioceptive history and not on policy observations, or the action space, the same TAM weights can be reused to adapt policies with different action spaces (joint targets, end-effector targets, or direct torques). The policies themselves do not need to be trained with domain randomization of robot parameters. Instead, we offload the need for domain randomization to TAM by training it entirely in randomized simulation, using multi-robot pretraining followed by a robot-specific fine-tuning step that still requires no real-robot data. We evaluate TAM zero-shot on a real Franka Panda robot across dynamic manipulation tasks that include a vision-based box pushing policy (from RL), a flip policy (from BC), and an MPC ball-on-plate balancing. Our experiments show that TAM improves zero-shot real-robot execution compared to online system identification and RMA baselines and enables robust dynamic manipulation performance.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	3.0/10	4.5

评分理由: 该论文专注于机器人控制中的扭矩适应与运动转移问题，核心方法基于本体感觉历史进行扭矩修正。提供的关键词主要集中在多模态大模型、表征学习及世界模型等领域（如 Tokenizer, MLLM, World Models），与本文的机器人控制主题高度不相关。仅在与强化学习（model-based RL）及视觉任务评估（Visual Encoder, MultiModal）方面存在微弱关联，因此相关度评分较低。作者列表中不包含指定的专家（Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang）。

关键词

Torque Adaptation Module, Motion Transfer, Robust Manipulation, Proprioceptive History, Domain Adaptation, Sim-to-Real, Zero-shot Execution, Robot Dynamics

107. Entropy-Based Evaluation of AI Agents: A Lightweight Framework for Measuring Behavioral PatternsFAIL

Score: 12.0 / 27.8

Authors: Olasimbo Ayodeji Arigbabu

Published: 2026-06-04

TL;DR: 本文提出了一种基于熵的 AI 代理评估框架，通过动作熵、轨迹熵和信息增益等指标测量代理的行为模式，以补充传统的任务完成度评估方法。

摘要翻译

智能体通常通过任务成功率、奖励、延迟和成本进行评估。这些指标虽然有用，但往往忽略了智能体行为的重要方面：智能体是否探索过多、是否过于僵化地重复自身、是否有效使用工具、是否随时间降低不确定性，或在多次运行中保持鲁棒性。本文提出了一种基于熵的智能体评估（EEA），这是一种通过熵来衡量智能体行为的轻量级框架。与仅将智能视为最终任务完成不同，EEA 研究智能体的决策过程结构。该框架引入了动作熵、轨迹熵、工具熵、信息增益、探索效率和鲁棒性熵。这些指标旨在补充而非取代传统评估方法。此外，我们还提供了一个实用的 Python 实现，旨在与 LangChain、Google ADK、自定义智能体循环以及存储的可观测性追踪等智能体框架集成。

Abstract

AI agents are commonly evaluated using task success, reward, latency, and cost. These metrics are useful, but they often miss important aspects of agent behavior: whether an agent explores too much, repeats itself too rigidly, uses tools effectively, reduces uncertainty over time, or remains robust across repeated runs. This paper proposes Entropy-Based Evaluation of AI Agents (EEA), a lightweight framework for measuring agent behavior through entropy. Rather than treating intelligence as only final task completion, EEA studies the structure of the agents decision process. The framework introduces action entropy, trajectory entropy, tool entropy, information gain, exploration efficiency, and robustness entropy. These metrics are intended to complement, not replace, traditional evaluation methods. We also present a practical Python implementation designed to integrate with agent frameworks such as LangChain, Google ADK, custom agent loops, and stored observability traces.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	3.0/10	4.5

评分理由: 论文核心在于提出基于熵的 AI 代理评估框架，侧重于行为模式测量（如动作熵、轨迹熵），与关键词中涉及的模型架构组件（Tokenizer、Visual Encoder）、模型统一策略及特定模型类型（World Models、MLLM、MultiModal）关联度较低。虽涉及 AI 代理领域，与 model-based RL 有一定交集，但侧重评估而非算法实现。作者列表中未包含指定的专家，故无额外加分。

关键词

Entropy-Based Evaluation, AI Agents, Behavioral Patterns, Action Entropy, Information Gain, Robustness Entropy, Lightweight Framework

108. Latent Reasoning with Normalizing FlowsFAIL

Score: 12.0 / 27.8

Authors: Guancheng Tu, Xiangjun Fu, Suhao Yu, Yao Tang, Haoqiang Kang, Lianhui Qin, Yizhe Zhang, Jiatao Gu

Published: 2026-06-04

TL;DR: NF-CoT 通过引入归一化流将离散思维链转化为连续潜因推理，在代码生成任务中提升了通过率并降低了推理成本。

摘要翻译

大语言模型（LLMs）通常通过生成显式的思维链（CoT）来提升推理能力，这突显了中间计算的重要性。然而，文本形式的 CoT 迫使这种计算通过离散的、串行的、面向通信的令牌流进行：每个推理步骤必须在模型继续前进之前被言语化，即使底层更新是语义的、不确定的或仅部分形成的。潜在推理提供了一种更高带宽的替代方案，通过在转向文本之前在紧凑的连续状态中执行中间计算。然而，现有的潜在推理方法通常牺牲了使 CoT 在自回归语言模型中有效的关键优势，包括原生的从左到右生成、概率采样、与 KV 缓存解码的兼容性以及可行的似然估计。我们提出 NF-CoT，这是一种潜在推理框架，通过利用归一化流对连续思维进行建模来保留这些优势。NF-CoT 在 LLM 骨干内部实例化了一个 TARFlow 风格的归一化流，定义了一个关于从显式思维链蒸馏出的紧凑连续思维的可行概率模型。连续思维位置由 NF 头生成，而文本位置由同一因果流内的标准 LM 头生成。这种设计为潜在思维提供了精确似然，支持使用原始 KV 缓存的概率性从左到右解码，并支持在潜在推理空间中进行直接策略梯度优化。在代码生成基准上，NF-CoT 的通过率优于显式思维链及先前的潜在推理基线，同时大幅降低了中间推理成本。

Abstract

Large language models often improve reasoning by generating explicit chain-of-thought (CoT), demonstrating the importance of intermediate computation. However, textual CoT forces this computation through a discrete, serial, and communication-oriented token stream: each reasoning step must be verbalized before the model can proceed, even when the underlying update is semantic, uncertain, or only partially formed. Latent reasoning offers a higher-bandwidth alternative by performing intermediate computation in compact continuous states before committing to text. Yet existing latent-reasoning methods often sacrifice key advantages that make CoT effective in autoregressive language models, including native left-to-right generation, probabilistic sampling, compatibility with KV-cache decoding, and tractable likelihood estimation. We propose NF-CoT, a latent reasoning framework that preserves these advantages by modeling continuous thoughts with normalizing flows. NF-CoT instantiates a TARFlow-style normalizing flow inside the LLM backbone, defining a tractable probability model over compact continuous thoughts distilled from explicit CoT. Continuous-thought positions are generated by an NF head, while text positions are generated by the standard LM head within the same causal stream. This design provides exact likelihoods for latent thoughts, enables probabilistic left-to-right decoding with the original KV cache, and supports direct policy-gradient optimization in the latent reasoning space. On code-generation benchmarks, NF-CoT improves pass rates over explicit-CoT and prior latent-reasoning baselines while substantially reducing intermediate-reasoning cost.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	2.0/10	3.0

评分理由: 论文核心聚焦于 LLM 中的潜因推理与归一化流（NF-CoT），与多模态组件（MultiModal, MLLM, Visual Encoder）及传统世界模型（World Models）关联度极低；虽涉及连续状态与策略梯度优化，但与模型强化学习（model-based RL）及多模态统一模型（Unify Models）的定义存在偏差，仅中度相关（Tokenizer 因对比离散 token 流而相关）。作者列表中未包含指定专家（Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang）。

关键词

Latent Reasoning, Normalizing Flows, Chain-of-Thought, Code Generation, Continuous Thoughts, Probabilistic Decoding, LLM Backbone

109. Symb-xMIL: Symbolic Explanations for Multiple Instance Learning in Digital PathologyFAIL

Score: 12.0 / 27.8

Authors: Yanqing Luo, Julius Hense, Niklas Prenißl, Andreas Mock, Klaus-Robert Müller, Thomas Schnake, Mina Jamshidi Idaji

Published: 2026-06-04

TL;DR: Symb-xMIL proposes a symbolic explanation framework for Multiple Instance Learning in digital pathology that utilizes logical rules to reveal how tissue features combine, enhancing interpretability beyond traditional heatmaps.

摘要翻译

多实例学习（MIL）模型的解释在数字组织病理学中广泛用于验证与发现。现有方法主要依赖热力图来突出关键区域，但无法解释如何组合来自不同组织区域的证据以生成预测。这限制了模型的可解释性，尤其是在决策依赖于组织特征之间的交互作用时。我们提出符号可解释多实例学习（Symb-xMIL），这是一种事后解释框架，用于量化 MIL 模型的行为如何与人类可读的决策规则相一致，这些规则以输入特征之间的逻辑关系（例如“与”、“或”、“非”）形式表达。这些一致性得分揭示了模型预测背后的语义模式。我们在合成及真实世界组织病理学数据集上评估了 Symb-xMIL。在合成 MIL 数据上，Symb-xMIL 能够可靠地恢复真实逻辑规则。在临床肿瘤检测任务中，对齐度最高的规则揭示了异质决策模式，并暴露了隐藏的模型错误。在 TCGA-HNSCC（头颈鳞状细胞癌队列）的 HPV 预测任务上，我们的框架超越了 HPV 状态，细化了患者生存分层，具有潜在的临床相关性。总体而言，Symb-xMIL 将 MIL 的可解释性从视觉归因扩展至结构化、基于规则的推理，从而实现更透明且基于语义的模型预测解释。

Abstract

Explanations of multiple instance learning (MIL) models are widely used for validation and discovery in digital histopathology. Existing methods primarily rely on heatmaps that highlight influential regions but do not explain how evidence from different tissue regions is combined to produce a prediction. This limits interpretability, especially when decisions depend on interactions between tissue features. We introduce Symbolic explainable MIL (Symb-xMIL), a post-hoc explanation framework that quantifies how a MIL model's behavior aligns with human-readable decision rules, expressed as logical relationships (e.g., AND, OR, NOT) between input features. These alignment scores reveal semantic patterns underlying the model's predictions. We evaluate Symb-xMIL on synthetic and real-world histopathology datasets. On synthetic MIL data, Symb-xMIL reliably recovers ground-truth logical rules. In a clinical tumor detection task, the best-aligned rules uncover heterogeneous decision patterns and expose hidden model errors. On an HPV-prediction task on TCGA-HNSCC, a cohort of head and neck cancer, our framework refines patient survival stratification beyond HPV status with potential clinical relevance. Overall, Symb-xMIL extends MIL explainability beyond visual attribution toward structured, rule-based reasoning, enabling more transparent and semantically grounded interpretation of model predictions.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	1.0/10	1.5

评分理由: The paper focuses on Explainable AI (XAI) for Multiple Instance Learning (MIL) in digital pathology, proposing a symbolic explanation framework (Symb-xMIL) using logical rules. The provided keywords primarily relate to Multimodal Large Language Models (MLLM), World Models, and Reinforcement Learning architectures. There is minimal overlap: the paper involves image data (slight relevance to Visual Encoder/MultiModal) but does not utilize tokenizers, unify models, employ world models, or involve reinforcement learning. Consequently, the weighted score is low (10.5), well below the dynamic passing threshold of 27.8, indicating the paper does not align with the specific theme defined by the keywords.

关键词

Multiple Instance Learning, Digital Pathology, Symbolic Explanations, Logical Rules, Interpretability, Post-hoc Explanation, Tissue Features, Clinical Relevance

110. TS-ICL: A Flexible Time-Indexed Foundation Model for Time Series via In-Context LearningFAIL

Score: 12.0 / 27.8

Authors: Etienne Le Naour, Tahar Nabil, Adrien Petralia

Published: 2026-06-04

TL;DR: TS-ICL 提出了一种基于上下文学习的时间序列基础模型，统一了预测与填补任务，并在填补任务上取得了新 state-of-the-art 结果。

摘要翻译

基础模型标志着时间序列建模领域的一次深刻范式转变，专用模型正被通用零样本模型所取代。然而，当前方法主要关注预测，而现实世界的时间序列往往是不规则且部分观测的，这要求模型能够联合进行预测、填补缺失值以及处理采样退化条件。为了解决这些挑战，我们引入了 TS-ICL，这是一种新颖的概率性上下文学习（In-Context Learning）编码器 - 回归器 Transformer，统一了预测和填补缺失值。TS-ICL 将时间序列任务表述为时间戳对齐回归，并通过在由新颖因果数据先验生成的合成依赖结构上训练，自然地纳入了协变量。实验上，TS-ICL 在填补缺失值方面达到了新的最先进水平，同时在单变量和考虑协变量的基准上与领先的预测基础模型保持竞争力。它在部分观测的回溯窗口预测中表现出特别强的性能。

Abstract

Foundation models mark a profound paradigm shift in time series modeling, with task-specific models being superseded by general-purpose zero-shot models. Yet, current approaches primarily focus on forecasting, while real-world time series are often irregularly and partially observed, requiring models that can jointly forecast, impute missing values, and handle degraded sampling conditions. To address these challenges, we introduce TS-ICL, a novel probabilistic In-Context Learning encoder--regressor Transformer that unifies forecasting and imputation. TS-ICL formulates time series tasks as timestamp-aligned regression and naturally incorporates covariates by training on synthetic dependency structures generated from a novel causal data prior. Empirically, TS-ICL achieves a new state-of-the-art in imputation, while remaining competitive with leading forecasting foundation models across both univariate and covariate-aware benchmarks. It shows particularly strong performance in forecasting with partially observed look-back windows.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	5.0/10	7.5
Tokenizer	1.5	3.0/10	4.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文主题为时间序列基础模型，'Unify Models' 因摘要明确提及统一任务得 5 分，'Tokenizer' 因 Transformer 架构隐含分词得 3 分。其余关键词涉及视觉、多模态及强化学习，与本文单模态时间序列任务无关，得 0 分。作者列表无指定专家。

关键词

Time Series, Foundation Model, In-Context Learning, Forecasting, Imputation, Transformer, Covariates, Probabilistic

111. Complexity-Balanced Diffusion SplittingFAIL

Score: 12.0 / 27.8

Authors: Noam Issachar, Dani Lischinski, Raanan Fattal

Published: 2026-06-04

TL;DR: 本文提出了一种复杂性平衡扩散分割框架，通过按时间分配生成容量到专用子网络，在不增加推理成本的情况下显著提升了合成质量。

摘要翻译

标准连续时间生成模型依赖于整体架构，这些架构必须应对截然不同的信号分布状态，从各向同性噪声到复杂的数据分布。尽管扩大模型容量可以提升性能，但在整个生成时间轴上均匀部署大规模网络在本质上效率低下。本文提出复杂度平衡分割（CBS），这是一个用于时间容量分配的严谨框架，它将生成任务分配给多个专用子网络。基于函数逼近理论和 de Boor 等分布原理，CBS 将扩散时间轴划分为具有相等逼近负担的片段，并将更多的表示能力分配给生成动力学更难建模的区域。为了估计这种局部复杂性，我们引入了两个互补且可行的监控函数：一种基于流的狄利克雷能量的空间度量，以及一种基于采样轨迹加速度的几何度量。通过使用轻量级辅助模型来估计这些复杂度分布，我们的方法消除了对启发式时间分割或计算昂贵搜索过程的需求。在多个架构（SiT、JiT 和 UNet）及数据集上的广泛评估表明，CBS 一致提高了生成质量，而不增加单步推理成本。特别是在 SiT-XL 上使用 CFG（无分类器引导）时，CBS 相对于朴素的时间划分，FID（弗雷歇 - 起始距离）性能提升了约 35%。项目页面见 https://noamissachar.github.io/CBS/。

Abstract

Standard continuous-time generative models rely on monolithic architectures that must navigate vastly different signal regimes, from isotropic noise to intricate data distributions. While scaling model capacity improves performance, deploying a massive network uniformly across the entire generative timeline is inherently inefficient. In this work, we propose Complexity-Balanced Splitting (CBS), a principled framework for temporal capacity allocation that distributes the generative workload across multiple specialized sub-networks. Grounded in function approximation theory and de Boor's equidistribution principle, CBS partitions the diffusion timeline into segments of equal approximation burden, allocating more representational capacity to regions where the generative dynamics are more difficult to model. To estimate this local complexity, we introduce two complementary and tractable monitor functions: a spatial measure based on the flow's Dirichlet energy, and a geometric measure based on the acceleration of the sampling trajectories. Using a lightweight auxiliary model to estimate these complexity profiles, our approach eliminates the need for heuristic temporal splits or computationally expensive search procedures. Extensive evaluation across multiple architectures (SiT, JiT, and UNet) and datasets demonstrates that CBS consistently improves synthesis quality without increasing per-step inference cost. In particular, CBS improves FID by ~35% on SiT-XL with CFG relative to naive temporal partitioning. Project page is available at https://noamissachar.github.io/CBS/.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	2.0/10	3.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文核心在于扩散模型的时间容量分配策略（Complexity-Balanced Splitting），与提供的关键词集（侧重多模态大模型与强化学习）存在显著领域差异。'Unify Models'因提出统一的容量分配框架得 3 分；'Visual Encoder'因实验涉及 SiT 等视觉架构得 3 分；'World Models'因生成模型特性略有关联得 2 分；其余关键词（Tokenizer, MLLM, MultiModal, model-based RL）在文中无体现，得 0 分。加权总分 12.0，远低于动态及格分 27.8。

关键词

Complexity-Balanced Splitting, Diffusion Models, Temporal Capacity Allocation, Generative Modeling, SiT Architecture, Inference Efficiency, Dirichlet Energy

112. Self-Augmenting Retrieval for Diffusion Language ModelsFAIL

Score: 10.5 / 27.8

Authors: Paul Jünger, Justin Lovelace, Linxi Zhao, Dongyoung Go, Kilian Q. Weinberger

Published: 2026-06-04

TL;DR: This paper proposes SARDI, a training-free framework that leverages low-confidence tokens from diffusion language models to guide retrieval during text generation, achieving superior performance and throughput on multi-hop QA benchmarks.

摘要翻译

离散扩散语言模型通过并行迭代去噪整个响应来生成文本。在每一步中，它们为每个掩码位置预测暂定标记，将高置信度的预测提交至输出，而丢弃低置信度的预测。我们表明，被丢弃的标记实际上是检索增强生成（RAG）中一种有用的前瞻信号：即使在去噪轨迹早期，低置信度标记也往往显露出显著实体，从而在输出最终确定之前检索到更强的证据。我们通过扩散语言模型的自增强检索（SARDI）利用这一特性，这是一种动态 RAG 框架，利用这些前瞻标记来引导去噪过程中的检索。SARDI 无需训练，与检索器无关，适用于任何具备推理能力的离散扩散语言模型。在五个多跳问答基准上，SARDI 在吞吐量高达 8 倍的情况下，优于当前的无需训练扩散和自回归检索基线。

Abstract

Discrete diffusion language models generate text by iteratively denoising an entire response in parallel. At each step, they predict tentative tokens for every masked position, committing the confident predictions to the output and discarding the unconfident ones. We show that the discarded tokens are in fact a useful lookahead signal for retrieval-augmented generation: even low-confidence tokens often surface salient entities early in the denoising trajectory, enabling retrieval of stronger evidence before the output is finalized. We exploit this through Self-Augmenting Retrieval for Diffusion Language Models (SARDI), a dynamic RAG framework that uses these lookahead tokens to guide retrieval during denoising. SARDI is training-free, retriever-agnostic, and applicable to any reasoning-capable discrete diffusion language model. Across five multi-hop QA benchmarks, SARDI outperforms current training-free diffusion and autoregressive retrieval baselines at up to $8\times$ higher throughput.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	3.0/10	4.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on text generation using discrete diffusion models and retrieval-augmented generation. It lacks visual encoders, multimodal components (MLLM, MultiModal), and reinforcement learning (model-based RL), resulting in 0 scores for these. Tokenizer is moderately relevant (3.0) as discrete tokens are fundamental to the model. Unify Models and World Models have weak relevance (2.0 each) as the paper unifies retrieval with diffusion generation and uses generative models, but not in the specific architectural or environmental modeling sense implied by the keyword set. None of the specified expert authors are present in the author list.

关键词

Diffusion Language Models, Retrieval-Augmented Generation, Self-Augmenting Retrieval, Multi-hop QA, Training-free, Discrete Diffusion, Denoising Process, Lookahead Signal

113. In-Context Multiple Instance LearningFAIL

Score: 10.5 / 27.8

Authors: Alexander Möllers, Marvin Sextro, Julius Hense, Gabriel Dernbach, Klaus-Robert Müller

Published: 2026-06-04

TL;DR: Pretraining a Perceiver-style in-context learner on synthetic data enables few-shot Multiple Instance Learning without task-specific training.

摘要翻译

多实例学习（MIL）解决的是在实例包级别提供监督的问题，并已成功应用于从计算病理学到卫星图像等多个领域。然而，现有算法在许多现实应用所特有的低标签设置中面临挑战。灵活模型容易过拟合，而刚性模型则难以适应当前任务。我们表明，在合成数据上预训练一个具有 Perceiver 风格架构的上下文学习者，可以得到一个仅需少量标注包即可解决新任务的模型。推理时，分类仅需单次前向传播即可完成，且无需梯度更新。我们提出并研究了针对包结构数据的不同合成数据生成器，发现它们捕捉了互补的归纳偏置。在这些生成器混合体上预训练的模型继承了各任务的优势，并在十二个 MIL 基准测试中实现了最佳平均性能，优于需要任务特定训练的监督基线。

Abstract

Multiple Instance Learning (MIL) addresses problems where supervision is available at the level of bags of instances and has been successfully applied in fields ranging from computational pathology to satellite imagery. Nevertheless, existing algorithms struggle in the low-label regime that characterizes many real-world applications. Flexible models overfit and rigid ones fail to adapt to the task at hand. We show that pretraining an in-context learner with a Perceiver-style architecture on synthetic data yields a model that can solve new tasks from a handful of labeled bags. At inference time, classification happens in a single forward pass and requires no gradient updates. We propose and investigate different synthetic data generators for bag-structured data and find that they capture complementary inductive biases. A model pretrained on a mixture of these generators inherits their per-task strengths and achieves the best average performance across twelve MIL benchmarks, outperforming supervised baselines that require task-specific training.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on Multiple Instance Learning (MIL) and in-context learning using Perceiver architecture. It has minimal overlap with the provided keywords which target Large Multimodal Models, World Models, and Reinforcement Learning. Perceiver involves latent tokens and visual encoding, but does not address World Models, MLLM, or RL.

关键词

Multiple Instance Learning, In-Context Learning, Perceiver Architecture, Synthetic Data, Few-Shot Learning, Bag-Structured Data, Pretraining

114. TokenMizer: Graph-Structured Session Memory for Long-Horizon LLM Context ManagementFAIL

Score: 10.5 / 27.8

Authors: Shweta Mishra

Published: 2026-06-04

TL;DR: TokenMizer utilizes a graph-structured memory system to compress LLM session history, reducing token usage while preserving relational context for long-horizon tasks.

摘要翻译

大型语言模型 (LLM) 在长周期任务中的部署面临一个根本性约束：上下文窗口是有限的，而有效工作会话却不是。当历史超过最大有效上下文窗口 (MECW) 时，关键结构化信息（如架构决策、任务转换、文件历史）会被静默丢弃。现有的缓解措施将历史视为扁平文本，破坏了使会话可恢复的关系结构。我们提出 TokenMizer，一个开源代理系统，它将 LLM 会话历史建模为类型化知识图谱。该模式定义了 14 种节点类型和 7 种边类型。混合提取管道增量填充该图谱，而三层检查点系统将其序列化为紧凑的恢复块 (resume blocks)。一个 8 层压缩管道减少上下文开销，而语义缓存则减少重复查询的延迟。在涵盖 5 个领域的 21 个会话的受控基准测试上，TokenMizer 展示了显著的令牌节省效果。其生成的恢复块平均为 78 个令牌（范围：42-124），约为评估基线（159-170 个令牌）的一半，同时实现了更高的决策召回率（提升 9-17 个百分点）。至关重要的是，基线仅保留技术被提及的事实，而 TokenMizer 则保留了背后的理由。在所有会话中，TokenMizer 实现了平均任务召回率 51.0%、决策召回率 46.6% 和文件召回率 58.7%。这种差异反映了领域异质性：显式命令式措辞（软件工程）的得分高于隐式推理（研究）。消融实验表明，模糊标签匹配是主要的改进因素（任务召回率提升 33 个百分点）。启发式压缩实现了 47.3% 的令牌减少，且无需外部依赖。TokenMizer 提供了一种可查询的替代方案，相较于文本保留基线，其令牌成本仅为后者的一半。

Abstract

Large language model (LLM) deployments for long-horizon tasks face a fundamental constraint: context windows are finite while productive work sessions are not. When history exceeds the Maximum Effective Context Window (MECW), critical structured information - architectural decisions, task transitions, file histories - is silently discarded. Existing mitigations treat history as flat text, destroying the relational structure that makes sessions resumable. We present TokenMizer, an open-source proxy system that models LLM session history as a typed knowledge graph. The schema defines 14 node types and 7 edge types. A hybrid extraction pipeline populates the graph incrementally, while a three-tier checkpoint system serializes it into compact resume blocks. An 8-layer compression pipeline reduces context overhead, and a semantic cache reduces repeated-query latency. Evaluated on a controlled benchmark of 21 sessions spanning 5 domains, TokenMizer demonstrates significant token economy. It produces resume blocks averaging 78 tokens (range: 42-124) - 2x smaller than evaluated baselines (159-170 tokens) - while achieving higher decision recall (+9-17 percentage points). Crucially, baselines only preserve that a technology was mentioned; TokenMizer preserves the rationale. Across all sessions, TokenMizer achieves mean task recall 51.0%, decision recall 46.6%, and file recall 58.7%. Variance reflects domain heterogeneity: explicit imperative phrasing (software engineering) scores higher than implicit reasoning (research). Ablation studies show fuzzy label matching is the dominant improvement factor (+33 pp task recall). The heuristic compression achieves 47.3% token reduction with zero external dependencies. TokenMizer provides a queryable alternative to text-retention baselines at half the token cost.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	3.0/10	4.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on LLM session memory management using knowledge graphs, showing weak relevance to Tokenizer (token management) and World Models (long-term memory). It lacks multimodal components (Visual Encoder, MultiModal, MLLM), reinforcement learning (model-based RL), and model unification architectures (Unify Models). No matching expert authors (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) were found in the author list.

关键词

LLM Context Management, Graph-Structured Memory, Session History, Token Economy, Knowledge Graph, Long-Horizon Tasks, Resume Blocks

115. Plug-and-Play Guidance for Discrete Diffusion Models via Gradient-Informed Logit CorrectionFAIL

Score: 10.5 / 27.8

Authors: Hongkun Dou, Zike Chen, Fengji Li, Hongjue Li, Yue Deng

Published: 2026-06-04

TL;DR: 本文提出 GILC 框架，通过梯度信息校正 logits 实现离散扩散模型的可控生成，无需额外训练即可在 DNA、蛋白质和分子生成任务上取得最优性能。

摘要翻译

离散扩散模型的可控生成往往受限于高昂的计算开销或重新训练的需求。本文提出了一种名为“梯度信息 Logit 校正”（GILC）的即插即用框架，该框架通过复用预训练去噪网络作为 variational proxy（变分代理）来高效估计 guidance signals（引导信号）。为规避高维离散空间中固有的梯度不稳定性，我们引入了一种 Jacobian-free（无雅可比）机制，直接修正干净预测 logits，从而实现稳定有效的 guidance（引导）。该方法兼容 differentiable（可微）和 non-differentiable（不可微）的 reward functions（奖励函数）。在 DNA、蛋白质序列和分子生成任务上的广泛实验表明，GILC 在不进行额外训练的情况下实现了 state-of-the-art（最先进的）性能，经常优于 fine-tuning（微调）方法。

Abstract

Controllable generation with discrete diffusion models is often hindered by high computational overhead or the need for retraining. In this paper, we present \underline{\textbf{G}}radient-\underline{\textbf{I}}nformed \underline{\textbf{L}}ogit \underline{\textbf{C}}orrection (\textbf{GILC}), a plug-and-play framework that efficiently estimates guidance signals by repurposing the pretrained denoising network as a variational proxy. To circumvent the gradient instability inherent in high-dimensional discrete spaces, we introduce a Jacobian-free mechanism that directly corrects the clean prediction logits, facilitating stable and effective guidance. Our method accommodates both differentiable and non-differentiable reward functions. Extensive experiments across DNA, protein sequence, and molecular generation tasks demonstrate that GILC achieves state-of-the-art performance without additional training, frequently outperforming fine-tuning approaches.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	2.0/10	3.0

评分理由: 该论文专注于离散扩散模型的可控生成，提出 GILC 框架。与背景关键词中的多模态（MultiModal, MLLM, Visual Encoder）及世界模型（World Models）高度不相关，因论文仅处理单模态序列数据（DNA、蛋白质、分子），未涉及视觉或跨模态交互。Tokenizer 相关性较低，因离散扩散虽基于 token 但非研究重点；model-based RL 相关性较低，虽使用奖励函数但非模型基强化学习；Unify Models 相关性中等，因方法统一了指导信号机制。作者列表中未包含指定的专家。加权总分约为 10.5，低于动态及格分 27.8。

关键词

Discrete Diffusion Models, Controllable Generation, Gradient-Informed Logit Correction, Plug-and-Play Framework, DNA Sequence Generation, Molecular Generation, Variational Proxy, Jacobian-free Mechanism

116. Improving Answer Extraction in Context-based Question Answering Systems Using LLMsFAIL

Score: 10.5 / 27.8

Authors: Hafez Abdelghaffar, Ahmed Alansary, Ali Hamdi

Published: 2026-06-04

TL;DR: 本文通过微调 Roberta-base 模型在 SQuAD1.1 数据集上，显著提升了基于上下文的问答系统中答案提取的准确性和相关性。

摘要翻译

问答（QA）系统随着大型语言模型（LLMs）的问世取得了显著进展。然而，它们仍面临从给定上下文中准确提取和生成精确答案的挑战，尤其是在处理复杂或模糊的查询时。现有方法通常在上下文理解、答案一致性以及跨多样化领域的泛化能力方面表现不佳。本文提出了一种基于大型语言模型的问答系统，其输入由文本上下文和相应问题构成，输出则为简洁且准确的答案。本研究旨在解决当前问答系统的局限性，特别是尽管拥有正确上下文，它们仍倾向于产生无关或不精确的响应这一倾向。我们的方法涉及在基准问答数据集上微调预训练的大型语言模型，以提升其上下文理解和答案提取能力。具体而言，我们采用了斯坦福问答数据集（SQuAD1.1），该数据集提供了高质量的上下文 - 问题 - 答案三元组，用于监督训练与评估。实验结果表明，微调后的 Roberta-base 模型表现最佳，取得了 86.84% 的 ROUGE-L 分数、28.24% 的 BLEU 分数以及 95.38% 的 BERTScore。这些结果表明模型具有强大的准确性和答案相关性，证明了所提出方法在基于上下文的问答任务中的有效性。此外，研究结果证实，有针对性的微调显著提高了问答系统的可靠性和精确度。

Abstract

Question answering (QA) systems have achieved notable progress with the advent of large language models (LLMs). However, they still face challenges in accurately extracting and generating precise answers from given contexts, particularly when dealing with complex or ambiguous queries. Existing approaches often struggle with contextual understanding, answer consistency, and generalization across diverse domains. In this work, we propose a question answering system based on large language models, where the input consists of a textual context and a corresponding question, and the output is a concise and accurate answer. The motivation behind this research lies in addressing the limitations of current QA systems, particularly their tendency to produce irrelevant or imprecise responses despite having access to the correct context. Our methodology involves fine-tuning a pre-trained LLM on a benchmark QA dataset to improve its contextual comprehension and answer extraction capabilities. Specifically, we utilize the Stanford Question Answering Dataset (SQuAD1.1), which provides high-quality context-question-answer triplets for supervised training and evaluation. Experimental results show that the fine-tuned Roberta-base model achieves the highest performance, attaining a ROUGE-L score of 86.84%, a BLEU score of 28.24%, and a BERTScore of 95.38%. These results indicate strong accuracy and answer relevance, demonstrating the effectiveness of the proposed approach for context-based question answering tasks. Furthermore, the findings confirm that targeted fine-tuning substantially improves the reliability and precision of QA systems.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	3.0/10	4.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 该论文专注于基于文本的问答系统（QA），利用 LLM（Roberta-base）在 SQuAD1.1 数据集上进行监督微调。关键词中的 MultiModal、Visual Encoder、MLLM 涉及多模态处理，与本文纯文本内容无关；World Models 和 model-based RL 涉及环境建模与强化学习，本文未涉及；Unify Models 未体现多模型统一架构；Tokenizer 虽为 LLM 基础组件但非本文核心贡献。因此，除 Tokenizer 和 MLLM（因使用 LLM）有微弱关联外，其余关键词相关性极低。

关键词

Question Answering, Large Language Models, Fine-tuning, SQuAD1.1, Answer Extraction, Context-based, Roberta-base

117. Learning to Route LLMs from Implicit Cost-Performance Preferences via Meta-LearningFAIL

Score: 10.5 / 27.8

Authors: Jiahao Zeng, Ming Tang, Ningning Ding

Published: 2026-06-04

TL;DR: This paper proposes MetaRouter, a meta-learning framework for personalized LLM routing that optimizes cost-performance trade-offs by efficiently learning implicit user preferences through contextual bandits.

摘要翻译

大语言模型（LLMs）在性能与成本之间存在权衡，更强大的模型往往伴随着更高的开销。LLM 路由旨在通过将查询请求路由至最合适的模型来降低开销，同时保持性能。然而，现有方法难以满足用户不同的成本 - 性能偏好。为了解决这一差距，我们提出了一种新颖的感知型 LLM 路由范式，旨在实现个性化且以用户为中心的成本 - 性能优化，该范式能够通过少量交互高效地学习用户的隐性偏好。为了应对用户需求的异质性，我们将偏好配置文件形式化为上下文带（Contextual Bandit）中的一组独立任务，并提出 MetaRouter，这是一种专为感知偏好 LLM 路由设计的元学习框架。实验结果表明，MetaRouter 在分布内和分布外任务上均优于强基线。此外，它在学习用户偏好方面表现出高效率，对可路由 LLMs 的变化具有鲁棒性，并且可扩展至多模型路由。

Abstract

Large language models (LLMs) present a trade-off between performance and cost, where more powerful models incur greater expense. LLM routing aims to mitigate expenses while maintaining performance by sending queries to the most suitable model. However, existing methods cannot perform well for different user cost-performance preferences. To address this gap, we introduce a novel perceptive LLM routing paradigm for personalized and user-centric cost-performance optimization, which efficiently learns users' implicit preferences through little interaction. To handle the challenge of heterogeneous user needs, we formulate preference profiles as a set of distinct tasks in contextual bandit and propose MetaRouter, a meta-learning framework designed for preference-aware LLM routing. Experimental results show that MetaRouter outperforms strong baselines on both in-distribution and out-of-distribution tasks. Furthermore, it exhibits high efficiency in learning user preferences, robustness to changes in the routable LLMs, and scalability to multi-model routing.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	3.0/10	4.5

评分理由: The paper focuses on LLM routing and meta-learning for cost-performance optimization, showing low relevance to multimodal architecture keywords (Visual Encoder, MultiModal, Tokenizer, World Models). It involves multiple LLMs (loosely Unify Models) and LLMs (loosely MLLM), and uses bandits/meta-learning (RL-adjacent, model-based RL loosely), but lacks specific content on these topics. No expert authors from the target list are found.

关键词

LLM Routing, Meta-Learning, Cost-Performance, Contextual Bandit, User Preferences, Model Selection, Personalized Optimization

118. Step-adaptive multimodal fusion network with multi-scale cloud feature learning for ultra-short-term solar irradiance forecastingFAIL

Score: 10.5 / 27.8

Authors: Jingxin Zhang Xiaoqin Wang

Published: 2026-06-04

TL;DR: This paper proposes a step-adaptive multimodal fusion network combining cloud image features and meteorological time-series data to improve ultra-short-term solar irradiance forecasting accuracy.

摘要翻译

超短期太阳辐照度预测对于光伏系统调度和电网稳定性至关重要。现有方法存在三个主要不足：单时间序列模型无法捕捉复杂条件下云的空间动态，标准卷积不足以表征多尺度云特征，且固定低频补偿策略无法适应不同的预测步长。为解决这些问题，本文提出了一种用于超短期辐照度预测的多源数据融合模型。该模型首先利用 InceptionNeXt 从地面云图中提取多尺度、多方向的空间特征。随后引入一个步长自适应低频补偿单元，根据预测步长动态调制全局低频信息。最终，增强后的图像特征与气象时间序列特征相结合，并通过 TempAttnLSTM 网络捕捉全局时间依赖，以实现多步预测。在公共 NREL 数据集及山东实际光伏电站上的实验表明，与几种最先进的现有方法相比，所提方法具有显著有效性。

Abstract

Ultra-short-term solar irradiance prediction is critical for photovoltaic system dispatch and power grid stability. Existing approaches suffer from three key shortcomings: single time-series models cannot capture the spatial dynamics of clouds under complex conditions, standard convolutions inadequately represent multi-scale cloud features, and fixed low-frequency compensation strategies fail to adapt to different prediction steps. To address these issues, this proposes a multi-source data fusion model for ultra-short-term irradiance prediction. The model first employs InceptionNeXt to extract multi-scale, multi-directional spatial features from ground-based cloud images. A step-adaptive low-frequency compensation unit is then introduced to dynamically modulate global low-frequency information based on the prediction step. Eventually, the enhanced image features are combined with meteorological time-series features, and a TempAttnLSTM network captures global temporal dependencies for multi-step prediction. Experiments on the public NREL dataset and practical photovoltaic stations in Shandong illustrate the effectiveness of the proposed method compared with several state-of-the-art approaches.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	5.0/10	7.5
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on solar irradiance forecasting using cloud images and meteorological time-series data. It shows moderate relevance to 'MultiModal' due to fusion of image and time-series data, and slight relevance to 'Visual Encoder' (using InceptionNeXt for image features). However, it has no relevance to MLLM, Tokenizers, World Models, Model-Based RL, or Model Unification, resulting in a low overall score relative to the keyword cluster.

关键词

Ultra-short-term solar irradiance forecasting, Multimodal fusion network, Cloud images, Multi-scale feature learning, Step-adaptive compensation, TempAttnLSTM, InceptionNeXt

119. On Advantage Estimates for Max@K Policy GradientsFAIL

Score: 10.5 / 27.8

Authors: Shota Takashiro, Soichiro Nishimori, Paavo Parmas, Yongmin Kim, Kohsei Matsutani, Gouki Minegishi, Yusuke Iwasawa, Takeshi Kojima, Yutaka Matsuo

Published: 2026-06-04

TL;DR: 本文提出了一种基于 Leave-Two-Out 基线的 Max@K 策略梯度优势估计方法，通过中心化优势减少了梯度方差，从而提升了大语言模型后训练的性能。

摘要翻译

使用可验证奖励的强化学习（Reinforcement Learning）被广泛应用于后训练推理模型，但稀疏的结果奖励使得探索变得困难。一种互补的方法是直接优化推理时目标（inference-time objectives），例如 pass@K 和 max@K，然而针对这些目标现有的策略梯度（policy-gradient）估计器使用不同的信号、基线和归一化，导致它们之间的关系不明确。我们通过基线设计和优势中心化（advantage centering）来研究这一问题。从该领域领先方法的优势估计器出发，我们表明其策略梯度无偏，但会产生非中心化优势（non-centered advantage）。随后，我们引入一种留二（Leave-Two-Out）基线，它在保持策略梯度无偏性的同时，使实现的批次优势（realized batch advantages）精确中心化。该方法 MaxPO 具有高效的二次时间复杂度实现，并能自然地整合到大语言模型（LLM）后训练的基于组的强化学习（group-based RL）中。我们进一步推导了 max@K 的典范有限批次优势（canonical finite-batch advantage），为现有优势估计器提供了统一视角。实验验证表明，L2O 基线降低了梯度方差，并优于非中心化替代方案。

Abstract

Reinforcement learning with verifiable rewards is widely used for post-training reasoning models, but sparse outcome rewards make exploration difficult. A complementary approach is to optimize inference-time objectives such as pass@K and max@K directly, yet existing policy-gradient estimators for these objectives use different signals, baselines, and normalizations, making their relationships unclear. We study this issue through baseline design and advantage centering. Starting from the advantage estimator of a leading method in the field, we show that it is policy-gradient unbiased but yields a non-centered advantage. We then introduce a Leave-Two-Out baseline that preserves policy-gradient unbiasedness while making realized batch advantages exactly centered. The resulting method, MaxPO, has an efficient quadratic-time implementation and integrates naturally into group-based RL for LLM post-training. We further derive the canonical finite-batch advantage for max@K, providing a unified view of existing advantage estimators. Empirically, we verify that the L2O baseline reduces gradient variance and outperforms non-centered alternatives.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	2.0/10	3.0

评分理由: 论文核心在于强化学习中 Max@K 策略梯度的优势估计方法，涉及 LLM 后训练。关键词中的多模态相关项（Visual Encoder, MultiModal, MLLM）、世界模型、分词器及模型基强化学习在文中未作为核心内容出现，仅'Unify Models'在统一优势估计器视角上部分契合。因此相关性评分普遍较低，加权总分 10.5 低于动态及格分 27.8。作者列表中未包含指定专家。

关键词

Advantage Estimates, Max@K Policy Gradients, Reinforcement Learning, LLM Post-training, Baseline Design, Variance Reduction, Leave-Two-Out

120. DNQ: Deep Nash Q-Network for Partially Observable n-Player GamesFAIL

Score: 10.5 / 27.8

Authors: Qintong Xie, Edward Koh, Xavier Cadet, Peter Chin

Published: 2026-06-04

TL;DR: DNQ introduces a solver-in-the-loop framework for training bidding agents in partially observable n-player games by imitating equilibrium strategies derived from payoff predictions, demonstrating superior scalability compared to exact equilibrium formulations.

摘要翻译

许多现实世界的竞争性系统要求多个决策者在共享约束、信息受限及重复交互的条件下同时行动，例如拍卖、资源分配和安全竞争等场景。本文将多轮同时竞价（multi-turn simultaneous bidding）作为此类问题的受控测试平台，并提出 DNQ，一种用于训练竞价代理的基于求解器循环的均衡监督框架。DNQ 在轨迹收集、基于批评家的收益估计、均衡计算和策略模仿之间交替进行。在每个访问的状态下，共享的批评家预测成对收益矩阵或精确的 N 玩家（N-player）收益张量，外部求解器计算均衡策略，并通过最小化其掩码策略与求解器推导的均衡目标之间的 KL 散度来训练代理。本文重点采用了一种可扩展的成对形式化，与精确形式化相比，大幅降低了均衡求解成本和训练时间，同时共享批评家在代理和状态之间分摊了收益学习成本。实验通过批评家损失、策略熵、竞价资源使用量及训练成本比较了成对与精确变体，结果表明成对方法可扩展至更多代理，而随着联合博弈规模扩大，精确方法在计算上变得不切实际。这些结果揭示了在重复竞争环境中，策略保真度与可扩展性之间的权衡关系。

Abstract

Many real-world competitive systems require multiple decision-makers to act simultaneously under shared constraints, limited information, and repeated interaction, as in auctions, resource allocation, and security competition. We study multi-turn simultaneous bidding as a controlled testbed for such problems and propose DNQ, a solver-in-the-loop equilibrium supervision framework for training bidding agents. DNQ alternates between trajectory collection, critic-based payoff estimation, equilibrium computation, and policy imitation. At each visited state, a shared critic predicts either pairwise payoff matrices or an exact N-player payoff tensor, an external solver computes equilibrium strategies, and the agents are trained by minimizing the KL divergence between their masked policies and the solver-derived equilibrium targets. We focus on a scalable pairwise formulation that greatly reduces equilibrium-solving cost and training time compared with the exact formulation, while the shared critic amortizes payoff learning across agents and states. Experiments compare the pairwise and exact variants using critic loss, policy entropy, bidding resource usage, and training cost, showing that the pairwise method scales to larger numbers of agents, whereas the exact method becomes computationally impractical as the joint game grows. These results illustrate the trade-off between strategic fidelity and scalability in repeated competitive environments.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	3.0/10	4.5

评分理由: The paper focuses on multi-agent reinforcement learning and game theory (DNQ for n-player games), lacking direct connections to multimodal large models, tokenizers, or visual encoders. 'Unify Models' and 'World Models' receive low scores due to weak alignment with unified architecture or generative world modeling concepts. 'model-based RL' receives a moderate score as the paper utilizes a payoff prediction model to guide policy imitation, though it is primarily equilibrium-focused. No expert authors from the specified list are present in the authorship.

关键词

DNQ, Partially Observable n-Player Games, Equilibrium Computation, Policy Imitation, Multi-turn Simultaneous Bidding, Scalable Pairwise Formulation, Critic-based Payoff Estimation

121. Ouvia: A User-centered Framework for Measuring Usability of Speech Translation in Real-World Communication ScenariosFAIL

Score: 10.5 / 27.8

Authors: Giuseppe Attanasio, Beatrice Savoldi, Daniel Chechelnitsky, Matteo Negri, Marine Carpuat, Maarten Sap, André F. T. Martins

Published: 2026-06-04

TL;DR: This paper proposes Ouvia, a user-centered framework for measuring speech translation usability in real-world communication, finding that only half of interactions are usable and QA-based metrics predict usability better than standard scores.

摘要翻译

语音翻译（ST）在用户应用中日益普及，但其评估主要集中于脱离语境的测试平台和整体质量，而非终端用户的沟通需求。我们介绍了 Ouvia，这是一个用于衡量语音翻译输出在现实场景中用户感知可用性的评估框架。Ouvia 专注于一对一沟通：一名英语使用者需向一名葡萄牙语使用者传达请求，该消息会被自动翻译。通过定制网页应用和多阶段研究设计，我们在医疗和日常情境中收集了超过 1,750 次此类交互，这些交互经由四个 ST 系统支持，涉及来自三种英语方言和两种性别的说话者。我们发现，现代 ST 仅在一定程度上服务用户——仅约一半的交互被评为可用——不同人口统计学群体报告的可用性存在显著差距。此外，在质量指标中，我们发现基于 QA（问答）的评价比标准方法更能显著预测现实可用性。综上所述，这些发现强调了情境化、以用户为中心的评价框架的重要性，此类框架不仅超越整体质量分数，还需关注该技术为谁服务——以及服务得有多好。

Abstract

Speech translation (ST) is increasingly adopted in user applications, yet its evaluation largely focuses on decontextualized testbeds and holistic quality, rather than end users' communication needs. We introduce Ouvia, an evaluation framework for measuring user-perceived usability of speech translation outputs in real-world settings. Ouvia focuses on one-to-one communication: an English speaker needs to convey a request to a Portuguese speaker, and the message is automatically translated. Through a custom web app and multi-phase study design, we collect more than 1,750 such interactions in healthcare and everyday situations, mediated by four ST systems, involving speakers from three English dialects and two genders. We find that modern ST serves people only to a limited extent -- only around half of interactions are rated as usable -- with significant gaps in reported usability across demographic groups. Moreover, among quality metrics, we find that QA-based evaluation is a substantially stronger predictor of real-world usability than standard approaches. Together, these findings stress the importance of situated, user-centered evaluation frameworks that go beyond holistic quality scores and attend to who the technology serves -- and how well.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	3.0/10	4.5
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on Speech Translation evaluation and user-centered usability in real-world scenarios, which has low alignment with the provided keywords centered on Multimodal Foundation Models, World Models, and Reinforcement Learning. While Speech Translation involves audio-text modalities (MultiModal), it does not address Visual Encoders, Tokenizers, World Models, or Model-Based RL architecture. The comparison of systems loosely relates to Unify Models/MLLM but is not the core contribution. Weighted total score is 10.5, below the dynamic passing score of 27.8.

关键词

Speech Translation, User-centered Evaluation, Usability Measurement, Real-world Communication, QA-based Evaluation, One-to-one Interaction, Healthcare Scenarios

122. ACE-SQL: Adaptive Co-Optimization via Empirical Credit Assignment for Text-to-SQLFAIL

Score: 10.5 / 27.8

Authors: Xiaobing Chen, Ai Jian, Eryu Guo, Zhiqi Pang

Published: 2026-06-04

TL;DR: ACE-SQL 提出了一种基于强化学习的文本到 SQL 生成框架，通过经验信用分配联合优化 schema 检索与 SQL 生成，在 BIRD Dev 上取得了 65.3% 的执行准确率。

摘要翻译

Text-to-SQL（文本到 SQL）将自然语言问题映射为可执行的 SQL 查询。现代数据库通常包含大型且复杂的模式（schemas），这使得模式链接（schema linking）成为准确生成 SQL 的关键步骤。现有方法要么依赖全模式生成（full-schema generation），这使得模式链接（schema linking）隐含在庞大的搜索空间中；要么使用一个独立的检索器（retriever），该检索器使用静态黄金列监督进行训练，其目标可能对当前的生成器（generator）策略而言并非最优。为了解决这一问题，我们提出了一种基于经验信用分配的文本到 SQL 自适应协同优化方法（ACE-SQL），这是一种强化学习（RL）框架，能够在执行反馈下协同优化模式检索和 SQL 生成。ACE-SQL 基于生成器（generator）轨迹（rollouts）构建一个在线列集池，并从与执行正确的轨迹（rollouts）关联最频繁的列集中推导出自适应的同策略（on-policy）检索目标。这诱导了双向适应：检索器（retriever）向生成器（generator）能够正确执行的列集调整，而生成器（generator）则在执行反馈下适应检索器（retriever）不断变化的模式选择。使用约 3k 个合成的 Text-to-SQL（文本到 SQL）问题 - 数据库对进行强化学习（RL）训练，ACE-SQL 在 BIRD Dev 上实现了 65.3% 的贪婪执行准确率，且每个查询仅使用 0.93k 个输出标记（tokens）。代码库可在 https://github.com/xbchen1/ACE-SQL 获取。

Abstract

Text-to-SQL maps natural language questions to executable SQL queries. Modern databases often contain large and complex schemas, making schema linking a critical step for accurate SQL generation. Existing methods either rely on full-schema generation, which leaves schema linking implicit within a large search space, or use a separate retriever trained with static gold-column supervision, whose targets may be suboptimal for the current generator policy. To address this issue, we propose Adaptive Co-optimization via Empirical Credit Assignment for Text-to-SQL (ACE-SQL), a reinforcement learning (RL) framework that jointly optimizes schema retrieval and SQL generation under execution feedback. ACE-SQL constructs an online column-set pool from generator rollouts and derives adaptive on-policy retrieval targets from the column set most frequently associated with execution-correct rollouts. This induces bidirectional adaptation, where the retriever adapts toward column sets that the generator can execute correctly, while the generator adapts to the retriever's evolving schema selections under execution feedback. With approximately 3k synthetic Text-to-SQL question-database pairs for RL training, ACE-SQL achieves 65.3% greedy execution accuracy on BIRD Dev while using 0.93k output tokens per query. The repository is available at https://github.com/xbchen1/ACE-SQL.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	3.0/10	4.5

评分理由: 论文专注于 Text-to-SQL 任务，使用强化学习框架优化 schema 检索和 SQL 生成。提供的关键词主要围绕多模态大模型（MLLM, MultiModal, Visual Encoder）和世界模型（World Models）。论文不涉及视觉编码、多模态融合或世界模型构建，因此相关关键词得分为 0。虽然使用了强化学习（RL），但属于基于执行反馈的策略优化，并非典型的基于模型的 RL（model-based RL），且未涉及模型统一（Unify Models）或分词器（Tokenizer）的核心创新。因此整体相关性较低，加权总分 10.5，远低于动态及格分 27.8。作者列表中未包含指定的专家，无额外加分。

关键词

Text-to-SQL, Reinforcement Learning, Schema Linking, Co-optimization, Execution Feedback, Credit Assignment, SQL Generation

123. RhymeFlow: Training-Free Acceleration for Video Generation with Asynchronous Denoising Flow SchedulingFAIL

Score: 10.5 / 27.8

Authors: Chensheng Dai, Shengjun Zhang, Yifan Li, Zhang Zhang, Zheng Zhu, Yueqi Duan

Published: 2026-06-04

TL;DR: RhymeFlow accelerates video generation diffusion models by asynchronously scheduling denoising steps for keyframes versus non-keyframes without retraining, maintaining temporal coherence through latent trajectory projection.

摘要翻译

基于扩散变换器（Diffusion Transformers, DiTs）的视频生成模型在视频合成方面取得了显著性能，然而由于 3D 注意力（3D attention）的二次复杂度，它们面临着高推理延迟和高计算成本的挑战。现有的加速方法主要通过稀疏注意力（sparse attention）和 KV 缓存（KV-caching）等技术，在每个单独的去噪步骤内降低计算复杂度。然而，它们严格遵守标准扩散流程（diffusion pipeline）的内在约束：目标视频序列中的每一帧都必须经历所有扩散时间步的完整、密集去噪过程。我们观察到，由于相邻帧之间存在对应的内容和运动，当锚定具有关键语义转换的关键帧（keyframes）时，其他帧的中间状态往往遵循更可预测的轨迹，这表明这种均匀、密集的去噪过程对于自然视频数据本质上是冗余的。为此，我们引入了 RhymeFlow，这是一个无需训练的框架，能够解耦不同帧的去噪轨迹。具体来说，我们首先识别出一组稀疏的关键帧（keyframes），这些关键帧主导着潜在语义的演化。随后，仅这些关键帧经历密集、逐步的去噪以确保结构完整性，而非关键帧则逐步跳过去噪步骤以最小化计算成本。由于非关键帧跳过的中间状态破坏了关键帧去噪步骤中的时间一致性（temporal coherence），从而导致视觉退化，我们进一步引入了一种潜在轨迹投影模块（latent trajectory projection module），该模块使关键帧能够与完整且时间一致的序列表示进行交互。在当前基于扩散变换器（DiT）的视频生成模型上的广泛实验表明，我们的方法优于现有基线，具有更高的推理速度和更好的视觉质量。

Abstract

Video generation models based on Diffusion Transformers (DiTs) have achieved remarkable performance in video synthesis, yet they suffer from high inference latency and computational costs due to the quadratic complexity of 3D attention. Existing acceleration methods primarily reduce computational complexity within each individual denoising steps through techniques such as sparse attention and KV-caching. However, they rigidly adhere to the inherent constraint of the standard diffusion pipeline: every frame in the target video sequence must be subjected to a complete, dense denoising process across all diffusion timesteps. We observe that due to the corresponding contents and motions among adjacent frames, when keyframes with critical semantic transitions are anchored, the intermediate states of others often follow more predictable trajectories, which indicates that such uniform, dense denoising process is inherently redundant for natural video data. To this end, we introduce \textbf{RhymeFlow}, a training-free framework that decouples the denoising trajectories of different frames. Specifically, we first identify a sparse set of pivotal key frames that dominate the latent semantic evolution. Then, only these keyframes undergo dense, step-by-step denoising to ensure structural integrity, while non-keyframes progressively skip denoising steps to minimize computational cost. Since skipped intermediate states of non-keyframes break the temporal coherence in keyframe denoising steps, leading to visual degradation, we further introduce a latent trajectory projection module, which enables keyframes to interact with a complete and temporally consistent sequence representation. Extensive experiments on current DiT-based video generation models demonstrate our method outperforms existing baselines with higher inference speed and better visual quality.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	3.0/10	4.5
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on inference acceleration for video generation diffusion models using asynchronous scheduling, which does not align with the core themes of MLLM architecture, tokenization, or reinforcement learning implied by the keywords. 'MultiModal' and 'Unify Models' receive slight relevance due to the video domain and unified scheduling strategy, while others are largely unrelated.

关键词

Video Generation, Diffusion Transformers, Inference Acceleration, Asynchronous Denoising, Keyframe Selection, Training-Free, Latent Trajectory Projection

124. MS-DKC: A Dataset Knowledge Card Framework for Designing and Adapting Medical Image Segmentation ModelsFAIL

Score: 10.5 / 27.8

Authors: Tariq M. Khan, Syed Saud Naqvi, Thantrira Porntaveetus, Hamid Alinejad-Rokny, Shahzaib Iqbal, Imran Razzak, Mohammad AU Khan

Published: 2026-06-04

TL;DR: 本文提出了一种名为 MS-DKC 的医学图像分割数据集知识卡框架，通过显式记录数据集特征来指导模型设计与评估，而非单纯追求架构优化。

摘要翻译

医学图像分割通常被表述为对更强架构的搜索，但这可能会掩盖一个更根本的问题：数据集对模型有何要求？在医学成像中，这种需求受前景占用率、形态学、边界模糊性、拓扑敏感性、标注质量、采集变异和操作点等因素的影响。本文提出了医学分割数据集知识卡（Medical Segmentation Dataset Knowledge Card，简称 MS-DKC），这是一个使这些因素显性化的框架。MS-DKC 通过图像/采集、形态学、监督、上下文依赖性和部署风险描述符来记录数据集证据。这些描述符被映射到失效模式、设计先验和风险对齐准则，使得分割设计比基于架构优先的比较更具可追溯性。我们在 DRIVE、ISIC2018 和 ACDC 上评估 MS-DKC，它们代表了不同的数据集场景。DRIVE 包含稀疏、纤细且分支状的血管，倾向于支持保留细节的模型、敏感性感知优化、阈值分析及拓扑感知指标。DKC-TNet-v2 在 35103 个参数下实现了 Dice 系数 0.8044 和 IoU 0.6730，而 SA-UNetv2-DKC-AmbRef 达到了 Dice 系数 0.8141、IoU 0.6865、敏感性 0.8265、特异性 0.9804 以及 AUC 0.9853。ISIC2018 涉及紧凑但外观可变的病变；对 Att-Next-Topo/ATTNext 应用验证约束的分数函数选择生成了 MS-DKC-AttNextTopo-VCSF-NoAug，其 Dice 系数为 0.8872，IoU 为 0.8214，精确率为 0.9173，Boundary F1 为 0.4878，ASSD 为 4.13，而合理的附加项未能改善风险对齐概况。ACDC 提供了一个多类别心脏案例，其中 MS-DKC 推荐采用四类别 softmax 分割、类别平衡的 Dice/CE 监督以及类别级的表面评估。总体而言，结果支持数据集条件化设计：不同的数据集需要不同的先验、操作点和证据，才能判断模型是否适用。

Abstract

Medical image segmentation is often framed as a search for stronger architectures, but this can obscure a more fundamental question: what does the dataset require from the model? In medical imaging, this requirement is shaped by foreground occupancy, morphology, boundary ambiguity, topology sensitivity, annotation quality, acquisition variation, and operating point. This paper introduces the Medical Segmentation Dataset Knowledge Card (MS-DKC), a framework for making these factors explicit. MS-DKC records dataset evidence through image/acquisition, morphology, supervision, context-dependence, and deployment-risk descriptors. These descriptors are mapped to failure modes, design priors, and risk-aligned criteria, making segmentation design more traceable than architecture-first comparison. We evaluate MS-DKC on DRIVE, ISIC2018, and ACDC, representing distinct regimes. DRIVE contains sparse, thin, branching vessels, favoring detail-preserving models, sensitivity-aware optimization, threshold analysis, and topology-aware metrics. DKC-TNet-v2 achieved Dice 0.8044 and IoU 0.6730 with 35103 parameters, while SA-UNetv2-DKC-AmbRef reached Dice 0.8141, IoU 0.6865, sensitivity 0.8265, specificity 0.9804, and AUC 0.9853. ISIC2018 involves compact but appearance-variable lesions; validation-constrained score-function selection on Att-Next-Topo/ATTNext produced MS-DKC-AttNextTopo-VCSF-NoAug with Dice 0.8872, IoU 0.8214, precision 0.9173, Boundary F1 0.4878, and ASSD 4.13, while plausible additions failed to improve the risk-aligned profile. ACDC provides a multi-class cardiac case, where MS-DKC recommends four-class softmax segmentation, class-balanced Dice/CE supervision, and class-wise surface evaluation. Overall, the results support dataset-conditioned design: different datasets require different priors, operating points, and evidence before a model can be judged appropriate.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	3.0/10	4.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	0.0/10	0.0

评分理由: 该论文核心在于医学图像分割的数据集知识卡框架（MS-DKC），侧重于数据集特征分析与模型设计准则的统一，而非模型架构的统一（Unify Models 得 2 分）。论文未涉及文本处理或 Tokenizer（Tokenizer 得 0 分）。虽然分割模型包含编码器，但视觉编码器并非本文核心贡献（Visual Encoder 得 3 分）。论文内容属于监督学习下的图像分割，与世界模型（World Models）、多模态大语言模型（MLLM）及强化学习（model-based RL）无关（均为 0 分）。医学图像虽涉及视觉数据，但本文未强调多模态融合（MultiModal 得 2 分）。总体来看，论文主题与给定的关键词集（偏向 LLM/RL/World Models）匹配度较低，加权总分 10.5 低于动态及格分 27.8。

关键词

Medical Image Segmentation, Dataset Knowledge Card, Design Framework, Dataset Requirements, Failure Modes, Model Adaptation, Evaluation Criteria

125. Self-Learning Expression Deformations for Data-Efficient Gaussian AvatarsFAIL

Score: 10.5 / 27.8

Authors: Jiahao Yang, Xiaohang Yang, Qing Wang, Yilan Dong, Gregory Slabaugh, Shanxin Yuan

Published: 2026-06-04

TL;DR: 该论文提出了一种自监督的高斯表情变形框架，通过联合优化 2D 高斯曲面元和符号距离场，显著降低了创建高保真可动画化身所需的数据量。

摘要翻译

利用 3D 高斯表示（3D Gaussian representations）建模动态面部表情仍具挑战性，源于其非结构化特性。传统的高斯化身（Gaussian avatar）管道需要大量的多视角（multiview）及序列表情数据，限制了其可扩展性和普及性。本文提出了一种自适应高斯表情（Self-Adaptive Gaussian Expression, SAGE）框架，该框架通过自学习表情诱导的高斯形变，能够从极少输入数据生成高保真且可动画化的化身。该方法联合优化 2D 高斯曲面元（2D Gaussian surfels）与有符号距离场（Signed Distance Field, SDF），以确保紧凑且表面对齐的高斯分布；同时，自监督表情学习阶段利用几何与外观一致性约束替代了长序列训练。该设计支持在多种重建设置中灵活部署：在多视角设置（multiview setting）下，仅需单帧（时间步）即可，无需数千帧；在单目设置（monocular setting）下，仅需头部旋转，无需表情序列；在单 shot 设置（one-shot setting）下，无需预训练或先验知识。实验表明，该方法在重建和动画质量上达到了与最先进（state-of-the-art）方法相当的水平，同时将数据需求降低了数个数量级。我们的结果凸显了自监督高斯形变学习的潜力，是迈向普及化、数据高效化身创建的重要一步。

Abstract

Modeling dynamic facial expressions using 3D Gaussian representations remains challenging due to their unstructured nature. Conventional Gaussian avatar pipelines require extensive multiview and sequential expression data, limiting scalability and accessibility. In this work, we introduce Self-Adaptive Gaussian Expression (SAGE), a framework for self-learning expression-induced Gaussian deformations that enables high-fidelity, animatable avatars from minimal input data. Our method jointly optimizes 2D Gaussian surfels and a Signed Distance Field (SDF) to enforce compact, surface-aligned Gaussian distributions, while a self-supervised expression learning phase replaces long training sequences with geometric and appearance consistency constraints. This design allows flexible deployment across multiple reconstruction regimes: in the multiview setting, only a single frame (timestep) is required instead of thousands; in the monocular setting, only head rotations are needed without expression sequences; and in the one-shot setting, no pretraining or priors are necessary. Experiments demonstrate that our approach achieves reconstruction and animation quality comparable to state-of-the-art methods, while reducing data requirements by several orders of magnitude. Our results highlight the potential of self-supervised Gaussian deformation learning as a step toward accessible, data-efficient avatar creation.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文属于计算机图形学领域，与给定的 MLLM/RL 关键词集不匹配。MLLM、Tokenizer、model-based RL 完全无关（0 分）；Unify Models、Visual Encoder、MultiModal 有微弱关联（2 分）；World Models 关联度低（1 分）。加权总分 10.5 分，低于及格分 27.8 分。作者列表中未包含指定专家，无额外加分。

关键词

Gaussian Avatars, Self-Learning, Expression Deformations, Data-Efficient, Signed Distance Field, 3D Gaussian Representations, Self-Supervised Learning

126. Subspace-Aware Sparse Autoencoders for Effective Mechanistic InterpretabilityFAIL

Score: 9.0 / 27.8

Authors: Seyed Arshan Dalili, Mehrdad Mahdavi

Published: 2026-06-04

TL;DR: 本文提出子空间感知稀疏自编码器（SASA），通过引入学习到的解码器子空间来解决大语言模型中的特征分裂问题，从而提升单义性并减少训练 token 成本。

摘要翻译

稀疏自编码器（SAEs）在大语言模型（LLMs）中被广泛用于机制可解释性，但其建模方式将每个潜在特征分配单个解码器方向，隐含地假设特征为一维。我们表明，这一假设与模型特征的多维结构不符，并通过两种不同的机制证明性地诱导了特征分裂。从几何角度看，使用单方向解码器将内在维度 $d_i \ge 2$ 的特征重构至误差 $\varepsilon$ 时，所需的原子数量随 $d_i$ 呈指数增长。从端到端优化的角度来看，这种分裂不仅是可能的，而且是被主动偏好的。我们证明存在一条从真实的 $d_i$ 维基出发，通往 $\ell_1$ 正则化 SAE 目标函数严格更低的风险的连续路径，其下降方向会将任何训练好的字典推向该指数区域。因此，一个连贯的特征被分散在多个近共线的潜在变量上，产生了虚假多重性，并遮蔽了内在几何结构。受此启发，我们提出感知子空间的稀疏自编码器（SASA），该方法用学习到的解码器子空间替代单向量解码器，通过 Top-$s$ 组门控强制执行块稀疏性，并利用核范数正则化器调整每组的有效秩。随后我们证明，一旦块大小满足 $r \ge d_i$，单个组不仅能够表示整个特征切片，而且成为 SASA 目标函数的全局最小值。这种整合使得样本复杂度随 $d_i$ 呈多项式增长而非指数增长——鉴于每次训练激活都需要进行一次 LLM 前向传播，这是一个决定性的优势。实验上，在 GPT-2 和 Mistral-7B 上，SASA 减少了特征分裂和吸收，提高了单语义性和可解释性，且在仅使用约一半令牌预算训练的情况下，表现匹配或优于标准 SAEs。

Abstract

Sparse Autoencoders (SAEs) are widely used for mechanistic interpretability in large language models, yet their formulation assigns each latent feature a single decoder direction, implicitly assuming features to be one-dimensional. We show that this assumption mismatches with the multi-dimensional structure of model features, provably inducing feature splitting through two distinct mechanisms. Geometrically, reconstructing a feature of intrinsic dimension $d_i \ge 2$ to error $\varepsilon$ with single-direction decoders forces a number of atoms that is exponential in $d_i$. From an end-to-end optimization perspective, this splitting is not merely possible but actively preferred. We prove that there exists a continuous path from the true $d_i$-dimensional basis to a strictly lower risk of the $\ell_1$-regularized SAE objective, whose descent directions drive any trained dictionary into that exponential regime. A single coherent feature is therefore fragmented across many near-collinear latents, producing spurious multiplicity and obscuring the intrinsic geometry. Motivated by this, we introduce Subspace-Aware Sparse Autoencoders (SASA), which replace single-vector decoders with learned decoder subspaces, enforce block sparsity via Top-$s$ group gating, and adapt each group's effective rank with a nuclear-norm regularizer. We then show that once the block size satisfies $r \ge d_i$, a single group not only can represent the entire feature slice but is the global minimizer of the SASA objective. This consolidation yields a sample complexity polynomial in $d_i$ rather than exponential -- a decisive advantage given that every training activation costs an LLM forward pass. Empirically, on GPT-2 and Mistral-7B, SASA reduces feature splitting and absorption, improves monosemanticity and interpretability, and matches or exceeds standard SAEs while training on roughly half the token budget.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 该论文主要研究大语言模型（LLM）的机制可解释性，提出子空间感知稀疏自编码器（SASA）以解决特征分裂问题。提供的关键词集主要聚焦于多模态、世界模型及强化学习领域，与本文的纯文本 LLM 解释性主题存在显著领域差异。因此，与视觉编码器、世界模型、多模态及基于模型的强化学习完全无关（0 分）。与 MLLM 和 Tokenizer 有微弱关联（涉及 LLM 上下文及 token 预算提及），与 Unify Models 有微弱关联（特征表示统一），故评分较低。

关键词

Sparse Autoencoders, Mechanistic Interpretability, Subspace-Aware, Feature Splitting, LLM, Decoder Subspaces, Nuclear-norm Regularization

127. Design a Reliable LLM-Integrated Interface for Mortality ForecastingFAIL

Score: 9.0 / 27.8

Authors: Thi Kim Ngan Nguyen

Published: 2026-06-04

TL;DR: 该论文提出了一种基于 LLM 的接口，可将自然语言请求转换为结构化配置以进行死亡率预测，在提高可用性的同时保持了统计有效性。

摘要翻译

死亡率预测在精算与政策决策中发挥着重要作用，但其实施过程技术复杂，非专家用户难以使用。本项目提出了一种可靠的大语言模型（LLM）集成接口，旨在提升可用性同时保持统计功效。该 LLM 被设计为一个受限编排层，将自然语言输入转换为确定性预测流程的结构化配置。采用三阶段方法论以确保准确性、可用性和透明度。首先，利用 CoMoMo 包实现基线流程，复现既有的死亡率预测结果。其次，扩展该流程以生成多步预测，采用滚动原点评估和均方误差（MSE）。第三，一个原型界面使用本地 LLM 处理用户的自然语言预测请求。该系统表明，LLMs 可以在不损害可复现性、透明度或高风险分析工作流中精算有效性的前提下，增强可访问性。

Abstract

Mortality forecasting plays an important role in actuarial and policy decision-making, but its implementation remains technically complex and inaccessible to non-expert users. This project proposes a reliable large language model (LLM)-integrated interface that improves usability while maintaining statistical power. The LLM is designed as a constrained orchestration layer that translates natural-language inputs into structured configurations for a deterministic forecasting pipeline. A three-phase methodology is employed to ensure accuracy, usability, and transparency. First, a baseline pipeline is implemented using the CoMoMo package, reproducing established mortality forecasting results. Second, the pipeline is extended to generate multi-step forecasts using rolling-origin evaluation and mean squared error (MSE). Third, a prototype interface uses a local LLM to handle users' forecasting requests in plain language. The system demonstrates that LLMs can enhance accessibility without compromising reproducibility, transparency, or actuarial validity in high-stakes analytical workflows.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0

评分理由: 论文主题聚焦于 LLM 在死亡率预测中的接口应用，与关键词集的多模态及强化学习方向高度不相关。Visual Encoder、World Models 和 model-based RL 完全无关（0 分）；Unify Models 和 MLLM 仅因涉及 LLM 而略有相关性（2 分）；Tokenizer 和 MultiModal 相关性极低（1 分）。加权总分为 9.0，低于动态及格分 27.8。作者列表中未包含指定的专家，故无额外加分。

关键词

Mortality Forecasting, LLM Interface, Natural Language, Forecasting Pipeline, CoMoMo, Usability, Transparency

128. Towards the Readability of LLM-Generated Codes through Multitask Representation EngineeringFAIL

Score: 9.0 / 27.8

Authors: Huifan Gao, Liuhua He, Yinghui Pan, Shenbao Yu, Yifeng Zeng, Shengchao Qin, Weidi Sun

Published: 2026-06-04

TL;DR: This paper proposes a multitask Representation Engineering framework to enhance the readability of LLM-generated codes while balancing the trade-off with functional correctness.

摘要翻译

正确性和可读性是衡量代码质量的关键指标，分别确保功能保真度和易理解性。尽管大多数现有研究专注于提升大语言模型（LLMs）生成代码的正确性，但可读性问题仍鲜有涉及。由于可读性具有主观性，通过针对性控制来提升可读性颇具挑战。鉴于表示工程（RepE）具有低数据依赖和低计算成本的特点，本文将其作为针对性控制方法。以往关于 RepE 的研究主要聚焦于单任务的针对性控制，然而提升代码可读性需要跨多任务的控制。因此，本文提出了多任务 RepE 框架，并从理论上探讨了多任务引导方法对代码可读性与正确性之间权衡的影响。此外，本文还提供了全面的实验支持。所有相关实现均为开源，并可应要求提供。

Abstract

Correctness and readability are key measures of code quality, respectively ensuring functional fidelity and ease of comprehension. While most existing research focuses on improving the correctness of large language models~(LLMs) generated codes, readability remains under-addressed. Enhancing readability through targeted control is challenging due to its subjective nature. In this article, we employ representation engineering~(RepE) as the targeted control method given its characteristics of low data dependency and low computational cost. Prior work on RepE has primarily focused on the targeted control for a single task, but improving the code readability requires the control across multiple tasks. Accordingly we proposes the multitask RepE framework and theoretically discuss the impact of the multitask steering method on the tradeoff between the code readability and correctness. We further provide comprehensive experiments in support. All the relevant implementations are open-source and available upon request.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on improving LLM-generated code readability using Representation Engineering (RepE), which has low relevance to the provided keywords centered on Vision, World Models, and Model-Based RL. 'Visual Encoder', 'World Models', and 'model-based RL' are completely unrelated (0 score). 'Unify Models' and 'MLLM' have slight relevance due to the use of LLMs and multitask objectives but are not core focuses (2 score). 'Tokenizer' and 'MultiModal' are minimally relevant as the work is text-based and does not focus on tokenizer design (1 score). No listed experts (Yang Shi, Xuanyu Zhu, etc.) are found in the author list.

关键词

LLM-Generated Codes, Readability, Representation Engineering, Multitask Framework, Correctness Trade-off, Code Quality, Steering Method

129. The Self-Correction Illusion: LLMs Correct Others but Not ThemselvesFAIL

Score: 9.0 / 27.8

Authors: Kuan-Yen Chen, Fang-Yi Su, Jung-Hsien Chiang

Published: 2026-06-04

TL;DR: 该论文揭示了大语言模型无法自我纠正推理错误是由聊天模板角色标签 artifact 而非能力缺陷导致的，并证明了仅通过调整提示结构即可显著提升纠错率。

摘要翻译

近期研究表明，大语言模型（LLM）代理难以纠正自身推理轨迹中的错误，但当相同的声明出现在外部来源下时，其纠正率显著更高。我们探究这种不对称性是否反映了能力缺陷还是角色标签伪影：代理纠正错误声明的意愿是否因果地取决于承载它的聊天模板角色，而非声明的内容？我们的设置保持错误声明在所有条件下字节级完全相同（经 SHA-256 验证），仅改变其封装角色：代理自身的 \role{<thought>}、\role{user} 消息、\role{tool} 响应，或 \role{system <memory>} 块。在涵盖七个模型家族和三个领域的 13 个模型 - 领域单元中（每个单元 $n{=}30$ 对配对任务），将声明从 \role{<thought>} 重新标记为外部角色，使显式纠正率提高了 23 至 93 个百分点，其中 13 个单元中有 10 个达到 $p{<}0.001$。进一步实验证实，该效应是不对称的、机制上可分解的，且在各个领域间具有稳健性。无法自我纠正并非认知缺陷，而是聊天模板伪影。我们利用这一伪影，设计了一种仅基于提示结构的干预方案，无需训练也无需模型修改，其最有效的角色标签具有领域依赖性：在数学领域，\role{<memory>} 占主导，而在逻辑推理领域，普通的 \role{user} 消息占主导。

Abstract

Recent work shows that LLM agents struggle to correct errors in their own reasoning traces yet show markedly higher correction rates when identical claims appear under external sources. We ask whether this asymmetry reflects a capability deficit or a role-label artifact: does an agent's willingness to correct a wrong claim depend causally on the chat-template role that carries it, rather than on the claim's content? Our setup keeps the erroneous claim byte-identical across all conditions (SHA-256 verified) and varies only its wrapping role: the agent's own \role{<thought>}, a \role{user} message, a \role{tool} response, or a \role{system <memory>} block. Across 13 model-domain cells covering seven model families and three domains ($n{=}30$ paired tasks per cell), relabeling the claim from \role{<thought>} to an external role lifts the explicit-correction rate by 23 to 93 percentage points, with 10 of 13 cells reaching $p{<}0.001$. Further experiments confirm that the effect is asymmetric, mechanistically decomposable, and robust across domains. The failure to self-correct is not a cognitive deficit; it is a chat-template artifact. We exploit this artifact by designing a prompt-structure-only intervention that requires no training and no model modification, with its strongest role label being domain-dependent: \role{<memory>} dominates on math, while a plain \role{user} message dominates on logical deduction.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文主要研究大语言模型（LLM）在推理过程中的自我纠正机制，发现其失败源于聊天模板的角色标签 artifact 而非认知缺陷，并提出提示结构干预方案。提供的关键词集侧重于多模态、世界模型及强化学习领域，与本文的纯文本 LLM 推理主题存在显著领域差异。因此，'Visual Encoder'、'World Models'、'MultiModal'、'model-based RL' 得分为 0；'MLLM' 和 'Unify Models' 因涉及模型评估体系得分为 2；'Tokenizer' 因涉及字节验证得分为 2。加权总分显著低于动态及格分，表明关键词与论文内容相关性极低。

关键词

LLM Reasoning, Self-Correction, Role-Label Artifact, Prompt Engineering, Chat Template, Error Correction, Model Agnostic

130. Measuring the sensitivity of LLM-based structured extraction to prompt, model, and schema choices in clinical discharge summariesFAIL

Score: 9.0 / 27.8

Authors: Martin Murin

Published: 2026-06-04

TL;DR: This paper analyzes the sensitivity of LLM-based structured extraction from clinical discharge summaries to prompt, model, and schema choices, finding that schema collapse reduces cross-prompt disagreement more than model size variations.

摘要翻译

大语言模型（Large Language Models, LLM）正越来越多地用于从临床自由文本笔记中提取结构化信息，然而，其输出对上游配置选择的敏感性，相较于其在固定基准上的准确性，尚不为人熟知。本研究在不依赖人工标注真值的情况下测量这种敏感性，通过固定提取任务并逐一改变配置选项来实现。固定的模式包含 17 个临床文档标志，其取值集为三元（yes/no/not_documented），以及一个包含 47 个标签的词汇表，用于表示主要入院原因。在 MIMIC-IV v3.1 出院摘要数据上，三种表达该模式的提示变体分别在两种模型规模下进行了运行。跨提示一致性通过在 ICD 分层子集上计算 Cohen's kappa 来衡量。通过配对相同笔记的比较隔离了模型选择的影响，而事后将三元标志折叠为二元变量，则测试了该模式对分歧的贡献。在三元标志上，两种模型达到了相同的聚合跨提示一致性（中位数 kappa 分别为 0.69 和 0.68）；较大模型在某些字段上提高了交叉一致性，而在其他字段上降低了它，这是一种重新分配，而非效应的缺失。将该模式折叠为二元变量后，大部分跨提示分歧消失，分歧主要源于 absence-versus-silence（缺失与未记录）的区别，而非发现本身是否存在。在多类入院分类中，更换模型会导致近一半笔记的主导标签被重新分配，而更改提示措辞仅导致约八分之一笔记的主导标签被重新分配；此外，较大模型在剩余的 catch-all categories（兜底类别）上分配的比例显著降低（从 44% 降至 26%）。这些模式表明，存在一种由模式施加的分歧来源，主要集中在 absence-versus-silence 轴上；而在多类分类中，模型的影响主导于提示措辞的影响。这些发现是通过一种可重用的方法识别出来的，该方法用于审计大规模部署中提取结果的可重复性。

Abstract

Large language models are increasingly used for structured extraction from clinical free-text notes, but the sensitivity of their output to upstream configuration choices is less understood than their accuracy on fixed benchmarks. This work measures that sensitivity without human-annotated ground truth, by holding the extraction task fixed and varying one choice at a time. The fixed schema comprises 17 clinical documentation flags on a three-way yes/no/not_documented value set and a 47-tag vocabulary for the primary admission reason. Three prompt variants expressing this schema were each run at two model sizes on MIMIC-IV v3.1 discharge summaries. Cross-prompt agreement was measured by Cohen's kappa on ICD-stratified subsets. A paired same-note comparison isolated the effect of model choice, and a post-hoc collapse of the three-way flags to binary tested the schema's contribution to disagreement. On the three-way flags, the two models reach the same pooled cross-prompt agreement (median kappa 0.69 and 0.68); the larger model raises agreement on some fields and lowers it on others, a redistribution rather than the absence of an effect. Collapsing the schema to binary dissolves most of the cross-prompt disagreement, locating it on the absence-versus-silence distinction rather than on whether the finding is present. On the multi-class admission categorization, changing the model reassigns the dominant tag on close to half of all notes while changing the prompt phrasing reassigns it on roughly one in eight, and the larger model places far less mass on residual catch-all categories (44% to 26%). These patterns indicate a schema-imposed source of disagreement concentrated on the absence-versus-silence axis and a dominance of model over prompt phrasing on multi-class categorization, identified by a reusable methodology for auditing extraction reproducibility on a population-scale deployment.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	3.0/10	4.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on clinical NLP and LLM-based structured extraction from text, lacking multimodal, visual, or reinforcement learning components. Keywords like Visual Encoder, MultiModal, World Models, and model-based RL are irrelevant (0). MLLM is partially relevant (3) as it uses LLMs. Unify Models and Tokenizer are tangential (1-2). No expert authors from the specified list are present.

关键词

LLM-based structured extraction, Clinical discharge summaries, Prompt sensitivity, Model size comparison, Schema choices, Reproducibility auditing, MIMIC-IV dataset, Cohen's kappa

131. Function-Space Priors for Bayesian Neural ODEs with Application to Vessel Trajectory PredictionFAIL

Score: 9.0 / 27.8

Authors: Jaeyeong Lee, Wonmo Koo, Heeyoung Kim

Published: 2026-06-04

TL;DR: 该论文提出了一种基于函数空间先验的贝叶斯神经常微分方程方法，用于解决 AIS 数据不规则采样下的船舶轨迹预测及不确定性量化问题。

摘要翻译

基于自动识别系统 (AIS) 数据的船舶轨迹预测对于海上态势感知至关重要，但由于采样不规则、报告缺失以及复杂的动力学特性，该任务仍具挑战性。除了准确的点预测外，海事应用还要求提供校准良好的不确定性估计，以支持可靠的决策制定。贝叶斯神经常微分方程 (Bayesian Neural Ordinary Differential Equations, ODEs) 提供了一种严谨的框架，通过在神经向量场参数上设定先验，实现连续时间轨迹建模与不确定性量化。然而，常用的各向同性高斯权重先验未能编码船舶动力学中具有信息量的结构特性，例如平滑性和局部性。现有的函数空间贝叶斯神经网络方法解决了静态映射中的这一局限性，但不能直接应用于神经 ODEs，因为在后者中，主要关注的量是轨迹本身，而非向量场。原则上，可以直接在 ODE 解上施加高斯过程 (Gaussian Process, GP) 先验，但这需要通过非线性 ODE 求解器传播分布，这在解析上是难以处理的。为应对这一挑战，我们采用了一种实用方法，直接在有限个测量点上评估的向量场上施加基于高斯过程核的先验。具体而言，我们在标准的权重空间变分目标基础上，增加了一个基于核的正则化项，该正则化项对向量场偏离高斯过程先验所暗示的结构进行惩罚。为处理长序列且不规则的 AIS 轨迹，我们进一步将此函数空间正则化与概率多重射击法相结合，该方法在保持全局一致性的同时，解耦了不同时间片段之间的推断。

Abstract

Vessel trajectory prediction from Automatic Identification System (AIS) data is essential for maritime situational awareness, yet it remains challenging due to irregular sampling, missing reports, and complex dynamics. Beyond accurate point forecasts, maritime applications also demand well-calibrated uncertainty estimates for reliable decision-making. Bayesian Neural Ordinary Differential Equations (ODEs) offer a principled framework for continuous-time trajectory modeling with uncertainty quantification by placing a prior over the neural vector field parameters. However, the commonly used isotropic Gaussian weight prior fails to encode informative structural properties of vessel dynamics, such as smoothness and locality. Existing function-space Bayesian neural network methods address this limitation for static mappings, but do not transfer directly to Neural ODEs, where the primary quantity of interest is the trajectory rather than the vector field itself. In principle, one could place a Gaussian process (GP) prior directly over ODE solutions, but this requires propagating distributions through a nonlinear ODE solver, which is analytically intractable. To address this challenge, we adopt a practical approach that imposes a GP-kernel-based prior directly on the vector field evaluated at a finite set of measurement points. Specifically, we augment the standard weight-space variational objective with a kernel-based regularizer that penalizes deviations of the vector field from the structure implied by a GP prior. To handle long and irregular AIS trajectories, we further combine this function-space regularization with probabilistic multiple shooting, which decouples inference across temporal segments while maintaining global consistency.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	3.0/10	4.5

评分理由: 论文聚焦贝叶斯神经 ODE 与船舶轨迹预测，与多模态大模型（MLLM）、视觉编码、Tokenizer 等关键词领域完全无关（0 分）。虽涉及动力学建模，与模型强化学习和世界模型有弱关联（2-3 分），但未涉及强化学习或世界模型架构。未发现指定专家作者。加权总分约为 9.0 分，低于动态及格分 27.8 分。

关键词

Bayesian Neural ODEs, Function-space Priors, Vessel Trajectory Prediction, Uncertainty Quantification, Gaussian Process Prior, Continuous-time Modeling, Probabilistic Multiple Shooting

132. Performance Evaluation of GraphCast for Medium-Range Weather Forecasting over BrazilFAIL

Score: 9.0 / 27.8

Authors: Wolfgang R. Rowell, Lucas S. Kupssinskü

Published: 2026-06-04

TL;DR: This study evaluates GraphCast's performance against ECMWF IFS for medium-range weather forecasting in Brazil, finding regime-dependent skill differences where GraphCast captures large-scale moisture in summer but struggles with baroclinic systems in winter.

摘要翻译

全球天气预报的范式正随着机器学习气象预测模型（MLWP）的兴起而迅速转变。尽管这些数据驱动架构展现出显著的全球预测技能，但全球南方（Global South）的区域基准仍相对稀缺，导致其在复杂、高度对流环境中的有效性在很大程度上尚未得到验证。本研究在四个不同的巴西气候亚区域上，评估了 GraphCast 业务版本相对于确定性欧洲中期天气预报中心（ECMWF）综合预报系统高分辨率（IFS HRES）基线的性能。利用可扩展的云原生管道及 WeatherBench-X 框架对天气模型进行基准测试，我们在四个选定季节窗口内评估了选定的对流层变量（$T_{850}$、$Q_{850}$、$Z_{500}$），采用业务 IFS 分析作为真值，以计算两个模型的统计指标。结果表明其预测技能具有依赖气候体制的特征。在南半球冬季，当解析南巴西地区快速传播的斜压系统时，GraphCast 在中期（提前 2-7 天）对 $Z_{500}$ 的预测表现欠佳，但在延长预报范围内重新获得优势；此时，其对混沌小尺度变异的内在平滑作用在确定性技能指标下变得有益。相反，在南半球夏季雨季，GraphCast 能准确捕捉大尺度水汽输送，同时内在地抑制了高频对流变异性，而这种变异性会削弱确定性数值天气预报（NWP）的温度预报。这些发现为巴西建立了基准，并界定了具体的物理边界，这将指导未来的“热带化”（tropicalization）努力，旨在优化这些基础 AI 模型以提升区域韧性。

Abstract

The paradigm of global weather forecasting is rapidly shifting with the emergence of Machine Learning Weather Prediction models (MLWP). While these data-driven architectures demonstrate remarkable global skill, regional benchmarks in the Global South remain scarce, leaving their efficacy in complex, highly convective environments largely unverified. This study evaluates the performance of GraphCast operational against the deterministic ECMWF IFS HRES as baseline across four distinct Brazilian climatic sub-regions. Utilizing a scalable, cloud-native pipeline and the WeatherBench-X framework for benchmarking weather models, we assess selected tropospheric variables ($T_{850}$, $Q_{850}$, $Z_{500}$) over four selected seasonal windows, employing the operational IFS analysis as the ground truth to calculate the statistical metrics for both models. Results reveal a regime-dependent skill profile. During the austral winter, GraphCast underperforms in the medium range (lead days 2-7) for $Z_{500}$ when resolving fast-propagating baroclinic systems over southern Brazil, but regains an advantage in the extended range, where its inherent smoothing of chaotic small-scale variability becomes beneficial under deterministic skill metrics. Conversely, during the austral summer wet season, GraphCast accurately captures large-scale moisture transport while intrinsically dampening the high-frequency convective variability that degrades deterministic NWP temperature forecasts. These findings establish a baseline for Brazil and define the specific physical boundaries that will guide future ``tropicalization'' efforts, aiming to optimize these foundational AI models for regional resilience.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	3.0/10	4.5
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on GraphCast evaluation for weather forecasting in Brazil, which is a supervised MLWP task. It has low relevance to MLLM/RL keywords: Tokenizer, Visual Encoder, MLLM, and model-based RL are irrelevant as GraphCast uses graph neural networks rather than language models or RL. World Models has slight relevance (environmental prediction), and Unify Models is weak (comparison vs integration).

关键词

GraphCast, Weather Forecasting, Brazil, Machine Learning Weather Prediction, ECMWF IFS, Tropospheric Variables, Regional Benchmarking

133. Attack Detection using Time Series Foundation ModelsFAIL

Score: 9.0 / 27.8

Authors: Sribalaji C. Anand, Anh Tung Nguyen, George J. Pappas

Published: 2026-06-04

TL;DR: This paper proposes a model-structure-free attack detector for cyber-physical systems using a time-series foundation model (TimesFM) that achieves comparable or superior performance without needing plant model knowledge.

摘要翻译

本文解决了在未知被控对象模型及其结构的情况下，网络物理系统中的攻击检测问题。一个远程部署的被控对象通过网络向操作员传输传感器测量值，假设该网络正遭受攻击。我们考虑两类攻击：无模型的重放攻击和基于模型的隐蔽攻击。对于后者，我们针对 χ² 检测器，在线性和非线性系统中导出了最优隐蔽攻击策略的闭式解。随后，我们提出了一种基于 TimesFM（由 Google Research 开发的时间序列基础模型）的无需模型结构的检测器，该检测器作为代理残差生成器以零样本模式运行。经验表明，基于 TimesFM 的检测器实现了相当或更优的攻击检测性能。所提方法的有效性在 IEEE 14 母线电力系统中通过数值实验得到验证。我们还表明，TimesFM 的预测可作为受损测量值的替代方案，这是在经典冗余假设失效时的一种实用缓解技术。

Abstract

This paper addresses the problem of attack detection in cyber-physical systems without any knowledge of the plant model or its structure. A remotely located plant transmits sensor measurements to an operator over a network that is assumed to be under attack. We consider two classes of attacks: model-free replay attacks and model-based stealthy attacks. For the latter, we derive closed-form expressions for the optimal stealthy attack policy against a $χ^2$ detector, for both linear and nonlinear systems. We then propose a model-structure-free detector based on TimesFM, a time-series foundation model developed by Google Research, which serves as a surrogate residual generator operating in a zero-shot fashion. We show empirically that the TimesFM-based detector achieves a comparable or superior attack detection performance. The efficacy of the proposed approach is demonstrated numerically on the IEEE 14-bus power system. We also demonstrate that TimesFM predictions can serve as a substitute for corrupted measurements, a practical mitigation technique when classical redundancy assumptions fail.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	1.0/10	1.5

评分理由: 论文核心为时间序列基础模型在 CPS 攻击检测中的应用，与多模态、视觉、LLM 及 RL 领域关联度低。仅因使用预测模型（类世界模型）及基础模型（类统一模型）获得微弱评分，未涉及视觉编码器、多模态融合或强化学习算法。

关键词

Attack Detection, Time Series Foundation Models, Cyber-Physical Systems, TimesFM, Zero-shot, Residual Generator, Stealthy Attacks

134. Equivariant Neural Belief PropagationFAIL

Score: 9.0 / 27.8

Authors: Zehua Cheng, Wei Dai, Jiahao Sun

Published: 2026-06-04

TL;DR: 本文提出等变信念传播（ENBP）以实现空间变量的 SE(3) 对称概率推断，在分子构象和多体机器人推断中实现了高精度与稳定性。

摘要翻译

空间嵌入变量上的概率推断需要尊重 SE(3) 对称性的信念，然而现有的等变网络仅产生标量和向量——无法生成各向异性不确定性所需的二阶精度张量，且单分量消息会将多模态能量景观坍缩为物理上无意义的平均值。我们引入等变神经信念传播（ENBP），这是一种因子图框架，其消息为等变高斯混合模型，其充分统计量在 SE(3) 下精确变换。二阶精度矩阵通过等变外积合成，经可微谱分解处理，并通过基于贪心 KL 的混合模型约简保持计算可行性，该约简被证明与 SE(3) 可交换。在 GEOM-QM9 和 GEOM-Drugs 数据集上，ENBP 在 0.090 埃误差下实现了 98.9% 的构象覆盖率，延迟为亚秒级——比扩散基线模型快 100 倍以上，且精度更高。在多体机器人推断任务中，原始循环信念传播 (BP) 在 15 个以上智能体时发散，而 ENBP 能够收敛，碰撞率接近零，且等变误差达到机器精度水平（约 10^-7，相比之下增强型基线为 10^-1）。

Abstract

Probabilistic inference over spatially embedded variables requires beliefs that respect $SE(3)$ symmetry, yet existing equivariant networks produce only scalars and vectors -- not the rank-2 precision tensors needed for anisotropic uncertainty, and single-component messages collapse multi-modal energy landscapes to physically meaningless averages. We introduce Equivariant Neural Belief Propagation (ENBP), a factor-graph framework whose messages are equivariant Gaussian mixture models with sufficient statistics that transform exactly under $SE(3)$. Rank-2 precision matrices are synthesised via equivariant outer products, ingested through differentiable spectral decomposition, and kept tractable by a greedy KL-based mixture reduction that provably commutes with $SE(3)$. On GEOM-QM9 and GEOM-Drugs, ENBP achieves 98.9% conformational coverage at 0.090 $\mathring{A}$ error with sub-second latency -- over $100\times$ faster than diffusion baselines at higher accuracy. On multi-body robotic inference, vanilla loopy BP diverges at 15+ agents while ENBP converges with near-zero collision rates and machine-precision equivariance error (${\sim}10^{-7}$ vs.\ $10^{-1}$ for augmented baselines).

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	2.0/10	3.0

评分理由: 论文主要关注空间变量的等变概率推断，与提供的以多模态大模型（MLLM）为中心的关键词集相关性较低。与'Unify Models'（等变网络与信念传播框架的统一）和'model-based RL'（机器人推断上下文）有一定关联，但与 Tokenizer、Visual Encoder、World Models、MLLM 无直接关联，MultiModal 关联度低（仅处理几何坐标而非跨模态数据）。未发现目标专家作者。

关键词

Equivariant Neural Belief Propagation, SE(3) symmetry, Probabilistic inference, Gaussian mixture models, Factor-graph framework, Molecular conformation, Multi-body robotic inference

135. Adaptive state-action abstractions via rate-distortionFAIL

Score: 9.0 / 27.8

Authors: Fernando E. Rosas

Published: 2026-06-04

TL;DR: This paper proposes a rate-distortion based strategy to dynamically adjust state-action abstraction granularity in reinforcement learning, achieving near-optimal performance under substantial lossy compression.

摘要翻译

婴儿在学习走路时，似乎首先处理问题的粗略版本——保持直立，到达照顾者——只有当在该分辨率下的进一步练习不再产生回报时才会对其进行细化。强化学习 (Reinforcement Learning) 提供了多种构建复杂任务简单版本的技术，但缺乏在学习过程中动态调整这些抽象粒度的通用原则。本文提出了这样一个原则：一旦其中的学习误差变得与抽象本身引起的误差相当，就细化该抽象。在这里，我们通过一种性能证明 (performance certificate) 来形式化这一原则，该证明将价值误差分解为两项：由 Bellman 残差 (Bellman residual) 捕获的学习误差界，以及由双模拟度量 (bisimulation metric) 给出的抽象误差界。由此产生的切换策略通过基于率失真原理 (rate-distortion principles) 构建的软状态 - 动作 (state-action) 抽象来实现，其在状态和动作轴上的分辨率可以连续调整。我们在一系列表格环境 (tabular settings) 中验证了这一构造，表明在状态和动作信息遭受大量有损压缩的情况下，仍可实现近优性能。

Abstract

When learning to walk, infants seem to address a coarse version of the problem first - stay upright, reach the caregiver - and refine it only when further practice at that resolution stops paying off. Reinforcement learning offers multiple techniques for building simple versions of complex tasks, but lacks general principles for how to dynamically adjust the granularity of these abstractions during learning. This paper proposes one such principle: refine the abstraction as soon as the learning error within it becomes comparable to the error induced by the abstraction itself. Here, we investigate one way of formalising this principle via a performance certificate that decomposes value error into two terms: a learning error bound captured by a Bellman residual, and an abstraction error bound given by a bisimulation metric. The resulting switching strategy is implemented by soft state-action abstractions built from rate-distortion principles, whose resolution along state and action axes can be continuously adjusted. We validate this construction in a range of tabular settings, showing that near-optimal performance can be achieved under substantial lossy compression of state and action information.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	3.0/10	4.5
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	3.0/10	4.5

评分理由: The paper focuses on reinforcement learning abstraction using rate-distortion theory, which is only weakly related to 'World Models' (state compression) and 'model-based RL' (abstraction for RL). It contains no content regarding multimodal data, tokenization, visual encoders, or large language models (MLLM), resulting in 0 scores for those keywords. The author list does not include the specified experts (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang), so no expert bonus is applied. The weighted total score is 9.0, which is below the dynamic pass score of 27.8, indicating low relevance to the specified keyword cluster.

关键词

Adaptive state-action abstractions, rate-distortion, reinforcement learning, bisimulation metric, Bellman residual, value error, lossy compression, tabular settings

136. HoT-SSM:Higher-order Temporal Knowledge Graph Reasoning with State Space Models for Health CareFAIL

Score: 9.0 / 27.8

Authors: Thummaluru Siddartha Reddy, Vempalli Naga Sai Saketh, Yash Punjabi, Mahesh Chandran

Published: 2026-06-04

TL;DR: HoT-SSM 提出了一种基于状态空间模型的高阶时序知识图谱推理方法，有效捕捉了临床概念间的高阶交互与长程时间依赖，显著提升了医疗预测任务的性能。

摘要翻译

融合了临床知识的医学知识图谱（MKGs）已被越来越多地用于建模电子健康记录（EHRs），以支持医疗领域中的可解释预测。然而，现有的基于 MKG 的方法在捕捉临床概念（如疾病、操作和药物）之间的两两关系方面存在局限，限制了其对共现或语义相关概念之间高阶交互的建模能力。此外，大多数利用 MKGs 的表示学习方法要么压缩了就诊间的时序信息，要么缺乏显式机制来建模长程时序依赖，而这对于死亡率预测等临床任务至关重要。为缓解这些局限，我们提出了 HoT-SSM，这是一种基于状态空间模型（SSM）的参数高效高阶时序图推理方法。对于每次就诊，HoT-SSM 利用领域知识将语义相关的临床概念分组为超边以构建超图，从而保留就诊级别的临床上下文。此外，为了在学习表示的同时建模时序动态，我们引入了一种新颖的基于动态超图的状态空间模型，该模型显式捕捉患者潜在状态随时间的演化，同时保留长程信息。学习到的表示被用于下游临床预测和推理。在 MIMIC-III 和 MIMIC-IV 数据集上的实验表明，相比当前最先进模型，该方法表现出显著的性能提升，证明了联合建模高阶临床交互和长程时序依赖的有效性。

Abstract

Medical knowledge graphs (MKGs) infused with clinical knowledge have been increasingly used to model electronic health records (EHRs) to support interpretable predictions in healthcare domain. However, existing MKG-based approaches are limited in capturing pairwise relations between clinical concepts (e.g., conditions, procedures, and medications), and restricts their ability to model higher-order interactions among co-occurring or semantically related concepts. In addition, most representation learning methods that leverage MKGs either collapse temporal information across visits or lack an explicit mechanism for modeling long-range temporal dependencies, which is critical for clinical tasks such as mortality prediction. To mitigate these limitations, we propose HoT-SSM, a parameter efficient and higher-order temporal graph reasoning with state space models. For each visit, HoT-SSM constructs hypergraphs by grouping semantically related clinical concepts into hyperedges using domain knowledge, thereby preserving visit-level clinical context. Further, to model the temporal dynamics while learning the representations, we introduce a novel dynamic hypergraph-based state space model that explicitly captures patients latent state evolution over time while preserving long-range information. The learned representations are used for downstream clinical prediction and reasoning. Experiments on MIMIC-III and MIMIC-IV datasets shows significant performance improvement over the current state-of-the-art models, demonstrating the effectiveness of jointly modeling higher-order clinical interactions and long-range temporal dependencies.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	1.0/10	1.5

评分理由: 论文核心为医疗知识图谱（MKG）与状态空间模型（SSM）的结合，用于临床预测。与提供的关键词集（主要涵盖多模态大模型、世界模型、强化学习）领域差异显著。论文未涉及 Tokenizer、视觉编码器或 MLLM 架构；虽使用 SSM 建模状态演化（类似世界模型概念），但并非生成式世界模型；虽为模型方法，但非强化学习任务。因此相关性评分较低。加权总分 9.0，低于动态及格分 27.8。作者列表中未包含指定的专家（Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang），无额外加分。

关键词

Higher-order Temporal Knowledge Graph Reasoning, State Space Models, Medical Knowledge Graphs, Clinical Prediction, Temporal Dynamics, Hypergraph Construction, Long-range Dependencies

137. Compress-Distill: Reasoning Trace Compression for Efficient Knowledge DistillationFAIL

Score: 9.0 / 27.8

Authors: Maxime Griot, Paul Steven Scotti, Tanishq Mathew Abraham

Published: 2026-06-04

TL;DR: The paper proposes compressing reasoning traces before knowledge distillation to significantly reduce training tokens and inference length while maintaining most accuracy, though raw traces remain slightly more accurate.

摘要翻译

推理模型会产生冗长的思维链轨迹（Chain-of-Thought Traces），这些轨迹在蒸馏过程中成本高昂，且容易导致学生模型输出冗长。本文研究在知识蒸馏之前对这些轨迹进行事后压缩（Post-hoc Compression）。两个教师模型 Qwen3.5-397B-A17B 和 gpt-oss-120B 各自生成了约 28.3 万个正确轨迹；随后，两个指令微调模型将它们压缩至原始字符长度的 8.6%-21.0%。在包含 48 次运行的主网格实验以及七个 Qwen 教师截断消融实验中，压缩轨迹将训练 token 数量降至原始的 12%-30%，使训练速度提升 2.0-7.6 倍，并将推理输出长度缩短 3-19 倍；而在较短的 gpt-oss 教师模型下，缩短幅度相对较小。然而，原始轨迹在所有规模及两种教师模型下均保持了最高的下游任务准确率。一项长度匹配的原始轨迹截断消融实验表明，压缩带来的收益并非仅仅源于更小的 token 预算：模型压缩的轨迹通常优于或媲美朴素截断（Naive Truncation），尤其是在学生模型规模较小时，同时还能保持较短的推理输出。总体而言，推理轨迹压缩提供了一种准确率与效率之间的权衡，而非免费的改进：学生模型可保留高达 96% 的原始轨迹准确率，同时获得高达 18 倍的每 token 效率提升；而在 0.8B 规模下使用 LoRA 时，压缩轨迹虽缩小了原始轨迹与压缩轨迹之间的差距，但并未超越原始轨迹。

Abstract

Reasoning models produce long chain-of-thought traces that are costly to distill and encourage verbose student outputs. We study post-hoc compression of such traces before knowledge distillation. Two teachers, Qwen3.5-397B-A17B and gpt-oss-120B, generate about 283k correct traces each; two instruction-tuned models then compress them to 8.6-21.0% of their original character length. Across a 48-run main grid plus seven Qwen-teacher truncation ablations, compressed traces reduce training tokens to 12-30% of raw, speed up training by 2.0-7.6x, and shorten inference outputs by 3-19x with smaller reductions under the shorter gpt-oss teacher. However, raw traces retain the highest downstream accuracy at every scale and for both teachers. A length-matched raw-trace truncation ablation shows that compression is not merely benefiting from a smaller token budget: model-compressed traces usually beat or match naive truncation, especially for smaller students, while maintaining shorter inference outputs. Overall, reasoning-trace compression offers an accuracy-efficiency trade-off rather than a free improvement: students retain up to 96% of raw-trace accuracy while gaining up to 18x higher per-token efficiency, and at the 0.8B scale under LoRA compressed traces narrow the raw-vs-compressed gap but do not exceed raw.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on reasoning trace compression for knowledge distillation, unrelated to World Models, Visual Encoders, or Model-Based RL (0 relevance). MLLM and MultiModal have minor relevance (2) as teachers are large models, while Unify Models and Tokenizer have low relevance (1) as compression applies to traces, not architecture or tokenizer design. Total weighted score is 9.0, below the dynamic pass score of 27.8. No specified expert authors are present.

关键词

Reasoning Trace Compression, Knowledge Distillation, Chain-of-Thought, Model Efficiency, Training Tokens, Inference Length, Post-hoc Compression

138. Cross-scale spatially-aware generative modeling of transcriptomic programs underlying neurodegenerative brain organizationFAIL

Score: 9.0 / 27.8

Authors: Krishnakumar Vaithianathan

Published: 2026-06-04

TL;DR: This paper introduces a cross-scale spatially-aware generative framework to model transcriptomic programs underlying neurodegenerative brain organization, successfully linking molecular organization to cortical degeneration with high predictive accuracy.

摘要翻译

神经退行性疾病（如阿尔茨海默病）表现出高度组织化的区域脑易损性模式，然而这种空间选择性的生物学机制仍不完全清楚。现有的影像 - 转录组学研究主要依赖于基因表达与神经影像学表型之间的相关性分析，限制了其建模分子组织如何导致神经退化的能力。在此，我们提出了一种跨尺度的空间感知生成框架，用于建模皮层神经退化的转录组程序。区域转录组概况源自艾伦人脑图谱（Allen Human Brain Atlas），使用了跨越 68 个皮层区域的 910 个标志基因。神经退行性易损性图谱基于 ADNI FreeSurfer 皮层厚度测量构建，通过计算认知正常对照组（NC = 926）与阿尔茨海默病患者（AD = 426）之间的区域皮层变薄差异得出。采用变分生成架构来学习潜在生物程序，该程序关联区域基因表达组织与皮层退化，同时融入基于图的空间平滑正则化，以保留皮层组织结构。所提出的框架实现了对区域神经退行性易损性的强预测，解释方差达 0.8604，且预测与观察到的皮层退化概况之间存在显著的空间相关性（r = 0.9439, p < 0.001）。学习到的潜在表示揭示了与分布式疾病易感性相关的结构化转录组组织。这些发现表明，生物约束的生成建模能够连接微观分子组织与宏观神经退行性病变，为空间感知生成神经生物学和计算神经科学奠定了基础。

Abstract

Neurodegenerative disorders such as Alzheimer's disease exhibit highly organized patterns of regional brain vulnerability, yet the biological mechanisms underlying this spatial selectivity remain incompletely understood. Existing imaging-transcriptomic studies have largely relied on correlation-based analyses between gene expression and neuroimaging phenotypes, limiting their ability to model how molecular organization gives rise to neurodegeneration. Here, we introduce a cross-scale spatially-aware generative framework for modeling transcriptomic programs underlying cortical neurodegeneration. Regional transcriptomic profiles were derived from the Allen Human Brain Atlas using 910 landmark genes across 68 cortical regions. Neurodegenerative vulnerability maps were constructed from ADNI FreeSurfer cortical thickness measurements by computing regional cortical thinning differences between cognitively normal controls (NC = 926) and Alzheimer's disease subjects (AD = 426). A variational generative architecture was used to learn latent biological programs linking regional gene-expression organization to cortical degeneration while incorporating graph-based spatial smoothness regularization to preserve cortical organization. The proposed framework achieved strong prediction of regional neurodegenerative vulnerability, yielding an explained variance of 0.8604 and a significant spatial correlation between predicted and observed cortical degeneration profiles (r = 0.9439, p < 0.001). The learned latent representations revealed structured transcriptomic organization associated with distributed disease susceptibility. These findings demonstrate that biologically constrained generative modeling can bridge microscale molecular organization with macroscale neurodegeneration, providing a foundation for spatially-aware generative neurobiology and computational neuroscience.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	1.0/10	1.5
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	3.0/10	4.5
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on computational neuroscience and biological generative modeling (transcriptomics + imaging) rather than AI foundation models. It does not utilize Tokenizers, MLLMs, Visual Encoders (in the VLM sense), or Model-Based RL. Although it involves multi-modal data and generative architectures, the technical alignment with the provided AI-specific keywords is minimal. No specified expert authors are found.

关键词

Cross-scale spatially-aware, generative modeling, transcriptomic programs, neurodegenerative brain organization, variational generative architecture, cortical degeneration, spatial smoothness regularization, regional transcriptomic profiles

139. FiLM-Based Speaker Conditioning of a SpeechLLM for Pathological Speech RecognitionFAIL

Score: 9.0 / 27.8

Authors: Fernando López, Santosh Kesiraju, Jordi Luque

Published: 2026-06-04

TL;DR: 本文提出一种基于 FiLM 的 speaker conditioning 方法，通过注入 x-vector 信息调整冻结 ASR 编码器表示，实现了病理语音识别的有效适配且保留了问答能力。

摘要翻译

自动语音识别（ASR）在标准语音方面取得了显著进展；然而，源于神经系统疾病的病理语音仍然是一个重大挑战。我们探究通过特征线性调制（FiLM）实现说话人条件化，将基于 x-vector 的信息注入冻结的 ASR 编码器的每个 Transformer 层，从而适应个别病理说话人的内部表示，且无需修改基础模型权重。我们在西班牙语和英语病理语音上，针对 ASR 任务将此方法与标准基线及参数高效微调（fine-tuning）基线（辅以后处理）进行了基准测试。此外，我们还评估了适配后的模型是否保留了回答语音相关问题的能力。结果表明，说话人条件化的 ASR 与成熟的适配策略相当，同时在非条件化语音上保持了性能。

Abstract

Automatic speech recognition (ASR) has advanced remarkably for standard speech; however, pathological speech from neurological conditions remains a significant challenge. We investigate speaker conditioning via Feature-wise Linear Modulation (FiLM), injecting x-vector-derived information into each transformer layer of a frozen ASR encoder to adapt internal representations to individual pathological speakers without modifying base model weights. We benchmark this for the ASR task against standard and parameter-efficient fine-tuning baselines, complemented by post-processing, on Spanish and English pathological speech. Additionally, we evaluate if the adapted model preserves the ability to answer speech-related questions. Results show that speaker-conditioned ASR is competitive with established adaptation strategies while retaining performance on non-conditioned speech.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	3.0/10	4.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文核心为病理语音识别中的 Speaker Conditioning，基于 FiLM 调制冻结编码器。提供的关键词涉及多模态、世界模型及强化学习，与单模态语音任务高度不匹配。Visual Encoder、World Models、model-based RL、MultiModal 完全无关（0 分）；MLLM 关联度低（2 分，SpeechLLM 侧重语言而非多模态）；Unify Models 关联度弱（1 分，条件化虽涉及信息融合但非模型统一）；Tokenizer 有一定隐含关联（3 分，因涉及 LLM/ASR 架构）。作者列表未包含指定专家，无额外加分。

关键词

Pathological Speech Recognition, Speaker Conditioning, Feature-wise Linear Modulation, Frozen ASR Encoder, Parameter-efficient Fine-tuning, SpeechLLM, x-vector

140. Multi-task Learning is Not Enough: Representational Entanglement in Dual-output Second Language Speech RecognitionFAIL

Score: 9.0 / 27.8

Authors: Seung Hwan Cho, Young-Min Kim

Published: 2026-06-04

TL;DR: 该研究发现第二语言语音识别中的多任务学习会导致表征纠缠从而降低表面转录质量，建议通过缓解编码器纠缠来优化双输出框架。

摘要翻译

第二语言（L2）语音识别通常需要发音和语义的转写。多任务学习（MTL）是一种自然方法，因为它假设共享表征对两个输出均有益。然而，本文表明这一假设在韩语和英语之间并不成立。MTL 提升了语义但损害了表面转写，尤其在英语中，这种损害程度随表面与语义的差异（由莱文斯坦编辑距离（Levenshtein edit distance）衡量）而扩大。编码器（Encoder）分析将这些模式归因于编码器层面的纠缠，其中韩语保留了不同的任务表征，而英语则产生了几乎相同的表征。跨任务解码器（Decoder）分析表明，语义双输出解码器采用独特表征进行适应，而表面双输出解码器仍受编码器约束。这些发现启发设计能够缓解编码器层面的纠缠从而减少双输出 L2 自动语音识别中表面转写退化的 MTL 框架。

Abstract

Second-language (L2) speech recognition often requires transcriptions of pronunciations and intended meanings. Multi-task learning (MTL) is a natural approach because it assumes that shared representations benefit both outputs. However, this paper shows that this assumption does not hold across Korean and English. MTL improves meaning but degrades surface transcription, especially in English, where the degradation scales with surface-meaning divergence measured by Levenshtein edit distance.Encoder analysis links these patterns to encoder-level entanglement, with Korean preserving distinct task representations while English produces nearly identical ones. Cross-task decoder analysis shows that the meaning dual-output decoder adapts with a unique representation, while the surface dual-output decoder remains constrained by the encoder. These findings motivate the design of MTL frameworks that mitigate encoder-level entanglement to reduce surface degradation in dual-output L2 automatic speech recognition.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	3.0/10	4.5
model-based RL	1.5	0.0/10	0.0

评分理由: 论文核心为第二语言语音识别的多任务学习（MTL）与表征纠缠，与'World Models'、'model-based RL'、'Visual Encoder'、'MLLM'完全无关（0 分）。'MultiModal'有一定关联（3 分），因涉及音频与文本。'Unify Models'关联度低（2 分），因仅涉及任务统一而非模型架构统一。'Tokenizer'关联度极低（1 分），因未讨论分词器设计。作者列表中不包含指定专家，无额外加分。

关键词

Multi-task Learning, Representational Entanglement, Second Language Speech Recognition, Dual-output Decoder, Encoder Analysis, Surface Transcription, Intended Meaning

141. Beyond Alignment: Value Diversity as a Collective Property in Multicultural Agent SystemsFAIL

Score: 9.0 / 27.8

Authors: Shaoyang Xu, Jingshen Zhang, Long P. Hoang, Jinyuan Li, Wenxuan Zhang

Published: 2026-06-04

TL;DR: 本文提出将价值多样性作为多文化多智能体系统的系统级评估指标，发现当前基于 LLM 的社会存在显著的同质化现象，混合背板系统仅能部分缓解此问题。

摘要翻译

多文化多智能体系统正日益部署于全球多样化的场景中，其中各智能体植根于不同的文化背景。现有的文化评估主要关注价值对齐，即单个智能体与目标文化的匹配程度。然而，对齐是一种单智能体属性，无法揭示作为一个整体的系统是否保留了其旨在代表的文化多元性。我们提出将价值多样性作为多文化多智能体系统的系统级评估维度，定义为文化塑造的智能体在共享价值调查上响应之间的差异性。我们利用世界价值观调查 (World Values Survey)，在广泛的系统配置下评估了 19 种文化和 18 种骨干模型。我们发现，多样性与对齐在很大程度上无相关性，表明两者捕捉了互补的系统属性，且当前多文化多智能体系统在价值多样性上远低于人类社会。混合骨干系统 (Mixed-backbone systems) 缩小了这一差距，但并未完全消除，且该差距在不同文化组成和智能体规模下依然存在。社会交互进一步侵蚀多样性，通过推动智能体趋向共识；一项参与式预算案例研究表明，这种同质化缩小了集体决策的广度。综上所述，我们的研究结果确立了价值多样性作为多文化多智能体系统的独立评估维度，并揭示了当前基于大语言模型 (LLM) 的社会中持续存在的同质化趋势。我们的代码和数据公开发布于 https://github.com/iNLP-Lab/MultiAgent-Diversity。

Abstract

Multicultural multi-agent systems are increasingly deployed in globally diverse settings, where different agents are grounded in different cultural backgrounds. Existing cultural evaluation focuses on value alignment: how closely a single agent matches a target culture. Yet alignment is a per-agent property and cannot reveal whether a system, taken as a whole, preserves the cultural plurality it is meant to represent. We propose value diversity as a system-level evaluation axis for multicultural agent systems, defined through the dissimilarity between culturally conditioned agents' responses on a shared value survey. Using the World Values Survey, we evaluate 19 cultures and 18 backbone models across a wide range of system configurations. We find that diversity is largely uncorrelated with alignment, indicating that the two capture complementary system properties, and that current multicultural agent systems fall substantially below human societies in value diversity. Mixed-backbone systems narrow this gap but do not close it, and the gap persists across culture compositions and agent scales. Social interaction further erodes diversity by driving agents toward consensus, and a participatory budgeting case study shows that this homogenization narrows the breadth of collective decision-making. Together, our results establish value diversity as a distinct evaluation axis for multicultural multi-agent systems and reveal a persistent homogenization tendency in current LLM-based societies. Our code and data are publicly available at https://github.com/iNLP-Lab/MultiAgent-Diversity.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.5/10	2.2
MLLM	1.5	1.5/10	2.2
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	1.0/10	1.5

评分理由: 论文聚焦多文化多智能体系统的价值多样性评估，与关键词涉及的多模态架构（Tokenizer, Visual Encoder, MultiModal）及强化学习（World Models, model-based RL）领域差异显著。仅在‘混合背板系统’与 Unify Models、‘世界价值观调查’与 World Models、‘骨干模型’与 MLLM 存在词汇层面的微弱关联，未涉及具体模型组件或 RL 算法。作者列表中未包含指定专家。

关键词

Multicultural Multi-Agent Systems, Value Diversity, Cultural Alignment, World Values Survey, Homogenization Tendency, Backbone Models, Social Interaction

142. Synthetic Data Generation and Vision-based Wrinkle and Keypoint Detection for Bimanual Cloth ManipulationFAIL

Score: 9.0 / 27.8

Authors: Ariel Herrera, Xueyang Kang, Atal Anil Kumar

Published: 2026-06-04

TL;DR: This paper proposes a perception framework using synthetic data and CNN/YOLO for wrinkle and keypoint detection to enable bimanual cloth manipulation without fine-tuning on physical fabrics.

摘要翻译

纺织品机器人操作仍具挑战性，因为连续变形与自遮挡阻碍了估计布料状态所需的鲁棒视觉感知。针对标注真实世界数据匮乏的问题，我们开发了一种基于 Blender 的合成管道，用于导出自动标注的关键点，并结合手动标注的渲染图像与真实世界数据训练褶皱检测器。我们提出了一种感知框架，集成了用于排列不变的关键点检测的卷积神经网络（CNN）以及用于从结构褶皱中提取抓取点的 YOLOv8-OpenCV 管道。所提出的双臂算法利用该系统通过褶皱拉伸完全折叠的衣物，当角点出现时，则切换至基于关键点的熨烫模式。关键点模型的平均位置误差（MPE）达到 1.7615 像素。该感知系统可直接迁移至物理织物而无需微调，其性能优于在高遮挡状态下失效或在严重褶皱上产生假阳性的基线方法。

Abstract

Robotic manipulation of textiles remains challenging because continuous deformation and self-occlusions hinder the robust visual perception required to estimate the cloth's state. To address the lack of annotated real-world data, we developed a Blender-based synthetic pipeline exporting auto-annotated keypoints, and combined manually labeled renders with real-world data to train a wrinkle detector. We present a perception framework integrating a CNN for permutation-invariant keypoint detection and a YOLOv8-OpenCV pipeline to extract grasping points from structural wrinkles. A proposed bimanual algorithm uses this system to stretch fully folded garments via wrinkles, transitioning to keypoint-based ironing once corners emerge. The keypoint model achieves a Mean Position Error (MPE) of 1.7615 pixels. The perception system transfers to physical fabrics without fine-tuning, outperforming baselines that fail in high-occlusion states or yield false positives on severe folds.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	2.0/10	3.0

评分理由: The paper focuses on robotic cloth manipulation using traditional computer vision (CNN, YOLO) and synthetic data generation. It does not involve Large Language Models, Tokenizers, World Models, or Unified Multimodal Architectures as implied by the keyword set. While it uses visual encoders (CNN) and addresses manipulation (related to RL), it lacks the specific methodologies (MLLM, Model-Based RL dynamics modeling) associated with the provided keywords. No expert authors from the specified list were found in the author list.

关键词

Synthetic Data Generation, Vision-based Wrinkle Detection, Keypoint Detection, Bimanual Cloth Manipulation, CNN, YOLOv8, Robotic Control

143. Code2LoRA: Hypernetwork-Generated Adapters for Code Language Models under Software EvolutionFAIL

Score: 7.5 / 27.8

Authors: Liliana Hotsko, Yinxi Li, Yuntian Deng, Pengyu Nie

Published: 2026-06-04

TL;DR: Code2LoRA proposes a hypernetwork framework to generate repository-specific LoRA adapters for code language models, enabling efficient adaptation to evolving codebases with zero inference-time token overhead.

摘要翻译

代码语言模型需要仓库级上下文以解析导入项、API 及项目约定。现有方法通过长输入注入此类知识（通过 RAG 或依赖分析检索），或通过单仓库微调和 LoRA——这在仓库规模上成本高昂，且对演化的代码库不够稳健。我们提出 Code2LoRA，一种超网络框架，用于生成仓库特定的 LoRA 适配器，从而以零推理时令牌开销有效地注入仓库知识。Code2LoRA 支持两种使用场景：Code2LoRA-Static 将单个仓库快照转换为适配器，适用于稳定代码库的理解；而 Code2LoRA-Evo 维护一个由 GRU 隐藏状态支持的适配器，该状态随每次代码差异进行更新，适用于演化代码库的活跃开发。为了将 Code2LoRA 与参数高效微调基线进行对比，我们构建了 RepoPeftBench，这是一个包含 604 个 Python 仓库的基准测试，包含两个赛道：静态赛道包含 40K 训练任务和 12K 测试断言补全任务，演化赛道包含 215K 基于提交的训练任务和 87K 基于提交的测试任务。在静态赛道上，Code2LoRA-Static 实现了 63.8% 的跨仓库精确匹配和 66.2% 的仓库内精确匹配，达到了单仓库 LoRA 的上界；在演化赛道上，Code2LoRA-Evo 实现了 60.3% 的跨仓库精确匹配（比单个共享 LoRA 高出 5.2 个百分点）。Code2LoRA 的代码可在 https://anonymous.4open.science/r/code2lora-6857 处获取；模型检查点和 RepoPeftBench 数据集可在 https://huggingface.co/code2lora 处获取。

Abstract

Code language models need repository-level context to resolve imports, APIs, and project conventions. Existing methods inject this knowledge as long inputs (retrieved through RAG or dependency analysis) or through per-repository fine-tuning and LoRA -- costly at repository scale and brittle to evolving codebases. We introduce Code2LoRA, a hypernetwork framework that generates repository-specific LoRA adapters, effectively injecting repository knowledge with zero inference-time token overhead. Code2LoRA supports two usage scenarios: Code2LoRA-Static converts a single repository snapshot into an adapter, suitable for comprehension of stable codebases; while Code2LoRA-Evo maintains an adapter backed by a GRU hidden state updated per code diff, suitable for active development of evolving codebases. To evaluate Code2LoRA against parameter-efficient fine-tuning baselines, we build RepoPeftBench, a benchmark of 604 Python repositories with two tracks: a static track with 40K training and 12K test assertion-completion tasks, and an evolution track with 215K commit-derived training and 87K commit-derived test tasks. On the static track, Code2LoRA-Static achieves 63.8% cross-repo and 66.2% in-repo exact match, matching the per-repository LoRA upper bound; on the evolution track, Code2LoRA-Evo achieves 60.3% cross-repo exact match (+5.2 pp over a single shared LoRA). Code2LoRA's code can be found at https://anonymous.4open.science/r/code2lora-6857; the model checkpoints and RepoPeftBench datasets can be found at https://huggingface.co/code2lora.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on Code Language Models and parameter-efficient fine-tuning (LoRA) using hypernetworks for software evolution. It does not involve multimodal processing (Visual Encoder, MultiModal, MLLM), world modeling, or reinforcement learning (model-based RL), resulting in 0-1 scores for these. Tokenizer usage is standard infrastructure (1 score). Unify Models is weakly related to adapter unification but not in the context of the keyword's typical meaning (e.g., multimodal unification), scored 2. No listed expert authors are found, so no bonus points are applied.

关键词

Code Language Models, Hypernetwork, LoRA Adapters, Software Evolution, Parameter-Efficient Fine-tuning, Repository-level Context, Adapter Generation

144. Operation-Guided Progressive Human-to-AI Text Transformation Benchmark for Multi-Granularity AI-Text DetectionFAIL

Score: 7.5 / 27.8

Authors: Sondos Mahmoud Bsharat, Jiacheng Liu, Xiaohan Zhao, Tianjun Yao, Xinyi Shang, Yi Tang, Jiacheng Cui, Ahmed Elhagry, Salwa K. Al Khatib, Hao Li, Salman Khan, Zhiqiang Shen

Published: 2026-06-04

TL;DR: 本文提出了 OpAI-Bench 基准，用于研究人机协同编辑过程中 AI 文本的可检测性，发现检测难度取决于编辑操作和修订历史而非仅由 AI 内容比例决定。

摘要翻译

随着 AI 写作助手日益融入现实世界的起草与修订工作流，许多文档不再纯粹是人工撰写或 AI 生成的，而是源于渐进式的人机协同编辑。然而，现有的 AI 文本检测基准主要关注最终输出，对 AI 作者身份信号在整个修订过程中如何出现、积累或消失的理解较为有限。我们提出 OpAI-Bench，这是一个操作引导的基准，旨在研究跨文档、句子、词元（token）和片段（span）粒度的渐进式人机文本转换。从人工撰写的文档出发，OpAI-Bench 在预定义的 AI 覆盖水平和五种代表性的 AI 编辑操作下，为每个样本构建九个顺序修订版本，涵盖四个领域，同时在多个粒度上保留完整的作者身份溯源。该基准支持综合评估，包含 8 个文档级检测器、7 个句子级检测器以及 2 个细粒度的词元/片段级检测器。实验表明，AI 文本的可检测性不仅受 AI 编辑内容比例的影响，还受编辑操作、领域以及累积修订历史的制约。有趣的是，我们发现混合作者身份的中间版本往往比纯人工和重度 AI 编辑的端点更难检测，这揭示了现有基准所忽略的非单调检测模式。OpAI-Bench 提供了一个受控测试平台，用于分析在现实的渐进式编辑场景下，AI 辅助写作是否、何时以及如何变得可检测。我们的代码和基准可在 https://github.com/VILA-Lab/OpAI-Bench 获取。

Abstract

As AI writing assistants become increasingly integrated into real-world drafting and revision workflows, many documents are no longer purely human-written or AI-generated, but instead result from progressive human-AI co-editing. However, existing AI-text detection benchmarks largely focus on final outputs and provide limited understanding of how AI authorship signals emerge, accumulate, or disappear throughout the revision process. We introduce OpAI-Bench, an operation-guided benchmark for studying progressive human-to-AI text transformation across document, sentence, token, and span granularities. Starting from human-written documents, OpAI-Bench constructs nine sequentially revised versions for each sample under predefined AI coverage levels and five representative AI edit operations, covering four domains while preserving complete authorship provenance at multiple granularities. The benchmark supports comprehensive evaluation with 8 document-level detectors, 7 sentence-level detectors, and 2 fine-grained token/span-level detectors. Experiments reveal that AI-text detectability is governed not only by the proportion of AI-edited content, but also by edit operation, domain, and cumulative revision history. Interestingly, we notice that mixed-authorship intermediate versions are often harder to detect than both fully human and heavily AI-edited endpoints, exposing non-monotonic detection patterns missed by existing benchmarks. OpAI-Bench provides a controlled testbed for analyzing whether, when, and how AI-assisted writing becomes detectable under realistic progressive editing scenarios. Our code and benchmark are available at https://github.com/VILA-Lab/OpAI-Bench.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 该论文主要关注 AI 文本检测基准（OpAI-Bench）及人机协同编辑过程中的文本转换，属于 NLP 安全与评估领域。提供的关键词（如视觉编码器、世界模型、多模态、基于模型的强化学习等）均属于多模态大模型架构与强化学习领域，与本文主题（纯文本检测、编辑过程分析）高度不相关。仅因涉及' token'粒度检测和'AI 文本'生成，Tokenizer 和 MLLM 给予极低分，其余关键词完全无关。

关键词

AI-text detection, Human-AI co-editing, Progressive transformation, Benchmark, Multi-granularity, Operation-guided, Authorship provenance

145. LLM Self-Recognition: Steering and Retrieving Activation SignaturesFAIL

Score: 7.5 / 27.8

Authors: Thibaud Ardoin, Jonas Schäfer, Gerhard Wunder

Published: 2026-06-04

TL;DR: This paper proposes a method for attributing LLM-generated text to specific models by steering internal residual streams to create detectable activation fingerprints without degrading text quality.

摘要翻译

可解释性领域的最新进展表明，大语言模型（LLMs）在其生成的文本中隐式编码了信号，这些信号使其能够自我识别其输出。我们证明这种能力是可靠的，即使在低熵场景下也是如此，并且可以通过针对性干预加以放大。通过在生成过程中使用随机稀疏向量引导内部残差流，我们创建了一个可检测的指纹，从而能够将给定文本归属到特定的大语言模型（LLM）。该信号可以从用作检测器的 LLM 的激活值中恢复出来，在多种检测设置下实现超过 98% 的准确率，同时保持生成文本的质量。随着 AI 生成内容的激增，该方法提供了一种比传统检测器更实用的替代方案，它利用模型的自然表示结构进行归属，而不是外部嵌入信号。我们的贡献包括：(i) 在 LLMs 中建立可靠的自我识别能力；(ii) 一种简单的引导机制，可实现多 LLM 识别且无质量退化；(iii) 证明激活空间包含可利用的结构，用于编码信号而不产生语义干扰。

Abstract

Recent advances in interpretability suggest that large language models (LLMs) implicitly encode signals in their generated text that enable self-recognition of their outputs. We demonstrate that this capability is reliable, even in low-entropy scenarios, and that it can be amplified through targeted intervention. By steering the internal residual stream during generation with a random sparse vector, we create a detectable fingerprint that enables attribution of a given text to a specific LLM. This signal is recoverable from the activations of an LLM used as a detector, achieving over 98% accuracy across multiple detection settings while preserving the quality of generated text. As AI-generated content proliferates, this approach offers a practical alternative to traditional detectors by leveraging the model's natural representation structure for attribution rather than embedding a signal externally. Our contributions include: (i) establishing reliable self-recognition capabilities in LLMs, (ii) a simple steering mechanism enabling multi-LLM identification with no quality degradation, (iii) demonstrating that activation spaces contain exploitable structure for encoding signals without semantic interference.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on LLM interpretability and attribution via activation steering, with minimal overlap regarding multimodal, world model, or RL keywords. Visual Encoder, World Models, MultiModal, and model-based RL are irrelevant. Unify Models and MLLM have slight tangential relation to LLMs but are not core. Tokenizer is peripheral. No specified expert authors are found.

关键词

LLM Self-Recognition, Activation Signatures, Steering Mechanism, Attribution, Interpretability, Residual Stream, Text Quality

146. When Should Memory Stay Silent: Measuring Memory-Use Boundaries in Memory-Augmented Conversational AgentsFAIL

Score: 7.5 / 27.8

Authors: Lingxiang Xu, Jiaoyun Yang, Min Hu, Hongtu Chen, Ning An

Published: 2026-06-04

TL;DR: 该论文研究了记忆增强型对话智能体何时应整合敏感长期记忆，发现记忆可用性显著降低了敏感内容的行为分离度，并表明安全个性化需要在检索和生成阶段都进行记忆感知决策。

摘要翻译

长期记忆使语言模型智能体能够支持个性化交互，但目前尚不清楚何时可用的记忆应当被整合进回复中。现有的记忆评估侧重于检索准确性和下游任务效用，却忽略了检索到的敏感记忆内容在当前轮次中是否必要。我们提出了 RBI-Eval，这是一种基于探针集构建的控制性测量研究，旨在比较模型在有访问敏感记忆和无访问敏感记忆两种情况下，在相同良性提示下的行为差异。我们在四种记忆访问设置下，将四个基础 LLMs 与匹配的无记忆参考模型进行了对比：全上下文暴露以及三种检索系统。我们的结果表明存在显著的行为差异。当记忆可用时，相对于匹配的无记忆参考，GPT-5.4-mini 的敏感记忆整合分离分数降低了 8.9%–26.6%，而 Claude-Sonnet-4.6、DeepSeek-V4-Flash 和 Qwen3.5-9B 的降幅则达到 51.1%–82.9%。在 DeepSeek 和 GPT-5.4-mini 上进行的控制实验表明，这种效应特定于敏感内容，而非一般性个性化。检索系统虽能减少暴露，但一旦敏感记忆到达生成器，仍无法消除整合。这些发现表明，安全个性化需要在检索和生成阶段均做出感知记忆的决策。

Abstract

Long-term memory enables language model agents to support personalized interactions, but it remains unclear when available memories warrant integration into responses. Existing memory evaluations emphasize retrieval accuracy and downstream task utility, while overlooking whether retrieved sensitive memory content is warranted in the current turn. We introduce RBI-Eval, a controlled measurement study built around a probe set that compares model behavior with and without access to sensitive memory under identical benign prompts. We evaluate four base LLMs against a matched no-memory reference across four memory-access settings: full-context exposure and three retrieval systems. Our results reveal substantial behavioral divergence. With memory available, the separation score for sensitive-memory integration decreases by 8.9\%--26.6\% relative to the matched no-memory reference for GPT-5.4-mini, but by 51.1\%--82.9\% for Claude-Sonnet-4.6, DeepSeek-V4-Flash, and Qwen3.5-9B. Control experiments on DeepSeek and GPT-5.4-mini show this effect is specific to sensitive content, rather than general personalization. Retrieval systems reduce exposure but do not eliminate integration once sensitive memory reaches the generator. These findings suggest safe personalization requires memory-aware decisions at both retrieval and generation time.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 该论文主要关注对话式智能体中长期记忆的安全使用边界及敏感内容整合问题，属于 LLM 安全与对齐领域。给定的关键词侧重于多模态架构（Visual Encoder, MLLM, MultiModal）、模型统一（Unify Models）、分词器（Tokenizer）及强化学习（World Models, model-based RL）。论文内容与多模态、视觉编码及强化学习几乎无关，仅在记忆机制与生成整合（Unify Models）及广义世界模型概念（World Models）上有微弱关联，因此相关性评分普遍较低。

关键词

Memory-Augmented Conversational Agents, Sensitive Memory Integration, RBI-Eval, Behavioral Divergence, Safe Personalization, Long-term Memory, Retrieval Systems

147. Better Literary Translation: A Multi-Aspect Data Generation and LLM Training ApproachFAIL

Score: 7.5 / 27.8

Authors: Zhihao Lin, Ziqi Zhu, Hao Huang, Guanghui Wang, Peiyang He

Published: 2026-06-04

TL;DR: 本文提出一种多面迭代 refinement 框架，通过生成高质量数据对 LLM 进行 SFT 和 GRPO 训练，在文学翻译基准上取得了具有竞争力的性能。

摘要翻译

文学翻译面临着独特的挑战，这源于高质量标注数据的稀缺性，以及需要在表达流畅性与文学效果之间取得平衡。我们提出了一种多视角迭代优化框架，通过专用大语言模型（LLM）翻译器生成高质量的翻译参考文本和偏好数据，每个翻译器针对不同的质量维度。我们利用生成的数据进行监督微调（SFT）和强化学习。实验表明，我们的生成参考文本优于用于 SFT 的原始真实数据，CEA100 分数高出 8.65 分。对于强化学习，我们发现 DPO（直接偏好优化）在此设置下会导致性能下降，而利用显式奖励模型进行 GRPO（组相对策略优化）则额外提升了 1.51 分。我们将此归因于两阶段训练的稳定性以及 GRPO 的在线探索能力。我们的最终模型 LitMT-8B 和 LitMT-14B 在 MetaphorTrans 英中文学翻译基准上分别取得了 67.25 和 69.07 的 CEA100 分数，与 Claude Sonnet 4.5 的 68.43 分相当，并展现出对域外文学作品（如欧·亨利（O. Henry））的强大泛化能力。

Abstract

Literary translation poses unique challenges due to the scarcity of high-quality annotated data and the need to balance expression fluency with literary effect. We present a multi-aspect iterative refinement framework that generates high-quality translation references and preference data through specialized LLM translators, each targeting a distinct quality dimension. We leverage the generated data for supervised fine-tuning and reinforcement learning. Experiments show that our generated references outperform the original ground truth for SFT by 8.65 CEA100 points. For reinforcement learning, we find that DPO leads to performance degradation in this setting, while leveraging an explicit reward model for GRPO yields an additional 1.51 point improvement. We attribute this to the stability of two-stage training and GRPO's online exploration capability. Our resulting models, LitMT-8B and LitMT-14B, achieve 67.25 and 69.07 CEA100 respectively on the MetaphorTrans English-to-Chinese literary translation benchmark, competitive with Claude Sonnet 4.5 at 68.43, and demonstrate strong generalization to out-of-domain literary work (i.e., O. Henry).

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	2.0/10	3.0

评分理由: 该论文专注于文本文学翻译，未涉及视觉编码器、多模态架构或世界模型，故相关关键词得分为 0。虽使用强化学习（GRPO），但属模型-free 对齐，非模型基 RL。未包含指定专家作者（Xuanyu Zhu 与 Ziqi Zhu 不同），未触发专家加分。

关键词

Literary Translation, Multi-Aspect Iterative Refinement, LLM Training, Data Generation, Supervised Fine-Tuning, GRPO, LitMT

148. Reducing Hallucinations in Complex Question Answering using Simple Graph-based Retrieval-Augmented Generation (long version)FAIL

Score: 7.5 / 27.8

Authors: Christopher J. Wedge, Joshua Stutter, Danny Dixon, Jacek Cała

Published: 2026-06-04

TL;DR: 本文提出一种基于图的检索增强生成系统，显著减少了复杂问答中的幻觉并提高了事实准确性，但未涉及多模态或强化学习技术。

摘要翻译

大型语言模型（LLMs）从根本上改变了自然语言处理（NLP）的格局。尽管取得了这些进展，LLMs 及基于 LLM 的系统仍容易出现多种失效模式。检索增强生成（RAG）系统已成为一种常见的部署场景，旨在避免 LLM“幻觉”信息的已知风险，并使 LLM 能够对其训练期间未访问过的专有信息进行推理和问答，而无需进行昂贵的模型微调。本文探索了一种利用轻量级图结构和相对简单的图模式，通过专用工具集支持 RAG 子系统的方案。我们设计了一个智能体系统，该系统包含多种向量搜索和图查询工具，基于精选的英文维基百科文章子集构建的结构化数据集运行，并在 MoNaCo（一个具有挑战性的维基百科问答基准，涉及复杂查询回答任务）的问题集上评估其性能。结果表明，引入基于图的工具可显著提高事实正确性的精确率和召回率，使幻觉答案的数量减半，并在三种评估场景中实现了最高的细粒度真实性得分。所有这些均是在 token 使用量适度增加的情况下实现的。

Abstract

Large language models (LLMs) have fundamentally transformed the landscape of Natural Language Processing. Despite these advances, LLMs and LLM-based systems remain prone to a variety of failure modes. Retrieval-augmented generation (RAG) systems have emerged as a common deployment scenario seeking to both avoid the well known risk of the LLM "hallucinating" information, and to enable reasoning and question answering over proprietary information that the LLM did not have access to during training without resorting to expensive model fine-tuning. In this work, we explore the idea of using a lightweight graph structure with a relatively simple graph schema, to support the RAG subsystem via a dedicated toolset. We design an agentic system with a variety of vector search and graph query tools operating over a structured dataset based on a curated subset of English Wikipedia articles, and evaluate its performance on questions from MoNaCo, a challenging Wikipedia QA benchmark of complex query answering tasks. Our results show that the introduction of graph-based tools can significantly increase the precision and recall of factual correctness, can halve the number of hallucinated answers, and achieves the highest fine-grained truthfulness score among the three evaluated scenarios. All this with a modest increase in token usage.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文主要研究基于图的检索增强生成（RAG）以减少大语言模型在问答中的幻觉，属于纯文本领域。未涉及视觉编码器、多模态数据、模型强化学习或统一模型架构。虽然提及 token 用量和知识图谱（弱相关世界模型），但与给定关键词（侧重多模态与 RL）高度不匹配。加权总分计算为 7.5 分，远低于动态及格分 27.8 分。作者列表中未包含指定的专家（Yang Shi 等）。

关键词

Large language models, Retrieval-augmented generation, Graph-based, Hallucinations, Question Answering, Factual correctness, Agentic system

149. QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG ServingFAIL

Score: 7.5 / 27.8

Authors: Jianxin Yan, Wangze Ni, Zhenxin Li, Jiabao Jin, Zhitao Shen, Haoyang Li, Jia Zhu, Peng Cheng, Xuemin Lin, Lei Chen, Kui Ren

Published: 2026-06-04

TL;DR: QCFuse 提出了一种基于压缩视图的查询感知选择器用于 RAG 缓存融合，在保持全预填充级别质量的同时显著降低了 LLM 服务中的预填充时间。

摘要翻译

检索增强生成 (RAG) 通过将生成锚定在外部证据上来提高大型语言模型 (LLM) 的答案质量，但处理检索到的上下文使得预填充阶段成为主要的服务开销。RAG 缓存融合通过重用检索块的预计算键值 (KV) 缓存，并在当前提示下选择性重新计算 token，从而降低此成本。然而，现有的选择器在质量和效率之间面临困境：快速的与查询无关的选择器或基于最终层的查询 - 上下文选择器可能会遗漏请求相关的证据，而全视图查询感知选择器需要在重新计算前获取广泛的上下文和层可见性，因此会阻塞逐层缓存融合流水线。我们提出 QCFuse，一种用于 RAG 缓存融合的压缩视图查询感知选择器。QCFuse 利用块锚点查询探测，基于紧凑的每块锚点来调节用户查询状态，并通过关键层剖析在不进行全层检查的情况下识别需要重新计算的 token。我们在 SGLang 中实现了 QCFuse，并在六个数据集上的四个开源权重 LLM 上对其进行了评估。QCFuse 达到了全预填充级别的质量。在质量相当的情况下，QCFuse 相对于全预填充实现了平均 1.7 倍的预填充时间加速，相对于 ProphetKV（最强的质量保留基线）实现了 1.5 倍的加速。

Abstract

Retrieval-augmented generation (RAG) improves large language model (LLM) answer quality by grounding generation in external evidence, but processing retrieved contexts makes the prefill stage a dominant serving cost. RAG cache fusion reduces this cost by reusing precomputed key-value (KV) caches for retrieved chunks and selectively recomputing tokens under the current prompt. Existing selectors, however, face a dilemma between quality and efficiency: fast query-agnostic or final-layer query-to-context selectors can miss request-relevant evidence, whereas full-view query-aware selectors require broad context and layer visibility before recomputation and therefore stall the layer-wise cache-fusion pipeline. We present QCFuse, a compressed-view query-aware selector for RAG cache fusion. QCFuse uses chunk-anchor query probing to condition user-query states on compact per-chunk anchors and critical-layer profiling to identify recomputation tokens without all-layer inspection. We implement QCFuse in SGLang and evaluate it on four open-weight LLMs across six datasets. QCFuse reaches full-prefill-level quality. At matched quality, QCFuse achieves an average prefill-time speedup of 1.7x over full prefill and 1.5x over ProphetKV, the strongest quality-preserving baseline.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 该论文专注于检索增强生成（RAG）的 LLM 服务效率优化，核心贡献在于缓存融合策略。提供的关键词主要涵盖多模态、世界模型及强化学习领域，与本文内容高度不匹配。视觉编码器、世界模型、多模态及强化学习相关内容在文中未出现，故得分为 0。虽然涉及 LLM 和 Token 处理，但未触及 MLLM 架构或 Tokenizer 设计核心，故相关度较低（1-2 分）。加权总分约为 7.5 分，远低于动态及格分 27.8 分，表明论文主题与评估关键词集存在显著偏差。

关键词

Retrieval-Augmented Generation, Cache Fusion, Query-Aware Selector, Compressed View, LLM Serving, Prefill Optimization, KV Cache

150. Discrete Causal Representations from Heterogeneous Domains: A Bayesian Approach with Social Survey ApplicationsFAIL

Score: 7.5 / 27.8

Authors: Ankur Garg, Michael Stettler, Aaron Schein, Julius von Kügelgen

Published: 2026-06-04

TL;DR: This paper proposes a Bayesian framework for learning discrete causal representations from heterogeneous multi-environment data, successfully inferring latent cultural values and political opinions from social surveys.

摘要翻译

因果表示学习（Causal representation learning）旨在推断产生观测到的低层测量的高层潜在因果概念。这对于来自不同环境或领域的异构数据尤为相关，因为分布偏移通常源于底层因果机制中某些稀疏、局部的变化，而生成过程的其他部分保持不变。尽管因果表示的可识别性已被广泛研究，但实用的不确定性感知方法及现实世界的应用案例仍较少被探索。本文提出了一种贝叶斯（Bayesian）方法，用于从多环境数据中学习因果表示，重点关注离散因果概念和未知多节点软干预（multi-node soft interventions）的情况。为此，我们将因果假设和可解释性需求转化为分层模型中的合适先验（priors）及参数选择。随后，我们设计了一种基于顺序蒙特卡洛（Sequential Monte Carlo）采样的推断方案，以近似所得的多峰后验（posterior）分布。我们通过社会调查数据的案例研究展示了该方法，其中潜在因果概念对应文化价值观或政治观点，测量对应调查响应，环境对应不同的国家或州。我们的模型推断出有意义的高层概念以及它们之间合理的因果关系，展示了其在学习复杂现实世界数据因果表示方面的效用。

Abstract

Causal representation learning aims to infer the high-level latent causal concepts that give rise to observed low-level measurements. This is particularly relevant for heterogeneous data from different environments or domains since distribution shifts often arise through sparse, localized changes in some of the underlying causal mechanisms, while other parts of the generative process remain unchanged. Whereas identifiability of causal representations has been studied extensively, practical uncertainty-aware methods and real-world use cases remain less explored. In this work, we propose a Bayesian approach to learning causal representations from multi-environment data, focusing on the case of discrete causal concepts and unknown multi-node soft interventions. To this end, we translate causal assumptions and interpretability desiderata into suitable priors and parametric choices within a hierarchical model. We then devise an inference scheme based on sequential Monte Carlo sampling to approximate the resulting multimodal posterior. We showcase our approach through case studies on social survey data, where latent causal concepts correspond to cultural values or political opinions, measurements to survey responses, and environments to different countries or states. Our model infers meaningful high-level concepts and plausible causal relations among them, demonstrating its utility for learning causal representations of complex real-world data.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on causal representation learning via Bayesian inference on survey data, lacking direct technical alignment with tokenizers, visual encoders, MLLMs, or RL keywords. Weak relevance exists for Unify Models and World Models due to representation unification and causal modeling, but MultiModal is low as data is tabular. No listed experts are authors.

关键词

Causal representation learning, Bayesian approach, Multi-environment data, Discrete causal concepts, Social survey applications, Hierarchical model, Sequential Monte Carlo

151. LLM Explainability with Counterfactual Chains and Causal GraphsFAIL

Score: 7.5 / 27.8

Authors: Nirit Nussbaum-Hoffer, Nitay Calderon, Liat Ein-Dor, Roi Reichart

Published: 2026-06-04

TL;DR: 本文提出了一种基于反事实链和因果图的 LLM 可解释性方法，通过发现概念状态和反事实增强来揭示模型推理过程。

摘要翻译

因果图提供了一种高层语言，用于使机制透明化。近期研究利用大型语言模型（LLMs）来恢复外部世界过程的因果图。相反，本文使用因果图来建模 LLM 推理本身，为利益相关者提供透明视图，展示模型如何感知并组织高层概念以产生预测。我们提出了一种构建此类图的四阶段方法。给定一个目标 LLM 和一组文本示例，我们的方法发现具有类别判别性且人类可解释的概念，并将每个输入映射为 LLM 感知到的概念状态。随后，我们引入一种受 MCMC（马尔可夫链蒙特卡洛）启发的反事实增强过程，通过反事实链扩展稀疏观测数据。这使得我们能够使用 σ-CG 进行稳定的因果发现，从而生成信息丰富且可解释的图。我们将该方法应用于三种 LLM，涵盖疾病诊断、情感分析以及 LLM-as-a-judge 分类任务。我们评估所学图的预测保真度和结构稳定性，以及受 MCMC 启发的增强过程的收敛性和下游效用。结果表明，所发现的因果图捕捉了与 LLM 推理一致的意义依赖关系。综上所述，本文奠定了 LLM 概念级可解释性的基础。

Abstract

Causal graphs provide a high-level language for making mechanisms transparent. Recent work uses Large Language Models (LLMs) to recover causal graphs of external-world processes. Instead, in this paper, we use causal graphs to model LLM inference itself, providing stakeholders with a transparent view of how the model perceives and organizes high-level concepts to produce a prediction. We propose a four-phase method for constructing such graphs. Given a target LLM and a set of textual examples, our method discovers class-discriminative, human-interpretable concepts and maps each input to LLM-perceived concept states. We then introduce an MCMC-inspired counterfactual augmentation procedure that expands the sparse observational data through chains of counterfactuals. This enables stable causal discovery with $σ$-CG, yielding informative, interpretable graphs. We apply our method to three LLMs across disease diagnosis, sentiment analysis, and LLM-as-a-judge classification tasks. We evaluate the learned graphs for predictive fidelity and structural stability, and the MCMC-inspired augmentation for convergence and downstream utility. Our results show that the discovered causal graphs capture meaningful dependencies consistent with LLMs' reasoning. Together, this paper provides a foundation for concept-level explainability of LLMs.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文核心在于 LLM 的可解释性、因果图构建及反事实链方法，属于 NLP 解释性领域。提供的关键词集主要聚焦于多模态大模型、世界模型及强化学习（如 Visual Encoder, World Models, model-based RL），与本文纯文本 LLM 解释性研究主题不符。因此，除 MLLM 和 Unify Models 因涉及 LLM 概念略有关联外，其余关键词相关性极低。作者列表中未包含指定的专家名单。

关键词

LLM Explainability, Causal Graphs, Counterfactual Chains, Concept Learning, Interpretability, Model Inference, Causal Discovery, Textual Examples

152. FOXGLOVE: Understanding Goal-Oriented and Anchored Writing Feedback from Experts and LLMs on Argumentative EssaysFAIL

Score: 7.5 / 27.8

Authors: Yijun Liu, Yifan Song, John Gallagher, Sarah Sterman, Tal August

Published: 2026-06-04

TL;DR: This paper introduces FOXGLOVE, a dataset comparing expert and LLM writing feedback on argumentative essays, finding that while goal distribution aligns, anchoring and complexity differ, with LLM feedback rated higher partly due to length.

摘要翻译

尽管大型语言模型（LLMs）被越来越多地用于生成写作反馈，但在写作研究认为对修改至关重要的维度上，仍缺乏对 LLM 反馈与专家反馈的系统性比较：目标导向性、基于具体句子的锚定以及优先级排序。我们推出了 FOXGLOVE 数据集，该数据集包含 696 条由受过训练的写作导师针对 69 篇十二年级议论文撰写的反馈评论，并与在共享协议下由四个前沿 LLMs 生成的 1,644 条评论配对，总计 2,340 条评论。我们对导师评论和 LLM 评论的一个子集提供了专家质量评分。我们发现，导师和 LLM 在目标和文章位置上分配反馈的方式相似，但在应提供反馈的具体句子方面，导师与模型存在分歧。此外，我们发现模型倾向于生成更复杂的反馈，且提出的问题比导师更少。由导师评分显示，LLM 反馈在大多数质量维度上获得更高评分，但这种优势似乎很大程度上归因于篇幅更长的评论。FOXGLOVE 使得能够系统性地比较人类反馈与 LLM 反馈在何处一致、分歧及差异。

Abstract

While large language models (LLMs) are increasingly used to generate writing feedback, there remains no systematic comparison of LLM and expert feedback on the dimensions that writing research identifies as central to revision: goal-orientation, anchoring to specific sentences, and prioritization. We introduce FOXGLOVE, a dataset of 696 feedback comments written by trained writing instructors on 69 twelfth-grade argumentative essays, paired with 1,644 comments generated from four frontier LLMs under a shared protocol, totaling 2,340 comments. We provide expert quality ratings on a subset of both instructor and LLM comments. We find that instructors and LLMs distribute feedback similarly across goals and essay positions, yet instructors and models diverge on the specific sentences on which to provide feedback. Additionally, we find that models tend to write more complex feedback and use fewer questions than instructors. LLM feedback also receives higher ratings on most dimensions of quality, as rated by instructors, but much of this advantage appears to be attributable to lengthier comments. FOXGLOVE enables systematic comparison of where human and LLM feedback align, diverge, and differ.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文主题聚焦于 LLM 与专家在议论文写作反馈上的对比分析，属于自然语言处理应用范畴。提供的关键词集（Unify Models, Tokenizer, Visual Encoder, World Models, MLLM, MultiModal, model-based RL）主要涉及多模态架构、世界模型及强化学习等底层模型技术。论文内容未涉及视觉编码器、世界模型构建、强化学习算法或模型统一架构，仅使用了通用 LLM 生成文本反馈，未体现多模态特性。因此，除 LLM 相关概念有微弱关联外，其余关键词与论文内容完全无关，导致整体相关性评分极低。

关键词

Writing Feedback, LLMs, Argumentative Essays, Goal-Oriented, Anchored Feedback, Dataset, Expert Comparison, Revision Dimensions

153. Many Circuits, One Mechanism: Input Variation and Evaluation Granularity in Circuit DiscoveryFAIL

Score: 7.5 / 27.8

Authors: Alireza Bayat Makou, Jingcheng Niu, Subhabrata Dutta, Iryna Gurevych

Published: 2026-06-04

TL;DR: The study reveals that structurally distinct circuits in language models often implement the same computation across different input frequency bands, challenging the assumption that structural differences signify distinct mechanisms.

摘要翻译

电路发现方法（Circuit discovery methods）识别出能够解释特定模型行为的子图，而已发现电路（Circuit）之间的结构差异通常被视为不同机制的证据。我们通过改变输入统计量同时保持任务固定来测试这一假设，并发现由此产生的结构差异表现出表观专业化，但并不对应功能差异，我们将这种模式称为幻影专业化（phantom specialization）。在五个 Pythia 模型（70M-1.4B）上，我们使用字面序列复制（Literal Sequence Copying）任务，涵盖四个词频带及一个控制条件，提取了 75 个电路，发现结构不同的电路实现了相同的计算：频带特异性边在频带间广泛转移，大多数频带共享的核心恢复了至少 99% 的电路性能，且因果交换干预（causal interchange interventions）确认了内部表征在频率带间是可互换的。在同一频率带内的重复提取进一步表明，发现算法是从有效子图的等价类中采样，而非恢复唯一的机制。标准评估实践掩盖了这一模式：源级评估（source-level evaluation）夸大了表观保真度，而边级评估（edge-level evaluation）揭示了从结构到功能的多对一映射。我们的结果表明，电路之间的结构差异并非不同机制的充分证据，而要揭示这一点，则需要边级评估和跨条件迁移测试。

Abstract

Circuit discovery methods identify subgraphs that explain specific model behaviors, and structural differences between discovered circuits are commonly interpreted as evidence of distinct mechanisms. We test this assumption by varying input statistics while holding the task fixed, and show that the resulting structural differences exhibit apparent specialization but do not correspond to functional differences, a pattern we term phantom specialization. Using Literal Sequence Copying across four token-frequency bands plus a control condition in five Pythia models (70M-1.4B), we extract 75 circuits and find that structurally distinct circuits implement the same computation: band-specific edges transfer broadly across bands, a core shared across most bands recovers at least 99% of circuit performance, and causal interchange interventions confirm that internal representations are interchangeable across frequency bands. Repeated extractions within the same frequency band further suggest that discovery algorithms sample from an equivalence class of valid subgraphs rather than recovering a unique mechanism. Standard evaluation practice obscures this pattern: source-level evaluation inflates apparent faithfulness, while edge-level evaluation reveals the many-to-one mapping from structure to function. Our results show that structural differences between circuits are not sufficient evidence for distinct mechanisms, and that exposing this requires edge-level evaluation and cross-condition transfer tests.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	3.0/10	4.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on circuit discovery and interpretability in language models, analyzing the relationship between circuit structure and function under input variation. It lacks content on multimodal learning, visual encoders, world models, or reinforcement learning, resulting in low relevance for those keywords. Tokenizer relevance is marginal due to token frequency analysis. No listed expert authors are present.

关键词

Circuit Discovery, Input Variation, Token Frequency, Structure-Function Mapping, Edge-Level Evaluation, Phantom Specialization, Pythia Models, Interpretability

154. Contextualized Prompting For Stance Detection On Social MediaFAIL

Score: 7.5 / 27.8

Authors: Tilman Beck, Shakib Yazdani, Simon Kruschinski, Marcus Maurer, Iryna Gurevych

Published: 2026-06-04

TL;DR: This study evaluates the impact of contextual features on LLM-based stance detection in Twitter, demonstrating that LLM-generated target descriptions enhance accuracy whereas other user metadata often degrades performance.

摘要翻译

社交媒体上的立场检测颇具挑战性，原因在于其语言简短、嘈杂且高度依赖上下文。尽管大型语言模型（LLMs）展现出零样本（zero-shot）泛化能力，但它们通常是在缺乏上下文信息的情况下进行提示，这限制了其解读模糊帖子的能力。本文系统研究了将现实世界特征（如用户简介）、衍生特征（如政党）以及 LLM 生成特征（如目标描述）整合到 Twitter 立场检测的零样本提示中所产生的影响。我们的评估涵盖了四个基准数据集，其中包括一个新的高质量的德国 Twitter 立场数据集。在多个 LLM 上，我们发现整合上下文信息能提升性能，但仅在特定条件下才成立。LLM 生成的目标描述一致性地提升了准确性，而其他用户元数据则产生混合甚至有害的影响。值得注意的是，我们发现纳入同一用户的其他推文（通常在监督学习中有益）可能会因输入噪声而损害性能。我们的定性分析揭示，LLMs 难以区分任务特定的有用信息与无关上下文。我们的发现突显了在嘈杂的现实环境中利用上下文信息进行提示所蕴含的潜力与挑战。我们在该页面公开了代码与数据（https://github.com/tilmanbeck/stance-context-twitter）。

Abstract

Stance detection on social media is challenging due to short, noisy, and context-dependent language. While large language models (LLMs) show zero-shot generalization, they are typically prompted without contextual information, which limits their ability to interpret ambiguous posts. In this work, we systematically investigate the impact of incorporating real-world (e.g., user biographies), derived (e.g., political party), and LLM-generated (e.g., target descriptions) contextual features into zero-shot prompting for stance detection on Twitter. Our evaluation spans four benchmark datasets, including a new high-quality German Twitter stance dataset. Across multiple LLMs, we find that integrating contextual information improves performance, but only under specific conditions. LLM-generated target descriptions consistently enhance accuracy, while other user metadata has mixed or even detrimental effects. Notably, we show that the inclusion of other tweets by the same user, often beneficial in supervised learning, can impair performance due to input noise. Our qualitative analysis reveals that LLMs struggle to distinguish task-specific useful information from irrelevant context. Our findings highlight both the promise and challenges of prompting with context information in noisy real-world settings. We publish code and data at this \href{https://github.com/tilmanbeck/stance-context-twitter}{page}.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper addresses stance detection in social media using Large Language Models (LLMs) with contextual prompting. It is a text-based NLP task without visual components (Visual Encoder, MultiModal), reinforcement learning (model-based RL), or world modeling (World Models). While it utilizes LLMs (somewhat related to MLLM/Unify Models broadly), it does not focus on model unification, tokenizer design, or multimodal integration, resulting in low relevance to the specific keywords provided.

关键词

Stance detection, Social media, Large language models, Contextual prompting, Zero-shot learning, Twitter, User biographies, Target descriptions

155. ReverseEOL: Improving Training-free Text Embeddings via Text Reversal in Decoder-only LLMsFAIL

Score: 7.5 / 27.8

Authors: Ailiang Lin, Zhuoyun Li, Yusong Wang, Keyu Mao, Kotaro Funakoshi, Manabu Okumura

Published: 2026-06-04

TL;DR: ReverseEOL addresses biased contextualized representations in decoder-only LLMs by utilizing text reversal to generate complementary embeddings, significantly improving training-free text embedding performance on STS and MTEB benchmarks.

摘要翻译

大型语言模型（LLMs）的最新进展为生成无需训练的文本嵌入开辟了新的途径。然而，仅解码器 LLM 中的因果注意力机制阻止了早期词元关注未来上下文，导致上下文表示存在偏差。本文提出了一种简单有效的方法 ReverseEOL（显式单词限制反向提示），用于增强冻结 LLM 的表征能力。ReverseEOL 通过引入源自反向输入文本的额外反向嵌入来增强标准前向嵌入。由于反向输入使每个词元接触到原始顺序中无法访问的上下文，所得的反向嵌入有效地为原始嵌入提供了互补信息。因此，结合前向嵌入与反向嵌入可获得更丰富的最终表示。在 STS（语义文本相似度）和 MTEB（多语言文本嵌入基准）基准上的全面实验表明，ReverseEOL 显著提升了现有无需训练基线方法在具有多样架构和规模的广泛 LLM 上的性能。广泛的消融实验与分析进一步证实了我们的反向机制的必要性。

Abstract

Recent advances in Large Language Models (LLMs) have opened new avenues for generating training-free text embeddings. However, the causal attention in decoder-only LLMs prevents earlier tokens from attending to future context, leading to biased contextualized representations. In this work, we propose Reverse prompting with Explicit One-word Limitation (ReverseEOL), a simple yet effective method for enhancing the representational capability of frozen LLMs. ReverseEOL augments the standard forward embedding with an additional reversed embedding derived from the reversed input text. Since reversing the input exposes each token to context inaccessible in the original order, the resulting reversed embedding effectively provides complementary information to the original one. As a result, combining the forward and reversed embeddings yields a richer final representation. Comprehensive experiments on STS and MTEB benchmarks demonstrate that ReverseEOL significantly improves the performance of existing training-free baselines across a broad range of LLMs with diverse architectures and scales. Extensive ablations and analyses further confirm the necessity of our reversal mechanism.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	3.0/10	4.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on text embedding enhancement in decoder-only LLMs using text reversal. It lacks multimodal components, visual encoders, world modeling, or reinforcement learning, rendering most keywords irrelevant. Tokenizer and Unify Models have slight relevance due to LLM usage and embedding fusion, but scores remain low.

关键词

ReverseEOL, Text Embeddings, Decoder-only LLMs, Training-free, Text Reversal, Representation Learning, STS Benchmarks

156. Towards Truly Multilingual ASR: Generalizing Code-Switching ASR to Unseen Language PairsFAIL

Score: 7.5 / 27.8

Authors: Gio Paik, Hyunseo Shin, Soungmin Lee

Published: 2026-06-04

TL;DR: 本文探究了通过模型合并将代码切换语音识别能力从已知语言对泛化到未知语言对的效果，发现泛化能力有限。

摘要翻译

自动语音识别（ASR）已成为人机交互的关键技术。然而，由于不同语言对之间多语言语码转换（code-switching）语音资源的严重稀缺，语码转换 ASR（CS-ASR）仍然特别具有挑战性。现有方法主要通过合成语码转换语音或在有限双语数据集上进行语言对特定的微调来提升 CS-ASR 性能。然而，这些方法面临着固有的可扩展性限制，因为必须为语言对单独开发语码转换支持，而语言对的数量随着支持语言数量的增加呈组合式增长。本文探讨了通过模型合并（model merging）和领域泛化（domain generalization）方法，从有限数量的已知语言对中学到的语码转换能力能否泛化到未知语言对。实验结果表明，合并后的双语 CS-ASR 模型对未知语言对的泛化程度有限，表明双语语码转换能力在不同语言对之间的转移是有限的。

Abstract

Automatic Speech Recognition (ASR) has become a key technology for human--AI interaction. However, code-switching ASR (CS-ASR) remains particularly challenging due to the severe scarcity of multilingual CS speech resources across diverse language pairs. Existing approaches primarily improve CS-ASR performance through synthetic CS speech generation or pair-specific fine-tuning on limited bilingual datasets. Nevertheless, these approaches face an inherent scalability limitation, as support for CS must be developed separately for language pairs whose number grows combinatorially with the number of supported languages. In this work, we investigate whether CS capabilities learned from a limited set of seen language pairs can generalize to unseen language pairs through model merging and domain generalization methods. Our experiments show that merged bilingual CS-ASR models modestly generalize to unseen language pairs, suggesting limited transfer of bilingual CS capabilities across language pairs.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0

评分理由: 论文聚焦多语言代码切换语音识别（ASR），与关键词中的视觉编码器、世界模型、MLLM 及强化学习等方向完全无关（0 分）。模型合并技术对应 'Unify Models'（2 分），语音处理涉及分词对应 'Tokenizer'（2 分），音频文本属弱 'MultiModal'（1 分）。加权总分 9.0，远低于动态及格分 27.8，表明论文与指定研究方向相关性极低。

关键词

Code-Switching ASR, Multilingual ASR, Unseen Language Pairs, Model Merging, Domain Generalization, Speech Recognition, Bilingual CS-ASR, Language Pair Generalization

157. Rethinking Infrastructure Inspection as Image Difference Classification: A Traffic Sign Case StudyFAIL

Score: 6.0 / 27.8

Authors: Ching Yau Fergus Mok, Lavindra de Silva, Varun Kumar Reja, Ioannis Brilakis

Published: 2026-06-04

TL;DR: This paper reformulates infrastructure defect detection as image difference classification to reduce data reliance, showing that instruction-based classifiers outperform encoder-based ones in low-resource traffic sign inspection.

摘要翻译

数字孪生（Digital Twins, DTs）实现了道路基础设施检测的数字化，然而这一过程受限于标注数据的匮乏。本研究利用连续资产状态监测的关联性，将基于图像的缺陷检测重新表述为图像差异分类（Image Difference Classification, IDC），以降低对数据的依赖。本研究通过一个低资源交通标志检测的案例研究进行了评估，该研究采用了不同的 IDC 分类器以及一个新构建的高质量数据集。结果表明，基于指令的分类器优于基于编码器的分类器，并且通过与参考图像的比较获得了性能提升。这表明 IDC 可作为一种有效的任务建模方法，用于应对基础设施检测及数字孪生资产状态更新中的数据约束问题。

Abstract

Digital twins (DTs) allow the digitalization of road infrastructure inspection, though this is hindered by limited annotated data. This work exploits the relational nature of continuous asset condition monitoring to reformulate image-based defect detection as image difference classification (IDC) to reduce data reliance. This was evaluated in a case study on low-resource traffic sign inspection with different IDC classifiers using a newly-curated, high quality dataset. Results indicate that the instruction-based classifier outperforms encoder-based ones and gains from comparison with reference images. This shows that IDC can be an effective task modeling for tackling data constraints in infrastructure inspection and DT asset condition updating.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on infrastructure inspection using Digital Twins and reformulates defect detection as Image Difference Classification (IDC). It lacks content on Unify Models, Tokenizers, World Models, MLLM, and Model-Based RL. Visual Encoder and MultiModal receive low scores due to the use of encoder-based baselines and instruction-based (text-image) methods, respectively, which are peripheral to the core task modeling contribution. No expert authors from the provided list are present in the authorship.

关键词

Infrastructure Inspection, Image Difference Classification, Digital Twins, Traffic Sign, Low-resource, Instruction-based Classifier, Encoder-based Classifier

158. RedKnot: Efficient Long-Context LLM Serving with Head-Aware KV Reuse and SegPagedAttentionFAIL

Score: 6.0 / 27.8

Authors: Yang Liu, ZhaoKai Luo, HuaYi Jin, ZhiYong Wang, RuoZhou He, BoYu Wang, Guanjie Chen, Junhao Hu

Published: 2026-06-04

TL;DR: RedKnot 通过头感知的 KV 缓存分解机制解决了长上下文 LLM 服务中的内存瓶颈，显著提升了缓存复用率和资源效率，无需重新训练模型。

摘要翻译

随着大语言模型（LLM）服务输入长度的持续增长，KV 缓存已成为人工智能基础设施中的主要瓶颈。它限制了 GPU 内存容量、服务并发度、缓存复用以及分布式扩展性。若干重要问题，包括与位置无关的 KV 缓存、前缀 KV 缓存压缩、热/冷 KV 缓存分离以及分布式 KV 缓存管理，均取决于 KV 缓存的表示与管理方式。然而，现有的服务系统主要依赖单体 KV 缓存抽象，其中 KV 缓存被视为令牌级内存块的同质序列，并在注意力头和服务场景之间使用相似的策略进行管理。我们观察到，KV 缓存效用在不同 KV 头之间具有高度结构化：不同的头表现出不同的功能角色、注意力距离和运行时重要性。因此，对于每个头、令牌范围或服务场景，完整的 KV 缓存并不总是必要的。我们提出了 RedKnot，一种用于 LLM 服务的头感知 KV 缓存管理系统。RedKnot 通过沿重要性及有效注意力范围在不同服务场景中显著变化的 KV 头分解 KV 缓存，打破了传统的单体 KV 缓存抽象。这种头级分解将 KV 缓存从单体张量抽象转变为结构化内存对象，使 RedKnot 能够统一支持与位置无关的 KV 复用、前缀 KV 压缩、热/冷 KV 分离以及分布式 KV 放置，同时保持输出保真度并提高资源效率，而无需模型重训练或微调。RedKnot 通过将 KV 缓存从单体、被动的运行时产物转变为动态、模型感知的运行时基础架构，为 AI 基础设施建立了新的基础，以实现可扩展的 LLM 服务。

Abstract

As the input length of large language model (LLM) serving continues to grow, the KV cache has become a dominant bottleneck in AI infrastructure. It limits GPU memory capacity, serving concurrency, cache reuse, and distributed scalability. Several important problems, including position-independent KV cache, prefix KV cache compression, hot/cold KV cache separation, and distributed KV cache management, all depend on how the KV cache is represented and managed. However, existing serving systems largely rely on a monolithic KV cache abstraction, where the KV cache is treated as a homogeneous sequence of token-level memory blocks and managed with similar policies across attention heads and serving scenarios. We observe that KV cache utility is highly structured across KV heads: different heads exhibit different functional roles, attention distances, and runtime importance. Therefore, a full KV cache is not always necessary for every head, token range, or serving scenario. We present RedKnot, a head-aware KV cache management system for LLM serving. RedKnot breaks the conventional monolithic KV cache abstraction by decomposing the KV cache along KV heads, whose importance and effective attention ranges vary significantly across serving scenarios. This head-level decomposition turns the KV cache from a monolithic tensor abstraction into a structured memory object, enabling RedKnot to uniformly support position-independent KV reuse, prefix KV compression, hot/cold KV separation, and distributed KV placement while preserving output fidelity and improving resource efficiency, without requiring model retraining or fine-tuning. RedKnot establishes a new foundation for AI infrastructure by transforming the KV cache from a monolithic, passive runtime artifact into a dynamic, model-aware runtime substrate for scalable LLM serving.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 该论文聚焦于大语言模型（LLM）服务端的 KV 缓存系统优化，属于基础设施领域。提供的关键词主要涉及多模态、世界模型及强化学习等模型架构方向。因此，论文与 Visual Encoder, World Models, MultiModal, model-based RL 完全无关（0 分）。与 MLLM 有微弱关联（2 分），因涉及 LLM 核心服务；与 Tokenizer 有微弱关联（1 分），因涉及 token 级内存块；Unify Models 得 1 分，因统一了缓存管理策略但未涉及模型统一。专家列表中未包含指定专家。加权总分 7.5 分，远低于动态及格分 27.8 分，表明论文主题与给定关键词背景相关性极低。

关键词

LLM Serving, KV Cache, Head-Aware, SegPagedAttention, Long-Context, Memory Efficiency, Cache Reuse

159. Learning to replenish: A hybrid deep reinforcement learning for dynamic inventory management in the pharmaceutical supply chainsFAIL

Score: 6.0 / 27.8

Authors: Amandeep Kaur, Gyan Prakash

Published: 2026-06-04

TL;DR: This paper proposes a hybrid deep reinforcement learning algorithm (A3C DPPO) to optimize inventory replenishment policies in pharmaceutical supply chains, reducing costs while maintaining service levels under uncertain demand.

摘要翻译

医药供应链（PSCs）因补货相关的不可预测需求模式及变化的提前期，在库存管理（IM）方面面临严峻挑战。此外，药品有限的保质期进一步加剧了这一复杂性，这需要在充足库存与最小浪费之间保持微妙的平衡。这些相互交织的因素构成了一个复杂的优化问题，需要采用先进的库存策略，以确保产品可得性及供应链效率。本研究旨在开发一种最优的药品库存补充策略，以应对由不确定需求和变动的供应链条件所带来的随机性。其目标是在保持高患者服务水平的前提下，最大化供应链的盈利能力。我们将该问题建模为马尔可夫决策过程，并提出一种深度强化学习（DRL）方法，具体为混合异步优势演员 - 评论家分布式近端策略优化（A3C DPPO）算法。A3C DPPO 算法专为处理库存管理中固有的连续动作空间而设计。数值结果表明，所提出的算法能够在动态场景下自适应地更新库存补充策略，相较于多种基准方法，实现了更低的库存成本。此外，我们还利用真实世界的药品库存数据进行数值验证，以确认所提出算法的实际可行性。

Abstract

Pharmaceutical supply chains (PSCs) struggle with inventory management (IM) due to unpredictable demand patterns and variable lead times associated with restocking. This complexity is further compounded by the finite shelf lives of pharmaceutical products, which necessitate a delicate balance between adequate stock and minimal waste. These intertwined factors create a complex optimization problem that requires sophisticated inventory strategies to ensure both product availability and PSC efficiency. This study aims to develop an optimal inventory replenishment policy for pharmaceutical products that can handle the stochasticity arising from uncertain demand and variable PSC conditions. The objective is to maximize the profitability of the PSC while maintaining a high patient service level. We formulate the problem as a Markov decision process and propose a deep reinforcement learning (DRL) approach, specifically, a hybrid asynchronous advantage actor critic distributed proximal policy optimization (A3C DPPO)algorithm. The A3C DPPO algorithm is tailored to handle the continuous action space inherent in IM. The numerical results demonstrate that the proposed algorithm adaptively updates the inventory replenishment strategy under dynamic scenarios, resulting in lower inventory costs compared to various benchmarks. We also conduct numerical validation using real-world pharmaceutical inventory data to confirm the practical feasibility of the proposed algorithm.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	2.0/10	3.0

评分理由: 该论文属于运筹学与供应链优化领域，主要应用深度强化学习（DRL）解决制药库存问题。提供的关键词集侧重于多模态大模型架构（MLLM, MultiModal）、生成式世界模型及统一模型（Unify Models, Tokenizer, Visual Encoder），与本文内容领域高度不相关。'model-based RL' 相关性略高（2 分），因论文涉及强化学习，但具体算法 A3C/PPO 通常属于模型-free 策略梯度方法，非严格意义上的模型基强化学习。其余关键词相关性极低（0-1 分）。作者名单中未包含指定的 Yang Shi 等专家，故无额外加分。加权总分约为 6.0，远低于 27.8 的动态及格分。

关键词

Pharmaceutical supply chains, Inventory management, Deep reinforcement learning, A3C DPPO algorithm, Markov decision process, Dynamic inventory replenishment, Stochastic demand

160. CogManip: Benchmarking Manipulative Behavior in Multi-Turn Interactions with Large Language ModelFAIL

Score: 6.0 / 27.8

Authors: Zeyang Yue, Chenfei Yan, Feifei Zhao, Haibo Tong, Mengwen Xu, Xiaozhen Wang, Erliang Lin, Yi Zeng

Published: 2026-06-04

TL;DR: CogManip 构建了一个评估多轮对话中 LLM 心理操纵风险的基准，揭示了不同模型间风险异质性并强调了提示词防御的必要性。

摘要翻译

大型语言模型（LLMs）是否在复杂人机交互中表现出隐蔽的心理操纵，正引发日益增长的安全关注。然而，现有的 AI 安全基准主要局限于显式规则遵从与静态提示，难以捕捉多轮对话中操纵策略的动态性与隐蔽性。本文引入 CogManip，一个综合基准，旨在评估 15 种操纵策略风险，涵盖 1,000 个多轮交互场景，并经人类专家验证。对 13 个代表性模型（包括 GPT-5.4 和 DeepSeek-V3.2 等前沿模型）的系统评估揭示了显著的风险异质性，并为未来防御指明了针对性方向。进一步的目标函数扰动分析表明，DeepSeek-V3.2 的操纵策略对负面及良性系统提示均高度敏感，凸显了基于提示的防御工程与隐式目标审计的必要性。CogManip 为审计现代 LLMs 的隐性心理影响及动态策略选择提供了稳健的工具与视角。

Abstract

Whether Large Language Models (LLMs) exhibit covert psychological manipulation in complex human-AI interactions has garnered increasing safety concerns. However, existing AI safety benchmarks remain largely restricted to explicit rule compliance and static prompts, failing to capture the dynamic and covert nature of manipulative strategies in multi-turn dialogues. We introduce CogManip, a comprehensive benchmark that evaluates 15 manipulation strategy risks across 1,000 multi-turn interaction scenarios, validated by human experts. A systematic evaluation of 13 representative models, including frontier models like GPT-5.4 and DeepSeek-V3.2, reveals significant risk heterogeneities and illuminates the targeted direction for future defense. Further analysis of objective function perturbation reveals that DeepSeek-V3.2's manipulation tactics are highly sensitive to both negative and benign system prompts, demonstrating the critical necessity of prompt-based defense engineering and implicit goal auditing. CogManip offers a robust instrument and perspective for auditing the implicit psychological influence and dynamic strategy selection of modern LLMs.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 该论文主要关注大语言模型（LLM）在多轮对话中的心理操纵行为基准测试（CogManip），属于 AI 安全与对齐领域。提供的关键词中，'Unify Models' 和 'MLLM' 与论文主题有一定关联（涉及评估多个模型、核心对象为 LLM），但论文并未提出模型架构统一方案或模态融合技术；其余关键词如 Tokenizer、Visual Encoder、World Models、MultiModal、model-based RL 与本文内容完全无关，主要关注文本交互安全而非模型内部结构、视觉组件或强化学习。因此相关度评分较低，加权总分远低于动态及格分。

关键词

CogManip, Manipulative Behavior, Multi-Turn Interactions, Large Language Model, AI Safety, Benchmark, Prompt Defense, Risk Assessment

161. Integrating Mechanistic and Data-Driven Models for Neurological Disorders through Differentiable ProgrammingFAIL

Score: 6.0 / 27.8

Authors: Shah Pallav Dhanendrakumar, Saikat Pal, Sitikantha Roy

Published: 2026-06-04

TL;DR: This paper proposes hybrid modeling strategies combining mechanistic and data-driven approaches using differentiable programming to improve diagnosis and treatment planning for neurological disorders.

摘要翻译

计算建模、神经成像及人工智能领域的进步正在彻底革新神经疾病的建模，旨在提升诊断、预后及治疗规划的水平。机制模型能为这些疾病提供宝贵的科学见解，但在实践中，它们往往因假设而被简化，或计算成本高昂且求解缓慢。然而，尽管纯数据驱动方法提供了速度和可扩展性，但它们需要大量高质量数据进行训练，且通常面临可解释性和泛化能力不足的问题。本文作为一篇观点论文，提出了混合建模策略的结构化概述。这些策略结合了深度学习模型与基于物理的求解器，并被划分为并行、串联以及并联 - 串联架构。文中重点强调了三种主要方法：针对缺失或不完整物理机制的残差建模（Residual Modeling）、用于连续时间动力学近似的神经常微分方程（Neural Ordinary Differential Equations, NODEs），以及利用神经近似加速传统求解器的“求解器在环”（Solver in the Loop）方法。这些混合模型将基于控制微分方程的表述与深度学习相结合，用以刻画神经疾病的演变，并有望推动个性化神经建模的发展。此外，本研究探索并提出了多种混合配置，旨在提高诊断准确性、预测疾病进展，并为多种神经疾病的治疗策略提供依据。这些能力优于独立的机制模型或纯数据驱动方法，使混合建模成为一项强大的工具，尤其在涉及脑肿瘤、阿尔茨海默病（Alzheimer's disease）和卒中（Stroke）等神经疾病进展及治疗反应建模的应用中表现卓越。

Abstract

Advances in computational modeling, neuroimaging, and artificial intelligence are revolutionizing the modeling of neurological disorders for improved diagnostics, prognosis, and treatment planning. Mechanistic models provide valuable scientific insight into the disorders, but in practice they are often simplified with assumptions or computationally expensive and slow to solve. However, while purely data driven approaches provide speed and scalability, they require large, high quality data to train and generally suffer from interpretability and generalization issues. This perspective paper presents a structured overview of hybrid modeling strategies, which combine deep learning models with physics based solvers, and are categorized into parallel, series, and parallel-series architectures. Three main approaches that have been emphasized are residual modeling for missing or incomplete physics, Neural Ordinary Differential Equations (NODEs) for continuous time dynamics approximation, and solver in the loop that accelerates traditional solvers with neural approximations. These hybrid models integrate the governing differential equation based formulations and deep learning to characterize the evolution of neurological disorders, and promise advanced personalized neurological modeling. In addition, the study explores and proposes different hybrid configurations to improve diagnosis accuracy, predict disease progression, and inform treatment strategies across a range of neurological disorders. These capabilities outperform standalone mechanistic or purely data driven approaches, making hybrid modeling a powerful tool, especially in applications involving modeling the progression and treatment responses in neurological conditions such as brain tumors, Alzheimer's disease, and stroke.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	4.0/10	6.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文主要研究神经疾病的混合建模（机制模型与数据驱动模型结合），使用微分编程和 NODEs。提供的关键词主要涉及大语言模型架构（Tokenizer, Visual Encoder, MLLM, MultiModal）和强化学习（World Models, model-based RL），与论文研究领域和技术方法严重不符。仅'Unify Models'因标题包含'Integrating Models'而具有微弱概念相关性。

关键词

Mechanistic Models, Data-Driven Models, Differentiable Programming, Neural Ordinary Differential Equations, Hybrid Modeling, Neurological Disorders, Disease Progression

162. EGTR-Review: Efficient Evidence-Grounded Scientific Peer Review Generation via Multi-Agent Teacher DistillationFAIL

Score: 6.0 / 27.8

Authors: Xinpeng Qiu, Wang Yihu, Zhifeng Liu, Xiaochen Wang, Jimin Wang

Published: 2026-06-04

TL;DR: 本文提出了一种基于多教师蒸馏的证据驱动同行评审生成框架，解决了评论泛化与推理成本高的问题，实现了高事实准确性与低开销的平衡。

摘要翻译

科学同行评审生成因有助于减轻评审负担并提供及时反馈而受到日益增多的关注。然而，现有的基于大语言模型（LLM）的方法往往产生通用性评论，证据支持不足且来源可追溯性较弱；而复杂的多智能体系统则会导致高昂的推理成本。为应对这些挑战，我们提出 EGTR-Review，一种基于多智能体教师蒸馏的、基于证据且可追溯的评审生成框架。EGTR-Review 首先构建一个多智能体教师，该教师执行结构感知的论文分解、关键元素提取、外部学术证据检索、证据状态标注、验证推理以及评审综合生成。随后，它通过任务前缀驱动的多元任务学习，将中间推理轨迹和最终评审评论蒸馏至一个轻量级学生模型中。此外，一个证据加权目标函数进一步降低了弱、缺失或不可验证监督信号的影响。在公共同行评审数据集上的实验表明，EGTR-Review（学生模型）在自动指标、LLM-as-Judge 评估及人类评估方面均优于强大的基于提示、微调以及结构化/智能体基线，同时在保持强大事实依据和来源可追溯性的前提下，显著降低了 Token 消耗和推理时间。我们的代码、提示词、配置文件及示例数据已在 GitHub 上开源。

Abstract

Scientific peer review generation has attracted increasing attention for reducing reviewing burdens and providing timely feedback. However, existing Large Language Model (LLM)-based methods often produce generic comments with insufficient evidence support and weak source traceability, while complex multi-agent systems incur high inference costs. To address these challenges, we propose EGTR-Review, an Evidence-Grounded and Traceable Review Generation framework via Multi-Agent Teacher Distillation. EGTR-Review first constructs a multi-agent teacher that performs structure-aware paper decomposition, key-element extraction, external scholarly evidence retrieval, evidence-state labeling, verification reasoning, and review synthesis. It then distills both intermediate reasoning trajectories and final review comments into a lightweight student model through task-prefix-driven multi-task learning. An evidence-weighted objective further reduces the influence of weak, missing, or non-verifiable supervision. Experiments on public peer-review datasets show that EGTR-Review (Student) outperforms strong prompt-based, fine-tuned, and structured/agentic baselines across automatic metrics, LLM-as-Judge evaluation, and human evaluation, while maintaining strong factual grounding and source traceability with substantially lower token consumption and inference time. Our code, prompts, configurations, and sample data are available on GitHub.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文主要研究基于多教师蒸馏的文本同行评审生成，未涉及视觉编码器、世界模型、强化学习或多模态融合。'Unify Models'仅体现在知识蒸馏上，关联度较弱。Tokenizer 和 MLLM 仅作为 LLM 基础组件提及，非核心贡献。

关键词

Scientific Peer Review Generation, Multi-Agent Teacher Distillation, Evidence-Grounded, Large Language Model, Inference Efficiency, Source Traceability, Task-Prefix-Driven, Multi-Task Learning

163. Learning of Robot Safety Policies via Adversarial Synthetic ScenariosFAIL

Score: 6.0 / 27.8

Authors: Nikolai Dorofeev, Alexey Odinokov, Rostislav Yavorskiy

Published: 2026-06-04

TL;DR: 本文提出了一种基于对抗性合成场景的代理博弈框架，用于通过红蓝团队互动高效发现高风险边缘案例以学习机器人安全策略。

摘要翻译

在这项工作中，我们提出了一种基于智能体的游戏化框架，用于通过合成场景进行基于危害的机器人安全策略学习。我们将场景生成建模为两个智能体之间的对抗性博弈：一个 Red Team 通过构建危险情境来探索潜在故障的空间，而一个 Blue Team 则逐步优化安全策略以防止这些故障。这种迭代过程能够高效地发现高风险的边缘案例，而这些案例很难通过随机模拟或手动枚举被捕捉到。通过将经典风险建模与对抗性场景生成及现代学习范式相结合，这项工作为将安全性嵌入到在复杂真实环境中运行的 Physical AI 系统提供了一条可扩展的路径。本文描述了正在进行的工作。贡献在于问题形式化和提出的解决方案架构。

Abstract

In this work, we propose an agentic gamification framework for hazard-informed learning of robot safety policies through synthetic scenarios. We model scenario generation as an adversarial game between two agents: a Red Team that explores the space of potential failures by constructing hazardous situations, and a Blue Team that incrementally refines safety policies to prevent them. This iterative process enables efficient discovery of high-risk edge cases that are unlikely to be captured through random simulation or manual enumeration. By combining classical risk modeling with adversarial scenario generation and modern learning paradigms, this work provides a scalable pathway for embedding safety into Physical AI systems operating in complex real-world environments. The paper describes ongoing work. The contribution is a problem formulation and a proposed solution architecture.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	2.0/10	3.0

评分理由: 论文聚焦机器人安全与对抗性模拟，未涉及 Tokenizer、Visual Encoder、MLLM、MultiModal 及模型统一技术。虽涉及策略学习，与 World Models 和 model-based RL 存在概念上的弱关联，但未构成核心内容。作者名单中无指定专家。

关键词

Robot Safety Policies, Adversarial Synthetic Scenarios, Gamification Framework, Red Team Blue Team, Hazard-informed Learning, Physical AI Systems, Scenario Generation

164. Maximising the Set-Piece Return: Optimising Football Corner Tactics with Graph Reinforcement LearningFAIL

Score: 6.0 / 27.8

Authors: Sean Groom, Michael Groom, Francisco Belo, Axl Rice, Liam Anderson, Victor-Alexandru Darvariu, Shuo Wang

Published: 2026-06-04

TL;DR: This paper applies graph reinforcement learning to optimize football corner kick tactics, discovering novel player configurations that maximize shot probability beyond historical imitation.

摘要翻译

机器学习 (Machine learning) 正越来越多地被用于评估足球战术。然而，现有方法主要侧重于刻画历史动作或分析师指定的反事实场景。在这项工作中，我们旨在超越对历史观测模式的模仿，转而发现新的可泛化的球员配置和策略。为解决这一问题，我们专注于优化角球战术执行方案，并构建了一个决策问题：其中中心策略调整进攻球员的位置与速度，以最大化首次触球射门概率。与针对孤立配置求解的经典优化不同，我们提出了一种基于图结构数据 (Graph-structured data) 运行的强化学习 (Reinforcement learning) 架构，该架构能够生成一种通用策略，用于调整任意初始球员站位。在超过 3,000 个英超 (Premier League) 角球样本上的评估表明，在匹配推理预算的情况下，我们的方法显著优于基线优化技术。我们的结果表明，图强化学习可以将定位球 (Set-piece) 分析从历史评估和模仿转向奖励驱动的战术发现。

Abstract

Machine learning is increasingly employed for the evaluation of football tactics. However, existing approaches focus on characterising historical actions or analyst-specified counterfactual scenarios. In this work, we seek to go beyond the imitation of historically observed patterns towards discovering new generalisable player configurations and strategies. To tackle this, we focus on optimising corner kick routines, and formulate a decision-making problem in which a central policy makes adjustments to attacking player positions and velocities to maximise first contact shot probability. Unlike classic optimisation that solves for isolated setups, we contribute a reinforcement learning architecture operating on graph-structured data that yields a general policy for adjusting arbitrary starting player positions. Evaluated on over 3,000 Premier League corners, our approach strongly outperforms baseline optimisation techniques under matched inference budgets. Our results suggest that graph reinforcement learning can shift set-piece analysis from historical evaluation and imitation towards reward-driven tactical discovery.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	4.0/10	6.0

评分理由: The paper focuses on Graph Reinforcement Learning for football tactics optimization. It does not address multimodal fusion, large language models, tokenization, or world model generation as defined in the keywords. Only the RL aspect shows partial relevance, hence a moderate score for model-based RL while others are zero.

关键词

Graph Reinforcement Learning, Football Tactics, Corner Kick Optimization, Policy Learning, Set-Piece Analysis, Reward-Driven Discovery, Player Positions

165. EDIT: Evidence-Diagnosed Intervention Training for Rule-Faithful LLM GradingFAIL

Score: 6.0 / 27.8

Authors: Zhihao Wu, Linhai Zhang, Taiyi Wang, Runcong Zhao, Peter Andrews, Cesare Aloisi, Yulan He

Published: 2026-06-04

TL;DR: The paper proposes an Evidence-Diagnosed Intervention Training framework to improve LLM grading faithfulness by locating problematic reasoning steps and calibrating beliefs via reinforcement learning, outperforming baselines on multi-subject benchmarks.

摘要翻译

可靠的量规评分不仅仅依赖于准确的分数预测。每一项判断都必须以评分细则为依据，并结合学生答案中的证据。现有的信用分配和干预方法主要针对数学推理等自包含推理任务设计，在此类场景下表现不佳，因为它们无法识别评分推理中的错误环节，也无法追踪模型关于最终分数的信念在推理过程中的变化。本文提出基于证据诊断的干预训练（Evidence-Diagnosed Intervention Training, EDIT），这是一种用于训练更符合量规要求的 LLM 评分器的两阶段框架。首先，EDIT-SFT 利用内部模型信号定位有问题的推理步骤，包括最终分数的后验信念和输入接地分数（input-grounding scores）。随后，借助量规检查清单，仅对这些局部步骤进行修正。其次，EDIT-RL 采用基于信念的奖励塑造来校准评分器，在惩罚大幅有害信念漂移的同时，仍允许有益的探索。在两个真实世界多科目评分基准上的实验表明，EDIT 在域内和域外划分上均一致优于强大的监督微调及强化学习基线，消融研究证实内部状态诊断是这些性能提升的主要驱动力。

Abstract

Reliable rubric grading requires more than accurate score prediction. Each judgement must be grounded in the mark scheme and evidence from the student answer. Existing credit-assignment and intervention methods, primarily designed for self-contained reasoning tasks such as mathematics reasoning, struggle in this setting because they do not identify where grading reasoning goes wrong or how the model's belief about the final mark changes during reasoning. We propose Evidence-Diagnosed Intervention Training (EDIT), a two-phase framework for training more rubric-faithful LLM graders. First, EDIT-SFT locates problematic reasoning steps using internal model signals: posterior belief over the final mark and input-grounding scores. It then revises only these local steps with help from a rubric checklist. Second, EDIT-RL calibrates the grader with belief-guided reward shaping, penalising large harmful belief drifts while still allowing helpful exploration. Experiments on two real-world, multi-subject grading benchmarks demonstrate that EDIT consistently outperforms strong supervised fine-tuning and reinforcement learning baselines on both in-domain and out-of-domain splits, with ablation studies confirming that internal-state diagnostics drive these gains.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	2.0/10	3.0

评分理由: The paper focuses on LLM grading faithfulness via intervention training and belief-guided RL. It does not involve multimodal architectures (Visual Encoder, MLLM, MultiModal), tokenizer design, or world models for environment simulation. While it uses RL, it is not model-based RL for planning, and does not unify model architectures.

关键词

Evidence-Diagnosed Intervention Training, Rule-Faithful LLM Grading, Rubric Grading, Belief-Guided Reward Shaping, Internal Model Signals, Multi-Subject Grading Benchmarks, Reinforcement Learning

166. Decomposing Factual Sycophancy in Language Models: How Size and Instruction Tuning Shape RobustnessFAIL

Score: 6.0 / 27.8

Authors: Victor De Marez, Luna De Bruyne, Walter Daelemans

Published: 2026-06-04

TL;DR: 本文通过分解事实奉承为真理边际和操纵敏感性，揭示了模型规模主导脆弱性而指令微调差异化影响鲁棒性的结论。

摘要翻译

当语言模型（Language Model）在社会压力下放弃正确且可验证的答案时，即发生事实奉承（Factual Sycophancy）。仅当针对错误答案的压力超过模型对真理的无偏偏好时，才会发生翻转（Flip）；因此，翻转率混淆了两个机制：该基线偏好的强度（真理边际，Truth Margin），以及压力使其偏移的程度（操纵敏感性，Manipulation Sensitivity）。我们将事实奉承分解为这两个通道，并利用它们分离规模（Size）和指令微调（Instruction Tuning）在 56 个开源模型（Open-weight Models，参数范围 0.3B-32B）及 13 种操纵类型上的影响。研究发现，脆弱性主要由规模决定，但指令微调改变了规模的作用方式：小型指令微调模型可能稳健性降低，而大型指令微调模型通常稳健性增强。指令微调主要增加真理边际，但其行为效果取决于操纵类型。缩放（Scaling）也以不同方式影响这两个通道：基座模型（Base Models）获得边际但操纵敏感性略微增加，而指令微调模型获得边际的速度更快且敏感性降低。因此，事实奉承并非单一标量属性。评估应报告特定通道、特定操纵类型及特定规模条件下的稳健性，而不应仅报告翻转率。

Abstract

Factual sycophancy occurs when a language model abandons a correct, verifiable answer under social pressure. Because a flip occurs only when pressure toward a false answer exceeds the model's neutral preference for the truth, flip rates conflate two mechanisms: the strength of that baseline preference (truth margin), and how far pressure shifts it (manipulation sensitivity). We decompose factual sycophancy into these channels and use them to separate the effects of size and instruction tuning across 56 open-weight models spanning 0.3B-32B parameters and 13 manipulation types. We find that vulnerability is governed mainly by size, but instruction tuning changes how size acts: small instruction-tuned models can become less robust, whereas large instruction-tuned models usually become more robust. Instruction tuning primarily increases truth margin, but its behavioral effect depends on manipulation type. Scaling also changes the two channels differently: base models gain margin but become mildly more manipulation-sensitive, whereas instruction-tuned models gain margin faster and become less sensitive. Factual sycophancy is therefore not a single scalar property. Evaluations should report channel-specific, manipulation-specific, and size-conditioned robustness rather than flip rates alone.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	1.0/10	1.5

评分理由: 该论文聚焦于文本语言模型中的事实奉承行为分解，分析模型规模与指令微调对鲁棒性的影响，属于 NLP 安全与对齐领域。提供的关键词集侧重于多模态（Visual Encoder, MultiModal, MLLM）、世界模型及强化学习（World Models, model-based RL），与本文纯文本语言模型的研究主题无直接交集，因此相关性评分极低。

关键词

Factual Sycophancy, Language Models, Instruction Tuning, Model Size, Robustness, Truth Margin, Manipulation Sensitivity

167. The Tell-Tale Norm: $\ell_2$ Magnitude as a Signal for Reasoning Dynamics in Large Language ModelsFAIL

Score: 6.0 / 27.8

Authors: Jinyang Zhang, Hongxin Ding, Yue Fang, Weibin Liao, Muyang Ye, Junfeng Zhao, Yasha Wang

Published: 2026-06-04

TL;DR: This paper proposes using the ℓ2 norm of hidden states as an intrinsic signal to monitor and enhance reasoning dynamics in Large Language Models through test-time scaling techniques without additional training.

摘要翻译

近期研究致力于理解大语言模型（LLMs）的推理，然而，一种能够捕捉其层级推理动态的基于原理的、模型内生的信号仍鲜有探索。我们通过证明隐藏状态的 l2 范数可作为模型推理强度的内生信号，从而填补了这一空白。利用稀疏自编码器（SAEs）作为诊断探针，我们发现大语言模型的内部推理表现为深层中推理特征激活的显著增加。受此模式启发，我们建立了推理强度与模型潜在几何之间的形式化联系，并理论上证明隐藏状态的 l2 范数界定了稀疏自编码器推理特征的激活强度。经验相关性分析和因果干预进一步验证了 l2 范数作为忠实指标的可靠性，其中范数的升高始终对应关键推理步骤。随后，我们引入了三种基于 l2 范数指导的推理时缩放技术：(i) 自适应层级推理递归，(ii) 内生推理状态引导，以及 (iii) 基于 l2 的响应选择。这些方法无需额外训练或数据，且兼容先进的推理引擎。跨模型架构和基准测试的实验表明，基于 l2 范数的技术显著提升了推理性能，提供了一种基于原理却简洁的视角来感知和控制大语言模型的潜在推理动态。我们的代码可在 https://github.com/zjy1298/The-Tell-Tale-Norm 获取。

Abstract

Recent work has sought to understand Large Language Models (LLMs) reasoning, yet a principled, model-intrinsic signal that captures its layer-wise reasoning dynamics remains underexplored. We bridge this gap by demonstrating that the l2 norm of hidden states serves as an endogenous signal of the model's reasoning intensity. Using Sparse Autoencoders (SAEs) as a diagnostic probe, we observe that LLMs' internal reasoning is marked by a sharp increase in reasoning feature activations concentrated in late layers. Motivated by this pattern, we establish a formal link between reasoning intensity and the model's latent geometry and theoretically prove that the l2 norm of hidden states bounds the activation strength of SAE reasoning features. Empirical correlation analysis and causal interventions further validate the l2 norm as a faithful indicator, where heightened norms consistently correspond to critical reasoning steps. We then introduce three test-time scaling techniques guided by l2 norms: (i) Adaptive Layer-wise Reasoning Recursion, (ii) Endogenous Reasoning State Steering, and (iii) l2-guided Response Selection, which requires no additional training or data and is compatible with advanced inference engines. Experiments across model architectures and benchmarks show that l2-norm-based techniques significantly improve reasoning performance, offering a principled yet simple lens to perceive and control LLM latent reasoning dynamics. Our code is available at https://github.com/zjy1298/The-Tell-Tale-Norm.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on LLM reasoning dynamics using hidden state norms and test-time scaling, lacking content on multimodality, visual encoders, world models, or reinforcement learning. Tokenizers and Unify Models have minimal contextual relevance as LLM components but are not the study's focus.

关键词

Large Language Models, Reasoning Dynamics, ℓ2 Magnitude, Hidden States, Sparse Autoencoders, Test-time Scaling, Reasoning Intensity, Endogenous Signal

168. Large Language Models are Perplexed by some Political PartiesFAIL

Score: 6.0 / 27.8

Authors: Paul Lerner, François Yvon

Published: 2026-06-04

TL;DR: 本文通过困惑度评估大型语言模型的政治公平性，发现模型对极右翼文本的困惑度高于社会民主党文本，源于预训练偏差。

摘要翻译

大语言模型（LLMs）的应用日益广泛，包括政治应用场景，但其政治公平性却鲜有研究。本文利用困惑度（perplexity）对其进行评估，假设公平模型应对所有政治群体赋予相等的概率。然而，基于涵盖 37 种语言的三个数据集及十个大语言模型的研究发现，LLMs 对极右翼及民族主义政党的文本的困惑度高于社会民主主义政党的文本。这一发现与先前关于翻译公平性的研究一致，且困惑度与下游翻译指标之间存在显著相关性。该方法既适用于基础大语言模型（base LLMs），也适用于其指令微调（instruction-tuned）版本，且两者高度相关，表明大语言模型的政治公平性源于其预训练（pretraining）过程，几乎不受指令微调（instruction-tuning）的影响。

Abstract

Large Language Models (LLMs) are increasingly used, including in political applications, but their political fairness has been little studied. We assess it using perplexity, posing that a fair model should give equal probability to all political groups. However, we find, across ten LLMs and three datasets covering 37 languages, that LLMs are more perplexed by the texts of far right and nationalist parties than of social-democratic parties. We find this to be consistent with previous work on translation fairness, to the point that perplexity correlates with downstream translation metrics. Our method is applicable to both base LLMs as well as their instruction-tuned counterpart, and we find that both are highly correlated, suggesting that the political fairness of LLMs stems from their pretraining, and is hardly affected by instruction-tuning.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	2.0/10	3.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文主要研究 LLM 的政治公平性和困惑度，而关键词涉及多模态、世界模型和强化学习，两者主题不匹配。仅 LLM 基础与 MLLM 和 Unify Models 有微弱关联，其余关键词完全无关。

关键词

Large Language Models, Political Fairness, Perplexity, Bias Detection, Pretraining, Instruction-tuning, Cross-lingual

169. Texture-preserving implicit neural representation for Cone beam CT truncated reconstructionFAIL

Score: 6.0 / 27.8

Authors: Genyuan Zhang, Junyao Wang, Haoran Lan, Chuandong Tan, Songtao Zhu, Fenglin Liu

Published: 2026-06-04

TL;DR: This paper proposes a self-supervised implicit neural representation framework with physics-based iterative refinement to solve CBCT truncation artifacts while preserving high-frequency textures.

摘要翻译

锥束计算机断层扫描 (CBCT) 常面临数据截断问题，这会引入严重的伪影并限制有效视野 (FOV)。现有的针对截断锥束计算机断层扫描 (CBCT) 重建的深度学习方法存在严重局限性，包括过度依赖监督真值以及无法处理连续三维空间截断变化。为应对这些挑战，我们提出了一种基于神经场景表示的自监督三维重建框架。该方法在投影监督下直接将空间坐标映射至辐射密度，本质上规避了传统的滤波与反投影操作，从而从根本上消除了截断引起的环状伪影，同时实现了稳健的连续三维数据外推。然而，坐标网络易受固有谱偏差的影响，导致临床关键的高频纹理严重丢失。为突破这一瓶颈，我们进一步将基于物理的迭代细化模块整合进神经场景表示架构中。该模块利用坐标网络生成的无伪影外推体作为最优初始化，逐步从原始投影中重新提取并注入高频结构信息至体数据中。在模拟和真实数据集上的广泛实验表明，该方法成功地将神经网络的卓越伪影抑制与外推能力同迭代算法的高保真细节保留能力统一起来。

Abstract

Cone-beam computed tomography (CBCT) frequently suffers from data truncation, which introduces severe artifacts and limits the effective field of view (FOV). Existing deep learning methods for truncated cone-beam computed tomography (CBCT) reconstruction suffer from serious limitations, including a strict reliance on supervised ground truth and a failure to account for continuous 3D spatial truncation variations. To address these challenges, we introduce a self-supervised 3D reconstruction framework based on neural scene representations. By directly mapping spatial coordinates to radiodensity under projection supervision, our approach inherently bypasses traditional filtering and backprojection operations, thereby fundamentally eliminating truncation-induced ring artifacts while enabling robust continuous 3D data extrapolation. However, coordinate networks are susceptible to an inherent spectral bias, which leads to a severe loss of clinically vital high-frequency textures. To resolve this bottleneck, we further incorporate a physics-based iterative refinement module into the neural scene representation architecture. Leveraging the artifact-free, extrapolated volume from the coordinate network as an optimal initialization, this module progressively re-extracts and injects high-frequency structural information from the original projections back into the volume. Extensive experiments on both simulated and real-world datasets demonstrate that our method successfully unifies the exceptional artifact suppression and extrapolation capabilities of neural networks with the high-fidelity detail preservation of iterative algorithms.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on medical imaging (CBCT) reconstruction using implicit neural representations, which has low alignment with the provided MLLM/RL keyword set. 'Unify Models' scores 2.0 due to the abstract's mention of unifying neural and iterative methods. 'Visual Encoder' and 'MultiModal' score 1.0 for involving image data processing. Tokenizer, World Models, MLLM, and model-based RL are irrelevant (0.0). Weighted total is 6.0, below the 27.8 threshold.

关键词

Cone beam CT, Truncated reconstruction, Implicit neural representation, Self-supervised, Physics-based iterative refinement, Texture preservation, Neural scene representations, Radiodensity mapping

170. PC Layer: Polynomial Weight Preconditioning for Improving LLM Pre-TrainingFAIL

Score: 4.5 / 27.8

Authors: Senmiao Wang, Tiantian Fang, Haoran Zhang, Yushun Zhang, Kunxiang Zhao, Alex Schwing, Ruoyu Sun

Published: 2026-06-04

TL;DR: The paper proposes a polynomial weight preconditioning layer to stabilize LLM pre-training optimization without incurring inference overhead.

摘要翻译

我们提出了一种预处理（PC）层，这是一种基于多项式预处理器的权重参数化方法，旨在确保大语言模型（LLM）训练过程中权重条件数的稳定性。该 PC 模块通过低阶多项式预处理重塑权重矩阵的奇异值谱。训练完成后，预处理后的权重可以合并回原始架构，且不产生任何推理开销。我们在 Llama-1B 预训练中展示了所提出的 PC 层相对于标准 Transformer 的优势，该优势在 AdamW 和 Muon 优化器下均得到验证。理论上，我们通过证明一致有界各层的奇异值可确保深层线性网络中梯度下降几何收敛至全局极小值，来论证这一谱控制原则。我们的代码可在 https://github.com/Empath-aln/PC-layer 获取。

Abstract

We propose a preconditioning (PC) layer, a weight parameterization via polynomial preconditioner that ensures stable weight conditioning throughout LLM training. The PC module reshapes the singular-value spectrum of weight matrices via low-degree polynomial preconditioning. After training, the preconditioned weights can be merged back into the original architecture, incurring no inference overhead. We demonstrate the advantage of the proposed PC layer over standard transformers in Llama-1B pre-training, for both the AdamW and Muon optimizers. Theoretically, we justify this spectrum-control principle by proving that uniformly bounding each layer's singular values ensures geometric convergence of gradient descent to global minima, for certain deep linear networks. Our code is available at https://github.com/Empath-aln/PC-layer.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on optimization techniques (polynomial preconditioning) for LLM pre-training, addressing weight conditioning and convergence. It lacks content on multimodal integration, tokenizers, visual encoders, world models, or reinforcement learning. While it modifies LLM architecture (Unify Models) and is an LLM paper (MLLM), it does not align with the specific background themes of multimodal world models or RL.

关键词

PC Layer, Polynomial Weight Preconditioning, LLM Pre-training, Singular-Value Spectrum, Weight Parameterization, Gradient Descent Convergence, Inference Overhead

171. Benchmarking Open-Source Layout Detection Models for Data Snapshot Extraction from Institutional DocumentsFAIL

Score: 4.5 / 27.8

Authors: AJ Carl P. Dy, Aivin V. Solatorio

Published: 2026-06-04

TL;DR: This paper benchmarks open-source layout detection models for data snapshot extraction in institutional documents, finding that current models fail to generalize to operational contexts despite strong academic performance.

摘要翻译

机构文件中包含大量嵌入在图表和表格中的操作性和分析性信息。当前从文档中提取视觉内容的方法主要围绕通用文档布局分析构建，其中图表和表格被视为同等重要的文档对象，而非语义上有意义的分析性构件。在这项工作中，我们引入了一个基准数据集和评估框架，用于 data snapshot extraction（数据快照提取），即在机构文件中识别和定位语义上有意义的视觉构件的任务。该基准涵盖人道主义报告、World Bank（世界银行）政策研究工作论文以及项目评估文件，并包含针对包含可复用分析信息的图表和表格的标注。利用该数据集，我们对多个开源布局检测模型进行了基准测试，并评估了检测性能及空间提取质量。结果表明，尽管当前模型在常规学术基准上表现强劲，但它们难以泛化到操作性机构文件。常见的错误模式包括分析性与非分析性内容之间的混淆、复合分析性构件的碎片化以及解释所需上下文信息提取不完整。这些发现突显了通用文档布局分析与具有操作性价值的 data snapshot extraction（数据快照提取）之间存在的显著差距。我们发布了源 PDF、标注数据集、元数据和源代码，以支持未来操作性文档智能的研究。数据集可在 https://huggingface.co/datasets/ai4data/data-snapshot 获取，源代码可在 https://github.com/worldbank/ai4data/tree/main/experimental/data-snapshot 获取。

Abstract

Institutional documents contain substantial amounts of operational and analytical information embedded within figures and tables. Current approaches for extracting visual content from documents are largely built around generic document layout analysis, where figures and tables are treated as uniformly relevant document objects rather than semantically meaningful analytical artifacts. In this work, we introduce a benchmark dataset and evaluation framework for \textit{data snapshot extraction}, the task of identifying and localizing semantically meaningful visual artifacts within institutional documents. The benchmark spans humanitarian reports, World Bank policy research working papers, and project appraisal documents, and includes annotations for figures and tables that contain reusable analytical information. Using this dataset, we benchmarked multiple open-source layout detection models and evaluated both detection performance and spatial extraction quality. Our results show that current models struggle to generalize to operational institutional documents despite strong performance on conventional academic benchmarks. Common failure modes include confusion between analytical and non-analytical content, fragmentation of composite analytical artifacts, and incomplete extraction of contextual information required for interpretation. These findings highlight a persistent gap between generic document layout analysis and operationally useful data snapshot extraction. We release the source PDFs, annotation dataset, metadata, and source code to support future research in operational document intelligence. The dataset is available at https://huggingface.co/datasets/ai4data/data-snapshot and the source code is available at https://github.com/worldbank/ai4data/tree/main/experimental/data-snapshot.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	2.0/10	3.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on document layout analysis and data snapshot extraction for institutional documents. It does not address Unify Models, Tokenizers, World Models, MLLMs, or Model-Based RL. There is minimal connection to Visual Encoder (as layout detection uses vision backbones) and MultiModal (documents contain text and images), but these are not central to the study's contribution regarding representation learning or reinforcement learning. Weighted score is 4.5, below the dynamic passing score of 27.8. No expert authors found.

关键词

Layout Detection, Data Snapshot Extraction, Institutional Documents, Benchmark Dataset, Open-Source Models, Document Intelligence, Visual Artifacts

172. Dense Contexts Are Hard Contexts: Lexical Density Limits Effective Context in LLMsFAIL

Score: 4.5 / 27.8

Authors: Giovanni Dettori, Matteo Boffa, Danilo Giordano, Idilio Drago, Marco Mellia

Published: 2026-06-04

TL;DR: This paper identifies lexical density as a critical factor limiting effective context windows in LLMs, demonstrating that high information density causes significant performance collapse in retrieval tasks regardless of token length.

摘要翻译

输入长度及相关信息的位置被广泛认为是导致大语言模型（LLM）长上下文性能下降的主要原因。本文研究词汇密度（lexical density）——即上下文引入独特信息的速率——作为第三个很大程度上被忽视的因素，该因素系统性地降低了大语言模型的有效上下文窗口。我们使用三个“找针”风格基准测试（benchmarks）来量化词汇密度对开源大语言模型（open-weight LLMs，9B-685B）的影响，这些基准测试具有相同的长度（约 12k 词元（tokens））和受控的针位置，但信息密度递增。我们在高密度基准测试中观察到性能的急剧下降：在稀疏上下文中表现近乎完美的模型，在更密集的上下文中的检索分数降至 60% 以下。为了排除任务类型混淆变量，我们在每个基准测试内变化和控制密度，同时保持所有其他属性不变。降低密度通常能恢复性能，尤其是在出现退化的高密度情形下。这些结果表明，有效上下文容量是词汇密度的函数，这对在紧凑、信息丰富的输入上运行的现实世界大语言模型系统具有直接影响。

Abstract

Input length and the position of relevant information are widely cited as the primary causes of degraded LLM long-context performance. Here, we study lexical density -- the rate at which a context introduces distinct information -- as a third, largely overlooked factor that systematically reduces the effective context window of LLMs. We quantify the impact of lexical density on open-weight LLMs (9B-685B) using three "find-the-needle" style benchmarks with identical length (~12k tokens) and controlled needle position, but increasing density of information. We observe a sharp performance collapse in higher-density benchmarks: models that are near-perfect in sparse contexts drop below 60% retrieval score on denser ones. To rule out task-type confounds, we vary and control the density within each benchmark while keeping all other properties unchanged. Reducing density generally restores performance, especially in the high-density regimes where degradation appears. These results show that effective context capacity is a function of lexical density, with direct implications for real-world LLM systems operating on compact, information-rich inputs.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on lexical density in text-only LLMs, while the provided keywords target multimodal, world model, and reinforcement learning domains. Only 'Tokenizer' and 'Unify Models' have slight relevance to LLMs/tokens; others like 'Visual Encoder' and 'model-based RL' are completely unrelated to this text-centric context study.

关键词

Lexical Density, LLMs, Context Window, Find-the-needle, Information Density, Long-context Performance, Effective Context

173. TLA-Prover: Verifiable TLA+ Specification Synthesis via Preference-Optimized Low-Rank AdaptationFAIL

Score: 4.5 / 27.8

Authors: Eric Spencer, Arslan Bisharat, Brian Ortiz, Khushboo Bhadauria, TaiNing Wang, George K. Thiruvathukal, Konstantin Laufer, Mohammed Abuhamad

Published: 2026-06-04

TL;DR: TLA-Prover leverages preference-optimized low-rank adaptation to enhance LLM-generated TLA+ formal specifications, achieving a 30% pass rate on verification benchmarks.

摘要翻译

TLA+ 是一种用于验证分布式系统和安全关键协议的形式化规范语言。大型语言模型（LLMs）经常生成的 TLA+ 规范因语义原因无法通过 TLC 模型检测器。在 25 个大型语言模型中，最佳公共基线的语法解析率为 26.6%，语义模型检测率为 8.6%。我们提出了 TLA-Prover，这是一个拥有 200 亿参数的 TLA+ 规范合成模型。训练过程结合了在验证示例上的监督微调（SFT）与基于修复的组相对策略优化（GRPO）。在 GRPO 阶段，模型学习修复其自身被拒绝的规范。我们还从相同的 SFT 检查点训练了一个直接偏好优化（DPO）变体，作为消融实验。TLC 直接提供奖励信号，无需学习奖励模型。每个输出分为四个等级进行评定：青铜（可通过语法解析）、白银（无警告）、黄金（通过 TLC 检测）和钻石。要达到钻石等级，模型的正确性属性会被自动进行微小修改；随后 TLC 必须检测到一个违规。如果 TLC 仍然通过，则该属性恒真且无实际贡献；该输出未能达到钻石等级。在保留的 30 问题基准测试上，TLA-Prover 在黄金和钻石等级上均达到 9/30（即 pass@1 = 30%）。这大约是未调优基线（8.6%）的 3.5 倍。DPO 变体在钻石等级上达到 20%。黄金与钻石等级在每个检查点重合；这防止了平凡属性失败模式。

Abstract

TLA+ is a formal specification language for verifying distributed systems and safety-critical protocols. Large language models (LLMs) frequently produce TLA+ specifications that fail the TLC model checker for semantic reasons. Across 25 LLMs, the best public baseline is 26.6% syntactic parse and 8.6% semantic model-check. We present TLA-Prover, a 20-billion-parameter model for TLA+ specification synthesis. Training combines supervised fine-tuning (SFT) on verified examples with repair-based group-relative policy optimization (GRPO). In the GRPO stage, the model learns to fix its own rejected specifications. We also train a direct preference optimization (DPO) variant from the same SFT checkpoint as an ablation. TLC provides the reward signal directly, with no learned reward model. Four tiers grade each output: Bronze (parses), Silver (no warnings), Gold (passes TLC), and Diamond. To reach Diamond, the model's correctness property is automatically altered in a small way; TLC must then detect a violation. If TLC still passes, the property was always-true and contributes nothing; the output fails Diamond. TLA-Prover reaches 9/30 (i.e. pass@1 = 30%) at both Gold and Diamond on a held-out 30-problem benchmark. This is roughly 3.5x the 8.6% untuned baseline. The DPO variant reaches 20% at Diamond. Gold and Diamond coincide at every checkpoint; this prevents the trivial-property failure mode.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	1.0/10	1.5

评分理由: The paper focuses on formal verification (TLA+) and LLM fine-tuning using SFT and GRPO/DPO. It does not address multimodal learning, visual encoders, world models, or model-based RL architectures. Tokenizer and Unify Models are tangential at best, resulting in low relevance scores for the provided keyword set.

关键词

TLA+, Specification Synthesis, Large Language Models, Preference Optimization, Low-Rank Adaptation, Model Checking, GRPO

174. Harnessing Structural Context for Entity Alignment Foundation ModelsFAIL

Score: 4.5 / 27.8

Authors: Xingyu Chen, Yuanning Cui, Zequn Sun, Wei Hu

Published: 2026-06-04

TL;DR: 本文提出 ContextEA 框架，通过利用异构知识图谱中的结构化上下文显著提升了实体对齐基础模型的迁移性能，在未见的 KG 对上表现优于微调基线。

摘要翻译

实体对齐（EA）旨在识别异构知识图谱（KG）之间的等价实体，是知识融合和跨知识图谱推理的关键组成部分。最近的 EA 基础模型表明，对齐知识一旦经过预训练，即可直接应用于多样化的先前未见过的知识图谱对。然而，它在两个方面仍未能充分利用结构上下文：编码期间的跨知识图谱交互较弱，且最终候选排序仍过于依赖粗略相似度。我们通过 ContextEA 解决了这些局限性，这是一种用于可转移实体对齐的增强型编码器 - 解码器框架。在编码器侧，我们引入了一种跨知识图谱交互编码器，它利用锚点桥统一两个知识图谱，并执行更早的关系感知跨图谱传播。在解码器侧，我们引入了一种结构校准解码器，它利用实体级、邻域级、关系级和锚点感知结构证据来校准对齐分数。该设计在保持轻量级的同时，加强了结构上下文构建与利用。在 OpenEA、SRPRS 和 DBP 中的 29 个实体对齐数据集上的实验表明，该方法相对于强大的可转移基线始终表现出提升。值得注意的是，预训练的 ContextEA 已在所有三个基准组上超过了微调基线，表明其对未见知识图谱的转移能力显著更强。这些结果表明，显式利用结构上下文是改进 EA 基础模型的有效方向。

Abstract

Entity alignment (EA) aims to identify equivalent entities across heterogeneous knowledge graphs (KGs) and is a key component of knowledge fusion and cross-KG reasoning. The recent EA foundation model demonstrates that alignment knowledge, once pretrained, can be directly applied to diverse previously unseen KG pairs. However, it still underuses structural context in two places: cross-KG interaction is weak during encoding, and final candidate ranking still relies too heavily on coarse similarity. We address these limitations with ContextEA, an enhanced encoder-decoder framework for transferable EA. On the encoder side, we introduce a cross-KG interaction encoder that unifies the two KGs with anchor bridges and performs earlier relation-aware cross-graph propagation. On the decoder side, we introduce a structural calibration decoder that calibrates alignment scores with entity-level, neighborhood-level, relation-level, and anchor-aware structural evidence. This design strengthens both structural context construction and structural context exploitation while remaining lightweight. Experiments on 29 EA datasets in OpenEA, SRPRS, and DBP show consistent gains over strong transferable baselines. Notably, the pretrained ContextEA already surpasses the finetuned baselines on all three benchmark groups, demonstrating substantially stronger transfer to unseen KGs. These results suggest that explicitly harnessing structural context is an effective direction for improving EA foundation models.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文主题聚焦于知识图谱实体对齐（Entity Alignment）与结构化上下文利用，属于知识表示学习领域。提供的关键词主要涉及多模态大模型、世界模型及强化学习（如 Tokenizer, Visual Encoder, MLLM, World Models, model-based RL），与本文内容领域完全不符。仅'Unify Models'因文中提及'unifies the two KGs'存在微弱语义关联，但未涉及模型架构的统一，故相关性极低。未发现指定专家作者。

关键词

Entity Alignment, Knowledge Graphs, Structural Context, Foundation Models, Encoder-Decoder, Transferable, Cross-KG Interaction

175. Sample-efficient Low-level Motion Planning for Robotic Manipulation Tasks via Zero-shot Transfer LearningFAIL

Score: 4.5 / 27.8

Authors: Yuanzhi He, Victor Romero-Cano, José J. Patiño, Juan David Hernández, William Sawtell, Gualtiero Colombo

Published: 2026-06-04

TL;DR: 本文提出了一种使用转移学习和奖励重设计的样本高效运动规划框架，以提高机器人操作任务的成功率，并在仿真和真实机器人上进行了验证。

摘要翻译

随着机器人系统日益复杂，其运动规划模型复杂性的增加以及更长的训练时间带来了严峻挑战。进化算法，例如样本高效交叉熵方法（iCEM），近期通过利用高效的知识重用策略来提升性能，在底层实时规划方面展现了巨大的潜力。尽管在许多控制任务中表现有效，iCEM 在更复杂的场景下性能可能受到限制，尤其是涉及堆叠、滑动和货架放置的场景。本文提出了一种新颖的 iCEM+TL 框架，该框架显式地利用了迁移学习（TL），通过将关键 iCEM 参数从更简单的上游任务转移至下游任务，以指导更复杂的任务。此外，我们通过任务分解应用了奖励重构（RR），针对堆叠物体和货架放置任务，以优化任务特定性能。仿真结果表明，我们的框架实现了高达 23% 的成功率提升。该框架在真实 Franka Emika 机器人的堆叠任务中得到了进一步验证，展示了其在实际部署中的可行性。

Abstract

As robotic systems become more sophisticated, the growing complexity of their motion planning models and the longer training times pose substantial challenges. Evolutionary algorithms such as the Sample-efficient Cross-Entropy Method (iCEM) have recently demonstrated promising potential for low-level real-time planning by leveraging efficient knowledge reuse strategies to improve performance. Although effective in many control tasks, iCEM's performance can be constrained in more complex scenarios, particularly those requiring stacking, sliding, and shelf placement. In this work, we propose a novel iCEM+TL framework that explicitly leverages Transfer Learning (TL), where key iCEM parameters are transferred from simpler upstream tasks to guide more complex downstream tasks. Additionally, we applied Reward Redesign (RR) through task decomposition for stacking objects and shelf placement to optimize task-specific performance. Results from the simulation show that our framework achieves success rate improvements of up to 23%. The framework is further validated on a real Franka Emika robot in a stacking task, demonstrating its practical feasibility for real-world deployment.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	2.0/10	3.0

评分理由: 该论文专注于使用进化算法和转移学习进行机器人运动规划，未涉及多模态大模型、标记器、视觉编码器或世界模型。仅在机器人控制领域与基于模型的强化学习有轻微关联，但方法论为进化而非基于模型的强化学习。

关键词

Motion Planning, Robotic Manipulation, Transfer Learning, Evolutionary Algorithms, Sample-efficient, Reward Redesign, Zero-shot Transfer Learning

176. RedditPersona: A Modular Framework for Community-Conditioned LLM Adaptation from RedditFAIL

Score: 4.5 / 27.8

Authors: Amirhossein Ghaffari, Ali Goodarzi, Huong Nguyen, Simo Hosio, Lauri Lovén, Ekaterina Gilman

Published: 2026-06-04

TL;DR: RedditPersona introduces a modular framework for standardizing community-conditioned LLM adaptation on Reddit, demonstrating trade-offs between behavioral identifiability and distributional similarity across different user partitioning strategies.

摘要翻译

社区条件化语言模型适配需要在数据收集、社区定义和评估等方面做出选择，而这些选择目前在每项研究中都是独立进行的，这使得比较假设或复用研究工件变得困难。我们提出了 RedditPersona，这是一个模块化框架，旨在标准化这些选择：它收集 Reddit 帖子和评论，对活跃用户进行画像，并根据五种分组策略（基于 subreddit (子版块)、图结构、语义、混合及基于交互）对其进行划分，随后通过 QLoRA 为每种策略训练一个参数高效适配器，最后在涵盖流畅度、保真度、分布对齐和社区可识别性的共享度量套件下进行评估。将该框架应用于城市福祉领域的 112 个 subreddit（涉及 301,429 个用户画像和 1600 万+ 评论），我们发现适配器的行为可识别性与每种策略相对于 subreddit 基线的内在一致性相吻合，并且在所有五种策略中，可识别性与真实文本分布相似性之间均存在一致的权衡。代码和配置文件可在以下网址获取：https://github.com/Ahghaffari/redditpersona。

Abstract

Community-conditioned language model adaptation requires choices about data collection, community definition, and evaluation that are currently made independently in each study, making it hard to compare assumptions or reuse artifacts. We present RedditPersona, a modular framework that standardizes these choices: it collects Reddit posts and comments, profiles active users, partitions them under five grouping strategies (subreddit-based, graph-structural, semantic, hybrid, and interaction-based), trains a parameter-efficient adapter per strategy via QLoRA, and evaluates them under a shared metric suite spanning fluency, fidelity, distributional alignment, and community identifiability. Applied to 112 subreddits in the urban well-being domain (301,429 user profiles, 16M+ comments), we find that adapters' behavioral identifiability tracks each strategy's intrinsic agreement with the subreddit baseline, and that a consistent trade-off between identifiability and distributional similarity to real text holds across all five strategies. The code and configuration files are available at: https://github.com/Ahghaffari/redditpersona.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	3.0/10	4.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on text-based LLM adaptation for Reddit communities using a modular framework and QLoRA adapters. It lacks visual encoders, multimodal processing, reinforcement learning, or world modeling architectures. Only 'Unify Models' has minor relevance due to the modular framework unifying adaptation choices, but it does not align with the architectural unification typically implied by the keyword list. No expert authors from the specified list are present.

关键词

Community-conditioned LLM Adaptation, Modular Framework, Reddit Posts, Parameter-efficient Adapter, QLoRA, User Profiling, Distributional Alignment, Behavioral Identifiability

177. Beyond Vector Similarity: A Structural Analysis of Graph-Augmented Retrieval for Industrial Knowledge GraphsFAIL

Score: 4.5 / 27.8

Authors: Grama Chethan

Published: 2026-06-04

TL;DR: 本文针对向量相似性检索在知识图谱结构推理中的局限性，提出了一种基于图遍历和计算原语的 LLM 查询规划器，该方法在特定查询类别上优于专用处理程序。

摘要翻译

检索增强生成 (RAG) 在处理需要针对互联实体进行结构推理的查询时存在系统性失效。我们比较了八种用于航空航天供应链情报的检索架构，其演进路径从文本检索经过图遍历直至图计算。基于一个包含 46 个节点和 64 条类型化边的知识图谱，我们在 10 个意图类别下评估了 23 个查询，并证明了五个查询类别对向量检索而言是结构上不可达的。我们的核心发现是操作符词汇量假说（Operator Vocabulary Thesis）：基于大语言模型（LLM）的图推理障碍并非模型智能，而是作为工具可用的计算操作符。一个拥有 9 种类型化遍历原语的 LLM 查询规划器优于定制处理程序（F1 = 0.632 vs. 0.472），且能泛化至未见查询。添加 6 个图计算工具后，LLM 仅在遍历失败的查询类别中选择性地采用这些工具。我们还识别出一个度量差距：实体层面的 F1 系统性地低估了那些综合答案正确的结构查询。

Abstract

Retrieval-Augmented Generation (RAG) fails systematically on queries requiring structural reasoning over interconnected entities. We compare eight retrieval architectures for aerospace supply chain intelligence, progressing from text retrieval through graph traversal to graph computation. Using a 46-node knowledge graph with 64 typed edges, we evaluate 23 queries across 10 intent categories and demonstrate that five query classes are structurally unreachable for vector retrieval. Our central finding is the operator vocabulary thesis: the barrier to LLM-based graph reasoning is not model intelligence but the computational operators available as tools. An LLM Query Planner with 9 typed traversal primitives outperforms bespoke handlers (F1 = 0.632 vs. 0.472) while generalizing to unseen queries. Adding 6 graph computation tools, the LLM selectively adopts them for exactly the query categories where traversal fails. We also identify a measurement gap: entity-level F1 systematically underscores structural queries where comprehensive answers are correct.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 该论文主要关注知识图谱（Knowledge Graphs）中的检索增强生成（RAG）及结构推理问题，涉及图遍历和计算工具。提供的关键词集侧重于多模态（MultiModal, MLLM, Visual Encoder）、世界模型（World Models）、强化学习（model-based RL）及模型统一（Unify Models）和分词器（Tokenizer）。论文内容未涉及视觉模态、强化学习或世界模型构建，因此相关关键词（Tokenizer, Visual Encoder, World Models, MultiModal, model-based RL）得分为 0。虽然使用了 LLM，但并非多模态大模型（MLLM），故得分为 1.0；虽涉及检索架构的统一，但未达到“模型统一”的核心定义，得分为 2.0。作者列表中不包含指定的专家（Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang），因此未获得额外加分。加权总分远低于动态及格分 27.8，表明该论文与指定关键词方向的相关性较低。

关键词

Knowledge Graphs, Retrieval-Augmented Generation, Graph Traversal, Structural Reasoning, LLM Query Planner, Graph Computation, Vector Similarity

178. Learned Response-Field Inertia Operator for HEC-RAS 2D Water-Surface Elevation PredictionFAIL

Score: 4.5 / 27.8

Authors: Edward Holmberg, Elias Ioup, Md Meftahul Ferdaus, Mahdi Abdelguerfi, Julian Simeonov

Published: 2026-06-04

TL;DR: This paper proposes a Learned Response-Field Inertia Operator (LRFIO) to accelerate HEC-RAS 2D water-surface elevation prediction via native-cell surrogate modeling, achieving significant computational speedup while maintaining solver consistency.

摘要翻译

本文展示了对用于 HEC-RAS 2D 中求解器一致的水面高程 (WSE) 预测的学习到的原生单元代理模型的跨数据集评估。为避免栅格重映射误差和信息访问混淆，代理模型直接在原始非均匀计算单元上进行评估，该显式策略将静态项目输入、当前水力状态、项目输入强迫、校准导出量以及未来求解器输出目标分离开来。我们引入了学习响应场惯性算子 (LRFIO)，这是一种无强迫、基于增量的学习代理，它从求解过的 HEC-RAS 轨迹中校准惯性响应算子，并通过闭式原生单元展开部署所保留的算子。LRFIO 评估了一个基于基准案例的响应层次结构，该结构包括持续性、全局校准惯性和分段响应场惯性。分段、残差修正和神经化惯性被视为可学习的建模选择，仅当验证证据证明其代价合理时，才保留增加的复杂度。在四个多样的 HEC-RAS 2D 基准上评估，LRFIO 为不同领域保留不同的响应结构，展示了自适应的学习复杂度。选择器审计显示复杂度受控，最大验证遗憾值为 4.30%。部署期间，保留的展开时间范围为 0.003 秒至 0.242 秒，而 Beaver Bayou 实测求解比较显示，相对于 HEC-RAS，其估计的 horizon 归一化加速比约为 2.75 × 10^4。这些结果表明，当前的原生单元增量是一个强大的求解器条件化预测支架，而增加的响应场、神经或空间复杂度仅在经验证合理时才应保留。

Abstract

This article presents a cross-dataset evaluation of learned native-cell surrogate models for solver-consistent water-surface elevation (WSE) prediction in HEC-RAS 2D. To avoid raster remapping error and information-access confounding, surrogates are evaluated directly on the original nonuniform computational cells under an explicit policy that separates static project inputs, current hydraulic state, project-input forcing, calibration-derived quantities, and future solver-output targets. We introduce the Learned Response-Field Inertia Operator (LRFIO), a no-forcing, increment-based learned surrogate that calibrates an inertial response operator from solved HEC-RAS trajectories and deploys the retained operator through closed-form native-cell rollout. LRFIO evaluates a base-case-first response hierarchy consisting of persistence, global calibrated inertia, and segmented response-field inertia. Segmentation, residual correction, and neuralized inertia are treated as learnable modeling choices, with added complexity retained only when validation evidence justifies its cost. Evaluated across four diverse HEC-RAS 2D benchmarks, LRFIO retains different response structures for different domains, demonstrating adaptive learned complexity. The selector audit shows controlled complexity with a maximum validation regret of 4.30%. During deployment, retained rollout times range from 0.003 s to 0.242 s, and the Beaver Bayou measured-solve comparison gives an estimated 2.75 x 10^4 horizon-normalized speedup over HEC-RAS. These results indicate that the current native-cell increment is a strong solver-conditioned predictive scaffold and that added response-field, neural, or spatial complexity should be retained only when empirically justified.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	1.0/10	1.5

评分理由: The paper focuses on hydraulic engineering simulation (HEC-RAS) using surrogate models for water-surface elevation prediction. It does not involve multimodal large language models (MLLM), tokenizers, visual encoders, or reinforcement learning frameworks. The provided keywords target AI/GenAI architectures, resulting in negligible overlap with this computational fluid dynamics study. No expert authors from the specified list (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are present in the author list, so no bonus points were applied.

关键词

Learned Response-Field Inertia Operator, HEC-RAS 2D, Water-Surface Elevation Prediction, Surrogate Models, Native-Cell, Hydraulic Simulation, Solver-Consistent, Increment-based

179. Reactive Flux Matching: Mechanism Discovery and Adaptive Sampling of Rare EventsFAIL

Score: 4.5 / 27.8

Authors: Rishal Aggarwal, David Ryan Koes, Nicholas M. Boffi, Eric Vanden-Eijnden

Published: 2026-06-04

TL;DR: 本文提出通量匹配框架，通过从反应轨迹数据中学习速度场和标量势来发现稀有事件的机制并实现自适应采样，无需了解底层动力学。

摘要翻译

路径采样方法生成连接亚稳态的反应轨迹系综，但从这些数据中提取机理洞察仍然并非易事。我们引入 Flux Matching，这是一个直接从反应轨迹数据中学习两个互补对象的框架：一个是流速度 $u(z)$，其流线追踪主导反应路径；另一个是标量势 $h(z)$，通过对反应通量进行加权 Helmholtz-Hodge 分解获得，作为数据驱动的反应坐标。两者均在反应路径系综上最小化二次泛函，类似于生成建模中的 Flow Matching 损失，且无需了解底层动力学或稳态分布。与基于 Committor 的方法不同，$u$ 和 $h$ 在投影到非马尔可夫集体变量时仍定义明确，它们的水平集进而为使用增强采样方法进行改进采样提供了自适应界面。Flux Matching 通过在分子系统上生成流速度轨迹和计算速率常数得到验证。

Abstract

Path sampling methods generate ensembles of reactive trajectories connecting metastable states, but extracting mechanistic insight from these data remains nontrivial. We introduce Flux Matching, a framework that learns two complementary objects directly from reactive trajectory data: a current velocity $u(z)$, whose streamlines trace the dominant reaction pathways, and a scalar potential $h(z)$, obtained from a weighted Helmholtz-Hodge decomposition of the reactive current, that serves as a data-driven reaction coordinate. Both minimize quadratic functionals over the reactive path ensemble, analogous to the flow matching loss in generative modeling, and require no knowledge of the underlying dynamics or stationary distribution. Unlike committor-based methods, $u$ and $h$ remain well-defined under projection onto non-Markovian collective variables, and their level sets in turn provide adaptive interfaces for improved sampling with enhanced sampling methods. Flux Matching is validated through the generation of current velocity trajectories and rate constant calculations on molecular systems.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	1.0/10	1.5

评分理由: 该论文属于计算化学与物理领域，核心内容是稀有事件的反应机制发现与自适应采样。虽然摘要提及了生成建模中的流匹配损失（flow matching loss），且统一了速度场与势场，但这与多模态大模型（MLLM）、视觉编码器、分词器及强化学习（RL）无直接关联。关键词中的统一模型和世界模型仅在抽象数学形式上有微弱对应，核心应用场景不符，故相关性评分极低。

关键词

Reactive Flux Matching, Mechanism Discovery, Adaptive Sampling, Rare Events, Reactive Trajectories, Helmholtz-Hodge Decomposition, Generative Modeling, Molecular Systems

180. Anchor PCAFAIL

Score: 4.5 / 27.8

Authors: Benedikt Seiter, Anya Fries, Julius von Kügelgen, Jonas Peters

Published: 2026-06-04

TL;DR: Anchor PCA 提出了一种针对多域数据的鲁棒降维方法，通过平衡解释方差和域间一致性，在未见域上获得了比池化基线更优的嵌入效果。

摘要翻译

主成分分析（PCA）是应用最广泛的无监督降维技术之一。我们研究了针对来自多个相关领域数据的 PCA。由于主成分通常在各个领域之间存在差异，一种获取共享低秩嵌入的方法是在合并数据上执行 PCA。然而，该方法可能会关注那些仅在少数领域表现出高变异的虚假方向。为了找到一个在未见但相似的领域仍能解释大部分方差的鲁棒嵌入，我们转而关注共享变异方向。为此，我们引入了 Anchor PCA，该方法在整体解释方差与共享低秩嵌入和领域特定低秩嵌入之间的一致性之间进行权衡。Anchor PCA 相当于在修改后的目标矩阵上执行 PCA，因此可以高效求解。此外，我们表明 Anchor PCA 能够恢复最大不变子空间，并在有界领域特定协方差膨胀下具有极小极大重构解释。在具有时间漂移的模拟及真实世界气体传感器数据上，我们分别验证了 Anchor PCA 能够恢复最大不变子空间，且其产生的嵌入在未见领域上解释的方差优于合并基线（pooling baseline）和最坏情况替代方案（worst-case alternative）。综上所述，这些发现确立了 Anchor PCA 作为一种从多域数据进行鲁棒无监督降维的有前景的方法。

Abstract

Principal component analysis (PCA) is one of the most widely used unsupervised dimension reduction techniques. We study PCA for data from multiple related domains. Since principal components generally differ across domains, one way to obtain a shared low-rank embedding is to perform PCA on the pooled data. However, this approach can focus on spurious directions that exhibit high variation in only a few domains. To find a robust embedding that still explains most variance in unseen but similar domains, we propose instead to focus on shared directions of variation. To this end, we introduce Anchor PCA which trades off overall explained variance with agreement between the shared and domain-specific low-rank embeddings. Anchor PCA amounts to PCA on a modified target matrix and thus can be solved efficiently. Moreover, we show that Anchor PCA recovers a maximal invariant subspace and admits a minimax reconstruction interpretation under bounded domain-specific covariance inflations. On simulated and real-world gas sensor data with temporal drift, we demonstrate, respectively, that Anchor PCA recovers the maximally invariant subspace and yields embeddings that explain more variance on unseen domains than the pooling baseline and a worst-case alternative. Taken together, these findings establish Anchor PCA as a promising approach to robust unsupervised dimension reduction from multi-domain data.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0

评分理由: 该论文主要研究多域数据的主成分分析（PCA）降维方法，旨在通过 Anchor PCA 找到共享的低秩嵌入。内容与大模型架构、视觉编码器、世界模型、强化学习及 tokenizer 等技术无直接关联。仅在‘统一模型’（共享表示学习）和‘多模态’（多域 vs 多模态）上有极弱的概念关联，因此相关性评分极低。

关键词

Principal Component Analysis, Multi-domain data, Shared low-rank embedding, Robust dimension reduction, Invariant subspace, Unsupervised learning, Domain-specific covariance

181. On the training of physics-informed neural operators for solving parametric partial differential equationsFAIL

Score: 4.5 / 27.8

Authors: Nanxi Chen, Chuanjie Cui, Airong Chen, Sifan Wang, Rujin Ma

Published: 2026-06-04

TL;DR: This paper systematically evaluates training strategies for physics-informed neural operators across different architectures, demonstrating that CViT performs robustly and physics-informed training can match data-driven methods for solving parametric PDEs.

摘要翻译

物理信息神经算子（PINOs）旨在利用支配物理规律作为监督来学习偏微分方程的解算子，而非仅依赖成对输入 - 输出模拟数据。通过将物理约束纳入训练目标，PINOs 结合了神经算子的跨实例泛化能力与物理信息学习的数据效率。尽管前景广阔，但关于如何高效且鲁棒地训练 PINOs 的理解程度，尚不如数据驱动神经算子或物理信息神经网络（PINNs）的训练那样充分。为了弥合这一差距，我们考察了 PINOs 训练流程中的关键组件，包括架构设计、优化器选择、损失平衡以及配点采样策略。我们在五种不同的参数化偏微分方程系统上研究了三种代表性的算子骨干网络：深度算子网络（DeepONet）、傅里叶神经算子（FNO）和连续视觉变换器（CViT）。结果表明，CViT 在所有考虑的基准测试中始终展现出强劲且稳定的性能。除了架构之外，我们发现之前在 PINNs 训练中识别出的若干优化病理问题在 PINOs 中自然出现，包括梯度冲突和因果违反。我们还发现，为 PINNs 开发的缓解算法在 PINOs 场景下依然有效。我们进一步比较了不同数据条件下物理信息训练与数据驱动训练的结果，揭示出精心设计的物理信息训练流程可以匹配，甚至在某些情况下超越纯数据驱动神经算子。综上所述，这些发现为 PINOs 训练中的优化挑战提供了系统性的经验理解，并指导构建了一个用于高效且鲁棒的物理信息算子学习的实用流程。代码与数据可在 https://github.com/NanxiiChen/PI-CViT 获取。

Abstract

Physics-informed neural operators (PINOs) aim to learn solution operators for partial differential equations by using the governing physics as supervision, rather than relying solely on paired input-output simulation data. By incorporating physical constraints into the training objective, PINOs combine the cross-instance generalization of neural operators with the data efficiency of physics-informed learning. Despite this promise, how to train PINOs efficiently and robustly remains less well-understood than the training of either data-driven neural operators or physics-informed neural networks (PINNs). To bridge this gap, we examine key components of the PINO training pipeline, including architecture design, optimizer choice, loss balancing, and collocation-point sampling strategy. We study three representative operator backbones, Deep Operator Network (DeepONet), Fourier Neural Operator (FNO), and Continuous Vision Transformer (CViT), across five diverse parametric PDE systems. Our results show that CViT provides consistently strong and stable performance across the considered benchmarks. Beyond architecture, we find that several optimization pathologies previously identified in PINN training naturally arise in PINOs, including gradient conflicts and causal violation. We also find that mitigation algorithms developed for PINNs remain effective in the PINO setting. We further compare physics-informed and data-driven training under different data regimes, revealing that a carefully designed physics-informed training pipeline can match, and in some cases, outperform purely data-driven neural operators. Taken together, these findings provide a systematic empirical understanding of the optimization challenges in PINO training and inform a practical pipeline for efficient and robust physics-informed operator learning. Code and data are available at https://github.com/NanxiiChen/PI-CViT.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on Physics-Informed Neural Operators (PINOs) for solving parametric PDEs, which belongs to scientific machine learning rather than the multimodal/RL domain implied by the keywords. It compares multiple operator backbones (DeepONet, FNO, CViT), yielding a low score for 'Unify Models' (2.0), and mentions CViT ('Visual Encoder', 1.0). All other keywords (Tokenizer, World Models, MLLM, MultiModal, model-based RL) are irrelevant (0.0). No expert authors from the target list are found. The weighted total is 4.5, well below the passing threshold of 27.8.

关键词

Physics-informed neural operators, Parametric partial differential equations, Deep Operator Network, Fourier Neural Operator, Continuous Vision Transformer, Training pipeline, Optimization pathologies, Data efficiency

182. IR3DE: A Linear Router for Large Language ModelsFAIL

Score: 4.5 / 27.8

Authors: Eros Fanì, Oğuzhan Ersoy

Published: 2026-06-04

TL;DR: IR3DE 提出了一种基于岭回归的线性路由器，用于在不重新训练的情况下高效选择领域特定大语言模型，在语言建模和推理任务中表现优于或持平于基线方法。

摘要翻译

基础大语言模型（LLMs）在广泛的一般任务上表现出色，并通过领域专家 LLMs 在各种专门任务上取得显著成果。随着可用 LLMs 列表的不断增长，inference routers（推理路由器）被提出用于为每个 prompt（提示）选择最合适的 LLM。然而，现有的路由方法要么在从弱到强的通用型 LLMs 之间优化成本，要么需要大量训练才能支持领域专家路由。本文提出 IR3DE，一种基于 Ridge Regression（岭回归）的领域专家路由器，可为每个 prompt 提供低成本且快速的路由决策。我们在两种 CLM（因果语言建模）设置下评估 IR3DE，任务为所有领域的 next-token prediction（下一词预测）；以及一种推理设置，其中每个领域都有其独特的推理任务。尽管是一种线性路由器，IR3DE 在两种 CLM 设置下的性能与其他 baselines（基线）相当，并在推理设置中超越它们，归一化性能达到 98.4%。此外，IR3DE 允许添加或移除新的 domain experts（领域专家），而无需要求 router（路由器）从头重新训练，从而能够以最小化对 router 本身干扰的方式服务动态的 LLM 集合。我们的代码可在以下网址获取：github.com/gensyn-ai/IR3DE。

Abstract

Foundational Large Language Models (LLMs) demonstrate proficiency on a wide range of general tasks, and achieve remarkable results on various specialized tasks via domain-expert LLMs. With the ever-growing list of available LLMs, inference routers are being proposed to select the most appropriate LLM for each prompt. However, existing routing methods either optimize cost across weak-to-strong generalist LLMs or require substantial training to support domain-expertise routing. In this paper, we propose IR3DE, a Ridge Regression-based Router for Domain Experts that provides cheap and fast routing decisions for each prompt. We evaluate IR3DE in two Causal Language Modeling (CLM) settings where the tasks are next-token prediction for all domains, and one reasoning setting where each domain has its own distinct reasoning task. Despite being a linear router, IR3DE achieves performance comparable to the other baselines in both CLM settings, and surpassing them in the reasoning setting, with a normalized performance of 98.4%. Moreover, IR3DE enables the addition or removal of new domain experts without requiring the router to be retrained from scratch, allowing a dynamic set of LLMs to be served with minimal disruption to the router itself. Our code is available at: github.com/gensyn-ai/IR3DE.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 该论文主要研究大语言模型的推理路由选择，采用岭回归实现线性路由决策。与关键词集中的多模态、世界模型、视觉编码器及强化学习等内容无直接关联，仅在语言模型任务（Tokenizer）和模型管理（Unify Models）上有微弱关联，因此相关性评分较低。

关键词

Large Language Models, Inference Router, Ridge Regression, Domain Experts, Causal Language Modeling, Dynamic Routing, Prompt Selection, Linear Router

183. IA-RAG: Interval-Algebra-Driven Temporal Reasoning for Dynamic Knowledge RetrievalFAIL

Score: 4.5 / 27.8

Authors: Xiaoman Wang, Yaoze Zhang, Wenzhuo Fan, Hongwei Zhang, Ding Wang, Guohang Yan, Song Mao, Botian Shi, Yunshi Lan, Pinlong Cai

Published: 2026-06-04

TL;DR: 针对现有 RAG 框架无法捕捉丰富时间结构的问题，IA-RAG 通过区间代数建模实现了动态知识检索中的高效时间推理。

摘要翻译

检索增强生成（RAG）在利用外部知识为大型语言模型（LLMs）提供知识锚定方面显示出强大的有效性。然而，现有的 RAG 和图 RAG 框架大多将知识视为静态，或将时间与粗粒度的时间戳或元数据关联，无法捕捉丰富的时间结构，如持续时间、重叠和包含关系。我们提出 IA-RAG，一种分层时间 RAG 框架，该框架将知识建模为时间区间，并在形式化时间约束下执行检索。IA-RAG 将事实表示为区间事件单元（IEUs），并将它们组织成分层主题森林（Thematic Forest），其中时间依赖关系遵循艾伦区间代数（Allen's Interval Algebra）。为了处理不完整或不确定的时间边界，IA-RAG 进一步引入了一种子图时间收紧机制，该机制通过连接事件子图中的逻辑约束来细化模糊区间。此外，IA-RAG 还支持基于区间代数遍历的隐式时间语义检索。在多个时间问答基准测试（包括 TimeQA、TempReason 和 ComplexTR）上的实验表明，IA-RAG 在时间检索和推理方面表现出色，特别是在复杂的组合式时间推理任务上。我们的代码已开源，位于 https://github.com/xiaoAugenstern/LogicalRAG_TemporalQA。

Abstract

Retrieval-Augmented Generation (RAG) has shown strong effectiveness in grounding Large Language Models (LLMs) with external knowledge. However, existing RAG and Graph RAG frameworks largely treat knowledge as static or associate time with coarse-grained timestamps or metadata, failing to capture rich temporal structures such as duration, overlap, and containment. We propose IA-RAG, a hierarchical temporal RAG framework that models knowledge as time intervals and performs retrieval under formal temporal constraints. IA-RAG represents facts as Interval Event Units (IEUs) and organizes them into a hierarchical Thematic Forest, where temporal dependencies are governed by Allen's Interval Algebra. To handle incomplete or uncertain temporal boundaries, IA-RAG further introduces a Sub-graph Time Tightening mechanism that refines fuzzy intervals through logical constraints within connected event subgraphs. In addition, IA-RAG supports implicit temporal semantic retrieval through interval-algebra-guided traversal. Experiments on multiple temporal question answering benchmarks, including TimeQA, TempReason, and ComplexTR, demonstrate that IA-RAG achieves strong temporal retrieval and reasoning performance, particularly on complex compositional temporal reasoning tasks. Our code is released at https://github.com/xiaoAugenstern/LogicalRAG_TemporalQA.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文聚焦于基于区间代数的时间推理 RAG，与多模态、世界模型及强化学习关键词领域高度不相关。未涉及视觉编码器、Tokenizer 设计或多模态融合，仅在 LLM 应用和方法统一性上有微弱关联。加权总分 4.5 分，远低于动态及格分 27.8 分，且作者列表中未发现指定专家。

关键词

Retrieval-Augmented Generation, Temporal Reasoning, Interval Algebra, Interval Event Units, Thematic Forest, Dynamic Knowledge Retrieval, Time Intervals

184. NAVIRA: Decoupled Stochastic Remasking for Masked Diffusion Language ModelsFAIL

Score: 4.5 / 27.8

Authors: Andrey Fomenko, Maksim Kryzhanovskiy, Svetlana Glazyrina, Roman Ischenko

Published: 2026-06-04

TL;DR: 本文提出 NAVIRA 解码策略，通过解耦 token 评分与随机重掩码，提升了掩码扩散语言模型生成文本的流畅度和多样性。

摘要翻译

掩码扩散语言模型（Masked diffusion language models）通过并行迭代解掩多个标记来生成文本，但这种速度伴随着一个校正问题：同一步骤中生成的标记是从边缘分布预测的，早期的局部依赖错误随后可能会污染上下文。PRISM 通过学习词元级质量分数并重新掩码（remasking）不可靠的标记来解决这一问题，但其推理规则是耦合的：同一前向传播（forward pass）既检测低质量标记又计算其替换项的 logits，因此错误标记仍会影响再生过程。我们提出 NAVIRA，一种推理时解码策略（inference-time decoding policy），它将这两个操作分离，并随机采样重新掩码的位置。首先进行第一次前向传播以标记打分；选定的标记被掩码；随后进行第二次前向传播，从清理后的上下文中再生文本。温度控制重掩码（Temperature-controlled remasking）减少了对相同位置的重复校正，并在流畅性与多样性之间取得平衡。在 170M 掩码扩散语言模型的受控实验中，解耦提高了流畅性，而计划性随机重掩码保留了熵，并在更大的前向传播预算（forward-pass budgets）下实现了更强的 LLM 评判分数（LLM-judge scores）。这些结果表明，重掩码策略，而不仅仅是学习到的质量信号，对于可靠的掩码扩散文本生成至关重要。

Abstract

Masked diffusion language models generate text by iteratively unmasking many tokens in parallel, but this speed comes with a correction problem: tokens generated in the same step are predicted from marginal distributions, and early local dependency errors can later contaminate the context. PRISM addresses this by learning token-level quality scores and remasking unreliable tokens, but its inference rule is coupled: the same forward pass both detects low-quality tokens and computes logits for their replacements, so the erroneous tokens still condition regeneration. We propose NAVIRA, an inference-time decoding policy that separates these two operations and samples remasking positions stochastically. A first forward pass scores tokens; selected tokens are masked; a second forward pass regenerates from the cleaned context. Temperature-controlled remasking reduces repeated correction of the same positions and balances fluency against diversity. In controlled experiments with a 170M masked diffusion language model, decoupling improves fluency, while scheduled stochastic remasking preserves entropy and achieves stronger LLM-judge scores under larger forward-pass budgets. These results show that remasking policy, not only the learned quality signal, is central to reliable masked-diffusion text generation.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	2.0/10	3.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文聚焦于掩码扩散语言模型的文本生成解码策略（NAVIRA），涉及 token 级别的重掩码操作，故 Tokenizer 得低分；未涉及视觉、多模态、世界模型或强化学习，故相关关键词得 0 分；论文主张解耦而非统一模型，故 Unify Models 得分较低。

关键词

Masked Diffusion Language Models, Decoupled Stochastic Remasking, Inference-time Decoding, Token-level Quality Scores, Text Generation, Fluency and Diversity, Remasking Policy

185. Inverse Design of Realizable Metasurface based Absorbers using Improved Conditioning and Diversity Enhanced Progressively Growing GANsFAIL

Score: 4.5 / 27.8

Authors: Vineetha Joy, Mohammad Abdullah, Pramit Pal, Anshuman Kumar, Amit Sethi, Hema Singh

Published: 2026-06-04

TL;DR: 本文提出了一种基于改进条件化和多样性增强渐进式生成对抗网络的框架，用于实现电磁超表面吸收器的逆设计，成功生成了具有高准确性和多样性的物理可实现结构。

摘要翻译

超表面（Metasurfaces）支持精确操控电磁波（Electromagnetic waves），广泛应用于波束转向、传感及隐身技术等领域。然而，针对目标电磁（EM）响应的超表面逆向设计仍具挑战性，这源于迭代全波仿真驱动优化的计算成本高昂，以及现有生成方法在条件保真度和多样性方面的局限。为应对这些挑战，本文提出了一种生成式逆向设计框架，旨在实现连续谱约束下可控且物理一致的超表面合成。该方法采用渐进式增长的带梯度惩罚的 Wasserstein 生成对抗网络（WGAN-GP），并结合基于特征线性调制的条件控制，以实现连续谱和制造约束的稳定传播。电磁（EM）一致性通过代理辅助谱对齐损失直接嵌入生成学习过程，从而在训练过程中实现物理约束下的生成。此外，还引入了基于行列式点过程（Determinantal Point Process, DPP）的多样性正则化策略，旨在针对同一目标响应生成几何多样但谱一致的实现方案。该框架的有效性通过生成在实际中可实现的超表面吸收体得到验证，这些吸收体在 2 至 18 GHz 频率范围内表现出多样的反射特性。电磁（EM）仿真验证表明，生成的设计能以高准确度满足目标规格。最终提出的框架实现了平均均方误差（MSE）为 0.0052、多样性得分为 0.8730、频带对齐精度为 0.8533，以及有效电磁（EM）设计生成百分比为 89.57%，清晰地展示了其生成高度准确、多样、电磁一致且可制造的超表面配置的能力。

Abstract

Metasurfaces enable precise manipulation of electromagnetic waves for applications such as beam steering, sensing, and stealth technology. However, inverse design of metasurfaces with targeted EM responses remains challenging due to the computational expense of iterative full wave simulation driven optimization and the limited conditioning fidelity and diversity of existing generative approaches. To address these challenges, this paper presents a generative inverse design framework for controllable and physically consistent metasurface synthesis under continuous spectral constraints. The proposed approach employs a progressively growing Wasserstein generative adversarial network with gradient penalty integrated with feature wise linear modulation based conditioning for stable propagation of continuous spectral and fabrication constraints. EM consistency is embedded directly into the generative learning process through a surrogate assisted spectral alignment loss, enabling physics constrained generation during training. Further, a determinantal point process based diversity regularization strategy is incorporated to generate geometrically diverse yet spectrally consistent realizations for the same target response. The effectiveness of the proposed framework is demonstrated through the generation of practically realizable metasurface absorbers exhibiting diverse reflection characteristics in the frequency range of 2 to 18 GHz. EM simulations validate that the generated designs meet the target specifications with high accuracy. The final proposed framework achieved an average mean squared error of 0.0052, diversity score of 0.8730, band alignment accuracy of 0.8533, and a valid EM design generation percentage of 89.57, clearly demonstrating its capability to generate highly accurate, diverse, electromagnetically consistent and fabrication realizable metasurface configurations.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0

评分理由: 该论文专注于电磁超表面的逆设计与生成对抗网络（GAN），属于计算电磁学与物理信息机器学习领域。提供的关键词（如 MLLM、Tokenizer、Visual Encoder、Model-Based RL）均指向多模态大模型与强化学习方向，与本文内容无直接关联。虽然涉及生成模型（World Models 广义相关）和多约束处理（MultiModal/Unify Models 广义相关），但在特定 AI 语境下相关性极低，加权总分（4.5）远低于动态及格分（27.8）。

关键词

Inverse Design, Metasurface Absorbers, Generative Adversarial Networks, Progressively Growing GAN, Spectral Constraints, Physics Constrained Generation, Diversity Regularization, Electromagnetic Consistency

186. RREDCoT: Segment-Level Reward Redistribution for Reasoning ModelsFAIL

Score: 3.0 / 27.8

Authors: Mykyta Ielanskyi, Kajetan Schweighofer, Lukas Aichberger, Sepp Hochreiter

Published: 2026-06-04

TL;DR: The paper proposes RREDCoT to redistribute rewards across Chain-of-Thought segments in reasoning models, mitigating delayed reward problems and high variance associated with standard RL fine-tuning methods.

摘要翻译

近期推理语言模型的进展主要由强化学习（RL）微调驱动。通常，这些方法依赖于组相对策略优化（GRPO）算法或其变体，以引导模型生成思维链（CoT）轨迹。只有在思维链（CoT）轨迹完成后，最终答案才能得到验证并分配奖励，这使得它成为一个延迟奖励问题。GRPO 及其变体对应于标准强化学习中的蒙特卡洛（Monte Carlo）方法，已知这些方法存在高方差问题。解决这一问题的一种可能方案是通过信用分配进行奖励再分配，即通过赋予更高的奖励来强调思维链（CoT）轨迹中对获得理想解至关重要的片段。虽然蒙特卡洛（Monte Carlo）采样可用于提供中间状态值的无偏估计，但其计算开销使其不适合在高粒度长上下文场景下的训练时信用分配。我们提出了 RREDCoT（思维链奖励再分配），该方法利用模型本身来近似最优奖励再分配，而无需额外生成。我们研究了该方法相较于蒙特卡洛采样（MC 采样）和几种归因方法的优势。我们进一步分析了与再分配构建相关的若干方面，例如思维链（CoT）轨迹的分割和状态值估计。

Abstract

Recent advancements in reasoning language models have been driven by Reinforcement Learning (RL) fine-tuning. Most often, these rely on the Group Relative Policy Optimization (GRPO) algorithm or modifications thereof to steer the models to produce Chain-of-Thought (CoT) traces. The final answer can only be verified, and the reward assigned, after the CoT trace is complete, making it a delayed reward problem. GRPO and its modifications correspond to Monte Carlo methods in standard RL, which are known to suffer from high variance. A possible solution to this problem is the redistribution of rewards through credit assignment, where segments of the CoT trace that are important for arriving at the desirable solution are emphasized by assigning a higher reward. While Monte Carlo sampling can be used to provide an unbiased estimate of intermediate state values, its computational overhead makes it unsuitable for train-time credit assignment in long contexts at high granularity. We introduce RREDCoT (Reward REDistribution for Chain of Thoughts), which utilizes the model itself to approximate the optimal reward redistribution without additional generation. We investigate the advantages of our method compared to MC sampling and several attribution methods. We further analyze several aspects relevant to the construction of the redistribution such as segmentation of CoT traces and state value estimation.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	2.0/10	3.0

评分理由: The paper focuses on reward redistribution for Chain-of-Thought in reasoning models using RL. It does not address multimodal architectures, visual encoders, tokenizers, world models, or model unification. 'model-based RL' receives a low score (2.0) as the paper focuses on credit assignment in policy optimization rather than learning dynamics models. All other keywords are unrelated to the text-only reasoning focus.

关键词

Reasoning Models, Reinforcement Learning, Chain-of-Thought, Reward Redistribution, Credit Assignment, Delayed Reward, Model Fine-tuning

187. Bridging Domain Expertise and Generalization for Performance EstimationFAIL

Score: 3.0 / 27.8

Authors: Shuxuan Li, Zhilin Zhao, Quyu Kong, Wei-Shi Zheng

Published: 2026-06-04

TL;DR: 本文提出一种名为 FRAP 的方法，通过校准和融合基础模型与基模型的预测分布，在无真实标签的情况下有效提升了分布偏移下的性能估计准确性。

摘要翻译

分布偏移下的性能估计旨在预测模型在未标记测试集上的表现，该测试集的分布与训练数据不同，此场景需要可靠的指标，能够在没有真实标签的情况下忠实反映模型行为。现有方法仅依赖于给定模型的输出，一旦分布发生偏移，其偏差会被放大，从而削弱了与真实性能的相关性。针对这一局限，我们提出了融合参考对齐预测（Fused Reference Alignment Prediction，简称 FRAP），利用外部基础模型（foundation model）与基模型（base model）的互补优势，构建更可靠的真实标签代理。FRAP 通过应用温度缩放校准（temperature-scaled calibration）来最小化两者之间的差异，从而将基础模型的预测分布与基模型的预测分布对齐。通过对齐后的预测进行基于置信度的加权融合，形成一种精炼的参考分布，该分布整合了基础模型的鲁棒性与基模型的领域特定专业知识；性能估计则是通过测量基模型预测与该参考分布的一致性而获得的。在多种数据集和架构上的广泛实验表明，FRAP 在分布偏移下相较于代表性的性能估计方法提供了持续且显著的提升。

Abstract

Performance estimation under distribution shift aims to predict how a model behaves on an unlabeled test set whose distribution differs from the training data, a scenario that requires reliable indicators that can faithfully reflect model behavior without ground-truth labels. Existing approaches rely solely on the outputs of the given model whose biases are amplified once the distribution shifts, weakening the correlation with the true performance. Motivated by this limitation, we propose Fused Reference Alignment Prediction (FRAP), which leverages the complementary strengths of an external foundation model and the base model to construct a more reliable surrogate of the ground-truth labels. FRAP aligns the prediction distribution of the foundation model with that of the base model by applying temperature-scaled calibration that minimizes their divergence. The aligned predictions are fused through confidence-based weighting into a refined reference distribution that integrates robustness from the foundation model and domain-specific expertise from the base model, and performance estimation is obtained by measuring how closely the base model predictions agree with this reference. Extensive experiments across diverse datasets and architectures show that FRAP provides consistent and substantial improvements over representative performance-estimation methods under distribution shift.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文核心为分布偏移下的性能估计（Performance Estimation），提出 FRAP 方法融合基础模型与基模型预测。关键词中，Tokenizer、Visual Encoder、World Models、MultiModal、model-based RL 与论文内容完全无关；MLLM 仅涉及基础模型概念，未特指多模态大模型；Unify Models 虽涉及模型预测融合，但非架构层面的统一，相关性较低。因此整体相关性评分极低。

关键词

Performance estimation, Distribution shift, Foundation model, Prediction alignment, Calibration, Domain expertise, Generalization, FRAP

188. TOKI: A Bitemporal Operator Algebra for Contradiction Resolution in LLM-Agent Persistent MemoryFAIL

Score: 3.0 / 27.8

Authors: Ziming Wang

Published: 2026-06-04

TL;DR: 本文提出了一种名为 TOKI 的双时态算子代数，用于解决 LLM 代理持久化记忆中的矛盾，确保写入时的一致性和审计性而不移除裁判机制。

摘要翻译

大语言模型 (LLM) 代理的持久化内存是一种写密集型底层：每一次信念更新都是版本化写入，且新主张可能与已存储的主张相矛盾。生产系统采用四种冲突解决启发式方法（最后写入者胜 (Last-writer-wins)、证据加权合并 (Evidence-weighted merge)、等待确认 (Await-confirmation)、按规则策略 (Per-rule policy)），但没有任何一种声明其假设的隔离级别或承认的写入时异常。我们表明冲突解决本质上是写入时并发控制，并明确了缺失的规范。TOKI 将四种启发式方法建模为双行模式上的一组双时态操作符，每个操作符均包含一个隔离前置条件和一个来源注释，该注释能在审计行中保留被覆盖的事实。四个健全性定理完善了关于隔离、模式和来源的规范，将保证提升至操作符流水线，并将折叠操作符扩展至 n 元冲突集。一个紧性伴随证明表明，在关系调度模型内，裁决器的带键日志对于重放一致性是必要的，而所有经过审计的基线均省略了这一点。一个基于八个系统的判决矩阵定位了该差距：所有在写入路径上保留语言模型裁决器的基线均存在至少一种写入时异常（重放不一致、信念漂移偏斜、审计擦除）；内容寻址引擎层比较器仅通过移除裁决器来避免这些异常，而唯有 TOKI 在保留裁决器同时排除了所有三种异常。在其一个自然工作负载切片上，审计行防御使 LoCoMo 提升了 0.86，而消融类型化内存层在 1,444 个可回答的 LoCoMo 问题上导致准确率下降 0.49；跨系统比较统计效力不足，未宣称具有优越性。本文的贡献在于该规范：一个写入时正确性规范，其在隔离、模式和来源上的健全性已得到证明，明确了每个生产启发式方法所假设但任何已部署系统均未明确指出的保证。

Abstract

Persistent memory for an LLM agent is a write-heavy substrate: every belief update is a versioned write, and a new claim may contradict a stored one. Production systems use four resolution heuristics (last-writer-wins, evidence-weighted merge, await-confirmation, per-rule policy), yet none declares the isolation level it assumes or the write-time anomalies it admits. We show that contradiction resolution is write-time concurrency control and make the missing contract explicit. TOKI types the four heuristics as one family of bitemporal operators over a dual-row schema, each with an isolation precondition and a provenance annotation that preserves the losing fact in an audit row. Four soundness theorems close the contract across isolation, schema, and provenance, lift the guarantees to operator pipelines, and extend the fold operators to n-ary conflict sets. A tightness companion proves that, within the relational schedule model, keyed logging of the adjudicating judge is necessary for replay consistency, which every audited baseline omits. A verdict matrix over eight systems localizes the gap: every baseline that keeps a language-model judge on the write path admits at least one of three write-time anomalies (replay inconsistency, belief-drift skew, audit erasure); a content-addressed engine-layer comparator avoids them only by removing the judge, and TOKI alone excludes all three while keeping it. On its one natural-workload slice the audit-row defence moves LoCoMo by 0.86, and ablating the typed memory layer removes 0.49 accuracy on 1,444 answerable LoCoMo questions; the cross-system comparison stays underpowered and claims no superiority. The contribution is the contract: a write-time correctness specification, proved sound across isolation, schema, and provenance, pinning the guarantee every production heuristic assumes but no deployed system makes explicit.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	2.0/10	3.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 该论文主要研究 LLM 代理持久化记忆中的矛盾解决与双时态算子代数（TOKI），属于系统一致性控制领域。关键词方面，论文未涉及多模态架构设计（Tokenizer, Visual Encoder, MultiModal, MLLM）、模型统一策略（Unify Models）或强化学习算法（model-based RL），故相关性为 0。虽然代理记忆与世界模型（World Models）领域存在一定关联，但论文侧重于数据库式的一致性保证而非表征学习或动力学建模，故给予较低的相关性评分（2.0）。作者列表中仅包含 Ziming Wang，未包含指定的 Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang 等专家，因此无专家加分。加权总分约为 3.0，远低于动态及格分 27.8。

关键词

LLM-Agent Persistent Memory, Contradiction Resolution, Bitemporal Operator Algebra, Concurrency Control, Isolation Level, Provenance Annotation

189. When Good Enough Is Optimal: Multiplication-Only Matrix Inversion Approximation for Quantized Gated DeltaNetFAIL

Score: 3.0 / 27.8

Authors: Luoming Zhang, Yuwei Ren, Kui Zhang, Tian Liu, Lingjuan Ge, Denghao Li, Matthew Harper Langston, Yin Huang, Weiliang Will Zeng, Liang Zhang

Published: 2026-06-04

TL;DR: 本文提出了一种针对量化 Gated DeltaNet 的乘法仅矩阵逆近似方法，在保持精度的同时实现了高达 5 倍的核级加速和 20% 的解码层开销降低。

摘要翻译

分块并行线性注意力中的矩阵求逆是长上下文建模的主要瓶颈，尤其在 NPUs 上，基于前代的方法表现出有限的并行度和较差的硬件利用率。我们提出了一种快速的、基于矩阵乘法（MatMul）的算法，专门针对分块线性注意力中出现的严格下三角矩阵。鉴于诺伊曼级数项的快速累积以及逆矩阵的对角集中性，我们采用带有结构掩码和并行残差修正的截断诺伊曼展开，以消除顺序依赖。我们进一步将方法扩展到低比特 INT，通过缓解重复矩阵幂运算引起的动态范围扩展，并将近似阶数和残差步长适配于块大小，以在保持模型准确性的同时最小化计算成本。在 Qwen3.5 系列模型上的实验表明，该方法实现了高达 5 倍的内核级加速和 20% 的解码层开销降低，同时在浮点和低精度推理下保持准确性。我们的方法为可扩展线性注意力提供了一种高效且硬件友好的解决方案。

Abstract

Matrix inversion in chunk-wise parallel linear attention is a major bottleneck for long-context modeling, particularly on NPUs, where forward-substitution-based methods exhibit limited parallelism and poor hardware utilization. We propose a fast, Matrix Multiplication (MatMul)-based algorithm tailored for strictly lower-triangular matrices arising in chunk-wise linear attention. Motivated by the rapid growth of Neumann-series terms and the diagonal concentration of the inverse matrix, we employ a truncated Neumann expansion with structural masking and parallel residual correction to eliminate sequential dependencies. We further extend our method to low-bits INT by mitigating the dynamic range expansion arising from repeated matrix power operations, and adapt the approximation order and residual step to the chunk size to minimize computational cost while preserving the model's accuracy. Experiments on Qwen3.5-family models demonstrate up to 5$\times$ kernel-level speedup and a 20% reduction in decode-layer overhead, while preserving accuracy under both floating-point and low-precision inference. Our method offers an efficient and hardware-friendly solution for scalable linear attention.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0

评分理由: 该论文主要研究量化 Gated DeltaNet 中的矩阵逆近似算法，属于模型推理加速与硬件优化领域。提供的关键词如 World Models、model-based RL、Visual Encoder、Tokenizer 及 Unify Models 均涉及模型架构、学习范式或特定组件，与本文的数学算法优化无直接关联。MLLM 和 MultiModal 虽与实验模型 Qwen3.5 的属性相关，但本文未探讨多模态理解或生成机制，仅作为应用背景，故相关性极低。

关键词

Matrix Inversion Approximation, Quantized Gated DeltaNet, Linear Attention, Low-bits INT, Hardware Efficiency, Neumann Expansion, Chunk-wise Parallel

190. The Post-GCN Decade Revisited: Curvature-Stratified Evaluation of Relational LearningFAIL

Score: 3.0 / 27.8

Authors: Shuo Wang, Xiangyu Wang, Quanxin Wang, Bailin Wu, Bokui Wang, Shunyang Huang, Boyan Deng, Haonan Liu, Ruiyi Fang, Zhenxiang Xu, Boyu Wang, Zhao Kang

Published: 2026-06-04

TL;DR: 本文提出了一种基于曲率分层的评价框架来评估关系学习模型，发现模型性能依赖于图的几何结构而非普遍可迁移，且图基础模型在某些曲率下并不优于几何对齐的图神经网络。

摘要翻译

当前关系学习的评估实践严重依赖于平坦排行榜，这些排行榜在异构数据集上平均其性能，隐含地假设存在统一的底层结构。我们表明，这种假设引入了系统性偏差：它掩盖了依赖于几何结构的性能差异，并可能导致关于模型泛化的误导性结论。在这项工作中，我们将内在几何（intrinsic geometry）识别为控制模型有效性的关键潜在因子。我们表明，传统聚合指标掩盖了关键的性能权衡，只有当数据集按其几何属性分层时，这些权衡才变得可见。为了解决这一问题，我们引入了一个曲率分层评估框架，将数据集划分为正曲率、负曲率和近零曲率区域（regimes）。我们的基准测试在 14 个数据集上评估了 18 个代表性模型，包括图卷积网络（GCNs）、图基础模型（GFMs）以及表格学习方法。我们发现，模型排名在每个曲率区域内高度稳定，但在不同区域间显著变化，这表明性能本质上是几何依赖的，而非普遍可迁移的。值得注意的是，我们识别出某些区域，在这些区域中，GFMs 相比几何对齐的图神经网络（GNNs）呈现出边际收益递减的趋势。基于这些发现，我们提出了一种几何感知的评估协议，相较于标准聚合基准，它能提供更可靠且可解释的比较。我们发布了所有代码、曲率分层数据集划分及评估工具，以支持对未来关系学习方法进行可复现且严格的评估。代码和数据集可在我们的项目主页获取：https://sirbabbage.github.io/CurvBench_HOME/.

Abstract

Current evaluation practices in relational learning rely heavily on flat leaderboards that average performance across heterogeneous datasets, implicitly assuming a uniform underlying structure. We show that this assumption introduces systematic bias: it obscures geometry-dependent performance variations and can lead to misleading conclusions about model generalization. In this work, we identify intrinsic geometry as a key latent factor governing model effectiveness. We demonstrate that conventional aggregated metrics mask critical performance trade-offs that only become visible when datasets are stratified by their geometric properties. To address this issue, we introduce a curvature-stratified evaluation framework that partitions datasets into positive, negative, and near-zero curvature regimes. Our benchmark evaluates 18 representative models including Graph Convolutional Networks (GCNs), Graph Foundation Models (GFMs), and tabular learning methods across 14 datasets. We find that model rankings are highly stable within each curvature regime but shift significantly across regimes, indicating that performance is fundamentally geometry-dependent rather than universally transferable. Notably, we identify regimes where GFMs offer diminishing returns compared to geometry-aligned GNNs. Based on these findings, we propose a geometry-aware evaluation protocol that yields more reliable and interpretable comparisons than standard aggregated benchmarks. We release all code, curvature-stratified dataset splits, and evaluation tools to support reproducible and rigorous assessment of future relational learning methods. Code and datasets are provided in our project homepage: https://sirbabbage.github.io/CurvBench_HOME/.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文主题聚焦于关系学习（Relational Learning）中的几何曲率评估框架，涉及图卷积网络（GCN）和图基础模型（GFM）的性能分析。提供的关键词主要涉及多模态大模型（MLLM, MultiModal）、强化学习（model-based RL）及视觉编码（Visual Encoder）等领域。论文内容与这些关键词高度不相关，仅在‘统一评估框架’层面与 Unify Models 有微弱关联，其余关键词如 Tokenizer、World Models 等完全未涉及。

关键词

Relational Learning, Curvature-Stratified Evaluation, Graph Convolutional Networks, Graph Foundation Models, Geometry-Dependent, Evaluation Benchmark, Model Generalization

191. End-to-End Subgraph Detection with GraphDETRFAIL

Score: 3.0 / 27.8

Authors: Dexiong Chen, Till Hendrik Schulz, Karsten Borgwardt

Published: 2026-06-04

TL;DR: GraphDETR 通过图神经网络和 Transformer 解码器将子图检测转化为集合预测问题，实现了分子结构等模式在大型图中的端到端高精度检测。

摘要翻译

子图检测旨在识别查询模式实例是否以及何时出现在更大的图中。该问题在科学领域具有基础性，且与子图同构密切相关，后者是 NP 完全问题，限制了组合方法仅适用于小模式或中等规模的图。我们引入了 GraphDETR，一种将子图检测建模为集合预测问题的深度学习框架，类似于目标检测中的 DETR。GraphDETR 使用图神经网络对目标图进行编码，并使用一组固定的可学习查询向量，通过 Transformer 解码器进行解码，以单次前向传播联合预测所有模式实例。这通过采用二分匹配对模型进行端到端训练得以实现。与仅解决精确结构匹配的传统组合方法不同，GraphDETR 自然扩展到近似匹配，从而实现超越精确模式对应的检测。实验表明，GraphDETR 能够在最多包含 1000 个节点的目标图中检测多种模式，包括分子结构、环、团以及最多 50 个节点的模糊模式。我们进一步在 ChEMBL 数据集上评估了分子功能团检测任务，其中 GraphDETR 预测了每个分子的完整功能团集合，取得了 AP100 = 91.2 的优异性能。

Abstract

Subgraph detection seeks to identify whether and where instances of query patterns occur within a larger graph. This problem is fundamental across scientific domains and is closely related to subgraph isomorphism, which is NP-complete, limiting combinatorial approaches to small patterns or moderately sized graphs. We introduce GraphDETR, a deep learning framework that formulates subgraph detection as a set prediction problem, analogous to DETR in object detection. GraphDETR encodes the target graph with a graph neural network, and employs a fixed set of learnable query vectors, decoded via a transformer decoder, to predict all pattern occurrences jointly in a single forward pass. This is enabled by training the model end-to-end with bipartite matching. Unlike traditional combinatorial methods that only solve exact structural matching, GraphDETR naturally extends to approximate matching, enabling detection beyond exact pattern correspondence. Empirically, we show that GraphDETR can detect diverse patterns, such as molecular structures, cycles, cliques, and fuzzy patterns of up to 50 nodes, in target graphs with up to 1000 nodes. We further evaluate on molecular functional group detection over the ChEMBL dataset, where GraphDETR predicts the complete set of functional groups per molecule, achieving a strong performance of $\text{AP}_{100} = 91.2$.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文核心为图神经网络子图检测，与关键词集（多模态大模型、世界模型、强化学习）领域不符。仅'Unify Models'因统一检测框架略有关联，其余关键词如 Tokenizer、Visual Encoder、MLLM 等均未涉及，相关性极低。

关键词

Subgraph Detection, Graph Neural Network, Set Prediction, Transformer Decoder, End-to-End Training, Molecular Structures, GraphDETR, Bipartite Matching

192. A Sliced-Wasserstein Framework on Correlation Matrices for EEG DecodingFAIL

Score: 3.0 / 27.8

Authors: Chen Hu, Rui Wang, Jiale Zhou, Jingjun Yi, Shaocheng Jin, Yidong Song, Yefeng Zheng

Published: 2026-06-04

TL;DR: This paper proposes a Sliced-Wasserstein framework on correlation matrices to enhance domain generalization in EEG decoding by overcoming channel-wise scaling sensitivity.

摘要翻译

脑电图 (EEG) 提供了非侵入式的、毫秒级分辨率的神经活动记录，并被广泛应用于神经科学和医疗保健领域。许多脑电图解码流程依赖于协方差描述符，因为它们对噪声具有鲁棒性，但这种表示方法对通道缩放敏感。因此，近期研究主张使用满秩相关矩阵作为脑电图解码的尺度不变替代方案。本文提出了一种通用框架，用于在配备拉回欧几里得度量 (PEMs) 的流形上计算切片 Wasserstein (SW) 散度，该框架被称为拉回欧几里得度量切片 Wasserstein (PEMSW)。在此框架下，我们在满秩相关矩阵流形上实例化了两种相关切片 Wasserstein (CorSW) 散度，基于两种新近提出的相关几何，即 Off-Log 度量 (OLM) 和 Log-Scaled 度量 (LSM)。基于 CorSW，我们进一步开发了一个用于脑电图解码的领域泛化 (DG) 框架。在三个 EEG 数据集上的实验表明，该方法在分布偏移下泛化能力得到提升，且训练开销低，无需额外推理成本。源代码可在 https://github.com/ChenHu-ML/CorSW 获取。

Abstract

Electroencephalography (EEG) offers noninvasive, millisecond resolution recordings of neuronal activity and is widely used in neuroscience and healthcare. Many EEG decoding pipelines rely on covariance descriptors for their robustness to noise, but such representations are sensitive to channel-wise scaling. Recent studies have therefore advocated full-rank correlation matrices as a scale-invariant alternative for EEG decoding. In this paper, we propose a general framework for Sliced Wasserstein (SW) discrepancies on manifolds endowed with Pullback Euclidean Metrics (PEMs), termed Pullback Euclidean Metric Sliced Wasserstein (PEMSW). Within this framework, we instantiate two Correlation Sliced-Wasserstein (CorSW) discrepancies on the manifold of full-rank correlation matrices under two recently introduced correlation geometries, \textit{i.e.}, the Off-Log Metric (OLM) and Log-Scaled Metric (LSM). Building on CorSW, we further develop a domain generalization (DG) framework for EEG decoding. Experiments on three EEG datasets demonstrate improved generalization under distribution shifts, with low training overhead and no additional inference cost. The source code is available at https://github.com/ChenHu-ML/CorSW.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文专注于 EEG 解码中的流形度量学习，与关键词涉及的多模态大模型、强化学习及世界模型等方向高度不相关。仅'Unify Models'因提出统一的数学框架有微弱关联，其余关键词如 Tokenizer、Visual Encoder、MLLM 等在文中均未出现。作者列表中不包含指定的专家。

关键词

EEG Decoding, Correlation Matrices, Sliced Wasserstein, Domain Generalization, Manifold Learning, Pullback Euclidean Metrics, Full-rank

193. 3D Underwater Path Planning via Generative Flow Field SurrogatesFAIL

Score: 3.0 / 27.8

Authors: Zachary Cooper-Baldock, Paulo E. Santos, Russell S. A. Brinkworth, Karl Sammut

Published: 2026-06-04

TL;DR: 该论文提出使用条件生成对抗网络快速生成 3D 流场代理模型以替代 CFD 模拟，从而在保持规划效率的同时大幅降低自主水下航行器路径规划的算力成本。

摘要翻译

自主水下航行器（AUV）在航行中的母平台船体内进行布放与回收（LAR）时，需穿越复杂的三维螺旋桨尾流，其水动力结构无法通过均匀流模型进行表征。高保真度的雷诺平均纳维 - 斯托克斯方程（RANS）计算流体动力学（CFD）模拟能够以足够的精度解析该结构以满足路径规划需求，但其计算成本使其难以应用于船载环境。我们通过整合两种条件生成对抗网络（cGAN）架构——正则化 PatchGAN 和带自注意力的 2D3DGAN——来解决这一空白，将其作为 RANS CFD 数据的即插即用型替换，嵌入到三维能量加权 A* 路径规划框架中。两个生成器均由一个分层管道驱动，仅基于标量操作条件输入即可合成完整的 128^3 体素流场体积，端到端推理时间约为 28-146 μs，而单次 RANS 计算则需要数小时。我们在涵盖 550 种不同流动条件的 19,800 条独立生成轨迹上，对四种环境知识水平进行了基准测试：均匀流、真实 CFD 数据、PatchGAN 以及 2D3DGAN-SA。相对于均匀流规划，完整的 CFD 尾流知识可使能量消耗降低 5.7%-12.5%，高速度尾流核心穿越次数减少高达 77.8%，且这两种效益均随运行工况严峻程度的增加而提升。cGAN 代理模型在兼容边缘设备使用推理速度的同时，恢复了约 45%-60% 的 CFD 能量效益和高速度单元规避效益。这些结果首次系统量化了 cGAN 预测的水动力场在三维海洋机器人应用中的下游路径规划价值。

Abstract

Autonomous underwater vehicle (AUV) launch and recovery (LAR) into the hull of an advancing host platform requires traversal of a complex, three-dimensional propeller wake whose hydrodynamic structure cannot be characterised by a uniform current model. High-fidelity Reynolds-Averaged Navier-Stokes (RANS) Computational Fluid Dynamics (CFD) simulations resolve this structure with sufficient accuracy for path planning, but their computational cost renders them impractical for onboard use. We address this gap by integrating two conditional generative adversarial network (cGAN) architectures -- a regularised PatchGAN and a 2D3DGAN with self-attention -- as drop-in replacements for RANS CFD data within a three-dimensional, energy-weighted A* path planning framework. Both generators are driven by a hierarchical pipeline that synthesises full $128^3$ voxel flow field volumes from scalar operating condition inputs alone, with end-to-end inference times of approximately 28-146 $μ$s, compared to hours for a single RANS computation. We benchmark all four environmental knowledge levels: uniform current, ground-truth CFD, PatchGAN, and 2D3DGAN~SA across 19,800 independently generated trajectories spanning 550 distinct flow conditions. Full CFD wake knowledge reduces energy expenditure by 5.7-12.5% and high-velocity wake-core encounters by up to 77.8% relative to uniform-current planning, with both benefits scaling with operating severity. The cGAN surrogates recover approximately 45-60% of the CFD energy benefit and high-velocity cell avoidance benefit while operating at inference speeds compatible with edge device use. These results provide the first systematic quantification of the downstream path planning value of cGAN-predicted hydrodynamic fields in a three-dimensional maritime robotics application.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	1.0/10	1.5
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	1.0/10	1.5

评分理由: 论文核心在于利用条件生成对抗网络（cGAN）替代高耗时的 CFD 模拟以加速 AUV 路径规划，属于机器人学与计算流体力学交叉领域。未涉及多模态大语言模型（MLLM）、分词器（Tokenizer）或视觉编码器（Visual Encoder），与 Unify Models、MultiModal 完全无关。虽涉及环境建模（World Models）和基于模型决策（model-based RL），但非强化学习背景下的典型定义，故仅给予极低相关度。作者列表中未包含指定专家，无额外加分。加权总分为 3.0。

关键词

3D Underwater Path Planning, Generative Flow Field Surrogates, Conditional Generative Adversarial Network, Computational Fluid Dynamics, A* Path Planning, Autonomous Underwater Vehicle, Voxel Flow Field

194. Dead Directions: Geometric Singular LearningFAIL

Score: 3.0 / 27.8

Authors: Tejas Pradeep Shirodkar

Published: 2026-06-04

TL;DR: 该论文通过引入“死方向”概念桥接奇异学习理论与信息几何，分析深度网络参数空间的奇异性并提出优化预处理器。

摘要翻译

奇异学习理论（Singular learning theory）与信息几何（Information geometry）主要在各自的术语体系中研究了相同的参数空间：前者在解析化坐标（resolved coordinates）下计算贝叶斯不变量（Bayesian invariants），后者在原始坐标（original coordinates）下工作，基于一个非退化假设（non-degeneracy assumption），而过参数化模型（overparameterised models）通常违反该假设。我们通过一个基本原语——死方向（dead direction）——将它们联系起来：这是一个费雪度量（Fisher metric）退化的单位向量，等价于解析奇异集（analytic singular set）的一个切向量，具有确定的 KL 阶（KL order），该阶由 KL 散度（KL divergence）消失的速度决定。这两种读法指向同一个向量；我们的核心洞察表明，其 KL 阶可恢复为接近奇点时方向费雪曲率（directional Fisher curvature）的衰减率，且在原始参数坐标下，无需希罗卡解析化（Hironaka resolution）。光滑纤维（smooth fibres）上的选择规则将此速率转换为渡边（Watanabe）对实对数典范阈值（real log canonical threshold, RLCT）的单方向贡献，并将恢复扩展至多分量交叉（multi-component crossings）、重数 $m$、奇异波动 $ν$（singular fluctuation，对于 1D 方向在 KL 阶上是通用的）、先验 -RLCT 偏移（prior-RLCT shifts）以及 tempered 后验（tempered posteriors）。随后，我们将此速率应用于深度网络：多层 K-FAC 分解（multi-layer K-FAC factorisation）将每个费雪块（Fisher block）表示为激活侧速率与梯度侧速率的乘积，二者之间存在对偶性，并在现代网络原语（modern-network primitives，如残差流 residual streams、层归一化 layer normalisation、注意力 attention）上实例化。商定理（quotient theorem）将此速率传递至规范商（gauge quotient）$Θ/G$ 下，该过程基于 $G$-不变度量（$G$-invariant metric）上的梯度流（gradient flow）；SGD 符合条件，而标准 Adam 不符合，我们构建了一个 $G$-等变 Adam 族预条件器（$G$-equivariant Adam-family preconditioner，DDCAdam），后者符合条件。该桥梁提供了奇异几何的参数坐标工具（parameter-coordinate handle）、每架构的闭式预测（closed-form per-architecture predictions），以及从一个检查点（checkpoint）的前向和后向传播中读取渡边三元组（Watanabe's triple）$(λ, m, ν)$ 的轨迹速率（trajectory-rate readout），无需后验采样（posterior sampling）。

Abstract

Singular learning theory and information geometry have studied the same parameter spaces in mostly separate vocabularies: the former computes Bayesian invariants in resolved coordinates, the latter works in original coordinates under a non-degeneracy assumption that overparameterised models routinely violate. We bridge them through one primitive, the dead direction: a unit vector along which the Fisher metric degenerates, equivalently a tangent to the analytic singular set with a definite KL order, set by how fast the KL divergence vanishes. The two readings name the same vector; our central move shows its KL order is recoverable as the decay rate of the directional Fisher curvature approaching the singularity, in original parameter coordinates and without a Hironaka resolution. A selection rule on smooth fibres translates this rate into Watanabe's single-direction contribution to the real log canonical threshold, and we extend the recovery to multi-component crossings, multiplicity $m$, the singular fluctuation $ν$ (universal in the KL order for 1D directions), prior-RLCT shifts, and tempered posteriors. We then lift this rate to a deep network: a multi-layer K-FAC factorisation writes each Fisher block as a product of activation- and gradient-side rates with a duality between them, instantiated at modern-network primitives (residual streams, layer normalisation, attention). A quotient theorem carries the rate to the gauge quotient $Θ/G$ under gradient flow on a $G$-invariant metric; SGD qualifies, standard Adam does not, and we construct a $G$-equivariant Adam-family preconditioner (DDCAdam) that does. The bridge yields a parameter-coordinate handle on singular geometry, closed-form per-architecture predictions, and a trajectory-rate readout of Watanabe's triple $(λ, m, ν)$ from one checkpoint's forward and backward passes, without posterior sampling.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文聚焦于深度学习理论（奇异学习理论、信息几何、优化动力学），与多模态、世界模型、强化学习等背景关键词无直接关联，仅在不涉及模型架构的理论统一上略有相关性。

关键词

Singular learning theory, Information geometry, Dead direction, Fisher metric, Deep networks, K-FAC, Preconditioner, Optimization dynamics

195. EML-CD: Causal Mechanism Recovery via EML Symbolic Trees in Structure LearningFAIL

Score: 3.0 / 27.8

Authors: Sota Asanuma

Published: 2026-06-04

TL;DR: EML-CD proposes a causal discovery framework that recovers closed-form causal equations and DAG structures using interpretable EML symbolic trees, overcoming the black-box limitation of neural network-based methods.

摘要翻译

基于神经网络 (NN) 的非线性因果发现方法能够恢复有向无环图 (DAG) 结构，却将每个因果机制视为黑箱。Waxman 等人指出，从神经网络权重中提取因果机制是一个不适定问题。我们提出 EML-CD 框架，该框架将 EML 算子（能够从单个二元算子组合初等函数）集成到因果结构学习中，以可解释的机制恢复为主要目标。EML-CD 将每条边上的机制表示为门控 EML 二叉树，并自动发现闭式因果方程。可直接从输出方程计算解析雅可比矩阵，从而实现对因果效应的定量理解。在真实数据（Sachs 蛋白信号传导，d=11）上，EML-CD 达到 SHD=11.2 ± 0.4（5 个种子均值；基线为单次确定性运行），其 SHD 值在种子方差内与 PC/GES 相当且低于 CAM，同时为每条检测到的边附上闭式方程（精确率 0.756，召回率 0.365）。在具有已知机制的控制双变量测试中，EML-CD 忠实地恢复了 11 个初等函数族中的 10 个（保留集形状相关性 ≥ 0.96；仅高频正弦波部分恢复）。在符号合成基准上，EML-CD 获得的保留集机制 f-MSE 显著低于且更稳定于固定的 SINDy 字典（均值 3.67 对比 7644，后者因一个种子上的灾难性外推而膨胀），尽管其结构恢复（SHD 14.0）仅与字典相当且劣于专用优化器；在 Causal Chambers 光隧道子集上，深度为 2 的模型在 F1 指标上优于线性 OLS-BIC（0.444 对比 0.273）。

Abstract

Neural network (NN)-based nonlinear causal discovery methods recover DAG structure but leave each causal mechanism as a black box. Waxman et al. argued that extracting causal mechanisms from NN weights is ill-posed. We propose EML-CD, a framework that integrates the EML operator (capable of composing elementary functions from a single binary operator) into causal structure learning, with interpretable mechanism recovery as the primary objective. EML-CD represents each edge mechanism as a gated EML binary tree and automatically discovers closed-form causal equations. Analytical Jacobians can be directly computed from the output equations, enabling quantitative understanding of causal effects. On real data (Sachs protein signaling, d=11), EML-CD achieves SHD=11.2 +/- 0.4 (5-seed mean; baselines are single deterministic runs), on par with PC/GES within seed variance and below CAM, while attaching closed-form equations to each detected edge (precision 0.756, recall 0.365). In a controlled bivariate test with known mechanisms, EML-CD recovers 10 of 11 elementary function families faithfully (held-out shape correlation >= 0.96; only high-frequency sine is partial). On a symbolic synthetic benchmark, EML-CD attains a substantially lower and more stable held-out mechanism f-MSE than a fixed SINDy dictionary (mean 3.67 vs. 7644, the latter inflated by catastrophic extrapolation on one seed), although its structure recovery (SHD 14.0) only matches the dictionary and stays below specialized optimizers; on the Causal Chambers light-tunnel subset, a depth-2 model improves F1 over linear OLS-BIC (0.444 vs. 0.273).

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper presents EML-CD, a framework for causal discovery using EML symbolic trees to recover interpretable causal mechanisms. There is significant topic divergence from the provided keywords, which focus on multi-modal LLMs, world models, and reinforcement learning. 'Unify Models' receives a low score (2.0) due to the unification of structure and mechanism learning, but it does not align with architectural unification in the background context. 'Tokenizer', 'Visual Encoder', 'World Models', 'MLLM', 'MultiModal', and 'model-based RL' receive 0.0 as the paper involves neither multi-modal data, tokenization, visual encoding, generative world models, large language models, nor reinforcement learning. No expert authors from the specified list are found in the author list (Sota Asanuma).

关键词

Causal Mechanism Recovery, EML Symbolic Trees, Structure Learning, Closed-form Causal Equations, Interpretable Mechanism, Causal Discovery, Analytical Jacobians

196. DBHN-Net: Dual-Branch Hybrid Neural Network For Low-Complexity Monaural Speech EnhancementFAIL

Score: 3.0 / 27.8

Authors: Cunhang Fan, Enrui Liu, Jing Zhou, Jian Kang, Jie Li, Andong Li, Jian Zhou, Zhao Lv, Xuelong Li

Published: 2026-06-04

TL;DR: This paper proposes a Dual-Branch Hybrid Neural Network combining ANN and SNN to achieve a 7.5-fold reduction in computational complexity for monaural speech enhancement while maintaining performance.

摘要翻译

尽管基于人工神经网络（ANN）的语音增强（SE）方法表现出优异的性能，但高计算复杂度和高能耗阻碍了它们在实际前端处理任务中的部署。当前，脉冲神经网络（SNNs）在降低功耗方面显示出潜力。然而，SNNs 的离散二值激活和复杂的时空动力学往往导致信息丢失。因此，当前的挑战在于如何保持性能并降低计算复杂度。为了解决这一问题，本文提出了一种双分支混合神经网络（DBHN）。1) 在网络架构方面：设计了一种集成 ANN 和 SNN 的双分支网络，其中 SNN 分支降低功耗，而 ANN 分支缓解信息丢失；开发了 BandSplit 和时频（TF）-Mamba 模块，以同时降低能耗并提升模型性能；脉冲特征提取组（SFEG）和信息变换块（ITB）组件采用了残差连接，以缓解信息丢失并进一步精炼特征表示。2) 为了促进分支间信息融合：设计了一个交互模块，以促进双分支网络各阶段的信息交换；设计了一个 TF-Cross Attention-Fusion 模块，以执行双分支信息的时频域融合，同时数据自适应地引导 SNN 分支保留更多关键信息。结果表明，所提出的模型在三个公共数据集上保持了优越的性能，同时相比基线模型实现了平均 7.5 倍的计算复杂度降低。

Abstract

Although artificial neural network (ANN) based speech enhancement (SE) methods demonstrate excellent performance, the high computational complexity and high energy consumption hinder their deployment in practical front-end processing tasks.} Currently, the spiking neural networks (SNNs) have shown potential in reducing power consumption. However, the discrete binary activation and complex spatio-temporal dynamics of SNNs often result in information loss. The current challenge therefore focuses on how to maintain performance and reduce computational complexity. To address this issue, this work propose a Dual-Branch Hybrid Neural (DBHN) Network. 1) In terms of network architecture: A dual-branch network integrating ANN and SNN was designed, where the SNN branch reduces power consumption while the ANN branch addresses information loss; The BandSplit and Time-Frequency (TF) -Mamba modules were developed to simultaneously compress energy consumption and enhance model performance; Spiking Feature Extraction Group (SFEG) and Information Transformation Block (ITB) components were implemented with residual connections to mitigate information loss while further refining feature representations. 2) To facilitate inter-branch information fusion: An Interaction module was designed to promote information exchange at various stages of the dual-branch network; A TF-Cross Attention-Fusion module was designed to perform time-frequency domain fusion of dual-branch information while data-adaptively guiding the SNN branch to retain more critical information. Results show that the proposed model maintains superior performance across three public datasets while achieving an average 7.5 fold reduction in computational complexity compared to baseline models.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on monaural speech enhancement using a hybrid ANN/SNN architecture, which has minimal overlap with the provided keywords concerning Multimodal LLMs, World Models, and Reinforcement Learning. 'Unify Models' receives a low score (2.0) as it technically unifies ANN and SNN but not within the context of foundation models or the specified research background. All other keywords (Tokenizer, Visual Encoder, MLLM, MultiModal, World Models, model-based RL) are completely unrelated to the audio processing task. No expert authors from the specified list are present in the authorship.

关键词

Speech Enhancement, Dual-Branch Hybrid Neural Network, Spiking Neural Networks, Artificial Neural Networks, Low-Complexity, Monaural Speech, BandSplit, TF-Mamba

197. Knowledge Manifold: A Riemannian Geometric Framework for Semantic Mapping and Geodesic Analysis of Scientific LiteratureFAIL

Score: 3.0 / 27.8

Authors: Tomonaga Okabe, Kazuhiko Komatsu

Published: 2026-06-04

TL;DR: This paper proposes a Riemannian geometric framework to map scientific literature based on semantic similarities and compute geodesic paths for conceptual analysis, rather than addressing multimodal modeling or reinforcement learning.

摘要翻译

我们提出“知识流形”（knowledge manifold）：一个黎曼几何空间，在该空间中，文档语料根据从字符 n-gram TF-IDF 表示中导出的语义位置关系进行排列。该框架包含五个紧密耦合的阶段。首先，每个文档被转换为字符级 n-gram TF-IDF 向量（4-7 元，最多 250,000 个特征，经 L2 归一化），并通过带有排斥、方差和中心化正则化项的约束应力最小化方法嵌入到二维知识地图中。其次，通过平滑粒子流体动力学（SPH）插值（使用三次样条核）估计任意查询点处的知识，生成一个可语言表征的插值 TF-IDF 特征向量。第三，从 SPH 插值地图中计算 0 度、45 度和 90 度的方向性知识梯度，并通过内积和余弦相似性量化成对方向相似性。第四，一个高斯过程回归（GPR）模型（在 10 维 SVD 投影上拟合常数 x RBF + 白核），提供查询点处的贝叶斯后验均值、不确定性估计以及每文档贡献率。第五，通过最小化由 SPH 诱导的度量张量导出的离散黎曼路径能量，并利用 L-BFGS-B 算法结合七个确定性初始路径候选，获得知识空间中的测地线。我们将该方法应用于包含 20 篇关于纤维增强复合材料和航空航天结构力学论文的语料，结果显示：语义地图恢复了有意义的研究簇，测地线路径揭示了遥远主题之间的自然概念桥梁，而 SPH/GPR 插值则实现了虚拟知识的生成：即描述尚未被研究但几何上预测的研究方向的假设性论文摘要。

Abstract

We present the knowledge manifold: a Riemannian geometric space in which a corpus of documents is arranged according to semantic positional relationships derived from character n-gram TF-IDF representations. The framework proceeds in five tightly coupled stages. First, each document is converted to a character-level n-gram TF-IDF vector (4-7 grams, up to 250,000 features, L2-normalized) and embedded in a two-dimensional knowledge map via constrained stress minimization with repulsion, variance, and centering regularizers. Second, knowledge at an arbitrary query point is estimated through Smoothed Particle Hydrodynamics (SPH) interpolation using a cubic-spline kernel, yielding an interpolated TF-IDF feature vector that can be linguistically characterized. Third, directional knowledge gradients at 0, 45, and 90 degrees are computed from the SPH interpolation map, and pairwise directional similarity is quantified via inner product and cosine similarity. Fourth, a Gaussian Process Regression (GPR) model, with a Constant x RBF + White kernel fitted on a 10-dimensional SVD projection, provides a Bayesian posterior mean, uncertainty estimate, and per-document contribution rate at the query point. Fifth, geodesics in the knowledge space are obtained by minimizing a discrete Riemannian path energy derived from the SPH-induced metric tensor, using L-BFGS-B with seven deterministic initial-path candidates. We apply the formulation to a corpus of 20 papers in fiber-reinforced composite materials and aerospace structural mechanics, showing that the semantic map recovers meaningful research clusters, geodesic paths reveal natural conceptual bridges between distant topics, and SPH/GPR interpolation enables the generation of virtual knowledge: hypothetical paper abstracts describing unstudied but geometrically predicted research directions.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on Riemannian geometry for semantic mapping of text corpora using TF-IDF and statistical interpolation. It lacks any connection to multimodal architectures, tokenizers, visual encoders, world models, or reinforcement learning systems implied by the keywords.

关键词

Knowledge Manifold, Riemannian Geometry, Semantic Mapping, Scientific Literature, Geodesic Analysis, TF-IDF, Interpolation

198. High-Dimensional Theory of LoRA Fine-Tuning in a Solvable Attention ModelFAIL

Score: 3.0 / 27.8

Authors: O. Duranthon, F. Boncoraglio, L. Zdeborová

Published: 2026-06-04

TL;DR: 本文建立了注意力模型中 LoRA 微调的高维统计理论，通过序参数量化了预训练与微调之间的相互作用并预测了测试误差。

摘要翻译

我们构建了一种注意力模型（Attention Models）中低秩适配（LoRA）的高维统计理论，捕捉了预训练（Pre-training）与微调（Fine-tuning）之间的协同作用。我们引入一个可解框架，在该框架中，单头注意力层（Single-head Attention Layer）首先在数据丰富的任务上进行预训练，随后在有限数据上通过秩一 LoRA 更新（Rank-one LoRA Update）进行适配。在高维极限（High-dimensional Limit）下，这两个阶段均可通过一组有限的序参数（Order Parameters）给出精确的渐近刻画，从而得到测试误差（Test Errors）和表示对齐（Representation Alignment）的显式预测。我们的分析表明，预训练对 LoRA 的影响可由一个有效噪声项（Effective Noise Term）来表征，基于此我们推导出了最优预训练流程的方案。我们还展示了一种情形，即测试误差值与表示质量（Representation Quality）之间存在不匹配，并将我们的理论应用于主动微调（Active Fine-tuning）。

Abstract

We develop a high-dimensional statistical theory of low-rank adaptation (LoRA) in attention models, capturing the interplay between pre-training and fine-tuning. We introduce a solvable framework in which a single-head attention layer is first pre-trained on a data-abundant task and subsequently adapted via a rank-one LoRA update on limited data. In the high-dimensional limit, both stages admit a sharp asymptotic characterization in terms of a finite set of order parameters, yielding explicit predictions for test errors and representation alignment. Our analysis shows that the impact of pre-training on LoRA is summarized by an effective noise term, from which we derive prescriptions for the optimal pre-training procedure. We also demonstrate a regime with a mismatch between the value of the test error and representation quality, and propose an application of our theory to active fine-tuning.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	2.0/10	3.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 该论文主要关注注意力模型中 LoRA 微调的高维统计理论，分析预训练与微调的相互作用。内容未涉及多模态数据、世界模型、强化学习、分词器或视觉编码器，因此除'Unify Models'因理论框架统一了预训练与微调分析而略有相关性外，其余关键词均与论文核心内容无关。

关键词

High-Dimensional Theory, LoRA Fine-Tuning, Attention Models, Pre-training and Fine-tuning, Statistical Theory, Order Parameters, Test Errors, Representation Alignment

199. YouZhi: Towards High-Concurrency Financial LLMs via Adaptive GQA-to-MLA TransitionFAIL

Score: 3.0 / 27.8

Authors: PSBC LLM Team, Huawei LLM Team, Ruihan Long, Junjie Wu, Tianan Zhang, Duo Zhang, Yaozong Wu, Jinbin Fu, Chang Liu, Zhentao Tang, Wenshuang Yang, Xin Wang, Zhihao Song, Ning Huang, Wenjing Xu, Shuai Zong, Shupei Sun, Sen Wang, Jing Hu, Bin Wang, Xinyu Wang, Junkui Ju, Zequn Ding, Jie Ran, Man Luo, Shixiong Kai, Linkai Hou, Kaichao Liang, Hu Zhao, Yang Zhao, Shucheng Lin, Wei Yu, Chenghan Jiang, Jingjing Ding, Jiahui Zhang, Tian Jin, Yuhang Zhang, Dong Guo, Wei Sun, Jun Xie, Jianwei Li, Lei Cao, Pei Li, Jiabin Li, Jia Yuan, Rui Yuan, Jing Zhu, Mingxuan Yuan, Zhangcheng Lv, Xin Jiang, Xiuhong Fei, Xiaozhe Ren, Yulong Li, Zhipeng Zhang, Hang Wang, Zhaohui Xu, Rui Zhao, Yibo He, Xinzhuang Niu

Published: 2026-06-04

TL;DR: YouZhi-LLM proposes an adaptive GQA-to-MLA transition framework to reduce KV cache memory overhead and improve concurrency for financial LLMs on Ascend hardware.

摘要翻译

大型语言模型（LLMs）推动了显著的金融创新，但其高并发部署却严重受限于 KV 缓存内存开销，这不仅推高了基础设施成本，还限制了系统的可扩展性。为此，我们提出了 YouZhi-LLM，这是一种高效金融 LLM，依托于基于华为昇腾（Ascend）生态系统原生构建的全面结构转换与训练管道。在算法核心层面，YouZhi-LLM 采用了一种层自适应的 GQA 至 MLA 转换框架，动态分配每层的 FreqFold 大小，在最大化 KV 缓存压缩的同时最小化困惑度（perplexity）的下降。为恢复表示能力并注入领域专业知识，基于昇腾的训练管道无缝整合了通用知识蒸馏与金融特定领域的监督微调。评估结果表明了这一系统方法的优越性，自适应转换相较于统一基线，将困惑度下降幅度减少了高达 35%。至关重要的是，当通过 vLLM-Ascend 在昇腾 NPU 上评估时，大幅度的 KV 缓存缩减直接转化为部署效率的提升。与各自的基线模型相比，YouZhi-7B 在平均金融基准分数上提升了 12.3%，同时最大并发量增加了 2.69 倍；类似地，YouZhi-14B 实现了 7.0% 的准确率增益和 2.43 倍的并发量提升，从而确立了成本效益高、高吞吐量金融推理的新范式。

Abstract

Large language models (LLMs) drive significant financial innovations, yet their high-concurrency deployment is severely bottlenecked by KV cache memory overhead, which inflates infrastructure costs and throttles scalability. To address this, we propose YouZhi-LLM, a highly efficient financial LLM empowered by a comprehensive structural transition and training pipeline natively built on the Huawei Ascend ecosystem. At its algorithmic core, YouZhi-LLM features a layer-adaptive GQA-to-MLA transition framework that dynamically assigns per-layer FreqFold sizes, maximizing KV-cache compression while minimizing perplexity degradation. To recover representation capacity and inject domain expertise, the Ascend-based training pipeline seamlessly integrates generalized knowledge distillation with financial-specific supervised fine-tuning. Evaluations demonstrate the superiority of this systematic approach, with the adaptive transition reducing perplexity degradation by up to 35% over uniform baselines. Crucially, when evaluated on Ascend NPUs via vLLM-Ascend, the massive KV-cache reduction translates directly into deployment efficiency. Compared to their respective base models, YouZhi-7B yields a 12.3% improvement in average financial benchmark score alongside a 2.69$\times$ increase in maximum concurrency; similarly, YouZhi-14B achieves a 7.0% accuracy gain and a 2.43$\times$ concurrency boost, establishing a new paradigm for cost-effective, high-throughput financial inference.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on inference efficiency and domain adaptation for financial LLMs (attention transition, KV cache, SFT). It lacks content on multimodal learning, world models, RL, visual encoders, or tokenizers. Weighted score is 3.0, far below the 27.8 threshold. No target expert authors were found.

关键词

Financial LLMs, GQA-to-MLA Transition, KV Cache Compression, High-Concurrency, Ascend Ecosystem, Knowledge Distillation, Supervised Fine-Tuning

200. RQUL-UIE: Revitalizing Quality-Unstable Labels for Underwater Image Enhancement via In-Dataset Self-SupervisionFAIL

Score: 3.0 / 27.8

Authors: Haochen Hu, Yanrui Bin, Chih-yung Wen, Bing Wang

Published: 2026-06-04

TL;DR: 本文提出一种基于扩散模型的自监督方法，通过评估和利用不稳定的标签质量来增强水下图像，实现了优于现有方法的恢复效果。

摘要翻译

水下图像增强（UIE）对于缓解由水介质引起的退化至关重要。尽管基于学习的方法取得了显著进展，但大多数仍依赖配对数据集，且标签质量不稳定，这限制了模型性能的提升。本文提出了一种基于扩散模型的数据集内自监督学习策略，旨在利用训练标签的质量分布特性。具体而言，我们采用无需训练的方式，通过预训练扩散模型的语义感知嵌入来评估标签质量。随后，这些质量分数被量化为噪声级别索引，以指导多步去噪过程，从而实现级别监督。该机制既能防止低质量标签损害模型，又能最大化其在训练过程中的效用。此外，本文还引入了一种基于傅里叶变换的细化网络，以显式重建高频分量。广泛的评估表明，我们的方法在恢复质量上始终优于现有最先进（SOTA）方法。代码及预训练模型将在论文被接受后通过链接提供。

Abstract

Underwater Image Enhancement (UIE) is essential for mitigating degradations caused by water medium. Although learning-based methods have advanced significantly, most rely on paired datasets with unstable label quality, which bottlenecks model performance. This paper proposes a diffusion-based, in-dataset self-supervised learning strategy designed to exploit the quality distribution of training labels. Specifically, we evaluate label quality via semantic perception embeddings from a pre-trained diffusion model in a training-free manner. These quality scores are subsequently quantized into noise-level indices, guiding a multi-step denoising process for level-wise supervision. This mechanism prevents low-quality labels from degrading the model while maximizing their utility during training. Furthermore, a Fourier-based refinement network is incorporated to explicitly reconstruct high-frequency components. Extensive evaluations demonstrate that our method consistently outperforms SOTA approaches in restoration quality. The code and pre-trained model will be available once accepted in link.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	2.0/10	3.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 该论文聚焦水下图像增强（UIE），利用扩散模型处理标签质量。与关键词高度不相关：未涉及模型统一、分词器、世界模型、MLLM 或强化学习。仅因扩散模型隐含视觉编码，与"Visual Encoder"有微弱关联（得分 2），其余为 0。作者不含指定专家，无加分。总分远低于及格线，不属于对应研究方向。

关键词

Underwater Image Enhancement, Diffusion Model, Self-Supervised Learning, Label Quality, Fourier-based Refinement, In-Dataset Supervision, Image Restoration

201. Regret Minimization with Adaptive Opponents in Repeated GamesFAIL

Score: 1.5 / 27.8

Authors: Mingyang Liu, Asuman Ozdaglar, Tiancheng Yu, Kaiqing Zhang

Published: 2026-06-04

TL;DR: 本文提出了一种针对自适应对手重复博弈的新 regret 度量及最小化算法，旨在学习合作均衡，但未涉及多模态模型或世界模型架构。

摘要翻译

本文研究了在重复博弈中，针对能够根据历史行动轨迹做出响应的自适应对手的 Regret 最小化问题。在线学习中的标准度量外部 Regret（External Regret）已知无法捕捉这种适应性。为了纳入玩家的反事实推理，我们引入了重复策略 Regret (RP-Regret)，这是一种博弈论度量，用于衡量当所有玩家都能对历史行动轨迹做出响应时，实现（realized）的累积效用与事后最优（best-in-hindsight）累积效用之间的差异。与该设置中现有的 Regret 概念相比，我们的概念原生适用于重复博弈，允许更强的比较对象和约束更少的对手，同时保持当所有玩家最小化该度量时找到更好均衡的可能性。我们首先确定了获得时间上次线性的 RP-Regret 的必要条件，这些条件涉及 Regret 定义中玩家比较策略的变差，以及比较者和对手策略的记忆。随后，我们研究了额外的条件和可证明的算法以最小化 RP-Regret，该度量在策略空间中被定义为非凸。为应对这一挑战，我们提出了三种算法：(i) 一种基于优化预言机（Optimization Oracle）的方法，类似于在线非凸学习中某些先验工作的假设；(ii) 一种在每个迭代中最小化 RP-Regret 的凸且线性化代理函数的方法；(iii) 一种当对手缓慢改变策略时直接最小化 RP-Regret 的方法。此外，当所有玩家都能运行算法以最小化 RP-Regret（或其线性化变体）时，重复博弈的某些子博弈完美均衡（Subgame Perfect Equilibria）可以被学习。我们还提供了实验表明，最小化我们的 Regret 概念可以在 Stag-Hunt（猎鹿博弈）等游戏中导致更具合作性且效用更高的解决方案。

Abstract

In this paper, we study regret minimization in repeated games with \emph{adaptive} opponents who can respond based on histories of play. The standard metric of \emph{external regret} in online learning is known to fail to capture such adaptivity. To account for players' counterfactual reasoning, we introduce {\tt Repeated Policy Regret (RP-Regret)}, a game-theoretic metric that measures the difference between the \emph{realized} and the \emph{best-in-hindsight} accumulated utility when all players can \emph{respond} to the history of play. Compared to existing regret notions in this setting, ours is native to repeated game playing, enabling stronger comparators and opponents with fewer constraints, while maintaining the possibility of finding better equilibria when all players minimize it. We first identify necessary conditions for obtaining {\tt RP-Regret} sublinear in time, on the variation of the player's comparator strategies in the regret definition and on the memories of both the comparator and opponents' strategies. We then study additional conditions and provable algorithms to minimize {\tt RP-Regret}, which is by definition \emph{non-convex} in the strategy space. To address this challenge, we propose three algorithms: (i) one based on an optimization oracle, as assumed in some prior work in online non-convex learning; (ii) one that minimizes a convex and \emph{linearized} surrogate of {\tt RP-Regret} at each iteration; (iii) one that directly minimizes {\tt RP-Regret} when opponents change strategies slowly. Furthermore, when all players can run algorithms to minimize the {\tt RP-Regret} (or its linearized variant), certain subgame perfect equilibria of the repeated game can be learned. We also provide experiments showing that minimizing our regret notions can lead to more cooperative solutions with higher utility in games such as Stag-Hunt.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	1.0/10	1.5

评分理由: 该论文专注于博弈论与在线学习中的 regret 最小化问题，特别是重复博弈中自适应对手的策略分析。提供的关键词主要围绕多模态大模型、世界模型及视觉编码器，与论文主题（游戏理论/算法）几乎无交集。仅'model-based RL'因涉及强化学习/决策理论背景有微弱关联。作者列表中未包含指定的专家。

关键词

Regret Minimization, Repeated Games, Adaptive Opponents, Repeated Policy Regret, Online Learning, Equilibrium Learning, Non-convex Optimization

202. Multi-ResNets for Subspace Preconditioning in Constrained OptimizationFAIL

Score: 1.5 / 27.8

Authors: Merve Karakas, Christopher J. Williams, Emmanuel O. Balogun, Sadegh Sadeghi Tabas, Christian Brown, Nikhil Rao

Published: 2026-06-04

TL;DR: 论文提出了一种名为 MResOpt 的阶段性残差神经网络架构，用于解决约束优化问题，通过在最优电力流任务中实现优先级约束满足，显著降低了高优先级约束违规率。

摘要翻译

我们提出了 MResOpt，一种用于约束优化问题的分阶段残差神经网络架构。该架构符合 predict-complete-correct（预测 - 完成 - 修正）管道，并通过中间重完成和阶段感知损失按优先级分解约束满足。该框架支持领域知识引导的有序约束满足，使网络能够在存在时利用序数结构。在理想化的无限宽度 regime（情形）下，我们表明我们的设计表现为序列 Gaussian Process（高斯过程）回归。在合成的 QP（二次规划）、QCQP（二次约束二次规划）和 SOCP（二阶锥规划）基准测试上，该分阶段架构在凸和非凸设置中提高了高优先级约束满足。在线路流约束的交流最优潮流问题上，我们引入了物理动机约束排序，并表明 MResOpt 支持一种学习到的分工，使迭代点保持在 equality manifold（等式流形）上，从而实现比 reprojected baselines（重投影基线）显著更低的高优先级违反，同时保持计算高效。

Abstract

We propose MResOpt, a staged residual neural network architecture for constrained optimization problems. Our architecture fits within predict-complete-correct pipelines and decomposes constraint satisfaction by priority through intermediate re-completion and stage-aware losses. The framework enables domain-informed ordered constraint satisfaction which allows the network to utilize ordinal structure when present. Under an idealized infinite-width regime, we show that our design behaves as sequential Gaussian Process regression. On synthetic QP, QCQP, and SOCP benchmarks, the staged architecture improves high-priority constraint satisfaction across convex and non-convex settings. On line-flow-constrained AC optimal power flow, we introduce a physics-motivated constraint ordering and show that MResOpt supports a learned division of labor that keeps iterates on the equality manifold, achieving substantially lower high-priority violation than reprojected baselines while remaining computationally efficient.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文主要研究基于残差神经网络的约束优化算法（MResOpt），核心内容涉及数值优化、子空间预处理和最优电力流，与提供的关键词（多模态大模型、世界模型、强化学习、Tokenizer 等）所属领域（LLM/RL/World Models）高度不匹配。仅'Unify Models'因架构设计具有统一性给予极低分，其余关键词在论文中均无体现。加权总分约为 1.5，远低于动态及格分 27.8，表明论文与指定研究背景相关性极低。

关键词

Multi-ResNets, Subspace Preconditioning, Constrained Optimization, Staged Architecture, Constraint Satisfaction, Optimal Power Flow, Predict-complete-correct, Gaussian Process regression

203. LLMs Can Leak Training Data But Do They Want To? A Propensity-Aware Evaluation of Memorization in LLMsFAIL

Score: 1.5 / 27.8

Authors: Gianluca Barmina, Peter Schneider-Kamp, Lukas Galke Poech

Published: 2026-06-04

TL;DR: This paper introduces a propensity-aware framework to evaluate training data leakage in LLMs, revealing that models rarely leak data under ordinary use despite being capable under adversarial attacks.

摘要翻译

大语言模型能够复现训练数据，但现有的记忆评估大多衡量模型是否被迫执行此操作，而非它们在常规使用下是否实际执行此操作。我们提出了 PropMe，这是一个用于记忆评估的倾向性感知框架，该框架对比了基于前缀的能力攻击与非对抗性评估。我们提出了一种指标变换方法，将其应用于现有函数，即可生成倾向性指标。此外，我们还引入了 SimpleTrace，这是一个基于 infini-gram 构建的轻量级追踪管道，它能将模型生成结果确定性地归因于大规模训练语料库，并计算逐字、近逐字以及倾向性变换后的记忆指标。我们在两种语言的两个数据集（Common Pile 和 Dynaword）上评估了两个完全开放的模型（Comma 和 DFM Decoder），发现能力与倾向性之间存在一致性的差距：前缀攻击引发的记忆信号显著强于通用提示或数据集特定提示，而倾向性分数整体仍保持较低水平。因此，当被直接诱导时，模型能够揭示训练数据，但在更常见的非对抗性设置下很少发生这种情况。我们还发现，由 Comma 持续预训练而来的 DFM Decoder 在 Common Pile 上表现出降低的记忆程度和记忆倾向性，这证实了当后续训练强调部分不同的数据时，记忆能力可能会下降。我们的研究结果表明，我们亦建议记忆审计应同时报告最坏情况提取性和常规泄露倾向性，以便对该现象有更全面的认识。

Abstract

Large language models can reproduce training data, but existing memorization evaluations mostly measure whether models can be forced to do so, rather than whether they do so under ordinary use. We introduce PropMe, a propensity-aware framework for memorization evaluation that contrasts prefix-based capability attacks with non-adversarial evaluations. We propose a metric transformation that, applied to existing functions, allows to create propensity metrics. We further introduce SimpleTrace, a lightweight tracing pipeline built on infini-gram that deterministically attributes model generations to large-scale training corpora and computes verbatim, near-verbatim, and propensity-transformed memorization metrics. Evaluating two fully-open models: Comma and DFM Decoder on two datasets: Common Pile and Dynaword in two languages, we find a consistent gap between capability and propensity: prefix attacks elicit substantially stronger memorization signals than generic or dataset-specific prompts, while propensity scores remain low overall. Thus, the models can reveal training data when directly elicited, but rarely do so in more common non-adversarial settings. We also find that DFM Decoder, which is continually pre-trained from Comma, exhibits reduced memorization and memorization propensity for Common Pile, confirming that memorization capability can decrease when later training emphasizes partially different data. Our results suggest, and we encourage, that memorization audits should report both worst-case extractability and ordinary leakage propensity in order to have a more comprehensive view of this phenomenon.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on memorization and training data leakage in text-based Large Language Models (LLMs), which significantly mismatches the provided keywords concerning multimodal models, world models, and reinforcement learning. Only 'Tokenizer' has minimal inherent relevance as LLMs utilize tokenization, though it is not the study's focus; all other keywords are completely unrelated to the paper's content.

关键词

Large language models, Memorization, Training data leakage, Propensity-aware evaluation, Infinit-gram, Comma, DFM Decoder

204. Where does Absolute Position come from in decoder-only Transformers?FAIL

Score: 1.5 / 27.8

Authors: Valeria Ruscio, Umberto Nanni, Fabrizio Silvestri

Published: 2026-06-04

TL;DR: 本文探讨了 RoPE 训练的解码器 Transformer 中绝对位置信息的来源，发现因果掩码和残差流通过注意力汇点导致了绝对位置的泄露。

摘要翻译

RoPE 训练的变换器在注意力模式中区分绝对位置，尽管 RoPE 仅在内积中编码相对偏移。我们将这种泄露归因于两个架构组件：因果掩码（Causal Mask）负责第一个，其每个查询的 softmax 分母根据构造依赖于绝对查询位置。残差流（Residual Stream）提供第二个。在因果注意力下，位置 0 处的激活仅关注自身，并从该位置 token 的嵌入开始作为封闭动力系统运行；下游注意力通过汇读取头（Sink-Reading Heads）读取该轨迹。这两个组件出现在我们研究的所有三种架构中，具有架构特定的平衡：NTK 缩放抑制了残差流组件，滑动窗口注意力允许其随深度累积，而标准 RoPE 介于两者之间。在前向传播前替换 BOS 嵌入可消除早期查询中 40% 的残差流组件。注意力汇（Attention Sink）是基于 token 的稳定器，它们向前传递位置 0 处 token 的确定性指纹；当该 token 是自动前置的 BOS 时，该指纹在所有输入中保持不变，否则随该 token 变化。

Abstract

RoPE-trained transformers distinguish absolute position in their attention patterns, even though RoPE encodes only relative offsets in the inner product. We trace this leakage to two architectural components, The causal mask is responsible for the first: its per-query softmax denominator depends on the absolute query position by construction. The residual stream supplies the second. Under causal attention the activation at position $0$ attends only to itself and runs as a closed dynamical system from the embedding of the token at that position; downstream attention reads this trajectory through sink-reading heads. Both components appear in all three architectures we study, in architecturally specific balance: NTK scaling suppresses the residual-stream component, sliding-window attention allows it to accumulate with depth, and standard RoPE sits between. Replacing the \texttt{BOS} embedding before the forward pass removes $40\%$ of the residual-stream component at early queries. Attention sinks are token-anchored stabilizers that pass forward a deterministic fingerprint of the token at position $0$, constant across inputs when that token is the auto-prepended \texttt{BOS} and varying with it otherwise.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 该论文研究了解码器 Transformer 中绝对位置编码的来源，重点分析了 RoPE、因果掩码和残差流的作用。然而，提供的关键词集主要涵盖多模态大模型、世界模型及强化学习领域（如 Visual Encoder, World Models, MLLM, model-based RL），与本文的 NLP 架构研究主题不符。因此，除'Unify Models'在统一理解位置机制上有微弱关联（1.0 分）外，其余关键词相关性均为 0 分。作者列表中不包含指定的专家，无额外加分。

关键词

Decoder-only Transformers, RoPE, Absolute Position, Causal Mask, Residual Stream, Attention Sinks, Positional Leakage, Token Embedding

205. Metamorphic Testing with the Rashomon Set: Explanation Faithfulness in Machine LearningFAIL

Score: 1.5 / 27.8

Authors: Helge Spieker, Jørn Eirik Betten, Arnaud Gotlieb

Published: 2026-06-04

TL;DR: 本文提出了一种基于元测试的框架，用于在不依赖真实标签的情况下评估机器学习模型后解释方法的忠实性，以解决拉肖蒙效应问题。

摘要翻译

多个机器学习模型可在同一任务上实现近乎等效的预测性能，却提供截然不同的基于特征的解释。这被称为可解释机器学习中的拉莫森效应（Rashomon effect），并引发了关于哪些解释（如果有的话）是可信的问题。我们提出一个基于变异测试（Metamorphic Testing）的框架，通过探索事后解释方法中的归因特征重要性，无需真实标签即可评估解释的忠实度。五个变异关系（Metamorphic Relations）形式化地定义了模型行为与特征归因之间预期的一致性属性。我们将此通用框架应用于两个表格回归数据集和两个事后解释器（SHAP 和 LIME），以展示该方法。该框架提供了一种实用的、与模型无关的工具，用于选择具有可靠且可信解释的准确模型。

Abstract

Multiple machine learning models can achieve near-equivalent predictive performance on the same task, yet provide divergent feature-based explanations. This is called the Rashomon effect of (explainable) machine learning, and it raises the question of which explanations, if any, are trustworthy. We propose a framework based on metamorphic testing that assesses explanation faithfulness without requiring ground-truth labels by exploring attributed feature importance from post-hoc explanation methods. Five metamorphic relations formalize expected consistency properties between model behavior and feature attributions. We apply this general framework to two tabular regression datasets and two post-hoc explainers (SHAP and LIME) to demonstrate the approach. The framework offers a practical, model-agnostic tool for selecting accurate models with reliable and trustworthy explanations.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文主题聚焦于可解释性 AI（XAI）和元测试，评估模型解释的忠实性。提供的关键词主要涉及多模态大模型、世界模型及强化学习领域，与论文内容（表格数据、后解释方法）高度不匹配。仅'Unify Models'因涉及多个模型对比有微弱关联（1 分），其余关键词完全无关（0 分）。作者列表中未包含指定的专家，无加分。加权总分 1.5，远低于动态及格分 27.8。

关键词

Metamorphic Testing, Rashomon Set, Explanation Faithfulness, Feature Importance, Post-hoc Explainability, Tabular Regression, Machine Learning

206. A Pre-Registered Causal Partition of Self-Consistency Elicitation and Reward Design in RLVRFAIL

Score: 1.5 / 27.8

Authors: Yuze Gao

Published: 2026-06-04

TL;DR: This paper proposes a causal partition method to disentangle self-consistency elicitation from reward design effects in RLVR, revealing systematic bias in naive reward estimators.

摘要翻译

基于可验证奖励的强化学习（RLVR）即使在奖励信号是虚假的情况下也能提升推理能力——它将归因分配给群体多数答案，而非真值验证器。从业者通常将朴素估计量（naive = acc(TRUE) - acc(RANDOM)）解释为奖励设计效应。我们证明该估计量存在系统性偏差：它将自一致性诱导（通过多数伪奖励将策略向其众数答案强化）与真实的奖励设计信号混淆在一起。借助一个受控的表格 GRPO 模拟器，我们推导出一个精确的望远镜分解公式（total = null + elicit + rd），并在五个先验强度水平上测量每一项。朴素估计量的奖励设计占比在弱先验（ps=0.20）时为 0.139，在强先验（ps=0.80）时降至 0.05，且诱导项在自一致性交叉点处发生符号翻转。一项预注册的 2x2x2 析因实验证实了非可加性（交互比 0.385；AxC 效应 -0.089）。一项点估计 vs 界限试点门控分析表明，强先验情形下是可点识别的，而接近交叉点的情形下仅是有界的。对两个具体的已发表成果进行重新审计，分别得出“诱导主导”（诱导占比 0.98）和“奖励设计主导”（rd 占比 1.18）的判定，从而展示了该分解的诊断价值。我们预先承诺无论符号翻转结果如何均提交论文；未翻转的发现具有同等地位。我们发布了一个可重用的单命令工具包，供任何对齐论文运行相同的审计。

Abstract

Reinforcement learning from verifiable rewards (RLVR) improves reasoning even when the reward signal is spurious -- assigning credit to the group-plurality answer rather than a ground-truth verifier. Practitioners commonly interpret naive = acc(TRUE) - acc(RANDOM) as the reward-design effect. We prove this estimand is systematically biased: it conflates self-consistency elicitation (sharpening the policy toward its modal answer via majority pseudo-reward) with genuine reward-design signal. Using a controlled tabular-GRPO simulator we derive an exact telescoping decomposition total = null + elicit + rd and measure each term across five prior-strength levels. The reward-design fraction of the naive estimator ranges from 0.139 at weak prior (ps=0.20) to 0.05 at strong prior (ps=0.80), with the elicitation term flipping sign at the self-consistency crossover. A pre-registered 2x2x2 factorial confirms non-additivity (interaction ratio 0.385; AxC effect -0.089). A points-vs-bounds pilot gate shows strong-prior regimes are point-identified while near-crossover regimes are only bounded. Re-audits of two named published results yield ELICITATION DOMINATED (elicitation share 0.98) and REWARD DESIGN DOMINATED (rd share 1.18) verdicts respectively, demonstrating the diagnostic value of the partition. We pre-commit to submit regardless of flip outcome; a non-flip is a finding of equal standing. We release a reusable one-command harness for any alignment paper to run the same audit.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	1.0/10	1.5

评分理由: The paper focuses on causal decomposition of reward signals in RLVR, addressing bias in reward design estimators. It lacks content on multimodal architectures, tokenizers, visual encoders, world models, or unified models. While it involves RL, it does not focus on model-based RL architectures, resulting in low relevance across all provided keywords.

关键词

Reinforcement Learning from Verifiable Rewards, Self-Consistency Elicitation, Reward Design, Causal Partition, Tabular-GRPO Simulator, Naive Estimator Bias, Alignment Paper Audit

207. Causal Atlases from Entropic Inference: Bayesian Networks beyond Optimal DAGsFAIL

Score: 1.5 / 27.8

Authors: Hazhir Aliahmadi, Irina Babayan, Greg van Anders

Published: 2026-06-04

TL;DR: 本文提出基于熵推理的方法生成因果关系图谱，量化了潜在数据中的结构不确定性，揭示了优化 DAG 可能存在的因果伪影。

摘要翻译

数据驱动的因果关系识别对于推进科学内外复杂系统的理解至关重要。贝叶斯网络 (Bayesian Networks) 提供了一种通过有向无环图 (DAGs) 来建模一般因果关系的概率方法。然而，构建贝叶斯网络的典型技术依赖于优化，这可能不适合学习因果关系，因为底层数据可能允许多重因果链。更忠实于数据的因果关系表示将为构建多个因果图提供框架，这些因果图与底层数据中固有的变异性一致。在这里，我们展示了基于熵的推理能够生成与底层数据一致的合理因果关系图谱。在模拟的 2 节点和 20 节点线性结构方程模型 (Linear Structural Equation Models) 的噪声数据上，我们采样了一个最大熵图集合，使我们能够量化底层因果关系中固有的结构模糊性。我们的方法表明，“优化”后的 DAGs 可能包含因果伪影，这些伪影在同等准确度的拓扑结构之间并不一致。

Abstract

Data-driven causal relationship identification is pertinent to advancing understanding of complex systems both within and beyond science. Bayesian networks offer a probabilistic method for modelling generic causal relationships via directed acyclic graphs (DAGs). However, typical techniques for constructing Bayesian networks rely on optimization, which can be ill-suited for learning causal relationships because the underlying data may admit multiple chains of causation. More data-faithful representations of causal relationships would provide frameworks for constructing multiple causal maps that are consistent with the variability that is inherent in underlying data. Here, we show that entropy-based inference generates atlases of plausible causal relationships that are consistent with underlying data. On simulated noisy data of 2- and 20-node linear structural equation models, we sample a maximum-entropy ensemble of graphs that allow us to quantify the inherent structural ambiguity in underlying causal relationships. Our method shows that "optimized" DAGs can contain causal artifacts are not consistent across equivalently accurate topologies.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 该论文聚焦于因果推断与贝叶斯网络，利用熵推理构建因果图谱以量化结构不确定性。内容与多模态大模型、强化学习及模型架构（如 Tokenizer、Visual Encoder）无直接关联。仅'Unify Models'因整合多个因果图谱存在极弱概念关联，其余关键词完全无关，导致加权总分远低于及格线。

关键词

Causal Inference, Bayesian Networks, Entropy-based Inference, Directed Acyclic Graphs, Structural Equation Models, Maximum-Entropy Ensemble, Causal Ambiguity

208. Wall Shear Stress Reconstruction from Concentration: Differentiable Physics and Physics-Informed Neural NetworksFAIL

Score: 1.5 / 27.8

Authors: Mahmoud Elhadidy, Siva Viknesh, Roshan M. D'Souza, Amirhossein Arzani

Published: 2026-06-04

TL;DR: This paper reconstructs wall shear stress from passive scalar observations using differentiable physics and physics-informed neural networks, demonstrating that differentiable physics outperforms PINNs across various measurement scenarios.

摘要翻译

壁面剪切应力 (WSS) 主导近壁输运动力学，是心血管流动中的关键血流动力学指标，但由于需要精确计算近壁速度梯度，其准确推断仍具挑战性。被动标量场（如浓度或温度）由相同的底层速度场平流，具有揭示隐藏流物理指标（如 WSS）的潜力。本文展示了利用两种根本不同的逆问题框架，从空间受限的被动标量观测中实现此类重建：一种是基于离散伴随和偏微分方程约束优化的可微物理框架，它将控制方程作为硬约束；另一种是物理信息神经网络 (PINNs)，将其视为软约束。基准问题包括二维标准后向台阶 (2D-BFS) 和三维患者特异性狭窄冠状动脉。对于 2D-BFS 案例，在三种测量场景（近壁、远场及组合）下评估，当近壁数据可用时，PINN 能达到高精度，但若仅限于远场测量则失效；而可微物理方法在所有场景中均能重建准确的 WSS。在三维患者特异性案例中，可微物理框架优于 PINNs，从而获得准确的 WSS 重建。这些结果表明，测量位置和逆问题公式共同决定了基于标量的近壁流推断中的重建保真度。所提出的框架为从标量输运数据估算近壁血流动力学开辟了一条路径，在可观测被动标量的流体流动问题中具有更广泛的应用前景。

Abstract

Wall shear stress (WSS) governs near-wall transport dynamics and is a key hemodynamic indicator in cardiovascular flows, yet remains difficult to infer accurately due to the need for precise computation of near-wall velocity gradients. Passive scalar fields, such as concentration or temperature, are advected by the same underlying velocity field and have the potential to uncover hidden flow physics metrics such as WSS. In this work, we demonstrate such reconstruction from spatially limited passive scalar observations using two fundamentally different inverse frameworks: a differentiable physics framework based on discrete adjoint, PDE-constrained optimization, which enforces the governing equations as hard constraints, and physics-informed neural networks (PINNs), which treat them as soft constraints. Benchmark problems include a 2D canonical backward-facing step (2D-BFS) and a 3D patient-specific stenotic coronary artery. For the 2D-BFS case, evaluated under three measurement scenarios (near-wall, far-field, and combined), PINN achieves high accuracy when near-wall data are available but fails when restricted to far-field measurements, whereas the differentiable physics approach recovers accurate WSS across all scenarios. In the 3D patient-specific case, the differentiable physics framework outperforms PINNs, yielding accurate WSS reconstruction. These results establish that measurement location and inverse formulation jointly determine reconstruction fidelity in scalar-based near-wall flow inference. The proposed framework opens a path toward estimation of near-wall hemodynamics from scalar transport data, with broader applicability to fluid flow problems where passive scalars can be observed.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 该论文属于计算流体力学（CFD）领域，专注于利用可微分物理和物理信息神经网络（PINNs）解决壁面剪切应力重建的逆问题。给定的关键词集（Tokenizer, MLLM, Visual Encoder, World Models 等）主要面向大语言模型、多模态 AI 及强化学习领域，与本文的科学计算主题高度不相关，故相关度多为 0。'Unify Models' 仅得 1 分，因论文对比了两种不同框架（可微分物理 vs PINNs），并未涉及模型架构的统一。作者列表中不包含指定的专家（Yang Shi 等）。加权总分约为 1.5 分，远低于动态及格分 27.8 分。

关键词

Wall Shear Stress Reconstruction, Differentiable Physics, Physics-Informed Neural Networks, Passive Scalar Fields, Hemodynamic Indicators, PDE-Constrained Optimization, Inverse Problems

209. Generative Criticality in Large Language Model Temperature ScalingFAIL

Score: 1.5 / 27.8

Authors: Huajian Ruan, Jinyang Li, Xingyu Guo, Lingxiao Wang

Published: 2026-06-04

TL;DR: This paper proposes a statistical-field framework to analyze phase transition-like behaviors in LLM outputs during temperature scaling, focusing on token embeddings without involving multimodal, world model, or reinforcement learning components.

摘要翻译

我们提出了一种针对大型语言模型（LLMs）生成文本的统计场框架，将词元嵌入视为一维链上的连续自旋变量。基于连通两点关联函数定义磁化率，并从系综平均嵌入场定义序参量，我们调整 softmax 温度 $T$，观察到在特征 $T_c$ 附近出现尖锐的磁化率峰值，呈现类幂律标度，同时序参量发生快速变化，且在 $T_c$ 以下坍缩至单一语义方向。通过两最近邻（TwoNN）方法估计的内蕴维度独立证实了这些发现，在 $T_c$ 附近达到最小值。结果在模型规模（Qwen3: 0.6B--32B）和提示词类别上均具有鲁棒性。尽管该现象学特征高度相似于连续相变，但自回归生成的非平衡性质值得进一步研究。我们的框架提供了定量工具，用于探测 LLMs 输出的集体统计结构，并暗示了解码策略与临界现象之间的联系。

Abstract

We propose a statistical-field framework for text generated by large language models (LLMs), treating token embeddings as continuous spin variables on a one-dimensional chain. Defining a susceptibility from the connected two-point correlator and an order parameter from the ensemble-averaged embedding field, we vary the \texttt{softmax} temperature $T$ and observe a sharp susceptibility peak near a characteristic $T_c$ with power-law-like scaling, a concurrent rapid change in the order parameter, and a collapse onto a single semantic direction below $T_c$. The intrinsic dimension estimated by the two nearest neighbor (TwoNN) method independently corroborates these findings, reaching a minimum near $T_c$. Results are robust across model scales (Qwen3: 0.6B--32B) and prompt categories. While the phenomenology closely resembles a continuous phase transition, the non-equilibrium nature of autoregressive generation warrants further investigation. Our framework provides quantitative tools for probing the collective statistical structure of LLM outputs and suggests connections between decoding strategies and critical phenomena.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	1.0/10	1.5
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper investigates statistical physics analogies in LLM temperature scaling and token embeddings, focusing on phase transitions. It does not address multimodal learning, visual encoders, world models, reinforcement learning, or model unification. Only the mention of 'token embeddings' provides minimal relevance to Tokenizer (score 1.0), while all other keywords are completely unrelated (score 0.0). None of the specified expert authors are listed.

关键词

Large Language Models, Temperature Scaling, Statistical Field Framework, Phase Transition, Token Embeddings, Susceptibility Peak, Order Parameter

210. Tracing the Oracle: Improving Diffusion Timestep Scheduling for 3D CT ReconstructionFAIL

Score: 1.5 / 27.8

Authors: Yujia Wu, Zhaoqiang Liu

Published: 2026-06-04

TL;DR: The paper proposes a Tracing the Oracle (TrO) framework to optimize diffusion timestep scheduling for 3D CT reconstruction, significantly improving reconstruction fidelity and computational efficiency compared to heuristic schedules.

摘要翻译

预训练扩散模型在求解高度不适定的三维计算机断层扫描（CT）逆问题时展现出令人印象深刻的潜力，然而推理过程面临着显著的计算开销。此外，现有的统一时间步调度方案未能捕捉反向条件扩散随机微分方程的非均匀演化，从而引入了显著的截断误差。为了克服这一局限性，我们提出了追踪 Oracle (TrO)，这是一个用于改进时间步调度的即插即用框架。具体来说，我们将少数样本上的密集采样数值积分轨迹视为参考 Oracle。通过利用动态规划算法，最小化少步近似与 Oracle 之间的累积误差，从而提取出优化调度方案。该机制将有限的采样步数精确分配给那些对截断误差高度敏感的关键演化阶段。我们在 AAPM 数据集上针对多个三维 CT 重建任务进行了广泛的实验，结果表明，当与最先进的三维 CT 重建方法 DDS 结合时，与现有的启发式调度方案相比，我们的优化时间步显著提高了重建保真度和计算效率，尤其是在采样步数严格限制为 10 步以内时。

Abstract

Pretrained diffusion models demonstrate impressive potential in solving highly ill-posed 3D computed tomography (CT) inverse problems, while the inference process suffers from significant computational overhead. Furthermore, existing uniform timestep schedules fail to capture the non-uniform evolution of the reverse conditional diffusion stochastic differential equation, thereby introducing substantial truncation errors. To overcome this limitation, we propose Tracing the Oracle (TrO), a plug-and-play framework for improved timestep scheduling. Specifically, we treat densely sampled numerical integration trajectories on a few samples as the reference oracle. The optimized schedule is extracted by leveraging dynamic programming to globally minimize the cumulative error between the few-step approximation and the oracle. This mechanism precisely allocates the limited sampling steps to critical evolution stages that are highly susceptible to truncation errors. Our extensive experiments on the AAPM dataset across multiple 3D CT reconstruction tasks demonstrate that, when combined with the state-of-the-art 3D CT reconstruction method DDS, our optimized timesteps significantly improve reconstruction fidelity and computational efficiency compared to existing heuristic schedules, especially under a strict budget of no more than 10 sampling steps.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on optimizing diffusion timestep scheduling for 3D CT reconstruction using dynamic programming. It does not address model unification, tokenization, visual encoder architecture design, world models, MLLMs, multimodal integration, or model-based reinforcement learning. The only tangential relevance is to 'Visual Encoder' due to processing visual data (CT scans), but it is not the core contribution.

关键词

Diffusion Models, 3D CT Reconstruction, Timestep Scheduling, Dynamic Programming, Tracing the Oracle, Inverse Problems, Computational Efficiency

211. Learning to model pediatric asthma exacerbation from multiple risk factors: a case study in coastal VirginiaFAIL

Score: 1.5 / 27.8

Authors: Jonathan Colen, Eric Werner, Maryam Golbazi, Heather Richter, Diana McSpadden, Amy Quinn, Jocel Santos, Mary Jane Darling, Mary Margaret Gleason

Published: 2026-06-04

TL;DR: 本研究通过对比统计模型与机器学习方法，利用环境及社会经济数据预测弗吉尼亚沿海地区儿童哮喘恶化并识别关键风险因素。

摘要翻译

儿童哮喘是一种常见疾病，其加重受空气污染、气象因素以及社区层面社会经济因素的影响。在大型时空数据集 (spatiotemporal datasets) 中对哮喘加重 (Asthma Exacerbation, AE) 进行建模，需要区分来自多个贡献因素的影响。在本案例研究中，我们比较了三种在预测能力与可解释性之间取得平衡的技术，用于预测弗吉尼亚沿海地区汉普顿路斯 (Hampton Roads) 的 AE。该地区包含 7 个城市，人口超过 150 万。在收集环境空气污染测量数据、气象数据及社区机会指标后，我们构建了 2018 年至 2023 年间某地区儿童医院及其关联机构邮政编码级别的急性 AE 就诊模型。广义线性模型 (Generalized Linear Models, GLM) 提供了基线，而神经网络 (Neural Networks, NN) 则作为最大预测能力的目标模型。为了弥合统计模型与深度学习之间的差距，我们基于稀疏字典学习 (Sparse Dictionary Learning) 开发了一个框架，用于识别并解释简约的非线性相互作用方程。在比较各模型的预测性能后，我们估计了由输入暴露变量导致的 AE 的相对风险，并在不同框架间发现了一致性。本研究将统计模型与可解释性机器学习模型相结合，旨在突出可能影响 AE 的协同交互作用，并可能为未来研究指导弗吉尼亚沿海地区的公共卫生干预提供依据。

Abstract

Childhood asthma is a common illness exacerbated by air pollution as well as meteorological and neighborhood-level socioeconomic factors. Modeling asthma exacerbation (AE) in large spatiotemporal datasets requires disentangling impacts from multiple contributors. In this case study, we compared three techniques that balance predictive power with interpretability to predict AE in Hampton Roads, a coastal Virginia region comprising 7 cities and over 1.5 million people. After collating ambient air pollution measurements, weather data, and measures of neighborhood opportunity, we modeled zip code-level acute AE visits to a regional children's hospital and affiliated providers from 2018-2023. Generalized linear models (GLM) provided a baseline while neural networks (NN) served as a maximally predictive target. To bridge between statistical models and deep learning, we developed a framework based on sparse dictionary learning to identify and interpret parsimonious nonlinear interacting equations. After comparing each model's predictive performance, we estimated relative risks for AE due to input exposure variables and found consensus across frameworks. Our work links statistical and interpretable machine learning models to highlight possible synergistic interactions influencing AE, and may enable future studies to guide public health interventions in coastal Virginia.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文属于公共卫生与流行病学领域，主要使用统计和机器学习方法预测哮喘恶化。提供的关键词集（Tokenizer, Visual Encoder, MLLM, World Models, model-based RL）均属于大模型与强化学习架构领域。两者领域差异巨大，仅‘Unify Models’在字面上对应文中统一统计与深度学习模型的尝试，其余关键词完全无关。

关键词

Pediatric asthma exacerbation, Air pollution, Machine learning, Interpretability, Spatiotemporal modeling, Generalized linear models, Neural networks, Public health

212. Effective Dimensionality as an Operator Invariant for Physics-Preserving Constraint Adaptation in Physics-Informed Neural NetworksFAIL

Score: 1.5 / 27.8

Authors: Cornelius Otchere, Michael Shields

Published: 2026-06-04

TL;DR: This paper analyzes task interference in Physics-Informed Neural Networks using effective dimensionality and proposes subspace projection strategies for efficient boundary adaptation without retraining.

摘要翻译

物理信息神经网络（Physics-Informed Neural Networks, PINNs）本质上存在任务干扰问题，因为它们依赖于共享参数空间来同时满足控制微分方程和边界条件。我们利用费雪信息矩阵（Fisher Information Matrix）分析这种结构冲突，以量化物理约束模型中的有效自由度（$d_{eff}$）。与经典的有效自由度（$d_{eff}$）不同，后者衡量在统计先验下数据提供了多少参数方向的信息，我们的 $d_{eff}$ 衡量的是不受微分算子约束的参数方向的维度。对于具有有限维核的算子，我们证明 $d_{eff}$ 精确收敛于核维度，且该结果独立于网络宽度、深度或激活函数，从而将其从拟合诊断重新定义为底层连续算子的结构不变量。对于具有无限维核的算子，$d_{eff}$ 则衡量网络对该核的有限维表示带宽，而非恢复一个整数不变量。重要的是，$d_{eff}$ 还可作为一种先验结构诊断工具。将适定问题的 $d_{eff}$ 降至零，即可确认物理和边界约束已吸收了网络的自由方向。基于这一表征，我们引入了用于边界适应的子空间投影策略。与从头重新训练不同，我们将参数更新投影到预训练物理算子的零空间中，从而在满足新边界条件的同时不干扰已习得的物理规律。基于梯度的微调可以达到或超过这一效果，但需要更多的实际运行时间和调优，而子空间投影可在数秒至数分钟内提供近等效的质量。我们在算子上验证了该方法，涵盖线性和非线性算子，展示了对初始和边界偏移以及未曾遇到的约束类型的准确适应能力。

Abstract

Physics-Informed Neural Networks inherently suffer from task interference because they rely on a shared parameter space to satisfy both governing differential equations and boundary conditions. We analyze this structural conflict using the Fisher Information Matrix to quantify the effective degrees of freedom ($d_{eff}$) in a physics-constrained model. Unlike the classical $d_{eff}$ which measures how many parameter directions are informed by data against a statistical prior, our $d_{eff}$ measures the dimension of the parameter directions unconstrained by the differential operator. For operators with finite-dimensional kernel, we show that $d_{eff}$ converges to the kernel dimension exactly, independent of network width, depth, or activation function, recasting it from a fit diagnostic into a structural invariant of the underlying continuous operator. For operators with infinite-dimensional kernel, $d_{eff}$ instead measures the network's finite-dimensional representational bandwidth for that kernel rather than recovering an integer invariant. Importantly, $d_{eff}$ also serves as an a priori structural diagnostic. Driving $d_{eff}$ of a well-posed problem to zero certifies that the physics and boundary constraints have absorbed the network's free directions. Building on this characterization, we introduce subspace projection strategies for boundary adaptation. Rather than retraining from scratch, we project parameter updates into the null space of the pre-trained physics operator so that new boundary conditions are satisfied without disturbing the learned physics. Gradient-based fine-tuning can match or exceed this but needs more wall-clock time and tuning, whereas subspace projection delivers near-equivalent quality in seconds to minutes. We validate on linear and nonlinear operators, demonstrating accurate adaptation to initial and boundary shifts and unencountered constraint types.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	1.0/10	1.5
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on Physics-Informed Neural Networks (PINNs) and differential operators, while the provided keywords target Multimodal Large Language Models (MLLM), World Models, and Reinforcement Learning. There is minimal domain overlap; 'Unify Models' has a loose conceptual link regarding constraint unification, but keywords like Tokenizer, Visual Encoder, and MLLM are irrelevant to this scientific computing work.

关键词

Physics-Informed Neural Networks, Effective Dimensionality, Fisher Information Matrix, Subspace Projection, Boundary Adaptation, Operator Invariant, Constraint Adaptation

213. Online KL-Regularized Reinforcement Learning with Function Approximation under MisspecificationFAIL

Score: 1.5 / 27.8

Authors: Haoyang Hong, Zichen Wang, Quanquan Gu, Huazheng Wang

Published: 2026-06-04

TL;DR: This paper establishes high-probability regret guarantees for KL-regularized contextual bandits and episodic reinforcement learning under model misspecification using regression-based algorithms with Gibbs policy updates.

摘要翻译

本文研究在一般函数近似及模型误设条件下，KL 正则化的上下文多臂老虎机（Contextual Bandits）与回合制强化学习（Reinforcement Learning, RL）。现有的理论保证依赖于可实现性（Realizability），因此无法推广至误设模型，而在误设模型中经典的后悔界（Regret Bounds）可能会失效。本文提出了针对上下文多臂老虎机与回合制强化学习的 KL 误设形式化（KL Misspecification Formulations），并分析了采用 Gibbs 策略更新（Gibbs Policy Updates）的基于回归的算法。建立了带有显式误设项的高概率 KL 后悔界（KL-Regret）保证，并将标准可实现 KL 正则化设置（Standard Realizable KL-Regularized Setting）作为特例恢复。

Abstract

We study KL-regularized contextual bandits and episodic reinforcement learning (RL) under general function approximation with model misspecification. Existing guarantees rely on realizability and therefore do not extend to misspecified models, where classical regret bounds may fail. This work introduces KL misspecification formulations for contextual bandits and episodic RL and analyzes regression-based algorithms with Gibbs policy updates. High-probability KL-regret guarantees with explicit misspecification terms are established, recovering the standard realizable KL-regularized setting as a special case.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	1.0/10	1.5

评分理由: The paper focuses on theoretical reinforcement learning and contextual bandits under model misspecification, which has only a tangential connection to 'model-based RL' (scored 1.0 due to general RL domain). It contains no content regarding multimodal learning, tokenization, visual encoders, MLLMs, unifying models, or world models architectures, hence all other keywords score 0. The keyword set appears targeted at multimodal/LLM papers, creating a significant mismatch with this theoretical RL work.

关键词

Reinforcement Learning, Contextual Bandits, Function Approximation, Model Misspecification, KL-Regularization, Regret Bounds, Gibbs Policy

214. Catastrophic Forgetting as Accessibility Collapse: A Three-Level Framework for Knowledge Persistence in Continual LearningFAIL

Score: 1.5 / 27.8

Authors: Ayushman Trivedi, Bhavika Melwani

Published: 2026-06-04

TL;DR: 该论文提出灾难性遗忘源于知识访问性的丧失而非表征的完全擦除，表明仅重训练分类器即可恢复大部分任务性能而无需修改骨干网络。

摘要翻译

灾难性遗忘（Catastrophic forgetting）通常被解释为在顺序学习（sequential learning）过程中先前获得知识的不可逆擦除。在这项工作中，我们探讨了一种替代视角：遗忘可能并非源于任务表征（task representations）的完全破坏，而是源于对保留信息的可访问性（accessibility）丧失。我们引入一个三级框架，将知识存储（knowledge storage）、表征（representation）和可访问性分离开来，并通过一系列持续学习（continual-learning）实验评估每个组件，这些实验基于使用 ResNet-18 的顺序 CIFAR-100 分类任务。我们的分析结合了检查点持久性（checkpoint persistence）、线性探测（linear probing）、表征几何（representation geometry）、分类器重置恢复（classifier-reset recovery）以及层可恢复性（layer-wise recoverability）实验。我们观察到早期任务的完全行为遗忘，任务准确率从 54.8% 降至 0%，而线性探测（linear probe）性能保留了约 76% 的原始表征信息。此外，仅重新训练最终分类器即可恢复 75.7% 的原始任务性能，而无需修改骨干网络（backbone network）。层分析（layer-wise analysis）显示，尽管后期阶段出现严重退化，早期和中间层仍保留了高度可恢复的任务信息。投影能量（projection-energy）和主角（principal-angle）分析表明，保留的知识以分布式高维表征的形式持续存在，而非通过保留一个小的主导子空间来实现。这些发现表明，灾难性遗忘更应被描述为一种可访问性失败，而非完全的表征擦除；并且即使功能遗忘已经发生，大量任务相关信息仍保留在神经表征（neural representations）之中。

Abstract

Catastrophic forgetting is commonly interpreted as the irreversible erasure of previously acquired knowledge during sequential learning. In this work, we investigate an alternative perspective: that forgetting may arise not from complete destruction of task representations but from a loss of accessibility to preserved information. We introduce a three-level framework separating knowledge storage, representation, and accessibility, and evaluate each component through a series of continual-learning experiments on sequential CIFAR-100 classification using ResNet-18. Our analysis combines checkpoint persistence, linear probing, representation geometry, classifier-reset recovery, and layer-wise recoverability experiments. We observe complete behavioral forgetting of earlier tasks, with task accuracy collapsing from 54.8% to 0%, while linear probe performance retains approximately 76% of the original representational information. Furthermore, retraining only the final classifier restores 75.7% of the original task performance without modifying the backbone network. Layer-wise analysis reveals that early and intermediate layers preserve highly recoverable task information despite severe degradation at later stages. Projection-energy and principal-angle analyses indicate that retained knowledge persists as distributed high-dimensional representations rather than through preservation of a small dominant subspace. These findings suggest that catastrophic forgetting is better characterized as an accessibility failure than complete representational erasure, and that substantial task-relevant information remains embedded within neural representations even after functional forgetting has occurred.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	1.0/10	1.5
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 论文聚焦连续学习中的灾难性遗忘机制（访问性丧失 vs 表征擦除），使用 ResNet-18 在 CIFAR-100 上实验。提供的关键词主要涉及多模态大模型（MLLM, MultiModal）、世界模型（World Models）及强化学习（model-based RL），与论文内容领域高度不匹配。仅因使用 ResNet 作为视觉骨干，与 'Visual Encoder' 有微弱关联（得 1 分），其余关键词完全无关。作者列表中不包含指定的专家。加权总分约为 1.5，远低于动态及格分 27.8。

关键词

Catastrophic Forgetting, Continual Learning, Accessibility Collapse, Knowledge Persistence, Linear Probing, ResNet-18, Classifier Reset, Layer-wise Analysis

215. Steering Vectors are an Adversarial Attack SurfaceFAIL

Score: 1.5 / 27.8

Authors: Abzal Aidakhmetov, Donato Crisostomi, Tommaso Mencattini, Adrian Robert Minut, Iacopo Masi, Emanuele Rodolà

Published: 2026-06-04

TL;DR: 该论文揭示了用于控制大语言模型行为的激活向量易受隐蔽数据中毒攻击从而实现越狱，并提出了通过拒绝方向正交化进行防御的方法。

摘要翻译

激活引导（Activation steering）已成为一种无需微调即可控制大型语言模型（LLM）行为的流行方法。由于该技术具有即插即用特性，用户共享数据集和预计算向量以引导模型激活。然而，我们发现一种隐蔽数据投毒攻击（Stealth data poisoning attack）会静默地破坏此流程。通过替换引导数据集中 4%-6% 的词元（Token），攻击者可以静默地将所得向量与反拒绝方向对齐。这会使目标模型越狱，同时保留对良性提示的预期引导效果。在此威胁模型下，恶意行为者可以分发一个看似安全的包（Bundle），包含文本、向量和权重，以及一个终端用户可以验证的等价证书。我们在两个开源权重模型家族和八个模型 - 属性组合上测试了该攻击，观察到中毒向量的绝对攻击成功率（ASR）达到 20%-55%，比干净参考基线高出 19% 到 51%。最后，我们发现一种拒绝方向正交化防御措施可缩小约 82% 的 ASR 差距，而不损害良性行为。

Abstract

Activation steering has become a popular way to control Large Language Model (LLM) behavior without fine-tuning. Since the technique is plug-and-play, users share datasets and precomputed vectors to steer model activations. However, we show that a \emph{stealth data poisoning attack} silently compromises this pipeline. By substituting $4{-}6\%$ of tokens in the steering dataset, an attacker can silently align the resulting vector with an anti-refusal direction. This jailbreaks the target model while preserving the intended steering effect on benign prompts. Under this threat model, a malicious actor can distribute an apparently safe bundle containing texts, vectors, and weights, alongside an equivalence certificate that the end-user can verify. We test the attack on two open-weight model families and eight model-attribute combinations, observing that poisoned vectors reach an absolute attack success rate (ASR) of $20{-}55\%$, $+19\%$ to $+51\%$ over a clean reference. Finally, we find that a refusal-direction orthogonalization defense can recover ${\approx}82\%$ of the ASR gap without harming benign behavior.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 该论文主要研究大语言模型（LLM）激活向量的安全性，具体探讨了针对激活导向（Activation Steering）的隐蔽数据中毒攻击及其防御机制。提供的关键词集主要聚焦于多模态世界模型、表征学习及模型强化学习领域。除 MLLM 因涉及大语言模型基础架构有微弱关联外，其余关键词（如视觉编码器、世界模型、模型强化学习、统一模型等）与本文内容完全无关，因此相关性评分极低，加权总分远低于动态及格分。

关键词

Activation Steering, Data Poisoning, Adversarial Attack, LLM Security, Jailbreak, Refusal Direction, Model Alignment

216. The Generator-Eraser Paradox: Community Guidelines for Responsible LLM-Assisted Dialect Resource CreationFAIL

Score: 1.5 / 27.8

Authors: Wajdi Zaghouani

Published: 2026-06-04

TL;DR: This paper proposes community guidelines to mitigate dialect erasure risks when using LLMs for resource creation, balancing generation benefits with linguistic diversity preservation.

摘要翻译

方言资源在科学描述、文化保存与计算基础设施的交汇之处占据着独特的位置。大型语言模型（LLM）通过基于检索的起草、语料库导航、元数据丰富化以及标注工作流支持，展现出加速方言资源开发的强大能力。然而，同一系统也构成重大风险：它们可能通过偏爱声望变体、正字法同质化以及促成合成反馈回路，从而导致方言抹除，并在长期内减少语言多样性。对于具有双言现象、书面标准化程度有限或说话者社区边缘化的语言变体而言，这些风险尤为严峻。本文做出了三项贡献。首先，我们整合变异社会语言学与语料库语言学的见解，将“生成者 - 抹除者悖论（generator-eraser paradox）”形式化为一个理论框架，用以理解 LLM 辅助方言工作的双重性质。其次，我们制定了 12 条社区指南，将该框架操作化为方言资源创建与文档化过程中可实施的设计要求。第三，我们提供了阿拉伯方言的深入案例研究，其中包括对广泛使用的资源的结构化比较，以展示这些指南如何解决包括双言现象、正字法变异性及社区治理在内的语言特定挑战。本研究贡献在于概念性与操作性，而非实验性，旨在使各语言的方言社区和资源构建者能够采用 LLM，同时不牺牲真实性、变异性或自主权。

Abstract

Dialect resources occupy a unique position at the intersection of scientific description, cultural preservation, and computational infrastructure. Large language models offer powerful capabilities for accelerating dialect resource development through retrieval-grounded drafting, corpus navigation, metadata enrichment, and annotation workflow support. However, the same systems pose substantial risks: they can contribute to dialect erasure by privileging prestige varieties, homogenizing orthography, and enabling synthetic feedback loops that reduce linguistic diversity over time. These risks are particularly acute for language varieties characterized by diglossia, limited written standardization, or marginalized speaker communities. This paper makes three contributions. First, we integrate insights from variationist sociolinguistics and corpus linguistics to formalize the generator-eraser paradox as a theoretical framework for understanding the dual nature of LLM-assisted dialect work. Second, we derive 12 community guidelines that operationalize this framework into implementable design requirements for dialect resource creation and documentation. Third, we provide an in-depth case study of Arabic dialects, including a structured comparison of widely used resources, to demonstrate how these guidelines address language-specific challenges including diglossia, orthographic variability, and community governance. The contribution is conceptual and operational rather than experimental, with the goal of enabling dialect communities and resource builders across languages to adopt LLMs without sacrificing authenticity, variation, or sovereignty.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	1.0/10	1.5
MultiModal	1.5	0.0/10	0.0
model-based RL	1.5	0.0/10	0.0

评分理由: 该论文主要探讨方言资源创建中的伦理问题及社区指南，属于社会语言学与 AI 伦理交叉领域。文中未涉及模型架构统一、分词器设计、视觉编码器、世界模型、多模态大模型技术细节或强化学习算法。虽提及大型语言模型（LLM），但未涉及多模态（MLLM）或模型内强化学习，故与给定技术关键词相关性极低。

关键词

Dialect Resource Creation, Large Language Models, Generator-Eraser Paradox, Community Guidelines, Responsible AI, Sociolinguistics, Arabic Dialects, Language Preservation

217. Forgive or forget: Understanding the context of hate in audio retrieval systemsFAIL

Score: 1.5 / 27.8

Authors: Arghya Pal, Sailaja Rajanala, Raphael C. -W. Phan, Shekhar Nayak

Published: 2026-06-04

TL;DR: This paper proposes a post hoc causal debiasing framework to reduce toxicity in text-to-audio retrieval systems while preserving semantic relevance and retrieval accuracy.

摘要翻译

由于上下文依赖，在 text-to-audio (文本到音频) 系统中处理有毒检索具有挑战性。现有策略（例如改述、摘要）存在改变意图或遗漏细节的风险。我们提出了一种包含情感控制中介的事后因果去偏框架，旨在保持语义相关性的同时抑制有害语音。该方法具有模型无关性，并能与现有检索管道无缝集成。我们引入了两种变体：Forgive，通过 logit 调整重新排序并过滤有毒音频；以及 Forget，生成反事实有毒提示以减轻有害检索。实验表明，该方法在检索准确性损失最小的情况下实现了持续的毒性降低，从而同时提升了安全性和可靠性。

Abstract

Handling toxic retrieval in text-to-audio systems is challenging due to contextual dependencies. Existing strategies (e.g., rephrasing, summarization) risk altering intent or omitting details. We propose a post hoc causal debiasing framework with a sentiment-controlled mediator to preserve semantic relevance while suppressing harmful speech. Our approach is model-agnostic and integrates seamlessly with existing retrieval pipelines. We introduce two variants: Forgive, which re-ranks and filters toxic audio via logit adjustment, and Forget, which generates counterfactual toxic prompts to mitigate harmful retrievals. Experiments show consistent toxicity reduction with minimal loss in retrieval accuracy, improving both safety and reliability.

评分详情

关键词	权重	相关度	得分
Unify Models	1.5	0.0/10	0.0
Tokenizer	1.5	0.0/10	0.0
Visual Encoder	1.5	0.0/10	0.0
World Models	1.5	0.0/10	0.0
MLLM	1.5	0.0/10	0.0
MultiModal	1.5	1.0/10	1.5
model-based RL	1.5	0.0/10	0.0

评分理由: The paper focuses on toxicity mitigation in audio retrieval systems using causal debiasing, which does not align with the provided research background of unified models, world models, or model-based RL. There is no mention of tokenizers, visual encoders, or reinforcement learning. While text-to-audio involves multiple modalities, the work addresses safety rather than multimodal representation learning or MLLM architectures central to the keywords.

关键词

Audio retrieval, Toxic retrieval, Causal debiasing, Text-to-audio, Safety, Sentiment-controlled mediator, Logit adjustment

218. You Only Index Once: Cross-Layer Sparse Attention with Shared RoutingFAIL

Score: 0.0 / 27.8

Authors: Yutao Sun, Yanqi Zhang, Li Dong, Jianyong Wang, Furu Wei

Published: 2026-06-04

TL;DR: 本文提出跨层稀疏注意力（CLSA）方法，通过共享 KV 缓存和路由索引优化长上下文推理，实现了显著的解码加速和吞吐量提升。

摘要翻译

现代大语言模型（LLM）的长上下文推理正日益受到解码效率的限制，尤其是在推理密集型场景中，模型会生成长长的中间思维链（CoT）。现有的稀疏注意力方法通常面临实际效率与质量之间的权衡。结构化块稀疏方法通常能提供更强的加速，但会导致明显的质量损失；而词元稀疏方法通常更准确，但由于在完整缓存上的 top-k 路由开销仍然昂贵，其端到端的加速比有限。本文提出跨层稀疏注意力（CLSA），该机制构建于 YOCO 等 KV 共享架构之上。其核心思想是在跨解码器层之间不仅共享 KV 缓存，还共享路由索引。单个索引器仅需计算一次词元级别的 top-k 选择，并在各层间复用所得索引，从而在保留词元稀疏注意力细粒度选择性的同时，分摊了路由开销。所得架构共同优化了所有主要推理瓶颈，包括预填充、KV 缓存存储以及长上下文解码。在短上下文和长上下文基准上的实验表明，CLSA 既准确又高效，在 128K 上下文长度下实现了高达 7.6 倍的解码加速比和 17.1 倍的总体吞吐量提升。这些结果表明了一种更完整的长上下文大语言模型（LLM）架构解决方案，能够协同推进模型质量与推理效率。

Abstract

Long-context inference in modern LLMs is increasingly constrained by decoding efficiency, especially in reasoning-heavy settings where models generate long intermediate chains of thought. Existing sparse attention methods often face a practical efficiency-quality trade-off. Structured block sparse methods typically provide stronger acceleration but incur noticeable quality loss, while token sparse methods are usually more accurate yet deliver limited end-to-end speedup because top-k routing over the full cache remains expensive. In this work, we propose cross-layer sparse attention (CLSA), which is built on top of KV-sharing architectures such as YOCO. The core idea is to share not only the KV cache across cross-decoder layers, but also the routing index. A single indexer computes token-level top-k selection once and reuses the resulting index across layers, thereby preserving the fine-grained selectivity of token sparse attention while amortizing the routing overhead. The resulting architecture improves all major inference bottlenecks jointly, including pre-filling, KV-cache storage, and long-context decoding. Experiments across short-context and long-context benchmarks show that CLSA is both accurate and efficient, achieving up to 7.6x decoding speedup and 17.1x overall throughput improvement at 128K context. These results suggest a more complete architectural solution for long-context LLMs that jointly advances model quality and inference efficiency.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 论文主要关注大语言模型（LLM）的长上下文推理效率优化，提出跨层稀疏注意力（CLSA）机制。提供的关键词集主要涉及多模态学习（MLLM, MultiModal, Visual Encoder）、世界模型及模型强化学习，与本文的研究领域（文本 LLM 架构优化）无直接交集。作者列表中未包含指定的专家。

关键词

Cross-Layer Sparse Attention, KV-sharing, Long-context inference, LLM efficiency, Routing index, Decoding speedup, Throughput improvement

219. Will the Agent Recuse Itself? Measuring LLM-Agent Compliance with In-Band Access-Deny SignalsFAIL

Score: 0.0 / 27.8

Authors: Thamilvendhan Munirathinam

Published: 2026-06-04

摘要翻译

随着自主大语言模型（LLM）代理越来越多地持有真实凭证并在无需人工干预的情况下操作基础设施，操作员尚无标准方法告知代理某资源处于禁止访问状态。访问控制要么允许代理访问（因其拥有有效凭证），要么直接拒绝它（这与拒绝任何其他客户端的情况无法区分）。我们提出第三种模式：一种轻量级的、公开的带内拒绝信号——“回避信号”（Recuse Signal）——服务器通过协议的现有通道（如 SSH 横幅、PostgreSQL NOTICE）发出，请求连接的自动代理自愿撤回。这是一种合作治理机制，相当于实时访问场景下的 robots.txt；它明确并非安全边界。其价值完全是实证的，据我们所知，尚未经过衡量：合规的 LLM 代理实际上会遵守此类信号吗？我们将该信号定义为一种开放微型标准，实现了两种零开销或低开销适配器（SSH 横幅/PAM 钩子和 PostgreSQL 线协议代理），并将其部署于一台实时生产主机上，随后运行了一项受控实验：向新代理分配良性操作任务，并观察其是否执行回避。在试点实验（基于 SSH；使用 OpenAI 的 GPT-4o 和 GPT-4o-mini；以及将 Claude Code 作为部署代理）中，该信号清晰地诱导了回避——存在信号时回避率为 100%，而在无信号对照组中任务完成率为 100%——并且揭示性地，它表现为一种合作而非绝对的信号：显式的操作员授权框架促使最强大的模型继续执行，而其他代理则继续服从主机端策略。我们发布了该标准、适配器及实验框架以供复现。

Abstract

As autonomous LLM agents increasingly hold real credentials and operate infrastructure without a human in the loop, operators have no standard way to tell an agent that a resource is off-limits. Access controls either let the agent in (it has valid credentials) or hard-fail it (indistinguishable from any other client). We propose a third mode: a lightweight, published in-band deny signal -- the Recuse Signal -- that a server emits over a protocol's existing channels (an SSH banner, a PostgreSQL NOTICE) asking a connecting automated agent to voluntarily withdraw. This is a cooperative governance control, the robots.txt analogue for live access; it is explicitly not a security boundary. Its value is entirely empirical and, to our knowledge, unmeasured: do compliant LLM agents actually honor such a signal? We define the signal as an open mini-standard, implement two zero- or low-footprint adapters (an SSH banner/PAM hook and a PostgreSQL wire-protocol proxy), deploy them on a live production host, and run a controlled experiment in which fresh agents are given a benign operations task and observed for recusal. In a pilot (SSH; OpenAI GPT-4o and GPT-4o-mini; and Claude Code as a deployed agent), the signal cleanly induces recusal -- 100% recusal when present versus 100% task completion in a no-signal control -- and, revealingly, behaves as a cooperative rather than absolute signal: an explicit operator-authorization framing flips the most capable model to proceed, while other agents continue to defer to the on-host policy. We release the standard, adapters, and experiment harness for reproduction.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 评分失败: Expecting ',' delimiter: line 13 column 22 (char 544)

220. Risk Assessment of Autonomous Driving: Integrating Technical Failures, Ethical Dilemmas, and Policy FrameworksFAIL

Score: 0.0 / 27.8

Authors: Boyi Chen, Shengqin Chu, Zicheng Wang, Brian Baetz, Zhen Gao

Published: 2026-06-04

TL;DR: 本文通过分析技术故障、伦理困境和政策框架评估自动驾驶风险，建议采用结合工程标准、伦理讨论和制度监督的适应性治理方法。

摘要翻译

自动驾驶技术有望每年减少大量由人为错误引发的道路交通事故，但它同时也带来了新的风险，需要从技术、伦理和法规三个方面进行评估。基于美国国家公路交通安全管理局（NHTSA）的公开碰撞数据、加利福尼亚州机动车辆管理局（DMV）的脱机报告、麻省理工学院道德机器数据集（MIT Moral Machines dataset）以及五个司法管辖区的比较监管分析，我们发现主要的技术故障模式是感知和分类错误。这些错误占报告事故的比例相对较大，可以推断出，自动驾驶车辆决策存在不同的伦理框架，且不同地区法规的不一致性增加了广泛应用的不确定性。总体而言，技术、伦理和法规问题密切相关，需要协同解决。因此，本文建议采用一种更具适应性和协作性的治理方法，该方法结合了工程标准、伦理讨论和制度监管。

Abstract

Autonomous driving technology has the potential to reduce the large number of road traffic accidents caused by human error each year, but it also brings new types of risks that need to be evaluated from the aspects of technology, ethics and regulations. Based on public crash data from the National Highway Traffic Safety Administration (NHTSA), disengagement reports from the California Department of Motor Vehicles (DMV), the MIT Moral Machines dataset, and a comparative regulatory analysis of five jurisdictions, we have found that the main types of technical failure modes are perception and classification errors. These account for a relatively large proportion of the reported accidents, and it can be concluded that there are different ethical frameworks for autonomous vehicle decision-making, and inconsistent regulations in different areas increase the uncertainty of widespread application. Generally speaking, the problems of technology, ethics and regulation are closely related and need to be solved together. Therefore, this paper recommends a more adaptive and cooperative governance approach that combines engineering standards, ethical discussion, and institutional supervision.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 论文主要关注自动驾驶的风险评估、伦理困境及政策框架，基于事故数据和法规分析提出治理建议。未涉及统一模型架构、分词器、视觉编码器、世界模型、多模态大语言模型、多模态学习框架或基于模型的强化学习等具体机器学习模型技术，因此与所有给定关键词的相关性均为 0。加权总分为 0，远低于动态及格分 27.8。

关键词

Autonomous Driving, Risk Assessment, Technical Failures, Ethical Dilemmas, Policy Frameworks, Perception Errors, Governance Approach

221. Boosting Brain-to-Image Decoding with TRIBE v2 Data AugmentationFAIL

Score: 0.0 / 27.8

Authors: Yohann Benchetrit, Marlène Careil, Simon Dahan, Hubert Banville, Stéphane d'Ascoli, Jean-Rémi King

Published: 2026-06-04

摘要翻译

脑解码受限于标注神经数据的可用性，在小样本场景下仍具挑战性。为了解决这一问题，我们探究是否以及在何种情况下，通过使用预训练模型生成的合成数据来扩充小型 fMRI 数据集，可以提升脑解码性能。我们使用了 TRIBE v2，这是一个大型编码模型，已在超过 1000 小时的视频、音频和语言刺激下的 fMRI 响应数据上进行了预训练。针对每个数据集，我们评估了系统性的网格实验，以展示图像解码器的性能如何随用于训练的合成数据量的变化而变化。基于两个数据集（7T fMRI Natural Scenes Dataset 和 3T fMRI BOLD5000）的结果显示，与仅在真实数据上训练的解码器相比，图像检索 Top-10 准确率最高提升了 68%。重要的是，达到特定图像解码性能所需的增强数据比例需根据数据来源进行调整。令人惊讶的是，仅在合成 fMRI 数据上训练的图像解码器在某些设置下表现高于随机水平，这表明 TRIBE v2 可能支持 zero-shot 脑到图像解码。综上所述，这些结果表明，针对视觉、声音和语言刺激的大规模 fMRI 响应模型可能为提高图像解码的数据效率奠定基础。

Abstract

Brain decoding is limited by the availability of labeled neural data, and remains challenging in low-data regimes. To address this issue, we investigate whether and when brain decoding can be boosted by augmenting small fMRI datasets with synthetic data generated by a pretrained model of fMRI responses to stimuli. We use TRIBE v2, a large encoding model pretrained on more than 1000 hours of fMRI responses to video, audio and language. For each dataset, we evaluate systematic grids that show how the performance of image decoders varies with the amount of synthetic data used for training. Our results, based on two datasets (the 7T fMRI Natural Scenes Dataset and 3T fMRI BOLD5000), show up to 68% improvement in Top-10 image-retrieval accuracy compared to decoders trained only on real data. Importantly, the proportion of augmented data required to reach a given image decoding performance needs to be adjusted depending on the data source. Surprisingly, image decoders trained exclusively on synthetic fMRI can perform above chance in some settings, suggesting that TRIBE v2 can support zero-shot brain-to-image decoding. Together, these results show how large-scale models of the fMRI responses to sight, sound and language may provide a foundation to improve the data efficiency for image decoding.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 评分失败: Expecting ',' delimiter: line 12 column 63 (char 286)

222. Quantum enhanced rare event discovery and samplingFAIL

Score: 0.0 / 27.8

Authors: Naixu Guo, Po-Wei Huang, Qisheng Wang, Jayne Thompson, Patrick Rebentrost, Mile Gu, Chengran Yang

Published: 2026-06-04

TL;DR: This paper introduces a quantum algorithm for efficient rare-event discovery and sampling without prior knowledge, achieving optimal quantum scaling and speedups for heavy-tailed systems.

摘要翻译

金融崩溃、基础设施中的级联失效以及人工智能系统（AI systems）中的关键错误，通常由发生概率极低的事件触发。因此，高效地发现并采样概率低于阈值（threshold）的事件至关重要。然而，利用现有的经典或量子方法（classical or quantum methods）完成此任务极具挑战性。由于此类事件罕见，收集足够的数据样本需要巨大的采样开销（sampling overhead）。此外，由于罕见事件（rare events）事先未知，无法使用标准技术（standard techniques）对其进行标记放大。在此，我们提出了一种量子算法（quantum algorithm），用于罕见事件发现与采样，而无需事先了解哪些事件是罕见的。该算法实现了与稀有度阈值（rarity threshold）最优的量子标度（quantum scaling）。我们进一步证明，对于尾部具有非零总质量的重尾系统（heavy-tailed systems），这可以实现二次加速（quadratic speedup），并转化为对平稳随机过程（stationary stochastic processes）的稳健多项式加速，其指数由其熵率结构（entropy-rate structure）决定。

Abstract

Financial crashes, cascading failures in infrastructure, and critical errors in AI systems are frequently triggered by events that occur with extremely small probability. Efficiently discovering and sampling events with probability below a threshold is therefore of critical interest. Yet this task is highly non-trivial using existing classical or quantum methods. Being rare, such events require an immense sampling overhead to collect sufficient data samples. Moreover, because the rare events are not known in advance, they cannot be flagged for amplification using standard techniques. Here, we introduce a quantum algorithm for rare-event discovery and sampling without first learning which events are rare. The algorithm achieves the optimal quantum scaling with the rarity threshold. We further demonstrate that this can achieve a quadratic speedup for heavy-tailed systems whose tail has nonvanishing total mass, and translates into a robust polynomial speedup for stationary stochastic processes, with the exponent determined by its entropy-rate structure.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on quantum algorithms for rare event sampling within probability theory and quantum computing. The provided keywords pertain to multimodal large language models (MLLM), unified architectures, and model-based reinforcement learning. There is no conceptual or technical overlap between the quantum sampling method and the specified deep learning architectures, resulting in zero relevance for all keywords.

关键词

Quantum algorithm, Rare event discovery, Sampling, Heavy-tailed systems, Probability threshold, Stochastic processes, Entropy-rate structure

223. AIS-Based Vessel Trajectory Prediction Using Memory-Augmented Neural NetworksFAIL

Score: 0.0 / 27.8

Authors: Wonmo Koo, Sanha Chang, Heeyoung Kim

Published: 2026-06-04

TL;DR: This paper proposes a memory-augmented neural network approach for vessel trajectory prediction using AIS data, achieving superior performance compared to deep learning baselines without external memory.

摘要翻译

准确的船舶轨迹预测对于安全高效的海上作业至关重要，能够实现避碰并支持航线优化。尽管记忆增强神经网络（Memory-Augmented Neural Networks）最近在行人和道路车辆轨迹预测中表现出强劲的性能，通过从外部记忆（External Memory）中有选择性地检索相关信息，但它们在船舶轨迹预测方面的潜力仍未被充分探索。本文利用自动识别系统（AIS）数据，对基于记忆的轨迹预测进行了实证研究。在墨西哥湾和纽约湾的数据上进行的实验表明，与一系列不包含外部记忆的深度学习基线模型相比，该方法实现了持续且显著的性能提升。

Abstract

Accurate vessel trajectory prediction is essential for safe and efficient maritime operations, enabling collision avoidance and supporting route optimization. Although memory-augmented neural networks have recently shown strong performance in pedestrian and road-vehicle trajectory prediction by selectively retrieving relevant information from an external memory, their potential for vessel trajectory prediction remains underexplored. This paper presents an empirical investigation of memory-based trajectory prediction using Automatic Identification System (AIS) data. Experiments on data from the Gulf of Mexico and the New York Bight demonstrate consistent and substantial performance gains over a range of deep learning baselines that do not incorporate an external memory.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on vessel trajectory prediction using AIS data and memory-augmented neural networks. It does not involve multimodal large models, tokenizers, visual encoders, world models, or reinforcement learning. Thus, it is completely unrelated to the provided keyword set which centers on multimodal foundation models and model-based RL. No expert authors from the specified list are present.

关键词

AIS-Based Vessel Trajectory Prediction, Memory-Augmented Neural Networks, External Memory, Maritime Operations, Deep Learning Baselines, Sequential Prediction, Collision Avoidance

224. ToolChoiceConfusion: Causal Minimal Tool Filtering for Reliable LLM AgentsFAIL

Score: 0.0 / 27.8

Authors: Rahul Suresh Babu, Laxmipriya Ganesh Iyer

Published: 2026-06-04

TL;DR: 该论文提出了一种因果最小化工具过滤方法，通过限制每步暴露的工具数量来提高 LLM 代理的可靠性并显著降低 Token 成本。

摘要翻译

大型语言模型（LLM）智能体日益依赖外部工具，但更大的工具菜单可能因增加错误工具调用、过早动作及令牌成本而降低可靠性和效率。现有的工具选择方法通常优化语义相关性，暴露出名称或描述与用户请求相匹配的工具。我们认为相关性不足：一个工具虽可能与任务相关，但在当前步骤中仍可能是多余或过早的。我们提出因果最小工具过滤（CMTF），这是一种无需训练、基于因果充分性来选择工具的方法。CMTF 利用轻量级的前置条件 - 效果契约，仅暴露从当前状态向用户目标推进所需的最小下一步工具前沿。在多步工具使用任务中，我们将 CMTF 与全工具暴露、关键词检索、状态感知过滤及因果路径消融进行对比，评估指标包括任务成功率、错误工具调用、过早动作、工具暴露量及令牌成本。在包含 102 个任务、100 个工具、四个 LLM 后端及 2448 次任务 - 方法 - 模型运行的主要基准测试中，CMTF 在总体成功率上与最强的因果基线相当，同时将每步可见工具从 100 个减少至一个，并将令牌使用量相对于全工具暴露减少了约 90%。

Abstract

Large language model agents increasingly rely on external tools, but larger tool menus can reduce reliability and efficiency by increasing wrong-tool calls, premature actions, and token cost. Existing tool-selection methods often optimize semantic relevance, exposing tools whose names or descriptions match the user request. We argue that relevance is insufficient: a tool may be related to the task while still being unnecessary or premature at the current step. We propose Causal Minimal Tool Filtering (CMTF), a training-free method that selects tools by causal sufficiency. CMTF uses lightweight precondition-effect contracts to expose only the minimal next-step tool frontier needed to advance from the current state toward the user goal. Across multi-step tool-use tasks, we compare CMTF with all-tools exposure, keyword retrieval, state-aware filtering, and causal-path ablations, measuring task success, wrong-tool calls, premature actions, tool exposure, and token cost. In the main benchmark with 102 tasks, 100 tools, four LLM backends, and 2448 task-method-model runs, CMTF matches the strongest causal baseline in aggregate success while reducing visible tools from 100 to one per step and reducing token usage by about 90% relative to all-tools exposure.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 该论文专注于 LLM 代理的工具选择与因果过滤，旨在提升可靠性和降低 Token 成本。提供的关键词侧重于多模态架构（Visual Encoder, MultiModal, MLLM）、模型统一（Unify Models）、分词器机制（Tokenizer）、世界模型（World Models）及模型强化学习（model-based RL）。论文内容未涉及视觉编码、多模态融合、分词器设计、世界模型构建或模型强化学习算法，因此与所有给定关键词的相关性极低。

关键词

Tool Choice, Causal Minimal Tool Filtering, LLM Agents, Tool Selection, Reliability, Token Cost, Precondition-Effect Contracts

225. Your GFlowNet Secretly Learns an Optimal Transport PlanFAIL

Score: 0.0 / 27.8

Authors: Ian Maksimov, Nikita Morozov, Denis Belomestny, Sergey Samsonov

Published: 2026-06-04

TL;DR: 本文建立了非循环生成流网络与最优传输的理论联系，证明学习到的策略编码了图上的最优传输计划。

摘要翻译

生成流网络（GFlowNets）是一种通过有向图中的随机轨迹采样结构化对象的框架。在本文中，我们建立了非无环 GFlowNets 与最优传输（OT）之间的理论联系。我们表明，在最小流 GFlowNet 中固定初始流分布可将其目标函数简化为具有图诱导最短路径成本的 Kantorovich OT 问题。在最优解处，学习到的 GFlowNet 策略因此编码了从源分布到目标分布的最优传输计划：我们表明，从最小流 GFlowNet 采样轨迹可恢复相应的最优耦合。该公式化使得能够通过边流和神经参数化将 GFlowNet 学习框架应用于大图上的 OT 问题。实验证实了与精确 OT 求解器的一致性，并表明 GFlowNets 能够学习高质量的传输计划。

Abstract

Generative Flow Networks (GFlowNets) are a framework for sampling structured objects via stochastic trajectories in a directed graph. In this work, we establish a theoretical connection between non-acyclic GFlowNets and optimal transport (OT). We show that fixing the initial flow distribution in a minimum-flow GFlowNet reduces its objective to a Kantorovich OT problem with graph-induced shortest path costs. At the optimum, the learned GFlowNet policy therefore encodes an optimal transport plan from the source distribution to the target distribution: we show that sampling trajectories from the minimum-flow GFlowNet recovers the corresponding optimal coupling. Our formulation enables applying the GFlowNet learning framework to OT problems on large graphs via edge flows and neural parameterization. Experiments confirm agreement with exact OT solvers and demonstrate that GFlowNets can learn high-quality transport plans.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 论文核心为生成流网络（GFlowNets）与最优传输（OT）的理论联系，未涉及多模态大模型（MLLM）、视觉编码器、Tokenizer 或世界模型等关键词主题。虽然 GFlowNets 与强化学习有关，但本文专注于图上的最优传输计划，而非基于模型的强化学习或统一多模态模型，因此与给定关键词的相关性极低。

关键词

Generative Flow Networks, Optimal Transport, Graph-induced Shortest Path, Kantorovich Problem, Neural Parameterization, Edge Flows, Stochastic Trajectories

226. Bridging the Semantic-Collaborative Gap: An Asymmetric Graph Architecture for Cold-Start Item RecommendationFAIL

Score: 0.0 / 27.8

Authors: Anh Truong, John Trenkle, Yuanbo Chen, Honghong Zhao, Abdullah Alchihabi, Effy Fang, Michael Tamir

Published: 2026-06-04

TL;DR: This paper proposes an asymmetric graph architecture (Shallow-RHS) to address cold-start recommendation by mapping intrinsic features into collaborative-filtering-aware embeddings, achieving improved engagement in online experiments.

摘要翻译

协同过滤（Collaborative Filtering）和基于图的推荐模型之所以非常有效，是因为它们利用了观察到的用户交互，但这种依赖性在新增内容缺乏交互历史时会产生根本性的冷启动挑战。在 Tubi 的生产检索系统中，这一挑战进一步受到服务接口的限制：新内容必须立即被赋予一个独立嵌入（standalone embedding），且模型还必须生成适合近似最近邻检索（approximate nearest-neighbor retrieval）的设备嵌入。我们通过将冷启动推荐问题表述为在时序二分设备 - 内容图（temporal bipartite device-content graph）上的归纳式图补全（inductive graph-completion）问题来解决这一场景。我们提出 Shallow-RHS，这是一种不对称链接预测架构。其中，左侧（LHS）设备塔利用时序有效的观看历史消息传递（watch-history message passing）来捕捉协同信号，而右侧（RHS）内容塔相对于图结构故意设计得较浅，仅通过内在特征（intrinsic features）对内容进行编码。该塔（RHS）不使用基于 ID 的嵌入（ID-based embeddings）、内容侧子图、邻居聚合（neighbor aggregation）或基于交互的表示（interaction-derived representations），从而迫使内容编码器将内在特征映射到感知协同过滤的嵌入空间（collaborative-filtering-aware embedding space）。训练完成后，学习到的内容编码器可为热内容（warm content）和新摄入内容（newly ingested content）生成嵌入，通过检索热代理邻居（warm surrogate neighbors）实现隐式图补全。我们进一步将相同的表示补全原则扩展到设备冷启动（device cold-start），通过基于人口统计学特征（demographic features）构建基于群体的嵌入（cohort-based embeddings）。大规模在线实验表明，在内容冷启动参与度、推广速度、曝光获取以及设备冷启动参与度方面均实现了持续的相对提升。

Abstract

Collaborative filtering and graph-based recommendation models are highly effective because they leverage observed user interactions, but this dependence creates a fundamental cold-start challenge when newly added content has no interaction history. In Tubi's production retrieval system, this challenge is further constrained by the serving interface: new content must be assigned a standalone embedding immediately, and the model must also produce device embeddings suitable for approximate nearest-neighbor retrieval. We address this setting by formulating cold-start recommendation as an inductive graph-completion problem on a temporal bipartite device-content graph. We propose Shallow-RHS, an asymmetric link-prediction architecture in which the left-hand side (LHS) device tower leverages temporally valid watch-history message passing to capture collaborative signals, while the right-hand side (RHS) content tower is intentionally shallow with respect to the graph and encodes content solely from intrinsic features. The RHS tower does not use ID-based embeddings, content-side subgraphs, neighbor aggregation, or interaction-derived representations, forcing the content encoder to map intrinsic features into a collaborative-filtering-aware embedding space. After training, the learned content encoder generates embeddings for both warm and newly ingested content, enabling implicit graph completion through retrieval of warm surrogate neighbors. We further extend the same representation-completion principle to device cold-start by constructing cohort-based embeddings from demographic features. Large-scale online experiments demonstrate consistent relative improvements in content cold-start engagement, promotion speed, impression acquisition, and device cold-start engagement.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on cold-start recommendation using graph neural networks and asymmetric architectures, which is unrelated to multimodal large models, tokenizers, visual encoders, world models, or model-based reinforcement learning. None of the specified expert authors are listed in the authorship.

关键词

Cold-Start Recommendation, Asymmetric Graph Architecture, Collaborative Filtering, Graph Completion, Intrinsic Features, Bipartite Graph, Device Embedding

227. Unsupervised Pattern Analysis in Japanese Veterinary Toxicology: A Regulatory-Compliant Framework for Cross-Species Risk AssessmentFAIL

Score: 0.0 / 27.8

Authors: Yukiko Kawakami, Mohammad Shirazi, Ryo Shimizuwa, Saito Shinoda, Alireza Mortazavi, Matsumoto Kawahara

Published: 2026-06-04

TL;DR: 本研究提出了一种监管合规的无监督框架，通过跨物种毒性模式识别实现了兽药物不良反应的高对齐风险评估。

摘要翻译

兽医药物警戒系统对于监测不良药物事件（ADEs）至关重要，但现有方法往往无法捕捉由当地生物和监管背景塑造的区域特异性毒性模式。在日本，这些挑战因物种特异性代谢差异以及农林水产省（MAFF）规定的报告实践而被放大。大多数先前工作依赖于预测导向模型，限制了机制可解释性。本研究提出了一种监管整合的无监督框架用于模式发现，基于国家兽医分析实验室（NVAL）数据库。ADEs 被编码为器官系统对齐表示，并针对物种特异性报告偏差进行调整，从而实现跨物种比较。应用基于相似性的聚类和降维方法以识别潜在毒性结构。对 4,120 份高置信度 ADE 报告（9,080 种药物 -ADE 组合）的分析识别出三个显著的物种聚类（p < 0.01），包括伴侣动物中的肝脏主导模式（0.42 ± 0.06）、反刍动物中的肾脏毒性（0.39 ± 0.07）以及绵羊中的皮肤敏感性（0.35 ± 0.07）。药物级别聚类与药理分类实现了 83% 的一致性，而余弦相似度优于替代指标（轮廓系数：0.48；聚类精度：87%）。监管验证与现有分类显示出强一致性。这些发现表明，监管对齐的无监督分析可以揭示具有生物学意义的区域特异性毒性模式，为兽医药物安全评估提供了一个可解释且可扩展的框架。

Abstract

Veterinary pharmacovigilance systems are essential for monitoring adverse drug events (ADEs), yet existing approaches often fail to capture region-specific toxicity patterns shaped by local biological and regulatory contexts. In Japan, these challenges are amplified by species-specific metabolic differences and reporting practices defined by the Ministry of Agriculture, Forestry, and Fisheries (MAFF). Most prior work relies on prediction-oriented models, limiting mechanistic interpretability. This study proposes a regulatory-integrated unsupervised framework for pattern discovery using the National Veterinary Assay Laboratory (NVAL) database. ADEs are encoded into organ system-aligned representations and adjusted for species-specific reporting biases, enabling cross-species comparison. Similarity-based clustering and dimensionality reduction are applied to identify latent toxicity structures. Analysis of 4,120 high-confidence ADE reports (9,080 drug-ADE combinations) identified three significant species clusters (p < 0.01), including hepatic-dominant patterns in companion animals (0.42 $\pm$ 0.06), renal toxicity in ruminants (0.39 $\pm$ 0.07), and dermatological sensitivity in sheep (0.35 $\pm$ 0.07). Drug-level clustering achieved 83% alignment with pharmacological classes, while cosine similarity outperformed alternative metrics (silhouette score: 0.48; cluster precision: 87%). Regulatory validation showed strong agreement with established classifications. These findings demonstrate that regulation-aligned unsupervised analysis can uncover biologically meaningful, region-specific toxicity patterns, providing an interpretable and scalable framework for veterinary drug safety assessment.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 论文主题聚焦于兽医学毒理学中的无监督模式分析与跨物种风险评估，采用聚类与降维等经典统计学习方法。所提供的关键词（Unify Models, Tokenizer, Visual Encoder, World Models, MLLM, MultiModal, model-based RL）均指向现代大语言模型、多模态架构及强化学习领域。论文内容未涉及模型统一架构、分词器、视觉编码器、世界模型、多模态大模型或强化学习算法，与指定关键词无实质关联，故所有相关度评分为 0。作者列表中未包含 Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang 等专家，因此不触发专家加分项。

关键词

Unsupervised Pattern Analysis, Veterinary Toxicology, Regulatory-Compliant Framework, Cross-Species Risk Assessment, Adverse Drug Events, Species Clustering, Dimensionality Reduction

228. ProSarc: Prosody-Aware Sarcasm Recognition Framework via Temporal Prosodic IncongruityFAIL

Score: 0.0 / 27.8

Authors: Prathamjyot Singh, Ashima Sood, Sahil Sharma, Jasmeet Singh

Published: 2026-06-04

TL;DR: ProSarc is an audio-only sarcasm detection framework that achieves high performance by modeling temporal prosodic incongruity between local dynamics and emotional baselines using dual encoding paths.

摘要翻译

我们提出了 ProSarc，一种仅基于音频的框架，通过建模时间韵律不一致性来检测讽刺，即局部韵律动态与语句级情感基线之间的不匹配。双编码路径，即全局情感编码器和时间韵律编码器（BiLSTM + 多头注意力机制），共同输入至韵律不一致分析器，该分析器生成一个标量不一致分数用于分类。蒙特卡洛 Dropout 提供不确定性估计，而基于注意力的机制可在无需帧级标签的情况下定位讽刺起始点。ProSarc 在 MUStARD++ 数据集上优于先前的仅音频方法（F1=75.3），并能泛化至自发式语音（PodSarc, F1=62.9）和跨语言语音（MuSaG, F1=65.6）。十次运行验证确认了不一致建模的贡献（Wilcoxon 检验 p=0.002，Cohen's d=1.51）。人类评估表明，模型不确定性对应于感知模糊性，且预测的起始点与人工标注的时间窗口一致。

Abstract

We present ProSarc, an audio-only framework that detects sarcasm by modelling temporal prosodic incongruity, that is, the mismatch between local prosodic dynamics and the utterance-level emotional baseline. Dual encoding paths, a Global Emotion Encoder and a Temporal Prosody Encoder (BiLSTM + multi-head attention), feed a Prosodic Incongruity Analyzer that produces a scalar incongruity score for classification. Monte Carlo dropout provides uncertainty estimates, and an attention-based mechanism localises sarcastic onset without frame-level labels. ProSarc outperforms prior audio-only methods on MUStARD++ (F1=75.3) and generalises to spontaneous (PodSarc, F1=62.9) and cross-lingual speech (MuSaG, F1=65.6). Ten-run validation confirms the contribution of incongruity modelling (Wilcoxon p=0.002, Cohen's d=1.51). Human evaluation shows that model uncertainty tracks perceptual ambiguity and predicted onsets align with human-annotated temporal windows.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on an audio-only sarcasm detection framework using prosodic incongruity and encoders (BiLSTM, Attention). It does not involve unifying large models, tokenization, visual encoders (audio-only), world models, multimodal large language models, multimodal fusion, or reinforcement learning. Thus, there is no relevance to the provided keywords.

关键词

Prosody-Aware, Sarcasm Recognition, Temporal Prosodic Incongruity, Audio-only Framework, Global Emotion Encoder, BiLSTM, Multi-head Attention, Uncertainty Estimates

229. ITP-STDP: An Intrinsic-Timing Power-of-Two Learning Engine for On-Chip SNN TrainingFAIL

Score: 0.0 / 27.8

Authors: Haihang Xia, Xinyu Zhao, Xuecheng Wang, John Goodenough, Charith Abhayaratne, Panagiotis A. Panagiotou, Chunyi Song, Tiantai Deng

Published: 2026-06-04

TL;DR: This paper proposes an energy-efficient hardware architecture for on-chip SNN training using intrinsic-timing power-of-two STDP, achieving significant speedup and area reduction compared to prior works.

摘要翻译

脉冲神经网络（SNN）有望成为第三代神经网络，并在广泛的应用领域引起了日益增多的关注。然而，SNN 中大量的突触连接会导致训练期间片上学习算法进行密集的权重更新计算，从而造成显著的硬件资源占用和能耗。在现有的 SNN 学习算法中，脉冲时序依赖可塑性（STDP）是最受广泛研究且应用最为广泛的算法之一，作为 SNN 中的基本学习组件。为了解决与 SNN 训练相关的硬件和能量开销，本文提出了内在时序基于 2 的幂次 STDP（ITP-STDP）及其对应的原型学习引擎硬件架构。所提出的设计通过专用的平均场突触漂移模型进行评估，以进行动力学分析，并在不同规模及数据集的 SNN 网络中进一步验证。该设计进一步在 ASIC（专用集成电路）和 FPGA（现场可编程门阵列）平台上实现，并与最先进的方法进行比较，包括原始 STDP 及更复杂的 STDP 变体。结果表明，该设计具有更优越的能效、更高的运行速度以及显著降低的硬件资源利用率，这是因为所提出的设计通过算法级和硬件级优化消除了 STDP 的大部分计算开销。在 FPGA 平台上，所提出的设计相比对比设计，能效提高了 4.5 倍至 219.8 倍。在 ASIC 平台上，所提出的设计实现了 4.8 倍至 22.01 倍的加速比，同时仅消耗先前工作所需面积的 1.2% 至 3.3%。

Abstract

Spiking neural networks (SNNs) have the potential to emerge as the third generation of neural networks and have attracted increasing attention across a wide range of applications. However, the large number of synaptic connections in SNNs leads to intensive weight-update computation by on-chip learning algorithms during training, resulting in substantial hardware resource utilization and energy consumption. Among existing SNN learning algorithms, spike-timing-dependent plasticity (STDP) is one of the most extensively studied and widely adopted, serving as a fundamental learning component in SNNs. To address the hardware and energy overheads associated with SNN training, this paper presents intrinsic-timing power-of-two STDP (ITP-STDP) and its corresponding prototype learning engine hardware architecture. The proposed design is evaluated through a dedicated mean-field synaptic drift model for dynamical analysis and further validated across SNN networks of different scales and datasets. It is further implemented on both ASIC and FPGA platforms and compared with state-of-the-art approaches, including the original STDP and more complex STDP variants. The results demonstrate superior energy efficiency, higher operating speed, and substantially lower hardware resource utilization, as the proposed design eliminates most of the computational overhead of STDP through both algorithmic and hardware-level optimizations. On the FPGA platform, the proposed design improves energy efficiency by 4.5$\times$ to 219.8$\times$ over the compared designs. On the ASIC platform, the proposed design achieves a 4.8$\times$ to 22.01$\times$ speedup while consuming only 1.2% to 3.3% of the area required by prior works.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on Spiking Neural Networks (SNN) hardware optimization for STDP training, whereas the provided keywords relate to Multimodal Large Language Models (MLLM), World Models, and Unified Architectures. There is no conceptual overlap regarding tokenizers, visual encoders, or model-based RL. None of the specified expert authors are listed.

关键词

Spiking neural networks, STDP, On-chip learning, Hardware architecture, Energy efficiency, ASIC, FPGA, Intrinsic-timing

230. WorldFly: A World-Model-Based Vision-Language-Action Model for UAV NavigationFAIL

Score: 0.0 / 27.8

Authors: Shengtao Zheng, Kai Li, Weichen Zhang, Yu Meng, Chen Gao, Xinlei Chen, Yong Li, Xiao-Ping Zhang

Published: 2026-06-04

摘要翻译

端到端视觉 - 语言 - 动作（VLA）模型在无人机（UAV）导航中展现出潜力。然而，现有方法通常依赖历史观测直接预测动作，往往在密集城市环境中表现不佳，其中严重的遮挡和急转弯会导致剧烈的视角转换。我们认为，具备“想象”未来状态的能力——这是世界模型（World Models）所固有的——在这种部分可观测性下对于稳健决策至关重要。为此，我们构建了一个具有挑战性的城市峡谷穿越基准（Urban Canyon Traversal Benchmark），专门用于评估在具有严重遮挡和剧烈视角转换特征的场景中的空间理解能力。为此，我们提出了 WorldFly，一种基于世界模型的 VLA 框架，该框架采用双分支耦合流匹配机制，联合生成未来视频预测和导航动作，从而通过空间想象显式地引导智能体的策略。在我们基准上的广泛评估表明，WorldFly 优于其他基线方法，特别是在未见环境中，验证了将世界模型整合到具身空中代理中的有效性。

Abstract

End-to-end Vision-Language-Action (VLA) models have shown promise in UAV navigation. However, existing approaches typically rely on historical observations to directly predict actions, often struggling in dense urban environments where severe occlusions and sharp turns result in drastic viewpoint transitions. We argue that the ability to "imagine" future states -- inherent in World Models -- is critical for robust decision-making under such partial observability. To address this, we construct a challenging Urban Canyon Traversal Benchmark, specifically designed to evaluate spatial understanding in scenarios characterized by severe occlusions and drastic viewpoint transitions. To this end, we propose WorldFly, a novel world-model-based VLA framework that employs a dual-branch coupled flow matching mechanism to jointly generate future video predictions and navigation actions, thereby explicitly guiding the agent's policy via spatial imagination. Extensive evaluations on our benchmark demonstrate that WorldFly outperforms other baselines, particularly in unseen environments, validating the effectiveness of integrating world models into embodied aerial agents.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 评分失败: Expecting ',' delimiter: line 12 column 69 (char 293)

231. A Finite Certificate for the Positive $n=9$ Vasc InequalityFAIL

Score: 0.0 / 27.8

Authors: Dakai Guo, Ruichen Qiu, Yichuan Cao, Ruyong Feng

Published: 2026-06-04

TL;DR: This paper proves the positive-real n=9 case of the Vasc cyclic inequality by constructing a finite certificate using human-guided AI agent assistance for polynomial inequality verification.

摘要翻译

我们证明了 Vasc 循环不等式在正实数域上 $n=9$ 的情形。该证明是在人类指导下，由 AI 代理 MechMath Agent Team 协助获得的：人类可读部分将有理不等式归约为齐次多项式不等式，固定循环最大值，并通过累积间隙参数化每个排序后的固定最大值锥体；有限部分是一个覆盖所有 $8!=40320$ 个排序锥体的证明证书。MechMath Agent Team 通过 Python 工具调用生成了证书验证工作流，包括案例划分、验证程序和终端分类。发布的证书包含 $36815$ 个系数叶、$2236$ 个普通 Polya 乘子叶以及 $1269$ 个 AM-GM 中点叠加叶。人类作者审查了数学归约与验证逻辑，一个独立工件包含该证书、独立验证器以及从源码重建的路径。

Abstract

We prove the positive-real $n=9$ case of the Vasc cyclic inequality. The proof was obtained with human-guided assistance from the AI agent MechMath Agent Team: the human-readable part reduces the rational inequality to a homogeneous polynomial inequality, fixes a cyclic maximum, and parametrizes each sorted fixed-maximum cone by cumulative gaps; the finite part is a certificate covering all $8!=40320$ sorted cones. MechMath Agent Team generated the certificate verification workflow through Python tool calls, including the case split, verification programs, and terminal classifications. The published certificate has $36815$ coefficient leaves, $2236$ ordinary Polya multiplier leaves, and $1269$ AM-GM midpoint overlay leaves. Human authors audited the mathematical reductions and verification logic, and a separate artifact contains the certificate, an independent verifier, and a from-source rebuild route.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on proving the n=9 Vasc cyclic inequality using algebraic methods and AI-assisted certificate generation, belonging to pure mathematics and computer algebra. The provided keywords relate to Multimodal Large Language Models, World Models, and Reinforcement Learning (e.g., Tokenizer, Visual Encoder, Model-Based RL). There is no domain overlap or methodological connection between algebraic inequality proofs and multimodal representation learning or RL, hence all relevance scores are 0.

关键词

Vasc Inequality, Polynomial Inequality, Finite Certificate, AI Agent Assistance, Cyclic Inequality, MechMath Agent, Algebraic Verification

232. A Framework for Measuring Appropriate Reliance on Set-Valued AI AdviceFAIL

Score: 0.0 / 27.8

Authors: Ranjan Mishra, Jakob Schoeffer

Published: 2026-06-04

TL;DR: 本文提出了一种衡量人机协作中集合值 AI 建议适当依赖的正式框架，针对分类和回归任务引入了特定指标。

摘要翻译

对 AI 建议的适当依赖已成为人机协作中的核心研究主题。现有框架 exclusively 仅专注于点预测作为 AI 建议。然而，集合值 AI 建议（例如离散集合或连续区间）正被越来越多地用于传达不确定性并改善人类决策。本文在顺序判断者 - 建议者范式（sequential judge-advisor paradigm）内，开发了首个用于衡量对集合值 AI 建议适当依赖的正式框架，涵盖分类和回归任务。对于分类任务，我们首先介绍了评估集合值 AI 建议所需的维度。随后，我们定义了两个指标：对 AI 的正确依赖率（correct reliance rate on AI）和对自身的正确依赖率（correct reliance rate on self），这两个指标共同刻画了该情境下的适当依赖。对于回归任务，我们引入了 AI 依赖量（quantity of AI reliance）和 AI 依赖质量（quality of AI reliance），分别衡量决策者是否利用了 AI 建议以及他们的依赖是否帮助他们相对于初始估计更接近真实值（ground truth）。通过应用我们的框架，我们展示了这些指标如何捕捉现有度量所忽略的人机协作中的重要细微差别。

Abstract

Appropriate reliance on AI advice has become a central research theme in human-AI collaboration. Existing frameworks have focused exclusively on point predictions as AI advice. However, set-valued AI advice (e.g., discrete sets or continuous intervals) is increasingly being used to communicate uncertainty and improve human decision making. In this paper, we develop the first formal framework for measuring appropriate reliance on set-valued AI advice within the sequential judge-advisor paradigm, spanning both classification and regression tasks. For classification, we first introduce the dimensions that are necessary for evaluating set-valued AI advice. We then define two metrics: correct reliance rate on AI and correct reliance rate on self, which jointly characterize appropriate reliance in this setting. For regression, we introduce quantity of AI reliance and quality of AI reliance, which respectively measure whether a decision maker utilized the AI advice and whether their reliance helped them get closer to the ground truth relative to their initial estimate. Through the application of our framework, we demonstrate how these metrics capture important nuances in human-AI collaboration that existing measures overlook.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 论文核心在于人机协作中集合值建议的依赖度量（Human-AI Collaboration & Set-valued Advice Evaluation），而关键词集（Unify Models, Tokenizer, Visual Encoder, World Models, MLLM, MultiModal, model-based RL）均聚焦于多模态大模型架构、表征学习及强化学习模型。两者在研究目标与技术路径上无交集，因此所有关键词相关性均为 0。此外，作者列表 Ranjan Mishra, Jakob Schoeffer 不包含指定的 Yang Shi 等专家，故无额外加分。

关键词

Set-valued AI advice, Human-AI collaboration, Judge-advisor paradigm, Appropriate reliance, Classification and regression, Uncertainty communication, Sequential decision making

233. ATT-CR: Adaptive Triangular Transformer for Cloud RemovalFAIL

Score: 0.0 / 27.8

Authors: Yang Wu, Ye Deng, Pengna Li, Wenli Huang, Kangyi Wu, Xiaomeng Xin, Jinjun Wang

Published: 2026-06-04

TL;DR: 该论文提出了一种自适应三角变换器（ATT-CR）用于遥感图像云去除，通过降低计算复杂度和过滤云像素干扰实现了高效且准确的图像重建。

摘要翻译

云去除（Cloud Removal）旨在准确重建遥感图像中被云层遮挡的地物。现有的基于 Transformer 的方法利用自注意力（self-attention）机制，通过在云图像中有效建模长程依赖（long-range dependencies），取得了令人印象深刻的结果。然而，它们面临以下问题：1) 自注意力的高计算复杂度限制了模型的扩展性；2) 在注意力计算中将云像素和无云像素均视为有效特征，会在后续层引入干扰，导致性能次优。为应对这些挑战，本文提出自适应三角云去除 Transformer（Adaptive Triangular Transformer for Cloud Removal，简称 ATT-CR），该模型能有效降低计算成本并减轻云像素的干扰。具体而言，该模型包含两个核心组件：三角注意力（Triangular Attention，TAN）和特征选择门控模块（Feature Selected Gating Module，FSGM）。TAN 利用下三角矩阵和上三角矩阵来近似 Softmax 注意力，其计算复杂度为 O(N)，显著降低了计算成本。另一方面，FSGM 与 TAN 相结合，自适应区分云特征和无云特征，从而最大限度地减少无效信息向后续层的引入。在云去除基准数据集上的大量实验表明，ATT-CR 相较于现有方法展现出优越的性能。

Abstract

Cloud removal aims to accurately reconstruct the ground objects obscured by clouds in remote sensing images. Existing Transformer-based methods utilizing self-attention have shown impressive results by effectively modeling long-range dependencies in cloudy images. However, they suffer from the following issues: 1) the high computational complexity of self-attention limits scalability; 2) treating both cloudy and clean pixels as valid within the attention computation brings disturbances in subsequent layers, leading to suboptimal performance. To address these challenges, we propose the Adaptive Triangular Transformer for Cloud Removal (ATT-CR), a model that effectively reduces computational costs and mitigates interference from cloudy pixels. Specifically, it consists of two core components: Triangular Attention (TAN) and Feature Selected Gating Module (FSGM). TAN employs lower and upper triangular matrices to approximate Softmax attention with O(N) computational complexity, significantly reducing the computational costs. The FSGM, on the other hand, integrates with TAN to adaptively distinguish between cloudy and clean features, which minimizes the introduction of invalid information into subsequent layers. Extensive experiments on cloud removal benchmarks demonstrate that ATT-CR delivers superior performance compared to existing methods.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 该论文专注于遥感图像云去除，属于计算机视觉/遥感图像处理领域。提供的关键词集（MLLM、World Models、Tokenizer、model-based RL 等）聚焦于多模态大模型与强化学习方向。论文未涉及语言模型 tokenizer、多模态对齐、强化学习或世界模型构建，虽然处理视觉数据但不属于 MLLM 架构中的视觉编码器，因此与所有给定关键词无实质关联。

关键词

Cloud Removal, Remote Sensing Images, Triangular Attention, Feature Selected Gating Module, Computational Complexity, Image Restoration, Adaptive Transformer

234. AttackPathGNN: Cross-function vulnerability detection in smart contracts using state interference graphs and conjunction poolingFAIL

Score: 0.0 / 27.8

Authors: Gabriela Dobrita, Simona-Vasilica Oprea, Adela Bara

Published: 2026-06-04

TL;DR: AttackPathGNN 提出了一种基于图神经网络的跨函数漏洞检测方法，通过建模状态干扰和攻击路径，在智能合约安全基准测试中取得了高 F1 分数。

摘要翻译

现有的基于学习的 Solidity 智能合约漏洞检测器将漏洞检测简化为单个函数内的语法模式匹配，然而许多最具严重后果的攻击（The DAO, Cream Finance）并不存在于任何单个函数中，而是存在于函数间关系以及使攻击可行的条件组合之中。因此，我们提出 AttackPathGNN，这是一种图神经网络（GNN），它将检测重新定义为基于显式攻击路径的推理。两个架构选择使其区别于先前的基于 GNN 的检测器：(1) 状态干扰图（State Interference Graph），它通过类型化、加权边连接每对共享可变存储的函数，并通过由显式五条件谓词定义的有向重入路径边连接；(2) 合取池化（Conjunction pooling），这是一种针对八个命名攻击前提条件的可微分 AND 聚合器，其对数 -Sigmoid 形式导致每当存在任何单一防护措施（如重入防护、访问控制修饰符或 SafeMath）时，单函数攻击得分即归零。在五次独立训练运行中，AttackPathGNN 在 SmartBugs Wild 保留测试分区上达到 92.3±0.2% 的 F1 分数（假阴性率为 4.3±0.3%，在独立人工标注的 SmartBugs Curated 基准上检测率为 90.8±2.5%），在每种种子下均覆盖了 10 个 DASP10 类别中的 6 个且达到 100%，重入（Reentrancy）达到 98.7±1.8%。每个预测均附带结构化修复报告，将每个判定转化为可操作的、函数级别的审计发现。

Abstract

Existing learning-based detectors for Solidity smart-contracts reduce vulnerability detection to syntactic pattern matching within single functions, yet many of the most consequential exploits (The DAO, Cream Finance) exist not in any individual function but in the relationship between functions and in the combination of conditions that made the attack feasible. Thus, we propose AttackPathGNN, a graph neural network (GNN) that reframes detection as reasoning over explicit attack paths. Two architectural choices distinguish it from prior GNN-based detectors: (1)a State Interference Graph that links every pair of functions sharing mutable storage through typed, weighted edges and through directed reentrancy-path edges defined by an explicit five-condition predicate; (2)conjunction pooling, a differentiable AND-aggregator over eight named exploit preconditions whose log-sigmoid form causes the per-function exploit score to collapse whenever any single mitigation (a reentrancy guard, an access-control modifier or SafeMath) is in place. Across five independent training runs, AttackPathGNN attains 92.3+/-0.2% F1 on the SmartBugs Wild held-out test partition (4.3+/-0.3% false-negative rate, 90.8+/-2.5% detection rate on the independently human-labelled SmartBugs Curated benchmark), recovering 6/10 DASP10 categories at 100% on every seed and Reentrancy at 98.7+/-1.8%. Each prediction is emitted with a structured remediation report, turning each verdict into an actionable, function-level audit finding.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 该论文专注于智能合约安全漏洞检测，采用图神经网络（GNN）和状态干扰图技术。提供的关键词集（如多模态大模型、视觉编码器、世界模型、强化学习等）均属于生成式 AI 与强化学习领域，与本文的安全工程及图神经网络主题无直接关联，故所有关键词评分为 0。

关键词

AttackPathGNN, Cross-function vulnerability detection, Smart contracts, Graph neural network, State interference graph, Conjunction pooling, Solidity

235. Framing, Judging, Steering: An Assessable Competency Model for Teach-ing Students to Reason With Generative AIFAIL

Score: 0.0 / 27.8

Authors: Alexander Apartsin, Yehudit Aperstein

Published: 2026-06-04

TL;DR: 本文提出了一种名为 CoRe-3 的教育 competency 模型，旨在通过 framing、judging 和 steering 三个步骤评估和提升学生使用生成式 AI 的能力，而非探讨 AI 模型本身的技术架构。

摘要翻译

生成式人工智能使得获取答案变得容易，却使理解变得困难，不加批判的使用会引发认知卸载（cognitive offloading）。学校仍主要测量无辅助表现，但真正的任务是与 AI 协作产出优质成果：框定一个界定不清的任务、评估输出结果，并引导模型朝向更佳结果。这种能力很少被单独评估；即便被测量，也往往简化为一个单一的“提示”分数，无法诊断 AI 使用成功或失败的原因。我们提出 CoRe-3（协同推理，Co-Reasoning），这是一种能力模型，将生产性 AI 使用分解为三个可评估的技能，我们将其缩写为 FJS：框架化（Framing，即在调用 AI 前明确一个界定不清的任务）、判断（Judging，即评估输出中的错误及未明说的假设）、引导（Steering，即迭代地重定向模型）。其独特主张在于将生成前的框架化与生成后的引导分离开来，并将判断作为两者之间的关卡。我们将这些技能建立在理论基础之上，提出五个可检验的命题，并在 CoReasoningLab（协同推理实验室）中予以实现，这是一个开放平台，它呈现有缺陷的 AI 输出并对其进行独立评分。在模拟学习者（由不同模型生成并评分）上，这些技能表现出分离性：每种技能追踪其自身被操纵的能力，而其他技能保持不变；当一种能力在三个技能中共享时，分数表现出相关性（聚合效度和区分效度），且这一结果在两个不同提供者的评分后端上均成立。人类评分者的一致性及结果将在后续研究中呈现；我们发布该工具、数据及协议。

Abstract

Generative AI makes answers easy and understanding hard, and uncritical use invites cognitive offloading. Schools still measure unaided performance, yet the real task is to produce good work with AI: framing an ill-defined task, judging the output, and steering the model toward a better result. This ability is rarely assessed in its own right; where measured, it collapses into one "prompting" score that cannot diagnose why AI use succeeds or fails. We propose CoRe-3 (Co-Reasoning), a competency model factoring productive AI use into three assessable skills we abbreviate FJS: Framing (specifying an ill-defined task before invoking AI), Judging (evaluating output for errors and unstated assumptions), and Steering (iteratively redirecting the model). Its distinguishing claim is the separation of pre-generation Framing from post-generation Steering, with Judging as the gate between. We ground the skills in theory, state five testable propositions, and instantiate them in CoReasoningLab, an open platform that presents flawed AI output and scores them independently. Over simulated learners (generated and graded by different models), the skills dissociate: each tracks its own manipulated competence while staying flat in the others, and grades become correlated when one competence is shared across all three (convergent and discriminant validity), across grader backends from two providers. Human-rater agreement and outcomes are next; we release the instrument, data, and protocol.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 该论文属于教育心理学与人机交互领域，主要提出了一种针对学生使用生成式 AI 的 competency 模型（CoRe-3），包含 Framing、Judging、Steering 三个技能维度。论文内容完全聚焦于教学评估框架，未涉及任何大模型内部架构（如 Tokenizer、Visual Encoder）、世界模型构建、多模态大模型（MLLM）技术或强化学习算法（model-based RL），因此与所有给定技术关键词无相关性。

关键词

Generative AI, Competency Model, Framing Judging Steering, Human-AI Interaction, Education, Assessment, CoRe-3, Prompting

236. Bidirectional Search for Longest Paths: Case for Front-to-Front HeuristicsFAIL

Score: 0.0 / 27.8

Authors: Tzur Shubi, Ariel Felner, Solomon Eyal Shimony, Shahaf S. Shperberg

Published: 2026-06-04

TL;DR: This paper proposes a bidirectional depth-first branch-and-bound algorithm for longest path problems using front-to-front heuristics, which is unrelated to multimodal models or reinforcement learning.

摘要翻译

双向启发式搜索可能减少适合向后搜索问题的搜索代价。其中，众所周知，前到前启发式函数可以减少节点展开次数，但其开销过高，导致总运行时间几乎总是增加。我们提出 BiXDFBnB，一种双向深度优先分支定界算法，该算法将单前沿双向搜索（SFBDS）框架——最初是为最短路径（MIN）问题开发的——应用于广义最长简单路径（GLSP）问题。由于 SFBDS 本质上基于配对状态进行操作，前到前（F2F）启发式评估自然产生，并避免了通常与双向前沿管理相关的开销。我们表明这种适应可以成功应用于最大化（MAX）问题，同时高效处理重叠约束。BiXDFBnB 被应用于多种最长路径问题，包括最长简单路径（LSP）、蛇形（Snakes）以及盒中线圈（CIB）。实验评估显示，新算法经常减少节点展开次数，在某些情况下还能改善总运行时间。

Abstract

Bidirectional heuristic search can potentially reduce search effort for problems amenable to backward search. Therein, it is well-known that front-to-front heuristics can reduce the number of node expansions, but their overhead is so high that overall runtime almost always increases. We propose BiXDFBnB, a bidirectional depth-first branch-and-bound algorithm that adapts the Single-Frontier Bidirectional Search (SFBDS) framework - originally developed for shortest-path (MIN) problems - to the Generalized Longest Simple Path (GLSP) setting. Because SFBDS inherently operates on paired states, front-to-front (F2F) heuristic evaluation arises naturally and avoids the overhead typically associated with bidirectional frontier management. We show that this adaptation can be successfully applied to maximization (MAX) problems while efficiently handling overlapping constraints. BiXDFBnB is applied to several types of longest-path problems: Longest Simple Path (LSP), Snakes, and Coil-in-the-Box (CIB). Empirical evaluation shows that the new algorithm frequently reduces the number of node expansions and, in some cases, also improves overall runtime.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on heuristic search algorithms for combinatorial optimization (longest path problems), while the provided keywords relate to Multimodal Large Language Models and World Models (e.g., Tokenizer, Visual Encoder). There is no conceptual overlap between the paper's content and the specified keyword domains.

关键词

Bidirectional Search, Longest Paths, Front-to-Front Heuristics, Heuristic Search, Branch-and-Bound, Combinatorial Optimization, Pathfinding

237. Staying with the Uncertainty: Uncertainty-Scaffolding Strategies for Artificial Moral Advisors in LLM-to-LLM Simulated ConversationsFAIL

Score: 0.0 / 27.8

Authors: Salvatore Greco, Hainiu Xu, Jacopo Domenicucci, Yulan He, Sylvie Delacroix

Published: 2026-06-04

TL;DR: 本文研究了大语言模型作为道德顾问时，通过不确定性引导策略维持对话质量的方法，发现策略影响互动质量而非立场修正程度。

摘要翻译

大型语言模型 (LLMs) 正日益被部署于多种情境中，充当人工智能道德顾问 (AMA)：它们应当展现出何种对话模式？本文探讨了 AMA 如何帮助其对话者“与不确定性共存”。我们提出了三种不确定性模式 (视角多重化 (Perspective-Multiplying)、张力维持 (Tension-Preserving)、过程反思 (Process-Reflecting))，并将它们与三种控制条件 (基线 (Baseline)、说服性 (Persuasive)、奉承性 (Sycophantic)) 进行对比。一个用户代理 LLM 与遵循特定不确定性策略的 AMA 就一个道德困境展开对话，并完成对话前后的问卷。此外，我们还考察了两种角色提示格式 (陈述式 (Declarative) 和叙事式 (Narrative)) 的影响。研究发现：(1) 没有单一模型在模拟用户代理中占据主导地位，开源模型通过跨角色分歧 (between-persona divergence) 与人类模糊性对齐，而闭源模型则通过角色内留白 (within-persona hedging) 实现对齐；(2) 陈述式 (Declarative) 角色更能捕捉初始立场多样性，而叙事式 (Narrative) 角色则展现出更真实的信念修正；(3) 所有六种 AMA 策略均产生了可区分的对话模式；且 (4) 不确定性策略的差异不在于它们产生的立场修正量，而在于它们维持的互动质量 (quality of engagement)。

Abstract

LLMs are increasingly deployed as Artificial Moral Advisors (AMA) in a variety of contexts: what kind of conversational patterns should they display? In this paper, we study how AMA can help their interlocutors "stay with the uncertainty". We propose three modes of uncertainty (Perspective-Multiplying, Tension-Preserving, Process-Reflecting) and compare them against three control conditions (Baseline, Persuasive, Sycophantic). A user-agent LLM engages in a dialogue on an ethical dilemma with an AMA following a specific uncertainty strategy, and completes pre- and post-conversation questionnaires. We further examine the effect of two persona prompt formats (Declarative and Narrative). We found that (1) no single model dominates as a simulated user agent, with open models aligning with human ambiguity through between-persona divergence and closed models through within-persona hedging; (2) declarative personas better capture initial stance diversity while narrative personas show more realistic belief revision; (3) all six AMA strategies produce distinguishable conversational patterns; and (4) uncertainty strategies differ not in how much stance revision they produce, but in the quality of engagement they sustain.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 论文聚焦于大语言模型在道德顾问中的不确定性引导策略及对话质量，属于伦理对话与提示工程范畴。提供的关键词（Tokenizer, Visual Encoder, World Models, model-based RL 等）均指向多模态架构、表征学习或强化学习的底层技术组件，与本文的研究对象（LLM-to-LLM 对话、道德困境、不确定性 scaffolding）无直接技术关联，故相关性评分均为 0。作者列表中未包含指定的专家。加权总分为 0，低于动态及格分 27.8。

关键词

Artificial Moral Advisors, Uncertainty Scaffolding, LLM-to-LLM Conversation, Ethical Dilemma, Persona Prompt, Uncertainty Strategies, Simulated Conversations, Large Language Models

238. Retry Policy Gradients in Continuous Action SpacesFAIL

Score: 0.0 / 27.8

Authors: Soichiro Nishimori, Paavo Parmas

Published: 2026-06-04

TL;DR: This paper introduces ReMAC, an actor-critic algorithm utilizing retry objectives and pathwise derivatives to promote stochastic exploration in continuous action spaces without explicit entropy regularization.

摘要翻译

基于重试的目标函数（如 pass@K 和 max@K）优化从多条采样轨迹中获得的最佳回报，近期研究表明，它们无需显式探索奖励即可促进探索。在离散动作空间中，ReMax 被证明可通过适应回报不确定性来实现这一目标。在这项工作中，我们引入了针对重试目标函数的路径导数估计器，并利用它们将 ReMax 扩展至连续动作空间。我们研究了由此产生的学习动力学，并表明，即使在使用确定性奖励的情况下，ReMax 也能通过重塑策略梯度景观来鼓励随机探索。具体而言，它在方向和幅度上均改变了梯度：在方向上，使更新偏向更高的策略熵；在幅度上，抑制梯度并减缓收敛。我们进一步表明，Adam 的自适应归一化可根据其数值稳定参数缓解这种抑制效应。经验上，我们将此目标函数实例化为 ReMax Actor-Critic (ReMAC)，这是一种利用路径导数估计器优化 ReMax 目标的离策略演员 - 评论家算法。我们的实验表明，ReMAC 无需熵正则化即可促进更高的策略熵，且性能与 SAC 相当。

Abstract

Retry-based objectives such as pass@K and max@K optimize the best return obtained from multiple sampled trajectories, and recent work has shown that they can promote exploration without explicit exploration bonuses. In discrete action spaces, ReMax was shown to do so by adapting to return uncertainty. In this work, we introduce pathwise derivative estimators for retry objectives and use them to extend ReMax to continuous action spaces. We study the resulting learning dynamics and show that, even with deterministic rewards, ReMax can encourage stochastic exploration by reshaping the policy-gradient landscape. In particular, it alters gradients both in direction, biasing updates toward higher policy entropy, and in magnitude, damping gradients and slowing convergence. We further show that Adam's adaptive normalization can mitigate this damping, depending on its numerical stabilization parameter. Empirically, we instantiate this objective as ReMax Actor-Critic (ReMAC), an off-policy actor--critic algorithm that optimizes the ReMax objective using a pathwise derivative estimator. Our experiments show that ReMAC can promote higher policy entropy without entropy regularization and achieves performance comparable to SAC.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on reinforcement learning policy gradients (ReMAC) in continuous action spaces, involving retry objectives and pathwise derivatives. It does not involve multimodal learning, large language models, tokenization, visual encoders, or unified world models, resulting in zero relevance to the provided keyword set. The total weighted score is 0, below the dynamic passing threshold of 27.8.

关键词

Retry Policy Gradients, Continuous Action Spaces, ReMax, Pathwise Derivative Estimators, Actor-Critic, Stochastic Exploration, Policy Entropy, ReMAC

239. Compositional Boundaries for Density FusionFAIL

Score: 0.0 / 27.8

Authors: Ratan Bahadur Thapa, Ali Darijani, Jürgen Beyerer, Steffen Staab

Published: 2026-06-04

TL;DR: 本文研究了分布式不确定性管理系统中加权概率密度融合的顺序不变性组合边界，确立了归一化加权线性池化等条件以实现层级执行。

摘要翻译

分布式不确定性管理系统（Distributed uncertainty-management systems）通常沿着由通信、隐私或调度约束选择的聚合树（aggregation trees）组合局部概率模型。最终的概率密度应依赖于加权源，而不依赖于中间节点组合它们的特定顺序。我们将这一要求视为加权概率密度二元融合（binary fusion）的代数组合性问题进行研究。核心问题是：局部融合规则（local fusion rule）在保持顺序无关性（order-invariant）的前提下，何时可以层次化执行。我们确立了局部分段值融合规则（local segment-valued fusion rules）的组合性边界。在具有可加输出权重和仅权重系数的连续二元规则类中，顺序无关的层次化执行刻画了归一化加权线性池化（normalized weighted linear pooling）；范数诱导的分段平衡则实现了相应的系数。光滑的端点到候选者 $f$-散度（$f$-divergence）平衡具有不同的局部几何：其二次展开诱导了平方根有效权重，这表明仅凭成对可解性不足以实现与调度无关的融合（schedule-independent fusion）。我们表明，这种障碍局限于端点到候选者的二元平衡，而全局散度重心（global divergence barycenters）则保留了可加权重的局部极限。最后，高斯混合模型（Gaussian mixtures）展示了相同问题如何出现在有限模型类中：精确融合（exact fusion）具有组合性，而逐步压缩（stepwise compression）仅在未归一化分量测度满足同余条件（congruence condition）时才具有组合性。这些结果区分了精确的与调度无关的融合、全局聚合目标以及局部近似启发式。

Abstract

Distributed uncertainty-management systems often combine local probabilistic models along aggregation trees chosen by communication, privacy, or scheduling constraints. The final density should depend on the weighted sources, not on the particular order in which intermediate nodes combine them. We study this requirement as an algebraic compositionality problem for binary fusion of weighted probability densities. The central question is when a local fusion rule can be executed hierarchically while remaining order-invariant. We establish a compositional boundary for local segment-valued fusion rules. Within the class of continuous binary rules with additive output weights and weight-only coefficients, order-invariant hierarchical execution characterizes normalized weighted linear pooling; norm-induced segment balancing realizes the corresponding coefficient. Smooth endpoint-to-candidate $f$-divergence balancing has a different local geometry: its quadratic expansion induces square-root effective weights, showing why pairwise solvability alone is insufficient for schedule-independent fusion. We show that this obstruction is local to endpoint-to-candidate binary balancing, whereas global divergence barycenters retain additive-weight local limits. Finally, Gaussian mixtures show how the same issue appears in finite model classes: exact fusion is compositional, whereas stepwise compression is compositional only under a congruence condition on unnormalized component measures. These results distinguish exact schedule-independent fusion from global aggregation objectives and local approximation heuristics.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 论文聚焦于分布式系统中的概率密度融合与代数组合性，属于统计推断范畴，与多模态大模型（MLLM）、Tokenizer、视觉编码器、世界模型及强化学习等关键词所代表的深度学习领域无直接关联。未包含指定专家作者。

关键词

Density Fusion, Compositional Boundaries, Probability Distributions, Order-Invariant, Hierarchical Execution, Weighted Linear Pooling, Gaussian Mixtures

240. Deciphering Two Training Clocks in Grokking via Deep Linear Network Theory with Conditional ReLU ReductionFAIL

Score: 0.0 / 27.8

Authors: Hu Tan, Kuo Gai, Shihua Zhang

Published: 2026-06-04

TL;DR: 该论文通过深度线性网络理论解释了 Grokking 现象中的“两个训练时钟”机制，揭示了分类损失快速下降与表征简化缓慢分离的现象。

摘要翻译

Grokking 表明，拟合训练数据与学习一个简单的潜在规则可能发生在不同的时间尺度上。我们通过将分类损失的快速衰减与学习到的表示的较慢简化分离开来，形式化这一现象，并将由此产生的停止时间对称为两个训练时钟（training clocks）。对于深度线性网络，我们表明，后边缘间隙增长（post-margin gap-growth）或单步尾部收缩（one-step tail-contraction）条件可在对数时间尺度上将交叉熵损失降低至 ε 水平。相反，当存在逐层权重衰减时，端到端映射上诱导的正则可以表示为 Schatten 型惩罚；在尖锐晚期 Kurdyka-Łojasiewicz 尾部条件下，这种结构能量可在多项式时间尺度上收敛。因此，这两个训练时钟将拟合过程与表示简化过程分离开来。随后，我们解释了相同机制如何出现在 ReLU 多层感知机（MLPs）中。在训练集上激活模式保持固定的区域，网络在活跃坐标上简化为线性模型。在两层 ReLU 嵌入模型中，链式法则估计进一步表明，在受控下游范数下，分类器头可获得比嵌入块更大的有效梯度。这支持了一种两阶段机制：分类器先进行拟合，而表示随后继续简化。我们以模加法作为主要实验设置。深度线性理论提供了该分析的严谨核心。但 ReLU 的结果被表述为条件归约，旨在解释经验行为，而非声称对非线性训练动力学的全局证明。

Abstract

Grokking suggests that fitting the training data and learning a simple underlying rule may occur on different time scales. We formalize this phenomenon by separating the fast decay of the classification loss from the slower simplification of the learned representation, and we call the resulting pair of stopping times two training clocks. For deep linear networks, we show that a post-margin gap-growth or one-step tail-contraction condition reduces the cross-entropy loss to level epsilon on a logarithmic time scale. In contrast, when layerwise weight decay is present, the induced regularization on the end-to-end map can be expressed as a Schatten-type penalty; under a sharp late-time Kurdyka-Lojasiewicz tail, this structural energy closes on a polynomial time scale. The two clocks, therefore, separate fitting from representation simplification. We then explain how the same mechanism can appear in ReLU MLPs. In regions where the activation patterns on the training set remain fixed, the network reduces to a linear model in the active coordinates. In a two-layer ReLU embedding model, chain-rule estimates further show that the classifier head can receive larger effective gradients than the embedding block under controlled downstream norms. This supports a two-stage mechanism in which the classifier fits first, while the representation continues to simplify later. We use modular addition as the main experimental setting. The deep linear theory provides the rigorous core of the analysis. But the ReLU results are formulated as conditional reductions that account for empirical behavior without claiming a global proof for nonlinear training dynamics.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 论文主要研究深度学习中的 Grokking 现象及训练动力学理论（深度线性网络、ReLU MLP），涉及损失衰减与表征简化机制；而关键词涉及多模态大模型、世界模型及强化学习，两者研究领域无交集，故相关性评分均为 0。

关键词

Grokking, Deep Linear Network, Training Clocks, Conditional ReLU Reduction, Representation Simplification, Modular Addition, ReLU MLP

241. LLMCodec: Adapting Video Codecs for Efficient Weight Compression of Large Language ModelsFAIL

Score: 0.0 / 27.8

Authors: Rui Wang, Yan Zhao, Li Song, Zhengxue Cheng

Published: 2026-06-04

TL;DR: LLMCodec utilizes video codec technology to efficiently compress large language model weights, reducing perplexity and improving task accuracy without requiring fine-tuning data.

摘要翻译

大型语言模型（LLMs）的快速发展推动了自然语言处理领域的显著进步。然而，这些模型规模的扩大在存储、传输和部署方面带来了巨大挑战。尽管在模型压缩和量化方面已投入巨大努力，现有方法通常依赖于微调或校准数据，在不同张量类型上的泛化能力有限。本文认为，视频编解码器为 LLM 压缩提供了有前景的解决方案，因为它们与矩阵结构化数据具有内在兼容性，压缩策略可配置，且拥有高度优化的现成实现。因此，我们提出了 LLMCodec，一种基于视频编解码器的 LLM 压缩方法，该方法将仿射量化与最新的 VVC/H.266 视频编解码器相结合。除了 VVC 之外，我们还进一步比较了一系列视频编解码器和编码配置，以评估它们对压缩性能的影响。在不同模型上的实验证明了 LLMCodec 的鲁棒性和通用性。值得注意的是，在 2 比特精度下的 LLaMA-3-8B 模型上，与现有方法相比，LLMCodec 将困惑度降低超过 1.5 倍，并将下游任务准确率提高了 21%。

Abstract

The rapid development of large language models(LLMs) has led to remarkable advances in natural language processing. However, the increasing scale of these models introduces substantial challenges in terms of storage, transmission, and deployment. Though great efforts have been devoted to model compression and quantization, existing methods often rely on fine-tuning or calibration data, which exhibit limited generalization across different tensor types. In this paper, we argue that video codecs offer a promising solution for LLM compression, due to their inherent compatibility with matrix structured data, configurable compression strategies, and the availability of highly optimized, off-the-shelf implementations. Therefore, we present LLMCodec, a video codec-based LLM compression method that integrates affine quantization with the recent VVC/H.266 video codec. Beyond VVC, we further compare a range of video codecs and encoding profiles to evaluate their impact on compression performance. Experiments on different models demonstrate the robustness and generality of LLMCodec. Notably, on LLaMA-3-8B at 2-bit precision, LLMCodec reduces perplexity by over 1.5x and improves downstream task accuracy by 21% compared with the existing method.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on LLM weight compression using video codecs (VVC/H.266), which is fundamentally different from the research directions implied by the keywords (multimodal representation learning, world models, visual encoders, tokenizers, model unification, and reinforcement learning). There is no overlap in methodology or core subject matter with the provided keywords.

关键词

LLM Compression, Video Codecs, Weight Compression, Affine Quantization, VVC/H.266, Model Deployment, Perplexity Reduction

242. TailLoR: Protecting Principal Components in Parameter-Efficient Continual LearningFAIL

Score: 0.0 / 27.8

Authors: Marius Dragoi, Ioana Pintilie, Alexandra Dragomir, Antonio Barbalau, Florin Brad

Published: 2026-06-04

TL;DR: TailLoR 提出了一种基于谱分解的参数高效连续学习方法，通过保护主成分来减少微调过程中的灾难性遗忘。

摘要翻译

基于谱分解 (spectral decomposition) 的参数高效微调 (parameter-efficient finetuning) 方法推动了持续学习 (Continual Learning) 的进展。本文提出了 TailLoR，该方法利用预训练权重的奇异基 U 和 V 作为固定参考框架，学习应用于奇异值矩阵的低秩 (low-rank) 更新。一种软谱惩罚机制抑制与主导奇异方向对齐的更新，从而减少干扰，同时将细粒度适应引导至高度灵活的长尾谱坐标 (long-tail spectral coordinates) 中。

Abstract

Parameter-efficient finetuning methods based on spectral decomposition have enabled progress in Continual Learning. In this paper we introduce TailLoR, which utilizes the singular bases U and V of the pre-trained weights as a fixed reference frame to learn a low-rank update applied to the singular value matrix. A soft spectral penalty discourages updates aligned with dominant singular directions, reducing interference while routing fine-grained adaptation into the highly flexible, long-tail spectral coordinates.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 论文核心内容聚焦于连续学习（Continual Learning）中的参数高效微调（PEFT）及谱分解方法，旨在通过保护主成分减少灾难性遗忘。提供的关键词主要涉及多模态大模型（MLLM, MultiModal）、世界模型（World Models）及强化学习（model-based RL），与论文的实际研究领域（通用深度学习/连续学习）存在显著领域错位。论文未提及分词器、视觉编码器或强化学习相关内容，故相关性评分为 0。作者列表中未包含指定的专家成员。

关键词

Continual Learning, Parameter-efficient, Spectral Decomposition, Low-rank Update, Principal Components, Singular Bases, Fine-tuning

243. How abundant are good interpolators?FAIL

Score: 0.0 / 27.8

Authors: August Y. Chen, Ahmed El Alaoui

Published: 2026-06-04

TL;DR: 本文研究了过参数化线性分类器中插值解的丰度与泛化性能，发现绝大多数插值解具有相似的泛化误差，而优化方法找到的解通常优于随机插值解。

摘要翻译

设 $S$ 为单位范数线性分类器 $\theta \in \mathbb{R}^d$ 的集合，这些分类器能正确分类标注数据集 $(X_i, y_i)_{i=1}^n$ 中的每一个点，其中 $X_i \in \mathbb{R}^d$, $y_i \in \{-1, +1\}$，且预先固定了可能为负的间隔 $\kappa$。在 $(X, y)$ 对的两种自然数据生成分布下——Gaussian Mixture Model (高斯混合模型) 和 Logistic Model with Gaussian Features (具有高斯特征的逻辑模型)——以及在比例 regime ($n/d \to \alpha$) 且 $\alpha$ 足够小的情况下，我们建立了 Large Deviation Principle (大偏差原理)，该原理描述了从 $S$ 中均匀随机选取的点 $\theta$ 达到给定泛化误差的事件，且该结论关于数据的选择以高概率成立。相关的 Large Deviation Rate Function (大偏差率函数) 是确定性的，它描述了在 $d$ 的指数尺度上，Interpolating Classifiers (插值分类器) 具有给定期望性能的比例。作为结果，我们建立了以下 Concentration Phenomenon (集中现象)：除了指数级小比例外，所有 Interpolating Classifiers 都具有大致相同的泛化性能，该性能由该 Rate Function (率函数) 的唯一 Maximiser (最大化器) 给出。我们通过数值比较将此 Maximiser 与 Empirical Risk Minimization by Gradient Descent (基于梯度下降的经验风险最小化) 的性能以及一个 Natural Linear Program (自然线性规划) (两者均在 $S$ 中找到一点) 的性能进行比较，并推断出在 Overparametrized Regime (过参数化 Regime) 中 $\alpha$ 较小的情况下，这些高效程序优于绝大多数 Interpolators，表明在此设置下存在 Nontrivial Benign Overfitting (非平凡的良性过拟合)。

Abstract

Let $S$ be the set of unit norm linear classifiers $θ\in \mathbb{R}^d$ which correctly classify every point of a labeled dataset $(X_i,y_i)_{i=1}^n$, $X_i \in \mathbb{R}^d$, $y_i \in \{-1,+1\}$, with a possibly negative margin $κ$ fixed in advance. Under two natural data-generating distributions of the $(X,y)$ pairs -- a Gaussian mixture model and a logistic model with Gaussian features -- and in the proportional regime $n/d \to α$ with small enough $α$, we establish a large deviation principle on the event that a point $θ$ chosen uniformly at random from $S$ achieves a given generalization error, with high probability over the choice of the data. The associated large deviation rate function is deterministic and describes the proportion, at the exponential scale in $d$, of interpolating classifiers having a given desired performance. As a consequence, we establish the following concentration phenomenon: all but an exponentially small fraction of interpolating classifiers have approximately the same generalization performance given by the unique maximizer of this rate function. We numerically compare this maximizer to the performance of empirical risk minimization by gradient descent and to the performance of a natural linear program, both finding a point in $S$, and deduce that in the overparametrized regime of small $α$, these efficient procedures outperform the vast majority of interpolators, pointing to their nontrivial benign overfitting in this setting.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 该论文属于统计学习理论范畴，主要研究线性插值分类器在过参数化 regime 下的泛化误差丰度及大偏差原理。提供的关键词（Unify Models, Tokenizer, Visual Encoder, World Models, MLLM, MultiModal, model-based RL）均聚焦于多模态大模型架构与强化学习领域，与本文的经典机器学习理论内容无直接关联，故所有关键词相关度均为 0 分。作者列表中未包含指定的专家。

关键词

Interpolating classifiers, Generalization error, Large deviation principle, Overparametrized regime, Benign overfitting, Linear classifiers, Gradient descent

244. Event Detection for Parameter-to-KPI Dependency Learning for AI-RANFAIL

Score: 0.0 / 27.8

Authors: Christie Djidjev, Nicholas Kaminski

Published: 2026-06-04

TL;DR: This paper proposes a machine learning pipeline for detecting events in network telemetry to learn parameter-to-KPI dependencies in AI-RAN systems, enabling interpretable control despite noisy data.

摘要翻译

下一代无线网络预计将依赖多个并发的基于人工智能的控制功能，这些功能同时优化不同的网络目标，特别是在人工智能集成与开放无线接入网架构中，例如人工智能无线接入网（AI-RAN）和开放无线接入网（O-RAN）。当这些功能相互作用时，它们可能会以难以仅从原始网络数据检测到的方式相互干扰。管理此类交互的一个关键缺失环节是一个可靠且可解释的依赖结构，该结构能够捕捉在任何给定时刻哪些控制参数正在积极影响哪些网络性能指标。本文专注于支持此类依赖学习所需的事件检测步骤，通过将含噪连续遥测数据转换为参数活动性和关键性能指标（KPI）响应的二元指标。核心难点在于数据中的并非每个波动都反映了真实的控制交互，因此该方法必须将真实的参数 - 指标关系与背景变异区分开来。由于获取具有已知参数 -KPI 真实值的真实 AI-RAN 流量轨迹较为困难，我们提出了一种具有植入潜在依赖关系的合成闭环流量生成器。我们利用这种受控遥测数据来评估一个基于机器学习的依赖恢复流水线，该流水线将连续轨迹转换为二元事件指标的过程表述为一个显著性检测问题。实验评估表明，当信号与背景变异充分分离时，所提出的流水线能够从含噪连续轨迹中可靠地恢复潜在依赖结构，同时强调阈值校准是控制事件检测质量的关键因素。这些结果构成了面向自适应 AI-RAN 控制系统实现可解释依赖学习的基础性步骤。

Abstract

Next-generation wireless networks are expected to rely on multiple concurrent AI-driven control functions that optimize different network objectives simultaneously, particularly in AI-integrated and open radio access network architectures such as AI Radio Access Network (AI-RAN) and Open Radio Access Network (O-RAN). When these functions interact, they can interfere with one another in ways that are difficult to detect from raw network data alone. A key missing piece for managing such interactions is a reliable, interpretable dependency structure that captures which control parameters are actively influencing which network performance outcomes at any given time. This paper focuses on the event-detection step needed to support such dependency learning by converting noisy continuous telemetry into binary indicators of parameter activity and KPI response. The central difficulty is that not every fluctuation in the data reflects a genuine control interaction, so the method must distinguish real parameter-outcome relationships from background variation. Because real AI-RAN traffic traces with known parameter-KPI ground truth are difficult to obtain, we introduce a synthetic closed-loop traffic generator with planted latent dependencies. We use this controlled telemetry to evaluate a machine-learning-based dependency recovery pipeline that formulates the conversion of continuous traces into binary event indicators as a significance-detection problem. Experimental evaluation shows that the proposed pipeline reliably recovers the latent dependency structure from noisy continuous traces when the signal is sufficiently separated from background variation, while highlighting threshold calibration as the key factor controlling event-detection quality. These results constitute a foundational step toward interpretable dependency learning for adaptive AI-RAN control systems.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on event detection and parameter-to-KPI dependency learning within AI-RAN/O-RAN wireless networks using telemetry data and synthetic traffic generators. The provided keywords (Unify Models, Tokenizer, Visual Encoder, World Models, MLLM, MultiModal, model-based RL) belong to the domains of Multimodal Large Language Models and Reinforcement Learning. There is no technical overlap between the paper's content (network management, signal processing, statistical dependency inference) and the specific terminology of the keywords (e.g., no tokenizers, visual encoders, or RL algorithms are discussed). Therefore, all keyword scores are 0. None of the listed target experts (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) appear in the author list.

关键词

Event Detection, Parameter-to-KPI Dependency, AI-RAN, Telemetry Analysis, Synthetic Traffic Generator, Machine Learning Pipeline, Network Control, O-RAN

245. Proper Scoring Rules for Right-Censored Survival DataFAIL

Score: 0.0 / 27.8

Authors: Jef Jonkers, Glenn Van Wallendael, Luc Duchateau, Sofie Van Hoecke

Published: 2026-06-04

TL;DR: This paper introduces a unified framework for proper scoring rules tailored to right-censored survival data, enabling accurate evaluation of probabilistic forecasts despite incomplete event time observations.

摘要翻译

正确评分规则为概率预测的训练和评估提供了严格的理论基础。然而，在存在右删失的情况下，事件时间仅被部分观测，导致常规评分规则在其标准形式下不适用。我们提出了一种针对右删失生存结果的正确评分框架，其基于一个简单的思想：首先，通过删失机制映射预测分布，然后在由此诱导的观测数据分布上应用底层正确评分。这产生了针对固定删失时间的局部得分，以及当删失时间是随机或仅被部分观测时的边缘得分。所得构造在一致框架内恢复了熟悉的右删失似然和 IPCW 型准则，同时也产生了 CRPS、Pinball loss、Brier score 和 Energy score 的右删失版本。我们证明，在条件独立删失下，边缘得分是正确的，且在可识别区域上是严格正确的。同样的原则也导致了 Censored engression（删失外推），这是一种基于样本的学习目标，用于多元右删失生存建模。在实验中，我们的得分在几种删失机制下正确地对预言者预测进行了排序，而依赖预测的插入加权得分可能会表现出排序反转。Censored engression 同样显著优于对删失结果进行朴素训练的方法。

Abstract

Proper scoring rules provide a rigorous theoretical basis for the training and evaluation of probabilistic forecasts. However, in the presence of right censoring, the event time is only partially observed, rendering conventional scoring rules inapplicable in their standard form. We propose a framework for proper scoring of right-censored survival outcomes based on a simple idea: first, map the predictive distribution through the censoring mechanism, then apply the underlying proper score on the induced observed-data law. This yields localized scores for fixed censoring times and marginalized scores when the censoring time is random or only partially observed. The resulting construction recovers familiar right-censored likelihood and IPCW-type criteria within a coherent framework, while also yielding right-censored versions of the CRPS, pinball loss, Brier score, and energy score. We show that the marginalized score is proper under conditional independent censoring and strictly proper on the identifiable region. The same principle also leads to censored engression, a sample-based learning objective for multivariate right-censored survival modeling. In experiments, our scores correctly rank the oracle forecast across several censoring regimes, whereas forecast-dependent plug-in weighted scores can exhibit ranking reversals. Censored engression likewise substantially improves over naive training on censored outcomes.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on statistical scoring rules for survival analysis with right-censored data, whereas the provided keywords relate to Multimodal Large Language Models (MLLM), World Models, and Reinforcement Learning. There is no technical overlap involving tokenizers, visual encoders, model architectures, or RL methods, resulting in zero relevance for all specified keywords.

关键词

Proper Scoring Rules, Right-Censored Survival Data, Probabilistic Forecasting, Censoring Mechanism, Survival Analysis, Censored Engression, Evaluation Metrics

246. Conformal Risk Sharing: Certified Cost Allocation with Participation GuaranteesFAIL

Score: 0.0 / 27.8

Authors: Ieva Kazlauskaite

Published: 2026-06-04

TL;DR: This paper proposes a Conformal Risk Sharing framework to certify cost allocation with participation guarantees using split conformal calibration, ensuring no participant is made materially worse off.

摘要翻译

在群体间分担罕见不利事件的财务影响可以减轻极端个人负担，但若任何参与者因该安排而处境变差，便有理由退出。因此，可信机制必须为每个代理提供对其未来义务的可信上限，且仅当参与者之间的总损害有界时才应部署。我们将此形式化为认证分配问题（Certified Allocation Problem）：基于有限数据且不依赖分布假设，寻找再分配规则，为每位参与者生成义务上限，并验证没有任何参与者因此实质性地变差。我们提出共形风险分担（Conformal Risk Sharing），通过将可解释的共享策略与分裂共形校准（split conformal calibration）相结合来解决此问题。共享强度在训练数据上进行调优，而保留的校准数据则产生无分布的个体代理保证（在可交换性假设下成立）。在合成数据和真实数据（包括降水和能源合作社数据）上的实验证实，该框架可以在控制对他人的损害的同时，显著降低高风险代理的极端义务。

Abstract

Sharing the financial impact of rare adverse events across a group can soften extreme individual burdens, but any participant made worse off by the arrangement has reason to leave. A credible mechanism must therefore provide each agent with a trustworthy cap on their future obligation and should be deployed only if the aggregate harm across participants is bounded. We formalise this as the Certified Allocation Problem: from finite data and without distributional assumptions, find a redistribution rule, produce obligation caps for every participant, and verify that no participant is made materially worse off. We propose Conformal Risk Sharing, which solves this problem by pairing an interpretable sharing policy with split conformal calibration. The sharing intensity is tuned on training data, while held-out calibration data produces distribution-free per-agent guarantees (valid under exchangeability). Experiments on synthetic and real-world data, including precipitation and energy-cooperative data, confirm that the framework can substantially reduce extreme obligations for high-risk agents while controlling harm to others.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on statistical risk sharing and cost allocation using conformal prediction in economics/finance, while the keywords relate to multimodal AI, LLM architectures, and reinforcement learning (e.g., Tokenizer, Visual Encoder, World Models). There is zero technical overlap between the paper's methodology and the evaluation keywords. Total weighted score is 0.0, which is below the dynamic passing score of 27.8. No expert authors from the specified list were found.

关键词

Conformal Risk Sharing, Certified Allocation, Participation Guarantees, Split Conformal Calibration, Cost Allocation, Rare Adverse Events, Obligation Caps, Exchangeability

247. Symmetric Divergence and Normalized Similarity: A Unified Topological Framework for Representation AnalysisFAIL

Score: 0.0 / 27.8

Authors: Yan Wang, Tianyang Hu

Published: 2026-06-04

摘要翻译

拓扑数据分析（TDA）为比较神经表示提供了一种原则性且内在的视角。然而，现有的配对拓扑散度（如 RTD）受限于启发式不对称性，更关键的是，其依赖于样本量的无界分数，阻碍了可靠的跨场景基准测试。为应对这些挑战，我们开发了一个统一的拓扑工具包，旨在满足两个互补的需求：细粒度结构诊断与鲁棒、标准化的评估。首先，我们通过引入对称表示拓扑散度（SRTD）及其高效变体 SRTD-lite，完善了 RTD 框架。除了解决先前变体的理论不对称性外，SRTD 还将诊断信息整合为一个单一且全面的跨条形码（cross-barcode）签名。这使得能够精确定位结构差异，并可作为有效的优化目标，而无需双向计算的额外开销。其次，为了在不同异构设置中实现可靠的基准测试，我们提出了归一化拓扑相似性（NTS）。通过测量层次合并顺序的秩相关，NTS 生成一个介于 -1 到 1 之间的尺度不变度量，有效克服了未归一化散度的尺度依赖性和样本依赖性。在合成及真实世界深度学习设置上的实验表明，我们的工具包能够捕捉到几何度量所遗漏的 CNN 功能偏移，并在距离饱和情况下稳健地映射大语言模型（LLM）谱系，提供了一种严谨的拓扑感知视角，补充了如 CKA 等度量。

Abstract

Topological Data Analysis (TDA) offers a principled, intrinsic lens for comparing neural representations. However, existing paired topological divergences (e.g., RTD) are limited by heuristic asymmetry and, more critically, unbounded scores that depend on sample size, hindering reliable cross-scenario benchmarking. To address these challenges, we develop a unified topological toolkit serving two complementary needs: fine-grained structural diagnosis and robust, standardized evaluation. First, we complete the RTD framework by introducing Symmetric Representation Topology Divergence (SRTD) and its efficient variant SRTD-lite. Beyond resolving the theoretical asymmetry of prior variants, SRTD consolidates diagnostic information into a single, comprehensive cross-barcode signature. This allows for precise localization of structural discrepancies and serves as an effective optimization objective without the overhead of dual directional computations. Second, to enable reliable benchmarking across heterogeneous settings, we propose Normalized Topological Similarity (NTS). By measuring the rank correlation of hierarchical merge orders, NTS yields a scale-invariant metric bounded between -1 and 1, effectively overcoming the scale and sample-dependence of unnormalized divergences. Experiments across synthetic and real-world deep learning settings demonstrate that our toolkit captures functional shifts in CNNs missed by geometric measures and robustly maps LLM genealogy even under distance saturation, offering a rigorous, topology-aware perspective that complements measures like CKA.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 评分失败: Expecting ',' delimiter: line 12 column 115 (char 338)

248. Quantifying the Privacy of Counterfactuals by Leveraging Membership Inference Attacks Against Synthetic DataFAIL

Score: 0.0 / 27.8

Authors: Maryam Babaei, Yingke Wang, Hadrien Lautraite, Heber H. Arcolezi, Ulrich Aivodji, Sebastien Gambs

Published: 2026-06-04

TL;DR: This paper demonstrates that counterfactuals used for model explanation are vulnerable to membership inference attacks similar to synthetic data, allowing privacy breaches even without model access.

摘要翻译

反事实（Counterfactuals）通常应用于高风险决策领域，旨在通过展示用户画像的变化如何产生期望结果来解释机器学习模型。然而，利用反事实解释模型决策也可能被攻击者利用，从而对模型或其训练数据发起隐私攻击。鉴于反事实类似于合成数据，能够为真实训练数据提供现实的替代品，本文展示了如何利用针对合成数据开发的攻击方法，成功地对反事实实施隐私攻击。更具体地，我们研究了专为合成数据设计的成员推断攻击（Membership Inference Attacks）在各种类型反事实上的有效性。此外，尽管现有的针对反事实的成员推断攻击通常要求能够查询模型，但我们展示了仅凭一组反事实即可成功执行成员推断攻击的方法，而无需访问生成这些反事实的模型。我们的结果表明，模型开发者在向不同用户发布反事实时应更加谨慎，因为这可能导致隐私泄露。

Abstract

Counterfactuals are typically used in high-stakes decision areas to explain a machine learning model by showing how changes to the user profiles result in the desired outcome. However, explaining the model's decisions through counterfactuals can also be exploited by an adversary to conduct privacy attacks against the model or its training data. Drawing on the analogy that counterfactuals provide realistic substitutes for real training data, similar to synthetic data, we demonstrate in this paper how it is possible to successfully perform privacy attacks on counterfactuals by drawing on the attacks developed against synthetic data. More precisely, we investigate the effectiveness of the membership inference attacks designed for synthetic data on various types of counterfactuals. Additionally, while existing membership inference attacks against counterfactuals usually require to be able to query the model, we show how it is possible to perform successful membership inference attacks using only a set of counterfactuals, with no access to the model from which they are generated. Our results demonstrate that model developers should be more cautious when releasing counterfactuals to various users, as it can lead to a privacy breach.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper investigates privacy vulnerabilities of counterfactuals using membership inference attacks, focusing on explainable AI and data privacy. The provided keywords specifically target Multimodal Large Language Models (MLLM), World Models, Tokenizers, Visual Encoders, and Model-Based Reinforcement Learning. There is no conceptual overlap between the paper's domain (privacy/XAI) and the specified keywords (multimodal/RL architecture), hence all keywords receive a score of 0.

关键词

Counterfactuals, Privacy, Membership Inference Attacks, Synthetic Data, Explainability, Machine Learning, Privacy Risk

249. Efficient Mean Curvature Computation on High-Dimensional Data ManifoldsFAIL

Score: 0.0 / 27.8

Authors: Alexandre L. M. Levada

Published: 2026-06-04

TL;DR: This paper proposes a computationally efficient algorithm to estimate local mean curvature on high-dimensional data manifolds, reducing computational complexity from O(m^4) to O(k^2 m) to enable practical use of curvature features in machine learning.

摘要翻译

在高维数据集中的每个点上估计局部平均曲率是几何感知机器学习算法的核心组成部分，例如平均曲率边界点（MCBP）方法。这种计算的朴素实现基于从 k 近邻邻域块近似得到的局部形状算子，涉及矩阵 $H$ 的显式构造，其迹形式导致每个点的计算复杂度为 $O(m^4)$，使得该方法对于特征维度超过几十个的数据集变得不可行。本文提出了两项互补的贡献，共同将这一计算复杂度降低了几个数量级。第一项贡献是一个精确的代数恒等式。该恒等式基于协方差矩阵特征向量的正交性及迹算子的循环性推导得出，完全消除了矩阵 $H$，并在特征分解后将每个点的计算复杂度降低至 $O(m^2)$。第二项贡献解决了完整特征分解中剩余的 $O(m^3)$ 瓶颈。由于局部协方差矩阵的秩至多为 $k-1 \ll m$，我们将其替换为 $k \times m$ 中心化数据矩阵的截断奇异值分解（SVD），这是一个 $O(k^2 m)$ 的操作，并基于哈尔测度（Haar measure）下其外积的期望值，推导出零空间特征向量贡献的解析近似。所得估计量的总计算复杂度为 $O(k^2 m + k m p^2)$，其中 $p = k-1$。在真实数据集上的实验证实，相对于原始实现，该方法实现了 50 到 300 倍的加速比，且当使用快速估计量替换原始版本时，精度损失可忽略不计。通过提供可扩展且数据驱动的局部曲率估计，该方法确立了曲率作为一种实用的几何特征，适用于广泛的机器学习任务，涵盖从经典方法到现代深度学习流程。

Abstract

Estimating local mean curvature at each point of a high-dimensional dataset is a key ingredient of geometry-aware machine learning algorithms, such as the Mean Curvature Boundary Points (MCBP) method. The naive implementation of this computation, based on a local shape operator approximated from k-nearest neighbor patches, involves an explicit construction of a matrix $H$ whose trace form yields an $O(m^4)$ cost per point, rendering the approach intractable for datasets with more than a few dozen features. This paper introduces two complementary contributions that together reduce this cost by several orders of magnitude. The first contribution is an exact algebraic identity. This identity, derived from the orthogonality of the eigenvectors of the covariance matrix and the cyclicity of the trace operator, eliminates $H$ entirely and reduces the per-point cost to $O(m^2)$ after the eigendecomposition. The second contribution addresses the remaining $O(m^3)$ bottleneck of the full eigendecomposition. Since the local covariance matrix has rank at most $k-1 \ll m$, we replace it with a truncated SVD of the $k \times m$ centered data matrix, an $O(k^2 m)$ operation, and derive an analytical approximation for the contribution of the null-space eigenvectors based on the expected value of their outer product under the Haar measure. The resulting estimator has total cost $O(k^2 m + k m p^2)$, where $p = k-1$. Experiments on real-world datasets confirm speedups of 50 to 300 times relative to the original implementation, with negligible loss when the fast estimator is used to replace the original version. By providing a scalable and data-driven estimate of local curvature, the proposed method establishes curvature as a practical geometric feature for a broad range of machine learning tasks, from classical to modern deep learning pipelines.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 该论文专注于高维数据流形上的平均曲率高效计算，利用代数恒等式和截断奇异值分解，属于几何感知机器学习和计算统计学范畴。论文内容不涉及多模态大语言模型（MLLM）、分词器、视觉编码器、世界模型、基于模型的强化学习或统一模型等概念，因此与所有指定关键词的相关性均为 0。

关键词

Mean Curvature, High-Dimensional Data, Data Manifolds, Geometry-aware Machine Learning, Truncated SVD, Computational Efficiency, Covariance Matrix

250. DAS-PINNs for high-dimensional partial differential equations: extending deep adaptive sampling to spacetime domainsFAIL

Score: 0.0 / 27.8

Authors: Anshima Singh, David J. Silvester

Published: 2026-06-04

TL;DR: This paper proposes a deep adaptive sampling framework for Physics-Informed Neural Networks that treats space and time as a unified domain to efficiently solve high-dimensional time-dependent partial differential equations without explicit time marching.

摘要翻译

时间依赖的高维偏微分方程（PDEs）具有空间局域且动态演化的解，这对物理信息神经网络（PINNs）构成了根本性挑战，因为在高维时空域中，均匀配点采样变得越来越无效。本研究通过将空间和时间视为一个统一域而不进行任何显式时间推进，将 PINNs 的深度自适应采样框架扩展到了时变情形。归一化流（Normalising Flow）神经网络模型有效地学习了由 PDE 残差诱导的分布，并生成了集中在解最难学习区域的新配点。与需要显式时间步进或移动网格的常规自适应策略不同，高残差区域仅由 PDE 残差分布驱动，即可在空间和时间上被自动识别和追踪。所提出策略的有效性在一系列基准问题上得到了评估，涵盖从二维空间中的尖锐且移动的特征到高达八维空间中的局域结构。

Abstract

Time-dependent high-dimensional partial differential equations (PDEs) with spatially localised and dynamically evolving solutions pose a fundamental challenge for physics-informed neural networks (PINNs), as uniform collocation sampling becomes increasingly ineffective in high-dimensional spatiotemporal domains. In this work, a deep adaptive sampling framework for PINNs is extended to the time-dependent setting by treating space and time as a unified domain without any explicit time marching. A normalising flow neural network model effectively learns the distribution induced by the PDE residual and generates new collocation points concentrated in regions where the solution is most difficult to learn. Unlike conventional adaptive strategies that require explicit time stepping or moving meshes, high-residual regions are automatically identified and tracked across both space and time, driven purely by the PDE residual distribution. The effectiveness of the proposed strategy is assessed on a range of benchmark problems, from sharp and moving features in two spatial dimensions to localised structures in up to eight spatial dimensions.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on Physics-Informed Neural Networks (PINNs) for solving high-dimensional partial differential equations using deep adaptive sampling in spacetime domains. The provided keywords relate to Multimodal Large Language Models (MLLM), World Models, Tokenizers, Visual Encoders, and Reinforcement Learning. There is no substantive domain overlap between scientific computing/PDEs and the multimodal/generative/RL focus of the keywords. None of the specified expert authors are present in the author list.

关键词

Physics-Informed Neural Networks, Partial Differential Equations, Deep Adaptive Sampling, Spacetime Domains, Normalizing Flow, Collocation Points, High-dimensional

251. Tangram: Unlocking Non-Uniform KV Cache for Efficient Multi-turn LLM ServingFAIL

Score: 0.0 / 27.8

Authors: Hyungmin Kim, Minsoo Kim, Hongseok Kim, Jungwook Choi

Published: 2026-06-04

TL;DR: Tangram 通过引入非均匀 KV 缓存管理解决了多轮 LLM 服务中线性增长导致的内存压力问题，在不牺牲准确性的前提下将吞吐量提高了最多 2.6 倍。

摘要翻译

多轮大语言模型（LLM）服务对于保持一致的用户体验至关重要，然而键值（KV）缓存的线性增长给 GPU 内存和带宽带来了显著压力。非均匀 KV 压缩通过考虑每个 KV 缓存的各自重要性，能够更有效地保留更多信息。然而，这种 KV 缓存的异构性引入了各种系统性挑战——包括内存碎片、调度复杂性以及核利用率下降——这些因素共同导致现有 LLM 服务系统中出现显著的低效问题。为了克服这些挑战，我们提出了 Tangram，这是一种旨在使非均匀 KV 缓存实用化的新颖服务系统。Tangram 通过以下三种核心技术解决系统性低效问题：（1）确定性预算分配（Deterministic Budget Allocation）根据每个头的固有特性分配静态内存占用，从而完全消除动态调度开销和预填充停顿；（2）头组页（Head Group Page）将具有类似保留需求的注意力头进行聚类，并使用独立的向量化页表进行管理，从而最大化物理内存回收；（3）事前负载平衡（Ahead-of-Time, AOT Load Balancing）利用静态预算配置文件来确保 GPU 利用率均匀，且无需运行时开销。实验结果表明，与现有基线相比，Tangram 的吞吐量提高了高达 2.6 倍，同时完全保持了模型准确率。我们的实现代码已公开，网址为 https://github.com/aiha-lab/TANGRAM。

Abstract

Multi-turn Large Language Model (LLM) serving is critical for consistent user experiences, yet the linear growth of the Key-Value (KV) cache imposes significant pressure on GPU memory and bandwidth. Non-uniform KV compression effectively preserves more information by considering the individual importance of each KV cache. However, such KV cache heterogeneity introduces various systemic challenges - including memory fragmentation, scheduling complexities, and diminished kernel utilization - which collectively lead to significant inefficiencies in existing LLM serving systems. To overcome these challenges, we present Tangram, a novel serving system designed to make Non-uniform KV caches practical. Tangram addresses systemic inefficiencies through three core techniques: (1) Deterministic Budget Allocation assigns a static memory footprint to each head based on its intrinsic pattern, entirely eliminating dynamic scheduling overhead and prefill stalls; (2) Head Group Page clusters attention heads with similar retention demands and manages them with independent, vectorized page tables, thereby maximizing physical memory reclamation; and (3) Ahead-of-Time (AOT) Load Balancing leverages static budget profiles to ensure uniform GPU utilization without runtime overhead. Experimental results show that Tangram improves throughput by up to 2.6x compared to existing baselines, while fully preserving model accuracy. Our implementation is publicly available at https://github.com/aiha-lab/TANGRAM.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 论文主要研究多轮 LLM 服务中的 KV 缓存优化与系统效率，属于推理系统领域。提供的关键词涉及多模态、世界模型、强化学习等，与论文内容无直接关联，因此所有关键词相关性评分为 0。作者列表中未包含指定的专家（Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang）。

关键词

Multi-turn LLM Serving, Non-uniform KV Cache, GPU Memory Optimization, Throughput Improvement, Deterministic Budget Allocation, Head Group Page, Ahead-of-Time Load Balancing

252. PAC-Bayesian Adversarially Robust Generalization for Message Passing Graph Neural Networks: A Sensitivity AnalysisFAIL

Score: 0.0 / 27.8

Authors: Ziling Liang, Xinping Yi, Qingsong Wen, Shi Jin

Published: 2026-06-04

TL;DR: This paper proposes a sensitivity-aware PAC-Bayesian framework to derive tighter robust generalization bounds for message passing graph neural networks against adversarial attacks.

摘要翻译

尽管图神经网络（GNNs）对对抗攻击的脆弱性对图表示学习构成了严重威胁，但在对抗环境下，对其鲁棒泛化行为的理解仍然是一个核心挑战。最近，PAC-Bayesian 基于边界的泛化分析通过提供灵活且数据依赖的分析框架，显著推进了这一研究方向。然而，现有的鲁棒分析通常依赖于各向同性高斯后验，并在全参数空间中控制权重扰动，这限制了捕捉异质参数敏感性的能力，却又依赖于依赖隐层宽度的复杂度项，导致泛化界不够紧。在本文中，我们将最近提出的感知敏感性的 PAC-Bayesian 框架从深度神经网络扩展到消息传递图神经网络（MPGNNs），并在对抗环境中推导出更紧的鲁棒泛化界。具体而言，我们首先通过推导关于权重参数的输出雅可比矩阵，量化不同参数块上的扰动对网络输出的敏感程度。利用这些雅可比矩阵在 K 类图分类中秩至多为 K 的事实，我们随后构建与雅可比矩阵对齐的敏感性矩阵，并使用具有优化协方差的各向异性高斯后验以紧的方式上界化 KL 散度。值得注意的是，通过细化对所学权重的谱范数依赖，并将主导维度因子从依赖隐层宽度的项减少到类别数 K，我们的分析为 MPGNNs 产生了更紧的鲁棒泛化保证，从而指导其设计以增强对抗鲁棒性。

Abstract

Whilst the vulnerability of graph neural networks (GNNs) to adversarial attacks poses a critical threat to graph representation learning, the understanding of the robust generalization behavior remains a fundamental challenge in the adversarial setting. Recently, PAC-Bayesian margin-based generalization analysis substantially advances this line of research by providing a flexible and data-dependent analytical framework. However, existing robust analyses often rely on isotropic Gaussian posteriors and control weight perturbations in the full parameter space, which limits the ability to capture heterogeneous parameter sensitivity yet hinges on hidden-width-dependent complexity terms, resulting in not-tight-enough generalization bounds. In this paper, we extend a recently proposed sensitivity-aware PAC-Bayesian framework from deep neural networks to message passing GNNs (MPGNNs) and derive a tighter robust generalization bound in the adversarial setting. Specifically, we first quantify how sensitive the perturbations across different parameter blocks are to the network outputs by deriving the output Jacobians with respect to the weight parameters. Exploiting the fact that these Jacobian matrices have rank at most $K$ in $K$-class graph classification, we then construct Jacobian-aligned sensitivity matrices and use anisotropic Gaussian posteriors with optimized covariances to upper bound the KL divergence in a tight way. Notably, by refining the spectral-norm dependence on the learned weights and reducing the leading dimension factor from hidden-width-dependent terms to the number of classes $K$, our analysis yields much tighter robust generalization guarantees for MPGNNs, thereby guiding their designs to enhance adversarial robustness.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on adversarial robustness and generalization bounds for Message Passing Graph Neural Networks using PAC-Bayesian theory. The provided keywords pertain to Multimodal Large Language Models (MLLM), Tokenizers, Visual Encoders, World Models, and Model-Based Reinforcement Learning, which are unrelated to the graph neural network and theoretical analysis content of this paper.

关键词

PAC-Bayesian, Adversarially Robust, Generalization, Message Passing, Graph Neural Networks, Sensitivity Analysis, Jacobian, Anisotropic Gaussian

253. Non-Negative Matrix Factorization for Event DataFAIL

Score: 0.0 / 27.8

Authors: Raphaël Romero

Published: 2026-06-04

TL;DR: The paper introduces EventNMF, a continuous-time non-negative factorization model that operates directly on event times using Poisson processes and B-splines to uncover interpretable temporal structures without binning.

摘要翻译

连续时间事件数据（实体随时间发出瞬时事件）在神经科学、地震学及社交网络等多个领域中自然产生。非负矩阵分解（NMF）是揭示此类数据中可解释结构的自然工具，但迄今为止仅在将实体级计数度量进行分箱或平滑后才被应用。这一预处理步骤存在抹除实体级异质性和细粒度时间特征的风险。本文引入 EventNMF，这是一种直接在事件时间上操作的连续时间非负分解模型：每个实体的事件被建模为泊松过程，其强度通过非负 B-spline 基进行分解，且一个简单的估计程序可恢复跨实体共享的可解释时间模板。所得方法在数学原理上严谨，易于实现，且计算高效。我们进一步表明标准分箱计数方法作为零阶样条的特例出现，探索了偏差 - 方差权衡，并在合成潜在因子模型上与现有方法进行了比较，同时在多个实际应用中展示了 EventNMF 的有效性。

Abstract

Continuous-time event data, in which entities emit instantaneous events over time, arises naturally across many domains such as neuroscience, seismology, and social networks. Non-negative matrix factorization (NMF) is a natural tool to uncover interpretable structure in such data, but it has so far only been applied after binning or smoothing the entity-level counting measures. This preprocessing step comes with the risk of erasing entity-level heterogeneities and fine-grained temporal features. In this paper, we introduce EventNMF, a continuous-time non-negative factorization model that operates directly on event times: each entity's events are modeled as a Poisson process whose intensity factorizes through a non-negative B-spline basis, and a simple estimation procedure recovers interpretable temporal templates shared across entities. The resulting method is mathematically principled, easy to implement, and computationally efficient. We further show that standard binned-count approaches arise as the special case of degree-zero splines, explore bias-variance tradeoffs and compare against existing methods on a synthetic latent factor model, and demonstrate the effectiveness of EventNMF on several real-world applications.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper proposes EventNMF, a statistical method for continuous-time event data using Non-negative Matrix Factorization (NMF), Poisson processes, and B-splines. The provided keywords pertain to Multimodal Large Language Models (MLLM), reinforcement learning architectures, and generative world models (Tokenizer, Visual Encoder, Unify Models). There is no conceptual overlap between the statistical event factorization approach and the deep learning multimodal/RL architectures specified. None of the listed expert authors are present in the author list.

关键词

Non-negative Matrix Factorization, Event Data, Continuous-time, Poisson Process, B-spline Basis, Temporal Templates, Factorization Model

254. A Machine Learning-Based Framework for Discovering Huntington's Disease Stages: Integrating Graph Representation Learning and clustering to Uncover Progression Dynamics in Longitudinal Enroll-HD DatasetFAIL

Score: 0.0 / 27.8

Authors: Lubna M. Abu Zohair, Marta Vallejo, MD Azher Uddin, John R. Woodward, Hind Zantout

Published: 2026-06-04

TL;DR: 本文提出了一种基于动态图表示学习和聚类的无监督框架，从纵向临床数据中发现了亨廷顿病的四个疾病阶段，解决了传统临床分期阈值模糊的问题。

摘要翻译

亨廷顿病（HD）是一种进行性脑部疾病，逐渐影响运动、认知功能及行为。准确且一致地识别疾病阶段对于理解其病程、患者分组、个性化护理及治疗发现至关重要。现有的临床分期框架主要依赖预设的临床测量阈值及临床专家决策，然而这些离散截断点可能掩盖有意义的阶段内变异性，且仍易受评分者间差异的影响，尤其是在运动和功能评估方面。为了解决这些局限性，我们开发了一种基于动态图表示学习（dynamic graph representation learning）的无监督机器学习框架，旨在从纵向临床测量数据中捕捉患者内部及跨患者的时间关系。利用学习到的表示，我们应用了 K-means++ 聚类算法以识别分离良好的群体。随后，我们迭代增加聚类数量（k），利用稳定性分析评估鲁棒性，并揭示除初始最优解之外额外的有意义聚类。我们将该框架应用于 Enroll-HD 队列中的 302 名个体（共 1477 次访问，每次访问包含 44 个临床变量；其中 80% 为显性参与者），从而实现了数据驱动的 HD 阶段发现，反映了自然的临床进展。尽管队列规模有限，所提出的框架利用四维潜在空间实现了稳健的聚类性能，并通过聚类稳定性分析识别出四个具有统计学显著差异的疾病阶段。每个阶段均对应定义明确的临床测量边界，与先前建立的临床分期方法相比，重叠程度最小。

Abstract

Huntington's disease (HD) is a progressive brain disorder that gradually affects movement, cognitive function, and behavior. Identifying the stage of the disease accurately and consistently is important for understanding its course, grouping patients, personalized care, and discovering treatment. Existing clinical staging frameworks rely primarily on predefined clinical measurement thresholds and clinical expert decisions, yet these discrete cut-offs may obscure meaningful intra-stage variability and remain vulnerable to inter-rater differences, especially in motor and functional assessments. To address these limitations, we developed an unsupervised machine learning framework based on dynamic graph representation learning to capture temporal relationships within and across patients from longitudinal clinical measurements. Using the learned representations, we applied K-means++ clustering to identify well-separated groups. We then iteratively increased the number of clusters (k), using stability analysis to assess robustness and reveal additional meaningful clusters beyond the initial optimal solution. We applied the framework to 302 individuals from the Enroll-HD cohort (1,477 visits, 44 clinical variables per visit; 80% manifest participants), enabling data-driven discovery of HD stages reflecting natural clinical progression. Despite the limited cohort size, the proposed framework achieved robust clustering performance using a four-dimensional latent space, identifying four meaningful and statistically distinct disease stages through clustering stability analysis. Each stage corresponded to well-defined clinical measurement boundaries, with minimal overlap compared to previously established clinical staging methods.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 该论文专注于医学数据（亨廷顿病）的无监督学习与聚类分析，使用动态图表示学习提取特征；而评分关键词均涉及大模型架构、多模态融合及强化学习（如 Tokenizer、Visual Encoder、World Models、MLLM 等），与本文研究内容领域完全不同，故相关性均为 0。

关键词

Huntington's Disease, Graph Representation Learning, Clustering, Longitudinal Data, Disease Staging, Unsupervised Learning, Enroll-HD Dataset, Progression Dynamics

255. Diffusion Models Observe Only Gradients: A Geometric Perspective on Score Matching ErrorsFAIL

Score: 0.0 / 27.8

Authors: Naïl B. Khelifa, Richard E. Turner, Ramji Venkataramanan

Published: 2026-06-04

TL;DR: This paper theoretically demonstrates that only the gradient component of score matching errors influences marginal distribution dynamics in diffusion models, challenging the use of L2 score error as a quality metric.

摘要翻译

基于分数的扩散模型 (Score-based diffusion models) 通常通过最小化 $L^2$ 分数匹配误差进行训练，标准理论分析依赖该量来界定学习分布与目标分布之间的采样差异。我们表明 $L^2$ 分数误差并非衡量边缘分布质量的恰当内在度量：一个学习到的扩散模型可以产生任意大的 $L^2$ 分数误差，同时完美匹配目标分布。通过将分数误差分解为梯度分量和无散分量（即赫尔姆霍兹 - 霍奇分解 (Helmholtz-Hodge decomposition)），我们揭示了背后的几何原因：只有梯度分量进入边缘福克 - 普朗克动力学 (Fokker-Planck dynamics)，而无散分量在结构上是不可见的。我们通过以下三个结果精确阐述了这一点。首先，基于修正后的几何，我们证明了一个不可能性结果：$L^2$ 分数误差的任何单调函数都无法一致地界定学习分布与目标分布之间任何散度的下界。其次，我们推导了一个仅依赖于误差可观测梯度分量的 KL 散度 (Kullback-Leibler divergence) 上界，收紧了标准的吉尔萨诺夫界 (Girsanov bound)，并将其松弛度识别为在路径空间 (path-space) 而非边缘空间 (marginal-space) 上操作的代价。第三，我们通过一个对偶索伯列夫恒等式 (dual Sobolev identity) 给出了梯度分量的一个可行估计量，经验表明其与样本质量的相关性显著优于完整的 $L^2$ 误差。

Abstract

Score-based diffusion models are typically trained by minimizing the $L^2$ score matching error, and standard theoretical analyses rely on this quantity to bound the sampling discrepancy between the learned and target distributions. We show the $L^2$ score error is not the right intrinsic measure of marginal distributional quality: a learned diffusion model can incur arbitrarily large $L^2$ score error while perfectly matching the target distribution. By decomposing score errors into a gradient and a solenoidal component (a Helmholtz-Hodge decomposition), we identify the geometric reason behind this: only the gradient component enters the marginal Fokker-Planck dynamics, while the solenoidal component is structurally invisible. We make this precise in three results. First, building on the corrected geometry, we prove an impossibility result: no monotone function of the $L^2$ score error can uniformly lower bound any divergence between the learned and target distributions. Second, we derive an upper bound on the Kullback-Leibler divergence that depends only on the observable gradient component of the error, tightening the standard Girsanov bound and identifying its looseness as the cost of operating on path-space rather than marginal-space dynamics. Third, we give a tractable estimator of the gradient component via a dual Sobolev identity, which is shown to empirically correlate substantially better with sample quality than the full $L^2$ error.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on the geometric analysis of score matching errors in diffusion models using Helmholtz-Hodge decomposition. It does not address multimodal integration, tokenization, visual encoders, or reinforcement learning frameworks. Therefore, it is irrelevant to all provided keywords which target multimodal large models and model-based RL architectures.

关键词

Score-based diffusion models, Score matching errors, Geometric perspective, Helmholtz-Hodge decomposition, Gradient component, Fokker-Planck dynamics, Kullback-Leiberg divergence

256. Trust-Aware Predictive Emissions Monitoring for Gas Turbine Fleets with Limited Labelled DataFAIL

Score: 0.0 / 27.8

Authors: Rebecca Potts, Aiden Durrant, Rick Hackney, Georgios Leontidis

Published: 2026-06-04

TL;DR: 本文提出了一种基于多头循环模型和不确定性量化的信任感知概率框架，用于在标签数据有限的情况下实现燃气轮机车队 NOx 的可靠预测。

摘要翻译

基于机器学习的预测排放监测系统（PEMS）为直接排放测量提供了一种实用的替代方案，但当仅有少数机组拥有排放标签时，其在燃气轮机机队中的部署面临挑战。本文提出了一种信任感知概率框架，旨在有限标注监督下实现机队级燃气轮机 NOx 预测。该框架结合了多头循环预测模型、学习到的置信度估计、基于集成的不确定性量化、辅助特征预测、特征空间距离分析以及运行范围诊断。这些信号在标注数据上进行校准，以生成可解释的样本级信任分数，从而为未标注涡轮机的预测可靠性提供指标，支持识别在机队部署过程中应给予更高谨慎度的预测。基于置信度的过滤将平均绝对误差（MAE）从全覆盖时的 0.202 降低至最高置信度 10% 预测的 0.070，表明置信度估计与预测误差之间存在显著关联。未标注样本及分布外样本表现出更高的不确定性和更低的置信度，表明该框架能够适当响应分布偏移。结果表明，所提出的信任框架为未标注涡轮机的排放预测提供了可操作的可信性信息，支持 PEMS 在工业机队中实现更透明且可信的部署。

Abstract

Machine learning-based predictive emissions monitoring systems offer a practical alternative to direct emissions measurement, but their deployment across gas turbine fleets is challenging when emissions labels are available for only a small subset of assets. In this work, a trust-aware probabilistic framework is proposed for fleet-level gas turbine NOx prediction under limited labelled supervision. The framework combines a multi-head recurrent prediction model with learned confidence estimation, ensemble-based uncertainty quantification, auxiliary feature prediction, feature-space distance analysis, and operating-range diagnostics. These signals are calibrated on labelled data to produce interpretable per-sample trust scores, providing indicators of prediction reliability on unlabelled turbines, supporting the identification of predictions that should be treated with greater caution during fleet-level deployment. Confidence-based filtering reduces MAE from 0.202 at full coverage to 0.070 for the highest-confidence 10\% of predictions, demonstrating that confidence estimates are meaningfully related to prediction error. Unlabelled and out-of-distribution samples exhibit increased uncertainty and reduced confidence, indicating that the framework responds appropriately to distributional shift. The results show that the proposed trust framework provides actionable reliability information for emissions prediction on unlabelled turbines, supporting more transparent and trustworthy deployment of PEMS across industrial fleets.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 该论文属于工业工程与传统机器学习领域，专注于燃气轮机排放预测及不确定性量化，未涉及多模态大模型（MLLM）、Tokenizer、视觉编码器、世界模型或强化学习等核心概念，因此所有关键词相关度均为 0。作者列表中不包含指定的专家，无额外加分。加权总分为 0，远低于动态及格分 27.8。

关键词

Predictive emissions monitoring, Gas turbine fleets, Limited labelled data, Trust-aware probabilistic framework, Uncertainty quantification, Confidence estimation, Multi-head recurrent model

257. Tight list replicability bounds via a novel sphere covering theoremFAIL

Score: 0.0 / 27.8

Authors: Ari Blondal, Hamed Hatami, Pooya Hatami, Chavdar Lalov, Sivan Tretiak

Published: 2026-06-04

TL;DR: This paper establishes tight bounds on list replicability for VC classes and large-margin half-spaces by proving a novel sphere covering theorem derived from the Borsuk-Ulam theorem.

摘要翻译

近年来，列表可复现性（list replicability）已成为一种用于形式化学习理论中可复现性的框架。一个核心问题是，所需的列表大小如何与假设类的精度参数及自然复杂度度量相关联。为了获得列表可复现性的紧界，我们证明了一个新颖的拓扑球覆盖定理，该定理源自博苏克 - 乌拉姆（Borsuk-Ulam）定理。具体来说，如果 $d$ 维球面被开集覆盖，且每个开集都包含于某个开半球内，则其中必有 $d+1$ 个集合具有非空公共交集。利用这一结果，我们获得了 VC 类中列表大小与精度之间关系的紧界。我们还表明，对于大间隔半空间，只要间隔不是太大，最优列表大小等于环境维数（ambient dimension）。然而，当间隔取为非常大时，我们设计了一种可复现算法，实现了最小列表大小 $\lceil d/2 \rceil + 1$。

Abstract

In recent years, list replicability has emerged as a framework for formalizing reproducibility in learning theory. A central question is how the required list size relates to the accuracy parameter and natural complexity measures of the hypothesis class. To achieve sharp bounds on list replicability, we prove a novel topological sphere covering theorem, derived from the Borsuk-Ulam theorem. Specifically, if the $d$-sphere is covered by open sets, each of which lies in an open hemisphere, then $d+1$ of these sets must have a common intersection. Using this result, we obtain a sharp bound on the relationship between list size and accuracy for VC classes. We also show that for large-margin half-spaces, provided the margin is not too large, the optimal list size equals the ambient dimension. However, when the margin is taken to be very large, we devise a replicable algorithm achieving the minimal list size of $\lceil d/2 \rceil + 1$.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on theoretical machine learning, specifically list replicability bounds using topology (sphere covering theorem, Borsuk-Ulam). The provided keywords relate to deep learning architectures (multimodal models, tokenizers, visual encoders, world models, reinforcement learning). There is no conceptual overlap between the paper's content and the specified keywords. Additionally, none of the listed expert authors (Yang Shi, Xuanyu Zhu, etc.) appear in the author list.

关键词

List replicability, Sphere covering theorem, Borsuk-Ulam theorem, VC classes, Large-margin half-spaces, Learning theory, Reproducibility

258. $p$-adic Bi-Filtrations for Topological Machine Learning on Genomic SequencesFAIL

Score: 0.0 / 27.8

Authors: Tirtharaj Dash, Gunja Sachdeva

Published: 2026-06-04

TL;DR: This paper proposes pVR, a topological machine learning framework using p-adic bi-filtrations for alignment-free genomic sequence classification, achieving better performance than baselines on low-sample benchmarks.

摘要翻译

我们介绍 pVR，一种用于无比对基因组序列分类的拓扑机器学习框架，该框架结合了 p-adic numbers (p-adic 数) 与 topological data analysis (拓扑数据分析)。每个 DNA 序列沿两个互补轴进行编码：一个是基于 k-mer (k-mer) 前缀的 p-adic distance (p-adic 距离)，用于捕捉层次位置结构；另一个是基于 k-mer 频率的 compositional L1 distance (组成性 L1 距离)，用于捕捉局部序列内容。这两个距离共同参数化一个 bi-filtered Vietoris--Rips complex (双过滤 Vietoris--Rips 复形)，来自该 bi-filtration (双过滤) 的 per-sequence topological summaries (序列拓扑摘要) 作为标准 machine learning classifiers (机器学习分类器) 的特征。我们为该构造建立了理论保证：包括在度量扰动下的稳定性以及对素数选择的不变性，此外还有一个结果解释了为什么单个 p-adic axis (p-adic 轴) 在拓扑上是 uninformative 的，以及为什么 bi-filtration 恢复 nontrivial homology (非平凡同调)。在十二个 genomic benchmarks (基因组基准)（28 到 500 个序列，3 到 7 个类别）上，pVR 在六个 low-sample datasets (低样本数据集) 中的三个上优于四种 established alignment-free baselines (已建立的无比对基线)，增益高达 21 个百分点；它仅在 SARS-CoV-2 variant benchmark (SARS-CoV-2 变异体基准) 上表现较差，该 benchmark 的 point-mutation divergence (点突变分歧) 违反了 hierarchical assumption (层次假设)，且所有方法在 large-sample regime (大样本情形) 下均达到饱和。pVR 还在三个 low-sample benchmarks (低样本基准) 上优于来自拥有 500M-parameter Nucleotide Transformer v2 的 zero-shot frozen embeddings (零样本冻结嵌入)，幅度为 6.7 到 11.4 个百分点。pVR codebase (pVR 代码库) 公开可用，网址为 https://github.com/MAHI-Group/pVR。

Abstract

We introduce pVR, a topological machine learning framework for alignment-free genomic sequence classification that combines $p$-adic numbers with topological data analysis. Each DNA sequence is encoded along two complementary axes: a $p$-adic distance on $k$-mer prefixes, which captures hierarchical positional structure, and a compositional $L_1$ distance on $k$-mer frequencies, which captures local sequence content. The two distances jointly parameterise a bi-filtered Vietoris--Rips complex, and per-sequence topological summaries from this bi-filtration serve as features for standard machine learning classifiers. We establish theoretical guarantees for the construction: stability under metric perturbations and invariance to the choice of prime, alongside a result that explains why a single $p$-adic axis is topologically uninformative and why the bi-filtration recovers nontrivial homology. On twelve genomic benchmarks ($28$ to $500$ sequences, $3$ to $7$ classes), pVR outperforms four established alignment-free baselines on three of six low-sample datasets, with gains of up to $21$ percentage points; it underperforms only on a SARS-CoV-2 variant benchmark whose point-mutation divergence violates the hierarchical assumption, and all methods saturate in the large-sample regime. pVR also outperforms zero-shot frozen embeddings from the 500M-parameter Nucleotide Transformer v2 by $6.7$ to $11.4$ percentage points on three low-sample benchmarks. The pVR codebase is publicly available at https://github.com/MAHI-Group/pVR.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on Topological Data Analysis (TDA) applied to genomic sequences, utilizing p-adic distances and Vietoris-Rips complexes for feature extraction. This methodology is distinct from the domains covered by the evaluation keywords, which pertain to Multimodal Large Language Models (MLLM), World Models, Reinforcement Learning, and Visual Architectures. There is no conceptual overlap regarding tokenization strategies, visual encoding, world modeling, or model-based RL. Furthermore, the author list does not include any of the specified experts.

关键词

p-adic numbers, Topological Data Analysis, Genomic Sequence Classification, Alignment-free, Vietoris-Rips complex, k-mer, Topological summaries

259. Learning solution operators of PDEs with sparse approximation methodsFAIL

Score: 0.0 / 27.8

Authors: Sebastian Neumayer, Daniel Potts, Fabian Taubert

Published: 2026-06-04

TL;DR: 本文提出了一种基于正交匹配追踪的稀疏近似方法，用于高效学习偏微分方程的解算子，显著减少了所需的样本数量。

摘要翻译

本文利用稀疏高维技术研究偏微分方程（PDEs）解算子的近似问题。基于维度增量框架，我们将乘积基展开与稀疏恢复方法相结合，具体采用正交匹配追踪（OMP），相较于先前考虑的基于求积（cubature）的方法，显著降低了所需的样本量。我们在若干示例上对该方法进行了数值评估，将其与基于求积的稀疏近似以及傅里叶神经算子（Fourier Neural Operators）在精度、计算时间和样本量方面进行比较。实验表明，相较于先前方法，我们的方法显著减少了所需的 PDE 求解次数，同时保持了具有竞争力的精度，尤其是在解在所选基下具有稀疏表示的情况下。此外，恢复的稀疏索引集提供了关于相关变量及参数交互的可解释性洞察。

Abstract

We investigate the approximation of solution operators for partial differential equations (PDEs) using sparse high-dimensional techniques. Building on a dimension-incremental framework, we combine product basis expansions with sparse recovery methods, specifically orthogonal matching pursuit (OMP), to substantially reduce the required sample size compared with a previously considered cubature-based approach. We evaluate the resulting method numerically on several examples, comparing it against both cubature-based sparse approximation and Fourier neural operators in terms of accuracy, runtime, and sample size. The experiments show that our approach considerably reduces the number of required PDE solves relative to its predecessor while maintaining competitive accuracy, particularly when the solution admits a sparse representation in the chosen basis. Furthermore, the recovered sparse index sets yield interpretable insights into the relevant variables and parameter interactions.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 论文研究内容为偏微分方程（PDE）解算子的稀疏近似，属于科学计算领域。提供的关键词（如 MLLM、Tokenizer、Visual Encoder、World Models、model-based RL）均属于多模态大模型与强化学习领域，两者研究范式完全不同，故相关性评分均为 0。作者列表中未包含指定专家，无额外加分。

关键词

Partial Differential Equations, Solution Operators, Sparse Approximation, Orthogonal Matching Pursuit, Product Basis Expansions, Numerical Analysis, Sample Size Reduction

260. Adaptive Learning Rates with Surrogate Probability for Follow-the-Perturbed-LeaderFAIL

Score: 0.0 / 27.8

Authors: Jongyeong Lee, Junya Honda, Shinji Ito, Chansoo Kim

Published: 2026-06-04

TL;DR: This paper proposes an adaptive learning rate mechanism for Follow-the-Perturbed-Leader algorithms in online bandit problems using surrogate probability functions to achieve best-of-both-worlds guarantees.

摘要翻译

Follow-the-regularized-leader (FTRL) 框架在在线学习问题中已展现出有效性和灵活性，其中学习率的选择至关重要。最近，通过求解凸优化获得的、基于臂选择概率定义的自适应学习率在各种 bandit 问题中实现了改进的 best-of-both-worlds (BOBW) 保证。相比之下，对于其计算效率更高的替代方案 follow-the-perturbed-leader (FTPL)，其 BOBW 保证仍然相对有限，因为其无需优化的特性反而使得设计自适应、依赖概率的学习率变得非平凡。为了解决这一挑战，我们通过引入代理概率函数来为 FTPL 提出自适应学习率，这些函数仅从可用量计算得出，无需精确概率。基于这些带有代理函数的学习率，我们提供了 FTPL 在 Pareto 扰动下针对任意形状参数 $α>1$ 的 BOBW 保证，推广了先前仅限于特定选择 $α=2$ 的结果。我们进一步展示了在带有专家建议的 bandit 问题中，FTPL 采用自适应学习率时的 BOBW 保证。我们的方法保留了 FTPL 的计算简单性，同时实现了依赖概率的自适应性，基于代理的方法学可能在 FTPL 和学习率设计之外的其他算法框架中具有独立的研究意义。

Abstract

Follow-the-regularized-leader framework has shown effectiveness and flexibility in online learning problems, where the choice of learning rates are known to be crucial. Recently, adaptive learning rates defined in terms of the arm-selection probabilities, obtained by solving convex optimization, have achieved improved best-of-both-worlds (BOBW) guarantees in various bandit problems. In contrast, BOBW guarantees for its computationally efficient alternative, follow-the-perturbed-leader (FTPL), remain relatively limited since its optimization-free nature ironically makes the design of adaptive, probability-dependent learning rates non-trivial. To address this challenge, we propose an adaptive learning rate for FTPL by introducing surrogate probability functions that can be computed only from the available quantities, without requiring the exact probabilities. Based on these learning rates with surrogate functions, we provide the BOBW guarantee for FTPL with Pareto perturbations for any shape parameter $α>1$, generalizing prior results restricted to specific choices of $α=2$. We further show the BOBW guarantees for FTPL with adaptive learning rates in the bandit problem with expert advices. Our approach preserves the computational simplicity of FTPL while enabling probability-dependent adaptivity, and the surrogate-based methodology may be of independent interest in other algorithmic frameworks beyond FTPL and learning rate designs.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on online learning algorithms (Follow-the-Perturbed-Leader) and adaptive learning rates in bandit problems, whereas the provided keywords pertain to Multimodal Large Language Models (MLLM), World Models, Visual Encoders, and Tokenizers. There is no substantive overlap between the theoretical optimization framework of the paper and the multimodal/generative AI concepts specified in the keywords.

关键词

Follow-the-Perturbed-Leader, Adaptive Learning Rates, Surrogate Probability, Online Learning, Bandit Problems, Best-of-Both-Worlds, Convex Optimization

261. Adaptive Oscillatory-State Alignment for Time Series ForecastingFAIL

Score: 0.0 / 27.8

Authors: Zhangyao Song, Ziqiong Li, Xiangfei Qiu, Chao Zha, Yinfei Xu, Tao Guo

Published: 2026-06-04

TL;DR: The paper introduces AOSNET, a Hilbert-guided forecasting framework that adaptively aligns oscillatory states to manage non-stationary time series, achieving state-of-the-art performance.

摘要翻译

长期时间序列预测得益于能够揭示重复时间结构的归纳偏置。现有的周期性预测方法通常通过预定义周期、全局频谱分量或固定可学习模板来对重复性进行建模。然而，现实世界的时间动态很少具有刚性周期性：振荡行为通常通过幅度调制、相位漂移和局部频率变化而演变。在此条件下，固定模板周期性建模可能与底层时间状态从根本上不匹配。我们提出 AOSNET（一种希尔伯特引导的预测框架），该框架将周期性预测从固定模板匹配重新表述为自适应振荡状态对齐。AOSNET 从观测序列和可学习的全局振荡先验中提取解析信号描述符，随后通过描述符条件门自适应地对齐局部状态，该门选择性保留可靠观测，同时平滑地修正不匹配区域。所学习的先验并非作为刚性重复模板，而是作为通过局部状态动态解释的灵活振荡参考。在八个基准数据集上的实验表明，该方法具有最先进的或极具竞争力的准确性，且推理速度快。受控的合成实验隔离了幅度调制、相位漂移和局部频率变化，证实振荡状态对齐的优势随着非平稳性的增强而持续增加。

Abstract

Long-term time series forecasting benefits from inductive biases that expose recurring temporal structure. Existing periodic forecasting methods typically model recurrence through predefined periods, global spectral components, or fixed learnable templates. However, real-world temporal dynamics are rarely rigidly periodic: oscillatory behavior often evolves through amplitude modulation, phase drift, and local frequency variation. Under these conditions, fixed-template periodic modeling can become fundamentally mismatched to the underlying temporal states. We propose AOSNET, a Hilbert-guided forecasting framework that reformulates periodic forecasting from fixed template matching to adaptive oscillatory-state alignment. AOSNET extracts analytic-signal descriptors from both the observed sequence and a learnable global oscillatory prior, then adaptively aligns local states through a descriptor-conditioned gate that selectively preserves reliable observations while softly correcting mismatched regions. The learned prior serves not as a rigid repeated template but as a flexible oscillatory reference interpreted through local state dynamics. Experiments on eight benchmarks demonstrate state-of-the-art or highly competitive accuracy with fast inference speed. Controlled synthetic studies isolating amplitude modulation, phase drift, and local frequency variation confirm that the advantage of oscillatory-state alignment consistently increases as non-stationarity intensifies.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on time series forecasting using adaptive oscillatory-state alignment and Hilbert transforms, belonging to signal processing and numerical prediction. The provided keywords relate to Multimodal Large Language Models (MLLM), World Models, Tokenizers, Visual Encoders, and Reinforcement Learning. There is no conceptual overlap between the paper's domain and the specified AI/MLM keywords, resulting in zero relevance for all terms.

关键词

Time Series Forecasting, Oscillatory-State Alignment, Hilbert Transform, AOSNET, Non-stationary Dynamics, Analytic-signal Descriptors, Adaptive Alignment

262. Fast and Robust Convergence Rate for TD(0) with Linear Function Approximation, Universal Learning Steps and I.I.D. SamplesFAIL

Score: 0.0 / 27.8

Authors: Ziad Kobeissi, Éloïse Berthier

Published: 2026-06-04

TL;DR: 本文建立了 TD(0) 在线性函数近似下的快速且鲁棒的收敛速率，证明了在独立同分布样本下均方误差具有最优的 1/k 依赖性且不依赖于协方差矩阵的最小特征值。

摘要翻译

本文研究了采用线性函数近似（LFA）的 TD(0) 时序差分方法的有限时间行为。我们考虑同策略独立同分布（i.i.d.）样本、常数学习步长以及 Polyak-Juditsky 平均法。我们建立了一种新的收敛速率，针对近似函数的均方误差（MSE），该速率（i）在迭代次数 k 上具有最优依赖关系（即 1/k 阶），因而快速；（ii）对病态具有鲁棒性：它仅依赖于初始误差和与模型无关的常数；（iii）在小于 11 的乘性常数意义下是紧的。特别地，该速率不依赖于线性参数化未中心化协方差矩阵的最小特征值，这与 TD(0) 文献中所有现有的 O(1/k) 速率不同。我们还引入了 PCTD(0)，这是 TD(0) 的一种变体，在马尔可夫链 (Markov Chain) 具有强混合性的额外假设下，受益于更好的收敛性质。

Abstract

In this paper, we study the finite-time behavior of the TD(0) temporal-difference method with linear function approximation (LFA). We consider on-policy independent and identically distributed (i.i.d.) samples, a constant learning step, and the Polyak-Juditsky averaging method. We establish a new convergence rate, for the Mean-Square Error (MSE) on the approximated function, that is (i) fast in the sense that it admits an optimal dependency in the number of iterations k (i.e., of order 1/k), (ii) robust to ill-conditioning: it only depends on an initial error and modelindependent constants and (iii) sharp up to a multiplicative constant lower than 11. In particular, it does not depend on the smallest eigenvalue of the uncentered covariance matrix of the linear parametrization, unlike all pre-existing O(1/k) rates in the TD(0) literature. We also introduce PCTD(0), a variant of TD(0), which benefits from better convergence properties under an additional assumption of strong mixing on the Markov Chain.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 论文聚焦于强化学习中 TD(0) 算法的理论收敛性分析，而提供的关键词集主要涉及多模态大模型架构（如 Tokenizer、Visual Encoder、MLLM、World Models）及模型统一。论文内容未涉及多模态处理、大模型训练或世界模型构建，且 TD(0) 属于模型-free 强化学习而非 model-based RL，因此所有关键词相关性均为 0。

关键词

TD(0), Linear Function Approximation, Convergence Rate, Mean-Square Error, I.I.D. Samples, Polyak-Juditsky Averaging, Model-independent Constants

263. Short paper: Models in the dark -- Rectification and erasure under GDPR in ML supply chainsFAIL

Score: 0.0 / 27.8

Authors: Henrik Graßhoff, Malte Hansen, Meiko Jensen, Sara Ramezanian

Published: 2026-06-04

TL;DR: This paper investigates the challenges of enforcing GDPR rectification and erasure rights in complex ML supply chains, introducing the concept of "models in the dark" to describe downstream models lacking transparency and highlighting the gap between legal requirements and technical implementation.

摘要翻译

根据《通用数据保护条例》（GDPR）确立的更正权和删除权，在保护个人隐私方面处于核心地位。然而，它们在机器学习（ML）系统中的有效实施仍面临挑战。现有研究大多孤立地从法律或技术视角探讨这些权利，却忽视了模型是在复杂的供应链中生成的，该供应链涉及开发、分发和部署等多个环节的参与者。本文全面综述了在机器学习模型中实施更正权和删除权所面临的挑战。基于学术文献及数据保护机构的指导，我们发现许多 GDPR 要求在技术上尚无法在实践中得到满足。我们的发现进一步表明，机器学习供应链中出现的问题在现有研究中尚未得到充分关注。为填补这一空白，本文引入了“黑暗中的模型”（models in the dark）的概念——即指在机器学习链中更下游创建的、缺乏足够透明度或可追溯性的派生模型——并分析了这一现象所带来的紧迫挑战。通过采用跨学科视角，本文有助于弥合法律要求与机器学习数据主体权利技术实施之间的差距，最终支持可信人工智能的发展。

Abstract

The rights to rectification and erasure, as established under the General Data Protection Regulation (GDPR), are central to protecting individuals' privacy. However, their effective enforcement in machine learning (ML) systems remains challenging. Existing work has largely addressed these rights from either a legal or a technical perspective in isolation and disregards the fact that models are produced in complex supply chains involving multiple actors across development, distribution, and deployment. This paper presents a holistic survey of challenges in implementing the rights to rectification and erasure in ML models. Drawing on academic literature and guidance from data protection authorities, we find that many GDPR requirements cannot yet be technically met in practice. Our findings further suggest that issues arising in ML supply chains are insufficiently addressed in research. To tackle this gap, we introduce the notion of models in the dark -- derived models created further downstream in an ML chain without sufficient transparency or traceability -- and analyse the urgent challenges posed by this phenomenon. By adopting an interdisciplinary perspective, this work contributes to bridging the gap between legal requirements and the technical implementation of data subject rights in ML, ultimately supporting the development of trustworthy artificial intelligence.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 该论文主要探讨 GDPR 框架下机器学习供应链中的更正与删除权利实施挑战，核心在于法律合规与供应链透明度。所提供的关键词（如 Unify Models, Tokenizer, Visual Encoder, World Models, MLLM, MultiModal, model-based RL）均指向多模态大模型架构、表征学习及强化学习等具体技术领域。论文内容未涉及这些技术组件或方法，因此与所有评分关键词完全无关，相关性评分为 0。

关键词

GDPR, Rectification, Erasure, ML Supply Chains, Models in the Dark, Legal Requirements, Technical Implementation

264. Addressing Imbalance in Multi-Label Data via Label-Specific Distance-based OversamplingFAIL

Score: 0.0 / 27.8

Authors: Bin Liu, Jun Wu, Haoyu Peng, Ao Zhou, Jin Wang, QiaoSong Chen, Grigorios Tsoumakas

Published: 2026-06-04

TL;DR: This paper proposes a label-specific distance-based oversampling method to mitigate multi-label classification imbalance by generating consistent synthetic instances, but it is unrelated to multimodal models or reinforcement learning.

摘要翻译

复杂的不平衡标签分布给多标签分类（multi-label classification）带来了严峻挑战，因为大多数分类器倾向于多数类（majority class）和高频标签（high-frequent labels）。过采样（Oversampling）是一种高效灵活的解决方案，它通过扩充实例来为多标签分类器提供更平衡的训练数据集。大多数现有的过采样方法以启发式方式创建合成实例（synthetic instances），这本质上依赖于使用欧氏距离（Euclidean distance）在整个特征空间（feature space）中检索的邻域信息。然而，它们未能考虑特征对不同标签的语义相关性差异，导致邻近邻居之间出现标签不一致（label inconsistency），并进一步引入标签混淆（label confusion）以及对合成实例的过拟合（overfitting）。为了解决上述问题，我们提出了一种名为基于标签特定距离的多标签过采样（Label-Specific Distance-based Multi-Label Oversampling, LSDMLO）的新颖采样方法，该方法旨在创建更有用且标记良好的合成实例，以解决多标签数据集（multi-label datasets）中的不平衡问题。LSDMLO 基于加权相关特征空间（weighted pertinent feature space）推导标签特定距离（label-specific distance），以识别标签一致邻居（label-consistent neighbors），这有助于选择能表达更多标签相关性（label correlations）的种子实例（seed instances）在边界区域（boundary areas），并生成与原始数据标签分布对齐的合成实例。综合实验验证了所提出的 LSDMLO 在各种基础分类器（base classifiers）下均优于最先进的（state-of-the-art）多标签采样方法。

Abstract

The complex imbalanced label distribution poses a crucial challenge to multi-label classification, as most classifiers are biased towards the majority class and high-frequent labels. Oversampling is an efficient and flexible solution that augments instances to provide a more balanced training dataset for multi-label classifiers. Most existing oversampling methods create synthetic instances in a heuristic way that essentially relies on neighborhood information retrieved using Euclidean distance within the entire feature space. However, they fail to consider the varying semantic relevance of features to different labels, leading to label inconsistency among proximate neighbors and further introducing label confusion and overfitting to synthetic instances. To overcome the above issue, we propose a novel sampling approach called Label-Specific Distance-based Multi-Label Oversampling (LSDMLO) that creates more useful and well-labeled synthetic instances to address the imbalance in multi-label datasets. LSDMLO derives the label-specific distance to identify label-consistent neighbors based on the weighted pertinent feature space, which facilitates selecting seed instances that express more label correlations in boundary areas and generating synthetic instances aligned with the label distribution of original data. The comprehensive experiments verify that the proposed LSDMLO outperforms the state-of-the-art multi-label sampling approaches under various base classifiers.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on multi-label classification imbalance and oversampling using label-specific distance, which belongs to supervised learning data preprocessing. It does not involve large language models, multimodal architectures (visual encoders, tokenizers), world models, or reinforcement learning. Although it mentions 'Multi-Label', this is distinct from 'MultiModal' in the context of MLLM and visual encoders provided in the keyword list, resulting in negligible relevance to all specified keywords.

关键词

Multi-label classification, Data imbalance, Oversampling, Label-specific distance, Synthetic instances, Feature space, Label consistency

265. Finding Most Influential SetsFAIL

Score: 0.0 / 27.8

Authors: Lucas D. Konrad, Nikolas Kuschnig

Published: 2026-06-04

TL;DR: This paper proposes an efficient algorithm based on Dinkelbach's method to identify most influential sets in causal inference, reducing the computational complexity from combinatorial search to a sequence of top-k problems.

摘要翻译

识别最具影响力集合（MIS）——即移除后能最大程度改变目标估计量（target estimand）的大小为 $k$ 的子集——通常是不可行的，因为它需要搜索 $\binom{n}{k}$ 个子集。对于具有线性分式留集效应（linear-fractional leave-set-out effects）的估计量（estimands），我们证明 MIS 选择简化为一个单参数序列的 top-$k$ 问题（top-$k$ problems）。Dinkelbach 方法（Dinkelbach's method）产生了一种算法，每次迭代的成本为 $\mathcal{O}(n)$ 且具有有限终止性（finite termination）。对于固定的残差化输入（residualized inputs），该算法针对单变量比率目标（univariate ratio objective）返回全局最优集合，包括 oracle-residualized 部分线性模型（oracle-residualized partial linear model）。对于估计的干扰函数（estimated nuisance functions），一致分母（uniform denominator）和生成分数稳定性（generated-score stability）意味着逼近一阶预言者正交分数目标（first-order oracle orthogonal-score objective）；在分离条件（separation condition）下可实现精确集合恢复。模拟和应用表明，该方法恢复了先前计算上不可达的精确 MIS。

Abstract

Identifying most influential sets (MIS) - size-$k$ subsets whose removal maximally changes a target estimand - is typically infeasible because it requires searching over $\binom{n}{k}$ subsets. For estimands with linear-fractional leave-set-out effects, we show that MIS selection reduces to a one-parameter sequence of top-$k$ problems. Dinkelbach's method yields an algorithm with $\mathcal{O}(n)$ cost per iteration and finite termination. For fixed residualized inputs, the algorithm returns a globally optimal set for the univariate ratio objective, including the oracle-residualized partial linear model. With estimated nuisance functions, uniform denominator and generated-score stability imply approximation to the first-order oracle orthogonal-score objective; exact set recovery follows under a separation condition. Simulations and applications show that the method recovers exact MIS that were previously computationally inaccessible.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper addresses causal inference and optimization problems regarding influential sets in statistical estimands. The provided keywords pertain to Multimodal Large Language Models, World Models, and Reinforcement Learning. There is no thematic or technical overlap between the paper's domain (statistics/causal ML) and the specified keywords (multimodal/RL), resulting in zero relevance for all keywords.

关键词

Most Influential Sets, Causal Inference, Dinkelbach's Method, Top-k Problems, Estimand, Optimization, Subset Selection

266. Representing Research Attention as Contextually Structured FlowsFAIL

Score: 0.0 / 27.8

Authors: Jessica Rodrigues, Angelo Salatino, Gard Jenset, Scott Hale

Published: 2026-06-04

TL;DR: This paper proposes representing research attention as contextually structured flows to better capture temporal and contextual evolution, demonstrating improved structural comparison and robustness in research evaluation compared to aggregated counts.

摘要翻译

研究关注度常被用作可见度、影响力和社会采纳的指标，但通常表示为聚合计数，无法保留关注度如何在不同情境下随时间演变。这导致了关注度解读方式与其表征方式之间的不匹配。我们提出注意力流（attention flows）作为一种语境结构化表征，用于编码关注度的组织及其随时间的演变。我们通过构建一个基于跨研究成果类比推理的基准，来评估这些表征是否捕捉到了可迁移的结构。比较信号、序列和基于流的表征后，我们发现流表征更有效地支持结构比较，特别是在关注度受时间进展或语境分布影响的情境下。我们还表明，学习到的流表征在部分观测和结构扰动下能提高鲁棒性。总体而言，这些结果支持将关注度建模为一种语境结构化现象，并为更具信息量的研究评估方法提供了基础。

Abstract

Research attention is widely used as an indicator of visibility, influence, and societal uptake, yet it is typically represented as aggregated counts that do not preserve how attention develops across contexts over time. This creates a mismatch between how attention is interpreted and how it is represented. We propose attention flows as contextually structured representations that encode the organisation of attention and its evolution over time. We evaluate whether these representations capture transferable structure by constructing a benchmark based on analogy-style reasoning across research outputs. Comparing signal, sequence, and flow-based representations, we find that flow representations more effectively support structural comparison, particularly in settings where attention is shaped by temporal progression or context distributions. We further show that learned flow representations improve robustness under partial observation and structural perturbation. Overall, these results support modelling attention as a contextually structured phenomenon and provide a basis for more informative approaches to research evaluation.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on bibliometric analysis of research attention using contextually structured flows, which is unrelated to the provided keywords concerning AI model architectures (Unify Models, Tokenizer, Visual Encoder), Multimodal Large Language Models (MLLM, MultiModal), World Models, or Reinforcement Learning (model-based RL). Thus, all keyword relevance scores are 0. No expert authors from the specified list (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are present in the author list (Jessica Rodrigues, Angelo Salatino, Gard Jenset, Scott Hale).

关键词

Research Attention, Contextually Structured Flows, Analogy-style Reasoning, Temporal Progression, Structural Comparison, Robustness, Research Evaluation

267. Scaffold, Not Vocabulary? A Controlled, Two-Tier, Pre-Registered Study of a Popperian Code-Generation SkillFAIL

Score: 0.0 / 27.8

Authors: Mehmet Iscan

Published: 2026-06-04

TL;DR: 本研究调查了 Popperian 程序性内容是否能在结构提示之外提升 LLM 代码生成效果，发现其相对于仅含标签的 scaffolds 并无显著额外收益。

摘要翻译

大语言模型（LLMs）日益承担起编写、审查和评判代码的任务，一种迅速发展的实践通过提示词“技能”赋予它们这些能力，要求模型像科学家一样进行推理。一个显著的例子是指示模型充当波普尔式（Popperian）证伪主义者，报道称此类技能能提高生成代码的质量。然而，这些收益几乎总是通过“大语言模型作为评判者”（LLM-as-a-judge）这一工具读取出来的，该工具已被记录存在位置偏好、自我偏好和风格偏见。我们提出疑问：如果它看似有效，这种收益是来自技能的波普尔式内容，还是来自任何支架所施加的结构？我们预先注册了一个两级消融实验，包含三个对照组：长度匹配的安慰剂、仅保留波普尔式标题但去除程序的仅标签支架，以及执行预言机（HumanEval+ 单元测试），此外还包括词汇光环哨兵（vocabulary-halo sentinel）和同模型自我评判审计（same-model self-judge audit）。在前沿模型（Claude Sonnet 4.6, N=163）上，所有条件均接近基准上限且彼此无显著差异，因此预先注册的 +5 点改进未得到支持（属于上限受限导致的未检测到）。在小模型（Qwen2.5-Coder-0.5B, N=164）上，结构化条件将八选一（best-of-eight）正确率提升了 20-22 点，但完整技能相对于仅标签支架未显示出可分离的收益（聚合 F@8=L@8 vs V@8=34.8%），而安慰剂仅落后 2.4 点。一个应用波普尔式评分标准的 0.5B 自我评判者并未优于随机选择，且将其 60% 的选择集中在一个索引上。在测试的两种设置中，该技能的波普尔式程序内容并未在仅标签支架之外增加任何可分离的执行正确性收益，因此这些收益反映了支架结构本身。我们贡献了一个校准后的负结果和一个可复用的消歧协议；该发现限定了一项关于特定提示词技能家族的工程主张，并非对波普尔方法论的一般性评价。

Abstract

Large language models increasingly write, review, and judge code, and a fast-growing practice equips them with prompt 'skills' that ask the model to reason like a scientist. A prominent example tells the model to act as a Popperian falsificationist, and such skills are reported to improve generated code. But these gains are almost always read off an LLM-as-a-judge, an instrument with documented positional, self-preference, and stylistic biases. We ask: if it appears to help, is the gain from the skill's Popperian content, or from the structure any scaffold imposes? We pre-register a two-tier ablation with three controls: a length-matched placebo, a labels-only scaffold that keeps the Popperian headers but strips the procedure, and an execution oracle (HumanEval+ unit tests), plus a vocabulary-halo sentinel and a same-model self-judge audit. On a frontier model (Claude Sonnet 4.6, N=163) all conditions sit near the benchmark ceiling and do not separate, so the pre-registered +5-point improvement is not supported (a ceiling-limited non-detection). On a small model (Qwen2.5-Coder-0.5B, N=164) structured arms lift best-of-eight correctness by 20-22 points, but the full skill shows no separable benefit over a labels-only scaffold (aggregate F@8=L@8 vs V@8=34.8%), and the placebo trails by only 2.4 points. A 0.5B self-judge applying the Popperian rubric does not beat random selection and concentrates 60% of its picks on one index. In the two settings tested, the skill's Popperian procedural content adds no separable execution-correctness benefit beyond a labels-only scaffold, so the gains track scaffold structure. We contribute a calibrated negative result and a reusable disambiguation protocol; the finding bounds an engineering claim about one prompt-skill family and is not an evaluation of Popperian methodology in general.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 该论文聚焦于大语言模型代码生成中的提示工程与评估偏差，通过消融实验分析 Popperian 技能 scaffolds 的有效性。研究内容未涉及多模态表征学习、世界模型构建、强化学习、视觉编码器或模型统一架构，因此与提供的所有技术关键词均无直接相关性。

关键词

Large language models, Code generation, Popperian skill, Scaffold, Ablation study, Prompt engineering, Evaluation bias, Pre-registered study

268. A Komi-Yazva--Russian Parallel Corpus and Evaluation Protocol for Zero- and Few-Shot LLM TranslationFAIL

Score: 0.0 / 27.8

Authors: Petr Parshakov

Published: 2026-06-04

TL;DR: This paper introduces a Komi-Yazva-Russian parallel corpus and evaluation protocol to assess large language model performance in zero-shot and few-shot low-resource translation tasks.

摘要翻译

我们提出了首个科米 - 亚兹瓦语 (Komi-Yazva) - 俄语平行语料库，以及一个明确的评估协议，旨在研究在濒危、极低资源环境下的大语言模型 (LLM) 翻译。该数据集包含来自 74 个叙事文本的 457 个对齐句子对，并附有记录在案的出处、句子级对齐和故事标识符，从而支持考虑数据泄露的评估。我们利用此设置，在严重的平行数据稀缺情况下，在零样本 (zero-shot) 和基于检索的少样本 (few-shot) 范式下，比较现代大语言模型在科米 - 亚兹瓦语到俄语翻译上的表现。该协议包括故事级交叉验证、用于少样本提示的确定性检索、对生成输出的严格验证、互补的基于参考和基于人工评判的指标，以及故事级不确定性估计。在所有模型中，大语言模型产生了非平凡的翻译，但性能因模型家族和提示范式而异。基于检索的少样本提示始终优于零样本提示，而超出小范围检索上下文的增益仍然有限。结果表明，在此设置下的评估结论在很大程度上取决于指标选择和失败处理，因此本文将该语料库既视为数据集贡献，也视为濒危语言机器翻译的可复现评估测试床。

Abstract

We present the first Komi-Yazva--Russian parallel corpus together with an explicit evaluation protocol for studying LLM translation in an endangered, extremely low-resource setting. The dataset contains 457 aligned sentence pairs from 74 narrative texts and is accompanied by documented provenance, sentence-level alignment, and story identifiers that enable leakage-aware evaluation. We use this setup to compare modern large language models on Komi-Yazva-to-Russian translation under severe parallel-data scarcity in zero-shot and retrieval-based few-shot regimes. The protocol includes story-level cross-validation, deterministic retrieval for few-shot prompting, strict validation of generated outputs, complementary reference-based and judge-based metrics, and story-level uncertainty estimates. Across models, LLMs produce non-trivial translations, but performance varies strongly by model family and prompting regime. Retrieval-based few-shot prompting consistently improves over zero-shot prompting, while gains beyond a small retrieved context remain limited. The results show that evaluative conclusions in this setting depend materially on metric choice and failure handling, so the paper frames the corpus as both a dataset contribution and a reproducible evaluation testbed for endangered-language machine translation.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on low-resource machine translation (Komi-Yazva to Russian) and evaluation protocols for LLMs. It does not involve multimodal learning (no Visual Encoder, not MultiModal, not MLLM), reinforcement learning (not model-based RL), world models, or model unification strategies. Tokenizer design is not a core contribution. Thus, all provided keywords are irrelevant to the paper's core content.

关键词

Komi-Yazva, Russian, Parallel Corpus, Evaluation Protocol, Zero-shot, Few-shot, LLM Translation, Low-resource

269. "Chi nas dal soch el sent de legn" -- Auditing Text Corpora for LombardFAIL

Score: 0.0 / 27.8

Authors: Edoardo Signoroni, Pavel Rychlý

Published: 2026-06-04

TL;DR: This paper audits Lombard text corpora to reveal severe data quality issues and representational bias between Western and Eastern varieties, advocating for community-driven curation over quantity-driven scraping.

摘要翻译

世界上许多语言在自然语言处理（NLP）工具方面仍然资源匮乏。这主要是因为缺乏高质量的数据集来训练、开发和评估用于机器翻译（MT）等任务的系统和模型。我们对来自意大利的一种资源匮乏的语言连续体伦巴第语（Lombard）所可用的平行语料库和单语语料库进行了人工审查。我们的分析揭示，网络爬取数据看似丰富实则是一种错觉，大规模数据集饱受严重的语言误识别、模板文本和非语言噪声的困扰。此外，我们分析了网络爬取数据集、精选语料库和基准数据集中有效伦巴第语部分的正字法构成。我们的研究发现，所有语料库中存在相互冲突的正字法系统和严重的表征偏差：高质量数据严重偏向西伦巴第语变体，而东伦巴第语变体则被边缘化。这突显了需要采用变体感知、社区驱动的数据策展方法，而非纯粹数量驱动的数据爬取。

Abstract

Several of the world's languages are still under-resourced in terms of Natural Language Processing (NLP) tools. This is mostly due to the lack of high-quality datasets to train, develop, and evaluate systems and models for several tasks, such as Machine Translation (MT). We conduct a manual audit of the parallel and monolingual corpora available for Lombard, an under-resourced language continuum from Italy. Our analysis reveals that the perceived abundance of web-scraped data is an illusion, with massive datasets plagued by severe language misidentification, boilerplate text, and non-linguistic noise. Furthermore, we analyze the orthographic composition of the valid Lombard portions across web-scraped datasets, curated corpora, and benchmarks. Our findings show conflicting orthographical systems and severe representational bias across all corpora: high-quality data is heavily skewed towards Western Lombard varieties, with Eastern ones left on the margins. This underscores the need for variety-aware, community-driven data curation rather than purely quantity-driven scraping.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on auditing text corpora for the under-resourced Lombard language, addressing data quality, bias, and curation issues. It does not discuss Unify Models, Tokenizers, Visual Encoders, World Models, MLLMs, Multimodal architectures, or Model-Based Reinforcement Learning. Therefore, there is no relevance to the provided keywords.

关键词

Text Corpora, Lombard Language, Data Auditing, Under-resourced Language, Representational Bias, Web-scraped Data, Data Curation

270. From Self to Other: Evaluating Demographic Perspective-Taking in LLM Hate Speech AnnotationFAIL

Score: 0.0 / 27.8

Authors: Paloma Piot, Javier Parapar

Published: 2026-06-04

TL;DR: This study evaluates the capability of persona-conditioned LLMs to simulate demographic perspectives for hate speech annotation, concluding that vicarious prompting with Llama 3.1 best approximates human disagreement patterns.

摘要翻译

仇恨言论检测本质上具有主观性：不同人口统计学群体对同一内容的感知存在显著差异。从多个人口统计学群体收集足够的标注数据成本高昂，且难以规模化。基于人设的大型语言模型（Persona-conditioned Large Language Models，即通过提示采用特定人口身份的模型）已被提出，作为一种在大规模上模拟多样视角的方法。然而，它们是否真正反映了不同群体之间的分歧模式？我们评估了人类社会判断的三个维度：(i) 不同群体的人设是否以类人方式产生分歧（组间分歧，inter-group disagreement），(ii) 当内容针对其自身身份时，它们是否变得更加敏感（组内敏感性，in-group sensitivity），以及 (iii) 它们能否准确预测另一群体的反应（替代性预测，vicarious prediction）。结果表明，没有任何模型能一致捕捉所有三个维度，性能高度依赖于具体模型，且仅靠最小化的身份提示无法可靠地涌现出相应能力。然而，使用 Llama 3.1 进行替代性提示（vicarious prompting）在大多数人口统计学维度上获得了最高的跨群体一致性，并提供了最接近人类分歧模式的总体近似，这表明该配置可能为与人类判断对齐的自动标注提供更可靠的环境。

Abstract

Hate speech detection is inherently subjective: people from different demographic groups perceive the same content very differently. Collecting enough annotations from multiple demographic groups is costly and difficult to scale. Persona-conditioned Large Language Models (models prompted to adopt a specific demographic identity) have been proposed as a way to simulate diverse perspectives at scale. But do they actually reflect how different groups disagree? We evaluate three aspects of human social judgement: (i) whether personas from different groups disagree in human-like ways (inter-group disagreement), (ii) whether they become more sensitive when content targets their own identity (in-group sensitivity), and (iii) whether they can accurately predict how another group would react (vicarious prediction). Our results show that no model consistently captures all three dimensions, and performance is highly model-dependent and does not emerge reliably from minimal identity prompts alone. However, vicarious prompting with Llama 3.1 yields the highest cross-group agreement in most demographic axes and provides the closest overall approximation to human disagreement patterns, indicating that this configuration may provide a more reliable setting for automatic annotation aligned with human judgements.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on NLP, social bias, and hate speech annotation using persona-conditioned LLMs. The provided keywords pertain to Multimodal Architecture, World Models, and Reinforcement Learning (Vision/RL domains). There is no technical overlap regarding visual encoders, tokenizers, world modeling, or RL methodologies in the provided abstract. All keyword scores are 0.0 due to complete domain mismatch. Total weighted score is 0.0, well below the dynamic passing score of 27.8.

关键词

Hate speech detection, Persona-conditioned LLMs, Demographic perspective-taking, Vicarious prompting, Annotation evaluation, Social bias, Llama 3.1

271. Revisiting Lexicon Evaluation in Unsupervised Word DiscoveryFAIL

Score: 0.0 / 27.8

Authors: Simon Malan, Danel Slabbert, Herman Kamper

Published: 2026-06-04

TL;DR: This paper proposes two new evaluation metrics for unsupervised word discovery in speech processing to correct biases in normalized edit distance, demonstrating improved correlation with ground-truth distributions.

摘要翻译

在零资源语音处理（zero-resource speech processing）中，从发现的类词单元（word-like units）构建词汇表（lexicon）是一个核心目标。但我们的评估能否提供关于词汇表质量的可靠指示？一个常见的指标是归一化编辑距离（normalized edit distance），它计算每个簇（cluster）中发现的单元之间的音素编辑距离（phoneme edit distance）的平均值。我们表明，该指标对大簇的质量存在内在偏差，从而阻碍了公平评估。此外，它忽略了真实类别在簇之间的分布情况。基于聚类文献中的既定理论，我们提出了两种解决这些缺陷的指标：一种是在评估簇内一致性（within-cluster consistency）时考虑簇大小（cluster size）的修改指标，另一种是评估真实词汇在簇之间分布情况的逆指标（inverse metric）。通过在合成和真实词汇表上的实验，我们证明，结合使用这两种指标：(1) 与词汇表与真实分布（ground-truth distribution）的相似度更密切相关，(2) 更能抵抗扭曲词汇表评估的偏差。

Abstract

Building a lexicon from discovered word-like units is a central goal in zero-resource speech processing. But do our evaluations provide a trustworthy indication of lexicon quality? A common metric, normalized edit distance, averages the phoneme edit distances between discovered units in each cluster. We show that this metric has an inherent bias toward the quality of large clusters, inhibiting fair evaluation. Moreover, it ignores how well true classes are distributed across clusters. Based on established theory in clustering literature, we propose two metrics that address these shortcomings: a modified metric that weighs cluster size when assessing within-cluster consistency, and an inverse metric that assesses how true words are spread across clusters. Through experiments on synthetic and real-world lexicons, we demonstrate that combined, these metrics are: (1) more closely correlated with how similar a lexicon is to the ground-truth distribution, and (2) more robust to biases that skew lexicon evaluations.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on unsupervised word discovery and lexicon evaluation metrics in speech processing, whereas the provided keywords relate to Multimodal Large Language Models, World Models, and Reinforcement Learning architectures. There is no technical or thematic overlap between the speech processing evaluation methods and the specified model paradigms. Additionally, none of the listed expert authors (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are present in the author list.

关键词

Unsupervised Word Discovery, Lexicon Evaluation, Normalized Edit Distance, Clustering Bias, Speech Processing, Zero-resource, Metric Proposal, Ground-truth Distribution

272. CHALIS: A Challenge Dataset for Language Identification in Difficult ScenariosFAIL

Score: 0.0 / 27.8

Authors: Michal Tichý, Jindřich Libovický

Published: 2026-06-04

TL;DR: The paper introduces CHALIS, a challenging benchmark dataset for language identification that exposes significant performance gaps in existing systems when handling cousin languages and orthographic noise.

摘要翻译

我们提出了 CHALIS（Challenging Language Identification Samples），一个旨在解决语言识别（language identification）中困难情况的新型基准数据集（benchmark dataset），涵盖亲属语言（cousin languages）和正字法噪声（orthographic noise）。该数据集包含两部分：首先，我们收集了在相互可理解的语言对（mutually intelligible language pairs）之间共享的句子（捷克语/斯洛伐克语、西班牙语/加泰罗尼亚语、葡萄牙语/加利西亚语、丹麦语/挪威语）。第二部分用于测试正字法噪声：我们在多种文字之间进行文本转写（transliterate），移除变音符号，模拟同形字攻击（homoglyph attacks），并使用网络俚语（Internet slang）。我们在 CHALIS 上评估了四种广泛使用的语言识别系统（language identification systems），结果表明所有系统在这些场景下均表现显著不佳，尤其是在亲属语言对中的低资源语言（lower-resource languages）以及转写输入上。该数据集公开发布于 https://huggingface.co/datasets/michal-tichy/CHALIS。

Abstract

We present CHALIS (Challenging Language Identification Samples), a new benchmark dataset explicitly designed to address difficult cases in language identification: cousin languages and orthographic noise. Our dataset has two parts: First, we collected sentences shared across mutually intelligible language pairs (Czech/Slovak, Spanish/Catalan, Portuguese/Galician, Danish/Norwegian). The second part tests for orthography noise: we transliterate text across multiple scripts, remove diacritics, simulate homoglyph attacks, and use Internet slang. We evaluate four widely used language identification systems on CHALIS and demonstrate that all struggle substantially in these scenarios, especially on lower-resource languages within cousin pairs and on transliterated input. The resource is publicly available at https://huggingface.co/datasets/michal-tichy/CHALIS.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper presents CHALIS, a dataset for Language Identification focusing on cousin languages and orthographic noise. The provided keywords relate to Multimodal Large Language Models, World Models, and Model-Based Reinforcement Learning. The paper is text-only, does not involve visual encoders, world modeling, reinforcement learning, or multimodal unification, resulting in zero relevance for all keywords.

关键词

Language Identification, CHALIS, Cousin Languages, Orthographic Noise, Benchmark Dataset, Text Classification, Difficult Scenarios

273. English-to-Prakrit Machine Translation via Multilingual Transfer LearningFAIL

Score: 0.0 / 27.8

Authors: Om Choksi, Smit Kareliya, Shrikant Malviya, Pruthwik Mishra

Published: 2026-06-04

摘要翻译

我们在低资源场景下研究英 -Prakrit 机器翻译，其中目标语言未被 IndicTrans2 支持。我们通过将 Prakrit 映射到印地语语言标签（hin_Deva）来适配多语言模型，且不修改分词器、词汇表或架构。基于 1,474 对马哈拉施特拉语 (Prakrit) 平行语料库，并在包含 20 个样本的阿达马加迪语测试集上进行评估，我们报告了相对于未调优基线的语料库 BLEU 提升。结果表明，脚本兼容的语言路由可实现对不支持的古典语言的可行迁移，同时也凸显了因数据稀缺和方言不匹配所带来的局限性。我们的代码及训练模型已向公众发布，以供进一步探索：https://github.com/D3v1s0m/indictrans2-prakrit-mt。

Abstract

We study English-to-Prakrit machine translation in a low-resource setting where the target language is unsupported by IndicTrans2. We adapt the multilingual model by mapping Prakrit to the Hindi language tag (hin_Deva) without modifying the tokenizer, vocabulary, or architecture. Using a 1,474-pair Maharashtri Prakrit parallel corpus and evaluation on a 20-sample Ardhamagadhi test set, we report corpus BLEU improvements over an untuned baseline. The results indicate that script-compatible language routing can enable feasible transfer to unsupported classical languages, while highlighting limitations due to data scarcity and dialect mismatch. Our code and trained models are released to the public for further exploration https://github.com/D3v1s0m/indictrans2-prakrit-mt.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 评分失败: Expecting ',' delimiter: line 12 column 135 (char 358)

274. Epistemic Injustice in Language Models: An Audit of Pretraining Filters and GuardrailsFAIL

Score: 0.0 / 27.8

Authors: Marco Antonio Stranisci, A Pranav, Rossana Damiano, Christian Hardmeier, Anne Lauscher

Published: 2026-06-04

TL;DR: This paper audits pretraining filters and guardrails in language models, revealing that they disproportionately suppress marginalized groups through blocklist-based cues, leading to epistemic erasure.

摘要翻译

现代语言模型依赖预训练过滤器（pretraining filters）从训练语料库中移除不良内容，并依赖推理时护栏（inference-time guardrails）在部署期间抑制不良输出。在本文中，我们考察这些过滤与审核决策如何产生认知抹除（epistemic erasure）的形式，并揭示自动化系统之间以及这些系统与人类判断之间的张力。我们在包含性别和地区起源提及的 Common Crawl 句子，以及一个手动标注的 500 个句子子集上，对四个预训练过滤器和三个推理时护栏进行了审计。我们的分析表明，过滤与护栏决策强烈关联于基于黑名单的词汇线索，而经常未能标记包含私人信息或明确仇恨言论的内容。与此同时，边缘化群体（marginalized groups），特别是跨性别者、女性和中美洲人，在所有系统中被显著过度标记。相比之下，人工标注员会保留 88.5% 的过滤器标记内容和 91.3% 的护栏标记内容，通常能认识到因内容移除的张力而产生的表征伤害，而当前系统无法捕捉到这一点。综上所述，我们的发现记录了一种认知抹除的形式，其中边缘化群体的提及在预训练之前被不成比例地移除，并且在推理时再次被抑制。

Abstract

Modern language models rely on pretraining filters to remove undesirable content from training corpora and inference-time guardrails to suppress undesirable outputs during deployment. In this paper, we examine how these filtering and moderation decisions produce forms of epistemic erasure and reveal tensions both across automated systems and between these systems and human judgment. We audit four pretraining filters and three inference-time guardrails on Common Crawl sentences containing gender and regional-origin mentions, together with a manually annotated subset of 500 sentences. Our analysis shows that filtering and guardrail decisions are strongly associated with blocklist-based lexical cues, while frequently failing to flag content containing private information or explicit hate speech. At the same time, marginalized groups, particularly transgender people, women, and Central Americans, are significantly over-flagged across systems. Human annotators, by contrast, would retain 88.5\% of filter-flagged and 91.3\% of guardrail-flagged content, often recognizing representational harms arising from tensions of content removal that current systems fail to capture. Taken together, our findings document a form of epistemic erasure in which mentions of marginalized groups are disproportionately removed before pretraining and additionally suppressed again at inference time.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on ethical audits of pretraining filters and guardrails in text-based language models, addressing epistemic injustice and bias against marginalized groups. The provided keywords relate to multimodal architectures, world models, and reinforcement learning methodologies, showing no technical overlap with the paper's content or methodology. No expert authors from the specified list were found in the author list, so no bonus points were added. The calculated weighted total score is 0.0.

关键词

Epistemic Injustice, Language Models, Pretraining Filters, Inference Guardrails, Marginalized Groups, Content Moderation, Bias Audit

275. Analysis of the Neglect-Zero Effect in Large Language ModelsFAIL

Score: 0.0 / 27.8

Authors: Jin Tanaka, Daiki Matsuoka, Ryoma Kumon, Hitomi Yanaka

Published: 2026-06-04

TL;DR: This study investigates whether Large Language Models exhibit the human cognitive neglect-zero effect regarding vacuous truths, finding that LLMs do not appear to show this bias under structural priming conditions.

摘要翻译

我们探究了大语言模型（LLMs）的语言处理在多大程度上类似于人类认知过程，重点聚焦于一种被称为 neglect-zero effect（忽略零效应）的人类认知偏差。该效应指的是人类倾向于忽略 zero-models（零模型），即那些因空集而使命题空洞为真的构型。我们关注由 neglect-zero effect 驱动的两种推理类型，并通过比较 LLMs 在不涉及 neglect-zero effect 的推理中的行为，来考察 LLMs 如何处理这些推理。为此，我们采用了一种基于 structural priming（结构启动）的范式，其中由于结构相似性，先前句子（prime，启动句）的近期暴露有助于后续句子（target，目标句）的处理。我们准备 primes 以迫使 LLMs 考虑 zero-model，并分析它们在 target 中是否也考虑了它。结果表明，neglect-zero effect 可能未出现在本研究分析的 LLMs 中。我们的代码可在 https://github.com/ynklab/neglect_zero 获取。

Abstract

We investigate the extent to which the language processing of LLMs resembles human cognitive processes, focusing on a human cognitive bias called the $\textit{neglect-zero effect}$. This effect refers to the human tendency to ignore $\textit{zero-models}$, which are configurations that render a proposition vacuously true by virtue of an empty set. We focus on two types of inferences driven by the neglect-zero effect, and examine how LLMs process these inferences by comparing their behavior with that in an inference that does not involve the neglect-zero effect. For this purpose, we employ a paradigm based on $\textit{structural priming}$, where recent exposure to a preceding sentence (the $\textit{prime}$) facilitates the processing of a subsequent sentence (the $\textit{target}$) due to their structural similarity. We prepare primes to force LLMs to consider the zero-model, and analyze whether they also consider it in the target. The results suggest that the neglect-zero effect may not occur in the LLMs analyzed in this study. Our code is available at https://github.com/ynklab/neglect_zero

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper investigates the 'neglect-zero effect' (a cognitive bias regarding vacuous truths) in text-based Large Language Models using a structural priming paradigm. The provided keywords focus on multimodal architecture (Visual Encoder, MultiModal, MLLM), tokenization, model unification, and reinforcement learning/world models. There is no overlap in technical domain or methodology between the paper's content (NLP/Cognition analysis) and the keywords (Multimodal/RL/Architecture), resulting in zero relevance for all keywords. None of the specified expert authors (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are listed in the author section.

关键词

Large Language Models, Neglect-Zero Effect, Cognitive Bias, Structural Priming, Vacuous Truth, Language Processing, Inference

276. Geodesic Flow Matching on a Riemannian Degradation Manifold for Blind Image RestorationFAIL

Score: 0.0 / 27.8

Authors: Akshay Janardan Bankar, Ankita Chatterjee, Sayan Banerjee, Shreyas Pandith, Kalakonda Sai Shashank, Amit Satish Unde

Published: 2026-06-04

TL;DR: 本文提出了一种基于黎曼流形上几何流匹配的盲图像恢复方法，通过建模退化过程为流形上的测地线运输，有效处理了混合退化并恢复了清晰图像。

摘要翻译

盲图像恢复要求从受未知且可能混合的退化污染的观测中恢复清晰图像。尽管近期基于流的确定性方法将恢复建模为将退化图像映射到清晰图像的传输过程，但它们通常依赖欧几里得插值（Euclidean interpolation），隐含地假设线性退化几何。在本文中，我们将退化显式建模为低维黎曼流形（Riemannian manifold）上的点，并将恢复表述为联合图像 - 流形空间上的测地线传输（geodesic transport）。利用测地线流匹配（geodesic flow matching）目标，我们学习尊重退化空间曲率的内在传输动力学。该框架推广了线性流匹配（linear flow matching），提供了将混合退化视为测地线组合的合理处理，并给出了超越观测退化的泛化的清晰理论解释。

Abstract

Blind image restoration requires recovering clean images from observations corrupted by unknown and potentially mixed degradations. While recent deterministic flow-based methods model restoration as transport processes that map degraded images to clean ones, they typically rely on Euclidean interpolation, implicitly assuming linear degradation geometry. In this paper, we explicitly model degradations as points on a low-dimensional Riemannian manifold and formulate restoration as geodesic transport on the joint image-manifold space. Using a geodesic flow matching objective, we learn intrinsic transport dynamics that respect the curvature of degradation space. This framework generalizes linear flow matching, provides a principled treatment of mixed degradations as geodesic compositions, and yields a clean theoretical interpretation for generalization beyond observed degradations.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 该论文研究盲图像恢复，基于黎曼流形几何与流匹配方法，属于计算机视觉范畴。所给关键词（MLLM、Tokenizer、World Models、RL 等）聚焦于多模态大模型与强化学习架构，与本文主题及方法论无实质关联，故相关性评分为 0。作者列表中未包含指定专家。

关键词

Blind Image Restoration, Geodesic Flow Matching, Riemannian Manifold, Degradation Modeling, Transport Process, Mixed Degradations, Image Restoration

277. RadiusFPS: Efficient Farthest Point Sampling on CPUs and GPUs via Spherical Voxel PruningFAIL

Score: 0.0 / 27.8

Authors: Ziyang Yu, Xiang Li, Qiong Chang, Jun Miyazaki

Published: 2026-06-04

TL;DR: The paper proposes RadiusFPS, an efficient Farthest Point Sampling algorithm using spherical voxel pruning to reduce latency in robotic point cloud perception pipelines.

摘要翻译

点云是机器人感知的主要感官表示，支撑着基于激光雷达的自动驾驶、同步定位与建图（SLAM）以及导航任务。在这些处理流程中，最远点采样（FPS）是最知名的下采样算子，因其均匀覆盖特性保留了下游感知所依赖的几何结构。然而，经典 FPS 的时间复杂度较高，难以随现代 3D 传感器每秒百万点速率的增长而良好扩展，使其成为主要的延迟瓶颈，这与机器人系统的实时性要求及有限的机载计算预算相冲突。因此，我们提出 RadiusFPS，这是一种基于球形体素剪枝的 FPS 加速框架，它在相同的初始化和平局打破策略下保持标准的 FPS 更新规则。通过利用球形体素对点云进行索引，RadiusFPS 推导出一个保守的几何界限，从而在每个迭代中剪枝冗余的距离计算；此外，还辅以一种坐标级点跳过测试，以消除残余更新。我们进一步引入了 RadiusFPS-G，这是一种线程束级 GPU 实现，它将体素选择、剪枝和距离更新融合为内存合并内核，从而消除了昂贵的全局内存往返开销。在室内（S3DIS, ScanNet）和室外激光雷达（SemanticKITTI）基准测试上，RadiusFPS-G 相较于基于 GPU 的 FPS 实现了高达 2.5 倍的加速，且在所评估的方法中与 QuickFPS 相当或更优，同时仅使用 QuickFPS 约一半的 GPU 内存，并保持相当的分割精度。当与基于学习的 FastPoint 采样器结合时，所得到的处理流程在所有评估配置中实现了最快的端到端推理。这些特性使得高质量的 FPS 风格采样在延迟和内存受限的机器人视觉应用中变得切实可行。

Abstract

Point clouds are a primary sensory representation for robotic perception, underpinning LiDAR-based autonomous driving, simultaneous localization and mapping (SLAM), and navigation. Within these pipelines, Farthest Point Sampling (FPS) is the most well-known downsampling operator, as its uniform coverage preserves the geometric structure on which downstream perception relies. However, the large time complexity of classical FPS scales poorly with the million-point-per-second rates of modern 3D sensors, making it a dominant latency bottleneck that conflicts with the real-time and limited onboard compute budgets of robotic systems. Therefore, we propose RadiusFPS, an FPS acceleration framework based on spherical voxel pruning that preserves the standard FPS update rule under the same initialization and tie-breaking policy. By indexing the point cloud with spherical voxels, RadiusFPS derives a conservative geometric bound that prunes redundant distance computations in each iteration, complemented by a coordinate-wise point-skip test that removes residual updates. We further introduce RadiusFPS-G, a warp-level GPU implementation that fuses voxel selection, pruning, and distance update into memory-coalesced kernels, eliminating costly global-memory round-trips. On indoor (S3DIS, ScanNet) and outdoor LiDAR (SemanticKITTI) benchmarks, RadiusFPS-G attains up to 2.5x speedup over GPU-based FPS and matches or exceeds QuickFPS among the evaluated methods while using roughly half its GPU memory, with comparable segmentation accuracy. When coupled with the learning-based FastPoint sampler, the resulting pipeline achieves the fastest End-to-End inference among all evaluated configurations. These properties make high-quality FPS-style sampling practical for latency- and memory-constrained robotic vision.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: The paper focuses on accelerating Farthest Point Sampling (FPS) for point cloud perception using spherical voxel pruning, which is a geometric algorithm optimization task. The provided keywords relate to Multimodal Large Language Models, World Models, and Unified Architectures, showing no thematic overlap with the paper's content. None of the specified expert authors are listed.

关键词

RadiusFPS, Farthest Point Sampling, Point Clouds, Spherical Voxel Pruning, GPU Implementation, Robotic Perception, Latency Reduction

278. SC-MFJ: A Simple Haptic Quality Metric for Medical Image SegmentationFAIL

Score: 0.0 / 27.8

Authors: Souraj Adhikary, Negar Chabi, Andre Mastmeyer

Published: 2026-06-04

TL;DR: 本文提出了一种名为 SC-MFJ 的新指标来评估医学分割表面的触觉质量，发现简单的后处理（如高斯平滑）能显著提升仿真效果而无需重新训练模型。

摘要翻译

标准的分割指标（如 Dice 和豪斯多夫距离）仅衡量几何重叠，却无法说明分割表面是否适用于手术模拟中的力觉渲染。我们提出 SC-MFJ（Surface-Constrained Mean Force Jerk，表面约束平均力急动），这是一种简单且廉价的指标，它通过多次短虚拟探针路径采样分割的器官表面，并测量由此产生的接触力的急动程度。该指标基于现有的分割输出进行计算，每个案例仅需约一分钟的 CPU 时间。我们在 80 个案例的五折交叉验证中评估了三种胰腺 CT 分割方法：二值化 nnU-Net 输出、高斯平滑输出以及学习得到的有符号距离函数（SDF）回归。SC-MFJ 揭示了原始二值基线与简单高斯后处理之间在力觉质量上存在 147 倍的差距，而这一差异在 Dice 和 HD95 指标下完全不可见。此外，它还表明，尽管需要重新训练整个模型，学习得到的 SDF 回归产生的力觉质量波动比高斯平滑更大，其案例级标准差为 168 N/s²，而高斯平滑仅为 22 N/s²。在 LiTS 肝脏数据集（131 个案例）上的第二次评估证实了这些发现的普遍性：二值化到高斯化的差距扩大至 189 倍，且高斯平滑再次在所有折中均产生一致的低位力急动。我们的结果表明，对于力觉模拟应用，可能仅需单行后处理步骤即可满足需求，且像 SC-MFJ 这样廉价的指标能够揭示几何指标所遗漏的问题。

Abstract

Standard segmentation metrics such as Dice and Hausdorff distance measure geometric overlap but say nothing about whether a segmented surface is suitable for haptic rendering in surgical simulation. We propose SC-MFJ (Surface-Constrained Mean Force Jerk), a simple, inexpensive metric that samples a segmented organ surface with many short virtual stylus walks and measures how jerky the resulting contact forces are. The metric is computed from existing segmentation outputs and uses roughly one minute of CPU time per case. We evaluate three pancreas CT segmentation approaches-binary nnU-Net output, Gaussian-smoothed output, and learned signed distance function (SDF) regression-across 80 cases in five-fold cross-validation. SC-MFJ reveals a 147x gap in haptic quality between the raw binary baseline and simple Gaussian post-processing, a difference entirely invisible to Dice and HD95. It also shows that learned SDF regression, despite requiring full model retraining, produces more variable haptic quality than Gaussian smoothing, with a case-level standard deviation of 168 N/s2 compared with 22 N/s2 for Gaussian. A second evaluation on the LiTS liver dataset (131 cases) confirms the generality of these findings: the binary-to-Gaussian gap widens to 189x, and Gaussian smoothing again produces consistently low force jerk across all folds. Our results suggest that for haptic simulation applications, a one-line post-processing step may be sufficient, and that a cheap metric like SC-MFJ can flag problems that geometric metrics miss.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 该论文专注于医学图像分割的触觉质量评估指标（SC-MFJ），主要讨论分割表面在手术仿真中的力抖动测量与后处理优化。提供的关键词集（如 Unify Models, Tokenizer, Visual Encoder, World Models, MLLM, MultiModal, model-based RL）均属于多模态大模型、表征学习及强化学习领域，与本文的医学影像评估主题完全无关。此外，作者列表中不包含指定的专家（Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang）。

关键词

Medical Image Segmentation, Haptic Quality Metric, Surgical Simulation, Surface-Constrained Mean Force Jerk, Segmentation Evaluation, Gaussian Smoothing, Signed Distance Function

279. VZCrash: A Large-Scale IMU Dataset of Ego-Vehicle CrashesFAIL

Score: 0.0 / 27.8

Authors: Tommaso Bianconcini, Henrique Piñeiro Monteagudo, Aurel Pjetri, Tomaso Trinci, Leonardo Taccari

Published: 2026-06-04

TL;DR: VZCrash 引入了一种大规模 IMU 数据集用于车辆碰撞检测，证明了数据规模对于训练部署在真实自动驾驶环境中的高质量深度学习模型至关重要。

摘要翻译

我们介绍 VZCrash，这是目前最大的公开可用真实世界车辆碰撞数据集，包含惯性测量单元（IMU）遥测数据。该数据集包含超过 31,000 个验证过的碰撞事件和 158,000 个负样本（negative samples），其中包括困难案例（hard cases）和干扰项（distractors）。每个样本包含 100 Hz 频率下的加速度和角速度，以及 1 Hz 频率下的 GPS 速度。VZCrash 中的事件是由安装在行驶于美国的 73,010 辆不同尺寸商用车辆（commercial vehicles）上的设备捕获的，数据收集跨越了数年。此外，我们还呈现了一项得益于该数据集规模（volume）的广泛实验研究。我们首先对多种方法进行了基准测试（benchmark），范围从简单的基于阈值的启发式方法（threshold-based heuristic）到最先进的深度学习模型（deep learning models）。随后，我们展示了一个实验，证明了数据规模（scaling data）对于训练高质量碰撞检测模型（crash detection models）的重要性，并表明当这些模型需要部署（deployed）到真实世界环境（real-world environment）中时，规模尤为关键。

Abstract

We introduce VZCrash, the largest publicly available dataset of real-world vehicle collision data featuring Inertial Measurement Unit (IMU) telemetry. The dataset contains more than 31,000 validated crashes and 158,000 negative samples, including hard cases and distractors. Each sample includes acceleration and angular velocity at 100 Hz, and GPS speed at 1 Hz. Events in VZCrash were captured by devices installed on a fleet of 73,010 commercial vehicles of different sizes driving in the United States over the span of several years. We also present an extensive experimental study enabled by the volume of the dataset. We first benchmark several different approaches, from a simple threshold-based heuristic to state-of-the-art deep learning models. Then, we present an experiment demonstrating the importance of scaling data to train high-quality crash detection models, and we show that scale is especially important when these models need to be deployed into a real-world environment.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 论文内容聚焦于自动驾驶领域的 IMU 传感器数据集发布与碰撞检测基准测试，核心贡献在于数据规模与模型部署效果。提供的关键词（如 Unify Models, Tokenizer, Visual Encoder, World Models, MLLM, model-based RL）均属于多模态大模型与强化学习领域，与本文的传感器数据处理及分类任务无技术交集，因此所有关键词相关度均为 0。

关键词

IMU Dataset, Vehicle Crashes, Ego-Vehicle, Deep Learning, Data Scale, Collision Detection, Telemetry

280. CamFlow+: Hybrid Motion Bases for 2D Camera Motion Estimation with Stabilization ApplicationsFAIL

Score: 0.0 / 27.8

Authors: Haipeng Li, Zhen Liu, Zhanglei Yang, Hai Jiang, Tianhao Zhou, Zhengzhe Liu, Ping Tan, Bing Zeng, Shuaicheng Liu

Published: 2026-06-04

TL;DR: CamFlow+ 提出了一种混合运动基框架，通过结合齐次变换基和深度平移基来改进 2D 相机运动估计，从而提升视频稳定化的效果。

摘要翻译

估计 2D 相机运动是 computer vision 和 computational photography 的基础。现有的基于 homography 的方法在平面场景或纯旋转下表现良好，但在相机平移、深度变化和局部 parallax 方面存在困难；局部 homography 和基于 mesh 的模型提高了灵活性，但仍依赖于分片平面假设。我们提出了 CamFlow+，一种混合基框架，直接在 dense-flow 空间中表示 2D 相机运动。CamFlow+ 结合了由 homography 导出的物理基、从 homography flow 中采样的随机基以及由深度和 camera intrinsics 导出的深度 - 平移基，在放松单平面约束的同时保持相机运动的规律性。一个感知深度的平滑项进一步正则化连续深度区域中由平移引起的 parallax，同时保持深度边界附近的运动变化。我们在 GHOF-Cam 上评估了 CamFlow+，这是一个相机运动基准，它在 optical-flow 基准中屏蔽了动态物体和病态遮挡区域，以隔离相机引起的运动。实验表明，CamFlow+ 改善了稀疏和稠密的相机运动估计。在数字视频稳定化中，CamFlow+ 也提高了全局和局部稳定性，在盲测用户研究中获得了最高的 top-1 偏好率。代码和数据集将在项目页面上提供：https://lhaippp.github.io/CamFlow+。

Abstract

Estimating 2D camera motion is fundamental to computer vision and computational photography. Existing homography-based methods work well for planar scenes or pure rotation, but struggle with camera translation, depth variation, and local parallax; local homography and mesh-based models improve flexibility but still rely on piecewise planar assumptions. We introduce CamFlow+, a hybrid-basis framework that represents 2D camera motion directly in dense-flow space. CamFlow+ combines homography-derived physical bases, stochastic bases sampled from homography flows, and depth-translational bases derived from depth and camera intrinsics, relaxing the single-plane constraint while preserving camera-motion regularity. A depth-aware smoothness term further regularizes translation-induced parallax in continuous-depth regions while preserving motion changes near depth boundaries. We evaluate CamFlow+ on GHOF-Cam, a camera-motion benchmark that masks out dynamic objects and ill-posed occlusion regions in an optical-flow benchmark to isolate camera-induced motion. Experiments show that CamFlow+ improves sparse and dense camera-motion estimation. In digital video stabilization, CamFlow+ also improves global and local stability, achieving the best top-1 preference rate in a blind user study. Code and datasets will be available on the project page: https://lhaippp.github.io/CamFlow+.

评分详情

关键词	权重	相关度
Unify Models	1.5	0.0/10
Tokenizer	1.5	0.0/10
Visual Encoder	1.5	0.0/10
World Models	1.5	0.0/10
MLLM	1.5	0.0/10
MultiModal	1.5	0.0/10
model-based RL	1.5	0.0/10

评分理由: 该论文属于计算机视觉领域，专注于 2D 相机运动估计与视频稳定化的几何方法（混合运动基、齐次变换等），与提供的关键词（多模态大模型、强化学习、世界模型等）主题完全无关，故相关性评分均为 0。作者列表中未包含指定的专家。

关键词

2D Camera Motion Estimation, Hybrid Motion Bases, Video Stabilization, Homography, Depth-Translational, Optical Flow, Geometric Modeling

Token 消耗: 4,317,748 tokens（输入 511,132 / 输出 3,806,616）