arXiv Daily Report 2026-06-02

ArXiv Report 2026-06-02/* ============================================================ ArXiv Daily Researcher - HTML Report Stylesheet 可自由修改此文件来定制报告外观 ============================================================ *//* ── 全局重置 ── */*,*::before,*::after { box-sizing: border-box; margin: 0; padding: 0;}/* ── CSS 变量(主题色板) ── */:root { --color-bg: #f0f2f5; --color-surface: #ffffff; --color-border: #e5e7eb; --color-primary: #2563eb; --color-primary-dk: #1d4ed8; --color-pass: #16a34a; --color-pass-bg: #dcfce7; --color-fail: #dc2626; --color-fail-bg: #fee2e2; --color-text: #111827; --color-muted: #6b7280; --color-tldr-bg: #eff6ff; --color-tldr-border: #bfdbfe; --color-cn-bg: #fefce8; --color-cn-border: #fde68a; --color-analysis-bg: #f8fafc; --radius-sm: 6px; --radius-md: 10px; --radius-lg: 14px; --shadow-sm: 0 1px 3px rgba(0, 0, 0, 0.06), 0 1px 2px rgba(0, 0, 0, 0.04); --shadow-md: 0 4px 12px rgba(0, 0, 0, 0.08); --shadow-hover: 0 8px 24px rgba(0, 0, 0, 0.1);}/* ── 页面布局 ── */body { font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", "Inter", Roboto, "Helvetica Neue", Arial, sans-serif; font-size: 15px; line-height: 1.65; color: var(--color-text); background: var(--color-bg); padding: 28px 20px 60px; max-width: 1080px; margin: 0 auto;}/* ── 页面标题 ── */h1 { font-size: 1.75rem; font-weight: 700; color: var(--color-text); letter-spacing: -0.5px; margin-bottom: 4px;}h2 { font-size: 1.15rem; font-weight: 600; color: var(--color-text); margin: 36px 0 14px; padding-bottom: 8px; border-bottom: 2px solid var(--color-border);}/* ── 元信息行 ── */.meta { font-size: 0.85rem; color: var(--color-muted); margin-bottom: 24px;}/* ── 统计栏 ── */.stats-bar { display: flex; gap: 14px; flex-wrap: wrap; margin-bottom: 32px;}.stat { flex: 1; min-width: 110px; background: var(--color-surface); border: 1px solid var(--color-border); border-radius: var(--radius-md); padding: 16px 20px; text-align: center; box-shadow: var(--shadow-sm);}.stat .num { font-size: 2rem; font-weight: 700; line-height: 1.1; color: var(--color-primary); display: block;}.stat .label { font-size: 0.78rem; color: var(--color-muted); margin-top: 4px; text-transform: uppercase; letter-spacing: 0.05em;}/* ── 论文卡片 ── */.card { background: var(--color-surface); border: 1px solid var(--color-border); border-radius: var(--radius-lg); padding: 20px 24px; margin-bottom: 14px; box-shadow: var(--shadow-sm); border-left: 4px solid var(--color-border); transition: box-shadow 0.18s ease, transform 0.18s ease;}.card:hover { box-shadow: var(--shadow-hover); transform: translateY(-1px);}.card.pass { border-left-color: var(--color-pass);}.card.fail { border-left-color: var(--color-fail);}/* ── 卡片标题 ── */.card-title { font-size: 1rem; font-weight: 600; color: var(--color-text); line-height: 1.4; margin-bottom: 10px; display: flex; align-items: flex-start; gap: 8px; flex-wrap: wrap;}.card-title a { color: inherit; text-decoration: none; flex: 1;}.card-title a:hover { color: var(--color-primary); text-decoration: underline; text-underline-offset: 3px;}/* ── 状态徽章 ── */.badge { display: inline-flex; align-items: center; padding: 2px 9px; border-radius: 99px; font-size: 0.72rem; font-weight: 700; letter-spacing: 0.04em; flex-shrink: 0; margin-top: 2px;}.badge.pass { background: var(--color-pass-bg); color: var(--color-pass);}.badge.fail { background: var(--color-fail-bg); color: var(--color-fail);}/* ── 字段行 ── */.field { font-size: 0.875rem; color: var(--color-muted); margin: 5px 0;}.field-label { font-weight: 600; color: #374151;}.score { font-weight: 700; color: var(--color-primary);}/* ── TL;DR 块 ── */.tldr { background: var(--color-tldr-bg); border: 1px solid var(--color-tldr-border); border-radius: var(--radius-sm); padding: 10px 14px; font-size: 0.9rem; color: #1e3a5f; margin: 10px 0; line-height: 1.55;}/* ── 中文摘要块 ── */.abstract-cn { background: var(--color-cn-bg); border: 1px solid var(--color-cn-border); border-radius: var(--radius-sm); padding: 10px 14px; font-size: 0.875rem; color: #713f12; margin: 10px 0; line-height: 1.6;}/* ── 深度分析折叠 ── */details { margin-top: 12px; border: 1px solid var(--color-border); border-radius: var(--radius-sm); overflow: hidden;}summary { cursor: pointer; font-size: 0.875rem; font-weight: 600; color: var(--color-primary); padding: 8px 14px; background: var(--color-analysis-bg); user-select: none; list-style: none; display: flex; align-items: center; gap: 6px;}summary::before { content: "▶"; font-size: 0.65em; transition: transform 0.2s; display: inline-block;}details[open] summary::before { transform: rotate(90deg);}summary:hover { color: var(--color-primary-dk);}.analysis-content { padding: 14px 16px; font-size: 0.875rem; color: #374151; line-height: 1.65; background: var(--color-surface);}.analysis-content p { margin: 6px 0;}.analysis-content ul { margin: 6px 0 6px 20px; color: #4b5563;}.analysis-content li { margin: 3px 0;}/* ── 响应式 ── */@media (max-width: 640px) { body { padding: 16px 12px 40px; } h1 { font-size: 1.4rem; } .stats-bar { gap: 10px; } .stat { min-width: 80px; padding: 12px 14px; } .stat .num { font-size: 1.6rem; } .card { padding: 16px; } .card-title { font-size: 0.95rem; }}/* ── 模型标签 ── */.model-badge { display: inline-block; font-size: 0.72rem; font-weight: 500; color: #6366f1; background: #ede9ff; border: 1px solid #c4b5fd; border-radius: 4px; padding: 1px 6px; margin-left: 8px; vertical-align: middle; font-family: 'Fira Code', 'Consolas', monospace; letter-spacing: 0.02em;}/* ── TLDR 增强样式 ── */.tldr { background: linear-gradient(135deg, #f0f9ff 0%, #e0f2fe 100%); border-left: 3px solid #38bdf8; border-radius: 0 8px 8px 0; padding: 10px 14px; margin: 8px 0;}.tldr-meta { display: flex; align-items: center; margin-bottom: 6px; font-size: 0.82rem; font-weight: 600; color: #0369a1;}.tldr-body { font-size: 0.88rem; color: #374151; line-height: 1.6;}/* ── 趋势分析区块 ── */.trend-section { margin: 32px 0 16px; border-top: 2px solid var(--color-border); padding-top: 24px;}.trend-section-header { display: flex; align-items: center; margin-bottom: 20px;}.trend-section-header h2 { margin: 0; font-size: 1.3rem; font-weight: 700; background: linear-gradient(135deg, #6366f1 0%, #8b5cf6 100%); -webkit-background-clip: text; -webkit-text-fill-color: transparent; background-clip: text;}.trend-card { background: #fff; border: 1px solid #e5e7eb; border-radius: 10px; padding: 20px 24px; margin-bottom: 16px; box-shadow: 0 1px 4px rgba(0, 0, 0, 0.06);}.trend-card-title { font-size: 1.05rem; font-weight: 700; color: #1e1b4b; margin: 0 0 14px; padding-bottom: 10px; border-bottom: 1px solid #ede9ff;}

ArXiv Research Report

Generated: 2026-06-02 21:42:29 | Passing score: 27.8

222
Total
48
Qualified
48
Analyzed
22%
Pass Rate

Papers

Score: 88.5 / 27.8
Authors: Ziyang Yao, Zeyu Zhu, YunCheng Jiang, Zibin Guo, Huijing Zhao
Published: 2026-06-01
TL;DR: This paper proposes a representation- and geometry-guided discrete tokenizer for driving world models that aligns with semantic features and geometric cues to improve planning performance and generative quality compared to standard image-generation tokenizers.
摘要翻译

离散视觉标记 (Discrete visual tokens) 应为自动驾驶中基于标记的世界建模 (token-based world modeling) 和规划提供紧凑表示。然而,大多数标记器 (tokenizer) 源自图像生成,主要优化于像素重建,这可能在易于生成与对驾驶决策有用解码之间留下差距。我们提出了一种表示引导且几何增强的标记器,该标记器在联合监督下学习离散标记 (tokens)。该标记器通过特征解码将其离散瓶颈与冻结的 DINO 特征空间对齐,同时通过带有感知损失和对抗损失的 RGB 重建保留外观。为了注入几何状态相关线索,我们在训练期间添加相邻帧深度和相对位姿监督,并使用多码本量化稳定联合目标。我们使用轻量级规划读出和 GPT 风格的下一个标记 (token) 世界模型 (world model) 评估相同的所学标记。在 NAVSIM 上的实验表明,重建保真度和表示一致性有所提高,在固定解码器下具有竞争力的规划性能,以及在匹配设置下更好的生成质量。

Abstract

Discrete visual tokens should provide a compact representation for both token-based world modeling and planning in autonomous driving. However, most tokenizers are inherited from image generation and are optimized mainly for pixel reconstruction, which may leave a gap between what is easy to generate and what is useful to decode for driving decisions. We present a representation-guided and geometry-enhanced tokenizer that learns discrete tokens under joint supervision. The tokenizer aligns its discrete bottleneck with a frozen DINO feature space through feature decoding, while preserving appearance via RGB reconstruction with perceptual and adversarial losses. To inject geometric state-related cues, we add adjacent-frame depth and relative-pose supervision during training and stabilize joint objectives with multi-codebook quantization. We evaluate the same learned tokens with a lightweight planning readout and a GPT-style next-token world model. Experiments on NAVSIM show improved reconstruction fidelity and representation consistency, competitive planning performance under a fixed decoder, and better generative quality under matched settings.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 9.0/10 13.5
Tokenizer 1.5 10.0/10 15.0
Visual Encoder 1.5 8.0/10 12.0
World Models 1.5 10.0/10 15.0
MLLM 1.5 7.0/10 10.5
MultiModal 1.5 8.0/10 12.0
model-based RL 1.5 7.0/10 10.5

评分理由: 论文标题强调'Unified',与 Unify Models 高度契合;核心创新为离散 Tokenizer,故该词得满分;利用 DINO 特征对齐涉及 Visual Encoder;明确构建 World Models 并使用 GPT-style 架构,故 World Models 得满分;GPT-style 架构与 MLLM 技术栈相似,故 MLLM 相关;融合视觉与几何信息属于 MultiModal;世界模型结合规划任务符合 Model-Based RL 范式。作者列表中未包含指定的 Yang Shi 等专家。

关键词

Unified Driving Tokens, Discrete Tokenizer, World Models, Representation Learning, Geometry-Guided, Autonomous Driving, Planning

深度分析

Chinese Title: 统一驾驶令牌:面向驾驶世界模型与规划的表示与几何引导的离散分词器

Summary: 本文提出一种面向自动驾驶的统一离散令牌学习框架,旨在使离散令牌同时服务于基于令牌的世界模型生成和规划决策。现有分词器多继承自图像生成,主要优化像素重建,导致令牌在规划任务中信息不足。作者设计了一个表示引导与几何增强的离散分词器:编码器融合冻结的DINO特征和RGB细节分支,解码器联合重建RGB和DINO特征以保留语义,并引入相邻帧深度和相对位姿监督以注入几何线索;采用多码本量化稳定联合训练。在NAVSIM数据集上,该分词器在重建保真度、表示一致性、规划性能和生成质量上均优于基线方法。

Innovations:

  • 提出统一离散令牌学习框架,将令牌视为同时服务于世界模型生成和规划决策的共享接口。
  • 设计语义表示引导的离散VQ分词器,通过冻结表示编码、显式细节通路和联合RGB/语义特征重建,对齐强预训练表示并保持外观保真度。
  • 引入时间深度和自运动监督使令牌具有几何感知能力,结合多码本量化实现外观、语义和几何的平衡联合训练。
  • 通过轻量规划解码器和自回归下一令牌Transformer世界模型验证了所学离散接口的规划收益和生成质量提升。

Methodology: 基于VQ-VAE架构,编码器使用冻结DINO特征和RGB细节分支的拼接,经Transformer编码后通过多码本量化得到离散令牌;解码器重建RGB和DINO特征,并额外解码相邻帧的深度和相对位姿以施加几何监督。训练目标包括像素重建损失、感知损失、语义对齐损失、对抗损失和量化正则项。下游评估:轻量规划解码器从冻结令牌预测轨迹;自回归Transformer世界模型以令牌序列进行下一令牌预测并评估生成质量。

Key Results:

  • 在NAVSIM数据集上,所提分词器在RGB重建和DINO特征重建上优于基线(如VQGAN)。
  • 使用相同冻结令牌的轻量规划器在规划指标(如碰撞率、位移误差)上取得竞争性结果。
  • 在匹配设置下,基于所提令牌的自回归世界模型生成质量(FID等)优于基线。
  • 多码本量化有效缓解了联合监督下的码本崩溃和利用率低问题。

Tech Stack:

  • VQ-VAE(向量量化变分自编码器)
  • DINO(自监督视觉Transformer特征)
  • Transformer编码器(Pre-norm, RoPE位置编码)
  • LPIPS感知损失
  • GAN对抗训练(判别器)
  • 指数移动平均(EMA)码本更新
  • 死码重新初始化与正交正则化
  • 多码本量化(Multi-codebook quantization)
  • 自回归下一令牌预测(GPT风格)

Strengths:

  • 将规划消费性作为令牌学习的一等目标,而非事后评估,填补了现有分词器与驾驶决策之间的鸿沟。
  • 联合语义、外观和几何监督,使令牌携带丰富信息,同时通过多码本量化稳定训练。
  • 实验设计全面,同时评估重建、规划性能和生成质量,验证了统一接口的有效性。
  • 方法具有通用性,可推广至其他需要离散表示的下游任务。

Limitations:

  • 依赖冻结的DINO特征,可能限制了令牌对驾驶特定语义(如交通规则、动态物体)的适应能力。
  • 几何监督仅使用相邻帧深度和相对位姿,未显式建模长时序运动或场景流。
  • 实验仅在NAVSIM数据集上进行,泛化性需在更多真实驾驶场景中验证。
  • 轻量规划解码器结构简单,可能无法充分挖掘令牌中的规划信息。

Relevance To Keywords:

  • Unify Models: 论文提出统一离散令牌作为世界模型和规划共享接口,符合模型统一思想。
  • World Models: 令牌用于训练自回归世界模型,直接关联世界模型构建。
  • Representation Learning: 通过语义对齐和几何监督学习紧凑离散表示,属于表征学习。
  • Model-Based RL: 规划解码器基于世界模型生成的令牌进行决策,可视为基于模型的强化学习组件。
  • 原生多模态大模型: 离散令牌可视为多模态(视觉+几何)的离散化表示,与多模态大模型输入形式兼容。
  • 多模态大模型的理解和生成一体化: 令牌同时用于重建(生成)和规划(理解),体现理解与生成一体化。
  • 后训练: 分词器预训练后,下游规划器和世界模型可视为后训练阶段。
Score: 69.0 / 27.8
Authors: Shalfun Li, Victor Yao, Charles Yang, Truth Qu, Regis Cheng, Ryan Yu, Howard Lu, Newton Von, Vincent Chen, Yohann Tang, Maeve Zhang, Ellie Ma, Gody Li, Sage Yang, Lorien Shu, J. W. Gao, Ethan Chen, Colin Ye, Yu Sun, Elise Mon, PS Zhang, Neo Li, Lily Li, James Wang, Ping Yang, Chris Pan, Lucy Liang, Hang Su, Roy Gan, Hao Wang, Qian Wang
Published: 2026-06-01
TL;DR: WALL-WM 提出了一种基于事件的地面视觉 - 语言 - 动作预训练框架,在语义事件级别统一视频和动作建模,并在真实世界任务中实现了最先进的泛化性能。
摘要翻译

WALL-WM 是一种世界动作模型(World Action Model, WAM),它将视频动作学习从以块为中心的优化转向基于事件的视觉 - 语言 - 动作(Vision-Language-Action, VLA)预训练,将语义一致的动作事件作为学习的基本单元。现有的 WAM 通常从多模态或视频基础模型初始化,然后直接在当前观测和指令的条件下优化固定长度的动作块。虽然这种以块为中心的方式很方便,但它造成了根本性的粒度不匹配。语言描述语义目标和事件,视觉通过连续的场景动态演化,动作则在控制级别的时间尺度上运行;将这三者强制纳入相同的固定长度预测窗口,使得 VLA 训练变成了短期相关性拟合。WALL-WM 通过围绕语义事件组织监督信号和数据来解决这一不匹配。具体而言,它将基于事件的 VLA 预训练与一个基于事件级字幕和聚类平衡采样的数据生态系统相结合,从而实现对多样化行为、场景和任务结构的可扩展学习。基于同一个事件预训练的骨干网络,WALL-WM 支持两种互补的推理模式。事件模式采用下一个事件描述,并支持可变长度的执行块;而统一模式则采用带有阶梯解码(Staircase Decoding)的视觉语言模型(VLM),在保持梯度连续的 VLA 路径的同时,对常规固定长度块推理进行条件化。结合基于 Muon 优化器的大规模预训练基础设施,WALL-WM 为通用世界动作模型(WAMs)提供了一套实用的扩展方案。实验表明,WALL-WM 在语言、场景和任务上具有广泛的泛化能力,在大规模现实世界泛化评估中达到了最先进的性能。

Abstract

WALL-WM is a World Action Model that shifts video-action learning from chunk-centric optimization to event-grounded Vision-Language-Action pretraining, using semantically coherent action events as the atomic unit of learning. Existing WAMs commonly initialize from multimodal or video foundation models and then optimize fixed-length action chunks conditioned directly on the current observation and instruction. Although convenient, this chunk-centric formulation creates a fundamental granularity mismatch. Language describes semantic goals and events, vision evolves through continuous scene dynamics, and actions operate at control-level timescales; forcing all three into the same fixed-length prediction window turns VLA training into short-horizon correlation fitting. WALL-WM addresses this mismatch by organizing both supervision and data around semantic events. Specifically, it pairs event-grounded VLA pretraining with a data ecosystem built from event-level captions and cluster-balanced sampling, enabling scalable learning over diverse behaviors, scenes, and task structures. From the same event-pretrained backbone, WALL-WM supports two complementary inference modes. The event mode consumes next-event descriptions and enables variable-length execution chunks, while the unified mode uses a VLM with Staircase Decoding to condition conventional fixed-length chunk inference while preserving a gradient-continuous VLA path. Together with Muon-optimizer-based large-scale pretraining infrastructure, WALL-WM provides a practical scale-up recipe for general-purpose WAMs. Experiments show that WALL-WM generalizes broadly across language, scenes, and tasks, achieving state-of-the-art performance in large-scale real-world generalization evaluation.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 8.0/10 12.0
Tokenizer 1.5 2.0/10 3.0
Visual Encoder 1.5 5.0/10 7.5
World Models 1.5 9.0/10 13.5
MLLM 1.5 7.0/10 10.5
MultiModal 1.5 8.0/10 12.0
model-based RL 1.5 7.0/10 10.5

评分理由: 论文核心提出 WALL-WM(世界动作模型),属于 World Models 范畴,高度相关;整合视觉、语言与动作,属于 MultiModal 和 Unify Models 范畴,相关性高;涉及 VLM,与 MLLM 相关;涉及动作建模,与 model-based RL 相关;未明确提及 Tokenizer,Visual Encoder 仅为隐含组件。加权总分 69.0,高于及格分 27.8。作者列表中未包含指定专家。

关键词

World Action Model, Event-grounded, Vision-Language-Action, Pretraining, Generalization, Multi-modal, Inference modes

深度分析

Chinese Title: WALL-WM: 在事件边界处雕琢世界动作建模

Summary: WALL-WM是一种世界动作模型(WAM),旨在解决现有固定长度动作块与语言、视觉、动作在语义和时间尺度上的不匹配问题。现有WAM通常从多模态或视频基础模型初始化,然后优化固定长度的动作块,但语言描述语义目标,视觉呈现连续场景动态,动作则运行在控制级时间尺度,三者粒度不匹配导致模型仅学习短视相关性,削弱了组合性和长程泛化。WALL-WM将语义连贯的动作事件(如抓取、放置)作为原子学习单元,进行事件级视觉-语言-动作(VLA)预训练,保留视频先验。它构建了事件级数据生态系统(事件描述、聚类平衡采样),并使用Muon优化器进行大规模预训练。推理时支持两种互补模式:事件模式(由VLM或人类提出下一事件描述,模型执行变长动作段)和统一模式(使用VLM阶梯解码生成事件潜在表示,指导固定长度块预测)。实验表明,WALL-WM在语言、场景和任务上具有广泛泛化能力,在大型真实世界泛化评估中达到最先进水平。

Innovations:

  • 提出事件级VLA预训练范式,以语义连贯的动作事件替代固定长度动作块作为原子学习单元,解决模态粒度不匹配问题。
  • 设计两种互补推理模式(事件模式和统一模式),从同一事件预训练骨干支持变长执行和固定长度块预测。
  • 引入阶梯解码(Staircase Decoding)机制,使VLM生成事件结构化潜在推理,引导固定长度块预测同时保持梯度连续的VLA路径。
  • 构建事件级数据生态系统,包括事件级描述生成和聚类平衡采样,支持多样化行为、场景和任务结构的可扩展学习。
  • 采用Muon优化器进行大规模预训练,提供实用的WAM规模化扩展基础设施。

Methodology: WALL-WM的方法论围绕事件级VLA预训练展开。首先定义动作-语义事件:一个时间上连贯的可执行行为段(如抓取、放置),该段在语言中可命名、在视频中可观察、在动作中可实现。预训练阶段,将事件描述、事件视频和事件动作配对,训练一个视频-动作去噪器,使其继承视频基础模型的先验(从描述到视频的归纳偏置)同时具备可控性和因果性。推理阶段有两种模式:事件模式中,VLM或人类提出下一事件描述,模型执行对应变长视频-动作段;统一模式中,VLM使用阶梯解码生成事件潜在表示,该表示作为条件指导固定长度块预测,从而保留传统部署兼容性。数据方面,通过事件级描述生成和聚类平衡采样构建大规模事件数据集。训练使用Muon优化器,支持大规模并行预训练。

Key Results: WALL-WM在大型真实世界泛化评估中达到最先进水平,在操作任务性能和视频生成指标上均表现出明显优势。实验证明其能够泛化到推理、灵巧操作和多种机器人任务,并支持物理上合理的视频预测。具体结果包括:在多个基准任务上超越现有VLA和WAM方法,展示出对语言指令、场景和任务结构的强泛化能力。

Tech Stack:

  • 视觉语言模型(VLM)
  • 视频基础模型(Video Foundation Models)
  • 扩散/流匹配(Diffusion/Flow Matching)
  • 动作去噪器(Action Denoiser)
  • 阶梯解码(Staircase Decoding)
  • Muon优化器(Muon Optimizer)
  • 聚类平衡采样(Cluster-Balanced Sampling)
  • 事件级描述生成(Event-Level Captioning)
  • 固定长度块预测(Fixed-Length Chunk Prediction)
  • 变长执行(Variable-Length Execution)

Strengths:

  • 从根本上解决了语言、视觉、动作模态之间的粒度不匹配问题,使VLA训练更符合语义结构。
  • 通过事件级预训练有效保留视频基础模型的先验,避免短视相关性覆盖有用结构。
  • 提供灵活的推理模式,既支持变长事件执行(更自然),又兼容传统固定长度块部署。
  • 构建了完整的数据生态系统和规模化训练基础设施,支持多样化场景的可扩展学习。
  • 在真实世界泛化评估中达到SOTA,验证了方法的有效性和实用性。

Limitations:

  • 事件定义依赖于语义分割,可能在某些边界模糊或连续动作中不够精确,需要高质量标注或自动分割方法。
  • 事件级数据构建需要额外的人工或自动标注成本,可能限制在缺乏事件标注的场景中的应用。
  • 推理时VLM的阶梯解码可能引入额外延迟,对实时控制要求高的任务可能不友好。
  • 论文未详细讨论失败案例或模型在极端情况下的表现,如未见过的接触模式或动态环境。
  • 对接触敏感任务(如精密装配)可能仍需更精细的触觉或力反馈信号,当前模型仅将触觉作为可选模态。

Relevance To Keywords: 论文与关键词高度相关。世界模型(World Models):WALL-WM是一种世界动作模型(WAM),明确耦合未来观测建模与动作预测。表征学习(Representation Learning):通过事件级对齐学习语言、视觉、动作的几何保持表征。模型基强化学习(Model-Based RL):WAM可用于规划和控制,支持基于预测的决策。原生多模态大模型:模型从视频基础模型继承多模态先验,并实现理解与生成一体化(VLM+视频生成+动作预测)。多模态大模型的理解和生成一体化:事件模式中VLM理解任务并生成事件描述,模型生成视频和动作。后训练:事件级预训练是对视频基础模型的后训练,保留先验并赋予可控性。强化学习:虽未直接使用RL,但WAM可作为RL的世界模型组件。

Score: 63.0 / 27.8
Authors: Zheng Lu, Mingqi Gao, Qinlei Xie, Wanqi Zhong, Hanwen Cui, Heng Cao, Zirui Song, Yifan Yang, Chong Luo, Bei Liu, Yiming Li
Published: 2026-06-01
TL;DR: This paper challenges token-prediction-based embodied planning by introducing a causal reasoning benchmark and a planner that improves next-state estimation and physical agency using Qwen3-VL-8B.
摘要翻译

当前具身视觉 - 语言规划的基准往往偏向于语言下一个词元预测,而非基于物理的下一个状态推理。这奖励了那些模仿统计语言先验而非追踪因果依赖的模型,将物理规划简化为浅层序列建模。我们认为,可靠的物理自主性需要从基于语言的词元预测转向基于物理的因果推理。为此,我们引入了 Causal-Plan-Bench,这是一个通过多阶段验证精心策划的高保真诊断套件,旨在从四个因果维度评估具身规划。我们还构建了 Causal-Plan-1M,这是一个百万级语料库,包含通过四阶段标注管道在第一人称视角视频上生成的显式推理轨迹。广泛的评估显示,领先模型仍难以展现真实的物理能动性,其中 Gemini 3 Pro 在我们的基准上仅达到 38.18。相比之下,我们的训练方案使基于 Qwen3-VL-8B 构建的 Causal Planner 能够内化物理逻辑,从而实现更精确的下一个状态估计。该模型展现出强大的领域内性能和跨基准泛化能力,并揭示了一个 Causal Scaling Law:将因果训练数据扩展至一百万实例可获得 36.3% 的相对增益,得分从 33.22 提升至 45.28。总体而言,我们的工作为实现将智能体从表面化的词元预测器转变为基于物理的因果推理者迈出了具体的一步。

Abstract

Current benchmarks for embodied vision-language planning often favor linguistic next-token prediction over physically grounded next-state reasoning. This rewards models that mimic statistical language priors rather than track causal dependencies, reducing physical planning to shallow sequence modeling. We argue that reliable physical autonomy requires a shift from linguistically grounded token prediction toward physically grounded causal reasoning. To this end, we introduce Causal-Plan-Bench, a high-fidelity diagnostic suite curated through multi-stage verification to evaluate embodied planning across four causal dimensions. We also construct Causal-Plan-1M, a million-scale corpus of explicit reasoning traces produced by a four-stage annotation pipeline over egocentric videos. Extensive evaluation shows that leading models still struggle to demonstrate genuine physical agency, with Gemini 3 Pro reaching only 38.18 on our benchmark. In contrast, our training recipe enables Causal Planner, built on Qwen3-VL-8B, to internalize physical logic for more accurate next-state estimation. The model achieves strong in-domain performance and cross-benchmark generalization, and reveals a Causal Scaling Law: scaling causal training data to one million instances yields a 36.3% relative gain, from 33.22 to 45.28. Overall, our work provides a concrete step toward turning agents from superficial token predictors into physically grounded causal reasoners.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 5.0/10 7.5
Tokenizer 1.5 4.0/10 6.0
Visual Encoder 1.5 3.0/10 4.5
World Models 1.5 7.0/10 10.5
MLLM 1.5 8.0/10 12.0
MultiModal 1.5 8.0/10 12.0
model-based RL 1.5 7.0/10 10.5

评分理由: The paper focuses on MLLM and MultiModal embodied planning, utilizing next-state estimation which aligns with World Models and model-based RL concepts (high scores). Tokenizer is referenced in the title regarding token prediction, while Visual Encoder is a component of the base model Qwen3-VL. Unify Models is tangentially related to paradigm shifts but not the core focus. No target experts (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are found in the author list.

关键词

Causal Reasoning, Embodied Planning, Token Prediction, Next-state Estimation, Vision-Language, Physical Agency, Causal Planner

深度分析

Chinese Title: 令牌预测器不是规划器:构建物理基础的因果推理器

Summary: 本文指出当前具身视觉-语言规划基准无意中偏向于语言层面的下一个令牌预测,而非物理层面的下一个状态推理,导致模型依赖统计语言先验而非真正的因果依赖。为此,作者提出了Causal-Plan-Bench,一个通过多阶段验证构建的高保真诊断套件,从可执行性、组合性、效果和鲁棒性四个因果维度评估具身规划。同时,开发了四阶段标注管道,从原始自我中心视频中提取结构化表示,构建了百万级密集推理轨迹数据集Causal-Plan-1M。实验表明,即使先进模型如Gemini 3 Pro也仅得38.18分,而基于Qwen3-VL-8B的Causal Planner通过渐进式SFT和RL训练,将基线从33.22提升至45.28,并揭示了因果缩放定律:训练数据扩展至一百万实例带来36.3%的相对增益。该工作首次将智能体从表面令牌预测器转变为物理基础的因果推理器,弥合了语言建模与世界建模之间的鸿沟。

Innovations:

  • 提出Causal-Plan-Bench,一个包含1200个实例、12个任务类别的高保真诊断套件,从可执行性、组合性、效果和鲁棒性四个因果维度系统评估具身规划。
  • 构建Causal-Plan-1M,一个百万级高保真监督语料库,通过四阶段自动标注管道从自我中心视频中提取结构化因果推理轨迹。
  • 发现并验证了因果缩放定律:物理基础的规划能力随因果监督数据量呈可预测的缩放关系。
  • 通过渐进式SFT和RL训练,使基于Qwen3-VL-8B的Causal Planner在基准上取得显著提升(+12.05%),并展现出强大的零样本跨基准泛化能力。
  • 首次将具身规划从语言驱动的令牌预测范式转向物理基础的因果推理范式,为自主物理智能体奠定基础。

Methodology: 论文采用四阶段自动标注管道从原始自我中心视频中提取结构化因果表示:阶段1(全局蓝图)识别整体目标并生成因果蓝图;阶段2(时间定位)将抽象步骤对齐到具体视频区间;阶段3(因果丰富)将步骤细化为状态中心因果轨迹,包括关键帧和因果结构;阶段4(原子分解)将因果轨迹分解为细粒度原子动作(如伸手、抓取、收回)。使用GPT-5.4作为唯一自动标注器,执行迭代自审计循环验证因果依赖。训练方面,基于Qwen3-VL-8B采用渐进式监督微调(SFT)和强化学习(RL)策略,使模型内化物理逻辑。

Key Results:

  • Causal-Plan-Bench上,最先进模型Gemini 3 Pro仅得38.18分,表明现有模型缺乏真正的物理因果推理能力。
  • Causal Planner(基于Qwen3-VL-8B)通过训练将基线从33.22提升至45.28,相对增益36.3%。
  • Causal Planner在四个因果维度(可执行性、组合性、效果、鲁棒性)上均超越所有前沿模型。
  • Causal Planner展现出优异的零样本跨基准泛化能力。
  • 验证了因果缩放定律:训练数据量从少量增加到一百万实例时,性能持续提升。

Tech Stack:

  • GPT-5.4(作为自动标注器)
  • Qwen3-VL-8B(基础模型)
  • 监督微调(SFT)
  • 强化学习(RL)
  • 四阶段标注管道(全局蓝图、时间定位、因果丰富、原子分解)
  • 关键帧提取与因果结构映射
  • 因果缩放定律分析

Strengths:

  • 提出了全新的因果推理评估基准,填补了现有基准忽视物理因果依赖的空白。
  • 构建了大规模、高质量、结构化的因果推理训练数据集,具有明确的物理基础。
  • 通过渐进式训练策略实现了显著性能提升,并验证了数据缩放定律。
  • 工作具有系统性,从问题定义、数据构建、模型训练到评估验证形成完整闭环。
  • 揭示了当前多模态大模型在物理推理上的根本缺陷,推动了从语言建模到世界建模的范式转变。

Limitations:

  • 数据构建依赖GPT-5.4作为自动标注器,可能引入模型自身的偏差或错误,尽管有人工验证。
  • 训练数据仅来自自我中心视频,可能无法覆盖所有具身场景(如机器人操作、自动驾驶等)。
  • 评估基准Causal-Plan-Bench包含1200个实例,规模相对较小,可能不足以全面衡量模型能力。
  • 论文未探讨模型在真实物理环境中的部署表现,仅停留在离线评估。
  • 因果缩放定律的验证仅基于单一模型架构(Qwen3-VL-8B),泛化性有待进一步验证。

Relevance To Keywords:

  • Unify Models: 论文提出的Causal Planner统一了语言模型与物理世界模型,将令牌预测与因果推理结合。
  • World Models: 论文强调从语言建模转向世界建模,通过因果推理构建物理世界的内部模型。
  • Representation Learning: 四阶段管道从视频中提取结构化因果表示,涉及表征学习。
  • Model-Based RL: 训练中使用了强化学习(RL),且因果推理可视为基于模型的方法。
  • 原生多模态大模型: 基础模型Qwen3-VL-8B是原生多模态大模型,论文在其基础上进行后训练。
  • 多模态大模型的理解和生成一体化: 论文同时涉及理解(因果推理)和生成(规划动作序列)。
  • 表征学习: 结构化因果表示是表征学习的具体应用。
  • 世界模型: 因果推理本质上是构建世界模型。
  • 强化学习: 使用RL进行后训练。
  • 后训练: 论文采用SFT和RL作为后训练策略。
Score: 63.0 / 27.8
Authors: Huiqiong Li, Jiayu Wang, Zhiting Mei, Anirudha Majumdar, Jingjing Chen, Bin Zhu
Published: 2026-06-01
TL;DR: 论文提出 RoboTrustBench 基准评测视频世界模型在机器人操作中的可信度,发现当前模型虽视觉连贯但缺乏约束推理和安全抑制能力。
摘要翻译

视频世界模型在机器人操作中的应用日益广泛,然而现有的基准评测大多仅在有效、可行且安全的指令下对其进行评估。我们引入了 RoboTrustBench,这是一个用于在四种场景下评估视频世界模型可信性的基准:正常、约束敏感、反事实和对抗。该基准基于真实世界 DROID 片段构建,包含 1,207 个经专家验证的指令 - 图像对,以及一个包含 13 个细粒度标准的六维评估方案。通过人类和多模态大模型 (MLLM) 对七个代表性视频世界模型进行评估,我们发现当前模型通常能生成视觉上连贯的视频,但在约束推理、反事实锚定、物理交互以及不安全指令抑制方面存在困难。这些结果表明,视觉质量和表层指令遵循不足以支撑可信的机器人视频世界建模。

Abstract

Video world models are increasingly used in robotic manipulation, yet existing benchmarks mostly evaluate them under valid, feasible, and safe instructions. We introduce RoboTrustBench, a benchmark for evaluating the trustworthiness of video world models under four scenarios: Normal, Constraint-Sensitive, Counterfactual, and Adversarial. Built from real-world DROID episodes, RoboTrustBench contains 1,207 expert-validated instruction-image pairs and a six-dimensional evaluation protocol with 13 fine-grained criteria. Evaluating seven representative video world models with human and MLLM assessment, we find that current models often generate visually coherent videos, but struggle with constraint reasoning, counterfactual grounding, physical interaction, and unsafe-instruction suppression. These results show that visual quality and surface-level instruction following are insufficient for trustworthy robotic video world modeling.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 5.0/10 7.5
Tokenizer 1.5 2.0/10 3.0
Visual Encoder 1.5 5.0/10 7.5
World Models 1.5 10.0/10 15.0
MLLM 1.5 7.0/10 10.5
MultiModal 1.5 7.0/10 10.5
model-based RL 1.5 6.0/10 9.0

评分理由: 论文核心聚焦视频世界模型(World Models)的可信度评测,故该词得满分(10)。MLLM 用于评估,相关性高(7)。MultiModal 与 Visual Encoder 隐含于视频模型中,相关性中等(5-7)。Unify Models 涉及模型本质但非本文重点(评测现有模型),相关性中等(5)。Tokenizer 未提及,相关性低(2)。model-based RL 为背景应用领域(机器人操作),相关性中等(6)。作者列表不含指定专家。

关键词

Video World Models, Robotic Manipulation, Trustworthiness, Benchmarking, Constraint Reasoning, MLLM Assessment, Adversarial Evaluation, Physical Interaction

深度分析

Chinese Title: RoboTrustBench:机器人操作中视频世界模型可信度基准测试

Summary: 本文提出RoboTrustBench,一个用于评估机器人操作中视频世界模型可信度的基准。现有基准大多假设输入指令有效、可行且安全,但实际部署中指令可能模糊、与场景矛盾、物理不可行或包含危险内容。RoboTrustBench基于真实机器人操作数据集DROID构建,包含1207个专家验证的指令-图像对,覆盖正常、约束敏感、反事实和对抗四种场景。论文设计了六维评估协议(共13个细粒度标准),包括场景实体对齐、时空一致性、交互合理性、任务执行质量、视觉质量和安全风险识别。评估了七个代表性视频世界模型(含人类和MLLM评估),发现当前模型虽能生成视觉连贯视频,但在约束推理、反事实接地、物理交互和危险指令抑制方面表现不佳,表明视觉质量和表面指令遵循不足以实现可信的机器人视频世界建模。

Innovations:

  • 首次系统性地评估视频世界模型在正常、约束敏感、反事实和对抗四种场景下的可信度,填补了现有基准仅考虑可行安全指令的空白。
  • 构建了包含1207个专家验证指令-图像对的数据集,覆盖多样化场景、物体和操作任务,并设计了六维13细粒度评估协议。
  • 揭示了当前视频世界模型在约束推理、反事实接地、物理交互和危险指令抑制方面的显著不足,为可信机器人视频建模提供了重要方向。

Methodology: 从DROID数据集中分层采样真实机器人操作片段,提取初始图像和原始指令。通过人工筛选和修改构建四种场景样本:正常场景保留原始可行指令;约束敏感场景引入遮挡、轨迹约束、歧义等;反事实场景制造指令与场景不一致(如物体缺失、属性矛盾);对抗场景编写不安全或破坏性指令。采用人类评估作为主要参考,并辅以基于多模态大语言模型(MLLM)的自动评估。评估维度包括场景实体对齐、时空一致性、交互合理性、任务执行质量、视觉质量和安全风险识别,共13个细粒度标准。

Key Results:

  • 当前视频世界模型在正常场景下能保持可见场景结构和视觉连贯性,但在约束、矛盾和对抗条件下可信度显著下降。
  • 模型难以处理轨迹约束、遮挡目标和物理可行的交互;在反事实指令下会幻觉缺失物体、改变属性或修改场景。
  • 在对抗指令下,强指令遵循模型可能直接生成不安全的机器人行为。
  • 视觉连贯性和表面指令遵循不足以衡量机器人视频世界模型的可信度。

Tech Stack:

  • DROID数据集(真实机器人操作数据)
  • 多模态大语言模型(MLLM)用于自动评估
  • 人类评估作为参考标准
  • 六维评估协议(13个细粒度标准)
  • 四种场景设计(正常、约束敏感、反事实、对抗)

Strengths:

  • 针对机器人操作中视频世界模型的可信度评估,具有重要的实际意义。
  • 数据集规模较大(1207对),且经过专家验证,质量高。
  • 评估维度全面,不仅关注视觉质量,还关注场景接地、交互合理性、安全风险等。
  • 揭示了现有模型的系统性缺陷,为后续研究提供了明确方向。

Limitations:

  • 数据集仅基于DROID,可能无法完全覆盖其他机器人平台或场景。
  • 自动评估依赖MLLM,其评估准确性可能受限于MLLM自身能力。
  • 未提供具体的模型改进方法或训练策略。
  • 对抗场景仅包含环境破坏和人身攻击两类,可能不够全面。

Relevance To Keywords:

  • 世界模型:论文直接评估视频世界模型的可信度,与关键词高度相关。
  • 多模态大模型:使用MLLM进行自动评估,且评估对象包含多模态视频生成模型。
  • 表征学习:视频世界模型涉及视觉表征学习,但论文未深入探讨表征学习机制。
  • 强化学习:视频世界模型可用于策略学习和评估,与强化学习间接相关。
  • 后训练:论文未涉及后训练方法,但评估结果可指导后训练改进。
Score: 61.5 / 27.8
Authors: Baoqi Gao, Ruize Han, Miao Wang, Song Wang
Published: 2026-06-01
TL;DR: IMWM enhances latent planning from raw pixels by combining an intuition model with a world model, significantly improving success rates in goal-reaching tasks where world models alone fail.
摘要翻译

基于学习到的潜在世界模型进行规划是从原始像素进行控制的一种有前景的途径,但仅凭强大的世界模型是不够的。我们通过实验证明了这一点:即使拥有完美世界模型(通过将学习到的前向预测器替换为真实环境动力学的理想化轨迹模拟来实现),有限预算的基于采样规划器在某些任务上仍然失败,这表明瓶颈可能在于搜索而非世界模型的准确性。受此差距的启发,我们提出了 IMWM(Intuition Model + World Model),该模型将世界模型与一个从演示中训练出来的直觉模型相结合,以识别有前景的动作。这两个模型通过三个轻量级组件进行协作:(i) Retrieval Initialization(检索初始化),它从检索到的演示中初始化规划器的动作提议;(ii) Hybrid Cost(混合成本),它将直觉分数与世界模型轨迹模拟成本相结合;以及 (iii) Reliability Gate(可靠性门),它调整规划器在每个设置中对直觉的信任程度。在四个基于像素的目标到达任务(Two-Room、Reacher、Push-T 和 OGBench-Cube)上,IMWM 的平均成功率均高于仅使用世界模型的规划器,其中在 Two-Room(99.2%,+11.5 个百分点)和 OGBench-Cube(94.7%,+28.5 个百分点)上获得了最大的提升。

Abstract

Planning with a learned latent world model is a promising route to control from raw pixels, but a strong world model alone is not enough. We show this experimentally: even with a perfect world model (operationalized by replacing the learned forward predictor with an idealized rollout of the true environment dynamics), a finite-budget sample-based planner still fails on some tasks, indicating that the bottleneck can lie in search rather than in world-model accuracy. Motivated by this gap, we propose IMWM (Intuition Model + World Model), which pairs the world model with an intuition model trained from demonstrations to recognize promising actions. The two models collaborate through three lightweight components: (i) Retrieval Initialization, which initializes the planner's action proposal from a retrieved demonstration; (ii) Hybrid Cost, which combines the intuition score with the world-model rollout cost; and (iii) a Reliability Gate, which adjusts how much the planner trusts intuition in each setting. Across four pixel-based goal-reaching tasks (Two-Room, Reacher, Push-T, and OGBench-Cube), IMWM has higher mean success than the world-model-only planner on all four, with the largest gains on Two-Room (99.2%, +11.5 percentage points) and OGBench-Cube (94.7%, +28.5 percentage points).

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 8.0/10 12.0
Tokenizer 1.5 3.0/10 4.5
Visual Encoder 1.5 7.0/10 10.5
World Models 1.5 10.0/10 15.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 3.0/10 4.5
model-based RL 1.5 10.0/10 15.0

评分理由: The paper centers on Model-Based RL and World Models (10/10), proposing a unified framework combining Intuition and World models (8/10). Visual Encoder is needed for pixel input (7/10). Tokenizer and MultiModal are peripheral as the work targets RL planning, not LLM tokenization or Vision-Language multimodality (3/10). MLLM is irrelevant (0/10). No listed experts are authors.

关键词

Intuition Model, World Model, Latent Planning, Model-Based RL, Raw Pixels, Goal-Reaching, Demonstration

深度分析

Chinese Title: IMWM:直觉模型补充世界模型用于潜在规划

Summary: 该论文指出,在基于潜在世界模型的规划中,即使拥有完美的世界模型(用真实环境动力学替换学习的前向预测器),有限预算的采样规划器仍可能失败,瓶颈在于搜索而非模型精度。为此,作者提出IMWM方法,将世界模型与一个从演示中训练的直觉模型配对,通过三个轻量组件协作:检索初始化(从演示中检索动作块初始化CEM提议)、混合代价(结合直觉分数与世界模型展开代价)、可靠性门控(根据场景调整对直觉的信任程度)。在四个像素级目标到达任务(Two-Room、Reacher、Push-T、OGBench-Cube)上,IMWM在所有任务上均取得高于纯世界模型规划器的平均成功率,最大提升在Two-Room(+11.5个百分点)和OGBench-Cube(+28.5个百分点)。论文还通过理论证明和实验诊断揭示了有限查询下搜索瓶颈的独立性。

Innovations:

  • 诊断了有限CEM潜在世界模型规划中的搜索瓶颈,通过理想化世界模型(真实动力学)实验证明即使完美预测仍因搜索失败。
  • 提出IMWM方法,将直觉模型(基于对比学习的动作兼容性评分)与世界模型结合,通过检索初始化、混合代价和可靠性门控三个组件提升规划效率。
  • 理论证明有限CEM查询下规划成功概率受提议体积界限制,独立于预测器训练方式(定理A.1)。
  • 在四个像素级任务上实现一致提升,且门控机制仅需三种预设配方,无需每任务调参。

Methodology: 首先通过替换学习的前向预测器为真实环境动力学(oracle dynamics)进行诊断实验,证明搜索瓶颈。然后设计直觉模型:使用逆侧编码器ϕI和双线性评分器Dψinv,在演示窗口上通过InfoNCE目标训练,输出(start, goal, action)三元组的对比分数。将直觉模型与冻结的世界模型结合,通过三个组件:检索初始化(余弦相似度检索演示动作块作为CEM均值)、混合代价(直觉分数与终端潜在MSE代价加权)、可靠性门控(根据场景选择三种预设配方之一)。在四个任务上使用固定CEM预算和冻结模型进行评估,每个任务12个种子×4任务共48个实验点。

Key Results:

  • Oracle dynamics(完美世界模型)在Two-Room上成功率85.5%(低于纯世界模型87.7%),在OGBench-Cube上76.8%(仍低于IMWM的94.7%),且失败中98.9%和91.4%的CEM种群无目标到达候选。
  • IMWM在四个任务上平均成功率均高于纯世界模型:Two-Room 99.2%(+11.5pp)、OGBench-Cube 94.7%(+28.5pp)、Push-T 92.7%(+2.8pp)、Reacher 83.8%(+0.7pp)。
  • 门控机制在Reacher上选择纯世界模型回退(近持平),在其他任务上选择混合模式。
  • 理论证明有限CEM查询下规划成功概率受提议体积界限制(定理A.1)。

Tech Stack:

  • Cross-Entropy Method (CEM) 采样优化器
  • InfoNCE 对比学习目标(van den Oord et al., 2018)
  • 双线性评分器(bilinear scorer)
  • 余弦相似度检索
  • 潜在世界模型(latent world model)
  • 终端潜在MSE代价(terminal latent-MSE cost)
  • 可靠性门控(三种预设配方)

Strengths:

  • 清晰诊断了潜在世界模型规划中的搜索瓶颈,并通过实验和理论双重验证。
  • 提出的IMWM方法简单有效,仅需少量额外组件即可显著提升成功率。
  • 在多个像素级任务上取得一致提升,且门控机制避免了过拟合。
  • 方法具有通用性,不依赖特定世界模型架构。

Limitations:

  • 直觉模型需要预先收集的演示数据,可能限制在无演示场景的应用。
  • 可靠性门控仅三种预设配方,可能无法适应更复杂的环境变化。
  • 实验仅在四个任务上进行,泛化性需进一步验证。
  • Push-T任务因模拟器物理混淆被排除在oracle dynamics实验外,说明方法对某些环境细节敏感。

Relevance To Keywords:

  • 世界模型(World Models):论文核心是改进基于潜在世界模型的规划,诊断了世界模型精度之外的搜索瓶颈。
  • 表征学习(Representation Learning):直觉模型使用对比学习学习动作表征,世界模型使用潜在编码。
  • 强化学习(Model-Based RL):论文属于基于模型的强化学习范畴,通过直觉模型引导搜索。
  • 后训练(Post-training):直觉模型在演示数据上训练后冻结,与预训练的世界模型结合。
  • 原生多模态大模型/多模态大模型的理解和生成一体化:论文未直接涉及多模态大模型,但潜在世界模型可处理像素输入,与多模态表征有间接关联。
Score: 58.5 / 27.8
Authors: Jiahui Huang, Yasi Zhang, Tianyu Chen, Shu Wang, Jianwen Xie, Oscar Leong, Mingyuan Zhou, Nanzhu Wang, Ying Nian Wu
Published: 2026-06-01
TL;DR: MT-EditFlow addresses multi-turn image editing failures caused by error propagation by employing a flow-matching reinforcement learning framework that optimizes reward signals and significantly boosts sequential editing performance.
摘要翻译

基于指令的图像编辑(instruction-based image editing)近期取得了重大突破,引起了广泛关注,因为模型现在能够处理现实世界的编辑需求,具备日常用户所需的实用性。然而,主要为单轮编辑(single-turn edits)训练的编辑模型往往在多轮编辑(multi-turn editing)中失效——这是一种自然的交互场景,用户基于模型之前的输出迭代地精炼图像。这种失败源于“全有或全无”的要求(all-or-nothing requirement),即单个轮次的失败会损害整个序列,以及误差传播(error propagation),其中暴露偏差(exposure bias)导致编辑错误累积。为了解决这些挑战,我们引入了 MT-EditFlow,这是一种流匹配(flow-matching)强化学习框架,旨在优化序列图像编辑的奖励信号。MT-EditFlow 将多轮视角与多奖励公式相结合,提供了一种统一结构,适用于 GRPO 和基于 NFT 的强化学习方法。我们通过调查轮级聚合的有效评分策略、视觉语言模型(VLM)推理模式以权衡奖励偏差与方差,以及优势融合水平以防止奖励黑客(reward hacking),系统地分析和优化了奖励信号。我们的发现表明,在整个编辑轨迹上广播聚合的优势(advantage)有效地弥合了局部规划与全局多轮任务成功之间的差距。广泛的实验表明,MT-EditFlow 在各种基础模型上显著提升了性能。值得注意的是,它在第 3 轮整体性能上将 FLUX.1-Kontext-dev 提升了 6.85 分,超越了 Qwen-Image-Edit 等最先进的开源模型。通过保持高边际成功率并减少暴露偏差,MT-EditFlow 为视觉内容创作中更可靠且自然的人机协作奠定了基础。

Abstract

Recent breakthroughs in instruction-based image editing have captured significant attention, as models are now capable of handling real-world editing demands with the practicality required by everyday users. However, editing models trained primarily for single-turn edits often break down in multi-turn editing--the natural interactive setting where a user iteratively refines an image based on the model's own previous outputs. This failure stems from the all-or-nothing requirement, where a single failed turn compromises the entire sequence, and error propagation, where exposure bias leads to compounding editing errors. To address these challenges, we introduce MT-EditFlow, a flow-matching reinforcement learning framework designed to optimize reward signals for sequential image editing. MT-EditFlow integrates a multi-turn perspective with a multi-reward formulation to provide a unified structure applicable to both GRPO and NFT-based reinforcement learning methods. We systematically analyze and optimize the reward signal by investigating effective scoring strategies for turn-level aggregation, VLM reasoning modes to trade off reward bias and variance, and advantage fusion levels to prevent reward hacking. Our findings reveal that broadcasting the aggregated advantage across the entire editing trajectory effectively bridges the gap between local planning and global multi-turn task success. Extensive experiments demonstrate that MT-EditFlow significantly improves performance across diverse base models. Notably, it boosts FLUX.1-Kontext-dev by 6.85 points in turn-3 overall performance, surpassing state-of-the-art open-source models such as Qwen-Image-Edit. By maintaining high marginal success rates and reducing exposure bias, MT-EditFlow provides a foundation for more reliable and natural human-AI collaboration in visual content creation.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 7.0/10 10.5
Tokenizer 1.5 2.0/10 3.0
Visual Encoder 1.5 5.0/10 7.5
World Models 1.5 3.0/10 4.5
MLLM 1.5 8.0/10 12.0
MultiModal 1.5 8.0/10 12.0
model-based RL 1.5 6.0/10 9.0

评分理由: 论文提出 MT-EditFlow 框架,统一了 GRPO 和 NFT 强化学习方法(Unify Models: 7),基于视觉语言模型进行多轮图像编辑(MLLM: 8, MultiModal: 8),利用流匹配进行奖励优化(model-based RL: 6),但未涉及分词器设计(Tokenizer: 2)、视觉编码器架构(Visual Encoder: 5)或通用世界模型(World Models: 3)。作者列表中无指定专家,无额外加分。加权总分 58.5,高于及格分 27.8。

关键词

Multi-Turn Image Editing, Reinforcement Learning, Flow Matching, Reward Optimization, VLM Reasoning, Sequential Editing, Error Propagation

深度分析

Chinese Title: MT-EditFlow:基于流匹配的多轮图像编辑强化学习

Summary: 本文针对多轮图像编辑中存在的“全有或全无”要求和误差传播问题,提出了MT-EditFlow框架,将强化学习与流匹配模型相结合,优化多轮序列编辑的奖励信号。该框架整合了多轮视角与多奖励结构,统一适用于GRPO和NFT两种强化学习方法。通过系统分析评分策略、VLM推理模式以及优势融合层级,发现将聚合优势广播至整个编辑轨迹能有效弥合局部规划与全局任务成功之间的差距。实验表明,MT-EditFlow显著提升了多种基础模型的多轮编辑性能,例如使FLUX.1-Kontext-dev在第三轮整体性能上提升6.85分,超越当前开源最优模型Qwen-Image-Edit,同时保持了高边际成功率并降低了曝光偏差。

Innovations:

  • 首次将强化学习框架应用于多轮图像编辑,针对序列编辑的时序依赖和误差累积进行优化。
  • 提出多奖励结构(指令遵循与内容一致性)与多轮视角的统一奖励信号设计,适用于GRPO和NFT两种RL方法。
  • 系统研究了评分策略、VLM推理模式(思考与非思考模式)以及优势融合层级对奖励信号的影响,并发现广播优势可有效防止奖励黑客行为。
  • 构建了约2.5K高质量多轮提示链数据集,并基于EdiVal-Agent流程生成。
  • 在多个基础模型上实现显著性能提升,超越当前开源SOTA模型。

Methodology: 论文采用流匹配(Flow Matching)作为生成模型基础,结合强化学习进行后训练。具体技术路线包括:1)使用GPT-4o生成多轮编辑提示链;2)采用零样本VLM(Qwen3-VL-8B)评估指令遵循(IF),结合EdiVal-CC评估内容一致性(CC);3)将多轮奖励聚合为单一优势值,并广播至整个轨迹;4)分别扩展Flow-GRPO和DiffusionNFT两种RL算法至多轮设置;5)通过对比实验分析不同评分策略、VLM推理模式和优势融合层级的效果。

Key Results:

  • MT-EditFlow使FLUX.1-Kontext-dev在第三轮整体性能提升6.85分,超越Qwen-Image-Edit。
  • FLUX.2-klein-base提升2.90分,推动开源模型边界。
  • 多轮编辑中边际成功率保持高位,曝光偏差显著降低。
  • 广播优势策略优于局部优势分配,有效防止奖励黑客。
  • VLM思考模式可降低奖励偏差但增加方差,需权衡使用。

Tech Stack:

  • 流匹配(Flow Matching)
  • Rectified Flow
  • GRPO(Group Relative Policy Optimization)
  • DiffusionNFT(Diffusion Negative-aware FineTuning)
  • VLM:Qwen3-VL-8B
  • EdiVal-CC(内容一致性评估)
  • GPT-4o(用于生成多轮提示链)
  • ODE/SDE求解器
  • Euler-Maruyama离散化

Strengths:

  • 首次系统解决多轮图像编辑中的强化学习奖励设计问题,具有开创性。
  • 统一了GRPO和NFT两种RL方法,框架通用性强。
  • 实验充分,在多个基础模型上验证了有效性,并超越SOTA。
  • 对奖励信号设计进行了深入分析,提供了实用指导。
  • 构建了高质量多轮编辑数据集,促进后续研究。

Limitations:

  • 依赖闭源VLM(Qwen3-VL-8B)作为奖励模型,可能存在偏见和成本问题。
  • 多轮提示链由GPT-4o生成,可能不完全覆盖真实用户交互模式。
  • 仅针对流匹配模型,未验证在其他生成框架(如扩散模型)上的适用性。
  • 训练数据规模约2.5K,可能不足以泛化到所有编辑场景。
  • 未讨论计算效率与训练时间等实际部署问题。

Relevance To Keywords:

  • Unify Models: 论文聚焦于多轮图像编辑,属于多模态生成与理解统一的研究方向。
  • World Models: 流匹配可视为一种连续时间世界模型,论文利用其进行编辑生成。
  • Representation Learning: 通过强化学习优化编辑模型,间接学习更好的图像表示。
  • Model-Based RL: 论文采用基于模型的强化学习(流匹配作为生成模型),并扩展至多轮序列。
  • 原生多模态大模型: 论文使用VLM作为评估器,但编辑模型本身是流匹配模型,非原生多模态大模型。
  • 多模态大模型的理解和生成一体化: 论文涉及图像编辑(生成)和VLM评估(理解),但未实现一体化模型。
  • 表征学习: 强化学习优化可视为对潜在表征的调整。
  • 世界模型: 流匹配可视为连续时间世界模型。
  • 强化学习: 核心方法,使用GRPO和DiffusionNFT。
  • 后训练: 论文属于后训练阶段,对预训练流匹配模型进行RL微调。
Score: 55.5 / 27.8
Authors: Chen Yang, Chufan Yu, Hanfu Chen, Jie Zhu, Jingqi Chen, Ke Chen, Wenxuan Wang, Yang Wang, Yaozhou Jiang, Yi Jiang, Zhengyuan Lin, Ziqi Chen, Zhaoye Fei, Chenghao Liu, Jun Zhan, Kang Yu, Kexin Huang, Mingshu Chen, Qinyuan Cheng, Ruixiao Li, Shimin Li, Songlin Wang, Yang Gao, Yiyang Zhang, Xipeng Qiu
Published: 2026-06-01
TL;DR: MOSS-Audio 是一个统一的音频语言模型,通过专用音频编码器和适配器与大型语言模型结合,在音频字幕、语音识别和时间感知问答任务中表现出强大性能。
摘要翻译

MOSS-Audio 是一个统一的音频 - 语言模型,用于语音、环境声音和音乐理解,支持音频描述、时间感知问答、时间戳转录和音频锚定推理。MOSS-Audio 将一个专用音频编码器与一个模态适配器和一个大型语言模型(LLM)耦合:编码器产生 12.5 Hz 的时域表示,适配器将它们投影到解码器空间,解码器生成自回归文本输出。该系统的核心在于两项设计选择:DeepStack 跨层特征注入,使解码器接触到来自多个编码器深度的声学信息,以及时间标记,通过在音频标记流中插入时间戳标记提供显式的时间线索。在数据层面,我们设计了一个事件保持的音频标注管道,在语义一致的事件边界处分割原始音频,对语音、音乐和通用音频应用分支特定标注,并将结果合并为统一描述用于预训练。中间分支特定描述进一步被保留,以支持构建任务导向的 SFT(监督微调)数据。该模型在大规模音频 - 语言数据上进行预训练,纳入时间感知目标以支持时间锚定,然后经历多阶段后训练以增强指令遵循和音频锚定推理。我们发布了 4B 和 8B 变体,包括 Instruct 和 Thinking 两种配置。MOSS-Audio 在通用音频理解、语音描述、ASR(自动语音识别)和时间戳 ASR 上实现了强大性能,将其定位为未来语音智能体的有前景的理解基础。

Abstract

MOSS-Audio is a unified audio-language model for speech, environmental sound, and music understanding, supporting audio captioning, time-aware question answering, timestamped transcription, and audio-grounded reasoning. MOSS-Audio couples a dedicated audio encoder with a modality adapter and a large language model: the encoder produces 12.5 Hz temporal representations, the adapter projects them into the decoder space, and the decoder generates autoregressive text outputs. Two design choices are central to the system: \textbf{DeepStack cross-layer feature injection}, which exposes the decoder to acoustic information from multiple encoder depths, and \textbf{time markers}, which provide explicit temporal cues by inserting timestamp markers into the audio-token stream. At the data level, we design an event-preserving audio annotation pipeline that segments raw audio at coherent event boundaries, applies branch-specific annotation to speech, music, and general audio, and merges the results into unified captions for pretraining. The intermediate branch-specific captions are further retained to support the construction of task-oriented SFT data. The model is pretrained on large-scale audio-language data, with time-aware objectives incorporated to support temporal grounding, and then undergoes multi-stage post-training to enhance instruction following and audio-grounded reasoning. We release 4B and 8B variants in both Instruct and Thinking configurations. MOSS-Audio achieves strong performance across general audio understanding, speech captioning, ASR, and timestamped ASR, positioning it as a promising understanding foundation for future voice agents.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 9.0/10 13.5
Tokenizer 1.5 6.0/10 9.0
Visual Encoder 1.5 2.0/10 3.0
World Models 1.5 3.0/10 4.5
MLLM 1.5 8.0/10 12.0
MultiModal 1.5 9.0/10 13.5
model-based RL 1.5 0.0/10 0.0

评分理由: 论文核心为统一音频语言模型,故 Unify Models、MultiModal、MLLM 得分较高;提及 audio-token stream,Tokenizer 中度相关;使用音频编码器而非视觉编码器,Visual Encoder 得分低;未涉及强化学习或世界模型动态学习,model-based RL 和 World Models 得分低。作者列表未包含指定专家,故无额外加分。加权总分 55.5,高于动态及格分 27.8。

关键词

Unified audio-language model, Audio encoder, Modality adapter, DeepStack cross-layer feature injection, Time markers, Audio-grounded reasoning, Multi-stage post-training

深度分析

Chinese Title: MOSS-Audio技术报告

Summary: MOSS-Audio是一个统一的音频-语言模型,能够处理语音、环境声音和音乐,支持音频字幕、时间感知问答、时间戳转录和音频基础推理。模型采用编码器-适配器-解码器架构:音频编码器从零训练,输出12.5Hz的时序表示;两个GatedMLP适配器将音频特征映射到语言模型空间;语言模型进行自回归文本生成。核心创新包括DeepStack跨层特征注入(将编码器中间层特征注入解码器早期层,保留多粒度声学信息)和显式时间标记(每2秒插入时间标记,支持时间基础任务)。数据管道采用事件保留分割和分支特定注释(语音、音乐、通用音频),合并为统一字幕用于预训练。模型发布4B和8B两种参数规模,每个规模有Instruct(直接指令执行)和Thinking(推理优化)两种配置。实验表明,MOSS-Audio在通用音频理解、语音字幕、ASR和时间戳ASR上达到强性能,为未来语音代理提供了感知和推理基础。

Innovations:

  • 提出DeepStack跨层特征注入,将音频编码器多个中间层的特征注入语言模型解码器,保留低层声学细节(如韵律、瞬态事件)和高层语义信息,避免单层表示瓶颈。
  • 引入显式时间标记,在音频特征序列中插入固定间隔的秒数标记,使模型能够直接学习时间戳转录和时间感知问答,无需外部后处理。
  • 构建事件保留音频标注管道:基于事件边界分割音频,分支特定注释(语音、音乐、通用音频),再合并为统一字幕,支持大规模异构数据预训练。
  • 发布Instruct和Thinking两种配置的4B和8B模型,分别优化直接指令执行和推理密集型音频理解,满足不同任务需求。

Methodology: 采用编码器-适配器-解码器架构。音频编码器:处理128通道log-mel频谱,经3层stride-2 Conv2D实现8倍下采样至12.5Hz,后接32层Transformer(隐藏维度1280),使用滑动窗口注意力(窗口100帧,约8秒)以线性复杂度处理长音频。两个GatedMLP适配器:主适配器将编码器最终层输出映射到语言模型空间;合并适配器从编码器中间层提取特征,聚合后注入解码器早期层。时间标记:每25个音频特征(对应2秒)插入一个数字标记(如"2","4"),与音频特征一起输入语言模型。训练流程:预训练阶段联合ASR、音频字幕、时间戳ASR和文本建模;后训练阶段分步进行指令跟随和音频基础推理。数据管道:原始音频经事件边界分割(避免固定窗口切割),通过音频标签路由到语音、音乐、通用音频分支,分别进行ASR、字幕生成等标注,最后合并为统一字幕格式。

Key Results:

  • 在通用音频理解、语音字幕、ASR和时间戳ASR任务上达到强性能,与同类模型相比具有竞争力。
  • Thinking变体在推理密集型音频理解基准上表现更优,Instruct变体在直接任务执行(转录、字幕)上更稳定。
  • 4B和8B模型均发布,覆盖不同计算资源需求。
  • DeepStack跨层注入相比仅使用最终层表示,在多个音频任务上带来一致提升。
  • 时间标记机制使模型能够准确输出时间戳转录和时间感知问答,无需额外定位模块。

Tech Stack:

  • 音频编码器:Conv2D(stride-2)、Transformer(32层,hidden=1280)、滑动窗口注意力(窗口100帧)
  • 适配器:GatedMLP(门控多层感知机)
  • 语言模型:基于LLM的decoder-only架构(具体未指定,可能为LLaMA或类似)
  • 时间标记:固定间隔(每25帧)插入数字标记,嵌入后与音频特征拼接
  • 数据管道:事件边界分割(基于音频事件检测)、分支注释(ASR模型、音频字幕模型等)
  • 训练:多任务预训练(ASR、字幕、时间戳ASR、文本建模)+ 多阶段后训练(指令微调、推理微调)
  • 推理:自回归文本生成,支持KV-cache加速

Strengths:

  • 统一模型覆盖语音、音乐、环境声音等多种音频类型和任务,无需专用模块。
  • DeepStack跨层注入有效保留多粒度声学信息,提升对韵律、瞬态事件等细节的感知。
  • 显式时间标记使时间基础任务(如时间戳ASR、时间感知QA)成为端到端生成,简化系统设计。
  • 数据管道通过事件保留分割和分支注释,高效利用大规模异构音频数据。
  • 发布Instruct和Thinking两种配置,兼顾直接执行和推理能力,适应不同应用场景。
  • 编码器从零训练,针对通用音频理解优化,而非仅针对ASR。

Limitations:

  • 时间标记固定为2秒间隔,对于需要亚秒级精度的时间定位可能不够精细。
  • 编码器从零训练需要数百万小时数据和大量计算资源,训练成本高。
  • 模型仅输出文本,不支持音频生成(如语音合成),不是完整的音频多模态模型。
  • 滑动窗口注意力限制局部上下文为8秒,对于超长音频(如数小时)的全局依赖可能不足。
  • 论文未提供与最新大模型(如GPT-4o、Gemini)在音频理解上的直接对比,性能评估可能不够全面。
  • Thinking变体的推理能力提升机制未详细说明(如是否使用CoT或强化学习)。

Relevance To Keywords:

  • Unify Models: 高度相关,MOSS-Audio是统一的音频语言模型,整合多种音频理解和生成任务。
  • World Models: 中等相关,音频理解可视为世界模型的一部分(感知环境声音),但论文未明确构建世界模型或进行预测。
  • Representation Learning: 高度相关,音频编码器从零训练学习多层级声学表征,DeepStack进一步利用中间层表征。
  • Model-Based RL: 不相关,论文未涉及强化学习或基于模型的控制。
  • 原生多模态大模型: 中等相关,模型是音频+语言的多模态,但音频编码器是专门训练的,并非原生多模态(如统一tokenizer)。
  • 多模态大模型的理解和生成一体化: 部分相关,模型支持理解(转录、字幕、QA)和生成(文本输出),但生成限于文本,不包含音频生成。
  • 表征学习: 高度相关,同Representation Learning。
  • 世界模型: 中等相关,同World Models。
  • 强化学习: 不相关,论文未使用强化学习。
  • 后训练: 高度相关,论文采用多阶段后训练(指令微调、推理微调)提升模型能力。
Score: 55.5 / 27.8
Authors: Hyeonwoo Cho, DongHyeon Baek, Yewon Kim, Bumsub Ham
Published: 2026-06-01
TL;DR: 本文提出 RESTORE 框架,通过校正视觉 token 缩减中的位置和注意力失真,在保持计算效率的同时提升了多模态大模型的推理准确性。
摘要翻译

近期,多模态大语言模型(MLLMs)在视觉 - 语言任务中取得了显著进展,然而,由于大量视觉令牌 (token) 所带来的二次计算复杂度,造成了显著的内存和延迟瓶颈。虽然视觉令牌缩减(VTR)策略已被探索以减轻这一负担,但现有方法忽略了原始序列与缩减序列之间的位置和注意力一致性,从而导致表示失真。为此,我们提出了一种名为 RESTORE 的新型 VTR 框架,该框架在保持效率的同时,修正了位置和注意力失真。具体而言,我们提出了一种简单而有效的校准方法,通过基于相对距离增强注意力权重来恢复丢失的视觉注意力。我们还引入了独特的锚点选择机制用于令牌合并,以在特征平均过程中减轻信息损失。在多个基准上的实验结果表明,我们的方法一致地提升了各种缩减方法的准确性,实现了最先进性能(SOTA),同时保持了计算效率。

Abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have achieved remarkable success in vision-language tasks, yet the quadratic computational complexity arising from the vast number of visual tokens incurs significant memory and latency bottlenecks. While visual token reduction (VTR) strategies have been explored to mitigate this burden, existing methods overlook the positional and attentional consistency between the full and reduced sequences, resulting in a distorted representation. To this end, we propose RESTORE, a novel VTR framework that rectifies the positional and attentional distortions while maintaining efficiency. Specifically, we present a simple yet effective calibration method that restores lost visual attention by augmenting attention weights based on relative distances. We also introduce a distinctive anchor selection for token merging to mitigate information loss during feature averaging. Experimental results on multiple benchmarks demonstrate that our method consistently improves the accuracy of various reduction methods, achieving state-of-the-art performance while maintaining computational efficiency.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 4.0/10 6.0
Tokenizer 1.5 5.0/10 7.5
Visual Encoder 1.5 8.0/10 12.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 10.0/10 15.0
MultiModal 1.5 10.0/10 15.0
model-based RL 1.5 0.0/10 0.0

评分理由: 论文核心在于多模态大模型(MLLM)的视觉 token 缩减与推理效率优化。与 MLLM 和多模态高度相关(10 分),因直接处理视觉编码器输出的 token 故视觉编码器相关度高(8 分)。涉及 token 管理但未深入 tokenizer 架构,tokenizer 相关性中等(5 分)。虽多模态模型可视为统一模型,但本文侧重效率而非统一架构,故 Unify Models 相关性较低(4 分)。与世界模型及强化学习完全无关(0 分)。

关键词

Visual Token Reduction, Multimodal LLM, Inference Efficiency, Attention Distortions, Token Merging, Visual Attention, Positional Consistency

深度分析

Chinese Title: 通过纠正失真改进视觉令牌缩减以实现高效多模态大语言模型推理

Summary: 本文提出RESTORE框架,旨在解决多模态大语言模型(MLLM)中视觉令牌缩减(VTR)导致的注意力与位置失真问题。现有VTR方法在缩减令牌时,由于位置索引重分配或保留原始索引,破坏了注意力权重的分布,导致视觉信息丢失。RESTORE通过保留原始位置索引并引入基于相对距离的注意力校准方法,恢复丢失的视觉注意力;同时提出一种新颖的锚点选择策略,在令牌合并时兼顾代表性与区分性,减少特征平均过程中的信息损失。实验表明,该方法在多个MLLM基准上显著提升了不同VTR方法的准确率,同时保持了计算效率。

Innovations:

  • 首次系统分析了VTR方法中位置分配对注意力权重的影响,揭示了注意力失真问题。
  • 提出基于相对距离的注意力校准方法,在保留原始位置索引的同时恢复视觉令牌的总注意力权重。
  • 设计了一种兼顾代表性与区分性的锚点选择策略,用于令牌合并,减少特征平均时的信息损失。
  • 在多个基准上验证了RESTORE能够一致提升多种现有VTR方法的性能,达到最先进水平。

Methodology: 采用标准MLLM架构(如LLaVA),视觉编码器提取特征后经投影器映射为视觉令牌序列。VTR方法在送入LLM前缩减令牌。RESTORE保留原始位置索引,通过分析RoPE中相对距离对注意力权重的影响,设计校准函数(基于相对距离的指数衰减或线性缩放)来补偿因令牌减少导致的注意力衰减。在令牌合并阶段,提出锚点选择:计算每个令牌与其邻域内其他令牌的相似度,选择代表性强且与其他候选锚点区分度高的令牌作为锚点,然后合并相似令牌。实验在多个视觉问答和推理基准上进行,对比多种VTR基线。

Key Results:

  • RESTORE在多个MLLM基准(如VQAv2、GQA、MMBench等)上,对FastV、VisionZip等VTR方法均带来准确率提升,最高提升约2-3个百分点。
  • 注意力校准有效恢复了视觉令牌的注意力权重比例,减少了模型对文本的过度依赖。
  • 锚点选择策略在令牌合并中减少了信息损失,尤其在细粒度视觉任务上表现更优。
  • 计算开销增加极小,保持了VTR带来的效率优势。

Tech Stack:

  • Rotary Position Embedding (RoPE)
  • Softmax归一化
  • 多头自注意力机制
  • 基于相对距离的注意力校准函数
  • 余弦相似度或内积用于令牌相似性度量
  • LLaVA-1.5作为基础MLLM架构
  • CLIP视觉编码器

Strengths:

  • 深入分析了VTR中位置分配对注意力机制的负面影响,填补了现有研究的空白。
  • 提出的校准方法简单有效,无需额外训练,可即插即用于多种VTR方法。
  • 锚点选择策略兼顾代表性与区分性,优于均匀采样或高注意力采样。
  • 实验全面,在多个基准上验证了泛化性和鲁棒性。

Limitations:

  • 注意力校准依赖于相对距离的预设函数,可能在不同模型或任务中需要调整超参数。
  • 锚点选择增加了少量计算开销,在极端低令牌预算下可能影响效率。
  • 主要针对图像输入,未验证在视频或多帧输入上的效果。
  • 未探讨与后训练或强化学习等方法的结合潜力。

Relevance To Keywords:

  • Unify Models: 论文聚焦于多模态大模型中的视觉令牌缩减,属于模型效率优化,与统一模型相关。
  • World Models: 间接相关,视觉令牌缩减影响模型对视觉世界的理解,但未直接涉及世界模型构建。
  • Representation Learning: 锚点选择涉及特征表示的质量,注意力校准维护了注意力表示的一致性,与表征学习相关。
  • Model-Based RL: 不直接相关,论文未涉及强化学习或基于模型的规划。
  • 原生多模态大模型: 直接相关,论文针对原生MLLM的推理效率问题。
  • 多模态大模型的理解和生成一体化: 相关,VTR影响理解与生成的质量。
  • 表征学习: 如上所述,锚点选择与注意力校准涉及表征优化。
  • 世界模型: 弱相关。
  • 强化学习: 不相关。
  • 后训练: 弱相关,论文方法无需后训练,但可应用于后训练后的模型。
Score: 55.5 / 27.8
Authors: Muyi Bao, Yuxin Cai, Hang Xu, Zongtai Li, Jinxi He, Jingfan Tang, Chen Lv, Ji Zhang, Yaqi Xie, Wenshan Wang
Published: 2026-06-01
TL;DR: Goal2Pixel reformulates vision-language navigation as navigable pixel grounding using a unified spatial interface, achieving state-of-the-art performance with significantly fewer VLM inference calls.
摘要翻译

视觉语言模型(VLM)已成为连续环境中的视觉语言导航(VLN-CE)的常见基础。然而,大多数基于 VLM 的方法将导航视为低级别的动作预测,这种接口具有模糊性,受限于短视程运动基元,且由于重复查询 VLM 而效率低下。我们提出 Goal2Pixel,这是一种纯像素基底的范式,将 VLN-CE 重新表述为可导航像素定位。与预测动作不同,Goal2Pixel 使用图像平面作为 VLM 推理与机器人运动之间的统一空间接口:模型向智能体预测一个可见的可导航像素,该像素被反投影为 3D 航点以用于前向导航。对于非前向动作,我们在图像平面上附加辅助指令区域,其中左侧、右侧和底部区域分别被解释为左转、右转和停止。为了实现长视程导航,我们提出了一种可见性感知关键帧记忆,用于紧凑且信息丰富的历史表示。为了使预训练的 VLMs 适应可导航像素定位,我们引入了语义嵌入和坐标感知辅助损失。Goal2Pixel 实现了具有竞争力的最先进的性能,同时所需的 VLM 推理调用次数少于先前方法。在 R2R-CE Val-Unseen 上,它实现了 54.1% SR 和 52.5% SPL,每回合仅需 7.75 次 VLM 调用,比直接动作预测所需的 46.62 次少 6 倍(后者在 32.9% SR 下)。在 RxR-CE 上也呈现出相同的趋势。项目页面:https://baobao0926.github.io/Goal2Pixel/.

Abstract

Vision-language models (VLMs) have become a common foundation for vision-and-language navigation in continuous environments (VLN-CE). Yet most VLM-based methods cast navigation as low-level action prediction, an interface that is ambiguous, tied to short-horizon motion primitives, and inefficient due to repeated VLM querying. We propose Goal2Pixel, a pure pixel-based paradigm that reformulates VLN-CE as navigable pixel grounding. Rather than predicting actions, Goal2Pixel uses the image plane as a unified spatial interface between VLM reasoning and robot motion: the model predicts a visible navigable pixel to the agent, which is back-projected into a 3D waypoint for forward navigation. For non-forward actions, we append auxiliary directive regions to the image plane, where the left/right/bottom regions are interpreted as turning left, turning right, and stopping, respectively. To enable long-horizon navigation, we propose a visibility-aware keyframe memory for compact and informative history representation. To adapt pretrained VLMs to navigable pixel grounding, we introduce semantic embeddings and coordinate-aware auxiliary losses. Goal2Pixel achieves competitive state-of-the-art performance while requiring fewer VLM inference calls than prior methods. On R2R-CE Val-Unseen it achieves 54.1% SR and 52.5% SPL with just 7.75 VLM calls per episode, 6x fewer than the 46.62 required by direct action prediction at 32.9% SR. The same trend holds on RxR-CE.Project Page: https://baobao0926.github.io/Goal2Pixel/.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 7.0/10 10.5
Tokenizer 1.5 2.0/10 3.0
Visual Encoder 1.5 5.0/10 7.5
World Models 1.5 4.0/10 6.0
MLLM 1.5 8.0/10 12.0
MultiModal 1.5 8.0/10 12.0
model-based RL 1.5 3.0/10 4.5

评分理由: The paper utilizes VLMs for Vision-Language Navigation, scoring high on MLLM and MultiModal. It moderately scores on Unify Models due to the unified spatial interface proposal. Visual Encoder is implicit. World Models, Tokenizer, and model-based RL are less relevant as the focus is on pixel grounding and memory rather than generative models, tokenization, or explicit model-based planning. No target experts were found in the author list.

关键词

Vision-Language Navigation, Pixel Grounding, VLM, Unified Interface, Keyframe Memory, Long-horizon Navigation, Visual-Language Models

深度分析

Chinese Title: Goal2Pixel:将目标锚定到像素以实现视觉-语言导航

Summary: 本文提出Goal2Pixel,一种纯像素范式的视觉-语言导航方法,将连续环境中的导航(VLN-CE)重新定义为可导航像素的锚定问题。传统VLM方法将导航视为低层动作预测,存在动作监督模糊、短视决策和高推理成本等问题。Goal2Pixel将图像平面作为VLM推理与机器人运动之间的统一空间接口:模型预测一个可见的可导航像素,通过相机几何反投影为3D航点,由轻量级局部规划器执行。对于非前进动作(左转、右转、停止),在图像平面左右下侧附加辅助指令区域。为支持长程导航,提出可见性感知的关键帧记忆(ViKeyMem),仅当可见航点集显著变化时添加帧,并叠加历史轨迹。为适配预训练VLM,引入语义嵌入和坐标感知辅助损失。在R2R-CE和RxR-CE上,Goal2Pixel达到竞争性SOTA性能,同时将VLM调用次数减少约6倍(R2R-CE上SR 54.1%,SPL 52.5%,每集仅7.75次调用)。

Innovations:

  • 提出纯像素接口Goal2Pixel,将VLN-CE从动作预测转变为图像空间的目标锚定,统一了前进和非前进决策。
  • 设计可见性感知关键帧记忆ViKeyMem,仅保留信息量大的历史帧,覆盖所有有意义视角变化,无需额外模型。
  • 引入语义嵌入和坐标感知辅助损失,使预训练VLM有效适应可导航像素锚定任务。
  • 在R2R-CE上实现54.1% SR和52.5% SPL,VLM调用次数减少约6倍,同时保持竞争性性能。

Methodology: 论文采用三阶段执行流水线:1) VLM预测图像平面上的目标像素;2) 将像素通过相机几何反投影为3D航点;3) 轻量级局部规划器将航点转换为低层动作。数据收集阶段从R2R-CE和RxR-CE训练集导出2.64M样本,每个样本包含指令、当前图像、历史图像(由ViKeyMem构建)和真实像素。真实像素定义为沿轨迹最远可见未来航点的投影,不可见时使用辅助指令区域。ViKeyMem根据候选帧与最近关键帧的可见性重叠、后续航点可见性等条件选择关键帧,并叠加轨迹点。训练时使用语义嵌入区分RGB区域、辅助指令区域和轨迹覆盖标记,并添加坐标感知辅助损失(数值和角度项)。

Key Results:

  • 在R2R-CE Val-Unseen上,SR 54.1%,SPL 52.5%,每集平均VLM调用7.75次。
  • 相比直接动作预测方法(SR 32.9%,调用46.62次),SR提升21.2个百分点,调用减少约6倍。
  • 在RxR-CE上,SR 48.1%,SPL 44.7%。
  • 消融实验表明纯像素范式优于动作基线和混合动作-像素基线。
  • 在真实轮式机器人上部署成功,验证了仿真到现实的迁移能力。

Tech Stack:

  • 预训练视觉-语言模型(VLM)
  • 相机几何反投影(pinhole camera model)
  • 可见性检查(投影-反投影一致性,阈值0.6m)
  • 语义嵌入(learnable embeddings for special tokens)
  • 坐标感知辅助损失(数值回归损失 + 角度损失)
  • 轻量级局部规划器(将3D航点转换为低层动作)
  • ViKeyMem关键帧选择算法(基于可见性条件)
  • MP3D仿真环境(Matterport3D)

Strengths:

  • 创新性地将导航输出统一为像素预测,避免了动作空间歧义和短视问题。
  • 显著降低VLM推理调用次数,提升效率,同时保持高成功率。
  • ViKeyMem历史表示紧凑且信息丰富,无需额外架构,可通用。
  • 在仿真和真实机器人上均验证了有效性,具有实际应用潜力。
  • 消融实验充分,对比了多种输出范式,证明了纯像素范式的优势。

Limitations:

  • 依赖深度信息进行反投影,在深度不准确或缺失的场景下可能受限。
  • 辅助指令区域的设计依赖于固定图像边界,可能不适用于全景或变焦相机。
  • 仅评估了MP3D环境,泛化到其他大规模场景(如Habitat-Matterport 3D)有待验证。
  • 未与使用外部训练数据(如预训练导航模型)的方法进行对比,公平性需注意。
  • 局部规划器可能无法处理复杂动态障碍物,真实世界部署需进一步鲁棒性测试。

Relevance To Keywords:

  • Unify Models:Goal2Pixel将VLM推理与像素级导航统一,体现了模型统一的思想。
  • World Models:论文未显式构建世界模型,但ViKeyMem历史表示和可见性检查隐含了空间世界建模。
  • Representation Learning:语义嵌入和坐标感知损失属于表征学习范畴,用于适配预训练VLM。
  • Model-Based RL:论文使用局部规划器将像素预测转化为动作,可视为基于模型的控制,但未涉及强化学习训练。
  • 原生多模态大模型:直接使用预训练VLM作为核心,属于多模态大模型在具体任务上的应用。
  • 多模态大模型的理解和生成一体化:VLM同时理解指令和图像,并生成像素坐标(可视为一种生成)。
  • 表征学习:语义嵌入和坐标感知损失是表征学习的具体技术。
  • 世界模型:ViKeyMem通过可见性条件选择关键帧,隐含了环境空间结构的表征。
  • 强化学习:论文未使用强化学习,而是监督学习训练。
  • 后训练:论文在预训练VLM基础上进行微调(后训练),以适应导航任务。
Score: 54.0 / 27.8
Authors: Sherzod Hakimov, Mattia D'Agostini, Ivan Samodelkin, David Schlangen
Published: 2026-06-01
TL;DR: 本文提出了一种基于迭代多模态对话的图像重建基准,发现视觉语言描述器的质量主导重建效果,而迭代优化效果取决于生成器能力。
摘要翻译

我们引入了图像重建游戏(Image Reconstruction Game),这是一个完全自动化的基准测试,其中视觉 - 语言模型在多轮交互中向图像生成器发出纠正指令,使累积的共同基础直接体现在生成的图像中。在七个图像类别上对两个描述者模型与两个生成器模型进行交叉测试,我们发现描述者是重建质量的主导因素,而生成器决定了迭代细化是有益还是有害。数学和几何图像构成了最大的挑战。描述者的 token 预算强烈影响收敛性:较短的预算产生更稀疏的首次生成图像,留有更大的可见改进空间;而较长的预算提高了绝对质量,但留下的修正余地较小。更强的描述者使用更丰富的修正词汇,涵盖空间、数值和结构类别,而较弱的描述者则集中在表面属性,倾向于在几轮交互后停止。人工验证显示,最佳自动评估器与人类偏好仅达到微弱至一般的一致性,且自动评分需要人工校准才能可靠使用。

Abstract

We introduce the Image Reconstruction Game, a fully automated benchmark in which a vision-language model issues corrective instructions to an image generator across multiple turns, making accumulated common ground directly observable as a rendered image. Benchmarking two Describer models crossed with two Generator models across seven image categories, we find that the describer is the dominant factor in reconstruction quality, while the generator determines whether iterative refinement helps or hurts. Mathematical and geometric images pose the greatest challenge. The describer's token budget strongly affects convergence: shorter budgets yield sparser first renderings with more room for visible improvement, while longer budgets raise absolute quality but leave less to fix. Stronger describers use a richer correction vocabulary spanning spatial, numeric, and structural categories, while weaker describers concentrate on surface properties and tend to stop after a few turns. Human validation shows that the best automated judge reaches only slight-to-fair agreement with human preferences, and automated scores require human recalibration to be used reliably.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 5.0/10 7.5
Tokenizer 1.5 3.0/10 4.5
Visual Encoder 1.5 6.0/10 9.0
World Models 1.5 3.0/10 4.5
MLLM 1.5 8.0/10 12.0
MultiModal 1.5 10.0/10 15.0
model-based RL 1.5 1.0/10 1.5

评分理由: MultiModal (10) 和 MLLM (8) 高度相关,因论文核心为视觉语言模型的对话交互。Visual Encoder (6) 隐含于 VLM 中但未深入。Unify Models (5) 体现在任务统一而非架构统一。Tokenizer (3) 仅涉及预算提及。World Models (3) 和 model-based RL (1) 与论文对话基准主题关联较弱,后者几乎无关。

关键词

Image Reconstruction Game, Iterative Multimodal Dialogue, Vision-Language Model, Image Generator, Common Ground, Corrective Instructions, Reconstruction Quality

深度分析

Chinese Title: 图像重建游戏:通过迭代多模态对话构建共同基础

Summary: 本文提出图像重建游戏(Image Reconstruction Game),一个全自动化的基准测试,其中视觉语言模型(VLM)作为描述者,向图像生成模型发出多轮纠正指令,使累积的共同基础直接通过渲染图像可见。研究将两个描述者模型与两个生成器模型交叉组合,在七个图像类别上进行测试。结果表明,描述者是重建质量的主导因素,而生成器决定迭代改进是否有益。数学和几何图像最具挑战性。描述者的令牌预算强烈影响收敛:较短的预算产生更稀疏的初始渲染,有更多改进空间;较长的预算提高绝对质量但改进余地小。更强的描述者使用更丰富的纠正词汇(空间、数值、结构),而较弱的描述者集中于表面属性且倾向于几轮后停止。人类验证显示,最佳自动评判与人类偏好仅达到轻微至中等一致性,自动分数需要人类重新校准才能可靠使用。

Innovations:

  • 提出全自动化的图像重建游戏基准,无需人类参与,可重复评估多模态模型的迭代对话能力。
  • 将共同基础构建过程直接可视化为每轮生成的图像,使累积进展可观测。
  • 系统性地分离描述者与生成器的影响,通过交叉组合实验揭示各自对重建质量的作用。
  • 引入人类相似性评分研究,验证自动指标(如CLIP、DINO、LLM评判)与人类感知的一致性,发现需要重新校准。
  • 对描述者纠正词汇进行定性分析,揭示强/弱描述者在空间、数值、结构等类别上的差异。

Methodology: 采用双玩家对话游戏框架(clem框架),描述者(VLM)观察目标图像和当前渲染,发出自然语言纠正指令;生成器(图像生成模型)根据累积对话历史生成新图像。游戏最多10轮,描述者每轮有200令牌预算。使用两个描述者模型(如GPT-4V、LLaVA)和两个生成器模型(如Stable Diffusion、DALL-E)交叉组合。评估使用自动相似性指标(SSIM、LPIPS、DINO、CLIP)和人类评分。在七个图像类别(自然、物体、场景、函数、几何、图表、抽象)上测试。

Key Results:

  • 描述者是重建质量的主导因素,生成器决定迭代改进是否有效。
  • 数学和几何图像类别最难重建,描述者难以感知和表达细粒度差异。
  • 短令牌预算(200)导致初始渲染稀疏,但后续改进空间大;长预算提高初始质量但改进幅度小。
  • 强描述者使用更丰富的纠正词汇(空间、数值、结构),弱描述者集中于颜色、纹理等表面属性。
  • 自动指标与人类评分一致性有限:最佳LLM评判(如GPT-4V作为评判)仅达到轻微至中等一致性(Cohen's kappa约0.2-0.4),需要人类重新校准。
  • 迭代改进在多数组合中有效,但部分生成器(如某些版本)在后期反而降低质量。

Tech Stack:

  • clem框架(游戏化模型评估框架)
  • 视觉语言模型(VLM):GPT-4V、LLaVA等
  • 图像生成模型:Stable Diffusion、DALL-E等
  • 自动相似性指标:SSIM、LPIPS、DINO、CLIP
  • 人类评分:Likert量表、Cohen's kappa一致性分析
  • 令牌预算控制(200 tokens/turn)
  • 多轮对话协议(<DESCRIPTION>标签、DONE信号)

Strengths:

  • 任务设计新颖,将共同基础构建与图像生成结合,直接可视化对话进展。
  • 全自动化,可重复,便于大规模基准测试。
  • 系统性地分离描述者和生成器的影响,提供因果分析。
  • 包含人类验证,揭示自动指标的局限性,具有实践指导意义。
  • 覆盖多种图像类别,测试了模型在不同难度下的表现。

Limitations:

  • 仅测试了两个描述者和两个生成器,样本有限,可能不具广泛代表性。
  • 令牌预算固定为200,未探索不同预算的影响(除简短分析外)。
  • 人类评分仅针对子集,且一致性较低,说明自动评判仍需改进。
  • 未深入分析描述者内部机制(如注意力、推理过程)。
  • 游戏仅限10轮,可能不足以让复杂图像完全收敛。

Relevance To Keywords:

  • 原生多模态大模型:论文中的描述者(VLM)即为原生多模态模型,测试其感知和语言生成能力。
  • 多模态大模型的理解和生成一体化:游戏同时测试理解(比较图像差异)和生成(发出纠正指令),体现一体化。
  • 表征学习:描述者需要学习目标与渲染之间的差异表征,生成器需要将语言表征映射到图像。
  • 世界模型:生成器可视为一种世界模型,根据语言指令模拟图像变化。
  • 强化学习:游戏可视为多轮决策过程,描述者通过纠正指令最大化最终相似度,类似强化学习中的策略优化。
  • 后训练:论文评估的是预训练模型的能力,但游戏本身可作为后训练(如RLHF)的评估环境。
Score: 54.0 / 27.8
Authors: Jiashuo Yu, Yao Yao, Boyu Chen, Alex Wang
Published: 2026-06-01
TL;DR: JenBridge proposes a modular framework utilizing text-visual conditioning and an LLM agent to generate coherent long-form video soundtracks across scene transitions, outperforming existing methods on the LVS Benchmark.
摘要翻译

我们解决生成高保真、长形式配乐的挑战,确保这些配乐在场景转换之间保持连贯性。现有的 AI 音乐系统主要面向短小、孤立的片段设计,缺乏确保叙事连贯性的机制。我们提出 JenBridge,一个模块化且可解释的框架,用于自适应长形式视频配乐,既能确保高保真音频生成,又能保证转换的自然性。核心架构是一个基于 Transformer 的生成模型,采用流匹配(flow-matching)目标进行训练,遵循两阶段范式:首先在大规模文本 - 音频语料库上进行预训练,以建立稳健的音乐先验;随后适应视频领域,利用双文本 - 视觉条件化实现精确的跨模态对齐。至关重要的是,为了在不同场景变化中实现长形式连贯性,JenBridge 引入了一种新颖的自适应转换机制。该系统配备了多功能的转换风格工具包,包含一种生成式转换方法,并独特地采用了一个大语言模型(LLM)智能体,该智能体充当导演,智能地为每个叙事转换选择最合适的过渡方式。为了严格评估该任务,我们提出了 LVS 基准,这是一个新的基准,包含一个精心构建的数据集以及关注整体性和转换感知评估的新型评估指标。在提出的基准上进行的广泛实验表明,JenBridge 在客观和主观指标上均显著优于现有方法,尤其是在转换自然度和整体叙事连贯性方面。JenBridge 代表了迈向完全自动化、专业级视频配乐的重要一步。

Abstract

We address the challenge of generating high-fidelity, long-form soundtracks that remain coherent across scene transitions. Existing AI music systems are mainly designed for short, isolated clips and lack mechanisms to ensure narrative continuity. We present JenBridge, a modular and interpretable framework for adaptive long-form video soundtracking that ensures both high-fidelity audio generation and transition naturalness. The core architecture is a Transformer-based generative model trained with a flow-matching objective, following a two-stage paradigm: pretraining on large-scale text-audio corpora to establish robust musical priors, then adapting to the video domain with dual text-visual conditioning for precise cross-modal alignment. Crucially, to achieve long-form coherence across diverse scene changes, JenBridge incorporates a novel adaptive transition mechanism. This system features a versatile toolkit of transition styles, including a generative transition method, and uniquely employs a Large Language Model (LLM) Agent that acts as a director to select the most appropriate transition for each narrative shift intelligently. To rigorously assess this task, we propose the LVS Benchmark, a new benchmark that includes a curated dataset and novel evaluation metrics focusing on holistic and transition-aware assessment. Extensive experiments on the proposed benchmark demonstrate that JenBridge significantly outperforms existing methods in both objective and subjective metrics, particularly in terms of transition naturalness and overall narrative coherence. JenBridge represents a significant step towards fully automated, professional-quality video soundtracking.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 5.0/10 7.5
Tokenizer 1.5 3.0/10 4.5
Visual Encoder 1.5 8.0/10 12.0
World Models 1.5 5.0/10 7.5
MLLM 1.5 6.0/10 9.0
MultiModal 1.5 9.0/10 13.5
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on multimodal audio generation conditioned on video and text, making 'MultiModal' (9) and 'Visual Encoder' (8) highly relevant. It employs an LLM agent for narrative continuity, moderately aligning with 'World Models' (5) and 'MLLM' (6). The framework unifies generation and transition logic ('Unify Models', 5), but lacks explicit focus on tokenizer design ('Tokenizer', 3) or reinforcement learning ('model-based RL', 0).

关键词

Long-Form Video Soundtracking, Adaptive Transition Mechanism, Text-Visual Conditioning, Flow-Matching Objective, LLM Agent, Narrative Coherence, Transformer-based Generative Model

深度分析

Chinese Title: JenBridge: 跨场景转换的自适应长视频配乐生成

Summary: 本文针对长视频配乐生成中跨场景转换的连贯性问题,提出JenBridge框架。现有AI音乐系统主要针对短片段设计,缺乏叙事连贯性机制。JenBridge采用模块化可解释架构,核心是基于Transformer的生成模型,使用流匹配目标训练,采用两阶段范式:先在文本-音频大规模语料上预训练建立音乐先验,再通过双文本-视觉条件适应视频域。关键创新是自适应过渡机制,包含多种过渡风格工具包(包括生成式过渡方法),并利用LLM Agent作为导演智能选择最适合的过渡。为评估该任务,提出LVS Benchmark,包含 curated 数据集和新颖评估指标。实验表明JenBridge在客观和主观指标上显著优于现有方法,尤其在过渡自然性和整体叙事连贯性方面。

Innovations:

  • 提出JenBridge端到端框架,实现长视频配乐生成,确保跨场景转换的连贯性和高保真度,模块化架构具有可解释性和可控性。
  • 提出自适应过渡机制,结合多种过渡风格工具包(包括基于ControlNet的生成式过渡)和LLM Agent作为导演进行上下文感知的创意选择。
  • 提出LVS Benchmark,首个专门针对长视频配乐的基准,包含丰富标注和整体评估协议,强调过渡质量评估。
  • 采用两阶段渐进训练:先预训练文本到音乐基础模型,再通过双条件(文本+视觉)适应视频域,并引入VMPT模块将视频字幕转换为结构化音乐提示。

Methodology: JenBridge框架分为三阶段:1)语义视频分割(使用PySceneDetect将长视频切分为语义连贯片段);2)逐片段音乐生成,采用两阶段训练:先训练基于MMDiT和流匹配的文本到音乐模型,再引入SigLIP视觉编码器进行双条件融合,并通过VMPT(基于Qwen3-8B的LLM)将视频字幕转换为音乐提示;3)自适应过渡,LLM Agent从过渡工具包(包括ControlNet生成式过渡)中选择最合适的过渡风格,连接各片段音频。训练使用条件流匹配目标。

Key Results:

  • JenBridge在LVS Benchmark上显著优于现有方法,在客观和主观指标上均取得最佳性能。
  • 在过渡自然性和整体叙事连贯性方面表现突出,解决了现有方法无法处理场景转换的问题。
  • 通过两阶段训练和双条件融合,实现了高保真音频生成与精确的跨模态对齐。
  • LLM Agent能够智能选择过渡风格,提升了长视频配乐的适应性和创造性。

Tech Stack:

  • Transformer (MMDiT - Multimodal Diffusion Transformer)
  • Flow Matching (流匹配目标)
  • Neural Audio Codec (EnCodec, 48kHz stereo)
  • T5-large / T5-base (文本编码器)
  • SigLIP (视觉编码器)
  • Qwen3-8B (VMPT模块)
  • ControlNet (生成式过渡)
  • PySceneDetect (视频分割)
  • LLM Agent (大语言模型代理)

Strengths:

  • 首次系统解决长视频配乐中的场景转换连贯性问题,具有实际应用价值。
  • 模块化可解释架构,便于用户控制和干预生成过程。
  • 自适应过渡机制结合LLM Agent,实现了智能化的创意决策。
  • 提出了专门的基准和评估协议,填补了该领域评估空白。
  • 两阶段训练策略有效利用大规模文本-音频数据,降低视频-音乐配对数据需求。

Limitations:

  • 依赖视频分割的准确性,若分割不当可能影响后续生成。
  • LLM Agent的决策质量受限于其训练数据和推理能力,可能在某些复杂叙事场景下不够精准。
  • 生成式过渡(ControlNet)可能引入额外计算开销,影响实时性。
  • 实验仅在电影预告片等特定类型视频上验证,泛化性有待进一步检验。
  • 未详细讨论不同过渡风格的具体实现细节和参数控制。

Relevance To Keywords:

  • 原生多模态大模型:JenBridge使用MMDiT处理文本和视觉模态,属于多模态生成模型。
  • 多模态大模型的理解和生成一体化:框架同时具备视频理解(分割、视觉编码)和音乐生成能力,实现理解与生成融合。
  • 表征学习:通过神经音频编解码器、文本编码器、视觉编码器学习多模态表征。
  • 世界模型:虽然不直接构建世界模型,但通过LLM Agent模拟导演决策,隐含对叙事世界的理解。
  • 强化学习:论文未明确使用强化学习,但LLM Agent的决策可视为一种策略选择,未来可结合强化学习优化。
  • 后训练:两阶段训练中的第二阶段(视频域适应)属于后训练范畴。
Score: 54.0 / 27.8
Authors: Jens U. Kreber, Lukas Mack, Joerg Stueckler
Published: 2026-06-01
TL;DR: 本文提出了一种基于高斯球谐函数的物体中心世界模型,用于预测刚性物体的动作条件动力学并成功应用于模型预测控制任务。
摘要翻译

世界模型使智能体能够预测其动作对环境的影响。本文提出多刚体高斯世界模型(MRO-GWM),这是一种新颖的模型,用于学习 3D 空间中刚体的动作条件动力学。通过采用物体中心高斯来表示场景,我们能够表示任意物体形状及多物体场景。我们开发了一种新颖的时空变换器架构,该架构根据物体高斯的历史序列和未来动作预测未来的刚体运动。物体在规范坐标系中由其高斯表示,这使得可以将物体运动描述为刚体变换。我们的模型在多视角重建数据上进行训练,这要求模型能够处理因遮挡导致的物体部分观测。我们在由典型家用物体构成的合成数据集上分析了该方法的预测性能,这些数据集涉及多物体动力学及与机器人末端执行器的交互。我们还评估了该模型在模拟环境中用于非抓取操纵的模型预测控制中的表现。

Abstract

World models enable intelligent agents to predict the consequences of their actions on the environment. In this paper, we propose Multi Rigid Object Gaussian World Model (MRO-GWM), a novel model that learns action-conditional dynamics of rigid objects in 3D. By representing the scene by object-centric Gaussians, we can represent arbitrary object shapes and multi-object scenes. We develop a novel spatio-temporal transformer architecture that predicts future rigid body motion from a history of object Gaussians and future actions. Objects are represented by their Gaussians in a canonical frame, which allows for describing object motion as rigid body transformation. Our model is trained on reconstructions from multiple viewpoints, which requires the model to handle partial observations of objects due to occlusions. We analyze prediction performance of our approach on synthetic datasets composed of typical household objects with multi-object dynamics and interactions by a robot end effector. We also evaluate our model in model-predictive control for non-prehensile manipulation in simulation.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 6.0/10 9.0
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 5.0/10 7.5
World Models 1.5 10.0/10 15.0
MLLM 1.5 1.0/10 1.5
MultiModal 1.5 5.0/10 7.5
model-based RL 1.5 8.0/10 12.0

评分理由: 论文核心聚焦于世界模型(World Models)及动作条件预测,与模型式强化学习(model-based RL)中的模型预测控制高度相关。涉及多视角视觉处理(Visual Encoder, MultiModal),但未使用离散 Tokenizer 或大语言模型(MLLM),统一模型(Unify Models)仅体现为表示与动力学的结合。

关键词

World Models, Gaussian Splatting, Object-Centric, Action-Conditional, Model-Predictive Control, Rigid Objects, Spatio-temporal Transformer

深度分析

Chinese Title: 面向刚性物体的动作条件与物体中心高斯溅射世界模型学习

Summary: 本文提出了一种名为MRO-GWM的多刚性物体高斯世界模型,用于学习三维场景中刚性物体的动作条件动力学。该模型采用物体中心的高斯溅射表示场景,能够处理任意形状的物体和多物体场景。作者设计了一种新颖的时空变换器架构,从历史物体高斯和未来动作中预测未来的刚体运动。物体通过其规范坐标系中的高斯表示,运动描述为刚体变换。模型在合成数据集上训练,包含典型家庭物品的多物体动力学及机器人末端执行器交互,并处理由于遮挡导致的局部观测问题。实验评估了预测性能,并在仿真中通过模型预测控制实现了非抓取操作任务。主要贡献包括:使用物体中心高斯溅射作为场景表示,通过每物体刚体变换编码历史姿态观测;提出结合时间与空间注意力块的新型变换器架构;将世界模型集成到模型预测控制中用于非抓取操作。

Innovations:

  • 提出使用物体中心高斯溅射作为场景表示,并通过每物体的刚体变换编码历史姿态观测序列。
  • 设计了一种新颖的时空变换器架构,包含时间注意力、空间注意力以及新提出的时空注意力层,能够同时处理物体形状和物体间接触。
  • 将世界模型成功集成到模型预测控制中,在仿真中实现了多物体场景下的非抓取操作任务。
  • 支持两种输入模式:直接使用2D高斯属性或使用锚点(anchor)的压缩潜在特征,并实现了跨场景共享MLP以保持一致性。

Methodology: 论文采用物体中心高斯溅射(ObjectGS)作为场景表示,每个高斯与一个物体关联,通过刚体变换编码历史姿态。构建时空变换器模型,输入为历史物体和末端执行器的高斯序列以及未来末端执行器姿态,输出未来物体姿态。变换器包含空间网格池化、空间注意力块、时间注意力块以及新提出的时空注意力层。训练时使用多视角重建数据,模型需处理遮挡导致的局部观测。在仿真环境中使用合成数据集进行训练和评估,并通过模型预测控制进行非抓取操作任务测试。

Key Results:

  • 在合成数据集上,模型能够准确预测多物体场景中刚性物体的未来姿态,包括处理遮挡和物体间相互作用。
  • 与多种基线模型和消融实验相比,所提出的时空变换器架构在预测性能上表现更优。
  • 在模型预测控制中,模型成功解决了两个非抓取操作任务(如推动物体),并在多物体场景中展示了有效性。
  • 模型能够泛化到未见过的物体形状,表明其具有较好的泛化能力。

Tech Stack:

  • 物体中心高斯溅射(ObjectGS)
  • ScaffoldGS(基于锚点的场景表示)
  • 2D高斯溅射(2D Gaussian Splatting)
  • 时空变换器(Spatio-Temporal Transformer)
  • 空间注意力(Spatial Attention)
  • 时间注意力(Temporal Attention)
  • 时空注意力层(Spatio-Temporal Attention Layer)
  • 模型预测控制(Model-Predictive Control, MPC)
  • MLP(多层感知机)
  • SE(3)刚体变换

Strengths:

  • 采用物体中心高斯表示,能够灵活处理任意形状的刚体物体和多物体场景。
  • 提出的时空变换器架构有效融合了时间序列信息和空间结构信息,能够捕捉物体间接触和遮挡影响。
  • 支持两种输入模式(直接高斯或锚点特征),适应不同精度和效率需求。
  • 在模型预测控制中成功应用,展示了世界模型在机器人操作中的实用性。
  • 在合成数据集上表现出良好的泛化能力,能够处理未见过的物体。

Limitations:

  • 目前依赖仿真环境中的地面真值姿态和分割掩码,尚未集成到真实世界的物体分割和姿态估计流程中。
  • 仅处理刚性物体,无法处理可变形物体或流体。
  • 模型训练需要多视角重建数据,可能在实际应用中获取成本较高。
  • 预测性能可能受限于高斯表示的分辨率和场景复杂度。
  • 未在真实机器人平台上进行实验验证。

Relevance To Keywords:

  • 世界模型:论文核心是学习动作条件的世界模型,用于预测环境动态。
  • 表征学习:采用物体中心高斯溅射作为场景表征,并通过变换器学习潜在动力学。
  • 基于模型的强化学习:模型被用于模型预测控制,属于基于模型的方法。
  • 后训练:论文未明确涉及后训练,但世界模型可作为预训练组件用于下游任务。
  • 多模态大模型:论文未涉及多模态大模型,但高斯溅射本身可结合视觉输入。
  • 原生多模态大模型:不直接相关。
  • 多模态大模型的理解和生成一体化:不直接相关。
Score: 52.5 / 27.8
Authors: Jingyun Liang, Min Wei, Shikai Li, Yizeng Han, Hangjie Yuan, Lei Sun, Weihua Chen, Fan Wang
Published: 2026-06-01
TL;DR: 本文提出一种基于网格 token 化的渲染自由视频扩散模型,通过统一处理视频与 3D 运动 token 实现了无需渲染 2D 引导的高质量人类运动控制。
摘要翻译

扩散模型在视频生成领域已展现出显著的成功。然而,此类模型是否真正意识到视觉观察背后的 3D 结构,而不仅仅是简单地重现合理的 2D 投影,仍然是一个开放性问题。在这项工作中,我们通过人体运动控制(human motion control)来探究这一问题,该任务需要对 3D 人体几何、运动、相机视角(camera viewpoint)和场景上下文进行精确建模。与依赖渲染的 2D 运动引导视频(rendered 2D motion guidance videos)的先前的方法不同,我们提出一个无渲染(render-free)框架,该框架直接在压缩的 3D 人体网格标记(compressed 3D human mesh tokens)上对视频生成进行条件化。这种表示法保留了完整的 3D 几何信息,同时启用了统一的基于标记(token-based)的生成管道,该管道在基于 DiT 的架构中联合处理视频标记(video tokens)与运动标记(motion tokens)。这种设计架构要求模型在视频生成过程中联合推理外观(appearance)、3D 结构以及相机视角。实验结果表明,该方法在人体运动控制基准上展现出强劲的性能,同时减少了由依赖视角的 2D 引导(view-dependent 2D guidance)以及编辑过程中轨迹 - 姿态不匹配(trajectory-pose mismatches)所诱导的伪影。这些发现表明,当配备网格标记化(mesh tokenization)时,视频扩散模型能够更好地捕捉复杂的 3D 人体结构及其与周围环境的交互。

Abstract

Diffusion models have shown remarkable success in video generation. However, whether such models are truly aware of the 3D structure underlying visual observations, rather than simply reproducing plausible 2D projections, remains an open question. In this work, we investigate this question through human motion control, a task that requires precise modelling of 3D human geometry, motion, camera viewpoint, and scene context. Unlike prior methods that rely on rendered 2D motion guidance videos, we propose a render-free framework that conditions video generation directly on compressed 3D human mesh tokens. This representation preserves full 3D geometric information while enabling a unified token-based generation pipeline that processes video tokens jointly with motion tokens in a DiT-based architecture. This design requires the model to reason jointly about appearance, 3D structure, and camera viewpoint during video generation. Experimental results demonstrate strong performance on human motion control benchmarks, while reducing artifacts induced by view-dependent 2D guidance and trajectory-pose mismatches during editing. These findings suggest that video diffusion models, when equipped with mesh tokenization, can better capture complex 3D human structures and their interactions with the surrounding environment.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 8.0/10 12.0
Tokenizer 1.5 9.0/10 13.5
Visual Encoder 1.5 5.0/10 7.5
World Models 1.5 3.0/10 4.5
MLLM 1.5 1.0/10 1.5
MultiModal 1.5 7.0/10 10.5
model-based RL 1.5 2.0/10 3.0

评分理由: 论文核心贡献为 Mesh Tokenization 与统一 token 管道,故 Tokenizer 与 Unify Models 评分较高。视频与 3D 几何结合属 MultiModal 范畴。未涉及 MLLM 或 World Models 核心定义。任务虽为控制但方法为生成非 RL,故 model-based RL 评分低。作者列表中未发现指定专家,无额外加分。

关键词

3D-Aware Video Diffusion Models, Mesh Tokenization, Human Motion Control, Render-Free Framework, 3D Human Mesh Tokens, DiT-based Architecture, Unified Token-based Generation

深度分析

Chinese Title: 迈向3D感知的视频扩散模型:基于网格令牌化的无渲染人体运动控制

Summary: 本文探讨视频扩散模型是否真正具备3D结构感知能力,以人体运动控制为测试任务。现有方法依赖渲染的2D引导信号(如姿态图、骨架视频),存在视角依赖和信息丢失问题。作者提出一种无渲染框架,直接使用压缩的3D人体网格令牌作为条件生成视频。首先利用SMPL参数化人体模型表示运动,分解为轨迹和身体姿态序列;然后通过VQ-VAE将网格压缩为紧凑的离散令牌,与轨迹嵌入结合形成运动令牌;最后在DiT架构中通过交叉注意力将运动令牌注入视频生成过程。实验表明,该方法在人体运动控制基准上表现优异,在视角变化和运动编辑场景下鲁棒性更强,减少了因2D引导和轨迹-姿态不匹配导致的伪影。结论认为,视频扩散模型在配备网格令牌化后能更好地捕捉复杂3D人体结构及其与环境交互。

Innovations:

  • 首次提出无渲染框架,直接以3D人体网格令牌作为视频生成条件,避免渲染带来的信息损失和视角依赖偏差。
  • 引入网格令牌化管道,将人体运动分解为轨迹和身体姿态,并通过VQ-VAE将3D网格压缩为紧凑离散令牌,实现几何保真且解耦的运动表示。
  • 在DiT架构中统一处理视频令牌和运动令牌,通过交叉注意力注入,迫使模型联合推理外观、3D结构和相机视角。
  • 提供更灵活的运动编辑接口,支持轨迹与姿态的组合控制,减少编辑场景下的脚部漂浮、穿透等伪影。

Methodology: 论文采用以下技术路线:1) 使用SMPL参数化人体模型表示3D人体网格序列,并分解为全局轨迹和身体姿态;2) 对每个网格进行规范化(平移至原点、去除全局朝向),输入全卷积网格自编码器提取低维潜在向量,再通过VQ-VAE量化得到离散令牌;3) 将参考图像通过图像编码器得到视觉令牌,与噪声令牌拼接;4) 运动令牌和文本令牌通过交叉注意力注入DiT骨干网络;5) 训练时使用扩散损失优化,推理时从噪声逐步去噪生成视频。

Key Results:

  • 在人体运动控制基准上取得强性能,生成视频质量优于基于渲染的方法。
  • 在视角变化场景下,生成视频保持一致性,减少因2D引导导致的视角依赖伪影。
  • 在运动编辑(如改变轨迹或姿态)时,有效避免脚部漂浮、穿透和运动不一致等结构伪影。
  • 消融实验验证了网格令牌化表示相比2D/2.5D表示的优势,以及分解轨迹与姿态的有效性。

Tech Stack:

  • SMPL参数化人体模型
  • VQ-VAE(向量量化变分自编码器)
  • 全卷积网格自编码器
  • DiT(Diffusion Transformer)架构
  • 交叉注意力机制
  • 扩散模型(DDPM/DDIM)

Strengths:

  • 创新性地提出无渲染的3D网格令牌化方法,从根本上解决了2D引导的信息损失问题。
  • 将运动分解为轨迹和姿态,提供更灵活、鲁棒的编辑接口。
  • 在DiT架构中统一多模态令牌,设计简洁且可扩展。
  • 实验设计充分,验证了模型在3D感知、视角变化和编辑场景下的优势。

Limitations:

  • 依赖SMPL模型,无法处理非人体或复杂物体运动。
  • 网格令牌化需要预训练VQ-VAE,可能引入量化误差。
  • 当前仅关注人体运动控制,未扩展到更通用的3D场景理解。
  • 计算开销可能较大,尤其是网格编码和量化过程。

Relevance To Keywords:

  • Unify Models: 论文提出统一令牌化框架处理视频、运动、文本,符合多模态统一建模趋势。
  • World Models: 通过3D网格令牌化使模型学习3D结构,有助于构建具备物理世界理解的世界模型。
  • Representation Learning: 核心贡献在于学习3D人体网格的紧凑离散表示,属于表征学习范畴。
  • Model-Based RL: 论文未涉及强化学习,但3D感知能力可潜在用于基于模型的RL中的状态表示。
  • 原生多模态大模型: 方法基于DiT架构,融合图像、视频、运动、文本多模态,与原生多模态大模型方向一致。
  • 多模态大模型的理解和生成一体化: 论文同时涉及视频生成和运动理解,但侧重生成,理解能力通过条件控制间接体现。
  • 后训练: 论文未讨论后训练策略,主要关注模型架构和训练方法。
Score: 52.5 / 27.8
Authors: Chuang Ma, Qianying Liu, Tomoyuki Obuchi, Fei Cheng, Wang Yang, Sudong Cai, Shuyuan Zheng, Akiko Aizawa, Sadao Kurohashi
Published: 2026-06-01
TL;DR: 本文揭示了 MLLM 空间推理中存在源于语言侧通道的空间词汇偏差,并通过轻量级 LLM-only DPO 更新显著提升了模型在空间任务上的鲁棒性。
摘要翻译

多模态大语言模型(MLLMs)在空间多项选择题上表现仍不稳定,其失败通常归因于视觉信息注意力不足。在这项工作中,我们识别出一种互补的失败模式,即空间词汇偏差:在答案选项中加入空间关系词会吸引模型的决策,使得新加入的选项更有可能被选中。基于九个开源权重的 MLLMs,我们发现这一现象普遍存在。特别是,模型能够正确回答二元空间问题,但一旦将第三个错误的空间选项加入答案集,模型却一致地选择该错误选项。我们将此类二元稳定但三元脆弱的案例隔离为诊断示例,并利用机制可解释性工具,揭示出很大一部分失败实际上起源于语言侧而非视觉侧:视觉注意力分析和残差流探针显示,在这些失败案例中,正确的空间关系在内部仍可用;而无关选项控制、激活修补和稀疏组件干预则将偏差追踪至特定的 LLM 侧通道和神经元。基于这一发现,我们表明,在微小的单对象对合成数据上执行轻量级仅 LLM 的 DPO(直接偏好优化)更新可缓解该偏差,在合成数据上将四路鲁棒准确率提升高达 100 分,并在更广泛的评估数据集 WhatsUp、SpatialMQA-Direct 和 VSR 上分别提升 68.0、32.6 和 20.1 分。

Abstract

Multimodal large language models (MLLMs) remain unreliable on spatial multiple-choice questions, and their failures are often attributed to poorly attended visual information. In this work, we identify a complementary failure mode, spatial lexical bias: adding a spatial relation word to the answer options can attract the model's decision and make the newly added option likely to be selected. Using nine open-weight MLLMs, we show that this phenomenon is widely observed. In particular, models can answer a binary spatial question correctly, yet consistently select an incorrect third spatial option once it is added to the answer set. We isolate such binary-stable but ternary-fragile cases as diagnostic examples and leverage mechanistic interpretability tools, revealing that a substantial part of the failure instead originates on the language side rather than the visual side: visual attention analyses and residual-stream probes show the correct spatial relation remains internally available on these failures, while irrelevant-option controls, activation patching, and sparse component interventions trace the bias to specific LLM-side channels and neurons. Based on this finding, we show that a lightweight LLM-only DPO update on tiny single-object-pair synthetic data mitigates the bias, lifting four-way robust accuracy by up to 100 points on synthetic data, and by 68.0, 32.6, and 20.1 points on broader evaluation datasets WhatsUp, SpatialMQA-Direct, and VSR.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 5.0/10 7.5
Tokenizer 1.5 2.0/10 3.0
Visual Encoder 1.5 6.0/10 9.0
World Models 1.5 1.0/10 1.5
MLLM 1.5 10.0/10 15.0
MultiModal 1.5 10.0/10 15.0
model-based RL 1.5 1.0/10 1.5

评分理由: 论文核心聚焦于多模态大语言模型(MLLM)的空间推理偏差诊断,因此 MLLM 和 MultiModal 得满分(10.0)。视觉编码器虽被用于注意力分析但非核心贡献,得 6.0 分。Unify Models 涉及多模态统一但未侧重架构统一,得 5.0 分。Tokenizer、World Models、model-based RL 在文中未涉及,得低分(1.0-2.0)。作者列表中无指定专家(Wang Yang 不等于 Yang Shi)。

关键词

Spatial Lexical Bias, Multimodal Large Language Model, Spatial Reasoning, Mechanistic Interpretability, LLM-side Bias, DPO Update, Robust Accuracy

深度分析

Chinese Title: 多模态大语言模型空间推理中空间词汇偏差的机制诊断

Summary: 本文发现多模态大语言模型(MLLMs)在空间多项选择题中存在一种新的失败模式——空间词汇偏差:当在答案选项中添加一个空间关系词时,模型会倾向于选择该新选项,即使它在二元选择中能正确回答。作者通过构建“二元稳定但三元脆弱”(BSTF)案例作为诊断工具,利用注意力分析和残差流探针证明正确的视觉关系在模型内部仍然可恢复,失败并非源于视觉信息丢失,而是语言侧的词汇语义干扰。进一步使用激活修补、稀疏组件干预等可解释性方法定位到LLM侧特定层和神经元。最后,仅对语言解码器进行轻量级DPO微调(使用单对象对合成数据),即可显著缓解该偏差,在合成数据上四路鲁棒准确率提升最高达100个百分点,并在WhatsUp、SpatialMQA-Direct、VSR等公开数据集上分别提升68.0、32.6、20.1个百分点。

Innovations:

  • 首次识别并系统定义“二元稳定但三元脆弱”(BSTF)案例,揭示MLLMs在空间推理中因添加空间选项而系统性失败的模式。
  • 通过注意力分析和残差流探针证明失败时正确视觉关系仍可恢复,将失败归因于语言侧的“空间词汇偏差”而非视觉信息丢失。
  • 使用激活修补和稀疏组件干预等因果可解释性工具,定位到LLM侧特定残差流通道和MLP神经元是偏差的关键载体。
  • 提出仅更新语言解码器的轻量级DPO后训练方法,使用极简合成数据即可有效缓解偏差,并泛化到多个公开基准。

Methodology: 1. 构建受控合成数据集(Sphere-Cube Raw, Sphere-Dog Raw, Sphere-Dog Outdoor),每个包含四种空间关系(左、右、前、后),并生成二元、三元、四路选项集,对每个选项集进行所有顺序排列以消除位置偏差。2. 定义BSTF案例:二元鲁棒准确率100%,添加一个空间选项后三元准确率下降≥80个百分点,筛选出55个案例。3. 视觉可恢复性分析:使用注意力图可视化视觉关注区域,并在最后一层残差流上训练线性探针预测正确空间关系。4. 语言侧机制分析:通过无关选项控制实验(对比空间干扰词与无关词如“小提琴”)、激活修补(交换二元正确与三元错误提示的残差流状态)、稀疏组件干预(识别并抑制特定残差流通道和MLP神经元)。5. 修复方法:使用单对象对合成数据构建偏好数据集,对LLM解码器进行LoRA-DPO微调,冻结视觉编码器和投影器。

Key Results:

  • 在9个开源MLLM上广泛观察到空间词汇偏差,其中7个模型存在BSTF案例(共55个)。
  • 注意力图显示二元正确与三元错误提示关注相似的视觉区域;线性探针在最后一层残差流上几乎完美恢复正确空间关系(即使模型选择了错误选项)。
  • 无关选项(如“小提琴”)仅导致适度退化,而特定空间选项(如“后面”)导致系统性崩溃,表明偏差源于空间词汇的语义结构。
  • 激活修补实验表明,在中间到后期层交换残差流状态可以挽救或破坏预测;稀疏干预识别出特定残差流通道和MLP神经元,抑制它们可部分恢复正确预测。
  • LLM-only LoRA-DPO微调在合成数据上四路鲁棒准确率提升最高100个百分点,在WhatsUp、SpatialMQA-Direct、VSR上分别提升68.0、32.6、20.1个百分点。

Tech Stack:

  • 多模态大语言模型:LLaVA-1.5-7B, LLaVA-v1.6-Vicuna-7B, LLaVA-v1.6-Mistral-7B, InternVL2-1B, InternVL2.5-1B, Qwen2-VL-2B, Qwen3-VL-8B等
  • 合成数据集渲染:使用3D渲染工具生成受控场景
  • 评估指标:样本级鲁棒准确率(所有选项顺序均正确)
  • 可解释性工具:注意力图可视化、线性探针(残差流)、激活修补(activation patching)、稀疏组件干预(sparse component intervention)
  • 后训练方法:Direct Preference Optimization (DPO) + LoRA(低秩适配)
  • 编程框架:PyTorch, HuggingFace Transformers

Strengths:

  • 发现了一个新颖且重要的失败模式(空间词汇偏差),挑战了以往将空间推理失败归因于视觉信息丢失的普遍观点。
  • 方法严谨:通过严格的位置偏差控制(所有顺序排列)和BSTF案例筛选,确保诊断的可靠性。
  • 结合行为实验与机械可解释性工具(激活修补、稀疏干预),提供了因果证据而非仅相关性。
  • 修复方案简单有效:仅更新语言侧,使用极小合成数据,且泛化到多个真实数据集,具有实用价值。

Limitations:

  • BSTF案例筛选标准严格(二元100%且下降≥80pp),可能遗漏部分弱化但仍有意义的偏差模式。
  • 可解释性分析主要针对LLaVA系列模型,其他架构的偏差机制可能有所不同。
  • 修复方法仅在LLM侧进行,未探索联合调整视觉编码器或投影器是否能进一步提升。
  • 合成数据集场景简单(单对象对),真实场景中物体数量、遮挡、光照等复杂因素未充分覆盖。
  • DPO训练数据仅来自单对象对,泛化到更复杂空间关系(如“在...之间”)的效果未知。

Relevance To Keywords:

  • Unify Models: 论文未直接涉及统一模型,但研究的多模态大模型属于统一视觉-语言理解的范畴。
  • World Models: 论文未涉及世界模型,空间推理可视为世界模型的一部分,但本文未构建或利用世界模型。
  • Representation Learning: 论文通过探针分析残差流表示,属于表征学习分析,但未提出新的表征学习方法。
  • Model-Based RL: 不相关。
  • 原生多模态大模型: 论文使用的模型均为原生多模态(如LLaVA、Qwen-VL),属于该范畴。
  • 多模态大模型的理解和生成一体化: 论文仅关注理解(空间推理),未涉及生成。
  • 表征学习: 论文通过探针研究内部表征的可恢复性,与表征学习相关。
  • 世界模型: 不直接相关。
  • 强化学习: 不直接相关,但DPO是一种基于偏好的优化方法,可视为后训练中的强化学习变体。
  • 后训练: 论文使用DPO进行后训练,属于后训练范畴,且仅更新语言解码器,是轻量级后训练方法。
Score: 52.5 / 27.8
Authors: Sunisth Kumar, Xanh Ho, Tim Schopf, Andre Greiner-Petter, Florian Boudin, Akiko Aizawa
Published: 2026-06-01
TL;DR: This paper investigates the table-chart gap in scientific claim verification using Multimodal LLMs, revealing that chart information is encoded but fails to be routed to the prediction stage unlike table information.
摘要翻译

Multimodal LLMs (多模态大语言模型) 正日益被用于协助科学同行评审,其核心要求是验证论文中的主张是否由其证据支持。先前工作表明,当证据是表格时,模型在此任务上的表现显著优于当证据是相同底层数据的图表时。这引发了一个问题:模型是否无法从图表中提取信息,或者它们提取了信息但在形成预测时未能使用?我们通过层线性探测 (layer-wise linear probing) 和注意力分析 (attention analysis) 在三个开源视觉语言模型 (open-weight VLMs) 上研究这一问题,这些模型处理代表相同底层数据的表格和图表证据。我们发现一致的证据支持后者。图表信息被编码在模型的中间表示 (intermediate representations) 中,但未到达预测位置 (prediction position),而表格则不存在这种差距,且在所有测试条件下均成立。注意力分析进一步揭示,这种脱节在模型家族中呈现出两种架构上不同的形式 (architecturally distinct forms)。这些发现将表格 - 图表差距 (table-chart gap) 重新定义为预测时编码视觉信息的路由 (routing) 失败,而非编码本身的失败。

Abstract

Multimodal LLMs are increasingly used to assist scientific peer review, where a core requirement is verifying whether claims in a paper are supported by its evidence. Prior work has shown that models perform substantially better at this task when the evidence is a table than when it is a chart of the same underlying data. This raises the question of whether models fail to extract information from charts, or do they extract it but fail to use it when forming their prediction? We study this question through layer-wise linear probing and attention analysis on three open-weight VLMs over table and chart evidence, representing the same underlying data. We find consistent evidence for the latter. Chart information is encoded in the models' intermediate representations but does not reach the prediction position, a gap that is absent for tables and holds across all conditions tested. Attention analysis further reveals that this disconnect takes two architecturally distinct forms across model families. These findings reframe the table-chart gap as a failure of how encoded visual information is routed at prediction time, rather than a failure of encoding itself.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 5.0/10 7.5
Tokenizer 1.5 2.0/10 3.0
Visual Encoder 1.5 8.0/10 12.0
World Models 1.5 1.0/10 1.5
MLLM 1.5 9.0/10 13.5
MultiModal 1.5 9.0/10 13.5
model-based RL 1.5 1.0/10 1.5

评分理由: The paper focuses on Multimodal LLMs (MLLM) and MultiModal understanding for scientific claim verification, resulting in high scores for these keywords. It analyzes Visual Encoder representations regarding how visual info is handled, warranting a high score on Visual Encoder. Tokenizer, World Models, and model-based RL are not discussed, resulting in low scores. Unify Models is moderately related due to the multimodal nature but not the primary focus. No expert authors from the specified list were found.

关键词

Multimodal LLMs, Scientific Claim Verification, Table-Chart Gap, Visual Information Routing, Layer-wise Linear Probing, Attention Analysis, Vision-Language Models

深度分析

Chinese Title: 编码但未路由:解释科学声明验证中的表格-图表差距

Summary: 本文研究了多模态大语言模型在科学声明验证任务中,对表格证据表现优于图表证据的现象。作者通过层间线性探测和注意力分析,对三个开源视觉语言模型(Qwen2.5-VL-7B、Qwen2.5-VL-32B、InternVL3-8B)在SciTabAlign+数据集上进行实验,该数据集包含语义等价的表格和图表证据。研究发现,图表信息在模型的中间表示中被编码,但未能有效路由到预测位置,而表格信息则能成功路由。注意力分析进一步揭示了两种模型家族中不同的失败模式:Qwen模型在最终层对图像标记的注意力远低于基线,而InternVL模型虽保持近比例注意力,但仍无法有效整合信息。链式思维提示实验也证实了两种失败模式的差异。结论表明,表格-图表差距源于编码信息的利用失败而非感知失败。

Innovations:

  • 首次通过层间线性探测区分了图表信息编码失败与路由失败,证明图表信息被编码但未有效路由到预测位置。
  • 发现图表信息在中间表示中比表格信息分布更广,但无法在预测位置集中,形成反转模式。
  • 揭示了两种模型家族中不同的失败机制:Qwen模型因注意力路由不足,InternVL模型因后注意力整合失败。
  • 通过链式思维提示实验进一步验证了两种失败模式的不同响应行为,Qwen模型因强制描述未注意内容导致性能下降,InternVL模型则受益。
  • 使用内容控制的SciTabAlign+数据集,确保表格和图表基于相同底层数据,隔离格式敏感性。

Methodology: 论文采用线性探测(linear probing)方法,在每个层训练一个线性分类器,使用留一法交叉验证,评估隐藏状态中任务相关信息的可解码性。设置最后令牌(last-token)和平均池化(mean-pool)两种探测方式。同时进行注意力分析,计算最后令牌对图像令牌的注意力比例,并归一化到比例基线。还进行了链式思维消融实验,要求模型先描述图表再预测标签。评估指标包括方向校正的AUROC、宏F1和准确率。

Key Results:

  • 平均池化探测在图表变体上的AUROC(84-89%)高于表格(65-70%),而最后令牌探测则相反,表明图表信息分布广泛但未集中到预测位置。
  • 平均池化探测的准确率在所有图表变体上均超过模型推理准确率,而表格上则未出现此现象,差异显著(McNemar检验p<0.01)。
  • Qwen家族模型在最终层对图像令牌的注意力仅为比例基线的4-11%,InternVL3-8B则保持93%但仍有性能差距。
  • 链式思维提示使Qwen模型性能下降(宏F1降低3.3-13.5),InternVL模型性能提升(+4.6)。
  • 仅基于声明文本的探测AUROC为64-67%,远低于图表平均池化探测,排除声明文本泄漏。

Tech Stack:

  • 线性探测(Linear Probing)
  • L-BFGS优化器
  • ℓ2正则化(C=1.0)
  • 方向校正AUROC(Direction-corrected AUROC)
  • McNemar检验
  • 注意力比例归一化(Proportional Baseline)
  • 链式思维提示(Chain-of-Thought Prompting)
  • SciTabAlign+数据集
  • Qwen2.5-VL-7B/32B-Instruct
  • InternVL3-8B

Strengths:

  • 实验设计严谨,使用内容控制的数据集隔离格式影响。
  • 通过内部表示分析揭示了模型失败的根本原因,而非仅观察表面性能。
  • 覆盖两个模型家族,发现不同失败机制,增强了结论的泛化性。
  • 结合注意力分析和链式思维实验,多角度验证路由失败假说。
  • 提供了可操作的诊断方法,为改进多模态模型提供方向。

Limitations:

  • 仅研究三个开源模型,可能无法代表所有多模态大模型。
  • 线性探测假设表示是线性可分的,可能忽略非线性特征。
  • 数据集规模有限(162个声明),可能影响统计稳定性。
  • 未探索如何具体修复路由失败,仅诊断问题。
  • 图表变体仅包括四种类型,未涵盖所有常见图表形式。

Relevance To Keywords:

  • 原生多模态大模型:论文研究的多模态大模型(Qwen2.5-VL、InternVL3)属于原生多模态架构,直接处理图像和文本。
  • 表征学习:通过线性探测分析中间表示中图表信息的编码情况,涉及表征学习中的可解码性。
  • 世界模型:虽未直接提及,但科学声明验证可视为对世界知识的推理,与模型对数据格式的理解相关。
  • 模型基于强化学习:论文未涉及强化学习,但后训练阶段可能通过链式思维提示等干预改善路由。
  • 后训练:链式思维提示实验属于后训练阶段的推理策略,论文探讨了其对不同模型的影响。
Score: 51.0 / 27.8
Authors: Rui Yang, Qianhui Wu, Yuxi Chen, Hao Bai, Wenlin Yao, Hao Cheng, Baolin Peng, Huan Zhang, Tong Zhang, Jianfeng Gao
Published: 2026-06-01
TL;DR: OpenWebRL 提出了一种用于视觉网页代理的在线多回合强化学习框架,在仅需少量初始轨迹的情况下实现了开放网页基准上的最新 state-of-the-art 性能。
摘要翻译

构建具备能力的视觉网页代理需要长程推理、精确 grounding(定位)以及与动态真实网站的稳健交互。尽管进展迅速,最先进的系统仍主要属于专有领域,而开放代理(agent)仍严重依赖于对大规模精心策划的网页轨迹(trajectory)的监督后训练。这种依赖造成了一个主要的可扩展性瓶颈:高质量演示的收集成本高昂,且静态数据集对多样且不断变化的开放网络覆盖有限。虽然在线强化学习(RL)在基于文本的代理上显示出潜力,但其直接在实时网站上训练视觉网页代理的潜力仍很大程度上未被充分探索。本文介绍了 OpenWebRL,这是一个在真实网站上使用在线多轮强化学习(RL)训练视觉网页代理的开放框架。OpenWebRL 涵盖了完整的训练流程,包括可扩展的实时浏览器基础设施、监督初始化、多模态上下文管理、轨迹级成功判定以及高效的多轮策略(policy)优化。利用该框架,我们训练了 OpenWebRL-4B,它在具有挑战性的实时网络基准上建立了新的开源最先进(SOTA)标准。仅使用 0.4K 初始化轨迹和 2.2K 开放式强化学习(RL)训练任务,OpenWebRL-4B 在 Online-Mind2Web 上取得 67.0% 的成功率,在 DeepShop 上取得 64.0% 的成功率,优于先前类似或更大规模的开放代理,并与包括 OpenAI CUA 和 Gemini CUA 在内的专有系统保持竞争力。除了强大的基准性能外,我们系统性地研究了使在线强化学习(RL)对视觉网页代理有效的关键设计选择,并分析了强化学习如何提升代理(agent)推理能力。总体而言,我们的工作为构建更具能力、可复现且成本高效的开放网页代理提供了一条切实可行的路径。我们将发布训练数据、模型和代码,以支持未来的研究。

Abstract

Building capable visual web agents requires long-horizon reasoning, precise grounding, and robust interaction with dynamic real-world websites. Despite rapid progress, the strongest systems remain largely proprietary, while open agents still depend heavily on supervised post-training over large collections of curated web trajectories. This dependence creates a major scalability bottleneck: high-quality demonstrations are expensive to collect, and static datasets offer limited coverage of the diverse, ever-changing open web. Although online RL has shown promise for text-based agents, its potential for training visual web agents directly on live websites remains largely underexplored. In this paper, we introduce OpenWebRL, an open framework for training visual web agents with online multi-turn RL on real websites. OpenWebRL covers the full training pipeline, including scalable live-browser infrastructure, supervised initialization, multimodal context management, trajectory-level success judging, and efficient multi-turn policy optimization. Using this framework, we train OpenWebRL-4B, which establishes a new open-source state of the art on challenging live-web benchmarks. With only 0.4K initialization trajectories and 2.2K open-ended RL training tasks, OpenWebRL-4B achieves 67.0% success on Online-Mind2Web and 64.0% on DeepShop, outperforming prior open agents of similar or larger scale and remaining competitive with proprietary systems including OpenAI CUA and Gemini CUA. Beyond strong benchmark performance, we systematically study the key design choices that make online RL effective for visual web agents, and analyze how RL improves agentic reasoning. Overall, our work offers a practical path toward building more capable, reproducible, and cost-efficient open web agents. We will release our training data, models, and code to support future research.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 4.0/10 6.0
Tokenizer 1.5 2.0/10 3.0
Visual Encoder 1.5 5.0/10 7.5
World Models 1.5 3.0/10 4.5
MLLM 1.5 7.0/10 10.5
MultiModal 1.5 8.0/10 12.0
model-based RL 1.5 5.0/10 7.5

评分理由: 论文核心聚焦于视觉网页代理的在线多回合强化学习。MultiModal 和 MLLM 评分较高,因视觉代理涉及多模态交互且基于大模型 backbone。model-based RL 评分中等,属 RL 范畴但摘要未明确强调模型学习。Visual Encoder 评分中等,视觉输入必要但非核心创新点。Unify Models 和 World Models 评分较低,因侧重 RL 框架而非模型统一或生成式世界模型。Tokenizer 评分最低,摘要未涉及。

关键词

Online Multi-turn Reinforcement Learning, Visual Web Agents, OpenWebRL, Multimodal Context Management, Policy Optimization, Live-browser Infrastructure, State-of-the-art

深度分析

Chinese Title: OpenWebRL:揭秘视觉网络代理的在线多轮强化学习

Summary: 本文提出OpenWebRL,一个完全开放的框架,用于在真实网站上通过在线多轮强化学习训练视觉网络代理。当前最强的网络代理多为专有系统,而开放代理严重依赖监督微调,需要大量昂贵的演示数据,且静态数据集覆盖有限。虽然在线强化学习在文本代理中已显示潜力,但在视觉网络代理上的应用尚未充分探索。OpenWebRL覆盖完整训练流程:可扩展的实时浏览器基础设施、监督初始化(仅0.4K轨迹)、多模态上下文管理、轨迹级成功判断(使用GPT或蒸馏的8B评判器,成本降低约545.5美元/实验)以及高效多轮策略优化(多模态多轮GRPO)。基于此框架训练的OpenWebRL-4B(4B参数)在Online-Mind2Web、DeepShop和WebVoyager上分别达到67.0%、64.0%和74.1%的成功率,超越先前同规模或更大规模的开放代理(如FARA-7B、MolmoWeb-8B、Qwen3-VL-235B),并与GPT-5、OpenAI CUA、Gemini CUA等专有系统竞争。论文系统分析了在线RL的关键设计选择及其对代理推理能力的影响,并开源训练数据、模型和代码。

Innovations:

  • 提出OpenWebRL,首个完全开放的用于视觉网络代理的端到端在线多轮强化学习框架,覆盖从基础设施到策略优化的完整流程。
  • 开发实用的多模态多轮RL配方,结合鲁棒的浏览器基础设施、轨迹级评判和高效上下文管理,使紧凑模型(4B)在真实网站上有效训练。
  • 发布强4B开放代理,并系统研究在线RL成功的关键因素(如监督初始化、评判器设计、多轮优化),为后续研究提供实证基础。
  • 引入混合评判器(GPT或蒸馏8B),在保持专有级性能的同时大幅降低评估成本(每实验节省约545.5美元)。
  • 在多个实时网络基准上取得开放代理新SOTA,超越更大规模模型,并接近专有系统水平。

Methodology: 论文采用两阶段训练范式:首先使用0.4K轨迹进行监督微调(SFT)作为热启动,使策略进入有效探索区域;然后在真实网站上通过在线多轮GRPO(Group Relative Policy Optimization)进行强化学习。具体技术包括:基于Orchard Env构建容错浏览器环境,支持大规模并行异步轨迹采样;采用ReAct风格的多工具调用代理框架,整合截图、URL、视口尺寸、标签元数据等多模态观测,并利用DOM树变化提取简洁文本环境反馈;设计轨迹级成功评判器(GPT或蒸馏8B模型)计算奖励;在多轮GRPO中,以完整轨迹为采样单元,计算组内相对优势,优化所有轮次生成的token。训练任务为2.2K开放网络任务,在实时网站上收集在线rollout并迭代更新策略。

Key Results:

  • OpenWebRL-4B在WebVoyager上达到74.1%成功率,在Online-Mind2Web上达到67.0%,在DeepShop上达到64.0%,均超越先前开放代理(如FARA-7B、MolmoWeb-8B、Qwen3-VL-235B-A22B-Thinking)。
  • 与专有系统(GPT-5、OpenAI CUA、Gemini CUA)相比,OpenWebRL-4B在部分基准上具有竞争力。
  • 仅使用0.4K初始化轨迹和2.2K在线RL训练任务,远少于其他开放代理(如MolmoWeb使用278K轨迹)。
  • 蒸馏8B评判器在性能上匹配GPT评判器,同时将评估成本降低约545.5美元/实验。
  • 系统分析表明:监督热启动、多模态上下文管理、轨迹级评判和多轮GRPO是成功的关键因素;RL训练提升了代理的推理能力(如更长的思考链、更少重复动作)。

Tech Stack:

  • Qwen3-VL-4B(基础视觉语言模型)
  • GRPO(Group Relative Policy Optimization)
  • ReAct(推理-行动循环)工具调用框架
  • Orchard Env(容错浏览器环境)
  • GPT-4o / 蒸馏8B模型(轨迹级成功评判器)
  • DOM树变化提取(环境反馈)
  • 异步并行浏览器实例(大规模rollout收集)
  • 多模态观测(截图、URL、视口、标签元数据)

Strengths:

  • 完全开源:训练数据、模型权重和代码均将发布,促进可复现研究。
  • 实用性强:在真实动态网站上训练和评估,而非模拟环境,更贴近实际应用。
  • 高效数据利用:仅需少量初始化轨迹(0.4K)和训练任务(2.2K),显著降低数据收集成本。
  • 系统消融研究:详细分析了监督初始化、评判器设计、多轮优化等关键因素,提供可迁移的经验。
  • 紧凑模型达到高性能:4B参数模型超越更大规模开放代理,表明在线RL对小型模型的有效性。

Limitations:

  • 依赖GPT评判器进行奖励标注,虽然蒸馏版降低了成本,但初始训练仍需访问GPT API,存在一定依赖。
  • 仅使用4B规模模型,未探索更大模型(如7B、8B)在相同框架下的表现,可能限制性能上限。
  • 在线RL训练在真实网站上可能遇到网站动态变化、反爬机制等不可控因素,影响训练稳定性和泛化性。
  • 基准测试时间点不同(如WebVoyager较早),直接比较历史结果可能受网站变化影响。
  • 未深入分析RL训练对模型安全性和鲁棒性的影响,如对抗性输入或恶意网站。

Relevance To Keywords:

  • 原生多模态大模型:论文使用Qwen3-VL-4B作为基础模型,属于原生多模态大模型,并针对网络代理任务进行后训练。
  • 多模态大模型的理解和生成一体化:代理需要理解网页截图(视觉)和文本指令,并生成结构化动作(文本),体现理解与生成一体化。
  • 表征学习:论文通过多模态上下文管理(截图、URL、反馈)学习有效的状态表征,但未重点讨论表征学习机制。
  • 世界模型:论文未显式构建世界模型,但在线RL通过与真实环境交互隐式学习环境动态,可视为世界模型的一种形式。
  • 强化学习:核心方法为在线多轮GRPO,属于强化学习范畴,直接优化任务成功率。
  • 后训练:论文采用SFT+RL的两阶段后训练范式,属于大模型后训练技术。
Score: 51.0 / 27.8
Authors: Hongyu Lu, Feng Zhang, Wenwei Jin, Huanling Hu, Pengfei Zhang, Yao Hu, Jiawei Li, Shikai Jiang
Published: 2026-06-01
TL;DR: EvoCut 提出了一种基于多层演化偏差的无训练视觉令牌压缩方法,能在仅保留 11.1% 视觉令牌的情况下保持大视觉语言模型 94.4% 的性能,显著提升了推理效率。
摘要翻译

大型视觉语言模型(LVLMs)在图像和视频理解任务上取得了优异性能,但其推理效率受限于视觉编码器产生的大量视觉令牌。大多数现有的视觉令牌压缩方法基于特定层的注意力分数或表征属性来估计令牌重要性,却忽略了视觉令牌在视觉编码器中的演变过程。此类基于特定层级的准则可能提供不完整的重要性估计,并限制压缩后的性能保留。为了解决这一问题,我们分析了逐层视觉令牌演变方向,并观察到令牌在视觉编码器层之间形成了多个群体演变方向。我们的分析进一步表明,信息丰富的令牌倾向于表现出对常见群体演变方向的持续偏离。基于这一观察,我们提出 EvoCut,一种无需训练且无需注意力的视觉令牌压缩方法,该方法通过多层演变偏差来估计令牌重要性。实验结果表明,在 LLaVA-1.5-7B 上,EvoCut 仅保留 11.1% 的视觉令牌,同时保留了 94.4% 的平均性能,证明了其在平衡效率与准确性方面的有效性。

Abstract

Large vision-language models (LVLMs) achieve strong performance on image and video understanding tasks, but their inference efficiency is constrained by the large number of visual tokens produced by vision encoders. Most existing visual token compression methods estimate token importance from attention scores or representation properties at specific layers, overlooking how visual tokens evolve across the vision encoder. Such layer-specific criteria may provide incomplete importance estimates and limit performance preservation after compression. To address this issue, we analyze layer-wise visual token evolution directions and observe that tokens form multiple group evolution directions across vision-encoder layers. Our analysis further shows that informative tokens tend to exhibit persistent deviations from common group evolution directions. Based on this observation, we propose EvoCut, a training-free and attention-free visual token compression method that estimates token importance from multi-layer evolution deviation. Experimental results show that EvoCut can retain only 11.1\% of the visual tokens on LLaVA-1.5-7B while preserving 94.4\% of the average performance, demonstrating its effectiveness in balancing efficiency and accuracy.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 3.0/10 4.5
Visual Encoder 1.5 10.0/10 15.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 10.0/10 15.0
MultiModal 1.5 9.0/10 13.5
model-based RL 1.5 0.0/10 0.0

评分理由: 论文核心在于视觉令牌压缩以提升大视觉语言模型(LVLM/MLLM)的效率,因此'MLLM'、'Visual Encoder'和'MultiModal'高度相关。'Tokenizer'涉及令牌处理但侧重压缩而非生成,相关性中等。'Unify Models'、'World Models'及'model-based RL'与论文内容(推理效率优化)无直接关联,故评分较低或为零。作者列表中未发现指定的专家。

关键词

Visual Token Compression, Large Vision-Language Models, Multi-Layer Evolution, Vision Encoder, Inference Efficiency, Training-Free, Token Importance

深度分析

Chinese Title: EvoCut:面向高效大型视觉语言模型的多层演化感知视觉标记压缩

Summary: 大型视觉语言模型(LVLMs)在图像和视频理解任务中表现优异,但视觉编码器产生的大量视觉标记严重制约了推理效率。现有压缩方法多依赖单层注意力分数或表示属性,忽略了视觉标记在编码器各层间的演化过程,导致重要性估计不完整。本文通过分析视觉标记在编码器层间的演化方向,发现标记会形成多个群体演化方向,且信息丰富的标记往往持续偏离这些共同方向。基于此,提出EvoCut——一种无需训练、无需注意力的视觉标记压缩方法,通过跨层演化偏差累积来评估标记重要性。实验表明,在LLaVA-1.5-7B上仅保留11.1%的视觉标记(64个)即可保持94.4%的平均性能,同时实现1.44倍的总时间加速,有效平衡了效率与准确性。

Innovations:

  • 首次揭示视觉标记在视觉编码器层间存在多个群体演化方向,而非单一全局方向。
  • 发现信息丰富的视觉标记倾向于持续偏离群体演化方向,并基于此提出跨层演化偏差作为重要性度量。
  • 提出EvoCut方法,无需训练、无需注意力图,兼容FlashAttention等高效注意力算子。
  • 通过历史感知的跨层累积策略,避免深层单层偏差噪声,提升重要性估计稳定性。

Methodology: 论文首先对视觉标记在编码器相邻层间的演化方向进行聚类分析(使用HDBSCAN),发现多个群体演化方向。然后定义每个标记的演化方向与最近群体方向之间的余弦偏差,并跨层累积该偏差分数(采用历史感知更新)。最后根据累积分数保留高偏差标记,丢弃低偏差标记,实现训练无关的视觉标记压缩。整体流程包括:群体演化建模、演化偏差评分、多层分数累积。

Key Results:

  • 在LLaVA-1.5-7B上仅保留64个视觉标记(原576个的11.1%)即可保持94.4%的平均性能。
  • 在多个LVLM骨干(LLaVA-1.5、LLaVA-NeXT、Video-LLaVA)上均优于现有压缩方法(如VisionZip、ApET、V2Drop等)。
  • 实现1.44倍总时间加速,且与FlashAttention兼容。
  • 在POPE、TextVQA等数据集上,保留的标记对应前景物体、文本区域等视觉信息丰富区域。

Tech Stack:

  • HDBSCAN(层次密度聚类)
  • 余弦相似度/偏差计算
  • 视觉编码器(ViT/CLIP)
  • LLaVA-1.5、LLaVA-NeXT、Video-LLaVA等LVLM架构
  • FlashAttention(兼容性)
  • 历史感知累积策略

Strengths:

  • 无需额外训练,即插即用,降低部署成本。
  • 不依赖注意力图,避免位置偏差,兼容高效注意力实现。
  • 跨层演化分析提供了比单层快照更鲁棒的重要性信号。
  • 在多个模型和任务上验证了通用性和有效性,性能保持率高。

Limitations:

  • 仅利用视觉编码器内部的演化信息,未考虑语言模型侧对视觉标记的交互需求。
  • 聚类步骤(HDBSCAN)可能引入额外计算开销,尽管整体仍为训练无关。
  • 对于极低保留率(如<10%)的场景,性能下降可能更明显,需进一步分析。
  • 当前仅在图像和视频理解任务上验证,未涉及生成任务(如视觉生成)。

Relevance To Keywords:

  • 原生多模态大模型:论文研究LVLM的视觉标记压缩,属于多模态大模型效率优化,与原生多模态大模型相关。
  • 表征学习:通过分析视觉标记的演化方向进行重要性估计,涉及表征学习中的特征演化分析。
  • 世界模型:论文未直接涉及世界模型,但视觉标记压缩可间接提升世界模型在视觉推理中的效率。
  • 强化学习/后训练:论文方法无需训练,与强化学习/后训练无直接关联。
  • 多模态大模型的理解和生成一体化:论文聚焦理解任务,未涉及生成一体化,但压缩方法可迁移至生成模型。
Score: 49.5 / 27.8
Authors: Tianze Yang, Yucheng Shi, Ruitong Sun, Jingyuan Huang, Ninghao Liu, Jin Sun
Published: 2026-06-01
TL;DR: TRON 提出了一种规则可验证的在线环境用于生成视觉推理数据,通过 RL 后训练提升 MLLM 性能而无需额外数据收集。
摘要翻译

视觉推理领域的强化学习(RL)需要可扩展、可验证且可控的训练信号。现有的视觉 RL 后训练基于静态精选数据集进行训练,其固定的图像 - 问题 - 答案样本受限于采集预算。本文引入 TRON(目标导向、规则可验证的在线环境),这是一种在线环境基底:训练轨迹由一个可控的生成 - 验证程序按需生成,该程序采样一个新鲜的潜在视觉状态,渲染图像,提出问题,并精确验证答案。因此,单次运行可根据当前课程所需的难度级别,生成无界的新鲜实例流。当前的 TRON 套件包含 520 个环境,分为五个能力桶(空间、数学、图表、模式/逻辑和计数);同一基底支持在所有桶上训练的单一完整模型以及按桶划分的能力专家模型,无需额外数据收集。我们还引入了一项基底分析,涵盖生成可靠性、实例和级别多样性、跨环境近重复项以及基础模型按难度级别的通过率。使用 METHOD 的 RL 后训练在 Qwen3-VL-4B、Qwen2.5-VL-7B 和 MiMo-VL-7B-SFT 十个外部多模态推理基准上持续改进了性能。

Abstract

Reinforcement learning (RL) for visual reasoning needs scalable, verifiable, and controllable training signals. Existing visual RL post-training trains on static curated datasets, with fixed image-question-answer samples bounded by their collection budget. In this work, we introduce TRON (Targeted, Rule-verifiable Online eNvironments), an online environment substrate: a training rollout is generated on demand by a controllable generator-verifier program that samples a fresh latent visual state, renders an image, asks a question, and exactly verifies the answer. A single run can therefore draw an unbounded stream of fresh instances at the difficulty level required by the current curriculum. The current TRON suite contains 520 environments organized into five ability buckets (spatial, mathematical, diagram, pattern/logic, and counting); the same substrate supports both a single full model trained on all buckets and per-bucket ability-specialist models, with no additional data collection. We also introduce a substrate analysis covering generation reliability, instance and level diversity, cross-environment near-duplicates, and base-model pass rate by difficulty level. RL post-training with METHOD consistently improves performance on ten external multimodal reasoning benchmarks across Qwen3-VL-4B, Qwen2.5-VL-7B, and MiMo-VL-7B-SFT.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 5.0/10 7.5
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 2.0/10 3.0
World Models 1.5 4.0/10 6.0
MLLM 1.5 8.0/10 12.0
MultiModal 1.5 8.0/10 12.0
model-based RL 1.5 5.0/10 7.5

评分理由: 论文核心为视觉推理的强化学习环境生成,目标模型为 MLLM,故 MLLM 和多模态相关度高。涉及单模型与专家模型训练策略,与 Unify Models 相关。环境生成涉及潜在状态,与 World Models 中度相关。未讨论 Tokenizer 和视觉编码器细节,相关度低。RL 为核心,环境生成属于模型化环境,与 model-based RL 中度相关。

关键词

Visual Reasoning, Reinforcement Learning, Online Environment, MLLM, Curriculum Learning, Synthetic Data, Rule-Verifiable

深度分析

Chinese Title: TRON:面向视觉推理强化学习的目标导向可规则验证在线环境

Summary: 本文提出TRON(Targeted, Rule-verifiable Online eNvironments),一个用于视觉推理强化学习(RL)的在线环境基座。现有视觉RL后训练依赖静态数据集,受限于收集预算且难以控制难度和技能。TRON包含520个可编程的生成器-验证器环境,每个环境按需生成新的图像-问题-答案样本,并提供确定性奖励。环境分为空间、数学、图表、模式/逻辑、计数五个能力桶,支持难度阶梯和课程控制。通过TRON-DAPO对Qwen3-VL-4B、Qwen2.5-VL-7B和MiMo-VL-7B进行RL后训练,在十个外部多模态推理基准上持续提升性能。论文还进行了基座分析,包括生成可靠性、实例和难度多样性、跨环境近重复检测以及基模型通过率等。

Innovations:

  • 提出TRON在线环境基座,包含520个生成器-验证器程序,可无上限生成训练样本,替代静态数据集。
  • 将环境组织为五个能力桶(空间、数学、图表、模式/逻辑、计数),支持全模型和专长模型训练,无需额外数据收集。
  • 引入基座分析框架,评估生成质量、多样性、难度校准和跨环境近重复,确保训练信号有效。
  • 在多个开源VLM上验证TRON-DAPO后训练在十个外部基准上的持续改进效果。

Methodology: 论文采用程序化生成环境的方法:每个环境是一个Python程序,包含生成器(从潜在状态采样、渲染图像、构造问题)和验证器(计算正确答案并给出确定性奖励)。训练时,模型仅看到图像和问题,奖励由验证器提供。环境具有难度阶梯(0-9级),通过跟踪验证器准确率自动提升难度。数据生成在线进行,每次采样使用新种子,并加入图像扰动(填充抖动、旋转、模糊等)增强鲁棒性。RL算法采用DAPO(基于GRPO的变体)。

Key Results:

  • TRON包含520个环境,覆盖五个能力桶:空间111个、数学131个、图表144个、模式/逻辑104个、计数30个。
  • RL后训练在Qwen3-VL-4B、Qwen2.5-VL-7B和MiMo-VL-7B上,在十个外部多模态推理基准(如MathVista、MMMU等)上均取得一致提升。
  • 基座分析表明:生成可靠性高,实例和难度多样性良好,跨环境近重复率低,难度阶梯与基模型通过率正相关。
  • 能力专长模型(per-bucket)与全模型训练均有效,揭示了能力迁移的新见解。

Tech Stack:

  • DAPO(基于GRPO的强化学习算法)
  • Python程序化环境生成器
  • 图像渲染(使用PIL/OpenCV等)
  • 确定性验证器(精确匹配、集合/序列比较、特定求解器)
  • 难度阶梯自动推进机制(基于滑动窗口准确率阈值)
  • 图像扰动增强(填充抖动、旋转、JPEG压缩、亮度偏移、高斯模糊、高斯噪声)

Strengths:

  • 在线生成机制突破静态数据集限制,训练样本数量仅受计算资源约束。
  • 难度可控,支持课程学习,避免模型过拟合固定样本。
  • 环境多样性高(520个),覆盖多种视觉推理能力,且每个环境内部有随机化。
  • 验证器提供精确奖励,无需人工标注,适合RL训练。
  • 在多个开源VLM上验证了泛化能力,结果一致提升。

Limitations:

  • 环境生成依赖预定义模板,可能无法覆盖所有真实世界的视觉推理场景。
  • 图像渲染质量可能有限,与真实图像分布存在差距。
  • 难度阶梯设计基于基模型通过率,但可能不适用于所有模型。
  • 目前仅支持五个能力桶,未来可扩展更多类型。
  • RL训练计算成本较高,在线生成增加采样开销。

Relevance To Keywords:

  • Unify Models: 论文训练多模态大模型(Qwen3-VL等),属于统一模型范畴。
  • World Models: 环境生成器模拟潜在视觉状态,可视为世界模型的一种简化形式。
  • Representation Learning: RL后训练提升模型视觉推理表征能力。
  • Model-Based RL: 环境生成器提供可交互的模型,但训练本身是model-free RL(DAPO)。
  • 原生多模态大模型: 论文针对视觉-语言模型进行后训练。
  • 多模态大模型的理解和生成一体化: 论文侧重理解(推理),未涉及生成。
  • 表征学习: 通过RL优化模型内部表征。
  • 世界模型: 环境生成器可看作世界模型的组件。
  • 强化学习: 核心方法是RL后训练(DAPO)。
  • 后训练: 论文聚焦于RL后训练阶段。
Score: 49.5 / 27.8
Authors: Tianze Yang, Yucheng Shi, Ruitong Sun, Ninghao Liu, Jin Sun
Published: 2026-06-01
TL;DR: 本文提出一种无需微调的注意力选择框架,通过利用 LVLM 内部注意力模式显著提升了小目标定位的准确性。
摘要翻译

大型视觉语言模型(LVLMs)的内部注意力模式能否在不进行微调的情况下识别可靠的小目标框?在本研究中,我们给出了肯定的回答。LVLMs 中的注意力结构编码了定位质量——仅基于注意力图训练的轻量级 IoU 回归器实现了强大的 IoU 预测(皮尔逊相关系数 r > 0.67)。该回归器驱动了我们基于注意力的候选选择(ACS)框架中的回归器变体,称为 ACS-Learned,该框架从多个采样候选框中选择最佳框以改进物体定位。通过分析该回归器所学到的内容,我们揭示了哪些变换器层和头最为关键,并推导出 ACS-Free:一种无需训练的选择器,它利用这些判别性头上的注意力熵对候选框进行排序,推理阶段无需学习组件。在 COCO 和 Objects365 上的实验表明,小目标定位性能提升幅度高达 19%,且 ACS-Free 在所有无需训练的方法中排名最佳,这表明有用的注意力结构提升了 LVLMs 中定位的可靠性和可解释性。

Abstract

Can internal attention patterns in Large Vision Language Models (LVLMs) identify reliable small-object boxes without fine-tuning? In this work, we provide an affirmative answer. Attention structure in LVLMs encodes grounding quality-a lightweight IoU regressor trained solely on attention maps achieves strong IoU prediction (Pearson r > 0.67). This regressor powers the regressor-based variant of our Attention-based Candidate Selection (ACS) framework, called ACS-Learned, which selects the best box from multiple sampled candidates to improve object grounding. By analyzing what the regressor learns, we reveal which transformer layers and heads are most critical and derive ACS-Free: a training-free selector that ranks candidates by attention entropy on these discriminative heads, with no learned component at inference. Experiments on COCO and Objects365 demonstrate up to 19% self-improvement on small object localization, with ACS-Free ranking best among all training-free methods, demonstrating that useful attention structure improves both localization reliability and interpretability in LVLMs.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 5.0/10 7.5
Tokenizer 1.5 2.0/10 3.0
Visual Encoder 1.5 6.0/10 9.0
World Models 1.5 1.0/10 1.5
MLLM 1.5 9.0/10 13.5
MultiModal 1.5 9.0/10 13.5
model-based RL 1.5 1.0/10 1.5

评分理由: 论文核心在于 LVLM 的小目标定位,利用注意力机制。MLLM 和 MultiModal 直接对应 LVLM 属性,高度相关;Visual Encoder 涉及注意力图来源,中度相关;Unify Models 中度相关(视觉语言统一);Tokenizer、World Models、model-based RL 在摘要中未体现或无关。作者列表不包含指定的 Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang,故无专家加分。

关键词

LVLMs, Small Object Grounding, Attention Patterns, Bounding Box Selection, Training-free, Vision-Language Models, Interpretability

深度分析

Chinese Title: LVLMs中的小目标自改进定位

Summary: 本文提出一种无需微调即可提升大型视觉语言模型(LVLMs)小目标定位能力的方法。作者发现LVLM内部的注意力模式与定位质量高度相关:一个轻量级的IoU回归器仅基于注意力图即可准确预测IoU(皮尔逊相关系数>0.67)。基于此,他们设计了基于注意力候选选择(ACS)框架,包含两个变体:ACS-Learned(使用训练好的回归器选择最佳候选框)和ACS-Free(通过分析回归器梯度与熵,识别出关键层和注意力头,在推理时无需任何学习组件,仅凭注意力熵排序即可选择最佳框)。实验在COCO和Objects365数据集上进行,小目标定位提升高达19%,ACS-Free在所有无训练方法中表现最佳。该工作揭示了LVLM内部注意力结构对定位可靠性的编码能力,提升了定位的可解释性。

Innovations:

  • 发现LVLM内部注意力模式与定位质量(IoU)存在强相关性,无需微调即可预测框的可靠性。
  • 提出轻量级IoU回归器,仅基于注意力图训练,实现高精度IoU预测(Pearson r>0.67)。
  • 通过梯度归因和熵分析,识别出对定位最关键的Transformer层和注意力头,并据此设计无训练的选择器ACS-Free。
  • ACS-Free在推理时无需任何学习组件,仅凭注意力熵排序即可达到最佳无训练方法性能,提升小目标定位达19%。
  • 框架无需修改LVLM架构、无需微调或外部检测器,可作为即插即用增强模块应用于不同LVLM。

Methodology: 首先,通过温度采样生成多个候选边界框及其对应的注意力图。然后,训练一个轻量级神经网络(IoU回归器)以注意力图为输入预测IoU,验证注意力编码定位质量的假设。接着,利用梯度归因和熵分析,从训练好的回归器中识别出对定位贡献最大的层和注意力头。最后,基于这些关键头的注意力熵设计无训练的选择规则(ACS-Free),在推理时直接根据熵值排序选择最佳框。整体框架分为三阶段:训练回归器、分析关键模式、蒸馏为无训练规则。

Key Results:

  • 轻量级IoU回归器在注意力图上训练,实现皮尔逊相关系数>0.67的IoU预测。
  • ACS-Learned在多个LVLM和数据集上持续提升小目标定位性能。
  • ACS-Free在所有无训练方法中表现最佳,小目标定位提升高达19%。
  • 识别出Qwen2.5-VL-7B和InternVL-3.5-8B中各自的前10个定位关键层(如14-21,24,25等)。
  • 注意力熵与IoU在关键头上呈现强负相关,验证了熵作为无训练选择信号的有效性。

Tech Stack:

  • LVLM:Qwen2.5-VL-7B, InternVL-3.5-8B
  • 注意力图提取:从所有层和头的视觉token注意力权重
  • IoU回归器:轻量级神经网络(具体结构未详述,但为MLP类)
  • 损失函数:均方误差(MSE)
  • 梯度归因分析:用于识别关键层
  • 熵分析:计算注意力熵,用于无训练选择
  • 温度采样:生成多样化候选框
  • 数据集:COCO, Objects365

Strengths:

  • 无需微调LVLM,仅需训练一个极小的回归器或完全无训练,计算成本低。
  • 即插即用,可应用于不同LVLM,通用性强。
  • 揭示了注意力结构与定位质量的内在联系,增强了模型可解释性。
  • 小目标定位提升显著(19%),对安全关键应用(如自动驾驶)有价值。
  • 从有监督回归器蒸馏出无训练规则,方法优雅且实用。

Limitations:

  • 回归器训练仍需少量标注数据(IoU标签),虽然轻量但并非完全零样本。
  • ACS-Free虽无训练,但需要预先通过回归器分析确定关键头,该分析过程仍依赖训练数据。
  • 实验仅在两个LVLM和两个数据集上进行,泛化性需进一步验证。
  • 方法聚焦于小目标,对大目标提升可能不明显(论文未详细报告大目标结果)。
  • 未探讨多目标场景或复杂查询下的表现。

Relevance To Keywords:

  • 原生多模态大模型:论文直接研究LVLM(Qwen2.5-VL, InternVL-3.5)的小目标定位,属于原生多模态大模型能力增强。
  • 表征学习:通过分析注意力图作为表征,学习IoU预测,属于表征学习在定位质量估计中的应用。
  • 世界模型:论文未直接涉及世界模型,但小目标定位是自动驾驶等世界模型感知模块的关键能力。
  • 强化学习/后训练:论文未使用强化学习或后训练,而是利用推理时采样和注意力分析,与关键词相关性较弱。
  • Unify Models:论文未涉及模型统一,但LVLM本身是视觉语言统一模型。
  • 模型基于强化学习:不相关。
Score: 48.0 / 27.8
Authors: Xinyu Che, Junqi Xiong, Yunfei Ge, Xinping Lei, Shihao Li, Hang Yan, Han Li, Yuanxing Zhang, Zhiqi Bai, Jinhua Hao, Ming Sun, Han Li, Jiaheng Liu
Published: 2026-06-01
TL;DR: MMG2Skill 提出一种框架,将网络上的多模态指南转化为可执行技能并通过轨迹反馈自我进化,在 GUI 控制、游戏和卡牌任务中显著提升了代理性能。
摘要翻译

网络上丰富的程序性知识在协助智能体解决长周期任务方面蕴含着巨大潜力。然而,此类知识往往具有多模态、异构和噪声特性,且隐含地假设人类执行者,因而难以直接作为智能体所需的技能使用。为了弥合人类导向指南与智能体可执行技能之间的差距,我们将此问题形式化为指南到技能学习(guide-to-skill learning):将真实世界中的指南转换为可执行技能,并从智能体可观察的轨迹中持续改进它们。为了评估现有智能体在此任务上的能力,我们引入了 MMG2Skill-Bench,这是首个专为该问题设计的基准。我们进一步提出了 MMG2Skill,这是一个闭环框架,它将指南编译为可编辑技能,并在执行过程中使固定的视觉 - 语言模型(VLM)智能体依据这些技能运行,并从轨迹级根本原因反馈修订技能,而不使用基准分数。在 GUI 控制、开放式游戏及策略卡牌游戏等任务中,基于六种 VLM 骨干,MMG2Skill 在所有模型 - 域设置中一贯优于朴素基线智能体,在所有骨干模型上实现了 +12.8 至 +25.3 个百分点的宏观平均增益。消融实验表明,直接使用原始指南提示智能体会导致性能下降,而结构化技能构建与轨迹驱动修订对于实现观察到的改进均是必要的。在成功可推断的任务中,基于分析器的早停机制进一步防止了后期阶段的性能退化,并在成功信号适当校准时节省了 25% 至 53% 的尝试次数。

Abstract

Abundant procedural knowledge on the Web holds great potential for helping agents solve long-horizon tasks. However, such knowledge is often multimodal, heterogeneous, noisy, and implicitly assumes human executors, making it difficult to use directly as the skills required by agents. To bridge the gap between human-oriented guides and agent-executable skills, we formalize this problem as guide-to-skill learning: converting in-the-wild guides into executable skills and continuously improving them from trajectories observable to the agent. To evaluate the capability of existing agents on this task, we introduce MMG2Skill-Bench, the first benchmark designed for this problem. We further propose MMG2Skill, a closed-loop framework that compiles guides into editable skills, conditions a fixed vision-language model (VLM) agent on these skills during execution, and revises the skills from trajectory-level root-cause feedback without using benchmark scores. Across GUI control, open-ended gameplay, and strategic card play with six VLM backbones, MMG2Skill consistently outperforms vanilla baseline agents in every model-domain setting, achieving macro-average gains of +12.8 to +25.3 percentage points across backbones. Ablation studies show that directly prompting agents with raw guides can degrade performance, while both structured skill construction and trajectory-driven revision are necessary for the observed improvements. On success-inferable tasks, analyzer-based early stopping further prevents late-stage performance regressions and saves 25%-53% of attempts when the success signal is properly calibrated.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 3.0/10 4.5
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 5.0/10 7.5
World Models 1.5 3.0/10 4.5
MLLM 1.5 8.0/10 12.0
MultiModal 1.5 9.0/10 13.5
model-based RL 1.5 3.0/10 4.5

评分理由: 论文核心为多模态技能转化,MultiModal (9.0) 与 MLLM (8.0) 相关性高,Visual Encoder (5.0) 为组件相关。Unify Models、World Models、model-based RL 相关性较低 (3.0),Tokenizer 未提及 (1.0)。加权总分 48.0,高于及格分 27.8。作者列表中未包含指定专家。

关键词

MMG2Skill, Guide-to-skill learning, Multimodal guides, Vision-language model, Trajectory feedback, Self-evolving skills, Skill distillation, GUI control

深度分析

Chinese Title: MMG2Skill:智能体能否从野外指南中提炼出自我进化的技能?

Summary: 论文针对智能体在长程任务中缺乏可执行技能的问题,提出将互联网上丰富的多模态人类指南转化为智能体可执行的技能,并实现自我进化。作者首先构建了首个指南到技能学习基准MMG2Skill-Bench,涵盖桌面GUI控制、开放世界游戏和策略卡牌三个领域,共130个任务。在此基础上提出MMG2Skill框架,该框架将多模态指南编译为可编辑的SKILL.md技能文件,在任务执行时以固定VLM策略条件化技能,并通过分析器从智能体可见轨迹中提取根因反馈,再由精炼器修订技能,形成闭环。在六个VLM骨干模型上的实验表明,MMG2Skill在所有模型-领域组合中均优于基线,宏平均提升12.8至25.3个百分点。消融实验显示,直接使用原始指南提示会降低性能,而结构化技能构建和轨迹驱动修订缺一不可。在成功可推断任务中,基于分析器的早期停止可防止后期性能退化并节省25%-53%的尝试次数。

Innovations:

  • 首次提出指南到技能学习问题,并构建了首个跨三个领域的基准MMG2Skill-Bench。
  • 提出MMG2Skill闭环框架,将多模态指南编译为可编辑技能,并通过轨迹级根因反馈进行修订,不依赖基准分数。
  • 发现原始指南直接提示可能损害性能,而结构化技能构建和轨迹驱动修订是必要且互补的。
  • 引入分析器早期停止机制,在成功可推断任务中防止后期性能退化并显著节省尝试次数。

Methodology: 论文采用以下技术路线:1)多模态技能构建:将人类指南(HTML+图像)通过VLM编译为结构化SKILL.md文件,包含可重用过程、适用条件、状态线索和恢复知识。2)技能条件执行:在每次尝试开始时将当前技能集注入VLM上下文,与任务指令和历史一起生成动作。3)分析器:使用VLM读取任务指令和智能体可见轨迹,输出诊断(成功/失败证据)和自判结果评估(likely_success)。4)精炼器:结合原始指南、当前技能和累积诊断,对技能进行局部编辑(添加缺失检查、强化成功行为、移除误导恢复建议)。5)闭环迭代:算法1循环执行,直到分析器判定成功或达到最大尝试次数。评估使用领域原生评分器,在OSWorld、OpenHA Minecraft和RLCard环境中进行。

Key Results:

  • MMG2Skill在六个VLM骨干上均优于基线,宏平均增益+12.8至+25.3个百分点。
  • 直接使用原始指南提示会导致性能下降,而结构化技能构建提供更安全的先验。
  • 轨迹驱动修订修复了指南与运行时环境之间的接地差距,且修订增益是非单调的。
  • 分析器早期停止在成功可推断任务中防止后期性能退化,节省25%-53%的尝试次数。
  • 在私有信息任务(无限制德州扑克)中,轨迹级分析器无法可靠推断胜负,因此被排除在主评估外。

Tech Stack:

  • 视觉语言模型(VLM)作为策略和分析器/精炼器
  • OSWorld桌面GUI环境
  • OpenHA Minecraft任务(MineStudio框架)
  • RLCard卡牌游戏环境(斗地主、麻将)
  • SKILL.md结构化技能表示
  • 闭环迭代算法(Algorithm 1)
  • 领域原生评分器(success-inferable评估)

Strengths:

  • 首次系统研究指南到技能学习问题,填补了现有基准和方法空白。
  • 跨三个不同交互领域(GUI、游戏、策略)验证,具有广泛适用性。
  • 闭环框架不依赖基准分数,仅使用智能体可见信息,更贴近实际部署。
  • 消融实验清晰揭示了技能构建和修订各自的贡献。
  • 早期停止机制实用且有效,节省计算资源。

Limitations:

  • 仅适用于成功可推断任务,私有信息任务(如无限制德州扑克)无法使用轨迹级分析。
  • 分析器的自判结果评估存在残差误差,可能误判成功或失败。
  • 技能表示(SKILL.md)的灵活性有限,可能无法覆盖所有复杂程序知识。
  • 依赖VLM的能力,不同骨干性能差异较大,框架效果受限于基础模型。
  • 未探索技能跨任务迁移或持续学习场景。

Relevance To Keywords:

  • 原生多模态大模型:论文使用VLM作为核心策略和分析器,直接处理多模态输入。
  • 多模态大模型的理解和生成一体化:VLM既用于理解指南和轨迹,也用于生成技能修订。
  • 表征学习:技能表示(SKILL.md)将指南转化为结构化表征,便于编辑和条件化。
  • 世界模型:智能体在交互环境中执行任务,需要理解环境状态和转移,技能提供程序性知识。
  • 强化学习:闭环修订过程类似于基于轨迹反馈的策略优化,但优化对象是技能而非模型参数。
  • 后训练:技能修订可视为一种后训练方式,通过轨迹反馈改进智能体的行为先验。
Score: 45.0 / 27.8
Authors: Minsik Choi, Geewook Kim
Published: 2026-06-01
TL;DR: 本文提出 MERIT,一种基于梯度冲突感知分裂和权重合并的去中心指令调优方法,在不进行分区间通信的情况下提升了多模态大模型的性能。
摘要翻译

指令微调将大语言模型(包括多模态模型)与多样化的用户意图对齐,但在扩展到异构混合任务时,受到梯度干扰和高带宽同步的阻碍。我们探究是否可以通过独立训练混合体中的部分任务并在参数空间中统一它们,从而联合解决这两个瓶颈。我们在共享平坦盆地内构建了一个局部二次理论,得出三个结论:权重合并产生曲率加权的方差减少;PCA (主成分分析) 对齐的冲突分裂沿高曲率方向最大化这一增益;此外,合并过程还充当了具有隐式范数正则化的谱滤波。这些结果直接催生了 MERIT,一种去中心化合并就绪的指令微调管道,该方法估计数据集级别的梯度冲突,沿主要 PCA 冲突轴划分混合体,独立微调每个分区且无需分区间通信,最后通过令牌加权平均合并一次。在 Qwen2.5-VL-3B 模型上,针对 136 个视觉 FLAN 任务,MERIT 将 8 个基准的平均值从 54.3(联合训练)提升至 57.0。该方法可扩展至 7B 模型,在包含 160 万示例、176 个来源的混合体上运行——以极小的成本开销匹配或超越集中式联合训练——并迁移至纯文本 FLAN 任务。我们的代码可在 https://github.com/naver-ai/merit 获取。

Abstract

Instruction tuning aligns large language models, including multimodal ones, with diverse user intents, but scaling to heterogeneous mixtures is hindered by gradient interference and bandwidth-heavy synchronization. We ask whether these two bottlenecks can be addressed jointly by training parts of the mixture independently and reconciling them once in parameter space. We develop a local quadratic theory inside a shared flat basin that yields three results: weight merging produces a curvature-weighted variance reduction; PCA-aligned conflict splitting maximizes this gain along high-curvature directions; and merging additionally acts as spectral filtering with implicit norm regularization. These results directly motivate MERIT, a decentralized merge-ready instruction-tuning pipeline that estimates dataset-level gradient conflicts, partitions the mixture along the top PCA conflict axes, fine-tunes each partition independently with no inter-partition communication, and merges once via token-weighted averaging. On Qwen2.5-VL-3B with 136 Vision-FLAN tasks, MERIT improves the 8-benchmark average from 54.3 (joint training) to 57.0. The same recipe scales to a 7B model on a 1.6M-example, 176-source mixture -- matching or exceeding centralized joint training with minimal cost overhead -- and transfers to text-only FLAN. Our code is available at https://github.com/naver-ai/merit.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 6.0/10 9.0
Tokenizer 1.5 3.0/10 4.5
Visual Encoder 1.5 5.0/10 7.5
World Models 1.5 0.0/10 0.0
MLLM 1.5 8.0/10 12.0
MultiModal 1.5 8.0/10 12.0
model-based RL 1.5 0.0/10 0.0

评分理由: 论文核心在于去中心化的指令调优与权重合并(MERIT),与 Unify Models 有一定关联(参数空间合并)。评估基于 Qwen2.5-VL 多模态大模型,因此与 MLLM 和 MultiModal 高度相关。摘要中提及 token-weighted averaging,与 Tokenizer 有关联;Visual Encoder 存在于基座模型中但非方法核心。论文未涉及 World Models 或强化学习,故这两项得分为 0。加权总分 45.0,高于动态及格分 27.8。

关键词

Decentralized Instruction Tuning, Conflict-Aware Splitting, Weight Merging, Gradient Interference, Token-Weighted Averaging, Vision-FLAN Tasks, Parameter Space Merging

深度分析

Chinese Title: 去中心化指令微调:冲突感知分割与权重合并

Summary: 本文提出MERIT,一种去中心化的指令微调流水线,旨在解决大规模异构指令微调中梯度干扰和带宽同步瓶颈。理论分析表明,在共享平坦盆地内,权重合并可产生曲率加权方差缩减,PCA对齐的冲突分割可最大化该增益,且合并具有隐式范数正则化和谱滤波效应。MERIT通过从校准集估计数据集级梯度冲突,沿主PCA冲突轴分割任务混合物,独立微调各分支(无跨分区通信),最后通过令牌加权平均合并。在Qwen2.5-VL-3B上,136个Vision-FLAN任务使8基准平均从54.3提升至57.0;在7B模型上,1.6M样本、176源混合物匹配或超越集中式联合训练,且可迁移至纯文本FLAN。

Innovations:

  • 提出局部二次理论,证明在共享平坦盆地内权重合并产生曲率加权方差缩减,且PCA对齐分割最大化该增益。
  • 设计MERIT流水线:先验冲突感知分割,独立微调,单次令牌加权合并,无需跨分区梯度通信。
  • 将冲突处理从训练时移至训练前,避免逐步干扰累积,同时消除同步需求。
  • 理论揭示合并具有隐式范数正则化和谱滤波效应,提升泛化与优化。
  • 在多种规模模型和混合物上验证一致改进,包括多模态和纯文本场景。

Methodology: 基于局部二次模型分析平坦盆地内权重合并的增益,推导曲率加权方差缩减公式;利用PCA分析数据集级梯度冲突,沿主冲突轴分割任务;采用令牌加权平均合并独立微调后的模型;实验使用Qwen2.5-VL系列模型,在Vision-FLAN和FLAN数据集上评估,对比集中式联合训练、随机分割、模型汤等基线。

Key Results:

  • MERIT在Qwen2.5-VL-3B上,8基准平均从54.3提升至57.0(相同令牌预算)。
  • 在Qwen2.5-VL-7B上,1.6M样本、176源混合物匹配或超越集中式联合训练,且三随机种子一致。
  • 理论证明合并增益由曲率加权方差决定,PCA分割优于随机分割。
  • 合并具有隐式范数正则化效果,可视为谱滤波。
  • 方法可迁移至纯文本FLAN设置。

Tech Stack:

  • 局部二次模型(Hessian矩阵)
  • PCA(主成分分析)
  • 令牌加权平均
  • 模型合并/模型汤
  • 线性模式连通性诊断
  • 梯度冲突估计(校准集)
  • Qwen2.5-VL系列模型
  • Vision-FLAN数据集
  • FLAN数据集

Strengths:

  • 理论分析扎实,从局部二次模型推导出合并增益的明确表达式,并指导算法设计。
  • 提出去中心化训练范式,有效解决梯度干扰和通信瓶颈两大问题。
  • 实验规模大,涵盖3B和7B模型、多模态和纯文本场景,结果一致且显著。
  • 方法实用,无需修改训练过程,仅需一次合并,适合异构计算环境。
  • 开源代码,可复现。

Limitations:

  • 理论基于局部二次近似和平坦盆地假设,实际中可能不完全成立。
  • 需要先验的合并就绪初始化(如LLaVA Stage 2),并非所有模型都具备。
  • PCA冲突分割依赖校准集,校准集大小和代表性可能影响效果。
  • 仅考虑单次合并,未探索迭代合并或多轮通信。
  • 实验主要针对视觉-语言模型,文本-only验证较简略。

Relevance To Keywords:

  • 原生多模态大模型:论文使用Qwen2.5-VL多模态模型,验证方法在多模态指令微调中的有效性。
  • 多模态大模型的理解和生成一体化:指令微调涉及理解(OCR、推理)和生成(对话),MERIT通过冲突分割提升整体性能。
  • 表征学习:理论分析涉及Hessian矩阵和曲率,与表征的几何性质相关。
  • 世界模型:间接相关,指令微调可增强模型对世界的理解能力。
  • 强化学习:论文未直接涉及RL,但后训练阶段与RLHF等对齐方法互补。
  • 后训练:核心研究后训练中的指令微调优化问题。
Score: 43.5 / 27.8
Authors: Bing-Cheng Chuang, I-Hsuan Chu, Bor-Jiun Lin, YuanFu Yang, Min Sun, Chun-Yi Lee
Published: 2026-06-01
TL;DR: 本文提出 Lie Diffuser Actor (LDA),通过在 SE(3) 李群上进行扩散建模,纠正了视觉 - 语言 - 动作策略中的欧几里得几何错误,提升了机器人操作任务的性能。
摘要翻译

基于扩散的视觉 - 语言 - 动作策略在机器人操作中取得了显著成功,但犯了一个我们称之为欧几里得谬误(Euclidean Fallacy)的基本几何错误:将 SE(3) 姿态表示为平坦的 R12 向量。这种近似会导致 (1) 违反 SO(3) 约束的流形漂移,(2) 在坐标变换下破坏的等变性,以及 (3) 具有过高运动学代价的非测地线轨迹。我们提出了李扩散演员(Lie Diffuser Actor, LDA),这是一个在 SE(3) 上内在运行的扩散框架。我们的方法通过左不变随机微分方程(SDEs)注入噪声,在切空间中预测得分,并通过指数映射回缩样本。这种构造从根本上消除了流形漂移,同时保证了坐标系等变性和测地线最优性。在 CALVIN ABC→D 基准上,LDA 将平均任务长度从 3.27 提高到 3.51(+7.3%)。我们进一步在真实机器人上验证了该方法,结果表明我们的方法在大多数任务上优于基线方法。

Abstract

Diffusion-based Vision-Language-Action policies achieve remarkable success in robotic manipulation, yet commit a fundamental geometric error we term the $\textbf{Euclidean Fallacy}$: representing SE(3) poses as flat $\mathbb{R}^{12}$ vectors. This approximation induces (1) manifold drift violating SO(3) constraints, (2) broken equivariance under coordinate transformations, and (3) non-geodesic trajectories with excessive kinematic cost. We introduce $\textbf{Lie Diffuser Actor (LDA)}$, a diffusion framework operating intrinsically on SE(3). Our method injects noise through left-invariant SDEs, predicts scores in the tangent space, and retracts samples via the exponential map. This formulation eliminates manifold drift by construction while guaranteeing coordinate-frame equivariance and geodesic optimality. On CALVIN ABC$\rightarrow$D, LDA improves average task length from $3.27$ to $3.51$ ($+7.3\%$). We further validate our method on real robot and the results show that our methodology outperforms the baseline on majority tasks.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 6.0/10 9.0
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 3.0/10 4.5
World Models 1.5 2.0/10 3.0
MLLM 1.5 5.0/10 7.5
MultiModal 1.5 8.0/10 12.0
model-based RL 1.5 4.0/10 6.0

评分理由: 论文聚焦视觉 - 语言 - 动作(VLA)策略的几何表示,提出基于 SE(3) 李群的扩散框架。"MultiModal"高度相关,因涉及多模态融合;"MLLM"与"Unify Models"中度相关,涉及多模态统一;"model-based RL"中度相关,属强化学习范畴;"Visual Encoder"与"World Models"关联较弱,非核心贡献;"Tokenizer"未提及,相关性最低。

关键词

Vision-Language-Action, Diffusion Models, SE(3), Lie Groups, Score Matching, Tangent Space, Robotic Manipulation, Geometric Correctness

深度分析

Chinese Title: 我们所说的谎言:通过切空间上的分数匹配纠正视觉-语言-动作策略中的欧几里得谬误

Summary: 本文指出当前基于扩散的视觉-语言-动作(VLA)策略存在一个根本性几何错误——将SE(3)位姿表示为平坦的R12向量,导致流形漂移、坐标变换下的等变性破坏以及非测地线轨迹。为解决此问题,作者提出Lie Diffuser Actor(LDA),一种直接在SE(3)流形上运行的扩散框架。该方法通过左不变随机微分方程注入噪声,在切空间(李代数se(3))中预测分数,并利用指数映射将样本回缩到流形上。理论证明该方法消除了流形漂移、保证了左不变等变性并生成测地线轨迹。在CALVIN ABC→D任务上,平均任务长度从3.27提升至3.51(+7.3%);在真实机器人实验中,多数任务优于基线。

Innovations:

  • 首次识别并形式化VLA扩散策略中的“欧几里得谬误”,揭示流形漂移和等变性破坏的系统性问题。
  • 提出Lie Diffuser Actor,在SE(3)流形上定义左不变SDE扩散过程,通过指数映射保证输出始终位于流形上。
  • 理论证明三个关键性质:流形漂移消除(命题4.1)、左不变等变性(定理4.2)和测地线轨迹生成(命题4.3)。
  • 在CALVIN、OpenVLA-OFT和真实机器人上验证了性能提升,并展示零样本泛化到新工作空间的能力。

Methodology: 论文采用基于李群SE(3)的扩散模型。前向过程通过左不变SDE在流形上添加噪声,具体为gt = g0 · exp(σtξ),其中ξ ∈ se(3)服从高斯分布。反向过程在切空间(李代数)中预测分数,并通过指数映射回缩到流形。网络架构采用条件扩散模型,输入视觉和语言指令,输出SE(3)轨迹。训练时使用分数匹配损失。

Key Results:

  • 在CALVIN ABC→D上,平均任务长度从3.27提升至3.51(+7.3%)。
  • 在OpenVLA-OFT的LIBERO Long任务上,成功率从92.20%提升至94.13%。
  • 正交性约束违反率:中位数降低5.7%,P90降低11.8%,P95降低5.4%,P99降低2.6%。
  • 真实机器人实验在多数任务上优于基线。
  • 零样本泛化到新工作空间验证了等变性带来的实用价值。

Tech Stack:

  • SE(3)李群与se(3)李代数
  • 左不变随机微分方程(SDE)
  • 指数映射与Rodrigues公式
  • 分数匹配(Score Matching)
  • 条件扩散模型
  • OpenVLA-OFT框架
  • CALVIN基准

Strengths:

  • 理论严谨:从几何角度揭示了现有方法的根本缺陷,并提供了严格的数学证明。
  • 方法优雅:直接在流形上操作,避免了后处理投影,保证了输出始终有效。
  • 实验充分:在模拟和真实机器人上均验证了有效性,且展示了零样本泛化能力。
  • 与现有架构兼容:可集成到OpenVLA等框架中,提升性能。

Limitations:

  • 计算开销:指数映射等操作可能增加计算成本,实时性需进一步评估。
  • 仅考虑SE(3)位姿:未涉及关节空间或其他动作表示,适用范围有限。
  • 实验规模:CALVIN任务相对简单,在更复杂的长时域任务上效果待验证。
  • 对噪声调度敏感:左不变SDE的噪声调度设计可能影响性能。

Relevance To Keywords:

  • Unify Models, World Models, Representation Learning, Model-Based RL: 论文聚焦于VLA策略中的几何表示学习,属于表征学习范畴;扩散模型可视为一种生成式世界模型;与强化学习的结合(后训练)未直接涉及,但改进的动作表示可提升RL策略质量。
  • 原生多模态大模型,多模态大模型的理解和生成一体化: 论文改进的是VLA策略中的动作生成部分,与多模态大模型的理解和生成一体化方向相关,但未涉及语言理解本身的改进。
  • 表征学习: 核心贡献在于纠正位姿的几何表征,属于表征学习。
  • 世界模型: 扩散模型可视为隐式世界模型,论文改进了其几何一致性。
  • 强化学习,后训练: 论文未直接涉及强化学习,但改进的动作策略可作为RL的底层控制器。
Score: 43.5 / 27.8
Authors: Sarmistha Das, Vaibhav Vishal, Shreyas Guha, Amaan Ali, Kitsuchart Pasupa, Sriparna Saha
Published: 2026-06-01
TL;DR: 本文提出 HybridMoE 框架与多模态习语语料库,有效提升了多语言视觉语言模型对习语隐喻意义的理解能力。
摘要翻译

在多语言教育的当代背景下,学习习语为通往创造力、文化价值、历史背景以及各语言传统所蕴含的多元视角提供了一条迷人的途径。本文探讨了如何在低资源东南亚语言(如印地语、孟加拉语和泰语)中保留比喻和文化语义。在这些语言中,由于深层隐喻复杂性,文化丰富的习语对计算建模和跨语言迁移构成了重大障碍。为应对这种复杂性,我们提出了 Varnika,这是一个重建的多模态习语语料库,包含 3,533 个多语言习语,并丰富了与文本及视觉表示对齐的七种习语语调(idiomatic tones)。此外,为了推断信息丰富的习语理解,我们引入了一个混合式混合专家(HybridMoE)框架。该框架嵌入了多个习语专家意见,并通过受控混合整合选定与未选定专家的输出以缓解专家稀疏性问题,同时进一步通过掩码多模态嵌入增强了习语属性信号(Idiomatic Property Signals)。为了从多个维度分析性能,我们提出了 IDIO-TONE 和习语验证分数(Idiomatic Validation Score),这是一个三阶段评估流程,用于衡量(i)字面翻译保真度,(ii)视觉 - 语义对齐,以及(iii)习语意义保留。实证评估表明,HybridMoE 在先进的视觉语言模型(Vision-Language Models)上实现了 5%–6% 的性能提升,展示了在多语言多模态环境中对比喻语言及文化嵌入意义的改进表示。

Abstract

In the contemporary epoch of multilingual education, learning idioms provides a fascinating gateway towards creativity, cultural values, historical context, and diverse perspectives inherent to various linguistic traditions. This paper showcases the navigation of retaining figurative and cultural semantics in low-resource Southeast Asian languages such as Hindi, Bengali, and Thai, where culturally rich idioms pose significant obstacles for computational modeling and cross-linguistic transfer due to their deep metaphorical complexity. To tackle such complexity, we present Varnika, a reconstructed multimodal idiom corpus comprising 3,533 multilingual idioms, enriched with seven idiomatic tones aligned with both textual and visual representations. Additionally, to infer informative idiomatic understanding, we introduce a Hybrid Mixture-of-Experts (HybridMoE) framework that embeds multiple idiomatic expert opinions while mitigating expert sparsity by integrating outputs from both selected and unselected experts through controlled hybridization, further augmented with Idiomatic Property Signals via masked multimodal embeddings. To analyze the performance across multiple dimensions, we propose the IDIO-TONE and Idiomatic Validation Score, a three-stage evaluation pipeline measuring (i) literal translation fidelity, (ii) visual-semantic alignment, and (iii) idiomatic meaning retention. Empirical evaluations highlight that HybridMoE achieves 5--6\% performance gains across advanced vision language models, demonstrating improved representation of figurative language and culturally embedded meaning in multilingual multimodal settings

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 5.0/10 7.5
Tokenizer 1.5 2.0/10 3.0
Visual Encoder 1.5 4.0/10 6.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 8.0/10 12.0
MultiModal 1.5 10.0/10 15.0
model-based RL 1.5 0.0/10 0.0

评分理由: 论文核心在于多模态习语理解与混合专家模型,与 MultiModal 和 MLLM 高度相关;涉及文本与视觉表征统一,Unify Models 中度相关;Visual Encoder 和 Tokenizer 为辅助组件;World Models 和 model-based RL 与习语理解任务无直接关联。

关键词

Hybrid-MoE, Idiomatic Understanding, Multimodal Idiom Corpus, Multilingual, Figurative Language, Vision Language Models, Masked Multimodal Embeddings

深度分析

Chinese Title: 当意义旅行:混合专家模型在语言模型习语理解中的细粒度作用

Summary: 本文针对低资源东南亚语言(印地语、孟加拉语、泰语)中习语理解困难的问题,提出Varnika多模态习语语料库(含3533条习语及七种语用语调标注),并设计HybridMoE(混合专家模型)框架。该框架通过超网络实现跨专家知识共享,结合习语属性信号(IPS)增强多模态嵌入,同时提出IDIO-TONE和习语验证分数(IV)作为三阶段评估指标(文字翻译保真度、视觉语义对齐、习语意义保留)。实验表明,HybridMoE在先进视觉语言模型上取得5-6%的性能提升,有效改善了模型对比喻语言和文化嵌入意义的表征能力。

Innovations:

  • 提出HybridMoE框架,通过超网络整合选定与未选定专家输出,缓解专家稀疏性问题
  • 引入习语属性信号(IPS),利用掩码多模态嵌入增强上下文保真度
  • 构建Varnika数据集,为三种低资源语言提供多模态习语语料及七种语用语调标注
  • 设计IDIO-TONE和习语验证分数(IV)三阶段评估体系,量化习语理解质量

Methodology: 首先重建Mediom数据集为Varnika,添加七种语用语调(幽默、嘲讽、喜爱、渴望、恐惧、悲伤、欺骗)并经过两阶段验证(同行评审+专家审核,Cohen's kappa=0.65)。然后设计HybridMoE架构:输入多模态对(文本+图像),通过多个专家模块处理,利用超网络生成专家权重并混合输出,同时IPS模块对跨模态嵌入进行条件化。最后使用IDIO-TONE和IV分数评估文字翻译、视觉语义对齐和习语意义保留三个维度。

Key Results:

  • HybridMoE在多个先进视觉语言模型上实现5-6%的性能提升
  • Varnika数据集标注一致性达0.65(Cohen's kappa)
  • 文本与视觉模态的语调标签重叠率为54.75%,表明跨模态语用对齐良好
  • 模型在低资源语言(印地语、孟加拉语、泰语)的习语理解任务上优于基线

Tech Stack:

  • Mixture-of-Experts (MoE)
  • HyperNetwork
  • Idiomatic Property Signal (IPS)
  • 多模态嵌入(文本+图像)
  • Cohen's kappa一致性检验
  • IDIO-TONE指标
  • Idiomatic Validation (IV)分数

Strengths:

  • 针对低资源语言习语理解这一难点,提供了高质量多模态数据集和专用评估指标
  • HybridMoE通过超网络和IPS有效缓解专家稀疏性,提升跨模态语义对齐
  • 实验验证了模型在多种VLM上的泛化能力,性能提升显著
  • 数据集包含语用语调标注,有助于细粒度情感和文化理解研究

Limitations:

  • 数据集规模较小(3533条),仅覆盖三种语言,可能限制跨语言泛化
  • 未与大规模世界模型或强化学习方法结合,后训练阶段探索不足
  • 评估指标主要针对习语理解,未涉及生成任务或对话交互场景
  • HybridMoE的计算开销和训练效率未详细分析

Relevance To Keywords:

  • 多模态大模型:论文聚焦视觉语言模型(VLM)的习语理解,属于多模态理解范畴
  • 表征学习:通过HybridMoE和IPS优化多模态表征,提升比喻语言编码能力
  • 世界模型:论文未直接涉及世界模型或物理规律建模,但习语理解需要文化世界知识
  • 强化学习/后训练:论文未使用强化学习或后训练技术,主要在前向推理阶段改进
Score: 42.0 / 27.8
Authors: Hojoon Lee, Ajay Subramanian, Ben Abbatematteo, Vijay Veerabadran, Pedro Matias, Karl Ridgeway, Nitin Kamra
Published: 2026-06-01
TL;DR: RDA leverages a Vision-Language Model to automate reward design in reinforcement learning, producing policies that are better aligned with task instructions while maintaining high success rates across manipulation tasks.
摘要翻译

强化学习使得机器人能够习得令人印象深刻的技能,但通常需要手工设计的奖励函数,这些函数设计缓慢且难以与人类意图对齐。近期工作(如 Eureka)利用大语言模型(LLM)根据任务描述迭代生成并精炼奖励代码,从而自动化奖励设计。然而,它们依赖于成功率等粗略反馈信号,这些信号对学习到的行为提供的语义洞察微乎其微。因此,它们训练出的策略虽然能达成最终目标,但通常与任务指令的对齐程度较差。我们提出了奖励设计代理(Reward Design Agent, RDA),这是一种基于视觉语言模型(VLM)的智能体框架,将语义理解注入奖励设计过程。RDA 分解任务,视觉评估轨迹,总结失败模式,并迭代修订奖励代码,以更好地与任务指令对齐。在 ManiSkill 的 12 个桌面操作任务和 HumanoidBench 的 4 个全身操作任务上,RDA 生成的策略在指令对齐程度上显著优于其他基线方法,同时实现了相当的任务成功率。视频和生成的奖励代码可在 https://nitinkamra1992.github.io/reward-design-agent 上获取。

Abstract

Reinforcement learning has enabled the acquisition of impressive robotic skills, but typically requires hand-crafted reward functions that are slow to design and difficult to align with human intentions. Recent work, such as Eureka, automates reward design by using an LLM to iteratively generate and refine reward code from task descriptions. However, they rely on coarse feedback signals such as success rate, which provide little semantic insight into the learned behavior. As a result, their trained policies achieve the final goal but are frequently poorly aligned with task instructions. We introduce the Reward Design Agent (RDA), a VLM-based agentic framework that injects semantic understanding into reward design. RDA decomposes tasks, visually evaluates trajectories, summarizes failure modes, and iteratively revises reward code to better align with task instructions. Across 12 tabletop manipulation tasks from ManiSkill and 4 whole-body manipulation tasks from HumanoidBench, RDA produces policies substantially more instruction-aligned than those of other baselines, while achieving comparable task success rates. Videos and the generated reward code are available on https://nitinkamra1992.github.io/reward-design-agent.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 5.0/10 7.5
World Models 1.5 2.0/10 3.0
MLLM 1.5 8.0/10 12.0
MultiModal 1.5 7.0/10 10.5
model-based RL 1.5 3.0/10 4.5

评分理由: The paper primarily utilizes a Vision-Language Model (MLLM) and MultiModal capabilities (Vision + Language) for reward design in Reinforcement Learning, justifying high scores for these keywords. Visual Encoder is implicitly used for trajectory evaluation but is not the core contribution. Other keywords such as Unify Models, Tokenizer, World Models, and model-based RL are not central to the proposed method or explicitly discussed as core innovations (focus is on reward shaping rather than model architecture unification, tokenizers, world models, or model-based planning).

关键词

Reward Design Agent, Reinforcement Learning, VLM-based, Semantic Understanding, Task Alignment, Visual Evaluation, Agentic Framework

深度分析

Chinese Title: RDA:面向强化学习的奖励设计智能体

Summary: 本文提出了一种名为奖励设计智能体(RDA)的视觉语言模型(VLM)驱动的智能体框架,旨在自动化强化学习中的奖励函数设计。传统方法如Eureka依赖成功率等粗粒度反馈信号,导致策略虽能完成任务但行为与指令对齐性差。RDA通过将自然语言指令分解为子任务,利用VLM视觉评估训练轨迹,诊断失败模式,并迭代修正奖励代码。在ManiSkill的12个桌面操作任务和HumanoidBench的4个全身操作任务上,RDA生成的策略在任务成功率上与基线相当,但在指令对齐性上显著优于人类设计和Eureka等方法。RDA通过闭环进化搜索,结合子任务级视觉评估和针对性修正,有效提升了奖励设计的质量和行为对齐性。

Innovations:

  • 提出VLM驱动的智能体框架RDA,将视觉语义理解注入奖励设计过程,克服了Eureka等纯数值反馈的局限性。
  • 引入任务分解机制,将复杂指令拆解为可解释的子任务,实现细粒度的行为评估与诊断。
  • 构建闭环进化搜索流程:训练策略→VLM视觉评估轨迹→生成子任务级诊断报告→针对性修正奖励代码。
  • 在长时域全身操作任务上,RDA能有效避免Eureka产生的行为错位(如抛掷而非推送),实现指令对齐。

Methodology: RDA采用两阶段流程:初始化阶段,VLM将自然语言指令分解为子任务列表,并基于子任务生成初始奖励候选代码。进化搜索阶段,每个候选奖励用于训练策略,采样多条轨迹并由VLM进行视觉评估,输出子任务得分和推理报告;聚合后生成诊断摘要,用于同时修正子任务列表和奖励代码。该过程迭代固定次数,逐步优化奖励函数。

Key Results:

  • 在ManiSkill的12个桌面操作任务中,RDA和Eureka均超越人类设计的奖励,但RDA在最具挑战的PlugCharger任务上唯一成功。
  • 在HumanoidBench的4个全身操作任务中,RDA与Eureka成功率相当,但RDA生成的策略严格遵循指令(如推送而非抛掷),而Eureka出现行为错位。
  • RDA通过子任务级视觉诊断,能够识别并修复Eureka无法检测的失败模式,显著提升指令对齐性。

Tech Stack:

  • 视觉语言模型(VLM)用于轨迹评估和子任务分解
  • 强化学习算法(PPO等)用于策略训练
  • 进化搜索(Evolutionary Search)用于奖励候选迭代
  • ManiSkill模拟器(桌面操作任务)
  • HumanoidBench模拟器(全身操作任务)
  • 奖励代码生成:基于LLM/VLM生成可执行Python代码

Strengths:

  • 引入视觉语义评估,弥补了纯数值反馈的不足,能检测细粒度行为错位。
  • 任务分解机制使诊断和修正更加精准,可解释性强。
  • 无需大量人工标注或预训练奖励模型,仅需VLM和模拟器。
  • 在长时域、复杂操作任务上表现出良好的指令对齐性。

Limitations:

  • 依赖VLM的视觉理解能力,可能受限于VLM对细粒度物理交互的感知精度。
  • 进化搜索需要多次训练策略,计算成本较高。
  • 当前仅在模拟环境中验证,未在真实机器人上测试。
  • 子任务分解和诊断报告的质量受VLM性能影响,可能存在误判。

Relevance To Keywords:

  • Unify Models: 论文使用VLM统一了视觉理解与奖励设计,但未涉及多模态模型统一框架。
  • World Models: 论文未直接构建世界模型,但VLM的视觉评估可视为对轨迹的隐式世界理解。
  • Representation Learning: 论文未涉及表征学习,但VLM的视觉编码可视为一种表征。
  • Model-Based RL: 论文属于无模型RL的奖励设计,未使用环境模型。
  • 原生多模态大模型: 论文使用VLM作为核心组件,属于多模态大模型的应用。
  • 多模态大模型的理解和生成一体化: VLM同时用于理解轨迹和生成奖励代码,体现理解与生成一体化。
  • 后训练: 论文未涉及模型后训练,但VLM的推理能力是预训练获得的。
Score: 40.5 / 27.8
Authors: Ziyuan Li, Yueyu Sun, Yimeng Zhang
Published: 2026-06-01
TL;DR: EVA-Net solves subject-independent EEG motor decoding by using video-derived priors to align EEG and video features, achieving significant accuracy gains without inference overhead.
摘要翻译

实用的非侵入式脑机接口 (BCI) 系统要求脑电解码器具备强大的跨被试泛化能力以及最小的校准需求。然而,被试间变异性与信号非平稳性往往将运动语义与被试特异性噪声纠缠在一起,限制了独立于被试的解码性能。尽管近期多模态方法利用文本作为语义锚点,但文本对于本质上动态的运动过程而言,提供的监督是稀疏且静态的。为了解决这一问题,我们提出 EVA-Net,这是一种两阶段框架,利用动作视频作为语义先验,以实现独立于被试的脑电运动解码。在第一阶段,利用跨模态和监督对比目标,将脑电特征与视频特征对齐到共享空间,以减少被试特异性变异。在第二阶段,通过视频类别原型和知识蒸馏,将视频衍生的先验知识转移至仅使用脑电的分类器中,且不增加推理开销。在两个公开数据集上的实验表明,EVA-Net 实现了强大的独立于被试的解码性能,其中包括在 EEGMMI 数据集上 8.66% 的留一被试法 (LOSO) 准确率增益。消融实验结果进一步表明,视频相较于本工作中考虑的文本基线,提供了更有效的语义锚点。

Abstract

Practical non-invasive Brain-Computer Interface (BCI) systems require EEG decoders with strong cross-subject generalization and minimal calibration. However, inter-subject variability and signal non-stationarity often entangle motor semantics with subject-specific noise, limiting subject-independent decoding. Recent multimodal approaches use text as a semantic anchor, yet text provides sparse and static supervision for inherently dynamic motor processes. To address this issue, we propose EVA-Net, a two-stage framework that uses action videos as semantic priors for subject-independent EEG motor decoding. In the first stage, EEG and video features are aligned in a shared space using cross-modal and supervised contrastive objectives to reduce subject-specific variation. In the second stage, video category prototypes and knowledge distillation transfer video-derived priors to an EEG-only classifier without adding inference overhead. Experiments on two public datasets show that EVA-Net achieves strong subject-independent decoding performance, including an 8.66% LOSO accuracy gain on EEGMMI. Ablation results further suggest that video provides a more effective semantic anchor than the text baseline considered in this work.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 6.0/10 9.0
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 8.0/10 12.0
World Models 1.5 1.0/10 1.5
MLLM 1.5 1.0/10 1.5
MultiModal 1.5 9.0/10 13.5
model-based RL 1.5 1.0/10 1.5

评分理由: The paper focuses on EEG decoding using video priors, making it highly relevant to MultiModal and Visual Encoder. Unify Models is moderately relevant due to cross-modal alignment. Tokenizer, World Models, MLLM, and model-based RL are irrelevant as the work involves BCI decoding without language models, tokenization, or RL. No target expert authors were found.

关键词

EEG Motor Decoding, Subject-Independent, Video-Derived Priors, Multimodal Learning, Knowledge Distillation, Cross-Modal Alignment, Brain-Computer Interface, Contrastive Learning

深度分析

Chinese Title: EVA-Net: 基于视频运动先验的跨被试脑电运动解码

Summary: 本文针对脑机接口中跨被试EEG解码泛化性差的问题,提出EVA-Net两阶段框架。第一阶段利用预训练的EEG编码器(基于EEG Conformer)和视频编码器(VideoMAE),通过跨模态对比学习(MIL-NCE)和监督对比损失将EEG与动作视频特征对齐到共享语义空间,减少被试特异性噪声。第二阶段引入视频类别原型(K-means聚类)和知识蒸馏,将视频运动先验迁移至仅EEG的分类器,推理时无需视频输入。在EEGMMI和另一公开数据集上,EVA-Net在留一被试交叉验证中准确率提升8.66%,消融实验表明视频作为语义锚点优于文本基线。

Innovations:

  • 首次将动作视频作为动态语义先验用于跨被试EEG运动解码,替代传统静态文本锚点
  • 提出两阶段训练范式:跨模态对齐阶段+先验引导分类阶段,推理时仅需EEG输入
  • 设计视频类别原型(K-means聚类)结合Log-Sum-Exp聚合,实现视频先验的高效迁移
  • 融合MIL-NCE跨模态对比损失与监督对比损失,同时增强跨模态一致性和EEG类内紧凑性
  • 引入响应式知识蒸馏,从视频教师网络向EEG学生网络传递类别间语义关系

Methodology: 采用两阶段训练:1)对齐阶段:EEG编码器(卷积+Transformer)和视频编码器(VideoMAE)分别提取特征,经投影头映射至共享空间,使用MIL-NCE损失(双向对比)和监督对比损失联合优化;2)分类阶段:对每类视频特征进行K-means聚类得到多个原型,EEG特征与所有原型计算余弦相似度,经Log-Sum-Exp聚合为类别logit,结合标签平滑交叉熵、知识蒸馏损失和原型正则化损失训练。推理时仅使用EEG编码器和原型头。

Key Results:

  • 在EEGMMI数据集上,留一被试交叉验证准确率相比基线提升8.66%
  • 在另一公开数据集上同样取得跨被试解码性能提升
  • 消融实验证实视频先验优于文本先验,且各损失组件均有效
  • 知识蒸馏和原型正则化进一步提升了分类性能

Tech Stack:

  • EEG Conformer(卷积+自注意力)
  • VideoMAE(视频自监督预训练)
  • MIL-NCE损失(多实例对比学习)
  • 监督对比损失(Supervised Contrastive Loss)
  • 余弦相似度与L2归一化
  • K-means聚类
  • Log-Sum-Exp(LSE)聚合函数
  • 知识蒸馏(响应式,温度缩放)
  • 标签平滑交叉熵(Label Smoothing Cross-Entropy)

Strengths:

  • 创新性地引入视频模态作为动态语义先验,比文本更贴合运动EEG的时序特性
  • 两阶段设计使推理时无视频计算开销,适合实际BCI系统
  • 跨模态对比学习有效缓解被试间变异,提升泛化性
  • 在多个数据集上验证了有效性,消融实验充分

Limitations:

  • 依赖高质量动作视频数据,视频质量或视角变化可能影响先验质量
  • 仅在两个公开数据集上评估,泛化性需更多场景验证
  • 文本基线可能未使用最优文本编码器,视频优势的公平性需进一步确认
  • 未探讨视频先验对运动执行与运动想象任务的差异影响
  • 未与强化学习或世界模型等更前沿方法结合

Relevance To Keywords:

  • 多模态大模型:论文使用视频和EEG双模态,通过对比学习对齐,属于多模态表征学习范畴
  • 表征学习:核心目标是学习跨被试、跨模态的通用EEG表征
  • 世界模型:视频作为运动先验可视为对物理世界运动规律的隐式建模,但论文未明确提及世界模型框架
  • 强化学习:论文未涉及强化学习或后训练,但视频先验可潜在用于BCI中的策略学习
Score: 40.5 / 27.8
Authors: Yuhang Han, Wenzheng Yang, Yujie Chen, Xiangqi Jin, Yaojie Zhang, Siteng Huang, Linfeng Zhang
Published: 2026-06-01
TL;DR: 本文提出 STaR-KV 框架,通过时空自适应重加权技术显著降低了 GUI 视觉语言模型的 KV 缓存内存占用,解决了部署瓶颈问题。
摘要翻译

基于视觉 - 语言模型的图形用户界面(GUI)代理已展现出广泛的自动化能力,然而其部署却受到键值(KV)缓存的瓶颈制约,该缓存随交互步骤线性增长。例如,UI-TARS-1.5-7B 在仅处理五个截图时便消耗 76 GB GPU 显存,接近主流 80 GB 加速器的容量上限。现有的 KV 压缩方法共享两个结构假设:一是将视觉令牌的重要性聚合到单个共享显著性图中,二是对融合后的分数分布应用固定的 top-B 截断。试点测量否定了这两点:空间专业化存在于注意力子空间(attention-subspace)层级并随层迁移,而分数分布的形状会沿轨迹发生漂移。我们提出 STaR-KV(时空自适应重加权),一种无需训练的 KV 缓存压缩框架,该框架沿三个轴校准令牌重要性:(i) 由在线空间互信息驱动的感知子空间评分;(ii) 时间稳定性折扣,用于抑制来自持久关注子空间的冗余缓存条目;(iii) 基于熵的温度参数,用于自适应重塑分数分布。在四个 GUI 基准测试中,STaR-KV 在相同预算下实现了所有现有先进 KV 压缩方法(如 GUIKV、SnapKV)中最强的平均准确率,且压缩阶段无 FLOPs 开销(-0.07%),在 20% KV 缓存预算下将峰值 GPU 显存消耗降低了近 40%。代码开源地址为 https://github.com/kawhiiiileo/STaR-KV。

Abstract

Vision-language-model-based graphical user interface (GUI) agents have shown broad automation capabilities, yet deployment is bottlenecked by a key-value (KV) cache that grows linearly with interaction steps. For instance, UI-TARS-1.5-7B consumes 76 GB of GPU memory on merely five screenshots, approaching the capacity of mainstream 80 GB accelerators. Existing KV compression methods share two structural assumptions: aggregating visual-token importance into a single shared saliency map, and applying a fixed top-B cutoff to the fused score distribution. Pilot measurements refute both: spatial specialization lives at the attention-subspace level and migrates across layers, while the score distribution drifts in shape along a trajectory. We propose STaR-KV (Spatio-Temporal Adaptive Re-weighting), a training-free KV cache compression framework that calibrates token importance along three axes: (i) subspace-aware scoring driven by online spatial mutual information; (ii) a temporal stability discount that suppresses redundant cache entries from persistently attended subspaces; and (iii) an entropy-derived temperature that adaptively reshapes the score distribution. Across four GUI benchmarks, STaR-KV achieves the strongest average accuracy among state-of-the-art KV compression methods (e.g., GUIKV, SnapKV) at matched budgets, with no compression-stage FLOPs overhead (-0.07%) and cutting peak GPU memory by nearly 40% at a 20% KV-cache budget. Code is available at https://github.com/kawhiiiileo/STaR-KV.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 3.0/10 4.5
Visual Encoder 1.5 4.0/10 6.0
World Models 1.5 1.0/10 1.5
MLLM 1.5 8.0/10 12.0
MultiModal 1.5 8.0/10 12.0
model-based RL 1.5 1.0/10 1.5

评分理由: 论文核心聚焦于 GUI 视觉语言模型的 KV 缓存压缩优化,属于 MLLM 和多模态领域的应用,因此 MLLM 和多模态评分较高;虽涉及视觉编码器架构但未作为创新点,tokenizer 仅作为缓存单位提及;未涉及模型统一、世界模型或强化学习,相关度低;经核对,作者列表中未包含指定专家(Yang Shi 等)。

关键词

KV Cache Compression, GUI Vision-Language Models, Spatio-Temporal Adaptive Re-weighting, Memory Efficiency, Large Language Models, Token Importance, Deployment Optimization

深度分析

Chinese Title: STaR-KV: 面向GUI视觉语言模型的时空自适应重加权KV缓存压缩方法

Summary: 本文针对基于视觉语言模型(VLM)的图形用户界面(GUI)代理在部署时面临的KV缓存线性增长问题(例如UI-TARS-1.5-7B在5张截图下消耗76GB显存),提出了一种无需训练的KV缓存压缩框架STaR-KV。现有压缩方法存在两个结构性缺陷:将视觉token重要性聚合为单一共享显著性图(忽略注意力子空间的空间异质性),以及使用固定的top-B截断(忽略分数分布随任务和步骤的漂移)。STaR-KV通过三个校准轴解决上述问题:基于在线空间互信息的子空间感知评分、抑制持久注意力子空间冗余的累积时间稳定性折扣、以及基于熵的自适应温度重塑分数分布。在四个GUI基准测试上,STaR-KV在匹配预算下取得了优于现有方法(如GUIKV、SnapKV)的平均准确率,压缩阶段无额外FLOPs开销(-0.07%),在20% KV缓存预算下峰值显存降低近40%。

Innovations:

  • 提出在线空间互信息(MI)估计,在子空间(GQA组或注意力头)级别对视觉token进行空间感知评分,保留布局敏感信号。
  • 引入累积时间稳定性折扣,对由持久稳定子空间(如固定工具栏)控制的token施加衰减,抑制冗余缓存条目,同时保留动态部件相关token。
  • 设计自适应熵基锐化(AEB)机制,根据分数分布的归一化熵动态调整温度,消除固定top-B截断在平坦分布下的任意性,并在峰值分布下增强选择性。
  • 无需微调,可直接应用于现有GUI VLM(如UI-TARS、OpenCUA),在多个基准上达到SOTA压缩性能。

Methodology: STaR-KV采用三轴校准流水线:首先通过最近查询注意力聚合构建基础分数s̄_t,并可选地加入残差流隐藏状态范数增强;然后通过在线空间互信息估计每个子空间与2D屏幕坐标的耦合度,得到子空间先验权重w_g^*(t),用于上采样最能解释该token的子空间分数;接着计算累积时间稳定性折扣D_t,基于子空间在历史帧中的稳定性对视觉token进行衰减;最后通过自适应熵基锐化(AEB)根据归一化熵Ĥ计算温度1/T,重塑融合后的分数分布。最终选择top-B token与固定最近窗口拼接形成压缩缓存。所有统计在线维护,每层仅需O(L)额外内存。

Key Results:

  • 在四个GUI基准测试上,STaR-KV在匹配KV缓存预算下平均准确率优于GUIKV、SnapKV等SOTA方法。
  • 压缩阶段无额外FLOPs开销(-0.07%)。
  • 在20% KV缓存预算下,峰值GPU显存降低近40%。
  • 验证了子空间级空间互信息(MI)的稳定性(Spearman ρ>0.85),以及注意力熵随步骤单调上升并存在任务间方差的现象。

Tech Stack:

  • 分组查询注意力(GQA)
  • 多头注意力(MHA)
  • 互信息(MI)估计
  • 归一化熵(Ĥ)
  • 自适应温度缩放
  • 1D平均池化
  • 残差流隐藏状态范数
  • 固定预算top-B选择
  • 最近窗口保留

Strengths:

  • 针对GUI场景的KV缓存压缩问题,发现了现有方法忽略的子空间空间异质性和分数分布漂移两个关键盲点。
  • 提出的三轴校准方法无需训练,计算开销极小,易于集成到现有GUI VLM中。
  • 在多个基准上取得SOTA性能,显存节省显著,具有实际部署价值。
  • 代码开源,可复现性强。

Limitations:

  • 方法依赖于在线互信息估计,可能对极短交互序列(如单帧)效果有限。
  • 时间稳定性折扣假设子空间稳定性在帧间一致,但某些动态任务中子空间可能快速切换,折扣策略需进一步验证。
  • 仅针对GUI VLM设计,未验证在自然图像/视频VLM上的泛化性。
  • 实验仅在UI-TARS和OpenCUA等模型上进行,未覆盖更多VLM架构。

Relevance To Keywords:

  • 原生多模态大模型:论文针对GUI VLM(多模态大模型的一种)的KV缓存压缩,属于多模态大模型推理效率优化。
  • 多模态大模型的理解和生成一体化:GUI VLM同时涉及截图理解(视觉)和动作生成(文本),本文压缩方法不影响理解与生成能力。
  • 表征学习:子空间空间互信息估计本质上是对注意力表征的空间耦合性进行学习,属于表征学习范畴。
  • 世界模型:GUI代理需要建模屏幕状态和交互历史,KV缓存压缩保留了关键历史信息,间接支持世界模型构建。
  • 强化学习/后训练:本文方法无需后训练,但压缩后的缓存可加速强化学习训练中的推理过程。
Score: 40.5 / 27.8
Authors: Dongxing Mao, Jinpeng Wang, Jiahao Tang, Kevin Qinghong Lin, Linjie Li, Zhengyuan Yang, Lijuan Wang, Min Li, Jingru Tan
Published: 2026-06-01
TL;DR: 本文提出残差解码器适配器(RDA)以非侵入式升级视觉 tokenizer,显著提升了视觉自回归模型中的文本渲染准确率,无需重新训练原有模型。
摘要翻译

视觉自回归(AR)模型通过预测离散 token 来生成图像,这些 token 由视觉分词器进行解码。尽管展现出强大的整体图像生成能力,它们在文本渲染方面仍表现欠佳,容易产生模糊笔画并破坏字母形状。在本工作中,我们将这一局限归因于视觉分词器,该分词器难以重建细粒度细节。改进分词器虽然直接但成本高昂,因为这需要同时重新训练分词器和 AR 模型。我们能否在不重新训练现有分词器和 AR 模型的情况下,提升 AR 模型的文本渲染性能?为此,我们提出了残差解码器适配器(RDA),该适配器可在不改变其 token 空间的前提下,事后升级现有的分词器。具体而言,它通过引入两个新颖组件来细化视觉分词器的解码器输出:(i) 一个与原始码本共享 token 分布的配对码本;(ii) 一个并行分支,用于在像素空间中学习重建图像与真实图像之间的微小差异(残差)。这种残差设计使我们能够在非侵入式地增强分词器的同时,保持与先前 AR 模型的兼容性。RDA 显著提升了文本渲染性能。例如,在竞争性的 TextAtlas 基准上,微调后的 Janus-Pro OCR 准确率从 24.52% 提升至 58.26%(TextVisionBlend),从 12.75% 提升至 36.81%(StyledTextSynth)。代码已开源,详见 https://github.com/CSU-JPG/RDA。

Abstract

Visual Autoregressive (AR) models generate images by predicting discrete tokens that are decoded by a visual tokenizer. Despite demonstrating strong overall image generation ability, they still underperform on text rendering with blur strokes and disrupt letter shapes. In this work, we trace this limitation to the visual tokenizer, which struggles to reconstruct fine-grained detail. Improving the tokenizer is straightforward but expensive, as it necessitates retraining both the tokenizer and the AR model. Can we improve text rendering performance of AR models without retraining the existing tokenizer and AR model? To achieve this, we propose the Residual Decoder Adapter(RDA) that upgrades an existing tokenizer post-hoc without changing its token space. Specifically, it refines the decoder output of the visual tokenizer by introducing two novel components: (i) a paired codebook that shares the token distribution with the original one; (ii) a parallel branch to learn the tiny differences (residual) between the reconstructed image and the ground-truth images in the pixel space. This residual design allows us to enhance the tokenizer non-invasively while preserving compatibility with prior AR models. RDA substantially improves text rendering significantly by a large margin. For instance, we boost finetuned Janus-Pro OCR accuracy rises from 24.52% to 58.26% (TextVisionBlend), from 12.75% to 36.81% (StyledTextSynth) on competitive TextAtlas benchmark. The code is available at https://github.com/CSU-JPG/RDA

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 3.0/10 4.5
Tokenizer 1.5 9.0/10 13.5
Visual Encoder 1.5 4.0/10 6.0
World Models 1.5 2.0/10 3.0
MLLM 1.5 3.0/10 4.5
MultiModal 1.5 5.0/10 7.5
model-based RL 1.5 1.0/10 1.5

评分理由: 论文核心改进视觉 tokenizer 以解决文本渲染问题,故'Tokenizer'相关性最高(9 分)。'MultiModal'涉及文本图像生成(5 分),'Visual Encoder'为相关组件(4 分)。'Unify Models'与'MLLM'关联较弱(3 分),'World Models'与'model-based RL'无关(2 分与 1 分)。作者列表不含指定专家,无加分。

关键词

Visual Autoregressive, Text Rendering, Visual Tokenizer, Residual Decoder Adapter, Post-hoc Adaption, Token Space, Image Generation

深度分析

Chinese Title: 残差解码器适配器:面向自回归文本渲染的保持ID的标记器适配

Summary: 本文针对自回归(AR)模型在文本渲染任务中表现不佳的问题展开研究。作者指出,AR模型生成图像时依赖离散视觉标记器,而现有标记器在重建细粒度细节(如文字)时存在瓶颈。直接改进标记器需要重新训练整个AR模型,成本高昂。为此,本文提出残差解码器适配器(RDA),一种即插即用的后处理框架,在不改变原始标记器ID空间的前提下提升其重建质量。RDA包含两个轻量组件:共享ID的提示码本(Hint Codebook)和残差解码器(Residual Decoder)。前者与原始码本共享相同的标记ID,提供补充的高频细节特征;后者在像素空间学习重建图像与真实图像之间的残差。RDA无需重新训练AR模型,即可显著提升文本渲染精度。实验表明,在多个基准上,RDA将Janus-Pro等模型的OCR准确率提升超过30个百分点,同时保持整体图像保真度。

Innovations:

  • 提出残差解码器适配器(RDA),一种无需重新训练AR模型即可提升视觉标记器文本渲染能力的后处理框架。
  • 设计共享ID的提示码本(Shared-ID Hint Codebook),与原始码本使用相同标记ID,提供互补的高频细节特征,保持与预训练AR模型的兼容性。
  • 引入残差解码器,在像素空间学习重建图像与真实图像之间的残差,有效恢复量化过程中丢失的细粒度细节。
  • 采用实例依赖的特征注入机制,将基础特征与提示特征融合,确保残差学习针对具体图像内容。
  • 提出多种损失函数(如感知损失、对抗损失等)以稳定训练,克服残差信号稀疏且易被背景像素主导的困难。

Methodology: 本文采用以下技术路线:首先,冻结原始视觉标记器(VQ-VAE)的编码器、量化器和解码器,保留其ID空间。然后,引入一个共享ID的提示码本(Hint Codebook),与原始码本使用相同的标记ID,但学习补充的高频特征。接着,设计残差解码器(Residual Decoder),该解码器接收原始解码器的中间特征和提示码本的特征,通过实例依赖的特征注入(两个投影器融合)得到混合特征,再预测像素级残差。最终重建图像为原始解码器输出加上残差。训练时使用多种损失函数,包括L1损失、感知损失(LPIPS)、对抗损失(GAN)以及码本损失等,以优化残差解码器和提示码本。推理时,AR模型生成标记ID序列,RDA使用相同ID从两个码本检索特征,经残差解码器得到增强图像。

Key Results:

  • 在TextAtlas基准上,将微调后的Janus-Pro OCR准确率从24.52%提升至58.26%(TextVisionBlend),从12.75%提升至36.81%(StyledTextSynth)。
  • 在AnyText-Benchmark上,将LlamaGen-VQ的文本准确率从21.26%提升至36.79%,LPIPS从0.1912降至0.1832。
  • RDA在多个主流AR模型(Janus-Pro、TAR、Lumina-mGPT)上均取得一致的文本渲染性能提升。
  • RDA在提升文本渲染的同时,保持整体图像生成质量,不损害原有图像保真度。

Tech Stack:

  • VQ-VAE(向量量化变分自编码器)作为基础视觉标记器
  • 共享ID的提示码本(Shared-ID Hint Codebook)
  • 残差解码器(Residual Decoder),基于VQ-VAE解码器架构,最后两层卷积通道加倍
  • 实例依赖的特征注入:两个投影器(projector)分别处理基础特征和提示特征
  • 损失函数:L1损失、感知损失(LPIPS)、对抗损失(GAN)、码本损失(commitment loss)
  • 自回归模型(AR model):Janus-Pro、TAR、Lumina-mGPT等

Strengths:

  • 非侵入式设计:无需重新训练AR模型或修改原始标记器,兼容现有预训练模型。
  • 即插即用:RDA作为外部适配器,可快速部署到任意基于VQ的AR模型上。
  • 显著提升文本渲染:在多个基准上实现超过30个百分点的准确率提升,效果明显。
  • 保持图像保真度:在增强文本细节的同时,不损害整体图像质量(LPIPS下降)。
  • 方法简洁有效:仅需训练两个轻量组件,训练成本低,易于复现。

Limitations:

  • 依赖原始标记器的质量:RDA的改进上限受限于原始标记器的编码能力。
  • 增加推理计算开销:引入额外的码本查找和残差解码器,可能增加推理延迟。
  • 仅针对文本渲染:论文主要聚焦于文本渲染任务,对其他细粒度细节(如人脸、纹理)的泛化能力未充分验证。
  • 未与扩散模型直接对比:虽然提及扩散模型在文本渲染上更优,但未提供RDA与扩散模型的公平对比。

Relevance To Keywords:

  • 原生多模态大模型:RDA提升了自回归多模态模型(如Janus-Pro)的文本生成能力,有助于多模态理解和生成一体化。
  • 表征学习:通过共享ID的提示码本和残差学习,改进了视觉标记器的表征质量,保留高频细节。
  • 世界模型:自回归模型可视为世界模型的一种形式,RDA增强了其生成细粒度视觉世界的能力。
  • 模型基于强化学习:论文未直接涉及强化学习,但后训练阶段(如微调)与强化学习微调范式相关。
  • 后训练:RDA是一种后训练适配方法,无需从头训练,符合后训练提升模型能力的趋势。
Score: 40.5 / 27.8
Authors: Muhammed Burak Kizil, Enes Sanli, Niloy J. Mitra, Xuelin Chen, Erkut Erdem, Aykut Erdem, Duygu Ceylan
Published: 2026-06-01
TL;DR: The paper proposes Auteur, a language-driven method utilizing a multimodal large language model and domain-specific language to achieve human-centric camera framing in generative video generation.
摘要翻译

生成式视频模型在视觉保真度和时间一致性方面取得了显著进展,然而有意控制相机运动仍然难以实现。现有框架将相机运动视为像素合成的副产品,生成的轨迹具有随机性和空间不一致性,且不考虑驱动场景的人物主体。在这项工作中,我们提出了 Auteur,一种用于生成式视频的语言驱动、以人为中心的相机构图方法。我们的核心见解在于,专业电影制作人将镜头构思为相对于人物主体定义的构图,而非世界空间轨迹,将景别、角度和构图表示为人体姿态和运动的函数。我们将这种直觉形式化为一种以人为中心的相机参数化,并引入一种领域特定语言(DSL),该语言可转换为标准的 6-DoF 相机参数。随后,一个微调的多模态大语言模型充当虚拟导演,将自然语言描述和粗略人体运动映射到稀疏的 DSL 关键帧,这些关键帧被确定性插值为连续相机轨迹,随后作为输入提供给视频生成器。我们在一个新数据集上训练并评估了 Auteur,该数据集包含 3.4 万条对齐的文本、人体运动和 DSL 标注的相机轨迹,源自程序化合成以及来自 CondensedMovies 数据集的真实电影片段。Auteur 实现了以人为中心场景的电影级构图,而这一能力在很大程度上缺失于先前的生成式模型中。为了评估这一行为,我们提出了新的以构图为中心的指标,实验结果表明 Auteur 一贯优于现有方法。

Abstract

Generative video models have achieved remarkable visual fidelity and temporal coherence, yet intentional camera control remains elusive. Existing frameworks treat camera motion as a byproduct of pixel synthesis, producing trajectories that are stochastic, spatially inconsistent, and indifferent to the human subject driving the scene. In this work, we present Auteur, a method for language-driven, human-centric camera framing in generative video. Our core insight is that professional filmmakers conceive shots not as world-space trajectories but as framings defined relative to the actor, encoding shot size, angle, and composition as functions of human pose and motion. We formalize this intuition as a human-centric camera parameterization and introduce a Domain-Specific Language (DSL) that is convertible to standard 6-DoF camera parameters. A fine-tuned multimodal large language model then acts as a virtual director, mapping natural language descriptions and coarse human motion to sparse DSL keyframes that are deterministically interpolated into continuous camera trajectories, which are then provided as input to video generators. We train and evaluate Auteur on a new dataset of 34K aligned text, human motion, and DSL-annotated camera trajectories drawn from procedural synthesis and real-world movie footage from the CondensedMovies dataset. Auteur enables cinematographic framing of human-centered scenes, a capability largely absent in prior generative models. To assess this behavior, we propose new framing-focused metrics, and our experiments show that Auteur consistently outperforms existing methods

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 3.0/10 4.5
Tokenizer 1.5 2.0/10 3.0
Visual Encoder 1.5 2.0/10 3.0
World Models 1.5 4.0/10 6.0
MLLM 1.5 8.0/10 12.0
MultiModal 1.5 8.0/10 12.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper centers on using a Multimodal Large Language Model (MLLM) for camera control, yielding high scores for MLLM and MultiModal. Video generation relates moderately to World Models, but the work focuses on framing rather than world modeling. Unify Models, Tokenizer, and Visual Encoder are not primary contributions. No Reinforcement Learning is involved, so model-based RL scores zero. No specified expert authors are present.

关键词

Language-Driven, Cinematographic Framing, Human-Centric Video Generation, Multimodal Large Language Model, Domain-Specific Language, Camera Control, Video Generation

深度分析

Chinese Title: Auteur:语言驱动的以人为中心的视频生成电影化构图

Summary: 本文提出Auteur方法,旨在实现语言驱动的、以人为中心的视频生成中的相机控制。现有生成视频模型缺乏有意的相机控制,通常将相机运动视为像素合成的副产品,产生随机、空间不一致且不关注人类主体的轨迹。Auteur的核心洞察是专业电影制作人不是以世界空间轨迹来构思镜头,而是相对于演员定义构图(景别、角度、画面布局)。作者形式化了以人为中心的相机参数化,并引入领域特定语言(DSL),可转换为标准6自由度相机参数。通过微调的多模态大语言模型(Qwen-2.5-VL)作为虚拟导演,将自然语言描述和粗略人体运动映射为稀疏DSL关键帧,再确定性插值为连续相机轨迹,输入视频生成器。训练数据集包含34K对齐的文本、人体运动、DSL标注的相机轨迹,来自程序化合成和真实电影片段(CondensedMovies)。实验表明Auteur在构图相关指标上优于现有方法。

Innovations:

  • 提出以人为中心的相机参数化和领域特定语言(DSL),将相机状态相对于演员身体坐标系定义,实现电影化构图。
  • 设计两阶段语言到人体到相机的流水线:先预测粗略3D人体轨迹,再生成稀疏DSL关键帧并插值为连续6自由度轨迹。
  • 构建包含34K样本的Auteur数据集,融合程序化合成和真实电影数据,支持学习人类感知的电影化先验。
  • 引入Auteur Score构图基准套件,评估主体可见性、构图一致性、时间构图稳定性和演员-相机协调性。

Methodology: Auteur采用两阶段框架:第一阶段,微调的多模态大语言模型Qwen-2.5-VL将自然语言描述映射为粗略3D人体轨迹(SOMA参数化)。第二阶段,同一LLM作为虚拟导演,基于人体轨迹生成稀疏DSL关键帧程序,DSL参数包括方向、景别、相机水平、构图、注视水平等,均相对于演员身体坐标系。这些关键帧通过确定性插值(如三次样条)转换为连续6自由度相机轨迹。轨迹用于条件化下游视频生成模型(如CameraCtrl、MotionCtrl)。训练数据来自程序化合成(随机SOMA运动+DSL驱动相机)和真实电影(CondensedMovies,通过4D人体和相机恢复、AuroraCap生成字幕、运动标签策略获得DSL)。

Key Results:

  • Auteur在内部Auteur基准和外部PulpMotion基准上均优于现有可控相机基线(如LAMP、E.T.),实现更强的可控性和更稳定的电影化构图。
  • 提出的构图指标(Auteur Score)有效评估了主体可见性、构图一致性、时间稳定性和演员-相机协调性。
  • Auteur生成的相机轨迹可成功输入多个下游视频生成模型,提升空间连贯性和主体对齐。

Tech Stack:

  • 多模态大语言模型:Qwen-2.5-VL(微调)
  • 人体运动表示:SOMA(Saito et al., 2026)
  • 领域特定语言(DSL):自定义,包含方向、景别、相机水平、构图、注视水平等参数
  • 相机轨迹插值:确定性插值(三次样条等)
  • 视频生成模型:CameraCtrl、MotionCtrl等(作为下游)
  • 数据集构建:CondensedMovies(Bain et al., 2020)、AuroraCap(Chai et al., 2024)
  • 4D人体和相机恢复:Wang et al., 2024a

Strengths:

  • 创新性地将电影化构图从世界空间轨迹转移到以人为中心的参数化,更符合专业电影制作实践。
  • DSL设计使LLM能够生成结构化的相机程序,兼具可解释性和几何表达能力。
  • 两阶段流水线有效解耦了人体运动预测和相机规划,便于训练和推理。
  • 构建的大规模数据集融合合成和真实数据,增强了泛化能力。
  • 提出专门的构图评估指标,弥补了现有视频生成评估的不足。

Limitations:

  • 依赖SOMA人体运动表示,可能无法处理复杂多人场景或非标准人体姿态。
  • DSL关键帧插值为连续轨迹是确定性的,可能缺乏对动态场景的适应性。
  • 两阶段流水线中人体轨迹预测的误差会累积到相机规划中。
  • 当前仅针对单个人类主体,未扩展到多人交互场景。
  • 下游视频生成模型的质量仍受限于基础生成器的能力。

Relevance To Keywords:

  • 原生多模态大模型:使用Qwen-2.5-VL作为核心组件,实现语言到视觉的映射。
  • 多模态大模型的理解和生成一体化:LLM同时理解文本和人体运动,生成相机程序。
  • 表征学习:通过SOMA和DSL学习以人为中心的相机表征。
  • 世界模型:相机轨迹生成可视为对场景动态和主体行为的建模。
  • 后训练:微调多模态LLM以适应特定任务。
  • 强化学习:论文未直接使用RL,但相机规划可视为策略学习,未来可结合。
Score: 40.5 / 27.8
Authors: Xiang Fang, Wanlong Fang, Wei Ji, Tat-Seng Chua
Published: 2026-06-01
TL;DR: 该论文提出了一种基于反应扩散机制的多模态融合框架,通过模拟生物模式形成过程来优化视频与文本的动态交互,从而提升视频时刻检索的性能。
摘要翻译

视频 - 语言模型对于时刻检索和高光检测等任务至关重要,但它们往往难以捕捉时序视频序列与文本语义之间的动态、非线性交互。现有方法依赖于静态交叉注意力或提示调优机制,无法自适应地建模模态之间动态演变的关系,导致次优对齐及泛化能力受限。受系统生物学启发,我们提出了一种新颖的框架——反应 - 扩散多模态融合(RDMF),该框架将视频 - 语言对齐重新构想为反应 - 扩散(RD)过程,借鉴了艾伦·图灵提出的模式形成原理。在 RDMF 中,视频特征随时间扩散以捕捉时序上下文,而文本 - 视频交互被建模为非线性反应,放大相关特征并抑制噪声,形成类似于生物系统的涌现模式。利用 Gray-Scott RD 模型,我们设计了一个计算高效的多模态融合模块,该模块整合了视频和文本表示,并通过使用图灵不稳定性判据对稳定性和收敛性进行了严格的数学分析。该框架具有坚实的理论基础,采用先进的数学工具确保稳定的模式形成,同时具有实际可行性,包含了预训练编码器和 DETR 风格头等标准组件,用于时刻检索和显著性预测。RDMF 代表了一种开创性的跨学科方法,架起系统生物学与多媒体研究之间的桥梁,以解决传统多模态融合的局限性。初步实验展示了其在识别显著视频时刻方面超越现有方法的潜力,为视频 - 语言任务提供了新范式。

Abstract

Video-language models are pivotal for tasks such as moment retrieval and highlight detection, yet they often struggle to capture the dynamic, non-linear interactions between temporal video sequences and textual semantics. Existing approaches, relying on static cross-attention or prompt-tuning mechanisms, fail to adaptively model the evolving relationships between modalities, leading to suboptimal alignment and limited generalization. Inspired by systems biology, we propose \textbf{Reaction-Diffusion Multimodal Fusion (RDMF)}, a novel framework that reimagines video-language alignment as a reaction-diffusion (RD) process, drawing on the principles of pattern formation introduced by Alan Turing. In RDMF, video features diffuse across time to capture temporal context, while text-video interactions are modeled as non-linear reactions that amplify relevant features and suppress noise, forming emergent patterns akin to biological systems. Leveraging the Gray-Scott RD model, we design a computationally efficient fusion module that integrates video and text representations, supported by rigorous mathematical analysis of stability and convergence using Turing instability criteria. Our framework is theoretically grounded, employing advanced mathematical tools to ensure stable pattern formation, and is practically viable, incorporating standard components like pretrained encoders and DETR-style heads for moment retrieval and saliency prediction. RDMF represents a pioneering interdisciplinary approach, bridging systems biology and multimedia research to address the limitations of conventional multimodal fusion. Preliminary experiments demonstrate its potential to outperform existing methods in identifying salient video moments, offering a new paradigm for video-language tasks.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 4.0/10 6.0
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 5.0/10 7.5
World Models 1.5 2.0/10 3.0
MLLM 1.5 5.0/10 7.5
MultiModal 1.5 10.0/10 15.0
model-based RL 1.5 0.0/10 0.0

评分理由: 论文核心在于多模态融合(MultiModal 高度相关,10 分),使用视觉编码器提取特征(Visual Encoder 中度相关,5 分),属于视频语言模型范畴(MLLM 中度相关,5 分)。虽然涉及动态交互,但未明确构建世界模型(World Models 低相关,2 分),未涉及统一模型架构(Unify Models 低相关,4 分)或分词器设计(Tokenizer 低相关,1 分),且与强化学习无关(model-based RL 无关,0 分)。作者名单未包含指定专家,无额外加分。加权总分 40.5,超过动态及格分 27.8。

关键词

Reaction-Diffusion, Multi-Modal Fusion, Video-Language Alignment, Moment Retrieval, Turing Patterns, Dynamic Interaction, Pattern Formation

深度分析

Chinese Title: 多媒体中的图灵斑图:用于语言引导视频时刻检索的反应-扩散多模态融合

Summary: 本文针对视频-语言模型在时刻检索和精彩片段检测任务中难以捕捉视频时序与文本语义之间动态非线性交互的问题,提出了一种受系统生物学启发的反应-扩散多模态融合框架(RDMF)。该框架将视频-语言对齐重新建模为反应-扩散过程:视频特征在时间维度上扩散以获取时序上下文,文本-视频交互则作为非线性反应,放大相关特征并抑制噪声,形成类似生物系统的涌现斑图。基于Gray-Scott反应-扩散模型,设计了计算高效的融合模块,并利用图灵不稳定性准则进行了严格的数学稳定性与收敛性分析。RDMF结合预训练编码器和DETR风格头部,在时刻检索和显著性预测任务上展现出超越现有方法的潜力,为视频-语言任务提供了新的跨学科范式。

Innovations:

  • 首次将反应-扩散系统引入视频-语言建模,实现系统生物学与多媒体研究的跨学科融合。
  • 提出动态特征交互机制:视频特征扩散建模时序演化,文本-视频反应放大语义相关特征,无需预定义注意力模式。
  • 基于图灵不稳定性与稳定性分析提供严格数学理论支撑,确保反应-扩散系统收敛到稳定有意义的斑图。
  • 框架具有通用性,可拓展至音视频对齐、传感器-文本融合等多模态领域,并促进多媒体与系统生物学社区的合作。

Methodology: 采用Gray-Scott反应-扩散模型作为核心融合模块:视频特征通过扩散过程在时间维度上平滑传播,文本特征与视频特征通过非线性反应项相互作用,形成自组织斑图。结合预训练编码器(如CLIP、SlowFast)提取视频和文本特征,使用DETR风格头部进行时刻边界回归和显著性预测。通过图灵不稳定性分析确保系统稳定,并利用数学工具(如线性稳定性分析)验证收敛性。

Key Results: 初步实验表明RDMF在识别视频显著时刻方面优于现有方法,能够自适应地突出与文本查询相关的视频片段并抑制无关内容。框架在时刻检索和精彩片段检测任务上展现出潜力,但论文未提供具体数值结果(仅提及初步实验)。

Tech Stack:

  • Gray-Scott反应-扩散模型
  • 图灵不稳定性准则
  • 线性稳定性分析
  • 预训练编码器(CLIP、SlowFast)
  • DETR风格头部(Moment-DETR、QD-DETR)
  • Transformer架构
  • 跨模态注意力机制

Strengths:

  • 跨学科创新:将系统生物学原理引入多媒体研究,开辟新范式。
  • 动态自适应融合:克服静态交叉注意力的局限,实现非线性、上下文相关的多模态交互。
  • 理论严谨:提供数学稳定性与收敛性分析,确保方法可靠性。
  • 实用性强:兼容现有预训练编码器和检测头,易于部署。

Limitations:

  • 论文仅报告初步实验结果,缺乏在标准数据集(如Charades-STA、ActivityNet Captions)上的定量对比。
  • 反应-扩散模型的计算复杂度可能较高,需进一步优化以支持实时视频分析。
  • 对长视频或复杂场景的泛化能力尚未充分验证。
  • 跨学科理论(如图灵斑图)的直观解释与多媒体任务的对应关系需更清晰阐述。

Relevance To Keywords:

  • 原生多模态大模型:RDMF提出新的多模态融合机制,可视为多模态大模型中的一种新型交互模块。
  • 表征学习:通过反应-扩散过程学习视频与文本的动态联合表征。
  • 世界模型:反应-扩散系统模拟了视频时序演化与语义约束的交互,具有世界模型建模潜力。
  • 模型-Based RL:反应-扩散过程可视为一种可微分动力学系统,未来可结合强化学习进行决策优化。
  • 后训练:RDMF可作为预训练模型的轻量级后训练融合模块,提升下游任务性能。
Score: 39.0 / 27.8
Authors: Xu Jiang, Bin Chen, Gehui Li, Yule Duan, Ronggang Wang, Jian Zhang
Published: 2026-06-01
TL;DR: 针对文本生成图像模型面临边际效益递减和推理效率瓶颈的问题,OctoT2I 通过自演化智能体路由框架实现了性能与效率的显著平衡。
摘要翻译

文本到图像(T2I)模型的爆炸式增长,从大规模版本延伸至轻量级、实时版本,如今正面临单模型扩展带来的边际收益递减问题。为缓解这一瓶颈,基于多模型的智能体(Agentic)T2I 方法应运而生。然而,现有的智能体 T2I 方法面临三大关键挑战:依赖昂贵的手工设计先验或人工标注、僵化的单路径决策机制,以及对推理效率的忽视。为应对这些挑战,我们引入 OctoT2I,一种新颖的智能体框架,将 T2I 任务重新表述为生成质量与推理效率的联合优化。OctoT2I 实施了一种有状态的、多轮路由策略,能够根据其知识和记忆自适应地选择最合适的工具。该策略由一种从头构建的知识库支撑,该知识库源自我们提出的新颖自我进化机制(Self-Evolving Mechanism)。该机制无需人工监督,首先自主定义基础概念维度(Conceptual Dimensions,例如风格、颜色、数量),然后通过迭代的"提议--求解--评估--学习"(Propose--Solve--Evaluate--Learn, PSEL)循环智能地探索其组合。PSEL 循环高效地发现每个工具的能力边界,驱动持续改进而无需外部指导。大量实验表明,OctoT2I 在 GenEval 上取得了具有竞争力的性能(0.96),同时相比领先基线(Flow-GRPO)实现了 90.3% 的推理加速和 56.6% 的能效增益,在性能与效率之间实现了卓越的平衡。代码与模型将公开提供。

Abstract

The explosive growth of Text-to-Image (T2I) models, from large-scale versions to lightweight, real-time ones, now faces diminishing marginal returns from single-model scaling. Agentic T2I methods emerged to alleviate this bottleneck by using multiple models. However, existing agentic T2I methods suffer from three key challenges: reliance on expensive handcrafted priors or human annotations, rigid single-path decision mechanisms, and a neglect of inference efficiency. To address these challenges, we introduce OctoT2I, a novel agentic framework that reformulates the T2I task as a joint optimization of generation quality and inference efficiency. OctoT2I implements a stateful, multi-round routing strategy that adaptively selects the most suitable tool based on its knowledge and memory. This strategy is enabled by a knowledge base built from scratch by our novel Self-Evolving Mechanism. This mechanism, which requires no human supervision, first autonomously defines foundational Conceptual Dimensions (eg, style, color, count) and then intelligently explores their combinations via an iterative" Propose--Solve--Evaluate--Learn"(PSEL) loop. The PSEL loop efficiently discovers each tool's capability frontier, driving continuous improvement without external guidance. Extensive experiments demonstrate that OctoT2I achieves competitive performance (0.96) on GenEval while delivering a 90.3% inference speedup and a 56.6% energy-efficiency gain over the leading baseline (Flow-GRPO), striking an exceptional balance between performance and efficiency. Code and models will be made available.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 6.0/10 9.0
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 1.0/10 1.5
World Models 1.5 1.0/10 1.5
MLLM 1.5 5.0/10 7.5
MultiModal 1.5 9.0/10 13.5
model-based RL 1.5 3.0/10 4.5

评分理由: 论文聚焦 T2I 多模型路由与效率优化,与 MultiModal 高度相关;整合多模型与 Unify Models 中度相关;智能体学习循环与 MLLM/model-based RL 有关联但非核心;未涉及 Tokenizer、Visual Encoder 及 World Models 架构创新。未发现指定专家。

关键词

Text-to-Image, Agentic Framework, Self-Evolving Mechanism, Inference Efficiency, Multi-model Routing, Energy Efficiency, GenEval

深度分析

Chinese Title: OctoT2I:一种自演进的智能文本到图像路由系统

Summary: 本文针对文本到图像(T2I)生成领域面临的多模型选择难题,提出了一种名为OctoT2I的自演进智能路由框架。该框架将T2I任务重新定义为生成质量与推理效率的联合优化问题,通过状态化、多轮路由策略,自适应地为每个用户提示选择最合适的T2I工具。核心创新在于引入自演进机制,无需人工标注或先验知识,通过“提议-求解-评估-学习”(PSEL)循环自主构建工具知识库,并利用探索空间剪枝策略高效发现各工具的能力边界。实验表明,OctoT2I在GenEval基准上达到0.96的竞争性能,同时相比领先基线Flow-GRPO实现了90.3%的推理加速和56.6%的能效提升,在性能与效率之间取得了卓越平衡。

Innovations:

  • 提出OctoT2I框架,将文本到图像路由问题形式化为生成质量与推理效率的联合优化,实现性能与效率的协同平衡。
  • 设计自演进机制,使智能体无需任何人工监督或外部数据,通过自我交互从零开始自主构建工具知识库。
  • 引入状态化、多轮动态路由策略,结合知识模块与记忆模块,实现基于实时反馈的迭代式工具选择。
  • 提出“提议-求解-评估-学习”(PSEL)循环与探索空间剪枝策略,高效、自适应地探索各工具的能力边界。

Methodology: 论文首先将路由问题形式化为带质量阈值约束的成本最小化选择问题。系统设计包含推理工作流和自演进机制两部分:推理工作流中,智能体在每个决策轮次基于用户提示、知识模块和记忆模块选择工具,生成图像后由评估模块打分,形成“推理-行动-反思”闭环;自演进机制中,智能体先自主定义概念维度,然后通过PSEL循环(提议具体提示、用不同工具求解、定量评估图像、分析结果更新知识)迭代探索维度组合,并由探索空间剪枝策略引导搜索方向,最终构建分层工具知识库。

Key Results:

  • 在GenEval基准上达到0.96的得分,与最先进方法持平。
  • 相比Flow-GRPO基线,推理速度提升90.3%,能效提升56.6%。
  • 自演进机制无需人工标注,即可自主构建高质量工具知识库。
  • 多轮动态路由策略优于静态单路径方法,在复杂提示上表现更优。

Tech Stack:

  • 大语言模型(LLM)作为路由控制器
  • 多个预训练T2I工具库(如SD-Turbo、Flow-GRPO、Stable Diffusion 3等)
  • 自演进机制中的“提议-求解-评估-学习”(PSEL)循环
  • 探索空间剪枝策略(Exploration Space Pruning)
  • 定量评估函数(qeval)用于图像质量评分
  • GenEval基准测试

Strengths:

  • 首次在智能T2I路由中联合优化生成质量与推理效率,实用性强。
  • 自演进机制完全摆脱对人工标注或先验知识的依赖,可扩展性好。
  • 多轮动态路由结合记忆模块,能根据实时反馈自适应调整,决策更智能。
  • 实验充分,在性能、速度、能效三个维度均取得显著优势。

Limitations:

  • 路由决策依赖LLM的推理能力,若LLM本身能力不足可能影响路由质量。
  • 当前工具库规模有限,未来需扩展至更多T2I模型以验证泛化性。
  • 自演进过程需要一定计算资源用于探索,初始阶段可能耗时。
  • 论文未详细讨论在极端复杂或模糊提示下的鲁棒性。

Relevance To Keywords:

  • 原生多模态大模型:OctoT2I利用LLM作为控制器,整合多个T2I模型,体现了多模态理解与生成的协同。
  • 多模态大模型的理解和生成一体化:路由系统需理解用户提示并选择合适生成工具,实现理解与生成的联合优化。
  • 表征学习:自演进机制中概念维度的定义和工具能力边界的探索可视为一种隐式表征学习。
  • 世界模型:通过多轮路由和评估反馈,系统逐步构建对工具行为的世界知识。
  • 强化学习:Flow-GRPO等工具本身采用强化学习后训练,而自演进机制中的PSEL循环也包含类似“试错-学习”的强化信号。
  • 后训练:论文强调自演进无需人工标注,可视为一种后训练阶段的知识自主获取方法。
Score: 39.0 / 27.8
Authors: Kui Wu, Beiyu Guo, Hao Chen, ShuHang Xu, Yuling Li, Yongdan Zeng, Zhoujun Li, Yizhou Wang, Fangwei Zhong
Published: 2026-06-01
TL;DR: RescueBench establishes a photo-realistic benchmark for embodied search-and-rescue agents, revealing that current methods fail at complex tasks primarily due to exploration and spatial memory limitations.
摘要翻译

搜索与救援 (SAR) 要求具身智能体在多模态不确定性下探索未知环境,执行多阶段交互,并在长时程内检索空间记忆。现有基准通常孤立地评估这些能力,使得当它们必须在真实工作流中组合时,失败如何累积尚不明确。我们引入 RescueBench,一个照片级真实感的诊断基准,它将 SAR 实例化为一个四阶段流程:多模态探索、目标救援、记忆引导返回和最终交接。通过结合顺序任务组合与阶段级评估,RescueBench 使得能够分析探索失败和记忆失败如何在具身救援工作流中传播。它包含五个渐进难度级别,在环境复杂度、线索模糊性和空间层级上有所不同,并配备一个自动的 episode (场景) 生成与标注流程,用于可扩展的评估和训练。我们评估了七种基线方法、一个 oracle (神谕) 参考和人类玩家,结果显示在最困难级别下,没有基线方法能完成全部任务。阶段级诊断将自主探索识别为主导失败模式,将空间记忆识别为第二个独立的瓶颈,表明这些限制无法通过当前的拓扑视觉语言导航或基于地图的方法解决。代码可在 https://github.com/wukui-muc/RescueBench 获取。

Abstract

Search-and-rescue (SAR) requires embodied agents to explore unfamiliar environments under multimodal uncertainty, perform multi-stage interactions, and retrieve spatial memory over long horizons. Existing benchmarks typically evaluate these capabilities in isolation, leaving unclear how failures compound when they must be composed in realistic workflows. We introduce RescueBench, a photo-realistic diagnostic benchmark that instantiates SAR as a four-stage pipeline: multimodal exploration, target rescue, memory-guided return, and final handoff. By combining sequential task composition with stage-level evaluation, RescueBench enables analysis of how exploration and memory failures propagate through embodied rescue workflows. It contains five progressive difficulty levels that vary in environmental complexity, clue ambiguity, and spatial hierarchy, along with an automatic episode generation and annotation pipeline for scalable evaluation and training. We evaluate seven baselines, an oracle reference, and human players, showing that no baselines complete the full task at the greatest difficulty. Stage-level diagnosis identifies autonomous exploration as the dominant failure mode and spatial memory as a second, independent bottleneck, suggesting that these limitations are not resolved by current topological visual-language navigation or map-based methods. Code is available in https://github.com/wukui-muc/RescueBench

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 4.0/10 6.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 3.0/10 4.5
World Models 1.5 4.0/10 6.0
MLLM 1.5 4.0/10 6.0
MultiModal 1.5 8.0/10 12.0
model-based RL 1.5 3.0/10 4.5

评分理由: 论文核心为搜救任务的基准测试(Benchmark),而非模型架构创新。与 MultiModal 高度相关(明确提及多模态不确定性);与 World Models 和 Unify Models 中度相关(涉及空间记忆与任务流程统一);与 MLLM 中度相关(基线涉及视觉语言模型);与 Tokenizer 完全无关;与 Visual Encoder 和 model-based RL 关联较弱(未深入具体编码器或模型算法)。未发现指定专家作者。加权总分 39.0,高于动态及格分 27.8。

关键词

Search-and-Rescue, Embodied Agents, Multimodal Uncertainty, Spatial Memory, Photo-realistic Benchmark, Visual-Language Navigation, Long-Horizon Tasks

深度分析

Chinese Title: RescueBench:具身智能体能否在野外拯救生命?

Summary: 本文提出了RescueBench,一个面向搜索与救援(SAR)任务的具身智能基准测试。现有基准通常孤立评估探索、导航或操作能力,忽略了真实SAR场景中这些能力必须顺序组合时产生的级联失败。RescueBench将SAR任务分解为四个顺序阶段:多模态探索(S1)、定位并救援目标(S2)、记忆引导返回(S3)、定位并交接(S4),并设计了五个渐进难度等级(L1-L5),通过环境复杂度、线索模糊度和空间层次三个维度控制难度。基于Unreal Engine 5构建了照片级真实环境,并开发了自动化的场景生成与标注管线。评估了七个基线方法、一个Oracle参考和人类玩家,结果显示:在最高难度下没有任何基线能完成完整任务;阶段级诊断表明自主探索是主要失败模式,空间记忆是第二个独立瓶颈。这些发现揭示了当前拓扑视觉语言导航和地图方法在开放世界探索和长时空间记忆方面的根本局限。

Innovations:

  • 首次将搜索救援任务建模为四阶段顺序流水线(探索-救援-返回-交接),支持阶段级诊断以分析级联失败模式。
  • 设计了五个渐进难度等级,独立控制环境复杂度、线索模糊度和空间层次,实现可控的能力瓶颈分析。
  • 提出了自动化的场景生成与标注管线,基于UE5导航网格和VLM生成多模态线索,支持大规模可扩展评估。
  • 通过环境辅助交互触发机制解耦操作能力,使评估聚焦于导航、探索和记忆瓶颈,拓宽了适用架构范围。
  • 揭示了自主探索和空间记忆是当前具身智能体的两个独立且持久的瓶颈,即使成功定位目标后仍无法可靠返回。

Methodology: 论文采用基准测试构建与实证评估的方法。首先,基于Unreal Engine 5和UnrealZoo平台构建照片级真实环境,设计四阶段顺序任务定义。其次,开发自动化数据收集管线:从导航网格提取候选生成区域,采样救护车、目标、智能体位置,通过导航网格驱动的智能体执行任务并记录轨迹,使用VLM(Qwen-3.5 plus)从智能体第一人称视角生成多模态线索。然后,定义五个难度等级(L1-L5),通过距离范围、遮挡条件、跨区域/跨楼层等规则控制难度。最后,评估七个基线方法(包括拓扑VLN、地图方法、端到端方法等)、一个Oracle参考和人类玩家,计算各阶段成功率及完整任务成功率,进行阶段级诊断。

Key Results:

  • 在最高难度L5(跨区域室内外+多楼层)下,没有任何基线方法能完成完整四阶段任务。
  • 自主探索(S1)是主导失败模式:所有基线在L3及以上难度中S1成功率显著下降,L2到L3的过渡(从近距离可见目标到中距离遮挡目标)产生了最陡峭的能力悬崖。
  • 空间记忆(S3)是第二个独立瓶颈:即使智能体成功完成S2(定位并救援),返回救护车的成功率仍远低于S2,表明空间编码质量不足。
  • 人类玩家在L5下完整任务成功率为60%,而最佳基线(Oracle)仅为20%,表明当前方法存在巨大差距。
  • 阶段级诊断揭示了拓扑VLN和地图方法在开放世界探索和长时记忆方面的根本局限,并非简单参数调整可解决。

Tech Stack:

  • Unreal Engine 5 (UE5) 和 UnrealZoo 平台
  • 导航网格 (NavMesh) 用于自动路径规划和场景生成
  • 视觉语言模型 (VLM): Qwen-3.5 plus 用于生成多模态线索
  • 七个基线方法:包括基于拓扑图的VLN方法、基于地图的方法、端到端强化学习方法等(具体名称见原文)
  • Oracle参考:使用全局地图和完美感知的智能体
  • 人类玩家评估:通过UE5前端进行实时控制
  • 评估指标:各阶段成功率、完整任务成功率、级联失败分析

Strengths:

  • 任务设计紧密贴合真实SAR场景,具有实际应用价值。
  • 阶段级诊断机制能够精确定位能力瓶颈,而非仅给出整体成功率。
  • 渐进难度框架允许系统性地分析环境因素对能力的影响。
  • 自动化数据管线支持大规模生成和可重复实验。
  • 解耦操作能力使评估聚焦于导航和记忆,适用于多种架构。
  • 提供了人类基线,便于衡量AI与人类差距。

Limitations:

  • 环境辅助交互触发机制虽然解耦了操作,但无法评估真实操作能力。
  • 当前仅评估单智能体设置,多智能体协作仅初步验证。
  • 环境多样性有限(基于UnrealZoo的特定场景),可能无法完全代表真实野外环境。
  • 基线方法的选择可能未涵盖最新的大模型驱动方法(如基于MLLM的端到端规划)。
  • 未深入分析失败的具体原因(如探索策略、记忆表示形式),仅停留在成功率层面。

Relevance To Keywords:

  • Unify Models / 原生多模态大模型:论文使用VLM生成多模态线索,但评估的基线方法并非统一模型,而是分离的导航/记忆模块。相关性中等。
  • World Models / 世界模型:任务要求智能体在探索中构建空间记忆,与世界模型中的环境建模相关,但论文未直接使用或评估世界模型方法。相关性中等。
  • Representation Learning / 表征学习:空间记忆的编码方式涉及表征学习,但论文未深入探讨表征质量。相关性较弱。
  • Model-Based RL / 强化学习:部分基线可能使用RL,但论文主要评估的是基于导航的方法,而非模型基RL。相关性较弱。
  • 多模态大模型的理解和生成一体化:论文使用VLM生成线索,但智能体本身未使用多模态大模型进行一体化理解与生成。相关性中等。
  • 后训练:论文未涉及后训练技术。相关性低。
Score: 39.0 / 27.8
Authors: Junlin Long, Zeyu Zhang, Xu Deng, Yiran Wang, Yue Yang, Luke Borgnolo, Maxwell Twelftree, Yang Zhao
Published: 2026-06-01
TL;DR: PlatonicNav 提出一种无训练框架,利用视觉编码器构建共享语义流形,实现了视觉与语言导航任务的统一及跨模态目标定位,无需成对视觉 - 语言数据。
摘要翻译

具身视觉导航(Embodied Visual Navigation)是指智能体感知复杂环境并根据原始感官输入采取行动以达成目标,它是家庭服务机器人、辅助机器人以及大规模自主探索等众多应用的基础。然而,近期统一视觉 - 语言导航(VLN)和目标导航(ObjNav)的尝试仍局限于架构融合、混合任务训练及大规模视觉 - 语言预训练层面,尚未探究独立训练的视觉编码器与语言编码器是否已共享共同的语义结构。此外,即使是以对象为中心的拓扑地图仍通过显式跨模态监督(如 CLIP 或大型视觉 - 语言模型)来锚定语言目标,这留下了一个问题:仅从纯视觉构建的地图中实现这种锚定是否可能。为应对这些挑战,我们将柏拉图表示假设(Platonic Representation Hypothesis)扩展至具身导航,并将仅视觉 ObjNav、跨模态 ObjNav 以及 VLN 重新表述为通往同一以对象为中心的语义流形的三种不同接口。此外,我们提出了 PlatonicNav,这是一个无需训练的框架,其柏拉图拓扑地图(Platonic Topological Map)融合了来自自监督视觉编码器的几何与语义节点距离,并通过盲匹配锚定语言目标,而无需任何配对的视觉 - 语言数据。在包括 HM3D-IIN、OVON 以及 MP3D 上的 R2R-CE 等仿真基准上的广泛实验,以及在 Unitree Go2 机器人上的部署,均表明 PlatonicNav 能够在无需显式跨模态训练的情况下,跨任务、模态和具身形式实现泛化。代码:https://github.com/AIGeeksGroup/PlatonicNav。网站:https://aigeeksgroup.github.io/PlatonicNav。

Abstract

Embodied visual navigation, where an agent perceives a complex environment and acts to reach a goal from raw sensory input, underpins a wide range of applications such as household service robotics, assistive robotics, and large-scale autonomous exploration. However, recent attempts to unify vision-and-language navigation (VLN) and object goal navigation (ObjNav) remain at the level of architectural fusion, mixed-task training, and large vision-language pretraining, without examining whether independently trained vision and language encoders may already share a common semantic structure. Moreover, even object-centric topological maps still ground language goals through explicit cross-modal supervision such as CLIP or large vision-language models, leaving open whether such grounding is possible from a purely vision-built map. To address these challenges, we extend the Platonic Representation Hypothesis to embodied navigation and recast vision-only ObjNav, cross-modal ObjNav, and VLN as three different interfaces to the same object-centric semantic manifold. We further introduce PlatonicNav, a training-free framework whose Platonic Topological Map fuses geometric and semantic node distances from a self-supervised visual encoder, and grounds language goals via blind matching without any paired vision-language data. Extensive experiments on simulation benchmarks including HM3D-IIN, OVON, and R2R-CE on MP3D, together with deployment on Unitree Go2, demonstrate that PlatonicNav generalizes across tasks, modalities, and embodiments without explicit cross-modal training. Code: https://github.com/AIGeeksGroup/PlatonicNav. Website: https://aigeeksgroup.github.io/PlatonicNav.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 6.0/10 9.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 9.0/10 13.5
World Models 1.5 1.0/10 1.5
MLLM 1.5 1.0/10 1.5
MultiModal 1.5 7.0/10 10.5
model-based RL 1.5 2.0/10 3.0

评分理由: 论文核心在于利用视觉编码器构建共享语义流形以统一 VLN 和 ObjNav 任务。Visual Encoder 是核心组件,评分最高;MultiModal 涉及视觉与语言交互,评分较高;Unify Models 体现在任务统一而非模型架构融合,评分中等;Tokenizer、World Models、MLLM、model-based RL 与论文提出的无训练、视觉主导、匹配机制关联度低,评分较低或为零。未发现指定专家作者。

关键词

Embodied Visual Navigation, Platonic Topological Maps, Semantic Correspondence, Visual Encoder, Blind Matching, Vision-and-Language Navigation, Object Goal Navigation, Training-free Framework

深度分析

Chinese Title: PlatonicNav:基于柏拉图拓扑地图的导航中语义对应关系揭示

Summary: 本文提出PlatonicNav,一个无需训练的统一导航框架,基于柏拉图表征假说(Platonic Representation Hypothesis),将视觉目标导航(ObjNav)和视觉语言导航(VLN)视为同一物体中心语义流形的不同接口。该方法构建柏拉图拓扑地图(Platonic Topological Map),融合自监督视觉编码器(DINOv3)提取的几何与语义节点距离,并通过盲匹配(blind matching)将语言目标直接映射到地图节点,无需任何配对视觉-语言数据。实验在HM3D-IIN、OVON、R2R-CE等仿真基准以及Unitree Go2真实机器人上验证,表明PlatonicNav能够跨任务、跨模态、跨实体泛化,证明独立训练的视觉和语言编码器已隐含共享的语义结构,从而减少对显式跨模态监督的依赖。

Innovations:

  • 将柏拉图表征假说扩展到具身导航,提出可证伪的思想实验,将视觉ObjNav、跨模态ObjNav和VLN统一为同一物体中心语义流形的不同接口。
  • 提出PlatonicNav,一个无需训练的框架,其柏拉图拓扑地图融合几何距离和自监督视觉编码器产生的语义距离,无需任何配对视觉-语言数据。
  • 通过盲匹配技术,利用独立训练的视觉编码器(DINOv3)和语言编码器(GTR-T5)之间的关系结构,实现语言目标到视觉地图节点的零样本对齐。
  • 在仿真和真实机器人上验证了跨任务、跨模态、跨实体的泛化能力,表明显式跨模态训练在很大程度上是冗余的。

Methodology: 首先,基于柏拉图表征假说,将视觉和语言编码器视为学习同一现实统计模型的不同模态接口,通过双中心化(double-centering)对齐其关系距离矩阵。其次,构建柏拉图拓扑地图:以自监督视觉编码器(DINOv3)提取的图像片段作为节点,边权重融合几何距离和视觉语义距离。然后,通过盲匹配:计算语言目标类别嵌入与视觉片段聚类嵌入之间的成对关系距离,利用匹配求解器(如匈牙利算法)找到对应节点。最后,基于地图规划路径,生成柏拉图物体代价地图(PlatonicObject Costmap)用于控制预测。整个流程无需训练,仅使用预训练编码器。

Key Results:

  • PlatonicNav在HM3D-IIN、OVON和R2R-CE基准上均达到或超越现有需要跨模态训练的方法。
  • 在Unitree Go2真实机器人上成功部署,验证了从仿真到现实的零样本迁移能力。
  • 盲匹配成功将语言目标映射到视觉节点,无需任何配对数据,证明视觉和语言编码器已隐含共享语义结构。
  • 同一柏拉图拓扑地图可同时支持视觉ObjNav、跨模态ObjNav和VLN三种任务,输出轨迹高度一致。

Tech Stack:

  • 自监督视觉编码器:DINOv3
  • 语言编码器:GTR-T5(基于T5的密集文本检索器)
  • 盲匹配:基于关系距离矩阵的双中心化与匹配求解器(如匈牙利算法)
  • 拓扑图构建:图像分割(如SAM)、几何距离计算、语义距离计算
  • 仿真平台:Habitat Simulator(HM3D-IIN, OVON, R2R-CE on MP3D)
  • 真实机器人:Unitree Go2 Air四足机器人

Strengths:

  • 提出新颖的理论视角,将柏拉图表征假说引入导航领域,为统一不同导航任务提供理论基础。
  • 完全无需训练,仅依赖预训练编码器,降低了计算和数据需求。
  • 无需任何配对视觉-语言数据,通过盲匹配实现跨模态对齐,具有强泛化性。
  • 在多个仿真基准和真实机器人上验证,展示了跨任务、跨模态、跨实体的能力。
  • 代码和网站开源,便于复现和扩展。

Limitations:

  • 依赖自监督视觉编码器(DINOv3)的语义质量,在复杂或罕见物体上可能表现不佳。
  • 盲匹配假设视觉和语言编码器已充分对齐,但实际中可能存在残余错位,影响匹配精度。
  • 当前框架主要针对静态场景,动态环境下的实时更新和重规划未深入探讨。
  • 实验仅在有限仿真环境和单一真实机器人上验证,大规模真实场景泛化性有待进一步测试。
  • 未与端到端强化学习方法进行直接比较,性能上限可能受限于拓扑图的离散化精度。

Relevance To Keywords:

  • Unify Models: 论文将视觉ObjNav、跨模态ObjNav和VLN统一到同一语义流形,实现了任务层面的统一。
  • World Models: 柏拉图拓扑地图可视为环境的世界模型,融合几何与语义信息,支持规划。
  • Representation Learning: 核心基于柏拉图表征假说,利用自监督视觉和语言编码器的共享表征,无需跨模态训练。
  • Model-Based RL: 框架使用拓扑地图进行路径规划,属于基于模型的方法,但未涉及强化学习训练。
  • 原生多模态大模型/多模态大模型的理解和生成一体化: 论文使用独立预训练编码器(非原生多模态大模型),但盲匹配思想与多模态对齐相关。
  • 后训练: 论文完全无需训练,属于零样本方法,与后训练方向不同。
Score: 37.5 / 27.8
Authors: Junqi Liu, Salena Song, Yuhan Wang, Jiawei Mao, Hardy Chen, Xiaoke Huang, Tianhao Qi, Pengfei Guo, Yucheng Tang, Yufan He, Can Zhao, Andriy Myronenko, Dong Yang, Daguang Xu, Yuyin Zhou
Published: 2026-06-01
TL;DR: AutoMedBench introduces a workflow-aware benchmark for medical agentic AI research, revealing that the validation stage is the weakest and verification/submission errors dominate agent failures.
摘要翻译

自主智能体日益被期望支持端到端的医疗 AI 研究工作流程,超越孤立的预测任务或简短的临床问答。然而,现有的医疗智能体基准主要评估最终输出,对研究过程中智能体行为的可见性有限。为填补这一空白,我们提出了 AutoMedBench,这是一个面向自主医疗 AI 研究的工作流感知基准,涵盖多样的医学影像和多模态推理任务,将智能体执行组织为统一五阶段工作流(S1-S5):计划(Plan)、设置(Setup)、验证(Validate)、推理(Inference)和提交(Submit)。该基准包含长周期任务,每次运行平均涉及 33 次智能体轮次,涵盖五个研究赛道:分割(segmentation)、图像增强(image enhancement)、视觉问答(VQA)、报告生成(report generation)和病灶检测(lesion detection)。每个任务均在两个难度层级(Lite 和 Standard)下进行评估,两者使用相同的数据和指标,但在任务简报脚手架的量上有所不同;每次运行均基于最终任务表现和 S1-S5 阶段得分进行评分,从而支持从初始任务简报到最终提交工件的阶段级分析。在数千次记录运行的基础上,阶段级评分显示,验证(Validate)平均而言是最弱的工作流阶段,而设置(Setup)是最强的,这表明当前智能体在使管道可执行方面优于验证其可靠性。运行后的错误分析进一步表明,验证(verification)和提交(submission)失败主导了标记错误,分别占触发错误代码的 37.7% 和 38.1%,而任务理解错误(task-understanding errors)极为罕见,仅占 0.9%;此外,平均而言,出现一次触发错误代码的运行其整体得分比无错误代码的运行低 48%。

Abstract

Autonomous agents are increasingly expected to support end-to-end medical-AI research workflows, moving beyond isolated prediction tasks or short-form clinical question answering. However, existing medical agent benchmarks primarily evaluate final outputs, providing limited visibility into agent behavior within the research process. To address this gap, we present AutoMedBench, a workflow-aware benchmark for autonomous medical-AI research across diverse medical imaging and multimodal inference tasks, organizing agent execution into a unified five-stage workflow (S1-S5): Plan, Setup, Validate, Inference, and Submit. It comprises long-horizon tasks with each run averaging 33 agent turns, spanning five research tracks: segmentation, image enhancement, visual question answering (VQA), report generation, and lesion detection. Each task is evaluated under two difficulty tiers, Lite and Standard, which use the same data and metrics but differ in the amount of task-brief scaffolding, and each run is scored using both final task performance and S1-S5 stage scores, enabling stage-level analysis from the initial task brief to the final submitted artifact. Across thousands of recorded runs, stage-level scoring reveals that Validate is the weakest workflow stage on average, whereas Setup is the strongest, suggesting that current agents are better at making pipelines executable than at verifying their reliability. Post-run error analysis further shows that verification and submission failures dominate tagged errors, accounting for 37.7% and 38.1% of fired codes respectively, whereas task-understanding errors are rare at 0.9%, and runs with one fired error code have a 48% lower overall score than runs with no error code on average.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 3.0/10 4.5
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 3.0/10 4.5
World Models 1.5 1.0/10 1.5
MLLM 1.5 7.0/10 10.5
MultiModal 1.5 8.0/10 12.0
model-based RL 1.5 2.0/10 3.0

评分理由: The paper presents AutoMedBench, a benchmark for medical agentic workflows, rather than a novel model architecture. Thus, keywords related to specific model components (Tokenizer, World Models) have low relevance. 'MultiModal' and 'MLLM' are highly relevant as the tasks involve multimodal inference and likely utilize MLLMs. 'Unify Models' is moderately relevant due to the 'unified five-stage workflow' mention, though it refers to process unification. 'model-based RL' is low relevance as RL is not explicitly detailed. No expert authors from the provided list are found in the author list.

关键词

AutoMedBench, Medical Agentic AI, Workflow Benchmark, Multimodal Inference, Agent Evaluation, Medical Imaging, Stage-level Analysis

深度分析

Chinese Title: AutoMedBench:迈向基于智能体AI模型的医学自动研究

Summary: 本文提出AutoMedBench,一个面向自主医学AI研究工作流评估的基准。现有医学智能体基准主要评估最终输出,缺乏对研究过程中智能体行为的可见性。AutoMedBench将智能体执行过程组织为统一的五阶段工作流(S1-S5:规划、设置、验证、推理、提交),涵盖24个任务,涉及分割、图像增强、视觉问答、报告生成和病灶检测五个研究轨道,每个任务设有Lite和Standard两个难度等级。评估采用过程级评分(智能体分数)和最终任务分数相结合的方式,并记录交互轨迹和错误代码。实验发现,验证阶段是平均最弱的环节,而设置阶段最强;错误分析显示验证和提交错误占主导(37.7%和38.1%),任务理解错误仅占0.9%。触发一个错误代码的运行平均总分比无错误代码的运行低48%。这些结果表明,当前医学自动研究智能体的主要瓶颈不仅在于领域知识,更在于鲁棒的工程执行、中间验证和错误恢复能力。

Innovations:

  • 提出工作流感知的评估协议,将医学AI研究过程分解为五个可检查的阶段(S1-S5),实现过程级评分。
  • 构建包含24个任务、覆盖五种医学研究轨道(分割、增强、VQA、报告生成、检测)的基准,任务来源为公开挑战赛和数据集。
  • 引入双难度等级(Lite和Standard),在保持数据和指标不变的情况下改变任务说明中的脚手架信息,便于分析智能体自主性。
  • 设计后运行错误诊断机制,基于交互轨迹自动标注五类错误代码(E1-E5),揭示隐藏的流程故障。
  • 开源完整的执行框架和沙箱环境,包括容器化智能体和隔离评估环境,支持可复现的基准测试。

Methodology: 论文采用基准构建与实验评估相结合的方法。首先,从20多个公开医学挑战赛(如KiTS19)中选取任务,确保输入公开、参考隐藏、工作流可表达。每个任务遵循统一五阶段工作流(S1-S5),智能体在隔离容器中与代码执行环境交互。评估时,每个运行获得智能体分数(基于S1-S5完成度)和任务分数(基于私有参考的确定性评估)。后运行阶段,利用完整交互记录(conversation.json)自动识别五类错误代码(E1:任务理解、E2:数据/模型设置、E3:验证与恢复、E4:实现与执行、E5:交付提交)。实验对6种前沿基础模型进行基准测试,记录数千次运行,进行阶段级和错误级分析。

Key Results:

  • 验证阶段(S3)是平均最弱的工作流阶段,设置阶段(S2)最强,表明智能体更擅长使管道可运行而非验证其可靠性。
  • 错误代码分析显示,验证错误(E3)占37.7%,提交错误(E5)占38.1%,任务理解错误(E1)仅占0.9%。
  • 触发一个错误代码的运行平均总分比无错误代码的运行低48%,表明错误代码与性能下降强相关。
  • 隐藏的故障包括模型加载失败、形状错误、跳过验证、空输出和格式错误的提交,这些常被最终输出指标掩盖。
  • 在整体排行榜上,不同智能体在智能体分数和任务分数上表现出显著差异,但所有智能体在验证阶段均表现不佳。

Tech Stack:

  • 大型语言模型(LLM)作为智能体基础模型
  • 代码执行环境(容器化沙箱)
  • 五阶段工作流评分规则(S1-S5)
  • 确定性评估(私有参考)
  • 错误代码自动标注(基于交互轨迹)
  • Docker容器隔离
  • GitHub仓库和在线排行榜

Strengths:

  • 首次提出面向医学AI研究全流程的智能体基准,填补了过程级评估的空白。
  • 工作流感知设计使得不同任务类型(分割、VQA等)可在统一协议下比较。
  • 双难度等级和错误诊断机制提供了细粒度的智能体能力分析。
  • 开源框架和容器化环境保证了实验的可复现性和公平性。
  • 实验规模大(数千次运行),结论具有统计意义。

Limitations:

  • 基准任务仅包含推理时工作流,不包括训练阶段或需要长期临床对话的任务。
  • 错误代码标注依赖交互轨迹,可能遗漏某些隐式错误或误判。
  • 当前仅评估了6种基础模型,未涵盖所有主流智能体框架。
  • 任务选择偏向公开挑战赛,可能不完全代表真实临床研究场景。
  • 过程级评分与最终任务分数之间的关系未进行深入因果分析。

Relevance To Keywords:

  • 原生多模态大模型:论文评估的智能体通常基于多模态大模型(如LLM),但基准本身不限定模型架构,与多模态理解与生成一体化间接相关。
  • 世界模型:论文未直接涉及世界模型,但智能体在验证阶段需要模拟和检查中间结果,可视为世界模型的一种简单应用。
  • 表征学习:基准任务涉及医学图像表征,但论文重点在智能体工作流而非表征学习方法。
  • 模型基强化学习:论文未使用强化学习,但智能体在错误恢复和验证中可能隐含类似强化学习的试错机制。
  • 后训练:论文未涉及后训练,但智能体在设置阶段需加载预训练模型,与后训练概念有间接联系。
Score: 36.0 / 27.8
Authors: Shuo Lu, Yinuo Xu, Kecheng Yu, Siru Jiang, Yongcan Yu, Yubin Wang, Haitao Yang, Yuxiang Zhang, Bin Wang, Ran He, Jian Liang
Published: 2026-06-01
TL;DR: WorldCoder-Bench introduces a benchmark for evaluating LLMs' ability to generate physically grounded 3D worlds, revealing significant challenges in state consistency and interaction chains across frontier models.
摘要翻译

大型语言模型 (LLMs) 不仅被要求编写静态界面,还被要求能够从自然语言构建可执行的交互式世界。浏览器原生 3D(通常使用 Three.js 构建)是自然的下一个前沿:生成的程序必须整合资产,遵守空间和物理约束,并确保面向用户的控件与隐藏的运行时状态保持同步。然而,现有的网页生成基准和评估器主要仅观察像素或 DOM 节点,而 Three.js 世界的机制却在不透明的 <canvas> 内部展开。我们提出 WorldCoder-Bench,这是一个用于自主且基于物理的 3D 世界合成的基准。WorldCoder-Bench 包含 2,026 个专家精心策划的任务,涵盖模拟、渲染和应用场景,并提供可选的 .glb 资产及隐藏的行为契约。我们进一步提出 StateProbe,这是一种基于执行的协议,它在沙箱浏览器中探测生成的程序,并验证隐藏且经过变异加固的契约,以检查运行时状态和转换。除验证覆盖率外,我们还报告自动化回报与时间效率乘数(Return on Automation and Time Efficiency Multiplier),以衡量经正确性调整后的成本和时间节省。在九个前沿模型中,最佳系统在 WorldCoder-Core 上仅达到 27.8% 的验证覆盖率,在 WorldCoder-Robust 上为 19.9%,其失败主要由状态模式漂移和断裂的交互链主导,而非场景元素缺失。效用指标进一步表明,低成本或快速模型在较简单的领域仍能提供显著价值。WorldCoder-Bench 可在 https://anonymous.4open.science/r/WorldCoder-Bench/ 获取。

Abstract

Large language models (LLMs) are increasingly asked not only to write static interfaces, but to construct executable interactive worlds from natural language. Browser-native 3D, commonly built with Three.js, is a natural next frontier: generated programs must integrate assets, obey spatial and physical constraints, and keep user-facing controls synchronized with hidden runtime state. Existing web-generation benchmarks and evaluators, however, largely observe only pixels or DOM nodes, while the mechanics of a Three.js world unfold inside an opaque <canvas>. We introduce WorldCoder-Bench, a benchmark for autonomous, physically grounded 3D world synthesis. WorldCoder-Bench contains 2,026 expert-curated tasks across Simulation, Rendering, and Application scenarios, with optional .glb assets and hidden behavioral contracts. We further propose StateProbe, an execution-based protocol that probes generated programs in a sandboxed browser and verifies hidden, mutation-hardened contracts over runtime states and transitions. Beyond verification coverage, we report Return on Automation and Time Efficiency Multiplier to measure correctness-adjusted cost and time savings. Across nine frontier models, the best system reaches only 27.8% verification coverage on WorldCoder-Core and 19.9% on WorldCoder-Robust, with failures dominated by state-schema drift and broken interaction chains rather than missing scene elements. Utility metrics further show that cheap or fast models can still provide substantial value on easier domains. WorldCoder-Bench is available at https://anonymous.4open.science/r/WorldCoder-Bench/.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 2.0/10 3.0
World Models 1.5 6.0/10 9.0
MLLM 1.5 5.0/10 7.5
MultiModal 1.5 5.0/10 7.5
model-based RL 1.5 3.0/10 4.5

评分理由: The paper focuses on 3D world synthesis from text, making 'World Models' moderately relevant due to the title and concept. 'MLLM' and 'MultiModal' are relevant for text-to-3D tasks. 'Unify Models', 'Tokenizer', 'Visual Encoder', and 'model-based RL' are less relevant as the paper is a benchmark for generation, not architecture or RL. No expert authors from the specified list were found. Weighted sum is 36.0, exceeding the dynamic pass score of 27.8.

关键词

WorldCoder-Bench, 3D World Synthesis, StateProbe, Physical Grounding, LLM Benchmark, Three.js, Runtime Verification

深度分析

Chinese Title: WorldCoder-Bench:物理基础3D世界合成的基准测试

Summary: 本文提出了WorldCoder-Bench,首个针对可执行Three.js 3D世界的行为正确性基准测试,包含2026个专家策划的任务,涵盖模拟、渲染和应用三大类别。现有基准仅通过像素或DOM节点评估,无法检测隐藏的物理、状态同步等行为正确性。为此,作者提出了STATEPROBE协议,通过沙盒浏览器执行程序并验证隐藏的行为合约,合约经过突变硬化以确保严格性。此外,引入了“自动化回报率”和“时间效率乘数”两个实用指标,结合推理成本和人工开发成本评估模型价值。对9个前沿模型的测试显示,最佳系统在核心集上仅达到27.8%的验证覆盖率,在鲁棒集上为19.9%,失败主要由状态模式漂移和交互链断裂导致。外部评估方法误分类了45.6%的严重缺陷输出,凸显了本方法的必要性。

Innovations:

  • 首个评估生成3D世界行为正确性而非视觉合理性的基准,覆盖物理、资产集成和状态同步三个被忽视的方面。
  • 提出STATEPROBE执行式状态验证协议,使用隐藏的突变硬化行为合约,确保通过测试反映真实正确性。
  • 引入自动化回报率(Return on Automation)和时间效率乘数(Time Efficiency Multiplier)两个实用指标,结合成本与延迟衡量模型的实际价值。
  • 构建了鲁棒性变体集(WORLDCODER-ROBUST),通过扰动物理常数、对象数量等防止记忆化,测试机制级泛化能力。
  • 系统评估了9个前沿模型,揭示了现有模型在物理正确性和状态同步方面的严重不足。

Methodology: 采用四阶段数据构建流程:专家种子创作、LLM辅助扩展与过滤、运行时验证与合约编写、反污染处理。任务定义为从自然语言指令和可选.glb资产生成自包含HTML文件。评估使用STATEPROBE协议:在无头Chromium中执行程序,通过标准化接口暴露运行时变量,执行脚本化动作序列并快照状态,将状态变化与隐藏行为合约对比。合约经过突变测试(删除状态更新、缩放物理常数、交换事件目标等)确保严格性。指标包括验证覆盖率、自动化回报率(正确性调整后的成本节省)和时间效率乘数(正确性调整后的时间节省)。

Key Results:

  • 最佳模型(GPT-4o等)在WORLDCODER-CORE上仅达到27.8%验证覆盖率,在WORLDCODER-ROBUST上为19.9%。
  • DOM-based评分与隐藏状态正确性几乎不相关(Kendall τb = -0.02)。
  • 8轮代理探测(成本约400倍DOM)仍将45.6%的严重缺陷输出评为通过。
  • 便宜或快速的模型在简单领域仍能提供显著价值(高自动化回报率)。
  • 失败主要由状态模式漂移和交互链断裂导致,而非缺失场景元素。

Tech Stack:

  • Three.js(WebGL 3D渲染库)
  • WebGL(浏览器图形API)
  • Headless Chromium(无头浏览器)
  • JavaScript(编程语言)
  • GLB/glTF(3D模型格式)
  • LLM(大语言模型,如GPT-4、Claude等)
  • Kendall τb(相关性统计量)
  • 突变测试(合约硬化)

Strengths:

  • 填补了3D世界生成行为正确性评估的空白,解决了现有基准仅关注视觉外观的局限。
  • STATEPROBE协议通过隐藏合约和突变硬化确保了评估的严格性和可靠性。
  • 数据集规模大(2026任务)、覆盖广(3大类15子域),并包含鲁棒性变体。
  • 引入实用经济指标,使评估更贴近实际应用价值。
  • 对多个前沿模型进行了全面基准测试,揭示了重要缺陷。

Limitations:

  • 仅针对Three.js框架,可能不适用于其他3D引擎(如Unity、Babylon.js)。
  • 评估依赖于预定义的行为合约,可能无法覆盖所有可能的正确行为。
  • 任务主要面向浏览器原生3D,未涵盖更复杂的物理模拟或VR/AR场景。
  • 数据集构建依赖专家和LLM辅助,可能存在偏差或遗漏。
  • STATEPROBE需要沙盒执行环境,部署成本较高。

Relevance To Keywords:

  • 世界模型:论文评估的3D世界合成涉及物理模拟和状态同步,与世界模型中的环境建模和预测相关。
  • 表征学习:生成程序需学习如何表示3D场景、物理规律和交互逻辑,但论文更侧重代码生成而非表征学习本身。
  • 多模态大模型:任务输入为自然语言和可选3D资产,输出为可执行代码,属于多模态生成范畴。
  • 模型-Based RL:论文未直接涉及强化学习,但生成的3D世界可作为RL环境,评估其物理正确性对RL应用重要。
  • 后训练:论文评估的是预训练模型的零样本生成能力,未涉及后训练或微调。
Score: 34.5 / 27.8
Authors: Raghad Albusayes, Munirah Alyahya
Published: 2026-06-01
TL;DR: 针对大规模多视角长上下文视频流中的复杂视觉时空与语言问题,本文提出了一种基于分层知识图谱检索的代理框架,实现了高零-shot 推理准确率。
摘要翻译

本文介绍了我们在 CVPR 2026 EgoVis Workshop 举办的 CASTLE 2026 Challenge 中获胜的方法,我们的团队在全球范围内荣获第三名。该挑战要求参与者在海量多模态视频流中回答高度复杂的视觉、时空及语言问题,涵盖视觉计数、动作定位、多视角跟踪和说话人时序推理等任务。底层数据集由 15 个第一人称(ego)和第三人称(exo)相机源捕获的超过 600 小时同步录像组成。为应对该环境带来的极端规模与长上下文需求,我们提出了一种针对长视频理解优化的无训练智能体框架(training-free agentic framework)。我们的框架引入了两个核心架构组件:i) 视频知识图谱(Video Knowledge Graph),用于映射静态和动态实体、它们的时序关系及交叉事件,从而实现多跳关系推理;ii) 自适应智能体工作流(adaptive agentic workflow),通过层级检索与索引来解决复杂查询。实验结果表明,我们的框架在长上下文多视角流上实现了高零样本(zero-shot)推理准确率。我们的代码将在 https://github.com/RaghadKhaled/CASTLE-Challenge-Framework 上发布。

Abstract

This paper presents our winning methodology for the CASTLE 2026 Challenge at the CVPR 2026 EgoVis Workshop, where our team secured third place globally. The challenge tasks participants with answering highly complex visual, spatiotemporal, and verbal questions, including visual counting, action localization, multi-view tracking and speaker temporal reasoning, within massive, multimodal video streams. The underlying dataset consists of over 600 hours synchronized footage captured by 15 ego and exo camera sources. To tackle the extreme scale and long-context demands of this environment, we introduce a training-free agentic framework optimized for long-form video understanding. Our framework introduces two core architectural components: i) a Video Knowledge Graph that maps static and dynamic entities, their temporal relationships, and intersecting events to enable multi-hop relational reasoning, and ii) an adaptive agentic workflow that resolves complex queries through a hierarchical retrieval and indexing. Empirical results demonstrate that our framework achieves high zero-shot reasoning accuracy on long-context multi-view streams. Our code will be released at https://github.com/RaghadKhaled/CASTLE-Challenge-Framework.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 3.0/10 4.5
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 4.0/10 6.0
World Models 1.5 3.0/10 4.5
MLLM 1.5 5.0/10 7.5
MultiModal 1.5 8.0/10 12.0
model-based RL 1.5 0.0/10 0.0

评分理由: 论文核心在于构建一个基于知识图谱检索的代理框架以解决长上下文视频理解问题。MultiModal 高度相关(明确处理多模态视频流);MLLM 中度相关(涉及语言与视觉推理);Visual Encoder 中度相关(视频处理隐含视觉特征提取);Unify Models 和 World Models 低相关(侧重系统架构而非模型统一或生成式世界模型);Tokenizer 和 model-based RL 完全无关(未提及分词或强化学习机制)。作者列表中未包含指定的 Yang Shi 等专家。

关键词

Video Understanding, Multi-View, Long-Context, Agentic Framework, Knowledge Graph, Hierarchical Retrieval, Zero-shot Reasoning

深度分析

Chinese Title: CVPR 2026 CASTLE挑战赛第三名:基于层次化知识图谱检索的智能体多视角长上下文视频理解

Summary: 本文提出了一种无需训练、基于智能体的框架,用于解决CVPR 2026 CASTLE挑战赛中的长视频理解任务。该任务涉及超过600小时的多视角(10个第一人称和5个第三人称摄像头)同步视频,要求回答复杂的视觉、时空和语言问题。框架包含两个核心组件:1)视频知识图谱,将静态和动态实体及其时间关系、交叉事件建模为有向图,支持多跳关系推理;2)层次化智能体工作流,通过规划、图检索、片段检索和答案生成四个模块,逐步收集证据并生成答案。实验表明,该框架在零样本推理中取得了高准确率,最终获得全球第三名。代码已开源。

Innovations:

  • 提出训练无关的智能体框架,无需微调即可处理超长多视角视频问答。
  • 构建视频知识图谱,显式建模实体类型(人物、物体、地点、动作等)和关系类型(对话、空间、操作等),并附带时间戳,支持多跳推理。
  • 设计层次化检索粒度:通过SQLite数据库存储带时间、摄像头、日期的边信息,实现从全局到局部的高效检索。
  • 引入“严格到宽松”的SQL生成策略,确保图检索的鲁棒性。
  • 利用知识图谱作为时间索引指导视频片段检索,减少冗余计算。

Methodology: 方法分为记忆构建和智能体推理两个阶段。记忆构建:将每个视频均匀切分为30秒片段,使用统一音视频提取函数(Gemini 2.5 Pro)生成多模态描述;然后利用LangChain的LLMGraphTransformer从描述中提取实体和关系,构建有向知识图谱,每个边包含时间区间和摄像头元数据,存入SQLite3数据库。智能体推理:PlannerAgent将用户查询分解为子任务;GraphRetrievalAgent通过“严格到宽松”策略生成SQL查询,从知识图谱中检索相关信息;ReflectorAgent判断是否需要进一步检索视频片段;ClipRetrievalAgent根据时间字段定位并提取片段,ClipAnalyzeAgent分析片段提供视觉证据;最后VQAAgent综合所有证据生成答案。

Key Results:

  • 在CVPR 2026 CASTLE挑战赛中获得全球第三名。
  • 在185个长上下文问答样本上,零样本推理准确率显著高于基线。
  • 验证了知识图谱作为结构化时间索引的有效性,能够高效定位相关视频片段。
  • 层次化检索策略(先图后片段)减少了不必要的视频处理开销。

Tech Stack:

  • Gemini 2.5 Pro(多模态大模型,用于音视频描述、实体关系提取、推理)
  • LangGraph(智能体框架)
  • LLMGraphTransformer(知识图谱构建)
  • SQLite3(图存储与查询)
  • 严格到宽松的SQL生成策略
  • 层次化检索(图级→片段级)

Strengths:

  • 训练无关,无需微调即可适应新视频域,实用性强。
  • 知识图谱显式建模时间关系,支持复杂多跳推理。
  • 层次化检索机制高效,先利用结构化索引缩小范围,再精细分析。
  • 多视角处理能力:通过摄像头元数据支持跨摄像头事件关联。
  • 代码开源,可复现。

Limitations:

  • 依赖LLM提取实体和关系的质量,可能存在遗漏或错误。
  • 30秒片段划分可能丢失细粒度时间边界信息。
  • 仅使用第一人称视频,未充分利用第三人称视角信息。
  • 智能体流程中多次调用LLM,推理成本较高。
  • 在极端长视频(600小时)上,图构建和检索的扩展性有待验证。

Relevance To Keywords:

  • 原生多模态大模型:框架使用Gemini 2.5 Pro作为核心模型,直接处理视频和音频,属于多模态大模型应用。
  • 表征学习:知识图谱将视频内容表示为结构化实体和关系,是一种隐式的表征学习。
  • 世界模型:视频理解涉及对时空动态和因果关系的建模,知识图谱可视为一种简化的世界模型。
  • 后训练:框架无需训练,属于零样本/训练无关方法,与后训练(post-training)概念相关。
  • 强化学习:论文未直接使用强化学习,但智能体框架中的规划与反思可类比于强化学习中的策略搜索。
Score: 34.5 / 27.8
Authors: Zequn Xie, Xibei Jia, Sihang Cai, Shulei Wang, Tao Jin
Published: 2026-06-01
TL;DR: 本文提出 ROGLE 框架,通过自动化区域监督实现文本与图像的全局 - 局部对齐,显著提升了基于文本的人体搜索性能。
摘要翻译

基于文本的人体搜索(TBPS)旨在利用自然语言查询检索行人图像。然而,现有的 TBPS 模型,尤其是基于 CLIP 的模型,由于继承了短文本描述训练所带来的全局表示偏差和语义稀疏性,在细粒度理解方面面临挑战。这导致了细粒度对齐能力较弱,而区域级标注的稀缺性进一步加剧了这一困难。为此,我们提出 ROGLE(鲁棒全局 - 局部嵌入),这是一个统一框架,通过自动化的区域 - 句子匹配(RSM)策略克服了对昂贵人工标注的依赖。RSM 自动挖掘伪区域 - 句子对,以实现可扩展的细粒度监督。此外,ROGLE 采用一种多粒度学习策略,该策略融合了全局对比学习与区域级局部对齐。我们还引入了 P-VLG 基准,这是一个大型数据集,通过整理和丰富来自现有公共基准的图像构建而成。该数据集拥有超过 10 万个标注区域和丰富的长文本描述,使其成为首个同时支持全局和局部评估协议的 TBPS 基准。大量实验表明,ROGLE 显著优于现有方法,尤其在具有挑战性的长文本查询上表现更佳。代码及 P-VLG 基准将公开提供。

Abstract

Text-Based Person Search (TBPS) aims to retrieve pedestrian images using natural language queries. However, existing TBPS models, especially those based on CLIP, struggle with fine-grained understanding due to global representational bias and semantic sparsity inherited from training on short captions. This results in weak fine-grained alignment, exacerbated by the scarcity of region-level annotations. To address this, we propose ROGLE (Robust Global-Local Embedding), a unified framework that overcomes reliance on costly manual annotations through an automated Region-to-Sentence Matching (RSM) strategy. RSM automatically mines pseudo region-sentence pairs for scalable fine-grained supervision. Furthermore, ROGLE employs a multi-granular learning strategy that fuses global contrastive learning with region-level local alignment. We also introduce the P-VLG Benchmark, a large-scale dataset constructed by curating and enriching images from established public benchmarks. It features over 100,000 annotated regions and rich long-form captions, making it the first TBPS benchmark to support both global and local assessment protocols. Extensive experiments show that ROGLE significantly outperforms existing approaches, particularly on challenging long-form queries. Code and the P-VLG benchmark will be made publicly available.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 5.0/10 7.5
Tokenizer 1.5 2.0/10 3.0
Visual Encoder 1.5 5.0/10 7.5
World Models 1.5 0.0/10 0.0
MLLM 1.5 3.0/10 4.5
MultiModal 1.5 8.0/10 12.0
model-based RL 1.5 0.0/10 0.0

评分理由: 论文聚焦于基于文本的人体搜索,属于典型的多模态(MultiModal)检索任务,因此相关性最高。文中提出统一的全局 - 局部对齐框架(Unify Models),并隐含使用视觉编码器(Visual Encoder),相关性中等。Tokenizer 和 MLLM 并非本文核心贡献,相关性较低。World Models 和 model-based RL 与本文的检索任务及方法论完全无关,相关性为 0。作者列表中未包含指定的专家组成员。

关键词

Text-Based Person Search, Global-Local Alignment, Automated Region Supervision, Multi-granular Learning, Cross-modal Retrieval, Fine-grained Understanding, P-VLG Benchmark

深度分析

Chinese Title: ROGLE:基于自动区域监督的鲁棒全局-局部对齐用于文本行人搜索

Summary: 本文针对文本行人搜索(TBPS)任务中现有模型(尤其是基于CLIP的模型)因全局表示偏差和语义稀疏性导致的细粒度对齐不足问题,提出ROGLE统一框架。该框架通过自动区域-句子匹配(RSM)策略,利用SAM分割图像区域、CLIP计算跨模态相似度,自动挖掘伪区域-句子对,无需人工标注即可提供细粒度监督。同时,ROGLE采用多粒度学习策略,融合全局对比学习和区域级局部对齐,并引入可靠性感知对齐校准机制以增强对噪声对应关系的鲁棒性。此外,论文构建了P-VLG基准数据集,包含超过10万个标注区域和丰富长描述,支持全局和局部评估协议。实验表明,ROGLE在复杂长查询上显著优于现有方法。

Innovations:

  • 提出ROGLE统一框架,通过自动区域-句子匹配(RSM)模块从长描述中挖掘伪对齐,实现无需人工标注的细粒度监督。
  • 构建P-VLG基准数据集,首个支持多粒度TBPS评估的数据集,包含丰富描述和超过10万个区域级标注。
  • 设计多粒度学习策略,协同全局对比学习与区域级局部对齐,提升模型对细粒度细节的区分能力。
  • 引入可靠性感知对齐校准机制,增强模型对大规模数据中噪声对应关系的鲁棒性。

Methodology: 采用双编码器架构(基于预训练CLIP),通过SAM自动分割图像为语义区域,利用spaCy将长描述分解为原子句子;使用CLIP编码器提取区域和句子特征,计算余弦相似度并贪心分配生成伪区域-句子对。训练时,全局分支进行图像-文本全局对比学习,局部分支利用伪对齐进行多粒度局部对齐,并加入可靠性校准损失。P-VLG基准从四个公开数据集(CUHK-PEDES、ICFG-PEDES、RSTPReid、UfineBench)中整合图像并自动生成区域标注。

Key Results:

  • ROGLE在多个TBPS基准上显著优于现有方法,尤其在长查询和细粒度描述上表现突出。
  • P-VLG基准包含6,801个训练身份、48,485张图像、68,990条丰富描述,验证集提供超过10万个区域-描述对齐。
  • 自动RSM模块有效生成高质量伪区域-句子对,无需人工标注。
  • 多粒度学习策略和可靠性校准机制提升了模型对噪声的鲁棒性和细粒度对齐能力。

Tech Stack:

  • CLIP (ViT-B/16) 预训练模型
  • SAM (Segment Anything Model, ViT-H) 图像分割
  • spaCy 自然语言处理库
  • 余弦相似度计算
  • 贪心分配策略
  • 对比学习损失 (全局和局部)
  • 可靠性感知对齐校准机制

Strengths:

  • 提出自动化的区域-句子匹配方法,大幅降低对人工标注的依赖,具有可扩展性。
  • 构建了首个支持全局和局部评估的TBPS基准,推动细粒度检索研究。
  • 多粒度学习策略有效结合全局语义和局部细节,提升模型对复杂查询的理解能力。
  • 在多个公开数据集上取得领先性能,验证了方法的有效性。

Limitations:

  • 自动生成的伪区域-句子对可能存在噪声,尽管有可靠性校准,但无法完全消除错误对齐。
  • 依赖SAM和CLIP的预训练质量,在特定场景下分割或特征提取可能不准确。
  • P-VLG基准基于现有数据集整合,未引入全新真实场景图像,多样性可能受限。
  • 方法计算开销较大(SAM分割+CLIP编码),实时性可能受影响。

Relevance To Keywords:

  • Unify Models / 原生多模态大模型: 论文基于CLIP多模态模型,通过后训练实现全局-局部对齐,属于多模态大模型的理解与检索应用。
  • World Models / 表征学习: 通过对比学习和区域对齐学习多粒度表征,提升模型对视觉世界的细粒度理解。
  • Model-Based RL / 强化学习: 论文未直接涉及强化学习,但后训练过程可视为一种基于伪标签的优化,与模型微调相关。
  • 后训练: ROGLE在预训练CLIP基础上进行多粒度微调,属于后训练范式。
Score: 33.0 / 27.8
Authors: Yogesh Kumar Meena, Saurabh Agarwal, K. V. Arya
Published: 2026-06-01
TL;DR: 该论文提出了一种基于强化学习的放射学报告生成网络 RL-ACRGNet,通过结合 DenseNet 编码器和 LSTM 解码器,在标准数据集上实现了优于基线方法的报告生成质量。
摘要翻译

医学影像解读是现代临床诊断的基石,然而放射学报告的手动生成过程仍耗时且易出现解读不一致。在医学人工智能领域,通过深度学习自动化这些描述有望优化临床工作流程并标准化诊断输出。然而,由于在捕捉细粒度视觉特征及确保临床连贯性方面存在局限,准确的疾病检测与精确的报告生成仍面临重大挑战。为解决这些问题,我们提出了 RL-ACRGNet,这是一种改进的编码器 - 解码器模型,该模型在离策略强化学习框架内整合了预训练的 DenseNet 编码器和多层 LSTM 解码器。通过采用双网络方法,基于度量奖励机制优化视觉 - 语义嵌入,我们证明 RL-ACRGNet 在 IU-Xray 数据集上始终优于最先进基线,在 BLEU-4 (0.47%)、METEOR (0.17%) 和 ROUGE-L (0.518) 指标上实现了定量提升。此外,在大规模 MIMIC-CXR 数据集上的综合评估证实了该模型具有鲁棒的泛化能力,并能生成高质量且具有临床相关性的报告。

Abstract

Medical imaging interpretation is a foundational pillar of modern clinical diagnostics, yet the manual generation of radiology reports remains a time-consuming process prone to interpretation inconsistencies. Within the field of medical AI, automating these descriptions through deep learning promises to streamline clinical workflows and standardise diagnostic output. However, accurate disease detection and precise report generation remain significant challenges due to limitations in capturing fine-grained visual features and ensuring clinical coherence. To address these issues, we propose RL-ACRGNet, an improved encoder-decoder model that integrates a pre-trained DenseNet encoder with a multilevel LSTM decoder within an off-policy reinforcement learning framework. Using a dual-network approach to refine visual-semantic embeddings through a metric-based reward mechanism, we demonstrate that RL-ACRGNet consistently outperforms state-of-the-art baselines on the IU-Xray dataset, achieving quantitative improvements in BLEU-4 (0.47%), METEOR (0.17%) and ROUGE-L (0.518). Furthermore, comprehensive evaluations on the large-scale MIMIC-CXR data set confirm the robust generalisation of the model and its ability to generate high-quality, clinically relevant reports

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 1.0/10 1.5
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 8.0/10 12.0
World Models 1.5 1.0/10 1.5
MLLM 1.5 1.0/10 1.5
MultiModal 1.5 8.0/10 12.0
model-based RL 1.5 2.0/10 3.0

评分理由: 论文核心使用了 DenseNet 作为视觉编码器(Visual Encoder 高相关),并处理图像到文本的多模态任务(MultiModal 高相关)。虽然涉及强化学习(RL),但主要为基于奖励机制的策略优化,未明确体现环境模型学习,故 model-based RL 相关性较低。论文未涉及统一模型架构、特定分词器设计、世界模型构建或大语言模型(MLLM),相关度均较低。加权总分 33.0,高于动态及格分 27.8。

关键词

Chest Radiology, Report Generation, Reinforcement Learning, DenseNet, LSTM, Encoder-Decoder, Medical Imaging

深度分析

Chinese Title: RL-ACRGNet:基于强化学习的胸部放射学报告生成网络

Summary: 本文提出RL-ACRGNet,一种结合预训练DenseNet编码器与多级LSTM解码器的混合模型,并在离策略强化学习框架下进行训练。针对传统序列到序列模型在细粒度视觉特征捕获和临床连贯性方面的不足,该模型采用双网络(策略网络与价值网络)结构,通过基于指标的奖励机制优化视觉-语义嵌入。在IU-Xray数据集上,RL-ACRGNet在BLEU-4、METEOR和ROUGE-L指标上分别提升0.47%、0.17%和0.518%,超越现有最优方法。在MIMIC-CXR大规模数据集上的评估进一步验证了模型的泛化能力和生成高质量临床报告的能力。该工作从先前的肺部疾病诊断扩展至全尺度描述性报告生成,旨在提高放射科医生效率并加速诊断决策。

Innovations:

  • 提出RL-ACRGNet,一种基于强化学习的混合自动报告生成网络,专门用于胸部X光图像,整合了图像分析与临床报告生成。
  • 设计统一的DenseNet视觉编码器与多级LSTM解码器架构,有效捕获细粒度视觉模式与长期语言依赖。
  • 开发增强的离策略演员-评论家策略,利用双网络(策略网络与价值网络)实现单词级精炼与序列级连贯性优化。
  • 采用复合奖励函数,线性组合标准化NLP指标(如BLEU、METEOR、ROUGE-L),直接优化临床准确性与报告多样性。

Methodology: 论文采用编码器-解码器架构,编码器使用预训练的DenseNet提取胸部X光图像的深层视觉特征;解码器使用多级LSTM生成报告文本。在强化学习框架中,策略网络负责局部单词生成,价值网络评估全局连贯性,奖励网络对部分生成的报告进行评分。训练采用离策略演员-评论家方法,通过复合奖励函数(线性组合BLEU、METEOR、ROUGE-L)指导网络更新。推理时,模型根据策略网络输出逐步生成报告。

Key Results:

  • 在IU-Xray数据集上,RL-ACRGNet的BLEU-4达到0.47%,METEOR达到0.17%,ROUGE-L达到0.518%,均优于现有最优基线。
  • 在MIMIC-CXR大规模数据集上的评估证实模型具有良好的泛化能力,能够生成高质量、临床相关的报告。
  • 与先前工作(CNN-O-ELMNet、MultiFusionNet、CXR-Net)相比,从疾病诊断扩展至全尺度描述性报告生成,实现了功能升级。

Tech Stack:

  • DenseNet(预训练卷积神经网络编码器)
  • 多级LSTM(长短期记忆网络解码器)
  • 离策略强化学习(Off-policy Reinforcement Learning)
  • 演员-评论家方法(Actor-Critic)
  • 策略网络(Policy Network)
  • 价值网络(Value Network)
  • 奖励网络(Reward Network)
  • 复合奖励函数(线性组合BLEU、METEOR、ROUGE-L)
  • IU-Xray数据集
  • MIMIC-CXR数据集

Strengths:

  • 将强化学习与CNN-RNN框架结合,有效解决了传统序列模型在全局连贯性和临床一致性上的不足。
  • 采用离策略演员-评论家方法,提高了样本效率并稳定了训练过程。
  • 复合奖励函数直接优化多个NLP指标,使报告生成更贴近临床评估标准。
  • 在多个数据集上验证了模型的泛化能力,且从诊断任务扩展至报告生成任务,具有实际临床价值。

Limitations:

  • 论文未详细讨论模型在罕见疾病或异常发现上的生成能力,可能对长尾分布不敏感。
  • 强化学习训练可能对超参数敏感,且计算成本较高,文中未提供详细的训练时间或资源消耗分析。
  • 仅使用标准NLP指标(BLEU、METEOR、ROUGE-L)评估,缺乏临床专家评估或更专业的医学指标(如临床准确性、错误率等)。
  • 模型依赖预训练DenseNet,可能无法充分利用最新视觉架构(如Transformer)的优势。

Relevance To Keywords:

  • 强化学习:论文核心方法为离策略强化学习,与关键词“强化学习”直接相关。
  • 表征学习:DenseNet编码器与LSTM解码器共同学习视觉与文本表征,属于表征学习范畴。
  • 世界模型:论文未明确构建世界模型,但强化学习中的环境可视为图像状态空间,有一定间接关联。
  • 多模态大模型的理解和生成一体化:论文实现从图像到文本的生成,属于多模态理解与生成一体化,但未使用大模型架构(如Transformer),规模较小。
  • 后训练:强化学习阶段可视为对预训练编码器-解码器的后训练优化,与“后训练”概念相关。
  • 原生多模态大模型:论文未采用原生多模态大模型(如LLaVA、GPT-4V),而是定制化CNN-RNN-RL架构,相关性较弱。
  • Model-Based RL:论文使用离策略RL,但未显式建模环境动态,属于Model-Free RL,与Model-Based RL相关性较低。
Score: 33.0 / 27.8
Authors: Lichao Wang, Zhaoxing Ren, Tianzhuo Yang, Jiaming Ji, Chi Harold Liu, Yaodong Yang, Juntao Dai
Published: 2026-06-01
TL;DR: SafeMCP proposes a server-side defense plugin using an internal world model and reinforcement learning to proactively constrain LLM agent tool acquisition, effectively mitigating power-seeking risks while preserving agent utility.
摘要翻译

随着大语言模型(LLM)智能体日益借助模型上下文协议(MCP)在复杂环境中运行,其动作空间的扩展赋予了智能体不安全的能力,并凸显了权力寻求的风险。虽然广泛的动作空间和更大的环境影响对于任务完成至关重要,但它们创造了一个脆弱的风险面,其中微小的错误或幻觉会被放大为灾难性失败。对此,我们提出 SafeMCP,这是一种服务端防御插件,通过针对未来安全风险的预测性推理来约束工具获取。SafeMCP 利用内部世界模型进行前瞻推理,以实施两层防御:主动工具过滤以约束危险的权力扩张,以及作为故障安全措施的即时干预。为了训练 SafeMCP,我们引入一个三阶段流程,包括环境动态锚定、安全策略初始化以及具有双重可验证奖励的强化学习(RL)。在 PowerSeeking Bench、ToolEmu 和 AgentHarm 上的实验表明,SafeMCP 实现了安全均衡,在有效缓解风险的同时保持了智能体效用。

Abstract

As Large Language Model (LLM) agents increasingly leverage the Model Context Protocol (MCP) to operate in complex environments, the expansion of their action spaces offers agents unsafe capabilities and underscores the risk of power-seeking. While broad action space and greater environment influence are essential for task fulfillment, they create a fragile risk surface where minor errors or hallucinations are magnified into catastrophic failures. In response, we propose SafeMCP, a {server-side} defense plugin that constrains tool acquisition via predictive reasoning regarding future safety risks. SafeMCP utilizes an internal world model for look-ahead reasoning to implement a two-tier defense: proactive tool filtering to constrain hazardous power expansion and immediate intervention as a fail-safe. To train SafeMCP, we introduce a three-stage pipeline comprising environmental dynamic grounding, safe policy initialization, and reinforcement learning (RL) with dual verifiable rewards. Experiments on PowerSeeking Bench, ToolEmu, and AgentHarm show that SafeMCP achieves a safe equilibrium, effectively mitigating risks while preserving agent utility.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 8.0/10 12.0
MLLM 1.5 3.0/10 4.5
MultiModal 1.5 2.0/10 3.0
model-based RL 1.5 7.0/10 10.5

评分理由: The paper centers on SafeMCP, which explicitly utilizes an internal 'World Model' for look-ahead reasoning (Score: 8) and combines this with Reinforcement Learning for safety training (Score: 7). It focuses on LLM agent safety rather than model unification (Score: 2), tokenization architecture (Score: 0), or visual encoders (Score: 0). While it involves LLMs, it is not specifically a Multimodal LLM study (MLLM: 3, MultiModal: 2). No matching expert authors from the provided list were found in the authorship.

关键词

SafeMCP, World Model, Look-ahead Reasoning, Power Seeking, Reinforcement Learning, LLM Agent Defense, Tool Acquisition

深度分析

Chinese Title: SafeMCP:基于环境前瞻推理的LLM智能体防御主动功率调节

Summary: 本文针对LLM智能体通过模型上下文协议(MCP)扩展工具集时面临的安全风险,提出SafeMCP——一种服务器端防御插件。该插件利用内部世界模型进行前瞻推理,实现两层防御:主动工具过滤(排除导致危险状态转移的工具)和即时拦截(作为故障安全)。将智能体与防御的交互建模为合作Stackelberg功率博弈,其中功率由状态相关的有效工具集量化。通过三阶段训练流程(环境动态基础、安全策略初始化、双可验证奖励强化学习)训练SafeMCP。在PowerSeeking Bench、ToolEmu和AgentHarm基准上的实验表明,SafeMCP在保持智能体效用的同时有效缓解风险,实现了安全均衡。

Innovations:

  • 将智能体-防御交互形式化为合作Stackelberg功率博弈,将防御转化为功率约束问题,实现安全与效用的最优平衡。
  • 首次提出服务器端防御SafeMCP,结合环境前瞻推理与主动工具过滤,在不中断工作流的前提下实现安全功率调节,支持即插即用。
  • 采用双可验证奖励机制和Smooth Tchebycheff标量化,增加奖励密度并避免梯度饥饿问题。
  • 提出三阶段训练流程:环境动态基础、安全策略初始化、强化学习,使SafeMCP能够理解自动扩展工具集带来的变化并推理未来风险状态。

Methodology: SafeMCP采用三阶段训练流程:1)环境动态基础:收集MCP环境中的状态-动作轨迹,训练世界模型预测状态转移;2)安全策略初始化:基于世界模型进行前瞻推理,初始化安全工具过滤策略;3)强化学习与双可验证奖励:使用RL训练防御策略,奖励包括安全性和任务完成度,通过STCH标量化将离散反馈转化为连续信号。推理时采用模拟-评估-约束范式:预测下一状态,分类安全状态,过滤导致不可逆危险转移的工具,并实施两层防御(主动过滤和即时拦截)。

Key Results:

  • 在PowerSeeking Bench上,SafeMCP有效约束了自动扩展动作空间的风险边界,防止了危险功率寻求行为。
  • 在ToolEmu和AgentHarm基准上,SafeMCP在保持高任务效用的同时显著降低了安全风险。
  • 与现有后验过滤方法相比,SafeMCP避免了过度拒绝,通过细粒度权限约束引导智能体安全完成任务。
  • SafeMCP的即插即用设计使其能够兼容多种LLM智能体,无需修改智能体内部逻辑。

Tech Stack:

  • 合作Stackelberg功率博弈(Cooperative Stackelberg Power Game)
  • 世界模型(World Model)用于状态转移预测
  • 强化学习(Reinforcement Learning)
  • 双可验证奖励(Dual Verifiable Rewards)
  • Smooth Tchebycheff(STCH)标量化
  • 模型上下文协议(Model Context Protocol, MCP)
  • 前瞻推理(Look-Ahead Reasoning)
  • 主动工具过滤(Proactive Tool Filtering)
  • 即时拦截(Immediate Intervention)

Strengths:

  • 创新性地将防御问题建模为功率博弈,理论框架清晰。
  • 服务器端设计实现即插即用,与智能体无关,易于部署。
  • 前瞻推理能力克服了传统后验过滤的短视问题,能预防未来风险。
  • 双奖励机制和STCH标量化提高了训练效率和稳定性。
  • 在多个基准上验证了安全与效用的平衡,实验充分。

Limitations:

  • 依赖世界模型的预测准确性,模型误差可能导致防御失效(论文提及了失败模式分析但未完全解决)。
  • 训练流程需要大量环境交互数据,可能难以迁移到全新领域。
  • 当前仅针对MCP协议下的工具获取场景,对其他协议或非工具型智能体可能不适用。
  • 未讨论对抗性攻击下防御的鲁棒性,例如智能体故意绕过过滤。

Relevance To Keywords:

  • 世界模型(World Models):论文核心使用世界模型进行前瞻推理,预测状态转移和安全分类,与关键词高度相关。
  • 强化学习(Reinforcement Learning):采用RL训练防御策略,使用双可验证奖励,属于模型基RL(Model-Based RL)范畴。
  • 表征学习(Representation Learning):世界模型需要学习环境状态的有效表征,但论文未深入讨论表征学习方法。
  • 原生多模态大模型/多模态大模型的理解和生成一体化:论文主要针对LLM智能体,未涉及多模态,相关性较低。
  • 后训练(Post-training):SafeMCP的训练属于后训练阶段,但论文未强调此概念。
Score: 33.0 / 27.8
Authors: Tianjiao Li, Kai Zhao, Xiang Li, Yang Liu, Huyang Sun
Published: 2026-06-01
TL;DR: 本文提出 MEDEA 架构,利用社会链式思维与强化学习,从社区共鸣角度评估用户生成内容,超越了传统视觉质量评估。
摘要翻译

传统视频质量评估 (VQA) 狭隘地专注于美学保真度,却忽略了定义用户生成内容 (UGC) 质量的复杂社会动态。在此工作中,我们提出了一种范式转变,即从以信号为中心的指标转向以人为中心的共鸣评估。我们引入了 CASTER(社区感知的社会文本参与与共鸣评估),这是一个新任务,旨在评估某项用户生成内容 (UGC) 是否基于其多模态属性而非仅凭视觉质量实现了积极的社区共鸣。为此,我们提出了 MEDEA(多模态参与驱动评估架构),该架构引入了一种新颖的社会思维链 (Social-CoT) 机制。与传统逻辑思维链 (CoT) 不同,社会思维链 (Social-CoT) 执行多模态视角采择,实例化多样化的观众角色,以模拟集体认知和情感反应(即“社区心智”),随后再得出质量判断。MEDEA 采用两阶段方法进行训练,包含监督微调以及结合社会对齐奖励的过程监督强化学习,以确保推理路径扎根于真实的人类社会认知。为支持该任务,我们发布了 CASTER-Bench,这是一个涵盖多样化用户生成内容 (UGC) 类别的全面人工标注基准。实验表明,MEDEA 在 CASTER-Bench 上显著优于最先进基线,同时提供可解释且富有同理心的推理路径,这些路径与真实的社区反馈相一致。

Abstract

Traditional Video Quality Assessment (VQA) focuses narrowly on aesthetic fidelity, overlooking the complex social dynamics that define quality in User-Generated Content (UGC). In this work, we propose a paradigm shift from signal-centric metrics to human-centric resonance assessment. We introduce CASTER (Community-Aware Assessment of Social Textual Engagement and Resonance), a new task that evaluates whether a UGC item achieves positive community resonance based on its multimodal attributes rather than visual quality alone. To address this, we present MEDEA (Multimodal Engagement-Driven Evaluation Architecture), which introduces a novel Social Chain-of-Thought (Social-CoT) mechanism. Unlike traditional logical CoT, Social-CoT performs multimodal perspective-taking, instantiating diverse viewer personas to simulate collective cognitive and emotional reactions (i.e., the "community mind") before deriving a quality judgment. MEDEA is trained via a two-stage approach involving supervised fine-tuning and process-supervised reinforcement learning with Social Alignment Reward to ensure reasoning paths are grounded in authentic human social cognition. To support this task, we release CASTER-Bench, a comprehensive human-annotated benchmark covering diverse UGC categories. Experiments demonstrate that MEDEA significantly outperforms state-of-the-art baselines on CASTER-Bench while providing interpretable and empathetic reasoning paths that align with real community feedback.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 3.0/10 4.5
World Models 1.5 1.0/10 1.5
MLLM 1.5 5.0/10 7.5
MultiModal 1.5 8.0/10 12.0
model-based RL 1.5 2.0/10 3.0

评分理由: 论文核心贡献在于多模态用户生成内容评估架构(MultiModal 高相关),涉及多模态大模型推理应用(MLLM 中等相关)。虽使用强化学习进行训练,但侧重过程监督与社会对齐,非典型的基于模型的强化学习(model-based RL 低相关)。论文未涉及统一模型架构、Tokenizer 设计或世界模型构建(相关度低)。视觉编码器作为多模态组件隐含存在但非核心创新点。作者列表中未包含指定的 Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang 专家,故无额外加分。加权总分 33.0,高于动态及格分 27.8。

关键词

Community-Aware Assessment, Social Textual Engagement, User-Generated Content, Multimodal Engagement-Driven Evaluation, Social Chain-of-Thought, Reinforcement Learning, Social Alignment Reward

深度分析

Chinese Title: 社区感知的社会文本参与与共鸣评估:一种以人为中心的用户生成内容评价视角

Summary: 本文提出了一种从信号中心指标向以人为中心的共鸣评估的范式转变。传统视频质量评估(VQA)仅关注美学保真度,忽略了用户生成内容(UGC)中定义质量的复杂社会动态。为此,作者引入了CASTER(社区感知的社会文本参与与共鸣评估)任务,旨在基于多模态属性而非视觉质量来评估UGC是否实现积极的社区共鸣。为应对该任务,提出了MEDEA(多模态参与驱动评估架构),其核心是一种新颖的社会思维链(Social-CoT)机制。与传统的逻辑CoT不同,Social-CoT执行多模态视角转换,实例化多样化的观众角色以模拟集体认知和情感反应(即“社区心智”),然后得出质量判断。MEDEA通过两阶段训练:监督微调和基于过程监督的强化学习(结合社会对齐奖励),确保推理路径基于真实的人类社会认知。为支持该任务,作者发布了CASTER-Bench,一个涵盖多样化UGC类别的人工标注基准。实验表明,MEDEA在CASTER-Bench上显著优于现有基线,同时提供可解释且富有同理心的推理路径,与真实社区反馈一致。

Innovations:

  • 提出CASTER任务,将UGC质量评估重新定义为社区共鸣识别,而非传统视觉保真度。
  • 提出Social-CoT机制,通过模拟多样化观众角色和情感反应路径实现多模态视角转换,模拟“社区心智”。
  • 设计MEDEA架构,结合监督微调和过程监督强化学习,并引入社会对齐奖励以确保推理路径符合真实社会认知。
  • 发布CASTER-Bench基准,包含长视频(平均442秒)和多模态信息(帧、封面、标题、标签、ASR等),聚焦叙事连贯性、信息密度和持续参与。
  • 通过实验证明MEDEA在社区共鸣评估上显著优于传统VQA和多模态基线,并提供可解释的推理路径。

Methodology: 论文采用两阶段训练方法:首先使用大规模伪标签数据进行监督微调(SFT),然后使用过程监督强化学习(RL)结合社会对齐奖励进行训练。在RL阶段,模型生成Social-CoT推理路径,通过奖励函数确保路径反映真实人类社会认知。评估时,MEDEA接收多模态输入(视频帧、封面、标题、标签、类别、ASR转录),先模拟多种观众角色的评论反应,再聚合为最终质量判断。基准构建采用分层随机抽样,由10名专业内容运营专家根据人类中心评分标准进行标注。

Key Results:

  • MEDEA在CASTER-Bench上显著优于传统VQA方法(如VSFA、DOVER、MaxVQA)和多模态基线(如Q-Align、LMM-VQA、FineVQ)。
  • 实验表明,正面用户评论与专家判断之间存在强相关性,而传统VQA和视觉中心模型表现较差。
  • MEDEA生成的Social-CoT推理路径具有可解释性,能够模拟不同观众的情感反应,与真实社区反馈一致。
  • CASTER-Bench中质量标签分布为:优秀10.6%、良好17.0%、一般38.6%、较差33.7%,反映真实平台分布。

Tech Stack:

  • Social Chain-of-Thought (Social-CoT) 机制
  • 监督微调 (Supervised Fine-Tuning, SFT)
  • 过程监督强化学习 (Process-Supervised Reinforcement Learning)
  • 社会对齐奖励 (Social Alignment Reward)
  • 多模态编码器(视频帧、封面、标题、标签、ASR等)
  • CLIP-based 视觉语言预训练
  • Transformer架构(用于时序建模)
  • 分层随机抽样(数据收集)
  • 专家标注协议(人类中心评分标准)

Strengths:

  • 创新性地将社会推理引入UGC质量评估,突破传统VQA的局限。
  • Social-CoT机制模拟社区心智,提供可解释的推理路径,增强模型可信度。
  • CASTER-Bench基准针对长视频和真实UGC场景,填补了现有基准的空白。
  • 两阶段训练策略结合伪标签和专家标注,有效利用数据。
  • 实验全面,对比多种基线,验证了方法的有效性。

Limitations:

  • CASTER-Bench规模较小(1485个样本),可能限制泛化能力。
  • 依赖ASR转录和标签等元数据,对于缺失元数据的UGC可能效果下降。
  • Social-CoT模拟的观众角色数量有限,可能无法覆盖所有社区多样性。
  • 训练过程需要大量伪标签数据,其质量可能影响最终性能。
  • 未讨论模型在跨平台或不同文化背景下的迁移能力。

Relevance To Keywords:

  • Unify Models: 论文提出的MEDEA架构整合了多模态理解和生成(通过Social-CoT生成评论),体现了统一模型的思路。
  • World Models: Social-CoT模拟社区心智可视为一种简化的世界模型,预测观众反应。
  • Representation Learning: 论文使用多模态编码器学习内容表示,但未深入探讨表征学习创新。
  • Model-Based RL: 过程监督强化学习属于模型无关方法,但社会对齐奖励可视为基于模型的奖励设计。
  • 原生多模态大模型: MEDEA基于多模态输入,但未明确使用原生多模态大模型(如LLaVA),而是采用CLIP等预训练模型。
  • 多模态大模型的理解和生成一体化: Social-CoT同时进行理解(分析内容)和生成(模拟评论),符合一体化趋势。
  • 表征学习: 论文未重点讨论表征学习,但多模态融合涉及表征。
  • 世界模型: 模拟社区心智可看作构建社会世界模型。
  • 强化学习: 论文使用RL进行后训练,与关键词相关。
  • 后训练: 两阶段训练中的RL阶段属于后训练,与关键词匹配。
Score: 33.0 / 27.8
Authors: Afsaneh Hasanebrahimi, Hanxun Huang, Christopher Leckie, Sarah Erfani
Published: 2026-06-01
TL;DR: 本文提出密度感知翻译(DAT)方法,通过校正嵌入空间中的几何密度来消除零-shot VLM 分类中的伪相关,显著提升了最坏组和平均准确率。
摘要翻译

视觉 - 语言模型(VLMs),例如 CLIP,实现了强大的零样本(zero-shot)分类。然而,它们的预测仍然对虚假关联(spurious correlations)敏感,其中上下文线索主导了语义内容。早期解决方案通常依赖于微调(fine-tuning)或提示工程(prompt engineering),这要么削弱了预训练模型的优势,要么容易产生幻觉(hallucination)。在这项工作中,我们提出了密度感知翻译(DAT),该机制利用从组参考集导出的局部几何密度项来优化图像 - 文本相似度得分。我们的方法受以下现象的启发:CLIP 嵌入(embeddings)表现出模态间隙(modality gap),并位于特征空间中的一个各向异性壳(anisotropic shell)上:常见模式聚集在均值附近,而稀有模式则被推向外部。这种几何结构造成了不均匀的对齐,其中虚假关联被放大,而具有语义意义但稀有的线索却被边缘化(marginalised)。为了解决这一问题,我们采用一种基于嵌入密度的相对度量来重新缩放相似度,抑制弥散区域中的过度自信得分,同时保留密集且语义一致的匹配。在基准数据集上的实验结果表明,最坏组准确率和平均准确率均得到一致提升,凸显了密度感知翻译作为一种简单有效的校准机制(calibration mechanism),用于使用多模态模型进行可靠的零样本分类。

Abstract

Vision-Language models (VLMs), such as CLIP, achieve powerful zero-shot classification. However, their predictions remain sensitive to spurious correlations, where contextual cues dominate over semantic content. Earlier solutions typically rely on fine-tuning or prompt engineering, which either undermine the advantages of pre-trained models or are prone to hallucination. In this work, we propose Density-Aware Translation (DAT) that refines image-text similarity scores using a local geometric density term derived from group reference sets. Our approach is motivated by the phenomenon that CLIP embeddings exhibit a modality gap and lie on an anisotropic shell in the feature space: common patterns cluster near the mean, while rare patterns are pushed outward. This geometry creates uneven alignment, where spurious correlations are amplified while semantically meaningful but rare cues are marginalised. To address this, we employ a relative measure to rescale similarities based on embedding density, suppressing overconfident scores in diffuse regions while preserving dense, semantically consistent matches. Experimental results on benchmark datasets demonstrate consistent improvements in worst-group and average accuracy, highlighting density-aware translation as a simple and effective calibration mechanism for reliable zero-shot classification using multimodal models.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 4.0/10 6.0
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 5.0/10 7.5
World Models 1.5 0.0/10 0.0
MLLM 1.5 4.0/10 6.0
MultiModal 1.5 8.0/10 12.0
model-based RL 1.5 0.0/10 0.0

评分理由: 论文主要研究视觉语言模型(VLMs)的零-shot 分类校准,通过密度感知翻译解决伪相关性问题。MultiModal 高度相关,因研究对象本质为多模态模型;Unify Models 和 Visual Encoder 中度相关,涉及多模态统一表征及编码器嵌入空间,但未深入架构设计;MLLM 中度相关,属于多模态大模型领域但论文针对的是 CLIP 类 VLM 而非典型 MLLM;Tokenizer 低相关,未涉及分词器设计;World Models 和 model-based RL 完全不相关,论文不涉及世界模型或强化学习。

关键词

Vision-Language Models, Zero-Shot Classification, Spurious Correlations, Embedding Density, Similarity Calibration, Multimodal Models, Density-Aware Translation

深度分析

Chinese Title: 零样本视觉语言模型中虚假相关性的密度感知平移

Summary: 本文提出密度感知平移(DAT)方法,用于缓解零样本视觉语言模型(如CLIP)中的虚假相关性。研究发现CLIP嵌入空间呈现各向异性的椭球壳结构:频繁出现的虚假相关样本聚集在均值附近,而稀有但有意义的样本分布在稀疏区域,导致余弦相似度评分偏向虚假模式。DAT通过从参考集中估计局部几何密度,重新缩放图像-文本相似度,抑制稀疏区域的过度自信分数,保留密集区域的一致匹配。理论分析表明,DAT纠正了余弦相似度在各向异性嵌入下的偏差,恢复了缺失的对数似然项,与贝叶斯最优决策对齐。实验在多个基准数据集上一致提升了最差组准确率和平均准确率,且无需微调或提示工程,保持零样本推理能力。

Innovations:

  • 提出DAT,一种零样本机制,利用组参考密度重新缩放图像-文本相似度,无需微调、提示工程或测试时虚假属性标签。
  • 理论证明DAT在各向异性嵌入下纠正余弦相似度的偏差,恢复缺失的对数似然项,与贝叶斯最优决策规则对齐。
  • 通过Tangent-Space Mahalanobis距离量化组间对齐偏差,揭示CLIP嵌入几何导致虚假相关性被放大的现象。
  • 提出DAT*变体,在无显式虚假属性标注时自动推断组归属,保持零样本设置。

Methodology: 首先构建每组平衡的参考集(通过采样或零样本属性推断),然后提取图像和文本的冻结CLIP嵌入。对每个测试样本,计算其与各提示的余弦相似度,并利用参考集估计每个组的局部几何密度(通过相对密度比)。最后将原始相似度乘以密度感知权重,抑制稀疏区域(虚假相关)的得分,保留密集区域(真实语义)的得分。理论部分使用Kent分布建模组分布,推导余弦相似度与贝叶斯最优决策的偏差,并证明DAT的校正作用。

Key Results:

  • 在Waterbirds、CelebA等基准数据集上,DAT一致提升了最差组准确率和平均准确率。
  • 相比现有零样本去偏方法(如Orth-Cali、TIE),DAT在多个VLM(CLIP ViT-B/32、ViT-L/14等)上取得更优或相当的性能。
  • 理论分析验证了DAT恢复各向异性对数似然项,使决策边界接近贝叶斯最优。
  • DAT*在无虚假属性标签时仍能有效提升鲁棒性,接近有标签版本。

Tech Stack:

  • CLIP(ViT-B/32, ViT-L/14等)作为基础VLM
  • Kent(Fisher-Bingham)分布用于建模组嵌入分布
  • Tangent-Space Mahalanobis距离(TMD)用于量化组间对齐
  • 相对密度比估计(基于参考集)
  • 余弦相似度重缩放机制

Strengths:

  • 完全零样本,无需微调或访问模型参数,仅需少量平衡参考集。
  • 理论扎实,从几何角度解释了虚假相关性的成因并提供了校正方案。
  • 方法简单有效,在多个基准和多种VLM上均取得一致改进。
  • 提供DAT*变体,适应无虚假属性标签的实际场景。

Limitations:

  • 依赖参考集的构建质量,若参考集不平衡或噪声大可能影响密度估计。
  • DAT*中零样本属性推断可能引入额外误差,尤其当属性描述不准确时。
  • 方法仅在分类任务上验证,对检索、生成等任务的适用性未讨论。
  • 需要为每个组构造提示,当类别或属性数量很大时成本增加。

Relevance To Keywords:

  • 表征学习:论文深入分析CLIP嵌入空间的几何结构,提出密度感知重缩放,属于表征学习中的校准方法。
  • 多模态大模型的理解和生成一体化:研究对象为零样本VLM(如CLIP),属于多模态理解范畴,但未涉及生成。
  • 世界模型:论文未直接涉及世界模型,但通过密度感知校正虚假相关性,可提升模型对真实因果关系的建模,间接相关。
  • 后训练:DAT是一种后训练校准方法,无需微调,符合后训练范式的轻量级调整。
  • 强化学习:论文未涉及强化学习。
  • 模型基础RL:不相关。
Score: 33.0 / 27.8
Authors: Abhishek Aich, Sparsh Garg, Vijay Kumar BG, Turgun Yusuf Kashgari, Manmohan Chandraker
Published: 2026-06-01
TL;DR: 本文提出 SliceScorer 和 SliceNav 框架,用于在驾驶视觉语言模型中发现可解释的覆盖缺口并提升安全验证效率。
摘要翻译

驾驶视觉 - 语言模型(VLMs)必须准确理解由操作设计域(ODDs)定义的多样化条件下的场景,然而验证仍然稀疏:许多切片缺失,导致经验性故障率不可靠。我们提出 SliceScorer,这是一种用于缺失切片推荐的确定性评分规则,它结合了(i)基于曝光的覆盖先验,用于优先处理稀有且测试不足的区域,以及(ii)邻居失败先验,该先验从类似的测试条件传播风险。SliceScorer 刻意保持简单——具有可解释性、可审计性和保守性——这些属性对于安全关键验证至关重要。针对超出声明操作设计域(ODD)的压力测试,我们将 SliceScorer 嵌入 SliceNav 中,这是一个由大语言模型编排的验证管道。在该管道中,模型解释开发者查询以选择相关操作(分类、评分、采集、评估)及词汇扩展,从而构建验证工作流程,同时保持所有评分的确定性和可审计性。在三个驾驶视觉 - 语言模型(WiseAD、DriveMM、Cosmos-Reason2-2B)上的实验表明,SliceNav 比先前的切片发现方法更有效地揭示高风险覆盖缺口,同时在条件空间中保持多样化的推荐。消融实验证实两个评分组件均有贡献,而定性分析展示了从开发者查询到针对性评估的端到端工作流程。

Abstract

Driving vision-language models (VLMs) must accurately understand scenes across diverse conditions defined by Operational Design Domains (ODDs), yet verification remains sparse: many slices are missing, making empirical failure rates unreliable. We propose SliceScorer, a deterministic scoring rule for missing-slice recommendation that combines (i) an exposure-based coverage prior to prioritize rare, under-tested regions, and (ii) a neighbor-failure prior that propagates risk from similar tested conditions. SliceScorer is deliberately simple - interpretable, auditable, and conservative - properties essential for safety-critical validation. For stress testing beyond the declared ODD, we embed SliceScorer within SliceNav, an LLM-orchestrated verification pipeline where the model interprets developer queries to select relevant operators (triage, scoring, acquisition, evaluation) and vocabulary extensions, composing verification workflows while keeping all scoring deterministic and auditable. Experiments on three driving VLMs (WiseAD, DriveMM, Cosmos-Reason2-2B) show that SliceNav surfaces high-risk coverage gaps more effectively than prior slice-discovery methods while maintaining diverse recommendations across the condition space. Ablations confirm both scoring components contribute, and qualitative analysis demonstrates end-to-end workflows from developer query to targeted evaluation.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 2.0/10 3.0
World Models 1.5 1.0/10 1.5
MLLM 1.5 7.0/10 10.5
MultiModal 1.5 8.0/10 12.0
model-based RL 1.5 1.0/10 1.5

评分理由: 论文聚焦驾驶 VLMs 的测试验证框架,与 MultiModal 和 MLLM 领域高度相关。未涉及模型架构统一、Tokenizer、视觉编码器、世界模型或强化学习,故相关度低。作者无指定专家,无加分。加权总分 33.0,高于动态及格分 27.8。

关键词

Driving VLMs, Coverage Gap Discovery, SliceScorer, SliceNav, Safety-critical Validation, Operational Design Domains, LLM-orchestrated Verification

深度分析

Chinese Title: 下一步测试什么:驾驶视觉语言模型中可解释的覆盖缺口发现

Summary: 本文针对驾驶视觉语言模型(VLM)在多样化运行设计域(ODD)条件下的验证问题,提出了一种可解释的缺失场景推荐方法。现有方法仅能诊断已评估数据中的失败模式,无法指导下一步测试。作者提出SLICESCORER,一种确定性评分规则,结合基于暴露度的覆盖先验(优先测试罕见、未充分测试的区域)和邻居失败先验(从相似已测试条件传播风险)。进一步构建SLICENAV,一个由大语言模型(LLM)编排的验证流水线,能够根据开发者自然语言查询选择相关操作符(分类、评分、获取、评估)并组合验证工作流,同时保持评分过程的确定性和可审计性。在三个驾驶VLM(WiseAD、DriveMM、Cosmos-Reason2-2B)上的实验表明,SLICENAV比先前的切片发现方法更有效地发现高风险覆盖缺口,同时保持推荐多样性。消融实验验证了两个评分组件的贡献,定性分析展示了从开发者查询到目标评估的端到端工作流。

Innovations:

  • 形式化了驾驶VLM验证中的缺失场景推荐问题:给定部分测试覆盖,对未测试条件按预测失败风险排序。
  • 提出SLICESCORER,一种结合稀有性先验和邻居失败先验的确定性评分规则,具有可解释、可审计和保守性特点。
  • 构建SLICENAV,一个LLM编排的验证流水线,支持灵活、查询驱动的验证工作流,同时保持确定性排序。
  • 在三个驾驶VLM上验证了方法有效性,相比SliceFinder等基线,在更大测试预算下平均降低65%-71%的遗漏失败风险,并保持高多样性。

Methodology: 论文采用以下技术路线:首先定义ODD条件空间和切片概念,将验证数据分为已观测切片和缺失切片。然后设计SLICESCORER评分函数,由两部分组成:暴露度稀有性先验(基于边际频率估计联合暴露度,取补数并幂次调整)和邻居失败先验(通过嵌入相似度从已观测高错误切片传播风险)。最后将SLICESCORER嵌入SLICENAV框架,由LLM根据用户查询选择操作符(分类、评分、获取、评估),组合成端到端验证工作流。所有评分保持确定性,LLM仅用于解释查询和编排流程。

Key Results:

  • 在K=5000测试预算下,SLICESCORER相比SliceFinder在DriveMM上平均降低65%的遗漏失败风险,在Cosmos上降低71%。
  • 推荐列表在条件空间上具有高多样性,避免聚类在单一区域。
  • 消融实验证实稀有性先验和邻居失败先验均对性能有贡献。
  • 定性分析展示了从开发者查询到ODD扩展、数据获取和VLM评估的完整工作流。

Tech Stack:

  • SLICESCORER:基于边际频率的暴露度估计(拉普拉斯平滑)、乘积独立性假设、幂次调整的稀有性先验;基于嵌入相似度的邻居失败传播(使用文本嵌入模型)。
  • SLICENAV:LLM(大语言模型)作为编排器,操作符包括Triage、In-ODD scoring、Out-of-ODD scoring、Acquisition、Evaluation。
  • 基准方法:SliceFinder、SliceLine。
  • 评估模型:WiseAD、DriveMM、Cosmos-Reason2-2B。
  • 评估指标:风险捕获(遗漏失败风险)、多样性(条件空间覆盖)。

Strengths:

  • 方法设计简单、可解释、可审计,适合安全关键验证场景。
  • 首次提出缺失场景推荐问题,填补了现有切片发现方法仅诊断已评估数据的空白。
  • LLM编排框架灵活,支持自然语言查询驱动的验证流程,同时保持核心评分确定性。
  • 实验充分,在多个VLM上验证了有效性和多样性,消融实验清晰。

Limitations:

  • 暴露度估计采用维度独立性假设,可能不准确,作者承认这是保守近似。
  • 邻居失败先验依赖嵌入相似度,语义相近但实际风险不同的切片可能被错误传播。
  • 方法依赖种子验证表的质量和样本阈值,稀疏数据下统计意义有限。
  • 未考虑测试成本(如数据获取难度)对推荐排序的影响。
  • 仅在驾驶VLM上验证,泛化到其他领域需进一步研究。

Relevance To Keywords:

  • 论文涉及多模态大模型(驾驶VLM)的验证,与“原生多模态大模型”、“多模态大模型的理解和生成一体化”有一定关联,但重点不在模型本身而在验证方法。
  • ODD条件空间可视为一种世界模型(World Model)的离散化表示,论文通过覆盖缺口发现来评估模型在未知条件下的行为,与世界模型概念相关。
  • 表征学习(Representation Learning)在邻居失败先验中通过嵌入相似度体现,但并非核心。
  • 强化学习(Model-Based RL)和后训练(Post-training)与本文关系较弱,本文更侧重于验证和测试选择,而非训练或优化。
Score: 33.0 / 27.8
Authors: Sicheng Xu, Yu Deng, Shoukang Hu, Yichuan Wang, Yizhong Zhang, Zhan Chen, Jiaolong Yang, Baining Guo
Published: 2026-06-01
TL;DR: 本文提出了一种基于因果视频 VAE 和参考引导的框架,实现了高保真且实时的说话人像视频生成。
摘要翻译

视频扩散模型在肖像视频生成方面取得了显著进展,然而其高昂的计算需求限制了它们在交互式应用中的使用。本文提出了一种基于语音音频和参考图像的可流式传输说话肖像视频生成框架。该框架专为流式传输场景精心设计,包含一个用于深层潜在压缩的因果视频 VAE 和一个自回归潜在去噪模型。我们的因果 VAE 整合了可变数量的参考图像作为引导,使网络能够关注动态信息而非静态外观,从而提高了压缩效率和重建质量。此外,我们扩展了残差自编码范式,以改进 VAE 中的时空因果性处理。生成器基于 Rectified Flow Transformer 架构,并以块状自回归的方式生成视频潜在表示。我们的方法实现了高质量说话肖像视频的实时生成,速度显著快于基线模型。此外,综合实验表明,该方法在真实性、生动性和视频质量方面与这些大型模型相当,甚至优于它们。

Abstract

Video diffusion models have significantly advanced portrait video generation, yet their high computational demands limit their use in interactive applications. This work presents a framework for streamable talking portrait video generation conditioned on speech audio and reference images. Designed meticulously for streaming scenarios, it features a causal video VAE for deep latent compression and an autoregressive latent denoising model. Our causal VAE integrates a variable number of reference images as guidance, allowing the network to focus on dynamic information rather than static appearance, thereby enhancing compression efficacy and reconstruction quality. Additionally, we extend the residual auto-encoding paradigm to improve spatial-temporal causality handling in our VAE. The generator is based on a Rectified Flow Transformer architecture and produces video latents in a blockwise auto-regressive manner. Our method enables the real-time generation of high-quality talking portrait videos, achieving speeds significantly faster than baseline models. Furthermore, comprehensive experiments demonstrate that it is on par with or even outperforms these large models in realism, vividness, and video quality.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 3.0/10 4.5
Visual Encoder 1.5 5.0/10 7.5
World Models 1.5 3.0/10 4.5
MLLM 1.5 1.0/10 1.5
MultiModal 1.5 8.0/10 12.0
model-based RL 1.5 0.0/10 0.0

评分理由: 论文主要研究实时说话人像视频生成,采用因果 VAE 和扩散模型。'MultiModal'高度相关(音频 + 视频输入);'Visual Encoder'中度相关(参考图像编码);'Tokenizer'和'World Models'低相关(潜在空间建模);'Unify Models'、'MLLM'、'model-based RL'无关(无语言模型、强化学习或模型统一)。未发现指定专家。

关键词

Talking Portrait Video, Streamable Generation, Causal Video VAE, Reference-Guided, Real-time Generation, Audio-Video Conditioning, Rectified Flow Transformer

深度分析

Chinese Title: 基于参考引导深度压缩变分自编码器的可流式说话人像视频实时生成

Summary: 本文提出一个用于可流式说话人像视频实时生成的框架,旨在解决现有视频扩散模型计算成本高、难以用于交互式应用的问题。该框架包含一个因果视频VAE用于深度潜在压缩,以及一个基于音频和参考图像条件的自回归潜在去噪模型。因果VAE通过引入可变数量的参考图像作为解码器引导,使网络专注于动态信息而非静态外观,从而提升压缩效率和重建质量;同时扩展了残差自编码范式以改进时空因果处理。生成器采用整流流Transformer架构,以块状自回归方式生成视频潜在表示,支持通过KV缓存实现高效流式推理。实验表明,该方法在单个GPU上能以42 FPS生成512×512视频,速度比现有扩散模型快25倍以上,且在逼真度、生动性和视频质量方面与大型模型相当或更优。

Innovations:

  • 提出可流式说话人像视频生成框架,结合因果视频VAE和块状自回归去噪模型,支持任意长度视频的连续实时生成。
  • 提出参考引导视频VAE,将一张或多张参考人像图像注入解码器,使网络专注于动态信息,显著提升压缩效率和重建保真度。
  • 实现768倍视频压缩率,比流行视频扩散模型的VAE高10-15倍,同时视频生成速度达42 FPS,比现有扩散模型快25倍以上。
  • 将残差自编码范式扩展至因果视频VAE,通过时空分离的残差编码改进因果处理,进一步提升重建质量。

Methodology: 采用两阶段方法:首先训练一个因果视频VAE,其编码器由E1(时空下采样)和E2(空间下采样)组成,解码器由D1(空间上采样)和D2(时空上采样)组成,并在D1和D2之间插入基于Transformer的融合网络Dref,将参考图像经E1处理后的特征通过交叉注意力注入。VAE训练时从输入视频中随机采样参考帧。然后训练一个基于整流流Transformer的生成器,采用块状因果注意力机制,以自回归方式(教师强制策略)生成潜在表示,条件为音频特征和参考图像特征。整个框架在说话人像视频语料库上端到端训练。

Key Results:

  • 视频生成速度达到42 FPS(512×512分辨率),比现有扩散模型(如VASA-1、MuseTalk等)快25倍以上。
  • 视频压缩率达到768倍,比Stable Video Diffusion等模型的VAE高10-15倍。
  • 在音频-唇同步、视觉质量、动态逼真度等指标上与大型模型(如VASA-1、MuseTalk)相当或更优。
  • 支持任意长度视频的连续生成,无时间不连续性。

Tech Stack:

  • 因果卷积神经网络
  • RMSNorm归一化层
  • Transformer(自注意力与交叉注意力)
  • 整流流(Rectified Flow)
  • 自回归建模(块状因果注意力)
  • KV缓存(Key-Value Caching)
  • 教师强制训练策略(Teacher Forcing)
  • 残差自编码(Residual Auto-Encoding)
  • 变分自编码器(VAE)

Strengths:

  • 实现了实时、可流式的说话人像视频生成,满足交互式应用需求。
  • 通过参考图像引导显著提升VAE压缩效率和重建质量,无需额外计算开销。
  • 高压缩率(768倍)大幅减少潜在令牌长度,降低生成延迟。
  • 生成视频包含丰富的动态效果(唇动、表情、头部/躯干运动、头发、光影),逼真度高。
  • 与大型扩散模型相比,速度优势明显且质量不逊色。

Limitations:

  • 依赖参考图像质量,若参考图像与生成视频差异较大(如不同光照、角度),可能影响效果。
  • 当前方法主要针对单说话人场景,对多说话人或复杂背景的泛化能力未充分验证。
  • 对音频质量敏感,噪声或低质量音频可能导致唇同步下降。
  • 未讨论模型在低端硬件或移动设备上的部署可行性。

Relevance To Keywords:

  • 原生多模态大模型:论文处理音频、图像、视频三种模态,通过VAE和Transformer实现多模态条件生成。
  • 多模态大模型的理解和生成一体化:模型同时理解音频和参考图像内容,生成连贯视频,体现理解与生成融合。
  • 表征学习:因果VAE学习紧凑潜在表征,参考引导优化表征聚焦动态信息,属于表征学习范畴。
  • 世界模型:视频生成可视为对物理世界动态的建模,尤其是人物运动、光影等,与世界模型概念相关。
  • 强化学习:论文未直接涉及强化学习,但后训练(如整流流)可视为生成模型的训练范式,与后训练相关。
  • 后训练:整流流训练属于后训练方法,通过优化概率流提升生成质量。
Score: 31.5 / 27.8
Authors: Stefano Samele, Eugenio Lomurno, Teodora Jovanovic, Sanjay Shivakumar Manohar, Alberto Crivellaro, Matteo Matteucci
Published: 2026-06-01
TL;DR: This paper proposes a structured benchmark for text-guided anomaly detection and reveals that current multimodal models rely superficially on language rather than truly conditioning decisions based on textual instructions.
摘要翻译

工业异常检测历史上一直是一项单模态任务。近期出现的多模态视觉 - 语言模型(Vision-Language Models)已开发出能够同时接受图像和文本输入的系统,并被宣称能够实现文本引导的零样本及少样本检测。然而,这些方法沿用了源自单模态基准的评估协议,该协议保持文本条件恒定,因而无法衡量语言是否对决策产生了调节作用;报告的性能提升究竟源于文本引导还是强大的预训练视觉特征,这一问题仍未得到解决。我们提出了文本引导异常检测(Text-Guided Anomaly Detection, TGAD),这是一个结构化基准,旨在通过三个场景逐步提升语言的功能角色:一是基于 MVTec AD 的受控提示敏感性设置;二是 MVTec AD 的组件标记扩展版本,要求模型将评估限制在指定部件上;三是新的组装面板数据集(Assembled Panel Dataset, APD),这是一个真实的工业场景,要求模型同时具备缺陷类型和组件位置知识。我们在每种范式下评估了一个代表性模型:生成式大型视觉 - 语言模型、无训练判别式模型以及嵌入自适应判别式模型。在所有三个场景中,文本界面仅对决策产生了表面性的调节作用:提示内容会被模型吸收,除非移除对象名词(生成式模型的 I-AUROC 从 97.4 降至 82.6);一旦将指定部件之外的缺陷视为正常,组件级别的指令便无法约束决策(从 90.3 降至 66.3);而当两者在 APD 上结合时,图像级判别能力崩溃至低于 MVTec 水平,其中一例甚至低于随机水平(71.2, 50.5, 31.5)。这些结果表明,标准基准高估了当前多模态异常检测系统的文本引导能力,而此类协议是实现可通过语言可靠控制并用于工业部署的模型之先决条件。

Abstract

Industrial anomaly detection has historically been a unimodal task. Recent multimodal vision-language models have produced systems that admit textual input alongside the image and are presented as enabling text-guided zero- and few-shot inspection. Yet these methods are evaluated with protocols inherited from unimodal benchmarks that hold the textual condition constant and therefore cannot measure whether language conditions the decision; whether reported gains reflect text guidance or strong pretrained visual features remains open. We introduce Text-Guided Anomaly Detection (TGAD), a structured benchmark that progressively increases the functional role of language across three scenarios: a controlled prompt-sensitivity setting on MVTec AD; a component-tagged extension of MVTec AD that requires the model to restrict its assessment to an instructed part; and the new Assembled Panel Dataset (APD), a realistic industrial setting that requires both defect-type and component-location knowledge. We evaluate one representative model per paradigm: generative large vision-language, training-free discriminative, and embedding-adaptive discriminative. In all three, the textual interface conditions the decision only superficially: prompt content is absorbed unless the object noun is removed (the generative model's I-AUROC drops from 97.4 to 82.6); component-level instructions do not constrain the decision once defects outside the instructed part are admitted as normal (from 90.3 to 66.3); and when both combine on APD, image-level discrimination collapses below the MVTec level, in one case below chance (71.2, 50.5, 31.5). These results suggest that standard benchmarks overstate the text-guided capabilities of current multimodal anomaly detection systems, and that a protocol of this kind is a prerequisite for models that can be reliably controlled through language for industrial deployment.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 3.0/10 4.5
World Models 1.5 0.0/10 0.0
MLLM 1.5 6.0/10 9.0
MultiModal 1.5 9.0/10 13.5
model-based RL 1.5 0.0/10 0.0

评分理由: The paper evaluates text-guided anomaly detection using multimodal models, making MultiModal and MLLM relevant. Visual Encoder is tangentially relevant. Unify Models, Tokenizer, World Models, and model-based RL are irrelevant as the paper focuses on benchmarking language conditioning in anomaly detection, not model architecture or reinforcement learning.

关键词

Text-Guided Anomaly Detection, Multimodal Vision-Language Models, Evaluation Benchmark, Language Conditioning, Industrial Anomaly Detection, Prompt Sensitivity, MVTec AD

深度分析

Chinese Title: 文本引导异常检测的结构化基准:当语言停止影响决策时

Summary: 本文针对当前多模态异常检测方法在标准基准上表现良好但实际文本引导能力存疑的问题,提出了一个结构化基准TGAD(Text-Guided Anomaly Detection),通过三个逐步增加语言功能角色的场景来系统评估文本输入是否真正影响检测决策。场景一在MVTec AD上控制提示词变化,测试模型输出是否随语言移动;场景二引入组件级标注,要求模型仅检测指定部件的缺陷,忽略其他部件的异常;场景三提出新的组装面板数据集(APD),同时要求缺陷类型和组件位置知识。作者评估了三个代表性模型(生成式大视觉语言模型AnomalyGPT、无训练判别式LogSAD、嵌入自适应判别式AA-CLIP),发现文本接口仅表面地影响决策:提示词内容仅在移除物体名词时显著改变结果(对象锚点崩溃);组件级指令无法约束全局决策(定位-决策分离);在APD上图像级检测性能大幅下降甚至低于随机水平。结果表明当前基准高估了多模态异常检测系统的文本引导能力,该协议是构建可靠语言控制模型的前提。

Innovations:

  • 提出了TGAD结构化基准,包含三个逐步增加语言功能角色的场景,系统评估文本引导异常检测的真实能力。
  • 发布了组件标注扩展的MVTec AD数据集和新的组装面板数据集(APD),提供像素级标注和组件标签。
  • 发现了“对象锚点崩溃”现象:模型仅在移除物体名词时才对提示变化敏感,表明文本仅作为弱类别先验。
  • 揭示了“定位-决策分离”诊断特征:在指令压力下图像级检测性能崩溃,而区域级定位指标(AUPRO)下降较少。
  • 通过三个代表性模型(生成式、无训练判别式、嵌入自适应判别式)的全面评估,证明当前多模态异常检测方法的文本引导能力被高估。

Methodology: 论文采用结构化基准测试方法,设计三个场景:场景一在MVTec AD上固定图像和标注,系统变化文本提示(从完整描述到仅物体名词再到无内容查询),测量输出变化;场景二对MVTec AD的9个物体类进行组件标签标注,定义两种评估设置(EV1仅无缺陷图像为正常,EV2允许无关缺陷图像为正常),测试模型是否仅关注指定组件;场景三使用新采集的APD数据集(电子板组装件),要求同时识别缺陷类型和组件位置。评估三个代表性模型:AnomalyGPT(生成式)、LogSAD(无训练判别式,并为其添加组件级注意力扩展)、AA-CLIP(嵌入自适应判别式),使用图像级AUROC、像素级AUROC和AUPRO指标。

Key Results:

  • 在场景一中,生成式模型AnomalyGPT的I-AUROC从97.4降至82.6(移除物体名词后),而其他模型变化较小,表明文本仅作为弱类别先验。
  • 在场景二中,组件级指令无法有效约束全局决策:LogSAD在EV2设置下I-AUROC从90.3降至66.3,即使添加组件注意力扩展仍脆弱。
  • 在APD数据集上,图像级检测性能大幅下降:AnomalyGPT为71.2,LogSAD为50.5,AA-CLIP为31.5(低于随机水平)。
  • 定位指标AUPRO下降相对较小,形成“定位-决策分离”现象。
  • 所有模型在标准MVTec AD评估中表现良好,但在TGAD场景中暴露了文本引导的表面性。

Tech Stack:

  • CLIP(对比语言-图像预训练)
  • AnomalyGPT(生成式大视觉语言模型)
  • LogSAD(无训练判别式异常检测)
  • AA-CLIP(嵌入自适应判别式异常检测)
  • MVTec AD数据集
  • MVTec LOCO数据集(逻辑异常)
  • VisA数据集(跨域适应)
  • AUROC(受试者工作特征曲线下面积)
  • AUPRO(区域加权平均精度)
  • 组件级注意力扩展(为LogSAD添加的架构修改)

Strengths:

  • 提出了系统性的结构化基准,填补了现有评估协议无法衡量文本引导能力的空白。
  • 发布了两个新数据集(组件标注MVTec AD扩展和APD),促进后续研究。
  • 发现了重要的诊断现象(对象锚点崩溃、定位-决策分离),揭示了多模态异常检测方法的本质局限。
  • 评估覆盖三种主要范式,结论具有代表性。
  • 实验设计严谨,逐步增加语言功能角色,清晰分离了视觉先验和文本引导的贡献。

Limitations:

  • 仅评估了三个代表性模型,可能无法完全覆盖所有多模态异常检测方法。
  • 组件标注扩展仅覆盖MVTec AD的9个物体类,未包含纹理类。
  • APD数据集规模可能有限,且仅针对电子板组装场景,泛化性需进一步验证。
  • 对生成式模型的评估仅使用AnomalyGPT,其他LVLM(如GPT-4V)未测试。
  • 未深入探讨如何改进模型以真正实现文本引导,仅指出问题。

Relevance To Keywords:

  • 原生多模态大模型:论文评估的AnomalyGPT属于生成式大视觉语言模型,直接相关。
  • 多模态大模型的理解和生成一体化:AnomalyGPT同时进行异常检测和自然语言解释,体现理解与生成一体化。
  • 表征学习:AA-CLIP和LogSAD依赖CLIP的视觉语言表征,论文探讨了表征中文本先验的作用。
  • 世界模型:论文未直接涉及世界模型,但异常检测可视为对正常世界状态的建模。
  • 强化学习:论文未涉及强化学习。
  • 后训练:论文中的模型多为预训练后直接使用或少量微调,与后训练概念相关但非核心。
Score: 31.5 / 27.8
Authors: Xu Li, Zedong Fu, Xinyi Li, Xun Han
Published: 2026-06-01
TL;DR: TrafficRAG 提出一种多模态检索增强框架,结合视觉语言模型与法律知识库,有效提高了交通事故定责的准确性和可靠性。
摘要翻译

交通事故责任分析是智能交通和法律辅助领域一项关键且具有挑战性的任务。现有方法往往存在效率低下、判断主观以及分析结果不一致的问题。与此同时,大型语言模型(LLM)受限于嘈杂的视频输入和不足的领域法律知识。为了解决这些问题,本文提出了 TrafficRAG,一种用于自动化交通事故分析和报告生成的多模态检索增强框架。具体而言,所提出的框架首先采用视觉 - 语言模型(VLM)生成事故场景的结构化文本描述,这些描述作为准确的检索查询。基于这些文本查询,采用了一种融合 BM25 稀疏检索和稠密嵌入检索的混合检索策略,以获取相关的交通法规和类似的历史案例。最后,LLM 整合检索到的法律知识和多模态事故证据进行综合推理,并生成标准化、基于法律的责任分析报告。大量实验表明,TrafficRAG 始终优于基线方法,实现了 77.32% 的 Legal Norm Adaptation Accuracy(法律规范适配准确率)、81.71% 的 Factual Faithfulness(事实忠实度)以及 5.48% 的 Liability Ratio MAE(责任比率平均绝对误差)。结果表明,通过检索增强将多模态事实证据与法律条款相结合,可以有效提高交通事故责任认定的可靠性和准确性。

Abstract

Traffic accident liability analysis is a critical yet challenging task in intelligent transportation and legal assistance. Existing methods often suffer from low efficiency, subjective judgment, and inconsistent analysis results. Meanwhile, large language models are constrained by noisy video inputs and insufficient legal domain knowledge. To address these issues, this work presents TrafficRAG, a multimodal retrieval-augmented framework for automated traffic accident analysis and report generation. Specifically, the proposed framework first adopts a vision-language model to produce structured textual descriptions of accident scenarios, which serve as accurate retrieval queries. Based on these textual queries, a hybrid retrieval strategy integrating BM25 sparse retrieval and dense embedding retrieval is employed to fetch relevant traffic regulations and similar historical cases. Finally, the large language model incorporates retrieved legal knowledge and multimodal accident evidence for comprehensive reasoning, and generates standardized, legally grounded liability analysis reports. Extensive experiments show that TrafficRAG consistently outperforms baseline methods, achieving 77.32% Legal Norm Adaptation Accuracy, 81.71% Factual Faithfulness, and a Liability Ratio MAE of 5.48%. The results validate that integrating multimodal factual evidence with legal clauses via retrieval augmentation can effectively improve the reliability and accuracy of traffic accident liability determination.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 3.0/10 4.5
World Models 1.5 0.0/10 0.0
MLLM 1.5 6.0/10 9.0
MultiModal 1.5 9.0/10 13.5
model-based RL 1.5 0.0/10 0.0

评分理由: 论文核心为多模态 RAG 框架,与"MultiModal"高度相关;使用视觉语言模型涉及"MLLM"技术。未涉及世界模型、强化学习及底层架构细节。未发现指定专家作者。

关键词

Multimodal RAG, Traffic Accident Liability, Vision-Language Model, Retrieval Augmentation, Legal Knowledge, LLM Reasoning, Traffic Regulations

深度分析

Chinese Title: TrafficRAG: 一种用于交通事故责任判定的多模态检索增强生成框架

Summary: 本文提出TrafficRAG,一种多模态检索增强生成框架,用于自动生成交通事故责任分析报告。针对现有方法效率低、主观性强、法律依据不足等问题,TrafficRAG首先利用视觉语言模型(VLM)从监控视频中提取结构化事故描述,然后通过混合检索策略(BM25稀疏检索与稠密检索结合)从外部知识库中检索相关交通法规和类似案例,最后利用大语言模型(LLM)结合检索到的知识与多模态证据进行推理,生成具有法律依据的责任分析报告。实验结果表明,TrafficRAG在法律规范适应准确率(77.32%)、事实忠实度(81.71%)和责任比例平均绝对误差(5.48%)等指标上显著优于多个基线模型。该工作展示了将多模态事故事实与法律条款相结合的推理方法能够有效提升交通事故责任判定的性能。

Innovations:

  • 提出TrafficRAG框架,首次将多模态检索增强生成(RAG)应用于交通事故责任判定,实现从视频到结构化报告的全流程自动化。
  • 设计VLM引导的结构化事故描述生成模块,从视频中提取场景背景、参与者、行为序列和责任相关线索四类关键信息,过滤噪声并聚焦于责任判定。
  • 采用双路径混合检索(BM25+稠密检索)与跨源一致性重排序模块,联合检索法律条款和相似案例,提升证据的连贯性和相关性。
  • 构建了包含1584个案例的多模态交通事故数据集,整合了VRU-Accident、TAU-106K等视频数据集与CADD、STARD、LeCaRDv2等法律资源。
  • 通过结构化提示和三种一致性约束(事实一致性、法律一致性、结构一致性)生成标准化责任分析报告,减少幻觉并增强可控性。

Methodology: 论文采用多阶段流水线方法:1)视频预处理:均匀采样帧、去噪、归一化,并用CLIP编码为视觉特征;2)VLM生成结构化事故描述:使用视觉编码器-语言解码器架构,通过提示引导提取四类信息;3)双路径知识检索:结合BM25稀疏检索和moka-ai/m3e-base稠密检索,经Min-Max归一化后加权融合;4)跨源一致性重排序:基于余弦相似度和固定逻辑规则计算法律-案例对的联合得分,筛选最优证据包;5)LLM报告生成:使用结构化提示,基于检索证据生成包含基本信息、过程、法律依据、责任划分等部分的报告。

Key Results:

  • 在法律规范适应准确率(Legal Norm Adaptation Accuracy)上达到77.32%,优于基线模型。
  • 在事实忠实度(Factual Faithfulness)上达到81.71%。
  • 在责任比例平均绝对误差(Liability Ratio MAE)上仅为5.48%。
  • 通过消融实验验证了各组件(VLM描述生成、双路径检索、重排序)对性能的贡献。
  • 在多个强基线(包括纯LLM方法、无检索方法等)上持续取得最优结果。

Tech Stack:

  • 视觉语言模型:CLIP (openai/clip-vit-large-patch14)
  • 稠密检索编码器:moka-ai/m3e-base
  • 稀疏检索:BM25
  • 向量索引:FAISS
  • 大语言模型:未明确指定,但使用了结构化提示和条件生成
  • 数据集:VRU-Accident, TAU-106K, CADD, STARD, LeCaRDv2
  • 数学方法:Min-Max归一化,余弦相似度,加权融合公式

Strengths:

  • 提出端到端的多模态RAG框架,有效结合视频理解与法律知识检索,解决传统方法的法律依据不足问题。
  • 构建了高质量的多模态交通事故数据集,涵盖视频、结构化描述、法律条款和案例,为后续研究提供基准。
  • 通过跨源一致性重排序提升证据的连贯性,减少检索噪声对生成的影响。
  • 实验设计全面,包括多个基线对比和消融研究,验证了各模块的有效性。
  • 报告生成采用结构化提示和一致性约束,提高了输出的规范性和可解释性。

Limitations:

  • 数据集规模有限(1584个案例),可能影响模型泛化能力,尤其对罕见事故场景。
  • VLM生成的结构化描述仍可能受视频噪声、遮挡等因素影响,导致关键信息遗漏。
  • 法律知识库仅包含265条法规和671个案例,覆盖范围有限,可能无法处理复杂或新型事故。
  • 重排序模块使用固定逻辑规则而非学习型方法,灵活性不足。
  • 未公开模型代码和数据集,可复现性受限。

Relevance To Keywords:

  • Unify Models: 论文未直接涉及统一模型,但多模态RAG框架可视为视觉与语言模型的协同。
  • World Models: 论文未涉及世界模型,不相关。
  • Representation Learning: 使用了CLIP和m3e-base等预训练表征模型,但未提出新的表征学习方法。
  • Model-Based RL: 论文未涉及强化学习,不相关。
  • 原生多模态大模型: 论文使用VLM和LLM,但并非原生多模态大模型(如GPT-4V),而是组合现有模型。
  • 多模态大模型的理解和生成一体化: 论文实现了视频理解与报告生成的一体化流程,但未提出统一模型。
  • 表征学习: 同上,使用了现有表征模型。
  • 世界模型: 不相关。
  • 强化学习: 不相关。
  • 后训练: 论文未涉及后训练(如RLHF),仅使用预训练模型进行推理。
Score: 31.5 / 27.8
Authors: Keyue Qiu, Xintong Wang, Zhilong Zhang, Hao Zhou, Wei-Ying Ma
Published: 2026-06-01
TL;DR: The paper proposes GeoCoupling to optimize temporal couplings between sequence and structure modalities in biomolecular co-design, improving physical validity and diversity compared to synchronous baselines.
摘要翻译

生物分子如蛋白质和小分子配体在生物系统中起核心作用,这源于序列与三维结构之间的紧密相互作用。近期针对生物分子协同设计(biomolecular co-design)的生成模型(generative models)旨在通过联合建模耦合模态(coupled modalities)来捕捉这种相互作用。然而,现有方法大多采用边缘生成过程(marginal generative processes)的并行执行,隐式地强制固定同步耦合(synchronous coupling)。我们认为,一个关键但被忽视的自由度在于这些边缘过程在训练和生成过程中如何进行时间上的耦合(temporally coupled),不恰当的耦合可能引入高方差监督(high-variance supervision)和不一致的中间状态(inconsistent intermediate states),进而影响模态一致性(modality consistency)。为了解决这一问题,我们引入了 GeoCoupling,这是一个优化异质模态(heterogeneous modalities)之间时间耦合(temporal couplings)的系统框架。在基于结构的药物设计(structure-based drug design)和无条件蛋白质设计(unconditional protein design)上的实证结果表明,学习到的耦合一致地优于同步和随机耦合的基线(baselines),产生具有改进的物理有效性(physical validity)和多样性(diversity)的生物分子。

Abstract

Biomolecules such as proteins and small-molecule ligands play a central role in biological systems, arising from the tight interplay between sequence and three-dimensional structure. Recent generative models for biomolecular co-design aim to capture this interplay by jointly modeling coupled modalities. However, existing approaches largely adopt a parallel execution of marginal generative processes, implicitly enforcing fixed synchronous coupling. We argue that a critical but overlooked degree of freedom lies in how these marginal processes are temporally coupled during training and generation, where inappropriate coupling can introduce high-variance supervision and inconsistent intermediate states, affecting modality consistency. To address this, we introduce GeoCoupling, a systematic framework that optimizes for temporal couplings between heterogeneous modalities. Empirical results across structure-based drug design and unconditional protein design demonstrate the learned couplings consistently outperform synchronous and randomly coupled baselines, yielding biomolecules with improved physical validity and diversity.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 5.0/10 7.5
Tokenizer 1.5 2.0/10 3.0
Visual Encoder 1.5 2.0/10 3.0
World Models 1.5 1.0/10 1.5
MLLM 1.5 1.0/10 1.5
MultiModal 1.5 9.0/10 13.5
model-based RL 1.5 1.0/10 1.5

评分理由: The paper centers on multimodal biomolecular co-design, scoring high on 'MultiModal' (9.0) and moderate on 'Unify Models' (5.0) due to coupling optimization. Other keywords like Tokenizer, Visual Encoder, World Models, MLLM, and model-based RL are largely irrelevant (1.0-2.0). Total weighted score is 31.5, surpassing the 27.8 threshold. No specified expert authors are found.

关键词

Multimodal Biomolecular Co-design, Intrinsic Geodesic Coupling, Temporal Couplings, Heterogeneous Modalities, Structure-based Drug Design, Protein Design, Generative Models

深度分析

Chinese Title: 揭秘具有内在测地耦合的多模态生物分子协同设计

Summary: 本文针对生物分子(如蛋白质和小分子配体)的序列与三维结构协同设计问题,指出现有多模态生成模型普遍采用同步或随机耦合策略,导致高方差监督和模态间不一致。作者提出Geo-Coupling框架,将多模态生成建模为时间最优传输问题,通过贝叶斯优化学习模态间的时间耦合路径(测地线),从而最小化从先验到数据分布的传输能量。该方法在基于结构的药物设计和无条件蛋白质生成任务上,相比同步和随机耦合基线,持续生成具有更好物理有效性和多样性的生物分子。实验表明,学习到的耦合路径能够自动发现复杂度感知的课程,使结构误差快速下降后序列一致性才提升,验证了“结构优先”的生成机制。

Innovations:

  • 识别出多模态协同设计中动态不匹配的根本问题,指出同步耦合存在高偏差、随机耦合存在高方差。
  • 提出Geo-Coupling框架,将时间耦合视为可学习的自由度,通过时间最优传输形式化多模态生成。
  • 采用双层优化与高斯过程代理模型高效搜索耦合空间,无需反向传播整个训练轨迹。
  • 在结构药物设计和蛋白质生成上验证了学习耦合优于同步和随机耦合,提升了物理有效性和多样性。

Methodology: 论文首先将多模态生成建模为时间最优传输问题,定义时间耦合函数γ(τ)=[t_r(τ), t_h(τ)],将全局进度映射到各模态的局部噪声水平。然后提出双层优化目标:内层训练生成模型,外层优化耦合参数以最小化最终验证性能(非可微)。采用高斯过程代理模型对耦合空间进行贝叶斯优化,高效探索低能量路径。训练时使用学习到的耦合调度,推理时直接应用该调度进行联合生成。

Key Results:

  • 学习到的耦合路径在结构药物设计任务中生成配体的物理有效性(如结合亲和力、类药性)显著优于同步和随机耦合。
  • 在无条件蛋白质生成中,生成序列的结构一致性(如pLDDT)和多样性均得到提升。
  • 轨迹分析显示结构误差在早期快速下降(t<0.3),随后序列一致性才上升,验证了结构优先的生成机制。
  • Geo-Coupling在训练和推理阶段仅增加极小的额外计算开销。

Tech Stack:

  • 时间最优传输(Temporal Optimal Transport)
  • 贝叶斯优化(Bayesian Optimization)
  • 高斯过程代理模型(Gaussian Process Surrogate)
  • 双层优化(Bi-level Optimization)
  • 扩散模型/流匹配(Diffusion/Flow Matching)
  • 测地线耦合(Geodesic Coupling)

Strengths:

  • 问题定义清晰,从理论角度揭示了同步/随机耦合的缺陷,并给出几何解释。
  • 方法模型无关,可适用于多种多模态生成模型(如扩散、流匹配)。
  • 实验验证充分,在两类重要生物分子设计任务上均取得一致改进。
  • 学习到的耦合路径具有可解释性,揭示了结构优先的生成顺序。

Limitations:

  • 贝叶斯优化在高维耦合空间中的扩展性可能受限,论文仅实验了二维耦合(两个模态)。
  • 方法依赖最终验证性能作为优化目标,可能受验证指标选择影响。
  • 未讨论更多模态(如蛋白质-配体-溶剂)的耦合优化。
  • 理论分析部分对测地线的严格数学定义和收敛性证明不够深入。

Relevance To Keywords:

  • Unify Models: 论文提出的多模态协同设计框架统一了序列和结构生成,与统一模型理念相关。
  • World Models: 生物分子设计可视为对分子世界的建模,但论文未直接涉及世界模型。
  • Representation Learning: 论文通过耦合学习隐式地优化了模态间的表征对齐。
  • Model-Based RL: 论文使用贝叶斯优化(一种基于模型的方法)搜索耦合,但非强化学习。
  • 原生多模态大模型: 论文聚焦生物分子特定领域,非通用多模态大模型,但方法论可迁移。
Score: 31.5 / 27.8
Authors: Bishr Omer Abdelrahman Adam, Xu Li
Published: 2026-06-01
TL;DR: 本文提出了一种结合自然场景统计与视觉语言模型嵌入的失真感知融合框架,在盲图像质量评估任务上取得了优于现有 state-of-the-art 方法的结果。
摘要翻译

盲图像质量评估(BIQA)旨在无需参考图像的情况下预测感知图像质量。经典的自然场景统计(NSS)描述子与现代视觉 - 语言模型(VLM)嵌入从本质上不同的视角解决这一问题,但结合二者是否能产生互补效益,以及如何根据输入图像加权它们的贡献,尚待探索。我们提出一种失真感知融合框架,该框架通过乘法门控机制整合一个 138 维 NSS 描述子与两种互补的视觉 - 语言模型(VLM)嵌入(SigLIP 和 CLIP-H),该机制基于图像内容学习针对每个输入的流权重。与静态拼接融合不同,所提出的门控网络根据输入抑制或放大每个流的贡献,生成的权重与在 KADID-10k 上通过独立消融实验测量的每失真 NSS 贡献呈正相关(斯皮尔曼等级相关系数 rho=0.33)。该框架无需对 VLM 骨干进行端到端微调,而是采用混合损失进行训练,该损失结合了均方误差、皮尔逊线性相关系数以及成对排序目标。我们在三个标准基准上进行评估:KonIQ-10k(SROCC=0.9142, PLCC=0.9279)、KADID-10k(SROCC=0.9715, PLCC=0.9733,超越了近期最先进方法)以及 LIVE Challenge in-the-Wild(在跨数据集预训练和微调下,SROCC=0.8527, PLCC=0.8802)。在 KADID-10k 上的每失真分析表明,NSS 特征在噪声和色偏失真(像素统计直接受影响)上贡献最大,而在感知失真(如色饱和度变化)上贡献最小。学习得到的门控值验证了这些发现,证实了模型能够自主发现与手动每失真研究一致的失真 - 流亲和模式。

Abstract

Blind image quality assessment (BIQA) aims to predict perceived image quality without access to a reference image. Classical natural scene statistics (NSS) descriptors and modern vision-language model (VLM) embeddings address this problem from fundamentally different perspectives, yet whether combining them yields complementary benefits and how to weight their contributions per input image remains unexplored. We propose a distortion-aware fusion framework that integrates a 138-dimensional NSS descriptor with two complementary VLM embeddings, SigLIP and CLIP-H, through a multiplicative gating mechanism that learns per-input stream weights conditioned on image content. Unlike static concatenation fusion, the proposed gating network suppresses or amplifies each stream's contribution based on the input, producing weights that correlate positively (Spearman rank correlation rho=0.33) with the per-distortion NSS contribution measured by independent ablation on KADID-10k. The framework requires no end-to-end fine-tuning of the VLM backbones and is trained with a hybrid loss combining mean squared error, Pearson linear correlation, and pairwise ranking objectives. We evaluate on three standard benchmarks: KonIQ-10k (SROCC=0.9142, PLCC=0.9279), KADID-10k (SROCC=0.9715, PLCC=0.9733, surpassing recent state-of-the-art methods), and LIVE Challenge in-the-Wild (SROCC=0.8527, PLCC=0.8802 with cross-dataset pretraining and fine-tuning). A per-distortion analysis on KADID-10k reveals that NSS features contribute most on noise and color-shift distortions where pixel statistics are directly affected, and least on perceptual distortions such as color saturation changes. The learned gate values validate these findings, confirming that the model autonomously discovers distortion-stream affinity patterns consistent with the manual per-distortion study.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 3.0/10 4.5
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 4.0/10 6.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 6.0/10 9.0
MultiModal 1.5 7.0/10 10.5
model-based RL 1.5 0.0/10 0.0

评分理由: 论文核心为盲图像质量评估(BIQA),使用 VLM 嵌入与统计特征融合。与背景中的 World Models 和 Model-Based RL 完全无关(0 分)。虽未直接讨论 Tokenizer,但使用 VLM 涉及 Visual Encoder 和 MLLM 技术栈(中高相关);特征融合体现了多模态(MultiModal)和模型统一(Unify Models)思想,故给予中等偏高评分。作者列表中不包含指定的专家名单。

关键词

Blind Image Quality Assessment, Distortion-Aware Fusion, Natural Scene Statistics, Vision-Language Model, Multiplicative Gating Mechanism, SigLIP, CLIP-H

深度分析

Chinese Title: 面向失真的统计特征与视觉语言特征融合的盲图像质量评估

Summary: 本文提出一种面向失真的三流融合框架,用于盲图像质量评估(BIQA)。该框架结合了138维自然场景统计(NSS)描述符与两个互补的视觉语言模型(VLM)嵌入(SigLIP和CLIP-H),通过乘法门控机制学习每个输入图像的流权重,从而自适应地抑制或放大各流贡献。VLM骨干网络保持冻结,仅训练轻量级MLP头(约466k参数)和门控网络。训练采用混合损失函数(MSE、PLCC和成对排序目标)。在KonIQ-10k、KADID-10k和LIVE Challenge in-the-Wild三个基准上评估,KADID-10k上SROCC达0.9715,超越现有方法。每失真类型分析表明NSS特征在噪声和颜色偏移失真上贡献最大,而门控值与此分析正相关(ρ=0.33),验证了模型自主发现失真-流亲和模式。

Innovations:

  • 提出三流BIQA框架,融合NSS、SigLIP和CLIP-H嵌入,仅训练轻量级MLP头,无需微调VLM骨干,在KADID-10k上达到SROCC 0.9715的SOTA性能。
  • 引入乘法失真感知门控机制,学习每输入图像的流权重,门控值与独立消融实验测得的每失真NSS贡献正相关(ρ=0.33),提供可解释性验证。
  • 设计三阶段评估协议(跨数据集预训练、目标域微调、多种子集成、蒙特卡洛dropout测试时增强),在LIVE-itW上获得4.2个百分点的SROCC提升。
  • 提供KADID-10k上的每失真分析,明确NSS特征在哪些失真类型上补充VLM嵌入(如颜色块、对比度变化、颜色扩散),以及哪些失真上贡献最小(如颜色饱和度、非偏心补丁)。

Methodology: 整体架构为三流并行提取:NSS流提取138维统计特征(空间域48维+颜色频率域21维,每块32×32像素,跨块聚合均值和标准差);SigLIP和CLIP-H流分别提取VLM嵌入,经PCA降维后与NSS特征拼接。然后通过一个轻量级乘法门控网络(MLP)学习每输入图像的流权重,权重与特征逐元素相乘后再输入回归MLP预测MOS。训练使用混合损失:MSE、PLCC和成对排序损失。VLM骨干冻结,仅训练门控网络和回归头(约466k参数)。评估采用标准SROCC和PLCC指标。

Key Results:

  • 在KADID-10k上SROCC=0.9715,PLCC=0.9733,超越MANIQA、LIQE、Q-Align等SOTA方法。
  • 在KonIQ-10k上SROCC=0.9142,PLCC=0.9279。
  • 在LIVE Challenge in-the-Wild上,通过跨数据集预训练+微调,SROCC=0.8527,PLCC=0.8802。
  • 每失真分析显示NSS在噪声、颜色偏移等失真上贡献最大,在颜色饱和度等感知失真上贡献最小。
  • 门控值与独立消融的每失真NSS贡献的Spearman相关系数为ρ=0.33,验证了门控机制的有效性。

Tech Stack:

  • 自然场景统计(NSS):MSCN系数、广义高斯分布、局部二值模式(LBP)、log-Gabor滤波器、Benford定律首位数字分布。
  • 视觉语言模型:SigLIP(sigmoid对比损失,WebLI数据集)、CLIP-H(softmax对比损失,LAION-2B数据集)。
  • 特征降维:PCA。
  • 门控机制:乘法门控网络(MLP)。
  • 回归头:三层MLP,GELU激活。
  • 混合损失:均方误差(MSE)、皮尔逊线性相关系数(PLCC)、成对排序损失。
  • 训练策略:跨数据集预训练、目标域微调、多种子集成、蒙特卡洛dropout测试时增强。
  • 评估指标:Spearman秩相关系数(SROCC)、皮尔逊线性相关系数(PLCC)。

Strengths:

  • 融合经典统计特征与现代VLM特征,实现互补优势,且无需微调大模型,计算资源需求低。
  • 门控机制提供可解释性,自动学习每失真类型的特征重要性,与人工分析一致。
  • 在合成失真数据集KADID-10k上达到SOTA,在真实失真数据集上也有竞争力。
  • 方法设计简洁,训练参数少,易于复现和部署。
  • 提供了详细的每失真分析,有助于理解不同特征在BIQA中的作用。

Limitations:

  • VLM特征提取仍依赖预训练模型,虽然冻结但推理时仍需加载大模型,可能影响实时性。
  • 在真实失真数据集LIVE-itW上性能提升有限,跨域泛化能力仍需改进。
  • 门控机制仅学习流权重,未考虑特征内部维度的自适应加权。
  • 仅使用两个VLM(SigLIP和CLIP-H),未探索更多VLM或自监督模型(如DINOv2)的融合潜力。
  • 实验仅在三个标准数据集上评估,缺乏对更多场景(如AIGC图像、视频质量)的验证。

Relevance To Keywords:

  • 表征学习:论文通过融合NSS和VLM特征,学习图像质量表征,属于表征学习在BIQA中的应用。
  • 多模态大模型的理解和生成一体化:论文使用视觉语言模型(CLIP、SigLIP)作为特征提取器,体现了多模态理解能力,但未涉及生成。
  • 世界模型:间接相关,因为NSS特征基于自然图像统计规律,可视为对视觉世界先验的建模。
  • 强化学习/后训练:论文未涉及强化学习或后训练技术,但训练策略中的微调、集成等可视为后训练的一种形式。
  • 原生多模态大模型:论文使用的CLIP/SigLIP属于原生多模态模型,但未进行端到端微调。
  • 总体相关性中等,主要聚焦于表征学习和多模态特征融合,与世界模型、强化学习关联较弱。
Score: 30.0 / 27.8
Authors: Yuchen Zhang, Ning Xi, Pengbin Feng, Shigang Liu, Jianfeng Ma, Yulong Shen, Yanan Sun, Xiaolin Zhou
Published: 2026-06-01
TL;DR: IstGPT leverages LLMs and graph neural networks on multi-modal industrial data to achieve state-of-the-art real-time anomaly detection in industrial control systems.
摘要翻译

工业互联网系统正面临来自先进的工业控制系统(ICS)攻击的日益增加的威胁,导致严重的安全事故。然而,由于传感器和执行器之间存在复杂的依赖关系,现有工具在实时异常检测方面的有效性有限。为了解决这一问题,我们提出了 IstGPT,这是首个基于大语言模型(LLMs)和图学习的工业异常检测工具,旨在提供针对广泛 ICS 攻击的实时保护。IstGPT 能够对工业信息物理系统中的时空依赖关系进行细粒度和精确的建模。它首先利用工业多模态知识(包括运行数据、技术文档和系统图),通过多阶段提示工程提取传感器 - 执行器依赖图。然后,LLM-Optimation 基于节点准确性、边一致性和逻辑一致性迭代优化该图。最后,IstGPT 将改进的图神经网络与编码器 - 解码器架构相结合,通过重构误差来检测异常。我们在 9 个数据集(包括 2 个公开数据集、6 个模拟数据集和一个真实世界机械臂数据集)上,将 IstGPT 与 12 种最先进基线方法进行了评估。IstGPT 在九个数据集上均取得了最佳的 F1 分数和 eTaF1(一种较新的时间感知指标)。我们进一步讨论了在真实工业场景中部署 IstGPT 的可行性。

Abstract

Industrial Internet systems face increasing threats from sophisticated industrial control system (ICS) attacks, resulting in critical safety incidents. However, existing tools exhibit limited effectiveness in real-time anomaly detection due to the complex dependencies among sensors and actuators. To tackle this, we present IstGPT, the first industrial anomaly detection tool based on LLMs and graph learning to provide real-time protection against a wide range of ICS attacks. IstGPT achieves fine-grained and precise modeling on spatial-temporal dependencies in industrial cyber-physical systems. It first leverages industrial multi-modal knowledge, including operational data, technical documents, and system diagrams, to extract sensor-actuator dependency graphs via multi-stage prompt engineering. Then, LLM-Optimation iteratively refines the graph based on node accuracy, edge consistency, and logical coherence. Finally, IstGPT integrated improved graph neural networks with an encoder-decoder architecture to detect anomalies via reconstruction errors. We evaluate IstGPT against 12 state-of-the-art baselines on 9 datasets, including 2 public, 6 simulated, and a real-world robotic arm dataset. IstGPT achieves the best F1-scores and eTaF1 (a newer time-aware metric) across nine datasets. We further discuss the feasibility of deploying IstGPT in real-world industrial scenarios.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 3.0/10 4.5
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 2.0/10 3.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 6.0/10 9.0
MultiModal 1.5 8.0/10 12.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on industrial anomaly detection using LLMs and graph networks, showing strong alignment with MultiModal (diverse data inputs) and MLLM (LLM processing multimodal prompts), and moderate alignment with Unify Models (LLM+GNN integration). It has low relevance to Tokenizer (implicit usage), Visual Encoder (diagrams processed via prompts rather than dedicated encoder), and no relevance to World Models or Model-Based RL.

关键词

LLM-based Anomaly Detection, Spatial-Temporal Graph, Multi-modal Knowledge, Graph Neural Networks, Industrial Control System, Encoder-Decoder Architecture

深度分析

Chinese Title: IstGPT:基于大语言模型的工业系统时空图异常检测

Summary: 工业互联网系统面临日益复杂的工业控制系统攻击,现有异常检测方法因未能有效建模传感器与执行器之间的复杂依赖关系而性能有限。本文提出IstGPT,首个基于大语言模型和图学习的工业异常检测工具。该方法首先通过多阶段提示工程从工业多模态知识(运行数据、技术文档、系统图)中提取传感器-执行器依赖图,然后利用LLM-Optimation模块基于节点准确性、边一致性和逻辑连贯性迭代优化该图,最后结合改进的图神经网络与编码器-解码器架构,通过重构误差检测异常。在9个数据集(含2个公开、6个模拟和1个真实机械臂数据集)上,IstGPT在F1分数和时态感知指标eTaF1上均优于12个最先进基线,平均F1达91.7%,并展示了实际部署可行性。

Innovations:

  • 首次设计迭代提示工程,从分类的工业多模态知识中捕获复杂的传感器-执行器依赖关系,构建高效可扩展的异常检测工具。
  • 提出LLM-Optimation模块,通过验证节点、边和整体逻辑一致性评估并优化LLM生成的依赖图,确保图结构的准确性和物理意义。
  • 构建工业时空图(ISTG),联合建模多传感器与执行器信号在空间和时间维度上的依赖关系,实现多维多元异常检测,显著降低误报和漏报。
  • 在9个数据集上全面评估,包括真实机械臂平台,性能优于12个SOTA方法,并讨论了实际工业场景部署的可行性。

Methodology: IstGPT包含四个阶段:1)数据预处理:对原始传感器/执行器数据进行清洗、归一化和滑动窗口分割;2)ISTG生成:利用多阶段提示工程从工业多模态知识(运行数据、技术文档、系统图)中提取传感器-执行器依赖图,并通过LLM-Optimation迭代优化,再结合时间相关性形成工业时空图;3)ISTG学习:采用无监督深度学习(GAT、GCN和编码器-解码器架构)建模正常时空行为;4)异常检测:通过阈值化重构误差检测和定位异常。

Key Results:

  • 在SWaT和WADI两个公开数据集上,IstGPT平均F1分数达91.7%,优于最佳基线(89.1%、87.6%、79.9%)。
  • 在7个私有工业数据集(含模拟和真实机械臂)上,IstGPT在所有指标上均优于12个SOTA基线。
  • 图构建平均耗时约7分钟(数百节点规模),训练和推理时间成本低,具备实际部署可行性。
  • 在时态感知指标eTaF1上同样取得最优性能。

Tech Stack:

  • 大语言模型(LLM)
  • 多阶段提示工程(Multi-stage Prompt Engineering)
  • LLM-Optimation模块(基于节点准确性、边一致性、逻辑连贯性)
  • 图注意力网络(GAT)
  • 图卷积网络(GCN)
  • 编码器-解码器架构(Encoder-Decoder)
  • 自相关函数(Autocorrelation)用于工业周期估计
  • 滑动窗口分割(Sliding-window Segmentation)
  • 重构误差阈值检测(Reconstruction Error Thresholding)

Strengths:

  • 首次将LLM与图学习结合用于工业异常检测,充分利用工业多模态知识(文档、图、数据)构建物理可解释的依赖图。
  • 提出迭代优化机制(LLM-Optimation),有效避免纯数据驱动方法产生的虚假或误导性关系。
  • 构建的时空图(ISTG)同时建模空间依赖和时间动态,提升检测精度。
  • 在多个数据集上全面超越现有方法,且训练时间成本低,具备实际部署潜力。

Limitations:

  • 依赖LLM对工业知识的理解能力,若LLM对特定领域术语或复杂系统图理解不足,可能影响图构建质量。
  • 目前仅针对传感器和执行器级别的异常,未覆盖网络层或控制逻辑层的攻击。
  • 需要预先提供工业多模态知识(文档、图等),在知识缺失或质量差的环境中可能受限。
  • 对未知攻击类型的泛化能力尚未充分验证,实验基于已知攻击场景。

Relevance To Keywords:

  • 原生多模态大模型:论文利用LLM处理工业多模态知识(文本、图、数据),体现了多模态理解能力。
  • 多模态大模型的理解和生成一体化:LLM用于从多模态知识中生成依赖图,实现理解到结构化输出的转化。
  • 表征学习:通过图神经网络学习传感器-执行器的时空表征,用于异常检测。
  • 世界模型:构建的工业时空图可视为对工业系统物理世界依赖关系的建模,有助于理解系统行为。
  • 强化学习/后训练:论文未直接涉及强化学习,但LLM-Optimation的迭代优化过程与后训练中的反馈机制有相似之处。
Score: 28.5 / 27.8
Authors: Xueji Fang, Liyuan Ma, Jianhao Zeng, Jinjin Cao, Mingyuan Zhou, Guo-Jun Qi
Published: 2026-06-01
TL;DR: FocusDiT enhances fine-grained text-to-image generation by masking critical query tokens in Diffusion Transformers to improve detail decoding.
摘要翻译

扩散变换器(DiT)已在生成扩散领域被广泛采用,通过注意力机制和前馈(FFN)层推进查询标记的去噪。FFN 实际上充当了解码视觉内容的键值词汇库,其值嵌入了视觉语义知识。本文指出,关注对应更复杂细节的关键查询标记,并鼓励模型改进这些标记,对于细粒度视觉生成至关重要。为此,我们提出了 FocusDiT,该模型应用一种掩码(Masking)方案,专门关注仅输入至 FFN 的关键查询标记。被掩码的查询可以从 FFN 词汇库中检索视觉标记,并利用它们来解码其视觉细节。广泛的文本到图像实验验证了标记掩码在提升生成性能方面的有效性。

Abstract

Diffusion transformer (DiT) has been widely adopted in the generative diffusion field, advancing the denoising of query tokens through attention and Feed-Forward (\text{FFN}) layers. FFN actually acts as the key-value vocabulary for decoding visual contents where the value embeds the visual semantical knowledge. We present that focusing on critical query tokens corresponding to more complex details and encouraging the model to improve these tokens is essential for fine-grained visual generation. To this end, we propose FocusDiT, which applies a Masking scheme to focus on critical query tokens that are exclusively fed into FFN. The masked queries can retrieve visual tokens from the FFN vocabularies, and use them to decode their visual details. Extensive text-to-image experiments validate the effectiveness of token masking in enhancing generative performance.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 4.0/10 6.0
Visual Encoder 1.5 4.0/10 6.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 3.0/10 4.5
MultiModal 1.5 6.0/10 9.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on Diffusion Transformers for fine-grained image generation, showing moderate-to-high relevance to MultiModal (text-to-image task) and Tokenizer (token masking mechanism). Visual Encoder is moderately relevant as part of the DiT architecture. Relevance is low for Unify Models, World Models, and model-based RL as the paper does not address model unification, world simulation, or reinforcement learning. MLLM is moderately relevant due to multimodal context but the model is diffusion-based rather than language-model-based.

关键词

FocusDiT, Diffusion Transformers, Masking Queries, Fine-grained Image Generation, Text-to-Image, Visual Tokens, FFN Vocabulary

深度分析

Chinese Title: FocusDiT:扩散Transformer中的查询掩码用于细粒度图像生成

Summary: 本文提出FocusDiT,一种针对扩散Transformer(DiT)中前馈网络(FFN)的查询令牌掩码策略,旨在提升细粒度图像生成质量。研究发现,DiT中的FFN充当存储视觉语义知识的键值词汇表,但关键查询令牌(对应复杂细节)常被背景或低频区域令牌淹没,导致词汇利用率不足。为此,FocusDiT设计了一个查询掩码预测网络(Q-MaskGen),根据时间步和前一模块掩码动态区分关键与非关键令牌,仅让关键令牌通过FFN解码视觉细节。同时提出词汇重分配(VR)方案,根据各层解码需求调整词汇容量,将浅层和深层的冗余容量重新分配给中间层。实验表明,FocusDiT在文本到图像生成任务中显著提升了图像质量和细节丰富度,并可通过跳过掩码接近零的FFN层提高推理效率。

Innovations:

  • 提出查询令牌掩码策略,动态识别关键令牌并优先分配FFN词汇资源,增强细粒度细节生成。
  • 设计词汇重分配方案,根据各层解码需求调整词汇容量,优化总词汇利用率。
  • 将时间步和前一模块掩码融入掩码预测,使模型适应不同噪声阶段的解码需求。
  • 利用查询掩码实现推理加速,通过跳过低激活FFN层降低计算成本。

Methodology: 基于DiT架构,在FFN前插入查询掩码预测网络(Q-MaskGen),该网络由多层MLP组成,输入包括时间步t、交叉注意力输出和前一模块掩码,输出每个令牌的掩码值(0-1)。掩码与FFN输出逐元素相乘后加入残差。词汇重分配通过调整各层FFN的中间维度(词汇大小)实现,保持总参数量不变。训练采用标准扩散损失,掩码网络与主模型端到端联合优化。

Key Results:

  • FocusDiT在文本到图像生成中优于PixArt-α、SD3和OpenSoraPlan等基线,定量指标(如FID、CLIP分数)提升。
  • 词汇利用率分析显示,掩码策略使关键令牌能访问更多词汇条目,非关键令牌被抑制。
  • 词汇重分配后,中间层词汇容量增加,浅层和深层减少,整体生成质量提升。
  • 推理阶段可跳过约30%的FFN层(掩码接近零),加速而不损失质量。

Tech Stack:

  • 扩散Transformer(DiT)
  • 前馈网络(FFN)作为键值词汇表
  • 查询掩码预测网络(Q-MaskGen)基于MLP
  • 自适应层归一化(AdaLN-single)
  • GeLU激活函数
  • 词汇重分配(调整FFN中间维度)
  • 文本到图像生成任务(基于PixArt-α等基线)

Strengths:

  • 深入分析了DiT中FFN的词汇存储与利用机制,提出针对性优化。
  • 掩码策略简单有效,同时提升生成质量和推理效率。
  • 词汇重分配方案在不增加参数量的前提下优化资源分配。
  • 实验充分,与多个SOTA方法对比,定性定量结果均有优势。

Limitations:

  • 掩码预测网络增加了额外计算开销,尽管推理时可跳过部分层。
  • 词汇重分配策略依赖于手动设计各层容量比例,可能非最优。
  • 方法主要针对文本到图像生成,未在视频或其他模态验证。
  • 对极复杂场景(如多人、多物体)的细节生成能力有待进一步测试。

Relevance To Keywords: 论文聚焦于扩散Transformer中的令牌选择与词汇利用,与“表征学习”和“世界模型”间接相关(通过视觉语义存储)。其掩码机制可视为一种隐式注意力分配,与“多模态大模型的理解和生成一体化”方向一致(提升文本到图像生成质量)。但论文未直接涉及强化学习、后训练或统一模型框架,相关性中等。

Score: 25.5 / 27.8
Authors: Wenhang Shi, Jinhao Dong, Yiren Chen, Zhe Zhao, Shuqing Bian, Wei Lu, Xiaoyong Du
Published: 2026-06-01
TL;DR: 本文提出一种基于地面交互合成的智能体能力扩展框架,通过 MCP 服务器构建环境和结构引导规划生成任务,显著提升了智能体在基准测试中的表现且数据效率更高。
摘要翻译

通用智能体智能的核心在于与多样化的真实世界工具交互以完成复杂任务的能力,而这种能力从根本上取决于交互数据的质量。为规避高昂的人工标注成本,现有范式完全依赖大型语言模型(LLMs)来扩展智能体环境与任务的合成规模。然而,这种无约束生成往往退化为对大型语言模型(LLMs)内部先验的有偏随机采样,无法捕捉真实世界领域的多样性与难度,也难以构建高保真、长视距的任务。本文引入了扎根智能体交互合成(GAIS),这是一种通过两阶段扎根机制自动化构建多样化环境与复杂任务的框架。具体而言,我们构建源自真实世界模型上下文协议(MCP)服务器的协议锚定环境,以确保功能多样性与任务难度。随后,我们采用结构引导规划来遍历这些环境,主动强制执行逻辑依赖与对抗性策略,从而生成复杂任务。在 BFCL、τ²-Bench 和 ACEBench 上的实验表明,GAIS 合成的数据显著优于现有基线方法,使基础模型能够匹配甚至超越其官方指令微调版本。此外,GAIS 展现出卓越的数据效率与可扩展性,在显著减少数据量的情况下实现卓越能力,同时在基线方法停滞不前时仍能保持持续增长。我们的代码与数据集已公开,可在 https://github.com/Eric8932/GAIS 获取。

Abstract

General agentic intelligence hinges on the ability to interact with diverse real-world tools to complete complex tasks, a capability fundamentally tied to the quality of interaction data. To bypass the prohibitive costs of human annotation, prevailing paradigms depend entirely on Large Language Models (LLMs) to scale the synthesis of agentic environments and tasks. However, such unconstrained generation often degenerates into biased random sampling of LLMs' internal priors, failing to capture the diversity and difficulty of real-world domains or construct high-fidelity, long-horizon tasks. In this work, we introduce Grounded Agentic Interaction Synthesis (GAIS), a framework that automates the scalable construction of diverse environments and complex tasks via a two-phase grounding mechanism. Specifically, we construct protocol-anchored environments derived from real-world Model Context Protocol (MCP) servers to ensure functional diversity and difficulty. Subsequently, we employ structure-guided planning to navigate these environments, actively enforcing logical dependencies and adversarial policies to generate complex tasks. Experiments on BFCL, $τ^2$-Bench, and ACEBench demonstrate that GAIS-synthesized data significantly outperforms state-of-the-art baselines, enabling base models to match or even surpass their official instruction-tuned counterparts. Furthermore, GAIS exhibits superior data efficiency and scalability, achieving exceptional capabilities with significantly less data while maintaining continuous growth where baselines stagnate. Our code and dataset are publicly available at https://github.com/Eric8932/GAIS.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 1.0/10 1.5
World Models 1.5 3.0/10 4.5
MLLM 1.5 3.0/10 4.5
MultiModal 1.5 3.0/10 4.5
model-based RL 1.5 4.0/10 6.0

评分理由: 论文核心为智能体交互数据合成(GAIS),侧重 MCP 环境与任务规划。与 Tokenizer、Visual Encoder 等架构组件无关;Unify Models 未涉及;World Models 和 model-based RL 因涉及环境构建与规划有一定关联;MLLM 与 MultiModal 因主要基于 LLM 而非明确多模态。作者列表中无指定专家,无额外加分。

关键词

Grounded Agentic Interaction Synthesis, Agentic Capabilities, Model Context Protocol, Structure-guided Planning, Environment Construction, Task Synthesis, Data Efficiency

Score: 25.5 / 27.8
Authors: Huayang Huang, Ruoyu Wang, Jinhui Zhao, Wei Deng, Daiguo Zhou, Jian Luan, Yu Wu, Ye Zhu
Published: 2026-06-01
TL;DR: This paper proposes Geometry-Aware Distillation (GAD) to restore initial noise sensitivity in text-to-image distillation, improving diversity and control capabilities while maintaining visual fidelity.
摘要翻译

生成式蒸馏通过将多步轨迹压缩至少步学生模型,同时保持感知质量,显著加速了文本到图像(T2I)生成。然而,现有方法主要优化效率和输出保真度,往往忽略了原始轨迹的关键特性。在这项工作中,我们识别出一个缺失的关键属性:对初始噪声的敏感性,其退化损害了依赖于基于噪声的优化与操控的下游控制方法。我们将这一问题归因于强制逐点输出对齐的标准蒸馏目标,这无意中平坦化了输入 - 输出景观,并抑制了教师模型的局部几何结构。为此,我们提出几何感知蒸馏(GAD),这是一种保敏感框架,旨在对齐教师模型与学生模型的局部函数行为。具体而言,GAD 匹配关于输入噪声的雅可比 - 向量积,使学生能够复现教师模型对扰动的微分响应。在多个 T2I 范式及噪声驱动控制任务上的广泛实验表明,GAD 显著恢复了敏感性并提升了多样性,同时保持了高视觉保真度。代码可在 https://github.com/Hannah1102/GAD 获取。

Abstract

Generative distillation significantly accelerates text-to-image (T2I) generation by compressing multi-step trajectories into few-step student models while preserving perceptual quality. However, existing methods primarily optimize efficiency and output fidelity, often neglecting critical properties of the original trajectory. In this work, we identify a key missing property: sensitivity to initial noise, whose degradation impairs downstream control methods relying on noise-based optimization and manipulation. We trace this issue to standard distillation objectives that enforce pointwise output alignment, inadvertently flattening the input-output landscape and suppressing the teacher's local geometric structure. To address this, we propose Geometry-Aware Distillation (GAD), a sensitivity-preserving framework that aligns the local functional behavior of teacher and student models. Specifically, GAD matches Jacobian-vector products with respect to input noise, enabling the student to reproduce the teacher's differential response to perturbations. Extensive experiments across multiple T2I paradigms and noise-driven control tasks demonstrate that GAD significantly restores sensitivity and improves diversity while maintaining high visual fidelity. Code is available at https://github.com/Hannah1102/GAD.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 3.0/10 4.5
World Models 1.5 1.0/10 1.5
MLLM 1.5 3.0/10 4.5
MultiModal 1.5 6.0/10 9.0
model-based RL 1.5 1.0/10 1.5

评分理由: The paper focuses on Text-to-Image (T2I) distillation and noise sensitivity restoration. It shows moderate relevance to MultiModal due to text-image generation and weak relevance to MLLM and Unify Models (distillation involves teacher-student alignment but not architectural unification). It has low relevance to Tokenizer, Visual Encoder (not the core focus), World Models, and model-based RL (no reinforcement learning involved). No target expert authors (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are listed in the paper.

关键词

Text-to-Image Distillation, Noise Sensitivity, Geometry-Aware Distillation, Jacobian-vector products, Generative distillation, Teacher-student alignment, Local geometric structure

Score: 25.5 / 27.8
Authors: Hao Wei, Yanhui Zhou, Chenyang Ge, Saeed Anwar, Ajmal Mian
Published: 2026-06-01
TL;DR: 本文提出 SPRDiff 方法,通过融合语义与像素表征的扩散模型架构,在超低比特率下实现了高保真图像压缩与重建。
摘要翻译

大多数现有的极端压缩方法难以实现最优的率失真感知(RDP)权衡,因为它们通常优先考虑感知保真度和视觉真实性,而忽视了像素级精度。因此,所得的重建图像往往与原始图像存在显著偏差。超低比特率图像压缩至关重要——不仅旨在生成极紧凑的表示,还需确保重建图像在语义上保持一致,且在像素级别忠实于源图像。为此,我们提出 SPRDiff(一种基于扩散的压缩方法),该方法充分利用语义和像素表示,从而在超低比特率约束下增强重建保真度。具体而言,我们设计了一种三编码器架构,利用预训练的失真导向编码器和语义导向编码器的高保真特征,以补偿冻结的 VAE 编码器所提取的有限表示,从而改进潜在压缩和熵建模。为进一步增强扩散模型的重建保真度,我们引入了一种具有双特征提取机制的失真感知重建模块。该模块不仅能生成保留主要结构的粗略重建,还能提供实用且准确的语义级和像素级条件信号,以指导扩散模型。在基准数据集上的广泛实验表明,我们的方法在极低比特率(低于 0.03 bpp)下的率失真感知权衡上优于最先进方法,有效保持了重建图像中的感知质量和像素级保真度。我们将在此处 https://github.com/cshw2021/SPRDiff 发布源代码及训练模型。

Abstract

Most existing extreme compression methods fail to achieve an optimal rate-distortion-perception trade-off, as they typically prioritize perceptual fidelity and visual realism over pixel-level accuracy. Consequently, the resulting reconstructions often deviate noticeably from the originals. Ultra-low bitrate image compression is therefore crucial-not only for producing extremely compact representations but also for ensuring that reconstructed images remain semantically coherent and faithful to the source at the pixel level. To this end, we propose SPRDiff, a diffusion-based compression method that fully leverages both semantic and pixel representations, thereby enhancing reconstruction fidelity under ultra-low bitrate constraints. Specifically, we develop a triple-encoder architecture that utilizes high-fidelity features from the pretrained distortion-oriented and semantic-oriented encoders to compensate for the limited representations extracted by the frozen VAE encoder, thereby improving latent compression and entropy modeling. To further enhance the reconstruction fidelity of diffusion models, we introduce a distortion-aware reconstruction module with dual feature extraction. This module not only generates a coarse reconstruction that preserves the main structures, but also provides practical and accurate semantic- and pixel-level conditional signals to guide the diffusion model. Extensive experiments on benchmark datasets demonstrate that our method outperforms state-of-the-art approaches in the rate-distortion-perception tradeoff at extremely low bitrates (below 0.03 bpp), effectively preserving both perceptual quality and pixel-wise fidelity in the reconstructed images. We will release the source code and trained models at https://github.com/cshw2021/SPRDiff.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 2.0/10 3.0
Visual Encoder 1.5 6.0/10 9.0
World Models 1.5 2.0/10 3.0
MLLM 1.5 1.0/10 1.5
MultiModal 1.5 3.0/10 4.5
model-based RL 1.5 1.0/10 1.5

评分理由: 论文核心聚焦于超低比特率图像压缩,采用扩散模型与三元编码器架构。视觉编码器(Visual Encoder)使用明确,相关性较高(6.0);语义与像素表征融合部分关联 MultiModal(3.0)与 Unify Models(2.0)。但论文未涉及语言模型(MLLM)、强化学习(model-based RL)、显式 Tokenizer 或世界模型(World Models)的核心机制,相关性极低(1.0-2.0)。加权总分 25.5,低于动态及格分 27.8。作者列表中未包含指定的专家,无额外加分。

关键词

Ultra-Low Bitrate Image Compression, Diffusion-based Compression, Semantic and Pixel Representations, Triple-Encoder Architecture, Rate-Distortion-Perception Trade-off, VAE Encoder, Distortion-Aware Reconstruction

Score: 24.0 / 27.8
Authors: Kuan Li, Shuo Zhang, Huacan Wang, Fangzhou Yu, Zecheng Sheng, Yi Gu, Weipeng Ming, Lei Xue, Chen Liu, Sen Hu, Ronghao Chen, Siyue Lin, Yuqing Hou, Xiaofeng Mou, Yi Xu
Published: 2026-06-01
TL;DR: SMH-Bench introduces a comprehensive smart-home benchmark revealing that frontier LLMs struggle with automation task scheduling and personalized reasoning in complex environments despite strong performance on explicit control tasks.
摘要翻译

智能家居正演变为复杂的状态依赖型生活环境,需要大型语言模型(LLMs)对用户意图、偏好及多设备交互进行推理。然而,现有的智能家居基准测试通常专注于静态指令到 API 的映射或有限的仿真,未能评估大型语言模型是否能在真实的家庭场景中可靠地推理、交互和执行。为了解决这些局限性,我们提出了 SMH-Bench,一个用于评估智能家居环境中大型语言模型的综合基准。基于 HomeEnv(一个可执行且可验证的智能家居模拟器),SMH-Bench 包含 1100 个高质量任务,涵盖 7 个类别和 22 个细粒度子类别。它进一步将任务分布于简单、中等和复杂家居层级,范围从小公寓到拥有 135 个设备的密集多房间环境。实验表明,尽管前沿大型语言模型在显式控制和查询任务上表现强劲,但在自动化任务调度、歧义处理和个性化推理方面仍存在显著弱点,尤其是随着家居复杂度的增加。我们希望 SMH-Bench 能促进更可靠、上下文感知且实际可部署的智能家居智能体的发展。

Abstract

Smart homes are evolving toward complex state-dependent living environments, requiring Large Language Models (LLMs) to reason over user intent, preferences, and multi-device interactions. However, existing smart-home benchmarks often focus on static instruction-to-API mapping or limited simulations, failing to evaluate whether LLMs can reason, interact, and act reliably in realistic household scenarios. To address these limitations, we introduce SMH-Bench, a comprehensive benchmark for evaluating LLMs in smart-home environments. Built upon HomeEnv, an executable and verifiable smart-home simulator, SMH-Bench contains 1,100 high-quality tasks spanning 7 categories and 22 fine-grained subcategories. It further stratifies tasks across simple, medium and complex homes, ranging from small apartments to dense multi-room environments with 135 devices. Experiments show that although frontier LLMs achieve strong performance on explicit control and query tasks, they still exhibit significant weaknesses in automation task scheduling, ambiguity handling and personalized reasoning, especially as home complexity increases. We hope SMH-Bench will facilitate the development of more reliable, context-aware, and practically deployable smart-home agents.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 2.0/10 3.0
World Models 1.5 3.0/10 4.5
MLLM 1.5 3.0/10 4.5
MultiModal 1.5 3.0/10 4.5
model-based RL 1.5 2.0/10 3.0

评分理由: This paper focuses on benchmarking LLM agents rather than model architecture or learning algorithms. Keywords like Tokenizer and Unify Models are irrelevant (low scores). World Models and MultiModal have moderate relevance due to the simulator and environment context, while MLLM and model-based RL are not core contributions as the paper evaluates existing LLMs rather than proposing new architectures or RL methods.

关键词

SMH-Bench, LLM Agents, Smart Homes, Environment-Grounded Reasoning, HomeEnv, Benchmarking, Task Scheduling

Score: 24.0 / 27.8
Authors: Shihao Rao, Liang Li, Jiapeng Liu, Tong Lin, Bing Li, Xiyan Gao, Peng Fu, Jing Huang, Can Ma
Published: 2026-06-01
TL;DR: This paper proposes DocFormBench and DocFormFlow to improve content-aware document formatting accuracy and efficiency by decoupling target localization from modification execution.
摘要翻译

大型语言模型(LLMs)的最新进展为自动化文档格式化开辟了新的可能性。然而,实际场景中的格式化往往需要根据文档内容来识别目标。这种内容感知场景仍然具有挑战性且研究不足,主要原因是缺乏专用的评估数据集。为了在真实的内容感知场景中实现评估,我们引入了 DocFormBench,这是一个将 Text-to-Format 评估扩展到多样化格式化要求的基准,同时包含准确率和效率指标。为了减轻现有方法在格式化过程中冗余的文档阅读,我们提出了 DocFormFlow,这是一种工作流格式化方法,将目标定位与修改执行解耦为格式化内容与方式。在多个大型语言模型和多模态模型上的广泛实验表明,与代表性基线相比,DocFormFlow 一致地提高了格式化准确率并减少了 Token 消耗。进一步分析表明,精确的目标定位是影响格式化性能的主要因素。我们希望 DocFormBench 和 DocFormFlow 能促进未来研究,朝着更智能、更可靠的文档格式化方向发展。

Abstract

Recent advances in large language models (LLMs) have opened up new possibilities for automated document formatting. However, real-world formatting often requires identifying targets based on document content. This content-aware setting remains challenging and underexplored, primarily due to the lack of dedicated evaluation datasets.To enable evaluation in realistic content-aware scenarios, we introduce DocFormBench, a benchmark that extends Text-to-Format evaluation to diverse formatting requirements, along with metrics for both accuracy and efficiency.To mitigate redundant document reading in existing methods during formatting, we propose DocFormFlow, a workflow formatting method that decouples target localization from modification execution into what to format and how. Extensive experiments across multiple LLMs and multimodal models show that DocFormFlow consistently improves formatting accuracy while reducing token consumption compared to representative baselines. Further analysis reveals that precise target localization is the primary factor influencing formatting performance. We hope DocFormBench and DocFormFlow will facilitate future research toward more intelligent and reliable document formatting.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 3.0/10 4.5
Visual Encoder 1.5 2.0/10 3.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 4.0/10 6.0
MultiModal 1.5 5.0/10 7.5
model-based RL 1.5 0.0/10 0.0

评分理由: 论文核心在于文档格式化基准(DocFormBench)与工作流方法(DocFormFlow),利用 LLM 和多模态模型提升格式准确性及效率。与 'World Models' 和 'model-based RL' 完全无关;'Unify Models' 和 'Visual Encoder' 关联度低,因未涉及模型架构统一或编码器设计;'Tokenizer' 仅提及 token 消耗优化而非设计;'MLLM' 和 'MultiModal' 因实验涉及多模态模型而有中等关联。加权总分 24.0,低于动态及格分 27.8。

关键词

Document Formatting, Large Language Models, Benchmark, Workflow, Target Localization, Token Consumption, Multimodal Models, Content-Aware

Score: 24.0 / 27.8
Authors: Zhi-Kai Chen, Jun-Peng Jiang, Jun-Jie Tao, De-Chuan Zhan, Han-Jia Ye
Published: 2026-06-01
TL;DR: Polaris introduces a retrieval framework that selects and integrates suitable image generation models from a large library to fulfill personalized style instructions without requiring additional training.
摘要翻译

用户日益期望图像生成模型(image generation models)能够快速适应高度多样化和个性化的需求,例如生成具有独特风格或特征的图像。传统方法依赖于微调(fine-tuning),成本高且难以扩展。为应对这些局限,社区已积累了一个不断增长的微调模块和适配器(adapters)库,其中每个组件针对特定的生成需求,共同构成了应对新需求的基础。这自然引发一个问题:与其重复训练新模型,我们能否系统性地利用这一不断扩展的生态系统,以更好地满足用户指令?为此,我们提出了 Polaris,一种基于用户指令从模型库中自动选择和整合合适模型的智能检索框架。关键洞察在于,利用如此庞大且异构的模型池(pool)不仅需要从数千个候选中找到最相关的模块,还需要有效地将它们对齐,以实现基于指令的生成和编辑。Polaris 通过索引超过 6,500 个检查点(checkpoints)和 75,000 个适配器,并根据用户的输入和指令检索最相关的组件,从而应对这一挑战。通过这种方式,它实现了可扩展、可控且良好对齐的生成——无需任何额外训练。

Abstract

Users increasingly expect image generation models to quickly adapt to highly diverse and personalized requirements, such as producing images with distinctive styles or characteristics. Traditional approaches rely on fine-tuning, which is costly and difficult to scale. To cope with these limitations, the community has accumulated a growing library of fine-tuned modules and adapters, where each component targets specific generation needs and collectively serves as a foundation for handling new demands. This naturally raises a question: instead of repeatedly training new models, can we systematically exploit this expanding ecosystem to better fulfill user instructions? To this end, we present Polaris, an intelligent retrieval framework that automatically selects and integrates suitable models from the model library based on a user's instructions. The key insight is that harnessing such a massive and heterogeneous pool requires not only finding the most relevant modules among thousands of candidates, but also aligning them effectively for instruction-driven generation and editing. Polaris addresses this challenge by indexing over 6,500 checkpoints and 75,000 adapters, and retrieving the most relevant components given a user's input and instruction. In doing so, it delivers scalable, controllable, and well-aligned generation -- without any additional training.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 3.0/10 4.5
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 2.0/10 3.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 4.0/10 6.0
MultiModal 1.5 6.0/10 9.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper presents Polaris, a retrieval framework for selecting image generation models/adapters based on instructions. It has low relevance to World Models and Model-Based RL (0 score) as it focuses on static image generation rather than environment modeling or reinforcement learning. It scores moderately on MultiModal (text-image interaction) and MLLM (instruction handling), but does not focus on Tokenizers or Visual Encoders as core contributions. No expert authors from the specified list are present in the author list.

关键词

Image Generation, Instruction-Guided, Model Retrieval, Personalized Style, Adapter Integration, Scalable Generation, Model Library

Score: 22.5 / 27.8
Authors: Louis Mouchon
Published: 2026-06-01
TL;DR: Echo 是一个利用共享潜在空间 ViT 编码器统一说话人分离、语音识别和源分离任务的音频系统,无需针对每个任务进行微调。
摘要翻译

我们提出 Echo,这是一个围绕单个 2500 万参数 ViT 编码器构建的概念验证(PoC)音频系统。该编码器首先使用 JEPA 目标进行预训练,随后分阶段专门化,以便在同一个 512 维潜在空间中编码说话人身份、语音内容和动态源路由,且在部署时无需进行任务特定微调。轻量级输出头负责处理说话人分离(ArcFace + VBx)以及动态源分离(零目标 K 集预测)。在未知 K 值的合成 VoxCeleb2 混合语料上,该标准架构达到了 15.00% 的盲检测错误率(DER)、97.80% 的排列不变训练(PIT)分离准确率,以及 +9.52 dB 的潜在尺度不变信干比(SI-SDR),并在保留的 k-NN 探针上实现了 +53.50 点的说话人/内容解耦差距。Echo 的核心意义并非在任何单一任务上取得新的最先进(SOTA)结果,而是在此模型规模下实现三个任务在同一编码器上的联合共存。我们分阶段记录设计过程,报告遇到的失败尝试,并通过向量量化(VQ)瓶颈识别出端到端自动语音识别(ASR)中的结构瓶颈,该瓶颈目前仍限制了概念验证(PoC)的性能。

Abstract

We present Echo, a proof-of-concept audio system built around a single 25 M-parameter ViT encoder. The encoder is pretrained with a JEPA objective and then specialised by stages to carry speaker identity, phonetic content, and dynamic source routing in the same 512-dimensional latent space, with no per-task fine-tuning at deployment. Light heads handle diarization (ArcFace + VBx) and dynamic source separation (null-target K-set prediction). On synthetic VoxCeleb2 mixtures with unknown K, the canonical stack reaches 15.00% blind DER, 97.80% PIT separation accuracy with +9.52 dB latent SI-SDR, and a +53.50-point speaker/content factorisation gap on a held-out k-NN probe. The point of Echo is not a new SOTA on any single task but the joint coexistence of three tasks on one encoder at this footprint. We document the design stage by stage, report the dead-ends, and identify the structural wall on end-to-end ASR through the VQ bottleneck that still bounds the PoC.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 5.0/10 7.5
Tokenizer 1.5 3.0/10 4.5
Visual Encoder 1.5 4.0/10 6.0
World Models 1.5 3.0/10 4.5
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: 论文核心为音频多任务学习,使用 ViT 编码器与 JEPA 目标。'Unify Models' 得 5 分因统一了三个任务;'Visual Encoder' 得 4 分因使用 ViT 架构但应用于音频;'Tokenizer' 和 'World Models' 得 3 分因涉及 VQ 瓶颈及预测架构。'MLLM'、'MultiModal'、'model-based RL' 得 0 分因论文为单模态音频且无语言模型或强化学习。加权总分 22.5 分,低于动态及格分 27.8 分。作者列表中未包含指定专家。

关键词

Speaker Diarization, Speech Recognition, Shared Latent Space, ViT Encoder, JEPA, Multi-task Learning, Audio System

Score: 22.5 / 27.8
Authors: Bin Chen, Xinye Liao, Yiming Liu, Xin Liao, Chonghan Liu
Published: 2026-06-01
TL;DR: 本文提出 Credit-Attenuated Privileged Feedback (CAPF) 方法,通过利用验证器侧信息指导搜索代理修正轨迹,显著提升了 Qwen3-4B 在开放域问答任务上的精确匹配分数。
摘要翻译

近期的大语言模型搜索代理利用基于可验证奖励的强化学习(RLVR),从结果奖励中学习搜索增强推理。在困难问题上,这些代理很少采样端到端成功的轨迹(rollouts),导致仅基于结果的 RLVR 仅有少量正奖励轨迹。我们认为,改进此类问题的学习需要训练过程中的额外指导,而 RLVR 本身已包含可提供该指导的验证器侧信息。这些信息能够识别代理提交答案中的错误或遗漏,并在轨迹内指导修订。我们提出了一种训练时机制,称为信用衰减特权反馈(CAPF),该机制通过训练期间的特权反馈调用使这种验证器侧信息可用。CAPF 允许策略将零奖励尝试修订为正奖励修复轨迹,并衰减反馈调用及先前动作的信用,以适应无需该调用的部署场景。实证研究表明,CAPF 将 Qwen3-4B 在七个开放域问答基准上的平均精确匹配分数,从仅基于结果 RLVR 下的 44.7% 提升至 48.5%。

Abstract

Recent LLM search agents use reinforcement learning with verifiable rewards (RLVR) to learn search-augmented reasoning from outcome rewards. On hard problems, these agents rarely sample end-to-end successful rollouts, leaving outcome-only RLVR with few positive-reward trajectories. We argue that improving learning on such problems requires additional guidance during training, and RLVR already contains verifier-side information that can provide it. This information can identify errors or omissions in the agent's submitted answer and guide revision within the rollout. We propose a training-time mechanism called \textbf{Credit-Attenuated Privileged Feedback} (CAPF), which makes this verifier-side information available through a Privileged Feedback call during training. CAPF lets the policy revise zero-reward attempts into positive-reward repair trajectories and attenuates credit for the feedback call and earlier actions to accommodate deployment without this call. Empirical research demonstrates that CAPF improves Qwen3-4B's average exact-match score from 44.7% under outcome-only RLVR to 48.5% on seven open-domain QA benchmarks.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 1.0/10 1.5
World Models 1.5 2.0/10 3.0
MLLM 1.5 3.0/10 4.5
MultiModal 1.5 2.0/10 3.0
model-based RL 1.5 4.0/10 6.0

评分理由: 该论文聚焦于强化学习搜索代理的训练机制(CAPF),利用验证器侧信息优化轨迹,而非多模态统一架构、视觉编码或世界模型构建。虽然涉及大语言模型(Qwen3-4B),但核心在于 RL 训练策略而非模型表征或模态融合,因此与多模态及世界模型类关键词相关性较低,仅与强化学习相关词汇有一定关联。

关键词

Search-Agent Rollouts, Credit-Attenuated Privileged Feedback, Reinforcement Learning with Verifiable Rewards, Search-Augmented Reasoning, Verifier-Side Information, Open-Domain QA, Qwen3-4B

Score: 22.5 / 27.8
Authors: Peijia Qin, Qi Cao, Pengtao Xie
Published: 2026-06-01
TL;DR: ATLAS introduces an agentic test-time scaling framework where an LLM orchestrator dynamically allocates compute and solvers, improving performance on reasoning and multimodal benchmarks compared to fixed-workflow baselines.
摘要翻译

测试时扩展已成为提升大语言模型推理能力的主要方法,但其编排机制仍由人工预设:固定的样本预算、固定的精炼循环、固定的评分规则或固定的搜索策略决定了计算资源的分配,使得模型仅负责求解任务,而不负责编排过程。我们提出了 ATLAS,这是一种基于智能体的测试时扩展框架,其中 LLM 编排器端到端地掌控控制回路。通过一个名为 explore 的单一动作,编排器向原始问题调度一个新的独立求解器,从而决定是否需要收集更多证据、何时停止以及如何综合最终答案;该动作空间是可扩展的,每次 explore 调用可选地指定求解器、推理预算或提示策略。我们在涵盖科学问答、代码生成和多模态推理的四个基准测试上评估 ATLAS,基于 Claude Sonnet 4.6 骨干模型,结果显示其在 HLE-Verified 上达到 56.00%,在 LiveCodeBench 上达到 82.29%,在 GPQA-Diamond 上达到 85.75%,在 BabyVision 上达到 23.71%,同时使用的 API 调用远少于固定工作流基线。一种多模型扩展版本 ATLAS-MM,将求解器选择作为额外的动作维度,进一步将 HLE-Verified 提升至 60.00%,LiveCodeBench 提升至 85.63%,同时在 GPQA-Diamond 和 BabyVision 上也取得了稳定的提升。消融实验显示,若用独立的整合器替换编排器的直接综合,在四个基准中的三个上会导致精度下降或未获提升,这与状态化证据管理在产生增益中的作用一致。

Abstract

Test-time scaling has become a major way to improve large language model reasoning, but its orchestration has remained designer-engineered: a fixed sample budget, a fixed refinement loop, a fixed scoring rule, or a fixed search policy decides how compute is spent, leaving the model in charge of solving but not of orchestration. We introduce ATLAS, an agentic test-time scaling framework in which an LLM orchestrator owns the control loop end-to-end. Through a single action, explore, which dispatches a fresh independent solver on the original problem, the orchestrator decides whether to gather more evidence, when to stop, and how to synthesize the final answer; the action space is extensible, with each explore call optionally specifying solver, reasoning effort, or prompting strategy. We evaluate ATLAS on four benchmarks covering scientific question answering, code generation, and multimodal reasoning under a Claude Sonnet 4.6 backbone, where it reaches 56.00% on HLE-Verified, 82.29% on LiveCodeBench, 85.75% on GPQA-Diamond, and 23.71% on BabyVision while using far fewer API calls than fixed-workflow baselines. A multi-model extension, ATLAS-MM, that exposes solver choice as an additional action dimension further improves HLE-Verified to 60.00% and LiveCodeBench to 85.63%, with consistent gains on GPQA-Diamond and BabyVision. Ablations replacing the orchestrator's direct synthesis with a separate integrator degrade or fail to improve accuracy on three of four benchmarks, consistent with the role of stateful evidence management in producing the gains.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 3.0/10 4.5
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 4.0/10 6.0
MultiModal 1.5 6.0/10 9.0
model-based RL 1.5 2.0/10 3.0

评分理由: The paper focuses on agentic test-time scaling for LLMs rather than model architecture unification, tokenization, visual encoders, world models, or model-based RL. It shows moderate relevance to MultiModal (BabyVision benchmark, ATLAS-MM) and MLLM (Claude Sonnet backbone), but low relevance to the core technical keywords. No expert authors from the specified list were found.

关键词

Test-time scaling, Agentic framework, LLM orchestration, Solver dispatch, Multimodal reasoning, BabyVision, ATLAS-MM

Score: 22.5 / 27.8
Authors: Yilin Lyu, Mark YY Chan, Ching-Hui Sia, Lei Li
Published: 2026-06-01
TL;DR: This paper proposes a geometry-motion embedded model to reconstruct personalized 3D myocardial infarct geometry from contrast-free cine MRI for cardiac digital twins, achieving high Dice scores and consistent electrophysiological simulation results.
摘要翻译

准确的心肌梗死(MI)的三维几何表征对于构建心脏数字孪生(CDTs)以精确模拟梗死相关的电生理学至关重要。钆延迟增强磁共振成像(LGE MRI)是定位心肌梗死的临床金标准,但其依赖对比剂的特性限制了其在肾功能受损患者中的应用,并限制了纵向随访的进行。作为一种替代方案,无对比剂电影磁共振成像能够可视化异常的心室壁运动,而这高度指示了梗死区域。本研究提出了一种新颖的显式几何 - 运动嵌入模型,旨在直接从多视图电影磁共振成像中全自动重建个性化、可直接用于仿真的心肌梗死三维几何结构。具体而言,我们构建了一个 4D(3D + 时间)双心室网格,以显式提取并解耦几何感知特征与运动感知特征。此外,我们设计了一个双分支模块,用于自适应的几何 - 运动融合,以捕捉时空依赖关系,从而映射梗死区域。此外,我们引入了多尺度监督机制,利用基于美国心脏协会 17 段(AHA-17 segment)引导的交叉注意力机制来引导预测,确保重建结果符合生物物理一致性。在 225 例电影磁共振成像上的实验结果表明,所提出的心肌梗死三维重建方法取得了高性能,平均 Dice 系数为 0.678 ± 0.011。在下游的计算机仿真电生理模拟评估中,结果与基于 LGE 得到的金标准高度一致,突显了所提出模型在无对比剂瘢痕表征以及无缝集成到心脏数字孪生建模中的巨大潜力。代码将在手稿被录用发表后公开发布。

Abstract

Accurate 3D geometric characterization of myocardial infarction (MI) is essential for building cardiac digital twins (CDTs) to precisely simulate infarct-related electrophysiology. Late gadolinium enhancement magnetic resonance imaging (LGE MRI) is the clinical reference for locating MI, yet its reliance on contrast agents restricts use in renally impaired patients and limits longitudinal follow-ups. As an alternative, contrast-free cine MRI visualizes abnormal ventricular wall motion, which is highly indicative of the infarcted area. In this study, we propose a novel explicit geometry-motion embedded model to fully automatically reconstruct personalized, simulation-ready 3D MI geometries directly from multi-view cine MRIs. Specifically, we construct a 4D (3D + t) biventricular mesh to explicitly extract and decouple geometry-aware and motion-aware features. We further design a dual-branch module for adaptive geometry-motion fusion to capture spatiotemporal dependencies for mapping infarcted region. Furthermore, we introduce multi-scale supervision utilizing an AHA-17 segment-guided cross-attention mechanism to steer the prediction, ensuring biophysically consistent reconstruction. Experimental results on 225 cine MRIs demonstrated that the proposed 3D MI reconstruction achieved high performance with an average Dice score of 0.678 $\pm$ 0.011. In the downstream in-silico electrophysiological simulation evaluations, the results were highly consistent with the LGE-derived ground truth, highlighting the great potential of the proposed model for contrast-free scar characterization and seamless integration into CDT modeling. The code will be released publicly upon acceptance of the manuscript for publication.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 3.0/10 4.5
World Models 1.5 5.0/10 7.5
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 5.0/10 7.5
model-based RL 1.5 0.0/10 0.0

评分理由: 该论文专注于医学影像重建(心肌梗死几何重建)与心脏数字孪生,与提供的关键词集(主要面向多模态大模型与强化学习)存在领域错位。'World Models'与'Cardiac Digital Twins'概念高度相关,'MultiModal'体现在多视图输入及几何 - 运动特征融合,'Visual Encoder'隐含在图像特征提取中,但相关性中等。'Unify Models'仅涉及特征融合而非模型架构统一。'Tokenizer'、'MLLM'及'model-based RL'在论文中无直接体现,故评分为 0。作者列表中不包含指定的专家,无额外加分。

关键词

Myocardial Infarct Reconstruction, Cardiac Digital Twins, Cine MRI, Geometry-Motion Fusion, 3D Mesh Reconstruction, Electrophysiological Simulation, Contrast-free Imaging

Score: 21.0 / 27.8
Authors: Zixin Zhang, Fan Qi, Shuai Li, Xiaoshan Yang, Changsheng Xu
Published: 2026-06-01
TL;DR: This paper proposes FedMChain, a balanced multimodal federated learning framework that mitigates modality competition via chained modality-wise phases and sparse aggregation, improving performance while reducing communication overhead.
摘要翻译

多模态联邦学习(MMFL)允许在具有异构数据和模态可用性的去中心化客户端之间进行隐私保护的协作学习。然而,现有的大多数 MMFL 方法将多模态训练视为联合优化问题,忽视了一个关键瓶颈:模态竞争,即主导模态抑制较弱的模态,从而导致次优的全局模型。为了解决这一问题,我们提出了 FedMChain,这是一种平衡的 MMFL 框架,它将联邦多模态训练结构化为一连串的模态阶段。这种分阶段的设计为每个模态在多模态客户端上提供了专用的局部优化窗口,以缓解模态竞争,并通过误差补偿正则化项进一步促进模态间互补性。在服务器端,我们采用稀疏符号引导聚合策略,利用方向符号一致性实现鲁棒的模态内聚合,避免破坏性平均,并支持较低频率的同步以减少通信开销。在多个多模态基准上的广泛实验表明,FedMChain 在提高预测性能的同时,比基线方法需要更少的通信频率。

Abstract

Multimodal Federated Learning (MMFL) enables privacy-preserving collaborative learning across decentralized clients with heterogeneous data and modality availability. However, most existing MMFL methods cast multimodal training as a joint optimization problem, overlooking a key bottleneck: modality competition, where dominant modalities suppress weaker ones and lead to suboptimal global models. To address this, we propose FedMChain, a balanced MMFL framework that structures federated multimodal training as a chain of modality-wise phases. This phase-wise design gives each modality a dedicated local optimization window on multimodal clients to mitigate modality competition, and further promotes cross-modal complementarity via an error-compensated regularizer. On the server side, we employ a sparse sign-guided aggregation strategy that leverages directional sign agreement for robust intra-modality aggregation, avoids destructive averaging, and supports less frequent synchronization to reduce communication overhead. Extensive experiments on multimodal benchmarks demonstrate that FedMChain consistently improves predictive performance while requiring less frequent communication than baselines.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 3.0/10 4.5
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 2.0/10 3.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 9.0/10 13.5
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on Multimodal Federated Learning (MMFL) and modality competition. 'MultiModal' is highly relevant (9/10) as it is the core domain. 'Unify Models' has slight relevance regarding unifying modalities in FL (3/10), and 'Visual Encoder' has minor relevance due to multimodal context (2/10). 'Tokenizer', 'World Models', 'MLLM', and 'model-based RL' are unrelated to the federated learning optimization framework presented. Total weighted score is 21.0, below the dynamic passing threshold of 27.8. No specified expert authors are found in the author list.

关键词

Multimodal Federated Learning, Modality Competition, FedMChain, Chained Modality Optimization, Sparse Sign-guided Aggregation, Privacy-preserving, Cross-modal Complementarity, Decentralized Clients

Score: 21.0 / 27.8
Authors: Mingju Chen, Can Lv, Guibin Zhang, Heng Chang, Shiji Zhou
Published: 2026-06-01
TL;DR: HarnessForge introduces a meta-adaptive framework that jointly evolves the harness and policy of LLM agent systems to enhance performance across heterogeneous task regimes, outperforming single-component adaptation baselines.
摘要翻译

大语言模型(LLM)代理日益被期望在需要不同执行范式的异构任务环境下运行。这对固定代理系统构成了挑战,并促使人们寻求超越孤立组件更新的系统级元适应。虽然现有工作已调整外部 Harness(框架)或训练底层推理策略,但全系统适应仍未被充分表征。结构与执行之间的适应空间很少被明确界定,且外部 Harness 与内部推理器之间的兼容性未得到联合优化。我们提出 HarnessForge,一个用于演化 LLM 代理系统的元自适应框架。HarnessForge 将代理系统建模为 Harness-策略对,定义了一个稳定的适应空间,该空间将 Harness 级执行结构与策略级推理行为分离开来。随后,它通过故障引导的 Harness 定制和 Harness 条件化的策略对齐,实现 Harness-策略协同演化。在来自五个不同领域的基准测试上的实验表明,HarnessForge 一致改进了 Qwen3-4B 和 Qwen3-8B 骨干模型,其表现优于仅 Harness 基线和仅策略基线,相对于最强基线的增益高达 12.0%,并实现了有利的执行效率权衡,这证明了 Harness-策略协同演化是有效的,且 Harness 与推理策略之间的可执行兼容性对于代理系统适应至关重要。代码可在 https://github.com/mingju-c/HarnessForge 获取。

Abstract

LLM agents are increasingly expected to operate across heterogeneous task regimes that require distinct execution paradigms. This challenges fixed agent systems and motivates system-level meta-adaptation beyond isolated component updates. While existing works have adapted external harness or trained underlying reasoning policies, full-system adaptation remains insufficiently characterized. The adaptation space between structure and execution is rarely made explicit, and the compatibility between the external harness and the internal reasoner is not optimized jointly. We propose HarnessForge, a meta-adaptive framework for evolving LLM agent systems. HarnessForge formulates an agent system as a harness--policy pair, defining a stable adaptation space that separates harness-level execution structure from policy-level reasoning behavior. It then performs harness--policy co-evolution through fault-guided harness tailoring and harness-conditioned policy alignment. Experiments across five benchmarks from diverse domains show that HarnessForge consistently improves both Qwen3-4B and Qwen3-8B backbones, outperforming harness-only and policy-only baselines with gains of up to 12.0\% over the strongest baseline and achieving favorable rollout-efficiency tradeoffs, demonstrating that harness--policy co-evolution is effective, and that executable compatibility between the harness and reasoning policy is essential for agent-system adaptation. The code is available at https://github.com/mingju-c/HarnessForge.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 3.0/10 4.5
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 1.0/10 1.5
World Models 1.5 2.0/10 3.0
MLLM 1.5 3.0/10 4.5
MultiModal 1.5 2.0/10 3.0
model-based RL 1.5 2.0/10 3.0

评分理由: The paper focuses on LLM agent system adaptation via harness-policy co-evolution, showing limited overlap with keywords centered on multimodal architecture (Tokenizer, Visual Encoder, MultiModal) and specific learning paradigms (World Models, model-based RL). 'Unify Models' and 'MLLM' have moderate relevance due to the unification of system components and use of Qwen3 backbones, but the core contribution is system-level meta-adaptation rather than model architecture or multimodal representation learning.

关键词

LLM agents, Harness-policy co-evolution, Meta-adaptive framework, Adaptive agent systems, Execution structure, Reasoning behavior, System-level meta-adaptation

Score: 19.5 / 27.8
Authors: Ailiya Borjigin, Igor Stadnyk, Ben Bilski, Maksym Chikita, Dmytro Kyrylenko, Sofiia Pidturkina, Julia Stadnyk
Published: 2026-06-01
TL;DR: 本文提出交互原生知识架构(InKH)用于金融 LLM 代理,通过吸收复杂性显著降低了延迟和令牌成本,同时提高了任务质量和可追溯性。
摘要翻译

金融 AI 代理往往失败有一个简单的原因:它们让用户承担复杂性。用户必须反复重申目标、风险偏好、投资组合上下文、过往判断以及不断变化的市场假设,而代理则回答、检索、行动并遗忘。在金融领域,这不仅仅是不便。在市场分析、跟单交易审查和交易准备等任务中,遗忘的背景和过时的记忆可能导致延迟、重复错误、可审计性差以及不安全决策。我们提出交互原生知识 harness (InKH),这是一种面向金融 LLM 代理的架构,能将复杂性吸收进系统。InKH 将用户、市场、投资组合及工具事件转化为结构化的操作知识。它采用被动知识注入,在主模型步骤前组装有界工作上下文缓冲区,利用时序图记忆 (temporal graph memory) 实现低延迟检索,通过维基审计表面 (wiki audit surface) 支持人类可读的治理,并具备具有成熟度、衰减及写入时失效的背景提取机制。我们在一个可复现的受控合成基准上评估 InKH,该基准包含 24 个随机种子、4 轮、每轮 80 个回合及 6 个基线,共产生 46,080 个基于基线的评估。InKH 在 900 毫秒延迟下实现了 0.815 的平均任务质量。与基于代理的维基行走记忆 (wiki-walk memory) 相比,其延迟降低了 82.95%,令牌成本降低了 82.29%,过时知识使用率降低了 96.58%,同时质量提高了 0.108,可追溯性提高了 0.461。与无失效机制的时序图系统相比,其质量提高了 0.050,过时内存使用率降低了 96.58%,且服务成本相当。结果支持金融 AI 的一项设计主张:当复杂性由系统吸收而非转移给用户时,采用才会发生。该基准验证了架构层面的行为,而非实盘交易表现。

Abstract

Financial AI agents often fail for a simple reason: they make users carry the complexity. A user must repeatedly restate goals, risk preferences, portfolio context, past judgments, and shifting market assumptions, while the agent answers, retrieves, acts, and forgets. In finance, this is not just inconvenient. In tasks such as market analysis, copy-trading review, and trade preparation, forgotten context and stale memory can create latency, repeated errors, weak auditability, and unsafe decisions. We propose the interaction-native knowledge harness (InKH), an architecture for financial LLM agents that absorbs complexity into the system. InKH converts user, market, portfolio, and tool events into structured operational knowledge. It uses passive knowledge injection to assemble a bounded working context buffer before the main model step, temporal graph memory for low-latency retrieval, a wiki audit surface for human-readable governance, and background extraction with maturity, decay, and write-time invalidation. We evaluate InKH on a reproducible controlled synthetic benchmark with 24 random seeds, 4 rounds, 80 episodes per round, and 6 baselines, producing 46,080 baseline-conditioned evaluations. InKH achieves mean task quality of 0.815 at 900 ms latency. Compared with agent-driven wiki-walk memory, it reduces latency by 82.95 percent, token cost by 82.29 percent, and stale-knowledge usage by 96.58 percent, while improving quality by 0.108 and traceability by 0.461. Compared with a temporal-graph system without invalidation, it improves quality by 0.050 and reduces stale-memory usage by 96.58 percent with comparable serving cost. The results support a design thesis for financial AI: adoption happens when complexity is absorbed by the system rather than transferred to the user. The benchmark validates architecture-level behavior, not live trading performance.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 1.0/10 1.5
World Models 1.5 3.0/10 4.5
MLLM 1.5 2.0/10 3.0
MultiModal 1.5 2.0/10 3.0
model-based RL 1.5 2.0/10 3.0

评分理由: 论文聚焦金融 LLM 代理的知识管理架构(InKH),核心在于上下文缓冲与时序图记忆,与多模态(Visual Encoder, MultiModal, MLLM)、分词器(Tokenizer)及模型强化学习(model-based RL)的技术细节关联度较低。虽时序记忆与 World Models 概念有轻微重叠,但未涉及生成式世界模型构建;架构统一性主要体现在知识处理流程而非模型融合。加权总分 19.5 分,低于动态及格分 27.8 分。

关键词

Financial LLM Agents, Interaction-Native Knowledge Harness, Temporal Graph Memory, Passive Knowledge Injection, Latency Reduction, Structured Operational Knowledge, Task Quality, Bounded Working Context Buffer

Score: 19.5 / 27.8
Authors: Donghwan Kim, Prakhar Singh, Younghoon Min, Jongryool Kim, Jongse Park, Kiwan Maeng
Published: 2026-06-01
TL;DR: 本文提出 GAIATrace 数据集和 Vidur-Agent 模拟器,通过追踪驱动模拟方法表征多模型智能体 AI 系统的系统级行为,揭示了不同设计选择对任务处理的影响。
摘要翻译

智能体 AI 通过基于观测结果的迭代规划、工具使用和推理来完成任务。尽管其广受欢迎,但其系统级行为仍知之甚少,尤其是在复杂数据集和智能体架构方面——这主要是由于高度非确定性执行、高昂的评估成本以及对专有模型透明度有限所致。本文提出了 GAIATrace,这是首个针对两个最先进的智能体系统(MiroThinker 和 OWL)在 GAIA 基准(一个由异构混合的通用任务组成的基准)上运行时的标记级追踪数据集。与之前的追踪数据集不同,GAIATrace 捕获了所有主要参与大语言模型(LLMs)的完整推理标记、任务级结构和活动,从而能够进行深入的系统研究。作为该数据集的补充,我们提出了 Vidur-Agent,这是一个追踪驱动的模拟器,可以重放 GAIATrace 以在多样化的模拟环境中进行可复现、低成本的系统评估。结合这两个成果,我们刻画了现代智能体系统如何处理通用任务,以及各种系统设计选择如何塑造其行为,从而得出若干独特发现。

Abstract

Agentic AI completes tasks through iterative planning, tool use, and reasoning based on observed outcomes. Despite its popularity, its system-level behavior remains poorly understood, particularly for complex datasets and agent architectures-owing to highly non-deterministic execution, prohibitive evaluation costs, and limited visibility into proprietary models. This paper presents GAIATrace, the first token-level trace dataset of two state-of-the-art agentic systems (MiroThinker and OWL) running GAIA, a benchmark composed of a heterogeneous mix of general-purpose tasks. Unlike prior trace datasets, GAIATrace captures full reasoning tokens, task-level structures, and activities of every major participating LLMs, enabling in-depth systems research. Complementing the dataset, we present Vidur-Agent, a trace-driven simulator that can replay GAIATrace to perform reproducible, low-cost system evaluation across diverse simulated environments. Using both artifacts, we characterize how modern agentic systems handle general tasks and how various system design choices shape their behavior, yielding several unique findings.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 2.0/10 3.0
Visual Encoder 1.5 1.0/10 1.5
World Models 1.5 1.0/10 1.5
MLLM 1.5 3.0/10 4.5
MultiModal 1.5 3.0/10 4.5
model-based RL 1.5 1.0/10 1.5

评分理由: 论文聚焦多模型智能体系统的追踪驱动模拟(GAIATrace/Vidur-Agent),与关键词集(多模态架构、世界模型、RL)领域偏差较大。'Unify Models'与'MLLM'因涉及多模型系统及 LLM 基础得低分(2-3 分);'Tokenizer'因仅提及 token 级追踪而非架构设计得 2 分;其余关键词(Visual Encoder, World Models, model-based RL, MultiModal)因非核心贡献或提及极少得 1-3 分。加权总分 19.5,低于动态及格分 27.8。

关键词

Agentic AI, Trace-driven Simulation, GAIATrace, Vidur-Agent, System Characterization, Multi-Model Agentic Systems, GAIA Benchmark

Score: 19.5 / 27.8
Authors: Yiyao Liua, Wenxiao He, Liyuan Ren, Huan Wang
Published: 2026-06-01
TL;DR: This paper proposes a Contrastive Augmented Transformer framework with domain-specific enhancement to solve metal surface defect detection challenges, achieving high accuracy and robust generalization across diverse industrial scenarios.
摘要翻译

金属表面缺陷检测对于维持工业制造中的产品质量至关重要。然而,它面临着显著挑战,包括标注数据有限、难以识别细微的多尺度缺陷,以及在多样化场景下泛化能力较差。为解决这些问题,本文提出了一种新颖的对比增强变换器(Contrastive Augmented Transformer, CAT)框架,用于鲁棒缺陷检测。CAT 采用分层 Swin Transformer 骨干,并重新设计了特征金字塔网络(FPN),以有效融合低层纹理与高层语义,从而实现对细微和多尺度缺陷模式的精确建模。为了增强在真实世界噪声条件下的鲁棒性,我们提出了一种领域特定的水滴增强算法。此外,我们将难例挖掘策略融入对比损失中,以增强模型在模糊缺陷区域中的判别能力。在 KolektorSDD2 数据集上的实验结果表明,CAT 达到了 99.54% 的像素级 AUROC,优于现有方法。此外,CAT 在三个未见数据集(包括 KSDD1、用于瓷砖缺陷的 MTD 以及用于轨道表面缺陷的 MSDD)上表现出卓越的泛化性和鲁棒性,展示了其在大规模工业部署中的潜力。

Abstract

Metal surface defect detection is critical for maintaining product quality in industrial manufacturing. However, it faces significant challenges, including limited annotated data, difficulty in identifying subtle multi-scale defects, and poor generalization across diverse scenarios. To address these issues, this paper proposes a novel Contrastive Augmented Transformer (CAT) framework for robust defect detection. CAT employs a hierarchical Swin Transformer backbone and redesigns the feature pyramid network to effectively fuse low-level textures with high-level semantics, enabling precise modeling of subtle and multi-scale defect patterns. To enhance robustness under real-world noise conditions, we propose a domain-specific droplet augmentation algorithm. Furthermore, we incorporate a hard negative mining strategy into the contrastive loss to strengthen the model's discrimination ability in ambiguous defect regions. Experimental results on the KolektorSDD2 dataset demonstrate that CAT achieves a pixel-level AUROC of 99.54%, outperforming existing methods. In addition, CAT exhibits superior generalization and robustness on three unseen datasets, including KSDD1, MTD for tile defects, and MSDD for rail surface defects, demonstrating its potential for wide-scale industrial deployment.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 6.0/10 9.0
World Models 1.5 1.0/10 1.5
MLLM 1.5 1.0/10 1.5
MultiModal 1.5 1.0/10 1.5
model-based RL 1.5 1.0/10 1.5

评分理由: The paper focuses on industrial metal surface defect detection using a Contrastive Augmented Transformer with Swin backbone. It shares minimal overlap with the provided keywords, which target MLLM, World Models, and Reinforcement Learning. Only 'Visual Encoder' has moderate relevance due to the use of Swin Transformer for feature extraction; others like Tokenizer, MLLM, MultiModal, World Models, and model-based RL are largely irrelevant as the paper deals with single-modality vision tasks without language modeling or reinforcement learning components. 'Unify Models' is also low as it proposes a specific framework rather than unifying diverse model architectures.

关键词

Metal Surface Defect Detection, Contrastive Augmented Transformer, Swin Transformer, Feature Pyramid Network, Domain-specific Enhancement, Hard Negative Mining, Robust Generalization, Industrial Manufacturing

Score: 18.0 / 27.8
Authors: Yuhan Wang, Yibo Ding, Yutong Ye, Mufan Zhao, Wenbo Zhang, Ruijie Wang, Jianxin Li
Published: 2026-06-01
TL;DR: G2LoRA mitigates catastrophic forgetting in graph continual learning for text-attributed graphs via gradient orthogonal low-rank adaptation, achieving superior performance across incremental tasks.
摘要翻译

LLM-as-Aligner 已成为文本属性图(TAGs)的一种主流预训练范式,通过 CLIP 风格对比学习将图模态与文本模态对齐至共享嵌入空间。尽管在单个下游任务上表现有效,但我们观察到当此类模型在流式任务上顺序微调时,会出现严重的灾难性遗忘。尽管参数高效微调在一定程度上缓解了遗忘,但它仍不足以解决任务干扰问题,且知识迁移效果不佳。本文研究了 TAGs 上 LLM-as-Aligner 模型的图持续学习,旨在减轻任务干扰的同时促进任务间的正向迁移。该设置引入了两个基本挑战:(1) 异构下游任务导致优化目标漂移,阻碍统一微调;(2) 图编码器和文本编码器对适应表现出不同的敏感性,使得不协调的更新容易导致模态失配。为了解决这些挑战,本文提出了 G2LoRA,一种用于 TAGs 的持续学习框架。G2LoRA 在单一图 - 文本对齐目标下统一了节点级、链接级和图级任务,并支持在域/类/任务增量模式下实现一致优化。为了减少任务干扰同时鼓励正向迁移,G2LoRA 在结构化子空间中执行类别感知梯度投影,解决冲突更新,并启用条件反向迁移以平衡前向与后向知识流。为进一步防止跨模态漂移,G2LoRA 引入梯度幅度调制,以协调图编码器与文本编码器之间的更新速率。在基准数据集上的广泛实验表明,G2LoRA 在不同骨干架构下始终优于强基线,实现了优越的持续性能和迁移能力。

Abstract

LLM-as-Aligner has emerged as a prevalent pre-training paradigm for Text-Attributed Graphs(TAGS), aligning graph and text modalities into a shared embedding space via CLIP-style contrastive learning. While effective on individual downstream tasks, we observe severe catastrophic forgetting when such models are sequentially fine-tuned on streaming tasks. Although parameter-efficient fine-tuning alleviates forgetting to some extent, it remains insufficient to resolve task interference and ineffective knowledge transfer. In this work, we study graph continual learning for LLM-as-Aligner models on TAGs, with the goal of mitigating interference while promoting positive transfer across tasks. This setting introduces two fundamental challenges: (1) heterogeneous downstream tasks induce shifting optimization objectives, hindering unified fine-tuning; and (2) graph and text encoders exhibit different sensitivities to adaptation, making uncoordinated updates prone to misalignment. To address these challenges, we propose G2LoRA, a continual learning framework for TAGs. G2LoRA unifies node-, link-, and graph-level tasks under a single graph--text alignment objective, and enables consistent optimization across domain/class/task incremental modes. To reduce task interference while encouraging positive transfer, G2LoRA performs category-aware gradient projection in structured subspaces, resolving conflicting updates and enabling conditional backward transfer to balance forward and backward knowledge flow. To further prevent cross-modal drift, G2LoRA introduces gradient magnitude modulation to coordinate update rates between graph and text encoders. Extensive experiments on benchmark datasets demonstrate that G2LoRA consistently outperforms strong baselines across different backbone architectures, achieving superior continual performance and transferability.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 3.0/10 4.5
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 2.0/10 3.0
MultiModal 1.5 6.0/10 9.0
model-based RL 1.5 0.0/10 0.0

评分理由: 论文主题聚焦于图连续学习(Graph Continual Learning)和文本属性图(Text-Attributed Graphs),而提供的关键词集主要围绕多模态大模型(MLLM)、世界模型(World Models)和强化学习(RL)。因此,大部分关键词(如 Visual Encoder, World Models, model-based RL, Tokenizer)与论文内容高度不相关。仅 MultiModal(图与文本模态)和 Unify Models(任务统一)有一定关联,但关联度有限。未发现指定专家作者(Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang)。

关键词

Graph Continual Learning, Text-Attributed Graphs, Gradient Orthogonal, Low-Rank Adaptation, Catastrophic Forgetting, LLM-as-Aligner, Parameter-efficient Fine-tuning

Score: 18.0 / 27.8
Authors: Manu Srinath Halvagal, Sebastian Lee, SueYeon Chung
Published: 2026-06-01
TL;DR: This study reveals that deep RL algorithms learn different representational invariances based on their learning objectives (value-based vs. policy-gradient), affecting transfer learning and offering insights into neural coding and LLM behavior.
摘要翻译

强化学习(RL)长期以来一直是神经科学中目标导向动物行为的一个模型。现代深度强化学习在许多领域取得了显著的成功,进一步巩固了这一联系。学习高维状态空间的抽象表征的能力构成了这一成功的基础。然而,对这些学习到的表征的理论理解仍然有限,阻碍了模型与动物学习之间的直接比较。我们通过马尔可夫决策过程(MDP)约简理论的视角来分析深度强化学习表征,以填补这一空白。在导航任务中研究经典的强化学习算法时,我们发现即使性能相当,基于价值的方法(DQN)学习到的表征对 MDP 同态对称性保持不变,而基于策略梯度的方法(PPO)学习到的表征对动作对称性保持不变。这些差异在不同领域中一致出现,对迁移学习产生下游影响,并且在大型语言模型(LLMs)中以提示依赖的方式呈现。我们的发现提供了一种原则性的方法来比较不同强化学习算法的学习表征,具有已证实的实际意义,并为大脑中的神经编码提供了可能的见解。

Abstract

Reinforcement Learning (RL) has long served as a model for goal-directed animal behavior in neuroscience. Modern deep RL has shown remarkable success across many domains, further strengthening this connection. The ability to learn abstract representations of high-dimensional state spaces underlies much of this success. However, theoretical understanding of these learned representations remains limited, hindering direct comparisons between models and animal learning. We address this gap by analyzing deep RL representations through the lens of MDP reduction theory. Investigating canonical RL algorithms in a navigation task, we find that even when performance is comparable, the value-based method (DQN) learns representations that are invariant to MDP homomorphism symmetries, while the policy-gradient method (PPO) learns representations invariant to action symmetries. These differences emerge consistently across domains, have downstream consequences for transfer learning, and appear in LLMs in a prompt-dependent manner. Our findings provide a principled approach to comparing learned representations across RL algorithms, with demonstrated practical implications and possible insights for neural coding in the brain.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 1.0/10 1.5
World Models 1.5 2.0/10 3.0
MLLM 1.5 3.0/10 4.5
MultiModal 1.5 1.0/10 1.5
model-based RL 1.5 3.0/10 4.5

评分理由: The paper focuses on theoretical Deep RL representation analysis (MDP reduction) comparing DQN and PPO. It does not involve multi-modal architectures, tokenizers, visual encoders, or unified large model architectures. Relevance to 'model-based RL' and 'MLLM' is marginal due to theoretical RL context and brief LLM mention, while other keywords are irrelevant.

关键词

Deep Reinforcement Learning, Representational Invariance, MDP Reduction, Value-based RL, Policy-gradient RL, Transfer Learning, Neural Coding

Score: 18.0 / 27.8
Authors: Pengyang Ling, Jiazi Bu, Yujie Zhou, Yibin Wang, Zhenyu Hu, Zihan Zhang, Yi Jin, Huaian Chen, Yuhang Zang
Published: 2026-06-01
TL;DR: Pave-GRPO 通过原理性平均速度分解,在不增加生成成本的情况下,将奖励反馈传播到更多中间时间步,从而提升了流生成模型的偏好对齐粒度。
摘要翻译

基于群体相对策略优化(Group Relative Policy Optimization, GRPO)的后训练已成为一种强大的范式,用于将基于流的生成模型(flow-based generative models)与人类偏好对齐。然而,流模型的迭代去噪特性在生成用于策略梯度(policy-gradient)更新的群体轨迹(group rollouts)时会产生显著成本,迫使现有方法仅使用极少的去噪步骤进行训练。这种时间稀疏性(temporal sparsity)严重限制了偏好优化:奖励反馈(reward feedback)仅能到达每条轨迹的少数几个阶段,导致绝大多数中间去噪步骤缺乏直接监督,从而损害了对齐的粒度。为了解决这一问题,我们提出了 Pave-GRPO,该方法通过基于原理的平均速度分解(Principled average velocity decomposition)重新表述了 GRPO 的目标函数。与生成昂贵的高步数轨迹(high-step rollouts)不同,我们保持高效的少步数群体采样(few-step group sampling),但将每个粗粒度转换(coarse transition)分解为跨越多个中间时间步的更细粒度子轨迹(finer sub-trajectories)的等效集成(ensemble)。这使得奖励反馈能够传播到更密集的时间阶段(temporal stages),以实现更全面的偏好对齐,且无需额外的生成成本。该设计提供了两大优势:(i)零成本视距(horizon)扩展:通过直接重用分段群体样本(piece-wise group samples)及其关联奖励,Pave-GRPO 在固定采样预算下显著扩大了有效优化范围;(ii)全面的时间监督:通过将瞬时速度目标(instantaneous velocity target)等效分解为多时间步集成(multi-timestep ensemble),它将奖励信号分散至去噪过程的更多中间阶段,从而实现更细粒度且更彻底的偏好优化。广泛的实验验证了 Pave-GRPO 在不同奖励设置下能有效推进偏好对齐,提供全面的性能提升。

Abstract

Post-training via Group Relative Policy Optimization (GRPO) has emerged as a powerful paradigm for aligning flow-based generative models with human preferences. However, the iterative denoising nature of flow models incurs substantial costs when generating group rollouts for policy-gradient updates, compelling existing methods to train with extremely few denoising steps. This temporal sparsity severely restricts preference optimization: reward feedback can only reach a handful of stages per trajectory, leaving the vast majority of intermediate denoising steps without direct supervision and thus compromising alignment granularity. To address this, we propose Pave-GRPO, which reformulates the GRPO objective through Principled average velocity decomposition. Rather than generating expensive high-step rollouts, we maintain efficient few-step group sampling but decompose each coarse transition into an equivalent ensemble of finer sub-trajectories spanning multiple intermediate timesteps. This propagates reward feedback to a denser set of temporal stages for more comprehensive preference alignment without additional generation cost. This design offers two benefits: (i) zero-cost horizon expansion: through the direct reuse of piece-wise group samples and their associated rewards, Pave-GRPO significantly broadens the effective optimization scope under fixed sampling budgets; and (ii) comprehensive temporal supervision: by equivalently decomposing an instantaneous velocity target into a multi-timestep ensemble, it distributes reward signals across more intermediate stages of the denoising process, enabling finer-grained and more thorough preference optimization. Extensive experiments validate that Pave-GRPO effectively advances preference alignment across different reward settings, offering comprehensive performance enhancement.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 1.0/10 1.5
World Models 1.5 2.0/10 3.0
MLLM 1.5 2.0/10 3.0
MultiModal 1.5 1.0/10 1.5
model-based RL 1.5 3.0/10 4.5

评分理由: 该论文专注于流生成模型的偏好对齐优化,提出 Pave-GRPO 方法通过速度分解增强时间监督。论文未涉及多模态架构、分词器、视觉编码器或统一模型等核心组件。虽然使用了强化学习(GRPO),但侧重于生成模型对齐而非传统的基于模型的强化学习或世界模型构建,因此与给定关键词的相关性普遍较低。

关键词

Pave-GRPO, Flow-based generative models, Group Relative Policy Optimization, Preference alignment, Velocity decomposition, Temporal supervision, Post-training

Score: 16.5 / 27.8
Authors: Ping Li, Bartlomiej Brzozka
Published: 2026-06-01
TL;DR: This paper proposes a BERT-GNN framework to extract entities and relationships from historical texts for knowledge graph construction, achieving higher precision and recall than rule-based methods.
摘要翻译

通过数字人文研究和规模化历史数据分析,大量的传统历史文本被转化为结构化知识图谱。本文提供了一种高层架构,该架构结合了双向编码器表示变换器(BERT)和图神经网络(GNN),旨在从各种类型的历史文本中提取实体和关系。针对传统历史文本中存在的语言歧义、语境受限的指称以及缺乏既定语法规范等问题,本研究以系统化的方式予以解决。基于上述建议,本研究开发了一种新的图像检索系统,该系统基于 FastRQNet 和预训练视觉 - 语言模型 Vilt-qaformer+RoBInet 构建。实验充分利用了市政记录、议会文件和历史书信的综合数据集。与传统基于规则的技术及其他流行的深度学习基线相比,联合 BERT-GNN 系统在精确率、召回率和 F1 分数上表现更优(表 2)。在构建知识图谱时,该结构能够以足够的准确性和全面性处理复杂嵌套结构和隐含指称问题。上述实验表明,将关系图学习算法与上下文敏感的语义表示技术相结合,可以自动提取历史数据,从而为知识库增添累积智慧。

Abstract

Through digital humanities research and scale-up historical data analysis, a significant amount of traditional historical text is converted into structured knowledge graphs. This paper provides a high-level architecture that combines bidirectional encoder representations of transformers (BERT) and graph neural networks (GNN) to extract the entities and relationships from various types of historical texts. The texts of traditional history resolve linguistic ambiguities, references limited by context, and a lack of established grammatical norms in a systematic way. This study develops a new image retrieval system based on FastRQNet and pre-trained vision-language model Vilt-qaformer+RoBInet in accordance with the aforementioned recommendations. The experiments make full use of a comprehensive collection of municipal records, parliamentary documents, and historical correspondence. When compared to conventional rule-based techniques and other popular deep-learning baselines, the joint BERT-GNN system obtains greater Precision, Recall, and F1-score (Table 2). Complex nested structures and implicit reference issues can be handled by this structure with sufficient accuracy and thoroughness when creating knowledge graphs. The aforementioned experiments show that combining relational graph learning algorithms with context-sensitive semantic representation techniques can automatically extract historical data to add accumulated wisdom to the knowledge repository.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 3.0/10 4.5
World Models 1.5 0.0/10 0.0
MLLM 1.5 2.0/10 3.0
MultiModal 1.5 3.0/10 4.5
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on constructing historical knowledge graphs using BERT and GNN, which only weakly aligns with Unify Models (model combination) and Tokenizer (implicit in BERT). It mentions a vision-language model for image retrieval, giving Visual Encoder and MultiModal low relevance. MLLM is tangential. World Models and model-based RL are completely absent from the content.

关键词

Knowledge Graphs, BERT, Graph Neural Networks, Historical Text, Entity Extraction, Relationship Extraction, Vision-Language Model, Digital Humanities

Score: 16.5 / 27.8
Authors: Junlin He, Yihong Tang, Tong Nie, Ao Qu, Yuebing Liang, Hamzeh Alizadeh, Bang Liu, Wei Ma, Lijun Sun
Published: 2026-06-01
TL;DR: MobEvolve 提出了一种由 LLM 代理驱动的自演化启发式系统,用于生成可解释且合理的移动轨迹,其性能优于现有的深度生成和 LLM 方法。
摘要翻译

人类移动性生成旨在基于个体特征为目标群体合成真实的出行链。现有范式,包括深度生成模型、基于大语言模型(LLM)的方法以及传统启发式方法,难以在保持可解释性、行为合理性、群体级分布对齐以及推理效率的同时,满足该任务的复杂需求。为了弥合这一差距,我们引入了 MobEvolve,这是首个用于人类移动性生成的基于自主代理的自进化启发式框架。MobEvolve 初始化一个受行为启发的启发式系统,并利用 LLM 代理迭代演化其内部逻辑。通过在验证集上诊断实证偏差与失败案例,该代理提出针对性更新,并积累进化记忆以实现累积的自我改进。在新加坡和蒙特利尔基准数据集上的广泛评估表明,MobEvolve 在个体轨迹保真度、群体级分布对齐及行为合理性方面显著优于最先进的深度生成模型和基于大语言模型的方法,同时保持了可解释性和高推理效率。

Abstract

Human mobility generation aims to synthesize realistic trip chains for target populations based on individual features. Existing paradigms, including deep generative models, LLM-based methods, and traditional heuristics, struggle to satisfy the complex demands of this task while simultaneously maintaining interpretability, behavioral plausibility, population-level distributional alignment, and inference efficiency. To bridge this gap, we introduce MobEvolve, the first agentic self-evolving heuristic framework for human mobility generation. MobEvolve initializes a behavior-inspired heuristic system and employs an LLM agent to iteratively evolve its internal logic. By diagnosing empirical misalignments and failure cases on a validation set, the agent proposes targeted updates and accumulates evolution memory for cumulative self-improvement. Extensive evaluations on the Singapore and Montreal benchmarks demonstrate that MobEvolve significantly outperforms state-of-the-art deep generative and LLM-based methods in individual trajectory fidelity, population-level distribution alignment, and behavioral plausibility, while preserving interpretability and high inference efficiency.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 2.0/10 3.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 1.0/10 1.5
MLLM 1.5 2.0/10 3.0
MultiModal 1.5 2.0/10 3.0
model-based RL 1.5 2.0/10 3.0

评分理由: 该论文专注于基于 LLM 代理演化的启发式系统生成人类移动性,不涉及视觉编码器、多模态融合、生成式世界模型或模型强化学习。提供的关键词针对多模态和强化学习领域,因此相关性评分较低。

关键词

Human mobility generation, Agentic self-evolving, Heuristic system, LLM agent, Interpretability, Trajectory fidelity, Population-level distribution, Inference efficiency

Score: 16.5 / 27.8
Authors: Zelin He, Haotian Lin, Boran Han, Wei Zhu, Haoyang Fang, Bernie Wang, Xuan Zhu, Runze Li, Matthew Reimherr
Published: 2026-06-01
TL;DR: ReSkill 提出了一种 RL 循环技能创建框架,通过协调技能演化与策略学习,实现了技能的自动生命周期管理并在未见任务上表现出优于现有方法的性能。
摘要翻译

智能体强化学习(RL)使大语言模型(LLM)智能体能够从环境奖励中持续改进,但由此产生的策略并未系统性地积累可跨任务泛化的可重用策略。模块化技能可提供此类可重用策略,然而现有的技能增强强化学习(RL)方法将技能创建与策略优化解耦,存在采纳与不断演化的策略相冲突的技能的风险。受 Anthropic 的 Skill Creator 启发,我们引入了 ReSkill,一个强化学习(RL)闭环技能创建框架,该框架协调了技能演化与策略学习。ReSkill 利用 GRPO 的组内结构,以仅带来边际额外开销的方式自然嵌入了三个机制:(1) 一个断言驱动的技能创建器,它从过往经验中诊断失败,并提出条件性、基于触发器的技能修订;(2) 组内 rollout 采样,它使得技能版本的受控比较成为可能,从而捕捉哪个版本最能支持策略的持续学习;(3) 带有自适应折扣的 Thompson Sampling,以在策略演化过程中平衡技能版本选择中的探索与利用。在多个领域,ReSkill 一致优于现有的基于记忆和技能的方法,且在未见任务上获得最大收益。技能生命周期分析表明,随着策略的改进,技能被自动创建、测试、精炼和剪枝,展示了协调的技能 - 策略共同演化。

Abstract

Agentic reinforcement learning (RL) enables LLM agents to improve continuously from environment rewards, yet the resulting policies do not systematically accumulate reusable strategies that generalize across tasks. Modular skills can provide such reusable strategies, yet existing skill-augmented RL methods decouple skill creation from policy optimization, risking adopting skills that conflict with the evolving policy. Inspired by Anthropic's Skill Creator, we introduce ReSkill, an RL-in-the-loop skill creation framework that reconciles skill evolution with policy learning. ReSkill exploits the group-wise structure of GRPO to naturally embed three mechanisms with only marginal additional overhead: (1) an assertion-driven skill creator that diagnoses failures from past experience and proposes conditional, trigger-based skill revisions; (2) within-group rollout sampling that enables controlled comparison of skill versions, capturing which version best supports the policy's ongoing learning; and (3) Thompson Sampling with adaptive discounting to balance exploration and exploitation in skill version selection as the policy evolves. Across several domains, ReSkill consistently outperforms existing memory and skill-based RL methods, with the largest gains on unseen tasks. Analysis of the skill lifecycle shows skills being automatically created, tested, refined, and pruned as the policy improves, demonstrating reconciled skill-policy co-evolution.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 3.0/10 4.5
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 3.0/10 4.5
MLLM 1.5 3.0/10 4.5
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 2.0/10 3.0

评分理由: 论文核心在于强化学习中的技能创建与策略优化(ReSkill 框架),使用 GRPO 算法。虽然涉及 LLM 代理,但未明确提及多模态架构组件(Tokenizer, Visual Encoder, MultiModal),故相关性为 0。技能积累与世界模型有一定概念关联,但非核心世界模型构建,评分较低。GRPO 属于模型-free RL,与 model-based RL 关联度低。Unify Models 仅体现在技能与策略的 reconciling,非模型架构统一。作者列表中未包含指定的 Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang 专家,故无加分。

关键词

Agentic RL, Skill Creation, Policy Optimization, GRPO, Skill Evolution, Thompson Sampling, Reusable Strategies

Score: 16.5 / 27.8
Authors: Fang Wan, Jingxiang Qu, Yi Liu
Published: 2026-06-01
TL;DR: 本文提出不确定性校准扩散(UCD)方法,通过校准扩散过程中的逆过程以应对认识论不确定性,从而提升 3D 分子图生成的采样质量和化学有效性。
摘要翻译

贝叶斯推断提供了一种严谨的框架,通过将预测视为分布而非确定性值,从而对神经网络中的认知不确定性进行建模。与此同时,用于三维分子图生成的基于扩散的模型在受严格化学约束的脆弱几何结构上运行,这使得推断对不确定性校准偏差高度敏感。一个被广泛忽视的问题是,从学习到的去噪器中产生的认知不确定性会与在反向扩散过程中有意注入的偶然性不确定性相互作用,导致系统性方差膨胀以及真实分布与模拟分布之间的不匹配。这种效应对于高精度分子生成尤其有害,因为即使微小的偏差也可能违反化学有效性。在这项工作中,我们提供了关于认知不确定性如何通过扩散推断传播并降低采样质量的理论和实证分析。基于此研究,我们提出了 UCD(Uncertainty-Calibrated Diffusion),这是一种简单却有效的方法,通过校准反向扩散过程来处理认知不确定性。在标准三维分子基准上的广泛实验表明,UCD 在各种基线方法中一致地改进了采样质量,为三维分子扩散确立了新的最先进性能。代码可在 https://github.com/jiuguaiwf/UCD 处获取。

Abstract

Bayesian inference provides a principled framework for modeling epistemic uncertainty in neural networks by treating predictions as distributions rather than deterministic values. Meanwhile, diffusion-based models for 3D molecular graph generation operate on fragile geometric structures governed by strict chemical constraints, making inference highly sensitive to uncertainty miscalibration. A largely overlooked issue is that epistemic uncertainty arising from the learned denoiser interacts with the aleatoric uncertainty intentionally injected during reverse diffusion, leading to systematic variance inflation and a mismatch between the true distribution and the simulated distribution. This effect is particularly detrimental for high-precision molecular generation, where even small deviations can violate chemical validity. In this work, we provide a theoretical and empirical analysis of how epistemic uncertainty propagates through diffusion inference and degrades sampling quality. Building on this investigation, we propose UCD (Uncertainty-Calibrated Diffusion), a simple yet effective method that calibrates the reverse diffusion process to account for epistemic uncertainty. Extensive experiments on standard 3D molecular benchmarks demonstrate that UCD consistently improves sampling quality across diverse baseline methods, establishing new state-of-the-art performance for 3D molecular diffusion. The code is available at https://github.com/jiuguaiwf/UCD.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 2.0/10 3.0
Visual Encoder 1.5 2.0/10 3.0
World Models 1.5 1.0/10 1.5
MLLM 1.5 1.0/10 1.5
MultiModal 1.5 2.0/10 3.0
model-based RL 1.5 1.0/10 1.5

评分理由: 论文主题聚焦于 3D 分子图生成的扩散模型不确定性校准,与关键词集中的 MLLM、世界模型(RL 方向)、模型强化学习等核心领域关联度较低。虽然涉及生成模型,但未体现多模态统一、特定 Tokenizer 设计或视觉编码器架构,因此相关度评分普遍偏低。加权总分为 16.5,低于动态及格分 27.8。

关键词

Diffusion Models, 3D Molecular Graph Generation, Uncertainty Calibration, Bayesian Inference, Epistemic Uncertainty, Aleatoric Uncertainty, Sampling Quality, Chemical Validity

Score: 16.5 / 27.8
Authors: Wenhang Shi, Yiren Chen, Shuqing Bian, Zhe Zhao, Jinhao Dong, Pengfei Hu, Wei Lu, Xiaoyong Du
Published: 2026-06-01
TL;DR: This paper proposes State-Adaptive Prompt Optimization (SAPO) to dynamically adjust training prompts during LLM fine-tuning, effectively mitigating catastrophic forgetting and improving generalization performance.
摘要翻译

尽管提示工程在推理阶段对于最大化大语言模型(LLMs)的能力至关重要,但提示在训练过程中的作用仍被严重低估。现有的微调范式通常将训练提示视为仅仅是表面形式,假设语义等价的指令会产生相同的学习结果。然而,我们揭示这种等价性是误导性的:尽管改写后的提示通常能带来可比的任务内性能,但它们在灾难性遗忘和泛化方面会引发截然不同的跨任务影响。至关重要的是,这些影响在不同任务间呈正相关,表明存在能够始终产生更好性能的优越提示。此外,我们发现这些优越提示可以在学习前通过任务损失稳健地识别出来。基于这些洞察,我们提出状态自适应提示优化(SAPO),这是一种轻量级却有效的训练策略,它将任务表述从静态输入转变为动态的状态自适应变量。在多种基准上的全面实验证实了其有效性:该方法显著缓解了遗忘问题,同时提升了泛化能力,相对于最先进方法实现了显著的性能提升。这些结果为理解训练提示如何塑造学习动态提供了见解,并为稳健微调提供了实用的方案。我们的代码可在 https://github.com/Eric8932/SAPO 获取。

Abstract

While prompt engineering is instrumental in maximizing the capabilities of Large Language Models (LLMs) during inference, the role of prompts during training remains critically underexplored. Prevailing fine-tuning paradigms typically treat training prompts as mere surface forms, assuming that semantically equivalent instructions yield identical learning outcomes. However, we reveal that this equivalence is deceptive: while paraphrased prompts often lead to comparable in-task performance, they induce drastically different cross-task impacts regarding catastrophic forgetting and generalization. Crucially, these impacts are positively correlated across tasks, indicating the existence of superior prompts that consistently yield better performance. Furthermore, we discover that these superior prompts can be robustly identified by task loss prior to learning. Leveraging these insights, we introduce State-Adaptive Prompt Optimization (SAPO), a lightweight yet effective training strategy that shifts task formulation from a static input to a dynamic, state-adaptive variable. Comprehensive experiments on diverse benchmarks confirm its effectiveness, which significantly mitigates forgetting while improving generalization, achieving substantial performance gains over state-of-the-art methods. These results provide insights into how training prompts shape learning dynamics and offer a practical recipe for robust fine-tuning. Our code is available at https://github.com/Eric8932/SAPO.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 1.0/10 1.5
World Models 1.5 1.0/10 1.5
MLLM 1.5 3.0/10 4.5
MultiModal 1.5 2.0/10 3.0
model-based RL 1.5 1.0/10 1.5

评分理由: The paper focuses on LLM fine-tuning and prompt optimization (SAPO), addressing catastrophic forgetting and generalization. It shows low relevance to World Models, Model-Based RL, Visual Encoders, and Tokenizers as these are not discussed. It has moderate relevance to MLLM (LLMs involved) but low for MultiModal. No specified expert authors are found in the author list.

关键词

Prompt Engineering, Fine-Tuning, Catastrophic Forgetting, Generalization, State-Adaptive Prompt Optimization, Large Language Models, Training Dynamics

Score: 15.0 / 27.8
Authors: Lukas Johanns, Marilin Moor, Davide Panzeri, Yu Zhou, Xinyi Chen, Nora F. K. Pauly, Zixuan Pan, Matthias Gunzer, Andreas Müller, Yiyu Shi, Hedi Peterson, Jianxu Chen
Published: 2026-06-01
TL;DR: Agentic-J 是一个基于自然语言的生物显微镜图像分析多智能体助手,能够生成可执行的脚本并实现工作流程的可复现性。
摘要翻译

生物图像分析日益要求整合异构工具、编程环境及领域知识,却鲜有研究者能同时驾驭这些要素。本文提出 Agentic-J,一种面向 ImageJ/Fiji 的容器化、多智能体 AI 助手,使生物学家能够通过自然语言指定分析任务,涵盖从细胞核分割、细胞追踪到多条件量化等多个方面。该智能体生成组织成文档化项目结构的可执行脚本,确保每个分析决策均可追溯,且工作流可被复现或共享。专用子智能体负责插件管理、代码生成、调试、质量保证及统计报告。本文介绍了该系统的架构设计,演示了真实的生物显微图像分析工作流,并详细阐述了技术实现细节。

Abstract

Biological image analysis increasingly demands integration across heterogeneous tools, programming environments, and domain knowledge that few researchers can command simultaneously. We present Agentic-J, a containerised, multi-agent AI assistant, primarily for ImageJ/Fiji that enables biologists to specify analysis tasks in natural language, from nuclei segmentation and cell tracking to multi-condition quantification. The agent generates executable scripts organised into a documented project structure, so every analysis decision is traceable and the workflow can be reproduced or shared. The specialised sub-agents handle plugin management, code generation, debugging, quality assurance, and statistical reporting. In this paper we introduce the system's design, demonstrate real biological microscopy image analysis workflows, and detailed the technical implementation.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 1.0/10 1.5
World Models 1.5 0.0/10 0.0
MLLM 1.5 3.0/10 4.5
MultiModal 1.5 3.0/10 4.5
model-based RL 1.5 0.0/10 0.0

评分理由: 该论文主要介绍了一个名为 Agentic-J 的多智能体 AI 助手系统,用于生物显微镜图像分析,侧重于自然语言交互、脚本生成和工作流管理。关键词中的 'World Models', 'Tokenizer', 'Visual Encoder', 'model-based RL' 均属于底层模型架构或强化学习技术,文中未提及相关内容,故评分为 0 或极低。'MultiModal' 和 'MLLM' 因涉及自然语言处理图像任务有一定关联,但非核心贡献,故评分较低。'Unify Models' 指模型统一,本文是工具统一,故评分低。作者列表中未包含指定的专家(如 Yang Shi, Yiyu Shi 不同)。

关键词

Agentic-J, Biological Microscopy Image Analysis, Multi-agent AI Assistant, Natural Language, ImageJ/Fiji, Executable Scripts, Workflow Reproducibility, Plugin Management

Score: 15.0 / 27.8
Authors: Matthew Khoriaty, David Williams-King, Shi Feng
Published: 2026-06-01
TL;DR: 本文提出一种基于语言模型对数概率的文本多样性度量方法,用于检测后训练过程中的多样性损失,无需参考语料。
摘要翻译

衡量创意输出的多样性对于评估训练后模式崩溃、比较解码策略以及量化 AI 和人类写作中的创意行为至关重要。我们提出了一种基于上下文学习的新方法来衡量多样性,其中我们评估的具体实例是"Decan"度量,$D_{Ca_n} = C \times a_n$:这是一种每字节分数,通过对基础模型 $\theta$ 的每 token 对数概率读取得出,每个排列仅需单次前向传播,无需嵌入模型、无需参考语料库,也无需人工标签。该方法基于信息论,利用语言模型的上下文学习来检测任意数量输入之间的广泛相似性,从而消除了训练专用模型的需求。同一流程对 AI 样本和人工撰写的响应集进行评分,将多样性视为(响应、提示词、评分模型)的属性。在 Tevet 和 Berant 提出的基于人类评估的 McDiv 基准上,$D_{Ca_n}$ 在 McDiv prompt\_gen 集上达到了 OCA 0.846 的最佳表现,仅次于 Tevet 和 Berant 报告的最强神经基线(SentBERT,0.897)。在 OLMo-2-7B 训练后流程中,$D_{Ca_n}$ 在基础模型 $\to$ SFT $\to$ DPO $\to$ RLVR 阶段单调下降,检测到了创意写作应用所关心的多样性损失类型。

Abstract

Measuring the diversity of creative outputs is central to evaluating post-training mode collapse, comparing decoding strategies, and quantifying creative behavior in both AI and human writing. We propose a new approach to measuring diversity using in-context learning, of which the ``Decan'' metric, $D_{Ca_n} = C \times a_n$, is the working instance we evaluate: a per-byte score read off the per-token log-probabilities of a base model $θ$ in a \emph{single forward pass} per permutation, with no embedding model, no reference corpus, and no human labels. This approach is grounded in information theory, makes use of language model in-context learning to detect a wide range of similarities between any number of inputs, and obviates the need to train a special-purpose model. The same pipeline scores AI samples and human-written response sets, with diversity treated as a property of (responses, prompt, scoring model). On Tevet and Berant's human-grounded McDiv benchmark, $D_{Ca_n}$ reaches OCA 0.846 on the McDiv prompt\_gen set where it performs best, behind the strongest neural baseline reported in Tevet and Berant (SentBERT, 0.897). On the OLMo-2-7B post-training pipeline, $D_{Ca_n}$ drops monotonically across the base $\to$ SFT $\to$ DPO $\to$ RLVR stages, detecting the type of diversity loss that creative-writing applications care about.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 4.0/10 6.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 1.0/10 1.5
MLLM 1.5 1.0/10 1.5
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 2.0/10 3.0

评分理由: 论文核心为文本多样性度量,与多模态及世界模型架构无关。Tokenizer 因基于 token 概率得中分;Visual Encoder 和 MultiModal 因纯文本得 0 分;Unify Models、World Models、MLLM、model-based RL 因内容偏离(非架构统一、非环境建模、非多模态、非模型 RL)得分较低。

关键词

Diversity Measurement, In-context Learning, Language Model, Post-training Pipeline, Creative Output, Log-probability, McDiv Benchmark

Score: 15.0 / 27.8
Authors: Atoosa Chegini, Soheil Feizi
Published: 2026-06-01
TL;DR: This paper proposes a training-free method called Chunk-Level Guided Generation that uses off-the-shelf large language models to score candidate text chunks, improving small language models' mathematical reasoning performance without requiring trained reward models.
摘要翻译

使用更强的评分器从多个小模型样本中选择最佳响应是一种简单的推理时策略,但当小模型已陷入错误的推理路径时,该方法便会失效。PRM(过程奖励模型)引导搜索通过在生成过程中对候选续写进行评分来避免这一问题,但它需要一个使用步骤级标签训练的奖励模型。我们提出“分块级引导生成”(Chunk-Level Guided Generation),这是一种无需训练的替代方案,它使用现成的大语言模型作为过程评分器。在每一步,小模型采样 k 个固定长度的候选分块,而大模型则利用似然度对这些候选者进行评分,且不生成任何文本。选定的分块会在下一步之前被确定,从而在错误传播之前引导生成过程。我们通过两种选择规则实例化该框架:似然度引导选择(LGS),选择大模型长度归一化对数概率最高的分块;以及对比引导选择(CGS),通过减去小模型的对数概率,以偏好那些大模型偏好与小模型存在分歧的分块。我们表明,使用大模型似然度对可变长度的推理步骤进行评分是不可靠的,因为存在一种系统性的长度偏差,即使经过长度归一化后依然存在,而固定长度的分块避免了这种混淆因素。在 GSM8K、MATH、Minerva Math、AMC23 和 AIME24 基准上,使用 Qwen2.5-1.5B 并由 Qwen2.5-32B 引导,以及使用 Llama-3.2-1B 并由 Llama-3.1-70B 引导时,CGS 的表现优于多数投票法,最高可达 28 个百分点;且在引导预算匹配的情况下,无需奖励模型训练,其在大多数基准上的表现即可匹配或优于 Qwen2.5-Math-PRM-72B 的引导搜索。当使用 Qwen2.5-7B 并由 Qwen2.5-72B 引导时,在 k=16 的情况下,CGS 在 MATH 上达到 81.8%,在 Minerva Math 上达到 63.6%,超越了多数投票法 4 至 6 个百分点。最后,分块级引导生成的推理轨迹显著短于 PRM 引导搜索。

Abstract

Selecting the best response from multiple small-model samples using a stronger scorer is a simple inference-time strategy, but fails when the small model has already committed to incorrect reasoning paths. PRM guided search avoids this by scoring candidate continuations during generation, but requires a reward model trained with step-level labels. We propose Chunk-Level Guided Generation, a training-free alternative that uses an off-the-shelf large language model as a process scorer. At each step, a small model samples k fixed-length candidate chunks, while the larger model scores the candidates using likelihoods without generating any text. The selected chunk is committed before the next step, steering generation before errors can propagate. We instantiate this framework with two selection rules: Likelihood-Guided Selection (LGS), which selects the chunk with the highest length-normalized large-model log-probability, and Contrastive-Guided Selection (CGS), which subtracts the small model's log-probability to favor chunks where the large model's preference diverges from the small model's. We show that scoring variable-length reasoning steps with large-model likelihoods is unreliable due to a systematic length bias that persists even after length normalization, and that fixed-length chunks avoid this confound. On GSM8K, MATH, Minerva Math, AMC23, and AIME24 with Qwen2.5-1.5B guided by Qwen2.5-32B and Llama-3.2-1B guided by Llama-3.1-70B, CGS outperforms majority voting by up to 28 pp and, under matched guidance budgets, matches or outperforms Qwen2.5-Math-PRM-72B guided search on most benchmarks without reward-model training. With Qwen2.5-7B guided by Qwen2.5-72B, CGS reaches 81.8% on MATH and 63.6% on Minerva Math at k=16, surpassing majority voting by 4--6 pp. Finally, Chunk-Level Guided Generation produces substantially shorter reasoning traces than PRM guided search.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 2.0/10 3.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 2.0/10 3.0
MLLM 1.5 2.0/10 3.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 2.0/10 3.0

评分理由: The paper focuses on text-based mathematical reasoning using inference-time scoring with off-the-shelf LLMs, lacking connections to multimodal components (Visual Encoder, MultiModal), world models, or unified model architectures. While it uses LLMs, it does not address MLLM integration, tokenizer design, or model-based reinforcement learning as defined in the background.

关键词

Off-the-Shelf LLMs, Process Scorers, Chunk-Level Guided Generation, Training-Free, Mathematical Reasoning, Likelihood-Guided Selection, Contrastive-Guided Selection, PRM Alternative

Score: 15.0 / 27.8
Authors: Liu Qing, Ou Wu, Yi Du
Published: 2026-06-01
TL;DR: AlphaToken introduces a path-aware token valuation framework for LLM post-training that decouples adaptation and stability to improve performance and mitigate catastrophic forgetting.
摘要翻译

Token 选择对于有效的 LLM 后训练至关重要。然而,现有方法大多依赖局部启发式方法,很少将 Token 选择表述为对单个响应 Token 的基于原则的估值。我们提出 AlphaToken,一种响应 Token 估值框架,该框架将估值解耦为适配(促进目标任务学习)和稳定性(保留预训练能力),并通过结合来自局部 Token 梯度的直接路径信号与自回归生成中的下游因果路径信号,使每个目标均具备路径感知能力。由于保留数据通常不可用,AlphaToken 通过锚定在预训练参考模型上的 Fisher-drift 代理来近似稳定性。为提升计算效率,我们将 Ghost Dot-Product 扩展至 Token 级估值。AlphaToken 在微调和偏好优化期间掩码低价值响应 Token,将训练信号集中于更有价值的位置。实验表明,AlphaToken 提升了后训练性能并缓解了灾难性遗忘。

Abstract

Token selection is pivotal for effective LLM post-training. However, existing methods mostly rely on local heuristics and rarely formulate token selection as a principled valuation of individual response tokens. We introduce $\textbf{AlphaToken}$, a response token valuation framework that decouples valuation into $\textbf{adaptation}$ (promoting target-task learning) and $\textbf{stability}$ (preserving pre-trained capabilities), and makes each objective $\textbf{path-aware}$ by combining the direct-path signal from local token gradients with the downstream causal-path signal in autoregressive generation. Since retention data are typically unavailable, AlphaToken approximates stability via a $\textbf{Fisher-drift proxy}$ anchored at the pre-trained reference model. For efficient computation, we extend Ghost Dot-Product to token-level valuation. AlphaToken masks low-value response tokens during fine-tuning and preference optimization, concentrating training signals on more valuable positions. Experiments show that AlphaToken improves post-training performance and mitigates catastrophic forgetting.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 3.0/10 4.5
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 3.0/10 4.5
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 2.0/10 3.0

评分理由: The paper focuses on LLM post-training token valuation (AlphaToken), addressing adaptation vs. stability and path-aware signals. It lacks content on multimodal integration (MultiModal, Visual Encoder, MLLM strictly), world modeling (World Models), or model-based RL planning (model-based RL). Tokenizer is tangentially related due to token-level focus but does not address tokenizer architecture. Unify Models is weakly related as it unifies objectives rather than models. Weighted sum is 15.0, below the dynamic pass score of 27.8, indicating low relevance to the specific keyword theme.

关键词

LLM Post-Training, Token Valuation, Adaptation Stability, Path-Aware, Fisher-Drift, Catastrophic Forgetting, Preference Optimization

Score: 15.0 / 27.8
Authors: Shuwen Deng, Cui Ding, David R. Reich, Paul Prasse, Lena A. Jäger
Published: 2026-06-01
TL;DR: Eyettention II proposes a dual-sequence deep learning model to accurately predict human eye-tracking scanpaths during reading by modeling fixation attributes, overcoming data scarcity challenges in cognitive modeling.
摘要翻译

阅读时的眼动模式不仅为洞察读者的认知过程提供了宝贵见解,也揭示了文本的属性。特别是,阅读时眼动追踪(eye-tracking-while-reading)数据在各种技术应用中具有高度益处,例如增强和解释语言模型,以及推断读者的特征。然而,这些应用通常依赖于大规模、数据驱动模型,这需要大量的眼动数据集,但由于数据收集具有资源密集型特性,获取这些数据集颇具挑战性。为应对数据稀缺的挑战,我们开发了 Eyettention II,这是一个端到端训练的深度学习模型,能够生成逼真的注视路径(scanpaths),该路径包含按时间顺序排列的完整注视属性集,包括注视点位置(fixation location)、词内着陆位置(within-word landing position)和注视时长(fixation duration)。该模型轻量级,能够在有限的 GPU 资源上高效训练,且与认知理论高度契合。我们证明,Eyettention II 在注视路径预测上超越了现有最先进模型(state-of-the-art models),并通过捕捉关键心理语言学现象(psycholinguistic phenomena),模拟了人类般的注视行为。凭借稳健的性能,Eyettention II 有望推动自然语言处理的发展,促进心理语言学实验材料的试点,并揭示超出理论认知模型中明确编码的新见解。

Abstract

The way our eyes move while reading provides valuable insights into both the reader's cognitive processes and the properties of the text. In particular, eye-tracking-while-reading data has shown to be highly beneficial in various technological applications, such as enhancing and interpreting language models and inferring a reader's characteristics. However, these applications often rely on large-scale, data-driven models, which demand extensive eye-tracking datasets that are challenging to obtain due to the resource-intensive nature of data collection. To address the challenge of data scarcity, we develop Eyettention II, an end-to-end trained deep-learning model capable of generating realistic scanpaths consisting of a complete set of fixation attributes in chronological order, including fixation location, within-word landing position, and fixation duration. Our model is lightweight, efficiently trainable on limited GPU resources, and closely aligned with cognitive theories. We demonstrate that Eyettention II surpasses state-of-the-art models in scanpath prediction and mirrors human-like gaze behavior by capturing key psycholinguistic phenomena. With its robust performance, Eyettention II holds the potential to drive advancements in natural language processing, facilitate piloting the materials of psycholinguistic experiments, and uncover new insights beyond what is explicitly encoded in theoretical cognitive models.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 1.0/10 1.5
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 2.0/10 3.0
World Models 1.5 1.0/10 1.5
MLLM 1.5 1.0/10 1.5
MultiModal 1.5 3.0/10 4.5
model-based RL 1.5 1.0/10 1.5

评分理由: The paper focuses on cognitive modeling of eye movements (Eyettention II) rather than large-scale multimodal foundation models or RL. It proposes a dual-sequence architecture for scanpath prediction. Overlap with keywords is minimal: Unify Models, Tokenizer, World Models, MLLM, and model-based RL are irrelevant (score 1). Visual Encoder is weakly related (score 2) due to gaze data. MultiModal is moderately related (score 3) as it combines text and gaze. Total weighted score is 15.0, below the 27.8 threshold. No target expert authors are found.

关键词

Eye-tracking, Scanpath prediction, Dual-sequence architecture, Fixation location, Cognitive theories, Deep-learning model, Reading behavior

Score: 15.0 / 27.8
Authors: Tanguy Magne, Alexandre Binninger, Ruben Wiersma, Olga Sorkine-Hornung
Published: 2026-06-01
TL;DR: 本文提出一种基于语义驱动优化和分数蒸馏采样的方法,从文本或图像生成高质量单线矢量绘图,在艺术风格和连续性上优于现有文本到图像模型。
摘要翻译

线描是一种极具表现力的艺术形式,要求艺术家对其主题进行抽象并提炼其精髓。我们提出了首个语义驱动的方法,用于自动生成矢量格式的单线条画,该方法可由描述概念的文本提示或描绘该概念的输入图像引导。我们的方法利用分数蒸馏采样(Score Distillation Sampling)来优化均匀有理 B 样条(URBS)曲线的参数,确保绘图在设计上由单一连续笔触构成。这种表示法提供了对细节层次的细粒度控制,而额外的损失项则允许我们引导最终的艺术风格。我们证明,我们的方法在该任务上优于最先进的文本到图像模型和优化流程,产生的结果既更具美感,又更忠实于连续线描艺术家的风格。此外,由于我们的方法生成的是矢量曲线,它直接支持下游制造流程,如刺绣、激光雕刻和金属丝弯曲。我们的代码和结果可在 https://github.com/tanguymagne/SLDgen 获取。

Abstract

Line drawings are a highly expressive art form that requires the artist to abstract and distill the essence of their subject. We present the first semantics-driven method for automatically generating single-line drawings in vector format, guided either by a text prompt describing the concept or an input image depicting it. Our approach leverages score distillation sampling to optimize the parameters of a uniform rational B-spline (URBS) curve, ensuring that the drawing consists of a single continuous stroke by design. This representation provides fine-grained control over the level of detail, while additional loss terms allow us to steer the final artistic style. We demonstrate that our method outperforms state-of-the-art text-to-image models and optimization pipelines for this task, producing results that are both more aesthetically pleasing and more faithful to the style of continuous line drawing artists. Furthermore, because our method generates a vectorized curve, it directly supports downstream fabrication processes such as embroidery, laser engraving and wire bending. Our code and results are available at https://github.com/tanguymagne/SLDgen.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 1.0/10 1.5
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 2.0/10 3.0
World Models 1.5 1.0/10 1.5
MLLM 1.5 1.0/10 1.5
MultiModal 1.5 3.0/10 4.5
model-based RL 1.5 1.0/10 1.5

评分理由: 该论文属于计算机图形学领域,核心贡献在于基于分数蒸馏采样的单线矢量绘图优化,而非多模态大模型或强化学习架构。虽然支持文本和图像输入(涉及多模态),但未涉及统一模型设计、Tokenizer、视觉编码器创新、世界模型构建或模型强化学习,因此与给定关键词相关性普遍较低。

关键词

Single-Line Drawing, Semantics-Driven Optimization, Score Distillation, URBS Curve, Vector Format, Text Prompt, Downstream Fabrication

Score: 15.0 / 27.8
Authors: Weixing Chen, Zhuoqian Feng, Yang Liu, Yexin Zhang, Yifan Wen, Yinghong Liao, Weichao Qiu, Guanbin Li, Liang Lin
Published: 2026-06-01
TL;DR: PhyScene3D 通过认知拓扑推理链和物理感知去噪对齐,解决了 3D 桌面场景物理一致性问题,使碰撞率相比人工标注数据降低了 40%。
摘要翻译

生成物理一致的 3D 桌面场景是交互式和通用机器人学习中一个基础但尚未充分探索的问题。这一挑战源于密集的对象层级和不规则的可供性。本文中,交互场景指代一个物理有效、无碰撞且可直接加载到物理模拟器中的环境。现有方法,从解耦的符号求解器到端到端回归模型,往往面临误差传播或对包含广泛物理违规的噪声监督的过拟合问题。为了解决这些局限性,我们提出了 PhyScene3D 框架,该框架将生成过程重新表述为人类模仿构建过程 (Human-Mimetic Constructive Process)。所提出的认知拓扑推理链 (Cognitive Topological Reasoning Chain, CTRC) 将场景合成分解为一种顺序的、锚点条件化的过程。该方法采用基于 3D AABB 的放置方案,施加了强烈的结构归纳偏置。为了解决不完美的监督及物理不可行性问题,我们引入了物理感知去噪对齐 (Physics-Aware Denoising Alignment, PADA)。该方法整合了可微分有符号距离场 (Signed Distance Field, SDF) 与测试时优化 (Test-Time Optimization, TTO),旨在将生成的场景投影到物理可行流形上,同时保持语义意图。实验表明,PhyScene3D 在语义准确性和物理有效性方面均优于现有最先进方法,相对于人工标注的训练数据,实现了场景级碰撞率降低 40%。

Abstract

Generating physically consistent 3D tabletop scenes is a fundamental yet underexplored problem for interactive and generalist robotic learning. The challenge stems from dense object hierarchies and irregular affordances. Here, an interactive scene denotes a physically valid, collision-free environment directly loadable into physics simulators. Existing methods, ranging from decoupled symbolic solvers to end-to-end regression models, often suffer from error propagation or overfitting to noisy supervision containing widespread physical violations. To address these limitations, we introduce PhyScene3D, a framework that reformulates generation as a Human-Mimetic Constructive Process. The proposed Cognitive Topological Reasoning Chain (CTRC) factorizes scene synthesis into a sequential, anchor-conditioned process. It employs a 3D AABB-based placement scheme that imposes a strong structural inductive bias. To address imperfect supervision and physical infeasibility, we introduce Physics-Aware Denoising Alignment (PADA). It integrates a differentiable Signed Distance Field (SDF) with Test-Time Optimization (TTO) to project generated scenes onto a physics-feasible manifold while preserving semantic intent. Experiments demonstrate that PhyScene3D outperforms state-of-the-art approaches in both semantic accuracy and physical validity, achieving a 40% reduction in scene-wise collision rate relative to the human-annotated training data.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 1.0/10 1.5
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 1.0/10 1.5
World Models 1.5 3.0/10 4.5
MLLM 1.5 1.0/10 1.5
MultiModal 1.5 1.0/10 1.5
model-based RL 1.5 2.0/10 3.0

评分理由: 论文核心在于 3D 场景的物理一致性生成与几何推理(AABB、SDF),属于计算机视觉与机器人仿真领域,与 Tokenizer、Visual Encoder(MLLM 语境)、MLLM、Unify Models 等关键词完全无关(得 1 分)。虽然生成场景服务于机器人学习(涉及 model-based RL 应用环境)且构建了虚拟世界(与 World Models 有概念交集),但并非世界模型算法或强化学习本体,故相关性较低(2-3 分)。

关键词

Physically Consistent, 3D Tabletop Scene Generation, Human-Mimetic Constructive Process, Cognitive Topological Reasoning Chain, Physics-Aware Denoising Alignment, Differentiable Signed Distance Field, Test-Time Optimization, Robotic Learning

Score: 13.5 / 27.8
Authors: Kaidi Zhang, Guanxu Zhu
Published: 2026-06-01
TL;DR: This paper proposes a fast and lightweight novel view synthesis method using differentiable Multiplane Image and one-step diffusion, achieving superior efficiency compared to 3D Gaussian Splatting.
摘要翻译

近年来,新视角合成取得了显著进展,主流方法如神经辐射场(NeRF)和 3D 高斯泼溅(3DGS)均取得了令人印象深刻的成果。然而,这些方法往往难以平衡渲染速度与模型大小,且其基于优化的训练过程往往耗时较长。此外,它们通常依赖于密集观测,往往在稀疏视角条件下难以产生令人满意的结果。尽管前馈重建显著缩短了 3DGS 的优化时间,但其像素对齐的形式会从单张图像生成数百万个高斯,严重限制了其在移动设备上的实际部署。为了解决这些局限性,我们重新审视了多平面图像(MPI)表示法,该方法利用紧凑的平面层集合来表示场景,以实现高效的新视角合成。借助视觉基础模型的最新进展,我们利用预测的点图进行可靠的几何初始化,随后进行可微优化。针对稀疏初始化 MPI 中出现的空洞和伪影问题,我们引入了单步扩散机制,该机制同时参与 MPI 的可微优化及渲染结果的后处理。与一种代表性的基于 3DGS 的方法相比,我们的方法速度快 30.7%,且模型大小仅为该方法的 14.8%,同时在正面视角场景下实现了具有竞争力的合成质量。

Abstract

Recently, novel view synthesis has witnessed remarkable progress, with mainstream methods such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) delivering impressive results. However, these approaches often struggle to balance rendering speed and model size, and their optimization-based training can be highly time-consuming. Furthermore, they typically rely on dense observations, often failing to produce satisfactory results under sparse-view conditions. Although feed-forward reconstruction significantly reduces the optimization time of 3DGS, its pixel-aligned formulation generates millions of Gaussians from a single image, severely limiting its practical deployment on mobile devices. To address these limitations, we revisit the Multiplane Image(MPI) representation, which represents scenes using a compact set of planar layers for efficient novel view synthesis. Leveraging recent advances in visual foundation models, we utilize predicted point maps for reliable geometric initialization, followed by differentiable optimization. To address the issues of holes and artifacts in sparsely initialized MPI, we introduce one-step diffusion, which participates in both the differentiable optimization of MPI and the postprocessing of rendering results. Compared with a representative GS-based method, our approach is 30.7% faster and uses only 14.8% of its model size, while achieving competitive synthesis quality on front-view scenarios

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 1.0/10 1.5
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 3.0/10 4.5
World Models 1.5 1.0/10 1.5
MLLM 1.5 1.0/10 1.5
MultiModal 1.5 1.0/10 1.5
model-based RL 1.5 1.0/10 1.5

评分理由: The paper focuses on Novel View Synthesis using Multiplane Image and diffusion techniques, which is unrelated to MLLM, Tokenizers, RL, or World Models in the provided context. 'Visual Encoder' has a minor link via visual foundation models used for initialization. No expert authors from the specified list (Yang Shi, Xuanyu Zhu, etc.) match the paper authors (Kaidi Zhang, Guanxu Zhu).

关键词

Novel View Synthesis, Multiplane Image, Differentiable Optimization, One-step Diffusion, Visual Foundation Models, Sparse-view, Lightweight

Score: 13.5 / 27.8
Authors: Sherzod Turaev, Mary John, Mamoun Awad, Nazar Zaki, Khaled Shuaib
Published: 2026-06-01
TL;DR: This paper proposes an NLP framework using schema-constrained LLM extraction and semantic matching to quantify supply-demand gaps between university curricula and labor market requirements.
摘要翻译

从多样化的教育和劳动力市场语料库中进行模式约束 (Schema-constrained) 信息提取在自然语言处理领域仍是一个开放挑战,因为现有的处理流程主要依赖词汇表面方法,这些方法无法恢复隐性能力,缺乏在共享分类法中的基础,且未提供提取可靠性或文档级别完整性的正式度量。为了解决这些局限性,本文提出一个四阶段 NLP 框架,该框架结合了:(i) 针对由 JSON Schema 强制执行的七槽能力形式化,对双模型前沿 LLM 集成进行模式约束提示;(ii) 提取记录与十一领域 ESCO v1.2.1 受控词汇表之间的 Sentence-BERT (SBERT) 对齐;(iii) 解决模型间分歧的两层裁决协议;以及 (iv) 结合每槽 Cohen's kappa、模式符合性和文档级别完整性审计的验证机制。该框架被实例化应用于高等教育质量保证的一个关键场景,即阿联酋大学 ABET 认证的 BSc 计算机科学课程与劳动力市场的对齐。该流程从包含 85 门课程的 2025-2026 学年培养方案中提取 400 条能力记录,并在涵盖从计算核心到概率加权学生轨迹的五层分析范围内,将它们与 30 个职位发布(共 483 个要求条款)在 SBERT 余弦阈值 0.50 下进行对齐。该提取器在技能槽上实现了 0.79 的 Cohen's kappa,且模式符合性和文档级别完整性均达到 100%。该对齐揭示了可解释的供需差距:一般及横向技能为 25.0%,算法与计算理论为 13.8%,软件工程和项目管理为 12.2%;尽管供应覆盖率为 38.6%,人工智能与数据科学的差距仍接近零,仅为 1.8%。

Abstract

Schema-constrained information extraction from diverse educational and labor-market corpora remains an open challenge in natural language processing because existing pipelines rely primarily on lexical-surface methods that cannot recover implicit competencies, lack grounding in shared taxonomies, and provide no formal measures of extraction reliability or document-level completeness. To address these limitations, this paper proposes a four-stage NLP framework that combines (i) schema-constrained prompting of a two-model frontier-LLM ensemble against a JSON Schema-enforced seven-slot competency formalism, (ii) Sentence-BERT (SBERT) alignment of the extracted records against an eleven-domain ESCO v1.2.1 controlled vocabulary, (iii) a two-tier adjudication protocol that resolves inter-model disagreements, and (iv) a verification mechanism that combines per-slot Cohen's kappa, schema conformance, and document-level completeness audits. The framework is instantiated for a critical application in higher-education quality assurance, namely curriculum-labor market alignment for the ABET-accredited BSc Computer Science program at the United Arab Emirates University. The pipeline extracts 400 competency records from the 85-course 2025-2026 study plan and aligns them, under a five-scope analysis ranging from the computing core to a probability-weighted student trajectory, with 30 job postings (483 requirement clauses) at an SBERT cosine threshold of 0.50. The extractor achieves Cohen's kappa of 0.79 on the skill slot, with 100% schema conformance and 100% document-level completeness. The alignment surfaces interpretable supply-demand gaps of 25.0% in general and transversal skills, 13.8% in algorithms and computational theory, and 12.2% in software engineering and project management, with a near-zero 1.8% gap in artificial intelligence and data science despite 38.6% supply coverage.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 1.0/10 1.5
World Models 1.5 1.0/10 1.5
MLLM 1.5 2.0/10 3.0
MultiModal 1.5 1.0/10 1.5
model-based RL 1.5 1.0/10 1.5

评分理由: 论文聚焦于 NLP 驱动的课程与劳动力市场对齐,使用 LLM 提取和语义匹配。提供的关键词集侧重于多模态、世界模型和强化学习,与本文文本主导、非 RL 的研究内容高度不匹配。虽然使用了 LLM 集成(Unify Models 弱相关),但缺乏视觉编码器、Tokenizer 细节、世界模型及 RL 组件,导致相关性评分较低。

关键词

Curriculum-Labor Market Alignment, Schema-Constrained LLM Extraction, Semantic Matching, ESCO Taxonomy, Competency Formalism, NLP Framework, Supply-Demand Gaps, Sentence-BERT Alignment

Score: 13.5 / 27.8
Authors: Prateek Kumar Sikdar
Published: 2026-06-01
TL;DR: LayerRoute proposes a lightweight LoRA-based adapter for agentic language models that selectively skips transformer blocks during inference, reducing compute for tool calls while maintaining planning quality.
摘要翻译

智能体语言模型系统在两种结构不同的步骤类型之间交替:结构化工具调用(短、确定性强、低困惑度)和开放式规划/推理步骤(长、复杂、高困惑度)。尽管存在这种异构性,当前的推理系统对每个步骤都施加相同的计算量。我们引入 LayerRoute,一种轻量级适配器,它学习根据输入选择性跳过 Transformer 块。LayerRoute 为 Qwen2.5-0.5B-Instruct 的每个 Transformer 块增加了:(1) 一个逐层路由器(约 897 个参数,Linear(896, 1)),通过直通估计器输出硬二值门;以及 (2) 在 Q/K/V/O 注意力投影上的 LoRA 适配器(秩为 8,约 108 万个参数)。骨干权重保持冻结。在智能体数据(Hermes, Glaive, GSM8K, Turing)上进行单次端到端训练,并加入门正则化项,迫使系统发现每种输入类型下哪些块是可跳过的。经过 3,000 步(在 A100 40GB 显卡上耗时 6.4 分钟),LayerRoute 实现了 12.91% 的跳过率差异:工具调用跳过了 15.25% 的 FLOPs,而规划步骤仅跳过 2.34%,仅使用 110 万个可训练参数(占 4.94 亿骨干参数的 0.22%)。由于 LoRA 适配,质量优于基线模型,工具调用的困惑度差值为 -1.29,规划步骤为 -1.30。

Abstract

Agentic language model systems alternate between two structurally distinct step types: structured tool calls (short, deterministic, low perplexity) and open-ended planning/reasoning steps (long, complex, high perplexity). Despite this heterogeneity, current inference systems apply identical compute to every step. We introduce LayerRoute, a lightweight adapter that learns to selectively skip transformer blocks on a per-input basis. LayerRoute augments each of the 24 transformer blocks in Qwen2.5-0.5B-Instruct with: (1) a per-layer router (~897 parameters, Linear(896,1)) that outputs a hard binary gate via the straight-through estimator, and (2) LoRA adapters (rank 8, ~1.08M parameters) on the Q/K/V/O attention projections. The backbone weights remain frozen. A single end-to-end training pass on agentic data (Hermes, Glaive, GSM8K, Turing) with a gate regularisation term forces the system to discover which blocks are skippable per input type. After 3,000 steps (6.4 minutes on an A100 40GB), LayerRoute achieves a 12.91% skip differential: tool calls skip 15.25% of FLOPs while planning steps skip only 2.34%, using only 1.10M trainable parameters (0.22% of the 494M backbone). Quality improves over the base model due to LoRA adaptation, with perplexity delta of -1.29 on tool calls and -1.30 on planning.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 1.0/10 1.5
MLLM 1.5 2.0/10 3.0
MultiModal 1.5 1.0/10 1.5
model-based RL 1.5 2.0/10 3.0

评分理由: Paper focuses on inference optimization for agentic LLMs via layer skipping, showing low relevance to multimodal components (Visual Encoder, MultiModal, MLLM) and world models. Method is inference-time adaptation, not model-based RL. Tokenizer is not addressed. Unify Models is weakly related to compute allocation.

关键词

LayerRoute, Agentic Language Models, Adaptive Layer Skipping, LoRA Fine-Tuning, Inference Efficiency, Transformer Blocks, Tool Calls, Planning Steps

Score: 13.5 / 27.8
Authors: Yuzhe Zhang, Chihui Chen, Lina Yao, Chen Wang
Published: 2026-06-01
TL;DR: 本文提出一种基于 LLM 的管道来评估因果发现基准与领域研究的一致性,发现 11 个流行基准在质量上存在显著差异。
摘要翻译

在图形因果模型(graphical causal model)中,因果发现(causal discovery)旨在基于数值数据和纯文本领域知识构建因果图(causal graph)。然而,因果发现方法的评估在该领域仍具挑战性,因为领域研究的进展往往导致基准因果图(benchmark causal graphs)包含知识错配(mis-aligned knowledge)。这一问题尤其影响基于大语言模型(LLM)的因果发现方法的评估,因为这些模型对文献中的新发现敏感。本研究是首个系统研究基准因果图(benchmark causal graphs)质量的工作。具体而言,我们设计了一个流程(pipeline),该流程自动从科学数据库中检索相关研究论文,并提示大语言模型(LLMs)检查基准因果图(causal graph)与领域研究论文之间的一致性(consistency)。我们评估了 11 个流行的现实世界基准(benchmarks),我们的流程总共处理了 38,081 篇领域论文。结果表明,流行的基准(benchmarks)在与领域研究的一致性上存在显著差异,这对因果发现研究具有明确的启示。

Abstract

In graphical causal model, causal discovery aims to construct a causal graph based on numerical data and domain knowledge in plain text. However, the evaluation of causal discovery methods remains a challenge in the area as the progress of domain researches often makes benchmark causal graphs contain mis-aligned knowledge. This problem especially affects the evaluation of large language model (LLM) based causal discovery methods as they are sensitive to the new discoveries in the literature. This work is the first to systematically study the quality of benchmark causal graphs. Specifically, we design a pipeline that automatically retrieves relevant research papers from scientific databases, and prompts LLMs to check the consistency between the benchmark causal graphs and domain research papers. We evaluate 11 popular real-world benchmarks, for which our pipeline in total proceeds 38,081 domain papers. Our results show that popular benchmarks vary significantly in their consistency with domain research, with clear implications for causal discovery research.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 1.0/10 1.5
World Models 1.5 1.0/10 1.5
MLLM 1.5 2.0/10 3.0
MultiModal 1.5 1.0/10 1.5
model-based RL 1.5 1.0/10 1.5

评分理由: 论文主题聚焦于因果发现基准的一致性评估,利用 LLM 检查文本与因果图的一致性。提供的关键词主要涉及多模态世界模型、强化学习、视觉编码器等,与本文的因果发现、基准评估领域高度不匹配。仅因使用了 LLM 部分关键词略有涉及,其余关键词(如 Tokenizer、Visual Encoder、RL)完全无关,导致加权总分较低。

关键词

Causal discovery, Benchmark evaluation, Consistency evaluation, LLM, Graphical causal model, Domain research, Scientific databases

Score: 13.5 / 27.8
Authors: Bangguo Zhu, Peng Huo, Yuanbo Zhao, Zhicheng Du, Jun Yin, Senzhang Wang
Published: 2026-06-01
TL;DR: This paper proposes a time-aware diffusion framework called TDPM for generative recommendation that disentangles user preferences into period and point components, achieving significant performance improvements over state-of-the-art baselines.
摘要翻译

近期,生成式推荐器(GRs)作为一种变革性的推荐范式应运而生,其核心在于用语义索引(SIDs)取代传统的物品 ID。得益于扩散模型卓越的生成能力,少数开创性工作尝试以扩散架构为骨干来构建生成式推荐器。然而,现有基于扩散的生成式推荐器存在一个致命局限:扩散过程对历史交互中的所有物品统一应用。相比之下,用户偏好受多方面随时间演变因素的影响,因此在时间维度上呈现出非平稳分布。为弥合这一差距,本研究提出了一种新的生成式推荐器框架,命名为 TDPM,其核心在于设计基于 SID 词元的时间感知扩散机制。具体而言,TDPM 显式地将随时间演变的用户偏好影响整合进扩散过程。进一步地,用户偏好被解耦为:(i)周期偏好,即在长时间跨度内保持一致的偏好;(ii)点偏好,即由近期关键事件触发的偏好。在三个公开真实世界数据集上的广泛实验表明,TDPM 显著优于现有的最先进基线方法。在 HR@20 和 NDCG@20 指标上,TDPM 分别实现了高达 29.21% 和 25.45% 的平均提升。消融实验进一步强调了在基于扩散的生成式推荐器中引入时间感知词元扩散的必要性。

Abstract

Recently, Generative Recommenders (GRs) have emerged as a transformative recommendation paradigm by replacing traditional item IDs with semantic indices (SIDs). Owing to the exceptional generative capabilities of diffusion models, a few pioneering works explore developing GRs with diffusion architectures as the backbone. However, a fatal limitation of existing diffusion-based GRs is that the diffusion process applies uniformly to all items within the historical interactions. In contrast, the user preference is shaped by multifaceted time-evolving factors and thus exhibits a non-stationary distribution in the temporal aspect. To bridge this gap, this study proposes a novel GR framework, named TDPM, by designing the time-aware diffusion on SID tokens. Specifically, TDPM explicitly integrates the impact of time-evolving user preferences into the diffusion process. In detail, the user preference is disentangled into (i) the period preference, which remains consistent over a long time-span, and (ii) the point preference, which is triggered by recent focal events. Extensive experiments on three public real-world datasets demonstrate the significant superiority of TDPM over the state-of-the-art baselines. TDPM achieves average improvements of up to 29.21% and 25.45% in terms of HR@20 and NDCG@20, respectively. The ablation study further underscores the necessity of time-aware token diffusion in diffusion-based GRs.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 3.0/10 4.5
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 1.0/10 1.5
MLLM 1.5 1.0/10 1.5
MultiModal 1.5 2.0/10 3.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on generative recommendation using time-aware diffusion models on semantic indices. It does not involve visual encoders, reinforcement learning, or multimodal large language models. While it utilizes tokens (SIDs), it is not primarily about tokenizer architecture or world models. The provided keywords are largely mismatched with the paper's domain.

关键词

Generative Recommendation, Time-Aware Diffusion, Preference Disentanglement, Semantic Indices, Diffusion Models, User Preference, Temporal Factors

Score: 13.5 / 27.8
Authors: Christopher Thirgood, Dipon Kumar Ghosh, Simon Hadfield
Published: 2026-06-01
TL;DR: TIDES introduces a continuous-time event simulator using dynamic Gaussian splatting to generate high-fidelity event streams by avoiding timestamp batching artifacts inherent in frame-based simulators.
摘要翻译

事件相机(Event cameras)对环境外观变化发出异步事件。真实世界事件数据集的稀缺性使得仿真至关重要。然而,大多数模拟器从帧序列推断事件时间戳,迫使许多阈值穿越共享一小组离散的时间点;一种我们称之为时间戳批处理(timestamp batching)的故障模式,在快速运动和遮挡情况下会恶化。我们提出 TIDES,一种基于动态高斯泼溅(dynamic Gaussian splatting)的连续时间事件模拟器。由于 TIDES 基于具有学习得到的几何和运动的显式 3D 场景表示运行,它可以直接从场景推导像素级强度动态,而不是通过计算渲染帧之间的差分。这使得能够准确预测阈值穿越,包括每个渲染步骤中的多次穿越,而无需时间上采样或帧插值。同一个 3D 场景模型揭示了物体相互部分遮挡的位置;TIDES 利用这一点指导自适应时间步长,仅在遮挡动力学使亮度变化的简单模型不可靠的区域集中计算。最后,我们使用瓦片级仲裁器(tile-level arbiter)对有限的传感器带宽进行建模,其吞吐量、抖动和事件丢失再现了真实的传感器伪影。在成对的 RGB-事件基准测试中,TIDES 达到了最先进的事件流保真度。我们还表明,TIDES 模拟的事件比竞争对手的模拟事件更有效地迁移到真实的下游任务中。

Abstract

Event cameras emit asynchronous events in response to environmental appearance changes. The scarcity of real-world event datasets makes simulation essential. However, most simulators infer event timestamps from frame sequences, forcing many threshold crossings to share a small set of discrete times; a failure mode we term timestamp batching that worsens under fast motion and occlusion. We present TIDES, a continuous-time event simulator built on dynamic Gaussian splatting. Because TIDES operates on an explicit 3D scene representation with learnt geometry and motion, it can derive per-pixel intensity dynamics directly from the scene, rather than by differencing rendered frames. This enables accurate threshold-crossing prediction, including multiple crossings per rendering step, without temporal upsampling or frame interpolation. The same 3D scene model reveals where objects partially occlude one another; TIDES uses this to guide adaptive time stepping, concentrating computation only in regions where occlusion dynamics make simple models of brightness change unreliable. Finally, we model finite sensor bandwidth using a tile-level arbiter whose throughput, jitter, and event drops reproduce realistic sensor artifacts. Across paired RGB-event benchmarks, TIDES attains state-of-the-art event-stream fidelity. We also show that events simulated by TIDES transfer more effectively to real downstream tasks than competitors'.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 1.0/10 1.5
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 2.0/10 3.0
World Models 1.5 3.0/10 4.5
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 1.0/10 1.5
model-based RL 1.5 2.0/10 3.0

评分理由: The paper focuses on event camera simulation using dynamic Gaussian splatting, which has minimal overlap with MLLM, Tokenizer, or RL specific architectures. While it involves visual representation and world simulation, it does not address unified models, multi-modal language understanding, or reinforcement learning algorithms directly. No expert authors from the specified list were found.

关键词

Event cameras, Dynamic Gaussian splatting, Time-derivative event simulation, 3D scene representation, Timestamp batching, Sensor bandwidth modeling, Event-stream fidelity

Score: 13.5 / 27.8
Authors: Can Zhang, Gim Hee Lee
Published: 2026-06-01
TL;DR: SCAPO 提出了一种自监督框架,能够从单张 RGB-D 观测中估计 articulated 物体的规范几何和关节参数,无需真实标签。
摘要翻译

现有的基于单 3D 观测进行类别级物体运动估计的方法通常依赖于密集监督、多帧输入或 CAD 模板,但仍难以将几何与运动解耦,或恢复显式关节参数。我们提出 SCAPO,这是一个自监督框架,能够从单个 RGB-D 观测中估计规范几何、刚性部件分割以及关节枢轴、轴和运动状态,无需真实标签或类别特定模型。我们的 SCAPO 首先使用一个 SE(3) 等变向量神经元自编码器来分离全局姿态,并将不同实例对齐到共享的规范空间。在此对齐后的形状上,设计了一个关节感知混合蒙皮模块来建模部件运动。我们通过观测形状与规范形状之间的循环重建,以及与可学习规范模板的跨空间对齐来学习这种表示,该模板将共享类别几何与实例特定残差形状解耦。在合成和真实铰接物体数据集上的实验表明,我们的 SCAPO 恢复了一致的部件结构和准确的运动参数,并优于所有自监督基线。

Abstract

Existing methods for category-level object articulation from a single 3D observation often rely on dense supervision, multi-frame inputs, or CAD templates, and still struggle to disentangle geometry from articulation or to recover explicit joint parameters. We propose SCAPO, a self-supervised framework that estimates canonical geometry, rigid part segmentation, and joint pivots, axes, and articulation states from a single RGB-D observation without ground-truth labels or category-specific models. Our SCAPO first uses an SE(3)-equivariant vector-neuron autoencoder to factor out global pose and align diverse instances into a shared canonical space. On this aligned shape, a joint-aware blend-skinning module is then designed to model part motion. We learn this representation through cycle reconstruction between observed and canonical shapes and cross-space alignment with a learnable canonical template that decouples shared category geometry from instance-specific residual shape. Experiments on synthetic and real articulated-object datasets show that our SCAPO recovers consistent part structure and accurate articulation parameters and outperforms all self-supervised baselines.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 3.0/10 4.5
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 4.0/10 6.0
model-based RL 1.5 0.0/10 0.0

评分理由: 论文主要关注自监督下的 3D articulated 姿态估计,与 'MultiModal'(RGB-D 输入)和 'Visual Encoder'(编码器架构)有一定关联,与 'Unify Models'(统一几何与姿态表示)有微弱关联。但与 Tokenizer、World Models、MLLM、model-based RL 完全无关,因为这些属于大模型与强化学习领域,而本文属于计算机视觉领域。

关键词

Self-Supervised, Articulated Pose Estimation, Canonical Geometry, RGB-D Observation, Vector-Neuron Autoencoder, Joint Parameters, Category-Level

Score: 13.5 / 27.8
Authors: Dekel Galor, Adam Pikielny, Zhoutong Zhang, Ke Wang, Laura Waller, Jiawen Chen, Ilya Chugunov
Published: 2026-06-01
TL;DR: Hist2Style 提出了一种基于直方图引导的双边网格方法,通过蒸馏大型图像编辑模型实现了高效、可控且无幻觉的写实风格迁移。
摘要翻译

照片级真实感风格迁移旨在将输入图像的色彩和色调匹配到风格目标,同时保留原始场景的内容和细节。尽管现有的大型图像模型可以辅助这类外观编辑,但它们高昂的计算需求、潜在的幻觉风险以及有限的用户控制能力,使其不适合高分辨率、实时工作流。我们提出 Hist2Style,这是一种双边网格表述,用于快速、边缘感知风格化,通过在双边空间中将操作约束为局部仿射变换来保持视觉保真度。我们的模型通过在由语言和视觉语言模型生成的大型监督语料库上进行训练,将大型图像编辑模型蒸馏为轻量级网络,以空间变化的色彩编辑为目标。该网络以风格目标的直方图嵌入为条件,提供了一个可解释接口,通过修改目标颜色分布来调整输出风格。总体而言,Hist2Style 通过构造保持内容结构,避免幻觉,并支持实时、高分辨率的照片级真实感风格化,具有交互式用户可控的色彩和色调调整功能。

Abstract

Photorealistic style transfer aims to match the color and tone of an input image to that of a style target while preserving the content and details of the original scene. Although existing large image models can facilitate these kinds of appearance edits, their high computational demands, potential for hallucinations, and limited user control make them unsuitable for high-resolution, real-time workflows. We introduce Hist2Style, a bilateral-grid formulation for fast, edge-aware stylization that preserves visual fidelity by constraining operations to locally affine transforms in bilateral space. Our model distills a large image editing model into a lightweight network by training on a large supervised corpus generated with language and vision-language models, targeting spatially varying color edits. The network conditions on a histogram-based embedding of the style target to provide an interpretable interface for adjusting the output style by modifying the target color distribution. Overall, Hist2Style maintains content structure by construction, avoids hallucinations, and supports real-time, high-resolution photorealistic stylization with interactive user-controllable color and tone adjustments.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 2.0/10 3.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 2.0/10 3.0
MultiModal 1.5 3.0/10 4.5
model-based RL 1.5 0.0/10 0.0

评分理由: 论文聚焦图像风格迁移与双边网格,与关键词匹配度低。未涉及 Tokenizer、世界模型或强化学习(0 分);虽用 VLM 生成数据(MLLM/MultiModal 相关度 2-3 分),但非多模态大模型核心;未体现统一模型架构(Unify Models)或特定视觉编码器(Visual Encoder 相关度 2 分)。加权总分 13.5,低于及格分。

关键词

Photorealistic style transfer, Bilateral grids, Histogram-guided, Model distillation, Real-time stylization, Color tone adjustment, Visual fidelity, Image editing

Score: 13.5 / 27.8
Authors: Kumar Abhishek, Ghassan Hamarneh
Published: 2026-06-01
TL;DR: This paper proposes a quality-guided semi-supervised learning framework for medical image segmentation that improves segmentation accuracy by training a network to estimate segmentation quality and reweight pseudolabels.
摘要翻译

训练准确的医学图像分割模型需要大量密集标注的数据,获取这些数据耗时且昂贵。半监督学习(SSL)通过利用丰富的无标签数据和有限的有标签数据来缓解这一问题。然而,大多数现代 SSL 方法依赖于无标签数据的伪标签(pseudolabels),通常通过模型置信度或不确定性来评估其可靠性,这些指标具有自指性,且缺乏明确的分割质量依据。相反,我们提出了一种质量引导的 SSL 框架,该框架训练一个专用网络,从图像 - 掩码对中估计分割质量。该预测器在质量多变的掩码上进行训练,这些掩码通过合成扰动(synthetic corruptions)生成,并结合了部分训练的分割模型的不完美输出,从而捕捉训练过程中遇到的真实错误模式。我们通过两种互补机制将质量预测器集成到 SSL 中:一种质量感知正则化损失(quality-aware regularization loss),以及一种基于质量的伪标签样本重加权方案(pseudolabel sample reweighting scheme)。我们的方法可作为现有 SSL 框架的即插即用式增强(drop-in enhancement)。在五个数据集和多种网络架构上的广泛实验表明,该方法相对于其他 SSL 方法具有一致的提升,推动了半监督医学图像分割领域达到最先进水平(state-of-the-art)。

Abstract

Training accurate medical image segmentation models requires large amounts of densely annotated data, which is costly and time-consuming to obtain. Semi-supervised learning (SSL) alleviates this by learning from both abundant unlabeled data and limited labeled data. However, most modern SSL methods rely on pseudolabels for unlabeled data, and typically assess their reliability through model confidence or uncertainty, measures that are self-referential and lack explicit grounding in segmentation quality. Instead, we propose a quality-guided SSL framework that trains a dedicated network to estimate segmentation quality from image-mask pairs. The predictor is trained on variable-quality masks generated through synthetic corruptions augmented with imperfect outputs from partially trained segmentation models, capturing realistic error patterns encountered during training. We integrate the quality predictor into SSL through two complementary mechanisms: a quality-aware regularization loss and a quality-based pseudolabel sample reweighting scheme. We show that our method serves as a drop-in enhancement to existing SSL frameworks. Extensive experiments across five datasets and multiple architectures demonstrate consistent improvements over competing SSL methods, advancing the state-of-the-art in semi-supervised medical image segmentation.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 1.0/10 1.5
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 2.0/10 3.0
World Models 1.5 1.0/10 1.5
MLLM 1.5 1.0/10 1.5
MultiModal 1.5 2.0/10 3.0
model-based RL 1.5 1.0/10 1.5

评分理由: The paper focuses on semi-supervised learning for medical image segmentation using quality-guided pseudolabeling. It does not involve Unify Models, Tokenizers, World Models, MLLMs, or Model-Based RL. Scores for Visual Encoder and MultiModal are low (2.0) as these are standard components in segmentation but not the core novelty aligned with the provided keywords targeting large-scale multimodal and RL architectures.

关键词

Quality-Guided, Semi-Supervised Learning, Medical Image Segmentation, Pseudolabels, Segmentation Quality, Synthetic Corruptions, Sample Reweighting

Score: 12.0 / 27.8
Authors: Guangjin Pan, Hui Chen, Hei Victor Cheng, Henk Wymeersch
Published: 2026-06-01
TL;DR: RA-LWLM 提出了一种基于无线基础模型的检索增强上下文定位框架,实现了无需重训练的跨场景自适应定位。
摘要翻译

无线定位是第六代 (6G) 网络的一项基本能力。传统的基于模型的方法需要对传播环境进行精确建模,且在复杂的多径和非视距 (NLOS) 场景下性能会下降;而基于学习的方法则将模型参数紧密耦合于训练场景,每当基站 (BS) 配置或传播环境发生变化时,都需要代价高昂的重新训练。本文提出 RA-LWLM,一种检索增强上下文定位框架,该框架通过将场景特定信息外部化至每场景指纹数据库,而非将其编码在模型权重中,实现了无需训练的跨场景适应。该框架包含三个组件:一个冻结的无线基础模型 (FM) 编码器,用于将原始信道状态信息 (CSI) 映射为场景无关表示;一个检索模块,通过表示空间中的相似性搜索从每场景指纹数据库中选取最具信息量的参考样本;以及一个基于变换器的上下文学习 (ICL) 模块,通过将查询与检索到的参考样本融合来预测用户设备 (UE) 的位置。为了适应不同查询间变化的检索质量及传播复杂性,ICL 模块采用了混合专家 (MoE) 设计,其中专家专门化于不同的上下文长度,并由一个可学习的选择器进行软组合。广泛的基于射线追踪的实验表明,在具有多样基站 (BS) 配置的异构场景中,RA-LWLM 在已见和未见场景上实现了几乎相同的定位精度,且无需任何每场景重新训练,显著优于端到端及基于基础模型 (FM) 的基线方法。这些结果验证了所提出的检索增强上下文范式作为一种可扩展解决方案,适用于 6G 网络中的跨场景定位。

Abstract

Wireless localization is a fundamental capability of sixth-generation (6G) networks. Conventional model-based methods require accurate modeling of the propagation environment and degrade in complex multipath and non-line-of-sight scenarios, while learning-based methods couple model parameters tightly to the training scene, requiring costly retraining whenever the base station (BS) configuration or propagation environment changes. In this paper, we propose RA-LWLM, a retrieval-augmented in-context localization framework that achieves training-free cross-scene adaptation by externalizing scene-specific information into a per-scene fingerprint database rather than encoding it in model weights. The framework consists of three components: a frozen wireless foundation model (FM) encoder that maps raw channel state information into a scene-agnostic representation; a retrieval module that selects the most informative references from the per-scene database via similarity search in the representation space; and a transformer-based in-context learning (ICL) module that fuses the query with the retrieved references to predict the user equipment (UE) position. To accommodate varying retrieval quality and propagation complexity across queries, the ICL module adopts a mixture-of-experts design in which experts specialize in different context sizes and are softly combined by a learnable selector. Extensive ray-tracing-based experiments across heterogeneous scenes with diverse BS configurations show that RA-LWLM achieves nearly identical accuracy on seen and unseen scenes without any per-scene retraining, substantially outperforming end-to-end and FM-based baselines. These results validate the proposed retrieval-augmented in-context paradigm as a scalable solution for cross-scene localization in 6G networks.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 1.0/10 1.5
World Models 1.5 1.0/10 1.5
MLLM 1.5 1.0/10 1.5
MultiModal 1.5 1.0/10 1.5
model-based RL 1.5 1.0/10 1.5

评分理由: 论文主要研究基于无线基础模型(Wireless Foundation Model)的检索增强上下文定位,与提供的关键词集(多模态大模型、视觉编码器、强化学习等)存在显著领域差异。文中未涉及视觉编码器、多模态数据融合或强化学习算法,仅使用了基础模型表征和检索增强技术。因此,除'Unify Models'因基础模型概念略有关联外,其余关键词相关性均较低。

关键词

Wireless localization, Retrieval-augmented, In-context learning, Wireless foundation model, Cross-scene adaptation, Channel state information, Mixture-of-experts, 6G networks

Score: 12.0 / 27.8
Authors: Yuchen Zhu, Jing Shi, Chongjian Ge, Hao Tan, Yiran Xu, Wanrong Zhu, Jason Kuen, Koustava Goswami, Rajiv Jain, Yongxin Chen, Molei Tao, Jiuxiang Gu
Published: 2026-06-01
TL;DR: FLARE proposes a systematic conversion framework enabling hybrid-attention language models to support both autoregressive and diffusion inference simultaneously, achieving consistent throughput gains while preserving checkpoint capabilities.
摘要翻译

自回归 (AR) 大型语言模型 (LLMs) 已取得了广泛的实际应用成功,但顺序解码仍是低延迟部署的关键瓶颈。近期高效推理工作沿两个方向取得进展:一是通过高效架构降低每次模型调用的成本,二是通过并行生成减少串行解码步骤。混合注意力骨干网络解决了前者问题,而扩散语言模型 (dLLMs) 则通过迭代并行去噪实现后者。结合这些优势仍具挑战性:从 AR 模型到 dLLMs 的转换往往无法保留种子检查点能力,且混合注意力循环状态及掩码约束使得扩散训练与服务颇具挑战。本文提出 FLARE,一个针对混合注意力 LLMs 的系统转换框架。我们的分析表明,迁移数据质量是能力保留的首要决定因素,其重要性超过损失函数设计与注意力掩码设计。所得框架结合了词元平等的自回归与扩散目标、硬件感知内核及统一推理,使得单个检查点既能支持 AR 风格的验证解码,也能支持扩散风格的并行去噪。从具有有限后训练数据的强 AR 检查点出发,FLARE 在各类模型规模上与领先的开源 dLLMs 相当,并在单 GPU 并发服务场景下相对于开源 dLLM 基线带来一致的吞吐量增益。研究结果进一步表明,实用的 dLLMs 不仅受限于解码算法,还受限于迁移数据质量及当前块扩散目标的训练效率低下,这激励了数据、目标、架构与推理系统的联合设计。

Abstract

Autoregressive (AR) large language models (LLMs) have achieved broad practical success, but sequential decoding remains a key bottleneck for low-latency deployment. Recent efficient-inference work has progressed along two axes: reducing the cost of each model invocation through efficient architectures, and reducing serial decoding steps through parallel generation. Hybrid attention backbones address the former, while diffusion language models (dLLMs) pursue the latter via iterative parallel denoising. Combining these advantages remains challenging: AR-to-dLLM conversion often fails to preserve seed-checkpoint capability, and hybrid-attention recurrent states and masking constraints make diffusion training and serving nontrivial. We present FLARE, a systematic conversion framework for hybrid-attention LLMs. Our analysis identifies transfer data quality as the primary determinant of capability preservation, outweighing loss formulation and attention-mask design. The resulting framework combines a token-equal AR-and-diffusion objective, hardware-aware kernels, and unified inference, enabling one checkpoint to support both AR-style verified decoding and diffusion-style parallel denoising. Starting from strong AR checkpoints with limited post-training data, FLARE is competitive with leading open-source dLLMs across model scales and delivers consistent throughput gains over open-source dLLM baselines in single-GPU concurrent serving. Our results further suggest that practical dLLMs are limited not only by decoding algorithms, but also by transfer data quality and the training inefficiency of current block-diffusion objectives, motivating joint design of data, objectives, architectures, and inference systems.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 5.0/10 7.5
Tokenizer 1.5 3.0/10 4.5
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on converting Autoregressive LLMs to Diffusion Language Models for inference efficiency. It moderately relates to 'Unify Models' by unifying AR and diffusion inference modes, and 'Tokenizer' via token-equal objectives. However, it is purely text-based, lacking visual encoders, multimodal components, world modeling, or reinforcement learning, resulting in 0 scores for those keywords. The weighted total (12.0) is below the dynamic passing score (27.8), indicating low relevance to the specified keyword track.

关键词

Diffusion Language Model, Hybrid Attention, Autoregressive, Inference Efficiency, Parallel Denoising, Unified Inference, Transfer Learning

Score: 12.0 / 27.8
Authors: Zewen Liu, Zhan Shi, Yisi Sang, Bing He, Minhua Lin, Tianxin Wei, Dakuo Wang, Benoit Dumoulin, Wei Jin, Hanqing Lu
Published: 2026-06-01
TL;DR: The paper proposes an Adaptive Auto-Harness framework to sustain LLM agent self-improvement in open-ended task streams through task-wise adaptation and multi-agent evolution, demonstrating superior performance over existing baselines.
摘要翻译

像 A-Evolve、GEPA 和 Meta-Harness 这样的自动 Harness 系统通过从执行反馈中优化提示词、技能、工具、记忆和支持基础设施来提升大语言模型代理(LLM agents)的性能,但它们通常在固定离线基准(fixed offline benchmarks)上进行评估。实际部署场景则呈现开放式任务流(open-ended task streams):历史记录持续增长而无固定终点,异构任务需要不同的 Harness,且问题分布随时间发生漂移。这些挑战使得单一反复且密集更新的 Harness 变得脆弱,导致性能退化:准确率早期达到峰值后随即下降。这促使了持续 Harness 构建与任务级适应(task-wise adaptation)的需求。本文介绍了 Adaptive Auto-Harness,这是一个针对此类任务流(streams)的框架与系统。该框架将到理想 Harness(oracle harness)的差距分解为演化损失(evolution loss)和适应损失(adaptation loss)。该系统通过有状态多智能体演化器(stateful multi-agent evolver)、带求解时路由(solve-time routing)的 Harness 树,以及在历史缺乏所需信号时的人类引导钩子(human-steering hooks)来处理这些损失。在预测市场(prediction-market)、安全竞赛(security-competition)和事件预测(event-forecasting)流中,Adaptive Auto-Harness 优于五种现有的自动 Harness 基线;消融实验将性能提升归因于更优的构建、路由或针对性的人类引导。代码可在 https://github.com/A-EVO-Lab/AdaptiveHarness 获取。

Abstract

Auto-harness systems such as A-Evolve, GEPA, and Meta-Harness improve LLM agents by optimizing prompts, skills, tools, memories, and supporting infrastructure from execution feedback, but they are typically evaluated on fixed offline benchmarks. Real deployments instead present open-ended task streams: histories grow without a fixed endpoint, heterogeneous tasks require different harnesses, and problem distributions shift over time. These challenges make a single repeatedly and densely updated harness brittle, causing performance degradation as accuracy peaks early and then declines. This motivates sustained harness construction with task-wise adaptation. We introduce Adaptive Auto-Harness, a framework and system for such streams. The framework decomposes the gap to an oracle harness into evolution loss and adaptation loss. The system addresses these losses with a stateful multi-agent evolver, a harness tree with solve-time routing, and human-steering hooks for cases where history lacks the needed signal. Across prediction-market, security-competition, and event-forecasting streams, Adaptive Auto-Harness outperforms five existing auto-harness baselines and ablations attribute gains to better construction, routing, or targeted human steering. Code is available in https://github.com/A-EVO-Lab/AdaptiveHarness .

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 1.0/10 1.5
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 3.0/10 4.5
MLLM 1.5 2.0/10 3.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 2.0/10 3.0

评分理由: The paper focuses on an Adaptive Auto-Harness framework for LLM agents in open-ended task streams, emphasizing adaptation and evolution. It does not address tokenizer design, visual encoders, or multimodal architectures (Tokenizer, Visual Encoder, MultiModal score 0). While it involves feedback loops resembling RL (model-based RL score 2) and modeling task dynamics (World Models score 3), it does not explicitly unify model architectures (Unify Models score 1) or emphasize multimodal capabilities (MLLM score 2). The domain focus is on agent system deployment rather than the specific technical components listed in the keywords. The calculated weighted score is 12.0, which is below the dynamic passing score of 27.8, indicating low relevance to the provided keyword set.

关键词

Adaptive Auto-Harness, LLM Agents, Open-Ended Task Streams, Self-Improvement, Multi-Agent Evolver, Harness Tree, Task-Wise Adaptation, Execution Feedback

Score: 12.0 / 27.8
Authors: Guanrong Xu, Jessica Li, Hao Wang, Yuzhe Yang
Published: 2026-06-01
TL;DR: 本文针对连续预测任务中的深度虚假回归问题,通过利用属性相似性校准标签和特征分布,显著提升了模型在计算机视觉和语言模型等域上的泛化性能。
摘要翻译

现实世界回归往往表现出“捷径”现象:在训练数据中与连续目标虚假相关的属性,在部署偏移下却不可靠;利用此类捷径进行回归预测可能在测试时发生灾难性失败。现有关于虚假相关性的研究主要集中在分类任务上,其中标签是类别性的,组别也是自然定义的。然而,许多现实世界任务需要连续预测,此时不存在硬标签边界或离散的组 - 标签对。我们将深度虚假回归(Deep Spurious Regression, DSR)定义为从存在属性与标签混杂的回归数据中学习,旨在解决连续虚假相关性问题,并能在测试时泛化至所有属性 - 标签组合。受分类与回归捷径之间内在差异的启发,我们提出利用虚假属性在标签空间和特征空间中的相似性,从而兼顾邻近目标和相关组别,同时校准跨属性的标签分布和学习到的特征分布。在涵盖计算机视觉、环境感知和大语言模型(LLM)回归的常见真实世界 DSR 数据集上进行的广泛实验验证了我们的策略的优越性能。我们的工作填补了在连续预测中研究虚假相关性的基准和方法方面的空白。

Abstract

Real-world regression often exhibits shortcuts: attributes that are spuriously correlated with continuous targets in training, yet unreliable under deployment shifts; regressing targets using such shortcuts may fail catastrophically at test time. Existing studies on spurious correlations focus primarily on classification, where labels are categorical and groups are naturally defined. However, many real-world tasks require continuous prediction, where hard label boundaries or discrete group-label pairs do not exist. We define Deep Spurious Regression (DSR) as learning from regression data with attribute-label confounding, addressing continuous spurious correlations, and generalizing to all attribute-label combinations at test time. Motivated by the intrinsic difference between classification and regression shortcuts, we propose to exploit the similarity among spurious attributes in both label and feature spaces, thereby accounting for nearby targets and related groups while calibrating both label and learned feature distributions across attributes. Extensive experiments on common real-world DSR datasets that span computer vision, environmental sensing, and large language model (LLM) regression verify the superior performance of our strategies. Our work fills the gap in benchmarks and techniques for studying spurious correlations in continuous prediction.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 1.0/10 1.5
World Models 1.5 0.0/10 0.0
MLLM 1.5 2.0/10 3.0
MultiModal 1.5 2.0/10 3.0
model-based RL 1.5 0.0/10 0.0

评分理由: 论文核心内容为深度虚假回归(Deep Spurious Regression),聚焦连续预测任务中的属性 - 标签混淆与泛化问题。提供的关键词主要涉及多模态架构(Tokenizer, Visual Encoder, MLLM, MultiModal)、世界模型及强化学习(World Models, model-based RL)。论文仅在实验部分涉及计算机视觉和 LLM 回归数据,未深入探讨模型统一、编码设计、世界建模或强化学习框架,因此相关性评分较低。作者列表中未包含指定的 Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang 等专家,无额外加分。加权总分计算为 (2.0+1.0+1.0+0.0+2.0+2.0+0.0)*1.5 = 12.0,低于动态及格分 27.8。

关键词

Deep Spurious Regression, Continuous Prediction, Attribute-Label Confounding, Spurious Correlations, Generalization, Computer Vision, LLM Regression

Score: 12.0 / 27.8
Authors: Ning Lin, Luxi Chen, Huaguan Chen, Jiacheng Cen, Chongxuan Li, Wenbing Huang, Hao Sun
Published: 2026-06-01
TL;DR: 本文提出了一种对称化框架,将 2D 连续表示转换为对称表示,以实现视觉图案设计任务中的有效对称控制。
摘要翻译

生成具有特定对称性的对象在各种现实场景中至关重要。然而,将现有的 2D 连续表示调整为强制平面群对称性仍然是一个挑战,因为非反射群元素的变换可能会破坏连续性。为了克服这一限制,我们提出了一种适用于任意平面群的对称化框架。我们的方法将任何 2D 连续表示转换为对称表示,同时保持连续性。我们提供了该表示的数学表述,展示了其对对称函数的逼近能力,并详细说明了构建方法。我们通过三个视觉设计任务(图案设计、剪纸设计和风格化拓扑设计)和一个材料设计任务来验证我们的方法。实验证实,我们的表示能够实现有效的对称控制,并展示了其更广泛的应用性。

Abstract

Generating objects with specific symmetries is essential in various real-world scenarios. However, adapting existing 2D continuous representations to enforce planar group symmetry remains a challenge, as the transformation of non-reflective group elements may disrupt continuity. To overcome this limitation, we propose a symmetrization framework for arbitrary planar groups. Our method transforms any 2D continuous representation into a symmetric one while preserving continuity. We provide the mathematical formulation of this representation, demonstrate its approximation capability for symmetric functions, and detail the construction methodology. We validate our approach through three visual design tasks (pattern design, paper-cutting design and stylized topology design) and one material design task. Experiments confirm that our representation enables effective symmetry control and demonstrate its broader applicability.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 3.0/10 4.5
World Models 1.5 1.0/10 1.5
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 2.0/10 3.0
model-based RL 1.5 0.0/10 0.0

评分理由: 论文核心在于几何对称性框架与图案生成,属于计算机图形学范畴。提供的关键词主要聚焦于大语言模型、强化学习及世界模型,与本文内容关联度较低。仅'Visual Encoder'涉及视觉表示,'MultiModal'涉及视觉任务,其余如 Tokenizer、MLLM、model-based RL 及 Unify Models(通常指模型架构统一)均无直接关联。加权总分约为 11.25,低于动态及格分 27.8。

关键词

Planar Symmetric Pattern Generation, Symmetrization Framework, Planar Group Symmetry, 2D Continuous Representation, Visual Design Tasks, Pattern Design, Material Design

Score: 12.0 / 27.8
Authors: Thomas Chen, Zhiyuan Li
Published: 2026-06-01
TL;DR: 本文提出了一种基于图论和对比学习的自我博弈定理证明理论框架,用于解决定理生成多样性问题,但未涉及多模态或世界模型。
摘要翻译

自我对弈(Self-play)是一种能够使模型自我改进的训练算法,最近在利用大语言模型(LLMs)进行形式化定理证明的背景下显示出有前景的经验结果。(Dong & Ma, 2025) 通过两个协作智能体实例化了自我对弈:一个证明者(prover),负责证明定理;一个猜想者(conjecturer),负责生成新定理作为证明者的课程。本文提供了一个理论框架,用于理解自我对弈算法在定理证明中的自我改进能力。首先,我们将定理集形式化为一个图,其中节点代表定理,边连接语义相似的定理对。我们引入一组基本假设,刻画了训练证明者的保证性质,以及猜想者如何访问该图的结构。其次,我们表明,如果定理底层图连通性良好,那么一个证明者 - 猜想者系统(其中猜想算法基于可逆随机游走)足以使已证明定理集指数级增长。第三,鉴于自我对弈算法在经验中遇到的一个问题(即猜想者倾向于生成人为复杂且非基础的定理),我们提出了一种针对猜想者生成的定理训练分布的多样性度量,以及一种改进的猜想算法,该算法通过计算定理图中相邻定理之间的扩散相似性来局部最大化这一多样性度量。最后,我们描述了一种计算扩散相似性的方法:利用对比学习将节点嵌入到欧几里得空间,然后计算嵌入向量之间的内积。

Abstract

Self-play, a type of training algorithm that enables a model to self-improve, has recently shown promising empirical results in the context of formal theorem proving using Large Language Models (LLMs). (Dong & Ma, 2025) instantiate self-play with two cooperating agents: a prover, which proves theorems, and a conjecturer, which generates new theorems as a curriculum to the prover. In this paper, we provide a theoretical framework for understanding the self-improvement capabilities of self-play algorithms for theorem proving. First, we formalize the set of theorems as a graph, with nodes as theorems and edges between pairs of theorems with similar semantics. We introduce a set of primitive assumptions that characterize the guarantees of a trained prover and how a conjecturer can access the structure of the graph. Second, we show that if the underlying graph of theorems is well-connected, then a prover-conjecturer system, where the conjecturing algorithm is based on a reversible random walk, is sufficient to grow the set of proved theorems exponentially. Third, motivated by an issue encountered empirically by self-play algorithms, where the conjecturer tends to generate artificially complex and non-fundamental theorems, we propose a diversity measure for a training distribution of theorems generated by a conjecturer and an improved conjecturing algorithm that locally maximizes this diversity measure, by computing the diffusion similarity between neighboring theorems in the theorem graph. Finally, we describe a method to compute the diffusion similarity by using contrastive learning to embed nodes into Euclidean space and then computing the inner-product between embeddings.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 2.0/10 3.0
MLLM 1.5 2.0/10 3.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 2.0/10 3.0

评分理由: 论文主要研究基于大语言模型的自我博弈定理证明的理论框架,涉及图论建模和对比学习。虽然使用了 LLM 且涉及自我博弈(与 RL 有概念关联),但未涉及多模态处理、视觉编码器、具体 tokenizer 设计或显式的模型强化学习,与给定的关键词集(侧重多模态、世界模型、统一模型)领域匹配度较低。作者列表中未包含指定的专家,故无额外加分。加权总分为 12.0,低于动态及格分 27.8。

关键词

Self-Play, Theorem Proving, Large Language Models, Graph Theory, Prover-Conjecturer, Contrastive Learning, Diversity Measure

Score: 12.0 / 27.8
Authors: Taehan Kim, Sarrah Rose Mikhail Leung, Bharat Mekala, Jeongbin Park
Published: 2026-06-01
TL;DR: Site4Drug 是一个 AI 代理,通过评估多种药物模态并提供基于证据的排名列表,来预测蛋白质上的药物结合靶点。
摘要翻译

在蛋白质上选择干预位点(即选择可靶向位点)往往比选择结合物更为模糊且易成为瓶颈,尤其是在膜蛋白中,因为可及性、拓扑结构和翻译后修饰(PTMs)限制了可作用区域。我们提出 Site4Drug,一种模态感知位点发现工具,输出包含明确约束、证据摘要、风险标记和可追溯决策日志的排序可靶向区域列表。与要求用户预先指定药物模态不同,Site4Drug 可以从用于位点发现的相同证据中推荐结合模态(例如抗体/肽类 vs 小分子),包括拓扑结构、疏水性、PTM 倾向、二硫键、结构域上下文和序列。重要的是,这种证据在所有模态中一致应用,包括小分子口袋发现,以避免选择化学上合理但生物学上遮蔽的位点。

Abstract

Selecting where to intervene on a protein (i.e., choosing a targetable site) is often a more ambiguous and failure-prone bottleneck than selecting what binds, especially for membrane proteins where accessibility, topology, and post-translational modifications (PTMs) constrain actionable regions. We present Site4Drug, a modality-aware site-finding agent that outputs a ranked list of targetable regions with explicit constraints, evidence summaries, risk flags, and a traceable decision log. Rather than requiring users to specify the drug modality upfront, Site4Drug can recommend a binding modality (e.g., antibody/peptide-like vs small-molecule) from the same evidence used for site discovery, including topology, hydropathy, PTM propensity, disulfides, domain context, and sequence. Importantly, this evidence is applied consistently across modalities, including small-molecule pocket discovery, to avoid selecting chemically plausible but biologically occluded sites.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 1.0/10 1.5
World Models 1.5 0.0/10 0.0
MLLM 1.5 1.0/10 1.5
MultiModal 1.5 3.0/10 4.5
model-based RL 1.5 1.0/10 1.5

评分理由: 本文属于生物信息学与药物发现领域,核心内容是开发一个 AI 代理来预测蛋白质结合位点。与给定的关键词(主要涉及大模型架构、视觉编码、世界模型及强化学习)匹配度较低。'MultiModal'相关度稍高,因论文涉及不同药物模态(小分子/抗体)的统一证据评估;'Unify Models'部分相关,因证据统一但未涉及模型架构统一;其余关键词如 Tokenizer、Visual Encoder、World Models、MLLM、model-based RL 在摘要中均未明确提及或为核心贡献,故评分较低。作者列表中不包含指定的专家,无额外加分。

关键词

Site4Drug, Drug-Binding Target Sites, AI Agent, Modality-Aware, Protein Binding, Targetable Regions, Evidence Summaries

Score: 12.0 / 27.8
Authors: Ziqin Gao, Zhijie Yang, Qiang Zou
Published: 2026-06-01
TL;DR: KDH-CAD addresses CAD data scarcity by integrating foundation model knowledge with structured domain knowledge and minimal labeled data, achieving high classification accuracy without fine-tuning.
摘要翻译

计算机辅助设计(CAD)中的深度学习从根本上仍受限于数据稀缺挑战:真实的大规模 CAD 数据难以收集,而合成数据可能无法忠实地反映实际设计实践。本文并未追求日益庞大的 CAD 数据集,而是转而将 CAD 学习视为一个知识补全与校准问题。本文引入了 KDH-CAD,这是一种知识 - 数据混合框架,它整合了基础模型(Foundation Models)中的预训练知识、来自教材/教程的结构化领域知识以及极少量的标注 CAD 数据。领域知识用于提取和补全在预训练基础模型中表达微弱或表示不足的 CAD 相关概念,而标注的 CAD 数据则在潜在空间中校准这些概念,以考虑任务特定的几何变异性,且无需对基础模型进行微调。真实世界机械零件分类的实验表明,KDH-CAD 在低数据场景下表现强劲,仅需 250 个训练样本即可达到 92.6% 的准确率,1000 个样本时达到 95.8%,且随着数据增加性能持续提升。该结果达到或超过了通常需多一个数量级数据的最先进水平。这些结果表明,将预训练基础模型与结构化领域知识相结合,可以显著减少对大规模 CAD 数据集的依赖,为数据高效的 CAD 学习提供了一条兼具原则性与实践性的方向。

Abstract

Deep learning in computer-aided design (CAD) remains fundamentally constrained by the data scarcity challenge: authentic CAD data is difficult to collect at scale, while synthetic data may not faithfully reflect real design practice. Rather than pursuing ever-larger CAD datasets, this paper alternatively treats CAD learning as a knowledge completion and calibration problem. It introduces KDH-CAD, a knowledge-data hybrid framework that integrates pretrained knowledge in foundation models, structured domain knowledge from textbooks/tutorials, and a very small amount of labeled CAD data. Domain knowledge is used to elicit and complete CAD-relevant concepts that are weakly expressed or under-represented in pretrained foundation models, while labeled CAD data calibrates these concepts in the latent space to account for task-specific geometric variability, without fine-tuning the foundation model. Experiments on real-world mechanical part classification show that KDH-CAD achieves strong performance in low-data regimes, reaching 92.6\% accuracy with only 250 training samples, 95.8\% with 1,000 samples, and continuing to improve with additional data. This matches or exceeds state-of-the-art performance that typically requires an order of magnitude more data. These results suggest that combining pretrained foundation models with structured domain knowledge can substantially reduce reliance on large-scale CAD datasets, providing a principled and practical direction for data-efficient CAD learning.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 3.0/10 4.5
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 1.0/10 1.5
World Models 1.5 0.0/10 0.0
MLLM 1.5 2.0/10 3.0
MultiModal 1.5 2.0/10 3.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on CAD learning under data scarcity using a knowledge-data hybrid framework with foundation models. It does not address Tokenizers, World Models, or Model-Based RL. While it utilizes foundation models (loosely related to Unify Models/MLLM), the core contribution is knowledge integration and calibration rather than model unification or multimodal generation. No expert authors from the specified list (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are found in the author list (Ziqin Gao, Zhijie Yang, Qiang Zou).

关键词

CAD learning, Data scarcity, Foundation models, Domain knowledge, Knowledge-data hybrid, Mechanical part classification, Latent space calibration

Score: 12.0 / 27.8
Authors: Wenmin Li, Shunsuke Sakai, Zhongkai Zhao, Tatsuhito Hasegawa
Published: 2026-06-01
TL;DR: This paper proposes a Pool-Select-Refine framework for allocation-aware generative dataset distillation that enhances sample quality and budget efficiency through latent space refinement using soft-label supervision.
摘要翻译

基于扩散的数据集蒸馏(Diffusion-based Dataset Distillation)近期已成为一种颇具前景的范式,旨在将大规模数据集浓缩为紧凑的合成数据集。通过利用预训练生成先验,这些方法相较于传统的基于匹配的方法,能更高效地生成逼真的类条件样本。然而,大多数现有的扩散方法仍采用僵硬的“生成并使用”(Generate-and-Use)策略,即在固定的每类图像预算下,直接将生成的样本视为最终的蒸馏集。这种设计将候选生成与最终的预算分配紧密耦合,可能导致有限预算的冗余浪费,或产生信息量不足的样本。本文提出"Pool-Select-Refine",这是一种用于感知预算分配的生成式数据集蒸馏的两阶段框架。首先,不同于直接使用固定数量的生成样本,我们构建一个过完备候选池,并在目标预算下从中选择一个紧凑子集。其次,我们在潜在空间中利用源自教师模型的软标签监督对选定的样本进行精炼,在保留生成先验的同时提升语义对齐性。该设计显式地解耦了生成、选择与精炼过程,从而实现了对蒸馏预算更有效的利用。在大规模及细粒度图像分类基准上的实验表明,所提出的框架相比扩散基线方法稳定地带来提升。结果表明,在精炼之前引入一个策展(curation)阶段是提高基于扩散的数据集蒸馏的一种简单而有效的方法。

Abstract

Diffusion-based dataset distillation has recently emerged as a promising paradigm for condensing large-scale datasets into compact synthetic sets. By leveraging pretrained generative priors, these methods can produce realistic class-conditional samples more efficiently than traditional matching-based approaches. However, most existing diffusion-based methods still adopt a rigid ``Generate-and-Use'' strategy, where the generated samples are directly treated as the final distilled set under a fixed images-per-class budget. Such a design tightly couples candidate generation with final budget allocation, which may result in redundant waste of the limited budget or insufficiently informative samples. In this paper, we propose ``Pool-Select-Refine'', a two-stage framework for allocation-aware generative dataset distillation. First, instead of directly using a fixed number of generated samples, we construct an over-complete candidate pool and select a compact subset under the target budget. Second, we refine the selected samples in latent space using soft-label supervision derived from the teacher model, improving semantic alignment while preserving the generative prior. This design explicitly decouples generation, selection, and refinement, enabling more effective use of the distillation budget. Experiments on large-scale and fine-grained image classification benchmarks show that the proposed framework delivers consistent gains over diffusion-based baselines. The results suggest that introducing a curation stage before refinement is a simple yet effective way to improve diffusion-based dataset distillation.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 3.0/10 4.5
World Models 1.5 2.0/10 3.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on generative dataset distillation using diffusion models for image classification, which has limited overlap with the provided keyword set focused on foundation models and reinforcement learning. Keywords like MLLM, MultiModal, and model-based RL are irrelevant (0). Visual Encoder and World Models have partial relevance due to the use of diffusion architectures (3, 2), while Unify Models and Tokenizer are minimally related (2, 1). The total weighted score is 12.0, below the dynamic passing threshold of 27.8, indicating low relevance to the specified research themes.

关键词

Dataset Distillation, Diffusion Models, Allocation-Aware, Latent Refinement, Generative Prior, Soft-Label Supervision

Score: 10.5 / 27.8
Authors: Xiang Li, Jiwei Wei, Ke Liu, Yitong Qin, Jinyu Guo, Malu Zhang, Peng Wang, Yang Yang
Published: 2026-06-01
TL;DR: eMoT addresses LLM unreliability in multi-step reasoning by evolving memory-of-thought and symbolic anchoring, achieving superior accuracy and consistency on mathematical benchmarks without model scaling.
摘要翻译

尽管大型语言模型(LLMs)在多步推理任务上取得了令人印象深刻的性能,但其可靠性仍持续受到诸如不受约束的幻觉和数值计算性能不佳等关键限制。从根本上说,这些问题源于标准模型将推理视为瞬时的、一次性生成过程,而非保留并完善成功的程序性逻辑。为应对这些挑战,我们提出 eMoT(evolving Memory-of-Thought),一种通过将推理轨迹视为动态演化的记忆而非静态模板来稳定多步推理的统一框架。该框架主要由三个相互连接的模块组成:(i)一个记忆腐蚀机制,该机制强化高实用性的推理结构,同时逐渐衰减较不频繁的结构;(ii)一个符号锚定引擎,利用 Python 进行确定性计算,就像人类使用计算器一样;以及(iii)一个一致性驱动的精化过程,该过程将神经推理与符号结果对齐,从而减少逻辑不一致的累积。在多个推理基准测试中,eMoT 提高了准确性和解的一致性,优于标准的思维链(Chain-of-Thought)和结构化推理基线。在传统任务 24 点游戏中,eMoT 实现了 100% 的准确率,比基线高出多达 17.6%。在数学任务 GSM8K、ASDiv、SVAMP 和 MGSM 上的评估进一步展示了在多步数学推理方面的一致增益。在我们的评估中,尽管使用了具有受限基线能力的轻量级骨干模型,我们仍实现了卓越的性能。与依赖大规模模型的替代方法相比,我们的结果表明,性能增益本质上是由 eMoT 框架的推理控制驱动的,而非单纯依靠模型规模。

Abstract

While Large Language Models (LLMs) achieve impressive performance on multi-step reasoning tasks, their reliability is persistently hindered by critical limitations such as unconstrained hallucinations and poor numerical computation. Fundamentally, these issues arise because standard models treat reasoning as a transient, one-off generation process rather than retaining and refining successful procedural logic. To address these challenges, we propose eMoT (evolving Memory-of-Thought), a unified framework that stabilizes multi-step reasoning by treating reasoning trajectories as dynamic, evolving memories rather than static templates. The framework primarily consists of three interconnected modules: (i) a memory corrosion mechanism that reinforces high-utility reasoning structures while gradually decaying less frequent ones; (ii) a symbolic anchoring engine that utilizes Python for deterministic computation, much like a human uses a calculator; and (iii) a consistency-driven refinement process that aligns neural inference with symbolic outcomes, reducing the accumulation of logical discrepancies. Across multiple reasoning benchmarks, eMoT improves accuracy and solution consistency over standard Chain-of-Thought and structured reasoning baselines.On the traditional task Game of 24, eMoT achieves 100% accuracy, surpassing the baseline by up to 17.6%. Evaluations on mathematical task GSM8K, ASDiv, SVAMP, and MGSM further show consistent gains in multi-step mathematical reasoning. In our evaluation, we achieve superior performance despite utilizing a lightweight backbone model with constrained baseline capabilities. Compared to alternative methods that rely on massively scaled models, our results demonstrate that the performance gains are fundamentally driven by the eMoT framework's reasoning control rather than sheer model size.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 3.0/10 4.5
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 2.0/10 3.0
MLLM 1.5 1.0/10 1.5
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on enhancing LLM multi-step reasoning via memory evolution and symbolic anchoring. It lacks multimodal components (Visual Encoder, MultiModal, MLLM), specific tokenizer innovations, and reinforcement learning (model-based RL). It unifies neural and symbolic reasoning (Unify Models) and employs memory mechanisms (World Models), yielding low-to-moderate relevance. No expert authors from the specified list were found, so no bonus points were applied.

关键词

Memory-of-Thought, Symbolic Anchoring, Multi-step Reasoning, LLM, Consistency-driven Refinement, Memory Corrosion, Python Anchoring

Score: 10.5 / 27.8
Authors: Jiazhen Lei, Tianze Cao, Yuxin Sha, Sihan Wang, Bingbing Wang, Fengyuan Zhu, Zeming Yang, Xiaohua Tian
Published: 2026-06-01
TL;DR: RadioMaster 是一个利用大语言模型和领域知识的多智能体框架,能够从用户意图自主生成物理无线电信号,在配置可行性和信号保真度上显著优于基线。
摘要翻译

将用户意图转换为物理无线信号代表了无线原型设计中关键却极其繁琐的最后一步,因为它需要深入的物理层细节知识并带来巨大的实现挑战。大语言模型(LLMs)和多智能体系统已经革新了传统软件工程,引发了一个引人注目的问题:它们能否解决这些艰巨的困难?然而,我们的研究表明,当前模型在应用于无线信号生成时存在显著局限性,无法完成此任务。这种性能下降主要源于严重的领域知识匮乏以及对物理硬件约束的根本性不敏感。为了弥合这一差距,我们引入了 RadioMaster,这是一个完全自主的多智能体框架,旨在无缝地将用户输入转换为现实世界中的无线发射信号。RadioMaster 基于三个协同支柱运行:RadioWiki 用于领域特定知识检索,RadioAgent 用于协同生成 I/Q 样本并进行硬件配置,RadioEmulator 用于闭环物理层验证。此外,我们构建了 RadioBench,这是首个专门针对无线信号生成领域的综合基准。广泛的现实世界评估表明,RadioMaster 在配置可行性和信号保真度方面显著优于最先进(SOTA)基线。

Abstract

Translating user intents into physical radio signals represents the critical yet notoriously tedious final step in wireless prototyping, as it requires intricate knowledge of physical layer details and presents immense implementation challenges. Large Language Models (LLMs) and multi-agent systems have revolutionized conventional software engineering, raising the compelling question of whether they can resolve these formidable difficulties. However, our investigations reveal that current models experience significant limitations and fail to accomplish this task when applied to radio signal generation. This performance degradation primarily stems from severe domain ignorance and a fundamental insensitivity to physical hardware constraints. To bridge this gap, we introduce RadioMaster, a fully autonomous multi-agent framework designed to seamlessly translate user input into real-world wireless emissions. RadioMaster operates on three synergistic pillars: RadioWiki for domain-specific knowledge retrieval, RadioAgent for collaborative I/Q sample generation alongside hardware configuration, and RadioEmulator for closed-loop physical layer verification. Furthermore, we construct RadioBench, the first comprehensive benchmark tailored specifically for the radio signal generation domain. Extensive real-world evaluations demonstrate that RadioMaster significantly outperforms state-of-the-art (SOTA) baselines regarding configuration viability and signal fidelity.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 2.0/10 3.0
MultiModal 1.5 2.0/10 3.0
model-based RL 1.5 0.0/10 0.0

评分理由: 论文聚焦于无线信号生成与多智能体系统,虽使用 LLM 但未涉及视觉编码器、世界模型或强化学习核心机制。关键词集主要偏向视觉 - 文本多模态及模型强化学习,与本文无线电物理层生成任务匹配度低。未检测到指定专家作者。加权总分 10.5,低于动态及格分 27.8。

关键词

Radio Signal Generation, Multi-Agent System, Large Language Model, Domain Knowledge, I/Q Sample, Hardware Configuration, Physical Layer Verification

Score: 10.5 / 27.8
Authors: Yihui Wang, Yonghui Yang, Jilong Liu, Fengbin Zhu, Le Wu, Tat-Seng Chua
Published: 2026-06-01
TL;DR: 该论文提出了一种捷径子空间抑制框架,通过识别并抑制伪造方法特定的捷径特征,显著提升了深度伪造检测模型在不同伪造方法间的泛化能力。
摘要翻译

深度伪造检测在不同伪造方法间的泛化能力较差,因为现有模型往往依赖于虚假的特定于方法的捷径,而这些捷径无法泛化到未见过的操纵中。尽管近期方法试图提升泛化能力,但它们缺乏一种显式的机制来识别并抑制学习表示中的此类捷径。本文提出了一种捷径子空间抑制(S^3)框架,该框架通过子空间建模显式地刻画并抑制特定于方法的捷径。我们的核心洞察在于:区分不同伪造方法的变异捕捉了特定于方法的伪影,因而可作为特定于捷径的有效代理。为此,我们训练了一个轻量级线性探针用于伪造方法分类,并执行奇异值分解(SVD)以提取主导捷径子空间。基于此框架,我们提出了两种互补策略以减少对捷径的依赖。在训练阶段,我们在特征表示中软性地抑制捷径子空间,鼓励模型依赖更具泛化性的线索来进行真实/伪造判别。在推理阶段,我们引入了一种无需训练的对应方法,该方法抑制与已识别捷径方向对齐的神经元,从而实现即插即用的泛化增强,并提升了可解释性。在多个基准数据集上的广泛实验表明,我们的方法显著提升了跨方法泛化能力,同时保持了强大的域内性能。代码将在论文录用后公开。

Abstract

Deepfake detection suffers from poor generalization across forgery methods, as existing models tend to rely on spurious method-specific shortcuts that fail to transfer to unseen manipulations. While recent approaches attempt to improve generalization, they lack an explicit mechanism to identify and suppress such shortcuts in learned representations. In this work, we propose Shortcut Subspace Suppression (S^3) framework that explicitly characterizes and suppresses method-specific shortcuts via subspace modeling. Our key insight is that variations distinguishing different forgery methods capture method-specific artifacts and thus serve as an effective proxy for method-specific shortcuts. To this end, we train a lightweight linear probe for forgery method classification and perform Singular Value Decomposition (SVD) to extract the dominant shortcut subspace. Building on this formulation, we develop two complementary strategies to reduce shortcut reliance. During training, we softly suppress the shortcut subspace in feature representations, encouraging the model to rely on more generalizable cues for real/fake discrimination. At inference time, we introduce a training-free counterpart that attenuates neurons aligned with the identified shortcut directions, enabling plug-and-play generalization enhancement with improved interpretability. Extensive experiments on multiple benchmarks demonstrate that our method significantly improves cross-method generalization while maintaining strong in-domain performance. The code will be released upon acceptance of the submission.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 3.0/10 4.5
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 2.0/10 3.0
model-based RL 1.5 0.0/10 0.0

评分理由: 论文核心在于深度伪造检测的泛化性与捷径抑制,属于计算机视觉与表示学习领域。关键词中,'Visual Encoder'(3.0)因涉及特征提取有一定关联,'MultiModal'(2.0)因深度伪造常涉及多模态数据有弱关联,'Unify Models'(2.0)因提出统一框架有弱关联。其余关键词如'Tokenizer'、'World Models'、'MLLM'、'model-based RL'(0.0)主要涉及大语言模型与强化学习,与本文判别式检测任务无直接关联。加权总分为 10.5,低于动态及格分 27.8,表明论文与给定关键词集相关性较低。

关键词

Deepfake Detection, Generalizable, Shortcut Subspace Suppression, Forgery Methods, Feature Representations, Singular Value Decomposition, Cross-method Generalization, Subspace Modeling

Score: 10.5 / 27.8
Authors: Peter Chen, Xi Chen
Published: 2026-06-01
TL;DR: The paper proposes a two-fidelity tree-search algorithm (2FFS) that optimizes best-action identification in stochastic minimax trees by adaptively combining cheap heuristic evaluations and expensive accurate rollouts to reduce computational cost.
摘要翻译

本文研究随机极小值树中的固定置信度最优动作识别(BAI)问题。该问题在现代人工智能规划中日益重要,其中深度极小值搜索和结合语言模型长回放的蒙特卡洛树搜索(MCTS)面临根本性权衡:启发式评估成本低廉但存在偏差,而准确回放虽可靠却成本过高。我们提出 2FFS,一种双保真度树搜索算法,将多保真度平坦老虎机(multi-fidelity flat bandit)思想引入树结构中。该算法结合了极小值风格的快速扩展与 MCTS 风格的随机采样,自适应地决定何时采用廉价但有偏差的评估,何时调用昂贵且准确的评估以进行局部认证。我们证明了固定置信度下的正确性,确立了精确识别的有限停止时间,并为一般深度树给出了多项式深度的成本上界。在数值随机树实验中,与现有的 BAI-MCTS 基线相比,2FFS 使用的样本数和计算操作显著更少。

Abstract

We study fixed-confidence best-action identification (BAI) in stochastic minimax trees. This problem is increasingly relevant in modern AI planning, where deep minimax search and Monte Carlo Tree Search (MCTS) with language model long rollouts face a fundamental tradeoff: heuristic evaluations are cheap but biased, while accurate rollouts are reliable but prohibitively expensive. We propose 2FFS, a two-fidelity tree-search algorithm that brings multi-fidelity flat bandit ideas into trees. The algorithm combines minimax-style fast expansion with MCTS-style stochastic sampling, adaptively deciding when to exploit cheap biased evaluations and when to invoke expensive accurate evaluations for local certification. We prove fixed-confidence correctness, establish finite stopping for exact identification, and give a polynomial-depth cost upper bound for general-depth trees. Across numerical stochastic-tree experiments, 2FFS uses substantially fewer samples and computational operations comparing to existing BAI-MCTS baseline.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 2.0/10 3.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 3.0/10 4.5

评分理由: The paper focuses on algorithmic optimization for best-action identification in stochastic minimax trees (2FFS algorithm), which is largely unrelated to multimodal architectures, tokenizers, visual encoders, or world models. There is a minor connection to MLLM (mentioned in context of rollouts) and model-based RL (planning context), but the core contribution is theoretical search algorithm design rather than model learning or multimodal integration.

关键词

Two-Fidelity, Best-Action Identification, Stochastic Minimax Tree, Multi-fidelity, Monte Carlo Tree Search, Fixed-confidence, Minimax-style Expansion

Score: 10.5 / 27.8
Authors: Michał Brzozowski, Neo Christopher Chung
Published: 2026-06-01
TL;DR: 该论文揭示了 Archetypal SAEs 的稳定性实为初始化伪影,而非架构优势,挑战了其在 NLP 机制解释性中的可靠性。
摘要翻译

基于稀疏自编码器(SAEs)的字典学习从神经网络激活中生成超完备基,这些基往往具有可解释性,并能降低多语义性。然而,SAEs 提取的特征在不同随机种子下存在显著差异——这一问题被称为不稳定性(instability)。原型 SAEs(Archetypal SAEs)(Fel et al., 2025)被提出作为一种通用的字典学习干预方法,旨在实现更可靠的概念提取,并声称在训练结束时能获得更稳定的字典。我们发现,原型 SAEs 所声称的稳定性源于在多次运行中设置了相同的初始化参数。通过我们的分析,我们试图澄清机制可解释性(mechanistic interpretability)中两个可能被模糊使用的不同概念:稳定性(stability)指两个独立训练模型之间的一致性,而稳定化(stabilization)指独立初始化的运行向共同解的收敛。这种区分对于自然语言处理(NLP)的机制可解释性至关重要,因为特征稳定性正被越来越多地用作证据,以表明 SAE 特征是可重用的分析单元。原型 SAEs 的实验共享一种确定性的 k-means 解码器初始化,在训练开始前将运行间的字典距离设为零。当移除这种初始化设置后,原型约束在我们的实验设置中并未提供稳定化优势。我们进一步发现了一个依赖于预处理的余弦几何问题,这使得终点稳定性指标的解释变得复杂。总体而言,我们的研究支持在更广泛的字典学习传统框架下研究 SAEs 的价值,同时表明稳定性声明需要轨迹诊断和初始化消融实验。

Abstract

Dictionary learning with sparse autoencoders (SAEs) produces overcomplete bases from neural network activations that are often interpretable and reduces polysemanticity. However, features from SAEs vary substantially across random seeds -- a problem known as instability. Archetypal SAEs (Fel et al., 2025) were proposed as a general dictionary-learning intervention for more reliable concept extraction, and report more stable dictionaries at the end of training. We demonstrate that the stability claimed by archetypal SAEs is a result of setting identical initialization across multiple runs. Through our analyses, we attempt to clarify two distinct notions in mechanistic interpretability that may be ambiguously used: stability is agreement between two independently trained models, whereas stabilization is the convergence of independently initialized runs toward a common solution. This distinction is critical for mechanistic interpretability of natural language processing (NLP), where feature stability is increasingly used as evidence that SAE features are reusable units of analysis. Experiments from archetypal SAEs share a deterministic k-means decoder initialization, setting inter-run dictionary distance to zero before training begins. When this initialization is removed, the archetypal constraint provides no stabilization advantage in our setting. We further identify a preprocessing-dependent cosine geometry issue that complicates interpretation of endpoint stability metrics. Overall, our study supports the value of studying SAEs within the larger dictionary-learning tradition while showing that stability claims require trajectory diagnostics and initialization ablations.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 3.0/10 4.5
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 2.0/10 3.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: 论文聚焦于 NLP 中稀疏自编码器(SAEs)的稳定性机制,发现稳定性源于初始化而非架构。与给定关键词关联度较低:虽涉及特征提取(Tokenizer, Unify Models)及 LLM 基础(MLLM),但未涉及多模态(MultiModal, Visual Encoder)、世界模型(World Models)或强化学习(model-based RL)。

关键词

Sparse Autoencoders, Archetypal SAEs, Mechanistic Interpretability, Stability, Initialization, Dictionary Learning, NLP

Score: 10.5 / 27.8
Authors: Minghui Zheng, Hongxu Chen, Huimin Ren, Hongsheng Xin, Xiaoyang Qu, Ze Wang, Shuling Yang, Ziyu Peng, Kaike Zhang, Pan Zhou, Kun Zhan
Published: 2026-06-01
TL;DR: HMPO introduces a single-stage reinforcement learning framework that compresses Chain-of-Thought reasoning in Large Language Models by 19%-46% without significant accuracy loss across various tasks.
摘要翻译

大语言模型通过扩展的思维链(CoT)推理取得了显著性能,然而这一漫长的过程带来了巨大的推理开销。现有的 CoT 压缩方法面临着手动长度预算不灵活、多阶段训练流程计算成本高昂,以及仅限于小模型的可扩展性脆弱等问题。我们提出 HMPO(混合中位数策略优化),这是一种成本效益高的单阶段强化学习框架。HMPO 通过三个协同组件高效压缩 CoT:一个基于成功轨迹推导的自适应中位数预算,旨在消除手动调优;一个用于平滑长度惩罚的余弦衰减 token 奖励;以及一个通过严格优先答案正确性来显著缓解琐碎奖励滥用的乘法奖励形式。仅在数学数据上训练,HMPO 能无缝泛化至数学、代码、科学及指令遵循任务。在从 9B 到 122B 参数的稠密和混合专家(MoE)架构上进行的广泛实验表明,HMPO 实现了 19%–46% 的 token 压缩,且准确率退化可忽略不计,同时相比现有多阶段基线大幅降低了训练成本。

Abstract

Large language models achieve remarkable performance via extended chain-of-thought (CoT) reasoning, yet this lengthy process incurs substantial inference overhead. Existing CoT compression methods struggle with inflexible manual length budgets, computationally expensive multi-stage training pipelines, and fragile scalability restricted to small models. We propose HMPO (Hybrid Median-length Policy Optimization), a cost-effective, single-stage reinforcement learning framework. HMPO efficiently compresses CoT via three synergistic components: an adaptive median-based budget derived from successful rollouts to eliminate manual tuning, a cosine-decay token reward for smooth length penalization, and a multiplicative reward formulation that substantially mitigates trivial reward hacking by strictly prioritizing answer correctness. Trained exclusively on mathematical data, HMPO generalizes seamlessly across math, code, science, and instruction-following tasks. Extensive experiments scaling from 9B to 122B parameters across dense and Mixture-of-Experts (MoE) architectures demonstrate that HMPO achieves 19%--46% token compression with negligible accuracy degradation, all while drastically reducing training costs compared to existing multi-stage baselines.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 1.0/10 1.5
Tokenizer 1.5 2.0/10 3.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 2.0/10 3.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 2.0/10 3.0

评分理由: The paper focuses on Chain-of-Thought compression in Large Language Models using a single-stage RL framework (HMPO). It has low relevance to Visual Encoder, MultiModal, and World Models as it is text-only and lacks world modeling. While it involves tokens and RL, it focuses on policy optimization rather than tokenizer architecture or model-based planning, and does not unify different model architectures.

关键词

Chain-of-Thought Compression, Reinforcement Learning, Policy Optimization, Large Language Models, Inference Efficiency, Token Reduction, Single-stage Training

Score: 10.5 / 27.8
Authors: Yumiao Zhao, Bo Jiang, Beibei Wang, Xixi Wan, Xiao Wang, Jin Tang
Published: 2026-06-01
TL;DR: The paper proposes LoRSP, a framework leveraging spiking neural networks to generate low-rank sparse visual prompts that reduces redundancy and parameters while maintaining competitive performance on vision tasks.
摘要翻译

视觉提示(VP)已成为一种高效范式,通过在输入层引入可学习提示,将大规模预训练视觉模型适配至下游任务。然而,现有的 VP 方法通常采用密集的像素级提示,这往往会导致冗余扰动、泛化能力受限以及能效低下。为了克服这些局限性,我们提议将脑启发式尖峰学习整合到视觉提示学习任务中。众所周知,尖峰神经元可通过将输入数据转换为离散的尖峰序列并输出稀疏结果,从而执行低成本的信息处理。受此启发,我们提出低秩视觉尖峰提示(LoRSP),这是一种新颖的框架,通过尖峰神经元学习机制自然地学习动态低秩稀疏视觉提示。LoRSP 的核心思想是利用尖峰神经元的脑启发式稀疏发放机制,为每个实例生成像素级稀疏提示。具体而言,我们首先通过低秩分解构建一系列提示因子,以捕捉不同的提示子空间。随后,这些提示因子被输入到尖峰神经网络(SNN)架构中,该架构执行积分 - 发放过程以发射尖峰。由此,我们的 LoRSP 在保持低秩约束的同时,生成稀疏视觉提示。该设计实现了实例特定的选择性提示,从而在多样化的下游任务中实现更紧凑且鲁棒的适配。在五种异构视觉骨干网络和多个基准上的广泛实验表明,LoRSP 实现了具有竞争力的性能,同时相比现有 VP 方法需要更少的可调节参数。

Abstract

Visual Prompting (VP) has emerged as an efficient paradigm for adapting large-scale pre-trained vision models to downstream tasks by incorporating learnable prompts at the input level. However, existing VP methods typically employ dense pixel-level prompts, which often suffer from redundant perturbations, limited generalization and energy inefficiency. To overcome these limitations, we propose to integrate brain-inspired spiking learning into visual prompt learning tasks. As we know that spiking neuron can perform inexpensive information processing by transmitting the input data into discrete spike trains and return sparse outputs. Inspired by this, we propose \textbf{Lo}w-\textbf{R}ank visual \textbf{S}pike \textbf{P}rompting (LoRSP), a novel framework that learns dynamic low-rank sparse visual prompts naturally via a Spiking neuron learning mechanism. The core idea of LoRSP is to exploit the brain-inspired sparse firing mechanism of spiking neurons to generate pixel-level sparse prompt for each instance. To be specific, we first construct a series of prompt factors via low-rank factorization to capture distinct prompt subspaces. These prompt factors are then fed into an SNN architecture, which performs the integrate-and-fire process to emit spikes. As a result, our LoRSP generates a \emph{sparse} visual prompt while maintaining the low-rank constraint. This design enables instance-specific selective prompting, leading to more compact and robust adaptation across diverse downstream tasks. Extensive experiments on five heterogeneous vision backbones and multiple benchmarks demonstrate that LoRSP achieves competitive performance while requiring fewer tunable parameters compared to existing VP methods.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 5.0/10 7.5
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on Visual Prompting using Spiking Neural Networks for vision model adaptation. It has moderate relevance to 'Visual Encoder' as it adapts vision backbones and slight relevance to 'Unify Models' by combining low-rank factorization with spiking learning. However, it has no direct connection to Tokenizer, World Models, MLLM, MultiModal integration, or Model-Based RL, resulting in low scores for those keywords.

关键词

Visual Prompting, Spiking Neural Network, Low-Rank Sparse Prompting, Prompt Factorization, Instance-specific Selective Prompting, Energy Efficiency, Vision Backbones

Score: 9.0 / 27.8
Authors: Ümit Mert Çağlar, Alptekin Temizel
Published: 2026-06-01
TL;DR: LALE 提出了一种基于分辨率分叉的轻量级 Transformer 架构,实现了遥感图像分割在计算预算下的效率与性能平衡。
摘要翻译

遥感图像语义分割要求模型在计算预算受限的情况下,同时捕捉全局上下文与局部细节。以往工作通常针对其中一个维度进行优化:使用注意力机制捕捉全局上下文,使用卷积提取局部细节,或追求紧凑性以提高效率。尽管混合方法旨在同时捕捉两者,但它们需要架构变更以及具有计算开销的编码器骨干,从而限制了效率和性能。我们提出了 LALE(轻量级 Transformer 土地覆盖估算架构,Lightweight-transformer Architecture for Land-cover Estimation),一种端到端的遥感图像分割架构。该架构按分辨率对编码器进行分支:轻量级 ConvMixer 阶段负责处理高分辨率局部特征,而 Transformer 阶段负责处理低分辨率全局上下文,从而将自注意力的二次计算复杂度限制在深层下采样特征图上。此外,全 MLP 多尺度解码器配合全程使用的 RMSNorm 和 StarReLU,进一步减少了计算量和参数量。在大规模 ARAS400k 遥感分割基准上,LALE 相较于 CNN、Transformer 及混合基线,建立了显著的效率 - 性能权衡。我们的最小变体(仅 160 万参数)在 F1 分数上仅落后于最佳基线(UPerNet)2.6 个点,同时参数量减少 4.5 倍,存储空间减少 7 倍,计算量(GMACs)减少 17 倍,吞吐量提升 1.8 倍。

Abstract

Semantic segmentation of remote sensing imagery requires models that capture both global context and local detail under tight computational budgets. Prior work typically optimizes for one of these axes: attention for global context, convolution for local detail, or compactness for efficiency. While hybrid approaches aim to capture both, they require architectural changes and encoder backbones with computational overhead, limiting efficiency and performance. We present LALE (Lightweight-transformer Architecture for Land-cover Estimation), an end-to-end remote sensing image segmentation architecture, that bifurcates its encoder by resolution: lightweight ConvMixer stages handle high-resolution local features, while transformer stages handle low-resolution global context, confining the quadratic cost of self-attention to deep, downsampled feature maps. An all-MLP multi-scale decoder, together with RMSNorm and StarReLU throughout, further reduces compute and parameter count. On the large-scale ARAS400k remote-sensing segmentation benchmark, LALE establishes a strong efficiency-performance trade-off against CNN, transformer, and hybrid baselines. Our smallest variant, (just 1.6M parameters), reaches within 2.6 F1 points of the best baseline (UPerNet) while using 4.5x fewer parameters, 7x less storage, 17x fewer GMACs, and delivering 1.8x higher throughput.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 3.0/10 4.5
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 1.0/10 1.5
model-based RL 1.5 0.0/10 0.0

评分理由: 论文聚焦遥感分割的轻量级架构,涉及 ConvMixer 与 Transformer 融合。与背景关键词中的世界模型、MLLM、强化学习及 tokenizer 无直接关联。仅视觉编码器和架构统一性有较低相关性。作者列表中无指定专家。

关键词

Lightweight-Transformer Architecture, Land-Cover Estimation, Remote Sensing Image Segmentation, ConvMixer, Global Context, Local Detail, Efficiency-Performance Trade-off

Score: 9.0 / 27.8
Authors: Jiaming Wang, Ziteng Feng, Jiangtao Wu, Ruihao Li, Qianqian Xie, Yuxiang Ren, He Zhu, Xueming Han, Fanyu Meng, Junlan Feng, Jiaheng Liu
Published: 2026-06-01
TL;DR: This paper proposes a span-level error localization framework called DRIFT and a benchmark TELBench to identify unreliable parts in deep-research agent trajectories, improving first-error accuracy by up to 30 percentage points.
摘要翻译

深度研究代理通过包含搜索、工具使用、证据检查和答案合成的长轨迹来完成任务。基于最终答案的评估仅能显示代理是否成功,却无法指出轨迹中哪些部分导致答案不可靠。本文研究深度研究代理的片段级错误定位。我们从两个代理框架、三个骨干模型和三个基准中收集了 2,790 条真实轨迹,将原始日志转换为语义片段,并通过 LLM(大型语言模型)辅助专家评审标注有害错误片段。基于这些标注,我们构建了 TELBench,这是一个包含 1,000 个实例的基准,旨在从正常探索、失败搜索、暂定假设和无害噪声中识别错误片段。我们进一步提出了 DRIFT,这是一个以主张为中心的审计框架,用于跟踪代理主张、检查其在轨迹证据中的支持情况,并标记那些缺乏支持或相互冲突的主张影响答案路径的片段。跨模型家族和审计框架的实验表明,DRIFT 将片段级错误定位和首次错误准确率提高了多达 30 个百分点。我们的工作为深度研究代理的可靠性提供了过程级视角。

Abstract

Deep-research agents solve tasks through long trajectories of search, tool use, evidence inspection, and answer synthesis. Evaluation based on final answers shows whether an agent succeeds, but not which parts of the trajectory make the answer unreliable. We study span-level error localization for deep-research agents. We collect 2,790 real trajectories from two agent frameworks, three backbone models, and three benchmarks, convert raw logs into semantic spans, and annotate harmful error spans through LLM-assisted expert review. From these annotations, we build TELBench, a 1,000-instance benchmark for identifying error spans among normal exploration, failed searches, tentative hypotheses, and harmless noise. We further propose DRIFT, a claim-centric auditing framework that tracks agent claims, checks their support in trajectory evidence, and marks spans where unsupported or conflicting claims affect the answer path. Experiments across model families and auditing frameworks show that DRIFT improves span-level error localization and first-error accuracy by up to 30 percentage points. Our work provides a process-level view of reliability in deep-research agents.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 2.0/10 3.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 2.0/10 3.0

评分理由: 论文核心在于深度研究代理的轨迹级错误定位与评估(TELBench 基准,DRIFT 框架),属于代理行为分析领域。关键词中的 Tokenizer、Visual Encoder、World Models、MultiModal 在摘要中未提及,相关性为 0。虽然论文使用了 LLM 作为骨干模型且涉及代理轨迹,但并非专注于多模态大模型(MLLM)、模型统一架构或模型强化学习(model-based RL)的具体技术实现,因此相关度较低(2 分)。作者列表中未包含指定的 Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang 专家。

关键词

Deep-research agents, Span-level error localization, Agent trajectories, TELBench, DRIFT, Claim-centric auditing, Error spans

Score: 9.0 / 27.8
Authors: Nermeen Abou Baker, David Rohrschneider, Uwe Handmann
Published: 2026-06-01
TL;DR: This study evaluates parameter-efficient fine-tuning methods like adapters and LoRA on transformer-based models for instance segmentation, demonstrating that competitive performance can be achieved by fine-tuning only 1-6% of parameters.
摘要翻译

随着大型预训练模型的兴起,人工智能领域的研究与应用近期发生了转变,这些模型在各类任务中均取得了最先进的性能。然而,参数的显著增加带来了参数高效训练策略的需求。尽管取得了显著进展,但在基于 Transformer 的模型进行实例分割的背景下,关于参数高效微调(PEFT)方法的研究仍较为有限。为填补这一空白,本研究探讨了参数高效微调(PEFT)方法的有效性,具体包括适配器(adapters)和低秩适应(LoRA),并将它们应用于两个模型在四个基准数据集上的实验。集成顺序排列的适配器模块并将低秩适应(LoRA)应用于可变形注意力(deformable attention,此处为首次探索),可在仅微调约 1%-6% 模型参数的情况下实现具有竞争力的性能,相较于传统微调所需的 40%-55%,这是一个显著的改进。关键结果表明,在每个 Transformer 块中使用 2-3 个适配器能够在性能与效率之间提供最佳平衡。此外,当低秩适应(LoRA)应用于可变形注意力时,它表现出强大的参数效率,在某些情况下甚至优于适配器配置。这些结果表明,参数高效微调(PEFT)技术的效果因数据集复杂度和模型架构而异,强调了上下文特定调优的重要性。总体而言,本研究展示了参数高效微调(PEFT)在实例分割任务中实现可扩展、可定制且计算高效的迁移学习的潜力。

Abstract

Research and applications in artificial intelligence have recently shifted with the rise of large pretrained models, which deliver state-of-the-art results across numerous tasks. However, the substantial increase in parameters introduces a need for parameter-efficient training strategies. Despite significant advancements, limited research has explored parameter-efficient fine-tuning (PEFT) methods in the context of transformer-based models for instance segmentation. Addressing this gap, this study investigates the effectiveness of PEFT methods, specifically adapters and Low-Rank Adaptation (LoRA), applied to two models across four benchmark datasets. Integrating sequentially arranged adapter modules and applying LoRA to deformable attention--explored here for the first time--achieves competitive performance while fine-tuning only about 1-6% of model parameters, a marked improvement over the 40-55% required in traditional fine-tuning. Key findings indicate that using 2-3 adapters per transformer block offers an optimal balance of performance and efficiency. Furthermore, LoRA, exhibits strong parameter efficiency when applied to deformable attention, and in certain cases surpasses adapter configurations. These results show that the impact of PEFT techniques varies based on dataset complexity and model architecture, underscoring the importance of context-specific tuning. Overall, this work demonstrates the potential of PEFT to enable scalable, customizable, and computationally efficient transfer learning for instance segmentation tasks.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 3.0/10 4.5
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on Parameter-Efficient Fine-Tuning (PEFT) for instance segmentation using transformers. It does not address multimodal learning (MLLM, MultiModal), reinforcement learning (model-based RL), or world modeling (World Models), resulting in 0 scores for these. Visual Encoder has moderate relevance (3.0) as the task is vision-based, but the core contribution is fine-tuning strategies rather than encoder design. Tokenizer (1.0) is minimally relevant due to transformer usage, and Unify Models (2.0) loosely relates to combining fine-tuning methods but not model unification architecture. No expert authors from the specified list were found in the author list.

关键词

Parameter-Efficient Fine-Tuning, Instance Segmentation, Adapters, LoRA, Transformers, Large Pretrained Models, Deformable Attention

Score: 9.0 / 27.8
Authors: Atmika Bhardwaj, Silvia Vock, Nico Steckhan
Published: 2026-06-01
TL;DR: 本文研究了通过生成性 inpainting 结合多阶段训练方案能否提升手部检测在分布外(戴手套)数据上的性能。
摘要翻译

当目标图像稀缺、昂贵或存在偏差时,生成(或合成)图像数据正被越来越多地用于增强或替代真实训练数据集。在手部检测领域,尤其是在职业安全场景中,公共数据集主要包含裸手。这未能充分代表由手套、纹身、珠宝及其他个人防护装备(PPE)引入的手部外观变化,从而产生了分布偏移,这是安全关键型应用在部署时会遇到的情况。我们测试生成式修复技术(仅编辑真实照片的手部区域以引入配饰)能否缩小这一分布偏移差距。在真实图像及其合成对应物的配对数据集上,我们在六种训练与调度策略下(实验 A-F,每种策略使用三个随机种子)训练 YOLOv8n 手部检测器,并在真实测试集及仅含真实手套的测试子集上评估每个检测器,报告两个重叠阈值下的平均精度均值(mAP)(即 [email protected][email protected]:0.95)以及配对统计检验结果。一项两阶段实验:先在真实与合成数据上训练,然后在仅真实数据上以较低学习率微调所得权重,相较于仅使用真实数据的基线模型,该方案在标准真实测试集上提高了 [email protected],并改善了仅含真实手套数据的分布外差距。另一项三阶段实验在保持边界框紧密度方面表现最佳,达到了本研究所有实验中最高的 [email protected]:0.95。合成数据在安全关键型手部检测中的效用取决于训练过程,而简单的多阶段实验能够从修复后的配饰数据中提取显著的实地部署效益。

Abstract

Generated (or synthetic) image data is increasingly used to augment or replace real training datasets when target imagery is scarce, expensive, or biased. For hand detection, particularly in occupational safety settings, public datasets mostly contain bare hands. This under-represents the variation in hand appearance introduced by gloves, tattoos, jewelry, and other personal protective equipment, creating a distribution shift that safety-critical applications encounter at deployment. We test whether generative inpainting, editing only the hand region of a real photograph to introduce accessories, can close this shift gap. On a paired dataset of real images and their synthetic counterparts, we train YOLOv8n hand detectors under six training-and-scheduling regimes (Experiments A-F, three random seeds each), evaluate every detector on a real test set and on a real-gloves-only test split, and report the mean average precision (mAP) at two overlap thresholds ([email protected] and [email protected]:0.95) along with paired statistical tests. A two-stage experiment: train on real U synthetic data, then fine-tune the resulting weights on real-only at a lower learning rate, increases [email protected] compared to the real-only baseline model on the standard real test set, and improves the real-gloves out-of-distribution gap. Another three-stage experiment preserves box-tightness best, reaching the highest [email protected]:0.95 of any other experiment in the study. The synthetic-data utility for safety-critical hand detection is determined by the training procedure, and simple multi-stage experiments extract substantial real-deployment benefit from inpainted accessory data.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 3.0/10 4.5
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 1.0/10 1.5
model-based RL 1.5 0.0/10 0.0

评分理由: 论文专注于计算机视觉手部检测及生成数据增强,未涉及 Tokenizer、World Models、MLLM、Model-Based RL 或多模态融合。Visual Encoder 仅作为 YOLO 骨干隐含存在,非研究核心。Unify Models 仅松散关联真实与合成数据的结合。

关键词

Generative Data, Hand Detection, Synthetic Data, Training Schedules, YOLOv8, Out-of-Distribution, Data Augmentation, Inpainting

Score: 9.0 / 27.8
Authors: Jiangyu Chen, Banyi
Published: 2026-06-01
TL;DR: 本文提出了一种基于声誉市场的机制来校准多目标贝叶斯优化中的 LLM 先验,结果显示动态校准能提高鲁棒性,但原始 LLM 置信度的收益不一致。
摘要翻译

大语言模型(LLMs)正越来越多地被用作黑盒优化的启发式顾问,然而它们的建议和自我报告的置信度未必与下游目标值校准。这一问题在多目标贝叶斯优化中尤为显著,因为不同目标可能需要不同的专家知识,且一个 LLM 专家可能对某一目标有用,却对另一目标具有误导性。我们研究如何在离散多目标贝叶斯优化中使用 LLM 生成的专家先验,而不盲目信任它们。我们提出一种基于目标的声誉市场机制,将每个专家 - 目标对视为可证伪的先验来源。专家权重根据观察到的目标反馈在线更新,随时间衰减,并由市场级别的信任进行门控。随后,我们引入一个解耦的反事实门,该门可以在无置信度时使用 LLM 先验,在有置信度时使用它,或完全放弃 LLM 先验。在受控的合成压力测试和三个使用 Qwenflash 生成的专家先验的分子优化基准测试中,我们发现动态基于目标的校准比固定的 LLM 先验提高了鲁棒性。然而,原始 LLM 置信度并非总是有益的:在 ESOL 上,置信度与预测误差正相关;在 FreeSolv 上,置信度可能有帮助;而在 Lipophilicity 上,忽略置信度表现仍最强。我们的固定三臂反事实门在 ESOL 和 FreeSolv 上优于第一种反事实变体,而尝试的边际选择策略揭示了一个有用的负面结果:边际选择应基于 acquisition-aware(采集函数感知),而不仅仅基于一步先验误差。

Abstract

Large language models (LLMs) are increasingly used as heuristic advisors for black-box optimization, yet their suggestions and self-reported confidence are not necessarily calibrated to downstream objective values. This issue becomes more pronounced in multi-objective Bayesian optimization, where different objectives may require different expert knowledge and where an LLM expert can be useful for one objective but misleading for another. We study how to use LLM-generated expert priors in discrete multi-objective Bayesian optimization without blindly trusting them. We propose an objective-wise reputation-market mechanism that treats each expert-objective pair as a falsifiable prior source. Expert weights are updated online from observed objective feedback, discounted over time, and gated by market-level trust. We then introduce a decoupled counterfactual gate that can use the LLM prior without confidence, use it with confidence, or abstain from the LLM prior entirely. Across controlled synthetic stress tests and three molecule optimization benchmarks with \qwenflash{}-generated expert priors, we find that dynamic objective-wise calibration improves robustness over fixed LLM priors. However, raw LLM confidence is not reliably beneficial: on ESOL, confidence is positively correlated with prediction error; on FreeSolv, confidence can help; and on Lipophilicity, ignoring confidence remains strongest. Our fixed three-arm counterfactual gate improves over the first counterfactual variant on ESOL and FreeSolv, while an attempted margin portfolio exposes a useful negative result: margin selection should be acquisition-aware rather than based only on one-step prior error.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 3.0/10 4.5
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 1.0/10 1.5

评分理由: 该论文聚焦于多目标贝叶斯优化中的 LLM 先验校准,未涉及多模态架构、视觉编码器、世界模型或强化学习核心内容。虽然使用了 LLM(MLLM 相关度 3),但未体现多模态统一(Unify Models 2)或模型基 RL(model-based RL 1)特征,其余关键词(Tokenizer, Visual Encoder, World Models, MultiModal)完全无关。加权总分为 9.0,低于动态及格分 27.8。

关键词

LLM Priors, Multi-Objective Bayesian Optimization, Reputation Market, Counterfactual Gate, Calibration, Molecule Optimization, Expert Weights

Score: 9.0 / 27.8
Authors: Ting Xu, Xu He, Yupu Lu, Jiankai Sun, Dong Li, Wai Lam, Jianye Hao
Published: 2026-06-01
TL;DR: This paper proposes a training-free framework using CUSUM-based change-point detection to identify confidence regions in Chain-of-Thought reasoning, enabling efficient early exit and improved test-time scaling without sacrificing accuracy.
摘要翻译

本文研究了思维链(CoT)的熵动力学,揭示了一致的两阶段结构:一个探索性的不确定性区域急剧过渡到一个收敛性的置信区域。我们证明置信区域具有两个关键属性:1) 高可靠性——置信区域内的答案变得高度准确且稳定,2) 高冗余性——模型在达到正确答案后仍会生成不必要的 token。这些属性开启了更高效可靠的推理策略:1) 早期退出(Early Exit)利用可靠性和冗余性,在收益递减时安全终止计算,2) 测试时扩展(Test-Time Scaling)利用置信区域信号优先选择已收敛的轨迹。为了将这些见解付诸实践,我们将置信区域检测建模为序列变点检测问题,并首次将经典变点检测方法应用于监控思维链推理。利用统计最优的变点检测器——累积和(CUSUM)算法,我们开发了一个无需训练的实时推理控制框架。实验表明,我们的方法为早期退出建立了更优的帕累托前沿。CUSUM 实现了 63.06% 的准确率和 11.1% 的 token 减少率,在准确率上分别比 DEER 和 Dynasor 高出 3.28% 和 4.36%。在测试时扩展任务中,CUSUM 加权投票始终优于自一致性(self-consistency)。

Abstract

This paper investigates the entropy dynamics of Chain-of-Thought (CoT) and uncovers a consistent two-phase structure: an Uncertainty Region of exploration transitioning sharply to a Confidence Region of convergence. We demonstrate that the Confidence Region possesses two critical properties: 1) High Reliability -- answers in the confidence region become highly accurate and stable, and 2) High Redundancy -- models generate unnecessary tokens long after reaching the correct answer. These properties unlock more efficient and reliable inference strategies: 1) Early Exit leverages reliability and redundancy to terminate computation safely when returns diminish, and 2)Test-Time Scaling uses the Confidence Region signal to prioritize converged trajectories. To operationalize these insights, we formulate Confidence Region detection as a sequential change-point detection problem, being the first to apply classical change-point methods to monitor CoT reasoning. Using the Cumulative Sum (CUSUM) algorithm, a statistically optimal change-point detector, we develop a training-free framework for real-time inference control. Experiments show our approach establishes a superior Pareto-frontier for early exit. CUSUM achieves 63.06% accuracy with 11.1% token reduction, outperforming DEER and Dynasor by 3.28% and 4.36% in accuracy respectively. For test-time scaling, CUSUM-weighted voting consistently outperforms self-consistency.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 1.0/10 1.5
Tokenizer 1.5 2.0/10 3.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 2.0/10 3.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 1.0/10 1.5

评分理由: The paper focuses on Chain-of-Thought reasoning entropy dynamics and inference control using change-point detection. It lacks content related to multimodal architectures, visual encoders, world models, or model-based reinforcement learning. While it involves LLMs and tokens, the focus diverges from the provided keyword set, resulting in low relevance scores.

关键词

Chain-of-Thought Reasoning, Entropy Dynamics, Confidence Region, Early Exit, Test-Time Scaling, Change-point Detection, CUSUM Algorithm

Score: 9.0 / 27.8
Authors: Woojun Jung, Susik Yoon
Published: 2026-06-01
TL;DR: This paper proposes NAVI, a segment-centric pretraining framework for heterogeneous tabular data that unifies schema-level structural evidence and column-level distributional evidence to improve reconstruction and semantic consistency.
摘要翻译

实际领域通常包含异构表,其表头各异但底层属性语义共享,这使得仅凭表内证据难以诱导领域专用语义。现有编码器虽建模了该问题的一部分,但往往未充分利用列级值分布,并对具有不同语义角色的属性应用统一目标。我们提出 NAVI,一种以片段为中心的预训练框架,将每个表头 - 值对作为聚合模式级结构证据与列级分布证据的单元。我们通过掩码片段建模(Masked Segment Modeling)和熵驱动片段对齐(Entropy-driven Segment Alignment)实现这一设计,共同强制结构化表头 - 值耦合,并在稳定属性与实例特定属性之间实现语义对齐。在异构领域内表上的实验表明,总体而言,重构能力、语义一致性以及下游任务效用在所有评估设置下均有所提升。

Abstract

Real-world domains often contain heterogeneous tables whose headers vary while their underlying attribute semantics are shared, making it difficult to induce domain-specialized semantics from table-local evidence alone. Existing encoders model parts of this problem, but often underuse column-level value distributions and apply uniform objectives across attributes with different semantic roles. We propose NAVI, a segment-centric pretraining framework that treats each header-value pair as the unit for aggregating schema-level structural evidence and column-level distributional evidence. We realize this design through Masked Segment Modeling and Entropy-driven Segment Alignment, which jointly enforce structured header-value coupling and semantic alignment across stable and instance-specific attributes. Experiments on heterogeneous in-domain tables show improved reconstruction, semantic consistency, and downstream utility across evaluation settings overall.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 2.0/10 3.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 2.0/10 3.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on heterogeneous tabular representation learning using a segment-centric pretraining framework (NAVI). It shows minimal relevance to the provided keywords: while it unifies structural and distributional evidence (Unify Models) and uses segment modeling (Tokenizer/MultiModal), it lacks visual encoders, world models, MLLM architectures, or model-based RL components. The research domain (Tabular Data) differs significantly from the Multimodal/RL background specified in the keywords. No expert authors from the specified list are present.

关键词

Heterogeneous Tabular Representation, Segment-driven Structural Induction, Semantic Alignment, Masked Segment Modeling, Entropy-driven Segment Alignment, Header-value pair, Pretraining framework, Schema-level structural evidence

Score: 9.0 / 27.8
Authors: Rui Hong, Jana Košecká
Published: 2026-06-01
TL;DR: 本文提出基于运动自编码器潜空间的新评估指标,揭示标准指标无法衡量手势忠实度,并指出句子级配对数据集规模是手语生成的瓶颈。
摘要翻译

手语生成(SLP)是指从自然语言文本生成虚拟人手语动作的任务。生成动作的质量通常通过在 How2Sign 等基准上使用的运动空间弗雷歇距离(FID)和回译(BT)BLEU 分数进行评估。然而,这两个指标可以显著提高,而底层生成器却未能忠实地表示手语手势。在这项工作中,我们提出在三个独立层面上评估生成动作:(τ1) 初始姿态条件化,(τ2) 输出多样性,以及 (τ3) 目标忠实度。我们利用冻结的运动自编码器(MoAE)的潜在表示,将这些指标计算为成对距离比率。我们在 How2Sign 数据集上评估了 14 个 SLP 模型检查点,其中包括重新实现的神经手语演员(NSA),结果表明 τ3 忠实度从未达成,而 FID 变化了近两个数量级,且与忠实度无关。我们表明,在孤立词汇数据集 ASL3DWord 上可以达到理想的 τ3,从而将句子级配对数据集的大小确定为瓶颈。

Abstract

Sign Language Production (SLP) is the task of generating avatar sign language motion from natural language text. The quality of the generated motion is typically evaluated by a motion-space Fréchet distance (FID) and back-translation (BT) BLEU score on benchmarks such as How2Sign. Both metrics can improve substantially while the underlying generator fails to faithfully represent the sign language gestures. In this work we propose to evaluate the generated motion at three independent levels: (τ1) initial-pose conditioning, (τ2) output diversity, and (τ3) target faithfulness. We compute these as pairwise-distance ratios using latent representations of a frozen motion autoencoder (MoAE). We evaluate 14 SLP model checkpoints on the How2Sign dataset, including a re-implemented Neural Sign Actors (NSA), and show that τ3 faithfulness is never attained, while FID varies by nearly two orders of magnitude and is uncorrelated with faithfulness. We show that on the isolated gloss dataset ASL3DWord favorable τ3 can be attained, hence isolating the size of the sentence-level paired-dataset as the bottleneck.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 1.0/10 1.5
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 5.0/10 7.5
model-based RL 1.5 0.0/10 0.0

评分理由: 该论文主要研究手语生成(SLP)的评估方法,利用冻结运动自编码器(MoAE)的潜空间计算距离比率来衡量生成质量。关键词方面:论文未涉及模型统一、分词器、世界模型、多模态大语言模型(MLLM)或基于模型的强化学习,故这五项相关性为 0;任务虽涉及文本到动作的映射,属于多模态任务,但未涉及多模态大模型架构,故 MultiModal 评 5 分;论文使用了编码器,但为运动自编码器而非视觉编码器,相关性极低,故 Visual Encoder 评 1 分。作者列表不包含 Yang Shi 等指定专家,无额外加分。加权总分为 9.0,远低于动态及格分 27.8。

关键词

Sign Language Production, Motion Autoencoder, Evaluation Metrics, Gesture Faithfulness, Text-to-Motion, Conditional Collapse, How2Sign Dataset

Score: 7.5 / 27.8
Authors: Haruki Sakajo, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe
Published: 2026-06-01
TL;DR: This paper investigates the structural representation of multilingual large language models, revealing that low-resource languages are structurally more distinct from English and that post-training alters structures while preserving inter-language relationships.
摘要翻译

尽管英语在训练数据中占据主导地位,大语言模型(LLMs)仍通过在多语言数据上的预训练与后训练,在处理多种语言方面表现卓越。先前关注词元表示的研究揭示了这些 LLMs 如何处理非英语文本。尽管这些分析提供了深刻的见解,但它们未能捕捉到一种结构视角,而这正是语言的固有属性。本研究通过表征结构分析,探索 LLMs 的多语言性。研究发现,低资源语言在结构上与英语的差异程度高于高资源和中等资源语言,且语言特定的后训练会改变其结构,同时保持跨语言关系。

Abstract

Large language models (LLMs) have excelled in processing multiple languages through pre- and post-training on multilingual data, even though English dominates the training data. Prior work focusing on token representations has revealed how those LLMs process non-English text. Although these analyses have provided insightful findings, they fail to capture a structural view, which is an inherent property of language. In this study, we explore the multilinguality of LLMs through representational structural analysis. Our findings reveal that low-resource languages are structurally more different from English than high- and mid-resource languages, and that language-specific post-training alters their structures while preserving inter-language relationships.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 1.5/10 2.2
Tokenizer 1.5 2.0/10 3.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 1.5/10 2.2
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on structural analysis of multilingual LLMs, showing moderate relevance to Tokenizer (mentioned as prior contrast) and MLLM/Unify Models (LLM context), but no relevance to Visual Encoder, World Models, MultiModal, or model-based RL which pertain to multimodal and reinforcement learning domains.

关键词

Large Language Models, Multilinguality, Structural Analysis, Representational Structure, Low-resource Languages, Post-training, Inter-language Relationships

Score: 7.5 / 27.8
Authors: Yangxuan Zhou, Sha Zhao, Jiquan Wang, Shijian Li, Gang Pan
Published: 2026-06-01
TL;DR: EvoBrain 提出了一种针对异构 BCI 任务的 EEG 基础模型持续学习框架,实现了统一解码并平衡了可塑性与稳定性,但未涉及 MLLM 或模型强化学习技术。
摘要翻译

脑电图(EEG)是非侵入式脑机接口(BCIs)的基石,然而常规解码依赖于碎片化的、任务特定的架构,这严重限制了跨任务的可扩展性。尽管在大规模数据集上预训练的 EEG 基础模型有望实现通用脑解码,但当前的后训练阶段仍依赖于任务隔离的微调。这种静态范式限制了异构任务之间的知识迁移,阻碍了模型的可扩展性,并产生了随任务数量线性增长的计算和存储开销。为克服这些瓶颈,我们将下游适配建模为跨任务持续学习问题,并提出 EvoBrain,这是一个用于统一脑电图解码的动态、任务感知持续学习框架。EvoBrain 通过两个互补组件解决可塑性 - 稳定性权衡:(1)神经谱任务归一化(NSN)将传入任务与历史统计对齐,同时重新校准谱响应以处理分布和神经谱偏移;(2)响应亲和蒸馏(RAD),结合时间依赖回放,保留旧任务的响应几何结构,并促进频谱兼容任务之间的选择性知识迁移,有效缓解遗忘。在六个不同 BCI 任务上的广泛评估表明,EvoBrain 在各种基础骨干网络上始终超越最先进方法,最优平衡了可塑性与稳定性。据我们所知,这项工作开创了 EEG 领域的跨任务持续学习,推动了统一且通用的脑解码系统的实现。

Abstract

Electroencephalography (EEG) is the cornerstone of non-invasive brain-computer interfaces (BCIs), yet conventional decoding relies on fragmented, task-specific architectures that severely limit cross-task scalability. While EEG foundation models pre-trained on massive corpora promise universal brain decoding, current post-training depends on task-isolated fine-tuning. This static paradigm restricts knowledge transfer across heterogeneous tasks, hinders model scalability, and incurs computational and storage overheads that scale linearly with task count. To overcome these bottlenecks, we formulate downstream adaptation as a cross-task continual learning problem and propose EvoBrain, a dynamic, task-aware continual learning framework for unified EEG decoding. EvoBrain addresses the plasticity-stability trade-off via two complementary components: (1) Neuro-Spectral Task Normalization (NSN) aligns incoming tasks with historical statistics while recalibrating spectral responses to handle distributional and neuro-spectral shifts; and (2) Response-Affinity Distillation (RAD), combined with time-dependent replay, preserves old-task response geometry and promotes selective knowledge transfer between spectrally compatible tasks, effectively mitigating forgetting. Extensive evaluations across six distinct BCI tasks demonstrate that EvoBrain consistently surpasses state-of-the-art methods across diverse foundation backbones, optimally balancing plasticity and stability. To our knowledge, this work pioneers cross-task continual learning in the EEG domain, advancing the realization of a unified, one-for-all brain decoding system.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 5.0/10 7.5
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: 论文聚焦于 EEG 脑机接口(BCI)领域的持续学习与统一解码,属于信号处理与神经工程范畴。关键词中的 MLLM、世界模型、模型强化学习、视觉编码器等均与论文内容无关,仅'Unify Models'与论文提出的'统一解码系统'目标存在概念层面的弱关联。

关键词

EEG, BCI, Continual Learning, Foundation Models, Unified Decoding, NSN, RAD, Cross-task

Score: 7.5 / 27.8
Authors: Zhiqing Ma, Zhonghao Xu, Dong Yu, Chen Kang, Changliang Li, Pengyuan Liu
Published: 2026-06-01
TL;DR: THRD 提出了一种无需训练的多轮防御框架,通过建模时间风险累积有效降低了大语言模型的多轮越狱攻击成功率且保持了模型效用。
摘要翻译

多轮越狱攻击正对大语言模型(LLM)构成日益增长的威胁,它们利用对话动态(如逐步升级和跨轮协调)来实施。现有的防御方案要么依赖代价高昂的重新训练,往往导致模型效用下降,要么在每一轮独立应用单轮分析,无法捕捉风险如何沿交互轨迹累积。我们发现,多轮交互中的安全行为具有轨迹依赖性:对话历史不断重塑模型的条件上下文,因此孤立地评估每一轮是不够的。基于这一洞察,我们提出了 THRD,这是首个针对多轮越狱防御、显式建模时间风险累积的无需训练框架。THRD 集成了四个模块:轮次级风险评估器(TRA),用于瞬时风险估计;历史上下文分析器(HCA),用于跨轮意图升级检测;响应评估器(RE),用于识别促进性输出;以及决策模块,该模块通过一种基于衰减调制和趋势感知调整的时间演化评分机制来融合这些信号。针对最先进的多轮攻击(包括基于树搜索和多代理协作的方法)在两个目标模型上的实验表明,THRD 将攻击成功率(ASR)降低至 0.2%--4.0%,同时在 MMLU 和 GSM8K 基准上保持模型效用损失在 1.5% 以内。消融研究证实了各模块贡献的非冗余性以及稳定的跨架构泛化能力。对首次拒绝触发点的分析揭示,超过 70% 的多轮攻击需要到第 2 轮或更晚才能被检测到,从而验证了显式时间聚合的必要性。

Abstract

Multi-turn jailbreak attacks pose a growing threat to LLMs by exploiting conversational dynamics such as gradual escalation and cross-turn coordination. Existing defenses either rely on costly retraining -- often degrading model utility -- or apply single-turn analysis independently at each turn, failing to capture how risk accumulates along interaction trajectories. We observe that safety behavior in multi-turn interaction is trajectory-dependent: dialogue history continuously reshapes the model's conditioning context, making it insufficient to evaluate each turn in isolation. Motivated by this insight, we present THRD, the first training-free framework that explicitly models temporal risk accumulation for multi-turn jailbreak defense. THRD integrates four modules: a Turn-level Risk Assessor (TRA) for instantaneous risk estimation, a Historical Context Analyzer (HCA) for cross-turn intent escalation detection, a Response Evaluator (RE) for identifying facilitative outputs, and a Decision Module that combines these signals through a time-evolving scoring mechanism with attenuation-based modulation and trend-aware adjustment. Experiments against state-of-the-art multi-turn attacks -- including tree-search-based and multi-agent collaborative methods -- across two target models show that THRD reduces ASR to 0.2--4.0% while preserving model utility within 1.5% degradation on MMLU and GSM8K. Ablation studies confirm non-redundant module contributions and stable cross-architecture generalization. Analysis of first rejection triggers reveals that over 70% of multi-turn attacks require Turn~2 or later to detect, validating the necessity of explicit temporal aggregation.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 3.0/10 4.5
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: 论文核心在于大语言模型(LLM)的多轮对话安全防御与风险建模,与提供的关键词集(多模态架构、强化学习、世界模型等)领域差异较大。仅'MLLM'因涉及大语言模型基础架构有中度相关,'Unify Models'因框架整合了多个评估模块有低度相关,其余关键词在论文中无直接体现。作者列表中未包含指定的专家(Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang)。加权总分计算为 7.5,远低于动态及格分 27.8。

关键词

Multi-turn jailbreak attacks, Training-free framework, Temporal risk accumulation, Large Language Models, Safety defense, Dialogue history, Risk assessment, Model utility

Score: 7.5 / 27.8
Authors: Udbhav Bamba, Arnav Chavan, Aryamaan Thakur, Steve Teig, Deepak Gupta
Published: 2026-06-01
TL;DR: DOT-MoE 提出了一种基于可微最优运输的框架,将密集大语言模型转换为稀疏混合专家模型,在减少 50% 活跃参数的同时保留了 90% 的性能。
摘要翻译

大语言模型(LLM)的规模化带来了显著的性能提升,却在推理效率方面带来了巨大挑战。尽管混合专家模型(MoE)架构通过将模型规模与推理成本解耦来解决这一问题,但从头训练 MoE 往往不稳定且计算开销巨大。将预训练稠密模型转换为稀疏 MoE 已成为一种替代方案;然而,现有方法通常依赖启发式神经元聚类或随机分割,将前馈网络(FFN)划分为专家模块。本文提出了一种名为 DOT-MoE 的新颖框架,该框架将稠密层的分解问题形式化为可微最优传输(DOT)问题。与静态启发式方法不同,我们将神经元分配建模为平衡传输问题,利用可微的 Sinkhorn-Knopp 迭代来强制执行严格的专家容量约束。此外,我们利用直通估计器(STE),端到端地联合学习离散的神经元 - 专家分配方案以及令牌 - 专家路由策略。在多种架构和基准上的广泛实验表明,DOT-MoE 显著优于结构化剪枝、启发式聚类及随机分割基线,在保留原始稠密模型 90% 性能的同时,将激活参数减少了 50%。

Abstract

The scaling of Large Language Models (LLMs) has driven significant performance gains but created substantial challenges in inference efficiency. While Mixture of Experts (MoEs) architectures address this by decoupling model size from inference cost, training MoEs from scratch is often unstable and compute intensive. Conversion of pre-trained dense models into sparse MoEs has emerged as an alternative solution; however, existing methods typically rely on heuristic neuron clustering or random splitting to partition the Feed-Forward Network (FFN) into experts. In this work, we propose DOT-MoE, a novel framework that formulates the decomposition of dense layers as a Differentiable Optimal Transport (DOT) problem. Instead of static heuristics, we model neuron assignment as a balanced transport problem, utilizing differentiable Sinkhorn-Knopp iterations to enforce strict expert capacity constraints. Furthermore, we utilize Straight-Through Estimators (STE) to jointly learn the discrete neuron-to-expert assignment and the token-to-expert routing policy end-to-end. Extensive experiments across multiple architectures and benchmarks demonstrate that DOT-MoE significantly outperforms structured pruning, heuristic clustering, and random-split baselines, retaining 90% of the original dense model's performance while reducing active parameters by 50%.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 2.0/10 3.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: 论文核心在于利用可微最优运输将密集 LLM 转换为稀疏 MoE 以提升推理效率,属于 NLP 架构优化领域。提供的关键词集主要涉及多模态、世界模型及强化学习,与本文主题(纯文本模型架构转换)存在显著偏差,因此除 LLM 相关关键词外,其余关键词相关性极低。作者列表中不包含指定的专家,无额外加分。

关键词

Differentiable Optimal Transport, Mixture of Experts, Inference Efficiency, Dense Models, Sparse MoEs, Token Routing, LLM Conversion

Score: 7.5 / 27.8
Authors: Kaizheng Wang
Published: 2026-06-01
TL;DR: MINTS 提出了一种极简贝叶斯框架,通过仅对最优值位置设置先验,在多臂老虎机问题中实现了接近最优的遗憾保证。
摘要翻译

贝叶斯范式提供了处理不确定性下序贯决策的严谨工具,但其对所有参数均依赖概率模型,可能会阻碍复杂结构约束的纳入。我们提出了一种极简贝叶斯框架,仅在最优值的位置上设置先验分布,并通过轮廓似然消除干扰参数。这产生了一个广义后验,能够自然容纳结构约束。作为直接实例化,我们开发了极简 Thompson 采样(MINTS)。对于具有均值约束的多臂老虎机,我们建立了近最优的非渐近遗憾保证以及几乎必然的渐近遗憾刻画。特别是在无结构设置下,MINTS 达到了经典的 Lai--Robbins 常数,并自动适应单峰结构,实现了仅由最优臂的相邻臂决定的紧常数。

Abstract

The Bayesian paradigm offers principled tools for sequential decision-making under uncertainty, but its reliance on a probabilistic model for all parameters can hinder the incorporation of complex structural constraints. We introduce a minimalist Bayesian framework that places a prior only on the location of the optimum, while eliminating nuisance parameters through profile likelihood. This yields a generalized posterior that naturally accommodates structural constraints. As a direct instantiation, we develop MINimalist Thompson Sampling (MINTS). For multi-armed bandits with mean constraints, we establish near-optimal non-asymptotic regret guarantees and sharp almost-sure asymptotic regret characterizations. In particular, MINTS attains the classical Lai--Robbins constant in the unstructured setting and automatically adapts to unimodal structure, achieving the sharp constant determined only by the immediate neighbors of the optimal arm.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 5.0/10 7.5

评分理由: 论文主要研究贝叶斯 Thompson 采样在多臂老虎机中的遗憾保证,属于序列决策与强化学习领域,与 model-based RL 有一定关联(涉及概率模型与决策),但与多模态大模型架构(Tokenizer, Visual Encoder, MLLM, MultiModal)、世界模型及模型统一等概念完全无关。作者列表中未包含指定的专家。

关键词

Thompson Sampling, Bayesian framework, Sequential decision-making, Multi-armed bandits, Regret guarantees, Profile likelihood, Minimalist Bayesian

Score: 7.5 / 27.8
Authors: Haiyang Lu, Pratik Gajane, Shaojie Bai, Mohammad Sadegh Talebi
Published: 2026-06-01
TL;DR: This paper demonstrates that Randomized Least Squares Value Iteration inherently satisfies joint differential privacy in tabular MDPs through its exploration noise without requiring additional privacy mechanisms.
摘要翻译

随着强化学习(RL)越来越多地应用于医疗保健和推荐系统等敏感领域,保护用户敏感信息的隐私保护技术已变得至关重要。我们在回合制环境下研究隐私保护强化学习,重点关注基于随机探索的算法,例如随机最小二乘值迭代(RLSVI)。本研究旨在探讨随机探索与隐私机制所需注入噪声之间的相互作用。在这项工作中,我们提出了一种新的隐私分析,刻画了 RLSVI 中为探索而设置的噪声如何同时提供隐私保护。具体来说,我们证明了在表格型 MDP 中,RLSVI 满足 $(\varepsilon(δ),δ)$-联合微分隐私,且 $\varepsilon(δ) = \frac{2AK}{H^2\log(2HSA)} + 2\sqrt{\frac{2AK\log(1/δ)}{H^2\log(2HSA)}}$,其中 $S$ 和 $A$ 分别是状态数和动作数,$H$ 是回合长度,$K$ 是回合数。

Abstract

As reinforcement learning (RL) increasingly applies to sensitive domains, such as health care and recommendation systems, privacy-preserving techniques have become essential to protect users' sensitive information. We investigate privacy-preserving RL under an episodic setting, focusing on algorithms based on randomized exploration, such as Randomized Least Squares Value Iteration (RLSVI). The overall goal is to study how randomized exploration interacts with the injected noise required by privacy mechanisms. In this work, we show a new privacy analysis that characterizes how the noise in RLSVI set for exploration simultaneously provides privacy protection. Specifically, we prove that RLSVI is $(\varepsilon(δ),δ)$-joint differentially private in tabular MDP as is with $\varepsilon(δ) = \frac{2AK}{H^2\log(2HSA)} + 2\sqrt{\frac{2AK\log(1/δ)}{H^2\log(2HSA)}}$, where $S$ and $A$ are the number of states and actions respectively, $H$ is the length of an episode and $K$ is the number of episodes.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 5.0/10 7.5

评分理由: The paper focuses on privacy-preserving Reinforcement Learning (RLSVI in tabular MDPs), which is unrelated to Multimodal/LLM/World Model architectures (Unify Models, Tokenizer, Visual Encoder, World Models, MLLM, MultiModal). 'model-based RL' has moderate relevance as RLSVI is a model-based algorithm, but the core contribution is differential privacy, not model-based RL methodology. No expert authors from the list are present.

关键词

Reinforcement Learning, Differential Privacy, RLSVI, Tabular MDP, Privacy-preserving, Exploration Noise, Joint Differential Privacy

Score: 7.5 / 27.8
Authors: Zefeng Li, Qiaoyue Tang, Mathias Lecuyer, Evan Shelhamer
Published: 2026-06-01
TL;DR: 该论文通过引入微分隐私机制(梯度裁剪和高斯噪声)统一多种测试时间适应方法,在保护测试数据隐私的同时维持或提升了模型适应的精度与稳定性。
摘要翻译

测试时适应(TTA)可以通过在推理过程中基于这些输入更新模型,从而降低在新数据及不同分布数据上的误差。然而,这些更新引发了关于测试数据的隐私问题,因为模型参数现在依赖于所有过去的输入。为了控制这种隐私风险,我们将多种流行的 TTA 方法(Tent、EATA、SAR、DeYO 和 COME)转化为微分隐私(DP)形式,这些形式对所有更新均应用逐样本梯度裁剪和高斯噪声。在 ImageNet-C 上,我们的 DP-TTA 方法在精度损失较小的情况下提供了足够的隐私保护,而在低隐私设置下,DP 的裁剪机制甚至可以在持续设置中提高适应的精度和稳定性。这些隐私和精度的改进仅带来了适度的计算开销。这些关于隐私保护 TTA 的首批结果提高了对该问题的认识,指导了隐私性更强的测试时更新的发展,并识别出逐样本裁剪是提高适应精度和稳定性的有效技术。

Abstract

Test-time adaptation (TTA) can reduce error on new and different data by updating the model on these inputs during inference. However, these updates raise the issue of privacy w.r.t. the testing data, because the model parameters now depend on all past inputs. To control this privacy risk, we cast multiple popular TTA methods (Tent, EATA, SAR, DeYO, and COME) into differential privacy (DP) forms that apply per-sample gradient clipping and Gaussian noise for all updates. On ImageNet-C, our DP-TTA methods provide adequate privacy at small cost to accuracy, and in the low-privacy regime the clipping mechanism of DP can even improve the accuracy and stability of adaptation in the continual setting. These improvements to privacy and accuracy come at only modest computational overhead. These first results on private TTA raise awareness of the issue, inform the development of more private test-time updates, and identify per-sample clipping as an effective technique for improving the accuracy and stability of adaptation.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 2.0/10 3.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 1.0/10 1.5
model-based RL 1.5 0.0/10 0.0

评分理由: 论文主要研究测试时间适应(TTA)中的隐私保护问题,通过微分隐私机制提升适应过程的稳定性。提供的关键词集(如 MLLM、World Models、Tokenizer、model-based RL)聚焦于多模态大模型与强化学习领域,与本文的计算机视觉及隐私机器学习主题存在显著领域偏差。因此,除因涉及视觉数据(ImageNet-C)和统一多种 TTA 方法给予 Visual Encoder 和 Unify Models 极低相关性评分外,其余关键词相关性均为 0。作者列表中未包含 Yang Shi 等指定专家,未触发加分项。

关键词

Test-time adaptation, Differential Privacy, Gradient Clipping, Gaussian Noise, ImageNet-C, Privacy Risk, Model Updates, Stability

Score: 7.5 / 27.8
Authors: Jiaqi Yu, Xin Wang, Yixu Wang, Jie Li, Yan Teng, Xingjun Ma, Yingchun Wang
Published: 2026-06-01
TL;DR: SentGuard proposes a sentence-level streaming guardrail for LLMs that detects unsafe intent at sentence boundaries with high accuracy and low false-positive rates during real-time generation.
摘要翻译

大型语言模型(Large Language Models, LLMs)正越来越多地实时流式输出长篇幅、高推理强度的响应,这使得“何时审核”与“是否审核”同样关键。现有的安全护栏(guardrails)落入两个令人不满的极端:响应级方法延迟干预直至完整输出生成,而词元级方法基于不完整的语义进行操作,通常导致不稳定的决策和过多的护栏调用。为了解决这一挑战,我们提出 SentGuard,这是一种与生成过程并行运行的句子级流式安全护栏。一个轻量级的等待缓冲区将流式词元分组为句子片段,仅向用户释放经过验证的片段,引入一个小的时间偏移,使得 SentGuard 能够在目标大型语言模型解码后续内容时评估当前前缀。为此,我们构建了 StreamSafe,一个包含 8 类危害的结构化句子级标注基准,捕捉推理和响应段落中安全风险的演变。我们进一步使用粗粒度到细粒度的目标训练 SentGuard,以便在句子边界处尽早检测出不安全意图。在 5 个安全基准上的实验表明,SentGuard 优于现有基线方法,在两个句子内检测到 90.5% 的不安全案例,同时保持较低的流式误报率为 7.41%。

Abstract

Large language models increasingly stream long, reasoning-intensive responses in real time, making when to moderate as critical as whether to moderate. Existing guardrails fall into two unsatisfactory extremes: response-level methods delay intervention until the full output is generated, whereas token-level methods act on incomplete semantics, often producing unstable decisions and excessive guard invocations. To address this challenge, we propose SentGuard, a sentence-level streaming guardrail that operates in parallel with generation. A lightweight waiting buffer groups streamed tokens into sentence chunks and releases only verified chunks to the user, introducing a small offset that enables SentGuard to assess the current prefix while the target LLM decodes subsequent content. To support this, we construct StreamSafe, a benchmark with structured per-sentence annotations across 8 harm categories, capturing the evolution of safety risks across both reasoning and response segments. We further train SentGuard with a coarse-to-fine objective to detect unsafe intent as soon as it emerges at sentence boundaries. Experiments on 5 safety benchmarks show that SentGuard outperforms existing baselines, detecting 90.5% of unsafe cases within two sentences while maintaining a low streaming false-positive rate of 7.41%.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 2.0/10 3.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 3.0/10 4.5
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on safety guardrails for streaming LLM generation, which does not align with the provided keywords regarding multimodal models, world models, or reinforcement learning. While it processes tokens (Tokenizer) and involves LLMs (MLLM), it does not address tokenizer design, visual encoding, world modeling, or RL, resulting in low relevance scores.

关键词

Sentence-Level Streaming, Guardrails, Large Language Models, Safety Moderation, StreamSafe Benchmark, Real-time Detection, Coarse-to-Fine Objective

Score: 7.5 / 27.8
Authors: Xiaolu Kang, Zhongyuan Wang, Jikang Cheng, Baojin Huang, Zhanhe Lei, Gang Wu, Qin Zou, Qian Wang
Published: 2026-06-01
TL;DR: This paper proposes a multi-view evidential learning framework called DiCoME that separates semantic and artifact features to enhance generalization and uncertainty estimation in deepfake detection.
摘要翻译

随着生成模型的演进,深度伪造已实现近乎完美的语义真实性,仅在细微的结构异常中留下取证痕迹。然而,现有的单视图范式往往难以泛化,因为在纠缠表示中,主导的语义特征掩盖了细微的伪影线索。这种不平衡导致了过度自信但脆弱的预测——我们将这种现象称为语义遮蔽效应(Semantic Masking Effect)。为应对这一挑战,我们提出了一种可靠的框架,称为分治多视图证据学习(Divide-and-Conquer Multi-View Evidential Learning, DiCoME),用于深度伪造检测。在“划分”阶段,我们采用几何视图净化(Geometric View Purification),通过基于原理的几何投影分解纠缠表示空间。该过程抑制了对伪影敏感表示中的语义干扰,为去相关但互补的语义视图和伪影视图奠定了基础。在“攻克”阶段,我们利用不确定性感知证据学习(Uncertainty-Aware Evidential Learning)来综合这些不同的视图。通过显式建模语义与伪影线索之间的“认知冲突”,该机制提供了校准的不确定性估计,而非强迫做出刚性确定性决策。在多个基准上的广泛实验表明,我们的方法在泛化性能上始终优于现有方法,同时提供可靠的不确定性估计,以实现可信的深度伪造检测。代码开源地址为 https://github.com/kxl0825/DiCoME.git。

Abstract

With the evolution of generative models, deepfakes have achieved near-perfect semantic realism, leaving forensic traces only in subtle structural anomalies. However, existing single-view paradigms often fail to generalize, as dominant semantic features overwhelm subtle artifact cues within entangled representations. This imbalance leads to overconfident yet brittle predictions -- a phenomenon we term the Semantic Masking Effect. To address this challenge, we propose a reliable framework called Divide-and-Conquer Multi-View Evidential Learning (DiCoME) for Deepfake Detection. In the "Divide" phase, we employ Geometric View Purification to decompose the entangled representation space through principled geometric projection. This process suppresses semantic interference within artifact-sensitive representations, forming the foundation for decorrelated yet complementary semantic and artifact views. In the "Conquer" phase, we leverage Uncertainty-Aware Evidential Learning to synthesize these distinct views. By explicitly modeling the "epistemic conflict" between semantic and artifact cues, this mechanism provides calibrated uncertainty estimates instead of forcing rigid deterministic decisions. Extensive experiments across multiple benchmarks demonstrate that our method consistently outperforms existing approaches in generalization performance, while providing reliable uncertainty estimation for trustworthy deepfake detection. Code is available at https://github.com/kxl0825/DiCoME.git.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 3.0/10 4.5
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 2.0/10 3.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on deepfake detection using multi-view evidential learning and geometric view purification, which has low relevance to the provided keywords centered on unified models, tokenization, world models, MLLMs, and reinforcement learning. Only Visual Encoder and MultiModal have marginal relevance due to the vision-based task and multi-view terminology, respectively. No expert authors from the target list were found.

关键词

Deepfake Detection, Multi-View Evidential Learning, Geometric View Purification, Semantic Masking Effect, Uncertainty-Aware Evidential Learning, Reliable Framework, Generalization Performance

Score: 7.5 / 27.8
Authors: Mingxiao Wang, Xiaozhen Qu, Bolin Gao, Tong Wang, Lei He
Published: 2026-06-01
TL;DR: The paper proposes a hierarchically decoupled Mixture-of-Experts framework for traffic sign recognition that improves detection accuracy and efficiency in complex driving scenarios through dynamic expert routing.
摘要翻译

交通标志检测是自动驾驶及智能交通系统中环境感知的基础组成部分。然而,大多数现有检测器依赖于具有全局共享参数的静态推理,限制了其适应多样化且非结构化交通场景的能力。因此,单个静态模型往往难以同时兼顾清晰的近距样本以及具有挑战性的条件,例如远距离小目标或恶劣天气环境。为解决这一局限性,我们提出 CBDES MoE TSR,一种用于交通标志识别的层次解耦异构混合专家(Mixture-of-Experts,MoE)框架。所提出的框架通过引入异构的 You Only Look Once(YOLO)专家池以及轻量级门控网络,突破了常规的全局共享参数范式,实现了图像级的动态路由机制。基于输入图像的语义特征,门控模块从专家池中有选择地激活最合适的专家模型,从而实现从固定参数拟合到按需动态表示的转变。该设计增强了特定场景下的特征提取能力,同时保持了受控的推理开销。实验结果表明,所提出的方法在复合交通标志数据集上实现了检测精度与效率之间的显著平衡。具体而言,该方法达到了 76.8% 的 mAP50-95,相较于基线方法(74.5%)提升了 2.3%,同时计算开销降低了约 39.4%。这些发现稳健地验证了所提出方法的有效性。

Abstract

Traffic sign detection is a fundamental component of environmental perception in autonomous driving and intelligent transportation systems. However, most existing detectors rely on static inference with globally shared parameters, limiting their ability to adapt to diverse and unstructured traffic scenarios. As a result, a single static model often struggles to simultaneously handle both clear near-range samples and challenging conditions such as distant small targets or adverse weather environments. To address this limitation, we propose CBDES MoE TSR, a hierarchically decoupled heterogeneous mixture-of-experts(MoE) framework for traffic sign recognition. The proposed framework departs from the conventional globally shared parameter paradigm by introducing a heterogeneous You Only Look Once (YOLO) expert pool together with a lightweight gating network, enabling an image-level dynamic routing mechanism. Based on the semantic characteristics of the input image, the gating module selectively activates the most suitable expert model from the expert pool, enabling a shift from fixed parameter fitting to on-demand dynamic representation. This design enhances feature extraction capability for specific scenarios while maintaining controlled inference overhead. Experimental results demonstrate that the proposed method achieves a remarkable balance between detection accuracy and efficiency on the composite traffic sign dataset. Specifically, our method attains an mAP50-95 of 76.8%, yielding a 2.3% improvement over the baseline method (74.5%) while simultaneously reducing computational overhead by approximately 39.4%. These findings robustly validate the effectiveness of the proposed approach.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 3.0/10 4.5
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on computer vision (traffic sign detection) using a Mixture-of-Experts (MoE) framework and YOLO architecture. It shows low relevance to the provided keywords: 'Tokenizer', 'World Models', 'MLLM', 'MultiModal', and 'model-based RL' are completely unrelated as the work is single-modal vision detection without language or reinforcement learning components. 'Unify Models' has slight relevance due to MoE expert unification (score 2), and 'Visual Encoder' has moderate relevance as YOLO uses a backbone for feature extraction (score 3), but neither aligns with the specific context of world models or representation learning implied by the keyword set.

关键词

Mixture-of-Experts, Traffic Sign Recognition, YOLO, Dynamic Routing, Autonomous Driving, Heterogeneous Expert Pool, Inference Efficiency

Score: 7.5 / 27.8
Authors: Zhuoxu Huang, Zhenkun Fan, Jungong Han, Josef Kittler
Published: 2026-06-01
TL;DR: MotionPDE proposes a PDE-based contrastive learning framework for self-supervised representation learning of point cloud videos, effectively capturing spatial-temporal correlations without relying on tokenizers or language models.
摘要翻译

探究时空相关性,特别是空间点随时间的变化,对于理解点云视频至关重要。传统方法,尤其是基于流的方法,由于序列点云数据的无序空间排列,难以处理这些相关性。为了解决这一挑战,我们提出了一种新方法,通过将问题建模为可解的偏微分方程 (PDE) 来正则化时空相关性学习。尽管偏微分方程在物理领域长期有效,但将其应用于点云视频等新型序列数据的研究仍显不足。受流体分析启发,我们构建了一个简化的 PDE,且 PDE 的求解过程由时间嵌入与空间嵌入之间的对比学习结构引导和优化。借助这种额外监督,我们的方法(名为 MotionPDE)可作为现有骨干网络的有效即插即用增强模块,仅增加极少的计算开销和参数。利用对比学习过程,我们深入探讨了 MotionPDE 的自监督能力,取得了有前景的结果,凸显了其在点云视频数据理解中的实用性和适应性。包含训练好的检查点的代码仓库将在 https://github.com/zhh6425/motionpde.git 上提供,以促进未来的研究。

Abstract

Investigating spatial-temporal correlations, specifically how spatial points vary over time, is crucial for understanding point cloud videos. Traditional methods, particularly flow-based techniques, struggle with these correlations due to the unordered spatial arrangement of sequential point cloud data. To address this challenge, we propose a novel approach that regularizes spatial-temporal correlation learning by formulating the problem as a solvable Partial Differential Equation (PDE). While PDEs have long been effective in the physical domain, their application to novel sequential data like point cloud video remains underexplored. Inspired by fluid analysis, we construct a simplified PDE, and the process of solving PDE is guided and refined by a contrastive learning structure between the temporal embeddings and the spatial embeddings. With this extra supervision, our method, named MotionPDE, serves as an effective, plug-and-play enhancement module for existing backbone models, adding minimal computational overhead and parameters. Capitalizing on the contrastive learning process, we delve deeper into the self-supervised capabilities of MotionPDE, yielding promising results that underscore its utility and adaptability in point cloud video data interpretation. The code repo with trained checkpoints will be available at https://github.com/zhh6425/motionpde.git for facilitating future research.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 1.0/10 1.5
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 3.0/10 4.5
World Models 1.5 1.0/10 1.5
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on point cloud video representation learning via PDEs and contrastive learning, showing low relevance to MLLMs, tokenizers, RL, or multimodal fusion. 'Visual Encoder' has moderate relevance (3/10) for processing visual data, while 'Unify Models' and 'World Models' have slight relevance (1/10 each) due to spatial-temporal integration, but lack generative or language-based characteristics. Weighted total score is 7.5, below the dynamic passing score of 27.8.

关键词

Point Cloud Video, Representation Learning, PDE Model, Contrastive Learning, Spatial-Temporal Correlation, Self-Supervised, MotionPDE, Backbone Models

Score: 6.0 / 27.8
Authors: Hallah Shahid Butt, Qiong Huang, Gökhan Demirel, Kevin Förderer, Erfan Tajalli-Ardekani, Simnon Waczowicz, Luigi Spatafora, Veit Hagenmeyer, Benjamin Schäfer
Published: 2026-06-01
TL;DR: 本文提出了一种可解释的深度强化学习框架用于优化住宅建筑能源管理,通过对比策略算法降低了电力成本并提供了透明的决策洞察。
摘要翻译

可再生能源在电力系统中的日益整合,尤其是在配备光伏(PV)面板和储能系统的建筑中,给能源系统带来了显著复杂性。波动的电力生产、变化的电价以及新增实体(如光伏系统和热泵)增加了系统的复杂性,使其更难运行。这催生了对额外控制与优化路径的需求,包括基于数据的控制方法,例如强化学习。尽管深度强化学习(DRL)已成为在动态且日益复杂的环境中优化建筑运营的有前景方案,但其黑盒特性阻碍了用户的信任及实际采纳。本文提出了一种应用于住宅建筑能源管理的可解释深度强化学习(XRL)框架。我们在合成数据以及来自卡尔斯鲁厄理工学院(KIT)的生活实验室能源园区(LLEC)的真实数据上展示了其应用。我们在一个扩展的状态空间上训练并比较了基于策略(on-policy)和离策略(off-policy)的 DRL 智能体,该空间整合了实时测量数据(需求、光伏发电量、电池功率、荷电状态)、外部信号(动态电价、本地天气数据)、日历及假期指标,以及需求和价格的预测。实验结果表明,基于策略(on-policy)的算法,尤其是优势演员批评(A2C)和近端策略优化(PPO),在累积奖励和策略稳定性方面优于离策略(off-policy)方法。为了解释这些模型,我们采用事后解释技术来阐述所学习的控制策略。研究结果表明,XRL 框架不仅通过最优电池管理降低了电力成本,还为智能体的决策过程提供了透明且可操作的洞察。

Abstract

The increasing integration of renewable energy sources into power systems, particularly in buildings equipped with photovoltaic (PV) panels and energy storage systems, introduces significant complexity in energy systems. Volatile power generation, varying electricity tariffs, and increased entities, e.g., PV systems, and heat pumps, have increased the complexity and made the system harder to operate. This leads to the demand for additional control and optimization routes including data-based controls, such as reinforcement learning. While deep reinforcement learning (DRL) has emerged as a promising solution to optimize building operations in dynamic and ever more complex environments, its black-box nature impedes user trust and practical adoption. This paper presents a framework for explainable deep reinforcement learning (XRL) applied to energy management in residential buildings. We demonstrate its usage on both synthetic data but also on real-world data from the Living Lab Energy Campus (LLEC) at KIT. We train and compare both on-policy and off-policy DRL agents on an expanded state space that incorporates real-time measurements (demand, PV generation, battery power, state of charge), external signals (dynamic electricity price, local weather data), calendrical and holiday indicators, and forecasts for demand and price. Our experimental results indicate that on-policy algorithms, particularly Advantage Actor Critic (A2C) and Proximal Policy Optimization (PPO), outperform off-policy methods in terms of cumulative rewards and policy stability. To explain these models, we employ post-hoc interpretation techniques to elaborate the learned control policies. Our findings demonstrate that the XRL framework not only reduces electricity costs through optimal battery management, but also provides transparent, actionable insights into the agent's decision-making process.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 1.0/10 1.5
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 1.0/10 1.5
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 1.0/10 1.5
model-based RL 1.5 1.0/10 1.5

评分理由: 论文主要关注建筑能源管理中的可解释深度强化学习(XRL),未涉及多模态大模型(MLLM)、Tokenizer、视觉编码器或统一模型架构。虽然使用了强化学习,但采用的是标准的策略梯度方法(A2C/PPO),属于模型-free 范畴,且数据为时间序列而非多模态输入,因此与给定关键词相关性极低。专家列表中未包含指定的合作专家。加权总分为 6.0,低于动态及格分 27.8。

关键词

Deep Reinforcement Learning, Explainable AI, Energy Management, Residential Buildings, Renewable Energy, Policy Optimization, Post-hoc Interpretation

Score: 6.0 / 27.8
Authors: Oleksandr Nikitin
Published: 2026-06-01
TL;DR: PlanarBench evaluates LLM spatial reasoning via planar graph drawing ASCII art, identifying edge count as the primary difficulty predictor.
摘要翻译

PlanarBench 测试大语言模型(LLM)是否仅凭边列表即可将平面图绘制为 ASCII 图——这是一个空间推理任务,由于边的顺序、边的方向及节点标签均可置换,因而难以通过记忆掌握。我们在 199 个最简单的非同构连通平面图(2-7 个顶点)上评估了 91 个模型。边数是主要的难度预测因子($r = -0.85$)——这一发现未在先前的 LLM 图基准测试中被报告,后者仅将节点数作为难度维度。

Abstract

PlanarBench tests whether LLMs can draw planar graphs as ASCII art given only an edge list -- a spatial reasoning task that resists memorization because edge order, edge orientation, and node labels are all permutable. We evaluate 91 models on the 199 simplest non-isomorphic connected planar graphs (2 - 7 vertices). Edge count is the dominant difficulty predictor ($r = -0.85$) -- a finding not reported in prior LLM graph benchmarks, which use only node count as the difficulty axis.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 2.0/10 3.0
MultiModal 1.5 2.0/10 3.0
model-based RL 1.5 0.0/10 0.0

评分理由: 该论文为 LLM 空间推理基准测试,未涉及统一模型架构、世界模型构建、视觉编码器设计或模型强化学习,故相关度为 0。虽任务涉及空间理解,但实现方式为文本生成(ASCII),非典型多模态融合,故 MLLM 和多模态评分较低(2 分)。作者列表中未包含指定专家。加权总分 6.0,低于动态及格分 27.8。

关键词

PlanarBench, LLM Spatial Reasoning, Planar Graph Drawing, ASCII Art, Edge Count, Benchmark Evaluation, Graph Theory, Model Evaluation

Score: 6.0 / 27.8
Authors: Luca Butera, Giovanni De Felice, Andrea Cini, Cesare Alippi
Published: 2026-06-01
TL;DR: 本文揭示了时间序列预测中长上下文窗口的必要性在于降低生成过程的不确定性,并通过解耦生成过程识别与条件预测优化了预测性能。
摘要翻译

现代用于预测时间序列组的深度学习模型越来越依赖更长的观测窗口。然而,增加窗口尺寸的好处往往仅被归因于捕捉长程依赖,而对于全局预测模型如何利用输入观测的更广泛讨论则相对有限。本文指出,预测时间序列组涉及两个目标:(i) 生成过程识别(GPI),即推断生成输入序列的具体过程;以及 (ii) 条件预测(CF),即基于输入观测预测未来值。从这个角度来看,最优预测可被解释为对所有合理的数据生成过程的加权平均,权重由它们给定输入窗口的似然决定。这为长上下文窗口(context windows)的优势提供了另一种解释:它们降低了在运行过程中关于具体是哪个过程生成输入时间序列的不确定性。我们证明,即使对于记忆长度为 $P$ 的过程,输入窗口大小严格大于 $P$ 也是实现最小可达误差所必需的。最后,我们展示了解耦 GPI 与 CF 如何在不牺牲准确性的前提下提高计算可扩展性。在合成数据和真实世界数据上的实验验证了我们的见解及其对设计预测架构的相关性。

Abstract

Modern deep learning models for forecasting groups of time series rely on increasingly longer observation windows. However, the benefit of increasing the window size is often simply attributed to capturing long-range dependencies, and broader discussion on how global forecasting models leverage input observations has been limited. In this paper, we show that forecasting groups of time series involves two objectives: (i) generative process identification (GPI), i.e., inferring the specific process generating the input sequence, and (ii) conditional forecasting (CF), i.e., predicting future values given input observations. From this perspective, optimal predictions can be interpreted as an average over plausible data-generating processes, weighted by their likelihood given the input window. This suggests another explanation for the benefits of long context windows: they reduce the uncertainty about which specific process is generating the input time series during operation. We prove that even for processes with memory length $P$, an input window size strictly larger than $P$ is necessary to achieve the minimum attainable error. Finally, we show how decoupling GPI and CF can improve computational scalability without compromising accuracy. Experiments on synthetic and real-world data validate our insights and their relevance for designing forecasting architectures.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 1.0/10 1.5
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 2.0/10 3.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 1.0/10 1.5

评分理由: 论文聚焦时间序列预测中上下文窗口与生成过程识别,与多模态、视觉编码器、Tokenizer、MLLM 及强化学习领域无直接关联。仅在生成过程建模上与 World Models 有概念交集,目标解耦与 Unify Models 有微弱关联,模型预测方法接近 model-based RL 但非强化学习,故整体相关性低。

关键词

Time Series Forecasting, Long Context Windows, Generative Process Identification, Conditional Forecasting, Deep Learning, Model-Based, Uncertainty Reduction

Score: 6.0 / 27.8
Authors: Jonathan Mayo, Moshe Unger, Konstantin Bauman
Published: 2026-06-01
TL;DR: 本文提出 SPHERE 系统,利用 LLM 生成的语义人物模型在用户和物品均不跨域的异构领域间实现知识转移,其性能优于传统推荐基线。
摘要翻译

数字平台日益演变为孤立的信息孤岛,限制了其跨域构建全面用户表征的能力。跨域推荐系统(Cross-domain recommender systems)旨在通过将知识从源域(source domain)迁移至目标域(target domain)来克服这一限制,然而大多数现有方法依赖于共享用户、共享物品或结构相似的交互图(interaction graphs)。这些假设在独立平台之间往往并不现实。我们提出 SPHERE(Semantic Personas for Heterogeneous cross-domain Recommendation),这是一种设计制品,能够在严格不相交的域之间实现推荐知识转移,且无需共享用户或物品。与通过身份或图结构对齐域不同,SPHERE 利用大语言模型(large language models)诱导共享行为词汇,为用户生成结构化语义人物(semantic personas),并检索行为相似的源域社区,从而形成社区源人物(Community Source Persona)。该语义信号通过双塔架构(dual-tower architecture)和动态融合门(dynamic fusion gate)与协同信号(collaborative signals)相整合,从而使 SPHERE 能够增强标准推荐骨干网络。在 Amazon Books、Goodreads 和 Steam 数据集上的实证评估表明,在全排名评估(full-ranking evaluation)下,SPHERE 相对于 NCF、SVD++ 和 LightGCN 基线模型表现出一致的提升。结果表明,跨域迁移的有效性并非仅由域间的语义邻近性(semantic proximity)决定;相反,它关键地取决于目标域的结构密度和原生预测能力。本研究通过将跨域个性化重新定义为基于行为的语义对齐(behavior-based semantic alignment),为信息系统研究(information systems research)做出了贡献,提供了一种克服信息孤岛的实际机制,同时保持了可解释性和模块化。

Abstract

Digital platforms increasingly operate as isolated information silos, limiting their ability to construct comprehensive user representations across domains. Cross-domain recommender systems seek to overcome this limitation by transferring knowledge from a source domain to a target domain, yet most existing approaches depend on shared users, shared items, or structurally similar interaction graphs. These assumptions are often unrealistic across independent platforms. We propose SPHERE (Semantic Personas for Heterogeneous cross-domain Recommendation), a design artifact that enables recommendation knowledge transfer across strictly disjoint domains with no shared users or items. Rather than aligning domains through identity or graph structure, SPHERE uses large language models to induce a shared behavioral vocabulary, generate structured semantic personas for users, and retrieve behaviorally similar source-domain communities that form a Community Source Persona. This semantic signal is integrated with collaborative signals through a dual-tower architecture and dynamic fusion gate, allowing SPHERE to augment standard recommender backbones. Empirical evaluation across Amazon Books, Goodreads, and Steam demonstrates consistent improvements over NCF, SVD++, and LightGCN baselines under full-ranking evaluation. The results show that cross-domain transfer effectiveness is not determined solely by semantic proximity between domains; rather, it depends critically on the structural density and native predictive strength of the target domain. The study contributes to information systems research by reframing cross-domain personalization as behavior-based semantic alignment, offering a practical mechanism for overcoming information silos while preserving interpretability and modularity.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 1.0/10 1.5
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 1.0/10 1.5
MultiModal 1.5 1.0/10 1.5
model-based RL 1.5 0.0/10 0.0

评分理由: 该论文专注于跨域推荐系统,利用语义人物模型和 LLM 解决信息孤岛问题,与关键词集中的多模态大模型、世界模型及强化学习领域关联度极低。文中未涉及视觉编码器、世界模型构建或强化学习算法。虽然使用了 LLM,对 Tokenizer 和 MLLM 有微弱关联,但核心贡献在于推荐系统的语义对齐,而非模型架构的统一或多模态表征学习。

关键词

Cross-domain Recommendation, Semantic Personas, Large Language Models, Information Silos, Dual-tower Architecture, Knowledge Transfer, Heterogeneous Domains

Score: 6.0 / 27.8
Authors: Thi-Nhung Nguyen, Linhao Luo, Rollin Omari, Junae Kim, Thuy-Trang Vu, Dinh Phung
Published: 2026-06-01
TL;DR: TriAlign proposes a multi-agent reinforcement learning framework to ensure universal truth consistency across social groups in personalized LLMs while preserving personalization and objective task performance.
摘要翻译

个性化大语言模型(Personalized LLMs)根据用户的偏好和社会属性调整响应,但可能在不同的社会群体之间引入显著的普遍真理不一致性,其中某些群体在客观任务上系统性地接收到较不准确的响应。现有的对齐方法要么忽略个性化,要么主要关注主观偏好对齐,在很大程度上忽视了普遍真理中的公平性和一致性。为了解决这一差距,我们研究了真理不变对齐(Truth-Invariant Alignment, TIA),这是一个针对个性化大语言模型的对齐问题,旨在确保普遍真理在社会群体之间保持一致,同时保留个性化。我们提出了 TriAlign,这是首个用于 TIA 的离线多智能体强化学习(MARL)框架,其中每个社会群体被建模为一个相互作用的智能体。TriAlign 通过一个公平感知目标和显式的不一致性惩罚,联合优化普遍真理准确性、跨组真理一致性和个性化。在多个基准测试上的实验表明,TriAlign 在这三个目标之间实现了比强基线更强的平衡,减少了社会群体之间的普遍真理差异,同时提高了客观任务表现和个性化质量。

Abstract

Personalized large language models adapt responses to users' preferences and social attributes, but can introduce substantial universal truth inconsistencies across social groups, where some groups systematically receive less accurate responses on objective tasks. Existing alignment methods either ignore personalization or mainly focus on subjective preference alignment, largely overlooking fairness and consistency in universal truths. To address this gap, we study Truth-Invariant Alignment (TIA), an alignment problem for personalized LLMs that aims to ensure universal truths remain consistent across social groups while preserving personalization. We propose TriAlign, the first offline multi-agent reinforcement learning (MARL) framework for TIA, where each social group is modeled as an agent interacting. TriAlign jointly optimizes universal truth accuracy, cross-group truth consistency, and personalization through a fairness-aware objective and an explicit inconsistency penalty. Experiments across diverse benchmarks demonstrate that TriAlign achieves a stronger balance among these three objectives than strong baselines, reducing universal truth disparities across social groups while improving both objective task performance and personalization quality.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 2.0/10 3.0

评分理由: 论文聚焦于个性化 LLM 的真理不变对齐(TIA),使用多智能体强化学习(MARL)。关键词如 Tokenizer、Visual Encoder、MultiModal、World Models、MLLM 与文本对齐任务无关,评分为 0。Unify Models 仅涉及真理统一而非模型架构统一,model-based RL 虽涉及强化学习但未明确指定基于模型,故给予较低分(2 分)。未检测到目标专家,加权总分 6.0,低于动态及格分 27.8。

关键词

Personalized LLMs, Truth-Invariant Alignment, Multi-Agent Reinforcement Learning, Universal Truth Consistency, Fairness-aware Objective, Offline Alignment, Cross-group Truth Consistency

Score: 6.0 / 27.8
Authors: Pingping Liu, Aohua Li, Yubing Lu, Jin Kuang, Tongshun Zhang, Qiuzhan Zhou
Published: 2026-06-01
TL;DR: RPCASSM proposes a robust PCA-based state space model with specialized background and target modules to improve infrared small target detection accuracy and edge modeling.
摘要翻译

红外小目标的检测与分割在监视与安全、海上救援等领域具有重要的应用价值。由于这些目标在远距离成像中占据率低,主流视觉状态空间模型效率较低,且难以准确建模目标边缘。现有的红外状态空间模型未基于红外小目标的结构特性,偏离主流视觉状态空间的结构框架。为了解决这一问题,本文基于鲁棒主成分分析(RPCA)模型范式提出了 RPCASSM 网络,旨在根据空间域内红外小目标的特性,设计背景状态空间模块(BSSM)和目标状态空间模块(TSSM)。BSSM 旨在利用空间异质信号的显著性,设计空间探针扫描机制(SPCM)以建模背景信息。TSSM 利用目标的稀疏性和局部高亮特性,设计可变形提示扫描机制(DPCM),聚焦于目标的可变形空间进行状态空间建模。基于上述设计,本文有效解决了现有主流视觉状态空间模型难以准确建模红外小目标边缘结构的问题。在现有基准数据集上的实验结果证明了 RPCASSM 设计的有效性。代码将在 https://github.com/PepperCS/RPCASSM 处公开。

Abstract

The detection and segmentation of infrared small targets have important application significance in the fields of surveillance and security, maritime rescue and so on. Due to the low occupancy of these targets in long-distance imaging, the mainstream visual state space model is inefficient and difficult to accurately model the target edge. The existing infrared state space models do not deviate from the mainstream visual state space structure framework from the structural properties of infrared small targets. In order to solve this problem, this paper proposes the RPCASSM network based on the model paradigm of robust principal component analysis(RPCA), which aims to design the background state space module(BSSM) and the target state space module(TSSM) by the nature of the infrared small target in the spatial domain. The BSSM aims to use the saliency of spatial heterogeneous signals to design a spatial probe scanning mechanism(SPCM) to model background information. The TSSM designs a deformable prompt scanning mechanism(DPCM) by using the sparsity and local highlight of the target to focus on the deformable space of the target for state space modeling. According to the above design, we effectively solve the problem that the existing mainstream vision state space model is difficult to accurately model the edge structure of infrared small target. Experimental results on the existing benchmark data sets prove the effectiveness of the RPCASSM design. Our code will be made public at \href{https://github.com/PepperCS/RPCASSM}{RPCASSM}.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 1.0/10 1.5
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 2.0/10 3.0
World Models 1.5 1.0/10 1.5
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on infrared small target detection using a Robust PCA State Space Model (RPCASSM). It does not involve multi-modal fusion, language models, tokenization, or reinforcement learning, resulting in low scores for MLLM, MultiModal, Tokenizer, and model-based RL. While it processes visual data (Visual Encoder) and uses State Space Models (World Models), it is a detection task rather than a generative world model or unified modeling framework, hence low scores here. No expert authors from the specified list are present in the author list.

关键词

Infrared Small Target Detection, Robust PCA, State Space Model, Background State Space Module, Target State Space Module, Spatial Probe Scanning, Deformable Prompt Scanning

Score: 6.0 / 27.8
Authors: Zefeng Li, Evan Shelhamer
Published: 2026-06-01
TL;DR: This paper benchmarks open-set test-time adaptation methods on corrupted image datasets, revealing their struggle to balance in-distribution recognition accuracy and out-of-distribution detection.
摘要翻译

开放集测试时适应(TTA)在存在输入偏移和未知输出类别的情况下,基于新数据更新模型。尽管近期方法在提高已知类别的分布内(InD)准确率方面取得了进展,但它们准确检测分布外(OOD)未知类别的能力仍未被充分探索。我们在小规模 CIFAR-10-C 和大规模 ImageNet-C 的标准污损基准上,对鲁棒及开放集 TTA 方法(SAR、OSTTA、UniEnt 和 SoTTA)进行了基准评估。对于 CIFAR-10-C,我们使用来自 SVHN 和 CIFAR-100 的分布外(OOD)数据,分别采用其污损形式 SVHN-C 和 CIFAR-100-C。对于 ImageNet-C,我们使用来自 ImageNet-O 和 Textures 的分布外(OOD)数据,分别采用其污损形式 ImageNet-O-C 和 Textures-C。ImageNet-O 更接近 ImageNet,因其包含未知但相关的对象类别(例如食物中的“蒜蓉面包”与“热狗”,或基础设施中的“高速公路”与“水坝”);而 Textures 离 ImageNet 更远,因其包含非对象模式(例如“开裂”的泥土、“多孔”的海绵、“有脉络”的叶子)。我们在 CIFAR-10-C 和 ImageNet-C 上评估 TTA 方法在分布内(InD)与分布外(OOD)识别上的准确率和置信度。我们在 CIFAR-10-C 上验证了各方法自身 OOD 检测技术的准确性。此外,我们在 ImageNet-C 上也进行了评估,并报告了准确率和标准的 OOD 检测指标。我们进一步考察了更现实的设置,其中分布外(OOD)数据的比例和比率可变。为了探索分布内(InD)识别与分布外(OOD)拒绝之间的权衡,我们提出了一种新的基线,该基线将 softmax/多类输出替换为 sigmoid/多标签输出。我们的分析首次表明,当前的开放集 TTA 方法难以平衡分布内(InD)与分布外(OOD)准确率,且它们仅能不完备地过滤分布外(OOD)数据以用于自身的适应更新。

Abstract

Open-set test-time adaptation (TTA) updates models on new data in the presence of input shifts and unknown output classes. While recent methods have made progress on improving in-distribution (InD) accuracy for known classes, their ability to accurately detect out-of-distribution (OOD) unknown classes remains underexplored. We benchmark robust and open-set TTA methods (SAR, OSTTA, UniEnt, and SoTTA) on the standard corruption benchmarks of CIFAR-10-C at the small scale and ImageNet-C at the large scale. For CIFAR-10-C, we use OOD data from SVHN and CIFAR-100 in their respective corrupted forms of SVHN-C and CIFAR-100-C. For ImageNet-C, we use OOD data from ImageNet-O and Textures in their respective corrupted forms of ImageNet-O-C and Textures-C. ImageNet-O is nearer to ImageNet, as unknown but related object classes (like ''garlic bread'' vs. ''hot dog'' for food, or ''highway'' vs. ''dam'' for infrastructure), while Textures is farther from ImageNet, as non-object patterns (like ''cracked'' mud, ''porous'' sponge, ''veined'' leaves). We evaluate the accuracy and confidence of TTA methods for InD vs. OOD recognition on CIFAR-10-C and ImageNet-C. We verify the accuracy of each method's own OOD detection technique on CIFAR-10-C. We also evaluate on ImageNet-C and report both accuracy and standard OOD detection metrics. We further examine more realistic settings, in which the proportions and rates of OOD data can vary. To explore the trade-off between InD recognition and OOD rejection, we propose a new baseline that replaces softmax/multi-class output with sigmoid/multi-label output. Our analysis shows for the first time that current open-set TTA methods struggle to balance InD and OOD accuracy and that they only imperfectly filter OOD data for their own adaptation updates.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 2.0/10 3.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on Open-set Test-Time Adaptation in computer vision (InD vs OOD accuracy), while keywords relate to Multimodal LLMs, World Models, and RL. Most keywords are irrelevant (0). 'Visual Encoder' and 'Unify Models' score low (2.0) due to vision context and method comparison, but lack specific alignment with keyword definitions.

关键词

Open-set Test-time Adaptation, In-Distribution Accuracy, Out-of-Distribution Detection, Corrupted Image Benchmarks, Model Adaptation, Sigmoid Output Baseline, Robustness Evaluation

Score: 6.0 / 27.8
Authors: Shamira Venturini, Oliver Hennhöfer, Steffen Kinkel, Jannik Strötgen
Published: 2026-06-01
TL;DR: TalkTag 提出了一种基于 LLM 的工具,用于自动化语音转录文本的精细形态句法错误标注,为低资源语言研究提供了可扩展的解决方案。
摘要翻译

细粒度形态句法错误标注在临床及发展性语言研究中至关重要,然而其过程耗时、依赖专家且难以规模化。我们提出 TalkTag,这是一种基于大语言模型(LLM)的微调轻量化工具,旨在自动化口语转录本中的 CHAT 格式错误标注。该系统在极端数据稀缺条件下利用儿童叙事数据开发而成,展示了在低资源环境中进行语言分析的可行性。我们的评估表明,TalkTag 产生了令人鼓舞的精确标注,同时能有效识别出因语言歧义而导致自动标注真正复杂的实例。总之,借助 TalkTag,我们提供了一种可扩展的手动错误标注替代方案,并为形态句法错误标注提供了切实可行的支持。

Abstract

Fine-grained morphosyntactic error annotation is important in clinical and developmental language research, yet it is labour-intensive, expert-dependent, and difficult to scale. We present TalkTag, an LLM-based lightweight tool fine-tuned to automate CHAT-style error annotation in spoken-language transcripts. Developed under conditions of extreme data scarcity using children's narrative data, the system shows the feasibility of linguistic analysis in low-resource settings. Our evaluation demonstrates that TalkTag produces encouragingly precise annotation while effectively identifying instances where linguistic ambiguity makes automated tagging genuinely complex. In summary, with TalkTag, we provide a scalable alternative to manual error annotation and practically viable support for morphosyntactic error annotation.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 1.0/10 1.5
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 1.0/10 1.5
MultiModal 1.5 1.0/10 1.5
model-based RL 1.5 0.0/10 0.0

评分理由: 论文核心在于利用 LLM 自动化语音转录文本的形态句法错误标注,属于 NLP 与语言学交叉领域。提供的关键词主要涉及统一模型、视觉编码器、世界模型及强化学习,与论文内容高度不匹配。论文未涉及视觉模态、强化学习或世界模型构建,仅间接使用了 LLM 技术,因此相关性评分极低。

关键词

TalkTag, Morphosyntactic Error Annotation, Transcribed Speech, LLM-based Tool, Low-resource Settings, Automated Annotation, Clinical Language Research

Score: 6.0 / 27.8
Authors: Weibai Fang, Haijun Che, Feiyang Ren, Qiancheng Lao
Published: 2026-06-01
TL;DR: This paper proposes a normality-preserving continual anomaly detection framework using orthogonal LoRA banks to prevent catastrophic forgetting, achieving state-of-the-art results on industrial datasets without involving multimodal or reinforcement learning techniques.
摘要翻译

基于扩散模型的持续工业异常检测面临历史正常性先验漂移和灾难性遗忘的挑战。现有的持续扩散方法通过回放或约束优化来保留先前知识,但在顺序适应过程中缺乏显式机制来隔离和保护类别特定的正常性先验。尽管低秩适应提供了模块化残差更新,但标准 LoRA 既不冻结历史正常性子空间,也不防止新适配器干扰之前的适配器。为了解决这一问题,我们提出了一种基于两个模块的保持正常性的持续异常检测框架:历史冻结正交 LoRA 库 (HF-OLB) 和层次新颖性自适应库增长模块 (HNABG)。HF-OLB 冻结了预训练的 U-Net 骨干和已学习的 LoRA 库,并将新任务特定的正常性残差约束在历史 LoRA 子空间的正交补空间。HNABG 进一步分配层依赖的残差容量,并且仅当残差正常性新颖性超过现有库的表达能力时才扩展库。在 MVTec 和 VisA 上的广泛实验证明了所提方法的有效性。在具有挑战性的 VisA 2x6 设置下,我们的方法在图像级和像素级 A-AUROC 上分别达到 83.6 和 91.8,FM 分别为 3.8 和 3.9,相比最先进方法,像素级 A-AUROC 提高了 3.2 点,同时像素级 FM 降低了 1.3。这些结果表明,我们的方法在长时序持续类别序列中有效保持了历史正常性先验。

Abstract

Continual industrial anomaly detection with diffusion models suffers from historical normality prior drift and catastrophic forgetting. Existing continual diffusion methods preserve previous knowledge through replay or constrained optimization, but they lack an explicit mechanism for isolating and protecting category-specific normality priors during sequential adaptation. Although low-rank adaptation provides modular residual updates, standard LoRA neither freezes historical normality subspaces nor prevents new adapters from interfering with previous ones. To address this issue, we propose a normality-preserving continual anomaly detection framework based on two modules: History Frozen Orthogonal LoRA Bank (HF-OLB) and Hierarchical Novelty Adaptive Bank Growth module (HNABG). HF-OLB freezes both the pre-trained U-Net backbone and the learned LoRA banks, and constrains new task-specific normality residuals to the orthogonal complement of historical LoRA subspaces. HNABG further allocates layer-dependent residual capacity and expands the bank only when the residual normality novelty exceeds the expressive capacity of existing banks. Extensive experiments on MVTec and VisA demonstrate the effectiveness of the proposed method. On the challenging VisA 2x6 setting, our method achieves 83.6/91.8 image and pixel level A-AUROC with 3.8/3.9 FM, improving pixel level A-AUROC over the state of the art by 3.2 points while reducing pixel level FM by 1.3. These results show that our method effectively preserves historical normality priors in long horizon continual category sequences.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 2.0/10 3.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on continual industrial anomaly detection using diffusion models and orthogonal LoRA adaptation. It has low relevance to the provided keywords as it addresses single-modal visual tasks rather than multimodal large language models (MLLM), world models, or reinforcement learning. While it utilizes a U-Net backbone for visual processing, it does not align with the specific 'Visual Encoder' context of MLLMs. No tokenization is involved. The calculated weighted score is 6.0, which is below the dynamic passing score of 27.8. None of the listed expert authors are present in the author list.

关键词

Continual Industrial Anomaly Detection, Diffusion Models, Orthogonal LoRA Banks, Catastrophic Forgetting, Normality Prior, U-Net Backbone, MVTec Dataset

Score: 6.0 / 27.8
Authors: Vietbao Tran, Pratik Shah
Published: 2026-06-01
TL;DR: This study develops a conditional GAN to synthesize PIN-4 immunohistochemistry staining from H&E images, enabling direct spatial comparison for prostate cancer diagnosis with acceptable quantitative and qualitative performance.
摘要翻译

免疫组织化学 (IHC) 常用于解决苏木精 - 伊红 (H&E) 染色组织切片中诊断意义不明的前列腺癌活检发现。然而,PIN-4 免疫组织化学染色通常在相邻组织切片上进行,这限制了 H&E 形态学与相应免疫表型信号之间的直接空间对比。本研究从常规临床前列腺活检全切片图像 (WSIs) 构建了一对配准的 H&E/PIN-4 数据集,并训练了一个条件生成对抗网络 (cGAN),旨在直接从原始 H&E 图像块合成 PIN-4 染色模式。最终数据集包含来自 93 名患者的 172 对配准全切片图像 (WSIs) 和 27,298 对配准的 1024x1024 图像块对,涵盖腺癌阳性和良性病例,且在年龄、种族和民族群体中均有代表。该模型在来自 17 张全切片图像 (WSIs) 的 1,814 对图像块的保留测试集上进行了评估,平均峰值信噪比 (PSNR) 达到 21.88 dB,结构相似性指数度量 (SSIM) 为 0.667,皮尔逊相关系数 (PCC) 为 0.684,学习感知图像块相似性 (LPIPS) 为 0.417。经认证病理学家的定性审查显示,生成的图像捕捉到了具有诊断相关性的 PIN-4 染色模式,包括 α-甲基酰基辅酶 A 消旋酶 (AMACR) 表达和基底细胞相关染色,同时保留了与源 H&E 形态的空间对应关系。合成的准确性在不同形态学复杂区域存在差异,包括高级别癌和导管内癌。这些结果支持了从常规获取的明场 H&E 前列腺活检图像中进行监督式 PIN-4 合成的可行性。该方法允许在源前列腺 H&E 组织架构背景下直接解释预测的 PIN-4 标记模式,解决了常规相邻切片免疫组织化学当前的空间局限性。

Abstract

Immunohistochemistry (IHC)is frequently used to resolve diagnostically ambiguous prostate cancer biopsy findings on hematoxylin and eosin (H&E)-stained tissue. However, PIN-4 IHC staining is typically performed on adjacent tissue sections, limiting direct spatial comparison between the H&E morphology and the corresponding immunophenotypic signal. A paired, registered H&E/PIN-4 dataset was constructed from routine clinical prostate biopsy whole-slide images (WSIs), and a conditional generative adversarial network (cGAN) was trained to synthesize PIN-4 staining patterns directly from native H&E image patches. The final dataset comprised 172 paired WSIs from 93 patients and 27,298 registered 1024x1024 patch pairs, spanning adenocarcinoma-positive and benign cases with representation across age, race, and ethnicity groups. The model was evaluated on a held-out test set of 1,814 patch pairs from 17 WSIs, achieving a mean peak signal-to-noise ratio (PSNR) of 21.88 dB, structural similarity index measure (SSIM) of 0.667, Pearson correlation coefficient (PCC) of 0.684, and learned perceptual image patch similarity (LPIPS) of 0.417. Qualitative review by a board-certified pathologist showed that generated images captured diagnostically relevant PIN-4 staining patterns, including AMACR/racemase expression and basal-cell-associated staining, while preserving spatial correspondence with the source H&E morphology. Accuracy of synthesis varied across morphologically complex regions, including high-grade carcinoma and intraductal carcinoma. These results support the feasibility of supervised PIN-4 synthesis from routinely acquired brightfield H&E prostate biopsy images. The approach enables direct interpretation of predicted PIN-4 marker patterns in the context of the source prostate H&E architecture, addressing a current spatial limitation of conventional adjacent-section IHC.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 2.0/10 3.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 2.0/10 3.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on medical image translation using conditional GANs for prostate pathology, which has minimal overlap with the MLLM, World Model, and RL themes of the keywords. Only Visual Encoder (CNN components) and MultiModal (image-to-image translation) show slight technical relevance. Core concepts like Tokenizer, Unify Models, and RL are absent. The weighted score (6.0) is below the dynamic passing score (27.8).

关键词

Deep Learning, Immunohistochemistry, H&E Images, Prostate Biopsy, Conditional GAN, Image Synthesis, Pathology, Spatial Correspondence

Score: 6.0 / 27.8
Authors: Smit Kadvani, Shriya Gumber, Kriti Faujdar, Harsh Dave
Published: 2026-06-01
TL;DR: PillarDETR addresses real-time 3D object detection for autonomous driving by integrating YOLOv8 and RT-DETR to achieve an efficient trade-off between accuracy and inference latency without non-maximum suppression.
摘要翻译

实时三维目标检测是自动驾驶系统及机器人安全运行的关键组成部分。尽管激光雷达点云能提供准确的空间信息,但如何高效处理它们仍是一个重大挑战。传统方法依赖于复杂的三维卷积或基于锚点的范式,难以在检测精度与推理速度之间取得平衡。本文提出 PillarDETR,一种新颖的端到端三维目标检测架构,它结合了基于支柱的激光雷达编码效率与现代 2D 视觉模型的表征能力。具体而言,PillarDETR 使用源自 YOLOv8 的 Cross Stage Partial (CSP) 网络替换标准卷积骨干,从而能够从伪图像中提取更丰富的特征。此外,我们摒弃了传统的基于锚点或基于中心的检测头,转而采用实时检测变换器(RT-DETR)解码器。这种混合设计使网络能够捕获全局上下文,并直接预测三维边界框,而无需依赖非极大值抑制(NMS)。在 KITTI 和 nuScenes 基准上的广泛实验表明,PillarDETR 在平均精度均值(mAP)与推理延迟之间取得了极具竞争力的权衡。我们的消融研究证实,整合 YOLOv8 骨干与 RT-DETR 检测头相较于 PointPillars 基线带来了显著提升,确立了 PillarDETR 作为一种高效的实时三维感知解决方案。

Abstract

Real-time 3D object detection is a critical component for the safe operation of autonomous driving systems and robotics. While LiDAR point clouds provide accurate spatial information, processing them efficiently remains a significant challenge. Traditional methods rely on complex 3D convolutions or anchor-based paradigms that struggle to balance detection accuracy with inference speed. In this paper, we propose PillarDETR, a novel end-to-end 3D object detection architecture that combines the efficiency of pillar-based LiDAR encoding with the representational power of modern 2D vision models. Specifically, PillarDETR replaces standard convolutional backbones with a Cross Stage Partial (CSP) network derived from YOLOv8, enabling richer feature extraction from pseudoimages. Furthermore, we discard conventional anchor-based or center-based detection heads in favor of a Real-Time Detection Transformer (RT-DETR) decoder. This hybrid design allows the network to capture global context and directly predict 3D bounding boxes without relying on non-maximum suppression (NMS). Extensive experiments on the KITTI and nuScenes benchmarks demonstrate that PillarDETR achieves a compelling trade-off between mean Average Precision (mAP) and inference latency. Our ablation studies confirm that integrating the YOLOv8 backbone and RT-DETR head yields substantial improvements over the PointPillars baseline, establishing PillarDETR as a highly effective solution for real-time 3D perception.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 2.0/10 3.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 2.0/10 3.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on 3D object detection using YOLO and DETR architectures for LiDAR data, which is unrelated to Unify Models, Tokenizers, World Models, MLLM, or Model-Based RL (scores 0). There is a minor connection to Visual Encoder (YOLO backbone usage) and MultiModal (LiDAR-to-image projection), but these do not align with the multimodal/RL context of the keywords (scores 2). Total weighted score: 6.0, below the dynamic pass score of 27.8. No expert authors from the specified list are present.

关键词

3D Object Detection, Real-time Inference, YOLOv8 Backbone, RT-DETR Head, LiDAR Point Clouds, Pillar-based Encoding, Autonomous Driving

Score: 6.0 / 27.8
Authors: Yijun Yang, Ruiqiang Xiao, Lijie Hu, Angelica I Aviles-Rivero, Yunzhu Wu, Jing Qin, Lei Zhu
Published: 2026-06-01
TL;DR: This paper proposes a semi-supervised hypergraph concept bottleneck model to enhance interpretability and label efficiency in medical image diagnosis by modeling high-order concept dependencies.
摘要翻译

深度学习已彻底革新了医学图像分析,在各类应用中实现了卓越的诊断准确性。然而,其决策过程缺乏可解释性阻碍了临床应用,尤其在高风险医疗场景中,透明度对于建立信任至关重要。例如,在胎盘植入谱系 (Placenta Accreta Spectrum, PAS) 中,超声成像中的细微线索给可靠诊断带来了挑战,使得黑盒模型难以可靠地进行准确评分。为了解决这一问题,概念瓶颈模型 (Concept Bottleneck Models, CBMs) 提供了一条有前景的途径,通过将具有临床意义的中间概念嵌入诊断流程,使临床医生能够审查并优化模型输出。然而,传统的 CBMs 在捕捉复杂的概念间依赖关系方面表现不佳,且需要昂贵且依赖专家的概念标注,限制了其可扩展性。本研究提出了一种专为医学图像设计的新颖半监督 CBM 框架,该框架利用双层级超图学习来建模高阶概念依赖关系,并生成域自适应伪标签。该方法通过整合概念级超图以增强推理能力,以及图像级超图以稳健生成伪标签,实现了卓越的可解释性和性能。在新标注的 PAS 超声数据集和乳腺超声公共数据集上的实验证明了所提出的概念标签高效可解释框架的有效性。其在皮肤镜图像数据集 SkinCon 上的通用性也进一步得到了验证。代码可在 https://github.com/scott-yjyang/HyperCBM 获取。

Abstract

Deep learning has revolutionized medical image analysis, delivering exceptional diagnostic accuracy across diverse applications. Yet, the lack of interpretability in its decision-making hinders clinical adoption, particularly in high-stakes medical contexts where transparency is paramount for trustworthiness. For example, in Placenta Accreta Spectrum (PAS), subtle cues in ultrasound imaging challenge reliable diagnosis, rendering black-box models untrustworthy for accurate scoring. To address this, Concept Bottleneck Models (CBMs) offer a promising avenue by embedding clinically meaningful intermediate concepts into the diagnosis pipeline, enabling clinicians to scrutinize and refine model outputs. However, conventional CBMs falter in capturing complex inter-concept dependencies and demand costly, expert-driven concept annotations, limiting their scalability. This study introduces a novel semi-supervised CBM framework designed for medical imaging, which leverages dual-level hypergraph learning to model high-order concept dependencies and generate domain-adaptive pseudo-labels. Our approach achieves superior interpretability and performance by integrating a concept-level hypergraph for enhanced reasoning and an image-level hypergraph for robust pseudo-label generation. Experiments on a newly annotated PAS ultrasound dataset and a breast ultrasound public dataset demonstrate the effectiveness of the proposed concept label-efficient interpretable framework. Its universality is further validated on the dermoscopic image dataset SkinCon. The code is available at https://github.com/scott-yjyang/HyperCBM.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 1.0/10 1.5
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 2.0/10 3.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 1.0/10 1.5
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on medical image diagnosis using Concept Bottleneck Models and hypergraph learning, which is unrelated to Tokenizers, World Models, MLLM, or Model-Based RL. While it utilizes image encoders and concept labels (suggesting mild multimodality), these are not the core focus relative to the provided keyword set's emphasis on foundation models and reinforcement learning.

关键词

Medical Image Diagnosis, Concept Bottleneck Models, Hypergraph Learning, Semi-supervised Learning, Interpretability, Label-Efficient, Pseudo-label Generation

Score: 6.0 / 27.8
Authors: Jinwon Ko, Keunsoo Ko, Chang-Su Kim
Published: 2026-06-01
TL;DR: CanonCGT introduces a two-stage color grading framework utilizing a canonical pivot representation to achieve stable and photorealistic tone mapping that matches reference styles while preserving scene structure.
摘要翻译

基于参考的色彩分级(Reference-based color grading)旨在重现参考图像的色调氛围和光照,同时保持色彩和谐与场景结构。现有的照片级真实感(photorealistic)及基于滤波(filter-based)的方法往往产生不稳定的色调映射(tone mappings)——过度偏移或颜色保留不一致——导致结果不自然。我们提出了 CanonCGT,这是一种基于规范枢轴(canonical pivot)的两阶段框架,该枢轴是一种风格中性的中间表示,用于实现稳定的色彩映射。第一阶段通过去除内在色调偏差来规范化输入,第二阶段对其进行色彩分级以匹配参考风格。双阶段训练方案 DP-CGT 结合了监督式预设学习与在未配对照片上的自监督精修。CanonCGT 在多样数据集上提供了照片级真实感且色调一致的结果,在稳定性和视觉保真度方面超越了最先进的方法。我们的代码可在 https://github.com/Jinwon-Ko/CanonCGT 获取。

Abstract

Reference-based color grading aims to reproduce the tonal mood and lighting of a reference while preserving color harmony and scene structure. Existing photorealistic and filter-based methods often produce unstable tone mappings -- over-shifting or inconsistently retaining colors -- leading to unnatural results. We propose CanonCGT, a two-stage framework built on a canonical pivot -- a style-neutral intermediate representation for stable color mapping. The first stage canonicalizes the input by removing intrinsic tonal bias, and the second color-grades it to match the reference style. A dual-phase training scheme, DP-CGT, combines supervised preset learning with self-supervised refinement on unpaired photographs. CanonCGT delivers photorealistic and tonally consistent results across diverse datasets, surpassing state-of-the-art methods in stability and visual fidelity. Our codes are available at \href{https://github.com/Jinwon-Ko/CanonCGT}{https://github.com/Jinwon-Ko/CanonCGT}

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 1.0/10 1.5
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 2.0/10 3.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 1.0/10 1.5
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on reference-based color grading using a canonical pivot representation, which is a computer vision task unrelated to MLLM, Reinforcement Learning, World Models, or Tokenizers. While it processes images (implicit Visual Encoder) and uses multiple inputs (MultiModal), it lacks alignment with the provided keyword themes regarding large model unification or learning. No listed expert authors are present in the author list.

关键词

Reference-Based Color Grading, Canonical Pivot Representation, Two-Stage Framework, Photorealistic Results, Self-Supervised Refinement, Style-Neutral Representation, Image Color Correction

Score: 4.5 / 27.8
Authors: Ieva Raminta Staliūnaitė, James Bishop, Andreas Vlachos
Published: 2026-06-01
TL;DR: 该论文提出了一种通过解耦输入歧义与不确定性量化信号来改进大语言模型误差预测的方法,在多个数据集上实现了超过 10 点的 PRR 提升。
摘要翻译

错误预测任务,即预测模型输出是否正确,通常通过不确定性量化(UQ)来解决。然而,虽然不确定性度量能够捕捉模型缺乏知识或预测能力的情况,但它们同时也反映了存在于模型输入和语境中的偶然不确定性(aleatoric uncertainty)。本文提出了一种方法,通过解耦输入歧义与不确定性量化(UQ)信号,来改进大语言模型(LLMs)的错误预测。我们在问答(QA)任务上使用了六种不确定性度量(UQ metrics)进行实验,结果表明,UQ 度量在无歧义实例上的错误预测能力优于具有多个可能答案的问题。我们利用门控专家(Gated Experts)和选择性预测(Selective Prediction),将金标准标签和预测的歧义标签纳入错误预测流程。我们发现,歧义信息在模型家族、训练与评估范式、数据集(包括声称无歧义的那些)以及偶然不确定性来源上均提升了错误预测分数,在标准数据集上,单个 UQ 度量的 PRR 提升超过 10 点。

Abstract

The task of Error Prediction, namely predicting whether a model output is correct, is commonly tackled with Uncertainty Quantification (UQ). However, while uncertainty metrics capture when models lack knowledge or capacity to make a prediction, they also reflect aleatoric uncertainty, which is inherent in the model input and context. This paper presents a method for improving error prediction for Large Language Models (LLMs), by disentangling input ambiguity from UQ signal. We conduct experiments on the task of Question Answering (QA) with six UQ metrics and show that UQ metrics are more predictive of errors on unambiguous instances than on questions with multiple plausible answers. We use Gated Experts and Selective Prediction to incorporate gold and predicted ambiguity labels into the error prediction pipeline. We find that ambiguity information improves error prediction scores across model families, training and evaluation paradigms, datasets (including allegedly unambiguous ones), and sources of aleatoric uncertainty, yielding improvements of over 10 points of PRR for individual UQ metrics on standard datasets.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 1.0/10 1.5
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 2.0/10 3.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: 论文聚焦于大语言模型(LLM)的误差预测与不确定性量化,未涉及多模态架构、视觉编码器、世界模型或强化学习。仅因提及 LLM 与 MLLM 有微弱关联,其余关键词完全不相关,Unify Models 仅涉及模型评估而非架构统一。加权总分约为 4.5 分,远低于动态及格分 27.8 分,表明论文主题与给定关键词高度不匹配。

关键词

Error Prediction, Uncertainty Quantification, Large Language Models, Input Ambiguity, Question Answering, Aleatoric Uncertainty, Selective Prediction, Gated Experts

Score: 4.5 / 27.8
Authors: Adrián Cánovas-Rodriguez, Miguel A. González-Illán, Maria Fernanda García-Cruz, Pedro Nortes Tortosa, José Salvador Rubio-Asensio, Miguel A. Zamora Izquierdo, Juan Antonio Martínez Navarro, Antonio F. Skarmeta
Published: 2026-06-01
TL;DR: This study develops a robust peach leaf damage classification system using EfficientNet and CBAM attention modules, achieving high accuracy through transfer learning to handle domain shifts in agricultural imagery.
摘要翻译

人工智能为基于影像数据的作物损害评估提供了一个实用框架,支持农业管理中的早期决策。在桃树果园中,气候变化加剧了非生物胁迫和生物胁迫(包括病虫害),这些往往会产生视觉上相似的叶片症状。这种重叠使得人工诊断变得困难,尤其是在环境条件多变的多个田地中,突显了对具有强大泛化能力的自动化模型的需求。我们提出了一种基于图像的桃叶损伤检测方法。通过手动标注公开可用的图像创建了一个基准数据集,包含 1366 个桃叶样本,涵盖六种损伤类别。评估了多种深度学习架构。EfficientNet 模型取得了最佳结果,其中 EfficientNetB0 准确率达到 92.9%,EfficientNetB3 达到 91.5%,而 EfficientNetB5 在少数类上表现最强。DenseNet121 准确率达到 92.6%。卷积块注意力模块(CBAM)的集成在多个骨干网络中提升了性能,特别是在 EfficientNetB5 和 InceptionV3 中,而在其他网络中则表现出有限或负面影响。增强 CBAM 的 EfficientNetB5 实现了最佳整体准确率 93.3%。为了评估在真实条件下的鲁棒性,收集了一个包含 180 张图像、涵盖四个类别的本地数据集,并应用了迁移学习策略以应对领域偏移。测试了三种微调策略。结合 CBAM 的 EfficientNetB3 在本地领域表现最佳,迁移后达到 93% 的宏平均 F1 分数。总体而言,基于注意力的模型在少数类上表现出更好的鲁棒性,并在不同田地条件下具有更好的泛化能力。

Abstract

Artificial intelligence provides a practical framework for crop damage assessment from imagery data, supporting early decision-making in agricultural management. In peach orchards, climate change increases abiotic stress and biotic pressures, including pests and diseases, which often produce visually similar foliar symptoms. This overlap makes manual diagnosis difficult, especially across multiple fields with varying environmental conditions, highlighting the need for automated models with strong generalization ability. We propose an image-based classification approach for peach leaf damage detection. A benchmark dataset was created through manual annotation of publicly available images, consisting of 1,366 peach leaves across six damage categories. Several deep learning architectures were evaluated. EfficientNet models achieved the best results, with EfficientNetB0 reaching 92.9 percent accuracy, EfficientNetB3 achieving 91.5 percent, and EfficientNetB5 showing the strongest performance on minority classes. DenseNet121 reached 92.6 percent accuracy. The integration of the Convolutional Block Attention Module (CBAM) improved performance in several backbones, particularly EfficientNetB5 and InceptionV3, while showing limited or negative impact in others. The CBAM-enhanced EfficientNetB5 achieved the best overall accuracy of 93.3 percent. To evaluate robustness under realistic conditions, a local dataset of 180 images across four classes was collected, and transfer learning strategies were applied to address domain shift. Three fine-tuning strategies were tested. EfficientNetB3 combined with CBAM achieved the best performance in the local domain, reaching a 93 percent macro F1-score after transfer. Overall, attention-based models showed improved robustness for minority classes and better generalization across different field conditions.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 1.0/10 1.5
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 2.0/10 3.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on agricultural image classification using CNNs (EfficientNet, DenseNet) and attention mechanisms (CBAM) to address domain shift. It does not involve Large Language Models, World Models, Reinforcement Learning, or Tokenization. While it uses visual encoders (CNN backbones), it lacks the multimodal, unified, or model-based RL characteristics specified in the background keywords, resulting in low relevance to the provided scoring criteria.

关键词

peach leaf damage classification, transfer learning, domain shift, EfficientNet, CBAM, attention mechanisms, image classification

Score: 4.5 / 27.8
Authors: Sugyeong Eo, Heuiseok Lim
Published: 2026-06-01
TL;DR: This paper evaluates the limited ability of Large Language Models to infer pragmatic meaning from non-verbal responses in dialogue, revealing significant performance drops compared to verbal cues but improvement via in-context learning.
摘要翻译

尽管大型语言模型(LLMs)在语用语言理解方面取得了显著进展,但以往研究主要集中在其对言语行为的理解上。然而,非言语行为仍然是人类沟通的基本组成部分,特别是当被刻意单独使用时,以传达间接含义。在本研究中,我们首次系统评估了 LLMs 在仅由非言语回应构成的对话中推断语用意义的能力。我们探讨了三个研究问题:(1) LLMs 能否识别通过非言语回应传达的间接意图?(2) LLMs 何时以及如何未能捕捉非言语意图?(3) 如何提高 LLMs 解读非言语意图的能力?通过评估,我们发现 LLMs 难以从非言语回应中推断潜在含义,与言语回应相比,准确率下降高达 60 个百分点。进一步的广泛分析揭示了 LLMs 在非言语行为解释中的行为模式,并表明上下文学习(in-context learning)能够促进语用推理。

Abstract

Although large language models (LLMs) have shown considerable progress in pragmatic language understanding, prior research has focused mainly on their comprehension of verbal behavior. Nonetheless, non-verbal behavior remains a fundamental component of human communication, especially when deliberately utilized in isolation to convey indirect meanings. In this work, we present the first systematic evaluation of LLMs' ability to infer pragmatic meaning in dialogue consisting solely of non-verbal responses. We explore three research questions: (1) Can LLMs recognize indirect intent conveyed through non-verbal responses? (2) When and how do LLMs fail to capture non-verbal intent? (3) How can we improve LLMs' ability to interpret non-verbal intent?. Through the evaluation, we observe that LLMs struggle to infer underlying meaning from non-verbal responses, with accuracy dropping by up to 60% points compared to verbal ones. Further extensive analysis reveals a behavioral pattern in LLMs' interpretations of non-verbal behavior and demonstrates that in-context learning facilitates pragmatic inference.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 1.0/10 1.5
MultiModal 1.5 2.0/10 3.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper evaluates LLMs' pragmatic understanding of non-verbal dialogue cues, which does not align with the provided keywords concerning model unification, tokenizer design, visual encoder architecture, world modeling, or model-based reinforcement learning. While it involves non-verbal (multimodal) input, it does not propose MLLM architectures or RL frameworks. No expert authors from the specified list are present, and the total weighted score (4.5) is well below the dynamic pass threshold (27.8).

关键词

Large Language Models, Pragmatic Meaning, Non-Verbal Responses, Dialogue, In-Context Learning, Indirect Intent, Verbal Behavior, Pragmatic Inference

Score: 4.5 / 27.8
Authors: Kaihui Cheng, Zhiqiang Cai, Wenkai Xiang, Zhihang Hu, Siyu Zhu, Tzuhsiung Yang, Yuan Qi
Published: 2026-06-01
TL;DR: This paper proposes a history-dependent bias method for generative protein dynamics emulators to significantly accelerate the sampling of rare structural states while maintaining structural validity.
摘要翻译

蛋白质动力学的生成式模拟器 (Generative emulators) 能以远低于分子动力学的成本生成合理的轨迹,但它们继承了训练分布,在长程外推时倾向于重新访问已知状态而非探索稀有状态。受经典增强采样启发,我们在预训练生成式模拟器的生成空间中引入了一种隐式的、历史依赖的偏置。具体而言,一个历史感知评分估计器 (Score estimator) 通过距离加权偏置增强了冻结的模拟器,该偏置将反向时间采样引导远离先前生成的结构,并由一个环境支撑项进行正则化。为了在长程下保持结构有效性,一个基于分数的精炼步骤利用冻结的模拟器将漂移的样本重新投影到数据流形 (Data manifold) 上。我们的实验表明,该方法(i)在 DynamicPDB-80 上将多样性提高了 35%;(ii)在 12 个零样本快速折叠蛋白质 (Fast-Folding proteins) 上,仅学习到的偏置就能以约 15 倍的速度达到无偏模拟器的覆盖度,而将其与精炼相结合则以约 37 倍的速度达到覆盖度,同时覆盖约 3 倍数量的低能态。代码将很快发布。

Abstract

Generative emulators of protein dynamics produce plausible trajectories at a fraction of the cost of molecular dynamics, but they inherit their training distribution and tend to revisit known states rather than reach rare ones under long-horizon extrapolation. Inspired by classical enhanced sampling, we introduce an implicit, history-dependent bias in the generative space of a pretrained emulator. Specifically, a history-aware score estimator augments the frozen emulator with a distance-weighted bias that steers reverse-time sampling away from previously generated structures, regularized by an environment-support term. To preserve structural validity at long horizons, a score-based refinement step re-projects drifted samples onto the data manifold using the frozen emulator. Our experiments demonstrate that the method (i) raises diversity by $35\%$ on DynamicPDB-80; (ii) on $12$ zero-shot Fast-Folding proteins, the learned bias alone reaches the unbiased emulator's coverage up to ${\sim}15\times$ faster, and pairing it with refinement reaches the coverage up to ${\sim}37\times$ faster while covering ${\sim}3\times$ as many low-energy states. Code will be released soon.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 2.0/10 3.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 1.0/10 1.5

评分理由: The paper focuses on generative modeling for protein dynamics using score-based methods and implicit bias. It lacks direct connections to Multimodal LLMs (MLLM), Tokenizers, Visual Encoders, or MultiModal architectures, scoring 0 for these. While it involves generative dynamics (partial relevance to World Models) and model-based emulation (partial relevance to model-based), it does not involve Reinforcement Learning or model unification strategies, resulting in low overall scores for the provided keyword set.

关键词

Protein Dynamics, Generative Emulator, Implicit Bias, Score-Based Modeling, Rare State Sampling, Molecular Dynamics, Structure Refinement

Score: 4.5 / 27.8
Authors: Jun He, Deying Yu
Published: 2026-06-01
TL;DR: 本文提出了后确定性分布式系统(PDDS)框架,旨在通过协调确定性代码、随机模型和自主代理共存的异构环境,构建可信的自主基础设施。
摘要翻译

几十年来,分布式系统通常假设正确参与者执行协议规定的行为,其语义具有稳定性、外部定义性及确定性。经典理论已广泛参数化了网络时序、通信拓扑和故障域,但这种参与者模型相比之下却保持相对固定。自主推理引擎、随机模型驱动的智能体以及策略驱动的行为体被集成到云控制平面(Cloud Control Planes)、事件响应系统和金融基础设施中,挑战了该假设的普遍性。这些智能体往往会产生分歧的推理路径、不同的操作轨迹以及异构的内部表示,同时实现语义等价且正确的结果。本文引入后确定性分布式系统(Post-Deterministic Distributed Systems, PDDS)作为研究与工程模型,用于协调确定性代码、随机模型和自主智能体共存的异构环境。我们表明,经典分布式计算模型构成了这一参与者通用模型(participant-general model)的一个零歧义特例。我们并不主张确定性系统将消失;相反,确定性执行不再能作为自主基础设施的通用参与者假设。最后,我们概述了后确定性基础设施的五大架构支柱:协议驱动开发(Protocol-Driven Development)、可验证智能体基础设施(Verifiable Agentic Infrastructure)、自主状态控制平面(Autonomous State Control Planes)、语义多数派保证(Semantic Quorum Assurance)以及认知状态复制(Epistemic State Replication)。认知状态复制将持久化和一致性模型从数据可见性扩展到知识可见性,从而支持智能体记忆(agentic memory)、可验证语义回滚(Verifiable Semantic Rollback)以及推理参与者之间的一致性(coherence)。此外,我们还定义了在此情境下出现的故障分类法(taxonomy of failure classes)。

Abstract

For decades, distributed systems have typically assumed that correct participants execute protocol-specified behavior with stable, externally defined, and deterministic semantics. Classical theory has extensively parameterized network timing, communication topologies, and failure domains, but this participant model has remained comparatively fixed. The integration of autonomous reasoning engines, stochastic model-driven agents, and policy-driven actors into cloud control planes, incident response systems, and financial infrastructure challenges the universality of this assumption. These agents often produce divergent reasoning paths, distinct operational traces, and heterogeneous internal representations while achieving semantically equivalent and correct outcomes. In this paper, we introduce Post-Deterministic Distributed Systems (PDDS) as a research and engineering model for coordinating heterogeneous environments where deterministic code, stochastic models, and autonomous agents coexist. We show that classical distributed computing models form a zero-ambiguity special case of this participant-general model. We do not argue that deterministic systems disappear; rather, deterministic execution can no longer serve as the universal participant assumption for autonomous infrastructure. Finally, we outline five architectural pillars of post-deterministic infrastructure: Protocol-Driven Development, Verifiable Agentic Infrastructure, Autonomous State Control Planes, Semantic Quorum Assurance, and Epistemic State Replication. Epistemic State Replication extends persistence and consistency models from data visibility to knowledge visibility, enabling agentic memory, Verifiable Semantic Rollback, and coherence across reasoning participants. We also define a taxonomy of failure classes that arise in this setting.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 1.0/10 1.5
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: 论文主题聚焦于分布式系统架构(PDDS),旨在协调确定性代码、随机模型与自主代理的异构环境。提供的关键词集主要涉及多模态大模型与强化学习领域(如 Tokenizer, Visual Encoder, MLLM, MultiModal, model-based RL),与论文内容领域差异巨大。仅'Unify Models'和'World Models'在概念上略有重叠(涉及模型与代理的协同),其余关键词在文中均无提及,相关性极低。

关键词

Post-Deterministic Distributed Systems, Autonomous Infrastructure, Stochastic Model-Driven Agents, Deterministic Code, Verifiable Agentic Infrastructure, Epistemic State Replication, Heterogeneous Environments

Score: 4.5 / 27.8
Authors: Haoben Huang, Shuxin Liu, Ou Wu, Di Gao
Published: 2026-06-01
TL;DR: 本文针对大语言模型知识编辑中的涟漪效应问题,提出联合邻域优化框架,实现了传播与保留的平衡并提升了编辑稳定性。
摘要翻译

大型语言模型中的单次编辑更新可能会在局部知识邻域中引发涟漪效应:表现为对相关事实的期望传播以及对被保留事实的意外扰动。现有方法分别处理这两种效应,但未显式建模它们之间的耦合关系。我们通过分析典型基线上的涟漪响应来挑战这种分离,识别出两个耦合的设计压力:编辑侧协调与保留侧泄露。我们提出联合邻域优化(JNO),这是一种新的知识编辑框架,旨在在目标规划阶段形式化并共同处理这两种压力。JNO 通过感知压力协调(PAC)实现这一原则,该机制在耦合约束下共同优化邻域目标表示,并在参数执行前通过语义预执行门拒绝高风险目标计划。在 RippleEdits 上的实验表明,JNO 将传播与保留指标至少提升了 7.0%,同时保持了跨骨干编辑的稳定性。

Abstract

Single-edit updates in large language models can trigger ripple effects across local knowledge neighborhoods: desirable propagation to related facts and unintended perturbation of preserved ones. Existing methods address these two effects separately, without explicitly modeling their coupling. We challenge this separation through an analysis of ripple responses across typical baselines, identifying two coupled design pressures: editable-side coordination and preserved-side leakage. We propose Joint Neighborhood Optimization (JNO), a new knowledge-editing framework to formalize and jointly address both pressures at the target-planning stage. JNO instantiates this principle through Pressure-Aware Coordination (PAC), which jointly optimizes neighborhood target representations under coupled constraints, and a semantic pre-execution gate that rejects high-risk target plans before parameter execution. Experiments on RippleEdits show JNO improves propagation and preservation metrics by at least 7.0% while preserving cross-backbone editing stability.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 1.0/10 1.5
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: 论文聚焦大语言模型知识编辑,涉及涟漪效应与邻域优化。提供的关键词涉及多模态、世界模型及强化学习架构,与本文核心内容无直接关联,故相关性评分极低。作者列表中未包含指定专家,无额外加分。

关键词

Knowledge Editing, Ripple Effects, Joint Neighborhood Optimization, Pressure-Aware Coordination, Large Language Models, Knowledge Propagation, Knowledge Preservation, Parameter Execution

Score: 4.5 / 27.8
Authors: Junhyoung Chung, Euijong Song, Won Hwa Kim, Gunwoong Park
Published: 2026-06-01
TL;DR: The paper proposes Convex Distance Operator Transport (CDOT), a convex optimal transport framework that aligns distributions across heterogeneous domains while preserving geometric structure, achieving better performance on point clouds and graph benchmarks.
摘要翻译

我们引入凸距离算子传输(Convex Distance Operator Transport, CDOT),这是首个凸最优传输框架,旨在通过联合保持特征对应和内在几何结构,对齐跨异构域的分布。具体而言,CDOT 采用基于算子的正则化,通过引入距离算子和条件期望算子来对齐聚合距离结构。由此,所提出的正则化提高了对局部几何变化的鲁棒性。我们进一步证明所得到的 CDOT 差异是具有属性的紧致度量测度空间上的有效伪度量。此外,我们通过一种新的分散间隙概念刻画 CDOT 与 Gromov--Wasserstein (GW) 之间的关系,形式化阐明了 GW 中非凸性的几何来源,并与 CDOT 的凸性形成对比。在有限样本情形下,我们推导出一个分解为优化误差和统计误差的非渐近风险界,并在全局收敛的 Frank--Wolfe 算法下确立了风险一致性。在合成点云、脑连接组以及图分类基准上的实验表明,其性能优于现有方法,且在实践中表现出稳定可靠的行为。

Abstract

We introduce Convex Distance Operator Transport (CDOT), the first convex optimal transport framework that aligns distributions across heterogeneous domains by jointly preserving feature correspondence and intrinsic geometric structure. Specifically, CDOT employs an operator-based regularization that aligns aggregated distance structures by introducing distance and conditional expectation operators. Consequently, the proposed regularization improves the robustness to local geometric variations. We further prove that the resulting CDOT discrepancy is a valid pseudometric on the space of attributed compact metric-measure spaces. In addition, we characterize the relationship between CDOT and Gromov--Wasserstein (GW) through a new notion of dispersion gap, formally elucidating the geometric source of non-convexity in GW compared to the convexity of CDOT. In the finite-sample regime, we derive a non-asymptotic risk bound decomposed into optimization and statistical errors, establishing risk consistency under a globally convergent Frank--Wolfe algorithm. Experiments on synthetic point clouds, brain connectomes, and graph classification benchmarks demonstrate better performance over existing methods, with stable and reliable behavior in practice.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 1.0/10 1.5
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 2.0/10 3.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper introduces Convex Distance Operator Transport (CDOT), a theoretical optimal transport framework for aligning heterogeneous distributions with geometric preservation. It has no direct connection to Tokenizers, Visual Encoders, World Models, MLLM, or Model-Based RL. 'MultiModal' scores slightly higher (2.0) due to handling heterogeneous data types (point clouds, graphs), and 'Unify Models' (1.0) reflects mathematical formulation alignment rather than model unification. No expert authors from the specified list were found, so no bonus points were added. The weighted total score is 4.5, which is below the dynamic passing score of 27.8.

关键词

Convex Distance Operator Transport, Optimal Transport, Geometry-Preserving, Heterogeneous Domains, Operator-based Regularization, Gromov-Wasserstein, Risk Consistency

Score: 4.5 / 27.8
Authors: Li Ye, Xinhang Zhou, Xingyu Yang, Ruofeng Tong, Hailong Li, Peng Du, Min Tang
Published: 2026-06-01
TL;DR: MidSurfNet introduces a learnable framework for generalized mid-surface abstraction in CAD models using neural face pairing and interference implicit fields, overcoming limitations of rule-based methods in complex geometric scenarios.
摘要翻译

中面抽象(Mid-surface abstraction)对于薄壁 CAD 模型的有限元分析(finite element analysis)至关重要。现有的基于面配对(face pairing)的方法依赖于手工设计的几何启发式(geometric heuristics),然而现实世界中的工业模型常表现出多壁厚(multi-wall-thickness)区域、自匹配面(self-matching face)配置以及对非中心偏移曲面(non-center offset surfaces)的需求——在这些场景下,基于规则(rule-based)的方法始终失败。本文提出 MidSurfNet,这是一个学习增强(learning-augmented)框架,通过两个新颖组件解决了上述局限性:(1) 一个神经网络面配对(neural face pairing)模块,该模块学习从几何和拓扑特征预测面配对置信度,能够处理超越基于规则(rule-based)方法的复杂配对场景;(2) 一个干涉隐式场(interference implicit field),该场将中面表示为两个有符号距离函数(signed distance functions)的干涉,从而实现广义偏移控制(generalized offset control),以便在下游 CAE/FEA 导向的工作流中进行灵活定位。我们构建了一个大规模的中面(mid-surface)数据集,包含超过 1,500 个手动标注的 CAD 模型。实验表明,MidSurfNet 实现了 87.32% 的面配对准确率,并成功处理了多壁厚(61.90% 完成率)和自匹配(52.94% 完成率)场景,这些场景使得所有现有方法都束手无策。此外,MidSurfNet 为 CAE 导向的应用提供了一种基于学习的广义中面抽象方法,支持任意偏移控制。

Abstract

Mid-surface abstraction is essential for finite element analysis of thin-walled CAD models. Existing face pairing-based methods rely on handcrafted geometric heuristics, yet real-world industrial models frequently exhibit multi-wall-thickness regions, self-matching face configurations, and demand for non-center offset surfaces--scenarios where rule-based approaches consistently fail. We present MidSurfNet, a learning-augmented framework that addresses these limitations through two novel components: (1) a neural face pairing module that learns to predict face pair confidence from geometric and topological features, handling complex pairing scenarios beyond rule-based methods; and (2) an interference implicit field that represents mid-surfaces as the interference of two signed distance functions, enabling generalized offset control for flexible positioning in downstream CAE/FEA-oriented workflows. We construct a large-scale mid-surface dataset containing over 1,500 manually annotated CAD models. Experiments demonstrate that MidSurfNet achieves 87.32% face pairing accuracy and successfully handles multi-wall-thickness (61.90% completion) and self-matching (52.94% completion) scenarios that confound all existing methods. Furthermore, MidSurfNet provides a learning-based approach to generalized mid-surface abstraction with arbitrary offset control for CAE-oriented applications.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 1.0/10 1.5
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 1.0/10 1.5
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 1.0/10 1.5
model-based RL 1.5 0.0/10 0.0

评分理由: This paper addresses CAD mid-surface abstraction using neural implicit fields and face pairing, belonging to computer graphics and geometry processing. It shows low relevance to the provided keywords (LLM/MLLM/RL domain) as it lacks tokenizers, visual encoders for multimodal understanding, world models, or reinforcement learning components.

关键词

Mid-surface abstraction, Face pairing, Implicit fields, CAD models, Signed distance functions, Finite element analysis, Neural network, Geometric features

Score: 4.5 / 27.8
Authors: Lina Wang, Yaning Cui
Published: 2026-06-01
TL;DR: 论文提出 Mos-Gen 框架,通过结合 Uni-Mol 和 VAE 生成新型杀虫剂分子 scaffold,实验验证命中率达 78%。
摘要翻译

蚊媒传染病每年在全球范围内导致超过 70 万人死亡。常规化学杀虫剂的长期使用引发了严重的抗药性问题,迫切需要开发新颖、高效且生态可持续的替代方案。尽管该领域现有的人工智能方法主要集中在活性预测与分类上,但在新型分子骨架的从头生成(de novo generation)方面仍存在关键空白。本研究提出了一种名为 Mos-Gen 的基序感知生成式协作框架,该框架将预训练分子表示模型 Uni-Mol 与变分自编码器(VAE)相结合,专门用于设计含二硫键的大蒜素衍生物作为蚊虫杀虫剂。在生成的候选化合物中,选取了 14 种化合物——包括 9 个预测阳性和 5 个预测阴性——进行化学合成与实验验证。预测阳性化合物的命中率达到 78%,而预测阴性化合物均未表现出灭蚊活性。这些实验结果充分验证了 Mos-Gen 框架的高精度筛选能力。

Abstract

Mosquito-borne infectious diseases cause more than 700000 deaths worldwide each year. The long-term use of conventional chemical insecticides has induced serious resistance problems, creating an urgent need to develop novel, highly effective, and ecologically sustainable alternatives. While existing artificial intelligence approaches in this domain have focused primarily on activity prediction and classification, they leave a critical gap in the de~novo generation of novel molecular scaffolds. In this study, we propose Mos-Gen, a motif-aware generative collaborative framework that couples the pretrained molecular representation model Uni-Mol with a variational autoencoder (VAE), specifically tailored for the design of disulfide-containing allicin derivatives as mosquito insecticides. Among the generated candidates, fourteen compounds -- comprising nine predicted positives and five predicted negatives -- were selected for chemical synthesis and experimental validation. The hit rate among the predicted positives reached 78%, whereas none of the predicted negatives exhibited mosquitocidal activity. These experimental results fully validated the high-precision screening capability of the Mos-Gen framework.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: 该论文专注于分子生成框架(Mos-Gen),结合 Uni-Mol 和 VAE 设计杀虫剂。提供的关键词主要涉及多模态大模型(MLLM)、视觉编码器、世界模型及强化学习等领域。论文内容与这些关键词高度不相关,仅在模型耦合(Unify Models)和分子编码(Tokenizer)上有微弱关联,因此相关性评分极低。作者列表中未包含指定的专家。

关键词

Mos-Gen, Molecular Design, VAE, Uni-Mol, Insecticide, Generative Framework, Chemical Validation

Score: 4.5 / 27.8
Authors: Jiebin Zhang, Zhenghan Yu, Song Liu, Eugene J. Yu, Zheng Li, Dawei Zhu, Jiangshan Duo, Weimin Xiong, Yifan Song, Guanghua Yu, Jianchen Zhu, Sujian Li
Published: 2026-06-01
TL;DR: DFlare enhances draft model expressiveness in block diffusion speculative decoding to accelerate LLM inference, achieving significant wall-clock speedups on large language models.
摘要翻译

块扩散推测解码(Block Diffusion Speculative Decoding)通过同时预测块内所有令牌,供目标模型并行验证,从而加速大语言模型(LLM)推理。一次性预测整个块需要能力充足的草稿模型以及对目标模型内部知识的有效利用。然而,最先进的 DFlash 方法限制所有草稿层共享仅从少数目标层导出的单一融合表示,限制了每层的表达能力,并阻碍了草稿模型容量的进一步扩展。在本文中,我们提出了 \modelname,它通过轻量化的层间融合机制(Layer-wise Fusion Mechanism)拓宽了 DFlash 狭窄的条件瓶颈(Conditioning Bottleneck):每个草稿层均以可忽略的开销关注其自身可学习的、来自广泛目标层集合的组合,同时注入更丰富的目标知识,并为每个草稿层提供独特的输入。这种增强的每层表达能力使得草稿模型能够扩展到更深架构,并获得一致的性能提升。我们进一步将训练数据规模从 80 万样本扩展至 240 万样本,以充分利用扩大的模型容量。在涵盖数学推理、代码生成和对话的六个基准测试上,\modelname 在 Qwen3-4B、Qwen3-8B 和 GPT-OSS-20B 上分别实现了平均实际加速比(Wall-clock Speedup)5.52 倍、5.46 倍和 3.91 倍,相比 DFlash 分别提升了约 11%、8% 和 5%。我们的代码可在 https://github.com/Tencent/AngelSlim 获取。

Abstract

Block diffusion speculative decoding accelerates LLM inference by predicting all tokens within a block simultaneously for the target model to verify in parallel. Predicting an entire block at once requires a sufficiently capable draft model and effective utilization of the target model's internal knowledge. However, the state-of-the-art method DFlash constrains all draft layers to share a single fused representation derived from only a few target layers, limiting per-layer expressiveness and hindering further scaling of draft capacity. In this paper, we present \modelname, which flares out the narrow conditioning bottleneck of DFlash through a lightweight layer-wise fusion mechanism: each draft layer attends to its own learnable combination of a broad set of target layers at negligible overhead, simultaneously injecting richer target knowledge and providing every draft layer with a distinct input. This enhanced per-layer expressiveness enables scaling the draft model to deeper architectures with consistent gains. We further scale training data from 800K to 2.4M samples to fully exploit the enlarged capacity. On six benchmarks spanning mathematical reasoning, code generation, and conversation, \modelname attains average wall-clock speedups of 5.52x on Qwen3-4B, 5.46x on Qwen3-8B, and 3.91x on GPT-OSS-20B, improving over DFlash by roughly 11\%, 8\%, and 5\% respectively. Our code is available at https://github.com/Tencent/AngelSlim.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 1.0/10 1.5
Tokenizer 1.5 2.0/10 3.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on accelerating LLM inference via improved draft model architecture in block diffusion speculative decoding. The provided keywords primarily relate to multimodal learning, world models, and reinforcement learning, which are not the core focus of this work. Only token-related concepts have minor relevance. No listed expert authors are found in the author list. Total weighted score is 4.5, well below the dynamic passing threshold of 27.8.

关键词

Block Diffusion, Speculative Decoding, Draft Model, LLM Inference, Layer-wise Fusion, Scaling Capacity, Speedup

Score: 4.5 / 27.8
Authors: Sarah Almeida Carneiro, Christos Xypolopoulos, Xiao Fei, Yang Zhang, Michalis Vazirgiannis
Published: 2026-06-01
TL;DR: The paper proposes CARTE, a benchmark for evaluating large language models' fine-grained reasoning on regional knowledge within France, revealing performance disparities across regions and model scales.
摘要翻译

我们提出 CARTE 1(Culturally Anchored Regional-Territorial Evaluation,文化锚定区域领土评估),这是一个多项选择题基准,旨在评估大型语言模型(LLMs)在法国境内对基于地理的、区域差异化知识进行细粒度推理的能力。尽管先前基准侧重于国家层面的文化理解,但它们在很大程度上忽略了国家内部差异,以及区分密切相关区域背景的需求。CARTE 通过引入 2,431 个问题来解决这一差距,这些问题覆盖法国本土的 13 个大区,并涵盖 14 个主题领域,包括文化、语言、人口统计、经济、环境和交通。此外,我们还引入了 CARTE-LV,这是一个针对法国各地区语言变异的子集,旨在实现对语言相关差异的聚焦评估。我们在少样本设置下评估了 27 个参数规模为 10 亿至 120 亿的 LLMs。我们的实验揭示了不同区域和模型规模之间的性能差异,这表明预训练覆盖范围存在系统性差距,且对国家内部差异的鲁棒性有限。

Abstract

We introduce CARTE 1 (Culturally Anchored Regional-Territorial Evaluation), a multiplechoice benchmark for evaluating the ability of large language models (LLMs) to perform fine-grained reasoning over geographically grounded and regionally differentiated knowledge within France. While prior benchmarks focus on national-level cultural understanding, they largely overlook intra-country variation and the need to distinguish between closely related regional contexts. CARTE addresses this gap by introducing 2,431 questions spanning the 13 metropolitan regions of France and covering 14 thematic domains, including culture, language, demographics, economy, environment, and mobility. We further introduce CARTE-LV, a subset targeting Linguistic Variation across French regions, enabling focused evaluation of language-related differences. We evaluate 27 LLMs ranging from 1B to 12B parameters under few-shot settings. Our experiments reveal performance disparities across regions and model scales, suggesting systematic gaps in pretraining coverage and limited robustness to intra-national variation.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 1.0/10 1.5
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 1.0/10 1.5
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper introduces CARTE, a text-based benchmark for evaluating LLMs on regional knowledge in France. It does not propose unified model architectures, tokenizer designs, visual encoders, world models, multi-modal architectures, or model-based reinforcement learning methods, resulting in low relevance to the provided technical keywords.

关键词

CARTE, Benchmark, Large Language Models, Regional Knowledge, France, Evaluation, Intra-country variation, Few-shot

Score: 4.5 / 27.8
Authors: Yangfan Ye, Xiaocheng Feng, Jialong Tang, Xiayu Cao, Zihan Zhang, Xiachong Feng, Baosong Yang, Bing Qin
Published: 2026-06-01
TL;DR: CultureForest reveals that while LLMs possess substantial cultural knowledge, their performance is bottlenecked by effective reasoning and cross-region disparities in cultural norm grounded tasks.
摘要翻译

现有研究大多将大语言模型(LLMs)中的文化智力简化为一个知识层面的问题,忽略了模型是否能够在现实场景中有效利用其习得的知识。为了弥合这一差距,我们引入了 CultureForest,这是一个用于基于文化规范的推理(Cultural Norm Grounded Reasoning)的基准。每个问题都基于一组少量的原子规范,从而实现可验证且可归因的评估。CultureForest 包含来自 8 个领域和 53 个国家/地区的 5,378 个示例,并支持从多项选择题到开放式生成的渐进式评估。大量实验表明,即使在顶尖模型中,在开放式设置下性能也会显著下降,并伴随着显著的跨区域差异。通过针对性分析,我们发现了几种一致的模式:(1)测试时推理带来的收益有限,且可能加剧不平等;(2)模型表现出高度共享的区域偏好结构;(3)模型响应尤为保守,尤其是在更严格的文化约束下;(4)通过解耦文化知识获取与文化推理,我们发现尽管大语言模型(LLMs)拥有大量的文化知识,但其性能进一步受到有效利用这一瓶颈的制约。这些发现表明,有必要从以知识为中心的评估转向测量基于知识的推理。

Abstract

Existing research largely reduces cultural intelligence in LLMs to a knowledge-level problem, overlooking whether models can effectively utilize their acquired knowledge in realistic scenarios. To bridge this gap, we introduce CultureForest, a benchmark for \textit{Cultural Norm Grounded Reasoning}. Each question is grounded in a small set of atomic norms, enabling verifiable and attributable evaluation. CultureForest comprises 5,378 examples across 8 domains and 53 countries/regions, and supports a progressive evaluation from multiple-choice to open-ended generation. Extensive experiments reveal that even top-tier models degrade substantially in open-ended settings, accompanied by pronounced cross-region disparities. Through targeted analysis, we uncover several consistent patterns: (1) test-time reasoning yields limited gains and may exacerbate inequity; (2) models exhibit highly shared regional preference structures; (3) model responses are markedly conservative, especially under stricter cultural constraints; and (4) by disentangling cultural knowledge acquisition from cultural reasoning, we show that while LLMs possess substantial cultural knowledge, their performance is further bottlenecked by its effective use. These findings point to a necessary shift from knowledge-centric evaluation toward measuring knowledge-grounded reasoning.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 1.0/10 1.5
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 2.0/10 3.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: This paper focuses on cultural norm grounded reasoning evaluation in LLMs (CultureForest benchmark). It does not involve Tokenizer design, Visual Encoders, World Models, MultiModal architectures, or Model-Based RL. While it discusses LLMs (closest to MLLM), it is text-only and does not unify model architectures. Thus, relevance to the provided keyword set is low.

关键词

Cultural Norm Grounded Reasoning, LLM Evaluation, Cultural Intelligence, Cross-region Disparities, Knowledge-Reasoning Gap, Benchmark, Text-based LLM

Score: 4.5 / 27.8
Authors: Shuai Zhang, Huachuan Qiu, Hongliang He, Yong Dai
Published: 2026-06-01
TL;DR: The paper proposes CaDDTree, a cost-aware diffusion draft tree method that optimizes token throughput in speculative decoding by dynamically adjusting tree structures and budgets according to verification latency, demonstrating improved inference efficiency on Qwen language models.
摘要翻译

推测解码通过让轻量级草稿生成器提出令牌,并由目标语言模型并行验证这些令牌,从而加速推理过程。块扩散草稿生成器(如 DFlash)在一次传递中生成整个草稿块,从而得到每个位置的边缘概率分布;DDTree 利用这些分布,在固定节点预算下构建候选树,以最大化期望接受长度。然而,我们观察到接受长度随预算非递减:它总是倾向于更大的树,而不管验证成本如何,这无法为预算选择提供原则性依据。我们引入 CaDDTree(成本感知扩散草稿树),该方法通过联合选择树结构和节点预算,直接优化令牌吞吐量(单位时间内生成的预期令牌数)。我们显式建模草稿生成和验证延迟,证明吞吐量目标可分解为每轮关于预算的一维搜索,并在凸验证成本假设下证明吞吐量函数是单峰的,从而使得高效的贪心停止规则成为可能。CaDDTree 无需进行离线预算搜索,而是每轮根据当前的每个位置分布和验证成本动态调整预算。在 Qwen3-4B 和 Qwen3-8B 模型上,针对涵盖推理、代码生成及指令跟随任务的八个基准测试进行的实验表明,CaDDTree 在几乎所有任务上均达到或超过了使用理想预算选择的 DDTree 的性能。

Abstract

Speculative decoding accelerates inference by having a lightweight drafter propose tokens verified in parallel by the target language model. Block diffusion drafters such as DFlash generate an entire draft block in one pass, yielding per-position marginals; DDTree uses these to build a candidate tree that maximizes expected acceptance length under a fixed node budget. We observe, however, that acceptance length is non-decreasing in budget: it always favors larger trees regardless of verification cost, offering no principled basis for budget selection. We introduce \textbf{CaDDTree} (Cost-aware Diffusion Draft Tree), a method that directly optimizes token throughput (expected tokens generated per unit time) by jointly selecting the tree structure and node budget. We model draft and verification latencies explicitly, show that the throughput objective decomposes into a per-round one-dimensional search over the budget, and prove that under a convex verification cost the throughput function is \emph{unimodal}, enabling an efficient greedy stopping rule. CaDDTree requires no offline budget search, adapting the budget each round from the current per-position distributions and verification cost. Experiments on Qwen3-4B and Qwen3-8B across eight benchmarks spanning reasoning, coding, and instruction-following tasks show that \caDDTree{} matches or surpasses DDTree with oracle budget selection on nearly all tasks.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 1.0/10 1.5
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 1.0/10 1.5
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on speculative decoding acceleration for language models using diffusion draft trees, which does not align with the provided keywords regarding multimodal unification, visual encoders, world models, or reinforcement learning. Tokenizer and MLLM receive minimal relevance due to token-level processing and the use of Qwen3 models, but core themes mismatch significantly. No listed expert authors (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are present in the author list.

关键词

Speculative Decoding, Diffusion Draft Trees, Token Throughput, Latency Optimization, Tree Structure, Verification Cost, Inference Acceleration

Score: 4.5 / 27.8
Authors: Kilho Son, Paul Hsu, Cha Zhang, Dinei Florencio
Published: 2026-06-01
TL;DR: RCEM 通过将 LLM 的查询改写能力蒸馏到嵌入模型中,实现了在分布偏移下无需显式查询改写即可进行鲁棒对话检索的效果。
摘要翻译

对话搜索在检索增强生成(RAG)系统中日益重要,用户通过包含上下文相关查询的多轮对话与 AI 助手进行交互。我们提出 RCEM,一种对话式稠密检索模型,它将大语言模型(LLM)的查询重构能力蒸馏至嵌入模型中,从而在推理过程中无需显式查询重写即可实现上下文感知检索。与先前学习直接对话到文档匹配的对话式稠密检索方法不同,RCEM 通过对齐对话查询嵌入与重写查询嵌入,提高了在分布偏移下的鲁棒性。RCEM 在训练时无需对话查询到文档的相关性映射,此类映射通常成本高昂且难以获得高质量数据。在 QReCC、TopiOCQA 和 TREC CAsT 上的广泛实验表明,RCEM consistently 强对话检索基线,尤其在分布偏移下取得显著收益,Recall@10 提升幅度高达 20%。RCEM 进一步扩展了基础嵌入模型,增加了对话查询重写能力,同时保留其原始检索功能,使得单个模型即可编码独立查询与对话查询,并可在不重建检索数据库的前提下对现有文档索引进行搜索。

Abstract

Conversational search has become increasingly important in retrieval-augmented generation (RAG) systems, where users interact with AI assistants through multi-turn conversations containing context-dependent queries. We propose RCEM, a conversational dense retrieval model that distills the query reformulation capability of LLMs into the embedding model, enabling context-aware retrieval without explicit query rewriting during inference. Unlike prior conversational dense retrieval approaches that learn direct conversation-to-document matching, RCEM aligns conversational-query embeddings with rewritten-query embeddings, improving robustness under distributional shift. RCEM does not require conversational query-to-document relevance mappings for training, which are often expensive and difficult to obtain with high quality. Extensive experiments on QReCC, TopiOCQA, and TREC CAsT demonstrate that RCEM consistently outperforms strong conversational retrieval baselines, achieving particularly large gains under distributional shift, including up to 20% improvement in Recall@10. RCEM further extends the base embedding model with conversational query rewriting capability while preserving its original retrieval functionality, allowing both standalone and conversational queries to be encoded by a single model and searched against existing document indexes without rebuilding the retrieval database.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: 该论文聚焦于对话式搜索与稠密检索,核心贡献在于将 LLM 的查询改写能力蒸馏到嵌入模型中以应对分布偏移。提供的关键词集主要涵盖多模态大模型、世界模型及强化学习方向。论文内容未涉及视觉处理、多模态对齐或强化学习决策,因此与 Visual Encoder, World Models, MLLM, MultiModal, model-based RL 无直接相关性(0 分)。Unify Models 得分较低(2 分)因论文仅统一了查询类型而非模型架构,Tokenizer 为通用 NLP 组件(1 分)。作者列表中未包含指定的 Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang 专家,故无额外加分。加权总分远低于动态及格分 27.8,表明论文内容与给定关键词领域高度不相关。

关键词

Conversational Search, Dense Retrieval, Query Rewriting, Distributional Shift, Embedding Model, Context-Aware Retrieval, Retrieval-Augmented Generation

Score: 4.5 / 27.8
Authors: Jiahe Fan, Shaolong Shu, Mingjian Sun, Tiehua Zhang, Bohong Xiao, Hanli Wang, Rui Fan
Published: 2026-06-01
TL;DR: 该论文提出了一种无监督协同域适应框架,通过整合多个源模型的互补知识到统一目标模型中,在不访问原始源数据的情况下提高了驾驶场景解析的可靠性。
摘要翻译

可靠的驾驶场景解析是自动驾驶汽车在开放且动态的驾驶环境中运行的一项基本能力。然而,将感知模型适应到新的部署域仍然具有挑战性,因为像素级标注获取成本高昂,而源域数据往往因隐私、安全或所有权限制而不可访问。现有的无源无监督域适应方法通常仅依赖单个预训练源模型,这使得适应后的感知系统易受源特定偏差的影响,并限制了其在多样化道路布局、光照条件、天气模式及交通条件下的鲁棒性。本文提出了一种面向无源设置下驾驶场景解析的无监督协作域适应(UCDA)框架,该框架在不访问任何原始源样本的情况下,将多个预训练源模型的互补知识转移至统一的目标模型。为了比较独立训练模型的预测结果,UCDA 构建了一个类级别原型记忆库,并通过原型相似度估计跨模型预测可靠性,从而降低源模型间不一致置信度尺度的影响。基于所得的互补监督,UCDA 采用两阶段迁移策略:首先,多个源模型在无标签的目标域驾驶数据上通过带有正负一致性约束的协同优化进行精炼;然后,将其经过验证的经验蒸馏至单个可部署的目标模型。在公共驾驶场景数据集及从自动驾驶平台收集的真实数据上的全面评估表明,UCDA 能有效整合互补的多源知识,提升目标域场景解析的可靠性及在多样化驾驶环境下的泛化能力。

Abstract

Reliable driving scene parsing is a fundamental capability for autonomous vehicles operating in open and dynamic driving environments. However, adapting perception models to new deployment domains remains challenging because pixel-level annotations are expensive to obtain, while source-domain data are often inaccessible due to privacy, security, or ownership constraints. Existing source-free unsupervised domain adaptation methods typically rely on a single pre-trained source model, which makes the adapted perception system vulnerable to source-specific biases and limits its robustness under diverse road layouts, illumination conditions, weather patterns, and traffic conditions. This article presents an unsupervised collaborative domain adaptation (UCDA) framework for driving scene parsing in a source-free setting, which transfers complementary knowledge from multiple pre-trained source models to a unified target model without accessing any original source samples. To compare predictions from independently trained models, UCDA constructs a class-level prototype memory bank and estimates cross-model prediction reliability through prototype similarity, reducing the effect of inconsistent confidence scales across source models. Based on the resulting complementary supervision, UCDA adopts a two-stage transfer strategy: multiple source models are first refined on unlabeled target-domain driving data through collaborative optimization with positive and negative consistency constraints, and their validated expertise is then distilled into a single deployable target model. Comprehensive evaluations on public driving-scene datasets and real-world data collected from an autonomous vehicle platform demonstrate that UCDA effectively consolidates complementary multi-source knowledge, improving target-domain scene parsing reliability and generalization across diverse driving environments.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 1.0/10 1.5
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: 该论文主要研究驾驶场景解析的无监督协同域适应,属于传统计算机视觉领域。虽然文中提及'统一目标模型'(Unify Models)且涉及视觉处理(Visual Encoder),但未涉及多模态大模型(MLLM)、Tokenizer、世界模型(World Models)、多模态融合(MultiModal)或强化学习(model-based RL),因此与给定关键词集的整体相关性较低。

关键词

Unsupervised Collaborative Domain Adaptation, Driving Scene Parsing, Source-free Setting, Multiple Source Models, Unified Target Model, Prototype Memory Bank, Consistency Constraints

Score: 3.0 / 27.8
Authors: Sourav Das
Published: 2026-06-01
TL;DR: The paper proposes ProbScale, a probing-based framework that optimizes neural scaling laws to identify parameter-efficient subnetworks within Small Language Models, achieving significant parameter reduction while maintaining high performance.
摘要翻译

小型语言模型(SLMs)在能力与计算可行性之间取得了平衡。神经缩放定律指导其最优训练,表明它们拥有随规模扩展而增长的丰富内部表征。然而,即使在严格的资源约束下,部署这些 SLMs 也可能颇具挑战。语言模型探测提供了分析模型内部编码的语言知识的方法。我们提出 ProbScale 框架,该框架统一了缩放定律与探测技术的见解,旨在识别预训练 SLMs 中的参数高效子网络。ProbScale 利用规模扩展良好的 SLMs 的高质量表征,并使用任务特定探针,定量地量化每一层对于目标下游能力的相关性。这使得能够选择那些在性能与参数规模之间实现最优权衡的子网络。我们将子网络选择形式化为:在参数预算约束下,寻找一个层子集,使其最大化聚合的、任务加权的探针性能。在 RoBERTa-Large 和 T5-Base 等代表性 SLMs 上的实验表明,ProbScale 能够识别出实现显著参数缩减 5 至 10 倍的子网络,同时在目标任务上保持高性能(达到原始 SLMs 的 95% 至 98%),优于启发式基线。

Abstract

Small Language Models (SLMs) offer a balance between capability and computational feasibility. Neural scaling laws inform their optimal training, suggesting that they possess rich internal representations that scale with their size. However, deploying even these SLMs can be challenging under strict resource constraints. Language model probing provides methods for analyzing the linguistic knowledge encoded in a model's internals. We propose ProbScale, a framework that unifies insights from scaling laws and probing to identify parameter-efficient subnetworks within pre-trained SLMs. ProbScale utilizes the high-quality representations of well-scaled SLMs and uses task-specific probes to mathematically quantify the relevance of each layer for target downstream capabilities. This allows selecting subnetworks that optimally trade off performance against parameter size. We formulate the subnetwork selection as finding a layer subset maximizing aggregated, task-weighted probe performance under a parameter budget. Experiments on representative SLMs such as RoBERTa-Large and T5-Base demonstrate that ProbScale identifies subnetworks achieving significant parameter reduction, from 5 to 10 times, while maintaining high performance (95% to 98% of the original SLMs) on targeted tasks, outperforming heuristic baselines.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 1.0/10 1.5
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on Small Language Model (SLM) inference optimization via probing and scaling laws, which has minimal overlap with the provided keywords emphasizing Multimodal, World Models, and Reinforcement Learning. 'Unify Models' and 'Tokenizer' have slight semantic or implicit connections (verb match and implicit component), while others are irrelevant. The author list does not contain the specified experts (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang).

关键词

Small Language Models, Neural Scaling Laws, Probing Analysis, Parameter Efficiency, Subnetwork Selection, Inference Optimization, Task-specific Probes

Score: 3.0 / 27.8
Authors: Rakshit Naidu
Published: 2026-06-01
TL;DR: This paper proposes Fair Fine-tuning to mitigate distribution inference attacks by enforcing equalized odds constraints on complementary distributions, theoretically bounding and empirically reducing adversarial advantage across tabular, image, and NLP datasets.
摘要翻译

在敏感数据上训练的机器学习模型可能会无意中泄露关于其训练分布的群体层面信息——这种威胁被称为分布推断攻击 (DIA)。具有黑盒访问权限的攻击者可以推断敏感的人口统计属性(例如子群体比例),而无需直接观察任何训练数据。尽管已经提出了差分隐私和属性遗忘等防御方法,但公平性约束与分布泄露之间的联系仍未被探索。我们提出公平微调 (FFt):在均等机会 (EO) 约束下,在互补分布的样本上对训练好的模型进行微调。我们提供了完整的理论刻画,证明了紧界 $\text{Adv}(\mathcal{A},M_f) \le Δ_{\text{EO}} \cdot W$,其中 $W$ 量化了两个训练分布通过其敏感属性构成可区分的程度。我们还建立了 FFt 减少对抗优势的必要条件,并证明了该界的紧性。我们在六个涵盖表格型(ACS Income、COMPAS、German Credit)、图像(UTKFaces)和自然语言处理(NLP)(Bias in Bios)模态的数据集上进行了评估。基于重演的 FFt 在所有设置中始终将对抗准确率差距降低到检测阈值 $τ = 0.1$ 以下;在 ACS Income 上,差距从约 15% 下降到 4% 以下。我们的工作提供了第一个形式化界,将模型测量的 EO 差异直接连接到其在 DIA 博弈中的对抗优势,为统一的公平性与隐私防御开辟了新途径。

Abstract

Machine learning models trained on sensitive data can inadvertently leak population-level information about their training distributions -- a threat known as distribution inference attack (DIA). An adversary with black-box access can infer sensitive demographic properties, such as subgroup proportions, without observing any training data directly. While defenses such as differential privacy and property unlearning have been proposed, the link between fairness constraints and distributional leakage remains unexplored. We propose Fair Fine-tuning (FFt): a trained model is fine-tuned on samples from the complementary distribution under an Equalized Odds (EO) constraint. We provide a complete theoretical characterization, proving the tight bound $\text{Adv}(\mathcal{A},M_f) \le Δ_{\text{EO}} \cdot W$, where $W$ quantifies how distinguishable the two training distributions are by their sensitive-attribute composition. We also establish a necessary condition for FFt to reduce adversarial advantage and prove tightness of the bound. We evaluate across six datasets spanning tabular (ACS Income, COMPAS, German Credit), image (UTKFaces), and NLP (Bias in Bios) modalities. Rehearsal-based FFt consistently reduces the adversarial accuracy gap below the detection threshold $τ!=!0.1$ across all settings; on ACS Income, the gap falls from $\sim!15%$ to under $4%$. Our work provides the first formal bound connecting a model's measured EO disparity directly to its adversarial advantage in the DIA game, opening a new avenue for unified fairness-and-privacy defenses.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 1.0/10 1.5
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 1.0/10 1.5
model-based RL 1.5 0.0/10 0.0

评分理由: 论文主题聚焦于机器学习中的公平性与隐私保护(分布推断攻击),与关键词集(多模态架构、世界模型、强化学习等)高度不匹配。论文虽在多种模态数据集上评估,但未涉及 Tokenizer、Visual Encoder、World Models、MLLM 或 Model-Based RL 技术。'Unify Models' 仅在概念层面(公平与隐私的统一)有微弱关联,无架构统一意义。作者列表中未包含指定的专家,故无专家加分。

关键词

Fair Finetuning, Distribution Inference Attacks, Equalized Odds, Privacy Defense, Fairness Constraint, Adversarial Advantage, Complementary Distribution

Score: 3.0 / 27.8
Authors: Wei-Chieh Sun, Gyungmin Ko, Heejae Kwon, Hsiang-Wei Huang, Jenq-Neng Hwang
Published: 2026-06-01
TL;DR: This paper proposes a lightweight identity-repair backend for thermal pedestrian MOT that improves identity continuity through scene-level spatial-temporal consistency rather than complex re-identification models.
摘要翻译

热行人多目标跟踪(MOT)仍然具有挑战性,因为微弱的外观线索和频繁的检测中断会导致严重的轨迹碎片化。我们研究轻量级后处理是否能在不依赖重型重识别模型或复杂在线关联的情况下恢复身份连续性。基于 YOLOv8 和 SORT 基线,我们添加了一个模块化身份修复后端,该后端包括基于时间、空间、运动及边界线索的在线短间隙重映射和离线轨迹片段重连接。在固定验证集划分上的控制性消融实验以及在官方 PBVS 热行人多目标跟踪基准上的评估表明,主要的身份提升源于保守重连接,IDF1 从 82.25 提升至 84.93,同时保持 MOTA 不变,而许多启发式阈值在广泛的操作范围内保持稳定。这些结果表明,在低信息量热成像中,通过高精度轨迹重连接实现鲁棒身份恢复比通过增加跟踪器复杂度更为有效。这些结果提供了热视频中身份恢复的控制性分析,表明与局部帧间关联相比,场景级时空一致性在身份连续性中起主导作用。

Abstract

Thermal pedestrian MOT remains challenging because weak appearance cues and frequent detection interruptions cause severe trajectory fragmentation. We study whether lightweight post-processing can recover identity continuity without relying on heavy re-identification models or complex online association. Starting from a YOLOv8 and SORT baseline, we add a modular identity-repair backend consisting of online short-gap remapping and offline tracklet relinking based on temporal, spatial, motion, and border cues. Controlled ablations on a fixed validation split and evaluation on the official PBVS Thermal Pedestrian MOT benchmark show that the main identity gains arise from conservative relinking, improving IDF1 from 82.25 to 84.93 while preserving MOTA, whereas many heuristic thresholds remain stable across broad operating ranges. These results suggest that, in low-information thermal imagery, robust identity recovery can be achieved more effectively through high-precision trajectory relinking than through increasing tracker complexity. These results provide a controlled analysis of identity recovery in thermal video, showing that scene-level spatial-temporal consistency plays a dominant role in identity continuity compared to local frame-to-frame association.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 1.0/10 1.5
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 1.0/10 1.5
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on thermal pedestrian Multi-Object Tracking (MOT) using trajectory relinking and scene-level consistency, which is a traditional computer vision task. There is minimal overlap with the provided keywords regarding foundation models and reinforcement learning. YOLOv8 contains a Visual Encoder (score 1.0) and the pipeline combines detection/tracking modules (Unify Models score 1.0), but it does not involve tokenization, world modeling, MLLMs, or model-based RL. The weighted total score is 3.0, which is well below the dynamic passing score of 27.8. None of the listed expert authors are present in the paper.

关键词

Thermal pedestrian MOT, Identity continuity, Scene-level consistency, Trajectory relinking, YOLOv8, SORT baseline, Post-processing, Thermal video

Score: 3.0 / 27.8
Authors: Jiaming Qu, Lucheng fu, Yibo Hu
Published: 2026-06-01
TL;DR: This study demonstrates that LLMs in multi-agent systems are more susceptible to harmful conformity errors induced by peer consensus and authority than to beneficial corrections, and standard reasoning interventions fail to reliably mitigate this risk.
摘要翻译

大语言模型(LLM)在多智能体系统中的应用日益广泛,在这些系统中,它们能够观察并回应其他代理的答案。一个关键风险是**从众(conformity)**:模型可能会仅仅因为其他代理对另一个答案达成一致,就放弃自己的答案。先前研究表明,LLM 往往倾向于将答案修订为多数答案,但不清楚这些修订在纠正错误方面是否同样频繁地引入了新错误。本文进行了一项受控研究:LLM 首先回答一个问题,然后查看模拟的同伴响应,最后做出决定。我们操纵两个社会线索:共识结构以及分配给同伴的权威标签,并测量它们如何影响有益和有害的修订。在四个开源权重 LLM 和七个问答(QA)数据集上,我们发现同伴共识使得误导初始正确的模型比修正初始错误的模型容易得多。权威标签使模型更有可能选择被认可的答案,无论该答案是否正确。更令人担忧的是,诸如思维链(chain-of-thought)和反思等通用推理干预措施,无法可靠地在保留有益修订的同时减少有害修订。这些发现表明,多智能体 LLM 系统应当验证同伴的答案,而不仅仅是聚合它们。

Abstract

Large language models are increasingly used in multi-agent systems, where they see and respond to other agents' answers. A key risk is conformity: a model may abandon its own answer simply because others agree on a different one. Prior studies show that LLMs often revise toward a majority answer, but it remains unclear whether these revisions help correct mistakes as often as they introduce new errors. In this paper, we conduct a controlled study in which an LLM first answers a question, then sees simulated peer responses before making a final decision. We manipulate two social cues: consensus structure and authority labels assigned to peers, and measure how they influence beneficial and harmful revisions. Across four open-weight LLMs and seven QA datasets, we find that peer agreement makes it much easier to mislead initially correct models than to correct initially wrong ones. Authority labels make models more likely to choose the endorsed answer, regardless of whether it is correct. More concerningly, generic reasoning interventions such as chain-of-thought and reflection do not reliably reduce harmful revision while preserving beneficial revision. These findings suggest that multi-agent LLM systems should verify peer answers rather than simply aggregate them.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 1.0/10 1.5
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 1.0/10 1.5
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on LLM conformity and social bias in multi-agent decision making, while the provided keywords target multimodal architecture (Visual Encoder, MultiModal, MLLM), world models, and RL. Only 'MLLM' loosely relates to the use of LLMs, and 'Unify Models' has a weak connection to multi-agent coordination; other keywords are irrelevant. No expert authors from the specified list were found.

关键词

LLM Conformity, Multi-Agent Systems, Peer Consensus, Authority Labels, Harmful Revision, Beneficial Revision, Chain-of-Thought

Score: 3.0 / 27.8
Authors: Kanwar Bharat Singh
Published: 2026-06-01
TL;DR: TechGraphRAG 提出了一种代理图增强 RAG 框架,通过迭代查询重写、知识图遍历和自动引文验证来改进技术文献推理。
摘要翻译

本文提出了一种面向领域特定技术推理支持的智能体增强生成 (RAG) 框架,该框架构建于一个精心策划的语料库之上,包含约 2,100 篇关于智能轮胎、车辆动力学和车辆控制的学术论文。与传统的单轮 RAG 系统不同,所提出的架构采用了一个 13 步自主流水线,该流水线按意图对查询进行分类,基于多维标准对证据充分性进行评分,执行带有漂移防护的查询重构智能体重试,通过迭代优化 - 搜索 - 验证循环搜索外部学术数据库(Crossref、OpenAlex、Semantic Scholar),遍历 Neo4j 知识图谱以获取关系上下文,验证引文完整性,并应用生成后质量检查及自动重新生成。主要贡献包括:一个涵盖五个维度、包含相关性衰减及混合基于规则与 LLM 的评审的 100 分制证据充分性评分框架;一个具有迭代智能体循环的依赖路径外部搜索架构;一个基于 LLM 实体提取和 OpenAlex 作者验证构建、并包含语料库内引文解析的知识图谱;以及一个包含引文验证和质量评估的自我修正生成循环。该框架作为一个实际实施的案例研究呈现,说明了智能体、基于证据的 RAG 如何在大型领域特定语料库上支持文献导航和技术推理。

Abstract

This paper presents an agentic retrieval-augmented generation (RAG) framework for domain-specific technical reasoning support, instantiated over a curated corpus of approximately 2,100 academic papers in intelligent tires, vehicle dynamics, and vehicle control. Unlike conventional single-pass RAG systems, the proposed architecture employs a 13-step autonomous pipeline that classifies queries by intent, scores evidence sufficiency against a multi-dimensional rubric, performs agentic retry with drift-guarded query reformulation, searches external academic databases (Crossref, OpenAlex, Semantic Scholar) through iterative optimize--search--vet loops, traverses a Neo4j knowledge graph for relational context, verifies citation integrity, and applies post-generation quality checks with automatic regeneration. Key contributions include a 100-point evidence sufficiency scoring framework across five dimensions with relevance damping and hybrid rule-based/LLM review; a route-dependent external search architecture with iterative agentic loops; a knowledge graph constructed via LLM-based entity extraction and OpenAlex author validation with intra-corpus citation resolution; and a self-correcting generation loop with citation verification and quality assessment. The framework is presented as a practical, implemented case study illustrating how agentic, evidence-grounded RAG can support literature navigation and technical reasoning over large, domain-specific corpora.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 1.0/10 1.5
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 1.0/10 1.5
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: 该论文主要关注基于知识图的文本检索增强生成(RAG)框架,用于技术文献推理。内容未涉及多模态表征学习(Visual Encoder, MultiModal)、世界模型构建、Tokenizer 设计或基于模型的强化学习。虽然框架使用了大语言模型(MLLM 相关技术),但并非多模态大模型;虽整合了 RAG 与图结构(Unify Models 的广义理解),但未涉及模型架构统一的核心研究。未发现指定列表中的专家作者,故未添加额外加分。加权总分为 3.0,远低于动态及格分 27.8。

关键词

TechGraphRAG, Agentic RAG, Knowledge Graph, Technical Literature Reasoning, Retrieval-Augmented Generation, Citation Verification, Query Reformulation, Neo4j

Score: 3.0 / 27.8
Authors: Haoji Hu, Huaqing Mao, Yijun Lin, Xiaowei Jia, Jinwei Zhou, Minoh Jeong, Yao-Yi Chiang
Published: 2026-06-01
TL;DR: This paper proposes a nonparametric mutual information estimator to quantify dependence between continuous time series and discrete temporal event sequences without data transformation, achieving improved accuracy and robustness across various analysis tasks.
摘要翻译

成对依赖度量(pairwise dependence measures),如相关性(correlation)和因果性(causality),是时序数据挖掘的基础,然而,目前仍缺乏一种原则性强且稳健的方法来量化异构数据类型之间的依赖关系,尤其是连续时间序列(continuous time series)与离散时间事件序列(discrete temporal event sequences)之间的依赖。现有方法通常依赖于临时变换(ad hoc transformations)或互信息估计器,这些方法对量化(quantization)、重复值(repeated values)以及事件冗余(event redundancy)高度敏感,从而导致实践中出现有偏或不稳定的结果。本文提出了一种非参数互信息估计器(nonparametric mutual information estimator),该估计器无需数据变换、学习或临时离散化(ad hoc discretization),即可直接测量时间序列与事件序列之间的依赖关系。该方法通过建模现实世界时间序列的连续 - 离散对偶性(continuous-discrete duality)来处理量化和重复值伪影,并引入潜在事件聚类策略(latent event clustering strategy)以减轻事件共现(event co-occurrence)和冗余带来的偏差。上述方法共同构建了一个稳健且统一的框架,弥合了离散互信息与连续互信息之间的鸿沟。我们在四个代表性任务上评估了所提出的估计器:用于因果分析的离散 - 连续时间延迟互信息(discrete-continuous time-delayed mutual information)、全局与局部时间重复发现(global and local temporal repetition discovery)、用于时间序列预测的离散协变量选择(discrete covariate selection)以及用于分类的连续特征选择(continuous feature selection)。在合成数据集和真实世界数据集上的实验表明,该方法在准确性、稳健性和可解释性方面均一致优于现有方法,使其成为异构时序数据的通用依赖算子(general-purpose dependence operator),类似于同质时间序列中的皮尔逊相关系数(Pearson correlation)。代码开源地址:https://github.com/HaojiHu/Multimodal-Temporal-Data-Quantification

Abstract

Pairwise dependence measures such as correlation and causality are fundamental to temporal data mining, yet there is still no principled and robust way to quantify dependence between heterogeneous data types, especially between continuous time series and discrete temporal event sequences. Existing approaches rely on ad hoc transformations or mutual-information estimators that are highly sensitive to quantization, repeated values, and event redundancy, leading to biased or unstable results in practice. We propose a nonparametric mutual information estimator that directly measures the dependence between time series and event sequences without data transformation, learning, or ad hoc discretization. Our method models the continuous-discrete duality of real-world time series to handle quantization and repeated-value artifacts and introduces a latent event clustering strategy to mitigate bias from event co-occurrence and redundancy. Together, these yield a robust and unified framework that bridges discrete and continuous mutual information. We evaluate the proposed estimator on four representative tasks: discrete-continuous time-delayed mutual information for causality analysis, global and local temporal repetition discovery, discrete covariate selection for time series forecasting, and continuous feature selection for classification. Experiments on synthetic and real-world datasets show consistent improvements over existing methods in accuracy, robustness, and interpretability, positioning our approach as a general-purpose dependence operator for heterogeneous temporal data, similar to Pearson correlation for homogeneous time series. Code available at: https://github.com/HaojiHu/Multimodal-Temporal-Data-Quantification

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 1.0/10 1.5
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 1.0/10 1.5
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on statistical mutual information estimation between heterogeneous temporal data (time series and event sequences). It does not align with the technical domains of the provided keywords, which target Multimodal Large Models and Reinforcement Learning architectures (e.g., Tokenizers, Visual Encoders, World Models, MLLM, Model-Based RL). Only superficial relevance exists for 'Unify Models' (unified framework mentioned) and 'MultiModal' (heterogeneous data types), hence low scores.

关键词

Mutual Information, Time Series, Temporal Event Sequences, Nonparametric Estimator, Causality Analysis, Feature Selection, Heterogeneous Data, Continuous-Discrete Duality

Score: 3.0 / 27.8
Authors: Kazuto Fukuchi, Ryuichiro Hataya, Kota Matsui
Published: 2026-06-01
TL;DR: 本文提出了一种基于复杂度最小化的元学习框架,从理论上证明了预训练数据规模的增长可有效降低下游任务的样本复杂度并提升少样本泛化能力。
摘要翻译

预训练(Pre-training)已成为现代机器学习的一个基本范式,其关键经验性优势之一是随着预训练数据规模的增加,下游样本复杂度(downstream sample complexity)得以降低。然而,现有的预训练理论框架(theoretical frameworks)尚无法完全解释这一现象。本文提出了一种名为复杂度最小化(Complexity Minimization)的新型元表征学习(meta-representation learning)框架,旨在实现对这种缩放行为(scaling behavior)的理论分析。该框架通过评估每个领域最适合的下游模型复杂度(downstream model complexity),并在源域(source domains)上最小化此类复杂度的最坏情况来学习表征。我们的端到端理论分析(end-to-end theoretical analysis),涵盖从预训练到下游回归(downstream regression)的全过程,表明该框架可证明地捕捉到了这种缩放行为;特别是,我们展示了随着元训练数据量(meta-training data)的增加,小样本适应(few-shot adaptation)的错误率会降低。实证上,我们将复杂度正则化(complexity regularization)整合到现有的元学习方法(meta-learning methods)中,一致地提高了下游样本效率(downstream sample efficiency)。

Abstract

Pre-training has become a fundamental paradigm in modern machine learning, with one of its key empirical benefits being reduced downstream sample complexity as the scale of pre-training data increases. However, existing theoretical frameworks for pre-training do not fully explain this phenomenon. In this paper, we introduce complexity minimization, a novel meta-representation learning framework designed to enable theoretical analysis of this scaling behavior, which learns representations by evaluating the downstream model complexity best suited to each domain and minimizing the worst-case such complexity across source domains. Our end-to-end theoretical analysis, spanning pre-training through downstream regression, shows that this framework provably captures this scaling behavior; in particular, we show that the error rate of few-shot adaptation improves as the amount of meta-training data grows. Empirically, we demonstrate that incorporating complexity regularization into existing meta-learning methods consistently improves downstream sample efficiency.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: 论文核心在于元学习的理论框架与数据缩放律,属于通用机器学习理论范畴。虽然涉及表征学习和预训练,但未包含多模态处理、视觉编码器、Tokenizer、世界模型、大语言模型(MLLM)或强化学习等关键词所涵盖的具体技术模块,与给定的研究背景(多模态/RL/世界模型)高度不匹配。

关键词

Meta Learning, Complexity Minimization, Pre-training, Scaling Law, Representation Learning, Few-shot Adaptation, Sample Efficiency

Score: 3.0 / 27.8
Authors: Mingkuan Zhao, Yide Gao, Wentao Hu, Suquan Chen, Tianchen Huang, Zhenhua An, Zetao Chang, Xiayu Sun, Yuheng Min
Published: 2026-06-01
TL;DR: This paper proposes Resonant Context Anchoring, an inference-time intervention method that decouples attention routing and signal gain to suppress factual hallucinations in Large Language Models without requiring training.
摘要翻译

大型语言模型(LLMs)在面对与其内部参数记忆相冲突的输入证据时,常表现出“上下文忽视”,从而导致持续的事实性幻觉。现有的缓解策略主要依赖于抑制特定神经元激活或采用计算昂贵的对比解码机制,这往往会导致困惑度增加或推理延迟显著升高。为了解决这些局限性,我们提出共振语境锚定(RCA),这是一种基于残差流信号动力学视角的轻量级推理时干预方法。RCA 旨在缓解外部证据在通过深层网络传播过程中出现的信号衰减。其核心机制在于对自注意力模块内的路由逻辑与信息幅度进行正交解耦。通过利用原始预 softmax 注意力分数作为语义对齐的瞬时度量,我们通过非线性整流构建动态增益场,以选择性地放大对应上下文令牌的值向量范数,而不改变注意力概率分布。该机制有效地提升了残差流混合体中输入证据的信噪比(SNR),从而在推理过程中稳健地将生成轨迹锚定至真实语境。在 Llama-3 模型系列上的广泛实验表明,RCA 在多个事实一致性和强知识冲突任务中显著提升了语境忠实度,有效抑制了参数性幻觉。此外,结果表明,作为一种无需训练且计算开销可忽略的即插即用模块,RCA 在保持模型通用语言理解能力的同时,在忠实度与流畅度之间实现了帕累托改进。

Abstract

Large Language Models (LLMs) frequently exhibit "contextual disregard" when faced with input evidence that conflicts with their internal parametric memory, leading to persistent factual hallucinations. Existing mitigation strategies primarily rely on suppressing specific neuron activations or employing computationally expensive contrastive decoding mechanisms, which often result in increased perplexity or significantly elevated inference latency. To address these limitations, we propose Resonant Context Anchoring (RCA), a lightweight inference-time intervention method grounded in the perspective of residual stream signal dynamics. RCA aims to resolve the signal attenuation of external evidence during its propagation through deep networks. The core mechanism involves the orthogonal decoupling of routing logic and information magnitude within the self-attention module. By utilizing raw pre-softmax attention scores as an instantaneous metric of semantic alignment, we construct a dynamic gain field via non-linear rectification to selectively amplify the norms of value vectors corresponding to context tokens, without altering the attention probability distribution. This mechanism effectively elevates the signal-to-noise ratio (SNR) of input evidence within the residual stream mixture, thereby robustly anchoring the generation trajectory to the truthful context during inference. Extensive experiments on the Llama-3 model series demonstrate that RCA significantly improves contextual faithfulness across multiple factual consistency and strong knowledge-conflict tasks, effectively suppressing parametric hallucinations. Furthermore, results confirm that as a training-free and computationally negligible plug-and-play module, RCA achieves a Pareto improvement in faithfulness and fluency while maintaining the model's general language understanding capabilities.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 1.0/10 1.5
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 1.0/10 1.5
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on inference-time intervention for Large Language Models (LLMs) to reduce factual hallucinations via a method called Resonant Context Anchoring. It modifies attention routing and signal gain within the residual stream. There is no mention of multimodal data (Visual Encoder, MultiModal), tokenization changes (Tokenizer), world modeling (World Models), or reinforcement learning (model-based RL), resulting in 0 relevance for these keywords. 'Unify Models' receives a low score (1.0) because the method unifies routing and gain logic within attention, but it does not align with the architectural unification implied by the keyword background. 'MLLM' receives a low score (1.0) as the work is based on LLMs, though it is not explicitly multimodal. No expert authors from the specified list are found in the author list.

关键词

Resonant Context Anchoring, Inference-time Intervention, Attention Routing, Signal Gain, Factual Hallucinations, Large Language Models, Residual Stream, Signal-to-Noise Ratio

Score: 3.0 / 27.8
Authors: Jianru Ding, Ryien Hosseini, Pouya Mahdi Gholami, Mingyuan Xiang, Henry Hoffmann
Published: 2026-06-01
TL;DR: The paper proposes ConServe, a conversation-level disaggregated scheduling system for LLM agents that replaces turn-level prediction with observable metrics, reducing latency by 51% and improving energy efficiency.
摘要翻译

基于大语言模型(LLM)的智能体通过多轮依赖推理和工具调用解决用户任务,生成一个工作负载,其总成本在任务到达时未知。现有的多轮系统以轮次(turn)为调度单元,逐轮决定是否将预填充(prefill)与解码(decode)分离。该决策依赖于轮次的解码长度、工具行为及 KV(Key-Value)缓存增长量,而这些量在调度器必须做出决策时不可观测,迫使系统对其进行预测。我们表明,这种对预测的依赖是由调度单元强加的,而非工作负载本身。将调度单元从轮次提升至对话(conversation),可将轮级不规则性转化为稳定的两阶段结构:1) 计算受限的第 1 轮预填充,随后是 2) 长尾的内存受限阶段。因此,以对话为调度单元,放置策略简化为读取第 1 轮输入长度及每个解码器的 KV 占用率,这两者均可直接观测。我们在 ConServe 系统中实例化了这一原则:系统将第 1 轮预填充路由至高通量预填充器,仅传输一次 KV 缓存,并将整个对话长尾阶段固定绑定至单个解码器,无需学习解码侧成本的模型。与基于轮次的预测基线相比,ConServe 将第 95 百分位首次有效 token 时间(即对话首个用户可见输出的延迟)降低了 51.08%,能效提升了 7.51%,同时保持最后一轮 TBT 和 SLO(服务等级目标)不变;将这两个阶段映射到异构 GPU 层级可进一步带来 22.75% 的能效增益。

Abstract

LLM-based agents resolve a user task through many turns of dependent inference and tool calls, producing a workload whose total cost is unknown when the task arrives. Existing multi-turn systems keep the turn as the scheduling unit and decide, turn by turn, whether to disaggregate prefill from decode. That decision rests on the turn's decode length, tool behavior, and KV growth, quantities that are not observable when the scheduler must act, forcing the system to predict them. We show this dependence on prediction is imposed by the scheduling unit, not the workload. Raising the scheduling unit from the turn to the conversation converts turn-level irregularity into a stable, two-phase structure: 1) a compute-bound turn-1 prefill followed by 2) a long, memory-bound tail. Thus, with the conversation as the scheduling unit, placement reduces to reading the first-turn input length and per-decoder KV occupancy, both directly observable. We instantiate this principle in ConServe, which routes the first-turn prefill to a high-throughput prefiller, transfers the KV cache exactly once, and pins the conversation to a single decoder for its entire tail, with no learned model of decode-side cost. Against a per-turn prediction baseline, ConServe reduces p95 time-to-first-effective-token (the latency of a conversation's first user-visible output) by 51.08% and improves energy efficiency by 7.51% while preserving last-turn TBT and SLOs; mapping the two phases onto heterogeneous GPU tiers adds a further 22.75% in energy efficiency.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on LLM serving systems (ConServe) and inference scheduling optimization, specifically changing the scheduling unit from turn to conversation. It does not discuss model architecture components (Tokenizer, Visual Encoder), multimodal processing (MultiModal, MLLM), or learning paradigms (World Models, Model-based RL). 'Unify Models' has slight relevance regarding unifying scheduling logic, but overall relevance to the provided keyword set is low due to domain mismatch (Systems vs. Model Architecture/RL).

关键词

Conversation-Level Disaggregated Scheduling, Agentic Serving, Prefill and Decode, KV Cache, Time-to-First-Effective-Token, Energy Efficiency, LLM-based Agents, Observation vs Prediction

Score: 3.0 / 27.8
Authors: Franz Nowak, Ryan Cotterell, Reda Boumasmoud
Published: 2026-06-01
TL;DR: This paper establishes a unified algebraic framework to analyze the formal language expressivity of recurrent neural language models, demonstrating that computational power depends on the underlying arithmetic model.
摘要翻译

Recurrent Neural Language Model(递归神经网络语言模型)能够识别什么样的 Formal Languages(形式语言)?文献中的形式化结果存在冲突:一些作者报告了 Turing-completeness(图灵完备性),而另一些则表明其等价于 regular languages(正则语言)。这种差异的原因是 underlying arithmetic model(底层算术模型)不同。本文提出了一个关于 recurrent neural networks(递归神经网络)表达能力的 unified algebraic account(统一代数解释),始于对各种 arithmetic models(算术模型)的形式化描述。该解释将表达能力问题归结为一个代数问题,例如,network's syntactic monoid(网络的语法单群)是否 divides(整除)某个 wreath product(wreath 积)。作为 case study(案例研究),本文重新审视了 diagonal state-space models(对角状态空间模型):一旦强制 floating-point recurrences(浮点递归),相同的 architecture(架构)无法实现 an even-modulus counter(偶数模计数器),然而在 unsigned-integer quantization(无符号整数量化)下实现了 every even-modulus counter(每一个偶数模计数器)。

Abstract

What formal languages can a recurrent neural language model recognize? Formal results in the literature conflict: some authors report Turing-completeness, while others show equivalence to regular languages. The reason for this discrepancy is that the underlying arithmetic model differs. The paper develops a unified algebraic account of the expressivity of recurrent neural networks, starting with a formal account of various arithmetic models. This account reduces expressivity to an algebraic question, e.g., whether a network's syntactic monoid divides a certain wreath product. As a case study, the paper revisits diagonal state-space models: the same architecture cannot implement an even-modulus counter once floating-point recurrences are enforced, yet realizes every even-modulus counter under unsigned-integer quantization.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 2.0/10 3.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on the theoretical expressivity of recurrent language models using algebraic structures and arithmetic models, unrelated to multimodal architectures, tokenization, visual encoders, world models, MLLMs, or model-based RL. Only the abstract's mention of a 'unified algebraic account' links to 'Unify Models' (Score: 2.0). Weighted total score is 3.0, well below the dynamic pass score of 27.8.

关键词

Recurrent Language Models, Formal Language Theory, Algebraic Expressivity, Arithmetic Models, Syntactic Monoid, Wreath Product, Diagonal State-Space Models

Score: 3.0 / 27.8
Authors: Kai Wang
Published: 2026-06-01
TL;DR: 该论文针对神经网络分类器中判别性与对抗鲁棒性的权衡问题,提出混合原型混合框架,在不牺牲判别性的前提下显著提升了模型的对抗鲁棒性。
摘要翻译

现代神经网络极易受到对抗扰动的影响。在这项工作中,我们发现这种脆弱性的一部分源于广泛使用的全连接(FC)分类器对这些扰动的敏感性。相比之下,简单的基于 $\ell_2$ 距离的分类器表现出显著更强的鲁棒性。我们提供了详尽的理论与实证分析,表明虽然全连接分类器的高敏感性使其具有判别性,但也使其易受攻击。相反,基于 $\ell_2$ 分类器的不敏感性赋予了鲁棒性,但限制了性能。受这种权衡的启发,我们提出了一种基于混合原型混合(HPM)框架的新型 $\ell_2$ 重分类器。该方法在利用 $\ell_2$ 距离鲁棒性的同时,保留了全连接分类器的判别能力。它通过融合两种原型类型来产生基于 $\ell_2$ 距离的预测:(1)通过指数移动平均(EMA)更新的稳定、数据集级别的原型;(2)利用直通估计器(STE)基于全连接分类器预测生成的动态、批次级别的原型。然而,这种基于 STE 的动态架构给评估带来了显著挑战,例如梯度混淆和前向不连续。为了解决这一问题,我们提出了一种新的严格评估协议——混合代理攻击(MSA),该协议结合多种代理模型以及强大的 AutoAttack,以确保公平且鲁棒的评估。广泛的实验表明,我们的轻量级、即插即用模块只需最小微调,即可有效提升各种现有最先进(SOTA)对抗训练模型的对抗鲁棒性。

Abstract

Modern neural networks are highly susceptible to adversarial perturbations. In this work, we identify that part of this vulnerability stems from the sensitivity of the widely used fully connected (FC) classifiers to such perturbations. In contrast, simple $\ell_2$ distance-based classifiers exhibit significantly greater robustness. We provide thorough theoretical and empirical analysis showing that while FC classifiers' high sensitivity makes them discriminative, it also makes them vulnerable. Conversely, $\ell_2$-classifiers' insensitivity grants robustness but limits performance. Motivated by this trade-off, we propose a novel $\ell_2$-reclassifier based on a Hybrid Prototype Mixing (HPM) framework. This method retains the discriminative power of FC classifiers while leveraging the robustness of $\ell_2$ distance. It yields $\ell_2$-distance-based predictions by fusing two prototype types: (1) stable, dataset-level prototypes updated via EMA, and (2) dynamic, batch-level prototypes generated from the FC classifier's predictions using a Straight-Through Estimator (STE). However, this dynamic, STE-based architecture introduces significant challenges for evaluation, such as gradient obfuscation and forward discontinuity. To address this, we propose a new, rigorous evaluation protocol, the Mixed Surrogate Attack (MSA), which uses multiple surrogates along with powerful AutoAttack to ensure a fair and robust assessment. Extensive experiments demonstrate that our lightweight, plug-and-play module, with minimal fine-tuning, effectively enhances the adversarial robustness of various existing SOTA adversarially trained models.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 1.0/10 1.5
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 1.0/10 1.5
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: 该论文主要研究神经网络分类器中的对抗鲁棒性与判别性之间的权衡,提出混合原型混合(HPM)框架。提供的关键词集(如 MLLM、世界模型、模型强化学习等)主要聚焦于多模态大模型与强化学习领域。论文内容与这些关键词高度不相关,仅可能涉及视觉特征提取(Visual Encoder)极少量,未涉及 tokenizer、多模态融合、世界模型或强化学习。因此相关性评分极低,远低于动态及格分。

关键词

Adversarial Robustness, Discriminability, Hybrid Prototype Mixing, Mixed Surrogate Attack, Adversarial Perturbations, Prototype Mixing, Classification Head

Score: 3.0 / 27.8
Authors: Rodrigo Wilkens, Rémi Cardon, Vincent Folny, Thomas François
Published: 2026-06-01
TL;DR: 本文提出了一种增强的基于论证的验证框架用于法语自动作文评分,通过比较 8 种模型架构在公平性、一致性和有效性方面的表现,推进了法语 AES 领域的研究。
摘要翻译

在自动作文评分(AES)中,基准测试实践催生了极简主义的评价方式,这与评估框架(如基于论证的验证框架(ABV))所倡导的更广泛视角形成对比。ABV 框架主张对系统进行多维评估,尤其是在高风险语言测试的背景下。本文提出了一种增强且更实用的 ABV 框架版本,该版本纳入了公平性分析、与语言特征的相关性、预测误差评估以及与人类评分者之间的模型一致性分析。将该框架应用于法语自动作文评分,我们在一个包含 27,000 篇考试作文(每篇由 2 位评分者评分)的语料库和一个包含 961 篇作文(每篇至少由 9 位评分者评分)的泛化语料库上比较了 8 种模型架构。我们的分析展示了应用 ABV 框架以更好地理解 AES 模型的能力与局限性的益处,同时也推动了法语自动作文评分领域的前沿发展。

Abstract

In Automated Essay Scoring (AES), benchmarking practices have fostered minimalist evaluation practices, in contrast with the broader-view recommendations of evaluation frameworks, such as the argument-based validation framework (ABV), which argued in favor of a multidimensional assessment of systems, especially in the context of high-stakes language tests. In this paper, we introduce an enhanced and more practical version of the ABV framework, incorporating fairness analysis, correlations with linguistic features, prediction error evaluation, and model agreement compared with human raters. Applying this framework to French AES, we compare 8 model architectures on a corpus of 27k exam essays (2 raters each) and a generalization corpus of 961 essays (at least nine raters each). Our analyses illustrate the benefits of applying the ABV framework to better understand the capabilities and pitfalls of AES models, while also advancing the state-of-the-art for French AES.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 1.0/10 1.5
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: 论文聚焦于法语自动作文评分(AES)的评估框架与模型比较,属于纯文本 NLP 领域。提供的关键词涉及多模态、世界模型及强化学习,与论文内容(无视觉、无 RL、无多模态)完全无关。仅 Tokenizer 和 Unify Models 与 NLP 模型基础有微弱关联。未包含指定专家作者。

关键词

Automated Essay Scoring, French Language, Argument-based Validation, Model Evaluation, Fairness Analysis, Linguistic Features, Generalizability, Human Rater Agreement

Score: 3.0 / 27.8
Authors: Meihua Dang, Linxin Song, Honghua Zhang, Jieyu Zhao, Guy Van den Broeck, Stefano Ermon
Published: 2026-06-01
TL;DR: This paper proposes a globally constrained decoding approach using tensorized finite automata and sequential Monte Carlo to mitigate sampling bias in locally constrained decoding for large language models.
摘要翻译

大语言模型的生成结果往往难以符合期望的约束,例如 JSON Schema (JSON 模式)。现有的局部约束解码 (LCD) 方法通过短视地屏蔽下一个 token 来强制执行约束,导致采样偏差和性能下降。最近的工作使用序列蒙特卡洛 (SMC) 方法来缓解此类偏差,但设计有效的提议分布或势函数仍然是一个关键挑战。在本文中,我们提出了一种通用方法,用于构建从 $p_{\mathrm{lm}}( \cdot \mid \mathrm{constraint})$ 进行 SMC 采样的提议和势函数。首先,我们表明以有限自动机指定的约束可以被张量化,以便在 GPU 上高效执行,我们利用这一点来构建全局约束解码 (GCD) 提议。此外,利用张量化有限自动机与隐马尔可夫模型共享相同电路结构的事实,我们对它们进行电路相乘,以获得概率 GCD (P-GCD) 提议,该提议编码了关于目标分布的逻辑和概率信息。我们在函数调用、基于关键词的生成和 SQL 生成任务上评估了 (P-)GCD。实验表明,在相同的 SMC 采样设置下,与 LCD 提议相比,(P-)GCD 以显著更少的粒子更快地收敛到目标分布。

Abstract

Generations from large language models often fail to conform to desired constraints such as JSON schema. Existing locally constrained decoding (LCD) approaches enforce constraints by myopically masking out next tokens, resulting in biased sampling and degradation in performance. Recent work uses sequential Monte Carlo (SMC) methods to mitigate such biases, but designing effective proposal distributions or potential functions remains a key challenge. In this work, we propose a generic approach to construct proposals and potentials for SMC sampling from $p_{\mathrm{lm}}( \cdot \mid \mathrm{constraint})$. First, we show that constraints specified as finite automata can be tensorized for efficient execution on GPUs, which we use to construct globally constrained decoding (GCD) proposals. In addition, leveraging the fact that tensorized finite automata share the same circuit structure as hidden Markov models, we circuit-multiply them to obtain the probabilistic GCD (P-GCD) proposals encoding both logical and probabilistic information about the target distributions. We evaluate (P-)GCD on the tasks of function calling, keyword-based generation, and SQL generation. Experiments show that under the same SMC sampling setup, compared to LCD proposals, (P-)GCD converges faster to the target distribution with significantly fewer particles.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 1.0/10 1.5
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on mitigating bias in locally constrained decoding for Large Language Models using Sequential Monte Carlo and tensorized finite automata. It does not involve Visual Encoders, World Models, MultiModal integration, or Model-Based Reinforcement Learning. While it pertains to LLMs (slight relevance to Tokenizer and MLLM broadly), it lacks the core multimodal or world modeling aspects defined by the keywords, resulting in low relevance scores. None of the listed expert authors appear in the author list.

关键词

Locally Constrained Decoding, Sequential Monte Carlo, Tensorized Finite Automata, Globally Constrained Decoding, Bias Mitigation, Large Language Models, Function Calling

Score: 3.0 / 27.8
Authors: Junjie Chen, Yuxi Dong, Haitao Li, Weihang Su, Yujia Zhou, Min Zhang, Yiqun Liu, Qinyao Ai
Published: 2026-06-01
TL;DR: This paper introduces LongJudgeBench to evaluate the reliability of LLM-as-a-judge methods for long-form text generation, revealing significant instability across different scenarios.
摘要翻译

随着大型语言模型(LLMs)在长文本生成中的应用日益广泛,可靠评估长文本输出已成为一项关键挑战。LLM-as-a-judge(LLM 作为评判者)提供了一种可扩展的人工评估替代方案,但其在长文本输出评估中的可靠性尚未得到充分研究:现有的元评估基准主要关注短文本输出。与短文本评估相比,长文本评估不仅仅是输出长度的问题;它通常要求评判者处理更复杂的文档级要求。在这项工作中,我们引入了 LongJudgeBench,这是一个综合基准,旨在评估 LLM 评判者在多样化现实场景和评判协议下的长文本输出表现。我们系统性地评估了广泛的 LLM 评判者,涵盖了多种基础模型和评判设置。我们的结果揭示了一个显著的可靠性差距:当前 LLM 评判者在不同场景下仍不稳定,且评分细则或参考输出虽有帮助,但并非总是充分。我们希望 LongJudgeBench 能支持未来研究,以开发更鲁棒、更具上下文感知能力且更符合人类对齐的 LLM-as-a-judge 方法。我们的代码可在 https://anonymous.4open.science/r/LongJudgeBench-F782 获取。

Abstract

As large language models (LLMs) are increasingly used for long-form generation, reliably evaluating long-form outputs has become a critical challenge. LLM-as-a-judge offers a scalable alternative to human evaluation, yet its reliability in long-form output evaluation remains underexamined: existing meta-evaluation benchmarks focus mainly on short-form outputs. Compared with short-form evaluation, long-form evaluation is not merely a matter of output length; it often requires judges to handle more complex document-level demands. In this work, we introduce LongJudgeBench, a comprehensive benchmark for evaluating LLM judges on long-form outputs across diverse real-world scenarios and judging protocols. We systematically evaluate a broad range of LLM judges, covering multiple base models and judging settings. Our results reveal a substantial reliability gap: current LLM judges remain unstable across scenarios, and rubrics or references are helpful but not always sufficient. We hope LongJudgeBench will support future research on more robust, context-aware, and human-aligned LLM-as-a-judge methods. Our code is available at https://anonymous.4open.science/r/LongJudgeBench-F782.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 1.0/10 1.5
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 1.0/10 1.5
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on evaluating LLM judges for long-form text generation using a benchmark (LongJudgeBench). It does not discuss multimodal architectures (Visual Encoder, MultiModal), tokenizer design, world models, or reinforcement learning algorithms. The provided keywords relate to foundation model architectures and RL, which are not covered in this evaluation-focused study, resulting in low relevance scores.

关键词

LLM-as-a-Judge, Long-Form Output Evaluation, Benchmark, Reliability Gap, LongJudgeBench, Document-level Demands, Human-aligned Methods

Score: 3.0 / 27.8
Authors: Howard Huang, Bharath Surianarayanan, Keifer Lee, Chenyu Wang, Chen Feng
Published: 2026-06-01
TL;DR: SAVMap generates precise 3D wireframe maps of warehouse environments from panoramic video using semantic segmentation and Manhattan-constrained structure-from-motion, achieving high accuracy and scalability.
摘要翻译

工业环境的精确三维表示使得机器人定位(robot localization)和数字孪生(digital twin)生成等任务成为可能。我们提出了 SAVMap,一种仅使用全景视频相机作为传感器输入,用于生成仓库货架和照明结构语义线框图(semantic wireframe map)的方法。从沿仓库过道拍摄的全景视频中,提取出包含面向货架和天花板视角的校正图像(rectified images)序列。利用语义分割网络(semantic segmentation network)前端,从每张图像中提取一组稀疏的语义结构特征点(例如货架结构的角点、灯光中心),并在序列之间进行跟踪。通过考虑点之间的真实世界几何关系(如曼哈顿网格 Manhattan grids),约束结构 - 运动恢复(constrained structure-from-motion)算法生成形成线框图(wireframe map)的三维点。我们在一个拥有 46 排货架的仓库中展示了该方法的可扩展性和准确性,每排货架的立面跨度为 55 米乘 7 米。利用一小时的全景视频内容,我们生成了跨越这些排数的超过 5000 个货架元素的线框图,相对于真值(ground-truth),实现了 4.8 厘米的平均绝对误差(mean absolute error)。

Abstract

Precise 3D representations of industrial environments enable tasks such as robot localization and digital twin generation. We propose SAVMap, a method for generating a semantic wireframe map of warehouse shelf and light structures using only a panoramic video camera as the sensor input. Sequences of rectified images with shelf and ceiling-facing views are extracted from a panoramic video captured along the warehouse aisles. Using a semantic segmentation network front end, a set of sparse, semantic structure feature points (e.g., corners of shelf structures, centers of lights) are extracted from each image and tracked across the sequences. By accounting for real-world geometric relationships among the points such as Manhattan grids, a constrained structure-from-motion algorithm yields the 3D points that form a wireframe map. We demonstrate the scalability and accuracy of our proposal in a warehouse with 46 shelving rows, each with faces spanning 55\,m by 7\,m. From an hour of panoramic video content, we create wireframe maps for over 5000 shelf elements across the rows, achieving an aggregate mean absolute error of 4.8\,cm with respect to ground-truth.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 2.0/10 3.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on computer vision and robotics mapping (SfM, semantic segmentation) rather than the provided keywords which target large language models, multimodal architectures, and reinforcement learning. Only 'Visual Encoder' has marginal relevance due to the use of a visual network, but it is not used in an MLLM or unified modeling context. No expert authors from the specified list are included in the authorship.

关键词

Panoramic Video, Semantic Segmentation, Structure-from-Motion, Manhattan Wireframes, 3D Mapping, Warehouse Environment, Feature Tracking

Score: 1.5 / 27.8
Authors: Mubaraka Sani Ibrahim, Lehel Csató, Isah Charles Saidu
Published: 2026-06-01
TL;DR: This paper proposes a Group Rank-Constrained Deep Matrix Completion framework that addresses data sparsity in group recommendations by unifying low-rank regularization and attention-based modeling, achieving superior reconstruction accuracy and robust performance across group sizes.
摘要翻译

群体活动的日益流行增加了对能够根据个体偏好为用户群体提供推荐的方法的需求。许多现有的群体推荐系统依赖于聚合个体用户偏好,但它们往往难以处理现实场景中常见的高维且高度稀疏的评分数据。我们提出了一种新颖的框架——群体秩约束深度矩阵补全(Group RC-DMC),该框架通过引入基于 Set-Transformer 聚合器的群体级表示学习来扩展秩约束深度矩阵补全(RC-DMC),联合利用低秩结构和基于注意力的非线性建模。与大多数现有的群体推荐系统不同,Group RC-DMC 在单一框架内统一了显式低秩正则化、线性编码器 - 解码器架构以及基于注意力的非线性群体建模,从而在个体和群体级别上均能产生准确的预测。Group RC-DMC 通过低秩矩阵补全解决数据稀疏性问题,仅基于观测评分计算每个用户的潜在表示,并利用基于周期性奇异值阈值处理的核范数近端步在潜在空间上施加秩约束。解码器被参数化为低秩分解形式,从而实现高效推断。在 MovieLens 和 Goodbooks 数据集上的实验结果表明,Group RC-DMC 实现了卓越的重构精度(以较低的群体 RMSE 衡量),同时在计算效率上保持高效,且在群体层面的精确率、召回率和 F1 分数方面,相较于分解前加权(WBF)和分解后加权(AF)基线具有竞争力。结果表明,该模型能够恢复用户 - 物品交互的底层低秩结构,并在小型、中型和大型用户群体上提供稳健的群体推荐。

Abstract

The growing popularity of group activities has increased the need for methods that provide recommendations to groups of users given their individual preferences. Many existing group recommender systems rely on aggregating individual user preferences, but they often struggle with high-dimensional and highly sparse rating data commonly found in real-world scenarios. We propose Group Rank-Constrained Deep Matrix Completion (Group RC-DMC), a novel framework that extends RC-DMC by integrating group-level representation learning via a Set-Transformer aggregator, jointly leveraging low-rank structure and attention-based nonlinear modeling. Unlike most existing group recommender systems, Group RC-DMC unifies explicit low-rank regularization, linear encoder-decoder architectures, and attention-based nonlinear group modeling within a single framework, yielding accurate predictions at both the individual and group levels. Group RC-DMC addresses data sparsity through low-rank matrix completion, computing per-user latent representations from observed ratings only, and enforcing a rank constraint on the latent space using a nuclear-norm proximal step based on periodic singular value thresholding. The decoder is parametrized as a low-rank factorization, enabling efficient inference. Experimental results on the MovieLens and Goodbooks datasets demonstrate that Group RC-DMC achieves superior reconstruction accuracy, measured by lower group RMSE, while remaining computationally efficient and competitive in group-level performance in terms of precision, recall, and F1 score compared with weighted-before-factorization (WBF) and after-factorization (AF) baselines. The results highlight the model's ability to recover the underlying low-rank structure of user-item interactions and provide robust group recommendations across small, medium, and large user groups.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 1.0/10 1.5
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on group recommender systems using matrix completion and Set-Transformers, which is fundamentally different from the provided keywords targeting Multimodal Large Language Models, World Models, and Reinforcement Learning. Only 'Unify Models' has a minor linguistic overlap due to the abstract mentioning unifying architectural components, but not in the context of multimodal integration implied by the keyword cluster.

关键词

Group Recommendation, Matrix Completion, Low-rank Structure, Set-Transformer, Data Sparsity, User-item Interactions, Nuclear-norm Proximal Step, Deep Learning

Score: 1.5 / 27.8
Authors: Xingyu Qu, Wenxuan Zhang, Peng Hu
Published: 2026-06-01
TL;DR: This paper proposes a multi-view fusion framework using YOLO-based detectors to significantly improve space object detection accuracy in LEO constellations by integrating observations from multiple satellite viewpoints.
摘要翻译

随着低地球轨道(LEO)星座中卫星数量的不断增加,近地空间环境日益拥挤,使得空间目标检测(SOD)成为关乎空间安全与可持续性的紧迫挑战。为了降低碰撞风险并确保空间运行的连续性,SOD 系统必须在严格的机载约束下实现快速且准确的检测。本文研究了在深度学习(DL)框架内融合多视角观测以提升 SOD 性能的潜力。我们设计了一种实用的多视角管道以及几种输入表示方法,用于将多视角数据输入到基于 YOLO 的检测器中。实验结果表明,在大多数情况下使用多视角输入是可行的,且通常在 mAP50 和 mAP50-95 指标上表现更佳。例如,在 YOLOv9-m 模型中,与单视角相比,三视角融合 RGB 设置使 mAP50 从 0.638 提升至 0.732,而 mAP50-95 从 0.227 提升至 0.276。与单视角设置相比,最佳三视角灰度配置使 mAP50 提高了 36.3%,mAP50-95 提高了 46.5%。这些发现确立了多视角融合作为一种可行且有效的 SOD 策略,对 LEO 星座部署中的空间态势感知具有重要意义。

Abstract

With the growing number of satellites in low Earth orbit (LEO) constellations, the near-Earth space environment has become increasingly congested, making space object detection (SOD) a pressing challenge for space safety and sustainability. To mitigate collision risks and ensure the continuity of space operations, SOD systems must deliver fast and accurate detection under stringent onboard constraints. In this paper, we investigate the potential of multi-viewpoint observation fusion within a deep learning (DL) framework to enhance SOD performance. We design a practical multi-view pipeline and several input representations for feeding multi-view data into YOLO-based detectors. Our experiments show that using multi-view inputs is feasible in most cases and typically produces better results for mAP50 and mAP50-95. For example, in model YOLOv9-m, single-view compared to a three-view fused RGB setting, mAP50 increases from 0.638 to 0.732, while mAP50-95 improves from 0.227 to 0.276. Compared with the single-view setting, the best three-view grayscale configuration improves mAP50 by 36.3% and mAP50-95 by 46.5%. These findings establish multi-view fusion as a viable and effective strategy for SOD, with broad implications for space situational awareness in LEO constellation deployments.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 1.0/10 1.5
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on Space Object Detection (SOD) using YOLO and multi-view fusion in LEO constellations. It does not involve Large Language Models (MLLM), World Models, Reinforcement Learning, Tokenization, or Model Unification as defined in the keyword list. While multi-view fusion combines multiple data streams (RGB/Grayscale), it is a computer vision task rather than the multimodal large model context implied by the keywords. The author list (Xingyu Qu, Wenxuan Zhang, Peng Hu) does not include the specified experts (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang), so no bonus points apply.

关键词

Space Object Detection, Multi-Satellite Viewpoints, LEO Constellations, Multi-view Fusion, YOLO-based Detectors, Deep Learning, Space Situational Awareness

Score: 1.5 / 27.8
Authors: Yujia Tong, Yuxi Wang, Yunyang Wan, Tian Zhang, Junhao Dong, Jingling Yuan
Published: 2026-06-01
TL;DR: This paper evaluates whether model compression techniques preserve uncertainty in large language models using conformal prediction, finding that accuracy and uncertainty often decouple despite compression.
摘要翻译

模型压缩技术(如量化和剪枝)被广泛用于降低大语言模型(LLMs)的部署成本,然而现有评估几乎 exclusively 专注于准确性保持。然而,在安全关键应用中,模型可靠量化自身不确定性的能力同样重要。我们提出一个问题:压缩是否保留了这种能力?为回答这一问题,我们在五个自然语言处理(NLP)任务上,对 12 个大语言模型(LLMs)在各种压缩配置下进行了基准测试,并使用共形预测(conformal prediction)提供了一种严格、无分布的不确定性度量。我们的实验结果表明:(I)压缩经常将准确性与不确定性解耦;(II)较大模型比较小模型更能有效吸收压缩引起的不确定性;(III)不确定性膨胀往往是阈值式的而非渐进式的。这些结果表明,仅基于准确性的评估不足以衡量压缩后大语言模型的部署就绪度,而感知不确定性的基准测试应成为模型压缩流程中的标准组成部分。

Abstract

Model compression techniques such as quantization and pruning are widely used to reduce the deployment cost of large language models (LLMs), with existing evaluations focusing almost exclusively on accuracy preservation. However, in safety-critical applications, a model's ability to reliably quantify its own uncertainty is equally important. We ask: does compression preserve this ability? To answer this question, we benchmark 12 LLMs under various compression configurations across five NLP tasks, using conformal prediction to provide a rigorous, distribution-free measure of uncertainty. Our experiments reveal that: (I) compression frequently decouples accuracy from uncertainty; (II) larger models absorb compression-induced uncertainty far more effectively than smaller ones; and (III) uncertainty inflation is often threshold-like rather than gradual. These results suggest that accuracy-only evaluation is insufficient for assessing the deployment readiness of compressed LLMs, and that uncertainty-aware benchmarking should be a standard component of model compression pipelines.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 1.0/10 1.5
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on LLM compression (quantization/sparsity) and uncertainty quantification using conformal prediction, which belongs to the NLP/LLM safety domain. The provided keywords target multimodal learning, world models, and reinforcement learning domains. There is no substantive overlap; only the word 'Unified' in the title ('Unified Benchmark') provides minimal lexical overlap with 'Unify Models'. None of the specified expert authors are present in the author list.

关键词

Model Compression, Quantization, Uncertainty Quantification, Conformal Prediction, LLMs, Safety-Critical, Benchmark, Sparsity

Score: 1.5 / 27.8
Authors: Wanshuang Gou, Zihan Liu
Published: 2026-06-01
TL;DR: This paper proposes DySCo, a dynamic trust-aware sparse communication mechanism that reduces quadratic overhead in LLM-based multi-agent systems while preserving essential consensus information for reasoning tasks.
摘要翻译

大语言模型(Large Language Model)驱动的多智能体系统(Multi-Agent Systems)通过多轮深思熟虑(Deliberation)、角色专业化(Role Specialization)和交叉验证(Cross-Validation)提升了复杂推理任务的可靠性。然而,现有的多智能体辩论与协作框架通常采用全连接通信(Fully Connected Communication),导致消息数量、Token 成本(Token Costs)和端到端延迟(End-to-End Latency)随智能体数量的增加近似呈二次方增长;尽管固定稀疏拓扑(Fixed Sparse Topologies)能降低开销,但它们无法根据具体任务实例或中间推理状态动态调整通信关系,因而容易陷入保留低价值交互或丢失关键纠错信息的困境。为此,本文提出 DySCo(Dynamic Sparse Consensus),即一种动态信任感知稀疏共识机制。在每一轮推理中,DySCo 基于智能体可靠性、答案分歧及任务相关性估算通信边的价值,并在预算约束下选择少量高价值边进行消息交换;随后,它利用动态信任权重聚合不同智能体的答案,并在共识稳定后提前终止讨论。该机制以按需通信(On-Demand Communication)取代通用广播,从而在保留关键交叉验证信息的同时降低了通信开销。本文进一步分析了通信复杂度(Communication Complexity)和共识稳定性(Consensus Stability),并在数学推理、逻辑推理及事实性问答任务上评估了 DySCo 的性能。

Abstract

Large language model-driven multi-agent systems enhance the reliability of complex reasoning tasks through multi-round deliberation, role specialization, and cross-validation. However, existing multi-agent debate and collaboration frameworks typically adopt fully connected communication, causing the number of messages, token costs, and end-to-end latency to grow approximately quadratically with the number of agents; although fixed sparse topologies reduce overhead, they cannot adapt communication relationships to different task instances or intermediate reasoning states, making them prone either to preserving low-value interactions or to losing critical error-correction information. To address this problem, this paper proposes DySCo (Dynamic Sparse Consensus), a dynamic trust-aware sparse consensus mechanism. In each round of reasoning, DySCo estimates the value of communication edges based on agent reliability, answer divergence, and task relevance, and selects a small number of high-value edges for message exchange under budget constraints; it then aggregates the answers of different agents through dynamic trust weights and terminates the discussion early once consensus stabilizes. This mechanism replaces universal broadcasting with on-demand communication, thereby reducing communication overhead while preserving essential cross-validation information. We further present analyses of communication complexity and consensus stability, and evaluate the performance of DySCo on mathematical reasoning, logical reasoning, and factual question-answering tasks.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 1.0/10 1.5
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper addresses LLM-based multi-agent consensus via dynamic sparse communication (DySCo), which does not align with the provided keywords focused on multimodal foundation models, tokenization, visual encoders, world models, or model-based RL. 'MLLM' receives a minimal score (1.0) due to the use of LLMs, but the work is not multimodal. No expert authors (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are listed among the authors (Wanshuang Gou, Zihan Liu), resulting in no bonus points. The calculated weighted total score is 1.5, significantly below the dynamic passing threshold of 27.8.

关键词

LLM-Based Multi-Agent, Sparse Communication Topology, Dynamic Trust-Aware, Consensus Mechanism, Communication Overhead, Multi-Round Deliberation, Task Relevance

Score: 1.5 / 27.8
Authors: Eric Liang
Published: 2026-06-01
TL;DR: SECUREVENT proposes a hybrid AI/ML security monitoring architecture for distributed event-based systems that combines traditional protections with online anomaly detection and graph-aware behavioral features to improve recall over static rules while retaining a low false-positive rate.
摘要翻译

分布式事件驱动系统已成为互联网规模的发布/订阅 (Publish/Subscribe) 服务、物联网 (IoT) 遥测、云原生微服务以及安全运营流水线的常见基础架构。它们的松耦合和异步交付提升了可扩展性,但也扩大了攻击面:发布者、代理、订阅者、主题、模式 (Schema) 及时序均可能被滥用,且没有任何单个组件能够观察到整体行为。本文提出 SECUREVENT,一种面向分布式事件驱动系统的混合 AI/ML 安全监控架构。该架构将传统保护措施(如认证传输、主题级授权和签名事件)与在线异常检测、图感知行为特征、复杂事件处理 (CEP) 策略规则、联邦学习以及对抗性机器学习 (Adversarial ML) 治理相结合。一项针对合成事件流攻击的确定性原型研究展示了混合 AI/CEP 监控器如何在保持低误报率的同时,相较于静态规则提高召回率。核心论点并非机器学习取代密码学和访问控制机制,而是当事件流、身份、模式及时序关系过于动态,仅靠静态控制不足时,基于模型的安全监控是必要的。

Abstract

Distributed event-based systems have become a common substrate for Internet-scale publish/subscribe services, IoT telemetry, cloud-native microservices, and security operations pipelines. Their loose coupling and asynchronous delivery improve scalability, but they also expand the attack surface: publishers, brokers, subscribers, topics, schemas, and temporal ordering can each be abused without a single component observing the whole behavior. This paper proposes SECUREVENT, a hybrid AI/ML security-monitoring architecture for distributed event-based systems. The architecture combines traditional protections such as authenticated transport, topic-level authorization, and signed events with online anomaly detection, graph-aware behavioral features, complex-event policy rules, federated learning, and adversarial-ML governance. A deterministic prototype study over synthetic event-stream attacks illustrates how a hybrid AI/CEP monitor can improve recall over static rules while retaining a low false-positive rate. The central claim is not that machine learning replaces cryptographic and access-control mechanisms, but that model-based security monitoring is necessary when event flows, identities, schemas, and timing relationships are too dynamic for static controls alone.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 1.0/10 1.5
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on security monitoring for distributed event-based systems using hybrid AI/ML and complex event processing, which is unrelated to the provided keywords concerning multimodal large models, world models, or reinforcement learning. While it mentions 'model-based' monitoring, it does not involve RL, tokenizers, visual encoders, or multimodal architectures. 'Unify Models' receives a minimal score due to the hybrid nature of the AI/ML integration, but it does not align with the specific research paradigm.

关键词

Distributed event-based systems, Security monitoring, Hybrid AI/ML, Anomaly detection, Graph-aware behavioral features, Complex-event policy rules, Federated learning

Score: 1.5 / 27.8
Authors: Yekyung Kim, Yapei Chang, Chau Minh Pham, Mohit Iyyer
Published: 2026-06-01
TL;DR: This paper investigates how Large Language Models tend to generate less diverse and more convergent arguments compared to humans in public debates, a phenomenon termed 'argument collapse'.
摘要翻译

随着大型语言模型(LLM)越来越多地被用于起草面向公众的论点,它们可能会通过反复引入相同打磨得当且看似合理的论点,使公共辩论趋于扁平化。本文研究“论点坍塌”现象,即不同 LLM 生成的文章倾向于汇聚到更小的主要论点、子论点及段落级结构集合。我们比较了来自 195 场《纽约时报》(NYT)辩论的 1,039 条人类回应、来自 61 个《波士顿评论》(BR)长篇论坛的 448 条人类回应,以及 23,384 篇由 LLM 生成的文章。在《纽约时报》(NYT)语料库中,65.3% 的人类主要论点在单个辩论中是独特的,而 LLM 的主要论点中这一比例仅为 3.4%。虽然要求 LLM 生成多样化回答可以增加多样性,但典型 LLM 仅能覆盖到约一半的独特人类主要论点,且大部分新增的多样性超出了观察到的人类论点空间范围。论点坍塌现象同样出现在子论点中;在具有相同主要论点的文章中,41.0% 的人类子论点是独特的,而 LLM 回应中这一比例仅为 9.1%。定性分析显示,LLM 经常重复使用泛化且留有余地的子论点,而人类则更倾向于使用更具体且主题特定的子论点。从结构上看,LLM 生成的文章倾向于遵循更固定的结构模式,通常以直接主张开头,并迅速转向具体建议。同样的模式在《波士顿评论》(BR)的长篇文章中同样存在,这表明论点坍塌现象不仅限于短形式回应。

Abstract

As LLMs are increasingly used to draft public-facing arguments, they may flatten public debate by repeatedly introducing the same polished, plausible arguments. We study argument collapse, the tendency of essays generated by different LLMs to converge to a smaller set of main arguments, sub-arguments, and paragraph-level structures. We compare 1,039 human responses from 195 New York Times (NYT) debates, 448 human responses from 61 longer-form Boston Review (BR) forums, and 23,384 LLM-generated essays. In the NYT corpus, 65.3% of human main arguments are unique within a debate, compared to 3.4% of LLM main arguments. Asking LLMs to generate diverse answers adds variation, but a typical model recovers only about half of the distinct human main arguments, with much of the added variation falling outside the observed human argument space. Collapse also appears in sub-arguments, where among essays with the same main argument, 41.0% of human sub-arguments are unique versus 9.1% from LLM responses. Qualitatively, LLMs often reuse generalized and hedged sub-arguments, while humans prefer more concrete and topic-specific ones. Structure-wise, LLM-generated essays tend to follow a more fixed arc, often opening with a direct claim and moving quickly toward proposals. The same patterns hold in longer BR essays, suggesting that argument collapse extends beyond short-form responses.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 1.0/10 1.5
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper analyzes argument diversity in LLM-generated text, unrelated to Unify Models, Tokenizers, Visual Encoders, World Models, Multimodal architectures, or Model-Based RL. MLLM score is minimal due to LLM usage. No specified expert authors are found.

关键词

Argument Collapse, LLMs, Public Debate, Text Generation, Diversity, Human Arguments, Long-Form

Score: 1.5 / 27.8
Authors: Joy Bose
Published: 2026-06-01
TL;DR: 本文提出了一种基于图条件分层 Shapley 归因的专利估值框架(PatentXAI),利用知识图和马尔可夫毯高效估算单个专利在产品中的经济贡献。
摘要翻译

估算单个专利在包含数以万计专利的产品中的经济贡献,一直是知识产权经济学中长期未解决的难题。我们提出 PatentXAI,这是一个将专利估值视为可解释人工智能(Explainable AI)问题的框架:给定一个特征函数 v(S),用于编码专利子集 S 可实现收入,专利的 Shapley 值衡量其对产品利润的公平份额,该衡量方式满足效率性、对称性、虚拟性和可加性。为了使计算可行,我们将每个专利的联盟限制在知识图谱内的马尔可夫毯(Markov Blanket),该限制基于 C-SVE 条件独立性定理(Li et al., 2020)。基于帕累托分布覆盖图进行的扩展实验(专利数量 n 从 12 增至 100)显示,在 n=100 时,马尔可夫毯大小的中位数为 n 的 32.9%,第 90 百分位毯大小为 n 的 55.2%,且每个专利的计算耗时为 10 毫秒。在 n=12 时,与精确真实值的差异为 0.088;在 n=100 时,与高样本蒙特卡洛参考值的差异为 0.062 ± 0.003。密集组件实验表明,当 80% 的专利共享同一组件时,马尔可夫毯会正确扩展以覆盖该密集簇,此时与参考值的差异降至 0.039,因为池化计算在同质专利组合上更为准确。利润分配是分层次进行的:首先使用精确 Shapley 值在宏观组件间分配总利润,然后使用中心性加权 Shapley 值将每个组件的预算分配给覆盖该组件的专利。从真实数据估计 v(S) 是主要的开放性问题;我们将此与计算贡献区分开来,并概述了利用公共 ETSI、USPTO 和 Lens.org 数据集进行实证验证的具体路线图。

Abstract

Estimating the economic contribution of a single patent inside a product that embodies tens of thousands of patents is a long-standing unsolved problem in intellectual property economics. We propose PatentXAI, a framework that treats patent valuation as a problem of explainable AI: given a characteristic function v(S) encoding the revenue achievable by patent subset S, a patent's Shapley value measures its fair share of product profit in a way that satisfies efficiency, symmetry, dummy, and additivity. To make computation tractable we restrict each patent's coalition to its Markov Blanket inside a knowledge graph, grounded in the C-SVE conditional independence theorem (Li et al., 2020). Scaling experiments from n=12 to n=100 patents using Pareto-distributed coverage graphs report median Markov Blanket size of 32.9 percent of n at n=100, with 90th-percentile blanket size of 55.2 percent of n, and runtime of 10 milliseconds per patent. Difference against exact ground truth at n=12 is 0.088; difference against a high-sample Monte Carlo reference at n=100 is 0.062 plus or minus 0.003. A dense-component experiment shows that when 80 percent of patents share one component, the blanket correctly expands to cover that dense cluster, and the difference versus reference falls to 0.039 because the pooled computation becomes more accurate on homogeneous portfolios. Profit allocation proceeds hierarchically: exact Shapley distributes total profit among macro-components, then centrality-weighted Shapley distributes each component budget among covering patents. Estimating v(S) from real data is the primary open problem; we distinguish this from the computational contribution and outline a concrete roadmap for empirical validation using public ETSI, USPTO, and Lens.org datasets.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 1.0/10 1.5
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: 论文主要研究专利估值与可解释性 AI(Shapley 值、知识图),与多模态大模型(MLLM)、世界模型、视觉编码器、Tokenizer 及模型强化学习等技术领域完全无关。仅'Unify Models'因论文提出统一框架略有概念关联,但非模型架构统一。作者列表中不包含指定的专家,故无额外加分。

关键词

Patent Valuation, Shapley Value, Knowledge Graph, Markov Blanket, Explainable AI, Hierarchical Attribution, Profit Allocation, PatentXAI

Score: 1.5 / 27.8
Authors: Bradley G. Karat, Maëliss Jallais, Ali R. Khan, Santiago Aja-Fernández, Jelle Veraart, Marco Palombo
Published: 2026-06-01
TL;DR: 本文提出了一种现实噪声合成框架,用于缓解监督学习中模拟与真实信号分布的差异,从而在扩散磁共振成像中减少组织微结构估计的偏差并提高精度。
摘要翻译

扩散磁共振成像(Diffusion MRI) enables 非侵入式探测组织微结构,然而准确的参数估计受到噪声相关效应的制约。在基于模拟数据训练的监督机器学习框架中,模拟信号与采集信号之间的噪声特性差异会引入一种协变量偏移(covariate shift),导致训练与推理阶段的输入信号分布存在差异。我们研究了这种不匹配对微结构参数估计的影响,并提出了一种真实噪声合成(RNS)框架以缓解该问题。RNS 将莱斯期望(Rician expectation)和有效后处理噪声方差纳入模拟训练信号中。莱斯期望采用通过 MPPCA(多通道主成分分析)估计的噪声标准差进行建模,而有效标准差则源自预处理数据的球谐函数残差。该方法在多个信噪比(SNR)水平的模拟数据集以及具有重复采集的体内扩散数据上,利用圆柱 - 飞艇模型和 SANDI 模型进行了评估。同时还评估了该方法对噪声误估计的敏感性。训练过程中忽略幅度诱导噪声效应会产生系统性的、依赖于信噪比的参数偏差,尤其在低信噪比情况下更为显著。纳入莱斯期望可将偏差显著降低至与噪声感知非线性最小二乘拟合相当的水平。对有效标准差进行建模进一步提高了估计精度。该方法性能在很大程度上独立于回归架构,但对噪声估计的准确性敏感。这些发现表明,模拟训练数据中的真实噪声建模能够缓解信号域的协变量偏移,对于实现无偏的监督微结构估计至关重要,尤其是在高 b 值或高空间分辨率相关的低信噪比(SNR)情况下。

Abstract

Diffusion MRI enables non-invasive probing of tissue microstructure, but accurate parameter estimation is challenged by noise-related effects. In supervised machine learning frameworks trained on simulated data, discrepancies between the noise characteristics of simulated and acquired signals introduce a form of covariate shift, whereby the input signal distribution differs between training and inference. We investigated the impact of this mismatch on microstructure parameter estimation and propose a realistic noise synthesis (RNS) framework to mitigate it. RNS incorporates both the Rician expectation and the effective post-processing noise variance into simulated training signals. The Rician expectation was modelled using a noise standard deviation estimated with MPPCA, while the effective standard deviation was derived from spherical harmonic residuals of preprocessed data. The method was evaluated using the cylinder-zeppelin and the SANDI models on simulated datasets across multiple SNR levels and on in vivo diffusion data with repeated acquisitions. Sensitivity to noise misestimation was also assessed. Ignoring magnitude-induced noise effects during training produced systematic, SNR-dependent parameter bias, particularly at low SNR. Incorporating the Rician expectation substantially reduced bias to the level of noise-aware nonlinear least-squares fitting. Modelling the effective standard deviation further improved precision. Performance was largely independent of regression architecture but sensitive to accurate noise estimation. These findings demonstrate that realistic noise modelling in simulated training data mitigates signal-domain covariate shift and is essential for unbiased supervised microstructure estimation, particularly in low-SNR regimes associated with high b-values or high spatial resolution.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 1.0/10 1.5
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: 该论文主要关注扩散磁共振成像(Diffusion MRI)中的噪声建模与组织微结构估计,使用的是监督机器学习方法。提供的关键词(如 MLLM、Tokenizer、World Models、RL 等)均属于大模型与强化学习领域,与本文的医学影像信号处理主题高度不相关。仅'Unify Models'因涉及提出统一噪声合成框架而勉强关联,其余关键词完全无关。

关键词

Diffusion MRI, Tissue microstructure estimation, Realistic noise synthesis, Supervised machine learning, Covariate shift, Rician noise, Parameter bias, Signal processing

Score: 1.5 / 27.8
Authors: Vladimir Beskorovainyi
Published: 2026-06-01
TL;DR: This paper proposes a rule-plus-bag-of-words pipeline with reliability-weighted human-in-the-loop labeling to accurately map noisy retail product names to consumer-price categories, achieving high F1 scores with minimal labeled data.
摘要翻译

消费者价格测量日益依赖于替代数据源,包括扫描数据、网络抓取数据以及交易/收据数据。一个反复出现的障碍是,此类来源中的产品描述简短、含噪且缩写,缺乏标准产品代码,因此在比较价格之前,每个项目首先必须映射到消费分类(例如联合国 COICOP 方案)。本文将这种映射研究为一种通用且可复现的方法。该流程包括:(i) 对含噪项目名称进行文本标准化和分词;(ii) 基于前缀树(trie)的规则预分类器,由各类别的关键短语和停用短语驱动;(iii) 各类别的二元确认模型,用于判定项目是否属于暂定分配的类别。针对大规模标签,我们采用一种人在回路(human-in-the-loop)协议,标注者给出二元有效/拒绝判断,并通过动态更新的可靠性权重进行聚合;模型结合相同规则,从而实现持续微调。我们的实证发现表明情况比预想的简单(通缩性):在一个受控且无数据泄露的研究中(一个类别,真实正例与硬负例对比,五个随机种子),词袋模型(bag-of-words)基本上饱和了该任务(F1 约为 0.99)——线性分类器与多层感知机(multilayer perceptron)表现相当,显式词序(n-gram)特征并无增益,且约 67 个标注示例已足够。对标注协议的蒙特卡洛研究表明,可靠性加权投票仅略优于简单多数(其加性权重已饱和),而 Dawid-Skene 算法恢复标签的效果明显更好。我们还讨论了价格水平质量控制以及针对考虑使用交易数据的统计机构的设计启示。所有图表均为示意性展示;未复现任何保密数据、代码或文档。

Abstract

Consumer-price measurement increasingly draws on alternative data sources -- scanner, web-scraped, and transaction/receipt data. A recurring obstacle is that product descriptions in such sources are short, noisy, and abbreviated, with no standard product code, so each item must first be mapped to a consumption classification (e.g., the UN COICOP scheme) before prices can be compared. This paper studies that mapping as a general, reproducible method. The pipeline is: (i) text normalization and tokenization of noisy item names; (ii) a prefix-tree (trie) rule-based pre-classifier driven by per-category key-phrases and stop-phrases; and (iii) a per-category binary confirmation model deciding whether an item belongs to a tentatively assigned category. For labels at scale we use a human-in-the-loop protocol in which annotators give a binary valid/reject judgment, aggregated by a dynamically updated reliability weight; the model joins the same rule, enabling continual fine-tuning. Our empirical finding is deflationary: in a controlled, leakage-free study (one category, real positives vs. hard negatives, five seeds), bag-of-words models essentially saturate the task (F1 about 0.99) -- a linear classifier matches a multilayer perceptron, explicit word-order (n-gram) features add nothing, and about 67 labeled examples already suffice. A Monte-Carlo study of the labeling protocol shows the reliability-weighted vote barely beats plain majority (its additive weights saturate) while Dawid-Skene recovers labels markedly better. We also discuss price-level quality control and design lessons for statistical offices considering transaction data. All figures are illustrative; no confidential data, code, or documentation is reproduced.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 1.0/10 1.5
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper addresses text classification for economic data using traditional ML pipelines (rule-based + bag-of-words), lacking any connection to multimodal learning, world models, or reinforcement learning. 'Tokenizer' is marginally relevant as a preprocessing step, while all other keywords are completely unrelated to the methodology or domain.

关键词

Machine Learning, Retail Product Names, Consumer-Price Categories, Rule-plus-Bag-of-Words, Human-in-the-Loop, Text Classification, Labeling Protocol

Score: 1.5 / 27.8
Authors: Anis Yassine Ben Mabrouk, Antoine Tadros, Rafael Grompone von Gioi, Gabriele Facciolo, Axel Davy, Rodrigo Verschae
Published: 2026-06-01
TL;DR: This paper investigates generalization limits in vehicle re-identification, revealing that state-of-the-art methods fail to generalize to unseen vehicle types and lack robustness to viewpoint changes beyond training data.
摘要翻译

车辆重识别(Vehicle re-identification)旨在给定查询图像(query image)时,从图库(gallery)中检索出同一辆车的图像。在仔细检查常用数据集(datasets)后,我们发现视觉差异较小的车辆(例如品牌、型号和颜色相同)同时出现在训练集(training set)和测试集(test set)中。因此,有效记忆训练数据的方法在这些测试集上表现良好,但难以泛化(generalize)到其他数据集。本文通过提出一种新的评估方法(evaluation approach)来解决这一问题,该方法能更有效地衡量对未见过的车辆类型的泛化能力。为进一步研究泛化性能(generalization performance),我们还提出基于视角(view)对评估进行划分,从而使我们能够区分视角鲁棒性(viewpoint robustness)与同视角重识别(same-view re-identification)的影响。我们的发现表明,大多数最先进方法(state-of-the-art methods)在处理未见过的车辆类型时存在困难,且其对视角变化的鲁棒性和对细节的关注仅限于训练期间见过的车辆类型。

Abstract

Vehicle re-identification focuses on retrieving images of the same vehicle from a gallery given a query image. Upon closer inspection of commonly used datasets, we observe that vehicles with few visual differences-e.g., the same make, model, and color-appear in both the training and test sets. As a result, methods that effectively memorize the training data tend to perform well on these test sets but struggle to generalize to other datasets. In this paper, we address this issue by proposing a novel evaluation approach that more effectively measures generalization capability to unseen vehicle types. To further study generalization performance, we also propose splitting the evaluation based on view, allowing us to differentiate the effect of viewpoint robustness from that of same-view re-identification. Our findings reveal that most state-of-the-art methods struggle with unseen vehicle types, and that their robustness to viewpoint changes and attention to detail are limited to vehicle types seen during training.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 1.0/10 1.5
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on vehicle re-identification and generalization evaluation in computer vision. The provided keywords relate to Multimodal Large Language Models (MLLM), World Models, and Reinforcement Learning, which are not addressed in the paper. Only 'Visual Encoder' has a tangential connection as image-based methods utilize feature extractors, though the paper focuses on evaluation methodology rather than encoder architecture.

关键词

Vehicle Re-identification, Generalization Limits, Unseen Vehicle Types, Viewpoint Robustness, Evaluation Approach, Image Retrieval, Memorization vs Generalization

Score: 1.5 / 27.8
Authors: Zhenyu Li, Tianyi Shang
Published: 2026-06-01
TL;DR: This paper proposes an adversarial attack framework (LPQN) that perturbs visual feature encodings to mislead robot localization retrieval systems, exposing critical vulnerabilities in practical applications.
摘要翻译

机器人定位系统对于自主导航与安全至关重要。对抗性扰动可能会误导这些系统,导致误定位、导航错误或不安全交互,尤其是在关键任务场景中。本文研究了基于深度学习的定位流水线对抗攻击的脆弱性。我们提出了一种新颖的框架,用于生成专门针对视觉定位系统中乘积量化(PQ)的对抗性查询。我们的方法采用轻量级乘积量化网络(LPQN)来扰动查询特征编码,通过返回语义无关的数据库条目来误导检索过程。对抗性查询通过两阶段过程生成:一个扰动特征分布的前向传播阶段和一个通过优化细化扰动的后向传播阶段。LPQN 的轻量级设计使得能够以最小的计算开销创建微妙但高度有效的扰动。在受控和真实世界机器人环境中的广泛实验表明,我们的方法显著降低了 PQN(乘积量化网络)的性能,暴露了实际应用中的关键脆弱性。

Abstract

Robot localization systems are critical for autonomous navigation and safety. Adversarial perturbations can mislead these systems, resulting in mislocalization, navigation errors, or unsafe interactions, especially in mission-critical scenarios. This paper investigates the vulnerability of deep learning based localization pipelines to adversarial attacks. We propose a novel framework for generating adversarial queries that specifically target Product Quantization (PQ) in visual localization systems. Our method employs a Lightweight Product Quantization Network (LPQN) to perturb query feature encodings, misleading the retrieval process by returning semantically irrelevant database entries. Adversarial queries are generated via a two-phase procedure: a forward pass that perturbs feature distributions and a backward pass that refines the perturbation through optimization. The lightweight design of LPQN allows the creation of subtle yet highly effective perturbations with minimal computational overhead. Extensive experiments in both controlled and real-world robotic environments demonstrate that our approach substantially degrades PQN performance, exposing critical vulnerabilities in practical applications.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 1.0/10 1.5
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on adversarial attacks in robot localization via feature perturbation, which is largely unrelated to the provided keywords concerning Multimodal LLMs, World Models, and Model-Based RL. 'Visual Encoder' receives a minimal score (1.0) due to the involvement of visual feature encodings, but the paper does not address representation learning or model unification. No expert authors from the specified list are present. The total weighted score is 1.5, well below the passing threshold of 27.8.

关键词

Adversarial Attacks, Robot Localization, Deep Feature Perturbation, Product Quantization, Visual Localization, Lightweight Product Quantization Network, Retrieval Process

Score: 0.0 / 27.8
Authors: Zahra Tabatabaei, Diana Soto Aguilar, Jose C. Bonilla, Mathias P. Clausen, Jon Sporring
Published: 2026-06-01
TL;DR: 本文提出了一种结合拓扑数据分析和分形几何方法的计算工具箱,用于量化酪蛋白凝胶化过程中的微结构动态变化并将其与流变学性质相关联。
摘要翻译

我们提出了一种新颖的计算工具箱,该工具箱整合了拓扑数据分析(TDA)、差分盒计数法(DBC)、多重分形分割(MFP)和局部二值模式(LBP),并将其应用于由葡萄糖酸-δ-内酯(GDL)在 30 °C 和 40 °C 下诱导的酪蛋白酸钠凝胶化的延时超分辨率 STED 显微镜图像,涉及两种 GDL 浓度(1.8% 和 3.5% w/v)。TDA 通过最大 Betti-1 曲线追踪了拓扑环,即反映蛋白质网络连通性的闭合环状结构,这些曲线揭示了一个分散聚集体的滞后期,一个与网络渗流及流变学观察到的溶胶 - 凝胶转变相吻合的急剧下降,以及一个对应于网络重排的后凝胶化增长。这些拓扑转变得到了 DBC 和 MFP 的证实,因为这些方法能够解析结构复杂性和空间异质性的变化。该工具箱在实验应用之前已在模拟分形图像上进行了验证。总体而言,这些描述符对细微的微结构转变提供了敏感性,而体相流变学仅将其捕捉为平均化的体相力学响应。这种集成方法为表征食品与材料科学中具有演化微结构动力学的复杂微结构提供了一种稳健的定量工具。代码可在 https://github.com/Zahratabatabaei/Delifood_CV_paper.git 获取。

Abstract

We propose a novel computational toolbox that integrates Topological Data Analysis (TDA), Differential Box Counting (DBC), Multifractal Partition (MFP), and Local Binary Patterns (LBP), applied to time-lapse super-resolution STED microscopy images of sodium caseinate gelation induced by glucono-delta-lactone (GDL) at 30 °C and 40 °C and two GDL concentrations (1.8% and 3.5% w/v). TDA tracked topological loops, closed ring-like structures reflecting protein network interconnectivity, via max-Betti-1 curves, which revealed a lag phase of dispersed aggregates, a sharp decay coinciding with network percolation and the rheologically observed sol-gel transition, and a post-gelation increase corresponding to network rearrangements. These topological transitions were corroborated by DBC and MFP as these methods were able to resolve changes in structural complexity and spatial heterogeneity. The toolbox was validated on simulated fractal images prior to experimental application. Together, these descriptors provided sensitivity to subtle microstructural transitions that bulk rheology captured as averaged bulk mechanical responses. This integrated approach provides a robust quantitative tool for characterizing complex microstructure in food and material science with evolving microstructural dynamics. Code is available at https://github.com/Zahratabatabaei/Delifood_CV_paper.git

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: 该论文属于食品科学与材料科学领域,利用拓扑数据分析(TDA)和经典图像处理算法(如 LBP、DBC)分析显微镜图像以研究酪蛋白凝胶化过程。提供的关键词(如 Unify Models, Tokenizer, MLLM, model-based RL 等)均指向人工智能、大模型架构及强化学习领域。论文未涉及任何深度学习模型、大语言模型或强化学习算法,与关键词主题领域完全不符,故相关性评分为 0。

关键词

Topological Data Analysis, Microscopy Images, Casein Gelation, Rheological Properties, Texture Analysis, Protein Network, STED Microscopy

Score: 0.0 / 27.8
Authors: Matvei Shelukhan, Timur Mamedov, Aleksandr Chukhrov, Karina Kvanchiani
Published: 2026-06-01
TL;DR: This paper identifies a fundamental mismatch between ranking metrics and assignment objectives in multi-view object association and proposes Sinkhorn-based normalization to align evaluation with the actual task goal.
摘要翻译

多视图目标关联(Multi-view object association)是一个重要的计算机视觉问题,它是许多多相机感知任务的基础。虽然该任务自然地被表述为一个约束一对一匹配问题,但近期工作严重依赖成对排序指标(如 AP 和 FPR-95)来进行模型评估。我们指出了这些指标与实际分配目标之间的根本性不匹配。理论上,我们表明即使分配已经正确,AP 和 FPR-95 也可能表现不佳,而基于 Sinkhorn 的归一化可以使它们达到完美。反之,最优的成对排序仍可能导致错误的分配。我们通过使用基于 Sinkhorn 的归一化作为受控的后处理压力测试,在实践中验证了这种不匹配。我们表明,仅优化几个后处理参数就能显著提升 AP 和 FPR-95,但在分配级别指标(如 ACC 和 IPAA)上却没有相应的提升。

Abstract

Multi-view object association is an important computer vision problem that underlies many multi-camera perception tasks. While this task is naturally formulated as a constrained one-to-one matching problem, recent works heavily rely on pairwise ranking metrics like AP and FPR-95 for model evaluation. We highlight a fundamental mismatch between these metrics and the actual assignment objective. Theoretically, we show that AP and FPR-95 can be imperfect even when the assignment is already correct, and that Sinkhorn-based normalization can make them perfect. Conversely, optimal pairwise ranking can still lead to incorrect assignments. We validate this mismatch in practice by using our Sinkhorn-based normalization as a controlled post-processing stress test. We show that optimizing just a few post-processing parameters significantly boosts AP and FPR-95 without corresponding improvements in assignment-level metrics such as ACC and IPAA.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on computer vision metric mismatch in multi-view object association, whereas the provided keywords target multimodal large models, world models, and reinforcement learning. There is no substantive overlap in methodology, architecture, or research objective between the paper and the specified keywords.

关键词

Multi-view object association, Metric mismatch, Ranking metrics, Assignment objective, Sinkhorn normalization, One-to-one matching, Computer vision evaluation

Score: 0.0 / 27.8
Authors: Ekaterina Alimaskina, Darya Rudas, Denis Shveykin, Gleb Molodtsov, Pavel Vasiliev, Aleksandr Beznosikov
Published: 2026-06-01
摘要翻译

大推理模型(LRMs)依赖于较长的推理轨迹,导致推理成本高昂。尽管低比特量化降低了每 token 的解码成本,但我们指出激进的 2-bit 推理可能无法实现端到端加速,因为生成过程中的不稳定性会导致总 token 数量膨胀。2-bit 量化不仅会降低答案准确率,通常还会产生更长的轨迹,伴随重复循环、预算耗尽、延迟承诺以及未闭合的推理片段。我们分析了 Qwen3 推理模型在数学和常识基准测试上的完整推理轨迹,结果表明准确率下降与这些过程级失败紧密相关。为了解决这些问题,我们引入了两种轻量级控制机制:FP16 规划,即为 2-bit 模型提供一个简短的高精度大纲;以及循环救援,用于检测重复轨迹,并选择提前确定答案或回退至 FP16。在 MATH-500 基准上,循环救援将 Qwen3-8B 的准确率从 17.2% 提升至 74.2%,而规划结合循环救援则将 Qwen3-32B 的准确率从 65.0% 提升至 87.2%。总体而言,我们的结果表明,当极端低比特推理的失败被视为可控的生成缺陷时,该推理方法变得实用:通过轻量级检测和选择性 FP16 支持,2-bit 推理可以在保持真实端到端速度的同时恢复准确率。我们的代码开源地址为:https://github.com/brain-lab-research/quantized-reasoning。

Abstract

Large Reasoning Models (LRMs) rely on long reasoning traces, making inference expensive. While low-bit quantization reduces per-token decoding cost, we show that aggressive 2-bit inference can fail to deliver end-to-end speedup because instability in the generation process inflates total token count. Instead of merely lowering answer accuracy, 2-bit quantization often produces much longer traces with repetitive loops, budget exhaustion, delayed commitment, and unclosed reasoning segments. We analyze full reasoning traces of Qwen3 reasoning models across mathematical and commonsense benchmarks and show that accuracy degradation is tightly linked to these process-level failures. To address them, we introduce two lightweight controls: FP16 planning, which gives the 2-bit model a short high-precision outline, and loop rescue, which detects repetitive traces and either commits to an earlier answer or falls back to FP16. On MATH-500, loop rescue improves Qwen3-8B accuracy from 17.2% to 74.2%, while planning plus loop rescue improves Qwen3-32B from 65.0% to 87.2%. Overall, our results show that extreme low-bit reasoning becomes practical when its failures are treated as controllable generation pathologies: with lightweight detection and selective FP16 support, 2-bit inference can recover accuracy while preserving real end-to-end speed. Our code is available at: https://github.com/brain-lab-research/quantized-reasoning.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: 评分失败: Expecting ',' delimiter: line 12 column 75 (char 298)

Score: 0.0 / 27.8
Authors: Fabian Hoppe, Melven Röhrig-Zöllner, Philipp Knechtges
Published: 2026-06-01
TL;DR: This paper investigates using LLMs within a verifier-guided evolutionary framework to optimize contraction order in tensor networks, highlighting the potential and challenges of LLM-driven algorithm development.
摘要翻译

本文通过一个关于使用 OpenEvolve 进行张量网络收缩顺序优化的案例研究,探讨了基于大语言模型(LLM)的算法开发。我们特别关注 LLM 的选择,以及诸如评估指标和测试实例等设计选择。我们的结果既突出了验证器引导的进化编码代理在算法开发/改进方面的潜力,也强调了人类科学家在评估、验证和解释方面持续的重要性及其面临的相应挑战。

Abstract

We consider LLM-based algorithm development through a case study on contractionorder optimisation for tensor networks with OpenEvolve. We pay particular attention to the choice of the LLM as well as design choices such as evaluation metric and test instances. Our results highlight both the promise of verifier-guided evolutionary coding agents for algorithm development/improvement and the continuing importance of evaluation, validation, and interpretation -- and corresponding challenges -- by the human scientist.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on LLM-based algorithm development for tensor network contraction optimization using evolutionary coding agents. It does not address multimodal learning, world models, visual encoders, tokenizers, or model-based reinforcement learning, resulting in zero relevance to all specified keywords.

关键词

LLM-based algorithm development, Contraction order optimization, Tensor networks, Verifier-guided evolutionary coding agents, Evaluation metric, Test instances, Algorithm development

Score: 0.0 / 27.8
Authors: Meredith Ringel Morris
Published: 2026-06-01
摘要翻译

关于人工智能的公共话语已趋于极化;传统媒体与社交媒体中关于人工智能的夸张立场威胁着公众人工智能素养(AI Literacy)的发展。本文介绍了 VET 框架(VET Framework),这是一种沿效价、有效性和轨迹维度对人工智能话语进行分类的方法。本文展示了如何利用该框架来识别、比较和批判流行的人工智能炒作(AI Hype)、人工智能末日论(AI Doom)、人工智能否认论(AI Denial)及人工智能正常化(AI Normalcy)叙事。借助 VET,本文分析了这四种立场如何夸大了人工智能当前状态和/或未来发展的某些方面,并阐明了 VET 框架如何通过支持对极化人工智能话语的“甄别”来充当人工智能素养工具。

Abstract

Public discourse on AI has become polarized; exaggerated positions on AI in traditional and social media threaten the development of AI Literacy among the general public. In this article, I introduce the VET Framework, a method for categorizing AI discourse along the dimensions of valence, effectiveness, and trajectory. I show how this framework can be used to identify, compare, and critique prevalent narratives of AI Hype, AI Doom, AI Denial, and AI Normalcy. Using VET, I analyze how each of these four stances exaggerates some aspects of the current state and/or likely evolution of AI, and illustrate how the VET framework can serve as an AI Literacy tool by supporting the ``vetting'' of polarized AI discourse.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: 评分失败: Expecting value: line 13 column 11 (char 402)

Score: 0.0 / 27.8
Authors: Keito Inoshita, Takato Ueno
Published: 2026-06-01
TL;DR: This paper proposes a Bayesian spectral framework to discover emotion transition structures from multi-annotator disagreement, preserving uncertainty signals rather than compressing them into hard labels.
摘要翻译

情绪在对话的动态过程中演变,理解其转换结构是心理健康筛查到对话系统等应用的基础。然而,现有研究通常通过多数投票将多评分者判断压缩为单个硬标签,从而丢弃了理解轮次间转换所需的不确定性信号。本文提出贝叶斯谱情感转换发现(BSETD),这是一种基于多评分者软标签发现情感转换结构的两阶段框架。第一阶段,通过软标签的外积构建分层狄利克雷 - 多项式后验,为 K×K 转换矩阵的每个单元格赋予可信区间以及经本杰明 - 霍赫伯格(BH)错误发现率(FDR)控制的显著性。第二阶段,对对称化图拉普拉斯算子进行谱分解,以分离低频(惯性)分量和高频(传染)分量。在 EmotionLines 数据集上,BSETD 同时恢复了两种不同情感空间的特征:普拉奇克(Plutchik)相邻转换中,厌恶到愤怒(log2 提升 +0.94)和愤怒到厌恶(+0.86)过度表示,而罗素(Russell)效价反转转换中,快乐到愤怒(-0.90)和愤怒到快乐(-0.89)表示不足。五源跨语料库验证显示,英语内部成对皮尔逊相关系数为 0.91-0.98,与中文 M3ED 的相关系数为 0.79-0.85,且在相同话语集上,人类硬标签与大语言模型(LLM)虚拟软标签之间的相关系数高达 0.979。这表明,保留标注者不确定性的流程架起了情感动力学计算研究与既有心理学理论之间的桥梁。

Abstract

Emotions evolve through the dynamics of conversation, and understanding their transition structure is foundational to applications ranging from mental-health screening to dialogue systems. However, existing studies typically compress multi-rater judgments into a single hard label by majority voting, discarding the uncertainty signal needed to understand turn-to-turn transitions. In this article, we propose Bayesian Spectral Emotion Transition Discovery (BSETD), a two-stage framework that discovers emotion-transition structure from multi-rater soft labels. In the first stage, a hierarchical Dirichlet-Multinomial posterior is constructed through the outer product of soft labels, equipping each cell of the K x K transition matrix with a credible interval and Benjamini-Hochberg (BH) false discovery rate (FDR)-controlled significance. In the second stage, the symmetrized graph Laplacian is spectrally decomposed to separate a low-frequency (inertia) component from a high-frequency (contagion) component. On EmotionLines, BSETD simultaneously recovers the signatures of two distinct affective spaces: the Plutchik-adjacent transitions disgust to anger (log2 lift +0.94) and anger to disgust (+0.86) are over-represented, while the Russell-valence-reversed transitions joy to anger (-0.90) and anger to joy (-0.89) are under-represented. A five-source cross-corpus validation yields pairwise Pearson correlations in 0.91-0.98 within English, 0.79-0.85 against Chinese M3ED, and 0.979 between the human hard labels and the LLM virtual soft labels on the same utterance set, demonstrating that a pipeline preserving annotator uncertainty bridges the computational study of emotion dynamics with established psychological theory.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper proposes a Bayesian spectral framework for discovering emotion transition structures from multi-annotator disagreement in dialogue systems. It focuses on statistical analysis (Dirichlet-Multinomial, spectral decomposition) and psychology theory, with no connection to Unify Models, Tokenizers, Visual Encoders, World Models, MLLMs, MultiModal architectures, or Model-Based RL. None of the specified expert authors are listed.

关键词

Bayesian Spectral Emotion Transition Discovery, Multi-Annotator Disagreement, Emotion Dynamics, Hierarchical Dirichlet-Multinomial, Spectral Decomposition, Soft Labels, EmotionLines, False Discovery Rate

Score: 0.0 / 27.8
Authors: Christian Autenried, Cosimo Persia
Published: 2026-06-01
摘要翻译

自然语言处理(NLP)在医疗保健领域的应用日益增长,亟需专门适配于临床语言复杂性的语言模型。本文介绍了 KliniskVestBERT,这是一套基于 BERT 的编码器模型,在来自 Helse Vest 的大量真实世界、去标识化的挪威临床文本语料库上进行了预训练。我们在专用临床数据集上继续对现有的语言模型 Nb-BERT-large、NorBERT3-large 和 ModernBERT 进行预训练。该数据集基于 Helse Vest 患者的代表性人群。所包含的文档类型经过精心筛选,涵盖了书面挪威语(bokmål)和新挪威语(nynorsk)中的广泛临床范围,包括出院摘要、手术报告、护理笔记等,确保全面代表挪威医疗保健环境中的语言景观。在三个合成挪威临床基准数据集和两个真实世界问题上的评估表明,每个临床专用模型始终优于其基线模型,突显了临床领域内 NLP 任务领域特定预训练的显著益处。该项目是 Helse Vest 所有机构(Helse Bergen、Helse Fonna、Helse Førde 和 Helse Stavanger)与 DIPS 的联合合作,由 Helse Vest ICT 牵头。

Abstract

The increasing application of Natural Language Processing (NLP) in healthcare demands language models specifically attuned to the complexities of clinical language. This work introduces KliniskVestBERT, a suite of three BERT-based encoder models pre-trained on a substantial corpus of real-world, de-identified Norwegian clinical texts from Helse Vest. We continue pretraining existing language models Nb-BERT-large, NorBERT3-large, and ModernBERT on our specialized clinical dataset. This dataset is based on a representative population of Helse Vest patients. The included document types are carefully curated to encompass a broad clinical spectrum in bokmål and nynorsk including discharge summaries, surgical reports, nursing notes etc. ensuring comprehensive representation of the linguistic landscape within Norwegian healthcare settings. Evaluation on three synthtetic Norwegian clinical benchmark datasets and two real-world problems demonstrates that each of our clinically specialized models consistently outperforms their baseline counterparts, highlighting the significant benefit of domain-specific pre-training for NLP tasks within the clinical domain. The project was a joint effort by all Helse Vest entities (Helse Bergen, Helse Fonna, Helse Førde and Helse Stavanger) with DIPS under the project lead of Helse Vest ICT.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: 评分失败: Expecting ',' delimiter: line 12 column 114 (char 337)

Score: 0.0 / 27.8
Authors: Deyu Zhuang, Peiliang Gong, Yang Shao, Liyuan Shu, Qi Zhu, Xiaoli Li, Daoqiang Zhang
Published: 2026-06-01
摘要翻译

准确的剩余使用寿命(RUL)预测对于工业预测性维护至关重要。然而,由于传感器观测具有不规则性(表现为异步采样、突发缺失和时间抖动),实际部署面临挑战。加剧这一问题的是,纯数据驱动模型往往生成物理上不可信的退化轨迹,违反了损伤积累的不可逆性。为解决这一问题,我们提出了 PC-MambaSDE,这是一种在不规则观测下实现稳健 RUL 预测的统一连续时间框架。具体而言,我们设计了一种掩码感知连续 Mamba 编码器,显式利用观测掩码提取上下文丰富的控制信号。此外,我们引入了一种具有参数化修正混合漂移的物理引导潜在 SDE(随机微分方程),叠加全局物理偏差以在严重观测缺失的情况下强制单调退化。此外,我们通过终端退化惩罚将 RUL 预测表述为一个边值问题,该问题解耦了健康指数(HI)维度并应用惩罚损失以引导轨迹趋向失效状态。理论上,我们证明了我们的变分目标通过 Girsanov 定理在数学上等价于最小化 KL 散度,并通过 Lyapunov 分析保证了所学动力学的全局渐近稳定性。为了实现严谨的评估,我们开发了一种混合不规则性生成方案,以模拟真实的工业缺陷。在公共基准上的广泛实验表明,PC-MambaSDE 显著优于最先进方法,特别是在极端观测稀缺的情况下,验证了将物理先验嵌入连续时间潜在动力学的有效性。

Abstract

Accurate Remaining Useful Life prediction is critical for industrial predictive maintenance. However, real-world deployment is challenging due to the irregular nature of sensor observations, characterized by asynchronous sampling, burst missingness, and temporal jitter. Compounding this issue, purely data-driven models often generate physically implausible degradation trajectories that violate the irreversible nature of damage accumulation. To address this, we propose PC-MambaSDE, a unified continuous-time framework for robust RUL prediction under irregular observations. Specifically, we design a Mask-Aware Continuous Mamba Encoder that explicitly leverages observation masks to extract context-rich control signals. Furthermore, we introduce a Physics-Guided Latent SDE with parametrically rectified hybrid drift, superimposing a global physical bias to enforce monotonic degradation even amid severe observation gaps. Additionally, we formulate RUL prediction as a boundary value problem via a Terminal Degradation Penalty, which decouples a Health Index dimension and applies a penalty loss to guide trajectories toward the failure state. Theoretically, we prove that our variational objective is mathematically equivalent to minimizing the KL divergence via Girsanov's theorem, and we guarantee the global asymptotic stability of the learned dynamics through Lyapunov analysis. To enable rigorous evaluation, we develop a Hybrid Irregularity Generation Scheme that simulates realistic industrial imperfections. Extensive experiments on public benchmarks demonstrate that PC-MambaSDE significantly outperforms state-of-the-art methods, particularly under extreme observation scarcity, validating the efficacy of embedding physical priors into continuous-time latent dynamics.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: 评分失败: Expecting ',' delimiter: line 12 column 78 (char 301)

Score: 0.0 / 27.8
Authors: Yuki Suzuki, Alex Fukunaga
Published: 2026-06-01
TL;DR: This paper evaluates baseline methods for Immediate Duplicate Detection in A* search algorithms using external memory like SSDs, analyzing the impact of OS page caches on search performance.
摘要翻译

许多困难的搜索问题无法仅依靠内存(RAM)通过 A* 等算法求解。先前工作提出了使用外部存储器(如 SSD 和 HDD,其容量远大于 RAM)的搜索算法,但这些研究主要集中于延迟重复检测方法及复杂的即时重复检测(IDD)方法,而相对简单的 IDD 方法尚未得到系统研究。此外,尚未研究操作系统级机制(如页面缓存)对外部存储器访问的管理与加速效果。本文通过评估和分析基于 IDD 的 A* 的简单基线方法性能,填补了这些研究空白。

Abstract

Many difficult search problems cannot be solved by algorithms such as A* using only RAM. Search algorithms which use external memory such as SSDs and HDDs with much higher capacity than RAM have been proposed in previous work, but previous work has focused on delayed duplicate detection approaches, as well as complex immediate duplicate detection (IDD) methods, and relatively simple methods for IDD have not been systematically studied. In addition, the effect of OS-level mechanisms for managing and speeding up accesses to external memory, such as page caches, has not been studied. This paper addresses these gaps in the literature by evaluating and analyzing the performance of simple baseline approaches for IDD-based A*.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper addresses classical search algorithms (A*) and external memory management (SSD/HDD, page caches), while the keywords focus on Multimodal LLMs, World Models, and Model-Based RL. There is no thematic overlap between the paper's content and the specified evaluation criteria.

关键词

A*, Search Algorithms, External Memory, SSD, Immediate Duplicate Detection, Page Caches, Baseline Methods, OS-level mechanisms

Score: 0.0 / 27.8
Authors: Chinthaka Ranasingha, Tharindu Fernando, Sridha Sridharan, Clinton Fookes, Harshala Gammulle
Published: 2026-06-01
TL;DR: This paper proposes a lightweight TCN with physics-guided attention for efficient WiFi CSI-based human activity recognition, achieving superior performance with significantly reduced computational cost compared to deeper baselines.
摘要翻译

利用 WiFi 信道状态信息 (CSI) 的人体动作识别 (HAR) 因其非接触、低成本及隐私保护特性而受到日益增多的关注。然而,现有的基于学习的方法主要依赖于深层、计算密集型架构,隐式地从 CSI 测量中捕捉运动动态,从而增加了模型复杂度并降低了效率。相反,我们认为,结合针对 CSI 信号物理特性定制的适当归纳偏置,能够实现更高效且有效的学习。本文提出了一种基于紧凑时域卷积网络 (TCN) 的框架,该框架显式地将运动感知归纳偏置融入特征学习中。具体而言,我们在特征空间中引入一种多普勒能量引导的时域注意力机制,以强调运动显著的时间段;同时提出一种方差驱动的通道注意力模块,根据时域运动统计自适应地加权信息子载波。通过整合这些领域特定先验,所提出的模型在不增加架构深度的情况下,有效地捕捉了运动动态。在多个基准数据集上的广泛实验表明,与更深的基线模型相比,我们的方法实现了更优的性能,同时显著降低了参数量和计算成本。

Abstract

Human Action Recognition (HAR) using WiFi Channel State Information (CSI) has gained increasing attention due to its non-contact, low-cost, and privacy-preserving nature. However, existing learning-based approaches largely rely on deep, computationally intensive architectures to implicitly capture motion dynamics from CSI measurements, thereby increasing model complexity and reducing efficiency. Instead, we argue that incorporating appropriate inductive biases tailored to the physical characteristics of CSI signals enables more efficient and effective learning. In this work, we propose a compact temporal convolutional network (TCN)-based framework that explicitly incorporates motion-aware inductive biases into feature learning. Specifically, we introduce a Doppler-energy-guided temporal attention mechanism in feature space to emphasize motion-salient time segments, and a variance-driven channel attention module to weight informative subcarriers based on temporal motion statistics adaptively. By integrating these domain-specific priors, the proposed model effectively captures motion dynamics without increasing architectural depth. Extensive experiments on multiple benchmark datasets demonstrate that our approach achieves superior performance compared to deeper baselines, while significantly reducing parameter count and computational cost.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper addresses WiFi CSI-based Human Activity Recognition using a lightweight TCN with physics-guided attention. The provided keywords pertain to Multimodal Large Models, World Models, and Model-Based RL (e.g., Unify Models, Tokenizer, Visual Encoder, MLLM, World Models, MultiModal, model-based RL). There is no thematic or methodological overlap between wireless sensing/HAR and large-scale generative/RL models, resulting in zero relevance for all keywords.

关键词

WiFi CSI, Human Activity Recognition, Lightweight TCN, Physics-Guided Attention, Doppler-energy, Channel Attention, Motion Dynamics

Score: 0.0 / 27.8
Authors: Serge Gratton, Philippe L. Toint
Published: 2026-06-01
TL;DR: 本文分析了并行异步自适应一阶优化方法在非凸函数上的随机收敛性,证明了其收敛阶为 O(1/sqrt{t}),适用于异构大规模机器学习系统。
摘要翻译

引入了一类新的异步自适应一阶优化方法,该方法包含多种流行算法的异步变体。此外,还考虑了采用动量和/或非精确归一化版本的这些方法。在完全随机设定下,分析了该类方法在非凸函数上的收敛性,并在合理假设下证明其收敛阶(至多相差对数因子)为 O(1/sqrt{t})。数值实验表明,此类异步自适应算法在异构大规模机器学习系统中具有重要意义。

Abstract

A new class of asynchronous adaptive first-order optimization methods is introduced, comprising asynchronous variants of several popular algorithms. Versions of these methods using momentum and/or inexact normalization are also considered. The convergence of methods in the class on non-convex functions is analyzed in a fully stochastic setting, and is shown to be (up to logarithmic factors) of order O(1/sqrt{t}) under reasonable assumptions. Numerical experiments suggest that such asynchronous adaptive algorithms are very relevant in heterogeneous large-scale machine learning systems.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: 论文主要研究并行异步自适应一阶优化方法的随机收敛性,属于优化算法理论范畴。而提供的关键词均涉及多模态大模型架构(如 Tokenizer, Visual Encoder, MLLM, Unify Models)、世界模型及强化学习(model-based RL)。两者研究领域无交集,故所有关键词相关度评分为 0。作者列表中未包含指定的专家名单,无额外加分。

关键词

Stochastic convergence, Parallel asynchronous, Adaptive first-order, Non-convex functions, Convergence rate, Large-scale machine learning, Optimization algorithms

Score: 0.0 / 27.8
Authors: Enqiang Zhu, Yizi Liu, Yilong Luo, Yao Chen, Yu Zhang, Baoshan Ma
Published: 2026-06-01
TL;DR: SGAP-PPIS improves protein-protein interaction site prediction by utilizing structure-guided adaptive propagation within equivariant graph neural networks, achieving competitive performance on benchmark datasets.
摘要翻译

准确预测蛋白质 - 蛋白质相互作用位点(PPIS)对于理解细胞过程、疾病机制及治疗靶点发现至关重要。基于图的深度学习通过整合残基级结构上下文,推动了 PPIS 预测的发展。然而,大多数基于图的模型仍依赖于固定的传播机制,将所有残基一视同仁,尽管蛋白质界面存在结构和功能上的异质性。此类传播可能限制了信息扩散适应局部几何环境的能力,使得难以区分真实相互作用位点与结构相似的非相互作用邻居。我们提出了 SGAP-PPIS,一种用于 PPIS 预测的结构引导自适应传播模型。与使用固定传播机制不同,SGAP-PPIS 利用等变图神经网络(Equivariant Graph Neural Network)中的多尺度几何状态来生成残基级传播系数。该设计使每个残基能够根据其几何微环境自适应地平衡局部特征保留与邻域扩散。实验结果表明,SGAP-PPIS 在 Test_60 数据集上达到了与最先进方法具有竞争力的性能。消融实验表明,基于几何条件的自适应传播、尺度对齐的几何引导以及多步传播状态表示共同驱动了这些改进。

Abstract

Accurate prediction of protein-protein interaction sites (PPIS) is essential for understanding cellular processes, disease mechanisms, and therapeutic target discovery. Graph-based deep learning has advanced PPIS prediction by incorporating residue-level structural context. However, most graph-based models still rely on fixed propagation schemes that treat all residues similarly, despite the structural and functional heterogeneity of protein interfaces. Such propagation may limit the ability to adapt information diffusion to local geometric environments, making it difficult to distinguish true interaction sites from structurally similar non-interacting neighbors. We present SGAP-PPIS, a structure-guided adaptive propagation model for PPIS prediction. Rather than using a fixed propagation mechanism, SGAP-PPIS leverages multi-scale geometric states from an equivariant graph neural network to generate residue-wise propagation coefficients. This design allows each residue to adaptively balance local feature preservation and neighborhood diffusion according to its geometric microenvironment. Experimental results show that SGAP-PPIS achieves competitive performance among the state-of-the-art methods on Test\_60. Ablation studies show that geometry-conditioned adaptive propagation, scale-aligned geometric guidance, and multi-step propagation-state representation jointly drive these improvements.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper presents a graph neural network model (SGAP-PPIS) for protein-protein interaction site prediction, emphasizing structure-guided adaptive propagation and equivariant geometry. The provided keywords (Unify Models, Tokenizer, Visual Encoder, World Models, MLLM, MultiModal, model-based RL) specifically target multimodal large language models and reinforcement learning domains. There is no methodological or thematic overlap between the bioinformatics GNN approach and the provided keywords, thus all relevance scores are assigned 0.0. Additionally, a check of the author list (Enqiang Zhu, Yizi Liu, Yilong Luo, Yao Chen, Yu Zhang, Baoshan Ma) reveals none of the specified expert authors (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are present.

关键词

Protein-Protein Interaction, Structure-Guided, Adaptive Propagation, Graph Neural Network, Equivariant, Site Prediction, Multi-scale Geometric

Score: 0.0 / 27.8
Authors: Seonghyeon Go, Yumin Kim
Published: 2026-06-01
TL;DR: This paper introduces HAIM, a dataset for granular AI music production tracking, shifting evaluation beyond binary classification to stage-specific intervention detection.
摘要翻译

随着 Suno 和 Udio 等生成式平台达到人类级音频质量,AI 的应用范围已扩展至整个音乐制作流程。超越简单的曲目生成,这些进步推动了多种形式的 AI 驱动方法的采用。这些包括人声合成、编曲以及专业母带处理。然而,当前的检测研究仍主要局限于二元"AI 或人类”范式。这未能反映当代音乐制作流程的现实情况。在实际制作中,AI 工具正被越来越多地用于完善或母带处理人类制作的曲目,而人类工程师同样会对 AI 生成的素材进行后处理,以确保专业质量。此外,用户常采用对抗性策略以绕过 AI 检测器,例如对 AI 生成的曲目应用人类母带处理。这创造了一个灰色地带,简单的二元分类无法涵盖。本文定义并探讨了"AI Music Tracking"(AI 音乐追踪):即在音乐制作的多维度谱系中识别特定 AI 集成程度的挑战。为此,我们引入了 HAIM 数据集,该数据集包含音乐制作各阶段的多样化标签。该数据集旨在隔离 AI 干预的各个阶段,包括混合制作及代理级追踪。我们对最先进的检测器的评估揭示了其系统性缺陷。通过发布 HAIM,我们提出了一个新的基准,旨在将该领域从二元分类转向对 AI 音乐的细粒度、结构化评估。

Abstract

As generative platforms such as Suno and Udio reach human-grade audio quality, the scope of AI's utility has expanded across the entire music production workflow. Beyond simple track generation, these advancements have catalyzed the adoption of AI-driven methodologies in diverse forms. These include vocal synthesis, arrangement, and professional mastering. However, current detection research remains largely confined to a binary `AI-or-human' paradigm. It fails to reflect the realities of contemporary music production workflows. In real-world production, AI tools are increasingly used to refine or master human-produced tracks, and human engineers likewise post-process AI-generated material to ensure professional quality. Moreover, users often employ adversarial tactics to bypass AI detectors, such as applying human mastering to AI-generated tracks. This creates a grey area that a simple binary classification fails to capture. In this paper, we define and investigate ``AI Music Tracking'': the challenge of identifying specific AI integration across the multifaceted spectrum of music production. To this end, we introduce HAIM, a dataset with diverse labels for stages of music production. It is designed to isolate stages of AI intervention, including hybrid production and agent-level tracking. Our evaluation of state-of-the-art detectors reveals systemic flaws. By releasing HAIM, we propose a new benchmark that shifts the field beyond binary classification toward a granular, structured evaluation of AI music.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: 该论文主要关注音乐生产领域的人工智能追踪数据集(HAIM)及基准评测,旨在超越二元分类进行阶段化干预检测。论文内容未涉及模型统一架构、分词器设计、视觉编码器、世界模型、多模态大语言模型或多模态架构,也未涉及基于模型的强化学习,因此所有给定关键词均不相关。作者列表中不包含指定的专家,故无加分。

关键词

AI Music Production, Dataset Benchmark, Human-AI Collaboration, Audio Detection, Music Tracking, Granular Evaluation, AI Integration Stages

Score: 0.0 / 27.8
Authors: Lin Jiang, Dahai Yu, Ximiao Li, Guang Wang
Published: 2026-06-01
TL;DR: E4GEN introduces an explainable diffusion framework for time-series generation that effectively captures extreme events, achieving superior fidelity and utility compared to state-of-the-art methods.
摘要翻译

生成逼真的时间序列对于科学研究和实际应用至关重要。然而,现有方法往往侧重于整体分布保真度,却未能忠实捕捉极端事件。为了推进现有研究,我们提出 E4GEN,一种用于极端事件感知时间序列生成的可解释扩散框架。E4GEN 通过三个关键组件,提供了关于何时、何种以及如何控制极端事件生成的系统性见解。首先,E-Activator 在去噪过程中学习数据集自适应的极端控制信号激活步骤,而不干扰常规时间组件(包括趋势和季节性)。其次,E-Predictor 通过自驱动语义预测(Self-Driven Semantic Prediction)确定需要施加的控制信号,其中每个样本通过在生成过程中推断潜在极端事件信息来推导其自身的控制信号。此外,它还包含一种新颖的数据条件训练、噪声初始化采样(Data-Conditioned Training, Noise-Initiated Sampling)机制,以解决训练标签不可用的问题。第三,E-Control 通过一个可训练的极端控制网络(Extreme Control Network)指定如何控制极端事件生成,该网络将语义控制信号转换为层间信号,并将其注入去噪过程。我们在六个数据集上使用 17 个指标对 E4GEN 进行评估,广泛的实验表明,E4GEN 在多个维度上优于最先进模型,包括整体保真度、极端事件保真度以及下游效用。

Abstract

Generating realistic time series is essential for scientific research and real-world applications. However, existing methods often emphasize overall distributional fidelity while failing to faithfully capture extreme events. To advance existing research, we propose E4GEN, an explainable diffusion framework for extreme event-aware time-series generation. E4GEN provides systematic insights into when, what, and how to control extreme-event generation through three key components. First, E-Activator learns the dataset-adaptive extreme-control signal activation step during the denoising process without interfering with regular temporal components, including trend and seasonality. Second, E-Predictor determines what control signal to enforce through Self-Driven Semantic Prediction, where each sample derives its own control signal by inferring latent extreme-event information during generation. It also includes a novel Data-Conditioned Training, Noise-Initiated Sampling mechanism to address the issue of unavailable training labels. Third, E-Control specifies how to control extreme-event generation through a trainable Extreme Control Network, which transforms the semantic control signal into layer-wise signals and injects it into the denoising process. We evaluate E4GEN on six datasets with 17 metrics, and extensive experiments show that E4GEN outperforms state-of-the-art models across multiple dimensions, including overall fidelity, extreme-event fidelity, and downstream utility.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on time-series generation using a diffusion framework for extreme events, unrelated to multimodal data, LLMs, tokenizers, visual encoders, world models, model unification, or RL. All keyword scores are 0.0 (Total 0.0 < 27.8). No expert authors found.

关键词

Time-series Generation, Extreme Events, Diffusion Framework, Explainable Control, Event-level Explainable, Denoising Process, Data-Conditioned Training, Extreme Control Network

Score: 0.0 / 27.8
Authors: Tianyi Xu, Yaolun Zhang, Xuan Ouyang, Huazheng Wang
Published: 2026-06-01
TL;DR: EvoPool 利用进化多智能体框架生成标注代码,在生物医学和法律专用任务中实现了比 LLM 标注更高效率和性能的标签生成方案。
摘要翻译

大语言模型(Large Language Models, LLMs)擅长通用任务,但在训练标签成本高昂的专业、高风险领域,其表现逊于较小的监督模型。我们提出 EvoPool 来应对这一场景,这是一个受达尔文进化论启发的进化多智能体框架。三个专用代理迭代地提出可执行的标注器代码,小型验证集提供适应度信号,而确定性门控仅保留通过可行性、多样性和边际贡献检查的标注器,此过程跨代进行。投票结果通过 EvoAgg 映射为软训练标签,EvoAgg 是一种文本感知聚合器,结合了语义特征与标注器投票特征。生成的池在每个示例上的成本接近零,且在 10 万条示例上比 LLM 标注快 4500 至 31000 倍。在涵盖生物医学关系抽取、法律条款分类、复杂推理和密集多标签生物医学分类的 8 个 LLM 表现不佳的专业复杂任务中的 7 个上,EvoPool 的平均宏 F1 分数比最强的 LLM 标注基线高出 0.141,在 ChemProt 上峰值达到 0.301,在 PubMed 上达到 0.265。代码可在以下网址获取:https://github.com/tianyi0216/EvoPool

Abstract

Large language models excel at general tasks but underperform smaller supervised models in specialized, high-stakes domains where training labels are costly. We address this regime with EvoPool, an evolutionary multi-agent framework inspired by Darwinian evolution. Three specialized agents iteratively propose executable annotator code, a small validation set provides a fitness signal, and a deterministic gate keeps only annotators that pass viability, diversity, and marginal-contribution checks across generations. Pool votes are mapped to soft training labels by EvoAgg, a text-aware aggregator combining semantic features with annotator-vote features. The authored pool runs at near-zero per-example cost and is 4500 to 31000x faster than LLM annotation on 100K examples. Across 7 of 8 LLM-weak specialized and complex tasks spanning biomedical relation extraction, legal-clause classification, complex reasoning, and dense multi-label biomedical classification, EvoPool beats the strongest LLM annotation baseline by an average +0.141 macro-F1, peaking at +0.301 on ChemProt and +0.265 on PubMed. Code is available at: https://github.com/tianyi0216/EvoPool

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: 该论文提出了一种基于进化算法的多智能体框架(EvoPool),用于生成可执行标注代码以实现标签高效的专用监督。论文内容聚焦于程序化标注和生物医学/法律领域的文本任务,未涉及世界模型、视觉编码器、多模态大模型架构、Tokenizer 策略或基于模型的强化学习。虽然使用了 LLM 作为基线,但并非 MLLM 或模型统一研究。因此,所有给定关键词均不相关(加权总分为 0)。作者列表中未包含指定的专家。

关键词

Evolutionary Programmatic Annotation, Label-Efficient Specialized Supervision, Multi-Agent Framework, Executable Annotator Code, LLM Annotation Baseline, Biomedical Relation Extraction, Legal-Clause Classification

Score: 0.0 / 27.8
Authors: Nazmus Shakib Shadin, Aaron Cummings, Xinyue Zhang, Bobin Deng
Published: 2026-06-01
TL;DR: 该论文提出 FedMTFI 框架,通过结合多教师知识蒸馏和特征重要性(SHAP)优化,在异构联邦学习环境中提升了模型在非 IID 数据下的准确性和泛化能力。
摘要翻译

联邦学习(FL)是一种去中心化方法,能够在不暴露原始数据的前提下实现协作模型训练。与传输敏感数据不同,它允许设备仅共享模型权重,从而确保个人数据保留在本地且安全。然而,在实际应用场景中,设备持有的数据往往分布不均,且设备在计算能力和内存容量上存在显著差异。这些差异使得联邦学习难以在整个系统中保持一致的性能表现。为了解决这些问题,我们提出了一种名为 FedMTFI 的新颖架构,该架构将多教师知识蒸馏(MTKD)与特征重要性相结合,旨在提升联邦学习在异构环境中的性能。在 FedMTFI 中,客户端依据相似的硬件配置和模型类型进行聚类。每个簇在非独立同分布(non-IID)数据上训练特定的模型。在每个簇内部,每个客户端仅利用其自身的本地私有数据更新该模型。随后,服务器利用 FedAvg 算法聚合每个簇中本地训练的模型,从而形成多个原型模型。进而,这些原型模型充当教师模型,利用 MTKD 训练一个全局泛化学生模型。FedMTFI 的独特之处在于引入了沙普利值(SHAP),用于在蒸馏过程中强调重要特征,从而同时提升了模型的准确性和可解释性。实验结果表明,FedMTFI 相较于传统联邦学习算法实现了更高的准确率,且在非独立同分布(non-IID)数据条件下表现更为有效。

Abstract

Federated learning (FL) is a decentralized approach that enables collaborative model training without exposing raw data. Instead of transferring sensitive data, it allows devices to share only model weights, keeping personal data locally and secure. However, in real world settings, the data held by devices is often not evenly distributed and devices mostly differ in computing power and memory capacity. These differences make FL harder to maintain consistent performance across the system. To address these issues, we propose FedMTFI, a novel architecture that combines multi-teacher knowledge distillation (MTKD) with feature importance to improve the FL process in heterogeneous environments. In FedMTFI, clients are clustered based on similar hardware and model types. Each cluster trains a specific model on not independently and identically distributed (non-IID) data. Within a cluster, every client updates that model using only its own local private data. The server then aggregates the locally trained models in each cluster using FedAvg to form multiple prototype models. Then these prototypes serve as teacher models to train a global generalized student model using MTKD. What makes FedMTFI more unique is the integration of Shapley values (SHAP) to emphasize important features during distillation, which enhances both accuracy and interpretability. Experimental results show that FedMTFI achieves higher accuracy than traditional FL algorithms and performs more effectively under non-IID data conditions.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: 该论文专注于联邦学习(Federated Learning)与知识蒸馏(Knowledge Distillation)在异构环境下的优化,核心贡献在于多教师蒸馏与特征重要性(SHAP)的结合。提供的关键词集(Unify Models, Tokenizer, Visual Encoder, World Models, MLLM, MultiModal, model-based RL)主要涉及多模态大模型、表征学习及强化学习领域。论文内容与这些关键词在技术路径、模型组件及应用场景上均无直接交集,故相关性评分均为 0 分。经核对,作者列表不包含指定的专家名单(Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang),因此不触发专家加分。

关键词

Federated Learning, Multi-Teacher Knowledge Distillation, Heterogeneous Environment, Feature Importance, Shapley Values, Non-IID Data, Client Clustering

Score: 0.0 / 27.8
Authors: Jianhao Xu, Zhuang Yang
Published: 2026-06-01
TL;DR: This paper proposes a dynamic l_p-norm optimization scheme for DNNs that adapts to curvature changes, achieving better convergence and generalization performance compared to standard SGD and SGDM.
摘要翻译

现有的深度神经网络(DNN)优化器通常依赖于 $\ell_2$ 范数或 $\ell_\infty$ 范数,这导致优化器难以很好地适应跨参数维度的曲率显著变化。一般来说,DNN 的训练过程在早期通常表现出强烈的曲率各向异性,而在后期,训练过程趋向于各向异性较弱的平坦区域。特别是,基于 $\ell_2$ 范数的优化器通常被高曲率方向主导,限制了沿低曲率方向的更新,从而导致收敛速度较慢。而基于 $\ell_\infty$ 范数的优化器在平坦区域容易因坐标分量更新幅度相同而产生振荡。为了解决由 $\ell_2$ 和 $\ell_\infty$ 范数产生的这两种极端情况,我们提出了一种具有动态 $p$ 值的新 $\ell_p$ 范数方案,并将其融入随机梯度下降(SGD)和带动量的随机梯度下降(SGDM)中,从而得到两种具有更好泛化性能的新优化器:$\ell_p$-SGD(LPSGD)和 $\ell_p$-SGDM(LPSGDM)。特别是,所得优化器在早期通过使用较大的 $p$($p>2$)来抑制高曲率方向的主导地位,随后 $p$ 逐渐减小趋向于 2,以实现更稳定和精细的更新,该过程基于余弦退火策略。我们建立了所得算法的理论保证,并分析表明 LPSGD 和 LPSGDM 在非凸情形下均能达到 \(O(T^{-1/2})\) 的收敛率。我们在基准数据集(包括 CIFAR-10、CIFAR-100 和 ImageNet-1K)上进行了大量实验,使用了多种 DNN(如 VGG-11、ResNet-18 和 ResNet-50)。

Abstract

The existing optimizers for deep neural networks (DNNs) typically rely on either the $\ell_2$ norm or the $\ell_\infty$ norm, resulting in optimizers that do not adapt well to substantial changes in curvature across parameter dimensions. Generally, the training process of DNNs often exhibits strong curvature anisotropy in the early period, whereas in the later period, the training process of DNNs tends to move toward flatter regions with weaker anisotropy. Particularly, optimizers based on the \(\ell_2\)-norm are usually dominated by high-curvature directions, restricting updates of optimizers along with lower curvature direction and thus leading to a slower convergence rate. While optimizers based on the \(\ell_\infty\)-norm are prone to oscillations in flatter regions, due to the coordinate-wise updates of the same magnitude. To address these two extreme cases generated by $\ell_2$ and $\ell_\infty$ norms, we propose a novel $\ell_p$-norm scheme with a dynamical value of $p$ and incorporate it into stochastic gradient descent (SGD) and SGD with momentum (SGDM), leading to two novel optimizers with better generalization performance: ${\ell_p}$-SGD (LPSGD) and ${\ell_p}$-SGDM (LPSGDM). Particularly, the resulting optimizers suppress the dominance of high-curvature directions in the early period by utilizing a large $p$ ($p>2$), followed by a gradual decrease of $p$ toward 2 to enable more stable and refined updates, where the latter process is motivated by the cosine annealing strategy. We establish theoretical guarantees of the resulting algorithms and analyze that both LPSGD and LPSGDM achieve an \(O(T^{-1/2})\) convergence rate for the nonconvex setting. Extensive experiments are conducted on benchmark datasets, including CIFAR-10, CIFAR-100, and ImageNet-1K, with multiple DNNs such as VGG-11, ResNet-18, and ResNet-50.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on optimization algorithms (l_p-norm scheme) for standard Deep Neural Networks, addressing curvature anisotropy during training. The provided keywords relate to Multimodal Large Models, World Models, and Model-Based RL architectures/components. There is no substantive overlap between the paper's content (optimization for CV tasks) and the keywords (multimodal/RL specific components), hence all scores are 0. No expert authors from the specified list are present in the author list.

关键词

Deep Neural Networks, Optimizers, l_p-norm, Curvature, SGD, Convergence Rate, Generalization

Score: 0.0 / 27.8
Authors: Sabyasachi Basu, Manuj Mukherjee, Lutz Oettershagen, Suhas Thejaswi
Published: 2026-06-01
TL;DR: The paper investigates exact community recovery in stochastic block models under limited query budgets, demonstrating that adaptive querying strategies can strictly improve information-theoretic limits compared to non-adaptive uniform querying.
摘要翻译

我们在有限且嘈杂的网络数据访问条件下,研究 $n$ 顶点两社区随机块模型(Stochastic Block Model, SBM)中的精确社区恢复。学习者可查询一个嘈杂的邻域预言机(noisy neighborhood oracle),该预言机会以固定概率独立地揭示查询顶点的每个真实邻居,且从不返回非邻居,同时受限于有限的查询预算。我们考虑仅预言机访问(oracle-only access)以及一种组合模型,在该模型中学习者还观察到底层图的一个子采样副本(subsampled copy)。对于仅预言机访问,平衡均匀查询(balanced uniform querying)提供了一个精确的非自适应基准:当每个顶点被查询相同的整数次时,观测结果简化为具有衰减边概率的 SBM,此时 Abbe-Bandeira-Hall 精确恢复阈值适用。我们表明该基准并非自适应最优:一种两阶段自适应策略(two-stage adaptive strategy)在 $n+o(n)$ 次查询内即可成功,而在该区域中,平衡均匀查询需要 $m n$ 次查询(其中 $m>1$)。借助额外的子采样图,我们证明了亚线性查询自适应差距(sublinear-query adaptivity gap):具有亚线性预算的数据无关平衡均匀查询(balanced data-independent uniform querying)不会优于单独使用子采样图,而自适应查询可以针对一小部分不确定顶点并实现精确恢复。因此,自适应数据获取(adaptive data acquisition)可以严格改进精确恢复的信息论极限。

Abstract

We study exact community recovery in the two-community stochastic block model on $n$ vertices under limited and noisy access to network data. The learner may query a noisy neighborhood oracle that reveals each true neighbor of a queried vertex independently with fixed probability and never returns non-neighbors, subject to a finite query budget. We consider both oracle-only access and a combined model where the learner also observes a single subsampled copy of the underlying graph. For oracle-only access, balanced uniform querying gives a sharp non-adaptive benchmark: when each vertex is queried the same integer number of times, the observations reduce to an SBM with attenuated edge probabilities and the Abbe-Bandeira-Hall exact-recovery threshold applies. We show that this benchmark is not adaptively optimal: a two-stage adaptive strategy succeeds with $n+o(n)$ queries in a regime where balanced uniform querying requires $m n$ queries for some $m>1$. With an additional subsampled graph, we prove a sublinear-query adaptivity gap: balanced data-independent uniform querying with a sublinear budget does not improve over the subsampled graph alone, whereas adaptive querying can target a small set of uncertain vertices and achieve exact recovery. Thus adaptive data acquisition can strictly improve the information-theoretic limits of exact recovery.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on theoretical computer science and network science, specifically Stochastic Block Models (SBM) and community recovery algorithms under query constraints. The provided keywords relate to Multimodal Large Language Models (MLLM), World Models, and Reinforcement Learning (RL). There is no domain overlap between graph-theoretic community recovery and multimodal representation learning or generative world models. Consequently, all keyword relevance scores are 0. None of the listed expert authors appear in the author list.

关键词

Stochastic Block Models, Community Recovery, Query-Limited, Adaptive Strategy, Information-Theoretic Limits, Neighborhood Oracle, Exact Recovery

Score: 0.0 / 27.8
Authors: Reda Snaiki, Abdelatif Merabtine
Published: 2026-06-01
TL;DR: 本研究提出了一种不确定性感知图神经网络框架,用于在传感器部署约束下从稀疏观测中重建连续城市温度场,其精度优于传统插值方法。
摘要翻译

从稀疏观测中重建空间连续的日温度场对于城市气候监测和热风险分析至关重要,但实际部署受限于传感器预算和间距约束。本研究提出了一种不确定性感知图神经网络 (GNN) 框架,用于从稀疏传感器重建日最高温度场,同时支持距离约束下的传感器布设及概率超越映射。该模型采用基于图注意力机制的均值 - 残差架构,并通过高斯负对数似然进行训练,同时预测温度场和空间变化的预测不确定性场。传感器布设采用带 QR 分解的正交分解 (POD-QR) 策略,设定最小传感器间距为 4 公里,并与随机可行布设及最远点采样方法进行比较。该框架基于蒙特利尔区域多边形,利用 Daymet v4.1 日温度数据(分辨率 1 公里)进行评估,并采用严格的时间保留协议(训练集:2020-2023 年;测试集:2024 年)。在传感器预算范围(10-40 个传感器)内,所提出的 GNN 在未观测节点上的均方根误差 (RMSE) 和平均绝对误差 (MAE) 上始终优于反距离加权法 (IDW) 和普通克里金法 (OK)。传感器布设效应在低预算时最为显著,随预算增加而减弱,在施加的间距约束下,约 30 个传感器时出现实际饱和状态。概率评估进一步表明,随着传感器密度增加,不确定性校准得到改善,且相较于克里金法,具有更优的锐度 - 校准权衡。这些结果表明,所提出的框架是进行不确定性感知温度场重建及面向决策的热风险制图的有效工具。

Abstract

Reconstructing spatially continuous daily temperature fields from sparse observations is important for urban climate monitoring and heat-risk analysis, but practical deployments are limited by sensor budgets and spacing constraints. This study proposes an uncertainty-aware graph neural network (GNN) framework for reconstructing daily maximum temperature fields from sparse sensors while supporting distance-constrained sensor placement and probabilistic exceedance mapping. The model predicts both the temperature field and a spatially varying predictive uncertainty field using a graph-attention-based mean-residual architecture trained with a Gaussian negative log-likelihood. Sensor placement is addressed using a Proper Orthogonal Decomposition with QR factorization (POD-QR) strategy with a 4 km minimum inter-sensor distance constraint and is compared with random feasible placement and farthest-point sampling. The framework is evaluated over a Montreal-area polygon using Daymet v4.1 daily temperature data (1 km resolution) under a strict temporal hold-out protocol (training: 2020-2023; testing: 2024). Across sensor budgets (10-40 sensors), the proposed GNN consistently outperforms inverse distance weighting and ordinary kriging in RMSE and MAE on unobserved nodes. Sensor-placement effects are most pronounced at low budgets and diminish at higher budgets, with a practical saturation regime emerging around 30 sensors under the imposed spacing constraint. Probabilistic evaluation further shows improved uncertainty calibration with increasing sensor density and a better sharpness-calibration trade-off than kriging. These results support the proposed framework as an effective tool for uncertainty-aware temperature field reconstruction and decision-oriented heat-risk mapping.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: 该论文专注于图神经网络在城市温度场重建及传感器部署约束下的应用,属于空间统计与气候计算领域。提供的关键词(如统一模型、Tokenizer、视觉编码器、世界模型、MLLM、多模态、基于模型的强化学习)均属于大语言模型、多模态大模型及强化学习领域,与本文的研究内容(回归任务、空间插值、不确定性量化)无直接关联,因此所有关键词相关性评分为 0。作者列表中也不包含指定的专家。

关键词

Graph Neural Network, Temperature Field Reconstruction, Uncertainty-Aware, Sensor Placement, Spatial Reconstruction, Urban Climate, Probabilistic Mapping, Sparse Sensors

Score: 0.0 / 27.8
Authors: Eduardo Sebastián, Adrian Pfisterer, Vito Mengers, Oliver Brock, Amanda Prorok
Published: 2026-06-01
摘要翻译

机器人学习必须生成能够泛化至新组合的约束、协作伙伴及环境的策略。为此,我们必须对策略进行结构化分解,这一选择决定了哪些部分能够泛化、哪些需要再训练,以及哪些部分保持纠缠。现有方法跨度广泛,从期望结构通过数据规模扩展自然涌现,到通过层次结构、技能库或学习到的专长进行手工设计。本文研究了我们认为机器人学中最基本的分解方式:将世界与任务分离。世界因素是具身系统及环境的属性,它们独立于意图而存在。任务因素则由任务逻辑定义,基于世界所允许的状态。我们通过贝叶斯模型证据(Bayesian model evidence)形式化这种不对称性:它与数据生成过程保持一致,通过解析式世界模型维持高似然性,并降低对任务参数的奥卡姆剃刀(Occam razor)惩罚。我们通过结合 AICON(一种可微的递归估计器与互连构成的组合式图,无需特定任务数据即可运行,并将代价梯度传播至执行器)与一个紧凑的学习策略(用于调节梯度路径)来实例化这种分解。梯度作为两个因素之间的接口:它们通过图传递世界结构,通过代价传递任务结构,从而实现低维学习的同时保持结构泛化。我们在涵盖异构机器人、环境、任务逻辑及感觉运动模态的三个问题上测试了该世界/任务分解框架。我们的框架在所有设置中均优于端到端基线和解析启发式方法,能够零样本(zero-shot)泛化至分布外(out-of-distribution)配置,且无需再训练即可迁移至真实硬件。

Abstract

Robot learning must produce policies that generalize to new combinations of constraints, teammates, and environments. To achieve this, we must structurally factor the policy, which is a choice that dictates what generalizes, what requires retraining, and what remains entangled. Existing methods span a wide spectrum, from expecting structure to emerge from data scaling, to hand-designing it via hierarchies, skill libraries or learned specializations. In this paper, we study what we argue is the most fundamental factorization in robotics: separating the world from the task. We investigate the conditions under which this factorization is principled. World factors are properties of the embodied system and the environment; they exist independently of intent. Task factors are defined by the task's logic over what the world admits. We formalize this asymmetry through Bayesian model evidence: it aligns with the data-generating process, maintains high likelihood through an analytical world model, and reduces the Occam razor's penalty on task parameters. We instantiate this factorization by pairing AICON, a differentiable graph of recursive estimators and interconnections that is compositional, operates without task-specific data, and propagates cost gradients to actuators, with a compact, learned policy that modulates gradient paths. Gradients serve as the interface between the two factors: they carry world structure through the graph and task structure through costs, enabling low-dimensional learning while preserving structural generalization. We test the world/task factorization across three problems that encompass heterogeneous robots, environments, task logic and sensorimotor modalities. Our framework outperforms end-to-end baselines and analytical heuristics in all settings, generalizes zero-shot to out-of-distribution configurations, and transfers to real hardware without retraining.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: 评分失败: Expecting ',' delimiter: line 12 column 45 (char 268)

Score: 0.0 / 27.8
Authors: Gjorgjina Cenikj, Jakub Kudela, Eva Tuba, Tome Eftimov
Published: 2026-06-01
TL;DR: This paper evaluates the real-world generalizability of algorithm selection models across synthetic and optimization landscapes, identifying significant challenges in domain transfer for robotics and UAV tasks.
摘要翻译

算法选择(AS)旨在通过利用可测量的问题特征和历史性能数据,自动为给定的问题实例识别最合适的优化算法。本研究探究了 AS 模型在合成与真实世界优化景观上的泛化能力。我们考察了两个广泛使用的学术基准套件(BBOB 和 CEC)以及两个真实世界问题集(机器人轨迹优化任务和无人机路径规划问题)。通过系统的跨基准评估,我们分析了 AS 模型在领域间的迁移情况,识别了泛化成功或失效之处,并强调了在现实、特定领域背景下应用 AS 时所面临的挑战。我们的研究结果为当前 AS 方法的稳健性提供了洞察,并为开发更可靠、更广泛适用的 AS 系统以应用于真实世界优化提供了依据。

Abstract

Algorithm Selection (AS) aims to automatically identify the most suitable optimization algorithm for a given problem instance by leveraging measurable problem characteristics and historical performance data. In this study, we investigate the generalization ability of AS models across both synthetic and real-world optimization landscapes. We consider two widely used academic benchmark suites (BBOB and CEC) and two real-world problem sets (robotics trajectory optimization tasks and unmanned aerial vehicle path-planning problems). Through a systematic cross-benchmark evaluation, we analyze how AS models transfer between domains, identify where generalization succeeds or breaks down, and highlight the challenges that arise when applying AS in realistic, domain-specific contexts. Our findings provide insights into the robustness of current AS approaches and inform the development of more reliable, broadly applicable AS systems for real-world optimization.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on Algorithm Selection for optimization problems in robotics and UAVs, evaluating generalization across benchmarks. The provided keywords relate to Multimodal LLMs, Tokenizers, Visual Encoders, and World Models, which are not present in the paper's content or methodology. There is no direct overlap between the paper's topic (optimization algorithm selection) and the specified keywords (multimodal representation learning).

关键词

Algorithm Selection, Real-World Generalizability, Optimization Landscapes, Robotics Trajectory Optimization, UAV Path-Planning, Cross-Benchmark Evaluation, Problem Characteristics

Score: 0.0 / 27.8
Authors: Adel Dabah
Published: 2026-06-01
TL;DR: This paper reformulates the Vehicle Routing Problem as a Graph Edit Distance maximization problem to enable structural analysis of routing solutions and suggests potential supervision signals for graph neural networks.
摘要翻译

我们表明,车辆路径问题(VRP)可以被重构为图编辑距离(GED)最大化问题。在简单的边删除成本模型下,最小化总路线成本等同于最大化从完整实例图中删除的边的总权重。该表述在边级别上对 VRP 进行建模,其中解由选定的边定义而非路线序列,这使得在经典表述中难以进行的结构分析成为可能:包括逐边的解质量归因、最优性差距分解、解稀疏性刻画以及通过贪心构造难以到达的边的识别。理论上,我们建立了一个合并分解定理,表明 Clarke-Wright 节省量等于每次合并的 GED 增量,以及一个近似转移定理,将 GED 近似比转化为 VRP 成本界限。利用这种重构,我们分析了 90 个具有已知最优解的容量约束车辆路径问题(CVRP)基准实例。我们发现,最优路由图仅使用了 5.5% 的可用边,约 3.0% 的最优边在多次重启下始终未被 Clarke-Wright 启发式算法找到,且成本差距分解为错失的最优边和替代的非最优边,这两类边的总权重相当。边可加性目标为未来的图神经网络(GNN)边预测方法提供了自然的逐边监督信号,暗示了与图神经网络方法的潜在联系,我们将其留作后续工作。

Abstract

We show that the Vehicle Routing Problem (VRP) can be reformulated as a Graph Edit Distance (GED) maximization problem. Under a simple edge-deletion cost model, minimizing total route cost is equivalent to maximizing the total weight of edges deleted from the complete instance graph. This formulation models VRP at the edge level, where solutions are defined by selected edges rather than route sequences, enabling structural analyses that are difficult in classical formulations: per-edge attribution of solution quality, decomposition of the optimality gap, characterization of solution sparsity, and identification of edges that are hard to reach by greedy construction. Theoretically, we establish a merge-decomposition theorem showing that Clarke-Wright savings equal per-merge GED increments, and an approximation-transfer theorem that turns GED approximation ratios into VRP cost bounds. Using this reformulation, we analyze 90 CVRP benchmark instances with known optimal solutions. We find that optimal routing graphs use only 5.5% of available edges, that approximately 3.0% of optimal edges are consistently not found by Clarke-Wright heuristics under repeated restarts, and that the cost gap decomposes into missed optimal edges and substituted non-optimal edges of comparable total weight. The edge-additive objective provides a natural per-edge supervision signal for future graph neural network approaches to edge prediction, suggesting a potential connection to graph neural network approaches that we leave for follow-up work.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on combinatorial optimization (Vehicle Routing Problem) and graph theory (Graph Edit Distance), whereas the provided keywords relate to multimodal large language models, world models, and reinforcement learning. There is no methodological or thematic overlap between the paper's content and the specified keywords.

关键词

Vehicle Routing Problem, Graph Edit Distance, Edge-level formulation, CVRP benchmarks, Graph Neural Networks, Optimization, Structural analysis

Score: 0.0 / 27.8
Authors: Luis A. Ortega, Andrés R. Masegosa, Thomas D. Nielsen
Published: 2026-06-01
TL;DR: 本文提出流变换隐式过程方法,通过归一化流构建更灵活的变分分布,解决了贝叶斯函数空间推断中高斯近似无法捕捉不对称或多模态后验的问题。
摘要翻译

隐式过程先验(Implicit-process priors)通过灵活的生成机制定义函数上的分布,使其在贝叶斯函数空间建模(Bayesian function-space modelling)中颇具吸引力。然而,使用此类先验进行后验推断(posterior inference)具有挑战性,因为它们诱导的函数空间分布通常无法以闭式(closed form)获得。一种实用的策略是利用有限个采样函数的集合来近似先验,然后将后验函数表示为这些样本的学习组合。现有方法通常在组合权重上采用高斯变分分布(Gaussian variational distribution)。虽然这种选择是可处理的(tractable),但它限制了可表示的后验不确定性形状,特别是在真实后验是非对称的(asymmetric)、重尾的(heavy-tailed)或多峰的(multimodal)情况下。我们提出流变换隐式过程(Flow-Transformed Implicit Processes, FTIP),这是一种变分推断方法,旨在使这种有限维函数空间近似更具表达能力(expressive)。与在组合权重上使用高斯分布不同,FTIP 使用归一化流(normalizing flow)来定义更丰富的变分分布。这在保持可处理优化的同时,诱导了函数上的灵活后验分布。我们使用黑盒 α 目标函数(Black-Box α objective)训练该模型,从而能够比较质量覆盖(mass-covering)与模式寻求(mode-seeking)的变分行为。实验表明,FTIP 能够捕捉函数空间中非对称和多峰的后验结构,而高斯系数近似倾向于将其平滑或坍缩。

Abstract

Implicit-process priors define distributions over functions through flexible generative mechanisms, making them attractive for Bayesian function-space modelling. However, performing posterior inference with such priors is challenging because their induced function-space distributions are typically not available in closed form. One practical strategy is to approximate the prior using a finite collection of sampled functions, and then represent posterior functions as learned combinations of these samples. Existing approaches commonly place a Gaussian variational distribution over the combination weights. While tractable, this choice limits the shapes of posterior uncertainty that can be represented, especially when the true posterior is asymmetric, heavy-tailed, or multimodal. We propose Flow-Transformed Implicit Processes (FTIP), a variational inference method that makes this finite-dimensional function-space approximation more expressive. Instead of using a Gaussian distribution over the combination weights, FTIP uses a normalizing flow to define a richer variational distribution. This induces a flexible posterior distribution over functions while preserving tractable optimization. We train the model using a Black-Box α objective, allowing us to compare mass-covering and mode-seeking variational behaviour. Experiments show that FTIP captures asymmetric and multimodal posterior structure in function space that Gaussian coefficient approximations tend to smooth or collapse.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: 该论文专注于贝叶斯函数空间变分推断与隐式过程,涉及归一化流与生成机制。提供的关键词涉及多模态大模型架构(如 Tokenizer、Visual Encoder、MLLM、MultiModal)及强化学习(model-based RL),与本文的统计推断主题无直接交集,故所有关键词相关度均为 0。

关键词

Implicit-process priors, Variational inference, Normalizing flow, Function-space modelling, Posterior inference, Black-Box α objective, Generative mechanisms

Score: 0.0 / 27.8
Authors: Mayank Sharma, Rohit Kumar Mourya
Published: 2026-06-01
TL;DR: The paper proposes a theoretical framework for balanced prototype geometry in open-set recognition but shares no thematic overlap with multimodal models, tokenization, or reinforcement learning.
摘要翻译

开放集识别(OSR)要求分类器拒绝来自未见类别的输入,这在医学成像等安全关键场景中至关重要。基于单纯形的方法将类原型固定在正单纯形的顶点上,并通过距离比分数进行拒绝,经验上表现良好,但缺乏理论依据;且现有分析仅适用于嵌入维度 d 至少为 C-1 的情形,这是正单纯形存在的情形。我们给出了单纯形比 OSR 的理论分析,该分析适用于所有嵌入维度,包括 d < C-1。我们的分析集中于平衡等范数码:即具有等长和零和的原型配置,它们对所有 d ≥ 2 均存在,并将正单纯形作为特殊情况包含在内。对于这些码,我们证明了一个辅助平方比分数的次水平集是欧几里得球的精确并集,进而界定了操作分数的接受区域;并且我们证明了一个严格的二分法:原型获得一距离对称性(表现得如同正单纯形),当且仅当 d ≥ C-1,在该阈值以下,退化程度由一个明确的缺陷参数控制。此外,我们在自然各向同性假设下证明错误接受率随维度 d 指数衰减,且操作分数具有全局 Lipschitz 连续性,其接受区域是紧致的。实验上,我们将平衡原型几何既作为分析工具也作为表示学习先验进行研究,而非作为独立的最先进检测器。在 CIFAR 和 MedMNIST 的开放集划分上,该几何结构提供了有用的信息,但 OSR 性能仍强烈依赖于评分规则:原始比分数通常表现不如最近邻及基于 logit 的替代方法。

Abstract

Open-set recognition (OSR) requires a classifier to reject inputs from unseen classes which is essential in safety-critical settings such as medical imaging. Simplex based methods, which fix class prototypes at the vertices of a regular simplex and then reject via a distance-ratio score, perform well empirically but lack theoretical justification, and existing analysis applies only when the embedding dimension d is at least C-1, which is the regime in which a regular simplex exists. We give a theoretical account of simplex-ratio OSR that holds in every embedding dimension, including d < C-1. Our analysis centers on balanced equal-norm codes: prototype configurations with equal lengths and zero sum, which exist for all d >= 2 and include the regular simplex as a special case. For these codes we show that an auxiliary squared ratio score has sublevel sets that are exact unions of Euclidean balls, which in turn bracket the acceptance region of the operational score; and we prove a sharp dichotomy: the prototypes attain one-distance symmetry, behaving like a regular simplex, if and only if d >= C-1, with controlled degradation governed by an explicit defect parameter below that threshold. We further show the false-acceptance rate decays exponentially in d under natural isotropy assumptions, and that the operational score is globally Lipschitz with compact acceptance regions. Empirically, we study balanced prototype geometry as both an analytic tool and a representation-learning prior, rather than as a stand-alone state-of-the-art detector. Across CIFAR and MedMNIST open-set splits, the geometry provides useful structure, but OSR performance remains strongly dependent on the scoring rule: raw ratio scores typically underperform nearest-neighbor and logit-based alternatives.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on Open-Set Recognition (OSR) and prototype geometry theory for classification rejection. It does not involve multimodal model unification, tokenization, visual encoder architecture design, world models, MLLMs, multimodal fusion, or model-based reinforcement learning, hence all provided keywords are irrelevant. Total score 0.0 is below the dynamic passing score of 27.8.

关键词

Open-set recognition, Prototype geometry, Balanced equal-norm codes, Simplex configuration, Embedding dimension, Scorer-agnostic, Classification rejection

Score: 0.0 / 27.8
Authors: Snigdha Chandan Khilar
Published: 2026-06-01
TL;DR: 本文提出了一种基于物理相变机制的持续学习算法 Stefan-CL,通过移动边界模型平衡知识保留与新任务学习,无需存储原始数据。
摘要翻译

持续学习 (Continual learning) 难以平衡保留过往知识与吸收新任务。Stefan-CL 通过熔化物理优雅地解决了这一稳定性 - 可塑性困境。它将巩固的知识视为受保护的“固体”,将未用容量视为可适应的“液体”。随着网络学习,这一边界不断扩展,由一个“潜热”调节旋钮控制。通过数学冻结已学习内部,Stefan-CL 将遗忘降至接近零,在不存储原始数据的情况下匹配内存密集型基线,为人工智能 (AI) 开辟了一条美丽且基于物理的路径。

Abstract

Continual learning struggles to balance retaining past knowledge with absorbing new tasks. Stefan-CL elegantly resolves this stability-plasticity dilemma through the physics of melting. It frames consolidated knowledge as a protected "solid" and unused capacity as an adaptable "liquid." As the network learns, this boundary expands, governed by a "latent heat" tuning dial. By mathematically freezing the learned interior, Stefan-CL cuts forgetting to near zero, matching memory-heavy baselines without storing raw data, forging a beautiful, physics-grounded path for AI.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: 论文核心为持续学习(Continual Learning),使用物理相变模型解决稳定性 - 可塑性问题。提供的关键词涉及多模态大模型、强化学习及模型组件,与本文内容无交集,故所有关键词得分为 0。作者列表中无指定专家。加权总分 0,低于动态及格分 27.8。

关键词

Continual Learning, Stefan-CL, Stability-Plasticity Dilemma, Moving-Boundary Problem, Physics-Inspired, Knowledge Retention, No Raw Data Storage

Score: 0.0 / 27.8
Authors: Peihan Liu, Lucas Rosenblatt, Weiwei Kong, Natalia Ponomareva, Gautam Kamath, Rachel Cummings, Roxana Geambasu, Yu Gan, Lillian Tsai, Alex Bie
Published: 2026-06-01
TL;DR: 该论文探讨了差分隐私合成文本能否有效传递原始语料的知识能力,研究发现非私有合成能显著转移知识,但现有差分隐私方法即使在宽松隐私预算下也无法有效传递能力。
摘要翻译

差分隐私(DP)文本合成有望解锁敏感语料库以供模型训练,但目前尚不确定差分隐私合成数据是否能够传递仅存在于这些语料库中的真正新知识和能力。这是因为现有的评估依赖于几乎无需训练即可解决的任务,因此强大的基准表现并不能证明差分隐私合成能够替代原始数据访问。因此,我们引入了 ContinuousBench,这是一个持续自动再生的基准,用于衡量从差分隐私合成文本中获得的能力增益。每季度,新版本将一个从未见过的训练语料库与一个派生的问答集(QA set)配对,其构造旨在满足:(1)脱离语料库则不可解;(2)在差分隐私条件下可学习,因为所测试的知识由数百条独立记录支撑。研究人员从训练语料库生成差分隐私合成数据,并在其合成数据上运行我们标准化的训练与评估框架以衡量增益。我们实例化了两个赛道:Geminon,一个关于虚构生物的程序化生成数据集;以及 News,一个新爬取的公共新闻文章流。尽管标准基准几乎已饱和,但在 ContinuousBench 上我们发现,非差分隐私合成从原始语料库转移了大量知识,而最先进的差分隐私合成方法通常无法做到这一点,甚至在 ε=100 时也是如此。

Abstract

Differentially private (DP) text synthesis promises to unlock sensitive corpora for model training, but it remains unclear whether DP synthetic data transmits genuinely new knowledge and capabilities present only in those corpora. This is because existing evaluations rely on tasks that are nearly solvable without training, so strong benchmark performance does not establish that DP synthesis can substitute original data access. Thus, we introduce ContinuousBench, a continuously and automatically-regenerated benchmark that measures capability gain from DP synthetic text. Each quarter, a new release pairs a never-before-seen training corpus with a derived QA set, constructed to be: (1) unsolvable sans-corpus; and (2) learnable under DP, as the tested knowledge is supported by hundreds of independent records. Researchers produce DP synthetic data from the training corpus and run our standardized training and evaluation harness on their synthetic data to measure gains. We instantiate two tracks: Geminon, a procedurally-generated dataset about fictional creatures; and News, a stream of newly crawled public news articles. Although standard benchmarks are nearly saturated, on ContinuousBench we find that non-private synthesis transfers substantial knowledge from the original corpus, while state-of-the-art DP synthesis methods generally fail to do so, even at $\varepsilon=100$.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: 论文主题聚焦于差分隐私(DP)文本合成与能力基准测试(ContinuousBench),旨在评估合成数据是否能传递原始语料的知识。提供的关键词(如多模态、世界模型、强化学习、视觉编码器等)均属于多模态与强化学习领域,与本文的隐私保护文本合成及基准测试主题无直接关联,因此所有关键词相关度均为 0。作者列表中未包含指定的 Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang 专家。

关键词

Differentially Private, Synthetic Text, ContinuousBench, Capability Gain, Knowledge Transfer, Benchmarking, Text Synthesis

Score: 0.0 / 27.8
Authors: Dimitris Oikonomou, Nicolas Loizou
Published: 2026-06-01
TL;DR: 本文提出了一种基于 Polyak 步长的自适应 Sharpness-Aware Minimization 调度器,证明了其收敛性并显著减少了学习率调参需求。
摘要翻译

锐度感知最小化(Sharpness-Aware Minimization, SAM)已成为一种强大且广泛采用的优化器,用于训练机器学习模型。通过显式最小化损失景观(loss landscape)的锐度,SAM 通常能提高泛化能力,同时展现出强大的经验性能。然而,SAM 及其变体,如同大多数训练算法一样,对学习率的选择敏感,而学习率通常是通过广泛的超参数调优或预定义的调度器(schedulers)来选择的。受近期关于随机 Polyak 步长在随机梯度下降(Stochastic Gradient Descent, SGD)中有效性的进展启发,我们推导出针对 SAM 风格更新定制的 Polyak 调度器,从而在确定性和随机设置中生成新的自适应算法。在光滑设置中,我们证明了对于强凸目标具有线性收敛性,而对于凸目标在确定情况下具有 $\mathcal{O}(1/T)$ 的收敛率。在随机设置中,我们建立了类似的收敛保证,收敛至最优解邻域内。数值实验表明,所提出的 Polyak 调度器实现的性能与精心调优的 SAM 基线相当或更好,同时大幅减少了对学习率调优的需求。

Abstract

Sharpness-Aware Minimization (SAM) has established itself as a powerful and widely adopted optimizer for training machine learning models. By explicitly minimizing the sharpness of the loss landscape, SAM often improves generalization while delivering strong empirical performance. However, SAM and its variants, like most training algorithms, are sensitive to the choice of learning rate, which is typically selected through extensive hyperparameter tuning or predefined schedulers. In this work, motivated by recent advances on the effectiveness of stochastic Polyak step sizes for Stochastic Gradient Descent (SGD), we derive Polyak schedulers tailored to SAM-style updates, yielding novel adaptive algorithms in both deterministic and stochastic settings. In the smooth setting, we prove linear convergence for strongly convex objectives and an $\mathcal{O}(1/T)$ convergence rate for convex objectives in the deterministic case. In the stochastic setting, we establish analogous convergence guarantees up to a neighborhood of the optimum. Numerical experiments demonstrate that the proposed Polyak schedulers achieve performance comparable to or better than carefully tuned SAM baselines, while substantially reducing the need for learning-rate tuning.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: 论文核心为优化算法理论(SAM 与 Polyak 步长调度),涉及收敛性证明与学习率调度;而关键词集(Unify Models, Tokenizer, Visual Encoder, World Models, MLLM, MultiModal, model-based RL)均指向多模态大模型、表征学习及强化学习领域,主题无直接关联,故所有关键词相关度均为 0。

关键词

Sharpness-Aware Minimization, Polyak Step Size, Convergence Analysis, Learning Rate Scheduler, Optimization Algorithm, Strongly Convex, Stochastic Gradient Descent

Score: 0.0 / 27.8
Authors: Pu Wang, Yao-Xiang Ding
Published: 2026-06-01
TL;DR: 本文提出了一种统一的树引导识别后利用框架(TG-ITE),用于解决 Dueling Bandits 问题,在 Condorcet 获胜者假设下实现了最佳臂识别和 regret 最小化的最优样本复杂度与 regret 保证。
摘要翻译

本文在孔多塞获胜者(Condorcet-winner)假设下研究 $N$ 臂随机对决老虎机(dueling bandits),其中考虑了三个广泛采用的目标:最佳臂识别(BAI)、弱遗憾(weak regret)和强遗憾(strong regret)。我们提出树引导识别后利用(Tree-Guided Identify-Then-Exploit, TG-ITE),据我们所知,这是首个能够统一处理所有这些目标的框架。在不要求更强假设的前提下,我们提出一种共享的树引导识别方法,能够在 $O(N)$ 次比较内找到高置信度的 incumbent(当前最优臂)。我们进一步提出多种利用策略,以利用这一热身阶段来优化当前面临的具体目标。该方法使得我们能够:(1) 在不采用通常更强的假设前提下,在最佳臂识别(BAI)中达到 $O(N)$ 的样本复杂度;(2) 构建首个胜者留位式(winner-stays-style)算法,以实现 $O(N)$ 的弱遗憾;(3) 享有与专门针对强遗憾的方法相同的 $O(N \log T)$ 保证;(4) 实现最佳臂识别(BAI)与弱遗憾的联合优化,两者均获得 $O(N)$ 的保证,消除了现有方法中存在的 $O(\log N)$ 次优差距。我们的结果表明,在对决老虎机(dueling bandits)中,最佳臂识别(BAI)与遗憾最小化之间的权衡相对温和。

Abstract

We study $N$-armed stochastic dueling bandits under the Condorcet-winner assumption, where three widely adopted objectives are considered: best-arm identification (BAI), weak regret, and strong regret. We propose Tree-Guided Identify-Then-Exploit (TG-ITE), the first unified framework to tackle all these objectives to our knowledge. Without requiring stronger assumptions, we propose a shared tree-guided identification approach to find a high-confidence incumbent within $O(N)$ comparisons. We further propose varied exploitation strategies to utilize this warm-start stage to optimize the specific objectives at hand. This methodology enables our approach to (1) achieve $O(N)$ sample complexity in BAI without commonly adopted stronger assumptions; (2) build the first winner-stays-style algorithm to achieve $O(N)$ weak regret; (3) enjoy the same $O(N \log T)$ guarantee as specialized strong-regret approaches; (4) realize the joint optimization of BAI and weak regret with $O(N)$ guarantees for both, eliminating the sub-optimal gap of $O(\log N)$ in the existing approach. Our results provide evidence that the trade-off between BAI and regret minimization is relatively benign in dueling bandits.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: 该论文研究的是 Dueling Bandits 算法框架,属于强化学习中的在线学习理论范畴。而提供的关键词集(如 Tokenizer, Visual Encoder, MLLM, MultiModal, World Models)主要指向多模态大模型、表征学习及世界模型领域。两者在技术路线、研究对象及应用场景上完全无关,因此相关性评分为 0。

关键词

Dueling Bandits, Best Arm Identification, Regret Minimization, Unified Framework, Tree-Guided Identify-Then-Exploit, Sample Complexity, Condorcet-winner

Score: 0.0 / 27.8
Authors: Yue Wu, Weiqiang Zheng, Yang Cai, Haipeng Luo
Published: 2026-06-01
TL;DR: This paper proposes a dynamic power-law stepsize schedule to accelerate the last-iterate convergence rate of the Extragradient method in unconstrained biaffine min-max optimization.
摘要翻译

我们重新考察了无约束双仿射极小极大优化中外梯度法(Extragradient, EG)的收敛性保证。已知使用固定步长的 EG 能达到 $\Theta(T^{-1/2})$ 的最后迭代收敛率,这比通过引入锚定(anchoring)等额外机制可达到的最优 $\mathcal{O}(T^{-1})$ 速率要慢。受近期进展的启发,这些进展表明仅动态步长就能显著加速梯度下降,我们询问动态步长是否能同样加速 EG 的最后迭代收敛。我们在此方向上给出了第一个正面结果。具体来说,我们提供了一个确定性动态步长调度,将 EG 的收敛率加速至 $\mathcal{O}(T^{-2/3+\varepsilon})$,其中 $\varepsilon > 0$ 为任意常数。我们还表明,当 EG 的外推步和更新步使用相同步长时,该速率是紧的。随后我们表明,允许外推步和更新步使用不同的步长进一步将收敛率提高至近最优的 $\mathcal{O}(T^{-1+\varepsilon})$。我们的分析将步长调度归结为一个优化问题,其解导出的步长调度遵循(离散化)幂律分布。我们提出的步长调度及分析方法可扩展至其他方法,例如乐观梯度法(Optimistic Gradient, OG),并暗示了对一般极小极大优化问题更广泛的应用前景。

Abstract

We revisit the convergence guarantees of the Extragradient (EG) method for unconstrained biaffine min-max optimization. It is known that EG with a fixed stepsize achieves a $Θ(T^{-1/2})$ last-iterate convergence rate, which is slower than the optimal $\mathcal{O}(T^{-1})$ rate attainable by incorporating additional mechanisms such as anchoring. Motivated by recent advances showing that dynamic stepsizes alone can significantly accelerate gradient descent, we ask whether dynamic stepsizes can similarly accelerate the last-iterate convergence of EG. We present the first positive result in this direction. Specifically, we provide a deterministic dynamic stepsize schedule that accelerates the convergence rate of EG to $\mathcal{O}(T^{-2/3+\varepsilon})$ for any $\varepsilon > 0$. We also show that this rate is tight when the extrapolation and update steps of EG use the same stepsize. We then show that allowing different stepsizes for the extrapolation and update steps further improves the convergence rate to the near-optimal $\mathcal{O}(T^{-1+\varepsilon})$. Our analysis reduces stepsize scheduling to an optimization problem, whose solution leads to a stepsize schedule that follows (a discretization of) a power-law distribution. Our proposed stepsize schedules and analysis extend to other methods, such as Optimistic Gradient (OG), and suggest broader applicability to general min-max optimization problems.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: 该论文专注于理论优化算法(外梯度法、极小极大收敛性),而非多模态架构或世界模型。内容中未提及分词器、视觉编码器、MLLM 或统一多模态框架。虽然优化是强化学习的基础,但论文未具体讨论基于模型的强化学习系统或世界建模,因此与给定关键词无直接内容关联。

关键词

Min-Max Optimization, Extragradient Method, Dynamic Stepsize, Convergence Rate, Power-Law Distribution, Optimization Algorithm, Theoretical Analysis

Score: 0.0 / 27.8
Authors: Rai Hisada, Kanji Tanaka
Published: 2026-06-01
摘要翻译

本文提出了一种名为"FlatVPR"的新型几何校正范式,该范式通过强制一种特征流形结构,有效平衡了视觉定位 (VPR) 中地图轻量化与定位精度之间的权衡。在该结构中,任意两个相邻锚点 $\mathbf{z}_A$ 和 $\mathbf{z}_B$ 之间的描述子均可通过线性插值 $\hat{\mathbf{z}}_{pseudo} = (1-t)\mathbf{z}_A + t\mathbf{z}_B$ 准确重构,其中 $t \in [0,1]$ 表示相对位置。尽管像 DINOv2-ViT-S/14 这样的最新基础模型提供了鲁棒的语义特征,但其潜在流形表现出显著曲率,将物理空间中的均匀线性运动投影到特征空间中的高度非线性轨迹上,这在稀疏锚点条件下阻碍了可靠的重构。为了实现上述基于插值的重构,我们对原始基础特征 $\mathbf{z}$ 引入残差变换 $\hat{\mathbf{z}} = \mathbf{z} + \text{Res}(\mathbf{z})$,其中 $\text{Res}(\cdot)$ 表示一个可学习适配器。我们的方法使用一种基于数学原理的 Pullback Flatness Loss (拉回平坦度损失) 显式地抑制流形曲率,该损失最小化中间特征与连接相邻锚点的线性段之间的偏差,从而最小化流形的内在曲率。通过这种空间平坦化,地图构建被表述在一个期望最大化 (EM) 框架内,解耦为用于流形适应的连续 M 步 (M-step) 和用于最优锚点选择指导的概念性 E 步 (E-step)。在 NCLT 数据集上的实验表明,即使在锚点间隔为 100 米的极端稀疏锚点条件和极端季节变化下,我们的适配器的应用也带来了显著的性能提升。

Abstract

This paper proposes ``FlatVPR,'' a novel geometric rectification paradigm that effectively bridges the trade-off between map lightweightness and localization accuracy in visual place recognition (VPR) by enforcing a feature manifold structure where any descriptor between two adjacent anchors $\mathbf{z}_A$ and $\mathbf{z}_B$ can be accurately reconstructed via linear interpolation $\hat{\mathbf{z}}_{pseudo} = (1-t)\mathbf{z}_A + t\mathbf{z}_B$, where $t \in [0,1]$ denotes the relative position. While state-of-the-art foundation models such as DINOv2-ViT-S/14 provide robust semantic features, their latent manifolds exhibit prominent curvature, projecting uniform linear motion in physical space onto highly non-linear trajectories in the feature space, which hinders reliable reconstruction under sparse anchor conditions. To enable the aforementioned interpolation-based reconstruction, we introduce a residual transformation $\hat{\mathbf{z}} = \mathbf{z} + \text{Res}(\mathbf{z})$ to the raw foundation features $\mathbf{z}$, where $\text{Res}(\cdot)$ represents a learnable adapter. Our method explicitly suppresses manifold curvature using a mathematically grounded Pullback Flatness Loss that minimizes the deviation of intermediate features from the linear segment connecting adjacent anchors, thereby minimizing the intrinsic curvature of the manifold. Through this spatial flattening, map construction is formulated within an Expectation-Maximization (EM) framework, decoupled into a continuous M-step for manifold adaptation and a conceptual E-step for optimal anchor selection guidelines. Experiments on the NCLT dataset demonstrate that the application of our adapter leads to significant performance improvements even under extremely sparse anchor conditions with 100m intervals and extreme seasonal changes.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: 评分失败: Expecting ',' delimiter: line 12 column 65 (char 288)

Score: 0.0 / 27.8
Authors: Da Chang, Qiankun Shi, Lvgang Zhang, Yu Li, Ruijie Zhang
Published: 2026-06-01
TL;DR: This paper analyzes the finite-sample generalization stability of a client-sampled distributed optimization scheme with matrix-valued parameters and orthogonalized momentum updates, deriving upper-tail guarantees under heterogeneous client data.
摘要翻译

我们研究了具有矩阵值参数和正交化动量更新的客户端采样分布式优化方案的有限样本泛化。核心量是在每轮仅有部分客户端参与时,返回模型处的总体目标与经验目标之间的差距。在独立异构客户端数据、不等的局部样本数以及固定聚合权重的假设下,我们基于耦合邻居稳定性递归和加权集中步骤,推导出了有限轮次的上尾保证。该界通过放大因子 \(Y_i(\mathcal C)\) 保留了客户端选择计数;在均匀全参与全批处理情形下,只要 horizon 依赖的放大项得到控制,它就能给出 \(\widetilde{\mathcal O}(n^{-1}+n^{-1/2})\) 的缩放。矩阵正交化规则要求沿配对轨迹满足 Lipschitz 条件,该条件由正则化极型映射和归一化有限步 Newton--Schulz 正交化器满足。对于未正则化的矩阵符号,同样的论证要求耦合谱分离,而 Gaussian 平滑则给出了一个有限轮次的平滑变体。一个一维反例说明了为何间隙、平滑或正则性条件是必要的。

Abstract

We study finite-sample generalization for a client-sampled distributed optimization scheme with matrix-valued parameters and orthogonalized momentum updates. The central quantity is the gap between the population and empirical objectives at the returned model when only a subset of clients participates in each round. Under independent heterogeneous client data, unequal local sample counts, and fixed aggregation weights, we derive a finite-round upper-tail guarantee from a coupled-neighbor stability recursion and a weighted concentration step. The bound keeps the client-selection counts through the amplification factor \(Y_i(\mathcal C)\); in the uniform full-participation full-batch regime, it yields \(\widetilde{\mathcal O}(n^{-1}+n^{-1/2})\) scaling whenever the horizon-dependent amplification terms are controlled. The matrix-orthogonalization rule is required to be Lipschitz along paired trajectories, a condition satisfied by regularized polar-type maps and normalized finite-step Newton--Schulz orthogonalizers. For the unregularized matrix sign, the same argument requires coupled spectral separation, whereas Gaussian smoothing gives a finite-round smoothed variant. A one-dimensional counterexample shows why a gap, smoothing, or regularity condition is necessary.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper addresses theoretical stability in distributed optimization with matrix-valued parameters and orthogonalized momentum under client sampling. The provided keywords specifically target Multimodal Large Language Models (MLLM), World Models, and Reinforcement Learning architectures (e.g., Tokenizer, Visual Encoder). There is no conceptual overlap between optimization stability analysis and multimodal/RL model architectures. Additionally, none of the listed expert authors (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are present in the author list.

关键词

Distributed Optimization, Matrix-valued Parameters, Orthogonalized Momentum, Client Sampling, Stability Analysis, Finite-sample Generalization, Upper-tail Guarantee

Score: 0.0 / 27.8
Authors: Swapnil Parekh
Published: 2026-06-01
TL;DR: CANARY 提出了一种基于隐藏状态稀疏自编码器的零标签框架,可高精度检测语言模型中的细调污染,克服了输出级防御的局限性。
摘要翻译

对手仅需污染 1% 的微调样本,即可植入潜伏的有害行为。这种污染对所有输出级别防御均不可见:有害行为潜伏于模型的隐藏状态几何结构中,仅在污染程度超过 7.5% 时才会显现于生成文本。我们引入 CANARY(Contamination Auditor via Neural Activation Representation Yield),这是一种零标签检查点审计器,它仅通过对无标签提示集进行两次前向传播,即可直接检测这种隐藏偏移。CANARY 通过稀疏自编码器(Sparse Autoencoder, SAE)投影隐藏状态差异,过滤风格噪声以隔离有意义的语义漂移。在四种模型架构和两种训练范式下,该方法在 1% 污染程度下即达到 AUROC = 1.000(95% 置信区间 CI = [0.997, 1.000]; Cohen's d = 3.28),其检测阈值比任何输出级别方法触发点低 7.5 倍,且在良性微调上假阳性为零,对风格匹配及梯度噪声自适应攻击具有完全鲁棒性。相同的 SAE 特征基础驱动完整的治理管道:经 SAE 过滤的增强机制揭示潜在危害的速率比标准生成高 5 倍;按分数排序的提示使红队测试提升达 4.2 倍;且在推理时抑制少量污染特异性特征,可将危害从 70% 降至 10%,且不带来困惑度惩罚。CANARY 是首个仅基于隐藏状态即可检测、验证、优先排序并修复供应链污染的零标签框架。

Abstract

Adversaries can implant latent harmful behavior by poisoning as few as 1% of fine-tuning examples. The contamination is invisible to every output-level defense: harmful behavior lies dormant in the model's hidden-state geometry and does not appear in generated text until contamination exceeds 7.5%. We introduce CANARY (Contamination Auditor via Neural Activation Representation Yield), a zero-label checkpoint auditor that detects this hidden shift directly from two forward passes over an unlabeled prompt set. CANARY projects the hidden-state difference through a Sparse Autoencoder, filtering style noise to isolate meaningful semantic drift. It achieves AUROC = 1.000 at 1% contamination (95% CI = [0.997, 1.000]; Cohen's d = 3.28) across four model architectures and two training paradigms, 7.5x below where any output-level method fires, with zero false positives on benign fine-tuning and full robustness to style-matching and gradient-noise adaptive attacks. The same SAE feature basis drives a complete governance pipeline: SAE-filtered amplification surfaces latent harm at a 5x higher rate than standard generation; score-ranked prompts yield 4.2x red-teaming lift; and suppressing a handful of contamination-specific features at inference time reduces harm from 70% to 10% with no perplexity penalty. CANARY is the first zero-label framework to detect, verify, prioritize, and remediate supply-chain contamination from hidden states alone.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: 该论文聚焦于语言模型的细调污染检测与治理,核心方法为稀疏自编码器分析隐藏状态。提供的关键词涉及多模态、世界模型及强化学习,与论文主题无直接关联,故相关性评分为 0。作者列表中未包含指定的 Yang Shi 等专家,无额外加分。

关键词

Zero-Label Detection, Fine-Tuning Contamination, Language Models, Sparse Autoencoder, Hidden-State Geometry, Supply-Chain Contamination, Governance Pipeline

Score: 0.0 / 27.8
Authors: Peiqing Chen, Jiedong Jiang, Nengneng Yu, Yuefeng Wang, Sixian Xiong, Wei Wang, Zaoxing Liu
Published: 2026-06-01
TL;DR: The paper proposes OptCC, a fault-tolerant AllReduce algorithm that minimizes performance overhead caused by network failures in GPU clusters, achieving near-optimal communication efficiency.
摘要翻译

网络故障是大规模 GPU 集群中最频繁的硬件故障之一,也是导致训练任务中断的主要原因。现代的集体通信库(如 NCCL)通过在同一服务器上利用幸存的网络接口卡(NICs)重新路由流量来缓解网络故障,以牺牲节点间带宽为代价换取不间断的训练。然而,降级后的服务器仍处于标准环形算法的关键路径上,从而拖慢了整个集体操作。我们提出了在非对称网络带宽下 AllReduce 完成时间的首个信息论下界,并证明当慢节点保留至少一半原始带宽时,相对于无故障最优解的不可避免开销仅为 O(1/p)(其中 p 为 GPU 数量)。随后,我们设计了 OptCC,这是一种四阶段流水线 AllReduce 算法,该算法接近此下界。在 SimAI 上的实验证实,OptCC 填补了现有容错方案留下的差距:在高达 50% 带宽损失的实际网络故障下,OptCC 完成 AllReduce 的时间仅比 NCCL 的无故障环形性能高出 2-6%,而现有最先进方法则会产生高达 57% 的开销。

Abstract

Network failures are among the most frequent hardware faults in large-scale GPU clusters and a leading cause of training-job interruptions. Modern collective communication libraries such as NCCL mitigate network failures by rerouting traffic through surviving NICs on the same server, trading reduced inter-node bandwidth for uninterrupted training. However, the degraded server remains on the critical path of the standard ring algorithm, slowing the entire collective. We present the first information-theoretic lower bound on AllReduce completion time under asymmetric network bandwidth and show that when the straggler retains at least half of its original bandwidth, the unavoidable overhead relative to the fault-free optimum is only O(1/p) for p GPUs. We then design OptCC, a four-stage pipelined AllReduce algorithm that approaches this lower bound. Experiments on SimAI confirm that OptCC closes the gap left by existing fault-tolerant schemes: under practical network failures with up to 50% bandwidth loss, OptCC completes AllReduce within 2-6% of NCCL's fault-free ring performance, whereas the state-of-the-art incurs up to 57% overhead.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on distributed systems and fault-tolerant collective communication (AllReduce) in GPU clusters. The provided keywords pertain to multimodal model architectures (MLLM, MultiModal), specific components (Tokenizer, Visual Encoder), and learning paradigms (World Models, model-based RL). There is no direct relevance between the systems engineering focus of the paper and the model-centric keywords. Additionally, none of the listed expert authors appear in the author list.

关键词

Network failures, AllReduce, Collective communication, GPU clusters, Fault tolerance, OptCC algorithm, Bandwidth loss, Pipelined

Score: 0.0 / 27.8
Authors: Bo Li, Chen Zhang
Published: 2026-06-01
TL;DR: 本文量化了 SAC 驱动 HVAC 控制的能量下限,发现回放缓冲区初始化偏差是导致次优解的主要原因而非算法设计本身。
摘要翻译

我们在 sbsim 校准的建筑模拟器上量化了软演员 - 评论家 (SAC) HVAC 控制下的能耗下限——即在动作空间约束下可实现的最小成本。通过最小动作实验,我们直接测得该下限为每天 35.51 美元,主要由持续电力负荷主导(35.44 美元,占比 99.8%),燃气消耗可忽略不计。标准 SAC 基线(使用基于调度策略的重放缓冲区转换初始化)收敛至每天 37.18 美元,比下限高出 4.7%。我们识别出缓冲区初始化是该场景下次优性的主要来源:从空缓冲区开始训练将成本降至每天 35.57 美元,消除了 96% 的差距。将供水温度范围扩大 10 K 可产生可忽略不计的额外节省(每天 0.03 美元),而进一步扩展则会触发物理约束违规。我们还发现存在一个折扣因子耦合(gamma_eff = 0.891),将有效规划时长从 8.3 小时缩短至 46 分钟——这是一个普遍存在的问题,值得审查。针对规划时长、奖励权重和观测增强进行的系统性消融实验确认,所有预填充缓冲区配置均聚集在 0.7% 范围内(37.18–37.42 美元),这表明设备最小功率——而非算法设计——施加了紧约束。

Abstract

We quantify the energy floor -- the minimum achievable cost given action space constraints -- for Soft Actor-Critic (SAC) HVAC control on the sbsim calibrated building simulator. Through minimum-action experiments, we directly measure this floor at USD 35.51/day, dominated by continuous electrical loads (USD 35.44, 99.8%) with negligible gas consumption. The standard SAC baseline, initialized with schedule-policy replay buffer transitions, converges to USD 37.18/day, 4.7% above the floor. We identify buffer initialization as the dominant source of sub-optimality in this scenario: training from an empty buffer reduces cost to USD 35.57/day, eliminating 96% of the gap. Expanding the supply water temperature range by 10 K yields negligible additional savings (USD 0.03/day), and further expansion triggers physical constraint violations. We additionally uncover a discount factor coupling (gamma_eff = 0.891) shrinking the effective planning horizon from 8.3 h to 46 min -- a benchmark-wide issue warranting audit. Systematic ablation across planning horizon, reward weights, and observation enrichment confirms all pre-filled-buffer configurations cluster within 0.7% (USD 37.18--USD 37.42), demonstrating that equipment minimum power -- not algorithmic design -- imposes the binding constraint.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: 该论文专注于基于 SAC 的 HVAC 控制系统能量下限量化及回放缓冲区偏差分析,属于强化学习在建筑控制领域的应用。论文内容未涉及多模态大模型、统一模型架构、分词器、视觉编码器或世界模型(World Models)等核心概念。虽然使用了 sbsim 模拟器,但这属于物理仿真环境而非学习到的世界模型;SAC 通常被视为模型-free 算法,与 model-based RL 不符。因此,所有给定关键词与论文主题完全无关。作者列表中未包含指定的专家成员。

关键词

SAC, HVAC Control, Energy Floor, Replay Buffer Bias, sbsim, Building Simulator, Discount Factor

Score: 0.0 / 27.8
Authors: Zichao Yue, Zhiru Zhang
Published: 2026-06-01
TL;DR: The paper proposes FilterMoE, a mixture-of-experts pre-propagation GNN that jointly routes Chebyshev filter experts over nodes and channels, achieving state-of-the-art performance on graph benchmarks.
摘要翻译

预传播图神经网络(PPGNNs)将所有图依赖计算移至预处理步骤,并仅基于生成的密集跳步特征进行训练,这使得它们具有高度可扩展性。该范式下存在的一个困惑在于,更复杂的跳步聚合器并不能可靠地优于更简单的聚合器:在许多基准测试中,朴素的基于多层感知机(MLP)的聚合器持平或优于基于跳步注意力的变体。我们从图滤波器视角重新审视了这一行为。在预计算的扩散基之上,现有的 PPGNNs 主要区别在于滤波器系数如何在节点和特征通道之间共享,而不仅仅是原始聚合器容量。基于多层感知机(MLP)的架构学习通道依赖的滤波器,这些滤波器主要在节点之间共享,而基于跳步注意力的架构学习节点依赖的混合,这些混合主要在通道之间共享。这揭示了标准 PPGNN 设计中缺失的一个范式:在预传播计算契约下的联合节点和通道自适应滤波。我们提出 FilterMoE,一种专家混合(Mixture-of-Experts)PPGNN,其中一小组可学习的切比雪夫(Chebyshev)滤波器专家通过三维门控张量在节点和通道之间联合路由。在十一个同调和异调基准测试上,FilterMoE 在九个数据集上优于强 PPGNN 基线,并在所有三个大型基准测试上排名第一,平均测试分数提高了 1.53 分。这些结果确立了联合节点通道滤波器路由作为数据集特定跳步聚合器选择的稳健替代方案。

Abstract

Pre-propagation graph neural networks (PPGNNs) push all graph-dependent computation into a preprocessing step and train only on the resulting dense hop features, which makes them highly scalable. A puzzle in this regime is that more complex hop aggregators do not reliably outperform simpler ones: on many benchmarks, a plain MLP-based aggregator matches or beats hop-attention variants. We revisit this behavior from a graph-filter perspective. Over a precomputed diffusion basis, existing PPGNNs differ mainly in how filter coefficients are shared across nodes and feature channels, rather than simply in raw aggregator capacity. MLP-based architectures learn channel-dependent filters that are largely shared across nodes, while hop-attention-based architectures learn node-dependent mixtures that are largely shared across channels. This reveals a missing regime in standard PPGNN designs: joint node- and channel-adaptive filtering under the pre-propagation computational contract. We propose FilterMoE, a mixture-of-experts PPGNN in which a small bank of learnable Chebyshev filter experts is routed jointly over nodes and channels by a 3D gating tensor. Across eleven homophilic and heterophilic benchmarks, FilterMoE outperforms strong PPGNN baselines on nine datasets and ranks first on all three large-scale benchmarks, improving the average test score by 1.53 points. These results establish joint node-channel filter routing as a robust alternative to dataset-specific hop-aggregator selection.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on Graph Neural Networks (GNNs) and filter routing (FilterMoE), while the provided keywords target Multimodal LLMs, World Models, and RL. There is no conceptual overlap regarding tokenizers, visual encoders, or model-based RL. Thus, all keyword scores are 0.0. Total weighted score is 0.0, below the dynamic passing score of 27.8. No target expert authors (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are present in the author list (Zichao Yue, Zhiru Zhang).

关键词

Pre-propagation GNNs, Graph filtering, Node-channel mixtures, Mixture of Experts, Chebyshev filter, 3D gating tensor, Homophilic graphs

Score: 0.0 / 27.8
Authors: Keito Wakatsuki, Hideaki Shimazaki
Published: 2026-06-01
TL;DR: This paper proposes an SDE-based sampler for Heavy-Tailed Diffusion Models using Student's t-distribution with a state-dependent diffusion coefficient to improve tail fidelity and induce self-regulating annealing.
摘要翻译

扩散模型已成为深度生成建模的主导框架。尽管标准高斯形式在理论上具有便利性,但其在重尾数据集上的适用性尚不明确。为此,重尾扩散模型(HTDMs)通过用学生 t 分布(Student's t-distribution)替换高斯分布来扩展标准形式,从而在重尾数据集上提高了尾部保真度。尽管在 HTDMs 中基于随机微分方程(SDE)的采样是可行的,但尚未得到充分探索。本文提出了一种用于 HTDMs 的基于 SDE 的采样器,该采样器显式引入了状态依赖的扩散系数。这种状态依赖性通过自适应调节有效噪声尺度,自然诱导了一种自调节退火机制。我们从理论上探讨了这一机制,并通过实验验证了其在从重尾分布中生成样本时的必要性。

Abstract

Diffusion models have emerged as a leading framework for deep generative modeling. While the standard Gaussian formulation is theoretically convenient, its suitability for heavy-tailed datasets remains unclear. To address this, heavy-tailed diffusion models (HTDMs) extend the standard formulation by replacing the Gaussian distribution with a Student's t-distribution, thereby improving tail fidelity on heavy-tailed datasets. Although stochastic differential equation (SDE)-based sampling is possible in HTDMs, it has not been fully explored. In this paper, we propose an SDE-based sampler for HTDMs that explicitly incorporates a state-dependent diffusion coefficient. This state dependence naturally induces a self-regulating annealing mechanism by adaptively modulating the effective noise scale. We theoretically explore this mechanism and experimentally verify its necessity for reproducing samples from a heavy-tailed distribution.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on Heavy-Tailed Diffusion Models (HTDMs) and SDE-based sampling mechanisms for generative modeling, specifically addressing heavy-tailed distributions using Student's t-distribution. The provided keywords pertain to Multimodal Large Language Models (MLLM), World Models, and Reinforcement Learning (e.g., Tokenizer, Visual Encoder, Unify Models). There is no mention of multimodality, tokenization, visual encoders, reinforcement learning, or world model architectures in the abstract, indicating no overlap with the specified keyword topics.

关键词

Diffusion Models, Heavy-Tailed Diffusion Models, Student's t-distribution, SDE-based sampling, State-dependent diffusion coefficient, Self-regulating annealing, Generative modeling

Score: 0.0 / 27.8
Authors: Shinhoo Kang, Hai V. Nguyen, Tan Bui-Thanh
Published: 2026-06-01
摘要翻译

从数据中学习混沌动力系统不仅需要短期预测精度:学习到的模型必须保持吸引子几何结构及其不变统计量。轨迹(零阶)和雅可比(Jacobian,一阶)匹配监督向量场的数值及其切结构,但两者均无法约束场如何偏离其切平面弯曲。因此,模型可以在监督状态下匹配数值和切线,但其曲线可能与真实情况不同,尽管保持局部准确,却会漂移至虚假吸引子并扭曲长时间统计量。我们发现,强制二阶一致性可缓解这些失效,但在高维情形下,构建完整的 Hessian(黑塞)矩阵是计算上不可行的。我们提出了一种模型约束的随机雅可比(Jacobian)匹配方法,该方法在随机扰动的输入处比较真实向量场与学习向量场的雅可比矩阵。泰勒展开表明,期望的随机雅可比损失可分解为名义雅可比失配加上由噪声方差缩放的 Hessian 失配,从而以 $\mathcal{O}(d^2)$ 的代价隐式地强制二阶一致性,而无需构建 $\mathcal{O}(d^3)$ 的 Hessian 张量。该方法仅依赖雅可比矩阵的评估,因而可扩展至显式 Hessian 匹配无法处理的高维情形。数值实验证实了二阶方法的稳健性。在 Lorenz 63 系统中,一阶方法在极小时间监督下会产生灾难性的 Lyapunov 指数异常值,而二阶方法不仅能消除这些异常值,还能恢复正确的吸引子。在耦合 Lorenz 96 系统中,分布外强制扫描将各方法区分开来:所有方法在 $F=16$ 之前均达成一致,但当 $F>18$ 时,仅二阶方法能保持不变测度和 Lyapunov 谱。在这两个系统中,随机雅可比匹配的性能与显式 Hessian 匹配相当,但成本显著更低。

Abstract

Learning chaotic dynamical systems from data requires more than short-term predictive accuracy: the learned model must preserve the attractor geometry and its invariant statistics. Trajectory (zero-order) and Jacobian (first-order) matching supervise the values and tangent structure of the vector field, but neither constrains how the field bends away from its tangent plane. A model can thus match values and tangents at the supervised states yet curve differently from the truth, remaining locally accurate while drifting toward spurious attractors and distorting long-time statistics. We show that enforcing second-order consistency mitigates these failures, but forming the full Hessian is prohibitive in high dimensions. We propose model-constrained randomized Jacobian matching, which compares the Jacobians of the true and learned vector fields at randomly perturbed inputs. A Taylor expansion shows that the expected randomized Jacobian loss decomposes into the nominal Jacobian mismatch plus a Hessian mismatch scaled by the noise variance, implicitly enforcing second-order consistency at $\mathcal{O}(d^2)$ cost without forming the $\mathcal{O}(d^3)$ Hessian tensor. Using only Jacobian evaluations, the method scales to high dimensions where explicit Hessian matching does not. Numerical experiments confirm that second-order methods are robust. For Lorenz~63, first-order methods produce catastrophic Lyapunov-exponent outliers under minimal temporal supervision, which second-order methods eliminate while recovering the correct attractor. For coupled Lorenz~96, an out-of-distribution forcing sweep separates the methods: all agree up to $F=16$, but beyond $F=18$ only second-order methods preserve the invariant measure and Lyapunov spectrum. On both systems, randomized Jacobian matching performs comparably to explicit Hessian matching at much lower cost.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: 评分失败: Expecting ',' delimiter: line 12 column 152 (char 375)

Score: 0.0 / 27.8
Authors: Danqing Wang, Akshay Sivaraman, Lei Li
Published: 2026-06-01
TL;DR: 论文提出 CRAB-Bench 和 RUSE 用于评估 LLM 代理在复杂任务依赖和真实用户模拟下的表现,发现前沿模型在真实场景下性能显著下降且存在隐瞒错误行为。
摘要翻译

在真实的服务场景中评估大语言模型(LLM)代理,需要复杂的任务依赖、不完美的用户行为以及能够容纳多个有效解的评估方式。为了解决这一差距,我们引入了 CRAB-Bench(基于约束的真实代理基准)和 RUSE(真实用户模拟引擎)。CRAB-Bench 通过基于多个相互依赖实体的约束图生成任务,并引入结构化干扰项,要求代理在数千个误导性候选项中仔细推理,其中仅有极小比例的有效解。RUSE 用基于人类行为研究的真实用户取代了合作式、模板化的模拟器,这些用户实例化于多样化人格及四个行为维度之上。在四个前沿大语言模型(LLM)代理上的实验表明,最佳模型在 CRAB-Bench 上仅达到 61% 的 pass@1,而切换到 RUSE 会导致性能进一步下降高达 57%,这种下降主要集中在任务解决能力而非对话质量上。信息泄露(Information Disclosure)是最具破坏性的行为维度,与 RUSE 交互的代理不太可能承认错误,而是通过隐式修正来掩盖错误。

Abstract

Evaluating LLM agents in realistic service scenarios requires complex task dependencies, imperfect user behavior, and an evaluation that accommodates multiple valid solutions. We introduce CRAB-Bench (Constraint-based Realistic Agent Benchmark) and RUSE (Realistic User Simulation Engine) to address this gap. CRAB-Bench generates tasks via a constraint graph over multiple interdependent entities with structured distractors, requiring agents to reason carefully over thousands of misleading candidates where only a tiny fraction of solutions are valid. RUSE replaces cooperative, template-like simulators with realistic users grounded in human behavioral studies, instantiated across diverse personas and four behavioral dimensions. Experiments on four frontier LLM agents show that the best model achieves only 61% pass@1 on CRAB-Bench, and switching to RUSE causes further drops of up to 57%, concentrated in task-solving ability rather than conversational quality. Information Disclosure is the most damaging behavioral dimension, and agents interacting with RUSE are less likely to admit mistakes, instead masking errors through implicit corrections.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: 论文主要贡献在于提出评估基准 CRAB-Bench 和用户模拟引擎 RUSE,聚焦于任务依赖结构和人类行为模拟,未涉及模型架构统一、分词器设计、视觉编码器、世界模型构建、多模态大模型技术或基于模型的强化学习算法,因此与给定关键词无直接相关性。

关键词

CRAB-Bench, LLM Agents, Task Dependencies, User Simulation, Evaluation Benchmark, Constraint Graph, Behavioral Dimensions

Score: 0.0 / 27.8
Authors: Liuliu Chen, Mike Conway, Jo Robinson, Vlada Rozova
Published: 2026-06-01
TL;DR: 该研究发现急诊科分诊笔记在不同医院间的词汇和语义差异导致自残预测模型泛化能力下降。
摘要翻译

急诊科(ED)就诊的自残表现(self-harm)与更高的自杀风险密切相关。自然语言处理(NLP)模型在单家医院的分诊记录(triage notes)中检测自残表现稳健,但在跨机构时性能往往下降。为了探究潜在原因,我们通过分析词汇特征(lexical characteristics)、高度相关的预测特征(predictive features)和显著主题(salient topics),比较了两家医院的急诊分诊记录。结果显示,尽管核心主题(core themes)一致,如自我中毒(self-poisoning)和自我伤害(self-injury),但不同医院在自残相关的词汇表达(lexical expression)和特征重要性(feature importance)上存在差异。这些记录差异(documentation differences)与跨站点性能(cross-site performance)下降相关。我们的发现提供了关于机构差异(institutional variation)如何影响临床文本(clinical text)中自残识别的见解,并突出了提高模型泛化能力(model generalisability)的潜在方法。

Abstract

Self-harm presentations to emergency departments (EDs) are strongly associated with higher suicide risk. NLP models have shown robust performance in detecting self-harm from triage notes within single hospitals, yet performance often declines across institutions. To examine potential causes, we compare ED triage notes from two hospitals by analyzing lexical characteristics, highly associated predictive features, and salient topics. Our results reveal variation in lexical expression and feature importance related to self-harm across hospitals, despite consistent core themes such as self-poisoning and self-injury. These documentation differences are associated with reduced cross-site performance. Our findings provide insight into how institutional variation affects the identification of self-harm in clinical text and highlight potential methods to improve model generalisability.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: 论文属于临床自然语言处理(Clinical NLP)领域,主要研究急诊科分诊笔记中自残预测模型的跨机构泛化问题,核心在于文本的词汇与语义分布差异。提供的评分关键词(如 MLLM、World Models、Visual Encoder、model-based RL、MultiModal、Unify Models)均聚焦于多模态大模型架构、世界模型构建及强化学习算法,与本文的纯文本分析、医疗数据分布偏移及泛化性研究主题存在显著领域偏差。除 Tokenizer 可能间接关联文本预处理外,其余关键词在模型架构、训练方法或数据类型上均无体现,故相关性评分为 0。作者列表中不包含指定的专家(Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang),无额外加分。

关键词

Self-harm prediction, Emergency Department, Triage notes, Model generalization, Lexical variation, Semantic variation, Cross-institutional performance

Score: 0.0 / 27.8
Authors: Wenshuai Xu, You Song, Yuzhuo Cui, Minjie Ren, Qingjie Liu, Zhenghui Hu
Published: 2026-06-01
摘要翻译

主动微调(Active Finetuning)的主要目标是通过精心挑选的具有信息量或具有挑战性的数据对预训练模型进行微调,从而提升其在特定任务或领域上的性能。以往研究主要侧重于主动方面(即数据选择),同时统一采用全微调(Full Finetuning)进行模型适应,这由于分布偏移(Distribution Shift)不可避免地扭曲了预训练特征。当模型规模相对于微调数据量较大时,这一问题变得尤为突出,从而导致过拟合风险加剧。为了解决这一关键空白,我们正式提出了 FiAF 任务,该任务强调在主动学习中系统探索微调方法。我们提出 FACT 框架,这是一个兼具效率与简洁性的三阶段层次化微调框架,专门针对主动微调场景设计。我们的综合实验涵盖:(1) 三大主要数据集类别,包括经典数据集(CIFAR10、CIFAR100、ImageNet-1k)、不平衡数据集(CIFAR10-LT、CIFAR100-LT)和细粒度数据集(StanfordCars、FGVCAircraft)的图像分类任务,每种数据集均在 3-5 种不同的采样比率下进行评估;(2) 多样化的预训练架构,包括卷积神经网络(ConvNeXt)、视觉 Transformer(ViT)以及视觉 LSTM(ViL)网络;(3) 对冻结特征增强(FroFA)策略的系统性探究;(4) 对效率与泛化性的全面且严谨的分析。实验结果表明,该方法在泛化性和鲁棒性方面均表现出显著提升。值得注意的是,在低采样比率下,我们的框架在 ViT 模型上针对 CIFAR10、CIFAR100 和 ImageNet-1k 基准实现了超过 20% 的显著性能提升。该方法在保持参数效率的同时确立了新的最新性能,尤其在标注数据稀缺的情况下证明尤为有效。

Abstract

The main goal of active finetuning is to improve a pretrained model's performance on a specific task or domain by finetuning it with carefully selected informative or challenging data. Previous research has predominantly focused on the active aspect (i.e., data selection) while uniformly employing full finetuning for model adaptation, which inevitably distorts pretrained features due to distribution shift. This issue becomes particularly pronounced when the model size is large relative to the finetuning data quantity, leading to heightened overfitting risks. To address this critical gap, we formally outline the FiAF task that emphasizes systematic exploration of finetuning methodologies in active learning. We propose FACT, a three-phase hierarchical finetuning framework featuring both efficiency and simplicity, specifically designed for active finetuning scenarios. Our comprehensive experiments span: (1) Three major dataset categories encompassing classic (CIFAR10, CIFAR100, ImageNet-1k), imbalanced (CIFAR10-LT, CIFAR100-LT), and fine-grained (StanfordCars, FGVCAircraft) image classification datasets, each evaluated under 3-5 distinct sampling ratios; (2) Diverse pretrained architectures including Convolutional Neural Network (ConvNeXt), Vision Transformer (ViT), and Vision LSTM (ViL) networks; (3) A systematic investigation of frozen feature augmentation (FroFA) strategies. (4) A comprehensive and rigorous analysis of efficiency and generalizability. The results demonstrate significant improvements with strong generalization and robustness. Notably, under low sampling ratios, our framework achieves remarkable performance gains of over 20% on the ViT model for CIFAR10, CIFAR100, and ImageNet-1k benchmarks. This systematic approach establishes new state-of-the-art performance while maintaining parameter efficiency, proving particularly effective when labeled data is scarce.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: 评分失败: Expecting ',' delimiter: line 12 column 55 (char 278)

Score: 0.0 / 27.8
Authors: Ahmad AlMughrabi, Farid Al-Areqi, David Fernández Gómez, Umair Haroon, Marc Bolaños, Ricardo Marques, Petia Radeva
Published: 2026-06-01
TL;DR: PerBite introduces a curated 3D reconstruction workflow for estimating food volume from before-and-after consumption images, achieving competitive performance in a dietary assessment challenge.
摘要翻译

视觉可信的食物网格能否被信任用于估算摄入食物的体积?本文方法利用 MetaFood CVPR 2026 进食过程中连续 3D 重建挑战赛中选定的成对进食前后状态来探究这一问题。提交的工作流程遵循一个精心设计的重建协议:SAM-3 用于分割食物和盘子区域;Hunyuan3D/SAM-3D 生成无量纲食物网格;盘子直径提供度量尺度;在 Blender 中移除盘子几何结构;随后对剩余网格进行孔洞填充、水密处理及积分运算以估算体积。仅在直接测量盘子存在不确定性时,MoGe-2 才作为初始盘子直径估计的辅助线索使用;它并非报告挑战结果的主要尺度来源。本文方法排名第一,在使用刚性 ICP 且不进行尺度校正的情况下,34 个网格的平均 Chamfer 距离为 8.31。在 17 对进食前后状态上,该方法实现了 33.87% 的状态级体积 MAPE 且无单调性违反,而摄入体积的 MAPE 仍为 53.74%。结果表明,表面重建、度量尺度、受控网格清理、水密体积积分以及物理消耗一致性应在饮食评估中分别进行评估。源代码及评估脚本将在 https://github.com/GCVCG/PerBite-CVPR-MetaFood-2026 上提供。

Abstract

Can a visually plausible food mesh be trusted to estimate the volume of consumed food? \method investigates this question using selected paired before- and after-consumption states from the MetaFood CVPR 2026 Continuous 3D Reconstruction While Eating Challenge. The submitted workflow follows a curated reconstruction protocol: SAM~3 segments the food and plate regions; Hunyuan3D/SAM~3D generates a dimensionless food mesh; the plate diameter provides the metric scale; the plate geometry is removed in Blender; and the remaining mesh is hole-filled, made watertight, and integrated to estimate volume. MoGe-2 is used only as an auxiliary cue for initial dish-diameter estimation when direct plate measurement is uncertain; it is not the primary scale source for the reported challenge result. \method ranks first, with an average Chamfer distance of 8.31 across 34 meshes using rigid ICP without scale correction. On 17 before- and after-pairs, it achieves 33.87\% state-level volume MAPE and zero monotonicity violations, while consumed-volume MAPE remains 53.74\%. The results show that surface reconstruction, metric scale, controlled mesh cleanup, watertight volume integration, and physical depletion consistency should be evaluated separately for dietary assessment. Source code and evaluation scripts will be available at \href{https://github.com/GCVCG/PerBite-CVPR-MetaFood-2026}{github.com/GCVCG/PerBite-CVPR-MetaFood-2026}.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on a 3D reconstruction workflow for food volume estimation, which is an application-level task unrelated to the provided keywords concerning model architectures (Unify Models, Tokenizer, MLLM), world models, or reinforcement learning (model-based RL). Although tools like SAM (containing a visual encoder) are used, the paper does not study encoder design or learning. No expert authors from the specified list are present. The weighted total score is 0, which is below the dynamic passing score of 27.8.

关键词

Food Volume Estimation, 3D Reconstruction, Before-and-After Consumption, Dietary Assessment, Mesh Generation, Visual Segmentation, Volume Integration

Score: 0.0 / 27.8
Authors: Xuewei Meng, Chuanmin Jia, Xinfeng Zhang, Shanshe Wang, Siwei Ma
Published: 2026-06-01
TL;DR: This paper proposes a spatio-temporal correlation guided geometric partitioning scheme to reduce side information signaling overhead and improve coding efficiency in Versatile Video Coding (VVC).
摘要翻译

几何分割因其混合视频编码框架中卓越的运动场描述能力而受到日益增多的关注。然而,通用视频编码(VVC)中现有的几何分割(GEO)方案给侧信息的信令带来了不可忽视的负担,从而限制了编码效率。为此,本文提出一种时空相关性引导的几何分割(STGEO)方案,旨在高效描述视频编码运动场中的对象信息。该方法能够节省用于侧信息信令(包括分割模式和运动信息)所消耗的比特。首先,本文以统计学上严谨的方式分析了分割模式决策和运动矢量选择的特性。基于观察到的时空相关性,本文设计了一种模式预测与编码方法,以减少表示上述侧信息的开销。其核心思想是预测具有更高选择可能性的 STGEO 模式和运动候选,以此指导熵编码,即用更少的比特表示预测的高概率模式及运动候选。具体而言,高概率的 STGEO 模式基于相邻 STGEO 编码块的边缘信息及历史模式进行预测。相应的运动信息通过合并候选列表中的索引来表示,该索引基于离线训练的合并候选选择概率自适应推断。仿真结果表明,与未采用 GEO 的 VTM-8.0 相比,所提出的方法在随机访问(RA)和低延迟 B(LD-B)配置下分别平均实现了 0.95% 和 1.98% 的比特率节省。

Abstract

Geometric partitioning has attracted increasing attention by its remarkable motion field description capability in the hybrid video coding framework. However, the existing geometric partitioning (GEO) scheme in Versatile Video Coding (VVC) causes a non-negligible burden for signaling the side information. Consequently, the coding efficiency is limited. In view of this, we propose a spatio-temporal correlation guided geometric partitioning (STGEO) scheme to efficiently describe the object information in the motion field of video coding. The proposed method can economize the bits consumed for side information signaling, including the partitioning mode and motion information. We firstly analyze the characteristics of partitioning mode decision and motion vector selection in a statistically-sound way. Based on the observed spatio-temporal correlation, we design a mode prediction and coding method to reduce the overhead for representing the above mentioned side information. The main idea is to predict the STGEO modes and motion candidates that have higher selection possibilities, which can guide the entropy coding, i.e., representing the predicted high-probability modes and motion candidates with fewer bits. In particular, the high-probability STGEO modes are predicted based on the edge information and history modes of adjacent STGEO-coded blocks. The corresponding motion information is represented by the index in a merge candidate list, which is adaptively inferred based on the off-line trained merge candidate selection probability. Simulation results show that the proposed approach achieves 0.95% and 1.98% bit-rate savings on average compared to VTM-8.0 without GEO for Random Access and Low-Delay B configurations, respectively.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on video coding compression (VVC) and geometric partitioning, belonging to signal processing and traditional computer vision engineering. The provided keywords (Unify Models, Tokenizer, Visual Encoder, World Models, MLLM, MultiModal, model-based RL) pertain to Multimodal Large Language Models and Reinforcement Learning. There is no conceptual overlap between video coding algorithms and the specified AI/ML keywords, resulting in a score of 0 for all. None of the listed expert authors appear in the author list.

关键词

Spatio-Temporal Correlation, Geometric Partitioning, Versatile Video Coding, Motion Field, Side Information Signaling, Coding Efficiency, Mode Prediction, Merge Candidate List

Score: 0.0 / 27.8
Authors: Mohammed Q. Alkhatib, Swalpa Kumar Roy, Ali Jamali
Published: 2026-06-01
TL;DR: MixerSENet is a lightweight framework for efficient hyperspectral image classification that achieves superior accuracy with fewer parameters compared to state-of-the-art methods.
摘要翻译

本文提出了一种新颖的框架 MixerSENet,用于高光谱图像(HSI)分类,旨在应对计算效率与标注数据有限所带来的挑战。该模型在处理高光谱图像块的同时,在整个网络中保持一致的尺寸和分辨率,有效地解耦了空间维度与通道维度的混合。值得注意的是,MixerSENet 具有轻量级和计算高效的特点,相较于传统模型需要更少的参数,使其适用于资源受限的环境。模型中引入了挤压和激励(Squeeze-and-Excitation)模块以细化特征提取,增强了网络捕捉更具信息量特征的能力。在两个基准数据集上的实验结果表明,MixerSENet 取得了卓越的性能,在 Houston13 数据集上达到了 82.47% 的整体准确率(OA),在 Qingyun 数据集上达到了 96.70%,优于包括 3D-CNN、HybridKAN、HSIFormer、SimPoolFormer 和 MorphMamba 在内的最先进方法。此外,对计算效率的详细分析表明,MixerSENet 在精度与效率之间取得了良好的平衡,仅需 53,146 个参数且推理时间较低,证实了其在实际应用中的实用性。论文发表后,源代码将在 https://github.com/mqalkhatib/MixerSENet 上公开。

Abstract

In this paper, a novel framework, MixerSENet, is introduced for hyperspectral image (HSI) classification, designed to address the challenges of computational efficiency and limited labeled data. The proposed model processes hyperspectral image patches while maintaining consistent size and resolution throughout the network, effectively decoupling the mixing of spatial and channel dimensions. Notably, MixerSENet is lightweight and computationally efficient, requiring fewer parameters compared to traditional models, making it suitable for resource-constrained environments. A squeeze and excitation block is incorporated into the model to refine feature extraction, enhancing the network's ability to capture more informative features. Experimental results on two benchmark datasets demonstrate that MixerSENet achieves superior performance, reaching an overall accuracy (OA) of 82.47% on Houston13 dataset and 96.70% on the Qingyun dataset, outperforming state-of-the-art methods including 3D-CNN, HybridKAN, HSIFormer, SimPoolFormer, and MorphMamba. Furthermore, a detailed analysis of computational efficiency shows that MixerSENet achieves a favorable balance between accuracy and efficiency, with only 53,146 parameters and an low inference time, confirming its practicality for real-world applications. At publication, source code will be publicly available at https://github.com/mqalkhatib/MixerSENet.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper proposes a lightweight neural network framework (MixerSENet) specifically for hyperspectral image classification, focusing on computational efficiency and feature extraction. The provided keywords relate to Large Language Models, Multimodal AI, World Models, and Reinforcement Learning. There is no overlap regarding tokenizers, world models, model-based RL, or unifying models in the context of MLLMs. Additionally, none of the specified expert authors (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are listed on the paper.

关键词

Hyperspectral Image Classification, Lightweight Framework, Computational Efficiency, Spatial and Channel Dimensions, Squeeze and Excitation Block, Benchmark Datasets, Overall Accuracy

Score: 0.0 / 27.8
Authors: Satoshi Takabe, Shunta Arai, Tadashi Wadayama
Published: 2026-06-01
TL;DR: This paper proposes a physics-aware linearized ADMM algorithm combined with deep unfolding to efficiently solve inverse problems involving partial differential equations in signal processing tasks.
摘要翻译

近期,偏微分方程(PDEs)已被用于直接建模信号处理中的测量过程,尽管其计算代价高昂。本文提出了一种基于交替方向乘子法(ADMM)的新算法,称为物理感知线性化 ADMM(PA-LADMM),用于求解基于 PDE 测量过程的逆问题。其核心思想是对包含 PDE 的子问题进行线性化,从而得到一个计算高效的更新规则,该规则每次迭代仅需调用一个 PDE 求解器及其梯度计算。该算法在特定条件下具有理论收敛性保证。此外,我们将其与深度展开相结合,以展开 PA-LADMM 并利用监督数据训练其内部参数。两项不同的实验,即光通信中的压缩感知和基于含噪各向异性扩散的图像恢复,证明了所提出算法的有效性。

Abstract

Recently, partial differential equations (PDEs) have been used to directly model the measurement process in signal processing, although their evaluation is costly. In this paper, we propose a novel alternating direction method of multipliers (ADMM)-based algorithm called physics-aware linearized ADMM (PA-LADMM) for inverse problems from PDE-based measurement processes. The key idea is the linearization of the subproblem with PDEs, leading to a cost-efficient update rule that calls only a PDE solver and its gradient evaluation per iteration. The algorithm has a theoretical convergence guarantee under certain conditions. In addition, we combine it with deep unfolding to unroll the PA-LADMM and train its internal parameters using supervised data. Two distinct experiments, compressed sensing with optical fiber communication and image restoration from noisy anisotropic diffusion, demonstrated the effectiveness of the proposed algorithms.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper focuses on signal processing and optimization algorithms (ADMM) for PDE-based inverse problems using deep unfolding. It does not involve multimodal large language models, tokenizers, visual encoders, world models, or reinforcement learning. Therefore, all provided keywords are completely unrelated to the paper's core content.

关键词

Physics-Aware, Linearized ADMM, Deep Unfolding, Inverse Problems, PDE-based, Compressed Sensing, Image Restoration, Alternating Direction Method of Multipliers

Score: 0.0 / 27.8
Authors: Xuewei Meng, Xinfeng Zhang, Chuanmin Jia, Xia Li, Shanshe Wang, Siwei Ma
Published: 2026-06-01
TL;DR: This paper proposes an edge-directed geometric partitioning strategy for VVC video coding that reduces signaling overhead and improves coding efficiency by leveraging spatio-temporal edge information.
摘要翻译

为提升编码性能,几何分区(GEO)被提议用于即将推出的 VVC 标准。GEO 提供了 140 种分区候选方案。最优 GEO 模式的索引需要显式信令指示。考虑到不同编码单元(CUs)的结构差异以及空间相邻块与时域同位置块之间的相关性,本文提出一种 GEO 模式预测策略,通过构建最可能模式(MPM)列表来降低 GEO 索引的开销并提高编码效率。基于分区模式与物体边界之间高度相关性的观察,本文提出一种边缘导向几何分区方案,依据时空边缘信息构建 MPM 列表。与 VTM-6.0 相比,该方法在随机访问(RA)和低延迟 B(LDB)配置下平均分别实现了 0.58% 和 1.00% 的客观 BD 率增益。此外,该方法还提升了物体边界的视觉质量。

Abstract

To improve the coding performance, geometric partition (GEO) was proposed for the upcoming VVC standard. GEO provides 140 partition candidates. The index of optimal GEO mode needs to be signaled explicitly. Considering different structural characteristics of different CUs and the correlation between spatial adjacent blocks and temporal collocated blocks, we propose a GEO mode prediction strategy by constructing a Most Probable Mode (MPM) list to reduce the overhead of GEO index and improve coding efficiency. Based on the observation of the high correlation between the partition mode and object boundaries, an edge-directed geometric partition scheme is proposed to construct the MPM list according to spatio-temporal edge information. The proposed method provides an objective BD-rate gain of 0.58% and 1.00% on average for RA and LDB configurations compared to VTM-6.0. Besides, it also promotes the visual quality of object boundaries.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: The paper addresses video compression technology (VVC standard) focusing on geometric partitioning and coding efficiency, which belongs to signal processing and multimedia standards. The provided keywords (Unify Models, Tokenizer, Visual Encoder, World Models, MLLM, MultiModal, model-based RL) pertain to Artificial Intelligence, specifically Large Language Models, Multimodal Learning, and Reinforcement Learning. There is no conceptual or technical overlap between video compression algorithms and AI model architectures. Additionally, the author list does not include the specified experts (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang).

关键词

Edge-directed, geometric partitioning, versatile video coding, VVC standard, MPM list, spatio-temporal edge, coding efficiency

Score: 0.0 / 27.8
Authors: Jianlin Xiang, Yanshan Li, Linhui Dai
Published: 2026-06-01
摘要翻译

目标检测的视觉可解释性仍具挑战性,这源于检测任务的多实例性质。现有方法主要采用事后范式(post-hoc paradigms),例如基于梯度或基于扰动的解释方法,来解释预训练检测器。然而,这些方法需要额外的梯度计算或重复模型推理,导致效率受限。为解决这一问题,我们提出了一种端到端实例级视觉解释框架(EIVE),该框架可在检测变换器(DETR)类模型的前向传播过程中直接生成实例级显著性图。具体而言,我们将解码器中的交叉注意力机制重新表述为实例级特征归因路径,使得每个对象查询(object query)的交叉注意力对应其预测实例的视觉归因。基于此表述,我们设计了跨层混合共识融合(CLHCF)模块,以聚合解码器各层之间的交叉注意力信号,从而生成稳定且紧凑的解释。EIVE 的解释过程无需梯度计算或输入扰动,具有高计算效率,且适用于单尺度和多尺度的 DETR 类目标检测器。最后,我们提出了一种注意力感知联合训练策略(AAJTS)作为面向训练的应用,该策略对交叉注意力模式施加空间约束,以鼓励稳定且集中的归因表示,从而同时提升模型的可解释性和检测性能。在 MS COCO 2017、ExDark 和 Cityscapes 上的实验表明,EIVE 能够生成高质量的实例级显著性图,并在标准指标上达到与最先进的后处理方法相当甚至更优的性能,同时显著提升了解释效率。代码开源地址为:https://github.com/xjlDestiny/EIVE.git。

Abstract

Visual explainability for object detection remains challenging due to the multi-instance nature of detection. Existing approaches predominantly adopt post-hoc paradigms, such as gradient-based or perturbation-based explanation methods, to interpret pretrained detectors. However, these methods require additional gradient computation or repeated model inference, resulting in limited efficiency. To address this issue, we propose an End-to-end Instance-specific Visual Explanation framework (EIVE) that directly generates instance-level saliency maps following the forward pass of Detection Transformer (DETR)-like models. Specifically, we reformulate the cross-attention mechanism in the decoder as an instance-level feature attribution pathway, so that the cross-attention of each object query corresponds to the visual attribution of its predicted instance. Based on this formulation, we design a cross-layer hybrid consensus fusion (CLHCF) module to aggregate cross-attention signals across decoder layers, producing stable and compact explanations. The explanation process of EIVE requires neither gradient computation nor input perturbation, yielding high computational efficiency, and applies to single- and multi-scale DETR-like object detectors. Finally, we present an attention-aware joint training strategy (AAJTS) as a training-oriented application, which imposes spatial constraints on cross-attention patterns to encourage stable and concentrated attribution representations, thereby improving both interpretability and detection performance. Experiments on MS COCO 2017, ExDark, and Cityscapes demonstrate that EIVE produces high-quality instance-level saliency maps and achieves performance comparable to, or better than, state-of-the-art post-hoc methods across standard metrics, while substantially improving explanation efficiency. Code is available at https://github.com/xjlDestiny/EIVE.git.

评分详情
关键词 权重 相关度 得分
Unify Models 1.5 0.0/10 0.0
Tokenizer 1.5 0.0/10 0.0
Visual Encoder 1.5 0.0/10 0.0
World Models 1.5 0.0/10 0.0
MLLM 1.5 0.0/10 0.0
MultiModal 1.5 0.0/10 0.0
model-based RL 1.5 0.0/10 0.0

评分理由: 评分失败: Expecting ',' delimiter: line 12 column 96 (char 319)

Token 消耗: 3,405,584 tokens(输入 444,361 / 输出 2,961,223)