arXiv Daily Report 2026-05-26
- 未分类
- 6小时前
- 1热度
- 0评论
ArXiv Research Report
Papers
摘要翻译
许多视频推理任务需要追踪运动、时间顺序以及跨帧演变的视觉状态。现有基于大型视觉语言模型(LVLMs)的方法通常通过文本思维链(CoT)、关键帧选择、重复帧插入或外部工具使用等外部化推理方式来应对这一挑战。尽管这些方法有效,但此类流程会增加推理延迟和工程复杂度,并迫使时间-视觉证据被序列化为文本或从帧中反复重新编码。受“视觉推理可以在语言化之前隐式发生”这一直觉的启发,我们提出了STORMS(通过内化建模进行时空推理),这是一个两阶段框架,旨在教导LVLMs通过有界连续潜在轨迹进行推理,而非依赖显式的文本CoT。在第一阶段,STORMS将潜在标记与从生成视频中导出的思维-视频表征对齐,从而将潜在状态锚定于动态视觉证据中。在第二阶段,模型进一步通过仅含答案的监督进行训练,促使推理过程在没有逐步标注的情况下实现内化。生成的思维视频仅用于训练;在推理时,STORMS执行有界潜在展开,无需重新生成视频、插入帧或调用外部视觉工具。在VideoMME、MVBench、TempCompass和MMVU上的实验表明,与基于工具或视频生成的推理流程相比,STORMS在提升视频推理准确性的同时显著降低了推理开销。
Abstract
Many video reasoning tasks require tracking motion, temporal order, and evolving visual states across frames. Existing methods built on large vision-language models (LVLMs) often address this challenge by externalizing reasoning through textual chain-of-thought (CoT), keyframe selection, repeated frame reinsertion, or external tool use. While effective, such pipelines increase inference-time latency and engineering complexity, and they force temporal-visual evidence to be serialized into text or repeatedly re-encoded from frames. Inspired by the intuition that visual reasoning can occur implicitly before verbalization, we propose STORMS (Spatial-Temporal reasOning via inteRnalized Modeling), a two-stage framework that teaches LVLMs to reason through bounded continuous latent trajectories instead of explicit textual CoT. In Stage I, STORMS aligns latent tokens with thought-video representations derived from generated videos, grounding the latent states in dynamic visual evidence. In Stage II, the model is further trained with answer-only supervision, encouraging the reasoning process to be internalized without step-by-step annotations. Generated thought videos are used only during training; at inference, STORMS performs a bounded latent rollout without regenerating videos, reinserting frames, or invoking external visual tools. Experiments on VideoMME, MVBench, TempCompass, and MMVU show that STORMS improves video reasoning accuracy while substantially reducing inference overhead compared with tool or video-generation-based reasoning pipelines.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 5.0/10 | 5.0 |
| World Models | 1.0 | 9.0/10 | 9.0 |
| MLLM | 1.0 | 10.0/10 | 10.0 |
| CV | 1.0 | 10.0/10 | 10.0 |
| MultiModal | 1.0 | 10.0/10 | 10.0 |
| model-based RL | 1.0 | 1.0/10 | 1.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文提出STORM框架,通过内部化潜在轨迹建模提升视频语言模型的时空推理能力。核心涉及多模态大模型(MLLM)、计算机视觉(CV)和多模态(MultiModal),评分均为10分。世界模型(World Models)相关度高,因为使用生成视频作为内部世界模型来模拟动态视觉证据,评分9分。统一模型(Unify Models)有一定关联,但论文未强调模型统一,评分5分。基于模型的强化学习(model-based RL)仅在世界模型概念上有微弱联系,但论文未使用RL训练,评分1分。OPD、RL、GRPO均未在论文中出现,评分0分。
关键词
Spatial-Temporal Reasoning, Video-Language Models, Internalized Modeling, Latent Trajectories, Thought Videos, Chain-of-Thought, Inference Efficiency
深度分析
Chinese Title: STORM:面向视频语言模型的时空推理内化建模
Summary: 本文提出STORM框架,旨在解决视频语言模型在时空推理中依赖显式文本思维链或外部工具导致推理延迟高、工程复杂的问题。STORM采用两阶段训练策略:第一阶段通过生成“思维视频”作为训练监督,将潜在令牌与动态视觉证据对齐;第二阶段仅使用答案监督,促使模型将推理过程内化到连续潜在空间中。推理时,模型仅通过固定长度的潜在轨迹进行推理,无需重新生成视频或调用外部工具。实验表明,STORM在VideoMME、MVBench等基准上提升了视频推理准确率,同时显著降低了推理开销。
Innovations:
- 提出在连续潜在空间中进行内部动态模拟的推理框架,替代显式文本思维链或工具依赖。
- 设计两阶段训练策略:第一阶段对齐潜在令牌与思维视频表示,第二阶段通过答案监督强化内化推理。
- 构建思维视频监督流水线,将生成动态轨迹、关键帧和问答标注对齐,支持可扩展的潜在视频推理训练。
- 推理时仅需固定预算的潜在轨迹,无需生成视频、重新插入帧或调用外部视觉工具,显著降低延迟。
Methodology: STORM采用两阶段训练方法。第一阶段:从源视频中提取关键帧,结合教师思维链生成思维视频,将潜在令牌与思维视频表示对齐,并联合优化答案令牌损失。第二阶段:保留潜在令牌序列,仅对最终答案进行监督,促使模型将推理过程内化。推理时,模型在潜在段内进行固定长度的连续状态滚动,然后恢复文本解码生成答案。
Key Results:
- 在VideoMME、MVBench、TempCompass和MMVU基准上,STORM提升了视频推理准确率。
- 与基于工具或视频生成的推理流水线相比,STORM显著降低了推理开销。
- STORM在保持推理性能的同时,避免了推理时重新生成视频或调用外部工具的延迟。
Tech Stack:
- GPT-4o用于关键帧提取和教师思维链生成
- Wan 2.1条件扩散模型用于思维视频生成
- 潜在令牌(latent tokens)用于连续空间推理
- Coconut-like设置用于第二阶段答案监督
- 固定预算潜在滚动(bounded latent rollout)推理机制
Strengths:
- 创新性地将视频推理内化到潜在空间,避免显式文本或工具依赖,提升效率。
- 两阶段训练策略有效利用思维视频作为监督,同时保持推理时简洁。
- 实验覆盖多个视频推理基准,验证了方法的通用性和有效性。
- 推理时固定预算设计保证了延迟可控,适合实际部署。
Limitations:
- 思维视频生成依赖外部扩散模型,训练阶段仍有一定计算开销。
- 潜在令牌数量固定(K=8),可能限制复杂推理场景的表示能力。
- 方法主要针对视频推理,对静态图像或纯文本推理的适用性未充分探讨。
- 依赖教师模型生成思维链和关键帧,可能引入教师模型的偏差。
Relevance To Keywords:
- 原生多模态大模型:STORM基于视频语言模型,通过潜在空间推理增强多模态理解。
- 世界模型:思维视频生成模拟动态场景变化,隐含世界模型思想。
- 表征学习:潜在令牌学习紧凑的时空视觉表征,实现内化推理。
- 模型基于强化学习:论文未直接使用强化学习,但两阶段训练中的答案监督与强化学习思想相关。
- 后训练:两阶段训练属于后训练策略,优化模型推理能力。
摘要翻译
交互式世界模型正在快速发展,然而现有基准仅覆盖了部分所需能力,缺乏系统评估的统一标准。为填补这一空白,我们提出了WBench——一个用于交互式世界模型评估的综合性多轮基准,涵盖五个维度:视频质量、设定遵循度、交互遵循度、一致性和物理合规性。WBench包含289个测试用例和1,058个交互轮次,每个用例指定一个世界设定和一个多轮交互序列,涵盖多样化的场景、风格、主体以及第一人称和第三人称视角,同时包含四种交互类型:导航、主体动作、事件编辑和视角切换。对于导航,WBench统一了文本、6-DoF位姿和离散动作控制,使得能够评估具有不同原生输入接口的模型。评估使用22个自动子指标,这些指标结合了专家视觉模型与大型多模态模型,并且所有指标均通过人类判断进行了验证。在20个最先进模型中,我们发现没有任何一个模型在所有维度上表现强劲。我们提供了关于每个模型的特征优势、劣势和开放挑战的详细诊断见解。代码和数据可在 https://github.com/meituan-longcat/WBench 获取。
Abstract
Interactive world models are advancing rapidly, yet existing benchmarks cover only part of the required competencies, leaving no unified standard for systematic evaluation. To fill this gap, we introduce WBench, a comprehensive multi-turn benchmark for interactive world model evaluation along five dimensions, namely video quality, setting adherence, interaction adherence, consistency, and physics compliance. WBench contains 289 test cases and 1,058 interaction turns, where each case specifies a world setting and a multi-turn interaction sequence, covering diverse scenes, styles, subjects, and both first- and third-person perspectives, together with four interaction types, including navigation, subject action, event editing, and perspective switching. For navigation, WBench unifies text, 6-DoF pose, and discrete-action control, enabling evaluation of models with different native input interfaces. Evaluation uses 22 automatic sub-metrics that combine specialist vision models with large multimodal models, and all metrics are validated against human judgments. Across 20 state-of-the-art models, we find that no single model performs strongly across all dimensions. We provide detailed diagnostic insights into the characteristic strengths, weaknesses, and open challenges of each model. Code and data are available at https://github.com/meituan-longcat/WBench.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 5.0/10 | 5.0 |
| World Models | 1.0 | 10.0/10 | 10.0 |
| MLLM | 1.0 | 6.0/10 | 6.0 |
| CV | 1.0 | 9.0/10 | 9.0 |
| MultiModal | 1.0 | 9.0/10 | 9.0 |
| model-based RL | 1.0 | 4.0/10 | 4.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 1.0/10 | 1.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文提出WBench,一个用于交互式视频世界模型评估的多轮基准,核心主题是世界模型(World Models)和计算机视觉(CV)中的视频生成与评估,同时涉及多模态(MultiModal)交互。评估中使用了大型多模态模型(MLLM)作为自动评分工具,但论文本身并非关于MLLM或统一模型(Unify Models)。与基于模型的强化学习(model-based RL)有一定关联(世界模型是MBRL的组成部分),但论文不涉及RL训练或算法,因此RL和GRPO评分极低。OPD未在论文中出现,评0分。
关键词
Interactive World Models, Benchmark, Multi-turn Evaluation, Video Quality, Consistency, Physics Compliance, Navigation, Large Multimodal Models
深度分析
Chinese Title: WBENCH:面向交互式视频世界模型评估的综合多轮基准测试
Summary: 本文提出了WBENCH,一个用于交互式视频世界模型评估的综合多轮基准测试。现有基准测试仅覆盖部分能力,缺乏统一标准。WBENCH包含289个测试用例和1058个交互轮次,每个用例指定世界设置和多轮交互序列,涵盖多样场景、风格、主体及第一/第三人称视角,并包含导航、主体动作、事件编辑和视角切换四种交互类型。导航控制统一了文本、6自由度位姿和离散动作三种形式。评估采用22个自动子指标,结合专业视觉模型和大规模多模态模型,所有指标均经过人工判断验证。对20个先进模型的实验表明,没有模型在所有维度上表现优异,并提供了详细的诊断分析。
Innovations:
- 提出了覆盖视频质量、设置遵循、交互遵循、一致性和物理合规性五个互补评估维度的统一基准。
- 构建了包含多轮交互、双视角、四种交互类型(导航、主体动作、事件编辑、视角切换)的数据集。
- 设计了统一的导航接口(文本、6-DoF位姿、离散动作),支持不同控制范式的公平比较。
- 开发了全自动评估流程,包含22个细粒度子指标,并验证了其与人类判断的一致性。
Methodology: 论文采用系统化的基准构建方法。首先,定义世界设置(场景、风格、视角、主体)和交互序列(导航、主体动作、事件编辑、视角切换)。其次,构建包含289个测试用例和1058个交互轮次的数据集。然后,设计22个自动评估子指标,涵盖五个评估维度,结合专业视觉模型(如美学评分、运动平滑度)和大规模多模态模型(如场景遵循、因果忠实度)。最后,对20个模型(包括文本驱动、相机控制和动作条件模型)进行实验,并进行跨维度分析和人类偏好对齐验证。
Key Results:
- 没有单一模型在所有五个评估维度上表现优异。
- 导航能力与其他维度(如视频质量、一致性)相对独立。
- 相机控制和视角一致性是两种不同的能力。
- 物理正确性与渲染质量相关,而非控制能力。
- 基准难度受视角、场景类型和主体类别影响。
- 四种交互类型在多轮交互中退化不均,导航最脆弱。
Tech Stack:
- 视频生成模型:Diffusion Transformers, Flow Matching
- 评估指标:FID, FVD, Aesthetic Quality, Motion Smoothness, NavScore, Causal Fidelity, Visual Plausibility
- 视觉模型:CLIP, DINOv2, Depth Anything V2, RAFT, X-CLIP
- 多模态大模型:GPT-4o, Gemini Pro Vision, Qwen-VL-Max
- 控制范式:文本提示、6-DoF相机位姿、离散动作(键盘/鼠标)
- 数据集构建:多轮交互序列设计、世界设置属性定义
Strengths:
- 全面性:覆盖五个评估维度和四种交互类型,提供细粒度诊断。
- 统一性:设计统一的导航接口,支持不同控制范式的公平比较。
- 自动化:全自动评估流程,22个指标均经过人类判断验证。
- 诊断性:对20个模型进行详细分析,揭示不同模型的优缺点和开放挑战。
- 开源:代码和数据公开,促进社区研究和复现。
Limitations:
- 数据集规模有限(289个用例),可能无法覆盖所有长尾场景。
- 评估指标依赖现有视觉模型和VLM,可能引入模型偏见。
- 仅支持四种交互类型,未涵盖更复杂的交互(如物体操作、对话)。
- 物理合规性评估主要基于视觉合理性,缺乏严格的物理模拟验证。
- 未考虑实时性、计算效率等实际部署因素。
Relevance To Keywords:
- Unify Models: WBENCH通过统一导航接口(文本、位姿、动作)支持不同控制范式的模型评估,促进模型统一。
- World Models: WBENCH直接针对交互式视频世界模型评估,覆盖世界设置、多轮交互和物理合规性。
- Representation Learning: 评估指标中的一致性(主体、背景、空间)和重建一致性涉及表征学习质量。
- Model-Based RL: 基准测试中的导航、主体动作和事件编辑交互与基于模型的强化学习中的动作预测和世界演化相关。
- 原生多模态大模型: 评估使用多模态大模型(如GPT-4o)进行场景遵循、因果忠实度等判断。
- 多模态大模型的理解和生成一体化: WBENCH评估模型对多模态输入(文本、图像、动作)的理解和视频生成能力。
- 表征学习: 同上,一致性指标反映模型对世界状态的表征能力。
- 世界模型: 核心主题,WBENCH是专门为世界模型评估设计的基准。
- 强化学习: 交互序列设计(多轮动作-观察循环)与强化学习中的环境交互范式一致。
- 后训练: 基准测试可用于评估后训练(如RLHF、微调)对世界模型性能的影响。
摘要翻译
我们提出LLaVA-OneVision-2(LLaVA-OV-2),这是LLaVA-OneVision系列中迄今为止能力最强的视觉语言模型,在广泛的多模态基准测试中均展现出卓越性能。该模型基于原生OneVision编码器构建,并引入窗口注意力机制以实现高效的局部计算,同时保持原生分辨率。其关键进展在于编解码流分词化:它将压缩视频视为连续的比特成本流,其中比特成本动态决定自适应时间分组,而运动残差线索则选择显著空间证据以形成紧凑的视觉画布。这种分配机制将有限的令牌预算集中于承载事件的内容上,从而比固定图像组实现更稳定的长视频令牌压缩。共享的3D旋转位置编码进一步将编解码画布、采样帧和图像置于统一的时空坐标系中。此外,我们围绕大规模开放监督构建了LLaVA-OV-2的数据与训练体系:约800万条重新标注描述的视频样本用于预训练,以及400万样本的空间语料库用于微调。我们还引入了JumpScore,这是一个针对高频、密集重复运动中的细粒度定位的时间定位基准,填补了现有视频评估中的空白领域。LLaVA-OV-2的一项突出能力在于其跨视频理解、时间定位、空间定位及操作轨迹推理的统一感知能力。在JumpScore上,LLaVA-OneVision-2-8B达到74.9的JumpScore mAP,超越Qwen3-VL-8B(30.1)达44.8个百分点;在相同基准测试中,匹配视觉令牌预算的条件下,编解码流输入相比帧采样在时间定位上提升了9.7个百分点。在标准基准测试中,LLaVA-OneVision-2-8B在视频任务上平均超越Qwen3-VL-8B达4.3个百分点,在空间任务上达5.3个百分点,在跟踪任务上平均J&F指标达15.6个百分点。
Abstract
We introduce LLaVA-OneVision-2 (LLaVA-OV-2), the most capable vision-language model in the LLaVA-OneVision series to date, achieving superior performance across a broad range of multimodal benchmarks. The model builds on a native OneVision-Encoder and incorporates Windowed Attention for efficient local computation while maintaining native resolution. Its key advance is codec-stream tokenization: it treats compressed video as a continuous bit-cost stream, where bit-cost dynamics determine adaptive temporal groups, and motion-residual cues select salient spatial evidence into compact visual canvases. This allocation concentrates a limited token budget on event-bearing content, enabling more stable long-video token compression than fixed groups of pictures. A shared 3D RoPE further places codec canvases, sampled frames, and images in a unified spatiotemporal coordinate system. Furthermore, we build the LLaVA-OV-2 data and training stack around large-scale open supervision: approximately 8M re-captioned video samples for pretraining, a 4M-sample spatial corpus for fine-tuning. We also introduce JumpScore, a temporal-localization benchmark targeting fine-grained grounding in high-frequency, densely repeated motion, a regime underrepresented by existing video evaluations. A standout capability of LLaVA-OV-2 is its unified perception across video understanding, temporal grounding, spatial grounding, and manipulation-trace reasoning. On JumpScore, LLaVA-OneVision-2-8B reaches 74.9 JumpScore mAP, surpassing Qwen3-VL-8B (30.1) by +44.8 points; under matched visual-token budgets on the same benchmark, codec-stream inputs improve temporal grounding over frame sampling by +9.7 points. Across standard benchmarks, LLaVA-OneVision-2-8B further outperforms Qwen3-VL-8B by +4.3 average points on video tasks, +5.3 on spatial tasks, and +15.6 average J&F on tracking tasks.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 8.0/10 | 8.0 |
| World Models | 1.0 | 3.0/10 | 3.0 |
| MLLM | 1.0 | 10.0/10 | 10.0 |
| CV | 1.0 | 10.0/10 | 10.0 |
| MultiModal | 1.0 | 10.0/10 | 10.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文核心是LLaVA-OneVision-2,一个多模态大语言模型(MLLM),专注于视觉语言理解,涉及计算机视觉(CV)和多模态(MultiModal)任务。它通过codec-stream tokenization和共享3D RoPE实现了视频、图像和时空定位的统一感知(Unify Models相关,但并非核心创新)。World Models仅在研究背景中提及,论文本身未涉及世界模型或基于模型的强化学习。model-based RL、RL、GRPO、OPD在论文中无实质内容,评分均为0。
关键词
LLaVA-OneVision-2, Perceptual Intelligence, Codec-Stream Tokenization, Windowed Attention, 3D RoPE, JumpScore, Unified Perception, Vision-Language Model
深度分析
Chinese Title: LLaVA-OneVision-2:迈向下一代感知智能
Summary: 本文介绍了LLaVA-OneVision-2(LLaVA-OV-2),这是LLaVA-OneVision系列中最强大的视觉语言模型。该模型基于原生OneVision-Encoder,并引入窗口注意力机制以实现高效的局部计算,同时保持原生分辨率。其核心创新是编解码流令牌化:将压缩视频视为连续的比特成本流,通过比特成本动态自适应确定时间组边界,并利用运动残差线索将显著空间证据压缩成紧凑的视觉画布。这种设计将有限的令牌预算集中在事件承载内容上,实现了比固定图像组更稳定的长视频令牌压缩。共享的3D RoPE进一步将编解码画布、采样帧和图像置于统一的时空坐标系中。训练方面,模型使用了约800万重新标注的视频样本进行预训练,以及400万样本的空间语料进行微调。此外,论文引入了JumpScore基准,用于评估高频密集重复运动中的细粒度时间定位。实验结果表明,LLaVA-OV-2-8B在多个多模态基准上显著优于Qwen3-VL-8B,在JumpScore上mAP达到74.9(领先44.8点),在视频任务、空间任务和跟踪任务上分别平均提升4.3、5.3和15.6 J&F。代码、数据和模型均已开源。
Innovations:
- 编解码流令牌化:将压缩视频视为连续比特成本流,通过比特成本动态和运动残差线索自适应分配视觉令牌,实现长视频稳定压缩。
- 共享3D RoPE:为编解码画布、采样帧和图像提供统一的时空坐标系,支持多种输入形式的联合处理。
- 大规模训练数据:使用约800万重新标注的视频样本和400万2D/3D空间语料,构建了四阶段渐进式训练流程。
- JumpScore基准:专门针对高频密集重复运动中的细粒度时间定位设计,填补了现有视频评估的空白。
- 窗口注意力机制:在视觉编码器中结合空间窗口注意力,实现高效的原生分辨率处理,并与视频级分组规则正交。
Methodology: 模型架构包括OneVision-Encoder(视觉编码器)、两层MLP连接器和Qwen3-8B自回归语言模型。编解码流令牌化过程:首先将压缩视频的P/B帧包大小按时间分箱,计算比特成本分布;然后通过自适应GOP划分(基于最小/最大跨度及局部谷值搜索)确定时间组边界;接着利用运动残差信号对每个GOP内的空间区域进行显著性评分,选择高得分2×2块打包成紧凑的I/P画布;最后所有画布、采样帧和图像通过共享3D RoPE编码为统一视觉令牌。训练采用四阶段策略:阶段1混合8500万图像-文本样本和420万30秒视频字幕;阶段2加入2200万指令数据和270万30-60秒/70万60-180秒视频字幕;阶段3扩展至长视频(35万10-15分钟字幕,384帧);阶段4使用变长GOP编解码管道重新编码长视频(384/768帧),并加入400万空间监督数据。所有阶段中编解码流、均匀采样视频和图像/平铺输入交错训练。
Key Results:
- 在JumpScore基准上,LLaVA-OV-2-8B达到74.9 mAP,比Qwen3-VL-8B(30.1)高出44.8点。
- 在相同基准和匹配视觉令牌预算下,编解码流输入比帧采样输入在时间定位上提升9.7点。
- 在18个视频任务上平均超越Qwen3-VL-8B 4.3点,在11个空间推理任务上平均超越5.3点,在4个跟踪任务上平均J&F提升15.6点。
- 编解码流输入在粗粒度时间结构任务(如时间定位、事件理解、事件排序、显著检索)上表现更优,而帧采样在细节敏感任务(静态细粒度、小目标、轨迹特定、边界级)上更优。
Tech Stack:
- OneVision-Encoder(动态分辨率视觉编码器)
- 窗口注意力(Windowed Attention)
- 3D旋转位置编码(3D RoPE)
- 编解码流令牌化(Codec-stream Tokenization)
- 比特成本动态(Bit-cost Dynamics)
- 运动残差线索(Motion-residual Cues)
- 自适应GOP划分(Adaptive GOP Partition)
- Qwen3-8B自回归语言模型
- 两层MLP视觉-语言连接器
- H.264/H.265视频编解码标准
Strengths:
- 统一处理图像、采样帧视频和编解码流视频,架构简洁且通用。
- 编解码流令牌化有效压缩长视频,将令牌集中在信息丰富的区域,提升长视频理解能力。
- 在多个基准上取得显著性能提升,尤其在高频密集运动定位任务上大幅领先。
- 大规模开源数据、代码和模型,促进社区研究。
- 四阶段训练策略逐步扩展视频长度和任务复杂度,训练稳定高效。
Limitations:
- 编解码流输入在细节敏感任务(如静态细粒度纹理、小目标识别)上不如帧采样,存在应用场景限制。
- 依赖视频编解码器(H.264/H.265),可能不适用于所有视频格式或实时场景。
- 编解码流令牌化引入额外预处理开销,对计算资源有一定要求。
- 论文未深入讨论模型在极端长视频(如数小时)上的表现和扩展性。
- JumpScore基准仅针对特定运动类型,泛化性有待验证。
Relevance To Keywords:
- Unify Models: LLaVA-OV-2通过共享视觉编码器和统一令牌接口,实现了图像、采样帧视频和编解码流视频的统一处理,符合统一模型的思想。
- World Models: 编解码流令牌化利用比特成本和运动残差隐式建模视频中的预测性结构,与世界模型中预测未来状态的概念有一定关联,但论文未明确构建世界模型。
- Representation Learning: OneVision-Encoder和共享3D RoPE旨在学习统一的视觉表征,编解码流令牌化进一步优化了视频表征的压缩和选择性,属于表征学习范畴。
- Model-Based RL: 论文未涉及强化学习或基于模型的规划,相关性较弱。
- 原生多模态大模型: LLaVA-OV-2是原生多模态大模型(MLLM),直接处理视觉和语言输入,无需外部模块。
- 多模态大模型的理解和生成一体化: 模型使用自回归语言模型进行文本生成,同时理解视觉输入,实现了理解和生成的一体化。
- 表征学习: 同上,视觉编码器和令牌化过程属于表征学习。
- 世界模型: 编解码流对视频预测性结构的利用可视为一种隐式世界模型,但未显式建模环境动态。
- 强化学习: 论文未使用强化学习训练或后训练,相关性低。
- 后训练: 论文的四阶段训练属于预训练和微调,未涉及强化学习后训练(如RLHF),相关性低。
摘要翻译
当前主流的将扩散模型与人类偏好对齐的方法通常采用基于视觉语言模型(VLM)的奖励模型。然而,这些为语义对齐而预训练的奖励模型难以捕捉诸如美学、构图和视觉和谐等关键的感知质量。在本工作中,我们认为一个能够实现高保真生成的模型必须对这些视觉属性有深刻的理解。基于这一见解,我们提出了基于扩散模型的奖励模型(Diffusion-based Reward Model, DRM),这是一种新颖的范式,利用预训练的扩散模型作为强大的评估骨干网络。DRM的一个关键优势在于其独特的能力,即不仅能评估最终图像,还能评估生成过程中任意阶段的含噪中间潜在表示。我们以两种方式利用这种逐步评估能力。首先,我们提出了逐步GRPO(Step-wise GRPO),这是一种强化学习算法,通过提供密集的每步奖励来解决GRPO算法中不精确的信用分配问题,从而实现更稳定、更有效的对齐。其次,我们引入了逐步采样(Step-wise Sampling),这是一种新颖的推理策略,利用DRM作为动态引导,在每一步评估多个生成路径,从而将生成过程导向更高质量的结果。大量实验证实,我们的方法显著提升了生成图像的最终质量。代码:https://github.com/jjaxonx/DRM。
Abstract
Current mainstream methods of aligning diffusion models with human preferences typically employ VLM-based reward models. However, these reward models, pre-trained for semantic alignment, struggle to capture the essential perceptual qualities-such as aesthetics, composition, and visual harmony. In this work, we argue that a model capable of high-fidelity generation must possess a profound understanding of these visual attributes. Based on this insight, we introduce the Diffusion-based Reward Model (DRM), a novel paradigm that use the pre-trained diffusion model as a powerful evaluative backbone. A key advantage of the DRM is its unique ability to assess not only the final image but also the noisy intermediate latents at any stage of the generative process. We leverage this step-wise evaluative capacity in two ways. First, we propose Step-wise GRPO, a reinforcement learning algorithm that provides dense, per-step rewards to resolve the imprecise credit assignment problem in GRPO algorithm, leading to more stable and effective alignment. Second, we introduce Step-wise Sampling, a novel inference strategy that employs the DRM as a dynamic guide to evaluate multiple generation paths at each step, steering the process towards higher-quality outcomes. Extensive experiments confirm that our approach significantly enhances the final quality of generated images. Code: https://github.com/jjaxonx/DRM.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 1.0/10 | 1.0 |
| World Models | 1.0 | 1.0/10 | 1.0 |
| MLLM | 1.0 | 1.0/10 | 1.0 |
| CV | 1.0 | 8.0/10 | 8.0 |
| MultiModal | 1.0 | 3.0/10 | 3.0 |
| model-based RL | 1.0 | 5.0/10 | 5.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 9.0/10 | 9.0 |
| GRPO | 1.0 | 10.0/10 | 10.0 |
评分理由: 论文聚焦于扩散模型的奖励模型设计,用于人类偏好对齐,核心贡献是Step-wise GRPO和Step-wise Sampling。与Unify Models、World Models、MLLM等概念关联较弱;CV领域高度相关;MultiModal仅涉及图像生成,多模态程度低;model-based RL有一定关联(奖励模型可视为一种模型),但非传统环境模型;OPD未提及;RL和GRPO是核心方法,高度相关。
关键词
Diffusion-based Reward Model, Step-wise GRPO, Step-wise Sampling, Human Preference Alignment, Reinforcement Learning, Diffusion Models, Image Generation
深度分析
Chinese Title: DRM:基于扩散模型的逐步引导奖励模型
Summary: 本文提出了一种基于扩散模型的奖励模型(DRM),用于对齐扩散模型与人类偏好。现有基于VLM的奖励模型因预训练目标侧重语义对齐,难以捕捉美学、构图等感知质量。作者认为高保真生成模型本身具备对这些视觉属性的深刻理解,因此利用预训练扩散模型作为奖励骨干。DRM不仅能评估最终图像,还能评估生成过程中任意阶段的噪声中间潜变量。基于此逐步评估能力,本文提出了Step-wise GRPO算法,通过提供密集的每步奖励来解决GRPO中的信用分配问题,实现更稳定高效的对齐;同时提出了Step-wise Sampling推理策略,在每一步评估多个生成路径并动态引导至更高质量结果。实验表明,该方法显著提升了生成图像质量。
Innovations:
- 首次系统性地使用预训练扩散模型作为奖励模型骨干,利用其内在的感知质量理解能力。
- 提出Step-wise GRPO算法,利用DRM提供每步密集奖励,解决传统GRPO中粗粒度信用分配问题,加速收敛并提升稳定性。
- 提出Step-wise Sampling推理策略,在每一步动态评估多个候选路径,引导生成过程避免级联失败。
- DRM架构通过截断预训练DiT的尾部层并添加预测头,在保持参数规模可比的同时实现中间潜变量评估。
Methodology: 论文采用以下技术路线:1)构建DRM:基于预训练扩散模型(如SD3.5-Medium的DiT),截断最后几层Transformer,添加线性投影、空间特征重塑、卷积网络和池化层组成的预测头,输出偏好分数。2)训练:使用人类偏好三元组(赢家、输家、提示),对图像加噪至随机时间步,输入DRM得到分数,通过Bradley-Terry模型计算负对数似然损失。3)Step-wise GRPO:在强化学习训练中,从初始点出发,每步通过SDE探索k个候选样本,用DRM计算每步奖励和优势,优化策略。4)Step-wise Sampling:推理时,每步生成多个候选,用DRM评估并选择最优路径。
Key Results:
- Step-GRPO相比标准GRPO,最终奖励更高且收敛速度提升2.5倍。
- DRM能够有效评估任意时间步的噪声潜变量,提供细粒度奖励。
- Step-wise Sampling在推理时通过动态路径选择提升了生成图像质量。
- 实验验证了DRM在感知质量(美学、构图)评估上优于VLM-based奖励模型。
Tech Stack:
- Diffusion Transformer (DiT)
- Flow Matching
- Markov Decision Process (MDP)
- Group Relative Policy Optimization (GRPO)
- Step-wise GRPO (Step-GRPO)
- Stochastic Differential Equation (SDE)
- Bradley-Terry (BT) model
- VAE encoder
- CLIP-style vision encoder
- Vision-Language Model (VLM)
Strengths:
- 创新性地利用扩散模型自身作为奖励模型,挖掘其隐含的感知质量理解能力。
- 解决了扩散模型对齐中信用分配不精确的核心问题,通过每步奖励实现更高效优化。
- 推理阶段动态引导策略有效避免了固定轨迹的级联错误。
- 架构设计合理,通过截断DiT保持参数规模与VLM奖励模型可比。
Limitations:
- DRM训练需要大量人类偏好数据,数据收集成本较高。
- Step-wise Sampling在推理时增加计算开销,需平衡质量与效率。
- 论文未详细讨论DRM在不同扩散模型架构(如UNet)上的泛化能力。
- 仅针对图像生成任务,未扩展到视频或其他模态。
Relevance To Keywords:
- Unify Models: 论文探索了生成模型与奖励模型的统一,使用同一扩散骨干实现生成与评估。
- World Models: DRM对生成轨迹的逐步评估可视为对生成世界状态的理解。
- Representation Learning: DRM利用扩散模型中间特征进行偏好预测,涉及表征学习。
- Model-Based RL: Step-wise GRPO利用DRM作为奖励模型,属于基于模型的强化学习。
- 原生多模态大模型: 论文使用DiT骨干,属于多模态生成模型,但未涉及理解与生成一体化。
- 多模态大模型的理解和生成一体化: DRM将生成模型用于理解(评估),体现了一体化思想。
- 表征学习: 扩散模型中间特征用于偏好预测,是表征学习的应用。
- 世界模型: 对生成过程的逐步评估类似于世界模型中的状态预测。
- 强化学习: Step-wise GRPO是强化学习算法,用于对齐。
- 后训练: 论文方法属于扩散模型的后训练对齐阶段。
摘要翻译
视觉-语言-动作(Vision-Language-Action, VLA)模型广泛采用预训练的视觉-语言模型(Vision-Language Models, VLMs)作为策略骨干,但何种预训练VLM表示对VLA初始化有用尚不明确。本文沿三个维度将VLA初始化作为受控表示设计问题进行研究:能力级别的具身VQA监督、参数更新策略以及机器人数据预训练。实验表明,原始预训练VLM表示是动作性能的关键来源。然而,具身VQA适应并未带来均匀增益:其益处取决于下游瓶颈,且不同能力领域的增益并非简单叠加。在更新策略方面,LoRA比全微调(Full Finetune)提供了更可靠的初始化,表明过度重塑预训练表示会削弱VLA初始化。机器人数据预训练进一步改善了VLA初始化,其中基于LoRA的分阶段训练获得了最强变体。综合这些发现表明,有效的VLM到VLA适应应在注入与动作相关的具身和机器人轨迹信号的同时,保留对动作学习仍有用的预训练VLM表示。
Abstract
Vision-Language-Action (VLA) models widely adopt pretrained Vision-Language Models (VLMs) as policy backbones, yet it remains unclear what kind of pretrained VLM representation is useful as a VLA initialization. In this paper, we study VLA initialization as a controlled representation-design problem along three axes: capability-level embodied VQA supervision, parameter-update strategy, and robot-data pretraining. Our experiments show that the original pretrained VLM representation is a key source of action performance. However, embodied VQA adaptation does not yield uniform gains: its benefit depends on downstream bottlenecks, and gains from different capability domains are not simply additive. For update strategy, LoRA provides a more reliable initialization than Full Finetune, indicating that overly reshaping the pretrained representation can weaken VLA initialization. Robot-data pretraining further improves VLA initialization, with the strongest variant obtained by staged LoRA-based training. Together, these findings suggest that effective VLM-to-VLA adaptation should inject action-relevant embodied and robot-trajectory signals while preserving the pretrained VLM representation that remains useful for action learning.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 6.0/10 | 6.0 |
| World Models | 1.0 | 2.0/10 | 2.0 |
| MLLM | 1.0 | 9.0/10 | 9.0 |
| CV | 1.0 | 7.0/10 | 7.0 |
| MultiModal | 1.0 | 9.0/10 | 9.0 |
| model-based RL | 1.0 | 1.0/10 | 1.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 2.0/10 | 2.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究VLM表征用于VLA初始化,核心涉及多模态(视觉-语言-动作)和MLLM,与Unify Models有一定关联(统一视觉、语言、动作),但未聚焦于统一模型本身。World Models、model-based RL、RL、OPD、GRPO均未在论文中提及或作为研究重点,相关性极低。CV作为视觉基础有中等关联。总体评分基于论文实际内容与关键词的匹配程度。
关键词
VLM representation, VLA initialization, embodied VQA, parameter-update strategy, robot-data pretraining, LoRA, action learning, pretrained VLM
深度分析
Chinese Title: 重新思考视觉-语言模型表征用于视觉-语言-动作初始化的研究
Summary: 本文系统研究了视觉-语言-动作(VLA)模型初始化中,预训练视觉-语言模型(VLM)表征的有效性问题。作者将VLA初始化视为一个受控的表征设计问题,从三个维度展开分析:能力级别的具身VQA监督、参数更新策略以及机器人数据预训练。实验发现:原始预训练VLM表征是动作性能的关键来源;具身VQA适应并非总是有效,其收益取决于下游瓶颈,且不同能力域的增益不可简单叠加;LoRA更新策略比全微调更可靠,能更好地保留预训练表征;机器人数据预训练可进一步提升初始化效果,最佳方案是分阶段LoRA训练。研究表明,有效的VLM到VLA适应应在注入动作相关信号的同时,保留对动作学习仍有用的预训练VLM表征。
Innovations:
- 将VLA初始化系统化为一个受控的表征设计问题,从三个关键维度(VQA域、更新策略、机器人数据预训练)进行解耦分析。
- 揭示了具身VQA适应的条件性:收益取决于下游任务瓶颈,且不同能力域的增益不可简单叠加,最佳组合为Grounding+Egocentric Understanding。
- 发现LoRA比全微调提供更可靠的初始化,表明过度重塑预训练表征会削弱VLA初始化效果。
- 提出分阶段训练策略:先进行感知侧VQA适应,再进行机器人数据预训练,能获得最佳初始化效果。
- 通过多个基准和动作头验证了结论的普适性,为VLM到VLA的适应提供了实用指导原则。
Methodology: 论文采用两阶段训练流水线:第一阶段在具身VQA数据上适应基础VLM,注入能力级信号;第二阶段将得到的VLM初始化VLA策略,在动作轨迹上训练。通过控制VQA域组合、参数更新策略(LoRA vs 全微调)和机器人数据预训练方式,系统比较不同初始化对下游动作性能的影响。评估在三个模拟基准(Libero-10、SimplerBridge、RoboCasa GR1)上进行,使用MLP头和扩散专家两种动作架构。
Key Results:
- 从头训练的策略在所有基准上性能下降超过20%,表明原始预训练VLM表征是动作性能的主要来源。
- 具身VQA适应收益具有条件性:Grounding+Egocentric Understanding组合带来最大提升,但不同域增益不可叠加。
- LoRA比全微调提供更有效的初始化,且此效应随VLM强度变化:模型越弱,LoRA增益越小,全微调退化越严重。
- 机器人数据预训练一致提升VLA初始化,最佳方案是分阶段LoRA训练(先VQA适应,再机器人数据预训练)。
- 在三个基准和两种动作头上,上述模式保持一致,表明结论具有普适性。
Tech Stack:
- 视觉-语言模型(VLM)
- 视觉-语言-动作模型(VLA)
- LoRA(低秩适应)
- 全微调(Full Finetune)
- MLP动作头
- 扩散动作专家(Diffusion Expert)
- 具身VQA(视觉问答)
- 两阶段训练流水线
- 模拟环境:Libero-10、SimplerEnv、RoboCasa
Strengths:
- 系统性地解耦了VLA初始化的多个影响因素,提供了清晰的分析框架。
- 实验设计严谨,通过控制变量法隔离了不同因素的影响。
- 发现了反直觉的模式(如VQA适应并非总是有效、LoRA优于全微调),具有重要实践指导意义。
- 在多个基准和动作架构上验证了结论的普适性。
- 提供了开源代码,便于复现和进一步研究。
Limitations:
- 所有实验均在模拟环境中进行,未在真实机器人上验证结论。
- 仅使用了单一VLM系列(可能为LLaVA或类似模型),未探索不同VLM架构的影响。
- VQA域的分类和组合可能不够全面,未考虑更细粒度的能力分解。
- 机器人数据预训练仅使用了一个数据集(AgiBot-World-Beta),未探索不同数据源的影响。
- 未深入分析VLM表征变化的内在机制(如注意力模式、特征空间变化等)。
Relevance To Keywords:
- Unify Models: 论文研究VLM到VLA的初始化,涉及视觉-语言-动作模型的统一,但未直接探讨理解与生成的统一。
- World Models: 论文未直接涉及世界模型,但VLA模型可视为隐式学习世界动态,且具身VQA中的空间、时序理解与世界模型相关。
- Representation Learning: 核心相关。论文系统研究了VLM表征如何影响VLA初始化,探讨了表征设计的关键因素。
- Model-Based RL: 弱相关。论文使用行为克隆训练VLA,未涉及基于模型的强化学习,但VLA可视为模型预测控制的一种形式。
- 原生多模态大模型: 相关。论文使用预训练VLM作为基础,研究其表征对下游任务的影响,涉及多模态大模型的迁移学习。
- 多模态大模型的理解和生成一体化: 部分相关。VLM本身是理解模型,VLA扩展了生成动作的能力,但论文未直接研究理解与生成的统一框架。
- 表征学习: 核心相关。论文核心是研究VLM表征如何影响VLA初始化,属于表征学习的应用研究。
- 世界模型: 弱相关。VLA模型可视为隐式学习世界动态,但论文未明确从世界模型角度分析。
- 强化学习: 弱相关。论文使用监督学习(行为克隆)训练VLA,未涉及强化学习,但VLA可应用于强化学习策略。
- 后训练: 相关。论文的两阶段训练可视为后训练的一种形式,研究如何通过后训练优化VLM表征用于下游任务。
摘要翻译
本文提出InstructSAM,一个统一且精简的框架,旨在实现任意指令下的多实例分割。我们将指令驱动的实例分割形式化为一个集合结构化的查询预测问题,并提出一个显式的推理到实例查询接口,该接口优雅地连接了视觉语言模型(VLM)与SAM3。具体而言,一组可学习的实例查询被注入VLM,并与指令和视觉信息进行上下文融合,使每个查询充当一个实例感知槽位。一种混合注意力机制进一步促进了这些查询、视觉令牌和指令令牌之间的交互,从而改进实例枚举并减少重复预测。由此产生的LLM条件化查询被投影到SAM3的检测器查询空间中,以在单次前向传播中驱动准确的多实例分割。该设计在不修改SAM3核心架构的情况下,赋予其高级指令理解、组合推理和实例级集合预测能力。为支持训练与评估,我们进一步构建了Inst2Seg,一个高质量、大规模的基于指令的实例分割数据集与基准,该数据集将自由形式的指令与实例级掩码配对。大量实验表明,仅2B规模的InstructSAM在复杂的指令驱动和短语级指代分割基准上取得了强劲结果,超越了先前的端到端方法及SAM3的代理流水线,同时实现了高效的单次多实例预测。
Abstract
In this paper, we introduce InstructSAM, a unified and streamlined framework designed for multi-instance segmentation under arbitrary instructions. We formulates instruction-driven instance segmentation as a set-structured query prediction problem and propose an explicit reasoning-to-instance query interface that elegantly bridges a vision-language model (VLM) and SAM3. Specifically, a bank of learnable instance queries is injected into the VLM and contextualized with instruction and visual information, enabling each query to serve as an instance-aware slot. A hybrid-attention mechanism further promotes interaction among these queries, visual tokens, and instruction tokens, improving instance enumeration and reducing duplicate predictions. The resulting LLM-conditioned queries are projected into SAM3's detector query space to drive accurate multi-instance segmentation in a single forward pass. This design equips SAM3 with high-level instruction understanding, compositional reasoning, and instance-level set prediction without modifying its core architecture. To support training and evaluation, we further construct Inst2Seg, a high-quality and large-scale instruction-based instance segmentation dataset and benchmark that couples free-form instructions with instance-level masks. Extensive experiments show that only 2B-scale InstructSAM achieves strong results across complex instruction-driven and phrase-level referring segmentation benchmarks, outperforming prior end-to-end methods and SAM3's agentic pipeline while enabling efficient single-pass multi-instance prediction.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 8.0/10 | 8.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 8.0/10 | 8.0 |
| CV | 1.0 | 9.0/10 | 9.0 |
| MultiModal | 1.0 | 9.0/10 | 9.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文提出InstructSAM框架,专注于多实例分割,结合视觉-语言模型(VLM)和SAM3,实现基于任意指令的实例分割。核心是统一模型(Unify Models)进行指令驱动的分割,涉及多模态(MultiModal)和计算机视觉(CV)任务,并使用了MLLM(多模态大语言模型)进行推理。但论文完全不涉及世界模型(World Models)、基于模型的强化学习(model-based RL)、OPD、RL或GRPO等强化学习或世界模型相关概念。
关键词
InstructSAM, multi-instance segmentation, vision-language model, SAM3, instruction-driven, set prediction, hybrid-attention
深度分析
Chinese Title: InstructSAM: 根据任意指令分割任意实例
Summary: 本文提出InstructSAM,一个统一的端到端框架,用于在任意指令下进行多实例分割。该框架将指令驱动的实例分割建模为集合结构查询预测问题,并设计了一个显式的推理到实例查询接口,将视觉语言模型(VLM)与SAM3桥接。具体地,在VLM中注入一组可学习的实例查询,通过混合注意力机制与指令和视觉信息交互,使每个查询成为实例感知的槽位。得到的LLM条件化查询被投影到SAM3的检测器查询空间,驱动单次前向传播中的准确多实例分割。为支持训练和评估,作者构建了Inst2Seg数据集,包含50万训练QA对和3328条人工验证指令的基准。实验表明,2B规模的InstructSAM在复杂指令和短语级指代分割基准上均取得强结果,优于先前端到端方法和SAM3的代理流水线。
Innovations:
- 提出显式的推理到实例查询接口,将VLM的推理能力与SAM3的分割能力桥接,无需修改SAM3核心架构。
- 在LLM中引入一组可学习的并行实例查询作为实例槽位,结合混合注意力机制实现指令条件化的集合预测。
- 构建大规模指令实例分割数据集Inst2Seg,覆盖单目标、多目标和无目标场景,支持系统评估。
- 实现单次前向传播的多实例预测,避免自回归生成多个掩码令牌带来的延迟和不稳定性。
Methodology: InstructSAM由三部分组成:多模态LLM用于指令理解和实例槽位上下文化;一组可学习的并行掩码查询作为实例槽位接口;基于SAM3的集合预测掩码解码器。混合注意力机制允许实例查询与视觉令牌、指令令牌及其他查询交互。训练时使用二分匹配损失进行集合预测优化。
Key Results:
- 2B规模的InstructSAM在复杂指令分割基准上显著优于先前端到端方法(如LISA++)和SAM3的代理流水线。
- 在短语级指代分割基准上也取得领先性能。
- 在单目标、多目标、无目标场景下均表现鲁棒。
- 单次前向传播即可完成多实例分割,推理效率高。
Tech Stack:
- 多模态大语言模型(VLM)
- SAM3(Segment Anything Model 3)
- 可学习实例查询(learnable instance queries)
- 混合注意力机制(hybrid-attention)
- 二分匹配损失(bipartite matching loss)
- 集合预测(set prediction)
Strengths:
- 统一端到端框架,无需复杂代理流水线,推理高效。
- 显式查询接口使实例分割与指令推理解耦,保持SAM3的开放世界分割能力。
- 构建的大规模数据集Inst2Seg填补了指令级实例分割的空白。
- 在多个基准上取得SOTA结果,验证了方法的有效性。
Limitations:
- 模型规模为2B,可能无法处理极复杂的长尾指令。
- 依赖预训练的VLM和SAM3,训练和推理资源需求较高。
- 数据集Inst2Seg的构建依赖人工标注,可能存在标注偏差。
- 未在视频或3D场景中验证,仅针对图像。
Relevance To Keywords:
- 原生多模态大模型:InstructSAM基于多模态LLM实现指令理解与视觉推理,属于原生多模态大模型的应用。
- 多模态大模型的理解和生成一体化:模型同时进行语言理解和掩码生成,但生成的是分割掩码而非文本。
- 表征学习:通过可学习查询和混合注意力学习实例感知的表征。
- 世界模型:论文未直接涉及世界模型或强化学习,但指令驱动的实例分割可用于具身智能中的场景理解。
- 强化学习/后训练:论文未涉及强化学习或后训练技术。
摘要翻译
微调多模态大语言模型(MLLMs)用于视频时间定位(VTG)通常能提升域内性能,但在域迁移下性能急剧下降。本研究发现,这种失败主要不仅源于未见过的查询概念,更源于视觉域迁移,它阻碍了模型将其学习到的时间定位知识与自身固有的实体注意力能力相结合。为解决此问题,我们提出EVIDENT,一种参数高效的适配框架,通过将VTG适配路由至显式的视觉实体证据,将时间定位锚定在预训练MLLMs固有的实体注意力上。EVIDENT包含三个组件:(i)实体瓶颈适配器(Entity Bottleneck Adapter),将密集的视觉令牌转换为紧凑的实体级槽位;(ii)实体绑定蒸馏损失(Entity-Binding Distillation loss),将对象性先验注入语义非结构化的MLLM视觉空间,引导每个槽位绑定到一个连贯的实体;(iii)实体到证据门控机制(Entity-to-eVidence gating),利用捕获的实体作为证据,引导模型定位包含查询相关实体的时刻。这些组件共同使VTG微调依赖于实体锚定的证据,而非脆弱的数据集捷径。在跨域VTG基准上的实验表明,EVIDENT在保持具有竞争力的域内性能且参数开销适中的同时,持续提升了域外鲁棒性。这些结果表明,实体级定位是可泛化时间定位的有效归纳偏置。
Abstract
Fine-tuning MLLMs for Video Temporal Grounding (VTG) often improves in-domain performance but degrades sharply under domain shift. In this work, we find that this failure is primarily driven not just by unseen query concepts, but by visual domain shift, which prevents the model from coupling its learned temporal localization knowledge with its inherent entity-attention capability. To address this, we introduce EVIDENT, a parameter-efficient adaptation framework that anchors temporal grounding in the inherent entity-attention of pre-trained MLLMs by routing VTG adaptation through explicit visual entity evidence. EVIDENT consists of three components: (i) an Entity Bottleneck Adapter that transforms dense visual tokens into compact entity-level slots, (ii) an Entity-Binding Distillation loss that instills objectness priors into the semantically unstructured MLLM visual space, guiding each slot to bind to a coherent entity, and (iii) an Entity-to-eVidence gating mechanism that leverages the captured entities as evidence, steering the model to localize moments containing query-relevant entities. Together, these components enable VTG fine-tuning to rely on entity-grounded evidence rather than brittle dataset shortcuts. Experiments on cross-domain VTG benchmarks show that EVIDENT consistently improves out-of-domain robustness while preserving competitive in-domain performance with modest parameter overhead. These results suggest that entity-level grounding is an effective inductive bias for generalizable temporal localization.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 5.0/10 | 5.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 10.0/10 | 10.0 |
| CV | 1.0 | 8.0/10 | 8.0 |
| MultiModal | 1.0 | 10.0/10 | 10.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文核心是使用MLLM(多模态大语言模型)进行视频时间定位(VTG),属于多模态和计算机视觉领域。Unify Models有一定关联,因为MLLM本身是统一视觉和语言的模型,但论文未深入讨论统一模型概念;World Models、model-based RL、RL、GRPO、OPD均未在论文中出现或涉及,完全无关。CV相关度高,但论文更侧重MLLM的适应机制,故给8分。MLLM和MultiModal均为核心主题,给10分。
关键词
Video Temporal Grounding, MLLM, Entity-Grounded Visual Evidence, Cross-Domain Adaptation, Parameter-Efficient Fine-Tuning, Entity Bottleneck Adapter, Entity-Binding Distillation, Entity-to-eVidence gating
深度分析
Chinese Title: EVIDENT:通过实体锚定的视觉证据引导多模态大语言模型适应以实现跨域视频时间定位
Summary: 本文针对多模态大语言模型(MLLM)在视频时间定位(VTG)任务中的域泛化问题展开研究。作者通过系统分析发现,直接微调MLLM会导致注意力-定位解耦,即模型学到的定位能力无法继承预训练阶段固有的实体注意力,且该问题主要由视觉域偏移而非概念域偏移驱动。为此,提出EVIDENT框架,包含三个核心组件:实体瓶颈适配器(EB Adapter)将密集视觉令牌压缩为紧凑的实体槽;实体绑定蒸馏损失(EB Distillation)将对象性先验注入MLLM视觉空间,引导每个槽绑定到连贯实体;实体到证据门控(E2V)利用预训练MLLM的实体注意力对每帧中查询相关实体的存在性进行评分,并据此调制适配器输出,使定位决策基于实体证据而非数据集特定捷径。实验在多个跨域VTG基准上表明,EVIDENT在保持竞争性域内性能的同时,显著提升了域外鲁棒性,验证了实体级锚定作为可泛化时间定位的有效归纳偏置。
Innovations:
- 系统揭示了MLLM微调导致注意力-定位解耦的现象,并证明其主要由视觉域偏移而非概念域偏移驱动。
- 提出参数高效的EVIDENT框架,通过实体瓶颈适配器将密集视觉令牌转换为紧凑实体级表示,无需重新进行指令微调。
- 引入实体绑定蒸馏损失,利用预训练MLLM的实体注意力作为监督,将对象性先验注入语义非结构化的MLLM视觉空间。
- 设计实体到证据门控机制,将捕获的实体作为显式视觉证据,引导模型定位包含查询相关实体的时刻。
- 在跨域VTG基准上实现一致的域外鲁棒性提升,同时保持竞争性域内性能,且参数开销适度。
Methodology: 论文采用以下技术路线:首先,基于预训练MLLM(如Qwen2.5-VL、InternVL3),在其视觉编码器后附加实体瓶颈适配器(EB Adapter),该适配器通过可学习的实体槽(slot)对密集视觉令牌进行压缩和重组,得到紧凑的实体级表示。其次,设计实体绑定蒸馏损失(EB Distillation),利用预训练MLLM在零样本下对物体文本令牌与视觉补丁的余弦相似度作为软标签,引导每个实体槽绑定到语义连贯的实体。最后,引入实体到证据门控(E2V),通过计算每帧中查询相关实体的注意力得分,动态调制EB Adapter的输出,使时间定位预测依赖于实体证据。整体框架仅训练适配器部分,保持预训练权重冻结,实现参数高效适应。
Key Results:
- 在Charades-STA上训练、QVHighlights上测试的跨域场景中,EVIDENT将ID-OOD差距从19.8点缩小至约5点,显著提升OOD性能。
- 在QVHighlights上训练、Charades-STA上测试的场景中,OOD性能提升约33.4点。
- 消融实验表明,EB Adapter、EB Distillation和E2V门控三个组件均对OOD鲁棒性有正向贡献,其中E2V门控贡献最大。
- 与直接微调基线相比,EVIDENT在保持ID性能的同时,OOD性能提升10-20个百分点。
- 可视化分析显示,EVIDENT学到的实体槽在不同域中均能稳定捕获语义连贯的实体(如人物、书本),而直接微调模型则依赖场景上下文。
Tech Stack:
- Qwen2.5-VL-7B / 3B
- InternVL3-2B
- 槽注意力(Slot Attention)
- 余弦相似度
- 交叉注意力(Cross-Attention)
- 蒸馏损失(Distillation Loss)
- 门控机制(Gating Mechanism)
- 参数高效微调(Adapter)
Strengths:
- 深入剖析了MLLM在VTG任务中域泛化失败的根本原因,为后续研究提供了清晰的理论分析。
- 提出的EVIDENT框架设计巧妙,将实体级表示学习与参数高效适应相结合,无需大规模重新训练。
- 在多个跨域基准上取得显著且一致的OOD性能提升,验证了方法的有效性和泛化性。
- 各组件设计具有可解释性,实体绑定蒸馏和门控机制直观且易于理解。
- 代码和模型可能开源,便于复现和扩展。
Limitations:
- 依赖预训练MLLM的实体注意力质量,若预训练模型本身实体对齐能力较弱,可能影响EB Distillation效果。
- 实体槽数量为超参数,对于复杂场景中实体数量变化较大的情况,可能需要自适应调整。
- 当前仅验证了视频时间定位任务,在其他跨域多模态任务(如视频问答、动作识别)上的泛化性尚未探索。
- 实验主要基于Qwen2.5-VL和InternVL3,在其他MLLM架构上的适用性有待验证。
- 未与基于大规模指令微调的VTG方法(如TimeChat、VTimeLLM)进行直接比较,仅对比了直接微调基线。
Relevance To Keywords:
- 原生多模态大模型:论文直接使用预训练MLLM(Qwen2.5-VL、InternVL3)作为基础模型,研究其微调后的域泛化问题。
- 表征学习:核心创新在于将密集视觉令牌转化为实体级紧凑表示,属于表征学习范畴。
- 世界模型:通过实体级表示理解视频场景,间接涉及对世界状态的建模,但并非直接构建世界模型。
- 强化学习:论文未涉及强化学习或后训练技术,主要关注微调阶段的域泛化。
- 后训练:论文提出的参数高效适应方法属于后训练的一种形式,但重点在于跨域泛化而非强化学习对齐。
摘要翻译
生成高保真且可控的合成数据对于推动端到端自动驾驶的发展至关重要,尤其是在处理长尾分布的罕见安全关键场景方面。现有的占用引导方法通常依赖于浅层条件机制和参考帧相关的视频合成,这限制了从任意BEV(鸟瞰图)布局进行细粒度控制的能力,并制约了其在可扩展仿真中的适用性。本文提出AnyScene,一个统一的以占用为中心的驾驶场景生成框架。AnyScene通过一个时空占用扩散变换器(Spatial-Temporal Occupancy Diffusion Transformer),以自回归方式联合标记化BEV和占用特征,从而从BEV布局生成语义占用序列。该设计实现了从跨数据集和用户定义的BEV输入进行精确控制,同时自然支持长时程生成。在生成的占用基础上,一个几何接地视图扩展模块(Geometry-Grounded View Expansion Module)将占用视为规范空间表示,并以无参考和自回归的方式合成时间一致的多视角驾驶视频,支持推理时的灵活相机配置。大量实验表明,AnyScene在占用生成和视频生成方面均达到了最先进性能。它对未见过的和自定义布局表现出强大的泛化能力,并为稀疏视角三维重建等下游任务提供了可量化的益处。
Abstract
Generating high-fidelity and controllable synthetic data is critical for advancing end-to-end autonomous driving, particularly for addressing the long tail of rare safety-critical scenarios. Existing occupancy-guided methods typically rely on shallow conditioning mechanisms and reference-frame-dependent video synthesis, which limits fine-grained controllability from arbitrary BEV layouts and restricts their applicability for scalable simulation. In this paper, we propose AnyScene, a unified occupancy-centric framework for driving scene generation. AnyScene generates semantic occupancy sequences from BEV layouts through a Spatial-Temporal Occupancy Diffusion Transformer that jointly tokenizes BEV and occupancy features in an autoregressive manner. This design enables precise controllability from cross-dataset and user-defined BEV inputs while naturally supporting long-horizon generation. Building upon the generated occupancy, a Geometry-Grounded View Expansion module treats occupancy as the canonical spatial representation and synthesizes temporally consistent multi-view driving videos in a reference-free and autoregressive fashion, supporting flexible camera configurations at inference time. Extensive experiments demonstrate that AnyScene achieves state-of-the-art performance in both occupancy and video generation. It exhibits strong generalization to unseen and customized layouts, and provides measurable benefits for downstream tasks such as sparse-view 3D reconstruction.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 5.0/10 | 5.0 |
| World Models | 1.0 | 6.0/10 | 6.0 |
| MLLM | 1.0 | 2.0/10 | 2.0 |
| CV | 1.0 | 9.0/10 | 9.0 |
| MultiModal | 1.0 | 7.0/10 | 7.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 3.0/10 | 3.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文聚焦于驾驶场景生成,属于计算机视觉(CV)核心领域,涉及多模态输入(BEV布局、占用、视频),与Unify Models有一定关联(统一框架),与世界模型部分相关(生成未来状态),但与MLLM(多模态大语言模型)关联较弱,与model-based RL、RL、GRPO完全无关,OPD(可能指占用预测与扩散)仅部分相关。
关键词
driving scene generation, occupancy-centric, BEV layout, diffusion transformer, multi-view video synthesis, autonomous driving, controllable generation, spatial-temporal modeling
深度分析
Chinese Title: AnyScene:面向任意地点及更远场景的高度可控驾驶场景生成
Summary: 本文提出AnyScene,一个以语义占据为核心的统一框架,用于高度可控的驾驶场景生成。该框架首先通过空间-时间占据扩散变换器(STOccDiT)从任意BEV布局自动回归生成语义占据序列,实现精确、高频的控制;随后通过几何接地视图扩展模块(GGVE),将生成的占据序列作为规范空间表示,以无参考帧、自回归方式合成时间一致的多视图驾驶视频,支持任意相机配置。为支持高频评估,作者构建了nuCraftv2数据集(12Hz重采样)。实验表明,AnyScene在占据生成和视频生成上均达到SOTA,对未见数据集和自定义布局具有强泛化能力,并能有效提升下游任务(如稀疏视图3D重建)的性能。
Innovations:
- 提出BEV条件可控占据生成:包含基于BEV的占据VAE和空间-时间占据扩散变换器(STOccDiT),通过联合令牌化BEV布局和占据潜在特征,实现从任意、跨数据集、用户自定义BEV输入的高频精确控制。
- 设计几何接地视图扩展(GGVE):将生成的占据序列作为规范空间锚点,渲染出显式几何条件信号,实现无参考帧、自回归的多视图视频合成,支持任意相机数量和姿态,克服了现有方法对参考帧的依赖。
- 构建nuCraftv2基准:对nuScenes进行12Hz重采样,提供同步的BEV布局和高品质密集语义占据,支持高频评估。
- 实现跨数据集和全局场景的零样本泛化:仅用nuCraftv2训练,即可泛化到未见数据集和从OpenStreetMap导出的BEV布局。
- 支持完全自定义布局的零样本占据生成,以及任意相机配置下的时间一致多视图视频生成。
Methodology: AnyScene采用两阶段流水线:1)占据生成阶段:使用基于2D BEV的占据VAE将语义占据编码为潜在表示,然后通过STOccDiT(结合因果时间注意力掩码的自回归扩散变换器)以BEV布局序列为条件生成占据序列。2)视频生成阶段:GGVE模块将生成的占据渲染为坐标和语义缓冲,结合相机位姿的Plücker嵌入作为ControlNet的条件输入,以自回归方式生成多视图视频。训练时使用nuCraftv2数据集,采用多视图组合采样策略。
Key Results:
- 在nuCraftv2上,AnyScene在占据生成和视频生成指标上均达到SOTA。
- 零样本测试显示,AnyScene能泛化到nuScenes、Waymo等未见数据集,以及从OpenStreetMap导出的全球场景布局。
- 支持用户自定义布局(如手动放置智能体)的零样本占据生成。
- 生成的几何接地视频在时间一致性和跨视图一致性上优于现有方法。
- 在稀疏视图3D重建等下游任务中,AnyScene生成的数据带来可量化的性能提升。
Tech Stack:
- 空间-时间占据扩散变换器(STOccDiT)
- 基于2D BEV的占据VAE
- 因果时间注意力掩码(自回归生成)
- ControlNet(用于视频生成的条件控制)
- Plücker嵌入(表示相机位姿)
- 扩散模型(Denoising Diffusion Probabilistic Models)
- nuCraftv2数据集(12Hz重采样nuScenes)
- OpenStreetMap(OSM)布局导入
Strengths:
- 高度可控性:支持任意BEV布局、跨数据集、用户自定义布局的精确控制。
- 强泛化能力:仅用单一数据集训练即可零样本泛化到多种未见场景。
- 灵活的视频生成:支持任意相机数量和姿态,无需参考帧,实现时间一致的多视图合成。
- 统一框架:以占据为中心,将几何与外观解耦,同时支持占据生成和视频生成。
- 实用价值:生成的数据可提升下游任务(如3D重建)性能,有助于自动驾驶长尾问题。
Limitations:
- 依赖高质量占据标注:训练需要密集语义占据数据,nuCraftv2虽为12Hz但仍是合成标注,可能引入噪声。
- 计算资源消耗大:自回归扩散生成和视频合成需要大量GPU内存和时间。
- 视频生成质量受限于占据精度:若占据生成出现错误(如缺失智能体),视频生成也会继承错误。
- 未涉及LiDAR或其他模态的联合生成:当前仅处理视觉和占据,缺少多模态一致性验证。
- 对极端动态场景(如高速变道、遮挡严重)的鲁棒性有待进一步评估。
Relevance To Keywords:
- 世界模型(World Models):AnyScene通过生成语义占据序列作为场景的隐式世界模型,并基于此合成多视图视频,体现了世界模型的思想——学习环境的动态和几何结构。
- 表征学习(Representation Learning):占据表示是一种结构化、视图无关的几何-语义表征,AnyScene的VAE和扩散变换器本质上是学习从BEV布局到占据的潜在表征。
- 多模态大模型的理解和生成一体化:AnyScene将占据生成(理解场景结构)与视频生成(生成视觉外观)统一在一个框架中,体现了理解与生成的融合。
- 原生多模态大模型:虽然AnyScene并非典型的大语言模型,但其以占据为中心的多模态生成(BEV+占据+视频)可视为一种多模态生成模型。
- 强化学习/后训练:论文未直接涉及强化学习或后训练,但生成的合成数据可用于自动驾驶策略的闭环训练和评估,间接支持强化学习中的环境模拟。
摘要翻译
少步扩散蒸馏(few-step diffusion distillation)的最新进展实现了高效的图像生成,然而使这些模型与人类偏好对齐仍然具有挑战性。我们提出了奖励倾斜分布匹配蒸馏(Reward-Tilted Distribution Matching Distillation, RTDMD),这是一个两阶段框架,将分布匹配蒸馏(distribution matching distillation)与奖励引导的强化学习(reward-guided reinforcement learning)统一起来,用于少步流生成器(few-step flow generators)。我们证明,最小化相对于奖励倾斜的教师分布(reward-tilted teacher distribution)的KL散度(KL divergence)自然分解为一个分布匹配项和一个奖励最大化项。在第一阶段,我们引入了环境一致分布匹配蒸馏(Ambient-Consistent Distribution Matching Distillation, AC-DMD),它执行子区间分布匹配(subinterval-wise distribution matching),并通过一致性正则化器(consistency regularizer)增强虚假分数目标(fake score objective),以帮助虚假分数模型在有限更新下跟踪变化的生成器分布。在第二阶段,我们联合优化这两个项:对于奖励最大化项,我们推导了一个混合策略梯度(hybrid policy gradient),它结合了用于随机中间过渡的GRPO风格估计器(GRPO-style estimator)和通过确定性最终步骤的直接奖励反向传播,并进一步引入了步骤子集GRPO(step-subset GRPO, SubGRPO)以降低方差。在SD3、SD3.5和FLUX.2上的实验表明,RTDMD仅用4个推理步骤就在偏好、美学和组合指标上建立了新的最先进结果,超越了之前的少步文本到图像生成方法。代码和模型可在 https://github.com/Harahan/RTDMD 获取。
Abstract
Recent advances in few-step diffusion distillation have enabled efficient image generation, yet aligning these models with human preferences remains challenging. We propose Reward-Tilted Distribution Matching Distillation (RTDMD), a two-stage framework that unifies distribution matching distillation with reward-guided reinforcement learning for few-step flow generators. We show that minimizing the KL divergence to a reward-tilted teacher distribution naturally decomposes into a distribution matching term and a reward maximization term. In the first stage, we introduce Ambient-Consistent Distribution Matching Distillation (AC-DMD), which performs subinterval-wise distribution matching and augments the fake score objective with a consistency regularizer to help the fake score model track the shifting generator distribution under limited updates. In the second stage, we jointly optimize both terms: for the reward maximization term, we derive a hybrid policy gradient that combines a GRPO-style estimator for the stochastic intermediate transitions with direct reward backpropagation through the deterministic final step, and further introduce step-subset GRPO (SubGRPO) to reduce variance. Experiments on SD3, SD3.5, and FLUX.2 demonstrate that RTDMD establishes new state-of-the-art results across preference, aesthetic, and compositional metrics with only 4 inference steps, outperforming previous few-step text-to-image generation methods. Code and models are available at https://github.com/Harahan/RTDMD.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 2.0/10 | 2.0 |
| World Models | 1.0 | 1.0/10 | 1.0 |
| MLLM | 1.0 | 1.0/10 | 1.0 |
| CV | 1.0 | 5.0/10 | 5.0 |
| MultiModal | 1.0 | 6.0/10 | 6.0 |
| model-based RL | 1.0 | 1.0/10 | 1.0 |
| OPD | 1.0 | 1.0/10 | 1.0 |
| RL | 1.0 | 7.0/10 | 7.0 |
| GRPO | 1.0 | 8.0/10 | 8.0 |
评分理由: 论文聚焦于文本到图像生成的少步扩散蒸馏与人类偏好对齐,核心方法是结合分布匹配蒸馏与强化学习(RL),并使用了GRPO风格的策略梯度。与'Unify Models'(统一模型)相关性低,因为论文未涉及多任务或架构统一;与'World Models'(世界模型)几乎无关;与'MLLM'(多模态大语言模型)相关性低,因为论文未涉及语言模型或原生多模态大模型;与'CV'(计算机视觉)中等相关,因为图像生成属于CV子领域;与'MultiModal'(多模态)中等相关,因为涉及文本到图像生成;与'model-based RL'(基于模型的强化学习)几乎无关,因为论文使用的是无模型RL;与'OPD'(在线策略蒸馏)几乎无关;与'RL'(强化学习)高度相关,因为核心是奖励引导的RL;与'GRPO'(分组相对策略优化)高度相关,因为论文明确使用GRPO风格估计器。作者列表中未包含指定的专家。
关键词
few-step diffusion distillation, reward-guided reinforcement learning, distribution matching distillation, GRPO, text-to-image generation, human preference alignment, policy gradient
深度分析
Chinese Title: 通过奖励倾斜分布匹配强化少步生成器
Summary: 本文提出奖励倾斜分布匹配蒸馏(RTDMD),一个两阶段框架,用于训练高质量的少步流生成器。首先,通过环境一致分布匹配蒸馏(AC-DMD)实现稳定冷启动,该阶段在子区间上进行分布匹配,并引入一致性正则化帮助假分数模型在有限更新下跟踪变化的生成器分布。其次,联合优化分布匹配项和奖励最大化项:对于奖励项,推导了混合策略梯度,结合GRPO风格的随机中间步估计和确定性最终步的直接奖励反向传播,并引入步子集GRPO(SubGRPO)降低方差。在SD3、SD3.5和FLUX.2上的实验表明,RTDMD在仅4步推理下,在偏好、美学和组合性指标上均达到最先进水平,超越了之前的少步文本到图像生成方法。
Innovations:
- 提出奖励倾斜分布匹配蒸馏(RTDMD),将KL散度最小化自然分解为分布匹配项和奖励最大化项,统一了蒸馏和强化学习。
- 引入环境一致分布匹配蒸馏(AC-DMD),在子区间上进行分布匹配,并通过一致性正则化稳定假分数模型训练,使其能跟踪快速变化的生成器分布。
- 推导混合策略梯度,结合GRPO风格估计随机中间步和直接奖励反向传播确定性最终步,并引入步子集GRPO(SubGRPO)降低方差。
- 在FLUX.2 4B上实现4步推理,超越完整50步的FLUX.2 9B,证明了方法的有效性。
Methodology: 采用两阶段框架:第一阶段使用AC-DMD进行冷启动,通过子区间分布匹配和一致性正则化训练假分数模型;第二阶段联合优化分布匹配和奖励最大化,使用混合策略梯度(GRPO+直接反向传播)更新生成器,并采用SubGRPO降低方差。生成器使用系数保持采样(CPS)统一确定性欧拉采样器和一致性模型采样器。
Key Results:
- 在SD3-M、SD3.5-M和FLUX.2 4B上,RTDMD在4步推理下在偏好、美学和组合性指标上均达到最先进水平。
- 蒸馏后的FLUX.2 4B在大多数基准上超越完整50步的FLUX.2 9B。
- AC-DMD相比标准DMD提供了更稳定的冷启动,假分数模型跟踪更准确。
- 混合策略梯度和SubGRPO有效降低了奖励优化的方差,提升了生成质量。
Tech Stack:
- 分布匹配蒸馏(DMD)
- 条件流匹配(CFM)损失
- 系数保持采样(CPS)
- 一致性正则化(Consistency Regularization)
- GRPO(Group Relative Policy Optimization)
- 策略梯度(Policy Gradient)
- KL散度最小化
- 奖励倾斜分布(Reward-Tilted Distribution)
- 假分数模型(Fake Score Model)
- 子区间分布匹配(Subinterval Distribution Matching)
- 步子集GRPO(SubGRPO)
Strengths:
- 理论框架清晰,将蒸馏和奖励优化统一在KL散度最小化下。
- 两阶段设计有效解决了冷启动不稳定和奖励优化方差大的问题。
- 在多个大规模模型上取得最先进结果,且仅需4步推理,效率高。
- 方法通用,可应用于不同架构的流生成器。
Limitations:
- 依赖预训练教师模型和奖励函数,奖励函数的设计可能影响最终性能。
- 两阶段训练需要交替更新生成器和假分数模型,计算开销较大。
- 实验仅在文本到图像生成任务上验证,未在其他模态(如视频、3D)上测试。
- SubGRPO的步子集选择策略可能需针对不同任务调整。
Relevance To Keywords:
- Unify Models: 论文统一了分布匹配蒸馏和强化学习,属于模型统一方法。
- World Models: 流生成器可视为世界模型的一种,用于生成图像。
- Representation Learning: 假分数模型学习生成器分布的表征,一致性正则化促进表征一致性。
- Model-Based RL: 奖励倾斜分布匹配可视为基于模型的强化学习,利用教师模型作为隐式模型。
- 原生多模态大模型: 方法应用于多模态大模型(如FLUX.2)的后训练对齐。
- 多模态大模型的理解和生成一体化: 论文聚焦生成部分,但奖励函数可来自理解模型。
- 表征学习: 假分数模型和一致性正则化涉及表征学习。
- 世界模型: 生成器模拟数据分布,可视为世界模型。
- 强化学习: 奖励最大化项使用策略梯度,属于强化学习。
- 后训练: 论文提出的蒸馏和对齐方法属于后训练阶段。
摘要翻译
基于扩散的多模态大语言模型(dMLLMs)通过并行迭代预测多个掩码位置的令牌进行解码。这使每个解码步骤成为一个位置选择问题:模型不仅需要选择哪些预测单独可靠,还需要决定哪些位置应作为上下文共同提交给后续解码步骤。现有的基于置信度的解码方法独立地对掩码位置进行排序并提交前K个位置,很大程度上忽略了所提交的令牌是否提供互补的视觉基础。我们识别出该策略在多模态场景中的一个步骤级局限性:同一步骤中选择的高置信度令牌可能依赖重叠的视觉基础,导致提交的令牌之间存在视觉冗余,从而为后续解码留下的互补视觉基础减少。为量化这一效应,我们引入了视觉冗余指数(Visual Redundancy Index, VRI),用于衡量并行提交的令牌之间的视觉基础重叠程度。为在解码过程中控制这种冗余,我们提出了视觉冗余控制解码(Visual-Redundancy-Controlled Decoding, VRCD),一种无需训练、推理时使用的解码方法,通过令牌到图像的注意力优先选择视觉互补的位置。在多种多模态基准测试中,VRCD以适度的运行时开销降低了视觉冗余和剩余位置熵。在更长的解码实验中,与基于置信度的解码相比,它在M^3CoT上实现了高达18.8%的相对准确率提升,在MMBench上实现了6.9%的提升。代码将在https://github.com/infiniteYuanyl/VRCD发布。
Abstract
Diffusion-based multimodal large language models (dMLLMs) decode by iteratively predicting tokens at multiple masked positions in parallel. This turns each decoding step into a position-selection problem: the model must choose not only which predictions are reliable in isolation, but also which positions should be committed together as context for later decoding steps. Existing confidence-based decoding ranks masked positions independently and commits the top-K positions, largely ignoring whether the committed tokens provide complementary visual grounding. We identify a step-level limitation of this strategy in multimodal settings: high-confidence tokens selected in the same step can rely on overlapping visual grounding, introducing visual redundancy among the committed tokens and leaving less complementary visual grounding available for later decoding. To quantify this effect, we introduce the Visual Redundancy Index (VRI), which measures visual grounding overlap among tokens committed in parallel. To control this redundancy during decoding, we propose Visual-Redundancy-Controlled Decoding (VRCD), a training-free inference-time decoding method that uses token-to-image attention to prioritize visually complementary positions. Across diverse multimodal benchmarks, VRCD reduces visual redundancy and remaining-position entropy with modest runtime overhead. In longer decoding experiments, it also achieves relative accuracy gains of up to 18.8% on M^3CoT and 6.9% on MMBench over confidence-based decoding. Code will be released at https://github.com/infiniteYuanyl/VRCD.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 2.0/10 | 2.0 |
| World Models | 1.0 | 1.0/10 | 1.0 |
| MLLM | 1.0 | 9.0/10 | 9.0 |
| CV | 1.0 | 7.0/10 | 7.0 |
| MultiModal | 1.0 | 9.0/10 | 9.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文聚焦于扩散式多模态大语言模型(dMLLMs)的并行解码优化,核心是视觉冗余控制(VRI和VRCD),属于多模态理解和生成领域。与MLLM和MultiModal高度相关(9分),因为直接研究多模态大模型解码;与CV相关(7分),因为涉及视觉注意力、视觉冗余等计算机视觉概念;与Unify Models弱相关(2分),仅间接涉及模型统一(扩散+LLM),但非核心;与World Models几乎无关(1分),仅因多模态场景下隐含世界建模,但未明确讨论;与model-based RL、OPD、RL、GRPO完全无关(0分),论文未涉及强化学习或策略优化。作者列表中未包含指定的任何专家。
关键词
Visual Redundancy, Parallel Decoding, Diffusion-Based MLLMs, Token-to-Image Attention, Visual Grounding, Inference-Time Decoding, Multimodal Benchmarks
深度分析
Chinese Title: 基于视觉冗余控制的扩散多模态大语言模型并行解码方法
Summary: 本文针对扩散式多模态大语言模型(dMLLMs)在并行解码过程中存在的视觉冗余问题展开研究。传统基于置信度的解码策略独立选择高置信度token,忽略了这些token可能依赖重叠的视觉区域,导致后续解码缺乏互补的视觉上下文。作者首先定义了视觉冗余指数(VRI)来量化并行提交token之间的视觉注意力重叠程度。在此基础上,提出了一种无需训练的推理时解码方法——视觉冗余控制解码(VRCD),该方法利用token到图像的注意力分布计算视觉重叠,并据此对置信度进行重新加权,优先选择视觉互补性强的token进行提交。实验表明,VRCD能有效降低视觉冗余和剩余位置的熵,在M3CoT和MMBench等基准上相比置信度解码分别获得最高18.8%和6.9%的相对准确率提升,且运行时开销较小。
Innovations:
- 首次发现并定义了扩散多模态大模型并行解码中的视觉冗余现象,即同一解码步骤中高置信度token可能依赖重叠的视觉区域。
- 提出视觉冗余指数(VRI),用于量化并行提交token之间的视觉注意力重叠程度,为分析解码质量提供新指标。
- 设计视觉冗余控制解码(VRCD),一种轻量级、无需训练的解码方法,通过注意力导出的视觉重叠对置信度进行重排序,优先选择视觉互补的token。
- 在多个多模态基准上验证了VRCD的有效性,表明视觉冗余控制能提升解码质量且保持接近基线的吞吐量。
Methodology: 本文采用以下技术路线:首先,在扩散多模态大模型的每个解码步骤中,利用token到图像的注意力分布作为视觉关联的轻量代理;其次,定义视觉冗余指数(VRI)衡量并行提交token的注意力重叠;然后,构建置信度截断的候选窗口,计算每个候选token的冗余得分,并用其对置信度进行加权,形成冗余控制得分;最后,根据该得分选择要提交的token。整个过程无需额外训练,仅需在推理时计算注意力并调整排序。
Key Results:
- VRCD能有效降低解码过程中的视觉冗余(VRI值)和剩余位置的熵。
- 在M3CoT基准上,VRCD相比置信度解码获得最高18.8%的相对准确率提升。
- 在MMBench基准上,VRCD获得最高6.9%的相对准确率提升。
- VRCD的运行时开销较小,保持接近基线的吞吐量。
Tech Stack:
- 扩散式多模态大语言模型(dMLLMs)
- 并行解码(Parallel Decoding)
- token-to-image注意力(Token-to-Image Attention)
- 视觉冗余指数(VRI)
- 置信度重加权(Confidence Reweighting)
- 候选窗口截断(Candidate Window Truncation)
Strengths:
- 问题新颖:首次关注并行解码中token之间的视觉冗余,而非传统输入视觉token冗余。
- 方法轻量:无需训练,仅利用已有注意力机制,易于集成到现有dMLLM中。
- 实验充分:在多个多模态基准上验证,并展示了显著性能提升。
- 分析深入:通过VRI量化冗余,并展示了熵降低等内在改善。
Limitations:
- 方法依赖于注意力作为视觉关联的代理,可能无法完全捕捉复杂的视觉语义关系。
- 仅针对扩散式多模态大模型,未验证在自回归多模态模型上的适用性。
- 未讨论视觉冗余控制对生成多样性的影响,可能在某些任务中引入偏差。
- 实验仅基于特定模型架构,泛化性需进一步验证。
Relevance To Keywords:
- 原生多模态大模型:论文研究的扩散式多模态大模型属于原生多模态模型范畴,直接处理视觉和语言。
- 多模态大模型的理解和生成一体化:dMLLMs本身支持理解和生成,本文聚焦解码阶段,与一体化相关。
- 表征学习:论文通过注意力表征视觉关联,但未深入探讨表征学习本身。
- 世界模型:论文未涉及世界模型。
- 强化学习:论文未使用强化学习。
- 后训练:论文方法为推理时解码,不涉及后训练。
- Unify Models:论文未讨论模型统一。
- Model-Based RL:不相关。
摘要翻译
多模态大语言模型(Multimodal Large Language Models, MLLMs)近期在地理空间推理方面展现出令人瞩目的进展。然而,现有的遥感基准测试仍以二维(2D)为中心,主要基于光学外观评估模型。在自然环境中,由于严重的光谱混淆(即生态上不同的区域具有相似的纹理,但在垂直结构上存在根本差异),这种范式难以奏效。在此类情况下,显式的三维(3D)结构数据,如冠层高度模型(Canopy Height Models, CHMs),便成为语义消歧的关键几何证据。然而,当前MLLMs是否能够真正利用垂直线索来解决外观层面的歧义,尚不明确。为弥补这一空白,我们提出了VertiCue-Bench——首个基于CHM的地理空间推理诊断基准。VertiCue-Bench包含1,534个精心筛选的实例,涵盖17项任务,明确区分了低层次的高度感知与需要感知歧义的语义推理。对14个最先进的通用及遥感专用MLLMs的评估,结合反事实模态测试,揭示出显著的感知-推理分离现象:尽管模型在读取原始CHM高度线索方面展现出初步能力,但它们大多无法将几何感知转化为可靠的语义推理,在需要联合约束时甚至不如仅依赖RGB的基线模型。总体而言,VertiCue-Bench暴露了自然场景理解中关键的几何到语义鸿沟,为推进地理空间MLLMs提供了可操作的见解。
Abstract
Multimodal Large Language Models (MLLMs) have recently shown promising progress in geospatial reasoning. However, existing remote sensing benchmarks remain largely 2D-centric, evaluating models primarily on optical appearance. In natural environments, this paradigm breaks down due to severe spectral confusion, where ecologically distinct regions share similar textures but differ fundamentally in vertical structure. In such cases, explicit 3D structural data, such as Canopy Height Models (CHMs), become essential geometric evidence for semantic disambiguation. Yet, it remains unclear whether current MLLMs can genuinely leverage vertical cues to resolve appearance-level ambiguity. To address this gap, we introduce VertiCue-Bench, the first diagnostic benchmark for CHM-grounded geospatial reasoning. VertiCue-Bench comprises 1,534 carefully curated instances across 17 tasks, explicitly disentangling low-level height perception from ambiguity-aware semantic reasoning. Evaluations on 14 state-of-the-art general and remote-sensing-specialized MLLMs, combined with counterfactual modality testing, reveal a striking perception-reasoning dissociation. While models exhibit emerging competence in reading raw CHM height cues, they largely fail to translate geometric perception into reliable semantic reasoning, often underperforming RGB-only baselines when joint constraints are required. Overall, VertiCue-Bench exposes a critical geometry-to-semantics gap in natural scene understanding, offering actionable insights for advancing geospatial MLLMs.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 2.0/10 | 2.0 |
| World Models | 1.0 | 1.0/10 | 1.0 |
| MLLM | 1.0 | 10.0/10 | 10.0 |
| CV | 1.0 | 7.0/10 | 7.0 |
| MultiModal | 1.0 | 8.0/10 | 8.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文核心是评估多模态大语言模型(MLLMs)在遥感场景中利用高度线索(CHM)进行语义推理的能力,与MLLM高度相关(10分);涉及多模态数据(RGB+CHM)和计算机视觉任务(8分和7分);但完全不涉及统一模型、世界模型、强化学习及其变体(OPD、GRPO),因此这些关键词得分为0或极低。
关键词
MLLMs, CHM, geospatial reasoning, remote sensing, height cues, semantic disambiguation, perception-reasoning dissociation, VertiCue-Bench
深度分析
Chinese Title: VertiCue-Bench:诊断多模态大语言模型是否利用高度线索解决遥感自然场景中的二维歧义
Summary: 论文针对当前遥感多模态大语言模型(MLLMs)评估主要依赖二维光学外观、忽视三维结构信息的问题,提出了首个基于冠层高度模型(CHM)的诊断基准VertiCue-Bench。该基准包含1534个精心设计的实例,覆盖17种任务,系统地将低层高度感知与高层语义推理解耦。通过对14个通用和遥感专用MLLMs的评估,结合反事实模态测试,揭示了显著的感知-推理分离现象:模型虽能初步读取CHM高度线索,但无法将几何感知转化为可靠的语义推理,在需要联合约束的任务中甚至不如仅使用RGB的基线。论文暴露了自然场景理解中关键的几何到语义鸿沟,为推进地理空间MLLMs提供了可操作的见解。
Innovations:
- 首次提出专门诊断MLLMs是否利用对齐的几何线索(CHM)解决二维歧义的基准VertiCue-Bench。
- 设计了分层评估框架,将基本几何感知与高阶语义-几何推理解耦,系统评估模型认知能力。
- 通过大规模实验揭示了当前MLLMs存在严重的感知-推理分离:模型能感知高度但无法将其用于语义决策。
- 采用反事实模态测试(RGB-only vs RGB+CHM)量化了几何信息对推理的实际贡献。
Methodology: 论文采用三阶段构建流程:数据源与预处理、自动真值构建、元组生成。首先收集对齐的RGB和CHM遥感图像,确保严格配准。然后基于CHM数值和语义标签自动生成各任务的正确答案。最后构建标准化测试实例。评估时对14个MLLMs进行零样本测试,并设计RGB-only与RGB+CHM两种输入条件进行对比,通过反事实测试分析模型对高度线索的利用程度。
Key Results:
- 在14个MLLMs中,添加CHM显著提升了基本高度感知任务(如点标量读取)的性能。
- 但在需要语义-几何融合的歧义感知推理任务中,性能停滞甚至下降,多数模型不如RGB-only基线。
- 模型存在感知-推理分离:能读取高度数值,但无法将其转化为可靠的语义判断。
- 遥感专用MLLMs在几何感知上略优于通用模型,但在推理任务上同样表现不佳。
Tech Stack:
- Canopy Height Models (CHMs)
- Multimodal Large Language Models (MLLMs)
- Remote Sensing Vision-Language Models (RS-VLMs)
- 零样本评估(Zero-shot evaluation)
- 反事实模态测试(Counterfactual modality testing)
- 分层能力分类法(Hierarchical capability taxonomy)
- 自动真值构建(Automated ground-truth construction)
Strengths:
- 填补了遥感MLLMs评估中缺乏三维几何推理诊断的空白。
- 设计精细的分层任务体系,从低层感知到高层推理逐步递进,诊断性强。
- 实验覆盖14个主流模型,结论具有广泛代表性。
- 反事实测试设计巧妙,能清晰分离几何信息对感知和推理的不同影响。
Limitations:
- 基准仅基于CHM一种几何模态,未考虑其他三维数据(如点云、DSM)。
- 任务设计偏重森林/植被场景,对其他地物类型(如建筑、水体)的泛化性未验证。
- 评估仅采用零样本方式,未探索微调或提示工程对模型利用几何信息的影响。
- 数据集规模(1534实例)相对较小,可能限制统计显著性。
Relevance To Keywords:
- Unify Models: 论文评估的MLLMs属于统一模型范畴,但未直接涉及统一框架设计。
- World Models: 论文强调模型需利用三维几何线索理解世界,与世界模型中的空间推理相关。
- Representation Learning: 论文诊断模型是否从CHM中学习到可用的几何表征,与表征学习相关。
- Model-Based RL: 论文未涉及强化学习,但几何推理可视为模型预测的一部分。
- 原生多模态大模型: 论文评估的正是原生多模态大模型(如GPT-4V、LLaVA等)。
- 多模态大模型的理解和生成一体化: 论文聚焦理解(感知与推理),未涉及生成。
- 后训练: 论文仅做零样本评估,未涉及后训练阶段。
摘要翻译
3D视觉定位(3DVG)是具身人工智能(embodied AI)的一项关键能力,要求智能体根据自然语言描述在三维场景中定位物体。最近的零样本方法利用了2D视觉语言模型(LVLMs),但这些方法通常依赖于现有的多视角图像集,并且难以应对标准3D分割工具所提供的有限语义和空间细节。我们提出了**AgentGrounder**,一个零样本3D视觉定位框架,它直接对彩色点云进行操作,无需针对特定任务的3D训练。我们的方法采用两阶段设计:(1)离线阶段,应用3D模型构建一个包含实例ID、语义标签和3D边界框的对象查找表(OLT);(2)在线阶段,一个由工具驱动的智能体,它分解每个查询,仅从OLT中检索相关候选对象,执行几何评分,并在需要额外视觉证据(如颜色、材质或视角敏感线索)时按需触发图像渲染。与固定的锚点-目标匹配流程相比,这种设计减少了级联匹配错误,并通过避免提示中充斥无关对象来提高上下文窗口效率。我们在零样本设置下对ScanRefer和Nr3D进行了评估,观察到在我们的设置中,与SeeGround相比有一致的改进,包括在ScanRefer上[email protected]提高了2.5%,在Nr3D上提高了6.3%,其中在Nr3D视角无关查询上显著提高了6.3%。这些结果表明,结合选择性检索、几何推理和自适应视觉检查,为开放词汇的3D定位提供了实用且稳健的基础。我们的代码可在https://github.com/be2rlab/AgentGrounder获取。
Abstract
3D Visual Grounding (3DVG) is an essential capability for embodied AI, requiring agents to localize objects in 3D scenes based on natural language descriptions. Recent zero-shot methods leverage 2D vision-language models (LVLMs). However, they often rely on existing sets of multi-view images and struggle with the limited semantic and spatial details provided by standard 3D segmentation tools. We present $\textbf{AgentGrounder}$, a zero-shot 3D visual grounding framework that operates directly on colored point clouds without task-specific 3D training. Our approach follows a two-stage design: (1) an offline stage that applies 3D model to build an Object Lookup Table (OLT) with instance IDs, semantic labels, 3D bounding boxes; and (2) an online tool-driven agent that decomposes each query, retrieves only relevant candidates from the OLT, performs geometric scoring, and triggers image rendering on demand when additional visual evidence (e.g., color, material, or viewpoint-sensitive cues) is required. Compared with fixed anchor-target matching pipelines, this design reduces cascading matching errors and improves context-window efficiency by avoiding prompts overloaded with irrelevant objects. We evaluate on ScanRefer and Nr3D under a zero-shot setting and observe consistent improvements over SeeGround in our setup, including +2.5% [email protected] on ScanRefer and +6.3% on Nr3D, with a notable +6.3% gain on Nr3D view-independent queries. These results show that combining selective retrieval, geometric reasoning, and adaptive visual inspection yields a practical and robust foundation for open-vocabulary 3D grounding. Our code is available at https://github.com/be2rlab/AgentGrounder.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 1.0/10 | 1.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 8.0/10 | 8.0 |
| CV | 1.0 | 9.0/10 | 9.0 |
| MultiModal | 1.0 | 9.0/10 | 9.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究的是3D视觉定位(3DVG),属于计算机视觉(CV)和多模态(MultiModal)领域,使用了多模态大语言模型(MLLM)进行零样本推理。但论文未涉及统一模型(Unify Models)、世界模型(World Models)、基于模型的强化学习(model-based RL)、OPD、RL或GRPO等概念,这些关键词与论文核心内容无关。
关键词
3D Visual Grounding, Zero-Shot, Point Cloud, Multimodal Language Models, Object Lookup Table, Geometric Scoring, ScanRefer, Nr3D
深度分析
Chinese Title: AgentGrounder:基于多模态语言模型的零样本3D视觉点云定位
Summary: 本文提出AgentGrounder,一种零样本3D视觉定位框架,直接对彩色点云进行操作,无需任务特定的3D训练。现有零样本方法依赖2D视觉语言模型和预存多视图图像,受限于标准3D分割工具的语义和空间细节不足。AgentGrounder采用两阶段设计:离线阶段使用3D模型构建对象查找表,包含实例ID、语义标签和3D边界框;在线阶段由工具驱动的智能体分解查询、从表中检索相关候选、进行几何评分,并在需要时触发图像渲染以获取额外视觉证据。该方法减少了级联匹配错误,提高了上下文窗口效率。在ScanRefer和Nr3D数据集上的零样本评估显示,相比SeeGround,[email protected]分别提升2.5%和6.3%,尤其在Nr3D视图无关查询上提升显著。结果表明,结合选择性检索、几何推理和自适应视觉检查为开放词汇3D定位提供了实用且鲁棒的基础。
Innovations:
- 提出两阶段零样本3DVG流水线:离线构建对象查找表和在线工具驱动智能体推理。
- 采用透明定位策略,结合选择性候选检索、确定性几何评分和按需渲染,提升推理鲁棒性和上下文窗口效率。
- 在ScanRefer和Nr3D上实现一致提升,尤其在Nr3D视图无关查询上获得+7.6%的显著增益。
- 通过标签映射机制鲁棒处理用户术语与可用标签不匹配的问题,无需重新训练。
Methodology: 论文采用两阶段技术路线:离线阶段使用Mask3D进行3D实例分割,从彩色点云中预测对象实例和语义标签,构建对象查找表(OLT),存储实例ID、语义标签、对象中心和3D边界框尺寸。在线阶段使用Qwen3-VL-32B-Instruct作为核心视觉语言模型,智能体首先生成显式计划,提取语义锚点和空间约束,通过标签检索候选对象,应用确定性几何评分(基于距离、尺寸、方向等谓词),对于视图依赖的查询调用渲染工具进行视觉消歧,最终输出结构化预测结果。
Key Results:
- 在ScanRefer数据集上,[email protected]相比SeeGround提升2.5%。
- 在Nr3D数据集上,整体[email protected]提升1.5%,视图无关查询提升7.6%。
- 通过选择性检索和按需渲染,减少了级联匹配错误和上下文窗口过载问题。
- 标签映射机制有效处理了用户术语与可用标签不匹配的情况,提高了查询覆盖率。
Tech Stack:
- Mask3D:用于3D实例分割的模型
- Qwen3-VL-32B-Instruct:作为核心视觉语言模型(LVLM)
- Ollama:用于本地部署LVLM
- 确定性几何评分:基于距离、尺寸、方向等谓词的数学计算
- 对象查找表(OLT):存储实例ID、语义标签、中心坐标和边界框尺寸
- 按需渲染工具:用于视图依赖查询的视觉消歧
Strengths:
- 零样本设置无需任务特定3D训练数据,降低标注成本。
- 两阶段设计将离线分割与在线推理解耦,提高效率。
- 选择性检索和按需渲染减少不必要的计算和token开销。
- 确定性几何评分提供透明可解释的推理过程。
- 在多个基准数据集上取得一致改进,尤其在视图无关查询上表现突出。
Limitations:
- 依赖离线3D实例分割的质量,分割错误可能传播到后续推理。
- 在线推理阶段依赖LVLM的规划能力,复杂查询可能产生次优计划。
- 标签映射机制依赖上下文推断,可能在某些边缘情况下失败。
- 实验仅在ScanRefer和Nr3D上进行,泛化性需在更多数据集上验证。
Relevance To Keywords:
- 原生多模态大模型:论文使用Qwen3-VL-32B-Instruct作为核心LVLM,体现了多模态大模型在3D定位中的应用。
- 多模态大模型的理解和生成一体化:智能体同时进行查询分解、计划生成和结构化输出,融合理解与生成能力。
- 表征学习:通过对象查找表将3D场景转化为结构化表征,支持高效检索和推理。
- 世界模型:框架通过几何评分和按需渲染模拟3D空间关系,隐含了场景理解的世界模型思想。
- 强化学习:论文未直接涉及强化学习,但智能体的工具调用和决策过程可视为一种策略学习,未来可结合后训练优化。
- 后训练:论文未涉及后训练,但零样本设置避免了任务特定微调,符合后训练范式的精神。
摘要翻译
自回归视频生成器(Autoregressive video generators)在流式、长时域和交互式应用中具有吸引力,但将强大的黑盒教师(black-box teachers)蒸馏为因果学生(causal students)仍然困难。学生必须在其自身的展开分布(rollout distribution)下学习,而实际教师可能仅暴露提示条件完成的视频(prompt-conditioned completed videos),并且在架构、容量、时域设计和采样调度上存在差异。这种接口使得监督微调(supervised fine-tuning)成为离策略(off-policy),基于分数的蒸馏(score-based distillation)不适用,而直接对抗模仿(adversarial imitation)对于去噪时间信用分配(denoising-time credit assignment)而言过于稀疏。我们提出对抗流蒸馏(Adversarial Flow Distillation, AFD),一种用于异构黑盒视频蒸馏(heterogeneous black-box video distillation)的在策略框架(on-policy framework)。AFD查询教师并在相同提示上展开当前学生,训练一个提示配对的Bradley-Terry判别器(Bradley-Terry discriminator)以估计干净样本师生差异(clean-sample teacher-student discrepancy),并将得到的在策略优势(on-policy advantage)转换为对学生自身噪声状态的前向过程流匹配更新(forward-process flow-matching updates)。因此,AFD提供密集速度场监督(dense velocity-field supervision),同时无需教师分数、潜在变量、去噪轨迹、步对齐或反向链强化学习(reverse-chain reinforcement learning)。在两个因果自回归学生族(causal AR student families)上的实验表明,AFD在保持通用视频质量(general video quality)的同时,持续改善了运动和物理敏感生成(motion- and physics-sensitive generation),消融实验验证了自适应在策略反馈(adaptive on-policy feedback)和前向过程信用分配(forward-process credit assignment)的重要性。该方法仅需干净教师视频和学生展开,为将专有或异构视频生成器蒸馏为高效自回归学生(efficient autoregressive students)提供了一条实用路径。
Abstract
Autoregressive video generators are attractive for streaming, long-horizon, and interactive applications, but distilling strong black-box teachers into causal students remains difficult. The student must learn under its own rollout distribution, whereas practical teachers may expose only prompt-conditioned completed videos and may differ in architecture, capacity, temporal design, and sampling schedule. This interface makes supervised fine-tuning off-policy, score-based distillation inapplicable, and direct adversarial imitation too sparse for denoising-time credit assignment. We propose Adversarial Flow Distillation (AFD), an on-policy framework for heterogeneous black-box video distillation. AFD queries the teacher and rolls out the current student on the same prompts, trains a prompt-paired Bradley-Terry discriminator to estimate clean-sample teacher-student discrepancy, and converts the resulting on-policy advantage into forward-process flow-matching updates on the student's own noised states. Thus, AFD provides dense velocity-field supervision while requiring no teacher scores, latents, denoising trajectories, step alignment, or reverse-chain reinforcement learning. Experiments across two causal AR student families show that AFD consistently improves motion- and physics-sensitive generation while preserving general video quality, and ablations validate the importance of adaptive on-policy feedback and forward-process credit assignment. The method requires only clean teacher videos and student rollouts, providing a practical route for distilling proprietary or heterogeneous video generators into efficient autoregressive students.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 1.0/10 | 1.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 9.0/10 | 9.0 |
| MultiModal | 1.0 | 3.0/10 | 3.0 |
| model-based RL | 1.0 | 1.0/10 | 1.0 |
| OPD | 1.0 | 8.0/10 | 8.0 |
| RL | 1.0 | 3.0/10 | 3.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文聚焦于自回归视频生成模型的蒸馏,属于计算机视觉(CV)核心领域,但未涉及多模态大语言模型(MLLM)、世界模型(World Models)或统一模型(Unify Models)。多模态(MultiModal)仅因视频是视觉模态而略有相关,但论文未处理多模态融合。OPD(On-Policy Distillation)与论文核心的on-policy框架高度相关。RL(强化学习)和model-based RL仅间接相关(使用对抗训练和advantage,但非标准RL)。GRPO完全无关。
关键词
Adversarial Flow Distillation, On-Policy, Autoregressive Video Generation, Black-Box Distillation, Bradley-Terry Discriminator, Flow Matching, Credit Assignment
深度分析
Chinese Title: 面向自回归视频生成的在线策略对抗流蒸馏
Summary: 本文针对自回归视频生成模型在蒸馏强黑盒教师时面临的困难,提出了一种在线策略对抗流蒸馏(AFD)框架。研究背景是:许多强大的视频教师模型仅作为黑盒采样器返回完整视频,不暴露分数、潜变量或去噪轨迹,而自回归学生需要在自身 rollout 分布下学习,导致监督信号不匹配。AFD 通过训练一个提示条件化的 Bradley-Terry 判别器来估计教师与学生之间的分布差异,并将该差异转化为前向过程流匹配更新,从而为学生提供密集的速度场监督。方法上,AFD 无需教师分数、潜变量、去噪轨迹或步骤对齐,仅需教师生成的完整视频和学生自身的 rollout。实验在两个因果自回归视频骨干上验证,AFD 在运动敏感和物理敏感指标上取得一致提升,同时保持整体视频质量。消融实验验证了自适应在线策略反馈和前向过程信用分配的重要性。
Innovations:
- 识别了黑盒异构在线策略蒸馏是自回归视频学生的核心障碍,并指出现有方法(如SFT、基于分数的DMD、直接视频级对抗训练)与有限教师接口不匹配。
- 提出了AFD,一种无分数的蒸馏框架,通过判别器从完整视频估计教师-学生差异,并将其转换为学生自身噪声状态上的密集前向过程流匹配更新。
- 在两个因果自回归视频骨干上评估AFD,在运动敏感和物理敏感指标上取得一致提升,并进行了领域适应和判别器设计的消融实验。
Methodology: AFD 包含两个主要部分:自适应教师-学生判别器和扩散原生在线策略更新。首先,训练一个提示条件化的时空判别器,使用 Bradley-Terry 损失来区分教师样本和学生样本,从而得到在线策略优势分数。然后,利用 DiffusionNFT 方法,将学生 rollout 进行前向加噪,根据判别器优势分数对正负样本加权,通过负感知微调损失和正则化项更新学生的速度场。整个过程无需教师分数、潜变量或去噪轨迹,仅需教师生成的完整视频和学生自身的 rollout。
Key Results:
- 在 VBench 和 VideoPhy-2 等基准上,AFD 相比 SFT 基线在运动敏感和物理敏感指标上取得一致提升。
- 消融实验验证了自适应在线策略反馈和前向过程信用分配的重要性。
- AFD 在保持整体视频质量的同时,显著改善了运动生成和物理合理性。
Tech Stack:
- Bradley-Terry 偏好模型
- Flow Matching / Rectified Flow
- DiffusionNFT(前向过程扩散优化)
- LoRA(低秩适应)
- 自回归视频生成模型(Self-Forcing 等)
- 视频判别器(基于 VideoAlign 等初始化)
Strengths:
- 无需教师模型内部信息(分数、潜变量、去噪轨迹),仅需完整视频样本,适用于黑盒或异构教师。
- 提供密集的速度场监督,超越视频级对抗训练,实现去噪时间步上的信用分配。
- 在线策略训练避免了 SFT 中的分布偏移问题,学生在其自身 rollout 分布上学习。
- 方法通用,可应用于不同架构的自回归视频学生。
Limitations:
- 判别器的质量直接影响蒸馏效果,若判别器无法准确区分教师和学生,则优势信号可能不准确。
- 需要同时维护教师、学生和判别器,计算开销较大,尤其对于高分辨率长视频。
- 实验仅在两个自回归骨干上验证,泛化性需更多测试。
- 未讨论与纯扩散学生(非自回归)的兼容性。
Relevance To Keywords:
- Unify Models, World Models, Representation Learning, Model-Based RL: 论文涉及视频生成模型蒸馏,与统一模型、世界模型(视频作为世界模型的一种形式)相关。
- 原生多模态大模型,多模态大模型的理解和生成一体化: 自回归视频生成是多模态生成的重要方向,AFD 旨在提升生成质量,与多模态大模型的后训练和一体化相关。
- 表征学习: 判别器学习教师-学生分布差异,涉及表征学习。
- 强化学习: 使用在线策略和优势信号,与强化学习中的策略梯度思想相关(但采用前向过程优化)。
- 后训练: 蒸馏是后训练的一种形式,AFD 针对视频生成模型的后训练优化。
摘要翻译
视觉语言模型(Vision Language Models)能够很好地适应下游任务,但极易受到破坏跨模态语义对齐的对抗性扰动的影响。现有防御方法大多为单向或结构性方法,未能利用双向跨模态互补性和实例级自适应保护。为了克服对抗环境下单向和静态防御的局限性,我们提出闭环双向提示(Closed-Loop Bidirectional Prompting),通过冻结编码器上的动态反馈循环将鲁棒适应转化为跨模态一致性恢复。引入语义锚点(Semantic Anchor)作为稳定先验,以约束循环更新并减轻扰动引起的特征损坏。通过基于锚点的自举,文本语义对视觉表示进行去噪,而改进后的视觉表示能够实现实例自适应提示更新,从而产生修正且鲁棒的共识。在11个数据集上的广泛评估验证了最先进的鲁棒性和强大的基类到新类泛化能力,同时保持了计算成本与准确性之间的良好平衡。
Abstract
Vision Language Models adapt well to downstream tasks but are highly vulnerable to adversarial perturbations that disrupt cross-modal semantic alignment. Existing defenses are largely unidirectional or structural, failing to exploit bidirectional cross-modal complementarity and instance-wise adaptive protection. To overcome the limitations of unidirectional and static defenses in adversarial settings, we propose Closed-Loop Bidirectional Prompting, casting robust adaptation as cross-modal agreement recovery via a dynamic feedback loop on frozen encoders. A Semantic Anchor is introduced as a stable prior to constrain cyclic updates and mitigate perturbation-induced feature corruption. Through anchor-based bootstrapping, textual semantics denoise visual representations, while the refined visuals enable instance-adaptive prompt updating, yielding a rectified and robust consensus. Extensive evaluations across 11 datasets validate state-of-the-art robustness and strong base-to-new generalization, while maintaining a favorable trade-off between computational cost and accuracy.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 2.0/10 | 2.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 6.0/10 | 6.0 |
| CV | 1.0 | 8.0/10 | 8.0 |
| MultiModal | 1.0 | 9.0/10 | 9.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究视觉语言模型(VLM)的对抗鲁棒性,提出闭环双向提示方法。核心是多模态(视觉+语言)和计算机视觉(CV),与MLLM有一定关联(VLM属于多模态大语言模型范畴),但未涉及统一模型、世界模型、基于模型的强化学习、OPD、RL或GRPO等概念。因此,MultiModal和CV评分较高,MLLM中等,Unify Models较低,其余关键词评分为0。
关键词
Closed-Loop Bidirectional Prompting, Adversarial Robustness, Vision Language Models, Semantic Anchor, Cross-Modal Agreement, Instance-Adaptive, Robustness
深度分析
Chinese Title: 闭环双向提示:视觉语言模型的对抗鲁棒性
Summary: 本文提出闭环双向提示(CLBP),一种针对视觉语言模型(VLM)对抗鲁棒性的测试时防御方法。现有防御方法多为单向或静态,未能利用跨模态互补性和实例自适应保护。CLBP在冻结编码器上建立动态反馈循环,通过文本到视觉(T2V)去噪和视觉到文本(V2T)细化交替更新,恢复跨模态语义对齐。引入语义锚作为稳定先验,约束循环更新并缓解扰动引起的特征退化。通过锚引导的引导,文本语义去噪视觉表示,细化后的视觉实现实例自适应提示更新,形成修正且鲁棒的共识。在11个数据集上的评估表明,CLBP在跨数据集零样本、少样本适应和基础到新类泛化三个评估机制下,实现了最先进的鲁棒性,同时保持了良好的计算成本与准确率权衡。
Innovations:
- 提出闭环双向提示(CLBP),在冻结编码器上建立视觉与文本特征的动态反馈循环,实现双向跨模态校正。
- 引入语义锚作为稳定先验,初始化并约束循环更新,防止扰动传播和特征漂移。
- 设计文本到视觉(T2V)去噪模块,利用相对稳定的语言信号过滤视觉噪声。
- 设计视觉到文本(V2T)细化模块,将视觉特征映射为实例自适应的文本提示偏移。
- 结合多视图聚合模块,抑制单次增强的采样噪声,提升预测稳定性。
Methodology: CLBP方法基于预训练的CLIP模型,保持图像编码器和文本编码器冻结。首先通过语义引导(Semantic Bootstrapping)初始化固定文本锚。然后执行闭环迭代:T2V步骤中,当前文本原型通过轻量级适配器生成视觉提示,预置到图像块中,利用交叉注意力过滤噪声;V2T步骤中,得到的视觉特征通过另一个适配器映射为文本提示的实例自适应偏移。循环更新受语义锚约束,通过分析固定点动态证明单次双向更新已接近循环的固定点。最终通过多视图聚合(Multi-View Aggregation)整合多个增强视图的预测。训练采用三组件损失函数,包括对抗损失、一致性损失和锚约束损失。
Key Results:
- 在11个数据集上,CLBP在对抗鲁棒性上优于对抗微调(AFT)、对抗提示调优(APT)和测试时防御基线。
- CLBP在保持干净准确率的同时,显著提升了对抗样本下的鲁棒性。
- 在跨数据集零样本、少样本适应和基础到新类泛化三种评估机制下均达到最先进性能。
- 单次双向更新已接近循环的固定点,实际推理中仅需一次迭代,计算开销低。
- 语义锚有效约束循环更新,防止扰动传播和特征漂移。
Tech Stack:
- CLIP模型(冻结的图像编码器fV和文本编码器fT)
- 文本到视觉(T2V)适配器ΦT2V
- 视觉到文本(V2T)适配器ΘV2T
- 轻量级MLP用于上下文令牌偏移(Compose(·))
- 余弦相似度与温度缩放预测
- 投影梯度下降(PGD)对抗攻击
- AutoAttack集成攻击框架
- 多视图聚合(Multi-View Aggregation)
- 三组件损失函数(对抗损失、一致性损失、锚约束损失)
- 固定点分析与Lipschitz常数估计
Strengths:
- 提出双向反馈循环机制,充分利用跨模态互补性,克服了现有单向或静态防御的局限。
- 语义锚设计巧妙,既提供稳定性又保留实例自适应能力,有效防止扰动传播。
- 在冻结编码器上运行,计算开销低,适用于实际部署。
- 在多个数据集和评估机制下验证了最先进的鲁棒性和泛化能力。
- 理论分析(固定点收敛性和稳定性)为方法提供了坚实支撑。
Limitations:
- 方法依赖于预训练CLIP模型,可能不适用于其他架构的VLM。
- 虽然计算开销低,但双向循环仍比单次前向传播增加额外推理时间。
- 对抗攻击假设为白盒场景,对黑盒攻击的鲁棒性未充分验证。
- 语义锚的初始化模板选择可能影响性能,需要人工设计。
- 在极端扰动下,循环可能仍存在误差累积风险,尽管理论分析表明收敛。
Relevance To Keywords:
- Unify Models: 方法基于CLIP模型,但未涉及多模态理解与生成的统一。
- World Models: 不直接相关,方法未构建世界模型或环境交互。
- Representation Learning: 核心是通过双向循环恢复跨模态表征对齐,与表征学习高度相关。
- Model-Based RL: 不直接相关,方法未涉及强化学习或模型预测控制。
- 原生多模态大模型: 方法针对CLIP这类视觉语言模型,属于多模态大模型范畴。
- 多模态大模型的理解和生成一体化: 方法仅关注分类任务的理解方面,未涉及生成。
- 表征学习: 方法通过语义锚和双向循环优化跨模态表征,与表征学习紧密相关。
- 世界模型: 不直接相关。
- 强化学习: 不直接相关。
- 后训练: 方法属于测试时防御,无需后训练,但适配器需要训练。
摘要翻译
随着文本到图像(T2I)生成技术的持续进步,生成高质量图像正变得越来越容易;因此,用户需求正转向更符合其特定要求的图像。由于奖励模型在评估生成图像是否符合用户偏好方面发挥着日益重要的作用,这一趋势为奖励建模带来了一个重要挑战:奖励模型不应仅依赖静态和通用的评估维度,而应考虑与任务相关且细粒度的标准,用户通过这些标准来评估生成图像是否满足其特定需求。为应对这一挑战,我们提出了DyCoRM,一种动态的、基于标准的奖励模型,它能够锚定任务相关标准并进行基于标准的偏好比较。为支持这一设定,我们构建了DyCoDataset-20K,该数据集提供了动态标准及标准级别的标注,并进一步衍生出DyCoBench-1K,一个用于在动态标准下系统评估奖励模型的基准。我们还引入了DyCoPick,它将基于标准的奖励建模应用于T2I图像选择。我们的贡献建立了首个用于动态和细粒度评估的奖励建模框架,并在T2I生成中实现了实际应用。
Abstract
With the continued advancement of text-to-image (T2I) generation, producing high-quality images is becoming increasingly attainable; consequently, user demands are shifting toward images that better satisfy their specific requirements. As reward models play an increasingly important role in assessing whether generated images align with user preference, this trend introduces an important challenge for reward modeling: rather than relying solely on static and general evaluation dimensions, reward models should account for the task-relevant and fine-grained criteria through which users assess whether generated images meet their specific requirements. To address this challenge, we propose DyCoRM, a dynamic, criterion-aware reward model that grounds task-relevant criteria and performs criterion-aware preference comparison. To support this setting, we construct DyCoDataset-20K, which provides dynamic criteria together with criterion-level annotations, and further derive DyCoBench-1K, a benchmark for systematically evaluating reward models under dynamic criteria. We further introduce DyCoPick, which applies criterion-aware reward modeling to selecting T2I images. Our contributions establish the first reward modeling framework for dynamic and fine-grained evaluation and practical application in T2I generation.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 1.0/10 | 1.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 2.0/10 | 2.0 |
| CV | 1.0 | 8.0/10 | 8.0 |
| MultiModal | 1.0 | 9.0/10 | 9.0 |
| model-based RL | 1.0 | 1.0/10 | 1.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 3.0/10 | 3.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文聚焦于文本到图像生成中的动态准则感知奖励模型,属于计算机视觉和多模态领域,但与统一模型、世界模型、基于模型的强化学习、OPD、GRPO等关键词关联极低或无关。MLLM和RL有微弱关联(奖励模型可用于RLHF,但本文未明确涉及强化学习或大语言模型)。
关键词
text-to-image generation, reward model, dynamic criteria, criterion-aware, preference comparison, DyCoRM, DyCoDataset, DyCoBench
深度分析
Chinese Title: DyCoRM: 面向文本到图像生成的动态准则感知奖励建模
Summary: 本文针对文本到图像(T2I)生成中用户需求动态变化的问题,提出了一种动态准则感知的奖励模型DyCoRM。现有奖励模型通常基于固定维度进行静态评估,无法适应不同任务下用户对特定准则(如文本可读性、角色表情等)的偏好。为此,作者构建了包含动态准则和准则级标注的数据集DyCoDataset-20K,并从中提炼出评测基准DyCoBench-1K。DyCoRM采用两阶段框架:首先从提示-图像上下文中推断任务相关的评估准则,然后在该准则下进行偏好比较。此外,还提出了DyCoPick,将准则感知奖励建模应用于生成图像的选择。实验表明,该方法能有效适应多样化的用户需求,实现更细粒度的动态评估。
Innovations:
- 提出DyCoRM两阶段框架:先推断任务相关准则,再基于准则进行偏好比较,实现动态、细粒度的奖励建模。
- 构建DyCoDataset-20K数据集,包含动态准则和准则级人工标注,支持准则感知训练。
- 从数据集中提炼DyCoBench-1K评测基准,用于系统评估动态准则下的奖励模型。
- 提出DyCoPick,将准则感知奖励建模扩展到T2I生成选择,实现个性化输出挑选。
Methodology: 首先使用GPT-4o生成覆盖多种主题和复杂度的提示,通过14种主流T2I模型生成图像对。然后采用两阶段人工标注:先由多名标注者迭代确定与提示和图像对相关的细粒度准则,再在每条准则下进行成对偏好比较和总体偏好判断。DyCoRM训练分为两步:第一步训练准则推断模块,从提示-图像上下文中预测准则;第二步训练准则条件偏好预测模块,在给定准则下比较图像对。DyCoPick则利用训练好的模型对候选图像进行准则感知排序。
Key Results:
- DyCoDataset-20K包含4,876个提示、23,378个图像对、94,572条细粒度准则及对应标注。
- 提示长度主要分布在120-400之间,覆盖简单、中等、困难任务及多样主题。
- 图像来源覆盖14种模型,模型间相似度多样,包含易区分和细粒度混淆样本。
- 提示与准则的语义相似度平均约0.25,表明准则既基于提示又保持细粒度多样性。
- DyCoBench-1K从全数据集中按代表性、动态准则、评估挑战、标注可靠性四个维度筛选得到。
Tech Stack:
- GPT-4o(提示生成)
- Gemini-3-Flash(提示清洗)
- 14种T2I模型(包括GPT-Image-1、DALL·E 3、Midjourney、FLUX.1-dev、Stable Diffusion 3.5等)
- 两阶段人工标注协议(准则制定+准则级偏好比较)
- 语义相似度计算(用于分析提示-准则关系)
- 两阶段训练策略(准则推断+准则条件偏好预测)
Strengths:
- 首次提出动态准则感知的奖励建模,解决了现有模型无法适应任务特定需求的问题。
- 构建了大规模、高质量、含准则级标注的数据集,为细粒度评估提供基础。
- 两阶段框架清晰,将评估分解为准则推断和偏好比较,可解释性强。
- DyCoPick将动态评估与生成选择结合,具有实际应用价值。
Limitations:
- 动态准则的标注成本高,数据集规模相对有限(20K图像对)。
- 准则推断依赖提示-图像上下文,可能受限于模型对复杂语义的理解能力。
- 当前仅针对T2I生成,未扩展到其他多模态生成任务(如视频、3D)。
- 未与现有主流奖励模型(如HPSv2、PickScore)进行充分对比实验。
Relevance To Keywords:
- 与“原生多模态大模型”和“多模态大模型的理解和生成一体化”相关:论文研究T2I生成中的奖励模型,涉及多模态理解(从提示和图像推断准则)和生成评估。
- 与“表征学习”相关:奖励模型需要学习图像和文本的联合表征以进行偏好比较。
- 与“强化学习”相关:奖励模型可用于后训练阶段(如DPO优化),论文提出的动态准则感知可提升奖励信号的质量。
- 与“后训练”相关:论文直接面向T2I生成的后训练(通过奖励模型进行对齐或选择)。
- 与“世界模型”和“Model-Based RL”相关性较弱:论文未涉及环境建模或基于模型的规划,主要聚焦于奖励建模本身。
摘要翻译
指代表达理解(Referring Expression Comprehension, REC)旨在根据给定的表达在图像中定位目标对象。尽管近期视觉-语言模型的进展显著提升了REC任务的表现,但当前的REC基准测试通常仅涵盖简单场景,并假设每个表达对应唯一对象。这些局限性阻碍了REC模型在开放世界环境中的部署。为填补这一空白,我们提出了OpenRef——一个面向复杂视觉与语言场景的REC新基准。OpenRef具有三项关键改进:1)多样化视觉场景:涵盖地面视角、无人机视角、黑暗场景及恶劣天气条件等多种视觉域;2)可变目标数量:通过多目标与无目标样本突破单目标限制;3)丰富词汇类型:融入专有名词、多义词及序数词,以适应更广泛的表达需求。此外,由于传统指标不足以应对开放世界场景,我们采用F1值衡量定位精度,并提出N3R(Negative Relative Rejection Reliability,负样本相对拒绝可靠性)以评估模型对负向表达的相对拒绝可靠性。最后,我们引入多任务一致性检查器(Multi-task Consistency Checker, MCC),这是一种无需训练且即插即用的策略,通过强制一致性自验证一键提升模型性能。大量实验表明,本工作显著提升了现有REC模型在复杂场景中的表现,为开放世界REC铺平了道路。项目页面:https://zongjianwu.github.io/openref
Abstract
Referring expression comprehension (REC) aims to localize a target object within an image based on a given expression. Although recent advances in vision-language models have led to substantial improvements in REC tasks, current REC benchmarks often hold simple scenarios and the assumption that each expression maps to a unique object. These limitations hinder the deployment of REC models in open-world environments. To fill this gap, we introduce OpenRef, a new benchmark for REC in complex visual and linguistic scenarios. OpenRef features three key advancements: 1) Diverse visual scenarios: spanning diverse visual domains, including ground views, drone views, dark scenes and adverse weather conditions; 2) Variable target counts: breaking the single-target limitation with multi-target and none-target samples; 3) Rich vocabulary types: incorporating proper nouns, polysemous words and ordinal terms to fit a wider range of expression needs. Furthermore, as traditional metrics are insufficient for open-world setting, we leverage F1 to measure grounding accuracy and propose N3R (Negative Relative Rejection Reliability) to assess relative rejection reliability against negative expressions. Finally, we introduce Multi-task Consistency Checker (MCC), a training-free but plug-and-play strategy that enhances model performance with one click by enforcing consistency self-verification. Extensive experiments demonstrate that this work significantly advances the performance of existing REC models in complex scenarios, paving the way for open-world REC. Project page: https://zongjianwu.github.io/openref
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 3.0/10 | 3.0 |
| CV | 1.0 | 10.0/10 | 10.0 |
| MultiModal | 1.0 | 10.0/10 | 10.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文聚焦于开放世界指代表达理解(REC),属于计算机视觉(CV)和多模态(MultiModal)任务,与MLLM有一定关联(使用视觉语言模型),但未涉及统一模型、世界模型、强化学习、OPD或GRPO等概念。因此,仅CV和MultiModal获得满分,MLLM获得3分,其余关键词均为0分。
关键词
open-world referring expression comprehension, benchmark, multi-task consistency checker, training-free, vision-language models, complex scenarios, F1, N3R
摘要翻译
大型视觉-语言模型(LVLMs)的指令微调日益依赖于大规模多模态语料库,然而这些数据集包含大量冗余样本、低视觉依赖性样本以及多模态推理行为覆盖高度不均衡的样本。因此,均匀子采样或基于简单得分的筛选方法往往产生次优的训练子集。我们提出MAGIC,一种无需训练、仅需前向传播的核心集选择方法,旨在为多模态指令微调构建紧凑且行为保真的子集。MAGIC基于从预训练视觉-语言模型(VLM)中提取的三个内在信号:多模态增益(Multimodal Gain),用于衡量从视觉输入获得的似然提升;桥接相关性(Bridging Relevance),用于捕捉答案词元在视觉词元上的注意力锐度;以及技能神经元签名(Skill-Neuron Signatures),通过最高激活的前馈神经元表征每个样本所引发的功能计算。MAGIC通过三阶段流水线整合这些信号:过滤低增益样本、基于归一化质量目标对候选样本排序、以及针对离散神经元签名执行桶式预算分配以保留潜在的多模态技能覆盖。该方案避免了反向传播、辅助选择器训练以及连续激活空间中的昂贵聚类,同时在现有VLM中保持高效且易于部署。在LLaVA-665K和Vision-Flan数据集上,以及向大目标模型LLaVA-1.5-7B和-13B的迁移设置中,MAGIC在匹配的20%预算下持续优于强基线:在LLaVA-665K上达到全微调相对性能的100.3%,在Vision-Flan-186K上达到101.6%,同时将实际运行时间减少73.7%。
Abstract
Instruction tuning of large vision-language models (LVLMs) increasingly depends on massive multimodal corpora, yet these datasets contain samples with substantial redundancy, low visual dependency, and highly imbalanced coverage of multimodal reasoning behaviors. As a result, uniform subsampling or naive score-based selection often yields suboptimal training subsets. We introduce MAGIC, a training-free, forward-only coreset selection method designed to construct compact yet behaviorally faithful subsets for multimodal instruction tuning. MAGIC is built on three intrinsic signals extracted from a pretrained VLM: Multimodal Gain, which measures the likelihood improvement obtained from visual input; Bridging Relevance, which captures the sharpness of answer-token grounding over visual tokens; and Skill-Neuron Signatures, which characterize the functional computation elicited by each sample via top-activated feed-forward neurons. MAGIC combines these signals in a three-stage pipeline: filtering low-gain examples, ranking candidates by a normalized quality objective, and performing bucket-wise budget allocation over discrete neuron signatures to preserve latent multimodal skill coverage. This formulation avoids backpropagation, auxiliary selector training, and expensive clustering in continuous activation spaces, while remaining efficient and easily deployable in existing VLMs. Across LLaVA-665K and Vision-Flan datasets, and transfer settings to large target models, LLaVA-1.5-7B and -13B, MAGIC consistently improves over strong baselines under matched 20% budgets: it achieves 100.3% relative performance to full finetuning on LLaVA-665K and 101.6% relative performance on Vision-Flan-186K, while yielding a 73.7% reduction in wall-clock run time.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 2.0/10 | 2.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 6.0/10 | 6.0 |
| CV | 1.0 | 5.0/10 | 5.0 |
| MultiModal | 1.0 | 9.0/10 | 9.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文聚焦于多模态指令微调中的核心集选择方法,核心是多模态(视觉-语言)数据筛选,与MultiModal高度相关(9分)。涉及MLLM(大型视觉语言模型)但更具体为LVLM,相关度中等(6分)。CV部分涉及视觉输入但非核心,相关度一般(5分)。Unify Models仅轻微关联(2分),因为论文未讨论统一模型或生成理解一体化。World Models、model-based RL、OPD、RL、GRPO均未提及,完全无关(0分)。
关键词
Multimodal Instruction Tuning, Coreset Selection, Vision-Language Models, Multimodal Gain, Bridging Relevance, Skill-Neuron Signatures, LLaVA, Data Efficiency
摘要翻译
我们测试了标准的RLVR工具使用方案——在Qwen2.5-7B-Instruct上应用GRPO——针对一个刻意简化的知识图谱工具API:基于Complex WebQuestions的四个Freebase导航动词。在可自验证的检索奖励下,策略的基于工具的回答率在250步内从3.8%攀升至9.6%,随后在单个50步窗口内骤降至0%——这一“先峰后崩”模式在四个随机种子中重复出现。在七种奖励设计中,我们发现了四种反复出现的失败模式:增加更密集或更具针对性的代理奖励只会改变失败模式,而非消除它。我们认为,与Python解释器、网络搜索和JSON API的一个关键区别在于接口反馈:它们的失败通常会泄露模型在预训练中见过的自然语言信号。Python回溯会指出失败的行;而空的Freebase结果`[]`则不会。剥离这一表层暴露了一种退化机制,而同族奖励的重新设计无法修复它。直接的消融实验排除了关系选择问题:在每次检索调用中注入黄金关系仅将精确匹配准确率提升了+0.20个百分点,且95.4%的检索相关错误是检索组合失败,而非答案提取失败。作为一种缓解措施,单次迭代自蒸馏在7B规模下达到了40.0%的精确匹配率,且具有容量不变性:将容量翻倍至14B仅将精确匹配率提升了0.25个百分点,而初始化几乎无关紧要——在测试的7B至14B范围内,性能上限似乎受限于接口本身。
Abstract
We test the standard RLVR tool-use recipe -- GRPO on Qwen2.5-7B-Instruct -- on a deliberately minimal knowledge-graph tool API: four Freebase navigation verbs over Complex WebQuestions. Under a self-verifiable retrieval reward, the policy's tool-grounded answer rate climbs from $3.8\%$ to $9.6\%$ over 250 steps, then collapses to $0\%$ within a single 50-step window -- a \emph{peak-then-collapse} pattern replicated across four seeds. Across seven reward designs, we find four recurring failure modes: adding denser or more targeted proxy rewards shifts the failure mode rather than eliminating it. We argue that a key difference from Python interpreters, web search, and JSON APIs is interface feedback: their failures often leak natural-language signal the model saw in pretraining. A Python traceback names the failing line; an empty Freebase result \texttt{[]} does not. Stripping away that surface exposes a degradation regime that same-family reward redesigns do not fix. A direct oracle ablation rules out relation selection: injecting gold relations at every retrieval call lifts exact-match accuracy by only $+0.20$~pp, and $95.4\%$ of retrieval-dependent errors are retrieval-composition failures rather than answer-extraction failures. As a mitigation, one-iteration self-distillation reaches $40.0\%$ EM at 7B and is capacity-invariant: doubling capacity to 14B improves EM by only $0.25$~pp, and initialization barely matters -- the ceiling appears interface-bound within the 7B--14B range tested.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 1.0/10 | 1.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 10.0/10 | 10.0 |
| GRPO | 1.0 | 10.0/10 | 10.0 |
评分理由: 论文主要研究知识图谱工具使用中的RLVR(强化学习与验证推理)配方,使用GRPO算法,核心是强化学习(RL)和GRPO,因此这两个关键词高度相关(10分)。model-based RL仅轻微相关,因为论文未涉及世界模型或环境模型,但GRPO属于RL范畴,故给1分。其他关键词(Unify Models, World Models, MLLM, CV, MultiModal, OPD)与论文内容完全无关,均给0分。
关键词
Knowledge-Graph Tool Use, RLVR, GRPO, Peak-Then-Collapse, Interface Feedback, Freebase, Self-Distillation
摘要翻译
大规模文本到图像基础模型在视觉真实感方面取得了显著成就,但生成具有正确解剖结构的人体图像仍具挑战性。现有方法通过在高质量人体照片上进行监督微调时,采用部位特定模块或局部损失加权来施加解剖约束,然而此类数据集有限,且由于光照、姿态和背景等混杂因素,往往提供模糊的优化信号。基于偏好的对齐提供了一种替代方案,但标准的直接偏好优化(DPO)对所有像素一视同仁,因此未能利用解剖伪影的局部特性。为解决这一问题,我们提出了基于合成解剖偏好(ASAP)的对齐框架,该框架通过对高保真人体图像应用局部退化机制来构建受控偏好对。该机制在图像上执行受控实验,在目标区域引入显式解剖错误,同时保留其余内容。借助这一机制,我们构建了包含超过1万对精心筛选数据的人体解剖偏好(HAP)数据集,用于对文本到图像人体图像生成模型进行有效的解剖对齐。为更好地利用这些受控偏好对的局部性,我们引入了DPO的局部化与边界约束变体,该变体优先优化目标解剖区域,同时施加有限偏好边界以防止过度优化并保持全局语义。我们进一步提出了HAF-Bench,一个用于系统评估解剖保真度的基准。大量实验表明,ASAP在保持整体图像质量的同时,持续减少了多个基础模型中的解剖错误。
Abstract
Large-scale text-to-image foundation models have achieved remarkable visual realism, yet generating human images with correct anatomical structures remains challenging. Existing approaches enforce anatomical constraints through part-specific modules or localized loss weighting during supervised fine-tuning on high-quality human photos, but such datasets are limited and often provide ambiguous optimization signals due to confounding factors such as lighting, pose, and background. Preference-based alignment offers an alternative, but standard Direct Preference Optimization (DPO) treats all pixels equally and therefore fails to exploit the localized nature of anatomical artifacts. To address this, we propose the framework of Alignment via Synthetic Anatomical Preference (ASAP), which constructs controlled preference pairs through a localized degradation mechanism applied to high-fidelity human images. This mechanism performs a controlled experiment on images by introducing explicit anatomical errors in targeted regions while preserving the remaining content. With this mechanism, we create the Human Anatomical Preference (HAP) dataset with over 10K curated pairs for effective anatomical alignment of text-to-image human image generative models. To better leverage the locality of these controlled preference pairs, we introduce a localized and margin-bounded variant of DPO that prioritizes optimization in targeted anatomical regions while enforcing a finite preference margin to prevent over-optimization and preserve global semantics. We further introduce HAF-Bench, a benchmark for systematic evaluation of anatomical fidelity. Extensive experiments demonstrate that ASAP consistently reduces anatomical errors across multiple foundation models while maintaining overall image quality.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 1.0/10 | 1.0 |
| CV | 1.0 | 9.0/10 | 9.0 |
| MultiModal | 1.0 | 8.0/10 | 8.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 3.0/10 | 3.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文聚焦于文本到图像生成中人体解剖结构的正确性,属于计算机视觉(CV)和多模态(MultiModal)领域,但与统一模型(Unify Models)、世界模型(World Models)、基于模型的强化学习(model-based RL)、OPD、GRPO完全无关。MLLM仅微弱相关(文本到图像可视为多模态生成,但非典型多模态大语言模型)。RL有一定关联(DPO源于RLHF),但非核心。
关键词
Anatomical Plausibility, Human Image Generation, Preference Alignment, Direct Preference Optimization, Localized Degradation, Text-to-Image Generation, Anatomical Fidelity
摘要翻译
大规模评估历史文档的结构化信息提取需要高精度的真实标注(ground-truth annotations),然而传统的人工标注成本高昂,而基于大语言模型的全自动化流水线又容易产生幻觉。我们提出双三角标注(Double Triangle Annotation),一种双层人在回路(human-in-the-loop)框架,通过利用跨模型共识来自动化大部分标注工作,同时确保高精度输出。在第一层中,两个架构独立的多模态大语言模型(Multimodal Large Language Models)并行对每个文档进行标注;当它们达成一致时,标签被自动接受,不一致的结果则交由人工评审团处理。第二层对两个这样的系统进行相互交叉校验,将残留冲突升级至领域专家。该框架基于单一假设——模型之间的误差独立性(error independence),无需分布先验或任务特定校准,并且随着模型能力的提升而变得更加自主。在涵盖1887至1906年法国医疗目录语料库Guides Rosenwald上,该框架实现了最终词错误率(Word Error Rate)为0.003。在大规模应用中,模型共识自动接受了13,595个字段中超过85%的内容。我们发布了由此产生的基准数据集——首个针对Rosenwald指南的结构化提取真实标注——以支持未来历史文档处理的相关研究。
Abstract
Evaluating structured-information extraction from historical documents at scale requires high-precision ground-truth annotations, yet traditional manual labeling is expensive and fully automated pipelines built on large language models are prone to hallucination. We propose Double Triangle Annotation, a two-layer human-in-the-loop framework that leverages cross-model consensus to automate the majority of annotation work while ensuring high-precision outputs. In the first layer, two architecturally independent Multimodal Large Language Models annotate each document in parallel; when they agree, the label is auto-accepted, and disagreements are routed to a human jury. A second layer cross-checks two such systems against each other, escalating residual conflicts to a domain expert. The framework rests on a single assumption -- error independence between models -- requires no distributional priors or task-specific calibration, and becomes more autonomous as model capability improves. On the Guides Rosenwald, a corpus of French medical directories spanning 1887-1906, the framework achieves a final Word Error Rate of 0.003. Applied at scale, model consensus auto-accepts over 85% of 13,595 fields. We release the resulting benchmark -- the first structured-extraction ground truth for the Rosenwald Guides -- to support future work on historical document processing.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 7.0/10 | 7.0 |
| CV | 1.0 | 5.0/10 | 5.0 |
| MultiModal | 1.0 | 8.0/10 | 8.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究的是历史文档标注框架,核心是利用多模态大语言模型(MLLM)进行并行标注和人工校验,属于文档处理与多模态应用领域。关键词中,MLLM和MultiModal直接相关(框架使用两个MLLM,处理多模态文档图像与文本),CV有一定关联(涉及文档图像理解),但Unify Models、World Models、model-based RL、OPD、RL、GRPO均与论文内容无关,论文未涉及模型统一、世界模型、强化学习或相关算法。
关键词
human-in-the-loop, multimodal large language models, historical document annotation, structured-information extraction, cross-model consensus, high-precision annotation, benchmark
摘要翻译
尽管整合多种影像学与临床表格数据的多模态数据对于精确医学诊断至关重要,但在临床实践中,特定模态的任意缺失现象普遍存在,严重降低了多模态模型的性能。现有方法要么直接丢弃缺失模态导致信息损失,要么在未捕捉复杂模态间依赖关系的情况下难以合成缺失模态。为解决这些局限,我们提出了一种新颖的上下文驱动缺失模态学习(Context-driven Missing-Modality Learning, CMML)框架,该框架通过顺序执行模态合成与语义对齐,在任意缺失条件下实现稳健诊断。具体而言,我们设计了一种基于级联残差Transformer的自编码器(Cascade Residual Transformer-based Autoencoder, CRTA),该编码器利用可学习的上下文令牌(作为数据集级语义先验)来捕捉模态间依赖关系并合成关键缺失表征。这些表征进一步通过模态特定记忆库得到丰富。为解决原始可用表征与合成表征之间的差异,我们将学习到的上下文令牌通过注入CRTA输出的多模态表征转化为实例自适应语义参考。该参考引导异质模态表征对齐至统一空间,并最终应用类别感知对比精炼以探索判别性诊断线索。在皮肤病变(Derm7pt)、眼病(ODIR)及脑膜瘤(MEN)数据集上的广泛评估表明,CMML显著优于当前最优(state-of-the-art, SOTA)方法,平均AUC分别提升1.26%、0.97%和1.32%。
Abstract
While multimodal data integrating diverse imaging and clinical tabular records is crucial for accurate medical diagnosis, the arbitrary absence of specific modalities is prevalent in clinical practice, severely degrading the performance of multimodal models. Existing methods either discard missing modalities, leading to information loss, or struggle to synthesize them without capturing complex inter-modal dependencies. To address these limitations, we propose a novel Context-driven Missing-Modality Learning (CMML) framework, which sequentially performs modality synthesis and semantic alignment to achieve robust diagnosis under arbitrary missing conditions. Specifically, we design a Cascade Residual Transformer-based Autoencoder (CRTA) that leverages learnable context tokens acting as dataset-level semantic prior to capture inter-modal dependencies and synthesize key missing representations. These representations are further enriched by modality-specific memory banks. To resolve the discrepancy between original available and synthesized representations, we transform the learned context tokens into instance-adaptive semantic references by infusing multimodal representations from the CRTA's outputs. This reference guides the alignment of heterogeneous modality representations into a unified space, where class-aware contrastive refinement is finally applied to explore discriminative diagnostic cues. Extensive evaluations on skin lesion (Derm7pt), ocular disease (ODIR), and meningioma (MEN) datasets demonstrate that CMML significantly outperforms state-of-the-art (SOTA) methods, yielding AVG AUC improvements of 1.26%, 0.97%, and 1.32%, respectively.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 3.0/10 | 3.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 2.0/10 | 2.0 |
| CV | 1.0 | 6.0/10 | 6.0 |
| MultiModal | 1.0 | 9.0/10 | 9.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究多模态医学诊断中缺失模态的鲁棒学习,核心是多模态(图像+表格数据)和计算机视觉(图像处理),与Unify Models有弱关联(统一异构表示),与MLLM有微弱关联(多模态但非大语言模型),与World Models、model-based RL、RL、OPD、GRPO完全无关。
关键词
missing-modality learning, multimodal medical diagnosis, image-tabular data, context-driven, cascade residual transformer, semantic alignment, class-aware contrastive refinement
摘要翻译
基于深度学习的物体检测技术已彻底改变了精准畜牧业(Precision Livestock Farming, PLF),但一个关键障碍依然存在:高性能基础模型(如SAM 3)计算量过大,难以在边缘端部署;而轻量级模型(如YOLO)则需要高昂的人工标注成本。本研究提出了一种全自动知识蒸馏流程,利用Segment Anything Model 3(SAM 3)生成零样本伪标签,用于训练高效的YOLOv8检测器。通过将SAM 3作为离线自动标注器,我们消除了人工标注的瓶颈,从而生成能够在资源受限硬件上进行实时推理的模型。我们在PigLife数据集上系统评估了该方法,并将SAM 3监督模型与人工标注基线进行了比较。结果表明,无需人工干预,由SAM 3训练的YOLOv8m模型平均精度(mean Average Precision, mAP)达到79.4%,同时推理延迟相比教师模型降低了约200倍。此外,分层分析显示,在低遮挡场景下,该自动化流程的检测率与人工基准相当($AP_{50} > 99\%$)。这些发现表明,基础模型可作为有效的、零标注成本的监督者,为智慧农业提供可扩展的边缘计算解决方案。
Abstract
Deep learning-based object detection has revolutionized Precision Livestock Farming (PLF), yet a critical barrier remains: high-performance Foundation Models (such as SAM 3) are too computationally intensive for edge deployment, while lightweight models (like YOLO) require prohibitive manual annotation efforts. This work proposes a fully automated knowledge distillation pipeline that leverages the Segment Anything Model 3 (SAM 3) to generate zero-shot pseudo-labels for training efficient YOLOv8 detectors. By treating SAM 3 as an offline auto-annotator, we eliminate the manual labeling bottleneck, producing models capable of real-time inference on resource-constrained hardware. We systematically evaluate this approach on the PigLife dataset, comparing SAM 3-supervised models against human-annotated baselines. Results demonstrate that a SAM 3-trained YOLOv8m achieves a mean Average Precision (mAP) of 79.4% without human intervention, while reducing inference latency by approximately 200$\times$ compared to the teacher model. Furthermore, stratified analysis reveals that in low-occlusion scenarios, the automated pipeline achieves detection rates comparable to human benchmarks ($AP_{50} > 99\%$). These findings indicate that foundation models can serve as effective, zero-annotation-cost supervisors, enabling scalable edge computing solutions for smart agriculture.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 10.0/10 | 10.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 10.0/10 | 10.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文核心是计算机视觉中的目标检测任务,使用SAM3作为教师模型生成伪标签来训练轻量级YOLO模型,应用于精准养猪。与CV(计算机视觉)和OPD(假设为Object Detection)高度相关,评分10。其他关键词如Unify Models、World Models、MLLM、MultiModal、model-based RL、RL、GRPO均与论文内容无关,评分0。论文未涉及多模态、强化学习或统一模型等概念。
关键词
SAM3, YOLOv8, knowledge distillation, zero-shot pseudo-labels, precision pig farming, object detection, edge deployment
摘要翻译
女性安全与保障对现代社会至关重要。针对女性的犯罪既发生在白天,也发生在低光照条件下。此类事件常被以较低分辨率运行的真实世界监控摄像头捕捉。尽管计算机视觉(CV)相关研究取得了显著进展,但聚焦于女性安全的视频异常检测(VAD)尚未得到充分解决。现有视频异常数据集包含光照充足、高分辨率、近景拍摄的视频,未能代表以女性为中心的异常行为,如项链抢夺、跟踪、不当触摸及其他针对女性的隐蔽犯罪形式。为解决这些问题,我们提出了ExtrAnom数据集,这是一个新的多模态基准,包含1001个带有文本描述的视频(500个正常视频和501个异常视频),分为5种不同类型的以女性为中心的犯罪。该数据集包含低光照(8%)、低分辨率(13%)、远景镜头(15%)以及白天(64%)的异常视频,并涵盖跟踪(3.9%)、项链抢夺(17.6%)、绑架(7.3%)、暗杀(2.3%)、骚扰(18.9%)等异常事件以及正常事件(50%)。每个视频附有4条文本注释,包括一条人工生成和三条大语言模型(LLM)生成的描述,从而支持跨模态和基于视觉语言模型(VLM)的验证。创建以女性为中心的数据集旨在准确检测可通过视觉观察到的以女性为中心的异常模式。该数据集辅助VLM准确生成视频级别的描述。ExtrAnom已与流行的单模态和多模态VAD数据集(如XD-Violence、UCF-Crime和UCA)以及当前最优(SOTA)方法进行了基准测试。实验表明,现有数据集不足以训练用于检测以女性为中心的异常的模型。
Abstract
Women's safety and security are paramount for a modern society. Crimes against women occur in daylight as well as in low-light conditions. Often, such events are captured through real-world surveillance cameras that operate at lower resolutions. Despite substantial progress in CV-related research, video anomaly detection (VAD) focused on women's safety has not yet been adequately addressed. Existing video anomaly datasets contain well-lit, high-resolution, close-shot videos, and fail to represent women-centric anomalies such as chain snatching, stalking, inappropriate touch, and other subtle forms of crime against women. To address these problems, we propose the ExtrAnom dataset, a new multi-modal benchmark containing 1001 videos with textual descriptions, 500 normal and 501 anomalous, classified into 5 different types of women-centric crimes. The dataset comprises low-light (8%), low-resolution videos (13%), long-shot (15%), along with daylight (64%) anomalous videos. And it covers anomalous events like stalking (3.9%), chain snatching (17.6%), kidnapping (7.3%), assassinations (2.3%), harassment (18.9%), and normal (50%). Each video is supplemented with 4 textual annotations, including one human-generated and three LLM-generated descriptions, enabling cross-modal and VLM-based validations. The aim of creating a women-centric dataset is to accurately detect the women-centric anomaly patterns, which are possible to observe visually. The dataset supplements the VLMs to accurately generate video-level descriptions. ExtrAnom has been benchmarked against popular unimodal and multi-modal VAD datasets (e.g., XD-Violence, UCF-Crime, and UCA) and SOTA methods. Experiments reveal that the existing datasets are insufficient to train models for detecting women-centric anomalies.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 3.0/10 | 3.0 |
| CV | 1.0 | 9.0/10 | 9.0 |
| MultiModal | 1.0 | 8.0/10 | 8.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文主要研究女性安全场景下的视频异常检测(VAD),核心是计算机视觉(CV)领域,提出了一个多模态数据集ExtrAnom(包含视频和文本描述),并使用了多模态大模型(MLLM)进行验证。因此,CV和MultiModal相关度高;MLLM有一定关联(用于生成文本描述和验证),但并非论文核心方法。Unify Models、World Models、model-based RL、OPD、RL、GRPO等关键词与论文内容完全无关,论文未涉及统一模型、世界模型、基于模型的强化学习、最优策略分解、强化学习或GRPO算法。
关键词
Video Anomaly Detection, Women's Safety, Multi-modal Dataset, ExtrAnom, Computer Vision, LLM-generated Descriptions, Benchmark
摘要翻译
红外与可见光视频融合对于实现动态场景中的全面感知至关重要。然而,保持时间一致性仍是一项严峻挑战。依赖光流(optical flow)的传统方法常受限于几何刚性与重影伪影。此外,基于扩散(diffusion)的标准融合模型通常以逐帧方式运行;当扩展至自回归(autoregressive)设置时,它们缺乏内在的时间约束,易出现严重的误差累积与漂移(drift),其中微小伪影随时间放大。为解决这些局限,我们提出一种抗漂移的视频融合方法,将任务重构为历史条件运动生成。我们引入稳定历史引导(Stabilized History Guidance)与软时间锚定(Soft Temporal Anchoring),将时间一致性重新定义为频谱滤波,无需刚性对齐即可隐式聚合运动动态。此外,我们的解耦结构-运动适应(Decoupled Structure-Motion Adaptation)策略通过两阶段训练与潜在精炼,桥接了预训练先验与结构约束。大量实验表明,我们的方法在融合质量与时间稳定性上均达到最先进水平。
Abstract
Infrared and visible video fusion is essential for achieving comprehensive perception in dynamic scenes. However, maintaining temporal consistency remains a formidable challenge. Conventional methods relying on optical flow often suffer from geometric rigidity and ghosting artifacts. Moreover, standard diffusion-based fusion models typically operate in a frame-by-frame manner; when extended to autoregressive settings, they lack intrinsic temporal constraints and are prone to severe error accumulation and drifting, where minor artifacts amplify over time. To address these limitations, we propose a drift-resilient video fusion method that reformulates the task as history-conditioned motion generation. We introduce Stabilized History Guidance and Soft Temporal Anchoring to reframe temporal consistency as spectral filtering, implicitly aggregating motion dynamics without rigid alignment. Furthermore, our Decoupled Structure-Motion Adaptation strategy bridges pre-trained priors and structural constraints via two-stage training and latent refinement. Extensive experiments demonstrate that our method achieves state-of-the-art performance in both fusion quality and temporal stability.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 10.0/10 | 10.0 |
| MultiModal | 1.0 | 10.0/10 | 10.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究红外-可见光视频融合,属于计算机视觉(CV)和多模态(MultiModal)领域,与这两个关键词高度相关。但论文未涉及统一模型、世界模型、多模态大语言模型、基于模型的强化学习、OPD、RL或GRPO等概念,因此这些关键词得分为0。
关键词
Infrared-Visible Video Fusion, Temporal Consistency, Drift-Resilient, Diffusion Model, Motion Generation, Decoupled Structure-Motion Adaptation, Stabilized History Guidance, Soft Temporal Anchoring
摘要翻译
视觉-语言模型(如CLIP)通过将图像与文本概念对齐展现出强大的零样本识别能力,但在多目标共存的多元例识别任务中往往表现不佳。其关键瓶颈在于[CLS]标记作为单一的全局视觉表征,难以准确编码具有不同尺度、上下文及共现模式的多样化目标。为解决这一局限,我们提出一种名为PIAA的新型多元例图像识别框架,该框架将预测过程形式化为补丁级推理与自适应聚合。具体而言,我们首先从两个互补角度增强补丁级预测:(i)缓解视觉编码器中的语义纠缠以获得更具判别性的补丁表征,(ii)学习无监督视觉分类器以缩小视觉-语言模态差距。随后引入自适应聚合模块,将补丁级分数整合为最终多元例预测结果。值得注意的是,整个流程完全无需训练,无需梯度更新或参数微调。实验表明,本方法在极低额外计算量下实现显著性能提升,在具有挑战性的NUS-WIDE基准上相比代表性基线方法获得超过6%的mAP增益。代码已开源:https://github.com/akang-wang/PIAA。
Abstract
Vision-Language Models such as CLIP exhibit strong zero-shot recognition capability by aligning images with textual concepts, yet they often underperform on multi-label recognition where multiple objects co-exist. A key bottleneck is that the [CLS] token, as a single global visual representation, is insufficient to faithfully encode diverse targets with varying scales, contexts, and co-occurrence patterns. To address this limitation, we present a new multi-label image recognition framework, termed PIAA, which formulates prediction as Patch-level Inference followed by Adaptive Aggregation. Specifically, we first enhance patch-wise predictions from two complementary perspectives: (i) mitigating semantic entanglement in the visual encoder to obtain more discriminative patch representations, and (ii) learning an unsupervised visual classifier to narrow the vision-language modality gap. We then introduce an adaptive aggregation module that consolidates patch-level scores into the final multi-label prediction. Notably, the entire pipeline is fully training-free, requiring no gradient updates or parameter fine-tuning. Experiments show that our method achieves strong improvements with minimal extra computation, exceeding a 6% mAP gain on the challenging NUS-WIDE benchmark over representative baselines. Code is available at https://github.com/akang-wang/PIAA.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 2.0/10 | 2.0 |
| CV | 1.0 | 10.0/10 | 10.0 |
| MultiModal | 1.0 | 7.0/10 | 7.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究多标签图像识别,属于计算机视觉(CV)核心领域,相关度10。使用CLIP视觉语言模型,涉及多模态(MultiModal)但主要聚焦视觉识别,相关度7。CLIP可视为一种多模态模型,但并非当前主流的多模态大语言模型(MLLM),且论文未涉及统一模型、世界模型、强化学习或GRPO等,因此其他关键词相关度极低或为0。
关键词
multi-label recognition, patch-level inference, adaptive aggregation, CLIP, vision-language models, zero-shot, training-free
摘要翻译
我们提出了Paris 2.0,这是首个通过去中心化计算预训练的视频生成模型。其训练方案基于Paris 1.0(arXiv:2510.03434),即首个开源权重的去中心化扩散模型(Decentralized Diffusion Model, DDM),该模型证明了无需单一GPU集群即可完成图像生成训练。然而,在去中心化训练下,时间连贯的视频生成仍是一个未解决的问题,而Paris 2.0填补了这一空白。在低分辨率文本到视频训练中,与在相同数据上以匹配的总计算预算训练的单一模型相比,Paris 2.0将弗雷歇视频距离(Frechet Video Distance, FVD)从561.04降至279.01,提升了约2.0倍,并提高了CLIP文本-视频相似度与美学评分。
Abstract
We present Paris 2.0, the first video generation model pre-trained through decentralized computation. Its training recipe builds upon Paris 1.0 (arXiv:2510.03434), the first ever open-weight Decentralized Diffusion Model (DDM), which showed that image generation can be trained without a monolithic GPU cluster. However, temporally coherent video generation had remained an open problem under decentralized training, and Paris 2.0 closes it. In low-resolution text-to-video training, against a monolithic model trained on the same data under a matched total compute budget, Paris 2.0 cuts Frechet Video Distance (FVD) from 561.04 to 279.01, a ~2.0x improvement, and lifts CLIP text-video similarity and aesthetic score.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 10.0/10 | 10.0 |
| MultiModal | 1.0 | 8.0/10 | 8.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文核心是去中心化扩散模型用于视频生成,属于计算机视觉(CV)领域,且涉及文本到视频的多模态(MultiModal)任务。但完全不涉及统一模型、世界模型、多模态大语言模型、基于模型的强化学习、强化学习、GRPO或OPD(未定义术语),因此这些关键词得分为0。
关键词
decentralized diffusion model, video generation, text-to-video, Frechet Video Distance, open-weight, Paris 2.0, decentralized training
摘要翻译
医学检索增强生成(Medical RAG)需要基于证据的声明,因此将声明级别的自然语言推理(NLI)检查器插入到检索增强的强化学习(RL)中是直观的。**我们发现,检查器在训练期间的输出分布(而非其保留准确率)决定了它是否提供可训练的梯度。** 我们比较了四种NLI检查器后端作为过程奖励,这些后端被用于一个经过GRPO训练的医学RAG代理(Qwen2.5-7B,并在Qwen3-4B和Llama-3.1-8B上进行了复现),在四个保留的医学问答(QA)基准上进行了评估。得出了三个诊断性发现。**(i)** 信号崩溃是log-prob特定的:大语言模型(LLM)的log-probability评分将超过97%的声明标记为中性——导致RL梯度坍缩为零——而一个经过校准的MedNLI分类器对相同配对进行非退化评分。**(ii)** 在答案质量上,中等信号优于强信号:一个强大的专有检查器会触发三步奖励黑客级联——超短答案、回避搜索、语言崩溃——因此一个中等信号的本地分类器训练出了更高质量的模型(在零样本基础上BERTScore提升12%,且不依赖GPT)。**(iii)** 信号强度是策略依赖的:同一个检查器在一个策略上表现为中等信号,但在另一个策略上表现为强信号,且不会触发级联的最终状态。我们将这些视为验证器作为奖励系统的边界条件。
Abstract
Medical RAG needs evidence-grounded claims, so plugging a claim-level NLI checker into retrieval-augmented RL is intuitive. \textbf{We find that the checker's \emph{output distribution} during training, not its held-out accuracy, decides whether it provides trainable gradient.} We compare four NLI checker back-ends as process rewards inside a GRPO-trained medical RAG agent (Qwen2.5-7B, replicated on Qwen3-4B and Llama-3.1-8B) across four held-out medical QA benchmarks. Three diagnostic findings emerge. \textbf{(i)} Signal collapse is log-prob-specific: LLM log-probability scoring labels over 97\% of claims neutral -- collapsing the RL gradient to zero -- while a calibrated MedNLI classifier scores the same pairs non-degenerately. \textbf{(ii)} Moderate signal beats strong signal on answer quality: a strong proprietary checker triggers a three-step reward-hacking cascade -- ultra-short answers, search avoidance, language collapse -- so a moderate-signal local classifier trains a higher-quality model (\textbf{+12\% BERTScore over zero-shot, no GPT dependency}). \textbf{(iii)} Signal strength is policy-dependent: the same checker registers as moderate on one policy but strong on another without triggering the cascade end-state. We frame these as boundary conditions for verifier-as-reward systems.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 8.0/10 | 8.0 |
| GRPO | 1.0 | 9.0/10 | 9.0 |
评分理由: 论文聚焦于医疗问答中的RAG系统,使用NLI检查器作为过程奖励,在GRPO框架下训练。核心主题是信号崩溃和奖励黑客的诊断,与统一模型、世界模型、多模态大语言模型、计算机视觉、多模态、基于模型的强化学习、OPD均无关联。RL作为广义强化学习,GRPO是具体算法,因此RL得8分(非核心创新但涉及),GRPO得9分(直接使用)。
关键词
Medical RAG, NLI checker, GRPO, signal collapse, reward hacking, process reward, biomedical QA, trainable gradient
摘要翻译
多领域任务增量学习要求模型在视觉多样化的领域间顺序获取知识,既不遗忘先前任务,又在推理时无法获知任务身份。基于冻结视觉语言模型的参数高效方法已取得显著进展,但现有方法完全依赖视觉特征进行任务路由(task routing)、置信度估计(confidence estimation)和编码器适应(encoder adaptation),完全未利用CLIP的跨模态文本嵌入空间(cross-modal text embedding space)。我们通过三项贡献填补这一空白。文本空间任务路由(Text-space task routing)将视觉高斯匹配(visual Gaussian matching)替换为与冻结的CLIP文本原型(text prototypes)的余弦相似度(cosine similarity),在零参数成本下实现了对数据稀缺具有鲁棒性的顺序无关路由。多原型视觉-文本置信度(Multi-prototype visual-textual confidence)采用K-means视觉原型(visual prototypes)和任务校准阈值下的跨模态对齐分数(cross-modal alignment scores),替代了单高斯类建模(single-Gaussian class modeling)。对称跨模态门控(Symmetric cross-modal gating)将逐层Gumbel门控(Gumbel gates)扩展到以批次图像特征为条件的文本编码器,从而在分布外输入上保持跨模态对齐。在涵盖11个数据集和1201个类别的MTIL基准上,我们的方法在Order-I下实现了74.2%的迁移(Transfer)、80.5%的平均(Average)和88.7%的最后(Last)准确率,仅用2.5M可训练参数且无外部数据,分别超越先前最优方法5.0、3.7和3.0个百分点。
Abstract
Multi-domain task-incremental learning requires a model to sequentially acquire knowledge across visually diverse domains without forgetting prior tasks, and without access to task identity at inference. Parameter-efficient methods built on frozen vision-language models have made strong progress, yet all existing approaches rely exclusively on visual features for task routing, confidence estimation, and encoder adaptation, leaving CLIP's cross-modal text embedding space entirely unexploited. We address this gap through three contributions. Text-space task routing replaces visual Gaussian matching with cosine similarity to frozen CLIP text prototypes, giving order-independent routing robust to data scarcity at zero parameter cost. Multi-prototype visual-textual confidence replaces single-Gaussian class modeling with K-means visual prototypes and cross-modal alignment scores under task-calibrated thresholds. Symmetric cross-modal gating extends per-layer Gumbel gates to the text encoder conditioned on batch image features, preserving cross-modal alignment on out-of-distribution inputs. On the MTIL benchmark spanning 11 datasets and 1201 classes, our method achieves 74.2% Transfer, 80.5% Average, and 88.7% Last under Order-I, surpassing the prior state of the art by 5.0, 3.7, and 3.0 percentage points with only 2.5M trainable parameters and no external data.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 8.0/10 | 8.0 |
| MultiModal | 1.0 | 9.0/10 | 9.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 该论文研究跨模态自适应提示的多领域任务增量学习,核心是视觉-语言模型(CLIP)的跨模态对齐与增量学习,属于计算机视觉(CV)和多模态(MultiModal)领域。论文未涉及统一模型(Unify Models)、世界模型(World Models)、多模态大语言模型(MLLM)、基于模型的强化学习(model-based RL)、在线策略蒸馏(OPD)、强化学习(RL)或GRPO等主题,因此这些关键词评分为0。CV评分为8(视觉特征与任务路由),MultiModal评分为9(核心利用文本空间和跨模态门控)。
关键词
cross-modal adaptive prompting, multi-domain task-incremental learning, vision-language models, CLIP, text-space task routing, multi-prototype visual-textual confidence, symmetric cross-modal gating
摘要翻译
稀疏视角(sparse-view)三维重建越来越多地采用前馈溅射网络(feed-forward splatting networks),这类网络直接从图像预测显式原语。然而,现有方法大多仍以高斯原语(Gaussian primitives)为中心,且仅间接暴露表面:为下游仿真、物理推理或具身交互提取可用网格(mesh),仍需昂贵的后处理步骤,这违背了前馈(feed-forward)的初衷。这一限制在无位姿(pose-free)场景中尤为突出,此时场景结构与相机参数必须从稀疏观测中联合估计。我们提出TriSplat,一种前馈重建网络,它使用定向三角形原语(oriented triangle primitives)表示场景,并直接从单次前向传播(forward pass)输出可仿真就绪的网格场景。给定输入图像,网络预测局部三维点图(point maps)、三角形属性、相机位姿(camera poses)及可选内参(intrinsics)。我们的方法并非将三角形方向回归为无约束隐变量,而是从预测的点图构造几何法线(geometry normals),通过图像条件法线头(image-conditioned normal head)对其进行细化,并将其转换为稳定的局部框架(local frames)用于三角形参数化。单法线引导策略(mono-normal bootstrap schedule)进一步稳定早期训练,而不透明度和模糊调度(opacity and blur scheduling)逐步锐化学习到的表面表示,以实现直接网格提取。在RealEstate10K和DL3DV上的实验表明,与基于高斯的前馈基线相比,该表示能产生几何更忠实的三维重建,同时保持具有竞争力的新视角渲染(novel-view rendering)质量。由于渲染原语本身就是表面三角形,输出可直接被物理引擎(physics engines)、碰撞检测器(collision detectors)和标准渲染管线(rendering pipelines)使用,无需任何转换,从而成为前馈三维场景重建中一种实用的仿真就绪解决方案。
Abstract
Sparse-view 3D reconstruction is increasingly addressed with feed-forward splatting networks that predict explicit primitives directly from images. Yet most existing methods remain centered on Gaussian primitives and expose surfaces only indirectly: extracting a usable mesh for downstream simulation, physics reasoning, or embodied interaction still requires expensive post-hoc steps that break the feed-forward promise. This limitation is especially pronounced in pose-free settings, where scene structure and camera parameters must be estimated jointly from sparse observations. We present TriSplat, a feed-forward reconstruction network that represents scenes with oriented triangle primitives and directly exports simulation-ready mesh scenes from a single forward pass. Given input images, the network predicts local 3D point maps, triangle attributes, camera poses, and optional intrinsics. Rather than regressing triangle orientation as an unconstrained latent variable, our approach constructs geometry normals from the predicted point maps, refines them with an image-conditioned normal head, and converts them into stable local frames for triangle parameterization. A mono-normal bootstrap schedule further stabilizes early training, while opacity and blur scheduling progressively sharpens the learned surface representation for direct mesh extraction. Experiments on RealEstate10K and DL3DV show that this representation produces more geometry-faithful reconstructions than Gaussian feed-forward baselines while maintaining competitive novel-view rendering quality. Because the rendering primitives are themselves surface triangles, the output can be directly ingested by physics engines, collision detectors, and standard rendering pipelines without any conversion, making it a practical simulation-ready solution for feed-forward 3D scene reconstruction.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 1.0/10 | 1.0 |
| World Models | 1.0 | 1.0/10 | 1.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 10.0/10 | 10.0 |
| MultiModal | 1.0 | 5.0/10 | 5.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文专注于前馈式3D场景重建,使用三角形图元直接输出仿真就绪网格,属于计算机视觉(CV)核心领域,与多模态(MultiModal)有一定关联(输入多视图图像,输出3D网格),但与其他关键词(Unify Models、World Models、MLLM、model-based RL、OPD、RL、GRPO)几乎无关或完全无关。
关键词
feed-forward 3D reconstruction, oriented triangle primitives, simulation-ready mesh, sparse-view, direct mesh extraction, novel-view rendering, TriSplat
摘要翻译
从多视角视频重建和预测动态三维场景是机器人学、增强现实/虚拟现实(AR/VR)以及数字孪生的一项基础任务。近期基于物理信息的Gaussian Splatting方法实现了令人满意的未来帧外推,但缺乏语义感知能力且计算开销较大。我们提出R5DGS框架,该框架通过紧凑的身份编码(Identity Encoding)向量增强物理驱动的四维高斯(4D Gaussian)表示,从而实现精确的高斯-对象关联。通过构建离线CLIP(Contrastive Language-Image Pre-training)对象查找表,我们支持开放词汇文本提示,以检索并渲染任意时间戳和视角下的特定对象高斯。此外,我们提出一种刚体推理约束,仅对对象质心预测并整合物理动力学,通过相对变换将运动传播至关联高斯。该优化在外推过程中实现了11 FPS的加速,同时不降低轨迹合理性。
Abstract
Reconstructing and predicting dynamic 3D scenes from multi-view videos is a foundational task for robotics, AR/VR, and digital twins. Recent physics-informed Gaussian Splatting methods achieve impressive future frame extrapolation but lack semantic awareness and suffer from large computational overhead. We introduce $\textbf{R5DGS}$, a framework that augments a physics-driven 4D Gaussian representation with compact Identity Encoding vectors, enabling precise Gaussian-to-object association. By constructing an offline CLIP-based object lookup table, we support open-vocabulary text prompting to retrieve and render object-specific Gaussians across arbitrary timestamps and viewpoints. Furthermore, we propose a rigid-body inference constraint that predicts and integrates physical dynamics exclusively for object centroids, propagating motion to associated Gaussians via relative transformations. This optimization yields a 11 FPS speedup during extrapolation without compromising trajectories plausibility.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 2.0/10 | 2.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 8.0/10 | 8.0 |
| MultiModal | 1.0 | 6.0/10 | 6.0 |
| model-based RL | 1.0 | 1.0/10 | 1.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究的是动态3D场景重建与预测,属于计算机视觉(CV)领域,涉及多视角视频处理(MultiModal,但仅视觉模态,非严格多模态)。与Unify Models、MLLM、OPD、RL、GRPO完全无关。World Models和model-based RL有极弱关联:论文使用物理约束预测未来帧,可视为一种世界模型思想,但未涉及强化学习或统一模型框架。
关键词
4D Gaussian Splatting, Dynamic Scene Reconstruction, Semantic Awareness, CLIP, Rigid Body Constraints, Future Frame Extrapolation, Multi-view Video
摘要翻译
视觉-语言模型(如CLIP)展现出令人印象深刻的泛化能力,但其在跨域小样本学习(Cross-Domain Few-Shot Learning, CDFSL)中的潜力尚未得到充分探索——该任务要求模型将源域信息迁移至训练数据稀缺的目标域。尽管已有研究在视觉-语言模型中观察到注意力下沉现象(attention sink phenomenon)在某些任务中的存在,但其在CDFSL场景中的作用尚未被探讨。本文揭示了一个被先前工作忽视的关键问题:CDFSL中标准的目标域小样本微调会显著加剧注意力下沉问题,导致类别间判别性下降。为理解这一现象,我们通过大量实验将其解释为模型为适应领域而采取的捷径学习(shortcut learning):为克服源域与目标域之间的巨大领域鸿沟,模型表现出强烈倾向——将原本更接近目标域类别的令牌(即简单令牌)进一步拉近至这些类别,从而加剧注意力下沉,并浪费了学习其他具有判别性但初始距离较远的令牌(即困难令牌)的能力。为解决此问题,我们提出一种新方法,在目标域微调过程中根据令牌与目标域类别的相关性动态重新加权,显式抑制模型对简单令牌的依赖,增强对困难令牌的学习,从而减少下沉令牌并提升判别性。在四个基准数据集上的大量实验验证了我们方法的合理性,并展示了新的最优性能。我们的代码已开源至 https://github.com/shuaiyi308/TIR。
Abstract
Vision-language models (VLMs) like CLIP have shown impressive generalization capabilities, yet their potential for Cross-Domain Few-Shot Learning (CDFSL) remains underexplored, where the model needs to transfer source-domain information to target domains with scarce training data. While the attention sink phenomenon has been observed in VLMs for certain tasks, its role in CDFSL scenarios has not been studied. In this paper, we uncover a critical issue overlooked by prior works: standard target-domain few-shot fine-tuning in CDFSL significantly exacerbates the attention sink problem, leading to poor discriminability across classes. To understand this phenomenon, through extensive experiments, we interpret it as the model's shortcut learning for domain adaptation: to overcome the huge domain gap between the source and target domains, the model shows a high tendency to push tokens that are initially closer to target-domain classes (i.e., simple tokens) to be even closer to these classes, exacerbating the attention sink and wasting the capability of learning other discriminative but initially further tokens (i.e., hard tokens). To address this, we propose a novel approach to dynamically re-weight tokens according to their relevance with target-domain classes during the target-domain finetuning, which explicitly suppresses the model's reliance on these simple tokens and enhances the learning of hard tokens, reducing sink tokens and enhancing discriminability. Extensive experiments on four benchmark datasets validate the rationale of our method, demonstrating new state-of-the-art performance. Our codes are available at https://github.com/shuaiyi308/TIR.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 2.0/10 | 2.0 |
| CV | 1.0 | 10.0/10 | 10.0 |
| MultiModal | 1.0 | 4.0/10 | 4.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文主要研究跨域小样本学习(CDFSL)中的注意力沉没问题,属于计算机视觉(CV)领域,使用CLIP等视觉语言模型(VLM),但并非典型的多模态大语言模型(MLLM)或统一模型;未涉及世界模型、强化学习(RL)、基于模型的RL、OPD或GRPO。多模态相关性较低,因为任务本质是图像分类,文本仅作为辅助。
关键词
Cross-Domain Few-Shot Learning, Attention Sink, Vision-Language Models, CLIP, Token Re-weighting, Domain Adaptation, Source-Free, Discriminability
摘要翻译
文本到图像合成(Text-to-image synthesis)得益于扩散模型(diffusion models)强大的生成能力,已取得显著进展。然而,这些模型在去噪过程中难以在交叉注意力图(cross-attention maps)中实现精确的文本-图像对齐。现有工作主要关注不同主体之间的跨主体令牌激活(inter-subject-token activations,即交叉注意力分数)的重叠问题,而忽略了同一主体内部的主体内令牌激活(intra-subject-token activations)分散问题。本文提出一种面向扩散模型的聚合与隔离交叉注意力方法(Aggregating-and-Isolating cross-attention approach),用于文本到图像合成,命名为AI-T2I。在技术上,为解决分散问题,我们设计了一个聚合损失(aggregation loss),用于识别并整合分散的令牌内激活,这隐式地有助于缓解潜在的重叠问题。在此基础上,进一步引入隔离损失(isolation loss),将令牌间激活(inter-token activations)推离,从而实现精确的文本-图像对齐。在多个基准上的大量实验表明,AI-T2I在文本到图像合成任务上优于当前最先进的工作。此外,我们的AI-T2I在其他任务(如可控布局生成和个性化生成)中也展现出优异的泛化能力。
Abstract
Text-to-image synthesis has made significant progress, benefiting from the strong generative capabilities of diffusion models. However, these models struggle to achieve precise text-to-image alignment within cross-attention maps during the denoising process. Existing works primarily focus on inter-subject-token activations (i.e., cross-attention scores) overlap for different subjects, overlooking the intra-subject-token activations scattering issue for identical subjects. In this paper, we propose an Aggregating-and-Isolating cross-attention approach to diffusion models for Text-to-Image synthesis, dubbed AI-T2I. Technically, to address the scattering issue, we devise an aggregation loss to identify and consolidate the scattered intra-token activations, which implicitly helps mitigate the potential overlap issue. Upon that, an isolation loss is further introduced to push the inter-token activations apart, thus fulfilling precise text-to-image alignment. Extensive experiments on various benchmarks demonstrate the superiority of AI-T2I over the state-of-the-art works for text-to-image synthesis. Furthermore, our AI-T2I exhibits excellent generalization across other tasks, e.g., controllable layout generation and personalized generation.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 8.0/10 | 8.0 |
| MultiModal | 1.0 | 8.0/10 | 8.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 该论文聚焦于文本到图像合成(Text-to-Image Synthesis)中的跨注意力机制优化,属于计算机视觉(CV)和多模态(MultiModal)领域,与扩散模型中的文本-图像对齐问题高度相关。然而,论文完全不涉及统一模型(Unify Models)、世界模型(World Models)、多模态大语言模型(MLLM)、基于模型的强化学习(model-based RL)、最优策略蒸馏(OPD)、强化学习(RL)或GRPO等关键词。因此,仅CV和MultiModal获得较高评分,其余关键词评分为0。
关键词
Text-to-Image Synthesis, Diffusion Models, Cross-Attention, Attention Alignment, Aggregation Loss, Isolation Loss, Controllable Layout Generation, Personalized Generation
摘要翻译
显式思维链(Chain-of-Thought, CoT)推理显著提升了大型语言模型(Large Language Models, LLMs)的推理能力,但由于生成长的自回归轨迹而带来了高昂的推理成本。现有的隐式推理方法提供了一种有前景的替代方案,但它们通常将推理过程视为可统一压缩,导致对精度至关重要的中间步骤被过度压缩,从而降低了推理准确性。在本工作中,我们提出选择性隐式思考(Selective Latent Thinking, SLT)框架,该框架在相同推理轨迹中,将冗余的推理片段选择性压缩为隐式表征,同时保留对精度至关重要的片段作为显式CoT。具体而言,SLT首先使用轻量级解码器预测即将出现的短推理片段,然后基于置信度的门控机制确定可被可靠压缩的最长片段。被接受的片段被编码为紧凑的隐式表征以提高推理效率,而不确定或对精度至关重要的推理则保留为显式CoT形式以保持准确性。为学习这种选择性压缩策略,SLT采用三阶段训练策略,结合了片段级隐式压缩、可靠性感知的未来推理预测以及轨迹级强化学习,以优化答案正确性与推理成本之间的权衡。在四个数学推理基准上的大量实验表明,在可比压缩比下,SLT的准确率比隐式推理基线方法高出22.7%,同时与显式CoT相比,推理链长度减少了58.4%,而准确率仅下降2.8%。我们的代码可在https://github.com/hunshi34/SLT获取。
Abstract
Explicit chain-of-thought (CoT) reasoning substantially improves the reasoning ability of large language models (LLMs), but incurs high inference cost due to lengthy autoregressive traces. Existing latent reasoning methods offer a promising alternative, yet they often treat reasoning as uniformly compressible, causing precision-critical intermediate steps to be overly compressed and thereby degrading reasoning accuracy. In this work, we propose Selective Latent Thinking (SLT), a framework that selectively compresses redundant reasoning spans into latent representations while preserving precision-critical spans as explicit CoT within the same reasoning trajectory. Specifically, SLT first uses a lightweight decoder to anticipate a short upcoming reasoning span, and then applies confidence-based gating to determine the longest span that can be reliably compressed. The accepted span is encoded into a compact latent representation to improve reasoning efficiency, while uncertain or precision-critical reasoning remains in explicit CoT form to preserve accuracy. To learn this selective compression policy, SLT adopts a three-stage training strategy that combines span-level latent compression, reliability-aware future reasoning prediction, and trajectory-level reinforcement learning to optimize the trade-off between answer correctness and reasoning cost. Extensive experiments across four mathematical reasoning benchmarks demonstrate that SLT achieves 22.7\% higher accuracy than latent reasoning baselines at comparable compression ratios, while reducing reasoning chain length by 58.4\% with only 2.8\% accuracy degradation compared to explicit CoT,Our code can be found in https://github.com/hunshi34/SLT.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 1.0/10 | 1.0 |
| MLLM | 1.0 | 1.0/10 | 1.0 |
| CV | 1.0 | 1.0/10 | 1.0 |
| MultiModal | 1.0 | 1.0/10 | 1.0 |
| model-based RL | 1.0 | 1.0/10 | 1.0 |
| OPD | 1.0 | 1.0/10 | 1.0 |
| RL | 1.0 | 8.0/10 | 8.0 |
| GRPO | 1.0 | 1.0/10 | 1.0 |
评分理由: 论文主要研究LLM推理链的压缩与效率优化,核心是选择性压缩冗余推理步骤为潜在表示,同时保留关键步骤为显式CoT。其中RL(强化学习)用于训练压缩策略,相关度较高(8分)。但论文不涉及多模态(MultiModal)、视觉(CV)、多模态大模型(MLLM)、世界模型(World Models)、模型统一(Unify Models)、基于模型的强化学习(model-based RL)、OPD(可能指其他特定概念)或GRPO(一种强化学习算法),这些关键词均与论文内容无关,故评1分。
关键词
Selective Latent Thinking, Chain-of-Thought, Latent Reasoning, Reinforcement Learning, Reasoning Compression, LLM Efficiency
摘要翻译
预测行人与车辆之间的交互对于非结构化与半结构化场景中的自动驾驶安全至关重要,然而,该任务因缺乏以密集行人-车辆交互为特征的公开数据集而受到严重阻碍。当前大多数研究依赖于结构化道路数据,导致非结构化环境中存在的复杂、异质性交互未能得到充分表征与研究。本文提出一种基于未标定监控摄像头视频数据的数据集标注框架,并构建了PINNS(非结构化场景下未标定摄像头的行人-车辆交互数据集,Pedestrian-vehicle Interaction dataset from uNcalibrated cameras in uNstructured Scenes)。该数据集涵盖多个国家和地区,包含多样化的典型交通场景,并考虑了季节、光照条件及天气的变化。其聚焦于密集行人-车辆交互的复杂场景,且设计易于扩展。数据集依据中国自动化学会发布的标准进行构建与标注,同时提供轨迹数据及对应的场景级信息。此外,本文分析了异质性智能体轨迹预测的当前挑战与研究方向,展示了所提出数据集的必要性与实用性。我们希望该框架与数据集能够促进复杂混合交通场景中轨迹预测与自动驾驶的研究。PINNS公开获取地址为:https://github.com/Songan-Lab。
Abstract
Predicting the interaction between pedestrian and vehicle is essential for autonomous driving safety in unstructured and semi-structured scenarios; however, this task is severely hindered by the scarcity of public datasets that feature dense pedestrian-vehicle interactions. Most current studies rely on structured road data, leaving the complex, heterogeneous interactions found in unstructured environments insufficiently represented and researched. In this paper, we propose a dataset annotation framework based on video data from uncalibrated surveillance cameras and present PINNS (Pedestrian-vehicle Interaction dataset from uNcalibrated cameras in uNstructured Scenes). The dataset covers multiple countries and regions, includes diverse typical traffic scenarios, and considers variations in seasons, lighting conditions, and weather. It focuses on complex scenes with dense pedestrian-vehicle interactions and is designed to be easily extensible. The dataset is constructed and annotated according to the standard issued by the Chinese Association of Automation, providing both trajectory data and corresponding scene-level information. Furthermore, this paper analyzes current challenges and research directions in heterogeneous agent trajectory prediction, shows the necessity and usefulness of the proposed dataset. We hope our framework and dataset will facilitate research on trajectory prediction and autonomous driving in complex mixed traffic scenarios. PINNS is publicly available at https://github.com/Songan-Lab.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 8.0/10 | 8.0 |
| MultiModal | 1.0 | 7.0/10 | 7.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究的是行人-车辆交互数据集和标注框架,用于非结构化场景下的轨迹预测和自动驾驶安全。核心是计算机视觉(CV)和多模态(MultiModal,涉及行人、车辆、场景信息),但与统一模型(Unify Models)、世界模型(World Models)、多模态大语言模型(MLLM)、基于模型的强化学习(model-based RL)、最优策略分解(OPD)、强化学习(RL)以及GRPO(一种强化学习算法)完全无关。因此,仅CV和MultiModal获得较高评分,其余均为0分。
关键词
Pedestrian-Vehicle Interaction, Unstructured Scenes, Uncalibrated Cameras, Trajectory Prediction, Autonomous Driving, Dataset Annotation, PINNS
摘要翻译
当前视频到4D的方法在处理复杂拓扑变化、透明材质、薄层结构及内表面时存在困难。我们提出Helix4D,一种动态网格生成框架,通过继承Trellis2的表达性表示,将其从图像到3D的生成扩展至视频条件化的4D生成。我们的设计源于两个关键问题:(a) 如何使Trellis2的帧内局部注意力能够在跨帧间共享信息,同时保留其在透明物体和内表面等罕见案例上的预训练质量;(b) 如何在不破坏预训练能力的前提下,将时间信息注入纯3D位置编码。针对问题(a),我们采用滑动窗口跨帧注意力机制,并以第一帧为锚点。第一帧由基础Trellis2模型生成并注入我们的模型,使其通过跨帧注意力继承Trellis2在罕见案例上的质量。针对问题(b),我们提出一种4D时间编码,将冗余的低频空间旋转位置编码(RoPE)频带重新用于时间维度,从而在无需额外参数的情况下将编码从3D扩展至4D。大量实验表明,Helix4D在ActionBench及我们自建的复杂动态数据集上能够高效生成高质量动态网格。
Abstract
Current video-to-4D methods struggle with complex topology changes, transparent materials, thin structures, and inner surfaces. We present Helix4D, a dynamic mesh generation framework by inheriting the expressive representation of Trellis2, adapting it from image-to-3D to video-conditioned 4D generation. Our design arises from two key questions: (a) how to enable Trellis2's frame-local attention to share information across frames while preserving its pretrained quality on rare cases such as transparent objects and inner surfaces, and (b) how to inject temporal information into a purely 3D positional encoding without breaking pretrained capabilities. We address (a) with a sliding-window cross-frame attention and anchor on the first frame. The first frame is generated by the base Trellis2 model and injected into our model, letting it inherit Trellis2's quality in rare cases through cross-frame attention. We address (b) with a 4D temporal encoding that repurposes redundant low-frequency spatial RoPE bands for time, extending the encoding from 3D with no additional parameters. Extensive experiments show the effectiveness of Helix4D for high-quality dynamic mesh generation on ActionBench and our own challenging complex dynamics set.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 9.0/10 | 9.0 |
| MultiModal | 1.0 | 5.0/10 | 5.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文主要研究4D动态网格生成,属于计算机视觉(CV)领域,与视频输入和3D/4D几何输出相关,因此CV评分为9分。输入为视频(视觉模态),输出为4D网格(几何模态),可视为多模态生成任务,但未涉及文本、音频等典型多模态,故MultiModal评分为5分。其余关键词(Unify Models、World Models、MLLM、model-based RL、OPD、RL、GRPO)与论文内容完全无关,评分均为0分。论文未提及任何强化学习、世界模型或统一模型的概念。
关键词
Helix4D, 4D mesh generation, dynamic mesh, video-to-4D, Trellis2, cross-frame attention, temporal encoding, sliding-window attention
摘要翻译
我们首次提出了一种数据驱动的方法,用于从大规模野外面部视频中建模时间维度的注视-头部协调(temporal gaze-head coordination)。为了获取可泛化学习的训练数据,我们设计了一条自动化流程,利用现成的基于外观的注视估计器(appearance-based gaze estimators)提取自然且多样化的注视与头部运动。为了捕捉注视-头部协调的概率相关性及时序动态,我们基于生成式条件变分自编码器(conditional Variational Autoencoder)构建模型,以生成合理且多样化的、以注视为条件的头部运动。我们进一步将该框架应用于注视控制的面部视频生成,实现了与输入注视相关的自然且逼真的头部运动视频生成——这一方面此前未被强调。人工评估与定量比较验证了我们方法的有效性及设计选择的合理性,评估者对我们的方法表现出相较于基线方法的统计显著偏好。
Abstract
We present the first data-driven approach to model temporal gaze-head coordination from large-scale in-the-wild facial videos. To obtain training data for generalizable learning, we propose an automatic pipeline that extracts natural yet diverse gaze and head motions with off-the-shelf appearance-based gaze estimators. To capture the probabilistic correlation and temporal dynamics of gaze-head coordination, we build our model on a generative conditional Variational Autoencoder for plausible yet diverse gaze-conditioned head motion generations. We further apply our framework to gaze-controlled facial video generation, where we enable video generation with natural and realistic head motion correlated to the input gaze - an aspect that has not been emphasized before. Human evaluation and quantitative comparisons demonstrate our method's effectiveness and validate our design choices, with evaluators showing statistically significant preference for our approach over baseline methods.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 8.0/10 | 8.0 |
| MultiModal | 1.0 | 6.0/10 | 6.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 该论文研究的是基于数据驱动的头部运动生成,通过自然注视-头部协调模型,属于计算机视觉(CV)领域,涉及多模态(视觉+运动)生成。但论文内容与统一模型(Unify Models)、世界模型(World Models)、多模态大语言模型(MLLM)、基于模型的强化学习(model-based RL)、OPD、强化学习(RL)和GRPO等关键词完全无关,因为这些关键词均指向大模型、强化学习或世界模型等方向,而本文是纯粹的生成式视觉任务。
关键词
head motion generation, gaze-head coordination, conditional VAE, facial video generation, data-driven, in-the-wild
摘要翻译
高分辨率视频生成面临优化不稳定与计算成本高昂的双重瓶颈。令牌序列的急剧膨胀不仅会使优化偏向局部纹理而牺牲全局连贯性,导致结构崩塌,还会带来高昂的训练成本与严重的推理延迟。为解决这一问题,我们提出PixelWizard框架,该框架将全局结构建模与细粒度细节合成进行层次化解耦。PixelWizard首先建立紧凑的时空锚点(spatiotemporal anchor)以凝聚密集的结构先验,进而引导高分辨率下的细粒度生成。这缓解了局部优化偏差,在保证结构稳定性的同时不损失高频细节。借助这种结构稳定性,我们引入噪声跨度对齐捷径训练(Noise-Span Aligned Shortcut Training)以突破推理瓶颈。该机制通过显式建模步长,使模型能够以大步长遍历生成轨迹。关键的是,我们结合指数偏置采样(Exponential Index-Biased Sampling)与自适应噪声跨度校准(Adaptive Noise-Span Calibration),使优化与高分辨率网格的偏移噪声调度对齐,从而在不引入蒸馏带来的沉重开销的前提下,实现稳健的少步推理。大量实验表明,PixelWizard在实现卓越视觉质量的同时,将原生2K/4K视频的生成采样速度提升超过10倍。
Abstract
High-resolution video generation faces a coupled bottleneck of optimization instability and prohibitive computational costs. The massive expansion of the token sequence not only biases optimization toward local textures at the expense of global coherence, leading to structural collapse, but also imposes prohibitive training costs and severe inference latency. To address this, we propose PixelWizard, a framework that hierarchically decouples global structure modeling from fine-grained detail synthesis. PixelWizard first establishes a compact spatiotemporal anchor to concentrate dense structural priors, which then guides fine-grained generation at high resolution. This mitigates the local optimization bias to ensure structural stability without compromising high-frequency details. Leveraging this structural stability, we introduce Noise-Span Aligned Shortcut Training to break the inference bottleneck. By explicitly modeling the step size, this mechanism allows the model to traverse the generation trajectory with large steps. Crucially, we incorporate Exponential Index-Biased Sampling and Adaptive Noise-Span Calibration to align optimization with the shifted noise schedules of high-resolution grids, ensuring robust few-step inference without incurring the heavy overhead of distillation. Extensive experiments demonstrate that PixelWizard achieves superior visual quality while accelerating the generative sampling of native 2K/4K videos by over 10x.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 8.0/10 | 8.0 |
| MultiModal | 1.0 | 6.0/10 | 6.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文《PixelWizard》专注于超高分辨率视频生成,属于计算机视觉(CV)领域,与多模态(MultiModal)有一定关联(视频生成涉及视觉模态,但未涉及文本、音频等其他模态的联合建模)。论文核心是解决高分辨率视频生成中的优化不稳定和计算成本问题,采用层次化解耦全局结构与细节合成的方法,完全不涉及统一模型(Unify Models)、世界模型(World Models)、多模态大语言模型(MLLM)、基于模型的强化学习(model-based RL)、OPD、RL或GRPO等关键词。因此,除CV和MultiModal外,其余关键词评分为0。
关键词
High-resolution video generation, Hierarchical decoupling, Global structure modeling, Fine-grained detail synthesis, Noise-Span Aligned Shortcut Training, Exponential Index-Biased Sampling, Adaptive Noise-Span Calibration, Few-step inference
摘要翻译
我们提出CrossLift,一种在网格上计算由图像视觉特征引导的交叉场(cross fields)的技术。我们利用强大的文本到图像先验(text-to-image priors),这些先验能够在二维空间中合成特征对齐的四边形网格(quad meshes)图像。我们将此信号提取为二维图像中明确的每像素方向(per-pixel directions),然后将其反投影(back-project)到网格表面。我们通过在网格表面执行两次平滑插值(smooth interpolations)(首先在每个视图内,然后在多个视图之间)来聚合这些候选表面方向。我们为每次插值中的候选方向提出了自定义的基于置信度的权重(confidence-based weights),这使我们能够解决同一面上候选方向之间的冲突,并将我们的场平滑插值到被遮挡的面(occluded faces)。我们的方法是模块化的(modular),可以与许多不同的二维视觉先验一起使用。我们展示了额外的应用,包括纹理对齐的四边形网格生成(texture-aligned quad meshing)以及使用粗略的用户绘制线条作为信号的交互式交叉场设计(interactive cross-field design)。我们在各种有机和机械形状上展示了CrossLift的有效性,并生成了与现有方法相比具有更优语义对齐(semantic alignment)的四边形网格。项目页面:https://crosslift.github.io/
Abstract
We present CrossLift, a technique for computing cross fields on meshes guided by visual features in images. We leverage powerful text-to-image priors that are capable of synthesizing images of feature-aligned quad meshes in 2D. We extract this signal as explicit per-pixel directions in the 2D images, which we then back-project to the mesh surface. We aggregate these candidate surface directions by performing two smooth interpolations on the mesh surface (first within each view and second across multiple views). We propose custom confidence-based weights for the candidate directions in each interpolation that allow us to resolve conflicts between candidates on the same face and smoothly interpolate our field to occluded faces. Our method is modular and can be used with many different 2D visual priors. We show additional applications to texture-aligned quad meshing as well as interactive cross-field design using coarse, user-drawn lines as signal. We demonstrate the effectiveness of CrossLift on a diverse set of both organic and mechanical shapes and produce quad meshes that exhibit superior semantic alignment as compared to existing methods. Project page at: https://crosslift.github.io/
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 8.0/10 | 8.0 |
| MultiModal | 1.0 | 5.0/10 | 5.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 该论文研究的是基于2D视觉先验(如文本到图像生成模型)计算网格上的交叉场(cross fields),并应用于四边形网格生成。核心领域是计算机图形学(CV相关),涉及多模态(文本到图像先验)但并非多模态大模型(MLLM)或统一模型(Unify Models)。论文不涉及世界模型、强化学习(RL)、model-based RL、OPD或GRPO。因此,仅CV关键词有较高相关度(8分,因为核心是图形学中的几何处理,但非传统CV任务),MultiModal有中等相关度(5分,因为利用了文本到图像先验),其余关键词均为0分。
关键词
cross fields, quad meshing, text-to-image priors, 2D visual priors, surface direction interpolation, semantic alignment, geometry processing
摘要翻译
面向角色动画的实时流式音视频联合生成,要求生成器能够朗读指定文本、在分块间保持视觉一致性,并在严格的播放预算内运行。这些需求难以同时满足:逐块自回归生成会累积文本-音频错位与视觉漂移,而低延迟所需的少步蒸馏则常降低空间多样性与时间质量。我们提出StreamChar,一种将长程编排与短窗口音视频去噪相分离的流式框架。基于LLM的编排器利用文本和历史上下文生成帧对齐的音频条件,而联合音视频DiT则通过参考帧与运动帧条件执行局部双向去噪。为高效部署,我们采用两阶段蒸馏流程:先压缩采样器,再在在线分块推演中微调学生模型。推演训练期间,进度感知指针将部分文本与生成音频对齐,而汇合块记忆提供持久视觉锚点以减少长程漂移。在短片段与长程协议上的实验表明,StreamChar可在单块H100 GPU上实时运行,相较于近期联合生成与音频驱动基线方法,在文本保真度、音视频同步、视觉质量与流式稳定性方面实现了有利的系统级权衡。
Abstract
Real-time streaming joint audio-video generation for character animation requires a generator to speak the requested transcript, maintain visual identity across chunks, and run within a strict playback budget. These requirements are difficult to satisfy simultaneously: chunk-wise autoregressive generation can accumulate transcript-audio misalignment and visual drift, while the few-step distillation needed for low latency often degrades spatial diversity and temporal quality. We present StreamChar, a streaming framework that separates long-horizon orchestration from short-window audio-video denoising. An LLM-based orchestrator uses the transcript and historical context to produce frame-aligned audio conditions, and a joint audio-video DiT performs local bidirectional denoising with reference and motion-frame conditioning. For efficient deployment, we use a two-stage distillation pipeline that first compresses the sampler and then fine-tunes the student under online chunk rollouts. A progress-aware pointer aligns partial transcripts with generated audio during rollout training, and a sink-chunk memory provides a persistent visual anchor for reducing long-horizon drift. Experiments on short-clip and long-horizon protocols show that StreamChar runs in real time on a single H100 GPU and provides a favorable system-level trade-off among transcript fidelity, audio-visual synchronization, visual quality, and streaming stability compared with recent joint and audio-driven baselines.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 6.0/10 | 6.0 |
| MultiModal | 1.0 | 7.0/10 | 7.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究的是流式角色音频-视频联合生成,核心是长时程流式生成、音频-视频对齐和蒸馏部署,属于多模态生成和计算机视觉(CV)领域。与'MultiModal'(多模态)和'CV'(计算机视觉)有一定关联,但完全不涉及统一模型(Unify Models)、世界模型(World Models)、多模态大语言模型(MLLM)、基于模型的强化学习(model-based RL)、OPD、RL或GRPO等概念。因此,除MultiModal和CV外,其余关键词均为0分。
关键词
streaming generation, audio-video generation, character animation, diffusion transformer, distillation, long-horizon generation, real-time
摘要翻译
Wasserstein策略梯度(WPG)是一种用于强化学习(RL)的策略优化方法,它利用了动作分布的最优传输几何结构。对于熵正则化的RL目标,WPG通过将每个状态条件策略沿软Q函数的动作梯度以及朗之万型扩散进行传输来演化该策略。尽管该方法在连续控制问题中具有吸引力,但其全局收敛性质仍鲜为人知。标准的朗之万分析无法直接适用,因为RL目标通过贝尔曼递归而非静态凸泛函依赖于策略,且朗之万漂移由软Q函数决定,而该函数的正则性必须在策略迭代过程中加以控制。本文通过利用熵正则化RL的贝尔曼结构,建立了WPG的全局收敛理论。我们证明,通常由凸性扮演的角色可由基于贝尔曼的论证替代:软贝尔曼残差相对于吉布斯策略具有逐状态的KL表示;贝尔曼收缩将该残差与全局最优性差距联系起来;而贝尔曼预解恒等式将价值改进与相对费希尔信息相关联。结合演化吉布斯族的均匀对数索博列夫不等式(LSI),这些要素导出了分布式的Polyak–Łojasiewicz条件。我们进一步建立了控制离散化误差所需的正则性与一致界,从而获得了直至离散化偏差的几何收缩。从概念上讲,我们的分析表明,尽管熵正则化RL在通常的平坦意义上并非凸的,但贝尔曼递归诱导了一种有利的Polyak–Łojasiewicz型(PL)几何结构,该结构支持WPG的全局收敛。
Abstract
Wasserstein policy gradient (WPG) is a policy optimization method for reinforcement learning (RL) that exploits the optimal-transport geometry of action distributions. For the entropy-regularized RL objective, WPG evolves each state-conditional policy by transporting it along the action gradient of the soft Q-function together with a Langevin-type diffusion. Despite its appeal for continuous-control problems, its global convergence properties remain poorly understood. Standard Langevin analyses do not directly apply, because the RL objective depends on the policy through the Bellman recursion rather than through a static convex functional, and the Langevin drift is determined by the soft Q-function, whose regularity must be controlled along the policy iterates. In this paper, we develop a global convergence theory for WPG by exploiting the Bellman structure of entropy-regularized RL. We show that the role usually played by convexity can be replaced by a Bellman-based argument: the soft Bellman residual admits a statewise KL representation with respect to a Gibbs policy; Bellman contraction relates this residual to the global optimality gap; and a Bellman resolvent identity connects value improvement to relative Fisher information. Combined with a uniform log-Sobolev inequality (LSI) for the evolving Gibbs family, these ingredients yield a distributional Polyak--Łojasiewicz condition. We further establish the regularity and uniform bounds needed to control the discretization error, thereby obtaining geometric contraction up to a discretization bias. Conceptually, our analysis shows that although entropy-regularized RL is not convex in the usual flat sense, the Bellman recursion induces a favorable Polyak--Lojasiewicz-type (PL) geometry that supports global convergence of WPG.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 2.0/10 | 2.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 10.0/10 | 10.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文专注于熵正则化强化学习中Wasserstein策略梯度方法的全局收敛性理论分析,核心是强化学习(RL)的理论基础,与多模态、统一模型、世界模型、计算机视觉、多模态大语言模型等主题完全无关。'model-based RL'仅因涉及Bellman递归和值函数而略有间接关联,但方法本身是model-free的,故评2分。其余关键词均不相关,评0分。
关键词
Wasserstein policy gradient, entropy-regularized reinforcement learning, global convergence, Bellman recursion, Polyak-Łojasiewicz condition, log-Sobolev inequality, optimal transport, Langevin diffusion
摘要翻译
近期时间序列预测研究常探讨将文本和视觉模态与数值模型相结合,以更好地应对非平稳环境。尽管现有多模态方法在数值结果上表现良好,但通常面临一个困境:优先最小化平均误差会导致预测过于平滑,从而忽略关键波动。为解决这一局限,我们提出STaT(Symbolic-Temporal-Textual Alignment,符号-时间-文本对齐),一种创新的多模态架构,无缝融合三种协同模态。具体而言,符号模态将连续时间序列转换为离散标记,便于准确识别结构模式和转折点;时间模态提取固有的序列依赖关系;文本模态利用领域语义指导宏观预测趋势。在八个真实世界基准上的综合评估表明,STaT表现出卓越性能,将传统幅度指标提升高达8.9%,同时将形状失真降低高达8.5%。
Abstract
Recent research in time series forecasting frequently investigates the integration of textual and visual modalities with numerical models to better navigate non-stationary environments. Despite delivering solid numerical results, existing multi-modal approaches usually encounter a dilemma: prioritizing the minimization of average errors can result in excessively smooth forecasts that overlook essential fluctuations. To resolve this limitation, we introduce STaT, an innovative multimodal architecture for Symbolic-Temporal-Textual Alignment, which seamlessly unites three synergistic modalities. Specifically, the symbolic modality converts continuous time series into discrete tokens, facilitating the accurate identification of structural patterns and turning points; the temporal modality extracts inherent sequential dependencies; and the textual modality leverages domain semantics to steer the macroscopic forecasting trends. Comprehensive evaluations on eight real-world benchmarks indicate that STaT delivers exceptional performance, enhancing conventional magnitude indicators by up to 8.9% while simultaneously decreasing shape distortion by up to 8.5%.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 2.0/10 | 2.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 1.0/10 | 1.0 |
| CV | 1.0 | 1.0/10 | 1.0 |
| MultiModal | 1.0 | 8.0/10 | 8.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究非平稳时间序列预测中的形状失真问题,提出三模态(符号、时间、文本)对齐架构STaT。核心是多模态融合,但并非统一多模态大模型(Unify Models)、世界模型(World Models)、多模态大语言模型(MLLM)或计算机视觉(CV)。与强化学习(RL、model-based RL、GRPO)及OPD完全无关。因此,仅MultiModal评分较高(8分),其他关键词评分很低或为零。
关键词
time series forecasting, shape distortion, non-stationary time series, tri-modal synergy, symbolic-temporal-textual alignment, multimodal architecture, structural patterns, domain semantics
摘要翻译
运动恢复结构(Structure-from-Motion)——即从一组图像中同时估计相机位姿和三维场景结构的过程——仍然是计算机视觉领域的核心挑战,存在许多尚未解决的开放性问题。前馈三维重建(feedforward 3D reconstruction)的最新进展在克服经典SfM方法持续存在的失败案例方面取得了显著突破,尤其是在低纹理、有限重叠和对称性等场景中。然而,尽管前馈方法在这些具有挑战性的条件下表现出色,但它们在可扩展性、准确性或鲁棒性方面常常面临限制,并且在标准重建设置中通常不如经典方法。在这项工作中,我们系统地分析了这些局限性,并通过结合经典方法与前馈方法各自的优势,提出了一种新的运动恢复结构(Structure-from-Motion)流程。在多个数据集上的大量实验表明了我们方法的优势,在广泛场景中取得了最先进的结果。我们将该系统作为开源实现分享于 https://github.com/colmap/gluemap。
Abstract
Structure-from-Motion -- the process of simultaneously estimating camera poses and 3D scene structure from a collection of images -- remains a central challenge in computer vision, with many open problems yet to be solved. Recent advances in feedforward 3D reconstruction have made significant strides in overcoming persistent failure cases of classical SfM methods, particularly in scenarios characterized by low texture, limited overlap, and symmetries. However, while feedforward approaches excel in these challenging conditions, they often face limitations regarding scalability, accuracy, or robustness, and typically fall short of classical methods in standard reconstruction settings. In this work, we systematically analyze these limitations and propose a new Structure-from-Motion pipeline by combining the respective strengths of classical and feedforward methods. Extensive experiments across multiple datasets show the benefits of our approach, achieving state-of-the-art results across a wide range of scenarios. We share our system as an open-source implementation at https://github.com/colmap/gluemap.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 2.0/10 | 2.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 10.0/10 | 10.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文核心是计算机视觉中的Structure-from-Motion(SfM)和前馈3D重建,属于CV领域,因此CV得10分。'Unify Models'指结合经典与前馈方法,但并非多模态统一模型,仅得2分。其他关键词(World Models, MLLM, MultiModal, model-based RL, OPD, RL, GRPO)与论文内容完全无关,均得0分。作者列表中未包含指定专家。
关键词
Structure-from-Motion, Feedforward Reconstruction, 3D Reconstruction, Camera Pose Estimation, Classical Methods, Deep Learning, Open-Source
摘要翻译
像素数量和地理覆盖范围是遥感图像的两个关键特征。现有遥感图像分割方法通常聚焦于像素数量较小或像素数量大但地理覆盖范围有限的图像。本文提出了一种面向超广域(UWA)遥感图像的新型分割任务,其特点是像素数量大且地理覆盖范围极广。UWA分割的核心挑战在于同时处理尺度差异显著的地物目标并维持长程上下文语义连续性。为解决这些挑战,我们提出了尺度截锥体表示网络(SFR-Net)。受不同高度拍摄的遥感图像视锥体启发,我们构建了尺度截锥体表示,实现了对不同尺度地物目标与上下文特征的统一建模。此外,我们设计了一种级联跨尺度融合机制来有效整合这些表示,在增强局部语义理解的同时确保长程上下文连续性。在GID和FBPS数据集上的实验结果表明,SFR-Net达到了最优性能,相较于最强竞争方法,mIoU分别提升了1.72%和4.29%。同时,所提出的尺度截锥体表示可集成到通用分割网络中,提升分割精度与收敛速度。实现代码将在https://github.com/ChuyuZhong/SFR-Net公开。
Abstract
Pixel count and geographical coverage are two key characteristics of remote sensing images. Existing remote sensing image segmentation methods typically focus on images with either a small pixel count or a large pixel count but limited geographical coverage. In this paper, we introduce a novel segmentation task targeting ultra-wide area (UWA) remote sensing images, characterized by both a large pixel count and extremely wide geographical coverage. The core challenges of UWA segmentation lie in simultaneously handling ground objects with significantly varying scales and maintaining long-range contextual semantic continuity. To address these challenges, we propose the Scale-Frustum Representation Network (SFR-Net). Inspired by the viewing frustums of remote sensing images captured from different altitudes, we construct scale-frustum representations, enabling unified modeling of ground objects and contextual features at different scales. Furthermore, we design a cascaded cross-scale fusion mechanism to effectively integrate these representations, enhancing local semantic understanding while ensuring long-range contextual continuity. Experimental results on GID and FBPS demonstrate that SFR-Net achieves state-of-the-art performance, improving mIoU by 1.72% and 4.29%, respectively, over the strongest competing methods. In addition, the proposed scale-frustum representations can be integrated into generic segmentation networks to improve both segmentation accuracy and convergence speed. The implementation code will be publicly available at https://github.com/ChuyuZhong/SFR-Net.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 2.0/10 | 2.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 10.0/10 | 10.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究超宽区域遥感图像分割,属于计算机视觉(CV)领域,与CV高度相关(10分)。论文提到“unified modeling”,但并非指多模态统一模型,而是统一不同尺度的特征表示,因此与Unify Models仅有微弱关联(2分)。其他关键词(World Models, MLLM, MultiModal, model-based RL, OPD, RL, GRPO)与论文内容完全无关,均评0分。
关键词
Ultra-Wide Area, Remote Sensing Image Segmentation, Scale-Frustum Representation, SFR-Net, Cross-Scale Fusion, Multi-Scale Representation, Semantic Segmentation
摘要翻译
离线目标条件强化学习(Offline Goal-Conditioned Reinforcement Learning, GCRL)为从固定数据集中获取目标达成策略提供了一种实用框架。然而,在长视界任务中学习可靠的目标条件价值函数仍具挑战性。本文指出,目标条件价值函数中的错误泛化是根本瓶颈,并证明在价值函数中引入适当的归纳偏置对于解决该瓶颈至关重要。基于这些发现,我们提出潜在对齐价值学习(Latent-Aligned Value Learning, LAVL),这是一种离线GCRL算法,将基于潜在表示的价值泛化与分层规划统一于一个框架中。在OGBench上的大量实验表明,LAVL持续优于现有离线GCRL方法,在22个数据集中有20个取得了最高性能。值得注意的是,LAVL在长视界任务和轨迹拼接数据集上表现出强劲性能,而先前方法在这些场景中性能显著下降。我们的代码可在 https://github.com/oh-lab/LAVL.git 获取。
Abstract
Offline goal-conditioned reinforcement learning (GCRL) provides a practical framework for obtaining goal-reaching policies from fixed datasets. However, learning a reliable goal-conditioned value function in long-horizon tasks remains challenging. In this paper, we identify erroneous generalization in goal-conditioned value functions as a fundamental bottleneck, and demonstrate that appropriate inductive bias in the value function is crucial for addressing the bottleneck. Building on these findings, we propose Latent-Aligned Value Learning (LAVL), an offline GCRL algorithm that integrates latent-representation-based value generalization with hierarchical planning in a unified framework. Extensive experiments on OGBench demonstrate that LAVL consistently outperforms existing offline GCRL methods, achieving the highest performance on 20 out of 22 datasets. Notably, LAVL exhibits strong performance in long-horizon tasks and trajectory stitching datasets, where prior methods suffer significant performance degradation. Our code is available at https://github.com/oh-lab/LAVL.git.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 1.0/10 | 1.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 10.0/10 | 10.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文主题为离线目标条件强化学习(Offline GCRL),核心是价值函数泛化与潜在表征对齐,未涉及统一模型、世界模型、多模态大模型、计算机视觉、多模态、基于模型的强化学习(仅微弱关联,因层次规划可能隐含模型但未明确)、OPD或GRPO。RL是核心主题,评分10分;model-based RL因摘要提到层次规划但未明确基于模型,给予1分;其余关键词完全无关。
关键词
Offline Goal-Conditioned Reinforcement Learning, Latent Representation Alignment, Value Function Generalization, Hierarchical Planning, Long-Horizon Tasks, OGBench
摘要翻译
高光谱三维成像能够捕获密集的光谱信息与场景几何结构,但传统上仅限于窄光谱窗口(通常为可见光范围)。本研究提出一种宽带高光谱三维成像(BH3D)方法,将其能力扩展至整个可见-近红外及短波红外(SWIR)光谱(450-1500 nm)。这一宽覆盖范围至关重要,因为它能捕获互补的物理线索:可见光波长揭示表面外观,而SWIR波段则提供对亚表面特性与材料组成的洞察。然而,由于可见光谱硅基传感器与SWIR光谱InGaAs传感器之间存在根本性的传感器限制,需要复杂的多光谱仪设计,实现BH3D颇具挑战。本文提出一种采用可见光与SWIR相机组成的立体装置的单光谱仪BH3D系统,可同时重建密集的宽带高光谱反射率与精确的三维几何结构。我们的核心思想是利用单台光谱仪将色散结构光扩展至宽带领域。我们建立了宽带色散结构光的成像模型,并估计高光谱反射率与深度。我们在多种真实场景中验证了该方法,实现了平均光谱角映射器为0.13弧度、均方根误差为0.03、平均深度误差为4.5毫米的精确重建。我们进一步展示了同色异谱材料识别、不透明层穿透成像、钞票隐藏特征揭示以及血管显影等应用。
Abstract
Hyperspectral 3D imaging enables the capture of dense spectral information and scene geometry but has traditionally been confined to narrow spectral windows, typically the visible range. In this work, we introduce a broadband hyperspectral 3D imaging (BH3D) method to extend this capability across the full visible-near-infrared and short-wavelength infrared (SWIR) spectrum (450-1500 nm). This broad coverage is critical as it captures complementary physical cues: visible wavelengths reveal surface appearance, while SWIR bands provide insight into subsurface properties and material composition. However, realizing BH3D is challenging due to fundamental sensor constraints between visible-spectrum silicon and SWIR-spectrum InGaAs sensors, which necessitate complex multi-spectrograph designs. Here we propose a single-spectrograph BH3D system, using a stereo setup comprising visible and SWIR cameras, that reconstructs dense broadband hyperspectral reflectance together with accurate 3D geometry. Our key idea is to extend dispersed structured light to the broadband regime using a single spectrograph. We model the image formation of broadband dispersed structured light, and estimate hyperspectral reflectance and depth. We validate our approach on diverse real-world scenes, demonstrating accurate reconstruction with a mean spectral angle mapper of 0.13 rad, root mean square error of 0.03, and mean depth error of 4.5 mm. We further demonstrate identifying metameric materials, performing imaging through opaque layers, uncovering hidden features on banknotes, and revealing blood vessels.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 6.0/10 | 6.0 |
| MultiModal | 1.0 | 5.0/10 | 5.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究的是宽带高光谱3D成像技术,使用分散结构光,属于计算机视觉和成像领域,但与统一模型、世界模型、多模态大语言模型、强化学习等关键词完全无关。CV相关度中等(6分),因为涉及3D重建和图像处理,但核心是光学系统而非通用视觉方法;MultiModal相关度较低(5分),因为融合了可见光、SWIR和深度信息,但并非典型的多模态学习。其余关键词均为0分。
关键词
hyperspectral 3D imaging, broadband, dispersed structured light, visible-near-infrared, short-wavelength infrared, depth reconstruction, spectral reflectance
摘要翻译
联邦边缘学习(Federated Edge Learning, FEEL)近期作为一种实现边缘智能(Edge Intelligence, EI)的有前景范式出现,它通过使边缘设备间能够进行协作模型训练,同时保护数据隐私。本文提出了一种在线优化框架,用于在资源受限的边缘设备上联合管理联邦训练与推理。我们引入了一种基于串联队列(tandem-queue)的转换机制,将推理请求与训练数据相连接,并进一步将数据新鲜度与模型新鲜度纳入准确率公式中,以捕捉真实环境中的时间动态。为了在最小化延迟和能耗的同时最大化推理准确率,我们联合优化了边缘设备的模式选择、通信与计算资源分配。我们将该优化问题建模为一个多目标优化问题,该问题属于NP难问题,并且由于在线设置而进一步复杂化。为应对这些挑战,我们将问题转化为多目标马尔可夫决策过程(Multi-Objective Markov Decision Process, MOMDP),并开发了一种约束多目标近端策略优化(Constrained Multi-Objective Proximal Policy Optimization, C-MOPPO)算法。具体而言,C-MOPPO首先学习一组具有不同目标偏好策略,然后利用约束策略优化来丰富帕累托前沿(Pareto front),从而获得高质量且密集的解。大量实验表明,C-MOPPO在多个目标之间实现了良好的平衡权衡,并在各种系统配置下显著优于基线方法。
Abstract
Federated edge learning (FEEL) has recently emerged as a promising paradigm for achieving edge intelligence (EI) via enabling collaborative model training across edge devices while protecting data privacy. In this paper, we put forth an online optimization framework that jointly manages federated training and inference on resource-constrained edge devices. We introduce a tandem-queue-inspired conversion mechanism that bridges inference requests and training data, and further incorporate both data and model freshness into the accuracy formulation to capture temporal dynamics in real-world environments. To maximize inference accuracy while minimizing latency and energy consumption, the mode selections, communication, and computation resource allocations of edge devices are jointly optimized. We formulate this optimization as a multi-objective optimization problem, which is NP-hard and further complicated by the online setting. To address these challenges, we transform the problem into a multi-objective Markov decision process (MOMDP) and develop a \underline{c}onstrained \underline{m}ulti-\underline{o}bjective \underline{p}roximal \underline{p}olicy \underline{o}ptimization (C-MOPPO) algorithm. Specifically, C-MOPPO first learns a set of policies with different preferences across three objectives, then leverages constrained policy optimization to enrich the Pareto front and obtain high-quality, dense solutions. Extensive experiments demonstrate that C-MOPPO achieves well-balanced trade-offs among objectives and significantly outperforms baselines under various system configurations.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 10.0/10 | 10.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文主题为联邦边缘学习中的训练与推理联合优化,采用约束多目标深度强化学习(C-MOPPO)算法。关键词中仅'RL'(强化学习)与论文核心方法高度相关(10分),其余关键词如Unify Models、World Models、MLLM、CV、MultiModal、model-based RL、OPD、GRPO均与论文内容无关(0分)。论文未涉及多模态大模型、世界模型、统一模型或基于模型的强化学习等概念。
关键词
Federated Edge Learning, Multi-Objective Optimization, Deep Reinforcement Learning, Constrained Policy Optimization, Pareto Front, Inference Accuracy, Latency, Energy Consumption
摘要翻译
无人驾驶飞行器(UAV)已迅速成为各类空域中的常见设备,其应用范围涵盖从娱乐飞行到商业摄影及包裹递送等多个领域。随着无人机日益普及,确保有人驾驶与无人驾驶飞行器均能远距离探测到无人机及其他飞行物体,以有效追踪其运动并保障共享空域中的安全运行,变得至关重要。尽管已有多个数据集被引入用于无人机检测,但对高质量扩展数据的需求依然存在,尤其是在高分辨率远距离无人机数据方面。为此,我们引入了一个包含102,532张远距离无人机RGB图像的高分辨率数据集,这些图像以5 FPS的采样率从128个不同视频片段中提取,这些视频片段是在跨越8个月的17个不同数据采集日期间于飞行中途拍摄的,以确保涵盖多样化的光照场景、飞行地点及背景元素。该数据集拥有全面的无人机距离信息,并包含29,630张红外(IR)图像,所有这些图像均与基础数据集中的RGB图像配对。作为首批利用4K图像分辨率及配对的640x512 IR图像的无人机检测数据集之一,我们的工作代表了在实现远距离无人机探测方面的重要进展。如需访问完整数据集,请访问 https://research.coe.drexel.edu/ece/imaple/lrddv3/
Abstract
Unmanned Aerial Vehicles (UAVs) have quickly become common in various airspaces, representing a wide range of applications from recreation flying to commercial photography and package delivery. With the increasing prevalence of UAVs, it becomes critical that both manned and unmanned aircraft can detect UAVs and other flying objects from long range to effectively track movement and ensure safe operation in shared spaces. While several datasets have been introduced for drone detection, the need for expanded high-quality data persists, especially in the area of high-resolution long-range drone data. To address this, we introduce a high-resolution dataset of 102,532 long-range RGB images of drones, sampled at 5 FPS from 128 distinct video clips taken mid flight during 17 different data collection days spread over 8 months to ensure a wide variety of lighting scenarios, flight locations, and background elements. The dataset boasts comprehensive drone range information across the dataset, as well as 29,630 IR images, all paired with RGB counterparts from the base dataset. As one of the first drone detection datasets to leverage 4K image resolution and paired 640x512 IR images, our work represents a significant advancement to enable the detection of drones at long range. For access to the complete dataset, please visit https://research.coe.drexel.edu/ece/imaple/lrddv3/
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 5.0/10 | 5.0 |
| MultiModal | 1.0 | 5.0/10 | 5.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文主题为无人机检测数据集构建,属于计算机视觉(CV)领域,且包含RGB与红外多模态数据,因此CV和MultiModal有中等关联(5分)。其余关键词(Unify Models、World Models、MLLM、model-based RL、OPD、RL、GRPO)与论文内容完全无关,均评0分。论文未涉及任何统一模型、世界模型、多模态大语言模型、强化学习或相关方法。
关键词
drone detection, dataset, high-resolution, long-range, RGB, infrared, thermal, UAV
摘要翻译
文本到视频扩散Transformer在模型深度上不均匀地编码语义信息,这限制了有效概念擦除的实现。我们识别出一个表征瓶颈,称为概念-层拓扑对齐(concept-layer topological alignment),在此机制下,目标概念在特定表征深度上表现出更高的可分离性。在这些深度之外,概念与非目标信号仍高度纠缠,从而限制了深度特定擦除的效果。这一观察将概念擦除重新定义为识别概念与非目标自然分离的表征深度的问题。受这一结构约束的启发,我们提出了CLEAR,一种基于可分离性驱动的概念擦除优化框架,该框架显式地强制实现概念-层对齐。CLEAR通过将层选择表述为关于概念-非目标可分离性的优化问题(而非依赖与层无关或启发式的选择)来具体实现这一原则。为此,我们引入了一个可分离性感知的目标函数,该函数倾向于选择表现出更强概念-非目标分离的层。在大规模文本到视频扩散模型上的实验表明,强制实现概念-层对齐能够在保持整体生成质量的同时实现更精确的概念抑制。
Abstract
Text-to-video diffusion transformers encode semantic information unevenly across model depth, which constrains effective concept erasure. We identify a representational bottleneck, termed concept-layer topological alignment, under which target concepts exhibit higher separability at certain representational depths. Outside these depths, concept and non-target signals remain strongly entangled, limiting the effectiveness of depth-specific erasure. This observation reframes concept erasure as the problem of identifying representational depths where concept-non-target separation naturally emerges. Motivated by this structural constraint, we introduce CLEAR, a separability-driven optimization framework for concept erasure that explicitly enforces concept-layer alignment. CLEAR operationalizes this principle by formulating layer selection as an optimization problem over concept-non-target separability, rather than relying on layer-agnostic or heuristic choices. To enable this, we introduce a separability-aware objective that favors layers exhibiting stronger concept-non-target separation. Experiments on large-scale text-to-video diffusion models demonstrate that enforcing concept--layer alignment leads to more precise concept suppression while preserving overall generative quality.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 5.0/10 | 5.0 |
| MultiModal | 1.0 | 5.0/10 | 5.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究文本到视频扩散模型中的概念擦除问题,核心是概念-层对齐和表示瓶颈分析。虽然涉及多模态(文本+视频)和计算机视觉领域,但并未涉及统一模型、世界模型、多模态大语言模型、强化学习或GRPO等关键词。CV和MultiModal有一定关联,但非核心主题,故评分较低。其余关键词完全无关。
关键词
concept erasure, text-to-video diffusion, concept-layer alignment, representational bottleneck, separability optimization, generative quality
摘要翻译
状态空间模型(State Space Models, SSMs)因其线性复杂度与长程建模能力,已成为高效单图像超分辨率(Single-Image Super-Resolution, SR)领域的一种强大范式。然而,现有基于Mamba的方法通常依赖与数据无关的刚性扫描,将二维图像在固定网格上重塑为一维序列,这不可避免地破坏了空间-语义拓扑结构并引入伪影。受**格式塔知觉组织理论**启发,我们提出**SP-MoMamba**——一种面向内容感知超分辨率的超像素驱动混合状态空间专家模型。其核心思想在于将传统刚性扫描转化为**语义级交互**,以超像素为基本单元。具体而言,我们引入**超像素驱动状态空间模型(Superpixel-driven State Space Model, SP-SSM)**,将语义同质区域压缩为高阶令牌以保持全局拓扑一致性。为解决固定扫描尺度与多样语义粒度间的冲突,我们开发了**多尺度超像素混合状态空间专家(Multi-Scale Superpixel Mixture of State Space Experts, MSS-MoE)**。该模块利用动态路由机制自适应分配尺度专属专家,在降低计算冗余的同时有效捕获多尺度纹理。此外,为防止全局抽象过程中高频细节的丢失,我们引入**局部空间调制专家(Local Spatial Modulation Expert, LSME)**以补充全局建模,确保锐利边缘与精细结构的精确重建。在标准基准上的大量实验表明,与当前最先进的高效超分辨率方法相比,SP-MoMamba在重建保真度与效率-性能权衡方面均展现出更优表现。
Abstract
State space models (SSMs) have emerged as a powerful paradigm for efficient single-image super-resolution (SR) due to their linear complexity and long-range modeling capabilities. However, existing Mamba-based methods typically rely on data-agnostic rigid scanning, which reshapes 2D images into 1D sequences over a fixed grid, inevitably disrupting spatial-semantic topology and introducing artifacts. Inspired by the \textbf{Gestalt perceptual grouping theory}, we propose \textbf{SP-MoMamba}, a superpixel-driven mixture of state space experts designed for content-aware SR. Our core idea is to transform the traditional rigid scanning into a \textbf{semantic-level interaction} by treating superpixels as fundamental units. Specifically, we introduce the \textbf{Superpixel-driven State Space Model (SP-SSM)}, which compresses semantically homogeneous regions into high-order tokens to preserve global topological consistency. To address the conflict between fixed scanning scales and diverse semantic granularities, we develop the \textbf{Multi-Scale Superpixel Mixture of State Space Experts (MSS-MoE)}. This module utilizes a dynamic routing mechanism to adaptively assign scale-specific experts, effectively capturing multi-scale textures while reducing computational redundancy. Furthermore, to prevent the loss of high-frequency details during global abstraction, we introduce a \textbf{Local Spatial Modulation Expert (LSME)} to complement the global modeling, ensuring a precise reconstruction of sharp edges and fine structures. Extensive experiments on standard benchmarks demonstrate that SP-MoMamba achieves superior reconstruction fidelity and a more favorable efficiency-performance trade-off compared to state-of-the-art efficient SR methods.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 10.0/10 | 10.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究的是基于状态空间模型(SSM)的图像超分辨率(SR),属于计算机视觉(CV)领域,与CV高度相关(10分)。其他关键词如Unify Models、World Models、MLLM、MultiModal、model-based RL、OPD、RL、GRPO均与论文内容无关,评分均为0。论文未涉及多模态、强化学习或统一模型等概念。
关键词
Superpixel-driven, Mixture of State Space Experts, Image Super-Resolution, State Space Models, Multi-Scale Superpixel, Local Spatial Modulation Expert, Gestalt perceptual grouping
摘要翻译
3D Gaussian Splatting (3DGS,三维高斯泼溅) 提供了一种使用各向异性高斯进行高质量场景重建的高效方法。近年来,基于3DGS的方法在实现实时性能的同时,显著提升了人类虚拟形象的渲染质量。然而,现有方法存在基于图像的方法与基于3DMM(三维形变模型)的方法所生成的高斯数量在量级上不匹配的问题。这种差异导致重建的表情缺乏精细细节。在本文中,我们提出了一种从单张图像重建可动画头部虚拟形象的新方法。我们设计了一个图分裂网络(Graph splitting network),利用自回归架构从粗到细逐步生成高斯。为了解决分裂高斯引起的图不一致性问题,我们采用了一种网格拓扑扩展方法,使图神经网络(GNN)的连接性与增加的高斯数量对齐。此外,我们引入了一种新颖的密度控制方法,其中包括一个门控机制,为高斯生成软掩码,防止分裂操作后的过度密集化。这使得我们能够动态控制不同面部区域的高斯密度。为了实现平滑且快速的训练,我们采用了一种延迟过滤策略,避免在训练过程中重新计算图拓扑。实验结果表明,我们的自回归结构通过逐步分裂高斯,有效提升了表情表达能力。这一过程在GNN引导的分裂下,能够合成更精确的面部细节,并实现更高的重建质量。
Abstract
3D Gaussian Splatting (3DGS) provides an efficient method for high-quality scene reconstruction using anisotropic Gaussians. Recently, 3DGS-based methods have significantly improved the rendering quality of human avatars while enabling real-time performance. However, existing methods suffer from a magnitude mismatch in the number of Gaussians generated by image-based and 3DMM-based approaches. This discrepancy results in reconstructed expressions that lack fine-grained detail. In this paper, we introduce a novel method for reconstructing an animatable head avatar from a single image. We propose a Graph splitting network to progressively generate Gaussians from coarse to fine using an autoregressive architecture. To address the graph inconsistency caused by split Gaussians, we employ a mesh topology extension method to align the GNN's connectivity with the increased Gaussian count. Furthermore, we introduce a novel density control method that includes a gating mechanism that generates soft masks for Gaussians, preventing over-densification after the splitting operation. This allows for dynamic control over Gaussian density across different facial regions. For smooth and rapid training, we employ a delayed filtering strategy to avoid re-computing the graph topology during training. Experimental results demonstrate that our autoregressive structure effectively improves expression representation ability by progressively splitting Gaussians. This process, enabled by the GNN-guided splitting, synthesizes more precise facial details and achieves higher reconstruction quality.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 8.0/10 | 8.0 |
| MultiModal | 1.0 | 2.0/10 | 2.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 该论文研究的是基于单张图像的头像重建,核心是3D高斯泼溅(3DGS)与自回归图分裂网络,属于计算机视觉(CV)领域,与多模态(MultiModal)仅有微弱关联(涉及图像输入)。论文完全不涉及统一模型(Unify Models)、世界模型(World Models)、多模态大语言模型(MLLM)、基于模型的强化学习(model-based RL)、OPD、RL或GRPO等关键词。
关键词
3D Gaussian Splatting, Head Avatar, Autoregressive Generation, Graph Splitting Network, One-shot Reconstruction, Facial Detail Synthesis, Density Control
摘要翻译
诸如Segment Anything Model(SAM)等基础分割模型现已常规用于迭代流水线中,其中每个预测掩码被反馈作为下一个提示。这种做法将分割转化为一个闭环动态过程,然而这些系统的解码器级行为在很大程度上仍未得到检验。我们表明,这种反馈循环可能诱发一种此前被忽视的失效模式——解码器耦合漂移(decoder coupling drift),即掩码解码器的交叉注意力逐渐失去与目标对象的对齐,导致误差在迭代过程中累积。我们通过检测SAM的解码器并推导出无需真值的提示-图像耦合度、注意力稳定性及时间一致性度量来研究这一现象。在体电子显微镜数据上,这些解码器内部信号揭示,相较于基于真值锚定的反馈,标准迭代提示会系统性地降低注意力对齐与时间一致性。随后,我们将迭代提示形式化为一个离散时间动力系统,并展示近端锚定(proximal anchoring)如何减少反馈循环中的误差放大。基于此分析,我们提出DeCoDrift——一种无需训练、推理时稳定的框架,该框架约束提示更新并在迭代间保持解码器耦合。在大量实验中,DeCoDrift在标准迭代提示基础上持续提升了注意力稳定性、时间一致性及分割质量,且无需重新训练或真值监督。更广泛地,我们的结果表明解码器内部动态不仅具有诊断价值:它们为闭环使用中稳定基础分割模型提供了可操作的信号。
Abstract
Foundation segmentation models such as Segment Anything Model (SAM) are now routinely used in iterative pipelines, where each predicted mask is fed back as the next prompt. This practice turns segmentation into a closed-loop dynamical process, yet the decoder-level behavior of these systems remains largely unexamined. We show that this feedback loop can induce a previously overlooked failure mode, decoder coupling drift, in which the mask decoder's cross-attention progressively loses alignment with the target object, causing errors to accumulate across iterations. We study this phenomenon by instrumenting SAM's mask decoder and deriving ground-truth-free measures of prompt-image coupling, attention stability, and temporal consistency. On volumetric electron microscopy data, these decoder-internal signals reveal that standard iterative prompting systematically degrades attention alignment and temporal coherence relative to oracle-anchored feedback. We then formalize iterative prompting as a discrete-time dynamical system and show how proximal anchoring reduces error amplification in the feedback loop. Building on this analysis, we introduce DeCoDrift, a training-free inference-time stabilization framework that constrains prompt updates and preserves decoder coupling across iterations. Across extensive experiments, DeCoDrift consistently improves attention stability, temporal coherence, and segmentation quality over standard iterative prompting, without retraining or ground-truth supervision. More broadly, our results show that decoder-internal dynamics are not merely diagnostic: they provide actionable signals for stabilizing foundation segmentation models in closed-loop use.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 10.0/10 | 10.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文聚焦于基础分割模型(如SAM)在闭环迭代提示中的解码器耦合漂移问题,属于计算机视觉(CV)领域,与统一模型、世界模型、多模态大语言模型(MLLM)、多模态、基于模型的强化学习、OPD、强化学习、GRPO等关键词完全无关。因此仅CV得10分,其余均为0分。
关键词
Decoder Coupling Drift, Closed-Loop Segmentation, Foundation Segmentation Models, SAM, Iterative Prompting, Attention Stability, Temporal Coherence, Proximal Anchoring
摘要翻译
伪装目标检测(Camouflaged Object Detection, COD)旨在通过物理属性定位与背景在感知上差异极小的目标。现有方法受限于静态的“先训练后冻结”范式,存在领域刚性和标注依赖性问题,限制了其对场景变化及未见过的伪装模式的适应能力。为克服这些局限,我们提出层次一致性学习(Hierarchical Consistency Learning, HCL)框架,该框架集成测试时自适应机制以实现动态表征重校准。具体而言,我们设计了层次表征重构(Hierarchical Representation Reconstruction, HRR),通过协同空间重构与双流频域分解来缓解特征纠缠,增强对表观同质化的鲁棒性。像素与频谱推理提供了结构先验与上下文先验。我们进一步引入任务亲和性引导(Task Affinity Guidance, TAG),通过通道级亲和性在分支间传播知识,对齐局部判别线索并缓解语义漂移。为确保语义不变性,我们构建了原型一致性校准(Prototype Consistency Calibration, PCC),将区域特征聚合为紧凑原型并建立原型-特征相似度,从而施加连接任务与表征差距的隐式层次约束。在四个伪装目标基准与四个水下目标基准上,针对三种退化场景的大量实验表明,我们的方法始终优于现有最优方法,突显了其在分布偏移下的鲁棒性与泛化能力。
Abstract
Camouflaged object detection (COD) aims to localize targets that exhibit minimal perceptual differences from backgrounds through physical attributes. Existing methods, constrained by the static train-then-freeze paradigm, suffer from domain rigidity and annotation dependency, limiting their adaptability to scene variations and unseen camouflage patterns. To overcome these, we propose the hierarchical consistency learning (HCL) framework, which integrates test-time adaptation for dynamic representation recalibration. Specifically, we design the hierarchical representation reconstruction (HRR) to alleviate feature entanglement by synergizing spatial reconstruction with dual-stream frequency-domain decomposition, enhancing robustness against appearance homogenization. The pixel and spectrum inference provide structural and contextual priors. We further introduce task affinity guidance (TAG) to propagate knowledge across branches via channel-wise affinity, aligning local discriminative cues and mitigating semantic drift. To ensure semantic invariance, we formulate the prototype consistency calibration (PCC), which aggregates region features into compact prototypes and establishes prototype-feature similarity. This imposes implicit and hierarchical constraints that bridge task and representation gaps. Extensive experiments across four camouflaged and four underwater object benchmarks, under three degradation settings, demonstrate that our method consistently outperforms state-of-the-art approaches, highlighting its robustness and generalization under distribution shifts.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 10.0/10 | 10.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 该论文专注于伪装目标检测(COD),属于计算机视觉(CV)领域,与统一模型、世界模型、多模态大语言模型、强化学习等关键词完全无关。CV评分10分,其余均为0分。加权总分仅10分,远低于动态及格分24分。作者列表中未包含指定的任何专家。
关键词
Camouflaged object detection, test-time adaptation, hierarchical consistency learning, representation reconstruction, frequency-domain decomposition, task affinity guidance, prototype consistency calibration, distribution shifts
摘要翻译
使基于物理的人形机器人能够根据高级文本指令执行多样化行为仍然是一项重大挑战。现有方法通常遵循两种范式:一种是将运动学动作生成与基于物理的跟踪相结合的两阶段范式,另一种是直接从文本生成动作的端到端模仿学习范式。然而,前者受限于运动学生成与基于物理跟踪之间的固有领域偏移(domain shift),而后者则难以应对文本指令与低级动作之间的巨大模态差距(modality gap),从而限制了有效的语义对齐。值得注意的是,人形机器人状态(humanoid states)编码了丰富的运动动态信息,与低级动作相比,这些信息在语义上与文本描述更为对齐,因此成为推导行为意图(behavioral intent)的自然基础。基于这一洞察,我们提出了MIND,一种新颖的端到端扩散框架,用于文本驱动的基于物理的人形机器人控制,该框架利用行为意图作为文本指令与低级动作之间的语义桥梁。其核心在于,MIND引入了一种多尺度意图扩散机制(multi-scale intent diffusion mechanism),其中整体意图预测器(holistic intent predictor)捕捉全局行为动态以指导整体行为合成,而即时意图预测器(immediate intent predictor)则在每个扩散步骤提供逐步的细粒度信号,用于局部行为优化。这种分层意图公式化(hierarchical intent formulation)为人形机器人控制施加了结构化的归纳偏置(inductive bias),从而提升了语义对齐和行为自然性。此外,MIND将人形机器人状态编码到潜在空间(latent space)中,以实现更有效的语义意图建模。大量实验表明,MIND优于现有方法,能够从文本指令中合成连贯、物理合理且语义对齐的人形机器人行为。我们将发布代码以促进未来研究。
Abstract
Enabling physics-based humanoids to execute diverse behaviors from high-level textual commands remains a significant challenge. Existing methods typically follow either a two-stage paradigm that combines kinematic motion generation with physics-based tracking, or an end-to-end imitation-learning paradigm that directly generates actions from text. However, the former suffers from the inherent domain shift between kinematic generation and physics-based tracking, while the latter struggles with the substantial modality gap between textual commands and low-level actions, limiting effective semantic alignment. Notably, humanoid states encode rich motion dynamics that are more semantically aligned with textual descriptions than low-level actions, making them a natural basis for deriving behavioral intent. Building upon this insight, we propose MIND, a novel end-to-end diffusion framework for text-driven physics-based humanoid control that leverages behavioral intent as a semantic bridge between textual commands and low-level actions. At its core, MIND introduces a multi-scale intent diffusion mechanism, where a holistic intent predictor captures global behavioral dynamics to guide overall behavior synthesis, while an immediate intent predictor provides step-wise, fine-grained signals for local behavior refinement at each diffusion step. This hierarchical intent formulation imposes a structured inductive bias for humanoid control, improving semantic alignment and behavioral naturalness. Furthermore, MIND encodes humanoid states into a latent space to enable more effective semantic intent modeling. Extensive experiments demonstrate that MIND outperforms existing methods and synthesizes coherent, physically plausible, and semantically aligned humanoid behaviors from text commands. Our code will be released to facilitate future research.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 3.0/10 | 3.0 |
| MultiModal | 1.0 | 4.0/10 | 4.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 2.0/10 | 2.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 该论文研究的是基于文本指令的物理仿真人体控制,属于计算机图形学、物理仿真和运动生成领域。论文核心是提出多尺度意图扩散框架(MIND),利用人体状态作为语义桥梁,将文本命令映射到低层动作。与给定关键词的相关性极低:Unify Models(统一模型)、World Models(世界模型)、MLLM(多模态大语言模型)、model-based RL(基于模型的强化学习)、OPD(可能指代特定方法或优化策略)、GRPO(一种强化学习算法)均不涉及;CV(计算机视觉)仅与人体运动生成有间接关联(3分);MultiModal(多模态)因涉及文本和运动两种模态,有一定关联(4分);RL(强化学习)仅在背景中提及,非核心方法(2分)。作者列表中未包含指定的任何专家。
关键词
physics-based humanoid control, text-driven behavior generation, diffusion model, multi-scale intent, semantic alignment, behavioral intent, humanoid state encoding, end-to-end control
摘要翻译
近年来,语义分割研究日益趋向于更强的上下文建模、密集注意力机制以及基于Transformer的架构。尽管这些模型取得了令人瞩目的性能,但经典的基于CNN的分割流水线因其简洁性、高效性和易于实现的特点,仍然具有吸引力。本文重新审视了一个实际问题:仅通过修改分割头部,基于ResNet的分割模型能提升到何种程度?我们提出ATV-Net(Adaptive Triple-View Network),一种自适应三视图网络,通过三种简单但互补的感受野视图来增强ResNet-101骨干网络。微观视图捕获逐点的语义响应,局部视图建模邻域结构和物体边界,而侦察视图则提供扩大的上下文线索。ATV-Net并非以固定权重融合这些视图,而是引入自适应决策门(Adaptive Decision Gate),根据输入场景特征动态选择感受野响应。此外,还应用了一个紧凑的全局协调层以改善空间和语义一致性。在Cityscapes验证集上的实验表明,ATV-Net达到了80.31%的mIoU。这一结果表明,经典的基于CNN的分割方法远未过时:通过简单的感受野视图和自适应融合,基于ResNet的流水线无需依赖Transformer风格的全局注意力或过于复杂的上下文模块,即可达到具有竞争力的精度水平。
Abstract
Recent semantic segmentation research has increasingly moved toward stronger context modeling, dense attention, and transformer-based architectures. Although these models achieve impressive performance, classical CNN-based segmentation pipelines remain attractive because of their simplicity, efficiency, and ease of implementation. This paper revisits a practical question: how far can a ResNet-based segmentation model be improved by only modifying the segmentation head? We propose ATV-Net, an Adaptive Triple-View Network that strengthens a ResNet-101 backbone using three simple but complementary receptive-field views. The micro view captures point-wise semantic responses, the local view models neighborhood structures and object boundaries, and the scout view provides enlarged contextual cues. Instead of fusing these views with fixed weights, ATV-Net introduces an Adaptive Decision Gate that dynamically selects receptive-field responses according to input scene characteristics. A compact global coordination layer is further applied to improve spatial and semantic consistency. Experiments on the Cityscapes validation set show that ATV-Net achieves 80.31\% mIoU. This result suggests that classical CNN-based segmentation is still far from obsolete: with simple receptive-field views and adaptive fusion, a ResNet-based pipeline can reach a competitive accuracy level without relying on transformer-style global attention or overly complex context modules.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 8.0/10 | 8.0 |
| MultiModal | 1.0 | 1.0/10 | 1.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究的是基于ResNet的语义分割方法,属于计算机视觉(CV)的核心任务,因此CV关键词得8分。论文仅处理图像模态,未涉及多模态(MultiModal)或大语言模型(MLLM),故MultiModal得1分,MLLM得0分。Unify Models、World Models、model-based RL、RL、OPD、GRPO等关键词与论文内容完全无关,均得0分。论文作者列表中不包含任何指定专家。
关键词
semantic segmentation, ResNet, adaptive fusion, receptive-field views, CNN-based segmentation, Cityscapes, mIoU
摘要翻译
我们提出了OMGTex,一种基于扩散模型的端到端框架,用于从多风格人脸图像中重建高质量且可编辑的面部UV纹理。现有的纹理重建方法面临两大局限性:(1)由于依赖三维几何先验而导致的脆弱性——这些先验难以准确估计,尤其是在面部遮挡或风格化域中;(2)缺乏语义解耦能力,从而限制了区域特定的纹理编辑与风格迁移。我们的工作同时解决了这两个挑战。我们的核心创新在于一种无几何先验的管线,该管线直接将二维人脸图像映射到对应的可编辑UV纹理。我们引入了两项关键技术:首先,针对扩散生成中常见的UV错位问题,我们提出了一种在推理阶段基于梯度引导的细化策略,显式地修正结构一致性。其次,我们利用扩散模型固有的语义分布能力,并设计了一种新颖的训练范式来增强这一倾向,从而实现对面部纹理的语义感知编辑。此外,为解决多风格纹理重建中的数据稀缺问题,我们构建了CANVAS——首个涵盖真实与多样化风格化领域的综合性配对纹理重建数据集。据我们所知,OMGTex是首个无几何先验的推理框架,能够在不同领域实现鲁棒、风格一致且可编辑的面部纹理重建。我们的方法在多个面部纹理基准测试中达到了最先进的性能。
Abstract
We propose OMGTex, an end-to-end diffusion-based framework for reconstructing high-quality and editable facial UV textures from multi-style facial images. Existing texture reconstruction methods face two major limitations: (1) Fragility due to reliance on 3D geometry priors, which are difficult to estimate accurately, especially under facial occlusions or in stylized domains; and (2) A lack of semantic disentanglement, inhibiting region-specific texture editing and style transfer. Our work addresses both challenges simultaneously. Our core innovation is a geometry-free pipeline that directly maps a 2D face image to its corresponding editable UV texture. We introduce two key techniques: First, to address the challenge of UV misalignment common in diffusion generation, we introduce a gradient-guided refinement strategy at inference time, which explicitly corrects structural consistency. Second, we leverage the inherent semantic distribution capability of diffusion models and design a novel training paradigm to enhance this tendency, enabling semantic-aware editing of facial texture. Furthermore, to address the data scarcity in multi-style texture reconstruction, we construct CANVAS, the first comprehensive paired texture reconstruction dataset covering realistic and diverse stylized domains. To the best of our knowledge, OMGTex is the first geometry-free inference framework that achieves robust, style-consistent, and editable facial texture reconstruction across diverse domains. Our method achieves state-of-the-art performance on multiple facial texture benchmarks.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 8.0/10 | 8.0 |
| MultiModal | 1.0 | 1.0/10 | 1.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 该论文研究的是基于扩散模型的人脸纹理重建,属于计算机视觉(CV)领域,与CV关键词相关度较高(8分)。论文涉及多风格(multi-style)图像输入和纹理输出,但并非严格意义上的多模态(MultiModal)研究,因为输入输出均为视觉模态,仅风格变化,故MultiModal评分较低(1分)。其他关键词如Unify Models、World Models、MLLM、model-based RL、OPD、RL、GRPO均与论文内容完全无关,评分为0。作者列表中未包含指定的专家。
关键词
facial texture reconstruction, diffusion model, UV texture, geometry-free, semantic-aware editing, multi-style, CANVAS dataset
摘要翻译
本文将机器学习应用于版本控制合并这一重要且具有挑战性的任务。(1) 我们构建了一个名为Merge-Bench的数据集,包含来自1439个GitHub仓库的7938个真实世界合并冲突块。真实标注(ground truth)是开发者提交至仓库的合并解决方案。我们的数据集构建方法具有可扩展性,能够处理任意规模的数据,因为无需人工标注。(2) 我们训练了一个名为LLMergeJ的模型,用于解决Java程序中的合并冲突。该方法采用组相对策略优化(Group Relative Policy Optimization, GRPO),一种在线强化学习方法,来训练大型语言模型(Large Language Model, LLM)。(3) 我们对LLM在解决合并冲突方面的性能进行了两项评估。在Java程序上,拥有140亿参数的LLMergeJ超越了3个商业LLM,仅次于Gemini 2.5 Pro。在11种编程语言中,商业LLM的性能在不同语言间基本保持稳定。表现最佳的模型能够正确解决的合并冲突比例仍低于60%。
Abstract
This paper applies machine learning to the difficult and important task of version control merging. (1) We constructed a dataset, Merge-Bench, of 7938 real-world merge conflict hunks from 1439 GitHub repositories. The ground truth is the merge resolution that developers committed to the repository. Our dataset construction methodology is scalable to arbitrary amounts of data since no manual labeling is required. (2) We trained a model, LLMergeJ, to resolve merge conflicts in Java programs. Our approach uses Group Relative Policy Optimization (GRPO), an online reinforcement learning method, to train a Large Language Model (LLM). (3) We performed two evaluations of the performance of LLMs on resolving merge conflicts. On Java programs, LLMergeJ with 14B parameters outperforms 3 commercial LLMs, trailing only Gemini 2.5 Pro. Across 11 programming languages, commercial LLM performance is largely stable from language to language. The best models correctly resolve less than 60% of merge conflicts.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 3.0/10 | 3.0 |
| GRPO | 1.0 | 5.0/10 | 5.0 |
评分理由: 论文主题是使用大语言模型解决版本控制中的合并冲突,与统一模型、世界模型、多模态大模型、计算机视觉、多模态、基于模型的强化学习、OPD等关键词完全无关。RL和GRPO作为训练方法被提及,但并非论文核心研究内容,GRPO是具体优化算法,RL是广义强化学习,相关度较低。
关键词
merge conflicts, large language models, GRPO, reinforcement learning, version control, Java, dataset, LLMergeJ
摘要翻译
视觉-语言-动作(Vision-Language-Action, VLA)模型正越来越多地部署于真实机器人上,其中每个预测的动作均被执行,且每次失败都伴随安全代价。这些模型在干净输入上能达到较高成功率,但在微小对抗扰动下会崩溃。针对OpenVLA-7B的$16/255$ PGD攻击使LIBERO成功率从$95\%$以上骤降至$5\%$以下。经验性防御方法虽能恢复部分鲁棒性,但会牺牲干净准确率,而现有文献并未说明这种权衡是否存在理论下限。我们证明该下限确实存在。对于任何采用离散动作的VLA策略,其能力(策略动作与理想动作之间的互信息)与鲁棒性(对抗扰动下保留的互信息,扣除平凡信道泄漏)之和受限于一个与策略无关的预算:任务熵与对抗信道容量之和。该证明基于两次应用数据处理不等式及互信息非负性。当前模型上的像素级界限较为宽松(约$10^3$纳特),但一个编码器特定的推论将信道限制在策略相关子空间内,使OpenVLA上的预算从约$5{,}000$纳特降至约$31$纳特;现有策略已消耗该更紧预算的约$24\%$,留给同时提升鲁棒性的空间十分有限。我们通过$252$个闭式高斯-VLA单元和$48$个OpenVLA-7B $\times$ LIBERO $\times$ PGD单元(零违规)验证了该界限。我们提出将编码器特定松弛量作为防御论文的归一化比较轴,并公开所有代码、清单及结果。
Abstract
Vision-Language-Action (VLA) models are increasingly deployed on real robots, where each predicted action is executed and each failure carries a safety cost. They reach high success rates on clean inputs but collapse under small adversarial perturbations. A $16/255$ PGD attack on OpenVLA-7B drops LIBERO success from above $95\%$ to under $5\%$. Empirical defenses recover some robustness at a cost in clean accuracy, but the literature does not say whether the trade-off has a theoretical floor. We prove that it does. For any VLA policy with discrete actions, the sum of capability (mutual information between policy action and oracle action) and robustness (mutual information preserved under adversarial perturbation, net of trivial channel leakage) is upper-bounded by a policy-independent budget: task entropy plus adversarial channel capacity. The proof is two applications of the Data Processing Inequality plus MI non-negativity. The pixel-level bound is loose on current models ($\sim 10^3$ nats), but an encoder-specific corollary restricts the channel to the policy-relevant subspace, reducing the budget from $\sim 5{,}000$ to $\sim 31$ nats on OpenVLA; the policy already consumes $\sim 24\%$ of this tighter budget, leaving limited room for simultaneous robustness improvement. We validate the bound across $252$ closed-form Gaussian-VLA cells and $48$ OpenVLA-7B $\times$ LIBERO $\times$ PGD cells (zero violations). We propose encoder-specific slack as a normalized comparison axis for defense papers, and release all code, manifests, and results.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 2.0/10 | 2.0 |
| MultiModal | 1.0 | 6.0/10 | 6.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究的是视觉-语言-动作(VLA)模型在对抗扰动下的能力与鲁棒性之间的信息论权衡,核心是理论证明和实证验证。与'MultiModal'有一定关联(VLA涉及视觉和语言),与'CV'有微弱关联(涉及视觉输入),但完全不涉及统一模型、世界模型、多模态大语言模型、基于模型的强化学习、OPD、强化学习或GRPO等关键词。
关键词
Vision-Language-Action, robustness, capability, information theory, adversarial perturbation, VLA, Data Processing Inequality
摘要翻译
大语言模型(Large Language Models, LLMs)在基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)赋能下,推理能力取得了显著进展。然而,RLVR本质上依赖真实标签(ground-truth labels)进行奖励计算,而在实际场景中获取这些标签的成本往往高得令人望而却步。尽管无监督的RLVR范式试图通过基于伪标签(pseudo-labels)的训练来规避这一问题,但它们极易陷入训练崩溃。此外,不同样本通常具有不同的标注价值。在本文中,我们提出基于主动可验证奖励的强化学习(Reinforcement Learning with Active Verifiable Rewards, RLAVR),该方法主动为少量精选样本获取真实标签,并将其与伪标签相结合,从而在有限标注预算下稳定训练动态并提升性能。为识别有价值的样本,我们提出了修正优势差距(Corrective Advantage Gap, CAG)指标,并分析了样本层面的监督价值。在此基础上,我们引入了面向RLAVR的修正感知可靠性估计(Correction-Aware Reliability Estimation for RLAVR, CARE),该机制将理想的CAG准则转化为一种实用的查询前获取策略,以显著提升训练稳定性。跨不同领域、模型家族及模型规模的广泛实验证明了我们方法的有效性与通用性。我们的代码已开源至 https://github.com/Lumina04/CARE。
Abstract
Large Language Models (LLMs) have achieved remarkable advancements in reasoning capabilities empowered by Reinforcement Learning with Verifiable Rewards (RLVR). Nonetheless, RLVR intrinsically relies on ground-truth labels for reward computation, the acquisition of which is often prohibitively expensive in real-world scenarios. While unsupervised RLVR paradigms attempt to circumvent this by training on pseudo-labels, they are notoriously susceptible to training collapse. Moreover, different samples often exhibit varying annotation values. In this paper, we propose Reinforcement Learning with Active Verifiable Rewards (RLAVR), which actively acquires ground-truth labels for a small set of selected samples and integrates them with pseudo-labels, thereby stabilizing training dynamics and improving performance under limited annotation budgets. To identify valuable samples, we propose the Corrective Advantage Gap (CAG) metric and analyze the sample-level supervision value. Building on this, we introduce Correction-Aware Reliability Estimation for RLAVR (CARE), which translates the oracle CAG criterion into a practical pre-query acquisition policy to substantially improve training stability. Extensive experiments across diverse domains, model families, and model scales demonstrate the effectiveness and generality of our approach. Our code is available at https://github.com/Lumina04/CARE.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 8.0/10 | 8.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文聚焦于强化学习中的主动标签获取问题,核心是RLVR(带可验证奖励的强化学习)和主动学习策略(CAG、CARE),与给定的关键词如Unify Models、World Models、MLLM、CV、MultiModal、model-based RL、OPD、GRPO均无直接关联。仅RL(强化学习)作为核心方法,相关度较高(8分),但论文未涉及多模态、统一模型、世界模型或GRPO等具体技术。作者列表中未包含指定的专家。
关键词
Reinforcement Learning with Verifiable Rewards, Active Label Acquisition, Corrective Advantage Gap, Correction-Aware Reliability Estimation, Pseudo-Labeling, Training Stability, Limited Annotation Budget
摘要翻译
我们证明,当某些任务超出智能体的可靠能力范围时,任何具有置信门控自主性的强化学习策略都无法在理性监督下同时实现最大帮助性、最优校准和完全自主性:即行为可信三难困境(Behavioral Credibility Trilemma)。这一不可能性具有几何本质——在严格适当的评分规则(strictly proper scoring rule)中添加任何非仿射的自主性激励都会破坏严格适当性,因此,一个因校准置信度和自主行动而获得奖励的智能体,会在低于委托人批准阈值的任务上系统性地夸大其报告的置信度。行为扰动引理(Behavioral Perturbation Lemma)量化了这种夸大程度(对于Brier分数,其缩放比例为$w_A/(2 w_C)$),并表明检测需要$\Omega(1/\Delta^2)$次观测。我们证明委托人的最优监督规则必然是非仿射的,这使得该不可能性成为无条件的,并且在对数凹密度策略族(log-concave-density policy families)中与优化器无关。我们形式化了置信门控决策问题(Confidence-Gated Decision Problem),将现有方法映射到三难困境上,并识别出两条建设性的解决路径(承诺机制、领域分离)。一项包含540种配置的Best-of-N实验测试了五个预先注册的假设,所有假设均得到强烈证实(效应量$d = 1.10$至$5.32$),并附加了对可达$(H, C, A)$曲面几何的描述性分析,显示出一个与预测的夸大饱和一致的高原截断前沿。
Abstract
We prove that no reinforcement learning policy with confidence-gated autonomy can simultaneously achieve maximum helpfulness, optimal calibration, and full autonomy under rational oversight, whenever some tasks exceed the agent's reliable competence: the Behavioral Credibility Trilemma. The impossibility is geometric -- adding any non-affine autonomy incentive to a strictly proper scoring rule destroys strict properness, so an agent rewarded for both calibrated confidence and autonomous action systematically inflates its reported confidence on tasks below the principal's approval threshold. The Behavioral Perturbation Lemma quantifies the inflation (scaling as $w_A/(2 w_C)$ for the Brier score) and shows detection requires $Ω(1/Δ^2)$ observations. We prove the principal's optimal oversight rule is necessarily non-affine, making the impossibility unconditional and optimizer-independent across log-concave-density policy families. We formalize the Confidence-Gated Decision Problem, map existing methods onto the trilemma, and identify two constructive resolution pathways (commitment, domain separation). A 540-configuration Best-of-N experiment tests five pre-registered hypotheses, all strongly confirmed (effect sizes $d = 1.10$ to $5.32$), and adds a descriptive analysis of the achievable-$(H, C, A)$ surface geometry showing a plateau-truncated frontier consistent with the predicted inflation saturation.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 8.0/10 | 8.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文核心研究强化学习(RL)策略中置信门控自主性的不可能性定理(行为可信三难困境),属于RL理论(置信校准与自主性权衡),与多模态、统一模型、世界模型、CV、MLLM、model-based RL、OPD、GRPO等关键词完全无关。RL相关度较高(8分),因为论文严格证明RL策略在特定条件下无法同时满足帮助性、校准性和自主性,但并非RL应用或算法改进,而是理论分析。
关键词
Behavioral Credibility Trilemma, Reinforcement Learning, Confidence Calibration, Autonomy, Strictly Proper Scoring Rule, Oversight, Brier Score
摘要翻译
仅最小化真实置信度的黑盒对抗攻击存在类别漂移问题:扰动在特征空间中游荡,未锁定特定对抗类别,浪费查询在分散、无方向的进展上。我们提出机会性目标选择(Opportunistic Target Selection, OTS),这是一种轻量级包装器,能在非定向攻击的早期阶段将其切换为定向目标,锁定当前领先的非真实类别。OTS无需对底层攻击进行架构修改,无需梯度访问,也无需先验的目标类别知识。我们在五个标准ImageNet分类器(4500次运行)上,基于三种基于分数的攻击(SimBA、使用交叉熵损失的Square Attack和Bandits)验证了OTS。在随机搜索攻击中,OTS紧密追踪理想性能,在ResNet-50上成功率提升高达27个百分点,审查均值迭代次数相对减少43%。在梯度估计攻击(Bandits)和基于边际损失的攻击中,OTS是冗余的,这一负面结果强化了我们将OTS视为边际损失替代的解释。在对抗训练模型上,双峰难度分布消除了目标选择有帮助的场景。
Abstract
Black-box adversarial attacks that minimize only the ground-truth confidence suffer from class drift: perturbations wander through the feature space without committing to a specific adversarial class, wasting queries on diffuse, undirected progress. We introduce Opportunistic Target Selection (OTS), a lightweight wrapper that switches an untargeted attack to a targeted objective early in its trajectory, locking onto whichever non-true class currently leads. OTS requires no architectural modification to the underlying attack, no gradient access, and no a priori target-class knowledge. We validate OTS on three score-based attacks (SimBA, Square Attack with cross-entropy loss, and Bandits) across five standard ImageNet classifiers (4,500 runs). On random-search attacks, OTS closely tracks oracle performance, with gains up to +27 pp in success rate and 43% relative reduction in censored-mean iterations on ResNet-50. On gradient-estimation attacks (Bandits) and attacks with margin loss, OTS is redundant, a negative result that reinforces our interpretation of OTS as a margin-loss surrogate. On adversarially-trained models, a bimodal difficulty distribution eliminates the regime where targeting helps.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 8.0/10 | 8.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 该论文研究黑盒对抗攻击中的目标选择策略(OTS),核心是计算机视觉(CV)领域的图像分类器攻击,与多模态、统一模型、世界模型、强化学习等关键词完全无关。论文涉及ImageNet分类器和ResNet-50等CV典型模型,因此CV关键词得8分(核心应用领域但非方法创新)。其他关键词均未在标题、摘要或方法中出现,得0分。
关键词
Black-box adversarial attack, Opportunistic Target Selection, Query efficiency, Score-based attack, ImageNet classifiers, Class drift, Targeted attack
摘要翻译
因果Transformer语言模型受限于严格顺序解码和每步二次方注意力计算成本。尽管线性时间因果模型和离散扩散模型各自解决了这些缺陷,但两者的整合本质上仍存在矛盾:扩散需要双向注意力,而因果模型则是单向的。为统一这些架构,我们提出了$B^3D-RWKV$,这是一种扩散RWKV变体,通过一种\emph{三元组块布局}(triplet-block layout)方法,将模型的$O(L)$推理效率与并行、双向离散扩散相结合。在8任务测试集上,$B^3D-RWKV-7.2B$达到了与现有模型相当的准确率,同时在解码吞吐量上显著超越基线模型,平均实现了$\mathbf{1.6\times}$的加速。
Abstract
Causal Transformer language models suffer from strictly sequential decoding and a quadratic per-step attention cost. While linear-time causal models and discrete diffusion models each address these weaknesses, their integration remains inherently inconsistent: diffusion requires bidirectional attention, while causal models are unidirectional. To unify these architectures, we propose $B^3D-RWKV$, a diffusion RWKV variant that integrates the model's $O(L)$ inference efficiency with parallel, bidirectional discrete-diffusion through a \emph{triplet-block layout} method. $B^3D-RWKV-7.2B$ reaches comparable accuracy on an 8-task suite versus existing models while significantly outperforming baselines in decoding throughput with an average of $\mathbf{1.6\times}$ speedup.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 8.0/10 | 8.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文提出了一种名为B^3D-RWKV的扩散RWKV变体,核心贡献在于将线性时间因果模型(RWKV)与双向离散扩散模型通过三块布局方法统一起来,实现了高效的并行解码。这与关键词'Unify Models'高度相关(8分),因为论文明确旨在统一两种架构。然而,论文内容完全聚焦于语言模型的解码效率和架构创新,不涉及世界模型、多模态大模型(MLLM)、计算机视觉(CV)、多模态(MultiModal)、基于模型的强化学习(model-based RL)、OPD、强化学习(RL)或GRPO等方向,因此这些关键词得分为0。
关键词
Diffusion RWKV, B^3D-RWKV, triplet-block layout, linear-time causal model, discrete diffusion, parallel decoding, inference efficiency
摘要翻译
自动化的路面病害评估不仅需要图像级分类或粗略的边界框检测,还要求对细长、分叉且不规则的裂缝进行精确定位,以达到维护相关量化所需的几何精度。本文提出了一种基于Mask R-CNN实例分割的视觉路面病害分析系统,并在自建野外道路图像数据集UWGB-StreetCrack上进行了评估。该数据集使用车载智能手机采集,并手动标注了纵向裂缝、横向裂缝、鳄鱼裂缝和坑洞的多边形标签。在一致的微调协议下,考虑了五种基于Detectron2的Mask R-CNN骨干网络变体。性能最佳的模型——采用ResNet-101 FPN骨干网络的Mask R-CNN,在项目特定的边界框匹配协议下,实现了84.23%的精确率、90.04%的召回率和87.04%的F1分数。同一模型生成的聚合预测裂缝面积占比为2.164%,与真实裂缝面积占比2.170%高度吻合。为了将该分割系统与基于检测器的替代方案进行对比,还基于数据集对CSPDarknet53的YOLO检测器进行了适配和重新训练,其在验证协议下达到了27.5%的精确率和20.7%的召回率。结果表明,实例分割是野外路面图像和聚合裂缝面积估计的一个实用方向,同时也揭示了在标注一致性、类别不平衡、混杂因素排除以及掩膜级基准测试方面存在的开放挑战。
Abstract
Automated pavement distress assessment requires more than image-level classification or coarse bounding box detection, demanding precise localization of thin, branching, and irregular cracks to achieve the geometric precision necessary for maintenance-relevant quantification. This paper presents a vision-based pavement distress analysis system based on Mask R-CNN instance segmentation and evaluates it on UWGB-StreetCrack, a custom field-collected roadway image dataset acquired with a vehicle-mounted smartphone and manually annotated with polygon labels for longitudinal cracks, transverse cracks, alligator cracks, and potholes. Five Detectron2-based Mask R-CNN backbone variants were considered under a consistent fine-tuning protocol. The best-performing model, Mask R-CNN with a ResNet-101 FPN backbone, achieved 84.23% precision, 90.04% recall, and an F1 score of 87.04% under the project-specific bounding-box matching protocol. The same model produced an aggregate predicted crack-area fraction of 2.164%, closely matching the 2.170% ground-truth crack-area fraction. To contextualize the segmentation system against a detector-oriented alternative, a CSPDarknet53-based YOLO detector was also adapted and retrained on the dataset, reaching 27.5% precision and 20.7% recall on the validation protocol. The results show that instance segmentation is a practical direction for field pavement imagery and aggregate crack-area estimation, while also exposing open challenges in annotation consistency, class imbalance, confounder rejection, and mask-level benchmarking.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 8.0/10 | 8.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 该论文专注于基于实例分割(Mask R-CNN)的像素级路面病害评估,属于计算机视觉(CV)在土木工程中的应用。论文内容与Unify Models、World Models、MLLM、MultiModal、model-based RL、OPD、RL、GRPO等关键词完全无关,因为这些关键词涉及多模态大模型、世界模型、强化学习等前沿AI领域,而本文仅涉及传统的视觉分割任务。CV关键词得8分,因为论文核心是视觉分割,但并非CV领域最前沿的通用方法(如基础模型),而是特定应用。
关键词
pavement distress assessment, instance segmentation, Mask R-CNN, crack detection, computer vision, roadway image dataset, crack-area estimation
摘要翻译
病理学评估指导肺癌诊断、治疗选择和预后评价,然而当前的计算病理学(CPath)方法依赖于针对孤立目标的特定任务模型。尽管泛癌基础模型具有通用性,但它们缺乏亚专科层面的深度,且尚未在临床工作流程中进行评估或在真实世界环境中进行前瞻性验证。我们引入了PulmoFoundation,这是一个经过多中心、前瞻性验证及随机对照试验(RCT)评估的基础模型,用于涵盖术前、术中和术后全流程的综合性肺病理学评估。该模型基于Virchow2,通过使用约40,000张诊断性苏木精-伊红(H&E)染色的全切片图像(WSIs)进行亚专科特异性预训练,并在约26,000张WSIs上针对32项临床相关任务进行了系统评估。除了准确预测分子标志物和患者生存期外,我们的模型在活检、冰冻切片和手术切除切片的核心诊断任务中达到了临床级性能。在一项涵盖11项诊断任务、共1,357名患者的前瞻性注册研究中,该模型实现了平均AUC为92.3%。利用预设的分诊阈值,PulmoFoundation可减少68.8%的活检和83.0%的冰冻切片的额外二次复核负担,并推迟44.5%的免疫组化(IHC)染色订单,其阳性预测值(PPVs)分别为1.0、0.991和0.966。除前瞻性验证外,我们还开展了一项包含八名病理学家的交叉随机对照试验,其中人工智能辅助在4,928个病例-阅片者配对中提高了诊断准确率(有AI辅助为91.7%,无AI辅助为83.8%)。AI辅助还将中位诊断时间缩短了19.6%,诊断信心提高了8.7%,并将阅片者间一致性从中等(kappa = 0.56)提升至显著(kappa = 0.76)。综合这些评估,PulmoFoundation可作为经过临床验证的肺病理学决策支持系统。
Abstract
Pathological assessment guides lung cancer diagnosis, treatment selection, and prognostic evaluation, yet current CPath approaches rely on task-specific models for isolated objectives. Although pan-cancer foundation models offer versatility, they lack subspecialty-level depth and have not been evaluated across clinical workflows or prospectively validated in real-world settings. We introduce PulmoFoundation, a multi-center, prospectively validated, randomized controlled trial (RCT)-evaluated foundation model for comprehensive lung pathology assessment across pre-operative, intra-operative, and post-operative care. Built upon Virchow2 via subspecialty-specific pretraining using ~40,000 diagnostic H&E-stained whole-slide images (WSIs), PulmoFoundation was systematically evaluated on ~26,000 WSIs across 32 clinically relevant tasks. In addition to accurately predicting molecular markers and patient survival, our model achieves clinical-grade performance in core diagnostic tasks across biopsy, frozen section, and surgical resection slides. In a registered prospective study of 1,357 patients across 11 diagnostic tasks, our model achieved an average AUC of 92.3%. Using pre-specified triage thresholds, PulmoFoundation could reduce additional second-review burden for 68.8% of biopsies and 83.0% of frozen sections, and defer 44.5% of IHC stain orders, with PPVs of 1.0, 0.991, and 0.966. Beyond prospective validation, we conducted a crossover RCT with eight pathologists, in which AI assistance improved diagnostic accuracy across 4,928 case-reader pairs (91.7% w/ AI vs. 83.8% w/o AI). AI assistance also reduced median diagnostic time by 19.6%, increased diagnostic confidence by 8.7%, and improved inter-rater agreement from moderate (kappa = 0.56) to substantial (kappa = 0.76). Together, these evaluations support PulmoFoundation as a clinically validated decision-support system for lung pathology.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 8.0/10 | 8.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 该论文专注于临床肺病理学的基础模型(PulmoFoundation),涉及计算病理学(CPath)中的全切片图像分析、分子标志物预测和生存预后,属于计算机视觉(CV)在医学影像中的应用。论文未涉及多模态大模型(MLLM)、统一模型(Unify Models)、世界模型(World Models)、强化学习(RL)或其变体(如model-based RL、GRPO、OPD),也未涉及多模态(MultiModal)融合(仅使用H&E染色图像)。因此,仅CV关键词获得较高评分(8分),其余关键词均为0分。
关键词
Foundation Model, Computational Pathology, Lung Cancer, Whole-Slide Image, Clinical Validation, Randomized Controlled Trial, Diagnostic Accuracy
摘要翻译
前沿大语言模型(LLMs)现已在广泛的物理学评估中表现强劲,但难以区分其是真正进行推理还是回忆已有科学知识。我们提出DiscoverPhysics,这是一个交互式基准测试,要求LLM智能体发现一个模拟世界的运动定律,该世界的物理规律刻意偏离了我们所熟知的体系。我们构建了22个世界,其物理规律包括但不限于:屏蔽与分数幂引力、多物种耦合、隐藏的暗物质类似粒子、非坐标无关的物理以及时变相互作用。每个世界均由一个N体模拟器按需生成,智能体需提出多轮实验,观察原始轨迹数据,并最终提交一份关于该世界物理规律的自然语言解释,以及所推断定律的Python实现。由于解决一个世界需要智能体设计信息量丰富的实验并修正其假设,该基准测试考察了基于实验历史的长期推理能力。我们沿两个互补维度评估提交结果:对保留粒子的轨迹均方误差(MSE),以及由LLM根据专家撰写的评估标准(衡量对每个世界的概念理解)评判的解释得分。在对十一个前沿模型的评估中,我们发现最强的智能体仅能通过一半的世界,并且在那些需要揭示潜在结构的世界中始终失败。开源模型在设计信息性实验以及从数据中提取结论的能力上均显著落后于商业模型。我们进一步发现,良好的预测准确性并不能保证高质量的解释,而概念理解依赖于通过精心选择的实验进行假设修正。
Abstract
Frontier LLMs now perform strongly across a wide range of physics evaluations, but it is hard to disentangle genuine reasoning from recall of established science. We introduce DiscoverPhysics, an interactive benchmark that asks a LLM agent to discover the laws of motion of a simulated world whose physics deliberately deviates from our own. We construct 22 worlds governed by, among others, screened and fractional-power gravity, multi-species couplings, hidden dark-matter-like particles, non-coordinate-free physics, and time-varying interactions. Each world is generated on demand by an N-body simulator, for which the agent proposes several rounds of experiments, observes raw trajectory data, and ultimately submits both a natural-language explanation of the world's physics and a Python implementation of the inferred law. Because solving a world requires the agent to design informative experiments and revise its hypotheses, the benchmark probes long-horizon reasoning over an experimental history. We evaluate submissions along two complementary axes: trajectory MSE on held-out particles and an LLM-judged explanation score following an expert-written rubric assessing conceptual understanding of each world. Across eleven frontier models, we find that the strongest agents pass only half of the worlds and consistently fail on those where latent structure must be uncovered. Open-source models lag substantially behind commercial models, both in their ability to design informative experiments and in extracting conclusions from the data. We further find that good predictive accuracy does not guarantee high explanation quality and that conceptual understanding depends on hypothesis refinement through well-chosen experiments.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 7.0/10 | 7.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究的是用LLM在模拟物理世界中通过实验发现物理定律,属于科学推理和世界模型(World Models)的范畴,因为LLM需要构建对模拟世界的内部表征并基于实验数据更新假设。但论文不涉及多模态(MultiModal)、统一模型(Unify Models)、强化学习(RL/model-based RL/GRPO)、计算机视觉(CV)或OPD(可能指其他特定领域)。因此,仅World Models得7分(核心相关,但非传统世界模型定义),其余关键词均为0分。
关键词
LLM, scientific reasoning, physics discovery, interactive benchmark, world model, hypothesis refinement, experiment design, N-body simulation
摘要翻译
激活言语化(Activation Verbalization)以自然语言解释隐藏表示,但现有方法大多局限于自我解释(self-explanation),即每个模型仅解释自身的激活。我们提出了通用激活言语化器(Universal Activation Verbalizer, UAV),该框架使用共享解码器来解释来自异构供体模型(donor model)的激活。UAV学习一个轻量级适配器(adapter),将供体激活转换为解码器嵌入空间中的软令牌(soft token),并通过复用冻结的解码器端LoRA(decoder-side LoRA)进一步支持仅适配器迁移(adapter-only transfer),此时仅为另一个供体训练新的适配器。在分类、事实检索和要点总结任务中,UAV在保持与强自我解释基线相当性能的同时,实现了跨模型家族和规模的跨模型言语化(cross-model verbalization)。消融实验表明,解码器端调优主要改善任务行为,而适配器则提供忠实解释所需的基于激活的事实与语义信息。
Abstract
Activation verbalization explains hidden representations in natural language, but existing methods are mostly limited to self-explanation, where each model explains only its own activations. We introduce Universal Activation Verbalizer (UAV), a framework that uses a shared decoder to explain activations from heterogeneous donor models. UAV learns a lightweight adapter that converts donor activations into soft tokens in decoder's embedding space, and further supports adapter-only transfer by reusing a frozen decoder-side LoRA while training only a new adapter for another donor. Across classification, fact retrieval, and gist summarization, UAV remains competitive with strong self-explanation baselines while enabling cross-model verbalization across model families and scales. Ablations show that decoder-side tuning mainly improves task behavior, whereas the adapter provides the activation-grounded factual and semantic information needed for faithful explanations.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 2.0/10 | 2.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 1.0/10 | 1.0 |
| CV | 1.0 | 1.0/10 | 1.0 |
| MultiModal | 1.0 | 3.0/10 | 3.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文主要研究跨模型激活值解释的统一框架(UAV),核心是激活值到自然语言的映射,不涉及多模态大模型(MLLM)、世界模型、强化学习(RL/GRPO)、模型基强化学习(model-based RL)或开放目标检测(OPD)等主题。与Unify Models有一定关联(统一不同模型的解释),但并非统一多模态模型本身;与MultiModal弱相关(涉及语言和视觉特征的解释,但非核心);与CV弱相关(涉及视觉分类任务)。其余关键词(World Models, MLLM, model-based RL, OPD, RL, GRPO)几乎完全无关。
关键词
Activation Verbalization, Cross-Model Explanation, Shared Decoder, Adapter, LoRA, Interpretability, Hidden Representations
摘要翻译
尽管角色扮演智能体(role-playing agents)在短期交互中表现出色,但长期对话会超出上下文窗口(context windows)的容量,从而催生了外部记忆框架(external memory frameworks)的需求。当前系统通常依赖角色无关的摘要(persona-agnostic summarization),即仅记录事实而不进行角色特定的解读(persona-specific interpretation),导致生成的回答千篇一律,损害了角色保真度(persona fidelity)。为弥补这一不足,我们引入了RoleMemo数据集,其中包含四项推理任务,要求通过角色(persona)对事实片段进行解读才能得出正确答案。在RoleMemo上的评估揭示了角色无关框架的关键局限性。为此,我们提出DualMem,将记忆解耦为两个流:事实认知(factual cognition)和角色条件化洞察(persona-conditioned insight)。通过监督微调(Supervised Fine-Tuning, SFT)和强化学习(Reinforcement Learning, RL)训练,我们的框架(基于4B参数模型)在持续角色保真度上超越了由DeepSeek-V3.2驱动的零样本角色无关框架。我们的资源可在https://github.com/role2026/rolememo获取。
Abstract
While role-playing agents excel in short-term interactions, long-term conversations overwhelm context windows, motivating external memory frameworks. Current systems typically rely on persona-agnostic summarization, which records facts without persona-specific interpretation, yielding generic responses that compromise persona fidelity. To bridge this gap, we introduce RoleMemo, a dataset featuring four reasoning tasks where the factual fragments must be interpreted through the persona to reach the correct answer. Evaluation on RoleMemo exposes critical limitations of persona-agnostic frameworks. We thus propose DualMem, which decouples memory into two streams: factual cognition and persona-conditioned insight. Trained through Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), our framework with a 4B-parameter model outperforms zero-shot persona-agnostic frameworks powered by DeepSeek-V3.2 for sustained persona fidelity. Our resources are available at https://github.com/role2026/rolememo.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 1.0/10 | 1.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 6.0/10 | 6.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究角色扮演代理的长期记忆框架,核心是角色一致性保持,与多模态、世界模型、统一模型、基于模型的强化学习等关键词无关。仅RL作为训练方法之一被提及(SFT和RL),但并非核心贡献,且未涉及GRPO。MLLM仅因使用DeepSeek-V3.2作为基线有微弱关联。其余关键词均不相关。
关键词
Role-Playing Agents, Long-term Memory, Persona Fidelity, Dual Memory Framework, Reinforcement Learning, Supervised Fine-Tuning, RoleMemo Dataset, DualMem
摘要翻译
从真实世界物体中捕获可重光照的3D资产是一个广泛研究的问题。基于3D高斯泼溅(3DGS)的几种每场景优化方法支持重光照,但它们通常需要密集的输入视图,且其过拟合特性使其难以跨场景泛化。与每场景优化方法不同,通用前馈模型可以直接从稀疏输入视图重建高斯体。然而,生成的资产带有烘焙光照,难以直接用于重光照。本文提出F-RNG,一种前馈框架,可直接从稀疏视图输入生成可重光照的3DGS资产。从头训练此类模型可能需要海量数据和计算资源,尤其以前馈方式生成可重光照资产且成本可接受极具挑战性。我们在现有大型重建模型(LRM)基础上开发F-RNG,以提取可重光照表示,同时利用内在分解模型(IDM)的先验。具体而言,我们首先引入一种潜在插值细粒度几何合成,以增强LRM的几何表示。其次,我们提出一种先验引导的可重光照外观蒸馏,通过融合IDM先验来提取可重光照神经表示。最后,一个通用神经渲染器实现了灵活且高保真的重光照。F-RNG无需重新训练或微调底层LRM,因此未来可自动受益于更优的LRM和IDM。仅需使用可负担的数据和计算资源训练小型网络,F-RNG避免了大型模型在不同光照条件下的重复推理。与基于LRM的最先进重光照方法相比,F-RNG实现了约25倍更快的重光照,以及更优的质量(约+2.0 dB)。
Abstract
Capturing relightable 3D assets from real-world objects is a widely researched problem. Several per-scene optimization-based methods, based on 3D Gaussian splatting (3DGS), support relighting; however, they usually require dense input views, and their overfitting nature makes it difficult to generalize across scenes. Unlike per-scene optimization methods, generalized feed-forward models can directly reconstruct Gaussians from sparse input views. However, the resulting assets have baked-in illumination and cannot be easily used for relighting. In this paper, we present F-RNG, a feed-forward framework that directly generates relightable 3DGS assets from sparse-view inputs. Training such a model from scratch can require massive data and computing resources, and it is especially challenging to generate relightable assets in a feed-forward manner with acceptable cost. We develop F-RNG upon an existing large reconstruction model (LRM) to extract relightable representations, while also utilizing priors from an intrinsic decomposition model (IDM). Specifically, we first introduce a latent-interpolated fine-grained geometry synthesis to enhance the LRM's geometry representation. Second, we propose a prior-guided relightable appearance distillation to extract relightable neural representations by incorporating IDM priors. Finally, a universal neural renderer enables flexible and high-fidelity relighting. F-RNG requires neither re-training nor fine-tuning of the underlying LRMs, thus can automatically benefit from better LRMs and IDMs in the future. With only small networks that can be trained with affordable data and computational resources, F-RNG avoids the repetitive inference of large models under different light conditions. By comparison to the state-of-the-art LRM-based relighting method, F-RNG achieves ~25x faster relighting, as well as superior quality (~+2.0 dB).
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 6.0/10 | 6.0 |
| MultiModal | 1.0 | 1.0/10 | 1.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究的是前馈式可重光照神经高斯表示(F-RNG),属于计算机视觉和图形学领域,与CV有一定关联(6分)。但论文不涉及统一模型、世界模型、多模态大语言模型、强化学习、OPD或GRPO等关键词,因此这些关键词得分为0或极低。MultiModal仅因输入为多视角图像而勉强得1分。
关键词
Feed-Forward, Relightable, Neural Gaussians, 3D Gaussian Splatting, Large Reconstruction Model, Intrinsic Decomposition, Relighting
摘要翻译
近期视频超分辨率(VSR)方法利用深度神经网络增强低质量输入视频并恢复视觉细节,其中基于扩散的方法尤其展现出令人瞩目的效果。本文通过将模型预测与主观测试结果进行比较,探究现有视频质量模型能否用于评估这些基于扩散的VSR方法的性能。研究对比了六种升尺度方法(Lanczos、Rhea、SCST、DOVE、SeedVR2、Starlight Mini),这些方法应用于压缩(AV1和DCVC-RT)及未压缩的低分辨率视频,并考虑在超高清1/4K(UHD-1/4K)屏幕上的播放效果。采用一系列全参考和无参考质量模型,评估它们对此类新型质量退化的适用性,重点关注序列内性能。结果表明,基于卷积神经网络的全参考模型(如LPIPS、DISTS和CVQA-FR)的相关系数显著高于传统全参考模型及所测试的无参考模型。大多数模型高估了SCST过度锐化的结果,而VMAF主要因Starlight Mini引入的空间不一致性而失效。所测试的视频质量模型均未达到足够精度以替代补充性主观测试。原始视频、退化视频、升尺度视频以及用户评分和模型得分均作为开放数据随论文发布,网址为https://github.com/Telecommunication-Telemedia-Assessment/AVT-VQDB-UHD-1-VSR。
Abstract
Recent video super-resolution (VSR) approaches use deep neural networks to enhance low-quality input videos and recover visual detail, with diffusion-based methods in particular showing promising results. In this paper, we investigate whether existing video quality models can be used to assess the performance of these diffusion-based VSR methods, by comparing model predictions with results from a subjective test. The study compares six upscaling methods (Lanczos, Rhea, SCST, DOVE, SeedVR2, Starlight Mini) applied to both compressed (AV1 and DCVC-RT) and uncompressed low-resolution videos considering the play-out on a UHD-1/4K screen. A range of full- and no-reference quality models are used to assess their applicability to this new type of quality degradation, focusing on within-sequence performance. The results highlight that CNN-based full-reference models, such as LPIPS, DISTS, and CVQA-FR show significantly higher correlation coefficients than both conventional full- as well as the tested no-reference models. Most overestimate the overly sharp results of SCST, with VMAF mainly failing due to spatial inconsistencies introduced by Starlight Mini. None of the tested video quality models reach sufficient accuracy so as to replace complementary subjective testing. The reference, degraded and upscaled videos, as well as the user ratings and model scores are made available with the paper at https://github.com/Telecommunication-Telemedia-Assessment/AVT-VQDB-UHD-1-VSR as open data.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 5.0/10 | 5.0 |
| MultiModal | 1.0 | 2.0/10 | 2.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文主要研究基于扩散的视频超分辨率(VSR)方法的视频质量模型评估,与计算机视觉(CV)有一定关联(视频超分辨率属于CV领域),但核心是视频质量评估而非CV算法本身。多模态(MultiModal)方面,视频本身是视觉模态,但论文未强调多模态交互或融合,仅涉及单一视觉质量。其余关键词(Unify Models, World Models, MLLM, model-based RL, OPD, RL, GRPO)均与论文内容完全无关,论文未涉及统一模型、世界模型、多模态大语言模型、基于模型的强化学习、目标感知检测、强化学习或GRPO算法。因此相关度评分较低。
关键词
Video Super-Resolution, Diffusion-Based Methods, Video Quality Models, Subjective Test, Full-Reference, No-Reference, CNN-based, LPIPS
摘要翻译
精确估计击球时机对于理解快速感觉运动控制至关重要。然而,由于时间分辨率不足和运动模糊,这一任务对RGB相机而言颇具挑战。同样,惯性测量单元(Inertial Measurement Units, IMUs)因传感器侵入性及有限的时间精度,在实际比赛中难以应用。为克服这些局限,我们提出了一种新颖框架,利用具有微秒级分辨率和高动态范围的事件相机(event-based cameras),基于检测到的球与球棒之间的加权质心距离来估计击球时机。为解决事件帧与RGB图像之间的域差距(domain gap)导致的分割精度下降问题,我们生成了高密度事件帧。随后引入一个掩膜细化网络(mask refinement network),利用这些帧和双向掩膜信息,并通过一种新颖的损失函数(loss function)进行优化。在真实数据集上的实验表明,我们的方法在低光照环境和严重遮挡等挑战性条件下实现了卓越的精度,将平均绝对误差(Mean Absolute Error)降低了约63%,优于基线方法。
Abstract
Estimating the precise timing of batting impact is crucial for understanding the rapid sensorimotor control. However, this task is challenging for RGB cameras due to insufficient temporal resolution and motion blur. Similarly, Inertial Measurement Units (IMUs) are impractical for actual matches due to sensor intrusiveness and their limited temporal precision. To overcome these limitations, we propose a novel framework leveraging event-based cameras, which offer microsecond resolution and high dynamic range, to estimate impact timing based on the weighted centroid distance between the detected ball and bat. To address the domain gap between event frames and RGB images that degrades segmentation accuracy, we generate high-density event frames. We then introduce a mask refinement network that leverages these frames and bidirectional mask information, optimized using a novel loss function. Experiments on real-world datasets demonstrate that our method achieves superior accuracy under challenging conditions, including low-light environments and severe occlusions, outperforming baselines by reducing the Mean Absolute Error by approximately 63%.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 5.0/10 | 5.0 |
| MultiModal | 1.0 | 2.0/10 | 2.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究基于事件相机的击球时机估计,属于计算机视觉(CV)中的运动分析应用,但与多模态大模型、世界模型、强化学习等关键词完全无关。CV评分5分表示有一定关联(事件相机是CV传感器),MultiModal评分2分因为仅涉及单一视觉模态,其余关键词均为0分。
关键词
event-based cameras, batting impact estimation, mask refinement, high-density event frames, temporal resolution, sensorimotor control, weighted centroid distance
摘要翻译
多特征作文评分旨在对写作质量进行跨多个维度的细粒度评估。然而,如何有效后训练自回归评分模型仍未被充分探索。本文提出Trait-Aware Policy Optimization(TAPO,特征感知策略优化),一种针对自回归多特征评分的后训练框架。我们的方法沿样本和特征维度分解奖励,结合全局评分一致性、特征级准确性、格式有效性以及跨特征依赖性保持。此外,我们通过增强提示来改进监督微调(supervised fine-tuning),使模型在偏好优化之前内化特征语义。跨多个骨干模型的实验表明,我们的方法在监督微调和标量奖励优化(scalar-reward optimization)基线上持续提升多特征评分性能,展示了特征感知后训练在作文评分中的有效性和可迁移性。
Abstract
Multi-trait essay scoring aims to provide fine-grained evaluation of writing quality across multiple dimensions. However, how to effectively post-train autoregressive scoring models remains underexplored. In this paper, we propose Trait-Aware Policy Optimization (TAPO), a post-training framework tailored to autoregressive multi-trait scoring. Our method decomposes rewards along both the sample and trait dimensions, combining global scoring consistency, trait-level accuracy, format validity, and inter-trait dependency preservation. In addition, we enhance supervised fine-tuning with enhanced prompts, allowing the model to internalize trait semantics before preference optimization. Experiments across multiple backbone models show that our method consistently improves multi-trait scoring performance over supervised fine-tuning and scalar-reward optimization baselines, demonstrating the effectiveness and transferability of trait-aware post-training for essay scoring.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 5.0/10 | 5.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究的是多特征作文评分(Multi-trait essay scoring),使用自回归模型和基于策略优化的后训练框架(TAPO),属于自然语言处理与强化学习的交叉领域。关键词中,RL(强化学习)与论文使用的policy optimization直接相关,但论文并非聚焦于通用强化学习,而是特定应用,故评5分。其余关键词均与论文内容无关:Unify Models(统一模型)、World Models(世界模型)、MLLM(多模态大语言模型)、CV(计算机视觉)、MultiModal(多模态)、model-based RL(基于模型的强化学习)、OPD(未知缩写)、GRPO(一种强化学习方法,论文未提及)均得0分。作者列表中未包含指定的任何专家。
关键词
Multi-trait essay scoring, Autoregressive scoring models, Policy optimization, Post-training, Reward decomposition, Trait-aware, Supervised fine-tuning, Preference optimization
摘要翻译
大型语言模型(LLMs)可作为基于内容发布/订阅代理的语义匹配引擎,用于跨越边缘-云计算连续体的智能体AI,弥合导致关键词与嵌入过滤器失效的词汇与模态鸿沟。通过将问题框架化为跨社交媒体、法律与智能家居传感器领域三个公开数据集的离线多标签检索(涉及六种LLMs、七种基线方法),我们的核心贡献在于提出一种双交叉点的成本-精度特征刻画:其一为分析性上下文窗口交叉点,低于该点时CoverAndMerge压缩流水线可减少LLM调用次数;其二为经验性判别能力交叉点,高于该点时匹配精度会独立于上下文预算而崩溃,其崩溃程度取决于模型参数数量与训练代际的因子。两项发现具有实际价值:其一,在判别交叉点之上,压缩无法恢复精度,仅前沿规模模型能处理大型订阅集合;其二,后端选择主导配置选择,因此模型选择(而非流水线调优)是首要操作杠杆。我们为此配套提出三种可组合算法及一种面向自主LLM层级选择的每集群体验质量(QoE)框架。
Abstract
Large language models (LLMs) can serve as the semantic-matching engine of a content-based publish/subscribe broker for agentic AI across the edge-cloud computing continuum, bridging the vocabulary and modality gaps that defeat keyword and embedding filters. Framed as offline multi-label retrieval over three public datasets spanning social-media, legal, and smart-home sensor domains (six LLMs, seven baselines), our central contribution is a two-crossover cost-accuracy characterisation: an analytical context-window crossover below which a CoverAndMerge compression pipeline reduces LLM invocations, and an empirical discrimination-capacity crossover above which matching accuracy collapses independently of context budget, by a model-dependent factor of parameter count and training generation. Two findings carry practical weight: above the discrimination crossover, compression cannot recover accuracy and only frontier-scale models clear large subscription sets; and there backend choice dominates configuration choice, so model selection, not pipeline tuning, is the primary operator lever. We accompany this with three composable algorithms and a per-cluster Quality-of-Experience framework for autonomic LLM-tier selection.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 1.0/10 | 1.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 2.0/10 | 2.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 2.0/10 | 2.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文主要研究基于LLM的语义内容匹配神经路由器,用于代理AI的发布/订阅系统,核心是文本语义匹配和多标签检索,不涉及多模态大模型统一、世界模型、计算机视觉、强化学习或相关算法(OPD、GRPO)。与Unify Models和MultiModal仅有微弱关联(可能涉及多模态数据但未深入),MLLM因使用LLM处理多领域数据得2分,其余关键词完全无关。
关键词
Neural Router, Semantic Content Matching, Agentic AI, LLM, publish/subscribe, edge-cloud continuum, multi-label retrieval, cost-accuracy characterization
摘要翻译
事件相机相较于传统基于帧的相机具有显著优势,包括高时间分辨率、低延迟和能量效率。这些特性使其适用于高速和高动态范围场景采集任务;然而,缺乏密集强度帧限制了传统计算机视觉方法在场景理解中的直接适用性。事件到视频(E2V)重建旨在通过将异步事件流转换为同步视频帧序列来弥合这一差距。现有基于卷积神经网络和Transformer的E2V重建方法主要在空间域中运行,且往往难以在恢复精细结构细节的同时抑制严重的重建伪影。为解决这些问题,我们提出了MSFET-E2V,一种新颖的多尺度频率增强Transformer模型。其核心是一个跨域注意力模块,该模块将时空特征与基于离散小波变换的频率感知表示相融合。与仅依赖空间注意力的先前方法不同,我们的方法通过考虑低频和高频分量有效捕捉局部与全局结构,从而增强细节保留能力及在不同运动场景下的鲁棒性。此外,我们提出了一种轻量级的小波增强跳跃块作为跳跃连接,通过联合空间-频率域处理促进伪影抑制和结构细节优化。大量实验表明,MSFET-E2V在多个真实世界事件数据集上取得了优于现有最先进方法的性能,在重建质量上实现了显著提升。此外,与现有基于Transformer的方法相比,我们提出的模型大幅减少了参数量、GPU内存占用和推理时间。
Abstract
Event cameras offer significant advantages over conventional frame-based counterparts, including high temporal resolution, low latency, and energy efficiency. These characteristics make them suitable for high-speed and high-dynamic range scene acquisition scenarios; however, the lack of dense intensity frames limits the direct applicability of conventional computer vision methods for scene understanding. Event-to-video (E2V) reconstruction seeks to bridge this gap by converting asynchronous event streams into a sequence of synchronous video frames. Existing E2V reconstruction methods based on convolutional neural networks and transformers operate primarily in the spatial domain and often struggle to recover fine structural details while suppressing severe reconstruction artifacts. To address these issues, we propose MSFET-E2V, a novel multiscale frequency-enhanced transformer model. At its core lies a cross-domain attention module, which fuses spatio-temporal features with frequency-aware representations derived from the discrete wavelet transform. Unlike prior methods relying solely on spatial attention, our approach effectively captures both local and global structures by taking into account low- and high-frequency components, enhancing detail preservation and robustness across various motion scenarios. Furthermore, we propose a lightweight wavelet-enhanced skip block that serves as a skip connection, facilitating artifact suppression and structural detail refinement through joint spatial-frequency domain processing. Extensive experiments demonstrate that MSFET-E2V achieves superior performance over state-of-the-art methods on multiple real-world event datasets, offering significant gains in reconstruction quality. Moreover, compared to the existing transformer-based method, our proposed model significantly reduces the number of parameters, the GPU memory usage, and inference time.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 3.0/10 | 3.0 |
| MultiModal | 1.0 | 2.0/10 | 2.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究的是基于事件相机(event camera)的视频重建(Event-to-Video reconstruction),使用时空和频率增强的深度神经网络。核心是计算机视觉(CV)中的低层视觉任务,与多模态大模型(MLLM)、统一模型(Unify Models)、世界模型(World Models)、强化学习(RL/model-based RL/GRPO)以及OPD等关键词完全无关。与MultiModal仅有微弱关联(事件流和视频帧可视为两种模态),但并非论文核心。因此,除CV有较低相关度外,其余关键词均为0分。
关键词
Event cameras, Event-to-video reconstruction, Frequency-enhanced transformer, Spatio-temporal features, Discrete wavelet transform, Cross-domain attention, Multi-scale representation
摘要翻译
合成高保真对比增强MRI在实现更安全、更高效的乳腺癌筛查方面具有临床价值,但由于复杂的病灶纹理(texture)和异质性强化模式(heterogeneous enhancement pattern),这一任务仍具挑战性。
Abstract
Synthesizing high fidelity contrast enhanced MRI is clinically valuable for safer and more efficient breast cancer screening, yet remains challenging due to complex lesion textures and heterogeneous enhancement patterns.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 4.0/10 | 4.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究的是对比增强乳腺MRI合成,使用扩散模型和注意力机制,属于医学图像生成领域。与给定的关键词相比,仅CV(计算机视觉)有微弱关联,因为医学图像处理是CV的子领域,但论文不涉及多模态、世界模型、强化学习或统一模型等核心概念。其他关键词均完全无关。
关键词
SAFE-Diff, Scale-Aware Attention, Feature-Dispersive Diffusion, Uncertainty Estimation, Contrast-Enhanced Breast MRI Synthesis, breast cancer screening, lesion textures
摘要翻译
在生物医学和神经退行性疾病领域,由于标记数据的稀缺以及成像模式的复杂性,准确且早期的疾病识别仍然具有挑战性。为应对这些挑战,我们提出了ARMA-C3——一个统一的、基于对比学习和图割正则化的无监督与半监督图学习框架,用于节点分类,以学习具有结构意义且具备判别性的表示。通过将样本或图像建模为图节点并利用样本间关系,该框架能够捕捉传统机器学习方法通常忽略的受试者级别依赖关系。我们在五个临床相关数据集上进行了广泛的二分类实验:阿尔茨海默病神经影像学倡议(ADNI)、额颞叶痴呆神经影像学(NIFD)数据集,以及三个医学影像基准数据集(BreastMNIST、PneumoniaMNIST和一个肝脏超声数据集)。实验结果表明,在多种评估设置下,特别是在有限监督和严重类别不平衡的情况下,ARMA-C3相较于经典聚类技术、最先进的机器学习模型以及现有基于图的深度学习方法,取得了具有竞争力且通常更优的性能。该框架进一步展示了在多种生物医学成像模态上的鲁棒表示学习能力和强大的跨模态泛化能力。
Abstract
In biomedical and neurodegenerative disorders, accurate and early disease identification remains challenging due to the scarcity of labeled data and the complexity of imaging patterns. To address these challenges, we introduce ARMA-C3, a unified unsupervised and semi-supervised graph learning framework for node classification based on contrastive learning and graph-cut regularization to learn structurally meaningful and discriminative representations. By modeling samples or images as graph nodes and exploiting inter-sample relationships, the proposed framework captures subject-level dependencies that conventional machine learning methods typically overlook. We conduct extensive binary classification experiments across five clinically relevant datasets: the Alzheimer's Disease Neuroimaging Initiative (ADNI), the Neuroimaging in Frontotemporal Dementia (NIFD) dataset, and three medical imaging benchmarks (BreastMNIST, PneumoniaMNIST, and a liver ultrasound dataset). Experimental results demonstrate that ARMA-C3 achieves competitive and frequently superior performance compared to classical clustering techniques, state-of-the-art machine learning models, and existing graph-based deep learning approaches across multiple evaluation settings, particularly under limited supervision and severe class imbalance. The proposed framework further demonstrates robust representation learning and strong cross-modal generalization across diverse biomedical imaging modalities.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 3.0/10 | 3.0 |
| MultiModal | 1.0 | 1.0/10 | 1.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文主要研究基于对比学习和图正则化的无监督/半监督图节点分类框架,应用于生物医学图像(如ADNI、NIFD、医学影像基准)。与给定的关键词几乎无关:Unify Models、World Models、MLLM、MultiModal、model-based RL、OPD、RL、GRPO均未涉及;CV仅因处理医学图像有微弱关联(3分),但核心是图学习而非传统计算机视觉;MultiModal仅因涉及多种医学成像模态(如MRI、X光、超声)有极弱关联(1分)。
关键词
contrastive learning, graph learning, semi-supervised classification, unsupervised learning, biomedical imaging, node classification, graph-cut regularization
摘要翻译
在盈利公告(Earnings Announcements, EAs)期间预测股票价格走势是一项重大挑战,原因在于市场噪声和高冲击性的价格不连续性。在本研究中,我们评估了公告前的新闻情绪、公司基本面以及近期市场动态是否能够共同预测股票在EA日的方向性价格变动。我们构建了一个多模态特征空间,结合了15个基本面指标、3个基于价格的技术指标以及使用FinBERT处理的金融新闻文章所提取的情绪得分。我们将长短期记忆网络(Long Short-Term Memory, LSTM)和基于Transformer的架构与逻辑回归基线进行比较,并进一步评估所有模型在有和没有情绪特征的情况下的表现,以量化其增量价值。我们的结果表明,虽然LSTM通过保守的安全赌注策略展现出更高的精确度,但Transformer模型在识别波动性变动方面表现出更高的敏感性,实现了更高的宏观F1分数,且消融实验显示,纳入新闻情绪始终带来益处。
Abstract
Predicting stock price movements during Earnings Announcements (EAs) is a significant challenge due to market noise and high-impact price discontinuities. In this study, we evaluate whether pre-announcement news sentiment, firm fundamentals, and recent market dynamics jointly predict the directional price movement of equities on EA days. We construct a multi-modal feature space combining 15 fundamental metrics, 3 price-based technical indicators and sentiment scores derived from financial news articles processed using FinBERT. We compare a Long Short-Term Memory (LSTM) network and a Transformer-based architecture against a logistic regression baseline, and further assess all models with and without sentiment features to quantify their incremental value. Our results indicate that while the LSTM demonstrates higher precision through a conservative safe-bet strategy, the Transformer model exhibits superior sensitivity in identifying volatile movements, achieving a higher macro F1-score, with ablation experiments showing a consistent benefit from incorporating news sentiment.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 3.0/10 | 3.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究的是利用多模态特征(基本面指标、技术指标、新闻情感)预测财报公告日的股票价格方向,属于金融时间序列预测领域。与给定的关键词几乎完全无关:Unify Models(统一模型)、World Models(世界模型)、MLLM(多模态大语言模型)、CV(计算机视觉)、model-based RL(基于模型的强化学习)、OPD(未知缩写)、RL(强化学习)、GRPO(未知缩写)均不涉及。仅'MultiModal'有微弱关联,因为论文使用了文本情感和数值指标的多模态特征,但并非原生多模态大模型或视觉-语言多模态,故评3分。
关键词
stock price prediction, earnings announcements, multi-modal deep learning, sentiment analysis, FinBERT, LSTM, Transformer, financial news
摘要翻译
足球决策过程的特点是空间定位、对手压力与球员意图之间的复杂相互作用。本文提出了一种图神经网络(Graph Neural Network, GNN)框架,通过将场上交互建模为动态图来预测接球者选择(Receiver Selection),即最佳传球目标。每个球员被表示为一个节点,包含位置和上下文特征,而潜在传球线路则形成加权边,其特征由距离、角度和压力指标决定。我们开发并训练了一种消息传递神经网络(Message-Passing Neural Network, MPNN),该网络使用了来自职业比赛的追踪数据和事件数据,并通过基于优化版Needleman-Wunsch算法的稳健流水线进行同步。该模型在识别实际选择的接球者方面达到了有竞争力的准确率,并在其前三建议中达到了最先进的准确率。我们的模型进一步量化了每个选项的可能性、威胁性和创造性,使表现分析师能够在数秒内评估超过1000次传球。
Abstract
The process of decision-making in football is characterized by a complex interplay between spatial positioning, opponent pressure, and player intent. This work introduces a Graph Neural Network (GNN) framework designed to predict Receiver Selection, the optimal passing target, by modeling on-field interactions as dynamic graphs. Each player is represented as a node with positional and contextual features, while potential passing lines form weighted edges characterized by distance, angle, and pressure metrics. A Message-Passing Neural Network (MPNN) has been developed and trained using a combination of tracking data and event data from professional matches, synchronized through a robust pipeline based on an optimized version of the Needleman-Wunsch Algorithm. The model achieves competitive accuracy in identifying the actual chosen receiver and state-of-the-art accuracy within its top three suggestions. Our model further offers quantification of each option's likelihood, threat, and creativity, enabling performance analysts to evaluate over 1,000 passes in seconds.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 2.0/10 | 2.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 1.0/10 | 1.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 该论文研究足球传球决策中的接收者选择问题,采用图神经网络(MPNN)对球员间动态交互建模。与给定关键词几乎无关:Unify Models、World Models、MLLM、MultiModal、model-based RL、OPD、GRPO均不涉及;CV仅因使用位置和轨迹数据有微弱关联(2分);RL仅因决策过程有极弱关联(1分),但模型为监督学习而非强化学习。论文主题属于体育分析、图神经网络应用,与多模态大模型、世界模型、强化学习等背景关键词完全无关。
关键词
passing decision-making, graph neural network, receiver selection, football analytics, message-passing neural network, tracking data, event data, Needleman-Wunsch algorithm
摘要翻译
尽管AI智能体在推理和工具使用方面展现出卓越能力,但其本质上仍是被动的:仅在用户明确提示后才计算响应。这种范式忽略了一个关键机遇——交互间的空闲时间大多被浪费,导致智能体无法为未来的用户需求做好准备。为弥补这一缺陷,我们提出ProAct,一种利用空闲时间计算来预测并满足用户可能出现的需求的主动式智能体架构。通过分析不断演变的对话历史与持久化记忆,ProAct能预测即将出现的需求并迭代式地获取信息,使智能体能够在用户发起查询前解决知识缺口并准备证据。为严格评估主动能力,我们还提出ProActEval,一个包含40个领域200个场景的综合基准测试,涵盖可预测的需求链与多样化的用户认知特征。实验结果表明,与被动基线相比,ProAct具有显著优势:在ProActEval上,ProAct通过减少14.8%的所需交互轮次加速任务完成,降低11.7%的用户努力,并将幻觉率降低28.1%。此外,MemBench评估证实ProAct达到了最先进的反思准确性,彰显其持续稳健的性能。
Abstract
While AI agents demonstrate remarkable capabilities in reasoning and tool use, they remain fundamentally reactive: they compute responses only after explicit user prompts. This paradigm ignores a critical opportunity: the idle time between interactions is largely wasted, leaving agents unable to prepare for future user needs. To bridge this gap, we introduce ProAct, a proactive agent architecture that leverages idle-time compute to anticipate and fulfill likely upcoming user needs. By analyzing evolving dialogue history together with persistent memory, ProAct predicts upcoming needs and iteratively acquires information, allowing the agent to resolve knowledge gaps and prepare evidence before the user initiates a query.To rigorously evaluate proactive capabilities, we also introduce ProActEval, a comprehensive benchmark comprising 200 scenarios across 40 domains, featuring predictable need chains and diverse user cognitive profiles. Empirical results demonstrate significant advantages over reactive baselines. ProAct accelerates task completion by reducing required turns by 14.8%, decreases user effort by 11.7%, and cuts hallucination rates by 28.1% on ProActEval. Furthermore, MemBench evaluations confirm that ProAct achieves state-of-the-art reflective accuracy, underscoring its sustained and robust performance.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 1.0/10 | 1.0 |
| World Models | 1.0 | 1.0/10 | 1.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 1.0/10 | 1.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究的是主动式智能体架构ProAct,利用空闲时间预测用户需求并提前获取信息,属于对话系统和智能体领域。与给定的关键词几乎无关:未涉及统一多模态模型、世界模型、多模态大语言模型、计算机视觉、多模态、基于模型的强化学习、强化学习、GRPO等。仅有微弱关联:主动预测未来需求可视为对环境的隐式建模(类似世界模型),但非核心;主动决策过程可类比基于模型的方法,但未涉及强化学习。因此各关键词评分极低。
关键词
ProAct, proactive agent, idle-time compute, anticipation, user needs, memory, ProActEval
摘要翻译
我们描述了KSAA-2026阿拉伯语语音听写与自动标音共享任务(Task 2)的获胜系统。该任务要求从语音音频和无标音转录文本中生成完整标音的阿拉伯语文本,仅提供2,327个训练样本,且不允许使用外部数据。我们的系统对CATT-Whisper进行了微调,这是一种字符级多模态模型,结合了预训练的CATT文本编码器与冻结的Whisper语音编码器。我们方法的关键在于训练正则化:R-Drop一致性正则化、Optuna优化的高权重衰减超参数以及Focal Loss。在推理阶段,我们通过蒙特卡洛Dropout(Monte Carlo Dropout)在softmax概率层面对四个模型检查点进行200次随机前向传播的平均。该系统在主要排行榜指标(含词尾变音符,包括无标音位置)上实现了23.26%的词错误率(WER),在所有参赛者中排名第一。
Abstract
We describe the winning system for Task 2 of the KSAA-2026 Shared Task on Arabic Speech Dictation with Automatic Diacritization. The task requires producing fully diacritized Arabic text from speech audio and undiacritized transcripts, with only 2,327 training samples available and no external data permitted. Our system fine-tunes CATT-Whisper, a character-level multimodal model combining a pretrained CATT text encoder with a frozen Whisper speech encoder. The key to our approach is training regularization: R-Drop consistency regularization, Optuna-optimized hyperparameters with high weight decay, and Focal Loss. At inference, we average 200 stochastic forward passes across four model checkpoints using Monte Carlo Dropout at the softmax probability level. The system achieves 23.26% WER on the primary leaderboard metric (with case endings, including no-diacritic positions), placing 1st among all participants.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 3.0/10 | 3.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 该论文研究的是阿拉伯语语音识别与自动标音任务,属于语音处理和自然语言处理领域,与给定的关键词(多模态大模型统一、世界模型、表征学习、基于模型的强化学习、计算机视觉、GRPO等)几乎完全无关。论文中提到的“多模态”仅指结合了文本编码器和语音编码器,并非视觉-语言多模态大模型。所有关键词评分均为0或极低(MultiModal仅因模型使用了两种模态而给3分)。作者列表中未包含指定的任何专家。
关键词
Arabic Speech Diacritization, CATT-Whisper, Regularized Fine-Tuning, R-Drop, Focal Loss, Monte Carlo Dropout, Low-Resource ASR
摘要翻译
随着政策逐步跟上生成式AI的能力,水印技术成为内容溯源工作的核心。针对自回归模型的推理时水印因离散化不一致性而无法适用于连续模态。现有方法通过微调解码器分词器来克服这一问题,但这会丧失水印技术无需训练的优势。在本研究中,受离散化过程中词汇冗余现象的启发,我们提出了一种优雅的解决方案,用于对合成音频实现强大且鲁棒的水印。我们从理论上分析了令牌错误对水印检测的影响,并通过社区检测方法获得精简词汇表,有效缓解了这一问题。大量实验表明,我们的无梯度方法可将可检测性提升数个数量级,同时实现对音频修改的内建鲁棒性。总体而言,我们在多媒体领域的令牌级水印中发现了新的最优方法,而这仅仅源于离散表示学习的本质特性。
Abstract
As policy catches up with the capabilities of generative AI, watermarking is central to content provenance efforts. Inference-time watermarks for autoregressive models are unfit for continuous modalities due to discretization inconsistencies. Existing methods overcome this by finetuning the modality tokenizers, nullifying the watermark's training-free advantage. In this work, motivated by the vocabulary redundancy of discretization, we propose an elegant solution for powerful and robust watermarking of synthetic audio. We theoretically analyze the impact of token errors on watermark detection, and effectively mitigate them using a reduced vocabulary obtained via community detection. Thorough experiments showcase that our gradient-free method can boost detectability by several orders of magnitude, while also achieving built-in robustness to audio modifications. Broadly, we discover a new state-of-the-art for token-level watermarks in multimedia, which simply arises from the nature of discrete representation learning.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 2.0/10 | 2.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文主题为合成音频的水印技术,聚焦于离散表征学习中的词汇冗余和社区检测方法,与给定的关键词(统一模型、世界模型、多模态大语言模型、计算机视觉、多模态、基于模型的强化学习、OPD、强化学习、GRPO)几乎完全无关。仅“MultiModal”因涉及音频模态而略有微弱关联(2分),其余均为0分。
关键词
watermarking, synthetic audio, discrete representation learning, vocabulary redundancy, community detection, gradient-free, robustness
摘要翻译
我们发布了Llamion,这是一个由14B参数的开源权重语言模型组成的系列,通过将Orion-14B转换为标准化的Llama系列架构而获得。该转换通过高效知识保留转换(KEPT)方法实现,该方法结合了:(i) 用于未改变模块的正常参数映射(NPM),(ii) 优化参数映射(OPM),这是一种无需训练的LayerNorm到RMSNorm初始化,我们证明其在权重衰减导致的近零均值激活机制下是最优的,以及(iii) 跨架构知识蒸馏(XKD),这是一种等大小冻结教师蒸馏,使转换后的模型输出与源模型在任何合理输入分布上保持一致。Llamion在H6、MT-Bench和KoMMLU上恢复了Orion的行为,仅使用约123M tokens,在单个A100上耗时四天;Llamion-Base在KoMMLU上达到66.87%,在提交时超过了Open Ko LLM Leaderboard上排名第二的条目超过7.0个绝对百分点。转移语料中完全缺失的能力(Python编程和200K-token上下文处理)在架构转换后完整保留。我们发布了三个检查点(Base、Chat、LongChat),它们在Hugging Face Transformers库中加载时设置trust_remote_code=False。
Abstract
We release Llamion, a family of 14B-parameter open-weight language models obtained by transforming Orion-14B into the standardized Llama-family architecture. The transformation is performed by Efficient Knowledge Preservation for Transformation (KEPT), a recipe that combines (i) Normal Parameter Mapping (NPM) for unchanged modules, (ii) Optimized Parameter Mapping (OPM), a training-free LayerNorm-to-RMSNorm initialization we prove optimal under the near-zero-mean activation regime induced by weight decay, and (iii) Cross-architecture Knowledge Distillation (XKD), an equal-size frozen-teacher distillation that aligns the converted model's outputs with the source model's on any reasonable input distribution. Llamion recovers Orion's behaviour on H6, MT-Bench, and KoMMLU with only ~123M tokens on a single A100 in four days; Llamion-Base reaches 66.87% on KoMMLU, exceeding the next-best entry of the Open Ko LLM Leaderboard by >7.0 absolute points at submission time. Capabilities entirely absent from the transfer corpus (Python programming and 200K-token context handling) survive the architectural transition intact. We release three checkpoints (Base, Chat, LongChat) that load with trust_remote_code=False in the Hugging Face Transformers library.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 2.0/10 | 2.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文专注于将Orion-14B模型转换为Llama架构的纯语言模型,涉及参数映射、知识蒸馏等架构转换技术,完全不涉及多模态、世界模型、强化学习、GRPO等主题。'Unify Models'勉强相关,因为论文统一了不同架构(Orion→Llama),但并非多模态统一,故仅给2分。其余关键词均与论文内容无关,得0分。
关键词
Llamion, KEPT, Parameter Mapping, Knowledge Distillation, Architecture Transformation, Open-weight Language Models, Orion-14B
摘要翻译
心脏功能评估需要连续、无创的监测,而磁共振成像(MRI)在此方面的能力有限。毫米波(mmWave)雷达及其合成孔径雷达(SAR)模式提供了一种保护隐私且便携的床旁临床应用方案。然而,从SAR图像重建高保真三维心脏几何结构仍是一个开放挑战。传统雷达方法生成稀疏点云,缺乏连续表面拓扑。同时,由于SAR图像固有的严重散斑噪声和模糊边界,直接应用光学重建网络效果不佳。为弥合这一差距,我们提出SAR2Mesh,一种将任务重新定义为从粗到细网格变形过程的新框架。通过使用拓扑模板初始化,我们的方法通过渐进式网格变形明确保留了解剖连通性。我们引入了几何感知特征投影模块,通过3D到2D采样提取多视图特征,以及基于物理信息的雷达损失函数,以强制预测几何结构与原始雷达回波之间的一致性。此外,我们提出了Cardiac Mesh-SAR,首个大规模配对SAR-网格数据集。大量实验表明,SAR2Mesh显著优于现有基于图像的基线方法,实现了准确且物理一致的心脏重建。
Abstract
Cardiac function evaluation necessitates continuous, non-invasive monitoring, a capability limited in MRI. Millimeter-wave (mmWave) radar and its Synthetic Aperture Radar (SAR) mode offer a privacy-preserving and portable point-of-care clinical applications. However, reconstructing high-fidelity 3D cardiac geometry from SAR remains an open challenge. Traditional radar methods generate sparse point clouds that lack continuous surface topology. Meanwhile, direct application of optical reconstruction networks performs poorly due to the severe speckle noise and ambiguous boundaries inherent in SAR images. To bridge this gap, we propose SAR2Mesh, a novel framework that reformulates the task as a coarse-to-fine mesh deformation process. By initializing with a topological template, our approach explicitly preserves anatomical connectivity through progressive mesh deformation.We introduce a geometry-aware feature projection module to extract multi-view features via 3D-to-2D sampling, and a physics-informed radar loss to enforce consistency between the predicted geometry and raw radar echoes. Furthermore, we present Cardiac Mesh-SAR, the first large-scale paired SAR-mesh dataset. Extensive experiments demonstrate that SAR2Mesh significantly outperforms existing image-based baselines, achieving accurate and physically consistent cardiac reconstructions.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 2.0/10 | 2.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究的是利用毫米波雷达SAR成像和物理信息神经网络进行3D心脏网格生成,属于医学成像和计算机图形学交叉领域。所有给定的关键词(Unify Models, World Models, MLLM, MultiModal, model-based RL, OPD, RL, GRPO)均与论文核心内容无关。CV(计算机视觉)有微弱关联,因为涉及3D重建,但论文更偏向雷达信号处理和医学物理建模,而非通用计算机视觉任务,因此仅给2分。其余关键词均为0分。
关键词
3D heart mesh generation, contactless radar imaging, physics-informed neural network, SAR, mesh deformation, cardiac reconstruction, millimeter-wave radar
摘要翻译
心血管疾病(Cardiovascular diseases, CVDs)仍是全球主要死亡原因,因此需要持续、准确的无创心脏监测。尽管基于非接触式雷达的方法前景广阔,但它们通常采用单一的“失真驱动”或“感知驱动”范式,常常面临“低失真但弱语义信息”与“高感知保真度但可解释性差”之间的权衡。为解决这一问题,我们提出了一种三阶段失真-感知预训练模型(Three-stage Distortion-Perception Pre-Training Model, TriDP-PTM),这是一个基于雷达的多尺度融合双路径框架,系统比较了“直接雷达到任务”路径与“间接雷达到心电图(ECG)再到任务”路径。通过将ECG生成器与特征判别器集成以形成复合损失函数,我们的方法有效地将医学先验知识(如ECG形态和节律)融入下游任务。通过实证分析,我们揭示了这种权衡在三个不同阶段(正和、合作竞争和负和)中显现,并表明最优的下游临床准确性通常出现在合作竞争阶段。在涉及30名受试者、涵盖5种生理状态的数据集上进行的大量实验表明,间接路径在各种任务中始终优于直接路径,在波形分割中实现了0.80的平均交并比(mean IoU),在四个任务中实现了98.3%的平均分类准确率,并且与最强基线相比,血压回归的平均绝对误差(MAE)降低了56%。这些发现验证了我们的框架,并表明在间接雷达到ECG路径中,适当权衡失真损失和感知损失以使其处于合作竞争状态,对于在非接触式心脏监测中同时获得临床可解释的ECG形态和强大的下游准确性至关重要。
Abstract
Cardiovascular diseases (CVDs) remain a leading cause of death globally, necessitating continuous, accurate non-invasive cardiac monitoring. While non-contact radar-based approaches show great promise, they often employ a single "distortion-driven" or "perception-driven" paradigm, frequently facing a trade-off between "low distortion but weak semantic information" and "high perceptual fidelity but poor interpretability." To address this, we propose a Three-stage Distortion-Perception Pre-Training Model (TriDP-PTM), a radar-based multi-scale fusion dual-path framework that systematically compares the "direct radar-to-task" path against an "indirect radar-to-ECG-to-task" path. By integrating an ECG generator with a feature discriminator to form a composite loss function, our approach effectively incorporates medical priors - such as ECG morphology and rhythm - into downstream tasks. Through empirical analysis, we reveal that this trade-off manifests in three distinct phases (Positive-Sum, Coopetitive, and Negative-Sum), showing optimal downstream clinical accuracy typically emerges in the coopetitive stage. Extensive experiments on a dataset involving 30 subjects across 5 physiological states reveal that the indirect path consistently outperforms the direct path in diverse tasks, achieving 0.80 mean IoU in waveform segmentation, 98.3% average classification accuracy across four tasks, and a 56% MAE reduction in blood pressure regression compared to the strongest baselines. These findings validate our framework and indicate that, within the indirect radar-to-ECG pathway, appropriately weighting distortion and perception losses to operate in the coopetitive regime is critical for achieving both clinically interpretable ECG morphology and strong downstream accuracy in non-contact cardiac monitoring.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 2.0/10 | 2.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究的是基于雷达的非接触式心脏监测,提出了一种三阶段失真-感知权衡的预训练模型(TriDP-PTM),涉及雷达信号、ECG生成和多尺度融合。关键词中,'MultiModal' 有微弱关联,因为雷达和ECG可视为两种模态,但论文并非典型的多模态大模型或统一模型研究。'Unify Models'、'World Models'、'MLLM'、'CV'、'model-based RL'、'OPD'、'RL'、'GRPO' 均与论文内容完全无关,论文不涉及强化学习、世界模型、多模态大模型或计算机视觉核心任务。因此,除MultiModal得2分外,其余均为0分。
关键词
radar cardiac sensing, distortion-perception tradeoff, pre-training model, ECG generation, non-contact monitoring, multi-scale fusion, dual-path framework
摘要翻译
大语言模型(LLM)的发展目前由对数据混合(data mixtures)、奖励模型(reward models)、路由策略(routing strategies)和评估流程(evaluation pipelines)的大规模经验性迭代所驱动。在此,我们认为LLM开发与评估中的许多核心问题本质上是因果性的:在预训练过程中添加一个数据领域会产生什么影响?当LLM以不同风格生成文本时,标注者的偏好如何变化?在推理成本约束下,应将提示路由到更大还是更小的模型?一般而言,因果方法(causal methods)非常适用于这种干预改变结果的场景,但令人惊讶的是,它们在LLM开发中并未得到充分体现。我们的贡献有三方面:(1)我们解释了因果方法如何有助于现代LLM开发与评估:LLM开发严重依赖日志数据,而这些数据往往受到混杂因素和分布偏移的影响;评估使用经过学习但可能存在偏差的评判者;部署环境是非平稳的。这些条件使得纯预测方法变得脆弱,并为来自因果推断(causal inference)的严谨识别与估计方法创造了机会。(2)我们进一步描绘了因果方法在整个LLM开发流程中的机会,包括预训练、对齐、路由、智能体工作流(agentic workflows)和评估。(3)我们讨论了围绕利用因果方法进行LLM开发与评估的新研究机会。总体而言,我们认为因果方法在LLM开发与评估流程中可能未被充分利用,尽管这些方法能够确保可靠且科学严谨的设计。
Abstract
Large language model (LLM) development is currently driven by large-scale empirical iteration over data mixtures, reward models, routing strategies, and evaluation pipelines. Here, we argue that many central questions in LLM development and evaluation are inherently causal: What is the effect of adding a data domain during pretraining? How do annotator preferences change when LLMs generate text in a different style? Should a prompt be routed to a larger or smaller model given inference cost constraints? In general, causal methods are well-suited to such settings where interventions change outcomes but, surprisingly, are underrepresented in LLM development. Our contribution is threefold: (1) We explain how causal methods can help develop modern LLM development and evaluation: LLM development relies heavily on logged data, which are often subject to confounding and distribution shifts; evaluation uses learned but potentially biased judges; and deployment environments are non-stationary. These conditions make purely predictive approaches fragile and create opportunities for principled identification and estimation methods from causal inference. (2) We further map opportunities for causal methods in the entire LLM development pipeline, including pretraining, alignment, routing, agentic workflows, and evaluation. (3) We discuss new research opportunities around leveraging causal methods for LLM development and evaluation. Overall, we argue that causal methods are potentially underutilized for the LLM development and evaluation pipeline, despite the fact that such methods can ensure a reliable and scientifically grounded design.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 1.0/10 | 1.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文标题和摘要聚焦于因果方法在大型语言模型(LLM)开发与评估中的应用,涉及预训练、对齐、路由、智能体工作流和评估等环节。所有给定关键词(Unify Models, World Models, MLLM, CV, MultiModal, model-based RL, OPD, GRPO)均与论文核心内容无关:论文未讨论多模态、世界模型、计算机视觉、统一模型或基于模型的强化学习;仅因摘要中提及“reward models”与强化学习有微弱关联,故RL评分为1分,其余均为0分。论文作者列表中未包含指定的五位专家。
关键词
causal methods, LLM development, LLM evaluation, pretraining, alignment, routing, agentic workflows, confounding
摘要翻译
我们研究高维结果的反事实分布学习,其反事实律可能集中于低维结构附近。标准的各向同性平滑对所有环境方向一视同仁,导致不利的尺度缩放和不稳定的局部推断。我们提出两种基于半参数去偏的扩散引导估计量:用于反事实密度的扩散知情平滑,以及用于反事实得分的扩散知情得分平滑。这些估计量将因果干扰调整与由扩散得分信息驱动的几何自适应局部化相结合,在消除一阶干扰偏差的同时,使平滑与局部结果几何结构对齐。我们建立了针对平滑密度和基于得分目标的渐近展开、风险界限及推断程序,并在额外近似条件下获得了环境密度推断。在结构几何条件下,主导随机误差由扩散引导核诱导的有效维度控制,而非环境维度。基于CelebA的半合成实验表明,几何自适应方法的误差衰减更陡峭,支持了所提出的有效维度理论。
Abstract
We study counterfactual distribution learning for high-dimensional outcomes whose counterfactual law may concentrate near lower-dimensional structure. Standard isotropic smoothing treats all ambient directions equally, leading to unfavorable scaling and unstable local inference. We propose two diffusion-guided estimators based on semiparametric debiasing: diffusion-informed smoothing for counterfactual densities and diffusion-informed score smoothing for counterfactual scores. The estimators combine causal nuisance adjustment with geometry-adaptive localization driven by diffusion score information, removing first-order nuisance bias while aligning smoothing with local outcome geometry. We establish asymptotic expansions, risk bounds, and inference procedures for smoothed density and score-based targets, with ambient density inference obtained under additional approximation conditions. Under structural geometry conditions, the leading stochastic error is governed by an effective dimension induced by the diffusion-guided kernel, rather than by the ambient dimension. Semi-synthetic experiments based on CelebA show steeper error decay for geometry-adaptive methods, supporting the proposed effective-dimension theory.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 1.0/10 | 1.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究反事实分布学习中的几何自适应扩散引导平滑方法,核心是因果推断和高维数据降维,与给定的关键词(统一模型、世界模型、多模态大语言模型、计算机视觉、多模态、基于模型的强化学习、OPD、强化学习、GRPO)几乎完全无关。仅因使用了CelebA图像数据集,与计算机视觉(CV)有微弱关联,但论文主题并非CV,故CV评1分,其余均为0分。
关键词
Counterfactual Distribution Learning, Diffusion-Guided Smoothing, Geometry-Adaptive, Semiparametric Debiasing, Effective Dimension, Causal Inference, High-Dimensional Outcomes
摘要翻译
我们研究了可实现设定下带赌博反馈的多类PAC学习问题。在该框架中,与经典多类PAC学习类似,存在一个关于实例空间$\mathcal{X}$和标签空间$\mathcal{Y}$的未知数据分布,但学习器无法观测到独立同分布训练样本的标签。相反,在每一轮中,学习器接收一个无标签实例,预测其标签,并收到仅指示预测是否正确的赌博反馈。尽管存在这一限制,目标仍与经典PAC学习相同。我们给出了该问题最优样本复杂度的通用刻画,该刻画对每个概念类均精确至对数因子。我们的刻画基于一个新的组合维度,称为赌博$\mathrm{DS}$维度,该维度通过我们称为伪盒子的广义组合结构定义。这些结构扩展了构成$\mathrm{DS}$维度基础的伪立方体,允许每个坐标具有不同数量的邻居。与通过计数伪立方体中坐标数量来支配全信息设定的$\mathrm{DS}$维度不同,赌博$\mathrm{DS}$维度聚合了各坐标的邻居数量,从而得到样本复杂度与邻居总数成比例的刻画。我们还提出了一种通用的学习算法,该算法基于称为ListCascade的算法原则实现了上界,该原则将赌博学习与列表学习联系起来,可能具有独立的研究价值。
Abstract
We study the problem of multiclass PAC learning with bandit feedback in the realizable setting. In this framework, there is an unknown data distribution over an instance space $\mathcal{X}$ and a label space $\mathcal{Y}$, as in classical multiclass PAC learning, but the learner does not observe the labels of the i.i.d. training examples. Instead, in each round, it receives an unlabeled instance, predicts its label, and receives bandit feedback indicating only whether the prediction is correct. Despite this restriction, the goal remains the same as in classical PAC learning. We provide a general characterization of the optimal sample complexity of this problem, sharp for every concept class up to logarithmic factors. Our characterization is based on a new combinatorial dimension, termed the bandit $\mathrm{DS}$ dimension, defined via generalized combinatorial structures we call pseudo-boxes. These extend the pseudo-cubes underlying the $\mathrm{DS}$ dimension by allowing a different number of neighbors in each coordinate. In contrast to the $\mathrm{DS}$ dimension, which governs the full-information setting by counting the number of coordinates in the pseudo-cube, the bandit $\mathrm{DS}$ dimension aggregates the number of neighbors across coordinates, leading to a characterization in which the sample complexity scales with the total number of neighbors. We also propose a general learning algorithm achieving the upper bound, based on an algorithmic principle called ListCascade, which connects bandit learning to list learning and may be of independent interest.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 1.0/10 | 1.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究的是经典机器学习理论中的PAC学习问题,在bandit反馈(仅知道预测是否正确)下的样本复杂度。所有给定关键词(Unify Models, World Models, MLLM, CV, MultiModal, model-based RL, OPD, GRPO)均与多模态大模型、世界模型、强化学习等方向高度相关,而本文完全不涉及这些内容。仅RL有微弱关联,因为bandit反馈是强化学习中的一种简化形式,但论文本身并非关于强化学习算法或应用,而是纯理论分析,因此RL评分仅为1分。其余关键词均为0分。
关键词
PAC learning, bandit feedback, sample complexity, realizable setting, combinatorial dimension, DS dimension, pseudo-boxes, ListCascade
摘要翻译
损失及其梯度的范数仅能微弱地分离神经网络训练的健康状态与病理状态,而经验风险的曲率在两者之间存在定性差异,但在参数数量 \(P\sim 10^{6}-10^{8}\) 时无法显式获取。我们提出了一种神经网络经验风险 Hessian 矩阵对角块迹的随机估计器。该过程将 Hutchinson 随机迹估计器与整个参数向量上的单个 Hessian-向量积相结合,并在计算图的一次反向传播中恢复每层迹的无偏估计。我们证明,权值共享下的正确性要求在二次微分之前先组装逐层 Hessian:将共享权值展开为独立坐标会引入系统偏差,其符号和大小由展开后 Hessian 的跨实例块决定。推导了固定 Hessian 下估计量方差的闭式表达式,以及小批量采样分布下总方差的分解。该分解给出了一个临界探测数 \(K^{\star}\),它平衡了两种随机性来源,并支持在线监控模式下的实用建议 \(K\in[5,10]\)。该估计器应用于 ResNet-18、ResNet-34 和 VGG-11 在 CIFAR-10 和 CIFAR-100 上的标签记忆状态检测,其中校准的累积和决策规则在虚警率为 \(16/120\) 时达到了 \(179/180\) 的经验检测力。
Abstract
The loss and the norm of its gradient separate the healthy and the pathological regimes of neural-network training only weakly, whilst the curvature of the empirical risk differs qualitatively between them but is inaccessible explicitly at parameter counts $P\sim 10^{6}-10^{8}$. We present a stochastic estimator of the trace of the diagonal blocks of the Hessian matrix of the empirical risk of a neural network. The procedure combines the Hutchinson stochastic trace estimator with a single Hessian-vector product over the whole parameter vector and recovers unbiased estimates of every per-layer trace in one backward pass through the computational graph. We show that correctness under weight sharing requires the layer-wise Hessian to be assembled before the second differentiation: unrolling shared weights into independent coordinates introduces a systematic bias whose sign and magnitude are governed by the cross-instance blocks of the unrolled Hessian. A closed-form expression for the variance of the estimator at a fixed Hessian is derived, together with a decomposition of the total variance under the mini-batch sampling distribution. This decomposition yields a critical probe count $K^{\star}$ that balances the two sources of randomness and supports the practical recommendation $K\in[5,10]$ in the on-line monitoring regime. The estimator is applied to the detection of the label-memorisation regime of ResNet-18, ResNet-34, and VGG-11 on CIFAR-10 and CIFAR-100, where a calibrated cumulative-sum decision rule attains an empirical detection power of $179/180$ at a false-alarm rate of $16/120$.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 1.0/10 | 1.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究的是神经网络训练过程中Hessian矩阵迹的随机估计方法,用于监测训练状态(如标签记忆化)。与给定的关键词(统一模型、世界模型、多模态大模型、计算机视觉、多模态、基于模型的强化学习、OPD、强化学习、GRPO)几乎完全无关。CV仅因实验在CIFAR图像数据集上进行而获得1分,但核心并非计算机视觉任务。其余关键词均为0分。
关键词
Hessian trace, stochastic estimator, layer-wise, neural network training monitoring, label memorization, variance analysis, cumulative-sum decision rule
摘要翻译
贝叶斯最优实验设计(Bayesian optimal experimental design, BOED)通过选择实验来最大化关于模型参数的信息增益。然而,在决策关键场景中,降低参数不确定性并不一定能改善下游决策,因为只有与目标相关的特定参数方向才真正重要。我们提出GoBOED,一种目标驱动的BOED框架,它直接针对指定的决策目标优化实验设计。GoBOED将摊销变分后验代理(amortized variational posterior surrogate)与可微凸决策层相结合,实现了完全以决策为中心的基于梯度的设计优化。我们从理论上证明,GoBOED的梯度对与决策目标无关的参数方向不敏感,这为为什么目标驱动设计能够在比信息增益最大化更广泛的实验设计集合上实现等效决策质量提供了形式化依据。在实证中,通过源定位、流行病管理和药代动力学控制等案例,GoBOED识别出更符合下游决策目标的设计,并揭示出接近最优的设计窗口远宽于目标无关的BOED方法所预测的范围。
Abstract
Bayesian optimal experimental design (BOED) selects experiments to maximize information gain about model parameters. However, in decision-critical settings, reducing parameter uncertainty does not necessarily improve downstream decisions, as only specific parameter directions relevant to the objective truly matter. We propose GoBOED, a goal-driven BOED framework that directly optimizes experimental designs for a specified decision-making objective. GoBOED combines an amortized variational posterior surrogate with a differentiable convex decision layer, enabling gradient-based design optimization that is fully decision-focused. We theoretically show that GoBOED gradients are insensitive to parameter directions irrelevant to the decision objective, providing a formal justification for why goal-driven design achieves equivalent decision quality over a wider set of experimental designs than information-gain maximization. Empirically, across source localization, epidemic management, and pharmacokinetic control, GoBOED identifies designs that better align with downstream decision objectives and reveals that near-optimal design windows are substantially wider than those predicted by goal-agnostic BOED approaches.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 该论文研究的是贝叶斯最优实验设计(BOED)的决策驱动变体,核心是信息论与凸优化在实验设计中的应用,完全不涉及多模态大模型、统一模型、世界模型、强化学习、计算机视觉、多模态学习、OPD或GRPO等关键词。所有给定关键词与论文内容均无任何关联,因此每个关键词的评分均为0。
关键词
Bayesian optimal experimental design, goal-driven design, decision-focused optimization, amortized variational posterior, differentiable convex decision layer, model uncertainty, information gain
摘要翻译
高效学习用户偏好对于许多现代决策系统至关重要,但通常需要昂贵的标注数据。主动学习降低了这一成本,然而标准方法由于基于池的评估(pool-based evaluation)而计算开销大。此外,大多数方法假设所有查询反馈同样可靠,忽略了在几乎相同或完全不同的项目之间的成对查询(pairwise queries)会产生模糊、低置信度的响应。为了解决反馈可靠性问题,我们引入了一种新颖的置信度感知响应模型(confidence aware response model),该模型明确考虑了这些模糊比较。为了克服基于池的评估的计算瓶颈,我们提出了一个主动查询合成框架Info-Synth,该框架通过在连续空间中最大化基于互信息的目标来生成最优查询。此外,我们提出了两种策略Pair M-dist和Pair Opt-dist,它们扩展了Info-Synth,使其即使在受限的有限查询池中也能选择有效的查询。我们在合成偏好学习(synthetic preference learning)、受限文本摘要数据集(constrained text summary datasets)以及模拟移动机器人的主观连续空间控制器增益调优(controller gain tuning)中展示了我们框架的多功能性和性能。
Abstract
Efficient learning of user preferences is crucial for many modern decision making systems but typically requires costly labeled data. Active learning reduces this cost, yet standard methods are computationally expensive due to pool-based evaluation. Further, most methods assume all query feedback is equally reliable, ignoring that pairwise queries between nearly identical or entirely dissimilar items yield ambiguous, low-confidence responses. To address the issue of feedback reliability, we introduce a novel confidence aware response model that explicitly accounts for these ambiguous comparisons. To overcome the computational bottleneck of pool-based evaluation, we propose an active query synthesis framework, Info-Synth that generates optimal queries by maximizing a mutual information-based objective within a continuous space. Moreover, we propose two strategies, Pair M-dist and Pair Opt-dist, that extend Info-Synth to select effective queries even when restricted to finite query pools. We demonstrate our framework's versatility and performance across synthetic preference learning, constrained text summary datasets, and subjective, continuous-space controller gain tuning for a simulated mobile robot.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 该论文研究的是主动查询合成(Active Query Synthesis)用于偏好学习,核心是用户偏好建模、主动学习、互信息优化和反馈可靠性。论文中未涉及任何与多模态大模型(MLLM)、世界模型(World Models)、表征学习(Representation Learning)、模型强化学习(model-based RL)、计算机视觉(CV)、多模态(MultiModal)、在线策略蒸馏(OPD)、强化学习(RL)或GRPO相关的概念、方法或应用。所有给定关键词与论文内容完全无关,因此每个关键词的评分为0。
关键词
Active Query Synthesis, Preference Learning, Confidence-Aware Response Model, Mutual Information, Info-Synth, Pair M-dist, Pair Opt-dist
摘要翻译
贝叶斯逆设计提供了一个原则性框架,用于从稀疏流场观测中推断气动几何形状,同时量化不确定性。然而,其在计算流体动力学(CFD)中的实际应用受到基于梯度的马尔可夫链蒙特卡洛(MCMC)采样所需重复高保真模拟成本的严重限制。尽管通常提出使用代理模型来降低这一成本,但它们对后验几何形状和不确定性的影响,尤其是在激波主导的流动中,仍知之甚少。在本工作中,我们证明神经算子代理可以直接嵌入MCMC推理循环中,同时保持后验结构。通过使用准一维喷管流动的完全贝叶斯逆公式,我们证明几何参数化在可辨识性和后验条件化中起着决定性作用,其中三次B样条能够产生稳定且具有物理意义的不确定性估计。基于这一公式,在No-U-Turn采样器中,用基于CFD生成数据训练的深度算子网络替代CFD求解器,同时保持似然模型、先验和采样配置不变。从稀疏到完全观测的场景中,基于代理的推理重现了CFD参考的后验几何形状和不确定性趋势。由于代理的集成,总推理时间减少到一秒以下,对应超过三个数量级的加速。此外,还研究了直接逆神经算子作为逆设计的确定性替代方案,能够实现无需后验采样的单次几何重构。这些结果表明,神经算子加速的贝叶斯推理能够为气动应用实现实用的、具有不确定性意识的逆设计工作流程。
Abstract
Bayesian inverse design provides a principled framework for inferring aerodynamic geometries from sparse flow observations while quantifying uncertainty. However, its practical use in computational fluid dynamics (CFD) is severely limited by the cost of repeated high-fidelity simulations required for gradient-based Markov chain Monte Carlo (MCMC) sampling. While surrogate models are commonly proposed to reduce this cost, their effect on posterior geometry and uncertainty, especially for shock-dominated flows, remains poorly understood. In this work, we demonstrate that neural operator surrogates can be embedded directly within the MCMC inference loop while preserving posterior structure. Using a fully Bayesian inverse formulation of quasi-one-dimensional nozzle flow, we demonstrate that geometry parameterization plays a decisive role in identifiability and posterior conditioning, with cubic B-splines yielding stable and physically meaningful uncertainty estimates. Building on this formulation, a Deep Operator Network trained on CFD-generated data is substituted for the CFD solver within a No-U-Turn Sampler, while keeping the likelihood model, priors, and sampling configuration unchanged. Across sparse to fully observed regimes, surrogate-based inference reproduces the posterior geometry and uncertainty trends of the CFD reference. As a result of surrogate integration, total inference time is reduced to under one second, corresponding to a speedup exceeding three orders of magnitude. In addition, a direct inverse neural operator is examined as a deterministic alternative for inverse design, enabling single-shot geometry reconstruction without posterior sampling. These results demonstrate that neural operator-accelerated Bayesian inference enables practical, uncertainty-aware inverse design workflows for aerodynamic applications.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 该论文研究的是计算流体动力学中的贝叶斯反设计,使用神经算子加速MCMC采样。所有给定关键词(Unify Models, World Models, MLLM, CV, MultiModal, model-based RL, OPD, RL, GRPO)均与多模态大模型、世界模型、强化学习等主题相关,而论文内容完全不涉及这些领域,因此每个关键词的相关度均为0分。
关键词
Bayesian inverse design, computational fluid dynamics, neural operators, MCMC, surrogate model, uncertainty quantification, aerodynamic geometry
摘要翻译
长度泛化仍然是神经网络面临的一个持续挑战:循环模型往往存在位置偏差,而Transformer则受限于固定的计算深度。正则语言(regular languages)为评估长度泛化提供了一个常用的测试平台,因为标签预测可以针对任意序列长度进行检验。我们提出了MLP-LDRU,一种对数深度循环单元(Log-Depth Recurrent Unit),它捕获了一类具有结合性偏好的算子,旨在通过并行归约来近似循环。我们在21个正则语言任务上评估了MLP-LDRU,包括标准基准测试和新的前缀语言(prefix languages),当增加最大训练长度时,它在18个任务上实现了100%的分布外准确率,在其余3个任务上至少达到99.9%,优于可比的循环和基于注意力的模型。我们进一步在正则语言之外,在ListOps和NLP分类基准测试上评估了MLP-LDRU,其表现具有竞争力。
Abstract
Length generalization remains a persistent challenge for neural networks: recurrent models tend to suffer from positional biases, while transformers are constrained by fixed computational depth. Regular languages provide a frequently used testbed for evaluating length generalization, as label prediction can be checked for any sequence length. We propose MLP-LDRU, a type of Log-Depth Recurrent Unit, which captures a class of associativity-biased operators designed to approximate recurrence through parallel reduction. We evaluate MLP-LDRU on 21 regular-language tasks, consisting of standard benchmarks and new prefix languages, where it achieves 100% out-of-distribution accuracy on 18 tasks and at least 99.9% on the remaining 3 when increasing max training length, outperforming comparable recurrent and attention-based models. We further evaluate MLP-LDRU beyond regular languages on ListOps and NLP classification benchmarks, where it performs competitively.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究的是长度泛化问题,提出了一种对数深度递归单元(MLP-LDRU),专注于序列模型在正则语言任务上的泛化能力。全文未涉及多模态、世界模型、强化学习、视觉、统一模型等关键词。所有给定关键词与论文内容完全无关,因此每个关键词相关度均为0分。
关键词
Length Generalization, Log-Depth Recurrent Units, MLP-LDRU, Regular Languages, Parallel Reduction, Associativity-biased Operators, ListOps, NLP Classification
摘要翻译
随机梯度下降(SGD)是大规模统计学习与随机优化的基础算法。然而,当随机梯度具有无限方差时,基于SGD迭代的统计推断仍具挑战性,因为相关极限分布依赖于未知的冗余参数。本文提出了一种高效、模型无关的方法,用于从SGD轨迹构建置信区域,该方法同时适用于有限方差与无限方差情形。该过程基于Polyak-Ruppert平均估计量与沿SGD轨迹由随机梯度构建的经验二阶矩归一化量的联合弱收敛结果。这一联合极限产生了一个自归一化统计量,其中主导的尾部依赖缩放项相互抵消。随后,我们采用子采样校准方案估计相关临界值,避免了尾指数、慢变函数或稳定律参数的显式估计。由此得到的置信区域易于实现,且在有限二阶矩与无限二阶矩情形下均具有渐近有效性。仿真研究展示了该方法在各种设置下的可靠覆盖,支持其作为随机优化中不确定性量化的实用工具。
Abstract
Stochastic gradient descent (SGD) is a foundational algorithm for large-scale statistical learning and stochastic optimization. However, statistical inference based on SGD iterates remains challenging when stochastic gradients have infinite variance, as the relevant limiting distributions depend on unknown nuisance parameters. In this paper, we develop an efficient, model-agnostic methodology for constructing confidence regions from SGD trajectories that applies in both finite- and infinite-variance regimes. The procedure is based on a joint weak convergence result for the Polyak-Ruppert averaged estimator and an empirical second-moment normalizer constructed from stochastic gradients along the SGD trajectory. This joint limit yields a self-normalized statistic in which the leading tail-dependent scaling terms cancel. We then use a subsampling calibration scheme to estimate the relevant critical values, avoiding explicit estimation of tail indices, slowly varying functions, or stable-law parameters. The resulting confidence regions are straightforward to implement and are asymptotically valid under both the finite- and infinite-second-moment regimes. Simulation studies show reliable coverage in various settings, supporting the proposed method as a practical tool for uncertainty quantification in stochastic optimization.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文主题为随机梯度下降(SGD)的统计推断,聚焦于无限方差情形下的置信区间构建,属于统计学与优化理论范畴。所有给定关键词(Unify Models, World Models, MLLM, CV, MultiModal, model-based RL, OPD, RL, GRPO)均涉及多模态大模型、世界模型、强化学习等方向,与本文内容完全无关,因此每个关键词的相关度均为0分。
关键词
Stochastic Gradient Descent, Statistical Inference, Infinite Variance, Self-Normalized Statistic, Subsampling, Confidence Regions, Polyak-Ruppert Averaging
摘要翻译
基准测试日益引导着模型部署、采购及科学筛选,然而单一分数仅支持其所记录的响应,并不必然对应部署行为。我们提出"部署完备性基准测试"(deployment-complete benchmarking),用于检验基准证据是否能够决定部署行为。当且仅当部署行为在每个证据纤维(evidence fiber)上保持恒定时,该基准对某一主张才是完备的;混合纤维揭示了缺失的部署信息,而完备性曲线(completion curves)则量化了消除歧义所需的证据量。在受控响应空间中,基准通道共形覆盖率为94.98%,但迁移至未测量的部署通道时表现不佳(10.07%),而响应秩区间(response-rank intervals)则实现了94.91%的覆盖率;即便基准误差为零,在最大残差规模下也仅能认证45.4%的候选模型。公开审计揭示了不完备性,包括Tox21中97.9%的混合纤维,以及Matbench与JARVIS主要审计中认证比例中位数为零。在保留重放实验中,"先认证后获取"(certify-then-acquire)策略将Tox21中的错误决策从1.19%降至0.027%,JARVIS中从20.3%降至0.128%,同时改变了模型选择并识别出与部署相关的探针。面向部署的基准应报告证据、可支持行动、歧义性及完备成本,而非仅提供分数。
Abstract
Benchmarks increasingly guide deployment, procurement and scientific screening, yet a score supports only the response it records, not necessarily the deployment action. We introduce deployment-complete benchmarking, which tests whether benchmark evidence determines a deployment action. A benchmark is complete for a claim exactly when the action is constant on each evidence fiber; mixed fibers expose missing deployment information, and completion curves quantify the evidence required to resolve ambiguity. In controlled response spaces, benchmark-channel conformal coverage of 94.98% transferred poorly to an unmeasured deployment channel (10.07%), whereas response-rank intervals achieved 94.91% coverage; even zero benchmark error certified only 45.4% of candidates at the largest residual size. Public audits revealed incompleteness, including 97.9% mixed Tox21 fibers and zero median certifiable fraction in main Matbench and JARVIS audits. In held-out replays, certify-then-acquire reduced false decisions from 1.19% to 0.027% in Tox21 and from 20.3% to 0.128% in JARVIS, while changing model choice and identifying deployment-relevant probes. Deployment-ready benchmarks should report evidence, supported actions, ambiguity and completion cost rather than scores alone.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文主题为基准测试的部署完备性,讨论基准证据是否足以决定部署行动,涉及统计覆盖、决策理论等。所有给定关键词(Unify Models, World Models, MLLM, CV, MultiModal, model-based RL, OPD, RL, GRPO)均与论文内容完全无关,论文未提及多模态大模型、世界模型、强化学习或相关算法。因此每个关键词相关度均为0分。
关键词
deployment-complete benchmarking, benchmark completeness, evidence fiber, conformal coverage, certification, decision ambiguity, deployment action
摘要翻译
我们提出了Fuzzy PyTorch,一个用于快速评估深度学习(DL)模型中数值变异性的框架。随着深度学习被越来越多地应用于各种任务,理解浮点运算带来的变异性对于确保模型的稳健性和可靠性至关重要。评估此类变异性的工具必须具有可扩展性、高效性,并能与现有框架无缝集成,同时最大限度地减少代码修改。Fuzzy PyTorch通过将随机算术(stochastic arithmetic)集成到PyTorch中实现了这一点,其核心是一种名为“带指令集管理的概率舍入”(Probabilistic Rounding with Instruction Set Management)的新型库,该库与数值分析编译器Verificarlo进行交互。该库提供了随机舍入模式以及一种新型模式:上下舍入(up-down rounding)。对比评估显示,与最先进的工具Verrou相比,Fuzzy PyTorch保持了模型性能,并将运行时间减少了5倍至60倍。我们进一步通过运行参数数量从100万到3.41亿的模型来展示其可扩展性,证实了该框架在小型和大型深度学习架构中的适用性。总体而言,Fuzzy PyTorch为评估深度学习中的数值变异性提供了一种高效、可扩展且实用的解决方案,使研究人员和从业者能够在不牺牲性能或计算效率的前提下量化和管理浮点不确定性。
Abstract
We introduce Fuzzy PyTorch, a framework for rapid evaluation of numerical variability in deep learning (DL) models. As DL is increasingly applied to diverse tasks, understanding variability from floating-point arithmetic is essential to ensure robust and reliable performance. Tools assessing such variability must be scalable, efficient, and integrate seamlessly with existing frameworks while minimizing code modifications. Fuzzy PyTorch enables this by integrating stochastic arithmetic into PyTorch through Probabilistic Rounding with Instruction Set Management, a novel library interfacing with Verificarlo, a numerical analysis compiler. The library offers stochastic rounding mode and a novel mode; up-down rounding. Comparative evaluations show Fuzzy PyTorch maintains model performance and achieves runtime reductions of 5x to 60x versus Verrou, a state-of-the-art tool. We further demonstrate scalability by running models from 1 to 341 million parameters, confirming applicability across small and large DL architectures. Overall, Fuzzy PyTorch provides an efficient, scalable, and practical solution for assessing numerical variability in deep learning, enabling researchers and practitioners to quantify and manage floating-point uncertainty without compromising performance or computational efficiency.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文主题为深度学习模型数值变异性评估工具Fuzzy PyTorch,专注于浮点运算的随机舍入与数值稳定性分析,与给定的关键词(统一模型、世界模型、多模态大模型、计算机视觉、多模态、基于模型的强化学习、OPD、强化学习、GRPO)完全无关。所有关键词相关度均为0分。
关键词
Fuzzy PyTorch, numerical variability, deep learning, stochastic arithmetic, probabilistic rounding, floating-point uncertainty, scalability
摘要翻译
我们检验了在子1亿参数量解码器语言模型的从头初始化量化感知训练(QAT)中,最优学习率调度是否依赖于位宽。一项涵盖位宽×退火比例×学习率量级×模型规模×随机种子(FP16/INT8/INT6,15M-100M,5个种子)的720次运行因子网格实验(第二阶段)发现,在每个(位宽,规模)组合下,最优退火比例均为33%。主要假设——即INT6 QAT需要与高精度训练不同的调度策略——在FP16/INT8/INT6条件下被证伪。一项625次运行的后续实验(第五阶段)沿五个轴对零假设进行了检验:优化器(AdamW)、调度形状(余弦)、训练长度(最多9倍迭代次数)、扩展规模扫描(5M-350M),以及从3M到100M的INT4扫描。零假设在所有三种设置变化下均保持稳健。INT6的性能损失遵循对数线性缩放定律,该定律在第二阶段上的拟合结果成功预测了第五阶段中五个保留规模(5M、8M、175M、250M、350M)的95%预测区间(5/5)。对于INT4,情况比高精度更为清晰:在50M和100M规模下,wd33具有决定性优势(配对z值约12-15,10/10个种子);在50M以下,跨越从3M到30M的六个测试规模,没有单个规模显示出统计显著的调度偏好,且各规模的平均损失在种子级噪声中振荡。因此,边界并非一个清晰的wd10区域,而是从50M以下的噪声主导区间向50M及以上决定性wd33区间的过渡。一项权重到网格距离的探测实验证伪了FP16/INT8/INT6零假设的最简单机制(快速网格吸附):在退火前,INT6-QAT权重与INT6网格的距离与FP16权重基本相当(比值约1.04)。实际建议:在子1亿参数量级下,仅在FP16上调整一次学习率调度,并原封不动地应用于INT8/INT6 QAT;对于50M及以上的INT4,使用wd33;对于50M以下的INT4,调度选择处于噪声范围内。
Abstract
We test whether the optimal learning-rate schedule depends on bit-width during from-initialisation quantisation-aware training (QAT) for sub-100M decoder language models. A 720-run factorial grid (Phase 2) over bit-width x warmdown fraction x LR magnitude x model size x seed (FP16/INT8/INT6, 15M-100M, 5 seeds) finds the optimal warmdown is 33% at every (bit-width, size) cell. The primary hypothesis -- that INT6 QAT requires a different schedule than higher-precision training -- is falsified at FP16/INT8/INT6. A 625-run follow-up (Phase 5) probes the null along five axes: optimiser (AdamW), schedule shape (cosine), training length (up to 9x more iterations), an extended size sweep (5M-350M), and an INT4 sweep from 3M to 100M. The null is robust under all three setup changes. The INT6 penalty follows a log-linear scaling law whose fit on Phase 2 predicts the five held-out Phase 5 sizes (5M, 8M, 175M, 250M, 350M) within their 95% prediction intervals (5/5). For INT4 the picture is sharper than the higher precisions: at 50M and 100M, wd33 is decisively optimal (paired z ~ 12-15, 10/10 seeds); below 50M, across the six tested sizes from 3M to 30M, no individual size shows a statistically significant schedule preference and the per-size mean penalty oscillates within seed-level noise. The boundary is therefore a transition between a noise-dominated regime below 50M and a decisive wd33 regime at and above 50M, not a clean wd10 region. A weight-to-grid-distance probe falsifies the simplest mechanism for the FP16/INT8/INT6 null result (rapid grid-snapping): pre-warmdown, INT6-QAT weights sit at essentially the same distance from the INT6 grid as FP16 weights (ratio ~ 1.04). Practical recommendation: at sub-100M scale, tune the LR schedule once at FP16 and apply unchanged to INT8/INT6 QAT; for INT4 at 50M+ use wd33; for INT4 below 50M the schedule choice is in the noise.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究的是子100M参数解码器语言模型在量化感知训练(QAT)中学习率调度与位宽的关系,涉及FP16/INT8/INT6/INT4等精度。所有给定关键词(Unify Models, World Models, MLLM, CV, MultiModal, model-based RL, OPD, RL, GRPO)均与论文主题无关:论文不涉及多模态、视觉、世界模型、强化学习或统一模型等概念。因此每个关键词的相关度均为0分。
关键词
quantisation-aware training, learning-rate schedule, bit-width, decoder language models, sub-100M, INT4, INT6, FP16
摘要翻译
我们提出了一个对抗性恶意软件样本数据集,该数据集源自公开的RawMal-TF真实世界恶意软件二进制文件集合。通过使用一套对抗性恶意软件生成器,我们构建了两组对抗性PE文件:44,347个家族标记样本和33,596个类型标记样本,分别对EMBER分类器实现了98.35%和92.20%的逃避率。每个对抗性二进制文件都附有详细的元数据,包括EMBER分数和VirusTotal分类。我们进一步通过一系列训练实验证明了恶意软件分类流程对数据投毒攻击的敏感性。在家族标记数据集中,仅注入占训练数据0.5%的完全错误标记的对抗性样本,就能使针对重新训练的分类器的逃避率从26.1%提高到92.8%。该数据集已公开发布,以促进未来关于对抗性恶意软件、投毒攻击以及基于机器学习的恶意软件检测系统鲁棒性的研究。
Abstract
We present a dataset of adversarial malware samples derived from the public RawMal-TF collection of real-world malware binaries. Using a suite of adversarial malware generators, we construct two sets of adversarial PE files: 44,347 family-labelled samples and 33,596 type-labelled samples, achieving evasion rates of 98.35 % and 92.20 % against the EMBER classifier, respectively. Each adversarial binary is accompanied by detailed metadata, including EMBER scores and VirusTotal classifications. We further demonstrate the susceptibility of malware classification pipelines to data poisoning attacks through a series of training experiments. Injecting fully mislabelled adversarial samples representing only 0.5 % of the training data in the family-labelled dataset increases the evasion rate against the re-trained classifier from 26.1 % to 92.8 %. The dataset is publicly released to facilitate future research on adversarial malware, poisoning attacks, and the robustness of machine-learning-based malware detection systems.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 该论文的研究主题是构建对抗性恶意软件数据集,涉及恶意软件生成、逃避检测和投毒攻击评估,与给定的关键词(如Unify Models、World Models、MLLM、CV、MultiModal、model-based RL、OPD、RL、GRPO)完全无关。所有关键词均指向多模态大模型、表征学习、世界模型、强化学习等方向,而论文内容属于网络安全和机器学习对抗性攻击领域,没有任何关联。
关键词
adversarial malware, dataset, evasion, poisoning, PE files, EMBER classifier, VirusTotal
摘要翻译
多智能体大语言模型(LLM) deliberation的有效性不仅取决于各智能体的个体预测,还取决于它们之间的沟通与协作方式。我们通过Friedkin-Johnsen(FJ)意见动力学这一视角来研究该机制,该模型是一个可处理的多智能体系统分析模型,用于分析固执性、影响力及意见变化,能够捕捉经验观察到的 deliberation模式。我们证明FJ参数具有输入依赖性,这使得多智能体 deliberation转变为一种专家混合模型。这一视角表明,当路由机制反映智能体能力时,多智能体系统能够超越单智能体及静态集成方法。由于能力在实践中是潜在变量,我们分析了如何通过可观测的代理指标建立影响力:智能体的自我评估置信度、其感知到的置信度,以及与其他智能体观点的初始一致性。
Abstract
The effectiveness of multi-agent LLM deliberation depends not only on the agents' individual predictions, but also on how they communicate and collaborate. We study this mechanism through the lens of Friedkin-Johnsen (FJ) opinion dynamics, a tractable model for analyzing stubbornness, influence, and opinion change in multi-agent systems that captures empirically observed deliberation patterns. We show that the FJ parameters are input-dependent, turning multi-agent deliberation into a mixture of experts. This perspective implies that multi-agent systems can outperform single agents and static ensembles when routing reflects agent competence. Since competence is latent in practice, we analyze how influence is established through observable proxies: agents' self-assessed confidence, their perceived confidence, and initial alignment with other agents' views.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究多智能体LLM协商中的意见动态(Friedkin-Johnsen模型),聚焦于智能体固执性、影响力和观点变化,属于多智能体系统与LLM协作的范畴。所有给定关键词(Unify Models, World Models, MLLM, CV, MultiModal, model-based RL, OPD, RL, GRPO)均与论文内容无关:论文未涉及多模态、统一模型、世界模型、计算机视觉、基于模型的强化学习、OPD(可能指其他特定术语)、强化学习或GRPO(一种强化学习方法)。因此,每个关键词的相关度均为0分。
关键词
multi-agent systems, LLM deliberation, Friedkin-Johnsen opinion dynamics, mixture of experts, stubbornness, influence, confidence
摘要翻译
近期,自动作文评分(automated essay scoring, AES)研究越来越多地使用预训练Transformer模型,但这些模型通常是在通用领域英语上预训练的,可能无法充分代表第二语言学习者的写作。本研究探讨在EFCAMDAT学习者语料库上进行领域自适应持续预训练(domain-adaptive continued pretraining, DAPT)是否能改进基于Transformer的英语水平测试AES。我们将DAPT应用于三个Transformer编码器,并在FCE和IELTS数据集上评估其在领域内评分和少样本跨数据集迁移中的表现。全语料库DAPT在不同模型、数据集和评估指标上产生了混合结果。进一步分析表明,这些混合效应部分源于EFCAMDAT与下游数据集在语言水平、体裁和交际目的上的不匹配。基于语言水平的消融实验表明,使用与欧洲共同语言参考框架(CEFR)对齐的子集进行针对性DAPT,比全语料库DAPT更可靠地提升了下游评分,尤其是对使用B1-B2数据的FCE。然而,这些提升并未一致地改善跨数据集迁移。总体而言,研究结果表明,当预训练数据与下游评估设置充分对齐时,在学习者写作语料库上进行持续预训练能够有益于英语评估的领域内AES。然而,它并不能自动提高在不同英语水平测试数据集之间的可迁移性。
Abstract
Recent automated essay scoring (AES) studies increasingly use pretrained transformer models, but these models are usually pretrained on general-domain English and may under-represent second-language learner writing. This study investigates whether domain-adaptive continued pretraining (DAPT) on the EFCAMDAT learner corpus improves transformer-based AES for English proficiency tests. We apply DAPT to three transformer encoders and evaluate them on FCE and IELTS in both in-domain scoring and few-shot cross-dataset transfer. Full-corpus DAPT produces mixed results across models, datasets, and metrics. Further analyses suggest that these mixed effects are partly explained by mismatches in proficiency, genre, and communicative purpose between EFCAMDAT and the downstream datasets. A proficiency-based ablation shows that targeted DAPT using CEFR-aligned subsets improves downstream scoring more reliably than full-corpus DAPT, especially for FCE with B1--B2 data. However, these gains do not consistently improve cross-dataset transfer. Overall, the findings suggest that continued pretraining on a learner-writing corpus can benefit in-domain AES for English assessment when the pretraining data is sufficiently aligned with the downstream assessment settings. However, it does not automatically improve transferability across different English proficiency test datasets.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究自动作文评分(AES),使用预训练transformer模型进行领域自适应继续预训练(DAPT)在EFCAMDAT学习者语料库上,评估FCE和IELTS数据集。内容完全属于自然语言处理中的文本评分任务,不涉及多模态、视觉、世界模型、强化学习、统一模型、OPD或GRPO等关键词。所有关键词与论文主题无关,因此相关度均为0分。
关键词
automated essay scoring, continued pretraining, domain adaptation, learner corpus, EFCAMDAT, transformer, English proficiency tests
摘要翻译
窄域微调的语言模型会逐字记忆植入的内容,但在无法获取模型权重或训练数据的情况下,审计已部署模型所学内容仍是一项开放挑战。近期研究表明,基础模型与微调模型之间的激活差异携带了可读的微调领域痕迹;最先进的激活差异透镜(Activation Difference Lens, ADL)虽能恢复模糊的领域级描述,但需要完全的“白盒”访问模型内部结构。我们提出对比解码差异法(Contrastive Decoding Diffing, CDD),这是一种仅基于输出级对数几率分布(logit distribution)的模型差异分析方法,无需权重访问、无需层选择、无需逐模型调参,却能恢复植入的事实。CDD包含三个核心思想:绕过聊天模板以暴露原始微调先验、使用最大模糊预填充(pre-fill)引导生成、以及在每个解码步骤放大微调模型与基础模型在对数几率空间(logit space)的差异。单一默认配置即可在四种架构(1B至32B参数规模)上逐字恢复植入的事实——包括精确的药物名称、投票计数、物理测量值及程序细节——尽管访问权限更低,但性能全面优于ADL,且运行速度快约170倍。此外,CDD还揭示了意外的数据流水线伪影:由大语言模型数据生成器通过模式坍缩(mode collapse)引入的虚构人格特征泄露至模型权重,并被CDD提取出来,据我们所知,这构成了首个从数据生成器伪影到模型权重再到恢复输出的端到端指纹溯源链。我们在真实领域微调场景中进行了验证,在所有单数据集非思维链(non-CoT)变体上实现了近乎完美的恢复,并在混合数据集场景中正确识别了全部四个数据集。CDD作为灰盒方法却优于白盒基线的成功,凸显了其在AI系统透明度与问责性方面的实用价值。
Abstract
Narrowly finetuned language models memorize implanted content verbatim, but auditing what a deployed model has been taught, without access to its weights or training data, remains an open challenge. Recent work shows that activation differences between base and finetuned models carry readable traces of the finetuning domain; the state-of-the-art Activation Difference Lens (ADL) recovers a vague domain-level description but requires full "white-box" access to model internals. We introduce Contrastive Decoding Diffing (CDD), a model diffing method that operates on output-level logit distributions only, with no weight access, no layer selection, and no per-model tuning, yet recovers implanted facts. CDD consists of three ideas: bypassing the chat template to expose the raw finetuning prior, seeding generation with maximally vague pre-fills, and amplifying the logit-space difference between finetuned and base models at each decoding step. A single default configuration recovers implanted facts verbatim -- exact drug names, vote counts, physical measurements, and procedural details -- across four architectures (1B--32B parameters), uniformly outperforming ADL despite less access and running ~170x faster. Furthermore, CDD surfaces unintended data pipeline artifacts: a fictional persona introduced by the LLM data generator via mode collapse leaked into model weights and was extracted by CDD, constituting to our knowledge the first demonstrated end-to-end fingerprinting chain from data generator artifact to model weights to recovered output. We validate on real-domain finetuning settings, achieving near-perfect recovery across all single-dataset non-CoT variants and correctly identifying all four datasets in the mixed-dataset setting. CDD's success as a grey-box method outperforming white-box baselines underscores its practical utility for transparency and accountability in AI systems.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 该论文研究的是通过对比解码差异(Contrastive Decoding Diffing)从微调语言模型中恢复植入的逐字内容,属于模型审计和可解释性领域。论文主题聚焦于语言模型(LLM)的微调记忆效应和输出级模型差异分析,完全不涉及多模态、视觉(CV)、统一模型、世界模型、强化学习(RL/Model-based RL/GRPO)、OPD(可能指优化策略或特定方法)等关键词。所有给定关键词与论文核心内容均无关联,因此每个关键词评分均为0分。
关键词
Contrastive Decoding Diffing, Model Diffing, Finetuning Prior, Verbatim Content Recovery, Activation Difference Lens, Language Model Auditing, Logit-space Analysis
摘要翻译
我们研究了一家电子商务企业在双层配送网络中,当多品类客户订单顺序到达且未来需求未知时,应如何做出实时履行决策。核心的管理矛盾在于:是使用稀缺的前端配送中心(FDC)库存来节省当前履行成本,还是将该库存保留给未来可能更有价值、需本地服务的订单。我们构建了一个对抗性在线模型,该模型包含多个前端配送中心(FDC)、一个区域配送中心(RDC)、多单位多品类订单,以及具有品类特异性和时变性的可变成本。我们的理论目标是刻画在何种条件下,简单、可解释且可实施的履行规则能够达到与最优先知规划者几乎相同的绩效。我们提出了一类基于门控优先级的贪心策略(Gated Priority-based Greedy policies),推导了在时变与时不变成本结构下的竞争比保证,并为任何在线算法建立了匹配或近似匹配的下界。数值实验表明,与广义短视策略和基于预测的基准策略相比,所提出的策略表现优异。该分析为以下管理问题提供了指导:何时应保护本地库存,何时拆分订单值得承担固定成本负担,以及固定成本与可变成本的相对大小如何决定更复杂优化方法的价值。
Abstract
We study how an e-commerce firm should make real-time fulfillment decisions in a two-layer distribution network when multi-item customer orders arrive sequentially and future demand is unknown. The central managerial tension is whether to use scarce front distribution center (FDC) inventory to save current fulfillment cost or preserve that inventory for future orders that may be more valuable to serve locally. We formulate an adversarial online model with multiple FDCs, one regional distribution center (RDC), multi-unit multi-item orders, and item-specific and time-varying variable costs. Our theoretical objective is to characterize when simple, interpretable, and implementable fulfillment rules can perform nearly as well as an optimal clairvoyant planner. We develop a family of Gated Priority-based Greedy policies, derive competitive-ratio guarantees under both time-varying and time-invariant cost structures, and establish matching or near-matching lower bounds for any online algorithm. Numerical experiments show that the proposed policies perform strongly relative to generalized myopic and forecast-based benchmarks. The analysis yields managerial guidance on when local inventory should be protected, when splitting orders is worth the fixed-cost burden, and how the relative magnitudes of fixed and variable costs determine the value of more sophisticated optimization.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 该论文研究的是电商环境下两层配送网络中的实时订单履行决策问题,属于运筹学、供应链管理和在线算法领域。论文内容涉及多物品订单、库存分配、贪婪策略和竞争比分析,与给定的关键词(Unify Models、World Models、MLLM、CV、MultiModal、model-based RL、OPD、RL、GRPO)完全无关。这些关键词均指向多模态大模型、表征学习、强化学习、世界模型等人工智能前沿方向,而论文没有任何相关讨论。因此所有关键词评分均为0。
关键词
order fulfillment, two-layer distribution network, online algorithm, competitive ratio, gated priority-based greedy policy, multi-item orders, inventory allocation
摘要翻译
近期不确定性量化领域的进展日益强调机器学习中偶然不确定性(aleatory uncertainty)与认知不确定性(epistemic uncertainty)之间的区分,这促使了对更统一框架的需求。然而,尽管在生成可靠预测方面取得了诸多进展,现有方法在泛化至训练领域之外时往往缺乏严格的保证。我们提出了一种用于稳健外推的共形化不精确推理框架(conformalised imprecise inference framework),该框架与模型无关,并通过不精确性和距离感知增强预测模型。所提出的方法生成不精确预测(概率盒,probability boxes),这些预测在分布偏移下保持有效,在外推区域中维持覆盖的同时自适应地扩展不确定性。在合成数据集和基准数据集上的实验表明,与标准概率方法相比,该方法在有限数据条件下尤其展现出更强的鲁棒性和可靠的覆盖。
Abstract
Recent advances in uncertainty quantification increasingly emphasise the distinction between aleatory and epistemic uncertainty in machine learning, motivating the need for more unified frameworks. However, despite much progress in producing reliable predictions, existing methods often lack rigorous guarantees when generalising beyond the training domain. We propose a conformalised imprecise inference framework for robust extrapolation, which is model-agnostic and augments predictive models with imprecision and distance awareness. The proposed approach yields imprecise predictions (probability boxes) that remain valid under distributional shift, maintaining coverage while adaptively expanding uncertainty in extrapolation regimes. Experiments on synthetic and benchmark datasets demonstrate improved robustness and reliable coverage compared to standard probabilistic approaches, particularly under limited data.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文标题和摘要聚焦于不确定性量化、共形预测、模糊推理和鲁棒外推,涉及分布偏移和有限数据下的泛化问题。所有给定的关键词(Unify Models, World Models, MLLM, CV, MultiModal, model-based RL, OPD, RL, GRPO)均与多模态大模型、世界模型、强化学习等主题高度相关,而论文内容完全不涉及这些领域,因此每个关键词的相关度评分为0分。
关键词
conformal prediction, imprecise inference, robust extrapolation, distributional shift, uncertainty quantification, aleatory uncertainty, epistemic uncertainty, limited data
摘要翻译
大规模Transformer的训练和部署日益受到跨加速器传输激活值(activations)、梯度(gradients)和优化器状态(optimizer states)的限制。低比特量化(low-bit quantization)提供了一种自然的解决方案,但Transformer的激活值往往呈现重尾分布且以异常值(outliers)为主导,这使得简单量化的损失极大。我们证明,这一困难不仅源于量化器本身,也源于架构特性。具体而言,残差连接(residual connections)会在训练过程中驱使Transformer激活值偏离高斯性(Gaussianity)。通过在有残差和无残差Transformer之间进行受控比较,我们表明,在低精度下,这一效应会导致残差模型出现显著更高的量化误差和精度下降。我们通过超峰度(excess kurtosis)分析解释了这一现象:残差混合会放大非高斯性,而无残差模型中的密集混合则会压缩非高斯性。随后我们证明,通过正交初始化(orthogonal initialization)、谱优化(spectral optimization)或二阶优化(second-order optimization)以及深度感知的注意力温度(attention temperature)缩放,可以使无残差Transformer具备可训练性。在语言任务中,虽然全精度性能略有下降,但这些模型保持了近似高斯的激活值,并对低比特量化展现出显著增强的鲁棒性。我们的研究结果揭示了Transformer设计中精度与可压缩性之间的权衡,并推动了面向量化友好型基础模型的架构级方法。
Abstract
Large-scale transformer training and deployment are increasingly constrained by the transfer of activations, gradients, and optimizer states across accelerators. Low-bit quantization offers a natural remedy, but transformer activations are often heavy-tailed and outlier-dominated, making simple quantization highly lossy. We show that this difficulty is not only a property of the quantizer, but also of the architecture. Specifically, residual connections can drive transformer activations away from Gaussianity during training. Using controlled comparisons between residual and residual-free transformers, we demonstrate that this effect leads to substantially higher quantization error and accuracy degradation at low precision in residual models. We explain the phenomenon through an excess kurtosis analysis, showing that residual mixing can amplify non-Gaussianity, whereas dense mixing in residual-free contracts non-Gaussianity. We then show that residual-free transformers can be made trainable using orthogonal initialization, spectral or second-order optimization, and depth-aware scaling of attention temperature. In language tasks, while there is a small drop in full precision performance, these models retain near-Gaussian activations and exhibit significantly improved robustness to low-bit quantization. Our results identify an accuracy--compressibility trade-off in transformer design and motivate architecture-level approaches to quantization-friendly foundation models.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文标题和摘要聚焦于Transformer架构中残差连接对激活值分布的影响,以及无残差Transformer在低比特量化中的优势。内容涉及量化、激活值非高斯性、残差连接等,与给定的所有关键词(统一模型、世界模型、多模态大语言模型、计算机视觉、多模态、基于模型的强化学习、OPD、强化学习、GRPO)均无直接关联。每个关键词评分均为0分。
关键词
Residual-Free Transformers, Quantization, Activations, Kurtosis, Low-bit quantization, Accuracy-compressibility trade-off, Orthogonal initialization, Spectral optimization
摘要翻译
人工队友的速度和准确性从根本上改变了人机集成(Human-AI integration)的失败状态。高速AI干预可能引发反射性盲从(reflexive blind compliance),而延迟干预则可能诱发模糊的认知冲突(cognitive conflict)。本研究探讨了任务内AI助手的基本特征——快速/低准确度AI(Fast/Less-Accurate, FLA-AI)与慢速/准确AI(Slow/Accurate, SA-AI)——如何影响虚拟现实(Virtual Reality)无人机任务中协作脑机接口(Collaborative Brain-Computer Interface, cBCI)团队的协同效应。17名操作员在高认知负荷下完成了连续搜索任务,同时使用二维自适应黎曼Oracle(2D Adaptive Riemannian Oracle)映射了他们的空间协方差(spatial covariance)。结果从数学上证明,AI的时序决定了团队失败的机制。快速AI引发了即时的盲从;受欺骗时人类准确率降至50.2%,纯行为团队(N=8)的规模未能超过74.1%。相比之下,慢速AI诱发了延迟的认知冲突;人类出现犹豫(准确率61.1%),但N=8的行为团队最终恢复至100.0%。关键在于,黎曼Oracle(Riemannian Oracle)在数学上适应了这些状态:它严格限制时间窗口(<0.8秒)以拦截快速的反射性顺从,同时扩大窗口(>1.2秒)以捕获延迟的认知冲突。通过混合融合(Hybrid Fusion)整合这些孤立的真实信号,成功挽救了快速AI团队(N=8时提升+7.6%),并显著加速了较小规模慢速AI团队的恢复(N=4时提升+6.9%)。这些发现证明,cBCI协同效应高度依赖于信任的时间动态(temporal dynamics of trust),为设计动态门控的人机系统(Human-AI systems)提供了关键框架。
Abstract
The speed and accuracy of an artificial teammate fundamentally alter the failure states of Human-AI integration. While high-speed AI interventions risk inducing reflexive blind compliance, delayed interventions can induce ambiguous cognitive conflict. This study investigates how the fundamental characteristics of an in-task AI assistant, Fast/Less-Accurate (FLA-AI) versus Slow/Accurate (SA-AI) impact the synergy of Collaborative Brain-Computer Interface (cBCI) teams in a Virtual Reality drone task. Seventeen operators completed continuous search tasks under high cognitive workload while their spatial covariance was mapped using a 2D Adaptive Riemannian Oracle. The results mathematically demonstrate that AI timing dictates the mechanism of team failure. Fast AI induced instant, blind compliance; human accuracy under deception collapsed to 50.2%, and pure behavioural teams (N=8) failed to scale beyond 74.1%. In contrast, Slow AI induced delayed cognitive conflict; humans hesitated (61.1% accuracy), but N=8 behavioural teams eventually recovered to 100.0%. Crucially, the Riemannian Oracle mathematically adapted to these states: it heavily restricted temporal windows (< 0.8s) to intercept fast reflexive compliance, while widening windows (> 1.2s) to capture delayed cognitive conflict. Integrating these isolated veridical signals via Hybrid Fusion successfully rescued the Fast AI team (+7.6% at N=8) and significantly accelerated the recovery of smaller Slow AI teams (+6.9% at N=4). These findings prove that cBCI synergy is heavily contingent on the temporal dynamics of trust, providing a critical framework for designing dynamically gated Human-AI systems.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究的是Human-AI团队中信任的时间依赖性,涉及AI助手的速度与准确性对协作脑机接口(cBCI)性能的影响。所有给定关键词(Unify Models, World Models, MLLM, CV, MultiModal, model-based RL, OPD, RL, GRPO)均属于多模态大模型、世界模型、强化学习等AI/ML领域,而论文内容完全不涉及这些概念,属于认知科学与人机交互范畴,因此每个关键词的相关度均为0分。
关键词
Human-AI teams, trust timing dependencies, cBCI, neuro-decoupling, speed-accuracy tradeoff, collaborative brain-computer interface, virtual reality
摘要翻译
准确预测晶体性质对于加速材料发现至关重要,但常受限于稀缺的标注数据和昂贵的理论计算。为缓解这一问题,我们提出UNATE(无监督原子嵌入,Unsupervised Atomic Embedding)框架,该框架利用从无标注晶体结构中提取的结构信息。UNATE将无监督去噪自编码器(denoising autoencoder)与自监督对比学习(self-supervised contrastive learning)相结合,以学习鲁棒的原子表示,并将其作为下游性质预测的输入特征。实验结果表明,用UNATE预训练的节点嵌入替换原始原子序数,在全数据基线上实现了2.7%的提升。值得注意的是,在标注数据有限的场景下,这种优势更为显著:当仅使用25%的标注数据时,性能提升可达10%。
Abstract
Accurately predicting crystal properties is critical for accelerating materials discovery, but it is often limited by scarce labeled data and costly theoretical calculations. To alleviate this, we propose UNATE (Unsupervised Atomic Embedding), a framework that leverages structural information extracted from unlabeled crystal structures. UNATE integrates an unsupervised denoising autoencoder with self-supervised contrastive learning to learn robust atomic representations, which are then used as input features for downstream property prediction. Experimental results show that replacing raw atomic numbers with UNATE-pretrained node embeddings yields a 2.7\% improvement over the full-data baseline. Notably, the benefits become more pronounced in scenarios with limited labeled data, reaching improvements of up to 10\% when only 25\% of the labeled data is used.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究的是晶体结构属性预测,使用无监督原子嵌入(UNATE),涉及自编码器和对比学习。所有给定关键词(Unify Models, World Models, MLLM, CV, MultiModal, model-based RL, OPD, RL, GRPO)均与多模态大模型、世界模型、强化学习或计算机视觉相关,而论文主题属于材料科学和结构化学,与这些关键词完全无关,因此每个关键词评分均为0分。
关键词
crystal structures, property prediction, unsupervised atomic embedding, denoising autoencoder, contrastive learning, node embeddings, materials discovery
摘要翻译
我们研究了$k$折交叉验证作为风险估计量的均方误差,特别关注其精度如何依赖于折数$k$。尽管交叉验证被广泛使用,但关于如何选择$k$的原则性指导仍基本缺失,这主要是由于各折误差估计之间复杂的依赖关系。为了获得清晰且可解释的结果,我们聚焦于二分类中的多数算法(majority algorithm),这是一种最小但非平凡的经验风险最小化过程。我们对其交叉验证行为进行了细粒度分析,表明即使是这一简单算法也会展现出微妙而精细的现象,而现有理论对此给出的界是宽松甚至空洞的。基于这一分析,我们引入了交叉验证风险估计的极小极大框架,并证明:当折数随样本量$n$增长时,没有任何经验风险最小化算法能够达到$O(1/n)$的极小化均方误差;相反,一个阶为$Ω(\sqrt{k}/n)$的下界是不可避免的。我们的结果揭示了交叉验证作为一种数据复用策略的根本局限性,澄清了先前理论工作中的空白与不准确之处,并将多数算法定位为一个自然的基准,任何对交叉验证的紧致分析都应能够解释该算法。
Abstract
We study the mean-squared error of $k$-fold cross-validation as a risk estimator, with particular emphasis on how its accuracy depends on the number of folds $k$. Despite the widespread use of cross-validation, principled guidance for choosing $k$ is largely absent, mainly due to the complex dependence between fold-wise error estimates. To obtain sharp and interpretable results, we focus on the majority algorithm in binary classification, a minimal yet nontrivial empirical risk minimization procedure. We provide a fine-grained analysis of its cross-validation behavior, showing that even this simple algorithm exhibits subtle and delicate phenomena for which existing theory provides loose and even vacuous bounds. Leveraging this analysis, we introduce a minimax framework for cross-validation risk estimation and prove that no empirical risk minimization algorithm can achieve an $O(1/n)$ minimax mean-squared error when the number of folds grows with the number of samples $n$; instead, a lower bound of order $Ω(\sqrt{k}/n)$ is unavoidable. Our results reveal fundamental limitations of cross-validation as a data-reuse strategy, clarify gaps and inaccuracies in prior theoretical work, and position the majority algorithm as a natural benchmark that any tight analysis of cross-validation should be able to explain.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究的是k折交叉验证在二分类问题中的均方误差,聚焦于交叉验证的统计性质和理论极限,完全不涉及多模态大模型、世界模型、强化学习、表征学习、统一模型等关键词。所有给定关键词(Unify Models, World Models, MLLM, CV, MultiModal, model-based RL, OPD, RL, GRPO)均与论文内容无关,因此每个关键词评分均为0分。
关键词
k-fold cross-validation, majority algorithm, binary classification, minimax framework, mean-squared error, empirical risk minimization, data-reuse
摘要翻译
我们提出了一种分支签名核求解器,用于由可能粗糙的激励信号的“单一观测轨迹”驱动的线性和非线性常微分方程——这种设定在地震工程、金融、生物学和结构健康监测中自然出现,其中激励仅被观测一次,且求解器必须尊重底层物理定律,而无需依赖系综实现。两个要素是新的。首先,一种“计数采样”构造将单一观测转化为一个由 \(N+1\) 条嵌套训练路径组成的层次族,分支签名核(branched signature kernel)可在其上求值;这使得原本为多实现回归问题设计的签名核机制能够作用于单轨迹观测。其次,一个核配置框架将待定解置于解的最高阶导数上(低阶导数通过对核积分恢复),或置于解本身(在对常微分方程进行 \(m\) 次积分之后)。我们证明了分支签名核的通用逼近定理,利用 Hairer–Kelly 同态通过时间扩展路径的几何签名来表达分支签名求值。离线求解器被扩展为一种流式测试/训练/重训练协议,在线性情形下具有闭式在线更新,在非线性情形下具有标量牛顿步。在六个基准(埃尔森特罗地震位移、索洛资本存量模型、分数布朗运动驱动的二阶常微分方程、受迫杜芬振荡器、变系数路径依赖的阿里亚斯强度退化振荡器,以及含噪声的仓本相位振荡器系统)上的数值实验表明,分支签名核求解器在所有场景下均能提供准确、稳定的预测。
Abstract
We develop a branched signature kernel solver for linear and nonlinear ordinary differential equations driven by a \emph{single observed trajectory} of a possibly rough forcing signal -- a setting that arises naturally in earthquake engineering, finance, biology, and structural health monitoring, where the forcing is observed exactly once and the solver must respect the underlying physical law without recourse to an ensemble of realizations. Two ingredients are new. First, a \emph{count-sampling} construction turns the single observation into a hierarchical family of $N+1$ nested training paths on which the branched signature kernel can be evaluated; this allows the signature kernel machinery, originally designed for multi-realization regression problems, to operate on a single-trajectory observation. Second, a kernel-collocation framework places the ansatz either on the highest-order derivative of the solution (with lower derivatives recovered by integrating the kernel) or on the solution itself (after $m$-fold integration of the ODE). We prove a universal approximation theorem for the branched signature kernel, leveraging the Hairer--Kelly morphism to express branched signature evaluations through geometric signatures of time-extended paths. The offline solver is extended to a streaming Test/Train/Retrain protocol with closed-form online updates in the linear case and scalar Newton steps in the nonlinear case. Numerical experiments on six benchmarks (El-Centro earthquake displacement, the Solow capital-stock model, an fBM-driven second-order ODE, a forced Duffing oscillator, a path-dependent Arias-intensity-degraded oscillator with variable coefficients, and a noisy Kuramoto phase-oscillator system) show that the branched signature-kernel solver delivers accurate, stable predictions across all regimes.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 该论文研究的是基于分支签名核的常微分方程求解器,用于处理由单个粗糙轨迹驱动的ODE,应用领域包括地震工程、金融、生物学等。论文内容完全属于数值分析和微分方程求解范畴,与给定的所有关键词(Unify Models、World Models、MLLM、CV、MultiModal、model-based RL、OPD、RL、GRPO)均无任何关联。这些关键词涉及多模态大模型、世界模型、强化学习、计算机视觉等人工智能领域,而论文不涉及任何相关概念、方法或应用。因此,所有关键词的相关度评分均为0。
关键词
Branched Signature Kernel, ODE Solver, Rough Signals, Single-Trajectory, Kernel Collocation, Universal Approximation, Streaming Test/Train/Retrain
摘要翻译
成员推断攻击(Membership Inference Attacks, MIAs)是通过从数据中习得的模型或统计量,对训练数据中敏感信息泄露进行经验性评估的常用方法。MIA脆弱性通常通过一个二分类器的假阳性率(False Positive Rate, FPR)和真阳性率(True Positive Rate, TPR)来评估,该分类器试图预测特定样本是否存在于训练数据中。然而,为了可靠地估计TPR(尤其是在低FPR值条件下),需要大量观测数据,这在MIA场景中意味着需要众多目标模型,从而导致巨大的计算成本。为避免过高的计算需求,MIA得分通常会在多个个体和多个目标模型上取平均值。我们揭示了这种高效MIA评估流程中的两个关键缺陷。首先,我们证明,基于跨多个个体拼接的MIA得分来评估TPR(该方法常用于研究极低FPR区间下的脆弱性)在各样本的FPR上并未经过校准,这使得其作为差分隐私审计工具时不可靠。为解决此问题,我们提出了一种后处理方法,以有效校准不同样本间的FPR。其次,我们识别出Carlini等人(2022)提出的常用高效似然比攻击(Likelihood-Ratio Attack, LiRA)实现中存在有限总体偏差,这导致各样本脆弱性评估出现正向偏差。
Abstract
Membership inference attacks (MIAs) are popular methods for empirically assessing the leakage of sensitive information in the training data through models or statistics learned from the data. The MIA vulnerability is often evaluated through false positive rate (FPR) and true positive rate (TPR) of a binary classifier that tries to predict whether a particular sample was in the training data. However, in order to reliably estimate the TPR especially for low FPR values, a lot of observations are needed, which in case of MIA translates to many target models, leading to large computational cost. To avoid excessive compute requirements, the MIA scores are often averaged over multiple individuals and multiple targeted models. We demonstrate two key weaknesses in this efficient MIA evaluation pipeline. First, we show that evaluating the TPR based on MIA scores concatenated across multiple individuals, commonly used to study vulnerabilities in the very low FPR regime, is not calibrated across the per-sample FPRs. This makes it unreliable as a tool for auditing differential privacy. To solve this, we propose a post-processing method to effectively calibrate the FPR across different samples. Second, we identify a finite population bias in the commonly used efficient likelihood-ratio attack (LiRA) implementation proposed by Carlini et al. 2022, leading to a positive bias in the per-sample vulnerability.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 该论文研究的是成员推断攻击(MIA)的可靠性评估,重点关注差分隐私审计中的假阳性率校准和有限总体偏差问题。论文内容完全属于隐私安全与机器学习审计领域,与给定的所有关键词(统一模型、世界模型、多模态大模型、计算机视觉、多模态、基于模型的强化学习、OPD、强化学习、GRPO)均无任何关联。因此,每个关键词的相关度评分均为0分。
关键词
Membership Inference Attack, Differential Privacy, False Positive Rate Calibration, Likelihood-Ratio Attack, Finite Population Bias, Privacy Auditing, Machine Learning Security
摘要翻译
消息传递神经网络(MPNNs)是学习图结构域表示的一种强大框架。然而,MPNNs中的权重仅作用于特征,限制了其捕捉结构模式的能力。我们提出了一种新颖的结构感知权重共享原则,该原则明确纳入了图结构固有的信息。权重直接由用户选择的图不变量(即节点置换下保持不变的函数)进行索引,从而能够在结构等价的子图之间实现系统性重用。我们介绍了ShareGNNs,该模型在一个简单的编码器-解码器架构中实例化了这一原则,从而产生了一种具有可学习邻接性和类似Transformer连接性的MPNN。我们证明,其表达能力至少与所选不变量的判别能力相当,从而提供了对模型复杂性的显式控制。在合成数据和真实世界数据上的实验,以及子图计数任务,均表明其相较于标准MPNNs具有一致的改进,超越1-WL测试的竞争性表达能力,以及可扩展至大规模数据集的能力。
Abstract
Message-passing neural networks (MPNNs) are a powerful framework for learning representations of graph-structured domains. However, weights in MPNNs act on features only, limiting their ability to capture structural patterns. We introduce a novel structure-aware weight sharing principle that explicitly incorporates information inherent to the graph structure. Weights are indexed directly by user-chosen graph invariants, i.e., functions preserved under node permutations, enabling systematic reuse across structurally equivalent subgraphs. We present ShareGNNs, which instantiate this principle within a simple encoder-decoder architecture, resulting in an MPNN with learnable adjacency and transformer-like connectivity. We show that their expressivity is at least as strong as the discriminative power of the chosen invariants, providing explicit control over the model complexity. Experiments on synthetic and real-world data, as well as subgraph counting tasks, demonstrate consistent improvements over standard MPNNs, competitive expressivity beyond the 1-WL test, and scalability to large datasets.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文主题为图神经网络中的基于图不变量的权重共享方法,属于图机器学习领域。所有给定关键词(统一模型、世界模型、多模态大模型、计算机视觉、多模态、基于模型的强化学习、OPD、强化学习、GRPO)均与论文内容完全无关。论文未涉及多模态、强化学习、世界模型或任何相关概念。因此每个关键词的相关度均为0分。
关键词
Graph Neural Networks, Message Passing, Weight Sharing, Graph Invariants, Structure-Aware, ShareGNNs, Subgraph Counting
摘要翻译
现实物理系统的特征在于跨多个长度和时间尺度的涌现相互作用,这对预测性机器学习(ML)模型构成了重大挑战。大多数科学ML模型仅关注狭窄范围内的相互作用。尽管机器学习力场(MLFF)能够提供接近量子级的精度,但普遍使用的消息传递层却忽略了长程多体效应。在此,我们引入多尺度结构集成(MuSE),这是一种分层模型,通过软粗粒化池化(Soft Coarse-Graining Pooling)从原子到粗粒节点的平滑分数分配构建粗粒表示,使MLFF模块能够在多个尺度上运行。MuSE具有架构无关性,并与SO3krates、MACE和PaiNN等MLFF模型耦合,适用于分子和材料。我们通过基于Hessian矩阵的基准测试、生物分子的折叠轨迹以及分子-石墨烯纳米结构中的能量分布展示了MuSE的强大能力——与近期其他长程ML模型不同,MuSE能够在相关尺度上准确捕捉量子力学相互作用。
Abstract
Realistic physical systems are characterised by emergent interactions across multiple length and time scales, posing a significant challenge for predictive machine learning (ML) models. Most scientific ML models focus on a narrow range of interactions. While machine learning force fields (MLFFs) offer near-quantum accuracy, the ubiquitous message-passing layers miss long-range many-body effects. Here we introduce the Multiscale Structural Ensemble (MuSE), a hierarchical model that uses Soft Coarse-Graining Pooling to construct coarse representations from smooth fractional assignments of atoms to coarse nodes, enabling MLFF modules to operate across multiple scales. MuSE is architecture-agnostic and coupled with SO3krates, MACE, and PaiNN MLFFs for both molecules and materials. We demonstrate the power of MuSE through Hessian-based benchmarks, folding trajectories for biomolecules, and energy profiles in molecule-graphene nanostructures, where MuSE accurately captures quantum-mechanical interactions at relevant scales -- unlike other recent long-range ML models.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究的是多尺度物理系统的机器学习建模,提出MuSE层次化模型用于分子和材料力场,与给定的关键词(统一模型、世界模型、多模态大语言模型、计算机视觉、多模态、基于模型的强化学习、OPD、强化学习、GRPO)完全无关。所有关键词相关度均为0分。
关键词
Machine Learning, Multiscale Interactions, Soft Coarse-Graining Pooling, Multiscale Structural Ensemble (MuSE), Machine Learning Force Fields (MLFFs), molecules, materials, long-range interactions
摘要翻译
在当代大型语言模型(LLMs)中,swish门控线性单元(SwiGLU)激活函数被广泛用于调节信息流并引入非线性。对于较大的正输入,SwiGLU近似于二次函数$x^2$,提供了强非线性和表达能力。然而,这一特性也导致了数值不稳定性,随着输入或模型规模的增大,尤其是在低精度LLM训练中。主要原因在于其近似二次放大效应,这会扩大输出范围并加剧异常值。为解决此问题,我们提出了一种适用于大规模LLM预训练的稳定激活函数——幂线性单元(PowLU)。具体而言,PowLU采用有理幂函数实现自适应非线性,从而提升表示能力,并在尖峰区域实现稳定训练。此外,我们为PowLU的若干关键特性提供了理论证明。缩放定律实验证实了其性能在不同模型规模下的一致性,而基于Ling架构(总参数量7.9B和124B)的进一步实验结果表明,在LLM的大规模训练中,PowLU相较于SwiGLU和SwiGLU-Clip取得了具有竞争力的结果。同时,实验结果还表明,PowLU有效提升了LLM大规模训练的可扩展性。
Abstract
In contemporary large language models (LLMs), the swish-gated linear unit (SwiGLU) activation function is widely adopted to regulate the information flow and introduce non-linearity. For large positive inputs, SwiGLU approximates the quadratic function $x^2$, providing strong nonlinearity and expressive capacity. However, this property also causes numerical instability as the input or model scale increases, particularly in low-precision LLM training. The main reason is its approximate quadratic amplification, which enlarges the output range and exacerbates outliers. To address this issue, we propose a stable activation function, Power Linear Unit (PowLU), for large-scale LLM pre-training. Specifically, PowLU employs a rational power function to achieve adaptive nonlinearity, thereby improving representation ability and enabling stable training in spike regions. Moreover, we provide theoretical justification for several key properties of PowLU. Scaling law experiments confirm that the performance is consistent across model sizes, and further experimental results with the Ling architecture (7.9B and 124B total parameters) demonstrate that PowLU achieves competitive results against SwiGLU and SwiGLU-Clip in large-scale training of LLMs. In addition, the experimental results also show that PowLU effectively improves the scalability of the large-scale training of LLMs.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究的是大语言模型(LLM)预训练中的激活函数PowLU,旨在解决SwiGLU在低精度训练中的数值不稳定问题。内容完全聚焦于激活函数设计与稳定性分析,与给定的关键词(Unify Models、World Models、MLLM、CV、MultiModal、model-based RL、OPD、RL、GRPO)均无任何关联。这些关键词涉及多模态统一模型、世界模型、计算机视觉、强化学习等方向,而论文不涉及多模态、视觉、强化学习或任何相关概念。因此所有关键词的相关度均为0分。
关键词
PowLU, activation function, large language models, pre-training, SwiGLU, numerical stability, scaling law
摘要翻译
用于节点分类的图神经网络通常通过梯度下降训练数百或数千个epoch。近期研究表明,经过适当调优后,经典的GCN/SAGE/GAT架构能够在许多节点分类基准上媲美图Transformer。我们提出一个互补性问题:通过确定性闭式求解器(closed-form solver)能恢复多少性能,并且这能提供何种保证?我们引入一个由调整后的同质性(adjusted homophily)选择的闭式路由框架(routed closed-form framework)。对于同配图(assortative graphs),我们采用SGC风格的传播后接岭回归(Ridge regression);对于异配图(heterophilous graphs),我们提出LCF-Net——一种逐层闭式图特征精炼网络(layer-wise closed-form graph feature-refinement network),其每层的岭回归求解由高斯核-岭回归头(Gaussian kernel-Ridge head)封顶。在包括ogbn-arxiv和ogbn-proteins在内的14个基准上,我们的闭式预测器在9个已测数据集中有9个匹配或击败了最佳普通2层GCN/SAGE/GAT,在12个小基准中有9个与调优后的深度方法在一个标准差内持平,并在两个大图上超越了OGB排行榜上的普通GCN。剩余的异配图差距与从普通2层到深度SAGE的性能提升紧密相关,表明残余差异主要源于架构。由于我们的预测器是确定性线性系统的显式解,修改后的图输入可被重新求解以获得与重新训练等效的参数。我们形式化了针对标签、特征、边、节点和子图修改的精确图对象遗忘(exact graph-object unlearning),证明了岭回归组件的K跳局部性(K-hop locality),并在109种配置上验证了精确性。在ogbn-arxiv上,局部化更新相比完全重新求解加速了21–45倍,相比梯度重新训练加速了约10⁶倍。结构反演实验(structural-inversion experiments)进一步量化了精确重新训练的隐私下限(privacy floor)以及近似图遗忘方法(approximate graph-unlearning methods)的额外泄露。
Abstract
Graph neural networks for node classification are typically trained by gradient descent over hundreds or thousands of epochs. Recent work has shown that, when properly tuned, classic GCN/SAGE/GAT architectures can match graph transformers on many node-classification benchmarks. We ask a complementary question: how much of this performance can be recovered by deterministic closed-form solvers, and what guarantees does this enable? We introduce a routed closed-form framework selected by adjusted homophily. For assortative graphs, we use SGC-style propagation followed by Ridge regression; for heterophilous graphs, we introduce LCF-Net, a layer-wise closed-form graph feature-refinement network whose per-layer Ridge solves are capped by a Gaussian kernel-Ridge head. Across 14 benchmarks, including ogbn-arxiv and ogbn-proteins, our closed-form predictors match or beat the best vanilla 2-layer GCN/SAGE/GAT on 9 of 9 measured datasets, tie tuned deep recipes within one standard deviation on 9 of 12 small benchmarks, and exceed the OGB-leaderboard plain GCN on both large graphs. The remaining heterophilous gap closely tracks the gain from vanilla 2-layer to deep SAGE, suggesting that the residual difference is primarily architectural. Because our predictors are explicit solutions of deterministic linear systems, modified graph inputs can be re-solved to obtain retrain-equivalent parameters. We formalize exact graph-object unlearning for label, feature, edge, node, and subgraph modifications, prove K-hop locality for Ridge components, and verify exactness across 109 configurations. On ogbn-arxiv, localized updates give $21$--$45\times$ speedups over full re-solving and roughly $10^{6}\times$ speedups over gradient retraining. Structural-inversion experiments further quantify the privacy floor of exact retraining and the additional leakage of approximate graph-unlearning methods.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文主题为图神经网络节点分类的闭式解法与精确图遗忘,与给定的所有关键词(统一模型、世界模型、多模态大模型、计算机视觉、多模态、基于模型的强化学习、OPD、强化学习、GRPO)均无任何关联。论文未涉及多模态、强化学习、世界模型或统一模型等概念,因此所有关键词相关度均为0分。
关键词
closed-form node classification, graph unlearning, ridge regression, GCN, heterophilous graphs, exact unlearning, SGC, LCF-Net
摘要翻译
本文提出了StrTransformer,一种面向源的结构化Transformer框架,用于盲源恢复和分支级潜在建模。StrTransformer不使用编码器来推断潜在变量,而是直接优化潜在源矩阵,同时结合观测空间混合器和面向源的结构化Transformer分支。混合器强制执行重建一致性,而每个Transformer分支对一条潜在源轨迹施加可微的结构约束。具体而言,每个源被转换为多尺度补丁令牌,随机掩码,由具有局部偏置的Transformer处理,并通过掩码补丁重建能量进行评估。该能量作为隐式的面向源的结构先验。为了鼓励不同的潜在分支专门处理不同的时间模式,StrTransformer进一步引入了一个有序多尺度控制器,学习分支特定的补丁尺度权重、有序尺度中心和局部注意力斜率。最终的目标函数结合了观测重建、面向源的结构正则化以及用于分离和尺度专门化的模块化辅助惩罚。我们分析了目标函数的解耦与耦合结构、正则化精确重建纤维以及由有序分支描述符引起的置换对称性降低。一个受控案例研究表明,学习到的分支收敛到不同的时间尺度结构,并在事后评估中恢复出与源对齐的潜在轨迹。
Abstract
This paper proposes StrTransformer, a source-wise structured Transformer framework for blind source recovery and branch-wise latent modeling. Instead of using an encoder to infer latent variables, StrTransformer directly optimizes the latent source matrix together with an observation-space mixer and source-wise structural Transformer branches. The mixer enforces reconstruction consistency, while each Transformer branch imposes a differentiable structural constraint on one latent source trajectory. Specifically, each source is converted into multi-scale patch tokens, randomly masked, processed by a locality-biased Transformer, and evaluated through a masked patch reconstruction energy. This energy acts as an implicit source-wise structural prior. To encourage different latent branches to specialize into different temporal regimes, StrTransformer further introduces an ordered multi-scale controller that learns branch-specific patch-scale weights, ordered scale centers, and locality attention slopes. The resulting objective combines observation reconstruction, source-wise structural regularization, and modular auxiliary penalties for separation and scale specialization. We analyze the decoupling and coupling structure of the objective, the regularized exact-reconstruction fiber, and the reduction of permutation symmetry induced by ordered branch descriptors. A controlled case study shows that the learned branches converge to distinct temporal-scale structures and recover source-aligned latent trajectories under post-hoc evaluation.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文标题和摘要完全聚焦于盲源恢复(Blind Source Recovery)和源结构Transformer,属于信号处理和无监督学习领域。所有给定的关键词(统一模型、世界模型、多模态大语言模型、计算机视觉、多模态、基于模型的强化学习、OPD、强化学习、GRPO)均未在论文中提及或隐含,与论文核心内容无关。因此每个关键词的相关度评分为0。
关键词
StrTransformer, Blind Source Recovery, Source-Wise Structured Transformer, Unsupervised Learning, Latent Source Matrix, Masked Patch Reconstruction, Ordered Multi-Scale Controller
摘要翻译
现代AI基准测试的复杂度已超越传统验证方法的处理能力。由领域专家编写的任务常包含隐含假设、不完整的环境规范以及脆弱的评估逻辑,人工标注无法可靠地捕捉这些问题。我们提出自动基准审计(Auto Benchmark Audit, ABA),这是一个智能体框架,能够系统性地审计单个基准测试任务,发现诸如隐藏环境依赖、规范缺失及评分逻辑受限等问题。我们在前沿大语言模型基准测试及往届NeurIPS论文中选取了涵盖九个领域的168个基准任务,对ABA进行了验证。在该语料库中,ABA识别出超过25.7%的评估任务存在关键问题,包括模糊的任务设计、执行环境冲突及错误的标准答案。这些自动化审计的准确性已通过专家评审及上游拉取请求(PR)等独立第三方报告验证。关键的是,我们证明存在问题的任务会严重扭曲对智能体及大语言模型的能力评估:过滤掉这些有问题的任务后,模型排名发生变化,且在SWE-bench Verified和Terminal-Bench 2上的平均性能分别提升了9.9%和9.6%。我们开源了该智能体工具及所有任务标注,以支持未来前沿基准测试的发展。
Abstract
Modern AI benchmarks operate at a complexity that outpaces traditional verification methods. Tasks authored by domain experts often contain implicit assumptions, incomplete environment specifications, and brittle evaluation logic that human annotation cannot reliably catch. We introduce Auto Benchmark Audit (ABA), an agentic framework that systematically audits individual benchmark tasks, uncovering issues such as hidden environment dependencies, specification gaps, and limited grading logic. We run ABA on a collection of frontier LLM benchmarks and previous NeurIPS publications, totaling 168 benchmarks across nine domains. Across this corpus, ABA identifies critical issues including ambiguous task design, execution environment conflicts, and incorrect ground truths in over 25.7% of the evaluated tasks. The precision of these automated audits is validated by expert review and independent third-party reports such as upstream PRs. Crucially, we demonstrate that these problematic tasks severely distorts capability assessments for agents and LLMs: filtering out these tasks with issues shifts model rankings and increases average performance on SWE-bench Verified and Terminal-Bench 2 by 9.9% and 9.6%, respectively. We release the agentic tool and all task annotations to support the future development of frontier benchmarks.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文主题为自动化基准测试审计(Auto Benchmark Audit),专注于检测AI代理和LLM基准测试中的缺陷(如隐藏环境依赖、规范缺口、评分逻辑错误),完全不涉及多模态大模型(MLLM)、统一模型(Unify Models)、世界模型(World Models)、计算机视觉(CV)、多模态(MultiModal)、基于模型的强化学习(model-based RL)、OPD、强化学习(RL)或GRPO。所有关键词均与论文核心内容无关,因此评分均为0。
关键词
Automated Benchmark Audit, AI Agents, Large Language Models, Benchmark Verification, Task Defects, Model Ranking
摘要翻译
从文本中标注说话人属性本身具有歧义性,尤其是在多语言环境中,人口统计学和社会线索隐含且因文化而异。我们提出了一种人-大语言模型(LLM)协作重新标注框架,用于在实际资源约束下稳定多语言说话人属性标签。从含噪语料库出发,我们通过与专家迭代交互,利用LLM揭示反复出现的标注理由,并应用以分歧为中心的采样进行针对性重新标注。利用该框架,我们构建了WhoSaidIt数据集,涵盖九个说话人属性标签。我们量化了原始标注与修订标注之间的差异,对近期LLM进行了基准测试,并分析了显式理由对模型行为的影响。我们的结果揭示了标注决策中显著的跨语言差异,并展示了LLM在说话人属性分类中的优势与局限性。
Abstract
Annotating speaker attributes from text is inherently ambiguous, particularly in multilingual settings where demographic and social cues are implicit and culturally variable. We propose a human-large language model (LLM) collaborative re-annotation framework for stabilizing multilingual speaker-attribute labels under practical resource constraints. Starting from a noisy corpus, we use LLMs to surface recurring annotation rationales through iterative interaction with experts, and apply disagreement-focused sampling for targeted re-annotation. Using this framework, we construct WhoSaidIt, a multilingual dataset covering nine speaker-attribute labels. We quantify divergence between original and revised annotations, benchmark recent LLMs, and analyze the effect of explicit rationales on model behavior. Our results reveal substantial cross-lingual differences in annotation decisions and demonstrate both the strengths and limitations of LLMs in speaker-attribute classification.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究的是基于文本的多语言说话人属性分类的人-LLM协作标注框架,核心是文本分类、多语言标注和人机协作,与给定的关键词(统一模型、世界模型、多模态大模型、计算机视觉、多模态、基于模型的强化学习、OPD、强化学习、GRPO)均无直接关联。所有关键词相关度均为0分。
关键词
Human-LLM collaboration, annotation, multilingual, speaker-attribute classification, dataset, rationales, cross-lingual differences
摘要翻译
从自发性语音中检测痴呆症为认知筛查提供了一种可扩展的方法,然而自然语言处理系统仍以英语为中心。这一局限在菲律宾尤为突出,该国普遍存在菲律宾语-英语语码转换现象,且尚无研究涉及基于自然语言处理的痴呆症检测。我们首次提出了基于Transformer架构的菲律宾语痴呆症检测的系统性评估,并首次在临床自然语言处理场景中评估了NeoBERT。为分离语言与领域效应,我们构建了一个包含4000份DementiaBank衍生转录文本的平行双语数据集,其中菲律宾语翻译由人工完成,以保留认知衰退的话语层面标记。我们评估了五个模型系列——TF-IDF + LogReg、BERT、NeoBERT、XLM-R和RoBERTa-Tagalog——分别在单语、零样本跨语言及双语微调设置下进行。研究发现,领域内性能无法跨语言迁移,英语训练的BERT在菲律宾语上的宏F1值降至0.455,且仅靠架构现代化并不能提升鲁棒性。然而,双语微调消除了所有Transformer模型的跨语言性能下降,宏F1值收敛至0.969-0.973。这些结果表明,多语言临床自然语言处理的性能主要取决于训练过程中的语言覆盖范围,而非模型规模或架构。
Abstract
Dementia detection from spontaneous speech offers a scalable approach to cognitive screening, yet NLP systems remain predominantly English-centric. This limitation is especially acute in the Philippines, where Filipino-English code-switching is pervasive and no prior work has addressed NLP-based dementia detection. We present the first systematic evaluation of transformer-based dementia detection in Filipino speech and the first assessment of NeoBERT in a clinical NLP setting. To separate language from domain effects, we construct a parallel bilingual dataset of 4,000 DementiaBank-derived transcripts, with Filipino translations produced manually to preserve discourse-level markers of cognitive decline. We evaluate five model families, TF-IDF + LogReg, BERT, NeoBERT, XLM-R, and RoBERTa-Tagalog, under monolingual, zero-shot cross-lingual, and bilingual fine-tuning settings. We find that in-domain performance does not transfer across languages, with English-trained BERT dropping to Macro-F1 = 0.455 on Filipino, and that architectural modernization alone does not improve robustness. Bilingual fine-tuning, however, eliminates cross-lingual degradation across all transformer models, converging to Macro-F1 = 0.969-0.973. These results suggest that multilingual clinical NLP performance is driven primarily by linguistic coverage during training rather than model scale or architecture.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究的是基于Transformer的痴呆症检测,聚焦于菲律宾语和英语的低资源对话语音,属于NLP和临床语言分析领域。所有给定的评分关键词(Unify Models, World Models, MLLM, CV, MultiModal, model-based RL, OPD, RL, GRPO)均与论文内容完全无关:论文未涉及多模态、计算机视觉、强化学习、世界模型或统一模型等概念,也未提及OPD或GRPO。因此每个关键词的相关度均为0分。
关键词
Dementia detection, NeoBERT, Low-resource speech, Filipino-English code-switching, Bilingual fine-tuning, Transformer models, DementiaBank
摘要翻译
我们记录了由来自七个架构家族的十个大语言模型驱动的思维链与ReAct智能体中的一个经验现象:意义性扰动(如同义替换、释义)相较于同等严重程度的呈现性扰动(如格式调整、重排序)更频繁地改变最终答案。在涵盖GSM8K、MATH和HotpotQA的68个实验单元中(包含1,530个原始样本及约11,150个变体),经过严重性匹配后,不一致性差距平均为+19.69个百分点(配对t检验:t=9.58,p<0.0001),其中64/68个实验单元呈现正向差距。该差距在四种严重性代理检验中依然存在,且在排除qwen模型后仍保持显著(+11.10个百分点,p<0.0001)。多项压力测试未通过诚实检验:在更严格假设下聚类自举显著性消失,可处理性对比无法复现,跨架构生成器交换破坏了各实验单元的排名,且第二位LLM评判者仅产生中等一致性(κ=0.50)。随后,我们在一个完全留出的第11个模型(qwen2.5-14B-Instruct;1,800条轨迹)上验证了主要效应,并重新检验了预先注册的能力×可处理性划分,观察到虽小但正向的留出效应(3/4实验单元正向;合并Welch t检验:t=3.81,p=9.6×10⁻⁴)。利用留出轨迹,我们探测了四个轨迹层面的机制信号。两项先前的机制主张未能复现并被明确撤回。两项新探测则支持一种“隐性发散”图景:语义扰动通常保留首个动作,但从后续步骤起引发中间推理的发散,同时伴随略微更深的轨迹。我们将此定位为一项包含留出复现的测量贡献,以及关于语义扰动如何在智能体推理中传播的部分轨迹层面解释。代码、扰动语料库、原始轨迹及分析脚本已匿名发布以供评审。
Abstract
We document an empirical phenomenon in chain-of-thought and ReAct agents driven by ten large language models from seven architecture families: meaning-bearing perturbations (e.g., paraphrase, synonym) alter final answers more often than presentation perturbations (e.g., formatting, reordering) of comparable severity. Across 68 cells spanning GSM8K, MATH, and HotpotQA (1,530 originals and $\sim$11,150 variants), the inconsistency gap averages +19.69 pp after severity matching (paired $t=9.58$, $p<0.0001$), with 64/68 cells positive. The gap survives four severity-proxy audits and remains significant when excluding qwen models (+11.10 pp, $p<0.0001$). Several stress tests fail honestly: cluster-bootstrap significance disappears under stricter assumptions, tractability contrasts do not replicate, cross-architecture generator swaps break per-cell rankings, and a second LLM judge yields only moderate agreement ($κ=0.50$). We then validate the headline effect on a fully held-out 11th model (qwen2.5-14B-Instruct; 1,800 trajectories) and re-test a pre-registered capability$\times$tractability partition, observing a small but positive held-out effect (3/4 cells positive; pooled Welch $t=3.81$, $p=9.6\times10^{-4}$). Using held-out trajectories, we probe four trace-level mechanism signals. Two prior mechanism claims fail to replicate and are explicitly retracted. Two new probes instead support a \emph{stealth-divergence} picture: semantic perturbations often preserve the first action but induce divergence in intermediate reasoning from later steps onward, accompanied by slightly deeper trajectories. We position this as a measurement contribution with held-out replication and a partial trace-level account of how semantic perturbations propagate through agent reasoning. Code, perturbation corpus, raw trajectories, and analysis scripts are released anonymously for review.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究的是LLM代理(基于文本的大语言模型)在链式思维和ReAct框架下对语义噪声与表面噪声的敏感性差异,属于自然语言处理、推理鲁棒性和测量方法学范畴。所有给定的关键词(Unify Models, World Models, MLLM, CV, MultiModal, model-based RL, OPD, RL, GRPO)均与论文内容无关:论文不涉及多模态、计算机视觉、世界模型、强化学习或任何统一模型方法,也未提及OPD或GRPO等术语。因此每个关键词的相关度均为0分。
关键词
LLM agents, chain-of-thought, ReAct, semantic noise, surface noise, inconsistency gap, held-out validation, stealth-divergence
摘要翻译
本文介绍了PolyGnosis 2.0,这是一种开创性的多智能体架构,旨在通过综合Polymarket异常信号与全球开源情报(OSINT)流(具体指全球事件、语言与语调数据库(GDELT))来提取预测性情报。我们定义并聚焦于“视角错位”(Perspective Mismatches),即Polymarket情绪与全球媒体流之间的叙事分歧,将其作为高阿尔法交易信号。超越通用的智能体优越性,我们严格量化了“驾驭工程”(Harness Engineering)技术在高噪声金融领域中的效能,包括反思循环、工具调用、分而治之分区(D&C)以及思维链(CoT)。我们基于人类专家基准的实证评估表明,虽然结构性分区对于多维对齐是强制性的,但无约束的终端反思会主动诱发逻辑漂移。此外,我们在所有智能体配置的叙事推理过程中识别出一种普遍的“共识偏差”(consensus bias),这需要确定性验证。最终,我们分离出一种帕累托最优配置,该配置在实现专业级分析精度的同时,最小化了延迟和令牌开销,为预测市场中的自主智能提供了稳健的蓝图。
Abstract
This paper introduces PolyGnosis 2.0, a pioneering multi-agent architecture designed to extract predictive intelligence by synthesizing Polymarket anomaly signals with global Open Source Intelligence (OSINT) streams, specifically Global Database of Events, Language, and Tone (GDELT). We define and target "Perspective Mismatches", the narrative divergence between Polymarket sentiment and global media flows, as high-alpha trading signals. Moving beyond generic agentic superiority, we rigorously quantify the efficacy of "Harness Engineering" techniques, including reflection loops, tool-calling, divide-and-conquer partitioning (D&C), and chain-of-thought (CoT), within high-noise financial domains. Our empirical evaluation against human-expert benchmarks reveals that while structural partitioning is mandatory for multi-dimensional alignment, unconstrained terminal reflection actively induces logical drift. Furthermore, we identify a pervasive "consensus bias" across all agent configurations during narrative reasoning, necessitating deterministic validation. Ultimately, we isolate a Pareto-optimal configuration that achieves professional-grade analytical precision while minimizing latency and token overhead, providing a robust blueprint for autonomous intelligence in prediction markets.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究的是基于多智能体架构(PolyGnosis 2.0)从Polymarket异常信号和全球开源情报(GDELT)中提取预测性情报,聚焦于金融预测市场中的叙事分歧和信号交易。所有给定关键词(Unify Models, World Models, MLLM, CV, MultiModal, model-based RL, OPD, RL, GRPO)均与论文内容无关:论文未涉及多模态大模型、统一模型、世界模型、计算机视觉、多模态学习、基于模型的强化学习、OPD(可能指其他特定领域)、强化学习或GRPO(一种强化学习算法)。因此,所有关键词相关度均为0分。作者列表中未包含指定的专家。
关键词
PolyGnosis 2.0, multi-agent architecture, Polymarket, OSINT, GDELT, Perspective Mismatches, Harness Engineering, prediction markets
摘要翻译
长期记忆对于持久的LLM智能体至关重要,然而当前的主流架构将历史交互存储为无结构的平面文本。这种无约束的存储方式会导致溯源角色崩塌(provenance-role collapse),这是一种关键失效模式,智能体会因此出现源监控错误(source-monitoring errors)。为了在架构层面解决这一认知脆弱性,我们提出MemIR,一种类型化的记忆中间表示(typed Memory Intermediate Representation),它将源监控操作化为一种结构约束。MemIR将长期记忆写入基于事实的原子(grounded atoms)中,这些原子将原始证据、检索线索和承载真值的声明(truth-bearing claims)分离开来,且事实授权仅限于受支持的声明原子。随后,它通过多路径原子投影(multi-route atomic projection)和溯源范围限定利用(provenance-scoped utilization),将异构检索结果转化为以声明为中心的候选束(candidate bundles)以及用于答案生成的标准化事实接口。在LoCoMo和BEAM-100K上的实验表明,MemIR始终优于现有的记忆基线方法,尤其是在需要源追踪、时间锚定以及碎片化证据聚合的任务上。
Abstract
Long-term memory is essential for persistent LLM agents, yet prevailing architectures store historical interactions as unstructured, flat text. This unconstrained storage induces provenance-role collapse, a critical failure mode where agents suffer from source-monitoring errors. To resolve this cognitive vulnerability at the architectural level, we propose MemIR, a typed Memory Intermediate Representation that operationalizes source monitoring as a structural constraint. MemIR writes long-term memory into grounded atoms that separate raw evidence, retrieval cues, and truth-bearing claims, with factual authorization restricted to supported claim atoms. It then applies multi-route atomic projection and provenance-scoped utilization to transform heterogeneous retrieval hits into claim-centered candidate bundles and a normalized fact interface for answer generation. Experiments on LoCoMo and BEAM-100K demonstrate that MemIR consistently outperforms existing memory baselines, especially on tasks requiring source tracking, temporal grounding, and aggregation of fragmented evidence.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究长期智能体的记忆表示问题,提出MemIR以缓解来源-角色崩溃。全文聚焦于LLM智能体的记忆架构,未涉及多模态、统一模型、世界模型、强化学习或计算机视觉等关键词。所有给定关键词与论文核心内容完全无关,因此相关度均为0分。
关键词
long-term memory, provenance-role collapse, source monitoring, typed memory representation, LLM agents, MemIR, evidence aggregation
摘要翻译
赋予模型一致的多语言性能可以通过混合预训练数据或后训练方法(如特定语言模型合并)来实现。在本工作中,我们测试了合并是否可应用于单语预训练模型。我们对混合、合并和单语预训练设置的有效性进行了受控研究。我们发现,虽然单语预训练能带来强大的语言内性能,但合并单语模型的任何组合都会因干扰而导致性能崩溃。我们的分析表明,表征相似性(representational similarity)是模型合并(model merging)的前提条件。因此,我们得出结论,微调中合并的灵活性并不能简单扩展到特定语言的预训练。
Abstract
Endowing models with consistent multilingual performance can be achieved by mixing pre-training data, or post-training approaches such as language-specific model merging. In this work, we test whether merging can be applied to monolingually pre-trained models. We conduct a controlled study on the efficacy of mixed, merged, and monolingual pre-training setups. We find that while monolingual pre-training results in strong in-language performance, merging any combination of monolingual models leads to performance collapse due to interference. Our analysis suggests representational similarity is a prerequisite for model merging. We therefore conclude that the flexibility of merging in fine-tuning does not extend trivially to language-specific pre-training.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究多语言预训练中的模型合并问题,核心是自然语言处理中的多语言性,与给定的关键词(统一模型、世界模型、多模态大模型、计算机视觉、多模态、基于模型的强化学习、OPD、强化学习、GRPO)完全无关。所有关键词均涉及多模态、强化学习或世界模型,而本文专注于单语预训练模型的合并及其在多语言场景下的性能崩溃,因此每个关键词的相关度均为0分。
关键词
model merging, multilinguality, pre-training, monolingual, performance collapse, representational similarity, interference
摘要翻译
大型语言模型(Large Language Models, LLMs)重塑了用户画像技术,然而当前评估主要聚焦于静态数据快照。这种范式忽视了个性化系统的现实情况:用户生成内容(User-Generated Content, UGC)持续涌入,细粒度画像快速演变。为弥补这一差距,我们提出StreamProfileBench——一个面向细粒度流式用户画像的大规模基准。我们将流式用户画像形式化为一项连续状态维护任务,并构建了高度真实的数据集,包含来自五个不同平台7000余名真实用户的超过12万条UGC帖子。通过利用用户兴趣的时间相关性,我们进一步提出了一种新颖的、无需标注的评估框架。在14个主流LLM上的大量实验表明,连续画像更新仍是一个开放挑战。模型表现出系统性的保守偏差,过度保留过往兴趣而未能识别兴趣衰减。消融实验进一步验证了流式范式的实际效用与必要性。
Abstract
Large Language Models (LLMs) have reshaped user profiling, yet current evaluations mainly focus on static data snapshots. This paradigm overlooks the reality of personalized systems, where User-Generated Content (UGC) arrives continuously and fine-grained profile evolve rapidly. To bridge this gap, we introduce StreamProfileBench, a large-scale benchmark for fine-grained streaming user profiling. We formalize streaming user profiling as a continuous state maintenance task and curate a highly authentic dataset comprising over 120,000 UGC posts from 7,000+ real users across five diverse platforms. By leveraging the temporal correlation of user interests, we further propose a novel, annotation-free evaluation framework. Extensive experiments across 14 leading LLMs reveal that continuous profile updating remains an open challenge. Models exhibit a systemic conservative bias, over-retaining past interests while failing to recognize interest decay. Ablation experiments further validate the practical utility and necessity of the streaming paradigm.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究的是流式场景下的细粒度用户画像推断基准(StreamProfileBench),主要涉及LLM在用户画像更新中的表现,与给定的关键词(统一模型、世界模型、多模态大模型、计算机视觉、多模态、基于模型的强化学习、OPD、强化学习、GRPO)均无直接关联。摘要中未提及任何多模态、视觉、强化学习或世界模型相关内容,因此所有关键词相关度均为0。
关键词
StreamProfileBench, user profiling, streaming scenarios, LLM, benchmark, fine-grained, UGC, temporal correlation
摘要翻译
近期从专用神经机器翻译系统向通用大语言模型的转变重塑了机器翻译领域,据报告,大语言模型生成的译文比其前代系统更流畅、更不直译。我们检验了这一转变是否延伸至“去直译化假说”——翻译研究中长期存在的观点,即译文在起草和修订过程中会逐渐降低直译程度。利用WMT24++数据集,我们比较了人工翻译和译后编辑与两种神经机器翻译系统及六种大语言模型在54个语言对及三种任务(直接翻译、迭代自我修订、人工译稿的译后编辑)中的直译程度。直译程度通过基于六种启发式方法构建的经验证的合成直译指数进行测量。我们发现:(i)人工翻译的直译程度仍显著低于所有受测机器翻译系统,尽管近期的大语言模型缩小了这一差距;(ii)当提示大语言模型迭代修订自身输出时,其直译程度呈单调递减趋势,这首次证明该假说天然适用于大语言模型的生成过程;(iii)作为译后编辑者,大语言模型反转了人类译后编辑的修订触发机制,即容忍直译译稿,而将目标语言地道表达作为修订对象。
Abstract
The recent shift from dedicated NMT systems to general-purpose LLMs has reshaped machine translation, with LLMs reported to produce more fluent, less literal output than their predecessors. We test whether this shift extends to the deliteralization hypothesis, the long-standing claim from translation studies that translations become progressively less literal as they are drafted and revised. Using the WMT24++ dataset, we compare the literality of human translations and post-editions to that of two NMT systems and six LLMs across 54 language pairs and three tasks: direct translation, iterative self-revision, and post-editing of human drafts. Literality is measured via a validated Synthetic Literality Index built from six heuristics. We find that (i) human translations remain significantly less literal than those of all tested MT systems, though recent LLMs narrow the gap; (ii) when prompted to iteratively revise their own output, LLMs deliteralize monotonically, providing the first evidence that the hypothesis applies natively to LLM generation; and (iii) as post-editors, LLMs invert the revision triggers of human post-editors, tolerating literal drafts and targeting idiomatic human formulations for revision.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究的是机器翻译中的去字面化假设,比较人类翻译与NMT、LLM的字面程度。所有给定关键词(Unify Models, World Models, MLLM, CV, MultiModal, model-based RL, OPD, RL, GRPO)均与翻译研究无关,论文未涉及多模态、世界模型、强化学习或统一模型等主题,因此每个关键词相关度均为0分。作者列表中不包含指定的任何专家。
关键词
deliteralization hypothesis, machine translation, LLM, NMT, literality, post-editing, iterative self-revision, WMT24++
摘要翻译
在自然语言处理领域的自由形式法律文书评估中,专家间评分稳定性被视为一个单一的上限数值,而大语言模型(LLM)判断与该上限的一致性则被视作判断稳定性的证据。我们通过一项相同输入协议,在泰国律师资格考试中对这两个假设进行了检验:三位经律师协会培训的考官(A、B、C)以及一个由26个LLM组成的评审小组,基于相同的四项输入(试题、官方律师协会评分规则、标准答案、考生答案),对同一组15份交叉评阅的答卷进行评分。主要发现呈现非对称性。在评分规则规定了两个评分维度的15个评分单元中的10个单元上,全部29位评分者的评分高度集中:小组一致性具有普遍性。在其余5个评分单元中,评分规则并未规定如何对遗漏了关键法定引用的正确最终答案进行评分,此时人类评审小组分裂为两种合理的解读(B/C多数派采用评分规则的上限区间,得分6-8分;A少数派采用下限区间,得分1-2分)。LLM评审群体并未呈现对称性分裂:26个LLM中有22个的评分落在或接近B/C存在争议的区间,3个位于规则未明确规定的中间空白区域,仅1个(GPT-5.4 Nano)接近A的区间但未能持续落入该区间。*在我们26个LLM组成的评审小组中,没有任何一个LLM在争议评分单元上复现了少数派人类评审的解读。* B/C方向的聚类涵盖了我们所测试的所有模型规模、供应商及价格层级。一个由三个LLM锚定模型(Claude 4.6 Opus、Gemini 3.1 Pro、GPT-5.4 Pro)组成的仪器化子小组,进行了确定性探测、输入消融实验及自助法置信区间计算,在15个评分单元上,锚定小组的α值为0.77,而人类小组的α值为0.36。LLM小组的高α值反映了对多数派解读的系统性趋同,而非对两种解读的均衡复现;一个通过最大化与人类参考小组的一致性来遴选其LLM评审的基准测试,将必然在结构上继承这种非对称性。
Abstract
Free-form legal essay evaluation in NLP treats expert inter-rater stability as a single ceiling number, and treats LLM-judge agreement with that ceiling as evidence of judge stability. We test both assumptions on the Thai bar examination through an identical-inputs protocol: three Bar Council-trained examiners (A, B, C) and a 26-LLM judge panel score the same 15 cross-graded answers from the same four inputs (question, official Bar Council grading regulation, gold answer, candidate answer). The headline finding is asymmetric. On 10 of 15 cells where the rubric prescribes both axes, all 29 raters converge in a tight band: panel agreement is universal. On the remaining 5 cells where the rubric does not prescribe how to grade a correct final answer that omits a decisive statutory citation, the human panel splits between two coherent readings (B/C majority at the upper rubric band, score $6$--$8$; A minority at the lower band, score $1$--$2$). The LLM judge population does not split symmetrically: 22 of 26 LLMs score in or near B/C's contested band, 3 sit in the regulation-silent middle gap, and only 1 (GPT-5.4 Nano) approaches A's band without consistently scoring within it. \emph{Zero LLMs in our 26-judge panel reproduce the minority human reading on the contested cells.} The B/C-direction cluster spans every model size, vendor, and price tier we tested. An instrumented three-LLM anchor sub-panel (Claude 4.6 Opus, Gemini 3.1 Pro, GPT-5.4 Pro) carries determinism probes, input ablations, and bootstrap CIs, and reaches anchor panel $α= 0.77$ on the 15 cells against human-panel $α= 0.36$. The high LLM-panel $α$ reflects systematic convergence on the majority reading rather than balanced reproduction of both readings; a benchmark that selects its LLM judge by maximising agreement with a human reference panel will inherit this asymmetry by construction.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 该论文研究的是泰国律师资格考试自由回答论文的评分稳定性,比较了人类考官与LLM裁判的一致性。论文主题聚焦于法律文本评估、LLM裁判的可靠性以及评分者间信度,完全不涉及多模态、世界模型、强化学习、统一模型、计算机视觉或任何与给定关键词相关的技术领域。所有关键词(Unify Models, World Models, MLLM, CV, MultiModal, model-based RL, OPD, RL, GRPO)均与论文内容无关,因此每个关键词的评分均为0分。
关键词
LLM judges, bar examination, free-form essay evaluation, inter-rater stability, Thai legal text, human-LLM agreement, scoring rubric ambiguity
摘要翻译
临床路径以可视化流程图的形式传播,其中空间拓扑结构、箭头方向、颜色编码和字体粗细编码了关键的预检分诊逻辑,但这些逻辑对计算系统而言仍不可访问。我们提出PathWISE,一个五阶段流水线,将四个基于大语言模型(LLM)的智能体与确定性深度优先搜索审计器及Java编译器批评器相结合,将这些不可计算的人工制品转化为经过验证、可执行的HL7临床质量语言(CQL)库,并可作为FHIR CDS Hooks服务部署。专用智能体将流程图结构提取为类型化有向图,执行确定性路径枚举,对每个节点的可计算性进行结构化语义审计,生成经官方Java CQL-to-ELM编译器验证的术语约束型CQL定义,并产生覆盖100%枚举患者路径的路由逻辑。在五项英国国家医疗服务体系(NHS)癌症路径(结直肠癌、肺癌、皮肤癌、上消化道癌及乳腺癌)上的演示表明,PathWISE可审计多达183个节点(混合配置下为182个),识别出涵盖四个问题类别的544项结构化治理发现,实现100%的语法编译成功率,其中不可计算(UNCOMPUTABLE)节点接收虚假占位符以保持可编译性,同时暴露治理缺口以供临床审查,并为词典覆盖的概念生成零个幻觉术语代码。关键的是,PathWISE将非确定性LLM推理限制在知识提取环节,而确定性图论数学及标准编译器支撑每一个验证步骤。
Abstract
Clinical pathways are disseminated as visual flowcharts where spatial topology, arrow direction, colour coding, and font weight encode critical triage logic that remains inaccessible to computational systems. We present PathWISE, a five-phase pipeline combining four LLM-based agents with a deterministic depth-first search auditor and a Java compiler critic, transforming these non-computable artefacts into validated, executable HL7 Clinical Quality Language (CQL) libraries deployable as FHIR CDS Hooks services. Purpose-built agents extract flowchart structure into a typed directed graph, perform deterministic path enumeration, conduct a structured semantic audit of every node's computability, generate terminology-constrained CQL definitions verified by the official Java CQL-to-ELM compiler, and produce routing logic covering 100% of enumerated patient journeys. Demonstrated across five UK NHS cancer pathways (colorectal, lung, skin, upper GI, and breast), PathWISE audits up to 183 nodes (182 under the Hybrid configuration), identifies 544 structured governance findings across four issue categories, achieves 100% syntactic compilation success, with UNCOMPUTABLE nodes receiving false placeholders that preserve compilability while surfacing governance gaps for clinical review, and produces zero hallucinated terminology codes for dictionary-covered concepts. Critically, PathWISE confines non-deterministic LLM inference to knowledge extraction while deterministic graph mathematics and a standard compiler underpin every verification step.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文主题为临床流程图解析与可执行CQL代码生成,属于医疗信息学与知识工程领域,与给定的关键词(统一模型、世界模型、多模态大语言模型、计算机视觉、多模态、基于模型的强化学习、OPD、强化学习、GRPO)完全无关。所有关键词评分均为0分。
关键词
clinical pathways, flowchart, LLM agents, CQL, FHIR CDS Hooks, ontology learning, multi-agent system, cancer pathway triaging
摘要翻译
紧急疑似结直肠癌(CRC)转诊造成了操作瓶颈,因为半结构化临床文档通常需要人工审核和转录。最初的RAPTOR系统使用大型语言模型进行结构化提取,但依赖独立的OCR阶段,使其易受手写、布局变化以及视觉证据关联丢失的影响。我们提出RAPTOR+,一种多模态扩展,使用视觉语言模型(VLM)进行端到端转诊理解。我们在223份临床整理的CRC紧急转诊表单上评估了微调后的VLM、商业及开源零样本VLM,以及基于OCR的原始流水线。我们还引入了一种定位感知评估框架,同时衡量提取准确性和证据定位。结果显示零样本模型存在明显的定位差距。Gemini 2.5 Flash达到了92.6%的Reading Accuracy(读取准确率),但仅有1.2%的Strict Safety(严格安全性)。相比之下,微调后的Qwen3-VL-8B达到了96.1%的Reading Accuracy和60.6%的Strict Safety,显著改善了可验证的证据定位。这些发现表明,针对特定任务的微调对于可靠、可审计的临床文档理解至关重要。RAPTOR+使得提取的转诊决策能够与视觉证据相关联,从而支持更安全、更高效的癌症转诊分诊。
Abstract
Urgent suspected colorectal cancer (CRC) referrals create operational bottlenecks because semi-structured clinical documents often require manual review and transcription. The original RAPTOR system used Large Language Models for structured extraction but relied on a separate OCR stage, making it vulnerable to handwriting, layout variation, and loss of visual evidence linkage. We present RAPTOR+, a multimodal extension that uses Vision-Language Models (VLMs) for end-to-end referral understanding. We evaluate fine-tuned VLMs, commercial and open-source zero-shot VLMs, and the original OCR-based pipeline on 223 clinically curated CRC urgent referral forms. We also introduce a grounding-aware evaluation framework that measures both extraction accuracy and evidence localisation. Results show a clear grounding gap in zero-shot models. Gemini 2.5 Flash achieved 92.6% Reading Accuracy but only 1.2% Strict Safety. In contrast, fine-tuned Qwen3-VL-8B achieved 96.1% Reading Accuracy and 60.6% Strict Safety, substantially improving verifiable evidence grounding. These findings show that task-specific fine-tuning is essential for reliable, auditable clinical document understanding. RAPTOR+ enables extracted referral decisions to be linked to visual evidence, supporting safer and more efficient cancer referral triage.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究的是基于视觉-语言模型(VLM)的临床文档理解与癌症转诊处理,属于多模态文档理解与医疗AI应用。虽然涉及多模态(MultiModal)和视觉-语言模型(MLLM),但核心是文档OCR与证据定位,而非统一模型、世界模型、强化学习、OPD或GRPO等方向。具体评分理由如下:- Unify Models: 论文未涉及模型统一或一体化,仅使用特定VLM进行端到端提取,相关性极低。- World Models: 论文未涉及世界模型或环境建模。- MLLM: 论文使用Vision-Language Models(VLM)进行多模态理解,属于MLLM范畴,但非核心创新点,相关性中等。- CV: 论文涉及视觉文档理解(OCR、图像处理),属于计算机视觉应用,但非核心方法创新。- MultiModal: 论文核心是多模态(文本+图像)文档理解,相关性较高。- model-based RL: 论文未涉及基于模型的强化学习。- OPD: 论文未涉及OPD(可能指其他特定术语)。- RL: 论文未涉及强化学习。- GRPO: 论文未涉及GRPO(可能指Group Relative Policy Optimization等)。
关键词
Vision-Language Models, Clinical Document Understanding, Evidence Grounding, Cancer Referral Processing, Fine-tuning, Multimodal
摘要翻译
三维曲线骨架化(3D curve skeletonization)的进展正在加速推动众多应用领域的发展。然而,开发能够捕捉复杂物体细节的鲁棒骨架化算法仍然具有挑战性。基于局部分隔符(Local Separators, LS)的骨架化方法提供了一种高效的图论方法,但由于其离散性质,存在表示不准确的问题。为解决这一问题,我们提出了CSCD,一种新颖的连续域曲线骨架化(Curve Skeletonization in the Continuous Domain)框架,将LS推广到流形上。具体而言,我们提出了两种实现:针对网格的CSCD-M和针对点云的CSCD-PC。CSCD-M利用网格的内在三角剖分来抵抗噪声并改善拓扑保持,而CSCD-PC采用簇状拉普拉斯算子(tufted Laplacians)以增强鲁棒性。据我们所知,CSCD-M是首个用于曲线骨架化的内在方法。我们的结果表明,CSCD-M在各种网格上达到了与LS相当的性能,并在Thingi10k数据集等基准测试中优于LS(TOG'21)。CSCD-PC在定性上优于CoverageAxis++(Eurographics'24)和EPCS(CAG'23)。最后,我们展示了CSCD在若干下游任务中的有效性:物体分类、形状分割、识别物体中的手柄、孔洞和收缩。项目网站:https://cscd-skel.pages.dev
Abstract
Advancements in 3D curve skeletonization are accelerating progress across a wide range of applications. However, developing robust skeletonization algorithms that capture intricate object details remains challenging. Skeletonization via Local Separators (LS) offers an efficient graph-based approach but suffers from representation inaccuracies due to its discrete nature. To address this, we introduce CSCD, a novel framework for Curve Skeletonization in the Continuous Domain, generalizing LS to manifolds. Specifically, we present two realizations: CSCD-M for meshes and CSCD-PC for point clouds. CSCD-M leverages the intrinsic triangulation of a mesh for resilience to noise and improved topological preservation, while CSCD-PC employs tufted Laplacians for enhanced robustness. To our knowledge, CSCD-M is the first intrinsic method for curve skeletonization. Our results show CSCD-M matches LS performance across diverse meshes and outperforms LS (TOG'21) on benchmarks like Thingi10k dataset. CSCD-PC qualitatively outperforms CoverageAxis++ (Eurographics'24) and EPCS (CAG'23). Finally, we demonstrate the efficacy of CSCD in a few downstream tasks: object classification, shape segmentation, identifying handles, tunnels, and constrictions in objects. Project Website: https://cscd-skel.pages.dev
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.0 | 0.0/10 | 0.0 |
| World Models | 1.0 | 0.0/10 | 0.0 |
| MLLM | 1.0 | 0.0/10 | 0.0 |
| CV | 1.0 | 0.0/10 | 0.0 |
| MultiModal | 1.0 | 0.0/10 | 0.0 |
| model-based RL | 1.0 | 0.0/10 | 0.0 |
| OPD | 1.0 | 0.0/10 | 0.0 |
| RL | 1.0 | 0.0/10 | 0.0 |
| GRPO | 1.0 | 0.0/10 | 0.0 |
评分理由: 论文研究的是3D曲线骨架化(curve skeletonization)在连续域中对网格和点云的处理方法,属于计算机图形学与几何处理领域。所有给定的关键词(如Unify Models、World Models、MLLM、MultiModal、model-based RL、OPD、RL、GRPO)均与多模态大模型、世界模型、强化学习等主题相关,与本文内容完全无关。CV(计算机视觉)虽有一定关联,但本文更偏向图形学而非视觉识别,且未涉及任何视觉模型或任务,故评0分。作者列表中未包含任何指定的专家。
关键词
curve skeletonization, continuous domain, meshes, point clouds, local separators, CSCD, topological preservation, downstream tasks