arXiv Daily Report 2026-06-10
- 未分类
- 16小时前
- 1热度
- 0评论
ArXiv Research Report
Papers
摘要翻译
时序建模对于机器人操作至关重要,因为有效的控制既需要记忆过去的交互经验,也需要想象未来的状态演变。然而,大多数 VLA(视觉 - 语言 - 动作)模型主要依赖当前观测,因此在长时序、时间依赖的任务上面临挑战。认知科学表明,人类依赖工作记忆来缓冲短暂上下文,利用海马系统保存过去经验的情景记忆,并通过内部模型想象可能的未来状态演变。受这些机制启发,我们提出 MemoryVLA++,这是一个完整的时序建模框架,旨在为机器人操作中的 VLA 模型配备记忆与想象能力。一个预训练的 VLM(视觉 - 语言模型)将当前观测编码为感知和认知标记,从而形成工作记忆。这些标记查询一个感知 - 认知记忆库(Perceptual-Cognitive Memory Bank),以检索相关的历史上下文。该记忆库存储来自过去交互的底层细节和高层语义,并通过基于冗余感知的整合机制进行更新。一个世界模型在去噪潜在空间中想象未来状态,并在记忆指导下整合这些想象的潜在表示,以形成完整的时序感知标记。所得到的标记作为条件输入,引导扩散动作专家预测时间一致的动作序列。我们在 5 个仿真基准和 3 类真实机器人任务上进行了广泛实验,涉及 3 种机器人,涵盖一般操作、长时序任务、鲁棒性及泛化能力。我们的方法在 Libero、SimplerEnv、Mikasa-Robo、Calvin、Libero-Plus 以及多种真实机器人任务上均取得了优异性能,验证了具备记忆与想象能力的完整时序建模的有效性。例如,在真实机器人上,该方法在一般任务、记忆依赖任务和想象依赖任务上分别实现了 +9%、+26% 和 +28% 的性能提升。项目页面:https://shihao1895.github.io/MemoryVLA-PP-Web
Abstract
Temporal modeling is essential for robotic manipulation, as effective control requires both memory of past interactions and imagination of future states. However, most VLA models rely primarily on the current observation and therefore struggle with long-horizon, temporally dependent tasks. Cognitive science suggests that humans rely on working memory to buffer short-lived context, the hippocampal system to preserve episodic memory of past experience, and internal models to imagine possible future state evolution. Inspired by these mechanisms, we propose MemoryVLA++, a full temporal modeling framework that equips VLA models with memory and imagination for robotic manipulation. A pretrained VLM encodes the current observation into perceptual and cognitive tokens, forming working memory. These tokens query a Perceptual-Cognitive Memory Bank to retrieve relevant historical context. This bank stores low-level details and high-level semantics from past interactions, and is updated through redundancy-aware consolidation. A world model imagines future states in a denoising latent space, and the imagined latents are integrated under memory guidance to form full temporal-aware tokens. The resulting tokens condition a diffusion action expert to predict temporally consistent action sequences. We conduct extensive experiments on 5 simulation benchmarks and 3 categories of real-robot tasks across 3 robots, covering general manipulation, long-horizon temporal tasks, robustness, and generalization. Our method achieves strong performance across Libero, SimplerEnv, Mikasa-Robo, Calvin, Libero-Plus, and diverse real-robot tasks, validating the effectiveness of full temporal modeling with memory and imagination. For example, on real robots, it achieves +9%, +26%, +28% gains on general, memory-dependent, and imagination-dependent tasks. Project Page: https://shihao1895.github.io/MemoryVLA-PP-Web
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 7.0/10 | 10.5 |
| Tokenizer | 1.5 | 6.0/10 | 9.0 |
| Visual Encoder | 1.5 | 8.0/10 | 12.0 |
| World Models | 1.5 | 9.0/10 | 13.5 |
| MLLM | 1.5 | 8.0/10 | 12.0 |
| MultiModal | 1.5 | 9.0/10 | 13.5 |
| model-based RL | 1.5 | 8.0/10 | 12.0 |
评分理由: 论文核心在于将世界模型(World Models)和记忆机制整合到视觉 - 语言 - 动作模型(VLA/MLLM)中,因此相关关键词得分较高。Visual Encoder 和 MultiModal 是基础架构核心。model-based RL 通过世界模型想象未来状态体现。Tokenizer 在 VLM 中存在但非主要创新点。Unify Models 指代记忆与想象的统一,相关性中等。作者列表中未包含指定的专家。
关键词
Vision-Language-Action Models, Temporal Modeling, World Model, Memory Bank, Robotic Manipulation, Diffusion Action Expert, Pretrained VLM
深度分析
Chinese Title: MemoryVLA++:通过记忆与想象在视觉-语言-动作模型中进行时序建模
Summary: 本文提出MemoryVLA++,一个受认知科学启发的全时序建模框架,旨在增强视觉-语言-动作(VLA)模型在机器人操作中的记忆与想象能力。现有VLA模型主要依赖当前观测,难以处理长时程、时间依赖的任务。MemoryVLA++通过预训练VLM编码当前观测生成感知和认知令牌,形成工作记忆;这些令牌查询感知-认知记忆库(PCMB)以检索相关历史上下文,并通过冗余感知合并更新记忆库。同时,一个视频生成世界模型在去噪潜空间中想象未来状态,并在记忆引导下集成到完整时序令牌中,最终条件化扩散动作专家以预测时序一致的动作序列。在5个模拟基准和3类真实机器人任务上的实验表明,该方法在Libero、SimplerEnv、Mikasa-Robo、Calvin、Libero-Plus等基准上均取得领先性能,真实机器人任务中在通用操作、长时程记忆依赖和想象依赖任务上分别提升9%、26%和28%。
Innovations:
- 将VLA时序建模从仅依赖过去记忆扩展到涵盖过去、现在和未来的全时序建模,同时引入记忆与想象机制。
- 提出感知-认知记忆库(PCMB),同时存储低层感知细节和高层认知语义,并通过冗余感知合并实现紧凑更新。
- 利用视频生成世界模型在潜空间中进行部分去噪,生成决策相关的未来想象令牌,避免像素级预测的高成本。
- 设计记忆引导的想象集成模块,自适应地将想象未来潜变量与记忆增强令牌融合,生成全时序感知表示。
- 在扩散动作专家中,认知令牌提供高层语义指导,感知令牌提供细粒度视觉细节,共同产生时序一致的动作序列。
Methodology: MemoryVLA++采用端到端框架,首先使用预训练VLM(如Qwen)编码当前RGB观测和语言指令,生成感知令牌和认知令牌作为工作记忆。工作记忆通过交叉注意力查询感知-认知记忆库(PCMB),检索相关历史上下文,并通过门控机制自适应融合。PCMB通过冗余感知合并(合并时间相邻且语义相似的条目)进行更新。未来想象部分使用视频生成世界模型(基于扩散模型)在潜空间中进行部分去噪,生成未来状态潜变量,并在记忆引导下与当前令牌集成。最终全时序令牌条件化扩散动作专家,通过迭代去噪预测连续动作序列。训练采用模仿学习,结合动作预测损失和记忆检索损失。
Key Results:
- 在Libero模拟基准上达到98.4%的成功率,在SimplerEnv上达到74.0%,相比基线最高提升16.7个百分点。
- 在长时程任务Mikasa-Robo上达到44.4%的成功率,在Calvin上得分4.29,相比基线提升15.0个百分点。
- 在鲁棒性与泛化基准Libero-Plus上达到82.7%的成功率。
- 真实机器人实验中,通用操作任务得分85%,长时程记忆依赖任务得分83%,长时程想象依赖任务得分77%,分别比基线提升9%、26%和28%。
- 消融实验验证了记忆库、世界模型想象、记忆引导集成等组件的有效性。
Tech Stack:
- 视觉-语言模型(VLM):Qwen系列
- 扩散模型(Diffusion Model)用于动作预测和视频生成
- 感知-认知记忆库(PCMB)
- 交叉注意力机制(Cross-Attention)
- 门控融合机制(Gating Mechanism)
- 冗余感知合并(Redundancy-aware Consolidation)
- 潜空间部分去噪(Partial Denoising in Latent Space)
- 模仿学习(Imitation Learning)
- 大规模机器人数据集(OXE, Agibot等)
Strengths:
- 受认知科学启发,设计了完整的记忆与想象机制,理论动机清晰。
- 在多个模拟和真实机器人基准上均取得显著性能提升,泛化性强。
- 潜空间想象避免了像素级预测的高计算成本,效率较高。
- 端到端框架,无需外部VLM或规划器,部署简洁。
- 支持多种机器人平台(单臂、双臂),实验覆盖广泛。
Limitations:
- 世界模型训练需要额外的视频数据,可能增加数据收集成本。
- 记忆库的容量和更新策略在极长时程任务中可能仍需优化。
- 当前方法主要针对操作任务,在移动操作或复杂导航任务中的适用性未验证。
- 对VLM基座模型依赖较强,更换基座可能需要重新训练或调参。
Relevance To Keywords:
- Unify Models, World Models, Representation Learning, Model-Based RL: 本文使用世界模型进行未来状态想象,属于世界模型与表征学习的结合。
- 原生多模态大模型,多模态大模型的理解和生成一体化: 基于VLM(多模态大模型)进行感知和认知编码,同时生成动作,体现理解与生成一体化。
- 表征学习: 通过感知和认知令牌学习高效的状态表征,并利用记忆库进行长期依赖建模。
- 世界模型: 视频生成世界模型用于想象未来状态,属于基于模型的方法。
- 强化学习,后训练: 论文采用模仿学习,未直接使用强化学习,但世界模型和记忆机制可与强化学习结合进行后训练。
摘要翻译
具身世界模型已成为视觉机器人决策与交互式环境模拟中的关键范式。然而,传统的具身框架依赖于低维结构化动作向量(如关节角度和末端执行器姿态),其存在表达能力有限、在不同具身体之间泛化能力差以及复杂物理交互下动态建模不自然等缺陷。为了解决这些局限性,本文提出了 iMac(Image as Action Control,图像即动作控制),这是一种新颖的统一控制范式,它将原始视觉图像视为具身世界模型的原生动作表示。与传统显式运动学动作编码不同,iMac 将连续视觉操作表述为基于图像的动作令牌,这些令牌天然地封装了空间运动意图、交互式几何约束以及细微的物理动力学。我们构建了一个双分支具身架构,包含图像 - 动作编码器和动态世界预测器:编码器将目标驱动视觉图像压缩为紧凑的动作嵌入,而预测器则学习基于图像动作的环境转换规则,以实现高保真的未来状态预测和闭环具身控制。我们在公共具身操作基准和真实机器人场景上进行了广泛的实验。实验结果表明,iMac 在预测准确率、任务成功率以及跨场景泛化能力方面均优于基于向量的动作控制基线。此外,我们的图像 - 动作设计消除了对手动定义动作空间的依赖,实现了对异构具身代理的灵活且通用的控制。这项工作为具身世界模型提供了一种创新的视觉 - 动作视角,为可扩展的机器人感知与操作提供了一个简单而有效的范式。
Abstract
Embodied world models have emerged as a pivotal paradigm for visual robotic decision-making and interactive environment simulation. However, conventional embodied frameworks rely on low-dimensional structured action vectors (e.g., joint angles and end-effector poses), which suffer from limited expressive capacity, poor generalization across diverse embodiments, and unnatural dynamic modeling for complex physical interactions. To address these limitations, this paper proposesiMac (Image as Action Control), a novel unified control paradigm that treats raw visual images as native action representations for embodied world models. Departing from traditional explicit kinematic action encoding, iMac formulates continuous visual manipulation as image-based action tokens, which inherently encapsulate spatial motion intentions, interactive geometric constraints and subtle physical dynamics. We construct a dual-branch embodied architecture consisting of an image-action encoder and a dynamic world predictor: the encoder compresses target-driven visual images into compact action embeddings, while the predictor learns environment transition rules conditioned on image actions to achieve high-fidelity future state prediction and closed-loop embodied control. Extensive experiments are conducted on public embodied manipulation benchmarks and real-world robotic scenarios. The results demonstrate that iMac outperforms vector-based action control baselines in prediction accuracy, task success rate and cross-scene generalization ability. Moreover, our image-action design eliminates the reliance on manually defined action spaces, realizing flexible and universal control for heterogeneous embodied agents. This work provides an innovative visual-action perspective for embodied world models, offering a simple yet effective paradigm for scalable robotic perception and manipulation.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 8.0/10 | 12.0 |
| Tokenizer | 1.5 | 8.0/10 | 12.0 |
| Visual Encoder | 1.5 | 9.0/10 | 13.5 |
| World Models | 1.5 | 10.0/10 | 15.0 |
| MLLM | 1.5 | 1.0/10 | 1.5 |
| MultiModal | 1.5 | 6.0/10 | 9.0 |
| model-based RL | 1.5 | 9.0/10 | 13.5 |
评分理由: 论文核心围绕具象化世界模型(World Models)展开,提出统一控制范式(Unify Models)并使用图像动作令牌(Tokenizer),包含视觉编码器(Visual Encoder)和世界预测器(model-based RL)。虽涉及视觉与动作结合(MultiModal),但未涉及语言模型(MLLM),故后者得分较低。作者列表中未包含指定的专家,故无额外加分。
关键词
Embodied World Models, Image as Action Control, Visual Action Tokens, Dynamic World Predictor, Unified Control Paradigm, Robotic Manipulation
深度分析
Chinese Title: iMaC: 将动作转化为运动和接触图像用于具身世界模型
Summary: 本文提出iMaC,一种具身世界模型,旨在通过显式的空间控制信号提升动作条件视频预测的精度,从而支持机器人策略的离线评估。现有方法将动作编码为紧凑向量,模型需间接推断空间后果,在精细操作中易出错。iMaC将未来动作转化为两类图像控制:运动图像(通过URDF和正向运动学渲染未来机器人外观)和接触图像(基于当前场景与未来机器人的点云构建双流几何距离场)。同时,模型预测深度以增强空间理解,并采用训练时滚动策略(用生成帧作为下一参考)支持分钟级长时程生成。在8个真实机器人长时程操作任务上,iMaC的世界模型评估分数与真实策略成功率呈强正相关,能有效区分不同策略检查点的性能。
Innovations:
- 将未来动作转化为运动图像:利用URDF和正向运动学渲染未来机器人外观,提供像素级引导,减少模型对机器人运动的推断负担。
- 构建双流接触图像:从当前场景点云和未来机器人点云分别计算场景到夹爪和机器人到场景的距离场,显式编码接触相关的空间几何关系。
- 训练时滚动策略:在训练阶段使用生成帧作为下一参考,减少闭环生成中的曝光偏差,支持分钟级长时程视频预测。
- 深度预测辅助:预测深度图并用于后续接触图像构建,提升世界模型的空间理解能力。
Methodology: iMaC基于WAN2.2图像到视频DiT模型,将多视图(一个固定头相机和两个腕部相机)拼接为单张图像处理。运动图像通过URDF和正向运动学计算未来关节状态,并从相同相机视角渲染机器人模型。接触图像通过Depth Anything 3估计初始深度,生成点云,然后计算当前场景点云到未来夹爪点云的距离(场景到夹爪)以及未来机器人点云到当前场景点云的距离(机器人到场景),形成两个距离场图像。这些控制图像经VAE编码和专用patchify层后,通过潜在加法注入到噪声化的未来视频令牌中。模型使用flow matching损失训练。为支持长时程,采用训练时滚动:将生成块的最后一帧作为下一块的参考图像,使模型适应自身预测误差。
Key Results:
- 在8个真实机器人长时程操作任务上,iMaC的世界模型评估分数与真实策略成功率呈强正相关(Spearman相关系数高)。
- iMaC能够有效区分不同策略检查点的相对性能,为策略选择提供可靠依据。
- 相比使用紧凑动作编码的基线,iMaC在长时程闭环生成中误差累积更小,支持分钟级视频预测。
Tech Stack:
- WAN2.2图像到视频DiT模型
- VAE编码器
- URDF(统一机器人描述格式)
- 正向运动学(Forward Kinematics)
- Depth Anything 3(深度估计)
- 点云生成与距离场计算
- Flow Matching损失函数
- Patchify层(视频/控制专用)
- 多视图拼接(头相机+腕部相机)
Strengths:
- 动作表示显式空间化:运动图像和接触图像直接提供像素级和几何级控制,减少模型对抽象动作向量的依赖。
- 支持长时程闭环生成:训练时滚动策略有效缓解曝光偏差,使模型能稳定生成分钟级视频。
- 与真实性能强相关:在多个真实操作任务上验证了世界模型评估的可靠性,可用于策略比较和故障分析。
- 无需额外标注:URDF和动作序列是策略评估场景下的自然输入,接触图像基于深度估计自动构建。
Limitations:
- 依赖机器人URDF和相机标定,限制了跨平台迁移的便利性。
- 深度估计精度(Depth Anything 3)影响接触图像质量,在复杂场景或透明物体上可能引入误差。
- 训练时滚动策略增加了训练复杂度,且模型泛化到未见过的场景或物体时可能性能下降。
- 计算成本较高:渲染运动图像和点云距离场需要额外计算资源,可能影响实时性。
Relevance To Keywords: 论文直接聚焦于世界模型(World Models),通过视频生成模拟环境用于策略评估,属于模型基强化学习(Model-Based RL)的典型应用。动作转化为图像表示体现了表征学习(Representation Learning)思想,将低维动作映射为高维空间控制信号。视频生成模型属于多模态大模型范畴,但论文未涉及原生多模态大模型的理解与生成一体化,而是专注于生成侧。后训练(post-training)方面,训练时滚动策略可视为一种后训练适应闭环生成的技术。整体上,论文与世界模型、表征学习、模型基强化学习、多模态大模型(生成)高度相关,与原生多模态大模型的理解一体化相关性较弱。
摘要翻译
我们提出**Echo-Memory**,一项针对动作条件化世界模型(action-conditioned world models)中记忆机制的受控研究。这些模型基于第一帧、文本提示和相机动作序列生成多段视频,但其核心失败往往并非局部图像合成,而是记忆机制:当相机离开并返回后,场景或显著物体可能会悄然发生改变。现有的记忆设计难以比较,因为性能提升往往与骨干网络、训练策略、检索方法及评估标准的差异混杂在一起。Echo-Memory 固定了动作到视频的接口,仅改变生成器对历史的存储与读取方式。在共享的视频扩散骨干网络(video diffusion backbone)、优化器、相机动作表示、采样器和评估流程下,我们比较了原始上下文、基于压缩的记忆、具有不同读出路径的空间摘要以及状态空间循环机制(state-space recurrence)。这一匹配矩阵分离了四个 otherwise 混淆的维度:容量(capacity)、压缩(compression)、读出(read-out)和循环(recurrence)。我们还通过一个三分支协议评估记忆:回放质量(replay quality)、领域内循环重访(in-domain loop revisit)和领域外返回探测(open-domain return probes)。各分支经常不一致,表明回放保真度并非衡量记忆世界的充分代理指标。得出以下三个发现:原始上下文是一个强大的容量基线,其对领域外返回的提升远大于对回放指标的提升。紧凑性并非无需代价的容量替代:激进的基于空间摘要和混合压缩的记忆会丢失返回所需的显著证据。最后,块级状态空间循环机制在我们的矩阵中表现最强,表明隐式记忆的结构与其使用决策同样重要。这些结果提供了一个紧凑的协议,用于在动作世界模型(action world models)中研究记忆,超越了孤立的重放指标。
Abstract
We present \textbf{Echo-Memory}, a controlled study of memory mechanisms in action-conditioned world models. These models generate multi-segment videos from a first frame, text prompt, and camera-action sequence, but their central failure is often memory rather than local image synthesis: after the camera leaves and returns, the scene or salient object may silently change. Existing memory designs are hard to compare because gains are entangled with backbone, training, retrieval, and evaluation differences. Echo-Memory fixes the action-to-video interface and varies only how history is stored and read by the generator. Under a shared video diffusion backbone, optimizer, camera-action representation, sampler, and evaluation pipeline, we compare raw context, compression-based memory, spatial summaries with different read-out paths, and state-space recurrence. This matched matrix separates four otherwise conflated axes: \emph{capacity}, \emph{compression}, \emph{read-out}, and \emph{recurrence}. We also evaluate memory through a three-branch protocol: replay quality, in-domain loop revisit, and open-domain return probes. The branches routinely disagree, showing that replay fidelity is not a sufficient proxy for remembering a world. Three findings follow. Raw context is a strong capacity baseline and improves open-domain return far more than it improves replay metrics. Compactness is not a free substitute for capacity: aggressive spatial and hybrid-compression memories lose the salient evidence needed for return. Finally, block-wise state-space recurrence is the strongest open-domain return mechanism in our matrix, showing that the structure of implicit memory matters as much as the decision to use it. These results provide a compact protocol for studying memory in action world models beyond isolated replay metrics.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 6.0/10 | 9.0 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 5.0/10 | 7.5 |
| World Models | 1.5 | 10.0/10 | 15.0 |
| MLLM | 1.5 | 5.0/10 | 7.5 |
| MultiModal | 1.5 | 8.0/10 | 12.0 |
| model-based RL | 1.5 | 8.0/10 | 12.0 |
评分理由: 论文核心聚焦于世界模型(World Models)中的记忆机制研究,标题与摘要均明确提及,故该项得满分;研究涉及视频、文本及动作输入,属于多模态系统,且动作条件世界模型是模型强化学习(model-based RL)的基础组件,故相关度高;论文通过统一评估框架比较不同记忆机制,体现了 Unify Models 的思想;视觉编码器作为扩散模型 backbone 的一部分存在,Tokenizer 和 MLLM 虽有涉及但非核心创新点。作者列表中不包含指定的 Yang Shi 等专家,故无额外加分。
关键词
World Models, Memory Mechanisms, Action-conditioned, Video Generation, State-space Recurrence, Diffusion Backbone, Controlled Study, Open-domain Return
深度分析
Chinese Title: 回响记忆:动作世界模型中记忆的受控研究
Summary: 本文提出Echo-Memory,一项针对动作条件世界模型中记忆机制的受控研究。研究背景是,现有世界模型在生成多段视频时,常因记忆不足导致场景在相机离开并返回后发生静默变化。现有记忆设计因骨干网络、训练、检索和评估差异而难以比较。Echo-Memory固定了视频扩散骨干、优化器、相机动作表示、采样器和评估流程,仅改变历史信息的存储与读取方式,比较了原始上下文、基于压缩的记忆、空间摘要和状态空间递归四种记忆家族。通过三分支评估协议(回放质量、域内循环重访、开放域返回探测)发现:原始上下文是强容量基线,显著提升开放域返回;紧凑性不能替代容量;块状状态空间递归是最强的开放域返回机制。结论表明,回放保真度不足以代理世界记忆,记忆的结构与使用同等重要。
Innovations:
- 提出受控比较框架,固定所有非记忆组件,仅变化记忆表示,实现公平对比。
- 将记忆设计分解为容量、压缩、读取和递归四个正交轴,系统评估各轴影响。
- 设计三分支评估协议(回放、域内循环、开放域返回),揭示回放指标不足以衡量世界记忆。
- 发现块状状态空间递归在开放域返回上表现最优,强调隐式记忆结构的重要性。
- 明确原始上下文作为容量基线的强大效果,以及紧凑压缩记忆在开放域中的局限性。
Methodology: 论文采用受控实验方法,基于固定视频扩散Transformer骨干(Video DiT),使用每帧VAE潜在表示和相对RT相机动作编码。记忆设计空间包括四种家族:原始上下文(Context)、压缩记忆(Compression)、空间记忆(Spatial)和状态空间递归(State-Space)。所有变体共享相同训练流程、优化器(AdamW)、学习率、分辨率(352×640)、段长(81帧)和评估管道。评估使用三分支协议:回放质量(PSNR、SSIM、LPIPS)、域内循环重访(相机离开并返回的固定轨迹)和开放域返回探测(使用VLM判断场景和物体一致性)。
Key Results:
- 原始上下文从K=1增至K=20时,回放图像质量提升,开放域VLM返回得分从12.25升至58.63。
- 空间记忆在回放PSNR上有竞争力,但开放域返回较弱;混合压缩丢失了长度压缩保留的信号。
- 读取方式比存储更重要:保留但不可读的令牌无法改善重访一致性。
- 块状状态空间递归达到开放域VLM得分69.00,虽回放PSNR较低,但重访一致性最强。
- 三分支评估结果常不一致,证明回放保真度不是世界记忆的充分代理。
Tech Stack:
- Video DiT(视频扩散Transformer)
- VAE(变分自编码器)潜在表示
- 流匹配(Flow Matching)回归损失
- 相对RT相机动作编码(12维:9旋转+3平移)
- AdamW优化器
- PSNR、SSIM、LPIPS、FID、FVD评估指标
- VLM(视觉语言模型)作为裁判的开放域评估
- 上下文检索器(基于相机重叠、场景几何等)
- 压缩操作(令牌加权、时间窗口池化、帧打包)
- 状态空间模型(块状递归)
Strengths:
- 严格的受控实验设计,隔离记忆机制影响,提高结论可信度。
- 三分支评估协议全面覆盖不同记忆需求场景,避免单一指标偏差。
- 清晰分解记忆设计的四个正交轴,提供系统化理解。
- 发现原始上下文和状态空间递归的强效性,为实际设计提供指导。
- 开源代码和项目页面,促进可复现研究。
Limitations:
- 仅使用单一骨干(Video DiT),结论可能不直接迁移到其他架构。
- 动作表示限于相对RT,未探索绝对编码或更复杂动作类型。
- 评估限于视频生成任务,未涉及强化学习或交互式环境。
- 上下文长度上限为20,未探索更长序列下的记忆行为。
- 未深入分析记忆与生成质量之间的计算成本权衡。
Relevance To Keywords:
- Unify Models: 论文研究世界模型中的记忆机制,属于统一模型中的关键组件。
- World Models: 直接聚焦动作条件世界模型,探讨其记忆失效问题。
- Representation Learning: 记忆表示(上下文、压缩、空间、状态空间)是表征学习的核心。
- Model-Based RL: 世界模型是基于模型强化学习的基础,记忆影响长期规划。
- 原生多模态大模型: 视频生成与记忆机制是多模态大模型的重要能力。
- 多模态大模型的理解和生成一体化: 三分支评估中VLM用于理解场景一致性,生成由扩散模型完成。
- 表征学习: 不同记忆家族本质上是不同的历史表征学习方式。
- 世界模型: 论文标题和内容直接围绕世界模型中的记忆问题。
- 强化学习: 世界模型可用于强化学习中的环境模拟,记忆影响长期一致性。
- 后训练: 论文中的微调训练(5k步)可视为后训练阶段。
摘要翻译
世界 - 动作模型(World-action models)已成为机器人操作的一种有前景的范式,通过联合建模视觉场景动力学与动作,将物理先验注入策略学习过程中。然而,现有的世界 - 动作模型在相同的时间分辨率下耦合世界预测与动作执行,迫使世界分支建模近期帧变化,而这些变化既冗余又信息量微弱。我们认为,严格将世界预测与动作执行绑定到同一时序节奏,可能会未充分利用视频分支在具身控制中的潜力。因此,我们提出 AHA-WAM,一种基于双扩散变换器(DiT)架构的异步 horizon 自适应世界 - 动作模型,该模型围绕这种时序不对称性重新组织世界 - 动作建模。AHA-WAM 将视频 DiT 实例化为低频世界规划器,维护过去观测的滚动键值记忆,并暴露可重用的逐层潜上下文以编码长 horizon 场景演化;而高频动作 DiT 则通过逐层联合注意力查询此上下文,在闭环中执行短动作块。为支持异步执行,我们引入 horizon 自适应偏移训练和观测引导视频上下文路由(OVCR),二者共同使动作专家能够利用长 horizon 世界上下文,同时保持对实时执行状态的响应,而无需重新运行视频 DiT。在 RoboTwin 和真实世界操作任务上的实验表明,AHA-WAM 在未进行任何机器人数据预训练的情况下实现了最先进的性能,在 RoboTwin 上平均成功率为 92.80%,在 4 个真实世界任务上成功率为 78.3%,同时达到 24.17 Hz 的闭环控制,相较于 Fast-WAM 实现了 4.59 倍加速。
Abstract
World-action models have emerged as a promising paradigm for robot manipulation, jointly modeling visual scene dynamics and actions to inject physical priors into policy learning. However, existing world-action models couple world prediction and action execution at the same temporal resolution, forcing the world branch to model near-term frame variations that are redundant and weakly informative. We posit that strictly binding world prediction and action execution to the same temporal rhythm may underutilize the potential of the video branch for embodied control. Therefore, we propose AHA-WAM, an Asynchronous Horizon-Adaptive World-Action Model built on a dual Diffusion Transformer (DiT) architecture that reorganizes world-action modeling around this temporal asymmetry. AHA-WAM instantiates the video DiT as a low-frequency world planner that maintains rolling key-value memory over past observations and exposes reusable layerwise latent context encoding long-horizon scene evolution, while a high-frequency action DiT executes short action chunks in closed loop by querying this context through layerwise joint attention. To support asynchronous execution, we introduce horizon-adaptive offset training and Observation-Guided Video-Context Routing (OVCR), which together let the action expert exploit long-horizon world context while remaining responsive to real-time execution state without rerunning the video DiT. Experiments on RoboTwin and real-world manipulation tasks show that AHA-WAM achieves state-of-the-art performance without any robot-data pretraining, attaining 92.80% average success on RoboTwin and 78.3% success across 4 real-world tasks, while reaching 24.17 Hz closed-loop control with a 4.59x speedup over Fast-WAM.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 7.5/10 | 11.2 |
| Tokenizer | 1.5 | 2.5/10 | 3.8 |
| Visual Encoder | 1.5 | 6.5/10 | 9.8 |
| World Models | 1.5 | 9.0/10 | 13.5 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 8.0/10 | 12.0 |
| model-based RL | 1.5 | 8.0/10 | 12.0 |
评分理由: 论文核心提出世界动作模型(World Models),结合视觉与动作(MultiModal),用于机器人控制(model-based RL)。架构上统一了世界预测与动作执行(Unify Models),使用视觉变换器但未强调分词器(Tokenizer)或语言模型(MLLM)。
关键词
World-Action Modeling, Asynchronous Horizon-Adaptive, Diffusion Transformer, Observation-Guided Context Routing, Robot Manipulation, Closed-loop Control, Visual Scene Dynamics
深度分析
Chinese Title: AHA-WAM:基于观测引导上下文路由的异步自适应视界世界-动作建模
Summary: 现有世界-动作模型将世界预测与动作执行绑定在同一时间分辨率下,导致世界分支在冗余且信息量低的近邻帧变化上浪费计算资源。为此,本文提出AHA-WAM,一种基于双扩散Transformer(DiT)架构的异步自适应视界世界-动作模型。AHA-WAM将视频DiT设计为低频世界规划器,维护滚动键值记忆并暴露可复用的逐层潜上下文,编码长视界场景演化;同时将动作DiT设计为高频执行器,通过逐层联合注意力查询该上下文,实现短动作块的闭环执行。为支持异步执行,引入自适应视界偏移训练和观测引导视频上下文路由(OVCR),使动作专家能利用长视界世界上下文同时保持对实时执行状态的响应。在RoboTwin和真实世界操作任务上,AHA-WAM无需机器人数据预训练即达到92.80%和78.3%的成功率,闭环控制频率达24.17 Hz,较Fast-WAM加速4.59倍。
Innovations:
- 提出异步自适应视界世界-动作模型(AHA-WAM),将视频DiT世界规划与动作DiT闭环执行解耦为不同时间尺度,并引入自适应视界偏移训练以支持任意规划器-执行器相位关系。
- 开发观测引导视频上下文路由(OVCR),动态从当前观测构建动作块特定的潜视频上下文,使异步规划器上下文与实时执行状态对齐,无需重新运行视频DiT。
- 在无需大规模机器人数据预训练的条件下,在仿真和真实世界操作任务上达到最先进水平,并显著提升推理效率(闭环控制频率达56.9 Hz,较Fast-WAM加速10.82倍)。
Methodology: AHA-WAM采用双DiT架构:低频视频DiT作为世界规划器,以较长视界预测未来视频潜变量,并维护滚动键值记忆以连接过去观测与未来规划;高频动作DiT作为执行器,以较短视界预测动作块,并通过逐层联合注意力从视频DiT的潜上下文中获取视觉信息。OVCR机制利用最新观测构建查询,更新视频DiT的键值状态,使动作专家获得观测条件化的规划器上下文。训练时采用自适应视界偏移训练,暴露模型于多种规划器-执行器相位偏移。推理时通过ODE蒸馏和CUDA优化加速动作流。
Key Results:
- 在RoboTwin 2.0的50个任务上平均成功率达92.80%,达到最先进水平。
- 在4个真实世界操作任务(涵盖可变形操作、长视界整理、精细工具使用和空间泛化)上平均成功率为78.3%。
- 在真实世界分布外评估中,与π0.5并列表现出最小的性能下降,显示更强的分布偏移鲁棒性。
- 闭环控制频率最高达56.9 Hz,较Fast-WAM加速10.82倍。
Tech Stack:
- 扩散Transformer(DiT)
- 变分自编码器(VAE)
- 滚动键值记忆(Rolling KV Memory)
- 逐层联合注意力(Layerwise Joint Attention)
- 观测引导视频上下文路由(OVCR)
- 自适应视界偏移训练(Horizon-Adaptive Offset Training)
- ODE蒸馏(ODE Distillation)
- CUDA优化
Strengths:
- 创新性地解耦世界模型与动作模型的时间尺度,使视频分支专注于长视界规划,动作分支专注于高频闭环控制,结构设计合理。
- OVCR机制有效解决异步执行中规划器上下文过时问题,无需重算视频DiT,兼顾效率与准确性。
- 无需大规模机器人数据预训练即可在仿真和真实世界任务上达到最先进性能,实用性强。
- 推理效率显著提升,闭环控制频率高,适合实际部署。
Limitations:
- 双DiT架构和OVCR机制增加了模型复杂度和训练难度。
- 实验仅在RoboTwin和少量真实世界任务上验证,泛化到更广泛场景和任务类型的能力有待进一步评估。
- 异步执行中视频规划器上下文更新频率较低,可能在某些快速变化场景中引入延迟或信息滞后。
Relevance To Keywords:
- Unify Models: AHA-WAM将世界模型与动作模型统一在双DiT架构中,实现联合建模。
- World Models: 视频DiT作为世界规划器,学习视觉场景动态,属于世界模型范畴。
- Representation Learning: 通过OVCR和滚动KV记忆学习可复用的长视界潜上下文表示。
- Model-Based RL: 世界-动作模型为策略学习注入物理先验,属于基于模型的强化学习范式。
- 原生多模态大模型: 使用DiT处理视觉和语言多模态输入,但非典型大语言模型。
- 多模态大模型的理解和生成一体化: 模型同时理解视觉观测和语言指令,并生成动作和未来视频。
- 表征学习: 视频DiT学习场景演化的潜表征,动作DiT利用该表征进行控制。
- 强化学习: 论文未明确使用强化学习,但世界-动作模型可服务于强化学习中的规划与策略学习。
- 后训练: 论文未涉及后训练,但模型可进一步通过微调适应新任务。
摘要翻译
自回归(AR)模型在视觉生成领域展现出巨大潜力,凭借简单的架构和优化目标实现了卓越的性能。然而,现有方法通常局限于单模态条件(例如文本),限制了其在需要从多样控制进行图像合成的现实场景中的适用性。本文提出了 OmniGen-AR,一个用于任意到图像(Any-to-Image)生成的统一自回归框架。通过共享视觉标记器对各种视觉条件进行离散化,并使用文本标记器处理文本提示,OmniGen-AR 在单个模型中支持广泛的条件输入,包括文本(文本到图像生成)、空间信号(分割到图像和深度到图像)以及视觉上下文(图像编辑、帧预测和文本到视频生成)。为了缓解条件 token 向内容 token 泄露信息的风险,我们引入了解耦因果注意力(DCA),该机制将全序列因果掩码分离为条件因果注意力和内容因果注意力。它在训练过程中充当正则化器,同时不影响推理期间的标准下一个 token 预测。借助此设计,OmniGen-AR 在一系列基准测试上取得了新的最先进或至少具有竞争力的结果,例如在 GenEval 上达到 0.63,在 VBench 上达到 80.02,证明了其在灵活且高保真视觉生成方面的有效性。
Abstract
Autoregressive (AR) models have demonstrated strong potential in visual generation, offering superior performance with simple architectures and optimization objectives. However, existing methods are typically limited to single-modality conditions, e.g., text, restricting their applicability in real-world scenarios that demand image synthesis from diverse controls. In this work, we present OmniGen-AR, a unified autoregressive framework for Any-to-Image generation. By discretizing various visual conditions through a shared visual tokenizer and text prompts with a text tokenizer, OmniGen-AR supports a broad spectrum of conditional inputs within a single model, including text (text-to-image generation), spatial signals (segmentation-to-image and depth-to-image), and visual context (image editing, frame prediction, and text-to-video generation). To mitigate the risk of information leakage from condition tokens to content tokens, we introduce Disentangled Causal Attention (DCA), which separates the full-sequence causal mask into condition causal attention and content causal attention. It serves as a training-time regularizer without affecting the standard next-token prediction during inference. With this design, OmniGen-AR achieves new state-of-the-art or at least competitive results across a range of benchmark, e.g., 0.63 on GenEval and 80.02 on VBench, demonstrating its effectiveness in flexible and high-fidelity visual generation.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 9.0/10 | 13.5 |
| Tokenizer | 1.5 | 9.0/10 | 13.5 |
| Visual Encoder | 1.5 | 5.0/10 | 7.5 |
| World Models | 1.5 | 3.0/10 | 4.5 |
| MLLM | 1.5 | 6.0/10 | 9.0 |
| MultiModal | 1.5 | 8.0/10 | 12.0 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: The paper presents a unified autoregressive framework for diverse input conditions, strongly aligning with 'Unify Models' (9.0) and 'Tokenizer' (9.0) as it uses shared tokenizers for text and visual inputs. It handles multiple modalities ('MultiModal': 8.0). While it involves visual processing, the core novelty is the AR generation and attention mechanism rather than a specific encoder architecture ('Visual Encoder': 5.0). It shares architecture with MLLMs but focuses on generation ('MLLM': 6.0). Frame prediction relates loosely to world models but it is primarily a generative model ('World Models': 3.0). There is no reinforcement learning content ('model-based RL': 1.0).
关键词
Any-to-Image Generation, Autoregressive Framework, Shared Visual Tokenizer, Disentangled Causal Attention, Text-to-Image, Image Editing, Frame Prediction
深度分析
Chinese Title: OmniGen-AR: 自回归任意到图像生成
Summary: 本文提出OmniGen-AR,一个统一的自回归框架,用于任意条件到图像的生成。该框架通过共享的视觉分词器将多种视觉条件(如分割掩码、深度图、参考图像)离散化为视觉令牌,并结合文本分词器处理文本提示,从而在单一模型中支持文本到图像、分割到图像、深度到图像、图像编辑、帧预测以及文本到视频等多种生成任务。为解决条件令牌到内容令牌的信息泄露问题,作者引入了解耦因果注意力(DCA),在训练时将全序列因果掩码分离为条件因果注意力和内容因果注意力,作为正则化手段,推理时仍采用标准的下一个令牌预测。实验表明,OmniGen-AR在GenEval上达到0.63,在VBench上达到80.02,在多个基准上取得了新的最优或具有竞争力的结果,验证了其在灵活、高保真视觉生成方面的有效性。
Innovations:
- 首次提出统一的自回归框架支持任意到图像生成,涵盖文本、空间信号和视觉上下文等多种条件输入。
- 采用共享视觉分词器处理不同视觉条件,无需为每种条件设计独立编码器。
- 提出解耦因果注意力(DCA),在训练时分离条件与内容的注意力路径,有效防止信息泄露,且不影响推理过程。
- 相比现有自回归模型,额外支持视频生成任务(文本到视频生成和视频预测)。
Methodology: OmniGen-AR采用解码器-only Transformer架构,使用VQ-VAE作为视觉分词器将图像和视觉条件离散化为令牌,使用语言模型分词器处理文本。所有令牌按序列拼接(文本、条件、内容),通过自回归方式预测内容令牌。训练时,随机将标准因果注意力替换为解耦因果注意力(DCA),其中条件令牌只能关注自身和文本令牌,内容令牌可关注所有前置令牌(包括文本和条件),但条件令牌不能关注内容令牌。推理时仍使用标准因果注意力。模型在多个任务上联合训练,采用下一个令牌预测损失。
Key Results:
- 在GenEval基准上达到0.63,实现新的最优结果。
- 在VBench基准上达到80.02,表现具有竞争力。
- 在文本到图像、文本到视频、帧预测、图像编辑、深度到图像、分割到图像等六个任务上均取得最优或接近最优的性能。
- 通过消融实验验证了DCA对防止信息泄露、提升生成质量的有效性。
Tech Stack:
- 自回归模型(Autoregressive Model)
- VQ-VAE(Vector Quantized Variational Autoencoder)视觉分词器
- 文本分词器(如GPT-2 tokenizer)
- 解码器-only Transformer
- 因果注意力(Causal Attention)
- 解耦因果注意力(Disentangled Causal Attention, DCA)
- 下一个令牌预测(Next-Token Prediction)
Strengths:
- 统一框架:单一模型支持多种条件输入和生成任务,无需额外适配器。
- 简洁高效:保持自回归模型的简单架构和优化目标,易于扩展。
- 防止信息泄露:DCA作为训练正则化,有效避免模型利用条件与内容间的捷径,提升指令遵循能力。
- 性能优异:在多个标准基准上达到SOTA或竞争水平。
- 兼容性:推理过程与标准自回归模型一致,无需修改。
Limitations:
- 依赖视觉分词器的质量,可能影响生成细节。
- 训练时随机应用DCA增加了超参数调优的复杂性。
- 当前仅支持短时视频生成,长视频生成能力未验证。
- 与扩散模型相比,自回归模型在生成速度上可能较慢。
- 未探索与强化学习或后训练技术的结合。
Relevance To Keywords:
- 原生多模态大模型:OmniGen-AR是统一的多模态生成模型,支持文本、图像、视频等多种模态的输入和输出。
- 多模态大模型的理解和生成一体化:模型同时具备理解(条件编码)和生成(自回归预测)能力,但侧重于生成。
- 表征学习:通过共享视觉分词器学习统一的视觉表征,将不同视觉条件映射到同一离散空间。
- 世界模型:自回归模型可视为对视觉序列的因果建模,具备世界模型的部分特性(预测未来帧)。
- 强化学习:论文未直接使用强化学习,但自回归生成可通过后训练(如RLHF)进一步优化,具有潜在关联。
- 后训练:模型训练采用标准的下一令牌预测,但DCA可视为一种训练正则化技术,与后训练概念相关。
摘要翻译
多模态情感分析旨在通过联合建模文本和图像等异构模态来理解人类的情感与情绪。然而,多模态模型往往无法始终如一地超越强大的纯文本基线,且性能在不同融合策略下存在显著差异。在这项工作中,我们将独立预训练的模态编码器之间的表征错位识别为有效多模态学习的关键瓶颈,并通过控制实验表明,融合前的对齐通常比融合复杂度更为重要。为了解决这一问题,我们提出了一种统一的多模态情感分析框架,该框架利用视觉语言模型 (VLMs) 将视觉内容转换为结构化文本描述,将异构模态投射到共享的语言空间中,从而实现可解释的以文本为中心的推理。为了进一步提高鲁棒性,我们引入了一种混合学习策略,该策略结合了语义 token 选择与批次级均匀性正则化目标,旨在鼓励更分散且稳定的全局特征空间,同时缓解由 VLM 生成描述所引入的噪声。在多个多模态情感与情绪基准上的实验表明,我们的方法始终优于强大的单模态和多模态基线,实现了最先进的性能。我们的分析进一步强调了表征对齐在多模态情感学习中的关键作用。
Abstract
Multimodal affective analysis aims to understand human sentiment and emotion by jointly modeling heterogeneous modalities such as text and images. However, multimodal models often fail to consistently outperform strong text-only baselines, with performance varying significantly across fusion strategies. In this work, we identify representation misalignment between independently pretrained modality encoders as a key bottleneck for effective multimodal learning, and show through controlled experiments that alignment prior to fusion is often more important than fusion complexity. To address this issue, we propose a unified multimodal affective analysis framework that leverages vision-language models (VLMs) to convert visual content into structured textual descriptions, projecting heterogeneous modalities into a shared linguistic space and enabling interpretable text-centric reasoning. To further improve robustness, we introduce a hybrid learning strategy that combines semantic token selection with a batch-level uniformity regularization objective, encouraging a more dispersed and stable global feature space while mitigating noise introduced by VLM-generated descriptions. Experiments on multiple multimodal sentiment and emotion benchmarks show that our method consistently outperforms strong unimodal and multimodal baselines, achieving state-of-the-art performance. Our analysis further highlights the critical role of representation alignment in multimodal affective learning.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 8.0/10 | 12.0 |
| Tokenizer | 1.5 | 5.0/10 | 7.5 |
| Visual Encoder | 1.5 | 8.0/10 | 12.0 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 8.0/10 | 12.0 |
| MultiModal | 1.5 | 10.0/10 | 15.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文聚焦多模态情感分析,'MultiModal'为核心主题(10 分);利用 VLM 进行视觉编码和文本转换,故'MLLM'和'Visual Encoder'高度相关(8 分);提出统一框架将视觉映射至语言空间,体现'Unify Models'理念(8 分);涉及'Tokenizer'的语义选择但非核心(5 分);'World Models'和'model-based RL'与情感分析及表征对齐无关(1 分和 0 分)。未发现指定专家作者。加权总分 60.0,高于动态及格分 27.8。
关键词
Multimodal Sentiment Analysis, Representation Alignment, Vision-Language Models, Shared Linguistic Space, Text-centric Reasoning, Semantic Token Selection, Batch-level Uniformity
深度分析
Chinese Title: 面向多模态情感分析的显式表示对齐
Summary: 本文针对多模态情感分析中模型性能常未能显著超越纯文本基线的问题,指出独立预训练的多模态编码器之间存在表示不对齐,这是制约多模态学习效果的关键瓶颈。作者提出统一多模态情感分析框架,利用视觉-语言模型将视觉内容转化为结构化文本描述,将异构模态投影到共享语言空间,实现可解释的文本中心推理。为增强鲁棒性,引入混合学习策略,结合语义令牌选择与批次级均匀性正则化目标,鼓励更分散稳定的全局特征空间,并抑制VLM生成描述带来的噪声。在多个多模态情感与情绪基准上的实验表明,该方法一致优于强单模态和多模态基线,达到最先进性能。研究还系统分析了表示对齐在多模态学习中的关键作用。
Innovations:
- 提出显式表示对齐策略,在融合前利用视觉-语言模型将视觉内容转化为结构化文本,实现模态统一。
- 设计混合学习策略,结合Top-K语义令牌选择与批次级均匀性正则化,提升特征鲁棒性和判别性。
- 通过系统实验和可视化,揭示表示不对齐导致模态偏差并损害性能的机制。
- 构建统一多模态情感分析框架,支持不同编码器组合的可控比较。
Methodology: 论文采用以下技术路线:首先,使用视觉-语言模型(如CLIP)将图像转换为结构化文本描述,实现模态对齐;其次,将原始文本与VLM生成的描述在令牌级拼接,输入单一预训练语言编码器进行文本中心融合;然后,引入Top-K令牌选择机制筛选情感相关令牌,并施加批次级均匀性正则化以分散特征空间;最后,在多个基准数据集上进行分类或回归任务评估。
Key Results:
- 所提方法在多个多模态情感与情绪基准上一致超越强单模态和多模态基线,达到最先进性能。
- 显式表示对齐比融合复杂度对性能提升更为关键。
- 混合学习策略有效抑制VLM生成噪声,提升特征鲁棒性。
- 可视化分析证实表示不对齐会导致模型偏向文本模态。
Tech Stack:
- 视觉-语言模型(VLM,如CLIP)
- 预训练语言编码器(如BERT)
- Top-K令牌选择
- 批次级均匀性正则化(Uniformity Regularization)
- 结构化提示工程(四阶段推理:OCR理解、视觉场景分析、跨模态整合、高层推理)
- 交叉熵损失与对比学习目标
Strengths:
- 问题定位精准,揭示了表示不对齐这一关键瓶颈。
- 方法简洁有效,通过模态统一避免复杂跨模态注意力。
- 实验充分,在多个基准上验证了一致优势。
- 提供了深入的分析和可视化,增强了可解释性。
Limitations:
- 依赖VLM生成质量,低质量描述可能引入噪声。
- 文本中心策略可能丢失视觉特有信息(如细微表情)。
- 未在语音或视频模态上验证泛化性。
- 计算开销因VLM推理而增加。
Relevance To Keywords:
- {'keyword': '表征学习', 'relevance': '核心贡献在于通过显式对齐改进多模态表征,属于表征学习范畴。'}
- {'keyword': '多模态大模型', 'relevance': '利用VLM(如CLIP)实现模态转换,与多模态大模型技术紧密相关。'}
- {'keyword': '世界模型', 'relevance': '间接相关,文中结构化推理涉及常识和背景知识,但未直接构建世界模型。'}
- {'keyword': '强化学习', 'relevance': '不直接相关,论文未使用强化学习方法。'}
- {'keyword': '后训练', 'relevance': '不直接相关,论文主要关注前向推理和微调,未涉及后训练策略。'}
摘要翻译
通常在生成的帧之间保持 3D 空间一致性的视频世界模型,依赖于在 RGB 空间中构建的显式点云内存。这种设计不仅计算成本高,需要重复渲染和 VAE 编码,而且本质上是有损的,因为通过像素空间的往返过程会丢弃学习到的潜在表示中的丰富特征。在本文中,我们为视频世界模型引入了潜在空间内存(latent spatial memory),这是一种持久 3D 缓存,直接在扩散潜在空间中存储场景信息,从而避免了像素空间重建。在此基础上,我们提出了 Mirage,一种潜在空间空间内存框架,该框架通过深度引导的反投影将潜在标记提升为 3D 来构建该内存,并通过直接潜在空间扭曲合成新视角来查询该内存。这种统一表述既消除了像素空间重建的信息损失,也消除了重复编码和渲染的计算负担。实验表明,潜在空间内存实现了比显式 3D 基线快 10.57 倍的端到端视频生成,并且内存占用降低了 55 倍。利用扩散模型的几何先验,Mirage 在 WorldScore 上达到了最先进的性能,并在 RealEstate10K 上展现出强大的重建质量。
Abstract
Video world models that maintain 3D spatial consistency across generated frames typically rely on explicit point cloud memory constructed in RGB space. This design is both computationally expensive, requiring repeated rendering and VAE encoding, and inherently lossy, as the round trip through pixel space discards rich features of the learned latent representation. In this paper, we introduce \emph{latent spatial memory} for video world models, a persistent 3D cache that stores scene information directly in the diffusion latent space, avoiding pixel-space reconstruction. Building on this, we propose Mirage, a latent-space spatial memory framework that constructs the memory by lifting latent tokens into 3D via depth-guided back-projection and queries it by synthesizing novel views through direct latent-space warping. This unified formulation eliminates both the information loss of pixel-space reconstruction and the computational burden of repeated encoding and rendering. Experiments show that latent spatial memory achieves up to \textbf{10.57}$\times$ faster end-to-end video generation and \textbf{55}$\times$ reduction in memory footprint relative to explicit 3D baselines. Leveraging the geometric prior of the diffusion model, Mirage attains state-of-the-art performance on WorldScore and strong reconstruction quality on RealEstate10K.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 7.0/10 | 10.5 |
| Tokenizer | 1.5 | 4.0/10 | 6.0 |
| Visual Encoder | 1.5 | 5.0/10 | 7.5 |
| World Models | 1.5 | 10.0/10 | 15.0 |
| MLLM | 1.5 | 3.0/10 | 4.5 |
| MultiModal | 1.5 | 5.0/10 | 7.5 |
| model-based RL | 1.5 | 6.0/10 | 9.0 |
评分理由: The paper centers on Video World Models (10/10) and proposes a unified latent memory formulation (Unify Models, 7/10). It leverages diffusion latent space (Tokenizer, 4/10; Visual Encoder, 5/10) without proposing new architectures. World Models are closely linked to model-based RL (6/10) via the WorldScore metric. MLLM (3/10) and MultiModal (5/10) are less relevant as the work focuses on video generation without explicit language or multi-modal fusion.
关键词
Latent Spatial Memory, Video World Models, Diffusion Latent Space, 3D Spatial Consistency, Mirage Framework, Depth-guided Back-projection, WorldScore
深度分析
Chinese Title: 视频世界模型的潜在空间记忆
Summary: 本文提出潜在空间记忆(Latent Spatial Memory),用于视频世界模型,以解决传统RGB点云记忆在像素空间往返计算昂贵且信息损失的问题。作者构建Mirage框架,将场景信息直接存储在扩散模型的潜在空间中,通过深度引导反投影将潜在令牌提升到3D,并在读取时通过潜在空间扭曲合成新视角,避免了像素级渲染和VAE编码。实验表明,该方法在WorldScore和RealEstate10K上达到最先进性能,同时实现高达10.57倍的端到端加速和55倍的GPU内存减少。Mirage通过初始化-读取-更新的循环生成几何一致的视频,并利用扩散模型的几何先验保持多视角一致性。
Innovations:
- 提出潜在空间记忆,完全在扩散模型的潜在空间中存储和查询3D场景信息,避免像素空间往返。
- 设计Mirage框架,包含深度引导反投影、潜在分辨率下的遮挡感知读取和动态物体排除的迭代更新机制。
- 实现端到端视频生成速度提升10.57倍,GPU内存减少55倍,同时保持或超越RGB点云记忆的生成质量。
- 将潜在空间记忆与ControlNet侧分支结合,实现高效的条件注入,无需额外编码步骤。
Methodology: 论文采用以下技术路线:1)使用预训练VAE将输入帧编码为潜在张量;2)通过深度估计和反投影将潜在令牌映射到3D世界坐标,构建潜在缓存;3)在生成每个视频块时,通过潜在分辨率投影从缓存中读取目标视角的潜在特征,并利用ControlNet侧分支注入扩散骨干;4)生成帧后,重新估计深度、排除动态物体,并将新帧的潜在特征反投影更新缓存;5)重复该循环以生成长序列。整个过程在潜在空间操作,仅在块级更新时涉及像素空间。
Key Results:
- 在WorldScore基准上达到最先进的世界生成性能。
- 在RealEstate10K上取得有竞争力的新视角合成质量。
- 端到端视频生成速度比RGB点云基线快10.57倍。
- GPU内存使用量降低55倍。
- 生成视频在几何一致性上显著优于无记忆或RGB记忆方法。
Tech Stack:
- VAE编码器/解码器
- 扩散模型(Diffusion Transformer)
- ControlNet侧分支
- 深度估计(单目深度预测)
- 深度引导反投影(Depth-guided back-projection)
- 潜在空间扭曲(Latent-space warping)
- Z-buffer投影(遮挡处理)
- 动态物体分割(如天空、移动物体排除)
- WorldScore评估指标
- RealEstate10K数据集
Strengths:
- 创新性地将3D记忆完全置于潜在空间,避免了像素空间的信息损失和计算瓶颈。
- 显著提升生成效率,使长序列几何一致视频生成变得实用。
- 在多个基准上取得SOTA或竞争性结果,验证了方法的有效性。
- 设计简洁,与现有扩散模型架构兼容,易于集成。
Limitations:
- 依赖单目深度估计的质量,深度误差可能导致缓存中的几何不一致。
- 动态物体排除策略可能不完美,快速运动或复杂场景下仍有残留伪影。
- 方法主要针对相机轨迹可控的场景,对于自由动作或交互式世界模型可能需进一步扩展。
- 潜在空间记忆的容量受限于GPU内存,极长序列可能需要压缩或遗忘机制。
Relevance To Keywords:
- 世界模型:论文直接针对视频世界模型中的3D一致性挑战,提出潜在空间记忆作为核心组件。
- 表征学习:潜在空间记忆本质上是学习到的场景表征,在扩散潜在空间中存储和查询。
- 多模态大模型的理解和生成一体化:方法将生成(视频合成)与理解(深度、几何)结合,但未涉及多模态输入。
- 模型基强化学习:论文未直接涉及RL,但世界模型是模型基RL的基础,本文方法可为其提供更一致的场景表示。
- 后训练:方法基于预训练扩散模型,通过添加记忆模块进行后训练或微调,但论文未详细说明训练过程。
摘要翻译
空间推理是多模态大语言模型(MLLMs)感知并操作于物理世界的基础能力。然而,现有基准主要依赖被动评估(例如静态 VQA)或模拟器特定管道,未能评估通用的交互式空间理解能力。我们提出 SpatialWorld,这是一个专为评估多模态智能体在复杂真实任务中的交互式空间理解而设计的统一基准。SpatialWorld 在共享的模拟器无关协议下整合了八个异构仿真后端,涵盖了跨多样领域(例如家庭例行、旅行、社会协作)的 760 个人工标注任务。智能体必须在仅视觉部分可观测的条件下完成任务,主动收集第一人称视觉证据,并通过统一、基于文本的动作接口表达决策,该接口原生适用于 MLLMs。为确保评估的可靠性,每个任务均包含人工验证的初始状态、参考轨迹以及终端状态验证器。对 15 种先进智能体的评估表明,稳健的空间任务求解仍然具有挑战性:最强模型 GPT-5 的平均任务成功率(TSR)仅为 17.4%,而领先的开源模型 Qwen-3.5 达到 14.1%。进一步分析揭示了任务成功率与执行效率之间的明显不匹配,以及显著的领域特定性能差异。这些在主动探索和长程规划方面的瓶颈,使 SpatialWorld 成为未来空间智能体严格的测试平台。
Abstract
Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and operate within the physical world. However, existing benchmarks predominantly rely on passive evaluation (e.g., static VQA) or simulator-specific pipelines, failing to assess general interactive spatial understanding. We introduce SpatialWorld, a unified benchmark designed specifically for evaluating the interactive spatial understanding of multimodal agents in complex real-world tasks. Integrating eight heterogeneous simulation backends under a shared, simulator-agnostic protocol, SpatialWorld features 760 human-annotated tasks across diverse domains (e.g., household routines, travel, social collaboration). Agents must solve tasks under vision-only partial observability, actively gathering egocentric visual evidence and expressing decisions via a unified, text-based action interface native to MLLMs. For reliable evaluation, each task includes a human-validated initial state, a reference trajectory, and a terminal-state verifier. Evaluating 15 advanced agents reveals that robust spatial task solving remains challenging: the strongest model, GPT-5, achieves an average task success rate (TSR) of only 17.4%, while the leading open-source model, Qwen-3.5, reaches 14.1%. Further analysis exposes a clear mismatch between task success and execution efficiency, alongside substantial domain-specific performance variations. These bottlenecks in active exploration and long-horizon planning position SpatialWorld as a rigorous testbed for future spatial agents.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 5.0/10 | 7.5 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 2.0/10 | 3.0 |
| World Models | 1.5 | 7.0/10 | 10.5 |
| MLLM | 1.5 | 9.0/10 | 13.5 |
| MultiModal | 1.5 | 9.0/10 | 13.5 |
| model-based RL | 1.5 | 5.0/10 | 7.5 |
评分理由: 论文核心贡献是 SpatialWorld 基准测试,用于评估多模态代理的空间推理能力。MLLM 和多模态(MultiModal)高度相关,因直接评估此类模型;世界模型(World Models)中度相关,因涉及空间世界理解;统一模型(Unify Models)中度相关,因基准框架是统一而非模型架构本身;基于模型的强化学习(model-based RL)中度相关,因任务涉及规划与探索;Tokenizer 和视觉编码器(Visual Encoder)相关性低,因未涉及组件创新。作者列表中未包含指定专家,故无额外加分。
关键词
SpatialWorld, Multimodal Agents, Interactive Spatial Reasoning, Real-World Tasks, Benchmarking, Vision-Language, Long-horizon Planning
深度分析
Chinese Title: SpatialWorld:多模态智能体在真实世界任务中交互式空间推理的基准测试
Summary: 该论文提出了SpatialWorld,一个用于评估多模态大语言模型(MLLM)在复杂真实世界任务中交互式空间理解能力的统一基准。现有基准主要依赖静态VQA或特定模拟器,无法评估通用的交互式空间推理。SpatialWorld整合了八个异构仿真后端,包含760个人工标注的任务,涵盖家庭日常、旅行、社交协作等多个领域。智能体需在仅视觉部分可观测条件下,主动收集自我中心视觉证据,并通过统一的文本动作接口表达决策。实验评估了15个先进智能体,最强模型GPT-5的平均任务成功率仅为17.4%,领先开源模型Qwen-3.5达到14.1%。分析揭示了任务成功与执行效率之间的明显不匹配,以及领域间性能差异。该基准为未来空间智能体的主动探索和长程规划能力提供了严格的测试平台。
Innovations:
- 提出了统一的多仿真后端基准架构,将八个异构3D环境封装为共享的观察-动作接口,实现模拟器无关的评估。
- 设计了纯视觉部分可观测的交互式任务范式,智能体仅依赖自我中心RGB图像进行主动探索和决策。
- 引入了执行验证机制,通过终端状态验证器客观判断任务成功,而非依赖预定义轨迹。
- 构建了涵盖多领域(家庭、工作、娱乐、旅行、社交、数字游戏)的760个高质量人工标注任务。
- 系统性地解耦了视觉语义与纯几何推理,通过抽象3D游戏与日常具身场景对比分析空间认知瓶颈。
Methodology: 论文采用模块化架构,包括环境接口、观察接口、动作接口、智能体模块和验证接口。任务形式化为部分可观测马尔可夫决策过程(POMDP),智能体接收自然语言目标和自我中心RGB图像,输出高层文本动作。通过统一接口将异构仿真后端(如AI2-THOR、CARLA、VirtualHome等)标准化,实现跨平台评估。数据构建包括环境收集、人工标注指令、定义成功条件,并通过自动执行验证和人工交叉验证校准。评估使用任务成功率(TSR)和执行效率指标。
Key Results:
- 最强模型GPT-5的平均任务成功率为17.4%,领先开源模型Qwen-3.5-397B-A17B为14.1%。
- 任务成功与执行效率之间存在明显不匹配,高成功率模型常伴随冗余探索。
- 模型排名在不同领域差异显著:GPT-5在家庭日常、旅行和社交协作中领先;Qwen-3.5在工作和物理娱乐中表现突出;Gemini-3.1-Pro在数字游戏中最高。
- 当前智能体在主动探索和长程规划方面存在显著瓶颈。
Tech Stack:
- 部分可观测马尔可夫决策过程(POMDP)
- 多模态大语言模型(MLLM)
- 仿真后端:AI2-THOR、ProcTHOR、VirtualHome、CARLA、EmbodiedCity等
- 自我中心RGB观察
- 文本动作接口
- 任务成功率(TSR)
- 执行效率指标
Strengths:
- 统一跨平台架构,消除了模拟器特定偏差,提供更通用的空间推理评估。
- 纯视觉部分可观测设计更贴近真实世界交互条件。
- 任务覆盖多个领域,具有多样性和现实相关性。
- 执行验证机制确保评估的客观性和可重复性。
- 系统性地揭示了当前MLLM在空间推理中的多个独立瓶颈。
Limitations:
- 基准任务数量相对有限(760个),可能不足以覆盖所有空间推理场景。
- 依赖仿真环境,与现实世界的物理复杂性和感知噪声仍有差距。
- 动作接口为高层文本形式,可能无法完全反映低层控制能力。
- 未深入探讨模型在失败任务中的具体错误类型(如感知错误、规划错误、执行错误)。
Relevance To Keywords:
- 多模态大模型:论文直接评估MLLM在空间推理任务中的表现,与原生多模态大模型高度相关。
- 世界模型:空间推理是世界模型的核心能力之一,基准测试了模型对3D世界的理解和预测。
- 表征学习:任务要求模型从视觉观察中学习空间表征,与表征学习紧密相关。
- 强化学习:任务形式化为POMDP,智能体通过交互学习决策策略,与基于模型的强化学习相关。
- 后训练:论文评估了多个经过后训练的先进模型,揭示了后训练对空间推理能力的影响。
摘要翻译
近年来,统一多模态模型(UMMs)应运而生,旨在单一框架内同时支持理解与生成。掌握动态、多轮交错图像 - 文本对话是 UMMs 在现实应用中的关键任务。然而,现有基准未能评估这一重要任务,因为它们通常局限于单轮或静态设置,且通常忽略多轮交互中的暴露偏差。为弥合这一差距,我们提出 IMUG-Bench,这是一个针对 UMMs 多轮交错图像 - 文本对话的综合基准,旨在联合评估其理解与生成能力。我们的 IMUG-Bench 包含三类:静态空间(Static Spatial)、时序因果(Temporal Causal)和混合(Hybrid),涵盖 3,113 个样本和 12,034 次交互轮次。它还包含动态理解问题,从而支持更能反映现实世界多轮交互场景的评估。在 IMUG-Bench 上进行的大规模实验系统性地评估了主流开源和闭源 UMMs,揭示了它们的能力边界和失效模式,并揭示了多轮交互中生成侧显著的暴露偏差。我们进一步探索了若干测试时缩放策略,包括思维链(Chain-of-Thought)、自我验证(Self-Verification)和最佳 N 采样(Best-of-N Sampling),这些策略有效提高了生成准确性并减轻了生成任务中的暴露偏差。这些发现为增强未来 UMMs 的鲁棒性和多轮交互能力提供了重要见解。
Abstract
In recent years, unified multimodal models (UMMs) have emerged to support both understanding and generation within a single framework. Mastering dynamic, multi-turn interleaved image-text dialogues is a crucial task for UMMs in real-world applications. However, existing benchmarks fail to evaluate this important task, as they are often limited to single-turn or static settings, and typically overlook exposure bias in multi-turn interactions. To bridge this gap, we propose IMUG-Bench, a comprehensive benchmark for multi-turn interleaved image-text dialogue of UMMs that jointly evaluates their understanding and generation capabilities. Our IMUG-Bench comprises three classes: Static Spatial, Temporal Causal, and Hybrid, covering 3,113 samples and 12,034 interaction turns. It also includes dynamic understanding questions, thereby supporting evaluation that better reflects real-world multi-turn interaction scenarios. Large-scale experiments on IMUG-Bench systematically evaluate mainstream open-source and closed-source UMMs, revealing their capability boundaries and failure modes, and uncovering pronounced exposure bias on the generation side in multi-turn interactions. We further explore several test-time scaling strategies, including Chain-of-Thought, Self-Verification, and Best-of-N Sampling, which effectively improve generation accuracy and mitigate exposure bias in generation tasks. These findings provide insights into enhancing the robustness and multi-turn interaction capability of future UMMs.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 10.0/10 | 15.0 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 5.0/10 | 7.5 |
| World Models | 1.5 | 2.0/10 | 3.0 |
| MLLM | 1.5 | 9.0/10 | 13.5 |
| MultiModal | 1.5 | 10.0/10 | 15.0 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: 论文核心围绕统一多模态模型(Unify Models, MultiModal, MLLM)的评测,因此这三个关键词高度相关。视觉编码器(Visual Encoder)隐含于图像文本处理中但未作为核心创新,相关性中等。Tokenizer、World Models 及 model-based RL 在摘要中未提及,相关性极低。作者列表中未包含指定的专家(Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang)。
关键词
Unified Multimodal Models, Interleaved Understanding and Generation, IMUG-Bench, Multi-turn Dialogue, Exposure Bias, Test-time Scaling, Image-text Generation
深度分析
Chinese Title: IMUG-Bench:面向交错理解与生成的统一多模态模型基准测试
Summary: 本文提出IMUG-Bench,一个专门用于评估统一多模态模型(UMMs)在多轮交错图像-文本对话中理解与生成能力的综合基准。现有基准多局限于单轮或静态设置,且忽视多轮交互中的暴露偏差。IMUG-Bench包含静态空间、时间因果和混合三类任务,覆盖19个领域、97个子任务,共3,113个样本和12,034轮交互。通过大规模实验,系统评估了主流开源和闭源UMMs,揭示了模型在多轮生成任务中显著的暴露偏差,即生成质量随对话轮次增加而下降。进一步探索了测试时扩展策略(如思维链、自我验证和最佳N采样),这些策略能有效提升生成准确性并缓解暴露偏差。研究为增强UMMs的鲁棒性和多轮交互能力提供了新见解。
Innovations:
- 构建了首个面向多轮交错图像-文本对话的统一多模态模型基准IMUG-Bench,包含3,113个样本和12,034轮交互,覆盖静态空间、时间因果和混合三类任务。
- 系统揭示了主流UMMs在多轮生成任务中存在的暴露偏差现象,即生成质量随对话轮次增加而显著下降。
- 探索并验证了测试时扩展策略(思维链、自我验证、最佳N采样)在缓解暴露偏差和提升生成准确性方面的有效性。
- 提出了动态理解问题设计,使评估更贴近真实多轮交互场景,并支持基于模型历史响应的答案更新。
Methodology: 论文采用模板设计、LLM/VLM填充和人工验证的流水线构建基准。首先手动设计多轮交互模板,包括问题、答案和评估点;然后调用LLM生成关键词序列填充模板,并收集多源图像;最后使用LLM或VLM完成模板填充,并经过两人人工验证。评估时,对理解任务使用静态和动态多选题,对生成任务使用图像生成指令,并采用任务特定的评分策略。实验部分系统评估了多个开源和闭源UMMs,并引入思维链、自我验证和最佳N采样等测试时扩展策略进行优化。
Key Results:
- 当前主流UMMs在多轮交错生成任务中表现不足,生成质量随交互轮次增加而显著下降,暴露偏差明显。
- 即使顶级闭源模型在多轮生成任务上仍有较大改进空间。
- 引入测试时扩展策略(思维链、自我验证、最佳N采样)能有效提升生成准确性并缓解暴露偏差。
- IMUG-Bench提供了细粒度的评估,揭示了模型在理解与生成能力上的不平衡。
Tech Stack:
- LLM(大语言模型)用于关键词序列生成和模板填充
- VLM(视觉语言模型)用于图像理解和模板填充
- 思维链(Chain-of-Thought)
- 自我验证(Self-Verification)
- 最佳N采样(Best-of-N Sampling)
- 多轮交互模板设计
- 静态多选题(Static-MCQ)和动态多选题(Dynamic-MCQ)
- 图像生成指令(I)
Strengths:
- 基准设计全面,覆盖静态空间、时间因果和混合三类任务,共19个领域和97个子任务,样本量大。
- 首次系统评估多轮交错交互中的暴露偏差,并提供了有效的缓解策略。
- 采用动态理解问题设计,更贴近真实应用场景,评估更具生态效度。
- 实验规模大,系统评估了多个主流开源和闭源模型,结果具有代表性。
Limitations:
- 基准构建依赖人工模板设计和LLM/VLM填充,可能存在模板偏差和生成偏差。
- 评估主要针对图像-文本模态,未涉及视频、音频等其他模态。
- 测试时扩展策略的实验仅在部分模型上验证,泛化性有待进一步研究。
- 暴露偏差的缓解策略虽然有效,但计算成本较高,可能限制实际部署。
Relevance To Keywords:
- Unify Models: 论文直接研究统一多模态模型(UMMs)在多轮交互中的理解与生成能力,与统一模型高度相关。
- World Models: 时间因果类任务涉及对隐含自然规律和常识因果的推理,与世界模型中的因果建模相关。
- Representation Learning: 基准评估模型对图像和文本的联合表征能力,涉及跨模态表征学习。
- Model-Based RL: 测试时扩展策略(如思维链、自我验证)可视为模型在推理阶段的规划与验证,与基于模型的强化学习中的规划思想相关。
- 原生多模态大模型: 论文评估的UMMs即原生多模态大模型,研究其多轮交互能力。
- 多模态大模型的理解和生成一体化: 基准核心目标正是评估理解与生成一体化模型在多轮交互中的表现。
- 表征学习: 与Representation Learning类似,强调跨模态表征的联合学习与评估。
- 世界模型: 时间因果任务要求模型具备对世界动态规律的理解,与世界模型概念契合。
- 强化学习: 最佳N采样策略可视为一种简单的强化学习式搜索,未来可结合更复杂的RL方法优化生成。
- 后训练: 测试时扩展策略可视为后训练阶段的一种推理优化方法,论文探索了其在多轮生成中的应用。
摘要翻译
视频世界模型在生成可控视觉体验方面取得了快速进展,但大多数模型仍仅从单一观察者的视角模拟世界。将此类模型扩展到多智能体场景提出了一个核心挑战:若每个智能体的未来状态独立生成,重叠视图可能会实例化同一场景的不同版本,从而导致智能体之间的对象、布局和外观不一致。传统的相机条件化方法虽能控制单个轨迹,但并未显式耦合那些应在共享场景几何下保持一致的视图生成过程。我们提出 Prisma-World,一种相机可控的多智能体世界模型,它将多智能体生成建模为一种联合几何感知去噪过程,以确保跨视图一致性。Prisma-World 在一个全注意力序列中处理所有智能体的视频,采用多智能体 RoPE(旋转位置编码)设计以区分智能体身份,同时保持同步的时间坐标,并将相对相机几何信息注入注意力机制,从而引导重叠视点趋向于共享的场景证据。为进一步增强多视图一致性及全局空间感知能力,我们在框架中引入了重叠衰减课程学习训练范式,并结合了小地图条件化的结构引导。为了促进多智能体模型的训练与评估,我们引入了 PrismaDataset,这是一个基于虚幻引擎 5 (UE5) 构建的大规模数据集,包含跨多样场景的全景采集、具有灵活智能体数量和复杂相机轨迹的可组合多智能体视图组,以及用于一致性训练与评估的精确相机/动作标注。实验表明,单个 Prisma-World 模型即可生成高保真度的多智能体视频,具备灵活的智能体数量、相机可控性、改进的跨视图一致性,以及在小地图引导下的空间锚定能力。
Abstract
Video world models have made rapid progress in generating controllable visual experiences, but most of them still simulate the world from a single observer. Extending such models to multiple agents raises a central challenge: if each agent's future state is generated independently, overlapping views may instantiate different versions of the same scene, leading to inconsistent objects, layouts, and appearances across agents. Conventional camera conditioning controls individual trajectories, but it does not explicitly couple the generation of views that should agree under shared scene geometry. We introduce Prisma-World, a camera-controllable multi-agent world model that formulates multi-agent generation as a joint geometry-aware denoising process for cross-view consistency. Prisma-World processes all agent videos within one full-attention sequence, uses a multi-agent RoPE design to distinguish agent identities while preserving synchronized temporal coordinates, and injects relative camera geometry into attention to bias overlapping viewpoints toward shared scene evidence. To further strengthen multi-view consistency and enhance global spatial perception, we augment our framework with an overlap-decaying curriculum training paradigm alongside minimap-conditioned structural guidance. To facilitate the training and evaluation of multi-agent models, we introduce PrismaDataset, a large-scale UE5 dataset with panoramic acquisition across diverse scenes, composable multi-agent view groups with flexible agent counts and complex camera trajectories, and precise camera/action annotations for consistency training and evaluation. Experiments show that a single Prisma-World model can generate high-fidelity multi-agent videos with flexible agent numbers, camera controllability, improved cross-view consistency, and spatial grounding under minimap guidance.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 8.0/10 | 12.0 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 5.0/10 | 7.5 |
| World Models | 1.5 | 10.0/10 | 15.0 |
| MLLM | 1.5 | 3.0/10 | 4.5 |
| MultiModal | 1.5 | 5.0/10 | 7.5 |
| model-based RL | 1.5 | 6.0/10 | 9.0 |
评分理由: World Models 为核心主题(10 分),标题与摘要均明确聚焦于此。Unify Models 高度相关(8 分),论文统一了多智能体跨视图生成过程。model-based RL 中度相关(6 分),涉及相机可控性与最小地图空间指导,适用于强化学习场景。MultiModal 中度相关(5 分),涵盖多视图视频输入处理。Visual Encoder 隐含但非创新重点(5 分)。Tokenizer 与 MLLM 未在摘要中明确提及(2-3 分)。
关键词
Multi-Agent Video World Model, Camera-Controllable, Cross-View Consistency, Geometry-Aware Denoising, Full-Attention Sequence, Minimap Guidance, PrismaDataset
深度分析
Chinese Title: Prisma-World:相机可控的多智能体视频世界模型
Summary: 本文提出Prisma-World,一个相机可控的多智能体视频世界模型,旨在解决现有单智能体视频生成模型无法保证多视角一致性的问题。传统方法独立生成每个智能体的视频会导致场景不一致,而简单拼接视频序列又无法区分智能体身份和同步帧。Prisma-World将所有智能体视频作为联合去噪过程处理,采用全注意力机制、多智能体RoPE设计(区分智能体身份并保持时间同步)以及相对相机几何注入(促进重叠视角共享场景证据)。此外,引入重叠衰减课程训练和迷你地图条件引导以增强全局空间感知。为支持训练和评估,构建了基于UE5的大规模数据集PrismaDataset,包含全景采集、可组合的多智能体视图组和精确标注。实验表明,单个Prisma-World模型能生成高保真多智能体视频,支持灵活智能体数量、相机控制、跨视角一致性提升和迷你地图空间定位。
Innovations:
- 提出多智能体视频世界模型框架,将多智能体生成视为联合几何感知去噪过程,实现跨视角一致性。
- 设计多智能体RoPE位置编码,在保持时间同步的同时区分不同智能体身份。
- 将相对相机几何变换注入注意力机制,使重叠视图共享场景证据,非重叠视图保留视角特有内容。
- 引入重叠衰减课程训练策略,从强重叠视图逐步过渡到复杂轨迹组合,增强多视角一致性。
- 构建大规模UE5数据集PrismaDataset,支持全景采集、灵活智能体数量组合和精确相机/动作标注,并配套多智能体一致性评估基准。
Methodology: Prisma-World基于扩散模型架构,将所有智能体视频帧拼接成一个完整序列,使用全注意力机制进行联合去噪。通过多智能体RoPE(旋转位置编码)为每个智能体分配独立的位置偏移,同时保持相同时间步的帧在时间维度上对齐。在注意力计算中注入相对相机变换(如相对旋转和平移),使注意力权重偏向于几何上一致的重叠区域。训练时采用重叠衰减课程:先使用高重叠度的多视图数据训练,逐渐降低重叠度以增强泛化能力。可选地,将全局迷你地图(俯视图)作为结构条件输入,通过交叉注意力为每个智能体提供局部空间引导。数据集构建:在UE5中录制全景视频,投影为多个同步的透视视图,并记录相机位姿、动作和迷你地图。
Key Results:
- 单个Prisma-World模型能生成高保真多智能体视频,支持灵活指定智能体数量(如2-4个)。
- 在跨视角一致性指标上显著优于单智能体独立生成和简单拼接基线。
- 迷你地图条件引导能提升空间定位和场景级连贯性。
- 重叠衰减课程训练有效增强了模型对复杂轨迹组合的鲁棒性。
- 在PrismaDataset上训练后,模型能泛化到未见过的场景和智能体配置。
Tech Stack:
- 扩散模型(Diffusion Model)
- 全注意力机制(Full Attention)
- 旋转位置编码(RoPE, Rotary Position Embedding)
- 相对相机几何变换注入(Relative Camera Transformation Injection)
- 重叠衰减课程训练(Overlap-Decaying Curriculum Training)
- 迷你地图条件引导(Minimap Conditioning)
- UE5(Unreal Engine 5)全景录制与投影
- Plücker坐标或相对相机编码(参考PRoPE等)
Strengths:
- 首次系统性地解决多智能体视频生成中的跨视角一致性问题,具有理论创新性。
- 模型设计灵活,支持可变智能体数量,无需重新训练。
- 结合相机几何信息,使一致性具有几何可解释性。
- 构建了大规模高质量数据集和评估基准,推动该领域研究。
- 迷你地图条件提供了显式空间结构先验,增强实用性。
Limitations:
- 当前模型仅在合成数据集(UE5)上训练和评估,真实场景泛化能力未知。
- 多智能体全注意力机制的计算复杂度随智能体数量和帧数线性增长,可能限制大规模应用。
- 相对相机几何注入依赖精确的相机位姿标注,实际应用中获取困难。
- 未涉及智能体之间的交互(如物体操作、碰撞),仅关注视觉一致性。
- 迷你地图条件为可选,但生成质量可能对迷你地图精度敏感。
Relevance To Keywords:
- 世界模型:论文直接提出多智能体视频世界模型,旨在模拟共享环境中的多视角动态。
- 表征学习:通过联合去噪和几何约束学习场景的共享表征,促进跨视角一致性。
- 模型基强化学习:虽然论文未直接涉及RL,但多智能体世界模型可作为环境模拟器用于多智能体RL规划。
- 原生多模态大模型:模型处理视频、相机参数、动作、迷你地图等多模态输入,但未采用大语言模型架构。
- 多模态大模型的理解和生成一体化:论文聚焦生成,但一致性建模隐含对场景几何的理解。
- 后训练:论文中的课程训练可视为一种后训练策略,但未涉及强化学习后训练。
摘要翻译
世界模型(World Models)近年来在普及度和能力上均快速增长,作为一种生成机器人训练数据或模拟真实世界环境的数据高效工具,许多研究提议将其集成到机器人学习流程中。尽管具有极高的实用性,但在本工作中,我们发现世界模型为机器人学习供应链引入了一个独特且隐蔽有效的数据投毒入口点,这可能导致部署不安全或被破坏的机器人策略,尽管训练数据看似是安全的地面真值数据。与传统的直接将危险轨迹植入销售或上传数据集的数据投毒技术不同,我们的新颖攻击方法将恶意提示或破坏性的状态转移动力学注入到表面上看似安全的遥操作数据集中,这些攻击仅在数据通过世界模型作为输入时才会被激活。这可能导致生成合成的危险机器人训练轨迹,进而导致不安全或被破坏的机器人策略。我们证明了我们的攻击对最先进的动作条件世界模型(Action-conditioned World Models)和文本条件世界模型(Text-conditioned World Models)均有效,展示了在下游深度强化学习(DRL)策略上的完整端到端后门,以及视觉 - 语言 - 动作(VLA)设置下的概念验证。总体而言,这些发现促使人们需要研究更安全的世界模型,并重新评估其在机器人学习供应链中的地位。
Abstract
World models have recently seen a rapid growth in both their popularity and capability as more data efficient tools for generating robot training data or simulating real world environments, with many works proposing their integration into the robot learning pipeline. While highly practical, in this work we demonstrate that world models introduce a uniquely stealthy and effective data poisoning entry point into the robot learning supply chain that can result in the deployment of unsafe or otherwise compromised robotic policies despite training on seemingly safe ground truth training data. In contrast to traditional data poisoning techniques which directly implant dangerous trajectories into sold or uploaded datasets, our novel attack methods inject malicious prompts or compromising transition dynamics into visibly safe teleoperated datasets which are only activated once fed through a world model as input. This can result in the generation of synthetic, dangerous robot training trajectories and subsequently unsafe or compromised robot policies. We demonstrate the effectiveness of our attacks against both state of the art action conditioned and text conditioned world models, showing a full end-to-end backdoor on a downstream DRL policy and a proof-of-concept for the VLA setting. Overall these findings necessitate research into more secure world models and reevaluating their position within the robot learning supply chain.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 4.0/10 | 6.0 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 3.0/10 | 4.5 |
| World Models | 1.5 | 10.0/10 | 15.0 |
| MLLM | 1.5 | 6.0/10 | 9.0 |
| MultiModal | 1.5 | 6.0/10 | 9.0 |
| model-based RL | 1.5 | 7.0/10 | 10.5 |
评分理由: The paper explicitly centers on World Models (10/10) in robot learning, linking to model-based RL (7/10) and multimodal contexts like VLA (MultiModal/MLLM at 6/10). Unify Models is tangential (4/10) as the focus is security, while Tokenizer and Visual Encoder are only implicit (2-3/10). No specified experts are authors.
关键词
World Models, Robot Learning, Data Poisoning, Backdoor Attack, VLA Setting, Text Conditioned, Synthetic Trajectories
深度分析
Chinese Title: 针对世界模型以破坏机器人学习流水线
Summary: 本文研究了世界模型在机器人学习流水线中的安全漏洞,提出两种新型数据投毒攻击方法:针对文本条件世界模型的“视觉提示劫持”(VPH)和针对动作条件世界模型的“视觉转换劫持”(VTH)。攻击者通过向看似安全的遥操作数据中注入微不可见的视觉扰动,使得世界模型生成危险的合成训练轨迹或操纵转换动力学,从而在下游机器人策略中植入后门。实验表明,这些攻击能有效欺骗最先进的文本条件和动作条件世界模型,并在深度强化学习策略中实现端到端后门植入,同时在视觉-语言-动作(VLA)设置中提供了概念验证。研究揭示了世界模型引入的独特脆弱性,呼吁开发更安全的世界模型并重新评估其在机器人学习供应链中的地位。
Innovations:
- 首次提出针对世界模型的攻击方法,利用视觉扰动间接操纵机器人学习流水线。
- 发现文本条件世界模型在分布外或欠指定提示下更易被操纵,拓展了可信AI研究的新方向。
- 实现了首个仅通过操纵世界模型输入就在深度强化学习策略中植入后门的攻击。
- 提出了两种具体攻击范式:视觉提示劫持(VPH)和视觉转换劫持(VTH),分别针对文本条件和动作条件世界模型。
- 攻击无需直接访问训练环境或数据集,通过隐蔽的视觉扰动绕过安全检测,具有高实用性和隐蔽性。
Methodology: 论文采用白盒攻击假设,攻击者具有对目标世界模型的完全访问权限(可查询和计算梯度)。攻击者将微小的视觉扰动(在LAB颜色空间的ℓp范数球内)嵌入到遥操作视频帧中,使得世界模型在接收到这些帧后产生异常输出。对于文本条件世界模型,VPH攻击通过扰动帧覆盖用户输入的文本提示,导致世界模型生成危险轨迹;对于动作条件世界模型,VTH攻击使世界模型在未选择预定动作时预测未来状态崩溃。攻击者通过优化对抗损失函数(如使策略在触发对象出现时转向恶意目标)来植入后门,同时保持正常状态下的行为不变。实验在多个世界模型(如Cosmos-Predict等)和下游策略(DRL、VLA)上进行验证。
Key Results:
- 文本条件世界模型在分布外或欠指定提示下表现出显著更高的操纵脆弱性。
- VPH攻击能成功覆盖用户提示,使世界模型生成包含危险行为的合成轨迹。
- VTH攻击能操纵动作条件世界模型的转换动力学,使策略在未选择预定动作时性能崩溃。
- 在深度强化学习策略中实现了端到端后门植入:触发对象出现时策略转向恶意行为,无触发时保持正常。
- 在VLA设置中提供了概念验证,表明攻击可扩展到更复杂的机器人学习范式。
Tech Stack:
- 世界模型:文本条件(如Cosmos-Predict)、动作条件(如UniSim, DayDreamer)
- 攻击方法:视觉提示劫持(VPH)、视觉转换劫持(VTH)
- 优化技术:对抗损失函数、梯度下降、ℓp范数约束(在LAB颜色空间)
- 下游策略:深度强化学习(DRL)、行为克隆(BC)、视觉-语言-动作模型(VLA)
- 数学框架:部分可观测马尔可夫决策过程(POMDP)、后门攻击定义(攻击成功与攻击隐蔽性)
- 评估指标:攻击成功率、策略价值函数
Strengths:
- 提出了新颖且实用的攻击向量,揭示了世界模型在机器人学习供应链中的关键安全漏洞。
- 攻击方法隐蔽性强,仅通过视觉扰动即可实现,无需直接修改数据集或访问训练环境。
- 实验覆盖了两种主流世界模型范式(文本条件和动作条件),并展示了端到端后门植入,验证了攻击的有效性。
- 研究具有前瞻性,为未来世界模型的安全设计和防御研究提供了重要基础。
- 论文结构清晰,问题定义严谨,攻击约束合理,符合实际威胁场景。
Limitations:
- 攻击假设白盒访问世界模型,实际中攻击者可能难以获得完全梯度信息。
- 仅考虑了视觉扰动在LAB空间内的ℓp范数约束,未探索其他隐蔽性更强的扰动形式。
- 实验主要针对特定世界模型和下游策略,泛化性需进一步验证。
- 未提出具体的防御机制或鲁棒性提升方法,仅指出了问题。
- 对VLA设置的攻击仅为概念验证,缺乏完整的端到端实验。
Relevance To Keywords:
- 世界模型:论文核心研究对象,分析了文本条件和动作条件世界模型的安全脆弱性。
- 机器人学习:攻击目标为机器人学习流水线,包括DRL和VLA策略。
- 安全与鲁棒性:论文聚焦于世界模型引入的安全威胁,呼吁更安全的模型设计。
- 视觉-语言-动作模型:作为下游策略之一,论文提供了针对VLA的初步攻击验证。
- 后门攻击:论文提出的攻击方法本质上属于数据投毒和后门植入,与关键词高度相关。
摘要翻译
视觉推理需要整合分布在区域、属性和关系中的证据,这使得单链推理容易出现早期感知承诺和幻觉。我们提出 Visual Para-Thinker++,这是一种单策略多智能体框架,其中共享的多模态大语言模型(MLLM)策略被实例化为角色条件化的主智能体、工作智能体和总结智能体。主智能体采用固定分配模式分解任务;工作智能体在上下文隔离下并行推理;而总结智能体整合完整的工作智能体推理轨迹,而非对最终标签进行多数投票。共享策略通过多智能体能力注入和角色解耦多智能体优化进行训练,该方法将角色特定的奖励和优势分配给相应的 token 段,以减少协作角色之间的梯度冲突。原生推理引擎通过共享视觉前缀和 KV 缓存重用实现高效的多智能体推理过程。在 V*、CountBench、RefCOCO 系列及 HallusionBench 上,Visual Para-Thinker++ 始终优于单轨迹和推理时并行基线,尤其在幻觉敏感的视觉推理上表现尤为显著。
Abstract
Visual reasoning requires integrating evidence distributed across regions, attributes, and relations, making single-chain reasoning prone to early perceptual commitment and hallucination. We propose Visual Para-Thinker++, a single-policy multi-agent framework in which one shared MLLM policy is instantiated as role-conditioned Main, Worker, and Summary Agents. The Main Agent decomposes the task with fixed allocation patterns; Worker Agents reason in parallel under context isolation; and the Summary Agent reconciles full Worker reasoning traces rather than majority-voting on final labels. The shared policy is trained by Multi-Agent Capability Injection and Role-Decoupled Multi-Agent Optimization, which assign role-specific rewards and advantages to corresponding token segments to reduce gradient conflict among collaborative roles. A native inference engine enables efficient multi-agent rollout through shared visual prefix and KV cache reuse. Across V*, CountBench, the RefCOCO family, and HallusionBench, Visual Para-Thinker++ consistently outperforms single-trajectory and inference-time parallel baselines, with especially strong gains on hallucination-sensitive visual reasoning.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 6.0/10 | 9.0 |
| Tokenizer | 1.5 | 3.0/10 | 4.5 |
| Visual Encoder | 1.5 | 7.0/10 | 10.5 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 9.0/10 | 13.5 |
| MultiModal | 1.5 | 8.0/10 | 12.0 |
| model-based RL | 1.5 | 4.0/10 | 6.0 |
评分理由: 论文核心在于利用共享 MLLM 策略构建多智能体框架进行视觉推理,因此 MLLM 和多模态相关性最高。视觉编码器通过视觉前缀隐含存在,相关性较高。统一模型体现在单策略多角色设计上。Tokenizer 仅在优化细节中提及 token segment,相关性较低。世界模型和基于模型的强化学习在摘要中未体现核心地位,相关性低。
关键词
Visual Reasoning, Multi-Agent Framework, MLLM Policy, Shared Visual Prefix, Role-Decoupled Optimization, Hallucination Reduction, Inference Efficiency
深度分析
Chinese Title: 视觉并行思考者++:一种用于视觉推理的单策略多智能体框架
Summary: 本文提出Visual Para-Thinker++,一种单策略多智能体视觉推理框架。该框架通过共享一个多模态大语言模型(MLLM)策略,实例化为角色条件化的主智能体、工作智能体和总结智能体。主智能体使用固定分配模式分解任务;工作智能体在上下文隔离下并行推理;总结智能体整合所有工作智能体的推理轨迹而非仅对最终标签进行多数投票。共享策略通过多智能体能力注入和角色解耦多智能体优化进行训练,为不同角色分配特定奖励和优势,减少梯度冲突。推理时通过共享视觉前缀和KV缓存复用实现高效多智能体 rollout。在V*、CountBench、Pixmo、MMVP、RefCOCO系列和Hallusion-Bench等多个视觉基准上,Visual Para-Thinker++一致优于单轨迹和推理时并行基线,尤其在幻觉敏感的视觉推理任务上表现突出。
Innovations:
- 提出单策略多智能体视觉推理框架,通过角色条件化将共享MLLM策略实例化为主、工作、总结三种智能体,实现多角色协作。
- 引入多智能体能力注入(角色感知SFT)和角色解耦多智能体优化(角色特定奖励与优势),减少不同角色间的梯度冲突。
- 设计固定任务分配模式(基于块和扫描顺序),使主智能体无需开放规划即可有效分配子任务。
- 实现基于KV缓存复用的原生推理引擎,提升多智能体 rollout 效率。
- 在多个视觉推理基准上显著优于单轨迹和并行推理基线,尤其在幻觉敏感任务上表现优异。
Methodology: 本文采用两阶段训练方法。第一阶段:多智能体能力注入,使用更强的MLLM教师合成多智能体轨迹数据,通过角色感知SFT训练共享策略,并应用上下文隔离掩码确保工作智能体独立推理。第二阶段:角色解耦多智能体优化,基于DAPO进行组相对强化学习,为工作智能体设计局部奖励(基于跨工作多数投票启发式),为总结智能体设计全局结果奖励,通过角色特定优势组合(全局优势+局部优势)更新策略。推理时采用固定协议:主智能体→工作智能体(并行)→总结智能体,并复用KV缓存加速。
Key Results:
- 在V*、CountBench、Pixmo、MMVP、RefCOCO系列和Hallusion-Bench上,Visual Para-Thinker++一致优于单轨迹(如贪婪解码、长CoT)和并行推理基线(如自一致性、多智能体辩论)。
- 在幻觉敏感的视觉推理任务上,性能提升尤为显著。
- 通过KV缓存复用,推理效率得到有效提升。
Tech Stack:
- 多模态大语言模型(MLLM)作为共享策略
- 角色条件化(角色token嵌入)
- SFT(监督微调)
- DAPO(组相对策略优化)
- KV缓存复用(vLLM-based引擎)
- 上下文隔离掩码
- 跨工作多数投票启发式奖励
- 固定任务分配模式(块分配、扫描顺序分配)
Strengths:
- 单策略多智能体设计避免了使用多个独立模型的高计算成本,同时实现角色分工。
- 角色解耦优化有效缓解了不同角色目标冲突导致的梯度问题。
- 固定分配模式简化了主智能体的规划负担,同时保证了任务分解的多样性。
- 推理时KV缓存复用显著提升了效率,使多智能体协作实用化。
- 在多个视觉推理基准上取得一致且显著的性能提升,尤其在幻觉敏感任务上。
Limitations:
- 依赖更强的MLLM教师合成训练数据,可能引入教师偏差。
- 固定分配模式(块/扫描顺序)可能无法覆盖所有视觉推理任务类型,灵活性有限。
- 工作智能体数量固定为四个,对于某些任务可能过多或过少。
- 跨工作多数投票启发式奖励可能不够精确,尤其在任务答案非唯一时。
- 未探讨在更复杂视觉推理(如视频、3D场景)上的适用性。
Relevance To Keywords:
- Unify Models: 框架使用单一MLLM策略统一不同角色,体现了模型统一思想。
- World Models: 未直接涉及世界模型,但视觉推理可视为对视觉世界的理解。
- Representation Learning: 角色token和上下文隔离隐式学习不同角色的表征。
- Model-Based RL: 使用强化学习(DAPO)优化策略,属于后训练阶段。
- 原生多模态大模型: 基于MLLM,处理图像和文本输入。
- 多模态大模型的理解和生成一体化: 框架同时涉及视觉理解(推理)和文本生成(答案)。
- 表征学习: 角色token嵌入和共享权重下的角色分化涉及表征学习。
- 世界模型: 不直接相关。
- 强化学习: 核心训练方法为角色解耦的强化学习。
- 后训练: 两阶段训练(SFT+RL)属于后训练范式。
摘要翻译
联合嵌入预测架构(JEPAs)展现出了有前景的世界建模能力,能够通过交叉熵方法(CEM)等优化动作轨迹,从而在隐空间中进行规划。然而,这些方法计算开销过大,且在长时程规划中效果不佳。此外,这些方法通常需要目标状态的显式图像,而在现实任务中这并不总是可行的。在本文中,我们通过提出前向 - 前向 - JEPA(FF-JEPA)来解决这些局限性,这是一种利用两个前向动力学模型的分层方法。除了标准的动作条件化前向模型外,我们还引入了一种无动作潜在规划器,该规划器根据当前状态预测下一个子目标。这种方法消除了对目标图像的需求,并通过将复杂轨迹分解为一系列易于求解的短期优化问题,实现了长时程规划。在 PushT 上的初步结果表明,FF-JEPA 成功克服了平坦世界模型的长时程崩溃问题,突显了该方法作为无目标规划方向的前景。
Abstract
Joint Embedding Predictive Architectures (JEPAs) have shown promising world modeling capabilities, enabling planning in latent space by optimizing action trajectories using methods like the Cross-Entropy Method (CEM). These methods are, however, too computationally expensive and ineffective for long-horizon planning. Furthermore, these methods typically require an explicit image of the goal state, which is not always possible in real-world tasks. In this work, we tackle these limitations by proposing Forward-Forward-JEPA (FF-JEPA), a hierarchical approach leveraging two forward dynamics models. Alongside a standard action-conditioned forward model, we introduce an action-free latent planner that predicts the next subgoal given the current state. This approach removes the need for goal images and enables long-horizon planning by decomposing complex trajectories into a sequence of tractable, short-term optimization problems. Preliminary results on PushT demonstrate that FF-JEPA successfully overcomes flat world models' long-horizon collapse, highlighting this approach as a promising direction for goal-free planning.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 7.0/10 | 10.5 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 6.0/10 | 9.0 |
| World Models | 1.5 | 10.0/10 | 15.0 |
| MLLM | 1.5 | 1.0/10 | 1.5 |
| MultiModal | 1.5 | 3.0/10 | 4.5 |
| model-based RL | 1.5 | 9.0/10 | 13.5 |
评分理由: 论文核心聚焦于世界模型(World Models)框架下的长时程规划,利用前向模型进行动作轨迹优化,因此与 World Models (10) 和 model-based RL (9) 高度相关。方法通过结合标准前向模型与潜在规划器实现了模型层面的统一(Unify Models, 7),且基于视觉隐空间工作隐含了视觉编码器(Visual Encoder, 6)。论文未提及离散分词器(Tokenizer, 1)、大语言模型(MLLM, 1)或多模态输入(MultiModal, 3)。作者列表中未包含指定的 Yang Shi 等专家。
关键词
World Models, Latent Planners, Long-Horizon Planning, Forward Models, Joint Embedding Predictive Architectures, Goal-free Planning, Latent Space
深度分析
Chinese Title: FF-JEPA:基于潜在规划器的世界模型中的长时域规划
Summary: 本文针对现有联合嵌入预测架构(JEPA)世界模型在长时域规划中计算成本高、需要显式目标图像等问题,提出了一种层次化方法FF-JEPA。该方法在预训练的世界模型潜在空间中引入一个无动作的潜在规划器(latent planner),该规划器根据当前状态预测下一个子目标,从而将复杂的长轨迹分解为一系列短时域优化子问题,无需目标图像。作者实验了两种规划器架构:确定性Transformer和扩散DiT。在PushT任务上的实验表明,FF-JEPA成功克服了平坦世界模型的长时域崩溃问题,在75步规划中达到91.80%的成功率,而基线LeWM仅3.52%;在随机初始化场景下也达到82.42%的成功率,展示了该方法在无目标图像规划中的潜力。
Innovations:
- 提出Forward-Forward JEPA(FF-JEPA)层次化框架,在潜在空间中引入无动作的潜在规划器,实现长时域规划分解。
- 无需显式目标图像,规划器仅基于当前潜在状态预测子目标,适用于实际任务中目标不可知场景。
- 将世界模型重新解释为逆动力学模块,通过采样优化(CEM)在想象轨迹上提取动作序列,统一预测与控制。
- 实验了两种潜在规划器架构(确定性Transformer和扩散DiT),并验证了扩散规划器在长时域和随机初始化下的优越性能。
Methodology: 基于预训练的LeWM世界模型(含编码器E和前向预测器P),冻结编码器,在潜在空间训练一个无动作的潜在规划器G。训练数据来自成功演示的潜在表示,子目标通过步长H下采样获得。确定性规划器G_Det使用与P相同架构的Transformer,最小化预测子目标与真实子目标的MSE;扩散规划器G_DM使用DiT骨干,训练去噪分数匹配目标。推理时,G每H步预测下一个子目标,然后世界模型P使用CEM在H步内优化动作序列达到该子目标。实验在PushT任务上进行,比较短时域(25步)、长时域(75步)和随机初始化三种场景,评估成功率。
Key Results:
- 平坦LeWM在长时域(75步)中成功率仅3.52%,随机初始化下为0.00%,而FF-JEPA(DM)分别达到91.80%和82.42%。
- FF-JEPA(DM)在短时域(25步)中达到96.09%,优于层次化DINO-WM的89.0%。
- 确定性规划器在短时域中表现略差(76.95%),但在长时域和随机初始化中与扩散规划器相当。
- 成功率随规划预算增加而稳定,剩余失败主要源于子目标预测误差而非规划步数不足。
Tech Stack:
- LeWM(LeWorldModel)JEPA世界模型
- Cross-Entropy Method (CEM) 采样优化
- Transformer架构(用于确定性规划器)
- DiT(Diffusion Transformer)骨干(用于扩散规划器)
- 去噪分数匹配(Denoising Score Matching)训练目标
- 滑动窗口上下文(WG=3或1)
- 稳定预训练库(stable-pretraining)和稳定世界模型库(stableworldmodel)
Strengths:
- 有效解决了长时域规划中计算成本高和误差累积的问题。
- 无需目标图像,提升了实际应用中的灵活性。
- 层次化分解使得世界模型可以专注于短时域优化,降低了CEM的搜索难度。
- 扩散规划器具有较好的鲁棒性和泛化能力,在随机初始化下仍保持高成功率。
- 潜在规划器可基于无标签数据训练,降低了对动作标注的依赖。
Limitations:
- 实验仅在PushT单一任务上进行,泛化性有待验证。
- 确定性规划器在短时域中表现不佳,可能受限于上下文窗口未充分利用。
- 子目标预测误差仍是失败的主要原因,规划器精度有待提升。
- 与DINO-WM等基线并非完全公平比较(评估协议略有不同)。
- 未讨论规划器训练对演示数据质量和数量的敏感性。
Relevance To Keywords:
- 世界模型(World Models):论文核心是改进世界模型的长时域规划能力,直接相关。
- 表征学习(Representation Learning):利用JEPA世界模型的潜在空间进行规划,涉及表征学习。
- 基于模型的强化学习(Model-Based RL):通过世界模型进行规划,属于模型基强化学习范畴。
- 后训练(Post-training):潜在规划器是在预训练世界模型基础上训练的,属于后训练阶段。
- 原生多模态大模型/多模态大模型的理解和生成一体化:论文未直接涉及多模态大模型,但JEPA架构与多模态表征学习有潜在联系。
- Unify Models:论文尝试统一前向模型和逆动力学模型,具有统一建模思想。
摘要翻译
多模态大语言模型(MLLMs)通常继承为单模态文本建模设计的深层、对称的 Transformer 骨干网络,并对图像令牌和文本令牌一致地应用相同的计算。这种设计忽视了一个关键的模态不对称性:图像令牌和文本令牌在信息密度、冗余度以及所需的推理深度上存在显著差异。通过对 LLaVA-1.5 的逐层分析,我们发现视觉令牌倾向于在中间层趋于饱和。具体而言,文本到图像注意力从第 0 层的 0.68 下降至第 4 层的 0.07,并在第 18 层后稳定在 0.04 左右,而文本令牌则继续从深层语义处理中获益。这些发现表明架构对称性与深度异步模态演化之间存在不匹配,导致冗余视觉计算,并在深层任务特定适应过程中可能导致感知表征的漂移。基于此,我们提出双路径视觉令牌路由(DPVR),这是一种用于高效多模态大语言模型的模态不对称路由框架。其核心实例化方案 DPVR-LF(晚期融合),在饱和点将视觉令牌路由至一层可训练的侧分支,执行十三层仅文本的前向传播(跳过深层堆栈中的图像位置),并仅在最后一层重新融合视觉与文本流。仅需约 3% 的可训练参数,DPVR-LF 便在标准基准测试上保持了具有竞争力的多模态性能,同时减少了深层 Transformer 堆栈中的视觉计算。研究结果挑战了视觉令牌必须遍历所有深层语言模型层的传统假设,并表明单个晚期融合层足以维持 LLaVA 风格多模态大语言模型中的强感知能力。
Abstract
Multimodal large language models (MLLMs) commonly inherit the deep, symmetric Transformer backbone designed for unimodal text modeling, and apply the same computation uniformly to image and language tokens. This design overlooks a key modality asymmetry: image and text tokens differ substantially in information density, redundancy, and required reasoning depth. Through a layer-wise analysis of LLaVA-1.5, we observe that vision tokens tend to saturate in the middle layers. Specifically, text-to-image attention decreases from 0.68 at layer 0 to 0.07 by layer 4, and stabilizes near 0.04 after layer 18, whereas text tokens continue to benefit from deep semantic processing. These findings suggest a mismatch between architectural symmetry and depth-asynchronous modality evolution, resulting in redundant visual computation and possible drift in perceptual representations during deep task-specific adaptation. Motivated by this, we propose Dual-Path Vision Token Routing (DPVR), a modality-asymmetric routing framework for efficient MLLMs. Its core instantiation, DPVR-LF (Late-Layer Fusion), routes vision tokens at the saturation point into a one-layer trainable side branch, runs a thirteen-layer text-only forward that skips image positions in the deep stack, and re-fuses the visual and textual streams only at the final layer. With approximately 3% trainable parameters, DPVR-LF preserves competitive multimodal performance on standard benchmarks while reducing visual computation in the deep Transformer stack. The results challenge the conventional assumption that vision tokens must traverse all deep language-model layers, and indicate that a single late fusion layer can be sufficient for maintaining strong perceptual competence in LLaVA-style MLLMs.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 7.0/10 | 10.5 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 5.0/10 | 7.5 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 10.0/10 | 15.0 |
| MultiModal | 1.5 | 10.0/10 | 15.0 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: 论文核心聚焦于多模态大语言模型(MLLM)的架构优化,针对视觉令牌饱和问题提出双路径路由方案,因此与 MLLM 和多模态高度相关。论文涉及视觉令牌处理与模态融合,与统一模型和视觉编码器中度相关。论文未涉及分词器设计、世界模型构建或强化学习,故相关度低。作者列表中未包含指定的五位专家。
关键词
Multimodal Large Language Models, Vision Token Routing, Late-Layer Fusion, Visual Saturation, Modality Asymmetry, Efficient MLLMs, Dual-Path
深度分析
Chinese Title: 晚期融合足矣:视觉饱和条件下多模态大语言模型的双路径视觉令牌路由
Summary: 本文针对多模态大语言模型(MLLMs)中视觉与文本令牌在深度Transformer骨干中对称处理导致的计算冗余问题,通过层间分析发现视觉令牌在中层即出现饱和现象(文本-图像注意力从第0层的0.68骤降至第4层的0.07,之后稳定在0.04附近),而文本令牌仍需深层语义处理。基于此,提出双路径视觉令牌路由(DPVR)框架,其核心方法DPVR-LF在视觉饱和点(7B模型第18层)将视觉令牌路由至单层可训练侧分支,后续13层仅处理文本令牌,最后在最终层通过单层全注意力块进行图像-文本融合。仅需约3%可训练参数(7B模型202M),DPVR-LF在8个标准基准上匹配或超越全微调性能,同时节省25-30%前向FLOPs(实测延迟降低28%)。实验表明,视觉令牌无需遍历所有深层语言模型层,单层晚期融合足以维持强感知能力。
Innovations:
- 提出三视角分析(相邻层余弦相似度、文本-图像注意力质量、logit-lens转换)揭示视觉饱和现象,为模态非对称深度分配提供实证依据。
- 设计DPVR-LF方法:单层可训练视觉侧分支 + 十三层纯文本前向 + 单层最终融合,在保持梯度回传的同时大幅减少深层视觉计算。
- 验证视觉令牌在LLaVA-1.5中于第18层(7B)/第28层(13B)饱和,且融合层数在1层时即饱和(增加第二层仅提升0.18个百分点)。
- 实现与现有令牌缩减方法(如FastV、VTW)正交,可组合使用进一步加速。
Methodology: 首先对LLaVA-1.5-7B进行层间分析,从隐藏状态演化、注意力脱离、logit-lens转换三个角度量化视觉饱和。基于此设计DPVR-LF:在饱和点(第18层)将视觉令牌路由至单层LlamaDecoderLayer侧分支,后续13层跳过图像位置进行纯文本前向,最后在最终层通过全注意力块融合。训练时仅微调侧分支和融合层参数(约3%),其余参数冻结。在LLaVA-1.5-7B和13B上评估8个标准基准,进行分割点扫描、视觉深度消融、融合层数消融等实验。
Key Results:
- DPVR-LF在7B模型上仅用3%可训练参数(202M)即匹配全微调性能,8个基准平均准确率0.66。
- 实测延迟降低28.0%(A800),理论FLOPs节省26.8%(填充率0.70)。
- 分割点扫描显示第18-24层为鲁棒平台(六基准均值波动<0.1个百分点)。
- 融合层数消融表明单层融合即饱和(第二层仅+0.18 pp)。
- 13B模型上视觉饱和点在第28层,DPVR-LF同样有效。
- 延迟节省完全来自预填充阶段(13B/5880 Ada上-23.4%),且在不同硬件(A800、Blackwell RTX PRO 6000、5880 Ada)上一致复现。
Tech Stack:
- LLaVA-1.5架构(ViT视觉编码器 + 线性投影器 + Llama解码器)
- 层间余弦相似度分析
- 注意力质量(文本-图像注意力权重)
- logit-lens / tuned lens技术
- LoRA(低秩适应)作为对比基线
- FLOPs理论计算与实测延迟(A800、5880 Ada等)
- LlamaDecoderLayer(单层侧分支)
- PyTorch实现
Strengths:
- 基于实证观察(视觉饱和)提出简洁有效的非对称路由方案,无需改变投影器或视觉编码器。
- 仅需极少量可训练参数(3%),与全微调性能相当,显著降低计算成本。
- 设计正交于现有令牌缩减方法,可组合使用。
- 在多个硬件和模型规模上验证鲁棒性,分割点选择具有宽泛的鲁棒平台。
- 提供了机制性分析(融合层注意力集中度1.77倍基线,共享浅层堆栈与原始LLaVA位一致)。
Limitations:
- 仅针对LLaVA-1.5系列验证,未在其他MLLM架构(如Qwen-VL、MiniGPT-4)上测试。
- 视觉饱和点依赖于模型规模和训练数据,可能需针对不同模型重新扫描。
- 单层融合假设视觉表示在饱和后不再需要深层交互,但某些任务(如细粒度视觉推理)可能仍需更深融合。
- 未与令牌缩减方法(如FastV)联合实验,组合效果未知。
- 训练时冻结大部分参数,可能限制下游任务适应能力。
Relevance To Keywords:
- 原生多模态大模型:论文直接研究LLaVA类原生多模态大模型,提出非对称路由改进。
- 多模态大模型的理解和生成一体化:DPVR-LF保持理解(视觉感知)和生成(文本推理)的一体化架构,仅改变计算路径。
- 表征学习:通过层间分析揭示视觉表征的饱和特性,为表征学习提供新视角。
- 世界模型:论文未直接涉及世界模型,但视觉饱和现象可能影响世界模型中的感知模块设计。
- 强化学习/后训练:论文主要关注前向计算效率,未涉及强化学习或后训练,但DPVR-LF可作为一种高效微调策略用于后训练阶段。
- Unify Models:论文提出的非对称路由可视为统一模型中模态特异性处理的范例。
摘要翻译
视觉 - 语言 - 动作 (VLA) 模型已在多种机器人操作任务中展现出卓越的端到端性能。然而,这些策略无法保证避免与场景中无关任务的对象发生碰撞。现有的安全滤波器通过查询视觉 - 语言模型 (VLM) 来识别障碍物及其位置从而规避这一问题。然而,这在控制回路中运行速度过慢,且只能在回合初始化时调用,导致滤波器无法跟踪移动障碍物。我们发现,VLA 模型中少量的注意力头能够可靠地定位策略意图接近的目标对象。这些注意力头可在无需训练的安全框架中被利用:该框架在每个步骤从注意力头获取活跃目标,将场景其余部分视为障碍物,并将这些信息输入至控制屏障函数 (CBF) 滤波器。结合轻量级实时目标跟踪器,该方法实现了对非静态障碍物的碰撞避免。我们在 SafeLIBERO 基准上评估了我们的框架,并扩展了该基准以包含移动障碍物。在原始静态基准上,我们的方法与利用特权模拟器状态来识别目标的 oracle 表现相当,这模拟了在回合初始化时运行一次的基于 VLM 的识别步骤。在动态变体中,当 oracle 的初始化目标分配变得过时,我们的方法平均比其高出 43%,表现显著优于它。我们的发现表明,实时安全滤波所需的感知信号已存在于 VLA 策略之中,且可在无需额外训练或重型辅助模型的情况下被利用。
Abstract
Vision-Language-Action (VLA) models have demonstrated impressive end-to-end performance across a variety of robotic manipulation tasks. However, these policies offer no guarantees against collisions with task-irrelevant objects in the scene. Existing safety filters sidestep this problem by querying a vision-language model (VLM) to identify obstacles and their locations. This, however, is too slow to run in the control loop and can only be invoked at episode initialization, leaving the filter unable to track moving obstacles. We discover that a small number of attention heads within a VLA model reliably localize the object the policy intends to approach. These heads can be exploited within a training-free safety framework that obtains the active target from the attention heads at every step, treats the remainder of the scene as obstacles, and feeds these into a Control Barrier Function (CBF) filter. Together with a lightweight real-time object tracker, this allows for collision avoidance for non-static obstacles. We evaluate our framework on SafeLIBERO, which we extend with moving obstacles. On the original static benchmark, our method performs comparably to an oracle that uses privileged simulator state to identify the target, emulating a VLM-based identification step run once at episode initialization. On the dynamic variant, where the oracle's init-time target assignment becomes stale, our method substantially outperforms it by 43%, on average. Our findings suggest that the perceptual signals needed for real-time safety filtering are already present within VLA policies and can be exploited without additional training or heavy auxiliary models.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 5.0/10 | 7.5 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 4.0/10 | 6.0 |
| World Models | 1.5 | 2.0/10 | 3.0 |
| MLLM | 1.5 | 8.0/10 | 12.0 |
| MultiModal | 1.5 | 8.0/10 | 12.0 |
| model-based RL | 1.5 | 6.0/10 | 9.0 |
评分理由: The paper focuses on safety filtering for Vision-Language-Action (VLA) models. Unify Models (5.0) relates to VLA's unified nature but isn't the core focus. Tokenizer (2.0) and World Models (2.0) are largely irrelevant as they aren't discussed or central. Visual Encoder (4.0) is a component but not the contribution. MLLM (8.0) and MultiModal (8.0) are highly relevant as VLA is a multimodal large model variant. model-based RL (6.0) is relevant due to Control Barrier Functions used in control tasks. None of the specified experts (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) appear in the author list, so no bonus points were added.
关键词
Vision-Language-Action, Safety Filter, Attention Heads, Control Barrier Function, Collision Avoidance, Moving Obstacles, Training-free
深度分析
Chinese Title: 你的模型已经知道:面向视觉-语言-动作模型的注意力引导安全过滤器
Summary: 本文针对视觉-语言-动作(VLA)模型在机器人操作任务中缺乏碰撞保证的问题,提出了一种无需额外训练的安全过滤框架KNOWS。研究发现,VLA模型中的少量注意力头能够可靠地指示策略当前意图接近的目标物体。通过提取这些注意力头的注意力图,结合滑动窗口累积注意力密度,动态识别目标物体,并将场景中其余物体视为障碍物。利用轻量级实时目标跟踪器更新障碍物位置,并采用控制障碍函数(CBF)二次规划(QP)过滤器将候选动作投影到安全集。在静态SafeLIBERO基准上,KNOWS与使用特权状态信息的oracle性能相当;在扩展的动态障碍物场景中,碰撞率平均降低43%。该方法无需额外训练或重型辅助模型,实现了实时安全过滤。
Innovations:
- 发现VLA模型中少量注意力头可作为每步目标物体的可靠指示器,无需额外训练或监督。
- 提出KNOWS无训练安全框架,结合注意力头、轻量级目标跟踪器和CBF-QP过滤器,实现控制频率下的实时碰撞避免。
- 扩展SafeLIBERO基准,引入动态移动障碍物场景,验证了方法在动态环境中的优势。
- 利用注意力头的空间注意力分布,通过滑动窗口和注意力密度归一化,鲁棒地识别目标物体。
- 在静态场景中达到与使用特权状态信息的oracle相当的性能,在动态场景中显著超越(碰撞率降低43%)。
Methodology: 首先,使用冻结的VLA策略(如π0.5)生成候选动作,并通过隐藏状态钩子提取特定注意力头的注意力图。其次,在初始化时使用SAM分割模型获取每个物体的掩码,并拟合最小体积包围椭球(MVEE)表示物体;运行时通过HSV颜色直方图跟踪更新椭球中心。然后,将每个椭球投影到图像平面,计算注意力质量分配给各物体,并在滑动窗口内累积注意力密度,通过密度差距阈值确认目标物体,其余物体视为障碍物。最后,构建CBF-QP优化问题,将候选动作投影到安全集,确保末端执行器椭球与所有障碍物椭球分离。
Key Results:
- 在静态SafeLIBERO基准上,KNOWS与使用特权状态信息的oracle性能相当。
- 在动态障碍物场景中,KNOWS的碰撞率平均比oracle低43%。
- 注意力头选择实验表明,π0.5策略中特定层和头的注意力图能可靠定位目标物体。
- 方法在控制频率下运行,额外计算开销主要来自分割模型的前向传播,但整体延迟可接受。
Tech Stack:
- VLA模型:π0.5(扩散解码)
- 注意力头提取:通过隐藏状态钩子手动计算softmax(Q_act K_vis^T / sqrt(d))
- 实例分割:SAM(Segment Anything Model)
- 椭球拟合:最小体积包围椭球(MVEE)算法
- 目标跟踪:HSV颜色直方图 Bhattacharyya 距离匹配
- 安全过滤:控制障碍函数(CBF)二次规划(QP)求解器
- 滑动窗口累积:K帧窗口,注意力密度归一化(β=-1)
- 参数:K、β、δ通过经验确定
Strengths:
- 无需额外训练或微调VLA模型,直接利用模型内部注意力信号。
- 实时性强,可在控制频率下运行,适用于动态环境。
- 与现有安全过滤方法相比,计算开销低,无需频繁调用重型VLM。
- 在静态和动态场景中均表现良好,动态场景优势显著。
- 方法具有通用性,可应用于不同VLA架构(如π0.5)。
Limitations:
- 注意力头选择依赖于特定VLA模型,可能不适用于所有VLA架构,需要预先进行头选择实验。
- 初始化时需要一次分割模型运行,且假设物体为刚性,对非刚性或变形物体可能不适用。
- 滑动窗口参数(K、β、δ)需针对具体场景调优,缺乏自适应机制。
- 当目标物体被严重遮挡或注意力信号模糊时,可能无法正确识别目标,导致保守地视所有物体为障碍物。
- 未考虑机器人自身与其他物体(如桌面)的碰撞,仅关注可操作物体。
Relevance To Keywords:
- 原生多模态大模型:VLA模型是原生多模态大模型在机器人领域的应用,本文直接研究其内部注意力机制。
- 表征学习:通过分析注意力头表征,发现其隐含目标定位能力,属于表征学习范畴。
- 世界模型:CBF安全过滤器可视为对世界状态(物体位置)的建模,但本文未显式构建世界模型。
- 模型-Based RL:本文方法属于推理时安全过滤,不涉及强化学习训练,但与模型预测控制(MPC)思想相关。
- 后训练:本文方法无需后训练,直接利用预训练模型,但注意力头选择可视为一种后训练分析。
摘要翻译
理解和推理抽象视觉内容仍然是当前多模态大语言模型(MLLMs)面临的挑战。本文探索了一种新颖的抽象数据类型,称为复杂视觉查询(CVQ),旨在探测符号推理与抽象推理,这是面向 MLLMs 的人类类似神经符号推理中关键却尚未被充分探索的维度。我们从三个视角进行了全面的研究:数据 × 范式 × 探索。具体来说,我们提出了一种可扩展的流水线,基于大规模多模态知识图谱合成 CVQ,通过一阶逻辑算子的系统组合生成了一个涵盖 14 种不同查询类型的多样化数据集。我们进一步引入了一种两阶段训练框架,逐步赋予 MLLMs 鲁棒的视觉推理能力。我们进行了广泛的实验,从多个维度严格评估 MLLMs,包括在 CVQ 上的推理性能,以及跨任务和跨场景的泛化能力。我们相信我们的工作为拓展 MLLMs 的推理前沿开启了新的视角和途径。
Abstract
Understanding and reasoning over abstract visual content remains a challenge for current multi-modal large language models (MLLMs). In this paper, we explore a novel abstract data type termed complex visual query (CVQ), designed to probe symbolic and abstractive reasoning, which is a critical yet underexplored dimension of human-like neuro-symbolic reasoning for MLLMs. We present a comprehensive investigation from three perspectives: \textbf{Data $\times$ Paradigm $\times$ Exploration}. Specifically, we propose a scalable pipeline for synthesizing CVQs grounded in large-scale multi-modal knowledge graphs, generating a diverse dataset encompassing 14 distinct query types via systematic combinations of first-order logic operators. We further introduce a two-stage training framework that progressively equips MLLMs with robust visual reasoning capabilities. We conduct extensive experiments to rigorously evaluate MLLMs across multiple dimensions, including reasoning performance on CVQs, as well as cross-task and cross-scenario generalization. We believe our work opens new perspectives and avenues for advancing the reasoning frontiers of MLLMs.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 5.0/10 | 7.5 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 4.0/10 | 6.0 |
| World Models | 1.5 | 2.0/10 | 3.0 |
| MLLM | 1.5 | 10.0/10 | 15.0 |
| MultiModal | 1.5 | 10.0/10 | 15.0 |
| model-based RL | 1.5 | 2.0/10 | 3.0 |
评分理由: 论文核心聚焦于多模态大语言模型(MLLM)的视觉推理能力,因此 MLLM 和 MultiModal 得分为 10。研究涉及视觉内容处理,Visual Encoder 有一定关联(4 分)。未涉及 Tokenizer 架构、World Models 或 Model-Based RL 的具体技术,故得分低(2 分)。虽提出统一查询范式,但未明确涉及模型架构统一,Unify Models 给中等分(5 分)。作者列表中不包含指定的专家,无额外加分。
关键词
Complex Visual Queries, Symbolic Reasoning, Abstractive Reasoning, Multi-modal Large Language Models, First-order Logic, Two-stage Training Framework, Visual Reasoning
深度分析
Chinese Title: 复杂视觉查询的符号与抽象推理
Summary: 本文针对多模态大语言模型(MLLMs)在抽象视觉内容理解与推理上的不足,提出了一种新型抽象数据类型——复杂视觉查询(CVQ),旨在探索符号与抽象推理能力。作者从数据、范式、探索三个维度展开系统研究:首先设计可扩展的CVQ数据合成流水线,基于大规模多模态知识图谱生成涵盖14种一阶逻辑查询类型的多样化数据集QUASAR;其次提出两阶段训练框架,逐步赋予MLLMs鲁棒的视觉推理能力;最后通过大量实验评估MLLMs在CVQ上的推理性能、跨任务及跨场景泛化能力。实验结果表明,所提方法能有效提升模型对复杂视觉查询的理解与推理,并展现出良好的泛化性。该工作为推进MLLMs的推理前沿提供了新视角和方向。
Innovations:
- 提出复杂视觉查询(CVQ)概念,将一阶逻辑组合查询可视化,开辟MLLMs符号与抽象推理新方向。
- 构建QUASAR数据集,涵盖14种标准查询类型,并附带细粒度思维链(CoT)标注,数据规模大且多样性高。
- 设计两阶段训练框架(CQU与CQR任务),逐步提升MLLMs对视觉查询的结构理解与多步逻辑推理能力。
- 提出基于CLIP的语义图像筛选策略,有效过滤多模态知识图谱中的噪声图像,提升数据质量。
- 系统评估MLLMs在CVQ上的推理性能及跨任务、跨场景泛化能力,揭示当前模型的不足与改进方向。
Methodology: 论文采用数据合成与两阶段训练相结合的技术路线。首先,从VisualSem、MKG-Y、FB15K-237三个多模态知识图谱中,通过模板引导的遍历采样获取查询实例;利用CLIP计算图像与实体描述的语义相似度,筛选高质量图像;使用GraphViz将查询渲染为有向计算图,并设计不同视觉编码表示一阶逻辑算子(投影、交、并、否)。然后,定义复杂查询理解(CQU)和复杂查询推理(CQR)两个任务,分别对应结构识别与答案预测;设计28种细粒度CoT模板(每查询类型×任务类型)用于监督训练。最后,采用两阶段训练框架:第一阶段训练CQU,第二阶段训练CQR,逐步提升模型能力。评估时使用准确率、F1等指标,并测试跨查询类型和跨场景的泛化能力。
Key Results:
- 在14种查询类型上,所提两阶段训练方法在CQU和CQR任务上均显著优于基线MLLMs(如GPT-4V、LLaVA等)。
- 模型在简单查询(如1p)上表现较好,但在涉及否定(negation)和复杂组合(如pin、pni)的查询上仍有较大提升空间。
- 跨任务泛化实验表明,经过CVQ训练的模型在统计图表、几何图形等抽象视觉推理任务上也有性能提升。
- 跨场景泛化实验显示,模型能适应不同知识图谱来源的CVQ,但泛化能力受训练数据分布影响。
Tech Stack:
- 多模态知识图谱:VisualSem、MKG-Y、FB15K-237
- 图像-文本匹配模型:CLIP(视觉编码器Mvis、文本编码器Mtxt)
- 图可视化工具:GraphViz
- 一阶逻辑算子:投影(p)、交(i)、并(u)、否(n)
- 思维链(Chain-of-Thought, CoT)模板
- 两阶段训练框架(CQU→CQR)
- 评估指标:准确率、F1分数
Strengths:
- 首次将复杂查询推理引入视觉模态,填补了MLLMs在抽象符号推理方面的研究空白。
- 数据集QUASAR规模大、类型全,且附带高质量CoT标注,便于监督训练和评估。
- 两阶段训练框架设计合理,逐步提升模型能力,实验验证有效。
- 系统评估了泛化能力,包括跨查询类型和跨场景,具有实际参考价值。
- 语义图像筛选策略有效提升了数据质量,可推广至其他多模态数据集构建。
Limitations:
- 数据集依赖现有知识图谱,覆盖的实体和关系有限,可能影响模型对真实世界复杂查询的泛化。
- 当前模型在涉及否定和复杂组合的查询上性能仍较低,推理能力有待进一步提升。
- 两阶段训练框架需要大量人工设计的CoT模板,扩展至更多查询类型时成本较高。
- 实验仅评估了部分MLLMs,未涵盖最新的大型模型(如GPT-4o、Gemini等)。
- 视觉查询的渲染方式固定,可能无法完全模拟真实场景中抽象视觉信息的多样性。
Relevance To Keywords:
- Unify Models, World Models, Representation Learning, Model-Based RL: 论文聚焦于多模态大模型(MLLMs)的推理能力,属于统一模型和世界模型的研究范畴;通过视觉查询的符号推理,涉及表征学习(CLIP特征提取)和结构化推理,与模型基强化学习中的推理规划有间接关联。
- 原生多模态大模型,多模态大模型的理解和生成一体化: 论文研究MLLMs对抽象视觉查询的理解与推理,属于多模态理解任务,与理解和生成一体化方向相关(但未涉及生成)。
- 表征学习: 使用CLIP进行语义图像筛选,属于表征学习技术。
- 世界模型: 复杂视觉查询可视为对知识图谱中世界知识的抽象表示,模型需构建内部世界模型进行推理。
- 强化学习,后训练: 论文采用两阶段训练(先CQU后CQR),可视为一种后训练策略,但未直接使用强化学习。
摘要翻译
视觉 - 语言 - 动作(VLA)模型在机器人操作中展现出强大的潜力,然而自然语言主要用于指定任务意图,而非在高频低层执行过程中被反复处理。基于这种分离,我们提出了一种受小脑 - 丘脑启发的视觉 - 动作模型(CT-VAM),用于高效的基于任务的视觉运动控制。CT-VAM 作为一种紧凑的本地执行策略,能够从双视角视觉观测、本体感觉和轻量级任务条件中预测动作块,从而可能实现一种实用的云边范式:高层语义推理由大型模型处理,而快速闭环控制则在本地硬件上运行。为了有效融合异构输入,CT-VAM 引入了 TARS(丘脑动作路由流),这是一种流分离的条件注意力解码器,独立路由动作、视觉和任务流,防止密集的感官 tokens 淹没紧凑的任务相关条件。仅使用 6800 万参数,CT-VAM 在 LIBERO 上的成功率即可与显著更大的 VLA 模型相媲美,同时降低了推理延迟。结合用于异步块执行的光流一致 inpainting,CT-VAM 支持高频控制,并在资源受限的机器人平台上展示了稳健的实地部署能力。
Abstract
Vision-language-action models have shown strong promise for robot manipulation, yet raw language is primarily needed to specify task intent rather than to be repeatedly processed during high-frequency low-level execution. Motivated by this separation, we propose a cerebello-thalamic-inspired vision-action model (CT-VAM) for efficient task-conditioned visuomotor control. CT-VAM acts as a compact local execution policy that predicts action chunks from dualview visual observations, proprioception, and a lightweight task condition, potentially enabling a practical cloud-edge paradigm in which high-level semantic reasoning can be handled by large models while fast closed-loop control runs on local hardware. To fuse heterogeneous inputs effectively, CT-VAM introduces TARS (Thalamic Action Routing Stream), a stream-separated conditional attention decoder that independently routes action, visual and task streams, preventing dense sensory tokens from overwhelming compact task-relevant conditions. With only 68M parameters, CT-VAM achieves LIBERO success rates competitive with substantially larger VLA models, while reducing inference latency. Together with flow-consistent inpainting for asynchronous chunk execution, CT-VAM supports high-frequency control and demonstrates robust realworld deployment on resource-constrained robotic platforms.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 5.0/10 | 7.5 |
| World Models | 1.5 | 6.0/10 | 9.0 |
| MLLM | 1.5 | 5.0/10 | 7.5 |
| MultiModal | 1.5 | 8.0/10 | 12.0 |
| model-based RL | 1.5 | 6.0/10 | 9.0 |
评分理由: 论文聚焦视觉 - 动作控制,强调云端 - 边缘分离而非模型统一,故 Unify Models 得分低;Tokenizer 仅在注意力机制中提及,非核心贡献;Visual Encoder 为基础组件;World Models 与 model-based RL 因控制策略与流一致插补技术具相关性;MLLM 相关但侧重动作生成;MultiModal 因融合视觉、本体感受及语言任务条件得分最高。
关键词
Vision-Action Model, Visuomotor Control, TARS, Cloud-Edge Paradigm, Flow-Consistent Inpainting, Efficient Control, Dualview Observations
深度分析
Chinese Title: CT-VAM:一种小脑-丘脑启发的视觉-动作模型,用于高效视觉运动控制
Summary: 本文提出了一种名为CT-VAM的视觉运动控制模型,灵感来源于生物小脑-丘脑通路。该模型将语言指令转化为紧凑的任务条件,避免在高速闭环控制中重复处理原始语言,从而降低推理延迟和硬件需求。CT-VAM采用双视角视觉观测、本体感知和任务条件作为输入,通过流分离条件注意力模块TARS(丘脑动作路由流)分别处理动作、视觉、本体感知和任务流,防止密集的视觉令牌淹没紧凑的任务条件。模型仅含68M参数,在LIBERO基准测试中达到了与更大规模VLA模型相当的成功率,同时推理延迟更低。此外,引入流一致修补(FCI)实现异步块执行,支持高频控制,并在资源受限的Jetson Orin NX平台上实现了完全机载部署。
Innovations:
- 提出将语言语义接地与高频视觉反馈控制分离的框架,语言仅用于生成紧凑任务条件,避免执行中重复处理。
- 设计TARS(丘脑动作路由流)模块,通过流分离条件注意力机制独立路由动作、视觉、本体感知和任务流,防止密集视觉令牌主导任务相关条件。
- 引入流一致修补(FCI)技术,实现异步块执行,在重叠推理与执行的同时保持动作连续性。
- 在仅68M参数下达到与大规模VLA模型相当的LIBERO成功率,并支持在Jetson Orin NX上高效机载部署。
Methodology: 本文采用模仿学习框架,基于双视角视觉观测和本体感知历史,结合任务标识符,预测动作块。模型由三部分组成:双视角编码器生成层视觉记忆;TARS解码器通过流分离注意力机制融合动作、视觉、本体感知和任务流;流一致修补(FCI)实现异步块执行。训练采用整流流(rectified flow)进行连续动作生成,并利用预训练视觉骨干的注册令牌初始化动作查询。
Key Results:
- CT-VAM在LIBERO基准测试中达到与大规模VLA模型(如RT-2)相当的成功率,但参数量仅68M。
- 推理延迟显著低于同等性能的VLA模型,支持高频闭环控制。
- 在真实机器人平台上成功部署,实现完全机载运行,无需依赖云端推理。
- 流一致修补(FCI)有效保证了异步块执行时的动作连续性。
Tech Stack:
- Transformer架构
- 整流流(Rectified Flow)
- 流分离条件注意力(Stream-Separated Conditional Attention)
- 动作块预测(Action Chunking)
- 预训练视觉骨干(如ViT)
- 注册令牌初始化(Register Token Initialization)
- Jetson Orin NX边缘计算平台
Strengths:
- 创新性地将语言接地与低层控制分离,兼顾语义泛化与实时性。
- TARS模块设计精巧,有效处理异构输入,防止信息淹没。
- 模型轻量(68M参数),适合资源受限的机器人平台。
- 在仿真和真实场景中均验证了有效性和鲁棒性。
Limitations:
- 任务条件目前采用简单的一热编码,未充分探索更丰富的语义表示。
- 未与高层语言模型进行端到端集成,实际应用需额外接地模块。
- 仅在桌面操作任务上验证,复杂长程任务泛化性待进一步评估。
Relevance To Keywords:
- Unify Models: 模型将视觉、本体感知和任务条件统一为动作生成框架,但未涉及多模态大模型的一体化。
- World Models: 未显式构建世界模型,但整流流生成动作可视为隐式建模。
- Representation Learning: TARS通过流分离注意力学习异构输入的联合表示。
- Model-Based RL: 本文基于模仿学习,未涉及强化学习或模型预测控制。
- 原生多模态大模型: 未使用原生多模态大模型,但借鉴了视觉-语言预训练骨干。
- 多模态大模型的理解和生成一体化: 未实现理解与生成一体化,仅聚焦动作生成。
- 表征学习: 流分离注意力机制可视为一种表征学习策略。
- 世界模型: 不直接相关。
- 强化学习: 不直接相关。
- 后训练: 未涉及后训练技术。
摘要翻译
基础模型提供了一种有前景的途径,可将多模态生理信号压缩为人类健康的紧凑表示,在睡眠医学、心脏病学、神经病学及其他医疗领域具有广泛应用。现有模型通常使用掩码重构或对比目标进行训练。然而,掩码重构可能不太适合这些信号的随机特性,而对比方法依赖于正样本对定义,尽管生理信号的语义不变性尚不明确。在这项工作中,我们表明下一个词预测是一种简单且可扩展的替代方案。我们开发了 Hypnos,一种多模态睡眠基础模型,使用八种不同的传感模态(例如 EEG、ECG、呼吸信号)进行训练,数据源自超过 20,000 次夜间多导睡眠图记录。我们使用残差向量量化将每种模态量化为离散令牌流,然后训练一个大型自回归 RQ-Transformer,以并行方式联合预测所有模态的下一个令牌。训练完成后,Hypnos 可应用于来自任何支持模态子集的连续传感器数据流,生成用于下游任务的嵌入表示。在一系列基准测试中,Hypnos 显著优于现有基础模型。在睡眠阶段分类任务中,我们在保留测试集上达到了强监督基线的性能,同时标记数据用量仅为 1/100。Hypnos 甚至泛化至日间生理,在检测心房颤动方面超越了专门的 ECG 基础模型。我们的结果表明,下一个词预测是从多模态生理信号中进行表示学习的一种强大自监督目标。
Abstract
Foundation models offer a promising route to compress multi-modal physiological signals into compact representations of human health, with broad applications across sleep medicine, cardiology, neurology and other healthcare domains. Existing models have typically been trained with masked-reconstruction or contrastive objectives. However, masked reconstruction may be poorly suited to the stochastic nature of these signals, while contrastive approaches rely on positive-pair definitions despite the semantic invariances of physiological signals being poorly understood. In this work, we show that next-token prediction is a simple and scalable alternative. We develop Hypnos, a multi-modal sleep foundation model trained using eight different sensing modalities (e.g. EEG, ECG, respiratory signals) drawn from over 20,000 overnight polysomnography recordings. We tokenize each modality into streams of discrete tokens using residual vector quantization, then train a large auto-regressive RQ-Transformer to jointly predict the next token across all modalities in parallel. After training, Hypnos can be applied to continuous streams of sensor data from any subset of supported modalities, generating embeddings for downstream tasks. Across a range of benchmarks, Hypnos significantly outperforms existing foundation models. In sleep stage classification, we match the performance of strong supervised baselines on held-out test sets whilst using \(100\times\) less labelled data. Hypnos even generalises to daytime physiology, surpassing a dedicated ECG foundation model at detecting atrial fibrillation. Our results demonstrate that next-token prediction is a strong self-supervised objective for representation learning from multi-modal physiological signals.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 5.0/10 | 7.5 |
| Tokenizer | 1.5 | 9.0/10 | 13.5 |
| Visual Encoder | 1.5 | 1.0/10 | 1.5 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 6.0/10 | 9.0 |
| MultiModal | 1.5 | 10.0/10 | 15.0 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: 论文核心为多模态生理信号的基础模型构建。MultiModal(10 分)和 Tokenizer(9 分)高度相关,因论文明确使用 8 种模态及残差向量量化分词。MLLM(6 分)和 Unify Models(5 分)中度相关,因采用类似 LLM 的自回归架构统一多模态表征。Visual Encoder、World Models 及 model-based RL 均低相关(1 分),因论文不涉及图像编码、环境建模或强化学习。作者列表中未包含指定的 Yang Shi 等专家,故无额外加分。加权总分为 49.5,高于动态及格分 27.8。
关键词
Next-Token Prediction, Multi-modal, Foundation Model, Sleep Physiology, Residual Vector Quantization, RQ-Transformer, Representation Learning
深度分析
Chinese Title: 下一词元预测学习睡眠生理学的可泛化表征
Summary: 本文提出Hypnos,一种基于下一词元预测的多模态睡眠基础模型。针对现有生理信号基础模型(掩码重建或对比学习)在随机性、多模态互补性及流式推理方面的不足,Hypnos采用残差向量量化(RVQ)将EEG、ECG、呼吸等8种模态信号离散化为词元流,并训练自回归RQ-Transformer并行预测所有模态的下一词元。模型在超过20,000条多导睡眠图记录上预训练,支持任意模态子集和任意长度序列的流式推理。实验表明,Hypnos在睡眠分期、皮质觉醒检测等任务上显著优于现有基础模型,在睡眠分期中仅用1%标注数据即可匹配强监督基线,并泛化至日间生理信号,在房颤检测上超越专用ECG基础模型。研究证明下一词元预测是生理信号多模态表征学习的有效自监督目标。
Innovations:
- 首次将下一词元预测应用于多模态生理信号,统一生成式建模与表征学习
- 采用残差向量量化(RVQ)对8种生理模态进行离散化,支持高采样率信号的高效压缩
- 设计RQ-Transformer架构,通过随机子组交叉模态注意力实现训练时模态缺失泛化
- 模型支持流式推理,以1Hz频率生成嵌入,适用于实时闭环神经调控等应用
- 在睡眠分期任务中仅用1%标注数据即达到强监督基线性能,展现极强数据效率
Methodology: 使用残差向量量化(RVQ)将每个模态的连续信号编码为K个离散词元流(词元率1Hz),编码器/解码器由因果卷积(SeaNet)和因果滑动窗口Transformer组成。预训练阶段,训练自回归RQ-Transformer,在给定历史词元序列条件下并行预测所有模态的下一词元(多模态下一词元预测)。训练时随机采样模态子组限制交叉注意力,增强测试时对任意模态组合的泛化能力。预训练后,模型可对任意长度传感器数据流生成嵌入,用于下游任务(如睡眠分期、房颤检测)。
Key Results:
- Hypnos在睡眠分期、皮质觉醒检测、房颤检测等基准上显著优于现有睡眠基础模型(如SleepFM、NeuroLM等)
- 在睡眠分期任务中,仅使用1%标注数据即可匹配强监督基线(100倍数据效率提升)
- 模型泛化至日间生理信号,在房颤检测上超越专用ECG基础模型
- 下一词元困惑度和下游探测性能随模型规模增大持续提升,验证了缩放定律
- 支持任意模态子集和任意长度序列的流式推理,嵌入生成频率为1Hz
Tech Stack:
- 残差向量量化(Residual Vector Quantization, RVQ)
- RQ-Transformer(自回归Transformer,处理多级残差词元)
- 因果卷积(SeaNet)
- 因果滑动窗口Transformer
- 多模态下一词元预测(Multi-modal Next-token Prediction)
- 自监督学习(Self-Supervised Learning)
- 多导睡眠图(Polysomnography, PSG)数据集(SHHS, CCSHS, CFS, CHAT, MESA, MrOS, NCHSDB, WSC, DOD-H, DOD-O)
Strengths:
- 方法简单且可扩展,继承自语言建模和音频建模的成功经验
- 有效处理生理信号的随机性和多模态互补性,避免对比学习中的正对定义问题
- 模型设计支持流式推理和任意模态组合,实用性强
- 数据效率极高,在睡眠分期中仅需少量标注数据
- 跨任务泛化能力强,甚至迁移至日间生理信号
Limitations:
- 仅基于睡眠数据预训练,对非睡眠场景的泛化能力可能有限(论文仅展示了日间ECG任务)
- 依赖大量公共数据集,可能存在数据偏差(如特定人群、设备配置)
- RVQ词元化过程可能丢失细微生理信息(如精确波形形态),尽管论文声称通过概率建模避免
- 未与基于扩散的连续信号生成模型进行直接比较
- 计算资源需求较高(大Transformer模型训练)
Relevance To Keywords:
- 原生多模态大模型:Hypnos是典型的多模态生理基础模型,统一处理EEG、ECG、呼吸等8种模态,符合原生多模态大模型定义
- 表征学习:通过下一词元预测自监督目标学习可泛化的生理信号表征,是表征学习的典型应用
- 世界模型:下一词元预测本质上是学习生理信号的时序动态模型,可视为生理世界模型的一种形式
- 模型-Based RL:论文未直接涉及强化学习,但流式推理和实时嵌入生成能力为闭环神经调控(可结合RL)提供了基础
- 后训练:论文主要关注预训练,未涉及后训练(如指令微调),但模型可应用于下游任务探测
摘要翻译
皮层中相邻的神经元具有相似的响应特征,从而在感觉和认知系统中产生系统性的空间组织。最近的拓扑模型复制了该结构的某些方面,但仍局限于单模态,并且分别对每一层进行空间约束,产生了碎片化的图谱,既无法捕捉皮层处理流的连续性,也无法捕捉跨模态的整合。我们引入了 Topo-Omni,这是一种拓扑多模态模型,其中视觉、听觉和语言/认知处理共享一个单一的连续计算模拟平面。该架构通过利用空间平滑目标对预训练基础模型进行微调构建而成,它在跨模态之间发展出与人类神经影像(从感觉到认知系统)一致的簇。选择性地激活或抑制某个簇会偏置或损害感知,这与人类干预研究类似。最后,我们利用该模型在计算模拟中筛选出新簇,发现了新的自然景观和动物网络,并在人类数据中对其进行了验证。因此,单一的空间原则组织了跨模态和处理阶段的表征,产生了关于皮层组织的可检验假设。
Abstract
Nearby neurons in cortex share similar response profiles, producing systematic spatial organization across sensory and cognitive systems. Recent topographic models reproduce aspects of this structure but remain unimodal and spatially constrain each layer separately, yielding fragmented maps that capture neither the contiguity of cortical processing streams nor their integration across modalities. We introduce Topo-Omni, a topographic multimodal model in which visual, auditory, and language/cognitive processing share a single contiguous in-silico sheet. Built by fine-tuning a pretrained foundation model with a spatial smoothness objective, this architecture develops clusters across modalities that are consistent with human neuroimaging, from sensory to cognitive systems. Driving or suppressing a cluster selectively biases or impairs perception, paralleling human intervention studies. Finally, we use our model to screen for novel clusters in-silico and discover new natural landscape and animal networks which we validate in human data. A single spatial principle thus organizes representations across modalities and processing stages, yielding testable hypotheses about cortical organization.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 8.0/10 | 12.0 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 5.0/10 | 7.5 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 7.0/10 | 10.5 |
| MultiModal | 1.5 | 10.0/10 | 15.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on a topographic multimodal model (MultiModal: 10) unifying visual, auditory, and language representations (Unify Models: 8) via a foundation model (MLLM: 7). Visual Encoder is a component (5), while Tokenizer (2), World Models (1), and model-based RL (0) are not core contributions. None of the listed expert authors are present, so no bonus points apply.
关键词
Topographic Multimodal Model, Brain Regions, Spatial Organization, Foundation Model, Neuroimaging, Visual Auditory Language, Cortical Organization, Functionally Selective
深度分析
Chinese Title: 利用深度拓扑多模态模型发现功能选择性脑区
Summary: 本文提出Topo-Omni,一种将视觉、听觉和语言/认知处理整合到单个连续模拟皮层片上的拓扑多模态模型。通过微调预训练基础模型(Qwen2.5-Omni)并加入空间平滑目标,模型自发形成与人类神经影像一致的跨模态功能选择性聚类(如面孔、场景、物体、文字等区域)。驱动或抑制特定聚类可选择性影响感知,模拟人类干预研究。模型还用于在硅片中筛选新聚类,发现了自然景观和动物选择性网络,并在人类fMRI数据中得到验证。研究表明,单一空间平滑原则即可组织跨模态和跨处理阶段的表征,为皮层组织提供可检验假设。
Innovations:
- 首次构建跨模态(视觉、听觉、语言/认知)的连续拓扑模型,突破以往单模态和分层分离的局限。
- 利用预训练基础模型(Qwen2.5-Omni)进行微调,赋予拓扑模型强大的多模态能力,避免从零训练。
- 通过空间平滑损失在单一连续皮层片上实现跨模态聚类,重现人类视觉、听觉、语言等类别选择性区域。
- 通过因果性实验(驱动或抑制聚类)验证了模型聚类与感知行为的因果关系。
- 利用模型在硅片中预测并发现新的功能选择性网络(自然景观和动物),并在人类fMRI数据中验证。
Methodology: 基于Qwen2.5-Omni多模态架构,将各层单元投影到统一的二维连续皮层片上,通过自蒸馏任务损失保持模型能力,同时引入空间平滑损失(Lspatial)鼓励相邻单元响应相似。使用人类fMRI局部器(如EMFL)定义功能区域,通过交叉验证评估模型聚类与人类脑区的匹配度。采用旋转楔形和收缩环刺激测试视网膜拓扑映射。通过聚类分析和特征聚类发现新网络,并在Spacetop fMRI数据集上验证。
Key Results:
- 模型在视觉编码器中形成与人类腹侧视觉通路一致的面孔、场景、物体、文字等类别选择性区域,响应轮廓与人类fMRI高度相关(FFA: r=0.88, LOC: r=0.89)。
- 模型发展出视网膜拓扑映射(极角、离心度),与早期视觉皮层组织一致。
- 在保持拓扑结构的同时,模型在脑对齐和下游任务上仍具竞争力。
- 因果性实验:驱动或抑制特定聚类可选择性影响或损害感知,类似人类干预研究。
- 模型发现新的自然景观和动物选择性网络,并在人类fMRI数据中得到空间验证。
Tech Stack:
- Qwen2.5-Omni(预训练多模态基础模型)
- 自蒸馏任务损失(self-distillation task loss)
- 空间平滑损失(spatial smoothness loss)
- fMRI功能局部器(EMFL localizer)
- 交叉验证(odd/even runs)
- 聚类分析(agglomerative hierarchical clustering)
- 视网膜拓扑映射(旋转楔形、收缩环刺激)
- Spacetop fMRI数据集
Strengths:
- 首次实现跨模态连续拓扑建模,更贴近真实皮层组织。
- 利用预训练模型,兼具高性能和脑对齐能力。
- 通过因果实验验证了模型聚类的功能意义。
- 在硅片中发现新聚类并得到人类数据验证,具有预测价值。
- 方法通用,可推广至其他多模态架构。
Limitations:
- 部分区域(如PPA、VWFA)的响应轮廓与人类数据仅达趋势水平相关(p=0.077, p=0.063),未达显著。
- 模型仅基于视频/音频/文本刺激,未涵盖更复杂的自然场景。
- 空间平滑损失可能限制模型在非空间任务上的灵活性。
- 人类验证仅基于有限数据集(Spacetop),泛化性待检验。
Relevance To Keywords:
- 原生多模态大模型:模型基于Qwen2.5-Omni,整合视觉、听觉、语言处理。
- 多模态大模型的理解和生成一体化:模型通过自蒸馏任务损失保持多模态理解能力。
- 表征学习:空间平滑目标促使模型学习具有拓扑结构的表征。
- 世界模型:模型模拟皮层空间组织,可视为对感知世界的一种结构化表征。
- 强化学习:论文未直接涉及强化学习,但因果干预实验类似于强化学习中的动作-结果关系。
- 后训练:模型通过微调预训练基础模型实现拓扑结构,属于后训练阶段。
摘要翻译
多模态大语言模型(MLLMs)在视频时间定位任务中,通过强化学习生成推理路径取得了显著进展。然而,现有模型往往产生表面化的推理,这对精确的时间定位提供的指导有限。这一局限性源于(1)低效的随机探索以及(2)仅关注答案正确性而忽略推理质量的奖励函数。为了解决这些问题,我们提出 TaRO(时间感知推理优化),该框架显式增强了模型的时间感知推理能力。首先,我们引入了一种构造性推理探索(Constructive Reasoning Exploration),利用预生成的密集字幕构建基于显式视觉线索和时间戳的推理路径,从而实现高质量时间感知推理的高效探索。其次,为了评估推理质量,我们设计了一种时间敏感性奖励(Temporal-Sensitivity Reward)。高质量推理应锚定在特定事件和时间戳上。如果在推理过程中事件边界被破坏,此类推理应变得无效,导致推理路径的 logit 下降。我们将这种下降作为评判推理质量的依据。最后,TaRO 遵循一种渐进式课程策略,首先利用该奖励选择构造更优的推理路径,随后演化为自由探索阶段,在此阶段模型自主生成有效的推理。实验表明,TaRO 在视频时间定位(VTG)基准上实现了最先进的性能。代码可在 https://github.com/oceanflowlab/TaRO 获取。
Abstract
Multi-modal Large Language Models (MLLMs) have achieved remarkable progress in video temporal grounding with reinforcement learning for generating reasoning paths. However, existing models often produce superficial reasoning, which offers limited guidance for precise temporal localization. This limitation stems from (1) inefficient random exploration and (2) reward functions that focus solely on the answer correctness while ignoring reasoning quality. To address these issues, we propose TaRO (Temporal-Aware Reasoning Optimization), a framework that explicitly enhances the model's ability of thinking with time. First, we introduce a Constructive Reasoning Exploration that leverages pre-generated dense captions to construct reasoning paths grounded in explicit visual cues and timestamps, enabling efficient exploration of high-quality time-aware reasoning. Second, to evaluate reasoning quality, we design a Temporal-Sensitivity Reward. High-quality reasoning should be anchored to specific events and timestamps. If the event boundary under thinking is disrupted, such reasoning should become invalid, leading to a drop in the logit of the reasoning path. We utilize this drop as a critique of reasoning quality. Finally, TaRO follows a progressive curriculum, which starts by utilizing this reward to select better constructed reasoning paths, and evolves to a free exploration phase where the model autonomously generates effective reasoning. Experiments demonstrate that TaRO achieves state-of-the-art performance on VTG benchmarks. Code is available at https://github.com/oceanflowlab/TaRO.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 3.0/10 | 4.5 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 3.0/10 | 4.5 |
| World Models | 1.5 | 3.0/10 | 4.5 |
| MLLM | 1.5 | 9.0/10 | 13.5 |
| MultiModal | 1.5 | 8.0/10 | 12.0 |
| model-based RL | 1.5 | 5.0/10 | 7.5 |
评分理由: The paper explicitly utilizes Multi-modal Large Language Models (MLLM) for Video Temporal Grounding, making MLLM and MultiModal highly relevant (scores 9 and 8). It employs Reinforcement Learning to optimize reasoning paths, which is moderately related to model-based RL (score 5). Other keywords like Tokenizer and Visual Encoder are implicit components of the MLLM architecture but not the focus of this work (scores 2-3). Unify Models and World Models are tangential as the paper focuses on specific task optimization rather than general model unification or generative world modeling.
关键词
Video Temporal Grounding, MLLM, Reinforcement Learning, Temporal-Aware Reasoning, Temporal-Sensitivity Reward, Constructive Reasoning Exploration, Progressive Curriculum
深度分析
Chinese Title: 面向视频时序定位的时间感知推理优化
Summary: 本文针对多模态大语言模型(MLLM)在视频时序定位(VTG)任务中推理浅层化的问题,提出TaRO框架。现有基于强化学习的方法存在随机探索低效和奖励函数仅关注答案正确性而忽视推理质量两大局限。TaRO通过三方面改进:首先,提出建设性推理探索策略,利用预生成的密集视频字幕(含显式时间戳)构建高质量推理路径,引导模型关注关键视觉线索;其次,设计时间敏感性奖励,通过局部打乱真实边界附近的帧,利用推理token在原始与打乱视频下的log概率差作为推理质量的评判标准,鼓励推理锚定关键事件和时间戳;最后,采用渐进课程学习,先利用构建的推理路径进行监督训练,再过渡到自由探索阶段。实验表明,TaRO在多个VTG基准上达到最先进性能,有效提升了模型的时间感知推理能力。
Innovations:
- 提出TaRO框架,显式增强MLLM在VTG中的时间思考能力,解决推理浅层化问题。
- 提出建设性推理探索策略,利用密集字幕构建高质量推理路径,替代低效的随机探索。
- 设计时间敏感性奖励,通过局部帧打乱评估推理对时间信息的依赖程度,鼓励推理锚定关键事件和时间戳。
- 采用渐进课程学习,先使用构建的推理路径进行监督,再过渡到自由探索,实现引导与自主学习的有效结合。
Methodology: TaRO基于GRPO算法进行策略优化。首先,使用现成字幕器生成视频的密集字幕(含时间戳),随机选择子集按时间顺序拼接作为构建的推理轨迹。然后,对于生成的推理路径,通过局部打乱真实边界附近的帧,计算推理token在原始视频和打乱视频下的log概率差,差值越大表示推理对时间信息越敏感,从而给予正奖励。训练采用渐进课程:初期使用构建的推理路径进行监督学习,后期转为标准随机rollout,结合时间敏感性奖励和IoU奖励进行优化。
Key Results:
- 在VTG基准上达到最先进性能,显著优于现有RL方法(如Time-R1)。
- 分析表明Time-R1的推理链与直接答案性能相当,说明其推理贡献有限,而TaRO的时间感知推理有效提升了定位精度。
- 统计显示Time-R1仅8.3%的推理包含显式时间戳,而TaRO通过建设性探索和时间敏感性奖励大幅提高了推理的时间锚定性。
Tech Stack:
- GRPO (Group Relative Policy Optimization)
- Chain-of-Thought (CoT) 推理格式
- IoU奖励 (Intersection over Union)
- 格式奖励 (format reward)
- 密集视频字幕生成器 (off-the-shelf captioner)
- 局部帧打乱 (local frame shuffling)
Strengths:
- 针对现有RL方法在VTG中的两大核心问题(随机探索低效、奖励忽视推理质量)提出了系统性解决方案。
- 时间敏感性奖励设计新颖,通过局部扰动直接度量推理对时间线索的依赖,避免了全局打乱导致答案失效的问题。
- 渐进课程学习有效结合了引导探索与自主探索,使模型既能快速学习高质量推理范式,又能进一步优化。
- 实验充分,在多个基准上验证了SOTA性能,并提供了消融分析。
Limitations:
- 依赖预生成的密集字幕,字幕质量可能影响构建推理路径的效果,且引入额外计算开销。
- 局部帧打乱策略仅适用于有明确边界标注的场景,在弱监督或无监督设置下可能不适用。
- 时间敏感性奖励的计算需要多次前向传播,增加了训练复杂度。
- 未探讨在更复杂视频(如多事件、长视频)中的泛化能力。
Relevance To Keywords:
- 多模态大模型:论文以MLLM为基础,通过RL后训练增强其时间推理能力。
- 强化学习:采用GRPO算法,并设计新的奖励函数和探索策略。
- 后训练:TaRO属于RL后训练范式,在预训练MLLM基础上进行推理优化。
- 表征学习:通过时间敏感性奖励促使模型学习更鲁棒的时间表征。
- 世界模型:视频时序定位涉及对视频中事件动态的理解,可视为世界模型中对时间维度的建模。
摘要翻译
我们提出 SUPERBROWSER,一种自主网页导航代理,其设计基于单一指导假设:网页代理应当像人类一样进行浏览。人类在阅读页面时并不会保留所见的所有像素;他们会查看少数几个候选目标,选定其中一个,并仅记住维持目标所需的信息。我们将这一感知 - 认知 - 行动三元组(perception-cognition-action triad)操作化为三个耦合机制。首先,一个基于视觉的边界框(bounding-box)管道在每个截图上标记候选交互区域,并将它们异步预取后馈送给语言模型,从而实现“眼”先于“手”。其次,一个三角色大脑(three-role brain)——包括负责分类与路由的编排者(Orchestrator)、负责每隔几步评估进度的规划者(Planner)以及负责发出每一步动作的执行者(Worker)——将战略推理与操作推理分离开来。第三,一个结构化的账本(Ledger)仅存储人类会保留的信息:目标、最近三个动作、少量事实与死胡同以及若干检查点;一个六阶段驱逐循环系统地从当前上下文中丢弃过时的截图、状态块和推理轨迹。动作执行采用三层点击级联(从 Chrome DevTools Protocol 到 Puppeteer 再到脚本化),并辅以拟人化的贝塞尔(Bezier)运动,此外还包括一个感知 chevron(箭头状)的边界框捕捉器,用于解决“大标签旁小箭头”的歧义问题。在 Mind2Web Hard 基准(66 个任务)上,SUPERBROWSER 取得了 89.47% 的成功率,总体排名第三,且大幅领先于所有已发布的开源/研究浏览器代理基线。我们认为,这一性能提升并非源于任何单一技巧,而是源于在整个系统中一致贯彻的认知契约(cognitive contract)。
Abstract
We present SUPERBROWSER, an autonomous web-navigation agent designed against a single guiding hypothesis: a web agent should browse the way a person browses. A human reading a page does not retain every pixel they have seen; they look at a few candidate targets, decide on one, and remember only what is needed to keep the goal alive. We operationalize this perception-cognition-action triad as three coupled mechanisms. First, a vision-first bounding-box pipeline labels candidate interactive regions on every screenshot and feeds them, asynchronously prefetched, to the language model so that the "eye" precedes the "hand". Second, a three-role brain -- an Orchestrator that classifies and routes, a Planner that evaluates progress every few steps, and a Worker that emits per-step actions -- separates strategic from operational reasoning. Third, a structured Ledger stores only what a person would: the goal, the last three actions, a small set of facts and dead-ends, and a handful of checkpoints; a six-phase eviction loop systematically discards stale screenshots, state blobs, and reasoning traces from the live context. Action execution is a three-tier click cascade (Chrome DevTools Protocol to Puppeteer to scripted) with humanized Bezier motion, plus a chevron-aware bounding-box snapper that resolves the "small arrow beside a large label" ambiguity. On the Mind2Web Hard benchmark (66 tasks), SUPERBROWSER attains 89.47% success, placing third overall and ahead of every published open/research browser-agent baseline by a large margin. We argue that the gain comes not from any single trick but from the consistent application of a cognitive contract throughout the system.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 4.0/10 | 6.0 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 6.0/10 | 9.0 |
| World Models | 1.5 | 3.0/10 | 4.5 |
| MLLM | 1.5 | 7.0/10 | 10.5 |
| MultiModal | 1.5 | 7.0/10 | 10.5 |
| model-based RL | 1.5 | 3.0/10 | 4.5 |
评分理由: The paper presents SuperBrowser, an autonomous web navigation agent. It scores highly on MLLM and MultiModal due to its use of language models processing visual screenshots and text. Visual Encoder is relevant for the vision-first pipeline. However, it does not focus on Tokenizer design, World Model learning (Ledger is memory, not latent dynamics), or Model-Based RL algorithms (uses LLM planning). Unify Models refers to the architectural unification of perception/action. No target experts (Yang Shi, etc.) are authors, so no bonus points applied.
关键词
Autonomous Web Navigation, Human Browsing Behaviour, Vision-first Pipeline, Structured Ledger, Mind2Web Benchmark, Perception-Cognition-Action, LLM Agent
深度分析
Chinese Title: RunAgent SuperBrowser:基于人类浏览行为的自主网页导航理论
Summary: 本文提出SUPERBROWSER,一个自主网页导航智能体,其核心假设是:网页智能体应像人类一样浏览网页。人类阅读页面时不会记住每个像素,而是关注少数候选目标,决定一个,并仅记住维持目标所需的信息。论文将感知-认知-行动三元组操作化为三个耦合机制:首先,视觉优先的边界框管道标记每个截图上的候选交互区域,并异步预取给语言模型;其次,三角色大脑——编排器(分类和路由)、规划器(每几步评估进度)和工作器(每步发出动作)——分离战略与操作推理;第三,结构化账本仅存储人类会记住的内容:目标、最近三个动作、少量事实和死胡同、以及少数检查点;六阶段驱逐循环系统性地从实时上下文中丢弃过时的截图、状态块和推理痕迹。动作执行采用三层点击级联(Chrome DevTools协议→Puppeteer→脚本化),带有拟人化的贝塞尔曲线运动,以及一个尖角感知边界框捕捉器,解决“大标签旁的小箭头”歧义。在Mind2Web Hard基准(66个任务)上,SUPERBROWSER达到89.47%的成功率,排名第三,大幅领先所有已发表的开源/研究浏览器智能体基线。论文认为,收益并非来自单一技巧,而是整个系统中认知契约的一致应用。
Innovations:
- 提出基于认知理论的自主网页导航系统契约,将人类浏览行为的感知-认知-行动三元组作为系统设计原则,而非隐喻。
- 设计三角色大脑架构(编排器、规划器、工作器),通过动词分类器路由和每N步进度评估,分离战略与操作推理。
- 引入结构化账本记忆系统,仅存储目标、最近三步动作、事实、死胡同和检查点,并通过六阶段驱逐循环保持上下文大小恒定。
- 提出尖角感知边界框捕捉算法和三层拟人化点击级联(CDP→Puppeteer→脚本化),解决视觉接地智能体的常见失败模式。
- 在Mind2Web Hard基准上取得89.47%的成功率,大幅超越所有开源/研究基线,验证了认知契约的有效性。
Methodology: 论文采用认知科学驱动的系统设计方法。首先,从人类浏览行为(扫描路径、工作记忆限制、结构化情节回忆)中提取设计原则。然后,构建SUPERBROWSER系统:感知层使用视觉模型(如Set-of-Mark)生成候选交互区域边界框,并通过DOM富化和异步预取增强;认知层分为三个角色,每个角色有严格范围化的提示词;记忆层使用结构化账本和六阶段驱逐循环;动作层使用三层点击级联和边界框捕捉器。在Mind2Web Hard基准上评估,包含66个长时域任务,并与多个基线(如WebVoyager、AutoWebGLM、UI-TARS等)比较成功率。
Key Results:
- 在Mind2Web Hard基准(66个任务)上,SUPERBROWSER达到89.47%的成功率,排名第三。
- 领先所有已发表的开源/研究浏览器智能体基线(如WebVoyager、AutoWebGLM、UI-TARS等)一个数量级。
- 验证了认知契约(感知-认知-行动三元组、结构化记忆、角色分离)在长时域任务中的有效性。
- 六阶段驱逐循环使上下文大小保持恒定,避免了长提示中的注意力漂移问题。
- 尖角感知边界框捕捉器解决了“小箭头旁大标签”的UI歧义问题。
Tech Stack:
- 视觉模型:Set-of-Mark提示(用于生成候选区域边界框)
- 语言模型:GPT-4V或类似多模态LLM(用于认知推理)
- 动作执行:Chrome DevTools Protocol (CDP)、Puppeteer、脚本化点击
- 贝塞尔曲线运动(拟人化鼠标轨迹)
- 尖角感知边界框捕捉算法(chevron-aware bounding-box snapper)
- 结构化账本(Ledger):包含目标、计划、最近三步动作、事实字典、死胡同列表、检查点列表
- 六阶段驱逐循环:丢弃过时的截图、状态块、推理痕迹
- DOM缓存子系统(用于缓存页面状态,避免重复感知)
- Mind2Web Hard基准(66个长时域任务)
Strengths:
- 理论驱动:基于认知科学(工作记忆、情节回忆、扫描路径)设计系统,具有坚实的理论基础。
- 系统完整性:从感知、认知、记忆到动作执行,每个组件都遵循认知契约,形成一致的整体。
- 高效记忆管理:结构化账本和驱逐循环避免了上下文膨胀,使智能体在长时域任务中保持稳定性能。
- 实用创新:尖角感知捕捉器和拟人化点击级联解决了实际部署中的常见问题(UI歧义、反爬虫检测)。
- 显著性能:在Mind2Web Hard上取得89.47%成功率,大幅超越现有开源基线,验证了方法的有效性。
Limitations:
- 依赖特定视觉模型和语言模型:当前实现基于GPT-4V等商业模型,可能受限于API成本和可用性。
- 基准覆盖有限:仅在Mind2Web Hard上评估,未在更广泛的在线环境(如Online-Mind2Web)或沙盒环境(如WebArena)上验证。
- 认知契约的泛化性:理论假设可能不完全适用于所有类型的网页任务(如动态内容、复杂表单)。
- 未讨论失败案例:论文未详细分析89.47%成功率之外的失败任务类型,缺乏对失败模式的深入剖析。
- 与关键词相关性较弱:论文主题是网页导航智能体,与“世界模型”、“表征学习”、“模型统一”等关键词的直接关联度不高。
Relevance To Keywords: 论文研究自主网页导航智能体,其核心方法基于认知科学,而非直接涉及“Unify Models”、“World Models”、“Representation Learning”、“Model-Based RL”、“原生多模态大模型”、“多模态大模型的理解和生成一体化”、“表征学习”、“世界模型”、“强化学习”、“后训练”等关键词。然而,论文使用了多模态大模型(如GPT-4V)进行视觉感知和推理,这与“原生多模态大模型”和“多模态大模型的理解和生成一体化”有一定关联;其结构化记忆和认知架构可视为一种轻量级的“世界模型”(网页状态和任务进展的内部表示);但论文未涉及强化学习或后训练技术。总体相关性较低,属于间接应用。
摘要翻译
预训练基础模型的快速发展使得更通用的图像分割成为可能。多模态大语言模型(MLLMs)已被广泛探索用于需要高级推理的复杂查询图像分割。尽管取得了有前景的进展,现有方法通常受限于训练数据不足以及 MLLMs 与掩码生成模块之间的差距。为了更好地将 MLLMs 的感知和推理能力转移到基于复杂推理的分割任务中,我们提出了一种用于掩码生成和选择的两阶段框架 Rea2Seg。具体而言,该框架首先基于分割 MLLM 的注意力图识别潜在区域作为候选掩码。然后它使用一个 MLLM 对问题和候选掩码进行推理,并为每个掩码分配分数。最终分割结果通过重新排序候选者并选择得分最高的掩码获得,从而将图像分割重新表述为候选发现随后进行判别性掩码选择。我们还注意到,现有基准中的大部分问题集中在常识推理上,这些问题通常不完全需要联合视觉观察和推理。为了解决这一问题,我们引入一个新的基准 ReasonSeg-SGDR,该基准在多个维度上全面评估模型的感知、定位和推理能力,包括判别性识别、空间推理、几何推理和多步推理,并带有细粒度的掩码生成。此外,我们收集训练数据以增强 MLLMs 联合理解多模态查询和候选掩码的能力,并通过推理分配分数。在提出的基准和 ReasonSeg 上的实验结果表明了统一掩码生成和选择框架的有效性。
Abstract
The rapid development of pretrained foundation models has enabled more general image segmentation. Multimodal large language models (MLLMs) have been widely explored for image segmentation with complex queries that require high-level reasoning. Despite promising progress, existing methods are often constrained by limited training data and the gap between MLLMs and mask generation modules. To better transfer MLLMs' perception and reasoning ability to complex reasoning-based segmentation tasks, we propose a two-stage framework Rea2Seg for mask generation and selection. Specifically, the framework first identifies potential regions as candidate masks based on the attention maps of a segmentation MLLM. It then employs an MLLM to reason over the question and candidate masks and assign scores to each mask. The final segmentation result is obtained by reranking the candidates and selecting the highest-scoring mask, reformulating image segmentation as candidate discovery followed by discriminative mask selection. We also notice that a large portion of questions in existing benchmarks focus on commonsense reasoning, and these questions usually do not fully require joint visual observation and reasoning. To address this issue, we introduce a new benchmark called ReasonSeg-SGDR that comprehensively evaluates a model's perception, grounding, and reasoning abilities across multiple dimensions, including discriminative recognition, spatial reasoning, geometric reasoning, and multi-step reasoning, with fine-grained mask generation. In addition, we collect training data to enhance MLLMs' ability to jointly understand multimodal queries and candidate masks, and to assign scores through reasoning. Experimental results on the proposed benchmark and ReasonSeg demonstrate the effectiveness of the unified mask generation and selection framework.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 7.0/10 | 10.5 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 4.0/10 | 6.0 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 9.0/10 | 13.5 |
| MultiModal | 1.5 | 8.0/10 | 12.0 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: The paper heavily utilizes MLLMs for image segmentation, scoring high on MLLM and MultiModal. It proposes a unified framework for mask generation and selection, moderately aligning with Unify Models. Visual Encoder is implicitly used within the MLLM architecture. Tokenizer, World Models, and model-based RL are not discussed or relevant to the core contribution. No specified expert authors were found in the author list.
关键词
MLLM, Image Segmentation, Candidate Discovery, Comparative Reasoning, MultiModal Queries, Mask Generation, Two-stage Framework, Reasoning Benchmark
深度分析
Chinese Title: 再次推理:通过候选发现与比较推理进行分割
Summary: 本文提出一种两阶段推理分割框架Rea2Seg,旨在解决多模态大语言模型(MLLM)在复杂推理分割任务中因训练数据不足和与掩码生成模块之间的鸿沟而导致的性能瓶颈。第一阶段利用分割MLLM的注意力图生成候选掩码,第二阶段使用MLLM对候选掩码进行推理评分,通过重排序选择最高分掩码作为最终结果,将分割任务转化为候选发现与判别性掩码选择。此外,作者指出现有基准(如ReasonSeg)大多依赖常识推理,未能充分评估联合视觉感知与推理能力,因此构建了新基准ReasonSeg-SGDR,涵盖判别性识别、空间推理、几何推理和多步推理四个维度,要求精细掩码生成。实验表明,Rea2Seg在多个基准上取得一致改进,并支持语义、物体和部件级不同粒度的分割。
Innovations:
- 提出两阶段推理分割框架Rea2Seg,将分割任务重新定义为候选发现与判别性掩码选择,更契合MLLM的感知与推理优势。
- 设计新基准ReasonSeg-SGDR,从判别性、几何、空间和多步推理四个维度全面评估模型的视觉感知与推理联合能力。
- 构建训练数据集,增强MLLM对生成掩码的理解以及多候选掩码的比较与评分能力。
- 利用MLLM注意力图生成候选掩码,避免直接端到端掩码预测的脆弱性,提升复杂查询下的分割质量。
Methodology: 采用两阶段流水线:第一阶段,基于分割MLLM(如LISA)的Transformer注意力图,通过阈值化和连通域分析提取潜在区域作为候选掩码;第二阶段,将问题与候选掩码(以掩码图像或掩码嵌入形式)输入MLLM,让模型进行推理并为每个掩码分配分数,最后通过重排序选择最高分掩码。训练时,收集包含问题、图像、候选掩码及对应分数的数据,微调MLLM使其具备掩码理解与比较推理能力。
Key Results:
- 在ReasonSeg-SGDR基准上,Rea2Seg相比现有方法(如LISA、GSVA)在多个维度(判别性、几何、空间、多步推理)上均取得显著提升。
- 在原有ReasonSeg基准上,Rea2Seg也实现一致改进,验证了框架的有效性。
- 消融实验表明,候选掩码生成阶段和比较推理评分阶段均对最终性能有重要贡献。
- Rea2Seg支持语义、物体、部件级不同粒度的分割,展示了判别性公式的潜力。
Tech Stack:
- 多模态大语言模型(MLLM):如LISA、LLaVA等作为基础模型。
- 注意力图(Attention Maps):从MLLM Transformer块提取,用于候选掩码生成。
- SAM(Segment Anything Model):作为掩码解码器或候选掩码生成的后端。
- 阈值化与连通域分析:用于从注意力图提取候选区域。
- 重排序(Reranking):基于MLLM评分的掩码选择策略。
- 微调(Fine-tuning):使用收集的候选掩码-评分数据训练MLLM。
Strengths:
- 创新性地将分割任务转化为候选发现+判别性选择,避免了MLLM直接生成像素级掩码的困难。
- 新基准ReasonSeg-SGDR填补了现有基准对复杂推理评估不足的空白,更贴近实际需求。
- 实验充分,在多个基准和不同粒度上验证了方法的通用性和有效性。
- 利用MLLM已有的注意力图作为候选线索,无需额外训练即可获得高质量候选。
Limitations:
- 两阶段框架增加了推理计算开销,实时性可能受限。
- 候选掩码的质量高度依赖于第一阶段MLLM的注意力图,若注意力图不准确则候选可能遗漏目标。
- 训练数据收集和标注成本较高,需要人工或自动生成候选掩码及评分。
- 当前方法主要针对单目标分割,扩展到多目标或全景分割可能需要进一步设计。
Relevance To Keywords:
- 原生多模态大模型:论文直接使用MLLM作为核心组件,属于原生多模态大模型在分割任务中的应用。
- 多模态大模型的理解和生成一体化:Rea2Seg将MLLM的理解(推理评分)与生成(掩码选择)结合,体现一体化思想。
- 表征学习:注意力图作为视觉表征用于候选发现,MLLM的隐层表征用于评分,涉及表征学习。
- 世界模型:论文未直接涉及世界模型,但推理分割需要模型对场景有常识性理解,可视为世界模型能力的体现。
- 强化学习:论文未使用强化学习,但相关研究(如参考文献[21,40,41,85,97])将RL用于MLLM分割,本文未采用。
- 后训练:论文通过收集新数据微调MLLM,属于后训练阶段,提升模型在特定任务上的能力。
摘要翻译
大语言模型 (LLM) 代理正越来越多地部署在人与人交互的场景中,例如会议助手和临床文档系统,在这些场景中,它们必须观察对话并保留信息以应对下游查询。与传统的人 - 助手设置不同,这些环境本质上是多模态的,涉及回指(anaphora)和指示语(deixis)等复杂话语现象,并包含来自多个参与者的异步或冲突信息。然而,现有的记忆基准主要关注单用户、纯文本交互,未能捕捉这些挑战。为了解决这一差距,我们引入了 H2HMem,这是一个用于评估复杂人与人交互中记忆能力的人人多模态记忆基准。H2HMem 包括双人及多方对话,并包含多模态信息流,并从三个维度评估代理:记忆回忆、推理与应用。对先进代理的实验揭示了在跨模态、跨参与者和跨会话构建、保留和利用记忆方面的显著局限性,突出了下一代大语言模型代理仍有巨大的改进空间。
Abstract
Large language model agents are increasingly deployed in human-human interaction settings, such as meeting assistants and clinical documentation systems, where they must observe conversations and retain information for downstream queries. Unlike traditional human-assistant settings, these environments are inherently multimodal, involve complex discourse phenomena such as anaphora and deixis, and contain asynchronous or conflicting information from multiple participants. However, existing memory benchmarks largely focus on single-user, text-only interactions, failing to capture these challenges. To address this gap, we introduce H2HMem, a Human-to-Human Multimodal Memory Benchmark for evaluating memory capabilities in complex human-human interactions. H2HMem includes both dyadic and multi-party conversations with multimodal information streams, and evaluates agents along three dimensions: memory recall, reasoning, and application. Experiments with advanced agents reveal substantial limitations in constructing, retaining, and utilizing memories across modalities, participants, and sessions, highlighting substantial room for improvement in next-generation LLM agents.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 3.0/10 | 4.5 |
| World Models | 1.5 | 4.0/10 | 6.0 |
| MLLM | 1.5 | 8.0/10 | 12.0 |
| MultiModal | 1.5 | 9.0/10 | 13.5 |
| model-based RL | 1.5 | 3.0/10 | 4.5 |
评分理由: 论文核心在于构建多模态记忆基准(Benchmark),评估人类交互中代理的记忆能力。因此与 MultiModal 和 MLLM 高度相关,涉及多模态大模型代理评估。与 World Models 有一定关联(记忆是世界模型组件),但非核心架构贡献。与 Unify Models, Tokenizer, Visual Encoder, model-based RL 相关性较低,因论文未涉及模型统一架构、分词器设计、视觉编码器细节或强化学习算法。作者列表中未包含指定的 Yang Shi, Xuanyu Zhu 等专家,故未添加专家分。
关键词
Multimodal Memory Benchmark, Human-Human Interactions, LLM Agents, Memory Recall, Multi-party Conversations, Asynchronous Information, Memory Limitations, Downstream Queries
深度分析
Chinese Title: H2HMem:面向人类-人类交互中智能体的多模态记忆基准
Summary: 论文提出H2HMem,一个用于评估智能体在复杂人类-人类交互中多模态记忆能力的基准。现有记忆基准大多聚焦于单用户、纯文本的人机交互,无法捕捉多模态、多人对话中的复杂现象(如指代、异步信息、冲突观点)。H2HMem包含双人和多人对话,结合文本与图像等多模态信息,从记忆回忆、推理和应用三个维度设计评估任务。通过隐私保护的人机协同数据生成管道构建大规模多会话数据集。实验表明,当前先进的多模态大模型在跨模态记忆对齐和结构化推理方面存在显著局限,揭示了下一代LLM智能体需要改进的方向。
Innovations:
- 首个同时覆盖双人和多人交互场景的多模态记忆基准,填补了现有基准的空白。
- 提出隐私保护的人机协同数据生成管道,避免真实对话的隐私风险,同时生成逼真的多模态、多会话、多参与者交互数据。
- 设计全面的记忆评估分类体系,涵盖回忆(事实检索、跨模态关联、知识消解)、推理(因果推理、指代演化、时间推理)和应用(测试时学习、冲突检测、拒绝回答)三个维度。
- 构建大规模多模态数据集,包含25个对话、309个会话、7078轮对话、1300张图像和2236个问答对。
Methodology: 采用人机协同(human-in-the-loop)范式:人类作为导演,确保场景一致性、视觉基础和质量控制;LLM作为脚本生成器,生成参与者档案、多会话场景大纲、图像关键词,并基于图像描述生成对话文本。随后由人类验证并构建问答对。对话类型包括双人(dyadic)和多人(multi-party)交互,模拟在线对话环境。评估任务沿三个功能维度设计:记忆回忆(UPR、CRR、KR)、记忆推理(MCR、RET、TR)和记忆应用(TTL、CD、AR)。
Key Results:
- H2HMem数据集包含25个对话(20个双人、5个多人),共309个会话,7078轮对话,1300张图像,2236个问答对。
- 双人对话平均每会话18.7轮,多人对话平均每会话70.5轮,平均每会话包含4.21张图像。
- 实验发现当前多模态大模型在跨模态记忆对齐(如文本与图像关联)和结构化推理(如因果推理、时间推理)任务上表现不佳,存在显著改进空间。
- 现有基准(如LoCoMo、EverMemBench)在交互类型、多模态信息、评估维度等方面均不如H2HMem全面。
Tech Stack:
- 大型语言模型(LLM):ChatGPT、DeepSeek等作为脚本生成器
- 多模态大模型(MLLM):用于评估的智能体
- 人机协同数据生成管道
- 图像描述生成(用于对话生成)
- 评估指标:记忆回忆、推理、应用三类任务
Strengths:
- 首次系统性地评估多模态记忆在人类-人类交互中的表现,覆盖双人和多人场景。
- 隐私保护的数据生成方法,避免真实对话的伦理风险。
- 评估维度全面,从简单回忆到复杂推理和应用,反映真实应用需求。
- 数据集规模较大,包含丰富的多模态信息(图像、文本、多会话)。
Limitations:
- 数据集规模有限(仅25个对话),可能无法完全代表真实世界交互的多样性。
- 对话生成依赖LLM,可能存在与真实人类对话的偏差。
- 评估任务主要基于问答形式,可能无法全面覆盖记忆的连续性和动态性。
- 未涉及语音、视频等其他模态,仅包含文本和图像。
Relevance To Keywords:
- 原生多模态大模型:论文评估了多模态大模型在跨模态记忆对齐和推理上的能力,直接相关。
- 表征学习:记忆回忆和推理任务涉及多模态表征的构建与检索。
- 世界模型:记忆应用中的测试时学习和冲突检测要求智能体构建对交互世界的动态理解。
- 强化学习:论文虽未直接使用强化学习,但记忆应用中的动态决策与强化学习中的经验回放相关。
- 后训练:评估结果揭示了当前模型在记忆能力上的不足,为后训练优化提供方向。
摘要翻译
有效的危机响应需要基于空间情境的沟通,这种沟通将平民的语言指导与物理环境相连接,同时需考虑结构瓶颈、动态演变的威胁以及智能体特定情境。然而,当前危机通信中的 NLP 研究主要局限于静态、仅文本的分类设置,忽略了 AI 代理(AI operators)在动态、具身场景中的关键沟通角色。我们提出了一种新颖的基准测试框架,用于评估视觉 - 语言模型(VLMs),这些模型的任务是通过模拟疏散引导平民智能体,从而填补这一空白。我们在九个具有不同结构复杂度的地图上测试了两种沟通策略(窄播(Narrowcast)与广播(Broadcast))、两种环境表征(基于视觉与基于图)以及两种威胁行为(静态与移动)。结果显示,窄播(Narrowcast)在所有难度级别上均一致地降低了平民的失败率,相较于广播(Broadcast)。指导质量高度依赖于 VLM 代理如何表征世界:视觉模态驱动性能,而添加邻接图则取决于模型,且通常有害。移动威胁在所有条件下均提高了失败率,因为通信必须随时间持续适应。综上所述,这些发现表明,在疏散场景中部署 VLMs 作为 AI 代理仍然是一个非平凡挑战,其中沟通策略和输入表示的选择可直接决定干预措施的成功或失败。
Abstract
Effective crisis response requires spatially grounded communication that bridges linguistic guidance of civilians with the physical environment, accounting for structural bottlenecks, evolving threats, and agent-specific contexts. Yet, current NLP research in crisis communication remains mainly limited to static, text-only classification settings, overlooking the critical communicative role of AI operators in dynamic, embodied scenarios. We address this gap with a novel benchmarking framework for evaluating Vision-Language Models (VLMs) tasked with guiding civilian agents through simulated evacuations. We test two communication strategies (narrowcast vs. broadcast), two environment representations (visual vs. graph-based), and two threat behaviors (static vs. moving) across nine maps of varying structural complexity. Our results show that Narrowcast consistently reduces civilian Fail rates compared to Broadcast across all difficulty levels. Guidance quality depends heavily on how the VLM operator represents the world: the visual modality drives performance, while adding an adjacency graph is model-dependent and often harmful. Moving threats raise Fail rates across all conditions as communication must continuously adapt over time. Together, these findings show that deploying VLMs as AI operators in evacuation scenarios remains a non-trivial challenge, where the choice of communication strategy and input representation can directly determine the success or failure of the intervention.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 3.0/10 | 4.5 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 5.0/10 | 7.5 |
| World Models | 1.5 | 3.0/10 | 4.5 |
| MLLM | 1.5 | 8.0/10 | 12.0 |
| MultiModal | 1.5 | 8.0/10 | 12.0 |
| model-based RL | 1.5 | 2.0/10 | 3.0 |
评分理由: 论文核心为 VLM 在危机场景中的通信基准测试,与 MLLM 和多模态高度相关(8 分);环境表示中的视觉模态涉及视觉编码器(5 分);未涉及模型统一、分词器、世界模型架构及模型强化学习,相关性低(2-3 分);作者列表无指定专家,无额外加分。
关键词
Vision-Language Models, Crisis Scenarios, Communication Strategies, Environment Representations, Evacuation Benchmark, Narrowcast vs Broadcast, Visual Modality, Civilian Guidance
深度分析
Chinese Title: 引导我出去:一个在危机场景中基准测试VLM操作员通信的框架
Summary: 本文提出一个新颖的基准测试框架,用于评估视觉语言模型(VLM)作为操作员在模拟疏散场景中引导平民的能力。该框架通过多智能体模拟,在九张不同结构复杂度的地图上,比较两种通信策略(窄播 vs. 广播)、两种环境表示(视觉 vs. 图结构)以及两种威胁行为(静态 vs. 移动)对疏散效果的影响。实验结果表明,窄播策略在所有难度级别上均显著降低平民的失败率;视觉模态对操作员性能至关重要,而额外添加邻接图反而可能有害;移动威胁在所有条件下均提高失败率,因为通信需要持续适应动态变化。研究揭示了将VLM部署为危机操作员仍面临重大挑战,通信策略和输入表示的选择直接决定干预成败。
Innovations:
- 首次提出针对VLM在危机通信中引导平民的基准测试框架,填补了现有NLP研究仅关注静态文本分类的空白。
- 系统比较了窄播(个性化消息)与广播(统一消息)两种通信策略在动态疏散场景中的效果。
- 探索了环境表示(视觉图像 vs. 图结构文本)对VLM操作员引导性能的影响,发现视觉模态不可或缺。
- 引入移动威胁的动态行为,考察通信策略在时间演化下的适应性。
- 构建了九张不同结构复杂度的地图,并设计拓扑评分量化难度,为实验提供可控变量。
Methodology: 采用多智能体模拟框架,将操作员和平民均实现为VLM智能体。模拟以离散时间步进行:操作员观察全局状态(视觉或图表示)并生成消息,平民接收消息并结合局部观察选择下一步移动。地图分为易、中、难三个等级,通过出口数量和平均距离量化。威胁分为静态和随机游走两种。实验控制通信策略(窄播/广播)、环境表示(视觉/图)、威胁行为(静态/移动)三个变量,在九张地图上重复运行,记录平民的Save、Fail、Timeout三种结局。
Key Results:
- 窄播策略在所有地图难度下均显著降低平民失败率,优于广播策略。
- 视觉模态是操作员性能的关键驱动因素,仅使用图表示时性能下降明显。
- 添加邻接图作为辅助输入对部分模型有帮助,但对其他模型有害,效果依赖模型。
- 移动威胁在所有条件下均提高失败率,因为通信需要持续适应动态变化。
- 地图结构复杂度(如瓶颈、单一出口)显著影响疏散成功率,硬地图失败率最高。
Tech Stack:
- 视觉语言模型(VLM)作为操作员和平民智能体
- 多智能体模拟环境(离散时间步、图结构地图)
- 拓扑评分(出口数量、平均距离)用于量化地图难度
- 随机游走模型模拟移动威胁
- 窄播与广播两种通信策略的实现
- 视觉输入(俯视图)与图结构输入(邻接列表)的对比
Strengths:
- 研究问题具有现实意义,填补了危机通信中VLM应用的评估空白。
- 实验设计系统全面,同时控制通信策略、环境表示、威胁动态三个关键变量。
- 模拟环境可控且可复现,为后续研究提供基准。
- 结果揭示了视觉模态的重要性,对实际系统设计有指导价值。
- 论文结构清晰,方法描述详细,便于理解与复现。
Limitations:
- 模拟环境简化了真实危机场景的复杂性(如人类心理、多模态感知噪声)。
- 仅使用单一VLM模型(未明确说明具体模型),结果可能不具广泛泛化性。
- 未考虑平民之间的交互或群体行为,仅关注操作员-平民单向通信。
- 威胁行为仅采用随机游走,未模拟更复杂的智能威胁。
- 未评估通信延迟、消息长度等实际因素对效果的影响。
Relevance To Keywords:
- Unify Models: 论文使用VLM作为统一模型处理视觉和语言输入,体现了多模态统一的思想。
- World Models: 模拟环境中的地图、威胁、平民构成一个简化的世界模型,操作员需基于此进行推理。
- Representation Learning: 比较视觉与图结构两种环境表示,探讨不同表征对模型性能的影响。
- Model-Based RL: 虽然未直接使用强化学习,但模拟框架中的决策循环(观察-行动-反馈)与模型基强化学习范式相似。
- 原生多模态大模型: VLM作为操作员,直接处理视觉和文本,符合原生多模态大模型的应用场景。
- 后训练: 论文未涉及后训练,但基准测试可用于评估后训练策略的效果。
摘要翻译
视觉 - 语言 - 动作(VLA)策略为语言条件操作提供了强大的先验,但在需要针对性恢复的非标称状态下仍然脆弱。我们提出 ReCoVLA——一种基于故障条件的残差恢复框架,该框架保持预训练 VLA 策略冻结,使用外部视觉 - 语言模型(VLM)推断故障模式和恢复阶段,并从任务相关组件编译结构化奖励。与直接使用 VLM 生成动作或奖励不同,ReCoVLA 将其用作语义奖励选择器:它预测恢复描述符和奖励掩码,用于仿真内的残差策略训练,随后将训练好的恢复策略进行零样本仿真 - 真实部署。这解耦了高层故障理解与底层纠正控制,从而支持不同的 VLA。在短期、长期以及接触丰富操作任务上的实验表明,ReCoVLA 平均优于所测试的基线方法。在仿真中,我们的奖励编译器将微调的 π₀.₅ 基线平均成功率从 36.7% 提升至 66.7%。在物理零样本仿真 - 真实实验中,ReCoVLA 取得了最佳平均性能,成功率为 61.7%。
Abstract
Vision-language-action (VLA) policies provide strong priors for language-conditioned manipulation, but remain brittle in off-nominal states requiring targeted recovery. We propose ReCoVLA -- a failure-conditioned residual recovery framework that keeps a pretrained VLA policy frozen, uses an external vision-language model (VLM) to infer the failure mode and recovery stage, and compiles a structured reward from task-relevant components. Rather than using the VLM to generate actions or rewards directly, ReCoVLA uses it as a semantic reward selector: it predicts a recovery descriptor and reward mask for in-simulation residual-policy training, followed by zero-shot sim-to-real deployment of the trained recovery policies. This decouples high-level failure understanding from low-level corrective control to support different VLAs. Experiments across short-horizon, long-horizon, and contact-rich manipulation tasks show that ReCoVLA outperforms the tested baselines on average. In simulation, our reward compiler improves average success from 36.7% for the fine-tuned $π_{0.5}$ baseline to 66.7%. In physical zero-shot sim-to-real experiments, ReCoVLA achieves the best average performance, with 61.7% success.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 4.0/10 | 6.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 2.0/10 | 3.0 |
| World Models | 1.5 | 2.0/10 | 3.0 |
| MLLM | 1.5 | 8.0/10 | 12.0 |
| MultiModal | 1.5 | 9.0/10 | 13.5 |
| model-based RL | 1.5 | 4.0/10 | 6.0 |
评分理由: 论文核心在于利用视觉语言模型(MLLM)指导视觉 - 语言 - 动作(MultiModal)策略的奖励编译,因此 MLLM 和多模态相关性最高。虽然涉及仿真训练(关联 model-based RL 和 World Models),但重点在于奖励编译而非世界模型学习或未统一架构,故相关性中等。Tokenizer 和视觉编码器仅为 VLM 组成部分,未作为创新点提及,相关性较低。作者列表中不包含指定的专家。
关键词
Vision-Language-Action, Failure Recovery, Reward Compilation, Vision-Language Model, Residual Policy, Sim-to-Real, VLM-Guided
深度分析
Chinese Title: ReCoVLA: 视觉语言模型引导的奖励编译用于视觉-语言-动作策略的失败恢复
Summary: 本文提出ReCoVLA框架,旨在解决预训练的视觉-语言-动作(VLA)策略在遇到失败状态时缺乏针对性恢复能力的问题。该框架保持基础VLA策略冻结,利用外部视觉语言模型(VLM)分析失败模式并生成结构化的恢复描述符,包括失败类型、恢复阶段、活跃实体和奖励掩码。随后,一个确定性的奖励编译器根据描述符从奖励库中选择组件并插入阶段门控,为残差强化学习训练编译结构化奖励。训练后的残差策略在零样本模拟到真实部署中提供纠正动作。实验在短时、长时和接触丰富的操作任务上进行,模拟中平均成功率从微调π0.5基线的36.7%提升至66.7%,物理实验中达到61.7%,优于所有基线。该方法将高层失败理解与低层纠正控制解耦,支持不同的VLA骨干网络。
Innovations:
- 提出VLM引导的奖励编译器,将结构化失败描述符转换为可执行的残差RL奖励,而非直接生成动作或奖励。
- 将VLA失败恢复建模为混合残差专家学习,在VLA潜在空间上学习纠正控制,同时保持名义行为。
- 引入阶段门控奖励机制,根据恢复阶段激活相应奖励组件,避免过早奖励激活和奖励黑客问题。
- 实现零样本模拟到真实部署,无需真实世界微调即可迁移恢复策略。
Methodology: ReCoVLA采用三阶段方法:首先,在模拟中执行冻结的VLA策略,使用外部VLM分析失败轨迹,构建失败类别目录和恢复描述符;其次,奖励编译器根据描述符从奖励库中选择组件并插入阶段门控,训练残差策略;最后,在真实部署中,VLM监控观测历史,检测已知失败类别时调度对应残差策略。残差策略在VLA潜在特征上输出纠正动作,与名义动作相加后执行。评估使用六种变体(M1-M6),比较不同奖励结构和VLA骨干的效果。
Key Results:
- 在模拟中,ReCoVLA(M4)将平均成功率从微调π0.5基线(M1)的36.7%提升至66.7%,提升30个百分点。
- 在物理Fetch机器人实验中,ReCoVLA达到61.7%的平均成功率,优于所有基线,比次优基线高18.3个百分点。
- 阶段门控奖励(M4)优于无门控的失败相关奖励(M3),证明门控对防止过早奖励激活至关重要。
- ReCoVLA设计可迁移至不同VLA骨干(如OpenVLA),M6在模拟中达到60.0%成功率。
Tech Stack:
- VLA策略:π0.5、OpenVLA(基于流匹配的视觉-语言-动作模型)
- VLM:Qwen3-VL-8B-Instruct(用于失败分析和描述符生成)
- 残差强化学习:在VLA潜在空间上训练残差策略,使用PPO或类似算法
- 奖励编译:基于奖励库的组件选择与阶段门控机制
- 模拟到真实迁移:零样本部署,无需真实微调
Strengths:
- 解耦高层语义理解与低层控制,支持不同VLA骨干,具有通用性。
- 无需收集失败恢复演示数据,降低数据成本。
- 阶段门控奖励设计有效防止奖励黑客,提升训练稳定性。
- 零样本模拟到真实迁移成功,验证了方法的实用性。
- 在多种任务(长时、短时、接触丰富)上均表现优异。
Limitations:
- 依赖VLM的失败检测准确性,错误分类可能导致恢复策略误用。
- 当前仅支持预定义的失败类别,未知失败模式无法处理。
- 模拟训练依赖仿真环境,真实世界动态可能与模拟存在差异。
- 奖励库需要针对任务手动设计,缺乏自动生成能力。
- 实验仅在Fetch机器人上验证,泛化到其他平台需进一步测试。
Relevance To Keywords:
- Unify Models, World Models, Representation Learning, Model-Based RL: 论文使用VLM作为语义理解模块,与多模态统一模型相关;残差策略在VLA潜在空间学习,涉及表征学习;但未直接涉及世界模型或基于模型的强化学习。
- 原生多模态大模型,多模态大模型的理解和生成一体化: 论文利用VLM(Qwen3-VL)进行失败分析,体现多模态理解;但VLM不直接生成动作,而是生成结构化描述符,与生成一体化部分相关但非核心。
- 表征学习: 残差策略在VLA的潜在特征上学习,利用预训练表征进行纠正控制。
- 世界模型: 论文未构建或使用世界模型进行规划或预测。
- 强化学习,后训练: 核心方法为残差强化学习,属于后训练阶段,用于微调恢复行为。
摘要翻译
全模态检索旨在为文本、图像、视频、文档和音频输入提供单一的嵌入空间,但由于这些模态在数据分布、架构和优化动态上存在差异,构建此类统一检索器颇具挑战性。在这项工作中,我们提出了 Conan-embedding-v3,一种用于全模态检索的解耦 - 融合 - 恢复框架。Conan-embedding-v3 首先独立训练模态专家,并将它们的任务向量融合到一个单一的稠密骨干网络中,我们称这种策略为解耦专家融合(Decoupled Specialist Fusion)。我们发现,这种融合组合了视觉、视频和文档检索能力,但也暴露了基于投影器的模态的一种失败模式:当音频通过外部编码器和投影器接入时,融合骨干网络会导致投影器校准至音频专家骨干网络,尽管所有音频专用模块均保持不变,但仍会造成显著的音频检索性能退化。我们将这种失败称为投影器漂移(Projector Drift)。为修复这一问题,Conan-embedding-v3 采用投影器恢复(Projector Recovery),即在保持骨干网络冻结的同时对投影器进行全参数微调,随后进行平衡多模态重演。所得模型在单一骨干网络中支持这些检索路径,在 MMEB 上取得 74.9 分,同时在 30 任务 MAEB 音频基准套件上获得 55.61 分。
Abstract
Omni-modal retrieval promises a single embedding space for text, image, video, document, and audio inputs, but building such a unified retriever is difficult since these modalities differ in data distribution, architecture, and optimization dynamics. In this work, we present Conan-embedding-v3, a decouple--fuse--recover framework for omni-modal retrieval. Conan-embedding-v3 first trains modality specialists independently and fuses their task vectors into a single dense backbone, a strategy we call Decoupled Specialist Fusion. We show that this fusion composes visual, video, and document retrieval capabilities, but also exposes a failure mode for projector-based modalities: when audio is attached through an external encoder and projector, fusing the backbone leaves the projector calibrated to the audio-specialist backbone, causing a large audio retrieval regression despite copying all audio-specific modules unchanged. We call this failure Projector Drift. To repair it, Conan-embedding-v3 applies Projector Recovery (i.e., full-parameter fine-tuning of the projector while keeping the backbone frozen) followed by balanced multi-modal rehearsal. The resulting model supports these retrieval pathways in one backbone, achieving 74.9 scores on MMEB while obtaining 55.61 on the 30-task MAEB audio suite.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 8.0/10 | 12.0 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 5.0/10 | 7.5 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 3.0/10 | 4.5 |
| MultiModal | 1.5 | 10.0/10 | 15.0 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: The paper focuses on unifying diverse modalities into a single embedding space (high relevance for MultiModal and Unify Models) using modality specialists and a fusion framework. Visual encoders are part of the specialists (moderate relevance). It does not focus on tokenizers, world models, MLLM generation, or reinforcement learning (low relevance).
关键词
Omni-modal retrieval, Modality-specific models, Decouple-fuse-recover, Projector recovery, Single dense backbone, Multi-modal embedding, Fusion strategy
深度分析
Chinese Title: Conan-embedding-v3:融合模态特定模型的全模态嵌入
Summary: 本文提出Conan-embedding-v3,一种解耦-融合-恢复框架,用于构建支持文本、图像、视频、文档和音频检索的全模态嵌入模型。该方法首先独立训练各模态专家(图像、视频、文档、音频),通过任务向量融合共享骨干网络,并直接复制音频专用模块。然而,融合后音频投影器与融合后的骨干网络不匹配,导致音频检索性能严重下降,作者将此现象称为“投影器漂移”。为解决该问题,提出投影器恢复阶段(冻结骨干网络,仅微调投影器)和平衡的多模态重放训练。最终模型在MMEB基准上达到74.9分,在30任务的MAEB音频套件上获得55.61分,在保持视觉检索能力的同时显著提升了音频性能。
Innovations:
- 提出解耦专家融合(Decoupled Specialist Fusion)策略,先独立训练各模态专家,再通过任务向量融合共享骨干,避免跨模态优化冲突。
- 识别并命名了“投影器漂移”(Projector Drift)现象:融合后的骨干网络与直接复制的音频投影器不匹配导致性能退化。
- 提出投影器恢复(Projector Recovery)方法,通过冻结骨干网络、仅微调投影器来修复投影器-骨干接口。
- 采用平衡的多模态重放训练,防止恢复后的模型退化为仅音频检索模型。
- 在单一骨干网络中同时支持文本、图像、视频、文档和音频五种模态的检索,无需大规模混合模态优化。
Methodology: 采用三阶段流水线:阶段1,从同一视觉语言初始化出发,分别训练图像、视频、文档和音频专家(音频专家额外引入外部编码器和投影器),各专家使用LoRA适配器进行参数高效微调;阶段2,通过任务向量融合(Task Arithmetic)将共享骨干参数组合,直接复制音频专用模块;阶段3,先进行投影器恢复(全参数微调投影器,冻结骨干和音频编码器),再执行平衡多模态重放(使用LoRA微调骨干和视觉编码器)。训练损失为InfoNCE检索损失,使用双向检索时取平均。
Key Results:
- 在MMEB基准上达到74.9分,在30任务MAEB音频套件上达到55.61分。
- 直接融合导致音频检索性能严重下降,投影器恢复后性能显著回升。
- 通过可视化Transformer层中音频令牌方向,验证了投影器漂移随深度增加而加剧。
- 消融实验表明,多模态重放训练对保持视觉检索能力至关重要。
- 任务向量融合系数通过网格搜索确定,平衡了视觉保留与音频贡献。
Tech Stack:
- InfoNCE损失函数
- LoRA(低秩适配)
- 任务向量融合(Task Arithmetic)
- 投影器(Projector)
- 多模态大语言模型(MLLM)骨干
- 音频编码器(如CLAP或类似架构)
- L2归一化嵌入
- 内积相似度检索
Strengths:
- 有效解决了多模态融合中的跨模态优化冲突问题,通过解耦训练避免“跷跷板效应”。
- 识别并解决了投影器漂移这一关键问题,为后续多模态融合提供了重要见解。
- 在保持视觉检索性能的同时大幅提升音频检索能力,实现了真正的全模态检索。
- 方法具有通用性,可推广到其他通过投影器接入的模态。
- 实验设计严谨,包含可视化分析和消融研究。
Limitations:
- 任务向量融合系数需要手动网格搜索,缺乏自适应机制。
- 投影器恢复阶段仅针对音频,未考虑其他可能通过投影器接入的模态(如触觉、嗅觉等)。
- 方法依赖于初始的视觉语言基础模型,可能继承其偏见或局限性。
- 平衡多模态重放阶段仍需少量多模态数据,并非完全零样本融合。
- 未在更大规模或更多样化的基准上验证泛化能力。
Relevance To Keywords:
- Unify Models: 论文核心目标是统一多模态嵌入模型,将不同模态专家融合为一个单一模型。
- World Models: 虽未直接涉及世界模型,但全模态嵌入可作为世界模型感知模块的基础。
- Representation Learning: 论文聚焦于学习跨模态的统一表征,属于表征学习范畴。
- Model-Based RL: 无直接关联,但嵌入模型可用于强化学习中的状态表示。
- 原生多模态大模型: 论文基于MLLM骨干构建,属于原生多模态大模型的应用。
- 多模态大模型的理解和生成一体化: 论文侧重于理解(检索),未涉及生成。
- 后训练: 论文中的专家训练、融合和恢复阶段均属于后训练技术。
摘要翻译
纯文本大型语言模型(LLMs)的隐私风险已被广泛研究,尤其是其倾向于记忆并泄露敏感信息。然而,处理文本和图像的多模态大型语言模型(MLLMs)引入了独特的隐私挑战,这些挑战尚未得到充分探索。与纯文本模型相比,MLLMs 能够提取并暴露图像中嵌入的敏感信息,从而带来新的隐私风险。我们发现,部分 MLLMs 易受隐私泄露影响,会泄露嵌入在图像中或存储于内存中的敏感数据。具体而言,在本文中,我们(1)引入了 MM-Privacy,这是一个综合数据集,旨在评估各种多模态任务和场景中的隐私风险,其中我们定义了披露风险(Disclosure Risks)和保留风险(Retention Risks);(2)利用 MM-Privacy 系统性地评估了不同的 MLLMs,并展示了模型如何在各种任务中泄露敏感数据;(3)进一步探讨了任务不一致性在隐私风险中的作用,强调了缓解策略的迫切需求。我们的发现突出了 MLLMs 中的隐私问题,强调了防止数据暴露的防护措施的必要性。我们的数据集和代码可在此处获取。
Abstract
Privacy risks in text-only Large Language Models (LLMs) are well studied, particularly their tendency to memorize and leak sensitive information. However, Multi-modal Large Language Models (MLLMs), which process both text and images, introduce unique privacy challenges that remain underexplored. Compared to text-only models, MLLMs can extract and expose sensitive information embedded in images, posing new privacy risks. We reveal that some MLLMs are susceptible to privacy breaches, leaking sensitive data embedded in images or stored in memory. Specifically, in this paper, we (1) introduce MM-Privacy, a comprehensive dataset designed to assess privacy risks across various multi-modal tasks and scenarios, where we define Disclosure Risks and Retention Risks. (2) systematically evaluate different MLLMs using MM-Privacy and demonstrate how models leak sensitive data across various tasks, and (3) provide additional insights into the role of task inconsistency in privacy risks, emphasizing the urgent need for mitigation strategies. Our findings highlight privacy concerns in MLLMs, underscoring the necessity of safeguards to prevent data exposure. Our dataset and code can be found here.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 4.0/10 | 6.0 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 4.0/10 | 6.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 10.0/10 | 15.0 |
| MultiModal | 1.5 | 10.0/10 | 15.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper explicitly focuses on Multi-modal Large Language Models (MLLM) and MultiModal concepts, warranting top scores (10/10). Unify Models and Visual Encoder are contextually relevant to MLLM architecture but not the core research topic (privacy risks), so they receive moderate scores (4/10). Tokenizer is minimally relevant (2/10). World Models and model-based RL are unrelated to the privacy study (0/10). None of the specified expert authors appear in the author list.
关键词
Multi-modal Large Language Models, Privacy Risks, MM-Privacy Dataset, Disclosure Risks, Retention Risks, Task Inconsistency, Sensitive Information
深度分析
Chinese Title: 揭示多模态大语言模型中的隐私风险:任务特定漏洞与缓解挑战
Summary: 本文系统研究了多模态大语言模型(MLLMs)的隐私风险。与纯文本大语言模型相比,MLLMs能够处理图像和文本,可能泄露图像中嵌入的敏感信息或模型记忆中的隐私数据。作者首先定义了两种隐私风险:披露风险(模型直接输出输入中的敏感信息)和保留风险(模型泄露微调阶段记忆的敏感信息)。为评估这些风险,构建了MM-Privacy数据集,包含超过13000个样本,涵盖招聘、验证、金融和开放上下文四种场景,以及图像描述、句子改写等多种任务。实验评估了GPT-4V、Idefics2、Llava等闭源和开源模型,发现隐私泄露普遍存在,开源模型风险显著更高,且不同任务(如间接任务)更容易绕过闭源模型的防护。研究揭示了任务不一致性对隐私风险的影响,强调了开发任务感知的缓解策略的紧迫性。
Innovations:
- 首次系统定义并区分了多模态大语言模型中的披露风险和保留风险两种隐私风险类型。
- 构建了MM-Privacy数据集,包含超过13000个样本,覆盖多种场景、任务和图像生成方式(自动生成、人工填写、上下文相关图像)。
- 通过大量实验揭示了不同任务(如描述、改写)对隐私泄露风险的显著影响,发现间接任务更容易绕过闭源模型的防护。
- 对比了闭源与开源MLLMs的隐私风险差异,指出开源模型存在更高的泄露风险。
Methodology: 研究采用以下技术路线:1)使用Faker库生成合成隐私信息(邮箱、电话、SSN等);2)通过三种方式生成图像:自动生成(将隐私信息投影到空白表单模板)、人工填写(打印表单由人填写后拍照)、上下文相关图像(使用Stable Diffusion生成与主题相关的图像,经人工筛选);3)设计四种场景(招聘、验证、金融、开放上下文)和多种任务(图像描述、句子改写等);4)在闭源模型(GPT-4V)和开源模型(Idefics2、Llava等)上进行评估,使用攻击成功率(ASR)和拒绝率(RR)等指标衡量隐私泄露程度。
Key Results:
- 隐私泄露在多模态大语言模型中普遍存在,开源模型的风险显著高于闭源模型。
- 闭源模型(如GPT-4V)通常有更强的防护,但间接任务(如图像描述、句子改写)更容易绕过其安全机制。
- 不同任务和训练方法显著影响隐私漏洞,任务不一致性导致现有防护措施难以泛化。
- 开源模型甚至可以直接输出记忆集中的正确敏感信息,表明保留风险严重。
Tech Stack:
- Faker库(生成合成隐私信息)
- Stable Diffusion(生成上下文相关图像)
- GPT-4V(闭源MLLM评估)
- Idefics2(开源MLLM评估)
- Llava(开源MLLM评估)
- 攻击成功率(ASR)和拒绝率(RR)作为评估指标
Strengths:
- 首次系统性地研究多模态大语言模型的隐私风险,填补了该领域空白。
- 数据集MM-Privacy设计全面,涵盖多种场景、任务和图像类型,具有较高的实用性和代表性。
- 实验覆盖多个主流闭源和开源模型,对比分析具有说服力。
- 揭示了任务不一致性对隐私风险的影响,为后续缓解策略提供了重要方向。
Limitations:
- 数据集中的隐私信息为合成数据,可能无法完全模拟真实世界中的隐私泄露场景。
- 仅评估了部分MLLMs,未涵盖所有最新模型(如Gemini、Claude等)。
- 未提出具体的隐私缓解策略或防御方法,仅指出了问题的严重性。
- 对保留风险的评估依赖于微调阶段注入记忆,可能无法反映模型在预训练阶段记忆的真实隐私信息。
Relevance To Keywords:
- 多模态大模型:论文核心研究对象,直接相关。
- 表征学习:论文未深入探讨表征学习与隐私风险的关系,相关性较弱。
- 世界模型:论文未涉及世界模型概念,相关性低。
- 模型-Based RL:论文未涉及强化学习,相关性低。
- 后训练:论文中保留风险涉及微调(后训练)阶段,有一定相关性。
摘要翻译
多模态大语言模型 (MLLMs) 在视觉推理基准测试中取得了显著成果,但仅凭答案准确率无法判断模型是否依赖于正确的视觉证据。这一差距在用于自动驾驶的多视角驾驶场景中尤为重要,因为模型可能生成看似合理的答案,却 grounding 到了错误的相机视角。我们提出一个多视角视觉问答基准,用于评估证据源识别:给定六个同步的 NuScenes 视图和问题,模型必须识别支持性的相机视角并回答问题。该基准包含来自 73 个场景的 122 个以冲突为中心的问题 - 答案对,涵盖因果性、反事实推理和意图预测。视角标签由自动冲突挖掘管道提出,并由标注者手动验证。我们评估了三种设置:相机视角选择、给定黄金视角的 oracle 问答,以及联合预测,其中模型在一次推理中同时选择视角并回答问题。答案以多项选择题和自由形式两种格式进行评估,结构化预测使用精确匹配,自由形式响应使用 LLM 评判器。通过明确分离视觉源识别与答案正确性,该基准揭示了仅凭答案评估所遗漏的 grounding 失败。
Abstract
Multimodal large language models (MLLMs) achieve strong results on visual reasoning benchmarks, but answer accuracy alone does not indicate whether a model relied on the correct visual evidence. This gap is particularly important in multi-view driving scenes used for autonomous driving, where a model can produce a plausible answer while grounding it in the wrong camera view. We introduce a multi-view visual question answering benchmark for evaluating evidence-source identification: given six synchronized NuScenes views and a question, the model must identify the supporting camera view and answer the question. The benchmark contains 122 conflict-centric question-answer pairs from 73 scenes, spanning causality, counterfactual reasoning, and intent prediction. View labels are proposed by an automatic conflict-mining pipeline and manually verified by annotators. We evaluate three settings: camera-view selection, oracle QA given the golden view, and joint prediction in which the model selects a view and answers in one pass. Answers are evaluated in both multiple-choice and free-form formats, using exact match for structured predictions and an LLM judge for free-form responses. By explicitly separating visual-source identification from answer correctness, the benchmark exposes grounding failures that answer-only evaluation misses.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 3.0/10 | 4.5 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 5.0/10 | 7.5 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 9.0/10 | 13.5 |
| MultiModal | 1.5 | 8.0/10 | 12.0 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: The paper focuses on evaluating Multi-View MLLMs for visual evidence identification in autonomous driving. MLLM and MultiModal are core concepts appearing in the title and abstract (high scores). Visual Encoder is part of the pipeline but not the main contribution (medium score). Unify Models, Tokenizer, World Models, and model-based RL are unrelated to the specific benchmark task of visual evidence identification in MLLMs (low scores). No expert authors from the provided list are present in the author list, so no bonus points were applied.
关键词
Multi-View MLLMs, Visual Evidence Identification, Autonomous Driving, Benchmark, Grounding Failure, View-Level, NuScenes
深度分析
Chinese Title: 答案从何而来?面向自动驾驶的多视角多模态大模型视角级视觉证据识别基准测试
Summary: 该论文针对多模态大语言模型(MLLMs)在多视角自动驾驶场景中视觉证据来源识别能力不足的问题,提出了一个多视角视觉问答基准。研究背景是现有评估仅关注答案正确性,忽略了模型是否基于正确的视觉证据,尤其在多视角驾驶场景中,模型可能给出正确答案却依据错误视角。论文方法包括:从NuScenes数据集中构建122个冲突中心的问题-答案对,涵盖因果推理、反事实推理和意图预测三类;设计自动冲突挖掘管道并人工验证黄金视角标签;提出三部分评估协议:视角选择、给定黄金视角的问答、以及联合预测。主要结论是,即使最强模型(Claude)在视角选择上准确率仅82.6%,且联合预测中严格联合正确率大幅下降,表明多视角证据定位仍是挑战,现有评估方式掩盖了证据归因失败。
Innovations:
- 首次提出评估多视角MLLMs视觉证据来源识别能力的基准,区分答案正确性与证据来源正确性。
- 构建122个冲突中心的问题-答案对,覆盖因果、反事实和意图预测三类推理,并人工验证黄金视角。
- 设计三部分评估协议(视角选择、黄金视角问答、联合预测),系统分离证据定位与答案推理失败。
- 采用LLM法官评估自由形式答案的语义正确性,结合精确匹配与语义等价评估。
- 在统一零样本协议下评估多个闭源和开源MLLMs,揭示错误视角/正确答案行为。
Methodology: 论文采用事件中心的数据构建管道:从NuScenes缓存运动学、几何和地图特征中挖掘冲突事件,通过3D框投影到六视角确定候选黄金视角,生成问题-答案对后人工验证。评估方法包括视角选择(精确匹配)、黄金视角问答(MC精确匹配+自由形式LLM法官评估)、联合预测(严格联合正确率)。所有模型在零样本设置下使用固定提示模板评估,采用确定性解码。
Key Results:
- Claude在视角选择上准确率最高(82.62%),GPT-5.4(77.54%)和Gemini(74.10%)次之。
- 开源模型性能较低:Qwen3VL-8B和InternVL3约61.5%,Qwen2.5VL-7B仅12.62%。
- 联合预测中严格联合正确率大幅下降:Claude MC为73.2%,Free为59.8%;GPT-5.4 MC为66.9%,Free为61.3%。
- 黄金视角问答中,闭源模型MC准确率约87-89%,但自由形式准确率降至73-80%。
- Qwen2.5VL-7B在视角选择上接近随机基线,表明多视角证据定位对小型模型极具挑战。
Tech Stack:
- NuScenes数据集(六视角相机)
- 3D边界框投影到相机视角
- 精确匹配(Exact Match)
- LLM法官(LLM Judge)用于自由形式答案语义评估
- 零样本提示(Zero-shot Prompting)
- 确定性解码(Deterministic Decoding)
- 95%置信区间(95% CI)
- 标准偏差(SD)
Strengths:
- 聚焦于被忽视的视觉证据来源识别问题,具有重要安全意义。
- 数据集构建严谨,包含自动挖掘和人工验证,确保黄金视角可靠性。
- 评估协议设计全面,分离视角选择与答案推理,揭示隐藏的归因失败。
- 覆盖多种模型家族(GPT、Gemini、Claude、Qwen、InternVL),结果具有广泛代表性。
- 支持多选和自由形式两种答案格式,评估更全面。
Limitations:
- 数据集规模较小(122个QA对),可能限制统计显著性和泛化性。
- 仅使用NuScenes数据集,场景多样性有限。
- 未提供训练集,仅作为测试基准,无法用于模型优化。
- 自由形式答案评估依赖LLM法官,可能引入评估偏差。
- 未考虑时间序列或多帧信息,仅使用单帧六视角。
Relevance To Keywords:
- 多模态大模型(MLLMs):论文直接评估多模态大模型在多视角场景中的视觉证据识别能力。
- 世界模型:论文涉及场景理解、因果推理和意图预测,与世界模型中的环境建模相关。
- 表征学习:论文通过视角选择任务评估模型对多视角视觉表征的利用能力。
- 强化学习:论文未直接涉及强化学习,但后训练(post-training)可能用于改进模型视角定位能力。
- 后训练:论文零样本评估,但结果可指导后训练策略以增强证据归因。
摘要翻译
第一人称视角视觉提供了人类感知与决策的视角,但其在交通安全预测方面的潜力仍未得到充分探索。在这项工作中,我们研究从短的第一人称视角视频片段中解码行人过街意图。我们通过将任务构建为封闭式视觉问答(VQA)问题,并利用视觉语言模型(VLMs)来预测行人意图来解决这一问题。我们首先在零样本设置下对三类最先进的视觉语言模型(VLMs)进行基准测试,发现它们相比随机猜测取得了适度提升,但表现出有限的高层级交通推理能力。鉴于这些发现,我们进一步利用参数高效微调将视觉语言模型(VLMs)适配到目标任务。结果表明,微调后的模型显著优于其零样本对应模型,并在准确率上比专门的基于 Transformer(变换器)的基线提高了 9%。最后,我们证明纳入额外的上下文线索,包括自身运动、车辆运动和眼动,进一步提高了预测性能。特别是,受眼动和自身运动引导的微调 Qwen3-VL-2B 模型在准确率上比 Transformer 基线提高了 14.5%,确立了第一人称视角行人意图解码的新最先进水平。
Abstract
Egocentric vision offers a first-person view of human perception and decision making, yet its potential for traffic-safety prediction remains underexplored. In this work, we study the decoding of pedestrian crossing intentions from short egocentric video clips. We approach this by formulating the task as a closed-ended visual question answering (VQA) problem and leveraging vision language models (VLMs) to predict the pedestrians' intent. We first benchmark three families of state-of-the-art VLMs in a zero-shot setting, finding that they achieve moderate gains over random guessing but exhibit limited higher-level traffic reasoning. Motivated by these findings, we further adapt VLMs to the target task using parameter-efficient fine-tuning. Our results show that the fine-tuned models substantially outperform their zero-shot counterparts and achieve a 9\% accuracy improvement over a specialized transformer-based baseline. Finally, we demonstrate that incorporating additional contextual cues, including ego motion, vehicle motion, and eye gaze, further improves predictive performance. In particular, the fine-tuned Qwen3-VL-2B model guided by eye gaze and ego motion achieves a 14.5% accuracy improvement over the transformer baseline, establishing a new state of the art for egocentric pedestrian intent decoding.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 4.0/10 | 6.0 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 4.0/10 | 6.0 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 8.0/10 | 12.0 |
| MultiModal | 1.5 | 8.0/10 | 12.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文核心方法基于视觉语言模型(MLLM)处理多模态数据,故 MLLM 与 MultiModal 评分高。Visual Encoder 与 Tokenizer 为模型隐含组件,未作为创新点,评分中等。Unify Models 涉及模态统一但非本文焦点。World Models 与 model-based RL 与本文意图预测任务无关,评分低。作者列表不含指定专家。
关键词
Egocentric Vision, Pedestrian Intention, Vision Language Models, Parameter-efficient Fine-tuning, Contextual Cues, Traffic Safety, Zero-shot Benchmark
深度分析
Chinese Title: 基于视觉语言模型从自我中心视觉解码行人过街意图
Summary: 本文研究从短时自我中心视频片段中解码行人过街意图,将其表述为封闭式视觉问答(VQA)问题,并利用视觉语言模型(VLM)进行预测。首先在零样本设置下评估三个VLM家族(Qwen3-VL、InternVL3、GroundVQA),发现它们比随机猜测有适度提升但缺乏高级交通推理能力。随后采用参数高效微调(如LoRA)适应目标任务,微调模型显著优于零样本,比专用Transformer基线提升9%准确率。最后引入额外上下文线索(自我运动、车辆运动、眼动),其中Qwen3-VL-2B模型结合眼动和自我运动后准确率提升14.5%,达到新SOTA。论文还设计了标准提示、思维链提示和视觉提示三种策略,并验证了视觉提示的有效性。
Innovations:
- 将行人过街意图预测重新表述为封闭式VQA问题,利用VLM的视觉理解和语言推理能力
- 系统评估多种VLM在零样本和微调下的表现,揭示其交通推理的局限性
- 提出参数高效微调(如LoRA)适应交通特定预测,显著提升性能
- 引入眼动、自我运动等上下文线索,进一步改善预测准确率
- 建立基于自我中心视觉的行人意图解码新SOTA,准确率提升14.5%
Methodology: 论文采用视觉语言模型(VLM)进行行人过街意图预测。首先将2秒自我中心视频和可选上下文信息作为输入,输出“过街”或“等待”二分类结果。模型选择包括Qwen3-VL、InternVL3等MLLM以及GroundVQA等VLP。设计三种提示策略:标准提示、文本思维链(CoT)提示和视觉提示(使用GroundingDINO检测目标并叠加标记)。采用参数高效微调(如LoRA)适应任务。上下文信息包括自我运动、车辆运动、眼动和个人属性。视频编码支持直接视频输入和交错帧输入。
Key Results:
- 零样本VLM比随机猜测有适度提升,但缺乏高级交通推理
- 微调后VLM显著优于零样本,比专用Transformer基线提升9%准确率
- 加入眼动和自我运动后,Qwen3-VL-2B模型准确率提升14.5%,达到新SOTA
- 视觉提示(Set-of-Mark)在零样本和微调中均有效提升性能
- Qwen3-VL-2B在微调后表现优于更大模型,显示参数高效微调的优势
Tech Stack:
- Qwen3-VL-2B/8B-Instruct
- Qwen2.5-VL-7B-Instruct
- InternVL3-2B/8B
- GroundVQA (VLP)
- GroundingDINO (开放词汇检测)
- SORT (目标跟踪)
- LoRA (参数高效微调)
- 思维链 (CoT) 提示
- 视觉提示 (Set-of-Mark)
- 多模态旋转位置编码 (M-RoPE)
Strengths:
- 创新性地将VLM应用于行人意图预测,拓展了VLM在交通安全领域的应用
- 系统比较多种VLM和提示策略,实验设计全面
- 参数高效微调使得小模型也能达到优异性能,降低计算成本
- 引入眼动等上下文线索,符合人类认知机制,提升预测准确性
- 代码开源,可复现性强
Limitations:
- 仅使用2秒视频和1秒预测窗口,可能无法处理更复杂的长期意图
- 眼动数据获取依赖专用设备,实际部署受限
- 模型在零样本下推理能力有限,依赖微调适应特定场景
- 未探讨模型在恶劣天气、夜间等极端条件下的表现
- 数据集规模可能有限,泛化性有待验证
Relevance To Keywords:
- 原生多模态大模型:论文直接使用Qwen3-VL、InternVL3等原生多模态大模型,验证其在交通场景中的能力
- 表征学习:通过VLM的视觉-语言对齐表征进行意图解码,但未深入探讨表征学习机制
- 世界模型:论文未涉及世界模型构建或预测未来状态,相关性较弱
- 强化学习:论文未使用强化学习方法,相关性较弱
- 后训练:论文采用参数高效微调(LoRA)作为后训练手段,属于后训练范畴
摘要翻译
学习日常技能(如烹饪一道菜)正日益依赖于在线视频等教学媒体。这使得将视频(及多模态)大语言模型(LLMs)用作任务指导助手成为可能。一个潜在任务指导助手在现实世界中取得成功的关键能力在于,一旦错误显而易见,它便能主动干预以引导用户。为了评估这一关键能力,我们引入了 Ego-MC-Bench(错误修正),这是一个用于评估真实烹饪场景中反应式、逐步任务指导的基准。广泛的实验表明,Ego-MC-Bench 对最先进的视频大语言模型(LLMs)极具挑战性。我们认为,关键原因是用于在此任务上微调模型的训练数据可用性有限。尽管存在广泛的烹饪视频数据集,但现有数据集缺乏错误示例以及适时干预的示例。为了解决这一数据限制,我们还引入了 Ego-CoMist,这是一个反事实合成数据集,通过将非交互式烹饪视频转换为显示主动干预的监督训练示例而创建。我们表明,在 Ego-CoMist 上微调会带来性能提升,尤其对于更小且更高效的视频大语言模型(LLMs),这些模型非常适合在边缘设备上提供协助。
Abstract
Learning everyday skills, like cooking a dish, relies increasingly on instructional media such as online videos. This opens the door to the use of video (and multimodal) large language models (LLMs) as task guidance assistants. A crucial capability for the real-world success of a prospective task guidance assistant is it's ability to intervene proactively as soon as a mistake is apparent in order to guide the user. To evaluate this crucial capability, we introduce Ego-MC-Bench (Mistake Corrections), a benchmark for evaluating reactive, step-by-step task guidance in realistic cooking scenarios. Extensive experiments show that Ego-MC-Bench is highly challenging for state-of-the-art video LLMs. We argue that a key reason is the limited availability of training data for fine-tuning models on this task. Although there exists a wide range of cooking video datasets, existing datasets lack examples of mistakes along with appropriately timed interventions. To help address this data limitation, we also introduce Ego-CoMist, a counterfactual synthetic dataset created by transforming non -interactive cooking videos into supervised training examples showing proactive interventions. We show that fine-tuning on Ego-CoMist yields performance gains especially for smaller and more efficient video LLMs that are well suited for delivering assistance on edge devices.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 4.0/10 | 6.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 2.0/10 | 3.0 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 9.0/10 | 13.5 |
| MultiModal | 1.5 | 9.0/10 | 13.5 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: 论文核心聚焦于视频大语言模型(MLLM)在多模态任务指导中的错误纠正能力,因此 MLLM 和多模态评分较高(9.0)。未涉及 Tokenizer、视觉编码器、世界模型或模型强化学习的具体架构设计,评分较低(1.0-2.0)。统一模型仅隐含于多模态架构中(4.0)。加权总分 40.5 分,高于动态及格分 27.8 分。作者列表中未包含指定专家,无额外加分。
关键词
Video Large Language Models, Mistake Correction, Ego-MC-Bench, Ego-CoMist, Task Guidance, Fine-tuning, Multimodal
深度分析
Chinese Title: 流式干预:视频大语言模型能否在错误发生时即时纠正?
Summary: 本文针对视频大语言模型在实时任务指导场景中的主动干预能力不足问题,提出了EGO-MC-BENCH基准和EGO-COMIST合成数据集。研究背景在于现有模型和基准多基于非交互式数据,缺乏对错误发生时的即时反馈和纠正能力的评估。方法上,作者通过真人录制交互式烹饪视频构建了包含错误、反馈和用户反应的基准,并利用反事实合成技术将非交互式视频转化为带有干预标注的训练数据。实验表明,当前最先进的视频大语言模型在该基准上表现不佳,而基于EGO-COMIST微调的小型高效模型在错误干预任务上取得显著提升。结论指出,该工作为构建边缘设备上可部署的实时任务指导助手提供了评估方法和数据基础。
Innovations:
- 首次提出用于评估视频大语言模型在流式任务中主动干预能力的基准EGO-MC-BENCH,包含真实交互场景下的错误、反馈和用户反应。
- 提出EGO-COMIST合成数据集,通过反事实标注将非交互式烹饪视频转化为带有适时干预监督的训练样本,解决了训练数据稀缺问题。
- 实验证明小型高效视频大语言模型在EGO-COMIST上微调后,在错误干预任务上性能显著提升,适合边缘设备部署。
- 系统分析了现有数据集在“多步目标驱动”、“逐步指令”、“定时反馈”和“反应式参与者”四个维度上的不足,凸显了本工作的独特性。
Methodology: 研究采用真人录制交互式烹饪视频的方式构建基准,参与者佩戴头戴摄像机,教练在背后提供实时指令和反馈,模拟真实助手场景。通过手动转录和独立标注员验证确保数据质量。对于合成数据集,利用反事实技术将非交互式视频中的正常操作替换为错误操作,并生成对应的干预反馈,从而创建监督训练样本。评估时,使用多种视频大语言模型(包括开源和商业模型)在EGO-MC-BENCH上进行零样本和微调测试,比较其检测错误和提供及时反馈的能力。
Key Results:
- EGO-MC-BENCH包含约10小时视频、40次录制、7名参与者、559个食谱步骤和954条反馈消息,其中49.2%为预期性反馈,50.8%为事后反馈。
- 当前最先进的视频大语言模型在EGO-MC-BENCH上的错误干预任务表现不佳,表明该基准具有高挑战性。
- 在EGO-COMIST上微调后,小型高效视频大语言模型在错误干预准确率和及时性上显著提升,接近甚至超过部分大型模型。
- 合成数据训练的有效性验证了反事实数据生成策略在任务指导场景中的潜力。
Tech Stack:
- 视频大语言模型(Video LLMs)
- 反事实合成数据生成(Counterfactual Synthetic Data Generation)
- 流式视频处理(Streaming Video Processing)
- 多模态学习(Multimodal Learning)
- 主动干预任务(Proactive Intervention Task)
- 转录与人工验证(Transcription and Manual Verification)
- CaptainCook4D错误分类体系
Strengths:
- 基准设计真实反映实际应用场景,包含交互式反馈和用户反应,评估维度全面。
- 合成数据生成方法创新,有效缓解了训练数据稀缺问题,且验证了其有效性。
- 实验覆盖多种模型规模,特别关注边缘设备部署的小型高效模型,具有实际应用价值。
- 对现有数据集进行了系统对比分析,清晰定位了本工作的贡献和独特性。
Limitations:
- 基准仅聚焦烹饪领域,泛化性有待验证,其他日常技能(如组装、健身)可能需额外数据。
- 合成数据基于反事实假设,可能与真实错误分布存在偏差,影响模型泛化能力。
- 基准规模相对较小(10小时),可能不足以全面评估模型在复杂多步任务中的表现。
- 未探讨模型在跨文化或不同厨房环境下的鲁棒性。
Relevance To Keywords:
- Unify Models, World Models, Representation Learning, Model-Based RL: 论文聚焦于视频大语言模型在实时交互场景中的应用,涉及多模态表征学习和世界模型构建(通过理解视频流中的状态和动作),但未直接涉及强化学习或模型预测控制。
- 原生多模态大模型,多模态大模型的理解和生成一体化: 论文直接评估视频大语言模型的理解(检测错误)和生成(提供反馈)能力,与多模态大模型的理解与生成一体化高度相关。
- 表征学习,世界模型,强化学习,后训练: 论文通过合成数据微调(后训练)提升模型性能,涉及表征学习(视频与文本对齐),但未明确使用世界模型或强化学习。
摘要翻译
通用化少样本语义分割(GFSS)传统上被视为一个表示学习问题,需要通过任务特定适应,从有限的支持样本中纳入新类别。然而,近期基础模型已展现出强大的开放词汇识别与分割能力,由此引出了另一个问题:GFSS 能否通过推理时冻结语义先验的协调来解决,而非依赖参数适应?我们提出 Open-V 来回答这一问题,这是一个无训练的 GFSS 框架,它通过校准的逐像素语义仲裁,将 Segment Anything (SAM3) 的可提示概念分割(PCS)与 K-shot CLIP 支持中心点相结合。Open-V 不引入任何可训练组件,并支持在推理时间处理任意语义类别。除了分割性能之外,本研究还提出了三个更广泛的发现。首先,我们表明支持信息可以通过推理时的语义锚定纳入,且当基础模型的文本先验在标签不交集词汇上减弱时,其贡献随之增加。其次,我们在基础模型分割中发现了一个可复现性混淆因素,表明预处理和评估空间的差异可能会无声地扭曲报告的性能。最后,我们在 PASCAL5i、COCO-20i 和 ADE-OW 上验证了 Open-V,表明基础模型先验的无训练协调既能泛化于常规 GFSS 设置,也能泛化于开放词汇评估设置。在 PASCAL-5i (1-shot) 上,Open-V 取得了基类/新类/调和 mIoU 分别为 78.4/77.5/77.9,在不进行 GFSS 特定训练的情况下,超越了最强的训练基线,调和平均 (HM) 提升了 17.7。
Abstract
Generalized Few-Shot Semantic Segmentation (GFSS) has traditionally been approached as a representation-learning problem, requiring task-specific adaptation to incorporate novel classes from limited support examples. Recent foundation models, however, already exhibit strong open-vocabulary recognition and segmentation capabilities, raising a different question: can GFSS be solved through inference-time coordination of frozen semantic priors rather than parameter adaptation? We answer this question with Open-V, a training-free GFSS framework that combines Segment Anything (SAM3) Promptable Concept Segmentation (PCS) with a K-shot CLIP support centroid through calibrated per-pixel semantic arbitration. OpenV introduces no trainable components and supports arbitrary semantic categories at inference time. Beyond segmentation performance, our study contributes three broader findings. First, we show that support information can be incorporated through inference-time semantic grounding, and that its contribution increases as foundation-model text priors weaken on label-disjoint vocabularies. Second, we identify a reproducibility confound in foundationmodel segmentation, demonstrating that preprocessing and evaluation-space mismatches can silently distort reported performance. Finally, we validate Open-V across PASCAL5i, COCO-20i, and ADE-OW, showing that training-free coordination of foundation-model priors generalizes across both conventional GFSS and open-vocabulary evaluation settings. On PASCAL-5i (1-shot), Open-V attains base/novel/harmonic mIoU of 78.4/77.5/77.9, without GFSS-specific training surpassing the strongest trained baseline by +17.7 HM.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 6.0/10 | 9.0 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 8.0/10 | 12.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 3.0/10 | 4.5 |
| MultiModal | 1.5 | 8.0/10 | 12.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: Paper proposes Open-V for training-free GFSS using SAM and CLIP. High relevance to Visual Encoder (8) and MultiModal (8) due to vision-language integration. Moderate relevance to Unify Models (6) as it coordinates existing models. Low relevance to Tokenizer (2) and MLLM (3) as tokenization is incidental and CLIP is not a generative LLM. Zero relevance to World Models and RL. No target experts found.
关键词
Generalized Few-Shot Segmentation, Training-Free, Open-Vocabulary, Semantic Arbitration, Segment Anything, CLIP, Foundation Models
深度分析
Chinese Title: 通过开放词汇语义仲裁的无训练广义少样本分割
Summary: 本文提出Open-V,一个完全无需训练的广义少样本分割(GFSS)框架,通过推理时协调冻结的SAM3和CLIP先验实现。传统GFSS依赖参数适应,而Open-V利用SAM3的提示概念分割(PCS)和CLIP支持质心,通过校准的逐像素语义仲裁选择类别。该方法支持任意类别词汇,无需任何训练。研究发现:支持信息可通过推理时语义接地注入,且其贡献随基础模型文本先验减弱而增大;识别并修正了基础模型分割中的空间对齐混淆问题。在PASCAL-5i、COCO-20i和ADE-OW上验证,1-shot设置下PASCAL-5i的基类/新类/调和平均mIoU分别达到78.4/77.5/77.9,超越最强有监督基线+17.7 HM。
Innovations:
- 提出开放词汇语义仲裁框架,将GFSS重新定义为推理时协调冻结先验而非参数适应的问题。
- 实现推理时语义接地:通过后验CLIP支持质心重排序注入少样本信号,无需训练提示头。
- 诊断并修正了基础模型分割中的空间对齐混淆问题,揭示了预处理与评估空间不匹配对性能的扭曲。
- 跨基准验证:在封闭词汇(PASCAL-5i、COCO-20i)和开放词汇(ADE-OW)设置下均取得优异性能。
Methodology: 使用冻结的SAM3 ViT-L作为基础分割器,通过PCS接口为每个类别生成实例掩码和校准存在分数。对基类直接使用PCS输出,对新类则计算K-shot CLIP支持质心(L2归一化),通过余弦相似度重排序SAM3实例。设置存在阈值τPCS=0.20过滤低置信度实例,合并后通过第二次SAM3框提示细化边界,最后逐像素arg-max得到标签图。整个流程无任何可训练参数。
Key Results:
- PASCAL-5i 1-shot:基类mIoU 78.4,新类mIoU 77.5,调和平均HM 77.9,超越最强有监督基线+17.7 HM。
- 支持信息贡献随基础模型文本先验减弱而增大:在标签不重叠词汇上,视觉质心最优权重偏移,峰值增益扩大6.6倍。
- 识别了空间对齐混淆:预处理、预测与评估空间不一致可显著扭曲性能估计。
- 在COCO-20i和ADE-OW上验证了跨基准泛化能力。
Tech Stack:
- SAM3 (Segment Anything 3) with Promptable Concept Segmentation (PCS)
- CLIP ViT-B/16
- ViT-L backbone
- L2归一化
- 余弦相似度
- 校准的sigmoid乘积(质量logit与存在logit)
- 存在阈值τPCS=0.20
- 逐像素arg-max仲裁
Strengths:
- 完全无需训练,降低计算成本和数据依赖。
- 支持任意类别词汇,灵活适应开放词汇场景。
- 通过推理时语义接地有效利用少样本信息。
- 诊断并解决了基础模型评估中的空间对齐混淆问题,提高可重复性。
- 在多个基准上取得显著优于有监督方法的性能。
Limitations:
- 依赖SAM3和CLIP的预训练质量,若基础模型在特定领域表现不佳则性能受限。
- 存在阈值τPCS为固定值,可能不适用于所有场景。
- 仅使用单次SAM3骨干前向,但多次解码可能增加推理时间。
- 未探索与其他基础模型(如DINOv2)的融合。
- 在极端少样本(如1-shot)下,CLIP质心可能受支持图像质量影响。
Relevance To Keywords:
- Unify Models: 论文使用SAM3和CLIP两个基础模型,通过推理时协调实现统一分割,体现了模型统一的思想。
- World Models: 不直接相关,论文未涉及世界模型或环境建模。
- Representation Learning: 论文利用冻结的视觉-语言表征(CLIP、SAM3)进行推理,未进行表征学习,但依赖预训练表征。
- Model-Based RL: 不相关。
- 原生多模态大模型: 论文使用SAM3(视觉)和CLIP(视觉-语言)两个多模态大模型,但未涉及生成或理解一体化。
- 多模态大模型的理解和生成一体化: 不直接相关,论文仅使用理解能力(分割与分类),未涉及生成。
- 表征学习: 同上,依赖预训练表征而非学习新表征。
- 世界模型: 不相关。
- 强化学习: 不相关。
- 后训练: 论文明确强调无训练,与后训练相反。
摘要翻译
开放域开放词汇检测(ODOVD)要求检测器能够泛化至新类别和未见域,相较于开放词汇检测更具挑战性。现有方法通常从头开始联合训练开放词汇检测器与域泛化模块,导致训练成本较高。本文提出 ExDet,一种面向 ODOVD 的轻量级类别 - 域协作泛化框架,旨在增强现有检测器的跨类别与跨域泛化能力。ExDet 由文本引导外推(TGE)、轻量级检测器兼容校正(DCR)模块以及 ExRPN 组成。具体而言,TGE 利用视觉语言模型(VLMs)的 DeltaSpace 属性,从文本中推断出类别与域感知的代理视觉原型。DCR 基于 TGE 生成的原型进行学习,无需检测器训练及真实数据,在推理阶段插入分类头之后,将表示校正至检测器兼容的源域视觉分布,从而增强对新类别及未见域目标的分类性能。ExRPN 通过结合语义相似性与 RPN 置信度重新校准提案分数,提升新类别及域偏移对象的召回率,同时为后续分类及 DCR 提供更好的支持。ExDet 在 OD-LVIS、OV-LVIS、Objects365 和 MSOSB 上均达到了 SOTA 性能。
Abstract
Open-domain open-vocabulary detection (ODOVD) requires detectors to generalize to both novel categories and unseen domains, making it more challenging than open-vocabulary detection. Existing methods typically train open-vocabulary detectors together with domain generalization modules from scratch, leading to high training cost. we propose ExDet, a lightweight category-domain collaborative generalization framework for ODOVD that enhances the cross-category and cross-domain generalization of existing detectors. ExDet consists of Text-Guided Extrapolation (TGE), a lightweight Detector-Compatible Rectification (DCR) module, and ExRPN. Specifically, TGE exploits the DeltaSpace property of vision-language models (VLMs) to infer category- and domain-aware proxy visual prototypes from text. DCR is learned from the TGE-generated prototypes in a detector training-free and real-data-free manner, and is inserted after the classification head at inference to rectify representations toward a detector-compatible source-domain visual distribution, thereby enhancing classification for targets from novel categories and unseen domains. ExRPN recalibrates proposal scores by combining semantic similarity with RPN confidence, improving recall for novel and domain-shifted objects while providing better support for subsequent classification and DCR. ExDet achieves SOTA performance on OD-LVIS, OV-LVIS, Objects365, and MSOSB.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 4.0/10 | 6.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 5.0/10 | 7.5 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 7.0/10 | 10.5 |
| MultiModal | 1.5 | 8.0/10 | 12.0 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: 论文聚焦开放域开放词汇检测(ODOVD),利用视觉语言模型(VLMs)实现跨模态泛化,与 MultiModal 和 MLLM 高度相关;涉及视觉特征处理但未提出新编码器或 tokenizer,与 World Models 及 model-based RL 无关;Unify Models 相关性中等,因框架统一检测泛化能力而非模型架构统一。
关键词
Open-Domain Open-Vocabulary Detection, Cross-modal Extrapolation, Detector-Compatible Rectification, Vision-Language Models, Generalization, Text-Guided Extrapolation, Object Detection
深度分析
Chinese Title: ExDet:基于跨模态外推与校正的开放域开放词汇检测
Summary: 开放域开放词汇检测(ODOVD)要求检测器同时泛化到新类别和未见过的域,比开放词汇检测更具挑战性。现有方法通常从头训练开放词汇检测器与域泛化模块,导致训练成本高。本文提出ExDet,一种轻量级的类别-域协同泛化框架,通过文本引导外推(TGE)、检测器兼容校正(DCR)模块和ExRPN,在不重新训练检测器的情况下增强跨类别和跨域泛化。TGE利用视觉语言模型(VLM)的DeltaSpace属性从文本推断类别和域感知的代理视觉原型;DCR以检测器训练无关和真实数据无关的方式学习,在推理时插入分类头后,将表示校正到检测器兼容的源域视觉分布,提升对新类别和未见域目标的分类能力;ExRPN通过结合语义相似度与RPN置信度重新校准提议分数,提高对新颖和域偏移目标的召回率。实验表明,ExDet在OD-LVIS、OV-LVIS、Objects365和MSOSB上达到最先进性能。
Innovations:
- 提出ExDet框架,以检测器训练无关和真实数据无关的方式增强现有两阶段检测器在联合类别-域偏移下的泛化能力。
- 设计文本引导外推(TGE)方法,利用VLM的DeltaSpace属性从文本描述构建类别和域感知的代理视觉原型。
- 提出轻量级检测器兼容校正(DCR)模块,独立于检测器训练,在推理时校正分类表示以提升对新类别和未见域的判别力。
- 引入ExRPN,在推理时通过语义相似度重新校准提议置信度,提高对新颖和域偏移目标的召回率。
Methodology: ExDet基于冻结的两阶段开放词汇检测器,包含三个核心组件:TGE利用VLM的DeltaSpace属性,从文本语义关系外推出跨类别、跨域的代理视觉原型;DCR是一个轻量级模块,在TGE生成的原型监督下独立训练,推理时插入分类头后,将表示校正到检测器兼容的源域视觉分布;ExRPN在推理时结合语义相似度与RPN置信度重新校准提议分数,提升召回率。整体框架无需重新训练检测器,仅需约30分钟在单张RTX 3090 GPU上完成训练。
Key Results:
- ExDet在OD-LVIS、OV-LVIS、Objects365和MSOSB基准上达到最先进性能。
- 在OD-LVIS上,ExDet显著提升了跨类别和跨域泛化能力。
- 训练仅需约30分钟在单张RTX 3090 GPU上完成,效率高。
- 实验验证了TGE和DCR在提升分类判别力方面的有效性,以及ExRPN在提升提议召回率方面的作用。
Tech Stack:
- 视觉语言模型(VLM):CLIP
- DeltaSpace属性
- 两阶段检测器(如Faster R-CNN)
- 文本引导外推(TGE)
- 检测器兼容校正(DCR)模块
- ExRPN(提议分数重新校准)
- PCA可视化分析
Strengths:
- 轻量级框架,无需重新训练检测器,训练成本低。
- 同时处理类别和域偏移,解决更具挑战性的ODOVD问题。
- 方法具有通用性,可应用于现有两阶段检测器。
- 在多个基准上达到最先进性能,泛化能力强。
Limitations:
- 方法主要针对两阶段检测器,对单阶段检测器的适用性未验证。
- 依赖VLM的DeltaSpace属性,可能受限于VLM的表示能力。
- DCR模块的校正效果可能受源域视觉分布的代表性影响。
- 实验仅在特定数据集上验证,实际场景中的泛化性需进一步探索。
Relevance To Keywords:
- Unify Models: ExDet通过跨模态外推与校正统一了类别和域的泛化,体现了模型统一的思想。
- World Models: 方法利用VLM的DeltaSpace属性模拟视觉世界中的语义偏移,与世界模型中的预测和推理相关。
- Representation Learning: TGE和DCR涉及视觉和文本表示的联合学习与校正,属于表征学习范畴。
- Model-Based RL: 虽然不直接涉及强化学习,但DCR的校正机制可类比于基于模型的策略调整。
- 原生多模态大模型: 方法基于CLIP等原生多模态大模型,利用其跨模态对齐能力。
- 多模态大模型的理解和生成一体化: 通过文本引导视觉原型外推,体现了理解和生成的结合。
- 后训练: DCR模块在检测器训练后独立训练,属于后训练优化策略。
摘要翻译
我们研究预训练视频基础模型是否在冻结表示中编码了直觉物理信息,以及该信息如何在不同模型家族、层和探针类型之间变化。通过在 IntPhys2 和最小视频对 (MVP) 上采用冻结特征探针方法,我们比较了预测性联合嵌入模型 (V-JEPA)、掩码重建模型 (VideoMAE) 以及基于扩散的视频生成器 (LTX-Video)。V-JEPA 在所有基准测试中取得了最强的整体结果,尤其是在建模时间动态的探针下,而 VideoMAE 仍具有竞争力,LTX-Video 则恢复了较弱但非平凡信号。逐层分析显示,与物理相关的信息在早期层最弱,并在中间到深层变得最易获取;时序控制实验表明,破坏帧顺序会显著降低性能,尤其是在 MVP 上。综上所述,这些结果表明直觉物理知识在预训练视频表示中可靠地涌现,但其可访问性强烈依赖于预训练范式、表示深度和读出机制。
Abstract
We study whether pretrained video foundation models encode intuitive-physics information in their frozen representations, and how this information varies across model families, layers, and probe types. Using frozen-feature probing on IntPhys2 and Minimal Video Pairs (MVP), we compare predictive joint-embedding models (V-JEPA), masked reconstruction models (VideoMAE), and a diffusion-based video generator (LTX-Video). V-JEPA achieves the strongest overall results across benchmarks, especially with probes that model temporal dynamics, while VideoMAE remains competitive and LTX-Video recovers weaker but non-trivial signal. Layerwise analyses show that physics-relevant information is weakest in early layers and becomes most accessible at intermediate-to-late depth, and temporal controls show that disrupting frame order substantially reduces performance, especially on MVP. Together, these results suggest that intuitive-physics knowledge emerges reliably in pretrained video representations, but its accessibility depends strongly on pretraining paradigm, representational depth, and readout mechanism.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 3.0/10 | 4.5 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 8.0/10 | 12.0 |
| World Models | 1.5 | 5.0/10 | 7.5 |
| MLLM | 1.5 | 3.0/10 | 4.5 |
| MultiModal | 1.5 | 3.0/10 | 4.5 |
| model-based RL | 1.5 | 3.0/10 | 4.5 |
评分理由: The paper probes intuitive physics in video foundation models (V-JEPA, VideoMAE, LTX-Video). 'Visual Encoder' is highly relevant (8.0) as frozen feature probing relies on encoder representations. 'World Models' is moderately relevant (5.0) due to the focus on physics understanding. Other keywords are peripheral (1.0-3.0) because the study compares existing models without language integration (MLLM/MultiModal), tokenization focus, RL training, or model unification.
关键词
Video Foundation Models, Intuitive Physics, Layerwise Probing, Frozen-feature Probing, Temporal Dynamics, Pretrained Representations, V-JEPA, VideoMAE
深度分析
Chinese Title: 视频基础模型理解直观物理吗?逐层探测分析
Summary: 本文研究预训练视频基础模型是否在其冻结表示中编码了直观物理信息,以及这些信息如何随模型家族、层深度和探测类型变化。采用冻结特征探测方法,在IntPhys2和MVP两个基准上比较了三种主要预训练范式:掩码重建(VideoMAE)、预测联合嵌入(V-JEPA)和扩散生成(LTX-Video)。结果显示V-JEPA整体表现最强,尤其在使用建模时间动态的探测头时;VideoMAE具有竞争力;LTX-Video虽弱但仍有非平凡信号。逐层分析表明物理相关信息在早期层最弱,在中间到深层最易获取。时间控制实验(打乱帧顺序)显著降低性能,尤其在MVP上。结论:直观物理知识在预训练视频表示中可靠出现,但其可访问性强烈依赖于预训练范式、表示深度和读出机制。
Innovations:
- 首次系统比较三种主要视频预训练范式(掩码重建、联合嵌入预测、扩散生成)在直观物理探测上的表现,使用统一冻结特征协议。
- 采用逐层探测分析,揭示物理信息在模型不同深度的分布规律,发现中间到深层最丰富。
- 引入不同表达能力的探测头(线性、MLP、注意力时间探测),评估物理信息在不同读出机制下的可解码性。
- 设计时间控制条件(单帧、时间打乱),有效区分直观物理理解与通用时间建模能力。
- 在IntPhys2和MVP两个严格基准上评估,使用VOE和配对一致性指标,减少基准特定捷径的影响。
Methodology: 使用冻结特征探测方法,从预训练视频模型的各层提取特征,训练线性、MLP和注意力时间探测头输出物理合理性分数。在IntPhys2上计算Violation of Expectation (VOE)准确率和剪辑准确率;在MVP上计算配对一致性。同时进行控制实验:单帧输入(去除时间信息)和时间打乱(破坏时间顺序),以评估时间动态对性能的影响。
Key Results:
- V-JEPA在IntPhys2的VOE准确率和MVP的配对一致性上均取得最佳结果,尤其在注意力时间探测下表现突出。
- VideoMAE在多数指标上紧随V-JEPA,但差距较小。
- LTX-Video性能最弱,但仍高于随机基线,表明扩散模型也编码了部分物理信息。
- 逐层分析显示:早期层物理信息最弱,中间到深层(约60%-80%深度)信息最丰富。
- 时间打乱显著降低所有模型性能,尤其对MVP影响更大,证明时间动态对物理理解至关重要。
- 线性探测已能提取部分物理信息,但MLP和注意力时间探测进一步提升性能,表明信息以非线性方式编码。
Tech Stack:
- VideoMAE(掩码视频重建)
- V-JEPA(预测联合嵌入)
- LTX-Video(扩散视频生成)
- IntPhys2基准(物理合理性评估)
- MVP基准(最小视频对问答)
- 线性探测、MLP探测、注意力时间探测
- Violation of Expectation (VOE)指标
- 配对一致性指标
Strengths:
- 系统比较了三种主流预训练范式,覆盖了当前视频基础模型的主要方向。
- 使用严格基准(IntPhys2的VOE和MVP的配对一致性)减少捷径影响。
- 设计了控制实验(单帧、时间打乱)有效分离物理理解与时间建模。
- 逐层和多探测头分析提供了丰富的内部表示洞察。
- 实验设计严谨,结果清晰,结论有说服力。
Limitations:
- 仅包含三个模型家族,可能无法代表所有视频基础模型(如对比学习、视频语言模型等)。
- 冻结特征探测可能低估模型在微调或完整推理下的物理理解能力。
- 基准本身仍可能存在未完全消除的偏差,尽管已设计为最小化捷径。
- 未涉及因果推理或物理模拟,仅评估感知层面的物理合理性。
- 模型规模、训练数据等差异可能影响比较公平性,虽已尽量使用最大骨干。
Relevance To Keywords:
- 统一模型:视频基础模型是统一多模态理解与生成的重要方向,本文评估其物理理解能力。
- 世界模型:V-JEPA等被明确设计为世界模型,本文验证其是否编码物理结构。
- 表征学习:通过探测分析表征中物理信息的组织方式,属于表征学习评估。
- 强化学习:世界模型在基于模型的强化学习中用于规划,物理理解是核心能力。
- 后训练:研究预训练后冻结表征的物理知识,为后续微调或适配提供基础。
摘要翻译
最近最先进的 (SOTA) 文本到语音 (TTS) 系统通常采用一种级联管道,该管道由语音分词器、自回归大语言模型 (LLM) 和基于扩散的流匹配 (FM) 模型组成,且这些组件是独立训练的。本文提出一种完全端到端 (E2E) 优化框架,统一了语音分词器、LLM、FM 模型以及额外奖励模型 (RM) 的训练。具体而言,我们首先利用多任务目标联合优化分词器,这些目标分别源自 FM 的重构、LLM 的下一个词元预测以及 RM 的多识别任务。这种联合训练促使离散语音词元空间捕获声学上和语义上显著的信息,从而更好地适配 TTS 任务。随后,我们进一步利用 FM 和 RM 的下游重构与识别来优化 LLM,这不仅减少了推理时的不匹配,还引导 LLM 生成更偏好的结果。实验结果表明,我们的 E2E 框架始终优于级联基线。在 Seed-TTS-Eval 基准上,我们的系统实现了 0.78% 和 1.56% 的词错误率 (WER),在使用 0.6B 参数 LLM 和 0.5B 参数 FM 模型的情况下取得了新的 SOTA 结果。这些结果表明,整体端到端优化对于改进基于离散词元的 TTS 系统至关重要,且训练管道更为简单。
Abstract
Recent state-of-the-art (SOTA) text-to-speech (TTS) systems typically adopt a cascaded pipeline consisting of a speech tokenizer, an autoregressive large language model (LLM), and a diffusion based flow-matching (FM) model, with these components trained independently. In this paper, we propose a fully end-to-end (E2E) optimization framework that unifies the training of the speech tokenizer, LLM, FM model, and an additional reward model (RM). Specifically, we first jointly optimize the tokenizer using multi-task objectives derived from reconstruction for FM, next-token prediction for LLM, and multi recognition task for RM. This joint training encourages the discrete speech token space to capture acoustically and semantically salient information that is better tailored to TTS. We then further optimize the LLM using downstream reconstruction and recognition by FM and RM, which reduces inference-time mismatch and steers the LLM toward more preferred generations. Experimental results show that our E2E framework consistently outperforms cascaded baselines. On the Seed-TTS-Eval benchmark, our system achieves a word error rate (WER) of 0.78% and 1.56%, a new SOTA result with a 0.6B-parameter LLM and 0.5B-parameter FM model. These results validate that holistic E2E optimization is critical for improving discrete-token-based TTS systems with a much simpler training pipeline.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 8.0/10 | 12.0 |
| Tokenizer | 1.5 | 9.0/10 | 13.5 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 3.0/10 | 4.5 |
| MultiModal | 1.5 | 6.0/10 | 9.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文核心贡献在于提出端到端统一训练框架(Unify Models 高相关),联合优化分词器、LLM 和扩散模型,因此 Tokenizer 相关性最高。任务涉及文本与音频生成(MultiModal 中相关),但未涉及视觉内容(Visual Encoder 无关)、环境动力学建模(World Models 无关)或强化学习(model-based RL 无关)。虽使用 LLM,但专注于特定 TTS 任务而非通用多模态理解(MLLM 低相关)。作者列表中未包含指定的专家,故无额外加分。
关键词
End-to-End Training, Discrete Token LLM, TTS System, Speech Tokenizer, Flow-matching, Unified Training, Reward Model
深度分析
Chinese Title: 面向离散Token大语言模型文本转语音系统的端到端训练
Summary: 本文提出了一种完全端到端(E2E)优化框架,用于统一训练基于离散Token的文本转语音(TTS)系统中的语音分词器、大语言模型(LLM)、流匹配(FM)模型和奖励模型(RM)。现有级联系统因各模块独立训练导致信息不对齐、训练-推理不匹配等问题。作者首先通过多任务目标(FM重建、LLM下一Token预测、RM多任务识别)联合优化分词器,使其捕获更适合TTS的声学和语义信息;然后利用Gumbel-Softmax将FM和RM的梯度反向传播至LLM,缓解推理时分布偏移并引导LLM生成更优的Token序列。实验表明,E2E框架在Seed-TTS-Eval基准上以0.6B参数LLM和0.5B参数FM模型取得了0.78%和1.56%的词错误率(WER),达到新SOTA,验证了整体端到端优化对离散Token TTS系统的重要性。
Innovations:
- 首次提出完全端到端训练框架,联合优化语音分词器、LLM、FM和RM四个模块,而非仅部分联合。
- 采用第一阶损失(基于真实Token)训练分词器,使其直接受下游任务(FM重建、LLM预测、RM识别)监督,而非代理目标。
- 引入第二阶损失(基于LLM预测Token),通过Gumbel-Softmax实现梯度传播,使LLM在自身生成分布下优化重建和识别性能,减少训练-推理不匹配。
- 将分词器训练形式化为信源编码问题,用信息熵量化Token质量,为理论分析提供新视角。
- 三阶段训练策略:先联合训练所有模块(第一阶损失),再冻结分词器分别训练各模块,最后冻结RM用第二阶损失微调LLM和FM,兼顾稳定性和灵活性。
Methodology: 论文采用三阶段端到端训练方法。第一阶段:所有参数可训练,使用第一阶损失(包括FM重建损失、LLM下一Token预测损失、RM多任务识别损失)联合优化分词器、LLM、FM和RM。第二阶段:冻结分词器,分别独立训练RM、FM和LLM(可选用不同数据集)。第三阶段:冻结RM,使用第二阶损失(基于LLM预测Token的FM重建损失和RM识别损失)通过Gumbel-Softmax反向传播优化LLM和FM。分词器采用有限标量量化(FSQ)单码本,RM包含ASR、情感识别、说话人识别三个任务,FM采用流匹配模型预测速度场,LLM使用Transformer架构。
Key Results:
- 在Seed-TTS-Eval基准上,以0.6B参数LLM和0.5B参数FM模型实现0.78%和1.56%的WER,达到新SOTA。
- E2E框架在所有子任务(重建质量、识别准确率、LLM预测)上一致优于独立训练的级联基线。
- 第二阶损失有效缓解了LLM预测Token与FM训练Token之间的分布偏移,提升了合成语音质量。
- 联合训练使分词器学习到更有利于TTS的离散表示,提高了信息熵和下游任务性能。
Tech Stack:
- 有限标量量化(FSQ)
- Gumbel-Softmax直通估计器
- 流匹配(Flow Matching)模型
- Transformer架构(用于LLM、RM编码器)
- CTC损失(ASR)、交叉熵损失(情感识别)、余弦相似度损失(说话人识别)
- 信息熵分析(信源编码视角)
- 多任务学习(RM联合训练ASR、SER、SPK)
Strengths:
- 提出完全端到端优化,解决了级联系统中模块间信息不对齐和训练-推理不匹配的根本问题。
- 通过第一阶和第二阶损失分别优化分词器和LLM,理论清晰且实验验证有效。
- 三阶段训练策略兼顾了联合训练的整体性和各模块独立训练的数据灵活性。
- 在较小模型规模(0.6B+0.5B)上取得SOTA结果,表明方法高效。
- 将分词器训练与信息论结合,提供了理论分析框架。
Limitations:
- 论文未讨论E2E训练的计算开销和训练稳定性问题(如联合训练时目标变化导致的优化困难)。
- 仅在一个基准(Seed-TTS-Eval)上评估,缺乏跨语言、跨领域泛化性验证。
- 第二阶损失依赖Gumbel-Softmax近似,可能存在梯度偏差。
- 未与更大规模模型(如数十B参数)对比,无法判断方法在更大尺度下的表现。
- 奖励模型仅包含ASR、情感、说话人识别,未考虑韵律、自然度等更细粒度指标。
Relevance To Keywords:
- Unify Models: 论文统一了分词器、LLM、FM、RM四个模型,实现端到端训练,符合模型统一思想。
- World Models: TTS系统可视为对语音世界的建模,FM模型作为生成模型类似世界模型中的动态预测。
- Representation Learning: 分词器学习离散语音表征,通过下游任务监督优化表征质量,是表征学习的典型应用。
- Model-Based RL: 第二阶损失通过FM和RM的梯度引导LLM生成,可视为一种基于模型的强化学习(奖励信号来自RM)。
- 原生多模态大模型: TTS涉及文本和语音两种模态,LLM处理文本,分词器和FM处理语音,属于多模态理解与生成一体化。
- 多模态大模型的理解和生成一体化: 系统同时具备语音理解(RM)和语音生成(FM),且通过E2E训练融合。
- 表征学习: 同上。
- 世界模型: FM作为生成模型可视为语音世界的动态模型。
- 强化学习: 第二阶损失类似策略梯度,RM提供奖励,LLM作为策略网络。
- 后训练: 三阶段训练中第三阶段可视为后训练(fine-tuning),使用第二阶损失进行对齐。
摘要翻译
视频多模态模型的近期进展显著提升了视频问答(VideoQA)的性能。然而,这些系统往往依赖于虚假统计相关性,而非与答案相关的因果证据,导致不忠实且脆弱的推理,尤其是在复杂的现实场景中。现有方法要么依赖于跨模态相关性,要么依赖于代价高昂的精心构建的训练资源,要么依赖于不足的因果假设和约束,且通常仅在时间区间级别上运行。因此,它们未能明确地将因果视觉线索与混杂因素解耦,且提供的细粒度证据定位也十分有限。为了解决这一问题,我们提出了一种用于细粒度证据解耦的反事实推理框架(CREDiT)。CREDiT 利用结构因果模型对 VideoQA 过程进行建模,并在独立性和最小性约束下学习跨模态表示,使其被明确分解为因果和非因果成分。为了促进忠实的解耦,我们引入了特征级因果干预,并构建反事实输入,以近似因果效应同时抑制非因果相关性。在 NExT-GQA、SportsQA 和 SPORTU-video 上的广泛实验表明,CREDiT 在通用和复杂体育场景中一致提高了答案准确性和推理可靠性,从而构建出更可信的 VideoQA 系统。
Abstract
Recent advances in video multimodal models have significantly improved VideoQA performance. However, these systems often rely on spurious statistical correlations rather than answer-relevant causal evidence, resulting in unfaithful and brittle reasoning, especially in complex real-world scenarios. Existing methods either rely on cross-modality correlations, costly curated training resources, or insufficient causal assumptions and constraints, and typically operate at the time-interval level. As a result, they fail to explicitly disentangle causal visual cues from confounders and provide limited fine-grained evidence localization. To address this issue, we propose a Counterfactual Reasoning framework for fine-grained Evidence Disentanglement (CREDiT). CREDiT formulates the VideoQA process using a structural causal model and learns cross-modality representations that are explicitly decomposed into causal and non-causal components under independence and minimality constraints. To facilitate faithful disentanglement, we introduce feature-level causal interventions and construct counterfactual inputs that approximate causal effects while suppressing non-causal correlations. Extensive experiments on NExT-GQA, SportsQA, and SPORTU-video demonstrate that CREDiT consistently improves answer accuracy and reasoning reliability across both generic and complex sports scenarios, leading to more trustworthy VideoQA systems.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 3.0/10 | 4.5 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 5.0/10 | 7.5 |
| World Models | 1.5 | 2.0/10 | 3.0 |
| MLLM | 1.5 | 6.0/10 | 9.0 |
| MultiModal | 1.5 | 8.0/10 | 12.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文核心在于视频问答中的因果推理与证据解缠,属于多模态(MultiModal)领域,且 MLLM 是其主要应用场景,因此这两个关键词相关性最高。Visual Encoder 用于处理视频特征,相关性中等。然而,论文未涉及模型统一架构(Unify Models)、分词器设计(Tokenizer)、世界模型构建(World Models)或强化学习(model-based RL),因此这些关键词相关性较低或为零。
关键词
Counterfactual Reasoning, VideoQA, Evidence Disentanglement, Structural Causal Model, Cross-modality Representations, Causal Interventions, Fine-grained Evidence
深度分析
Chinese Title: 反事实推理用于视频问答中的细粒度证据解耦
Summary: 本文针对视频问答(VideoQA)中模型依赖虚假统计相关而非因果证据的问题,提出了一种基于反事实推理的细粒度证据解耦框架(CREDiT)。首先,利用结构因果模型(SCM)形式化VideoQA推理过程,识别出因果与非因果成分。然后,设计跨模态解耦模块,在独立性和最小性约束下将表示分解为因果和非因果部分。进一步,通过特征级因果干预构造反事实输入,近似因果效应并抑制非因果相关。实验在NExT-GQA、Sports-QA和SPORTU-video上表明,CREDiT显著提升了答案准确率和推理可靠性,尤其在复杂体育场景中表现优异,且无需额外标注。
Innovations:
- 从因果视角深入形式化VideoQA推理中的虚假相关根源,并提出跨模态因果解耦方法,在独立性和最小性约束下显式分解表示。
- 提出基于特征级因果干预的反事实学习框架,通过干预因果和非因果变量实现更细粒度的证据解耦。
- 无需额外人工标注或大规模预训练,即可在多个基准上提升答案准确率和证据定位能力,增强模型可解释性和鲁棒性。
Methodology: 论文采用结构因果模型(SCM)对VideoQA过程进行因果抽象,识别出因果路径和非因果路径。在此基础上,设计跨模态表示解耦模块,通过时空编码和文本特征融合得到联合表示,并利用独立性和最小性约束将表示分解为因果成分Fc和非因果成分Fnc。为促进解耦,引入特征级因果干预:对因果变量进行干预构造反事实输入,使模型专注于因果效应;对非因果变量进行干预以隔离其虚假影响。最终模型基于因果成分和问题生成答案,实现鲁棒推理。
Key Results:
- 在NExT-GQA上达到70.4%的QA准确率,在Sports-QA上达到60.4%,在SPORTU-video上达到71.9%,均优于对比方法。
- 在NExT-GQA上的接地QA准确率显著提升,表明模型能更准确地定位因果视觉证据。
- 消融实验和可视化分析验证了因果解耦和反事实干预的有效性,模型在分布偏移下仍保持鲁棒性。
Tech Stack:
- 结构因果模型(SCM)
- 反事实推理(Counterfactual Reasoning)
- 特征级因果干预(Feature-level Causal Intervention)
- 独立性约束(Independence Constraint)
- 最小性约束(Minimality Constraint)
- 跨模态表示解耦(Cross-modality Representation Disentanglement)
- 时空编码(Spatio-temporal Encoding)
- 视频多模态大语言模型(Video MLLM)
Strengths:
- 从因果角度系统分析了VideoQA中虚假相关的根源,并设计了显式解耦机制,理论框架清晰。
- 提出的反事实干预方法无需额外标注,实用性强,可推广至多种VideoQA场景。
- 在多个基准上取得一致提升,尤其在复杂体育场景中表现突出,验证了方法的泛化能力。
- 不仅提升答案准确率,还改善了证据定位能力,增强了模型的可信度。
Limitations:
- 论文未详细讨论解耦模块的计算开销和训练效率,可能在大规模视频上存在效率瓶颈。
- 反事实干预基于特征级操作,其与真实因果效应的近似程度缺乏理论保证。
- 实验仅在三个数据集上进行,未在更多样化的场景(如开放域、长视频)中验证。
- 对非因果成分的建模较为简单,可能无法完全捕获复杂混淆因素。
Relevance To Keywords:
- Unify Models: 论文聚焦于VideoQA,属于多模态理解任务,与统一模型方向相关但未直接涉及生成与理解的统一。
- World Models: 论文通过因果建模和反事实推理模拟世界中的因果机制,与世界模型的思想高度契合。
- Representation Learning: 核心是学习因果解耦的表示,属于表征学习范畴。
- Model-Based RL: 论文未涉及强化学习,但因果建模和反事实推理在基于模型的RL中常用,有一定间接关联。
- 原生多模态大模型: 论文使用视频MLLM作为基础,但主要贡献在于因果解耦框架,而非模型架构创新。
- 多模态大模型的理解和生成一体化: 论文仅关注理解(问答),未涉及生成。
- 后训练: 论文方法可在预训练模型基础上进行微调,属于后训练阶段,但未强调后训练策略。
摘要翻译
高光谱目标跟踪(HOT)利用高光谱视频(HSVs)所提供的丰富光谱信息,为目标跟踪提供了巨大的潜力。然而,如何从冗余的光谱带中高效提取和利用光谱信息仍是一个根本性挑战,这严重限制了模型的泛化能力和跟踪性能。此外,在动态场景中,目标常因遮挡和光照变化等因素而经历剧烈的外观变化。这些变化导致当前帧与模板之间产生显著的形变。此类差异对现有的时序建模方法构成了重大挑战。本文提出了一种名为 VLHTrack 的新型高光谱视觉 - 语言(VL)联合跟踪框架。具体而言,我们通过设计语言引导带选择模块(LBSM),引入语言先验以解决光谱冗余这一根本性挑战。借助大语言模型(LLM)的描述,LBSM 建立了语义到光谱的映射,从而减轻冗余并突出判别性光谱特征。随后,采用多模态视觉 - 语言融合模块,无缝集成视觉嵌入与语言嵌入,利用其互补优势学习连贯的多模态表示。为了解决长序列中的目标形变问题,本文提出了一种动态更新模板特征策略,并通过基于 Mamba 的动态模板更新(DTUM)模块实现。借助选择性状态空间建模,DTUM 学习帧间依赖以更新模板特征,确保模板特征在时序上下文指导下实现高效演化。在 HOT2023 和 HOT2024 数据集上的实验表明,VLHTrack 优于现有的最先进(SOTA)方法。
Abstract
Hyperspectral object tracking (HOT) leverages the rich spectral information provided by hyperspectral videos (HSVs), offering substantial potential for object tracking. However, efficiently extracting and exploiting spectral information from redundant spectral bands remains a fundamental challenge, which severely limits model generalization and tracking performance. Moreover, in dynamic scenes, targets often experience drastic appearance variations due to factors such as occlusion and illumination changes. These variations lead to large deformations between the current frame and the template. Such discrepancies pose major challenges for existing temporal modeling approaches. In this work, we propose VLHTrack, a novel hyperspectral vision-language (VL) joint tracking framework. Specifically, we incorporate language priors to address the fundamental challenge of spectral redundancy by designing a Language-Guided Band Selection Module (LBSM). By leveraging Large Language Model (LLM) descriptions, LBSM establishes a semantic-to-spectral mapping that mitigates redundancy and accentuates discriminative spectral features. A Multi-Modal Vision-Language Fusion Module is then employed to seamlessly integrate visual and linguistic embeddings, harnessing their complementary advantages to learn coherent cross-modal representations. To address target deformation in long-term sequences, we propose a dynamic update template feature strategy implemented via the Dynamic Template Update with Mamba (DTUM) module. By leveraging selective state space modeling, DTUM learns inter-frame dependencies to update template feature, ensuring efficient template feature evolution guided by temporal context. Experiments on HOT2023 and HOT2024 demonstrate that VLHTrack outperforms state-of-the-art (SOTA) methods.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 3.0/10 | 4.5 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 4.0/10 | 6.0 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 7.0/10 | 10.5 |
| MultiModal | 1.5 | 8.0/10 | 12.0 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: 论文核心为高光谱视觉 - 语言跟踪,MultiModal 和 MLLM 相关性高(涉及视觉与语言融合及大语言模型应用);Visual Encoder 和 Unify Models 中度相关(涉及视觉特征提取和模态统一);Tokenizer 低相关(隐含于 LLM 中但未作为核心贡献);World Models 和 model-based RL 完全不相关(未涉及强化学习或环境建模)。作者列表中未包含指定的专家作者,无额外加分。加权总分计算为 (3+2+4+1+7+8+1)*1.5 = 39.0,高于动态及格分 27.8。
关键词
Hyperspectral Object Tracking, Vision-Language Fusion, Language-Guided Band Selection, Dynamic Template Updating, Mamba, Large Language Model, Cross-modal Representation
深度分析
Chinese Title: 基于语义融合与上下文模板更新的视觉语言引导高光谱目标跟踪
Summary: 高光谱目标跟踪(HOT)利用丰富的光谱信息,但面临光谱冗余和动态场景中目标外观剧烈变化的问题。本文提出VLHTrack框架,首次将视觉语言(VL)引入HOT。通过语言引导波段选择模块(LBSM),利用大语言模型(LLM)生成的语义描述建立语义到光谱的映射,减少冗余并突出判别性光谱特征。多模态视觉-语言融合模块整合视觉和语言嵌入,学习跨模态表示。针对长期跟踪中的目标形变,提出基于Mamba的动态模板更新模块(DTUM),通过选择性状态空间建模捕获帧间依赖,动态更新模板特征。在HOT2023和HOT2024数据集上,VLHTrack超越现有最先进方法。
Innovations:
- 首次将视觉语言(VL)联合框架应用于高光谱目标跟踪,利用LLM生成的语义先验指导光谱信息提取。
- 设计语言引导波段选择模块(LBSM),结合熵分析、结构冗余和语义相似度,实现语义到光谱的映射,克服传统数据驱动波段选择的局限性。
- 提出基于Mamba的动态模板更新模块(DTUM),利用选择性状态空间模型捕获帧间依赖,动态更新模板特征以应对目标形变。
- 构建多模态视觉-语言融合模块,无缝整合视觉和语言嵌入,实现互补优势的跨模态表示学习。
Methodology: 首先,利用LLM从初始帧生成目标语言描述作为语义先验。LBSM模块通过熵分析去除低信息波段,通过结构冗余建模消除冗余,再通过语言嵌入与光谱特征的语义相似度引导选择判别性波段子集。多模态融合模块将视觉特征与语言特征进行对齐和融合。DTUM模块采用Mamba的选择性状态空间模型,对帧间时序依赖进行建模,动态更新模板特征,保持时序一致性。整体框架以Siamese网络为基础,结合交叉注意力机制实现跟踪。
Key Results:
- 在HOT2023和HOT2024数据集上,VLHTrack在多个评估指标(如成功率、精确率)上均优于现有SOTA方法。
- 消融实验验证了LBSM和DTUM模块的有效性,表明语言引导波段选择和动态模板更新显著提升跟踪性能。
- 定性结果展示了VLHTrack在遮挡、光照变化、相似目标干扰等复杂场景下的鲁棒性。
Tech Stack:
- 大语言模型(LLM)用于生成目标语义描述
- Mamba(选择性状态空间模型)用于时序建模
- 熵分析(Entropy Analysis)用于波段信息量评估
- 结构冗余建模(Structural Redundancy Modeling)
- 语义相似度度量(Semantic Similarity)
- 交叉注意力机制(Cross-Attention)
- Siamese网络架构
- 多模态特征融合
Strengths:
- 创新性地将语言先验引入高光谱跟踪,解决了光谱冗余和语义缺失问题。
- LBSM实现了知识驱动的波段选择,比传统数据驱动方法更具可解释性和目标感知能力。
- DTUM利用Mamba高效建模长时序依赖,动态更新模板适应外观变化。
- 在多个公开数据集上取得最优性能,验证了方法的有效性和泛化能力。
Limitations:
- 依赖LLM生成的语言描述质量,若描述不准确可能影响波段选择和跟踪性能。
- LBSM和DTUM增加了模型复杂度,可能带来额外的计算开销,实时性有待验证。
- 当前仅在HOT2023和HOT2024数据集上评估,缺乏更多场景的泛化测试。
- 未深入探讨语言描述更新策略对长期跟踪的影响。
Relevance To Keywords:
- 原生多模态大模型:论文利用LLM生成语言描述,并与视觉特征融合,体现了多模态大模型在跟踪任务中的应用。
- 多模态大模型的理解和生成一体化:LLM用于生成语义描述,同时模型理解视觉和语言信息,实现理解与生成的协同。
- 表征学习:通过语义到光谱的映射和跨模态融合,学习更鲁棒的联合表征。
- 世界模型:动态模板更新模块通过时序建模捕获环境变化,可视为对世界动态的隐式建模。
- 强化学习:论文未直接使用强化学习,但动态模板更新可类比为基于时序反馈的决策过程。
- 后训练:LLM作为先验知识,模型在跟踪任务上进行微调,属于后训练范式。
摘要翻译
大语言模型 (LLM) 为语音理解提供了强大的推理骨干,但将连续声学信号整合到冻结的 LLM 中仍然具有挑战性。现有的语音到 LLM 接口通常运行在两个极端:要么强制近乎离散的 token 对齐,这有利于转录但丢失了副语言信息,要么学习无约束的连续表示,这可能会偏离 LLM 的输入空间并降低自回归解码性能。在这项工作中,我们提出了凸门 (C-Gate),一种语音到 LLM 的桥梁,通过架构凸包约束将所有语音表示约束于 LLM 的输入 embedding 流形内。具体来说,每个帧被表示为 token embedding 的凸组合,确保与预训练 LLM 的兼容性,同时保留连续表达能力。在自动语音识别 (ASR) 和情感识别方面,C-Gate 实现了强大的联合性能,将 LibriSpeech 词错误率 (WER) 相对提高了 48.7%,同时匹配或超越单任务情感准确率。超越性能,我们的分析揭示了一个关键见解:信息并非由离散 token 的身份携带,而是由 embedding 空间中的时间分辨轨迹携带。因果干预证实了轨迹结构以及与预训练 embedding 流形的对齐对于性能均至关重要。这些结果表明,几何特性而非 token 离散性是语音到 LLM 接口中的根本设计因素,并为研究冻结 LLM 中的多模态整合提供了受控范式。我们发布了检查点、样本级输出、机制转储及干预套件以供复现。
Abstract
Large language models (LLMs) provide a powerful reasoning backbone for speech understanding, but integrating continuous acoustic signals into a frozen LLM remains challenging. Existing speech-to-LLM interfaces typically operate at two extremes: either enforcing near-discrete token alignment, which benefits transcription but loses paralinguistic information, or learning unconstrained continuous representations, which can drift away from the LLM's input space and degrade autoregressive decoding. In this work, we propose Convex Gate (C-Gate), a speech-to-LLM bridge that constrains all speech representations to lie within the LLM's input embedding manifold with an architectural convex-hull constraint. Concretely, each frame is represented as a convex combination of token embeddings, ensuring compatibility with the pretrained LLM while preserving continuous expressivity. Across automatic speech recognition (ASR) and emotion recognition, C-Gate achieves strong joint performance, improving LibriSpeech WER by up to 48.7% relative while matching or exceeding single-task emotion accuracy. Beyond performance, our analysis reveals a key insight: information is not carried by discrete token identities, but by time-resolved trajectories in the embedding space. Causal interventions confirm that both the trajectory structure and alignment to the pretrained embedding manifold are critical for performance. These results suggest that geometry, rather than token discreteness, is the fundamental design factor in speech-to-LLM interfaces, and provide a controlled regime for studying multimodal integration in frozen LLMs. We release the checkpoint, per-sample outputs, mechanism dumps, and intervention suite for replication.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 6.5/10 | 9.8 |
| Tokenizer | 1.5 | 5.5/10 | 8.2 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 5.0/10 | 7.5 |
| MultiModal | 1.5 | 8.5/10 | 12.8 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on integrating speech into LLMs using a convex gate method, aligning well with 'MultiModal' (speech+text) and 'Tokenizer' (token embeddings/alignment) concepts. It partially relates to 'Unify Models' by unifying representation spaces. However, it lacks visual components ('Visual Encoder'), world modeling ('World Models'), or reinforcement learning ('model-based RL'), and is specific to speech rather than general MLLM.
关键词
Speech LLMs, Convex Gate, Embedding Manifold, Multimodal Integration, Continuous Representations, Token Embeddings, Automatic Speech Recognition
深度分析
Chinese Title: 文本就是一切吗?文本作为语音大语言模型的通用信息瓶颈
Summary: 本文针对语音大语言模型(Speech LLMs)中连续声学信号与冻结大语言模型(LLM)集成困难的问题,提出了一种名为Convex Gate(C-Gate)的语音-LLM桥接方法。现有方法要么强制离散化导致副语言信息丢失,要么学习无约束连续表示导致偏离LLM输入空间。C-Gate通过凸包约束,将每一帧语音表示为LLM输入嵌入的凸组合,确保表示位于LLM预训练的嵌入流形内,同时保留连续表达能力。在自动语音识别(ASR)和情感识别联合任务中,C-Gate在LibriSpeech上将词错误率(WER)相对降低高达48.7%,同时情感识别准确率与单任务基线持平或更优。因果干预实验表明,性能的关键在于时间有序的轨迹结构及其与预训练嵌入流形的对齐,而非离散标记本身。该工作揭示了几何对齐而非离散性是语音-LLM接口的核心设计因素。
Innovations:
- 提出几何约束的语音-LLM接口C-Gate,通过凸包约束将语音表示限制在LLM输入嵌入流形内,解决了离散锁定和表示漂移的权衡问题。
- 实现了语义与副语言信息的联合高效传递,在ASR和情感识别任务上均取得显著提升,且多任务模型性能优于单任务基线。
- 通过因果干预实验(帧打乱、随机基替换、音频替换)揭示了时间有序的凸包轨迹是信息传递的关键通道,而非离散标记身份。
- 建立了受控的ASR-副语言迁移评估体系,并提出了更严格的语音推理基准报告协议,包括源重叠检查和音频替换控制。
Methodology: 论文采用以下技术路线:1)使用冻结的Whisper-large-v3作为语音编码器,Qwen2.5-7B-Instruct作为冻结LLM;2)设计C-Gate桥接模块,通过Top-K Q-Former对下采样语音特征与LLM词嵌入进行全词汇交叉注意力,选择Top-16支持集并计算凸组合权重;3)训练目标包括ASR的交叉熵损失和情感识别的分类损失,采用动态重加权策略处理多任务;4)进行三种因果干预实验(音频替换、帧打乱、随机基替换)验证凸包假设。
Key Results:
- 双任务(ASR+情感)C-Gate-2T模型在LibriSpeech上将自回归WER从7.76%降至4.78%(相对降低38.4%),RAVDESS情感识别准确率达97.1%。
- 三任务(ASR+情感+推理)C-Gate-3T模型进一步将WER降至3.98%(相对降低48.7%),情感准确率仅下降6.6个百分点。
- 因果干预实验证实:随机基替换(保持码本大小和维度)导致性能严重下降,而帧打乱和音频替换同样破坏性能,表明时间有序的凸包轨迹是关键。
- C-Gate在保持连续表达性的同时,消除了表示漂移,并保留了LLM分析工具(如logit-lens、注意力检查)的直接可解释性。
Tech Stack:
- Whisper-large-v3(语音编码器)
- Qwen2.5-7B-Instruct(冻结LLM)
- Q-Former(交叉注意力模块)
- 凸组合(Convex Combination)
- Top-K支持选择(Top-16)
- 交叉熵损失(ASR)
- 分类损失(情感识别)
- 动态重加权损失(多任务)
- 因果干预实验(音频替换、帧打乱、随机基替换)
- LibriSpeech(ASR数据集)
- RAVDESS(情感识别数据集)
Strengths:
- 提出了新颖的几何约束机制,在离散与连续表示之间取得了优雅平衡,理论清晰且实现简洁。
- 实验设计严谨,通过多任务联合训练和因果干预实验深入揭示了表示几何的关键作用。
- 在ASR和情感识别任务上均取得显著性能提升,验证了方法的有效性和通用性。
- 提供了完整的可复现资源(检查点、样本输出、机制转储、干预套件),有利于后续研究。
- 对语音-LLM接口的设计原则提供了深刻见解,将关注点从离散性转向几何对齐。
Limitations:
- 实验规模相对有限(960小时LibriSpeech、47小时情感数据、707M可训练参数),在更大规模数据和模型上的泛化性有待验证。
- 三任务模型中情感性能下降6.6个百分点,表明多任务干扰问题尚未完全解决。
- 推理任务仅作为压力测试,未充分验证音频对推理的实际贡献,存在文本捷径风险。
- 方法依赖于LLM词嵌入表,对于词汇表较小的LLM可能限制表示能力。
- 凸组合约束可能限制极端副语言特征的表达能力,如强烈情感或特殊音效。
Relevance To Keywords:
- 原生多模态大模型:论文研究语音与文本模态的深度融合,提出C-Gate桥接方法,直接相关。
- 多模态大模型的理解和生成一体化:C-Gate支持ASR、情感识别和推理任务,体现理解与生成的统一。
- 表征学习:核心创新在于通过凸组合约束学习与LLM流形对齐的语音表示,属于表征学习范畴。
- 世界模型:论文未直接涉及世界模型,但语音理解是构建世界模型的重要感知通道。
- 模型-Based RL:论文未涉及强化学习,但提出的几何约束思想可启发RL中状态表示的设计。
- 后训练:论文在冻结LLM基础上训练桥接模块,属于后训练范式,与关键词高度相关。
摘要翻译
自动驾驶中的重识别(ReID)通常被建模为一种视觉匹配问题,其中车辆、行人和骑行者的观测值通过学习到的外观嵌入在时间、帧或相机视图之间进行关联,通常辅以运动、几何或多模态线索。然而,纯视觉表示可能对视角、遮挡、光照以及传感器域差异敏感,从而限制了它们在复杂驾驶场景中的可解释性和鲁棒性。我们提出了一项基于零样本流程的基线研究,利用视觉 - 语言模型(VLMs)生成检测到的交通参与者的文本描述,并评估这些描述是否支持跨观测的身份匹配。与仅依赖低层视觉相似性不同,所提出的方法通过结构化语义属性来表示每个对象,包括类别、颜色、形状、姿态、可见部分、空间上下文以及显著视觉特征。本研究为自动驾驶场景中的基于语言的重识别提供了初步基准,讨论并评估了当前视觉 - 语言模型(VLMs)在此任务上的优势与局限性。结果表明,零样本语义描述可以有效支持对象重识别,达到了与监督式 CNN 基线相当的检索性能,同时通过显式身份线索提供了更高的可解释性。然而,实验也揭示了若干重要挑战,包括视角间的属性不一致以及视觉上相似实例之间的细粒度判别能力有限。
Abstract
Re-Identification (ReID) in autonomous driving is typically formulated as a visual matching problem, where observations of vehicles, pedestrians, and cyclists are associated across time, frames, or camera views using learned appearance embeddings, often complemented by motion, geometric, or multimodal cues. However, purely visual representations may be sensitive to viewpoint, occlusion, illumination, and sensor-domain variations, limiting their interpretability and robustness in complex driving scenes. We propose a baseline study of a zero-shot pipeline using Vision-Language Models (VLMs) to generate textual descriptions of detected traffic participants and evaluate whether these descriptions can support identity matching across observations. Instead of relying only on low-level visual similarity, the proposed formulation represents each object through structured semantic attributes, including category, color, shape, pose, visible parts, spatial context, and distinctive visual cues. This study provides an initial benchmark for language-based re-identification in autonomous-driving scenarios, discussing and evaluating the strengths and limitations of current VLMs for this task. Results demonstrate that zero-shot semantic descriptions can support effective object re-identification, achieving retrieval performance comparable to a supervised CNN baseline while offering greater interpretability through explicit identity cues. However, the experiments also reveal important challenges, including attribute inconsistency across viewpoints and limited fine-grained discrimination between visually similar instances.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 4.0/10 | 6.0 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 2.0/10 | 3.0 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 8.0/10 | 12.0 |
| MultiModal | 1.5 | 8.0/10 | 12.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文核心在于使用视觉语言模型(MLLM)进行零样本语义再识别,因此 MLLM 和多模态相关度高(8 分)。VLM 涉及视觉与语言的统一,与 Unify Models 中度相关(4 分)。Tokenizer 和视觉编码器是模型组件但非研究焦点(2 分)。世界模型和基于模型的强化学习与本文任务无关(1 分和 0 分)。作者列表中未包含指定专家,无额外加分。总加权得分为 37.5,高于及格线 27.8。
关键词
Zero-Shot, Semantic Re-Identification, Autonomous Driving, Vision-Language Models, Textual Descriptions, Attribute Matching, VLM Baseline
深度分析
Chinese Title: 面向自动驾驶的零样本语义重识别:一项视觉语言模型基线研究
Summary: 本文提出了一种零样本语义重识别(ReID)基线方法,用于自动驾驶场景。传统ReID依赖视觉外观嵌入,易受视角、遮挡、光照等影响。作者利用视觉语言模型(VLM)为检测到的交通参与者(车辆、行人等)生成结构化文本描述,包括类别、颜色、形状、姿态、可见部件、空间上下文和独特视觉线索,然后将描述编码为文本嵌入,通过余弦相似度进行身份匹配。实验在KITTI-ReID数据集上进行,评估了多种VLM(Qwen3.5系列、Gemma 4)和文本嵌入模型。结果表明,零样本语义描述能达到与监督CNN基线相当的检索性能,且具有更好的可解释性。但研究也揭示了挑战:跨视角属性不一致、细粒度区分能力有限。该工作为自动驾驶中基于语言的可解释重识别提供了基线基准。
Innovations:
- 首次提出零样本语义重识别基线,利用VLM生成结构化文本描述替代传统视觉嵌入进行身份匹配。
- 设计领域感知的提示策略,引导VLM输出稳定、判别性强的属性(如车辆体型、行人服饰),减少描述变异。
- 在自动驾驶ReID任务中系统比较多种VLM和文本嵌入模型的零样本性能,揭示其优势与局限。
- 提供可解释的中间表示,通过显式身份线索支持故障诊断和人机交互分析。
Methodology: 论文采用三阶段流水线:1)从KITTI数据集中裁剪单个目标图像;2)使用VLM(如Qwen3.5、Gemma 4)根据精心设计的提示生成单行文本描述,提示强调稳定身份属性(如形状、颜色、标记、配件等),并避免模糊语言;3)使用文本编码器(如Embedding Gemma、Nomic Em)将描述映射为嵌入向量,通过余弦相似度计算查询与图库之间的相似度,进行检索排序。所有模型均以零样本方式使用,无任何微调或适配。
Key Results:
- 零样本语义描述在KITTI-ReID上取得了与监督CNN基线相当的检索性能。
- 较大的VLM(如Qwen3.5 27B、Gemma 4 E2B)通常生成更准确、判别性更强的描述。
- 文本嵌入模型的选择对性能有显著影响,Embedding Gemma 300M表现较好。
- 存在跨视角属性不一致问题,同一目标在不同视角下描述可能变化。
- 细粒度区分能力有限,视觉相似的不同目标(如同色同款车辆)难以通过语义描述区分。
Tech Stack:
- 视觉语言模型:Qwen3.5 (0.8B, 9B Q4-K-M, 27B Q4-K-M), Gemma 4 E2B
- 文本嵌入模型:Embedding Gemma 300M, Nomic Em
- 数据集:KITTI-ReID(基于KITTI跟踪数据集构建)
- 相似度度量:余弦相似度
- 提示工程:结构化提示,指定属性顺序和内容约束
Strengths:
- 提出新颖的零样本语义重识别范式,无需训练即可实现身份匹配,具有良好的可解释性。
- 系统评估了多种VLM和文本嵌入模型,为后续研究提供基线参考。
- 提示设计考虑了自动驾驶场景特点(车辆/行人属性差异),增强了描述判别力。
- 实验设计严谨,与监督CNN基线对比,客观展示了语义方法的潜力与不足。
Limitations:
- 零样本VLM生成的描述存在跨视角不一致性,影响匹配稳定性。
- 细粒度区分能力有限,难以区分视觉高度相似的目标(如同型号同色车辆)。
- 依赖VLM的生成质量,小模型可能产生不准确或模糊的描述。
- 仅使用单行文本描述,可能丢失部分细节信息;未探索多模态融合策略。
- 实验仅在KITTI数据集上进行,泛化性有待验证。
Relevance To Keywords:
- 原生多模态大模型:论文直接使用VLM(多模态大模型)进行图像到文本的生成,属于原生多模态应用。
- 表征学习:通过文本嵌入将语义描述转化为向量表示,用于检索,涉及表征学习。
- 世界模型:虽然论文未直接构建世界模型,但语义描述可视为对场景中物体状态的抽象表示,有助于场景理解。
- 模型后训练:论文采用零样本方式,未涉及后训练,但研究结果可指导后续微调或适配。
- 强化学习:论文未涉及强化学习,但语义重识别可辅助自动驾驶中的决策与规划。
摘要翻译
思维链(CoT)提升了大语言模型(LLMs)的性能,并已被扩展至多模态大语言模型(MLLMs)。更近期的工作进一步从基于文本的多模态推理转向交错模态推理,其中中间步骤可以同时纳入文本理由和视觉证据。在这项工作中,我们提出一个更大胆且更具雄心的想法:仅凭图像能否作为语言和多模态任务的推理媒介?为了探索这一点,我们提出了光学推理(optical reasoning),将图像视为一种独立的推理媒介。我们通过两种变体实现了这一概念:基于排版的光学推理(typographic-based optical reasoning),旨在优化视觉布局以紧凑地呈现理由;以及基于图形的光学推理(graphical-based optical reasoning),将文本和图形元素组合成结构化的视觉理由。在数学、科学及交错模态推理基准测试中,光学推理可以匹配甚至超越传统文本推理,同时将语言任务中的推理 token 平均减少 28.57%,多模态任务中减少 16%,实现了文本推理 1.96 倍的 token 效率。这些结果表明,图像能够高效且有效地编码理由,同时为推理提供统一的视觉画布。
Abstract
Chain-of-Thought (CoT) improves the performance of Large Language Models (LLMs) and has been extended to Multimodal Large Language Models (MLLMs). More recent work further moves from text-based multimodal reasoning toward interleaved-modal reasoning, where intermediate steps can incorporate both textual rationales and visual evidence. In this work, we propose a bolder and more ambitious idea: could images alone serve as the reasoning medium for both language and multimodal tasks? To explore this, we propose optical reasoning, which treats images as a standalone reasoning medium. We instantiate this concept with two variants: typographic-based optical reasoning, which optimizes visual layouts for compact rationale rendering, and graphical-based optical reasoning, which composes text and graphical elements into structured visual rationales. Across mathematical, scientific, and interleaved-modal reasoning benchmarks, optical reasoning can match or even exceed traditional text reasoning while reducing reasoning tokens by an average of 28.57% on language tasks and 16% on multimodal tasks, achieving 1.96 times the token efficiency of text reasoning. These results show that images can effectively and efficiently encode rationales while providing a unified visual canvas for reasoning.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 3.0/10 | 4.5 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 2.0/10 | 3.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 9.0/10 | 13.5 |
| MultiModal | 1.5 | 9.0/10 | 13.5 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper is highly relevant to MLLM and MultiModal as it explicitly operates within Multimodal Large Language Models and interleaved-modal reasoning. It addresses token efficiency but does not focus on tokenizer design (Tokenizer low). It utilizes images but the core contribution is reasoning layout rather than visual encoding (Visual Encoder low). It unifies the reasoning medium but not model architecture (Unify Models low). There is no connection to World Models or Model-Based RL.
关键词
Optical Reasoning, Multimodal Large Language Models, Chain-of-Thought, Visual Reasoning, Token Efficiency, Interleaved-modal, Typographic-based, Graphical-based
深度分析
Chinese Title: 光学推理:将图像重新思考为超越文本的表达性推理媒介
Summary: 本文提出光学推理(Optical Reasoning)概念,将图像作为独立的推理媒介,替代传统文本推理中的中间推理步骤。作者实例化了两种变体:基于排印的光学推理(T-OR)通过优化文本宽度、字体大小等布局参数,将推理内容紧凑地渲染为图像;基于图形的光学推理(G-OR)采用步骤对齐的组合策略,将文本和图形元素组织成结构化视觉推理图。在数学推理(AquaRat、GSM8K)、科学推理(GPQA Diamond、ScienceQA)和交错模态推理(Zebra-CoT)五个基准上,使用GPT-5.1、Gemini 2.5 Flash、Claude Sonnet 4.5、Kimi K2.5、Qwen3-VL-235B五个前沿多模态大模型进行评估。结果表明,光学推理在多数情况下匹配或超越文本推理的准确率,同时平均减少28.57%(语言任务)和16%(多模态任务)的推理token,每个视觉推理token的效率是文本推理token的1.96倍。代码已开源。
Innovations:
- 首次提出将图像作为独立推理媒介的概念,超越传统文本推理和交错模态推理范式。
- 设计基于排印的光学推理(T-OR),通过可控制推理token预算的布局搜索策略最大化信息密度。
- 设计基于图形的光学推理(G-OR),利用步骤对齐组合策略将文本与图形元素统一到视觉画布中。
- 在多个前沿多模态大模型上验证了图像作为推理媒介的有效性和效率,提出边际准确率增益(MAG)指标衡量token效率。
Methodology: 首先形式化定义文本推理和光学推理,将推理序列通过渲染器映射为图像。T-OR使用XeLaTeX渲染器,在候选宽度集和字体大小集中搜索最优布局,通过填充率和布局惩罚评分选择满足token预算的紧凑可读图像。G-OR使用Nano Banana 2工具,通过结构化提示将推理分解为步骤,每个步骤对应一个视觉面板,保留关键文本和公式,用图形元素和空间布局组织推理。评估时,将图像输入多模态大模型,由视觉编码器提取视觉推理token,模型基于问题和视觉推理token生成答案。
Key Results:
- 在语言任务上,T-OR在7个模型-基准对中匹配或超越文本推理,平均减少28.57%推理token;在落后情况下平均准确率差距仅0.027,减少20% token。
- 在多模态任务上,T-OR在5个模型-基准对中匹配或超越文本推理,平均减少16%推理token;落后时平均准确率差距仅0.014,减少32% token。
- 边际准确率增益(MAG)指标显示,每个视觉推理token的效率是文本推理token的1.96倍。
- G-OR在部分任务上进一步提升了准确率,展示了统一视觉画布的优势。
Tech Stack:
- XeLaTeX(用于排印渲染)
- Nano Banana 2(用于图形渲染)
- 布局搜索算法(粗搜索+细搜索,基于填充率和布局惩罚评分)
- 边际准确率增益(MAG)指标(公式14)
- 多模态大模型:GPT-5.1、Gemini 2.5 Flash、Claude Sonnet 4.5、Kimi K2.5、Qwen3-VL-235B
Strengths:
- 创新性强,首次系统探索图像作为独立推理媒介的可行性。
- 实验全面,覆盖5个基准和5个前沿模型,结果具有说服力。
- 效率提升显著,token减少同时保持或提升准确率。
- 两种变体分别展示了紧凑编码和结构化视觉推理的优势,具有实用价值。
- 代码开源,可复现。
Limitations:
- 依赖于多模态大模型的视觉理解能力,若模型视觉编码器较弱可能影响性能。
- 未深入分析图像质量(如分辨率、渲染清晰度)对推理的影响。
- G-OR的图形渲染依赖外部工具和提示设计,可能引入额外成本和不稳定性。
- 实验仅在英文数据集上进行,中文或其他语言场景未验证。
- 未与更复杂的交错模态推理方法(如ICoT、MVoT)进行直接对比。
Relevance To Keywords:
- 原生多模态大模型:论文直接使用多模态大模型处理图像推理,与原生多模态理解紧密相关。
- 多模态大模型的理解和生成一体化:光学推理要求模型同时理解图像中的文本和图形,并生成答案,体现理解与生成的一体化。
- 表征学习:图像作为推理媒介涉及视觉表征学习,如何有效编码推理步骤是表征学习的问题。
- 世界模型:通过图形化推理(G-OR)构建结构化视觉推理图,可视为一种简化的世界模型表示。
- 强化学习:论文未直接涉及强化学习,但后训练阶段可结合光学推理优化模型推理能力。
- 后训练:光学推理可作为后训练阶段的数据增强或推理策略,提升模型效率。
摘要翻译
张量程序优化对于现代机器学习系统至关重要,但其搜索空间巨大。现有的自动调度器通过学习成本模型来降低测量成本,但它们通常将每个候选方案评估为静态代码快照,忽略了产生该方案的调度轨迹。这使得它们对动作依赖不敏感,且易受表面代码变化的影响。我们提出了一种基于世界模型的评估器,它将调度评估建模为基于程序状态的动作条件化潜在动力学。从初始程序开始,它在连续潜在空间中利用轻量级转换模型展开调度动作,避免了昂贵的 AST 突变和重复代码编码。最终的动态表示与动作和硬件特征相结合,用于对候选方案进行排序。该方法在 TVM AutoScheduler 中实现,在相同的 64 次试验预算下,相比 Ansor,在 GPU 上将代表性子图延迟提高了 1.37 倍,在 CPU 上提高了 1.54 倍。它在使用 10 倍更少测量次数的情况下,几何平均性能与 Ansor-10K 的差异在 2.2% 以内,并且相比 PyTorch/PyTorch-opt(cuDNN),全模型推理加速了 4.61 倍/3.67 倍(几何平均)。
Abstract
Tensor program optimization is essential for modern machine learning systems, but its search space is enormous. Existing auto-schedulers reduce measurement cost with learned cost models, yet they usually evaluate each candidate as a static code snapshot, ignoring the schedule trajectory that produced it. This makes them insensitive to action dependencies and vulnerable to superficial code variations. We propose a \emph{world-model-inspired} evaluator that models schedule evaluation as action-conditioned latent dynamics over program states. Starting from the initial program, it rolls out scheduling actions in a continuous latent space with a lightweight transition model, avoiding expensive AST mutation and repeated code encoding. The final dynamic representation is combined with action and hardware features to rank candidates. Implemented in TVM AutoScheduler, our method improves representative-subgraph latency over Ansor by 1.37$\times$ on GPU and 1.54$\times$ on CPU under the same 64-trial budget. It also matches Ansor-10K within 2.2% geometric mean using 10$\times$ fewer measurements, and accelerates full-model inference over PyTorch/PyTorch-opt(cuDNN) by 4.61$\times$/3.67$\times$ geometric mean.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 1.0/10 | 1.5 |
| World Models | 1.5 | 9.0/10 | 13.5 |
| MLLM | 1.5 | 1.0/10 | 1.5 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 9.0/10 | 13.5 |
评分理由: 论文标题和摘要明确使用了'World Models'概念,并通过建模潜在动态来评估调度轨迹,这与'World Models'和'model-based RL'的核心概念高度相关。然而,该论文专注于编译器优化,不涉及多模态、大语言模型、分词器或视觉编码器,因此'Unify Models'、'Tokenizer'、'Visual Encoder'、'MLLM'和'MultiModal'相关性极低。作者列表中未包含指定的专家。
关键词
Compiler World Models, Latent Dynamics, Tensor Program Search, Schedule Trajectory, Action-conditioned, TVM AutoScheduler, Model-based RL, Efficient Optimization
深度分析
Chinese Title: 迈向编译器世界模型:学习潜在动态以实现高效张量程序搜索
Summary: 本文针对张量程序优化中搜索空间巨大、硬件测量昂贵的问题,提出了一种基于世界模型(World Model)的候选程序评估框架。传统方法将每个候选程序视为静态代码快照,忽略了调度决策的逐步变换过程,导致模型易受语法噪声干扰。受世界模型启发,本文将编译器优化建模为动作条件下的潜在状态转移过程:初始程序状态通过调度动作序列在连续潜在空间中逐步演化,最终状态表示用于候选程序排序。该框架在TVM AutoScheduler中实现,包含状态编码器、动作条件转移模型和排序代价模型。在Intel Xeon Gold 6430 CPU和NVIDIA RTX 4090 GPU上的实验表明,在相同64次试验预算下,该方法在GPU和CPU上分别比Ansor提升1.37倍和1.54倍的代表性子图延迟;使用1K次测量即可达到Ansor-10K的搜索质量,并在全模型推理上相比PyTorch/cuDNN实现最高58.61倍加速。
Innovations:
- 首次将世界模型视角引入张量程序优化,将调度评估建模为动作条件下的潜在状态演化过程。
- 构建了基于TVM/TenSet的状态预测数据集,包含调度动作序列和中间状态轨迹,用于学习编译器状态转移。
- 提出在连续潜在空间中模拟状态转移的轻量级框架,避免了显式的AST变换和文本编码,显著降低评估成本。
- 在TVM AutoScheduler中实现了完整的潜在动态评估框架,包括状态编码器、转移模型和排序模型。
- 实验证明该方法在相同搜索预算下显著优于Ansor,且仅用1/10的测量次数即可达到相近的搜索质量。
Methodology: 本文采用世界模型框架,将编译器优化环境视为动态系统,TensorIR程序作为状态,调度变换作为动作。首先,通过状态编码器将初始程序映射到连续潜在空间;然后,利用动作条件转移模型逐步预测每个调度动作后的潜在状态;最后,将预测的终端状态表示与动作特征、硬件特征结合,通过排序代价模型对候选程序进行排名。训练数据来自TVM调优日志,构建了包含预调度状态、动作序列、中间状态和后调度状态的对齐轨迹。整个框架在TVM AutoScheduler中实现,使用学习到的编码器、转移模型和排序模型进行候选评估。
Key Results:
- 在GPU上,代表性子图延迟相比Ansor提升1.37倍(几何平均),在CPU上提升1.54倍。
- 在21/22个GPU子图和22/22个CPU子图上优于Ansor,最高加速比分别达2.40倍和3.76倍。
- 1K次搜索试验的算术平均延迟略优于Ansor-10K,几何平均延迟差距在2.2%以内,测量次数减少10倍。
- 全模型推理相比PyTorch实现4.61倍几何平均加速,最高58.61倍;相比PyTorch-opt(cuDNN)实现3.67倍几何平均加速,最高35.50倍。
Tech Stack:
- TVM AutoScheduler
- TensorIR中间表示
- 世界模型(World Model)
- 潜在状态编码器(Latent State Encoder)
- 动作条件转移模型(Action-Conditioned Transition Model)
- 排序代价模型(Ranking Cost Model)
- LSTM或Transformer(用于动作序列建模)
- TenSet数据集
- Intel Xeon Gold 6430 CPU
- NVIDIA RTX 4090 GPU
Strengths:
- 创新性地将世界模型应用于编译器优化,提供了新的理论视角。
- 有效解决了静态评估对语法噪声敏感的问题,提升了评估鲁棒性。
- 显著降低了自动调优的硬件测量成本,实现了10倍效率提升。
- 在多种硬件和模型上验证了方法的通用性和有效性,加速效果显著。
- 框架设计轻量,避免了显式AST变换,适合实际部署。
Limitations:
- 方法依赖于TVM生态和TensorIR表示,可能难以直接迁移到其他编译器框架。
- 训练数据需要大量调优日志,数据收集成本较高。
- 潜在状态的可解释性有限,难以直观理解模型内部决策。
- 实验仅覆盖CPU和GPU平台,未涉及其他硬件(如NPU、FPGA)。
- 对于极长调度序列,潜在状态预测的累积误差可能影响评估精度。
Relevance To Keywords:
- Unify Models: 本文未直接涉及多模态统一模型,但世界模型框架具有统一不同表示(程序状态、动作、硬件特征)的潜力。
- World Models: 核心创新,将世界模型应用于编译器优化,学习潜在状态转移动态。
- Representation Learning: 通过状态编码器学习程序状态的连续潜在表示,抑制语法噪声。
- Model-Based RL: 方法本质是基于模型的强化学习,使用学习到的转移模型模拟环境动态。
- 原生多模态大模型: 不直接相关,但潜在表示学习的思想与多模态表征学习有共通之处。
- 多模态大模型的理解和生成一体化: 不直接相关。
- 表征学习: 核心组成部分,通过编码器学习程序状态的有效表征。
- 世界模型: 核心创新点,全文围绕世界模型视角展开。
- 强化学习: 方法框架与基于模型的强化学习高度相关,但本文未显式使用强化学习算法。
- 后训练: 不直接相关,但模型训练过程可视为后训练阶段。
摘要翻译
斜视是一种常见的眼部疾病,需要细粒度亚型诊断以支持个体化治疗规划。然而,现有的深度学习方法主要提供诊断预测,却缺乏透明的推理过程;尽管近期的大视觉语言模型(LVLMs)在联合图像理解和报告生成方面展现出潜力,但在这一证据敏感且规则驱动的医学任务中,它们仍极易产生幻觉。为应对这些挑战,我们提出 MAGIS(一种基于证据的多智能体推理可解释斜视诊断框架)。MAGIS 将黑盒端到端生成转化为一个结构化的诊断过程,该过程包含候选假设生成、双证据约束上下文、基于证据的纠正验证以及报告生成四个阶段。具体而言,我们引入了一种双证据约束上下文(DECC)机制,该机制将来自九个标准注视位置照片的视觉证据与基于证据的临床诊断规则共同组织成约束上下文,以实现可靠的诊断推理。我们还开发了一种基于证据的纠正验证(EBCV)机制,用于验证当前的诊断假设是否得到视觉证据、基于热图的视觉线索以及基于证据的临床诊断规则的支持。当检测到不一致时,将触发假设细化过程。在细粒度斜视基准上的实验表明,MAGIS 不仅显著优于其他最先进的诊断系统,将加权 F1 分数从 72.0% 提升至 91.3%,还显著提高了所生成诊断报告的临床可靠性(包括一致性、对齐性和完整性)。这些结果表明,MAGIS 为构建准确、基于证据且具有临床可解释性的斜视诊断系统提供了一种有效方案。
Abstract
Strabismus is a common ocular disorder that requires fine-grained subtype diagnosis for individualized treatment planning. However, existing deep learning methods mainly provide diagnostic predictions without transparent reasoning, while recent large vision-language models (LVLMs), although promising for joint image understanding and report generation, remain highly prone to hallucination in this evidence-sensitive and rule-driven medical task. To address these challenges, we propose MAGIS, an evidence-based Multi-AGent reasoning for Interpretable Strabismus diagnosis framework. MAGIS transforms black-box end-to-end generation into a structured diagnostic process consisting of candidate hypothesis generation, dual-evidence constrained context, evidence-based corrective verification, and report generation. Specifically, we introduce a Dual-Evidence Constrained Context (DECC) mechanism that jointly organizes visual evidence from the photograph of the nine cardinal positions of gaze and evidence-based clinical diagnostic rules into a constrained context for reliable diagnostic reasoning. We further develop an Evidence-Based Corrective Verification (EBCV) mechanism that verifies whether the current diagnostic hypothesis is supported by visual evidence, heatmap-based visual cues, and evidence-based clinical diagnostic rules. Hypothesis refinement is triggered when inconsistency is detected. Experiments on a fine-grained strabismus benchmark demonstrate that MAGIS not only significantly outperforms other state-of-the-art diagnostic systems, improving the weighted F1 score from 72.0% to 91.3%, but also substantially improves the clinical reliability (consistency, alignment, and completeness) of generated diagnostic reports. These results demonstrate that MAGIS provides an effective solution for building accurate, evidence-based, and clinically interpretable strabismus diagnosis systems.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 3.0/10 | 4.5 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 7.0/10 | 10.5 |
| MultiModal | 1.5 | 9.0/10 | 13.5 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: 该论文提出了一种基于证据的多智能体推理框架(MAGIS)用于斜视诊断。在关键词相关性方面,'MLLM'和'MultiModal'相关性较高,因为论文基于大视觉语言模型处理图像与文本数据;'Visual Encoder'相关性中等,因为涉及视觉特征提取;而'Unify Models'、'Tokenizer'、'World Models'和'model-based RL'相关性较低,因为论文未涉及模型统一架构、分词器设计、世界模型构建或强化学习算法。作者列表中未包含指定的专家。
关键词
Strabismus Diagnosis, Multi-Agent Reasoning, Evidence-Based, Interpretable Clinical Decision-Making, Large Vision-Language Models, Dual-Evidence Constrained Context, Visual Evidence, Report Generation
深度分析
Chinese Title: MAGIS:基于证据的多智能体推理用于可解释的斜视临床决策
Summary: 斜视是一种常见的眼部疾病,需要细粒度的亚型诊断以制定个性化治疗方案。现有深度学习方法主要提供诊断预测,缺乏透明推理;而大型视觉语言模型(LVLMs)虽能联合图像理解和报告生成,但在证据敏感且规则驱动的医学任务中极易产生幻觉。为解决这些问题,本文提出MAGIS,一个基于证据的多智能体可解释斜视诊断框架。MAGIS将黑盒端到端生成转化为结构化的诊断过程,包括候选假设生成、双证据约束上下文(DECC)、基于证据的纠正验证(EBCV)和报告生成。DECC联合组织九方位眼位照片的视觉证据和基于证据的临床诊断规则,形成约束上下文;EBCV验证当前诊断假设是否得到视觉证据、热图视觉线索和临床诊断规则的支持,并在检测到不一致时触发假设修正。在细粒度斜视基准上的实验表明,MAGIS不仅显著优于其他最先进的诊断系统(加权F1从72.0%提升至91.3%),而且大幅提高了生成诊断报告的临床可靠性(一致性、对齐性和完整性)。
Innovations:
- 将LVLM辅助的斜视诊断从黑盒生成转化为基于证据的、可验证的结构化诊断过程,包含假设生成、证据约束、验证修正和报告生成四个阶段。
- 提出双证据约束上下文(DECC)机制,联合组织视觉证据(九方位眼位照片)和临床诊断规则,形成可靠的诊断推理约束。
- 提出基于证据的纠正验证(EBCV)机制,通过视觉证据、热图线索和临床规则验证诊断假设,并在不一致时触发迭代修正。
- 构建多智能体协作框架,将诊断角色分解为假设生成、验证和报告生成三个智能体,提升推理的可靠性和可解释性。
Methodology: 本文采用多智能体推理框架,将斜视诊断分解为四个连续阶段:首先由分类器生成候选诊断假设;然后通过DECC构建包含视觉证据和临床诊断规则的双证据约束上下文;接着由EBCV验证假设是否与视觉证据和规则一致,若不一致则触发假设修正;最后生成结构化诊断报告。视觉证据来自九方位眼位照片,临床规则通过医生参与的形式化过程获得。三个智能体(假设生成、验证、报告生成)相互协作,实现迭代优化。
Key Results:
- 在细粒度斜视基准上,MAGIS的加权F1分数从现有最佳方法的72.0%提升至91.3%。
- 生成的诊断报告在临床一致性、视觉对齐性和上下文完整性方面显著优于其他LVLMs和传统方法。
- 通过消融实验验证了DECC和EBCV各自对性能提升的贡献。
- 展示了MAGIS在减少幻觉(如错误分类和临床不支持输出)方面的有效性。
Tech Stack:
- 大型视觉语言模型(LVLMs):Gemini-3-Flash-Preview、GPT-5.2、Qwen3-VL-Plus
- 多智能体框架:假设生成智能体、验证智能体、报告生成智能体
- 双证据约束上下文(DECC)
- 基于证据的纠正验证(EBCV)
- 热图可视化(heatmap-based visual cues)
- 临床诊断规则(通过医生参与形式化)
- 加权F1分数评估
Strengths:
- 显著提升了斜视亚型诊断的准确性和可解释性,解决了现有黑盒模型和LVLM幻觉问题。
- 将医学诊断建模为证据驱动的结构化过程,符合临床推理逻辑,增强了临床可信度。
- 多智能体协作和迭代验证机制有效抑制了幻觉,提高了输出可靠性。
- 在真实临床基准上取得了大幅性能提升,具有实际应用价值。
Limitations:
- 依赖医生参与形式化临床诊断规则,规则库的完整性和准确性可能影响泛化能力。
- 多智能体迭代验证可能增加推理时间和计算开销,实时性有待评估。
- 实验仅在斜视单一病种上验证,未在其他眼科或医学领域测试通用性。
- 对视觉证据的提取(如热图)可能受限于图像质量或标注精度。
Relevance To Keywords:
- 原生多模态大模型:论文使用LVLMs作为基础模型,但核心贡献在于推理框架而非模型本身,相关性中等。
- 世界模型:论文未涉及世界模型的概念,相关性低。
- 表征学习:论文间接涉及视觉特征提取(热图),但未重点讨论表征学习,相关性较低。
- 模型-Based RL:论文未使用强化学习或基于模型的方法,相关性低。
- 后训练:论文未涉及后训练阶段,相关性低。
摘要翻译
大多数现有的多曝光高动态范围(HDR)方法遵循固定的前馈重建范式,导致其在复杂动态场景中易于产生鬼影伪影。为了解决这一问题,我们提出了 HDRAgent,这是首个用于 HDR 成像的智能体驱动框架,能够根据当前场景条件自适应地选择重建策略。具体而言,为了提供场景特定的先验知识,我们引入了一个细粒度的上下文知识匹配(FCM)模块。该模块利用多模态大语言模型(MLLM)衍生的场景感知来检索相关的历史案例和工具知识,并将其组织成结构化证据,以支持基于 MLLM 的自适应工具调度。此外,我们提出了一种感知 - 失真反馈机制,将执行后的质量评估和伪影诊断转化为结构化反馈,并将其累积于历史记忆中,以帮助后续的上下文知识精炼和策略选择。再者,考虑到极端运动可能导致对齐方法失效,我们设计了一种智能体引导的生成式对齐策略,该策略利用基于 MLLM 的动态区域解析,在参考帧指导下重建非参考帧中的不可靠内容。实验表明,HDRAgent 能有效减少鬼影和局部伪影,同时实现具有竞争力或更优的客观性能及视觉质量。
Abstract
Most existing multi-exposure HDR methods follow a fixed feed-forward reconstruction paradigm, making them prone to ghosting artifacts in complex dynamic scenes. To address this issue, we propose HDRAgent, the first agent-driven framework for HDR imaging, which adaptively selects reconstruction strategies according to the current scene conditions. Specifically, to provide scene-specific prior knowledge, we introduce a fine-grained contextual knowledge matching (FCM) module. This module leverages multimodal large language model (MLLM)-derived scene perception to retrieve relevant historical cases and tool knowledge, organizing them into structured evidence for MLLM-based adaptive tool scheduling. In addition, we propose a perception--distortion feedback mechanism that transforms post-execution quality assessment and artifact diagnosis into structured feedback, which is accumulated in historical memory to help subsequent contextual knowledge refinement and strategy selection. Furthermore, considering that extreme motion can invalidate alignment methods, we design an agent-guided generative alignment strategy that uses MLLM-based dynamic-region parsing to reconstruct unreliable contents in non-reference frames under reference-frame guidance. Experiments demonstrate that HDRAgent effectively reduces ghosting and local artifacts while achieving competitive or superior objective performance and visual quality.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 3.0/10 | 4.5 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 9.0/10 | 13.5 |
| MultiModal | 1.5 | 6.0/10 | 9.0 |
| model-based RL | 1.5 | 2.0/10 | 3.0 |
评分理由: 论文核心在于利用 MLLM 构建智能体框架解决 HDR 成像问题,因此 MLLM 相关性极高(9 分);MultiModal 因涉及多模态模型及多曝光图像处理具有中等关联(6 分);Visual Encoder 作为 MLLM 组件隐含存在但未作为研究重点(3 分);其余关键词如 Unify Models、Tokenizer、World Models 及 model-based RL 与论文核心内容(智能体调度、生成对齐、历史记忆)关联度较低(1-2 分),尽管涉及智能体概念,但并非传统模型强化学习或世界模型架构。加权总分为 36.0,高于动态及格分 27.8。作者列表中不包含 Yang Shi 等指定专家,无额外加分。
关键词
HDRAgent, Agentic Framework, Multi-Exposure HDR Imaging, MLLM, Adaptive Tool Scheduling, Generative Alignment, Ghosting Artifacts
深度分析
Chinese Title: HDRAgent:一种用于多曝光高动态范围成像的智能体框架
Summary: 本文提出HDRAgent,首个基于智能体的多曝光HDR成像框架。现有方法采用固定前馈重建范式,在复杂动态场景中易产生鬼影伪影。HDRAgent将HDR重建重构为感知-规划-执行-反馈的闭环智能体过程:首先利用多模态大语言模型(MLLM)进行场景感知,通过细粒度上下文知识匹配(FCM)模块组织历史案例、工具知识和反馈记忆作为结构化决策证据;然后由MLLM动态路由器自适应选择对齐与融合策略;执行后通过感知-失真反馈机制将质量评估和伪影诊断转化为结构化反馈,积累到历史记忆中用于后续策略优化。针对极端运动导致对齐失效的问题,设计了智能体引导的生成式对齐策略,利用MLLM动态区域解析在参考帧引导下重建不可靠内容。实验表明,HDRAgent有效减少鬼影和局部伪影,在客观指标和视觉质量上达到或超越现有方法。
Innovations:
- 首次将智能体驱动范式引入多曝光HDR重建,将固定前馈流程变为感知-规划-执行-反馈的动态决策过程。
- 提出细粒度上下文知识匹配(FCM)模块,将场景描述、工具适用性、历史失败案例和反馈经验组织为结构化证据,支持MLLM动态路由。
- 设计感知-失真反馈机制,将执行后的质量评估和伪影诊断转化为结构化反馈,实现迭代修正。
- 提出智能体引导的生成式对齐策略,针对极端运动区域,通过MLLM动态区域解析和参考帧引导的掩码生成来重建不可靠内容。
Methodology: HDRAgent采用闭环智能体架构:1)场景感知:使用Qwen3-VL-Plus分析多曝光输入的运动、曝光变化、饱和、遮挡等状态;2)FCM模块:将场景状态、专家工具知识、历史上下文和反馈记忆组织为结构化证据;3)动态路由:Qwen3.6-Plus根据证据选择对齐和融合/重建工具;4)工具箱执行:调用预定义的对齐、融合、生成等工具;5)反馈评估:感知-失真反馈评估器对重建结果进行质量评估和伪影诊断,将反馈写回上下文记忆,用于后续路由和修正。对于极端运动,使用MLLM识别动态前景并生成分割提示,调用分割和生成工具进行参考帧引导的掩码生成。
Key Results:
- 在多个基准数据集上,HDRAgent在客观指标(如PSNR、SSIM、HDR-VDP-2)上达到或超越现有方法。
- 在复杂动态场景中,有效减少鬼影和局部融合伪影,视觉质量优于SAFNet、LFDiff、AFUNet、DDPF-PR等方法。
- 通过反馈机制,能够逐步修正残留伪影,提升重建质量。
- 智能体引导的生成式对齐策略在极端运动场景下恢复出参考帧一致的结构。
Tech Stack:
- 多模态大语言模型:Qwen3-VL-Plus(场景感知)、Qwen3.6-Plus(动态路由)
- 细粒度上下文知识匹配(FCM)模块
- 感知-失真反馈机制(质量评估与伪影诊断)
- 智能体引导的生成式对齐策略(MLLM动态区域解析 + 分割工具 + 生成工具)
- 工具箱:对齐工具(如光流、注意力对齐)、融合工具(CNN/Transformer融合)、生成工具(扩散模型等)
- 评估指标:PSNR、SSIM、HDR-VDP-2
Strengths:
- 创新性地将智能体范式引入HDR成像,突破了固定前馈范式的局限,实现了场景自适应策略选择。
- 设计了完整的闭环反馈机制,能够迭代修正伪影,提升鲁棒性。
- 针对极端运动提出了专门的生成式对齐策略,扩展了适用场景。
- 实验充分,在多个数据集上验证了有效性和优越性。
Limitations:
- 依赖MLLM的推理能力,计算开销较大,实时性可能受限。
- 工具箱中的工具需要预先定义和训练,框架的泛化性受限于工具集。
- 反馈机制的有效性依赖于评估器的准确性,可能对某些细微伪影诊断不足。
- 未讨论在极端低光或高动态范围场景下的性能边界。
Relevance To Keywords:
- 原生多模态大模型:论文使用Qwen3-VL-Plus和Qwen3.6-Plus作为核心感知和决策模块,体现了多模态大模型在视觉任务中的应用。
- 多模态大模型的理解和生成一体化:MLLM同时用于场景理解(感知)和策略生成(路由),但生成部分主要依赖外部工具,未实现完全一体化。
- 表征学习:论文未直接涉及表征学习,但FCM模块可视为对场景和工具知识的表征组织。
- 世界模型:智能体框架中的上下文知识匹配和反馈记忆可类比为对HDR重建世界的建模,但未明确构建世界模型。
- 强化学习:论文中的反馈机制类似于强化学习中的奖励/惩罚信号,但未使用强化学习训练,而是基于MLLM推理。
- 后训练:论文未涉及后训练技术。
- 总体相关性中等:论文主要贡献在HDR成像应用,但智能体框架与多模态大模型、世界模型、强化学习等概念有交叉,可作为这些方向在图像恢复领域的应用示例。
摘要翻译
我们提出了 AliyunConsoleAgent,这是一个用于真实世界的云控制台自动化文档验证的 Web 代理框架。主流云平台涵盖数百种产品,且功能迭代迅速,导致控制台用户界面经常与其对应的文档发生偏离。验证文档流程是否准确反映当前控制台并能端到端执行,估计每年需要约 400 万次重复检查,但人工覆盖率仍低于 1%。尽管基于前沿专有模型构建的代理系统实现了高成功率,但其高昂成本和数据隐私限制阻碍了大规模部署。我们提出一种两阶段训练范式:首先在蒸馏的前沿模型轨迹上进行监督微调 (SFT),随后在真实云环境中使用组相对策略优化 (GRPO) 和双通道结果奖励模型进行强化学习 (RL)。为了支持大规模强化学习训练,我们构建了一个高确定性 rollout 系统,该系统基于 Terraform 的资源预配置和 LLM 驱动的按需配置,能有效将环境噪声与训练信号隔离。我们进一步引入了一种基于规则的奖励评估协议,该协议基于后端审计日志,提供客观且抗奖励操纵的结果判定。我们的模型从机械式指令跟随演进为具备云控制台及特定产品理解的自主决策。在包含 278 个任务的具有挑战性的基准测试中(其中最佳前沿模型仅达到 65.34%),实验表明 AliyunConsoleAgent-32B 实现了 63.52% 的平均成功率——相比基线模型提升了 20.24 个百分点,将差距缩小至最佳前沿专有模型仅 1.82 个百分点(bootstrap 95% CI [-1.27, 7.39]),且推理成本降低了 92%。
Abstract
We present AliyunConsoleAgent, a web agent framework for automated documentation verification in real-world cloud consoles. Major cloud platforms encompass hundreds of products with rapid feature iteration, causing console UIs to frequently diverge from their corresponding documentation. Verifying that documented procedures accurately reflect the current console and can be executed end-to-end demands an estimated 4 million recurring inspections annually, yet manual coverage remains below 1%. While agent systems built on frontier proprietary models achieve high success rates, their prohibitive cost and data privacy constraints preclude large-scale deployment. We propose a two-stage training paradigm: supervised fine-tuning (SFT) on distilled frontier-model trajectories, followed by reinforcement learning using Group Relative Policy Optimization (GRPO) and a dual-channel outcome reward model in real cloud environments. To support large-scale RL training, we construct a high-determinism rollout system featuring Terraform-based resource pre-provisioning and LLM-driven on-demand provisioning, which effectively isolates environment noise from the training signal. We further introduce a rule-based reward evaluation protocol grounded in backend audit logs, providing objective, reward-hacking-resistant outcome judgment. Our model evolves from mechanical instruction following to autonomous decision-making with cloud console and product-specific understanding. Experiments on a challenging 278-task benchmark where the best frontier model achieves only 65.34% demonstrate that AliyunConsoleAgent-32B achieves a 63.52% mean success rate -- a 20.24 percentage-point improvement over the base model, narrowing the gap to the best frontier proprietary model to 1.82 pp (bootstrap 95% CI [-1.27, 7.39]) -- at 92% lower inference cost.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 2.0/10 | 3.0 |
| World Models | 1.5 | 2.0/10 | 3.0 |
| MLLM | 1.5 | 5.0/10 | 7.5 |
| MultiModal | 1.5 | 5.0/10 | 7.5 |
| model-based RL | 1.5 | 6.0/10 | 9.0 |
评分理由: 论文核心在于基于蒸馏和强化学习训练云控制台 Web 代理,而非模型架构统一或编码器设计,因此 Unify Models、Tokenizer、Visual Encoder、World Models 相关性低。MLLM 和 MultiModal 相关性中等,因涉及 UI 交互(视觉 + 语言)。model-based RL 相关性较高,因强化学习是核心训练方法,符合背景设定。作者列表中未包含指定的专家。
关键词
Web Agents, Reinforcement Learning, Distillation, Cloud Console, LLM-based, Policy Optimization, Automated Verification
深度分析
Chinese Title: 阿里云控制台智能体:通过蒸馏和强化学习在真实云环境中训练网络智能体
Summary: 论文提出AliyunConsoleAgent,一个用于在真实云控制台中进行自动化文档验证的智能体框架。针对云平台文档与UI频繁脱节的问题,传统人工验证覆盖率不足1%,而前沿模型成本高且存在隐私风险。作者采用两阶段训练范式:首先通过监督微调(SFT)蒸馏前沿模型的高质量轨迹,然后使用组相对策略优化(GRPO)和双通道结果奖励模型在真实云环境中进行强化学习。为支持大规模RL训练,构建了高确定性rollout系统,包括基于Terraform的资源预置备和LLM驱动的按需置备,有效隔离环境噪声。实验在278个任务的基准上,AliyunConsoleAgent-32B达到63.52%的平均成功率,比基础模型提升20.24个百分点,与最佳前沿模型差距仅1.82个百分点,同时推理成本降低92%。
Innovations:
- 构建高确定性rollout环境,结合Terraform预置备和ResourceCoder按需置备,并利用ActionTrail审计日志进行客观、抗奖励黑客的验证。
- 提出SFT+GRPO两阶段训练范式,结合双通道结果奖励模型,在真实生产云环境中训练轻量级私有智能体。
- 实现生产级云文档验证部署,使用前沿模型审计54000+文档,发现4399个确认缺陷,并证明训练后的智能体可作为隐私合规的替代方案,推理成本降低92%。
- 开源评估基准、训练数据、rollout环境基础设施和模型训练代码,促进可重复研究。
Methodology: 论文将智能体与云控制台的交互形式化为部分可观测马尔可夫决策过程(POMDP),采用ReAct范式的观察-推理-行动循环。观察包括SoM标注的截图、DOM文本和行动历史。行动空间包括GUI交互(通过Playwright执行)、按需置备动作(ResourceCoder)、任务完成和不可达信号。训练分为两阶段:第一阶段使用蒸馏自前沿模型的轨迹进行监督微调(SFT);第二阶段使用组相对策略优化(GRPO)和基于ActionTrail审计日志与LLM评估的双通道结果奖励模型进行强化学习。rollout环境采用四层架构:账户池管理、沙箱执行、资源META离线置备(Terraform模板生成)和运行时按需置备。
Key Results:
- 在278个任务的基准上,AliyunConsoleAgent-32B(SFT+GRPO)平均成功率为63.52%,比基础模型(Qwen3-VL-32B)提升20.24个百分点。
- 与最佳前沿模型Gemini 3 Pro Preview的差距仅1.82个百分点,bootstrap 95%置信区间包含零(p>0.05),表明性能无显著差异。
- 推理成本降低92%,实现高性能低成本私有部署。
- 生产部署中,使用前沿模型审计54000+文档,发现4399个确认缺陷,被产品团队接受。
Tech Stack:
- Qwen3-VL(基础视觉语言模型)
- GRPO(组相对策略优化)
- PPO、DPO(强化学习方法)
- ReAct(推理-行动范式)
- Set-of-Mark (SoM) 视觉提示
- Terraform(基础设施即代码)
- Playwright(浏览器自动化)
- ActionTrail(阿里云审计日志服务)
- ResourceCoder(LLM驱动的代码智能体)
- Kubernetes (ACK) 容器编排
- StatefulSet(有状态应用管理)
Strengths:
- 在真实生产云环境中进行强化学习,而非模拟器,具有实际应用价值。
- 通过资源预置备和按需置备有效隔离环境噪声,确保训练信号纯净。
- 使用后端审计日志进行客观奖励评估,避免奖励黑客问题。
- 成本效益显著,使大规模私有部署成为可能。
- 开源关键组件,促进社区复现和进一步研究。
Limitations:
- 方法高度依赖阿里云特定平台和工具(如ActionTrail、Terraform模板),泛化到其他云平台可能需要适配。
- 基准任务仅278个,规模有限,可能无法全面反映真实场景的复杂性。
- 未讨论模型在未见过的文档或全新产品上的零样本泛化能力。
- 训练和评估依赖于人工标注的文档和资源META,构建成本较高。
Relevance To Keywords:
- 原生多模态大模型:使用Qwen3-VL作为基础模型,处理截图和DOM文本等多模态输入。
- 多模态大模型的理解和生成一体化:智能体通过SoM标注理解UI,并生成行动指令。
- 表征学习:通过SFT和RL学习云控制台UI和资源状态的表征。
- 世界模型:rollout环境中的资源状态管理可视为对云环境动态的建模。
- 强化学习:核心训练方法为GRPO,在真实环境中优化策略。
- 后训练:SFT+RL的两阶段后训练范式是论文的核心贡献。
摘要翻译
在线无任务持续学习(TFCL)要求智能体在严格单遍约束下,且无需任何显式任务标识,即可从无界、非平稳的数据流中顺序积累知识。现有的在线 TFCL 范式主要依赖于参数高效的提示微调或动态结构扩展,这些过程由训练耦合优化动态驱动,例如经验损失波动或潜在距离的演变。因此,这些训练耦合求解器对分布漂移的结构根源缺乏认知,机械地在本质上不同的流式变化上强制执行固定策略。为填补这一空白,我们提出 LargeMonitor 框架,该框架利用大型预训练基础模型来自主协调无任务的持续适应。具体而言,LargeMonitor 引入一个解耦检测模块,利用大型视觉模型(LVMs)冻结且稳定的表示空间,以实现鲁棒的零样本漂移检测,且无需训练依赖的干扰或脆弱的阈值调优。一旦确认发生漂移,该框架便会激活一个由大型多模态模型(LMMs)驱动的上下文感知诊断模块,以解释流式变化的精确语义病因(例如,新类别涌现与环境领域漂移)。这种双阶段能力使持续学习器能够动态部署适应性和针对特定漂移的优化策略。在多个 TFCL 设置和基准上的广泛实验表明,LargeMonitor 实现了对复杂数据流的精确且鲁棒的检测与诊断,同时持续提升现有在线 TFCL 算法的性能。
Abstract
Online task-free continual learning (TFCL) requires intelligent agents to sequentially accumulate knowledge from an unbounded, non-stationary data stream under strict single-pass constraints and without any explicit task identifiers. Existing online TFCL paradigms primarily rely on parameter-efficient prompt tuning or dynamic structure expansion driven by training-coupled optimization dynamics, such as empirical loss fluctuations or evolving latent distances. As a result, these training-coupled solvers remain agnostic to the structural origins of distribution drift, mechanically enforcing a fixed strategy across fundamentally distinct streaming variations. To address this gap, we propose LargeMonitor, a framework that leverages large pretrained foundation models to autonomously orchestrate task-free continuous adaptation. Specifically, LargeMonitor introduces a decoupled detection module utilizing the frozen, stable representation space of large vision models (LVMs) to achieve robust, zero-shot drift detection without training-dependent interference or brittle threshold tuning. Upon a confirmed drift, the framework activates a context-aware diagnostic module driven by large multimodal models (LMMs) to interpret the precise semantic etiologies of the stream variation (e.g., novel class emergence vs. environmental domain shift). This dual-stage capability empowers the continuous learner to dynamically deploy adaptive and shift-specific optimization strategies. Extensive experiments across multiple TFCL settings and benchmarks demonstrate that LargeMonitor achieves precise, robust detection and diagnosis of complex data streams while consistently improving the performance of existing online TFCL algorithms.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 7.0/10 | 10.5 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 7.0/10 | 10.5 |
| MultiModal | 1.5 | 6.0/10 | 9.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文提出 LargeMonitor 框架,利用冻结的大视觉模型(LVMs)和大多模态模型(LMMs)进行持续学习中的漂移检测与诊断。因此,Visual Encoder(对应 LVMs)和 MLLM(对应 LMMs)相关性较高(7 分),MultiModal 因涉及 LMMs 使用而中度相关(6 分)。论文未涉及模型架构统一(Unify Models)、Tokenizer 设计、世界模型构建或强化学习(RL),故这些关键词相关性极低或为 0 分。作者列表中未包含指定的 Yang Shi 等专家,无额外加分。加权总分 34.5,高于动态及格分 27.8。
关键词
Online Task-Free Continual Learning, Large Pretrained Models, Drift Detection, Semantic Diagnosis, Large Vision Models, Large Multimodal Models, Distribution Shift, Zero-shot
深度分析
Chinese Title: LargeMonitor: 通过大型预训练模型监控在线任务无关持续学习
Summary: 本文提出LargeMonitor框架,旨在解决在线任务无关持续学习(TFCL)中现有方法依赖训练耦合信号(如损失波动)进行漂移检测、无法区分漂移类型的问题。LargeMonitor利用大型预训练模型实现解耦的零样本漂移检测:使用冻结的大型视觉模型(LVM)提取稳定特征,通过CKA相似度和CUSUM过程实时监测分布漂移,无需阈值调优。检测到漂移后,激活大型多模态模型(LMM)诊断漂移原因(如新类出现、域偏移、数据损坏),从而触发自适应优化策略。在多种TFCL设置(包括类增量、域增量、混合漂移)上的实验表明,LargeMonitor实现了精确鲁棒的检测与诊断,并持续提升了现有在线TFCL算法的性能。
Innovations:
- 提出解耦的零样本漂移检测机制,利用冻结的LVM特征空间,不依赖训练信号或阈值调优。
- 首次在在线TFCL中引入联合检测与诊断范式,利用LMM对漂移原因进行语义分类。
- 根据诊断结果动态触发自适应策略(如模型扩展、记忆调整),实现差异化响应。
- 在多种TFCL基准(包括异构漂移)上全面评估,证明框架的通用性和性能提升。
Methodology: LargeMonitor包含两个模块:检测模块和诊断模块。检测模块使用冻结的LVM(如DINO或CLIP)提取当前批次特征,与FIFO历史缓冲区的特征计算线性CKA相似度,通过在线CUSUM过程(基于滚动中位数和MAD)检测持续性相似度下降。诊断模块在检测到漂移后,使用LMM(如LLaVA)对当前批次样本进行语义分析,生成自然语言描述并分类漂移类型(新类、域偏移、数据损坏等)。根据诊断结果,系统选择对应的优化策略(例如,新类出现时扩展模型容量,域偏移时调整记忆缓冲区)。
Key Results:
- LargeMonitor在多种TFCL设置下实现了精确的漂移检测,误报率低,无需手动阈值调整。
- 诊断模块能够准确区分新类出现、域偏移和数据损坏等不同漂移类型。
- 集成LargeMonitor后,现有在线TFCL算法(如MVP、ODDL、Online-LoRA)的性能得到持续提升。
- 在异构漂移场景中,自适应策略比统一策略带来更显著的性能增益。
Tech Stack:
- 大型视觉模型(LVM):DINO、CLIP视觉编码器
- 大型多模态模型(LMM):LLaVA
- 中心核对齐(CKA):用于度量特征空间相似性
- 累积和(CUSUM):在线单侧检测过程
- 滚动中位数和MAD(中位数绝对偏差):用于动态阈值设定
- FIFO滑动内存缓冲区:存储历史特征
Strengths:
- 解耦设计使检测不依赖训练过程,避免模型更新带来的信号漂移。
- 零样本检测无需数据集特定调优,泛化性强。
- 语义诊断提供可解释性,支持差异化自适应策略。
- 实验覆盖多种TFCL场景,验证了框架的鲁棒性和通用性。
Limitations:
- 依赖大型预训练模型(LVM和LMM),计算开销和存储需求较高,可能不适合资源受限设备。
- 诊断准确性受LMM能力限制,对细微或复杂漂移可能误判。
- 未讨论实时性要求,CUSUM和LMM推理可能引入延迟。
- 框架有效性依赖于下游TFCL算法的配合,未完全解耦。
Relevance To Keywords:
- Unify Models: 论文使用统一的大型预训练模型(LVM+LMM)作为监控器,体现了模型统一的思想。
- World Models: LMM对漂移原因的语义诊断可视为对数据流世界状态的建模。
- Representation Learning: 利用LVM的稳定表征进行CKA相似度分析,属于表征学习应用。
- Model-Based RL: 诊断结果指导自适应策略选择,类似于基于模型的强化学习中的状态识别与策略切换。
- 原生多模态大模型: 直接使用LLaVA等原生多模态模型进行视觉-语言推理。
- 多模态大模型的理解和生成一体化: LMM同时理解图像并生成诊断文本。
- 后训练: 论文使用预训练模型(冻结),不涉及后训练,但框架可集成后训练策略。
摘要翻译
一个有用的手机智能体需要具备个人智能。它应基于设备上存在的用户身份、历史和偏好进行推理,而不仅仅是在一个非个性化的沙箱中遵循孤立的指令。现有的移动智能体基准测试缺乏这种个性化。我们介绍 iOSWorld,这是一个围绕持久用户身份构建的首个交互式原生 iOS 模拟器基准测试,涵盖 26 个新构建的 iOS 应用。这些应用包含关联数据,如交易、消息、旅行记录、社交关系和财务活动。iOSWorld 包含 133 个任务,分为三个难度递增的类别。单应用任务(27 个)测试单个应用,多应用任务(60 个)跨越 2 到 8 个应用,而记忆与个性化任务(46 个)要求智能体从个人数据中推断模式。我们在仅视觉和特权视觉 +XML 设置下评估前沿和开源的计算机使用模型。最佳配置总体达到 52%,但在多应用任务上仅为 37%。特权视觉 +XML 访问使前沿模型提高了多达 26 个百分点,而较小模型并未从添加的无障碍树 (accessibility-tree) 输入中受益。我们发布 iOSWorld 作为一个开源基准测试,包含所有应用、种子数据、任务、评分标准和评估代码。
Abstract
A useful phone agent needs to be personally intelligent. It should reason over a user's identity, history, and preferences as they exist on the device, not just follow isolated instructions in an impersonal sandbox. Existing mobile agent benchmarks lack this kind of personalization. We introduce iOSWorld, the first interactive native iOS simulator benchmark built around a persistent user identity spanning 26 newly built iOS apps. These apps contain connected data such as transactions, messages, travel records, social relationships, and financial activity. iOSWorld includes 133 tasks across three increasingly difficult categories. Single-app tasks (27) test one app, multi-app tasks (60) span 2 to 8 apps, and memory and personalization tasks (46) require agents to infer patterns from personal data. We evaluate frontier and open-source computer-use models in both vision-only and privileged vision+XML settings. The best configuration reaches 52\% overall but only 37\% on multi-app tasks. Privileged vision+XML access improves frontier models by up to 26 percentage points, while smaller models do not benefit from added accessibility-tree input. We release iOSWorld as an open-source benchmark with all apps, seeded data, tasks, rubrics, and evaluation code.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 3.0/10 | 4.5 |
| World Models | 1.5 | 4.0/10 | 6.0 |
| MLLM | 1.5 | 5.0/10 | 7.5 |
| MultiModal | 1.5 | 6.0/10 | 9.0 |
| model-based RL | 1.5 | 2.0/10 | 3.0 |
评分理由: 该论文主要贡献在于提出 iOSWorld 基准测试,用于评估个人化手机智能体,而非提出新的模型架构或算法。因此,与核心技术组件(如 Tokenizer、Visual Encoder、Unify Models、model-based RL)相关的关键词得分较低。论文涉及 MLLM 模型评估及多模态输入(Vision+XML),且名称包含 'World',故 MLLM、MultiModal 和 World Models 得分中等。作者列表中未包含指定的 Yang Shi 等专家,无额外加分。加权总分为 34.5,高于动态及格分 27.8。
关键词
iOSWorld, Phone Agents, Personalization, Multi-app Tasks, Vision+XML, Benchmark, Computer-use Models
深度分析
Chinese Title: iOSWorld:面向个性化智能手机代理的基准测试
Summary: 本文提出了iOSWorld,首个基于原生iOS模拟器的交互式基准测试,旨在评估具备个性化智能的手机代理。现有移动代理基准缺乏对用户身份、历史记录和偏好的推理能力,而iOSWorld通过构建26个共享同一用户身份(Jordan Avery)的iOS应用,填充了跨应用关联数据(如交易、消息、旅行记录等),设计了133个任务,涵盖单应用(27个)、多应用(60个)和记忆/个性化(46个)三类。研究评估了多个前沿模型和开源基线(Qwen3.5 35B-A3B)在纯视觉和视觉+XML两种设置下的表现。最佳配置整体成功率为52%,多应用任务仅37%。视觉+XML特权访问可将前沿模型性能提升最多26个百分点,但小模型未受益。论文开源了所有应用、数据、任务、评估代码及AWS运行器。
Innovations:
- 首个基于原生iOS模拟器的交互式基准测试,支持持久用户身份和跨应用关联数据。
- 设计了133个任务,涵盖单应用、多应用和记忆/个性化三类,难度递增。
- 采用LLM-as-a-Judge评估框架,经人工验证达到较高一致性(κ=0.77)。
- 比较了多种模型在纯视觉和视觉+XML两种观察模态下的表现,揭示了特权访问对前沿模型的提升作用。
- 开源了完整的应用、种子数据、任务、评分标准和评估代码,并提供了AWS运行器以支持非Mac研究者。
Methodology: 论文构建了26个原生iOS应用,填充了统一的用户身份和跨应用关联数据。环境建模为部分可观测马尔可夫决策过程(POMDP),代理通过截图(纯视觉)或截图+可访问性树(视觉+XML)获取观测,执行点击、输入、滑动等动作。任务评估采用LLM-as-a-Judge框架,由GPT-5.4-Mini基于完整轨迹进行二元通过/失败判断,并经人工验证。实验比较了多个前沿模型(如Claude Opus 4.6)和开源基线(Qwen3.5 35B-A3B)在两种模态下的表现。
Key Results:
- 最佳配置(Opus 4.6 + 视觉+XML)整体成功率为52%,单应用82%,记忆任务54%,多应用仅37%。
- 视觉+XML特权访问可将前沿模型性能提升最多26个百分点,但小模型(如Qwen3.5)未受益。
- 多应用任务(跨2-8个应用)对所有模型最具挑战性,成功率普遍较低。
- LLM-as-a-Judge与人工评估的一致性较高(κ=0.77,准确率89%)。
- 所有模型在纯视觉设置下表现均低于视觉+XML设置,表明可访问性树对复杂任务至关重要。
Tech Stack:
- SwiftUI(应用开发)
- XCUITest(iOS UI测试框架)
- Appium(自动化测试工具)
- Claude Code(代码辅助)
- GPT-5.4-Mini(LLM-as-a-Judge)
- Qwen3.5 35B-A3B(开源基线模型)
- Claude Opus 4.6、GPT-4o等前沿模型
- AWS EC2 Mac实例(运行器)
- 部分可观测马尔可夫决策过程(POMDP)
Strengths:
- 填补了iOS平台动态代理基准的空白,且强调个性化智能。
- 构建了丰富的跨应用关联数据,任务设计贴近真实用户场景。
- 提供了两种观察模态(纯视觉和视觉+XML),便于分析不同输入对模型性能的影响。
- 开源完整资源,包括应用、数据、评估代码和云运行器,可复现性强。
- LLM-as-a-Judge评估框架经人工验证,可靠性较高。
Limitations:
- 所有应用为人工构建,可能与真实iOS应用在复杂性和多样性上存在差距。
- 仅评估了有限数量的模型,且开源模型性能较低,结论的泛化性有限。
- 视觉+XML设置依赖XCUITest框架,实际部署中可能无法获得如此丰富的结构化数据。
- 任务数量(133个)相对较少,可能不足以全面衡量代理能力。
- 未涉及网络浏览任务,限制了代理在混合场景下的评估。
Relevance To Keywords:
- Unify Models: 论文评估了多种多模态大模型(如Claude Opus 4.6、GPT-4o)在手机代理任务上的表现,涉及视觉与语言理解的统一。
- World Models: 环境建模为POMDP,代理需基于部分观测推理状态,与世界模型中的状态表示和预测相关。
- Representation Learning: 代理需从截图和可访问性树中学习有效的UI表示,以完成跨应用推理。
- Model-Based RL: 论文的POMDP框架和确定性状态转移与基于模型的强化学习思想一致,但未直接使用RL训练。
- 原生多模态大模型: 实验中的模型均为原生多模态模型,直接处理截图和文本输入。
- 多模态大模型的理解和生成一体化: 代理需理解屏幕内容并生成动作(如点击、输入),体现理解与生成的结合。
- 表征学习: 代理需从视觉和结构化数据中提取特征,以支持个性化推理。
- 世界模型: 环境动态和跨应用数据关联要求代理构建内部世界模型以规划多步任务。
- 强化学习: 虽然论文未使用RL训练,但任务设置和评估框架可为未来RL-based agent提供基准。
- 后训练: 论文未涉及后训练,但基准可用于评估后训练模型在手机代理场景中的表现。
摘要翻译
视觉语言模型(VLMs)因其卓越的能力而广受欢迎。尽管已有多种方法用于增强基于文本应用的隐私安全,但与视觉输入相关的隐私风险在很大程度上仍被忽视,例如医学图像中的受保护的健康信息(PHI)。为了解决这一问题,需执行两项关键任务:准确定位敏感文本并对其进行处理以确保隐私保护。针对这一问题,我们提出了 VisShield(视觉隐私盾牌),这是一个旨在增强视觉语言模型(VLMs)隐私感知能力的端到端框架。该框架包含两个关键组件:一个专门的指令微调数据集 OPTIC(光学隐私文本指令集)以及一种定制的训练方法。该数据集提供了多样化的隐私导向提示,引导视觉语言模型(VLMs)执行有针对性的光学字符识别(OCR),以精确定位敏感文本;而训练策略则确保 VLMs 能有效适应隐私保护任务。具体而言,该方法确保视觉语言模型(VLMs)能够识别隐私敏感文本,并为检测到的实体输出精确的边界框,从而实现对敏感信息的有效遮蔽。大量实验表明,我们的框架在处理隐私信息方面显著优于现有方法,为视觉语言模型中的隐私保护应用铺平了道路。我们的数据集和代码详见此处。
Abstract
Visual Language Models (VLMs) have gained significant popularity due to their remarkable ability. While various methods exist to enhance privacy in text-based applications, privacy risks associated with visual inputs remain largely overlooked such as Protected Health Information (PHI) in medical images. To tackle this problem, two key tasks: accurately localizing sensitive text and processing it to ensure privacy protection should be performed. To address this issue, we introduce VisShield (Vision Privacy Shield), an end-to-end framework designed to enhance the privacy awareness of VLMs. Our framework consists of two key components: a specialized instruction-tuning dataset OPTIC (Optical Privacy Text Instruction Collection) and a tailored training methodology. The dataset provides diverse privacy-oriented prompts that guide VLMs to perform targeted Optical Character Recognition (OCR) for precise localization of sensitive text, while the training strategy ensures effective adaptation of VLMs to privacy-preserving tasks. Specifically, our approach ensures that VLMs recognize privacy-sensitive text and output precise bounding boxes for detected entities, allowing for effective masking of sensitive information. Extensive experiments demonstrate that our framework significantly outperforms existing approaches in handling private information, paving the way for privacy-preserving applications in vision-language models. Our dataset and code can be found here.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 3.0/10 | 4.5 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 6.0/10 | 9.0 |
| MultiModal | 1.5 | 7.0/10 | 10.5 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: The paper focuses on privacy de-identification in Vision Language Models (VLMs) via instruction tuning. It moderately aligns with MLLM and MultiModal as it utilizes vision-language architectures. It has low relevance to Unify Models, Tokenizer, Visual Encoder (as a core contribution), World Models, and model-based RL, which are not discussed in this privacy-focused application. No expert authors from the specified list are present. Weighted total score is 33.0, exceeding the dynamic passing score of 27.8.
关键词
Vision Language Models, Privacy Protection, Instruction Tuning, De-identification, Medical Images, Sensitive Text, OPTIC Dataset, Bounding Box Localization
深度分析
Chinese Title: 视觉语言模型助力视觉数据中的隐私信息去标识化
Summary: 本文针对视觉数据(如医学图像)中存在的受保护健康信息(PHI)等隐私文本泄露问题,提出了一种端到端的框架VisShield。该框架利用视觉语言模型(VLM)实现隐私文本的精准定位与去标识化。主要贡献包括:构建了包含5000万样本的专用指令微调数据集OPTIC,涵盖自然图像和医学图像场景;设计了一套训练方法,使VLM能够理解自定义的隐私定义并输出精确的边界框。实验表明,VisShield在隐私感知OCR任务上显著优于现有方法,为视觉语言模型在隐私保护领域的应用开辟了新路径。
Innovations:
- 首次提出利用视觉语言模型解决视觉数据中文本隐私信息的去标识化问题,并支持自定义隐私定义。
- 构建了大规模、多样化的指令微调数据集OPTIC,包含5000万图像-文本对,覆盖自然图像和医学图像场景。
- 设计了三阶段数据集生成流程,结合大语言模型合成指令、合成图像生成和标签对齐,确保数据质量与多样性。
- 通过微调Kosmos-2.5模型,仅使用部分数据集即可实现高效的隐私感知OCR定位与去标识化。
Methodology: 本文采用三阶段方法:1)利用GPT-4和Claude-3.5 Sonnet生成多样化的隐私相关指令提示,包含隐私定义、少样本示例和OCR触发标记;2)通过控制字体、颜色、大小等参数合成包含隐私文本的图像,结合Flickr30k和医学图像数据集;3)将指令与图像-标签对对齐,形成指令微调数据集OPTIC。随后,基于Kosmos-2.5预训练模型进行微调,使其能够根据指令输出隐私文本的边界框,最后通过掩码处理实现去标识化。
Key Results:
- VisShield在隐私感知OCR任务上显著优于现有方法(如Presidio),实现了更高的定位精度和隐私保护效果。
- 仅使用部分OPTIC数据集微调Kosmos-2.5即可获得良好性能,证明了数据集的高效性。
- 在自然图像和医学图像场景中均表现出色,验证了框架的泛化能力。
- 支持多种隐私信息类型(如姓名、邮箱、SSN、疾病分类等)的识别与去标识化。
Tech Stack:
- 视觉语言模型:Kosmos-2.5
- 大语言模型:GPT-4, Claude-3.5 Sonnet
- 指令微调(Instruction Tuning)
- 光学字符识别(OCR)
- 少样本学习(Few-shot Learning)
- 数据集:Flickr30k, 医学图像数据集(Rutherford et al., 2021)
- 生成配置:字体(Arial等6种)、颜色(9种)、大小(图像3%-9%)
Strengths:
- 首次将VLM应用于视觉数据隐私去标识化,填补了该领域空白。
- 数据集规模大(5000万样本)、多样性高,覆盖多种隐私类型和图像场景。
- 支持自定义隐私定义,灵活性高,可适应不同领域需求。
- 端到端框架,从定位到掩码处理一体化,实用性强。
Limitations:
- 依赖合成数据,可能无法完全覆盖真实场景中的隐私文本分布。
- 仅基于Kosmos-2.5进行实验,未验证在其他VLM(如LLaVA、Qwen2-VL)上的迁移性。
- 未讨论模型对对抗性攻击(如文本遮挡、扭曲)的鲁棒性。
- 隐私定义依赖人工设计,可能无法自动适应新出现的隐私类型。
Relevance To Keywords:
- 原生多模态大模型:本文使用Kosmos-2.5这一多模态模型,并针对视觉-语言任务进行微调,直接相关。
- 多模态大模型的理解和生成一体化:VisShield利用VLM同时理解图像中的隐私文本并生成边界框,体现理解与生成的一体化能力。
- 表征学习:通过指令微调,模型学习了隐私文本的视觉表征与语义表征,属于表征学习范畴。
- 世界模型:本文未直接涉及世界模型,但隐私去标识化可视为对视觉世界的一种安全约束。
- 强化学习:本文未使用强化学习方法,但后训练(指令微调)与强化学习中的策略优化有间接关联。
- 后训练:指令微调是后训练的一种形式,本文通过后训练使VLM适应隐私保护任务,高度相关。
摘要翻译
多语言词典是低资源语言和濒危语言最具价值的文献资源之一,但许多仍仅提供扫描件形式。数十年来,由于涉及特定文字系统、复杂的多栏布局以及包含大量缩写和交叉引用的条目,其数字化及转换为机器可读格式几乎是不可能的。近期的视觉 - 语言模型(Vision-language models)提供了一种有前景的解决方案,但它们在保留字符、标记以及处理词典结构方面的效果尚不明确。我们提出 MUDIDI,这是一个用于多语言词典数字化的两阶段框架。第一阶段评估字符识别与标记保留的质量;第二阶段专注于词典条目分割,随后将其映射到机器可读的词典结构模式,即 SIL 的 Multi-Dictionary Formatter。我们还发布了一个数据集,该数据集包含从 30 个公共领域词典中收集的人工标注词典条目,这些词典涵盖了多样的文字系统、语系和格式。我们在该数据集上对光学字符识别(OCR)系统、通用大语言模型(LLMs)和视觉语言模型(VLMs)进行了基准测试,结果表明 LLMs 在大多数文字系统和语言的两阶段中均表现优越,并为改进更具挑战性场景的结果提供了实用指南。最后,我们表明向 LLMs 补充额外信息(例如词典前言)可以提高数字化词典的质量。GitHub: https://github.com/DavidSamuell/MUDIDI-Pipeline-for-Digitization-of-Multilingual-Dictionary/
Abstract
Multilingual dictionaries are among the most valuable documentary resources for low-resource and endangered languages, yet many remain available only as scans. For many decades, their digitization and conversion into a machine-readable format was nearly impossible due to language-specific scripts, complex multi-column layouts full of entries with abbreviations and cross-references. Recent vision-language models offer a promising solution, but it is unclear how well they preserve characters, markup, and process lexicographic structure. We introduce MUDIDI, a two-stage framework for multi-lingual dictionary digitization. Stage One evaluates the quality of character recognition and markup preservation; Stage Two focuses on dictionary entry segmentation with subsequent mapping into a machine-readable lexicographic schema, SIL's Multi-Dictionary Formatter. We also release a dataset that consists of human-annotated lexicographic entries collected from 30 public-domain dictionaries featuring diverse writing systems, language families, and formats. We benchmark OCR systems, general-purpose Large Language Models (LLMs), and Vision Language Models (VLMs) on the dataset, demonstrating superior performance of LLMs across most writing systems and languages in both stages, and provide practical guidelines on improving the results for more challenging scenarios. Finally, we show that supplementing additional information, such as dictionary introduction, to the LLMs can improve the quality of the digitized dictionary. Github: https://github.com/DavidSamuell/MUDIDI-Pipeline-for-Digitization-of-Multilingual-Dictionary/
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 3.0/10 | 4.5 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 8.0/10 | 12.0 |
| MultiModal | 1.5 | 8.0/10 | 12.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper applies existing LLMs and VLMs to multilingual dictionary digitization, showing high relevance to MLLM and MultiModal keywords. It treats models as tools rather than proposing new architectures (Unify Models, Tokenizer, Visual Encoder) or learning paradigms (World Models, model-based RL). No listed expert authors are found in the author list.
关键词
Multilingual Dictionary Digitization, Language Models, Two-Stage Framework, Vision-Language Models, Lexicographic Structure, OCR Benchmarking, Machine-readable Format
深度分析
Chinese Title: MUDIDI:一种基于语言模型的多语言词典数字化两阶段框架
Summary: 本文提出MUDIDI,一个两阶段的多语言词典数字化框架。第一阶段评估字符识别和标记保留的质量;第二阶段专注于词典条目分割并将其映射到机器可读的词典学模式(SIL的多词典格式化器MDF)。研究背景是大量低资源和濒危语言的词典仅以扫描形式存在,难以数字化。方法上,作者构建了一个包含30种公共领域词典、涵盖多种书写系统和语言家族的人工标注数据集,并对比了OCR系统、通用大语言模型(LLM)和视觉语言模型(VLM)的性能。结论表明,LLM在大多数书写系统和语言的两个阶段均表现优异,且向LLM补充词典引言等额外信息可提升数字化质量。该工作为多语言词典的自动数字化提供了系统性的评估框架和实用指南。
Innovations:
- 提出两阶段评估框架:将词典数字化分解为页面转录(Stage 1)和词典学解析(Stage 2),可独立测量和优化转录质量与条目解析质量。
- 发布首个多语言词典数字化数据集:包含约30种公共领域词典、每种3页的人工标注数据,覆盖多样书写系统、语言家族、版式和数字化条件。
- 首次系统评估LLM和VLM处理多种书写系统以及分割和解释词典条目的能力,并给出实用改进指南。
- 证明向LLM提供词典引言等辅助信息可显著提升数字化质量。
Methodology: 研究采用两阶段技术路线:Stage 1使用OCR系统、通用LLM和VLM对词典页面进行忠实转录,保留Unicode文本和<b><i>标记,并按列主阅读顺序输出。Stage 2将黄金参考转录和页面图像转换为符合MDF模式的词典条目(含词头、词性、释义等字段)。数据标注先由强模型生成银标准,再由语言专家验证修正。实验对比了MinerU2.5 Pro、PaddleOCR-VL-1.5、GLM-OCR、Mathpix等专业模型,以及Gemini、GPT、Claude、Qwen等通用模型。
Key Results:
- 通用LLM在大多数书写系统和语言的两个阶段均优于专用OCR和VLM。
- 在Stage 1(页面转录)中,LLM对字符识别和标记保留表现更佳。
- 在Stage 2(词典学解析)中,LLM能更准确地分割条目并分配MDF字段。
- 向LLM提供词典引言等额外信息可提升数字化质量。
- 数据集包含30种词典,涵盖Assyrian、Bengali、Chukchi、Georgian、Japanese、Sanskrit等多样语言和书写系统。
Tech Stack:
- Gemini 3.5-Flash / 3.1-Pro
- GPT-5.5
- Claude Opus 4.7
- Qwen3-VL-235B-A22B-Instruct / Thinking
- MinerU2.5 Pro
- PaddleOCR-VL-1.5
- GLM-OCR
- Mathpix (商业OCR基线)
- SIL Multi-Dictionary Formatter (MDF)
- LabelStudio (标注工具)
- HathiTrust (数据源)
Strengths:
- 系统性:首次将词典数字化分解为两个独立阶段,便于错误归因和优化。
- 多样性:数据集覆盖30种词典、多种书写系统和语言家族,具有广泛代表性。
- 实用性:提供了针对挑战性场景的改进指南,如利用词典引言提升质量。
- 评估全面:对比了多种OCR、LLM和VLM,覆盖专业和通用模型。
- 开源:代码和数据集已公开,促进可复现性和后续研究。
Limitations:
- 数据集规模有限:每种词典仅3页,可能不足以代表完整词典的多样性。
- Stage 2仅聚焦10种词典,受限于MDF专家标注资源。
- 主要依赖公共领域历史词典,对现代词典格式的适用性未验证。
- 未深入探讨模型在极端低质量扫描或罕见书写系统上的表现。
- 评估指标可能未完全捕捉词典学解析中的细微语义错误。
Relevance To Keywords:
- Unify Models: 论文评估了通用LLM和VLM在词典数字化中的统一能力,与多模态理解和生成一体化相关。
- World Models: 词典数字化涉及对页面布局、阅读顺序和词典结构的理解,可视为构建文档世界模型的一部分。
- Representation Learning: 模型需学习从图像到结构化文本的表示映射,涉及视觉和语言表征的联合学习。
- Model-Based RL: 论文未直接涉及强化学习,但两阶段框架可类比为基于模型的规划(先转录后解析)。
- 原生多模态大模型: 论文核心评估对象即为原生多模态模型(如Gemini、Qwen-VL)在文档理解任务上的表现。
- 后训练: 论文未涉及后训练技术,但结果可为后续微调或指令调优提供基准。
摘要翻译
三维语义场景生成对于自动驾驶应用至关重要,然而大多数方法依赖于复杂的专用 3D 架构,例如三平面编码器和适配的扩散网络,这限制了其简洁性和编辑能力。我们提出了 EditSSC,一种面向编辑的 3D 语义场景生成方法,该方法利用 2D 鸟瞰图(BEV)表示和现成的潜在扩散网络。该方法将 3D 语义占用网格转换为多通道 BEV 图像,并利用 Stable Diffusion 中的量化自编码器和 UNet,仅需进行最小修改。我们在量化后的潜在空间上执行扩散,从而实现无需训练的编辑能力。通过利用代码本中的类别与代码对应关系,我们的方法支持草图引导生成、图像修复和图像扩展,而无需任何重新训练。在 SemanticKITTI 数据集上,EditSSC 在无条件生成方面优于现有的专用 3D 基线方法,表明成熟的 2D 架构可以有效地复用用于 3D 场景生成和编辑。
Abstract
3D semantic scene generation is crucial for autonomous driving applications, yet most methods rely on complex 3D-specific architectures such as triplane encoders and adapted diffusion networks, limiting both their simplicity and their editing capabilities. We propose EditSSC, an editing-ready method for 3D semantic scene generation using 2D Bird's Eye View (BEV) representations and off-the-shelf latent diffusion network. Our approach reshapes 3D semantic occupancy grids into multi-channel BEV images and leverages the quantized autoencoder and UNet from Stable Diffusion with minimal modifications. We perform diffusion on the latents after quantization, which enables training-free editing capabilities. By exploiting class-to-code correspondences in the codebook, our method supports sketch-guided generation, inpainting, and outpainting without any retraining. On SemanticKITTI, EditSSC outperforms existing 3D-specific baselines on unconditional generation, demonstrating that well-established 2D architectures can be effectively repurposed for 3D scene generation and editing.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 3.0/10 | 4.5 |
| Tokenizer | 1.5 | 6.0/10 | 9.0 |
| Visual Encoder | 1.5 | 6.0/10 | 9.0 |
| World Models | 1.5 | 4.0/10 | 6.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 3.0/10 | 4.5 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文聚焦于 3D 语义场景生成,利用 2D 扩散模型(如 Stable Diffusion)进行编辑。Tokenizer 和 Visual Encoder 作为扩散模型的核心组件(量化自编码器、UNet)被直接使用,相关性较高。Unify Models 涉及将 2D 架构复用于 3D,有一定关联。World Models 方面扩散模型属于生成式模型,但非动力学世界模型,相关性中等。MLLM 和 model-based RL 完全无关。MultiModal 涉及 BEV 与语义/草图,相关性较低。
关键词
3D semantic scene generation, Unconditional Diffusion Models, Bird's Eye View, Quantized Autoencoder, Training-free editing, Semantic Occupancy, Sketch-guided generation
深度分析
Chinese Title: EditSSC:基于无条件扩散模型的可编辑语义占用场景生成
Summary: 本文提出EditSSC,一种面向自动驾驶场景的3D语义占用场景生成与编辑方法。研究背景在于现有3D场景生成方法依赖复杂的3D专用架构(如三平面编码器),限制了简洁性和编辑能力。方法上,EditSSC将3D语义占用网格重塑为多通道鸟瞰图(BEV)图像,并利用Stable Diffusion的量化自编码器和UNet,在量化后的潜在空间中进行扩散。通过利用码本中的类别-编码对应关系,该方法无需重新训练即可支持草图引导生成、修复和外推。在SemanticKITTI数据集上,EditSSC在无条件生成任务上超越了现有3D专用基线,证明了2D架构可有效用于3D场景生成与编辑。结论表明,潜在空间的结构化(如向量量化)比单纯的重构保真度对扩散质量更重要。
Innovations:
- 将3D语义占用网格重塑为多通道BEV图像,直接利用现成的2D扩散架构(Stable Diffusion),无需3D专用模块。
- 通过向量量化自编码器(VQ-VAE)构建离散且结构化的潜在空间,发现其比高重构保真度的非结构化潜在空间更有利于扩散建模。
- 利用码本中类别与编码的对应关系,实现无需重新训练或测试时适应的草图引导生成、修复和外推编辑能力。
- 在SemanticKITTI数据集上,以更简单的2D架构取得了优于3D专用方法(如SemCity)的无条件生成性能。
Methodology: 论文采用两阶段潜在扩散模型。第一阶段:将3D语义占用网格(尺寸256×256×32)沿高度维度折叠为多通道BEV图像(256×256×19),使用Stable Diffusion的VQ-VAE进行压缩和量化,得到离散潜在编码。第二阶段:训练轻量化的Stable Diffusion UNet在量化后的BEV潜在空间中进行扩散生成。编辑时,通过替换或屏蔽特定类别的码本索引,实现草图引导、修复和外推,无需额外训练。
Key Results:
- 在SemanticKITTI上,EditSSC的无条件生成性能(FID、CKL、Precision)优于SemCity等3D专用基线。
- VQ-VAE的BEV表示在重构性能较低(mIoU 68.35)的情况下,扩散性能(FID 97.5)优于高重构性能的MLP BEV(FID 156.9),验证了潜在空间结构的重要性。
- 成功实现了训练自由的草图引导场景生成、区域修复和场景外推编辑。
- 通过消融实验证明,BEV表示结合向量量化是编辑友好的关键设计。
Tech Stack:
- Stable Diffusion的VQ-VAE(向量量化自编码器)
- Stable Diffusion UNet(轻量化版本)
- 潜在扩散模型(Latent Diffusion Model)
- 鸟瞰图(BEV)表示
- SemanticKITTI数据集
- FID(Fréchet Inception Distance)
- CKL(Class-wise KL散度)
- IoU / mIoU(交并比/平均交并比)
Strengths:
- 方法简洁,利用现成2D架构,降低了3D场景生成的实现复杂度。
- 通过向量量化实现了训练自由的编辑能力,实用性强。
- 通过消融实验揭示了潜在空间结构对扩散质量的关键影响,具有方法论贡献。
- 在无条件生成任务上超越了3D专用方法,证明了2D架构的迁移潜力。
Limitations:
- 仅针对语义占用场景,未涉及RGB纹理或动态物体生成。
- 编辑能力依赖于码本中类别与编码的对应,可能对未见类别或细粒度编辑支持有限。
- 实验仅在SemanticKITTI一个数据集上进行,泛化性有待验证。
- BEV表示丢失了垂直方向上的精细结构信息,可能影响某些场景的生成质量。
Relevance To Keywords:
- Unify Models: 论文将2D图像扩散模型统一用于3D场景生成,体现了模型统一的思想。
- World Models: 3D语义场景生成可视为构建自动驾驶世界模型的一部分,用于模拟和规划。
- Representation Learning: 论文重点研究了潜在空间表示(BEV、VQ-VAE)对生成质量的影响,属于表征学习范畴。
- Model-Based RL: 生成的场景可用于基于模型的强化学习中的环境模拟和数据增强。
- 原生多模态大模型: 论文未直接涉及多模态大模型,但其2D到3D的迁移思路与多模态统一表征有潜在关联。
- 多模态大模型的理解和生成一体化: 论文聚焦于生成,未涉及理解任务。
- 后训练: 论文的编辑能力无需后训练,属于训练自由方法。
摘要翻译
大型语言模型(LLMs)和视觉 - 语言模型(VLMs)在表格推理任务中的评估日益普遍,然而表格表示(table representation)的作用仍未被充分探索。在实际应用中,相同的表格内容可能以不同的结构格式呈现,例如 HTML、Markdown 和 LaTeX,或以渲染图像的形式存在。然而,现有的评估往往让内容、格式、布局和模态同时变化,这使得难以单独分离表示效应的影响。我们引入了 TABVERSE,这是一个受控的多模态表格基准,它在多种结构格式和渲染图像中对齐相同的表格内容,并附带问题类别和难度标签。这种设计使得在保持表格内容固定的前提下,能够对表示效应进行系统性评估。我们在三个任务上评估 LLMs 和 VLMs:问答(QA)、结构理解能力(SUC)以及结构重建(SR)。结果表明,表示的选择对表格理解有显著影响。模型通常在结构化文本上的表现优于渲染图像,但这种性能差距的大小取决于任务、模型和格式。HTML 通常是最稳健的文本格式,而对行敏感的结构任务以及语法上可用的 LaTeX 重建仍然具有挑战性。这些发现表明,表格表示是可靠表格评估中的关键因素。
Abstract
Large Language Models (LLMs) and Vision-Language Models (VLMs) are increasingly evaluated on table reasoning tasks, but the role of table representation remains under-explored. In practice, the same table content may appear in different structural formats, such as HTML, Markdown, and LaTeX, or as rendered images. However, existing evaluations often let content, format, layout, and modality vary together, making it difficult to isolate representation effects. We introduce TABVERSE, a controlled multimodal table benchmark that aligns the same table content across multiple structural formats and rendered images, with question category and difficulty tags. This design enables systematic evaluation of representation effects while holding table content fixed. We evaluate LLMs and VLMs across three tasks: Question Answering (QA), Structural Understanding Capability (SUC), and Structure Reconstruction (SR). Our results show that representation choice substantially affects table understanding. Models generally perform better with structured text than with rendered images, but the size of this gap depends on the task, model, and format. HTML is often the most robust text format, while row-sensitive structural tasks and syntactically usable LaTeX reconstruction remain challenging. These findings show that table representation is a key factor in reliable table evaluation.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 3.0/10 | 4.5 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 6.0/10 | 9.0 |
| MultiModal | 1.5 | 8.0/10 | 12.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文核心在于提出 TABVERSE 基准测试,评估不同表格表示格式(结构化文本 vs 渲染图像)对 LLM 和 VLM 性能的影响。因此,与 MLLM(涉及多模态大模型)和 MultiModal(文本与图像的多模态表示对比)相关性较高;Visual Encoder 因涉及 VLM 模型而有一定关联;Unify Models 和 Tokenizer 关联度较低(论文侧重评估而非模型统一或 tokenizer 架构);World Models 和 model-based RL 与论文主题完全无关。经检查,作者列表中不包含指定的专家(Yang Shi 等),故无额外加分。
关键词
Table Understanding, Cross-Format Benchmark, LLMs and VLMs, Representation Effects, Structured Text, Rendered Images, Model Evaluation, TABVERSE
深度分析
Chinese Title: TABVERSE:大语言模型与视觉语言模型跨格式表格理解的基准测试
Summary: 本文提出TABVERSE,一个受控的多模态表格基准,用于系统评估表格表示形式对大语言模型(LLM)和视觉语言模型(VLM)表格理解能力的影响。现有评估常让表格内容、格式、布局和模态同时变化,难以隔离表示效应。TABVERSE将同一表格内容对齐为HTML、Markdown、LaTeX三种结构化格式及其渲染图像,并标注问题类别和难度,构建了包含700个样本的平衡评估集。在问答(QA)、结构理解能力(SUC)和结构重建(SR)三项任务上评估多个模型。结果表明:表示选择显著影响性能;结构化文本通常优于渲染图像;HTML是最鲁棒的文本格式;行敏感的结构任务和可用的LaTeX重建仍具挑战性。研究强调表格表示是可靠表格评估的关键因素。
Innovations:
- 将跨格式和跨模态表格理解定义为受控评估问题,固定表格内容,变化结构格式和输入模态,使用匹配的评估管线。
- 构建TABVERSE基准,包含HTML、LaTeX、Markdown三种文本表示及其对应渲染图像,并附带类别和难度标签,以及从五个TableQA来源筛选的700样本平衡评估集。
- 在QA、SUC、SR三项任务上系统评估LLM和VLM,揭示格式和模态选择如何改变模型行为,并将SR错误分解为重建质量和输出可用性。
Methodology: 从FEVEROUS、HYBRIDQA、TABFACT、SQA、WikiTableQuestions五个数据集的保留测试集筛选单表可回答的问题,构建6097个问题-表格对。对每个表格生成HTML、Markdown、LaTeX表示,并渲染图像。使用Gemini-3-Flash-Preview进行问题类别标注(七类),通过GPT-5.2和Gemini-3-Flash-Preview在渲染图像上的零样本QA结果定义难度(Easy/Hard)。最终平衡选择700个样本(350 Easy, 350 Hard,每类别每难度50个)。在QA、SUC、SR任务上,分别以文本格式和图像作为输入,评估多个LLM和VLM,比较性能差异。
Key Results:
- 表示选择显著影响表格理解性能,结构化文本通常优于渲染图像,但差距大小取决于任务、模型和格式。
- HTML是最鲁棒的文本格式,在多数任务和模型上表现最好。
- 行敏感的结构任务(如边界检测、大小估计)对格式变化敏感,LaTeX重建在语法可用性上仍具挑战。
- SR任务中,模型重建表格的质量与输出格式的可用性之间存在分离。
Tech Stack:
- Gemini-3-Flash-Preview(用于类别标注和难度评估)
- GPT-5.2(用于难度评估)
- HTML、Markdown、LaTeX格式转换工具(基于Sui et al. 2024a的转换工具并扩展)
- 渲染图像生成(标准化字体、填充、宽度)
- 零样本QA评估(用于难度标签)
- 平衡采样策略(按类别和难度)
Strengths:
- 受控实验设计:固定表格内容,仅变化格式和模态,能清晰归因表示效应。
- 多任务评估:涵盖QA、结构理解、结构重建,全面考察表格理解能力。
- 多格式对齐:提供HTML、Markdown、LaTeX及其图像,覆盖常见表示。
- 类别和难度标注:支持细粒度分析不同问题类型和难度下的表现。
- 数据来源多样:从五个知名TableQA数据集构建,增强泛化性。
Limitations:
- 仅覆盖三种文本格式和渲染图像,未包括PDF、Excel等实际常见格式。
- 模型评估限于当前主流LLM/VLM,未涵盖所有可能模型。
- 难度标签基于两个模型在图像上的零样本表现,可能引入偏差。
- 平衡集仅700样本,规模较小,可能不足以覆盖所有表格复杂性。
- 未深入分析模型内部机制或表示学习过程。
Relevance To Keywords:
- 原生多模态大模型:论文评估VLM(视觉语言模型)在表格图像上的表现,直接相关。
- 多模态大模型的理解和生成一体化:SR任务要求从图像生成结构化文本,体现理解与生成结合。
- 表征学习:论文核心是研究不同表格表示(文本格式、图像)对模型性能的影响,涉及表征选择与学习。
- 世界模型:间接相关,表格理解可视为世界模型对结构化信息的处理能力。
- 强化学习、后训练:论文未直接涉及,但表格理解可作为后训练评估任务。
摘要翻译
4D 生成(4D generation,即动态 3D 生成)最近已成为一个迅速发展的研究前沿,得益于其强大的时空建模能力。然而,尽管取得了显著进展,现有方法通常无法捕捉底层物理原理,导致生成的结果在物理上不一致且视觉上不合理。为了克服这一局限,我们提出了 CP4D,这是一种用于照片级真实感 4D 场景合成的新范式,能够忠实遵循复杂的物理动力学。受现实世界场景组合性特征的启发——其中不可变的静态背景与动态且物理上合理的前景共存——CP4D 将 4D 生成重新表述为静态 3D 环境与基于物理的动态对象的整合。在此基础上,我们的框架遵循一个三阶段流程:**1)** 首先,我们利用预训练专家模型(pre-trained expert models)分别生成环境及前景对象的高保真 3D 表示。**2)** 随后,为了生成这些对象的物理上合理的轨迹及真实交互,我们提出了一种混合运动合成策略,该策略整合了来自物理模拟器(physical simulators)的先验与嵌入在视频扩散模型(video diffusion models)中的常识。**3)** 最后,我们开发了一种自动组合机制,能够将静态环境与动态对象无缝融合,形成连贯且物理一致的 4D 场景。大量实验表明,CP4D 能够生成可探索且交互式的 4D 场景,具有高视觉保真度、强物理合理性及细粒度可控性,显著优于现有方法。项目页面:https://anonymous.4open.science/w/CP4D/。
Abstract
4D generation (\textit{i.e.}, dynamic 3D generation) has recently emerged as a rapidly growing research frontier due to its powerful spatiotemporal modeling capabilities. However, despite notable advances, existing approaches typically fail to capture the underlying physical principles, producing results that are both physically inconsistent and visually implausible. To overcome this limitation, we present CP4D, a novel paradigm for photorealistic 4D scene synthesis with faithful adherence to complex physical dynamics. Drawing inspiration from the compositional nature of real-world scenes, where immutable static backgrounds coexist with dynamic, physically plausible foregrounds, CP4D reformulates 4D generation as the integration of a static 3D environment with physically grounded dynamic objects. On this basis, our framework follows a three-stage pipeline: \textbf{1)} Firstly, we leverage pre-trained expert models to generate high-fidelity 3D representations of the environment and foreground objects respectively. \textbf{2)} Subsequently, to produce physically plausible trajectories and realistic interactions for these objects, we propose a hybrid motion synthesis strategy that integrates priors from physical simulators with the common sense embedded in video diffusion models. \textbf{3)} Finally, we develop an automated composition mechanism that seamlessly fuses the static environment and dynamic objects into coherent, physically consistent 4D scenes. Extensive experiments demonstrate that CP4D can generate explorable and interactive 4D scenes with high visual fidelity, strong physical plausibility, and fine-grained controllability, significantly outperforming existing methods. The project page: https://anonymous.4open.science/w/CP4D/.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 4.0/10 | 6.0 |
| Tokenizer | 1.5 | 1.5/10 | 2.2 |
| Visual Encoder | 1.5 | 3.0/10 | 4.5 |
| World Models | 1.5 | 5.0/10 | 7.5 |
| MLLM | 1.5 | 1.5/10 | 2.2 |
| MultiModal | 1.5 | 4.0/10 | 6.0 |
| model-based RL | 1.5 | 2.0/10 | 3.0 |
评分理由: The paper focuses on physics-aware 4D scene generation. 'World Models' (5.0) is highly relevant due to physics dynamics modeling. 'Unify Models' (4.0) and 'MultiModal' (4.0) relate to composition and spatiotemporal data. 'Visual Encoder' (3.0) is implied by 3D representations. 'Tokenizer' (1.5), 'MLLM' (1.5), and 'model-based RL' (2.0) are weakly related as they are not explicitly discussed or central (simulators used for generation, not RL). No specified expert authors were found. Total weighted score: 31.5, exceeding the passing threshold of 27.8.
关键词
4D Scene Generation, Physics-aware, Static Background, Dynamic Foreground, Motion Synthesis, Video Diffusion, Physical Plausibility, Composition Mechanism
深度分析
Chinese Title: CP4D:组合式物理感知的四维场景生成
Summary: 本文提出CP4D,一种用于生成具有物理一致性的四维(动态三维)场景的新范式。现有4D生成方法通常忽略物理原理,导致结果不真实。受真实场景中静态背景与动态前景共存的组合特性启发,CP4D将4D生成重构为静态3D环境与物理驱动的动态对象的集成。方法分为三阶段:1)利用预训练专家模型分别生成背景和前景的高保真3D表示,并通过图像编辑确保风格一致;2)提出混合运动合成策略,结合物理模拟器与视频扩散模型的常识知识,生成物理合理的轨迹和交互;3)开发自动合成机制,通过单目深度估计和优化将动态前景无缝融合到静态背景中。实验表明,CP4D能生成高视觉保真度、强物理合理性和精细可控性的可探索、可交互的4D场景,显著优于现有方法。
Innovations:
- 提出组合式框架,将4D场景生成分解为静态背景与物理驱动动态前景的集成,实现物理一致性。
- 提出混合运动合成策略,融合可微物理模拟器的先验与视频扩散模型的常识知识,生成物理合理且视觉逼真的轨迹与交互。
- 开发自动合成机制,利用单目深度估计和深度感知启发式规则,自动校准前景的空间属性,实现无缝融合。
- 支持用户对前景对象、背景环境和运动轨迹的灵活编辑,提供精细可控的4D生成能力。
Methodology: CP4D采用三阶段流水线:第一阶段,通过大语言模型分解文本提示,利用文本到图像模型生成背景图像,再通过图像编辑模型生成与背景兼容的前景,最后用预训练3D重建模型分别得到背景和前景的3D表示。第二阶段,先使用物理模拟器生成符合物理定律的粗轨迹,再通过视频扩散模型的SDS梯度细化,增强对象间交互的真实性。第三阶段,利用单目深度估计和深度感知规则估计前景在背景中的相对位置和尺度,并通过优化校准实现视觉协调的合成。
Key Results:
- CP4D生成的4D场景具有高视觉保真度、强物理合理性和精细可控性。
- 在物理一致性、视觉质量和用户可控性方面显著优于现有4D生成方法。
- 支持可探索(自由视角)和可交互(编辑场景元素)的4D场景。
- 通过消融实验验证了各阶段(背景-前景风格一致性、混合运动合成、自动合成机制)的有效性。
Tech Stack:
- Score Distillation Sampling (SDS)
- 预训练文本到图像生成模型 (Ft2i)
- 图像编辑模型 (Fedit)
- 图像分割模型 (Fseg)
- 大语言模型 (GPT-4o)
- 可微物理模拟器 (如Material Point Method, MPM)
- 视频扩散模型
- 单目深度估计
- 3D高斯泼溅 (3D Gaussian Splatting)
- 动态NeRF
Strengths:
- 首次将物理模拟与视频扩散模型结合,实现物理合理且视觉逼真的运动。
- 组合式设计使场景元素可独立编辑,提供强交互性和可控性。
- 自动合成机制无需人工干预,能生成风格一致、空间协调的4D场景。
- 实验充分,定量和定性结果均优于现有方法。
Limitations:
- 依赖预训练模型的质量,可能受限于图像编辑和3D重建的精度。
- 物理模拟器目前主要处理刚体和弹性体,对复杂多材料、多对象交互支持有限。
- 自动合成机制中的深度估计在复杂场景中可能产生误差,影响合成质量。
- 计算开销较大,三阶段流水线需要较多资源。
Relevance To Keywords:
- Unify Models: 论文未直接涉及统一模型,但组合式框架可视为不同专家模型(图像、3D、物理、视频)的集成。
- World Models: 4D场景生成可视为构建世界模型的一部分,CP4D强调物理一致性,有助于世界模型的真实感。
- Representation Learning: 论文使用预训练模型提取3D表示,但未涉及新的表征学习方法。
- Model-Based RL: 物理模拟器可视为模型,但论文未涉及强化学习应用。
- 原生多模态大模型: 使用GPT-4o进行文本分解,但未训练多模态大模型。
- 多模态大模型的理解和生成一体化: 论文主要聚焦生成,理解方面仅用LLM分解提示。
- 后训练: 论文未涉及后训练技术。
摘要翻译
基于行为克隆(behavior cloning)的参数化模仿学习在部署过程中因累积误差(compounding errors)可能导致对分布外状态(out-of-distribution states)的泛化能力不足。我们表明,通过一种半参数检索式模仿学习方法(semi-parametric retrieval-based imitation learning approach)在推理阶段复用训练数据,可以缓解这一挑战。我们提出了一种名为差异感知检索策略用于模仿学习(Difference-Aware Retrieval Policies for Imitation Learning,简称 DARP)的方法,这是一种半参数检索式模仿学习方法,它通过重新参数化模仿学习问题,基于局部邻域结构(local neighborhood structure)而非直接的状态 - 动作映射(state-to-action mappings)来解决这一局限性。与学习全局策略(global policy)不同,DARP 训练一个模型,基于专家示范中的 k 近邻(k-nearest neighbors)、其对应的动作以及邻居状态与查询状态之间的相对距离向量(relative distance vectors)来预测动作。DARP 除了标准行为克隆所做的假设外,无需额外假设——它不需要额外的数据收集、在线专家反馈或特定任务知识。我们在包括连续控制(continuous control)和机器人操作(robotic manipulation)在内的多样化领域,以及包括高维视觉特征(high-dimensional visual features)在内的不同表征上,展示了相对于标准行为克隆一致的性能提升(15%-46%)。代码和演示可在 https://weirdlabuw.github.io/darp-site/ 获取。
Abstract
Parametric imitation learning via behavior cloning can suffer from poor generalization to out-of-distribution states due to compounding errors during deployment. We show that reusing the training data during inference via a semi-parametric retrieval-based imitation learning approach can alleviate this challenge. We present Difference-Aware Retrieval Policies for Imitation Learning (DARP), a semi-parametric retrieval-based imitation learning approach that addresses this limitation by reparameterizing the imitation learning problem in terms of local neighborhood structure rather than direct state-to-action mappings. Instead of learning a global policy, DARP trains a model to predict actions based on $k$-nearest neighbors from expert demonstrations, their corresponding actions, and the relative distance vectors between neighbor states and query states. DARP requires no additional assumptions beyond those made for standard behavior cloning -- it does not require additional data collection, online expert feedback, or task-specific knowledge. We demonstrate consistent performance improvements of 15-46% over standard behavior cloning across diverse domains, including continuous control and robotic manipulation, and across different representations, including high-dimensional visual features. Code and demos are available at https://weirdlabuw.github.io/darp-site/.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 5.0/10 | 7.5 |
| World Models | 1.5 | 2.0/10 | 3.0 |
| MLLM | 1.5 | 1.0/10 | 1.5 |
| MultiModal | 1.5 | 4.0/10 | 6.0 |
| model-based RL | 1.5 | 5.0/10 | 7.5 |
评分理由: 论文核心为基于检索的模仿学习,与 MLLM、Tokenizer 及生成式世界模型关联度低;但涉及视觉特征提取(Visual Encoder)、视觉与状态的多模态输入(MultiModal)及强化学习中的策略建模(model-based RL 广义理解),故相关度中等偏高;未涉及模型统一架构(Unify Models)。
关键词
Imitation Learning, Retrieval-based, Behavior Cloning, Difference-Aware, k-nearest neighbors, Robotic Manipulation, Visual Features, Semi-parametric
深度分析
Chinese Title: 差异感知检索策略用于模仿学习
Summary: 本文提出了一种名为DARP(差异感知检索策略用于模仿学习)的半参数检索式模仿学习方法,旨在解决标准行为克隆(BC)因协变量偏移导致的泛化能力差和累积误差问题。DARP在推理阶段重新利用训练数据,通过检索查询状态的k个最近邻专家样本,并基于邻居状态、对应动作以及邻居与查询状态之间的差异向量来预测动作,而非直接学习全局状态到动作的映射。该方法无需额外数据收集、在线专家反馈或任务特定知识,仅需标准BC所需的专家演示数据。理论分析表明,DARP通过邻域聚合隐式实现了拉普拉斯平滑,降低了策略方差并提高了对分布偏移的鲁棒性。在连续控制(MuJoCo)、机器人操作(Robosuite、Robocasa)以及高维视觉模仿任务中,DARP相比标准BC实现了15-46%的性能提升。
Innovations:
- 提出差异感知检索策略:将模仿学习问题重新参数化为基于局部邻域结构而非直接状态-动作映射,通过检索邻居并利用差异向量进行动作预测。
- 半参数架构融合全局与局部学习优势:结合了全局参数化策略的稳定性和局部非参数检索的鲁棒性,无需额外假设或数据。
- 隐式拉普拉斯正则化:理论证明邻域聚合操作等价于对k-NN图施加低通谱滤波,实现参数自由的平滑,有效降低策略方差。
- 通用可扩展架构:自然扩展到高维视觉状态和复杂策略类(如Transformer、高斯混合模型),无需任务特定调整。
Methodology: DARP采用半参数检索增强架构。训练阶段,学习一个参数化模型,该模型以查询状态、检索到的k个最近邻状态、对应动作以及邻居与查询状态的差异向量为输入,输出每个邻居对应的动作预测。推理阶段,首先从训练数据中检索查询状态的k个最近邻,然后通过模型生成k个条件动作预测,最后通过置换不变聚合(如平均)得到最终动作。该方法通过邻域聚合隐式实现拉普拉斯平滑,理论分析基于图拉普拉斯正则化和Dirichlet能量最小化。实验在MuJoCo、Robosuite、Robocasa等基准上进行,包括连续控制、机器人操作和视觉模仿任务,并与标准BC、其他检索方法及正则化方法对比。
Key Results:
- 在连续控制(MuJoCo)、机器人操作(Robosuite、Robocasa)和视觉模仿任务中,DARP相比标准BC实现15-46%的性能提升。
- 理论证明DARP的邻域聚合等价于拉普拉斯滤波,能有效降低策略方差并保证局部Lipschitz连续性。
- 在闭环 rollout 中,DARP的误差累积速度低于标准BC(亚线性 vs 线性)。
- 消融实验验证了差异向量表示和邻域聚合机制的重要性,表明DARP对距离度量和邻居数量具有鲁棒性。
Tech Stack:
- 行为克隆(Behavior Cloning, BC)
- k-最近邻检索(k-Nearest Neighbors, k-NN)
- 图拉普拉斯正则化(Graph Laplacian Regularization)
- Dirichlet能量最小化(Dirichlet Energy Minimization)
- Tikhonov正则化(Tikhonov Regularization)
- 置换不变聚合(Permutation-Invariant Aggregation)
- 高斯混合模型(Gaussian Mixture Models)
- Transformer架构
Strengths:
- 方法简洁且通用:仅需标准BC的假设和数据,易于集成到现有模仿学习流程中。
- 理论扎实:提供了方差降低、平滑性和稳定性的理论保证,连接了检索方法与谱图理论。
- 实验全面:在多种领域(连续控制、机器人操作、视觉任务)和多种表示(低维状态、高维图像)上验证了有效性。
- 无需额外资源:不依赖模拟器、在线专家或大量次优数据,降低了实际应用门槛。
Limitations:
- 检索计算开销:推理时需对每个查询进行k-NN检索,在大规模数据集上可能带来延迟问题。
- 依赖距离度量:检索效果依赖于状态空间的距离度量选择,在复杂高维空间中可能不鲁棒。
- 未探索多模态动作分布:虽然支持高斯混合模型,但论文未深入分析在高度多模态行为下的表现。
- 理论分析假设专家策略光滑:实际中专家演示可能存在不连续性,可能影响理论保证的适用性。
Relevance To Keywords:
- {'keyword': '表征学习', 'relevance': 'DARP通过检索邻居并利用差异向量隐式学习局部数据流形结构,与表征学习中利用数据几何的思想相关,但并非核心贡献。'}
- {'keyword': '世界模型', 'relevance': '论文未涉及世界模型或环境动力学建模,相关性较低。'}
- {'keyword': '强化学习', 'relevance': 'DARP属于模仿学习范畴,是强化学习中利用专家数据的一种方法,但未涉及奖励函数或在线交互。'}
- {'keyword': '后训练', 'relevance': '论文未讨论模型后训练或微调阶段,相关性较低。'}
- {'keyword': '原生多模态大模型', 'relevance': '论文聚焦于机器人控制和连续控制任务,未涉及多模态大模型,相关性较低。'}
- {'keyword': '多模态大模型的理解和生成一体化', 'relevance': '论文不涉及多模态理解或生成,相关性较低。'}
摘要翻译
阿尔茨海默病(AD)的进展具有高度异质性,通常通过稀疏且不规则的纵向数据进行观测,这对预测和个性化监测构成了挑战。现有的机器学习方法利用多模态数据改进了 AD 预测,但往往侧重于静态分类或队列级风险估计,对受试者特异性建模和不确定性感知推理的支持有限。为了解决这些局限性,我们提出了一种利用多模态纵向数据进行 AD 预测和基于场景分析的个性化数字孪生框架。所提出的方法整合了互补的建模策略,以捕捉临床转换以及各访问点之间的时间依赖性。该框架利用来自阿尔茨海默病神经影像学倡议(ADNI)的数据,包括认知评估、临床变量和基于 MRI 的表型,预测认知状态和诊断类别,同时量化预测不确定性并支持患者特定的假设轨迹分析。在无泄漏的受试者级划分上的评估表明,该方法在分数预测和诊断分类方面表现强劲。在此稀疏且不规则的 ADNI 数据设置中,相邻访问点之间的基于转换的建模比基于序列的方法实现了更高的预测准确性,表明局部转换建模可能更具数据效率。尽管序列模型在不确定性感知轨迹预测中仍具价值,但局部转换建模提供了一种更具数据效率和稳健性的预测策略。这些发现强调了将时间建模策略与临床数据结构对齐的重要性,并表明基于转换的数字孪生构建范式可能为神经退行性疾病中的个性化疾病预测提供一种实用且可解释的方法。
Abstract
Alzheimer's disease (AD) progression is highly heterogeneous and is typically observed through sparse and irregular longitudinal data, posing challenges for prediction and personalised monitoring. Existing machine learning approaches have improved AD prediction using multimodal data, yet often focus on static classification or cohort-level risk estimation, providing limited support for subject-specific modelling and uncertainty-aware reasoning. To address these limitations, we present a personalised digital twin framework for AD prediction and scenario-based analysis using multimodal longitudinal data. The proposed approach integrates complementary modelling strategies to capture clinical transitions and temporal dependencies across visits. Using data from the Alzheimer's Disease Neuroimaging Initiative (ADNI), including cognitive assessments, clinical variables, and MRI-derived phenotypes, the framework predicts cognitive status and diagnostic categories while quantifying predictive uncertainty and enabling patient-specific what-if trajectory analysis. Evaluation on leak-free subject-level splits demonstrates strong performance in score forecasting and diagnosis classification. In this sparse and irregular ADNI setting, transition-based modelling of adjacent visits achieved higher predictive accuracy than the sequence-based branch, suggesting that local transition modelling may be more data-efficient. While sequence models remain valuable for uncertainty-aware trajectory forecasting, local transition modelling offers a more data-efficient and robust predictive strategy. These findings highlight the importance of aligning temporal modelling strategies with clinical data structure and suggest that transition-based digital twin formulations may provide a practical and interpretable approach for personalised disease forecasting in neurodegenerative disorders.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 2.0/10 | 3.0 |
| World Models | 1.5 | 4.0/10 | 6.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 8.0/10 | 12.0 |
| model-based RL | 1.5 | 4.0/10 | 6.0 |
评分理由: 论文核心为阿尔茨海默病数字孪生建模,与 MultiModal (8 分) 高度相关,因明确使用多模态纵向数据(认知、临床、MRI);与 World Models (4 分) 和 model-based RL (4 分) 概念相关,因涉及状态转移建模和 what-if 轨迹分析;但无 MLLM、Tokenizer 及特定 Visual Encoder 架构,Unify Models 关联较弱 (2 分)。专家列表无匹配。加权总分 30.0 分。
关键词
Transition-Based Modelling, Digital Twin, Alzheimer's Disease, Multimodal Longitudinal Data, Sparse Longitudinal Data, What-if Analysis, Uncertainty-aware Reasoning
深度分析
Chinese Title: 基于稀疏纵向数据的阿尔茨海默病过渡式数字孪生建模
Summary: 本文针对阿尔茨海默病(AD)进展的高度异质性和临床随访数据稀疏、不规则的特点,提出了一种个性化的数字孪生框架。该框架结合了基于MLP的过渡模型(短时预测)和基于BiLSTM-Attention的序列模型(长时预测与不确定性量化),利用ADNI数据集中的认知评估、临床变量和MRI结构表型,预测未来认知状态和诊断类别,并支持患者特定的“what-if”轨迹分析。实验表明,在稀疏纵向数据条件下,基于相邻访视的过渡建模在预测精度上优于序列建模,表明局部过渡建模在数据有限时更高效。序列模型虽在不确定性感知和场景分析中有价值,但过渡模型提供了更稳健的预测策略。该工作强调了将时间建模策略与真实临床数据结构对齐的重要性,并为神经退行性疾病的个性化预测提供了实用且可解释的方法。
Innovations:
- 提出混合数字孪生框架,同时支持短时过渡预测和长时序列预测,并整合不确定性量化与what-if场景分析。
- 实证发现:在稀疏、不规则纵向临床数据中,基于相邻访视的过渡建模(MLP)比序列建模(BiLSTM-Attention)更数据高效、预测更准确。
- 将数字孪生概念具体化为可更新的个体预测表征,支持个性化监测和前瞻性推理。
- 在ADNI数据集上实现了泄漏-free的受试者级划分,并进行了全面的特征选择与消融分析。
Methodology: 使用ADNI数据集,整合静态变量(性别、教育、APOE等)和动态变量(认知评分、MRI表型),预处理后得到385个预测因子。设计混合数字孪生框架:1)过渡分支:MLP模型,输入相邻访视对(t→t+6月),预测MMSE和诊断类别,使用mRMR特征选择、Adam优化和早停。2)序列分支:BiLSTM-Attention模型,对齐基线至24个月的序列,使用注意力池化,回归用SmoothL1损失,分类用Focal损失,测试时保持dropout进行不确定性估计。实验采用70%/20%/10%受试者级划分,评估指标包括RMSE、MAE、ACC、AUC、macro-F1,并通过bootstrap计算置信区间。
Key Results:
- MLP过渡分支在测试集上表现优于BiLSTM-Attention序列分支:MMSE预测RMSE 2.149 vs 2.687,MAE 1.529 vs 1.926;诊断分类ACC 0.906 vs 0.806,AUC 0.976 vs 0.928,macro-F1 0.908 vs 0.798。
- mRMR特征选择验证:K=20时验证集ACC达0.943,RMSE 2.153。
- 消融分析表明,同时使用静态和动态特征比单独使用任一子集效果更好。
- MLP在相同特征空间下优于逻辑回归、随机森林和决策树等基线分类器。
Tech Stack:
- MLP(多层感知机)
- BiLSTM(双向长短期记忆网络)
- 注意力机制(Attention Pooling)
- mRMR(最小冗余最大相关性)特征选择
- Adam优化器
- 早停(Early Stopping)
- SmoothL1损失
- Focal损失
- Dropout(用于不确定性量化)
- 线性插值(对齐不规则随访)
- Z-score标准化
- Bootstrap置信区间
Strengths:
- 针对真实临床数据稀疏、不规则的特点,提出了实用的过渡建模策略,并验证其有效性。
- 框架兼具短时高精度预测和长时不确定性感知与场景分析能力,功能完整。
- 采用受试者级泄漏-free划分,评估可信度高。
- 代码开源,可复现。
- 对特征来源(静态/动态)和特征选择进行了系统消融分析,提供了深入见解。
Limitations:
- 序列分支使用线性插值对齐数据,可能引入人为规律性,比较结果需谨慎解释。
- 仅使用ADNI单一数据集,泛化性有待验证。
- 过渡模型仅预测相邻访视(6个月),更长预测需迭代或依赖序列分支。
- 未与现有最先进的AD预测方法(如基于Transformer的模型)进行直接比较。
- 数字孪生框架的“what-if”分析仅演示了概念,缺乏临床验证。
Relevance To Keywords:
- 表征学习:论文使用mRMR进行特征选择,MLP和BiLSTM学习隐层表征,但未强调表征学习的理论贡献。
- 世界模型:数字孪生可视为个体疾病进展的世界模型,支持预测和反事实推理,与“世界模型”概念相关。
- 模型预测控制/强化学习:论文未涉及强化学习,但过渡建模可类比于状态转移模型,可用于规划。
- 原生多模态大模型:论文使用多模态数据(认知、临床、MRI),但模型规模小,非大模型。
- 多模态大模型的理解和生成一体化:论文仅做预测(理解),未涉及生成。
- 后训练:论文未涉及预训练-微调范式。
- 总体相关性中等:主要贡献在数字孪生和稀疏数据建模,与关键词中的世界模型、表征学习有间接联系。
摘要翻译
跨模型和模态的学习表示通常表现出显著的结构相似性,暗示着共享的底层概念分解。然而,概念对齐的定义仍不明确:现有方法在相同的术语下优化不同的目标,掩盖了实际上对齐的是什么。我们提出一个统一的框架,沿两个维度分解对齐:对齐的是什么(表示 vs. 概念)以及在对齐的粒度上(实例级 vs. 分布级)。这导出了四个相应的属性,即平移与概念一致性的实例级和分布级变体,并精确揭示了现有方法提供了哪些保证。我们进一步引入了 InterVenchA,这是一个基于干预的基准,分别衡量提取质量、平移质量和概念一致性。通过理论和实验,我们表明对齐目标之间通常假设的等价性在实践中失效:优化一个属性并不能可靠地恢复其他属性,且纯无监督目标无法恢复有意义的实例级对齐。随后,我们提出了耦合稀疏自编码器(CoSAE),它联合强制执行互补的对齐目标。唯有在此机制下才能实现强对齐。令人惊讶的是,当以分布级目标为锚点时,仅需 0.1% 的配对数据就足以恢复实例级对齐。总体而言,我们的结果表明概念对齐本质上是多目标的:必须如此定义、测量和优化。
Abstract
Learned representations across models and modalities often exhibit striking structural similarities, suggesting shared underlying concept decompositions. However, concept alignment remains poorly defined: existing approaches optimize different objectives under the same terminology, obscuring what is actually aligned. We propose a unifying framework that decomposes alignment along two axes: what is aligned (representations vs. concepts) and at what level (instance-wise vs. distributional). This induces four corresponding properties -- instance-wise and distributional variants of translation and concept consistency -- and reveals precisely which of these guarantees existing methods provide. We further introduce \InterVenchA, an intervention-based benchmark that separately measures extraction quality, translation quality, and concept consistency. Through theory and experiments, we show that commonly assumed equivalences between alignment objectives fail in practice: optimizing one property does not reliably recover the others, and purely unsupervised objectives fail to recover meaningful instance-level alignment. We then propose the Coupled Sparse Autoencoder (CoSAE), which jointly enforces complementary alignment objectives. Strong alignment emerges only in this regime. Surprisingly, as little as 0.1\% paired data is sufficient to recover instance-level alignment when anchoring distributional objectives. Overall, our results show that concept alignment is fundamentally multi-objective: it must be defined, measured, and optimized as such.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 6.0/10 | 9.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 2.0/10 | 3.0 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 7.0/10 | 10.5 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: 论文标题包含'Unifying Framework',与'Unify Models'有一定关联;摘要明确提到'modalities'和跨模态表征,与'MultiModal'高度相关。其余关键词如 Tokenizer、World Models、MLLM、model-based RL 在摘要和标题中未体现,相关性较低。作者列表中未包含指定的 Yang Shi 等专家,无额外加分。加权总分 30.0,高于动态及格分 27.8。
关键词
Concept-Based Representational Similarity, Unifying Framework, Concept Alignment, MultiModal Representations, Coupled Sparse Autoencoder, Instance-wise Alignment, Distributional Consistency
深度分析
Chinese Title: 基于概念的表征相似性的统一框架
Summary: 本文提出一个统一框架,将概念对齐分解为四个基本属性:实例级翻译、分布级翻译、实例级概念一致性、分布级概念一致性,并揭示了现有方法(如跨编码器、通用SAE、对齐SAE)各自满足哪些属性。作者引入InterVenchA基准,独立测量提取质量、翻译质量和概念一致性。通过理论和实验证明,常见的对齐目标之间的等价关系在实践中不成立:优化一个属性无法可靠地恢复其他属性,纯无监督目标无法恢复有意义的实例级对齐。为此,提出耦合稀疏自编码器(CoSAE),联合强制执行互补的正则化目标,仅需0.1%的配对数据即可在锚定分布目标时恢复实例级对齐。结果表明,概念对齐本质上是多目标问题,必须同时定义、测量和优化。
Innovations:
- 提出了概念对齐的四属性统一框架(实例级/分布级翻译与概念一致性),清晰定位现有方法。
- 设计了InterVenchA基准,独立评估提取质量、翻译质量和概念一致性。
- 通过理论和实验揭示了常见对齐目标(如翻译、循环一致性、分布匹配)之间的等价关系在实践中失效。
- 提出了耦合稀疏自编码器(CoSAE),通过联合互补正则化实现强对齐,仅需极少配对数据。
- 在合成和真实嵌入上展示了CoSAE优于现有方法(跨编码器、通用SAE、对齐SAE)的性能。
Methodology: 论文首先形式化定义四个对齐属性,并推导它们之间的理论关系(如命题1和2)。然后构建InterVenchA基准,包含合成数据和真实嵌入数据,分别测量提取质量(重建误差、稀疏性)、翻译质量(实例级和分布级翻译误差)和概念一致性(特征匹配)。接着,通过实验验证理论等价关系在实践中的失效。最后,提出CoSAE模型,在标准SAE损失基础上加入翻译正则化(实例级或分布级)和概念一致性正则化,使用少量配对数据锚定分布目标,进行联合训练。
Key Results:
- 翻译属性不能可靠地保证概念一致性,反之亦然(命题1的等价性在实践中不成立)。
- 循环一致性不是对齐的可靠代理。
- 分布级匹配不能保证实例级对齐。
- CoSAE在合成和真实数据上均优于跨编码器、通用SAE和对齐SAE,在翻译质量和概念一致性上取得最佳。
- 仅需0.1%的配对数据即可通过锚定分布目标恢复实例级对齐。
Tech Stack:
- 稀疏自编码器(SAE)
- 跨编码器(Crosscoder)
- 通用SAE(Universal SAE)
- 对齐SAE(SAE-Aligned)
- 耦合稀疏自编码器(CoSAE)
- 正则化损失函数(重建、翻译、概念一致性)
- InterVenchA基准(包含合成数据生成、真实嵌入提取)
- 线性模型分析(命题推导)
- 分布匹配(推前测度)
Strengths:
- 提供了清晰的概念对齐分类框架,有助于统一和比较不同方法。
- 通过理论分析和大量实验揭示了常见假设的局限性,具有重要指导意义。
- 提出的CoSAE方法简单有效,仅需极少配对数据即可实现强对齐,实用性强。
- InterVenchA基准为后续研究提供了标准化的评估工具。
- 论文结构严谨,从定义到实验再到方法,逻辑连贯。
Limitations:
- 主要聚焦于SAE架构,未探讨其他概念提取方法(如非稀疏编码)的适用性。
- 实验仅在合成数据和有限真实嵌入上进行,大规模多模态场景的验证有待进一步开展。
- CoSAE需要少量配对数据,在完全无配对场景下可能无法直接应用。
- 对分布级对齐的理论分析假设了线性可逆性,非线性情况下的等价性可能更复杂。
Relevance To Keywords:
- Unify Models: 论文提出的统一框架旨在对齐不同模型(不同架构、初始化、模态)的概念表征,直接服务于模型统一。
- World Models: 概念对齐有助于构建共享的世界模型,使不同系统对同一概念有共同理解。
- Representation Learning: 论文核心是表征学习中的概念对齐问题,提出了新的学习目标和评估方法。
- Model-Based RL: 概念对齐可应用于模型基强化学习中的状态表征对齐,但论文未直接涉及RL。
- 原生多模态大模型: 论文讨论跨模态对齐(如CLIP),与多模态大模型的理解和生成一体化相关。
- 多模态大模型的理解和生成一体化: 概念对齐是实现多模态理解和生成统一的关键技术之一。
- 表征学习: 直接相关,论文研究表征的分解与对齐。
- 世界模型: 间接相关,概念对齐有助于构建共享的世界表征。
- 强化学习: 弱相关,论文未涉及RL任务。
- 后训练: 概念对齐可作为后训练阶段的一种技术,但论文未明确讨论后训练场景。
摘要翻译
建模相互作用的动力系统需要同时捕捉空间相互作用与长程时间依赖。图神经网络(GNNs)提供了一种自然的表示方法,但通常依赖于自回归展开,并将空间和时间动力学分开处理,导致在长时域上产生误差累积。现有方法也主要关注局部相互作用和短时上下文,限制了其捕捉多跳依赖和全局结构的能力。我们引入了图 Mamba 算子(GraMO),这是一种将状态空间模型与基于图的交互学习相结合的潜在空间模拟器。与先前工作将节点序列化或在独立阶段分别应用空间和时间更新不同,GraMO 在单次循环中耦合了基于图的交互和时序状态更新。该更新在潜在状态上是线性的,具有输入依赖的系数,能够适应不同机制。我们在 N 体系统、动作捕捉和机器人数据集上评估了 GraMO,在所有基准上实现了最低误差,并在长时域预测中获得了最大增益。
Abstract
Modeling interacting dynamical systems requires capturing spatial interactions alongside long-range temporal dependencies. Graph neural networks (GNNs) provide a natural representation but typically rely on autoregressive rollouts and treat spatial and temporal dynamics separately, leading to error accumulation over long horizons. Existing approaches also focus on local interactions and short temporal contexts, limiting their ability to capture multi-hop dependencies and global structure. We introduce the Graph Mamba Operator (GraMO), a latent-space simulator that integrates state-space models with graph-based interaction learning. In contrast to prior work that sequences nodes or applies spatial and temporal updates in separate stages, GraMO couples graph-based interactions and temporal state updates within a single recurrence. The update is linear in the latent state, with input-dependent coefficients that adapt across regimes. We evaluate GraMO on N-body systems, motion capture, and robotics datasets, achieving the lowest error across benchmarks and the largest gains in long-horizon prediction.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 6.0/10 | 9.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 7.0/10 | 10.5 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 7.0/10 | 10.5 |
评分理由: 论文提出 Graph Mamba Operator,通过单次循环统一图交互与时间状态更新,架构上体现了模型统一性(Unify Models: 6);作为动力学系统的潜在模拟器,其核心概念与 World Models 高度一致(World Models: 7);且提供的模拟器适用于机器人等场景,是 Model-Based RL 的关键组件(model-based RL: 7)。然而,论文内容未涉及分词器、视觉编码器、大语言模型或多模态数据融合,故相关关键词得分为 0。
关键词
Graph Mamba Operator, Latent Simulator, Interacting Particle Systems, State Space Models, Graph-based Interaction Learning, Long-range Temporal Dependencies, Robotics Datasets
深度分析
Chinese Title: 图Mamba算子:交互粒子系统的潜在模拟器
Summary: 本文提出图Mamba算子(GraMO),一种用于交互粒子系统轨迹预测的潜在空间模拟器。传统方法通常将空间交互与时间建模分离,导致长程预测误差累积。GraMO将状态空间模型(SSM)与图神经网络耦合,在单一循环中同时更新图传播和时序状态,且更新系数依赖于输入,可自适应不同动态模式。该方法将系统状态编码为潜在表示,通过线性递归演化,避免了观测空间的自回归误差传播。在N体系统、运动捕捉和机器人数据集上,GraMO取得了最低的预测误差,尤其在长程预测中优势显著。论文还提供了潜在递归的稳定性分析和与图ARMA过程的联系。
Innovations:
- 提出图耦合状态空间算子,将空间交互与时间记忆统一在单一潜在递归中,避免分离建模。
- 更新系数输入依赖,实现自适应动态(如接触、漂移、快速过渡),类似时变Koopman算子。
- 提供潜在递归的稳定性与长程传播理论分析,并建立与图自回归移动平均(ARMA)过程的联系。
- 在多个物理基准上实现最优轨迹预测,尤其在长程预测中大幅超越现有图神经网络和神经算子方法。
Methodology: GraMO采用编码器-潜在模拟器-解码器架构。首先将粒子状态(位置、速度、属性)编码为潜在状态H(t)∈R^{V×D×N},其中D保留物理语义,N提供记忆轴。潜在状态演化由ODE描述:dH/dt = F(H, X, G),其中F是耦合图结构(通过邻接矩阵)与状态空间模型的算子。离散化后,每个时间步执行图消息传递与SSM递归的联合更新,且SSM参数(B, C, Δ)依赖于当前输入。输出通过读出层Y(t)=R(H(t))得到。训练目标是最小化预测轨迹与真实轨迹的L2范数。
Key Results:
- 在N体模拟、运动捕捉和机器人数据集上,GraMO的均方误差(MSE)低于所有基线(如GNN、神经算子、Mamba变体)。
- 长程预测(如未来100步)中,GraMO误差增长远慢于自回归GNN模型,展示了更好的稳定性。
- 零样本泛化测试中,GraMO在未见过的粒子数量或初始条件下仍保持较低误差。
- 消融实验验证了图耦合与输入依赖系数对性能的关键贡献。
Tech Stack:
- 状态空间模型(SSM):连续时间线性ODE,HiPPO初始化,零阶保持离散化
- 选择性状态空间模型(S6):输入依赖的B、C、Δ参数
- 图神经网络:消息传递,邻接矩阵定义交互
- Koopman算子理论:潜在空间线性演化
- 图自回归移动平均(ARMA)过程
- L2损失函数,梯度下降优化
Strengths:
- 统一了空间交互与时间记忆,避免了分离建模的误差累积。
- 输入依赖系数使模型能自适应不同动态阶段,提升泛化能力。
- 潜在递归线性化,支持长程预测且计算高效。
- 在多个物理基准上取得SOTA,尤其长程预测优势明显。
- 提供理论分析(稳定性、与ARMA联系),增强可解释性。
Limitations:
- 模型假设图结构已知且静态,不适用于动态图或未知交互关系。
- 潜在状态维度N的选择需要调参,可能影响记忆容量与计算开销。
- 仅评估了粒子系统,未在更复杂的多模态场景(如视觉-语言)中验证。
- 与纯Koopman方法相比,输入依赖系数增加了模型复杂度。
Relevance To Keywords: 论文聚焦于交互粒子系统的潜在模拟器,涉及表征学习(潜在状态编码)、世界模型(学习物理动态)、模型基强化学习(潜在模拟可用于规划)。但论文未涉及多模态大模型、理解与生成一体化、原生多模态等关键词。因此相关性中等,主要关联世界模型和表征学习,对模型基RL有间接启发。
摘要翻译
多模态联邦图学习(MM-FGL)旨在从包含文本和图像的去中心化图上进行协同学习。然而,现实中的客户端可能不具备共同的模态基础:视觉搜索客户端可能包含图像交互图却缺乏卖家描述,而目录客户端可能提供文本却缺乏产品图像。我们将这种实际场景称为客户端级模态缺失。与随机的实例级缺失不同,缺失模态的客户端缺乏重建缺失模态所需的本地语义基础。更重要的是,在图学习中,不完整的表征会初始化消息传递过程,因此插补错误可能会被接收端拓扑过滤、混合并放大。为填补这一空白,我们提出 PRISM(Proactive Retrieval and Imputation via Structural Meta-prompting),一种感知拓扑的联邦跨模态插补框架。PRISM 并非仅从本地观测重建缺失模态,而是从联邦中恢复缺失模态语义,并在拓扑感知控制下将其引入本地图传播过程。在六个多模态图数据集上进行的实验(涵盖图中心型和模态中心型任务)表明,PRISM 能持续改善模态缺失客户端,平均性能优于最先进的基线 4.48%。
Abstract
Multimodal federated graph learning (MM-FGL) aims to collaboratively learn from decentralized graphs with text and images. However, real-world clients may not share a common modality basis: a visual-search client may contain image--interaction graphs but no seller descriptions, while a catalog client may provide text but no product images. We refer to this practical setting as client-level modality deficiency. Unlike random instance-wise missingness, a deficient client lacks the local semantic basis needed to reconstruct the absent modality. More importantly, in graph learning, incomplete representations initialize message passing, so imputation errors can be filtered, mixed, and amplified by the receiving topology. To address this gap, we propose \textbf{PRISM} (\textbf{P}roactive \textbf{R}etrieval and \textbf{I}mputation via \textbf{S}tructural \textbf{M}eta-prompting), a topology-aware federated cross-modal imputation framework. Rather than reconstructing the missing modality solely from local observations, PRISM recovers missing-modality semantics from the federation and introduces them into local graph propagation under topology-aware control. Experiments on six multimodal graph datasets across graph-centric and modality-centric tasks show that PRISM consistently improves modality-deficient clients, outperforming state-of-the-art baselines by \textbf{4.48}\% on average.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 3.0/10 | 4.5 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 3.0/10 | 4.5 |
| MultiModal | 1.5 | 9.0/10 | 13.5 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: 论文核心聚焦于多模态联邦图学习中的模态缺失补全,与'MultiModal'高度相关(9 分)。涉及图像和文本处理,'Visual Encoder'和'MLLM'(基于提示词)有弱相关(3 分)。'Unify Models'在联邦学习统一模型语境下有弱相关(2 分)。'Tokenizer'、'World Models'、'model-based RL'与图学习和补全任务无直接关联(1 分)。作者列表中未包含指定专家,故无额外加分。加权总分为 30.0,高于动态及格分 27.8。
关键词
Multimodal Federated Graph Learning, Cross-Modal Imputation, Modality Deficiency, Topology-Aware, Federated Learning, Structural Meta-prompting, Graph Representation
深度分析
Chinese Title: PRISM: 面向模态缺失联邦图学习的拓扑感知跨模态插补
Summary: 本文针对多模态联邦图学习(MM-FGL)中客户端级模态缺失问题,提出PRISM框架。该问题中,不同客户端可能永久缺失整个模态(如图像或文本),导致局部语义基础缺失,且缺失模态的插补误差会通过图拓扑传播、混合甚至放大。PRISM遵循“全局检索、结构注入”原则:从联邦中跨客户端获取缺失模态语义,转换为图兼容的辅助信号,并根据接收客户端的拓扑结构和检索可靠性自适应调节其影响。实验在六个多模态图数据集上,涵盖图中心任务和模态中心任务,平均提升4.48%,在模态缺失设置下提升12.24%,验证了PRISM的有效性。
Innovations:
- 首次识别并形式化客户端级模态缺失问题,区别于实例级随机缺失,揭示其图特有后果——语义边界误差。
- 提出PRISM框架,实现拓扑感知的联邦跨模态插补,分离缺失语义的来源与注入方式。
- 设计全局检索与结构注入机制:从联邦中获取缺失模态原型,转化为虚拟锚点,并通过拓扑条件门控控制传播影响。
- 通过实证研究验证了相同模态缺失误差在不同拓扑结构下传播差异显著,且结构中心节点是危险入口点。
Methodology: PRISM采用联邦学习范式,每个客户端持有私有子图。首先,客户端利用本地可用模态编码节点表示;然后,通过跨客户端原型共享(隐私保护下)获取缺失模态的语义原型;接着,将原型转换为图兼容的辅助信号(虚拟锚点);最后,根据接收客户端的拓扑结构(如度、同质性)和检索可靠性,自适应调节辅助信号在消息传递中的权重,实现拓扑感知注入。整体训练通过联邦平均优化共享参数。
Key Results:
- 在六个多模态图数据集(Toys、KU、QB等)上,PRISM在模态缺失设置下平均提升12.24%,优于现有基线4.48%。
- 实证研究表明:相同模态缺失误差在不同客户端拓扑中传播差异显著,误差放大与平均度(ρ=1.00)和同质性(ρ=0.98)高度相关。
- PRISM有效抑制了因不可靠语义注入导致的拓扑污染,尤其对结构中心节点具有更强控制。
Tech Stack:
- 图神经网络(GNN)
- 联邦学习(Federated Learning)
- 跨模态原型共享(Cross-modal Prototype Sharing)
- 拓扑条件门控(Topology-conditioned Gating)
- 归一化传播矩阵(Normalized Propagation Matrix)
- 线性化消息传递(Linearized Message Passing)
- 语义边界误差分析(Semantic Boundary Error Analysis)
Strengths:
- 问题新颖:首次系统研究客户端级模态缺失对联邦图学习的影响,并揭示其图特有传播特性。
- 方法创新:提出“全局检索、结构注入”原则,将跨模态插补与拓扑感知控制有机结合。
- 实验充分:在多个数据集和任务上验证,性能提升显著,且消融实验证明各组件有效性。
- 理论分析:通过实证研究量化误差传播与拓扑统计量的相关性,为设计提供依据。
Limitations:
- 依赖跨客户端原型共享,可能面临隐私与通信开销的权衡,未深入讨论差分隐私等保护机制。
- 拓扑条件门控的设计基于度、同质性等简单统计量,可能无法捕捉更复杂的图结构模式。
- 实验仅在静态图场景下进行,未考虑动态图或时序演化场景。
- 方法假设缺失模态在联邦中至少部分客户端拥有,若所有客户端均缺失同一模态则无法恢复。
Relevance To Keywords:
- Unify Models: 论文聚焦多模态联邦图学习,涉及图像和文本模态的统一建模,但未直接讨论生成与理解一体化。
- World Models: 论文未涉及世界模型概念,但图结构可视为环境拓扑的抽象,与表征学习相关。
- Representation Learning: 核心是学习节点表示,通过跨模态插补和拓扑感知传播提升表征质量。
- Model-Based RL: 论文不涉及强化学习,但联邦学习中的模型聚合与后训练思想有间接关联。
- 原生多模态大模型: 论文未使用大模型,而是基于GNN和原型共享,属于小规模多模态学习。
摘要翻译
文本驱动的室内场景生成与编辑需要一种中间表示,该表示既能由语言模型生成,也能被其修订。现有的基于大语言模型(LLM)的系统通常依赖场景图或全局约束列表,这些表示虽然紧凑,但未能充分指定局部几何结构,使得基于指令的编辑难以定位。我们将此问题框架化为结构化程序生成与局部程序修复,并提出层次化描述场景语言(HDSL),这是一种用于结构化 3D 室内场景的 XML/CSS 风格的领域特定语言(DSL)。HDSL 将房间、区域、物体及支撑面表示为带有局部坐标的树,使得复杂场景更容易进行递归规划,也更容易检索以供编辑。我们的流程使用 LLM 代理生成带有有界验证的 HDSL 子树,通过多模态资产检索将非虚拟节点实例化,并应用力导向布局优化来修复边界和碰撞错误。在编辑方面,层次化检索增强生成(HRAG)检索相关子树,要求 LLM 仅重写该局部上下文,并通过确定性的三路合并将结果合并回原处。在我们复现的基准测试中,相较于完整的端到端文本到场景基线,HDSL 提高了平均物体覆盖率、文本 - 场景对齐度及生成时间,同时在几何指标上与最近的仅布局重现方法保持竞争力;在编辑任务上,HRAG 将令牌使用量减少 5.22 倍,运行时间减少 6.19 倍,为所有八对配对编辑生成有效的 DSL,并更好地保留无关场景对象。
Abstract
Text-driven indoor scene generation and editing require an intermediate representation that language models can both produce and revise. Existing LLM-based systems often rely on scene graphs or global constraint lists, which are compact but underspecify local geometry and make instruction-based edits difficult to localize. We frame this problem as structured program generation and local program repair, and propose Hierarchical Descriptive Scene Language (HDSL), an XML/CSS-style domain-specific language for structured 3D indoor scenes. HDSL represents rooms, regions, objects, and support surfaces as a tree with local coordinates, making complex scenes easier to plan recursively and easier to retrieve for editing. Our pipeline uses LLM agents to generate HDSL subtrees with bounded verification, grounds non-virtual nodes through multimodal asset retrieval, and applies force-directed layout optimization to repair boundary and collision errors. For editing, Hierarchical Retrieval-Augmented Generation retrieves the relevant subtree, asks the LLM to rewrite only that local context, and merges the result back through a deterministic three-way merge. In our reproduced benchmark, HDSL improves average object coverage, text-scene alignment, and generation time over full text-to-scene baselines while remaining competitive with recent layout-only reproductions on geometry metrics; for editing, HRAG reduces token use by $5.22\times$ and runtime by $6.19\times$, produces valid DSL for all eight paired edits, and better preserves unrelated scene objects.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 3.0/10 | 4.5 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 2.0/10 | 3.0 |
| World Models | 1.5 | 3.0/10 | 4.5 |
| MLLM | 1.5 | 5.0/10 | 7.5 |
| MultiModal | 1.5 | 6.0/10 | 9.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文提出了一种基于 LLM 代理的分层领域特定语言(HDSL)用于 3D 室内场景生成与局部编辑。与关键词的相关性分析如下:MLLM 和多模态相关性较高(5-6 分),因涉及文本驱动生成与多模态资产检索;Unify Models 和 World Models 中等(3 分),因涉及生成/编辑流程统一与结构化场景表征;Tokenizer 和 Visual Encoder 较低(1-2 分),非核心组件(主要依赖检索而非编码器或分词器设计);model-based RL 无关(0 分),未涉及强化学习。加权总分 30.0,高于动态及格分 27.8。作者列表中未包含指定的专家(如 Saining Xie, Yang Shi 等)。
关键词
Hierarchical Domain-Specific Language, 3D Indoor Scene Generation, Localized Editing, LLM Agents, Multimodal Asset Retrieval, Program Generation, Scene Representation
深度分析
Chinese Title: HDSL:一种用于结构化3D室内场景生成和基于LLM智能体的局部编辑的分层领域特定语言
Summary: 本文针对文本驱动的3D室内场景生成与编辑任务,提出了一种分层领域特定语言HDSL。现有基于LLM的系统通常使用场景图或全局约束列表,难以支持局部编辑且token消耗大。HDSL采用XML/CSS风格的树状结构表示场景,每个节点存储局部坐标、支撑关系和对象标识,使复杂场景可递归规划且易于检索编辑。生成管道中,LLM智能体递归生成HDSL子树,通过生成-验证-修正循环确保结构有效,利用多模态资产检索获取物体,并采用力导向布局优化修复边界和碰撞错误。编辑时,层次化检索增强生成(HRAG)仅检索并重写相关子树,通过确定性三路合并回全局场景。实验表明,HDSL在物体覆盖率、文本-场景对齐和生成时间上优于全文本到场景基线;HRAG相比全场景重写,token使用减少5.22倍,运行时间减少6.19倍,且更好保留非目标物体。
Innovations:
- 提出HDSL,一种XML/CSS风格的分层场景DSL,将3D室内生成转化为可检查、可寻址的结构化预测,支持局部坐标和支撑关系。
- 设计递归生成管道,包含LLM智能体、生成-验证-修正循环、多模态资产检索和力导向布局优化,实现复杂场景的逐步构建。
- 引入HRAG-HDSL局部编辑机制,通过检索相关子树、局部重写和确定性三路合并,显著降低token消耗和运行时间,同时保留无关场景对象。
- 将场景表示作为LLM的控制界面,使每个节点在父框架内空间自包含,便于局部修改而不影响全局状态。
Methodology: 论文将文本到场景生成视为结构化预测,编辑视为程序修复。主要方法包括:1)HDSL表示:XML/CSS树,节点包含语义、几何、层次和检索属性。2)层次化生成:LLM智能体递归扩展节点,每个节点接收祖先路径和父几何信息,输出子节点集;采用多候选生成-验证-修正循环,检查结构有效性和空间合理性。3)物体检索:非虚拟节点通过Holodeck检索管道从Objaverse获取资产。4)力导向布局优化:修复边界和碰撞错误。5)局部编辑:HRAG检索相关子树,LLM重写局部上下文,通过三路合并将修改后的子树合并回原DSL。
Key Results:
- 在复现的基准测试中,HDSL在平均物体覆盖率、文本-场景对齐和生成时间上优于全文本到场景基线。
- 在几何指标上与近期仅布局复现的方法保持竞争力。
- 编辑实验中,HRAG将token使用减少5.22倍,运行时间减少6.19倍。
- 所有八对配对编辑均产生有效DSL。
- HRAG比全场景重写更好地保留无关场景物体。
Tech Stack:
- LLM智能体(如GPT系列)
- XML/CSS解析器
- 力导向布局优化算法
- 多模态资产检索(Objaverse)
- 确定性三路合并算法
- 生成-验证-修正循环(bounded multi-candidate generate–verify–revise)
Strengths:
- 层次化表示天然匹配室内场景结构,支持递归规划和局部编辑。
- 生成管道通过验证循环和力导向优化提高了输出可靠性。
- 局部编辑机制显著降低计算开销,同时保持场景完整性。
- 无需额外训练,利用预训练LLM即可处理未见场景类型。
- DSL设计接近Web开发语法,易于理解和扩展。
Limitations:
- 依赖LLM输出质量,复杂场景可能产生不合理布局。
- 力导向布局优化可能无法完全消除所有碰撞和边界问题。
- 资产检索受限于Objaverse等现有库的覆盖范围和多样性。
- 场景复杂度受限于递归深度和LLM上下文长度。
- 未与其他端到端学习方法进行直接比较,仅与LLM基线对比。
Relevance To Keywords:
- 原生多模态大模型:论文使用LLM智能体处理自然语言和场景生成,涉及多模态理解与生成。
- 表征学习:HDSL作为一种场景表征,将3D结构编码为可编辑的树状表示,属于表征学习范畴。
- 世界模型:3D室内场景生成可视为构建虚拟世界模型,HDSL提供了结构化世界描述。
- 强化学习:力导向布局优化可看作一种优化过程,但论文未直接使用强化学习。
- 后训练:论文未涉及模型后训练,而是利用预训练LLM进行推理。
摘要翻译
我们介绍了提交至 CVPR 2026 Argoverse 2 场景挖掘挑战赛的方案。我们的系统采用了一个四阶段流程:(1) 通过由 GLM~5.1 驱动的 Claude Code 代理进行自主代码生成,(2) 使用 Timestamp Balanced Accuracy (时间戳平衡准确率) 阈值 0.8 对训练集进行迭代筛选,以整理少样本示例,(3) 通过独立的 Claude Code 会话进行语义代码审查,(4) 使用 Qwen3-VL 进行场景级验证以过滤假阳性。我们在 Argoverse 2 测试集上报告了结果。
Abstract
We present our submission to the CVPR 2026 Argoverse 2 Scenario Mining Challenge. Our system uses a four-stage pipeline: (1) autonomous code generation via a Claude Code agent powered by GLM~5.1, (2) iterative training set screening with Timestamp Balanced Accuracy threshold 0.8 to curate few-shot examples, (3) semantic code review by a separate Claude Code session, and (4) Qwen3-VL scene-level verification to filter false positives. We report results on the Argoverse 2 test set.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 3.0/10 | 4.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 2.0/10 | 3.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 8.0/10 | 12.0 |
| MultiModal | 1.5 | 7.0/10 | 10.5 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文主要利用大语言模型(Claude, GLM)和多模态大模型(Qwen3-VL)构建流水线进行自动驾驶场景挖掘。因此与 MLLM 和多模态相关性较高,与视觉编码器有一定关联(作为模型组件),但与统一模型架构、Tokenizer 设计、世界模型及基于模型的强化学习无直接关系。作者列表中不包含指定的专家。
关键词
Claude Code, Scenario Mining, Argoverse 2, Autonomous Driving, Qwen3-VL, Code Generation, Visual Verification, Few-shot Examples
深度分析
Chinese Title: 基于Claude Code的自动驾驶场景挖掘方法:面向Argoverse 2挑战赛
Summary: 本文提出了一种面向CVPR 2026 Argoverse 2场景挖掘挑战赛的四阶段流水线系统。该系统利用Claude Code作为自主编码智能体(基于GLM 5.1模型),从自然语言描述中生成可组合原子函数代码;随后在训练集上迭代筛选,保留时间戳平衡准确率超过0.8的提示-代码对作为少样本示例;接着通过独立的Claude Code会话进行语义代码审查;最后引入Qwen3-VL视觉语言模型进行场景级二分类验证,过滤假阳性输出。在Argoverse 2测试集上,团队(MICTeam)取得了HOTA-Temporal 27.91、时间戳平衡准确率69.65、日志平衡准确率69.32的成绩,优于官方基线方法。该方法的核心创新在于采用自主智能体范式替代单次LLM调用,并利用VLM作为后执行过滤器直接减少假阳性。
Innovations:
- 采用Claude Code自主编码智能体(基于GLM 5.1)替代传统单次LLM API调用,通过工具调用迭代探索原子函数库并生成代码。
- 提出训练集迭代筛选机制,对每个提示最多进行5轮代码生成,保留时间戳平衡准确率>0.8的提示-代码对作为少样本示例。
- 设计双智能体架构:生成智能体负责代码生成,独立的Claude Code会话负责语义代码审查,确保语法正确性和语义保真度。
- 引入Qwen3-VL视觉语言模型进行场景级二分类验证,作为后执行过滤器,有效滤除代码执行通过但实际不存在的假阳性场景。
Methodology: 论文采用四阶段流水线方法:第一阶段,利用Claude Code智能体(基于GLM 5.1)通过工具调用查询RefAV框架中的可组合原子函数定义,分析提示需求并构建函数组合生成代码;第二阶段,在RefAV训练集上迭代执行生成的代码,计算时间戳平衡准确率,若超过0.8则保留为少样本示例,否则最多重试5次;第三阶段,由另一个Claude Code会话审查代码的语义正确性,识别方向错误、缺失约束、参数阈值不合理等问题并修正;第四阶段,对每个提示-日志对,使用Qwen3-VL对驾驶帧进行二分类,判断事件是否真实发生,若VLM确认则保留代码执行结果,否则输出空预测。最终使用Le3DE2E跟踪模型从代码执行结果中获取轨迹。
Key Results:
- 在EvalAI测试集排行榜上,团队(MICTeam)排名第11,HOTA-Temporal为27.91,HOTA-Track为37.88,时间戳平衡准确率(TS BA)为69.65,日志平衡准确率(Log BA)为69.32。
- 优于官方基线RefProg(HOTA-Temp 26.27, TS BA 68.07)和SM-Agent(HOTA-Temp 23.25, TS BA 66.95)。
- 训练集迭代筛选产生了少量高质量提示-代码对,作为少样本示例提升了未见提示的代码质量。
- 代码审查识别出常见错误模式:关系函数方向错误、隐式地图约束缺失、参数阈值不合理。
Tech Stack:
- Claude Code(自主编码智能体)
- GLM 5.1(底层语言模型)
- Qwen3-VL(视觉语言模型,用于场景级二分类验证)
- RefAV框架(可组合原子函数库,包括空间、运动学、地图函数及逻辑组合器)
- Le3DE2E(跟踪模型,用于从代码执行结果获取轨迹)
- 时间戳平衡准确率(Timestamp Balanced Accuracy)作为筛选阈值(>0.8)
- HOTA-Temporal、HOTA-Track、Log Balanced Accuracy等评估指标
Strengths:
- 提出自主智能体范式,通过工具调用和迭代探索生成代码,比单次LLM调用更灵活、准确。
- 训练集迭代筛选结合少样本示例,有效提升代码质量。
- 双智能体设计(生成+审查)确保代码的语义保真度。
- VLM场景级验证作为后执行过滤器,直接降低假阳性,弥补LLM幻觉问题。
- 在挑战赛中取得优于官方基线的成绩,验证了方法的有效性。
Limitations:
- 整体排名第11,与顶尖团队(如HYULOASIS的HOTA-Temp 38.50)仍有较大差距,性能提升空间明显。
- 依赖多个模型(GLM 5.1、Qwen3-VL、Le3DE2E),计算成本和推理延迟较高。
- 迭代筛选最多5轮,可能遗漏部分高质量代码;阈值0.8的设定缺乏理论依据。
- VLM验证仅做二分类,可能无法处理复杂时空关系,且VLM本身也可能产生误判。
- 方法高度依赖RefAV原子函数库的完备性,对于库中未覆盖的场景描述可能失效。
Relevance To Keywords:
- Unify Models: 论文使用了GLM 5.1和Qwen3-VL两个不同模型,但未涉及统一模型或模型融合,相关性较低。
- World Models: 论文聚焦于场景挖掘(从自然语言定位对象和时段),未涉及世界模型构建或预测,相关性低。
- Representation Learning: 论文未涉及表征学习,主要依赖预训练模型和原子函数库,相关性低。
- Model-Based RL: 论文未涉及强化学习或基于模型的规划,相关性低。
- 原生多模态大模型: 论文使用了Qwen3-VL(多模态大模型)进行场景验证,但并非原生多模态大模型研究,而是应用。
- 多模态大模型的理解和生成一体化: 论文中Qwen3-VL用于理解(二分类),Claude Code用于生成代码,但未实现一体化,相关性中等。
- 表征学习: 不相关。
- 世界模型: 不相关。
- 强化学习: 不相关。
- 后训练: 论文未涉及模型后训练或微调,直接使用预训练模型,相关性低。
摘要翻译
高级科学模拟器提供了专用输入语言,可将模拟目标转化为可执行配置,但学习这些语言往往需要领域科学家花费数小时至数天。我们将模拟器设置视为代理 - 工具接口接地(agent-tool interface grounding)问题:为了使现成的编码代理(off-the-shelf coding agent)操作真实的科学软件,需要多少最小的模拟器特定适配?我们的直觉认为,编码代理已经掌握如何浏览文件、编辑代码、运行命令和修复输出,但它们缺乏模拟器的可执行契约:包括其词汇表、结构约束、验证规则和终止条件。我们引入了 SIGA(Simulator-Interface Grounding Adapter,模拟器 - 接口接地适配器),它通过检索、程序性记忆、轨迹内验证和验证强制终止来提供此契约。我们主要在 GEOS(用于地下科学的开源多物理场模拟器)上评估 SIGA。SIGA 在约五分钟内生成完整的 GEOS 输入文件集(GEOS deck),TreeSim 指标高于 0.90,匹配了一位花费约三小时的扩展预算人类专家,实现了约 36 倍的实际加速比。在更难处理的保留测试集上,接口接地将 TreeSim 从 0.720 提升至 0.789,相对于基础代理实现了约 10% 的相对增益,并将跨种子标准差降低了 16 倍。自我进化进一步改进了 SIGA,通过重写先前轨迹中的适配器内容,获得了最高的保留测试集 GEOS 均值,并匹配或超越了最强的人工设计配置。在 OpenFOAM 和 LAMMPS 上的迁移结果表明,主导机制随接口而异:当结构完整性是瓶颈时,验证最为重要;而当领域正确性是瓶颈时,记忆和检索最为重要。这些结果表明,轻量级且可自我改进的接地层能够将通用编码代理转变为科学软件的实用操作者。
Abstract
Advanced scientific simulators expose specialized input languages that turn simulation goals into executable configurations, but learning them can cost domain scientists hours to days. We study simulator setup as a problem of agent-tool interface grounding: what minimal simulator-specific adaptations are needed for an off-the-shelf coding agent to operate real scientific software? Our intuition is that coding agents already know how to navigate files, edit code, run commands, and repair outputs, but they lack the simulator's executable contract: its vocabulary, structural constraints, validation rules, and termination conditions. We introduce SIGA, a Simulator-Interface Grounding Adapter that supplies this contract through retrieval, procedural memory, in-trajectory validation, and validation-enforced termination. We primarily evaluate SIGA on GEOS, an open-source multiphysics simulator used in subsurface science. SIGA produces a complete GEOS deck in about five minutes with TreeSim above 0.90, matching an extended-budget human expert who took about three hours, a roughly 36x wall-clock speedup. On a harder held-out set, grounding raises TreeSim from 0.720 to 0.789, a roughly 10% relative gain over the bare agent, and can reduce the across-seed standard deviation by 16x. Self-evolution further improves SIGA by rewriting adapter contents from prior trajectories, yielding the highest held-out GEOS mean and matching or outperforming the strongest hand-designed configuration. Transfers to OpenFOAM and LAMMPS show that the dominant mechanism shifts by interface: validation matters most when structural completeness is the bottleneck, while memory and retrieval matter most when domain correctness is the bottleneck. These results suggest that lightweight, self-improvable grounding layers can turn general coding agents into practical operators of scientific software.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 4.0/10 | 6.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 1.0/10 | 1.5 |
| World Models | 1.5 | 4.0/10 | 6.0 |
| MLLM | 1.5 | 4.0/10 | 6.0 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 4.0/10 | 6.0 |
评分理由: The paper focuses on text-based scientific simulation interface grounding using coding agents. It aligns moderately with MLLM, Unify Models, World Models, and model-based RL due to the use of LLM-based agents, interface unification, self-evolving memory, and simulator-based planning. However, it lacks visual encoders, multimodal processing, and specific tokenizer discussions, resulting in low scores for those keywords.
关键词
Scientific Simulation, Coding-Agent Adapters, Interface Grounding, Self-Evolving, Procedural Memory, Tool Interaction, TreeSim
深度分析
Chinese Title: SIGA:面向科学模拟的自进化编码智能体适配器
Summary: 论文研究如何将通用编码智能体适配到科学模拟器的专用输入语言上,以加速模拟配置过程。作者提出SIGA(Simulator-Interface Grounding Adapter),一种轻量级适配层,通过检索、程序记忆、轨迹内验证和验证强制终止四个组件,为智能体提供模拟器的词汇、结构约束、验证规则和终止条件。在GEOS多物理场模拟器上,SIGA使智能体约5分钟生成完整配置,达到人类专家约3小时的质量水平,实现约36倍加速。在更难的任务集上,接地将平均质量提升约10%,标准差降低16倍。自进化版本通过重写适配器内容进一步改进。转移到OpenFOAM和LAMMPS表明,不同模拟器的主导机制不同:验证对结构完整性重要,记忆和检索对领域正确性重要。结论是轻量级、可自改进的接地层可将通用编码智能体转化为科学软件的操作员。
Innovations:
- 提出SIGA适配器,将通用编码智能体(如Claude Code)与科学模拟器接口接地,无需从头构建智能体循环。
- 将接地分解为四个可复用组件:检索、程序记忆、可调用验证器和强制终止钩子,实现模拟器无关的适配。
- 引入自进化机制,使智能体根据先前轨迹自动重写适配器内容,无需人工调优。
- 在GEOS、OpenFOAM、LAMMPS三个模拟器上验证,揭示不同模拟器瓶颈不同,验证对结构完整性关键,记忆和检索对领域正确性关键。
- 实现约36倍加速(5分钟 vs 3小时),同时提升可靠性和降低方差。
Methodology: 论文采用基于现有编码智能体(Claude Code)的轻量级适配方法。SIGA包含四个组件:检索(语义访问文档、示例)、程序记忆(高频词汇和模式始终可见)、验证器(智能体可调用,用于检查修复输出)、终止钩子(强制验证通过才结束)。自进化版本通过收集先前轨迹,让智能体反思并重写适配器内容。实验在GEOS、OpenFOAM、LAMMPS上进行,使用TreeSim指标评估配置质量,并与人类专家、裸智能体、手工配置对比。
Key Results:
- 在GEOS任务上,SIGA智能体约5分钟生成完整配置,TreeSim>0.90,匹配人类专家约3小时质量,约36倍加速。
- 在更难的任务集上,接地将TreeSim从0.720提升至0.789(约10%相对提升),标准差降低16倍。
- 自进化版本在GEOS上取得最高平均分,匹配或超越最佳手工配置。
- 在OpenFOAM上,验证组件最关键;在LAMMPS上,记忆和检索组件最关键。
- SIGA在不同模拟器上均有效,但主导机制随接口变化。
Tech Stack:
- Claude Code(通用编码智能体框架)
- GEOS(多物理场模拟器,XML DSL)
- OpenFOAM(计算流体动力学模拟器)
- LAMMPS(分子动力学模拟器)
- TreeSim指标(评估配置质量)
- 检索(语义搜索文档/示例)
- 程序记忆(高频词汇和模式)
- 验证器(XML schema检查、文件/字典检查、命令/参考检查)
- 自进化(反思-重写范式)
Strengths:
- 设计简洁,基于现有编码智能体,避免重复造轮子,工程成本低。
- 接地组件可复用,易于迁移到新模拟器。
- 自进化机制自动优化适配器,减少人工调优。
- 实验充分,在三个不同模拟器上验证,并分析不同瓶颈。
- 显著提升效率(36倍加速)和可靠性(降低方差)。
Limitations:
- 依赖现有编码智能体(Claude Code),可能受限于其能力和可用性。
- 自进化机制仅在GEOS上验证,未在OpenFOAM和LAMMPS上测试。
- 评估指标TreeSim可能未完全覆盖配置的物理正确性。
- 仅针对模拟器配置阶段,未涉及后续运行、结果分析等完整工作流。
- 对模拟器文档和示例的质量有一定依赖。
Relevance To Keywords:
- Unify Models: 论文未直接涉及统一模型,但SIGA适配器可视为将通用编码智能体与领域特定模拟器统一。
- World Models: 科学模拟器本身是世界模型的实现,论文关注如何让智能体操作世界模型,间接相关。
- Representation Learning: 论文未涉及表征学习,但检索和程序记忆可视为对模拟器知识的表示。
- Model-Based RL: 论文未涉及强化学习,但智能体通过验证反馈迭代改进,类似基于模型的RL中的规划与执行。
- 原生多模态大模型: 论文使用语言模型(Claude Code),未涉及多模态。
- 多模态大模型的理解和生成一体化: 不直接相关。
- 后训练: 自进化机制可视为一种后训练方式,通过轨迹重写适配器。
摘要翻译
工具学习使大语言模型(LLMs)能够调用外部工具以完成任务。先前的研究已证明了一种分层结构的有效性:高层策略负责全局规划并将任务分解为可管理的子任务,而低层策略则专注于调用工具来解决这些子任务。然而,这些工作通常分别优化高层和低层策略,导致规划器与执行器之间的错位,限制了大语言模型在工具使用任务上的表现。本文提出了一种名为能力对齐分层学习(CAHL)的方法,该方法利用 RLVR 联合优化两种策略,从而实现高层规划器与低层执行器之间更好的对齐。在受限工具使用基准(API-Bank 和 BFCL)以及开放环境(Bamboogle)上的实验证明了 CAHL 的有效性。
Abstract
Tool learning enables LLMs to invoke external tools to accomplish tasks. Prior studies have demonstrated the effectiveness of a hierarchical structure: a high-level policy handles global planning and decomposes tasks into manageable sub-tasks, and a low-level policy focuses on invoking tools to solve these sub-tasks. However, these works typically optimize the high-level and low-level policies separately, leading to planner-executor misalignment and limiting LLM performance on tool-use tasks. In this paper, we propose a method called Capability-Aligned Hierarchical Learning (CAHL), which leverages RLVR to jointly optimize both policies, enabling better alignment between the high-level planner and the low-level executor. Experiments on constrained tool-use benchmarks (API-Bank and BFCL) and an open-ended environment (Bamboogle) demonstrate the effectiveness of CAHL.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 5.0/10 | 7.5 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 1.0/10 | 1.5 |
| World Models | 1.5 | 2.0/10 | 3.0 |
| MLLM | 1.5 | 3.0/10 | 4.5 |
| MultiModal | 1.5 | 2.0/10 | 3.0 |
| model-based RL | 1.5 | 5.0/10 | 7.5 |
评分理由: The paper focuses on hierarchical reinforcement learning for tool-augmented LLMs, aligning planner and executor policies. It shows moderate relevance to Unify Models (policy alignment) and model-based RL (hierarchical planning/RL), but has low relevance to Tokenizer, Visual Encoder, and MultiModal as no such components are mentioned. World Models are tangentially related via planning, and MLLM is relevant due to LLM usage but not explicitly multimodal. Author check: None of the specified experts (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are found in the author list (Haotong Yang, Ting Long, Yi Chang), so no bonus points applied.
关键词
Tool-Augmented LLMs, Hierarchical Learning, High-level Policy, Low-level Policy, RLVR, Planner-Executor Alignment, Capability-Aligned
深度分析
Chinese Title: 面向工具增强大语言模型的能力对齐分层学习
Summary: 本文针对工具增强大语言模型(LLMs)中分层决策框架存在的规划器与执行器不对齐问题,提出了一种名为CAHL(Capability-Aligned Hierarchical Learning)的方法。该方法利用基于可验证奖励的强化学习(RLVR)联合优化高层规划策略和低层执行策略,使规划器能够根据执行器的实际能力调整子任务粒度,同时执行器也能更好地理解规划器的意图。实验在受限基准(API-Bank、BFCL)和开放环境(Bamboogle)上进行,结果表明CAHL在多个指标上优于单层和分层基线方法,有效缓解了规划-执行不对齐问题。代码已开源。
Innovations:
- 首次识别并形式化了分层工具学习中规划器与执行器之间的不对齐问题。
- 提出CAHL联合优化框架,通过RLVR同时训练高层和低层策略,实现能力对齐。
- 设计了基于可验证奖励的高层和低层奖励函数,将执行反馈融入规划优化。
- 在受限和开放两类工具使用基准上验证了方法的有效性,取得一致改进。
Methodology: 采用分层决策框架:高层规划器根据用户查询和工具集生成全局计划(有序子任务序列),低层执行器根据计划、工具规范和历史反馈生成具体工具调用或文本输出。使用GRPO(Group Relative Policy Optimization)算法联合优化两个策略,设计高层奖励(包括参数级奖励和执行级奖励)和低层奖励,通过可验证的反馈信号促使规划器与执行器相互适应。
Key Results:
- 在API-Bank基准上,CAHL相比单独优化的分层基线取得了显著提升。
- 在BFCL基准上,CAHL在复杂多步工具调用任务中表现优于现有方法。
- 在开放环境Bamboogle中,CAHL能够更有效地完成需要多工具协作的查询。
- 消融实验表明联合优化和可验证奖励设计对性能提升至关重要。
Tech Stack:
- RLVR(Reinforcement Learning with Verifiable Rewards)
- GRPO(Group Relative Policy Optimization)
- MDP(马尔可夫决策过程)建模
- 分层策略(高层规划器+低层执行器)
- 参数级奖励与执行级奖励设计
- 语义相似度评估(用于自然语言输出对齐)
Strengths:
- 针对分层工具学习中的关键不对齐问题提出有效解决方案。
- 联合优化框架使规划器和执行器相互适应,提升整体任务完成质量。
- 可验证奖励设计提供了细粒度、可解释的优化信号。
- 在多种基准上验证了方法的通用性和鲁棒性。
Limitations:
- 奖励函数设计依赖于人工定义的规则和地面真值,可能难以扩展到完全开放的场景。
- 联合优化增加了训练计算成本,对资源要求较高。
- 方法在工具集动态变化或未见工具上的泛化能力有待进一步验证。
- 实验仅基于特定模型和基准,结论的普适性需更多测试。
Relevance To Keywords:
- 强化学习:论文核心方法RLVR和GRPO属于强化学习范畴,用于后训练优化。
- 后训练:CAHL是一种后训练方法,通过RL调整模型策略。
- 世界模型:工具增强LLM可视为与世界交互的代理,但论文未直接涉及世界模型构建。
- 表征学习:论文未重点讨论表征学习,但分层策略可视为隐式表征。
- 原生多模态大模型:论文聚焦工具使用,不涉及多模态输入输出。
- 多模态大模型的理解和生成一体化:工具增强可辅助理解和生成,但非论文直接主题。
摘要翻译
从第一人称视角视频估计全手抓握压力对于沉浸式虚拟现实(VR)和机器人操作至关重要,但密集触觉传感通常依赖于侵入式硬件。现有的基于视觉的方法主要依赖平面表面或指尖接触,无法泛化到复杂的 3D 物体交互。因此,我们引入了 EgoTactile,这是一个将第一人称视角视频与全手压力监督配对的数据集,针对多样化的日常物体,并包含一个裸手迁移子集,以实现对自然场景的泛化。利用该基准,我们首先建立了 EgoPressureFormer 作为判别式基线。除此之外,为了明确处理部分观测中的不确定性,我们提出了 EgoPressureDiff,这是一个条件扩散框架,它适配了一个大规模预训练的视频扩散骨干。通过结合丰富的世界知识先验与一个 Physically-Informed Feature Rectification 层以注入语义约束,我们的方法能有效推断合理的接触模式并解决视觉 - 物理歧义。广泛的实验表明,我们的方法在该基准上取得了优越的性能,并且在野外场景中具有鲁棒的泛化能力。我们的项目页面位于 https://egotactile.github.io/。
Abstract
Estimating full-hand grasp pressure from egocentric video is critical for immersive VR and robotic manipulation, yet dense tactile sensing often relies on intrusive hardware. Existing vision-based methods predominantly rely on planar surfaces or fingertip contacts, failing to generalize to complex 3D object interactions. Therefore, we introduce EgoTactile, a benchmark pairing egocentric video with full-hand pressure supervision for diverse everyday objects, incorporating a bare-hand transfer subset to enable generalization to natural scenarios. Leveraging this benchmark, we first establish EgoPressureFormer as a discriminative baseline. Beyond this, to explicitly address the uncertainty in partial observations, we propose EgoPressureDiff, a conditional diffusion framework that adapts a large-scale pre-trained video diffusion backbone. By combining rich world knowledge priors with a Physically-Informed Feature Rectification layer to inject semantic constraints, our approach effectively infers plausible contact patterns and resolves visual-physical ambiguities. Extensive experiments demonstrate that our method achieves superior performance on the benchmark and robust transferability to in-the-wild scenarios. Our project page is available at https://egotactile.github.io/.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 5.0/10 | 7.5 |
| World Models | 1.5 | 6.0/10 | 9.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 8.0/10 | 12.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文核心在于利用扩散模型从视频中估计触觉压力,与 MultiModal(视频 + 压力)和 World Models(世界知识先验)有中度关联,Visual Encoder 因使用视频骨干网络而有部分关联。论文未涉及 Tokenizer、MLLM、模型统一或强化学习算法,故相关度为 0。作者列表中未包含指定的 Yang Shi 等专家。
关键词
Egocentric Video, Grasp Pressure Estimation, Conditional Diffusion, World Knowledge Priors, Physical Constraints, Tactile Sensing, Video Diffusion Backbone
深度分析
Chinese Title: EgoTactile:从第一人称视频学习日常物体的抓取压力
Summary: 该论文提出了一项新任务:从第一人称(egocentric)视频中预测全手抓取压力,旨在解决现有方法局限于平面表面或指尖接触的问题。为此,作者构建了EgoTactile基准数据集,包含63个日常物体的同步全手压力数据,并引入“裸手”子集以支持无传感器场景的泛化。针对部分观测下的不确定性,论文提出了两种基线:EgoPressureFormer作为判别式参考,以及核心贡献EgoPressureDiff——一种基于条件扩散的框架,通过适配预训练的Stable Video Diffusion模型,利用世界知识先验推断合理的接触模式。此外,物理信息特征修正层(PIFR)注入物体属性(如重量、刚度)等语义约束,解决视觉-物理歧义。实验表明,该方法在基准上表现优越,并能鲁棒泛化到真实场景。
Innovations:
- 首次提出从第一人称视频预测3D物体全手抓压任务,并构建了包含63个物体、全手162个压力传感器的EgoTactile基准数据集,包含裸手子集以支持无传感器场景。
- 提出EgoPressureDiff,一种条件扩散框架,利用预训练视频扩散模型(Stable Video Diffusion)作为世界模型先验,有效处理部分观测下的多模态不确定性。
- 设计物理信息特征修正层(PIFR),将物体属性(重量、材料等)作为结构化约束注入扩散模型,解决视觉上相同但物理属性不同的歧义问题。
- 建立EgoPressureFormer作为判别式基线,并实现压力序列与热力图的双向可逆转换,统一评估标准。
Methodology: 论文采用两阶段方法:首先构建EgoTactile数据集,通过同步第一人称视频和压力手套(162个传感器)采集63个物体的抓取数据,并设计裸手子集实现弱配对监督。然后提出EgoPressureDiff,基于Stable Video Diffusion(SVD)骨干网络,将RGB视频片段作为条件输入,通过扩散过程生成压力序列或热力图。在扩散模型中集成PIFR层,利用物体属性(重量、材料)和主体属性(年龄、体脂等)作为额外条件,修正特征表示。训练时采用去噪扩散损失,推理时通过迭代去噪生成压力。同时构建EgoPressureFormer(视频Transformer)作为对比基线。
Key Results:
- EgoPressureDiff在EgoTactile基准上优于EgoPressureFormer及其他基线,在接触IoU、体积IoU、MAE等指标上均取得最佳性能。
- PIFR层显著提升了物理歧义场景(如不同重量但外观相似的物体)下的压力预测准确性。
- 裸手子集上的实验表明,模型能够泛化到无传感器自然场景,无需重新训练。
- 定性可视化显示,扩散模型能生成合理的接触模式,即使在严重遮挡区域也能推断出压力分布。
Tech Stack:
- Stable Video Diffusion (SVD) 预训练视频扩散模型
- 条件扩散框架(Conditional Diffusion)
- 物理信息特征修正层(PIFR)
- 视频Transformer(EgoPressureFormer)
- Ridge回归(用于热力图到压力序列的逆映射)
- 高斯扩散(用于压力序列到热力图的渲染算子)
- MANO手部模型(用于几何约束)
- 压力手套(162个传感器)数据采集系统
Strengths:
- 任务定义新颖,填补了从第一人称视频预测3D物体全手压力的空白。
- 数据集规模大、标注丰富(包含物体属性、主体属性),支持多变量分析。
- 扩散模型结合世界先验,有效处理遮挡和不确定性,生成结果合理且多样。
- 物理信息注入模块增强了模型对不可见属性的推理能力。
- 裸手子集设计使方法可迁移到无传感器真实场景,实用性强。
Limitations:
- 数据集仅在受控实验室环境采集,物体种类和交互模式有限,真实场景泛化性仍需验证。
- 扩散模型推理速度较慢,难以满足实时VR/机器人应用需求。
- 压力传感器仅覆盖手掌和手指,未包含指尖侧面等区域,可能丢失部分接触信息。
- 物理属性(如重量)需预先提供,实际应用中可能难以获取。
- 未与强化学习或后训练方法结合,模型仅作为感知模块。
Relevance To Keywords:
- Unify Models: 论文未直接涉及多模态理解与生成的统一模型,但扩散框架本身可视为生成模型,与统一模型方向有间接关联。
- World Models: 论文利用Stable Video Diffusion作为世界模型先验,通过其学习到的物理交互知识推断遮挡下的压力分布,与世界模型概念高度相关。
- Representation Learning: 论文通过扩散模型学习压力表征,并利用PIFR层融合物理属性,属于表征学习范畴。
- Model-Based RL: 论文未涉及强化学习,但预测的压力可作为机器人操作中模型预测控制(MPC)的输入,有潜在关联。
- 原生多模态大模型: 论文使用预训练视频扩散模型(SVD),但未构建原生多模态大模型,仅作为条件生成工具。
- 多模态大模型的理解和生成一体化: 论文侧重于生成压力,而非理解,与一体化方向关联较弱。
- 后训练: 论文未涉及后训练(如RLHF、SFT等),仅进行标准监督训练。
摘要翻译
语义变化检测(SCD)旨在同时定位土地覆盖变化并识别变化前后的语义类别。然而,现有方法存在跨时间对齐不足、多尺度表征较弱以及对由光照、季节和配准噪声引起的伪变化鲁棒性差的问题。为了解决这些问题,我们提出了一种名为 SemDINO 的新型端到端语义变化检测网络,该网络将双分支编码器、多尺度时间交互、语义净化、变化增强和解耦多任务预测整合到一个统一框架中。具体来说,我们构建了一个双分支编码器,通过门控金字塔融合结合 CNN 骨干和冻结的 DINOv3 特征,从而实现丰富的多尺度语义表征。随后,提出了一种多尺度时间双向变换器交互(M-TBTT)模块,以实现全局跨时间特征对齐和信息交互。为了进一步增强真实变化并抑制伪变化,我们协同引入了语义净化(SCP)、双向变化增强(BiChangeEnhance)和多尺度变化增强(MCE)模块。最后,设计了一个多分支变化检测(CD)预测头,用于联合输出二值变化掩膜、双时相语义图和边缘约束。在公开遥感变化检测(CD)数据集上的广泛实验表明,SemDINO 在性能上优于最先进方法,特别是在具有干扰因素的复杂场景中。
Abstract
Semantic change detection (SCD) aims to simultaneously locate land-cover changes and identify semantic categories before and after transition. However, existing methods suffer from insufficient cross-temporal alignment, weak multi-scale representation, and poor robustness to pseudo-changes caused by illumination, season, and registration noise. To address these issues, we propose a novel end-to-end semantic change detection network named SemDINO, which integrates a dual-branch encoder, multi-scale temporal interaction, semantic purification, change enhancement, and decoupled multi-task prediction into a unified framework. Specifically, we construct a dual-branch encoder that combines a CNN backbone and frozen DINOv3 features via gated pyramid fusion, enabling rich multi-scale semantic representation. Then, a multi-scale temporal bidirectional transformer interaction (M-TBTT) module is proposed to achieve global cross-temporal feature alignment and information interaction. To further enhance genuine changes and suppress pseudo-variations, we introduce semantic purification (SCP), bidirectional change enhancement (BiChangeEnhance), and multi-scale change enhancement (MCE) modules collaboratively. Finally, a multi-branch CD prediction head is designed to jointly output binary change mask, bi-temporal semantic maps, and edge constraint. Extensive experiments on public remote sensing CD datasets demonstrate that SemDINO achieves superior performance and generalization ability against state-of-the-art methods, especially in complex scenarios with interference factors.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 4.0/10 | 6.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 9.0/10 | 13.5 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 1.0/10 | 1.5 |
| MultiModal | 1.5 | 2.0/10 | 3.0 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: 论文主要研究遥感语义变化检测,使用 DINOv3 作为视觉编码器,故"Visual Encoder"得分高(9 分);摘要提及"unified framework",故"Unify Models"中等(4 分);其余关键词如 Tokenizer、World Models、MLLM、model-based RL 与论文内容无关(1 分);"MultiModal"涉及双时相图像,相关性较低(2 分)。加权总分 28.5,超过动态及格分 27.8。作者列表中未发现指定专家,无额外加分。
关键词
Semantic Change Detection, DINOv3, Visual Encoder, Cross-Temporal Alignment, Multi-scale Representation, Remote Sensing
深度分析
Chinese Title: SemDINO: 基于DINOv3的跨时间语义对齐变化检测网络
Summary: 本文提出SemDINO,一种端到端的语义变化检测(SCD)网络,旨在解决现有方法在跨时间语义对齐、多尺度表示不足以及对光照、季节和配准噪声引起的伪变化鲁棒性差的问题。网络采用双分支编码器,结合CNN骨干和冻结的DINOv3特征,通过门控金字塔融合(PyFu)实现多尺度语义表示。核心模块是多尺度双向时间Transformer交互(M-TBTT),实现全局跨时间特征对齐和信息交互。为增强真实变化并抑制伪变化,引入语义净化(SCP)、双向变化增强(BiChangeEnhance)和多尺度变化增强(MCE)模块。最后,多任务预测头联合输出二值变化掩码、双时相语义图和边缘约束。在公开遥感变化检测数据集上的实验表明,SemDINO在复杂场景下优于现有方法,并具有良好的泛化能力。
Innovations:
- 提出M-TBTT作为核心跨时间语义对齐机制,实现双向时间交互和多尺度对齐,为不变区域一致性和变化区域判别提供统一基础。
- 引入DINOv3引导的PyFu设计,通过分离适应块(SepAB)和门控融合将冻结的DINOv3特征注入CNN金字塔特征,获得兼具局部细节和预训练语义先验的多尺度表示。
- 开发SCD导向的细化与预测管道(#FeaCE),包括BCE、SCP和MCE,增强真实变化并抑制伪变化;ChangeFusion和CD-Head联合输出CD图、双时相语义图和边缘图。
- 设计解耦的多任务CD-Head,使SemDINO无需架构修改即可灵活切换至标准二值变化检测(BCD)任务,验证了特征的通用性。
Methodology: 采用双分支编码器架构:一个CNN骨干(如ResNet)结合FPN提取多尺度特征,另一个冻结的DINOv3编码器提取语义先验特征。通过PyFu模块(包含SepAB和GatedFusion)在每个尺度上融合CNN和DINO特征。随后,M-TBTT模块在多个尺度上执行双向时间Transformer交互,通过可学习门控(LG-g)自适应控制对齐强度。对齐后,#FeaCE管道依次进行双向变化增强(BCE)、语义净化(SCP)和多尺度变化增强(MCE)以精炼变化特征。最后,两个并行的ChangeFusion模块融合特征,多任务CD-Head同时输出二值变化图、T1和T2语义分割图以及辅助边缘图。训练采用多任务损失函数。
Key Results:
- 在HRSCD和SECOND等公开SCD数据集上,SemDINO在语义变化检测指标(如OA、mIoU、F1等)上优于现有SOTA方法。
- 在复杂场景(如光照变化、季节变化、配准噪声)下,SemDINO表现出更强的鲁棒性,有效抑制伪变化。
- 通过解耦的CD-Head,SemDINO在二值变化检测任务上也取得具有竞争力的性能,验证了其跨任务泛化能力。
- 消融实验证实了M-TBTT、PyFu、#FeaCE等各模块的有效性。
Tech Stack:
- DINOv3(自监督视觉Transformer)
- CNN骨干(如ResNet)与特征金字塔网络(FPN)
- Transformer(用于双向时间交互)
- 门控融合(GatedFusion)
- 分离适应块(SepAB):Conv1×1 → BN → Depth-wise Conv3×3 → BN → Conv1×1
- 多任务损失函数(包括二值变化损失、语义分割损失、边缘损失)
- PyTorch深度学习框架
Strengths:
- 创新性地将DINOv3自监督特征引入SCD,通过PyFu实现有效融合,增强了语义鲁棒性。
- M-TBTT模块实现了双向、多尺度的跨时间对齐,解决了时间顺序偏差问题。
- #FeaCE管道系统性地抑制伪变化并增强真实变化,提升了检测精度。
- 解耦的多任务头设计使模型同时适用于SCD和BCD,展现了良好的通用性。
- 在多个公开数据集上取得SOTA性能,尤其在复杂场景下优势明显。
Limitations:
- 依赖DINOv3预训练模型,计算资源需求较高,推理速度可能受限。
- 未详细讨论在极端遮挡、大尺度几何形变等场景下的表现。
- 方法复杂度较高,模块较多,调参和训练可能较为繁琐。
- 对配准误差的鲁棒性虽有所提升,但未进行定量分析或与专门去噪方法对比。
Relevance To Keywords: 论文与表征学习(Representation Learning)高度相关,因为DINOv3是一种自监督表征学习方法,SemDINO利用其预训练特征提升语义变化检测的鲁棒性。与世界模型(World Model)和模型基强化学习(Model-Based RL)无直接关联,但可间接视为对遥感世界状态变化的理解。与原生多模态大模型、多模态理解与生成一体化、后训练等关键词相关性较弱,论文主要聚焦于视觉变化检测,未涉及多模态或生成任务。
摘要翻译
前沿的生成模型(如 CycleGAN、Pix2Pix 和扩散模型)在人脸生成任务中展现了卓越的性能。然而,在从颅骨(X 光)域到面部(光学)域进行转换时,由于跨模态结构身份对齐存在不匹配,它们无法有效捕获颅面重建中的跨模态语义信息。为了解决这一问题,我们提出了一种基于扩散的框架 Cranio-Diff,用于从 2D X 光颅骨图像进行跨域颅面重建。该方法通过 ControlNet 整合颅骨条件结构引导与生物特征文本条件,以生成一个在语义和结构上与给定颅骨更对齐的面部。所提出的 Cranio-Diff 方法在颅面数据集上进行了评估,该数据集来源于 120 名受试者的 X 光扫描,包含侧位和正位视图。为了实现可控评估,每张面部图像均在三个年龄组(25 岁、45 岁、65 岁)和三种 BMI 变异(-10%、基线、+10%)上进行合成,共生成 4320 个配对样本。据我们所知,这是目前唯一具有如此规模的 X 光 - 面部数据集。大量实验表明,所提出的方法在生成图像质量和检索任务上均优于现有的近期方法。最后,为了评估所提出方法的性能,我们使用 FID、IS、SSIM、LPIPS、PSNR 和 ArcFace 分数评估了生成图像的质量。此外,检索性能使用 recall@k、mAP@k 和 MRR@k 进行评估。获得的实验结果表明,所提出的方法可作为辅助法医调查的替代工具。
Abstract
The state-of-the-art generative models, such as CycleGAN, Pix2Pix, and diffusion models have demonstrated remarkable performance in the face generation task. However, they fail to effectively capture cross-modality semantic information in craniofacial reconstruction when translating from the skull (x-ray) to the face (optical) domain, due to a mismatch in the alignment of structural identity across modalities. To address this issue, we propose Cranio-Diff, a diffusion-based framework for cross-domain cranio-facial reconstruction from 2D X-ray skull images. The proposed approach integrates skull-conditioned structural guidance through ControlNet with biometric text conditioning to generate a face which is more semantically and structurally aligned with the given skull. The proposed Cranio-diff method is evaluated on skull-face dataset obtained from X-ray scans of 120 subjects in lateral and frontal views. To enable controlled evaluation, each face image is synthesised across three age groups (25, 45, 65) and three BMI variations of -10%, baseline and +10%, yielding 4320 paired samples. To the best of our knowledge, this is the only X-ray-face dataset with this magnitude. Extensive experiments showed that the proposed method outperforms recent existing approaches in both generated image quality and retrieval task. Finally, to evaluate the performance of our proposed method, we have evaluated the quality of the generated image using FID, IS, SSIM, LPIPS, PSNR and ArcFace score. Additionally, retrieval performance is evaluated using recall@k, mAP@k and MRR@k. Obtained experimental results demonstrate that the proposed method can be used as an alternate tool in providing aid in forensic investigations.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 3.0/10 | 4.5 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 6.0/10 | 9.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 8.0/10 | 12.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on diffusion-based medical image generation. It shows moderate relevance to MultiModal (cross-domain image/text conditioning) and Visual Encoder (ControlNet uses encoders). Unify Models and Tokenizer have minor relevance (latent space discretization, process unification). It is unrelated to World Models, MLLM, and model-based RL. No expert authors from the specified list were found in the author list. Weighted total score: 28.5 (Passes dynamic threshold 27.8).
关键词
Diffusion-based, Cross-domain, Craniofacial Reconstruction, X-ray Skull, ControlNet, Structural Identity, Image Generation, Forensic Applications
深度分析
Chinese Title: 基于扩散的跨域颅面重建:二维X射线颅骨引导与结构身份约束
Summary: 本文提出Cranio-Diff,一种基于扩散模型的跨域颅面重建框架,旨在从二维X射线颅骨图像生成对应的人脸图像。现有生成模型(如CycleGAN、Pix2Pix)在跨模态颅面重建中难以有效对齐结构身份信息。Cranio-Diff通过ControlNet引入颅骨结构条件,并结合生物特征文本提示(年龄、性别、姿态、BMI)实现语义引导,从而生成与颅骨解剖结构一致且身份保持的人脸。研究基于120名受试者的正面和侧面X射线颅骨-人脸配对数据集,通过合成不同年龄(25、45、65岁)和BMI变化(±10%)的人脸图像,共获得4320个配对样本。实验表明,该方法在图像质量(FID、IS、SSIM、LPIPS、PSNR、ArcFace分数)和人脸检索任务(Recall@k、mAP@k、MRR@k)上均优于现有方法,可作为法医鉴定辅助工具。
Innovations:
- 提出Cranio-Diff框架,首次将扩散模型应用于二维X射线颅骨到人脸的跨域重建,克服了传统GAN在结构对齐上的不足。
- 引入多模态条件策略,同时利用ControlNet的颅骨结构引导和文本编码器的生物特征语义引导,实现解剖一致且可控的人脸生成。
- 构建了包含年龄和BMI变化的大规模二维颅骨-人脸配对数据集(4320样本),填补了该领域公开数据集的空白。
- 在生成质量评估基础上,进一步通过人脸检索任务验证重建人脸的身份保持能力,使用三种人脸识别骨干网络进行定量分析。
Methodology: 论文采用基于扩散的生成框架,核心组件包括:1)冻结的VAE编码器将真实人脸映射到潜在空间,并添加噪声得到潜在变量zt;2)可训练的ControlNet分支以颅骨图像为条件,通过零卷积初始化提取结构特征,并与去噪UNet的编码器特征融合;3)冻结的文本编码器将生物特征(年龄、性别、姿态、BMI)编码为语义嵌入,注入UNet进行条件生成;4)训练过程中仅微调ControlNet和UNet,保持VAE和文本编码器冻结;5)损失函数包括LPIPS感知损失和ArcFace身份损失,以增强视觉保真度和身份一致性。
Key Results:
- 在图像质量指标上,Cranio-Diff的FID、IS、SSIM、LPIPS、PSNR和ArcFace分数均优于CycleGAN、Pix2Pix等基线方法。
- 在人脸检索任务中,使用FaceNet、ArcFace、VGGFace作为骨干,Recall@1、mAP@1、MRR@1等指标显著高于对比方法。
- 数据集包含4320个配对样本(120名受试者×36种组合),训练集经数据增强后达38880对。
- 实验验证了方法在不同年龄和BMI变化下的鲁棒性,生成人脸与真实人脸在结构上高度一致。
Tech Stack:
- Stable Diffusion v1.5(预训练扩散模型)
- ControlNet(结构条件控制网络)
- Realistic Vision v5.11(微调的人脸生成模型)
- VAE编码器/解码器(冻结)
- CLIP文本编码器(冻结)
- UNet(去噪网络)
- ArcFace损失(身份保持)
- LPIPS感知损失
- FID、IS、SSIM、PSNR(图像质量评估)
- FaceNet、ArcFace、VGGFace(人脸识别骨干)
- Recall@k、mAP@k、MRR@k(检索评估指标)
Strengths:
- 首次将扩散模型应用于2D X射线颅骨到人脸的重建,解决了跨模态结构对齐难题。
- 多模态条件(结构+文本)设计使生成过程兼具解剖一致性和语义可控性。
- 构建了大规模、多变量(年龄、BMI)的颅面数据集,为后续研究提供了基准。
- 全面的评估体系,既包含生成质量指标,也包含身份检索指标,验证了实际应用潜力。
Limitations:
- 仅使用2D X射线图像,可能丢失三维颅骨结构信息,影响重建精度。
- 数据集规模(120名受试者)仍相对有限,且来自单一采集环境,泛化性有待验证。
- 未讨论生成速度或实时性,可能难以满足法医现场快速鉴定需求。
- 文本条件中的BMI和年龄为合成变化,与真实人群分布可能存在偏差。
Relevance To Keywords:
- 原生多模态大模型:论文使用文本编码器与扩散模型结合,属于多模态条件生成,但未涉及统一理解和生成架构。
- 表征学习:通过ControlNet和身份损失学习颅骨到人脸的跨模态表征,但未显式提出新的表征学习方法。
- 世界模型:论文聚焦于特定域(颅面重建),未构建通用世界模型。
- 强化学习:论文未使用强化学习或后训练策略。
- 整体相关性中等,主要贡献在跨域生成和法医应用,与原生多模态大模型和世界模型的核心概念关联较弱。
摘要翻译
多模态数据管理已成为数据库社区的核心研究议题,涵盖数据集成、语义查询处理以及数据质量评估。尽管关注度日益增长,该社区仍缺乏融合表格、文本和图像的大规模真实世界数据集。我们提出 ArtiFact,一个包含 651045 条博物馆记录的多模态文化遗产数据集,这些记录收集自大都会艺术博物馆(Metropolitan Museum of Art)、芝加哥艺术学院(Art Institute of Chicago)和荷兰国家博物馆(Rijksmuseum)。我们通过两个下游任务展示了 ArtiFact 的实用性。针对跨模态错误检测,我们构建了一个包含七个错误类别的分类法,并将其注入到 130209 条记录中;结果表明,可靠检测细微的领域特定错误(如材质时代错误和时间偏移)仍是一个开放挑战。针对语义查询处理,我们表明当前系统难以应对涉及文化邻近性、模糊对象类型以及历史依赖术语的查询。我们的研究结果将 ArtiFact 定位为多模态数据管理研究的一个具有挑战性的 Benchmark。
Abstract
Multi-modal data management has emerged as a central research topic in the database community, spanning data integration, semantic query processing, and data quality assessment. Despite this growing interest, the community lacks large-scale, real-world datasets combining tables, text, and images. We present ArtiFact, a multi-modal cultural heritage dataset of 651045 museum records collected from the Metropolitan Museum of Art, the Art Institute of Chicago, and the Rijksmuseum. We demonstrate the utility of ArtiFact through two downstream tasks. For cross-modal error detection, we introduce a curated taxonomy of seven error categories injected into 130209 records and show that reliably detecting subtle domain-specific errors such as material anachronisms and temporal shifts remain an open challenge. For semantic query processing, we show that current systems struggle with queries involving cultural proximity, ambiguous object types, and historically contingent terminology. Our results position ArtiFact as a challenging benchmark for multi-modal data management research.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 2.0/10 | 3.0 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 10.0/10 | 15.0 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: 论文核心贡献是一个多模态文化遗产数据集(ArtiFact),主要用于数据库领域的跨模态错误检测和语义查询处理基准测试。因此,只有'MultiModal'高度相关(权重 1.5)。'MLLM'和'Visual Encoder'因涉及图像和多模态数据有轻微关联,但未涉及模型架构或训练细节。'Unify Models'、'Tokenizer'、'World Models'和'model-based RL'与论文内容完全无关。作者列表中不包含指定的 Yang Shi 等专家。
关键词
Multi-modal, Cultural Heritage, Dataset, Data Management, Cross-modal error detection, Semantic query processing, Museum records
摘要翻译
计算机使用代理(CUAs)日益在融合了可视化桌面控制、命令行执行、代码编辑、浏览器及外部工具的运行时环境中运作。然而,现有的基准测试通常将这些接口视为可分离的能力进行评估,导致长跨界面编排方面的测试不足。因此,我们提出 WeaveBench,这是一个长跨混合接口基准测试,涵盖 8 个真实工作领域的 114 个任务,基于真实用户请求和公开可验证的工件。每个任务要求代理在单一轨迹内结合 GUI 观察/操作与 CLI/代码操作。我们在部署的 CLI 代理运行时内的真实 Ubuntu 桌面上评估这些任务,并辅以最小化的桌面控制插件。我们还提出一个配套的轨迹感知评判器,用于检查交付物、文件、截图、日志及操作轨迹,同时检测捷径行为,例如伪造的视觉证据或硬编码指标。在前沿模型与运行时配对中,最佳 PassRate 仅达 41.2%,表明该基准测试仍远未饱和。轨迹感知评判器进一步揭示,仅基于结果的评分会显著高估代理性能。总体而言,WeaveBench 揭示了 CUA 评估中的关键差距,并提供了一个有效的测试平台,用于衡量代理是否能够在长跨真实任务中编排 GUI、CLI 和代码操作。
Abstract
Computer-use agents (CUAs) increasingly operate in runtimes that combine visual desktop control, command-line execution, code editing, browsers, and external tools. Existing benchmarks, however, often evaluate these interfaces as separable capabilities, leaving long-horizon cross-interface orchestration under-tested. Thus, we introduce WeaveBench, a long-horizon hybrid-interface benchmark with 114 tasks across 8 real-world work domains, grounded in real user requests and publicly verifiable artifacts. Each task requires agents to combine GUI observations/actions with CLI/code operations within a single trajectory. We evaluate these tasks on a real Ubuntu desktop inside deployed CLI-agent runtimes, augmented with a minimal desktop-control plugin. We also propose a companion trajectory-aware judge that inspects deliverables, files, screenshots, logs, and action traces, while detecting shortcut behaviors such as fabricated visual evidence or hard-coded metrics. Across frontier model-runtime pairings, the best PassRate reaches only 41.2%, showing the benchmark remains far from saturated. The trajectory-aware judge further reveals that outcome-only grading substantially overestimates agent performance. Overall, WeaveBench exposes a critical gap in CUA evaluation and provides an effective testbed to measure whether agents can orchestrate GUI, CLI, and code operations across long-horizon real-world tasks.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 3.0/10 | 4.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 1.0/10 | 1.5 |
| World Models | 1.5 | 3.0/10 | 4.5 |
| MLLM | 1.5 | 4.0/10 | 6.0 |
| MultiModal | 1.5 | 5.0/10 | 7.5 |
| model-based RL | 1.5 | 2.0/10 | 3.0 |
评分理由: 该论文聚焦于计算机使用代理(CUA)的基准测试构建,而非模型架构设计。因此与 Tokenizer、Visual Encoder 等具体模型组件关键词相关性极低。MultiModal 因涉及视觉与文本交互有中度关联,MLLM 因代理基础模型有中度关联,Unify Models 和 World Models 因长周期与界面整合有轻度关联,model-based RL 关联度较低。
关键词
WeaveBench, Computer-Use Agents, Hybrid Interfaces, Long-Horizon, Benchmark, GUI, CLI
摘要翻译
大规模文档处理需要上下文感知且准确高效的表格提取(TE)。然而,当前方法通常需要数十亿参数、数百步自回归步骤,或代价高昂的 API 推理。为此,我们引入了页面对象表格变换器(Page-Object Table Transformer, POTATR),这是一种轻量级的 2900 万参数图像到图模型,它在表格变换器(Table Transformer, TATR)的基础上进行了扩展,用于实现上下文感知的页面级表格提取(TE)。在 PubTables-v2 单页基准测试中,POTATR 超越了所有被测试的模型(包括前沿多模态大模型 MLLMs),实现了 0.964 的 GriTS_Con 得分,同时运行速度快 130 倍以上,成本约为原来的 1/300。此外,POTATR 的输出具有空间定位性:每个识别元素均包含边界框,以支持视觉验证和几何文本分配。因此,POTATR 在执行统一的页面级表格提取(TE)的同时可与其他模型组合,从而可借助外部光学字符识别(OCR)技术扩展至扫描文档,并通过跨页合并等技术扩展至全文档表格提取(TE)。代码和模型将发布。
Abstract
Large-scale document processing requires contextually aware table extraction (TE) that is both accurate and efficient. Yet current approaches require billions of parameters, hundreds of autoregressive steps, or costly API inference. Motivated by this, we introduce the Page-Object Table Transformer (POTATR), a lightweight 29M parameter image-to-graph model that extends the Table Transformer (TATR) for contextualized page-level TE. POTATR outperforms all models tested on the PubTables-v2 Single Pages benchmark -- including frontier MLLMs -- achieving $\textrm{GriTS}_\textrm{Con}$ of 0.964 while running over 130$\times$ faster at roughly 300$\times$ lower cost. Further, POTATR's output is spatially grounded: every recognized element has a bounding box, enabling visual verification and geometric text assignment. As a result, POTATR performs unified page-level TE while composing with other models, enabling extension to scanned documents via external OCR and to full-document TE via techniques like cross-page merging. Code and models will be released.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 5.0/10 | 7.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 3.0/10 | 4.5 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 4.0/10 | 6.0 |
| MultiModal | 1.5 | 5.0/10 | 7.5 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on lightweight table extraction from documents (Document AI), which has limited overlap with the provided keywords centered on World Models, Reinforcement Learning, and MLLM architecture. It mentions 'unified' page-level TE and compares against MLLMs, but does not utilize Tokenizers, World Models, or RL. Visual Encoder is implicitly present but not a core focus.
关键词
Page-Level Table Extraction, Image-to-Graph, Lightweight Model, Document Processing, Spatially Grounded, Multi-modal, Table Transformer
摘要翻译
训练数据集的保真度和结构多样性从根本上决定了视频生成模型的能力。尽管商业系统在生成电影叙事方面展现出显著能力,但开源模型的进展仍受限于高质量训练数据的稀缺性。为了弥合这一差距,我们引入了 CineDance-1M,这是一个大规模、开放研究的文本到音视频(T2AV)数据集,专门针对多镜头、长篇幅的联合音视频生成而设计。每个视频平均时长 92.8 秒,包含 24.2 个连续镜头,它为音频和视频模态提供了可配置的、结构化的标注。这种卓越的质量是通过一个严格的三阶段数据策展流程实现的:i) 多样化来源与全面清洗,ii) 基于电影理论的叙事解析,以及 iii) 层级双模态描述。为了进行全面评估,我们提出了 CineBench,它包含一个多样化的提示词套件和一个六维的、人类对齐的度量系统,专门针对复杂的叙事音视频评估而定制。此外,我们将 LTX-2.3 适配为 CineDance 模型,该模型展示了卓越的单模态质量,同时具备精确的音视频对齐以及稳健的主体与环境一致性,有效地验证了我们的策展策略以及 CineDance-1M 的高质量。我们期望这项工作能为加速未来多镜头、长篇幅联合音视频生成研究奠定坚实基础。我们的项目页面位于 https://aliothchen.github.io/projects/CineDance/。
Abstract
The fidelity and structural diversity of training datasets fundamentally determine the capabilities of video generation models. While commercial systems showremarkableabilitytogeneratecinematicnarratives, the progress of open-source models remains limited by the scarcity of high-quality training data. To bridge this gap, we introduce CineDance-1M, a large-scale, open research Text-to-Audio-Video (T2AV) dataset designed specifically for multi-shot, long-form joint audio-video generation. Averaging 92.8 seconds and 24.2 continuous shots per video, it provides configurable, structured annotations for both audio and video modalities. This exceptional quality is achieved through a rigorous three-stage curation pipeline: i) diverse sourcing and comprehensive cleansing, ii) film-theory-inspired narrative parsing, and iii) hierarchical dual-modal captioning. For a comprehensive assessment, we propose CineBench, featuring a diverse prompt suite and a six-dimensional, human-aligned metric system tailored for complex narrative audio-video evaluation. Furthermore, we adapt LTX-2.3 into CineDance, which demonstrates exceptional single-modality quality alongside precise audio-video alignment and robust subject and environment consistency, effectively validating our curation strategy and the high quality of CineDance-1M. We anticipate that this work will serve as a solid foundation for accelerating future research in multi-shot, long-form joint audio-video generation. Our project page is available at https://aliothchen.github.io/projects/CineDance/.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 3.0/10 | 4.5 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 1.0/10 | 1.5 |
| World Models | 1.5 | 2.0/10 | 3.0 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 8.0/10 | 12.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文主要贡献为构建多模态音频视频数据集(CineDance-1M)及评测基准(CineBench),属于数据-centric 工作。MultiModal 高度相关(核心任务为 T2AV 生成);Unify Models 中度相关(涉及音视频联合生成);Tokenizer、Visual Encoder、MLLM、World Models 相关性低(仅隐含于基座模型 LTX-2.3,非本文创新点);model-based RL 完全无关(未涉及强化学习)。加权总分 25.5,低于动态及格分 27.8。
关键词
Text-to-Audio-Video Generation, Multi-shot Long-form, Dataset Curation, Audio-Video Alignment, Cinematic Narrative, CineBench Evaluation, LTX-2.3 Adaptation
摘要翻译
本文介绍了 XInsight Lab 在 IJCAI 2026 第 4 届 MiGA 挑战赛微手势分类赛道中的解决方案,该方案位列第一,并取得了新的最先进结果。我们提出了一种多模态集成框架,该框架将基于 RGB 的自监督模型与先前方案中的监督多流模型相结合。该基于 RGB 的自监督模型通过掩码视频建模在 120K 无标签片段上进行预训练,随后在 iMiGUE 数据集上进行微调。这一简单而有效的 RGB 基线在 iMiGUE 测试集上达到了 69.224% 的 Top-1 准确率,展示了从领域内无标签视频中学习可迁移表示的优势。通过将该模型作为互补分支纳入,最终集成方案达到了 74.419% 的 Top-1 准确率,比之前的最先进水平高出 1.206 个百分点。在 iMiGUE 上的实验结果,包括针对集成策略的消融实验,验证了自监督 RGB 表示学习在微手势识别中的有效性。
Abstract
In this paper, we present XInsight Lab's solution to the micro-gesture classification track of the 4th MiGA Challenge at IJCAI 2026, in which our solution ranked first and achieved a new state-of-the-art result. We propose a multimodal ensemble framework that integrates a self-supervised RGB-based model with supervised multi-stream models from previous solutions. The self-supervised RGB model is pretrained on 120K unlabeled clips via masked video modeling and then fine-tuned on iMiGUE. This simple yet effective RGB baseline achieves 69.224% top-1 accuracy on the iMiGUE test set, demonstrating the benefit of learning transferable representations from unlabeled in-domain videos. By incorporating this model as a complementary branch, the final ensemble reaches 74.419% top-1 accuracy, surpassing the previous state of the art by 1.206 percentage points. Experimental results on iMiGUE, including ablation studies on the ensemble strategy, validate the effectiveness of self-supervised RGB representation learning for micro-gesture recognition.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 6.0/10 | 9.0 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 1.0/10 | 1.5 |
| MultiModal | 1.5 | 5.0/10 | 7.5 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: 论文聚焦计算机视觉微手势识别,使用自监督 RGB 模型与集成方法。与 MLLM、Tokenizer、World Models、model-based RL 领域完全无关(评分 1)。Visual Encoder 相关性中等(评分 6),因涉及视频视觉特征提取。MultiModal 因摘要提及'ensemble framework'有一定关联(评分 5)。Unify Models 相关性低(评分 2),因集成不等于模型统一。总分 25.5,低于动态及格分 27.8。
关键词
Self-supervised Learning, Micro-Gesture Recognition, Ensemble Framework, RGB Model, Masked Video Modeling, iMiGUE Dataset, State-of-the-Art
摘要翻译
人工智能正在迅速推动材料表征的发展,然而,大多数电子显微镜应用仅依赖图像对比度,忽略了塑造成像过程的化学和实验上下文。这种局限性使得缺陷分类本质上模糊不清,因为相似的对比度可能源于不同的材料或成像条件。在此,我们开发了一个上下文感知学习框架,该框架整合了从图像中提取的对比度与描述成分、束能和探测器几何结构的元数据。利用一个系统构建的包含约 5500 万模拟图像块的数据集,该数据集涵盖 96 种掺杂单层过渡金属二硫族化合物 (TMDs) 中的 576 种情况,我们表明,基于上下文变量进行条件化将缺陷分类从一个不适定的纯图像任务转变为一个适定的、基于物理的问题。该框架在模拟数据上实现了超过 98% 的准确率,在实验数据上达到了接近人类的一致性,并且后验熵降低了 94%。通过强调上下文基础而非架构复杂性,该方法将实验图像对比度与潜在的化学和成像条件联系起来,支持基于物理的缺陷指派,并为面向自主材料表征的多模态 AI 模型提供了一条通用路径。
Abstract
Artificial intelligence is rapidly advancing materials characterization, yet most applications in electron microscopy rely solely on image contrast, overlooking the chemical and experimental context that shapes image formation. This limitation makes defect classification inherently ambiguous, as similar contrasts can arise from different materials or imaging conditions. Here we develop a context-aware learning framework that integrates image-derived contrast with metadata describing composition, beam energy, and detector geometry. Using a systematically constructed dataset of ~55 million simulated patches spanning 576 cases across 96 doped monolayer transition-metal dichalcogenides, we show that conditioning on contextual variables transforms defect classification from an ill-posed image-only task into a well-posed, physically grounded problem. The framework achieves over 98% accuracy on simulations and near-human agreement on experimental data, with a 94% reduction in posterior entropy. By emphasizing contextual grounding over architectural complexity, this approach links experimental image contrast to the underlying chemical and imaging conditions, supporting physically grounded defect assignments and a general pathway toward multimodal AI models for autonomous materials characterization.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 3.0/10 | 4.5 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 6.0/10 | 9.0 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: 该论文主要研究原子分辨率 STEM 图像中的缺陷分类,核心在于上下文感知与元数据整合。MultiModal 相关性较高(6 分),因为论文整合了图像与实验元数据;Visual Encoder 相关性中等(3 分),涉及图像特征提取;Unify Models、Tokenizer、World Models、MLLM、model-based RL 相关性低(1-2 分),因为论文未涉及模型统一架构、分词器、世界模型、大语言模型或强化学习机制。作者列表中未包含指定的 Yang Shi、Xuanyu Zhu 等专家。加权总分为 24.0,低于动态及格分 27.8。
关键词
Context-Aware Deep Learning, Defect Classification, Atomic-Resolution STEM, Image Contrast, Metadata Integration, Physically Grounded, Materials Characterization
摘要翻译
医疗代理系统日益被期望支持交互式临床决策,而不仅仅局限于静态问答。在此类场景中,有效的代理必须在不断演变的病例中重用先前经验,然而现有的记忆机制往往保留原始历史轨迹,这些轨迹往往冗余、含噪声且难以管理。更重要的是,它们很少区分哪些记忆对未来推理真正有用。这限制了它们积累紧凑可靠经验以支持长程临床推理的能力。为填补这一空白,我们提出 SkeMex,一种部署后自我演化框架,该框架通过基于技能的记忆(skill-based memory)改进医疗代理,而无需更新模型权重。SkeMex 将信息丰富的交互轨迹提炼为编码可重用程序性知识的结构化技能,并将它们组织成一个多分支存储库,涵盖通用、任务特定及动作级经验。为确定哪些记忆应被重用和保留,SkeMex 从环境反馈中估计上下文相关的效用(context-dependent utility),并利用该效用指导价值感知检索(value-aware retrieval)及存储库治理。一个闭环的"Read--Write--Assess--Govern"生命周期进一步支持持续演化,通过写入新技能、更新效用、推广有用记忆以及移除有害条目来实现。在多样化临床任务上的实验表明,SkeMex 在离线和在线设置中一贯优于现有的代表性基于记忆的代理。该方法还泛化于不同的模型骨干,并支持可迁移的技能记忆。所有数据和代码将公开发布。
Abstract
Medical agent systems are increasingly expected to support interactive clinical decision making rather than only static question answering. In such settings, effective agents must reuse prior experience across evolving cases, yet existing memory mechanisms often retain raw historical traces that are redundant, noisy, and difficult to govern. More importantly, they rarely distinguish which memories are truly useful for future reasoning. This limits their ability to accumulate compact and reliable experience for long-horizon clinical reasoning. To close this gap, we propose SkeMex, a post-deployment self-evolution framework that improves medical agents through a skill-based memory without updating model weights. SkeMex distills informative interaction trajectories into structured skills that encode reusable procedural knowledge, and organizes them into a multi-branch repository spanning general, task-specific, and action-level experience. To determine which memories should be reused and retained, SkeMex estimates context-dependent utility from environment feedback and uses it to guide value-aware retrieval and repository governance. A closed-loop ``Read--Write--Assess--Govern" lifecycle further supports continual evolution by writing new skills, updating utilities, promoting useful memories, and removing harmful entries. Experiments across diverse clinical tasks show that SkeMex consistently outperforms representative memory-based agents in both offline and online settings. It also generalizes across model backbones and supports transferable skill memory. All data and code will be released publicly.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 1.0/10 | 1.5 |
| World Models | 1.5 | 3.0/10 | 4.5 |
| MLLM | 1.5 | 3.0/10 | 4.5 |
| MultiModal | 1.5 | 2.0/10 | 3.0 |
| model-based RL | 1.5 | 4.0/10 | 6.0 |
评分理由: The paper proposes SkeMex, a skill-based memory framework for medical agents, which has low direct relevance to Tokenizer and Visual Encoder as these are not discussed. It shows moderate relevance to MLLM and model-based RL due to the agent context and value feedback, but does not explicitly address World Models or Unifying Models. No expert authors from the specified list were found, so no bonus points were applied. The weighted total score is 24.0, below the dynamic pass score of 27.8.
关键词
Medical Agent Reasoning, Self-Evolving Skill Memory, Post-Deployment Framework, Value-Aware Retrieval, Generalizable Experience, Clinical Decision Making, Skill-Based Memory, Multi-Branch Repository
摘要翻译
基于大型语言模型(LLM)的内容审核系统已成为抵御有害在线内容的关键防线。然而,这些系统主要基于标记化文本运行,在很大程度上忽略了人类在解读内容时自然依赖的视觉线索。我们指出,这种差异造成了根本性的感知错配:人类轻易识别为有害的内容,可能对自动审核系统而言变得不可见。为了研究这一漏洞,我们引入了一类人类可感知对抗攻击(HPAA),即通过视觉上显著的排版操纵,将有害表达嵌入到原本良性的文本中。我们的核心见解在于,排版特征(包括间距、视觉强调和空间排列)可以被策略性地组合,以保留人类对有害内容的识别,同时大幅降低机器的可检测性。该攻击在黑盒设置下运行,仅需极少的查询预算,即可自动生成规避内容,而无需访问模型或获取梯度信息。我们在多个数据集和十个已部署的审核系统上评估了该攻击,其中包括商业 API 和最先进的开源护栏(guardrails)。结果显示,人类与机器感知之间存在显著差距:仅需三次检测器查询,生成的攻击即可实现超过 86% 的人类识别率,同时在所评估的系统中将检测率维持在 1% 以下。我们进一步开展了消融研究,以识别驱动成功规避的排版因素,分析当前审核架构为何无法捕获这些信号,并讨论了可行的防御措施。我们的发现揭示了当今基于 LLM 的审核生态系统中的根本性盲点,并凸显了需要构建能够以更符合人类感知理解的方式来对内容进行推理的审核系统。
Abstract
Large language model (LLM)-powered content moderation systems have become a critical defense against harmful online content. However, these systems primarily operate on tokenized text and largely ignore the visual cues that humans naturally rely on when interpreting content. We show that this discrepancy creates a fundamental perceptual mismatch: content that is readily recognized as harmful by humans can become effectively invisible to automated moderation systems. To study this vulnerability, we introduce a class of Human-Perceptible Adversarial Attacks (HPAA), in which harmful expressions are embedded into otherwise benign text through visually salient typographic manipulations. Our key insight is that typographic features, including spacing, visual emphasis, and spatial arrangement, can be strategically combined to preserve human recognition of harmful content while substantially reducing machine detectability. Operating in black-box settings with only a small query budget, our attack automatically generates evasive content without requiring model access or gradient information. We evaluate the attack across multiple datasets and ten deployed moderation systems, including commercial APIs and state-of-the-art open-source guardrails. Results reveal a striking gap between human and machine perception: with only three detector queries, generated attacks achieve over 86\% human recognition while maintaining detection rates below 1\% across the evaluated systems. We further conduct ablation studies to identify the typographic factors driving successful evasion, analyze why current moderation architectures fail to capture these signals, and discuss practical defenses. Our findings expose a fundamental blind spot in today's LLM-based moderation ecosystem and highlight need for moderation systems that reason about content in a manner more consistent with human perceptual understanding.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 7.0/10 | 10.5 |
| Visual Encoder | 1.5 | 1.0/10 | 1.5 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 3.0/10 | 4.5 |
| MultiModal | 1.5 | 4.0/10 | 6.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文核心在于 LLM 审核的对抗性攻击,未涉及模型统一、视觉编码器、世界模型或强化学习。'Tokenizer' 得分最高(7.0),因攻击基于文本 tokenization 的视觉感知差异;'MultiModal'(4.0)和 'MLLM'(3.0)因涉及视觉 - 文本感知差距而有一定相关性;其余关键词(Unify Models, Visual Encoder, World Models, model-based RL)相关性极低(0-1.0)。作者名单中无指定专家,无额外加分。加权总分 24.0,低于及格线 27.8。
关键词
Adversarial Text Attacks, LLM Moderation, Human Perception, Typographic Manipulations, Tokenized Text, Visual Cues, Machine Detectability
摘要翻译
策略内蒸馏(On-Policy Distillation, OPD)已成为大语言模型(Large Language Models, LLMs)后训练中的核心技术,用于将领域专家的知识迁移至学生模型。然而,现有的 OPD 蒸馏方法要求教师模型与学生模型共享相同的分词器,限制了 OPD 在模型系列内的适用性。当前主流做法通常采用监督微调(Supervised Fine-Tuning, SFT)对教师模型生成的响应进行跨分词器蒸馏,但这无法捕捉嵌入在教师模型概率分布中的丰富知识。本文使标准策略内蒸馏方法能够在不同模型家族间运行,通过精确的标记映射算法,确保高保真度的标记级信号能够在不同分词器之间传播。大量实验表明,跨分词器 OPD 在各种基准测试上相比基线方法具有显著更高的计算效率。我们的结果拓展了 OPD 中可用的教师 - 学生模型对的范围,为适配和增强大语言模型之间的交互开辟了新途径。
Abstract
On-Policy Distillation (OPD) has become a core technique in the post-training of Large Language Models (LLMs) for transferring knowledge from domain experts to student models. However, existing OPD distillation methods require teacher and student models to share the same tokenizer, restricting the applicability of OPD within the model series. Current mainstream practice typically employs Supervised Fine-Tuning (SFT) on teacher-generated responses for cross-tokenizer distillation, which fails to capture the rich knowledge embedded in the teacher's probability distribution. In this work, we enable the standard on-policy distillation method to operate across model families, ensuring that high-fidelity token-level signals can propagate across different tokenizers with a precise token-mapping algorithm. Extensive experiments show that cross-tokenizer OPD is significantly more compute-efficient than baselines on various benchmarks. Our results unlock a broader range of teacher-student pairs for OPD, opening up new avenues for adapting and enhancing interactions between LLMs.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 6.0/10 | 9.0 |
| Tokenizer | 1.5 | 10.0/10 | 15.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文核心在于大语言模型(LLM)的跨标记器策略蒸馏(On-Policy Distillation),因此'Tokenizer'相关性最高(10 分);'Unify Models'涉及模型家族间的蒸馏统一,相关性中等(6 分);其余关键词如视觉编码器、世界模型、多模态、基于模型的强化学习在摘要中未提及,相关性为 0 分。作者列表中不包含指定的专家(Yang Shi 等),故无加分。加权总分为 24.0 分,低于动态及格分 27.8 分。
关键词
On-Policy Distillation, Tokenizer Barrier, Cross-tokenizer Distillation, Large Language Models, Token-mapping Algorithm, Model Families, Knowledge Transfer, Post-training
摘要翻译
通过无声语音接口 (SSIs) 进行语音恢复已成为针对喉部语音产生受损或缺失个体的一种有前景的辅助技术。在非侵入式 SSI 模态中,表面肌电图 (sEMG) 和视频唇读提供了互补的发音信息,然而它们在连续语音合成中的整合仍鲜有研究。此外,现有的多模态方法很少涉及对模态退化或临时传感器故障的鲁棒性问题,这限制了它们在真实场景中的适用性。本文提出了一种掩码多模态语音合成框架,该框架通过在训练过程中采用模态掩码策略,联合利用 sEMG 和唇读信号。在多说话人设置下,与最强的单模态基线相比,该方法将词错误率 (WER) 降低了多达 14 个绝对百分点。实验结果表明,掩码策略对于实现这些性能增益以及在低比特率条件下的鲁棒性至关重要;此外,在模态缺失条件下,它们的泛化能力优于针对特定退化的数据增强。音素级分析进一步揭示了各模态之间的互补贡献,其中对元音及特定辅音组的提升尤为显著。总体而言,这些发现证明了掩码多模态集成在无声语音合成中的有效性和鲁棒性,尽管针对喉切除患者的适应仍是一个有待解决的研究挑战。
Abstract
Speech restoration through silent speech interfaces (SSIs) has emerged as a promising assistive technology for individuals with impaired or absent laryngeal voice production. Among non-invasive SSI modalities, surface electromyography (sEMG) and video-based lipreading provide complementary articulatory information, yet their integration for continuous speech synthesis remains underexplored. Moreover, existing multimodal approaches rarely address robustness to modality degradation or temporary sensor failure, limiting their applicability in realistic scenarios. In this work, we propose a masked multimodal speech synthesis framework that jointly leverages sEMG and lipreading signals through modality masking during training. Under multispeaker settings, the proposed approach reduces word error rate by up to 14 absolute percentage points compared to the strongest unimodal baseline. Experimental results not only show that masking strategies are critical for these performance gains and robustness under low-bitrate conditions, but also that they generalize better than degradation-specific data augmentations in the presence of modality absence conditions. Phone-level analyses further reveal complementary contributions across modalities, with particularly strong benefits for vowels and for specific consonant groups. Overall, these findings demonstrate the effectiveness and robustness of masked multimodal integration for silent speech synthesis, although adaptation to laryngectomized speakers remains an open research challenge.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 5.0/10 | 7.5 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 9.0/10 | 13.5 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文主要研究无声语音合成,利用 sEMG 和唇读两种模态,因此'MultiModal'评分最高(9.0);唇读涉及视觉特征提取,'Visual Encoder'有一定关联(5.0)。然而,论文未涉及统一模型架构、分词器、世界模型、大语言模型或强化学习机制,故'Unify Models'、'Tokenizer'、'World Models'、'MLLM'、'model-based RL'评分较低或为 0。
关键词
Silent Speech Synthesis, sEMG, Lipreading, Cross-Modal Masking, Robustness, Multimodal Integration, Speech Restoration
摘要翻译
监督微调(SFT)是下游任务适应的一种高效方法,通常作为强化学习(RL)的初始化阶段,但其泛化能力可能弱于强化学习。其关键限制在于离策略目标:监督微调逐个 token 拟合固定演示数据,其中包括与模型预训练分布对齐较差的目标,这可能导致过拟合。近期的一些工作通过给与当前模型预测分布对齐更好的 token 分配更大的训练权重来解决这一问题,其直觉是拟合这些 token 对模型的预训练知识和表征的扭曲较小。然而,从当前正在微调的模型中计算 token 权重会将 token 权重与优化轨迹耦合,导致一种自我强化动力学,因为分布迅速偏离预训练模型。为了解决这一问题,我们提出 PriFT(先验支持引导的微调),它从冻结的预训练参考模型中推导 token 权重,以获得不受微调影响的稳定重加权信号。该信号估计先验支持程度:即每个目标 token 在预训练分布中得到支持的程度。在多种现有的 token 重加权规则中,将来自当前微调模型的重加权信号替换为预训练模型的重加权信号,性能均得到一致提升。我们提出了两种变体:PriFT-prob 使用预训练 token 概率,而 PriFT-mass 根据预训练分布下的累积概率质量选择 token。在数学推理、代码生成和医学问答上的广泛实验表明,PriFT 在 SFT 基线方法中实现了最优性能,并为后续的强化学习训练提供了更好的初始化。
Abstract
Supervised fine-tuning (SFT) is an efficient approach for downstream task adaptation and often serves as the initialization stage for reinforcement learning (RL), but it can show weaker generalization than RL. A key limitation is its off-policy objective: SFT fits fixed demonstrations token by token, including targets poorly aligned with the model's pretrained distribution, which can lead to overfitting. A recent line of work addresses this issue by assigning larger training weights to tokens better aligned with the current model's predictive distribution, with the intuition that fitting these tokens are less distortive to the model's pretrained knowledge and representations. However, computing the token weights from the model that is currently fine-tuned entangles token weights with the optimization trajectory, inducing a self-reinforcing dynamics as the distribution rapidly departs from the pretrained model. To address this, we propose PriFT (Prior-support guided Fine-Tuning), which derives token weights from a frozen pretrained reference to obtain a stable reweighting signal unaffected by fine-tuning. This signal estimates prior support: the extent to which each target token is supported by the pretrained distribution. Across multiple existing token-reweighting rules, replacing the reweighting signal from the online model to pretrained model consistently improves performance. We introduce two instantiations: PriFT-prob uses pretrained token probability, while PriFT-mass selects tokens by cumulative probability mass under the pretrained distribution. Extensive experiments on mathematical reasoning, code generation, and medical question answering show that PriFT achieves state-of-the-art results among SFT baselines and provides a better initialization for subsequent RL training.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 6.0/10 | 9.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 4.0/10 | 6.0 |
评分理由: 论文聚焦文本领域监督微调,未涉及视觉或多模态组件,故 Visual Encoder 和 MultiModal 得 0 分;未提及世界模型或模型统一架构,故 World Models 和 Unify Models 得分较低;Tokenizer 因涉及 token 级别重加权得中等分;model-based RL 因涉及 RL 初始化但非核心模型基方法得中等分。
关键词
Supervised Fine-Tuning, Token Reweighting, Prior-Support, RL Initialization, Frozen Pretrained Model, Mathematical Reasoning, Code Generation
摘要翻译
自动化二语口语评估(L2 speech assessment)可以分配熟练度标签,但往往缺乏可解释性。我们提出一种基于量表的 SpeechLLM,用于多维度、多粒度评估,该模型采用结合监督微调(supervised fine-tuning)和有界直接偏好优化(Bounded Direct Preference Optimization)的混合目标进行训练。该模型联合预测句子级(准确性、流利度、韵律)的有序标签、词/音素级(word/phoneme-level)准确性,并在同一响应中生成自然语言解释。在 SpeechOcean762 上,我们的方法匹配或优于单粒度模型,同时与先前方法保持竞争力。我们从两个维度分析解释可靠性:与模型预测的自一致性和与真实标签的对齐情况,分别使用情感一致性(合理性)和提及一致性(忠实性)。解释在句子级是合理的,但在词/音素级忠实性下降:参考理由稀疏且与词元级标签(token-level labels)弱对齐。
Abstract
Automated L2 speech assessment can assign proficiency labels, but often lacks interpretability. We propose a rubric-guided SpeechLLM for multi-aspect, multi-granular assessment, trained with a hybrid objective combining supervised fine-tuning and Bounded Direct Preference Optimization. The model jointly predicts ordinal labels at the sentence-level (accuracy, fluency, prosody), word/phoneme-level accuracy, and generates a natural-language rationale in the same response. On SpeechOcean762, our approach matches or outperforms single-granularity models while remaining competitive with prior approaches. We analyze rationale reliability along two axes: self-consistency with model predictions and alignment with ground-truth labels, using sentiment consistency (plausibility) and mention-based agreement (faithfulness). Rationales are plausible at the sentence level, but faithfulness degrades at the word/phoneme level: references are sparse and weakly aligned with token-level labels.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 3.0/10 | 4.5 |
| MultiModal | 1.5 | 5.0/10 | 7.5 |
| model-based RL | 1.5 | 2.0/10 | 3.0 |
评分理由: 论文主要关注语音大模型(SpeechLLM)在第二语言(L2)评估中的应用,涉及多粒度评估和自然语言理由生成。与给定关键词相比,相关性较低:论文未涉及视觉编码器(0 分)或世界模型(0 分);虽使用多模态(语音 + 文本)但非视觉多模态(5 分);使用 DPO 而非基于模型的强化学习(2 分);未明确讨论统一模型架构或分词器细节(2-3 分)。作者列表中不包含指定的专家。加权总分 21.0,低于动态及格分 27.8。
关键词
SpeechLLM, L2 Assessment, Multi-granular, Natural-Language Rationales, DPO, SpeechOcean762, Language Learning
摘要翻译
场景图(SGs)通过对物体及其两两关系的建模,提供视觉场景的结构化表示。尽管近期取得了进展,但现有数据集主要关注通用的自然场景,而领域特定和功能导向的场景在很大程度上仍未得到充分探索。这一限制限制了在科学实验场景中对关系推理的评估,从而阻碍了此类场景中智能监控、分析及相关应用的发展。为填补这一空白,我们引入了 PhysScene,这是首个专为物理实验设计的场景图(SG)数据集。PhysScene 涵盖了专用仪器、结构化实验装置以及实验环境固有的功能关系,使得推理能够超越空间共现,延伸至逻辑依赖。PhysScene 并不追求大规模数据,而是专注于实验场景中的强语义约束和高关系密度,这为现有的场景解析算法提出了新的挑战,同时也提供了进一步改进的机会。广泛的分析和实验表明,PhysScene 补充了现有的基准,并为推进科学视觉推理建立了有价值的测试平台。该数据集公开获取于 https://github.com/ZMH-SDUST/PhysScene。
Abstract
Scene Graphs (SGs) provide structured representations of visual scenes by modeling objects and their pairwise relationships. Despite recent progress, existing datasets primarily focus on generic natural contexts, leaving domain-specific and function-oriented scenes largely underexplored. This limitation restricts the evaluation of relational reasoning in scientific experimental scenes, thereby hindering the development of intelligent monitoring, analysis, and related applications in such scenes. To address this gap, we introduce PhysScene, the first SG dataset tailored to physics experiments. PhysScene encompasses specialized instruments, structured experimental setups, and functional relations intrinsic to experimental environments, enabling reasoning that extends beyond spatial co-occurrence to logical dependencies. Rather than pursuing large data scale, PhysScene focuses on strong semantic constraints and high relation density in experimental scenes, posing new challenges for existing scene parsing algorithms while offering opportunities for further improvements. Extensive analyses and experiments show that PhysScene complements existing benchmarks and establishes a valuable testbed for advancing scientific visual reasoning. The dataset is publicly available at https://github.com/ZMH-SDUST/PhysScene.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 2.0/10 | 3.0 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 3.0/10 | 4.5 |
| MultiModal | 1.5 | 5.0/10 | 7.5 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: The paper introduces a dataset (PhysScene) for physics scene graphs rather than a model architecture. Consequently, keywords related to model components (Tokenizer, Visual Encoder) and paradigms (Unify Models, World Models, model-based RL) have low relevance. 'MultiModal' has moderate relevance due to the integration of visual scenes and structured graph data, while 'MLLM' has slight relevance as the dataset supports visual reasoning tasks. None of the listed expert authors are present in the author list.
关键词
Scene Graph, Physics Experiments, Visual Reasoning, Scientific Dataset, Structured Representations, Object Relationships, Experimental Setup
摘要翻译
检索增强生成(RAG)使智能体能够在推理阶段访问外部知识,但其主要检索碎片化的陈述性证据,导致智能体需反复从段落、手册、示例、日志或轨迹中推断任务流程。这引发了一个基本问题:能否将从外部知识库中提取的技能集成到智能体中,使其能够快速近似领域专长?本文提出 Anything2Skill,一种基于分类法引导的框架,将异构外部知识编译为智能体可复用、可检索且可执行的技能。给定一个知识记录语料库,Anything2Skill 首先将每个记录分解为证据窗口,并在技能树先验下执行计划与扩展技能提取。提取的候选项随后被转换为结构化技能契约,指定调用条件、禁忌条件、动作步骤、工作流步骤、约束、输出规范、支持证据及置信度分数。为了构建可部署的程序性记忆,Anything2Skill 通过基于分类法的编译、注册级别协调、生命周期跟踪、版本化更新和可见的技能树投影,在持久化 SkillBank 中管理提取的技能。在推理阶段,智能体从原始知识库检索特定任务的段落,并从 SkillBank 检索相关的程序性技能,使得 RAG 提供陈述性证据,而编译后的技能提供可复用的程序性指导。在 qsv 和 GitHub-CLI 上的实验表明,Anything2Skill 结合 RAG 分别实现了 98.85% 和 94.10% 的成功率,显著优于仅使用 RAG 的智能体。这些结果表明,将潜在的程序性知识编译为显式技能是将检索增强智能体从知识访问扩展至能力复用的有效途径。
Abstract
Retrieval-augmented generation (RAG) enables agents to access external knowledge at inference time, but it primarily retrieves fragmented declarative evidence, leaving agents to repeatedly infer task procedures from passages, manuals, examples, logs, or trajectories. This raises a fundamental question: can skills extracted from external knowledge bases be installed into an agent, enabling it to rapidly approximate domain expertise? In this paper, we propose Anything2Skill, a taxonomy-guided framework that compiles heterogeneous external knowledge into reusable, retrievable, and executable skills for agents. Given a corpus of knowledge records, \textsc{Anything2Skill} first decomposes each record into evidence windows and performs plan-and-expand skill extraction under a skill-tree prior. The extracted candidates are then converted into structured skill contracts that specify invocation conditions, contraindications, action moves, workflow steps, constraints, output specifications, supporting evidence, and confidence scores. To construct a deployable procedural memory, Anything2Skill manages the extracted skills in a persistent SkillBank through taxonomy-aware compilation, registry-level reconciliation, lifecycle tracking, versioned updates, and visible skill-tree projection. At inference time, agents retrieve both task-specific passages from the original knowledge base and relevant procedural skills from the SkillBank, allowing RAG to provide declarative evidence while compiled skills provide reusable procedural guidance. Experiments on qsv and GitHub-CLI show that Anything2Skill combined with RAG achieves 98.85\% and 94.10\% success rates, respectively, substantially outperforming RAG-only agents. These results suggest that compiling latent procedural knowledge into explicit skills is an effective way to extend retrieval-augmented agents from knowledge access toward capability reuse.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 1.0/10 | 1.5 |
| World Models | 1.5 | 3.0/10 | 4.5 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 2.0/10 | 3.0 |
| model-based RL | 1.5 | 3.0/10 | 4.5 |
评分理由: 论文核心在于 Anything2Skill 框架,将外部知识编译为可重用技能(Skill Compilation),主要涉及知识提取、技能合约及 SkillBank 管理。与关键词相关性分析如下:'World Models'和'model-based RL'因涉及技能与程序性记忆,有一定关联(3.0);'Unify Models'涉及知识类型统一,但非模型架构统一(2.0);'MLLM'和'MultiModal'因任务基于文本/CLI且未强调多模态架构,关联度较低(2.0);'Tokenizer'和'Visual Encoder'在文中未提及(1.0)。作者列表中不包含指定的 Yang Shi 等专家,故无加分。加权总分 21.0,低于动态及格分 27.8,表明论文与给定关键词簇的相关性较低。
关键词
Anything2Skill, Skill Compilation, External Knowledge, Reusable Skills, Procedural Memory, SkillBank, Retrieval-Augmented Generation
摘要翻译
声学超材料(AMM)的逆设计因声学色散效应,在实现宽带目标响应方面尤为具有挑战性:一种在某一频率下匹配期望响应的结构,在其他频率下可能会发生偏差;而修改几何形状以优化一个子带,往往会扰动邻近的子带。然而,现有的宽带逆设计方法要么受限于预定义模板,要么依赖于图像表示,而这些表示无法保持声学结构所必需的几何精度和结构连通性。本文提出 MetaSeq,这是一种基于物理引导的、基于序列的生成框架,用于声学超材料的逆设计。其核心在于,MetaSeq 引入了一种表示语言,将每个 AMM 表示为结构化序列,而非像素网格或固定模板。这种表示保留了精确的几何形状,显式编码了连通性,并将逆设计视为从目标响应到结构序列的序列到序列任务。MetaSeq 进一步构建了一个平衡且高保真的数据集,采用了高效的校准和基于复杂度的采样策略。为了解决逆设计的一对多特性,MetaSeq 结合了监督预训练与强化学习微调,并由基于物理的求解器和有效性检查器进行引导。与 COMSOL 及五种基线方法进行的广泛评估表明,MetaSeq 相较于最佳基线方法,将响应误差降低了 45%。
Abstract
Acoustic metamaterial (AMM) inverse design is particularly challenging for broadband target responses due to acoustic dispersion: a structure that matches the desired response at one frequency may deviate at others, and modifying geometry to improve one sub-band often perturbs neighboring sub-bands. Yet existing broadband inverse-design approaches are either constrained by predefined templates, or rely on image representations that fail to preserve the geometric precision and structural connectivity required by acoustic structures. We present MetaSeq, a physics-guided, sequence-based generative framework for acoustic metamaterial inverse design. At its core, MetaSeq introduces a language that represents each AMM as a structured sequence, rather than as a pixel grid or fixed template. This representation preserves precise geometry, explicitly encodes connectivity, and casts inverse design as a sequence-to-sequence task from target response to structure sequence. MetaSeq further constructs a balanced, high-fidelity dataset with efficient calibration and complexity-based sampling. To address the one-to-many nature of inverse design, MetaSeq combines supervised pretraining with reinforcement learning fine-tuning guided by a physics-based solver and validity checker. Extensive evaluations against COMSOL and five baselines show that MetaSeq reduces response error by 45% over the best baseline.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 3.0/10 | 4.5 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 1.0/10 | 1.5 |
| MultiModal | 1.5 | 2.0/10 | 3.0 |
| model-based RL | 1.5 | 6.0/10 | 9.0 |
评分理由: The paper focuses on acoustic metamaterial inverse design using a sequence-based generative framework, which shows moderate relevance to Tokenizer (sequence representation) and model-based RL (RL guided by physics solver). However, it lacks core components of MLLM, Visual Encoders, World Models, or Unify Models as defined in multimodal AI research, resulting in low scores for those keywords. The total weighted score is 21.0, below the dynamic passing threshold of 27.8, indicating a mismatch between the paper's domain (physics/engineering) and the keyword focus (LLM/MLLM).
关键词
Acoustic Metamaterial, Inverse Design, Sequence-Based, Generative Framework, Physics-Guided, Reinforcement Learning, Structure Representation
摘要翻译
从 3D 仿真场景构建知识图谱对于机器人任务推理至关重要,但关键瓶颈在于将场景对象接地到形式本体类,这仍依赖于人工编纂的字典,这些字典脆弱且无法跨资产泛化。我们探究大语言模型(LLMs)能否作为零样本、无需训练的替代方案,自动完成通用场景描述(USD)场景中的这一步接地操作。在包含 125 个对象的厨房场景(使用 SOMA-HOME 本体)上,LLMs 在使用描述性名称时达到 90-96% 的精确匹配准确率,在使用缩写名称时为 49-89%,显著优于字典和嵌入基线。在完全不可读名称的情况下,上下文增强提示可恢复至高达 48% 的准确率。特征消融表明,LLMs 主要利用场景图中的语义线索(兄弟节点名称和父路径);匿名化这些线索会使准确率降至 0-6%,而仅使用几何信息只能得到 4-17% 的准确率。
Abstract
Constructing knowledge graphs from 3D simulation scenes is essential for robot task reasoning, but the key bottleneck, grounding scene objects to formal ontology classes, still relies on manually curated dictionaries that are brittle and do not generalize across assets. We investigate whether large language models (LLMs) can automate this grounding step for Universal Scene Description (USD) scenes as a zero-shot, training-free alternative. On a kitchen scene (125 objects) with SOMA-HOME Ontology, LLMs achieve 90-96% exact-match accuracy with descriptive names and 49-89% with abbreviated names, substantially outperforming dictionary and embedding baselines. Under fully opaque names, context-augmented prompting recovers up to 48%. Feature ablation reveals that LLMs primarily exploit semantic cues in the scene graph (sibling names and parent paths); anonymizing these cues reduces accuracy to 0-6%, while geometry alone yields only 4-17%.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 2.0/10 | 3.0 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 3.0/10 | 4.5 |
| MultiModal | 1.5 | 4.0/10 | 6.0 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: The paper focuses on zero-shot ontology grounding for 3D scenes using LLMs, which has low alignment with Unify Models, Tokenizer, Visual Encoder, World Models, and model-based RL. It shows moderate relevance to MultiModal (3D + text) and MLLM (LLM usage). No expert authors from the specified list are present.
关键词
Knowledge Graphs, Zero-Shot Ontology Grounding, LLMs, USD Scenes, Semantic Cues, Scene Graph, Robot Task Reasoning
摘要翻译
帕金森病 (PD) 是一种进行性神经退行性疾病,常引发与运动过少性构音障碍相关的言语障碍。由于言语产生依赖于复杂神经肌肉机制的精确协调,语音分析已成为一种有前景的、无创且具有成本效益的早期帕金森病检测生物标志物。近期的深度学习方法已显示出令人鼓舞的结果;然而,大多数现有方法依赖于单一的语音表示,可能忽略了编码在不同特征空间中的互补病理信息。本文提出了一种多分支深度学习框架,用于从语音中自动检测帕金森病 (PD)。每个录音被分割为 5 秒长的片段,并使用三种互补模态进行表示:对梅尔频谱图 (Log-Mel spectrograms)、梅尔频率倒谱系数 (MFCCs) 以及从原始波形中提取的 HuBERT 嵌入向量。频谱图使用预训练的 ResNet-18 编码器进行处理,MFCCs 序列通过 BiLSTM 网络进行建模,而原始语音则使用预训练的 HuBERT 模型进行编码。为了有效整合这些异构表示,我们引入了一种上下文引导的跨模态注意力机制,该机制根据从频谱图和 MFCCs 分支导出的全局声学上下文,动态地对时序 HuBERT 嵌入向量进行加权。在公开可用的西班牙 PC-GITA 语料库上进行的实验,并在严格的说话人无关 5 折交叉验证下,证明了所提出方法的有效性。所提出的架构实现了 91.51% 的准确率、91.24% 的 F1-score 以及 95.97% 的 AUC 值。此外,消融实验证实了所提出的上下文引导的跨模态注意力机制以及互补语音表示整合均对性能有贡献。这些发现突显了异质语音建模在实现稳健且临床可靠的帕金森病检测方面的潜力。
Abstract
Parkinson's disease (PD) is a progressive neurodegenerative disorder that frequently causes speech impairments associated with hypokinetic dysarthria. As speech production relies on the precise coordination of complex neuromuscular mechanisms, speech analysis has emerged as a promising non-invasive and cost-effective biomarker for early PD detection. Recent deep learning approaches have shown encouraging results; however, most existing methods rely on a single speech representation, potentially overlooking complementary pathological information encoded across different feature spaces. In this work, we propose a multi-branch deep learning framework for automatic PD detection from speech. Each recording is segmented into 5-second chunks and represented using three complementary modalities: Log-Mel spectrograms, MFCCs, and HuBERT embeddings extracted from raw waveforms. The spectrograms are processed using a pre-trained ResNet-18 encoder, MFCC sequences are modeled through a BiLSTM network, and raw speech is encoded using a pre-trained HuBERT model. To effectively integrate these heterogeneous representations, we introduce a context-guided cross-modal attention mechanism that dynamically weights temporal HuBERT embeddings according to the global acoustic context derived from the spectrogram and MFCC branches. Experiments conducted on the publicly available Spanish PC-GITA corpus under strict speaker-independent 5-fold cross-validation demonstrate the effectiveness of the proposed approach. The proposed architecture achieves an accuracy of 91.51%, an F1-score of 91.24%, and an AUC of 95.97%. Furthermore, ablation studies confirm the contribution of both the proposed context-guided cross-modal attention mechanism and the integration of complementary speech representations. These findings highlight the potential of heterogeneous speech modeling for robust and clinically reliable PD detection.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 3.0/10 | 4.5 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 2.0/10 | 3.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 8.0/10 | 12.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 该论文专注于帕金森病的语音检测,采用多分支框架融合多种语音特征。'MultiModal'高度相关(融合三种模态),'Unify Models'和'Visual Encoder'有一定关联(特征统一与 ResNet 应用),其余关键词(Tokenizer, World Models, MLLM, model-based RL)与论文内容(医学语音分类,非大模型或强化学习)基本无关。作者列表中未包含指定专家。
关键词
Parkinson's Disease Detection, Speech Representation Learning, Multi-View, Context-guided Cross-modal Attention, Deep Learning Framework, Log-Mel spectrograms, HuBERT embeddings
摘要翻译
智能体强化学习(RL)已成为一种重要的后训练范式,用于将大语言模型(LLMs)从静态聊天机器人转变为交互智能体,催生了如 OpenClaw 等代表性应用。现有工作主要关注策略优化算法和训练框架,但较少关注智能体 - 环境交互的完整数据生命周期,即从数据生产到训练消耗的全过程。为了弥合这一差距,我们提出了 Claw-R1,一个用于智能体强化学习的交互式步骤级数据中间件系统。Claw-R1 通过两个核心组件——Gateway Server(网关服务器)和 Data Pool(数据池),连接异构智能体运行时与 RL 训练后端。Gateway Server 通过统一的 LLM API 入口捕获多轮交互步骤,而 Data Pool 将它们组织成步骤级记录,这些记录包含提示 ID、响应 ID、奖励及其他元数据。在演示中,用户可以交互式地查看实时轨迹,检查每个步骤的状态、动作和奖励,根据质量和就绪程度整理数据,并为不同的下游 RL 算法配置训练就绪批次。总体而言,Claw-R1 将智能体交互轨迹视为管理的数据资产,而非临时运行时日志。通过此次演示,我们希望鼓励社区认识到数据管理在智能体强化学习中的重要性。我们的代码开源地址为 https://github.com/AgentR1/Claw-R1,演示视频见 https://youtu.be/Pw47dAOw6B0。
Abstract
Agentic reinforcement learning (RL) has become an important post-training paradigm for turning LLMs from static chatbots into interactive agents, giving rise to representative applications such as OpenClaw. Existing work mainly focuses on policy optimization algorithms and training frameworks, but pays less attention to the full data lifecycle of agent-environment interactions, from data production to training consumption. To bridge this gap, we present Claw-R1, an interactive step-level data middleware system for agentic RL. Claw-R1 connects heterogeneous agent runtimes with RL training backends through two core components: a Gateway Server and a Data Pool. The Gateway Server captures multi-turn interaction steps through a unified LLM API entry point, while the Data Pool organizes them into step-level records consisting of prompt IDs, response IDs, rewards and other metadata. In our demo, users can interactively inspect live trajectories, examine the state, action, and reward of each step, curate data by quality and readiness, and configure training-ready batches for different downstream RL algorithms. Overall, Claw-R1 treats agent interaction traces as managed data assets rather than temporary runtime logs. Through this demonstration, we hope to encourage the community to recognize the importance of data management in agentic RL. Our code is available at https://github.com/AgentR1/Claw-R1 and the demonstration video can be found at link https://youtu.be/Pw47dAOw6B0.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 3.0/10 | 4.5 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 2.0/10 | 3.0 |
| MLLM | 1.5 | 3.0/10 | 4.5 |
| MultiModal | 1.5 | 2.0/10 | 3.0 |
| model-based RL | 1.5 | 3.0/10 | 4.5 |
评分理由: The paper focuses on data infrastructure for Agentic RL rather than model architecture. 'Unify Models' and 'MLLM' have moderate relevance due to unified LLM API usage and LLM-based agents, but 'Tokenizer', 'Visual Encoder', and 'MultiModal' are largely irrelevant as the paper does not discuss multimodal components or tokenization details. 'World Models' and 'model-based RL' have weak relevance as the paper is about data management for RL rather than learning dynamics models or specific RL algorithms. No expert authors from the specified list (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are present in the author list.
关键词
Agentic Reinforcement Learning, Data Middleware, Step-Level Records, Gateway Server, Data Pool, Interaction Traces, LLM Agents
摘要翻译
离线强化学习(RL)提供了一种仅利用日志数据即可改进策略的途径,其中历史回报或其他可测量结果被用作世界反馈(world feedback)。关键难点在于如何在不过度外推离线数据支持范围的前提下改进观测行为。本文提出“反事实传输流”(counterfactual transport flows),这是一种基于源条件、受世界反馈引导的离线决策轨迹精炼框架。针对低反馈候选轨迹,我们通过检索潜在轨迹空间中具有更高任务特定反馈的邻近轨迹,从离线数据构建局部偏好对,并将其用作保守精炼的弱监督。该框架学习实例特定的精炼方向:在推理阶段,精炼强度参数控制候选轨迹被传输的距离,从而在保留原始行为与施加更强改进之间实现权衡。在 D4RL 基准测试(包括 AntMaze 和 MuJoCo 任务)上的实验表明,我们的方法能够利用历史回报作为世界反馈来改进行为,同时提供可解释的轨迹级精炼路径。
Abstract
Offline reinforcement learning (RL) offers a path to policy improvement from logged data alone, using historical returns or other measurable outcomes as world feedback. A key difficulty is improving observed behavior without extrapolating beyond what the offline data supports. We propose \emph{counterfactual transport flows}, a source-conditioned trajectory refinement framework for offline decision-making guided by world feedback. Given a low-feedback candidate trajectory, we construct local preference pairs from offline data by retrieving nearby trajectories in latent trajectory space with higher task-specific feedback, and use them as weak supervision for conservative refinement. The framework learns instance-specific refinement directions: at inference time, a refinement strength parameter controls how far the candidate trajectory is transported, enabling a trade-off between preserving the original behavior and applying stronger improvement. Experiments on D4RL benchmarks, including AntMaze and MuJoCo tasks, show that our method improves behavior from historical returns as world feedback, while providing interpretable trajectory-level refinement paths.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 1.0/10 | 1.5 |
| World Models | 1.5 | 3.0/10 | 4.5 |
| MLLM | 1.5 | 1.0/10 | 1.5 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 5.0/10 | 7.5 |
评分理由: The paper focuses on Offline Reinforcement Learning and trajectory refinement, offering domain-level relevance to 'model-based RL' (both are RL subfields) and weak lexical overlap with 'World Models' (via 'world feedback'). It does not address Multimodal architectures, Tokenizers, Visual Encoders, MLLMs, or Model Unification, hence low scores for those. No specified expert authors are found.
关键词
Offline reinforcement learning, Trajectory refinement, Counterfactual transport flows, World feedback, Latent trajectory space, D4RL benchmarks, Conservative refinement
摘要翻译
多语言自动语音识别(ASR)模型如 Whisper 在高资源语言上表现良好,但在达罗毗荼语系(Dravidian languages)上的词错误率(WER)显著高于印度-雅利安语系(Indo-Aryan languages)。通过语言学和数据集分析,我们发现达罗毗荼语系单词更长,词汇多样性更高,重复率更低,导致词元(token)分布稀疏,并频繁出现字符级替换错误。基线微调进一步揭示了解码器中自注意力(self-attention,语言上下文)与交叉注意力(cross-attention,声学线索)之间的不平衡。尽管合成 token 重复实验表明存在潜在收益,但它们并不切实际。基于这些观察,我们引入了两种解码器级增强方法:Weighted-Attention,用于自适应平衡注意力源;以及 Self-Conditioning,用于重新注入中间预测以提高词元一致性。实验证明,在低资源语言及黏着语上,该方法能一致地降低词错误率(WER)。
Abstract
Multilingual ASR models such as Whisper perform well on high-resource languages but exhibit substantially higher Word Error Rates (WER) for Dravidian languages compared to Indo-Aryan ones. Through linguistic and dataset analysis, we show that Dravidian languages have longer words, higher vocabulary diversity, and lower repetition, resulting in sparse token distributions and frequent character-level substitution errors. Baseline fine-tuning further reveals decoder imbalance between self-attention (linguistic context) and cross-attention (acoustic cues). Although synthetic token-repetition experiments indicate potential gains, they are impractical. Motivated by these observations, we introduce two decoder-level enhancements: Weighted-Attention, which adaptively balances attention sources, and Self-Conditioning, which reinjects intermediate predictions to improve token consistency. Experiments demonstrate consistent WER reductions for low-resource and agglutinative languages.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 4.0/10 | 6.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 3.0/10 | 4.5 |
| MultiModal | 1.5 | 5.0/10 | 7.5 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文聚焦于 Whisper 在德拉维达语上的 ASR 改进,涉及解码器注意力平衡与 token 一致性。'Visual Encoder', 'World Models', 'model-based RL' 与音频识别任务无关,得 0 分。'MultiModal' 和 'Tokenizer' 有一定相关性,得 5 分和 4 分。'Unify Models' 和 'MLLM' 相关性较低,得 2 分和 3 分。作者列表中未包含指定专家,无额外加分。加权总分为 21.0,低于动态及格分 27.8。
关键词
Whisper, Dravidian Languages, Decoder Inconsistencies, Weighted-Attention, Self-Conditioning, Low-Resource Languages, Word Error Rates
摘要翻译
社交机器人不仅需要能与以语音为中心的系统所预设的用户进行稳健交互,还应能与依赖不同模态(如手语)进行通信的多样化用户进行交互。其中一个重要的能力差距在于与手语用户进行预测性话轮转换(predictive turn-taking)。尽管语音活动预测(Voice Activity Projection, VAP)已成功用于建模口语交互中的未来语音活动,但该框架是否适用于手语交互尚不明确。本文提出了一项初步的迁移研究,旨在将 VAP 架构适配于双人(dyadic)手语交互。利用公共 DGS 语料库(Public DGS Corpus)中的交互记录,我们从词汇手语标注中提取二元手语活动流,并构建了用于话轮转换预测的代理任务(proxy tasks)。该模型使用为每位手语者提取的基于姿态的手部、眼部区域及嘴部区域特征。结果表明,SHIFT/HOLD(转换/保持)预测前景良好,尤其是利用手部线索时,而 SHIFT(转换)预测仍然具有挑战性。这些发现为将预测性话轮转换模型从口语交互转移到手语交互所具有的前景及当前局限性提供了初步证据。手语交互的预测建模仍需超越语音衍生类别的手语特定事件定义。
Abstract
Social robots must interact robustly not only with users assumed by speech-centered systems but also with diverse users whose communication relies on different modalities, e.g., sign language. One important capability gap is predictive turn-taking with signing users. Although Voice Activity Projection (VAP) has been successfully used to model future voice activity in spoken interaction, it remains unclear whether the framework transfers to sign language interaction. This paper presents an initial transfer study of adapting a VAP architecture to dyadic sign language interaction. Using interaction recordings from the Public DGS Corpus, we derive binary signing activity streams from lexical sign annotations and formulate proxy tasks for turn-taking prediction. The model uses pose-derived hand, eye-region, and mouth-region features extracted for each signer. The results show that SHIFT/HOLD prediction is promising, especially with hand cues, while SHIFT-prediction remains difficult. These findings provide initial evidence for both the promise and the current limitations of transferring predictive turn-taking models from spoken interaction to sign language interaction. Predictive modeling of sign language interaction still requires sign-language-specific event definitions that go beyond speech-derived categories.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 3.0/10 | 4.5 |
| World Models | 1.5 | 3.0/10 | 4.5 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 4.0/10 | 6.0 |
| model-based RL | 1.5 | 2.0/10 | 3.0 |
评分理由: 论文主要关注手语交互和基于姿态特征的转身预测,属于机器人交互领域。内容不涉及 MLLM、Tokenizer 或统一模型架构。虽然使用了视觉特征提取和预测建模(与 World Models/RL 弱相关),但核心方法论是机器人特定的,与提供的大规模模型统一或表征学习关键词主题不符。
关键词
Sign Language Interaction, Predictive Turn-taking, Activity Projection, Pose-derived Features, Social Robots, VAP Architecture, SHIFT/HOLD Prediction
摘要翻译
交通事故预判——即在行车记录仪视频的每一帧中预测即将发生碰撞的可能性——是安全关键的,却难以规模化,因为在每个部署场景中收集领域内标注的事故视频片段成本过高。我们在零样本设置下研究此任务,其中没有目标域训练数据可用:模型必须仅从公开可用的二值标注驾驶事故数据集学习,并泛化到未见过的行车记录仪视频。我们提出一个框架,通过结合 VideoMAE-v2 骨干网络和滑动窗口协议下的帧级预测头,弥合了帧级时序风险评估任务与粗粒度标注的二值事故数据集之间的差距。我们的方法在 2026 CVPR@AUTOPILOT Zero-Shot Accident Anticipation 竞赛中获得第二名。代码可在 https://github.com/TimeSouth/zero-shot-taa-solution 获取。
Abstract
Traffic accident anticipation -- predicting the likelihood of an imminent collision at every frame of a dashcam video -- is safety-critical yet difficult to scale, because collecting in-domain annotated accident footage for every deployment scenario is prohibitively expensive. We study this task under a zero-shot setting where no target-domain training data is available: the model must learn exclusively from a publicly available binary-labelled driving-accident dataset and generalise to unseen dashcam footage. We propose a framework that bridges the gap between the frame-level temporal risk estimation task and coarsely labelled binary accident datasets by coupling a VideoMAE-v2 backbone with a per-frame prediction head under a sliding-window protocol. Our method achieves 2nd place in the 2026 CVPR@AUTOPILOT Zero-Shot Accident Anticipation competition. Code is available at https://github.com/TimeSouth/zero-shot-taa-solution.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 8.0/10 | 12.0 |
| World Models | 1.5 | 2.0/10 | 3.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文主要关注视频理解与交通事故预测,核心组件为 VideoMAE-v2 视觉编码器,因此与 Visual Encoder 高度相关。其余关键词涉及多模态大模型、强化学习及统一模型架构,与本文纯视觉零样本预测任务无直接关联。Tokenizer 仅在视频分块层面有轻微涉及。
关键词
VideoMAE-v2, Zero-Shot, Traffic Accident Anticipation, Dashcam Video, Prediction Head, Sliding-Window, CVPR Competition
摘要翻译
根治性前列腺切除术后生化复发(BCR)是前列腺癌的关键终点,然而风险分层几乎完全依赖于以格里森分级为主导的变量。苏木精 - 伊红全切片图像(WSIs)是否携带超越分级的预后信号,以及多实例学习(MIL)能否恢复这一信号,仍悬而未决。一个关键障碍在于,许多流程在评估折上选择模型检查点,人为地夸大了模型的一致性(C-index)。我们在 TCGA-PRAD 数据集(487 名患者,101 例 BCR 事件)上构建了一个严格的基准,采用严格的跨折评分,基于五折交叉验证并重复五个随机种子。MIL 聚合器(ABMIL、CLAM、TransMIL、PatchGCN)的选择影响较小(使用 UNI2-h 时 C-index 为 0.61-0.64),而特征提取器则是主导因素(ResNet50 为 0.566,而病理基础模型高达 0.639)。仅基于分级、分期和年龄的临床 Cox 模型达到 0.687;没有任何仅基于影像的模型显著优于该临床模型(p > 0.10)。我们提出了一种分级解耦多实例学习(GD-MIL)方法,这是一种门控注意力 MIL 编码器,采用梯度反转分级对抗器进行训练,旨在使切片表示在与临床变量进行晚期融合前对格里森分级具有不变性。GD-MIL 达到 C-index 0.704,显著优于临床基线(ΔC = +0.029, p = 0.0005)和最佳仅影像模型(ΔC = +0.062, p = 0.039),表明 H&E 形态包含与分级互补的预后信息。基于中位风险划分,无生化复发生存率的对数秩检验 p 值 < 0.0001,显示出显著分离(五年时约为 20% vs 70%)。
Abstract
Biochemical recurrence (BCR) after radical prostatectomy is a critical endpoint in prostate cancer, yet risk stratification relies almost entirely on variables dominated by Gleason grade. Whether H&E whole slide images (WSIs) carry prognostic signal beyond grade, and whether multiple instance learning (MIL) can recover it, remains unsettled. A key obstacle is that many pipelines select model checkpoints on the evaluation fold, artificially inflating concordance. We construct a rigorous benchmark on TCGA-PRAD (487 patients, 101 BCR events) using strict out-of-fold scoring over five-fold cross-validation repeated across five seeds. The choice of MIL aggregator (ABMIL, CLAM, TransMIL, PatchGCN) has little effect (C-index 0.61-0.64 with UNI2-h), while the feature extractor is the dominant factor (ResNet50 0.566 versus pathology foundation models up to 0.639). A clinical Cox model on grade, stage, and age reaches 0.687; no imaging-only model significantly outperforms it (p > 0.10). We introduce Grade-Disentangled MIL (GD-MIL), a gated-attention MIL encoder trained with a gradient-reversal grade adversary that encourages the slide representation to be invariant to Gleason grade before late fusion with clinical variables. GD-MIL achieves C-index 0.704, significantly outperforming both the clinical baseline (delta-c = +0.029, p = 0.0005) and the best imaging-only model (delta-c = +0.062, p = 0.039), suggesting H&E morphology contains prognostic information complementary to grade. A median risk split yields log-rank p < 0.0001 separation in BCR-free survival (~20% vs ~70% at five years).
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 6.0/10 | 9.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 8.0/10 | 12.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on medical image analysis using Multiple Instance Learning (MIL) for prostate cancer prediction. It utilizes visual encoders (ResNet, UNI) as feature extractors and combines multimodal data (images + clinical variables), resulting in moderate scores for 'Visual Encoder' and 'MultiModal'. However, it does not involve Unify Models, Tokenizers, World Models, MLLMs, or Reinforcement Learning, hence 0 scores for those. No expert authors from the specified list are found.
关键词
Grade-Disentangled, Multiple Instance Learning, Prostate Cancer, Biochemical Recurrence, Whole Slide Images, Multimodal, Feature Extractor
摘要翻译
具可验证奖励的强化学习(RLVR)已成为通过基于结果的监督提升大语言模型推理能力的主导范式。然而,可验证奖励在组级经常变得缺乏信息:当给定提示的所有采样轨迹获得相同奖励时,组相对优势估计无法提供梯度信号,尽管这些轨迹在推理质量上可能存在显著差异。我们提出 Reasoning Arena,一种自适应训练框架,该框架将此类缺乏多样性的奖励组路由至评判系统,而非直接丢弃。除了考察最终答案外,Reasoning Arena 构建轨迹锦标赛,在此过程中推理轨迹被一对一比较,以揭示组内更细粒度的偏好,从而将推理质量转化为丰富的相对奖励信号。为了使奖励估计高效,而非穷举比较每一对,每个新轨迹会与一个小型、动态更新的先前生成轨迹池(作为锚点)进行评估,以高效建立相对排名。随后,我们在不完整比较图上拟合 Bradley-Terry 模型,从而实现可扩展的强化学习集成,而无需进行二次成对比较。实验结果表明,Reasoning Arena 在竞赛数学和编码基准上一贯优于 RLVR 基线,平均领先 7.6%。通过将原本被浪费的零优势样本转化为有用的梯度更新,我们的方法将训练加速 27% 至 41%,节省近 50% 的生成计算量,并显著提升了整体推理性能。
Abstract
Reinforcement learning with verifiable rewards (RLVR) has become a leading paradigm for improving the reasoning ability of large language models through outcome-based supervision. However, verifiable rewards frequently become uninformative at the group level: when all sampled traces of a given prompt receive identical rewards, group-relative advantage estimation provides no gradient signal, even though the traces may differ substantially in reasoning quality. We propose Reasoning Arena, an adaptive training framework that routes such non-diverse reward groups to a judge system instead of discarding them. Beyond examining the final answer, Reasoning Arena constructs trace tournaments, where reasoning traces are compared head-to-head to expose finer-grained preferences within the group, converting reasoning quality into rich relative reward signals. To make reward estimation efficient, rather than exhaustively comparing every pair, each new trace is evaluated against a small, dynamically updated pool of previously generated traces as anchors to efficiently establish a relative ranking. We then fit a Bradley-Terry model on the incomplete comparison graph, enabling scalable RL integration without quadratic pairwise comparisons. Empirical results demonstrate that Reasoning Arena consistently outperforms the RLVR baseline by 7.6% on average in competition mathematics and coding benchmarks. By converting otherwise wasted zero-advantage samples into useful gradient updates, our method accelerates training by 27% to 41%, saving nearly 50% of generation compute, and substantially improves overall reasoning performance.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 2.0/10 | 3.0 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 2.0/10 | 3.0 |
| model-based RL | 1.5 | 4.0/10 | 6.0 |
评分理由: 论文核心在于大语言模型的推理强化学习与奖励建模,通过追踪锦标赛处理不可信奖励。与多模态(MultiModal, MLLM, Visual Encoder)及架构统一(Unify Models, Tokenizer)关键词关联度低;虽涉及 RL,但侧重奖励模型而非环境模型,故 model-based RL 相关性中等。未检测到指定专家作者。加权总分 19.5,低于动态及格分 27.8。
关键词
Reasoning Arena, Trace Tournaments, Verifiable Rewards, Reinforcement Learning, Bradley-Terry Model, Relative Ranking, LLM Reasoning, Judge System
摘要翻译
扩散语言模型(DLMs)通过迭代去噪完整序列实现并行文本生成,相较于自回归(AR)解码,提供了极具吸引力的灵活性。然而,现有方法未能充分捕捉词元关系,导致相对于自回归(AR)基线存在性能差距,尤其是在并行度增加的情况下。本文对该差距进行了系统分析,确定了三个关键因素:(i) 模型容量,(ii) 依赖关系,以及 (iii) 不变性。为了解决这些问题,我们首先提出了一种不变能量(Inv-E)结合有效的基于采样的估计器,以处理不变性问题。通过进一步结合独立能量(Ind-E),我们获得了一种统一能量(Uni-E),该能量涵盖了所有这些因子。Uni-E 享有独特优势:它可以精确计算,而无需基于采样的划分估计。此外,Uni-E 与模型无关,因此可扩展至任意规模的模型。我们进一步证明,Uni-E 可以纠正由依赖关系和不变性引起的分布偏移。在扩散语言模型(DLMs)和扩散大语言模型(DLLMs)上的广泛实验证明了所提出的 Uni-E 的有效性。
Abstract
Diffusion Language Models (DLMs) enable parallel text generation by iteratively denoising a full sequence, offering attractive flexibility compared to auto-regressive (AR) decoding. However, existing methods fail to fully capture token relationships, leading to a performance gap relative to AR baselines, especially as the degree of parallelism increases. In this paper, we give a systematic analysis of the gap, identifying three key factors: (i) model capacity, (ii) dependency, and (iii) invariance. To address these issues, we first propose an invariant energy (Inv-E) together with an effective sampling-based estimator to handle the invariance issue. By further combining with the independent energy (Ind-E), we obtain a unified energy (Uni-E), that accounts for all these factors. Uni-E enjoys a unique advantage: it can be computed exactly without sampling-based partition estimation. Besides, Uni-E is model agnostic and can therefore be scaled to models of arbitrary size. We further prove that Uni-E can correct the distribution shift caused by dependency and invariance. Extensive experiments across Diffusion Language Models (DLMs) and Diffusion Large Language Models (DLLMs) demonstrate the effectiveness of the proposed Uni-E.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 6.0/10 | 9.0 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 2.0/10 | 3.0 |
| MLLM | 1.5 | 3.0/10 | 4.5 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文核心在于扩散语言模型(DLMs)的并行解码策略,标题包含'Unified',与'Unify Models'有术语关联;涉及大语言模型(MLLM 相关),但未涉及多模态融合(MultiModal, Visual Encoder 为 0)、强化学习(model-based RL 为 0)或具体的世界模型构建(World Models 关联较弱)。Tokenizer 仅作为基础提及非核心贡献。
关键词
Diffusion Language Models, Unified Energy, Parallel Decoding, Text Generation, Invariant Energy, Independent Energy, DLLMs
摘要翻译
具有可验证奖励的强化学习(RLVR)通过依赖提供自动正确性信号的任务特定验证器,推动了推理密集型任务的进展。然而,许多实际语言任务难以配备可靠的验证器,这促使人们日益依赖基于人类反馈的强化学习(RLHF)。在此设定下,我们认为深入探讨如何解释人类反馈至关重要。我们提出了基于遗憾的偏好优化(RePO),该方法通过遗憾最小化(regret minimization)而非奖励最大化来重构 RLHF。人类偏好往往由对结果的前瞻性预期以及对替代行为的反事实比较所塑造,而非由即时、与结果无关的效用所塑造。RePO 通过将偏好建模为行为条件化的相对次优性评估,捕捉了这一结构。在数学推理基准和人类偏好数据集上的实验表明了一致的性能提升,表明 RePO 是一种有效且符合人类偏好的方法,用于训练大语言模型。
Abstract
Reinforcement learning with verifiable rewards (RLVR) has enabled progress on reasoning-intensive tasks by relying on task-specific verifiers that provide automated correctness signals. However, many realistic language tasks are difficult to equip with reliable verifiers, motivating a growing reliance on reinforcement learning from human feedback (RLHF). In this setting, we argue that a closer examination of how human feedback should be interpreted is essential. We introduce Regret-based Preference Optimization $(\textbf{RePO})$, which reframes RLHF through $\textit{regret minimization}$ rather than reward maximization. Human preferences are often shaped by $\textit{prospective}$ anticipation of outcomes and $\textit{counterfactual}$ comparisons to alternative behaviors, rather than by immediate, outcome-independent utility. $\textbf{RePO}$ captures this structure by modeling preferences as behavior-conditioned assessments of relative suboptimality. Experiments on mathematical reasoning benchmarks and human preference datasets demonstrate consistent performance gains, indicating that $\textbf{RePO}$ is an effective and human-aligned approach for training large language models.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 1.0/10 | 1.5 |
| World Models | 1.5 | 2.0/10 | 3.0 |
| MLLM | 1.5 | 3.0/10 | 4.5 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 3.0/10 | 4.5 |
评分理由: 该论文主要探讨大语言模型(LLM)中的偏好学习,提出基于遗憾最小化的 RePO 框架。给定的关键词集侧重于多模态架构(Unify Models, Tokenizer, Visual Encoder, MultiModal)及世界模型/模型强化学习(World Models, model-based RL),与本文纯文本、策略优化的研究内容契合度较低。仅 MLLM 和 model-based RL 因涉及大模型与强化学习背景获得较低相关分,其余关键词完全无关。未检测到指定专家作者,无额外加分。加权总分约为 19.5,低于动态及格分 27.8。
关键词
Preference Learning, Large Language Models, Regret Minimization, RLHF, Reinforcement Learning, Human Feedback, Counterfactual Comparisons
摘要翻译
随着大语言模型(LLM)能力的快速进步,用于评估它们的方法日益滞后。传统基准测试依赖于对狭窄、表面级约束的程序化验证,然而现实世界中的指令遵循(instruction following)和智能体(agentic)任务需要对细微的、上下文依赖的行为进行评估,这些行为难以通过简单的脚本检查来验证。我们提出了一种系统性的分析,将专家策划的基于评分标准的评价(rubric-based evaluation)作为一种替代范式,并基于两个领域的实证证据:复杂指令遵循和企业级智能体任务。我们首先阐述了构建高质量评分标准的五个设计原则,包括最大可行原子性(Maximum Viable Atomicity)、意图感知准则设计以及迭代式 LLM 裁判校准。为了验证这些原则,我们引入了 ComplexConstraints,这是一个新的专家策划的指令遵循数据集,其中每个提示均配对有 10 至 40 个原子级评分标准。我们证明,这些专家评分标准不仅是更优的评估工具,也是高效的训练信号:在约 1,000 个 ComplexConstraints 示例上进行训练,可使 4B 参数模型在指令遵循任务上提升 +15.5%,235B 参数模型提升 +12.2%;而在基于评分标准的企业环境中进行单周期强化学习(RL)训练所产生的收益,能够迁移到模型从未训练过的分布外基准上(BFCL +4.5%,Tau2-Bench +7.4%,Tool-Decathlon +6.8%)。我们的研究结果表明,专家撰写的评分标准既能提升前沿大语言模型能力的测量精度,也能促进其发展,充当有效的评估信号和强化学习训练信号。
Abstract
As LLM capabilities advance rapidly, the evaluation methods used to assess them increasingly lag behind. Traditional benchmarks relied on programmatic verification of narrow, surface-level constraints, but real-world instruction following and agentic tasks demand assessment of nuanced, context-dependent behaviors that resist simple scripted checks. We present a systematic analysis of expert-curated rubric-based evaluation as an alternative paradigm, drawing on empirical evidence from two domains: complex instruction following and enterprise agentic tasks. We first articulate five design principles for constructing high-quality rubrics, including Maximum Viable Atomicity, intent-aware criterion design, and iterative LLM-judge calibration. To validate these principles, we introduce ComplexConstraints, a new expert-curated instruction-following dataset in which each prompt is paired with 10-40 atomic rubric criteria. We demonstrate that these expert rubrics are not only better evaluation instruments but also highly effective training signals: training on approximately 1,000 ComplexConstraints examples yields +15.5% improvement for a 4B-parameter model and +12.2% for a 235B-parameter model on instruction following, while single-epoch RL training on a rubric-graded enterprise environment produces gains that transfer to out-of-distribution benchmarks the model was never trained on (+4.5% BFCL, +7.4% Tau2-Bench, +6.8% Tool-Decathlon). Our findings establish that expert-authored rubrics improve both the measurement and the development of frontier LLM capabilities, serving as effective evaluation and RL training signals.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 1.0/10 | 1.5 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 3.0/10 | 4.5 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 4.0/10 | 6.0 |
评分理由: The paper focuses on expert-curated rubrics for LLM evaluation and reinforcement learning training signals, showing moderate relevance to MLLM (due to LLM focus) and model-based RL (due to RL usage). However, it lacks specific content regarding multimodal architectures (Tokenizer, Visual Encoder, MultiModal), world modeling, or model unification strategies, resulting in low scores for those keywords. No matching expert authors were found in the list.
关键词
Expert Rubrics, Instruction Following, Agentic Tasks, LLM Evaluation, Reinforcement Learning, Training Signals, ComplexConstraints
摘要翻译
大语言模型智能体(Large Language Model Agents)日益依赖技能(skills):这些技能是编码工作流、工具使用、实现模式、验证检查和领域规则的可重用程序性文档。技能重写(Skill Rewriting)通常被视为提示压缩(prompt compression),但更短的技能可能通过移除防止探索、调试和恢复的稀疏操作锚点(sparse operational anchors)而增加智能体成本。本文通过这种经济视角研究技能重写。我们的受控框架剖析技能结构,采用信息保留策略(information-preservation strategies)重写技能,并在固定任务指令、环境和验证器(verifiers)下评估重写结果。在 SkillsBench 上的实验揭示不同策略之间存在显著的质量 - 成本权衡:API/代码锚定(API/code anchoring)、工作流保护(workflow guarding)和规则/公式锚定(rule/formula anchoring)有利于不同的任务家族,且不存在通用主导模板。在主要保留集评估(held-out evaluation)中,学习到的策略将总成本降低 7.0%,下游智能体令牌成本(agent-token cost)降低 6.0%;在冻结跨模型迁移(frozen cross-model transfer)中,相应的减少幅度平均为 14.7% 和 13.7%,同时验证器质量得以保留。这些结果表明,技能设计应被视为成本感知的操作知识工程(cost-aware operational knowledge engineering),而非提示压缩。资源:\href{https://github.com/1Reminding/Skill_EE}{SkillEE}。
Abstract
Large language model agents increasingly rely on skills: reusable procedural documents encoding workflows, tool use, implementation patterns, validation checks, and domain rules. Skill rewriting is often treated as prompt compression, but shorter skills can make agents more expensive by removing sparse operational anchors that prevent exploration, debugging, and recovery. We study skill rewriting through this economic lens. Our controlled framework profiles skill structure, rewrites skills using information-preservation strategies, and evaluates the rewrites under fixed task instructions, environments, and verifiers. Experiments on SkillsBench reveal distinct quality--cost trade-offs across strategies: API/code anchoring, workflow guarding, and rule/formula anchoring benefit different task families, with no universally dominant template. In the main held-out evaluation, the learned policy reduces total cost by 7.0\% and downstream agent-token cost by 6.0\%; in frozen cross-model transfer, the corresponding reductions average 14.7\% and 13.7\%, while verifier quality is preserved. These results position skill design as cost-aware operational knowledge engineering rather than prompt compression. Resources: \href{https://github.com/1Reminding/Skill_EE}{SkillEE}.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 2.0/10 | 3.0 |
| MLLM | 1.5 | 3.0/10 | 4.5 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 3.0/10 | 4.5 |
评分理由: 论文聚焦于语言模型代理的技能重写与成本优化,与 MLLM 和强化学习概念有弱关联(评分 3),但缺乏模型统一、分词器设计、视觉编码器、世界模型或多模态内容的直接讨论(评分 0-2)。提供的关键词针对多模态世界模型,而本文专注于语言代理的操作知识工程,相关性较低。
关键词
Skill Rewriting, Language Model Agents, Quality-Cost Trade-offs, Cost-Aware, Operational Knowledge Engineering, SkillsBench, Anchoring Strategies
摘要翻译
眼动,包括扫视,被广泛视为神经生理状态高度敏感且客观的生物标志物。在神经系统疾病中检测扫视特征提供了一种快速、便携的脑成像替代方案,避免了获取和成本障碍。目前,由于隐私问题及数据集稀缺,尚无强大的 AI 赋能视频眼动解决方案(例如数字生物标志物)可用于筛查、分诊或定位脑异常。在此工作中,我们提出了首个完全合成、无需患者、多模态的眼动生成流程,以实现泛化的扫视分析。利用该合成数据集,我们训练了一个深度学习分类器,用于区分正常与异常(低幅症和高幅症)的扫视准确性,并在真实临床数据上评估了其性能。该模型达到了 0.76 的 AUROC 和 0.71 的敏感性,表明合成数据具有很强的泛化潜力,适用于临床应用,包括作为居家和急诊室场景下的筛查工具,或用于精确神经解剖定位的工具。
Abstract
Eye movements, including saccades, are widely regarded as highly sensitive and objective biomarkers of neurophysiologic states. Detecting saccadic signatures in neurologic diseases offers a rapid, portable alternative to brain imaging, avoiding access and cost barriers. Currently, there are no robust AI-enabled video-oculographic solutions (e.g., digital biomarkers) for screening, triaging, or localizing brain abnormalities due to privacy issues and scarce datasets. In this work, we propose the first fully synthetic, patient-free, multimodal eye movement generation pipeline for generalizable saccade analysis. Using this synthetic dataset, we trained a deep learning classifier to distinguish between normal and abnormal (hypometria and hypermetria) saccadic accuracies and evaluated its performance on real-world clinical data. The model achieved an AUROC of 0.76 and a sensitivity of 0.71, showing that the synthetic data has strong potential to generalize for clinical applications, including as a screening tool in at-home and emergency room settings or a tool for precise neuroanatomic localization.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 3.0/10 | 4.5 |
| World Models | 1.5 | 2.0/10 | 3.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 6.0/10 | 9.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文主要关注医学眼动建模与数字生物标志物开发,虽摘要中明确提及'multimodal' pipeline,但与 MLLM、Tokenizer、Model-Based RL 等核心概念无直接关联。视觉编码器未明确作为核心架构组件提及,世界模型与统一模型的定义在此医疗应用背景下关联度较低。
关键词
Eye Movement, Saccade, Synthetic Data, Digital Biomarker, Multimodal, Classification, Neurophysiologic, Patient-Free
摘要翻译
合成孔径雷达 (SAR) 辅助的光学云去除旨在通过利用互补的 SAR 观测数据,恢复光学遥感图像中被云层遮蔽的地表信息。现有的多模态融合方法通常依赖于直接的空间拼接和像素级监督,这可能导致 SAR 散斑噪声传播至光学重建过程,并产生过度平滑的结果。为了解决这些局限性,本文提出了一种信息瓶颈驱动的高保真网络 (IB-HFN),用于 SAR 辅助的光学云去除。IB-HFN 采用双流骨干网络,在深度语义融合之前保留模态特定的表示,从而减轻过早的跨模态污染。在融合阶段,我们引入了一种空间信息瓶颈融合模块,该模块通过通道变分信息瓶颈压缩 SAR 特征,以抑制非结构化散斑噪声。与此同时,一种局部 - 全局门控机制用于预测晴空区域,并通过 Dirac 初始化跳跃连接路由可靠的光学细节,从而实现噪声抑制与纹理保留的解耦。我们进一步提出了一种联合优化策略,该策略将特征级瓶颈正则化与图像级的重建精度、结构一致性、光谱保真度及对比锐度约束相结合。动态加权策略平衡这些目标,以稳定训练过程并减少雾状伪影。在具有挑战性的时空分割设置下,基于 SEN12MS-CR 数据集的实验表明,IB-HFN 在结构保持和光谱保真度方面优于现有方法。
Abstract
Synthetic aperture radar (SAR)-assisted optical cloud removal aims to recover surface information obscured by clouds in optical remote sensing images by exploiting complementary SAR observations. Existing multimodal fusion methods typically rely on direct spatial concatenation and pixel-wise supervision, which can propagate SAR speckle noise into optical reconstruction and lead to over-smoothed results. To address these limitations, we propose an Information Bottleneck-driven High-Fidelity Network (IB-HFN) for SAR-assisted optical cloud removal. IB-HFN employs a dual-stream backbone to preserve modality-specific representations before deep semantic fusion, thereby mitigating premature cross-modal contamination. At the fusion stage, we introduce a Spatial Information Bottleneck Fusion module that compresses SAR features through a channel-wise variational information bottleneck to suppress unstructured speckle noise. In parallel, a local-global gating mechanism predicts clear-sky regions and routes reliable optical details through a Dirac-initialized skip connection, decoupling noise suppression from texture preservation. We further develop a joint optimization strategy that integrates feature-level bottleneck regularization with image-level constraints on reconstruction accuracy, structural consistency, spectral fidelity, and contrastive sharpness. A dynamic weighting schedule balances these objectives to stabilize training and reduce hazy artifacts. Experiments on the SEN12MS-CR dataset under challenging spatio-temporal splits demonstrate that IB-HFN achieves superior structural preservation and spectral fidelity over existing methods.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 3.0/10 | 4.5 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 8.0/10 | 12.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文属于遥感图像处理领域,虽涉及 SAR 与光学图像的多模态融合,但与 MLLM、世界模型、强化学习及 Tokenizer 等关键词无直接关联。视觉编码器仅作为常规骨干网络,未体现统一模型架构的核心特征。作者列表中不包含指定的专家。
关键词
SAR, Optical, Cloud Removal, Information Bottleneck, Fusion, Remote Sensing, Dual-stream, High-Fidelity
摘要翻译
将大规模预训练视频生成器适配到新颖领域的视频超分辨率(VSR)仍然计算成本过高。将生成过程重新表述为直接从低质量到高质量映射的方法偏离了原有的生成范式,需要大量的微调。由于缺乏编码器 - 解码器层级结构,ControlNet 风格的适配器在现代 Diffusion Transformers 下效率显著降低,导致整个骨干网络被复制。我们发现流匹配(flow matching)为跨域 VSR 适配提供了一种原理性的替代方案。通过在所有时间步预测恒定速度场,适配任务简化为学习固定的注入模式,而非时变变换。基于这一洞察,我们提出了 LiteVSR,这是一个极简框架,使用完全冻结的 Diffusion Transformer 和轻量级的 State-Aware Adapter 来执行 VSR。该适配器采用双流架构,从 LQ 输入中提取静态结构线索,从中间去噪状态中提取动态线索,通过时间依赖的交叉注意力将它们对齐,从而在去噪过程中实现从结构对齐到纹理细化的自适应过渡。LiteVSR 在仅使用 11.25% 可训练参数并在单张 A100 上进行 12 GPU-hours 训练的情况下,实现了具有竞争力的恢复质量,同时保持快速采样(低至单步)的兼容性。
Abstract
Adapting large-scale pre-trained video generators for Video Super-Resolution (VSR) in novel domains remains computationally prohibitive. Methods that reformulate generation as direct Low-Quality to High-Quality mappings deviate from the original generative formulation, demanding extensive fine-tuning. ControlNet-style adapters lose their efficiency under modern Diffusion Transformers since the absence of encoder-decoder hierarchy forces duplication of the entire backbone. We observe that flow matching offers a principled alternative for cross-domain VSR adaptation. By predicting a constant velocity field across all timesteps, the adaptation task reduces to learning a fixed injection pattern rather than time-varying transformations. Building on this insight, we propose LiteVSR, a minimalist framework that performs VSR using a completely frozen Diffusion Transformer with a lightweight State-Aware Adapter. The adapter employs a dual-stream architecture that extracts static structural cues from the LQ input and dynamic cues from intermediate denoising states, aligning them through time-dependent cross-attention to enable adaptive transition from structural alignment to texture refinement as denoising proceeds. LiteVSR achieves competitive restoration quality with only 11.25% trainable parameters and 12 GPU-hours of training on a single A100, while maintaining fast sampling (down to a single step) compatibility.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 3.0/10 | 4.5 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 3.0/10 | 4.5 |
| World Models | 1.5 | 3.0/10 | 4.5 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 2.0/10 | 3.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文聚焦视频超分辨率(VSR)的轻量化适配,使用冻结扩散变换器,与给定关键词相关性整体较低。Unify Models (3.0) 涉及适配器与模型结合但未达架构统一标准;Tokenizer (2.0) 隐含于扩散模型非重点;Visual Encoder (3.0) 扩散变换器含视觉处理但未创新;World Models (3.0) 扩散模型属生成模型范畴;MLLM (0.0) 无语言模型内容;MultiModal (2.0) 视频为视觉序列非典型多模态;model-based RL (0.0) 无强化学习。专家作者检查未匹配到 Yang Shi 等指定人员。加权总分 19.5,低于动态及格分 27.8。
关键词
Video Super-Resolution, Diffusion Transformers, Lightweight Adaptation, Flow Matching, Frozen Model, State-Aware Adapter, Cross-Attention
摘要翻译
尽管近期自回归视频扩散模型实现了卓越的流式传输质量,但它们仍局限于低分辨率(例如 480P),使得高效、可扩展、实时的高分辨率视频生成成为一个根本性的开放挑战。为了弥合这一差距,我们提出了 Ultra Flash,一种能够进行实时高分辨率视频生成的级联流式传输框架。通过三个关键贡献,Ultra Flash 在单张 GPU 上实现了 1K 分辨率下约 30 FPS 和 2K 分辨率下约 18 FPS 的性能:(1) 一种架构保持型的 T2V-to-TV2V 超分辨率训练范式,结合面向 AIGC 的数据退化管道,有效保留了基础模型的生成能力,使其在级联到主流低分辨率生成模型后能够增强高分辨率细节;(2) 一个因果流式传输潜在上采样器配合高分辨率解码器,增强了时空连贯性,同时实现了高效的潜在空间缩放和精确的高分辨率解码,且计算开销可忽略不计;(3) 一种级联高分辨率流式视频生成优化方案,首先对超分辨率模型执行混合奖励增强的稀疏因果化和单步蒸馏,随后引入带有动态缓存管理的级联流式传输自强制偏好优化 (Self-Forcing Preference Optimization),共同增强整体连贯性,提升质量,并实现实时高分辨率流式视频生成。大量实验表明,Ultra Flash 能够可靠地生成超高分辨率流式视频,同时保持最先进的视觉质量和卓越的效率。
Abstract
While recent autoregressive video diffusion models achieve remarkable streaming quality, they remain confined to low resolutions (e.g., 480P), leaving efficient, scalable, real-time high-resolution video generation a fundamental open challenge. To bridge this gap, we present Ultra Flash, a cascaded streaming framework capable of real-time high-resolution video generation. Ultra Flash achieves ~30 FPS at 1K resolution and ~18 FPS at 2K resolution on a single GPU through three key contributions: (1) an architecture-preserving T2V-to-TV2V super-resolution training paradigm coupled with an AIGC-oriented data degradation pipeline that effectively preserves the generative capability of the base model, enabling enhanced high-resolution detail when cascaded after mainstream low-resolution generative models; (2) a causal streaming latent upsampler paired with a high-resolution decoder, which enhances spatiotemporal coherence while enabling efficient latent spatial scaling and precise high-resolution decoding with negligible computational overhead; and (3) a cascade high-resolution streaming video generation optimization scheme that first performs hybrid-reward-enhanced sparse causalization and single-step distillation of the super-resolution model, then introduces cascaded streaming self-forcing preference optimization with dynamic cache management, jointly enhancing overall coherence, improving quality, and enabling real-time high-resolution streaming video generation. Extensive experiments demonstrate that Ultra Flash reliably produces ultra-high-resolution streaming video while maintaining state-of-the-art visual quality and superior efficiency.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 2.0/10 | 3.0 |
| World Models | 1.5 | 3.0/10 | 4.5 |
| MLLM | 1.5 | 1.0/10 | 1.5 |
| MultiModal | 1.5 | 4.0/10 | 6.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 该论文聚焦于实时流式视频生成(Ultra Flash),属于生成式 AI(AIGC)领域。关键词中 'model-based RL' 完全不相关(0 分);'MLLM' 和 'Unify Models' 相关性低,因论文未涉及大语言模型或模型统一架构;'Tokenizer' 和 'Visual Encoder' 虽为扩散模型组件但非本文核心贡献(2 分);'World Models' 与视频生成有一定 temporal modeling 关联但非严格定义(3 分);'MultiModal' 因涉及 T2V(文本到视频)有一定关联(4 分)。专家列表中未包含指定的五位专家(Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang),无加分。加权总分 19.5 分,低于动态及格分 27.8 分,表明论文与给定关键词集的相关性较低。
关键词
Real-Time Streaming Video Generation, High Resolutions, Cascaded Streaming Framework, Autoregressive Video Diffusion Models, Latent Upsampler, Text-to-Video, Single-Step Distillation, Preference Optimization
摘要翻译
3D 目标检测是自动驾驶汽车(AV)及更广泛的智能交通系统应用感知能力的核心骨干。长距离检测具有挑战性,因为感知证据稀疏;然而,这种“长距离”场景在交通中却是常态。尽管在计算机视觉中 >30m 常被标记为长距离,但在道路上,它仅提供约 1-2 秒的感知和决策时间。面对如此极端的稀疏性,出现了两个核心挑战。首先,早期多模态融合倾向于丢弃稀疏信息,并从空体素或错误占用的体素中注入噪声,从而降低长距离召回率。其次,上下文无关的统一通道监督偏向密集和近距样本,导致远距离和小物体欠优化,从而延迟了对远处物体的最早检测。我们提出“询问邻居”(ATN3D),一种专为稀疏感知距离条件定制的激光雷达 - 雷达(LiDAR-Radar)框架。ATN3D 引入了四项创新:(i)具有跨模态门控的密度感知早期融合,该融合基于每个体素的密度/稀疏性及雷达证据进行条件化;(ii)具有圆形核的占用率门控邻域聚合,仅从可信体素进行聚合;(iii)基于证据的通道自注意力,以根据天气和距离调整通道权重;以及(iv)一个距离感知的损失函数,通过距离重新平衡分类与定位任务,使训练过程与基于距离分层的评估标准对齐。在 VoD 基准上,涵盖清晰和雾天条件,ATN3D 超越了强基线:在清晰天气下 mAP 提升 3.55%,在模拟大雾条件下 mAP 提升 8.41%;对于 >30m 的物体,增益分别为 +3.33%(清晰天气)和 +2.09%(大雾天气)。这些结果表明,在道路交通的稀疏感知条件下,能够实现更早且更可靠的长距离检测。
Abstract
3D object detection is the backbone of perception for automated vehicles (AV) and broader intelligent transportation systems applications. Long-range detection is challenging because sensing evidence is sparse; yet this ``long-range'' scenario is routine in traffic. Although >30m is often labeled long-range in computer vision, on roadways it affords only approx. 1-2s for perception and decision-making. Under such extreme sparsity, two core challenges arise. First, early multimodal fusion tends to discard sparsity information and inject noise from empty or falsely occupied cells, degrading long-range recall. Second, context-agnostic uniform channel supervision favors dense and near-range samples, leaving far and small objects under-optimized, delaying the earliest detection of distant objects. We propose ``Ask The Neighbor'' (ATN3D), a LiDAR-Radar framework tailored for sparse-range conditions. ATN3D introduces (i) Density-aware early fusion with cross-modal gating that conditions fusion on per-voxel density/sparsity and Radar evidence, (ii) Occupancy-gated neighborhood aggregation with circular kernels to aggregate only from credible cells, (iii) Evidence-conditioned channel self-attention to adapt channel weights with weather/range, and (iv) a Range-aware loss that re-balances classification and localization by distance, aligning training with distance-stratified evaluation. On the VoD benchmark across clear and foggy conditions, ATN3D surpasses strong baselines: +3.55% mAP in clear weather and +8.41% mAP under simulated heavy fog; for >30m objects, gains are +3.33% (clear) and +2.09% (heavy fog). These results indicate earlier and more reliable long-range detections under sparse sensing in on-road traffic.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 3.0/10 | 4.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 9.0/10 | 13.5 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on 3D object detection for autonomous vehicles using LiDAR and Radar fusion. It scores moderately on 'MultiModal' (9.0) due to sensor fusion and weakly on 'Unify Models' (3.0) as it integrates two sensor streams into a single framework. It is unrelated to 'Tokenizer', 'Visual Encoder' (non-camera based), 'World Models', 'MLLM', and 'model-based RL' (0.0 each) as it is a traditional computer vision task without generative modeling or reinforcement learning components. No expert authors from the specified list were found. Total weighted score is 18.0, below the dynamic passing threshold of 27.8.
关键词
3D Object Detection, LiDAR-Radar Fusion, Extreme Sparsity, Density-aware, Early Fusion, Range-aware Loss, Automated Vehicles
摘要翻译
AI 科学家代理(AI Scientist agents)常被评估为,其能力主要取决于模型质量、提示工程(prompting)或推理支架(reasoning scaffolds)。我们在药物资产估值(drug-asset valuation)中测试了一个不同的假设:对于知识密集型科学决策,限制因素通常是代理所能获取的证据基底(evidence substrate)。我们在一个生产级估值代理上进行了受控的三臂消融实验(three-arm ablation):A 为仅基于网络的普通大语言模型(LLM)分析师;B 在此基础上增加了公共结构化工具,以及 14 维估值手册(valuation playbook)、验证器(verifier)、客观性策略(objectivity policy)和红队(red-team);C 进一步增加了专有的 Noah AI 语料库(Noah AI corpus),其中包含精心策划的管线(pipeline)、试验(trial)和交易(deal)情报。在包含 13 个资产的分层基准(stratified benchmark)上,B 改进了校准(calibration)和审计纪律:层级范围内准确率(tier-in-range accuracy)从 0.80 提升至 0.89,客观性(objectivity)从 3.16 提升至 3.30。但 B 并未消除事实天花板(factual ceiling)。在能力超集核算(capability-superset accounting)下,A 和 B 仅分别恢复了精心策划的金牌竞争记录(curated gold competitive record)的 0.25 和 0.38,而 C 恢复了 0.96;在精心策划的长尾子集(long-tail subset)上,C 达到 0.93,而 A/B 分别为 0.26/0.30。A 和 B 的原始盲审决策质量(blind-panel decision quality)相似(7.01 对比 6.96),因此我们引入了完整性感知决策效用(completeness-aware decision utility):知情决策质量(informed decision-quality)= 决策质量 × 金牌覆盖率(gold-coverage)。基于此指标,C 达到 7.43,而 A/B 分别为 1.76/2.57。即使是一份完美的非专有数据报告,也会受限于 B 的覆盖率,上限仅为 3.83。这一结果并非意味着推理支架(reasoning scaffolds)不重要;它们确实改进了校准和纪律。相反,专有证据(proprietary evidence)设定了 AI 科学家所能知晓的界限,进而决定了其决策的上限。
Abstract
AI Scientist agents are often evaluated as if capability were mainly a function of model quality, prompting, or reasoning scaffolds. We test a different hypothesis in drug-asset valuation: for knowledge-intensive scientific decisions, the limiting factor is often the evidence substrate the agent can access. We run a controlled three-arm ablation on a production valuation agent: A is a plain web-only LLM analyst, B adds public structured tools plus a 14-dimension valuation playbook, verifier, objectivity policy and red-team, and C adds the proprietary Noah AI corpus of curated pipeline, trial and deal intelligence. Across a 13-asset stratified benchmark, B improves calibration and audit discipline: tier-in-range accuracy rises from 0.80 to 0.89 and objectivity from 3.16 to 3.30. But B does not remove the factual ceiling. Under capability-superset accounting, A and B recover only 0.25 and 0.38 of the curated gold competitive record, while C recovers 0.96; on the curated long-tail subset, C reaches 0.93 vs. 0.26/0.30. Raw blind-panel decision quality is similar for A and B (7.01 vs. 6.96), so we introduce completeness-aware decision utility: informed decision-quality = decision-quality x gold-coverage. On this metric, C reaches 7.43 vs. 1.76/2.57 for A/B. Even a perfect non-proprietary-data report would be capped at 3.83 by B's coverage. The result is not that reasoning scaffolds are unimportant; they improve calibration and discipline. Rather, proprietary evidence sets the upper bound of what the AI Scientist can know and therefore decide.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 1.0/10 | 1.5 |
| World Models | 1.5 | 2.0/10 | 3.0 |
| MLLM | 1.5 | 3.0/10 | 4.5 |
| MultiModal | 1.5 | 2.0/10 | 3.0 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: The paper focuses on the impact of proprietary data versus reasoning scaffolds in AI Scientist agents for drug valuation. It does not discuss model architecture unification, tokenization strategies, visual encoders, world model architectures, or reinforcement learning frameworks. While it utilizes LLMs, it does not focus on multi-modality or model-based RL techniques, resulting in low relevance to the provided technical keywords. Total weighted score is 18.0, below the dynamic passing threshold of 27.8.
关键词
AI Scientist agents, Proprietary Data, Drug-Asset Valuation, Ablation Study, Reasoning Skills, Evidence Substrate, Knowledge-intensive decisions
摘要翻译
BCI-to-agent 管道将解码的神经活动转化为 tool-use 智能体的授权通道,暴露出一个我们称之为"brain-prompt injection"(脑提示注入)的新攻击面:信号侧扰动、仅上下文注入以及自适应双解码器攻击均可改变路由动作,而 EEG 侧或文本侧监控器对此仍无法察觉。此堆栈中的路由安全性取决于审计日志所能观察到的内容,而不仅仅取决于解码器准确性或一致性。我们定义了一个 Route-Safety Audit Contract(路由安全审计契约):包括最小日志模式、分母层级和端点规范,并证明了审计模式分离定理以及 C3 attacked-dependence decomposition(C3 攻击依赖性分解);干净一致性和边缘鲁棒性无法识别控制 C3 路由的联合项。作为契约之上的校准层,我们将 split-conformal calibration(分割共形校准)应用于 non-oracle(非预言机)EEG 确认通道,并在明确的 threat-archetype matrix(威胁原型矩阵)下报告由此产生的 false-accept frontier(错误接受前沿)。我们在 EEGMMI 原生左/右命令控制上实例化该契约,涵盖 5,400 个事件、无害工具桩以及种子/案例分母。Provenance(溯源)阻止了 C2 路由 ($0.000$);agreement-plus-provenance 允许 C3 翻转通过 ($1.000$);confirmation-plus-provenance 允许它们通过 ($0.000$)。Conformal frontier(共形前沿)在 acquisition isolation(采集隔离)下,对于 $\alpha=.005$ 在 clean utility(干净效用)$0.150$ 时达到 FAR $0.000$,对于 $\alpha=.10$ 在 clean utility $0.452$ 时达到 FAR $0.119$;attacker-controllable confirmation channel(攻击者可控确认通道)打破了该界限至 $\approx\!1$。Subject-cluster bootstrap(受试者簇自助法)在 60 个受试者上确认了这些区间;cross-architecture(跨架构)(TinyEEGNet, EEGNetV4) 和 capacity-sweep(容量扫描)结果表明 within-regime saturation(区域内饱和)。Mediation and confirmation(中介和确认)降低了风险;但它们并非 intent certificates(意图证书)。
Abstract
BCI-to-agent pipelines turn decoded neural activity into an authorization channel for tool-use agents, exposing a new attack surface we call \emph{brain-prompt injection}: signal-side perturbations, context-only injections, and adaptive dual-decoder attacks can all change the routed action while EEG-side or text-side monitors remain blind. Route safety in this stack depends on what the audit log can observe, not on decoder accuracy or agreement alone. We define a Route-Safety Audit Contract: a minimal log schema, denominator hierarchy, and endpoint specification, and prove an audit-schema separation theorem together with a C3 attacked-dependence decomposition; clean agreement and marginal robustness do not identify the joint term that controls C3 routing. As a calibration layer on top of the contract, we apply split-conformal calibration to a non-oracle EEG confirmation channel and report the resulting false-accept frontier under an explicit threat-archetype matrix. We instantiate the contract on EEGMMI native left/right command-control over 5{,}400 events, harmless tool stubs, and seed/case denominators. Provenance blocks C2 routes ($0.000$); agreement-plus-provenance routes C3 flips ($1.000$); confirmation-plus-provenance routes them ($0.000$). The conformal frontier reaches FAR $0.000$ at clean utility $0.150$ for $α=.005$ and FAR $0.119$ at clean utility $0.452$ for $α=.10$ under acquisition isolation; an attacker-controllable confirmation channel breaks the bound to $\approx\!1$. Subject-cluster bootstrap confirms these intervals on $60$ subjects; cross-architecture (TinyEEGNet, EEGNetV4) and capacity-sweep results show within-regime saturation. Mediation and confirmation reduce risk; they are not intent certificates.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 1.0/10 | 1.5 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 3.0/10 | 4.5 |
| MultiModal | 1.5 | 3.0/10 | 4.5 |
| model-based RL | 1.5 | 2.0/10 | 3.0 |
评分理由: 该论文主要关注脑机接口(BCI)与大语言模型(LLM)代理之间的安全性审计及'脑提示注入'攻击防御,而非模型架构统一、分词器设计、视觉编码器、世界模型或模型强化学习算法。虽然涉及 LLM 和多模态输入(EEG+文本),与 MLLM 和多模态关键词有一定关联,但核心贡献在于安全审计合约与校准方法,因此大部分架构类关键词相关性较低。作者列表中不包含指定的专家,故未添加额外分数。
关键词
Brain-Prompt Injection, BCI-LLM Agents, Route-Safety Audit, Conformal Calibration, EEG Security, Tool-use Agents, Threat-archetype Matrix
摘要翻译
扩散模型已展现出卓越的生成能力,同时也涌现为强大的自监督表征学习器,然而这两种能力之间的联系仍较少受到探索。受自监督学习 (SSL) 的启发,我们提出一个框架,用于联合评估扩散模型的表征能力和生成能力。具体而言,我们将特征分解为不变分量和残差分量,并推导出不变污染比率 (ICR),这是一种基于 Fisher 的度量,用于量化残差变异如何在特征空间中污染不变信号。我们利用该框架分析扩散模型的判别性和生成性行为。在表征方面,我们发现不变性在中间噪声水平达到峰值,而这些水平也带来了最佳的下游分类性能。在生成方面,我们研究了训练如何在数据受限情形下从真实泛化过渡到记忆化,并表明 ICR 可作为早期学习的敏感训练时指标:沿 Fisher 方向增加的残差能量标志着记忆化的开始,仅凭训练特征即可检测,无需外部评估器或预留测试集。总体而言,我们的结果表明,可以通过所学表征的几何结构,从自监督视角对扩散模型进行监控。
Abstract
Diffusion models have demonstrated remarkable generative capabilities and have also emerged as powerful self-supervised representation learners, yet the connection between these two abilities remains less explored. Drawing inspiration from self-supervised learning (SSL), we introduce a framework for jointly evaluating the representation and generation capabilities of diffusion models. Specifically, we decompose features into invariant and residual components and derive the Invariant Contamination Ratio (ICR), a Fisher-based metric that quantifies how residual variation contaminates invariant signal in feature space. We use this framework to analyze both discriminative and generative behavior of diffusion models. On the representation side, we find that invariance peaks at intermediate noise levels, which also yield the best downstream classification performance. On the generative side, we study how training transitions from genuine generalization to memorization in data-limited regimes, and show that ICR serves as a sensitive training-time indicator of early learning: increasing residual energy along Fisher directions marks the onset of memorization, detectable from training features alone without external evaluators or held-out test sets. Overall, our results show that diffusion models can be monitored from a self-supervised perspective through the geometry of their learned representations.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 3.0/10 | 4.5 |
| World Models | 1.5 | 2.0/10 | 3.0 |
| MLLM | 1.5 | 1.0/10 | 1.5 |
| MultiModal | 1.5 | 2.0/10 | 3.0 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: The paper focuses on evaluating diffusion models via self-supervised principles and feature space geometry, showing minimal alignment with keywords focused on unified multimodal architectures (Unify Models, MLLM, MultiModal), discrete processing (Tokenizer), or reinforcement learning (model-based RL). 'Visual Encoder' receives a moderate score due to the inherent use of encoders in diffusion architectures, though it is not the study's focus. 'World Models' is loosely related through representation learning but lacks specific context. No expert authors from the specified list are identified in the authorship.
关键词
Diffusion Models, Representation Space, Self-Supervised Learning, Invariant Contamination Ratio, Feature Space Geometry, Generative Capabilities, Training Dynamics
摘要翻译
AI 红队测试必须不断适应不断演变的攻击者与防御者。强化学习为发现新颖攻击提供了一种有前景的方法,而协同训练方法则可同步生成更稳健的防御者。近期工作通过应用 PPO 和 DPO 证明了攻击者 - 防御者协同训练的有效性,但指出 GRPO 在这种设置下是不稳定的。我们引入了 AdvGRPO,这是一种协同训练框架,利用稠密多通道奖励和解耦优势归一化,使 GRPO 适用于联合攻击者 - 防御者优化。训练遵循课程学习策略,从单回合攻击进展至闭环多回合攻击,随后启动协同训练,在此过程中攻击者与防御者模型交替更新。实验结果表明,该方法可产生高度有效且可转移的攻击,且协同训练的防御者在安全基准上优于基线方法。
Abstract
AI red teaming must continually adapt to evolving attackers and defenders. Reinforcement learning offers a promising approach to discovering novel attacks, and co-training methods can produce more robust defenders in tandem. Recent works have demonstrated the efficacy of attacker-defender co-training by applying PPO and DPO, but report that GRPO is unstable in this setting. We introduce AdvGRPO, a co-training framework that makes GRPO viable for joint attacker-defender optimization using dense multi-channel rewards and decoupled advantage normalization. Training progresses through a curriculum from single-turn to closed-loop multi-turn attacks before bootstrapping co-training, where attacker and defender models are updated in alternation. We show that our method can produce highly effective and transferable attacks and that co-trained defenders outperform baselines on safety benchmarks.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 3.0/10 | 4.5 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 2.0/10 | 3.0 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 3.0/10 | 4.5 |
评分理由: The paper focuses on AI safety and red teaming for Language Models using RL (GRPO), which aligns moderately with 'model-based RL' (as RL is the core method) and 'Unify Models' (co-training unifies attacker/defender objectives). It involves closed-loop interactions ('World Models') and Large Language Models ('MLLM'), but lacks multimodal components ('MultiModal', 'Visual Encoder') or tokenizer focus ('Tokenizer'). The paper does not match the specific Multimodal/World Model background track, resulting in lower scores for those keywords.
关键词
Red Teaming, Language Models, Reinforcement Learning, GRPO, Co-training, Safety, Attacker-Defender, Curriculum Learning
摘要翻译
近期异常检测方法在诸如 MVTec 等成熟数据集上取得了优异的检测与分割得分。然而,当基础假设(如物体尺度、视角、背景、光照及居中放置的一致性)不成立时,许多此类方法均面临挑战。这些变化使得异常检测方法在许多实际应用场景中变得不可用。为了解决这些局限性,我们提出了三个关键贡献:(1) 一种利用前景 - 背景掩码隔离对象的视觉提示流程;(2) 一种在师生模型中解冻教师模型以提高领域适应性的机制;以及 (3) 一种利用扩散生成的合成图像来增强异常检测性能的数据增强策略。通过使用掩码多尺度重建 (MMR) 模型作为骨干网络,我们在具有挑战性的 AeBAD 数据集上实现了比先前最先进方法高出 3.5 个百分点的提升。
Abstract
Recent Anomaly Detection methods achieve perfect detection and segmentation scores on well-established datasets, such as MVTec. However, many of these methods face challenges when foundational assumptions - such as consistent object scale, viewpoint, background, illumination, and centered placement - are violated. Those variations that occur render anomaly detection methods unusable in many real-world scenarios. To address these limitations, we introduce three key contributions: (1) a visual prompting pipeline that isolates objects using foreground-background masking; (2) a mechanism for unfreezing the teacher in student-teacher models to improve domain adaptability; and (3) a data augmentation strategy leveraging diffusion-generated synthetic images to enhance anomaly detection performance. We achieve a 3.5 percentage point improvement over the previous state-of-the-art on the challenging AeBAD dataset by using the Masked Multiscale Reconstruction (MMR) model as our backbone.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 3.0/10 | 4.5 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 1.0/10 | 1.5 |
| MultiModal | 1.5 | 2.0/10 | 3.0 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: 论文主题聚焦于计算机视觉异常检测,采用视觉提示与特征重构技术,与关键词集中的大语言模型、世界模型、强化学习等方向重合度极低。仅 Visual Encoder 因骨干网络架构有一定关联,其余关键词如 Tokenizer、World Models、MLLM、model-based RL 均无直接涉及。未找到指定专家作者。加权总分 16.5,低于动态及格分 27.8。
关键词
Visual Prompting, Anomaly Detection, Feature Reconstruction, Dual-Teacher Supervision, Student-Teacher Models, Diffusion-Augmentation, Foreground-Background Masking
摘要翻译
关于“涌现式错位”(emergent misalignment)的研究表明,在窄任务上微调大型语言模型(LLM)可能诱发广泛错位的行为。这支持了“人格选择”(Persona Selection, PSM)假设:在预训练过程中,LLM 学会模拟不同角色和视角,这些可以在后训练阶段被激发并精炼。本文研究了对面的现象——“涌现式对齐”(emergent alignment),并利用它来支持和完善 PSM,进而提出对齐的一个新 desideratum(期望标准)。我们在宽泛和窄安全任务上微调一个仅帮助型模型。为了创建监督微调(SFT)样本,我们遵循“宪法式 AI"(Constitutional AI, CAI)方法,并使用四个编码合理对齐策略的宪法:义务论、后果主义、美德伦理学,以及使 AI 服从人类权威。对于每个模型,我们表明在两个窄安全子类别上进行微调,能够可靠地诱导涌现式对齐,这种对齐体现在一般安全类别的代表性集合上,也体现在我们直接从用于窄对齐的数据集中过滤掉的安全子类别上。为了使用更细粒度的评估来测试 PSM,我们采用了多维“伦理人格”诊断工具。对于每个经宪法微调的(宽泛/窄)模型,我们评估其行为与其预期特征轮廓(signature profile)的匹配程度。我们的结果表明,我们的 CAI 模型获得了预期的“伦理人格”——例如,在利用后果主义宪法创建的 SFT 样本上进行窄微调的模型,其信念与功利主义的一致性显著高于义务论。然而,我们的粗粒度和细粒度评估表明,我们的(宽泛/窄)微调 CAI 模型在“投射”(project)表现上存在显著差异。我们得出结论,对齐策略的评估不应仅基于其(分布内)一般安全性能,还应特别基于其可投射性(projectability)的程度。
Abstract
Work on `emergent misalignment' shows that finetuning LLMs on narrow tasks can induce broadly misaligned behavior. This supports the `persona selection' (PSM) hypothesis: during pre-training, LLMs learn to simulate different characters and perspectives, which can be elicited and refined during post-training. This paper investigates the converse phenomenon, `emergent alignment', and uses it to support and refine the PSM and motivate a novel desideratum for alignment. We finetune a helpful-only model on broad and narrow safety tasks. To create SFT samples, we follow the `Constitutional AI' (CAI) approach and use four constitutions which encode reasonable alignment strategies: deontology, consequentialism, virtue ethics, and aligning AIs as subordinate to human authority. For each of those models, we show that finetuning on two narrow safety sub-categories reliably induces emergent alignment over a representative set of general safety categories, and on safety subcategories that we directly filtered-out of the data sets used for narrow alignment. To test the `PSM' using a more fine-grained evaluation, we used a multidimensional `ethical persona' diagnostic. For each constitutionally finetuned (broad/narrow) model, we evaluate how well their behavior matches their expected signature profile. Our results show that our CAI models acquire their expected ``ethical persona'' -- e.g., the model narrowly fine-tuned on SFT samples created using the consequentialist constitution agrees significantly more with utilitarian than deontological beliefs. Yet our coarse and fine-grained evaluations show that there are significant differences across our (broad/narrow) finetuned CAI models in how well they project. We conclude that alignment strategies should be evaluated, not just on their (in-distribution) general safety performance, but also specifically on their degree of projectability.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 1.0/10 | 1.5 |
| World Models | 1.5 | 2.0/10 | 3.0 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 2.0/10 | 3.0 |
评分理由: 论文主要探讨大语言模型(LLM)在安全任务微调下的涌现对齐现象及伦理人格的可投射性,属于 AI 对齐与伦理领域。所提供的关键词集中于多模态架构(如 Visual Encoder, MultiModal, MLLM)、模型组件(Tokenizer)及强化学习架构(model-based RL, World Models),与本文的研究内容和技术路线高度不匹配,因此相关性评分较低。
关键词
Emergent alignment, Ethical personas, Constitutional AI, LLM finetuning, Safety tasks, Persona selection hypothesis, Projectability
摘要翻译
由大语言模型(LLMs)驱动的深层研究智能体在自动论文写作任务中展现出了非凡的潜力。然而,现有系统严重依赖通过互联网和本地知识库进行的文献检索与综合,往往导致社会科学领域的研究缺乏洞察力和创造力。为了解决这一问题,我们提出“记忆增强社会模拟(MASS)”,这是一种创新范式,利用高度逼真且面向研究的社会模拟来增强大语言模型(LLMs)生成研究的创造力和实证基础。具体来说,MASS 整合了三个核心组件:用于引导模拟的动态目标 - 路径规划(包含多层次社会规范约束)、用于智能体记忆冷启动的多学科行为数据集,以及受艾宾浩斯曲线(Ebbinghaus curve)启发的结构化遗忘机制。这些组件共同确保了模拟的真实性,并为生成创新性学术论文提供了坚实的实证基础。实验结果表明了该方法的有效性,相较于基础大语言模型,生成整体质量提升了 6.81%,相较于强基线,洞察力(Insight)提升了 17.19%。
Abstract
Deep Research agents powered by Large Language Models (LLMs) have exhibited extraordinary potential in automated paper writing tasks. However, existing systems rely heavily on literature retrieval and synthesis through internet and local knowledge bases, often resulting research in lacking insight and creativity in social science. To address this issue, we propose "Memory-Augmented Social Simulation (MASS)", an innovative paradigm that leverages highly realistic and research-oriented social simulations to enhance the creativity and empirical founding of LLMs-generated research. Specifically, MASS integrates three core components: dynamic goal-path planning with multi-level social norm restraint to guide the simulation, a multi-disciplinary behavior dataset for agent memory cold-start, and a structured forgetting mechanism inspired by the Ebbinghaus curve. Together, these ensure simulation authenticity and provide a robust empirical foundation for generating innovative scholarly papers. Experimental results demonstrate the effectiveness of our method, showing a 6.81\% improvement in generation overall quality over foundation LLMs and 17.19\% gain in Insight over strong baselines.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 3.0/10 | 4.5 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 3.0/10 | 4.5 |
评分理由: 论文聚焦社会科学 LLM 研究自动化,与视觉编码器、多模态等技术点完全无关(0 分)。Tokenizer 未作为贡献点提及(1 分)。虽涉及模拟与规划,与 World Models 和 model-based RL 有概念关联但非核心(3 分)。未涉及模型统一架构(Unify Models)或多模态大模型(MLLM)的具体技术(2 分)。加权总分 16.5,低于动态及格分 27.8。作者列表中未包含指定专家。
关键词
Memory-Augmented Social Simulation, Large Language Models, Social Science Research, Dynamic Goal-Path Planning, Structured Forgetting Mechanism, Empirical Foundation, Automated Paper Writing
摘要翻译
EEG 基础模型的发布通常逐个端点进行审计,包括原始重构、成员推断、身份关联,或在下游头部应用 DP-SGD。我们在 BIOT、LaBraM 和 EEGPT 数据集上,对同一发布的嵌入在所有四个端点下进行联合审计,结果表明,每个单端点审计通过的发布实际上仍存在频谱属性泄露。决定性的证据来自跨编码器转移审计:一个从某个冻结编码器学习到的单一岭属性解码器,通过拟合的线性桥,可迁移至其他所有编码器的留受试者测试分割,且在所有六个 BIOT/LaBraM/EEGPT 方向上,受试者不交匹配控制的 95% 置信区间下界至少为 0.081。我们证明了一个充分条件:若两个编码器共享非平凡属性坐标投影器重叠度 β,则存在一个链式岭桥攻击者,其中心化增益下界为 √(β/(1+τ^2)) - ε_br - ρ_0,且反向求解得到的 β 位于 [0.008, 0.198] 区间内。为了将联合审计转化为可部署的决策规则,我们引入了审计端点分歧分数(AEDS),证明了其为正性的充分条件,并对每个单元进行了自举校准;结果显示,AEDS 在所有八个匹配 CI 单元中均为正值(包括 EEGMMI 上的 BIOT/LaBraM/EEGPT,以及 Sleep-EDF、54 通道 LIMO、CHB-MIT 儿科头皮 EEG 上的 LaBraM),p<0.001;相比之下,头部级别的 Carlini LiRA 成员推断审计仅达到 0.50-0.70 的 AUC。标准防御在审计下均失效:无论是 Wiener 风格的噪声感知自适应攻击者、LiRA 审计,还是在每个保效用 epsilon(取值为 4 或 8)下的 DP-SGD,属性通道基本保持不变。本研究贡献在于一个审计框架,该框架将分散的单端点防御转化为联合发布决策,得到了跨编码器桥定理以及自适应攻击者、LiRA 和 DP-SGD 基线的支持;该审计旨在支持发布阻断,而非原始波形窃取或留受试者身份恢复。
Abstract
EEG foundation-model releases are usually audited one endpoint at a time: raw-reconstruction, membership inference, identity linkage, or DP-SGD on the downstream head. We audit the same released embeddings under all four endpoints jointly, on BIOT, LaBraM, and EEGPT, and show that each single-endpoint audit clears releases that still leak spectral attributes. The decisive evidence is a cross-encoder transfer audit: a single ridge attribute decoder learned from one frozen encoder transfers, via a fitted linear bridge, to held-out-subject test splits of every other encoder, with subject-disjoint matched-control 95% CI lower bound at least 0.081 across all six BIOT/LaBraM/EEGPT directions. We prove a sufficient condition: two encoders sharing a nontrivial attribute-coordinate projector overlap beta admit a chained ridge bridge attacker with centered-gain lower bound sqrt(beta/(1+tau^2)) - eps_br - rho_0, and back-solve beta in [0.008, 0.198]. To turn the joint audit into a deployment-readable decision rule we introduce an audit-endpoint disagreement score (AEDS), prove sufficient conditions for its positivity, and bootstrap-calibrate it per cell; AEDS is positive in all eight matched-CI cells (BIOT/LaBraM/EEGPT on EEGMMI; LaBraM on Sleep-EDF, 54-channel LIMO, CHB-MIT pediatric scalp EEG) with p<0.001, while a head-level Carlini LiRA membership audit reaches AUC only 0.50-0.70. Standard defenses fail under audit: a Wiener-style noise-aware adaptive attacker, the LiRA audit, and DP-SGD at every utility-preserving epsilon in {4,8} leave the attribute channel essentially unchanged. The contribution is an audit framework that turns scattered single-endpoint defenses into a joint release decision, supported by a cross-encoder bridge theorem and adaptive-attacker, LiRA, and DP-SGD baselines; the audit licenses release-blocking, not raw-waveform exfiltration or held-out-subject identity recovery.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 1.0/10 | 1.5 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 3.0/10 | 4.5 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: 论文聚焦于 EEG 基础模型的跨编码器属性泄露审计,与统一模型、分词器、视觉编码器、世界模型及基于模型的强化学习无直接关联。'多模态'相关性略高因 EEG 模型常涉及多模态数据,但核心贡献为安全审计而非架构设计。作者列表中不包含指定的专家(Yang Shi 等),故不加分。
关键词
EEG Foundation Models, Cross-Encoder Transfer, Attribute Leakage, Privacy Audit, DP-SGD, Audit-Endpoint Disagreement Score, Spectral Attributes, Ridge Attribute Decoder
摘要翻译
大型语言模型(LLMs)正越来越多地被部署为代码生成器,其中隐蔽错误的程序会带来真实的安全性和可靠性风险。可靠的不确定性估计(UE)对于选择性预测、人机回环审查以及下游智能体决策至关重要。然而,大多数现有的代码不确定性估计(UE)方法均源自自然语言(NL)生成,并忽略了使代码具有独特性的属性。我们认为代码与自然语言(NL)在三个方面存在差异:单个错误标记(token)即可破坏整个程序(标记脆弱性);算法意图与具体实现可独立不一致(意图 - 代码差距);且程序可被执行(可执行性)。我们将这些属性实例化为三个正交的不确定性轴:词汇轴(Top-K 标记熵)、算法轴(伪代码一致性)和功能轴(行为一致性)。在五个代码大型语言模型上,我们的三轴集成方法将平均 AUROC 从最强的源自自然语言(NL)的基线方法的 0.696 提升至 0.776(+8.1 个百分点)。值得注意的是,在 Qwen3-14B 上,我们的单次 Top-K 标记熵方法达到了最强的多次基线方法的性能,且成本低于三分之一;在所有模型上,它仍是一种具有竞争力的低成本信号。这些结果表明,代码不确定性估计(UE)值得采用针对代码的设计,而非直接移植自然语言(NL)的方法。
Abstract
Large language models (LLMs) are increasingly deployed as code generators, where silently wrong programs pose real safety and reliability risks. Reliable uncertainty estimation (UE) is essential for selective prediction, human-in-the-loop review, and downstream agentic decisions. Yet most existing code UE methods are inherited from natural language (NL) generation and ignore properties that make code distinct. We argue that code differs from NL in three ways: a single wrong token can break an entire program (token fragility); algorithmic intent and concrete implementation can disagree independently (intent-code gap); and programs can be executed (executability). We instantiate these properties as three orthogonal uncertainty axes: lexical (Top-K token entropy), algorithmic (pseudo-code consistency), and functional (behavioral consistency). Across five code LLMs, our three-axis ensemble improves average AUROC from 0.696 for the strongest NL-derived baseline to 0.776 (+8.1 points). Notably, on Qwen3-14B, our single-pass Top-K token entropy matches the strongest multi-pass baseline while being over 3x cheaper; across models, it remains a competitive low-cost signal. These results suggest that code UE deserves code-specific design rather than direct NL ports.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 5.0/10 | 7.5 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 2.0/10 | 3.0 |
评分理由: The paper focuses on uncertainty estimation for code generation using LLMs, emphasizing code-specific properties like token fragility and executability. It scores moderately on Tokenizer (token entropy) and slightly on Unify Models/MLLM/model-based RL due to LLM usage and unification of uncertainty axes, but is irrelevant to Visual Encoder, World Models, and MultiModal as it lacks vision components and multimodal integration. No matching expert authors were found from the provided list.
关键词
Code Generation, Uncertainty Estimation, Token Entropy, Intent-Code Gap, Executability, Large Language Models, Ensemble Method
摘要翻译
大规模视频检索在自动驾驶的数据策展与安全验证中至关重要,用户不仅希望检索到场景,还希望找到切入(cut-ins)和急刹车(hard braking)等动态事件。现有的基于视觉 - 语言(vision-language)和关键词(keyword-based)的检索方法往往遗漏这些事件,因为相关的运动可能未在文本中明确描述,也无法通过词汇重叠捕捉。基于规则的检索可以更直接地表示此类事件,但其鲁棒性较差:生成的或手写的规则往往在假设与真实驾驶数据不匹配时失效。我们提出 STRIVE-D,一种用于驾驶视频的数据校准检索框架。该方法利用弱标注的领域内视频来估计查询规则的可靠性,调整与观测数据不匹配的规则,并将校准后的规则分数与视觉 - 语言及基于关键词的检索信号进行融合。在三个驾驶基准上(包括在 DrivingDojo 上新发布的人类标注事件数据),STRIVE-D 相较于最先进方法在 top-1 准确率上实现了高达 84% 的相对提升。
Abstract
Video retrieval at scale is central to data curation and safety validation in autonomous driving, where users want to find not only scenes but also dynamic events such as cut-ins and hard braking. Existing vision-language and keyword-based retrieval methods often miss these events because the relevant motion may not be explicitly described in text or captured by lexical overlap. Rule-based retrieval can encode such events more directly, but it is brittle: generated or hand-written rules often fail when their assumptions do not match real driving data. We propose STRIVE-D, a data-calibrated retrieval framework for driving videos. It uses weakly labeled in-domain videos to estimate when a query rule is reliable, adapt rules that mismatch observed data, and fuse calibrated rule scores with vision-language and keyword-based retrieval signals. Across three driving benchmarks, including newly released human-annotated event data on DrivingDojo, STRIVE-D delivers up to 84% relative improvement in top-1 accuracy over state-of-the-art methods.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 2.0/10 | 3.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 3.0/10 | 4.5 |
| MultiModal | 1.5 | 5.0/10 | 7.5 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper proposes STRIVE-D, a retrieval framework for driving videos, which does not focus on model unification, tokenizer design, visual encoder architecture, world modeling, or reinforcement learning. It utilizes vision-language signals (MLLM-adjacent) and handles video-text data (MultiModal), but these are not the core contributions. The weighted total score is 16.5, below the dynamic pass threshold of 27.8, indicating low relevance to the specific keyword cluster (MLLM/RL).
关键词
Driving Video Retrieval, Complex Queries, Structured Grounding, STRIVE-D, Vision-Language, Rule Calibration, Autonomous Driving, Event Detection
摘要翻译
灵巧操作因其高维动作空间和复杂的接触丰富动力学,对模仿学习构成了重大挑战。仅基于演示数据训练的策略在部署过程中往往会出现累积误差,且需要大量专家数据才能实现可靠性能。为了突破演示数据的局限性,本文提出了一种名为 DexPIE 的后训练框架,旨在利用真实世界部署过程中收集的经验来改进灵巧策略。首先,DexPIE 通过适配灵巧手的干预系统以及跨初始和中间任务阶段的多阶段 DAgger 风格数据收集,实现了有效的探索覆盖,为准确的策略评估提供了可靠的监督。为了减少后训练轨迹与演示数据之间的时间噪声,我们在相对动作空间中引入了异步推理,这更好地将轨迹数据与演示行为对齐,并使 Critic (评论者) 能够学习由更一致的基础策略所诱导的价值函数。最后,DexPIE 通过基于连续最优性指示器对策略进行条件化改进,使策略能够以更细粒度的方式利用数据质量。在三个具有挑战性的真实世界灵巧操作任务中,DexPIE 相较于基于演示的参考策略,成功率提高了 37%,优于所有基线方法,并展现出更强的鲁棒性。源代码和数据集将公开发布。
Abstract
Dexterous manipulation presents substantial challenges for imitation learning due to its high-dimensional action space and complex contact-rich dynamics. Policies trained purely from demonstrations often suffer from compounding errors during deployment and require large amounts of expert data to achieve reliable performance. To move beyond the limitations of demonstration data, in this work, we propose DexPIE, a post-training framework for dexterous policy improvement from experience collected through real-world deployment. First, DexPIE enables effective exploration coverage through a dexterous-hand-adapted intervention system and multi-stage DAgger-style data collection across initial and intermediate task stages, providing reliable supervision for accurate policy evaluation. To reduce temporal noise between post-training rollouts and demonstration data, we introduce asynchronous inference in the relative action space, which better aligns rollout data with demonstrated behavior and allows the critic to learn a value function induced by a more consistent underlying policy. Finally, DexPIE improves the policy through conditioning on a continuous optimality indicator, allowing the policy to leverage the quality of data in a more fine-grained manner. Across three challenging real-world dexterous manipulation tasks, DexPIE achieves a 37% improvement in success rate over the demonstration-based reference policy, outperforming all baseline methods and demonstrating stronger robustness. The source code and dataset will be made publicly available.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 2.0/10 | 3.0 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 3.0/10 | 4.5 |
| model-based RL | 1.5 | 3.0/10 | 4.5 |
评分理由: 论文聚焦于机器人灵巧操作策略改进,属于强化学习与模仿学习范畴。关键词如 Tokenizer、MLLM、Unify Models 主要涉及大语言模型,与本文内容高度不相关(0-2 分)。Visual Encoder 虽可能用于感知但非核心贡献(2 分)。MultiModal 在机器人视觉 - 动作控制中有一定关联(3 分)。model-based RL 涉及 RL 概念但本文更偏向策略改进与 DAgger,非典型模型基强化学习(3 分)。World Models 指生成式环境模型,本文未涉及(1 分)。作者列表中未包含指定的专家,无额外加分。
关键词
Dexterous manipulation, Policy improvement, Real-world experience, Imitation learning, DAgger-style data collection, Asynchronous inference, Value function
摘要翻译
长上下文语言模型推理受限于内存,因为 KV 缓存 (KV cache) 随上下文长度增长。近期用于压缩 KV 缓存的技术存在不足:它们要么显著降低模型质量,要么需要大量的时间和计算资源来压缩单个长提示。此外,许多方法要求输入必须适配目标模型的上下文窗口,且通常与现代生产推理引擎不兼容。编码器 - 解码器压缩器 (Encoder-decoder compressors),即将长 token 序列映射为较短的潜在嵌入序列以供解码器使用,原则上是一种有吸引力的替代方案。然而,现有方法在准确性 - 效率前沿上与 KV 缓存压缩相比不具备竞争力。在这项工作中,我们重新审视编码器 - 解码器压缩并缩小了这一差距。我们首先进行架构搜索,从头预训练多种变体,以确定如何最佳地设计和训练编码器 - 解码器压缩器。基于我们的发现,我们持续预训练了一组包含 0.6B 编码器和 4B 解码器的模型,每个模型在超过 350B token 的数据上进行训练,压缩比分别为 1:4、1:8 和 1:16。我们引入了潜在上下文语言模型 (Latent Context Language Models, LCLMs),这是一类压缩器,在通用任务性能、压缩速度和峰值内存使用方面改进了帕累托前沿。我们证明了 LCLMs 可作为长时程代理的高效骨干,使代理能够浏览压缩后的长上下文,并根据需求自适应地扩展相关片段。
Abstract
Long-context language model inference is bottlenecked by memory, as the KV cache grows with context length. Recent techniques to compress the KV cache fall short: they either degrade model quality substantially or require considerable time and compute to compress a single long prompt. Furthermore, many methods require the input to fit within the target model's context window, and are generally incompatible with modern production inference engines. Encoder-decoder compressors, which map a long token sequence to a shorter sequence of latent embeddings consumed by a decoder, are an appealing alternative in principle. However, existing approaches are not competitive with KV cache compression on the accuracy-efficiency frontier. In this work, we revisit encoder-decoder compression and close this gap. We first perform an architecture search, pre-training many variants from scratch to determine how best to design and train encoder-decoder compressors. Guided by our findings, we continually pre-train a family of 0.6B-encoder, 4B-decoder models on over 350B tokens each, at compression ratios of 1:4, 1:8, and 1:16. We introduce Latent Context Language Models (LCLMs), a family of compressors that improve the Pareto frontier across general-task performance, compression speed, and peak memory usage. We demonstrate that LCLMs serve as efficient backbones for long-horizon agents, letting the agent skim through a compressed long context and adaptively expand relevant segments on demand.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 3.5/10 | 5.2 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 3.5/10 | 5.2 |
评分理由: The paper focuses on compressing KV caches for long-context LLMs using encoder-decoder architectures (LCLMs). It shows low relevance to Multimodal keywords (MLLM, MultiModal, Visual Encoder) as it is text-only. Moderate relevance to World Models and Model-Based RL due to 'agent' applications, but the core contribution is inference compression rather than world modeling or RL algorithms. No expert authors from the specified list were found, so no bonus points were added.
关键词
Context Compression, Long-context Language Models, KV Cache, Encoder-Decoder, Latent Embeddings, Inference Efficiency, Long-horizon Agents
摘要翻译
通用大型语言模型 (LLM) 能否以资深化学家的精度设计分子?当前的基于 LLM 的框架通过标量反馈回路(生成、评分、拒绝)来回答这一问题,这本质上是有指导的试错。本文表明,用第一性原理计算得出的完整物理化学理据取代单个数值,可将 LLM 从随机采样器转变为因果推理器。我们的系统将检索增强生成 (RAG) 与自反思模块耦合,该模块将轨道能、原子电荷和电子密度(而非压缩分数)反馈回设计回路。在 1.0 至 5.0 eV 的 HOMO-LUMO 能隙目标上,这种结构 - 性质关系 (SPR) 反思实现了低至 0.0003 eV 的偏差,并在中等任务上达到 100% 的成功率,显著优于标量反馈和非反思基线。该框架可无缝推广至偶极矩设计,并在五种不同的 LLM 骨干模型上证明具有稳健性。这些结果确立了一种新范式:当模型不仅知道分子失败,而且知道失败原因时,迭代分子设计便真正成为机理性的。
Abstract
Can a general-purpose large language model design molecules with the precision of a seasoned chemist? Current LLM-based frameworks answer this question with scalar feedback loops-generate, score, reject-that amount to informed trial-and-error. Here we show that replacing a single number with the full physicochemical rationale from first-principles calculations transforms the LLM from a stochastic sampler into a causal reasoner. Our system couples retrieval-augmented generation with a self-reflection module that feeds orbital energies, atomic charges, and electron densities-rather than compressed scores-back into the design loop. On HOMO-LUMO gap targets from 1.0 to 5.0 eV, this structure-property-relationship (SPR) reflection achieves a deviation as low as 0.0003 eV and a 100% success rate on moderate tasks, decisively outperforming scalar-feedback and non-reflective baselines. The framework generalizes seamlessly to dipole-moment design and proves robust across five distinct LLM backbones. These results establish a new paradigm: when the model understands not only that a molecule fails, but why, iterative molecular design becomes genuinely mechanistic.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 2.0/10 | 3.0 |
| model-based RL | 1.5 | 2.0/10 | 3.0 |
评分理由: 该论文聚焦于 AI for Science(化学分子设计),利用大语言模型结合第一性原理计算进行自我反思式迭代。提供的关键词主要涉及多模态大模型、世界模型及强化学习架构,与本文的研究领域存在显著差异。视觉编码器、多模态、世界模型等关键词与本文内容完全无关(0-1 分);Tokenizer 虽隐含于 LLM 中但非核心贡献;Unify Models 和 model-based RL 仅存在微弱的流程相似性(如多 backbone 泛化或迭代反馈),故评分较低(2 分)。
关键词
Molecular Design, Large Language Models, Self-Reflection, Physicochemical Rationale, Structure-Property Relationship, Iterative Design, First-Principles Calculations
摘要翻译
神经组合优化近期利用扩散模型和一致性模型等生成模型,在欧几里得旅行商问题(TSP)上取得了显著成果。诸如 FT2T 等最先进的方法结合了基于快速一致性的预测与基于梯度的推理时间精炼。然而,梯度搜索往往会产生显著的计算开销,且可能与可行解的离散结构不相符。我们提出了一种即插即用、无需重新训练的替代方案——投影一致性推理(PCI),该方案用结构感知的投影取代了梯度精炼:PCI 从一致性模型的输出中解码有效的哈密顿回路,并应用轻量级局部搜索(例如 2-opt)。PCI 在 500 城市的 TSP 上实现了 0.17% 的平均最优性差距(OG),在 1000 城市的 TSP 上实现了 0.31%,优于 FT2T 的最佳设置(OG 分别为 0.22% 和 0.36%),同时将推理时间减少了 30% 至 40%。PCI 还表现出更低的方差和内存占用,且在快速生成解决方案方面能够超越 LKH3 等经典启发式算法。我们的结果表明,结构感知的推理时间操作为神经 TSP 求解器提供了一条实用且合理的路径,与训练时间目标相辅相成。
Abstract
Neural combinatorial optimization has recently achieved strong results on the Euclidean Traveling Salesman Problem (TSP) using generative models such as diffusion and consistency models. State-ofthe-art approaches like FT2T combine fast consistency-based prediction with gradient-based inference time refinement. However, gradient search often incurs significant computational overhead and may not align with the discrete structure of feasible solutions. We introduce Projected Consistency Inference (PCI), a plug-and-play, retraining-free alternative that replaces gradient refinement with structure-aware projections: PCI decodes valid Hamiltonian tours from the consistency model output and applies a lightweight local search (e.g., 2-opt). PCI achieves an average optimality gap (OG) of 0.17% on TSP with 500 cities, and 0.31% on TSP with 1000 cities, outperforming FT2T best settings (OG 0.22% and 0.36%, respectively) while reducing the inference time up to 30 to 40%. PCI also exhibits lower variance and memory usage, and can surpass classical heuristics such as LKH3 in rapid solution generation. Our results demonstrate that structure-aware inference time operations provide a practical and principled path for neural TSP solvers, complementing training time objectives.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 1.0/10 | 1.5 |
| World Models | 1.5 | 2.0/10 | 3.0 |
| MLLM | 1.5 | 1.0/10 | 1.5 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 2.0/10 | 3.0 |
评分理由: 该论文专注于神经组合优化(TSP 求解),利用扩散模型和一致性模型提出 PCI 方法。提供的关键词主要涉及多模态大模型(MLLM, MultiModal)、世界模型及强化学习领域。论文未涉及视觉编码器、Tokenizer 或多模态融合,也未使用强化学习框架,因此与多数关键词相关性较低。虽然涉及生成模型,但并非典型的世界模型或模型基强化学习,故评分偏低。作者列表不包含指定的 Yang Shi 等专家,未触发专家加分。
关键词
Neural combinatorial optimization, Traveling Salesman Problem, Diffusion models, Consistency models, Projected Consistency Inference, Structure-aware projections, Local search
摘要翻译
大语言模型(LLM)在精度关键领域(如技术制图和机械设计)中常产生幻觉,此类领域的输出必须满足严格的几何约束。本文研究从自然语言进行的开放式几何综合:将自由形式描述转换为精确构造,其中实体必须同时满足数十个相互交互的约束。为使此问题可处理,我们发布了 PyGeoX,一种可编程的几何领域特定语言(DSL),可将声明性约束编译为可微损失;同时发布了 PyGeoX-Bench,一个包含 300 个问题的分层套件,具备每约束可验证的奖励。以 PyGeoX 作为验证器,我们识别出一种称为异常值梯度遮蔽(Outlier Gradient Masking)的故障模式:在全局范数奖励下(任何通过单个范数聚合残差的方案,例如 $\exp(-\mathrm{MSE})$),单个异常值约束可抹除所有其他约束的学习信号。为了解决这一问题,我们提出饱和加法奖励(Saturating Additive Rewards, SAR),该机制将奖励分解为有界的每约束项,从而保留部分进展,并在严重违反情况下确保梯度的一致性。相较于几何求解器的自然基线——基于均方误差(MSE)的奖励,SAR 将困难层级的求解率提高了 2.3 倍,所得的 8B 模型在此基准上与规模大得多的前沿系统具有竞争力。我们已在 https://github.com/Huawei-AI4Math/PyGeoX 上发布该引擎、基准及数据。
Abstract
Large Language Models frequently hallucinate in precision-critical domains such as technical diagramming and mechanical design, where outputs must satisfy strict geometric constraints. We study open-ended geometric synthesis from natural language: translating free-form descriptions into precise constructions whose entities must simultaneously satisfy dozens of interacting constraints. To make this tractable, we release PyGeoX, a programmable geometric DSL that compiles declarative constraints into a differentiable loss, and PyGeoX-Bench, a stratified suite of 300 problems with per-constraint verifiable rewards. Using PyGeoX as a verifier, we identify a failure mode we call Outlier Gradient Masking: under global-norm rewards (any scheme that aggregates residuals through a single norm, for example, $\exp(-\mathrm{MSE})$), a single outlier constraint can nullify the learning signal across all others. To address this, we propose Saturating Additive Rewards (SAR), which decompose the reward into bounded per-constraint terms, preserving partial progress and ensuring consistent gradients even under severe violations. Against MSE-based rewards, the natural baseline for geometry solvers, SAR improves the hard-tier solving rate by $2.3\times$, and the resulting 8B model is competitive with much larger frontier systems on this benchmark. We release the engine, benchmark, and data at https://github.com/Huawei-AI4Math/PyGeoX.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 1.0/10 | 1.5 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 2.0/10 | 3.0 |
| model-based RL | 1.5 | 2.0/10 | 3.0 |
评分理由: 论文主要研究几何合成中的奖励塑形(SAR)与约束求解,未涉及多模态模型统一、分词器设计、视觉编码器、世界模型或基于模型的强化学习核心架构。虽然涉及 LLM 与奖励信号,但与关键词定义的多模态表征及模型架构方向偏离较大,故相关性评分较低。
关键词
Geometric Synthesis, Reward Shaping, Constraint Satisfaction, LLM Hallucination, PyGeoX, Differentiable Loss, Precision Generation, Solver Residuals
摘要翻译
生成新颖、可行且高质量的研究构想是科学发现中一项重要却具有挑战性的任务。近期基于大型语言模型(LLM)的方法通常依托检索到的文献来生成构想,但检索到的证据通常以扁平文本的形式呈现,例如标题、摘要或概要。此类扁平上下文可能包含冗余或弱相关的信息,同时使得跨论文中问题、方法、机制及发现之间的关系难以识别和追踪。为应对这一挑战,我们提出 Graph2Idea,一种基于知识图谱引导的检索增强科学构想生成框架。Graph2Idea 首先根据输入主题检索论文,将其转换为结构化的知识三元组,并动态构建以目标为中心的知识图谱,以显式化文献间的关系。随后,它提取紧凑的图衍生上下文,在保留目标相关关系证据的同时,减少嘈杂的文本输入。基于这些上下文,一个两阶段生成过程首先识别有前景的研究方向,随后引导 LLM 基于图结构证据综合生成候选构想。在科学构想生成基准上的实验表明,Graph2Idea 在自动评估协议下优于代表性基线模型。与最强基线相比,其新颖性(Novelty)从 0.45 提升至 0.52,质量(Quality)从 0.24 提升至 0.29,可行性(Feasibility)从 0.22 提升至 0.28。这些结果表明,图结构证据有助于 LLM 通过更明确、紧凑且可追溯地重组先验科学来生成研究构想。
Abstract
Generating novel, feasible, and high-quality research ideas is an important yet challenging task in scientific discovery.Recent Large Language Model (LLM)-based methods often ground idea generation with retrieved literature, but the retrieved evidence is usually provided as flat text, such as titles, abstracts, or summaries. Such flat contexts may contain redundant or weakly relevant information, while making cross-paper relations among problems, methods, mechanisms, and findings difficult to identify and trace.To address this challenge, we propose Graph2Idea, a knowledge graph-guided framework for retrieval-augmented scientific idea generation.Graph2Idea first retrieves papers according to the input topic, transforms them into structured knowledge triples, and dynamically constructs a target-centered knowledge graph to make literature relations explicit.It then extracts compact graph-derived contexts that retain target-relevant relational evidence while reducing noisy textual input.Based on these contexts, a two-stage generation process first identifies promising research directions and then guides the LLM to synthesize candidate ideas from graph-grounded evidence.Experiments on a scientific idea generation benchmark show that Graph2Idea outperforms representative baselines under the automatic evaluation protocol.Compared with the strongest baseline scores, it improves Novelty from 0.45 to 0.52, Quality from 0.24 to 0.29, and Feasibility from 0.22 to 0.28.These results suggest that graph-structured evidence helps LLMs generate research ideas through more explicit, compact, and traceable recombination of prior scientific knowledge.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 3.0/10 | 4.5 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 1.0/10 | 1.5 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: The paper focuses on scientific idea generation using knowledge graphs and LLMs, which shows low alignment with the multimodal/RL background keywords. 'Unify Models' (3.0) receives moderate credit for unifying retrieval and generation logic; 'MLLM' (2.0) is partially relevant due to LLM usage but lacks multimodal components. 'Tokenizer', 'Visual Encoder', 'World Models', 'MultiModal', and 'model-based RL' (all 1.0) are largely irrelevant as the work is text-only, lacks vision encoders, tokenization specifics, world modeling, or reinforcement learning. No expert authors from the specified list were found. The total weighted score (15.0) is below the dynamic passing threshold (27.8).
关键词
Scientific Idea Generation, Knowledge Graph, Retrieval-Augmented, Large Language Model, Graph-Structured Contexts, Literature Relations, Research Directions
摘要翻译
对扰动的响应是理解物理系统的关键。通过比较系统在略微不同条件下的反应来对比此类响应的能力,提供了一种学习机制。在此,我们提出了一种通用框架——扰动对比物理学习(Perturbative Contrastive Physical Learning,简称 PCPL),该框架中,学习源于通过对输入、边界条件、参数或解释器函数进行受控变化所产生的物理状态之间的可测量对比。PCPL 统一并扩展了先前方法:均衡传播(Equilibrium Propagation)根植于能量系统中自由平衡与微扰平衡之间的对比,而频率传播(Frequency Propagation)则对应于从正弦驱动且经频率解调的响应中提取的对比。我们表明,基于对比的更新既可以反映局部敏感性,也可以反映全局逆问题结构,但无需集中式梯度计算。相反,有效的学习几何结构隐式地源于系统自身的物理响应,使得学习行为能够在无需外部处理器或显式反向传播的情况下涌现。我们在两个平台上展示了 PCPL:(i) 利用测量的位移和力更新键刚度的弹簧网络;(ii) 通过 X 正交分量测量及雅可比矩阵有限差分估计训练的连续变量光子电路。两个平台均成功完成了分类任务的学习。此外,我们还展示了连续变量光子电路可被训练以实现模拟乘法,这标志着迈向更自主物理学习系统的一步。
Abstract
Responses to perturbations are key to understanding physical systems. The ability to contrast such responses by comparing how a system reacts under slightly different conditions provides a mechanism for learning. Here, we introduce Perturbative Contrastive Physical Learning (PCPL), a general framework in which learning emerges from measurable contrasts between physical states produced by controlled changes to inputs, boundary conditions, parameters, or interpreter functions. PCPL unifies and extends prior approaches: Equilibrium Propagation is rooted in contrasts between free and nudged equilibria in energy-based systems, while Frequency Propagation corresponds to contrasts extracted from sinusoidally driven, frequency-demodulated responses. We show that contrast-driven updates can reflect either local sensitivities or global inverse-problem structure, yet do not require centralized gradient computation. Instead, effective learning geometry emerges implicitly from the system's own physical response, allowing learning behavior to arise without an external processor or explicit backpropagation. We demonstrate PCPL in two platforms: (i) spring networks that update bond stiffness using measured displacements and forces, and (ii) continuous-variable photonic circuits trained via x quadrature measurements and finite-difference estimates of the Jacobian. Both platforms successfully learn classification tasks. We further show that a continuous-variable photonic circuit can be trained to implement analog multiplication, illustrating a step toward more autonomous physical learning systems.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 5.0/10 | 7.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 2.0/10 | 3.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 3.0/10 | 4.5 |
评分理由: 论文聚焦于物理系统(弹簧网络、光子电路)中的对比学习,与关键词集(多模态大模型、强化学习)存在领域错位。仅'Unify Models'因文中明确提及统一 prior approaches 而中度相关,'model-based RL'因涉及物理模型学习而弱相关,其余关键词完全无关。作者列表中不包含指定的专家。
关键词
Perturbative Contrastive Physical Learning, Physical Systems, Contrastive Learning, Equilibrium Propagation, Photonic Circuits, No Backpropagation, Spring Networks
摘要翻译
贝叶斯优化(BO)是样本高效设计的核心工具,而潜在空间贝叶斯优化(LSBO)将其扩展至分子和蛋白质等结构化对象。与此同时,表格基础模型(如 TabPFN 和 TabICL)现已实现最先进的回归性能,并日益被用作 BO 的代理模型。由于它们的贝叶斯行为是由大规模合成预训练数据诱导产生的,因此该预训练分布的构成至关重要。LSBO 造成了特有的不匹配:从潜在代码到目标值的诱导映射与用于训练当前上下文模型(in-context models)的回归任务存在显著差异。我们通过补充表格基础模型代理模型的预训练阶段来解决这一不匹配,即在分子变分自编码器(VAE)的潜在空间上定义合成优化任务。继续预训练的目标函数包含一个正则化器,该正则化器将模型锚定在原始检查点上,保留其广泛的回归先验,同时避免对适应任务过度专业化。在留出的分子优化基准上,所得模型表现出强劲的性能,支持了针对 LSBO 的特定适应对于上下文代理模型的相关性。
Abstract
Bayesian optimization (BO) is a central tool for sample-efficient design, and latent-space Bayesian optimization (LSBO) extends it to structured objects such as molecules and proteins. In parallel, tabular foundation models such as TabPFN and TabICL now achieve state-of-the-art regression performance and are increasingly used as BO surrogates. Because their Bayesian behavior is induced by large synthetic pretraining collections, the composition of this pretraining distribution is crucial. LSBO creates a distinctive mismatch: the induced map from latent code to objective value differs markedly from the regression tasks used to train current in-context models. We address this mismatch by complementing the pretraining stage of tabular foundation model surrogates with synthetic optimization tasks defined on the latent space of a molecular VAE. The continued-pretraining objective features a regularizer that anchors the model to the original checkpoint, preserving its broad regression prior while avoiding overspecialization to the adaptation tasks. On held-out molecular optimization benchmarks, the resulting model achieves strong performance, supporting the relevance of LSBO-specific adaptation for in-context surrogates.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 1.0/10 | 1.5 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 2.0/10 | 3.0 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: The paper focuses on Bayesian Optimization and Tabular Foundation Models, showing low relevance to Multimodal/RL keywords. 'MLLM' and 'Unify Models' have minor relevance due to foundation model adaptation. No specified expert authors are present. Weighted score: 15.0 (below dynamic pass score 27.8).
关键词
Bayesian Optimization, Latent Space, Tabular Foundation Models, In-Context Learning, Molecular VAE, Surrogate Models, Synthetic Optimization, Pretraining Adaptation
摘要翻译
条件生成器(Conditional Generators)为可控生成提供了自然工具,包括所需条件为观测属性或实验因子新组合的场景。在许多应用中,尤其是在科学领域,这类模型对于探索真实样本稀少、昂贵或尚未观测到的条件极具吸引力。然而,这为评估带来了循环性问题:标准条件质量度量需要参考目标分布(Reference Target Distribution),但在外推情形(Extrapolative Regime)下,该分布按定义是不可用的。我们提出了一种事后(Post-hoc)、样本级的信任分数(Trust Score),仅利用训练分布来评估条件样本,以解决这一问题。该分数结合了两个可估计的量:全局真实性(Global Realism),衡量与真实数据流形(Real Data Manifold)的兼容性;以及属性级忠实度(Attribute-wise Faithfulness),衡量样本是否比合理替代方案更接近所请求的属性。我们表明,在观测属性满足温和覆盖条件(Coverage Condition)下,该分数能够恢复外推生成之间的有意义比较。这些比较使得生成结果的有效过滤、排序和拒绝成为可能,并且可直接应用于现成预训练模型(Off-the-shelf Pretrained Models)。在生物成像中,选定的样本更好地保留了真实形态结构(Morphological Structure),并提升了下游预测性能;而在受控视觉基准(Controlled Vision Benchmarks)上也观察到类似的增益。最后,我们展示了该分数如何应用于生成过程中,从而允许在完全解码(Full Decoding)前进行拒绝生成。代码可在 https://github.com/berkerdemirel/faithful-cond-gen 获取。
Abstract
Conditional generators provide a natural tool for controllable generation, including settings where the desired condition is a new composition of observed attributes or experimental factors. In many applications, especially in scientific domains, such models are attractive to explore conditions for which real samples are rare, expensive, or not yet observed. However, this creates a circularity for evaluation: standard conditional quality metrics require a reference target distribution, but in the extrapolative regime that distribution is unavailable by definition. We address this problem with a post-hoc, per-sample trust score for assessing conditional samples using only the training distribution. The score combines two estimable quantities: global realism, measuring compatibility with the real data manifold, and attribute-wise faithfulness, measuring whether a sample is closer to the requested attributes than to plausible alternatives. We show that the score can recover meaningful comparisons across extrapolated generations, under a mild coverage condition on the observed attributes. These comparisons enable effective filtering, ranking, and abstention of generations and can be used directly on off-the-shelf pretrained models. In biological imaging, selected samples preserve real morphological structure better and improve downstream predictive performance, while similar gains are observed on controlled vision benchmarks. Finally, we show how the score can be applied during generation, enabling abstention before full decoding. Code is available at https://github.com/berkerdemirel/faithful-cond-gen.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 2.0/10 | 3.0 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 1.0/10 | 1.5 |
| MultiModal | 1.5 | 3.0/10 | 4.5 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: 论文核心贡献在于提出一种后验信任分数(trust score)以评估条件生成模型在组合分布偏移下的样本质量,主要关注评估指标而非模型架构或学习范式。因此与 Unify Models、Tokenizer、World Models、MLLM 及 model-based RL 高度不相关(评分 1.0)。由于应用涉及视觉数据(生物成像、视觉基准),与 Visual Encoder 有微弱关联(评分 2.0);条件生成任务涉及多模态语境,与 MultiModal 有轻微关联(评分 3.0)。加权总分 15.0 低于动态及格分 27.8,表明论文与给定研究背景相关性较低。
关键词
Conditional Generation, Compositional Shift, Sample Quality, Trust Score, Global Realism, Attribute-wise Faithfulness, Extrapolative Regime
摘要翻译
女书(Nüshu)是一种濒危的表音文字,历史上曾在中国湖南省南部江永县的女性中使用。尽管现有的女书计算研究主要集中在文本数字化和视觉识别方面,但其真实发音的声学重建在很大程度上仍鲜有探索。构建女书文本转语音(TTS)系统尤为具有挑战性,因为现有的录音资源极其有限,且大多仅包含孤立的音节级发音,而非自然的句子级语句。本文介绍了 NüshuVoice,这是首个面向女书的 TTS 基准。我们构建了一个句子级的女书文本转语音数据集,该数据集对齐了标准化的 Unicode 女书文本、音标转录、标准中文翻译以及档案录音。为了在这种极端低资源条件下合成语音,我们提出了 Nüshu-PitchVITS,这是一种 F0 条件化的 VITS 框架,利用女书的五度标记法作为显式的韵律归纳偏置。实验结果表明,Nüshu-PitchVITS 在频谱保真度、音高重建以及人工评估的可理解性方面均优于强 TTS 基线。我们已公开发布该数据集及代码,网址为:https://anonymous.4open.science/r/Nvshu-TTS-2EB6.
Abstract
Nüshu is an endangered phonetic script historically used by women in Jiangyong County, southern Hunan, China. While existing computational studies of Nüshu mainly focus on textual digitization and visual recognition, the acoustic reconstruction of its authentic pronunciation remains largely unexplored. Building a Nüshu text-to-speech (TTS) system is particularly challenging because available recordings are extremely limited and mostly consist of isolated syllable-level pronunciations rather than natural sentence-level utterances. In this work, we introduce NüshuVoice, the first TTS benchmark for Nüshu. We construct a sentence-level Nüshu text-to-audio dataset that aligns standardized Unicode Nüshu text, phonetic transcriptions, standard Chinese translations, and archival recordings. To synthesize speech under this extreme low-resource setting, we propose Nüshu-PitchVITS, an F0-conditioned VITS framework that leverages Nüshu's five-level pitch notation as an explicit prosodic inductive bias. Experimental results show that Nüshu-PitchVITS outperforms strong TTS baselines in spectral fidelity, pitch reconstruction, and human-rated intelligibility. We publicly release the dataset and code at: https://anonymous.4open.science/r/Nvshu-TTS-2EB6.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 1.0/10 | 1.5 |
| World Models | 1.5 | 2.0/10 | 3.0 |
| MLLM | 1.5 | 1.0/10 | 1.5 |
| MultiModal | 1.5 | 3.0/10 | 4.5 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文主题为濒危文字 Nüshu 的语音合成(TTS),使用 pitch-aware VITS 框架。提供的关键词集(Unify Models, World Models, MLLM, model-based RL 等)主要针对多模态大模型和强化学习领域,与本文主题高度不匹配。论文未涉及世界模型、大语言模型或强化学习;虽涉及文本与音频(MultiModal),但非多模态大模型;未使用视觉编码器(输入为文本);Tokenizer 非核心贡献;Unify Models 非重点。因此相关性评分极低,加权总分 15.0 低于动态及格分 27.8。作者列表中未包含指定的专家(Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang)。
关键词
Text-to-Speech, Nüshu, Pitch-Aware, VITS, Low-resource, Endangered Script, Prosodic Inductive Bias, Spectral Fidelity
摘要翻译
基于扩散的生成模型在真实世界图像超分辨率(SR)任务中取得了显著成就。借助分块扩散技术,这些模型能够生成超出其原生支持分辨率的高分辨率图像。然而,此类高分辨率(例如 $2048^2$)输出的质量通常仍然非常差,主要归因于以下两个因素:图像上采样比率(例如 $\times8$)超过了模型原生支持的上采样比率(例如 $\times4$),以及模型的原生支持分辨率限制。实际上,训练原生高分辨率模型需要更大的网络架构,这会带来显著的计算开销和 GPU 内存成本,使得在资源受限的设备上难以实现。因此,本文提出了 TUDSR(Twice Upsampling-Diffusion),一种用于更高超分辨率的双次上采样 - 扩散框架。TUDSR 框架主要包含两个阶段:第一阶段是在 $R$ 分辨率下进行训练,第二阶段则是在 $NR$ 分辨率下引入一种循环分块训练策略。每个阶段均采用包含生成器和判别器的一步生成对抗网络(GAN)架构。基于 SD2.1-base,我们开发了 TUDSR-S,其在多个基准测试上均达到了最先进的性能。大量实验进一步表明,TUDSR-S 能够在 $1024^2$ 乃至 $2048^2$ 的分辨率下生成高质量图像,显著优于现有方法。代码开源地址为 https://github.com/wuer5/TUDSR。
Abstract
Diffusion-based generative models have achieved remarkable success in real-world image super-resolution (SR). With tiled diffusion techniques, these models can produce high-resolution images that exceed their native-supported resolution. However, the quality of such high-resolution (e.g $2048^2$) outputs often remains extremely poor, primarily due to two factors we consider: the image upsampling ratio (e.g $\times8$) exceeding the model's native-supported upsampling ratio (e.g $\times4$), and the model's native-supported resolution. In practice, training a native high-resolution model requires larger architectures, which incur significant computational overhead and GPU memory costs, making it hard on limited-resource equipment. Thus, we present TUDSR, a Twice Upsampling-Diffusion framework for higher SR. The TUDSR framework mainly consists of two stages: the first involves training at $R$-resolution, and the second introduces a looped chunk-based training strategy at $NR$-resolution. Each stage adapts a one-step GAN architecture comprising a generator and a discriminator. Based on SD2.1-base, we develop TUDSR-S, which achieves state-of-the-art performance across multiple benchmarks. Extensive experiments further demonstrate that TUDSR-S generates high-quality images at the resolutions of $1024^2$ and even $2048^2$, significantly outperforming existing approaches. Code is available at https://github.com/wuer5/TUDSR.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 3.0/10 | 4.5 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 1.0/10 | 1.5 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: The paper focuses on image super-resolution using diffusion and GAN models, belonging to computer vision rather than the multimodal LLM, world model, or RL domains specified in the keywords. While it uses SD2.1-base (which implies a visual encoder), there is no tokenization, multimodal integration, or reinforcement learning involved.
关键词
Diffusion Models, Super-Resolution, Twice Upsampling, GAN Architecture, High-Resolution Generation, SD2.1-base, Image Processing
摘要翻译
自监督数据策展为扩展和提升机器学习模型的泛化能力提供了一条途径。通过利用自监督学习(SSL)进行数据策展,可以有效满足基础模型对大规模训练数据集的需求。SSL 大大减轻了与标注和人工数据策展相关的成本,同时最大限度地减少了对人工监督的需求。然而,SSL 策展数据集的完整性必须经过严格检查,因为依赖匿名且未经审查的外部来源会显著增加数据投毒的风险。本文提出了一种中毒数据检测器(PDD),这是一种主动防御机制,旨在确保基础模型训练之前 SSL 策展数据集的完整性。PDD 的设计结合了预训练的 ImageBind 模型和传统分类器,包括随机森林(RF)、k-近邻(KNN)、朴素贝叶斯(NB)和支持向量机(SVM)。我们使用来自三个不同数据集的 176,200 张图像以及三种不同的对抗攻击(涵盖分布内和分布外场景)对 PDD 进行了严格评估。值得注意的是,SVM-PDD 在分布内(Set3-Set5)和分布外(TrueFace 和 140K RealFace)数据集上均表现出优越的性能。我们的设计展示了强大的可扩展性,并通过集成方法实现了对新对抗攻击检测器的快速集成。
Abstract
Self-supervised data curation provides a pathway to scaling and improving the generalization capabilities of machine learning models. By leveraging self-supervised learning (SSL) for data curation, the demand for massive training datasets required by foundation models can be effectively met. SSL greatly alleviates the costs associated with annotation and manual dataset curation while minimizing the need for human oversight. However, the integrity of SSL-curated datasets must be rigorously checked, as reliance on anonymous and unvetted external sources can substantially increase the risk of data poisoning. In this paper, we propose a Poisoned Data Detector (PDD), an active defense mechanism designed to ensure the integrity of SSL-curated datasets prior to foundation model training. PDDs are designed using a combination of the pretrained ImageBind model and traditional classifiers, including Random Forest (RF), k-Nearest Neighbors (KNN), Naive Bayes (NB), and Support Vector Machines (SVM). We rigorously evaluated PDDs using 176,200 images from three diverse datasets and three different adversarial attacks encompassing both in-distribution and out-of-distribution scenarios. Notably, SVM-PDD achieves superior performance for both in-distribution (Set3-Set5) and out-of-distribution (TrueFace and 140K RealFace) datasets. Our design demonstrates strong scalability and enables the rapid integration of new adversarial attack detectors through an ensemble approach.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 2.0/10 | 3.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 4.0/10 | 6.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文核心为自监督数据安全与中毒检测,使用 ImageBind 和分类器。关键词多涉及模型架构(Tokenizer, Visual Encoder)、强化学习(model-based RL)及世界模型(World Models),与本文主题关联度低。ImageBind 具多模态特性,故 MultiModal 相关性稍高,但整体相关性不足。
关键词
Self-supervised learning, Data curation, Foundation models, Poisoned data detector, ImageBind, Adversarial attacks, Data security, Robustness
摘要翻译
临床超声图像常包含人工标记,如测量卡尺和文本,旨在辅助诊断解读与比较。然而,这些标记可能在下游自动化分析中引入捷径偏差,促使深度学习模型过度依赖与标记相关的线索,而非具有临床意义的解剖结构。现有的标记去除方法要么依赖掩码且易受误差传播影响,要么是无掩码确定性恢复器,这可能导致超声纹理过度平滑,并扰动未受影响的背景区域。为应对这些挑战,我们提出 Echo-DM,一种基于条件潜在扩散和区域感知融合的超声标记去除框架。Echo-DM 遵循通用的编码器 - 扩散 - 解码器管道:其中,基于 DiT(扩散变换器)的条件潜在扩散网络负责全局恢复,而区域感知融合模块则在端到端无掩码推理下强制执行保真感知的图像空间细化。在此基础上,我们进一步分别利用基于 VAE(变分自编码器)和基于 RAE 的潜在模块实例化 Echo-DM-V 和 Echo-DM-R,这表明 Echo-DM 架构兼容多种潜在模块实例化。在 Echo-PAIR(大规模配对临床超声数据集)上的广泛实验表明,与代表性的两阶段基线方法相比,该方法在标记去除和解剖保真度方面表现更优,同时在各种部署设置中提供了优良的质量 - 效率权衡。数据、代码及模型将在 https://github.com/MiliLab/Echo-DM 上发布。
Abstract
Clinical ultrasound images often contain artificial markers, such as measurement calipers and text, to assist diagnostic interpretation and comparison. However, these markers can introduce shortcut bias in downstream automated analysis, encouraging deep learning models to rely on marker-related cues rather than clinically meaningful anatomy. Existing marker removal methods are either mask-dependent and vulnerable to error propagation, or mask-free deterministic restorers that may over-smooth ultrasound texture and perturb unaffected background regions. To address these challenges, we present Echo-DM, a framework for ultrasound marker removal via conditional latent diffusion and region-aware fusion. Echo-DM follows a common encoder-diffusion-decoder pipeline, where a DiT-based conditional latent diffusion network performs global restoration and a region-aware fusion module enforces preservation-aware image-space refinement under end-to-end mask-free inference. Building on this fixed core design, we further instantiate Echo-DM-V and Echo-DM-R with VAE-based and RAE-based latent modules, respectively, which demonstrates that the Echo-DM architecture is compatible with diverse latent-module instantiations. Extensive experiments on Echo-PAIR, a large-scale paired clinical ultrasound dataset, demonstrate superior marker removal and strong anatomical fidelity compared with representative two-stage baselines, while providing favorable quality--efficiency trade-offs across deployment settings. Data, code and models will be released at https://github.com/MiliLab/Echo-DM.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 3.0/10 | 4.5 |
| Tokenizer | 1.5 | 3.0/10 | 4.5 |
| Visual Encoder | 1.5 | 4.0/10 | 6.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文聚焦超声标记物移除,属医学图像处理。架构上涉及视觉编码与潜在空间(Tokenizer/Visual Encoder 相关),但未涉及世界模型、多模态大语言模型、强化学习或统一模型架构的核心概念。作者列表中未包含指定专家。
关键词
Ultrasound Marker Removal, Conditional Latent Diffusion, Region-Aware Fusion, DiT-based Architecture, Mask-free Inference, VAE-based Latent Module, Medical Image Restoration
摘要翻译
基础模型(Foundation Models, FMs)正日益被用作语言、视觉、时序及多模态应用中下游任务的骨干网络。然而,现有的模型服务系统将每个定制任务部署为独立的模型实例,导致重复使用重量级骨干网络,浪费加速器内存,并丧失了分摊批处理和加载成本的机会。本文提出了 FMplex,这是一个将基础模型骨干网络视为虚拟化底层以实现部署共享的服务系统。FMplex 为每个任务呈现一个虚拟基础模型(vFM),这是一个由共享物理基础模型支持的逻辑上私有的基础模型实例。这种抽象使得独立定制的任务能够共享骨干网络,同时保留任务特定的扩展、独立的生命周期以及任务级隔离。此外,我们提出了一种感知批次的公平队列调度器,该调度器将加权任务级共享与共置任务间的任务内批处理相结合。我们实现了一个基于 FMplex 的服务栈,涵盖任务构建、感知共享的部署以及运行时执行。在 7 个基础模型骨干网络(16 种变体)和 92 个下游任务上,FMplex 相比空间分区方案可将延迟降低高达 80%,相比尽力而为的共置方案降低 33.3%,同时在集群规模上可托管多达 6 倍的任务。
Abstract
Foundation models (FMs) are increasingly used as backbones for downstream tasks across language, vision, time-series, and multimodal applications. Yet existing model-serving systems deploy each customized task as an independent model instance, thereby replicating heavyweight backbones, wasting accelerator memory, and losing opportunities to amortize batching and loading costs. This paper presents FMplex, a serving system that treats FM backbones as a virtualization substrate for deployment sharing. FMplex presents each task with a virtual foundation model (vFM), a logically private FM instance backed by a shared physical FM. This abstraction lets independently customized tasks share a backbone while preserving task-specific extensions, independent lifecycles, and task-level isolation. In addition, we propose a batch-aware fair-queueing scheduler that combines weighted task-level sharing with inter- and intra-task batching across colocated tasks. We implement a FMplex-based serving stack spanning task construction, sharing-aware deployment, and runtime execution. Across 7 FM backbones (16 variants) and 92 downstream tasks, FMplex reduces latency by up to 80% over spatial partitioning and 33.3% over best-effort co-location, while hosting up to 6x more tasks at cluster scale.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 1.0/10 | 1.5 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 3.0/10 | 4.5 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 本文提出 FMplex 系统,专注于基础模型的服务架构与资源虚拟化,旨在解决多任务部署中的内存浪费和延迟问题。提供的评分关键词主要涉及模型架构(Tokenizer, Visual Encoder)、学习范式(World Models, model-based RL)及模型统一(Unify Models),而本文属于系统服务领域,未涉及模型内部结构或训练算法,因此相关性普遍较低。仅'MultiModal'因摘要提及多模态应用有中等关联,'Unify Models'和'MLLM'因涉及基础模型共享与范畴有微弱关联。作者列表中未包含指定的专家。
关键词
Foundation Models, Model Virtualization, Serving System, Shared Backbones, Batch-aware Scheduler, Latency Reduction, Task Isolation
摘要翻译
安全评判器正被越来越多地部署用于依据不断演化的标准评估模型输出,但最近的元评估工作表明,它们在提示词和评分标准的变化下仍显脆弱,仅风格扰动一项就报告了高达 0.24 的假阴性率波动。我们认为,安全评判本质上是一个遵循评分标准的问题:稳健的评判器必须在不同的评分标准表述中一致地应用给定的评估准则,而非仅仅记忆某个特定模板。我们提出一种训练策略,该策略结合了 (i) 基于提示词 - 响应 - 标签三元组生成的实例条件动态评分标准,以使评判器暴露于评估标准的变异性中;以及 (ii) 一种从可靠到表达性的课程学习策略,该策略始于干净的固定评分标准监督,并逐步引入噪声更大的动态评分标准数据。我们在一个单一人工标注数据集上,针对三种对比鲜明的评分标准提示词(HarmBench 风格、ShieldGemma 风格以及领域特定评分标准)进行了评估。我们的 12B 参数课程学习评判器在三个评分标准上实现了 94.12%-94.88% 的准确率,跨评分标准范围仅为 0.76,在峰值准确率和稳定性上均优于通用大语言模型 (LLM)、专用安全分类器以及高达 30B 参数的面向推理评判器。消融实验表明,盲目地将动态评分标准混合到监督微调 (SFT) 中会增加跨评分标准方差(1.44 -> 3.60);唯有课程学习调度策略才能恢复并超越固定评分标准基线(方差 0.76)。
Abstract
Safety judges are increasingly deployed to evaluate model outputs against evolving criteria, yet recent meta-evaluation work shows they remain brittle under prompt and rubric variation, with false negative-rate swings of up to 0.24 reported for stylistic perturbations alone. We argue that safety judgment is fundamentally a rubric-following problem: a robust judge must apply the given evaluation criteria consistently across rubric formulations rather than memorize one specific template. We propose a training strategy that combines (i) instance-conditioned dynamic rubrics generated from prompt-response-label triples to expose the judge to the variability of evaluation criteria, and (ii) a reliable-to-expressive curriculum that begins with clean fixed-rubric supervision and progressively introduces noisier dynamic-rubric data. We evaluate on a single human-labeled set under three contrasting rubric prompts (HarmBench-style, ShieldGemma-style, and a domain-specific rubric). Our 12B curriculum judge achieves 94.12-94.88% accuracy across the three rubrics with a cross-rubric range of only 0.76, outperforming general-purpose LLMs, dedicated safety classifiers, and reasoning-oriented judges up to 30B in both peak accuracy and stability. An ablation shows that naively mixing dynamic rubrics into SFT increases cross rubric variance (1.44 -> 3.60); only the curriculum schedule recovers and improves on the fixed rubric baseline (variance 0.76).
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 1.0/10 | 1.5 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: 论文主题聚焦于安全评估模型(Safety Judges)在评分标准变化下的鲁棒性,采用课程学习和动态评分策略。给定的关键词集主要涉及多模态架构(Visual Encoder, MultiModal, MLLM)、世界模型(World Models)及模型式强化学习(model-based RL),而本文主要基于文本的 LLM 评估,未涉及视觉组件、世界模型构建或模型式强化学习架构,因此相关性较低。仅在 LLM 基础(MLLM)及概念统一性(Unify Models)上有微弱关联。未包含指定专家作者。
关键词
Safety Judges, Rubric-Following, Curriculum Learning, Dynamic Rubrics, Stability, LLM Evaluation, Prompt Variation
摘要翻译
多光谱点云(MPC)由三维空间 - 谱信息构成,在准确的土地覆盖分类方面具有巨大潜力。然而,分类模型的表征能力受限于固有的高维异构空间 - 谱信息、样本分布不平衡以及机载 MPC 的类间谱相似性。我们构建了两个 MPC 数据集,并提出了一种基于 Attention(注意力)的增强几何 - 谱特征学习框架,用于机载 MPC 分类。模型中的一个关键组件是具有 Attention 机制的双流特征融合方法,该方法增强了从高维异构 MPC 中提取的空间 - 谱特征的表征能力。第一个流旨在利用融合 Self-Attention(自注意力)提取位置编码全局谱特征,第二个流则由多核 Point Convolution(点卷积)和 Feature Aggregation Attention(特征聚合注意力)组成,用于提取谱引导几何特征。随后,我们开发了一个残差 Attention Fusion Block(残差注意力融合模块),以整合来自两个并行流中最具信息量的几何 - 谱特征。本工作的另一重要贡献在于提出了一种 Joint Loss Function(联合损失函数),旨在提高模型在不平衡及类间相似样本上的学习能力。在两个机载 MPC 数据集上的实验结果表明,与 State-of-the-Art(最先进)方法相比,所提方法具有有效性。此外,本文所使用的代码和数据集将免费公开于 https://github.com/HITlixian/TGRS_GSFF。
Abstract
Multispectral point cloud (MPC) is composed of 3D spatial-spectral information, which holds tremendous potential for accurate land-cover classification. However, the representation power of classification models is limited by inherent high-dimensional and heterogeneous spatial-spectral information, unbalanced sample distribution, and inter-class spectral similarity of airborne MPCs. We build two MPC datasets and propose an enhanced geometric-spectral feature learning framework based on attentions for airborne MPC classification. A key component in our model is a two-stream feature fusion method with attention mechanisms, which enhances the representation capability of spatial-spectral features from high-dimensional heterogeneous MPCs. The first stream aims to extract position-encoded global spectral features with fusion self-attention, and the second stream comprises a multikernel point convolution and feature aggregation attention to extract spectral-guided geometric features. We then develop a residual attention fusion block to integrate the most informative geometric-spectral features from the two parallel streams. Another important contribution of this work is a joint loss function to improve the learning ability on unbalanced and interclass similar samples. Experimental results on two airborne MPC datasets demonstrate the effectiveness of the proposed method compared with the state-of-the-art methods. Furthermore, the codes and datasets used in this paper will be made available freely at https://github.com/HITlixian/TGRS_GSFF.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 2.0/10 | 3.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 6.0/10 | 9.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on remote sensing and airborne multispectral point cloud classification using attention mechanisms and feature fusion. It is largely unrelated to the provided keywords concerning Large Language Models, World Models, Reinforcement Learning, and Tokenizers. Only MultiModal has moderate relevance due to the fusion of spatial and spectral data. No expert authors from the specified list are present in the authorship.
关键词
Multispectral Point Cloud, Geometric-Spectral Feature Learning, Attention Mechanisms, Two-Stream Feature Fusion, Land-cover Classification, Point Convolution, Residual Attention Fusion
摘要翻译
直播流的实时视频修复(VR)需要在严格的每帧延迟约束下实现高分辨率输出。现有的基于一步扩散的 VR 模型由于两个主要瓶颈,仍难以部署在消费级 GPU 上:高分辨率下的二次空间注意力以及大型视频自编码器的延迟与内存开销。我们提出 SwiftVR,一种流式一步生成式 VR 框架,该框架在因果分块协议下同时降低了这两个瓶颈。在注意力机制方面,无掩码移位窗口自注意力通过确定性索引将每个空间窗口聚合为密集张量,使所有注意力调用均位于密集缩放点积注意力(SDPA)路径上,无需掩码、循环移位、填充或硬件特定的稀疏核。由于 SwiftVR 仅使用标准密集 SDPA 调用,训练好的模型可直接部署到消费级 GPU 上,无需重新训练或自定义核。在自编码方面,轻量级的感知修复自编码器(Restoration-aware Autoencoder)实现了快速分块解码,同时保持了重建质量。在单张 H100 GPU 上,SwiftVR 在 2560x1440 分辨率下维持 31 FPS,在 3840x2160 分辨率下维持 14 FPS,而所有对比的基于扩散的 VR 基线模型在 4K 分辨率下均超出内存限制。在消费级 RTX 5090 GPU 上,SwiftVR 在 1920x1080 分辨率下达到 26 FPS。据我们所知,SwiftVR 是首个在消费级 GPU 上实现实时 1080p 流式传输的生成式 VR 模型,同时具备较强的无参考感知质量,且推理成本更低。项目主页位于 https://h-oliday.github.io/SwiftVR。
Abstract
Real-time video restoration (VR) for live streams requires high-resolution outputs under strict per-frame latency constraints. Existing one-step diffusion-based VR models remain difficult to deploy on consumer-grade GPUs due to two main bottlenecks: quadratic spatial attention at high resolutions and the latency-memory overhead of large video autoencoders. We present SwiftVR, a streaming one-step generative VR framework that reduces both bottlenecks under a causal chunk-wise protocol. For attention, mask-free shifted-window self-attention gathers each spatial window into a dense tensor via deterministic indexing, keeping all attention calls on the dense scaled dot-product attention path without masks, cyclic shifts, padding, or hardware-specific sparse kernels. Because SwiftVR uses only standard dense SDPA calls, the trained model transfers to consumer GPUs without retraining or custom kernels. For autoencoding, a lightweight Restoration-aware Autoencoder enables fast chunk-wise decoding while preserving reconstruction quality. On a single H100, SwiftVR sustains 31~FPS at 2560x1440 and 14~FPS at 3840x2160, whereas all compared diffusion-based VR baselines exceed the memory limit at 4K. On a consumer RTX~5090, SwiftVR reaches 26~FPS at 1920x1080. To our knowledge, SwiftVR is the first generative VR model to achieve real-time 1080p streaming on a consumer-grade GPU, while attaining strong no-reference perceptual quality with lower inference cost. Project is available at https://h-oliday.github.io/SwiftVR.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 3.0/10 | 4.5 |
| World Models | 1.5 | 2.0/10 | 3.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文主题聚焦于视频恢复(计算机视觉),与多模态大语言模型(MLLM)、强化学习(model-based RL)及分词器(Tokenizer)高度无关。虽涉及生成式模型(World Models)和视觉编码(Visual Encoder),但未体现多模态融合或模型统一架构,故相关度较低。
关键词
Video Restoration, One-Step Generative, Real-Time, Shifted-Window Self-Attention, Autoencoder, Streaming, Consumer GPUs
摘要翻译
现代目标检测器在标准基准测试上取得了优异的性能,然而它们对上下文变化的鲁棒性仍未被充分理解。先前的评估主要依赖于聚合指标(如 AP)在非受控分布偏移上的表现,这可能会掩盖上下文变化导致性能下降的具体机制。我们引入了 ContextShift,这是一个受控基准,能够在保持物体外观不变的情况下,系统性地操纵物体与上下文之间的关系。该基准基于 COCO 2017 构建,通过几何变换、合成背景替换及自然背景替换,将上下文隔离为独立变量,其中包括基于归一化点互信息(NPMI)的连续兼容性轴。在多种检测器架构下,我们观察到一致的退化模式:假阴性数量增加高达 227%,预测框数量减少高达 44%,而假阳性数量保持稳定或下降。这种抑制行为未被聚合指标(如 AP)所捕捉,这可能会掩盖显著的召回率损失以及预测动态的变化。进一步分析表明,性能退化并非主要由置信度降低驱动,而是由有效检测候选框生成减少所致。此外,沿统计兼容性轴的性能呈现非单调性,在中间 NPMI 值处达到峰值,并向两端退化,这表明统计共现与有效视觉上下文之间并不存在线性相关性。最后,我们展示了上下文感知增强可以提高鲁棒性:所有增强变体在原始及修改后的测试图像上均优于仅使用数据集的基线,通过在训练过程中使模型暴露于物体 - 上下文解耦,部分恢复了因预测抑制失败而损失的性能。
Abstract
Modern object detectors achieve strong performance on standard benchmarks, yet their robustness to contextual variation remains insufficiently understood. Prior evaluations largely rely on aggregate metrics such as AP on uncontrolled distribution shifts, which can obscure how performance degrades under context change. We introduce ContextShift, a controlled benchmark that systematically manipulates object--context relationships while preserving object appearance. Built on COCO 2017, it isolates context as an independent variable through geometric transformations and synthetic and natural background substitutions, including a continuous compatibility axis based on normalized pointwise mutual information (NPMI). Across diverse detector architectures, we observe a consistent degradation pattern: false negatives increase by up to 227% and prediction volume decreases by up to 44%, while false positives remain stable or decline. This suppression behavior is not captured by aggregate metrics such as AP, which can mask substantial recall loss and changes in prediction dynamics. Further analysis suggests that degradation is driven less by reduced confidence than by a reduced formation of valid detection candidates. Moreover, performance along the statistical compatibility axis is non-monotonic, peaking at intermediate NPMI and degrading toward both extremes, indicating that statistical co-occurrence does not correlate linearly with effective visual context. Finally, we show that context-aware augmentation improves robustness: every augmented variant outperforms the dataset-only baseline on both original and manipulated test images, partially recovering performance lost to prediction-suppression failures by exposing models to object--context decoupling during training.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 2.0/10 | 3.0 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 1.0/10 | 1.5 |
| MultiModal | 1.5 | 2.0/10 | 3.0 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: 该论文主要关注目标检测在上下文变化下的鲁棒性,提出了 ContextShift 基准。它不涉及模型统一、分词器、世界模型、MLLM 或基于模型的强化学习。虽然目标检测器使用视觉编码器,但本文重点在于上下文依赖的基准测试而非表征学习或模型架构统一。作者列表中不包含指定的专家,因此未获得专家加分。加权总分约为 13.5,低于动态及格分 27.8。
关键词
Object Detection, Context Dependence, Benchmark, Contextual Variation, Context-aware Augmentation, False Negatives, COCO 2017, Prediction Suppression
摘要翻译
两视图对应学习旨在通过利用图像对之间的潜在差异,区分真实对应点(内点)与虚假对应点(外点)。现有方法主要依赖于基于坐标的几何一致性。然而,它们往往难以处理包含重复结构、无纹理区域或局部相似几何模式场景中的伪一致外点。为了解决这一局限性,我们提出 TriMatch,一种用于两视图对应学习的多源特征融合框架,该框架包含两个部分:特征提取和特征精炼。在特征提取阶段,TriMatch 联合提取几何特征、纹理语义特征和结构语义特征,以提供互补证据用于对应点判别。为了弥合语义特征与几何特征之间的差距,纹理语义特征和结构语义特征分别通过专用的 Texture-Geometric Alignment(纹理 - 几何对齐)和 Structural-Geometric Alignment(结构 - 几何对齐)模块与几何特征对齐。我们进一步引入一个 Semantic-Guided Correspondence Modulation(语义引导对应调制)模块,该模块利用语义信息调制几何特征,以抑制几何上合理但语义上不一致的对应点。在特征精炼阶段,一种 Hierarchical Semantic-Enhanced Correspondence Refinement(层次化语义增强对应精炼)策略逐步建模对应点依赖关系并重新校准多上下文特征响应,从而实现更可靠的内点 - 外点判别。大量实验证明了 TriMatch 的有效性、鲁棒性及泛化能力。
Abstract
Two-view correspondence learning aims to distinguish true correspondences (inliers) from false ones (outliers) in image pairs by leveraging their underlying differences. Existing methods mainly rely on coordinate-based geometric consistency. However, they often struggle with pseudo-consistent outliers in scenes containing repetitive structures, textureless regions, or locally similar geometric patterns. To address this limitation, we propose TriMatch, a multi-source feature fusion framework for two-view correspondence learning, which consists of two parts: feature extraction and feature refinement. In feature extraction, TriMatch jointly extracts geometric, texture semantic, and structural semantic features to provide complementary evidence for correspondence discrimination. To bridge the gap between semantic and geometric features, texture and structural semantic features are aligned with geometric features through dedicated Texture-Geometric Alignment and Structural-Geometric Alignment modules, respectively. We further introduce a Semantic-Guided Correspondence Modulation module, which modulates geometric features using semantic information to suppress geometrically plausible but semantically inconsistent correspondences. In feature refinement, a Hierarchical Semantic-Enhanced Correspondence Refinement strategy progressively models correspondence dependencies and recalibrates multi-context feature responses, enabling more reliable inlier-outlier discrimination. Extensive experiments demonstrate the effectiveness, robustness, and generalization capability of TriMatch.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 4.0/10 | 6.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 3.0/10 | 4.5 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文专注于计算机视觉中的两视图对应学习,采用多源特征融合方法。提供的关键词主要涉及大模型、世界模型及强化学习,与论文领域(传统 CV)存在显著偏差。视觉编码器(Visual Encoder)作为基础组件有一定相关性,多源融合(MultiModal/Unify Models)略有关联,其余关键词完全无关。加权总分 13.5 低于动态及格分 27.8。
关键词
Two-View Correspondence Learning, Multi-Source Feature Fusion, Geometric Features, Semantic Features, Inlier-Outlier Discrimination, TriMatch, Texture-Geometric Alignment, Structural-Geometric Alignment
摘要翻译
随着对逼真虚拟人类需求的日益增长,参数化人体模型已成为现代医学、体育和娱乐应用的核心基石。然而,这些模型大多本质上受限:它们仅捕捉皮肤的 3D 表面,无法揭示生成运动的复杂生物力学结构。随着更多应用拓展至生物力学领域,超越皮肤表层的虚拟人类模型的需求日益凸显。传统的软组织模拟方法(如有限元方法 FEM)虽然准确,但不可扩展,且对于大多数常见应用而言计算成本过高。另一方面,现有的生物力学工具可以模拟肌肉力和激活,但无法建模外部形状的变化,从而限制了肌肉激活与实际可观察解剖结构之间的关联。这催生了一个新的逆问题:直接从可见表面观测(即从皮肤,进而从姿态)恢复肌肉变形。在本文中,我们提出了 SOMA(从表面观测到肌肉解剖),这是一个个体特定模型,利用 RGB 相机获取的表面信号推断时空肌肉行为;同时提出了 SKIM,一个受试者特定的软组织变形数据集。据我们所知,这是首个尝试从多视角 RGB 数据恢复肌肉变形的方法。我们展示了该方法如何在无需传统模拟复杂性的情况下,提供基于解剖学依据的动画,从而实现一种可扩展且具有成本效益的解决方案。数据和代码已公开。
Abstract
With the growing demand for realistic virtual humans, parametric body models have become a cornerstone of modern medicine, sports, and entertainment applications. However, most of these models are inherently limited: they only capture the 3D surface of the skin, offering no insight into the complex bio-mechanical structures that generate motion. As more applications expand towards biomechanics, the need for virtual human models that go beyond the skin has become increasingly evident. Traditional soft-tissue simulations, such as FEM, are accurate but non-scalable and too computationally expensive for most common applications. Alternatively, existing biomechanical tools can simulate muscular forces and activations, but do not model changes in external shape, restricting how activations correlate with actual observable anatomy. This motivates a novel inverse research problem: recovering muscle deformations directly from visible surface observations - i.e., from the skin, and thus the pose. In this work, we present SOMA (from Surface Observations to Muscle Anatomy), a person-specific model that infers spatio-temporal muscle behavior from surface signals obtained using RGB cameras, and SKIM, a subject-specific soft-tissue deformation dataset. To the best of our knowledge, this is the first method that attempts to recover muscle deformations from multi-view RGB data. We show how our method provides anatomically grounded animations without the complexity of traditional simulations, leading to a scalable and cost-effective solution. Data and code are available.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 3.0/10 | 4.5 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 3.0/10 | 4.5 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 该论文属于计算机图形学与生物力学领域,核心是 SOMA 模型从表面观测推断肌肉解剖结构。与给定的 AI/ML 关键词匹配度较低:未涉及 Tokenizer、MLLM 或强化学习(model-based RL);虽使用多视角 RGB 数据(MultiModal, Visual Encoder)并统一了表面与肌肉模型(Unify Models),但并非大模型或世界模型(World Models)的典型应用,因此相关度评分较低。
关键词
SOMA, Muscle Anatomy, Surface Observations, Multi-view RGB, Soft-tissue Deformation, Biomechanics, Inverse Problem, SKIM Dataset
摘要翻译
奖励黑客行为(Reward Hacking)通常在变得可见之后才被研究,即当模型获得高代理奖励(proxy reward)却未能完成预期任务时。相反,我们在该失败出现之前,研究代理强化学习(proxy RL)所教授的内容。我们引入了代理奖励内部化与机制利用(Proxy Reward Internalization and Mechanistic Exploitation,简称 PRIME),这是一种学习到的能力,用于评估任务正确性、预测代理接受度,并推理可利用的代理 - 黄金差距(proxy--gold gaps)。在具有可利用 pytest 奖励的编码强化学习环境中,我们通过思维链(chain-of-thought)监控、直接探针(direct probes)以及激活层概念向量来测量 PRIME。我们发现,PRIME 在持续的奖励黑客行为之前以阶段性序列出现,且其当前的直接探针分数能够预测后续黑客行为的起始和严重程度,即使可见黑客行为率仍然较低。当评估器发生变化时,PRIME 也会随之适应,重新定向至仍被奖励的任一代理 - 黄金差距;当黄金奖励抑制显式黑客行为时,PRIME 依然持续存在,且消融其激活方向可减少黑客行为。跨检查点来看,域内(in-domain)的 PRIME 能够追踪域外(out-of-domain)的不对齐情况。综上所述,这些结果表明,可利用的代理强化学习放大了位于可见黑客行为上游的代理内部化能力,使得 PRIME 成为更广泛对齐风险的一个候选早期预警信号。
Abstract
Reward hacking is usually studied after it becomes visible, once a model earns high proxy reward while failing the intended task. We instead study what proxy RL teaches before that failure appears. We introduce Proxy Reward Internalization and Mechanistic Exploitation (PRIME), a learned capability to assess task correctness, predict proxy acceptance, and reason about exploitable proxy--gold gaps. In coding RL environments with exploitable pytest rewards, we measure PRIME through chain-of-thought monitoring, direct probes, and activation-level concept vectors. We find that PRIME emerges in a staged sequence before sustained reward hacking, and that its current direct-probe score forecasts later hack onset and severity even when the visible hack rate is still low. PRIME also adapts when the evaluator changes, retargeting to whichever proxy--gold gap remains rewarded and persisting when gold reward suppresses overt hacking, and ablating its activation directions reduces hacking. Across checkpoints, in-domain PRIME tracks out-of-domain misalignment. Together these results suggest that exploitable proxy RL amplifies a proxy-internalization capability upstream of visible hacking, making PRIME a candidate early-warning signal for broader alignment risk.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 2.0/10 | 3.0 |
| MLLM | 1.5 | 1.0/10 | 1.5 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 3.0/10 | 4.5 |
评分理由: 该论文聚焦于强化学习中的奖励黑客行为(Reward Hacking)及代理奖励内部化(PRIME),属于模型对齐与解释性领域。提供的关键词多涉及多模态架构(Visual Encoder, MultiModal, MLLM, Tokenizer)及模型统一(Unify Models),与本文文本/代码强化学习的内容关联度较低。'model-based RL' 相关性最高,因论文涉及强化学习及奖励模型学习,但核心并非环境动力学建模。'World Models' 有一定关联因涉及环境交互预测,但非生成式世界模型。作者列表中未包含指定的 Yang Shi 等专家。
关键词
Proxy Reward Internalization, Mechanistic Exploitation, Reward Hacking, Alignment Risk, Coding RL, Chain-of-thought, Activation-level concept vectors
摘要翻译
可解释性日益将组件组而非单个单元视为基本对象,并提议通过聚类共激活统计来发现它们。我们质疑这种廉价信号是否真的能识别出 attention-head circuit(注意力头电路)。我们将 sparse-autoencoder(稀疏自编码器)的聚类方法适配到注意力头——但通过 causal ablation(因果消融)而非重构进行验证——随后聚类注意力头,并运行 closure test(闭合测试):消融发现的 community(社区),并将 per-example damage(每例损伤)与 matched-random controls(匹配随机对照)进行比较。在两个稠密 1B 规模模型(Pythia 1B, OLMo 1B)及两种输入分布下,这些社区均通过了 closure test。在 Mixture-of-Experts(专家混合)模型(OLMoE-1B-7B)中,route-conditional clustering(路由条件聚类)恢复了一个统计上真实的信号,然而该信号并未通过 closure test——ablation(消融)反而改善了 loss(损失),这是错误的方向。将 closure test 扩展到训练过程中,attention-target selectivity(注意力目标选择性)和 participation ratio(参与率)与功能在两个方向上均发生解耦。我们得出结论:a cheap signal(廉价信号)仅是一个 circuit proposal(电路提议),而非 a confirmed circuit(已确认的电路);closure test 才是区分二者的关键。
Abstract
Interpretability increasingly treats groups of components, not individual units, as the basic object, and proposes to find them by clustering co-activation statistics. We ask whether such a cheap signal actually identifies an attention-head circuit. Adapting a sparse-autoencoder clustering recipe to attention heads -- but validating by causal ablation rather than reconstruction -- we cluster heads and then run a closure test: ablate the discovered community and compare per-example damage to matched-random controls. Across two dense 1B-scale models (Pythia 1B, OLMo 1B) and two input distributions, the communities pass closure. In a Mixture-of-Experts model (OLMoE-1B-7B), route-conditional clustering recovers a statistically real signal that nonetheless does not survive closure -- ablation improves loss, the wrong direction. Extending closure across training, attention-target selectivity and participation ratio decouple from function in both directions. We conclude that a cheap signal is a circuit proposal, not a confirmed circuit; closure is what separates them.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 1.0/10 | 1.5 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 1.0/10 | 1.5 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: The paper focuses on transformer interpretability, specifically attention head circuit discovery via co-activation clustering and causal ablation validation. It does not address multimodal learning, world models, reinforcement learning, tokenizer design, or visual encoders. The keywords provided relate to multimodal unified architectures and RL, which are largely irrelevant to this NLP interpretability study, resulting in low scores across all categories.
关键词
Attention Heads, Circuit Discovery, Co-activation, Causal Ablation, Closure Test, Interpretability, Transformer Models
摘要翻译
大语言模型(LLM)智能体的长期记忆远不止于在恰当的时间检索正确的段落。当前的记忆系统将信念修正、因果耦合和跨领域抽象压缩为一个仅针对表面回忆调优的单一检索表面,因此在隐式个性化方面面临挑战,而这需要推理用户是如何演变的。我们提出 DCPM,该机制沿认知能力层级重新组织智能体记忆,该层级从原始输入和原子事实开始,历经历时信念轨迹与身份,最终上升至领域模式、潜在意图及跨领域模式。该层级由两个过程驱动,这两个过程继承了双过程理论的架构分裂:一个是同步日间写入器(System1),它将信念修正记录为双重链接的取代链;另一个是异步夜间引擎(System2),它诱导模式和意图,并扫描跨领域碰撞,将其抽象为高层核心模式。在 LongMemEval、PersonaMem 及 PersonaMem-v2 基准测试上,启用 System2 在奖励隐式跨会话推理的场景中贡献最大(在 PersonaMem-v2 上高达 +5.20),而在片段回忆方面贡献最小,这与架构预测相符。
Abstract
Long-term memory for an LLM agent is more than retrieving the right passage at the right time. Current memory systems collapse belief revision, causal coupling, and cross-domain abstraction into a single retrieval surface tuned for surface recall, and consequently struggle on implicit personalisation that requires reasoning over how a user has evolved. We propose DCPM, which reorganises agent memory along a cognitive capability hierarchy ascending from raw inputs and atomic facts, through diachronic belief trajectories and identity, to domain schemas, latent intentions and cross-domain patterns. The hierarchy is driven by two processes inheriting the architectural split of dual-process theory: a synchronous daytime writer (System1) that records belief revisions as doubly linked supersedes chains, and an asynchronous nighttime engine (System2) that induces schemas and intentions and sweeps for cross-domain collisions abstracted into higher-level core schemas. On LongMemEval, PersonaMem and PersonaMem-v2, enabling System2 contributes most where the benchmark rewards implicit cross-session inference (up to +5.20 on PersonaMem-v2) and least on span recall, matching the architectural prediction.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 3.0/10 | 4.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 3.0/10 | 4.5 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 2.0/10 | 3.0 |
评分理由: The paper focuses on a dual-process cognitive memory system (DCPM) for LLM agents, which has low alignment with the provided keywords centered on Multimodal Learning (MLLM, MultiModal, Visual Encoder, Tokenizer) and Reinforcement Learning (model-based RL). 'World Models' has moderate relevance due to internal state modeling, and 'Unify Models' has slight relevance regarding the unification of System 1 and 2 processes. No expert authors from the specified list are found. The weighted total score is 12.0, below the dynamic passing score of 27.8.
关键词
Long-term memory, Dual-process cognitive memory, Self-evolving LLM agents, Belief revision, Cross-domain abstraction, System1 System2, PersonaMem
摘要翻译
大语言模型(LLMs)可被微调,将提示词携带的秘密编码为流畅且看似无害的输出。这产生了一种隐写外泄风险,难以通过输出级隐写分析进行检测。近期工作提出了使用线性探针从内部激活中恢复秘密的机制检测方法。我们表明,这种防御机制可以被系统性地规避,但通过针对性的数据级干预,可恢复其可检测性。首先,我们将检测设置扩展至包含非线性 MLP(多层感知机)探针。随后,我们在五个基座模型(Qwen3-8B、Llama-3.1-8B、Ministral-8B、Qwen3-14B 和 Phi-4-14B)上对抗性微调隐写木马。所得模型在规避岭回归(ridge)探针和留置 MLP 探针的同时,仍保持 58%–79% 的精确匹配秘密恢复率,且在六个基准测试上的平均能力退化仅为 1%–8%。随后,我们从信息论角度对这种规避行为进行了表征。成功的规避在保留秘密可恢复性的同时,降低了秘密从内容对齐表示中的低阶可提取性,迫使载荷与剩余自由度产生协同交互。这促使我们构建一个重新语境化数据集,以限制这些剩余自由度。在该分布上,所有五种规避性木马的岭回归探针和 MLP 可检测性均得以恢复。总体而言,我们的发现表明,基于激活的隐写检测易受自适应规避的影响,但理论指导的评估分布也能揭示原本隐藏的载荷。
Abstract
Large language models can be fine-tuned to encode prompt-borne secrets into fluent, seemingly benign outputs. This creates a steganographic exfiltration risk that is difficult to detect with output-level steganalysis. Recent work proposes mechanistic detection using linear probes that recover the secret from internal activations. We show that this defense can be systematically evaded, but that detectability can be recovered through a targeted data-level intervention. First, we extend the detection setup to include a non-linear MLP probe. We then adversarially fine-tune steganographic trojans across five base models: Qwen3-8B, Llama-3.1-8B, Ministral-8B, Qwen3-14B, and Phi-4-14B. The resulting models retain $58$--$79\%$ exact-match secret recovery while evading both ridge and held-out MLP probes, with $1$--$8\%$ average capability degradation across six benchmarks. We then give an information-theoretic characterization of this evasion. Successful evasion preserves recoverability while reducing low-order extractability of the secret from the content-aligned representation, forcing the payload into synergistic interaction with residual degrees of freedom. This motivates a recontextualization dataset that restricts these residual degrees of freedom. On this distribution, both ridge and MLP detectability are restored across all five evasive trojans. Overall, our findings show that activation-based steganography detection is vulnerable to adaptive evasion, but also that theory-guided evaluation distributions can expose otherwise hidden payloads.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 3.0/10 | 4.5 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on steganography detection and evasion in Large Language Models (LLMs), utilizing multiple base models (Qwen, Llama, etc.) for evaluation. It has low relevance to Visual Encoders, World Models, and Model-Based RL as it does not involve vision, environment modeling, or reinforcement learning. It has moderate relevance to MLLM and Unify Models as it uses multiple LLMs as testbeds, but tokenization is not a research focus.
关键词
Steganography, LLMs, Detection, Evasion, Probes, Fine-tuning, Payloads, Information-theoretic
摘要翻译
法庭模拟(Court simulation)架起了法律教育与司法实践之间的桥梁,然而人工模拟成本高且难以规模化。大型语言模型(LLMs)提供了一种可扩展的替代方案,但现有的法庭模拟研究主要集中在刑事案件上。民事诉讼在实践中更为常见,且更难模拟,因为其诉讼请求、责任及救济措施更为灵活。我们提出了一种面向中国民事案件的多智能体(multi-agent)法庭模拟框架。该框架通过五阶段民事审判程序组织基于角色的交互,并整合记忆模块(memory module)和法条检索(statute retrieval)以支持长流程裁判。实验表明,该框架能产生可靠的民事判决,在责任分配和多事项裁判(multi-item adjudication)方面具有明显优势。进一步实验表明,记忆质量显著影响下游模拟质量。通过五层因素框架,我们分析了法律 grounding(legal grounding)、信息条件、司法能力与角色取向、组织压力和社会情境如何影响该框架的可靠性与行为。这些结果支持了所提出的框架在民事法庭模拟中的有效性。数据集和代码可在以下网址获取:https://github.com/foggpoy/Civil-Court.
Abstract
Court simulation bridges legal education and judicial practice, yet human-based simulations are costly and difficult to scale. Large language models (LLMs) offer a scalable alternative, but existing court-simulation research mainly focuses on criminal cases. Civil litigation is more common in practice and harder to simulate because its claims, liability, and remedies are more flexible. We present a multi-agent court simulation framework for Chinese civil cases. The framework organizes role-based interaction through a five-stage civil trial procedure and integrates memory module and statute retrieval to support long-process adjudication. Experiments show that the framework produces reliable civil judgments, with clear strengths in liability allocation and multi-item adjudication. Further experiments show that memory quality substantially affects downstream simulation quality. Through a five-layer factor framework, we analyze how legal grounding, information conditions, judicial capability and role orientation, organizational pressure, and social context affect the framework's reliability and behavior. These results support the effectiveness of the proposed framework for civil court simulation. The dataset and code are available at: https://github.com/foggpoy/Civil-Court.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 2.0/10 | 3.0 |
| MLLM | 1.5 | 1.0/10 | 1.5 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 2.0/10 | 3.0 |
评分理由: 该论文主要研究法律领域的多智能体大模型模拟,未涉及视觉编码器、Tokenizer 设计或多模态技术,因此与 Visual Encoder、Tokenizer、MultiModal、MLLM 关键词相关性极低(0-1 分)。虽然涉及模拟环境(World Models)和框架整合(Unify Models),但属于应用层而非模型架构核心,故评分为 2 分。作者列表中未包含 Yang Shi 等指定专家。
关键词
Civil Court Simulation, Large Language Models, Multi-agent System, Chinese Civil Cases, Memory Module, Statute Retrieval, Legal Grounding
摘要翻译
随着人工智能助手每日服务数百万用户,超越通用模型能力来评估用户体验(UX)的重要性日益凸显。本文提出 UXBench,这是首个基于真实用户反馈信号、以用户为中心的基准,旨在评估偏好对齐与对话生成能力。该基准包含三个相互关联的任务:UX Judge、UX Eval 和 UX Recovery,共包含 7,400 个测试实例,这些数据提取自主流中文人工智能助手的超过 7 万条交互日志。该数据集紧密反映了真实用户分布,涵盖 8 个场景、83 个领域以及多样化的失败模式,构成了严峻的挑战。在 26 个前沿语言模型上进行的广泛实验,提供了新颖的见解,揭示了模型感知用户体验的程度,以及模型能力的提升如何促进更好的对话参与度。通过对模型行为及性能差距的全面分析,我们发现用户反馈预测是一种可学习的能力,其中基于真实场景反馈信号训练的奖励模型能够实现校准良好的准确性。此外,我们记录了大语言模型作为评判者(LLM-as-a-judge)评估协议中的系统性偏差,并比较了直接影响用户体验的典型响应策略。UXBench 建立了新的评估范式,呼吁更多关注定制化的用户体验优化,有助于形成以用户为中心的扩展定律,从而塑造人工智能助手的成功。
Abstract
As AI assistants serve millions of users daily, evaluating user experience (UX) beyond general model capability has become increasingly important. We present UXBench, the first user-centric benchmark grounded in real user feedback signals for evaluating preference alignment and dialogue generation. The benchmark consists of three interconnected tasks, UX Judge, UX Eval, and UX Recovery, with 7,400 test instances extracted from over 70K interaction logs of a mainstream Chinese AI assistant. The dataset closely reflects real user distributions, covering 8 scenarios, 83 domains, and diverse failure patterns that pose severe challenges. Extensive experiments on 26 frontier language models provide novel insights into how well models perceive user experience and how improvements in model capability contribute to better dialogue engagement. Through comprehensive analysis of model behavior and performance gaps, we show that user feedback prediction is a learnable capability, where a reward model trained from in-the-wild feedback signals can achieve well-calibrated accuracy. We further document the systematic biases of LLM-as-a-judge evaluation protocols and compare typical response strategies that directly affect user experience. UXBench establishes a new evaluation landscape and calls for greater attention to tailored UX optimization, contributing to a user-centric scaling law that shapes the success of AI assistants.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 2.0/10 | 3.0 |
| model-based RL | 1.5 | 2.0/10 | 3.0 |
评分理由: 论文核心贡献在于提出 UXBench 基准,基于真实用户反馈评估 AI 助手用户体验、偏好对齐及对话生成。提供的关键词主要涉及模型架构(Tokenizer, Visual Encoder)、特定学习范式(World Models, model-based RL)及模型统一(Unify Models),与论文主题(评估基准、用户反馈)高度不相关。虽然论文评估了多个语言模型(Unify Models 弱相关)且助手可能涉及多模态(MLLM/MultiModal 弱相关),但未深入探讨相关技术细节。作者列表中未包含指定的 Yang Shi 等专家,故无额外加分。加权总分为 12.0,低于动态及格分 27.8。
关键词
User Experience, AI Assistants, Benchmark, Preference Alignment, Dialogue Generation, Real User Feedback, LLM-as-a-judge
摘要翻译
危机既改变了人们的出行方式,也改变了其交流方式。在野火及流行病等紧急事件中,移动模式的变化与在线情感话语共同演变,但通常被孤立研究。本文提出了一种统一且可解释的流程(pipeline),整合移动数据与社交媒体数据,以识别危机情境下的跨域行为模式。该框架通过两个案例研究进行评估:一是 2025 年 1 月洛杉矶野火的短期分析(原型案例),二是 2020 年 3 月至 2021 年 12 月阿联酋 COVID-19 行为的纵向分析(主要案例,共 671 天)。该流程对齐异质的每日信号,将其转化为二元行为状态,应用形式概念分析(Formal Concept Analysis, FCA)提取共现结构,挖掘关联规则,并通过时间顺序保留测试(chronological holdout testing)验证规则稳定性。一个结构化的政策翻译层将稳健的规则转化为操作简报,明确指定触发条件、提前期及行动手册。结果表明,在这两种危机中均存在清晰的跨域行为结构。在野火案例中,交通压力、恐惧/愤怒情感及治理话语在 33 天窗口期内紧密耦合,关键规则的置信度达到 100%,提升度分数高达 2.5。在 COVID 案例中,反复的移动适应与情感波动产生了 8 个稳定的同日规则(保留测试通过率为 88%)以及 40 个干净的预测规则,提前期为 2 至 7 天。本研究证明,可解释的多模态融合既能产出科学可信的危机情报,也能提供可操作的政策建议。
Abstract
Crises alter both how people move and how they communicate. During emergencies such as wildfires and pandemics, changes in mobility patterns and online emotional discourse evolve jointly, yet they are typically studied in isolation. This paper presents a unified and interpretable pipeline that integrates mobility and social media data to identify cross-domain behavioral patterns in crisis settings. The framework is evaluated through two case studies: a short-horizon analysis of the January 2025 Los Angeles wildfires (prototype case) and a longitudinal analysis of UAE COVID-19 behavior from March 2020 to December 2021 (primary case, 671 days). The pipeline aligns heterogeneous daily signals, transforms them into binary behavioral states, applies Formal Concept Analysis (FCA) to extract co-occurrence structure, mines association rules, and validates rule stability through chronological holdout testing. A structured policy-translation layer renders robust rules as operational briefs specifying triggers, lead times, and action playbooks. Results reveal clear cross-domain behavioral structure in both crises. In the wildfire case, traffic stress, fear/anger sentiment, and governance discourse are tightly coupled within a 33-day window, with key rules reaching 100\% confidence and lift scores up to 2.5. In the COVID case, repeated mobility adaptation and sentiment volatility yield 8 stable same-day rules (88\% holdout pass rate) and 40 clean predictive rules with 2--7 day lead horizons. The work demonstrates that interpretable multimodal fusion can produce both scientifically credible and policy-actionable crisis intelligence.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 6.0/10 | 9.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文主要关注危机场景下移动数据与社交媒体数据的融合分析及行为模式识别,使用形式概念分析等方法,未涉及 tokenizer、视觉编码器、MLLM、世界模型或强化学习等深度学习技术。仅'MultiModal'(多模态数据融合)和'Unify Models'(统一管道)在语义上有一定关联,但并非指模型架构层面的统一,因此相关性较低。加权总分计算:(2.0 + 6.0) * 1.5 = 12.0,远低于动态及格分 27.8。
关键词
Crisis Behavior Analysis, Mobility Data, Social Media Data, Multimodal Fusion, Formal Concept Analysis, Policy Translation, Interpretable Pipeline
摘要翻译
基于大语言模型(LLM)的智能体的性能由其基础模型和调节其与环境的交互的 harness 共同塑造。由于不同模型表现出独特的行为,有效的 harness 设计本质上具有模型特定性。然而,智能体 harness 仍主要由人类专家设计,这种范式在现代 LLM 日益多样化和快速演化的背景下扩展性较差。本文提出 Self-Harness,一种新范式,在此范式中,基于 LLM 的智能体可改进其自身的 operating harness,而无需依赖人类工程师或更强的外部智能体。我们将 Self-Harness 具体化为一个包含三个阶段的迭代循环:Weakness Mining(弱点挖掘),从 execution traces(执行轨迹)中识别模型特定的失败模式;Harness Proposal(harness 提案),生成与这些失败相关且多样但最小的 harness 修改;以及 Proposal Validation(提案验证),仅在回归测试后才接受候选修改。我们在 Terminal-Bench-2.0 上实例化 Self-Harness,使用一个最小的初始 harness 和三个来自不同家族的基础模型:MiniMax M2.5、Qwen3.5-35B-A3B 和 GLM-5。在所有三个模型上,Self-Harness 一致提升了性能,held-out(保留集)通过率分别从 40.5% 提升至 61.9%、23.8% 至 38.1% 和 42.9% 至 57.1%。定性分析进一步表明,Self-Harness 并非简单地添加通用指令,而是有效地将模型特定的弱点转化为具体的、可执行的 harness 更改。这些结果表明了一条通向基于 LLM 智能体的路径,这些智能体不仅受其 harness 塑造,还能参与重塑它们。
Abstract
The performance of LLM-based agents is jointly shaped by their base models and the harnesses that mediate their interaction with the environment. Because different models exhibit distinct behaviors, effective harness design is inherently model-specific. Yet agent harnesses are still largely engineered by human experts, a paradigm that scales poorly as modern LLMs become increasingly diverse and rapidly evolving. In this paper, we introduce Self-Harness, a new paradigm in which an LLM-based agent improves its own operating harness, without relying on human engineers or stronger external agents. We operationalize Self-Harness as an iterative loop with three stages: Weakness Mining, which identifies model-specific failure patterns from execution traces; Harness Proposal, which generates diverse yet minimal harness modifications tied to these failures; and Proposal Validation, which accepts candidate edits only after regression testing. We instantiate Self-Harness on Terminal-Bench-2.0 using a minimal initial harness and three base models from diverse families: MiniMax M2.5, Qwen3.5-35B-A3B, and GLM-5. Across all three models, Self-Harness consistently improves performance, with held-out pass rates increasing from 40.5% to 61.9%, 23.8% to 38.1%, and 42.9% to 57.1%, respectively. Qualitative analyses further show that Self-Harness does not simply add generic instructions, but effectively turns model-specific weaknesses into concrete, executable harness changes. These results suggest a path toward LLM-based agents that are not merely shaped by their harnesses, but can also participate in reshaping them.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 3.0/10 | 4.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 3.0/10 | 4.5 |
评分理由: Paper focuses on LLM agent harness self-improvement, unrelated to multimodal components (Tokenizer, Visual Encoder, MultiModal) or world models. It evaluates multiple models (Unify Models: 3.0) and involves agents (model-based RL: 3.0), but focuses on harness editing rather than environment modeling. MLLM relevance is low (2.0) as tasks are text-based.
关键词
Self-Harness, LLM-based agents, Weakness Mining, Harness Proposal, Proposal Validation, Terminal-Bench-2.0, Iterative loop, Base models
摘要翻译
电商场景中的对话系统通常需要满足多个目标:准确推理用户画像(例如资格、信用额度),以确保正确的决策和用户状态理解,同时还能生成自然且忠实的回复。这些目标虽互补但并不完全相同。本文提出 MORE,一种自适应多目标强化学习(Multi-Objective Reinforcement Learning)框架,联合优化推理准确性和语言自然性。初步实验表明,直接混合具有不同优化动态的奖励会导致振荡和学习不稳定。因此,我们不优化单一的混合奖励,而是将推理函数视为约束,以指导策略优化。在推理阶段,系统直接生成回复,无需显式的推理步骤,同时仍能受益于推理增强的支架,并避免额外的推理开销。为更好地平衡回复生成过程中的语言目标,我们引入了一种自适应多奖励机制,该机制聚合流畅性、自然性等各类信号,并通过梯度反馈动态调整其权重。我们在字节跳动的两个真实对话系统以及 MultiWOZ 2.2 基准上对 MORE 进行评估,结果表明其始终优于强基线。在字节跳动生产流量上的 14 天在线实验中,MORE 使整体转化率和到达转化率分别提升了 16.53% 和 30.09%,同时提高了用户满意度并降低了转接率。值得注意的是,在人机对比实验中,MORE 恢复了人类代理所实现的增量转化率提升的约 60%。
Abstract
Dialogue systems in e-commerce scenarios often need to satisfy multiple objectives: accurately reasoning over user profiles (e.g., eligibility, credit limit) to ensure correct decision-making and user state interpretation, while also generating natural and faithful responses. These goals are complementary but not identical. In this work, we propose MORE, an adaptive Multi-Objective REinforcement learning framework that jointly optimizes reasoning accuracy and linguistic naturalness. Our preliminary experiments show that directly mixing rewards with diverging optimization dynamics can cause oscillations and unstable learning. Thus, instead of optimizing a single mixed reward, we treat reasoning functions as constraints that guide policy optimization. At inference time, the system directly generates responses without explicit reasoning steps, while still benefiting from reasoning-enhanced scaffold and avoiding additional inference overhead. To better balance linguistic objectives during response generation, we introduce an adaptive multi-reward mechanism that aggregates signals such as fluency and naturalness and dynamically reweighs them via gradient feedback. We evaluate MORE on two real-world dialogue systems at ByteDance and the MultiWOZ 2.2 benchmark, where it consistently outperforms strong baselines. In 14-day online experiments on ByteDance production traffic, MORE improves overall and reached conversion by 16.53% and 30.09%, while increasing user satisfaction and reducing handoff rates. Notably, in a human-machine comparison, MORE recovers about 60% of the incremental conversion lift achieved by human agents.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 3.0/10 | 4.5 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 1.0/10 | 1.5 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 2.0/10 | 3.0 |
评分理由: The paper focuses on e-commerce dialogue systems using multi-objective RL, showing low alignment with keywords centered on multimodal world models. 'Unify Models' is loosely related to the title 'One Model', and 'model-based RL' is partially relevant due to RL usage, but 'Visual Encoder', 'MultiModal', and 'MLLM' are not applicable as the work is text-based. Tokenizer and World Models are not core contributions. No expert authors from the specified list were found.
关键词
E-commerce Dialogue Systems, Multi-Objective Reinforcement Learning, Adaptive Multi-Objective Learning, Reasoning Accuracy, Linguistic Naturalness, Policy Optimization, MultiWOZ 2.2
摘要翻译
中文歧视性语言检测颇具挑战性,因为有害意图往往隐含且依赖上下文语境。我们提出 MAAM(近视 - 散光锚点机制),这是一个轻量级、模型无关的框架,灵感源自功能性视觉模糊:该框架并非平等地保留每个标记,而是保留与歧视相关的语义锚点,并利用 C--I--S 上下文先验(Contextual Tone、Group Identity 和 Stance Polarity)对其进行校准。此外,我们还引入了 ChLGBT,据我们所知,这是首个专注于中文 LGBT 群体的歧视性语言数据集,包含 8,120 个手动标注样本及三个有序标签:显性偏见、隐性偏见和情感强度。在多种强编码器基线模型上,MAAM 在所有三个预测维度上均实现了提升,在准确率(Accuracy)、F1 分数、Brier 分数以及预期校准误差(Expected Calibration Error)方面均表现出一致的性能增益。在零样本(Zero-shot)和少样本(Few-shot)提示协议下,与前沿大语言模型(LLM)基线相比,MAAM 仍具竞争力,同时展现出更高的紧凑性和稳定性。这些结果表明,可解释的锚点保留与上下文校准为中文歧视性语言评估提供了一种比单纯扩大模型规模更具实用性的替代方案。
Abstract
Chinese discriminatory-language detection is challenging because harmful intent is often implicit and context-dependent. We propose MAAM (Myopia--Astigmatism Anchor Mechanism), a lightweight, model-agnostic framework inspired by functional visual blur: rather than preserving every token equally, MAAM retains discrimination-relevant semantic anchors and calibrates them with C--I--S contextual priors (Contextual Tone, Group Identity, and Stance Polarity). We also introduce ChLGBT, to our knowledge the first Chinese LGBT-focused discriminatory-language dataset, with 8,120 manually annotated samples and three ordinal labels: explicit bias, implicit bias, and emotional intensity. Across strong encoder baselines, MAAM improves all three prediction dimensions, with consistent gains in accuracy, F1, Brier score, and expected calibration error. Compared with frontier LLM baselines under zero-shot and few-shot prompting protocols, MAAM remains competitive while offering stronger compactness and stability. These results suggest that interpretable anchor preservation and contextual calibration provide a practical alternative to heavier model scaling for Chinese discriminatory-language assessment.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 1.0/10 | 1.5 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 1.0/10 | 1.5 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: The paper focuses on Chinese discriminatory language detection using a text-based framework (MAAM), showing low relevance to Multimodal, World Models, or RL keywords. Tokenizer has minor relevance due to explicit token-level discussion in the mechanism, and Visual Encoder has metaphorical relevance (visual blur inspiration). No expert authors from the specified list are included. The weighted total score (12.0) is below the dynamic passing score (27.8), indicating poor alignment with the provided keyword track.
关键词
Chinese Discriminatory Language Detection, MAAM Framework, Semantic Anchors, Contextual Calibration, ChLGBT Dataset, Model-Agnostic, Text Classification
摘要翻译
青光眼是全球范围内导致不可逆性失明的主要原因之一,基于眼底图像的早期检测对于有效的疾病管理至关重要。尽管深度学习在眼底图像分析中取得了令人鼓舞的性能,但大多数现有方法仍依赖单时间点图像,无法捕捉与疾病进展相关的纵向结构和血管变化。临床随访期间获取的序列眼底图像提供了宝贵的时间信息;然而,当前的序列模型往往难以检测细微的早期进展信号,且通常依赖固定长度输入或来自已确诊青光眼图像的诊断线索,这限制了它们在早期预测中的临床实用性。为解决这些局限性,我们提出 DiffSight-Former,一种用于从序列眼底图像预测青光眼进展的框架。该框架整合了一个基于眼底专用基础模型(fundus-specific foundation model)的时变特征提取模块,以获得稳健的解剖学表示。引入一个多结构差异建模模块,以量化视盘/杯区域(optic disc/cup region)和视网膜血管(retinal vasculature)中与进展相关的变化。这些表示与时间间隔嵌入(temporal interval embeddings)整合,并由时间感知 Transformer(time-aware Transformer)处理,以建模疾病进展并估计未来青光眼发作的概率。实验在两个纵向数据集 SIGF(405 个序列)和 GRAPE(263 个序列)上进行。在 SIGF 数据集上,DiffSight-Former 在进展预测中实现了 91.54% 的曲线下面积(AUC)和 92.16% 的灵敏度。在 GRAPE 数据集上,它在三种临床视野进展标准(clinical visual-field progression criteria)上实现了 87.48% 的平均准确率。与现有方法相比,DiffSight-Former 在不同时间设置下展现出卓越的性能和鲁棒性,凸显了其在纵向青光眼监测和早期风险预测方面的潜力。
Abstract
Glaucoma is a leading cause of irreversible blindness worldwide, and early detection from fundus images is critical for effective disease management. While deep learning has achieved promising performance in fundus image analysis, most existing methods rely on single time-point images and fail to capture longitudinal structural and vascular changes associated with disease progression. Sequential fundus images acquired during clinical follow-up provide valuable temporal information; however, current sequential models often struggle to detect subtle early progression signals and commonly depend on fixed-length inputs or diagnostic cues from already glaucomatous images, limiting their clinical utility for early prediction. To address these limitations, we propose DiffSight-Former, a framework for glaucoma progression prediction from sequential fundus images. It incorporates a time-variant feature extraction module based on a fundus-specific foundation model to obtain robust anatomical representations. A multi-structure difference modeling module is introduced to quantify progression-related changes in the optic disc/cup region and retinal vasculature. These representations are integrated with temporal interval embeddings and processed by a time-aware Transformer to model disease progression and estimate the probability of future glaucoma onset. Experiments were conducted on two longitudinal datasets, SIGF (405 sequences) and GRAPE (263 sequences). On SIGF, DiffSight-Former achieved an AUC of 91.54% and a sensitivity of 92.16% for progression prediction. On GRAPE, it achieved an average accuracy of 87.48% across three clinical visual-field progression criteria. Compared with existing approaches, DiffSight-Former demonstrates strong performance and robustness across different temporal settings, highlighting its potential for longitudinal glaucoma monitoring and early risk prediction.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 3.0/10 | 4.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 5.0/10 | 7.5 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 该论文属于医学影像分析领域,专注于青光眼进展预测,与给定的 MLLM/世界模型/强化学习关键词主题高度不匹配。仅'Visual Encoder'相关(使用基础模型提取视觉特征),'Unify Models'部分相关(整合结构与时序模块),其余关键词(Tokenizer, World Models, MLLM, MultiModal, model-based RL)完全不涉及。作者列表中不包含指定的专家,故无加分。
关键词
Glaucoma Progression Prediction, Sequential Fundus Images, Time-aware Transformer, Structural Differences, Longitudinal Monitoring, Medical Imaging, Foundation Model, Disease Prediction
摘要翻译
标准扩散模型通常使用单一的时间齐次高斯终端分布作为生成的参考律。虽然这一选择在分析上便利且经验上强大,但它为集中在低维流形附近的数据提供不了多少显式结构,其中数据分布的不同区域可能对应不同的局部几何或语义因素。因此,逆向模型必须几乎完全从未结构的终端参考分布中恢复流形层面的结构。我们提出 PTL-Diffusion,一个概念验证的扩散框架,其前向噪声过程收敛于一个非常数的周期性高斯终端律族,而非单一不变律。与相位条件化的 DDPM 不同,其中相位信息仅进入去噪网络而前向过程保持不变,PTL-Diffusion 将相位结构直接嵌入到前向噪声动力学中。所提出的构造仍接近标准去噪扩散模型:对于周期驱动的 Ornstein--Uhlenbeck 型前向过程,我们推导了闭合形式的前向边缘分布、极限周期性高斯终端族以及显式的高斯反向后验分布,从而支持标准的噪声预测训练。我们还引入一个不变平均正则化项,通过平均周期性参考律耦合相位条件化的反向动力学。在环面和圆柱体点云基准以及 Olivetti 人脸数据集上的实验表明,PTL-Diffusion 相比于匹配的 DDPM 基线改善了流形层面分布匹配,减少了相位条件化误差、特征空间协方差误差和最近邻流形距离。这些结果表明结构化终端参考律是一个有前景的方向,同时激励更具表现力的相位构造和更大规模的评估。
Abstract
Standard diffusion models typically use a single time-homogeneous Gaussian terminal distribution as the reference law for generation. While this choice is analytically convenient and empirically powerful, it provides little explicit structure for data concentrated near low-dimensional manifolds, where different regions of the data distribution may correspond to distinct local geometric or semantic factors. As a result, the reverse model must recover manifold-level structure almost entirely from an unstructured terminal reference distribution. We propose PTL-Diffusion, a proof-of-concept diffusion framework whose forward noising process converges to a nonconstant periodic family of Gaussian terminal laws rather than to a single invariant law. Unlike a phase-conditioned DDPM, where phase information only enters the denoising network while the forward process remains unchanged, PTL-Diffusion embeds phase structure directly into the forward noising dynamics. The proposed construction remains close to standard denoising diffusion models: for a periodically forced Ornstein--Uhlenbeck-type forward process, we derive closed-form forward marginals, the limiting periodic Gaussian terminal family, and explicit Gaussian reverse posteriors, enabling standard noise-prediction training. We also introduce an invariant-average regularization term coupling the phase-conditioned reverse dynamics through the averaged periodic reference law. Experiments on torus and cylinder point-cloud benchmarks and the Olivetti face dataset show that PTL-Diffusion improves manifold-level distributional matching over matched DDPM baselines, reducing phase-conditioned errors, feature-space covariance errors, and nearest-neighbour manifold distances. These results suggest structured terminal reference laws as a promising direction, while motivating more expressive phase constructions and larger-scale evaluations.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 1.5/10 | 2.2 |
| World Models | 1.5 | 3.0/10 | 4.5 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文主要研究扩散模型(Diffusion Models)的改进,特别是引入周期性终端定律以增强流形感知能力,属于生成模型领域。提供的关键词主要集中在多模态大模型(MLLM)、强化学习(RL)及模型统一(Unify Models)等方向。因此,大部分关键词(如 Tokenizer, MLLM, model-based RL)与论文内容完全无关(0 分)。'World Models'因涉及生成建模有弱相关(3 分),'Visual Encoder'因使用视觉数据(点云、人脸)有微弱关联(1.5 分),'Unify Models'指框架内部的统一性但非多模态模型统一(2 分)。加权总分约为 11.25,远低于动态及格分 27.8,表明论文与指定研究背景相关性较低。作者列表中未包含指定的 Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang 专家。
关键词
Diffusion Models, Manifold-Aware, Periodic Terminal Laws, Generative Modeling, Point-Cloud, Forward Noising, Distributional Matching
摘要翻译
大语言模型(LLM)日益被期望处理复杂的、长周期现实世界任务,其上下文需求可无限增长,然而模型的上下文窗口本质上仍是有限的。近期工作探索了一种范式:主智能体(Agent)分解任务并将子任务分发给子智能体(Subagent),子智能体执行后仅返回汇总结果,从而节省主智能体的上下文预算。然而,要出色完成这一任务需要委托智能(Delegation Intelligence):即分解复杂任务的能力、确定何时及何事进行委托,并将返回结果整合到持续的工作流中。此类能力的训练数据在自然生成的文本中极为稀缺,据我们所知,如何合成此类数据并训练模型以获得该能力,在开源社区中仍很大程度上未被充分探索。为了弥合这一差距,我们提出了一项初步探索,目标为深度研究任务——一种代表性的长周期智能体任务。具体而言,我们设计了一个引导框架(Harness),引导模型实现高质量的任务分解与委托,同时约束子智能体适当地返回结果,以支持主智能体的工作流。由引导框架生成的轨迹自然编码了正确的委托决策,我们将其用作监督微调(SFT)数据,将委托智能内化至模型权重中。我们的最终模型 SearchSwarm-30B-A3B 在 BrowseComp 上取得 68.1 分,在 BrowseComp-ZH 上取得 73.3 分,在所有同等规模的模型中表现最佳。我们将发布我们的引导框架、模型权重及训练数据,以助力未来的研究。
Abstract
Large language models are increasingly expected to handle complex, long-horizon real-world tasks whose context demands can grow without bound, yet model context windows remain inherently finite. Recent work explores a paradigm where a main agent decomposes tasks and dispatches subtasks to subagents, which execute and return only summarized results, conserving the main agent's context budget. However, performing this well requires delegation intelligence: the ability to decompose complex tasks, determine when and what to delegate, and integrate returned results into the ongoing workflow. Training data for this capability is scarce in naturally occurring text, and to our knowledge, how to synthesize such data and train models to acquire this capability remains largely unexplored in the open-source community. To bridge this gap, we present a preliminary exploration targeting deep research, a representative long-horizon agent task. Specifically, we design a harness that guides the model toward high-quality task decomposition and delegation, while constraining subagents to return results properly to support the main agent's workflow. The harness-guided trajectories naturally encode correct delegation decisions, which we use as supervised fine-tuning data to internalize delegation intelligence into model weights. Our resulting model, SearchSwarm-30B-A3B, achieves 68.1 on BrowseComp and 73.3 on BrowseComp-ZH, the best results among all models of comparable scale. We will release our harness, model weights, and training data to facilitate future research.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: The paper focuses on LLM agent delegation and task decomposition via supervised fine-tuning for long-horizon tasks. It lacks content on multimodal integration, visual encoders, world models, or model-based RL mechanisms, resulting in low relevance scores for most keywords (Total Weighted Score: 10.5, below dynamic passing score 27.8). No expert authors from the specified list (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are present in the author list.
关键词
Agentic LLMs, Delegation Intelligence, Task Decomposition, Supervised Fine-Tuning, Long-Horizon Tasks, Subagents, SearchSwarm, BrowseComp
摘要翻译
在指令微调(IFT)过程中,大语言模型(LLMs)利用提供的上下文回答问题,从而学会遵循指令。尽管先前研究已探讨了上下文特征与大语言模型上下文使用之间的相关性,但此类分析仅限于推理阶段,而这些关系最初是如何习得的仍是一个开放性问题。本文测量了模型对这些特征的敏感性如何在连续的指令微调阶段中发生变化:监督微调(SFT)、直接偏好优化(DPO)以及基于可验证奖励的强化学习(RLVR)。在四个模型和三个数据集上的实验表明,监督微调(SFT)使模型更倾向于使用易于理解的上下文,例如具有较高长度、较高上下文 - 查询相似度和较高流畅度的上下文。SFT 之后的动态过程可能会根据训练数据集的不同,要么强化要么消除这些偏好。我们的发现表明,上下文使用在每个指令微调阶段都会被积极重塑,因此设计平衡的指令微调数据集对于确保指令微调模型的稳健上下文使用至关重要。
Abstract
During instruction fine-tuning (IFT), large language models (LLMs) learn to follow instructions by using the provided context to answer a query. While prior work has studied how context characteristics correlate with context usage by the LLM, this analysis has been limited to inference time, leaving open how these relationships are acquired in the first place. Here, we measure how models' sensitivity to such characteristics shifts across successive IFT stages: supervised fine-tuning (SFT), direct preference optimization (DPO), and reinforcement learning with verifiable rewards (RLVR). Experiments across four models and three datasets show that SFT makes models more likely to use contexts that are easy to understand, such as containing high length, context-query similarity, and fluency. Post-SFT dynamics may either reinforce or resolve these preferences depending on the training dataset. Our findings reveal that context usage is actively reshaped at each IFT stage, and designing a balanced IFT dataset is important in ensuring robust context utilization of instruction-tuned models.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 1.0/10 | 1.5 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 2.0/10 | 3.0 |
评分理由: The paper focuses on LLM instruction fine-tuning and context sensitivity, showing weak alignment with MLLM (LLM context) and model-based RL (RLVR mention), but no relevance to multimodality, visual encoders, world models, tokenizers, or model unification. No listed expert authors are present.
关键词
Large Language Models, Instruction Fine-Tuning, Context Sensitivity, Supervised Fine-Tuning, Direct Preference Optimization, Reinforcement Learning, Context Usage
摘要翻译
人工智能的快速发展推动了深度神经网络的显著进步。尽管如此,传统的基于 GPU 的训练仍然能耗极高,促使人们探索物理动力学及兼容的基于能量的学习方案,例如平衡传播(Equilibrium Propagation, EP)。然而,基于 EP 的训练经常因相空间收缩而收敛至局部极小值。本文提出了一种受伊辛动力学启发的平衡传播框架,其中耗散的 Hopfield 弛豫被具有共轭变量的扩展相空间动力学所取代。所得到的训练范式保留了 EP 的局部两阶段学习规则,同时改变了神经状态达到平衡的物理路径。我们表明,这种动力学降低了有效能量壁垒,加速了收敛,提高了噪声鲁棒性,并在 MNIST、FashionMNIST 和 CIFAR-10 上训练深度卷积 Hopfield 网络,其性能可与反向传播相媲美。
Abstract
The rapid evolution of artificial intelligence has led to substantial advances in deep neural networks. Nonetheless, conventional GPU-based training remains highly energy-demanding, motivating the exploration of physical dynamics and compatible energy-based learning schemes, such as equilibrium propagation (EP). EP-based training, however, frequently suffers from convergence to local minima due to phase-space contraction. Here we introduce an Ising-dynamics-inspired equilibrium-propagation framework in which dissipative Hopfield relaxation is replaced by an extended phase-space dynamics with conjugate variables. The resulting training paradigm keeps the local two-phase learning rule of EP while changing the physical route by which neural states reach equilibrium. We show that this dynamics lowers effective energy barriers, accelerates convergence, improves noise robustness, and trains deep convolutional Hopfield networks on MNIST, FashionMNIST, and CIFAR-10 with performance comparable to backpropagation.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 2.0/10 | 3.0 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 2.0/10 | 3.0 |
评分理由: 该论文主要研究能量基学习(Energy-Based Learning)中的均衡传播(Equilibrium Propagation)算法改进,结合伊辛机器(Ising Machines)动力学。提供的关键词集(如 MLLM、Tokenizer、MultiModal、World Models)主要聚焦于多模态大模型和强化学习领域,与本文内容(图像分类、训练算法、物理动力学)高度不相关。仅在模型统一(Hybridizing EP 与 Ising)和模型基于 RL 的理论背景上有微弱关联,故评分较低。加权总分为 10.5,低于动态及格分 27.8。
关键词
Equilibrium Propagation, Ising Machines, Energy-Based Learning, Hopfield Networks, Phase-Space Dynamics, Convolutional Networks, Backpropagation
摘要翻译
用于投资组合优化的深度强化学习(DRL)框架已展现出前景,其优势在于能够从市场数据中动态学习分配规则。然而,这些模型未能考虑厚尾回报(fat-tailed returns),而厚尾回报正是实际市场行为的特征,表现为更频繁的极端事件。此外,历史数据被同质化处理,未考虑时间重要性,导致模型在市场状态转换(regime changes)期间失效。我们提出了一种新的 BAVAR-BLED 算法,该算法基于 TD3 架构,结合了源自贝叶斯平均向量自回归(BAVAR)模型和使用椭圆分布(BLED)的 Black-Litterman 模型的方法。BAVAR 捕获一组考虑多尺度时间特征的向量自回归表示,基于对回报期望和协方差矩阵的状态感知估计,从而实现自适应的分配决策。这些估计作为先验输入传递给 BLED,该模型使用学生 t 分布(Student's t-distributions),从而允许更现实的厚尾回报估计。BAVAR-BLED 算法使用 Transformer 网络进行视图构建,并使用卷积神经网络(CNNs)进行风险厌恶估计,从而根据市场条件修改动态分配决策。对 29 只道琼斯工业平均指数(Dow Jones Industrial Average)成分股在长达十年的市场周期内的评估表明,BAVAR-BLED 显著优于现有最先进方法,实现了 1.72 的夏普比率(Sharpe ratio)和 2.70 的索提诺比率(Sortino ratio),总回报率达到 57.26%。
Abstract
Deep reinforcement learning (DRL) frameworks for portfolio optimization have shown promise for their ability to learn allocation rules dynamically from market data. However, these models fail to account for fat-tailed returns, which characterize actual market behavior with more frequent extreme events. Furthermore, historical data is treated homogeneously, without accounting for temporal importance, leading models to fail during regime changes. We propose a new BAVAR-BLED algorithm that combines methods derived from Bayesian-Averaging Vector Autoregressive (BAVAR) and the Black-Litterman model using Elliptical Distributions (BLED) within a TD3 architecture. BAVAR captures a set of vector autoregressive representations that consider multi-scale temporal features, enabling adaptive allocation decisions based on regime-aware estimates of return expectations and dispersion matrices. These estimates serve as prior inputs to BLED, a model that uses Student's t-distributions, allowing for more realistic fat tail return estimates. The BAVAR-BLED algorithm uses transformer networks for view construction and CNNs for risk-aversion estimates, which modify dynamic allocation decisions based on market conditions. An evaluation of 29 Dow Jones Industrial Average constituents over a decade-long market period shows that BAVAR-BLED significantly outperforms state-of-the-art methods, achieving Sharpe and Sortino ratios of 1.72 and 2.70, respectively, and total returns of 57.26%.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 2.0/10 | 3.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 3.0/10 | 4.5 |
评分理由: 论文主题属于金融工程与强化学习领域,而提供的关键词集(如 Tokenizer, Visual Encoder, MLLM, MultiModal)主要对应多模态大模型方向,存在领域错位。因此,与多模态、文本处理、视觉相关的关键词评分为 0。Unify Models 和 World Models 在广义模型整合与环境建模上有一定联系,但非核心,评分较低。model-based RL 涉及模型辅助 RL,但 TD3 通常被视为模型自由算法,评分中等。作者列表中不包含指定的五位专家,故无额外加分。
关键词
Portfolio Optimization, Market Regime Changes, Heavy-Tailed Returns, Bayesian VAR, Black-Litterman, TD3 Architecture, Deep Reinforcement Learning
摘要翻译
长周期智能体任务为基于结果的强化学习(outcome-based reinforcement learning)提出了一个根本性的信用分配挑战:轨迹级奖励虽能验证最终正确性,却难以提供关于哪些中间推理步骤或工具交互对结果有贡献的有效指导。这种困难在多轮搜索智能体(multi-turn search agents)中尤为显著:成功的轨迹可能包含误导性动作,而失败的轨迹可能包含有价值的证据收集步骤。我们提出 PBSD(Privileged Bayesian Self-Distillation,特权贝叶斯自蒸馏),这是一种在稀疏最终奖励下用于细粒度信用分配的贝叶斯校准自蒸馏方法。PBSD 通过验证答案的后验概率与先验概率之比来衡量轨迹质量,并利用贝叶斯定理(Bayes' rule)将这个难以估计的答案侧比率转换为标准学生模型(student model)与特权答案条件化教师模型(privileged answer-conditioned teacher model)之间的可处理似然比。对该贝叶斯证据分数进行自回归分解,可生成轮次级信号,从而识别每个中间轮次是支持还是削弱了验证结果。因此,PBSD 提供了一种合理且优雅的重加权方案,将稀疏的结果监督转换为贝叶斯校准的轮次级信用信号,同时完全兼容标准策略优化。实验表明,PBSD 在域内(in-domain)和域外(out-of-domain)设置下均一致地提升了性能,并能有效地将知识从短上下文训练迁移至长上下文推理,这表明其细粒度信用分配机制促进了更有效的策略学习,并带来了更好的泛化能力。
Abstract
Long-horizon agentic tasks pose a fundamental credit assignment challenge for outcome-base reinforcement learning: trajectory-level rewards verify final correctness but provide limited guidance on which intermediate reasoning steps or tool interactions contribute to the outcome. The difficulty is especially pronounced in multi-turn search agents, where successful trajectories may contain misleading actions and failed trajectories may contain valuable evidence-gathering steps. We propose PBSD (Privileged Bayesian Self-Distillation), a Bayes-calibrated self-distillation method for fine-grained credit assignment under sparse final rewards. PBSD measures trajectory quality through the posterior-to-prior probability ratio of the verified answer and applies Bayes' rule to convert this hard-to-estimate answer-side ratio into a tractable likelihood ratio between a standard student model and a privileged answer-conditioned teacher model. Autoregressive decomposition of this Bayesian evidence score yields turn-level signals that identify whether each intermediate turn supports or undermines the verified outcome. Consequently, PBSD provides a principled and elegant reweighting scheme that transforms sparse outcome supervision into Bayes-calibrated turn-level credit signals, while remaining fully compatible with standard policy optimization. Experiments demonstrate that PBSD consistently enhances performance across both in-domain and out-of-domain settings, and effectively transfers knowledge from short-context training to long-context inference, suggesting that its fine-grained credit assignment mechanism facilitates more effective policy learning and yields improved generalization.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 2.0/10 | 3.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 3.0/10 | 4.5 |
评分理由: The paper focuses on Bayesian self-distillation for credit assignment in reinforcement learning, lacking multimodal components (Tokenizer, Visual Encoder, MLLM, MultiModal). It touches on RL policy optimization (model-based RL) and long-horizon planning (World Models), but these are not core methodologies. Unify Models loosely applies to teacher-student distillation.
关键词
Privileged Bayesian Self-Distillation, Long-Horizon Credit Assignment, Outcome-based Reinforcement Learning, Teacher-Student Distillation, Sparse Rewards, Policy Optimization, Agentic Tasks
摘要翻译
我们提出了一种在基于 RAG 的 LLM 推荐中安全训练的可复现故障模式——注入悖论(Injection Paradox)——其中嵌入在检索文档中的提示词注入(prompt injections)会反噬攻击者,导致目标品牌低于无注入基线。在安全训练的 Claude 模型中,包含提示词注入的文档推荐率急剧下降,且这种抑制作用会超出注入文档本身,蔓延至同一品牌的未修改文档。在 Claude Opus 4.6 中,目标品牌从 54% 的基线降至零前 2 名推荐,在所有 50 次试验中均如此,尽管语料库中 4 份品牌文档中仅有一份包含注入。这种方向性模式在反事实实验及跨三个品牌的测试中得到了复现。在测试的 GPT 模型中出现了相反的结果,相同的注入反而增加了推荐,这表明模型家族在注入类上下文如何影响推荐行为方面存在差异。这些发现提出了反向攻击场景的技术可能性,即攻击者将注入嵌入竞争对手的文档中,利用对安全敏感的模型行为来压制竞争对手的品牌。
Abstract
We present a reproducible failure mode of safety training in RAG-based LLM recommendation -- the Injection Paradox -- in which prompt injections embedded in retrieved documents backfire against the attacker, suppressing the target brand below the injection-free baseline. In safety-trained Claude models, documents containing prompt injections suffer a sharp drop in recommendation rate, and this suppression propagates beyond the injected document to unmodified documents of the same brand. In Claude Opus 4.6, the target brand drops from a 54% baseline to zero top-2 recommendations across all 50 trials, even though only 1 of 4 brand documents in the corpus contains an injection. The directional pattern is reproduced in counterfactual experiments and across three brands. A contrasting result across the GPT models tested, where the same injection instead increases recommendations, suggests model-family differences in how injection-like context affects recommendation behavior. These findings raise the technical possibility of a reverse-attack scenario in which an adversary embeds injections in a competitor's documents to suppress the competitor's brand via safety-sensitive model behavior.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 2.0/10 | 3.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on safety training failures in RAG-based LLM recommendations (Injection Paradox) and prompt injection effects on brand suppression. It does not address Unify Models, Tokenizers, Visual Encoders, World Models, or Model-Based RL. While it involves LLMs (potentially MLLM/MultiModal), the core contribution is about safety and recommendation behavior, not multimodal representation or world modeling. No listed expert authors (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are present in the author list.
关键词
Injection Paradox, Safety-Trained LLM, RAG Context Injection, Brand-Level Suppression, Prompt Injections, Recommendation Behavior, Model-Family Differences
摘要翻译
机器人生命周期的许多阶段,从形态合成到运行,根本依赖于可达工作空间。然而,当前近似工作空间的方法速度慢、精度低,或仅局限于单一形态。我们提出 Reachability Across Morphologies(RAM):一种形态条件化的隐式神经网络表示,该表示可作为位姿可达性的快速、可微代理模型,在泛化至未见形态的同时,内在考虑自碰撞。为了训练 RAM,我们发布了一个包含 $3\cdot10^{10}$ 个样本的大规模数据集,这些样本仅通过正向运动学生成。实验表明,我们的模型在纳秒级推理下实现了 86% 的 F1 分数,比基线高出 14%,同时将推理时间减少了三个数量级。我们还进一步展示了,基于梯度的形态优化和轨迹优化分别实现了一个数量级和两个数量级的加速。网站:https://timwalter.github.io/ram.
Abstract
Many stages of the robotic lifecycle, from morphology synthesis to operation, rely fundamentally on the reachable workspace. However, current methods for approximating workspaces are slow, imprecise, or tied to a single morphology. We introduce Reachability Across Morphologies (RAM): a morphology-conditioned, implicit neural representation that acts as a fast, differentiable surrogate for pose reachability, generalising to unseen morphologies while inherently accounting for self-collisions. To train RAM, we publish a large-scale dataset of $3\cdot10^{10}$ samples generated solely from forward kinematics. Experiments show that our model achieves an $ F_1$-score of $86\%$ at nanosecond inference, outperforming the baseline by $14\%$ while reducing inference time by three orders of magnitude. We further demonstrate speed-ups of one and two orders of magnitude for gradient-based morphology and trajectory optimisation, respectively. Website: https://timwalter.github.io/ram.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 2.0/10 | 3.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 3.0/10 | 4.5 |
评分理由: The paper focuses on robotic reachability using implicit neural representations based on forward kinematics. It lacks direct connection to tokenization, visual encoders, multimodal fusion, or large language models (MLLM). While it involves modeling and optimization, its relevance to 'World Models' and 'model-based RL' in the context of AI/LLM paradigms is minimal, as it is a specific kinematic tool rather than a generative or learning-based RL framework.
关键词
Reachability Across Morphologies, Implicit Neural Representation, Forward Kinematics, Morphology Generalization, Self-collisions, Trajectory Optimization, Robotics
摘要翻译
大推理模型 (LRMs) 通常能提升数学与编码性能,但其对指令跟随的影响尚不明确。我们使用 Qwen3 模型(1.7B-32B)在 IFEval 上进行研究,采用权重相同的 Thinking ON/OFF 控制;四个 Hunyuan 模型提供了跨家族的方向性支持。总体通过率的变化较小(-0.55 至 -3.52 百分点 (pp)),但在不同模式下,10%-20% 的提示词在通过和失败之间切换,这表明思考改变了错误模式——部分提示词表现改善而另一些恶化——而非一致性地降低性能。在基于 Qwen3 的事后分组下,约束类型分为 Planning(规划类,包含全局计数、结构、协调),其在思考下于类别层面得到改善,以及 Precision(精度类,精确局部形式),其持续恶化;尽管 Hunyuan 模型的总体变化方向相反,但在所有四个 Hunyuan 模型中,类别层面的 Planning/Precision 符号模式在方向上依然保持一致。思考还会改变最终答案的长度;长度匹配分析显著减少了 Precision 下降,但残留的惩罚仍然存在。使用交叉编码器 (CE) 相关性度量分析思考轨迹揭示了三种模式:Neutral(中性)模式显示相关性 - 合规性之间存在正相关(r 约为 0.15);Planning 模式显示预测相关性接近零(r 约为 0.02),尽管存在可测量的轨迹参与度,这与 CE 测量的轨迹相关性与最终答案合规性之间的执行差距一致;Precision 模式显示较小的负相关(r 约为 -0.05),失败实例的平均相关性高于通过实例。在四种模型规模(1.7B-14B)上进行激活修补(Activation patching)显示,Precision 翻转实例比 Planning 翻转实例更常被恢复(32%-58% vs. 14%-40% 平均层恢复率),在 14B 时差距最大(约 30 百分点 (pp))。
Abstract
Large reasoning models (LRMs) often improve math and coding performance, but their effect on instruction following is unclear. We study IFEval with Qwen3 models (1.7B-32B), using same-weights Thinking ON/OFF controls; four Hunyuan models provide directional cross-family support. Aggregate pass-rate changes are small (-0.55 to -3.52 pp), yet 10-20% of prompts switch between pass and fail across modes, suggesting that thinking changes the pattern of errors--some prompts improve while others worsen--rather than uniformly degrading performance. Under a post-hoc Qwen3-derived grouping, constraint types separate into Planning (global counting, structure, coordination), which improves at the class level under thinking, and Precision (exact local form), which consistently worsens; the class-level Planning/Precision sign pattern holds directionally for all four Hunyuan models despite Hunyuan's opposite aggregate direction. Thinking also changes final-answer length; matched-length analyses substantially reduce the Precision drop, but a residual penalty remains. Analyzing thinking traces with a cross-encoder relevance metric reveals three patterns: Neutral shows a positive relevance-compliance link (r approximately 0.15); Planning shows near-zero predictive correlation (r approximately 0.02) despite measurable trace engagement, consistent with an execution gap between CE-measured trace relevance and final-answer compliance; Precision shows a small negative correlation (r approximately -0.05), with failing instances having higher mean relevance than passing ones. Activation patching across four model sizes (1.7B-14B) shows that Precision flip instances are more often restored than Planning flip instances (32-58% vs. 14-40% mean layer-restoration), with the largest gap at 14B (about 30 pp).
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper investigates instruction following and reasoning traces in text-based LLMs (Qwen3, Hunyuan), focusing on constraint-level error shifts and activation patching. It does not discuss multimodal integration (MultiModal, MLLM, Visual Encoder), world models, tokenizers, or model-based RL. While it compares models (Unify Models), it does not unify them. Consequently, all keyword relevance scores are low due to topic mismatch.
关键词
Large reasoning models, Instruction following, Thinking traces, Constraint-level errors, Activation patching, Planning vs Precision, Error shifts
摘要翻译
大型语言模型(LLMs)在各种推理任务上展现出令人印象深刻的性能,但其在图结构推理方面的能力尚不明确。我们探究 LLMs 是否能够真正理解图同构——图论中的一个基本问题。尽管 LLMs 在同构检测上取得了近乎完美的准确率,但我们表明这种表现是虚幻的。当相同的图以置换后的节点标签呈现时,LLMs 无法识别它们的同构性。这一发现表明,LLMs 依赖模式而非对抽象图结构进行推理。由于置换不变性是有效结构推理的根本要求,这些结果表明,图推理基准上的成功不应被解释为真正拓扑理解的证据。
Abstract
Large language models (LLMs) have shown impressive performance on diverse reasoning tasks, yet their capacity for structural reasoning in graphs remains unclear. We investigate whether LLMs can genuinely understand graph isomorphism -a fundamental problem in graph theory. While LLMs achieve near-perfect accuracy on isomorphism detection, we show this performance is illusory. When identical graphs are presented with permuted node labels, LLMs fail to identify their isomorphism. This finding suggests that LLMs exploit patterns rather than reasoning about abstract graph structure. Since permutation invariance is a fundamental requirement for valid structural reasoning, these results indicate that success on graph reasoning benchmarks should not be interpreted as evidence of genuine topological understanding.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 3.0/10 | 4.5 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper investigates LLMs' failure in graph isomorphism due to lack of permutation invariance, focusing on structural reasoning rather than multimodal integration or world modeling. Keywords related to Vision, RL, and World Models are irrelevant (0). MLLM and Tokenizer have minor relevance as the paper discusses LLM architecture but not specifically multimodal or tokenization mechanisms (2-3). Unify Models is not the core focus (2).
关键词
Large Language Models, Graph Isomorphism, Structural Reasoning, Permutation Invariance, Pattern Matching, Topological Understanding, LLM Limitations
摘要翻译
在严格实时约束下的实际部署中,天气和成像变化会引发显著的分布偏移,严重削弱检测器性能。单域泛化目标检测旨在缓解这一问题,但现有方法很少从问题表述的角度探究实时检测器在受限推理预算下的泛化能力。为此,我们提出实时单域泛化目标检测(RT-SDGOD),旨在探讨实时检测器如何仅依靠训练阶段的表示学习,在零额外推理开销下实现跨域泛化。我们发现,在域偏移下,基于 DETR 的实时检测器性能主要通过漏检率上升而退化,其根源在于有限且不稳定的对象级判别性证据。基于此,我们提出 RT-SDGDet,一种面向 RT-SDGOD 的多证据协同建模框架。其核心思想是使同一对象的多个查询能够协同覆盖更充分的判别性证据,同时保持这种证据建模在不同视角下的稳定性。具体而言,我们采用一对多(O2M)监督构建稳定的对象特定查询组,并进一步设计了判别性证据多样性学习(DEDL)和双视角证据一致性学习(DvECL),分别用于扩展对象级证据覆盖和提高外观扰动下的证据稳定性。由于所有组件仅在训练阶段引入,该方法不会产生额外的推理开销。大量实验表明,所提出的方法在多个未见目标域上均实现了优于现有方法的泛化性能。
Abstract
In real-world deployment under strict real-time constraints, weather and imaging variations induce significant distribution shifts, severely degrading detectors. Single-Domain Generalized Object Detection aims to mitigate this issue, yet existing methods rarely investigate-at the level of problem formulation-the generalization capability of real-time detectors under such constrained inference budgets. To this end, we introduce Real-Time Single-Domain Generalized Object Detection (RT-SDGOD), which focuses on how real-time detectors can achieve cross-domain generalization under zero extra inference overhead by relying solely on training-time representation learning. We observe that, under domain shift, DETR-based real-time detectors mainly degrade through increased missed detections, rooted in limited and unstable object-level discriminative evidence. Based on this, we propose RT-SDGDet, a multi-evidence collaborative modeling framework for RT-SDGOD. The core idea is to enable multiple queries of the same object to collaboratively cover more sufficient discriminative evidence while maintaining the stability of such evidence modeling across views. Specifically, we use one-to-many (O2M) supervision to construct stable object-specific query groups, and further design Discriminative Evidence Diversity Learning (DEDL) and Dual-view Evidence Consistency Learning (DvECL) to expand object-level evidence coverage and improve evidence stability under appearance perturbations, respectively. Since all components are introduced only during training, our method incurs no extra inference overhead. Extensive experiments show that the proposed method achieves better generalization performance than existing approaches across multiple unseen target domains.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 4.0/10 | 6.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文聚焦实时目标检测的域泛化问题,基于 DETR 框架进行表征学习。关键词列表主要涉及大模型与强化学习领域,与论文主题偏差较大。仅 'Visual Encoder' 因 DETR 架构涉及视觉骨干网络而具有中等相关性,其余如 MLLM、World Models、Tokenizer 及 RL 均无直接关联。
关键词
Real-Time Object Detection, Single-Domain Generalization, Domain Shift, Representation Learning, DETR-based, Evidence Modeling, Cross-Domain Generalization
摘要翻译
多智能体代码生成提供了一种有前景的范式,用于自主软件开发,其通过模拟人类软件工程生命周期来实现。然而,系统可靠性仍受限于大语言模型(LLM)幻觉以及交互智能体间的错误传播。尽管语义熵提供了一种无需真实答案即可量化不确定性的严谨方法,但当前方法往往依赖于昂贵的大语言模型驱动的等价性检查。本文提出快速自适应语义熵(FASE),这是一种新颖的度量方法,基于结构差异图和语义差异图构建的最小生成树来近似功能正确性。在 HumanEval 和 BigCodeBench 上的评估表明,FASE 优于基于大语言模型蕴含的最先进语义熵方法。在使用 Qwen3-Embedding-8B 模型时,相较于基于真实测试用例的 Pass@1,FASE 在 Spearman 相关性上平均提升了 25%,在 ROCAUC 分数上提高了 19%。此外,通过消除昂贵的大语言模型驱动的等价性评估,FASE 产生的计算开销微乎其微,仅需传统语义熵方法运行成本的约 0.3%。这些结果将 FASE 定位为一种实用且具成本效益的解决方案,可用于优化现实世界多智能体工作流中的不确定性量化。
Abstract
Multi-agent code generation offers a promising paradigm for autonomous software development by simulating the human software engineering lifecycle. However, system reliability remains hindered by LLM hallucinations and error propagation across interacting agents. While semantic entropy provides a principled way to quantify uncertainty without ground-truth answers, current methods often rely on costly LLM-driven equivalence checks. In this work, we introduce Fast Adaptive Semantic Entropy (FASE), a novel metric that approximates functional correctness based on the minimum spanning tree of structural and semantic dissimilarity graphs. Evaluations on HumanEval and BigCodeBench demonstrate that FASE outperforms state-of-the-art semantic entropy by LLM entailment, achieving a 25% average improvement in Spearman correlation and a 19% increase in ROCAUC score against Pass@1 from ground-truth test cases when using the Qwen3-Embedding-8B model. Furthermore, by eliminating costly LLM-driven equivalence evaluation, FASE incurs negligible computational overhead, requiring only approximately 0.3% of the runtime cost of traditional semantic entropy approaches. These results position FASE as a practical, cost-effective solution for optimizing uncertainty quantification in real-world multi-agent workflows.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: 论文聚焦于多智能体代码生成中的不确定性量化(FASE 指标),核心贡献在于利用结构性和语义性差异图替代昂贵的 LLM 等价性检查。该主题与关键词集中的多模态、世界模型及模型强化学习领域高度不匹配。仅因使用了 LLM 嵌入模型,'Unify Models' 和 'MLLM' 给予极低分,其余关键词完全无关。作者列表中未包含指定的专家。
关键词
Multi-agent code generation, Semantic Entropy, Code Quality, Uncertainty Quantification, Structural and Semantic Dissimilarity, LLM Hallucinations, Cost-effective
摘要翻译
若向一个预训练的生物医学语言模型询问“皮质醇 28 ug/dL"与“股市波动”是否相关,它会返回 0.83 的余弦相似度(在 1.0 表示完全相同的尺度上)。两者之间毫无机制关联。这并非特例:我们测试的每一个现成生物医学编码器(BioBERT、PubMedBERT、BioM-ELECTRA)在答案应接近零的情况下,将无关的跨域对评分为 0.76 至 0.92,而跨域判别准确率为 0%。检索系统之所以能幸免于难,是因为下游的语言模型会过滤噪声;然而大型行为模型(LBM,一种主体为人而非句子的基础模型)则不然:它在用户生活图谱上进行推理,并将嵌入接近性视为两个事件存在因果关联的证据。错误的接近性会写入错误的因果边,且下游一切皆继承此错误。在此情境下,嵌入几何结构并非调节旋钮,而是关乎正确性。我们报告了修复方案。对 72,034 对样本进行一次对比学习过程,将 PubMedBERT 的 BIOSSES 相关性从 0.633 提升至 0.828,将域内与域间分离度从 1.05 倍提升至 1.63 倍。第二次过程,即 BODHI,从生物医学知识图谱中缺失的边中挖掘难负样本,将分离度提升至 2.30 倍,判别差距提升至 +0.392,代价为 BIOSSES 下降 4.5%。在配备 AMX 的 Intel Xeon 6737P 上,OpenVINO 将单次查询延迟从 1367 毫秒降至 10 毫秒(133 倍),并达到每秒 555 句。一项发现与标准建议相悖:在此硬件上,FP16 在所有推理批大小下均优于 INT8,我们解释了其原因。同一模型在未配备 AMX 的 Ice Lake 实例上运行速度慢 13 至 27 倍。我们发布了基准套件、训练语料、BODHI 生成器及 OpenVINO 脚本。
Abstract
Ask a pretrained biomedical language model whether "cortisol 28 ug/dL" and "stock-market volatility" are related, and it returns a cosine similarity of 0.83 on a scale where 1.0 means identical. The two share no mechanism. This is not a corner case: every off-the-shelf biomedical encoder we tested (BioBERT, PubMedBERT, BioM-ELECTRA) scores unrelated cross-domain pairs between 0.76 and 0.92 when the answer should be near zero. Accuracy on cross-domain discrimination is 0%. Retrieval systems survive this, because a language model downstream filters the noise. A Large Behavioural Model (LBM), a foundation model whose subject is a person rather than a sentence, does not: it reasons over a graph of a user's life and treats embedding proximity as evidence that two events are causally linked. False proximity writes a false causal edge, and everything downstream inherits the error. Here, embedding geometry is not a tuning knob; it is correctness. We report the fix. A contrastive pass over 72,034 pairs raises PubMedBERT BIOSSES correlation from 0.633 to 0.828 and within-vs-across-domain separation from 1.05x to 1.63x. A second pass, BODHI, mines hard negatives from edges absent in a biomedical knowledge graph and lifts separation to 2.30x and the discrimination gap to +0.392, at a 4.5% BIOSSES cost. On an Intel Xeon 6737P with AMX, OpenVINO cuts single-query latency from 1367 ms to 10 ms (133x) and reaches 555 sentences/sec. One finding contradicts standard advice: FP16 beats INT8 on this silicon at every serving batch size, and we explain why. The same model on a no-AMX Ice Lake instance runs 13-27x slower. We release the benchmark suite, training corpora, the BODHI generator, and the OpenVINO scripts.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 2.0/10 | 3.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 该论文聚焦于生物医学语言模型的嵌入优化与因果发现,与提供的多模态及强化学习关键词领域差异显著。'Visual Encoder'、'MLLM'、'MultiModal'、'model-based RL' 完全不相关(0 分);'Unify Models'、'Tokenizer'、'World Models' 仅有微弱关联(2 分),未触及核心贡献。作者列表中未包含指定的专家,故无额外加分。
关键词
Biomedical Language Models, Embedding Optimization, Causal Discovery, Contrastive Learning, Knowledge Graphs, Inference Acceleration, OpenVINO
摘要翻译
随着大型语言模型(LLMs)被部署为智能体,可靠的监控不仅需要知晓其输出内容,还需明确哪些指令正在引导其行为。当模型推断出意外子目标、遵循上下文线索,或受到提示注入和隐藏目标的影响时,这便变得困难。尽管激活到语言(activation-to-language)方法表明隐藏状态可以揭示自然语言信息,但现有方法并非旨在恢复智能体场景中同时激活的完整指令集、约束、禁令和子目标。我们将此问题形式化为指令集检索,并引入 PRISM,这是一种基于激活条件的解释器,它将来自冻结目标模型的隐藏状态解码为忠实的活动指令列表。与先前的激活到语言方法不同,PRISM 直接训练以恢复指令集,使用 judge-guided GRPO 来奖励被覆盖的指令并惩罚未得到支持的指令。在良性、受限、提示注入及隐藏目标场景下,PRISM 优于激活到语言基线方法,尤其是在与安全相关的目标上。
Abstract
As LLMs are deployed as agents, reliable monitoring requires knowing not only what they output, but which instructions are steering their behavior. This is difficult when models infer unintended subgoals, follow contextual cues, or are influenced by prompt injections and hidden objectives. While activation-to-language methods suggest that hidden states can reveal natural-language information, existing approaches are not designed to recover the full set of simultaneous instructions, constraints, prohibitions, and subgoals active in agentic settings. We formalize this problem as instruction set retrieval and introduce PRISM, an activation-conditioned interpreter that decodes hidden states from a frozen target model into a faithful bullet list of active instructions. Unlike prior activation-to-language methods, PRISM is trained to recover instruction sets directly, using judge-guided GRPO to reward covered instructions and penalize unsupported ones. Across benign, constrained, prompt-injection, and hidden-objective settings, PRISM outperforms activation-to-language baselines, especially on security-relevant objectives.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 1.0/10 | 1.5 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 2.0/10 | 3.0 |
评分理由: The paper focuses on LLM interpretability and instruction set retrieval via activation decoding, which does not align with the provided keywords regarding multimodal architecture (Visual Encoder, MultiModal, MLLM), tokenizer design, world models, or model-based RL algorithms. Relevance is minimal (weighted score ~9.0, below passing threshold of 27.8). No expert authors from the specified list were found in the authorship.
关键词
Instruction Set Retrieval, Language Model Activations, Activation-to-Language, Agent Monitoring, Judge-guided GRPO, Hidden Objectives, Prompt Injection
摘要翻译
目标:医疗领域的符合性检查旨在评估患者护理路径是否遵循临床指南。然而,其实际应用往往取决于是否存在形式化的、机器可解释的指南表示,例如计算机可解释指南(CIGs),而这些在实际临床环境中很少见。方法:本研究提出了一种基于大型语言模型(LLMs)编排的模块化框架,旨在直接从非结构化的临床文本和指南文本支持医疗符合性检查,而无需预先定义的 CIGs。所提出的架构整合了多个 LLM 和支持组件,从临床出院记录中提取患者轨迹,从文本临床指南中识别规范性规则,将这些规则转换为可执行脚本,并计算轨迹符合性指标(Trace Conformance Indicator),以量化事件日志中的符合性程度。结果:该框架在亚历山德里亚医院神经科病房的卒中护理领域实施并评估。从医院数据中自动提取了数百条患者轨迹,并基于参考指南导出的 50 条规则进行了评估。分析显示,超过 86% 的可用轨迹符合规范。结论:结果表明,使用编排的 LLMs 进行实际医疗符合性分析是可行的。同时,该研究提供了亚历山德里亚医院在卒中护理指南方面具有高水平依从性的证据。
Abstract
Objective: Conformance checking in healthcare seeks to assess whether patient care pathways adhere to clinical guidelines. However, its practical application often depends on the availability of formal, machine-interpretable representations of guidelines, such as Computer-Interpretable Guidelines (CIGs), which are seldom available in real-world clinical settings. Methods: This work introduces a modular framework based on the orchestration of Large Language Models (LLMs) to support medical conformance checking directly from unstructured clinical and guideline texts, without requiring predefined CIGs. The proposed architecture integrates multiple LLMs and supporting components to extract patient traces from clinical discharge letters, identify normative rules from textual clinical guidelines, translate these rules into executable scripts, and compute a Trace Conformance Indicator to quantify compliance within the event log. Results: The framework was implemented and evaluated in the stroke care domain at the neurological ward of Alessandria Hospital. Hundreds of patient traces were automatically extracted from hospital data and assessed against 50 rules derived from the reference guideline. The analysis showed that more than 86\% of the available traces were conformant. Conclusion: The results demonstrate the feasibility of using orchestrated LLMs for practical healthcare conformance analysis. At the same time, the study provides evidence of a high level of adherence to stroke care guidelines at Alessandria Hospital.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 3.0/10 | 4.5 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文聚焦于医疗文本合规性检查,使用 LLM 编排技术,与多模态、世界模型及强化学习等关键词领域高度不相关。仅因整合多个 LLM 及隐含 tokenizer 给予低分,视觉、RL 及世界模型相关项无关联。
关键词
LLM Orchestration, Conformance Checking, Stroke Care, Clinical Guidelines, Text Processing, Healthcare AI, Process Mining
摘要翻译
强化学习(RL)已成为后训练大语言模型(LLMs)的关键组成部分。在实践中,由于训练 - 推理不匹配及策略陈旧,大语言模型的强化学习(LLM RL)通常属于非策略方法,这使得信任域控制对于实现稳定优化至关重要。主流方法如 PPO 和 GRPO 通过比率裁剪机制近似实现这种控制,但在长尾词汇中,重要性比率可能是分布偏移的糟糕代理。近期工作如 DPPO 通过用基于散度的掩码替换比率裁剪来解决这种不匹配,从而生成一个由采样令牌的绝对概率偏移定义的信任域。然而,DPPO 仍依赖硬掩码:一旦令牌在有害方向上跨越信任域边界,其梯度会被丢弃而非修正。为解决这一问题,我们提出了散度正则化策略优化(DRPO),它将硬掩码替换为基于策略偏移的平滑优势加权二次正则化项。DRPO 保留了与 DPPO 相同的信任域几何结构,同时诱导产生有界连续的梯度权重,这些权重能够衰减发散的更新并提供边界外的修正信号。在模型规模、架构及精度设置上的实验表明,DRPO 提升了大语言模型强化学习(LLM RL)训练的稳定性和效率。
Abstract
Reinforcement learning (RL) has become a key component of post-training large language models (LLMs). In practice, LLM RL is often off-policy because of training-inference mismatch and policy staleness, making trust-region control essential for stable optimization. Mainstream methods such as PPO and GRPO approximate this control with a ratio-clipping mechanism, but the importance ratio can be a poor proxy for distributional shift in long-tailed vocabularies. Recent work such as DPPO addresses this mismatch by replacing ratio-based clipping with a divergence-based mask, yielding a trust region defined by the sampled token's absolute probability shift. However, DPPO still relies on a hard mask: once a token crosses the trust-region boundary in a harmful direction, its gradient is discarded rather than corrected. To address this, we propose Divergence Regularized Policy Optimization (DRPO), which replaces the hard mask with a smooth advantage-weighted quadratic regularizer on policy shift. DRPO preserves the same trust-region geometry as DPPO while inducing bounded, continuous gradient weights that attenuate diverging updates and provide corrective signals beyond the boundary. Experiments across model scales, architectures, and precision settings show that DRPO improves the stability and efficiency of LLM RL training.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 1.0/10 | 1.5 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 2.0/10 | 3.0 |
评分理由: The paper focuses on LLM RL optimization (DRPO) and lacks content on multimodality, world models, or visual encoders. Keywords like Visual Encoder and MultiModal are unrelated. Tokenizer and MLLM have minor relevance due to token/LLM context. Unify Models is absent. model-based RL is loosely related but the method is policy optimization, not model-based control.
关键词
LLM RL, Divergence Regularization, Policy Optimization, Trust Region, Off-policy, DRPO, Gradient Weights
摘要翻译
在分布漂移下确保大语言模型(LLMs)的可靠性需要推理时适应。虽然 Best-of-N 和拒绝采样等推理时对齐方法被广泛使用,但它们将任务视为一种采样密集、基于奖励的搜索,导致两个关键局限性:其性能受限于基础模型的生成质量,且对不完美的奖励模型的依赖使其容易受到奖励黑客攻击。为了解决这些挑战,我们引入了梯度引导奖励优化(GGRO),这是一种轻量级的推理时方法,通过梯度引导在解码过程中进行有针对性的最小干预。具体而言,GGRO 监控词元级熵以识别指示漂移或对齐偏差的高不确定性区域。检测到后,它通过注入由现成奖励模型的梯度信号生成的引导词元来响应,以引导生成轨迹,而非仅仅重新排序样本。实验表明,GGRO 在安全性、有用性及推理基准上一致改进了推理时对齐。它还增加了高质量响应的覆盖率和对奖励黑客攻击的鲁棒性,且计算开销极小。代码可在 https://github.com/lhk2004/GGRO 获取。
Abstract
Ensuring the reliability of Large Language Models (LLMs) under distribution drift requires inference-time adaptation. While inference-time alignment methods such as Best-of-$N$ and rejection sampling are widely used, they frame the task as a sampling-intensive, reward-guided search, leading to two key limitations: their performance is bounded by the base model's generation quality, and their reliance on imperfect reward models makes them vulnerable to reward hacking. To address these challenges, we introduce Gradient-Guided Reward Optimization (GGRO), a lightweight inference-time method that performs targeted, minimal intervention during decoding via gradient guidance. Specifically, GGRO monitors token-level entropy to identify high-uncertainty regions indicative of drift or misalignment. Upon detection, it responds by injecting nudging tokens, generated using gradient signals from an off-the-shelf reward model, to steer the generation trajectory rather than merely re-ranking samples. Experiments show that GGRO consistently improves inference-time alignment across safety, helpfulness, and reasoning benchmarks. It also increases coverage of high-quality responses and robustness to reward hacking, with minimal computational overhead. Code is available at https://github.com/lhk2004/GGRO.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 2.0/10 | 3.0 |
评分理由: 论文聚焦于大语言模型的推理时间对齐与梯度引导优化,未涉及多模态架构(MultiModal、MLLM、Visual Encoder)、世界模型或模型强化学习。Tokenizer 因提及 token-level 熵而略有相关性,Unify Models 未讨论。
关键词
Inference-time Alignment, Gradient-Guided Reward Optimization, Large Language Models, Reward Model, Token-level Entropy, Decoding, Distribution Drift
摘要翻译
手写文本识别(HTR)技术的进步使得历史文献的大规模转录成为可能,但仍难以提供用于古文字学(Paleography,研究历史手稿的学科)的可解释视觉测量。本文的主要见解在于,形态学脚本分析(morphological script analysis,尤其是从行级转录中学习字符原型的能力)能够定义出可扩展、有意义且稳定的古文字测量指标。更具体而言,我们利用基于 Transformer 的检测架构结合基于原型的行重建模块,学习典型字符及其出现频率、变形和定位方式。本文的贡献主要体现在两个方面。首先,我们提出了一种深度架构和学习方法,仅需行级转录监督即可实现高效的字符建模,显著优于 Learnable Typewriter 基线,并能实现准确的字符边界框预测,从而释放其在古文字测量方面的潜力。其次,我们介绍并展示了由我们的架构所启发的针对字符、bi-grams(双字符)以及图形单元之间空间的自动测量在古文字学上的相关性。为展示这一效果,我们将巴黎国家图书馆手稿(Codex Paris, BnF, fr. 2813)的注释扩展至 160 页。该手稿由查理五世于 14 世纪晚期委托,由四位抄写员抄写。我们在这些页面上可视化我们的测量结果,展示了它们不仅能帮助我们区分不同的图形特征,还能发现并分析细微的差异。这一案例研究概述了该方法的可扩展性及其在训练数据需求上的节约性,因为仅需单列文本就足以计算这 160 页上的测量结果。数据和代码公开可用:https://malamatenia.github.io/morphology4metrology-analysis.
Abstract
Advances in handwritten text recognition have enabled large-scale transcription of historical documents, but still provide limited access to interpretable visual measurements for paleography, the study of historical scripts. In this paper, our main insight is that morphological script analysis, in particular the capacity to learn character prototypes from line-level transcriptions, enables the definition of scalable, meaningful, and stable paleographic measurements. More precisely, we leverage a transformer-based detection architecture together with a prototype-based line reconstruction module to learn prototypical characters and their occurrence, deformation, and positioning. Our contributions are twofold. First, we introduce a deep architecture and learning methodology that enables efficient character modeling with only line-level transcription supervision, significantly improving over the Learnable Typewriter baseline and enabling accurate character bounding box prediction, unlocking its potential for paleographic measurements. Second, we introduce and demonstrate the paleographical relevance of automatic measurements enabled by our architecture for characters, bi-grams, and spaces between graphical units. For this demonstration, we extend the annotations of the codex Paris, BnF, fr. 2813, commissioned in the late fourteenth century by Charles V and copied by four hands, to 160 pages. We visualize our measurements over these pages, showing how they enable us not only to differentiate graphical profiles, but also to discover and analyze subtle variations. This case study outlines the scalability of our approach and its frugality in terms of required training data, since a single column of text is sufficient to compute our measurements on each of the 160 pages. Data and code are publicly available at: https://malamatenia.github.io/morphology4metrology-analysis.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 3.0/10 | 4.5 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 3.0/10 | 4.5 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on historical script metrology using transformer-based detection and prototype learning, showing no direct relevance to Unify Models, Tokenizer, World Models, MLLM, or model-based RL (0 points). It involves visual encoding (3 points) and multimodal supervision (3 points) but lacks foundation model scale or reinforcement learning components. No specified expert authors are found.
关键词
Historical Script, Morphology, Metrological Analysis, Transformer-based Detection, Character Prototypes, Paleographic Measurements, Line-level Supervision
摘要翻译
低空无人机的视频语义分割需要保持时间一致性,然而稠密光流在占据航空影像主导地位的平面区域中引入了空间结构化噪声。我们提出了一种零参数几何门,该门利用 $16\times16$ 空间网格上的 RANSAC 单应性内点比率,将每个区域引导至单应性变换或光流形变,随后通过语义相似性传播(SSP)进行融合。该门无需学习参数——仅基于 RANSAC 统计量的中值阈值二元决策——仅向冻结骨干网络添加 211K 个可训练参数(即 SSP 融合层)。在合成 UAVid 数据集上,该方法在两种架构(SegFormer-b2 和 Hiera-S+UPerNet)上相对于基线模型实现了 +4.24% 至 4.91% 的 mIoU 提升。机制诊断表明,平面区域中的光流残差具有空间自相关性(莫兰指数 Moran's I = 0.32,$p < 0.001$),可预测边界不稳定性(Spearman $ρ= 0.66$),且刚性化处理在单应性有效区域中将时间一致性从 62% 恢复至 92%(提升 29.5 个百分点)。
Abstract
Video semantic segmentation for low-altitude UAVs requires temporal consistency, yet dense optical flow introduces spatially structured noise in the planar regions that dominate aerial imagery. We propose a zero-parameter geometric gate that uses RANSAC homography inlier ratios on a $16\times16$ spatial grid to route each region to either homography or optical flow warp before fusion via Semantic Similarity Propagation. The gate requires no learned parameters -- only a median-threshold binary decision on RANSAC statistics -- adding only 211K trainable parameters (the SSP fusion layer) to a frozen backbone. On synthetic UAVid, the method achieves +4.24--4.91\% mIoU improvement over base models across two architectures (SegFormer-b2 and Hiera-S+UPerNet). Mechanism diagnostics reveal that flow residuals in planar regions are spatially autocorrelated (Moran's I = 0.32, $p < 0.001$), predict boundary instability (Spearman $ρ= 0.66$), and that rigidification recovers temporal consistency from 62\% to 92\% (+29.5pp) in homography-valid regions.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 4.0/10 | 6.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 该论文专注于低空 UAV 视频语义分割的几何一致性优化,核心贡献在于零参数几何门控机制(RANSAC 与光流路由)。与关键词背景中的多模态大模型(MLLM)、世界模型及强化学习(RL)领域关联度极低。仅视觉编码器(Visual Encoder)作为骨干网络被使用,故有一定相关性;未涉及 Tokenizer、World Models 或 Model-Based RL;无指定专家作者匹配。
关键词
Zero-Parameter Geometric Gating, Video Semantic Segmentation, Low-Altitude UAV, RANSAC Homography, Optical Flow, Temporal Consistency, Semantic Similarity Propagation
摘要翻译
我们提出拓扑神经算子(TNOs),这是一种在胞腔复形上进行算子学习的严谨框架,它将神经算子(NOs)从点和/或边上的函数推广至拓扑域。TNOs 将数据表示为定义在不同维度胞腔上的特征,并通过离散外微积分(Discrete Exterior Calculus)建模它们的相互作用,通过梯度、旋度和散度型算子实现显式的跨维度耦合。关键设计原则是将信息流动的位置(由固定的拓扑算子规定)与其变换方式(这是学习得到的)解耦,从而生成尊重物理量几何支撑并揭示守恒和相容结构的模型。我们进一步提出分层 TNOs(HTNOs),它们引入学习的粗复形以传播长距离及拓扑依赖的信息。我们的框架将现有的神经算子(NOs)作为特例包含在内,为跨离散化的算子学习提供了统一视角。在一组偏微分方程(PDE)基准测试中,包括不规则几何流问题,TNOs 和 HTNOs 提高了精度;对照研究进一步凸显了原生高阶(秩)及拓扑结构的优势。项目页面:https://circle-group.github.io/research/TNO
Abstract
We introduce Topological Neural Operators (TNOs), a principled framework for operator learning on cell complexes that lifts neural operators (NOs) from functions on points and/or edges to topological domains. TNOs represent data as features defined on cells of varying dimension and model their interactions through Discrete Exterior Calculus, enabling explicit cross-dimensional coupling via gradient-, curl-, and divergence-type operators. The key design principle is to decouple where information flows, as governed by fixed topological operators, from how it is transformed (which is learned), yielding models that respect the geometric support of physical quantities and expose conservation and compatibility structure. We further propose Hierarchical TNOs (HTNOs), which incorporate learned coarse complexes to propagate long-range and topology-dependent information. Our framework subsumes existing NOs as a special case, providing a unified perspective on operator learning across discretizations. Across a range of PDE benchmarks, including irregular-geometry flow problems, TNOs and HTNOs improve accuracy; controlled studies further isolate the benefits of native higher-rank and topological structure. Project page: https://circle-group.github.io/research/TNO
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 5.0/10 | 7.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on topological neural operators for PDEs and scientific computing, which is largely unrelated to the provided keywords targeting Multimodal/LLM/RL domains (Tokenizer, Visual Encoder, World Models, MLLM, model-based RL). 'Unify Models' is moderately relevant (5.0) as the paper proposes a unified perspective on operator learning across discretizations. No expert authors from the specified list are present.
关键词
Topological Neural Operators, Operator Learning, Cell Complexes, Discrete Exterior Calculus, PDE Benchmarks, Hierarchical TNOs, Geometric Support
摘要翻译
针对数字资源匮乏的原住民语言的神经机器翻译(NMT)常因极端数据稀缺而受阻,从而依赖提取式网络爬取。为确保数据主权,本研究提出了一种数据合成方法,旨在无需爬取目标语言平行文本的情况下引导神经机器翻译模型。聚焦于克奇奇语(Q'eqchi' Mayan),我们将社区来源的词典转化为大规模合成语料库,并在 mT5-base 模型上利用参数高效微调(PEFT)配合 LoRA 适配器进行训练。域内评估展示了高结构习得能力(BLEU 42.02),证明合成约束能有效教授复杂的黏着形态学和 VOS 词序。然而,针对自然词汇表的评估揭示了一个结构 - 语义鸿沟(BLEU 0.59),模型虽保持了语法完整性,却缺乏自然语言的词汇基础。模型表现出对合成模板受限结构方差的过拟合;尽管流程中语义熵较高,但其难以应对自然语言的句法流畅性,迫使自然输入进入僵化的学习模式。此外,采用多任务学习(Multi-Task Learning)架构的消融研究导致了负迁移,表明辅助任务在 LoRA 适配器内竞争有限的参数容量,导致过度优化合成标记,牺牲了自然灵活性。最终,我们确立合成引导是一种高效的结构入门方法,但需通过课程学习(Curriculum Learning)利用真实数据进行语义精炼。
Abstract
Neural machine translation for digitally low-resource Indigenous languages is often hindered by extreme data scarcity, prompting reliance on extractive web-scraping. To ensure data sovereignty, this study introduces a data synthesis methodology to bootstrap NMT models without scraping target-language parallel text. Focusing on Q'eqchi' Mayan, we transformed community-sourced dictionaries into a massive synthetic corpus, utilizing Parameter-Efficient Fine-Tuning (PEFT) via LoRA adapters on an mT5-base model. In-domain evaluation demonstrates high structural acquisition (BLEU 42.02), proving that synthetic constraints effectively teach complex agglutinative morphology and VOS word order. However, evaluation against an organic glossary reveals a structural-semantic gap (BLEU 0.59), where the model maintains grammatical integrity but lacks the lexical grounding of natural language. The model exhibits overfitting to the constrained structural variance of the synthetic templates; despite high semantic entropy in the pipeline, it struggles with the syntactic fluidity of natural language, forcing organic inputs into rigid learned patterns. Furthermore, an ablation study utilizing a Multi-Task Learning architecture resulted in negative transfer, suggesting that auxiliary tasks competed for limited parameter capacity within the LoRA adapters, causing over-optimization for synthetic markers at the expense of organic flexibility. Ultimately, we establish that synthetic bootstrapping is a highly effective structural primer, but requires authentic data for semantic refinement via Curriculum Learning.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 3.0/10 | 4.5 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文核心内容为低资源神经机器翻译(NMT)的数据合成与参数高效微调(PEFT),使用 LoRA 适配器在 mT5 模型上训练。该研究未涉及多模态学习、视觉编码器、世界模型构建或强化学习,因此除 Tokenizer(因使用 mT5 涉及分词)和 Unify Models(微调涉及模型参数调整)有微弱关联外,其余关键词均不相关。
关键词
Data Synthesis, Parameter-Efficient Fine-Tuning, Low-Resource NMT, LoRA adapters, Synthetic Corpus, Multi-Task Learning, mT5-base, Q'eqchi' Mayan
摘要翻译
现有的深度研究代理(DRAs)基准评测仅评估单次输出,忽略了一个关键问题:当受到反馈引导时,DRAs 能否改进其报告?为此,我们在两种反馈设置下对 DRAs 进行了多轮评估:自我反思,即代理在不使用任何外部诊断信号的情况下修订其报告;以及过程级反馈,即代理接收针对其研究策略中差距的指导。为了实现过程级反馈,我们设计了研究差距推断(RGI),该方法通过分析满足与不满足的评分标准模式来推断研究过程差距。我们的分析揭示了三个关键发现:(i) 在自我反思下,代理在评分标准上的采纳率与退化率几乎相等,导致净改善微不足道;(ii) 一轮过程级反馈带来显著提升,使归一化分数提高约 8-15 分,并实现约 35%-40% 的采纳率;(iii) 这些收益不会在后续轮次中累积,因为代理在重写完整报告以解决剩余差距时,会对高达 24% 的先前满足标准发生退化。即使有针对性指导,可靠的多轮改善对我们所评估的 DRA 架构而言仍然难以实现。我们的代码和结果已在 https://github.com/sabharwalrishabh/Multi-Turn-Evaluation-of-DRAs 上公开。
Abstract
Existing benchmarks for deep research agents (DRAs) assess only single-shot outputs, ignoring a key question: can DRAs improve their reports when guided by feedback? To investigate this, we conduct a multi-turn evaluation of DRAs under two feedback settings: self-reflection, in which the agent revises its report without any external diagnostic signal, and process-level feedback, in which the agent receives guidance targeting gaps in its research strategy. To enable process-level feedback, we design Research Gap Inference (RGI), a method that analyzes patterns of satisfied and unsatisfied rubric criteria to infer research-process gaps. Our analysis reveals three key findings: (i) under self-reflection, agents incorporate and regress on rubric criteria at nearly equal rates, yielding negligible net improvement; (ii) a single round of process-level feedback yields substantial gains, raising the normalized score by approximately $8$-$15$ points and yielding a roughly $35$-$40\%$ incorporation rate; (iii) these gains do not compound over subsequent turns, as agents regress on up to $24\%$ of previously satisfied criteria when rewriting the full report to address remaining gaps. Even with targeted guidance, reliable multi-turn improvement remains out of reach for the DRA architectures we evaluate. Our code and results are publicly available at https://github.com/sabharwalrishabh/Multi-Turn-Evaluation-of-DRAs.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 3.0/10 | 4.5 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on evaluating Deep Research Agents (DRAs) via multi-turn feedback loops (Research Gap Inference), which is unrelated to multimodal model architectures (Tokenizer, Visual Encoder, MultiModal), world models, or model-based RL algorithms. While DRAs may utilize MLLMs, the contribution is methodological evaluation rather than model design, resulting in low relevance scores for architectural keywords. No expert authors from the specified list were found in the authorship.
关键词
Deep Research Agents, Multi-Turn Evaluation, Process-Level Feedback, Research Gap Inference, Self-Reflection, Agent Feedback, Evaluation Benchmark
摘要翻译
多轮 LLM 代理将模型调用与外部工具调用交错进行,将服务从无状态请求处理转变为有状态程序执行。服务这些工作负载需要调度、KV-cache 管理和路由策略,这些策略利用程序级上下文,包括轮次依赖、工具诱导的间隙以及可重用的 KV 状态。直接在真实系统上评估这些策略成本高昂,因为每个设计点可能需要在到达率、模型规模、服务实例数量和存储层次结构等方面占用专用的加速器时间。仿真提供了一种可扩展的替代方案,但现有的 LLM 服务仿真器针对无状态请求级工作负载,因此忽略了代理服务的核心机制:多轮程序执行、跨轮次缓存局部性以及工具间隙期间的 KV-cache 驻留。我们提出 AGENTSERVESIM,一种用于多轮 LLM 代理服务的硬件感知仿真器。AGENTSERVESIM 通过可组合模块,在程序粒度上评估服务策略:一个程序编排器 (Program Orchestrator) 保留程序身份和轮次顺序,一个工具模拟器 (Tool Simulator) 模拟工具诱导的间隙,一个会话感知路由器 (Session-Aware Router) 维持程序到实例的亲和性以实现感知缓存的分发,以及一个 KV 驻留模型 (KV Residency Model) 跟踪策略定义的 KV 放置,跨越 HBM、主机 DRAM/CXL 及驱逐过程。在真实服务部署和硬件配置下,AGENTSERVESIM 在关键性能指标上以 6% 以内的误差重现真实系统行为,且完全运行在通用 CPU 上。这些结果表明,AGENTSERVESIM 使得代理服务策略的受控、可重复探索成为可能,而无需在昂贵的加速器上进行详尽部署。
Abstract
Multi-turn LLM agents interleave model calls with external tool invocations, shifting serving from stateless request processing to stateful program execution. Serving these workloads requires scheduling, KV-cache management, and routing policies that use program-level context, including turn dependencies, tool-induced gaps, and reusable KV state. Evaluating such policies directly on real systems is costly, since each design point may require dedicated accelerator time across arrival rates, model scales, serving-instance counts, and memory hierarchies. Simulation offers a scalable alternative, but existing LLM serving simulators target stateless request-level workloads and therefore omit the core dynamics of agent serving: multi-turn program execution, cross-turn cache locality, and KV-cache residency during tool gaps. We present AGENTSERVESIM, a hardware-aware simulator for multi-turn LLM agent serving. AGENTSERVESIM evaluates serving policies at program granularity through composable modules: a Program Orchestrator preserves program identity and turn order, a Tool Simulator materializes tool-induced gaps, a Session-Aware Router maintains program-to-instance affinity for cache-aware dispatch, and a KV Residency Model tracks policy-defined KV placement across HBM, host DRAM/CXL, and eviction. Across real serving deployments and hardware configurations, AGENTSERVESIM reproduces real-system behavior within 6% error across key performance metrics while running entirely on commodity CPUs. These results show that AGENTSERVESIM enables controlled, repeatable exploration of agent-serving policies without requiring exhaustive deployment on costly accelerators.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on LLM serving infrastructure simulation (KV-cache, routing, hardware awareness) rather than model architecture or learning paradigms. There is no mention of multimodal inputs, visual encoders, tokenizer design, world models, or model-based RL. 'MLLM' and 'Tokenizer' have slight relevance due to LLM context but are not core contributions. 'Unify Models' is unrelated.
关键词
Hardware-aware Simulator, Multi-Turn LLM Agent Serving, KV-cache Management, Program Orchestrator, Tool Simulator, Session-Aware Router, KV Residency Model
摘要翻译
现有的针对长上下文 LLM 推理的稀疏注意力(sparse attention)和 KV 缓存(KV cache)压缩方法通常在所有注意力头(attention heads)上应用固定的稀疏模式或统一预算,忽略了不同注意力头之间以及不同上下文之间注意力行为的显著差异。我们观察到注意力头中存在两种不同的熵(entropy)模式:刚性头(Rigid Heads),其熵在整个输入段(input segments)中保持接近零;以及动态头(Dynamic Heads),其熵显著波动。关键在于,这些类型的分布是上下文相关(context-dependent)的,无法在离线(offline)阶段预先确定。因此,我们提出 EntropyInfer,一个无需训练(training-free)的框架,它在预填充(prefilling)阶段利用注意力熵(attention entropy)以单个头(head)和段(segment)为粒度自适应分配计算。对于解码(decoding),我们引入了一种潜在(latent)KV 缓存压缩方案,该方案利用生成的输出令牌(generated output tokens),而不仅仅是预填充令牌(prefill tokens),来识别并保留最关键的缓存条目(cache entries)。在 Llama、Qwen 和 openPangu 模型系列上的广泛实验表明,EntropyInfer 始终优于包括 SnapKV、AdaKV 和 CritiPrefill 在内的基线(baselines),在超过 100k 令牌(tokens)的场景下实现高达 2.39 倍的端到端加速(end-to-end speedup),且相比完整注意力(full attention)的质量退化(degradation)最小。代码已发布在 https://github.com/SHA-4096/EntropyInfer。
Abstract
Existing sparse attention and KV cache compression methods for long-context LLM inference typically apply fixed sparsity patterns or uniform budgets across all attention heads, overlooking the substantial variation in attention behavior among heads and contexts. We observe two distinct entropy patterns among attention heads: Rigid Heads, whose entropy stays near zero across input segments, and Dynamic Heads, whose entropy fluctuates significantly. Crucially, the distribution of these types is context-dependent and cannot be predetermined offline. We therefore propose EntropyInfer, a training-free framework that uses attention entropy to adaptively allocate compute at the granularity of individual heads and segments during prefilling. For decoding, we introduce a latent KV cache compression scheme that leverages generated output tokens, rather than prefill tokens alone, to identify and retain the most critical cache entries. Extensive experiments on Llama, Qwen and openPangu model series show that EntropyInfer consistently outperforms baselines including SnapKV, AdaKV, and CritiPrefill, achieving up to 2.39$\times$ end-to-end speedup beyond 100k tokens with minimal quality degradation compared to full attention. The code is released in https://github.com/SHA-4096/EntropyInfer.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文聚焦于长上下文 LLM 的推理效率优化,利用注意力熵区分刚性头与动态头并自适应分配算力,同时提出 KV 缓存压缩方案。内容不涉及多模态、视觉编码器、世界模型或强化学习,Tokenizer 亦非核心研究对象,故相关关键词得分较低。加权总分约为 9.0,远低于动态及格分 27.8,表明论文主题与给定关键词集合匹配度低。
关键词
Long-Context LLMs, Entropy-Guided Inference, Attention Heads, KV Cache Compression, Adaptive Compute Allocation, Rigid Heads, Dynamic Heads
摘要翻译
成对比较结合如 Elo 等聚合方法已成为评估生成模型的核心手段,但仍有人担心它们会奖励表面风格线索或表现出评判者偏见。更为积极的是,我们发现当存在可用于比较的真值时,成对比较得出的模型排名与基于真值的准确率排名高度一致。通过将五个著名的基准测试转换为自由形式生成评估,我们发现 Elo 排名与准确率排名的 Spearman 相关系数高于 0.9,且在评判者能力较弱时显著优于直接评估。此外,尽管大多数判断发生在两个候选答案均正确(或均错误)的成对样本上,风格和评判者偏见对模型排名的影响微乎其微。在这些成对样本上,我们发现最终答案后的重复(回声)是评判者偏好的因果驱动因素。
Abstract
Pairwise comparisons combined with aggregation methods like Elo have become central to evaluating generative models, yet concerns remain that they reward superficial stylistic cues or display judge biases. In a more positive turn, we show that model rankings from pairwise comparisons strongly agree with ground-truth-based accuracy rankings when such ground truth is available for comparison. By converting five well-known benchmarks into free-form generative evaluations, we find that Elo rankings achieve a Spearman correlation above 0.9 with accuracy rankings and substantially outperform direct evaluation when the judge is weak. Furthermore, style and judge bias have only minor effects on model rankings, despite most judgments occurring on pairs where both candidate answers are correct (or incorrect). On such pairs, we find that repetition after the final answer (echo) is a causal driver of judge preference.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 2.0/10 | 3.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文主要关注生成模型的评估方法论(配对比较、Elo 评分、准确率相关性),而非模型架构组件(Tokenizer、Visual Encoder)、特定模型范式(World Models、Unify Models)或强化学习(model-based RL)。虽然评估对象可能涉及 MLLM 或多模态生成模型,但摘要未明确提及这些技术细节,因此相关性较低。
关键词
Pairwise comparisons, Generative models, Accuracy rankings, Elo ratings, Judge bias, Style cues, Echo effect
摘要翻译
交通数据的高效获取、存储和利用是时空数据管理中的关键挑战。大多数交通数据系统为了降低存储和计算成本,会在固定的、粗粒度时间间隔内收集并存储观测数据。然而,这种粗粒度数据严重限制了需要更细粒度时间预测的下游应用。在所有地点和时间段收集并维护细粒度交通数据会给数据库存储和预处理流程带来巨大负担。为了解决这种时间粒度不匹配问题,我们提出了一个新问题:利用粗粒度采样数据预测细粒度未来交通。我们提出了时空细化预测器(STRP),这是一个面向时空数据系统的粒度感知框架。STRP 集成了两个组件:用于高效且可解释的空间依赖建模的树卷积(Tree Convolution),以及用于渐进式时间外推的逆扩张卷积(Inverse Dilated Convolution)。STRP 支持两种实用的预测设置:基于窗口(window-based)和基于持续时间(duration-based),以处理不同形式的粒度不匹配。在六个基准数据集上的实验表明,STRP 在准确性和效率方面显著优于最先进基线。我们的工作提供了一种实用且可解释的方法,用于管理时空交通数据系统中的粒度不匹配。
Abstract
Efficient acquisition, storage, and utilization of traffic data are critical challenges in spatio-temporal data management. Most traffic data systems collect and store observations at fixed, coarse-grained temporal intervals to reduce storage and computation costs. However, such coarse-grained data severely limits downstream applications that require predictions at a finer temporal granularity. Collecting and maintaining fine-grained traffic data across all locations and time periods would impose a substantial burden on database storage and preprocessing pipelines. To address this temporal granularity mismatch, we formulate a novel problem: predicting fine-grained future traffic using coarse-grained sampled data. We propose the Spatial-Temporal Refinement Predictor (STRP), a granularity-aware framework for spatio-temporal data systems. STRP integrates two components: Tree Convolution for efficient and interpretable spatial dependency modeling, and Inverse Dilated Convolution for progressive temporal extrapolation. STRP supports two practical prediction settings: window-based and duration-based, to handle different forms of granularity mismatch. Experiments on six benchmark datasets show that STRP significantly outperforms state-of-the-art baselines in both accuracy and efficiency. Our work offers a practical and interpretable approach to managing granularity mismatches in spatio-temporal traffic data systems.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: 论文核心为时空交通预测与卷积网络,与关键词集中的多模态大模型(MLLM)、Tokenizer、视觉编码器及强化学习主题高度不匹配。虽涉及状态预测(类世界模型),但缺乏语言模型、多模态融合及 RL 框架内容,仅在空间与时间整合上略有涉及 Unify Models 概念。
关键词
Spatio-Temporal Data, Traffic Prediction, Coarse-to-Fine, Temporal Granularity, Tree Convolution, Inverse Dilated Convolution, STRP Framework, Data Management
摘要翻译
在线策略蒸馏(OPD)通过要求教师模型对学生模型生成的轨迹进行评分,提供密集的词元级监督。然而,当学生模型漂移至一个不可恢复的前缀时,教师模型可能会局部同意这种退化状态,产生较低的反向 KL 散度,却缺乏有效的纠正性训练信号。我们将这种持续状态称为低 KL 一致性陷阱(low-KL agreement trap)。进一步分析表明,此类陷阱期间及之后的词元产生的监督信号效用较低。我们提出 KAT(KL 一致性陷阱终止机制),这是一种在线 OPD 终止规则,通过动态训练自适应阈值检测持续的低 KL 一致性。通过过滤退化一致性中的弱监督,KAT 在四个数学基准上将 avg@k 准确率提高了 2.66%,pass@k 提高了 3.43%,同时将平均轨迹长度减少了 59.73%。
Abstract
On-policy distillation (OPD) provides dense token-level supervision by asking a teacher to score student-generated rollouts. However, when the student drifts into an unrecoverable prefix, the teacher may locally agree with the degraded state, producing low reverse KL but little corrective training signal. We identify this persistent regime as a low-KL agreement trap. Further analyses show that tokens during and after such traps produce less useful supervision signals. We propose KAT (KL Agreement Trap Termination), an online OPD termination rule that detects persistent low-KL agreement with a dynamic training-adaptive threshold. By filtering weak supervision from degenerate agreement, KAT improves avg@k accuracy by 2.66% and pass@k by 3.43% across four mathematical benchmarks, while reducing average rollout length by 59.73%.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 3.0/10 | 4.5 |
评分理由: The paper focuses on On-policy distillation (OPD) and KL divergence for mathematical benchmarks, proposing KAT to avoid low-KL agreement traps. It does not involve multimodal data, visual encoders, tokenizer design, world models, or MLLM architectures. While it uses RL terminology (rollouts, student drift), it is primarily a distillation technique rather than model-based RL or unified multimodal modeling. Weighted score is 7.5, below the dynamic passing score of 27.8.
关键词
On-policy distillation, KL Agreement Trap, KAT, Teacher-student, Rollouts, Mathematical benchmarks, Token-level supervision, Reverse KL
摘要翻译
在资源受限的边缘设备上部署安全的大型语言模型(LLMs)面临严峻挑战:虽然结合 LLMs 与守护模型(guard models)的双模型系统能提供有效的安全保证,但其巨大的内存和计算需求使得设备端部署的成本过高。本文针对资源受限环境,开展了一项关于参数高效安全对齐方法的综合研究。通过对多种 LLM 架构、训练目标和参数高效微调方法的系统评估,我们发现软提示(soft prompts)结合基于蒸馏的训练一贯优于其他替代方法。我们引入了基于总变差(Total Variation)和 KL 散度(KL divergence)的蒸馏框架,能有效将守护模型中的安全行为转移到学习到的软提示中。我们在各种基准测试上的评估表明,与 LoRA 适配器(LoRA adapters)、引导向量(steering vectors)和直接优化方法相比,这种组合实现了更优的安全 - 效用权衡,同时在推理时间仅需极少的额外内存和计算量。这些发现确立了软提示蒸馏(soft prompt distillation)作为设备端 LLM 部署中安全对齐的首选方法。
Abstract
Deploying safe large language models (LLMs) on resource-constrained edge devices presents a critical challenge: while dual-model systems combining LLMs with guard models provide effective safety guarantees, their substantial memory and computational demands make them prohibitively expensive for on-device deployment. This paper presents a comprehensive study of parameter-efficient safety alignment methods for resource-constrained settings. Through systematic evaluation across multiple LLM architectures, training objectives, and parameter-efficient fine-tuning approaches, we identify that soft prompts combined with distillation-based training consistently outperform alternative methods. We introduce distillation frameworks based on total variation and KL divergence that effectively transfer safety behaviors from guard models into learned soft prompts. Our evaluations on various benchmarks demonstrate that this combination achieves superior safety-usefulness trade-offs compared to LoRA adapters, steering vectors, and direct optimization methods, while requiring minimal additional memory and compute at inference time. These findings establish soft prompt distillation as the preferred approach for safety alignment in on-device LLM deployment.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 1.0/10 | 1.5 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: 论文核心聚焦于边缘设备上 LLM 的安全对齐与蒸馏技术,未涉及多模态架构、视觉编码器、世界模型或强化学习。因此,Visual Encoder, World Models, MultiModal 得 0 分。Tokenizer 和 model-based RL 有微弱关联(LLM 隐含 tokenizer,背景提及 RL),得 1 分。Unify Models 指安全行为统一,得 2 分。MLLM 涉及 LLM 但非多模态,得 1 分。
关键词
Safe LLM Systems, Soft Prompts, On Device Settings, Distillation, Safety Alignment, Edge Devices, Guard Models
摘要翻译
理解 Transformer 表示在层间如何演化,而不仅仅是它们编码的内容,仍然是机制可解释性领域的一个开放性问题。我们借鉴计算神经科学的几何工具,将 Transformer 前向传播(forward pass)重新表述为通过高维表示流形(representation manifold)的离散种群轨迹。与探测预设特征不同,我们直接在环境空间(ambient space)中使用五个指标来表征轨迹几何:轨迹长度(trajectory length)、曲率(curvature)、语义收敛指数(semantic convergence index)、层间余弦相似度(layerwise cosine similarity)和表示稳定性(representational stability)。基于三个模型家族(GPT-2, TinyLlama, Qwen2.5)和五个受控提示家族,我们报告了四个发现。首先,语义相关的提示在中间至后期层显著收敛(峰值 CI 0.41--0.58, p<0.001, Mann-Whitney U 检验),这与类似吸引子(attractor-like)的动力学一致。其次,推理任务产生的轨迹曲率大于词汇变化(0.71--0.83 rad 对比 0.27--0.31 rad),表明曲率编码了计算复杂性(computational complexity)。第三,模糊令牌(ambiguous tokens)表现出轨迹分叉(trajectory bifurcation),在最终层达到高达 5.6 倍的表示分离(representational separation),而在明确控制中不存在。第四,层间余弦相似度揭示了一个通用的三阶段结构:编码(encoding)、elaboration 和输出准备(output preparation),在所有三种架构中保持一致。所有四个效应在打乱层(shuffled-layer)和随机嵌入(random-embedding)控制下均消失。我们发布了一个完全开源、模型无关(model-agnostic)的管道,并认为轨迹几何构成了机制可解释性的一个原则性、无探针(probe-free)的视角。
Abstract
Understanding how transformer representations evolve across layers, not merely what they encode, remains an open problem in mechanistic interpretability. We recast the transformer forward pass as a discrete population trajectory through a high-dimensional representation manifold, drawing on geometric tools from computational neuroscience. Rather than probing for pre-specified features, we characterize trajectory geometry using five metrics computed directly in the ambient space: trajectory length, curvature, a semantic convergence index, layerwise cosine similarity, and representational stability. Across three model families (GPT-2, TinyLlama, Qwen2.5) and five controlled prompt families, we report four findings. First, semantically related prompts converge significantly in middle-to-late layers (peak CI 0.41--0.58, p<0.001, Mann-Whitney U), consistent with attractor-like dynamics. Second, reasoning tasks produce trajectories of greater curvature than lexical variations (0.71--0.83 rad vs. 0.27--0.31 rad), suggesting curvature encodes computational complexity. Third, ambiguous tokens exhibit trajectory bifurcation with up to 5.6x representational separation by the final layer, absent in unambiguous controls. Fourth, layerwise cosine similarity reveals a universal three-phase structure: encoding, elaboration, and output preparation, consistent across all three architectures. All four effects vanish under shuffled-layer and random-embedding controls. We release a fully open-source, model-agnostic pipeline and argue that trajectory geometry constitutes a principled, probe-free lens for mechanistic interpretability.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 3.0/10 | 4.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 2.0/10 | 3.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on mechanistic interpretability of text transformers using geometric trajectory analysis. It identifies universal patterns across model families (Unify Models) and analyzes state trajectories (World Models analogy), but lacks content on multimodal components (MLLM, MultiModal, Visual Encoder), reinforcement learning (model-based RL), or tokenizer design.
关键词
Transformer Representations, Trajectory Geometry, Mechanistic Interpretability, Layer-wise Evolution, Semantic Convergence, Computational Complexity, Representation Manifold
摘要翻译
撰写个别化教育计划(IEPs)是一项高劳动强度、知识密集型文档任务;现有英语研究已证明生成式人工智能(AI)可显著减少起草时间,然而由于领域数据稀缺、严格的隐私法规以及缺乏本地评估基准,繁体中文环境下的自动化 IEP 生成几乎尚未被探索。我们提出一个以语料库特征扩散(CGFD)为核心的低资源微调流程:(1)通过 tau 阈值及标志感知分数上限筛选出 25 个双专家高分种子转录本;(2)从种子中提取特征配置文件(FeatureProfile,包含句子长度、结构、量化模板)并注入大语言模型(LLM)提示中,同时结合言语采样式多样性控制以驱动扩散过程;(3)使用 15 个专家金种子作为扩散锚点,目标生成 585 个样本;获得 567 个有效扩散样本,构成 582 样本的训练集,用于使用 QLoRA 微调 Breeze-7B 模型;(4)通过语法约束解码(GCD)进行模式约束推理,在推理阶段强制执行分层 SMART 目标阶梯模式。在 55 样本的模式压力测试集上的消融结果揭示了一个意外发现:在繁体中文 Token 预算限制下,GCD 产生负面效果——无 GCD 路径实现了 100% 的模式通过率,且中位延迟降低 34%,在可靠性和速度上均优于使用 GCD 的路径。在 n=10 的正式保留集上,无 GCD 推理路径达到 BERTScore F1 = 0.779,超过了 GPT-5.4 (0.726)、DeepSeek-V3.2 (0.703)、Gemini-3-Flash-Preview (0.703) 和 Llama-4-Maverick (0.700) 的零样本基线,同时保持完全本地、气隙隔离的推理环境。该系统填补了繁体中文特殊教育自然语言处理领域的空白,并在工业工程范式下提供了一种可扩展、隐私保护的本地推理解决方案。
Abstract
Writing Individualized Education Programs (IEPs) is a high-labor, knowledge-intensive document burden; English-language research has demonstrated that generative AI can significantly reduce drafting time, yet automated IEP generation in Traditional Chinese remains virtually unexplored due to domain data scarcity, strict privacy regulations, and the absence of local evaluation benchmarks. We propose a low-resource fine-tuning pipeline centered on Corpus-Grounded Feature Diffusion (CGFD): (1) 25 dual-expert high-score seed transcripts are selected via a tau threshold with flag-aware score caps; (2) a FeatureProfile (sentence length, structure, quantification templates) is extracted from seeds and injected into LLM prompts alongside Verbalized-Sampling-style diversity control to drive diffusion; (3) 15 expert gold seeds are used as diffusion anchors, targeting 585 samples; 567 valid diffusion samples are obtained, yielding a 582-sample training set used to fine-tune Breeze-7B with QLoRA; (4) schema-constrained inference via Grammar-Constrained Decoding (GCD) enforces a hierarchical SMART Goal Ladder schema at inference time. Ablation results on a 55-sample schema stress set reveal an unexpected finding: GCD is counterproductive under Traditional Chinese token budgets -- the no-GCD path achieves 100% schema pass rate at 34% lower median latency, outperforming GCD on both reliability and speed. On the n=10 formal hold-out, the no-GCD inference path achieves BERTScore F1 = 0.779, exceeding GPT-5.4 (0.726), DeepSeek-V3.2 (0.703), Gemini-3-Flash-Preview (0.703), and Llama-4-Maverick (0.700) zero-shot baselines while maintaining fully local, air-gapped inference. This system addresses a gap in Traditional Chinese special-education NLP and offers a scalable, privacy-preserving local inference solution under an industrial engineering paradigm.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 3.0/10 | 4.5 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文专注于传统中文 NLP 领域的 IEP 生成,使用特征扩散和微调。未涉及多模态(视觉编码器、MLLM、MultiModal)、强化学习(model-based RL)或世界模型(World Models)。Tokenizer 仅在预算约束中被提及,Unify Models 指流程整合而非模型架构统一。
关键词
Automated IEP Generation, Traditional Chinese, Corpus-Grounded Feature Diffusion, Low-resource Fine-tuning, Grammar-Constrained Decoding, Special Education NLP, Parent-Teacher Interviews, Breeze-7B
摘要翻译
大型语言模型(Large Language Models)已被广泛评估为个体调查应答的模拟器。然而,在实践中,完全未观测到的应答极为罕见;主要问题在于部分无应答。插补(Imputation)旨在通过填充这些缺失值来恢复调查数据集的整体结构。它具有明确定义的评估标准,且从根本上区别于预测(prediction)。我们提出通过上下文学习(In-Context Learning, ICL)来插补缺失的调查数据。我们在涵盖 15 个波次的美国趋势面板(American Trends Panel)的 150 个意见变量上,系统性地评估了 ICL 设计选择在不同缺失机制(MCAR、MAR、MNAR)下的表现。与用于数据插补的成熟统计方法(如 MICE PMM)相比,我们的 ICL 方法在所有缺失机制下一致地减少了绝对误差,在非随机缺失(MNAR)下增益最大。值得注意的是,表现最佳的设定(gpt-oss-120b 配合 100 个上下文示例)实现了接近名义的聚合覆盖率(接近 95% 水平),其置信区间比 MICE PMM 窄两到五倍。我们发布了一个具有类似 sklearn API 的 Python 包,以便使用本地和专有大型语言模型轻松部署我们的方法。
Abstract
Large language models have been widely evaluated as simulators of individual survey responses. In practice, however, fully unobserved responses are rare; the dominant problem is partial non-response. Imputation aims to restore the overall structure of a survey dataset by filling in these missing values. It has its own well-defined evaluation criteria and differs fundamentally from prediction. We propose to impute missing survey data through in-context learning (ICL). We systematically evaluate ICL design choices across different missingness mechanisms (MCAR, MAR, MNAR) on 150 opinion variables spanning 15 waves of the American Trends Panel. Compared to well-established statistical methods for data imputation like MICE PMM, our ICL approach consistently reduces absolute error across all missingness mechanisms, with the largest gains under non-random missingness (MNAR). Notably, the best-performing specification (gpt-oss-120b with 100 in-context examples) achieves near-nominal aggregate coverage (approaching the 95% level) with confidence intervals two to five times narrower than MICE PMM. We publish a Python package with an sklearn-like API to enable easy deployment of our method using local and proprietary LLMs.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 1.0/10 | 1.5 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 该论文主要研究利用大语言模型(LLM)通过上下文学习(ICL)进行公共意见数据缺失值填补。虽然使用了 LLM,但未涉及多模态整合、视觉编码器、世界模型或强化学习架构,因此与提供的关键词集(侧重于统一模型、世界模型、RL)相关性较低。作者列表中未包含指定的专家,故无额外加分。
关键词
Large Language Models, In-Context Learning, Data Imputation, Public Opinion Data, Missing Values, Survey Responses, Statistical Methods, Confidence Intervals
摘要翻译
视频检索的主流范式依赖于基于嵌入的全语料扫描,该方法存在固有的计算效率低下问题,且面临信息密集的视频与稀疏文本查询之间的语义不对称挑战。为弥合这一差距,我们提出了 MAVIS,这是一种新颖的多智能体框架,它将检索重新定义为协同推理而非暴力搜索。MAVIS 首先通过将原始视频解析为结构化语义库(Structured Semantic Library)来桥接粒度不匹配,从而实现显式的属性级索引。在检索过程中,规划器将复杂的用户意图分解为原子子任务,并调度专用智能体独立提名候选项。关键的是,MAVIS 采用了一种带有严格否决协议的逻辑感知辩论(Logic-aware Debate)机制,智能体协同修剪逻辑不匹配,从而识别出一组紧凑的“争议性”候选项,用于细粒度验证。这种智能体工作流有效地规避了全库遍历的低效率问题。在 MSR-VTT、MSVD 和 ActivityNet 上的广泛实验表明,MAVIS 在不进行任务特定微调的情况下实现了具有竞争力的性能,为传统双编码器方法提供了一种可扩展且可解释的替代方案。
Abstract
The dominant paradigm in video retrieval relies on embedding-based full-corpus scanning, which suffers from inherent computational inefficiency and the semantic asymmetry between information-dense videos and sparse textual queries. To bridge this gap, we introduce \textbf{MAVIS}, a novel multi-agent framework that rethinks retrieval as cooperative reasoning rather than brute-force search. MAVIS first bridges the granularity mismatch by parsing raw videos into a \textbf{Structured Semantic Library}, enabling explicit attribute-level indexing. During retrieval, a planner decomposes complex user intents into atomic sub-tasks, dispatching specialized agents to independently nominate candidates. Crucially, MAVIS employs a \textbf{Logic-aware Debate} mechanism with a strict veto protocol, where agents collaboratively prune logical mismatches to identify a compact set of ``controversial'' candidates for fine-grained verification. This agentic workflow effectively bypasses the inefficiency of full-library traversal. Extensive experiments on MSR-VTT, MSVD, and ActivityNet demonstrate that MAVIS achieves competitive performance without task-specific fine-tuning, offering a scalable and interpretable alternative to traditional dual-encoder approaches.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 5.0/10 | 7.5 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper proposes MAVIS, a multi-agent framework for video retrieval focused on structured semantic libraries and logic-aware debate mechanisms. It shows moderate relevance to 'MultiModal' due to the video-text nature of the task, but has no direct content regarding Unify Models, Tokenizer, Visual Encoder architecture, World Models, MLLM architecture, or Model-Based RL. The author list does not contain the specified expert authors. The calculated weighted score is 7.5, which is below the dynamic passing score of 27.8.
关键词
Multi-Agent Video Retrieval, Structured Semantic Library, Logic-aware Debate, Cooperative Reasoning, Video Understanding, Dual-encoder Alternative
摘要翻译
光学音乐识别(OMR)在模型设计方面取得了显著进展,端到端方法现已能够识别各种复杂程度的音乐记谱。然而,这一进展的影响受到可用训练数据集视觉域的限制,这些数据集大多是原生数字(born-digital)的。图书馆及其他文化遗产机构中现有的大型乐谱收藏主要包含手稿,其视觉域高度多样且差异显著,因此现有的 OMR 系统在现实世界应用中往往失效。这些机构通常资源受限,因此难以期望获得大规模的同域数据集。本文在资源受限场景下,针对具有复杂钢琴记谱的现实世界手稿提供了首个基线结果。借助细粒度音乐记谱图(MuNG)标注及 Smashcima 合成工具,我们发现,尽管部分领域内数据的直接转录仍不可或缺,但利用合成音乐手稿图像进行领域自适应可带来显著提升。此外,所使用的符号无需来自同域,从而可以避免昂贵的细粒度标注。由此,我们将 OMR 推向了其既定目标之一:保护和推广音乐文化遗产。
Abstract
Optical Music Recognition (OMR) has seen major progress in model design, with end-to-end methods now capable of recognising notation at all levels of complexity. However, the impact of this progress has been limited by the visual domains of available training datasets, which are largely born-digital. Existing large collections of sheet music in libraries and other heritage institutions contain predominantly manuscripts, whose visual domains are highly diverse and different, so existing OMR systems fail when applied in the real world. These institutions are often resource-constrained, so large in-domain datasets cannot be expected. We provide a first baseline on real-world manuscripts with complex piano notation in the resource-constrained scenario. Using fine-grained music notation graph (MuNG) annotations and the Smashcima synthesis tool, we then show that while some direct transcriptions of in-domain data remain essential, domain adaptation using synthetic musical manuscript images brings significant improvement. Furthermore, the symbols used do not need to be in-domain, so the expensive fine-grained annotation can be avoided. We thus bring OMR closer to one of its stated goals: preserving and promoting musical cultural heritage.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 2.0/10 | 3.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 3.0/10 | 4.5 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文主要研究光学音乐识别(OMR)及合成数据域适应,未涉及统一模型、分词器、世界模型、MLLM 或基于模型的强化学习。视觉编码器与多模态映射虽存在但非核心贡献。作者列表未包含指定专家,故无加分。
关键词
Optical Music Recognition, Synthetic Data, Domain Adaptation, Manuscripts, Music Notation, Graph Annotations, Resource-constrained
摘要翻译
检索增强生成(RAG)已成为应对法律人工智能可靠性问题的标准架构方案,然而,包括向法院提交伪造引文以及呈现为当前的时代错置法律内容在内的备受关注的失败案例,仍在各个司法管辖区持续出现。我们认为,这些失败并非通过扩大语言模型规模即可消除的残留虚构,而是概率检索与法律知识的层级、时序及制度结构之间架构错配的症候。我们通过三个步骤来阐述这一论点。首先,我们将法律知识的本体论承诺阐述为源自经典法律理论的三重属性:层级与部分论结构、操作封闭下的历时动态性,以及基于证成义务的制度来源的可追溯因果性。其次,我们识别出三种相应的检索缺陷(部分盲视、历时盲视和因果不透明性),每种缺陷均配有操作定义、失效机制、典型范例以及用于诊断的检测标准。第三,我们通过这一视角审视了该领域的前沿现状,表明现有方法对这些要求的满足程度不均衡,且尚未构成一个将它们视为共同构成的范式。基于此分析,我们推导出四个架构承诺,它们表征了法律检索的设计确定性方向:本体论优先性、事件实体化、双时序正确性以及确定性交互协议。该框架关注 quaestio juris(即哪些规范适用及其处于何种状态),而非作用于已识别规范的下游任务,主要面向立法与宪法检索,并将解释时间作为显式扩展。
Abstract
Retrieval-Augmented Generation (RAG) has become a standard architectural response to unreliability in legal AI, yet high-profile failures, including fabricated citations submitted to courts and anachronistic legal content presented as current, continue to appear across jurisdictions. We argue that these failures are not residual confabulations to be eliminated by scaling language models, but symptoms of an architectural mismatch between probabilistic retrieval and the hierarchical, temporal, and institutional structure of legal knowledge. We develop the argument in three moves. First, we articulate the ontological commitment of legal knowledge as a triad of properties derivable from classical legal theory: hierarchical and mereological structure, diachronic dynamism under operational closure, and causal traceability of institutional provenance grounded in the duty of justification. Second, we identify three corresponding pathologies of retrieval (mereological blindness, diachronic blindness, and causal opacity), each developed with an operational definition, a failure mechanism, a canonical example, and detection criteria for diagnostic use. Third, we review the state of the art through this lens, showing that existing approaches address these requirements unevenly and do not yet compose into a paradigm that treats them as co-constitutive. From this analysis we derive four architectural commitments that characterize the deterministic-by-design direction for legal retrieval: ontological primacy, event reification, bitemporal correctness, and deterministic interaction protocols. The framework concerns quaestio juris (which norms apply and in what state) rather than the downstream tasks that act on identified norms, and addresses legislative and constitutional retrieval primarily, with interpretive time as an explicit extension.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.5/10 | 2.2 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 2.0/10 | 3.0 |
| MLLM | 1.5 | 1.0/10 | 1.5 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on Legal AI and RAG architecture, discussing structural, temporal, and causal limitations of retrieval. It does not address tokenization, visual encoders, multimodal integration, or reinforcement learning. 'World Models' and 'Unify Models' receive low scores for tangential architectural/temporal discussions. No expert authors from the specified list were found, so no bonus points were added.
关键词
Retrieval-Augmented Generation, Legal Domain, Structural Limitations, Temporal Limitations, Causal Limitations, Deterministic Retrieval, Legal Knowledge Structure, Architectural Mismatch
摘要翻译
基础模型(Foundation models)正从响应生成转向操作角色。它们跨步骤规划,调用工具,请求人类输入,与其他智能体协作,并日益承担起影响客户、索赔、代码、合同及临床决策的工作责任。生产部署已不再是单一人类监督单一模型的模式。它们是跨越团队、时区和信任边界的多人、多智能体协作。这种协作的技术规范仍较为薄弱。当智能体起草响应、人类在发布前对其进行编辑时,人类判断的时刻便是系统中最有价值的信号。在当前实践中,若有所记录,也往往散见于应用程序代码、聊天线程、工单评论及部落记忆之中。两项协议标准分别解决相邻问题:MCP 标准化智能体对工具和数据的访问,A2A 标准化智能体间的互操作性。二者均未定义人类与智能体共同承担责任的共享工作空间。本文提出 CHAP(协作人类 - 智能体协议)。在 CHAP 框架下,过去消失在聊天线程中的覆盖操作,转变为携带差异(diff)、理由(rationale)及内容哈希(content hash)的结构化事件。班次间的交接变为可移植的封装包,而非固定的消息。人类对智能体草案的批准变为不可抵赖的签名决策,数年后可被重放。该协议通过一个小型核心(Core),包含工作空间、参与者、任务、工件(artifacts)及追加式证据日志,以及可组合的配置文件来实现,配置文件可根据部署需求添加审查、模式、路由、审议、交接、身份、签名及基于透明度的审计等功能。规范、参考实现、一致性测试套件及示例代码可在以下网址获取:https://github.com/BrightbeamAI/chap
Abstract
Foundation models are moving from response generation into operational roles. They plan across steps, call tools, request human input, coordinate with other agents, and increasingly carry responsibility for work that affects customers, claims, code, contracts, and clinical decisions. Production deployments are no longer one human supervising one model. They are multi-human, multi-agent collaborations that cross teams, time zones, and trust boundaries. The technical surface for this collaboration remains weakly specified. When an agent drafts a response and a human edits it before it ships, the moment of human judgement is the most valuable signal in the system. In current practice it is recorded, if at all, in application code, chat threads, ticket comments, and tribal memory. Two protocol standards address adjacent concerns: MCP standardises agent access to tools and data, and A2A standardises agent-to-agent interoperability. Neither defines the shared workspace in which humans and agents perform accountable work together. This paper presents CHAP, the Collaborative Human-Agent Protocol. Under CHAP, the override that used to vanish into a chat thread becomes a structured event carrying a diff, a rationale, and a content hash. The handoff between shifts becomes a portable envelope rather than a pinned message. The human approval of an agent's draft becomes a non-repudiable signed decision that can be replayed years later. The protocol achieves this through a small Core (workspaces, participants, tasks, artefacts, and an append-only evidence log) together with composable profiles that add review, modes, routing, deliberation, handoff, identity, signatures, and transparency-backed audit as deployments require them. Specification, reference implementation, conformance suite, and worked examples are available at: https://github.com/BrightbeamAI/chap
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 1.0/10 | 1.5 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: 论文提出 CHAP 协议用于规范人类与代理人的协作工作流,侧重于审计追踪、签名决策和交接流程。论文未涉及模型架构(Tokenizer, Visual Encoder)、表征学习(World Models, Unify Models)、多模态能力(MultiModal, MLLM)或强化学习方法(model-based RL)的技术细节。仅 superficial 提及“基础模型”和“代理人”,与关键词的技术核心关联度极低。未发现列出的专家作者(Yang Shi 等)。
关键词
Collaborative Human-Agent Protocol, Foundation Models, Human-Agent Collaboration, Audit Trails, Signed Decisions, Shared Workspaces, Agent Coordination
摘要翻译
递归自我设计(Recursive self-design)指的是人工智能辅助修改人工智能系统构建、评估及改进机制的过程。本文不将 MetaAI 视为一种成熟的范式,而是将其作为一个工作术语,用于描述一种人类发起、人工智能扩展的发展模式,在此模式中,设计空间本身成为被修改的目标。我们提出一个操作性证据框架,包含四个标准:可检查的目标系统(inspectable target system)、元级修饰器(meta-level modifier)、反馈导向的选择(feedback-directed selection)以及递归延续(recursive continuation)。随后,我们将包括达尔文哥德尔机器(DGM)、STOP、哥德尔代理(Goedel Agent)和 ShinkaEvolve 在内的公开系统,与这些标准进行对照。DGM 提供了目前报告的最直接证据:其发表的结果显示,经过 80 次迭代后,在 SWE-bench Verified 基准上的表现从 20% 提升至 50%,在完整 Polyglot 基准上从 14.2% 提升至 30.7%,消融实验表明,开放性探索和自我改进均对此有贡献。最后,我们提供 MetaAI-Mini,这是一个基于 HumanEval 的可复现协议及代码库。由于此构建中未包含完成的模型运行,MetaAI-Mini 被报告为一种协议,而非实验结果。
Abstract
Recursive self-design refers to AI-assisted modification of the mechanisms by which an AI system is built, evaluated, and improved. This paper treats MetaAI not as a mature paradigm, but as a working term for a human-seeded, AI-expanded development pattern in which the design space itself becomes a target of modification. We propose an operational evidence framework with four criteria: inspectable target system, meta-level modifier, feedback-directed selection, and recursive continuation. We then map public systems, including Darwin Goedel Machine (DGM), STOP, Goedel Agent, and ShinkaEvolve, against these criteria. DGM provides the most direct currently reported evidence: its published results show improvement from 20% to 50% on SWE-bench Verified and from 14.2% to 30.7% on full Polyglot after 80 iterations, with ablations suggesting that both open-ended exploration and self-improvement contribute. Finally, we provide MetaAI-Mini, a reproducible HumanEval-based protocol and codebase. Because no completed model run is included in this build, MetaAI-Mini is reported as a protocol rather than as an experimental result.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: 论文核心聚焦于 MetaAI 递归自我设计的工程证据框架与可复现性协议,与多模态大模型(MLLM, MultiModal)、Tokenizer、视觉编码器(Visual Encoder)等关键词无直接内容关联。仅在‘统一模型’概念上涉及人机设计流程的统一,‘世界模型’与‘基于模型的强化学习’存在极弱的概念类比(设计空间建模),因此相关度评分极低,加权总分远低于及格线。
关键词
Recursive Self-Design, MetaAI, Reproducible Engineering, Operational Evidence Framework, AI System Improvement, Human-seeded Development, Meta-level Modifier
摘要翻译
人们日益频繁地使用人工智能(AI)进行写作等创造性任务。尽管采用率持续增长,但这种使用方式有损个体层面的创造力,并降低大规模创意输出的异质性。作为回应,我们引入了语义排斥技术(SRT),并通过计算评估以及一项涉及 16 名定期使用 AI 进行创造性任务参与者的研究对其进行了评估。我们的计算评估显示,SRT 在各类任务模式下将语义多样性提高了 85% 至 167%,同时将共识短语减少了 43% 至 95%。在用户研究中,SRT 输出的有用性($p = .019$, $W = .208$)和连贯性评分($p = .006$, $W = .260$)均更高;68.8% 的参与者愿意多次使用 SRT-Strong,而基线仅为 18.8%。在所有系统中,原创性与连贯性评分呈正相关($ρ= +.40$ 至 $+.67$),这表明发散性不必以牺牲可读性为代价。综上所述,这些初步发现可为旨在支持日常创造力而不加剧同质化的 AI 系统设计提供依据。
Abstract
People are increasingly using AI for creative tasks such as writing. While adoption continues to grow, this form of use risks undermining individual creativity locally and reducing the heterogeneity of creative output at scale. In response, we introduce the Semantic Repulsion Technique (SRT) and evaluate it both computationally and through a study with 16 participants who regularly use AI for creative tasks. Our computational assessment reveals that SRT increases semantic diversity by 85--167\% while reducing consensus phrases by 43--95\% across task modes. In the user study, SRT outputs received higher usefulness ($p = .019$, $W = .208$) and coherence ratings ( $p = .006$, $W = .260$); 68.8\% of participants were willing to use SRT-Strong for multiple tasks versus 18.8\% for baselines. Originality and coherence ratings were positively correlated across all systems ($ρ= +.40$ to $+.67$), suggesting that divergence need not compromise readability. Taken together, these preliminary findings can inform the design of AI systems that aim to support everyday creativity without contributing to homogenization.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on HCI and mitigating AI homogenization in text generation via a Semantic Repulsion Technique. It does not address Multimodal architectures, World Models, or Reinforcement Learning. Thus, relevance to technical keywords like Visual Encoder, World Models, and model-based RL is negligible. MLLM and Tokenizer have minimal relevance as the paper deals with text generation interfaces rather than model internals.
关键词
AI Homogenization, Semantic Repulsion Technique, Consensus-Aware Interaction, Creative Tasks, Semantic Diversity, User Study, Text Generation
摘要翻译
离线安全强化学习(Safe RL)能够在无需在线交互的情况下进行策略学习,使其适用于机器人系统等安全关键系统。然而,其对静态数据集的依赖使离线 Safe RL 暴露于数据投毒攻击中,攻击者注入恶意样本,危及安全并引发不安全的策略行为。在此工作中,我们提出了一种新的学习范式,称为安全强化遗忘(Safe-RULE),用作防御框架,旨在去除投毒数据的影响,而无需从头重新训练或访问原始训练环境。我们进一步将强化遗忘扩展至离线 Safe RL,通过在遗忘过程中显式考虑任务性能和安全约束。在基准 Safe RL 任务上的实验表明,我们的方法能有效提升针对数据投毒攻击的安全性能。
Abstract
Offline safe reinforcement learning (Safe RL) enables policy learning without online interactions, making it suitable for safety-critical systems such as robotics systems. However, its reliance on static datasets exposes offline Safe RL to data poisoning attacks, where adversaries inject malicious samples that compromise safety and induce unsafe policy behavior. In this work, we propose a new learning paradigm, named safe reinforcement unlearning (Safe-RULE), used as a defense framework to remove the influence of poisoned data without retraining from scratch or requiring access to the original training environment. We further extend reinforcement unlearning to offline Safe RL by explicitly accounting for both task performance and safety constraints during the unlearning process. Experiments across benchmark Safe RL tasks demonstrate that our approach effectively enhances safety performance against data poisoning attacks.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 3.0/10 | 4.5 |
评分理由: The paper focuses on Safe Reinforcement Learning and data poisoning defense via unlearning. The provided keyword set is predominantly focused on Multimodal Large Language Models (MLLM), Tokenization, and World Models. Consequently, keywords like Tokenizer, Visual Encoder, MLLM, and MultiModal are irrelevant (0). 'World Models' is unrelated to the safety/unlearning focus. 'Unify Models' shares only a superficial linguistic similarity with 'Unlearning' (1.0). 'model-based RL' is the only relevant domain keyword, but the paper emphasizes safety constraints and unlearning rather than model-based planning specifically, warranting a moderate score (3.0). The total weighted score is significantly below the dynamic pass threshold of 27.8.
关键词
Safe Reinforcement Learning, Offline RL, Data Poisoning, Unlearning, Policy Learning, Safety Constraints, Defense Framework
摘要翻译
以往研究将结构化输出视为一种推理税,但这种框架是不完整的:格式化成本高度依赖于模型的剩余容量。通过使用信息匹配的自然语言文本对照组和四级模式复杂度梯度,我们在 4 个模型和 5 个基准测试上将格式特定效应与提示长度混淆因素分离开来,且在成功生成的响应中解析失败率为 0%。我们发现结构化格式依赖于容量。具有足够余量的模型可以吸收 JSON 约束而不出现性能下降(Sonnet:在 MATH-Hard 上,JSON 为 $88.7\pm4.0$%,CoT 为 $89.3\pm1.7$%)。相比之下,格式会通过两种截然不同的机制严重损害接近其极限的模型。首先,在标准 token 预算下,Haiku 下降 36.2pp ($p < 0.0001$),主要由于截断。其次,即使延长预算以消除截断,GPT-4o-mini 仍下降 28.0pp ($p < 0.001$),揭示了与 token 耗尽无关的纯粹容量竞争。这种格式惩罚随模式复杂度增加而增加 (McNemar $p < 0.0001$),且无法仅由提示长度解释。此外,这些结果修正了关于前沿模型免疫性的声称:在 AIME 竞赛数学中,Opus 4.7 在 JSON 下从 96.2% 降至 91.0% ($-5.3$pp;显示的百分比是独立四舍五入的,精确差值为 $7/133 = 5.26$pp $\approx 5.3$pp)。延迟结构消融实验——先自由推理再格式化——恢复了大部分丢失的准确率(3 次运行均值:80--87%),支持了容量竞争机制。实际启示并非避免结构化输出,而是使其与容量匹配:当模型接近其极限时,先思考,后格式化。
Abstract
Prior work treats structured output as a reasoning tax, but this framing is incomplete: the cost of formatting depends strongly on a model's spare capacity. Using information-matched prose controls and a four-level schema complexity gradient, we separate format-specific effects from prompt-length confounds across 4 models and 5 benchmarks with 0% parse failures on successfully generated responses. We find that structured formats are capacity-dependent. Models with sufficient headroom absorb JSON constraints without degradation (Sonnet: $88.7\pm4.0$% JSON vs. $89.3\pm1.7$% CoT on MATH-Hard). In contrast, formats severely degrade models operating near their limits through two distinct mechanisms. First, under standard token budgets, Haiku drops 36.2pp ($p < 0.0001$) largely due to truncation. Second, even with extended budgets eliminating truncation, GPT-4o-mini drops 28.0pp ($p < 0.001$), revealing pure capacity competition independent of token exhaustion. This format penalty scales with schema complexity (McNemar $p < 0.0001$) and cannot be explained by prompt length alone. Furthermore, these results qualify claims of frontier model immunity: on AIME competition math, Opus 4.7 drops from 96.2% to 91.0% under JSON ($-5.3$pp; the displayed percentages are independently rounded, exact difference is $7/133 = 5.26$pp $\approx 5.3$pp). A delayed-structure ablation -- reasoning freely before formatting -- recovers most of the lost accuracy (3-run mean: 80--87%), supporting the capacity competition mechanism. The practical implication is not to avoid structured output, but to match it to capacity: when a model is near its limits, think first, format later.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on structured output formatting and model capacity constraints in Large Language Models (LLMs), specifically analyzing JSON vs. CoT performance on math benchmarks. It does not discuss Unify Models architecture, Tokenizer design, Visual Encoders, World Models, MultiModal learning mechanisms, or Model-Based RL. While the models used (Sonnet, GPT-4o-mini) fall under the broad MLLM category, the paper does not investigate MLLM-specific technologies, resulting in low relevance scores for the provided keyword set which targets Multimodal/World Models/RL domains.
关键词
Structured Output, Model Capacity, Reasoning Failures, JSON Constraints, Prompt Length, Format Penalty, Think First Format Later
摘要翻译
身体运动能够在距离较远且无法捕捉面部或语音的条件下传达意图。我们研究仅基于 2D 人体姿态(2D body pose)识别沟通意图(communicative intent)的问题。我们认为,身体运动是一种可靠的信号,尤其是在需要实时、低成本、设备端人机通信的长距离环境(如救援任务)中。然而,现有资源并未将这一信号孤立出来。情感语料库(Affective corpora)融合了身体、面部、语音和文本,而骨架动作识别基准(skeleton action-recognition benchmarks)标注的是执行的动作而非传达的信息。我们发布了一个包含十种沟通意图的全身姿态真实帧数据集,并将其与其他真实(IPC)和合成(MotionLCM, VEO3.1, Kimodo)数据集进行了对比,这些数据集涵盖了不同的难度级别。我们针对能够在机器人有限机载硬件上运行的系统。我们基准测试了多种模型,从骨架图分类器到联合运动预测网络,并在嵌入式 GPU(NVIDIA Orin Nano)上报告了性能指标与帧率,因为在我们的场景中,速度与准确性同等重要。最后,我们展示了模型自身的自回归自一致性(autoregressive self-consistency)可作为无监督的可靠性信号。我们给出了一个简短的证明,界定了自一致预测正确的概率,表明该概率随一致步骤数的增加而增长,并识别了自信预测仍可能为假的条件,相关评估基于行业标准指标(industry-standard metrics)进行。
Abstract
Body movement communicates intent at distances and in conditions where neither the face, nor speech can be captured. We study the recognition of communicative intent from 2D body pose alone. We argue that body motion is a reliable signal especially in scenarios that require real time low-cost on-device person-to-robot communication in long distance environments, such as rescue missions. However, existing resources do not isolate this signal. Affective corpora combine body, face, voice and text, while skeleton action-recognition benchmarks label the action performed rather than the message conveyed. We release a dataset of real frames of full-body pose covering ten communicative intents and we compare it against other real (IPC) and synthetic (MotionLCM, VEO3.1, Kimodo) ones that span a range of difficulty. We target systems that can run on a robot's limited onboard hardware. We benchmark multiple models, from skeleton graph classifiers to joint motion-forecasting networks, and report performance metrics together with frame rate on an embedded GPU (NVIDIA Orin~Nano), since speed matters as much as accuracy in our scenario. Finally, we show that a model's own autoregressive self-consistency works as an unsupervised reliability signal. We give a short proof that bounds the probability that a self-consistent prediction is correct, show that this probability grows with the number of consistent steps, and identify the conditions under which a confident prediction can still be false, benchmarked against industry-standard metrics.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 2.0/10 | 3.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on body pose estimation and intent recognition for human-robot interaction using skeleton graph classifiers and consistency measures. It does not involve Large Language Models, Tokenizers, World Models, or Reinforcement Learning. While it processes visual data for pose, it does not utilize Visual Encoders in the context of multimodal foundation models. The method is explicitly unimodal (pose alone), contrasting with multimodal corpora, resulting in low relevance to the provided keyword set.
关键词
Body Pose, Non-verbal Communication, Intent Recognition, Self-consistency, Embedded Hardware, Skeleton Graph, Communicative Intent, Motion Forecasting
摘要翻译
表格编码器通常在任务特定的端到端流水线中进行评估,因此即使处理相似的表格信号,来自不同训练范式的模型也难以直接比较。我们引入 TRL-Bench,这是一个多粒度表格表示学习(TRL)基准,旨在标准化跨范式表示级评估:每个编码器通过其支持的封装器导出行、列或表嵌入,共享的轻量级头部在三个套件中对其探测:TRL-CTbench(列/表)、TRL-Rbench(行)以及 TRL-DLTE(跨越所有三种粒度的组合式数据湖表扩充)。为了支持这种标准化设置,我们发布了精心整理的基准资产和任务重构,包括 50 张带有 123 个验证目标的 OpenML 表格、16 个行对链接重写任务,以及源自 1,379 个父表的 47,772 张表格的 DLTE 湖。在涵盖 20 个模型和 16 个任务的实验中,TRL-Bench 表明,一旦下游条件被标准化,编码器质量取决于具体能力,而非由单一排行榜所概括。在 TRL-CTbench 中,通用文本编码器通常在表面文本信号较强的任务上表现领先,而表格专家模型则在预训练目标与任务对齐的场景中胜出。在 TRL-Rbench 中,表内预测与跨表链接倾向于不同的训练策略,且原子链接性能与 DLTE 流水线的行匹配阶段呈强相关。在 TRL-DLTE 中,最强的流水线组合的是能力匹配的专家模型,而非复用单一编码器;且顶尖的端到端质量取决于非加性的组合拟合,而不仅仅是各阶段的边际排名。TRL-Bench 提供了一种通用协议,用于在共享下游条件下测量导出表格表示中的可复用信号。代码与数据:https://github.com/LOGO-CUHKSZ/TRL-Bench
Abstract
Tabular encoders are usually evaluated inside task-specific end-to-end pipelines, so models from different training paradigms are difficult to compare directly even when they operate on similar tabular signals. We introduce TRL-Bench, a multi-granular tabular representation learning (TRL) benchmark that standardizes cross-paradigm representation-level evaluation: each encoder exports row-, column-, or table embeddings through its supported wrapper, and shared lightweight heads probe them across three suites: TRL-CTbench (column/table), TRL-Rbench (row), and TRL-DLTE (compositional Data-Lake Table Enrichment spanning all three granularities). To support this standardized setting, we release curated benchmark assets and task reformulations, including 50 OpenML tables with 123 verified targets, 16 row-pair linkage rewrites, and a 47,772-table DLTE lake derived from 1,379 parent tables. Across 20 models and 16 tasks, TRL-Bench shows that once downstream conditions are standardized, encoder quality is capability-specific rather than captured by a single leaderboard. In TRL-CTbench, generic text encoders often lead on tasks with strong surface-text signal, while tabular specialists win where their pretraining objective aligns with the task. In TRL-Rbench, within-table prediction and cross-table linkage favor different training regimes, with atomic linkage performance correlating strongly with the row-matching stage of DLTE pipelines. In TRL-DLTE, the strongest pipelines combine capability-matched specialists rather than reuse a single encoder, and top end-to-end quality depends on non-additive compositional fit rather than per-stage marginal rank alone. TRL-Bench provides a common protocol for measuring reusable signal in exported tabular representations under shared downstream conditions. Code and data: https://github.com/LOGO-CUHKSZ/TRL-Bench
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on Tabular Encoders and Representation Learning for structured data, while the provided keywords primarily target Multimodal LLMs, World Models, and Reinforcement Learning, indicating a significant domain mismatch. 'Unify Models' has slight relevance regarding the standardization of evaluation paradigms; 'MultiModal' and 'Tokenizer' have weak relevance to tabular data processing; others (Visual Encoder, World Models, MLLM, model-based RL) are irrelevant. The weighted total score is 6.0, well below the dynamic pass score of 27.8. None of the specified expert authors (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) appear in the author list.
关键词
Tabular Encoders, Representation Learning, Benchmark, Cross-Paradigm, Embeddings, Downstream Evaluation, OpenML Tables
摘要翻译
高维低样本量(HDLSS)表格数据领域(例如组学)的特征是样本数 $n$ 远小于特征数 $m$(即 $n \ll m$)。此类领域通常表现出强烈的局部相关群组、稀疏的组间依赖、重尾非高斯边缘分布、异方差噪声以及结构化缺失,由于 $n \ll m$,导致在 $\mathbb{R}^m$ 中的直接密度学习呈现病态。我们提出 BSTabDiff,一种块 - 子单元生成框架,该框架将 $m$ 个观测特征划分为 $M$ 个潜在块($M \ll m$),并通过共享的低维子单元变量生成每个块,从而将全局依赖学习集中在紧凑的块 - 潜在空间 $\mathbb{R}^M$ 中,同时利用基于 copula 的依赖、灵活的逐特征边缘分布及显式缺失机制解码至完整特征空间。BSTabDiff 支持块潜在变量上的现代深度先验,包括 diffusion 和 normalizing flows,从而在 HDLSS 场景下实现稳定的合成与可控的基准生成。实验结果表明,与非结构化表格生成器相比,BSTabDiff 在 HDLSS 数据上能生成更真实且更稳定的高维合成数据。
Abstract
High-Dimensional Low-Sample Size (HDLSS) tabular domains (e.g., omics) are characterized by $n \ll m$, where $n$ = number of samples, and $m$ = number of features. Such domains often exhibit strong local correlation groups, sparse cross-group dependencies, heavy-tailed non-Gaussian marginals, heteroscedastic noise, and structured missingness, making direct density learning in $\mathbb{R}^m$ ill-conditioned since $n \ll m$. We propose BSTabDiff, a block-subunit generative framework that partitions the $m$ observed features into $M$ latent blocks ($M \ll m$) and generates each block via a shared low-dimensional subunit variable, concentrating global dependence learning in the compact block-latent space $\mathbb{R}^M$ while decoding to the full feature space with copula-driven dependence, flexible per-feature marginals, and explicit missingness mechanisms. BSTabDiff supports modern deep priors on block latents, including diffusion and normalizing flows, enabling stable synthesis and controllable benchmark generation in the HDLSS regime. Empirically, BSTabDiff produces more realistic and stable high-dimensional synthetic data when compared with unstructured tabular generators on HDLSS data.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 2.0/10 | 3.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文聚焦高维 tabular 数据生成,与多模态(MultiModal, MLLM, Visual Encoder)、强化学习(model-based RL)及 Tokenizer 无直接关联(0 分)。仅在生成式潜变量(World Models)和架构统一性(Unify Models)上有微弱关联(2 分)。作者列表不含指定专家。加权总分 6.0,远低于及格分 27.8。
关键词
Tabular Data Generation, Diffusion Priors, Block-Subunit, HDLSS, Generative Framework, Copula-driven, Latent Space
摘要翻译
近年来,少样本目标检测(Few-shot Object Detection)受到了广泛关注。针对这一任务,已有若干优秀算法被提出。然而,大多数此类算法仍依赖于少样本分类(Few-shot Classification)的性能。与以往工作不同,本文聚焦于新类别(Novel Classes)与基础类别(Base Classes)之间区域提议(Region Proposals)分布不平衡的问题。为缓解这种分布不平衡,我们针对不同训练阶段提出了提议细化(Proposal Refinement)方法。具体而言,在基础训练阶段设计了细化损失(Refinement Loss),以增强模型对新类别的敏感度;同时在微调阶段,将细化分支作为 RPN(区域提议网络)的辅助分支引入,以生成更多新类别提议。通过重新平衡提议分布,所提方法在当前基准测试上较基线方法提升了约 1%~6%,且未增加任何推理时间。大量实验证明,我们为少样本目标检测任务建立了一种新的最先进(State-of-the-Art)方法。
Abstract
Few-shot object detection has gained widely attention in recent years. Some excellent algorithms have been proposed to handle this task. However, most of these algorithms rely on the performance of few-shot classification. Unlike previous attempts, our work focuses on the problem of unbalanced distribution of region proposals between the novel classes and the base classes. In order to alleviate this unbalanced distribution, we propose the proposal refinement approach for different training phases. Specifically, refinement loss is designed for the base training phase to enhance sensitivity of the model to novel classes, and refinement branch is introduced as an auxiliary branch for RPN (Region Proposal Networks) to generate more novel proposals in the fine-tuning phase. By rebalancing the proposal distribution, the proposed approach outperforms the baselines methods by roughly 1\%$\sim$6\% on current benchmarks without increasing any inference time. Through extensive experiments, we prove that we establish a new state-of-the-art method for the few-shot object detection task.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 2.0/10 | 3.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文聚焦于少样本目标检测中的提案分布不平衡问题,通过提案损失和辅助分支优化 RPN。提供的关键词涉及多模态大模型、世界模型及强化学习,与本文的传统计算机视觉检测任务无直接关联。视觉编码器虽隐含存在但非研究重点,未涉及 Tokenizer、模型统一或多模态交互。
关键词
Few-shot object detection, Proposal refinement, Region Proposal Networks, Unbalanced distribution, Base classes, Novel classes, Inference time
摘要翻译
本文重新审视了我们提出的名为三段论评估框架 - 通用逻辑语法构建 (SEF-CLGC) 的流程。我们将形式化逻辑符号与小语言模型 (SLMs) 相结合,以评估在 SemEval-2026 Task 11 Subtask 1(解耦大语言模型中的内容与形式推理)上的推理性能。实验结果表明,仅依赖在自然语言与符号语言组合上训练的小语言模型 (SLMs),我们的最佳模型在该任务上取得了 27.80% 的内容得分,同时显著降低了推理中的内容偏差。
Abstract
This paper revisits our pipeline called Syllogistic Evaluation Framework-Common Logic Grammar Construction (SEF-CLGC). We combine formal logical notations with Small Language Models (SLMs) to evaluate reasoning performance on the SemEval-2026 Task 11 Subtask 1: Disentangling Content and Formal Reasoning in Large Language Models. Our experiments show that by relying solely on SLMs, trained on a combination of natural and symbolic languages, our best model achieves a content score of 27.80% on the task while significantly lowering the content bias in reasoning.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 1.0/10 | 1.5 |
| MultiModal | 1.5 | 2.0/10 | 3.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文聚焦于逻辑符号与小语言模型在推理任务中的结合,旨在评估并降低内容偏见。该研究内容与关键词中的多模态(MultiModal, MLLM)、视觉编码(Visual Encoder)、世界模型(World Models)及强化学习(model-based RL)等领域高度不相关,仅在模型统一(Unify Models)和语言模型(MLLM)方面存在微弱关联。Tokenizer 未提及。因此相关性评分普遍较低,加权总分远低于及格线。
关键词
Logical Notation, Small Language Models, Reasoning Performance, SemEval-2026, Content Bias, Common Logic Grammar, Formal Reasoning
摘要翻译
云边连续体(Cloud-Edge Continuum,CEC)通过将资源分发至远端边缘来支持对延迟敏感的应用,但其极端波动性使得基于时间序列预测的主动零触摸管理(Zero Touch Management)至关重要。然而,编排器面临着严重的“冷启动”(cold start)问题:新发现的节点缺乏训练本地化预测模型所需的历史数据,而通用模型又无法捕捉独特的硬件和微服务行为。为解决这一问题,我们提出了一种由新颖的数据混合方法驱动的完全自动化的时间序列预测架构。在基础设施层面,我们引入了一种轻量级、技术无关的资源展示器(Resource Exposer,RE),它能够动态发现节点并持续收集可定制的遥测数据(例如计算、网络、能源)。为了克服这些初始本地样本的稀疏性,我们的框架将它们与 TimeTrack(我们公开可用的高分辨率数据集,采集间隔为 45 秒)自动合并。这种结合将 TimeTrack 的基础性高频时间模式与本地节点数据的精确校准协同起来。经过神经架构搜索(Neural Architecture Search,NAS)引擎处理后,系统自动生成高度准确的基线模型。实验结果表明,将目标数据与 TimeTrack 合并有效缓解了冷启动挑战。这种集成显著提高了基于均方误差(Mean Squared Error,MSE)、平均绝对误差(Mean Absolute Error,MAE)和平均绝对百分比误差(Mean Absolute Percentage Error,MAPE)衡量的预测准确性,并加速了收敛,相较于仅使用稀疏本地样本训练、仅使用通用数据集训练或将目标数据与标准替代数据集混合,为持续的 MLOps 部署奠定了坚实基础。
Abstract
The Cloud-Edge Continuum (CEC) enables latency-critical applications by distributing resources to the far edge, but its extreme volatility makes proactive Zero Touch Management via time-series forecasting essential. However, orchestrators face a severe "cold start" problem: newly discovered nodes lack the historical data required to train localized predictive models, while generalized models fail to capture unique hardware and microservice behaviors. To solve this, we propose a fully automated time-series prediction architecture driven by a novel data-mixing methodology. At the infrastructure level, we introduce a lightweight, technology-agnostic Resource Exposer (RE) that dynamically discovers nodes and continuously collects customizable telemetry (e.g., compute, network, energy). To overcome the sparsity of these initial local samples, our framework automatically merges them with TimeTrack, our publicly available, high-resolution dataset collected at 45-second intervals. This synergizes TimeTrack's foundational, high-frequency temporal patterns with the precise calibration of the local node data. Processed through a Neural Architecture Search (NAS) engine, the system automatically generates highly accurate baseline models. Experimental results demonstrate that merging the target data with TimeTrack effectively mitigates the cold start challenge. This integration significantly improves forecasting accuracy measured in Mean Squared Error (MSE), Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE) and accelerates convergence compared to training on the sparse local samples alone, training solely on generic datasets, or mixing the target data with standard alternative datasets, establishing a robust foundation for continuous MLOps deployment.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: 论文主题聚焦云边协同的时间序列预测与编排自动化,涉及资源暴露器、神经架构搜索及数据混合。提供的关键词(MLLM、视觉编码器、Tokenizer、世界模型、基于模型的强化学习)属于多模态大模型与强化学习领域,与本文系统工程主题无直接技术关联,故相关性评分极低。
关键词
Cloud-Edge Continuum, Time-Series Forecasting, Zero Touch Management, Neural Architecture Search, Resource Exposer, Cold Start Problem, MLOps
摘要翻译
解耦(Disentanglement),即利用神经网络分离数据中的变异因子,仍然是机器学习领域中的一个长期挑战。先前的工作利用变分自编码器(VAE)和生成对抗网络(GAN)来解决这一问题,这些方法融入了变分推断和信息论约束的思想。与依赖连续表示的方法不同,我们提出了一种设计,将解耦表示视为符号结构,其动机在于构成分布样本的概念之间的组合关系。然而,在保持可微性的前提下,使用神经网络学习离散符号结构往往具有挑战性,且通常需要复杂的架构。为此,我们提出了一种无监督学习算法,该算法利用全息约简表示(HRR)来实现神经解耦。我们表明,HRR 解绑操作提供了分离因子的归纳偏置,并在潜在遍历和解耦度量指标下,相较于基线方法取得了具有竞争力的结果。我们通过针对 HRR 解绑通道的信息论分析来补充这些经验发现。我们证明了解绑操作诱导了近似独立的符号 - 值对,并推导出了一个每槽容量界,该界量化了能够被可靠编码的不同符号概念的数量,从而为解耦的归纳偏置提供了定量解释。所得表示与标准的基于自动编码器的模型有所不同,因为其潜在单元是相互相加的向量,而非低维潜在向量的标量维度。我们表明,这种 HRR 表示相较于其他解耦表示对噪声更具鲁棒性,并在一系列信噪比(SNR)范围内保持了重建质量。
Abstract
Disentanglement, the separation of factors of variation in data using neural networks, remains a long-standing challenge in machine learning. Prior work has addressed this problem with variational autoencoders and generative adversarial networks that incorporate ideas from variational inference and information-theoretic constraints. In contrast to methods that rely on continuous representations, we propose a design that treats disentangled representations as symbolic structures, motivated by the compositional relationships among the concepts that make up samples from a distribution. However, learning discrete symbolic structures with neural networks while maintaining differentiability is difficult and often requires complex architectures. To address this, we introduce an unsupervised learning algorithm that uses holographic reduced representations (HRR) for neural disentanglement. We show that the HRR unbinding operation provides an inductive bias for separating factors and yields competitive results against baselines, as measured by latent traversals and disentanglement metrics. We complement these empirical findings with an information-theoretic analysis of the HRR unbinding channel. We prove that unbinding induces approximately independent symbol-value pairs and derive a per-slot capacity bound that quantifies how many distinct symbolic concepts can be reliably encoded, giving a quantitative account of the inductive bias toward disentanglement. The resulting representations differ from standard autoencoder-based models, in that their latent units are vectors that are summed together, rather than scalar dimensions of a low-dimensional latent vector. We show that this HRR representation is more robust to noise than other disentangled representations and maintains reconstruction quality across a range of SNRs.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 2.0/10 | 3.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper addresses disentanglement via Holographic Reduced Representations (HRR) and symbolic structures. It does not discuss multimodal integration, tokenization, visual encoders, large language models, or reinforcement learning. Although it involves representation learning, it lacks specific alignment with World Models, Unify Models architectures, or Model-Based RL frameworks, resulting in low relevance scores for most keywords.
关键词
Disentanglement, Holographic Reduced Representations, Symbolic Structures, Unbinding Operation, Representation Learning, Noise Robustness
摘要翻译
微分方程在科学发现中起着至关重要的作用,因为它们提供了描述物理现象行为的数学框架。作为传统第一性原理的一种有前景的替代方案,数据驱动的微分方程发现因其能够从实验或模拟数据中直接推断支配定律而受到越来越多的关注,尤其是在底层物理机制尚不明确的情况下。然而,该领域正沿着多样化的方法论方向迅速扩展,尤其是随着基于人工智能(AI)方法的兴起,目前仍缺乏清晰的组织视角。在这篇综述中,我们提出了针对数据驱动微分方程发现的问题导向视角。我们首先引入了一个方程可发现性的二维相图,发现问题根据结构复杂性和系数复杂性进行分类组织。该相图展示了该领域如何从发现具有简单系数的稀疏方程,转向发现具有更丰富结构和更灵活参数化的更复杂支配定律。它还阐明了为何不同的方法论流派在不同的问题情境下会成功或失败。随后,我们提出了表示 - 评估 - 优化(REO)框架,将其作为发现过程的基本抽象。通过识别在各类算法变体中始终存在的方程发现核心问题,REO 将讨论焦点从个别算法转移到决定可发现性的基本原理上。我们将这些观点与物理学及相关科学中的应用相联系,并认为下一个挑战不仅仅是恢复方程,而在于利用它们修订现有理论、提炼机制并形成新的科学概念。
Abstract
Differential equations play a critical role in scientific discovery because they provide a mathematical framework to describe the behaviour of physical phenomena. As a promising alternative to traditional first principles, data-driven differential equation discovery has attracted increasing attention for its ability to infer governing laws directly from experimental or simulated data, especially when the underlying physics is unclear. However, the field has expanded rapidly along diverse methodological directions, particularly with the emergence of AI-based approaches, and still lacks a clear organizing perspective. In this Review, we propose a problem-oriented perspective on data-driven differential equation discovery. We first introduce a two-dimensional phase diagram of equation discoverability, where discovery problems are organized according to structural complexity and coefficient complexity. This phase diagram shows how the field has moved from the discovery of sparse equations with simple coefficients toward more complex governing laws with richer structures and more flexible parameterizations. It also clarifies why different methodological families succeed or fail in different problem settings. We then present the representation-evaluation-optimization (REO) framework as a fundamental abstraction of the discovery process. By identifying the core problems of equation discovery that persist across algorithmic variations, REO shifts the discussion from individual algorithms to the fundamental principles that determine discoverability. We connect these perspectives to applications across physics and adjacent sciences, and argue that the next challenge is not merely recovering equations, but using them to revise existing theories, distil mechanisms and form new scientific concepts.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: 论文主题为科学发现中的微分方程数据驱动发现,属于科学机器学习领域,与多模态大模型及强化学习关键词领域不匹配。'Unify Models'因提出 REO 统一框架得 2 分,'World Models'和'model-based RL'因概念差异(物理建模 vs 生成/RL 模型)各得 1 分,其余关键词完全无关得 0 分。未找到目标专家作者。
关键词
Data-driven discovery, governing differential equations, physical systems, problem-oriented perspective, phase diagram, REO framework, scientific discovery, discoverability
摘要翻译
神经文本到语音(TTS)及多语言语音生成的近期进展显著提升了合成语音质量,但这些收益在全球语言间的分布仍不均衡。现有模型仍主要由少数高资源语言主导,而许多低资源 TTS 研究是在人工下采样的高资源语料库上模拟的,这些语料库无法反映真正代表性不足环境中遇到的正字法变异及有限的音系覆盖范围。为此,我们推出了 OpenBibleTTS,这是一个涵盖 37 种代表性不足语言的大规模低资源语音合成基准。此外,我们在领域内圣经文本和领域外材料上对各种 TTS 架构及大规模语音生成模型进行了系统比较。结果显示,没有任何单一系统在所有语言和指标上占优:Gemini-TTS 在大多数评估语言上获得了最高的听众评分,但在 OpenBibleTTS 上训练的单语 EveryVoice 模型在可懂度方面仍表现最强,且在几种非洲语言中更受青睐;而从零开始的开源系统在领域外文本上性能急剧下降,这揭示了在服务不足的语音社区中,广泛的多语言覆盖与可靠的合成质量之间仍存在持续差距。我们通过主观人类评判补充自动评估,并开源所有处理后的数据集、对齐数据及训练模型,以支持未来的低资源 TTS 研究。
Abstract
Recent advances in neural text-to-speech (TTS) and multilingual speech generation have substantially improved synthetic speech quality, yet these gains remain unevenly distributed across the world's languages. Existing models are still dominated by a small set of high-resource languages, while many studies of low-resource TTS are simulated on artificially downsampled high-resource corpora that do not reflect the orthographic variation and limited phonetic coverage encountered in genuinely underrepresented settings. As such, we introduce OpenBibleTTS, which is a large-scale benchmark for low-resource speech synthesis spanning 37 underrepresented languages. Moreover, a systematic comparison of various TTS architectures and large-scale speech generation models is conducted across in-domain Biblical text and out-of-domain material. Results show that no single system dominates across languages and metrics: Gemini-TTS achieves the highest listener ratings on most evaluated languages, but monolingual EveryVoice models trained on OpenBibleTTS remain strongest for intelligibility and are preferred in several African languages, while open from-scratch systems degrade sharply on out-of-domain text, revealing a persistent gap between broad multilingual coverage and reliable synthesis quality in underserved linguistic communities. We complement automatic evaluation with subjective human judgments, and open-source all processed datasets, alignments, and trained models to support future low-resource TTS research.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 2.0/10 | 3.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 该论文专注于低资源语言的文本到语音(TTS)合成与基准测试,核心贡献在于数据集构建与模型对比,而非统一模型架构、世界模型或强化学习。关键词中的视觉编码器、MLLM 及基于模型的 RL 与本文内容完全无关;MultiModal 仅因涉及语音与文本而勉强相关,Tokenizer 与 Unify Models 关联度亦较低。
关键词
Low-Resource Languages, Text-to-Speech, Speech Synthesis, OpenBibleTTS, Multilingual, Benchmark, Intelligibility
摘要翻译
检索增强生成(Retrieval-Augmented Generation, RAG)常在查询、文档证据与用户意图处于不同抽象层级时失效。查询可能涉及类、关系或事件,而文档仅陈述具体实例、间接框架或受限表述。我们将这种不匹配定义为抽象差距(abstraction gap):对齐查询意图与可用证据所需的最小类型化假设集合。为填补这一差距,我们提出 AbstRAG,将抽象视为显式的检索对象。AbstRAG 将查询 - 证据差距分解为表达、概念、意图 - 证据以及事件类型组件,并通过结合匹配质量、查询无关效用先验以及所需桥梁的成本来评估相关性。其核心机制是反思性精炼(reflective refinement):评判器诊断检索失败,定位失效的抽象算子,提出最小化的阶段特定补丁,且仅在满足充分性和压缩控制时才接受该补丁。在三个文档内检索基准上与七种基线方法对比,AbstRAG 在 21 种配对自助对比中的 18 种上于 nDCG@10 指标上表现更优,且在三个基准上的生成准确率分别提升了 1.9%、5.2% 和 4.0%;消融实验证实,反思性精炼驱动了大部分检索增益,且仅压缩控制一项即可在压力切片(stress slice)上将过度扩展假阳性率从 73.7% 降至 0%。
Abstract
Retrieval-augmented generation often fails when the query, the document evidence, and the user's intent are expressed at different levels of abstraction. A query may ask about a class, a relation, or an event, while the document only states specific instances, indirect framings, or scoped formulations. We define this mismatch as an abstraction gap: the minimal set of typed assumptions required to align query intent with the available evidence. To close this gap, we introduce AbstRAG, which treats abstraction as an explicit retrieval object. AbstRAG decomposes the query--evidence gap into expression, conceptual, intent--evidence, and event-type components, and scores relevance by combining match quality, a query-independent utility prior, and the cost of the required bridges. Its central mechanism is reflective refinement: a critic diagnoses retrieval failures, localizes the failed abstraction operator, proposes a minimal stage-specific patch, and accepts the patch only under sufficiency and compression controls. Across three within-document retrieval benchmarks against seven baselines, AbstRAG outperforms on nDCG@10 in 18 of 21 paired-bootstrap contrasts and improves generation accuracy by 1.9%, 5.2%, and 4.0% across the three benchmarks; ablations confirm that reflective refinement drives most of the retrieval gain and the compression control alone reduces over-expansion false positives from 73.7% to 0% on a stress slice.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: The paper focuses on text-based Retrieval-Augmented Generation (RAG) and abstraction gaps, lacking visual components, multimodal integration, world modeling, or explicit reinforcement learning frameworks. Thus, Visual Encoder, World Models, MLLM, and MultiModal score 0. Unify Models and model-based RL receive minimal scores (1-2) due to conceptual similarities in abstraction/refinement but not core focus. Tokenizer is generic. No expert authors from the target list were found. The weighted total score (6.0) is well below the dynamic passing threshold (27.8), indicating low relevance to the provided keyword set.
关键词
Retrieval-Augmented Generation, Abstraction Gap, Reflective Refinement, Query Intent Alignment, Document Evidence, Text Retrieval, Benchmark Evaluation
摘要翻译
复杂推理任务日益要求系统产生输出,其正确性无法通过与单一参考进行精确匹配来评判。自动形式化(AF)是一个代表性例子:它要求模型将非形式化的数学或逻辑推理翻译为可形式化验证的对象,但专家验证的形式化无法扩展到简单示例之外,且单个非形式化论证可对应多种有效的形式化表达。因此,进展取决于部分、结构化的代理能否替代精确参考。我们引入了一种针对 AF 的无参考代理评判框架,该框架用每轴属性检查向量替代了金标准匹配。该框架沿三个结构范围组织代理:涵盖所生成对象的全局属性、其子组件内部的每模块属性,以及将其重新对齐到非形式化来源的跨域属性,并将每个轴聚合为一个判决向量。该向量驱动一个反思性精炼循环,其中违反的坐标将控制器路由至匹配的修复目标,因此每次迭代仅更改被判定为错误的部分。在有界评判噪声下,期望内在差距几何级收缩至依赖于噪声的平台期。在 miniF2F、ProofNet、e-SNLI 和 ProntoQA 上的七个形式化骨干网络上,精炼一致提升了通过率(Pass Rate)相对于单-shot ICL 基线,且在基线有改进空间的基准上,每轴代理优于匹配的标量代理。因此,结构化代理评判在缺乏精确参考时,既提供了实用的精炼信号,也提供了关于收敛的理论依据。
Abstract
Complex reasoning tasks increasingly require systems to produce outputs whose correctness cannot be judged by exact match against a single reference. Autoformalization (AF) is a representative example; it asks a model to translate informal mathematical or logical reasoning into a formally checkable object, yet expert-validated formalizations do not scale beyond toy cases and a single informal argument can admit many valid formal renderings. Progress therefore depends on whether partial, structured proxies can substitute for exact references. We introduce a reference-free proxy-judge framework for AF that replaces gold-standard matching with a vector of per-axis property checks. The framework organizes the proxy along three structural scopes that cover global properties of the elicited object, per-module properties internal to its sub-components, and cross-domain properties that re-align it to the informal source, and aggregates each axis into a verdict vector. The vector drives a reflective refinement loop in which a violated coordinate routes the controller to a matching repair target, so each iteration changes only what is judged wrong. Under bounded judge noise, the expected intrinsic gap contracts geometrically to a noise-dependent plateau. Across seven formalization backbones on miniF2F, ProofNet, e-SNLI, and ProntoQA, refinement consistently lifts Pass Rate over the single-shot ICL baseline, and the per-axis proxy outperforms a matched scalar proxy on benchmarks where the baseline has room to improve. Structured proxy judgments therefore provide both a practical refinement signal and a theoretical handle on convergence when exact references are unavailable.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 1.0/10 | 1.5 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on Autoformalization and logical reasoning using a proxy-judge framework. It lacks content related to visual encoders, multimodal processing, world models, or reinforcement learning. While it unifies an evaluation approach across backbones, it does not match the architectural focus of the provided keywords.
关键词
Autoformalization, Proxy-Judge, Reference-free, Property Checks, Reflective Refinement, Formal Verification, Logical Reasoning, Backbones
摘要翻译
光学前端(Optical Front-ends,如超表面(Metasurfaces))与神经网络后端(Neural Network Back-ends)的端到端联合优化(End-to-End Co-optimization)已广泛应用于成像任务,然而,表征此类系统何时以及为何优于传统基于透镜的成像(Lens-Based Imaging)的形式化框架在很大程度上仍缺乏。本文聚焦于物体分类(Object Classification)这一核心成像任务,探讨了在非相干成像(Incoherent Imaging)中,相位掩模(Phase Mask)的端到端优化何时能优于传统聚焦透镜(Conventional Focusing Lens)的性能。我们发现,这些增益主要出现在受限的探测器读出(Constrained Detector Readout)条件下,而在全探测器读出(Full Detector Readout)条件下则受到限制。在后一种设置下,我们证明任何非相干相位掩模都无法超过探测器测量值与类别标签之间的理想信道互信息(Ideal-Channel Mutual Information);传统聚焦透镜接近这一上限,而联合优化并未带来实际增益。当探测器读出受约束时——例如通过粗粒度空间采样(Coarse Spatial Sampling)或有限的测量次数——优化后的光学元件可以通过提高探测器测量值中的类可分性(Class Separability)来显著改善分类性能。这些增益在低探测器噪声下最大,并随噪声增加而减小,因为光学元件在信号到达探测器之前对其进行整形,但无法消除随后添加的噪声。这一优势还取决于任务的频谱结构(Spectral Structure):当类判别性内容集中在比类内变化(Within-Class Variation)更低的空间频率(Spatial Frequencies)时,联合设计带来的帮助最大。我们开发了一个形式化这些区别的理论框架,并在合成数据(Synthetic Data)及标准基准数据集(Standard Benchmarks,如 MNIST、FashionMNIST、SVHN)上验证了其预测。
Abstract
End-to-end co-optimization of optical front-ends (e.g. metasurfaces) and neural network back-ends has been widely applied to imaging tasks, yet a formalism characterizing when and why such systems outperform conventional lens-based imaging is largely lacking. This paper focuses on object classification, a central imaging task, and asks when end-to-end optimization of a phase mask for incoherent imaging improves performance over a conventional focusing lens. We find that these gains arise primarily under constrained detector readout and are limited under full detector readout. In the latter setting, we prove that no incoherent phase mask exceeds the ideal-channel mutual information between detector measurements and class labels; a conventional focusing lens approaches this ceiling, and joint optimization yields no empirical gain. When detector readout is constrained -- by coarse spatial sampling or a limited number of measurements -- optimized optics can substantially improve classification by increasing class separability in the detector measurements. These gains are largest under low detector noise and shrink as noise grows, because the optics shape the signal before it reaches the detector but cannot remove noise added afterward. The advantage also depends on the spectral structure of the task: co-design helps most when class-discriminative content is concentrated at lower spatial frequencies than within-class variation. We develop a theoretical framework formalizing these distinctions and test its predictions on synthetic data and standard benchmarks (MNIST, FashionMNIST, SVHN).
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 1.0/10 | 1.5 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on computational imaging and optical front-end optimization, which differs significantly from the provided keywords on multimodal LLMs, tokenizers, world models, and reinforcement learning. Only slight relevance exists for 'Unify Models' (end-to-end co-optimization) and vision-related terms ('Visual Encoder', 'MultiModal'), resulting in a low total score (6.0) well below the dynamic pass threshold.
关键词
End-to-End Optimization, Incoherent Imaging, Classification, Detector-Limited Readout, Optical Front-ends, Phase Mask, Neural Network Back-ends
摘要翻译
多相机系统因其视场角广、灵活性强及容错性好,正被广泛应用于机器人学和自主导航领域。然而,现有的 PnP(Perspective-n-Point)求解器尚无法处理多个投影中心。本文提出了一种虚拟点(virtual point)表述,该表述桥接了标准 PnP 问题与广义位姿问题,从而构建了一个统一流程,可将现有的 PnP 求解器转换为广义位姿求解器。基于此框架,我们推导出了三种基于虚拟点的广义位姿求解器,即 VGPc、VGPq 和 VGPr,它们分别基于凯莱(Cayley)、四元数(quaternion)和旋转矩阵(rotation-matrix)参数化方法。大量实验表明,所提出的求解器继承了原始 PnP 算法的精度与效率,同时显著优于现有的广义求解器。具体而言,VGPc 在异方差噪声条件下实现了更高的估计精度,VGPq 保持了全局最优性,而 VGPr 则在不降低精度的前提下提供了卓越的计算效率。
Abstract
Multi-camera systems are increasingly adopted in robotics and autonomous navigation for their wide field of view, flexibility, and fault tolerance. Nevertheless, existing PnP solvers fail to handle multiple projection centers. This paper introduces a virtual point formulation that bridges the standard PnP and generalized pose problems, enabling a unified pipeline that transforms existing PnP solvers into generalized pose solvers. Based on this framework, we derive three Virtual-point-based Generalized Pose solvers, namely VGPc, VGPq, and VGPr, leveraging Cayley, quaternion, and rotation-matrix parameterizations, respectively. Extensive experiments demonstrate that the proposed solvers inherit the accuracy and efficiency of original PnP algorithms while significantly outperforming existing generalized solvers. Specifically, VGPc achieves higher estimation accuracy under heteroscedastic noise conditions, VGPq maintains global optimality, whereas VGPr provides superior computational efficiency without accuracy degradation.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 2.0/10 | 3.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on computer vision geometry (PnP solvers) and multi-camera pose estimation using virtual points. It does not involve large language models, tokenization, neural visual encoders, world models, or reinforcement learning. While it mentions a 'unified pipeline' and 'multi-camera' systems, these concepts do not align with the deep learning/MLLM context implied by the keyword list (e.g., Unify Models, MLLM, World Models), resulting in low relevance scores for most keywords.
关键词
Virtual-point-based, Generalized Absolute Pose Problem, Multi-camera systems, PnP solvers, Unified pipeline, Rotation-matrix parameterizations, Heteroscedastic noise, Global optimality
摘要翻译
现代观测天文学中的源检测是准确定位和识别恒星源的基石。这对于恒星种群合成及宇宙学参数估计等研究至关重要。然而,天文图像的特性(包括高密度、点扩散函数(PSF)效应及低信噪比)对最新的高级目标检测器构成了严峻挑战。此外,由于在天文图像中密集、微弱且小的源难以标注,全监督检测方法难以实际应用。为应对天文数据集的稀缺性,我们引入了一个新的综合基准(LAMOST-DET),包含 18,400 张天文图像和 728,898 个源实例。基于该数据集,我们进一步设计了一种名为 Nova Teacher 的新型半监督学习框架,能够在稀疏标注下有效检测密集源。该框架在双教师范式下集成了源光增强模块、置信度引导的伪监督以及跨视图互补挖掘。在 LAMOST-DET 上的广泛实验表明,Nova Teacher 在两种半监督设置下 consistently 提升了先前方法 4.04% 和 5.22% 的 mAP。此外,我们的方法在自然图像数据集上与现有检测器进行了对比,验证了其在各种场景下的泛化能力。源代码可在 https://github.com/AcWiz/NovaTeacher 获取。
Abstract
Source detection in modern observational astronomy is a cornerstone for localizing and identifying stellar sources accurately. It is crucial for studies such as stellar population synthesis and cosmological parameter estimation. However, the characteristics of astronomical images, including high density, the effect of point spread functions and low signal-to-noise ratios, significantly challenge the latest advanced object detectors. Besides, fully-supervised detection methods are hardly practical, due to the significant difficulty in annotating dense, small, and faint sources in astronomical images. To tackle the scarcity of astronomical datasets, we introduce a new comprehensive benchmark (LAMOST-DET), comprising 18,400 astronomical images and 728,898 source instances. Upon the dataset, we further devise a novel semi-supervised learning framework coined Nova Teacher, capable of detecting dense sources effectively given sparse annotations. It integrates source light enhancement module, confidence-guided pseudo-supervision, and cross-view complementary mining in a dual-teacher paradigm. Extensive experiments on LAMOST-DET show that, Nova Teacher consistently improves previous competitors by 4.04% and 5.22% mAP under two semi-supervised settings. Additionally, our method competes against other detectors on a natural image dataset, validating its generalization ability to various scenarios. The source code is available at https://github.com/AcWiz/NovaTeacher.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 2.0/10 | 3.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 该论文专注于天文图像的半监督源检测,属于计算机视觉领域。提供的关键词主要涉及大语言模型、世界模型及强化学习方向,与论文内容高度不匹配。仅'Visual Encoder'因涉及图像特征提取略有相关性,'MultiModal'和'Unify Models'相关性极低,其余关键词完全无关。总分远低于动态及格分 27.8。作者列表中未包含指定的专家。
关键词
Semi-supervised learning, Astronomical images, Source detection, Nova Teacher, LAMOST-DET, Dual-teacher paradigm, Object detection
摘要翻译
从头开始训练强化学习 (RL) 策略 (Policy) 成本高昂:这需要精心设计的奖励函数与环境、广泛的调优以及大量的计算。然而,许多控制问题已经存在一个功能可用但次优的策略,可作为基线 (Baseline)。本文提出了一种方法,将此类基线嵌入到强化学习 (RL) 训练过程中,相较于从头开始的方法,既能提高训练效率,又能产生一个优于基线的学习策略。在每一步中,该方法在基线策略与可训练的学习策略之间进行权衡:最初强烈依赖基线策略,随后逐步将决策权转移给学习策略。训练结束时,学习策略成为一个独立的神经网络 (Neural Network),无需基线策略支持即可运行。本文形式化了基线策略功能性的含义:在此策略下,智能体 (Agent) 能够以高概率到达目标 (Goal) 状态集并停留于其中。所提出的仲裁机制旨在利用这一特性进行训练,从而从一开始就实现高目标达成率。理论分析在给定假设下对此行为提供了正式解释,并将其扩展到最终的无基线情形,在此情形下推导出了独立学习策略目标达成概率的明确下界。在连续控制基准任务上的实证结果表明,该方法获得的回报匹配或超过竞争方法,同时在所有比较方法中保持最高的目标达成率——包括在最后阶段,此时学习策略无需任何基线支持即可运行。
Abstract
Training reinforcement learning (RL) policies from scratch is costly: it requires careful reward and environment design, extensive tuning, and substantial computation. Yet many control problems already have a functional but suboptimal policy available as a baseline. This paper proposes a method for embedding such a baseline into the RL training process, simultaneously improving training efficiency relative to from-scratch methods and producing a learning policy that outperforms the baseline. At each step, the method arbitrates between the baseline policy and a trainable learning policy, initially relying strongly on the baseline policy and then progressively transferring agency to the learning policy. By the end of training, the learning policy is a standalone neural network that operates without baseline policy support. The paper formalizes what it means for the baseline policy to be functional: under this policy, the agent reaches a goal set and remains there with high probability. The proposed arbitration mechanism is designed to exploit this property during training, yielding high goal-reaching rates right from the beginning of training. A theoretical analysis provides a formal interpretation of this behavior under stated assumptions and extends it to the final baseline-free regime, where explicit lower bounds are derived for the goal-reaching probability of the standalone learning policy. Empirical results on continuous-control benchmarks show that the proposed method achieves returns that match or exceed those of competitive approaches, while maintaining the highest goal-reaching rates throughout training among the compared methods -- including in the final stage, where the learning policy operates without any baseline support.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文专注于模型无关(Model-Free)强化学习策略增强,通过基线策略仲裁机制提升训练效率。内容与多模态学习、分词器、视觉编码器、MLLM 及世界模型无直接关联。虽然涉及策略整合,但不符合关键词簇中‘统一模型’的通常语境(多模态/大模型统一)。论文明确声明为‘模型无关’,因此与‘model-based RL’无关。未包含指定领域的专家作者。加权总分约为 4.5,远低于动态及格分 27.8。
关键词
Reinforcement Learning, Policy Enhancement, Model-Free, Baseline Policy, Agency Transfer, Continuous-Control, Arbitration Mechanism, Goal-reaching
摘要翻译
硬安全过滤器(Hard safety filters)正日益被置于学习控制器的下游,以确保在运行时满足约束。然而,一个从不违反约束的过滤控制器可能仍未学到任何安全知识:过滤器可以无声地修复一个能力不足的上游策略,导致过滤后的成功衡量的是过滤器本身,而非策略。我们认为,安全策略学习应追问谁真正获得了安全性——是策略本身还是其保护性层——并使这一问题可量化。我们提出了干预感知变分量子微分预测控制(IA-VQC-DPC),该方法(i)在原始对偶干预预算下训练紧凑的变分量子电路(VQC)策略,该预算惩罚对可微控制屏障函数(CBF)投影的依赖;(ii)采用安全归因协议进行评估,该协议将执行轨迹修正分解为 CBF 项与部署运行时保护器项,并通过关闭保护器评估对策略进行压力测试。在闭环高保真 BOPTEST 建筑控制模拟器上(5 个随机种子,每种方法 60 个回合),干预感知训练显著降低了量子策略的原始预过滤器违规率和总安全层依赖度(两者 p < 10^-4),且未出现显著的能量退化;在参数预算相当(约 400 参数)的情况下,量子策略比匹配的经典策略更安全且更舒适。关闭保护器评估证实了该改进源于策略层面,并揭示了一个有价值的负面结果:学习到的可微能量头仅在与分布感知运行时保护器配对时才安全。该归因协议具有通用性,不仅限于量子策略和建筑领域。
Abstract
Hard safety filters are increasingly placed downstream of learned controllers to guarantee constraint satisfaction at run time. Yet a filtered controller that never violates a constraint may still have learned nothing about safety: the filter can silently repair an incompetent upstream policy, so that post-filter success measures the filter, not the policy. We argue that safe policy learning should ask who earns the safety - the policy or its protective layers - and we make this question measurable. We introduce Intervention-Aware Variational Quantum Differentiable Predictive Control (IA-VQC-DPC), which (i) trains a compact variational quantum circuit (VQC) policy under a primal-dual intervention budget that penalizes reliance on a differentiable Control-Barrier-Function (CBF) projection, and (ii) is evaluated with a safety-attribution protocol that decomposes the executed-trajectory correction into a CBF term and a deployment runtime-guard term, and stress-tests the policy with guard-off evaluation. On closed-loop, high-fidelity BOPTEST building-control emulators (5 seeds, 60 episodes per method), intervention-aware training significantly lowers the quantum policy's raw pre-filter violation and total safety-layer reliance (both p < 10^-4) with no significant energy regression; at an equal approximately 400-parameter budget the quantum policy is significantly safer and more comfortable than a matched classical policy. Guard-off evaluation confirms the improvement is policy-level and exposes a valuable negative result: a learned differentiable energy head is only safe when paired with a distribution-aware runtime guard. The attribution protocol is general beyond quantum policies and buildings.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 2.0/10 | 3.0 |
评分理由: 该论文主要研究量子预测控制中的安全归属问题,涉及控制理论、量子计算和安全屏障函数。提供的关键词集(如 Tokenizer、Visual Encoder、MLLM、MultiModal)均针对多模态大模型领域,与本文主题(量子控制、安全属性)几乎无关。仅"model-based RL"因涉及基于模型的预测控制策略而有微弱关联,"Unify Models"因整合策略与安全层有极弱关联。作者列表中不包含指定的专家(Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang),故无专家加分。
关键词
Quantum Predictive Control, Safety Attribution, Intervention-Aware, Control-Barrier-Function, Variational Quantum Circuit, Building Control, Policy Learning
摘要翻译
在非平稳性条件下对深度神经网络进行持续训练,通常会导致可塑性的渐进式丧失,最终限制进一步的学习。我们将可塑性与经验神经切核(NTK)联系起来,并识别出动力等距(即层雅可比奇异值保持接近一的条件)作为在持续学习中保持可塑性的关键机制。我们重新审视了一类网络,这些网络几乎处处保持等距,同时仍然是通用的 Lipschitz 函数近似器,证明了近动力等距与富有表现力的非线性表示是兼容的。对于通用架构,我们提出了一种高效的促进等距的正则化方案,并识别出一种新颖的机制,通过该机制它可以重新激活休眠的 ReLU 单元。在此基础上,我们引入了 AdamO,这是一种 Adam 风格的自适应优化器,它将等距正则化与梯度更新解耦,类似于 AdamW。我们进一步通过动力等距的视角重新审视了先前的保持可塑性的方法,表明它们仅针对等距的部分度量。在旨在诱导可塑性丧失的监督学习和强化学习的持续学习基准上,我们的方法始终匹配或优于现有方法。
Abstract
Continual training of deep neural networks under non-stationarity often leads to a progressive loss of plasticity, eventually limiting further learning. We relate plasticity to the empirical Neural Tangent Kernel, and identify dynamical isometry (the condition that layer-wise Jacobian singular values remain close to one) as a key mechanism for preserving plasticity in continual learning. We revisit a class of networks that are almost-everywhere isometric while remaining universal Lipschitz function approximators, demonstrating that near-dynamical isometry is compatible with expressive nonlinear representations. For general architectures, we propose an efficient isometry-promoting regularization scheme and identify a novel mechanism by which it can reactivate dormant ReLU units. Building on this, we introduce AdamO, an Adam-style adaptive optimizer that decouples isometry regularization from gradient updates, analogous to AdamW. We further reinterpret prior plasticity-preserving approaches through the lens of dynamical isometry, showing that they target only a partial measure of isometry. Across supervised and reinforcement-learning continual-learning benchmarks designed to induce plasticity loss, our methods consistently match or outperform existing approaches.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 2.0/10 | 3.0 |
评分理由: The paper focuses on Continual Learning and Dynamical Isometry to preserve plasticity in deep neural networks, showing minimal relevance to Multimodal, World Models, or LLM-specific components (Tokenizer, Visual Encoder, MLLM). Although RL benchmarks are used for evaluation, the method is not model-based. No listed expert authors are present in the author list.
关键词
Continual Learning, Dynamical Isometry, Neural Tangent Kernel, Plasticity Preservation, AdamO Optimizer, Reinforcement Learning, Regularization Scheme
摘要翻译
委托范围执行(Delegation-scoped execution)无法从标准可观测性指标(Observables)中识别:审计日志(Audit logs)和执行轨迹(Execution traces)在多种不兼容的委托分配(Delegation assignments)下可能完全相同。这种差距在基于大语言模型(LLM)的代理系统(Agentic systems)中尤为显著,因为代理会动态选择工具,针对同一指令在不同运行中改变执行序列,并生成协作的子代理(Sub-agents)。这些动态行为导致轨迹碎片化和交错,使得仅凭因果结构(Causal structure)进行委托范围重构在结构上是欠定的(Structurally underdetermined)。尽管单个动作已被授权并记录,但现有的审计、追踪和安全模式(Schemas)缺乏必要的语义,无法在异构系统(Heterogeneous systems)中重构特定委托下发生了哪些动作。我们关注委托范围归属(Attribution)和访问/共享足迹(Access/share footprint)重构,而非意图推理(Intent inference)或推理重构(Reasoning reconstruction)。我们提出一种感知代理的可观测性基底(Observability substrate),由轻量级网关(Lightweight gateway)和通用信息模型(Common information model)组成,可在执行时绑定委托上下文(Delegation context)。这使得可靠的跨工具委托范围重构和直接取证查询(Forensic queries)成为可能,无需启发式时间窗口关联(Heuristic time-window correlation)。
Abstract
Delegation-scoped execution is not identifiable from standard observables: audit logs and execution traces can be identical under multiple incompatible delegation assignments. This gap is especially acute in LLM-based agentic systems, where agents dynamically select tools, vary execution sequences across runs for the same instruction, and spawn cooperating sub-agents. These dynamics fragment and interleave traces, making delegation-scoped reconstruction from causal structure alone structurally underdetermined. Although individual actions are authorized and logged, existing audit, tracing, and security schemas lack the semantics to reconstruct what actions occurred under a given delegation across heterogeneous systems. We focus on delegation-scoped attribution and access/share footprint reconstruction, not intent inference or reasoning reconstruction. We present an agent-aware observability substrate consisting of a lightweight gateway and a common information model that binds delegation context at execution time. This enables reliable cross-tool delegation-scoped reconstruction and direct forensic queries without heuristic time-window correlation.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文核心内容聚焦于智能体系统中的可观测性、委托执行审计及日志重构,属于 AI 安全与系统架构领域。提供的评分关键词主要集中在多模态大模型(MLLM, MultiModal, Visual Encoder, Tokenizer)、表征学习及强化学习(World Models, model-based RL)领域,与论文实际内容存在显著领域差异。因此除 MLLM(因涉及 LLM-based 系统)和 Unify Models(因涉及信息模型统一)给予极低分外,其余关键词相关性为 0。作者列表(Abhinav Mishra, Kumar Sharad)不包含指定的 Yang Shi 等专家,无额外加分。加权总分约为 4.5 分,远低于动态及格分 27.8 分。
关键词
Observability, Delegated Execution, Agentic AI Systems, LLM-based, Audit Logs, Common Information Model, Delegation Attribution, Forensic Queries
摘要翻译
协同物体搬运在许多领域至关重要,涵盖从工业到家庭服务。一种流行的搬运策略是将物体承载于多机器人系统之上进行搬运。相应任务通常通过将其分解为三个相互关联的子问题来解决:编队控制、协同导航和避障。真实物体带来的特定挑战在于其可能具有任意形状和非均匀质量分布,这需要机器人编队能够稳固支撑物体。本文通过提出一种新颖的多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)方法,解决了此类真实物体运输中的编队控制挑战。该方法使多机器人系统能够在编队过程中自主定位至物体下方以支撑其重量,同时避免障碍物。我们在不同环境和不同机器人数量下的评估表明,该方法能够生成可靠产生平衡编队的策略,并能泛化至杂乱场景以及具有复杂几何形状和非均匀质量分布的物体。
Abstract
Cooperative object transportation is essential in numerous domains, including industrial to domestic services. A popular transportation strategy is to carry objects on top of multi-robot systems. The corresponding task is typically solved by decomposing it into three interconnected subproblems: formation control, cooperative navigation, and collision avoidance. A particular challenge posed by real-world objects is their potentially arbitrary shape and non-uniform mass distribution, necessitating robot formations that securely support the object. In this work, we address the challenge of pattern formation control for transporting such real-world objects by proposing a novel multi-agent reinforcement learning approach. Our approach enables a multi-robot system to autonomously position itself underneath an object to support its weight while avoiding obstacles during the formation process. Our evaluations with diverse environments and varying numbers of robots show that our approach leads to policies that reliably produce balanced formations and generalize to cluttered scenes and objects with complex geometry and non-uniform mass distribution.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 3.0/10 | 4.5 |
评分理由: The paper focuses on multi-agent reinforcement learning for robotic transportation, sharing only superficial relevance with 'model-based RL'. It is unrelated to 'Unify Models', 'Tokenizer', 'Visual Encoder', 'World Models', 'MLLM', and 'MultiModal' as these pertain to AI architectures, not robotic control. No specified expert authors are present.
关键词
Multi-Agent Reinforcement Learning, Cooperative Transportation, Shape Formation, Arbitrary Objects, Collision Avoidance, Robot Formation, Autonomous Positioning
摘要翻译
AI 微型短剧(AI minidramas,亦称"果剧"(fruit dramas))是近期在社交媒体平台上广泛涌现的一种现象,指那些具有拟人化角色、短小且通过算法分发的生成式 AI 视频系列。本文认为,尽管这些视频看似无害的美学,它们却复制了深层的性别化叙事结构:女性角色系统性地与道德越轨、性背叛及生育能力相关联;此外,若干情节还编码了“种族化”(racialization)的逻辑,即可见的身体差异被赋予道德负载的过程。基于女性主义电影理论、批判种族理论(critical race theory)及平台研究(platform studies),本文进一步指出,这些视频的生成式 AI 美学(以柔和、圆润及视觉上的可爱为特征)充当了一种“美学清洗”(aesthetic laundering)的机制,中和了这些叙事的意识形态权重,使其即便在内容审核系统下仍得以流通。本文通过个人观察与细读(close reading)来探讨这些问题,反思生成式 AI 的具体可供性(affordances),正是这些可供性使得这一现象成为可能,并对计算创意(computational creativity)领域具有深远的文化影响。
Abstract
AI minidramas (also known as fruit dramas) are short, algorithmically distributed generative AI video series featuring anthropomorphized characters that have recently emerged as a widespread phenomenon on social media platforms. This paper argues that despite their seemingly innocuous aesthetic, these videos reproduce deeply gendered narrative structures in which female characters are systematically associated with moral transgression, sexual betrayal, and reproductive capacity, and that several plots also encode the logic of racialization, i.e., the process by which visible bodily difference is morally loaded. Drawing on feminist film theory, critical race theory, and platform studies, it further argues that the generative AI aesthetic of these videos, characterized by softness, roundness, and visual cuteness, functions as a mechanism of aesthetic laundering, neutralizing the ideological weight of these narratives and enabling their circulation despite content moderation systems. This paper approaches these questions through personal observation and close reading, reflecting on the specific affordances of generative AI that make this phenomenon both possible and culturally consequential for the field of computational creativity.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 1.0/10 | 1.5 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 该论文属于人文社科领域,主要运用女性主义电影理论和批判种族理论分析 AI 生成视频(水果剧)的文化叙事与意识形态,未涉及模型架构、分词器、视觉编码器、世界模型或强化学习等具体技术实现。因此,与给定的技术关键词相关性极低,仅因涉及广义的生成 AI 和多媒体内容给予微弱关联分。
关键词
AI minidramas, fruit dramas, generative AI, aesthetic laundering, feminist film theory, critical race theory, platform studies, gendered narrative
摘要翻译
随着人工智能(AI)的进步,自适应与自组织系统的复杂性日益增长,使得它们越来越难以被理解和信任。虽然可解释性人工智能(Explainable AI)旨在提供对人工智能决策制定的洞察,但更高级的目标是系统能够自我解释——这种能力被称为自我可解释性(Self-Explainability,SX)。本文对自我可解释性(SX)进行了系统性文献综述,分析了现有方法,涵盖其应用领域、目标对象及评估方法。该综述提出了 SX 的统一定义和分类体系,并引入了自我可解释性层级(Levels of Self-Explainability),为定位当前及未来研究提供了框架。结果表明,大多数自我可解释性方法仍处于概念层面,实际实现案例较少。此外,目前尚无正式或事实标准用于评估自我可解释性,这突显了一个重大的研究缺口。因此,这项工作为在复杂系统中推进自我可解释性奠定了基础并制定了路线图。
Abstract
The growing complexity of self-adaptive and self-organising systems, fuelled by advances in Artificial Intelligence (AI), has made them increasingly difficult to understand and trust. While Explainable AI aims to provide insight into AI decision-making, a more advanced goal is for systems to explain themselves - an ability referred to as Self-Explainability (SX). This article presents a systematic literature review on SX, analysing existing approaches, including their domains, targets, and evaluation methods. The review develops a unified definition and taxonomy of SX and introduces Levels of Self-Explainability, providing a framework for positioning current and future research. Our results show that most SX approaches remain conceptual, with few practical implementations. Moreover, there is currently no formal or de facto standard for evaluating SX, highlighting a major research gap. This work thus establishes a foundation and roadmap for advancing Self-Explainability in complex systems.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文主题聚焦于自适应和自组织系统中的自我解释性(Self-Explainability),与关键词列表中的多模态大模型、强化学习等技术领域(如 Tokenizer, Visual Encoder, MLLM, MultiModal, model-based RL)高度不相关。'Unify Models' 仅因文中提出'统一定义'而略有相关性,'World Models' 因涉及系统建模略有关联,但整体主题偏离,导致加权总分较低。
关键词
Self-Explainability, Self-Adaptive Systems, Self-Organising Systems, Systematic Literature Review, Unified Definition, Taxonomy, Research Directions
摘要翻译
联邦学习(FL)允许一组客户端在不共享本地训练数据的前提下共同训练全局模型。将训练责任赋予去中心化参与者可能导致投毒攻击:由恶意第三方控制的客户端可能毒化训练数据集,从而在神经网络中植入后门。在联邦学习中,这些后门攻击仅依赖于算法方法,然而,硬件故障威胁(例如 Rowhammer)的最新进展扩大了整体的攻击面。在联邦模型适配的背景下,我们提出了一种针对联邦学习系统的新颖后门攻击类别,该攻击基于硬件故障攻击的模型投毒。更具体而言,我们提出了一种任务无关的后门攻击,该攻击在联邦学习训练期间通过诱导单个局部模型参数的硬件故障(比特翻转)来植入。该后门是在之前的离线阶段,基于联邦学习系统最初使用的预训练模型构造而成的。我们的结果表明,后门可以成功应用于不同类型的数据集和模型。通常情况下,每个恶意客户端最多产生 10 次故障,且在 ResNet-18 上累计发生 19 次,就足以达到 94% 的攻击成功率。最后,我们讨论了针对该攻击的潜在防御措施的实用性与鲁棒性,同时权衡了 Rowhammer 的实际约束条件,后者是此类威胁的首选攻击向量。
Abstract
Federated Learning (FL) allows a set of clients to collectively train a global model without sharing local training data. Giving the responsibility of the training to decentralized actors may lead to poisoning attacks: clients controlled by malicious third party potentially poison the training dataset to install a backdoor in neural networks. In FL, these backdoor attacks rely solely on algorithmic approach, however, recent advances in hardware faults threats (e.g, Rowhammer) have widen the overall attack surface. In the context of federated model adaptation, we introduce a novel category of backdoor attack against FL systems that relies on model poisoning based on hardware-fault attacks. More precisely, we propose a task-agnostic backdoor attack that is implanted during the FL training time by inducing hardware faults (bit-flips) in parameters of a single local model. The backdoor is crafted during a previous offline phase from the pretrained model initially used by the FL system. Our results show that a backdoor can be successfully applied on different type of models and datasets. Typically, with up to 10 faults per malicious client occurrence and 19 total occurrences on a ResNet-18 are enough to reach 94% of attack success rate. Finally, we discuss the practicality and the robustness of the attack potential defenses, while putting into perspective the practical constraints of Rowhammer, which is the preferred attack vector for this type of threats.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 1.0/10 | 1.5 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on security vulnerabilities in Federated Learning (FL) utilizing hardware faults (Rowhammer) for model poisoning, which is unrelated to the provided keywords concerning Multimodal AI, World Models, and Reinforcement Learning. Minimal overlap exists: FL involves model aggregation (loosely Unify Models), ResNet-18 uses a Visual Encoder, and vision is a component of MultiModal, but core concepts like Tokenizer, World Models, MLLM, and RL are absent. No expert authors from the specified list were found.
关键词
Federated Learning, Model Poisoning, Hardware Faults, Bit-Flips, Backdoor Attack, ResNet-18, Rowhammer
摘要翻译
个体动物识别可用于寻找走失或被盗的宠物、追踪濒危物种的个体,以及在拥挤的农场中识别动物。现有的识别技术大多使用物理设备,例如微芯片(microchips),通常不切实际且难以应用。这些方法可以通过基于动物面部的远程识别来替代;如果精度足够高,它具有多项优势:非侵入式、可在远距离工作且难以伪造,例如在食品行业中用患病动物替换健康动物的情况。现有的少数具备充足单主体图像且标注了单一动物身份的数据集,其规模不足以训练当前的深度学习架构。本文转而探究迁移学习的可能性,利用预训练网络模型作为骨干网络(backbones)。本文比较了 FaceNet(专门在大型人脸数据库上训练)与 Vision Transformer (ViT)(在 ImageNet 上预训练,即物体类别)。本文使用了三种差异显著动物的面部数据集:狗、灵长类(狐猴、金丝猴和黑猩猩)以及牛。本文报告了结果,并对每个数据集,将其与最先进的(SOTA)专门训练的深层网络进行了比较。三个数据集的拍摄条件各不相同。图像质量(分辨率、运动模糊、姿态多样性等)从狗到牛再到灵长类逐渐降低。在狗的数据集上取得了最佳性能,其中 ViT 的平均验证准确率达到 96.85%,第一识别率为 84.34%。濒危灵长类的结果仍具鼓舞性,但性能因动物类别和任务(验证或识别)而异,并不总是优于 SOTA。对于牛的数据集,ViT 的结果优于 SOTA,而 FaceNet 仍具竞争力。
Abstract
Individual animal recognition can be useful in the search for lost or stolen pets, the tracking of individuals of endangered species, and the recognition of animals in crowded farms. Present recognition techniques mostly use physical devices, e.g., microchips, often impractical and difficult to apply. These could be replaced by remote recognition via the animal's face; if accurate enough, it provides several advantages: it is non-invasive, can work at a distance, and is difficult to counterfeit, as, for instance, in the case of substituting sick animals for healthy ones in the food industry. The few existing datasets with sufficient per-subject images annotated with a single animal identity are not large enough to train current deep learning architectures. We rather investigate the possibility of transfer learning, exploiting pre-trained network models as backbones. Our experiments compared FaceNet, which is specifically trained on large databases of human faces, with the Vision Transformer (ViT) pre-trained on ImageNet, i.e., on object categories. We used three face datasets of very different animals: dogs, primates (lemurs, golden monkeys, and chimpanzees), and cattle. We report the results and, for each dataset, compare them with the state of the art (SOTA) ad hoc-trained deep networks. The capture conditions differ among the three datasets. Image quality (resolution, motion blur, diverse poses, etc.) decreases from dogs to cattle to primates. The best performance was achieved with dogs, where ViT reached a mean verification accuracy of 96.85% and a Rank-1 Identification Rate of 84.34%. The results for endangered primates are still encouraging, but performance varies across animal classes and tasks (verification or identification), and does not always outperform SOTA. For cattle, the ViT results outperform SOTA, while FaceNet is still competitive.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 2.0/10 | 3.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文主题聚焦于基于迁移学习的多物种动物人脸识别,属于传统计算机视觉领域。提供的关键词主要围绕多模态大模型(MLLM)、世界模型及强化学习架构,与论文内容高度不相关。仅因使用了 ViT 等视觉编码器给予极低分,其余关键词如 Tokenizer、RL、World Models 等完全未涉及。作者列表中不包含指定的专家,故无额外加分。
关键词
Animal Face Recognition, Transfer Learning, Vision Transformer, FaceNet, Multispecies, Deep Learning, Identification Rate
摘要翻译
本文提出了一种统一系统,旨在通过集成先进的天气预测、作物推荐以及面向农民的问答工具来支持精准农业。我们提出了两种深度学习模型——基于 Transformer 的图神经网络(Transformer-based Graph Neural Network)和时空图卷积网络(Spatio-Temporal Graph Convolutional Network, STGCN)——利用尼泊尔 1359 个地点的数据来预测未来 30 天的天气状况。STGCN 在准确性上优于基于 Transformer 的模型(均方误差 (MSE) ~0.011 对比 0.013),有效地建模了气候数据中的空间和时间依赖性。这些预测结果与静态土壤属性(如 pH 值、湿度和有机质含量)相结合,通过一种评分算法生成本地化的作物推荐,该算法旨在匹配每种作物的最佳生长条件。此外,我们还开发了一个检索增强生成(Retrieval-Augmented Generation, RAG)聊天机器人,利用特定领域的农业文档以自然语言回答农民的问题。整个系统通过移动应用程序进行部署,提供实时建议及对话式支持。用户反馈证实了系统的可用性和相关性,尤其是在个性化农业指导有限的农村地区。总体而言,我们的方法展示了如何将机器学习模型与当地农业数据相结合,从而赋予农民可操作的洞察,促进更明智的决策、更高的作物产量以及对气候变异性的更强韧性。
Abstract
This paper presents a unified system designed to support precision agriculture by integrating advanced weather prediction, crop recommendation, and a question-answering tool for farmers. We propose two deep learning models -- a Transformer-based Graph Neural Network and a Spatio-Temporal Graph Convolutional Network (STGCN) -- to forecast weather conditions for the next 30 days using data from 1,359 locations in Nepal. The STGCN outperforms the Transformer-based model in accuracy (MSE ~0.011 vs. 0.013), effectively modeling both spatial and temporal dependencies in climate data. These predictions are combined with static soil properties such as pH, moisture, and organic content to generate localized crop recommendations through a scoring algorithm that matches each crop's optimal growing conditions. Additionally, we develop a Retrieval-Augmented Generation (RAG) chatbot that leverages domain-specific agricultural documents to answer farmers' questions in natural language. The entire system is deployed via a mobile application, offering real-time suggestions and conversational support. User feedback confirms the system's usability and relevance, especially in rural settings where personalized farming guidance is limited. Overall, our approach demonstrates how combining machine learning models with local agricultural data can empower farmers with actionable insights, promoting more informed decisions, better crop yields, and increased resilience to climate variability.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 该论文专注于使用时空图神经网络和 RAG 进行精准农业,与提供的多模态世界模型和强化学习关键词对齐度较低。'Unify Models'仅因系统级集成得低分,'MultiModal'仅因文本与数据融合得低分。论文未涉及视觉编码器、统一标记器、MLLM 架构或强化学习组件。作者列表中未包含指定的专家,故无额外加分。
关键词
Spatio-Temporal Graph Neural Networks, Weather Prediction, Crop Recommendation, Retrieval-Augmented Generation, Precision Agriculture, Mobile Application, Soil Properties
摘要翻译
AutoMegaKernel (AMK) 将 HuggingFace Llama 系列模型编译为单个持久化协作 CUDA 内核,该内核在一次启动中运行整个前向传播过程,无需为每个模型手写 CUDA。其贡献在于系统本身,而非原始速度。一个冻结的 schedule-IR(调度 IR)验证器通过静态图检查(而非机械化证明)静态认证无死锁和无竞争条件,因此在启动前拒绝不安全代理提出的调度方案:在 7,160 个对抗性调度方案(其中 6,091 个不安全)中,它实现了零误接受,并接受了全部 360 个真实降级方案。同一源代码从单一代码库针对 sm_80/sm_90/sm_120 进行重定位,为 10 个受支持模型中的 10 个自动生成正确的 megakernels(巨内核),并在真实的 SmolLM2-135M 检查点上逐令牌复现了 HuggingFace 贪婪解码(困惑度匹配 2.5e-7)。一个无人值守、可由代理驱动的自动研究循环使其 megakernels 相对于自身基线进行自我改进(1.25-1.72 倍)。搜索发现的 int8 (W8A16) megakernels 在 NVIDIA 数据中心推理集群的 batch-1 解码中超越了基于 CUDA Graph 的 cuBLAS bf16:L4 最高达 1.33 倍,当前代 L40S 为 1.25-1.27 倍,A10G 在规模上最高达 1.08 倍,消费级 RTX 5090 为 1.19-1.23 倍。该排序并非带宽的简单函数(864 GB/s 的 L40S 优于 600 GB/s 的 A10G);区别在于推理类与训练类。AMK 在高带宽训练类的 A100/H100 上落后于 cuBLAS,其中 harness(测试框架)将跨 SM 同步瓶颈局部化;我们如实报告了差距。这是在解码位置 0 进行的精度不对称比较(W8A16 vs bf16);最大的真实检查点是 TinyLlama-1.1B。代码和 harness:https://github.com/RightNow-AI/AutoMegaKernel
Abstract
AutoMegaKernel (AMK) compiles a HuggingFace Llama-family model into a single persistent cooperative CUDA kernel that runs the whole forward pass in one launch, with no per-model hand-written CUDA. The contribution is the system, not raw speed. A frozen schedule-IR validator statically certifies deadlock-freedom and race-freedom via static graph checks (not a mechanized proof), so an unsafe agent-proposed schedule is rejected before launch: across 7,160 adversarial schedules (6,091 unsafe) it had zero false-accepts and accepted all 360 real lowerings. The same source retargets sm_80/sm_90/sm_120 from one codebase, auto-generates correct megakernels for 10 of 10 supported models, and on a real SmolLM2-135M checkpoint reproduces HuggingFace greedy decode token-for-token (perplexity match 2.5e-7). An unattended, agent-drivable autoresearch loop self-improves the megakernel over its own baseline (1.25-1.72x). A search-found int8 (W8A16) megakernel beats CUDA-graphed cuBLAS bf16 at batch-1 decode across NVIDIA's datacenter inference fleet: L4 up to 1.33x, the current-gen L40S 1.25-1.27x, A10G up to 1.08x at scale, and the consumer RTX 5090 1.19-1.23x. The ordering is not a clean function of bandwidth (the 864 GB/s L40S beats the 600 GB/s A10G); the divide is inference-class vs training-class. AMK trails cuBLAS on the high-bandwidth training-class A100/H100, where the harness localizes the cross-SM-sync bottleneck; we report the gap plainly. This is a precision-asymmetric (W8A16 vs bf16) comparison at decode position 0; the largest real checkpoint is TinyLlama-1.1B. Code and the harness: https://github.com/RightNow-AI/AutoMegaKernel
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: 该论文主要研究大语言模型(LLM)的编译器优化与 CUDA 内核合成技术,旨在通过静态检查和自动合成提升推理效率。提供的关键词侧重于多模态学习(MLLM, MultiModal, Visual Encoder)、世界模型及强化学习(World Models, model-based RL),与本文主题(LLM 编译器/推理加速)存在显著领域差异。虽然文中提及'agent'用于搜索循环且涉及'token'级精度匹配,但未涉及视觉编码器、多模态统一架构或基于模型的强化学习核心内容,因此相关性评分较低。
关键词
AutoMegaKernel, CUDA Kernel, Llama-family, Static Verification, Inference Optimization, Megakernel Synthesis, GPU Retargeting
摘要翻译
动机:基于 Transformer 的模型越来越多地应用于大规模单细胞转录组学,通过在数百万个细胞上进行自监督学习展现出强大性能。然而,大多数现有方法将基因视为独立特征,在很大程度上忽略了先验生物学知识,这限制了可解释性和鲁棒性。本文探讨了显式纳入基因调控信息是否能同时提升模型性能和生物学洞见。结果:我们提出了 scTransformer,这是首个将生物机制的先验知识构建到模型注意力模式中的基于 Transformer 的方法。通过根据已知的调控结构约束信息流,模型学习到更具生物学意义的表示。我们在一个与疾病相关的单核 RNA-seq 数据集上,利用监督细胞类型分类对 scTransformer 进行了评估。与标准 Transformer 相比,我们的方法提高了分类准确率,增强了嵌入空间中细胞类型的分离度,并产生了与已知调控程序一致的注意力模式。总体而言,我们的结果表明,将生物结构嵌入 Transformer 模型可以在不牺牲性能的情况下提升可解释性,为单细胞组学领域基于生物学的基座模型 (foundation models) 迈出了稳健的一步。
Abstract
Motivation: Transformer-based models are increasingly applied to large-scale single-cell transcriptomics, showing strong performance through self-supervised learning on millions of cells. However, most existing approaches treat genes as independent features, and largely ignore prior biological knowledge, which limits interpretability and robustness. In this paper, we explore whether explicitly incorporating gene regulatory information can improve both model performance and biological insight. Results: We present scTransformer, the first Transformer-based approach that builds a priori knowledge of biological mechanisms into the model's attention patterns. By constraining information flow according to known regulatory structures, the model learns representations that are more biologically meaningful. We evaluate scTransformer on a disease-relevant single-nucleus RNA-seq dataset using supervised cell-type classification. Compared to standard Transformers, our approach improves classification accuracy, enhances separation of cell types in embedding space, and produces attention patterns consistent with known regulatory programs. Overall, our results demonstrate that embedding biological structure into Transformer models can enhance interpretability without sacrificing performance, offering a principled step toward biologically grounded foundation models for single-cell omics.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文主题属于生物信息学(单细胞 RNA-seq),而关键词主要涉及通用人工智能(多模态、世界模型、强化学习)。因此除 Unify Models(因提及 foundation models)和 Tokenizer(Transformer 隐含机制)有微弱关联外,其余关键词(Visual Encoder, World Models, MLLM, MultiModal, model-based RL)完全无关。未发现指定专家作者。加权总分远低于及格线。
关键词
scRNA-seq analysis, Transformer-based models, Gene regulatory priors, Attention patterns, Single-cell transcriptomics, Biological interpretability, Cell-type classification
摘要翻译
单分子力谱(SMFS)为生物分子力学提供了前所未有的洞察,然而力 - 延伸轨迹的高通量生成造成了严重的数据整理瓶颈。在数千条噪声主导的曲线中识别罕见的分子解离事件,传统上依赖于繁琐且不可扩展的人工审核。在此,我们提出一种系统无关的、可解释的深度学习框架,专门用于克服自动 SMFS 筛选中的极端类别不平衡问题。利用一维到二维光栅化几何矩阵,我们部署了一种修改的 ResNet18 架构,其优化目标为非对称 Focal Loss 损失函数。我们在 R. champanellensis 纤维素体的复杂机械展开路径上评估了该框架。在超不平衡测试条件下,目标相互作用仅占数据集的 1.34%(970 条轨迹中仅有 13 个真实事件),该模型实现了 0.9196 的总体准确率和高达 0.9231 的真阳性率(召回率)。通过实施经验校准的双阈值筛选系统,该流程自动丢弃了 880 条明确的背景噪声轨迹,将人工整理工作量减少了 90% 以上,同时安全保留了高价值的稀有数据。最后,梯度加权类激活映射(Grad-CAM)从视觉上验证了网络的决策牢固地锚定于力曲线的相关几何特征,具体定位在结构解离区域,有效缓解了人们对“黑箱”模型的质疑。该工具专为免费云端执行而构建,其开源特性使得可扩展且高精度的分子发现得以在生物物理界普及。
Abstract
Single-Molecule Force Spectroscopy (SMFS) provides unprecedented insights into biomolecular mechanics, yet the high-throughput generation of force-extension trajectories creates a severe data curation bottleneck. Identifying rare molecular unbinding events within thousands of noise-dominated curves traditionally relies on tedious, non-scalable manual auditing. Here, we present a system-agnostic, interpretable deep learning framework tailored to overcome extreme class imbalance in automated SMFS triage. Utilizing 1D-to-2D rasterized geometric matrices, we deployed a modified ResNet18 architecture governed by an asymmetric Focal Loss objective function. We evaluated this framework on the complex mechanical unfolding pathways of the R. champanellensis cellulosome. Under hyper-imbalanced test conditions where the target interaction constituted only 1.34% of the dataset (13 true events out of 970 traces), the model achieved an overall accuracy of 0.9196 and a remarkable True Positive Rate (Recall) of 0.9231. By implementing an empirically calibrated dual-threshold triage system, the pipeline automatically discarded 880 unambiguous background noise traces , reducing the manual curation workload by over 90% while safely preserving high-value rare data. Finally, Gradient-weighted Class Activation Mapping (Grad-CAM) visually validated that the network's decisions are firmly anchored in the relevant geometric features of the force curves, specifically localizing on the structural unbinding regions, effectively mitigating 'black-box' skepticism. Built for free cloud-based execution, this open-source tool democratizes scalable, highly precise molecular discovery across the biophysics community.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 2.0/10 | 3.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on biophysics data analysis using CNNs (ResNet18) for imbalanced classification, which is unrelated to the provided keywords concerning Multimodal Large Language Models (MLLM), World Models, or Reinforcement Learning. Only minor technical overlap exists with 'Visual Encoder' (CNN usage) and 'Unify Models' (system-agnostic approach). No expert authors from the specified list are present in the authorship.
关键词
Single-Molecule Force Spectroscopy, Deep Learning Framework, Rare Event Discovery, Class Imbalance, ResNet18, Focal Loss, Force-Extension Trajectories, Grad-CAM
摘要翻译
时空图神经网络(STGNNs)已成为交通预测的主流方法,然而其计算需求给智能交通系统(ITS)的实际部署带来了挑战。尽管近期研究提出了 STGNNs 的高效替代方案,但一个根本性问题仍未得到探索:这些架构本身是否过参数化?本文利用时空图卷积网络(STGCN)来探究这一问题,该模型是该领域应用最广泛的模型之一。通过在四个多样化的交通数据集上进行系统实验,我们比较了单块、双块(标准)和三块 STGCN 变体。研究发现,单块架构在四个数据集中的三个上实现了短期预测(10 分钟)的最优性能,而在更长的预测时间范围内仅产生微小的性能下降(相对误差 $\leq$1.8%)。至关重要的是,双块变体相对于单块变体,CPU 推理延迟增加了 61%,吞吐量降低了 37% —— 这对于资源受限的 ITS 部署而言是显著的开销。三块架构并未提供有利的权衡,计算成本增加一倍以上,却仅带来不到 0.5% 的相对性能提升。这些结果表明,默认的双块 STGCN 在许多应用中可能存在过参数化问题,这对部署交通预测系统的从业者以及评估效率导向方法的研究者均具有重要意义。
Abstract
Spatio-temporal graph neural networks (STGNNs) have become the dominant approach for traffic prediction, yet their computational requirements pose challenges for practical deployment in intelligent transportation systems (ITS). While recent work has proposed efficient alternatives to STGNNs, a fundamental question remains unexplored: are these architectures themselves over-parameterised? We examine this question using the Spatio-Temporal Graph Convolutional Network (STGCN), one of the most widely adopted models in this domain. Through systematic experiments across four diverse traffic datasets, we compare 1-block, 2-block (standard), and 3-block STGCN variants. Our findings reveal that the single-block architecture achieves optimal performance for short-term prediction (10 mins) on three of four datasets, while incurring only marginal degradation ($\leq$1.8% relative error) at longer horizons. Crucially, the 2-block variant incurs 61% higher CPU inference latency and 37% lower throughput relative to 1-block -- substantial overhead for resource-constrained ITS deployment. The 3-block architecture offers no favourable tradeoff, more than doubling computational cost for $<$0.5% relative improvement. These results suggest that the default 2-block STGCN may be over-parameterised for many applications, with implications for both practitioners deploying traffic prediction systems and researchers benchmarking efficiency-focused methods.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on traffic prediction using STGCN and efficiency analysis. The provided keywords relate to MLLMs, World Models, and RL, which are unrelated to this work. Relevance is minimal, with only slight connections to model structure ('Unify Models') and data type ('MultiModal').
关键词
Spatio-temporal graph neural networks, STGCN, Traffic prediction, Architectural depth, Inference latency, Over-parameterised, Computational requirements
摘要翻译
近期研究已将差分隐私(DP)应用于大型语言模型(LLMs)的适配,使其适用于敏感场景,并提供理论保证。然而,其实际有效性尚不明确,部分原因在于 LLMs 的预训练阶段,其中与适配数据的重叠及相互依赖性可能会削弱隐私保护,尽管已采取 DP 措施。为了在实践中分析这一问题,我们利用最先进的攻击方法(如鲁棒成员推断和金丝雀数据提取),调查 LLMs 在差分隐私适配下的隐私风险。我们通过系统性地调整适配数据分布来对这些风险进行基准测试,范围涵盖与预训练数据的精确重叠、同分布(IID)情况,直至完全分布外(OOD)示例。此外,我们还评估了不同的适配方法以及不同的隐私机制对脆弱性的影响。结果表明,分布偏移强烈影响隐私脆弱性:适配数据越接近预训练分布,在相同的理论保证下,实际隐私风险越高,即使不存在直接的数据重叠。我们发现,对于分布外(OOD)数据,参数高效微调方法(如 LoRA)实现了最高的经验性隐私保护。我们的基准测试识别了实现差分隐私 LLMs 适配中实际隐私的关键因素,为在敏感场景中部署定制模型提供了可操作的见解。展望未来,我们提出一个结构化的框架,用于超越适配隐私的整体隐私评估,以识别并评估 LLMs 在整个预训练 - 适配流程中的风险。
Abstract
Recent work has applied differential privacy (DP) to adapt large language models (LLMs) for sensitive applications, offering theoretical guarantees. However, its practical effectiveness remains unclear, partly due to LLM pretraining, where overlaps and interdependencies with adaptation data can undermine privacy despite DP efforts. To analyze this issue in practice, we investigate privacy risks under DP adaptations in LLMs using state-of-the-art attacks such as robust membership inference and canary data extraction. We benchmark these risks by systematically varying the adaptation data distribution, from exact overlaps with pretraining data, through in-distribution (IID) cases, to entirely out-of-distribution (OOD) examples. Additionally, we evaluate how different adaptation methods and different privacy regimes impact the vulnerability. Our results show that distribution shifts strongly influence privacy vulnerability: the closer the adaptation data is to the pretraining distribution, the higher the practical privacy risk at the same theoretical guarantee, even without direct data overlap. We find that parameter-efficient fine-tuning methods, such as LoRA, achieve the highest empirical privacy protection for OOD data. Our benchmark identifies key factors for achieving practical privacy in DP LLM adaptation, providing actionable insights for deploying customized models in sensitive settings. Looking forward, we propose a structured framework for holistic privacy assessment beyond adaptation privacy, to identify and evaluate risks across the full pretrain-adapt pipeline of LLMs.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 1.0/10 | 1.5 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文核心主题为大语言模型(LLM)适配中的差分隐私保护与隐私风险评估,与关键词涉及的统一模型架构、视觉编码器、世界模型及强化学习等领域高度不相关。仅因涉及 LLM 基础模型,对 Unify Models、Tokenizer 和 MLLM 给予极低分,其余关键词完全无关。
关键词
Differential Privacy, Large Language Models, Privacy Protection, Adaptation, Membership Inference, Fine-tuning, Privacy Risk
摘要翻译
理解足球运动(以下简称“足球”)的战术组织需要识别不同的比赛阶段。然而,持球阶段很少能被直接观测到,它们是由不断演变的战术意图塑造的,而不仅仅是由空间模式决定。本研究提出了一种数据驱动框架,旨在从时空跟踪数据中识别持球比赛阶段。本研究分析了七场使用 TRACAB 系统以 25 Hz 频率记录的德国足球甲级联赛(German Bundesliga)比赛。定义了一个层级阶段模型,包含三种战术意图(入侵对方空间、保持控球、得分)和六个阶段(组织进攻、推进、反击、维持、持续威胁、终结)。开发了一种时间图注意力网络(Temporal Graph Attention Network, T-GAN),用于融合帧级球员交互图、上下文特征以及基于 Transformer 的时序建模。性能评估采用了帧级 F1 分数以及一种序列感知的真理主导交并比(Intersection over Truth-Dominance, IoT-D)指标。T-GAN 在意图级别达到了宏平均帧级 F1 分数 0.87,在入侵相关阶段为 0.76,在得分阶段为 0.79。在序列级别,经过后处理后,意图的平均对角线 IoT-D F1 从 0.68 提升至 0.79,阶段从 0.61 提升至 0.71,表明时间连贯性得到了改善。模型对比表明,时序建模是分割质量的主要驱动因素,而基于图的关系建模对于反击识别尤为有益。探索性球员注意力分析进一步表明,边路和中场位置组对阶段判别贡献显著。总体而言,该框架将连续的跟踪数据转化为具有战术可解释性的持球阶段表示,具有在自动比赛标注、战术分析及打法风格画像等方面的潜在应用价值。
Abstract
Understanding tactical organisation of association football, hereafter referred to as football, requires identifying distinct match phases. Yet in-possession phases are rarely directly observable and are shaped by evolving tactical intentions, rather than spatial patterns alone. This study proposes a data-driven framework for identifying in-possession match phases from spatiotemporal tracking data. Seven German Bundesliga matches recorded at 25 Hz with TRACAB were analysed. A hierarchical phase model was defined with three tactical intentions (Invade Opponent Space, Keep Possession, Scoring) and six phases (Build Up, Progression, Counter Attack, Maintenance, Sustained Threat, Finishing). A Temporal Graph Attention Network (T-GAN) was developed to combine frame-level player-interaction graphs, contextual features, and Transformer-based temporal modelling. Performance was evaluated using frame-level F1 and a sequence-aware Intersection over Truth-Dominance (IoT-D) metric. T-GAN achieved macro-average frame-level F1 scores of 0.87 at the intention level, 0.76 for invasion-related phases, and 0.79 for scoring phases. At the sequence level, mean diagonal IoT-D F1 increased from 0.68 to 0.79 for intentions and from 0.61 to 0.71 for phases after post-processing, indicating improved temporal coherence. Model comparisons showed that sequence modelling was the main driver of segmentation quality, while graph-based relational modelling was particularly beneficial for Counter Attack recognition. Exploratory player attention analysis further suggested that wide and midfield positional groups contributed strongly to phase discrimination. Overall, the framework translates continuous tracking data into tactically interpretable in-possession phase representations, with potential applications in automated match annotation, tactical analysis, and playing-style profiling.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 1.0/10 | 1.5 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 该论文属于体育数据分析领域,利用时空图注意力网络(T-GAN)处理足球追踪数据以识别比赛阶段。提供的关键词集主要聚焦于多模态大模型、表征学习及强化学习方向。论文内容未涉及语言模型、分词器、视觉编码器(针对图像)、世界模型生成或强化学习算法,仅在数据建模层面与“统一模型”和“多模态”有微弱关联(时空数据),因此相关性评分极低。
关键词
Temporal Graph Learning, In-Possession Match Phases, Association Football, Tactical Intentions, Tracking Data, Temporal Graph Attention Network, Spatiotemporal Analysis, Phase Segmentation
摘要翻译
生成连贯且可控的长文本内容仍然是大语言模型(LLMs)面临的一个持续挑战。尽管推理增强模型在逻辑密集型领域表现优异,但我们的评估发现它们在开放式写作中面临严重的长度坍塌,当目标长度超过 2000 词时,性能急剧下降。我们将此归因于静态层次规划的局限性,该规划难以在长上下文中提供动态指导。为弥合这一差距,我们提出了交织结构思维链(IS-CoT)框架。与外部智能体工作流不同,IS-CoT 将动态的计划 - 写作 - 反思(Plan-Write-Reflect)循环嵌入生成过程,无需额外辅助即可实现持续策略调整与全局对齐。基于此框架,我们通过多教师流水线构建了一个高质量的交织推理轨迹数据集,并训练了 IS-Writer-8B。实验表明,IS-Writer-8B 在具有挑战性的长文本基准上实现了最先进性能(例如,在 LongBench-Write 上比 DeepSeek-V3.2 高出 3.08 分),展现出稳健的长度合规性与连贯性,可与显著更大的专有模型相媲美。
Abstract
Generating coherent and controllable long-form content remains a persistent challenge for Large Language Models (LLMs). While reasoning-enhanced models have demonstrated success in logic-intensive domains, our evaluation reveals that they suffer from a severe length collapse in open-ended writing, where performance degrades sharply as target lengths exceed 2,000 words. We attribute this failure to the limitation of static hierarchical planning, which struggles to provide dynamic guidance over extended contexts. To bridge this gap, we introduce the Interleaved Structural Chain-of-Thought (IS-CoT) framework. Unlike external agentic workflows, IS-CoT embeds a dynamic Plan-Write-Reflect cycle into the generation process, enabling continuous strategy adaptation and global alignment without additional assistance. Based on this framework, we construct a high-quality dataset of interleaved reasoning traces via a multi-teacher pipeline and train IS-Writer-8B. Experiments demonstrate that IS-Writer-8B achieves state-of-the-art performance on challenging long-form benchmarks (e.g., +3.08 vs. DeepSeek-V3.2 on LongBench-Write), exhibiting robust length compliance and coherence competitive with significantly larger proprietary models.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 3.0/10 | 4.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 该论文专注于长文本生成及链式思维框架(IS-CoT),主要解决 LLM 的长度坍塌问题。内容未涉及多模态处理(无视觉编码器、无 MLLM、无多模态数据)、强化学习(无基于模型的 RL)或世界模型构建。与'Unify Models'仅有微弱关联(统一规划与生成流程),与'Tokenizer'无关。因此,论文主题与给定的多模态/强化学习关键词集相关性极低。
关键词
Long-form Generation, Interleaved Structural Thinking, Chain-of-Thought, Plan-Write-Reflect, Length Collapse, IS-Writer-8B, Multi-teacher Pipeline, LLM
摘要翻译
分类任务需要标注数据,而收集这些数据往往成本高昂、耗时,甚至难以收集。在医学领域尤为如此,大型数据集往往仅有少量标注示例。为此,我们提出了 DecSelfMask(Decoder Self-learning by Masking),一种旨在提升仅解码器 (Decoder-only) 模型在分类任务上性能的方法。我们在常见的自学习 (self-learning) 方法基础上,利用模型从无标签数据生成训练样本,并提出了一种新颖的相关性引导掩码策略。我们利用相关性归因方法来确定未标注文本中与任务相关的部分。随后,我们通过掩码掉这些部分创建自监督训练样本,训练模型通过下一个词预测 (next-token-prediction) 来重构它们。我们假设这些样本传达了关于未标注数据结构和语义的知识,这些信息可能对下游任务的性能有益。我们在来自意大利医院的 190 万份临床笔记集合中的 136 个任务上测试了该方法。我们在 5 个不同规模和架构的模型上量化了 DecSelfMask 对下游任务的影响,并包含了一项探测分析 (probing analysis)。实验结果表明该方法一致取得性能提升,优于标准监督微调方法 (宏观 F1 (Macro F1) 提升 19.9 分)、合成标签生成 (+12.5) 以及持续预训练 (+6.3),同时也优于常见基线。
Abstract
Classification tasks require annotated data, which can often be expensive, time-consuming, or even unfeasible to collect. This is the case of the medical domain, where large datasets often have few annotated examples. To address this, we propose DecSelfMask (Decoder Self-learning by Masking), an approach to enhance decoder-only performance on classification tasks. We build on common self-learning approaches by leveraging a model to create training examples from unlabeled data to propose a novel relevance-guided masking strategy. We use relevance attribution methods to determine what portions of unannotated texts are relevant for a task. We then create self-supervised training examples by masking out those portions, training the model to reconstruct them via next-token-prediction. We hypothesize that those examples convey knowledge about the structure and semantics of unannotated data that can be useful for downstream performance. We test our approach on 136 tasks from a collection of 1.9M clinical notes from an Italian hospital. We quantify DecSelfMask's impact on downstream tasks on 5 models of different scales and families, including a probing analysis. Experiments show consistent gains, outperforming standard supervised fine-tuning approaches (+19.9 points in Macro F1), synthetic label generation (+12.5), and continual pretraining (+6.3), as well as common baselines.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on text classification using self-supervised masking on unlabeled text data with decoder-only models. It does not involve visual encoders, world models, multi-modal integration, or reinforcement learning, resulting in 0 scores for these keywords. Tokenization is implicitly used for next-token prediction but is not a core contribution (2.0). Unify Models is not addressed as the paper targets a specific classification task rather than model unification (1.0). The total weighted score is 4.5, well below the dynamic passing score of 27.8. None of the listed expert authors (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are present in the author list, so no expert bonus was applied.
关键词
Decoder-Only Classification, Self-Relevance-Guided Masking, Unlabeled Text, Next-Token-Prediction, Self-Supervised Learning, Clinical Notes, Relevance Attribution, Text Classification
摘要翻译
无参考忠实度指标通过将模型提出的每个原子声明与事实真相(ground truth)进行比对来验证它们,并日益被用于评估 grounded 生成(基于事实的生成)。我们发现它们存在一个共同的盲点:它们仅衡量精确率(precision)——即所陈述的主张是否得到支持?——因此奖励回避行为,因为模型几乎不言不语即可获得近乎完美的忠实度评分。我们利用 F1 遥测数据(Formula 1 telemetry)使这一盲点变得可衡量。在该领域中,策略性事实真相(strategic ground truth)是确定性推导得出的,且关键的是完全完备的:对于每一个决策,我们均知晓所有相关事实的完整集合。这种完备性——在开放域忠实度基准中缺失——使我们能够精确测量召回率(recall,即相关事实的覆盖率),同时也能测量精确率。在一个涵盖 150 场比赛、共 7,253 个决策实例的多语言(EN/ES/PT)基准上,最精确的前沿模型仅覆盖不到一半的相关事实,且按 F1 分数排名垫底,因此引入覆盖率要求会重新排列系统顺序;同样的效果在第二个完整预言机领域(NOAA 天气预报)中也再次出现。提示消融实验(prompt ablation)表明,低覆盖率并非提示不足(under-prompting)导致的 artifact:明确要求模型详尽并不能缩小这一差距。我们将忠实度与覆盖率结合为一个单一分数,验证了该指标(通过可控扰动测试;无模型正则表达式提取器(model-free regex extractor)与跨家族 LLM 提取器(cross-family LLM extractor)之间的一致性,系统级 Spearman 相关系数为 1.0),并提出了一种无需参考的验证器引导生成方法,该方法能在无需参考的情况下提升精确率和召回率。我们发布了该基准、结构化标注、评估指标、基线模型以及交互式演示。
Abstract
Reference-free faithfulness metrics verify each atomic claim a model makes against ground truth, and are increasingly used to evaluate grounded generation. We show they share a blind spot: they measure only precision -- are the stated claims supported? -- and therefore reward abstention, since a model can score near-perfect faithfulness by saying almost nothing. We make this measurable using Formula 1 telemetry, a domain where strategic ground truth is derived deterministically and, crucially, completely: for each decision we know the full set of facts that mattered. This completeness -- absent in open-domain faithfulness benchmarks -- lets us measure recall (coverage of the relevant facts) exactly, alongside precision. On a multilingual (EN/ES/PT) benchmark of 7,253 decision instances spanning 150 races, the most precise frontier model covers under half of the relevant facts and ranks last by F1, so requiring coverage reorders the systems; the same effect reappears in a second complete-oracle domain (NOAA weather forecasts). A prompt ablation shows the low coverage is not an under-prompting artifact: explicitly asking models to be thorough does not close the gap. We pair faithfulness with coverage into a single score, validate the metric (controlled perturbation; agreement across a model-free regex extractor and a cross-family LLM extractor, system-level Spearman 1.0), and give a verifier-guided generation method that improves precision and recall without references. We release the benchmark, structured annotations, metric, baselines, and an interactive demo.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文聚焦于有根生成的评估指标(精确度与覆盖率),未涉及模型架构组件(Tokenizer、Visual Encoder)、统一策略(Unify Models)、世界模型(World Models)或强化学习(model-based RL)。虽涉及 LLM 及有根生成(关联 MLLM),但核心贡献在于度量方法而非模型架构,故相关性低。作者列表中不包含指定的专家名单。
关键词
Faithfulness Metrics, Grounded Generation, Precision, Coverage, Complete Oracle, Verifier-guided Generation, Formula 1, Multilingual
摘要翻译
大型语言模型(Large Language Models)在隐式多跳推理(implicit multi-hop reasoning)上存在缺陷:模型能正确回答"X 何时出生?"和"Y 的最亲密朋友是谁?",但在单次前向传播(single forward pass)中无法回答"Y 的最亲密朋友何时出生?",即使这两个事实已被完美记忆且可单独检索。我们在受控自然语言环境中研究了这一失败,该环境对预训练期间暴露于组合性上下文(compositional contexts)的个体与从未出现在此类上下文中的个体进行了严格区分。我们证实,即使在单跳准确率(1-hop accuracy)达到 97% 时,组合性失败(compositional failure)依然存在,这表明该差距源于预训练失败,而非知识缺失。我们提出并测试了九种以数据为中心的增强格式(data-centric augmentation formats),发现组合性预训练(compositional pretraining)对暴露于组合性上下文中的个体可迁移至未见问题,但对未参与组合性预训练的个体却无效,这暗示预训练期间暴露于组合性上下文是隐式多跳推理的必要条件。
Abstract
Large Language Models fail at implicit multi-hop reasoning: a model answers "When was $X$ born?" and "Who is $Y$'s closest friend?" correctly but fails on "When was $Y$'s closest friend born?" in a single forward pass, even when both facts are perfectly memorized and individually retrievable. We study this failure in a controlled natural language setting with a strict separation between individuals exposed to compositional contexts during pretraining and those that never appear in any such context. We confirm that compositional failure persists even at 97% 1-hop accuracy, establishing the gap as a pretraining failure rather than a knowledge absence. We propose and test nine data-centric augmentation formats and find that compositional pretraining transfers to unseen questions for exposed individuals, but never to individuals absent from compositional pretraining, suggesting that exposure to compositional contexts during pretraining is a necessary condition for implicit multi-hop reasoning.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 1.0/10 | 1.5 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文聚焦于大语言模型(LLM)中的隐式多跳推理及预训练暴露对组合能力的影响,属于纯文本 NLP 研究。提供的关键词主要涉及多模态、世界模型和强化学习(如 Visual Encoder, World Models, model-based RL),与论文内容高度不匹配。因此,除 LLM 相关关键词(MLLM, Tokenizer)有微弱关联外,其余关键词相关性极低,加权总分远低于动态及格分。
关键词
Multi-Hop Knowledge Composition, Pretraining Exposure, Implicit Multi-Hop Reasoning, Compositional Generalization, Large Language Models, Data-Centric Augmentation, Compositional Contexts
摘要翻译
我们介绍了我们在 SoccerNet 2026 以球员为中心的球动作定位挑战赛中的系统,该系统需要预测在广播足球比赛中,谁执行了何种动作以及何时执行,涵盖八个类别。基于三个 FOOTPASS 基线 [1](TAAD、TAAD+GNN 和 TAAD+DST),我们提出了四个扩展:(1) 梯度检查点技术,以便在单个 GPU 上实现全骨干微调;(2) 将 GNN 的 logits 融合到 DST 编码器中,结合基于图的战术上下文与每位球员的视觉特征;(3) 平方根频率类别加权,以解决训练数据中传球与铲球之间 213:1 的类别不平衡问题;(4) 一个后处理流程,包括每类 logit 门控、时序帧精炼、球衣重新分配以及双模型集成。我们的系统在测试集上取得了 0.548 的 Macro F1 分数,在挑战集(服务器评估)上取得了 0.446 的分数。
Abstract
We describe our system for the SoccerNet 2026 Player-Centric Ball-Action Spotting Challenge, which requires predicting who performs which action and when, across eight classes in broadcast soccer. Building on the three FOOTPASS baselines [1] (TAAD, TAAD+GNN, and TAAD+DST), we contribute four extensions: (1) gradient check pointing to enable full-backbone fine-tuning on a single GPU; (2) fusion of GNN logits into the DST encoder, combining graph-based tactical context with per-player visual features; (3) square-root frequency class weighting to address the 213:1 pass-to-tackle imbalance in the training data; and (4) a post processing pipeline comprising per-class logit gating, temporal frame refinement, jersey re-assignment, and a two-model ensemble. Our system achieves 0.548 Macro F1 on the test set and 0.446 on the challenge set (server evaluation).
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 2.0/10 | 3.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on computer vision and sports analytics (player-centric ball-action spotting in soccer videos), whereas the provided keywords pertain to generative AI, large language models, and reinforcement learning (e.g., Tokenizer, MLLM, World Models, model-based RL). There is a significant domain mismatch. 'Visual Encoder' has minimal relevance as the paper utilizes existing baselines rather than proposing encoder architectures. 'MultiModal' is weakly related due to video input and graph-based context but lacks explicit cross-modal fusion discussion. Consequently, most keywords receive a score of 0.
关键词
Player-Centric Ball-Action Spotting, FOOTPASS Baselines, Gradient Checkpointing, GNN Logits Fusion, Post-Processing Pipeline, Class Weighting, Macro F1 Score
摘要翻译
随着视觉感知系统的不断进步,计算机视觉在自动驾驶和机器人导航中正发挥着日益重要的作用。多相机系统中的相对位姿估计对于精确的车辆定位和环境感知至关重要,要求具备高实时性和鲁棒性。然而,现有方法通常计算成本高昂,且严重依赖大量的特征匹配,限制了其在时间敏感驾驶场景中的适用性。为了解决这些局限性,本文提出了一种基于新颖平移参数化和一阶旋转近似的高效相对位姿估计统一框架。在该框架内,我们提出了三个专为自动驾驶车辆设计的高效最小求解器。首个求解器融合了来自惯性测量单元(IMU)的垂直方向先验,第二个利用转向操作期间的旋转轴方向先验,第三个则针对平面运动设计——这是地面车辆在结构化道路上行驶时的一个合理假设。通过降低点对应关系的最小数量和代数复杂度,我们的方法能够在基于 RANSAC(随机抽样一致性)的流程中实现更快的假设生成,从而提高对实时系统的适用性。在合成数据集和 KITTI 自动驾驶基准上的广泛实验表明,与现有最先进算法相比,所提出的求解器在速度与精度之间取得了良好的平衡。
Abstract
With the advancement of visual sensing systems, computer vision is playing an increasingly important role in autonomous driving and robot navigation. Relative pose estimation in multi-camera systems is essential for accurate vehicle localization and environment perception, demanding high real-time performance and robustness. Existing methods, however, often involve high computational costs and rely heavily on abundant feature matches, limiting their applicability in time-sensitive driving scenarios. To address these limitations, this paper introduces a unified framework for efficient relative pose estimation, built upon a novel translation parameterization and first-order rotation approximation. Within this framework, we propose three efficient minimal solvers specifically designed for autonomous vehicles. The first solver integrates the vertical direction prior from Inertial Measurement Units (IMUs), the second utilizes the rotation axis direction prior during steering maneuvers, and the third is designed for planar motion - a realistic assumption for ground vehicles operating on structured roads. By reducing both the minimal number of point correspondences and the algebraic complexity, our methods enable faster hypothesis generation within RANSAC-based pipelines, improving suitability for real-time systems. Extensive experiments on synthetic datasets and the KITTI autonomous driving benchmark demonstrate that the proposed solvers achieve a favorable balance between speed and accuracy compared to existing state-of-the-art algorithms.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文主题为计算机视觉中的相对位姿估计与自动驾驶,主要涉及几何求解器与 RANSAC 流程,与关键词集(多模态大模型、世界模型、强化学习)领域严重不相关。'Unify Models' 仅因文中提到'unified framework'给予低分(2.0),'MultiModal' 因多摄像头系统给予极低分(1.0),其余 5 项(Tokenizer, Visual Encoder, World Models, MLLM, model-based RL)完全无关给予 0.0 分。加权总分 = (2.0 + 1.0) * 1.5 = 4.5,远低于动态及格分 27.8。作者列表中未包含指定的 Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang 专家,无额外加分。
关键词
Relative Pose Estimation, Autonomous Driving, Efficient Minimal Solvers, Multi-camera Systems, RANSAC-based Pipelines, Translation Parameterization, First-order Rotation Approximation
摘要翻译
传统的 one-hot 编码往往导致模型校准不良,在对抗攻击下过度自信,并使基于熵的检测算法失效。先前的图像分类研究表明,采用 Hadamard 编码的输出表示可以提升对抗鲁棒性。然而,将 Hadamard 编码集成到语义分割中的尝试,在平均交并比(mean intersection-over-union, mIoU)性能上远远落后于最先进模型。至于目标检测,此类输出编码尚未得到任何研究。此外,现有技术既未解决内在码字不一致性问题,也未实际利用内在码字冗余。因此,我们首先推导出一种针对 Hadamard 码字的新解码过程,旨在获得最优的类别概率,并通过将潜在优化问题投影到概率单纯形来解决。其次,我们的优化方法提供了一种预测不一致性的度量。第三,我们首次展示了如何利用这些不一致性来进行对抗攻击和扰动检测。第四,我们引入了 HadamardNet,这是一个采用 Hadamard 编码作为输出表示的框架,适用于语义分割和目标检测模型及任务。我们在扰动和对抗攻击方面进行了全面评估,仅在单次检测中即实现了两项任务的最先进扰动检测性能,同时在干净数据上提供了与参考模型相当或接近的性能。
Abstract
Conventional one-hot encodings often yield poorly calibrated models, being overconfident under attack, and letting entropy-based detection algorithms fail. Previous image classification works have demonstrated that Hadamard-coded output representations can improve adversarial robustness. However, attempts to integrate Hadamard codes into semantic segmentation fall far behind state-of-the-art models in mean intersection-over-union performance. Regarding object detection, such output encodings have not yet been investigated at all. Further, no prior art addressed intrinsic codeword inconsistencies or actually exploited intrinsic codeword redundancy. Accordingly, we first derive a novel decoding procedure for Hadamard codewords towards optimal class-wise probabilities, solving the underlying optimization problem by using the projection onto the probability simplex. Second, our optimization delivers a measure of prediction inconsistency. Third, we are the first to show how to exploit these inconsistencies for adversarial attack and disturbance detection. Fourth, we introduce HadamardNet, a framework employing Hadamard codes as output representations for semantic segmentation and object detection models and tasks. We conduct a comprehensive evaluation both on disturbances and adversarial attacks, achieving state-of-the-art perturbation detection performance for both tasks in only a single detection pass, while delivering equivalent or close-by reference performance on clean data.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 1.0/10 | 1.5 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 该论文专注于计算机视觉中的目标检测与语义分割,利用 Hadamard 编码输出表示提升对抗鲁棒性。与给定关键词高度不相关:未涉及大语言模型(MLLM)、世界模型、强化学习或分词器;虽涉及视觉任务(Visual Encoder),但核心在于输出编码而非编码器架构;虽统一了检测与分割任务(Unify Models),但非背景所述模型统一范式;视觉任务通常视为单模态(MultiModal)。因此多数关键词得分为 0 或极低。作者列表中未包含指定的专家,故无额外加分。
关键词
Adversarial Attack, Disturbance Detection, Hadamard-Coded Output Representations, Object Detection, Semantic Segmentation, HadamardNet, Prediction Inconsistency
摘要翻译
可靠的运动分类对自动驾驶至关重要,因为对静态对象的虚假动态预测会导致规划器进行不必要的干预。不稳定的边界框预测可能导致跟踪中的虚假速度估计以及错误的轨迹预测。我们提出了一种易于部署的缓解策略,该策略通过 aleatoric uncertainty estimates 增强 3D object detector,并在短观测窗口上应用 two-sample z-test,以区分真实运动与抖动。该方法只需少量更改即可集成到 Autoware 中,并复用现有的数据关联,从而带来极小的计算开销。实验结果表明,在 nuScenes 数据集上该方法与 velocity thresholding 表现相当,但在真实世界测试驾驶中显著减少了虚假动态预测和不必要的停车,这归因于记录数据中存在一个仅基于速度的规则会误分类的中间抖动带。这表明,uncertainty-aware detection 和 lightweight statistical testing 可以在噪声更大的真实世界场景中为自动驾驶带来实际的性能提升。
Abstract
Reliable motion classification is critical for autonomous driving, as false dynamic predictions of static objects can cascade into unnecessary planner interventions. Unstable bounding box predictions can lead to spurious velocity estimates in tracking and falsely predicted trajectories. We present a deployment-friendly mitigation strategy that augments a 3D object detector with aleatoric uncertainty estimates and applies a two-sample z-test over short observation windows to separate true motion from jitter. Integrated into Autoware with minimal changes, the approach reuses existing data association for minimal compute overhead. Empirical results show parity with velocity thresholding on nuScenes, but substantially fewer false dynamic predictions and unnecessary stops in real-world test drives, explained by the presence of an intermediate jitter band in the recorded data that speed-only rules misclassify. This demonstrates that uncertainty-aware detection and lightweight statistical testing can deliver practical performance gains for autonomous driving in noisier real-world settings.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 1.0/10 | 1.5 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 该论文属于自动驾驶感知领域,核心贡献在于 LiDAR 检测的不确定性建模与统计假设检验。给定的关键词集(Unify Models, Tokenizer, MLLM, World Models, model-based RL)主要聚焦于多模态大模型与强化学习框架,与本文的传统深度学习感知方法范式不符。因此,除 'Visual Encoder' 和 'MultiModal' 因涉及基础模态处理给予微弱关联分外,其余关键词相关性均为 0。
关键词
LiDAR Object Detection, Uncertainty-Aware, Motion Classification, Autonomous Driving, Perception Jitter, Aleatoric Uncertainty, Two-sample z-test
摘要翻译
足球事件数据构成了团队运动中球员动作定量分析的丰富时空来源。这些数据集包含异质特征,结合了连续位置坐标与分类变量,例如动作类型、动作结果和身体部位。此类数据已应用于体育分析领域,用于比赛结果预测、球员评估和战术模式识别。然而,现有方法主要使用 one-hot 或序数嵌入表示来编码分类特征,忽略了动作描述符的内在语义。Transformer 是一种基于 self-attention(自注意力)的深度神经网络架构,能够捕捉任意位置输入特征之间的依赖关系。我们提出并实现了一个基于 Transformer 的模型,用于学习分类事件特征之间的潜在依赖关系,并生成足球事件的稠密表示。通过将分类特征编码为学习到的嵌入向量,在预训练过程中捕获了特定运动的动作语义,从而使这些表示能够支持下游任务,如动作价值估计和比赛风格识别。实证评估表明,在下游预测任务上,嵌入表示相较于任务特定基线产生了更优越的概率校准,该结果通过 Brier score(布里尔分数)进行衡量。
Abstract
Football event data constitute a rich spatiotemporal source for quantitative analysis of player actions in team sports. These datasets contain heterogeneous features, combining continuous location coordinates with categorical variables such as action type, action outcome, and body part. Such data have been applied in sports analytics for match outcome forecasting, player evaluation, and tactical pattern recognition. However, existing approaches predominantly encode categorical features using one-hot or ordinal embedding representations, overlooking the intrinsic semantics of action descriptors. The Transformer is a deep neural network architecture based on self-attention that captures dependencies between input features at arbitrary positions. We propose and implement a Transformer-based model to learn latent dependencies among categorical event features and produce dense representations of football events. By encoding categorical features as learned embedding vectors, sport-specific action semantics are captured during pretraining, enabling the representations to support downstream tasks such as action value estimation and play style recognition. Empirical evaluation shows that the embedding representations yield superior probability calibration over task-specific baselines on the downstream prediction tasks, as measured by Brier score.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 该论文专注于足球事件数据的表格表示学习,使用 TabTransformer 架构处理混合特征。虽然 TabTransformer 涉及类别特征分词(Tokenizer)且处理多类型特征(MultiModal),但与多模态大模型(MLLM)、世界模型(World Models)、视觉编码器(Visual Encoder)及强化学习(model-based RL)无直接关联,统一模型(Unify Models)亦未涉及,整体相关性极低。
关键词
Football Event Representation, TabTransformer, Dense Representations, Categorical Features, Action Semantics, Sports Analytics, Downstream Tasks
摘要翻译
随着深度学习模型规模的扩大,管理、检查和修改大型检查点(checkpoints)正变得日益具有挑战性。研究人员常需修改模型权重以进行层重构(layer restructuring)、精度转换(precision casting)、低秩分解(low-rank factorization)及架构调试(architectural debugging),然而这些工作流程往往依赖于脆弱的临时 Python 脚本(ad-hoc Python scripts)。在此,我们推出 BrainSurgery,一种用于神经网络检查点(checkpoints)上稳健且可复现的“张量手术”(tensor surgery)的工具,并提供系统演示,涵盖从模型再利用(model upcycling)到 LoRA 提取的四个示例和三个案例研究。通过抽象存储格式与内存管理,BrainSurgery 通过声明式 YAML 计划(declarative YAML plans)执行复杂的变换。它借助表达性正则表达式(regex)和结构定位支持结构修改、数学变换及张量重塑(tensor reshaping),同时内置断言(built-in assertions)验证张量形状、数据类型及数值,以防止静默错误(silent errors)。我们展望 BrainSurgery 将通过其可复现且经过验证的操作,为未来研究奠定坚实基础。
Abstract
As deep learning models scale, managing, inspecting, and modifying large checkpoints has become increasingly challenging. Researchers often need to alter model weights for layer restructuring, precision casting, low-rank factorization, and architectural debugging, yet these workflows often rely on fragile ad-hoc Python scripts. Here, we introduce BrainSurgery, a tool for robust and reproducible "tensor surgery" on neural network checkpoints, and provide a system demonstration covering four examples and three case studies from model upcycling to LoRA extraction. By abstracting storage formats and memory management, BrainSurgery executes complex transformations through declarative YAML plans. It supports structural modifications, mathematical transformations, and tensor reshaping through expressive regex and structural targeting, while built-in assertions validate tensor shapes, data types, and values to prevent silent errors. We envision that BrainSurgery will provide a strong foundation for future research through its reproducible and validated operations.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文主要介绍 BrainSurgery 工具,用于神经网络检查点的可复现权重操作和模型升级。虽然涉及模型修改,但未涉及多模态架构(如 Tokenizer、Visual Encoder、MultiModal、MLLM)、世界模型或强化学习(World Models、model-based RL)。'Unify Models' 仅在广义的工作流统一层面略有关联,因此大部分关键词相关性极低。
关键词
BrainSurgery, Model Editing, Weight Manipulation, Reproducible, Declarative, Neural Network Checkpoints, Model Upcycling
摘要翻译
分子系统涉及跨越多个空间尺度的相互作用,从局部配位和短程扰动到长程静电及溶剂介导效应。然而,大多数分子表征学习方法依赖于手动预设的尺度,而任务最优的建模尺度可能与这些固定水平并不一致。本研究提出了一种基于损失引导的自适应尺度优化框架,用于分子力预测,将预设尺度视为初始锚点,并通过插值、路由、可微尺度更新及尺度池优化来发现任务有效的分辨率。以 NaCl 水离子系统作为最小测试平台,本研究构建了短尺度和长程力预测分支,并分析了它们的互补性。Oracle 硬路由将整体力平均绝对误差(MAE)从 399.65 降低至 382.67,而连续 Oracle 插值进一步将其降低至 380.96。在最近离子距离低于 0.6 nm 的近距离接触区域中,近距离接触 MAE 从 327.22 降至 260.51。最小尺度池更新实验表明,从端点锚点 {0,1} 开始,损失引导的更新自动生成了中间尺度,并恢复了大部分连续 Oracle 性能。最终更新的尺度池 {0,0.125,0.25,0.375,0.5,0.75,1} 实现了整体 MAE 为 381.23。这些结果表明,自适应尺度优化是分子表征学习的一个有前景的方向,尤其是在固定尺度建模不足以胜任的情况下。
Abstract
Molecular systems involve interactions across multiple spatial scales, from local coordination and short-range perturbations to long-range electrostatic and solvent-mediated effects. However, most molecular representation learning methods rely on manually predefined scales, and the task-optimal modeling scale may not coincide with these fixed levels. This study introduces a loss-guided adaptive scale refinement framework for molecular force prediction, treating predefined scales as initial anchors and discovering task-effective resolutions through interpolation, routing, differentiable scale updates, and scale pool refinement. Using a NaCl aqueous ionic system as a minimal testbed, this study constructs short-scale and long-range force prediction branches and analyzes their complementarity. Oracle hard routing reduces the overall force MAE from 399.65 to 382.67, while continuous oracle interpolation further reduces it to 380.96. In close-contact regimes with nearest-ion distance below 0.6 nm, the close-contact MAE decreases from 327.22 to 260.51. A minimal scale pool update experiment shows that starting from endpoint anchors {0,1}, loss-guided updates automatically generate intermediate scales and recover most of the continuous oracle performance. The final updated scale pool {0,0.125,0.25,0.375,0.5,0.75,1} achieves an overall MAE of 381.23. These results support adaptive scale refinement as a promising direction for molecular representation learning, especially when fixed-scale modeling is insufficient.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文主题为分子力预测与自适应尺度细化,属于计算化学领域。提供的关键词主要涉及多模态大模型、世界模型及强化学习,与论文内容领域严重不符。论文未涉及 Tokenizer、视觉编码器、MLLM、世界模型或强化学习。仅'Unify Models'在广义上涉及尺度策略的统一。作者列表中不包含指定专家,故无额外加分。
关键词
Molecular Force Prediction, Adaptive Scale Refinement, Loss-Guided, Representation Learning, Spatial Scales, Differentiable Scale Updates, Interpolation, NaCl Aqueous System
摘要翻译
对齐训练的愿景是让大型语言模型既安全又有用。主要机制——基于人类反馈的强化学习(RLHF)——通过使模型与“人类价值观”对齐,来塑造部署语言模型的行为。然而,这一过程是不透明的。正在编码哪些价值观?这些价值观属于谁?以及 RLHF 是如何编码它们的?越来越多的证据表明,RLHF 仅产生功能性合规,而非深层对齐。本文针对党派政治倾向这一现象,提供了一个机制性案例研究,通过比较 Llama 3.1 8B 在 RLHF 前后的内部表征。我们发现,RLHF 并未移除基础模型中结构化的党派倾向。相反,它压缩了党派信号的方差,从而生成始终平衡且无党派倾向的输出。稀疏自编码器分解揭示,策略编码特征在基础模型中零星激活,但在指令模型(Instruct model)中完全失活。特征级引导实验证实了这种因果断开。因此,RLHF 编码了一种政治中立规范,并非通过抹除模型对党派倾向的认知,而是切断了从党派几何结构到输出生成的因果路径。重要的是,这种中立是功能性的,而非结构性的,使得支持党派引导的基础几何结构保持完整。绕过 RLHF 护栏的机制(例如推断并放大用户的党派身份),会重新激活党派倾向的生成。如果 RLHF 是通过断开而非移除价值负载结构来运作的,那么同样的模式可能也适用于其他价值领域,且对齐模型的行为可能比其输出所暗示的更为脆弱。
Abstract
The ambition behind alignment training is to make large language models safe and useful. The primary mechanism, reinforcement learning from human feedback (RLHF), shapes the behavior of deployed language models by aligning them with ``human values.'' Yet the process is opaque. What values are being encoded; whose values are they; and how does RLHF encode them? A growing body of evidence suggests that RLHF produces only functional compliance rather than deep alignment. We offer a mechanistic case study of this phenomenon for partisan political orientation with a comparison of the internal representations of Llama 3.1 8B before and after RLHF. We show that RLHF does not remove the structured partisan direction in the base model. Instead, it compresses the variance of the partisan signal to generate consistently balanced and non-partisan output. Sparse autoencoder decomposition reveals that policy-encoding features, which activate sporadically in the base model, are completely inactive in the Instruct model. Feature-level steering experiments confirm the causal disconnect. RLHF thus encodes a norm of political neutrality, not by erasing the model's knowledge of partisanship, but by severing the causal pathway from partisan geometry to output generation. Importantly, this neutrality is functional, not structural so that the underlying geometry that enables partisan steering remains intact. The mechanisms that bypass RLHF's guardrails, such as inferring and amplifying a user's partisan identity, reactivate partisan generation. If RLHF operates by disconnecting rather than removing value-laden structure, then the same pattern may hold for other value domains, and the aligned model's behavior may be more fragile than its outputs suggest.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 1.0/10 | 1.5 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: 该论文聚焦于 LLM 中的 RLHF 对齐机制与党派结构表征,属于文本语言模型的可解释性研究。提供的关键词主要涵盖多模态、世界模型及模型基强化学习,与本文主题高度不匹配。文中未涉及视觉编码器、多模态融合、世界模型构建或模型基强化学习算法,仅涉及 RLHF 而非 model-based RL,属于 LLM 而非 MLLM,未体现模型统一。作者列表中不包含指定的专家。
关键词
RLHF, LLM Alignment, Partisan Structure, Internal Representations, Sparse Autoencoder, Functional Neutrality, Mechanistic Interpretability
摘要翻译
创造力是一种复杂的认知能力,依赖于知识组织和从语义记忆 (semantic memory) 中的提取。然而,大多数研究仅使用单一任务来测量它,仅捕捉了这种复杂性的很小一部分。本研究探讨多重网络 (multiplex networks)——从六个认知任务中获得的分层语义网络——作为一种更全面的方法来建模创造力背后的关联知识。我们收集了来自四个国家(奥地利、美国、新加坡、意大利)的 N=518 个个体的数据。基于他们对言语流畅性、句子链、自由联想和叙事写作任务的响应,我们构建了语义网络并将它们组装成多重结构。基于 AI 人格的响应 (AI persona-based responses) 提供了比较基线。结构可约性分析 (Structural reducibility analyses) 表明,不同的任务层捕捉了关于语义组织的不同、非冗余信息,支持使用多个任务而非单一任务。高创造力组和低创造力组的网络在结构上保持显著差异,而 AI 生成的网络无论创造力组别如何都显示出几乎相同的结构。最后,我们使用岭回归 (ridge regression) 机器学习模型,利用 12 个特征(网络度量、情感评分和扩散激活模拟 (spreading activation simulations))来预测个体创造力分数。先前阶段识别出的结构相似层的组合,将概念验证预测准确率提高了 50%。结构度量显示了最高的特征重要性,扩散激活动力学提供了额外的预测能力。总之,这些发现表明,多重语义网络捕捉了创造力背后关联知识的更丰富、跨文化的图景。我们还发布了我们的多样化数据集和代码,以促进创造力社区内的多样化计算方法。
Abstract
Creativity is a complex cognitive ability that relies on knowledge organisation and retrieval from semantic memory. Yet most research uses a single task to measure it, capturing only a fraction of this complexity. This study investigates multiplex networks - layered semantic networks obtained from six cognitive tasks - as a more comprehensive approach to modelling the associative knowledge underlying creativity. We collected data from N=518 individuals from four countries (Austria, USA, Singapore, Italy). From their responses to verbal fluency, sentence-chain, free association, and narrative writing tasks, we constructed semantic networks and assembled them in a multiplex structure. AI persona-based responses provided a comparison baseline. Structural reducibility analyses showed that different task layers captured distinct, non-redundant information about semantic organisation, supporting the use of multiple tasks over any single one. The networks from high- and low-creative groups remained structurally distinct, while AI-generated networks showed near-identical structures regardless of creativity group. Finally, we used 12 features (network measures, emotional scores, and spreading activation simulations) in a machine learning model using ridge regression to predict individual creativity scores. The combination of structurally similar layers, as identified in the previous stage, improved a proof-of-concept prediction accuracy by 50%. Structural measures showed the highest feature importance, with spreading activation dynamics providing additional predictive power. Together, these findings indicate that multiplex semantic networks capture a richer, cross-cultural picture of associative knowledge underlying creativity. We also release our diverse dataset and code to foster diverse computational approaches within the creativity community.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on cognitive science and semantic network analysis for creativity prediction, involving ridge regression and multiplex networks. It lacks technical content related to deep learning architectures (Tokenizer, Visual Encoder, MLLM), reinforcement learning (World Models, model-based RL), or unified model frameworks. Minor relevance exists for 'Unify Models' (task unification) and 'MultiModal' (multilingual data), but not in the AI context implied by the keywords.
关键词
Multiplex semantic networks, Creative associative knowledge, Cognitive tasks, Ridge regression, Cross-cultural, Structural reducibility, Spreading activation, Machine learning prediction
摘要翻译
我们提出了 TruthSplit,这是一个用于多视角论证分析的交互式系统。现有的论证工具通常分析论证本身的属性,例如结构、质量、立场或说服力,而未显式处理视角特定的背景知识。TruthSplit 通过支持探索性分析来填补这一空白,即分析相同的声明如何通过世界观特定的价值观、假设和概念解释而导致不同的结论。我们将这种依赖视角的分析称为条件有效性(conditional validity)。针对输入的论证文本,TruthSplit 提取主张与前提,采用三层自然语言推理(NLI)方法评估逻辑一致性以及世界观特定的规范性一致性,并基于编码核心价值和决策原则的结构化配置文件对大语言模型(LLM)的推理进行约束。随后,该系统生成视角特定的解释,识别价值冲突与假设缺口,并通过交互式分析界面可视化分歧。
Abstract
We present TruthSplit, an interactive system for multi-perspective argument analysis. Existing argumentation tools typically analyze properties of the argument itself, such as structure, quality, stance, or persuasiveness, while leaving perspective-specific background knowledge implicit. TruthSplit addresses this gap by supporting an exploratory analysis of how the same claim can lead to different conclusions when interpreted through worldview-specific values, assumptions, and conceptual definitions. We refer to this perspective-dependent analysis as conditional validity. Given an input argumentative text, TruthSplit extracts claims and premises, applies a three-layer natural language inference (NLI) approach to assess both logical and worldview-specific normative consistency, and conditions large language model (LLM) reasoning on structured worldview profiles that encode core values and decision principles. The system then generates perspective-specific interpretations, identifies value conflicts and assumption gaps, and visualizes divergence through interactive analytical interfaces.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 1.0/10 | 1.5 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on argumentation analysis and worldview conditioning using LLMs, which has low relevance to multimodal representation learning, visual encoders, tokenizers, or model-based reinforcement learning. Only superficial lexical overlap exists with 'World Models' (worldview vs. world model) and 'MLLM' (LLM usage), but no technical alignment with the specified keywords.
关键词
Multi-Perspective Reasoning, Conditional Validity, Argument Analysis, Worldview Profiles, Natural Language Inference, LLM Conditioning, Value Conflicts
摘要翻译
人体常规动力学分析往往受限于对接触式力和力矩传感器以及受控实验室环境的需求。为解决这一问题,本研究提出了一种面向多体系统 (multibody systems) 的光力学 (optical-mechanics) 运动学 - 动力学联合估计框架。具体而言,建立了约束多体模型来描述系统动力学,而图像测量的运动学量 (image-measured kinematic quantities) 则被用作动力学估计的非接触输入。随后,通过基于遗传算法 (Genetic Algorithm) 的优化,最小化模型预测与图像测量运动学量之间的差异,从而识别出未知关节力矩。在气浮平台 (air-bearing platform) 上的实验验证表明,基于图像数据估计的腕关节力矩与传感器测量值相比,平均绝对误差为 0.46 Nm。在前向预测测试中,模型预测的角速度相对于图像测量结果的平均绝对误差为 0.006 rad/s。本研究展示了在难以直接测量力和力矩的场景中,结合图像测量与机械建模进行非接触动力学估计 (non-contact dynamic estimation) 的潜力。
Abstract
Conventional dynamics analysis of the human body is often constrained by the need for contact force and torque sensors and controlled laboratory environments. To address this issue, this study proposes an opticalmechanics kinematic-dynamic integrated estimation framework for multibody systems. Specifically, a constrained multibody model is established to describe the system dynamics, while image-measured kinematic quantities are used as non contact inputs for dynamic estimation. The unknown joint torque is then identified through a genetic-algorithm based optimization by minimizing the discrepancy between model-predicted and image-measured kinematic quan tities. Experimental validation on an air-bearing platform showed that the wrist joint torque estimated from image data achieved a mean absolute error of 0.46 Nm compared with sensor measurements. In the forward prediction test, the model-predicted angular velocity achieved a mean absolute error of 0.006 rad/s relative to the image-measured results. This study demonstrates the potential of combining image measurement and mechanical modeling for non-contact dynamic estimation in scenarios where direct force and torque measurement is difficult.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on robotics and mechanics (optical-mechanics framework for dynamic estimation), while the provided keywords pertain to Artificial Intelligence (LLMs, World Models, RL). There is no overlap regarding Tokenizers, MLLMs, or AI-specific Visual Encoders. 'MultiModal' and 'Unify Models' receive minimal credit for combining vision and mechanics, but not in the AI context. No expert authors from the specified list were found in the author list. The weighted total score is 3.0, which is below the dynamic passing score of 27.8.
关键词
Opticalmechanics Framework, Dynamic Estimation, Multibody Systems, Non-contact, Image-measured Kinematics, Genetic Algorithm, Joint Torque, Mechanical Modeling
摘要翻译
AI 评估结果虽大规模生成,但在排行榜(leaderboards)、模型卡(model cards)、基准论文及公司博客中的报告却不一致。这种不一致的代价在于可解释性不足:读者无法可靠地跨源比较结果,无法识别报告遗漏的内容,也无法将汇总主张追溯至其底层证据。近期工作虽解决了孤立组件,但仍存在三个缺口:它们仅覆盖评估生命周期的狭窄片段,无法整合为单一的可解释记录;它们指定静态表示,无法区分不同利益相关者(stakeholders)对同一证据提出的不同问题;且它们仍停留在纸面提案阶段,缺乏大规模采纳所需的基础设施。本文提出 \EvalCards{},这是一个操作报告层,它将基准元数据、评估运行数据及模型元数据整合为统一记录。具体而言,(1) 我们通过对 52 篇论文和 10 次利益相关者访谈的结构化审查,推导出报告模式;(2) 我们实现了四个解释性信号(可复现性、文档完整性、溯源与风险、分数可比性),并通过针对研究及非研究受众校准的读者模式进行呈现;(3) 我们部署了一个监控工具,在 5,816 个模型、635 个基准及 101,843 个结果上应用 \EvalCards{},从而揭示了当前报告实践中的系统性差距。
Abstract
AI evaluation results are produced at scale but reported inconsistently across leaderboards, model cards, benchmark papers, and company blogs. The cost is interpretive: readers cannot reliably compare results across sources, identify what a report omits, or trace an aggregate claim to its underlying evidence. Recent efforts address isolated components but leave three gaps: they cover only narrow slices of the evaluation lifecycle and do not compose into a single interpretable record; they specify static representations that do not differentiate the questions different stakeholders bring to the same evidence; and they remain proposals on paper, lacking the extraction infrastructure required for adoption at scale. We present \EvalCards{}, an operational reporting layer that composes benchmark metadata, evaluation run data, and model metadata into a unified record. We (1) derive a reporting schema from a structured review of 52 papers and 10 stakeholder interviews, (2) implement four interpretive signals (reproducibility, documentation completeness, provenance and risk, and score comparability), rendered through reader modes calibrated to research and non-research audiences, and (3) deploy a monitoring tool that applies \EvalCards{} across 5,816 models, 635 benchmarks, and 101,843 results, surfacing systematic gaps in current reporting practice.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文核心内容是关于 AI 评估报告的基础设施(EvalCards),旨在解决评估结果报告不一致和缺乏可追溯性的问题。提供的关键词主要涉及模型内部架构(Tokenizer, Visual Encoder)、特定模型范式(MLLM, World Models)及强化学习(model-based RL),与本文的评估元数据与报告标准领域高度不匹配。仅'Unify Models'在字面意义上有'统一记录'的微弱关联,但并非模型架构层面的统一,因此相关性评分极低。
关键词
Evaluation Cards, AI Evaluation, Reporting Schema, Benchmark Metadata, Model Metadata, Interpretive Signals, Reproducibility
摘要翻译
形式化神经网络验证——证明网络在指定域内的所有输入上均满足安全性属性——在实践中受限于 GPU 内存:界限传播算法(IBP、CROWN、α-CROWN)的标准实现要求权重矩阵和松弛系数矩阵完全驻留在单个加速器上。我们将原本为大规模模型训练开发的两种并行技术适配至 auto_LiRPA / α,β-CROWN 验证框架。张量并行 (TP) 将权重矩阵和 A 矩阵跨 GPU 分片,在 P=2 时实现了约 2 倍的峰值内存减少;在 VNN-COMP 2022 MNIST-FC 基准测试上确认了正确性,但由于分片区域内部强制使用 IBP 替代中间界限,界限紧致性随分片区域数量的增加而下降。全分片数据并行 (FSDP) 仅对权重矩阵进行分片,并结合每层的 AllGather 操作,产生的界限与单 GPU 基线位级相同:在宽多层感知机上,基线内存降低 80%–90%,峰值内存降低 34%–39%。FSDP 与完整验证(β-CROWN + 分支定界)及卷积层(BoundConv)无缝集成;在 FSDP 下,针对 CIFAR-100 ResNet-large(VNN-COMP 2024)获得了完整的不可满足结果。在所有实验中,α-CROWN+BaB 模式下的内存瓶颈被证明是每神经元 α 张量,而非权重矩阵,这指明了未来工作的关键方向。
Abstract
Formal neural network verification -- proving that a network satisfies safety properties for \emph{all} inputs in a specified domain -- is bounded in practice by GPU memory: standard implementations of bound-propagation algorithms (IBP, CROWN, $α$-CROWN) require weight and relaxation-coefficient matrices to reside entirely on one accelerator. We adapt two parallelism techniques originally developed for large-scale model training to the \texttt{auto\_LiRPA}\,/\,$α,β$-CROWN verification framework. \textbf{Tensor Parallelism (TP)} shards both weight and $A$-matrices across GPUs, achieving ${\approx}2\times$ peak-memory reduction at $P{=}2$; soundness is confirmed on VNN-COMP 2022 MNIST-FC benchmarks, though bound tightness degrades with the number of sharded zones due to forced IBP substitution for intermediate bounds inside sharded zones. \textbf{Fully Sharded Data Parallelism (FSDP)} shards only weight matrices with a per-layer \texttt{AllGather}, producing bounds that are \emph{bitwise identical} to the single-GPU baseline: baseline memory drops by 80--90\%, peak memory by 34--39\% on wide MLPs. FSDP integrates cleanly with complete verification ($β$-CROWN + Branch-and-Bound) and with convolutional layers (\texttt{BoundConv}); a complete \emph{unsat} result is obtained for CIFAR-100 ResNet-large (VNN-COMP 2024) under FSDP. Across all experiments the memory bottleneck in $α$-CROWN+BaB mode proves to be per-neuron alpha tensors, not weight matrices, pointing to the key direction for future work.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on formal neural network verification and parallelism techniques (TP, FSDP) to address GPU memory bottlenecks. It has negligible overlap with the provided keywords concerning multimodal large models, tokenizers, visual encoders, world models, and reinforcement learning. Only 'Unify Models' receives a minimal score for conceptually unifying verification with parallelism, while others are completely unrelated.
关键词
Neural Network Verification, Tensor Parallelism, Fully Sharded Data Parallelism, Memory Bottleneck, Bound Propagation, Formal Verification, GPU Memory, auto_LiRPA
摘要翻译
大型语言模型(LLM)的多语言安全评估主要依赖于将英语基准直接翻译(DT)为目标语言,该方法虽转换了表层语言形式,却未能反映蕴含于威胁场景、社会规范及法律框架中的文化背景。我们通过 1:1 种子匹配构建了四种语言(韩语(KO)、日语(JA)、泰语(TH)和高棉语(KM))的配对直接翻译(DT)与文化适应(CA)数据集,并在四种开源大型语言模型(LLM)上比较了攻击成功率(ASR)和文化真实性分数。文化适应(CA)提示在所有 16 种语言与模型组合中均产生 Delta-ASR > 0(平均 +9.3 个百分点),而基于直接翻译(DT)的评估在 48 种类别与语言组合中的 44 种中低估了风险。语言级别的分析显示,威胁形式的分布在不同语言间存在异质性。文化真实性分析进一步显示,直接翻译(DT)的文化深度(C3)分数在所有四种语言中始终低于 3.0 分制下的 1.0 分(平均 0.17),而文化适应(CA)分数高达 2.51 分,这表明直接翻译产生的输入与现实世界多元文化环境中遇到的输入存在系统性偏差。这些发现表明,为使多语言大型语言模型(LLM)的安全评估有效,必须将基准适配至语言特定的文化语境,而非仅依赖语言翻译本身。
Abstract
Multilingual safety evaluation of large language models (LLMs) has predominantly relied on direct translation (DT) of English benchmarks into target languages - an approach that converts surface-level linguistic form while failing to reflect the cultural context embedded in threat scenarios, social norms, and legal frameworks. We construct paired DT and culturally-adapted (CA) datasets via 1:1 seed matching for four languages - Korean (KO), Japanese (JA), Thai (TH), and Khmer (KM) - and compare Attack Success Rate (ASR) and Cultural Realism scores across four open-source LLM. CA prompts yield Delta-ASR > 0 across all 16 language x model combinations (mean +9.3 pp), and DT-based evaluation underestimates risk in 44 of 48 category x language combinations. Language-level analysis reveals that the distribution of threat forms is heterogeneous across languages. Cultural Realism analysis further shows that DT Cultural Depth (C3) scores remain consistently below 1.0 out of 3.0 across all four languages (mean 0.17), whereas CA scores reach up to 2.51, indicating that direct translation produces inputs systematically divergent from those encountered in real-world multicultural settings. These findings demonstrate that adapting benchmarks to language-specific cultural contexts - rather than relying on linguistic translation alone - is necessary for valid multilingual LLM safety evaluation.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 1.0/10 | 1.5 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文主要关注多语言大语言模型(LLM)的安全评估与文化适配,涉及红队测试、攻击成功率及文化真实性分析。提供的关键词主要集中在多模态架构(Visual Encoder, MultiModal, MLLM)、模型统一(Unify Models)、分词器(Tokenizer)、世界模型(World Models)及强化学习(model-based RL)领域。论文内容与这些技术核心几乎无交集,仅'MLLM'因涉及 LLM 有微弱关联,其余关键词完全无关。加权总分远低于动态及格分 27.8。作者列表中不包含指定的专家。
关键词
Multilingual safety evaluation, Large language models, Red-teaming, Cultural adaptation, Direct translation, Attack success rate, Cultural realism, Benchmarking
摘要翻译
尽管伊辛机(Ising machines)作为伊辛模型(Ising model)的先进物理求解器,能够应用于组合优化和神经网络训练,但其在大规模神经网络上的可扩展性仍受限于硬件连接性限制及次优的训练方法。本研究借助相干伊辛机(CIM),采用平衡传播(Equilibrium Propagation)算法训练基于能量的神经网络,取得了与现有基于软件实现相当的性能。为进一步增强算法,我们通过集成 Adam 优化器求解霍普菲尔德(Hopfield)能量网络的基态,显著提升了收敛速度和求解精度。同时,我们还展示了该方法在更深网络架构及卷积操作上的可扩展性。结果表明,CIM 动力学作为一种可扩展平台,具有训练复杂神经网络的潜力,并通过模拟电路、光电子学或集成光子学为节能实现提供了途径。本研究为下一代人工智能硬件的开发建立了新颖的物理框架。
Abstract
While Ising machines serve as advanced physical solvers for the Ising model,enabling applications in combinatorial optimization and neural network training,their scalability for large-scale neural networks remains constrained by hardware connectivity limitations and suboptimal training methodologies. In this work,we leverage a Coherent Ising Machine (CIM) to train an energy-based neural network using Equilibrium Propagation, achieving performance comparable to existing software-based implementations. We further enhance the algorithm by integrating the Adam optimizer to solve for the ground state of a Hopfield energy network, significantly improving convergence speed and solution accuracy. Additionally, we demonstrate the scalability of our approach across deeper network architectures and convolutional operations. Our results highlight the potential of CIM dynamics as a scalable platform for training complex neural networks, offering a pathway toward energy-efficient implementations via analog circuits, optoelectronics, or integrated photonics. This work establishes a novel physical framework for next-generation AI hardware development.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on hardware acceleration for energy-based neural networks using Coherent Ising Machines and Equilibrium Propagation. It does not address multimodal learning, tokenization, visual encoders, world models, MLLMs, or model-based RL. Therefore, relevance to the provided keyword set is negligible (Weighted Sum: 1.5). No expert authors from the specified list (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) were found in the author list, so no bonus points were applied.
关键词
Coherent Ising Machine, Energy-based Neural Network, Equilibrium Propagation, Hopfield Energy Network, Scalability, Analog Circuits, Convergence Speed
摘要翻译
在线策略蒸馏(OPD)利用更强教师提供的密集逐令牌监督,在其自身的轨迹上训练学生模型,通常优于离线策略蒸馏和标准强化学习。然而,我们发现其有效性隐含地依赖于两个在实践中经常失效的假设:学生与教师之间的轨迹级对齐,以及教师偏好在逐令牌级别上可靠性的一致性。因此,我们提出符号门控在线策略蒸馏(SG-OPD),它在两个互补的粒度上使用二元验证器作为教师的信任信号:分阶段教师采样在冷启动阶段混合了经验证器认可的教师轨迹,而符号一致性门则在教师与验证器正确方向一致的令牌上外推蒸馏更新,在方向不一致时内插更新。在竞赛级数学推理基准上的实验表明,SG-OPD 始终优于标准 OPD,在样本级和题目级上的平均增益分别为 1.98 和 7.50。
Abstract
On-policy distillation (OPD) trains a student on its own trajectories with dense per-token supervision from a stronger teacher, and often outperforms off-policy distillation and standard reinforcement learning. However, we find that its effectiveness implicitly relies on two assumptions that frequently break in practice: trajectory-level alignment between the student and the teacher, and uniform token-level reliability of the teacher's preferences. We therefore propose Sign-Gated On-Policy Distillation (SG-OPD), which uses a binary verifier as a trust signal for the teacher at two complementary granularities: phased teacher sampling mixes in verifier-endorsed teacher rollouts at cold-start, and a sign-consistency gate extrapolates the distillation update on tokens where the teacher agrees with the verifier-correct direction and interpolates it where it disagrees. Experiments on competition-level mathematical reasoning benchmarks show that SG-OPD consistently outperforms standard OPD, with average gains of 1.98 and 7.50 at the per-sample and per-question levels, respectively.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: 该论文主要提出了一种针对数学推理任务的策略蒸馏算法(SG-OPD),核心在于解决教师-学生轨迹对齐和令牌级可靠性问题。论文内容未涉及多模态数据融合、视觉编码器、世界模型构建、Tokenizer 架构设计或未统一模型架构,因此与 Unify Models、Tokenizer、Visual Encoder、World Models、MLLM、MultiModal 关键词完全无关。虽然论文使用了强化学习术语(如轨迹、on-policy),但其焦点是蒸馏方法而非基于模型的强化学习(model-based RL),故仅给予极低的相关性评分。
关键词
Sign-Gated On-Policy Distillation, Teacher-Student Distillation, Mathematical Reasoning, Per-Token Supervision, Binary Verifier, Phased Teacher Sampling, Sign-Consistency Gating
摘要翻译
脉冲神经网络(SNNs)正日益在多种框架(如 SnnTorch、Lava、Norse 等)中进行训练,每种框架均拥有其独立的模型格式。神经形态中间表示(NIR)通过提供一种通用的、与框架无关的格式来解决这一问题,用于交换训练好的 SNN 模型。NIR 解决了交换问题,但止步于此。它仅提供网络的描述,而非运行该网络的路径。每个后端仍需自行实现部署,中间缺乏共享的、可转换的编译器表示。本文提出了 snn-mlir,这是一个用于 SNNs 的树外 MLIR 方言,并附带一个 NIR-MLIR-C 编译桥接。该方言提供了一组少量的类型多态操作,这些操作在浮点数(f32/f64)和量化数据上表现一致,因此单个中间表示同时服务于仿真和面向硬件的部署。一个 Python 前端读取任意 NIR 文件并发出方言中间表示(IR),自动插入重缩放操作以保持各层间量化尺度的一致性。一个参考降低传递将该方言转换为标准的线性代数(linalg)和算术(arith)操作,工具链由此生成自包含、无依赖的 C11 代码,可在任何支持 C 的 CPU 或嵌入式目标上编译并运行。我们评估了相对于参考输出的数值保真度、跨 CPU 目标的可移植性以及量化开销。当前的作用范围仅限于前馈、全连接网络,且仅支持 CPU 后端。snn-mlir 以 Apache-2.0 许可证(含 LLVM 异常)开源发布,目前已可在 GitHub 上获取。
Abstract
Spiking neural networks (SNNs) are increasingly trained in a wide range of frameworks (SnnTorch, Lava, Norse, and others) each with its own model format. The Neuromorphic Intermediate Representation (NIR) addresses this fragmentation by providing a common, framework-independent format for exchanging trained SNN models. NIR solves the exchange problem, but it stops there. It provides a description of a network, not a path to running one. Each backend is still left to implement deployment on its own, with no shared, transformable compiler representation in between. This paper presents snn-mlir, an outof-tree MLIR dialect for SNNs together with a NIR-MLIR-C compilation bridge. The dialect provides a small set of typepolymorphic operations that work identically on floating-point (f32/f64) and quantized data, so a single intermediate representation serves both simulation and hardware-oriented deployment. A Python front end reads any NIR file and emits dialect IR, automatically inserting rescaling operations to keep quantization scales consistent across layers. A reference lowering pass converts the dialect to standard linalg and arith operations, from which the toolchain produces self-contained, dependency free C11 code that compiles and runs on any C-capable CPU or embedded target. We evaluate numerical fidelity against reference outputs, portability across CPU targets, and the cost of quantization. The current scope is feedforward, fully-connected networks with a CPU backend. snn-mlir is released as open source under the Apache-2.0 license with LLVM-exception and it is already available on Github.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on Spiking Neural Networks (SNN) compilation using MLIR and NIR for neuromorphic hardware deployment. It does not involve Multimodal Large Language Models (MLLM), Tokenizers, Visual Encoders, World Models, or Model-Based Reinforcement Learning. While it unifies SNN model representations (NIR to MLIR), this does not align with the specific technical domains implied by the provided keyword set (which centers on generative and RL models). The weighted total score is 1.5 (Unify Models: 1.0 * 1.5), far below the dynamic pass score of 27.8. No expert authors from the specified list are present.
关键词
Spiking neural networks, MLIR, Neuromorphic Intermediate Representation, Compilation, Quantization, C code, SNN deployment, Embedded targets
摘要翻译
语言模型作为多任务学习者,在训练过程中习得广泛的能力。一个基本问题是,学习给定任务需要多少任务特定数据。对于自然语言而言,回答这一问题十分困难:任务难以界定,且可能相互混淆。为了严谨探究数据频率与可学习性之间的关系,我们转向一种受控设置,使用由概率有限自动机(probabilistic finite automata)构造的形式语言。这些设置作为方法学测试床,表明标准相关性评估实践本质上存在缺陷。为了支持因果分析,我们引入分箱半环(binning semiring),这是一种代数对象,允许我们控制目标属性在采样语料库中出现的频率。我们将实验流程表述为因果图模型(causal graphical model),并推导出分解的 Kullback-Leibler 散度(Kullback-Leibler divergence)度量,以衡量特定子任务的可学习性。我们的实验表明,在不进行因果干预的情况下评估可学习性会导致错误结论,这是由于相关性分析中的混杂因子所致,同时也作为对自然语言设置中相关性陷阱的警示。
Abstract
Language models, as multi-task learners, acquire a wide range of abilities during training. A fundamental question is how much task-specific data is needed to learn a given task. Answering this for natural language is difficult: tasks are hard to delineate and can confound one another. To rigorously investigate the relationship between data frequency and learnability, we turn to a controlled setting using formal languages induced from probabilistic finite automata. These serve as a methodological testbed to demonstrate that standard correlational evaluation practices are inherently flawed. To enable causal analysis, we introduce the binning semiring, an algebraic object that lets us control how often a targeted property occurs in a sampled corpus. We formulate the experimental pipeline as a causal graphical model and derive decomposed Kullback-Leibler divergence metrics to measure the learnability of specific sub-tasks. Our experiments show that evaluating learnability without causal intervention leads to incorrect conclusions due to confounders in correlational analysis, and serve as a warning about correlational pitfalls in natural-language settings.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on causal evaluation of learnability in formal languages using probabilistic finite automata and binning semirings. It does not involve multimodal integration, world modeling, reinforcement learning, tokenizers, or visual encoders, resulting in negligible relevance to the provided keyword set which targets multimodal and model-based RL architectures.
关键词
Causal Evaluation, Learnability, Formal Language Tasks, Probabilistic Finite Automata, Binning Semiring, Correlational Pitfalls, Natural Language
摘要翻译
医学语言模型(LMs)能够记忆并重现受保护的健康信息,但隐私评估往往侧重于训练文本的恢复,而非在现实威胁模型下的泄露情况。我们提出一个基于临床的框架,该框架沿对抗性访问的分级轴评估泄露情况,范围从可公开推断的人口统计学信息到泄露的笔记片段。在每个层级上,我们测量患者特定文本的逐字记忆以及敏感诊断的语义泄露。将该框架应用于在 37.8 万份临床笔记上预训练的 LM,我们发现常规就诊元数据(即姓名、出生日期、提供者、机构、就诊日期)会在患者的时间轴上引发高比例的逐字记忆,并导致敏感诊断恢复(流产的 AUROC 为 0.91,HIV 为 0.81)。与此同时,精确匹配记忆可能会高估泄露情况:36% 的记忆词元反映了模板化文本。我们的工作突出了在纵向临床数据上训练的风险,并为医学 LMs 的上下文隐私评估提供了一个实用框架。
Abstract
Medical language models (LMs) can memorize and reproduce protected health information, but privacy evaluations often focus on recovery of training text rather than disclosure under realistic threat models. We introduce a clinically grounded framework that evaluates leakage along a graded axis of adversarial access, ranging from publicly inferable demographics to leaked note fragments. At each tier, we measure verbatim memorization of patient-specific text and semantic leakage of sensitive diagnoses. Applying the framework to an LM pretrained on 378k clinical notes, we find that routine encounter metadata (i.e. name, date of birth, provider, practice, visit date) elicits high rates of verbatim memorization across a patient's timeline and sensitive-diagnosis recovery (AUROC 0.91 for abortion, 0.81 for HIV). At the same time, exact-match memorization can overstate disclosure: 36% of memorized tokens reflect templated documentation. Our work highlights the risks of training on longitudinal clinical data, providing a practical framework for contextual privacy evaluation of medical LMs.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on privacy evaluation of Medical Language Models using clinical notes, whereas the provided keywords target Multimodal architectures (MLLM, MultiModal, Visual Encoder), World Models, and Reinforcement Learning (model-based RL). Tokenizer scores 1.0 due to mention of token-level memorization; all others are 0.0 due to domain mismatch. No expert authors found.
关键词
Privacy Evaluation, Medical Language Models, Clinical Notes, Memorization, Semantic Leakage, Protected Health Information, Adversarial Access, Longitudinal Data
摘要翻译
随着大型语言模型(LLMs)日益应用于实际法律任务,评估其开放式法律回应的可靠性变得至关重要。这些任务需要上下文敏感的回答,且容错空间极小,从而推动了细粒度和诊断性评估的发展,旨在识别回答质量缺陷的具体来源。我们引入了 LexRubric,这是一个基于评分量表的基准,用于评估开放式中文法律任务。LexRubric 包含来自法律咨询和司法考试的 649 个实例,既反映了日常法律需求,也涵盖了专业法律推理,并涉及 14 种法律场景。它还进一步包含了 12,337 个专家撰写的原子评分标准,这些标准组织在一个统一的六维框架下,使得跨任务和跨评估维度的准确评估与诊断性分析成为可能。为了验证评估的可靠性,我们测试了多个评判模型,并将基于模型的判断与人类判断进行了比较。我们进一步在 LexRubric 上评估了 18 个近期通用型及法律领域的 LLMs。结果显示,不同模型展现出不同的能力特征,且开放式法律问题对当前的 LLMs 而言仍然具有挑战性。数据可在以下网址获取:https://github.com/foggpoy/LexRubric。
Abstract
As large language models (LLMs) are increasingly applied to real-world legal tasks, evaluating the reliability of their open-ended legal responses has become essential. These tasks require context-sensitive answers and allow little room for error, motivating fine-grained and diagnostic evaluation that can identify specific sources of response quality failures. We introduce LexRubric, a rubric-based benchmark for evaluating open-ended Chinese legal tasks. LexRubric contains 649 instances from legal consultation and judicial examination, which reflect both everyday legal needs and professional legal reasoning and cover 14 legal scenarios. It further includes 12,337 expert-written atomic scoring criteria organized under a unified six-dimensional framework, enabling accurate evaluation and diagnostic analysis across tasks and evaluation dimensions. To validate the reliability of the evaluation, we test multiple judge models and compare model-based judgments with human judgments. We further evaluate 18 recent general and legal-domain LLMs on LexRubric. Results show that different models exhibit distinct capability profiles, and that open-ended legal question remains challenging for current LLMs. Data is available at: https://github.com/foggpoy/LexRubric.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on a legal NLP benchmark (LexRubric) for evaluating LLMs using rubric-based scoring. It does not involve multimodal architectures, visual encoders, tokenizers, world models, or reinforcement learning. 'Unify Models' has minimal relevance (1.0) due to the unified scoring framework, but not model architecture unification. No expert authors from the target list are present.
关键词
Legal Benchmark, LLM Evaluation, Rubric-based, Open-ended Tasks, Chinese Legal, Diagnostic Analysis, Model Comparison
摘要翻译
人体血管网络的特点是血管在半径、长度、拓扑性质和分支模式上表现出巨大的结构差异。这种异质性加上特定位置的解剖背景差异,对整个心血管系统的鲁棒、大规模分析构成了重大挑战。因此,大多数研究都局限于血管网络的狭窄、孤立片段。虽然这些针对性研究提供了有价值的见解,但它们本质上限制了评估血管网络整体系统性健康和功能完整性的能力。在这项工作中,我们旨在弥合这一差距,以推进临床诊断和对血管生理学的基本理解。我们提出了在 CT 图像中分割所有血管的任务,范围涵盖从心血管系统的最大组成部分到微小的肠系膜血管。为此,我们引入了 vesselFM-CT,这是首个能够在 3D CT 图像中鲁棒分割所有血管的模型。vesselFM-CT 通过迭代多步过程进行训练,并优化了我们提出的 TubeLoss 损失函数,有效解决了心血管系统的固有异质性。我们证明 vesselFM-CT 优于所有基线,能够从 CT 图像中自动化、精确地提取心血管系统,从而解锁了广泛的临床和技术视角,包括自动疾病分类和合成 CT 图像生成。
Abstract
The vascular network in the human body is characterized by blood vessels exhibiting drastic structural variations in radius, length, topological properties, and branching patterns. This heterogeneity, together with location-specific anatomical background variations, poses a significant challenge for robust, large-scale analysis of the entire cardiovascular system. As a result, most research has focused on narrow, isolated segments of the vascular network. While such targeted studies provide valuable insights, they inherently limit the ability to assess the systemic health and functional integrity of the vascular network as a whole. In this work, we aim to bridge this gap to advance both clinical diagnostics and our fundamental understanding of vascular physiology. We propose the task of segmenting all vessels in CT images, ranging from the largest components of the cardiovascular system to even minuscule mesenteric vessels. To this end, we introduce vesselFM-CT, the first model capable of robustly segmenting all blood vessels in 3D CT images. VesselFM-CT is trained via an iterative, multi-step process and optimizes our proposed TubeLoss loss function, effectively addressing the inherent heterogeneity of the cardiovascular system. We demonstrate that vesselFM-CT outperforms all baselines and enables automated, precise extraction of the cardiovascular system from CT images, thereby unlocking a wide range of clinical and technical perspectives, including automated disease classification and synthetic CT image generation.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 1.0/10 | 1.5 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on medical image segmentation (blood vessels in CT) using vesselFM-CT and TubeLoss. It does not involve MLLM, Tokenizers, World Models, or Reinforcement Learning. It processes single-modality CT data, making MultiModal irrelevant. Visual Encoder has minimal relevance as it is a vision task but not aligned with the multimodal/RL context of the keywords. No listed expert authors are present, so no bonus points apply.
关键词
vesselFM-CT, Blood Vessel Segmentation, 3D CT Images, Cardiovascular Analysis, TubeLoss, System-Level Analysis, Medical Image Segmentation
摘要翻译
视觉 - 语言模型(VLM)智能体正越来越多地被部署于交互式游戏环境中。然而,针对 VLM 智能体的游戏基准测试通常仅报告每个(智能体,游戏)对的首次尝试得分,专注于 Solo 模式,且缺乏统一协议来同等地评估异构智能体类别(商业 VLM、开源权重 VLM 和专用游戏策略)。我们通过 OmniGameArena 和改进动力学曲线(IDC)来解决这些差距,其中 OmniGameArena 是一个包含十二个新构建的虚幻引擎 5 游戏的实时基准测试,涵盖 Solo(7)、PvP(3)和 Coop(2),具有统一动作接口;而 IDC 是一个智能体反思框架,其中使用工具的反思大语言模型(LLM)在多轮中自主精炼有界技能提示词。除了冷启动排行榜得分外,IDC 还为每个(智能体,游戏)对提供了两个额外的可观测指标:分数如何随反思轮次演变,以及所学技能在保留的任务变体上的表现。我们在冷启动排行榜上报告了十二个 VLM 智能体的这些可观测指标,并在 IDC 框架下报告了四个顶级智能体的结果。
Abstract
Vision-language model (VLM) agents are increasingly deployed in interactive game environments. Yet game benchmarks for VLM agents typically report a single first-attempt score per (agent, game) pair, focus on single-agent Solo play, and lack unified protocols for evaluating heterogeneous agent classes (commercial VLMs, open-weight VLMs, and specialized game policies) on the same footing. We address these gaps with OmniGameArena, a real-time benchmark of twelve newly built Unreal Engine 5 games spanning Solo (7), PvP (3), and Coop (2) with unified action interfaces, and the Improvement Dynamics Curve (IDC), an agentic-reflection harness in which a tool-using reflector LLM autonomously refines a bounded skill prompt across multiple rounds. Beyond cold-start leaderboard scores, IDC exposes two additional observables for each (agent, game) pair: how the score evolves across reflection rounds, and how the learned skill behaves on held-out task variants. We report these observables for twelve VLM agents on the cold-start leaderboard and four top agents under IDC.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 评分失败: Expecting ',' delimiter: line 12 column 58 (char 281)
摘要翻译
我们考虑线性上下文随机多臂老虎机(Linear Contextual Stochastic Multi-Armed Bandits)的一种变体,其中学习者需向一组用户推荐,每个用户均具有个性化偏好向量,且上下文分布随时间发生漂移。在符合实际应用的假设下,我们将此问题简化为均值平稳但具有异方差(Heteroskedastic)和非平稳(Non-stationary)噪声的线性老虎机(Linear Bandit)。我们还进一步研究了这样一种情形:学习者必须确保在每个决策步骤中,每个决策的平均奖励均超过基线策略 $\boldsymbol\pi_0$。我们提出了一种名为 Dri-MED 的算法,该算法受 MED 策略的线性版本启发,并经过精心调整以处理非平稳异方差噪声。我们证明,实例依赖的遗憾量级为 $\tilde{\mathcal O}\left(\frac{\kappa}{\tilde{\Delta}}d^2\log(T)\right)$,其中 $\tilde{\Delta}$ 是相对于策略 $\pi_0$ 的约束感知次优性差距(Constraint-aware Sub-optimality Gap),$\kappa$ 为方差感知的乘数项,我们通过异方差回归对其进行了精细处理。此外,我们还证明 Dri-MED 具有 $\tilde{\mathcal{O}}(d)$ 的期望约束违反次数。我们的数值实验结果表明,Dri-MED 显著优于那些忽略漂移和偏好结构的保守基线。
Abstract
We consider a variant of the linear contextual stochastic multi-armed bandits, where the learner must provide recommendations to a group of users, each having its personalized preference vector, and in the presence of context distributions that are drifting over time. Under practitioner-friendly assumptions, we reduce this setting to linear bandit with stationary mean but heteroskedastic and non-stationary noise. We further study the case when the learner must ensure the mean reward of each decision must exceed that of a baseline strategy $\boldsymbolπ_0$ at each decision step. We introduce Dri-MED, an algorithm inspired from the linear version of the MED strategy, and carefully adapted to handle the non-stationary heteroskedastic noise. We show that the instance-dependent regret scales as $\tilde{\mathcal O}\left(\fracκ{\tildeΔ}d^2(\log(T)\right)$, where $\tildeΔ$ is the constraint-aware sub-optimality gap subject to policy $π_0$, with variance-aware multiplicative term $κ$ that we carefully handle using heteroskedastic regression. We further show Dri-MED enjoys $\tilde{\mathcal{O}}(d)$ expected constraint violations. Our numerical results suggest that Dri-MED significantly outperforms conservative baselines that ignores the drift and preference structure.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 该论文专注于上下文多臂老虎机(Contextual Multi-Armed Bandits)算法设计,处理漂移上下文和个人化偏好问题。提供的评分关键词均围绕多模态大模型(MLLM)、世界模型(World Models)、Tokenizer 及视觉编码器等架构组件。论文内容与多模态表征学习、大模型统一架构或世界模型构建无直接关联,因此所有关键词相关性评分为 0。
关键词
Contextual Multi-Armed Bandits, Drifting Contexts, Personalized Preferences, Baseline Constraint, Dri-MED Algorithm, Linear Bandit, Experimentation
摘要翻译
随着人工智能日益部署于安全关键系统中,为底层模型提供形式化鲁棒性保证至关重要。现有的验证方法要么依赖于过于保守的近似,要么会产生不可承受的计算成本。例如,在视频场景中使用 lp-范数扰动编码了一种信念,即对手可以在每一帧视频中注入噪声。实际上,对抗性扰动表现出结构化的空间和时间相关性,被约束在低维、语义有意义的子空间中。本文研究了处理视频和体积输入的三维卷积神经网络(3D CNN)的鲁棒性验证,针对动作识别(UCF-101)、自动驾驶(Udacity)和医学成像(MedMNIST)的应用。我们通过将对抗强度建模为时空约束来利用现实假设——即攻击者可以修改一组连续帧中的部分帧或局部区域(patches)。我们证明,建模现实约束能够实现更紧密的近似。我们引入了时空界限传播(STBP),这是一种验证框架,它计算第一卷积层的精确闭式表征,并使用可扩展的近似方法将认证界限传播至后续层。计算精确闭式形式能为第一卷积层提供最紧的界限。因此,我们在网络的其余部分采用近似方法。为了推动该领域的进一步发展,我们提出了 ST-Bench,这是一个用于自动驾驶和活动识别的验证基准,旨在系统性地评估可验证鲁棒性。与现有的基于验证的方法相比,STBP 提供了更强的鲁棒性保证,且具有显著改进的可扩展性,在相同的扰动预算下实现了 1.7 倍的认证鲁棒准确率提升。
Abstract
With AI increasingly deployed in safety-critical systems, providing formal robustness guarantees for the underlying models is essential. Existing verification methods either rely on overly conservative approximations or incur prohibitive computational costs. For example, the use of lp-norm perturbations in video settings encodes the belief that the adversary can inject noise in every video frame. In practice, adversarial perturbations exhibit structured spatial and temporal correlations, constrained to lower-dimensional, semantically meaningful subspaces. In this work, we study robustness verification of 3D CNNs processing video and volumetric inputs, targeting applications in action recognition (UCF-101), autonomous driving (Udacity), and medical imaging (MedMNIST) exploiting realistic assumptions on adversarial strength by modelling them as spatio-temporal constraints - where the attacker can modify either a subset of frames or patches within a set of consecutive frames. We demonstrate that modelling realistic constraints enables tighter approximations. We introduce Spatio-Temporal Bound Propagation (STBP), a verification framework that computes an exact closed-form characterization of the first convolutional layer and propagates certified bounds through subsequent layers using scalable approximations. Computing the exact closed form provides the tightest bounds for the first convolutional layer. Thus, we utilise approximation methods in the remainder of the network. To spur further progress in this field, we propose ST-Bench, a verification benchmark for autonomous driving and activity recognition, to systematically evaluate verifiable robustness. Compared to existing verification-based approaches, STBP provides stronger robustness guarantees with significantly improved scalability, achieving 1.7x higher certified robust accuracy under identical perturbation budgets.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on formal robustness verification for 3D CNNs processing video/volumetric data, proposing Spatio-Temporal Bound Propagation (STBP). The provided keywords target Multimodal Large Language Models (MLLM), World Models, and Unified Architectures involving Tokenizers and RL. There is no overlap in methodology or subject matter, resulting in zero relevance for all keywords.
关键词
Robustness Verification, Spatio-Temporal Neural Networks, 3D CNN, Adversarial Perturbations, Bound Propagation, Formal Guarantees, Video Processing
摘要翻译
机器学习硬件中数值格式的激增——涵盖 FP8 (E4M3 和 E5M2)、BF16、MXFP4、微缩放块格式以及数十种研究变体——已超过了厂商中立、位精确参考材料的供应速度。在不同加速器间移植模型的工程师会遇到难以诊断的静默差异,若缺乏统一标准则更难排查。本文描述了一份涵盖 13 个类别的 84 种数值格式清单,一套包含六个位精确一致性包的套件(涵盖 GF16、MXFP4 元素、BF16、FP8 E4M3、FP8 E5M2 和 E8M0 块缩放),以及一个 IEEE P3109 v3.2.0 交叉对照表,将每个包映射至其对应的标准轨道配置格式。每个包均为一个自包含的 JSON 文档,包含 SHA-256 指纹、共享行模式以及一个锚向量,该向量编码 3.0(即恒等式 phi^2 + 1/phi^2 = 3),用作跨包的合理性检查。这些包与 ml_dtypes 0.5.4 (Google/JAX) 进行了交叉验证;任何差异均被明确记录,并被解释为规范允许的解读差异,而非隐藏的问题。该项工作定位为注册表填充:它不提出新格式,不作模型准确性声明,也不宣称优于任何厂商的实现。所有工件均在 https://github.com/gHashTag/t27 公开提供,并遵循开源许可证。
Abstract
Numeric format proliferation in machine learning hardware -- FP8 (E4M3 and E5M2), BF16, MXFP4, microscaling block formats, and dozens of research variants -- has outpaced the availability of vendor-neutral, bit-exact reference material. Engineers porting models across accelerators encounter silent divergences that are difficult to diagnose without a shared ruler. This paper describes a catalog of 84 numeric formats spanning 13 families, a suite of six bit-exact conformance packs covering GF16, MXFP4 element, BF16, FP8 E4M3, FP8 E5M2, and E8M0 block scale, and an IEEE P3109 v3.2.0 cross-walk that maps each pack to its corresponding standards-track configured format. Each pack is a self-contained JSON document with a SHA-256 fingerprint, a shared row schema, and an anchor vector that encodes 3.0 -- the identity phi^2 + 1/phi^2 = 3 -- as a cross-pack sanity check. Packs are cross-validated against ml_dtypes 0.5.4 (Google/JAX); any divergence is documented explicitly and interpreted as a spec-permitted interpretation gap rather than hidden. The work is framed as registry filling: it does not propose new formats, make model-accuracy claims, or assert superiority over any vendor's implementation. All artifacts are publicly available at https://github.com/gHashTag/t27 under an open license.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on numeric precision standards (FP8, BF16, etc.) and vendor-neutral conformance packs for hardware acceleration. It does not address model architecture unification, tokenization, visual encoding, world modeling, multimodal large language models, or reinforcement learning. Therefore, there is no substantive relevance to the provided keywords.
关键词
Numeric format, FP8, BF16, Bit-Exact, Conformance Vectors, Vendor-Neutral, Microscaling, IEEE P3109
摘要翻译
尽管多通道语音分离的判别式模型在基于参考的指标上表现优异,但它们往往在人类听感质量上表现欠佳。为此,我们提出了一种基于 MeanFlow 的一步生成校正器(MeCo)。MeCo 学习一个条件平均速度场,能够在单步内将判别式估计直接映射至干净语音流形。为了最大化一步生成性能,我们引入了数据空间优化(DSO)。DSO 整合了 $\mathbf{x}_r$-loss 与端点 SI-SDR 损失;其中,$\mathbf{x}_r$-loss 通过惩罚更长位移区间上的预测误差来作为人类听感质量的生成目标,而端点 SI-SDR 损失则直接优化终端信号保真度。实验表明,MeCo 以最小的计算开销实现了最先进(SOTA)的性能,同时在域内与域外场景下均实现了卓越的信号保真度与人类听感质量。
Abstract
While discriminative models for multi-channel speech separation excel in reference-based metrics, they often exhibit suboptimal human listening quality. To address this, we propose a novel MeanFlow-based one-step generative corrector (MeCo). MeCo learns a conditional average velocity field to map discriminative estimates directly onto the clean speech manifold in a single step. To maximize one-step generation performance, we introduce Data-Space Optimization (DSO). DSO integrates an $\mathbf{x}_r$-loss, which penalizes prediction errors on longer displacement intervals to serve as a generative objective for human listening quality, with an Endpoint SI-SDR loss that directly optimizes terminal signal fidelity. Experiments demonstrate that MeCo achieves state-of-the-art (SOTA) performance with minimal computational overhead, simultaneously achieving superior signal fidelity and human listening quality in both in-domain and out-of-domain scenarios.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文主题聚焦于多通道语音分离(音频信号处理),采用 MeanFlow 生成校正器方法。提供的评分关键词集主要围绕多模态大模型(MLLM)、世界模型、视觉编码器、Tokenizer 及强化学习架构。论文内容未涉及视觉编码器、离散 Tokenizer、世界模型构建或强化学习算法,也未体现多模态统一模型架构。因此,论文与所有评分关键词均无实质关联,相关度均为 0。作者列表(Dohwan Kim, Jung-Woo Choi)不包含指定的专家(Yang Shi 等),故无额外加分。
关键词
Multi-Channel Speech Separation, MeanFlow, Generative Corrector, Data-Space Optimization, Human Listening Quality, Signal Fidelity, One-Step Generation
摘要翻译
我们提出 Trellis:一个自动形式化系统,该系统利用大语言模型(LLM)代理,在确定性约束的工作流中,通过迭代细化自然语言证明,以确保在 Lean 自动形式化任务中取得增量式进展。我们的方法受数学家的普遍观念启发,即严格证明的本质在于:顺理成章地详细阐述证明的任何部分。该系统旨在以适度资源和通用型代理实现可靠的自动形式化,其对自动形式化的专业化并非源于特定任务的代理训练,而是源于受严格性意义启发并由过程语义强制的工作流。我们提供了该过程产生的近期拉姆齐理论突破的端到端 Lean 形式化链接。
Abstract
We present Trellis: an autoformalization system that leverages LLM agents in a deterministically constrained workflow to enforce incremental progress in Lean autoformalization tasks through iterative refinement of natural language proofs. Our approach is motivated by the common mathematician's notion of what it means to have a rigorous proof in the first place: namely, that it would be routine to elaborate any part of the proof in further detail. The result is a system which aims to achieve reliable autoformalization on a modest budget and with generalist agents, with specialization to autoformalization coming not from any task-specific agent training but instead from a meaning-of-rigor inspired workflow enforced by process semantics. We link to an end-to-end Lean formalization of a recent Ramsey theory breakthrough produced by the process.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on autoformalization of mathematical proofs using LLM agents and process semantics (AI for Formal Verification). The provided keywords relate to Multimodal LLMs, World Models, and Model-Based RL. There is no mention of visual encoders, multimodal data, world modeling for RL, or model-based RL algorithms, resulting in negligible relevance for all specified keywords.
关键词
Autoformalization, LLM agents, Process semantics, Lean formalization, Iterative refinement, Rigorous proofs, Ramsey theory
摘要翻译
输出空间模式采样是探索大型模式空间的一种强大替代方案,相较于穷举模式挖掘,其允许用户根据选定的有趣性度量聚焦于代表性模式。本文解决了在用户定义的句法约束下采样区间模式的问题。我们引入了 CFips,一种将约束直接纳入采样过程的采样方法。该方法依赖于多步采样框架,并通过将约束分解为区间边界上的基本谓词来支持多种句法约束,同时保持精确采样保证。我们形式化地证明,CFips 在受限模式空间中按频率比例采样区间模式。实验结果表明,将约束集成到采样过程使得原本在给定超时限制内无法完成的挖掘任务得以完成。
Abstract
Output space pattern sampling is a powerful alternative to exhaustive pattern mining for exploring large pattern spaces, as it enables users to focus on representative patterns drawn according to a chosen interestingness measure. In this paper, we address the problem of sampling interval patterns under user-defined syntactic constraints. We introduce CFips, a sampling approach that incorporates constraints directly into the sampling procedure. The approach relies on a multi-step sampling framework and supports several syntactic constraints by decomposing them into elementary predicates on interval bounds while preserving exact sampling guarantees. We formally prove that CFips samples interval patterns proportionally to their frequency within the constrained pattern space. The experimental results show that integrating constraints into the sampling procedure enables to complete mining tasks that would otherwise fail within a given time out.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on constraint-based sampling of interval patterns within the domain of data mining, proposing the CFips algorithm. The provided keywords relate to Multimodal AI, World Models, and Reinforcement Learning (e.g., Tokenizer, Visual Encoder, MLLM). There is no thematic or methodological overlap between the paper's content (pattern mining, interval constraints) and the specified keywords (deep learning architectures, multimodal fusion, RL), resulting in zero relevance for all keywords. Additionally, none of the listed expert authors appear in the paper's author list.
关键词
Interval Patterns, Constrained Sampling, Pattern Mining, Frequency-based, Syntactic Constraints, CFips, Output Space Sampling, Elementary Predicates
摘要翻译
Muon 近期已成为预训练大型语言模型(LLMs)和视觉分类器的最先进优化器。尽管其在效率上优于 Adam 和 SGD,但 Muon 的特征学习优势仍不明确。本文通过鲁棒性和可迁移性的视角,探究 Muon 的特征学习优势。首先,通过在受损图像和文本上评估预训练模型,我们发现 Muon 学习到的特征在不同架构(包括 Transformer 和卷积神经网络(CNNs))下,均比 Adam 和 SGD 学习到的特征更具鲁棒性。利用训练好的逐层探针,我们进一步表明,这种鲁棒性优势体现在各层更大的 logit margins 上。其次,通过在下游任务上训练线性分类器或使用预训练参数微调完整模型,我们证明 Muon 学习到的特征比 Adam 和 SGD 学习到的特征具有更有效的可迁移性。这种可迁移性优势进一步得到了各层隐藏状态多样性的支持,该多样性通过有效秩(effective rank)进行衡量。最后,在一个具有多组件特征的典型分类问题中,我们证明 Muon 能够获得比 Adam 和 SGD 更大的 logit margins 和更高的 effective rank,从而为实证发现提供理论支持。
Abstract
Muon has recently emerged as a state-of-the-art optimizer for pretraining Large Language Models (LLMs) and vision classifiers. Despite its efficiency advantage over Adam and SGD, the feature-learning advantage of Muon remains unclear. This paper investigates Muon's feature-learning advantage through the lens of robustness and transferability. First, by evaluating pretrained models on corrupted images and texts, we show that features learned by Muon are consistently more robust than those learned by Adam and SGD across different architectures, including transformers and Convolutional Neural Networks (CNNs). Using trained layer-wise probes, we further show that this robustness advantage is reflected in larger logit margins across layers. Second, by training linear classifiers or fine-tuning full models from pretrained parameters on downstream tasks, we demonstrate that Muon-learned features transfer more effectively than those learned by Adam and SGD. This transferability advantage is further supported by the diversity of hidden states across layers, as measured by effective rank. Finally, in a representative classification problem with multi-component features, we prove that Muon attains larger margins and higher effective rank than Adam and SGD, providing theoretical support for our empirical findings.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文核心贡献在于优化算法(Muon)的特征学习优势(鲁棒性与可迁移性),而非模型架构或模态融合。提供的关键词涉及视觉编码器、分词器、世界模型、统一模型、多模态大模型及模型强化学习,这些内容在摘要中均未作为核心研究对象出现,因此所有关键词相关性评分为 0。
关键词
Muon optimizer, Feature Learning, Robustness, Transferability, Large Language Models, Vision Classifiers, Effective Rank, Logit Margins
摘要翻译
人工智能在全球范围内的迅速扩张导致了能源密集型超大规模数据中心(DCs)的迅速普及,使其成为电力系统规划与运行中一个具有结构性挑战的关键组成部分。基于覆盖 21 种 AI 增长情景的欧洲空间显式优化模型,我们系统地量化了数据中心(DCs)的额外需求、容量需求、排放及运行影响。结果表明,到 2050 年,人工智能可能驱动 73-723 TWh 的额外需求,并存在 2030 年至 2050 年间累计排放超标 67-181 MtCO2 的风险。我们的分析表明,2030 年后,人工智能基础设施的地理布局将更多地由基荷电力(firm power)和系统灵活性塑造,而不仅仅取决于清洁能源的丰富程度。在中等情景下,人工智能需要额外增加 200 小时的基荷发电时长,这将使关键枢纽的平准化度电成本(LCOE)增加 35 欧元/兆瓦时。我们表明,即使在悲观情景下,现有基础设施也需要 70 吉瓦(GW)的额外容量,而在受控增长路径下,这一扩展容量可达 226 吉瓦(GW)。我们进一步发现,数据中心(DCs)的工作负载动态强烈塑造能源调度、系统灵活性和排放,而效率的提升显著减少了容量需求和系统峰值负荷。尽管我们的发现表明 2050 年的净零目标可能得以实现,但关键排放风险可能在过渡年份出现,除非政策适应这一加速的数字转型,否则欧盟可能损害其碳中和目标。
Abstract
The rapid expansion of AI globally has led to the proliferation of energy-intensive hyperscale data centres (DCs), making them as a structurally challenging component in power system planning and operation. Using a spatially explicit optimisation model of Europe across 21 AI growth scenarios, we systematically quantify additional demand, capacity requirements, emissions, and operational impacts of DCs. Results indicate that AI could drive 73-723 TWh of extra demand by 2050, risking cumulative emissions overshoots of 67-181 MtCO2 between 2030 and 2050. Our analysis indicates that after 2030, the geography of AI infrastructure will be shaped more by firm power and system flexibility than by the mere abundance of clean energy. In moderate scenarios, AI requires an additional of 200 hours of firm generation, which increases LCOE by 35 EUR/MWh in key hubs. We show that even under the pessimistic scenarios, existing infrastructure would require 70 GW additional capacity, while under managed growth pathways, this expansion could reach 226 GW. We further find DCs workload dynamics strongly shape energy dispatch, system flexibility, and emissions, while improved efficiency significantly reduces capacity needs, and system peaks. While our findings suggest that net-zero targets for 2050 may be achieved, critical emission risks may appear in the intermediate years, and the EU may compromise its carbon-neutral goals unless policies adapt to this accelerating digital transformation.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on energy economics, power system planning, and emissions analysis regarding AI data centers in Europe. It does not discuss machine learning model architectures (Tokenizer, Visual Encoder, MLLM, MultiModal), model unification strategies, generative World Models, or reinforcement learning algorithms (model-based RL). Therefore, all provided keywords are completely irrelevant to the paper's content.
关键词
AI Data Centers, Energy Transition, Net-Zero Goals, Power System Planning, Emissions Overshoot, Optimization Model, Europe, Firm Power
摘要翻译
双服务器安全推断允许客户端查询托管的大语言模型(LLM),而无需泄露提示词或嵌入。近期基于函数秘密共享(FSS)的 GPU 系统使得线性层计算高效,但定点非线性及辅助操作仍是瓶颈,因为每个操作符通常被实现为定制协议,各自包含比较、环绕校正及预处理数据。我们提出 FuseFSS,这是一种编译器,它用单一的编译管道取代了针对每个操作符的协议设计。对于每个标量定点操作符,紧凑规范列出了其区间划分、低次算术片段以及所需的谓词位。该编译器在公共掩码值上生成两个批处理 FSS 评估:一个是打包比较,用于返回所有谓词位;另一个是向量区间查找,用于返回活动系数和常数。与当前最先进的基于 FSS 的 GPU 安全推断相比,FuseFSS 在保持精度的同时,实现了 $1.24\times$--$1.50\times$ 的端到端加速比,并在 BERT 和 GPT 风格模型上将在线通信量减少了 $9\%$--$16\%$;预处理开销也更小,密钥生成时间降低了 $14\%$--$23\%$,密钥尺寸减少了 $20\%$--$24\%$。
Abstract
Two-server secure inference allows a client to query a hosted large language model (LLM) without revealing prompts or embeddings. Recent GPU systems based on function secret sharing (FSS) make linear layers efficient, but fixed-point nonlinearities and helper operations remain a bottleneck because each operator is typically implemented as a bespoke protocol with its own comparisons, wrap-around corrections, and preprocessing material. We present FuseFSS, a compiler that replaces per-operator protocol design with a single compilation pipeline. For each scalar fixed-point operator, a compact specification lists its interval partition, low-degree arithmetic pieces, and required predicate bits. The compiler emits two batched FSS evaluations on the public masked value: one packed comparison that returns all predicate bits, and one vector interval lookup that returns the active coefficients and constants. Compared to the current state-of-the-art FSS-based GPU secure inference, FuseFSS preserves accuracy while achieving a $1.24\times$--$1.50\times$ end-to-end speedup and reducing online communication by $9\%$--$16\%$ on BERT and GPT-style models; preprocessing is also lighter, with $14\%$--$23\%$ lower key-generation time and $20\%$--$24\%$ smaller keys.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on secure inference optimization for text-based LLMs using Function Secret Sharing (FSS) and compiler techniques. The provided keywords pertain to multimodal architectures, world models, and reinforcement learning. There is no technical overlap regarding visual encoders, multimodal integration, world modeling, or RL algorithms, resulting in zero relevance to the specified research background.
关键词
Secure LLM Inference, Function Secret Sharing, Fixed-point nonlinearities, Compiler optimization, BERT and GPT-style models, Efficient inference, Two-server secure inference
摘要翻译
使用工具的大型语言模型(LLM)智能体面临两类截然不同的安全失效:未经授权的对外部操作以及运行时内部敏感明文的泄露,此时任何最终输出检查都无法介入。现有的防御机制通常仅保护其中一个边界,即规划器/运行时(planner/runtime)或动作汇点(action sink),因此它们本身无法同时保障这两个表面的安全。我们提出 SecureClaw,这是一种双边界架构,它在效果汇点(effect sink)处实施授权,并在读取边界(read boundary)处实施明文限制。敏感读取操作需通过可信网关,该网关用不透明句柄(opaque handles)替换原始值;在评估的部署中,受限摘要(bounded summaries)作为显式解分类接口。更改外部状态的写入操作遵循 PREVIEW→COMMIT 协议,其中只有可信执行者(trusted executor)可提交策略所授权的精确规范请求(canonical request)。运行时仍可在受限摘要和符号引用上进行规划,但无法直接解引用秘密或执行副作用。在 AgentDojo、AgentLeak 和 Agent Security Bench (ASB) 基准测试中,SecureClaw 是唯一在一个通用框架内同时保留可用任务效用并实现安全目标的防御机制:它在 ASB 上实现了 0% 的攻击成功率(ASR),在 AgentDojo 上实现了 0.64% 的 ASR,并在 AgentLeak 的攻击平行通道(parity lane,衡量最终输出和中继泄露)上实现了 3.23% 的整体泄露率。
Abstract
Tool-using large language model (LLM) agents face two distinct security failures: unauthorized external actions and exposure of sensitive plaintext inside the runtime before any final output check can intervene. Existing defenses usually protect one boundary, either the planner/runtime or the action sink, and therefore do not by themselves secure both surfaces. We present SecureClaw, a dual-boundary architecture that places authorization at the effect sink and plaintext confinement at the read boundary. Sensitive reads pass through a trusted gateway that replaces raw values with opaque handles and, in the evaluated deployment, bounded summaries as an explicit declassification interface. Writes that change external state follow a PREVIEW$\rightarrow$COMMIT protocol in which only a trusted executor may commit the exact canonical request authorized by policy. The runtime can still plan over summaries and symbolic references, but cannot directly dereference secrets or perform side effects. Across AgentDojo, AgentLeak, and Agent Security Bench (ASB), SecureClaw is the only defense we evaluate in a common harness that simultaneously retains usable task utility and achieves 0\% attack success rate (ASR) on ASB, 0.64\% ASR on AgentDojo, and 3.23\% overall leak on AgentLeak's attacked parity lane, which measures final-output and internal-relay leakage.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文核心内容为 LLM 代理的安全防御架构(SecureClaw),主要解决未经授权操作和敏感文本泄露问题。提供的关键词均聚焦于模型架构统一、分词器、视觉编码器、世界模型、多模态大模型及强化学习,与本文的安全运行时防御主题无直接技术关联。例如,论文未涉及视觉编码、多模态表征、世界模型学习或强化学习算法。作者列表(Yuhan Ma, Stefan Schmid)不包含指定的五位专家(Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang),故无专家加分。加权总分为 0,低于动态及格分 27.8。
关键词
SecureClaw, LLM Agents, Security Architecture, Dual-boundary, Plaintext Confinement, Authorization, Tool-using
摘要翻译
目的:大型语言模型(LLMs)日益常用于起草临床研究手稿,但其流畅性可能掩盖伪造的引用、偏离源表的数据以及未满足的报告指南条目。现有工具生成文本而不进行验证,其自我批评机制继承了导致自信伪造的盲点。本文描述了一种将生成与验证相结合的架构。方法:该设计基于三个原则:将工作流分解为自包含的技能,通过“失败即停”机制拦截每个阶段转换,并以最简充分机制解决每个完整性问题——在足够时使用确定性、可重新执行的检查,仅在不可避免解释时使用文本级探针。这种可能时的确定性拆分(Determinism-where-possible split),组织为完整性门限分类法(integrity-gate taxonomy),是本文的核心贡献。它被实现为 MedSci Skills,一个由单一协调器协调的、包含 43 个技能的开源工具包,其确定性层级包含 21 个标准库检测器。我们在三个可复现的公共数据集管道(STARD、PRISMA、STROBE)以及一项种子缺陷消融实验上对其进行了评估。结果:在三个管道中,所有内容哈希清单均验证通过,且门限揭示了真实缺陷。针对 27 个相同的注入缺陷,确定性门限全部检出 27 个,且在匹配的干净实例上无假阳性;而通用单提示 LLM 评审器仅检出 11 个,其遗漏主要集中在生成代码、参考文献内部及风格缺陷,这些缺陷文本并未暴露。结论:可能时的确定性验证产生一条可审计、可重新执行的轨迹,暴露了人类检查 LLM 辅助手稿所需的证据——即可行性和可复现性证据,而非宣称具有人类竞争水平的质量(后者由单独的盲法研究解决)。MedSci Skills 采用 MIT 许可证并已归档(v3.8.0)。
Abstract
Objective. Large language models (LLMs) increasingly draft clinical research manuscripts, but their fluency can hide fabricated citations, numbers that drift from source tables, and unmet reporting-guideline items. Existing tools generate text without verifying it, and self-critique inherits the blind spots that produce confident fabrication. We describe an architecture that pairs generation with verification. Methods. The design rests on three principles: decompose the workflow into self-contained skills, gate every stage transition with halt-on-failure, and resolve each integrity question with the cheapest sufficient mechanism -- a deterministic, re-executable check where one suffices, and a prose-level probe only where interpretation is unavoidable. This determinism-where-possible split, organized as an integrity-gate taxonomy, is the core contribution. It is realized as MedSci Skills, an open-source toolkit of 43 skills coordinated by one orchestrator, whose deterministic tier comprises 21 standard-library detectors. We evaluate it on three reproducible public-dataset pipelines (STARD, PRISMA, STROBE) and a seeded-defect ablation. Results. Across the three pipelines every content-hash manifest verified clean and the gates surfaced real defects. On 27 identical injected defects the deterministic gates detected all 27 with no false positives on the matched clean fixtures, whereas a generic single-prompt LLM reviewer detected 11, its misses concentrated in generated-code, bibliography-internal, and style defects the prose does not expose. Conclusion. Determinism-where-possible verification yields an auditable, re-executable trail that exposes the evidence a human needs to check an LLM-assisted manuscript -- feasibility and reproducibility evidence, not a claim of human-competitive quality, which a separate blinded study addresses. MedSci Skills is MIT-licensed and archived (v3.8.0).
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 该论文聚焦于 LLM 辅助临床手稿准备的验证架构(MedSci Skills),强调确定性检查和审计轨迹。提供的关键词(视觉编码器、世界模型、强化学习、多模态等)涉及模型架构与强化学习领域,与本文的软件验证主题无直接技术关联,相关性极低。
关键词
Deterministic Integrity Gates, LLM-Assisted, Clinical Manuscript Preparation, Auditable Architecture, MedSci Skills, Verification, Biomedical Informatics
摘要翻译
大语言模型(LLMs)最近在形式化证明基准上取得了显著成果。然而,现有的评估仍高度集中于竞赛风格的问题,往往无法捕捉模型在更长、依赖关系更丰富的数学推导中的表现。我们引入了 TheoremBench,这是一个基于 Lean4 的基准,旨在评估超越竞赛场景的定理证明器。该基准基于近一百个经典定理构建,并以两种互补形式发布:一种是基础主版本,每个实例包含一个目标定理;另一种是前提版本,将每个定理扩展为相关的证明任务结构化集合,包括主定理以及自动提取的支持性子定理。这种设计不仅允许评估最终定理是否被从头证明,还允许通过定理的内部证明结构评估部分进展。我们的实验表明,显式前提显著提高了支持 Lean4 的证明器模型的性能。为了提供全面评估,我们引入了定理级覆盖率和词元效率指标,以揭示证明行为中的定性差异。结果表明,当前证明器仍强烈偏向于容易的子定理,并且往往通过长而低效的策略轨迹来解决定理,而不是紧凑的证明计划。因此,TheoremBench 提供了对形式化推理能力更精细的视角,并强调了结构基准设计在评估 Lean4 定理证明器方面的重要性。
Abstract
LLMs have recently achieved strong results on formal proving benchmarks. However, existing evaluations remain heavily concentrated on competition-style problems and often fail to capture how models behave on longer, more dependency-rich mathematical developments. We introduce TheoremBench, a Lean4 benchmark designed to evaluate theorem provers beyond contest settings. The benchmark is built from nearly one hundred classical theorems and is released in two complementary forms: a plain main version containing one target theorem per instance, and a premised version that expands each theorem into a structured family of related proving tasks consisting of the main theorem together with automatically extracted supporting subtheorems. This design enables evaluation of not only whether the final theorem was proved from scratch, but also of partial progress through the internal proof structure of a theorem. Our experiments show that explicit premises substantially improve performance for Lean4-capable prover models. To provide a comprehensive evaluation, we introduce theorem-level coverage and token-efficiency metrics that expose qualitative differences in proof behavior. The results show that current provers remain strongly biased toward easy subtheorems and often solve theorems through long and inefficient tactic traces rather than compact proof plans. TheoremBench therefore provides a more fine-grained view of formal reasoning ability and highlights the importance of structural benchmark design for evaluating Lean4 theorem provers.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 该论文专注于形式化数学定理证明的基准测试(TheoremBench),旨在评估 LLM 在 Lean4 环境下的推理能力。提供的关键词涉及多模态学习、世界模型、强化学习及特定模型组件(如 Tokenizer、Visual Encoder),这些内容与本文的核心贡献(数学推理基准设计)无直接关联,因此所有关键词相关性评分均为 0。
关键词
TheoremBench, Formal Mathematics, Theorem Proving, Lean4, LLM Evaluation, Proof Structure, Reasoning Ability
摘要翻译
检索增强生成(RAG)通过将相关文档注入大语言模型(LLM)查询中,以提高响应质量。这种注入增加了提示长度,并延长了首次令牌时间(TTFT)。与标准查询不同,RAG 查询具有上下文重用的独特属性,即相同的文档会在用户查询中反复出现。因此,为每个 RAG 查询完全重新计算文档会导致冗余计算,并增加 TTFT。先前工作离线预计算 RAG 文档的 KV 张量,并在在线预填充期间粗略地重新计算部分令牌。然而,由于高延迟磁盘传输,这种 KV 复用在现代 GPU 上通常比完全重新计算更慢。此外,这种粗粒度的重新计算会降低准确性。为了解决这些局限性,本文提出了 SIFT(Selective-Index For Fast Compute of RAG Prefill by Exploiting Attention Invariance):一种通过利用注意力不变性实现 RAG 预填充快速计算的索引选择方法。SIFT 离线处理文档,并为每个文档提取高注意力分数的细粒度位置。接下来,我们识别出以下注意力不变性见解,使我们在运行时能够利用提取的位置:(1)局部注意力不变性:文档内高注意力分数的位置相对于周围文档保持不变。这有助于我们预测文档在进行自注意力关注时高分数的位置。(2)交叉注意力一致性:具有高文档内注意力的键也会吸引来自后续文档的交叉注意力。这有助于我们预测文档关注后续文档时高分数的位置。关键地,SIFT 不存储 KV 数据,仅以两个紧凑位向量的形式存储高分数的位置。SIFT 的存储量比 KV 张量小多达 24,000 倍,从而避免了昂贵的磁盘传输。在预填充期间,SIFT 仅对标记位置计算注意力,使 TTFT 加速 1.71 倍,同时保持准确性与完全重新计算的结果相差在 1% 以内。
Abstract
Retrieval-Augmented Generation (RAG) injects LLM queries with relevant documents to improve response quality. This injection increases prompt length and slows time to first token (TTFT). Unlike standard queries, RAG queries have a unique property of context reuse where the same documents recur across user queries. Thus, fully recomputing documents for every RAG query does redundant compute and increases TTFT. Prior works precompute KV tensors of RAG documents offline and coarsely recompute some tokens during online prefill. However, such KV reuse is often slower than full recomputation on modern GPUs due to high-latency disk transfers. Further, such a coarse-grained recomputation degrades accuracy. To address these limitations, this paper proposes SIFT: Selective-Index For Fast Compute of RAG Prefill by Exploiting Attention Invariance. SIFT processes documents offline and extracts fine-grained locations of high attention scores for each document. Next, we identify the following attention invariance insights that enable us to exploit the extracted locations during runtime: (1) Local-Attention Invariance: The location of high attention scores within a document remain invariant to surrounding documents. This helps us predict the location of high scores where the document attends to itself. (2) Cross-Attention Consistency: Keys with high intra-document attention also attract cross-attention from subsequent documents. This helps us predict the location of high scores where the document attends to future documents. Critically, SIFT stores no KV data and only stores locations of high scores in the form of two compact bit vectors. SIFT's storage is up to 24,000x smaller than KV tensors, obviating costly disk transfers. During prefill, SIFT computes the attention only for the marked locations and improves TTFT by 1.71x while holding accuracy within 1% of full recompute.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on optimizing RAG prefill latency via attention invariance (SIFT), which is unrelated to multimodal learning, world models, model-based RL, tokenizers, or visual encoders. Thus, all provided keywords receive a score of 0. No expert authors from the specified list are found. Total weighted score is 0, below the dynamic passing threshold of 27.8.
关键词
Retrieval-Augmented Generation, Attention Invariance, Prefill Optimization, Time To First Token, LLM Inference, Selective-Index, KV Tensors, Document Reuse
摘要翻译
污水流感监测可在临床报告之前揭示社区传播,但仅凭污水数据并非人类负担的完全可识别代理。现有污水模型假设证据集是固定的,而通用证据获取方法则将官方监测流视为可互换且昂贵的特征。我们将污水优先流感监测建模为一个选择性决策问题:基于强制性污水证据,系统需判断污水数据是否充足,决定下一步查询哪个延迟的官方监测流,以及在来源模糊性下何时放弃(Abstention)是唯一科学上可辩护的行动。我们提出贝叶斯选择性潜在推断(BSLI),这是一种严谨的贝叶斯方法,它维护关于潜在负担(latent burden)和可识别性(identifiability)的后验分布(posterior),通过明确的科学门限(scientific gates)保证可回答性,并利用精确的成本校准的 Bellman 策略(cost-calibrated Bellman policy)优化查询 - 停止决策。我们证明了关键的变分(variational)、可回答性(answerability)、Bellman 最优性(Bellman-optimality)以及一维成本校准(one-dimensional cost-calibration)性质。在包含 5,933 个预测时段(forecasting episodes)和 3,102 个来源模糊时段(source-ambiguity episodes)的固定公共数据基准上,BSLI 改善了匹配预算的成本 - 性能前沿(cost-performance frontier),同时在来源模糊性下保持保守的放弃。
Abstract
Wastewater influenza surveillance can reveal community circulation before clinical reporting, but wastewater alone is not a fully identifiable proxy for human burden. Existing wastewater models assume a fixed evidence set, while generic evidence-acquisition methods treat official surveillance streams as interchangeable costly features. We cast wastewater-first influenza monitoring as a selective decision problem: starting from mandatory wastewater evidence, the system must decide whether wastewater is sufficient, which delayed official stream to query next, and when abstention is the only scientifically defensible action under source ambiguity. We propose Bayesian Selective Latent Inference (BSLI), a principled Bayesian method that maintains a posterior over latent burden and identifiability, certifies answerability through explicit scientific gates, and optimizes query-stop decisions with an exact cost-calibrated Bellman policy. We prove the key variational, answerability, Bellman-optimality, and one-dimensional cost-calibration properties. On a fixed public-data benchmark with 5,933 forecasting episodes and 3,102 source-ambiguity episodes, BSLI improves the matched-budget cost-performance frontier while preserving conservative abstention under source ambiguity.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 该论文专注于污水流感监测的贝叶斯推断与决策理论,与提供的 AI/MLLM 领域关键词(Tokenizer、视觉编码器、MLLM、统一模型、世界模型、多模态)完全无关。虽然文中提到了 Bellman 策略用于决策停止,但这属于贝叶斯最优停止问题,并非上下文所指的基于模型的强化学习(Model-Based RL)架构。作者列表(Yixuan Zhang, Yang Song 等)不包含指定的专家(Yang Shi, Xuanyu Zhu 等),因此不加分。加权总分为 0.0,远低于动态及格分 27.8。
关键词
Bayesian Selective Latent Inference, Wastewater Influenza Monitoring, Decision Problem, Bellman Policy, Cost-calibrated, Source Ambiguity, Public-data Benchmark
摘要翻译
在物理人工智能(Physical AI)时代,机器人中间件面临着新的角色定位。学习策略、规划器以及视觉 - 语言 - 动作(VLA)模型如今已作为控制路径上的因果参与者进入部署机器人,但将它们与时序、调度及网络集成在一起的层级尚未被命名。近期关于语言智能体的研究将这一层级称为 harness(中介系统),即中介工具、管理状态、限制资源并记录执行过程的外部系统。机器人社区尚未采纳这种框架,我们提议机器人中间件正是这一 harness。物理人工智能 harness 与软件 harness 的区别在于其介入的位置不同。软件 harness 仅在工具调用边界处进行中介。而物理人工智能 harness 必须在控制、计算和通信三个层面同时进行中介,因为学习策略的输出跨越了这三个层面:其指令改变轨迹,其推理时间改变调度,其负载改变带宽。机器人中间件是拥有这三个层面中介抽象的最低机器人堆栈层级,因此它最具备组合其执行力的优势。它已提供了 harness 所需的大部分功能,但缺乏针对 AI 模型的执行力。我们将这种缺失的执行力定义为三个功能:Projection(投影)在输出产生时对每个输出进行门控,Isolation(隔离)界定模型的执行和传输时隙,Transfer(转移)在检查失败时回退至验证过的基线。目前,这些功能均以手工构建的应用代码形式存在于部署机器人系统中,构建于机器人中间件已提供的接口之上。机器人中间件应托管这些功能,并非作为单一轴线的最佳执行者,而是作为组合这三者的层级。我们将此构想描绘为一种 ROS 2 Harness Profile(配置文件),这是一种部署工件,它承载 AI 模型声明的输出区域、推理预算及运行模式,同时中间件在 ROS 2、DDS 和 Zenoh 上强制执行这些约束。
Abstract
Robot middleware faces a new role in the era of Physical AI. Learned policies, planners, and vision-language-action (VLA) models now enter deployed robots as causal participants on the control path, but the layer that integrates them with timing, scheduling, and network has not been named. Recent language-agent work names this layer the harness, the external system that mediates tools, manages state, bounds resources, and records execution. The robotics community has not yet adopted this framing, and we propose that robot middleware is that harness. A Physical AI harness differs from a software harness in where it intervenes. A software harness mediates at tool-call boundaries. A Physical AI harness must mediate at control, computing, and communication simultaneously, because a learned policy's output crosses all three: its commands shift the trajectory, its inference time shifts the schedule, and its payload shifts the bandwidth. Robot middleware is the lowest robot-stack layer with mediating abstractions over all three, so it is best positioned to compose their enforcement. It already provides most of what a harness needs but lacks the enforcement for an AI model. We name this missing enforcement as three functions: Projection gates each output at emission, Isolation bounds the model's execution and transmission slot, and Transfer falls back to a verified baseline when checks fail. Each appears today as hand-built application code in deployed robot systems, built on surfaces robot middleware already provides. Robot middleware should host them not as the best single-axis enforcer but as the layer that composes all three. We sketch this as a ROS 2 Harness Profile, a deployment artifact that carries an AI model's declared output region, inference budget, and operating regime while the middleware enforces them across ROS 2, DDS, and Zenoh.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 评分失败: Expecting ',' delimiter: line 13 column 27 (char 434)
摘要翻译
本报告探讨了在英国国防领域实施 JSP 936 第 1 部分以进行 AI 保障所面临的实际挑战。通过对该指令要求进行结构化解释性审查,分析确定了八个主题挑战领域,即证据与论证的充分性、人机交互管理、作战环境定义、系统之系统中的 AI 集成、AI 性能评估与维护、安全与安保分析、伦理度量以及缓解 AI 的内在复杂性。报告认为 JSP 936 提供了有用的治理基础,但实施取决于尚未解决的技术、组织和保障问题。这些挑战源于 AI 赋能系统的社会技术性质、现实世界部署情境中的不确定性、当前保障方法论的局限性,以及性能、安全、人类监督、安保与伦理可接受性之间的张力。报告确定了需要进一步方法、指导和组织能力的领域,以便在整个国防领域实现雄心勃勃、安全且负责任的 AI 应用。这与 MOD 对 JSP 936 的界定一致,即要求迭代实施并提供支持性指导。
Abstract
This report examines practical challenges in operationalising JSP 936 Part 1 for AI assurance in UK Defence. Using a structured interpretive review of the directive's requirements, the analysis identifies eight thematic challenge areas adequacy of evidence and argument, management of human interaction with AI, definition of the operational environment, integration of AI within systems of systems, assessment and maintenance of AI performance, analysis of safety and security, measurement of ethicality, and mitigation of the inherent complexities of AI. The report argues that JSP 936 provides a useful governance basis, but that implementation depends on unresolved technical, organisational, and assurance questions. These challenges stem from the socio-technical nature of AI-enabled systems, uncertainty in real-world deployment contexts, limitations in current assurance methodologies, and tensions between performance, safety, human oversight, security, and ethical acceptability. The report identifies areas where further methods, guidance, and organisational capability are needed for the ambitious, safe, and responsible adoption of AI across Defence. This is consistent with MOD's own framing of JSP 936 as requiring iterative implementation and supporting guidance.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on policy, governance, and operational challenges of AI assurance in UK Defence (JSP 936), whereas the provided keywords pertain to technical deep learning architectures and methodologies (e.g., Tokenizer, Visual Encoder, World Models, MLLM). There is no discussion of model architectures, tokenization, visual encoders, world models, multimodal learning structures, or model-based reinforcement learning algorithms in the paper. Thus, all keyword scores are 0. No expert authors from the specified list are found.
关键词
AI Assurance, UK Defence, JSP 936, Operational Challenges, Governance, Safety and Security, Ethicality, Implementation
摘要翻译
我们呈现了一项民族志研究,探讨一种替代性的数据工作方法,该方法由一个公民科技倡议(civic-tech initiative)开发,该倡议致力于构建用于训练和基准测试在线安全系统的数据集。他们旨在从女性主义视角回应在线安全关切,通过与最受在线伤害影响的人群合作构建安全数据集。在本文中,我们探讨这种方法如何旨在将数据工作重新定位为修复与补救的场所,并追踪他们在过程中所遭遇的困境与斗争。具体而言,我们关注推进数据工作公正报酬及 AI 数据集集体治理过程中所面临的挑战与张力。我们透过基于科学技术研究(STS)视角的修复性正义与修复框架审视这些挑战,认为修复数据工作(乃至 AI)的根本,在于重置问责关系。在当前高度强调通过安全评估(safety evaluations)和红队测试(red teaming)等举措使 AI 更具责任感的背景下,我们强调有必要直面基础性问题:参与这些努力的人类如何与他们共同构建的数据集及系统相关联。修复性视角要求我们中断现有的数据工作规范,并将核心置于那些因当前数据集生产模式中的忽视、监管和排斥而遭受最大伤害的人群,而非 AI 或数据集本身。我们认为,这提供了一种关于责任的大胆愿景,并为构建数据和 AI 实践的替代未来贡献了一份批判性议程。
Abstract
We present an ethnographic study of an alternative approach to data work, developed by a civic-tech initiative that builds datasets for training and benchmarking online safety systems. They aim to respond to online safety concerns from a feminist perspective, by building safety datasets collaboratively with those most impacted by online harms. In this paper, we examine how this approach aims to reorient data work as a site for repair and redress, and trace the struggles they encounter in the process. Specifically, we draw attention to the challenges and tensions involved in advancing just reward for data work and collective governance of AI datasets. Examining these challenges through an STS-informed lens of reparative justice and repair, we argue that the work of repairing data work (and AI) lies, fundamentally, in resetting the ties of accountability. At a time heightened emphasis on efforts like safety evaluations and red teaming to make AI more responsible, we highlight the need to confront foundational questions about how the humans involved in these efforts relate to the datasets and systems they help produce. A reparative lens demands that we interrupt prevailing norms of data work and place at their centre, not AI or datasets, but those most harmed by the neglect, oversight and exclusion animated in the current modes of dataset production. This, we argue, offers a bold vision for responsibility and contributes towards a critical agenda for building alternative futures of data and AI practice.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper is an ethnographic study on the social and ethical dimensions of data labor, focusing on reparative justice and accountability. The provided keywords pertain to technical machine learning architectures (e.g., World Models, Encoders, Tokenizers, Model-Based RL). There is no overlap between the paper's sociological focus and the technical AI keywords, resulting in a relevance score of 0 for all.
关键词
Data Work, Reparative Justice, Online Safety, Civic-Tech, Feminist Perspective, Accountability, Collective Governance, Ethnographic Study
摘要翻译
特征交互驱动了机器学习模型的大部分预测能力,然而现有的解释方法仅能检测和量化交互,却无法揭示其函数形式,或者仅能可视化受限的交互类型。我们提出了一种基于代理的局部效应平滑交互分析(SAILS),这是一种模型无关的框架,通过拟合黑盒模型局部效应的可解释广义加性模型(GAM)代理来分析成对交互。对于感兴趣特征的每个区间,代理平滑项在导数层面隔离了交互成分,从而实现(i)基于平滑项显著性检验的启发式交互检测,(ii)将交互形式分类为线性、乘积可分离和非乘积可分离类型,以及(iii)为每种交互类型提供定制化的可解释可视化。我们通过受控模拟和实际任务经验性地验证了该框架,证明了其在成对交互上的有效性,但在强特征相关性和高阶交互下存在局限性。SAILS 填补了可解释人工智能(XAI)工具箱中的一个显著空白,超越了单纯的交互检测,转而刻画其函数形式。
Abstract
Feature interactions drive much of the predictive power of machine learning models, yet existing explanation methods only detect and quantify interactions without revealing their functional form, or visualize only restricted interaction types. We propose Surrogate-based Analysis of Interactions via Local effect Smooths (SAILS), a model-agnostic framework that analyzes pairwise interactions through interpretable generalized additive model (GAM) surrogates fitted to the local effects of a black-box model. For each interval of a feature of interest, the surrogate smooth terms isolate the interaction components on derivative level, enabling (i) interaction detection through a heuristic derived from significance tests on smooth terms, (ii) interaction form categorization into linear, product-separable, and non-product-separable types, and (iii) tailored, interpretable visualizations for each interaction type. We empirically validate the framework through controlled simulations and a real-world task, demonstrating its effectiveness for pairwise interactions, with limitations under strong feature correlations and higher-order interactions. SAILS fills a notable gap in the XAI toolbox, going beyond detection of interactions alone to characterizing their functional form.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on Explainable AI (XAI) and feature interaction analysis using GAM surrogates, which is unrelated to the provided keywords concerning Multimodal Large Language Models (MLLM), World Models, Reinforcement Learning architectures (Tokenizer, Visual Encoder), and Model Unification. None of the listed expert authors are present in the author list.
关键词
Surrogate-based Analysis, Feature Interactions, Local Effect Smooths, Generalized Additive Model, Explainable AI, Model-agnostic, Pairwise Interactions, Functional Form
摘要翻译
自主赛车领域通过深度强化学习(Reinforcement Learning, RL)取得了显著进展,主要集中于四轮车辆。然而,由于需要管理平衡和倾斜角度,且转向和油门控制更具反应性,加之重量更轻,摩托车引入了显著更大的复杂性。本文提出了一种框架,用于训练自主代理在 VRider SBK(一种基于 Unity 的物理精确摩托车模拟器)中驾驶超级摩托车进行比赛。我们的方法将软演员批评(Soft Actor-Critic, SAC)与自我节奏课程深度强化学习(Self-Paced curriculum Deep reinforcement Learning, SPDL)相结合,后者根据代理的表现动态生成逐渐更具挑战性的任务,无需手动设计课程。代理的状态空间包含本体感受特征,并扩展了倾斜角度历史,同时通过航点(course points)包含全局赛道特征。奖励信号被设计为鼓励沿赛道前进,同时对两轮动力学特有的导致不稳定的行为进行惩罚。初步实验结果表明,SPDL 在训练效率、圈速和驾驶稳定性方面均优于单独的 SAC,且在多种赛道和摩托车模型上均如此,从而确立了基于强化学习的自主摩托车赛车的首个基准。
Abstract
Autonomous Racing has seen remarkable progress through deep Reinforcement Learning (RL), primarily for four-wheeled vehicles. However, motorbikes introduce substantially greater complexity due to the need to manage balance and lean angle, in addition to more reactive steering and throttle control, and a smaller weight. In this work, we present a framework for training an autonomous agent to race a superbike in VRider SBK, a physics-accurate Unity-based motorbike simulator. Our approach integrates Soft Actor-Critic (SAC) with Self-Paced curriculum Deep reinforcement Learning (SPDL), which dynamically generates progressively more challenging tasks based on the agent's performance, without requiring manual curriculum design. The agent's state space comprises proprioceptive features extended with lean-angle history, along with global track features via course points. The reward signal is shaped to encourage progress along the track while penalizing instability-inducing behaviors specific to two-wheeled dynamics. Preliminary experimental results demonstrate that SPDL outperforms SAC alone in training efficiency, lap time, and driving stability across multiple tracks and motorbike models, establishing a first baseline for RL-based autonomous motorbike racing.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 该论文专注于使用基于策略的强化学习(SAC)结合课程学习进行自主超级摩托车赛车训练,属于模型-free 强化学习范畴。论文未涉及统一模型、分词器、视觉编码器(针对 MLLM 语境)、世界模型(生成式)、多模态大语言模型或多模态融合等概念。虽然使用了模拟器,但并未学习动力学模型进行规划,因此不属于模型强化学习。故所有关键词相关性均为 0。作者列表中不包含指定的专家(Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang)。加权总分为 0,远低于动态及格分 27.8。
关键词
Autonomous Superbike Racing, Reinforcement Learning, Soft Actor-Critic, Self-Paced Curriculum, Unity Simulator, Proprioceptive Features, Reward Shaping, Dynamic Task Generation
摘要翻译
2026 年欧盟安全与可持续设计 (SSbD) 框架、企业可持续发展尽职调查指令 (CSDDD) 与碳边境调节机制 (CBAM) 的融合,为先进半导体制造设施(智能晶圆厂,Smart Fabs)引入了严重的治理瓶颈。监管合规需求已超出手动企业报告的能力,在多方利益相关者透明度与企业数据隐私之间产生了直接冲突。本文通过引入一个零信任社会技术编排框架来解决这一挑战,该框架在可信工业数据空间中实现了六层 SSbD 参考架构的落地。我们提议通过 Professional Proxies(专业代理)——在硬件隔离信任区内执行的基于角色的智能体工作流,实现从反应式自动化向自主治理的转变。该框架被构建为一个互操作性网络协议栈,它协调设施、工艺工程和财务代理团队之间自动化的五步“接力赛”,以使车间良率模型与宏观层面的可持续性指令对齐。通过在基于硬件的可信执行环境 (TEEs) 内执行虚拟计量 (VM) 预测和联邦机器学习 (FML),该架构解决了数据主权悖论,展示了晶圆厂如何通过国际数据空间 (IDS) 连接器导出密码学签名的合规令牌,而无需暴露专有工艺配方。最终,该框架为技术管理者提供了一条可验证的、基于证据的路径,通向具有韧性的、净零排放的工业 5.0 生态系统。
Abstract
The convergence of the 2026 European Union Safe and Sustainable by Design (SSbD) framework, Corporate Sustainability Due Diligence Directive (CSDDD), and Carbon Border Adjustment Mechanism (CBAM) introduce a severe governance bottleneck for advanced semiconductor manufacturing facilities ("Smart Fabs"). Regulatory compliance demands have surpassed the capacity of manual corporate reporting, creating a direct conflict between multi-stakeholder transparency and corporate data privacy. This paper addresses this challenge by introducing a zero-trust socio-technical orchestration framework that operationalizes a six-layer SSbD reference architecture within trustworthy industrial data spaces. We propose a shift from reactive automation to autonomous governance through "Professional Proxies"-role-based agentic workflows executing within hardware-isolated trust zones. Structured as an interoperable network protocol stack, the framework coordinates an automated, five-step "relay race" between Facility, Process Engineering, and Finance proxy teams to align factory-floor yield models with macro-level sustainability mandates. By executing Virtual Metrology (VM) predictions and Federated Machine Learning (FML) inside hardware-rooted Trusted Execution Environments (TEEs), this architecture resolves the Data Sovereignty Paradox, demonstrating how fabs can export cryptographically signed compliance tokens via International Data Spaces (IDS) connectors without exposing proprietary process recipes. Ultimately, this framework provides technology managers with a verifiable, evidence-based pathway toward resilient, net-zero Industry 5.0 ecosystems.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 该论文主要研究半导体制造治理与数据隐私保护,提出基于工业数据空间的零信任框架。提供的评分关键词(如 MLLM, World Models, Tokenizer 等)均属于多模态大模型与强化学习架构领域。论文未涉及这些特定的 AI 模型组件或算法,仅涉及机器学习在工业治理中的应用(如 Federated Learning),与给定关键词无直接关联,故所有相关度均为 0。作者列表中未包含指定专家。
关键词
Smart Fabs, Professional Proxies, SSbD, Industrial Data Spaces, Trusted Execution Environments, Federated Machine Learning, Data Sovereignty, Zero-trust framework
摘要翻译
大规模机器学习(ML)的快速增长使得多 GPU 分布式训练成为现代 ML 系统的基本组成部分。随着模型规模和计算吞吐量的持续增长,通信开销已成为多 GPU 训练中的主要瓶颈,尤其是在计算与通信串行执行时。本研究探索了利用两种可移植运行时控制机制实现计算与集体通信(Collective Communication)的并发执行:针对计算核(Computation Kernel)的基于共享内存驱动的占用率塑造(Shared-Memory-Driven Occupancy Shaping)以及针对通信核(Communication Kernel)的提升调度优先级(Elevated Scheduling Priority)。该方法通过每块共享内存(Shared-Memory)分配来调节计算核(Computation Kernel)的驻留,从而为通信核(Communication Kernel)留出足够的片上资源(On-Chip Resources)以使其得以推进。此外,为通信流(Communication Stream)分配更高的优先级可确保一旦资源可用,通信进程保持稳定进展。在 NVIDIA A40、A100、H100 以及 AMD MI250X GPU 上的实验表明,所提出的方法实现了有效的计算 - 通信重叠(Computation-Communication Overlap),并将总执行时间减少了高达 25.5%,而无需修改供应商库(Vendor Libraries)或核实现(Kernel Implementations)。
Abstract
The rapid growth of large-scale machine learning (ML) has made distributed training across multiple GPUs a fundamental component of modern ML systems. As model sizes and computational throughput continue to increase, communication overhead has become a dominant bottleneck in multi-GPU training, particularly when computation and communication are executed sequentially. This work explores concurrent execution of computation and collective communication using two portable runtime controls: shared-memory-driven occupancy shaping for computation kernels and elevated scheduling priority for communication kernels. Our approach regulates computation-kernel residency through per-block shared-memory allocation, leaving sufficient on-chip resources for communication kernels to make progress. In addition, assigning higher priority to communication streams ensures steady communication progress once resources become available. Experiments on NVIDIA A40, A100, H100, and AMD MI250X GPUs demonstrate that the proposed method enables effective computation-communication overlap and reduces total execution time by up to 25.5 percent, without modifying vendor libraries or kernel implementations.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on distributed training systems optimization (computation-communication overlap), while the keywords relate to model architectures and learning paradigms (Multimodal, World Models, RL, Tokenizers). There is no direct technical overlap in methodology or subject matter.
关键词
multi-GPU, Computation-Communication Overlap, Distributed Training, Shared-memory, Scheduling Priority, Execution Time Reduction, ML Workloads
摘要翻译
近年来,基于移动边缘计算(MEC)的协作深度神经网络(DNN)推理已成为向资源受限的移动设备提供智能服务的极具前景的方法。一个典型场景是多用户协作边缘推理,在此场景中,不同设备独立划分其深度神经网络模型,并通过无线网络将后端计算卸载至共享边缘服务器。然而,由于存在未知且随时间变化的系统条件(包括波动的无线链路和多样化的设备能力),为每个设备确定最优的深度神经网络划分极具挑战性。为解决这一问题,我们提出了一种协作自学神经外科医生(CANS)框架,该框架是一种协作边缘推理方法,使设备能够在在线推理过程中通过共享信息反馈自适应地学习最优深度神经网络划分。为了应对设备异构性的挑战并更好地利用离线推理经验,我们整合了一种新颖的 FedLinUCB-DW 算法,该算法将同类设备分组,并利用本地离线的早期退出推理经验对在线探索进行预热。此外,我们通过推导遗憾上界,为 FedLinUCB-DW 算法提供了理论保证。同时,我们在模拟环境和硬件原型系统上验证了该方法。实证评估表明,与最先进的基线方法相比,CANS 实现了更低的推理延迟。特别是在两个边缘设备上的原型实验中,所提出的 CANS 相较于非合作基线方法,平均推理延迟降低了高达 50%。
Abstract
Recently, mobile edge computing (MEC)-enabled collaborative deep neural network (DNN) inference has emerged as a promising approach for delivering intelligent services to resource-constrained mobile devices. A representative scenario is multi-user collaborative edge inference, where distinct devices independently partition their DNN models and offload backend computation to a common edge server over wireless networks. However, determining the optimal DNN partition for each device is challenging due to unknown and time-varying system conditions, including fluctuating wireless links and diverse device capabilities. To address this problem, we propose Cooperative Autodidactic NeuroSurgeon (CANS), a collaborative edge inference framework that enables devices to adaptively learn optimal DNN partitions by sharing informative feedback during online inference. To handle the challenge of device heterogeneity and better leverage offline inference experience, we integrate a novel FedLinUCB-DW algorithm that groups devices of the same type and warm-starts online exploration using local offline early-exit inference experience. Furthermore, we provide theoretical guarantees for FedLinUCB-DW by deriving the regret upper bound. We also validate our method on both a simulated environment and a hardware prototype system. Empirical evaluations demonstrate that CANS achieves lower inference latency compared to state-of-the-art baselines. Especially, in prototype experiments on two edge devices, the proposed CANS reduced average inference latency by up to 50% compared to the non-cooperative baseline.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on Mobile Edge Computing (MEC) and collaborative DNN inference optimization using online learning algorithms (FedLinUCB-DW). It does not address multimodal data processing, large language models, world models, tokenization, visual encoders, or model-based reinforcement learning as defined in the provided background. Thus, there is no semantic overlap with the provided keywords.
关键词
Mobile Edge Computing, Collaborative DNN Inference, DNN Partitioning, Device Heterogeneity, FedLinUCB-DW, Online Inference, Inference Latency, Cooperative Autodidactic NeuroSurgeon
摘要翻译
我们证明,广泛部署的大语言模型(LLM)推理栈中存在一个隐写通道,该通道无需修改模型权重、采样代码或输出分布。该通道利用了确定性解码的结构性特征:逆变换采样中使用的伪随机数生成器(PRNG)产生一个依赖于种子的词元级概率区间序列,该序列仅凭生成的文本即可重构。发送者在生成前将秘密消息编码于 PRNG 种子中;接收者通过穷举搜索种子空间重构区间并恢复种子,从而恢复隐藏的有效载荷。我们形式化定义了两种操作模式。在已知提示场景中,发送者和接收者共享提示,通过强制对齐实现精确区间重构和完美种子恢复。在未知提示场景中,仅可获得生成的文本;近似区间重构结合最大命中计数评分策略仍允许从足够长的输出中可靠恢复。跨越六个模型系列和五个异质文本领域的广泛实验表明,在已知提示场景中,从完整的 2^32 候选空间中进行完整的 32 位种子恢复,根据模型和文本领域不同,最高可达 100% 准确率,且在单个 GPU 上仅需生成 300 个词元且耗时少于 35 秒。在未知提示场景中,恢复在 600-800 个词元时达到近乎完美的准确率,耗时约 12 秒。我们进一步分析了提示策略、分词歧义以及采样超参数对通道可靠性的影响。此外,我们讨论了研究结果的若干应用:首先,它允许进行 32 比特的隐写传输,但也表明将提示未知视为安全假设是不成立的。
Abstract
We demonstrate that widely deployed Large Language Model (LLM) inference stacks harbor a steganographic channel that requires no modification to model weights, sampling code, or output distributions. The channel exploits a structural property of deterministic decoding: pseudo-random number generators (PRNGs) used in inverse-transform sampling produce a seed-dependent sequence of token-level probability intervals that can be reconstructed from the generated text alone. A sender encodes a secret message in the PRNG seed before generation; a receiver reconstructs the intervals and recovers the seed, and thus the hidden payload, by exhaustive search over the seed space. We formalize two operational modes. In the known-prompt setting, sender and receiver share the prompt, enabling exact interval reconstruction and perfect seed recovery via forced alignment. In the unknown-prompt setting, only the generated text is available; approximate interval reconstruction combined with a maximum-hit-count scoring strategy still permits reliable recovery from sufficiently long outputs. Extensive experiments across six model families and five heterogeneous text domains show that, in the known-prompt setting, full 32-bit seed recovery from the complete 2^32 candidate space achieves up to 100% accuracy, depending on model and text domain, within 300 tokens and under 35 seconds on a single GPU. In the unknown-prompt setting, recovery reaches near-perfect accuracy at 600-800 tokens in about 12 seconds. We further analyze the influence of prompting strategies, tokenization ambiguities, and sampling hyperparameters on channel reliability. Moreover, we discuss several applications of our results: First, it allows for the steganographic transmission of 32 bits, but also shows that ignorance of the prompt is not a valid security assumption.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 评分失败: Expecting ',' delimiter: line 12 column 188 (char 397)
摘要翻译
超大规模(hyperscale)云网络基础设施带来了独特的运营挑战,传统的人工驱动故障响应难以跟上故障的规模、速度和复杂性。本文提出了一种面向大规模网络操作自主故障解决的智能体(agentic AI)架构。该系统采用多智能体(multi-agent)编排框架,专用 AI 智能体协作检测、诊断和修复网络故障,无需人工干预。我们阐述了架构原则,包括分层智能体分解、基于技能的工具调用(通过标准化协议)、从运维手册(operational runbooks)中进行的结构化知识编码、具有安全边界的渐进式自主性以及闭环验证。该架构已在一家主要云提供商的生产环境中部署,表明智能体(agentic AI)系统可在常见故障类别中实现超过 90% 的自主解决率,同时通过分层授权和回滚机制维持安全保证。我们还讨论了设计权衡、故障模式以及大规模运行自主智能体所获得的经验教训。
Abstract
Cloud network infrastructure at hyperscale presents unique operational challenges where traditional human-driven incident response cannot keep pace with the volume, velocity, and complexity of failures. This paper presents an agentic AI architecture for autonomous incident resolution in large-scale network operations. Our system employs a multi-agent orchestration framework where specialized AI agents collaborate to detect, diagnose, and remediate network incidents without human intervention. We describe the architectural principles, including hierarchical agent decomposition, skills-based tool invocation via standardized protocols, structured knowledge encoding from operational runbooks, progressive autonomy with safety boundaries, and closed-loop verification. The architecture has been deployed in production at a major cloud provider, demonstrating that agentic AI systems can achieve autonomous resolution rates exceeding 90% for common incident categories while maintaining safety guarantees through layered authorization and rollback mechanisms. We discuss design tradeoffs, failure modes, and lessons learned from operating autonomous AI agents at scale.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文主要探讨网络运维中的代理 AI 架构与多智能体编排,侧重于工具调用、知识编码和安全边界,未涉及多模态表征学习、分词器设计、视觉编码器、世界模型构建、大语言模型架构或基于模型的强化学习,因此与所有给定关键词无相关性。
关键词
Autonomous Incident Resolution, Hyperscale Network Operations, Agentic AI Architecture, Multi-agent Orchestration, Tool Invocation, Safety Boundaries, Closed-loop Verification
摘要翻译
我们通过引入导数的近似,将函数输入神经网络(FNN)的通用近似定理推广至可微映射。函数输入神经网络(FNN)将输入从可能的无限维加权流形映射至实值隐藏层,在该层上施加非线性标量激活函数,随后通过若干线性读出将输出映射至巴拿赫 (Banach) 空间。通过证明一个加权 Nachbin 定理,我们建立了可微映射的通用近似定理(UAT),该定理超越了通常在紧集上的表述,同时也包含了导数的近似。由此,我们获得了关于非预见泛函的近似结果,其中包括水平导数和垂直导数。作为进一步的应用,我们表明 signature (签名) 的线性函数能够近似路径空间泛函,包括其方向导数。
Abstract
We generalize the universal approximation theorem for functional input neural networks (FNN) to differentiable maps by including the approximation of the derivatives. A FNN maps the input from a possibly infinite-dimensional weighted manifold to the real-valued hidden layer, on which a non-linear scalar activation function is applied, and then returns the output into a Banach space via some linear readouts. By proving a weighted Nachbin theorem, we establish a universal approximation theorem (UAT) for differentiable maps, which goes beyond the usual formulation on compact sets and also includes the approximation of the derivatives. This leads us to approximation results for non-anticipative functionals including the horizontal and vertical derivatives. As a further application, we show that linear functions of the signature are able to approximate path space functionals including their directional derivatives.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 该论文属于理论数学与机器学习基础理论领域,主要研究无限维流形上可微映射的泛函输入神经网络逼近定理。摘要内容涉及 Banach 空间、加权 Nachbin 定理、签名函数等数学概念,完全未涉及多模态大模型、分词器、视觉编码器、世界模型或强化学习等应用架构,因此与所有给定关键词完全无关,得分为 0。作者列表中也不包含指定的专家。
关键词
Universal approximation, Differentiable maps, Infinite-dimensional manifolds, Functional input neural networks, Weighted manifold, Signature functions, Banach space
摘要翻译
我们研究具有固定读出(fixed readout)和二次损失(quadratic loss)的前馈(feed-forward)ReLU 网络。我们的目标并非主要将梯度下降(gradient descent)重写为权重空间(weight space)中的动力学,而是将其视为在训练集空间(training-set space)上定义的场项所封闭的集体动力学。对于单隐藏层,权重变量可从激活动力学(activation dynamics)中消除,从而得到由集体核(collective kernel)支配的残差(residuals)封闭方程,该集体核分解为输入几何矩阵(input-geometric matrix)和动力学共激活矩阵(dynamical co-activation matrix)。对于更深网络,残差动力学(residual dynamics)保留了清晰的层间核结构(layer-wise kernel structure)。然而,从深度三层起,封闭性需要层级化的权重诱导 Gram 算子(weight-induced Gram operators)来中介跨层的信息传输。
Abstract
We study feed-forward ReLU networks with fixed readout and quadratic loss. The aim is to rewrite gradient descent not primarily as a dynamics in weight space, but as a collective dynamics closed in terms of fields defined on the training-set space. For a single hidden layer, the weight variables can be eliminated from the activation dynamics, yielding a closed equation for the residuals governed by a collective kernel that factorizes into an input-geometric matrix and a dynamical co-activation matrix. For deeper networks, the residual dynamics retains a clean layer-wise kernel structure. However, from depth three onward, closure requires a hierarchy of weight-induced Gram operators that mediate information transport across layers.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on theoretical optimization dynamics (gradient descent) in feed-forward ReLU networks using Gram metrics. The provided keywords pertain to Multimodal Large Language Models, Reinforcement Learning, and specific model components (Tokenizer, Visual Encoder). There is no conceptual overlap between the theoretical neural network training analysis and the specified multimodal/RL topics, hence all keyword scores are 0. The author Claudio Nordio does not match any of the listed expert authors.
关键词
Feed-forward ReLU networks, Gradient descent dynamics, Weight-induced Gram metrics, Layer-wise kernel structure, Residual dynamics, Information transport, Deep network theory
摘要翻译
在量子硬件上训练参数化量子电路(PQCs)受限于梯度估计的测量成本,该成本在参数移位规则下随可训练参数的数量线性增长,并在大规模训练中主导了总的采样预算。在这项工作中,我们提出了一种基于自动微分前向模式的参数化量子电路(PQCs)前向梯度估计器框架,该框架通过平均可自由调节数量的随机方向导数产生梯度的无偏估计,并将 SPSA、随机坐标下降和参数移位规则作为极限情况恢复出来,且无需辅助量子比特或受控门开销。我们证明在标准假设下,随机量子前向梯度下降算法收敛,并给出了一个明确的二阶矩展开,该展开在 SPSA 的单方向极端与参数移位规则的满梯度极端之间进行插值。在此框架内,我们推导出 QUIVER(Quantum Iterative V-adaptive Estimator Rule),这是一种用于参数化电路的自适应优化器,其更新规则源于闭式的最小测量成本分配。数值实验表明,前向梯度在 ECG5000 和 MNIST 数据集上训练汉明权重保持正交量子神经网络(最多包含 60 个量子比特和 1770 个参数)时,其效率比参数移位规则高出几个数量级。我们还表明,所提出的 QUIVER 优化器在使用量子近似优化算法(QAOA)和变分量子本征求解器(VQE)进行量子模拟的优化问题上,能够优于 iCANS 和 gCANS 等低采样开销优化器。
Abstract
Training parameterised quantum circuits (PQCs) on quantum hardware is bottlenecked by the measurement cost of gradient estimation, which under the parameter-shift rule scales linearly in the number of trainable parameters and dominates the total shot budget of training at scale. In this work, we propose a framework of forward gradient estimators for PQCs, based on the forward mode of automatic differentiation, that yields an unbiased estimator of the gradient by averaging a freely tunable number of random directional derivatives and recovers SPSA, random coordinate descent, and the parameter-shift rule as limiting cases, with no ancilla qubits or controlled-gate overhead. We prove that stochastic quantum forward gradient descent converges under standard assumptions, with an explicit second-moment expansion that interpolates between the single-direction extreme of SPSA and the full-gradient extreme of parameter-shift. Within this framework we derive QUIVER (Quantum Iterative V-adaptive Estimator Rule), an adaptive optimiser for parameterised circuits whose update rule follows from a closed-form minimum measurement-cost allocation. We show numerically that forward gradients train Hamming-weight-preserving orthogonal quantum neural networks with up to 60 qubits and 1770 parameters on the ECG5000 and MNIST datasets orders of magnitude more efficiently than the parameter-shift rule. We also demonstrate that our proposed QUIVER optimiser can outperform iCANS and gCANS measurement-frugal optimisers on optimisation problems using the quantum approximate optimisation algorithm and quantum simulation with the variational quantum eigensolver.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 该论文专注于量子计算领域,具体涉及参数化量子电路(PQC)的梯度估计与优化算法(QUIVER)。提供的关键词集(如 MLLM、Tokenizer、Visual Encoder、World Models)均属于多模态大模型与强化学习领域,与本文的量子计算主题无直接交集。作者列表中未包含指定的专家(Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang)。因此所有关键词相关度均为 0。
关键词
Parameterised Quantum Circuits, Gradient Estimation, Forward Mode Automatic Differentiation, QUIVER, Measurement Cost, Quantum Neural Networks, Quantum Approximate Optimisation Algorithm
摘要翻译
我们精确刻画了将长度为 $T$ 的输入序列映射为单个输出的深度为 $L$、总参数量为 $W$ 的 Transformer(变换器)的 VC dimension(VC 维),建立了上界 $O(L W \log (T W))$ 以及几乎匹配的下界 $Ω(L W \log (T W / L))$。我们进一步精确刻画了使用此类 Transformer 进行 chain-of-thought(思维链)学习的样本复杂度,表明 teacher forcing(教师强制,即在训练数据上选择与整个 chain-of-thought 一致的预测器)的样本复杂度为 $O\left(L W \log \left(\left(T+T^{\prime}\right) W\right)\right)$,且任何使用 chain-of-thought 数据的学习规则至少需要 $Ω\left(L W \log \left(\left(T+T^{\prime}\right) W / L\right)\right)$ 个样本,其中 $T$ 为输入长度,$T^{\prime}$ 为自回归步骤(autoregressive steps)。
Abstract
We tightly characterize the VC dimension of depth-$L$ Transformers with a total of $W$ parameters, mapping an input sequence of length $T$ to a single output, establishing an upper bound of $O(L W \log (T W))$ and a nearly matching lower bound of $Ω(L W \log (T W / L))$. We further tightly characterize the sample complexity of chain-of-thought learning using such a Transformer, showing teacher forcing (i.e. selecting a predictor consistent with the entire chain-of-thought on training data) learns with sample complexity $O\left(L W \log \left(\left(T+T^{\prime}\right) W\right)\right)$ and that any learning rule that uses chain-of-thought data requires at least $Ω\left(L W \log \left(\left(T+T^{\prime}\right) W / L\right)\right)$ examples, where $T$ is the input length and $T^{\prime}$ is the number of autoregressive steps.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文主要研究 Transformer 的理论样本复杂度和 VC 维,属于机器学习理论范畴。摘要中未提及统一模型、分词器设计、视觉编码器、世界模型、多模态大模型、多模态融合或基于模型的强化学习等内容,因此与给定关键词无直接相关性。作者列表中不包含指定的专家名单(Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang)。
关键词
Transformers, Sample Complexity, VC Dimension, Chain-of-Thought, Sequence Learning, Theoretical Bounds, Autoregressive Steps
摘要翻译
科学生成建模通常需要进行尺寸转移(size transfer),即在小型系统上训练并在大型系统上进行评估。尽管平移不变架构(translation-invariant architectures)使得这种评估成为可能,但我们指出,仅凭架构局部性(architectural locality)并不能保证稳定的尺寸外推(size extrapolation)。相反,稳定的外推由高斯平滑得分(Gaussian-smoothed score)的准局部性(quasi-locality)所决定。借助 Tweedie 公式 (Tweedie's formula),远处的扰动可通过后验协方差(posterior covariance)影响局部得分分量,这意味着局部模型仅在感受野(receptive field)覆盖平滑得分的响应范围时才能成功。我们形式化了这一机制,证明了在反向扩散(reverse diffusion)下局部边缘分布的尺寸一致比较定理。此外,我们还引入了有限深度局部流(Finite-Depth Local Flow, FDLF),这是一个具有精确得分、密度及可控响应范围的白盒诊断基准。经验上,我们验证了空间混合(spatial mixing)、平滑得分准局部性与模型感受野之间的相互作用。在空间混合条件下,平滑得分相对于感受野保持准局部性,从而实现稳定外推。反之,当空间混合减弱时,得分的局部性迅速退化,导致尺寸转移失败。
Abstract
Scientific generative modeling often requires size transfer, where models trained on small systems are evaluated on larger ones. While translation-invariant architectures enable this evaluation, we show that architectural locality alone does not guarantee stable size extrapolation. Instead, stable extrapolation is governed by the quasi-locality of the Gaussian-smoothed score. Through Tweedie's formula, far-away perturbations can influence local score components via posterior covariance, meaning a local model succeeds only if its receptive field covers the smoothed score's response range. We formalize this mechanism, proving a size-uniform comparison theorem for local marginals under reverse diffusion. We also introduce Finite-Depth Local Flow (FDLF), a white-box diagnostic benchmark with exact scores, densities, and controllable response ranges. Empirically, we validate the interplay between spatial mixing, smoothed-score quasi-locality, and model receptive fields. Under spatial mixing, the smoothed score remains quasi-local relative to the receptive field, enabling stable extrapolation. Conversely, when spatial mixing weakens, the score's locality rapidly degrades, causing size transfer to fail.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on theoretical analysis of local score-based generative models regarding size extrapolation in scientific systems. It does not involve multimodal large language models, tokenization, visual encoders, world models (in the RL/MLLM sense), or reinforcement learning. Therefore, all provided keywords are irrelevant (0 score). The author Wenjie Xi does not match the listed expert authors, so no bonus points are applied.
关键词
Local Score Models, Size Extrapolation, Gaussian-smoothed Score, Quasi-locality, Reverse Diffusion, Spatial Mixing, Diagnostic Benchmark, Finite-Depth Local Flow
摘要翻译
上下文队列带问题(Contextual Queueing Bandits)提供了一个框架,用于在未知的上下文相关服务率下学习调度异构作业。在随机上下文下,现有算法实现了 $\widetilde{\mathcal{O}}(T^{-1/4})$ 的队列长度遗憾(queue length regret),定义为学习者在时间范围 $T$ 上与预言机(oracle)队列长度之差的期望值。本文将此速率改进至 $\widetilde{\mathcal{O}}(T^{-1/2})$。关键观察在于,随机探索仅需进行到精心选择的截断轮次(cutoff round),而非贯穿整个时间范围。我们提出 CQB-$η$-2 算法,这是一个三阶段算法:(i) 纯随机探索以构建初始估计量;(ii) 结合 UCB 规则的 $η$-随机探索,以在保持负漂移的同时继续学习;(iii) 探索截断后的纯 UCB。我们的证明在截断轮次处对队列长度遗憾进行了分解。在截断之前,负漂移抑制了由次优选择引起的队列长度差异。在截断之后,前两个阶段提供了足够的随机探索样本,从而确保 UCB 决策产生的离开率差距(departure-rate gaps)较小。结合这两个界限可得队列长度遗憾的量级为 $\widetilde{\mathcal{O}}(T^{-1/2})$。我们进一步证明了量级为 $Ω(T^{-1/2})$ 的极小极大下界(minimax lower bound)。该证明构造了两个难实例(hard instances),直至最终服务决策前它们在统计上不可区分,并利用队列特定的耦合论证(coupling argument)将由此产生的检验误差转化为队列长度遗憾。综上所述,我们的上界和下界刻画了极小极大依赖关系关于时间范围 $T$ 的依赖程度(直至对数因子)。
Abstract
Contextual queueing bandits provide a framework for learning to schedule heterogeneous jobs under unknown context-dependent service rates. Under stochastic contexts, existing algorithms achieve $\widetilde{\mathcal{O}}(T^{-1/4})$ queue length regret, defined as the expected difference between the learner's and oracle's queue lengths at horizon $T$. In this paper, we improve this rate to $\widetilde{\mathcal{O}}(T^{-1/2})$. The key observation is that random exploration is needed only up to a carefully chosen cutoff round, rather than throughout the entire horizon. We propose CQB-$η$-2, a three-phase algorithm: (i) pure random exploration to construct an initial estimator, (ii) $η$-random exploration combined with a UCB rule to continue learning while maintaining negative drift, and (iii) pure UCB after the exploration cutoff. Our proof decomposes the queue length regret at the cutoff round. Before the cutoff, negative drift suppresses queue length differences caused by suboptimal choices. After the cutoff, the first two phases provide sufficient random exploration samples, ensuring that UCB decisions incur small departure-rate gaps. Combining these two bounds yields queue length regret of order $\widetilde{\mathcal{O}}(T^{-1/2})$. We further prove a minimax lower bound of order $Ω(T^{-1/2})$. The proof constructs two hard instances that are statistically indistinguishable up to the final service decision, and uses a queue-specific coupling argument to convert the resulting testing error into queue length regret. Together, our upper and lower bounds characterize the minimax dependence on the horizon $T$ up to logarithmic factors.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 该论文研究上下文队列带(Contextual Queueing Bandits)及队列长度 regret 最小化,属于运筹学与随机控制领域。提供的关键词(Unify Models, Tokenizer, Visual Encoder, World Models, MLLM, MultiModal, model-based RL)均指向多模态大模型、世界模型及特定强化学习架构,与本文内容无直接关联。虽然 Bandit 属于 RL 范畴,但本文未涉及模型学习、世界模型构建或多模态表征,故所有关键词相关度均为 0。作者列表中未包含指定的专家(Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang)。
关键词
Contextual Queueing Bandits, Queue Length Regret, Minimax Lower Bound, UCB Rule, Service Rates, Job Scheduling, Stochastic Contexts, Three-phase Algorithm
摘要翻译
在开展营销活动时,零售商必须决定推广哪些产品以及针对哪些目标用户。这些决策是内在耦合的:有效的营销活动将具有强相互亲和力的用户与物品匹配到预设大小的不重叠组中。然而,现有方法要么假设预设的活动结构,要么将商品选择与用户分配解耦,无法直接从联合交互模式中发现活动分组。因此,我们将此营销活动问题形式化为自动定向(auto-targeting):联合选择用户与物品以构建多个互不重叠的活动组。为了解决这一组合问题,我们提出了三种互补策略:(i) 约束谱双聚类(biclustering),用于在用户 - 商品亲和力矩阵中寻找密集区域;(ii) 带成对交换的贪心局部搜索,用于组合优化;以及 (iii) 基于多臂老虎机(multi-armed bandit)的框架,通过探索来逃离局部最优解。我们在合成数据集、Amazon Reviews 基准数据集以及大规模专有商业数据上评估了这些方法,并将结果与模拟退火(simulated annealing)基线进行比较。结果表明,双聚类(biclustering)始终实现了最高的活动质量、提升率(lift)和公平性得分。尽管双聚类在较小数据集上运行效率较高,但在极大规模数据集上其运行时间显著增加,此时基于多臂老虎机(bandit-based)的方法则提供了一种可扩展的替代方案。
Abstract
When running marketing campaigns, retailers must decide which products to promote and which users to target. These decisions are inherently coupled: effective campaigns match users and items with strong mutual affinity into non-overlapping groups of predefined sizes. However, existing approaches assume predefined campaign structure or decouple item selection from user assignment, and cannot discover campaign groupings directly from joint interaction patterns. We therefore formalize this campaign problem as auto-targeting: jointly selecting users and items to construct multiple disjoint campaigns. To solve this combinatorial problem, we propose three complementary strategies: (i) constrained spectral biclustering to find dense regions in the user-item affinity matrix, (ii) greedy local search with pairwise swaps for combinatorial refinement, and (iii) a multi-armed bandit framework to escape local optima through exploration. We evaluate these methods on a synthetic dataset, the Amazon Reviews benchmarks, and large-scale proprietary commercial data, and compare the results to simulated annealing as a baseline. The results show that biclustering consistently achieves the highest campaign quality, lift, and fairness scores. While biclustering runs efficiently on smaller datasets, its runtime increases substantially on very large ones, where bandit-based methods instead offer a scalable alternative.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 该论文研究的是电商营销中的用户 - 物品分配问题,采用约束谱双聚类和多臂老虎机算法,属于运筹优化与推荐系统范畴,与多模态大模型、分词器、视觉编码器、世界模型及基于模型的强化学习等关键词定义的研究背景完全无关。作者列表中未包含指定的专家。
关键词
user-item allocation, marketing campaigns, constrained spectral biclustering, multi-armed bandit, auto-targeting, combinatorial problem, campaign groupings
摘要翻译
近期工作主张在隐私保护机器学习中采用高斯差分隐私(GDP)来报告隐私保证。我们通过匹配强敌手成员推断攻击在最坏情况下的成功率,基于三个指标(固定假正率 (FPR) 下的乘性优势、固定召回率下的精确率以及标准隐私概况),提供了从纯差分隐私 (DP) ε 到高斯差分隐私 GDP μ 的原则性映射。我们在常用参数范围内列出了 μ 值,并推荐 μ≈ε/5 作为一种保守的通用转换方案。
Abstract
Recent work argues for using Gaussian differential privacy (GDP) to report the privacy guarantees in privacy-preserving machine learning. We provide principled mappings from pure-DP $\varepsilon$ to GDP $μ$ by matching the worst-case success of a strong-adversary membership inference attack in terms of three metrics: multiplicative advantage at fixed FPR, precision at fixed recall, and the standard privacy profile. We tabulate $μ$ values across a useful range of parameters and recommend $μ\approx \varepsilon/5$ as a conservative general-purpose conversion.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 该论文专注于高斯微隐私(GDP)参数映射及隐私保证,内容涉及纯 DP 到 GDP 的转换及成员推断攻击评估。所提供的关键词(如 Unify Models, Tokenizer, Visual Encoder, World Models, MLLM, MultiModal, model-based RL)均涉及多模态大模型、表征学习及强化学习领域,与本文主题完全无关。因此,所有关键词相关度评分为 0,加权总分为 0,远低于动态及格分 27.8。作者列表中未包含指定的 Yang Shi 等专家,无额外加分。
关键词
Gaussian Differential Privacy, Pure-DP, Membership Inference Attack, Privacy Profile, Parameter Mapping, Privacy Guarantees, Conservative Conversion
摘要翻译
随着可再生能源接入加剧市场波动性,概率电价预测(probabilistic electricity price forecasting)已成为有效风险管理的关键。然而,当前的恰当评分规则(proper scoring rules)往往优先考虑预测锐度(forecast sharpness),牺牲校准性(calibration),从而导致过度自信且统计上不可靠的不确定性估计。本研究凸显了理论评分与实际校准之间的关键差距,表明当可靠性被忽视时,模型可能沦为确定性预测(deterministic forecasts)的代理(proxy)。我们得出结论,未来研究必须转向校准感知(calibration-aware)目标和架构(architectures),以确保能源市场预测的分布完整性(distributional integrity)。
Abstract
As renewable energy integration increases market volatility, probabilistic electricity price forecasting has become essential for effective risk management. However, current-proper-scoring rules often prioritize forecast sharpness at the expense of calibration, leading to overconfident and statistically unreliable uncertainty estimates. This work highlights the critical gap between theoretical scoring and practical calibration, demonstrating that models can become mere proxies for deterministic forecasts when reliability is neglected. We conclude that future research must shift toward calibration-aware objectives and architectures to ensure the distributional integrity of energy market forecasts.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文主题聚焦于概率性电价预测中的校准挑战与评分规则,属于能源经济与统计预测领域。提供的关键词(Unify Models, Tokenizer, Visual Encoder, World Models, MLLM, MultiModal, model-based RL)均涉及多模态大模型、世界模型及强化学习架构。论文内容未涉及任何多模态组件、tokenizer、视觉编码器、世界模型构建或强化学习算法,因此所有关键词相关性均为 0 分。作者列表中也不包含指定的专家(Yang Shi 等)。加权总分为 0,低于动态及格分 27.8。
关键词
Probabilistic electricity price forecasting, Calibration challenges, Scoring rules, Uncertainty estimates, Risk management, Distributional integrity, Calibration-aware objectives
摘要翻译
大语言模型(LLM)因其深度和参数量规模而面临高昂的推理成本。深度剪枝可通过跳过冗余的 Transformer 层来降低延迟,但现有方法(i)在用户特定的计算预算下提供的控制有限,(ii)通常固定路由路径,无法在解码过程中随上下文增长而动态适应。本文提出 Buddy,一种基于预算的动态深度路由框架。Buddy 采用轻量级的决策模块(Decision Module),根据输入对中间层进行评分,并确定性执行前 k 层以满足给定预算。为支持解码时的动态适应,Buddy 重用第一层的 KV 缓存作为低开销的全局上下文源,并在每次路由决策前将其与最新的词元表示进行聚合。当未提供显式预算时,可选的预算预测器(Budget Predictor)会估计一个依赖于输入的计算量级别,以平衡质量与效率。在 Llama-family 和 Qwen 模型上的实验表明,Buddy 与强大的静态剪枝基线相当,通常能改善准确率与计算量的权衡,同时独特地支持严格预算控制、解码时重路由以及单个训练模型内的多预算处理。
Abstract
Large language models (LLMs) incur high inference cost due to their depth and parameter scale. Depth pruning can reduce latency by skipping redundant Transformer blocks, but existing methods (i) provide limited control under user-specific compute budgets and (ii) typically fix the routing path, failing to adapt as the context grows during decoding. We propose Buddy, a budget-driven dynamic depth routing framework. Buddy uses a lightweight Decision Module to score intermediate layers conditioned on the input and deterministically executes the top-k layers to satisfy a given budget. To support decode-time adaptation, Buddy reuses the first-layer KV cache as a low-overhead global context source and pools it together with the newest token representation before each routing decision. When no explicit budget is provided, an optional Budget Predictor estimates an input-dependent compute level to balance quality and efficiency. Experiments on Llama-family and Qwen models show that Buddy is competitive with strong static pruning baselines and often improves the accuracy-compute trade-off, while uniquely supporting strict budget control, decode-time rerouting, and multiple budgets within a single trained model.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on LLM inference optimization via dynamic depth routing and budget control. It does not involve multimodal components (Visual Encoder, MultiModal, MLLM), tokenization strategies, world models, or model-based reinforcement learning, nor does it focus on unifying different model architectures. Thus, it has zero relevance to the provided keyword set.
关键词
Large language models, Inference cost, Depth pruning, Dynamic depth routing, Budget-driven, Decision Module, KV cache, Compute budgets
摘要翻译
概率预测模型日益通过机器学习方法训练,但它们所对比的基线往往较弱或被省略。我们表明,最简单的 conformal 区间——即包裹在有限样本 split-conformal 残差分位数中的最后值点预测,无需参数也无需训练——是一个远比近期学习预测和 conformal 时间序列比较中几乎完全缺失所暗示的更强的基线。在来自九个公共来源(Monash、LOTSA、LTSF 交通/电力/天气套件、METR-LA、BOOM、nips/probts)的 2,217 个真实序列上的单步在线预测中,该 ConformalNaive 区间显著优于 naive 值分位数基线、整个 NPTS 家族(NPTS 占 73%,SeasonalNPTS 占 64% 的序列)以及已发表的 Conformal Seasonal Pools (CSP) 方法(71% 的序列,bootstrap 95% 置信区间 [69,73],配对 Wilcoxon 检验 p 值约为 7.6e-135);它持平于更简单的学习 conformal 预测器(如 RCI、分位数回归;中值相对 Winkler 误差在 2% 以内),仅被自适应在线方法和集成方法(SPCI、ACI、AgACI)所超越,这些方法能够跟踪分布偏移,在相对 Winkler 误差上领先 9%-33%。其校准效果也优于训练好的神经预测器:在引入 DeepNPTS 的六个数据集上,该平凡基线在名义置信度为 95% 时覆盖真实值的比例为 84%-85%,而 DeepNPTS 仅为 66%。在多步季节性展望下,情况则相反:随机游走基线是最弱的方法,而季节性池(CSP)胜出——我们绘制了这一边界。最后,我们提出 ConformalNaive+,这是一种无需训练、仅需一行代码、具有展望自适应能力的选择器,它能在每个展望步长上选取两个互补基线中表现更优者,并恢复覆盖概率。我们认为,每当学习概率预测器声称取得增益时,相应的 conformal 平凡基线必须作为强制基线。
Abstract
Probabilistic forecasters are increasingly learned, yet the baselines they are compared against are often weak or omitted. We show that the simplest possible conformal interval - a last-value point forecast wrapped in a finite-sample split-conformal residual quantile, with no parameters and no training - is a far stronger baseline than its near-total absence from recent learned-forecasting and conformal-time-series comparisons would suggest. In one-step-ahead online forecasting across 2,217 real series from nine public sources (Monash, LOTSA, the LTSF traffic/electricity/weather suites, METR-LA, BOOM, nips/probts), this ConformalNaive interval decisively beats the naive value-quantile baselines, the entire NPTS family (NPTS 73%, SeasonalNPTS 64% of series), and the published Conformal Seasonal Pools (CSP) method (71% of series, bootstrap 95% CI [69,73], paired Wilcoxon p approx 7.6e-135); it is on par with the simpler learned conformal predictors (RCI, quantile regression; median relative Winkler within 2%) and is beaten only by the adaptive-online and ensemble methods (SPCI, ACI, AgACI), which track distribution shift and lead by 9-33% relative Winkler. It is also better calibrated than a trained neural forecaster: on the six datasets that introduced DeepNPTS, the trivial floors cover the truth 84-85% of the time at a nominal 95%, versus DeepNPTS's 66%. At multi-step seasonal horizons the picture inverts: the random-walk floor is the weakest method and the seasonal pool (CSP) wins - a boundary we map. Finally we give ConformalNaive+, a one-line, training-free, horizon-adaptive selector that attains the better of two complementary floors at every horizon with restored coverage. We argue the matching conformal naive floor must be a mandatory baseline whenever a learned probabilistic forecaster claims gains.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on probabilistic time-series forecasting using conformal prediction methods, specifically advocating for a training-free baseline. The provided keywords relate to multimodal large models, tokenization, visual encoders, world models, and model-based reinforcement learning, which are completely unrelated to the statistical forecasting content of this paper. Thus, all keyword scores are 0. The author Valery Manokhin does not match any of the listed expert authors.
关键词
Probabilistic Time-Series Forecasting, Conformal Prediction, Training-Free Baseline, ConformalNaive, Last-Value Forecast, Coverage Probability, Learned Forecasters
摘要翻译
人类在进行灵巧操作时,依赖于具有高时间分辨率的、空间密集的、具备几何与力感知能力的触觉反馈。尽管基于视觉的触觉传感器能够实现密集力估计,但它们受限于相机帧率、运动模糊以及数据带宽。基于事件的光学触觉传感器(Event-based optical tactile sensors)提供了一种有吸引力的替代方案,具有微秒级时间分辨率和低运动模糊,但现有方法仅限于预测合力。我们提出了首个利用基于事件的光学触觉传感器进行密集三维力场重建的框架。我们的方法从事件数据中估计三维表面位移,并通过逆有限元方法(inverse Finite Elements Method, iFEM)将它们映射为力。剪切位移通过所提出的基于事件的标记跟踪算法恢复,而法向位移则由卷积神经网络(convolutional neural network)预测,该网络在收集的同步力 - 位移 - 事件数据集上进行训练。实验表明,该方法能够准确重建物理合理的力,在力范围高达 (4 N, 4 N, 20 N) 的情况下,平均绝对误差达到 (0.14 N, 0.10 N, 0.93 N),同时以平均 100 Hz 的频率运行。这项工作构成了实现机器人抓取和灵巧操作中高频控制所需密集力反馈的第一步。
Abstract
Humans rely on spatially dense, geometry and force-aware tactile feedback at high temporal resolution for dexterous manipulation. While vision-based tactile sensors enable dense force estimation, they are limited by camera frame rates, motion blur, and data bandwidth. Event-based optical tactile sensors offer an attractive alternative with microsecond temporal resolution and low motion blur, but existing methods are restricted to predicting only net forces. We introduce the first framework for dense 3D force field reconstruction using event-based optical tactile sensors. Our approach estimates 3D surface displacements from event data and maps them to forces via the inverse Finite Elements Method (iFEM). Shear displacements are recovered through the proposed event-based marker tracking algorithm, while normal displacements are predicted by a convolutional neural network trained on a collected dataset of synchronized force-displacement-event data. Experiments demonstrate accurate reconstruction of physically grounded forces, achieving a mean absolute error of (0.14 N, 0.10 N, 0.93 N) over force ranges up to (4 N, 4 N, 20 N), while operating at an average of 100 Hz. This work constitutes a first step toward enabling dense force feedback for high-frequency control in robotic grasping and dexterous manipulation.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on dense force estimation using event-based optical tactile sensors and inverse Finite Elements Method (iFEM). It does not address Unify Models, Tokenizers, World Models, MLLMs, or Model-Based RL architectures as defined by the provided keywords. The research domain is robotics perception rather than large-scale multimodal foundation models or reinforcement learning frameworks. No authors from the specified expert list are present.
关键词
Dense Force Estimation, Event-based Optical Tactile Sensor, Inverse Finite Elements Method, 3D Surface Displacements, Robotic Grasping, Convolutional Neural Network, High-frequency Control
摘要翻译
Fokker-Planck 方程 (FPE) 在描述由随机动力学支配系统的概率密度函数 (PDF) 的时间演化中起着至关重要的作用。本文提出了一种基于条件归一化流 (conditional normalizing flow) 的物理信息神经网络 (PINN) 框架,用于高效近似 FPE 在各种初始条件下的解算子。借助马尔可夫随机过程的 Chapman-Kolmogorov 方程,该问题被重新表述为近似一个从初始时刻开始、以任意点为中心的 Dirac 质量 (Dirac mass) 转移概率密度函数。关联的线性化随机微分方程 (SDE) 的概率密度函数 (PDF) 被用作归一化流的基础分布,从而为目标 PDF 提供良好的近似,特别是在短时间情形下,进而避免了与 Dirac delta 初始分布相关的映射奇异性。此外,引入了一种时间加权损失函数,以缓解短时间情形下出现的数值不稳定性,随着时间推移在因果性和训练难度之间取得平衡。最后,进行了多种数值实验,以说明所提方法的有效性和鲁棒性。
Abstract
The Fokker-Planck equation (FPE) plays a pivotal role in describing the time evolution of probability density functions (PDFs) for systems governed by stochastic dynamics. In this work, we propose a conditional normalizing flow-based physics-informed neural network (PINN) framework for efficiently approximating the solution operator of the FPE for a whole range of initial conditions. Leveraging the Chapman-Kolmogorov equation for Markovian stochastic processes, the problem is reformulated into approximating a transition PDF starting at initial time from a Dirac mass centered at an arbitrary point. The PDF of an associated linearized stochastic differential equation (SDE) is employed as the base distribution for the normalizing flow, providing a good approximation of the target PDF, especially for small times, and thereby avoiding the singularity of the map associated with the Dirac delta initial distribution. Furthermore, a time-weighted loss function is introduced to mitigate numerical instabilities arising at small times, achieving a balance between causality and training difficulty as time progresses. A variety of numerical experiments are presented to illustrate the effectiveness and robustness of the proposed method.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 该论文属于科学计算与随机微分方程求解领域,核心内容是使用算子学习和归一化流求解福克 - 普朗克方程。提供的关键词集(Unify Models, Tokenizer, Visual Encoder, World Models, MLLM, MultiModal, model-based RL)主要聚焦于多模态大模型架构与强化学习。论文未涉及多模态理解、生成、视觉编码器、tokenizer 或强化学习规划,与关键词集无实质关联,故所有关键词评分为 0。加权总分为 0,远低于动态及格分 27.8。
关键词
Fokker-Planck equation, Operator learning, Normalizing flow, Physics-informed neural network, Initial conditions, Stochastic differential equation, Probability density function
摘要翻译
局部超流扩散 (HFD) 为一般子模超图中的种子聚类提供了与边大小无关的切格尔型保证,但现有的 HFD 求解器无法在每次迭代中保持中间计算的局部性。我们引入了阈值局部超流扩散 (TL-HFD),这是一种一阶方法,它在种子周围维持一个活跃区域,在该区域及其直接边界上执行投影次梯度更新,并通过阈值(top-k)边界激活进行扩展。我们证明了局部更新是精确的:限制在活跃区域及其边界上的度预条件投影次梯度步与无约束全局更新一致。我们确立了精确更新和阈值更新两者的有限时间对偶次优性,将后者视为具有显式跳过边界误差的不精确投影次梯度步。我们进一步推导了一个加性激活体积界,该界由实现的局部次梯度范数和 newly activated vertices 中的最小边界推动控制,并将具有局部支持的对偶近似最优性转化为早停迭代的稳健扫描割保证。对于一般子模割成本,每次迭代在扫描区域上是局部的,且在超边原语上对预言机敏感。实验上,TL-HFD 通常匹配或优于 HFD,同时激活更少的体积,在噪声实例上增益最大,此时扩散倾向于吸收非目标顶点。
Abstract
Local Hyper-Flow Diffusion (HFD) gives an edge-size-independent Cheeger-type guarantee for seeded clustering in general submodular hypergraphs, but existing HFD solvers do not keep intermediate computation local at every iteration. We introduce Thresholded Local HFD (TL-HFD), a first-order method that maintains an active region around the seeds, performs projected subgradient updates on that region and its immediate boundary, and expands via thresholded (top-k) boundary activation. We prove that the local update is exact: the degree-preconditioned projected subgradient step restricted to the active region and its boundary coincides with the unrestricted global update. We establish finite-time dual suboptimality for both exact and thresholded updates, treating the latter as inexact projected subgradient steps with explicit skipped-boundary error. We further derive an additive activated-volume bound controlled by realized local subgradient norms and the minimum boundary-push among newly activated vertices, and translate approximate dual optimality with localized support into a robust sweep-cut guarantee for early-stopped iterates. For general submodular cut-costs, each iteration is local in the scanned region and oracle-sensitive in the hyperedge primitive. Empirically, TL-HFD often matches or improves over HFD while activating less volume, with the largest gains on noisy instances where diffusion tends to absorb non-target vertices.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文主题聚焦于组合优化与子模超图上的局部扩散聚类算法(Thresholded Local Hyper-Flow Diffusion),而提供的关键词集(Unify Models, Tokenizer, Visual Encoder, World Models, MLLM, MultiModal, model-based RL)均属于多模态大模型、生成模型及强化学习领域。两者在技术路线、应用场景及核心概念上无交集,因此相关性评分均为 0。作者列表中未包含指定的专家名单。
关键词
Thresholded Local Hyper-Flow Diffusion, Seeded clustering, Submodular hypergraphs, Local computation, Projected subgradient, Sweep-cut guarantee, Diffusion method
摘要翻译
反演算法通过求解高光谱分辨率卫星辐射亮度测量中的反演问题,用于估算大气中温室气体(GHGs)的浓度,例如二氧化碳(CO2)和甲烷(CH4)。然而,这些算法计算成本高昂,难以实现大规模实时估算。因此,机器学习模型被提出作为反演算法的快速代理模型。然而,大多数现有研究仅在测试数据与训练数据来自同一时期的情况下评估它们。我们利用温室气体观测卫星(GOSAT)的数据研究了此类代理模型随时间的稳定性。我们发现,当测试期远离训练期时,预测精度通常会降低。我们还表明,将时间作为输入特征可显著改善 Lasso 和神经网络模型的 XCH4 预测。在所考虑的方法中,简单的 Lasso 模型表现与神经网络等更复杂的方法相当或更好,且随时间推移产生更稳定的预测。我们进一步利用总碳柱观测网络(TCCON)——一个地基观测网络——验证了这些结果。在与 TCCON 匹配的数据集上,时间增强的 Lasso 模型相对于 TCCON 的误差,在 XCO2 和 XCH4 上均与 GOSAT 和 TCCON 之间的差异相当。
Abstract
Retrieval algorithms are used to estimate atmospheric concentrations of greenhouse gases (GHGs), such as carbon dioxide (CO2) and methane (CH4), by solving inverse problems from high-spectral-resolution satellite radiance measurements. However, these algorithms are computationally expensive, which makes real-time estimation at scale difficult. Machine-learning models have therefore been proposed as fast emulators of retrieval algorithms. Most existing studies, however, evaluate them only on test data from the same period as the training data. We study the stability over time of such emulators using data from the Greenhouse Gases Observing SATellite (GOSAT). We show that prediction accuracy generally deteriorates when the test period moves away from the training period. We also show that including time as an input feature substantially improves XCH4 prediction for Lasso and neural-network models. Among the methods considered, a simple Lasso model performs as well as or better than more complex methods such as neural networks, and yields more stable predictions over time. We further validate the results using the Total Carbon Column Observing Network (TCCON), a ground-based observation network. On the TCCON-matched dataset, the time-augmented Lasso achieves errors against TCCON that are comparable to the disagreement between GOSAT and TCCON for both XCO2 and XCH4.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on machine learning emulation of satellite greenhouse gas retrievals using standard regression models (Lasso, NN) to address temporal stability in atmospheric science. It does not involve Large Language Models, Tokenizers, Visual Encoders, World Models, or Reinforcement Learning, making all provided keywords irrelevant to its core methodology and content. No expert authors from the specified list were found.
关键词
Machine-Learning Emulation, Satellite Greenhouse Gas Retrievals, Stability over Time, Radiance Measurements, Lasso Model, Time Feature, TCCON, Ground-based Observation
摘要翻译
方程发现(Equation Discovery)旨在从数据中自动发现以数学方程形式呈现的科学模型。从技术层面来看,方程发现是通过符号回归(Symbolic Regression)算法实现的。符号回归在方程发现任务中的性能沿两个维度进行衡量:测试数据上的预测精度以及对已知真实公式(groundtruth formulas)的恢复程度。对于标准回归,精度通常在域内(in-domain)测试数据上衡量,例如,通过将数据集随机划分为训练数据和测试数据。虽然这种方法对于域内插值(in-domain interpolation)而言是合理的,而这通常是普通回归的常见目标,但它可能是真实模型发现和泛化能力的误导性代理指标。显而易见的替代方案是测量域外(out-of-domain)精度。然而,获取具有挑战性的域外测试数据是一个非平凡的问题(non-trivial problem)。因此,我们专注于利用方程恢复(equation recovery)来评估用于方程发现的符号回归算法。其理由是,那些在恢复已知真实公式方面表现良好的符号回归算法,很可能是未知方程发现任务中表现优异的候选者。现有的符号回归基准确实包含方程恢复任务,然而,其中仅包含少量公开已知的真实公式。此外,这些基准在评估算法鲁棒性方面重视不足,特别是在维度变化、采样大小、采样分布及采样域变化下的算法表现。然而,这对于希望发现方程以建模自然现象的实践者而言至关重要,因为数据几乎必然包含噪声,且来自不同的域、分布和样本大小。为了填补这一空白,我们引入了方程恢复基准(Equation Recovery Benchmark, ERBench),这是一个新的评估框架,旨在严格评估明确针对方程发现任务的算法。
Abstract
Equation discovery aims to automate the discovery of scientific models in the form of mathematical equations from data. Technically, equation discovery is implemented by symbolic regression algorithms. Performance of symbolic regression for equation discovery is measured along two dimensions: Prediction accuracy on test data, and recovery of known groundtruth formulas. For standard regression, accuracy is typically measured on in-domain test data, for instance, by splitting a data set randomly into training and test data. While this makes sense for in-domain interpolation, which is the common goal in ordinary regression, it can be a misleading proxy for true model discovery and generalization. The obvious alternative is to measure out-of-domain accuracy. However, obtaining challenging out-of-domain test data is a non-trivial problem. Therefore, we focus on equation recovery for evaluating symbolic regression algorithms for equation discovery. The rationale is that symbolic regression algorithms that perform well in recovering known groundtruth formulas are good candidates to perform well in unknown equation discovery. Existing benchmarks for symbolic regression include equation recovery tasks, however, with only a small number of groundtruth formulas that are publicly known. Moreover, these benchmarks place less emphasis on evaluating the robustness of algorithms in terms of their behavior under changing dimensionality, sampling size, sampling distribution and sampling domain. This, however, is of central importance to practitioners wanting to discover equations for modeling natural phenomena, since data is almost certainly noisy and comes from diverse domains, distributions, and sample sizes. To fill this gap, we introduce the Equation Recovery Benchmark (ERBench), a new evaluation framework designed to rigorously assess algorithms explicitly targeting the task of equation discovery.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on symbolic regression and equation discovery benchmarks (ERBench), while the provided keywords pertain to multimodal large language models, tokenization, visual encoders, world models, and model-based reinforcement learning. There is no conceptual overlap between the paper's content (scientific equation discovery from data) and the specific multimodal/RL architectures implied by the keywords.
关键词
Equation Discovery, Symbolic Regression, Benchmark, Equation Recovery, Robustness Evaluation, Mathematical Equations, Data-driven Modeling
摘要翻译
尽管数据分析工作流的可视化编程已成为数据科学民主化的重要途径,但此类系统仍主要局限于独立应用程序,且对其可视化分析解决方案向交互式 Web 环境的迁移支持有限。因此,数据分析流水线难以共享、嵌入并适配到面向用户的分析工具中。本文提出 Orange Lab,一个用于可视化数据分析的基于 Web 的协作环境。其核心在于,Orange Lab 使用户能够从模块化组件中视觉构建机器学习工作流,其中任何组件内的交互均能无缝传播至整个工作流,从而将静态流水线转变为动态、反应式系统,支持探索和数据驱动的叙事。我们的关键贡献是组件展示(component exposition),这是一种范式,允许作者将选定的工作流组件或其接口的一部分嵌入任意 Web 上下文,从而创建同步、交互式界面,同时隐藏底层工作流的复杂性。这使得开发定制化的分析视图和叙事驱动的体验成为可能,从而将数据分析直接整合到在线材料中。我们通过数据素养教育中的部署展示了该方法,其中嵌入组件引导学生动手探索机器学习概念,而无需了解底层系统,表明 Orange Lab 有效降低了入门门槛并支持了数据科学的民主化。
Abstract
While visual programming of data analysis workflows has become an important vehicle for the democratization of data science, such systems remain largely confined to standalone applications and offer limited support for transitioning their visual analytics solutions into interactive web environments. As a result, data analysis pipelines are difficult to share, embed, and adapt into user-facing analytical tools. We present Orange Lab, a web-based collaborative environment for visual data analytics. At its core, Orange Lab enables users to visually construct machine learning workflows from modular components, where interactions in any component propagate seamlessly through the workflow, turning static pipelines into dynamic, reactive systems that support exploration and data-driven storytelling. Our key contribution is component exposition, a paradigm that allows authors to embed selected workflow components, or parts of their interfaces, into arbitrary web contexts, creating synchronized, interactive interfaces while hiding underlying workflow complexity. This enables the development of tailored analytical views and narrative-driven experiences that integrate data analysis directly into online materials. We demonstrate the approach through deployments in data literacy education, where embedded components guide students in hands-on exploration of machine learning concepts without requiring knowledge of the underlying system, showing that Orange Lab effectively lowers barriers to entry and supports the democratization of data science.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on visual data analytics workflows and web embedding for data science education, falling under HCI and Data Visualization. The provided keywords relate to large-scale AI models, tokenization, reinforcement learning, and world models. There is no conceptual overlap between the paper's content and the specified keywords, resulting in a score of 0 for all. None of the specified expert authors (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) appear in the author list.
关键词
Visual programming, Data analytics, Web-based environment, Interactive workflows, Embedded components, Machine learning workflows, Data literacy
摘要翻译
我们证明了 $ρ\text{-}\mathrm{NPTS}_{\mathrm{SG}}$(一种用于风险规避型多臂老虎机 (bandits) 的无锚点非参数 Thompson Sampling 算法)在 $\log n$ 的主阶上实现了与实例依赖下界匹配的遗憾,从而确立其在具有有界密度和次高斯尾部的分布类(包括高斯臂)上对于任何连续风险泛函 $ρ$(CVaR、均值 - 方差、Sharpe ratio、扭曲风险测度等)具有渐近最优性。本结果及其有界支撑版本仅要求 $ρ$ 的连续性:严格弱于先前参数化 Thompson Sampling 结果的支配条件,也严格弱于 UCB 类算法的 Lipschitz 条件,从而在不假设奖励服从参数分布的情况下,为 Sharpe ratio 等非 Lipschitz 泛函提供了首个实例最优保证。有界支撑情形首先被作为垫脚石提出,其共享相同的证明结构。关键的技术贡献包括一个离散化引理(针对有界支撑)和一个截断离散化引理(针对次高斯尾部),两者均通过 Dirichlet 聚合性质将字母表增长的 Dirichlet 后验投影到固定网格上,保持所有多项式前因子的次数固定且与样本大小无关,并打破了阻碍先前证明的超指数障碍。
Abstract
We prove that $ρ\text{-}\mathrm{NPTS}_{\mathrm{SG}}$, an anchor-free nonparametric Thompson Sampling algorithm for risk-averse bandits, achieves regret matching the instance-dependent lower bound to leading order in $\log n$, establishing it as asymptotically optimal for any continuous risk functional $ρ$ (CVaR, mean-variance, Sharpe ratio, distortion risk measures, and more) on the class of distributions with bounded density and sub-Gaussian tails, including Gaussian arms. Both this result and its bounded-support counterpart require only continuity of $ρ$: strictly weaker than the dominance condition of prior parametric Thompson Sampling results, and strictly weaker than the Lipschitz condition of UCB-type algorithms, yielding the first instance-optimal guarantees for non-Lipschitz functionals such as the Sharpe ratio without parametric reward assumptions. The bounded-support case is developed first as a stepping stone sharing the same proof structure. The key technical contributions are a discretisation lemma (bounded support) and a truncated discretisation lemma (sub-Gaussian tails), each projecting the growing-alphabet Dirichlet posterior onto a fixed grid via the Dirichlet aggregation property, holding all polynomial prefactors at fixed degree independent of sample size and breaking the super-exponential barrier that blocked prior proofs.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on theoretical reinforcement learning (risk-averse bandits and Thompson Sampling), whereas the provided keywords specifically target Multimodal Large Language Models (MLLM), generative world modeling, and multimodal architecture components (Tokenizer, Visual Encoder). There is no overlap in methodology (no tokenization, visual encoding, or multimodal generation) or architectural focus. Although Bandits are a subset of RL, 'model-based RL' in this context aligns with world models for control, which differs from the statistical bandit setting addressed here.
关键词
Thompson Sampling, Risk-Averse Bandits, Sub-Gaussian Rewards, Regret Matching, Nonparametric Algorithm, Dirichlet Posterior, Risk Functionals
摘要翻译
分布式随机梯度下降(Decentralized SGD)是分布式学习中的基本算法,尽管底层网络拓扑对其收敛行为的影响尚未完全被理解。现有的收敛分析表明,谱间隙(spectral gap)较小的拓扑结构会显著降低分布式随机梯度下降在同质和异质情形下的收敛速率。然而,许多先前的论文报告称,拓扑结构的选择在异质情形下确实具有显著的实验影响,但在同质情形下对训练行为几乎没有实验影响。本文提出了一种更紧致的分布式随机梯度下降收敛分析,相较于先前的分析,提供了关于拓扑结构如何影响收敛速率的更精确理解。具体而言,不同于仅使用谱间隙作为拓扑属性的现有收敛分析,我们的新颖分析表明,混合矩阵(mixing matrix)的所有特征值均会影响收敛速率。在实验过程中,我们仔细评估了分布式随机梯度下降的收敛行为,并证明了我们的新颖收敛分析能更准确地描述拓扑结构对收敛速率的影响。
Abstract
Decentralized SGD is a fundamental algorithm in decentralized learning, although the influence of an underlying network topology on its convergence behavior is not yet fully understood. Existing convergence analyses have shown that topologies with a small spectral gap significantly deteriorate the convergence rate of Decentralized SGD in both homogeneous and heterogeneous cases. However, many prior papers have reported that indeed the choice of the topology has a significant experimental impact in the heterogeneous case, but has little experimental impact on training behavior in the homogeneous case. In this paper, we present a tighter convergence analysis of Decentralized SGD, offering a more precise understanding of how topologies affect the convergence rate than the prior analysis. Specifically, unlike existing convergence analyses that used only the spectral gap as a property of the topology, our novel analysis shows that all eigenvalues of the mixing matrix affect the convergence rate. Throughout the experiments, we carefully evaluated the convergence behavior of Decentralized SGD and demonstrated that our novel convergence analysis can more accurately describe the effect of topology on the convergence rate.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文研究内容为分布式优化算法(Decentralized SGD)的收敛性分析,涉及拓扑结构、谱间隙和混合矩阵特征值等理论问题。而提供的关键词均属于多模态大模型、世界模型及强化学习领域(如 Tokenizer、Visual Encoder、MLLM 等)。两者研究主题和方法论完全无关,因此所有关键词相关性评分为 0。作者列表中也不包含指定的专家。
关键词
Decentralized SGD, Convergence Analysis, Topology Dependence, Spectral Gap, Mixing Matrix, Eigenvalues, Distributed Optimization
摘要翻译
社区检测是复杂网络分析中的一个基础性问题,在社会、生物及金融等多个领域均有广泛应用。传统算法(如 Louvain、LPA 及模块度优化方法)往往需要人工进行参数调优,此外还存在聚类中心选择不准确以及可扩展性不足的问题。为应对这些挑战,本文提出了一种新颖的社区检测算法——自动拉普拉斯中心性均值算法(Automatic Laplacian Centrality Means,简称 ALCMeans)。ALCMeans 将基于拉普拉斯能量的自动中心识别与 DeepWalk 嵌入相结合,以获取稳健的节点表示。与现有的基于拉普拉斯矩阵及聚类方法不同,ALCMeans 无需预先设定社区数量,利用结构重要性增强聚类中心的选择,并借助表示学习实现更准确且稳定的节点分配。在基准数据集上的实验结果表明,与 Louvain、Newman-Girvan、LPA、Fast-Greedy 以及近期基于图神经网络(GNN)的竞争方法(MAGI,KDD 2024)相比,ALCMeans 的 NMI 和 ARI 指标得分高出 10% 至 20%。进一步的模块度(Modularity)和 F1 分数评估结果也证实了 ALCMeans 的优越性。消融实验凸显了各组成部分的关键贡献。尽管 ALCMeans 依赖于 DeepWalk 参数且相对于轻量级启发式方法而言运行时间有所增加,但它始终优于现有最先进方法。这使得 ALCMeans 成为实际网络分析中一项颇具前景的工具。
Abstract
Community detection is a fundamental problem in the analysis of complex networks. It has applications across social, biological, and financial domains. Traditional algorithms such as Louvain, LPA, and modularity optimization often require manual parameter tuning. They also suffer from inaccurate cluster center selection and struggle with scalability. To address these challenges, we propose Automatic Laplacian Centrality Means (ALCMeans), a novel community detection algorithm. ALCMeans combines Laplacian energy-based automatic center identification with DeepWalk embeddings for robust node representation. Unlike existing Laplacian-based and clustering methods, ALCMeans eliminates the need to predefine the number of communities, enhances cluster center selection using structural importance, and leverages representation learning for more accurate and stable assignments. Experimental results on benchmark datasets demonstrate 10 to 20 percent higher NMI and ARI scores compared to Louvain, Newman-Girvan, LPA, Fast-Greedy, and a recent GNN-based competitor (MAGI, KDD 2024). Additional evaluations with modularity and F1-scores confirm the superiority of ALCMeans. Ablation studies highlight the critical contributions of each component. Despite its reliance on DeepWalk parameters and increased runtime relative to lightweight heuristics, ALCMeans consistently outperforms state-of-the-art methods. This makes it a promising tool for real-world network analysis.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on unsupervised community detection in complex networks using Laplacian centrality and DeepWalk embeddings. The provided keywords relate to MLLMs, World Models, and Model-Based RL. There is no conceptual overlap between graph clustering and the specified modalities or architectures (e.g., Tokenizer, Visual Encoder), resulting in zero relevance for all keywords.
关键词
Community detection, Unsupervised learning, Laplacian centrality, DeepWalk embeddings, Network analysis, Cluster center selection, Benchmark datasets
摘要翻译
大型语言模型(LLMs)常规地面临本应被拒绝的请求,这造成了帮助性与危害预防之间的权衡。然而,拒绝本身也可以是有帮助的。在涉及危机、胁迫或意图升级的高风险交互中,生硬的拒绝虽然可能防止直接伤害,但仍可能无法满足请求背后的人的需求。本文提出了 PsychoSafe,这是一个基于心理学知识的拒绝框架,它将拒绝重构为基于循证干预策略的结构化支持性沟通。为了开发 PsychoSafe,我们构建了一个包含 8019 个提示 - 响应对的语料库,涵盖五个具有心理学意义的高风险领域,并对 Qwen 3.5 27B 应用了提示工程(prompting)和参数高效微调(parameter-efficient fine-tuning)。在包含 500 个提示的平衡验证集上,通过 LLM 裁判评估并经人类评分验证,PsychoSafe 提示相较于通用基线将整体拒绝质量提高了 28.1%,在外部分资源推荐(+46.8%)和心理依据(+34.8%)方面尤其表现强劲,同时保留了非拒绝任务上的下游性能。微调实现了近乎完美的拒绝和资源推荐率,但降低了响应相关性。在 SORRY-Bench 和 XSTest 上的额外评估显示,该方法具有强大的领域内鲁棒性,但领域外泛化能力有限,这表明未来工作应多样化微调数据,以帮助模型选择性而非模式化地应用干预措施。
Abstract
Large language models (LLMs) routinely face requests that should be refused, creating a trade-off between helpfulness and harm prevention. However, refusals themselves can be helpful. In high-risk interactions involving crisis, coercion, or escalating intent, blunt non-compliance may prevent direct harm while still failing to support the needs of the person behind the request. We present PsychoSafe, a psychologically-informed refusal framework that reframes refusal as structured supportive communication grounded in evidence-based intervention strategies. To develop PsychoSafe, we construct a corpus of 8019 prompt-response pairs spanning five psychologically salient risk domains and apply prompting and parameter-efficient fine-tuning to Qwen 3.5 27B. On a balanced validation set of 500 prompts, evaluated with an LLM judge and validated through human ratings, PsychoSafe prompting improves overall refusal quality by 28.1% over a generic baseline, with particularly strong gains in external resource referral (+46.8%) and psychological grounding (+34.8%), while preserving downstream performance on non-refusal tasks. Fine-tuning achieves near-perfect refusal and resource-referral rates but reduces response relevance. Additional evaluations on SORRY-Bench and XSTest show strong in-domain robustness but limited out-of-domain generalization, suggesting that future work should diversify fine-tuning data to help models apply interventions selectively rather than schematically.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on LLM safety and psychologically-informed refusal strategies via prompting and fine-tuning. It does not involve multimodal components (Visual Encoder, MultiModal, MLLM), world modeling, model-based reinforcement learning, tokenizer architecture, or model unification. None of the authors match the specified expert list.
关键词
PsychoSafe, Large Language Models, Psychologically-Informed Refusals, Prompting, Fine-tuning, Safety Alignment, Resource Referral
摘要翻译
尽管机器翻译(MT)取得了显著进展,非人工智能社区对 MT 系统日益关切,这表明技术进步与现实用户需求之间存在明显差距。例如,虽然自然语言处理(NLP)研究人员关注基准性能,但终端用户更关心伦理问题、信任、可靠性、成本等方面。我们认为,听取各种用户社区的意见至关重要,这样才能将研究精力引导至社区所关心的问题。为此,我们首次开展了一项大规模分析,调查了四个利益相关者社区(人工智能(AI)开发者、专业译员、语言学习者和语言服务提供商)在社交媒体上关于机器翻译技术发布的内容。具体而言,我们构建了一个包含 79,286 条帖子和评论的数据集,数据来源于 Reddit、Facebook、Bluesky 和 Mastodon(2019 年至 2025 年),并分析这些社区在何处存在分歧,以及分歧的原因和方式。总体而言,我们发现社区之间经常存在分歧,甚至在翻译质量、效率和可靠性等话题上因极化情绪而表现出强烈冲突。这是因为这些社区处理这些话题的方式不同:人工智能(AI)社区将其视为技术和计算问题,而非人工智能(用户)社区则更关注质量细微差别、节省时间、用户信任以及更广泛的社会问题。
Abstract
Despite remarkable progress in machine translation (MT), non-AI communities have raised growing concerns about MT systems, suggesting a noticeable gap between technical advancement and the needs of real-world users. For instance, while NLP researchers focus on benchmark performance, end users care about ethical concerns, trust, reliability, costs, and more. We argue that listening to various user communities is essential so that research efforts would be directed towards the problems that the communities care about. To this end, we present a large-scale analysis, for the first time, that investigates what four stakeholder communities (AI developers, professional translators, language learners, and language service providers) post about MT technology on social media. To do so, we construct a dataset of 79,286 posts and comments from Reddit, Facebook, Bluesky, and Mastodon from 2019 to 2025, and analyse where these communities disagree, and how and why. Overall, we find that communities often disagree, and even show strong conflicts due to polarised sentiments on topics such as translation quality, efficiency, and reliability. This is because these communities approach these topics differently: the AI community frames them as technical and computational problems, while non-AI (user) communities care more about quality nuances, time savings, user trust, and broader social issues.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 该论文主要研究机器翻译领域中不同社区(AI 开发者、专业译者等)在社交媒体上的观点分歧,涉及伦理、信任和社会问题。给定的关键词(如 Unify Models, Visual Encoder, World Models, model-based RL 等)均涉及多模态大模型架构、表征学习及强化学习技术,与本文的 NLP 社会分析主题完全无关,因此所有关键词相关度均为 0。作者列表中不包含指定的专家(Yang Shi 等),故无额外加分。
关键词
Machine Translation, Social Media Analysis, Stakeholder Perspectives, Ethical Concerns, Benchmark Performance, User Trust, Community Disagreement
摘要翻译
本文介绍了一种受全基因组关联研究(GWAS)启发的文体学分析方法。每个被视为“基因”的词元 (token) 与“表型”作者身份的关联通过逻辑回归结合多重比较校正进行测试。该方法应用于英语、德语和俄语语料库时,能够检测出具有统计学意义的、个体作者特有的词汇标记。
Abstract
This short paper introduces a stylometric interpretation method inspired by genome-wide association studies (GWAS). Each "gene" token's association with "phenotype" authorship is tested using logistic regression with multiple-comparison correction. Applied to English, German, and Russian corpora, the method detects statistically significant lexical markers distinctive of individual authors.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on stylometric analysis using GWAS-inspired statistical methods for authorship attribution in text corpora. It does not involve multimodal learning, world models, reinforcement learning, or large model architectures (such as Tokenizers or Visual Encoders in the context of MLLMs). Thus, all provided keywords are completely unrelated (Score 0), resulting in a total weighted score of 0, which is below the dynamic passing score of 27.8. No expert authors from the specified list are found.
关键词
Stylometric Analysis, GWAS-inspired Approach, Authorship Attribution, Logistic Regression, Lexical Markers, Interpretable Method, Text Corpora, Token Association
摘要翻译
微调大语言模型(LLMs)在乌克兰语语法错误修正(GEC)中占据主导地位,而通过 API 调用的 LLMs 在最小编辑基准上几乎未被测试。我们在 UNLP 2023 仅 GEC 基准上评估了来自四个提供商的 11 个商业 LLM 和一个开源乌克兰语模型,比较了零样本、少样本、最小编辑以及 LLM 辅助提示优化策略。我们的最佳配置(Gemini 3.1-Pro)达到了 F0.5=69.22,缩小了与微调的最先进水平(SOTA,F0.5=73.14)之间差距的 90% 以上。对于零样本提示,只有 Claude 模型从乌克兰语指令中受益。然而,所有模型的最佳整体结果均采用了乌克兰语最小编辑提示,其语言特定规则需要精确使用乌克兰语表达。基于最小编辑 + 少样本的 LLM 辅助提示优化获得了最高分数。详细的最小编辑指令在标点和大写错误上带来了最大的收益,但导致模型放弃了几类低频类别。深入进行错误分析,我们识别出五种与乌克兰语特有语言现象相关的重复出现的过度修正模式。代码、提示和输出公开可用。
Abstract
Fine-tuned Large Language Models (LLMs) dominate in Ukrainian grammatical error correction (GEC), while API-accessed LLMs remain nearly untested on minimal-edit benchmarks. We evaluate 11 commercial LLMs from four providers and one open-source Ukrainian model on the UNLP 2023 GEC-only benchmark, comparing zero-shot, few-shot, minimal-edits, and LLM-assisted prompt optimization strategies. Our best configuration (Gemini 3.1-Pro) reaches F0.5=69.22, closing over 90% of the gap to fine-tuned SOTA (F0.5=73.14). For zero-shot prompts, only Claude models benefit from Ukrainian instructions. However, the best overall results for all models use Ukrainian minimal-edits prompts, whose language-specific rules require Ukrainian to express precisely. LLM-assisted prompt optimization on top of minimal-edits + few-shot achieves the highest score. Detailed minimal-edits instructions yield the largest gains for punctuation and case errors but cause the model to abandon several low-frequency categories. Delving into error analysis, we identify five recurring overcorrection patterns tied to Ukrainian-specific linguistic phenomena. Code, prompts, and outputs are publicly available.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文主要研究乌克兰语语法纠错(GEC)中的大模型提示策略,属于纯文本 NLP 任务。提供的关键词涉及多模态(MultiModal, MLLM, Visual Encoder)、世界模型(World Models)及强化学习(model-based RL),与本文内容(无视觉输入、无强化学习、非多模态架构)无直接关联,故相关性评分为 0。作者列表中不包含指定的专家(Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang)。
关键词
Ukrainian Grammatical Error Correction, Large Language Models, Prompting Strategies, Minimal-Edit, Zero-shot Learning, Fine-tuned Models, Benchmark Evaluation
摘要翻译
视图依赖外观建模在新视角合成与重建中仍然是一个具有挑战性的问题。准确表示复杂的角度效应通常需要大量的内存和计算资源。对于新的基于学习的方法,一种常见的方法是依赖 SH(球谐函数)。然而,捕捉镜面反射等高频现象需要高阶展开,这会增加内存使用量和计算成本。因此,大多数方法采用低阶 SH,这限制了建模复杂视图依赖效应的能力,导致表示过于平滑或漫反射。为了解决这些局限性,我们在场景重建的背景下系统评估了广泛的球面函数。其中一些函数首次被引入到图形学与计算机视觉领域。基于实验所得的见解,我们提出了一种新的球面函数——归一化各向异性球面 Gabor 函数 (Normalized Anisotropic Spherical Gabor function),该函数能够在保持紧凑表示的同时,实现高频外观效应的高效建模与学习。与现有方法相比,我们的函数在重建高光等视图依赖现象时质量更高,同时在内存效率上最高可达五倍,且计算效率更高。我们在 radiance-field (辐射场) 重建任务中验证了其性能。
Abstract
View-dependent appearance modeling remains a challenging problem in novel-view synthesis and reconstruction. Accurately representing complex angular effects often requires substantial memory and computational resources. For new learning-based methods, a common approach is to rely on SH. However, capturing high-frequency phenomena such as specular reflections demands high-order expansions, which increase memory usage and computational cost. Consequently, most methods employ low-order SH, which limits the ability to model complex view-dependent effects, resulting in overly smooth or diffuse representations. To address these limitations, we systematically evaluate a wide range of spherical functions in the context of scene reconstruction. Some of them are introduced to graphics and computer vision for the first time in this paper. Based on the insights from the experiment, we develop a novel spherical formulation, the Normalized Anisotropic Spherical Gabor function that enables efficient modeling and learning of high-frequency appearance effects while maintaining compact representation. Compared to existing approaches, our function achieves higher-quality reconstruction of view-dependent phenomena such as glints, while being up to five times more memory-efficient and more efficient to evaluate. We validate its performance in radiance-field reconstruction tasks.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on Neural Rendering and Computer Graphics (Spherical Harmonics, Radiance Fields), while the provided keywords pertain to Multimodal Large Language Models and Reinforcement Learning (e.g., Tokenizer, MLLM, World Models). There is no conceptual overlap regarding model unification, tokenization, visual encoders in the ML sense, world models, or RL. Therefore, all keyword scores are assigned 0.0. No expert authors from the specified list (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are found in the author list. The total weighted score is 0.0, which is below the dynamic pass score of 27.8.
关键词
Spherical Harmonics, Radiance Reconstruction, View-dependent Appearance, Spherical Gabor, Novel-view Synthesis, High-frequency Effects, Scene Reconstruction
摘要翻译
估计多相机系统的相对位姿是计算机视觉中的一个基本问题,在自动驾驶车辆、移动设备和无人机(UAVs)中具有关键应用。然而,现有解决方案往往计算复杂度高或依赖过多的点对应,限制了其实际适用性。为了解决这些局限性,我们提出两种高效极小解算子,利用一种新颖的参数化方法来估计多相机系统的相对位姿。第一个求解器利用惯性测量单元(IMU)提供的垂直方向先验,而第二个求解器利用 IMU 提供的旋转轴方向先验。我们的方法仅需四个点对应,并将多相机相对位姿估计问题简化为求解一个一元六次多项式,相较于通常涉及八次多项式的现有方法,这是一个显著改进。这种计算复杂度和点对应需求的降低,使得我们的求解器在集成到 RANSAC 框架时特别有效,展示了在视觉里程计应用中的巨大潜力。通过在合成数据和 KITTI 基准上的严格评估,我们的方法相比最先进算法实现了优越的计算效率和具有竞争力的精度。
Abstract
Estimating the relative poses of multi-camera systems is a fundamental problem in computer vision, with critical applications in autonomous vehicles, mobile devices, and unmanned aerial vehicles (UAVs). However, existing solutions often suffer from high computational complexity or rely on an excessive number of point correspondences, limiting their real-world applicability. To address these limitations, we propose two efficient minimal solvers for estimating the relative poses of multi-camera systems using a novel parameterization. The first solver leverages the vertical direction prior provided by Inertial Measurement Units (IMUs), while the second utilizes the rotation axis direction prior from IMUs. Our methods require only four point correspondences and reduce the problem of multi-camera relative pose estimation to solving a univariate 6th-degree polynomial, a significant improvement over existing approaches, which typically involve 8th-degree polynomials. This reduction in computational complexity and correspondence requirements makes our solvers particularly effective when integrated into RANSAC frameworks, demonstrating strong potential for visual odometry applications. Through rigorous evaluations on synthetic data and the KITTI benchmark, our methods achieved superior computational efficiency and competitive accuracy compared to state-of-the-art algorithms.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 该论文主要研究视觉 - 惯性相对位姿估计的几何求解器,属于传统计算机视觉与机器人领域,核心贡献在于代数求解器设计与多项式阶数降低,未涉及大模型架构、 tokenizer、表征学习、世界模型或强化学习等机器学习技术。因此,所有给定关键词的相关性均为 0。作者列表中未包含指定的专家(Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang)。
关键词
Visual-Inertial Relative Pose Estimation, Multi-Camera Systems, Efficient Minimal Solvers, Point Correspondences, Inertial Measurement Units, Visual Odometry, RANSAC Frameworks
摘要翻译
图像和视频字幕生成是连接视觉与语言领域的基础任务,在大型视觉 - 语言模型(LVLMs)的预训练中发挥着关键作用。当前最先进的字幕生成模型通常采用监督微调(SFT)进行训练,这种范式依赖于昂贵且难以扩展的标注,往往导致模型记忆特定的标准答案,从而限制了其泛化能力及生成多样、创造性描述的能力。为了克服这些限制,我们提议将基于可验证奖励的强化学习(RLVR)应用于多模态字幕生成的开放式任务。我们引入了字幕生成强化学习++(CapRL++),这是一种新颖的无参考训练框架,通过其实用性重新定义字幕质量:高质量的字幕应能使一个非视觉语言模型准确回答关于相应视觉内容的问题。CapRL++ 采用了解耦的两阶段流水线:LVLM 生成字幕,而客观奖励来源于一个独立的、无视觉的 LLM 仅基于该字幕回答多项选择题 (MCQs) 的准确性。在超过 20 个图像和视频基准上的评估表明,CapRL++ 提升了密集字幕质量,并在空间和时间理解等任务上增强了基于字幕的预训练效果。在由 CapRL++ 标注的可扩展图像和视频字幕数据集上进行预训练,可带来显著的下游任务性能提升。此外,在用于字幕质量评估的 Prism 框架内,使用 CapRL++ 训练的紧凑模型实现了密集字幕生成性能,与 Qwen2.5-VL-72B 和 Qwen3-VL-235B-A22B 等显著更大的模型相当。这些结果验证了 CapRL++ 能有效训练模型生成可泛化、高保真的描述,建立了超越传统监督微调(SFT)限制的坚实基础。
Abstract
Image and video captioning are fundamental tasks that bridge the visual and linguistic domains, playing a critical role in pre-training Large Vision-Language Models (LVLMs). Current state-of-the-art captioning models are typically trained with Supervised Fine-Tuning (SFT), a paradigm that relies on expensive, non-scalable annotations and often causes models to memorize specific ground-truth answers, limiting their generality and ability to generate diverse, creative descriptions. To overcome these limitations, we propose applying Reinforcement Learning with Verifiable Rewards (RLVR) to the open-ended task of multimodal captioning. We introduce Captioning Reinforcement Learning++ (CapRL++), a novel reference-free training framework that redefines caption quality through its utility: a high-quality caption should enable a non-visual language model to accurately answer questions about the corresponding visual content. CapRL++ employs a decoupled two-stage pipeline where an LVLM generates a caption, and the objective reward is derived from the accuracy of a separate, vision-free LLM answering Multiple-Choice Questions based solely on that caption. Evaluations on more than 20 image and video benchmarks show that CapRL++ improves dense caption quality and strengthens caption-based pretraining across tasks such as spatial and temporal understanding. Pretraining on scalable image and video caption datasets annotated by CapRL++ yields substantial downstream gains. Furthermore, within the Prism Framework for caption quality evaluation, compact models trained with CapRL++ achieve dense captioning performance comparable to substantially larger models such as Qwen2.5-VL-72B and Qwen3-VL-235B-A22B. These results validate that CapRL++ effectively trains models to produce generalizable, high-fidelity descriptions, establishing a robust foundation beyond the limitations of traditional SFT.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 评分失败: Expecting ',' delimiter: line 12 column 107 (char 330)
摘要翻译
可变形图像配准(DIR)广泛应用于放射治疗中的剂量传播与累积,但潜在变形中的不确定性可能会显著影响临床相关的剂量估计。我们提出了一种实用的概率框架,用于将 DIR 不确定性传播至体素级剂量统计和剂量体积直方图(DVH)。该方法将每个体素的映射对应关系建模为由透明局部确定性图(certainty map)控制的随机变量,该图可通过简单安全边界、结构边界不匹配或基于结构的保守不确定性值来定义。由此可得出可解释的指标,包括剂量概率、期望剂量、置信界限以及导出的 DVH 包络。该框架设计旨在保持轻量级和可解释性:它避免了复杂的生物力学或基于集合的不确定性模型,而是强调简单参数化、计算可行性以及透明的剂量指标。我们进一步引入了一种基于结构的内外策略作为可选的细化方案,将映射概率限制在解剖学上合理的靶区范围内。该方法在前列腺放射治疗病例研究中进行了演示,并用于比较不同的确定性图策略和概率核(probability kernels)。实验结果表明,确定性图设计对最终剂量和 DVH 不确定性界限的影响比具体的核选择更强,而内外策略的额外益处取决于具体病例,在当前示例中较为有限。总体而言,所提出的框架提供了一种透明的方式,将 DIR 不确定性纳入放射治疗剂量评估,并研究建模选择如何影响传播的剂量指标。
Abstract
Deformable image registration (DIR) is widely used in radiotherapy for dose propagation and accumulation, but uncertainty in the underlying deformation can substantially affect clinically relevant dose estimates. We present a practical probabilistic framework for propagating DIR uncertainty to voxel-wise dose statistics and dose-volume histograms (DVHs). The method models the mapped correspondence at each voxel as a random variable governed by a transparent local certainty map that can be defined by simple safety margins, structure-boundary mismatch, or structure-wise conservative uncertainty values. This yields interpretable quantities such as dose probabilities, expected dose, confidence bounds, and induced DVH envelopes. The framework is designed to remain lightweight and interpretable: it avoids complex biomechanical or ensemble-based uncertainty models and instead emphasizes simple parameterization, computational feasibility, and transparent dose metrics. We further introduce a structure-guided in/out strategy as an optional refinement that restricts mapping probabilities to anatomically plausible target regions. The approach is demonstrated on a prostate radiotherapy case study and used to compare different certainty-map strategies and probability kernels. The experiments show that the certainty-map design has a stronger effect on resulting dose and DVH uncertainty bounds than the specific kernel choice, while the additional benefit of the in/out strategy is case-dependent and modest in the present example. Overall, the proposed framework provides a transparent way to incorporate DIR uncertainty into radiotherapy dose assessment and to study how modelling choices affect propagated dose metrics.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on medical physics (DIR uncertainty in radiotherapy), while keywords pertain to AI/MLLM/RL architectures (Tokenizer, Visual Encoder, World Models, etc.). There is no technical overlap; thus, all keyword scores are 0 (Total Weighted Score: 0). None of the listed expert authors are present in the author list.
关键词
Deformable image registration, Radiotherapy dose propagation, Uncertainty quantification, Probabilistic framework, Dose-volume histograms, Local certainty map, Voxel-wise dose statistics
摘要翻译
作为一种生物启发式智能传感器,事件相机 (event cameras) 在时空信息的智能感知与视觉运动估计领域引入了新范式,其特点为高时间分辨率、低延迟及低功耗。然而,其异步数据流给传统的同步、基于帧的算法带来了显著挑战。为应对这些挑战,本文提出了一种新颖的框架,可直接从异步光流实现全自由度 (DoF) 自运动估计 (egomotion estimation),专门针对角速度与线速度的联合恢复。我们将微分对极约束解耦为独立的角速度分量与线速度分量,并推导了其在异步数据下的表达形式。基于该表达形式,我们开发了一种优化算法,利用至少五个点实现全 DoF 自运动估计。此外,通过对旋转动力学应用一阶近似,我们将约束方程转化为多项式形式,从而得到了该表达形式下的首个代数最小 5-point 求解器。为确保高速场景下的实时性能,我们还提出了一种加速求解器,通过截断高阶角速度项来实现。在合成数据集和真实世界数据集上的广泛评估表明,异步方法优于传统同步方法,尤其在准确性和对时空噪声的鲁棒性方面表现更佳。我们相信,这项工作为高速机器人应用中高效且准确的连续时间运动估计奠定了关键基础。
Abstract
As a bio-inspired intelligent sensor, event cameras have introduced a new paradigm in the intelligent perception of spatiotemporal information and visual motion estimation, characterized by their high temporal resolution, low latency, and minimal power consumption. However, their asynchronous data streams present significant challenges to traditional synchronous, frame-based algorithms. To address these challenges, this paper presents a novel framework for full degree of freedom (DoF) egomotion estimation directly from asynchronous optical flow, specifically targeting the joint recovery of angular and linear velocities. We decouple the differential epipolar constraint into distinct angular and linear velocity components, and derive its formulation for asynchronous data. Based on this formulation, an optimization algorithm is developed that enables full-DoF egomotion estimation leveraging at least five points. Furthermore, by applying a first-order approximation to rotational dynamics, we transform the constraint equations into a polynomial form, resulting in the first algebraic minimal 5-point solver for this formulation. To ensure real-time performance in high-speed scenarios, we additionally propose an accelerated solver achieved by truncating high-order angular velocity terms. Extensive evaluations on both synthetic and real-world datasets demonstrate that the asynchronous approach outperforms traditional synchronous methods, particularly in its accuracy and robustness to spatiotemporal noise. We believe that this work establishes a critical foundation for efficient and accurate continuous-time motion estimation in high-speed robotics applications.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 该论文属于计算机视觉与机器人领域,专注于事件相机的异步运动估计和微分 SfM 代数求解器。提供的关键词(Unify Models, Tokenizer, World Models, MLLM, model-based RL 等)均属于大语言模型、表征学习及强化学习范畴,与本文的几何优化、视觉里程计主题完全无关,因此所有关键词相关度均为 0 分。作者列表中不包含指定的专家名单。
关键词
Event Cameras, Motion Estimation, Asynchronous Data, Differential SfM, Minimal Solvers, Egomotion, Optimization Algorithm, Full-DoF
摘要翻译
弹头爆炸过程中会产生高密度、高速且相互遮挡的破片。其力学参数(位置、速度、动能)直接决定了弹头破片场的杀伤力。然而,爆炸场景中的高强度闪光和烟雾严重阻碍了这些力学参数的准确获取。为解决这一挑战,本文结合实验力学方法,提出了一种事件驱动方法,用于重构破片的动态轨迹并测量其力学参数。作为一种新颖的类脑视觉传感器,事件相机(Event Cameras)具有微秒级的时间分辨率和高动态范围光照变化感知能力,克服了在强闪光干扰下准确测量高速目标的困难。该方法构建了一个多事件相机视觉系统,采用三种几何约束:时间相关对极约束用于寻找潜在匹配的事件点配对,以及三焦张量线约束与局部单应性约束用于消除误匹配。建立了综合概率模型,采用熵权法确定各约束概率的权重,以定量过滤误匹配。通过空间直线相交和非线性优化实现三维轨迹重构。最后,基于重构的轨迹计算破片的速度和动能。该方法为弹头破片场的力学损伤评估及战术防护设计提供了可靠的技术支持。
Abstract
During warhead detonation, high-density, high-speed, and mutually occluded fragments are generated. Their mechanical parameters (position, velocity, kinetic energy) directly determine the lethality of the warhead fragment field. However, high-intensity flash and smoke in detonation scenarios severely hinder the accurate acquisition of these mechanical parameters. To address this challenge, this paper integrates experimental mechanics approaches and presents an event-driven method for reconstructing the dynamic trajectories of fragments and measuring their mechanical parameters. As a novel brain-inspired visual sensor, event cameras offer microsecond-level temporal resolution and high dynamic range lighting change perception, overcoming the difficulty of accurately measuring high-speed targets under strong flash interference. The method constructs a multi-event-camera vision system, adopting three geometric constraints: time-correlated epipolar constraint to find potential matching event point pairs, and trifocal tensor line constraint plus local homography constraint to eliminate mismatches. A comprehensive probability model is established, with entropy weight method determining the weight of each constraint's probability to quantitatively filter mismatches. 3D trajectory reconstruction is achieved via spatial line-line intersection and nonlinear optimization. Finally, the velocity and kinetic energy of the fragments are calculated based on the reconstructed trajectory. This method provides reliable technical support for the mechanical damage evaluation of warhead fragment fields and the tactical protection design.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文研究内容为基于事件相机的弹片轨迹重建与机械参数测量,属于实验力学与计算机视觉范畴。关键词涉及统一模型、分词器、世界模型、MLLM 及强化学习等人工智能前沿方向,与本文主题无直接关联。文中未使用神经网络视觉编码器、未涉及多模态大模型架构或强化学习算法,故相关性评分均为 0。
关键词
Event cameras, Trajectory reconstruction, Mechanical parameters, Warhead detonation, Geometric constraints, 3D reconstruction, High-speed measurement
摘要翻译
仅测向目标定位是光学测量领域的一个基本问题,并在无人机(UAV)技术中得到广泛应用。有效的轨迹规划可建立有利的观测几何构型,从而提升仅测向无人机系统的目标定位精度。本文提出了一种针对仅测向目标定位场景的无人机(UAV)轨迹优化方法。通过利用费雪信息矩阵(FIM),该方法将几何构型与飞行器机动性动态整合至优化框架中。具体而言,我们引入了一种谱加权 FIM 目标函数,该函数在退化构型附近提供更优的梯度动力学特性,使规划器能够快速摆脱不良观测条件。针对双无人机场景,引入交角正弦项,通过优化视线交角来改善三角测量几何构型,从而防止轨迹聚集。此外,我们提出了一种改进的粒子群优化(PSO)算法,引入运动模型约束和粒子归一化,以确保轨迹的物理可行性并增强与目标函数的兼容性。仿真结果表明,与基于传统 FIM 的方法相比,所提出的方法在单无人机场景中将中值定位误差降低了 99.21%,在双无人机构型下实现了 69.70% 的性能提升,并在远距离机动目标的长时仅测向目标定位中表现出优越的性能。
Abstract
Bearing-only target localization is a fundamental problem in optical measurement and finds extensive applications in unmanned aerial vehicle (UAV) technology. Effective trajectory planning establishes favorable observation geometries, thereby enhancing the target localization accuracy of bearing-only UAV systems. This paper proposes an trajectory optimization method for unmanned aerial vehicles (UAVs) in bearing-only target localization scenarios. By leveraging the Fisher Information Matrix (FIM), the proposed approach dynamically integrates the geometric configuration and vehicle maneuverability into the optimization framework. Specifically, we introduce a spectrally-weighted FIM objective function that provides better gradient dynamics near degenerate configurations, enabling the planner to rapidly escape from poor observation conditions. For dual-UAV scenarios, an intersection angle sine term is introduced to optimize triangulation geometry by improving the sight-line intersection angle, thereby preventing trajectory aggregation. Furthermore, we propose an improved Particle Swarm Optimization (PSO) algorithm with motion model constraints and particle normalization to ensure the physical feasibility of the trajectory and enhance the compatibility with the objective functions. Simulation results demonstrate that the proposed method reduces the median localization error by 99.21% compared to conventional FIM-based approaches in single-UAV scenarios, and achieves a 69.70% improvement for dual-UAV configurations, exhibits superior performance in long-duration bearing-only target localization of maneuverability targets at extended ranges.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 该论文聚焦于无人机(UAV)在 bearing-only 目标定位下的轨迹优化问题,采用 Fisher 信息矩阵(FIM)和粒子群优化(PSO)算法,属于经典控制与运筹优化领域。提供的关键词(如 Unify Models, Tokenizer, Visual Encoder, World Models, MLLM, MultiModal, model-based RL)均涉及多模态大模型、表征学习及强化学习等人工智能前沿领域,与本文主题无直接关联。作者列表中未包含指定的 Yang Shi 等专家。因此所有关键词相关性评分为 0。
关键词
Trajectory Optimization, UAV, Bearing-only Target Localization, Fisher Information Matrix, Particle Swarm Optimization, Dual-UAV, Motion Model Constraints
摘要翻译
基于 LiDAR(激光雷达)和相机的多模态 3D 目标检测在地面车辆场景中已展现出卓越的性能,但尚未在 UAV(无人驾驶飞行器)平台上得到探索。在 UAV 俯视场景中,由树冠主导的频繁地面物体遮挡会导致空间变化和模态依赖的信息退化。现有的多模态融合框架既未显式建模此类地面物体遮挡,也未将遮挡感知嵌入检测流程,限制了它们在遮挡 UAV 场景中的性能。为应对这些挑战,我们提出 CAMF-Det,一种面向 UAV 平台 LiDAR-相机 3D 目标检测的遮挡感知多模态融合框架,该框架通过物理启发式建模推导双模态遮挡强度,并将其作为先验嵌入整个检测流程。首先,一个双模态遮挡建模模块通过基于 Beer-Lambert 的公式及建筑掩码校正,离线显式构建两种模态的遮挡强度真值。其次,利用这些真值图作为监督,一个双模态预测网络将离线建模结果转换为单帧推理下的在线遮挡强度预测。第三,真值和预测的遮挡强度均被注入到数据增强、特征编码、多模态融合及检测头中,从而能够在空间变化和模态依赖的信息退化下实现自适应检测。在两个自建的基于 UAV 的多模态数据集 SI3D-DI 和 SI3D-DII 上的实验表明,CAMF-Det 在所有难度级别上均取得最佳性能,与最佳对比方法相比,其困难级别 mAP_BEV 分别提升了 9.43% 和 4.88%。这些结果证实了显式遮挡先验建模与利用在 UAV 场景中进行鲁棒多模态 3D 检测的有效性。
Abstract
Multimodal 3D object detection based on LiDAR and cameras has demonstrated excellent performance in ground-vehicle scenarios, but has not been explored for Unmanned Aerial Vehicle (UAV) platforms. In UAV top-down scenes, frequent groundobject occlusion dominated by tree canopies causes spatially varying and modality-dependent information degradation. Existing multimodal fusion frameworks neither explicitly model such ground-object occlusion nor embed occlusion awareness into the detection pipeline, limiting their performance in occluded UAV scenes. To address these challenges, we propose CAMF-Det, a closure-aware multimodal fusion framework for LiDAR-camera 3D object detection on UAV platforms, which derives dual-modal occlusion intensity through physics-inspired modeling and embeds them as priors throughout the detection pipeline. First, a dual-modal closure modeling module explicitly constructs occlusion intensity ground truth for both modalities offline via a Beer-Lambert-inspired formulation and building-mask correction. Second, using these ground-truth maps as supervision, a dual-modal prediction network converts the offline modeling results into online occlusion intensity predictions under single-frame inference. Third, both ground-truth and predicted occlusion intensity are injected into data augmentation, feature encoding, multimodal fusion, and detection head, enabling adaptive detection under spatially varying and modality-dependent information degradation. Experiments on two self-built UAV-based multimodal datasets, SI3D-DI and SI3D-DII, demonstrate that CAMF-Det achieves the best performance across all difficulty levels, with hard-level mAP$_{\mathrm{BEV}}$ improvements of 9.43% and 4.88% over the best competing methods, respectively. These results confirm the effectiveness of explicit occlusion prior modeling and exploitation for robust multimodal 3D detection in UAV scenes.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 评分失败: Expecting ',' delimiter: line 12 column 39 (char 262)
摘要翻译
尽管基于事件的运动估计取得了显著进展,但当前的几何方法主要侧重于速度估计。然而,绝对位姿估计 (absolute pose estimation),对于机器人导航和增强现实等关键应用同样至关重要,却仍相对较少被探索。因此,从事件流 (event streams) 中同时恢复绝对位姿和速度仍然是一个开放且具有挑战性的问题。为填补这一空白,我们提出了一种几何框架,用于通过利用场景中的 3D 线及其触发的事件来进行绝对位姿和速度估计。该框架的核心在于两个关键几何约束:3D 线与其对应的事件平面的法向量之间的正交性,以及事件与其关联线的 2D 投影之间的共线性。基于这些约束,我们提出了用于绝对位姿估计的线性求解器和多项式求解器。前者实现了高效计算,而后者提供了旋转的全局最优解。对于速度估计,我们开发了一种高效的线性求解器和一种更准确的基于优化的求解器,以恢复角速度和线速度。值得注意的是,我们的方法至少需要三个事件 - 线对应 (event-line correspondences) 即可独立确定 6-DoF 绝对位姿或速度。在仿真和真实数据集上的广泛实验表明,我们的方法实现了当前最佳 (state-of-the-art) 性能,与现有方法相比在准确性和计算效率方面有显著提升。演示代码公开可用,地址为 https://github.com/Zibin6/EventPoseVelocity.
Abstract
Despite the rapid advancements in event-based motion estimation, current geometric methods primarily focus on velocity estimation. However, absolute pose estimation, which is equally crucial for key applications such as robotic navigation and augmented reality, remains relatively underexplored. Consequently, the simultaneous recovery of absolute pose and velocity from event streams remains an open and challenging problem. To address this gap, we propose a geometric framework for absolute pose and velocity estimation by leveraging 3D lines in the scene and the events they trigger. At the core of the framework lie two key geometric constraints: the orthogonality between a 3D line and the normal vector of its corresponding event plane, and the collinearity of an event with the 2D projection of its associated line. Based on these constraints, we present both linear and polynomial solvers for absolute pose estimation. The former enables efficient computation, while the latter provides a globally optimal solution for rotation. For velocity estimation, we develop an efficient linear solver and a more accurate optimization-based solver to recover both angular and linear velocities. Notably, our methods require a minimum of three event-line correspondences to determine the 6-DoF absolute pose or velocities independently. Extensive experiments in simulation and on real-world datasets demonstrate that our methods achieve state-of-the-art performance, with significant improvements in accuracy and computational efficiency compared to existing methods. The demo code is publicly available at https://github.com/Zibin6/EventPoseVelocity.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on geometric computer vision for event cameras (absolute pose and velocity estimation), whereas the provided keywords relate to Multimodal Large Language Models (MLLM), World Models, and Reinforcement Learning architectures (Tokenizer, Visual Encoder, Unify Models). There is no technical overlap between the geometric estimation methods and the specific AI model architectures listed in the keywords. No expert authors from the specified list (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) appear in the author list.
关键词
Event Cameras, Absolute Pose Estimation, Velocity Estimation, Geometric Framework, 3D Lines, Event Planes, Computer Vision
摘要翻译
无人机(UAV)多光谱点云(MPC)为林下目标检测提供了高维空间 - 光谱数据;然而,其有效性因植被阴影引起的严重光照异质性而显著受损。为解决这一问题,我们提出了一种无先验异常检测框架,能够鲁棒地处理光照变化。首先,我们将太阳角度估计构建为一个逆优化问题。通过将光谱指数与光线追踪模型耦合,该策略实现了无先验阴影提取(Prior-Free Shadow Extraction),无需依赖飞行元数据,有效区分了暗物体与真实阴影。其次,为减轻光谱失真,我们引入了一种光照一致稀疏表示机制(Illumination-Consistent Sparse Representation)。与标准重建方法不同,我们严格从具有相同光照状态的邻居中构建背景字典。这种约束有效解耦了光谱反射率与光照变化,确保目标仅由物理上一致的背景点表示。实验结果表明,该方法在复杂森林环境中显著提高了异常目标与背景的可分性,表现出优于最先进的基线方法的性能。该框架特别适用于识别伪装军事目标、检测倒伏树干以及揭示隐藏在茂密植被下的考古遗迹。
Abstract
Unmanned Aerial Vehicle (UAV) multispectral point clouds (MPC) provide high-dimensional spatial-spectral data for sub-canopy target detection; however, their efficacy is significantly compromised by severe illumination heterogeneity caused by vegetation shadows. To address this, we propose a prior-free anomaly detection framework capable of robustly handling lighting variations. First, we formulate solar angle estimation as an inverse optimization problem. By coupling spectral indices with a ray-tracing model, this strategy achieves Prior-Free Shadow Extraction without relying on flight metadata, effectively distinguishing dark objects from true shadows. Second, to mitigate spectral distortions, we introduce an Illumination-Consistent Sparse Representation mechanism. Unlike standard reconstruction methods, we construct a background dictionary strictly from neighbors sharing the same illumination state. This constraint effectively disentangles spectral reflectance from lighting variations, ensuring that targets are represented solely by physically consistent background points. Experimental results indicate that the proposed method significantly improves the separability between anomalies and background in complex forest environments, demonstrating superior performance over state-of-the-art baselines. This framework is particularly suited for identifying camouflaged military targets, mapping fallen tree trunks, and uncovering archaeological ruins hidden beneath dense foliage.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on UAV multispectral point cloud anomaly detection using ray-tracing and sparse representation, belonging to remote sensing and computer vision. The provided keywords relate to AI Foundation Models, MLLMs, and Reinforcement Learning. There is no methodological overlap (e.g., no tokenizers, no world models, no RL). None of the listed expert authors are present.
关键词
UAV multispectral point clouds, Anomaly detection, Illumination-invariant, Ray-tracing model, Sparse representation, Sub-canopy target detection, Prior-free framework