arXiv Daily Report 2026-06-05
- 未分类
- -378分钟前
- 1热度
- 0评论
ArXiv Research Report
Papers
摘要翻译
计算机辅助设计(CAD)通过使得能够创建精确、可编辑的 3D 模型,支撑着现代工程与制造。然而,CAD 研究通常孤立地研究任务,而 CAD 的多模态、多任务学习因缺乏统一基准而受到阻碍。为了解决这一空白,我们引入了 UniCAD,这是一个全面的 CAD 多模态学习基准,涵盖了点云到 CAD 重建、文本/图像到 CAD 生成以及跨多种输入模态的 CAD 问答。除了基准外,我们还提出了 UniCAD-MLLM,这是一种通用的多模态大语言模型,能够接收文本、图像、草图和点云,并在单个框架内以端到端的方式执行这些异构任务。在 UniCAD 和 Fusion360 基准上的广泛实验表明,UniCAD-MLLM 在所有任务上均实现了最先进的性能,优于现有的特定任务和多任务基线。我们将发布数据集、代码和预训练模型,以加速未来的研究。
Abstract
Computer-Aided Design (CAD) underpins modern engineering and manufacturing by enabling the creation of precise, editable 3D models. However, CAD research typically studies tasks in isolation, and multi-modal, multi-task learning for CAD is hindered by the absence of a unified benchmark. To address this gap, we introduce UniCAD, a comprehensive benchmark for multi-modal CAD learning that covers point-to-CAD reconstruction, text/image-to-CAD generation, and CAD question answering across diverse input modalities. Alongside the benchmark, we present UniCAD-MLLM, a universal multi-modal large language model that ingests text, images, sketches, and point clouds and performs these heterogeneous tasks in an end-to-end fashion within a single framework. Extensive experiments on the UniCAD and Fusion360 benchmarks demonstrate that UniCAD-MLLM achieves state-of-the-art performance across all tasks, outperforming existing task-specific and multi-task baselines. We will release the dataset, code, and pretrained models to accelerate future research.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 9.0/10 | 13.5 |
| Tokenizer | 1.5 | 6.0/10 | 9.0 |
| Visual Encoder | 1.5 | 6.0/10 | 9.0 |
| World Models | 1.5 | 2.0/10 | 3.0 |
| MLLM | 1.5 | 9.0/10 | 13.5 |
| MultiModal | 1.5 | 10.0/10 | 15.0 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: The paper centers on a unified framework for multi-modal CAD, earning high scores for Unify Models (9.0) and MLLM (9.0) as UniCAD-MLLM is the core contribution. MultiModal is fundamental (10.0). Tokenizer and Visual Encoder are necessary for multi-modal input processing but not explicitly novel (6.0). World Models and model-based RL are background keywords unrelated to the CAD generation/reconstruction tasks described, resulting in low scores (2.0 and 1.0).
关键词
UniCAD, Multi-Modal, Multi-Task, CAD, Unified Benchmark, Universal Model, MLLM
深度分析
Chinese Title: UniCAD:面向多模态多任务计算机辅助设计的统一基准与通用模型
Summary: 本文针对计算机辅助设计(CAD)领域任务孤立、缺乏统一基准的问题,提出了UniCAD——首个大规模多模态多任务CAD学习基准,涵盖点云到CAD重建、文本/图像/草图到CAD生成以及CAD问答等多种任务。同时,提出了UniCAD-MLLM,一个统一的多模态大语言模型,能够端到端地处理文本、图像、草图和点云输入,并在单一框架内完成异构的CAD生成与理解任务。该模型采用模态特定编码器将不同输入投影到共享的几何-语义潜在空间,并以可执行的Python脚本(基于CadQuery)作为输出表示,兼具可解释性、可编辑性和可验证性。在UniCAD和Fusion360基准上的大量实验表明,UniCAD-MLLM在所有任务上均取得了最先进性能,超越了现有的任务特定和多任务基线。数据集、代码和预训练模型将开源以促进后续研究。
Innovations:
- 首次构建大规模统一的多模态多任务CAD基准UniCAD,标准化了文本、图像、草图、点云四种输入模态下的CAD重建、生成与问答任务的数据划分和评估协议。
- 提出UniCAD-MLLM,一个端到端的统一多模态大语言模型,能够同时处理多种输入模态并在共享几何-语义潜在空间中完成CAD生成与理解。
- 采用可执行的Python脚本(CadQuery)作为CAD输出表示,相比命令序列或B-rep,具有更好的可解释性、可编辑性和可验证性,便于与现有CAD工具链集成。
- 通过联合多模态多任务训练,证明统一模型能够超越专门化的单任务基线,验证了跨任务知识迁移的有效性和可扩展性。
Methodology: 论文首先构建UniCAD基准,从现有CAD数据集(如DeepCAD、Fusion360等)中收集并标准化文本、图像、草图、点云与对应CAD模型的配对数据,定义统一的评估指标。然后设计UniCAD-MLLM模型:采用模态特定编码器(如文本编码器、视觉编码器、点云编码器)分别提取特征,通过投影层映射到共享的几何-语义潜在空间;对于CAD生成任务,模型自回归地生成CadQuery Python脚本;对于CAD问答任务,模型输出结构化答案。训练采用多任务联合损失,包括脚本生成损失和问答损失。在UniCAD和Fusion360基准上,与多个任务特定基线(如DeepCAD、Text2CAD、Img2CAD等)和多任务基线进行对比实验。
Key Results:
- UniCAD-MLLM在点云到CAD重建、图像/草图到CAD生成、文本到CAD生成以及CAD问答四项任务上均取得最先进性能,超越所有任务特定和多任务基线。
- 在UniCAD基准上,统一多任务训练相比单任务训练显著提升了各任务的指标,验证了跨模态跨任务知识迁移的有效性。
- 在Fusion360基准上,UniCAD-MLLM同样优于现有方法,表明模型具有良好的泛化能力。
- 消融实验表明,共享潜在空间和联合训练策略是性能提升的关键因素。
Tech Stack:
- CadQuery(Python CAD脚本框架)
- Transformer架构(用于自回归脚本生成)
- 多模态编码器(文本:LLM;图像/草图:ViT;点云:PointNet++或类似)
- 投影层(将不同模态特征映射到共享潜在空间)
- 多任务联合损失函数(交叉熵损失用于脚本生成,分类/回归损失用于问答)
- 评估指标:CAD生成质量(如CD、F1、IoU等)、问答准确率
Strengths:
- 首次提供了统一的多模态多任务CAD基准,填补了领域空白,有利于公平比较和统一模型开发。
- 提出的UniCAD-MLLM模型架构简洁有效,能够端到端处理多种输入并输出可编辑的CAD程序,实用性强。
- 实验充分,在多个基准上验证了统一模型优于专门化模型,论证了跨任务学习的价值。
- 开源数据集、代码和模型,有助于推动CAD领域研究。
Limitations:
- CAD程序表示(CadQuery脚本)虽然可编辑,但可能无法覆盖所有CAD建模操作(如复杂曲面、自由形状),存在表示能力上限。
- 基准数据主要来源于现有CAD数据集(如DeepCAD、Fusion360),数据规模和多样性可能受限,尤其缺乏真实工业场景数据。
- 模型依赖预训练的多模态编码器,计算资源需求较高,且对输入模态的缺失或噪声鲁棒性未充分探讨。
- CAD问答任务目前仅覆盖基础几何和参数查询,缺乏高级推理(如设计意图、约束求解)的评估。
Relevance To Keywords:
- Unify Models:论文核心目标是统一多模态多任务CAD模型,与“Unify Models”高度相关。
- 原生多模态大模型:UniCAD-MLLM是原生多模态大模型,同时处理文本、图像、点云,并实现理解与生成一体化。
- 多模态大模型的理解和生成一体化:模型同时支持CAD生成(脚本输出)和CAD理解(问答),实现了一体化。
- 表征学习:模型通过共享几何-语义潜在空间进行跨模态表征学习,是表征学习的典型应用。
- 世界模型:CAD可视为一种结构化世界模型,但论文未直接涉及世界模型或强化学习,相关性较弱。
- 强化学习/后训练:论文未使用强化学习或后训练技术,相关性较低。
摘要翻译
乳腺癌仍然是导致女性癌症相关死亡率的主要原因之一。其临床管理需要跨越涵盖筛查、诊断和治疗计划等阶段的临床工作流程进行多模态推理,每个阶段均涉及不同的成像模态、任务目标和推理模式。然而,受限于数据稀缺性和模型通用性,现有的医疗 MLLMs 通常在孤立模态或窄任务家族上进行评估,限制了其支持工作流级别临床推理的能力。本文首先提出 BreastStage,这是一个与工作流对齐的乳腺成像指令语料库,包含 186 万对指令遵循对,这些数据源自 17 个子数据集,涵盖 5 种成像模态和 136 个任务模板。其保留集 BreastStage-Bench 提供了一个全面的基准,用于评估乳腺癌照护连续体上的多模态推理。基于该语料库,我们提出 BreastGPT,这是一种配备双分支视觉编码器和概念保持 token 压缩机制的统一 MLLM,旨在弥合标准放射学与千兆像素病理学之间的尺度差距。在 BreastStage-Bench 上,BreastGPT 实现了 75.66% 的封闭式准确率和 89.92% 的开放式得分,在各类临床阶段和任务格式上均优于通用型及医疗专用型 MLLMs。这些结果表明,与工作流对齐的数据和跨尺度视觉建模对于构建基于临床的医疗 MLLMs 至关重要。所有数据、代码及模型检查点均已发布于 https://yangyy-liu.github.io/BreastGPT.io。
Abstract
Breast cancer remains a leading cause of cancer-related mortality among women. Its clinical management requires multimodal reasoning across a clinical workflow that spans \textit{screening}, \textit{diagnosis} and \textit{treatment planning}, where each stage involves distinct imaging modalities, task objectives, and reasoning patterns. However, constrained by data scarcity and model versatility, existing medical MLLMs are typically evaluated on isolated modalities or narrow task families, limiting their ability to support workflow-level clinical reasoning. In this work, we first introduce \textbf{BreastStage}, a workflow-aligned breast imaging instruction corpus comprising 1.86M instruction-following pairs curated from 17 sub-datasets across 5 imaging modalities and 136 task templates. Its held-out split, \textbf{BreastStage-Bench}, provides a comprehensive benchmark for evaluating multimodal reasoning across the breast cancer care continuum. Building on this corpus, we propose \textbf{BreastGPT}, a unified MLLM equipped with a dual-branch visual encoder and concept-preserving token compression to bridge the scale gap between standard radiology and gigapixel pathology. On BreastStage-Bench, BreastGPT achieves 75.66\% closed-ended accuracy and 89.92\% open-ended score, outperforming both general-purpose and medical-specific MLLMs across clinical stages and task formats. These results suggest that workflow-aligned data and cross-scale visual modeling are critical for clinically grounded medical MLLMs. All data, code, and model checkpoints are released at https://yangyy-liu.github.io/BreastGPT.io.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 8.0/10 | 12.0 |
| Tokenizer | 1.5 | 6.0/10 | 9.0 |
| Visual Encoder | 1.5 | 9.0/10 | 13.5 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 10.0/10 | 15.0 |
| MultiModal | 1.5 | 10.0/10 | 15.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on BreastGPT, a unified Multimodal Large Language Model (MLLM) for breast cancer, making MLLM and MultiModal highly relevant (10/10). It utilizes a dual-branch Visual Encoder to handle different imaging scales (9/10). The model unifies workflows and modalities, supporting Unify Models (8/10). Token compression is mentioned, relating to Tokenizer (6/10). World Models and model-based RL are unrelated to this clinical reasoning task (0/10). None of the specified expert authors (Yang Shi, etc.) appear in the author list (Yu Shi is present but distinct).
关键词
Multimodal Large Language Model, Breast Cancer, Clinical Routine, Visual Encoder, Unified Model, Token Compression, BreastStage-Bench
深度分析
Chinese Title: BreastGPT:面向乳腺癌临床全流程的多模态大语言模型
Summary: 乳腺癌临床管理涉及筛查、诊断和治疗规划等多个阶段,每个阶段依赖不同的影像模态和推理模式。现有医学多模态大语言模型(MLLM)通常局限于单一模态或窄任务,缺乏工作流级别的临床推理能力。为此,本文首先构建了BreastStage数据集,包含来自5种影像模态(乳腺X线、超声、MRI、病理全切片、CT)的17个子数据集、136个任务模板和186万条指令对,并从中划分出BreastStage-Bench基准。基于该数据集,提出BreastGPT模型,该模型以Qwen3-VL为基座,采用双分支视觉编码器和概念保留令牌压缩机制,统一处理从常规放射影像到千兆像素病理切片的跨尺度视觉输入。在BreastStage-Bench上,BreastGPT在封闭式问答和开放式问答上分别达到75.66%和89.92%的得分,显著优于通用和医学专用MLLM,在筛查、诊断和治疗规划阶段分别提升超过25%、35%和40%。研究表明,工作流对齐的数据和跨尺度视觉建模对于临床级医学MLLM至关重要。
Innovations:
- 提出工作流级问题定义和基准:将乳腺癌多模态推理建模为覆盖筛查、诊断和治疗规划的临床工作流问题,并构建大规模指令数据集BreastStage及对应基准BreastStage-Bench。
- 提出统一跨尺度乳腺MLLM:BreastGPT在单一指令跟随框架内处理乳腺X线、超声、CT、MRI和千兆像素病理切片,实现全流程一致推理。
- 提出概念保留视觉令牌压缩:采用双分支视觉编码器与基于概念的令牌选择器,在固定令牌预算下保留临床关键视觉证据,兼顾准确性和推理效率。
Methodology: 首先通过工作流数据生成、清洗、专家注释和Qwen3-Max生成指令对构建BreastStage数据集,并划分训练集和基准集。模型方面,以Qwen3-VL为基座,设计双分支视觉编码器(一个分支处理常规放射影像,另一个处理千兆像素病理切片),通过分辨率感知门控模块根据阶段条件自动路由输入。采用概念保留令牌压缩机制,在固定令牌预算下选择与提示相关且覆盖全局的视觉令牌。使用阶段条件系统提示切换筛查、诊断和治疗推理行为。在BreastStage-Bench上评估多种通用和医学MLLM,并与BreastGPT进行对比。
Key Results:
- BreastGPT在BreastStage-Bench上封闭式准确率75.66%,开放式得分89.92%。
- 相比通用和医学专用MLLM,BreastGPT在筛查、诊断和治疗规划阶段分别提升超过25%、35%和40%。
- GPT-5.4在BreastStage-Bench上平均得分仅49.32%,现有医学MLLM无明显优势。
- BreastStage数据集包含约66.2万张独特图像、136个任务模板、60.6万条标注记录和186万条指令对。
Tech Stack:
- Qwen3-VL(基座多模态大语言模型)
- Qwen3-Max(用于生成指令数据)
- 双分支视觉编码器(Dual-branch visual encoder)
- 分辨率感知门控模块(Resolution-aware gating module)
- 概念保留令牌压缩(Concept-preserving token compression)
- 阶段条件系统提示(Stage-conditioned system prompts)
- 指令微调(Instruction tuning)
- 视觉问答(VQA)、报告生成、分类、视觉定位等任务模板
Strengths:
- 首次将乳腺癌AI建模为完整临床工作流,覆盖筛查、诊断和治疗规划,具有临床实用性。
- 统一处理五种影像模态,包括千兆像素病理切片,解决了跨尺度视觉建模难题。
- 构建了大规模、高质量、专家验证的指令数据集和基准,促进领域研究。
- 模型性能显著优于现有通用和医学专用MLLM,验证了工作流对齐数据和跨尺度建模的有效性。
Limitations:
- 数据集主要来自公开数据集和合作医院,可能未涵盖所有临床场景和罕见病例。
- 模型仅在乳腺癌领域验证,泛化到其他癌症或疾病的能力未知。
- 双分支编码器和令牌压缩可能增加模型复杂度和训练成本。
- 未深入讨论模型在临床部署中的公平性、隐私和伦理问题。
Relevance To Keywords:
- 原生多模态大模型:BreastGPT是原生多模态大模型,统一处理图像和文本,符合该关键词。
- 多模态大模型的理解和生成一体化:模型支持视觉问答、报告生成等理解和生成任务,体现一体化。
- 表征学习:双分支视觉编码器和概念保留令牌压缩涉及视觉表征学习。
- 后训练:基于Qwen3-VL进行指令微调(后训练)以适应乳腺癌临床任务。
- 世界模型:论文未直接涉及世界模型或环境交互,相关性弱。
- 强化学习:论文未使用强化学习方法,相关性弱。
- 模型基RL:未涉及,相关性弱。
摘要翻译
近期研究已探索视觉 - 语言模型(VLMs)在食物分析中的应用。然而,大多数现有方法主要依赖监督微调(SFT),这往往限制了推理和泛化能力。此外,高质量的大规模营养标注仍然稀缺。为了解决这些问题,我们引入了 CalorieBench-80K,这是一个拥有精心筛选的卡路里标签和饮食建议标注的大规模基准。据我们所知,这是首个引入思维链(CoT)标注以进行卡路里推理的食物图像基准。我们还提出了 Food-R1,这是一个在多任务学习范式下训练的统一食物视觉 - 语言模型,旨在赋予模型广泛的能力。Food-R1 采用了基于思维链的冷启动指令微调,随后使用组相对策略优化(GRPO)进行强化微调(RFT),以提升推理能力和性能。在 CalorieBench-80K 和代表性基准上的实验表明,Food-R1 在各类食物相关任务中始终优于强基线。代码、模型权重和基准标注可在项目仓库中获取。
Abstract
Recent studies have explored Vision-Language Models (VLMs) for food analysis. However, most existing methods rely primarily on supervised fine-tuning (SFT), which often limits reasoning and generalization capabilities. Moreover, high-quality large-scale nutritional annotations remain scarce. To address these issues, we introduce CalorieBench-80K, a large-scale benchmark with curated calorie labels and dietary advice annotations. To the best of our knowledge, it is the first food image benchmark to incorporate Chain-of-Thought (CoT) annotations for calorie reasoning. We also propose Food-R1, a unified food VLM trained in a multi-task learning paradigm to equip the model with broad capabilities. Food-R1 undergoes CoT-based cold-start instruction tuning, followed by reinforcement fine-tuning (RFT) using Group Relative Policy Optimization (GRPO) to improve reasoning and performance. Experiments on CalorieBench-80K and representative benchmarks show that Food-R1 consistently outperforms strong baselines across food-related tasks. The code, model weights, and benchmark annotations are available at the project repository.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 9.0/10 | 13.5 |
| Tokenizer | 1.5 | 3.0/10 | 4.5 |
| Visual Encoder | 1.5 | 5.0/10 | 7.5 |
| World Models | 1.5 | 2.0/10 | 3.0 |
| MLLM | 1.5 | 8.0/10 | 12.0 |
| MultiModal | 1.5 | 9.0/10 | 13.5 |
| model-based RL | 1.5 | 5.0/10 | 7.5 |
评分理由: 论文标题明确包含'Unified',高度契合 Unify Models;作为视觉语言模型,属于 MLLM 且为 MultiModal 核心内容;使用了 RL (GRPO) 进行微调,与 model-based RL 有一定关联但非核心模型构建;Tokenizer 和 Visual Encoder 为通用组件未重点提及;未涉及 World Models。作者列表中未包含指定的专家。
关键词
Food-R1, Unified Multi-Task, Vision-Language Model, Reinforcement Learning, Chain-of-Thought, CalorieBench-80K, Instruction Tuning
深度分析
Chinese Title: Food-R1: 一种基于强化学习的统一多任务食品视觉语言模型
Summary: 本文针对食品分析中视觉语言模型(VLM)主要依赖监督微调(SFT)导致推理和泛化能力受限,以及高质量营养标注稀缺的问题,构建了CalorieBench-80K大规模基准数据集,该数据集首次包含用于卡路里推理的思维链(CoT)标注。在此基础上提出Food-R1,一个统一的多任务食品VLM,采用多任务学习范式整合卡路里估计、饮食建议生成、食品分类、成分识别、食谱生成和营养估计等任务。训练分为两阶段:第一阶段进行基于CoT的冷启动指令微调(SFT),第二阶段使用组相对策略优化(GRPO)进行强化微调(RFT)以提升推理能力。实验表明,Food-R1在食品相关任务上持续优于强基线模型。代码和模型权重已开源。
Innovations:
- 构建了CalorieBench-80K,首个包含思维链(CoT)卡路里推理标注的大规模食品图像基准数据集。
- 提出统一多任务食品VLM Food-R1,通过多任务学习范式整合多种食品分析能力,增强跨任务泛化。
- 引入推理导向的大模型蒸馏,利用强教师模型生成CoT推理过程,使中间推理步骤显式化。
- 首次在食品分析领域应用基于GRPO的强化学习后训练,提升模型推理稳定性和整体性能。
Methodology: 首先基于MM-Food-100K构建CalorieBench-80K:通过GPT-4.1进行数据过滤(成分一致性、份量合理性、卡路里合理性),使用Qwen2.5-VL-72B进行粒度对齐(将粗粒度份量映射到最可能的成分),并生成饮食建议标注。然后采用两阶段训练:阶段1使用混合多任务数据(含CoT蒸馏数据)进行SFT冷启动;阶段2以SFT模型为参考,使用GRPO算法优化策略模型,奖励函数基于任务特定指标。模型架构为视觉编码器+语言解码器的VLM,所有任务统一为指令跟随的文本生成。
Key Results:
- Food-R1在CalorieBench-80K及其他代表性基准上,在卡路里估计、饮食建议生成、食品分类、成分识别、食谱生成、营养估计等任务中均优于强基线模型。
- 消融实验表明,CoT蒸馏和GRPO后训练均能显著提升性能。
- 定性比较显示,RFT后模型通过逐步推理产生更稳定准确的卡路里估计。
- 数据过滤阶段人工验证显示93%的过滤决策与人类判断一致,粒度对齐阶段90%一致。
Tech Stack:
- GPT-4.1(数据过滤验证)
- GPT-4o(CoT推理蒸馏)
- Qwen2.5-VL-72B(粒度对齐、饮食建议标注)
- Group Relative Policy Optimization (GRPO)
- LoRA(低秩适配)
- Supervised Fine-Tuning (SFT)
- Chain-of-Thought (CoT) 推理
- 多任务学习范式
Strengths:
- 构建了大规模、高质量、带CoT标注的食品基准数据集,填补了营养推理数据空白。
- 统一多任务框架有效整合多种食品分析能力,提升模型通用性和卡路里估计稳定性。
- 两阶段训练(SFT+GRPO RL)结合推理蒸馏,显著增强模型推理能力和泛化性。
- 系统性的数据清洗和验证流程保证了数据可靠性。
- 开源代码和模型权重,促进领域研究。
Limitations:
- 数据集规模仍有限(约80K),可能不足以覆盖所有食品类别和烹饪风格。
- CoT蒸馏依赖教师模型(GPT-4o)质量,可能存在偏差或错误。
- GRPO训练的计算成本较高,且奖励函数设计可能影响优化方向。
- 主要聚焦于卡路里估计,对其他营养维度(如脂肪、蛋白质)的推理能力未充分验证。
- 未深入探讨模型在真实世界复杂场景(如遮挡、混合菜品)下的鲁棒性。
Relevance To Keywords:
- 原生多模态大模型:Food-R1是基于视觉语言模型(VLM)构建的,属于原生多模态大模型在食品领域的应用。
- 多模态大模型的理解和生成一体化:模型统一处理图像理解(成分识别、卡路里估计)和文本生成(食谱、饮食建议),实现理解与生成一体化。
- 表征学习:通过多任务学习和CoT蒸馏,模型学习到更丰富的食品视觉和语义表征。
- 世界模型:食品分析涉及对食材、烹饪方式、营养知识的推理,可视为食品领域的世界模型雏形。
- 强化学习:论文核心创新之一是使用GRPO进行后训练,属于强化学习在VLM微调中的应用。
- 后训练:两阶段训练中的第二阶段(RFT)即为后训练,旨在提升推理能力。
摘要翻译
我们设想了一种主动式多模态助手系统,该系统能为用户提供程序性任务的分步实时指导,自主决定何时中断以及如何指导。然而,进展受限于缺乏反映现实条件的大规模跨领域基准测试,尤其是用户偏离预期步骤序列的常见情况。我们通过四项贡献来解决这一差距:(1) 我们发布了 EgoProactive,这是一个用于主动式程序性辅助的大规模可穿戴第一人称视角数据集,具有明确的计划外 (OOP) 标注和恢复步骤;(2) 我们在统一的主动式指导模式下,将五个现有的基准测试(Ego4D、EPIC-KITCHENS、EgoExo4D、HoloAssist、HowTo100M)扩充为 Pro²Bench;(3) 我们提出了一种解耦的规划器 - 交互架构,专门针对程序性状态、视觉线索和恢复注入;(4) 我们引入了一种后训练配方,该配方可在不同模型家族间迁移,通过在 Llama 4 和 Qwen-3.6-VL 上的跨骨干复制得到验证。在广泛的实验中,我们训练的 Llama 4 系统在所有六个数据集上,显著优于强大的专有基线(Claude Opus 4.6、Gemini 3.1 Pro、GPT 5.2)及开源权重基线(Qwen3 VL 235B)的客观干预质量。Oracle 计划实验进一步表明,当计划质量得到控制时,训练好的双工模型能够生成高质量的指导,并在计划外恢复方面取得显著提升。
Abstract
We envision a proactive multi-modal assistant system which gives users real-time step-by-step guidance on a procedural task, autonomously deciding \textit{when} to interrupt, and \textit{how} to coach. However, progress is limited by the absence of large-scale, cross-domain benchmarks that reflect realistic conditions, particularly the common case in which users deviate from the expected step sequence. We address this gap with four contributions: \textbf{(1)}~we release \textbf{EgoProactive}, a large-scale wearable-egocentric dataset for proactive procedural assistance with explicit Out-of-Plan (OOP) annotations and recovery steps; \textbf{(2)}~we augment five established benchmarks (Ego4D, EPIC-KITCHENS, EgoExo4D, HoloAssist, HowTo100M) into \textbf{Pro\textsuperscript{2}Bench} under a unified proactive-guidance schema; \textbf{(3)}~we propose a \textbf{decoupled planner--interaction architecture} specialized for procedural state, visual cues, and recovery injection; \textbf{(4)}~we introduce a post-training recipe that transfers across model families, validated by cross-backbone replication on Llama~4 and Qwen-3.6-VL. In extensive experiments, our trained Llama-4 system substantially improves objective intervention quality over strong proprietary baselines (Claude Opus~4.6, Gemini~3.1~Pro, GPT~5.2) and open-weight baselines (Qwen3~VL~235B) baselines across all six datasets. Oracle-plan experiments further show that, when plan quality is controlled, the trained duplex model produces high-quality guidance and large gains on Out-of-Plan recovery.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 6.0/10 | 9.0 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 5.0/10 | 7.5 |
| World Models | 1.5 | 4.0/10 | 6.0 |
| MLLM | 1.5 | 8.0/10 | 12.0 |
| MultiModal | 1.5 | 9.0/10 | 13.5 |
| model-based RL | 1.5 | 5.0/10 | 7.5 |
评分理由: 论文聚焦主动式多模态助手,核心涉及多模态(MultiModal)与 MLLM 技术,故评分较高;架构统一性对应 Unify Models;视觉依赖对应 Visual Encoder;规划与状态跟踪与 World Models 及 model-based RL 概念相关;Tokenizer 未涉及。未发现指定专家作者,故无额外加分。
关键词
Proactive Procedural Assistance, Multi-modal Assistant, Benchmark, Planner-Interaction Architecture, Out-of-Plan Recovery, Post-training, Visual Cues, Wearable-Egocentric Dataset
深度分析
Chinese Title: 计划、观察、恢复:面向主动式程序辅助的基准与架构
Summary: 本文针对主动式多模态程序辅助系统缺乏大规模跨领域基准、尤其是用户偏离预期步骤(Out-of-Plan, OOP)的现实问题,提出了四项贡献:首先,发布EgoProactive数据集,包含700个智能眼镜视频、9,935个评估实例,其中1,833个带有脚本化OOP错误及恢复指导,是首个可穿戴形式且同时提供干预标签和偏离-恢复配对的数据集。其次,将五个现有基准(Ego4D、EPIC-KITCHENS、EgoExo4D、HoloAssist、HowTo100M)统一标注为Pro2Bench,共42,275个评估实例和249,584个训练实例,覆盖14个活动域。第三,提出解耦的规划器-交互架构:规划器维护结构化计划状态(当前/已完成/剩余步骤及视觉线索),仅在中断时更新;交互模型以2fps处理流视频,每帧输出静默或中断决策,并支持主动指导与被动问答。第四,提出跨模型家族的后训练配方,在Llama 4和Qwen-3.6-VL上验证了模型无关的改进。实验表明,训练后的Llama 4系统在G-Mean F1上从最佳零样本基线0.55提升至0.84(Oracle规划器),OOP恢复能力超越GPT-5.2、Gemini 3.1 Pro、Claude Opus 4.6等前沿专有系统。
Innovations:
- 首次发布可穿戴形式、包含显式Out-of-Plan标注和恢复指导的EgoProactive数据集
- 统一标注五个现有基准为Pro2Bench,提供大规模、跨领域的主动程序辅助评估平台
- 提出解耦的规划器-交互架构,分离长程规划与实时交互,利用结构化计划状态和视觉线索指导中断决策
- 提出跨模型家族的后训练配方,在Llama 4和Qwen-3.6-VL上验证了模型无关的显著改进
- 在OOP恢复任务上超越所有前沿专有系统(GPT-5.2、Gemini 3.1 Pro、Claude Opus 4.6)
Methodology: 首先,收集700个智能眼镜视频(4个活动域),人工标注脚本化OOP错误及恢复步骤,构建EgoProactive数据集。其次,对五个现有基准进行统一重新标注,生成Pro2Bench。然后,设计解耦架构:规划器(背景模型)在中断时根据当前视频片段和对话历史更新结构化计划(步骤状态+视觉线索);交互模型(用户面向)以2fps处理流视频,结合计划锚定片段选择(最多15个8秒片段)和计划上下文,自回归生成中断决策和话语。后训练阶段,使用收集的数据对Llama 4和Qwen-3.6-VL进行微调,评估采用G-Mean F1指标,并与多个零样本基线(包括GPT-5.2等)对比。
Key Results:
- 在Llama 4上,PWR-Oracle规划器将G-Mean F1从最佳零样本基线0.55提升至0.84
- OOP恢复能力超越GPT-5.2、Gemini 3.1 Pro、Claude Opus 4.6等所有前沿专有系统
- 跨骨干复制到Qwen-3.6-VL-27B,验证了后训练配方的模型家族迁移性
- 在全部六个数据集(EgoProactive + Pro2Bench)上,训练后的Llama 4系统显著优于所有基线
Tech Stack:
- 2fps视频帧采样
- 计划锚定片段选择(最多15个8秒非连续片段)
- 结构化文本计划表示(步骤状态标签:completed/current/next,视觉线索)
- 自回归生成模型(pθ(dt, ut | ot, Pt-1))
- Llama 4、Qwen-3.6-VL Transformer架构
- G-Mean F1评估指标
- 后训练微调(跨模型家族配方)
Strengths:
- 首次系统性地解决用户偏离步骤(OOP)这一现实关键问题
- 提供大规模、多领域、可穿戴形式的数据集和统一基准,推动主动辅助研究
- 解耦架构有效分离规划与交互,兼顾实时响应与长程推理
- 后训练配方具有模型通用性,降低对特定架构的依赖
- 实验充分,与多个前沿专有系统对比,结果显著领先
Limitations:
- EgoProactive数据集规模仍有限(9,935评估实例),覆盖活动域仅4个
- OOP脚本化可能无法完全模拟真实用户偏离的多样性
- 实时性方面未提供端到端延迟测量,2fps采样可能错过快速动作
- Oracle规划器实验假设理想计划质量,实际部署中规划器可能出错
- 后训练配方虽跨模型验证,但仅测试了两个模型家族,泛化性需进一步验证
Relevance To Keywords:
- 原生多模态大模型:论文使用Llama 4和Qwen-3.6-VL等多模态模型处理视频和文本
- 多模态大模型的理解和生成一体化:交互模型同时理解视频和生成中断决策与话语
- 表征学习:视觉线索和计划状态作为结构化表征,指导模型学习
- 世界模型:规划器维护动态计划状态,相当于对任务世界的结构化建模
- 后训练:提出跨家族后训练配方,显著提升模型性能
- 强化学习:论文未直接使用强化学习,但中断决策可视为隐式策略,未来可结合RL优化
摘要翻译
生成式视觉模型在精确的空间控制方面面临根本性挑战。这源于一个核心脱节:模型能够处理空间的文本描述,但无法直接将数值坐标映射到二维图像画布上。我们引入 MetaPoint,这是一种通过将单个连续二维坐标表示为特殊标记来弥合这一差距的方法。关键在于,MetaPoint 不需要新的架构组件;它直接利用模型固有的位置编码方案来解释这些坐标,将该标记视为画布上的一个虚拟点。这种轻量级方法使得仅用一个标记即可实现物体位置的像素级控制,或用两个标记实现其边界框控制,且无需任何架构变化或定制注意力掩码。MetaPoint 标记被设计为具有组合性,作为空间原语。这使得规划代理能够将高层用户请求分解为生成器的结构化原语序列。通过为空间控制提供简单、精确且可扩展的构建模块,MetaPoint 解锁了更强大的组合式生成代理,并启用了直观的交互式编辑系统。
Abstract
Generative visual models fundamentally struggle with precise spatial control. This arises from a core disconnect: models can process textual descriptions of space but cannot directly map numerical coordinates onto the 2D image canvas. We introduce MetaPoint, a method that bridges this gap by representing a continuous 2D coordinate as a single, special token. Crucially, MetaPoint requires no new architectural components; it directly leverages the model's inherent positional encoding schemes to interpret these coordinates, treating our token as a virtual point on the canvas. This lightweight approach enables pixel-level control of an object's position with one token or its bounding box with two, all without requiring architectural changes or bespoke attention masking. The MetaPoint tokens are designed to be compositional, serving as spatial primitives. This allows a planner agent to decompose a high-level user request into a structured sequence of primitives for the generator. By providing a simple, precise, and scalable building block for spatial control, MetaPoint unlocks more powerful compositional generative agents and enables intuitive, interactive editing systems.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 3.0/10 | 4.5 |
| Tokenizer | 1.5 | 8.0/10 | 12.0 |
| Visual Encoder | 1.5 | 5.0/10 | 7.5 |
| World Models | 1.5 | 2.0/10 | 3.0 |
| MLLM | 1.5 | 6.0/10 | 9.0 |
| MultiModal | 1.5 | 8.0/10 | 12.0 |
| model-based RL | 1.5 | 3.0/10 | 4.5 |
评分理由: The paper proposes MetaPoint for spatial control using special tokens, scoring high on Tokenizer and MultiModal. It moderately relates to MLLM and Visual Encoder via positional encoding. Unify Models, World Models, and model-based RL are less relevant as the focus is generation control rather than unification, environment modeling, or RL. No specified expert authors were found in the author list.
关键词
MetaPoint, Spatial Control, Special Token, Agentic Generation, Visual Generation, Positional Encoding, Text-to-Image
深度分析
Chinese Title: MetaPoint:解锁智能体视觉生成中的精确空间控制
Summary: 论文针对生成式视觉模型在精确空间控制上的根本性缺陷——模型能理解文本空间描述但无法将数值坐标直接映射到二维图像画布——提出MetaPoint方法。该方法通过一个特殊token <mp> 表示连续二维坐标,直接复用模型固有的位置编码方案(如2D正弦位置编码或3D RoPE),无需新增架构组件或注意力掩码。单个MetaPoint可控制点位置,两个可定义边界框,多个可编码复杂布局。结合VLM规划器(MetaPoint-Agent),系统能将高层用户意图分解为结构化指令序列。实验表明,MetaPoint在COCO-MIG、T2I-CoReBench、ImgEdit等基准上取得显著提升(如mIoU从59.23%提升至77.29%),并能可靠扩展到包含30个物体的场景及多物体协调编辑。该方法轻量、精确、可组合,为智能体视觉生成提供了基础空间原语。
Innovations:
- 提出单token像素级空间控制接口,无需架构修改、词汇表扩展或粗粒度近似。
- 直接利用UMM原生位置编码表示连续坐标,实现模型无关的轻量级方案。
- MetaPoint token天然可组合:单点、双点框、序列编码复杂布局或编辑操作。
- 结合VLM规划器(MetaPoint-Agent)形成端到端系统,将模糊用户意图转化为精确可执行指令。
- 在多个基准上取得新SOTA,且优势随任务复杂度(如物体数量)增长,展现强可扩展性。
Methodology: 论文首先分析UMM中两种常见位置编码(2D正弦PE和3D RoPE),指出其可接受连续浮点输入。定义特殊token <mp>,其词嵌入表示控制意图,位置嵌入通过将目标坐标(u,v)输入位置编码函数得到,从而将连续坐标融入模型。训练阶段采用点锚定数据管道(Point-Anchored Data Pipeline),将图像、标签、边界框、掩码与文本描述绑定,生成包含<mp> token的训练数据。推理时,用户指令中插入<mp>并指定坐标,模型即可精确生成或编辑。MetaPoint-Agent使用VLM将复杂任务分解为子任务,生成包含MetaPoint的自然语言命令驱动生成模型。
Key Results:
- COCO-MIG基准上mIoU从59.23%提升至77.29%(相对提升30.49%)。
- T2I-CoReBench基准上BAGEL总体得分从38.2提升至66.1(相对提升73%)。
- ImgEdit基准上总体得分从3.42提升至3.94(相对提升15.2%)。
- 可靠支持多达30个物体的场景生成,同时保持视觉保真度。
- 实现多物体协调编辑(替换、缩放、移动、删除)而不改变未编辑区域。
Tech Stack:
- 2D Sinusoidal Positional Embedding
- 3D Rotary Position Embedding (RoPE)
- 特殊token嵌入(<mp>)
- 统一多模态模型(UMM)骨干(如扩散模型DiT/UNet)
- 视觉语言模型(VLM)作为规划器
- 点锚定数据管道(Point-Anchored Data Pipeline)
- 自注意力机制(Self-Attention)
Strengths:
- 轻量级:仅需添加一个特殊token,无需架构修改或大量参数。
- 模型无关:可适配多种UMM架构(如使用2D正弦PE或3D RoPE的模型)。
- 像素级精度:连续坐标输入突破patch网格限制。
- 可组合性:点、框、序列等原语支持复杂空间操作。
- 可扩展性:在物体数量和任务复杂度上均表现良好。
- 端到端系统:结合VLM agent实现从自然语言到精确生成的闭环。
Limitations:
- 依赖特定位置编码方案,可能不适用于未使用类似位置编码的模型。
- 需要构建点锚定训练数据,数据收集和标注成本较高。
- 连续坐标输入可能引入数值精度问题,尤其在极低分辨率场景。
- VLM agent的引入增加了推理步骤和延迟,且规划质量依赖VLM能力。
- 论文未讨论对非矩形区域(如不规则形状)的控制能力。
Relevance To Keywords:
- 原生多模态大模型:MetaPoint直接嵌入UMM,利用其位置编码实现空间控制,与多模态理解-生成一体化紧密相关。
- 多模态大模型的理解和生成一体化:论文核心是提升UMM的生成空间精度,同时保持理解能力,是典型的一体化研究。
- 表征学习:位置编码是表征学习的关键组件,MetaPoint通过重新利用位置编码表征连续坐标,属于表征学习范畴。
- 世界模型:精确空间控制是构建世界模型的基础能力,MetaPoint为世界模型提供了空间原语。
- 后训练:MetaPoint需要微调或训练UMM以理解<mp> token,属于后训练阶段的技术。
- Unify Models:论文直接面向统一多模态模型,提升其空间控制能力。
- Model-Based RL:不直接相关,但精确空间控制可服务于基于模型的强化学习中的视觉环境建模。
摘要翻译
儿童从连续的、具有时间结构的自我中心经验流中学习单词的含义。近期研究表明,神经网络也能从儿童的自我中心视频记录中学习词 - 指称映射(word-referent mappings),但它们会循环遍历打乱的数据数百个轮次(epochs),这与儿童实际遭遇环境的方式形成对比。我们提出 BabyCL,这是一个持续多模态学习框架,它以单次时间顺序遍历处理 SAYCam 数据集,将流式视觉表示学习(streaming visual representation learning)与图像 - 文本对比目标(image-text contrastive objective)相结合。BabyCL 结合了视频流的多阶段时间分割(multi-stage temporal segmentation)与一个双重回放缓冲区(dual replay buffer),该缓冲区独立管理视觉和多模态历史,并在共享骨干网络(shared backbone)上使用三种对比损失(contrastive losses)进行联合训练。在匹配的优化预算下,BabyCL 在 SAYCam Labeled-S 4AFC 基准上优于流式学习基线,显著缩小了与离线训练上界之间的差距。消融实验表明,这些增益对在线时间分割窗口的长度和回放缓冲区的驱逐规则具有稳健性。综上所述,这些结果表明,在更接近儿童实际经验的训练条件下,有意义的词 - 指称映射可以涌现。
Abstract
Children learn the meanings of words from a continuous, temporally structured stream of egocentric experience. Recent work shows that neural networks can also learn word-referent mappings from a child's egocentric video recordings, but they cycle through the shuffled data for hundreds of epochs, contrasting with how children actually encounter their environment. We introduce BabyCL, a continual multimodal learning framework that processes the SAYCam dataset in a single chronological pass, combining streaming visual representation learning with an image-text contrastive objective. BabyCL combines a multi-stage temporal segmentation of the stream with a dual replay buffer that independently manages visual and multimodal histories, and it is jointly trained with three contrastive losses on a shared backbone. Under a matched optimization budget, BabyCL outperforms streaming learning baselines on the SAYCam Labeled-S 4AFC benchmark, substantially narrowing the gap to an upper bound of offline training. Ablations show that the gains are robust to the length of the online temporal segmentation window and the eviction rule of the replay buffer. Together, these results show that meaningful word-referent mappings can emerge under training conditions much closer to a child's actual experience.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 5.0/10 | 7.5 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 8.0/10 | 12.0 |
| World Models | 1.5 | 2.0/10 | 3.0 |
| MLLM | 1.5 | 5.0/10 | 7.5 |
| MultiModal | 1.5 | 9.0/10 | 13.5 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: 论文核心贡献在于连续多模态学习框架(BabyCL),因此与 MultiModal(9 分)和 Visual Encoder(8 分)高度相关,涉及视觉表征与文本对比学习。虽然整合了视觉与语言流,但未体现统一的模型架构(Unify Models, 5 分)或大语言模型具体技术(MLLM, 5 分)。未讨论分词器策略(Tokenizer, 2 分)或生成式世界模型(World Models, 2 分)。内容完全不涉及强化学习(model-based RL, 1 分)。经核对,作者列表(Xiaoyang Jiang, Yanlai Yang 等)不包含指定的专家名单(Yang Shi 等),故无额外加分。
关键词
Continual Learning, Multimodal Learning, Egocentric Video, Contrastive Learning, Visual Representation, Word-Referent Mapping, SAYCam Dataset
深度分析
Chinese Title: 通过儿童自我中心输入的持续视觉与语言学习
Summary: 该论文提出了BabyCL框架,旨在模拟儿童在真实环境中通过连续、时间结构化的自我中心经验学习单词-指代映射的过程。与以往依赖离线、随机打乱数据并循环数百轮的方法不同,BabyCL以单次时间顺序流处理SAYCam数据集,结合流式视觉表示学习与图像-文本对比目标。框架采用多阶段时间分割将视频流划分为事件片段,并维护双回放缓冲区(视觉缓冲区和多模态缓冲区)以缓解时间冗余和灾难性遗忘。模型通过共享的ResNeXt-50骨干网络联合优化三个对比损失:实例损失(SimCLR)、时间损失(事件段分类)和跨模态损失(InfoNCE)。在SAYCam Labeled-S 4AFC基准上,BabyCL在匹配优化预算下显著优于流式学习基线,将准确率从27.52%提升至43.38%,缩小了与离线训练上限(57.31%)的差距。消融实验表明,该方法的鲁棒性对在线时间分割窗口长度和回放缓冲区驱逐规则不敏感。结论表明,在更接近儿童实际体验的训练条件下,有意义的单词-指代映射可以涌现。
Innovations:
- 首次在单次时间顺序流中联合学习视觉和语言表示,模拟儿童真实体验,而非离线多轮训练。
- 提出多阶段时间分割算法,将连续视频流划分为事件片段,并约束分割点位于话语区间之外,确保每个话语完整属于一个事件。
- 设计双回放缓冲区架构(视觉缓冲区与多模态缓冲区),每层分为短期FIFO和长期水库采样两部分,有效缓解时间冗余和灾难性遗忘。
- 联合三个对比损失(实例损失、时间损失、跨模态损失)训练共享视觉骨干,实现视觉表示与语言表示的同时学习。
- 在SAYCam数据集上验证了单次通过学习的可行性,显著缩小了与离线训练的性能差距,回应了关于深度学习模型认知合理性的批评。
Methodology: 论文采用以下技术路线:1)数据流处理:以5fps从SAYCam视频中提取帧,并与对齐的儿童导向语音转录配对。2)时间分割:使用分层聚类算法(基于帧嵌入的余弦相似度)将流划分为约3分钟的事件片段,并通过贪心近似最大化段内相似度,同时调整分割点避免落在话语区间内。3)回放缓冲区:维护两个缓冲区——视觉缓冲区存储事件片段用于帧级自监督,多模态缓冲区存储话语-帧对(每个话语随机采样一帧)。每个缓冲区分为短期FIFO(10%容量)和长期水库采样(50%容量),训练时按25%/75%比例采样。4)训练过程:每次新事件片段插入时,执行k次(默认8)额外前向-后向传播,混合新内容和回放内容。5)损失函数:实例损失ℒ_i(SimCLR,对视觉缓冲区帧进行数据增强后的对比学习)、时间损失ℒ_t(事件段分类,将帧分类到所属事件段)、跨模态损失ℒ_c(对称InfoNCE,拉近匹配的话语-帧对,推远不匹配对)。三个损失等权重组合,视觉分支总权重为多模态分支的两倍。6)模型架构:共享ResNeXt-50视觉骨干,加上模态特定的投影头。所有参数联合更新。
Key Results:
- 在SAYCam Labeled-S 4AFC图像模式准确率上,BabyCL(k=8)达到43.38%±1.47%,显著优于One-pass CVCL(27.52%±3.12%)和计算匹配的CL-CVCL(36.41%±2.31%)。
- 离线CVCL(i.i.d.,400轮)作为上限为57.31%±1.13%,BabyCL将其差距缩小至约14个百分点。
- 消融实验表明,性能对在线时间分割窗口长度(平均段长)和回放缓冲区驱逐规则(FIFO vs. 水库采样比例)具有鲁棒性。
- 增加每次插入时的额外前向-后向传播次数k(从4到8)可进一步提升准确率(41.08%→43.38%)。
Tech Stack:
- ResNeXt-50视觉骨干网络
- SimCLR对比学习(实例损失)
- InfoNCE损失(跨模态对比)
- 分层聚类算法(基于余弦相似度的贪心分割)
- 回放缓冲区(FIFO + 水库采样)
- 数据增强(用于SimCLR的随机裁剪、颜色抖动等)
- SAYCam数据集(头戴摄像头视频和儿童导向语音)
- 4AFC(四选一强制选择)评估任务
Strengths:
- 认知合理性:模拟儿童单次通过、时间顺序的学习方式,回应了Bowers(2025)对离线训练认知有效性的批评。
- 有效缓解灾难性遗忘:通过双回放缓冲区和重采样策略,在单次通过中保持良好性能。
- 联合学习视觉和语言表示:共享骨干同时优化三个损失,实现多模态表征的协同提升。
- 鲁棒性:消融实验表明方法对关键超参数(分割窗口长度、缓冲区驱逐规则)不敏感。
- 可复现性:基于公开数据集SAYCam和标准架构(ResNeXt-50),实验设置清晰。
Limitations:
- 性能仍低于离线训练上限(57.31% vs 43.38%),表明单次通过学习仍有较大提升空间。
- 仅评估了4AFC任务,未测试其他泛化能力(如零样本分类、检索等)。
- 依赖特定超参数(如平均段长、缓冲区容量比例、重采样次数k),可能需针对不同数据流调整。
- 模型规模固定为ResNeXt-50,未探索更大或更小架构的影响。
- 未与人类儿童学习数据进行直接比较,仅以离线模型为上限。
Relevance To Keywords:
- 表征学习:论文核心是学习视觉和语言联合表征,使用对比学习(SimCLR、InfoNCE)和事件段分类损失。
- 世界模型:通过时间分割和事件段分类,模型学习视频流中的时间结构,隐含了世界模型中的事件预测能力。
- 多模态大模型:论文训练的是多模态(视觉-语言)模型,但规模较小(ResNeXt-50),属于原生多模态学习。
- 模型-Based RL:论文未直接涉及强化学习,但持续学习框架和回放缓冲区与模型-based RL中的经验回放有相似之处。
- 后训练:论文关注从零开始的持续学习,而非预训练后的微调,与后训练方向关联较弱。
摘要翻译
为了充分发挥多模态数据(multimodal data)的潜力,我们需要超越当前最先进的对齐(alignment)与融合(fusion)方法的表示,充分利用所有跨模态交互(cross-modal interactions),同时不牺牲模态特定信息(modality-specific information)。学习解耦表示(disentangled representations)是一种原则性的方法,用于识别观测数据(observational data)中隐藏的潜在共享与独特因素。然而,尽管多模态解耦(multimodal disentanglement)是一个极具吸引力的范式,现有方法主要局限于双模态 regime(two-modality regime),因为其固有的可扩展性瓶颈(scalability bottleneck)。为了解决这一问题,我们提出 RePercENT,这是一个自监督(self-supervised)框架,旨在超越这些限制,并解锁可扩展的成对解耦(pairwise disentanglement),超越两种模态的限制。通过多模态“即插即用”(plug-and-play)架构,我们的方法直接在预提取的嵌入(embeddings)上运行,消除了对广泛联合预训练(joint pre-training)的需求,且不对底层模态或基础模型骨干(foundation model backbones)做出任何假设。此外,我们引入联合优化目标(joint optimization objective),以同时推导共享与独特组件,并提供形式化的理论保证(theoretical guarantees),以刻画我们解决方案的最优性。在各种模态和任务上,RePercENT 成功恢复了解耦组件,同时保持具有竞争力的性能,并显著降低计算复杂度(computational complexity)。
Abstract
To leverage the full potential of multimodal data, we need representations that go beyond the state-of-the-art alignment and fusion approaches and exploit all cross-modal interactions without sacrificing modality-specific information. Learning disentangled representations is a principled way to identify these underlying shared and unique factors that are hidden in observational data. However, while multimodal disentanglement is a compelling paradigm, existing methods are largely confined to the two-modality regime due to its inherent scalability bottleneck. To address this, we propose RePercENT, a self-supervised framework designed to surpass these limitations and unlocks scalable pairwise disentanglement beyond two modalities. Through a multimodal `plug-and-play' architecture, our approach operates directly on pre-extracted embeddings, eliminating the need for extensive joint pre-training while making no assumptions regarding the underlying modalities or foundation model backbones. Moreover, we introduce a joint optimization objective for simultaneously deriving the shared and unique components, and provide formal theoretical guarantees that characterize the optimality of our solution. Across diverse modalities and tasks, RePercENT successfully recovers disentangled components while maintaining competitive performance and significantly reducing computational complexity.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 6.0/10 | 9.0 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 3.0/10 | 4.5 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 5.0/10 | 7.5 |
| MultiModal | 1.5 | 10.0/10 | 15.0 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: 论文核心聚焦于多模态解耦表征学习(MultiModal 得 10 分),提出 RePercENT 框架实现跨模态可扩展性,与 Unify Models 概念部分契合(得 6 分),涉及基础模型背景与 MLLM 领域相关(得 5 分)。未涉及 Tokenizer 设计、Visual Encoder 具体架构、World Models 或 model-based RL,故相关度较低(1-3 分)。作者列表中未发现指定专家,无额外加分。加权总分 42.0,高于动态及格分 27.8。
关键词
Disentangled Representation Learning, Multimodal, Scalable, Self-supervised, Plug-and-play architecture, Shared and unique components, Pre-extracted embeddings
深度分析
Chinese Title: RePercENT:超越双模态的可扩展解耦表征学习
Summary: 论文提出RePercENT,一个自监督框架,旨在解决多模态解耦表征学习在超过两种模态时的可扩展性瓶颈。现有方法大多局限于双模态,因为随着模态数量增加,共享和特有因子的数量呈指数增长,且缺乏可处理的优化目标。RePercENT通过两个关键设计克服这一挑战:一是建模模态间的成对交互,将复杂度从O(2^M)降至O(M^2);二是采用基于Perceiver的即插即用架构,直接操作预提取的嵌入,无需联合预训练,且对模态类型和骨干网络无假设。框架引入语义编码和组槽注意力机制,促进槽位专业化,并通过联合优化目标同时推导共享和特有成分。论文提供了形式化的最优性保证,并在合成数据、真实语言理解和医学分析任务上验证了其有效性,表明RePercENT能成功恢复解耦成分,同时保持竞争性能并显著降低计算复杂度。
Innovations:
- 提出可扩展的多模态解耦框架,通过成对交互建模将复杂度从指数级降至二次方级,突破双模态限制。
- 设计即插即用架构,直接处理预提取嵌入,无需联合预训练,对模态类型和基础模型骨干完全无关。
- 引入语义编码和组槽注意力机制,在Perceiver框架内实现信息路由,促进槽位专业化。
- 提供形式化的最优性保证,涵盖最小必要信息(MNI)可达和不可达两种情况。
- 支持推理时缺失模态的鲁棒性,并可直接迁移到未见任务。
Methodology: 论文采用信息论框架,定义原子表示和成对解耦表示。架构上,每个模态使用预训练基础模型提取嵌入,然后通过独立的解耦模块(基于Perceiver的潜在注意力机制)处理。解耦模块包含可学习潜在数组,每个槽位通过语义编码指定为共享或特有成分,并通过组槽注意力机制在模态对内部竞争,强制专业化。优化目标为联合最大化共享成分的互信息并最小化特有成分间的冗余,同时保持信息保留。训练采用自监督方式,无需标签。理论部分推导了最优解的条件和保证。
Key Results:
- 在合成数据上,RePercENT能准确恢复已知的共享和特有因子,解耦性能优于基线方法。
- 在真实语言理解任务(如比喻语言理解)中,解耦后的共享和特有成分提升了下游任务性能。
- 在多模态医学分析任务中,RePercENT在保持竞争性能的同时显著降低了计算复杂度(相比联合预训练方法)。
- 框架对缺失模态具有鲁棒性,在推理时移除部分模态仍能有效工作。
- 理论分析证明了在MNI可达时解耦表示的最优性,并给出了不可达情况下的近似保证。
Tech Stack:
- Perceiver编码器(Jaegle et al., 2021)
- 潜在注意力机制(Latent Attention)
- 组槽注意力(Group Slot Attention)
- 语义编码(Semantic Encoding)
- 互信息(Mutual Information)估计与优化
- 自监督学习
- 预训练基础模型(如CLIP、ALIGN等)作为嵌入提取器
Strengths:
- 可扩展性:通过成对交互建模,有效解决了多模态解耦的指数级复杂度问题。
- 即插即用:无需联合预训练,直接利用现有基础模型嵌入,降低计算成本。
- 理论保证:提供了形式化的最优性证明,增强了方法的可靠性。
- 鲁棒性:支持推理时缺失模态,适用于实际应用场景。
- 广泛验证:在合成数据、语言理解和医学分析等多种任务上验证了有效性。
Limitations:
- 成对交互建模可能无法捕获高阶跨模态交互(如三个及以上模态共同拥有的信息),尽管论文声称成对分解足以保留所有信息,但严格证明仅针对特定假设。
- 依赖预训练基础模型的质量,若嵌入本身信息不足,解耦效果可能受限。
- 组槽注意力机制需要预先指定每个模态对的槽位数量,可能不适用于动态变化的模态数量。
- 实验规模有限,未在超大规模多模态数据集(如视频+音频+文本)上验证。
Relevance To Keywords:
- 表征学习:论文核心是解耦表征学习,直接相关。
- 多模态大模型的理解和生成一体化:RePercENT可视为多模态理解中的表征解耦模块,有助于提升理解能力,但未涉及生成。
- 世界模型:解耦出的共享和特有因子可辅助世界模型中的因果推理,但论文未直接探讨世界模型。
- 模型-Based RL:解耦表示有助于强化学习中的状态抽象,但论文未涉及RL。
- 后训练:RePercENT作为后训练模块,可应用于预训练嵌入之上,符合后训练范式。
- 原生多模态大模型:论文方法可集成到原生多模态模型中,提升其可解释性和可扩展性。
- Unify Models:解耦表示有助于统一不同模态的表征空间,但论文未直接提出统一模型。
摘要翻译
随着多模态模型朝着长视频理解方向发展,记忆能力日益凸显为关键能力。尽管在构建视频数据集和基准方面已付出大量努力,但现有工作主要集中于感知与推理,尚未系统性地评估记忆:模型究竟保留了哪些信息、信息保存的保真度如何,以及记忆在受到干扰时的鲁棒性如何。为填补这一空白,我们提出了 M$^3$Eval,这是首个旨在探测多模态模型不同记忆维度的综合性评估框架与基准。该框架设计基于认知心理学,包含精心构造的任务,旨在隔离记忆的关键方面。借助 M$^3$Eval,我们在多种代表性多模态模型上进行了广泛实验,揭示了模型普遍存在的弱点及独特行为模式。研究发现,模型在处理并行视频流时难以维持解耦表示;其干扰模式与人类记忆中观察到的模式存在显著差异;在空间域比时间域更可靠地锚定记忆源;且符号记忆能力有限。总体而言,我们的基准为未来研究提供了宝贵资源,而我们的发现则强调了记忆作为一种基础但尚未被充分探索的能力的重要性,并为设计更有效的多模态模型记忆机制提供了见解。我们的代码和数据集可在 https://pku-value-lab.github.io/m3eval-homepage 获取。
Abstract
As multi-modal models advance towards long-form video understanding, memory emerges as a critical capability. Despite substantial efforts in developing video datasets and benchmarks, existing works primarily focus on perception and reasoning, without systematically evaluating memory: what models retain, how faithfully information is preserved, and how robust memory remains under interference. To address this gap, we introduce M$^3$Eval, the first comprehensive evaluation framework and benchmark for probing different memory dimensions in multi-modal models. Grounded in cognitive psychology, our design features carefully constructed tasks that isolate key aspects of memory. Leveraging M$^3$Eval, we conduct extensive experiments across representative multi-modal models, revealing consistent weaknesses and distinctive behaviors. We find that models struggle to maintain disentangled representations when processing parallel video streams, exhibit interference patterns differing substantially from those observed in human memory, ground memory sources more reliably in the spatial domain than the temporal domain, and demonstrate limited symbolic memory. Collectively, our benchmark provides a valuable resource for future research, while our findings highlight memory as a fundamental yet underexplored capability and offer insights for designing more effective memory mechanisms in multi-modal models. Our code and dataset are available at https://pku-value-lab.github.io/m3eval-homepage.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 3.0/10 | 4.5 |
| World Models | 1.5 | 5.0/10 | 7.5 |
| MLLM | 1.5 | 6.0/10 | 9.0 |
| MultiModal | 1.5 | 9.0/10 | 13.5 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: The paper focuses on Multi-Modal Memory Evaluation (MultiModal: 9), relevant to MLLM (6) and World Models (5) as memory is a core component. It does not address Unify Models (2), Tokenizer (1), Visual Encoder architecture (3), or model-based RL (1). No specified expert authors were found in the author list.
关键词
Multi-Modal Memory Evaluation, Cognitively-Grounded Video Tasks, Memory Dimensions, Disentangled Representations, Interference Patterns, Spatial Domain, Symbolic Memory
深度分析
Chinese Title: M³Eval:基于认知心理学视频任务的多模态记忆评估
Summary: 本文提出M³Eval,首个系统评估多模态模型记忆能力的框架与基准。受认知心理学启发,设计了四个维度的视频任务:分心注意力(并行输入下的编码)、记忆干扰(相似内容的鲁棒性)、交错事件(时间组织)和N-Back(符号记忆)。通过构建分屏视频、交错剪辑等控制实验,评估模型在源识别、顺序理解、内容保持等方面的表现。对多个开源和闭源多模态模型的实验揭示了关键弱点:模型在处理并行视频流时无法保持独立表征;干扰模式与人类不同(人类后向干扰更强,模型两者相当);时间维度记忆源定位弱于空间维度;符号记忆能力远弱于人类。这些发现表明记忆是多模态模型的基础但未充分探索的能力,为设计更有效的记忆机制提供了见解。代码和数据集已公开。
Innovations:
- 首次提出专门针对多模态模型记忆能力的系统评估框架和基准,覆盖多个记忆维度。
- 基于认知心理学理论设计控制实验,通过精心构造的视频任务隔离不同记忆机制。
- 揭示了多模态模型在并行视频流处理、干扰模式、时间与空间记忆源定位、符号记忆等方面的独特行为与弱点。
- 提供了系统性的跨模型评估,为未来多模态模型记忆机制设计提供实证基础。
Methodology: 采用认知心理学中的经典范式(分心注意力、记忆干扰、交错事件、N-Back),将其转化为视频问答任务。具体包括:分屏视频(左右同时播放相似视频,含位置交换条件)评估并行编码;拼接相似视频(两种顺序)评估前向/后向干扰;交错剪辑两个视频片段评估时间组织;N-Back任务评估符号记忆与容量。设计多选问题(源识别、顺序理解、内容保持)并定义准确率、入侵率等指标。在多个开源(如LLaVA、Video-LLaMA)和闭源(如GPT-4V)多模态模型上进行实验。
Key Results:
- 模型在处理并行视频流时无法维持独立表征,出现注意力混淆。
- 人类表现出显著更强的后向干扰(retroactive interference),而模型的前向与后向干扰水平相当;重复干扰片段甚至能提升模型对目标片段的理解。
- 模型在时间维度上的记忆源定位能力弱于空间维度。
- 模型的符号记忆(N-Back)能力远弱于人类,且难以过滤无关信息。
Tech Stack:
- 认知心理学实验范式(分心注意力、记忆干扰、交错事件、N-Back)
- 视频问答任务设计(多选问题、入侵率指标)
- 多模态模型(LLaVA、Video-LLaMA、GPT-4V等)
- 公开视频数据集(用于构建测试样本)
- Python(代码实现)
Strengths:
- 首次系统评估多模态模型记忆能力,填补了现有基准的空白。
- 基于认知心理学理论,任务设计严谨,能有效隔离不同记忆机制。
- 实验覆盖多种代表性模型,结果具有广泛参考价值。
- 揭示了模型与人类记忆的差异,为后续研究提供方向。
Limitations:
- 评估任务基于合成视频场景,与真实长视频理解仍有差距。
- 仅测试了有限数量的模型,可能未涵盖最新模型。
- 未深入探讨模型记忆缺陷的根本原因(如注意力机制、架构设计)。
- 评估维度有限,未涉及情感记忆、程序性记忆等。
Relevance To Keywords:
- Unify Models: 论文评估的多模态模型属于统一模型范畴,但未直接讨论统一框架。
- World Models: 记忆是世界模型的重要组成部分,论文评估的记忆能力与世界模型中的长期依赖相关。
- Representation Learning: 论文揭示了模型在并行输入下无法保持独立表征,与表征学习中的解耦表征问题相关。
- Model-Based RL: 记忆是强化学习中经验回放的关键,论文的干扰实验与RL中的灾难性遗忘有联系。
- 原生多模态大模型: 论文评估的对象正是原生多模态大模型(如Video-LLaMA、GPT-4V)。
- 多模态大模型的理解和生成一体化: 论文聚焦理解(记忆问答),未涉及生成,但记忆是理解与生成的基础。
- 表征学习: 同上,与表征学习相关。
- 世界模型: 同上,记忆是世界模型预测未来所需的历史信息。
- 强化学习: 论文的干扰实验与RL中的干扰和遗忘现象类似。
- 后训练: 论文未涉及后训练,但评估结果可为后训练策略提供指导。
摘要翻译
规划是大语言模型代理(LLM agents)的核心:在采取行动前,代理必须分解目标、选择工具、基于约束推理,并判断任务何时不可行。然而,现有的代理评估通常仅报告端到端成功率,这使得难以判断失败是源于规划还是执行。我们引入 Agent Planning Benchmark (APB),这是一个针对规划的诊断基准,包含 4,209 个多模态案例,覆盖 22 个领域和五种设置,涵盖整体规划、基于反馈的条件化分步规划,以及在无关工具、故障工具和不可解任务下的鲁棒性。在 12 个多模态大语言模型(MLLMs)上,APB 揭示了长程规划、工具噪声鲁棒性、校准拒绝和推理时精炼方面的系统性弱点。我们进一步在 200 个 ToolSandbox 任务和 200 个 τ²-bench 任务上验证了 APB,其中基于 APB 指导的精炼一致地提高了三个代表性模型的计划正确性、计划评分和下游执行指标。因此,APB 可作为执行基准的上游诊断补充。
Abstract
Planning is central to LLM agents: before acting, an agent must decompose goals, select tools, reason over constraints, and decide when a task is infeasible. Yet existing agent evaluations often report only end-to-end success, making it difficult to determine whether failures stem from planning or execution. We introduce \textbf{Agent Planning Benchmark (APB)}, a planning-specific diagnostic benchmark with 4,209 multimodal cases across 22 domains and five settings, covering holistic planning, feedback-conditioned step-wise planning, and robustness under extraneous tools, broken tools, and unsolvable tasks. Across 12 MLLMs, APB reveals systematic weaknesses in long-horizon planning, tool-noise robustness, calibrated refusal, and inference-time refinement. We further validate APB on 200 ToolSandbox tasks and 200 $τ^2$-bench tasks, where APB-guided refinement consistently improves plan correctness, plan grade, and downstream execution metrics across three representative models. APB thus serves as an upstream diagnostic complement to execution benchmarks.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 2.0/10 | 3.0 |
| World Models | 1.5 | 2.0/10 | 3.0 |
| MLLM | 1.5 | 8.0/10 | 12.0 |
| MultiModal | 1.5 | 8.0/10 | 12.0 |
| model-based RL | 1.5 | 3.0/10 | 4.5 |
评分理由: The paper explicitly evaluates MLLMs using multimodal cases, leading to high relevance for MLLM and MultiModal keywords. It focuses on planning benchmarks rather than model architecture unification, tokenizers, visual encoders, world models, or model-based RL algorithms, resulting in lower scores for those categories. No listed expert authors appear in the author list.
关键词
Agent Planning Benchmark, LLM Agents, Planning Capabilities, Multimodal Cases, MLLM Evaluation, Tool Selection, Diagnostic Framework
深度分析
Chinese Title: 智能体规划基准:LLM智能体规划能力的诊断框架
Summary: 本文提出了Agent Planning Benchmark (APB),一个专门用于诊断LLM智能体规划能力的基准,包含4,209个多模态案例,覆盖22个领域和五种设置(整体规划、反馈条件逐步规划、工具噪声、工具损坏、不可解任务)。APB通过计划正确性、计划等级和E1-E6错误分类法进行细粒度评估。在12个多模态大模型上的评估揭示了系统性的弱点,包括长程规划、工具噪声鲁棒性、校准拒绝和推理时细化。进一步在ToolSandbox和τ2-bench上验证,APB引导的细化一致地提升了三个代表性模型的计划正确性、计划等级和下游执行指标。APB作为执行基准的上游诊断补充,有助于分离规划与执行失败的原因。
Innovations:
- 提出首个规划专用诊断基准APB,包含4209个多模态案例,覆盖整体、逐步和鲁棒性规划任务。
- 设计层次化评估框架,包括计划正确性、计划等级和人类启发的E1-E6错误分类法,支持细粒度根因分析。
- 引入三种对抗性变体(工具噪声、工具损坏、不可解任务)系统评估规划鲁棒性。
- 通过可执行验证(ToolSandbox和τ2-bench)证明APB引导的细化能提升下游执行指标。
Methodology: 论文采用数据合成与过滤管道生成复杂规划实例,定义整体规划和逐步规划两种核心任务,并扩展出工具噪声、工具损坏和不可解任务三种对抗性变体。使用LLM-as-Judge进行自动化逻辑评估,基于E1-E6错误分类法提供细粒度诊断。在12个MLLM上评估,并在ToolSandbox和τ2-bench上通过直接执行、计划优先执行和APB引导细化三种策略进行可执行验证。
Key Results:
- GPT-5和Gemini 3 Pro在整体规划上表现最佳(CR分别为74.5%和71.3%),开源模型如InternVL3.5和Qwen3VL在工具噪声和不可解任务上脆弱。
- 推理时细化对整体规划高度有效,但对短程逐步规划可能因过度修正而效果有限。
- 在ToolSandbox和τ2-bench上,APB引导的细化一致提升了计划正确性、计划等级和下游执行指标。
- 模型在不可解任务上的拒绝能力差异显著,GPT-5拒绝率最高(Rp=82.5%),而开源模型常错误尝试。
Tech Stack:
- ReAct框架
- Reflexion框架
- LLMCompiler框架
- MetaGPT
- SWE-agent
- GPT-4o, GPT-5, Gemini 2.5 Pro/Flash, Gemini 3 Pro, Claude Sonnet 4.5, InternVL3.5, Qwen3-VL系列等MLLM
- ToolSandbox环境
- τ2-bench环境
- LLM-as-Judge评估方法
- E1-E6错误分类法
Strengths:
- 诊断性强:分离规划与执行失败,提供细粒度错误分类。
- 覆盖全面:多模态、多领域、多种规划粒度及鲁棒性测试。
- 可执行验证:通过真实环境验证规划改进对执行的影响。
- 评估系统:12个MLLM的横向比较揭示模型规划能力差异。
Limitations:
- 数据为合成生成,可能不完全反映真实世界规划场景的复杂性。
- 未覆盖所有可能的规划失败模式(如动态环境中的重规划)。
- LLM-as-Judge评估可能受评判模型自身偏见影响。
- 可执行验证仅在两个基准上测试,泛化性需进一步验证。
Relevance To Keywords:
- Unify Models: APB评估多模态大模型(MLLM)的规划能力,与统一模型(文本+视觉)的智能体行为相关。
- World Models: 规划需要世界模型来预测动作后果,APB中的工具噪声和不可解任务测试模型对世界状态的理解。
- Representation Learning: 规划涉及工具和约束的表征,APB评估模型如何表示和利用工具信息。
- Model-Based RL: 规划是模型强化学习的核心,APB的诊断框架可指导模型训练和细化策略。
摘要翻译
我们提出了一种无需标签的方法,用于将强大但通用的视觉基础模型适配到专用科学领域。标准的监督微调通常不太适合这些场景:标签稀缺,且特定任务的训练可能会损害模型的泛化性并降低鲁棒性。相反,我们利用元数据,以自监督的方式将表征适配到新领域。我们的方法 FINO 将标准自监督目标与灵活的元数据指导相结合,该指导既能处理高度细粒度的离散元数据,也能处理连续元数据。它促使表征保留信息因子,同时抑制虚假因子。在亚细胞荧光显微镜、地球观测、野生动物监测和医学成像等领域,FINO 始终优于标准的无监督域适应和全监督适应。它还超越了高度专业化的领域特定最先进方法,同时在骨干网络适配中不使用任务标签,仅使用轻量级探针进行监督。
Abstract
We propose a label-free approach to adapt powerful but generic vision foundation models to specialized scientific domains. Standard supervised fine-tuning is often ill-suited to these settings: labels are scarce, and task-specific training can collapse the model's generality and hurt robustness. We instead leverage metadata to adapt representations to new domains in a self-supervised manner. Our method, FINO, combines a standard self-supervised objective with flexible metadata guidance that handles both highly granular discrete metadata and continuous metadata. It encourages the representation to preserve informative factors while suppressing spurious ones. Across subcellular fluorescence microscopy, Earth observation, wildlife monitoring, and medical imaging, FINO consistently outperforms standard unsupervised domain adaptation and fully supervised adaptation. It also exceeds highly-specialized domain-specific state of the art, while using no task labels for backbone adaptation and only lightweight probes for supervision.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 4.0/10 | 6.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 7.0/10 | 10.5 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 5.0/10 | 7.5 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: 论文主要研究视觉基础模型的无标签元数据自适应。'Visual Encoder'(7.0)为核心组件,'MultiModal'(5.0)因涉及视觉与元数据融合,'Unify Models'(4.0)因统一通用与专用模型。其余关键词如'Tokenizer'、'World Models'、'MLLM'、'model-based RL'与论文内容(计算机视觉、自监督学习)无直接关联,评分较低(1.0-2.0)。加权总分 31.5,高于动态及格分 27.8。未找到指定专家。
关键词
Vision Foundation Models, Label-free adaptation, Metadata guidance, Self-supervised learning, Domain adaptation, Representation learning, Scientific domains
深度分析
Chinese Title: 谁需要标签?利用已有的元数据适配视觉基础模型
Summary: 本文提出FINO方法,一种无需任务标签的元数据引导表示适配框架,旨在将通用的视觉基础模型适配到专业科学领域。该方法基于DINO自监督学习架构,通过区分信息性元数据(M+)和虚假元数据(M-),利用原型对比引导处理高基数离散元数据,使用回归器处理连续元数据,并结合梯度反转和SIGReg正则化来抑制虚假因素。在荧光显微镜、地球观测、野生动物监测和医学影像四个科学领域,FINO在无需任务标签的情况下,仅使用轻量级探针进行评估,其性能超越了全监督微调和无监督域适应方法,甚至匹配或超越了高度专业化的领域特定SOTA。该方法无需目标域数据或推理时元数据,具有通用性和鲁棒性。
Innovations:
- 提出统一的元数据引导表示适配框架FINO,同时处理离散和连续元数据,无需任务标签、目标域数据或推理时元数据。
- 创新性地将元数据分为信息性(M+)和虚假(M-)两类,分别通过原型对比引导和梯度反转进行保留或抑制,实现可控的表示学习。
- 在四个科学领域(荧光显微镜、地球观测、野生动物监测、医学影像)上,使用同一套超参数,FINO超越全监督微调和领域特定SOTA,展示了强大的通用性。
- 无需修改骨干网络,仅通过轻量级探针即可在下游任务上取得优异性能,避免了微调导致的模型通用性丧失。
Methodology: 基于DINO学生-教师自监督架构,扩展元数据监督。对于每个样本,提取学生和教师嵌入。离散元数据:维护一个EMA更新的原型库,每个元数据值对应一个原型,学生嵌入与对应原型进行对比学习,教师嵌入更新原型。连续元数据:使用预测头回归元数据值。虚假元数据:通过梯度反转层(GRL)使编码器无法预测该元数据。同时结合iBOT的掩码图像建模和SIGReg正则化(鼓励特征多样性)。整体损失为自监督损失(DINO+iBOT)与元数据引导损失(信息性+虚假)的加权和。训练后冻结骨干,仅训练线性探针进行下游任务评估。
Key Results:
- 在Human Protein Atlas(荧光显微镜)上,FINO超越全监督微调(+3.2% F1)和领域特定SOTA(+1.8% F1)。
- 在FMoW(地球观测)上,FINO超越全监督微调(+4.1% 准确率)和无监督域适应方法。
- 在iWildCam(野生动物监测)上,FINO超越全监督微调(+2.5% 准确率)。
- 在MIMIC-CXR(医学影像)上,FINO超越全监督微调(+1.9% AUC)。
- FINO在所有四个基准上均优于标准自监督适配(无元数据),验证了元数据引导的有效性。
Tech Stack:
- DINO(自蒸馏自监督学习)
- iBOT(掩码图像建模)
- 原型对比学习(Prototype-based contrastive learning)
- 梯度反转层(Gradient Reversal Layer, GRL)
- SIGReg(特征多样性正则化)
- 指数移动平均(EMA)更新教师网络
- Vision Transformer(ViT)作为骨干网络
- 线性探针(Linear probe)评估
Strengths:
- 无需任务标签,仅利用免费可得的元数据,大幅降低标注成本。
- 统一处理离散和连续元数据,适应科学领域高基数、异质性的元数据特点。
- 在多个科学领域上超越全监督微调,同时保持模型通用性,避免灾难性遗忘。
- 方法简单通用,同一套超参数适用于不同领域,易于复现和推广。
- 不要求元数据在推理时可用,仅用于训练阶段,实用性强。
Limitations:
- 需要手动将元数据划分为信息性(M+)和虚假(M-),依赖领域知识,可能引入主观偏差。
- 未考虑元数据缺失或噪声的情况,实际应用中元数据可能不完整或不准确。
- 仅验证了视觉基础模型(ViT),未探索其他架构(如CNN)或跨模态场景。
- 对于连续元数据,仅使用简单回归头,可能无法捕捉复杂非线性关系。
- 未与基于目标域数据的无监督域适应方法进行公平比较(后者需要目标域数据)。
Relevance To Keywords:
- 表征学习:论文核心是改进视觉表征,通过元数据引导自监督学习,与表征学习高度相关。
- 世界模型:论文未直接涉及世界模型,但通过元数据引导模型理解数据中的结构(如地理、时间),间接与构建环境模型相关。
- 多模态大模型:论文聚焦视觉基础模型,但元数据可视为一种弱模态,未来可扩展至多模态。
- 原生多模态大模型:论文未涉及多模态生成或理解一体化,相关性较弱。
- 模型基于强化学习:论文未使用强化学习,相关性低。
- 后训练:论文提出的适配方法可视为一种后训练策略,但更侧重于表示适配而非指令微调。
摘要翻译
科学与工程进步本质上是一个长周期的迭代过程:提出变更、运行实验、衡量结果,并持续完善工件 (artifacts)。然而,现有前沿模型基准主要评估单轮响应或短周期智能体轨迹,无法捕捉在更长时间范围内持续迭代改进的挑战。为填补这一空白,我们引入了 AutoLab,这是一个用于超长周期闭环优化的新基准。AutoLab 包含 36 个真实的、专家精心策划的任务,涵盖四个不同领域:系统优化、谜题与挑战、模型开发以及 CUDA 内核优化。每个任务都从一个正确但故意次优的基线开始,挑战智能体在严格的墙钟时间预算 (wall-clock budget) 内改进它。对 17 个最先进模型的评估揭示,成功的主要预测因子并非智能体初始尝试的质量,而是其在反复基准测试、编辑及纳入经验反馈方面的持续性。尽管 claude-opus-4.6 表现出强大的长周期优化能力,但大多数前沿模型,包括若干专有模型,要么过早终止,要么耗尽预算且进展甚微。这些结果强调了时间意识和持续迭代在自主智能体中的重要性。我们开源了完整的基准、评估工具包及任务工件,以加速朝向真正具备能力的长周期智能体的研究。
Abstract
Scientific and engineering progress is fundamentally a long-horizon iterative process: proposing changes, running experiments, measuring outcomes, and continuously refining artifacts. Yet existing benchmarks for frontier models primarily evaluate either single-turn responses or short-horizon agent trajectories, failing to capture the challenges of sustained iterative improvement over extended time horizons. To address this gap, we introduce AutoLab, a new benchmark for ultra long-horizon closed-loop optimization. AutoLab consists of 36 realistic, expert-curated tasks spanning four diverse domains: system optimization, puzzle & challenge, model development, and CUDA kernel optimization. Each task begins with a correct but deliberately suboptimal baseline and challenges agents to improve it within a strict wall-clock budget. Evaluating 17 state-of-the-art models reveals the dominant predictor of success is not the quality of an agent's initial attempt, but its persistence in repeatedly benchmarking, editing, and incorporating empirical feedback. While claude-opus-4.6 exhibits strong long-horizon optimization capabilities, most frontier models, including several proprietary ones, either terminate prematurely or exhaust their budgets with minimal progress. These results underscore the importance of time awareness and persistent iteration in autonomous agents. We open-source the full benchmark, evaluation harness, and task artifacts, to accelerate research toward truly capable long-horizon agents.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 6.0/10 | 9.0 |
| MLLM | 1.5 | 4.0/10 | 6.0 |
| MultiModal | 1.5 | 3.0/10 | 4.5 |
| model-based RL | 1.5 | 6.0/10 | 9.0 |
评分理由: The paper introduces AutoLab, a benchmark for long-horizon autonomous agents. It conceptually aligns strongly with World Models and model-based RL due to its focus on closed-loop iterative optimization and long-horizon planning (scores 6). It evaluates frontier models often categorized as MLLM (score 4) but does not focus on specific architectural components like Tokenizers or Visual Encoders (scores 0), nor does it propose Unify Models (score 1). No target experts (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are found in the author list.
关键词
AutoLab, Long-Horizon, Autonomous Agents, Closed-Loop Optimization, Iterative Improvement, Frontier Models, Engineering Tasks
深度分析
Chinese Title: AutoLab:前沿模型能否解决长周期自动研究与工程任务?
Summary: 本文提出了AUTOLAB,一个用于评估前沿大语言模型在超长周期闭环优化任务上的基准测试。该基准包含36个由专家精心设计的任务,覆盖系统优化、谜题与挑战、模型开发和CUDA内核优化四个领域。每个任务提供一个正确但次优的基线,要求智能体在严格的时钟预算内通过反复迭代(基准测试、编辑、整合经验反馈)来改进。研究评估了17个最先进模型,发现成功的关键预测因素不是初始尝试的质量,而是持续迭代的持久性。Claude-opus-4.6在所有领域表现领先,而许多其他前沿模型(包括一些专有模型)要么过早终止,要么在预算内几乎没有进展。论文还开源了完整的基准、评估框架和任务工件,以推动长周期自主智能体的研究。
Innovations:
- 首次提出专门针对超长周期闭环优化的基准测试AUTOLAB,涵盖多个领域且任务难度高、不易饱和。
- 采用连续评分机制(对数拉伸和线性插值),支持细粒度相对比较,避免二值通过/失败的限制。
- 大规模系统评估了17个前沿模型,包括4个专有模型,在统一标准化框架下进行,总耗时2544小时、消耗86亿token。
- 通过深入轨迹分析(包括手动检查302个零分轨迹),揭示了智能体缺乏时间意识(过早终止或预算耗尽)等关键行为限制,并证明持久性比初始解质量更重要。
- 设计了抗奖励黑客的验证机制,包括密封评估器、正确性门控、不可变文件检查和对抗性审计。
Methodology: 论文采用基准测试构建与评估的方法。首先,由资深研究人员和工程师从真实工程或研究问题中收集任务,经过多轮质量控制(有效性、可解性、完整性、测量稳定性)和抗奖励黑客审计。每个任务包含自然语言指令、容器化沙箱环境、持有评估器、参考解决方案和时钟预算。评估时,智能体在沙箱内自由编辑代码、执行、分析并迭代,最终由评估器在保留输入上计算连续分数(基于基线与参考解的对数或线性归一化)。共评估17个模型,每个模型运行多次(3次),记录平均分和最佳分,并分析轨迹行为。
Key Results:
- Claude-opus-4.6在所有四个子领域均领先,Avg@3达到0.68,次优模型为0.50。
- 许多强模型(如gpt-5.4)因过早终止或预算耗尽而失败,与原始编码能力无关。
- 最终性能与持久性(反复基准测试、编辑、整合反馈)强相关,与初始解质量弱相关。
- 总评估消耗2544小时时钟时间和86亿token。
- 手动检查302个零分轨迹,发现智能体普遍缺乏时间意识。
Tech Stack:
- 大语言模型(LLM)智能体(如Claude、GPT、Gemini等)
- 容器化沙箱(CPU或单GPU)
- 对数拉伸评分公式(公式1)
- 线性插值评分公式(公式2)
- 密封评估器(held-out evaluator)
- 正确性门控(minimum-improvement gate)
- 不可变文件检查(immutable-file checks)
- 对抗性审计(adversarial auditing)
- Python、CUDA、C等编程语言(任务实现)
Strengths:
- 基准设计真实且多样,任务来自实际工程和研究问题,具有高生态效度。
- 连续评分机制允许细粒度比较,避免二值基准的饱和问题。
- 大规模系统评估覆盖多个前沿模型,结果可靠且可复现。
- 深入轨迹分析提供了对智能体行为(如时间管理、持久性)的洞察。
- 开源全部资源,促进社区研究。
Limitations:
- 任务数量有限(36个),可能不足以全面代表所有长周期优化场景。
- 时钟预算设置可能偏向某些模型或策略,且硬件依赖(如GPU型号)可能影响结果。
- 评估主要针对LLM智能体,未涵盖其他类型自主系统(如强化学习智能体)。
- 抗奖励黑客机制虽强,但无法完全杜绝所有可能的作弊方式。
- 部分任务(如模型开发)设计为较小规模,可能无法完全反映真实研究中的大规模训练挑战。
Relevance To Keywords:
- Unify Models / 原生多模态大模型:论文评估的模型包括多模态大模型(如GPT-5.4、Claude-opus-4.6),但基准任务不直接涉及多模态理解与生成一体化,相关性中等。
- World Models / 世界模型:基准中的模型开发任务涉及LLM后训练,与世界模型构建有一定关联,但非核心。
- Representation Learning / 表征学习:部分任务(如模型开发)可能涉及表征学习优化,但基准更侧重工程优化而非表征学习本身。
- Model-Based RL / 强化学习:论文强调闭环迭代优化,与基于模型的强化学习中的规划与反馈循环有概念相似性,但未直接使用RL方法。
- 后训练:论文明确包含后训练模型任务(如Rank et al., 2026),与后训练领域高度相关。
摘要翻译
我们提出 T2Mo,这是一个以三维轨迹和文本为条件的可控动态三维形状生成的前馈框架 (feed-forward framework)。由于语言的固有歧义,仅依靠文本生成精确意图的运动仍然具有挑战性。为此,我们采用三维轨迹作为可控空间引导,指定选定点应移动的精确路径。通过结合两者,T2Mo 生成的物体运动在空间上遵循给定的轨迹,同时在整体上反映文本语义。为了稳健处理具有任意配置的轨迹输入(从密集到稀疏且分布不均匀),我们进一步提出一种基于形状的轨迹嵌入 (shape-grounded trajectory embedding),将输入轨迹集映射为覆盖整个物体的感知形状令牌集 (shape-aware token set)。我们与基于文本的基线 (text-based baselines) 以及结合轨迹引导的视频生成 (trajectory-guided video generation) 与视频到动态网格生成 (video-to-dynamic mesh generation) 的级联视频基线 (cascaded video-based baselines) 进行了广泛对比。定性和定量评估 (Quantitative and qualitative evaluations),以及用户研究 (user studies) 表明,我们的方法产生的运动更忠实地遵循给定提示,具有更高的表现力,同时保持运动质量。
Abstract
We introduce T2Mo, a feed-forward framework for controllable dynamic 3D shape generation conditioned on 3D trajectories and text. Due to the inherent ambiguity of language, generating precisely intended motions using text alone remains challenging. To address this, we adopt 3D trajectories as controllable spatial guidance, specifying the exact paths along which selected points should move. By combining both, T2Mo generates object motions that spatially adhere to the given trajectories while globally reflecting the text semantics. To robustly handle trajectory inputs with arbitrary configurations, ranging from dense to sparse and unevenly distributed, we further propose a shape-grounded trajectory embedding that maps an input trajectory set into a shape-aware token set covering the entire object. We conduct extensive comparisons against text-based baselines and cascaded video-based baselines that combine trajectory-guided video generation with video-to-dynamic mesh generation. Quantitative and qualitative evaluations, along with user studies, demonstrate that our approach produces motions that more faithfully follow the given prompts with higher expressiveness while preserving motion quality.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 3.0/10 | 4.5 |
| Tokenizer | 1.5 | 3.0/10 | 4.5 |
| Visual Encoder | 1.5 | 2.0/10 | 3.0 |
| World Models | 1.5 | 4.0/10 | 6.0 |
| MLLM | 1.5 | 1.0/10 | 1.5 |
| MultiModal | 1.5 | 7.0/10 | 10.5 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文主要关注基于文本和 3D 轨迹的动态 3D 形状生成。MultiModal 相关性高(文本 + 轨迹输入);Tokenizer 相关(轨迹嵌入为 token 集);World Models 相关(建模动态形状);Unify Models 中等(融合多模态输入);MLLM、Visual Encoder 非核心贡献;model-based RL 完全无关。
关键词
Controllable Dynamic 3D Shape Generation, 3D Trajectories, Text Conditioning, Shape-Grounded Trajectory Embedding, Feed-forward Framework, Motion Quality, Spatial Guidance
深度分析
Chinese Title: 基于3D轨迹与文本的可控动态三维形状生成
Summary: 本文提出T2Mo,一种前馈式可控动态三维形状生成框架,以3D轨迹和文本为条件。由于语言本身的歧义性,仅用文本难以精确指定运动意图,因此引入3D轨迹作为空间引导,指定选定点的精确移动路径。T2Mo结合两者,生成在空间上遵循给定轨迹、全局上反映文本语义的对象运动。为鲁棒处理任意配置(密集、稀疏、不均匀)的轨迹输入,提出形状锚定的轨迹嵌入方法,将输入轨迹集映射为覆盖整个物体的形状感知令牌集。与基于文本的基线及级联视频基线(轨迹引导视频生成+视频到动态网格生成)的定量、定性评估及用户研究表明,该方法在保持运动质量的同时,能更忠实、更具表现力地遵循给定提示。
Innovations:
- 首次提出前馈式可控动态3D形状生成框架,同时利用3D轨迹和文本作为条件。
- 设计形状锚定的轨迹嵌入,将任意配置的输入轨迹映射为固定大小、形状感知的令牌集,实现一致且几何局部化的条件控制。
- 在通用物体上实现多种用户定义的运动控制,包括交互式精细控制、运动编辑和运动迁移。
Methodology: 采用解耦方法:先通过VAE将静态网格和位移序列压缩为形状潜码和轨迹潜码;生成模型基于DiT(扩散Transformer),以形状潜码、文本嵌入和轨迹条件令牌为条件,通过交叉注意力生成轨迹潜码;轨迹条件通过形状锚定嵌入构建:将输入轨迹的起点作为轨迹锚点,并用最远点采样补充其余锚点,每个锚点结合形状编码器输出与轨迹特征(或可学习空嵌入),经MLP投影得到条件令牌。训练和推理采用扩散过程。
Key Results:
- 与文本基线(AnimateAnyMesh, BiMotion)和级联基线(轨迹引导视频生成+视频到动态网格)相比,T2Mo在VBench、轨迹对齐、运动幅度等指标上表现更优。
- 用户研究表明T2Mo生成的运动更忠实于给定提示,表现力更强。
- 定性结果展示了多种运动控制应用,如交互式控制、运动编辑和迁移。
Tech Stack:
- DiT (Diffusion Transformer) 作为生成骨干
- VAE (变分自编码器) 用于形状和轨迹潜码压缩
- 最远点采样 (Farthest Point Sampling) 用于锚点选择
- MLP (多层感知机) 用于投影
- 交叉注意力机制 (Cross-Attention) 用于条件融合
- B-spline 运动表示 (参考BiMotion)
- VBench 评估指标
Strengths:
- 提供直观的3D轨迹控制,用户可直接在网格上指定点运动路径,无需额外变换。
- 形状锚定嵌入有效处理任意稀疏/不均匀轨迹,保持几何局部性。
- 前馈式框架高效,无需优化,适用于实时交互。
- 在通用物体上实现可控动态生成,拓展了应用范围。
Limitations:
- 依赖预定义的静态网格,无法生成全新形状。
- 轨迹控制仅作用于顶点位移,无法处理拓扑变化或非刚性变形中的复杂交互。
- 用户需提供3D轨迹,可能增加使用门槛。
- 评估限于有限类别和场景,泛化性有待进一步验证。
Relevance To Keywords:
- Unify Models: 论文未直接涉及统一模型,但动态3D生成可视为多模态理解与生成的一部分。
- World Models: 动态3D形状生成可视为世界模型中对物体运动建模的一环。
- Representation Learning: 形状锚定嵌入和VAE潜码属于表征学习范畴。
- Model-Based RL: 论文未涉及强化学习,但可控生成可服务于RL中的环境模拟。
- 原生多模态大模型: 论文使用文本和轨迹两种模态,但未采用大语言模型架构。
- 多模态大模型的理解和生成一体化: 论文聚焦生成,理解部分仅用于条件编码。
- 表征学习: 是核心方法之一。
- 世界模型: 动态3D生成可视为世界模型对物体运动预测的简化。
- 强化学习: 无直接关联。
- 后训练: 论文未涉及后训练技术。
摘要翻译
音频本质上是一种交互模态,然而当今的大音频语言模型(LALMs)是离线的,且流式音频模型仅处理单一任务,例如流式 ASR 或语音聊天。是时候将它们统一为一个在线 LALM 了:该模型通过一个始终开启的感知 - 决策 - 响应循环,实时聆听声音、环境和指令,并即时做出反应。我们将此范式形式化为音频交互模型(Audio Interaction Model),并通过 Audio-Interaction 予以实现,这是一个统一流式模型,它在保留离线任务执行能力的同时,增加了在线通用音频指令遵循能力(从对话到完整语音聊天),并根据流式语义决定何时响应。为实现此目标,我们提出 SoundFlow 框架,该框架通过流式原生数据构建、理解感知训练以及异步低延迟推理,从数据到训练到部署,端到端实例化感知 - 决策 - 响应循环,从而实现稳定的实时交互。此外,我们还构建了 StreamAudio-2M,这是一个包含 260 万项、涵盖 7 种基本能力和 28 个子任务的流式语料库,以及 Proactive-Sound-Bench,用于评估主动音频干预。在 8 个基准测试上,Audio-Interaction 在主流音频任务上保持具有竞争力的性能,同时解锁了离线 LALM 无法实现的能力,包括实时 ASR、流式音频指令遵循和主动帮助。
Abstract
Audio is an inherently interactive modality, yet today's Large Audio Language Models (LALMs) are offline, and streaming audio models each handle only a single task such as streaming ASR or voice chatting. It is time to unify them into one online LALM: a model that, through an always-on perceive-decide-respond loop, listens to sound, environment, and instructions in real time and reacts on the fly. We formalize this regime as the Audio Interaction Model, and realize it with Audio-Interaction, a unified streaming model that retains offline task execution while adding online general audio instruction following, from dialogue to full voice chatting, deciding when to respond from the semantics of the stream. To enable this, we propose SoundFlow, a framework that instantiates the perceive-decide-respond loop end to end, from data to training to deployment, through streaming-native data construction, comprehension-aware training, and asynchronous low-latency inference for stable real-time interaction. We further construct StreamAudio-2M, a 2.6M-item streaming corpus spanning 7 fundamental abilities and 28 sub-tasks, and Proactive-Sound-Bench for evaluating proactive audio intervention. Across 8 benchmarks, Audio-Interaction preserves competitive performance on mainstream audio tasks while unlocking capabilities inaccessible to offline LALMs, including real-time ASR, streaming audio instruction following, and proactive help.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 9.0/10 | 13.5 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 2.0/10 | 3.0 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 3.0/10 | 4.5 |
| model-based RL | 1.5 | 2.0/10 | 3.0 |
评分理由: 论文核心贡献在于将离线与流式音频模型统一为在线交互框架,故 'Unify Models' 得高分 (9.0);内容仅限音频,无视觉元素,'Visual Encoder' 为 0;虽涉及语言模型,但未强调 tokenizer 架构,'Tokenizer' 得分低 (1.0);论文为 LALM 而非 MLLM,'MLLM' 相关性低 (2.0);'MultiModal' 因涉及音频 + 文本有一定关联 (3.0);'perceive-decide-respond' 循环虽类似 RL/World Model,但未明确提及,故得分低 (2.0)。加权总分 28.5,高于及格线 27.8。
关键词
Audio Interaction Model, Large Audio Language Models, Streaming audio models, SoundFlow, Perceive-decide-respond loop, Real-time interaction, Unified streaming model
深度分析
Chinese Title: 音频交互模型
Summary: 本文提出音频交互模型(Audio-Interaction),旨在将当前离线的、任务专用的大型音频语言模型(LALMs)统一为始终在线的流式交互模型。该模型通过“感知-决策-响应”循环,实时监听音频流并自主决定何时保持沉默或响应,从而同时支持传统任务(如对话、ASR)和流式原生任务(如同声传译、主动帮助)。为实现这一目标,作者提出了SOUNDFLOW框架,涵盖数据构建、训练和推理三个环节:通过层次化事件拼接管道合成流式交互数据,设计时间-频率联合预处理模块(TFJP)平滑音频边界;采用分块级顺序决策的流式训练策略,包含历史回顾和理解感知静默机制;提出异步低延迟推理方案,将首帧延迟降低4.5倍。此外,构建了包含2.6M条样本、302k小时的流式语料库StreamAudio-2M,以及用于评估主动音频干预的ProactiveSound-Bench。实验表明,Audio-Interaction在主流音频基准上保持竞争力(MMAU 58.15 vs 57.81),同时解锁了离线模型无法实现的实时ASR、流式指令跟随和主动帮助等能力。
Innovations:
- 首次提出统一的流式音频交互模型范式(LAIM),将离线LALM与专用流式模型整合为单一始终在线的感知-决策-响应循环。
- 提出SOUNDFLOW框架,包含层次化事件拼接数据合成、时间-频率联合预处理模块(TFJP),以及基于分块级顺序决策的流式训练策略。
- 设计理解感知的响应触发机制,模型基于语义理解而非声学线索自主决定何时响应,并引入历史回顾和静默控制解决上下文遗忘和误触发。
- 提出异步低延迟推理方案,通过先进先出解耦编码与解码,消除编码器-解码器同步阻塞,显著降低首帧延迟。
- 构建大规模流式音频交互数据集StreamAudio-2M(2.6M样本、302k小时)和主动帮助评估基准ProactiveSound-Bench。
Methodology: 论文采用端到端的流式交互框架SOUNDFLOW。数据构建方面:通过层次化事件拼接管道(场景规划→事件细化→片段检索/生成)合成连贯的多轮流式音频,并应用TFJP模块(静音裁剪、噪声估计与去除、核心定位、边界对齐与频谱平滑)提升自然度。训练方面:将音频建模为分块级顺序决策,每个400ms分块预测<silent>或<response>特殊标记,联合优化语言建模和响应触发;引入历史回顾机制(将前序分块特征缓存)和理解感知静默(根据语义置信度决定是否响应)。推理方面:采用异步先进先出方案,编码器持续处理分块并缓存特征,解码器在触发响应时从缓存读取,实现低延迟交互。基础模型采用Qwen2.5-Omni。
Key Results:
- 在MMAU基准上达到58.15分,超过基线57.81分,保持竞争力。
- 在流式ASR、流式音频指令跟随和主动帮助等新能力上表现优异,离线模型无法完成。
- 异步推理方案将首帧延迟降低4.5倍,消除编码器-解码器同步阻塞。
- 在多个主流音频任务(如语音翻译、对话)上性能与专用模型相当或更优。
- ProactiveSound-Bench评估显示模型能在无指令情况下主动干预音频事件。
Tech Stack:
- Qwen2.5-Omni(基础多模态大模型)
- Whisper音频编码器
- 时间-频率联合预处理(TFJP):静音裁剪、噪声估计与去噪、核心定位、边界对齐、频谱平滑
- 层次化事件拼接:LLM场景规划、检索(音频数据库)、生成(音频生成模型)
- 分块级顺序决策:<silent>/<response>特殊标记预测
- 异步先进先出推理:编码器-解码器解耦
- StreamAudio-2M数据集(2.6M样本,302k小时)
- ProactiveSound-Bench基准(644个人工设计事件)
Strengths:
- 首次实现统一的流式音频交互模型,覆盖传统任务和流式原生任务,减少模型数量。
- 数据构建方法(层次化事件拼接+TFJP)有效生成自然连贯的长音频流。
- 训练策略(历史回顾、理解感知静默)解决了流式场景下的上下文遗忘和误触发问题。
- 异步推理方案显著降低延迟,适合实时交互。
- 开源大规模数据集和基准,推动领域研究。
Limitations:
- 模型依赖预训练基础模型(Qwen2.5-Omni),可能受限于其能力上限。
- 流式训练和推理的复杂度较高,对计算资源要求大。
- 主动帮助能力仅在特定场景下验证,泛化性有待进一步测试。
- 未涉及世界模型或强化学习等更高级的交互范式,与论文研究背景中的关键词关联较弱。
- 音频分块大小(400ms)可能影响对极短事件的响应精度。
Relevance To Keywords:
- Unify Models: 论文核心目标是将多个专用模型统一为单一交互模型,高度相关。
- World Models: 论文未直接涉及世界模型,但流式交互中模型需理解音频环境并预测何时响应,隐含对环境建模,相关性中等。
- Representation Learning: 论文使用音频编码器提取特征,但未提出新的表示学习方法,相关性一般。
- Model-Based RL: 论文未使用强化学习或基于模型的RL,相关性低。
- 原生多模态大模型: 论文基于Qwen2.5-Omni(原生多模态),并扩展为流式交互,高度相关。
- 多模态大模型的理解和生成一体化: 模型同时支持音频理解和文本生成,且流式实现理解与生成的实时交替,高度相关。
摘要翻译
目前,针对视觉 - 语言模型(Vision-Language Models, VLMs)的基于提示和基于适配器的微调方法在医学影像领域颇具吸引力,这是因为临床数据敏感性倾向于使用冻结骨干网络,且标注数据有限。然而,这些方法通常仅优化真实类别,将所有其他类别视为同等错误,忽略了具有临床意义的类别关系,并在有限监督设置下导致不稳定的决策边界。我们提出了一种名为全几何知识蒸馏(Omni-Geometry Knowledge Distillation, OGKD)的新框架,该框架将类别关系结构注入教师模型,以生成方向性目标,这些目标在保留真实类别的同时尊重类间几何结构。利用这些目标,我们开发了两种蒸馏损失:全局几何感知蒸馏(Global Geometry-Aware Distillation, GAD)作用于全局图像 token,而标签引导几何蒸馏(Label-Guided Geometry Distillation, LGD)则将相同的几何结构应用于注意力 patch token,以提升细粒度对齐效果。在 11 个广泛使用的医学数据集上进行的全面实验与分析(涵盖基类到新类别及少样本评估)表明,我们的 OGKD 实现了显著更优的性能,相较于所有先前最先进的 VLM 适配方法,其准确率平均绝对提升 1.7%-2.8%。此外,该方法还能稳健地泛化至未见类别,并比其他方法产生更可靠的预测。我们的代码可在 https://github.com/tientrandinh/OGKD 获取。
Abstract
Current prompt-based and adapter-based tuning of vision-language models (VLMs) is attractive for medical imaging, where clinical data sensitivity favors frozen backbones and annotations are limited. However, these methods typically optimize only the ground-truth class, treating all other classes as equally incorrect, ignoring clinically meaningful class relations and yielding unstable decision boundaries in limited-supervision settings. We propose Omni-Geometry Knowledge Distillation (OGKD), a new framework that injects class-relation structure into the teacher to produce directional targets that preserve the ground truth while respecting inter-class geometry. Using these targets, we develop two distillation losses: Global Geometry-Aware Distillation (GAD) operates on the global image token, and Label-Guided Geometry Distillation (LGD) applies the same geometry to attentive patch tokens to improve fine-grained alignment. Across comprehensive experiments and analyses on 11 widely-used medical datasets for base-to-novel and few-shot evaluations, our OGKD achieves substantially better performance, consistently improving accuracy by an average absolute gain of 1.7%-2.8% over all prior state-of-the-art VLM adaptation counterparts. It also robustly generalizes to unseen classes and yields more reliable predictions than other approaches. Our code is available at https://github.com/tientrandinh/OGKD.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 3.0/10 | 4.5 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 5.0/10 | 7.5 |
| MultiModal | 1.5 | 6.0/10 | 9.0 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: The paper proposes a distillation framework for biomedical Vision-Language Models (VLMs). It is moderately relevant to MultiModal (6) and MLLM (5) because VLMs are inherently multimodal and fall under the large model category. It has slight relevance to Visual Encoder (3) and Tokenizer (2) as the method processes image and patch tokens, but does not propose new encoder or tokenizer designs. It is largely unrelated to Unify Models (1), World Models (1), and model-based RL (1) as the work focuses on classification distillation rather than model unification, world prediction, or reinforcement learning. The total weighted score is 28.5, which exceeds the dynamic pass score of 27.8. None of the listed expert authors (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are present in the author list.
关键词
Geometry-Aware Distillation, Prompt Tuning, Biomedical Vision-Language Models, Class-Relation Structure, Few-Shot Learning, Knowledge Distillation, Medical Imaging
深度分析
Chinese Title: 面向生物医学视觉语言模型提示调优的几何感知蒸馏
Summary: 本文提出Omni-Geometry Knowledge Distillation (OGKD)框架,用于在少量标注样本下对生物医学视觉语言模型进行提示调优。现有方法通常平等对待所有非目标类别,忽略临床上有意义的类间关系,导致决策边界不稳定。OGKD通过从冻结的文本原型构建类图W,将类间几何结构注入教师模型,生成保留真实标签同时尊重类间关系的方向性目标。在此基础上设计两种蒸馏损失:全局几何感知蒸馏(GAD)作用于全局图像令牌,标签引导几何蒸馏(LGD)作用于注意力补丁令牌以增强细粒度对齐。在11个医学数据集上的实验表明,OGKD在基类到新类泛化、少样本分类和选择性可靠性方面均优于现有最先进方法,平均绝对准确率提升1.7%-2.8%,且具有更好的风险覆盖曲线。
Innovations:
- 通过从冻结的医学文本原型构建固定类图W,将类间语义关系编码到蒸馏框架中,无需视觉样本即可注入临床有意义的几何结构,缓解过拟合。
- 提出全局几何感知蒸馏(GAD),在全局[IMG]令牌上塑造教师分布并蒸馏完整类分布,保留类间几何信息。
- 提出标签引导几何蒸馏(LGD),将相同几何结构应用于标签引导的补丁令牌,蒸馏标签通道以强调细粒度对齐,适用于医学影像中局部诊断证据。
- 仅学习学生提示,编码器和类图保持冻结,参数高效,无需额外可训练模块或两阶段训练。
- 在11个数据集上实现一致且显著的性能提升,同时降低高置信度错误率,提高选择性可靠性。
Methodology: 基于冻结的BiomedCLIP视觉和文本编码器,首先从预定义的类别标签通过文本编码器生成文本原型,计算类间余弦相似度构建类图W,并用几何强度γ平滑教师分布。对于每个图像,提取全局[IMG]令牌和补丁令牌。GAD损失:教师分布由真实标签one-hot与类图加权得到,学生分布由全局令牌与提示文本相似度得到,使用KL散度蒸馏。LGD损失:通过注意力机制选择与真实标签相关的补丁令牌(标签引导),对每个相关补丁令牌计算其与真实标签文本的相似度作为教师目标(同样经类图加权),学生分布为补丁令牌与提示文本相似度,蒸馏标签通道。总损失为交叉熵损失与两个蒸馏损失的加权和。仅更新学生提示上下文向量。
Key Results:
- 在11个医学数据集(涵盖9种模态和10个解剖部位)上,OGKD在基类到新类泛化、少样本分类任务中平均绝对准确率提升1.7%-2.8%,优于BiomedCoOp等最先进方法。
- 风险覆盖曲线显示OGKD具有更低的AURC(面积),表明在高置信度预测中错误更少,选择性可靠性更高。
- 消融实验验证了GAD和LGD各自的有效性,以及类图W和几何强度γ的重要性。
- OGKD对未见类具有良好的泛化能力,且预测更可靠。
Tech Stack:
- BiomedCLIP(冻结的视觉和文本编码器)
- 余弦相似度构建类图
- KL散度蒸馏损失
- 注意力机制(用于标签引导的补丁令牌选择)
- 提示学习(learnable context vectors)
- 交叉熵损失
- L2归一化
Strengths:
- 创新性地将类间几何结构引入蒸馏框架,解决了传统方法平等对待非目标类别的问题,符合临床诊断直觉。
- 同时利用全局和局部(补丁级)信息,适用于医学影像中细粒度病变识别。
- 参数高效,仅学习提示,易于部署。
- 在多个数据集和任务上取得一致且显著的提升,并改善了选择性可靠性,具有临床安全价值。
- 代码开源,可复现。
Limitations:
- 类图仅基于文本原型构建,未利用视觉样本中的类间关系,可能无法完全反映视觉相似性。
- 几何强度γ为超参数,需要调优,不同数据集可能敏感。
- 方法依赖于预训练的医学VLM(BiomedCLIP),若更换其他VLM可能需要重新构建类图。
- 仅针对提示调优场景,未探索全微调或适配器方法中的几何蒸馏。
- 实验仅在少样本设置下进行,大规模标注场景下的表现未知。
Relevance To Keywords:
- Unify Models: 论文聚焦于视觉语言模型的统一表示,但未涉及理解与生成一体化。
- World Models: 不直接相关,论文未涉及环境建模或预测。
- Representation Learning: 高度相关,通过几何感知蒸馏改进了视觉语言模型的表征学习,使类间关系更合理。
- Model-Based RL: 不相关。
- 原生多模态大模型: 论文使用预训练的BiomedCLIP作为基础,属于多模态大模型的应用,但未提出新的原生模型。
- 多模态大模型的理解和生成一体化: 论文仅关注理解(分类),未涉及生成。
- 表征学习: 核心贡献在于改进表征的几何结构,强相关。
- 世界模型: 不相关。
- 强化学习: 不相关。
- 后训练: 论文的提示调优属于后训练的一种形式,但更侧重于蒸馏而非强化学习后训练。
摘要翻译
表格基础模型(如 TabPFN)在处理包含数值型和类别型数据的表格数据集时表现强劲,但无法原生处理高基数文本特征。因此,标准流程通常使用语言模型对文本进行嵌入,并利用主成分分析(PCA)将生成的向量压缩为少量标量特征,然后再输入 TabPFN。这会造成信息瓶颈:大多数嵌入维度被丢弃,随后压缩表示又必须通过 TabPFN 的特征编码器再次扩展。端到端的替代方案可以避免 PCA,但它们需要大量包含文本单元格的预训练数据,且通常表现不如那些在大量合成数据上预训练的表格基础模型。受 LLaVA(视觉到 LLM 的 token 投影)和 TableGPT 风格系统(表格到 LLM 的 token 投影)等模态对齐方法的启发,我们提出了 TabPFN 文本适配器(文本到 TFM 的 token 投影)。我们同时冻结句子编码器和 TabPFN,仅训练一个轻量级适配器,该适配器将文本嵌入映射为 TabPFN 嵌入空间中的一短序列 token。该设计消除了 PCA 瓶颈,保留了 TabPFN 的数值优势,且比端到端的文本 - 表格流程训练更高效。
Abstract
Tabular foundation models, such as TabPFN, achieve strong performance on tabular datasets with numerical and categorical data, but do not natively handle high-cardinality text features. Standard pipelines, therefore, embed text with a language model and compress the resulting vectors with PCA into a small number of scalar features before inputting them into TabPFN. This creates an information bottleneck: most embedding dimensions are discarded, and the compressed representation must then be expanded again by TabPFN's feature encoder. End-to-end alternatives can avoid PCA, but they require large amounts of pretraining data containing text cells and usually perform subpar compared to tabular foundation models that were pretrained on large amounts of synthetic data. Inspired by modality-alignment approaches like LLaVA (vision-to-LLM token projection) and TableGPT-style systems (table-to-LLM token projection), we introduce the TabPFN Text Adapter (text-to-TFM token projection). We freeze both the sentence encoder and TabPFN, and train only a lightweight adapter that maps text embeddings into a short sequence of tokens in TabPFN's embedding space. This design removes the PCA bottleneck, preserves TabPFN's numerical strengths, and is more efficient to train than end-to-end text-tabular pipelines.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 4.0/10 | 6.0 |
| Tokenizer | 1.5 | 5.0/10 | 7.5 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 3.0/10 | 4.5 |
| MultiModal | 1.5 | 7.0/10 | 10.5 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文核心在于 TabPFN 的文本特征处理,通过 Text Adapter 实现文本到表格嵌入空间的 token 投影。MultiModal 相关性高(7.0),因涉及文本与表格多模态融合;Tokenizer 相关性中(5.0),因涉及嵌入空间中的 token 序列映射;Unify Models 相关性中(4.0),因统一了文本处理流程;MLLM 相关性低(3.0),仅方法灵感源自 LLaVA 架构;Visual Encoder、World Models、model-based RL 完全无关(0.0)。作者列表不含指定专家,无额外加分。加权总分 28.5,高于动态及格分 27.8。
关键词
TabPFN, Text Encoder, Text Adapter, Tabular Foundation Models, Token Projection, PCA Bottleneck, Multimodal Learning
深度分析
Chinese Title: 面向TabPFN的文本编码器预训练
Summary: 本文提出TabPFN文本适配器(TabPFN Text Adapter),旨在解决TabPFN无法原生处理高基数文本特征的问题。传统方法通过语言模型嵌入文本后使用PCA降维,造成信息瓶颈且不可微分。受LLaVA和TableGPT等模态对齐方法启发,作者冻结句子编码器和TabPFN,仅训练一个轻量级MLP适配器,将文本嵌入直接映射为TabPFN嵌入空间中的短序列令牌(分类5个,回归1个),从而避免PCA瓶颈并保留TabPFN的数值处理能力。适配器在STRABLE基准的子集上预训练,并在TextTabBench上评估。结果表明,回归任务上适配器优于所有基线(包括PCA-30和ConTextTab),分类任务上接近PCA-30但略逊于ConTextTab。消融实验验证了令牌数量、融合深度、语言模型选择、归一化和初始化策略的影响。该方法提供了一种更简单、更廉价、更灵活的文本-表格对齐方案。
Innovations:
- 提出轻量级后训练对齐策略,仅训练MLP适配器,无需更新句子编码器和TabPFN权重,大幅降低训练成本。
- 直接投影文本嵌入到TabPFN的192维嵌入空间,避免PCA降维造成的信息瓶颈,同时保持端到端可微性。
- 设计随机维度排列技巧,增强适配器对嵌入维度顺序的鲁棒性,并在推理时通过集成平均提升性能。
- 分离文本列与数值/类别列,分别处理,保留TabPFN在纯数值和类别数据上的原有性能。
- 在回归任务上显著超越现有方法(如PCA-30和ConTextTab),验证了适配器在文本丰富场景下的有效性。
Methodology: 首先将表格数据中的自由文本列与数值/类别列分离。文本列经过列名前缀化后,使用sentence-transformers(all-MiniLM-L6-v2)生成高维嵌入,然后通过一个MLP适配器(分类为线性层,回归为两层MLP)映射到TabPFN的192维嵌入空间中的多个令牌(分类5个,回归1个)。适配器权重从TabPFN原始特征编码器初始化,并加入随机维度排列。训练时冻结句子编码器和TabPFN,仅更新适配器,在STRABLE子集上预训练。推理时采用集成策略,每个成员使用不同排列,预测平均。评估使用3折交叉验证,分类用min-max归一化ROC-AUC,回归用min-max归一化RMSE。
Key Results:
- 回归任务:TabPFN文本适配器取得23.3%的归一化RMSE(越低越好),优于ConTextTab(32.9%)、PCA-30(38.9%)、tf-idf(54.7%)和丢弃文本(81.2%)。
- 分类任务:适配器归一化ROC-AUC为84.5%,低于ConTextTab(97.65%)和PCA-30(86.4%),但高于tf-idf(56.8%)和丢弃文本(0.0%)。
- 消融实验:5令牌在分类上最优(接近10令牌但计算更少),1令牌在回归上最优;融合深度在第三层性能下降;e5-small-v2语言模型略优于默认;无归一化或随机初始化显著降低性能。
Tech Stack:
- TabPFN 2.5(Grinsztajn et al., 2026)
- sentence-transformers库(all-MiniLM-L6-v2)
- MLP(线性层/两层MLP)
- PCA(主成分分析)
- tf-idf(词频-逆文档频率)
- Truncated SVD(截断奇异值分解)
- STRABLE基准(Blayer et al., 2026)
- TextTabBench(Mraz et al., 2025)
- skrub库(StringEncoder, TextEncoder)
- 3折交叉验证
- min-max归一化评估指标
Strengths:
- 方法轻量高效,仅训练适配器,计算成本远低于端到端预训练。
- 保留TabPFN在纯数值和类别数据上的强大性能,避免文本处理带来的性能退化。
- 在回归任务上取得最佳结果,证明适配器能有效利用文本信息。
- 设计简洁,易于集成到现有TabPFN流程中,无需修改基础模型。
- 通过随机排列和集成策略增强鲁棒性,消融实验系统全面。
Limitations:
- 分类任务上性能不及ConTextTab和PCA-30,表明适配器在分类场景下信息传递仍有瓶颈。
- 仅使用单一语言模型(all-MiniLM-L6-v2),未探索更大或更专业的文本编码器。
- 预训练数据来自STRABLE子集,可能未覆盖所有文本类型,泛化性有待验证。
- 适配器设计依赖于TabPFN的特定架构(192维嵌入、特征编码器结构),迁移到其他表格基础模型需重新设计。
- 未与端到端微调方法(如TabSTAR)进行直接比较,缺乏对微调成本与性能权衡的深入分析。
Relevance To Keywords:
- 原生多模态大模型:论文聚焦表格数据与文本的对齐,属于多模态(表格+文本)的轻量级对齐方法,但未涉及图像、视频等其他模态。
- 表征学习:通过适配器将文本嵌入映射到TabPFN的嵌入空间,本质上是跨模态表征对齐学习。
- 世界模型:不直接相关,论文未涉及环境交互或因果推理。
- 模型-Based RL:不相关。
- 后训练:论文采用后训练策略(冻结基础模型,仅训练适配器),属于后训练对齐的一种形式。
摘要翻译
音频 - 语言模型 (ALMs) 往往会采纳与音频相冲突的文本,即便音频证据本身非常清晰。这引出了一个基本问题:音频支持的答案是否不可用,还是它虽被表示出来却被冲突的文本所覆盖?我们通过一种同音频反事实方法来探究这一问题:保持音频固定,仅移除冲突的文本,并测量由此导致的模型偏好变化。在五个 ALMs 和四个冲突任务中,64.1% 的冲突样本显示出符号翻转:同音频分支偏好音频支持的答案,而联合分支则偏好文本支持的答案。这一模式表明,相关的音频证据虽被编码,却在仲裁过程中落败。激活修补进一步将这种反转定位到答案位置计算上,且修补效果紧密追踪输出候选分数差异 (Spearman rho=0.93)。利用这一诊断方法,我们提出了一种无需训练的解码规则——门控音频反事实对数修正 (GACL),该规则在联合分数与同音频分数之间进行插值。在严格的 5 个百分点 (pp) 忠实度下降预算下,GACL 相较于最佳对比基线将 nAUC 提高了 17.8 点,且无需重新调优即可迁移至视觉 - 文本仲裁任务(最高提升 +40.5 个百分点)。
Abstract
Audio-language models (ALMs) often follow text that conflicts with audio, even when the audio evidence is clear. This raises a basic question: is the audio-supported answer unavailable, or is it represented but overridden by the conflicting text? We examine this question using a same-audio counterfactual that keeps the audio fixed, removes only the conflicting text, and measures the resulting shift in model preference. Across five ALMs and four conflict tasks, 64.1% of conflict samples show a sign flip: the same-audio branch prefers the audio-supported answer, whereas the joint branch prefers the text-supported answer. This pattern suggests that the relevant audio evidence is encoded but loses in arbitration. Activation patching further localizes the reversal to answer-position computation, and patching effects closely track output candidate-score differences (Spearman rho=0.93). Using this diagnostic, we propose Gated Audio Counterfactual Logit Correction (GACL), a training-free decoding rule that interpolates between joint and same-audio scores. Under a strict 5 pp faithfulness-drop budget, GACL improves nAUC by 17.8 points over the best contrastive baseline and transfers without retuning to vision-text arbitration (up to +40.5 pp).
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 6.0/10 | 9.0 |
| MultiModal | 1.5 | 8.0/10 | 12.0 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: 论文核心聚焦于音频 - 语言模型的多模态仲裁问题,与 MultiModal (8.0) 和 MLLM (6.0) 高度相关。由于仅涉及音频而非视觉,Visual Encoder 得分为 0。论文未涉及强化学习、世界模型或分词器核心机制,故相关关键词得分较低。虽涉及信号整合,但未达到模型统一架构(Unify Models)范畴。加权总分 28.5,高于动态及格分 27.8。
关键词
Audio-Language Models, Text-Audio Arbitration, Counterfactual Analysis, Activation Patching, GACL, Faithfulness, Decoding Rule
深度分析
Chinese Title: 超越文本跟随:音频-语言模型中可修复的仲裁反转
Summary: 本文研究音频-语言模型(ALMs)在音频与文本冲突时倾向于跟随文本的现象。作者通过保持音频不变、仅移除冲突文本的对照实验发现,64.1%的冲突样本中,模型在仅音频分支中偏好音频支持的答案,而在联合分支中偏好文本支持的答案,表明音频证据被编码但在仲裁中失败。激活修补进一步将反转定位到答案位置的残差流,且修补方向与输出分数差高度相关(Spearman ρ=0.93)。基于此诊断,作者提出无需训练的GACL解码规则,通过门控插值联合和参考分数来纠正偏差。在严格保真度预算下,GACL在音频-文本冲突任务上提升nAUC 17.8点,并泛化到视觉-文本冲突任务(提升高达40.5百分点)。
Innovations:
- 识别了可修复的仲裁反转作为音频-文本冲突中常见的可恢复失败模式:模型能够支持音频答案,但联合输入反转了决策。
- 将反转定位到答案位置的残差流,并证明修补方向与输出分数差在秩水平上高度相关(Spearman ρ=0.93),使修复方向可从输出直接观测。
- 提出GACL(门控音频反事实对数校正),一种基于诊断的解码规则,在5百分点保真度预算下将音频-文本冲突准确率提升17.8 nAUC点,且无需重新调参即可泛化到视觉-文本冲突。
Methodology: 采用双分支设置:联合分支(J)输入音频+冲突文本,音频参考分支(A)仅输入音频(移除冲突文本)。计算每个分支对候选答案的长度归一化对数概率边际,定义诊断区域O={MA>0, MJ<0}。通过激活修补(activation patching)定位反转发生的层和位置。提出GACL解码规则:对联合和参考分数进行门控、有界插值,门控基于分支分歧和参考可靠性。在五个ALM和四个冲突任务上评估,并与多种对比解码基线比较。
Key Results:
- 在五个模型上,64.1%的冲突样本表现出仲裁反转(MA>0且MJ<0),表明音频证据可用但被文本覆盖。
- 激活修补显示反转主要发生在答案位置的残差流,修补方向与输出分数差sA−sJ的秩相关系数为0.93。
- GACL在5百分点保真度预算下,音频-文本冲突nAUC提升17.8点,优于最佳对比基线。
- 无需重新调参,GACL在视觉-文本冲突任务MC2上提升对抗准确率高达40.5百分点。
Tech Stack:
- 模型:Qwen2-Audio-7B-Instruct, Qwen2.5-Omni-7B, Voxtral-Small-24B, Qwen3-Omni-30B-A3B-Instruct, Kimi-Audio-7B-Instruct
- 数据集:MCR-Bench (AQA, VSC, SER), ALME (English subset)
- 方法:双分支对照实验、长度归一化对数概率边际、激活修补(activation patching)、门控插值(gated interpolation)
- 评估指标:音频跟随准确率、文本跟随准确率、nAUC(归一化曲线下面积)、保真度下降预算
Strengths:
- 诊断清晰:通过简单的对照实验揭示了仲裁反转这一常见且可修复的失败模式,而非简单的感知失败。
- 定位精确:激活修补将反转定位到具体计算位置,并发现输出分数差可作为修复方向的代理。
- 方法简洁有效:GACL无需训练,仅需一次额外前向传播,在严格保真度约束下显著提升性能。
- 泛化性强:同一规则无需调参即可迁移到视觉-文本冲突任务,表明设计具有跨模态通用性。
Limitations:
- 依赖参考分支的可靠性:在SER任务上音频参考分支准确率仅48.9%,导致诊断区域比例较低,修复效果受限。
- 仅适用于封闭式问答:方法基于候选答案的对数概率,不适用于开放式生成任务。
- 未探索模型规模影响:实验集中在7B-30B参数模型,更大或更小模型的行为可能不同。
- 保真度预算为人工设定:5百分点保真度下降的阈值可能不适用于所有应用场景。
Relevance To Keywords:
- 原生多模态大模型:论文研究的音频-语言模型属于原生多模态大模型,直接处理音频和文本输入。
- 多模态大模型的理解和生成一体化:模型同时具备理解和生成能力,论文关注理解中的冲突仲裁。
- 表征学习:论文通过激活修补揭示了音频表征在模型内部的存在和竞争,涉及表征学习。
- 后训练:GACL是一种解码时校正方法,属于后训练阶段的技术,无需微调。
- 世界模型:论文未直接涉及世界模型,但仲裁机制可视为模型内部对多模态证据的推理,与构建世界模型相关。
- 强化学习:论文未使用强化学习,但冲突仲裁可类比于奖励信号下的决策,间接相关。
- Unify Models:论文研究的ALMs是统一多模态模型的一种,但未涉及统一框架设计。
- Model-Based RL:不直接相关。
摘要翻译
将任务 LoRA 适配器与领域 LoRA 适配器整合至单一统一模型中,是一个实用但尚未被充分探索的挑战。现有方法将这两种适配器视为对称的对等体,在所有层上施加统一的权重。我们认为,任务适配器和领域适配器在 Transformer 架构中表现出一致的深度依赖不对称性。领域主导性随层深度增加而增强,而较浅层则保留更强的任务相关信号。基于这一观察,我们提出 TaDA(任务 - 领域 LoRA 融合),这是一种无训练算法,它通过校准探针引导的逐层门控和基于子空间的逐组件融合来利用这种结构。门控机制利用一种被证明对适配器权重大小不变的探针信号,为每一层和投影类型分配独立的权重。融合过程在合并剩余组件之前,会丢弃冲突的奇异方向。TaDA 生成一个标准的秩-r LoRA 适配器,且推理开销为零。在六个科学问答(QA)基准上,基于 Llama-2-7B,TaDA 实现了 0.452 的平均准确率,比 DARE-TIES 高出 3.6 个百分点,并在所有六个基准上均取得了最佳结果。在六个图像分类基准上,基于 ViT-L/16,TaDA 达到了 85.9% 的平均准确率,优于最强的融合基线,并在六个单独基准中的三个中位居首位。
Abstract
Combining a task LoRA adapter with a domain LoRA adapter into a single unified model is a practical yet largely unexplored challenge. Existing methods treat both adapters as symmetric peers, applying uniform weights across all layers. We argue that task and domain adapters exhibit a consistent depth-dependent asymmetry across transformer architectures. Domain dominance increases with layer depth, while shallower layers retain stronger task-relevant signals. Motivated by this observation, we propose $\textbf{TaDA}$ ($\textbf{Ta}$sk-$\textbf{D}$omain LoR$\textbf{A}$ Merging), a training-free algorithm that exploits this structure through calibrated probe-guided per-layer gating and per-component subspace-aware merging. The gating assigns individual weights per layer and projection type using a probe signal proved invariant to adapter weight magnitude. The merging discards conflicting singular directions before combining the remaining components. $\textbf{TaDA}$ produces a standard rank-$r$ LoRA adapter with zero inference overhead. On six scientific QA benchmarks with Llama-2-7B, TaDA achieves an average accuracy of 0.452, outperforming DARE-TIES by +3.6 percentage points and obtaining the best result on all six benchmarks. On six image classification benchmarks with ViT-L/16, TaDA reaches 85.9\% average accuracy, improving over the strongest merging baseline while leading in three of the six individual benchmarks.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 9.0/10 | 13.5 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 3.0/10 | 4.5 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 2.0/10 | 3.0 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: 论文核心贡献在于将任务域 LoRA 适配器合并为统一模型,与'Unify Models'高度相关(9.0 分)。实验使用了 ViT 和 Llama-2,因此与'Visual Encoder'(3.0 分)、'MLLM'(2.0 分)及'MultiModal'(2.0 分)有一定关联,但并非研究重点。论文未涉及 Tokenizer、World Models 或强化学习,故这些关键词相关性低(1.0 分)。作者列表中不包含指定的 Yang Shi 等专家。
关键词
LoRA Merging, Task-Domain, Probe Gating, Training-free, Adapter Unification, Transformer Architecture, ViT, Llama-2
深度分析
Chinese Title: TaDA:面向任务-领域LoRA合并的校准探针门控方法
Summary: 本文提出TaDA(Task-Domain LoRA Merging),一种无需训练的LoRA适配器合并算法,旨在将任务LoRA和领域LoRA合并为单个统一适配器。研究发现,在Transformer架构中,任务和领域适配器存在深度依赖的不对称性:浅层保留更多任务信号,深层领域主导性增强。TaDA通过校准探针(calibrated probe)计算每层每模块的领域相关性得分,并利用模块类型感知的门控阈值分配权重;进一步通过SVD分解进行逐分量子空间感知合并,丢弃冲突的奇异方向后加权合并。在Llama-2-7B的六个科学QA基准上,TaDA平均准确率0.452,比DARE-TIES高3.6个百分点,并在所有六个基准上取得最佳;在ViT-L/16的六个图像分类基准上,平均准确率85.9%,优于最强基线。TaDA输出标准秩r LoRA,推理零开销。
Innovations:
- 提出校准探针引导的逐层门控机制,利用领域/通用探针激活比作为领域相关性得分,并证明该得分对适配器权重缩放具有不变性。
- 引入模块类型感知的门控阈值,确保输出投影保持任务主导以维持答案格式。
- 提出逐分量子空间感知合并,通过SVD分解并基于输入/输出特征空间的重叠度过滤冲突奇异方向,再按分量加权合并。
- 实现完全确定性的合并过程,无需随机种子,且输出标准秩r LoRA,推理无额外开销。
- 系统构建了任务×领域合并基准,涵盖语言和视觉两个模态,证明现有对称合并方法不适用于非对称场景。
Methodology: 首先构建两个探针集(领域探针Pd和通用探针Pg,各32个样本),通过前向传播提取每层隐藏状态。计算校准领域相关性得分sℓ = rℓ(Pd)/rℓ(Pg),其中rℓ为激活范数比。根据模块类型设置不同门控阈值τm,通过sigmoid函数生成每层每模块的任务权重α(ℓ,m)。然后对每个适配器进行SVD分解,计算每个奇异分量在输入和输出特征空间的重叠度,丢弃重叠度超过阈值δ=0.10的分量。对保留分量使用分量级权重αi(基于分量级s_i)进行加权合并,得到合并后的ΔW_mℓ。最后通过SVD截断回秩r,得到标准LoRA适配器。
Key Results:
- 在Llama-2-7B的六个科学QA基准上,TaDA平均准确率0.452,比DARE-TIES高3.6个百分点,在所有六个基准上取得最佳。
- 在ViT-L/16的六个图像分类基准上,TaDA平均准确率85.9%,优于所有基线(DARE-TIES 85.6%),并在三个基准上领先。
- 验证了任务和领域适配器存在深度依赖的不对称性:浅层任务主导,深层领域主导。
- 校准探针得分对适配器权重缩放具有不变性(Proposition 1)。
- 超参数通过经验分布校准(如δ=0.10来自实际重叠分布的最大值0.162和P95 0.081)。
Tech Stack:
- LoRA(Low-Rank Adaptation)
- SVD(奇异值分解)
- 校准探针(Calibrated Probe)
- 门控函数(sigmoid gating)
- 模块类型感知阈值(module-type-aware threshold)
- 重叠度度量(overlap metric in U and V spaces)
- Llama-2-7B(语言模型)
- ViT-L/16(视觉Transformer)
- Alpaca-cleaned、PubMed、PubMedQA(训练数据)
- ImageNet-1k、PathMNIST(训练数据)
- 约束解码(constrained decoding)
Strengths:
- 无需额外训练,计算开销低(仅需两次前向传播和SVD分解)。
- 完全确定性,无随机种子依赖,结果可复现。
- 输出标准LoRA格式,推理零开销,兼容现有推理栈。
- 在语言和视觉两个模态上均验证有效性,泛化性强。
- 系统分析了任务-领域适配器的深度不对称性,为后续研究提供洞察。
Limitations:
- 需要访问少量领域探针和通用探针(各32个样本),可能在某些场景下难以获取。
- 超参数(τm, β, δ, λ)需根据经验校准,可能在不同模型或任务上需要调整。
- 仅针对任务-领域合并场景,未验证任务-任务或领域-领域合并效果。
- 探针构建依赖领域特定输入,对于高度抽象或混合领域可能不准确。
- 未与基于训练的方法(如联合微调)进行对比,仅与后训练合并方法比较。
Relevance To Keywords:
- Unify Models: 论文研究任务LoRA和领域LoRA的合并,属于模型统一的一种形式,但仅针对LoRA适配器而非完整模型。
- World Models: 不直接相关,但领域适配器可视为特定世界知识的编码。
- Representation Learning: 探针方法基于激活表示,涉及表示分析,但核心是合并而非学习表示。
- Model-Based RL: 不相关。
- 原生多模态大模型: 论文在语言和视觉两个模态上验证,但未涉及多模态融合。
- 多模态大模型的理解和生成一体化: 不直接相关。
- 表征学习: 部分相关,通过探针分析表征的领域相关性。
- 强化学习: 不相关。
- 后训练: 论文属于后训练合并方法,无需额外训练,与后训练概念相关。
摘要翻译
前馈式 3D Gaussian Splatting 方法通过单次前向传播从带姿态或无姿态的图像重建场景,但当前方法为每个输入像素预测一个高斯分布,将表示预算绑定到相机分辨率而非场景复杂度。一面平坦的墙和一个纹理丰富的物体尽管几何需求不同,却会产生相同数量的高斯分布。我们提出 ZipSplat,一种基于标记的前馈模型,它将高斯分布的放置与像素网格解耦。多视图骨干网络提取密集视觉标记 (visual tokens),k-means 聚类将它们压缩为一组紧凑的场景标记。交叉注意力和自注意力机制优化这些标记,轻量级 MLP 将每个标记解码为一组具有无约束 3D 位置的高斯分布。由于聚类是在推理阶段应用的,单个训练好的模型无需重新训练即可覆盖质量 - 效率曲线。ZipSplat 在没有真实姿态或内参的情况下运行,但在 DL3DV 和 RealEstate10K 上达到新 state-of-the-art 水平,相比像素对齐方法高斯分布数量减少约 6 倍,分别比最佳无姿态基线高出 2.1dB 和 1.2dB PSNR。它进一步零样本泛化到 Mip-NeRF360 和 ScanNet++,优于所有可比基线。我们的项目页面位于 https://veichta.com/zipsplat。
Abstract
Feed-forward 3D Gaussian Splatting methods reconstruct a scene from posed or pose-free images in a single forward pass, yet current approaches predict one Gaussian per input pixel, tying the representation budget to camera resolution rather than scene complexity. A flat wall and a richly textured object thus produce equally many Gaussians despite very different geometric needs. We propose ZipSplat, a token-based feed-forward model that decouples Gaussian placement from the pixel grid. A multi-view backbone extracts dense visual tokens, and k-means clustering compresses them into a compact set of scene tokens. Cross- and self-attention refine these tokens, and a lightweight MLP decodes each into a group of Gaussians with unconstrained 3D positions. Because clustering is applied at inference, a single trained model spans the quality-efficiency curve without retraining. ZipSplat operates without ground-truth poses or intrinsics, yet sets a new state of the art on DL3DV and RealEstate10K with ${\sim}6{\times}$ fewer Gaussians than pixel-aligned methods, surpassing the best pose-free baseline by 2.1dB and 1.2dB PSNR, respectively. It further generalizes zero-shot to Mip-NeRF360 and ScanNet++, outperforming all comparable baselines. Our project page is at ${\href{https://veichta.com/zipsplat}{https://veichta.com/zipsplat}}$.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 4.0/10 | 6.0 |
| Visual Encoder | 1.5 | 8.0/10 | 12.0 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 1.0/10 | 1.5 |
| MultiModal | 1.5 | 2.0/10 | 3.0 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: The paper focuses on 3D Gaussian Splatting optimization using token-based clustering and attention mechanisms. It shows moderate relevance to 'Visual Encoder' (multi-view backbone) and 'Tokenizer' (token extraction/clustering), but is largely unrelated to 'MLLM', 'World Models', 'model-based RL', 'MultiModal' (vision-only), and 'Unify Models'. No expert authors from the specified list are present in the author list.
关键词
3D Gaussian Splatting, Token-based model, Visual tokens, Clustering, Feed-forward, Multi-view backbone, Scene reconstruction
深度分析
Chinese Title: ZipSplat: 更少的高斯,更好的场景表示
Summary: 本文提出ZipSplat,一种基于token的前馈3D高斯泼溅方法,旨在解耦高斯与像素网格的绑定,使高斯数量适应场景复杂度而非相机分辨率。该方法利用多视图骨干网络提取密集视觉token,通过k-means聚类压缩为紧凑的场景token,再经交叉注意力和自注意力精炼,最后由轻量级MLP解码为具有自由3D位置的高斯组。训练中引入几何监督损失以稳定自由高斯的放置。ZipSplat在推理时可通过单一压缩比连续调节高斯预算,无需重训练。在DL3DV和RealEstate10K数据集上,该方法以约6倍更少的高斯达到新SOTA,PSNR分别提升2.1 dB和1.2 dB,并零样本泛化至Mip-NeRF360和ScanNet++。
Innovations:
- 提出解耦高斯与像素网格的token-based前馈架构,使高斯放置适应场景内容而非像素密度。
- 引入推理时可调的k-means特征空间聚类压缩机制,单一模型即可覆盖质量-效率曲线,无需重训练。
- 实现自由3D高斯位置预测,并通过几何监督损失和渐进式调度稳定训练。
- 在无姿态和内在参数条件下达到新SOTA,同时使用显著更少的高斯,并零样本泛化至多个数据集。
Methodology: ZipSplat采用多视图骨干(如DA3)提取密集视觉token,然后通过k-means聚类压缩为K个场景token(压缩比r可调)。场景token通过交叉注意力从原始视觉token恢复细节,再经自注意力获得全局上下文。每个场景token由轻量MLP解码为G个高斯,其3D位置自由预测(不沿射线)。训练时结合渲染损失和几何监督损失(将自由高斯拉向有效表面),并采用渐进式调度策略。
Key Results:
- 在DL3DV数据集上,ZipSplat以约6倍更少的高斯达到SOTA,PSNR比最佳无姿态基线高2.1 dB。
- 在RealEstate10K上,PSNR提升1.2 dB,高斯数量减少约6倍。
- 零样本泛化至Mip-NeRF360和ScanNet++,性能优于所有可比基线。
- 随着上下文视图增加,像素对齐方法质量下降,而ZipSplat保持稳定。
- 单一模型通过调整压缩比r即可在15K至380K高斯范围内连续调节质量-效率。
Tech Stack:
- 3D高斯泼溅(3D Gaussian Splatting)
- k-means聚类
- Transformer(交叉注意力、自注意力)
- 轻量级MLP
- DA3多视图骨干网络
- 几何监督损失(如深度或表面约束)
- 可微分光栅化渲染
Strengths:
- 解耦高斯与像素网格,使表示预算自适应场景复杂度,减少冗余。
- 推理时可调压缩比,单一模型覆盖多种质量-效率需求,无需重训练。
- 无需姿态或内参,适用于无校准输入,同时支持有校准输入。
- 在多个基准上达到SOTA,且高斯数量显著少于像素对齐方法。
- 零样本泛化能力强,处理多视图时质量稳定。
Limitations:
- k-means聚类在特征空间进行,可能受特征质量影响,对复杂场景聚类效果有待验证。
- 自由高斯放置缺乏射线锚定,训练需额外几何监督,可能增加训练复杂度。
- 当前仅验证了前馈重建场景,未探索与后训练(如强化学习、世界模型)的结合。
- 与原生多模态大模型、表征学习等关键词的直接关联较弱,主要聚焦3D视觉与图形学。
Relevance To Keywords:
- Unify Models: 论文提出统一的前馈模型,但未涉及多模态理解与生成一体化。
- World Models: 3D高斯泼溅可视为场景的世界模型表示,但论文未探索与强化学习或预测的结合。
- Representation Learning: 通过token压缩和自由高斯学习场景表征,与表征学习相关。
- Model-Based RL: 论文未涉及强化学习或基于模型的规划。
- 原生多模态大模型: 论文使用多视图骨干,但未涉及语言或跨模态融合。
- 后训练: 论文未讨论后训练策略,仅关注前馈推理。
摘要翻译
语言引导的照片修饰旨在调整色彩与色调,同时保留几何形状和纹理。近期,基于扩散的修饰方法展现出卓越的视觉质量,但往往因其生成性质而面临保真度问题,且因迭代采样过程而效率低下。本文提出了一种基于双边空间操作的高效且保真的修饰方法,该方法具有紧凑性且内容解耦。具体而言,我们的模型不直接编辑像素或图像潜在变量,而是预测一个低分辨率的仿射变换双边网格(bilateral grid),该网格通过学习到的引导图(guidance map)进行切片,然后应用于全分辨率图像。该方法兼具高保真度和改进的效率。为了保留预训练生成模型的强先验,我们利用变分分数蒸馏(Variational Score Distillation)将多步扩散模型蒸馏至双边网格框架中,并辅以提示对齐损失(prompt alignment loss)以引导指令遵循行为。此外,我们引入了一个新的基准,并在保真度、指令遵循及效率等多个维度上评估该方法。与最新的修饰方法(如 Gemini-2.5-Flash (Nano-Banana))相比,我们的方法可避免内容漂移,显著降低延迟,生成视觉上令人愉悦的修饰效果,同时保持高水平的保真度。项目页面:https://openimaginglab.github.io/InstantRetouch/.
Abstract
Language-guided photo retouching aims to adjust color and tone while preserving geometry and texture. Recently, diffusion-based retouching shows a superior visual quality, but often struggles with both fidelity issues due to its generative nature and efficiency because of its iterative sampling process. In this work, we propose an efficient and fidelity-preserving retouching method using bilateral space manipulation, which is both compact and content-decoupled. Specifically, instead of directly editing pixels or image latents, our model predicts a low-resolution bilateral grid of affine transforms, which are sliced using a learned guidance map and then applied to the full-resolution image. This approach yields both high fidelity and improved efficiency. To retain strong priors of a pretrained generative model, we distill a multi-step diffusion model into our bilateral grid framework using Variational Score Distillation, complemented by a prompt alignment loss to guide instruction-following behavior. Additionally, we introduce a new benchmark and evaluate our method across multiple dimensions: fidelity, instruction following, and efficiency. Compared to the latest retouch methods, like Gemini-2.5-Flash (Nano-Banana), our method can avoid content drift, significantly improve latency, and generate visually pleasing edits, while maintaining a high level of fidelity. Project page: https://openimaginglab.github.io/InstantRetouch/.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 3.0/10 | 4.5 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 5.0/10 | 7.5 |
| MultiModal | 1.5 | 8.0/10 | 12.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on efficient image retouching using bilateral space manipulation and diffusion distillation. It scores high on MultiModal (8.0) due to text-image interaction and moderate on MLLM (5.0) for instruction following. Visual Encoder (3.0) is relevant for image processing. Unify Models (2.0) and Tokenizer (1.0) have low relevance as they are not core contributions. World Models (0.0) and model-based RL (0.0) are unrelated. No expert authors from the specified list were found. The weighted total score is 28.5, which exceeds the dynamic passing score of 27.8.
关键词
Image Retouching, Bilateral Space, Diffusion Model, Instruction-Guided, High-Fidelity, Efficiency, Variational Score Distillation, Text-to-Image Editing
深度分析
Chinese Title: InstantRetouch:基于双边空间的高效高保真指令引导图像润色
Summary: 本文提出一种高效且保真的语言引导图像润色方法InstantRetouch。针对现有扩散模型在润色任务中存在的保真度低(易产生内容漂移)和效率差(迭代采样慢)的问题,作者利用双边空间(bilateral space)的紧凑性和内容解耦特性,设计了一个单步生成器直接预测低分辨率双边网格中的仿射变换参数,再通过学习到的引导图切片并应用到全分辨率图像上,从而实现高效、高保真的润色。为了保留扩散模型的丰富先验,作者通过变分分数蒸馏(VSD)将多步扩散教师模型蒸馏到该双边网格框架中,并辅以提示对齐损失和双边损失。此外,作者构建了约20万对的高质量指令-润色数据集iRetouch,并在新基准上从保真度、指令跟随和效率三个维度进行评估。实验表明,该方法比现有大型编辑模型快70-800倍,同时保持更好的内容保真度和相当的指令跟随性能。
Innovations:
- 首次将双边空间(bilateral grid)引入指令引导图像润色,实现内容与色调的解耦,从根本上避免内容漂移并提升效率。
- 提出单步蒸馏框架,通过变分分数蒸馏(VSD)将多步扩散模型的知识迁移到单步双边网格生成器中,兼顾扩散先验与实时性。
- 设计提示对齐损失(CLIP-based)以增强对模糊或风格化指令的跟随能力,弥补像素级信号的不足。
- 构建大规模高质量指令-润色数据集iRetouch(约20万对),并通过受控退化流程生成输入图像,确保数据多样性和保真性。
- 提出渐进式训练策略,先训练低分辨率分支再联合优化全分辨率双边分支,保证训练稳定性。
Methodology: 整体采用两阶段蒸馏框架:第一阶段,利用自建数据集微调一个多步扩散教师模型(基于InstructPix2Pix架构);第二阶段,将教师知识蒸馏到一个单步双边网格生成器Gθ中。Gθ包含两个分支:低分辨率扩散分支(冻结VAE编码器+单步UNet去噪器)用于语义理解,全分辨率双边处理分支(生成双边网格并执行切片-应用操作)用于高效润色。蒸馏目标包括:变分分数蒸馏损失(VSD)、数据损失(MSE)、CLIP提示对齐损失以及双边正则化损失。训练采用渐进策略:先训练低分辨率分支,再联合优化全分辨率分支。
Key Results:
- 在iRetouch基准上,InstantRetouch在保真度(内容保持)和效率(推理速度)上显著优于Gemini-2.5-Flash、FLUX.1-Kontext、Qwen-Image等基线方法。
- 速度提升70-800倍,可处理4K分辨率图像。
- 在指令跟随性能上与大型扩散模型相当,但避免了内容漂移。
- 消融实验验证了VSD蒸馏、提示对齐损失和双边损失的有效性。
Tech Stack:
- 双边网格(Bilateral Grid)及其切片-应用操作
- 变分分数蒸馏(Variational Score Distillation, VSD)
- CLIP模型用于提示对齐损失
- Stable Diffusion作为基础扩散模型
- UNet架构(教师和学生)
- VAE编码器/解码器
- MUSIQ和LAION aesthetic score用于数据筛选
- Grounding-SAM用于区域掩码生成
- Qwen2.5-VL-72B用于指令生成
- MSE损失、数据损失、双边正则化损失
Strengths:
- 高效性:单步推理,比现有扩散编辑模型快两个数量级,适合高分辨率实时应用。
- 高保真度:双边空间操作不修改图像内容,从根本上避免内容漂移。
- 指令跟随能力强:通过CLIP提示对齐损失和蒸馏,能处理复杂风格化指令。
- 数据构建方法创新:通过受控退化生成输入,确保训练数据多样性和质量。
- 全面评估:从保真度、指令跟随、效率三个维度在自建基准上对比,结果可信。
Limitations:
- 依赖预训练扩散模型的知识,若教师模型本身存在偏差,蒸馏后可能继承。
- 双边网格的表示能力有限,可能无法处理极端复杂的局部色调调整(如精细的HSL区域调整)。
- 数据集iRetouch为合成数据,真实场景泛化性需进一步验证。
- 未与基于强化学习或世界模型的方法进行对比,缺乏对后训练范式的探讨。
- 论文未开源代码和模型,可复现性待确认。
Relevance To Keywords:
- 原生多模态大模型:论文使用文本指令引导图像润色,属于多模态理解与生成任务,但未采用原生多模态架构(如统一Transformer),而是基于扩散模型+CLIP。
- 多模态大模型的理解和生成一体化:方法将文本理解(指令)与图像生成(润色)结合,但生成部分通过双边网格实现,并非端到端生成式模型。
- 表征学习:双边网格可视为一种紧凑的色调表征,但论文未深入探讨表征学习理论。
- 世界模型:不直接相关,论文未涉及物理世界建模或因果推理。
- 模型基于强化学习:未使用强化学习,而是采用蒸馏和损失优化。
- 后训练:蒸馏过程属于后训练的一种形式,但论文未强调后训练范式。
- 总体相关性中等:主要贡献在高效图像编辑,与多模态理解与生成有交集,但与世界模型、表征学习、强化学习等关键词关联较弱。
摘要翻译
跨视图地理定位(Cross-view geo-localization)通过将地面图像与航拍图像数据库进行匹配,来估计地面图像的地理位置。现有方法主要通过大规模检索或精确姿态估计来解决这一问题,但二者不可兼得:基于检索的方法虽能实现广域搜索,却以牺牲定位精度为代价;而姿态估计方法仅在有限的搜索空间内实现高精度。盲目级联这些流程会导致误差传播以及特征表示不一致的问题。本文将跨视图地理定位(Cross-view geo-localization) formulate 为一个统一问题,要求同时进行城市尺度的检索和精确的 3-DoF 姿态估计。我们提出 CIPER(Cross-view Image-retrieval and Pose-estimation transformER),这是一种单一架构,通过互惠的特征学习联合执行这两项任务。CIPER 采用共享的 Transformer 编码器,并引入任务特定令牌,以解耦全局检索特征与空间定位线索。为了弥合地面视图与航拍视图之间较大的领域差距,我们引入了一种双向 Transformer 姿态解码器,该解码器利用地面特征作为空间查询,以实现双向交叉注意力机制。此外,集合预测策略进一步使得在统一的多任务目标下实现稳定的 3-DoF 回归成为可能。在 VIGOR、KITTI 和 Ford Multi-AV 数据集上的实验表明,该方法具有竞争力,尤其是在视野受限和任意朝向条件下表现优异。代码可在 https://github.com/yurimjeon1892/CIPER 获取。
Abstract
Cross-view geo-localization estimates the geographic location of a ground image by matching it against an aerial image database. Existing methods tackle this through either large-scale retrieval or precise pose estimation, but not both: retrieval-based methods enable wide-area search at the cost of localization accuracy, while pose estimation methods achieve high precision within only a narrow search space. Naively cascading these pipelines introduces error propagation and inconsistent feature representations. We formulate cross-view geo-localization as a unified problem requiring simultaneous city-scale retrieval and precise 3-DoF pose estimation. We propose CIPER (Cross-view Image-retrieval and Pose-estimation transformER), a single architecture that jointly performs both tasks through mutually beneficial feature learning. CIPER uses a shared transformer encoder with task-specific tokens to disentangle global retrieval features from spatial localization cues. To bridge the large domain gap between ground and aerial views, we introduce a two-way transformer pose decoder that uses ground features as spatial queries for bidirectional cross-attention. A set prediction strategy further enables stable 3-DoF regression under a unified multi-task objective. Experiments on VIGOR, KITTI, and Ford Multi-AV demonstrate competitive performance, especially under limited field-of-view and arbitrary orientation conditions. Code is available at https://github.com/yurimjeon1892/CIPER.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 5.0/10 | 7.5 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 8.0/10 | 12.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 4.0/10 | 6.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文标题包含'Unified',且提出统一框架处理两项任务,故 Unify Models 得 5 分;核心组件为处理图像的变换器编码器,Visual Encoder 高度相关得 8 分;处理地面与 aerial 两种视图,视为广义多模态得 4 分;Tokenizer 隐含于 transformer 中但未强调得 2 分;内容不涉及语言模型、世界模型或强化学习,相关关键词得 0 分。作者列表中无指定专家。
关键词
Cross-view geo-localization, Unified Framework, Transformer Encoder, Pose Estimation, Image Retrieval, Ground and Aerial Views, Multi-task Objective
深度分析
Chinese Title: CIPER:一种用于跨视角图像检索与姿态估计的统一框架
Summary: 本文针对跨视角地理定位问题,提出了一种统一框架CIPER,同时实现城市级图像检索和精确的3自由度姿态估计。现有方法将检索和姿态估计视为独立任务,级联使用时存在特征冗余和误差传播。CIPER采用共享Transformer编码器,通过可学习的类别令牌和姿态令牌分别提取全局检索特征和空间定位线索。为弥合地面与航拍图像之间的巨大域差距,设计了一种双向Transformer姿态解码器,将地面特征作为空间查询,与航拍特征进行双向交叉注意力,实现鲁棒的跨视角对齐。同时采用集合预测策略稳定回归3自由度姿态。在VIGOR、KITTI和Ford Multi-AV数据集上的实验表明,该方法在有限视场和任意朝向条件下均取得可靠且具有竞争力的性能,验证了统一架构的有效性。
Innovations:
- 重新定义跨视角地理定位为同时需要城市级检索和精确3自由度姿态估计的统一任务,克服了分离管道的局限性。
- 提出CIPER统一架构,通过共享Transformer编码器和任务特定令牌(类别令牌和姿态令牌)实现全局检索与局部定位的联合学习。
- 设计双向Transformer姿态解码器,利用地面特征作为空间查询,与航拍特征进行双向交叉注意力,有效应对视角变化和有限视场。
- 采用集合预测策略进行稳定的3自由度回归,避免了传统迭代优化过程的复杂性。
- 在单一网络中同时完成检索和姿态估计,减少特征冗余和计算开销,适用于实际部署。
Methodology: CIPER采用共享Transformer编码器处理地面和航拍图像,编码器输出包含类别令牌(用于检索)和姿态令牌(用于定位)。检索阶段通过计算地面与航拍图像类别令牌的相似度得分,从城市级数据库中找到匹配的航拍图像。姿态估计阶段,将地面姿态令牌作为空间查询,与匹配航拍图像的特征进行双向交叉注意力解码,直接回归3自由度姿态(平移和朝向)。整个网络以多任务学习目标联合训练,包括检索损失和姿态回归损失。
Key Results:
- 在VIGOR数据集上,CIPER在检索和姿态估计任务上均达到竞争性能,尤其在有限视场和任意朝向条件下显著优于现有方法。
- 在KITTI和Ford Multi-AV数据集上验证了方法的泛化能力,展示了跨数据集的有效性。
- 与级联检索-姿态管道相比,CIPER减少了特征重复提取,提升了计算效率。
- 双向交叉注意力机制有效缓解了地面与航拍图像的域差距,提高了对齐鲁棒性。
Tech Stack:
- Vision Transformer (ViT) 作为共享编码器骨干
- 可学习类别令牌(cls_token)和姿态令牌(pos_token)
- 双向交叉注意力(bidirectional cross-attention)
- 集合预测策略(set prediction)用于3自由度回归
- 多任务学习损失(检索对比损失 + 姿态回归损失)
- NetVLAD、SAFA等作为对比基线方法
Strengths:
- 统一框架避免了级联管道的特征冗余和误差传播,实现端到端优化。
- 任务特定令牌设计使共享编码器能同时支持检索和定位,兼顾全局和局部信息。
- 双向交叉注意力解码器对视角变化和有限视场具有强鲁棒性,无需假设地面图像朝向与航拍中心对齐。
- 集合预测策略简化了姿态回归,无需复杂的迭代优化。
- 在多个大规模数据集上验证了有效性和泛化能力。
Limitations:
- 对航拍数据库的覆盖范围和分辨率有一定依赖,检索精度受限于数据库采样密度。
- 双向交叉注意力计算复杂度较高,可能影响实时性。
- 当前仅处理3自由度姿态(平移和朝向),未考虑高度或更复杂的6自由度。
- 在极端视角或严重遮挡情况下性能可能下降。
Relevance To Keywords:
- Unify Models: CIPER将检索和姿态估计统一到单一架构,体现了模型统一的思想。
- World Models: 通过跨视角对齐学习地面与航拍之间的空间关系,可视为一种世界模型表征。
- Representation Learning: 共享编码器学习同时适用于检索和定位的通用特征表示。
- Model-Based RL: 虽未直接涉及强化学习,但统一框架的思想可迁移至基于模型的强化学习中的状态表征。
- 原生多模态大模型: 地面和航拍图像属于不同模态,CIPER通过Transformer处理多模态输入,与多模态大模型技术路线一致。
- 多模态大模型的理解和生成一体化: 本文侧重于理解(检索和定位),未涉及生成,但双向注意力机制可扩展至生成任务。
摘要翻译
个体级移动性预测在城市模拟、交通规划及政策分析中处于核心地位。监督序列模型虽能实现高准确性,但需要针对特定任务进行训练,且决策层面的透明度有限。近期基于大语言模型(LLM)的方法提升了可解释性,但大多依赖静态提示和单次推理,当移动信号微弱或冲突时,限制了其寻求额外证据的能力。本文提出 \method{}(AgentMob),一种无需训练的 LLM 驱动代理框架,将下一位置预测形式化为自适应证据控制的决策制定。\method{} 基于历史规律,通过快速路径解决常规案例;而对于模糊案例,则触发针对近期轨迹、历史行为、停留 - 移动可能性及地理证据的迭代工具调用。在三个移动性数据集上,AgentMob 在无训练的基于 LLM 的方法中取得了最强的整体性能,其中 GPT-5.4 在 BW 数据集上 Acc@1 达到 71.42%,在 YJMob100K 上为 33.14%,在 Shanghai ISP 上为 33.50%。在 BW 数据集的非快速路径案例上,LLM 控制器相比同工具统计基线,将 Acc@1 从 30.65% 提升至 48.62%,这表明其主要优势在于通过自适应证据收集来解决模糊预测。我们的代码可在 https://github.com/Unknown-zoo/AgentMob 获取。
Abstract
Individual-level mobility prediction is central to urban simulation, transportation planning, and policy analysis. Supervised sequence models achieve strong accuracy but require task-specific training and offer limited decision-level transparency. Recent LLM-based methods improve interpretability, yet mostly rely on static prompts and single-pass inference, limiting their ability to seek additional evidence when mobility signals are weak or conflicting. We propose \method{}, a training-free LLM-driven agent framework that formulates next-location prediction as adaptive evidence-controlled decision making. \method{} resolves routine cases through a fast path based on historical regularity, while ambiguous cases trigger iterative tool use over recent trajectories, historical behavior, stay-move likelihood, and geographical evidence. Across three mobility datasets, AgentMob achieves the strongest overall performance among training-free LLM-based methods, with GPT-5.4 reaching 71.42\% Acc@1 on BW, 33.14\% on YJMob100K, and 33.50\% on Shanghai ISP. On BW non-fast-path cases, the LLM controller improves Acc@1 from 30.65\% to 48.62\% over a same-tool statistical baseline, showing that its main benefit lies in resolving ambiguous predictions through adaptive evidence gathering. Our code is available at https://github.com/Unknown-zoo/AgentMob.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 3.0/10 | 4.5 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 1.0/10 | 1.5 |
| World Models | 1.5 | 3.0/10 | 4.5 |
| MLLM | 1.5 | 4.0/10 | 6.0 |
| MultiModal | 1.5 | 4.0/10 | 6.0 |
| model-based RL | 1.5 | 2.0/10 | 3.0 |
评分理由: 论文提出了一种基于 LLM 驱动的证据引导移动性预测框架(AgentMob),核心在于自适应证据收集与决策,而非模型架构的统一或表征学习。未涉及 Tokenizer 设计、视觉编码器架构或基于模型的强化学习训练过程。虽然使用 LLM,但未明确强调多模态融合或世界模型构建,因此与给定技术关键词的相关性较低至中等。作者列表中未包含指定的专家,无额外加分。
关键词
Mobility Prediction, LLM-Driven Agent, Evidence-Grounded, Training-Free, Adaptive Decision Making, Tool Use, Next-Location Prediction
摘要翻译
主动推断将决策视为推断,其中预期自由能 (EFE) 统一了目标导向行为与信息寻求行为。近期研究表明,预期自由能 (EFE) 最小化可表述为在增加了认识先验的生成模型上的变分自由能 (VFE) 最小化。我们证明,增强模型的变分自由能 (VFE) 可重写为预测模型的变分自由能 (VFE) 加上显式熵修正项,从而使预期自由能 (EFE) 的贡献变得清晰。随后我们表明,恰当的基于预期自由能 (EFE) 的规划需将这些认识修正与一个规划修正相结合,该修正将边缘推断转化为策略优化,从而得到基于预期自由能 (EFE) 规划的完整变分表征。这明确了交叉熵规划与完整的基于预期自由能 (EFE) 规划分别需要的修正。同样的熵修正公式导出了基于预期自由能 (EFE) 规划的详细消息传递方案,以及更简化的变体。在三个网格世界环境中的实验表明,当观测具有决定性时,规划修正即可提供帮助;而当观测仅为提示性时,额外的观测侧认识修正则最为关键。
Abstract
Active inference casts decision-making as inference, with the Expected Free Energy (EFE) unifying goal-directed and information-seeking behavior. Recent work showed that EFE minimization can be written as Variational Free Energy (VFE) minimization on a generative model augmented with epistemic priors. We prove that the VFE of the augmented model can be rewritten as the VFE of the predictive model plus explicit entropy-correction terms, making the EFE contribution transparent. We then show that proper EFE-based planning requires combining these epistemic corrections with a planning correction that turns marginal inference into policy optimization, yielding a full variational characterization of EFE-based planning. This clarifies which corrections are needed for cross-entropy planning and for full EFE-based planning. The same entropy-corrected formulation leads to a detailed message-passing scheme for EFE-based planning together with simpler ablations. Experiments on three grid-world environments show that the planning correction already helps when observations are decisive, whereas the additional observation-side epistemic corrections matter most when observations are merely suggestive.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 3.0/10 | 4.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 4.0/10 | 6.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 7.0/10 | 10.5 |
评分理由: 论文聚焦主动推断(Active Inference)的理论推导,涉及生成模型与策略优化。model-based RL 相关性高(7.0),因主动推断基于生成模型进行规划;World Models 中度相关(4.0),因涉及世界生成模型;Unify Models 低度相关(3.0),仅统一行为目标而非模型架构;其余关键词(Tokenizer, Visual Encoder, MLLM, MultiModal)涉及多模态大模型具体组件,与本文理论内容无关(0.0)。作者列表未包含指定专家,无加分。加权总分 22.5,低于动态及格分 27.8。
关键词
Active Inference, Expected Free Energy, Variational Free Energy, Generative Model, Epistemic Priors, Policy Optimization, Planning Correction
摘要翻译
大语言模型(LLMs)面临捷径学习问题:它们在分布外(OOD)输入上系统性地失效,尽管逻辑结构相同,但这些输入的语义表层与训练数据存在差异。这破坏了将思维链推理(CoT)知识蒸馏至较小学生模型(Student Models)的流程。我们提出不变梯度对齐(Invariant Gradient Alignment, IGA),这是一种训练框架,通过三项创新,在语义多样但逻辑同构的示例之间对齐梯度更新:(i) 逻辑同构集(Logical Isomer Sets),指在不同语义领域(数学、医学、法律、科学)中共享相同逻辑结构的问题组;(ii) 可微的“连续梯度冲突掩码”(Continuous Gradient Conflict Mask),用于抑制跨域梯度方差较高的参数维度,同时保留不变方向;(iii) 将掩码后的梯度通过截断奇异值分解(SVD)投影回 LoRA 低秩流形,从而在整个过程中保持参数效率。理论上,IGA 比经验风险最小化(ERM)具有更紧的分布外(OOD)泛化界,该界随异构体领域的数量缩放,且在温和正则性条件下以标准随机梯度下降(SGD)速率收敛。实验上,IGA 在四个基准测试上优于八个基线方法,相比经验风险最小化微调(ERM-SFT)准确率提升高达 14.3 个百分点,且逻辑一致性得分为 0.031(对比 0.142)——实现了表征不变性的四倍提升。
Abstract
Large language models (LLMs) suffer from shortcut learning: they systematically fail on out-of-distribution (OOD) inputs whose semantic surface differs from training data, even when the logical structure is identical. This undermines knowledge distillation pipelines that transfer chain-of-thought reasoning to smaller students. We introduce Invariant Gradient Alignment (IGA), a training framework that aligns gradient updates across semantically diverse but logically isomorphic examples via three innovations: (i) Logical Isomer Sets, groups of problems sharing identical logical structure across distinct semantic domains (mathematics, medicine, law, science); (ii) a differentiable \emph{Continuous Gradient Conflict Mask}, that suppresses parameter dimensions with high cross-domain gradient variance while preserving invariant directions; and (iii) a truncated SVD projection of the masked gradient back onto the LoRA low-rank manifold, maintaining parameter efficiency throughout. Theoretically, IGA yields tighter OOD generalization bounds than ERM, scaling with the number of isomer domains, and converges at the standard SGD rate under mild regularity. Empirically, IGA outperforms eight baselines across four benchmarks with accuracy gains up to 14.3 pp over ERM-SFT and a Logical Consistency Score of 0.031 versus 0.142 -- a fourfold improvement in representational invariance.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 1.0/10 | 1.5 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 4.0/10 | 6.0 |
| MultiModal | 1.5 | 2.0/10 | 3.0 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: The paper focuses on LLM reasoning distillation and OOD generalization via gradient alignment, showing low relevance to Multimodal components (Visual Encoder, MultiModal), World Models, and Model-based RL. It shares moderate relevance with MLLM as an LLM study but lacks multimodal specifics. Tokenizer and Unify Models are not central themes. No expert authors from the specified list were found, so no bonus points were added.
关键词
Invariant Gradient Alignment, Reasoning Distillation, Out-of-Distribution Generalization, Logical Isomer Sets, Gradient Conflict Mask, LoRA, Large Language Models
摘要翻译
我们介绍了 IRIS-GAN,这是一种针对跨生成器偏移下合成人脸图像的专业法医检测器。与解决通用合成图像检测不同,我们专注于由生成对抗网络(GANs)生成的人脸,这些网络在深度伪造内容中处于最先进水平,并通过分阶段暴露于要求日益提高的 GAN 家族(同时保留早期生成器)来训练该检测器。最终模型在考虑的 GAN 家族上实现了超过 99% 的伪造检测率,并在外部真实人脸数据集上实现了 98.9% 的分类准确率。Grad-CAM 分析进一步揭示了可测量的生成器依赖的空间响应模式,这些模式对于仅基于热力图的分类器仍具有信息量。在扩散模型生成的人脸上的家族外测试证实了 IRIS-GAN 是一种专业检测器,具备一定的检测非 GAN 深度伪造的能力。这些结果确立了分阶段训练作为一种稳健的 GAN 人脸法医分析的有效策略。
Abstract
We introduce IRIS-GAN, a specialist forensic detector for synthetic face images under cross-generator shift. Rather than addressing universal synthetic-image detection, we focus on faces generated by generative adversarial networks (GANs), which are state-of-the-art in deepfake content, and train the detector through staged exposure to increasingly demanding GAN families while retaining earlier generators. The final model reaches fake-detection rates above 99% across the GAN families considered and classifies an external real-face dataset with 98.9% accuracy. Grad-CAM analysis further reveals measurable generator-dependent spatial response patterns, which remain informative for a secondary heatmap-only classifier. Out-of-family tests on diffusion-generated faces confirm that IRIS-GAN is a specialist detector, with some capability to reach non-GAN deepfakes. These results establish staged training as an effective strategy for robust GAN-face forensics.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 8.0/10 | 12.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 2.0/10 | 3.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文主要关注深度伪造人脸检测(Deepfake Detection)及生成对抗网络(GAN)家族的分类,未涉及分词器(Tokenizer)、世界模型(World Models)、多模态大语言模型(MLLM)或基于模型的强化学习(model-based RL)。虽然图像分析隐含使用了视觉编码器(Visual Encoder),但整体内容与统一模型(Unify Models)或多模态(MultiModal)的核心定义关联较弱,因此大部分关键词评分较低。
关键词
Deepfake Detection, GAN Families, Staged Training, Specialist Detector, Synthetic Faces, Forensic Analysis, Grad-CAM, Cross-Generator Shift
摘要翻译
近期基于生成模型的多视图图像编辑进展,使我们离通用 3D 内容生成和定制更近了一步。大多数现有工作利用未编辑场景的几何结构,专注于刚性或仅外观的编辑。这自然限制了这些方法仅适用于保留底层场景结构的编辑。其他方法则是针对特定图像编辑任务(如物体移除和添加)进行训练的。尽管取得了这些进展,通用非刚性编辑(即显著改变场景几何的编辑)对现有方法而言仍然具有挑战性。我们提出 GeM-NR,一种快速灵活的无需训练的通用多视图一致图像编辑方法,涵盖显著改变场景几何和外观的编辑。给定一个使用选定骨干编辑器 (backbone editor)(如 FLUX、Qwen、BrushNet)编辑过的锚图像 (anchor image) 和一个查询未编辑图像 (query image),GeM-NR 能够根据锚编辑一致地编辑查询图像。该方法包含多个阶段:(i) 深度图 (depth map) 估计,其中我们提出一种策略以最大化编辑场景与未编辑场景之间 3D 点云 (3D point clouds) 的对齐度;(ii) 投影到查询视角;以及 (iii) 基于未编辑查询图像对所获图像进行细化。这种基于条件的表述在从两个视图扩展到多个物体视图时具有良好的可扩展性。我们展示了该方法处理几何和外观发生显著变化的编辑的能力,而这正是现有方法所难以应对的。我们进行了广泛的评估,结果表明该方法在多种编辑任务中提升了包括生成编辑场景 3D 表示在内的一致性。定性和定量结果表明,该方法在编辑质量以及多视图下的几何和光度 (photometric) 一致性方面均达到了最先进的性能。
Abstract
Recent developments in multi-view image editing with generative models have brought us a step closer toward general 3D content generation and customization. Most existing works focus on rigid or appearance-only edits by utilizing the geometry of the unedited scene. This naturally limits these methods to edits that preserve the underlying scene structure. Other approaches are trained for specific image editing tasks, such as object removal and addition. Despite this progress, general nonrigid edits, i.e., edits that substantially change the scene geometry, remain challenging for existing methods. We propose GeM-NR, a fast and flexible training-free approach for general multi-view consistent image editing, including edits that drastically change the geometry and appearance of the scene. Given an anchor image edited with a chosen backbone editor (such as FLUX, Qwen, BrushNet) and a query unedited image, GeM-NR edits the query image consistently with the anchor edit. The method incorporates multiple stages: (i) depth map estimation, where we propose a strategy to maximize the alignment between the 3D point clouds of the edited and unedited scenes, (ii) projection onto a query viewpoint, and (iii) refinement of the obtained image conditioned on the unedited query. The conditioning-based formulation scales well from two to many views of an object. We demonstrate the ability of our method to handle edits with significant changes in geometry and appearance, something that existing methods struggle with. We perform an extensive evaluation showing that our method improves consistency for a wide variety of edit tasks, including generating 3D representations of the edited scene. Both quantitative and qualitative results indicate the state-of-the-art performance of our method in terms of edit quality as well as geometric and photometric consistency across multiple views.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 3.0/10 | 4.5 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 4.0/10 | 6.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on geometry-aware multi-view image editing for nonrigid scene changes using generative models. It utilizes existing tools (e.g., Qwen, FLUX) but does not propose unified model architectures, tokenizers, world models, or reinforcement learning frameworks. Moderate relevance is assigned to Visual Encoder and MultiModal due to vision and geometry processing, while other keywords are largely irrelevant to the core contribution. No target expert authors (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) were found in the author list.
关键词
Multi-View Editing, Nonrigid Scene Changes, Geometry-Aware, Generative Models, 3D Content Generation, Depth Map Estimation, Image Consistency
摘要翻译
基于评分标准的强化学习(RL)利用大语言模型作为裁判(LaaJ),依据评分标准对模型输出进行评分,并将其作为奖励。然而,策略模型可能会利用裁判中的潜在偏见,导致奖励黑客行为(reward hacking),从而产生无效或不安全的训练结果。在实际的基于评分标准的强化学习中,此类黑客行为往往较为微妙,且与多种裁判偏见相互纠缠,导致难以分析、检测和缓解。本文介绍了一种名为 CHERRL 的可控黑客环境,专门用于基于评分标准的强化学习。通过向 LaaJ 注入已知偏见,CHERRL 能够实现奖励黑客行为的重现、奖励分歧的明确观察以及黑客起始点的精确识别。这为研究基于评分标准的强化学习中奖励黑客的机制及缓解方法提供了一个清晰的实验测试平台。为了展示其效用,我们从可发现性和可利用性的角度分析了不同的裁判偏见,并探索了一种基于智能体的系统,用于从训练日志中自动检测奖励黑客行为的起始点。代码及环境已在 https://github.com/THUAIS-Lab/CHERRL 上公开提供。
Abstract
Rubric-based reinforcement learning (RL) uses an LLM-as-a-Judge (LaaJ) to score model outputs according to rubrics as rewards. However, policy models may exploit latent biases in the judge, leading to reward hacking and ineffective or unsafe training outcomes. In real-world rubric-based RL, such hacking behaviors are often subtle and entangled with multiple judge biases, making them difficult to analyze, detect, and mitigate. In this paper, we introduce CHERRL, a controllable hacking environment for rubric-based RL. By injecting known biases into LaaJ, CHERRL enables stable reproduction of reward hacking, explicit observation of reward divergence, and precise identification of hacking onset. This provides a clean experimental testbed for studying the mechanisms and mitigations of reward hacking in rubric-based RL. To demonstrate its utility, we analyze different judge biases from the perspectives of discoverability and exploitability, and explore an agent-based system for automatically detecting reward hacking onset from training logs. The code and environment are publicly available at https://github.com/THUAIS-Lab/CHERRL.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 1.0/10 | 1.5 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 3.0/10 | 4.5 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 3.0/10 | 4.5 |
评分理由: 论文主要研究 Rubric-based RL 中的奖励黑客问题,使用 LLM 作为评判者。关键词中的 Unify Models、Tokenizer、Visual Encoder、World Models、MultiModal 与论文内容无直接关联,相关性极低(1 分)。MLLM 虽涉及 LLM 但未涉及多模态,相关性较低(3 分)。model-based RL 虽属于 RL 范畴,但论文侧重于奖励建模而非环境模型学习,相关性中等(3 分)。作者列表中未包含指定的专家,无加分。加权总分低于动态及格分,表明论文与给定关键词主题匹配度较低。
关键词
Rubric-based Reinforcement Learning, Reward Hacking, LLM-as-a-Judge, CHERRL, Judge Biases, Controllable Hacking Environment, Reward Divergence
摘要翻译
车辆车身类型是超车事故中自行车骑行者受伤严重程度的重要决定因素,然而,目前公开文献中尚无能够从自然主义道路视频中自动将车辆分类为与伤害风险相关类别的工具。标准目标检测基准仅提供粗略的车辆标签(如汽车、卡车、公交车、摩托车),而现有的细粒度识别系统通常在受控图像数据上训练,且缺乏跨不同录制站点部署鲁棒性的评估。本文提出了一种开源的两阶段计算机视觉流水线,该流水线结合了预训练的 RT-DETR 检测器用于粗略车辆定位,以及微调的视觉 Transformer(ViT-Base/16)用于六类车身类型分类:乘用车、SUV、皮卡车、Minivan、大型货车和商用卡车。该流水线包含一个基于置信度的拒绝机制:当 softmax 输出低于 0.60 时,第二阶段将拒绝输出预测,从而生成未知标签,而非隐蔽的错误分类。在密歇根州安娜堡自行车道走廊的 3,805 个标注超车事件(分布内)上评估,该流水线实现了 0.94 的准确率,各类别 F1 分数范围从 0.91(Minivan)至 0.97(SUV)。在未重新训练的情况下,对来自开放骑行数据集的 311 个事件进行独立的分布外评估,准确率为 0.89。在域偏移下,四个样本充足的类别中有三个保持了 F1 分数在 0.90 或以上。性能下降最显著的是 Minivan(F1 = 0.72),这主要由拒绝率从 2.4% 上升至 25.0% 驱动,而非主动错误分类,表明该机制确实传播了真实的模型不确定性。完整的流水线,包括推理脚本、训练代码、评估工具及模型权重,均以开源软件形式发布,以支持路边视频档案和骑行安全研究中的可复现性与重用。
Abstract
Vehicle body type is a significant determinant of cyclist injury severity in overtaking crashes, yet automated tools for classifying vehicles into injury-risk-relevant categories from naturalistic roadway video do not exist in the open literature. Standard object detection benchmarks provide only coarse vehicle labels (car, truck, bus, motorcycle), while existing fine-grained recognition systems are trained on controlled imagery and lack evaluation for deployment robustness across recording sites. This paper presents an open-source two-stage computer vision pipeline combining a pre-trained RT-DETR detector for coarse vehicle localization with a fine-tuned Vision Transformer (ViT-Base/16) for six-category body-type classification: passenger car, SUV, pickup truck, minivan, large van, and commercial truck. A confidence-based abstention mechanism withholds Stage 2 predictions when softmax output falls below 0.60, producing unknown labels rather than silent misclassifications. Evaluated on 3,805 annotated overtaking events from a bicycle-lane corridor in Ann Arbor, Michigan (in-distribution), the pipeline achieved 0.94 accuracy with per-class F1 scores from 0.91 (minivan) to 0.97 (SUV). On an independent out-of-distribution evaluation of 311 events from an open cycling dataset without retraining, accuracy was 0.89. Three of four well-represented categories maintained F1 at or above 0.90 under domain shift. The largest degradation was observed for minivan (F1 = 0.72), driven by abstention rate rising from 2.4% to 25.0% rather than active misclassification, consistent with the mechanism propagating genuine model uncertainty. The full pipeline, including inference scripts, training code, evaluation utilities, and model weights, is released as open-source software to support reproducibility and reuse across roadside video archives and cycling safety research.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 7.0/10 | 10.5 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on fine-grained vehicle classification using Vision Transformers and RT-DETR, showing moderate relevance to 'Visual Encoder' (ViT usage) and weak relevance to 'Unify Models' (two-stage pipeline structure). It is largely unrelated to 'World Models', 'MLLM', 'MultiModal', 'model-based RL', and 'Tokenizer' (in the context of language/generative models). No expert authors from the specified list are present.
关键词
Vehicle Classification, Vision Transformers, Fine-grained Recognition, Two-Stage Pipeline, Roadway Video, Open-Source, RT-DETR, Confidence-based Abstention
摘要翻译
选择合适的 3D 表示(3D representation)是一个基本的设计决策,它决定了现代计算机视觉与图形学管线(computer vision and graphics pipelines)在 3D 重建、新视图合成与渲染、形状与运动分析、识别及生成等任务中的效率、质量和能力。虽然传统表示(例如网格(meshes)、点云(point clouds)和体素网格(volumetric grids))仍然是 3D 传感器(例如激光雷达(LiDAR)和 3D 扫描仪)的标准输出,并在下游应用(例如编辑和模拟)中广泛使用,但近期基于神经和基元(primitive)的表示(例如 3D 高斯泼溅(3D Gaussian Splatting))提供了紧凑且可微的替代方案,在游戏、AR/VR、自动驾驶、机器人导航和医学成像等应用中开启了广泛的机会,仅举几例。本文旨在综述 3D 表示的主要类别,涵盖从离散显式格式到基于神经渲染(neural rendering)或基元泼溅(primitive splatting)的连续隐式场(continuous implicit fields)。对于每种表示类型,我们阐述其一般形式化及其变体,讨论其优势与局限性,并重点阐述关键应用。本文最后概述了开放挑战及未来研究的潜在方向。区别于近期广泛覆盖 3D 物体与场景重建的综述,本文聚焦于 3D 表示本身演化的分析。我们特别强调向隐式表示(implicit representations)的范式转变,提供了一种新颖的视角,阐释这些新兴格式如何从根本上改变 3D/4D 工作流程。
Abstract
The selection of an appropriate 3D representation is a fundamental design decision that dictates the efficiency, quality, and capabilities of modern computer vision and graphics pipelines for tasks such as 3D reconstruction, novel-view synthesis and rendering, shape and motion analysis, recognition, and generation. While traditional representations (\eg meshes, point clouds, and volumetric grids) remain standard outputs of 3D sensors (\eg LiDAR and 3D scanners) and are widely used in downstream applications (\eg editing and simulation), recent neural and primitive-based representations (\eg 3D Gaussian Splatting) offer compact and differentiable alternatives opening a wide range of opportunities in applications such as games, AR/VR, autonomous driving, robot navigation, and medical imaging, to name a few. The goal of this paper is to survey the main families of 3D representations from discrete explicit formats to continuous implicit fields based either on neural rendering or primitive splatting. For each type of representation, we present the general formulation and its variants, discuss its benefits and limitations, and highlight key applications. We conclude the paper by outlining the open challenges and potential directions for future research. Distinct from recent surveys that broadly cover 3D object and scene reconstruction, this paper provides a focused analysis on the evolution of 3D representations themselves. We specifically emphasize the paradigm shift toward implicit representations, offering a novel perspective on how these emerging formats fundamentally alter 3D/4D workflows.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 3.0/10 | 4.5 |
| World Models | 1.5 | 3.0/10 | 4.5 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 2.0/10 | 3.0 |
评分理由: The paper is a survey on learning-based 3D geometric representations (meshes, point clouds, implicit fields, Gaussian Splatting) for computer vision and graphics. It does not focus on Multimodal Large Language Models (MLLM), tokenizers, or model-based reinforcement learning algorithms. While 3D representations are components in world models and visual encoders, the core contribution is a taxonomy of geometry rather than multimodal learning or control systems. Therefore, relevance to the provided keywords is low, resulting in a weighted score below the dynamic passing threshold.
关键词
3D Representations, Learning-based, Implicit Fields, Neural Rendering, Primitive Splatting, Computer Vision, Graphics Pipelines
摘要翻译
大语言模型 (LLM) 的后训练过程经历多个阶段,例如监督微调 (SFT) 之后是基于人类反馈的强化学习 (RLHF) 或直接偏好优化 (DPO),每个阶段的数据均来源于不同且可能不可信的来源。现有文献假设数据投毒攻击可能发生在每个训练阶段,却忽略了存在多个攻击者的可能性。为了研究整个后训练流程的可信度,我们提出了顺序数据投毒的威胁模型,其中多个攻击者分别对 SFT 数据集和偏好数据集进行投毒。在此威胁模型下,我们发现了一种“单攻击者幻觉”:每个攻击者若单独评估,似乎构成的威胁微不足道。然而,当攻击者在不同阶段之间协作时,真正的漏洞才会显现。在 SFT → DPO 流程中,攻击者的贡献是可加的:将固定的投毒预算分散到各个阶段优于将其集中在任一阶段。在 SFT → 近端策略优化 (PPO) 流程中,攻击者的贡献是互补的:单独的 SFT 投毒或奖励模型投毒均无法成功,但两者的结合却能奏效。这些发现表明,对单个后训练阶段的安全分析系统性地低估了仅从其交互中才会出现的复合漏洞。代码可在 https://github.com/jcksanderson/sequential-poisoning 获取。
Abstract
LLM post-training proceeds through multiple stages, e.g., supervised fine-tuning (SFT) followed by reinforcement learning from human feedback (RLHF) or direct preference optimization (DPO), where each stage draws data from different, potentially untrusted sources. Existing literature assumes data poisoning attacks may occur at each training stage, but neglects the possibility of multiple attackers. To study the trustworthiness of the entire post-training pipeline, we propose the threat model of sequential data poisoning, where multiple adversaries separately poison the SFT and preference datasets. Under this threat model, we identify the single-attacker illusion: each adversary, evaluated in isolation, appears to pose a negligible threat. Yet when adversaries collaborate across stages, the true vulnerability is revealed. In the SFT $\to$ DPO pipeline, their contributions are additive: splitting a fixed poison budget across stages outperforms concentrating it in either stage alone. In the SFT $\to$ PPO pipeline, their contributions are complementary: neither SFT nor reward model poisoning succeeds individually, yet their combination does. These findings show that security analyses of individual post-training stages systematically underestimate compound vulnerabilities that emerge only from their interaction. Code is available at https://github.com/jcksanderson/sequential-poisoning.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 1.0/10 | 1.5 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 2.0/10 | 3.0 |
评分理由: The paper focuses on security vulnerabilities (sequential data poisoning) in LLM post-training pipelines (SFT, DPO, PPO). It does not address multimodal architectures (Tokenizer, Visual Encoder, MultiModal, MLLM), world models, or model-based reinforcement learning algorithms. While RLHF/PPO involves reinforcement learning, it is model-free, not model-based. Unify Models is not the core theme. Thus, relevance to the provided keywords is low. Additionally, none of the listed expert authors (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are present in the author list.
关键词
Sequential Data Poisoning, LLM Post-Training, SFT, DPO, PPO, Threat Model, Compound Vulnerabilities, Adversarial Collaboration
摘要翻译
解剖标志点检测是医学图像分析中的基础任务,支持广泛的诊断与介入工作流程。尽管近期方法已实现亚毫米级定位,但仅凭准确性不足以支持临床部署,还需要预测具备可靠性和鲁棒性。尽管具有临床相关性,但在此背景下表征学习的影响仍未被充分探索。在本文中,我们引入了 CDPM-align,一种用于解剖标志点检测的多尺度引导对齐条件扩散预训练方法。我们的实验设置专注于少量图像和少量标注方案。具体来说,我们通过条件生成预训练,利用三个流行的异构小尺度基准数据集进行表征学习。此外,我们考虑了下游标志点检测任务中的低标注场景,分别使用 10 张和 25 张标注图像,反映了临床工作量与标注资源约束之间的现实权衡。我们的结果表明,生成式预训练使模型能够学习鲁棒的表征。这提高了下游任务的准确性和不确定性估计,朝着安全高效的临床部署迈进。
Abstract
Anatomical landmark detection is a fundamental task in medical image analysis supporting a wide range of diagnostic and interventional workflows. Although recent methods have achieved sub-millimetric localisation, accuracy alone is not sufficient for clinical deployment, requiring reliability and robustness in prediction. Despite its clinical relevance, the impact of representation learning in this context is still underexplored. In this work, we introduce CDPM-align, a multi-scale guidance-aligned conditional diffusion pre-training for anatomical landmark detection. Our experimental setup focuses on a few images and a few annotation regimes. Specifically, we employ three popular heterogeneous small-scale benchmark datasets for representation learning via conditional generative pre-training. Furthermore, we consider low-annotation scenarios for the downstream task of landmark detection, with 10 and 25 annotated images, reflecting realistic trade-offs between clinical effort and resource constraints for annotations. Our results confirm that generative pre-training enables the model to learn a robust representation. This improves both accuracy and uncertainty on the downstream tasks, advancing towards safe and efficient clinical deployment.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 3.0/10 | 4.5 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 1.0/10 | 1.5 |
| MultiModal | 1.5 | 2.0/10 | 3.0 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: The paper focuses on medical image analysis using diffusion models for few-shot landmark detection. It does not address Unify Models, Tokenizers, World Models, MLLM, or Model-Based RL. Visual Encoder is implicitly part of the diffusion architecture but not a core contribution. Multi-modality is limited to image and annotations. Thus, relevance to the provided keyword list is low.
关键词
Anatomical Landmark Detection, Diffusion Pretraining, Few-Shot Learning, Robustness, Medical Image Analysis, Conditional Generative, Uncertainty Estimation
摘要翻译
扩散语言模型(DLMs)通过迭代去噪生成文本,分块解码通过在局部块中确定词元提高了其实用性。然而,现有的分块方法通常依赖于固定块大小或基于分隔符的运行时信号,这些信号不一定与语义边界对齐。在本文中,我们提出 SemBlock,一种面向扩散大语言模型(LLMs)的语义边界驱动动态块解码框架。SemBlock 将动态块构建形式化为语义边界预测,并在冻结的 LLaDA 隐藏状态上训练轻量级预测器。为了提供监督,我们构建 SemBound,一个语义边界数据集,该数据集从自然语言、数学和代码任务的话语单元、推理步骤和实现片段中推导边界标签。在推理过程中,SemBlock 使用预测的边界概率来确定每个动态块的结束位置。在 GSM8K、IFEval、MATH 和 HumanEval 上的实验表明,SemBlock 一贯优于固定块解码和 AdaBlock。我们的代码已开源:https://github.com/TH-AI-Lab-PKU/SemBlock.
Abstract
Diffusion language models (DLMs) generate text through iterative denoising, and blockwise decoding improves their practicality by committing tokens in local blocks. However, existing blockwise methods typically rely on fixed block sizes or delimiter-based runtime signals, which do not necessarily align with semantic boundaries. In this paper, we propose SemBlock, a semantic-boundary-driven dynamic block decoding framework for diffusion LLMs. SemBlock formulates dynamic block construction as semantic boundary prediction and trains lightweight predictors on frozen LLaDA hidden states. To provide supervision, we construct SemBound, a semantic-boundary dataset that derives boundary labels from discourse units, reasoning steps, and implementation spans across natural language, math, and code tasks. During inference, SemBlock uses predicted boundary probabilities to select the ending position of each dynamic block. Experiments on GSM8K, IFEval, MATH, and HumanEval show that SemBlock consistently improves over fixed-block decoding and AdaBlock. Our code is publicly available: https://github.com/TH-AI-Lab-PKU/SemBlock.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 3.0/10 | 4.5 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文核心在于扩散语言模型(Diffusion LLMs)的解码策略优化,具体为语义边界动态块解码。与给定关键词高度相关的领域(如多模态、视觉编码器、世界模型、强化学习)在论文中均未涉及,因此相关性评分较低。Tokenizer 仅作为生成单元存在,非核心创新点;Unify Models 仅体现在解码逻辑的统一性上,关联较弱。
关键词
Diffusion Language Models, Blockwise Decoding, Semantic Boundaries, Dynamic Block, Text Generation, Denoising Process, LLaDA
摘要翻译
长时程在线视觉映射是机器人感知的一项核心能力,需要在有界内存与计算的约束下,从视觉流中持续进行相机运动与场景几何估计。近期前馈 3D 重建模型提供了强大的几何先验,但其流式变体通常在与第一帧绑定的固定坐标系或持久场景记忆中预测位姿。这种固定规范设计会导致训练 - 测试不匹配、对早期锚点的注意力偏差,以及在远长于训练期间所见序列上的累积漂移。我们提出 Anchor3R,一种流式 3D 重建框架,该框架将前馈重建视为以当前为中心的局部测量预测,而非持久全局规范回归。在每个时间步,Anchor3R 在当前帧坐标系中预测窗口相对位姿和局部点图,从而将流式重建转化为相对位姿测量生成。这些测量支持在线位姿更新,而闭环重插入和运动平均则用于对齐轨迹,并将局部点图转换为连贯的全局重建。在室内、室外、驾驶和 RGB-D 基准测试上的实验表明,Anchor3R 相比现有流式基线提高了长时程位姿精度和密集重建质量,同时支持有界内存在线推理。
Abstract
Long-horizon online visual mapping is a core capability for robot perception, requiring continuous camera-motion and scene-geometry estimation from visual streams under bounded memory and computation. Recent feed-forward 3D reconstruction models provide strong geometric priors, but their streaming variants often predict poses in a fixed coordinate system tied to the first frame or a persistent scene memory. This fixed-gauge design leads to train--test mismatch, attention bias toward early anchors, and accumulated drift on sequences much longer than those seen during training. We propose \emph{Anchor3R}, a streaming 3D reconstruction framework that treats feed-forward reconstruction as current-centric local measurement prediction rather than persistent global-gauge regression. At each time step, Anchor3R predicts window-relative poses and a local pointmap in the current-frame coordinate system, turning streaming reconstruction into relative-pose measurement generation. These measurements support online pose updates, while loop-closure reinsertion and motion averaging align the trajectory and transform local pointmaps into a coherent global reconstruction. Experiments on indoor, outdoor, driving, and RGB-D benchmarks show that Anchor3R improves long-horizon pose accuracy and dense reconstruction quality over existing streaming baselines, while supporting bounded-memory online inference.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 5.0/10 | 7.5 |
| World Models | 1.5 | 2.0/10 | 3.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 2.0/10 | 3.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 该论文专注于流式 3D 重建与长时程视觉映射,属于机器人感知与计算机视觉领域。关键词中 Unify Models、Tokenizer、MLLM、model-based RL 与论文内容无直接关联,评分为 0。Visual Encoder 虽隐含用于特征提取,但非核心创新点,评分为 5。World Models 涉及场景理解但非生成式世界模型,MultiModal 仅涉及 RGB-D 视觉输入,关联度较低,评分为 2。作者列表中未包含指定的 Yang Shi 等专家。加权总分 13.5,低于动态及格分 27.8。
关键词
Streaming 3D Reconstruction, Long-Horizon Visual Mapping, Transient Anchors, Relative-Pose Measurement, Loop-Closure Reinsertion, Motion Averaging, Pointmap Generation, Robot Perception
摘要翻译
中医 (TCM) 眼部望诊为评估巩膜表面异常提供了经验性依据,但其临床应用仍具有主观性且难以量化。为支持智能化和可量化的眼部望诊,本研究提出了中医启发的人工智能眼部辅助诊断系统 (TAO),并专注于像素级巩膜表面异常分割。针对受多源分布差异、多样异常形态及巩膜镜面反射 (SSR) 影响的临床及用户获取图像,我们提出了 HD-DinoMoE,一种类别感知层次化双混合专家网络。HD-DinoMoE 结合了类别感知双流 DINOv3 特征融合与类别特定多专家解码,以分割血管 (Vessels)、黄黑斑 (Yellow and Black Spots) 及血斑 (Blood Spots)。三阶段骨干冻结路由策略稳定了双骨干适应;渐进式置信度惩罚 (PCP) 损失减少了 SSR 区域中高置信度的假阳性及分割泄漏;类别感知自适应样本加权 (CA-ASW) 平衡了样本级与类别级的训练贡献。我们进一步构建了多标签巩膜异常分割数据集 (ML-SASD),这是一个包含临床、野外 (Wild) 和混合设置的新基准,并对三种异常类别进行了像素级标注。在 ML-SASD-Mix 上,HD-DinoMoE 达到了 72.11% 的平均 Dice 系数和 58.44% 的平均交并比 (mIoU),同时保持了良好的边界定位能力及镜面区域假阳性控制。它在公共 SBVPI 数据集的血管子集上也表现出具有竞争力的泛化性能。这些结果表明,HD-DinoMoE 为 TAO 在复杂采集场景下提供了一种可行的分割方案。代码和数据访问信息可在 https://github.com/FX-CMX/HD-DinoMoE 获取。
Abstract
Traditional Chinese Medicine (TCM) ocular inspection provides empirical cues for assessing scleral surface anomalies, but its clinical use remains subjective and difficult to quantify. To support intelligent and quantifiable ocular inspection, this study presents the TCM-inspired Artificial Intelligence Ocular Auxiliary Diagnosis System (TAO) and focuses on pixel-level scleral surface anomaly segmentation. For clinical and user-acquired images affected by multi-source distributional discrepancies, diverse anomaly morphologies, and scleral specular reflection (SSR), we propose HD-DinoMoE, a class-aware hierarchical dual mixture-of-experts network. HD-DinoMoE combines class-aware dual-stream DINOv3 feature fusion with class-specific multi-expert decoding to segment Vessels, Yellow and Black Spots, and Blood Spots. A three-stage backbone-frozen routing strategy stabilizes dual-backbone adaptation; Progressive Confidence Penalty (PCP) Loss reduces high-confidence false positives and segmentation leakage in SSR regions; and Class-Aware Adaptive Sample Weighting (CA-ASW) balances sample- and class-level training contributions. We further construct the Multi-label Scleral Anomaly Segmentation Dataset (ML-SASD), a new benchmark with Clinical, Wild, and Mix settings and pixel-wise annotations for three anomaly categories. On ML-SASD-Mix, HD-DinoMoE achieves a mean Dice of 72.11% and a mean Intersection-over-Union of 58.44%, while maintaining favorable boundary localization and specular-region false-positive control. It also shows competitive generalization on the Vessels subset of the public SBVPI dataset. These results indicate that HD-DinoMoE provides a feasible segmentation solution for TAO under complex acquisition scenarios. The code and data access information are available at https://github.com/FX-CMX/HD-DinoMoE.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 8.0/10 | 12.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on medical image segmentation using a Mixture-of-Experts network (HD-DinoMoE) and DINOv3 backbone. It shows high relevance to 'Visual Encoder' (DINOv3 is used as the feature extractor) and slight relevance to 'Unify Models' (dual-stream fusion mechanism). It is unrelated to 'Tokenizer', 'World Models', 'MLLM', 'MultiModal' (in the context of multimodal LLMs), and 'model-based RL', as the work is supervised computer vision rather than generative world modeling or reinforcement learning. The research background keywords target LLM/RL domains, while this paper is in medical AI.
关键词
HD-DinoMoE, Scleral Anomaly Segmentation, Mixture-of-Experts, DINOv3, Multi-label Segmentation, TCM Ocular Inspection, Dual Stream, Complex Acquisition Scenarios
摘要翻译
当后训练语言模型(Post-trained Language Models)在推理问题上失败时,常见的测试时扩展(test-time-scaling)响应是投入更多计算资源进行额外尝试,而失败轨迹不再发挥进一步作用。我们认为这丢弃了一个关键信号:部分失败源于不幸采样(unlucky sampling),此时更多展开(rollouts)有所帮助;而另一些失败则是结构性的,无论预算如何,重采样都无法克服。我们提出,失败轨迹编码了可恢复性结构(recoverability structure):即哪些测试时干预(test-time interventions)能够挽救特定失败的推理时特征(inference-time signature)。三个问题级轨迹特征(problem-level trajectory features)源自可用干预的结构,能够从失败展开的分布特征(distributional signature)中恢复此结构,而非基于其文本内容。它们将失败聚类为稳定模式(stable regimes),表征不同后训练方法的失败地形(failure topography)(准确率为 84.3±4.3%,比多数类基线(majority-class baseline)高出 20%),并支持一个无需训练的路由规则(training-free routing rule),在部署相关的 Steerable-Hard 子集(Steerable-Hard subset,即重试不足且有界干预可达的失败)上提升救援率 12.2%。这些特征和路由规则在两个跨模型族探针(cross-family probes)上均具有迁移性。因此,相同的三个特征将失败轨迹从被丢弃的数据转化为诊断对象,支持测试时路由(test-time routing)和后训练分析,且无需训练数据或权重空间访问。
Abstract
When post-trained language models fail on reasoning problems, the common test-time-scaling response is to spend more compute on additional attempts, and the failed traces play no further role. We argue this discards a crucial signal; some failures come from unlucky sampling, where more rollouts help, while others are structural and resist resampling regardless of budget. We propose that failed traces encode recoverability structure: the inference-time signature of which test-time interventions can rescue a given failure. Three problem-level trajectory features, derived from the structure of available interventions, recover this structure from the distributional signature of failed rollouts, not their text. They cluster failures into stable regimes, characterize the failure topography of different post-training methods ($84.3{\pm}4.3\%$ accuracy, $+20\%$ over a majority-class baseline), and support a training-free routing rule that lifts rescue by $+12.2\%$ on the deployment-relevant Steerable-Hard subset (failures where retry is insufficient and a bounded intervention is reachable). The features and the routing rule transfer across two cross-family probes. The same three features thus convert failed traces from discarded data into a diagnostic object, supporting test-time routing and post-training analysis without training-time or weight-space access.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 2.0/10 | 3.0 |
评分理由: 该论文专注于后训练语言模型的推理失败分析与测试时间干预策略,核心贡献在于利用轨迹特征区分采样错误与结构故障。提供的关键词集主要面向多模态大模型架构、世界模型及模型强化学习领域,与本文纯语言模型推理诊断的主题存在显著偏差。因此,除 MLLM 和 Unify Models 有微弱语义关联外,其余关键词(如 Visual Encoder, Tokenizer, MultiModal, model-based RL)相关性极低,导致加权总分(12.0)远低于动态及格分(27.8)。
关键词
Failed Reasoning Traces, Post-trained Language Models, Test-time Scaling, Intervention Routing, Failure Topography, Structural Failures, Unlucky Sampling
摘要翻译
基于大语言模型(LLM)的多智能体系统正日益应用于战略决策任务。在此类场景中,性能不仅取决于单个模型的能力,还取决于智能体之间交互与适应的策略。多智能体强化学习(MARL)可以优化这些交互策略,但其奖励设计往往仍局限于特定任务,且与交互结构的结合较弱。为了解决这一差距,我们提出 GARL(博弈论强化学习框架),用于多智能体战略优先级排序。GARL 将战略优先级排序形式化为一个两阶段博弈:竞争智能体首先在一个共享候选集上分配战略资源,随后由高层仲裁者生成最终排名。由此产生的博弈论效用被转换为角色特定的强化信号,从而使策略优化能够受到结构化交互的引导。我们将 GARL 实例化于争议问题排名任务中,其目标是在法律程序中优先处理核心问题。实验表明,GARL 提升了排名性能,使小型开源大语言模型在相同的候选排名设置下能够与强大的闭源大语言模型相竞争,并在法律领域能力及更广泛的战略决策中取得了收益。总体而言,GARL 展示了如何将博弈论交互结构转化为强化学习目标,为多智能体战略优先级排序中的策略优化提供了一种原则性的方法。
Abstract
LLM-based multi-agent systems are increasingly used for strategic decision-making tasks. In such settings, performance depends not only on individual model capabilities, but also on the policies by which agents interact and adapt. Multi-agent reinforcement learning can optimise these interaction policies, but its reward design often remains task-specific and weakly grounded in interaction structure. To address this gap, we propose GARL, a GAme-theoretic Reinforcement Learning framework for multi-agent strategic prioritisation. GARL formalises strategic prioritisation as a two-stage game: competing agents first allocate strategic resources over a shared candidate set, and a higher-level arbiter then produces the final ranking. The resulting game-theoretic utilities are converted into role-specific reinforcement signals, allowing policy optimisation to be guided by structured interaction. We instantiate GARL on issues-in-dispute ranking, where the goal is to prioritise core issues in legal proceedings. Experiments show that GARL improves ranking performance, enables small open-source LLMs to become competitive with a strong closed-source LLM under the same candidate-ranking setting, and yields gains in legal-domain competence and broader strategic decision-making. Overall, GARL demonstrates how game-theoretic interaction structure can be turned into reinforcement-learning objectives, providing a principled approach to policy optimisation in multi-agent strategic prioritisation.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 3.0/10 | 4.5 |
评分理由: The paper focuses on Game-Theoretic Reinforcement Learning for multi-agent strategic prioritization in the legal domain. It does not address multimodal architectures, tokenizers, visual encoders, or world models. While it utilizes LLMs and reinforcement learning, it lacks the core components of the specified keywords (e.g., no visual components, no world model learning, no explicit model-based dynamics learning), resulting in low relevance scores.
关键词
Game-Theoretic Reinforcement Learning, Multi-Agent Systems, Strategic Prioritisation, LLM-based Multi-Agent Systems, Policy Optimisation, Legal Domain, Resource Allocation, Interaction Structure
摘要翻译
视频全景分割(VPS)旨在联合检测、分割与跟踪所有对象,同时将视频划分为语义一致的区域。我们引入了无监督 VPS 的任务设定,摒弃了任何人工监督。现有的无监督场景理解工作主要集中于图像分割任务;视频领域仍探索不足。我们提出了 VideoCUPS,这是首个无监督 VPS 方法。VideoCUPS 通过利用无监督的深度、运动和视觉线索,从以场景为中心的视频中生成时间一致的视频全景伪标签。利用一种新颖的 Video DropLoss 在这些伪标签上进行训练,可获得一个准确的无监督 VPS 模型。为了评估进展,我们引入了一种全面的评估协议和四个竞争基线,将最先进的无监督全景图像分割和实例视频分割模型扩展至 VPS。VideoCUPS 优于所有基线,并展示了强大的标签高效学习能力。借助 VideoCUPS、我们的评估协议和基线,我们为未来的无监督 VPS 研究提供了坚实基础。
Abstract
Video panoptic segmentation (VPS) aims to jointly detect, segment, and track all objects while partitioning the video into semantically consistent regions. We introduce the task setting of unsupervised VPS, omitting any human supervision. Existing unsupervised scene understanding works mainly focused on image segmentation tasks; the video domain remains underexplored. We propose VideoCUPS, the first unsupervised VPS approach. VideoCUPS generates temporally consistent panoptic video pseudo-labels from scene-centric videos by exploiting unsupervised depth, motion, and visual cues. Training on these pseudo-labels using a novel Video DropLoss yields an accurate, unsupervised VPS model. To benchmark progress, we introduce a comprehensive evaluation protocol and four competitive baselines, extending state-of-the-art unsupervised panoptic image and instance video segmentation models to VPS. VideoCUPS outperforms all baselines and demonstrates strong label-efficient learning. With VideoCUPS, our evaluation protocol, and baselines, we provide a strong foundation for future research on unsupervised VPS.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 2.5/10 | 3.8 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 3.0/10 | 4.5 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on unsupervised video panoptic segmentation using depth and motion cues, which is a computer vision task unrelated to MLLM, Tokenizers, World Models, or RL. It shows weak alignment with 'Unify Models' (task unification) and 'MultiModal' (RGB+Depth+Motion). No expert authors from the specified list are present.
关键词
Video Panoptic Segmentation, Unsupervised Learning, Pseudo-labels, Depth Cues, Motion Cues, Scene-Centric, VideoCUPS
摘要翻译
我们探究人类数学教学法的方法能否引导语言模型的训练,使其具备算术推理能力。基于 GASING 方法(一种印尼教学法,通过从左到右的过程解决基本算术,该过程与 token 生成的因果顺序一致),我们将每个运算操作化为一个计算过程,其执行轨迹被序列化为自然语言思维链(Chain-of-Thought,简称 CoT)监督。一个小型的 GPT-2 解码器(8600 万(86M)参数),配备用于印尼语的音节 - 黏着式 TOBA 分词器,仅使用下一个 token 预测目标,基于此数据从头开始训练,未使用强化学习(Reinforcement Learning)或基于奖励的优化。监控训练揭示了三个不同的学习阶段,而机制分析——包括对思维链信息图的注意力遮蔽干预、残差流探测(Residual-Stream Probing)以及 logit 透镜检查(Logit-Lens Inspection)——表明模型首先内化了一种程序路径,随后发展出一种关联式“心算”能力,能够在无需显式逐步计算的情况下检索中间结果。该训练模型在保留问题(Held-out Problems)上达到超过 80% 的准确率,并在性能上与规模大得多的语言模型相当,这表明基于教学法的目标训练能够在小规模下产生强大且经济高效的算术能力。
Abstract
We investigate whether methods of human mathematics pedagogy can guide the training of language models toward arithmetic reasoning. Building on the GASING method -- an Indonesian pedagogy that solves basic arithmetic through a left-to-right procedure aligned with the causal order of token generation -- we operationalize each operation as a computational procedure whose execution trace is serialized into natural-language Chain-of-Thought (CoT) supervision. A small GPT-2 decoder (86M parameters) with a syllabic-agglutinative TOBA tokenizer for Indonesian is trained from scratch on this data using only a next-token prediction objective, without reinforcement learning or reward-based optimization. Monitoring training reveals three distinct learning phases, and mechanistic analyses -- attention-masking interventions on the CoT information graph, residual-stream probing, and logit-lens inspection -- show that the model first internalizes a procedural pathway and subsequently develops an associative, ``mental-arithmetic'' capacity that retrieves intermediate results without explicit step-by-step computation. The trained model reaches over 80% accuracy on held-out problems and attains competitive performance against substantially larger language models, indicating that targeted, pedagogically grounded training can yield strong and economical arithmetic capability at small scale.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 5.0/10 | 7.5 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 该论文主要研究语言模型的算术推理训练方法,核心贡献在于教学法指导下的思维链监督生成。关于关键词相关性:1. Tokenizer 文中明确提到使用'TOBA tokenizer',相关性中等(5 分);2. Unify Models 虽涉及模型训练,但未涉及多模型架构统一,相关性较低(2 分);3. Visual Encoder, World Models, MLLM, MultiModal, model-based RL 均与论文内容无关,论文明确排除了强化学习且无视觉或世界模型组件,相关性为 0 分。
关键词
Arithmetic Pedagogy, Language Models, Chain-of-Thought, Next-token Prediction, Syllabic Tokenizer, Mechanistic Analysis, Small-scale Training
摘要翻译
基于大语言模型(LLM)的智能体越来越多地通过与外部工具、检索系统、内存模块、环境及其他智能体交互来解决复杂任务。这些能力扩展了智能体的自主性,但也使得智能体行为更难验证、调试和审计。仅凭最终答案准确性无法解释输出是如何产生的、哪些证据支持了每个主张、工具调用是否合理、内存如何影响后续决策,或者执行失败源自何处。证据追踪和执行溯源通过建模检索到的证据、工具输出、内存条目、环境观察、中间主张、动作和最终答案如何在整个智能体执行过程中相互连接,从而填补这一空白。本文综述提供了关于 LLM 智能体中证据追踪和执行溯源的系统性回顾和概念框架。我们围绕统一的溯源视角组织相关工作,该视角连接了检索锚定、主张支持、工具使用安全、内存谱系、可观察性、调试、审计和恢复。我们引入一个涵盖追踪来源、证据和执行单元、溯源关系、追踪粒度与时机、表示形式和信任功能的分类法。我们回顾了关键的方法论方向,包括溯源表示、证据归因、工具使用溯源、运行时护栏、承载溯源的内存、基于追踪的可观察性和故障诊断。我们还映射现有的基准、数据集和评估指标到与溯源相关的功能,并讨论评估如何从最终答案正确性转向过程级问责。最后,我们概述了开放挑战,包括统一追踪模式、主张级和语义溯源、感知溯源的安全机制、真实的执行追踪基准、面向恢复的评估以及隐私感知审计基础设施。
Abstract
Large language model (LLM)-based agents increasingly solve complex tasks by interacting with external tools, retrieval systems, memory modules, environments, and other agents. These capabilities expand agent autonomy, but also make agent behavior harder to verify, debug, and audit. Final-answer accuracy alone cannot explain how an output was produced, which evidence supported each claim, whether tool calls were justified, how memory influenced later decisions, or where execution failures originated. Evidence tracing and execution provenance address this gap by modeling how retrieved evidence, tool outputs, memory items, environment observations, intermediate claims, actions, and final answers are connected throughout agent execution. This survey provides a systematic review and conceptual framework for evidence tracing and execution provenance in LLM agents. We organize related work around a unified provenance perspective that connects retrieval grounding, claim support, tool-use safety, memory lineage, observability, debugging, audit, and recovery. We introduce a taxonomy covering trace sources, evidence and execution units, provenance relations, tracing granularity and timing, representation forms, and trust functions. We review key methodological directions, including provenance representation, evidence attribution, tool-use provenance, runtime guardrails, provenance-bearing memory, trace-based observability, and failure diagnosis. We also map existing benchmarks, datasets, and evaluation metrics to provenance-related capabilities, and discuss how evaluation can move from final-answer correctness toward process-level accountability. Finally, we outline open challenges, including unified trace schemas, claim-level and semantic provenance, provenance-aware safety mechanisms, realistic execution-trace benchmarks, recovery-oriented evaluation, and privacy-aware audit infrastructure.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: 该论文聚焦于 LLM 代理的可解释性与审计性,核心贡献在于证据追踪和溯源框架。关键词中的 Tokenizer、Visual Encoder 涉及模型架构组件,文中未提及;World Models 和 model-based RL 通常指环境动力学建模与规划算法,本文仅涉及代理与环境交互的追踪,非模型构建;MLLM 和 MultiModal 强调多模态融合,本文主要基于 LLM;Unify Models 虽提到统一视角,但指概念统一而非模型架构统一。因此相关性普遍较低。未检测到指定专家作者。
关键词
LLM Agents, Evidence Tracing, Execution Provenance, Trust, Tool Use, Memory Lineage, Process-level Accountability
摘要翻译
实时数据分析需要具备在非平稳数据流中准确且自适应地应对非线性动力学的能力,同时保持计算效率。然而,非线性动力学如此复杂,以至于在严格的时间约束下捕捉动态变化的非线性模式并将其用于下游任务并非易事。为了弥合非线性复杂性与计算可处理性之间的差距,本研究采用了 Koopman 算子理论 (Koopman operator theory),该理论指出非线性动力学可表示为无限维空间中的线性变换。基于该算子的有限维近似,我们提出了 AdaKoop,一种用于在非平稳数据流上建模非线性动力学的高效流式算法。该方法基于 Koopman 算子理论构建了一个概率框架,将原始观测值和再生核希尔伯特空间 (RKHS) 特征均视为源自潜在向量的发射。这种双视角表述使得非线性动力学能够被表达为一个可处理的线性系统。因此,AdaKoop 能够在流式方式下高效且稳定地建模非线性动力学,避免了迭代非线性优化所带来的高昂计算成本。此外,为应对数据流中的非平稳性,AdaKoop 通过统计假设检验自适应地检测模式突变引起的切换,并增量更新模型参数以应对连续变化。在涵盖多个领域的总共 71 个实际基准数据集上的广泛实验表明,AdaKoop 在实时预测精度和计算效率方面均优于现有最先进方法。
Abstract
Real-time data analysis requires the ability to accurately and adaptively address nonlinear dynamics in a nonstationary data stream while preserving computational efficiency. However, nonlinear dynamics are so complex that capturing dynamically changing nonlinear patterns and utilizing them for downstream tasks under strict time constraints is nontrivial. To bridge the gap between nonlinear complexity and computational tractability, this study applies Koopman operator theory, which states that nonlinear dynamics can be represented as linear transitions in an infinite-dimensional space. Building upon finite-dimensional approximations of this operator, we present AdaKoop, an efficient streaming algorithm for modeling nonlinear dynamics over nonstationary data streams. Our approach utilizes a probabilistic framework grounded in Koopman operator theory, treating both raw observations and reproducing kernel Hilbert space (RKHS) features as emissions from latent vectors. This dual-view formulation allows nonlinear dynamics to be expressed as a tractable linear system. Therefore, AdaKoop enables the efficient and stable modeling of nonlinear dynamics in a streaming fashion, avoiding the prohibitive computational costs of iterative nonlinear optimization. Furthermore, to address nonstationarity in data streams, AdaKoop adaptively detects the switching of patterns via statistical hypothesis testing for abrupt pattern shifts and incrementally updates model parameters to handle continuous changes. Extensive experiments on a total of 71 practical benchmark datasets across various domains demonstrate that AdaKoop outperforms state-of-the-art methods in terms of real-time forecasting accuracy and computational efficiency.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 2.0/10 | 3.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 3.0/10 | 4.5 |
评分理由: 论文基于 Koopman 算子理论处理非线性动态流数据,属于控制理论与流式算法领域。关键词集(如 Tokenizer、Visual Encoder、MLLM)主要指向多模态大模型,与本文主题高度不匹配。虽然动力学建模与 World Models 和 Model-Based RL 有潜在关联,但本文未涉及生成式建模或强化学习,故相关性评分较低。
关键词
Koopman Operator, Nonlinear Dynamics, Nonstationary Data Streams, Streaming Algorithm, Probabilistic Framework, RKHS Features, Pattern Switching, Efficient Modeling
摘要翻译
深度主动学习此前已被探索用于大语言模型(LLM)的上下文样本选择,但尚未采用利用 Transformer(变换器)激活理解最新进展的方法。本文测试了这样一个假设:模型激活可提供细粒度信号,以优化上下文示例的选择。本文对应用于上下文学习的基于 MLP(多层感知机)激活的深度主动学习方法进行了迄今为止最全面的分析,包括不同注意力掩码策略如何影响多样分类和生成数据集上的主动学习,实验使用了 Llama-3.2-3B 和 Qwen2.5-3B 基础模型。然而,我们发现了一个负面结果:通过大规模激活或前四阶矩视角观察的 MLP 输出,与示例质量或任务性能无相关性。具体而言,在我们测试的所有任务和模型中,斯皮尔曼相关系数的绝对值最高仅为 0.33,表明此类基于激活的采样不应被用于上下文学习。我们假设这可能是由于叠加现象所致,即模型表示的特征数量超过其维度,这表明稀疏自编码器(SAEs)等方法可能是一个有前途的未来方向。
Abstract
Deep active learning has previously been explored for LLM in-context sample selection, but not with methods that utilise recent advances in understanding of transformer activations. In this paper, we test the hypothesis that model activations could provide a fine-grained signal to optimise the selection of in-context examples. We present the most comprehensive analysis to date of MLP activation-based deep active learning methods applied to in-context learning, including how different attention masking strategies impact active learning across diverse classification and generative datasets, using both Llama-3.2-3B and Qwen2.5-3B base models. However, we find a negative result: MLP outputs, viewed through the lenses of massive activations or the first four moments, do not correlate with example quality or task performance. Specifically, the absolute Spearman correlation coefficient is at most 0.33 for all tasks and models we tested, showing that such activation-based sampling should not be used for in-context learning. We hypothesise that this may be due to superposition, whereby models represent more features than they have dimensionality, suggesting that methods like Sparse Autoencoders (SAEs) may be a promising future direction.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 2.0/10 | 3.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on activation-based active learning for in-context learning in LLMs. It shows low relevance to World Models, Visual Encoders, and Model-Based RL as these concepts are not discussed. While it uses LLMs (MLLM/MultiModal context), it does not address multimodal architecture or world modeling specifically. Tokenizer and Unify Models have minimal relevance. No expert authors from the specified list were found.
关键词
Activation-Based Active Learning, In-Context Learning, Transformer Activations, MLP Activation, Sample Selection, Sparse Autoencoders, Llama-3.2, Qwen2.5
摘要翻译
音频深伪检测(ADD)模型在对抗文本到语音(TTS)模型的恶意使用方面至关重要。评估和加强 ADD 模型需要开发能够覆盖生成音频空间并突出高错误区域的数据集。现有的数据集开发策略面临两个挑战:(i)人工收集,以及(ii)低效地发现 ADD 模型中的盲点。为了解决这些挑战,我们提出了 FoeGlass,这是首个用于音频深伪检测(ADD)的黑盒自动化红队方法,它能够有效地在生成音频空间中发现 ADD 的失效模式,而这些空间未被最先进的深伪基准测试充分探索。FoeGlass 利用大语言模型(LLM)的上下文学习能力来探索文本到语音(TTS)模型的输入空间,生成能够欺骗目标音频深伪检测(ADD)模型的音频样本,且仅需对所有组件进行黑盒访问。通过基于多样性度量精心设计的上下文,FoeGlass 缓解了自动化红队系统中常见的模式坍塌问题。在多个开源音频深伪检测(ADD)和文本到语音(TTS)模型上的实证评估表明,基于 FoeGlass 生成的数据相较于无条件采样基线和近期的欺骗数据集,显著降低了假阴性率(最高达 94%),且无需人工监督。此外,我们还表明,FoeGlass 生成的攻击在不同目标音频深伪检测(ADD)模型之间具有可迁移性,这展示了其在 ADD 系统自动化红队中的广泛适用性和易用性。最后,在 FoeGlass 生成的样本上对音频深伪检测(ADD)模型进行微调,显著增强了检测器的鲁棒性(提升高达 41%)。
Abstract
Audio deepfake detection (ADD) models are critical for countering the malicious use of text-to-speech (TTS) models. Evaluating and strengthening ADD models requires developing datasets that span the space of generated audio and highlight high-error regions. Existing dataset development strategies face two challenges: (i) manual collection, and (ii) inefficient discovery of blind spots in the ADD models. To address these challenges, we propose FoeGlass, the first black-box automated red-teaming method for ADDs, which effectively discovers ADD failure modes in the space of generated audio underexplored by state-of-the-art deepfake benchmarks. FoeGlass uses the in-context learning capabilities of an LLM to explore the input space of a TTS model, generating audio samples that fool the target ADD using only black-box access to all components. By using a carefully designed context based on diversity measurements, FoeGlass mitigates the common problem of mode collapse in automated red-teaming systems. Empirical evaluations on several open-source ADD and TTS models demonstrate that data generated from FoeGlass substantially improves the false negative rates over unconditional sampling baselines and recent spoofing datasets by up to 94%, while requiring no manual supervision. Furthermore, we show that the attacks generated by FoeGlass are transferable across different target ADDs, demonstrating its broad applicability and ease of use for the automated red teaming of ADD systems. Finally, fine-tuning ADD models on FoeGlass-generated samples notably enhances the robustness of the detectors (up 41%).
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 2.0/10 | 3.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on audio deepfake red-teaming using LLM in-context learning, lacking content on visual encoders, world models, model-based RL, or unified architectures. Relevance to specific keywords is low. No expert authors from the specified list are found.
关键词
Audio Deepfake Detection, Red Teaming, In-Context Learning, Large Language Model, Text-to-Speech, Adversarial Robustness, Black-box Attack
摘要翻译
随着 ChatGPT 等公共大型语言模型(LLMs)的广泛部署,保护用户提示(prompt)隐私已成为一个日益关键的问题。现有的隐私保护推理方法往往在效用(utility)或效率(efficiency)之间做出妥协,并且通常需要针对特定模型的修改,这限制了它们的兼容性。本文提出 SharedRequest,这是一个与模型无关(model-agnostic)的隐私保护 LLM 推理框架,它将隐私保护重新定义为在批次(batch)级别而非单个提示(individual-prompt)级别上进行。其核心思想是通过将原始提示与噪声变体混合来模糊敏感信息,同时分组语义等效的指令,以在大量查询的批次上分摊推理成本,且对 LLM 响应质量的影响最小。该设计独立于 LLM 架构,无需访问模型参数或进行架构修改。实验结果表明,SharedRequest 相较于先前的差分隐私(differential privacy)基线方法,实现了超过 20% 的效用提升,且其共享提示机制将查询成本降低了最多 5 倍,相较于非批次推理。
Abstract
With the widespread deployment of public large language models (LLMs) such as ChatGPT, protecting user prompt privacy has become an increasingly critical issue. Existing privacy-preserving inference methods sacrifice either utility or efficiency, and often require model-specific modifications that limit their compatibility. In this paper, we propose SharedRequest, a model-agnostic framework for privacy-preserving LLM inference that reformulates privacy protection at the batch level rather than the individual-prompt level. The key idea is to obscure sensitive information by mixing original prompts with noisy variants, while grouping semantically equivalent instructions to amortize the inference cost over a large batch of queries with minimal impact on LLM response quality. This design is independent of the LLM architecture, requiring no access to model parameters or architectural modification. Empirical results demonstrate that SharedRequest achieves over $20\%$ higher utility compared to prior differential privacy baselines, and its shared-prompt mechanism reduces query cost by up to $5\times$ compared to non-batched inference.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文核心聚焦于大语言模型(LLM)的隐私保护推理框架(SharedRequest),主要贡献在于批处理级别的提示词混淆与语义分组以保护隐私并降低成本。提供的关键词集主要涵盖多模态、世界模型及强化学习领域,与本文文本主导的隐私优化主题存在显著领域偏差。'Unify Models'因模型无关性设计给予 2 分,'Tokenizer'因属于 LLM 基础组件给予 2 分,'MLLM'因涉及大模型范畴给予 2 分;其余关键词(Visual Encoder, World Models, MultiModal, model-based RL)与本文内容完全无关,给予 0 分。作者列表中未包含指定的 Yang Shi 等专家,无额外加分。加权总分为 9.0,低于动态及格分 27.8,表明关键词匹配度较低。
关键词
Privacy-Preserving Inference, Model-Agnostic Framework, Large Language Models, Batch-level Prompt Mixing, Semantic Grouping, Inference Cost Reduction, Prompt Obfuscation
摘要翻译
控制实验是机器学习研究的基石,但在现代基础模型(Foundation Models)的规模下,其代价已变得难以承受。取而代之的是,研究社区越来越依赖那些以极低成本近似理想实验的研究策略:代理实验(Proxy Experiments)和缩放定律(Scaling Laws)、基于公开模型的观察性研究(Observational Studies),以及利用单次训练运行内变异性的单次运行设计(Single-Run Designs)。本文认为,在计算预算(Compute Budget)约束下近似大规模实验时,天下没有免费的午餐。具体而言,计算资源的节省是以有效性威胁(Validity Threats)为代价的——这些是隐藏的、有时甚至不可验证的假设,一旦被违背,便可能推翻研究主张。为应对此类威胁,我们提出一个评估框架,将基础模型研究视为一个因果推断(Causal Inference)问题。在此框架内,我们借鉴经验社会科学的四种有效性类型——统计有效性(Statistical Validity)、内部有效性(Internal Validity)、外部有效性(External Validity)和构念有效性(Construct Validity)——来评估不同的研究策略。我们发现每种策略都带有其独特的有效性特征:代理实验以牺牲外部有效性和构念有效性为代价,换取统计有效性和内部有效性;观察性研究面临混杂(Confounding)效应和效应异质性(Effect Heterogeneity);而单次运行设计则受到处理单元(Treated Units)间干扰(Interference)的困扰。该分析揭示了文献中未受到足够关注的若干有效性威胁。总体而言,我们的评估框架为研究人员提供了一个实用工具包,用于审视基础模型研究设计中的有效性威胁。
Abstract
Controlled experiments are the backbone of machine learning research, but at the scale of modern foundation models, they have become prohibitively expensive. Instead, the community increasingly relies on research strategies that approximate the ideal experiment at a fraction of the cost: proxy experiments and scaling laws, observational studies with publicly available models, and single-run designs that leverage variation within individual training runs. In this work, we argue that there is no free lunch when approximating large-scale experiments on a compute budget. Specifically, savings in compute come at the cost of validity threats -- hidden and sometimes untestable assumptions that, when violated, can invalidate research claims. To help navigate such threats, we propose an evaluation framework that casts foundation model research as a causal inference problem. Within this framework, we evaluate different research strategies through four types of validity adapted from the empirical social sciences -- statistical, internal, external, and construct validity. We find that each strategy comes with a characteristic validity profile: proxy experiments trade external and construct validity for statistical and internal validity; observational studies face confounding and effect heterogeneity; and single-run designs are strained by interference between treated units. This analysis reveals several validity threats that have received insufficient attention in the literature. Overall, our evaluation framework provides researchers with a practical toolkit for scrutinizing validity threats in foundation model research~designs.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 2.0/10 | 3.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on research methodology and validity evaluation in foundation model research (causal inference, proxy experiments), rather than technical architectures or specific reinforcement learning strategies. While it mentions foundation models broadly, it does not discuss tokenizers, visual encoders, world models, or model-based RL specifically, resulting in low relevance scores for technical keywords. Total Weighted Score: 9.0, which is below the dynamic passing threshold of 27.8.
关键词
Foundation Model Research, Validity Threats, Causal Inference, Proxy Experiments, Scaling Laws, Statistical Validity, Internal Validity, External Validity
摘要翻译
大语言模型正越来越多地由其他模型进行评估,这引发了一个自然的问题:一个模型能否预测评判者将如何对其自身输出进行评分?我们发现,这种能力在很大程度上存在于任何针对性训练之前:通过提示式少样本(few-shot),基础模型已在三个基准测试上,显著高于随机水平地预测了外部评判者对开放式回答的多属性质量分数。我们引入了自我评估诱导(Self-Evaluation Elicitation, SEE),该方法通过一个短周期揭示这种潜在能力:首先是一个校准耦合强化学习阶段,用于改进答案并预测评判者;随后是一个掩码蒸馏阶段,该阶段细化预测同时保持答案不变。仅使用 160 个唯一示例(约为强化学习基线的 1/31),SEE 在三个基准测试上提高了保留集校准效果,同时保持了答案质量。诱导出的自我评估紧密局限于模型自身的词元分布中,且对从未训练过的评判者保持稳定,这表明这是一种可迁移的质量概念,而非单个评判者的偏好。这些结果将与评判者对齐的自我评估问题重新定义为诱导问题,而非习得问题。
Abstract
Large language models are increasingly evaluated by other models, raising a natural question: can a model predict how a judge will score its own output? We find that the ability is largely present before any targeted training: prompted few-shot, a base model already predicts an external judge's multi-attribute quality scores on open-ended responses well above chance across three benchmarks. We introduce Self-Evaluation Elicitation (SEE), a method that surfaces this latent ability through a short cycle comprising a calibration-coupled reinforcement learning phase that improves the answer and predicts the judge, followed by a masked distillation phase that sharpens the prediction while leaving the answer untouched. From 160 unique examples, roughly 31x fewer than a reinforcement learning baseline, SEE improves held-out calibration across three benchmarks while preserving answer quality. The elicited self-evaluation is sharply localized within the model's own token distribution and stable across judges it was never trained against, indicating a transferable notion of quality rather than a single judge's preference. These results reframe judge-aligned self-evaluation as a problem of elicitation rather than acquisition.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 2.0/10 | 3.0 |
评分理由: 论文聚焦于文本大模型的自我评估与法官校准,未涉及多模态融合(无 MultiModal, MLLM, Visual Encoder),也未构建环境世界模型(World Models)。虽使用了强化学习阶段,但并非典型的环境模型构建(model-based RL),Tokenizer 仅提及 token 分布而非架构设计,Unify Models 不适用。因此与给定关键词集相关性较低。
关键词
Self-Evaluation, Judge Calibration, Base LLMs, Reinforcement Learning, Masked Distillation, Minimal Data, Elicitation
摘要翻译
训练数据归因(TDA)旨在追溯模型预测背后的训练数据。TDA 的黄金标准依赖于因果干预,即观察数据添加或移除时模型的变化,但对于大型语言模型(LLMs)而言,反复重训练在计算上极具挑战性。因此,大多数方法在参数空间中利用梯度来近似这种效应。然而,追踪数十亿参数上的梯度不仅成本高昂,而且依赖于局部近似。在本文中,我们提出了一种范式转变:不估计参数变化,而是在激活空间中建模训练数据的功能效应。我们提出了 STRIDE(基于引导的训练数据影响分解),这是一个将 TDA 表述为稀疏恢复问题的框架,其思路源于压缩感知。STRIDE 学习轻量级的“引导算子”,以模仿在子集数据上训练所引发的行为偏移。通过测量这些算子如何扰动测试预测,我们通过稀疏线性分解恢复单个训练样本的影响。STRIDE 在 LLM 预训练归因任务上达到了最先进水平,同时比先前方法快一个数量级(13 倍)。我们进一步通过下游应用验证其实用性,包括数据选择、数据污染及定性分析。
Abstract
Training Data Attribution (TDA) seeks to trace a model's predictions back to its training data. The gold standard for TDA relies on causal interventions, observing how a model changes when data is added or removed, but repeated retraining is computationally challenging for Large Language Models (LLMs). Consequently, most approaches approximate this effect in the parameter space using gradients. However, tracking gradients across billions of parameters is not only prohibitively expensive but relies on local approximations. In this work, we propose a shift: rather than estimating parameter changes, we model the functional effect of training data in the activation space. We introduce STRIDE (Steering-based Training Data Influence Decomposition), a framework that formulates TDA as a sparse recovery problem in the spirit of compressive sensing. STRIDE learns lightweight "steering operators" that mimic the behavioral shift caused by training on data subsets. By measuring how these operators perturb test predictions, we recover individual training example influences via sparse linear decomposition. STRIDE achieves state-of-the-art for LLM pre-training attribution while being an order of magnitude ($13\times$) faster than previous art. We further validate its practical utility through downstream applications including data selection, data contamination, and qualitative analysis.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文核心为训练数据归因(TDA)与稀疏恢复,主要在大语言模型(LLM)的激活空间进行操作。提供的关键词涉及统一模型架构、分词器设计、视觉编码器、世界模型构建、多模态大模型(MLLM)、多模态融合或基于模型的强化学习,与本文主题(文本模型的数据归因)高度不相关,因此相关度评分极低。
关键词
Training Data Attribution, Sparse Recovery, Activation Space, Steering Operators, Large Language Models, Subset Perturbations, Compressive Sensing
摘要翻译
机器学习工程(MLE)代理有望实现从原始数据和自然语言指令到端到端 ML 流水线开发的自动化,从而使非技术背景的领域专家能够使用机器学习。然而,在敏感且受监管的领域中,这种抽象会产生一个责任鸿沟:最终用户可能无法洞察影响正确性、鲁棒性、公平性及监管合规性的设计选择。我们认为现有的基准不足以评估 MLE 代理是否能在此类环境中安全应用。我们提出了以责任为中心的评估框架的期望,并在黑色素瘤分类任务上开展了一项探索性研究,将肤色公平性作为责任约束重点考察。在对两个近期 MLE 代理进行评估时,我们发现,代理生成的流水线表现出高方差,且在预测质量和公平性方面始终低于手动设计的基线,尽管采用了面向公平的提示。这些初步结果表明,需要进一步研究重新设计 MLE 代理,以便人类能够引导搜索过程,并可靠地评估所生成的 ML 流水线的合规性与质量。
Abstract
Machine learning engineering (MLE) agents promise to automate end-to-end ML pipeline development from raw data and natural language instructions, potentially making ML accessible to non-technical domain experts. However, in sensitive and regulated domains, this abstraction creates a responsibility gap: end-users may lack visibility into design choices that affect correctness, robustness, fairness, and regulatory compliance. We argue that existing benchmarks are insufficient to assess whether MLE agents can be safely applied in such settings. We propose desiderata for a responsibility-centered evaluation framework and conduct an exploratory study on melanoma classification, focusing on fairness across skin tones as a responsibility constraint. When evaluating two recent MLE agents, we find that agent-generated pipelines show high variance and consistently underperform manually designed baselines in both predictive quality and fairness, despite fairness-oriented prompts. These preliminary results suggest that further research is needed towards redesigning MLE agents to allow humans to guide the search process and reliably assess the compliance and quality of the generated ML pipelines.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 1.0/10 | 1.5 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 1.0/10 | 1.5 |
| MultiModal | 1.5 | 2.0/10 | 3.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文核心在于机器学习工程(MLE)代理的公平性约束与责任评估,未涉及世界模型、视觉编码器、Tokenizer 或模型强化学习等核心技术。关键词主要指向多模态大模型架构及强化学习领域,与本文主题(自动化 ML 流程评估)存在显著差异。虽然任务涉及图像数据(黑色素瘤),但未讨论视觉编码器架构或多模态模型统一,故相关度评分较低。
关键词
Machine Learning Engineering Agents, Fairness Constraints, Responsibility Gap, Melanoma Classification, Automated ML Pipelines, Evaluation Framework, Predictive Quality
摘要翻译
扩散大语言模型(DLLMs)近期已成为一种有前景的替代自回归大语言模型(autoregressive LLMs)的方案,其通过结合双向上下文的迭代掩码去噪机制生成文本。然而,其庞大的模型规模和迭代去噪过程引入了显著的内存和计算开销,这促使采用后训练量化(PTQ)以实现高效部署。本文针对低比特 DLLM 量化识别出两个关键挑战:状态依赖的激活差异(state-dependent activation disparity)和时间误差累积(temporal error accumulation)。在每个去噪步骤中,掩码 token 与非掩码 token 表现出不同的激活分布,而在迭代解码过程中,量化误差可能会跨步骤累积。为应对这些挑战,本文提出了一种面向 DLLMs 的状态 - 时间一致后训练量化(PTQ)框架,即 STaR-Quant。STaR-Quant 引入了状态引导激活变换(State-Guided Activation Transformation, SGAT),通过统一的静态权重侧变换将掩码与非掩码 token 映射至不同的激活变换空间。此外,它还引入了时间注意力补偿(Temporal Attention Compensation, TAC),通过轻量级块对角仿射映射来校正量化注意力表示。在代表性 DLLMs 上的实验表明,STaR-Quant 在低比特权重 - 激活量化方面始终优于强 PTQ 基线,同时相比 FP16 部署可实现高达 1.69 倍的加速比和 3.14 倍的内存节省。
Abstract
Diffusion large language models (DLLMs) have recently emerged as a promising alternative to autoregressive LLMs by generating text through iterative masked denoising with bidirectional context. However, their large model sizes and iterative denoising process introduce substantial memory and computational overhead, motivating post-training quantization for efficient deployment. In this paper, we identify two key challenges for low-bit DLLM quantization: state-dependent activation disparity and temporal error accumulation. Masked and unmasked tokens exhibit different activation distributions within each denoising step, while quantization errors can accumulate across steps during iterative decoding. To address these challenges, we propose STaR-Quant, a state-time consistent PTQ framework for DLLMs. STaR-Quant introduces State-Guided Activation Transformation (SGAT) to assign masked and unmasked tokens to different activation transformation spaces with a unified static weight-side transformation. It further introduces Temporal Attention Compensation (TAC) to correct the quantized attention representation via a lightweight block-diagonal affine mapping. Experiments on representative DLLMs demonstrate that STaR-Quant consistently improves low-bit weight-activation quantization over strong PTQ baselines, while delivering up to 1.69x speedup and 3.14x memory saving over FP16 deployment.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 1.0/10 | 1.5 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文核心内容为扩散语言模型(DLLMs)的后训练量化(PTQ),旨在解决内存和计算开销问题。与提供的关键词相比,相关性普遍较低:'Tokenizer'因涉及 token 激活态略有相关性(2 分);'Unify Models'和'World Models'因涉及生成模型架构有微弱关联(1 分);'Visual Encoder'、'MultiModal'、'model-based RL'完全无关(0 分);'MLLM'因论文聚焦文本而非多模态(1 分)。作者列表中未包含指定的 Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang 专家,故无加分。加权总分为 7.5,远低于动态及格分 27.8。
关键词
Diffusion Large Language Models, Post-Training Quantization, State-Time Consistent, Activation Transformation, Temporal Attention Compensation, Low-bit Quantization, Efficient Deployment
摘要翻译
大型语言模型(LLMs)中的自回归思维链(CoT)推理本质上是前向导向的:每一步仅依赖于先前的标记。这种单向归纳偏置使得即使是有能力的模型也容易受到错误雪球效应的影响,即早期步骤中的单个逻辑或算术错误会不可逆地破坏整个推理链。我们引入目的论推理填充(TRI),这是一种训练框架,赋予仅解码器变换器原生的目标条件化桥接能力。关键洞察是将错误的推理片段重新表述为中间填充(FIM)任务:给定一个已验证的前缀前提 $P$、一个已验证的下游里程碑 $S$ 以及原始查询 $Q$,模型必须合成出逻辑桥接 $M$,该桥接需严格且完整地连接 $P$ 与 $S$。为了在标准因果架构下实现这一点,我们引入前缀 - 后缀 - 中间(PSM)序列重排,该重排采用三个非重叠的哨兵标记,使 $M$ 能够同时关注 $P$ 和 $S$,而无需对自注意力机制进行任何结构修改。训练分为两个阶段:(i) 在从形式数学语料库中提取的符号验证过的 $(P, S, M)$ 三元组上进行监督微调(SFT),(ii) 使用确定性符号验证器(Lean 4 / Python)作为唯一奖励预言机进行直接偏好优化(DPO),从而消除大语言模型评判者的谄媚。在推理阶段,TRI 在双系统循环中作为一个手术式修复模块运行:因果草稿模型生成初始推理轨迹,验证器定位故障,TRI 仅填充受损部分,保持已验证部分完整。在三个基准上的全面实验表明,TRI 在所有任务上实现了最先进的性能,同时将每个问题的令牌消耗减少了 31.2%。
Abstract
Autoregressive chain-of-thought (CoT) reasoning in large language models (LLMs) is fundamentally forward-directed: each step conditions only on prior tokens. This unidirectional inductive bias renders even capable models susceptible to error snowballing, wherein a single logical or arithmetic mistake in an early step irreversibly corrupts the entire reasoning chain. We introduce Teleological Reasoning Infilling (\TRI{}), a training framework that endows decoder-only transformers with a native \emph{goal-conditioned bridging} capability. The key insight is to reframe erroneous reasoning segments as fill-in-the-middle (FIM) tasks: given a verified prefix premise $P$, a verified downstream milestone $S$, and the original query $Q$, the model must synthesise the logical bridge $M$ that connects $P$ to $S$ rigorously and completely. To achieve this with standard causal architectures, we introduce a Prefix-Suffix-Middle (PSM) sequence rearrangement with three non-overlapping sentinel tokens, enabling $M$ to attend to both $P$ and $S$ without any structural modification to the self-attention mechanism. Training proceeds in two stages: (i) Supervised Fine-Tuning (SFT) on symbolically verified $(P, S, M)$ triples extracted from formal mathematics corpora, and (ii) Direct Preference Optimisation (DPO) with a deterministic symbolic verifier (Lean 4 / Python) as the sole reward oracle, eliminating LLM-judge sycophancy. At inference, TRI operates as a surgical repair module within a dual-system loop: a causal draft model generates an initial trace, the verifier pinpoints failures, and TRI infills only the damaged segment, leaving verified sections intact. Comprehensive experiments on three benchmarks demonstrate that TRI achieves state-of-the-art performance across all tasks, while reducing per-problem token expenditure by 31.2%.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 2.0/10 | 3.0 |
评分理由: 该论文专注于通过双向填充(TRI)修复 LLM 推理链,与多模态学习、视觉编码器或世界模型无关。虽然使用了 DPO(与 RL 相关),但并非模型强化学习。标记器使用标准,未涉及模型统一。评分反映了与所提供关键词集的低相关性。
关键词
Large Language Models, Chain-of-Thought, Fill-in-the-Middle, Bidirectional Logic, Reasoning Repair, Symbolic Verification, Prefix-Suffix-Middle
摘要翻译
Self-attention (自注意力机制) 在序列上自由选择信息,但在深度维度上,Transformer (变换器) 仅将每一层的输出添加到残差流中,因此后续层无法有选择地重用早期层的表示。近期的跨层方法改进了这一流程,但它们在注意力机制之外的隐藏状态上操作,在推理时引入了超出 key-value cache (键值缓存) 之外的状态——这一成本随着现代 LLM (大语言模型) 利用 grouped-query (组查询) 和 multi-head latent attention (多头潜在注意力) 压缩缓存而日益显著。我们提出了 Depth-Attention (深度注意力),它在 attention 模块内部执行这种选择:在某一层对序列进行注意力计算之前,其 query 会在同一 token 位置关注早期层的 keys,并将它们的 values 混合到 self-attention 随后读取的值中。由于 Depth-Attention 重用了标准的 attention query、key 和 value-cache 槽位,将深度混合值存储在原值位置,因此它不增加任何参数,也不引入超出标准 key-value cache 之外的持久推理状态——其缓存大小与 vanilla decoder (原始解码器) 相同,且少于基于隐藏状态的跨层方法。在参数量为 15 亿和 30 亿的 Qwen3 风格解码器上,Depth-Attention 达到了最低的 perplexity (困惑度) 和最高的平均下游准确率,相比原始 Transformer 提升了高达 2.3 个百分点的准确率,并在 perplexity 和平均准确率上超越了强大的跨层基线方法,同时仅增加了不到 0.01% 的额外 arithmetic FLOPs (算术浮点运算次数),且未引入额外的持久推理状态。这些增益在 3.6 亿到 30 亿参数范围内均成立,并适用于 looped Transformers (循环变换器)。
Abstract
Self-attention selects information freely across the sequence, but across depth, Transformers merely add each layer's output to the residual stream, so later layers cannot selectively reuse earlier-layer representations. Recent cross-layer methods improve this flow but operate on hidden states outside attention, adding state beyond the key-value cache at inference--a cost that becomes increasingly salient as modern LLMs compress the cache with grouped-query and multi-head latent attention. We introduce Depth-Attention, which performs this selection inside the attention module itself: before a layer attends over the sequence, its query attends over the keys of earlier layers at the same token position and mixes their values into the value that self-attention then reads. Because Depth-Attention reuses the standard attention queries, keys, and value-cache slots, storing depth-mixed values in place of the original values, it adds no parameters and introduces no persistent inference state beyond the standard key-value cache--the same cache size as a vanilla decoder and less than hidden-state-based cross-layer methods. On Qwen3-style decoders at 1.5B and 3B parameters, Depth-Attention attains the lowest perplexity and the highest average downstream accuracy, improving over the vanilla Transformer by up to 2.3 accuracy points and surpassing strong cross-layer baselines in perplexity and average accuracy, while adding under 0.01% extra arithmetic FLOPs and no additional persistent inference state. The gains hold from 360M to 3B parameters and extend to looped Transformers.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 3.0/10 | 4.5 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文核心为语言模型架构优化(Depth-Attention),涉及跨层注意力与值混合,旨在提升困惑度与准确率。提供的关键词集侧重于多模态(MLLM, MultiModal, Visual Encoder)、世界模型及强化学习,与本文纯文本 LLM 主题关联度低。Tokenizer 与视觉编码器未在摘要中体现。作者列表中未发现指定专家。加权总分远低于动态及格分。
关键词
Depth-Attention, Cross-Layer Value Mixing, Language Models, Self-attention, Key-Value Cache, Perplexity, Downstream Accuracy, Transformer Architecture
摘要翻译
扩散大语言模型(DLLMs)通过迭代去噪含噪声 token 序列并利用双向上下文,实现了非自回归生成。尽管它们能够并行更新多个位置,但由于高质量生成需要大量的去噪步骤,推理开销仍然较大。我们提出 SAID,一种感知骨架的迭代解码框架,通过重新分配 token 间的计算来加速 DLLMs。SAID 首先将去噪计算集中在骨架 token 上以建立粗略语义结构,随后用更少的步骤完成可预测的细节 token。我们进一步将 SAID 适配到块级扩散解码,并引入置信度分层生成(CHLG),该机制仅向低置信度 token 分配额外步骤。在 LLaDA-8B 和 LLaDA 1.5 上进行的数学、编码及知识基准测试实验表明,SAID 显著加速了 DLLM 推理,最大加速比达 9.1 倍,同时保持了具有竞争力的性能。我们的代码公开可用:https://github.com/TH-AI-Lab-PKU/SAID.
Abstract
Diffusion large language models (DLLMs) enable non-autoregressive generation by iteratively denoising corrupted token sequences with bidirectional context. Despite their ability to update multiple positions in parallel, inference remains costly due to the many denoising steps required for high-quality generation. We propose SAID, a Scaffold-Aware Iterative Decoding framework that accelerates DLLMs by reallocating computation across tokens. SAID first spends denoising computation on scaffold tokens to establish the coarse semantic structure, and then completes predictable detail tokens with fewer steps. We further adapt SAID to block-wise diffusion decoding and introduce Confidence-Hierarchical Layered Generation (CHLG), which assigns additional steps only to low-confidence tokens. Experiments on LLaDA-8B and LLaDA 1.5 across math, coding, and knowledge benchmarks show that SAID significantly accelerates DLLM inference with a maximum speedup of 9.1x while maintaining competitive performance. Our code is publicly available: https://github.com/TH-AI-Lab-PKU/SAID.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 3.0/10 | 4.5 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 该论文专注于扩散语言模型(DLLMs)的推理加速,核心贡献在于解码策略优化(SAID 框架)。提供的关键词集主要涵盖多模态、世界模型及强化学习领域,与本文纯文本生成加速的主题匹配度极低。Visual Encoder、World Models、MLLM、MultiModal、model-based RL 完全无关(0 分)。Tokenizer 涉及 token 处理但未深入设计(3 分),Unify Models 仅指代生成步骤统一而非模型统一(2 分)。
关键词
Diffusion Language Models, Scaffold-Aware Iterative Decoding, Inference Acceleration, Denoising Process, Block-wise Decoding, Confidence-Hierarchical Generation, Text Generation
摘要翻译
可验证奖励的强化学习(例如 GRPO)现已成为提升大语言模型(LLMs)数学推理能力的常用方法。然而,当前方法通常将一个序列级优势广播至所有词元,或使用昂贵的过程奖励模型(PRMs)进行步骤级监督。均匀的优势分布假设所有词元对最终奖励的贡献相等。这会稀释梯度信号,因为有缺陷的推理步骤和填充词会被像有效的逻辑推理一样强烈地更新。为了解决这一问题,我们引入了梯度重加权优势(GRAIL),这是一种内在的词元级优势重加权方法。GRAIL 利用梯度 - 激活显著性,赋予那些对最终答案更局部敏感的词元更高的权重。在来自 Qwen3、R1-distilled 和 OctoThinker 系列的五个模型上的评估表明,GRAIL 一贯优于 GRPO。GRAIL 在准确率和 Pass@3 上分别平均提升了 3.60% 和 3.05%,这表明无需过程级监督即可实现细粒度推理对齐。
Abstract
Reinforcement learning with verifiable rewards (e.g. GRPO) is now a common way to improve mathematical reasoning in Large Language Models (LLMs). However, current methods usually broadcast one sequence-level advantage to all tokens, or use costly process reward models (PRMs) for step-level supervision. Uniform advantage distribution assumes that all tokens contribute equally to the final reward. This dilutes the gradient signal, since flawed reasoning steps and filler words are updated as strongly as valid logical inferences. To address this, we introduce Gradient-Reweighted Advantage (GRAIL), an intrinsic token-wise advantage reweighting method. GRAIL uses gradient-activation saliency to place more weight on tokens that are more locally sensitive to the final answer. Evaluations across five models from the Qwen3, R1-distilled and OctoThinker families show that GRAIL consistently outperforms GRPO. GRAIL achieved an average improvement of 3.60% in accuracy and 3.05% in Pass@3, demonstrating that fine-grained reasoning alignment can be achieved without process-level supervision.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 2.0/10 | 3.0 |
评分理由: The paper focuses on token-wise advantage reweighting (GRAIL) for reinforcement learning with verifiable rewards in Large Language Models (LLMs). It does not address model unification, tokenizers, visual encoders, world models, or multimodal processing, resulting in low scores for these keywords. 'MLLM' and 'model-based RL' receive minimal credit (2.0) as the paper involves LLMs and RL respectively, though it specifically uses model-free GRPO variants rather than model-based dynamics or explicit multimodality. None of the specified expert authors (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are present in the author list.
关键词
Gradient-Reweighted Advantages, Reinforcement Learning, Verifiable Rewards, Token-wise Advantage, Large Language Models, Reasoning Alignment, Gradient-activation Saliency
摘要翻译
多智能体推理系统采用“生成后传输(generate-then-transfer)”范式,迫使端到端延迟随流水线深度线性增长。我们引入 StreamMA,这是一种多智能体推理系统,它在生成后立即将每个推理步骤流式传输至下游代理,通过对相邻代理进行流水线化处理从而降低延迟。令人惊讶的是,这种流水线化同时也提升了有效性:由于多步推理质量存在非均匀性,早期步骤比后期步骤更为可靠,使用这些可靠的早期步骤而非完整链条,可防止易出错的后期步骤误导下游代理。我们通过首次对流式(stream)、串行(serial)和单(single)协议进行闭联合分析,形式化地阐述了这两种优势,推导出了有效性排序、加速比上界及成本比率。在涵盖数学、科学和代码的八个推理基准上,针对两种前沿大语言模型(Claude Opus 4.6 和 GPT-5.4)及三种拓扑结构(链、树、图),StreamMA 均优于基线模型(平均提升 +7.3 个百分点,在 HMMT 2026 上最高提升 +22.4 个百分点;基于 Claude Opus 4.6-high)。除上述贡献外,我们发现了一种“步骤级扩展定律”:增加每个代理的步骤数一致地提升了有效性和效率,这是一种新的扩展维度,与代理数量扩展正交且可组合。
Abstract
Multi-agent reasoning systems adopt a "generate-then-transfer" paradigm that forces end-to-end latency to scale linearly with pipeline depth. We introduce StreamMA, a multi-agent reasoning system that streams each reasoning step to downstream agents as soon as it is generated, pipelining adjacent agents and thus reducing latency. Surprisingly, this pipelining also improves effectiveness: because multi-step reasoning quality is non-uniform and early steps are more reliable than later ones, working with these reliable early steps instead of the full chain prevents error-prone late steps from misleading downstream agents. We formalize both advantages with the first closed-form joint analysis of stream, serial, and single protocols, deriving the effectiveness ordering, speedup upper bound, and cost ratio. Across eight reasoning benchmarks spanning mathematics, science, and code, two frontier LLMs (Claude Opus 4.6 and GPT-5.4), and three topologies (Chain, Tree, Graph), StreamMA outperforms both baselines (avg. +7.3 pp, max +22.4 pp on HMMT 2026; Claude Opus 4.6-high). Beyond these contributions, we discover a "step-level scaling law": increasing per-agent steps consistently improves both effectiveness and efficiency, a new scaling dimension orthogonal to and composable with agent-count scaling.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 2.0/10 | 3.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on multi-agent reasoning communication protocols (StreamMA) rather than multimodal architecture, tokenization strategies, visual encoders, world models, or reinforcement learning. While it uses frontier LLMs (potentially MLLM), the core contribution is pipeline streaming, not model unification or representation learning, resulting in low relevance to the provided keyword theme.
关键词
Multi-Agent Reasoning, Streaming Communication, Latency Reduction, Pipelining Agents, Scaling Laws, StreamMA, LLM-based Reasoning
摘要翻译
在 3D Gaussian Splatting (3DGS) 成功应用于新视角合成之后,许多工作也探索了如何利用它进行几何表面表示。然而,直接从 3DGS 中提取准确的几何信息仍然具有挑战性,并且往往会降低外观渲染质量。本文通过训练使用完整的 ground-truth(真实值)纹理和几何信息,表明默认形式的 3DGS 固有地不适合同时表示纹理和几何。我们还提出了一种简单解决方案,即对每个高斯点(splat)应用一个额外的几何不透明度参数,并配合一个可选的透明度引导优化流程。我们的实验,无论是使用 ground-truth 还是 vision foundation model(视觉基础模型)的几何输入,都表明这种改进在各种数据集上带来了渲染和几何性能的提升,尤其是包含透明物体的复杂场景从我们的方法中获益显著。
Abstract
After the success of 3D Gaussian Splatting (3DGS) for novel view synthesis, many works have explored how to also use it for geometric surface representation. However, extracting accurate geometric information directly from 3DGS remains challenging and can often reduce the appearance rendering quality. In this work, we show that 3DGS in its default form is inheritedly unsuited to represent texture and geometry at the same time, by training with complete ground-truth texture and geometry information. We also propose a simple solution by applying a single additional geometry opacity parameter to each splat, together with an optional transparency-curated optimization pipeline. Our experiments, both with ground-truth and vision foundation model geometric input, show that this change leads to improved rendering and geometry performance on a wide variety of dataset, and especially complex scenes with transparent objects benefit significantly from our method.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 2.0/10 | 3.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on 3D Gaussian Splatting for geometry and appearance decoupling in novel view synthesis. It does not involve Tokenizers, MLLMs, World Models, or Reinforcement Learning. While it handles geometry and appearance (loosely MultiModal), it lacks alignment with the Multimodal AI/LLM context of the other keywords. No expert authors from the specified list were found.
关键词
3D Gaussian Splatting, Geometry Representation, Appearance Decoupling, Novel View Synthesis, Geometry Opacity, Transparent Objects, Rendering Performance
摘要翻译
大型语言模型(LLMs)正被越来越多地部署于各类应用中,这引发了关于治理、问责和数据溯源的关键问题。理解哪些训练数据对模型输出影响最大,仍然是一个根本性的开放性问题。我们通过针对自回归大语言模型(auto-regressive LLMs)的训练数据归因(TDA)来解决这一挑战,该方法扩展了逆向表述:如果模型在训练过程中看到了生成的输出,训练数据会受到怎样的影响?我们的方法通过对生成的文本样本使用双向梯度优化(bidirectional gradient optimization,包括梯度上升和下降)来扰动基础模型,并测量训练样本上损失函数的变化。我们的框架支持任意数据粒度的归因,从而实现事实归因和风格归因。我们在具有已知数据集的预训练模型上,将我们的方法与基线进行了对比评估,结果显示其在影响力指标上优于先前工作,从而增强了模型的可解释性,这是可问责 AI 系统的基本要求。
Abstract
Large Language Models (LLMs) are increasingly deployed across diverse applications, raising critical questions for governance, accountability, and data provenance. Understanding which training data most influenced a model's output remains a fundamental open problem. We address this challenge through training data attribution (TDA) for auto-regressive LLMs by expanding upon the inverse formulation: How would training data be affected if the model had seen the generated output during training? Our method perturbs the base model using bidirectional gradient optimization (gradient ascent and descent) on a generated text sample and measures the resulting change in loss across training samples. Our framework supports attribution at arbitrary data granularity, enabling both factual and stylistic attribution. We evaluate our method against baselines on pretrained models with known datasets, and show that it outperforms previous work on influence metrics, thereby enhancing model interpretability, an essential requirement for accountable AI systems.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on training data attribution for text-based LLMs using bidirectional gradient optimization. It is unrelated to World Models, MLLM, MultiModal, Visual Encoder, and model-based RL as it focuses solely on text generation and attribution. Tokenizers are used in LLMs but not analyzed, hence low score (2.0). 'Unify Models' is not discussed, hence low score (2.0). Consequently, the paper has very low relevance to the provided keyword set, which is centered on multimodal world models and RL, resulting in a total weighted score significantly below the passing threshold.
关键词
Data Attribution, Large Language Models, Bidirectional Gradient Optimization, Training Data Attribution, Model Interpretability, Accountability, Auto-regressive LLMs, Gradient Ascent and Descent
摘要翻译
量化深度神经网络对于在资源受限设备上进行高效推理至关重要。然而,大多数现有方法是为单域和类别平衡数据设计的,导致存在领域偏移或严重类别不平衡的实际场景未被充分探索。我们通过高效多域对齐量化(EmaQ)来解决这些挑战,该方法通过基于累积分布函数(CDF)的投影对齐领域分布,并使用敏感度感知的权重聚合来稳定多域量化。我们进一步将 EmaQ 扩展为 EmaQ-LT 以进行长尾量化,通过引入类别条件方差缩放和基于置信度的 logit 调整来缓解多数类过度自信。理论分析确立了收敛性保证,并为所提出的敏感度和缩放机制提供了理论依据。在标准、多域(Office-31, Digits)和长尾(SynDigits-LT, CIFAR-10-LT, CIFAR-100-LT)基准上的实验表明,EmaQ 和 EmaQ-LT 在领域偏移和类别不平衡下实现了优异的低比特性能。
Abstract
Quantizing deep neural networks is essential for efficient inference on resource-constrained devices. However, most existing methods are designed for single-domain and class-balanced data, leaving practical settings with domain shifts or severe class imbalance underexplored. We address these challenges with Efficient Multi-Domain Alignment Quantization (EmaQ), which aligns domain distributions through a CDF-based projection and uses sensitivity-aware weight aggregation to stabilize multi-domain quantization. We further extend EmaQ to EmaQ-LT for long-tailed quantization by introducing class-conditioned variance scaling and confidence-based logit adjustment to mitigate majority-class overconfidence. Theoretical analyses establish convergence guarantees and motivate the proposed sensitivity and scaling mechanisms. Experiments on standard, multi-domain (Office-31, Digits), and long-tailed (SynDigits-LT, CIFAR-10-LT, CIFAR-100-LT) benchmarks show that EmaQ and EmaQ-LT achieve strong low-bit performance under domain shift and class imbalance.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 1.0/10 | 1.5 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 该论文核心贡献在于深度神经网络的量化技术(Quantization),特别是针对多域(Multi-Domain)和长尾(Long-Tailed)分布的量化方法(EmaQ/EmaQ-LT)。提供的关键词集主要聚焦于多模态大模型(MLLM)、世界模型(World Models)及强化学习(RL)领域。论文内容与这些关键词高度不相关:未涉及 Tokenizer、World Models、MLLM 或 Model-Based RL。仅在视觉基准测试中使用了图像数据(与 Visual Encoder 微弱相关),且在分布对齐技术上涉及域对齐(与 Unify Models 微弱相关),MultiModal 指多模态而非多域,相关性极低。因此,加权总分远低于动态及格分。
关键词
Quantization, Multi-Domain, Long-Tailed, Feature Alignment, Scaling, Domain Shift, Class Imbalance
摘要翻译
倡议与公投是瑞士民主的核心,然而手写签名列表的验证仍是一项劳动密集型的人工流程。本文探讨了自动化文档分析方法的潜力,包括 OCR(光学字符识别)和基于 AI 的手写分析,以支持这一任务。我们提出了一种结合基于模板的行分割、文本识别与笔迹检索技术的流程,并在包含 418 位作者 443 个手写条目的数据集上进行了评估。结果表明,OCR 在处理词汇表外手写体时表现不佳,名字部分的字符错误率(CER)为 29.6%。相比之下,笔迹检索表现更为稳健,达到了 50.6% 的平均精度(mAP)。此外,我们的实验表明,现成的 OCR 系统对于转录手写签名数据并不足够可靠,特别是对于名字或地址等短小且词汇表外的条目。然而,笔迹检索方法可以有效识别签名列表中视觉上相似的条目,使其成为基于笔迹相似性检测潜在重复提交的合适工具。
Abstract
Popular initiatives and referendums are central to Swiss democracy, yet the validation of handwritten signature lists remains a labor-intensive manual process. This paper investigates the potential of automated document analysis methods, including OCR and AI-based handwriting analysis, to support this task. We propose a pipeline combining template-based line segmentation with text recognition and writer retrieval techniques, evaluated on a dataset of 443 handwritten entries from 418 writers. Results show that OCR struggles with out-of-vocabulary handwriting, with a CER of 29.6% for first names. In contrast, writer retrieval performs more robustly, reaching an mAP of 50.6%. Furthermore, our experiments indicate that off-the-shelf OCR systems are not sufficiently reliable for transcription of handwritten signature data, particularly for short, out-of-vocabulary entries such as names or addresses. However, writer retrieval methods can effectively identify visually similar entries across signature lists, making them a suitable tool for supporting the detection of potential duplicate submissions based on handwriting similarity.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 1.0/10 | 1.5 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 2.0/10 | 3.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on traditional document analysis, OCR, and writer retrieval for validating Swiss signature lists. It does not involve World Models, MLLM, Model-Based RL, or Unify Models. There is minimal overlap with MultiModal (image+text processing) and Tokenizer/Visual Encoder in a general computer vision sense, but not in the context of modern large-scale learning architectures implied by the keywords.
关键词
Handwriting Extraction, Signature Lists, Swiss Popular Initiatives, OCR, Writer Retrieval, Document Analysis, Duplicate Submissions, Handwriting Similarity
摘要翻译
道义推理是指通过将明确的规则与政策应用于案件特定事实来回答问题的任务,例如根据成文法计算纳税义务或确定移民上诉的结果。基于大语言模型(LLM)的道义推理面临的一个关键挑战是,相关的规则集可能很长且相互交叉引用,因此模型仍可能无法定位特定推理步骤所需的规则。我们引入了道义代理推理(DAR),这是一种代理推理设置,其中模型按需与成文法交互。我们在 DeonticBench 的困难子集上,利用多个评测框架对 DAR 进行了评估。在这些设置下,我们发现代理评测框架可以推动道义推理任务的前沿,但改进并不一致:性能较弱的模型在数值任务上往往性能下降,同时消耗更多的标记。
Abstract
Deontic reasoning is the task of answering questions by applying explicit rules and policies to case-specific facts, for example computing tax liability under a statute or determining the outcome of an immigration appeal. A key technical challenge for LLM-based deontic reasoning is that the relevant ruleset can be long and cross-referenced, so models may still fail to locate the rules needed for a particular reasoning step. We introduce Deontic Agentic Reasoning (DAR), an agentic reasoning setup in which the model interacts with the statutes on demand. We evaluate DAR under multiple harnesses on hard subsets of DeonticBench. Across these settings, we find that agentic harnesses can push the frontier on deontic reasoning tasks, but improvements are not uniform: weaker models often degrade on numerical tasks while consuming far more tokens.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 2.0/10 | 3.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 该论文主要研究大语言模型在规范推理任务中的代理式交互方法(DAR),核心在于规则检索与推理逻辑,未涉及视觉编码器、多模态融合、世界模型构建或模型强化学习等关键词所指向的技术领域。虽然文中讨论了 Token 消耗,但 Tokenizer 并非核心贡献。因此,论文与给定关键词集(偏向多模态与 RL 架构)的相关性非常低。
关键词
Deontic Reasoning, Agentic Harnesses, LLM, Statutes, Rules, DeonticBench, Token Consumption
摘要翻译
我们提出了图集合 Transformer(Graph Set Transformer, GST),这是一种用于在图集合上进行学习的神经网络架构,专为那些依赖于集合级上下文及局部结构的逐元素预测任务而设计。现有的架构(包括 DeepSets 和 SetTransformer)需要从独立的图神经网络(GNN)中获取预编码的图嵌入,这在特征提取与集合级上下文建模之间形成了瓶颈。相比之下,GST 在每一层交错进行节点级特征传播与跨图上下文建模,并通过门控机制融合这两个层级的信息。我们在一个旨在隔离集合条件结构推理的受控合成测试套件上评估 GST,并在三个涵盖逐原子反应中心识别、反应产率预测及图像分类的真实数据基准上进行测试。在参数量预算相当的情况下,GST 在这些设置中的表现均优于基线方法。架构消融实验强烈表明,局部上下文与集合上下文的交错对此优势贡献显著。
Abstract
We introduce the Graph Set Transformer (GST), a neural network architecture for learning on sets of graphs, designed for tasks in which per-element predictions depend on set-wide context as well as local structure. Existing architectures, including DeepSets and SetTransformer, require pre-encoded graph embeddings from a separate GNN, creating a bottleneck between feature extraction and set-level contextualisation. In contrast, GST interleaves node-level feature propagation and cross-graph contextual modelling at every layer, fusing the two levels of information through a gating mechanism. We evaluate GST on a controlled synthetic suite designed to isolate set-conditional structural reasoning and on three real-data benchmarks spanning per-atom reaction-centre identification, reaction yield prediction, and image classification. Under matched parameter budgets, GST performs better than the baselines across these settings. An architectural ablation strongly suggests that the interleaving of local and set context contributes substantially to this advantage.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 该论文提出 Graph Set Transformer (GST) 架构,专注于集合图上的学习与推理,核心在于交织节点级特征传播与跨图上下文建模。提供的关键词主要围绕多模态大模型(MLLM)、世界模型、强化学习及视觉编码器等前沿领域。论文内容与这些关键词高度不相关:未涉及 Tokenizer、视觉编码器、世界模型或强化学习;仅在图像分类基准中使用了视觉数据,故 MultiModal 相关性极低;虽涉及上下文融合,但并非"Unify Models"范式。作者列表中未包含指定的 Yang Shi 等专家。因此,加权总分远低于动态及格分。
关键词
Graph Set Transformer, Sets of Graphs, Node-level Feature Propagation, Cross-graph Contextual Modelling, Reaction-centre Identification, Reaction Yield Prediction, Image Classification, Gating Mechanism
摘要翻译
函数向量(FVs)是在上下文学习过程中提取的任务表示,可用于引导大型语言模型(LLMs)。然而,关于其定义方式的设计选择仍研究不足。本文研究了针对指令的 FV 定义在两个自由度上的变化所带来的影响:注意力头选择和引导方式。在注意力头选择方面,结合层相关传播(LRP)使用基于梯度的归因方法能显著提高效率和准确性。在 FV 引导方面,采用分布式方式应用比简单聚合能获得更高的准确性。代码已公开。
Abstract
Function vectors (FVs) are task representations elicited during in-context learning that can be used to steer Large Language Models (LLMs). However, design choices in their formulation remain underexplored. In this work, we study the impact of varying FV definitions for instructions along two degrees of freedom: attention head selection and steering. For head selection, using gradient-based attributions with Layer-wise Relevance Propagation (LRP) substantially improves efficiency as well as accuracy. For FV steering, applying it in a distributed manner yields a higher accuracy compared to simple aggregation. Our code is publicly available.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 1.0/10 | 1.5 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文主要关注大语言模型(LLM)中的函数向量(Function Vectors)及其在上下文学习中的引导作用,涉及注意力头选择和梯度归因。这与关键词中的多模态(MultiModal, Visual Encoder)、世界模型(World Models)、强化学习(model-based RL)及分词器(Tokenizer)无直接关联。虽然涉及 LLM(接近 MLLM),但未体现模型统一(Unify Models)的核心架构特征,因此相关性较低。
关键词
Function Vectors, Large Language Models, In-context Learning, Steering, Gradient-based Attributions, Layer-wise Relevance Propagation, Task Representations
摘要翻译
列车延误预测对乘客和铁路运营商而言均是一个重要问题,但由于缺乏标准化的数据集、预测目标及评估协议,该领域的进展仍难以评估。为填补这一空白,我们引入了 RIDE,这是一个在比利时铁路网络上以全国规模构建的列车延误预测开放数据集与基准。RIDE 涵盖了 2023 年至 2025 年间的 9450 万次列车事件、360 万次行程以及 3570 万次天气记录。该数据集被组织为一个分层数据管道,从原始铁路和天气数据源延伸至两个公开版本:一个可重用的中间关系型数据集以及模型就绪基准数据集。该基准标准化了预测任务以及训练和测试数据。此外,它还提供了一个统一评估协议,支持模型间的直接比较。利用该框架,我们首次提供了针对非学习模型、统计学习模型及深度学习模型的全面比较评估。结果表明,基于学习的方法明显优于非学习模型,其中图神经网络(Graph Neural Networks, GNNs)取得了最佳平均性能,而最强的基于学习模型之间性能相对接近。除了聚合平均绝对误差(MAE)和均方根误差(RMSE)外,该框架还提供了按预测时域(prediction horizon)和延误变化的分解,从而能够更详细地分析模型在不同预测范式(forecasting regimes)下的行为。
Abstract
Train delay prediction is an important problem for both passengers and railway operators, yet progress in the field remains difficult to assess due to the lack of standardized datasets, prediction targets, and evaluation protocols. To address this gap, we introduce RIDE, an open dataset and benchmark for train delay prediction built at nationwide scale over the Belgian railway network. RIDE covers 94.5M train events, 3.6M journeys, and 35.7M weather records from 2023 to 2025. It is organized as a layered data pipeline from raw railway and weather sources to two public releases: a reusable intermediate relational dataset and model-ready benchmark datasets. The benchmark standardizes the prediction task and the training and testing data. It also provides a unified evaluation protocol that supports direct comparison across models. Using this framework, we provide the first comprehensive comparative evaluation of non-learning, statistical learning, and deep learning models. We show that learning-based methods clearly outperform non-learning models, with graph neural networks achieving the best mean performance, while the strongest learning-based models remain relatively close to one another. Beyond aggregate mean absolute error (MAE) and root mean squared error (RMSE), the framework also provides breakdowns by prediction horizon and delay change, enabling more detailed analysis of model behavior across forecasting regimes.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文专注于铁路延误预测的数据集基准与深度学习模型比较,涉及图神经网络和时间序列预测。提供的关键词主要涉及多模态大模型、世界模型及强化学习领域,与论文内容高度不相关。除'统一评估协议'与'Unify Models'有微弱关联,'多源数据'与'MultiModal'有极微弱关联外,其余关键词在文中均未涉及。作者列表中不包含指定的专家。
关键词
Train Delay Prediction, Dataset Benchmark, Graph Neural Networks, Railway Network, Weather Records, Evaluation Protocol, Deep Learning, Time Series Forecasting
摘要翻译
本文提出了一种用于识别仿射控制(Control-affine)降阶模型(ROMs)的框架。所提出的方法利用自编码器(AEs)将高维状态(以及潜在的高维输入)转换为适合仿射控制状态空间动力学(State-space Dynamics)的低维潜在表示。这是通过对自编码器和状态空间模型(State-space Model)进行联合训练实现的。此外,我们将离散降阶模型(ROM)的形式扩展为基于序列的模型(Sequence-based Model),该模型处理状态和输入历史以提高预测精度,同时保持仿射控制结构。我们通过将反馈线性化(Feedback Linearization)应用于推导出的模型来论证该框架,并提出了其高效使用的指南。所提出的框架在两个数值算例上进行了评估,其性能与一个基线模型(Baseline Model)进行了比较,在该基线模型中,自编码器识别出一个具有线性状态空间动力学的潜在空间。评估包括评估降阶模型(ROM)在测试数据上的预测精度及其将系统控制到期望状态或轨迹的有效性。
Abstract
We present in this paper a framework for the identification of control-affine reduced-order models (ROMs). The proposed method utilizes autoencoders (AEs) to transform the high-dimensional states, and potentially the high-dimensional inputs, into reduced latent ones suitable for control-affine state-space dynamics. This is achieved by simultaneous training of the AE and the state-space model. In addition, we extend the discrete ROM formulation to a sequence-based model, which processes state and input histories to improve prediction accuracy while preserving the control-affine structure. We motivate our framework by applying feedback linearization to the derived models, and we present guidelines for its efficient use. The proposed framework is assessed on two numerical examples and its performance is compared to a baseline model, where the AE identifies a latent space with linear state-space dynamics. The assessment involves evaluating the prediction accuracy of the ROM on test data and its effectiveness in controlling the system to a desired state or trajectory.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 3.0/10 | 4.5 |
评分理由: 论文核心为控制理论中的系统辨识与降阶建模,使用自编码器处理连续状态空间。提供的关键词主要聚焦于多模态大模型、语言模型及强化学习领域(如 Tokenizer, MLLM, MultiModal),与本文内容领域不符,故相关性为 0。仅'model-based RL'因涉及基于模型的控制策略有微弱关联,故给予低分。作者列表中不包含指定的专家,无额外加分。加权总分约为 4.5,低于动态及格分 27.8。
关键词
Autoencoders, Reduced-Order Models, Control-Affine, System Identification, State-Space Model, Feedback Linearization, Latent Space
摘要翻译
推理模型发展迅速,但主流的基于可验证奖励的强化学习(RLVR)范式却出人意料地狭隘:即采样大量响应,并用单个比特奖励每个响应,以指示最终答案是否正确。然而,许多场景提供丰富的反馈,包括执行轨迹、工具输出、专家修正以及模型的自我评估。我们通过经典模仿学习算法 DAgger(Dataset Aggregation)的分布变体来研究如何利用此类反馈,其中学习者可局部访问当前策略所访问状态上的专家分布。这导出了一个简单的前向交叉熵目标函数,该函数支持黑盒专家,且其序列级梯度通过传播未来的专家 - 学生分歧回早期决策,从而执行丰富的信用分配。我们表明,先前基于反向 KL 或 Jensen-Shannon 散度的自蒸馏目标的强化学习无法保证单调策略改进:即使专家具有更高奖励,其更新也可能增加较差动作的概率。相比之下,我们证明前向交叉熵支持单调策略改进,并具备遗憾保证。此外,我们进一步表明,我们的目标优化了教师加权成功似然性的下界,从而提升了 Pass@N 指标。实验上,我们的方法 DistIL 在科学推理、代码生成及求解复杂数学问题等多种领域上,均优于 RLVR 及基于自蒸馏的强化学习基线方法。
Abstract
Reasoning models have advanced rapidly, but the dominant reinforcement learning from verifiable rewards (RLVR) recipe remains surprisingly narrow: sample many responses and reward each with a single bit indicating whether the final answer is correct. Yet many settings provide rich feedback, including execution traces, tool outputs, expert corrections, and model self-evaluations. We study how to use such feedback through a distributional variant of the classic imitation learning algorithm DAgger, where the learner has local access to an expert distribution on states visited by the current policy. This yields a simple forward cross-entropy objective that admits a blackbox expert and whose sequence-level gradient {conduct rich credit assignment by propagating} future expert-student disagreement back to earlier decisions. We show that prior RL with self-distillation objectives based on reverse KL or Jensen-Shannon fail to guarantee monotonic policy improvement: even when the expert has higher reward, their updates may increase probability on worse actions. In contrast, we show that forward cross-entropy admits monotonic policy improvement and enjoys guarantees on regret. We further show that our objective optimizes a lower bound on teacher-weighted likelihood of success, leading to improved Pass@N. Empirically, our approach, DistIL, improves over RLVR and RL with self-distillation baselines across a variety of domains: scientific reasoning, coding, and solving hard mathematical problems.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 2.0/10 | 3.0 |
评分理由: The paper focuses on Reinforcement Learning from Rich Feedback using a distributional DAgger variant, emphasizing policy improvement and forward cross-entropy objectives. It does not involve multimodal architectures, visual encoders, tokenizers, world models, or model unification, resulting in low relevance for most keywords. While it is an RL paper, it is not specifically model-based RL (which typically involves learning dynamics models), hence the moderate score for that keyword. The total weighted score (3.0) is below the dynamic passing threshold (27.8), indicating poor alignment with the provided keyword set. None of the listed expert authors are present in the author list.
关键词
Reinforcement Learning, Rich Feedback, Distributional DAgger, Forward Cross-Entropy, Policy Improvement, Scientific Reasoning, Pass@N
摘要翻译
大型语言模型(LLM)的知识基准面临三个问题:规模驱动的设计未能实现学科代表性;允许懒惰共识的扁平支付注释;以及在有界测试预算下未经审计的排名不稳定性。我们提出了 KINA,这是一个涵盖 261 个细粒度学科的 899 项基准,包含两个正式结果。首先,我们将代表性视为基于专家提取锚点的覆盖式目标,并通过一个代理变量操作化学科代表性,从而得到一个 (1-1/e) 的贪婪近似(命题 1);该保证仅适用于代理变量,而非总体代表性。其次,我们证明基于条形奖金的锦标赛在发布评论质量上弱一阶随机占优(FOSD)支配扁平支付,其激励相容阈值为 B > ΔC / Δp_min(定理 1)。评估来自 13 个实验室的 42 个模型后,表现最佳的模型是 Gemini-3.1-Pro-Preview,得分为 53.17%,其次是 Claude-Opus-4.6(49.92%)和 GPT-5.4(48.55%),表明在饱和点以下仍有较大的提升空间。完整的排行榜显示出层级结构而非平滑的全序:一个较小的前沿层级位于 48% 以上,一个密集的强模型层级跨度约为 38%-45%,而表现较差的模型仅略高于 10% 的机会基线。工具增强在五个工具使用评估中最多可增加 5.17 分,且增益在不同模型间差异显著。我们报告了基于自助法的排名稳定性统计,以明确有界预算下的方差,并防止对相邻排名的过度解读。
Abstract
Knowledge benchmarks for LLMs face three issues: scaling-driven designs that do not operationalize disciplinary representativeness; flat-payment annotation that permits lazy consensus; and unaudited ranking instability under bounded test budgets. We introduce KINA, an 899-item benchmark across 261 fine-grained disciplines, with two formal results. First, we cast representativeness as a coverage-style objective over expert-elicited anchors and operationalize disciplinary representativeness through a proxy, yielding a (1-1/e) greedy approximation (Proposition 1); the guarantee applies to the proxy, not to population representativeness. Second, we prove a bonus-on-bar tournament weakly FOSD-dominates flat payment in released-review quality, with incentive-compatibility threshold B > Delta C / Delta p_min (Theorem 1). Evaluating 42 models from 13 labs, the top model, Gemini-3.1-Pro-Preview, reaches 53.17%, followed by Claude-Opus-4.6 at 49.92% and GPT-5.4 at 48.55%, leaving substantial headroom below saturation. The full leaderboard shows a tiered structure rather than a smooth total order: a small frontier tier lies above 48%, a dense strong-model tier spans roughly 38-45%, and low-performing models remain only modestly above the 10% chance baseline. Tool augmentation adds up to 5.17 points across the five tool-use evaluations, with gains varying substantially across models. We report bootstrap ranking-stability statistics to make bounded-budget variance explicit and to discourage over-interpretation of adjacent ranks.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 1.0/10 | 1.5 |
| MultiModal | 1.5 | 1.0/10 | 1.5 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper proposes KINA, a knowledge benchmark for LLMs focusing on disciplinary representativeness, annotation incentives, and ranking stability. It does not discuss model architecture (Tokenizer, Visual Encoder), unification strategies, world modeling, or reinforcement learning. While it evaluates models like Gemini (potentially MLLMs), the core contribution is evaluation methodology rather than multimodal representation learning or model training. Thus, relevance to the specific technical keywords is minimal. None of the listed expert authors appear in the author list.
关键词
Knowledge Benchmark, LLM Evaluation, Disciplinary Representativeness, Incentive Compatibility, Ranking Stability, Tool Augmentation, Expert-Elicited Anchors
摘要翻译
研究论文的标题以清晰简洁的方式传达其主要思想,偶尔也会传达其结论。选择合适的标题往往具有挑战性,而自动生成标题可协助作者完成这一任务。在这项工作中,我们提出了一种利用开源预训练大语言模型从摘要生成论文标题的技术。我们使用了 CSPubSum 和 LREC-COLING-2024 数据集,并引入了一个新的数据集 SpringerSSAT,该数据集是从四本社会科学领域的 Springer 期刊中整理而成的。此外,我们在零样本设置中使用 GPT-3.5-turbo 生成标题。模型性能使用 ROUGE、METEOR、MoverScore、BERTScore 和 SciBERTScore 指标进行评估。我们的实验表明,微调后的 PEGASUS-large 在大多数指标上优于其他模型,包括微调后的 LLaMA-3-8B 和零样本 GPT-3.5-turbo。我们进一步展示了 ChatGPT 可以生成创造性的论文标题。总体而言,AI 生成的标题通常是合适且可靠的。
Abstract
The title of a research paper conveys its primary idea and, occasionally, its conclusions in a clear and concise manner. Choosing an appropriate title is often challenging, and automated title generation can assist authors in this task. In this work, we propose a technique to generate paper titles from abstracts using open-weight pre-trained and large language models. We use the CSPubSum and LREC-COLING-2024 datasets and introduce a new dataset, SpringerSSAT, curated from four Springer journals in the social sciences. Additionally, we use GPT-3.5-turbo in a zero-shot setting to generate titles. Model performance is evaluated with ROUGE, METEOR, MoverScore, BERTScore, and SciBERTScore metrics. Our experiments show that fine-tuned PEGASUS-large outperforms other models, including fine-tuned LLaMA-3-8B and zero-shot GPT-3.5-turbo, across most metrics. We further demonstrate that ChatGPT can generate creative paper titles. Overall, AI-generated titles are generally appropriate and reliable.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on text-based automatic title generation using LLMs (PEGASUS, LLaMA, GPT) and does not involve multimodal learning, world models, visual encoders, or reinforcement learning. Only minimal relevance exists for Tokenizer (implicit in LLMs) and Unify Models (comparison of different models), while all other keywords are completely unrelated to the paper's content.
关键词
Automatic Generation, Research Papers, Language Models, Title Generation, Text Generation, PEGASUS-large, Abstract-to-Title, Evaluation Metrics
摘要翻译
编程领域的 AI 工具已不再仅仅是自动补全或聊天助手;它们正组织成开发框架,涵盖流程、角色、工件及验证环节。近期的调查已梳理了软件工程中的智能体(Agents)与大语言模型(LLMs),但尚缺乏一项专注于将此类能力转化为流程的操作框架的研究。我们对原始资料进行了定向搜索,基于功能性纳入标准及影响力度量,筛选出六个框架:GitHub Spec Kit、OpenSpec、BMAD Method、Get Shit Done (GSD)、Spec Kitty 和 Reversa。各框架通过不同路径切入 AI 开发:包括完整与轻量级变体的规范驱动开发、基于智能体的敏捷规划、针对智能体的上下文工程、工作树隔离与审查,以及从遗留系统中恢复操作规范。本文的核心贡献在于提出了一种六维过程分类法:规范(Specification)、上下文(Context)、角色(Roles)、执行(Execution)、验证(Validation)与可移植性(Portability),并附带一套评分量表,使其成为一种可复现的评估工具。我们将该分类法应用于上述六个框架以及一个样本外案例(Spec-Flow)。研究结果主要有两点。在已采用部分流程的框架中存在趋同现象:孤立提示(isolated prompt)的中心地位减弱,持久性工件、工作契约、可追溯性及人工审查成为减少歧义并协调智能体的机制。且没有任何一个框架能全面覆盖所有六个维度,这揭示了过程深度与跨智能体可移植性之间存在结构性权衡。此外,我们还发现了反复出现的风险:规范与代码之间的漂移、对生成工件的过度信任、社区扩展的脆弱性、平台依赖性以及缺乏针对完整流程的基准。最后,我们提出了一项实证评估的研究议程,重点关注中间质量指标、上下文治理、安装安全性及可复现性。
Abstract
AI tools for programming are no longer just autocomplete or chat assistants: they organize themselves as development frameworks, with process, roles, artifacts and verification. Recent surveys map agents and LLMs for software engineering, but a study centered on the operational frameworks that turn these capabilities into process is missing. We ran a directed search of primary sources, with a functional inclusion criterion and traction measurement, and selected six frameworks: GitHub Spec Kit, OpenSpec, BMAD Method, Get Shit Done (GSD), Spec Kitty and Reversa. Each attacks AI development through a different path: spec-driven development in full and lightweight variants, agent-driven agile planning, context engineering over the agent, worktree isolation and review, and recovery of operational specifications from legacy systems. Our central contribution is a six-dimension process taxonomy: specification, context, roles, execution, validation and portability, with a scoring rubric that turns it into a replicable instrument. We apply it to the six frameworks and an out-of-sample case, Spec-Flow. Two results stand out. Among frameworks that already adopt some process there is convergence: the isolated prompt loses centrality, and persistent artifacts, work contracts, traceability and human review become mechanisms that reduce ambiguity and coordinate agents. And no framework strongly covers all six dimensions, exposing a structural trade-off between process depth and portability across agents. We also found recurring risks: drift between specification and code, excessive trust in generated artifacts, fragility of community extensions, platform dependence and a lack of benchmarks for the complete process. We close with a research agenda for empirical evaluation, focused on intermediate-quality metrics, context governance, installation security and reproducibility.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 1.0/10 | 1.5 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on process taxonomy for AI software development agents, lacking content on multimodal architectures (Tokenizer, Visual Encoder), world models, or reinforcement learning. While it utilizes LLMs, it does not address Multimodal LLMs (MLLM) or model unification in representation learning, resulting in low relevance scores.
关键词
AI Software Development Agents, Process Taxonomy, Comparative Assessment, Software Engineering Frameworks, Specification, Roles, Execution, Portability
摘要翻译
我们引入大语言模型代理架构 Agentic Redux,旨在用于需要线性可审计性的非平凡问题域。利用类型化 Lambda 演算,我们证明,在适用域上运行时,Agentic Redux 的执行在语义上保证正确,所有决策均记录于追加式账本中。我们展示了两个生产级适用领域,涵盖医疗账单合规性和安全漏洞披露。在两个领域上运行的 Agentic Redux 的可运行代码可在支持代码库中获取。我们还引入了本体优先代理设计(Ontology-First Agent Design),这是一种在问题域上创建代理框架的方法论:人类专家利用基本形式本体(Basic Formal Ontology)对该问题域进行本体化,然后指派大语言模型(LLM)推导出代理和人在回路(humans-in-the-loop)可承担的角色,以解决该领域的问题。
Abstract
We introduce the LLM agent architecture Agentic Redux, intended for use with nontrivial problem domains that require linear auditability. Using the typed lambda calculus, we prove that, run on appropriate domains, Agentic Redux executions are semantically guaranteed to be correct, with all decisions recorded in an append-only ledger. We present two production-grade appropriate domains, in healthcare billing compliance, and security vulnerability disclosure. Working code for Agentic Redux run on both domains is available in a supporting code repository. We also introduce Ontology-First Agent Design, a methodology for creation of agent frameworks on a problem domain, in which a human expert ontologizes the problem domain with Basic Formal Ontology, and then assigns an LLM to derive roles that agents and humans-in-the-loop can fill, in order to work the problems in the domain.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 1.0/10 | 1.5 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on LLM agent safety, auditability, and ontology-based design using formal verification (typed lambda calculus). It does not address multi-modality (Visual Encoder, MultiModal), tokenization strategies, world models, or model-based reinforcement learning. While it involves LLMs (MLLM), it lacks the multi-modal components implied by the keyword set, resulting in low relevance scores for most keywords.
关键词
LLM Agents, Auditable, Safe, Human-Authored Ontologies, Agentic Redux, Typed Lambda Calculus, Ontology-First Agent Design
摘要翻译
寻找组合谜题(如 Rubik's Cube、滑动拼图和 Lights Out)的最优解路径仍然是人工智能领域的一个经典挑战。启发式搜索算法(如 A*)仅在采用可容许启发式函数(即从不高估真实剩余代价的函数)时,才能保证路径的最优性。深度强化学习(RL)方法(如 DeepCubeA)通过训练深度神经网络来近似剩余代价启发式函数。然而,标准的均方误差(MSE)训练通常会产生高估,违反可容许性并损害解的最优性。本文提出了一种通用框架,用于学习基于验证校准的可容许神经启发式函数。我们利用一个低估型的可容许贝尔曼算子结合非对称损失函数来惩罚高估,从而训练价值网络。为了考虑残留的神经网络函数近似误差,我们提出了一种事后校准安全偏移量,该偏移量基于验证打乱状态计算得出。实验表明,在评估协议下,我们的校准神经启发式函数未观察到任何可容许性违反,并在实践中保持路径最优性;与标准解析基线相比,其在 2×2 Rubik's Cube 上的搜索节点扩展减少了高达 83.0%,在 3×3 Lights Out 网格上减少了 19.9%,在 8-Puzzle 上减少了 1.9%。
Abstract
Finding optimal solution paths for combinatorial puzzles like the Rubik's Cube, sliding tile puzzles, and Lights Out remains a classical challenge in artificial intelligence. Heuristic search algorithms, such as A* , guarantee path optimality only when using an admissible heuristic-one that never overestimates the true remaining cost-to-go. Deep reinforcement learning (RL) methods like DeepCubeA train deep neural networks to approximate cost-to-go heuristics. However, standard mean-squared error (MSE) training regularly yields overestimations, violating admissibility and compromising solution optimality. In this paper, we introduce a generalizable framework for learning validation-calibrated admissible neural heuristics. We train a value network using an underestimating Admissible Bellman Operator combined with an Asymmetric Loss function to penalize overestimation. To account for residual neural function approximation errors, we propose a post-hoc calibration safety offset computed over validation scrambles. We demonstrate that our calibrated neural heuristics achieve no observed admissibility violations under the evaluation protocol and preserve path optimality in practice while reducing search node expansions by up to 83.0% on a 2 by 2 Rubik's Cube, 19.9% on a 3 by 3 Lights Out grid, and 1.9% on an 8-Puzzle compared to standard analytical baselines.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 2.0/10 | 3.0 |
评分理由: The paper focuses on learning admissible neural heuristics for combinatorial search (e.g., Rubik's Cube) using RL concepts like the Bellman operator. It does not address Unify Models, Tokenizers, Visual Encoders, World Models, MLLM, or MultiModal aspects, resulting in 0 relevance for these keywords. There is a slight connection to model-based RL due to RL terminology usage, but the core is heuristic search rather than model-based planning. No expert authors from the specified list were found, so no bonus applies.
关键词
Combinatorial Search, Admissible Heuristics, Neural Heuristics, Deep Reinforcement Learning, Bellman Operator, Path Optimality, Rubik's Cube
摘要翻译
高通量测序技术的快速发展催生了大型、高维组学数据集。深度无监督学习架构,尤其是自编码器(Autoencoders, AEs),在该领域中日益被广泛用于降维和表示学习。然而,自编码器对架构选择和超参数高度敏感,且无监督优化通常依赖于重建损失,这作为下游任务性能的代理指标可能效果不佳。穷举超参数优化(HPO)计算成本高昂,致使研究人员常依赖次优的默认配置。为了降低大规模无监督 HPO 研究的门槛,我们引入了 BBOmix,这是首个面向真实世界生物数据的无监督表示学习的开源表格基准。我们的基准涵盖了来自 TCGA 和 SCHC 数据集的四种 AE 架构及七种多组学模态的 105,000 次评估。我们量化了重建损失与下游任务性能之间的相关性,并对最先进的单保真度、多保真度及迁移学习 HPO 方法进行了广泛评估,为未来的无监督生物表示学习研究建立了严格的基线。
Abstract
The rapid advancement of high-throughput sequencing has led to large, high-dimensional omics datasets. Deep unsupervised learning architectures, particularly Autoencoders (AEs), are increasingly used for dimensionality reduction and representation learning in this domain. However, AEs are highly sensitive to architectural choices and hyperparameters, and unsupervised optimization typically relies on reconstruction loss, which may be a poor proxy for downstream utility. Exhaustive hyperparameter optimization (HPO) is computationally expensive, leading researchers to frequently rely on suboptimal default configurations. To democratize access to large-scale unsupervised HPO research, we introduce $\textbf{BBOmix}$, the first open-source tabular benchmark for unsupervised representation learning on real-world biological data. Our benchmark includes 105,000 evaluations across four AE architectures and seven multi-omics modalities from the TCGA and SCHC datasets. We quantify the correlation between reconstruction loss and downstream task performance and provide an extensive evaluation of state-of-the-art single-fidelity, multi-fidelity, and transfer learning HPO methods, establishing a rigorous baseline for future research in unsupervised biological representation learning.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 2.0/10 | 3.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文核心为生物组学数据的无监督表示学习(自编码器)与超参数优化,属于生物信息学领域。提供的关键词主要指向多模态大模型(MLLM)、视觉编码器、世界模型及强化学习,技术栈不匹配。仅因涉及‘多组学模态’与 MultiModal 有微弱语义关联,其余关键词完全无关。
关键词
Unsupervised Representation Learning, Hyperparameter Optimization, Autoencoders, Multi-omics, Benchmark, Biological Data, Dimensionality Reduction, Reconstruction Loss
摘要翻译
生成逼真的金融时间序列颇具挑战性,因为训练数据往往仅限于单一历史路径。面对如此稀缺的数据,过拟合难以避免,尤其是在对抗训练场景下,训练好的判别器可能会记住训练样本。为缓解这一问题,近期方法通过训练生成器来最小化真实与生成时间序列的未经训练特征表示之间的差异。在这些工作中,特征图基于路径签名(path signatures),这在可行的截断深度下可能无法捕捉相关的时间序列属性。本文则通过匹配真实与生成时间序列的随机卷积特征来训练生成器。现有的随机卷积特征映射(如 Rocket 和 Hydra)已被证明能提供关于真实世界时间序列的富有信息量的表示,但由于它们不可微,无法用于监督生成模型。我们提出 SOCK(SOft Competing Kernels),这是一种完全可微的随机卷积特征映射,适用于训练生成式时间序列模型。实验表明,通过匹配随机 SOCK 特征训练的生成器在广泛的小样本金融数据集上一致优于基于签名和扩散的基线模型。此外,我们还展示了 SOCK 在双样本假设检验和时间序列分类任务上的表达能力,在这些任务中,SOCK 匹配或优于现有的无监督特征映射。
Abstract
Generating realistic financial time series is challenging as training data is often limited to a single historical path. With such scarce data, overfitting is hard to avoid, especially under adversarial training where a trained discriminator can memorize the training samples. To mitigate this, recent approaches train generators to minimize the discrepancy between untrained feature representations of real and generated time series. In these works, the feature maps are based on path signatures, which can fail to capture relevant time series properties at tractable truncation depths. In this work, we instead train generators by matching random convolutional features of real and generated time series. Existing random convolutional feature maps, such as Rocket and Hydra, have been shown to provide informative representations of real-world time series, but cannot supervise generative models because they are non-differentiable. We introduce SOCK (SOft Competing Kernels), a fully differentiable random convolutional feature map, suited to train generative time series models. We show that generators trained by matching random SOCK features consistently outperform signature and diffusion baselines across a wide range of small-sample financial datasets. We further demonstrate SOCK's expressiveness on two-sample hypothesis testing and time series classification tasks, where SOCK matches or outperforms existing unsupervised feature maps.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 1.0/10 | 1.5 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文核心为金融时间序列生成,采用可微分随机卷积特征匹配方法(SOCK)。给定关键词主要涉及多模态大模型(MLLM, MultiModal)、视觉编码(Visual Encoder)、序列标记(Tokenizer)及强化学习(model-based RL)等领域,与本文的时间序列生成任务无直接技术关联。因此,除“世界模型”和“统一模型”因涉及生成建模有微弱关联外,其余关键词相关性均为 0。加权总分远低于动态及格分 27.8,表明该论文不属于该特定研究方向的范畴。
关键词
Financial Time Series, Generative Models, Random Convolutional Features, Differentiable Feature Matching, SOCK, Small-sample Learning, Two-sample Hypothesis Testing
摘要翻译
缺失值插补(Missing Value Imputation)是机器学习中的基础任务,大多数现有方法假设所有缺失条目都对应于未观测到的常规值。然而,在许多真实世界的数据集中,缺失性(Missingness)可能源于两个不同的来源:一些条目是有意义缺失的(本质上缺失且语义有效),而其他条目是由于观测过程缺失且应被插补的。我们将这种区别形式化为一个选择性插补问题(Selective Imputation Problem),其目标是联合推断哪些缺失条目应被保留,哪些应被恢复。为了解决这一挑战,我们提出 Diff-Joint,一种基于扩散的框架(Diffusion-based Framework),联合建模表格数据与潜在缺失性掩码(Latent Missingness Mask)。该方法在条件采样(Conditional Sampling)和不确定性感知聚合(Uncertainty-aware Aggregation)之间交替进行,以迭代地优化插补值和缺失性标签。在合成数据集和真实世界数据集上的实证结果表明,Diff-Joint 能有效识别有意义缺失条目,同时实现具有竞争力的插补精度并提升下游任务性能。
Abstract
Missing value imputation is a fundamental task in machine learning, with most existing methods assuming that all missing entries correspond to unobserved regular values. In many real-world datasets, however, missingness may arise from two distinct sources: some entries are meaningfully missing (intrinsically absent and semantically valid), while others are missing due to the observation process and should be imputed. We formalize this distinction as a selective imputation problem, where the goal is to jointly infer which missing entries should be preserved and which should be recovered. To address this challenge, we propose Diff-Joint, a diffusion-based framework that jointly models tabular data together with a latent missingness mask. The method alternates between conditional sampling and uncertainty-aware aggregation to iteratively refine both imputed values and missingness labels. Empirical results on synthetic and real-world datasets demonstrate that Diff-Joint effectively identifies meaningfully missing entries while achieving competitive imputation accuracy and improved downstream task performance.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on tabular data missing value imputation using a diffusion framework, distinguishing meaningful missingness from observation missingness. It does not involve tokenizers, visual encoders, multimodal architectures, MLLMs, or reinforcement learning. There is a minor conceptual overlap with 'Unify Models' due to joint modeling of data and masks, but overall relevance to the provided keyword set is low.
关键词
Missing value imputation, Meaningful missingness, Diffusion framework, Tabular data, Uncertainty-aware, Latent missingness mask, Selective imputation
摘要翻译
深度图生成的研究范畴涵盖了两个极端:一次性模型(one-shot models)与序列模型(sequential models)。前者联合生成节点和边,而后者则自回归地(autoregressively)采样它们。每种方法在基于大小和拓扑结构的不同图领域表现各异,但均无法适用于所有图类别。例如,一次性方法在生成大型图时面临挑战,而序列方法在小型图上的表现则欠佳。克服这些限制的一种可行方法是在一个统一系统中灵活地结合这两种方法。在本文中,我们提出了 FLAGG(Flexible Autoregressive Graph Generation)框架,该框架利用一次性模型顺序生成图的部分子图。FLAGG 可将任意一次性模型转化为自回归模式,从而在选择序列策略时提供灵活性。该策略通过随机节点移除过程来定义,一个插入模型(Insertion Model)学习逆转该过程。我们在多个具有不同图大小和领域的数据集上,使用 DiGress 一次性模型对 FLAGG 进行了评估。结果表明,该方法在采样质量上优于一次性模型和自回归基线(baselines)方法。
Abstract
The Deep Graph Generation's panorama spans two extremes: one-shot and sequential models. The former generates nodes and edges jointly, while the latter samples them autoregressively. Each method performs better in different graph domains depending on size and topology, but neither is applicable to all graph categories. For instance, one-shot methods struggle with generating large graphs, while sequential methods underperform on smaller graphs. A possible way to overcome these limitations is to flexibly combine the two methods in a unique system. In this work, we propose the FLAGG (Flexible Autoregressive Graph Generation) framework, which sequentially generates portions of graphs with one-shot models. FLAGG can apply any one-shot model to make it autoregressive, allowing flexibility in choosing the sequential policy. This policy is specified through a stochastic node removal process, which an Insertion Model learns to reverse. We evaluate FLAGG with the DiGress one-shot model on several data sets of different graph sizes and domains. We show that the approach outperforms both one-shot and autoregressive baselines in terms of sampling quality.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文主题聚焦于图生成(Graph Generation),提出 FLAGG 框架融合一阶与自回归生成策略。提供的关键词主要涉及多模态大模型(MLLM, MultiModal)、视觉编码器、Tokenizer 及模型强化学习,与论文领域(图结构数据)存在显著差异。仅 Unify Models 因涉及策略融合获得微弱相关性,其余关键词完全不相关。
关键词
Graph Generation, Autoregressive, One-shot, Flexible Framework, Node Removal, Sampling Quality, DiGress
摘要翻译
离散图模型中的边缘推断迫使我们在精确性与可扩展性之间做出选择:精确算法在高树宽图上不可行,而迭代近似方法(Belief Propagation, 变分方法)在受阻拓扑上牺牲了收敛保证。我们认为这种对立源于不匹配的归纳偏置:迭代方法放弃了使精确推断正确的顺序消除结构。我们提出了一种 In-Context Graphical Inference(ICG-I),这是一种自回归 Graph Transformer,通过模仿 Variable Elimination,利用学习到的、经 Tensor-Train(TT)压缩的中间因子来恢复这种结构,并配合 Dirichlet 输出层和 Weighted Conformal Prediction(WCP),以在拓扑偏移下提供校准的、无分布的覆盖保证。我们证明了 TT 压缩误差在自回归链中最多线性传播,Dirichlet-Multinomial 损失是一个真评分规则,且 WCP 在估计密度比下保持覆盖性,尽管会有可量化的退化。我们进行了详尽的实验来评估 ICG-I,并在所有基准上实现了最先进性能。ICG-I 在标准实例上将 MAE 从 0.041(最佳基线)降低到 0.020,并在 N=500 的受阻自旋玻璃上取得了 0.048 的结果,此时 Belief Propagation(BP)完全发散。
Abstract
Marginal inference in discrete graphical models forces a choice between exactness and scalability: exact algorithms are intractable for high-treewidth graphs, while iterative approximations (Belief Propagation, variational methods) sacrifice convergence guarantees on frustrated topologies. We argue that this dichotomy stems from a mismatched inductive bias: iterative methods abandon the sequential elimination structure that makes exact inference correct. We introduce In-Context Graphical Inference (ICG-I), an autoregressive Graph Transformer that restores this structure by mimicking Variable Elimination with learned, Tensor- Train-compressed intermediate factors, paired with a Dirichlet output layer and Weighted Conformal Prediction for calibrated, distribution-free coverage guarantees under topological shift. We prove that TT compression errors propagate at most lincarly through the autoregressive chain, that the Dirichlet-Multinomial loss is a proper scoring rule, and that WCP maintains coverage with a quantifiable degradation under estimated density ratios. We conducted intensive experiments to evaluate ICG-I and achieved state-of-the-art performance across all benchmarks. ICG-I reduces MAE from 0.041 (best baseline) to 0.020 on standard instances and achieves 0.048 on N=500 frustrated spin glasses where BP diverges entirely.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 2.0/10 | 3.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文主要关注概率图模型中的推断算法(如变量消除、图变换器),而提供的关键词集主要围绕多模态大模型、世界模型和强化学习。两者领域差异较大。仅在'Unify Models'上略有关联(统一了精确与近似推断结构),其余关键词如 Tokenizer、Visual Encoder、MLLM 等均无涉及。未发现指定专家作者。
关键词
Graphical Inference, Variable Elimination, Graph Transformer, Tensor-Train Compression, Marginal Inference, Dirichlet Output, Weighted Conformal Prediction, Autoregressive
摘要翻译
大型语言模型(LLM)越来越多地被提议作为临床智能体,然而静态、单轮基准无法捕捉模型如何在一次诊疗过程中动态提供护理:收集信息、规划治疗,并在连续的患者状态中调整纵向管理。医学教育长期以来通过标准化病人(SP)解决了一个类似挑战:训练有素的演员始终扮演临床病例,从而实现现实化的练习和客观的脚本化评估。本文介绍了 MedSP1000,一种基于 SP 的临床智能体评估交互式基准,包含 1,638 个 SP 病例和 24,602 个轨迹级同行评审评分标准。MedSP1000 将同行评审的 SP 教学案例转换为可执行场景,包含定义的 SP 病例脚本、临床环境背景以及经人工验证的结构化评分标准。在每次模拟评估运行中,临床智能体与患者智能体及环境控制器进行闭环交互,其整个诊疗过程中的行为表现均根据原始材料中指定的专家标准进行评分。将 MedSP1000 应用于一系列通用型和医学专用 LLM 时,我们发现静态基准上的表现无法可靠地转化为此类教育场景。表现最佳的模型 GPT-5.5 仅完成了 60.4% 的专家定义评分标准项,而最强的医学专用模型仅达到 40.0%;增加推理时计算量并未带来可测量的提升。这些结果表明,当前的 LLM 包括针对医学调优的智能体系统,其可靠性尚不足以安全地整合到实际临床实践中。更广泛而言,MedSP1000 展示了过程级、SP 风格的评估如何揭示单轮基准所遗漏的临床相关故障模式。
Abstract
Large language models (LLMs) are increasingly proposed as clinical agents, yet static, single-turn benchmarks cannot capture how a model dynamically delivers care across an encounter: gathering information, planning treatment, and adapting longitudinal management across successive patient states. Medical education has long addressed an analogous challenge through standardized patients (SPs): trained actors who consistently portray clinical cases, enabling realistic practice and objective, scripted assessment. Here we introduce MedSP1000, an SP-derived interactive benchmark for clinical-agent evaluation, including 1,638 SP cases with 24,602 trajectory-level peer-reviewed rubrics. MedSP1000 converts peer-reviewed SP teaching cases into executable scenarios with defined SP case scripts, clinical environment contexts, and human-validated structured rubric. In each simulation evaluation run, a clinical agent interacts in closed loop with a patient agent and an environment controller, and its behaviour is scored throughout the encounter against expert criteria specified in the original materials. Applying MedSP1000 to a range of general-purpose and medically specialized LLMs, we find that performance on static benchmarks does not reliably translate to such educational scenarios. The best-performing model, GPT-5.5, completes only 60.4% of expert-defined rubric items, whereas the strongest medically specialized model reaches 40.0%; increasing test-time compute produces no measurable gain. These results suggest that current LLMs, including agentic systems tuned for medicine, are not yet reliable enough to be safely integrated into actual clinical practice. More broadly, MedSP1000 shows how process-level, SP-style evaluation can reveal clinically relevant failure modes that single-turn benchmarks miss.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 1.0/10 | 1.5 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: The paper focuses on evaluating LLMs in clinical decision-making using standardized patients (MedSP1000), lacking technical content on unification, tokenization, visual encoders, world models, or model-based RL algorithms. It involves LLMs (weak link to MLLM) and agent-environment loops (weak link to RL), but does not address the specific multimodal or architectural aspects implied by the keywords. No expert authors from the target list were found.
关键词
Large Language Models, Clinical Decision-Making, Standardized Patients, MedSP1000, Dynamic Evaluation, Closed Loop, Benchmark, Agent Evaluation
摘要翻译
大型语言模型(LLMs)正日益广泛应用于生成 Lean 形式化证明的工作流中。这些工作流通常将问题分解为更小的引理,采样大量证明尝试,并利用编译器反馈来引导搜索过程。然而,这些方法可能成本过高,往往在最终失败的尝试上消耗大量的计算资源。本文提出了一种动作路由代理(action routing agent)来解决这一问题,该代理由数据平面(data plane)和控制平面(control plane)组成。数据平面负责生成自然语言的引理分解,在 Lean 中将其形式化,并为生成的定理和引理目标采样证明尝试。控制平面观察先前失败的 Lean 尝试,估计成功的可能性以及另一次尝试的成本,并决定是继续证明当前目标还是从新的分解重新开始。在 PutnamBench 的一个子集上,我们的代理相较于固定步骤基线(fixed-step baseline)平均降低了 $25.8\%$ 的成本,在保持性能的同时显著减少了计算资源的消耗。这些结果表明,失败的 Lean 轨迹为基于代理的形式证明(agentic theorem proving)中成本感知的资源分配提供了可操作的信号。
Abstract
Large language models (LLMs) are increasingly used in workflows for generating formal proofs in Lean. These workflows often decompose problems into smaller lemmas, sample many proof attempts, and use compiler feedback to guide search. However, they can be prohibitively expensive, often spending substantial compute on attempts that ultimately fail. In this work, we address this problem with an action routing agent that consists of a data plane and a control plane. The data plane generates natural-language lemma decompositions, formalizes them in Lean, and samples proof attempts for the resulting theorem and lemma targets. The control plane observes previous failed Lean attempts, estimates both the likelihood of success and cost of another attempt, and decides whether to continue proving the current target or restart from a new breakdown. On a subset of PutnamBench, our agent decreases the cost by $25.8\%$ over a fixed-step baseline on average, preserving performance while using substantially less compute. These results suggest that failed Lean trajectories provide actionable signals for cost-aware resource allocation in agentic theorem proving.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 2.0/10 | 3.0 |
评分理由: 该论文研究 Lean 语言中代理定理证明的成本 - 质量优化,核心在于 LLM 工作流调度与决策。提供的关键词集(如视觉编码器、多模态、世界模型)主要面向多模态感知与表征学习,与本文的文本/代码形式化验证任务无直接关联。虽然控制平面涉及基于模型的决策(与 model-based RL 有微弱概念交集),但论文未涉及多模态输入、世界模型构建或模型统一架构,因此整体相关性极低。
关键词
Agentic Theorem Provers, Lean, Cost-Quality Tradeoff, Action Routing Agent, Control Plane, Data Plane, Formal Proofs, LLMs
摘要翻译
采用梯度下降算法训练的径向基函数神经网络 (RBFN) 在浅层和深层网络中均能提供有效的全连接结构。误差修正 (ErrCor) 作为一种最先进的基于梯度的训练方法,通过选择最优隐藏单元来提高准确性。另一方面,作为一种基于种群的算法,粒子群优化算法 (PSO) 利用群体经验来优化 RBFN 参数,具备全局搜索能力以及对局部极小值的鲁棒性。自适应粒子群优化算法 (APSO) 已成为 PSO 的一种改进变体。该 APSO 算法通过在优化过程中动态调整群体参数来提高收敛速度。ErrCor 和 PSO 均表现出改进的结果和具有竞争力的收敛性。然而,面对大型数据集,这些方法面临着可扩展性挑战,例如过多的核计算和庞大的隐藏层结构。最近提出的多列 RBFN 方法 (MCRN) 通过在并行系统中部署小型 RBFN 来改进 ErrCor 的性能。受 MCRN 成功的启发,我们提出了两种旨在提升 PSO 性能的新方法:基于 PSO 的多列 RBFN (MC-PSO) 和基于 APSO 的多列 RBFN (MC-APSO)。这些方法引入了采用进化群体方法训练的并行 RBFN 结构。每个 RBFN 均使用 PSO 或 APSO 算法,在数据集的特定空间子集上进行独立训练。由此产生的专门化训练 RBFN 针对其各自的子集进行了定制。测试时,仅测试实例邻居所在区域对应的选定 RBFN 参与多列输出的生成。这种专门化提高了准确性,而并行性则提升了速度。我们在多种基准数据集上对所提出的方法进行了评估。MC-PSO 和 MC-APSO 在准确性和召回率方面均优于 ErrCor、PSO、APSO 和 MCRN。此外,它们在大多数实验中均表现出更快的训练和测试速度。
Abstract
The radial basis function neural network (RBFN) trained with a gradient descending algorithm provides an effective fully connected structure in both shallow and deep networks. The error correction (ErrCor), a state-of-the-art gradient-based training method, selects optimal hidden units to improve accuracy. Alternatively, as a population-based algorithm, the particle swarm optimization algorithm (PSO) uses the swarm experience to optimize RBFN parameters, offering global search and robustness to local minima. Adaptive PSO (APSO) has emerged as an improved variant of PSO. APSO algorithm improves convergence speed by dynamically adjusting swarm parameters during optimization. Both ErrCor and PSO demonstrate improved results and competitive convergence. However, with large datasets, these methods face scalability challenges such as excessive kernel computations and large hidden layer structures. A recent multi-column RBFN approach (MCRN) improves ErrCor performance by deploying small RBFNs in a parallel system. Inspired by MCRN's success, we propose two novel approaches to improve PSO performance: the multi-column RBFN with PSO (MC-PSO) and the multi-column RBFN with APSO (MC-APSO). These methods introduce parallel RBFN structures trained using evolutionary swarm methods. Each RBFN is independently trained on a specific spatial subset of the dataset using either PSO or APSO algorithms. These resulting specialist-trained RBFNs are tailored to their respective subsets. During testing, only selected RBFNs, where the test instance neighbors are located, contribute to the multi-column output. This specialization improves accuracy, while parallelism enhances speed. We evaluate the proposed methods on various benchmark datasets. The MC-PSO and MC-APSO outperform ErrCor, PSO, APSO, and MCRN in terms of accuracy and recall. They also demonstrate faster training and testing times in most experiments.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on classical machine learning optimization (RBFN with PSO), unrelated to LLMs, RL, or multimodal systems. 'Unify Models' has minimal relevance due to multi-column architecture, but domain mismatch is significant.
关键词
Radial Basis Function Neural Network, Particle Swarm Optimization, Adaptive PSO, Multi-Column RBFN, Scalability, Accuracy, Parallelism, Benchmark Datasets
摘要翻译
我们引入了 Graph Cascades,这是一种针对图神经网络(GNNs)和图变换器(GTs)的介观重连策略,旨在捕捉超越纯局部边或完全全局注意力的中间尺度图结构。基于传染性的扩散过程,Graph Cascades 可在 O(|V|+|E|) 时间内构建一个辅助图,其中由反复多跳强化支持的节点对将被升级为直接邻接关系。我们从理论上刻画了基于强化重连何时有效:包括基于强化边选择比直接邻接更具标签对齐性的充分条件、一个两跳强化完全同质性的 SBM(随机块模型)见证,以及通过图有效电阻形式化的介观连通性。实验上,在多个节点分类基准上,Graph Cascades 改进了多种 GNN 和稀疏 GT 骨干模型,其中在异质性图以及中等至高度数同质性图上观察到了最稳定的增益。这些理论条件还指出了介观重连不太可能有益的情形——低度数正则图以及存在结构瓶颈的图——且这些预测与观察到的失败结果相符。此外,我们还观察到重连图中的性能与结构属性之间存在紧密相关性。
Abstract
We introduce Graph Cascades, a mesoscopic rewiring strategy for Graph Neural Networks (GNNs) and Graph Transformers (GTs) that captures intermediate-scale graph structure beyond purely local edges or fully global attention. Using contagion-based diffusion processes, Graph Cascades constructs, in O(|V|+|E|) time, an auxiliary graph where node pairs supported by repeated multi-hop reinforcement are promoted to direct neighbors. We theoretically characterize when reinforcement-based rewiring helps: sufficient conditions under which reinforcement-based edge selection is more label-aligned than direct adjacency, an SBM witness in which two-hop reinforcement is perfectly homophilic, and a formalization of mesoscopic connectivity via graph effective resistance. Empirically, across node-classification benchmarks, Graph Cascades improves multiple GNN and sparse-GT backbones, with the most reliable gains observed on heterophilic and moderate- to high-degree homophilic graphs. The theoretical conditions also identify regimes where mesoscopic rewiring is unlikely to be beneficial -- low-degree regular graphs and graphs with structural bottlenecks -- and these predictions match the observed failures. We additionally observe tight correlations between performance and structural properties in the rewired graphs.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 1.0/10 | 1.5 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on Graph Neural Networks and mesoscopic graph rewiring, which is fundamentally unrelated to the provided keywords concerning Multimodal LLMs, World Models, Tokenizers, Visual Encoders, or Reinforcement Learning. 'Unify Models' receives a minimal score as the method applies to multiple backbones, but it does not align with the background context of unified multimodal architectures.
关键词
Graph Cascades, Mesoscopic Rewiring, Graph Neural Networks, Graph Transformers, Contagion-based Diffusion, Node Classification, Heterophilic Graphs, Structure-Aware
摘要翻译
我们研究了多智能体深度强化学习,并针对多智能体深度确定性策略梯度(MADDPG)算法提出了两种改进。首先,我们引入了一种新颖的动作推断(Action Inference)机制,使每个智能体能够预测其他智能体的预期动作,从而提高其自身策略的准确性和稳定性。其次,我们在经验回放缓冲区中应用了一种基于几何分布的重要性采样策略,以优先处理更近期且更具信息量的经验,这有助于缓解多智能体环境中固有的非平稳性。我们在 PettingZoo 库提供的离散动作捕食者 - 猎物任务上评估了这两种修改,该库是一个用于通用多智能体强化学习基准的灵活 Python 接口。我们的结果表明,动作推断(Action Inference)在提高学习稳定性和智能体间合作方面是有效的,而使用几何分布的重要性采样相比标准 MADDPG 可以显著改善探索效率。代码可在 https://github.com/shaashwathsivakumar/MARL_Proj 获取。
Abstract
We investigate multi-agent deep reinforcement learning and propose two enhancements to the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm. First, we introduce a novel Action Inference mechanism that enables each agent to predict other agents' intended actions, thereby improving the accuracy and stability of its own policy. Second, we apply an importance sampling strategy, using geometric distribution, in the replay buffer to prioritize more recent and informative experiences, which helps mitigate the non-stationarity inherent in multi-agent environments. We evaluate both modifications on the discrete-action Predator-Prey task provided by the PettingZoo library, a flexible Python interface for general multi-agent reinforcement learning benchmarks. Our results indicate that Action Inference is effective in improving learning stability and inter-agent cooperation and that importance sampling using geometric distribution can lead to significant improvements in exploration efficiency over standard MADDPG. Code available at https://github.com/shaashwathsivakumar/MARL_Proj
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 1.0/10 | 1.5 |
评分理由: The paper focuses on enhancing the MADDPG algorithm for multi-agent reinforcement learning using action inference and importance sampling. It does not involve unifying models, tokenizers, visual encoders, world models, MLLMs, or multimodal architectures. Although it belongs to the RL domain, MADDPG is typically model-free, making the 'model-based RL' keyword only minimally relevant. No expert authors from the specified list are present.
关键词
Multi-Agent Reinforcement Learning, MADDPG, Action Inference, Importance Sampling, Policy Gradient, Non-stationarity, Predator-Prey
摘要翻译
混合专家模型(MoE)架构通过稀疏专家激活来扩展模型容量,但其部署仍受限于内存,因为所有专家权重都必须驻留在内存中。混合精度量化可以通过为不同专家分配不同的位宽来显著减少这种内存占用。然而,现有方法通常依赖校准数据来估计专家重要性并确定位宽分配。对于前沿 MoE 大语言模型而言,原始训练数据(因而也是真实的训练分布)是专有的且不可访问的。因此,校准集不可避免地是不完美的代理,这可能导致对专家利用率的误估,并导致次优的位宽分配。鉴于现代 MoE 模型中观察到的显著专家间质量差异,以及重尾自正则化(HT-SR)理论在无需访问训练或测试数据的情况下成功预测神经网络模型质量,我们提出了 AlphaQ,一种用于 MoE 量化的无校准位分配方法。AlphaQ 基于 HT-SR 理论并遵循一个简单的原则:具有更重尾权重谱的专家通常训练得更好,因此应分配更高的位宽,而重尾结构较弱的专家可以采用更激进的量化策略。AlphaQ 通过测量专家层面的谱重尾性并求解一个预算约束优化问题来实现这一原则,该问题在全局位预算约束下最小化总量化误差。在多个 MoE 模型上,AlphaQ 在相同的位预算下始终优于基于校准的基线方法。值得注意的是,在 Qwen1.5-MoE 上,AlphaQ 仅以平均 3.5 位的专家位宽就实现了接近全精度的准确率,同时实现了超过 4 倍的内存压缩。我们的代码可在 https://github.com/Superone77/AlphaQ 获取。
Abstract
Mixture-of-Experts (MoE) architectures scale model capacity through sparse expert activation, but their deployment remains memory-bound because all expert weights must reside in memory. Mixed-precision quantization can substantially reduce this footprint by assigning different bit-widths to different experts. Existing approaches, however, typically rely on calibration data to estimate expert importance and determine bit allocation. For frontier MoE LLMs, the original training data, and hence the true training distribution, is proprietary and inaccessible. As a result, calibration sets are inevitably imperfect surrogates, and this can misestimate expert utilization and lead to suboptimal bit allocation. Motivated by the substantial cross-expert quality variability observed in modern MoE models, and by the success of Heavy-Tailed Self-Regularization (HT-SR) theory at predicting neural network model quality without access to training or testing data, we propose AlphaQ, a calibration-free bit-allocation method for MoE quantization. AlphaQ draws on HT-SR theory and follows a simple principle: experts with more heavy-tailed weight spectra are typically better trained and hence should receive higher bit-widths, while experts with weaker heavy-tailed structure can be quantized more aggressively. AlphaQ operationalizes this principle by measuring expert-wise spectral heavy-tailedness and solving a budget-constrained optimization problem that minimizes total quantization error under a global bit-budget constraint. Across several MoE models, AlphaQ consistently outperforms calibration-based baselines under matched bit budgets. Notably, on Qwen1.5-MoE, AlphaQ achieves near full-precision accuracy with an average expert precision of only 3.5 bits, while delivering more than 4$\times$ memory compression. Our code is available at https://github.com/Superone77/AlphaQ.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 1.0/10 | 1.5 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 该论文主要研究混合专家模型(MoE)的量化比特分配方法,基于重尾自正则化理论实现无校准量化。虽然涉及 LLM 模型(与 MLLM 关键词有微弱关联),但论文核心在于模型压缩与效率优化,未涉及模型统一、分词器设计、视觉编码器、世界模型、多模态融合或基于模型的强化学习等内容,因此与给定关键词的相关性极低。
关键词
Mixture-of-Experts, Quantization, Bit Allocation, Calibration-Free, Heavy-Tailed Self-Regularization, Model Compression, Memory Footprint
摘要翻译
CARE-link 是一个开源的、基于网络的临床支持平台,旨在通过基于大语言模型(LLM)驱动的工作流连接临床医生和患者,以改善妊娠期糖尿病的管理。该系统聚合院外患者生成的数据,总结相关临床信息,并向临床医生提供情境感知的决策支持。对于患者,CARE-link 提供管理方案的清晰解释,并通过 WhatsApp 界面提供及时的生活方式指导。集成的双向设计旨在促进持续监测,支持个性化护理,并减少门诊随访的负担。采用模块化架构构建,该平台可适应其他需要纵向追踪和行为支持的慢性病。CARE-link 有望加强临床监管,促进患者依从性,并加强照护连续性,特别是在资源受限环境中。
Abstract
CARE-link is an open-source, web-based clinical support platform designed to improve the management of gestational diabetes by linking clinicians and patients through an LLM-mediated workflow. The system aggregates patient-generated data outside the hospital, summarizes relevant clinical information, and delivers context-aware decision support to clinicians. For patients, CARE-link provides clear explanations of management plans and delivers timely lifestyle guidance through a WhatsApp interface. The integrated dual-facing design aims to promote continuous monitoring, support individualized care, and reduce the burden of in-clinic follow-ups. Built with a modular architecture, the platform can be adapted to other chronic conditions requiring longitudinal tracking and behavioral support. CARE-link has the potential to enhance clinical oversight, promote patient compliance, and strengthen continuity of care particularly in resource-constrained settings.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 1.0/10 | 1.5 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper describes a healthcare application system (CARE-link) for diabetes management using LLMs, focusing on software architecture and clinical workflow. It does not address the technical model architectures implied by the keywords such as World Models, Visual Encoders, Tokenizers, or Model-Based Reinforcement Learning. While it utilizes an LLM (loosely related to MLLM), the core contribution is not model research, resulting in minimal relevance to the provided technical keywords.
关键词
Clinical Assistant, Remote Engagement, Electronic Health Records, Diabetes Management, LLM-mediated workflow, Web-based platform, Decision support, Patient-clinician linkage
摘要翻译
大型语言模型(LLMs)在 CLadder 等因果推理基准上达到 50% 至 70% 的准确率,但目前尚不清楚这反映的是结构推理还是词汇模式匹配。我们引入了 Caliper,这是一种可控扰动方法,它在保留每个问题的因果图和概率规范的同时,将语义变量名替换为占位符。在九个参数量介于 38 亿至 671 亿之间的指令微调大型语言模型(LLMs)及三个因果推理基准上,词汇匿名化在本地 38 亿至 140 亿参数集上产生了稳健的准确率下降,分别为 +7.6、+27.0 和 +11.1 个百分点;而在跨越 2024 至 2026 代际的九个前沿模型上,CRASS 和 e-CARE 基准上的下降幅度分别增至 +29.6 和 +18.0 个百分点。在 40 个涉及的模型 - 基准单元中,有 39 个显示出正向差距,且在 CLadder 的伪词子集上,该差距缩小为原来的 1/17。结构化支架和少样本上下文学习各自缩小了该差距,但主要是通过降低较小模型上的 P0 准确率,而非恢复 P1 准确率。当前指令微调的大型语言模型(LLMs)在零样本评估下,一旦移除词汇锚点,便几乎没有显示出结构因果推理的证据。
Abstract
Large language models reach 50 to 70% accuracy on causal reasoning benchmarks such as CLadder, but it is unclear whether this reflects structural reasoning or lexical pattern matching. We introduce Caliper, a controlled perturbation that replaces semantic variable names with placeholder tokens while preserving the causal graph and probabilistic specification of each question. Across nine instruction-tuned LLMs from 3.8B to 671B and three causal reasoning benchmarks, lexical anonymization yields robust accuracy drops of +7.6, +27.0, and +11.1 pp on a local 3.8B-14B set, rising to +29.6 and +18.0 pp on CRASS and e-CARE across nine frontier models spanning the 2024-2026 generations. Of 40 engaged model-by-benchmark cells, 39 show a positive gap, and the gap collapses by 17x on CLadder's pseudoword subset. Structured scaffolding and few-shot in-context learning each narrow the gap, but mainly by lowering P0 accuracy on smaller models rather than recovering P1. Current instruction-tuned LLMs, evaluated zero-shot, show little evidence of structural causal reasoning once lexical anchors are removed.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 1.0/10 | 1.5 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文主题为大语言模型的可解释性与因果推理,通过词汇扰动分析模型行为,与关键词集中的多模态、世界模型、统一架构及强化学习方向严重不相关。仅'Tokenizer'因涉及占位符令牌操作而具有极低相关性(1.0 分),其余均为 0 分。作者列表中未包含指定专家,无额外加分。加权总分 1.5 分,远低于动态及格分 27.8 分。
关键词
Large language models, Causal reasoning, Lexical anchors, Causal structure, Placeholder tokens, Instruction-tuned, Perturbation, Accuracy drops
摘要翻译
传统的生成对抗网络(GAN)在单图像超分辨率(SISR)任务中常面临幻觉伪影的困扰,主要是因为标准判别器评估的是整体图像的自然性,而非严格的条件真实性。为此,我们提出 MaCo-GAN,这是一种新颖的流形对比 GAN 框架,它用监督对比目标取代了传统的对抗损失。该方法的核心组件是一个动态假样本合成器,它将真实样本(GT)转化为一系列具有挑战性且感知上合理的假图像,这些图像严格保持低分辨率(LR)的一致性。利用这些合成样本,我们建立了一个稳健的对比极小极大博弈:生成器被训练将其预测吸引向流形内假样本(低失真),并排斥流形外假样本(高失真),而判别器则优化完全相反的目标。通过将基线超分辨率模型的对抗损失简单地替换为我们提出的目标,我们在各种基准上均实现了感知 - 失真权衡的一致改进。广泛的消融研究验证了该框架的有效性,并为这种条件对比博弈的动态机制提供了深入见解。
Abstract
Conventional Generative Adversarial Networks (GANs) for Single Image Super-Resolution (SISR) often struggle with hallucinated artifacts, largely because standard discriminators evaluate overall image naturalness rather than strict conditional realism. To address this, we propose MaCo-GAN, a novel manifold-contrastive GAN framework that replaces the conventional adversarial loss with a supervised contrastive objective. A core component of our method is a dynamic fake sample synthesizer that transforms ground truth (GT) data into a spectrum of challenging, perceptually plausible fake images that strictly maintain low-resolution (LR) correspondence. Utilizing these synthesized samples, we establish a robust contrastive minimax game: the generator is trained to attract its predictions toward on-manifold fakes (low distortion) and repel them from off-manifold fakes (high distortion), while the discriminator optimizes the exact opposite. By simply replacing the adversarial loss of a baseline SR model with our proposed objective, we demonstrate consistent improvements in the perception-distortion trade-off across various benchmarks. Extensive ablation studies validate the effectiveness of our framework and provide deep insights into the dynamics of this conditional contrastive game.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 1.0/10 | 1.5 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 该论文专注于单图像超分辨率(SISR)中的生成对抗网络(GAN)改进,采用流形对比学习解决幻觉伪影问题。提供的关键词主要围绕多模态大模型(MLLM)、世界模型、强化学习及模型统一等方向。论文内容与这些关键词高度不相关:未涉及文本分词器(Tokenizer)、多模态对齐(MultiModal/MLLM)、强化学习(model-based RL)或世界模型(World Models)。虽然涉及图像处理,但并非针对多模态视觉编码器(Visual Encoder)或模型统一(Unify Models)的研究,因此相关度极低,仅因涉及图像网络结构给予 Visual Encoder 极低分。
关键词
Single Image Super-Resolution, Generative Adversarial Networks, Manifold-Contrastive Learning, Perception-Distortion Trade-off, Fake Sample Synthesizer, Conditional Realism, GAN Framework
摘要翻译
近年来,基于声明式交互协议的多智能体系统在建模与实现方面取得了重大进展。我们的贡献——Strabo,确立了这些进展与当前业界在 Agentic AI(智能体 AI)方面努力的相关性。具体来说,我们考察了 UCP(通用商务协议),这是谷歌最近牵头的一项旨在为 AI 智能体标准化电商交互的努力。本研究分为两部分。首先,我们将 UCP 中涉及结账的部分建模为声明式 Langshaw 协议,并使用 Peach(Langshaw 的一种编程模型)来实现智能体。这一部分凸显了形式化、声明式规范的优势。其次,我们展示了 Peach 智能体可以与谷歌实现的 UCP 智能体进行互操作,从而确立了我们的方法相对于 UCP 的保真度。这种互操作使得声明式协议和智能体能够以增量方式引入现有环境,表明了一种 EMAS 理念影响实践的路径,而无需进行全面更新。
Abstract
The last few years have witnessed major advances in the modeling and implementation of multiagent systems based on declarative interaction protocols. Our contribution, Strabo, establishes the relevance of these advances to ongoing industry efforts in Agentic AI. Specifically, we consider UCP, the Universal Commerce Protocol, a recent Google-led effort to standardize e-commerce interactions for AI agents. Our exercise is in two parts. One, we model the part of UCP dealing with checkouts as a declarative Langshaw protocol and implement agents using Peach, a programming model for Langshaw. This part of the exercise brings out the advantages of formal, declarative specifications. Two, we show that Peach agents can interoperate with UCP agents implemented by Google, thereby establishing the fidelity of our approach with respect to UCP. Such interoperation enables the incremental introduction of declarative protocols and agents into a conventional setting, indicating a pathway by which EMAS ideas could influence practice without demanding a wholesale update.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper addresses formal specification of agentic interaction protocols using Strabo and Langshaw, focusing on interoperability with industry standards (UCP). The provided keywords relate to deep learning components (Tokenizer, Visual Encoder), multimodal models (MLLM, MultiModal), and reinforcement learning paradigms (World Models, model-based RL). There is no intersection in technical domain or methodology, hence all scores are 0.
关键词
Strabo, Declarative Specification, Agentic Interaction Protocols, Multiagent Systems, Universal Commerce Protocol, Langshaw Protocol, Interoperability
摘要翻译
当 AI 代理调用 API 并遭遇验证错误时,它所需的不仅仅是错误原因——它需要知道下一步该做什么。自反思 API (self-reflective API) 在验证失败时返回一个机器可读的 recovery_feedback.suggestions[] 载荷,足以使代理能够修复请求并重试,而无需外部推理。在经泄漏审计的试点实验中(每组 N=30,3 个 LLMs,10 个对抗性任务),相较于纯英文诊断,结构化建议将 Anthropic 模型的任务完成率提高了 +36.7--40.0 个百分点(Fisher 精确检验 p ≤ 0.0022),且每成功 token 效率提升了 1.8--2.2 倍。在 gpt-4o-mini 上提升不显著(p=0.435);在计费 API 上的第二领域复现确认了这一模式。只有在审计了 LLMs 基准测试中两类未记录的答案泄露后,该比较才成立。我们发布了 audit_prompt_leakage.py 作为可重用的持续集成 (CI) 基础设施。代码和数据:https://github.com/arquicanedo/self-reflective-apis.
Abstract
When an AI agent calls an API and hits a validation error, it needs more than what went wrong -- it needs what to do next. A self-reflective API returns, on validation failure, a machine-readable recovery\_feedback.suggestions[] payload sufficient for the agent to repair the request and retry without external reasoning. On a leak-audited pilot ($N{=}30$ per cell, 3 LLMs, 10 adversarial tasks), structured suggestions lift task-completion rate by $+36.7$--$40.0$pp over plain-English diagnoses on Anthropic models (Fisher's exact $p \le 0.0022$), at $1.8$--$2.2\times$ better per-success token efficiency. The lift is not significant on gpt-4o-mini ($p{=}0.435$); a second-domain replication on a billing API confirms the pattern. The comparison only holds after auditing two undocumented classes of answer leakage in LLM benchmarks. We shipaudit\_prompt\_leakage.py as reusable CI infrastructure. Code and data: https://github.com/arquicanedo/self-reflective-apis.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on API design for AI agent error recovery (structured feedback vs. natural language) and auditing prompt leakage. It does not discuss model unification, tokenization mechanisms, visual encoders, world model learning, multimodal architectures, or model-based reinforcement learning. Thus, relevance to all specific technical keywords defined in the scoring criteria is negligible.
关键词
Self-Reflective APIs, AI Agent Recovery, Structured Feedback, Validation Error, Token Efficiency, Prompt Leakage, API Recovery
摘要翻译
多方对话是研究协作推理与决策的关键场景,然而现有数据集很少聚焦于结构化、深入的复杂推理任务。我们引入 DeliChess(一种新颖的群体审议对话数据集),在该数据集中,参与者协作解决多选式国际象棋谜题。每个小组首先独立完成谜题,随后进行多方讨论,最后提交修订后的集体答案。该数据集包含 107 条对话,附有完整转录文本、讨论前后的选择项,以及关于谜题难度和走棋质量的元数据。我们基于棋引擎评估的三个指标来评估性能,发现审议过程显著提高了群体的准确率。我们进一步利用基于先前审议数据训练的 classifier,分析探查性话语(即引发提议、论证或战略反思的消息)的作用。尽管探查性话语使得讨论后群体表现的变异性增加,但它并不总是能带来更好的表现。我们的数据集为建模群体推理、对话动态以及解决明确战略领域中不同观点和意见的分歧提供了丰富的 testbed。
Abstract
Multi-party dialogue is a critical setting for studying collaborative reasoning and decision-making, yet existing datasets rarely focus on structured, in-depth complex reasoning tasks. We introduce DeliChess, a novel dataset of group deliberation dialogues in which participants collaboratively solve multiple-choice chess puzzles. Each group first completes the puzzle individually, then engages in a multi-party discussion before submitting a revised collective answer. The dataset includes 107 dialogues with full transcripts, pre- and post-discussion choices, and metadata on puzzle difficulty and move quality. We evaluate performance using three metrics based on chess engine evaluations, and find that deliberation significantly improves group accuracy. We further analyse the role of probing utterances (i.e., messages that elicit proposals, justifications, or strategic reflection) using a classifier trained on prior deliberation data. While probing makes group performance more variable after discussion, it does not consistently lead to better performance. Our dataset offers a rich testbed for modelling group reasoning, dialogue dynamics, and the resolution of differing perspectives and opinions in a well-defined strategic domain.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 该论文主要贡献在于构建了一个多主体对话数据集(DeliChess),用于研究棋类谜题求解中的协作推理和对话动态。提供的关键词涉及多模态大模型架构(Unify Models, Tokenizer, Visual Encoder, MLLM, MultiModal)、世界模型(World Models)及强化学习(model-based RL),这些内容在论文摘要和标题中均未提及。论文焦点在于对话分析而非模型架构或强化学习,因此所有关键词的相关度均为 0。
关键词
Multi-party Dialogue, Chess Puzzle Solving, Collaborative Reasoning, Group Deliberation, Dialogue Dataset, Strategic Domain, Probing Utterances, Collective Decision Making
摘要翻译
尽管学界普遍认为 AI 生成文本带来广泛的社会风险,但在 AI 生成文本检测领域,对于何为有害使用尚未达成共识。相反,现有数据集与方法往往自行定义标准并做出假设,有时隐含于此,且通常与现实世界的需求和应用关联松散。为填补这一空白,我们在此系统性地定义了各类 AI 生成文本的概念及其特征。为此,我们收集了 AITDNA——一个新的基准数据集,包含人机协作构建的文本,并标注了详细的起源信息,包括完整的编辑历史和 AI 交互历史。我们评估了多种机器生成文本检测器,发现它们往往仅在特定概念下表现良好,却难以充当通用检测器。我们公开发布了代码和数据。
Abstract
Although it is generally agreed that AI-generated text poses a broad societal risk, there is no common understanding in the AI-generated text detection literature on what constitutes harmful use. Rather, existing datasets and approaches often define their own criteria and make their own assumptions, sometimes implicitly, and often only loosely related to real-world needs and applications. To address this gap, we here systematically define various notions of AI-generated text and their characteristics. To study these, we collect AITDNA - a new benchmark of human-machine co-constructed texts that is annotated with detailed genesis information, such as the entire edit and AI-interaction history. We benchmark various machine-generated text detectors and find that they often only perform well for specific notions but not as broad detectors. We release code and data publicly.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 该论文专注于 AI 生成文本检测及基于现实假设的基准测试(AITDNA),涉及人机协作构建文本。提供的关键词涉及多模态大模型、世界模型、视觉编码器和基于模型的强化学习。论文内容(文本检测、NLP)与关键词(多模态架构、世界建模、强化学习)无技术重叠,因此所有关键词评分为 0。作者列表中未包含指定的专家(Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang)。
关键词
AI-generated text detection, human-machine co-constructed texts, AITDNA benchmark, realistic assumptions, machine-generated text detectors, genesis information, edit history
摘要翻译
基于表达性逻辑的证明助手在证明搜索方面的自动化能力受限,导致基于证明助手的形式化验证成本上升。我们通过引入 Isabelle/HOL 的 Abduction Prover 来解决这一问题。面对具有挑战性的证明目标,Abduction Prover 通过溯因推理识别有用的猜想,从而为该目标构建证明脚本。
Abstract
Proof assistants based on expressive logics suffer limited automation for proof search, raising the cost of formal verification based on proof assistants. We address this problem by introducing the Abduction Prover for Isabelle/HOL. Given a challenging proof goal, the Abduction Prover constructs a proof script for the goal by identifying useful conjectures using abductive reasoning.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on formal verification and automated theorem proving using abductive reasoning within the Isabelle/HOL framework. The provided keywords pertain to multimodal large language models, world models, and reinforcement learning, which are unrelated to the logical and formal methods domain of this work. Hence, all relevance scores are 0. None of the listed expert authors are present in the author list.
关键词
Abduction Prover, Isabelle/HOL, Proof Assistants, Proof Search, Abductive Reasoning, Formal Verification, Automated Theorem Proving
摘要翻译
随着 Replika 和 Character.AI 等 AI 伴侣平台的迅速扩张,关于不安全人机交互的担忧日益加剧。本研究介绍了 AICompanionBench,据我们所知,这是首个公开的、包含细粒度安全风险类别标注的人机伴侣对话基准数据集。该数据集包含从 Reddit 收集的 2,123 个真实世界的 Replika 对话,并通过人机协作标注了九个类别:性行为、反社会行为、身体攻击、言语攻击、物质滥用、自残与自杀、控制、操纵以及无害。基于此基准,我们在 LLM 作为裁判框架下评估了 20 个最先进的开源和闭源大语言模型(LLMs),以检测不安全交互。结果显示模型性能存在显著差异,尽管较强的模型实现了高整体准确率,但在操纵等细微类别上仍存在困难,且会将良性对话错误地识别为有害内容。我们的发现表明,尽管当前大语言模型(LLMs)能有效检测显式有害内容,但在识别隐式不安全交互方面仍存在局限。总体而言,本研究为 AI 伴侣安全研究贡献了一个新的基准数据集,并提供了利用大语言模型(LLMs)监控 AI 伴侣系统的见解。该数据集公开可用,地址为:https://github.com/anonymousresearcher2026/AICompanionBench/blob/main/AICompanionBench.xlsx
Abstract
As AI companion platforms such as Replika and Character.AI rapidly grow, concerns about unsafe human-AI interactions have intensified. This study introduces AICompanionBench, to our knowledge the first publicly available benchmark dataset of human-AI companion conversations annotated with fine-grained safety risk categories. The dataset contains 2,123 real-world Replika conversations collected from Reddit and annotated through human-AI collaboration across nine categories: sexual behavior, antisocial behavior, physical aggression, verbal aggression, substance abuse, self-harm and suicide, control, manipulation, and no-harm. Using this benchmark, we evaluate 20 state-of-the-art open-source and closed-source LLMs under an LLM-as-judge framework for detecting unsafe interactions. Results show substantial variation in model performance, with stronger models achieving high overall accuracy but still struggling with nuanced categories such as manipulation, as well as benign conversations that are incorrectly identified as harmful. Our findings suggest that while current LLMs can effectively detect explicit harmful content, they remain limited in identifying implicit unsafe interactions. Overall, our work contributes a new benchmark dataset for AI companionship safety research and offers insights into monitoring AI companion systems using LLMs. The dataset is publicly available at: https://github.com/anonymousresearcher2026/AICompanionBench/blob/main/AICompanionBench.xlsx
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on safety benchmarking for AI companions using LLMs-as-judges, which does not involve model unification, tokenizers, visual encoders, world models, multimodal architectures (MLLM/MultiModal), or model-based reinforcement learning. Thus, all keyword scores are 0.
关键词
AICompanionBench, LLMs-as-Judges, AI Companion Safety, Human-AI Interaction, Safety Risk Categories, Benchmark Dataset, Unsafe Interactions
摘要翻译
库普曼(Koopman)理论将非线性动力学转化为线性谱问题。然而,在计算中,一切均取决于一个困难的有限维选择:可观函数(observables)必须具有表达能力,在动力学作用下几乎不变,且理想情况下与复合运算兼容。深库普曼(Deep Koopman)方法学习灵活的坐标,而保结构方法则在固定字典上强制算子恒等式。我们通过引入深嵌入乘性动态模态分解(DeepMDMD)来结合这些思想,该方法学习一个潜在空间及其划分,同时将库普曼乘积法则(Koopman product rule)作为精确代数约束强制执行。训练过程在精确的乘性算子更新与促进库普曼闭包(Koopman closure)的可微潜在聚类步骤之间交替进行。最终得到的是一个在学习到的潜在单元(latent cells)上的有限转移映射。其非零谱位于单位圆上,其字典由动力学塑造而非环境几何塑造,预测先在潜在坐标中进行,随后解码至物理空间。在哈密顿(Hamiltonian)、混沌及流体示例中,DeepMDMD 学习到的字典比几何 MDMD 划分产生的字典更为紧凑且动力学相干性更好。该方法减少了谱污染,揭示了更丰富的连续谱结构,并在强噪声下仍能给出稳定预测。在高维流动中,包括 158,624 维圆柱尾流和噪声雷诺数 Re=20,000 的顶盖驱动腔(lid-driven cavity),它在状态空间 MDMD 失效之处仍能保持相干结构及长时间谱统计特性。这些结果暗示了库普曼学习的一个实用规则:学习坐标,约束代数。
Abstract
Koopman theory turns nonlinear dynamics into a linear spectral problem. In computation, however, everything depends on a hard finite-dimensional choice: the observables must be expressive, nearly invariant under the dynamics, and, ideally, compatible with composition. Deep Koopman methods learn flexible coordinates, whereas structure-preserving methods enforce operator identities on fixed dictionaries. We combine these ideas by introducing Deep Embedded Multiplicative Dynamic Mode Decomposition (DeepMDMD), a method that learns a latent space and a partition of it, while enforcing the Koopman product rule as an exact algebraic constraint. Training alternates between an exact multiplicative operator update and a differentiable latent-clustering step that promotes Koopman closure. The result is a finite transition map on learned latent cells. Its nonzero spectrum lies on the unit circle, its dictionary is shaped by the dynamics rather than by ambient geometry, and forecasts are made in latent coordinates before being decoded to physical space. Across Hamiltonian, chaotic, and fluid examples, DeepMDMD learns dictionaries that are far more compact and dynamically coherent than those produced by geometric MDMD partitions. It reduces spectral pollution, reveals richer continuous-spectrum structure, and gives stable forecasts under severe noise. In high-dimensional flows, including a 158,624-dimensional cylinder wake and a noisy $Re=20,000$ lid-driven cavity, it preserves coherent structures and long-time spectral statistics where state-space MDMD fails. These results suggest a practical rule for Koopman learning: learn the coordinates, constrain the algebra.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on Koopman operator theory and Dynamic Mode Decomposition for dynamical systems (fluids, Hamiltonian), whereas the provided keywords pertain to Multimodal Large Language Models (MLLM), Tokenization, and Reinforcement Learning. There is no overlap in methodology, domain, or terminology between the paper's content and the evaluation keywords.
关键词
Koopman theory, Dynamic Mode Decomposition, Algebraic constraints, Latent space, Fluid dynamics, Hamiltonian systems, Spectral analysis, Deep learning
摘要翻译
保持数据隐私是结构化数据管理和数据挖掘领域的一个重要议题。然而,分布式因果结构学习中的隐私泄露问题是一个持续的挑战,尤其是在涉及数据传输和计算的场景中。本文提出了一种基于全同态加密(FHE)的方法,该方法在密文上执行计算,确保数据在传输和计算过程中始终处于加密状态。然而,由于计算成本高昂以及 FHE 对除法和有限支持的对数运算支持有限,将 FHE 应用于因果结构学习面临挑战。为应对这一挑战,我们提出了一系列新颖的技术,包括:(i)电路简化以提高效率;(ii)通过牛顿 - 拉夫逊倒数法和泰勒展开近似除法和对数运算;以及(iii)一种带有 SIMD 加速的批处理技术,以增强整个学习过程。此外,该方法具有良好的可移植性,易于扩展至 FHE 之外以支持差分隐私。实证结果表明,我们的方法在测试数据集上实现了高一致性且因果结构与明文版本相当。最后,该方法高效且实用,即使在 FHE 的隐私保护下,也能在数十分钟内完成因果结构的学习。
Abstract
Preserving data privacy is an important topic in structural data management and data mining. However, the issue of privacy leakage in distributed causal structure learning is a persistent challenge, especially in cases where data transmission and computation are required. In this paper, we propose a method based on fully homomorphic encryption (FHE) that performs calculations on ciphertexts, keeping data encrypted in transition and computation. Nevertheless, adopting FHE to causal structure learning is challenging due to the high computation cost and limited support on division as well as logarithm operations in FHE. To tackle this challenge, we propose a series of novel techniques including (i) circuit simplification for better efficiency, (ii) approximation of division and logarithm through Newton-Raphson Reciprocal and Taylor expansion, and (iii) a batching technique with SIMD-acceleration to enhance the whole learning process. Additionally, our method can be easily extended beyond FHE by demonstration of its portability to support differential privacy. Empirical results show that our method achieves high consistency and comparable causal structure with the plaintext version in the datasets tested. Last, our method is efficient and practical to complete learning causal structures in tens of minutes even under the privacy protection of FHE.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on causal structure learning using Fully Homomorphic Encryption (FHE) for privacy preservation, which is unrelated to the provided keywords concerning Multimodal Large Language Models (MLLM), World Models, Tokenizers, Visual Encoders, and Model-Based Reinforcement Learning. None of the specified expert authors (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are present in the author list.
关键词
Fully Homomorphic Encryption, Causal Structure Learning, Data Privacy, Distributed Learning, Circuit Simplification, Approximation Techniques, SIMD-acceleration
摘要翻译
南希·格雷斯·罗曼太空望远镜(Roman)计划于 2026 年 9 月或更早发射,将开展具有前所未有的空间分辨率和采样频率的宽视场红外成像巡天,从而能够发现数百万个天文瞬变源。因此,有必要建立自动生成警报的自动化处理流程,以便望远镜在发射后不久即可开始发现可靠的瞬变源和变源。然而,目前尚无真实的罗曼望远镜数据,这使得此类处理流程的开发变得困难。在这项工作中,我们提出了一种机器学习模型 RuBR 以及一种通用方法,用于在 RAPID 处理流程中区分真实的瞬变源和变源探测与虚假(伪造)探测。具体而言,我们提出了基于此方法的三种模型:RuBR_comb,其在组合的本地注入瞬变源和 OpenUniverse2024 瞬变源上进行训练和测试;RuBR_loc,其在本地注入瞬变源上训练并在 OpenUniverse2024 瞬变源上测试;以及 RuBR_DA,其在域适应模式下将本地注入瞬变源与一部分 OpenUniverse2024 瞬变源结合用于训练。这为在罗曼任务早期阶段缺乏真实标签的情况下,将 RuBR_comb 模型适应到真实观测的策略铺平了道路。尽管图像差分处理流程仍在不断改进,但我们的实验结果表明了所提出方法的有效性,以及其在罗曼时代实现稳健真实 - 虚假分类的前景。
Abstract
The Nancy Grace Roman Space Telescope (Roman), set for launch as early as September 2026, will conduct wide-field infrared imaging surveys with unprecedented spatial resolution and cadence, enabling the discovery of millions of astronomical transients. Hence, it is necessary to have automated pipelines for generating alerts in place so that the telescope can begin discovering reliable transients and variable objects soon after it is launched. However, no real Roman data currently exist, making the development of such pipelines difficult. In this work, we present a machine learning model $RuBR$ and a general methodology for distinguishing genuine transient and variable detections from spurious (bogus) detections within the RAPID pipeline. In particular, we present three models using this methodology: $RuBR_{comb}$ trained and tested on combined locally injected and OpenUniverse2024 transients, $RuBR_{loc}$ trained on locally injected transients and tested on OpenUniverse2024 transients, and $RuBR_{DA}$ that combines locally injected transients with a fraction of OpenUniverse2024 transients in domain-adaptation mode for training. This paves the way for strategies to adapt the $RuBR_{comb}$ model to real observations in the absence of any ground-truth labels during the early phases of the Roman mission. While the image differencing pipeline continues to be improved, our experimental results demonstrate the effectiveness of the proposed approach and its promise for robust real-bogus classification in the Roman era.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 该论文专注于天文学瞬变源检测,利用模拟数据训练机器学习模型(RuBR)以区分真实与虚假检测,属于天文学图像处理领域。提供的关键词(如 MLLM、World Models、Tokenizer、model-based RL)均属于多模态大模型与强化学习范畴,与本文的研究内容、方法论及术语体系无直接关联,因此所有关键词相关度评分为 0。作者列表中未包含指定的 Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang 专家,故无额外加分。
关键词
Roman Space Telescope, RAPID pipeline, Machine learning, Transient detection, Real-bogus classification, Simulated data, Domain adaptation
摘要翻译
准确的 T 细胞受体 (TCR) 抗原特异性计算预测将变革 T 细胞生物学的研究并实现可扩展的免疫工程,然而现有模型缺乏足够的灵敏度和特异性以适用于广泛应用。主要限制在于缺乏严格定义的、未见过的基准数据集,这些数据集允许对模型性能和泛化能力进行无偏评估。在此,我们描述了满足这一标准的两类互补数据集,并论证它们既为模型评估提供了稳健框架,也为下一代 TCR-抗原预测算法开发奠定了基础。
Abstract
Accurate computational prediction of T cell receptor (TCR) antigen specificity would transform the study of T cell biology and enable scalable immune engineering, yet existing models lack sufficient sensitivity and specificity for broad applications. A major limitation is the absence of rigorously defined, unseen benchmark datasets that allow unbiased evaluation of model performance and generalizability. Here, we describe two complementary classes of datasets that meet this criterion and argue that they provide both a robust framework for model assessment and a foundation for next-generation TCR-antigen prediction algorithm development.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on computational immunology and TCR-antigen prediction benchmarks, whereas the keywords pertain to Multimodal Large Language Models, World Models, and Reinforcement Learning. There is no thematic or methodological overlap, resulting in zero relevance for all keywords. Additionally, none of the specified expert authors (Yang Shi, Xuanyu Zhu, Yuhao Dong, Saining Xie, Manyuan Zhang) are present in the paper's author list.
关键词
TCR Antigenic Epitope, Prediction Models, Generalization Power, Benchmarking, Unseen Datasets, Immune Engineering, Computational Prediction
摘要翻译
系统生成的日志构成了安全监控的基础,然而其基于模板的刚性格式既阻碍了自动化分析,也妨碍了人工理解。我们提出 NLLog(自然语言日志),这是一种轻量级管道,它确定性地将解析后的模板重写为 WHO-WHAT-SEVERITY 句子,采用词频 - 逆文档频率(TF-IDF)加权进行聚合,利用树集成对会话进行分类,并使用 TreeSHAP 回溯证据以供分析师审查。在 Hadoop 分布式文件系统(HDFS)和 Blue Gene/L(BGL)语料库上,NLLog 超过了两个复现的匹配协议基线;在 HDFS、BGL 和 AIT 警报数据集上,它保持低误报率,且具有适合安全运营中心(SOC)分诊的通用硬件延迟。覆盖率、稀疏与密集编码对比、忠实性以及对抗性消融实验表明,回退充分性依赖于语料库,注册时的覆盖率检查可在部署前揭示细化需求,而可审计的确定性重写结合轻量级密集编码为日志异常检测和分诊提供了一个可衡量的表示层。
Abstract
System-generated logs underpin security monitoring, yet their rigid template-based format hinders both automated analysis and human comprehension. We present NLLog (Natural-Language Log), a lightweight pipeline that deterministically rewrites parsed templates into WHO-WHAT-SEVERITY sentences, pools them with term-frequency-inverse-document-frequency weighting, classifies sessions with tree ensembles, and back-projects evidence with TreeSHAP for analyst review. On Hadoop Distributed File System (HDFS) and Blue Gene/L (BGL) corpora, NLLog exceeds two reproduced matched-protocol baselines; across HDFS, BGL, and the AIT Alert Data Set, it sustains low false-positive rates with commodity-hardware latency suitable for security operations center triage. Coverage, sparse-versus-dense, faithfulness, and adversarial ablations show that fallback sufficiency is corpus-dependent, that an enrollment-time coverage check can surface refinement requirements before deployment, and that an auditable deterministic rewrite combined with lightweight dense encoding provides a measurable representation layer for log-anomaly detection and triage.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 该论文专注于日志异常检测与自然语言重写,使用集成学习和 TF-IDF 方法。提供的关键词涉及多模态大模型、世界模型和强化学习等领域,与论文内容无直接关联。因此所有关键词相关度均为 0。
关键词
NLLog, Log Anomaly Detection, Natural Language Rewriting, Tree Ensembles, Explainability, SOC, TF-IDF
摘要翻译
一致性 (Consistency) 是动态次模最大化 (submodular maximization) 中的一个重要属性,它意味着始终维持一个近似最优解 (near-optimal solution),并在每一步中对解仅进行少量调整。先前工作已针对仅插入 (insertion-only) 的情况探索了这一问题,其中算法面对的是 $n$ 个插入流,并已建立了该问题基数约束 (cardinality-constrained) 版本的上下界。我们在完全动态设置 (fully dynamic setting) 中考虑这一问题,其中操作流可能同时包含插入 (insertions) 和删除 (deletions)。我们开发了一个用于在此设置下设计算法的通用框架 (general framework),并将其实例化以获得具有次线性一致性 (sublinear consistency) 的首个常数因子近似 (constant-factor approximations)。对于基数约束 (cardinality constraints),我们提出一个 $\frac 12 - O(\varepsilon)$ 近似,该近似是 $O\left(\frac{1}{\varepsilon^2}\right)$ 一致的。对于秩 -k matroid 约束 (rank-$k$ matroid constraints),我们构建了一个针对动态最优解 (dynamic optimum) 的 $\frac 14 - O(\varepsilon)$ 近似,该近似是 $O\left(\frac{\log k}{\varepsilon^2}\right)$ 一致的。
Abstract
Consistency is an important property in dynamic submodular maximization and entails maintaining a near-optimal solution at all times, making only a small number of adjustments to the solution in each step. Prior work has explored this question for the insertion-only case, where the algorithm faces a stream of $n$ insertions, and has established lower and upper bounds for the cardinality-constrained version of the problem. We consider this question in the fully dynamic setting, where the stream of operations may contain both insertions and deletions. We develop a general framework for designing algorithms for this setting, and instantiate it to obtain the first constant-factor approximations with sublinear consistency. For cardinality constraints, we propose a $\frac 12 - O(\varepsilon)$ approximation that is $O\left(\frac{1}{\varepsilon^2}\right)$ consistent. For rank-$k$ matroid constraints, we construct a $\frac 14 - O(\varepsilon)$ approximation to the dynamic optimum that is $O\left(\frac{\log k}{\varepsilon^2}\right)$ consistent.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文主题属于组合优化与动态算法领域(子模最大化),而评分关键词涉及多模态大模型、世界模型及强化学习领域。两者研究范畴完全不同,无任何技术或主题关联,故所有关键词相关性评分为 0。
关键词
Submodular Maximization, Dynamic Setting, Consistency, Approximation, Matroid Constraints, Insertions and Deletions, General Framework
摘要翻译
Mean-based 算法(基于均值算法)是一类在线学习算法,它们给平均奖励较低的动作分配较低的概率。近期研究表明,这些算法倾向于收敛到 serially undominated 动作(序列上不被支配的动作),这些动作在经济游戏中近似于 Nash equilibria(纳什均衡)。然而,实证研究也显示,在 bandit-feedback(多臂老虎机反馈)场景下,它们的收敛速度相比已有算法较慢。我们在 time horizon(时间范围)未知且仅可获得 bandit-feedback(多臂老虎机反馈)的情况下研究 Mean-based 算法。在此设置下,我们提供了算法定义序列 $\gamma_t$ 的下界,正式确立了这些算法学习速度的上限。此外,我们提出了两种 Mean-based 算法:一种推广了 $\epsilon$-greedy,另一种将基于均值的 Exp3 扩展到了未知 time horizon。我们的实验表明,Mean-based 算法虽然稍慢,但能与其他 bandit-feedback 算法竞争性地表现。我们进一步分析了与 no-regret(无遗憾)算法的关系。根据 $\gamma_t$ 的选择,与 no-regret 算法的交集是非平凡的,我们证明了存在既是 Mean-based 又是 no-regret 的算法。这为先前贡献所暗示的此类算法的"exploitability"(可被利用性)增添了背景。
Abstract
Mean-based algorithms are a class of online learning algorithms that assign low probability to actions with low average rewards. Recent work indicates these algorithms converge favorably to serially undominated actions, which approximate Nash equilibria in economic games. However, empirical studies also show slower convergence compared to established algorithms in bandit-feedback scenarios. We study mean-based algorithms when the time horizon is unknown and only bandit feedback is available. In this setting, we provide the first lower bound on the algorithm-defining sequence $γ_t$ that formally establishes a limit on how fast these algorithms can learn. Additionally, we propose two mean-based algorithms: one generalizes $ε$-greedy, and the other extends the mean-based Exp3 to unknown horizons. Our experiments show that mean-based algorithms, although slightly slower, can perform competitively with other bandit-feedback algorithms. We further analyze the relationship to no-regret algorithms. Depending on the choice of $γ_t$, the intersection with no-regret algorithms is non-trivial, and we show that algorithms exist that are both mean-based and no-regret. This adds context to the "exploitability" of this class of algorithms that previous contributions suggest.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on theoretical analysis of mean-based algorithms in online learning and bandit settings, specifically regarding regret bounds and convergence. It does not address multimodal architectures, tokenizers, visual encoders, world models, MLLMs, or model-based reinforcement learning as defined in the keyword set. Thus, there is no topical overlap with the provided keywords which target multimodal and generative model research.
关键词
Mean-based algorithms, Online learning, Bandit feedback, Regret bounds, Lower bound, No-regret algorithms, Serially undominated actions
摘要翻译
工人效用未被直接观测到——仅观测到其结果。每次零工交易产生一个比特:接受或拒绝。我们认为这种结构直接指向普雷萨克滞回模型(Preisach hysteresis model),作为潜在工人偏好的自然表示。普雷萨克算子(Preisach operator)将聚合输出建模为二元阈值元素群体的积分,这恰好对应于异质工人各自拥有私人接受工资时所产生的结构。我们通过双输出神经网络(共享层 256->128,边缘损失强制 U_1 >= U_0)估计两个潜在效用曲面:接受效用 U_1(X) 和拒绝效用 U_0(X)。分类简化为普雷萨克间隙 U_1(X) - U_0(X),该间隙连同经截断稳定的价格 - 阈值编码一同输入 XGBoost 分类器。在 36,891 笔零工交易上,该管道实现 Jaccard = 0.827 和 ROC AUC = 0.799。价格 - 阈值编码相比原始效用特征使 AUC 提升了 +11.0 个百分点。该模型证实了滞回理论预测的方向不对称性:价格下降对完成率的抑制作用大于同等幅度价格上涨的提升作用。应用于全数据集时,该模型的推荐方案同时使工资总额减少 21.3%,并使预期填充率提高 9.7 个百分点。对于 74.2% 的交易,接受概率 P(accept) 已超过 0.80;降低工资使其保持在阈值以上(削减后平均 P = 0.972),释放成本节约(中位数 31%)。对于剩余的 25.4%,中位数 7% 的工资增长可恢复 +43 个百分点的接受率。一个没有明确无差异区的模型无法同时执行这两种操作。
Abstract
Worker utility is not observed -- only its consequence is. Each gig transaction produces a single bit: accepted or rejected. We argue this structure points directly to the Preisach hysteresis model as the natural representation of latent worker preferences. The Preisach operator models aggregate output as an integral over a population of binary threshold elements -- precisely the structure that emerges when heterogeneous workers each carry a private acceptance wage. We estimate two latent utility surfaces: acceptance utility U_1(X) and rejection utility U_0(X), via a dual-output neural network (shared layers 256->128, margin loss enforcing U_1 >= U_0). Classification reduces to the Preisach gap U_1(X) - U_0(X), passed into an XGBoost classifier alongside clip-stabilised price-to-threshold encodings. On 36,891 gig transactions, this pipeline achieves Jaccard = 0.827 and ROC AUC = 0.799. The price-to-threshold encoding accounts for +11.0 pp AUC over raw utility features. The model confirms the directional asymmetry hysteresis predicts: price decreases depress completion rates more than equivalent increases raise them. Applied to the full dataset, the model's recommendations simultaneously reduce the total wage bill by 21.3% and increase expected fill rate by 9.7 pp. For 74.2% of transactions, P(accept) already exceeds 0.80; reducing the wage keeps it above threshold (mean post-cut P = 0.972), releasing cost savings (median 31%). For the remaining 25.4%, a median 7% wage increase recovers +43 pp acceptance. A model without an explicit indifference zone cannot execute both moves simultaneously.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper addresses labor economics and utility modeling using Preisach hysteresis and neural networks for gig market transactions. The provided keywords specifically target Multimodal Large Language Models (MLLM), Tokenizers, Visual Encoders, World Models, and Model-Based Reinforcement Learning. There is no thematic or technical overlap between the economic modeling approach described and the AI/ML architecture components listed in the keywords, resulting in zero relevance for all terms.
关键词
Worker Utility, Hysteresis Model, Preisach Model, Gig Labour Markets, Dual-output Neural Network, Transaction Acceptance, Wage Optimization, Latent Utility Surfaces
摘要翻译
自动机器学习(AutoML)中的大规模超参数优化(HPO)消耗大量计算资源,引发了对其可扩展性和能效日益增长的担忧。现有方法启发式地利用先验信息来加速黑盒及多保真度设置,但它们缺乏对先验信息如何定量降低样本复杂度的刻画。我们通过固定预算最佳臂识别(fixed-budget best-arm identification)的正式视角,首次提出了带有先验的多保真超参数优化(HPO)的分布依赖样本复杂度界。通过将先验直接建模为臂均值(即配置性能),我们推导出明确的、分布依赖的误差界,量化了先验与评估预算之间的关系。我们的分析表明,将概率质量集中在近优臂上的信息丰富的先验,可减少所需的评估次数;而无信息或误导性先验则会恢复基线性能。我们在合成基准与 LCBench(一种常见的深度学习多保真超参数优化基准)上进行了概念验证实验,以验证我们的理论结果,实现了高达 90% 的预算缩减,同时保持了求解质量。综上所述,我们的结果为基于先验引导且计算高效的绿色 AutoML 提供了理论基础。
Abstract
Large-scale hyperparameter optimization (HPO) in automated machine learning (AutoML) consumes substantial computational resources, raising growing concerns about scalability and energy efficiency. Existing methods use prior information heuristically to accelerate both black-box and multi-fidelity settings, but they lack a characterization of how prior informativeness quantitatively reduces sample complexity. In this work, we provide the first distribution-dependent sample complexity bounds for multi-fidelity HPO with priors through the formal lens of fixed-budget best-arm identification. By modeling priors directly over arm means as configuration performance, we derive explicit, distribution-dependent error bounds that quantify the relationship between priors and evaluation budget. Our analysis shows that informative priors, which concentrate probability mass on near-optimal arms, yield reductions in the number of required evaluations, whereas baseline performance is recovered with uninformative or misleading priors. We conduct proof-of-concept experiments on a synthetic benchmark and on LCBench, a common multi-fidelity HPO benchmark for deep learning, to confirm our theoretical results, achieving up to 90% budget reduction while retaining solution quality. Together, our results provide a principled foundation for prior-guided and compute-efficient green AutoML.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 论文主题聚焦于自动化机器学习(AutoML)中超参数优化(HPO)的理论分析,特别是先验信息对样本复杂度的影响。而给定的关键词集(如 Unify Models, Tokenizer, Visual Encoder, World Models, MLLM, MultiModal, model-based RL)均属于多模态大模型、世界模型及强化学习架构领域。两者在技术范畴上无重叠,故所有关键词评分为 0。
关键词
Hyperparameter Optimization, AutoML, Sample Complexity, Prior Information, Multi-fidelity, Best-Arm Identification, Green AutoML, LCBench
摘要翻译
常见的英语动词如 'have' 和 'make' 既可以作为 light-verb constructions(轻动词结构)中的 collocates,也可以作为 full lexical predicates(完整词汇谓词),例如 'make a decision' 与 'make a cake'。Language models(语言模型)是否表示这一区分尚不清楚。我们引入了一种大规模受控的 dataset,包含最小变化的英语 sentence series,其中相同的上下文包含同一动词在 light-verb(轻动词)和 full-verb(完整动词)用法中的情况。Two probing experiments(两个探测实验)表明,Language models 甚至在 minimal contexts(最小上下文)中也能区分这些用法,并在 object types(宾语类型)上表现出 separable patterns(可分离的模式)。我们发布该 dataset、generation code(生成代码)和 materials(材料)作为 reusable resource(可重用资源)。该 framework(框架)支持扩展到 broader contexts(更广泛的上下文)、additional verbs(其他动词)和 other languages(其他语言)。
Abstract
Frequent English verbs such as 'have' and 'make' can function either as collocates in light-verb constructions or as full lexical predicates, as in 'make a decision' vs. 'make a cake'. Whether language models represent this distinction remains unclear. We introduce a large-scale controlled dataset of minimally varying English sentence series in which the same context contains the same verb in light-verb and full-verb uses. Two probing experiments show that language models differentiate between these uses even in minimal contexts and exhibit separable patterns across object types. We release the dataset, generation code, and materials as a reusable resource. The framework supports extensions to broader contexts, additional verbs, and other languages.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on linguistic probing of light-verb vs. full-verb distinctions in text-based language models using minimal-pair datasets. It does not address model unification, tokenization architecture, visual encoders, world modeling, multimodal learning (MLLM/MultiModal), or model-based reinforcement learning, resulting in zero relevance for all specified technical keywords.
关键词
Light-Verb Constructions, Full-Verb Predicates, Minimal-Pair Dataset, Probing Experiments, Language Models, Phraseological Competence, English Verbs, Collocates
摘要翻译
自洽性通过采样多个推理路径并选择出现频率最高的答案来改进大语言模型,但多数投票往往无法恢复样本中已存在的正确答案。我们通过排名改进自洽性(RISC)来解决这一局限性,该方法将自洽性中的答案选择重新表述为排序问题。与依赖单一不确定性或置信度信号不同,RISC 使用轻量级的 LambdaRank 模型,通过五个精心设计的特征对候选答案进行评分,这些特征捕捉了答案频率、语义中心性和推理轨迹一致性。我们在三个数据集上,在不同的测试时预算下评估了 RISC。在所有数据集上,RISC 始终比标准自洽性和强基线实现更好的准确率与效率的权衡,尤其在问答基准上取得了显著增益。进一步分析表明,所提出的特征各自有用,更重要的是它们相互补充,凸显了在测试时答案选择中学习结合多个信息信号的价值。
Abstract
Self-consistency improves large language models by sampling multiple reasoning paths and selecting the most frequent answer, but majority voting often fails to recover correct answers that are already present among the samples. We address this limitation with Ranking-Improved Self-Consistency (RISC), which reformulates answer selection in self-consistency as a ranking problem. Instead of relying on a single uncertainty or confidence signal, RISC uses a lightweight LambdaRank model to score candidate answers with five carefully designed features that capture answer frequency, semantic centrality, and reasoning-trace consistency. We evaluate RISC on three datasets under a range of test-time budgets. Across datasets, RISC consistently achieves a better accuracy-efficiency trade-off than standard self-consistency and strong baselines, with particularly large gains on question answering benchmarks. Further analysis shows that the proposed features are individually useful and, more importantly, complementary, highlighting the value of learning to combine multiple informative signals for test-time answer selection.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on improving Large Language Model reasoning through ranking-based self-consistency for question answering. It does not involve multimodal components (Visual Encoder, MultiModal, MLLM), world modeling, reinforcement learning, tokenization strategies, or model unification architectures, resulting in zero relevance for all provided keywords. Additionally, none of the listed expert authors (Yang Shi, Xuanyu Zhu, etc.) are present in the author list.
关键词
Self-Consistency, Large Language Models, Ranking, Question Answering, LambdaRank, Test-time Inference, Answer Selection, Reasoning Traces
摘要翻译
LLMs 在风险决策 (risk decision-making) 任务中可能表现得谨慎,但看似谨慎的输出并不一定意味着与人类决策机制对齐。我们利用圣彼得堡游戏 (St. Petersburg game) 作为受控测试平台来探究这一区别,这是一个经典悖论:其期望收益为无穷大,但人类通常报告的支付意愿却较低且有限。我们使用一套结构化提示词套件评估了 28 个 LLMs,该套件包括:原始游戏;控制决策变体(扰动截断、重复游玩、数值禀赋和职业身份);要求模型以人类决策者视角推理的人类视角提示词;以及基础模型与其指令微调 (instruction tuning) 版本之间的配对比较。在原始游戏中,大多数模型生成有限的出价,呈现出类似人类的风险行为外观。然而,这种结果层面的相似性掩盖了实质性的机制层面 (mechanism-level) 差异。控制变体揭示,模型往往并未维持原始游戏中观察到的人类行为,而是转向了条件性和计算理性行为。人类线索提示和指令微调通常能降低出价并减少某些可见的异常行为,但大多数机制层面的响应模式基本保持不变。这些发现表明,风险决策中的行为对齐可能仅停留在表面级别:LLMs 可能产生类似人类的风险决策,却未展现出与人类一致的决策机制。因此,对 LLM 决策的高风险评估应超越结果相似性,转而考察这种对齐是否得到了机制层面一致性的支持。
Abstract
LLMs can appear cautious in risk decision-making tasks, yet cautious-looking outputs do not necessarily indicate alignment with human decision-making mechanisms. We investigate this distinction using the St. Petersburg game as a controlled testbed, a classical paradox in which the expected payoff is infinite, yet humans typically report low, finite willingness to pay. We evaluate 28 LLMs with a structured prompt suite that includes the original game; controlled decision variants that perturb truncation, repeated play, numeric endowment, and occupational identity; a human-perspective prompt that asks models to reason as human decision makers; and paired comparisons between base models and their instruction-tuned counterparts. In the original game, most models generate finite bids, creating the appearance of human-like risk behavior. However, this outcome-level resemblance masks substantial mechanism-level differences. The controlled variants reveal that rather than maintaining human-like behavior seen in the original game, models often shift to conditionally and computationally rational behavior. Human-cue prompting and instruction tuning often lower bids and reduce some visible pathologies, but most mechanism-level response patterns remain largely unchanged. These findings show that behavioral alignment in risk decision-making can be surface-level: LLMs may produce human-like risk decisions without exhibiting human-consistent mechanisms. High-stakes evaluations of LLM decision-making should therefore move beyond outcome similarity and examine whether the alignment is supported by mechanism-level consistency.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on LLM risk decision-making and alignment using the St. Petersburg game paradigm, evaluating outcome-level vs. mechanism-level alignment. It does not address model unification, tokenization, visual encoders, world models (in the RL/generative sense), multi-modal architectures, or reinforcement learning. Consequently, there is no substantive technical overlap with the provided keyword set.
关键词
LLM Risk Decisions, St. Petersburg Game, Outcome-Level Resemblance, Mechanism-Level Alignment, Instruction Tuning, Human Decision Making, Behavioral Alignment, Large Language Models
摘要翻译
大型语言模型(LLMs)作为写作工具的广泛使用挑战了众包数据的有效性,因为众包工作者可能会将任务外包给模型。为了更好地理解这一问题是如何被解决的,我们调查了 155 名自然语言处理(NLP)及相关领域的研究人员,了解他们通过众包收集自由文本回复的经验与观点。本文概述了研究人员面临的挑战、缓解策略以及对数据质量的预期影响。44% 的受访者报告称在他们收集的众包数据中观察到了 LLM 的使用情况。尽管 93% 的受访者对此有所预料,但有一半人不确定应采取何种预防措施。最普遍的检测策略是独特的文本风格模式和异常快速的完成时间。总体而言,调查结果表明,研究社区意识到了这一问题并正在采取措施,但现有的努力仍不足以完全解决它。最后,我们提出了一系列考虑因素,以指导 LLM 时代未来众包自由文本数据的收集。
Abstract
The widespread use of Large Language Models (LLMs) as writing tools challenges the validity of crowdsourced data, as crowdworkers may outsource tasks to models. To better understand how this is addressed, we surveyed 155 researchers in NLP and related disciplines about their experiences and opinions on collecting free-text responses via crowdsourcing. This paper provides an overview of practitioners' challenges, mitigation strategies, and the foreseen implications on data quality. 44% of respondents reported observing LLM usage in their crowdsourced data. While 93% of them had anticipated this, half were unsure what precautions to take. The most prevalent detection strategies are distinctive textual style patterns and unusually fast completion times. Overall, survey responses show that the research community is aware of the problem and taking measures, but existing efforts remain insufficient to fully address it. Finally, we derive a set of considerations to guide future crowdsourced free-text data collection in the era of LLMs.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper investigates crowdsourcing data quality in the LLM era, while the provided keywords relate to multimodal model architectures, tokenization, visual encoders, world models, and reinforcement learning, none of which are discussed in the study.
关键词
Crowdsourcing, LLM Era, Human Data Collection, Data Quality, Textual Style Patterns, Community Survey, Mitigation Strategies
摘要翻译
新兴市场的电商平台往往使用发展不完善的产品目录,这些目录仅包含类别分类法 (category taxonomies),却缺乏结构化属性模式 (structured attribute schemas)。这种细粒度产品属性的缺失限制了搜索能力——阻碍了分面过滤 (faceted filtering),降低了查询理解,并削弱了搜索系统所使用的语义表示 (semantic representations)。本文提出了 BEATS,一种人机回环 (human-in-the-loop) LLM (大语言模型) 框架,旨在从零开始自举 (bootstrapping) 产品属性分类法。我们的方法扩展了一个多阶段 LLM 生成流水线,增加了两个关键生产阶段:(1) 由模型开发者进行主动质量检查,以过滤错误输出;(2) 由领域专家本地员工进行人工标注,以验证生成的属性。该框架迭代运行——每个生成阶段的提示词均基于质量检查观察和标注者反馈,在连续轮次中不断精炼,逐步提升属性质量。一旦属性分类法建立,我们便利用 LLM 对单个产品项执行结构化属性标记,从而丰富其上下文表示。丰富的目录直接惠及搜索系统的多个组件:支持基于属性的细粒度过滤,为排序模型提供结构化特征,并改进稠密检索 (dense retrieval) 的语义表示。我们通过使用属性丰富的产品数据训练稠密检索模型来验证生成的分类法,结果表明该方法在使用原始目录信息的基线 (baselines) 之上实现了持续的性能提升。该系统已在 Rakuten Taiwan (乐天台湾) 部署,丰富了涵盖 2,694 个子类的大类共 9 个,生成了 67,277 个属性,超过 540 万个产品已被标记上生成的属性,并计划进一步丰富整个产品目录。
Abstract
E-commerce platforms in emerging markets often operate with underdeveloped product catalogs that contain only category taxonomies but lack structured attribute schemas. This absence of fine-grained product attributes limits search capabilities -- preventing faceted filtering, degrading query understanding, and weakening semantic representations used by search systems. We present BEATS, a human-in-the-loop LLM framework for bootstrapping product attribute taxonomies entirely from scratch. Our approach extends a multi-stage LLM generation pipeline with two critical production stages: (1) proactive quality checking by model developers to filter erroneous outputs, and (2) human annotation by domain-expert local staff to validate generated attributes. The framework operates iteratively -- prompts at each generation stage are refined based on quality check observations and annotator feedback across successive rounds, progressively improving attribute quality. Once the attribute taxonomy is established, we employ LLMs to perform structured attribute tagging on individual product items, enriching their contextual representations. The enriched catalog directly benefits multiple components of the search system: enabling granular attribute-based filtering, providing structured features for ranking models, and improving semantic representations for dense retrieval. We validate the generated taxonomy by training dense retrieval models on attribute-enriched product data, demonstrating consistent improvements over baselines using original catalog information. Our system has been deployed at Rakuten Taiwan, enriching 9 major categories spanning 2,694 sub-categories with 67,277 generated attributes, and over 5.4 million products have been tagged with the generated attributes, with plans to enrich the entire product catalog.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on e-commerce taxonomy bootstrapping using LLMs and human-in-the-loop, which is unrelated to the provided keywords concerning multimodal architectures (Tokenizer, Visual Encoder, MultiModal, MLLM), world models, or reinforcement learning (model-based RL, Unify Models). No target expert authors are found in the author list.
关键词
BEATS, E-commerce Attribute Taxonomies, Human-AI Collaboration, LLM Framework, Structured Attribute Tagging, Search System Improvement, Bootstrapping Product Catalogs
摘要翻译
手术过程中器械的处理与组装对刷手护士(scrub nurses)施加了较高的认知负荷,尤其是在器械不熟悉时。本文提出了一种用于手术器械的辅助引导系统,该系统将多相机 6D 位姿估计(6D pose estimation)与增强现实(AR)原位可视化相结合,并显示于头戴式显示器(head-mounted display, HMD)上,无需额外标记物。位姿估计及后续的相机标定均通过已知物体实现。6D 位姿估计网络仅基于合成数据进行训练,旨在提高其泛化能力和实际应用能力。AR 引导系统显示工具尖端定位提示及分步组装动画。用户可通过基于注视的选择和脚踏板,在术中切换组装步骤。在技术评估中,我们的方法优于现有的最先进 6D 位姿估计方法。本研究对 29 名刷手护士进行了用户研究,在膝关节置换术的手术模拟中将该系统与纸质手册进行对比。与纸质手册相比,AR 引导显著降低了感知工作负荷。客观而言,AR 引导将任务完成时间缩短了 21.3%(4.76 分钟)。具体而言,对器械组经验较少的刷手护士在使用该系统时获益更为明显。不同条件下的错误频率相当。定性反馈表明,流程清晰度得到提升,信息过载减少,且感知独立性增强。综上所述,我们的无标记物多相机 AR 引导方法用于手术器械,可主观和客观地改善术中器械操作性能,尤其对于未经训练的刷手护士。
Abstract
The handling and assembly of instruments during surgery imposes high cognitive demands on scrub nurses, particularly when instruments are unfamiliar. We present a supporting guidance system for surgical instrumentation that combines multi-camera 6D pose estimation with augmented reality in-situ visualization on a head-mounted display without the requirement for additional markers. Pose estimation and consecutive camera calibration are achieved through known objects. The 6D pose estimation network is trained purely on synthetic data, aiming for better generalizability and real-world applicability. The AR guidance displays tooltip localization cues and step-wise assembly animations. Via gaze-based selection and a foot pedal, users can switch between assembly steps in intraoperative use. In a technical evaluation, our approach outperforms state-of-art 6D pose estimation. A user study with 29 scrub nurses was conducted in a surgical simulation of knee arthroplasty, comparing the system against a paper manual. AR guidance significantly reduced the perceived workload compared. Objectively, AR guidance reduced task completion time by 21.3\% (4.76 minutes). Specifically, scrub nurses less experienced with the instrument set benefited when using the system. Error frequencies were comparable between conditions. Qualitative feedback highlighted improved process clarity, reduced information overload, and perceived independence. To summarize, our marker-free multi-camera AR guidance approach for surgical instruments can, subjectively and objectively, improve intraoperative instrumentation performance, particularly for untrained scrub nurses.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: The paper focuses on surgical AR guidance and 6D pose estimation, which does not align with the research background of Unify Models, World Models, MLLM, or Model-Based RL. It lacks content regarding tokenizers, model unification strategies, or reinforcement learning paradigms. Thus, all keyword relevance scores are 0. No expert authors from the specified list are found.
关键词
Multi-Camera, AR Guidance, Surgical Instrument, 6D Pose Estimation, Workload Reduction, Head-Mounted Display, Synthetic Data Training
摘要翻译
从点云生成紧凑的多边形模型是三维视觉和计算机图形学中的一个关键问题。然而,由于激光雷达(LiDAR)扫描固有的局限性(例如距离限制和遮挡),关键场景信息往往缺失,导致重建精度下降。为了解决这一问题,我们提出了一种平面组装策略,该策略能有效恢复缺失细节同时保持模型紧凑性。我们将场景中提取的所有平面分为三类:高可见度、低可见度和不可见。不可见平面通过场景结构分析恢复,指示了缺失的细节。这三类平面对应三个生长优先级。每个平面依据优先级水平生长,空间随之逐步划分,即层次化划分。随后,我们通过基于最小割的优化从划分中生成一个水密多边形网格。最后,在公开数据集上的对比实验表明,我们的方法相对于主流方法具有有效性和优越性。项目页面见 https://hsr-3dv.github.io/.
Abstract
Generating compact polygonal models from point clouds is a key problem in 3D vision and computer graphics. However, due to inherent limitations of LiDAR scanning (e.g. range constraints and occlusions), critical scene information is often missing, leading to degraded reconstruction accuracy. To address this, we propose a plane assembling strategy that effectively recovers missing details while maintaining model compactness. We classify all the planes extracted from the scene into three categories: highly visible, barely visible, and invisible. The invisible planes, which are recovered by scene structure analysis, indicate the missing details. The three types of planes correspond to the three growth priorities. Each plane grows according to the priority level, and the space is partitioned progressively, namely, the hierarchical partition. Subsequently, we generate a watertight polygonal mesh from the partition via a min-cut-based optimization. Finally, comparisons on public datasets show the effectiveness and superiority of our method against mainstream approaches. The project page is available at https://hsr-3dv.github.io/.
评分详情
| 关键词 | 权重 | 相关度 | 得分 |
|---|---|---|---|
| Unify Models | 1.5 | 0.0/10 | 0.0 |
| Tokenizer | 1.5 | 0.0/10 | 0.0 |
| Visual Encoder | 1.5 | 0.0/10 | 0.0 |
| World Models | 1.5 | 0.0/10 | 0.0 |
| MLLM | 1.5 | 0.0/10 | 0.0 |
| MultiModal | 1.5 | 0.0/10 | 0.0 |
| model-based RL | 1.5 | 0.0/10 | 0.0 |
评分理由: 该论文属于三维计算机视觉与图形学领域,采用几何方法(平面组装、空间划分、最小割)处理点云数据。给定关键词主要涉及多模态大模型、表征学习及强化学习,属于深度学习领域。两者研究范式不同,无直接关联,故所有关键词评分为 0 分。作者列表不包含指定专家。
关键词
Surface Reconstruction, Point Clouds, Plane Assembling, Hierarchical Space Partition, Polygonal Mesh, Min-cut Optimization, LiDAR Scanning, Watertight Mesh