← 전체 목록

2026-02-23 AI 리서치 브리핑

최신 VLM, sLLM, on-device AI 논문과 연구 블로그를 한눈에 정리합니다. 중복 기사 방지를 위해 URL 기준으로 추적합니다.

총 18건 요약 자동 생성

VLM 업데이트

멀티모달 비전-언어 모델의 최신 논문과 리더보드 변화

OpenEarthAgent: A Unified Framework for Tool-Augmented Geospatial Agents

Paper arXiv cs.CV (recent)

Recent progress in multimodal reasoning has enabled agents that can interpret imagery, connect it with language, and perform structured analytical tasks. Extending such capabilities to the remote sensing domain remains challenging, as models must reason over spatial scale, geographic structures, and multispectral indices while maintaining coherent multi-step logic. To bridge this gap, OpenEarthAgent introduces a unified framework for developing tool-augmented geospatial agents trained on satellite imagery, natural-language queries, and detailed reasoning traces. The training pipeline relies on supervised fine-tuning over structured reasoning trajectories, aligning the model with verified multistep tool interactions across diverse analytical contexts. The accompanying corpus comprises 14,538 training and 1,169 evaluation instances, with more than 100K reasoning steps in the training split and over 7K reasoning steps in the evaluation split. It spans urban, environmental, disaster, and infrastructure domains, and incorporates GIS-based operations alongside index analyses such as NDVI, NBR, and NDBI. Grounded in explicit reasoning traces, the learned agent demonstrates structured reasoning, stable spatial understanding, and interpretable behaviour through tool-driven geospatial interactions across diverse conditions. We report consistent improvements over a strong baseline and competitive performance relative to recent open and closed-source models.

원문 보기

When Vision Overrides Language: Evaluating and Mitigating Counterfactual Failures in VLAs

Paper arXiv cs.CV (recent)

Vision-Language-Action models (VLAs) promise to ground language instructions in robot control, yet in practice often fail to faithfully follow language. When presented with instructions that lack strong scene-specific supervision, VLAs suffer from counterfactual failures: they act based on vision shortcuts induced by dataset biases, repeatedly executing well-learned behaviors and selecting objects frequently seen during training regardless of language intent. To systematically study it, we introduce LIBERO-CF, the first counterfactual benchmark for VLAs that evaluates language following capability by assigning alternative instructions under visually plausible LIBERO layouts. Our evaluation reveals that counterfactual failures are prevalent yet underexplored across state-of-the-art VLAs. We propose Counterfactual Action Guidance (CAG), a simple yet effective dual-branch inference scheme that explicitly regularizes language conditioning in VLAs. CAG combines a standard VLA policy with a language-unconditioned Vision-Action (VA) module, enabling counterfactual comparison during action selection. This design reduces reliance on visual shortcuts, improves robustness on under-observed tasks, and requires neither additional demonstrations nor modifications to existing architectures or pretrained models. Extensive experiments demonstrate its plug-and-play integration across diverse VLAs and consistent improvements. For example, on LIBERO-CF, CAG improves $π_{0.5}$ by 9.7% in language following accuracy and 3.6% in task success on under-observed tasks using a training-free strategy, with further gains of 15.5% and 8.5%, respectively, when paired with a VA model. In real-world evaluations, CAG reduces counterfactual failures of 9.4% and improves task success by 17.2% on average.

원문 보기

Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting

Paper arXiv cs.CV (recent)

Black-box adversarial attacks on Large Vision-Language Models (LVLMs) are challenging due to missing gradients and complex multimodal boundaries. While prior state-of-the-art transfer-based approaches like M-Attack perform well using local crop-level matching between source and target images, we find this induces high-variance, nearly orthogonal gradients across iterations, violating coherent local alignment and destabilizing optimization. We attribute this to (i) ViT translation sensitivity that yields spike-like gradients and (ii) structural asymmetry between source and target crops. We reformulate local matching as an asymmetric expectation over source transformations and target semantics, and build a gradient-denoising upgrade to M-Attack. On the source side, Multi-Crop Alignment (MCA) averages gradients from multiple independently sampled local views per iteration to reduce variance. On the target side, Auxiliary Target Alignment (ATA) replaces aggressive target augmentation with a small auxiliary set from a semantically correlated distribution, producing a smoother, lower-variance target manifold. We further reinterpret momentum as Patch Momentum, replaying historical crop gradients; combined with a refined patch-size ensemble (PE+), this strengthens transferable directions. Together these modules form M-Attack-V2, a simple, modular enhancement over M-Attack that substantially improves transfer-based black-box attacks on frontier LVLMs: boosting success rates on Claude-4.0 from 8% to 30%, Gemini-2.5-Pro from 83% to 97%, and GPT-5 from 98% to 100%, outperforming prior black-box LVLM attacks. Code and data are publicly available at: https://github.com/vila-lab/M-Attack-V2.

원문 보기

sLLM 트렌드

경량화·효율화를 위한 스몰 LLM 연구

SpargeAttention2: Trainable Sparse Attention via Hybrid Top-k+Top-p Masking and Distillation Fine-Tuning

Paper Hugging Face Papers

SpargeAttention2: Trainable Sparse Attention via Hybrid Top-k+Top-p Masking and Distillation Fine-Tuning에 관한 최근 업데이트입니다. 자세한 내용은 원문 링크에서 확인할 수 있습니다.

원문 보기

On-Device AI

디바이스 내 추론 및 엣지 최적화 동향

Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents

Paper Hugging Face Papers

Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents에 관한 최근 업데이트입니다. 자세한 내용은 원문 링크에서 확인할 수 있습니다.

원문 보기

Human-level 3D shape perception emerges from multi-view learning

Paper arXiv cs.CV (recent)

Humans can infer the three-dimensional structure of objects from two-dimensional visual inputs. Modeling this ability has been a longstanding goal for the science and engineering of visual intelligence, yet decades of computational methods have fallen short of human performance. Here we develop a modeling framework that predicts human 3D shape inferences for arbitrary objects, directly from experimental stimuli. We achieve this with a novel class of neural networks trained using a visual-spatial objective over naturalistic sensory data; given a set of images taken from different locations within a natural scene, these models learn to predict spatial information related to these images, such as camera location and visual depth, without relying on any object-related inductive biases. Notably, these visual-spatial signals are analogous to sensory cues readily available to humans. We design a zero-shot evaluation approach to determine the performance of these `multi-view' models on a well established 3D perception task, then compare model and human behavior. Our modeling framework is the first to match human accuracy on 3D shape inferences, even without task-specific training or fine-tuning. Remarkably, independent readouts of model responses predict fine-grained measures of human behavior, including error patterns and reaction times, revealing a natural correspondence between model dynamics and human perception. Taken together, our findings indicate that human-level 3D perception can emerge from a simple, scalable learning objective over naturalistic visual-spatial data. All code, human behavioral data, and experimental stimuli needed to reproduce our findings can be found on our project page.

원문 보기

AI 뉴스 & 리서치

기업/연구기관의 주요 발표와 블로그 업데이트

Unified Latents (UL): How to train your latents

Paper Hugging Face Papers

Unified Latents (UL): How to train your latents에 관한 최근 업데이트입니다. 자세한 내용은 원문 링크에서 확인할 수 있습니다.

원문 보기

Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report v1.5

Paper Hugging Face Papers

Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report v1.5에 관한 최근 업데이트입니다. 자세한 내용은 원문 링크에서 확인할 수 있습니다.

원문 보기

Arcee Trinity Large Technical Report

Paper Hugging Face Papers

Arcee Trinity Large Technical Report에 관한 최근 업데이트입니다. 자세한 내용은 원문 링크에서 확인할 수 있습니다.

원문 보기

IntRec: Intent-based Retrieval with Contrastive Refinement

Paper arXiv cs.CV (recent)

Retrieving user-specified objects from complex scenes remains a challenging task, especially when queries are ambiguous or involve multiple similar objects. Existing open-vocabulary detectors operate in a one-shot manner, lacking the ability to refine predictions based on user feedback. To address this, we propose IntRec, an interactive object retrieval framework that refines predictions based on user feedback. At its core is an Intent State (IS) that maintains dual memory sets for positive anchors (confirmed cues) and negative constraints (rejected hypotheses). A contrastive alignment function ranks candidate objects by maximizing similarity to positive cues while penalizing rejected ones, enabling fine-grained disambiguation in cluttered scenes. Our interactive framework provides substantial improvements in retrieval accuracy without additional supervision. On LVIS, IntRec achieves 35.4 AP, outperforming OVMR, CoDet, and CAKE by +2.3, +3.7, and +0.5, respectively. On the challenging LVIS-Ambiguous benchmark, it improves performance by +7.9 AP over its one-shot baseline after a single corrective feedback, with less than 30 ms of added latency per interaction.

원문 보기

February 17, 2026 Teaching AI to read a map Machine Perception · Open Source Models & Datasets

News Google Research Blog

February 17, 2026 Teaching AI to read a map Machine Perception · Open Source Models & Datasets에 관한 최근 업데이트입니다. 자세한 내용은 원문 링크에서 확인할 수 있습니다.

원문 보기

February 11, 2026 Scheduling in a changing world: Maximizing throughput with time-varying capacity Algorithms & Theory

News Google Research Blog

February 11, 2026 Scheduling in a changing world: Maximizing throughput with time-varying capacity Algorithms & Theory에 관한 최근 업데이트입니다. 자세한 내용은 원문 링크에서 확인할 수 있습니다.

원문 보기

February 10, 2026 Beyond one-on-one: Authoring, simulating, and testing dynamic human-AI group conversations Human-Computer Interaction and Visualization · Machine Intelligence

News Google Research Blog

February 10, 2026 Beyond one-on-one: Authoring, simulating, and testing dynamic human-AI group conversations Human-Computer Interaction and Visualization · Machine Intelligence에 관한 최근 업데이트입니다. 자세한 내용은 원문 링크에서 확인할 수 있습니다.

원문 보기

February 9, 2026 How AI trained on birds is surfacing underwater mysteries Climate & Sustainability · Open Source Models & Datasets · Sound & Accoustics

News Google Research Blog

February 9, 2026 How AI trained on birds is surfacing underwater mysteries Climate & Sustainability · Open Source Models & Datasets · Sound & Accoustics에 관한 최근 업데이트입니다. 자세한 내용은 원문 링크에서 확인할 수 있습니다.

원문 보기

February 5, 2026 How AI tools can redefine universal design to increase accessibility Education Innovation · Machine Intelligence · Natural Language Processing · Responsible AI

News Google Research Blog

February 5, 2026 How AI tools can redefine universal design to increase accessibility Education Innovation · Machine Intelligence · Natural Language Processing · Responsible AI에 관한 최근 업데이트입니다. 자세한 내용은 원문 링크에서 확인할 수 있습니다.

원문 보기

Microsoft Research blog

News Microsoft Research Blog

Microsoft Research blog에 관한 최근 업데이트입니다. 자세한 내용은 원문 링크에서 확인할 수 있습니다.

원문 보기

Media Authenticity Methods in Practice: Capabilities, Limitations, and Directions

News Microsoft Research Blog

Media Authenticity Methods in Practice: Capabilities, Limitations, and Directions에 관한 최근 업데이트입니다. 자세한 내용은 원문 링크에서 확인할 수 있습니다.

원문 보기

Project Silica’s advances in glass storage technology

News Microsoft Research Blog

Project Silica의 유리 저장 기술 발전에 대한 최근 업데이트입니다. 더 자세한 내용은 원문 링크에서 확인하실 수 있습니다.

원문 보기

참고한 소스