← 전체 목록

2026-02-24 AI 리서치 브리핑

최신 VLM, sLLM, on-device AI 논문과 연구 블로그를 한눈에 정리합니다. 중복 기사 방지를 위해 URL 기준으로 추적합니다.

총 10건 요약 자동 생성

VLM 업데이트

멀티모달 비전-언어 모델의 최신 논문과 리더보드 변화

CapNav: Benchmarking Vision Language Models on Capability-conditioned Indoor Navigation

Paper arXiv cs.CV (recent)

Vision-Language Models (VLMs)는 시각-언어 내비게이션(VLN)에서 상당한 발전을 보였지만, 실제 내비게이션은 에이전트의 이동성 제약에 따라 달라집니다. 이 논문은 에이전트의 특정 물리적 및 운영 능력을 고려하여 VLM이 복잡한 실내 공간을 얼마나 잘 탐색할 수 있는지 평가하는 벤치마크인 CapNav를 소개합니다. 13개 VLM에 대한 평가 결과, 이동성 제약이 강화될수록 내비게이션 성능이 급격히 저하되며, 최첨단 모델조차 공간 추론이 필요한 장애물 유형에 어려움을 겪는 것으로 나타났습니다.

English Vision-Language Models (VLMs) show promise in navigation, but real-world scenarios are constrained by agent mobility. This paper introduces CapNav, a benchmark designed to evaluate VLMs' ability to navigate complex indoor spaces considering an agent's specific physical and operational capabilities. Evaluations of 13 modern VLMs reveal a sharp decline in performance as mobility constraints tighten, indicating current models struggle with capability-aware spatial reasoning.

원문 보기

sLLM 트렌드

경량화·효율화를 위한 스몰 LLM 연구

Spanning the Visual Analogy Space with a Weight Basis of LoRAs

Paper Hugging Face Papers

Spanning the Visual Analogy Space with a Weight Basis of LoRAs에 관한 최근 업데이트입니다. 자세한 내용은 원문 링크에서 확인할 수 있습니다.

원문 보기

AI 뉴스 & 리서치

기업/연구기관의 주요 발표와 블로그 업데이트

VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

Paper Hugging Face Papers

VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training에 관한 최근 업데이트입니다. 자세한 내용은 원문 링크에서 확인할 수 있습니다.

원문 보기

Does Your Reasoning Model Implicitly Know When to Stop Thinking?

Paper Hugging Face Papers

Does Your Reasoning Model Implicitly Know When to Stop Thinking?에 관한 최근 업데이트입니다. 자세한 내용은 원문 링크에서 확인할 수 있습니다.

원문 보기

Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control

Paper Hugging Face Papers

Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control에 관한 최근 업데이트입니다. 자세한 내용은 원문 링크에서 확인할 수 있습니다.

원문 보기

Decoding as Optimisation on the Probability Simplex: From Top-K to Top-P (Nucleus) to Best-of-K Samplers

Paper Hugging Face Papers

Decoding as Optimisation on the Probability Simplex: From Top-K to Top-P (Nucleus) to Best-of-K Samplers에 관한 최근 업데이트입니다. 자세한 내용은 원문 링크에서 확인할 수 있습니다.

원문 보기

Going Down Memory Lane: Scaling Tokens for Video Stream Understanding with Dynamic KV-Cache Memory

Paper arXiv cs.CV (recent)

Streaming video understanding requires models to robustly encode, store, and retrieve information from a continuous video stream to support accurate video question answering (VQA). Existing state-of-the-art approaches rely on key-value caching to accumulate frame-level information over time, but use a limited number of tokens per frame, leading to the loss of fine-grained visual details. In this work, we propose scaling the token budget to enable more granular spatiotemporal understanding and reasoning. First, we find that current methods are ill-equipped to handle dense streams: their feature encoding causes query-frame similarity scores to increase over time, biasing retrieval toward later frames. To address this, we introduce an adaptive selection strategy that reduces token redundancy while preserving local spatiotemporal information. We further propose a training-free retrieval mixture-of-experts that leverages external models to better identify relevant frames. Our method, MemStream, achieves +8.0% on CG-Bench, +8.5% on LVBench, and +2.4% on VideoMME (Long) over ReKV with Qwen2.5-VL-7B.

원문 보기

SARAH: Spatially Aware Real-time Agentic Humans

Paper arXiv cs.CV (recent)

As embodied agents become central to VR, telepresence, and digital human applications, their motion must go beyond speech-aligned gestures: agents should turn toward users, respond to their movement, and maintain natural gaze. Current methods lack this spatial awareness. We close this gap with the first real-time, fully causal method for spatially-aware conversational motion, deployable on a streaming VR headset. Given a user's position and dyadic audio, our approach produces full-body motion that aligns gestures with speech while orienting the agent according to the user. Our architecture combines a causal transformer-based VAE with interleaved latent tokens for streaming inference and a flow matching model conditioned on user trajectory and audio. To support varying gaze preferences, we introduce a gaze scoring mechanism with classifier-free guidance to decouple learning from control: the model captures natural spatial alignment from data, while users can adjust eye contact intensity at inference time. On the Embody 3D dataset, our method achieves state-of-the-art motion quality at over 300 FPS -- 3x faster than non-causal baselines -- while capturing the subtle spatial dynamics of natural conversation. We validate our approach on a live VR system, bringing spatially-aware conversational agents to real-time deployment. Please see https://evonneng.github.io/sarah/ for details.

원문 보기

The Geometry of Noise: Why Diffusion Models Don't Need Noise Conditioning

Paper arXiv cs.CV (recent)

Autonomous (noise-agnostic) generative models, such as Equilibrium Matching and blind diffusion, challenge the standard paradigm by learning a single, time-invariant vector field that operates without explicit noise-level conditioning. While recent work suggests that high-dimensional concentration allows these models to implicitly estimate noise levels from corrupted observations, a fundamental paradox remains: what is the underlying landscape being optimized when the noise level is treated as a random variable, and how can a bounded, noise-agnostic network remain stable near the data manifold where gradients typically diverge? We resolve this paradox by formalizing Marginal Energy, $E_{\text{marg}}(\mathbf{u}) = -\log p(\mathbf{u})$, where $p(\mathbf{u}) = \int p(\mathbf{u}|t)p(t)dt$ is the marginal density of the noisy data integrated over a prior distribution of unknown noise levels. We prove that generation using autonomous models is not merely blind denoising, but a specific form of Riemannian gradient flow on this Marginal Energy. Through a novel relative energy decomposition, we demonstrate that while the raw Marginal Energy landscape possesses a $1/t^p$ singularity normal to the data manifold, the learned time-invariant field implicitly incorporates a local conformal metric that perfectly counteracts the geometric singularity, converting an infinitely deep potential well into a stable attractor. We also establish the structural stability conditions for sampling with autonomous models. We identify a ``Jensen Gap'' in noise-prediction parameterizations that acts as a high-gain amplifier for estimation errors, explaining the catastrophic failure observed in deterministic blind models. Conversely, we prove that velocity-based parameterizations are inherently stable because they satisfy a bounded-gain condition that absorbs posterior uncertainty into a smooth geometric drift.

원문 보기

Spatio-Spectroscopic Representation Learning using Unsupervised Convolutional Long-Short Term Memory Networks

Paper arXiv cs.CV (recent)

Integral Field Spectroscopy (IFS) 조사는 공간 및 분광 차원에서 학습하고 은하 진화에 대한 새로운 통찰력을 밝혀낼 수 있는 독특한 기회를 제공합니다. 이 연구는 Convolutional Long-Short Term Memory Network Autoencoders를 사용하여 MaNGA IFS 조사에서 약 9000개 은하의 공간 및 분광 차원에 걸쳐 일반화된 특징 표현을 인코딩하는 새로운 비지도 심층 학습 프레임워크를 제시합니다. 시연을 위해 290개의 활동성 은하핵(AGN) 샘플에 모델을 적용하여 과학적으로 흥미로운 특성을 강조합니다.

English This work presents a new unsupervised deep learning framework utilizing Convolutional Long-Short Term Memory Network Autoencoders for spatio-spectroscopic representation learning. It encodes generalized features across spatial and spectroscopic dimensions from approximately 9000 galaxies in the MaNGA IFS survey. The model's effectiveness is demonstrated by assessing a sample of Active Galactic Nuclei (AGN) and highlighting anomalous characteristics.

원문 보기

참고한 소스