← 전체 목록

2026-03-22 AI 리서치 브리핑

최신 VLM, sLLM, on-device AI 논문과 연구 블로그를 한눈에 정리합니다. 중복 기사 방지를 위해 URL 기준으로 추적합니다.

총 18건 요약 자동 생성

VLM 업데이트

멀티모달 비전-언어 모델의 최신 논문과 리더보드 변화

Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

Paper arXiv cs.CV (recent)

While Multimodal Large Language Models demonstrate impressive semantic capabilities, they often suffer from spatial blindness, struggling with fine-grained geometric reasoning and physical dynamics. Existing solutions typically rely on explicit 3D modalities or complex geometric scaffolding, which are limited by data scarcity and generalization challenges. In this work, we propose a paradigm shift by leveraging the implicit spatial prior within large-scale video generation models. We posit that to synthesize temporally coherent videos, these models inherently learn robust 3D structural priors and physical laws. We introduce VEGA-3D (Video Extracted Generative Awareness), a plug-and-play framework that repurposes a pre-trained video diffusion model as a Latent World Simulator. By extracting spatiotemporal features from intermediate noise levels and integrating them with semantic representations via a token-level adaptive gated fusion mechanism, we enrich MLLMs with dense geometric cues without explicit 3D supervision. Extensive experiments across 3D scene understanding, spatial reasoning, and embodied manipulation benchmarks demonstrate that our method outperforms state-of-the-art baselines, validating that generative priors provide a scalable foundation for physical-world understanding. Code is publicly available at https://github.com/H-EmbodVis/VEGA-3D.

원문 보기

Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens

Paper arXiv cs.CV (recent)

Visual generation with discrete tokens has gained significant attention as it enables a unified token prediction paradigm shared with language models, promising seamless multimodal architectures. However, current discrete generation methods remain limited to low-dimensional latent tokens (typically 8-32 dims), sacrificing the semantic richness essential for understanding. While high-dimensional pretrained representations (768-1024 dims) could bridge this gap, their discrete generation poses fundamental challenges. In this paper, we present Cubic Discrete Diffusion (CubiD), the first discrete generation model for high-dimensional representations. CubiD performs fine-grained masking throughout the high-dimensional discrete representation -- any dimension at any position can be masked and predicted from partial observations. This enables the model to learn rich correlations both within and across spatial positions, with the number of generation steps fixed at $T$ regardless of feature dimensionality, where $T \ll hwd$. On ImageNet-256, CubiD achieves state-of-the-art discrete generation with strong scaling behavior from 900M to 3.7B parameters. Crucially, we validate that these discretized tokens preserve original representation capabilities, demonstrating that the same discrete tokens can effectively serve both understanding and generation tasks. We hope this work will inspire future research toward unified multimodal architectures. Code is available at: https://github.com/YuqingWang1029/CubiD.

원문 보기

NavTrust: Benchmarking Trustworthiness for Embodied Navigation

Paper arXiv cs.CV (recent)

There are two major categories of embodied navigation: Vision-Language Navigation (VLN), where agents navigate by following natural language instructions; and Object-Goal Navigation (OGN), where agents navigate to a specified target object. However, existing work primarily evaluates model performance under nominal conditions, overlooking the potential corruptions that arise in real-world settings. To address this gap, we present NavTrust, a unified benchmark that systematically corrupts input modalities, including RGB, depth, and instructions, in realistic scenarios and evaluates their impact on navigation performance. To our best knowledge, NavTrust is the first benchmark that exposes embodied navigation agents to diverse RGB-Depth corruptions and instruction variations in a unified framework. Our extensive evaluation of seven state-of-the-art approaches reveals substantial performance degradation under realistic corruptions, which highlights critical robustness gaps and provides a roadmap toward more trustworthy embodied navigation systems. Furthermore, we systematically evaluate four distinct mitigation strategies to enhance robustness against RGB-Depth and instructions corruptions. Our base models include Uni-NaVid and ETPNav. We deployed them on a real mobile robot and observed improved robustness to corruptions. The project website is: https://navtrust.github.io.

원문 보기

sLLM 트렌드

경량화·효율화를 위한 스몰 LLM 연구

Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation

Paper Hugging Face Papers

Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation에 관한 최근 업데이트입니다. 자세한 내용은 원문 링크에서 확인할 수 있습니다.

원문 보기

On-Device AI

디바이스 내 추론 및 엣지 최적화 동향

PlugMem: Transforming raw agent interactions into reusable knowledge

News Microsoft Research Blog

PlugMem은 AI 에이전트의 원시 상호작용을 재사용 가능한 지식으로 변환하는 기술입니다. 이 접근 방식은 에이전트의 메모리 시스템을 재고하여 효율적인 지식 활용을 목표로 합니다.

원문 보기

AI 뉴스 & 리서치

기업/연구기관의 주요 발표와 블로그 업데이트

Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

Paper Hugging Face Papers

Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding에 관한 최근 업데이트입니다. 자세한 내용은 원문 링크에서 확인할 수 있습니다.

원문 보기

SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing

Paper Hugging Face Papers

SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing에 관한 최근 업데이트입니다. 자세한 내용은 원문 링크에서 확인할 수 있습니다.

원문 보기

3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model

Paper Hugging Face Papers

3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model에 관한 최근 업데이트입니다. 자세한 내용은 원문 링크에서 확인할 수 있습니다.

원문 보기

FASTER: Rethinking Real-Time Flow VLAs

Paper Hugging Face Papers

FASTER: Rethinking Real-Time Flow VLAs에 관한 최근 업데이트입니다. 자세한 내용은 원문 링크에서 확인할 수 있습니다.

원문 보기

Matryoshka Gaussian Splatting

Paper arXiv cs.CV (recent)

The ability to render scenes at adjustable fidelity from a single model, known as level of detail (LoD), is crucial for practical deployment of 3D Gaussian Splatting (3DGS). Existing discrete LoD methods expose only a limited set of operating points, while concurrent continuous LoD approaches enable smoother scaling but often suffer noticeable quality degradation at full capacity, making LoD a costly design decision. We introduce Matryoshka Gaussian Splatting (MGS), a training framework that enables continuous LoD for standard 3DGS pipelines without sacrificing full-capacity rendering quality. MGS learns a single ordered set of Gaussians such that rendering any prefix, the first k splats, produces a coherent reconstruction whose fidelity improves smoothly with increasing budget. Our key idea is stochastic budget training: each iteration samples a random splat budget and optimises both the corresponding prefix and the full set. This strategy requires only two forward passes and introduces no architectural modifications. Experiments across four benchmarks and six baselines show that MGS matches the full-capacity performance of its backbone while enabling a continuous speed-quality trade-off from a single model. Extensive ablations on ordering strategies, training objectives, and model capacity further validate the designs.

원문 보기

MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction

Paper arXiv cs.CV (recent)

Reconstructing articulated 3D objects from a single image requires jointly inferring object geometry, part structure, and motion parameters from limited visual evidence. A key difficulty lies in the entanglement between motion cues and object structure, which makes direct articulation regression unstable. Existing methods address this challenge through multi-view supervision, retrieval-based assembly, or auxiliary video generation, often sacrificing scalability or efficiency. We present MonoArt, a unified framework grounded in progressive structural reasoning. Rather than predicting articulation directly from image features, MonoArt progressively transforms visual observations into canonical geometry, structured part representations, and motion-aware embeddings within a single architecture. This structured reasoning process enables stable and interpretable articulation inference without external motion templates or multi-stage pipelines. Extensive experiments on PartNet-Mobility demonstrate that OM achieves state-of-the-art performance in both reconstruction accuracy and inference speed. The framework further generalizes to robotic manipulation and articulated scene reconstruction.

원문 보기

March 17, 2026 Google Research at The Check Up: from healthcare innovation to real-world care settings Health & Bioscience · Machine Intelligence

News Google Research Blog

March 17, 2026 Google Research at The Check Up: from healthcare innovation to real-world care settings Health & Bioscience · Machine Intelligence에 관한 최근 업데이트입니다. 자세한 내용은 원문 링크에서 확인할 수 있습니다.

원문 보기

March 17, 2026 Improving breast cancer screening workflows with machine learning Health & Bioscience · Human-Computer Interaction and Visualization

News Google Research Blog

March 17, 2026 Improving breast cancer screening workflows with machine learning Health & Bioscience · Human-Computer Interaction and Visualization에 관한 최근 업데이트입니다. 자세한 내용은 원문 링크에서 확인할 수 있습니다.

원문 보기

March 16, 2026 Testing LLMs on superconductivity research questions Education Innovation · General Science · Machine Intelligence · Natural Language Processing

News Google Research Blog

March 16, 2026 Testing LLMs on superconductivity research questions Education Innovation · General Science · Machine Intelligence · Natural Language Processing에 관한 최근 업데이트입니다. 자세한 내용은 원문 링크에서 확인할 수 있습니다.

원문 보기

March 12, 2026 Protecting cities with AI-driven flash flood forecasting Climate & Sustainability · Earth AI · Generative AI · Open Source Models & Datasets

News Google Research Blog

March 12, 2026 Protecting cities with AI-driven flash flood forecasting Climate & Sustainability · Earth AI · Generative AI · Open Source Models & Datasets에 관한 최근 업데이트입니다. 자세한 내용은 원문 링크에서 확인할 수 있습니다.

원문 보기

March 12, 2026 Introducing Groundsource: Turning news reports into data with Gemini Climate & Sustainability · Generative AI · Natural Language Processing · Open Source Models & Datasets

News Google Research Blog

March 12, 2026 Introducing Groundsource: Turning news reports into data with Gemini Climate & Sustainability · Generative AI · Natural Language Processing · Open Source Models & Datasets에 관한 최근 업데이트입니다. 자세한 내용은 원문 링크에서 확인할 수 있습니다.

원문 보기

Microsoft Research blog

News Microsoft Research Blog

Microsoft Research blog에 관한 최근 업데이트입니다. 자세한 내용은 원문 링크에서 확인할 수 있습니다.

원문 보기

Systematic debugging for AI agents: Introducing the AgentRx framework

News Microsoft Research Blog

Systematic debugging for AI agents: Introducing the AgentRx framework에 관한 최근 업데이트입니다. 자세한 내용은 원문 링크에서 확인할 수 있습니다.

원문 보기

참고한 소스