← 전체 목록

2026-04-21 AI 리서치 브리핑

최신 VLM, sLLM, on-device AI 논문과 연구 블로그를 한눈에 정리합니다. 중복 기사 방지를 위해 URL 기준으로 추적합니다.

총 10건 요약 자동 생성

VLM 업데이트

멀티모달 비전-언어 모델의 최신 논문과 리더보드 변화

PersonaVLM: Long-Term Personalized Multimodal LLMs

Paper Hugging Face Papers

PersonaVLM: Long-Term Personalized Multimodal LLMs에 관한 최근 업데이트입니다. 자세한 내용은 원문 링크에서 확인할 수 있습니다.

원문 보기

Repurposing 3D Generative Model for Autoregressive Layout Generation

Paper arXiv cs.CV (recent)

We introduce LaviGen, a framework that repurposes 3D generative models for 3D layout generation. Unlike previous methods that infer object layouts from textual descriptions, LaviGen operates directly in the native 3D space, formulating layout generation as an autoregressive process that explicitly models geometric relations and physical constraints among objects, producing coherent and physically plausible 3D scenes. To further enhance this process, we propose an adapted 3D diffusion model that integrates scene, object, and instruction information and employs a dual-guidance self-rollout distillation mechanism to improve efficiency and spatial accuracy. Extensive experiments on the LayoutVLM benchmark show LaviGen achieves superior 3D layout generation performance, with 19% higher physical plausibility than the state of the art and 65% faster computation. Our code is publicly available at https://github.com/fenghora/LaviGen.

원문 보기

FineCog-Nav: Integrating Fine-grained Cognitive Modules for Zero-shot Multimodal UAV Navigation

Paper arXiv cs.CV (recent)

UAV vision-language navigation (VLN) requires an agent to navigate complex 3D environments from an egocentric perspective while following ambiguous multi-step instructions over long horizons. Existing zero-shot methods remain limited, as they often rely on large base models, generic prompts, and loosely coordinated modules. In this work, we propose FineCog-Nav, a top-down framework inspired by human cognition that organizes navigation into fine-grained modules for language processing, perception, attention, memory, imagination, reasoning, and decision-making. Each module is driven by a moderate-sized foundation model with role-specific prompts and structured input-output protocols, enabling effective collaboration and improved interpretability. To support fine-grained evaluation, we construct AerialVLN-Fine, a curated benchmark of 300 trajectories derived from AerialVLN, with sentence-level instruction-trajectory alignment and refined instructions containing explicit visual endpoints and landmark references. Experiments show that FineCog-Nav consistently outperforms zero-shot baselines in instruction adherence, long-horizon planning, and generalization to unseen environments. These results suggest the effectiveness of fine-grained cognitive modularization for zero-shot aerial navigation. Project page: https://smartdianlab.github.io/projects-FineCogNav.

원문 보기

VeRVE: Versatile Retrieval for Videos via Unified Embeddings

Paper arXiv cs.CV (recent)

이 논문은 다양한 비디오 검색 작업을 통합적으로 처리하는 MLLM 기반 프레임워크인 VeRVE를 제안합니다. VeRVE는 말뭉치 및 순간 수준 검색 기능을 결합하고 복합적인 멀티모달 쿼리를 지원합니다. 이 모델은 제로샷 비디오 및 순간 검색에서 다른 MLLM 기반 방법들을 능가하며, 특화된 모델과 유사한 성능을 달성합니다.

English This paper introduces VeRVE, an MLLM-based versatile video retrieval framework designed to handle diverse tasks from corpus-level retrieval to fine-grained moment localization and complex multimodal queries. VeRVE integrates these capabilities within a single architecture, utilizing contrastive alignment of visual and textual embeddings. The model achieves state-of-the-art results for zero-shot composed video retrieval and competitive performance in other zero-shot tasks, outperforming existing MLLM-based systems and matching specialized models.

원문 보기

AI 뉴스 & 리서치

기업/연구기관의 주요 발표와 블로그 업데이트

Elucidating the SNR-t Bias of Diffusion Probabilistic Models

Paper Hugging Face Papers

Elucidating the SNR-t Bias of Diffusion Probabilistic Models에 관한 최근 업데이트입니다. 자세한 내용은 원문 링크에서 확인할 수 있습니다.

원문 보기

Maximal Brain Damage Without Data or Optimization: Disrupting Neural Networks via Sign-Bit Flips

Paper Hugging Face Papers

Maximal Brain Damage Without Data or Optimization: Disrupting Neural Networks via Sign-Bit Flips에 관한 최근 업데이트입니다. 자세한 내용은 원문 링크에서 확인할 수 있습니다.

원문 보기

Qwen3.5-Omni Technical Report

Paper Hugging Face Papers

Qwen3.5-Omni Technical Report에 관한 최근 업데이트입니다. 자세한 내용은 원문 링크에서 확인할 수 있습니다.

원문 보기

Web Retrieval-Aware Chunking (W-RAC) for Efficient and Cost-Effective Retrieval-Augmented Generation Systems

Paper Hugging Face Papers

Web Retrieval-Aware Chunking (W-RAC) for Efficient and Cost-Effective Retrieval-Augmented Generation Systems에 관한 최근 업데이트입니다. 자세한 내용은 원문 링크에서 확인할 수 있습니다.

원문 보기

Differential privacy representation geometry for medical image analysis

Paper arXiv cs.CV (recent)

Differential privacy (DP)'s effect in medical imaging is typically evaluated only through end-to-end performance, leaving the mechanism of privacy-induced utility loss unclear. We introduce Differential Privacy Representation Geometry for Medical Imaging (DP-RGMI), a framework that interprets DP as a structured transformation of representation space and decomposes performance degradation into encoder geometry and task-head utilization. Geometry is quantified by representation displacement from initialization and spectral effective dimension, while utilization is measured as the gap between linear-probe and end-to-end utility. Across over 594,000 images from four chest X-ray datasets and multiple pretrained initializations, we show that DP is consistently associated with a utilization gap even when linear separability is largely preserved. At the same time, displacement and spectral dimension exhibit non-monotonic, initialization- and dataset-dependent reshaping, indicating that DP alters representation anisotropy rather than uniformly collapsing features. Correlation analysis reveals that the association between end-to-end performance and utilization is robust across datasets but can vary by initialization, while geometric quantities capture additional prior- and dataset-conditioned variation. These findings position DP-RGMI as a reproducible framework for diagnosing privacy-induced failure modes and informing privacy model selection.

원문 보기

StreamCacheVGGT: Streaming Visual Geometry Transformers with Robust Scoring and Hybrid Cache Compression

Paper arXiv cs.CV (recent)

이 논문은 연속 비디오 스트림에서 3D 형상을 재구성하기 위한 StreamCacheVGGT 프레임워크를 제안합니다. StreamCacheVGGT는 Cross-Layer Consistency-Enhanced Scoring (CLCES)과 Hybrid Cache Compression (HCC) 모듈을 통해 기존 캐시 관리의 한계를 극복합니다. 이 프레임워크는 일정한 메모리 제약 하에서 뛰어난 재구성 정확도와 장기적인 안정성을 제공하며, 최신 기술 수준을 달성합니다.

English This paper proposes StreamCacheVGGT, a training-free framework for reconstructing dense 3D geometry from continuous video streams under a constant memory budget. It redefines cache management through two synergistic modules: Cross-Layer Consistency-Enhanced Scoring (CLCES) and Hybrid Cache Compression (HCC). StreamCacheVGGT achieves new state-of-the-art results, delivering superior reconstruction accuracy and long-term stability while strictly adhering to constant-cost constraints.

원문 보기

VLM 업데이트

AI 뉴스 & 리서치

참고한 소스