2026-04-05 AI 리서치 브리핑

최신 VLM, sLLM, on-device AI 논문과 연구 블로그를 한눈에 정리합니다. 중복 기사 방지를 위해 URL 기준으로 추적합니다.

총 18건 요약 자동 생성

VLM 업데이트

멀티모달 비전-언어 모델의 최신 논문과 리더보드 변화

Generative World Renderer

Paper arXiv cs.CV (recent)

Scaling generative inverse and forward rendering to real-world scenarios is bottlenecked by the limited realism and temporal coherence of existing synthetic datasets. To bridge this persistent domain gap, we introduce a large-scale, dynamic dataset curated from visually complex AAA games. Using a novel dual-screen stitched capture method, we extracted 4M continuous frames (720p/30 FPS) of synchronized RGB and five G-buffer channels across diverse scenes, visual effects, and environments, including adverse weather and motion-blur variants. This dataset uniquely advances bidirectional rendering: enabling robust in-the-wild geometry and material decomposition, and facilitating high-fidelity G-buffer-guided video generation. Furthermore, to evaluate the real-world performance of inverse rendering without ground truth, we propose a novel VLM-based assessment protocol measuring semantic, spatial, and temporal consistency. Experiments demonstrate that inverse renderers fine-tuned on our data achieve superior cross-dataset generalization and controllable generation, while our VLM evaluation strongly correlates with human judgment. Combined with our toolkit, our forward renderer enables users to edit styles of AAA games from G-buffers using text prompts.

원문 보기

Modulate-and-Map: Crossmodal Feature Mapping with Cross-View Modulation for 3D Anomaly Detection

Paper arXiv cs.CV (recent)

We present ModMap, a natively multiview and multimodal framework for 3D anomaly detection and segmentation. Unlike existing methods that process views independently, our method draws inspiration from the crossmodal feature mapping paradigm to learn to map features across both modalities and views, while explicitly modelling view-dependent relationships through feature-wise modulation. We introduce a cross-view training strategy that leverages all possible view combinations, enabling effective anomaly scoring through multiview ensembling and aggregation. To process high-resolution 3D data, we train and publicly release a foundational depth encoder tailored to industrial datasets. Experiments on SiM3D, a recent benchmark that introduces the first multiview and multimodal setup for 3D anomaly detection and segmentation, demonstrate that ModMap attains state-of-the-art performance by surpassing previous methods by wide margins.

원문 보기

sLLM 트렌드

경량화·효율화를 위한 스몰 LLM 연구

EventHub: Data Factory for Generalizable Event-Based Stereo Networks without Active Sensors

Paper arXiv cs.CV (recent)

We propose EventHub, a novel framework for training deep-event stereo networks without ground truth annotations from costly active sensors, relying instead on standard color images. From these images, we derive either proxy annotations and proxy events through state-of-the-art novel view synthesis techniques, or simply proxy annotations when images are already paired with event data. Using the training set generated by our data factory, we repurpose state-of-the-art stereo models from RGB literature to process event data, obtaining new event stereo models with unprecedented generalization capabilities. Experiments on widely used event stereo datasets support the effectiveness of EventHub and show how the same data distillation mechanism can improve the accuracy of RGB stereo foundation models in challenging conditions such as nighttime scenes.

원문 보기

On-Device AI

디바이스 내 추론 및 엣지 최적화 동향

LaVR: Scene Latent Conditioned Generative Video Trajectory Re-Rendering using Large 4D Reconstruction Models

Paper arXiv cs.CV (recent)

Given a monocular video, the goal of video re-rendering is to generate views of the scene from a novel camera trajectory. Existing methods face two distinct challenges. Geometrically unconditioned models lack spatial awareness, leading to drift and deformation under viewpoint changes. On the other hand, geometrically-conditioned models depend on estimated depth and explicit reconstruction, making them susceptible to depth inaccuracies and calibration errors. We propose to address these challenges by using the implicit geometric knowledge embedded in the latent space of a large 4D reconstruction model to condition the video generation process. These latents capture scene structure in a continuous space without explicit reconstruction. Therefore, they provide a flexible representation that allows the pretrained diffusion prior to regularize errors more effectively. By jointly conditioning on these latents and source camera poses, we demonstrate that our model achieves state-of-the-art results on the video re-rendering task. Project webpage is https://lavr-4d-scene-rerender.github.io/.

원문 보기

AI 뉴스 & 리서치

기업/연구기관의 주요 발표와 블로그 업데이트

DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models

Paper Hugging Face Papers

DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models에 관한 최근 업데이트입니다. 자세한 내용은 원문 링크에서 확인할 수 있습니다.

원문 보기

The Latent Space: Foundation, Evolution, Mechanism, Ability, and Outlook

Paper Hugging Face Papers

The Latent Space: Foundation, Evolution, Mechanism, Ability, and Outlook에 관한 최근 업데이트입니다. 자세한 내용은 원문 링크에서 확인할 수 있습니다.

원문 보기

Generative World Renderer

Paper Hugging Face Papers

Generative World Renderer에 관한 최근 업데이트입니다. 자세한 내용은 원문 링크에서 확인할 수 있습니다.

원문 보기

SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization

Paper Hugging Face Papers

SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization에 관한 최근 업데이트입니다. 자세한 내용은 원문 링크에서 확인할 수 있습니다.

원문 보기

Steerable Visual Representations

Paper Hugging Face Papers

Steerable Visual Representations에 관한 최근 업데이트입니다. 자세한 내용은 원문 링크에서 확인할 수 있습니다.

원문 보기

ActionParty: Multi-Subject Action Binding in Generative Video Games

Paper arXiv cs.CV (recent)

Recent advances in video diffusion have enabled the development of "world models" capable of simulating interactive environments. However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene. In this work, we tackle a fundamental issue of action binding in existing video diffusion models, which struggle to associate specific actions with their corresponding subjects. For this purpose, we propose ActionParty, an action controllable multi-subject world model for generative video games. It introduces subject state tokens, i.e. latent variables that persistently capture the state of each subject in the scene. By jointly modeling state tokens and video latents with a spatial biasing mechanism, we disentangle global video frame rendering from individual action-controlled subject updates. We evaluate ActionParty on the Melting Pot benchmark, demonstrating the first video world model capable of controlling up to seven players simultaneously across 46 diverse environments. Our results show significant improvements in action-following accuracy and identity consistency, while enabling robust autoregressive tracking of subjects through complex interactions.

원문 보기

April 3, 2026 Evaluating alignment of behavioral dispositions in LLMs Generative AI · Human-Computer Interaction and Visualization · Machine Intelligence

News Google Research Blog

April 3, 2026 Evaluating alignment of behavioral dispositions in LLMs Generative AI · Human-Computer Interaction and Visualization · Machine Intelligence에 관한 최근 업데이트입니다. 자세한 내용은 원문 링크에서 확인할 수 있습니다.

원문 보기

March 31, 2026 Building better AI benchmarks: How many raters are enough? Algorithms & Theory · Machine Intelligence

News Google Research Blog

Google Research explores the trade-off between number of items and human raters per item to improve AI benchmark reproducibility and capture the nuance of human disagreement.

원문 보기

March 31, 2026 Safeguarding cryptocurrency by disclosing quantum vulnerabilities responsibly Algorithms & Theory · Quantum · Security, Privacy and Abuse Prevention

News Google Research Blog

March 31, 2026 Safeguarding cryptocurrency by disclosing quantum vulnerabilities responsibly Algorithms & Theory · Quantum · Security, Privacy and Abuse Prevention에 관한 최근 업데이트입니다. 자세한 내용은 원문 링크에서 확인할 수 있습니다.

원문 보기

March 25, 2026 Vibe Coding XR: Accelerating AI + XR prototyping with XR Blocks and Gemini Human-Computer Interaction and Visualization · Machine Intelligence

News Google Research Blog

March 25, 2026 Vibe Coding XR: Accelerating AI + XR prototyping with XR Blocks and Gemini Human-Computer Interaction and Visualization · Machine Intelligence에 관한 최근 업데이트입니다. 자세한 내용은 원문 링크에서 확인할 수 있습니다.

원문 보기

March 24, 2026 TurboQuant: Redefining AI efficiency with extreme compression Algorithms & Theory · Generative AI · Machine Intelligence

News Google Research Blog

March 24, 2026 TurboQuant: Redefining AI efficiency with extreme compression Algorithms & Theory · Generative AI · Machine Intelligence에 관한 최근 업데이트입니다. 자세한 내용은 원문 링크에서 확인할 수 있습니다.

원문 보기

Microsoft Research blog

News Microsoft Research Blog

Microsoft Research blog에 관한 최근 업데이트입니다. 자세한 내용은 원문 링크에서 확인할 수 있습니다.

원문 보기

ADeLe: Predicting and explaining AI performance across tasks

News Microsoft Research Blog

ADeLe: Predicting and explaining AI performance across tasks에 관한 최근 업데이트입니다. 자세한 내용은 원문 링크에서 확인할 수 있습니다.

원문 보기

AsgardBench: A benchmark for visually grounded interactive planning

News Microsoft Research Blog

AsgardBench: A benchmark for visually grounded interactive planning에 관한 최근 업데이트입니다. 자세한 내용은 원문 링크에서 확인할 수 있습니다.

원문 보기

VLM 업데이트

sLLM 트렌드

On-Device AI

AI 뉴스 & 리서치

참고한 소스