Mar 09, 2026

Dynin-Omni

We introduce Dynin-Omni, a first masked diffusion-based omnimodal foundation model that unifies text, image, video, and speech understanding and generation, achieving strong cross-modal performance within a single architecture.

Overview

Dynin-Omni is an omnimodal discrete diffusion foundation model released by the AIDAS Laboratory. It is the first masked diffusion architecture to unify text, image, video, and speech understanding and generation within a single framework. By leveraging iterative confidence-based refinement and bidirectional token modeling, Dynin-Omni enables scalable any-to-any cross-modal generation across modalities. As demonstrated in our experiments, Dynin-Omni achieves strong and consistent performance across diverse multimodal benchmarks, validating discrete diffusion as a practical paradigm for unified omnimodal intelligence.

Main result figure

Capabilities

Text icon

Textual Reasoning

Solves multi-step problems and follows structured instructions with robust reasoning.

Text icon

Image Understanding

Answers detailed visual questions and understands fine-grained objects, text, and layouts.

Text icon

Video Understanding

Tracks temporal dynamics and answers spatiotemporal questions over videos.

Text icon

Image Generation

Generates high-quality images from text prompts with strong compositional control.

Text icon

Image Editing

Edits images with natural language instructions while preserving identity and structure.

Text icon

ASR & TTS

Transcribes speech accurately and synthesizes natural speech from text prompts.

Omnimodal Discrete Diffusion

Dynin-Omni models omnimodal generation as masked token denoising over discrete sequences. Instead of autoregression, it iteratively refines masked tokens in parallel with modality-aware decoding schedules, enabling scalable and controllable generation across text, image, video, and speech.

Dynin-Omni unified architecture

Unified Token Space & Architecture

All modalities are mapped into a shared discrete token space and processed by a single Transformer backbone, enabling unified understanding and native cross-modal generation.

Architecture comparison

Examples

Textual Reasoning

Textual reasoning sample 1 Textual reasoning sample 2

Image Understanding

Image understanding sample 1 Image understanding sample 2

Video Understanding

Video understanding sample 1 Video understanding sample 2

Image Generation

Image generation sample 1
Image generation sample 11
Image generation sample 2
Image generation sample 12
Image generation sample 3
Image generation sample 13
Image generation sample 4
Image generation sample 14
Image generation sample 5
Image generation sample 15
Image generation sample 6
Image generation sample 16
Image generation sample 7
Image generation sample 17
Image generation sample 8
Image generation sample 18
Image generation sample 9
Image generation sample 19
Image generation sample 10
Image generation sample 20

Image Editing

ASR & TTS

Example 1 / 5
Speech → Text (ASR)
ASR example 1
0:00 / 0:00
Text → Speech (TTS)
TTS example 1
0:00 / 0:00

Performance

Dynin-Omni consistently achieves state-of-the-art or highly competitive results across a broad range of text, vision, video, and speech benchmarks, outperforming existing open-source unified models and narrowing the gap with modality-specific expert systems.

benchmark figure

Contributors

Jaeik Kim
Project leader
Woojin Kim
Core contributor
Jihwan Hong
Core contributor
Yejoon Lee
Core contributor
Sieun Hyeon
Speech Team
Mintaek Lim
Video Team
Yunseok Han
Speech Team
Dogeun Kim
Model Serving
Hoeun Lee
Model Serving
Hyeonggeun Kim
Training Team
Jaeyoung Do
Supervisor