Feb 17, 2026

Dynin-Omni Preview

We introduce Dynin-Omni, a first masked diffusion-based omnimodal foundation model that unifies text, image, video, and speech understanding and generation, achieving strong cross-modal performance within a single architecture.

Overview

Dynin-Omni is an omnimodal discrete diffusion foundation model released by the AIDAS Laboratory. It is the first masked diffusion architecture to unify text, image, video, and speech understanding and generation within a single framework. By leveraging iterative confidence-based refinement and bidirectional token modeling, Dynin-Omni enables scalable any-to-any cross-modal generation across modalities. As demonstrated in our experiments, Dynin-Omni achieves strong and consistent performance across diverse multimodal benchmarks, validating discrete diffusion as a practical paradigm for unified omnimodal intelligence.

Main result figure

Capabilities

Text icon

Text Reasoning

Solves multi-step problems and follows structured instructions with robust reasoning.

Text icon

Image Understanding

Answers detailed visual questions and understands fine-grained objects, text, and layouts.

Text icon

Video Understanding

Tracks temporal dynamics and answers spatiotemporal questions over videos.

Text icon

Image Generation

Generates high-quality images from text prompts with strong compositional control.

Text icon

Image Editing

Edits images with natural language instructions while preserving identity and structure.

Text icon

ASR & TTS

Transcribes speech accurately and synthesizes natural speech from text prompts.

Omnimodal Discrete Diffusion Framework

We introduce Dynin-Omni, an omnimodal foundation model that unifies understanding and generation across text, image/video, and speech using a single discrete masked-diffusion Transformer. By operating directly in a shared token space and refining predictions in parallel through iterative denoising, Dynin-Omni enables efficient decoding, flexible output lengths, and native cross-modal generation without switching to separate modality-specific backbones.

Omnimodal Discrete Diffusion

Our framework formulates omnimodal generation as a masked token denoising process over discrete sequences. Instead of left-to-right autoregression, the model starts from heavily-masked outputs and progressively reconstructs tokens via repeated refinement steps. This discrete diffusion view provides two practical advantages: (1) parallelism—many tokens can be updated simultaneously for fast decoding; and (2) editability—uncertain regions can be re-masked and corrected, making the process robust and controllable across modalities while preserving high-confidence predictions.

Unified Token Space & Architecture

Dynin-Omni maps each modality into a single discrete token space using modality tokenizers (text, vision, speech), and trains a unified Transformer to predict masked tokens with a standard cross-entropy objective. The same backbone therefore supports both understanding (conditioning on observed tokens) and generation (iteratively filling masked outputs) for all modalities, enabling native cross-modal tasks (e.g., image→text, text→image, speech→text) without hand-designed bridging modules.

Three-stage training pipeline highlighting model merging between modality adaptation and omnimodal SFT.

Training Pipeline with Modality-Disentangled Model Merging

We build Dynin-Omni through a staged pipeline that emphasizes model merging as the key step for native omnimodal unification. Stage 1 adapts modality-specific components while maintaining a strong text-centric backbone. We then apply modality-disentangled merging to combine complementary weights without overwriting the backbone’s knowledge, producing an initialization that is both stable and omnimodally aligned. Stage 2 performs omnimodal SFT for joint training across modalities, and Stage 3 continues capability growth (e.g., reasoning, higher-resolution generation, longer speech) via continual SFT.

Dynin-Omni main architecture: unified training with tokenizers and discrete diffusion inference for text, image, and speech.

Inference: Modality-Aware Parallel Decoding

At inference, Dynin-Omni performs iterative denoising with confidence-based remasking, updating low-confidence positions while keeping confident tokens fixed. We use modality-aware schedules: block-wise parallel decoding for temporally ordered sequences (text and speech) and fully parallel decoding for spatial grids (image/video tokens). After diffusion decoding, modality detokenizers convert predicted tokens back to the final outputs, yielding a single, consistent generation procedure across omnimodal tasks.

Experiments

We evaluate VIRST with comprehensive experimental settings, benchmark comparisons, and ablation studies.

Experimental Settings

In our experiments, the vision-language backbone is initialized with VideoChat-Flash-7B, whose vision encoder is a ViT-based model pretrained with UMT, while the mask prediction branch adopts SAM2. Low-rank adaptation (LoRA) is applied for efficient fine-tuning, and the STF module is trained from scratch.

The training corpus includes Ref-DAVIS17, Ref-YouTube-VOS, MeViS, ReVOS, and LV-VIS, together with RefCOCO/RefCOCO+/RefCOCOg, ADE20K, COCO-Stuff, PACO, PASCAL-Part, ReasonSeg, and VideoLLaVA-Instruct. Training is conducted on eight NVIDIA H100 GPUs for three days. Additional implementation details are provided in the appendix.

ST-attention visualization placeholder
Figure. ST attention visualization placeholder (`fig_st_attention`).

Referring Video Object Segmentation

We evaluate on ReVOS, MeViS, Ref-YT-VOS, and Ref-DAVIS17. Across all settings, VIRST achieves consistent state-of-the-art-level performance and strong margins over previous methods.

On ReVOS, which contains reasoning-heavy queries, VIRST performs strongly on both referring and reasoning settings, with larger gains in reasoning-oriented cases. On MeViS, Ref-YT-VOS, and Ref-DAVIS17, it also maintains leading performance and stable cross-dataset generalization.

Placeholder tables included in this section: `tab_ablation_fusion`, `tab_ablation_selection`, and `tab_ablation_keyframe_num`.

Image Segmentation

We further evaluate on RefCOCO, RefCOCO+, RefCOCOg, and ReasonSeg to test whether the design generalizes beyond videos. VIRST shows strong performance on major splits while remaining competitive across all reported settings.

Ablation Studies

Effect of STF. We ablate Initial Fusion and ST-Fusion separately. Removing either component degrades \(\mathcal{J}\&\mathcal{F}\), while enabling both yields clear gains in spatio-temporal reasoning and alignment quality.

Effect of TDAU. We compare first-frame, CLIP-based, random-3, uniform-3, and dynamic anchor strategies. Dynamic anchor selection is most robust, especially on motion-heavy videos.

Effect of anchor count. Increasing \(\alpha\) consistently improves performance by expanding temporal coverage. This also acts as an inference-time scaling knob: larger \(\alpha\) trades additional compute for improved reliability.

Main result figure (detailed view)
Figure 2. Main experiment result placeholder (`main_result.pdf` preview image).