Dynin-Omni Preview
We introduce Dynin-Omni, a first masked diffusion-based omnimodal foundation model that unifies text, image, video, and speech understanding and generation, achieving strong cross-modal performance within a single architecture.
Overview
Dynin-Omni is an omnimodal discrete diffusion foundation model released by the AIDAS Laboratory. It is the first masked diffusion architecture to unify text, image, video, and speech understanding and generation within a single framework. By leveraging iterative confidence-based refinement and bidirectional token modeling, Dynin-Omni enables scalable any-to-any cross-modal generation across modalities. As demonstrated in our experiments, Dynin-Omni achieves strong and consistent performance across diverse multimodal benchmarks, validating discrete diffusion as a practical paradigm for unified omnimodal intelligence.
Capabilities
Omnimodal Discrete Diffusion Framework
We introduce Dynin-Omni, an omnimodal foundation model that unifies understanding and generation across text, image/video, and speech using a single discrete masked-diffusion Transformer. By operating directly in a shared token space and refining predictions in parallel through iterative denoising, Dynin-Omni enables efficient decoding, flexible output lengths, and native cross-modal generation without switching to separate modality-specific backbones.
Omnimodal Discrete Diffusion
Our framework formulates omnimodal generation as a masked token denoising process over discrete sequences. Instead of left-to-right autoregression, the model starts from heavily-masked outputs and progressively reconstructs tokens via repeated refinement steps. This discrete diffusion view provides two practical advantages: (1) parallelism—many tokens can be updated simultaneously for fast decoding; and (2) editability—uncertain regions can be re-masked and corrected, making the process robust and controllable across modalities while preserving high-confidence predictions.
Unified Token Space & Architecture
Dynin-Omni maps each modality into a single discrete token space using modality tokenizers (text, vision, speech), and trains a unified Transformer to predict masked tokens with a standard cross-entropy objective. The same backbone therefore supports both understanding (conditioning on observed tokens) and generation (iteratively filling masked outputs) for all modalities, enabling native cross-modal tasks (e.g., image→text, text→image, speech→text) without hand-designed bridging modules.
Training Pipeline with Modality-Disentangled Model Merging
We build Dynin-Omni through a staged pipeline that emphasizes model merging as the key step for native omnimodal unification. Stage 1 adapts modality-specific components while maintaining a strong text-centric backbone. We then apply modality-disentangled merging to combine complementary weights without overwriting the backbone’s knowledge, producing an initialization that is both stable and omnimodally aligned. Stage 2 performs omnimodal SFT for joint training across modalities, and Stage 3 continues capability growth (e.g., reasoning, higher-resolution generation, longer speech) via continual SFT.
Inference: Modality-Aware Parallel Decoding
At inference, Dynin-Omni performs iterative denoising with confidence-based remasking, updating low-confidence positions while keeping confident tokens fixed. We use modality-aware schedules: block-wise parallel decoding for temporally ordered sequences (text and speech) and fully parallel decoding for spatial grids (image/video tokens). After diffusion decoding, modality detokenizers convert predicted tokens back to the final outputs, yielding a single, consistent generation procedure across omnimodal tasks.
Experiments
We evaluate VIRST with comprehensive experimental settings, benchmark comparisons, and ablation studies.
Experimental Settings
In our experiments, the vision-language backbone is initialized with VideoChat-Flash-7B, whose vision encoder is a ViT-based model pretrained with UMT, while the mask prediction branch adopts SAM2. Low-rank adaptation (LoRA) is applied for efficient fine-tuning, and the STF module is trained from scratch.
The training corpus includes Ref-DAVIS17, Ref-YouTube-VOS, MeViS, ReVOS, and LV-VIS, together with RefCOCO/RefCOCO+/RefCOCOg, ADE20K, COCO-Stuff, PACO, PASCAL-Part, ReasonSeg, and VideoLLaVA-Instruct. Training is conducted on eight NVIDIA H100 GPUs for three days. Additional implementation details are provided in the appendix.
Referring Video Object Segmentation
We evaluate on ReVOS, MeViS, Ref-YT-VOS, and Ref-DAVIS17. Across all settings, VIRST achieves consistent state-of-the-art-level performance and strong margins over previous methods.
On ReVOS, which contains reasoning-heavy queries, VIRST performs strongly on both referring and reasoning settings, with larger gains in reasoning-oriented cases. On MeViS, Ref-YT-VOS, and Ref-DAVIS17, it also maintains leading performance and stable cross-dataset generalization.
Placeholder tables included in this section: `tab_ablation_fusion`, `tab_ablation_selection`, and `tab_ablation_keyframe_num`.
Image Segmentation
We further evaluate on RefCOCO, RefCOCO+, RefCOCOg, and ReasonSeg to test whether the design generalizes beyond videos. VIRST shows strong performance on major splits while remaining competitive across all reported settings.
Ablation Studies
Effect of STF. We ablate Initial Fusion and ST-Fusion separately. Removing either component degrades \(\mathcal{J}\&\mathcal{F}\), while enabling both yields clear gains in spatio-temporal reasoning and alignment quality.
Effect of TDAU. We compare first-frame, CLIP-based, random-3, uniform-3, and dynamic anchor strategies. Dynamic anchor selection is most robust, especially on motion-heavy videos.
Effect of anchor count. Increasing \(\alpha\) consistently improves performance by expanding temporal coverage. This also acts as an inference-time scaling knob: larger \(\alpha\) trades additional compute for improved reliability.