Documentation Index
Fetch the complete documentation index at: https://veogenstudio.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Model Description
Google’s extended Gemini video model (omni-flash-ext) supports three generation modes — text-to-video, single-image animation, and 3-reference-image fusion — with per-second billing so you only pay for the duration you choose.
Maximum Resolution: 4K
Durations: 4, 6, 8, or 10 seconds
Key Capabilities
- Per-Second Billing: Pay only for the duration you generate — a 4s clip costs significantly less than a 10s clip.
- Three Generation Modes: Text-to-video (0 images), image-to-video (1 image), or reference fusion (exactly 3 images).
- Up to 4K Resolution: Choose 720p (base), 1080p (1.5× multiplier), or 4K (3× multiplier).
- Extended Durations: Supports 4, 6, 8, and 10 second clips — more flexibility than most Veo tiers.
- Reference Fusion: Combine scene, character, and object reference images into a single generated video.
- Aspect Ratio Control: 16:9 landscape or 9:16 portrait.
Image Input Rules
| Images | Mode |
|---|---|
| 0 | Text-to-video |
| 1 | Image-to-video (single reference) |
| 3 | Reference fusion (scene + character + object) |
| 2 | ❌ Not supported — will error |
Quick Start
1. Text-to-Video (No Images)
2. Image-to-Video (Single Reference Image)
3. Reference Fusion (3 Images)
Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
model | string | ✅ | Must be "omni-flash-ext" |
prompt | string | ✅ | Text description of the video to generate |
duration | integer | — | 4, 6, 8, or 10 (seconds). Only these exact values are valid. |
resolution | string | — | "720p" (default), "1080p", or "4k" |
aspect_ratio | string | — | "16:9" (default) or "9:16" |
image_urls | array | — | 0, 1, or exactly 3 publicly accessible image URLs. 2 images is not supported. |
FAQ
Why can't I use exactly 2 images?
Why can't I use exactly 2 images?
The underlying Omni-Flash-Ext API explicitly does not support 2-image
input. Use 0 images (text-to-video), 1 image (animation), or exactly 3
images (reference fusion).
What durations are supported?
What durations are supported?
Only fixed values: 4, 6, 8, or 10 seconds. Values like 5 or 7 will
return a validation error.
How does reference fusion work with 3 images?
How does reference fusion work with 3 images?
When you provide exactly 3 images, the model treats them as scene,
character, and object references respectively and fuses them into a
single coherent video. The order matters — first image is scene, second
is character, third is object.
Does Omni Flash Ext generate audio?
Does Omni Flash Ext generate audio?
No. Omni Flash Ext produces silent video output. For AI-generated audio,
use Veo 3.1 Fast.