
After exploring image-to-video (i2v) generation with WAN2.1, many creators are now looking for ways to take their AI video generation even further. Text-to-video (T2V) offers a powerful alternative, allowing users to generate motion directly from text descriptions without needing an initial image.
In this guide, we’ll walk through a WAN2.1 T2V workflow in ComfyUI, explaining how to set it up, optimize settings, and get smooth, high-quality results. Whether you’re aiming for simple motion clips or more dynamic animations, this workflow will help you leverage the full potential of WAN2.1 for creative video generation.
Models Download
- GGUF Models: The 14B WAN2.1 t2v models are available here. I have a RTX 3090 with 24GB VRAM and I used wan2.1-t2v-14b-Q8_0.gguf. If you have less VRAM, use the other variants like Q5 or Q6. The model goes to ComfyUI\models\unet directory. If you want to use the 1.3B model, you have to download the wan2.1_t2v_1.3B_fp16.safetensors file. Put this file under ComfyUI\models\diffusion_model. Also, you need to use the Load Diffusion Model node to load this.
- Text Encoder: Download umt5_xxl_fp8_e4m3fn_scaled.safetensors and place it in ComfyUI\models\text_encoders
- VAE: Download wan_2.1_vae.safetensors and place it in ComfyUI\models\vae
Installation
- Update your ComfyUI to the latest version.
- Drag the full size image to your ComfyUI canvas.
- Use ComfyUI Manager to install any missing nodes.
Nodes
This is the node to load the GGUF model.
Use this one if you want to use the diffusion model. Connect this to the ModelSamplingSD3 node.
This is where you specify shift. Default is 5.
Positive prompt and negative prompt. You can use ChatGPT to help you generate the positive prompt. You can write something like “write a video generation prompt about a panda cooking” . I usually add “A realistic video showing ” in front of the prompt for realistic video. The negative prompt is from Wan2.1’s defautl settings.
This controls the size of the video. Use smaller size if you run out of VRAM. length is number of frames. For frame of rate of 16, 33 is about 2 seconds long.
This the sampler. 30 steps are probably to much, you can try 20 steps. Default cfg is 5, sampler name is uni_pc.
This loads the VAE.
This combines the images to form a video. Default frame rate is 16. The crf controls the quality of the video. Usually the lower number is better, but don’t go lower than 17.
Default Settings Summary
Looking through the code on WAN2.1 repository, here are some default settings for WAN2.1.
- Frame Rate: 16
- Shift: 5
- cfg: 5
- Sampler: uni_pc
- Negative prompt: 色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走
Examples
Prompt: A realistic video showing a cute, chubby panda, wearing a tiny chef’s apron, expertly flips food in a sizzling wok, sending a burst of flames and steam into the air. The camera captures a dynamic close-up, focusing on the panda’s fluffy paws skillfully tossing vibrant vegetables and spices in the pan. The rich aroma fills the warm, traditional Chinese kitchen, as the wok sizzles with energy, creating a brief but mesmerizing moment of culinary mastery.
Prompt:
A realistic video showing a a breathtaking busty Greek goddess, draped in an elegant, flowing revealing white gown with golden accents, stands in a mythical mountaintop sanctuary bathed in soft sunlight. Her long, wavy hair cascades down her back, crowned with a delicate golden laurel. With a serene and commanding presence, she gently pets the massive head of a majestic dragon, its shimmering scales reflecting the light in hues of gold and emerald.
The camera moves in a slow cinematic pan, capturing the goddess’s graceful touch as the dragon’s eyes soften, exhaling a warm breath that swirls in the cool air. A gentle breeze lifts the fabric of her gown, and the scene shifts between close-ups—her delicate fingers tracing the dragon’s rugged scales—and wide shots showcasing the grandeur of the mythical landscape. The soft glow of the sun, the faint shimmer of magic in the air, and the dragon’s deep, rumbling purr create an awe-inspiring, ethereal atmosphere.
Prompt: A realistic video showing a stunning, busty Asian woman in a short, stylish revealing dress walks confidently down a sidewalk of a vibrant city street at dusk. Her long hair flows as neon lights reflect off the wet pavement. The camera smoothly tracks her movement, alternating between slow-motion close-ups—a soft smile, a flick of her hair—and wide shots of the bustling city around her. A gentle breeze lifts the hem of her dress as she strides gracefully, the urban soundscape blending distant chatter and passing cars, creating a cinematic, captivating atmosphere.
Running Time
I only tested 480 x 848 33 frames, it took about 11 minutes to generate the video on my Nvidia RTX 3090.
Conclusion
With WAN2.1’s T2V capabilities, generating AI-powered videos from text prompts becomes both accessible and highly customizable. By integrating this workflow into ComfyUI, users can experiment with different prompts, refine their outputs, and push the boundaries of AI-generated motion.
As T2V technology continues to evolve, we can expect even smoother, more detailed animations with better control over motion and style. Whether you’re a hobbyist or content creator, this guide provides a solid foundation to start generating AI-driven videos with just a few words.
Further Reading
Simple ComfyUI Workflow for WAN2.1 Image-to-Video (i2v) Using GGUF Models
This post may contain affiliated links. When you click on the link and purchase a product, we receive a small commision to keep us running. Thanks.
Leave a Reply