You've probably seen the clips circulating this week. Someone types a sentence - "a fox running through a snowy forest at sunset" - and out comes a cinematic, fluid, 5-second video. That's Seedance 2.0, ByteDance's latest text-to-video model, and it's genuinely impressive.

My first reaction was: how? Is this just a giant language model that somehow learned to draw frames? The answer is both yes and no - and the technical reality is far more interesting than it might seem.

The official 2.0 technical report isn't out yet, but I spent time reading the Seedance 1.0 paper, which lays out the full architecture. Here's what I learned, translated into plain English.


First: How Is This Different From an LLM?

When ChatGPT or Claude generates text, the mechanism is relatively elegant: it reads your prompt, breaks it into small chunks called tokens, and predicts what token should come next - one at a time. The whole thing is one-dimensional: a sequence of words, from left to right.

Video is a completely different beast. A 5-second clip at 1080p contains hundreds of frames. Each frame is a grid of millions of pixels. And crucially, every frame has to make sense relative to the ones before and after it - a running fox's legs need to move continuously, the lighting needs to stay consistent, the camera movement needs to feel smooth.

An LLM models language as a 1D sequence of discrete tokens. A video model must model reality as a 3D continuous flow through space and time.

Same Transformer DNA. Completely different physics.


Step 1 - Compression: Making Video Digestible

The first challenge is sheer size. You can't feed raw pixels into a Transformer - there are simply too many. So Seedance uses a Variational Autoencoder (VAE) to compress the video into a compact "latent" representation before any generation happens.

Think of it like zipping a file: the VAE learns to squeeze the visual information into a much smaller form, and another network (the decoder) learns to unzip it back into sharp video at the end.

Seedance's VAE uses a trick called temporal-causal compression: the compressed representation of any given frame only looks at past frames, never future ones. This ensures the model can't "cheat" by peeking ahead when building its understanding of motion.

The compression ratios are aggressive: 16× in both height and width, and 4× across time. That's the only reason a Transformer can attend to a whole video clip without melting your GPU.


Step 2 - Generation: Denoising, Not Predicting

Here's where things diverge sharply from LLMs. Instead of predicting the next token, video models use a process called diffusion: they start from pure random noise and gradually refine it - step by step - into a coherent video.

The backbone doing this refinement is a Diffusion Transformer (DiT) - yes, the same Transformer architecture used in LLMs, but heavily adapted for 3D spatiotemporal data.

A few clever design choices make this work at scale:

  • Decoupled spatial and temporal layers. Some Transformer layers focus on what each individual frame looks like (spatial structure). Others focus on how things move across frames (temporal motion). Separating these concerns makes the model much more efficient.
  • MM-RoPE (Multi-Modal Rotary Positional Encoding). LLMs like Llama use RoPE to tell the model where each token sits in a sequence. Seedance extends this to 3D - giving every patch of every frame a position in both space and time. This is how a character stays consistent across a camera cut.
  • Unified task handling. The same model weights handle text-to-video, image-to-video, and text-to-image by framing them all as a conditional masking problem. One architecture, three capabilities.

Step 3 - Training: A Multi-Stage Pipeline

You can't train a video model in one shot. Seedance follows a staged approach that mirrors how LLM training pipelines have evolved:

Pre-training starts on images at low resolution, then progressively adds video clips and increases resolution. Higher-resolution frames get more noise added during training - a technique called resolution-aware noise scheduling - to force the model to learn richer representations.

Continued training shifts the balance toward image-to-video tasks and filters the data for clips with real, interesting motion - using optical flow analysis to weed out static or boring footage. Captions are split into "full scene descriptions" and "motion-only descriptions" to force the model to separately learn what things look like versus how they move.

Supervised fine-tuning uses human-curated datasets across hundreds of specific motion categories - animal locomotion, human gestures, fluid dynamics. Specialists are trained for different motion types, then merged together.


Step 4 - Alignment: Video RLHF

LLMs famously use RLHF (Reinforcement Learning from Human Feedback) to align outputs with human preferences. Video models do the same - but the failure modes are different. Instead of hallucinated facts, you get physical hallucinations: flickering limbs, hands with seven fingers, objects that teleport between frames.

Seedance trains three separate reward models to catch these:

  • Foundational reward model - scores prompt alignment and structural stability using a Vision-Language Model
  • Motion reward model - penalizes artifacts and rewards vivid, dynamic movement
  • Aesthetic reward model - evaluates keyframes for professional composition, lighting quality, and visual appeal

Crucially, RLHF is also applied to the super-resolution refiner - the model that upscales the base 480p output to 1080p. This ensures the final step doesn't introduce new artifacts while sharpening the image.


The TL;DR Comparison

Feature LLMs Video Models
Input/Output Discrete tokens Continuous latent vectors
Dimensionality 1D - sequential 3D - space + time
Core task Predict next token Denoise toward clean video
Positional encoding RoPE (1D) MM-RoPE (3D: x, y, time)
Alignment RLHF (semantic) RLHF (motion + aesthetics)
Compression Tokenizer lookup VAE encoder/decoder

What This Means

Seedance 2.0 going viral isn't just a fun demo moment - it's a signal that the entire LLM playbook (scaling laws, RLHF, instruction tuning, staged training) is being successfully ported to video, with new layers of complexity added for physical plausibility and temporal coherence.

The gap between "AI-looking" video and genuinely cinematic output is closing fast. The architecture described in the 1.0 report - and presumably extended further in 2.0 - treats video generation not as a parlor trick but as a serious spatiotemporal modeling problem.

Worth watching closely.