Seedance 2.0 natively generates audio and video together, with strong semantic understanding and multimodal interaction. You don't just describe a scene in words — you hand it images, video clips, and audio, and it weaves them into one coherent shot. This guide captures the official BytePlus prompting playbook for getting high-quality, on-brief results.
The key mental model: Seedance 2.0 is a multimodal AI director. It reads your text, images, video, and audio at the same time and internally splits everything into two layers — a spatial layer (what is in the frame) and a temporal layer (how things change over time).
So a good prompt isn't flowery copywriting — it's an engineering instruction: who, in what scene, doing what action, how the camera moves, and in what order events occur — each delivered to the right layer.
The basic formula (reference-based)
Seedance 2.0 can reference videos, images, and audio simultaneously — locking character appearance, action, visual style, and voice timbre — which dramatically lowers the bar for writing prompts. Reference-based generation splits into three task types (which can be combined).
Multimodal reference
Extract elements (subject, style, scene, sound effects) from your assets to generate a brand-new video. Good for action transfer, subject reuse, atmosphere reference.
Image reference
Reference <Subject_N> in <Image_N> to generate…
Video reference
Reference <Action / Camera_movement / Style / Sound_effect> in <Video_N> to generate…
Audio reference
Reference the timbre in <Audio_N> to generate…
Video editing
Make partial or global changes to an original video. Anything you don't mention stays unchanged. Good for local replacement, subject removal, attribute changes.
Add elements
Clearly describe <Element_Features> + <Timing> + <Location>
Modify elements
Strictly edit <Video_N>, and modify <Original_Characteristic> in it to <New_Characteristic>
Delete elements
Specify the elements to delete; for elements that must remain, emphasise them in the prompt.
Video extension
Continue an original video along the time axis, keeping audio-video style, subject, and narrative consistent. Good for continuing a plot, extending actions, completing clips.
Extend
Extend <Video_N> forward / backward to generate…
Track completion
<Video_1> + <Transition_Description> + followed by <Video_2> + <Transition_Description> + followed by <Video_3>
<Video_N> — do not write “reference <Video_N>”, or it may be mistaken for a reference task.Combined tasks
You can reference one asset while editing another:
Reference [Reference_Dimension] of <Image/Video_N>, strictly edit <Video_X>, [Specific_Edits]
The advanced formula
For a richer scene built from a text brief, deliver the elements in this order. The model weights earlier elements first, so lock who is doing what, then where, then how to shoot, then tighten with style, quality, and constraints.
The structure
precise subject + action details + scene / environment + lighting & colour tone + camera movement + visual style + image quality + constraints
The next six sections break down each element — subject, shot sequencing, action, camera, quality/style/constraints, and full worked examples.
Define the subject
One image often contains several subjects, so to reference a specific one you must define it. A subject can be a person, a prop, or a scene. Pin it down with 2–3 clear, stable, static features (clothing, hairstyle, appearance, category) so it can be uniquely identified.
Basic definition
Define [Core_Subject_Features] in <Image/Video_N> as <Subject_N>
e.g. “Define the woman wearing a red dress and a straw hat in Image 1 as Subject 1.”
One subject across multiple assets
Define [features] in Image 1 and [features] in Image 2 as <Subject_N>
Multiple subjects
Define [features of Subject_1] in <Image/Video_N> as <Subject_1>, and define [features of Subject_2] as <Subject_2>…
e.g. “Define the tall man in Video 1 as police officer, and the short man as thief…”
<Subject_N>@<Image_N> — e.g. Zhang San@Image 1.• For pre-defined subjects, reuse the same label consistently (always “police officer”, always “thief”).• Using the asset library? You still refer with <Image/Video_N> — the model can't map an Asset ID to reference content directly.• Keep it concise, avoid redundancy and contradictory features, and prefer expressing spatial relationships through reference images over long text.Shot sequencing
Because the model decouples space and time internally, the ideal form for a complex video is a timeline storyboard: break it into shots and describe each in event order — who + where + doing what + how the camera moves. Label them Shot 1, Shot 2, Shot 3.
A man runs nervously down the street, and the scene feels very cinematic.
One vague blob — no shots, no order, no camera, nothing the spatial/temporal layers can act on.
Shot 1: Side shot of a street alley; the man slowly starts running, with a sense of rapid breathing. Shot 2: The man knocks over a fruit stand; the camera shakes quickly and gives a close-up of his frightened face. Shot 3: The man climbs over a low wall and disappears; the camera slowly pulls back and freezes on the empty street.
Ordered shots, each with camera + action + space. The model paces it naturally.
Organise each shot in this order:
Camera / transition
How the shot moves or cuts.
e.g. “slowly push in from a wide shot; cut to…”
Action & expression
Key actions and changes of expression.
e.g. “lowers head, then can't hold back a smile”
Position / space
Where the subject is; spatial change.
e.g. “walks to the dormitory entrance”
Audio
Sound effects, voices, music for the shot.
e.g. “warm ambient room tone + light music”
Action description
Four rules make on-screen motion read as natural rather than rubbery:
Refine the body + quantify the degree
Be specific to hands, legs, head, shoulders, back — and add range, speed, force.
slowly raise a hand · quickly turn the head · push hard off the ground · slightly lower the head
Prefer slow, gentle, continuous motion
Favour small coherent movements; avoid high-burst, large-dynamic actions like sprinting, big jumps, violent rolls.
walk slowly · gently raise a hand · sit down naturally with the motion
Bridge actions with transitions
State the inertia/continuity between one action and the next so motion stays coherent.
use the inertia of turning around to naturally raise a hand
Externalise emotion as physical detail
Replace abstract words like “very sad” with concrete body language (see table).
not “nervous” → frequently checks watch, fingers tap the table, rapid breathing
Externalising emotion
| Abstract emotion | Externalise as actions & details |
|---|---|
| Sadness | Lowered head, shoulders trembling slightly, eyes reddening, fingers clutching the corner of clothing, tears welling but not falling |
| Joy | Corners of the mouth rising uncontrollably, brows relaxed, steps light, unconsciously humming, spinning in place |
| Nervousness | Frequently checking the watch, fingers tapping the tabletop, rapid breathing, eyes darting, biting fingernails |
| Anger | Both fists clenched, jaw tense, chest heaving, eyes sharp as knives, words squeezed through gritted teeth |
| Relief | Letting out a long breath, tense shoulders relaxing, a faint long-lost smile, looking up toward the distance |
Camera movement
The model understands camera terminology well, so use standard terms directly: medium shot, close-up, wide shot, slow push-in, smooth lateral tracking, fixed shot.
Quality, style & constraints
These three define the model's creative boundaries and keep output stable.
Image quality
Clarity, texture, lighting.
HD · rich details · cinematic texture · natural colours · soft lighting
Style
Overall art style + tone.
cyberpunk blue-purple · retro film · fresh Japanese style
Constraints
Forbid flaws & deviations.
no deformation · no flicker · stable face
Symbols & dialogue
Symbols tell the model which kind of information it's reading. Keep dialogue in one language — avoid mixing Chinese and English (proper nouns excepted).
| Information | Symbol | Example |
|---|---|---|
| Music | ( ) | (fast-paced rock music is playing in the background) |
| Sound effect | < > | < dog barking can be heard in the distance > |
| Dialogue | { } | {Hello, world}. For other languages, mark it: says in Japanese {こんにちは} |
| Subtitles | 【 】 | 【Chapter One: Departure】 |
Asset strategy
Assets play four functional roles. A focused set beats a crowded one.
Character anchoring
Lock the character's appearance.
Scene tone-setting
Lock the environment and style.
Camera-movement reference
Lock the shot language and action rhythm.
Rhythmic atmosphere
Use audio to control emotion and timbre.
Long take vs. stitching. Use a continuous take (video extension) for single-scene dialogue, emotional progression, and movement along one path — immersive and coherent. Use segmented stitching for plot turns and fast action (chases, fights, montages) — independent clips edited together for rhythm and impact. In practice, combine both: extend a coherent conversation, then stitch in transitions.
Worked examples
Two official end-to-end examples showing the advanced formula in action — assets first, then a shot-by-shot prompt.
1 · Dormitory short drama (dialogue-focused)
Assets — @Image 1: female lead half-body · @Image 2: dormitory scene · @Video 1: indoor dialogue camera movement · @Audio 1: room ambience / light music.
Use the girl in @Image 1 as the main character, use @Image 2 as the dormitory scene style reference, and refer to the camera movement in @Video 1. Shot 1: At dusk, girl @Image 1 walks briskly to the dormitory entrance @Image 2. The camera follows steadily in a medium shot; warm sunlight spills into the hallway; she pauses at the doorway, takes a deep breath, looks slightly nervous. Shot 2: Girl @Image 1 pushes the door open; cut to an indoor medium shot; her roommates look up while organising books; one smiles and asks {How did the exam go? Did you pass?}; the camera slowly cuts between half-body close-ups. Shot 3: Girl @Image 1 first lowers her head, dejected (close-up), then raises it, laughs, and says {I was kidding}; roommates play-fight with her; the camera slowly pulls back and freezes on a wide shot. High-definition cinematic documentary style, warm tones, soft lighting; the face stays stable without deformation; motion natural, no stutter; ambient sound blends with @Audio 1.
2 · Cliff confrontation (action / atmosphere)
Assets — @Image 1: woman in red · @Image 2: woman in black (opponent) · @Image 3: cliff & bamboo forest · @Video 1: martial-arts camera movement · @Audio 1: tight drums / fight SFX.
Use the woman in red from @Image 1 as the lead, the woman in black from @Image 2 as the opponent, the cliff and bamboo forest in @Image 3 as the scene, refer to the camera movement and action rhythm in @Video 1, and sync background SFX with @Audio 1. Shot 1: At dusk, the camera slowly pushes in from a side medium shot of woman in red @Image 1 at the cliff edge lifting a wine flask; sleeves sway in the wind; the camera circles from front to back; a figure in black is faint in the bamboo. Shot 2: Zoom and fade to a long drone shot over the cliff; the two stand at opposite ends; wind lifts their robes; rhythm accelerates with the drums. Shot 3: Cut back to a ground-level close shot; both draw swords; woman in red @Image 1 shifts to a cold gaze; woman in black @Image 2 looks determined, sword tip trembling; the camera follows them circling, freezing on the instant before the swords meet. Cinematic wuxia in misty rain, cool tones, low saturation, film-grain texture; faces and proportions stable; motion continuous, no clipping or stutter.
Troubleshooting
The official FAQ — the failure modes you'll actually hit, and how to fix each. (Some can only be reduced, not eliminated 100%.)
Character ID drift
Symptom — The character looks different from the reference, or “face-swaps” mid-video (sometimes resembling a celebrity and getting blocked).
- Add a dedicated close-up headshot (face only, neutral expression, minimal background) alongside the full-body photo.
- Define it explicitly: facial features → headshot; makeup & styling → full-body photo.
- Place the most reference-critical assets earliest in the prompt.
- Use headshot + full-body only — avoid multi-view images, which the model may read as different people.
Unexpected subtitles
Symptom — Subtitles appear even though you never asked for them.
- Add explicit constraints: “keep it subtitle-free”, “avoid generating any text or subtitles”.
- Remove text from reference images/videos first (e.g. with Seedream/Seedance editing).
- Prefer landscape output — it produces subtitles far less often than portrait; crop later.
Logo or watermark
Symptom — A logo/watermark from another platform shows up unprompted.
- Add explicit constraints: “do not generate watermarks” and “do not generate logos”.
Style drift
Symptom — You want 2D/3D anime but a realistic reference pulls the result toward live-action.
- Add explicit style constraints, e.g. “2D Japanese anime style” or “3D Chinese-style comic”.
- For precise control, convert the reference image into the target style before generating.
Jump cuts at extension joins
Symptom — Frame jumps or rollback where an extended clip meets the original.
- Align keyframes in post (CapCut): trim 6 frames from the end of the previous segment and 1 frame from the start of the next.
- Repeat at every join; end a clip on a transition cut and start the next from the new scene.
Duplicated characters (“twins”)
Symptom — Two identical characters appear in one frame, especially with crowded scenes or multi-view references.
- Mark each character's reference image after their name (e.g. “Zhang San (image 1)”), kept consistent.
- Add a global constraint forbidding identical duplicate characters / twin effects.
- Prefer independent single-person photos over three/multi-view assets; simplify the prompt.
Quality degradation on extension
Symptom — Re-using model output as an extension input degrades quality; mottled blocks appear on faces, compounding over multiple continuations.
- Convert the source to a pure-white 3D-model video first, then continue from that.
- Prefer high-definition reference assets, and limit how many times you chain continuations.
Special effects off-brief
Symptom — A text-described effect (e.g. a countdown) renders with wrong/random logic.
- Define the effect with a reference video so the model learns its exact form and motion.
Too many reference characters (>4)
Symptom — Beyond 4 reference people, you get the wrong count or duplicates.
- Generate images in groups of ≤4 people first.
- Then use those grouped images as references for the final image-to-video.
Noise at the end (narration)
Symptom — Clicking / cut-off noise at the end of videos with voiceover.
- Regenerate, or apply an audio fade-out via the volume envelope (CapCut) — drag the final keyframe to 0.
Inaccurate Chinese pronunciation
Symptom — Polyphonic, rare, or look-alike characters get mispronounced.
- Swap hard words for common same-sound homophones (e.g. 螭龙山 → 吃龙山). An optimisation, not a guarantee.
Inaccurate voice reference
Symptom — The generated voice differs noticeably from the reference audio.
- Add detailed voice-characteristic descriptions (e.g. “low, thick, warm, finely grainy middle-aged male voice of @Audio 1”).
- Keep the line's tone and delivery close to the reference audio.
Template library
Reusable patterns straight from the official appendix — fill in the brackets. Upload assets in the order you want them referenced, then map them with Image 1…N / Video 1…N.
Text generation
Slogan
[Text Content] + [Timing] + [Positioning] + [Entrance / Appearance Style], [Visual Attributes (colour, font style)]
Subtitles
Display subtitles at the bottom-center with the text. The subtitles must be perfectly synchronized with the audio rhythm and pacing.
Speech bubbles
[Character] says, “[Dialogue].” Speech bubbles appear around the character containing the spoken text.
Image reference
Multi-perspective subject
Refer to / Extract / Combine / Use the [Subject] from [Image N] to generate [Scene Description], maintaining consistent [Subject] features.
Multi-image
Refer to / Extract / Combine / Follow the [Referenced elements] from [Image N] to generate [Scene Description], maintaining the consistency of [Referenced Elements].
Video reference
Motion
Refer to the [Motion Description] from [Video N] to generate [Scene Description], keeping the motion details consistent.
Camera motion
Refer to the [Camera Movement Description] from [Video N] to generate [Scene Description], keeping the scene consistent.
Special effects
Refer to the [Special-effects description] from [Video N] to generate [Scene Description], keeping the special effects consistent.
Video editing
Add
At [Timestamp/Timing] and [Spatial Location] of [Video N], add [Description of intended element].
Remove
Remove [Element] from [Video N], keeping the rest of the video content unchanged.
Modify
Replace [Element to change] in [Video N] with [Intended element].
Extend
Extend [Video N] forward / backward + [content] · Generate content before / after [Video N] + [content]
Complete tracks
[Video 1] + [Transition Description] + followed by [Video 2] + [Transition Description] + followed by [Video 3]
Your turn
Bring your assets. Direct the shot.
Seedance 2.0 lives inside Ekly. Drop in your images, clips, and audio, write a shot-by-shot prompt, and let the multimodal director do the rest.