All guides
AI Video 18 min read June 2026

The Complete Guide to Prompting Seedance 2.0

The official playbook for directing Seedance 2.0 — the multimodal model that reads your images, video, and audio together. Formulas, subject binding, shot sequencing, the symbol and emotion tables, the full FAQ, and a template library.

Seedance 2.0 natively generates audio and video together, with strong semantic understanding and multimodal interaction. You don't just describe a scene in words — you hand it images, video clips, and audio, and it weaves them into one coherent shot. This guide captures the official BytePlus prompting playbook for getting high-quality, on-brief results.

The key mental model: Seedance 2.0 is a multimodal AI director. It reads your text, images, video, and audio at the same time and internally splits everything into two layers — a spatial layer (what is in the frame) and a temporal layer (how things change over time).

So a good prompt isn't flowery copywriting — it's an engineering instruction: who, in what scene, doing what action, how the camera moves, and in what order events occur — each delivered to the right layer.

Two ways to prompt
Use the basic formula when you're driving generation from reference assets (images / videos / audio). Use the advanced formula when you're directing a richer, multi-shot scene from a detailed text brief. They combine freely.
01Section

The basic formula (reference-based)

Seedance 2.0 can reference videos, images, and audio simultaneously — locking character appearance, action, visual style, and voice timbre — which dramatically lowers the bar for writing prompts. Reference-based generation splits into three task types (which can be combined).

Multimodal reference

Extract elements (subject, style, scene, sound effects) from your assets to generate a brand-new video. Good for action transfer, subject reuse, atmosphere reference.

Image reference

Reference <Subject_N> in <Image_N> to generate…

Video reference

Reference <Action / Camera_movement / Style / Sound_effect> in <Video_N> to generate…

Audio reference

Reference the timbre in <Audio_N> to generate…

Video editing

Make partial or global changes to an original video. Anything you don't mention stays unchanged. Good for local replacement, subject removal, attribute changes.

Add elements

Clearly describe <Element_Features> + <Timing> + <Location>

Modify elements

Strictly edit <Video_N>, and modify <Original_Characteristic> in it to <New_Characteristic>

Delete elements

Specify the elements to delete; for elements that must remain, emphasise them in the prompt.

Video extension

Continue an original video along the time axis, keeping audio-video style, subject, and narrative consistent. Good for continuing a plot, extending actions, completing clips.

Extend

Extend <Video_N> forward / backward to generate…

Track completion

<Video_1> + <Transition_Description> + followed by <Video_2> + <Transition_Description> + followed by <Video_3>

Edit & extend: don't say “reference”
For edit / extend tasks, refer to the clip directly as <Video_N> — do not write “reference <Video_N>”, or it may be mistaken for a reference task.

Combined tasks

You can reference one asset while editing another:

Reference [Reference_Dimension] of <Image/Video_N>, strictly edit <Video_X>, [Specific_Edits]

02Section

The advanced formula

For a richer scene built from a text brief, deliver the elements in this order. The model weights earlier elements first, so lock who is doing what, then where, then how to shoot, then tighten with style, quality, and constraints.

The structure

precise subject + action details + scene / environment + lighting & colour tone + camera movement + visual style + image quality + constraints

The next six sections break down each element — subject, shot sequencing, action, camera, quality/style/constraints, and full worked examples.

03Section

Define the subject

One image often contains several subjects, so to reference a specific one you must define it. A subject can be a person, a prop, or a scene. Pin it down with 2–3 clear, stable, static features (clothing, hairstyle, appearance, category) so it can be uniquely identified.

Basic definition

Define [Core_Subject_Features] in <Image/Video_N> as <Subject_N>

e.g. “Define the woman wearing a red dress and a straw hat in Image 1 as Subject 1.”

One subject across multiple assets

Define [features] in Image 1 and [features] in Image 2 as <Subject_N>

Multiple subjects

Define [features of Subject_1] in <Image/Video_N> as <Subject_1>, and define [features of Subject_2] as <Subject_2>…

e.g. “Define the tall man in Video 1 as police officer, and the short man as thief…”

Refer to subjects explicitly — every time
• For undefined subjects, re-bind on every mention with <Subject_N>@<Image_N> — e.g. Zhang San@Image 1.• For pre-defined subjects, reuse the same label consistently (always “police officer”, always “thief”).• Using the asset library? You still refer with <Image/Video_N> — the model can't map an Asset ID to reference content directly.• Keep it concise, avoid redundancy and contradictory features, and prefer expressing spatial relationships through reference images over long text.
04Section

Shot sequencing

Because the model decouples space and time internally, the ideal form for a complex video is a timeline storyboard: break it into shots and describe each in event order — who + where + doing what + how the camera moves. Label them Shot 1, Shot 2, Shot 3.

Not this

A man runs nervously down the street, and the scene feels very cinematic.

One vague blob — no shots, no order, no camera, nothing the spatial/temporal layers can act on.

Do this

Shot 1: Side shot of a street alley; the man slowly starts running, with a sense of rapid breathing. Shot 2: The man knocks over a fruit stand; the camera shakes quickly and gives a close-up of his frightened face. Shot 3: The man climbs over a low wall and disappears; the camera slowly pulls back and freezes on the empty street.

Ordered shots, each with camera + action + space. The model paces it naturally.

Don't force exact timing
Order shots primary-first; don't impose strict per-shot durations. The model's support for precise timing (e.g. 0–3 seconds) is unstable and forcing it can break the result — let the plot drive the pacing.

Organise each shot in this order:

01

Camera / transition

How the shot moves or cuts.

e.g. “slowly push in from a wide shot; cut to…

02

Action & expression

Key actions and changes of expression.

e.g. “lowers head, then can't hold back a smile

03

Position / space

Where the subject is; spatial change.

e.g. “walks to the dormitory entrance

04

Audio

Sound effects, voices, music for the shot.

e.g. “warm ambient room tone + light music

05Section

Action description

Four rules make on-screen motion read as natural rather than rubbery:

Refine the body + quantify the degree

Be specific to hands, legs, head, shoulders, back — and add range, speed, force.

slowly raise a hand · quickly turn the head · push hard off the ground · slightly lower the head

Prefer slow, gentle, continuous motion

Favour small coherent movements; avoid high-burst, large-dynamic actions like sprinting, big jumps, violent rolls.

walk slowly · gently raise a hand · sit down naturally with the motion

Bridge actions with transitions

State the inertia/continuity between one action and the next so motion stays coherent.

use the inertia of turning around to naturally raise a hand

Externalise emotion as physical detail

Replace abstract words like “very sad” with concrete body language (see table).

not “nervous” → frequently checks watch, fingers tap the table, rapid breathing

Externalising emotion

Abstract emotionExternalise as actions & details
SadnessLowered head, shoulders trembling slightly, eyes reddening, fingers clutching the corner of clothing, tears welling but not falling
JoyCorners of the mouth rising uncontrollably, brows relaxed, steps light, unconsciously humming, spinning in place
NervousnessFrequently checking the watch, fingers tapping the tabletop, rapid breathing, eyes darting, biting fingernails
AngerBoth fists clenched, jaw tense, chest heaving, eyes sharp as knives, words squeezed through gritted teeth
ReliefLetting out a long breath, tense shoulders relaxing, a faint long-lost smile, looking up toward the distance
06Section

Camera movement

The model understands camera terminology well, so use standard terms directly: medium shot, close-up, wide shot, slow push-in, smooth lateral tracking, fixed shot.

One camera move per shot
Specify only one type of camera movement in a single shot. Don't ask for push, pull, pan and move all at once — stacking them increases image instability.
07Section

Quality, style & constraints

These three define the model's creative boundaries and keep output stable.

Image quality

Clarity, texture, lighting.

HD · rich details · cinematic texture · natural colours · soft lighting

Style

Overall art style + tone.

cyberpunk blue-purple · retro film · fresh Japanese style

Constraints

Forbid flaws & deviations.

no deformation · no flicker · stable face

Constraint-word templates
• No subtitles — “keep it subtitle-free” / “avoid generating any text or subtitles”• No logo — “do not generate a logo”• No watermark — “do not generate a watermark”
08Section

Symbols & dialogue

Symbols tell the model which kind of information it's reading. Keep dialogue in one language — avoid mixing Chinese and English (proper nouns excepted).

InformationSymbolExample
Music( )(fast-paced rock music is playing in the background)
Sound effect< >< dog barking can be heard in the distance >
Dialogue{ }{Hello, world}. For other languages, mark it: says in Japanese {こんにちは}
Subtitles【 】【Chapter One: Departure】
09Section

Asset strategy

Assets play four functional roles. A focused set beats a crowded one.

01

Character anchoring

Lock the character's appearance.

02

Scene tone-setting

Lock the environment and style.

03

Camera-movement reference

Lock the shot language and action rhythm.

04

Rhythmic atmosphere

Use audio to control emotion and timbre.

Recommended: 4–5 assets total
1–2 character images (facial close-up / full body) + 1 scene image + 1 camera-movement video + 1 audio clip. Don't max out the asset limit — too many assets make feature priorities hard to judge, causing style conflicts, blurry subject identification, and off-brief results.

Long take vs. stitching. Use a continuous take (video extension) for single-scene dialogue, emotional progression, and movement along one path — immersive and coherent. Use segmented stitching for plot turns and fast action (chases, fights, montages) — independent clips edited together for rhythm and impact. In practice, combine both: extend a coherent conversation, then stitch in transitions.

10Section

Worked examples

Two official end-to-end examples showing the advanced formula in action — assets first, then a shot-by-shot prompt.

1 · Dormitory short drama (dialogue-focused)

Assets — @Image 1: female lead half-body · @Image 2: dormitory scene · @Video 1: indoor dialogue camera movement · @Audio 1: room ambience / light music.

Do this

Use the girl in @Image 1 as the main character, use @Image 2 as the dormitory scene style reference, and refer to the camera movement in @Video 1. Shot 1: At dusk, girl @Image 1 walks briskly to the dormitory entrance @Image 2. The camera follows steadily in a medium shot; warm sunlight spills into the hallway; she pauses at the doorway, takes a deep breath, looks slightly nervous. Shot 2: Girl @Image 1 pushes the door open; cut to an indoor medium shot; her roommates look up while organising books; one smiles and asks {How did the exam go? Did you pass?}; the camera slowly cuts between half-body close-ups. Shot 3: Girl @Image 1 first lowers her head, dejected (close-up), then raises it, laughs, and says {I was kidding}; roommates play-fight with her; the camera slowly pulls back and freezes on a wide shot. High-definition cinematic documentary style, warm tones, soft lighting; the face stays stable without deformation; motion natural, no stutter; ambient sound blends with @Audio 1.

2 · Cliff confrontation (action / atmosphere)

Assets — @Image 1: woman in red · @Image 2: woman in black (opponent) · @Image 3: cliff & bamboo forest · @Video 1: martial-arts camera movement · @Audio 1: tight drums / fight SFX.

Do this

Use the woman in red from @Image 1 as the lead, the woman in black from @Image 2 as the opponent, the cliff and bamboo forest in @Image 3 as the scene, refer to the camera movement and action rhythm in @Video 1, and sync background SFX with @Audio 1. Shot 1: At dusk, the camera slowly pushes in from a side medium shot of woman in red @Image 1 at the cliff edge lifting a wine flask; sleeves sway in the wind; the camera circles from front to back; a figure in black is faint in the bamboo. Shot 2: Zoom and fade to a long drone shot over the cliff; the two stand at opposite ends; wind lifts their robes; rhythm accelerates with the drums. Shot 3: Cut back to a ground-level close shot; both draw swords; woman in red @Image 1 shifts to a cold gaze; woman in black @Image 2 looks determined, sword tip trembling; the camera follows them circling, freezing on the instant before the swords meet. Cinematic wuxia in misty rain, cool tones, low saturation, film-grain texture; faces and proportions stable; motion continuous, no clipping or stutter.

11Section

Troubleshooting

The official FAQ — the failure modes you'll actually hit, and how to fix each. (Some can only be reduced, not eliminated 100%.)

Character ID drift

Symptom — The character looks different from the reference, or “face-swaps” mid-video (sometimes resembling a celebrity and getting blocked).

  • Add a dedicated close-up headshot (face only, neutral expression, minimal background) alongside the full-body photo.
  • Define it explicitly: facial features → headshot; makeup & styling → full-body photo.
  • Place the most reference-critical assets earliest in the prompt.
  • Use headshot + full-body only — avoid multi-view images, which the model may read as different people.

Unexpected subtitles

Symptom — Subtitles appear even though you never asked for them.

  • Add explicit constraints: “keep it subtitle-free”, “avoid generating any text or subtitles”.
  • Remove text from reference images/videos first (e.g. with Seedream/Seedance editing).
  • Prefer landscape output — it produces subtitles far less often than portrait; crop later.

Logo or watermark

Symptom — A logo/watermark from another platform shows up unprompted.

  • Add explicit constraints: “do not generate watermarks” and “do not generate logos”.

Style drift

Symptom — You want 2D/3D anime but a realistic reference pulls the result toward live-action.

  • Add explicit style constraints, e.g. “2D Japanese anime style” or “3D Chinese-style comic”.
  • For precise control, convert the reference image into the target style before generating.

Jump cuts at extension joins

Symptom — Frame jumps or rollback where an extended clip meets the original.

  • Align keyframes in post (CapCut): trim 6 frames from the end of the previous segment and 1 frame from the start of the next.
  • Repeat at every join; end a clip on a transition cut and start the next from the new scene.

Duplicated characters (“twins”)

Symptom — Two identical characters appear in one frame, especially with crowded scenes or multi-view references.

  • Mark each character's reference image after their name (e.g. “Zhang San (image 1)”), kept consistent.
  • Add a global constraint forbidding identical duplicate characters / twin effects.
  • Prefer independent single-person photos over three/multi-view assets; simplify the prompt.

Quality degradation on extension

Symptom — Re-using model output as an extension input degrades quality; mottled blocks appear on faces, compounding over multiple continuations.

  • Convert the source to a pure-white 3D-model video first, then continue from that.
  • Prefer high-definition reference assets, and limit how many times you chain continuations.

Special effects off-brief

Symptom — A text-described effect (e.g. a countdown) renders with wrong/random logic.

  • Define the effect with a reference video so the model learns its exact form and motion.

Too many reference characters (>4)

Symptom — Beyond 4 reference people, you get the wrong count or duplicates.

  • Generate images in groups of ≤4 people first.
  • Then use those grouped images as references for the final image-to-video.

Noise at the end (narration)

Symptom — Clicking / cut-off noise at the end of videos with voiceover.

  • Regenerate, or apply an audio fade-out via the volume envelope (CapCut) — drag the final keyframe to 0.

Inaccurate Chinese pronunciation

Symptom — Polyphonic, rare, or look-alike characters get mispronounced.

  • Swap hard words for common same-sound homophones (e.g. 螭龙山 → 吃龙山). An optimisation, not a guarantee.

Inaccurate voice reference

Symptom — The generated voice differs noticeably from the reference audio.

  • Add detailed voice-characteristic descriptions (e.g. “low, thick, warm, finely grainy middle-aged male voice of @Audio 1”).
  • Keep the line's tone and delivery close to the reference audio.
12Section

Template library

Reusable patterns straight from the official appendix — fill in the brackets. Upload assets in the order you want them referenced, then map them with Image 1…N / Video 1…N.

Text generation

Slogan

[Text Content] + [Timing] + [Positioning] + [Entrance / Appearance Style], [Visual Attributes (colour, font style)]

Subtitles

Display subtitles at the bottom-center with the text. The subtitles must be perfectly synchronized with the audio rhythm and pacing.

Speech bubbles

[Character] says, “[Dialogue].” Speech bubbles appear around the character containing the spoken text.

Image reference

Multi-perspective subject

Refer to / Extract / Combine / Use the [Subject] from [Image N] to generate [Scene Description], maintaining consistent [Subject] features.

Multi-image

Refer to / Extract / Combine / Follow the [Referenced elements] from [Image N] to generate [Scene Description], maintaining the consistency of [Referenced Elements].

Video reference

Motion

Refer to the [Motion Description] from [Video N] to generate [Scene Description], keeping the motion details consistent.

Camera motion

Refer to the [Camera Movement Description] from [Video N] to generate [Scene Description], keeping the scene consistent.

Special effects

Refer to the [Special-effects description] from [Video N] to generate [Scene Description], keeping the special effects consistent.

Video editing

Add

At [Timestamp/Timing] and [Spatial Location] of [Video N], add [Description of intended element].

Remove

Remove [Element] from [Video N], keeping the rest of the video content unchanged.

Modify

Replace [Element to change] in [Video N] with [Intended element].

Extend

Extend [Video N] forward / backward + [content] · Generate content before / after [Video N] + [content]

Complete tracks

[Video 1] + [Transition Description] + followed by [Video 2] + [Transition Description] + followed by [Video 3]

Stitching limits
Track completion takes a maximum of 3 video clips, total duration ≤ 15 seconds. The model auto-trims the connecting segments for a seamless join, and original input segments are never re-generated.

Your turn

Bring your assets. Direct the shot.

Seedance 2.0 lives inside Ekly. Drop in your images, clips, and audio, write a shot-by-shot prompt, and let the multimodal director do the rest.