With ElevenLabs, prompting is only half the story — delivery is shaped by which voice you pick, how you set the controls, and (on Eleven v3) audio tags written inline in your script. Get those three right and the text almost reads itself.
This guide follows ElevenLabs' official documentation and blog. The single most important rule, straight from the v3 guidance: the voice you choose must be similar enough to the delivery you want — tags amplify a voice, they don't transform it.
Voice settings
The core controls and exactly what the docs say each one does:
| Setting | What it does |
|---|---|
| Stability | How stable/consistent the voice is between generations. Lower = broader emotional range; higher = more monotone with limited emotion. |
| Similarity | How closely the AI adheres to the original voice it's replicating (the “Clarity + Similarity” control). |
| Style | Amplifies the original speaker's style. Costs extra latency and is slightly less stable; default 0. |
| Speaker Boost | Boosts similarity to the original speaker (small latency cost). Not available on Eleven v3. |
v3 stability & speed
On Eleven v3, stability becomes the primary control and is exposed as three modes rather than a slider:
| Mode | Character |
|---|---|
| Creative | More emotional and expressive — but more prone to hallucinations. |
| Natural | Closest to the original recording: balanced and neutral. |
| Robust | Highly stable, but less responsive to directional prompts (closest to v2). |
Speed is adjustable from 0.7 (slowest) to 1.2 (fastest), with a default of 1.0.
Formatting & pronunciation
Beyond tags, the text itself shapes delivery:
| Technique | Effect |
|---|---|
| Ellipses … | Add pauses and weight. |
| CAPITALISATION | Increases emphasis / intensity. |
| Punctuation & structure | Drive rhythm; descriptive narration (“she said excitedly”) also colours tone. |
<break time="1.5s" /> — up to 3 seconds. Eleven v3 does not support SSML break tags; use ellipses, dashes, punctuation, structure, and audio tags to pace instead.For pronunciation control, the documented options are:
IPA (v3)
Wrap phonemes in slashes — e.g. /ˌbaɪoʊˈkemɪstri/
e.g. native IPA across 70+ languages; ~80–90% consistency, so test per voice.
Phoneme tag (v2)
<phoneme alphabet="cmu-arpabet" ph="M AE1 D IH0 S AH0 N">Madison</phoneme>
e.g. CMU Arpabet is recommended for predictable results.
Alias (no phoneme support)
<alias>Cloffton</alias> — spell out unusual names/terms phonetically
Models can stumble on numbers, dates, currency, and URLs — larger models normalise better. When in doubt, expand tricky text first (e.g. write “twenty-five dollars” rather than “$25”), or use a pronunciation dictionary for recurring terms.
Choosing a model
| Model | Best for | Notes |
|---|---|---|
| Eleven v3 (alpha) | Expressive dialogue, character work, audiobooks | Most expressive; audio tags + Text-to-Dialogue are exclusive to v3. 70+ languages. |
| Multilingual v2 | Stable, emotionally rich long-form | Most stable on long content; 29 languages. |
| Flash v2.5 | Real-time, conversational, large-scale | Ultra-low latency (~75 ms); ~50% lower cost per character; may trade some pronunciation accuracy. |
Multi-speaker dialogue
Text to Dialogue (v3 only) gives each turn its own text and voice — assign every speaker a distinct voice, with no cap on participants. Audio tags work inside dialogue, including audio events like [applause] or [gentle footsteps].
[cautiously] Hello, is this seat-• Use the optional seed for more repeatable results (subtle differences can still occur).• Keep to ~2,000 characters per request for reliable generation; split longer text into chunks and concatenate.Sound effects
Write sound-effect prompts in a narrative, script-like style. Documented controls: duration 0.1–30s (or auto), prompt influence (high = literal, low = more creative), and looping for seamless repeats.
Simple effect
Glass shattering on concrete · Heavy wooden door creaking open · Thunder rumbling in the distance
Sequential effect
Footsteps on gravel, then a metallic door opens · Wind whistling through trees, followed by leaves rustling
Musical one-shots
90s hip-hop drum loop, 90 BPM · Vintage brass stabs in F minor · Atmospheric synth pad with subtle modulation
Music
For Eleven Music, more words isn't better — length and detail don't always correlate with quality. Concise, evocative prompts often win. You can go abstract (“eerie, foreboding”) or precise (“dissonant violin over a pulsing sub-bass”).
| To get… | Prompt with… |
|---|---|
| A single instrument | Prefix “solo” — “solo electric guitar”, “solo piano in C minor” |
| Vocals | Prefix “a cappella” — “a cappella female vocals”; add “raw / breathy / aggressive” |
| Tempo & key | State them — “130 BPM”, “in A minor” |
| Structure / timing | “60 seconds”, “instrumental only”, “lyrics begin at 15 seconds” |
Lyrics can be multilingual, and Composition Plans give precise control over section structure, lyric placement, and multi-vocalist arrangements.
Best practices
Voice choice dominates (v3)
- Pick a voice already close to the target delivery — tags amplify, they don't transform.
Match stability to the goal
- Creative for expressive/tagged performance; Robust for consistency, but it resists tags.
Don't use <break> on v3
- SSML breaks are v2-only — on v3 use ellipses, punctuation, and tags for pacing.
Normalise tricky text
- Expand numbers, dates, currency, and URLs — especially on faster/smaller models.
Chunk long content
- Respect per-model character limits; ~2,000 chars per dialogue request, then concatenate.
Examples
Quoted from ElevenLabs' official blog posts.
Emotional progression
[tired] I've been working for 14 hours straight. [sigh] I can't even feel my hands anymore. [nervously] You sure this is going to work? [gulps] Okay… let's go.
Layered tags
[dramatic][French accent] You do not understand... zis was never about revenge. It was about destiny.
Multi-character dialogue
Jessica: [laughs] That was... beautiful. Dr. Von Fusion: [dramatic] To be or not to be — that is the question! Jessica: [French accent] This is spectacular, isn't it?
Built from official sources
Your turn
Pick the voice. Direct the performance.
ElevenLabs voices power Ekly's voiceovers. Choose a fitting voice, tune stability, and add a tag or two.