The Complete Guide to ElevenLabs Voice & Audio

With ElevenLabs, prompting is only half the story — delivery is shaped by which voice you pick, how you set the controls, and (on Eleven v3) audio tags written inline in your script. Get those three right and the text almost reads itself.

This guide follows ElevenLabs' official documentation and blog. The single most important rule, straight from the v3 guidance: the voice you choose must be similar enough to the delivery you want — tags amplify a voice, they don't transform it.

Three dials, in order of impact

1) the voice itself · 2) stability / style settings · 3) audio tags and text formatting. Reach for them in that order.

01Section

Voice settings

The core controls and exactly what the docs say each one does:

Setting	What it does
Stability	How stable/consistent the voice is between generations. Lower = broader emotional range; higher = more monotone with limited emotion.
Similarity	How closely the AI adheres to the original voice it's replicating (the “Clarity + Similarity” control).
Style	Amplifies the original speaker's style. Costs extra latency and is slightly less stable; default 0.
Speaker Boost	Boosts similarity to the original speaker (small latency cost). Not available on Eleven v3.

A sensible starting point

The docs cite a common starting setting of stability ≈ 50, similarity ≈ 75, and style = 0 — then adjust minimally from there. The right values depend on the voice and the performance you're after.

02Section

v3 stability & speed

On Eleven v3, stability becomes the primary control and is exposed as three modes rather than a slider:

Mode	Character
Creative	More emotional and expressive — but more prone to hallucinations.
Natural	Closest to the original recording: balanced and neutral.
Robust	Highly stable, but less responsive to directional prompts (closest to v2).

Tags need a responsive mode

Audio tags work best with Natural or Creative. Robust resists directional prompts, so don't expect strong tag response from it — a common mistake.

Speed is adjustable from 0.7 (slowest) to 1.2 (fastest), with a default of 1.0.

03Section

Audio tags (Eleven v3)

Audio tags are words in square brackets that v3 interprets to direct the performance. Crucially, they're natural-language instructions, not a fixed list — the examples below are illustrations, not a closed registry. Place them anywhere in the script, and combine them freely.

Emotion

[excited][nervous][frustrated][sorrowful][calm][curious][crying]

Non-verbal

[laughs][laughs harder][sighs][clears throat][gulps][gasps][whispers]

Delivery / tone

[shouts][dramatic][sarcastically][deadpan][cheerfully][quietly]

Accents & character

[British accent][French accent][pirate voice][fantasy narrator][sci-fi AI voice]

Tags are voice-dependent — test them

The voice and its training samples affect how well a tag lands. An over-the-top tag on a calm, neutral voice may do nothing — so test tags per voice, and lean on a voice whose character already matches the direction.

04Section

Formatting & pronunciation

Beyond tags, the text itself shapes delivery:

Technique	Effect
Ellipses …	Add pauses and weight.
CAPITALISATION	Increases emphasis / intensity.
Punctuation & structure	Drive rhythm; descriptive narration (“she said excitedly”) also colours tone.

Pauses: v2 vs v3

v2 models support SSML breaks — <break time="1.5s" /> — up to 3 seconds. Eleven v3 does not support SSML break tags; use ellipses, dashes, punctuation, structure, and audio tags to pace instead.

For pronunciation control, the documented options are:

IPA (v3)

Wrap phonemes in slashes — e.g. /ˌbaɪoʊˈkemɪstri/

e.g. native IPA across 70+ languages; ~80–90% consistency, so test per voice.

Phoneme tag (v2)

<phoneme alphabet="cmu-arpabet" ph="M AE1 D IH0 S AH0 N">Madison</phoneme>

e.g. CMU Arpabet is recommended for predictable results.

Alias (no phoneme support)

<alias>Cloffton</alias> — spell out unusual names/terms phonetically

Models can stumble on numbers, dates, currency, and URLs — larger models normalise better. When in doubt, expand tricky text first (e.g. write “twenty-five dollars” rather than “$25”), or use a pronunciation dictionary for recurring terms.

05Section

Choosing a model

Model	Best for	Notes
Eleven v3 (alpha)	Expressive dialogue, character work, audiobooks	Most expressive; audio tags + Text-to-Dialogue are exclusive to v3. 70+ languages.
Multilingual v2	Stable, emotionally rich long-form	Most stable on long content; 29 languages.
Flash v2.5	Real-time, conversational, large-scale	Ultra-low latency (~75 ms); ~50% lower cost per character; may trade some pronunciation accuracy.

Rule of thumb

Pick v3 for expressiveness and dialogue, Multilingual v2 for stable long-form quality, and Flash for low-latency/real-time. (Turbo v2.5/v2 are deprecated — use Flash instead.)

06Section

Multi-speaker dialogue

Text to Dialogue (v3 only) gives each turn its own text and voice — assign every speaker a distinct voice, with no cap on participants. Audio tags work inside dialogue, including audio events like [applause] or [gentle footsteps].

Dialogue tips from the docs

• Show interruptions with a dangling dash — [cautiously] Hello, is this seat-• Use the optional seed for more repeatable results (subtle differences can still occur).• Keep to ~2,000 characters per request for reliable generation; split longer text into chunks and concatenate.

07Section

Sound effects

Write sound-effect prompts in a narrative, script-like style. Documented controls: duration 0.1–30s (or auto), prompt influence (high = literal, low = more creative), and looping for seamless repeats.

Simple effect

Glass shattering on concrete · Heavy wooden door creaking open · Thunder rumbling in the distance

Sequential effect

Footsteps on gravel, then a metallic door opens · Wind whistling through trees, followed by leaves rustling

Musical one-shots

90s hip-hop drum loop, 90 BPM · Vintage brass stabs in F minor · Atmospheric synth pad with subtle modulation

Layer the complex ones

For dense, multi-sound moments, generate the individual effects separately and combine them in an audio editor — it beats cramming every event into one prompt. Useful vocabulary: impact, whoosh, ambience, one-shot, braam, glitch, drone.

08Section

Music

For Eleven Music, more words isn't better — length and detail don't always correlate with quality. Concise, evocative prompts often win. You can go abstract (“eerie, foreboding”) or precise (“dissonant violin over a pulsing sub-bass”).

To get…	Prompt with…
A single instrument	Prefix “solo” — “solo electric guitar”, “solo piano in C minor”
Vocals	Prefix “a cappella” — “a cappella female vocals”; add “raw / breathy / aggressive”
Tempo & key	State them — “130 BPM”, “in A minor”
Structure / timing	“60 seconds”, “instrumental only”, “lyrics begin at 15 seconds”

Lyrics can be multilingual, and Composition Plans give precise control over section structure, lyric placement, and multi-vocalist arrangements.

09Section

Best practices

Voice choice dominates (v3)

Pick a voice already close to the target delivery — tags amplify, they don't transform.

Match stability to the goal

Creative for expressive/tagged performance; Robust for consistency, but it resists tags.

Don't use <break> on v3

SSML breaks are v2-only — on v3 use ellipses, punctuation, and tags for pacing.

Normalise tricky text

Expand numbers, dates, currency, and URLs — especially on faster/smaller models.

Chunk long content

Respect per-model character limits; ~2,000 chars per dialogue request, then concatenate.

10Section

Examples

Quoted from ElevenLabs' official blog posts.

Emotional progression

✓ Do this

[tired] I've been working for 14 hours straight. [sigh] I can't even feel my hands anymore. [nervously] You sure this is going to work? [gulps] Okay… let's go.

Layered tags

✓ Do this

[dramatic][French accent] You do not understand... zis was never about revenge. It was about destiny.

Multi-character dialogue

✓ Do this

Jessica: [laughs] That was... beautiful. Dr. Von Fusion: [dramatic] To be or not to be — that is the question! Jessica: [French accent] This is spectacular, isn't it?

Built from official sources

Your turn

Pick the voice. Direct the performance.

ElevenLabs voices power Ekly's voiceovers. Choose a fitting voice, tune stability, and add a tag or two.

Start creating with Ekly