All guides
AI Audio 13 min read June 2026

The Complete Guide to ElevenLabs Voice & Audio

Get natural, expressive speech from ElevenLabs — voice settings, the Eleven v3 stability modes and audio-tag system, model selection, multi-speaker dialogue, sound effects, and music.

With ElevenLabs, prompting is only half the story — delivery is shaped by which voice you pick, how you set the controls, and (on Eleven v3) audio tags written inline in your script. Get those three right and the text almost reads itself.

This guide follows ElevenLabs' official documentation and blog. The single most important rule, straight from the v3 guidance: the voice you choose must be similar enough to the delivery you want — tags amplify a voice, they don't transform it.

Three dials, in order of impact
1) the voice itself · 2) stability / style settings · 3) audio tags and text formatting. Reach for them in that order.
01Section

Voice settings

The core controls and exactly what the docs say each one does:

SettingWhat it does
StabilityHow stable/consistent the voice is between generations. Lower = broader emotional range; higher = more monotone with limited emotion.
SimilarityHow closely the AI adheres to the original voice it's replicating (the “Clarity + Similarity” control).
StyleAmplifies the original speaker's style. Costs extra latency and is slightly less stable; default 0.
Speaker BoostBoosts similarity to the original speaker (small latency cost). Not available on Eleven v3.
A sensible starting point
The docs cite a common starting setting of stability ≈ 50, similarity ≈ 75, and style = 0 — then adjust minimally from there. The right values depend on the voice and the performance you're after.
02Section

v3 stability & speed

On Eleven v3, stability becomes the primary control and is exposed as three modes rather than a slider:

ModeCharacter
CreativeMore emotional and expressive — but more prone to hallucinations.
NaturalClosest to the original recording: balanced and neutral.
RobustHighly stable, but less responsive to directional prompts (closest to v2).
Tags need a responsive mode
Audio tags work best with Natural or Creative. Robust resists directional prompts, so don't expect strong tag response from it — a common mistake.

Speed is adjustable from 0.7 (slowest) to 1.2 (fastest), with a default of 1.0.

03Section

Audio tags (Eleven v3)

Audio tags are words in square brackets that v3 interprets to direct the performance. Crucially, they're natural-language instructions, not a fixed list — the examples below are illustrations, not a closed registry. Place them anywhere in the script, and combine them freely.

Emotion

[excited][nervous][frustrated][sorrowful][calm][curious][crying]

Non-verbal

[laughs][laughs harder][sighs][clears throat][gulps][gasps][whispers]

Delivery / tone

[shouts][dramatic][sarcastically][deadpan][cheerfully][quietly]

Accents & character

[British accent][French accent][pirate voice][fantasy narrator][sci-fi AI voice]
Tags are voice-dependent — test them
The voice and its training samples affect how well a tag lands. An over-the-top tag on a calm, neutral voice may do nothing — so test tags per voice, and lean on a voice whose character already matches the direction.
04Section

Formatting & pronunciation

Beyond tags, the text itself shapes delivery:

TechniqueEffect
Ellipses …Add pauses and weight.
CAPITALISATIONIncreases emphasis / intensity.
Punctuation & structureDrive rhythm; descriptive narration (“she said excitedly”) also colours tone.
Pauses: v2 vs v3
v2 models support SSML breaks — <break time="1.5s" /> — up to 3 seconds. Eleven v3 does not support SSML break tags; use ellipses, dashes, punctuation, structure, and audio tags to pace instead.

For pronunciation control, the documented options are:

IPA (v3)

Wrap phonemes in slashes — e.g. /ˌbaɪoʊˈkemɪstri/

e.g. native IPA across 70+ languages; ~80–90% consistency, so test per voice.

Phoneme tag (v2)

<phoneme alphabet="cmu-arpabet" ph="M AE1 D IH0 S AH0 N">Madison</phoneme>

e.g. CMU Arpabet is recommended for predictable results.

Alias (no phoneme support)

<alias>Cloffton</alias> — spell out unusual names/terms phonetically

Models can stumble on numbers, dates, currency, and URLs — larger models normalise better. When in doubt, expand tricky text first (e.g. write “twenty-five dollars” rather than “$25”), or use a pronunciation dictionary for recurring terms.

05Section

Choosing a model

ModelBest forNotes
Eleven v3 (alpha)Expressive dialogue, character work, audiobooksMost expressive; audio tags + Text-to-Dialogue are exclusive to v3. 70+ languages.
Multilingual v2Stable, emotionally rich long-formMost stable on long content; 29 languages.
Flash v2.5Real-time, conversational, large-scaleUltra-low latency (~75 ms); ~50% lower cost per character; may trade some pronunciation accuracy.
Rule of thumb
Pick v3 for expressiveness and dialogue, Multilingual v2 for stable long-form quality, and Flash for low-latency/real-time. (Turbo v2.5/v2 are deprecated — use Flash instead.)
06Section

Multi-speaker dialogue

Text to Dialogue (v3 only) gives each turn its own text and voice — assign every speaker a distinct voice, with no cap on participants. Audio tags work inside dialogue, including audio events like [applause] or [gentle footsteps].

Dialogue tips from the docs
• Show interruptions with a dangling dash — [cautiously] Hello, is this seat-• Use the optional seed for more repeatable results (subtle differences can still occur).• Keep to ~2,000 characters per request for reliable generation; split longer text into chunks and concatenate.
07Section

Sound effects

Write sound-effect prompts in a narrative, script-like style. Documented controls: duration 0.1–30s (or auto), prompt influence (high = literal, low = more creative), and looping for seamless repeats.

Simple effect

Glass shattering on concrete · Heavy wooden door creaking open · Thunder rumbling in the distance

Sequential effect

Footsteps on gravel, then a metallic door opens · Wind whistling through trees, followed by leaves rustling

Musical one-shots

90s hip-hop drum loop, 90 BPM · Vintage brass stabs in F minor · Atmospheric synth pad with subtle modulation

Layer the complex ones
For dense, multi-sound moments, generate the individual effects separately and combine them in an audio editor — it beats cramming every event into one prompt. Useful vocabulary: impact, whoosh, ambience, one-shot, braam, glitch, drone.
08Section

Music

For Eleven Music, more words isn't better — length and detail don't always correlate with quality. Concise, evocative prompts often win. You can go abstract (“eerie, foreboding”) or precise (“dissonant violin over a pulsing sub-bass”).

To get…Prompt with…
A single instrumentPrefix “solo” — “solo electric guitar”, “solo piano in C minor”
VocalsPrefix “a cappella” — “a cappella female vocals”; add “raw / breathy / aggressive”
Tempo & keyState them — “130 BPM”, “in A minor”
Structure / timing“60 seconds”, “instrumental only”, “lyrics begin at 15 seconds”

Lyrics can be multilingual, and Composition Plans give precise control over section structure, lyric placement, and multi-vocalist arrangements.

09Section

Best practices

Voice choice dominates (v3)

  • Pick a voice already close to the target delivery — tags amplify, they don't transform.

Match stability to the goal

  • Creative for expressive/tagged performance; Robust for consistency, but it resists tags.

Don't use <break> on v3

  • SSML breaks are v2-only — on v3 use ellipses, punctuation, and tags for pacing.

Normalise tricky text

  • Expand numbers, dates, currency, and URLs — especially on faster/smaller models.

Chunk long content

  • Respect per-model character limits; ~2,000 chars per dialogue request, then concatenate.
10Section

Examples

Quoted from ElevenLabs' official blog posts.

Emotional progression

Do this

[tired] I've been working for 14 hours straight. [sigh] I can't even feel my hands anymore. [nervously] You sure this is going to work? [gulps] Okay… let's go.

Layered tags

Do this

[dramatic][French accent] You do not understand... zis was never about revenge. It was about destiny.

Multi-character dialogue

Do this

Jessica: [laughs] That was... beautiful. Dr. Von Fusion: [dramatic] To be or not to be — that is the question! Jessica: [French accent] This is spectacular, isn't it?

Your turn

Pick the voice. Direct the performance.

ElevenLabs voices power Ekly's voiceovers. Choose a fitting voice, tune stability, and add a tag or two.