TTS Synthesis Markup Language

With the help of Speech Synthesis Markup Language (SSML), you can make your TTS responses seem more like natural speech. In the following Article you will find some examples of how to use it (applicable for both dynamic and static TTS).

The full List of SSML elements may be helpful for additional context and examples:

Created: May 2020

Permalink: https://wildix.atlassian.net/wiki/x/YwLOAQ

Do not use <speak> element as it is already hardcoded.

<break>

An optional element that you can use to insert pauses between words.


Attributes

AttributeDescription
strengthOptional. Specify the relative duration of a pause using one of the following values:
  • none
  • x-weak
  • weak
  • medium (default)
  • strong
  • x-strong
time 

Optional. Specify the absolute duration of a pause in seconds or milliseconds. Example: 2s and 500ms


Syntax

<break />
<break strength="string" />
<break time="string" />


Usage

  • Play sound -> Welcome to Wildix <break time="2s"/> Please wait for the next available operator

Example:



<prosody>

An optional element that specifies the pitch, contour, range, rate, duration, and volume for speaking the element's text.


Attributes

AttributeDescription
pitchOptional. Indicates the baseline pitch for the text. You may express the pitch as:
  • An absolute value, expressed as a number followed by "Hz" (Hertz). For example, 600Hz.
  • A relative value, expressed as a number preceded by "+" or "-" and followed by "Hz" or "st", that specifies an amount to change the pitch. For example: +80Hz or -2st. The "st" indicates the change unit is semitone, which is half of a tone (a half step) on the standard diatonic scale.
  • A constant value:
    • x-low
    • low
    • medium
    • high
    • x-high
    • default
contourOptional. Represents changes in pitch for speech content as an array of targets at specified time positions in the speech output. Each target is defined by sets of parameter pairs. For example:

<prosody contour="(0%,+20Hz) (10%,-2st) (40%,+10Hz)">

The first value in each set of parameters specifies the location of the pitch change as a percentage of the duration of the text. The second value specifies the amount to raise or lower the pitch, using a relative value or an enumeration value for pitch (see pitch).
rangeOptional. A value that represents the range of pitch for the text. You may express range using the same absolute values, relative values, or enumeration values used to describe pitch.
rateOptional. Indicates the speaking rate of the text. You may express rate as:
  • A relative value, expressed as a number that acts as a multiplier of the default. For example, a value of 1 results in no change in the rate. A value of .5 results in a halving of the rate. A value of 3 results in a tripling of the rate.
  • A constant value:
    • x-slow
    • slow
    • medium
    • fast
    • x-fast
    • default
durationOptional. The period of time that should elapse while the TTS engine reads the text, in seconds or milliseconds. For example, 2s or 1800ms.
volumeOptional. Indicates the volume level of the speaking voice. You may express the volume as:
  • An absolute value, expressed as a number in the range of 0.0 to 100.0, from quietest to loudest. For example, 75. The default is 100.0.
  • A relative value, expressed as a number preceded by "+" or "-" that specifies an amount to change the volume. For example +10 or -5.5.
  • A constant value:
    • silent
    • x-soft
    • soft
    • medium
    • loud
    • x-loud
    • default


Syntax

<prosody pitch="value" contour="value" range="value" rate="value" duration="value" volume="value"></prosody>

<say-as>

An optional element that indicates the content type (such as number or date) of the element's text. 


Attributes

AttributeDescription
interpret-asRequired. Indicates the content type of element's text. For a list of types, see the table below.
formatOptional. Provides additional information about the precise formatting of the element's text for content types that may have ambiguous formats. SSML defines formats for content types that use them (see table below).
detailOptional. Indicates the level of detail to be spoken. For example, this attribute might request that the speech synthesis engine pronounce punctuation marks. There are no standard values defined for detail.


The following are the supported content types for the interpret-as and format attributes. Include the format attribute only if interpret-as is set to date and time.

interpret-asformatInterpretation
address

The text is spoken as an address:

I'm at <say-as interpret-as="address">West Midlands, CV1 4LY, Coventry</say-as>

cardinal, number
The text is spoken as a cardinal number:

There are <say-as interpret-as="cardinal">4</say-as> levels
characters, spell-out
The text is spoken as individual letters (spelled out):

<say-as interpret-as="characters">test</say-as>
datedmy, mdy, ymd, ydm, ym, my, md, dm, d, m, yThe text is spoken as a date. The format attribute specifies the date's format (d=day, m=month, and y=year):

Today is <say-as interpret-as="date" format="mdy">12-05-2020</say-as>
digits, number_digit
The text is spoken as a sequence of individual digits:

<say-as interpret-as="number_digit">123456789</say-as>
fraction
The text is spoken as a fractional number:

<say-as interpret-as="fraction">3/8</say-as> of an inch
ordinal
The text is spoken as an ordinal number:

Select the <say-as interpret-as="ordinal">3rd</say-as> option
telephone
The text is spoken as a telephone number. The format attribute may contain digits that represent a country code. For example, "1" for the United States or "39" for Italy. The phone number may also include the country code, and if so, takes precedence over the country code in the format. The speech synthesis engine pronounces:

The number is <say-as interpret-as="telephone" format="44">3300 563 634</say-as>
timehms12, hms24The text is spoken as a time. The format attribute specifies whether the time is specified using a 12-hour clock (hms12) or a 24-hour clock (hms24). Use a colon to separate numbers representing hours, minutes, and seconds:

The office opens at <say-as interpret-as="time" format="hms12">4:00am</say-as>


Syntax

<say-as interpret-as="string" format="digit string" detail="string"> <say-as>


Usage

The following example shows how to use the <break> element to pause between steps:

  • Dialplan application Play sound -> The person you're trying to reach isn't available <break time="2s"/> Please call back on <say-as interpret-as="date" format="dmy">12-05-2020</say-as> at <say-as interpret-as="time" format="hms12">4:00</say-as>