Speaking
To have Furhat say something, use:
furhat.say("Some text")
Behind the scenes, say is calling a say-state that will send an event to the synthesizer to synthesize the text, and then return when the synthesizer is done. This means that your application will continue receiving events while speaking.
Speech synthesis actions are added to a speech synthesis queue. If the queue is empty (i.e., the system is silent), it is synthesized directly, otherwise it will be played when the currently queued utterances are completed. You can control this behavior as described below.
Controlling speech
Two ways of controlling the start of the speech synthesis exists through the following flags:
- abort: Abort the current speech queue (empty it), and play the new utterance directly. (Default is false)
- ifsilent: Will only synthesize the utterance if the queue is currently empty. (Default is false)
- withVoice: Override the currently selected Voice when speaking this specific utterance
furhat.say("Hello there", abort = true) // Aborts the current speech and immediately starts speaking this utterance
furhat.say("My name is Furhat", ifsilent = true) // Speaks only if system is currently not speaking
furhat.say("Hello world!", withVoice=PollyVoice(language=Language.ENGLISH_US) // Example using a PollyVoice, but there are other options for the Voice base class
Async speech
The synthesis action is by default blocking (synchronous), which means that the synthesis is completed before the next action is taking place. If you want to continue with the next action immediately, you can pass the async parameter:
furhat.say("Some other text", async = true) // whatever comes after here will be executed immediately
You can also check if Furhat is currently speaking:
furhat.isSpeaking()
Stopping speech
You can make Furhat stop speaking with the following action:
furhat.stopSpeaking()
Utterances
The furhat.say and furhat.ask (see further down) commands also support more complex speech actions, called Utterances. Utterances allows you to combine speech with behaviors (such as Gestures) or events that should be mixed with the spoken utterances, as well as randomly selected parts, and audio files.
You can define utterances as objects, and then use them in a furhat.say:
val greeting = utterance {
+"Hi there,"
+Gestures.Smile
+"nice to meet you"
}
furhat.say(greeting)
Or you can use them inline:
furhat.say {
+"You can pick up your order"
+glance(Location.LEFT)
+"over there"}
The advantage of adding actions in one single utterance like this, compared to splitting it up in several furhat.say commands, is that the prosody of the utterance will sound better, since it will be synthesized as one complete utterance. Also, it makes it easier to package the utterance into one object.
Note that the behaviors will be executed asynchronously, so the system will not wait for the behavior to be complete before continuing.
Kotlin also allows you to easily make parametrized utterances:
fun tellPrice(number: Int) = utterance {
+"It costs $number dollars"
}
furhat.say(tellPrice(500))
Mid-utterance behaviors
Apart from gestures and glances (as shown above), you can add any closure of your choice in the utterance. If you enclose it in the "behavior" method call, you will have access to flow-related methods such as "send" (which allows you to send events in the middle of the utterance).
furhat.say({
+"I will now print something"
+behavior {
println("something")
}
+"in the middle of the utterance"})
However, note that these behaviors have to be instant and asynchronous, which means that they cannot for example call another state. If you want to perform synchronous behaviors, such as waiting for a gesture to be completed before continuing, you have to use the "blocking" method instead:
furhat.say({
+"Here are some of my gestures"
+blocking {
furhat.gesture(Gestures.BigSmile, async = false)
furhat.gesture(Gestures.ExpressDisgust, async = false)
furhat.gesture(Gestures.Wink, async = false)
}
+"how about that?"})
If we would have performed the gestures asynchronously, they would just have overriden each other, and Furhat would not wait for them to complete before proceding with the rest of the utterance.
There is also a shortcut for performing delays (a blocking behavior) in the middle of an utterance (note that this can also be accomplished using SSML-tags, as described further down):
furhat.say({
+"Now I will pause"
+delay(2000) // Pausing for 2000 ms
+"before continuing"})
Randomizing parts
You can also add randomized parts to you utterance:
furhat.say({
random {
+"Hi"
+"Hello"
}
+"there, nice to meet you"})
If you need to group randomized options, you can use "block":
furhat.say({
+"This time I am"
random {
block {
+Gestures.Smile
+"happy"
}
+"neutral"
}})
Using audio and pre-recorded speech
Utterances also allow you to play audio files, for example with pre-recorded speech, as part of an utterance:
furhat.say {
+"An elephant sounds like this"
+Audio("https://furhat-files.s3.eu-west-1.amazonaws.com/sounds/elephant.wav", "ELEPHANT SOUND")}
Note: It's important that the audio files are wave files (.wav), 16 bit, 16kHz, mono for compatibility with the Furhat robot.
You can also include the audio file in the skill source resource folder and point to it using a URL:
furhat.say{
+"An elephant sounds like this"
+Audio("classpath:my/audio/elephant.wav", "ELEPHANT SOUND")}
An example of the richness you can achieve with natural speech is available on YouTube.
Audio files have to be fetched from a URL and need to be in .wav format. If a phonetics file (".pho") is provided on the same URL (see the lip sync tool below), the system will use it for lip sync. Specifically, if you provide the audio URL https://.../elephant.wav, the system looks for https://.../elephant.pho. If no such file is provided, then lip sync is automatically generated by default.
If the audio is not speech, lip sync can be disabled entirely, according to the following example. For the second argument, provide a human-readable string that can be displayed in logs and in the web interface.
Audio("https://furhat-files.s3.eu-west-1.amazonaws.com/sounds/elephant.wav", "ELEPHANT SOUND", speech = false)
Lip Sync Tool (BETA)
Alternatively, Furhat Robotics provides a tool for natural speech forced alignment (using Montreal Forced Aligner) that automatically aligns lip movements to speech of an audio file, and generates a .pho file. This tool is available on our Developer Zone on Furhat.io and in addition automatically hosts your audio files for you.
The .pho file output from the lip sync tool with a supplied transcription, will generally be higher quality, compared to the automatic lip sync, which relies on a phoneme recognizer.
One example where using the lip sync tool can be beneficial is for adding realistic lip sync for song lyrics. To do this, you might want to record yourself singing acapela, generate the .pho file with the tool, and use this together with the original music clip.
Note: A .pho file is a Furhat-specific JSON-formatted text file that can be parsed into a
furhatos.records.Transcription
object under the hood. These files can be modified at your own discretion, although there are currently no purpose-built tools to help you do so.
Playing background sounds
If you want to play a background sound, for example to provide background music on external speakers, see the utility class AudioPlayer.
Voices
The Furhat robot supports the following voice providers:
- Amazon Polly
- Microsoft Azure
- Acapela
- Elevenlabs
- Elevenlabs has a few pre-defined voice, but you can also define your own and clone voices based on a few minutes of audio. Note that in order for Elevenlabs to work, you need to also have the Amazon Polly recognizer configured, as it relies on Polly for doing lipsync.
To be able to use each provider, their API tokens need to be properly configured in the settings in the Furhat web interface.
Setting the voice and language
If you do not want to specify a provider, but just pick a voice that supports a certain language and gender, you can do this:
furhat.setVoice(language=Language.ENGLISH_US, gender=Gender.MALE) // Set the voice to English male
// OR we can set it using the Voice class as a specification, in one of these ways:
furhat.setVoice(Voice(language=Language.ENGLISH_US, gender=Gender.MALE))
furhat.voice = Voice(language=Language.ENGLISH_US, gender=Gender.MALE)
If a specific voice (from a specific provider) has a unique name, you can select the voice through the name. The provided name needs to only partially match the name of the voice. If several voices match, the first one will be selected.
furhat.voice = Voice(name = "Matthieu")
Each provider-specific voice interface inherits the Voice class. For some of the providers and voices, the specific voices have their own classes extending these provider-classes.
- PollyVoice
- Matthew
- Astrid
- Mathieu
- ...
- PollyNeuralVoice
- Camila
- Lupe
- Justin
- ...
- AzureNeuralVoice
- Sonia
- Ethan
- Jane
- ...
- AcapelaVoice
- ElevenlabsVoice
These can be used to specify the provider for the voice:
// Specify the provider of the voice using the PollyVoice class:
furhat.voice = PollyVoice(language=Language.SWEDISH) // Sets a Swedish Amazon Polly voice
// Use a specific class for a specific voice:
furhat.voice = PollyVoice.Astrid() // Sets the Amazon Polly voice "Astrid"
// Use an Azure voice
furhat.voice = AzureVoice(name = "TonyNeural")
// Use an Elevenlabs voice
furhat.voice = ElevenlabsVoice(name = "Martin")
Before setting a voice, you can check whether it is available on the robot or SDK you are running on:
if (PollyVoice.Matthew().isAvailable) {
furhat.voice = PollyVoice.Matthew()
} else if (PollyVoice(language=Language.ENGLISH_US).isAvailable) {
furhat.voice = PollyVoice(language=Language.ENGLISH_US)
} else if (Voice(language=Language.ENGLISH_US).isAvailable) {
furhat.voice = Voice(language=Language.ENGLISH_US)
}
After successfully setting a voice, furhat.voice will be set to the specific voice class that was actually selected:
furhat.voice = Voice(language=Language.ENGLISH_US)
println("The gender of the voice that was chosen is " + furhat.voice.gender)
Setting the input language
When you change Furhat's voice, the default is to also set the input language (speech recognition and NLU) to the language of the voice. If you want to set the input language to be different from the output language, you can do it like this:
furhat.setVoice(Language.ENGLISH_US, Gender.MALE, false) // The last parameter tells the system to not also set the input language
furhat.setInputLanguage(Language.SWEDISH) // Set the input language to Swedish
You can also set multiple languages if you wish to listen in multiple languages.
Note: If you are on Windows and want to use a language with special characters (like ä, å) you have to set the Encoding of your JVM to UTF8. In IntelliJ this can be done by going to (Run-> Edit configurations-> VM options) and adding "-Dfile.encoding=UTF8" there.
Amazon Polly Neural Voices
Amazon Polly has a Neural TTS (NTTS) system that can produce even higher quality voices than its standard voices.
Neural voices can have different styles, currently we support Neutral
, Conversational
and News
. For more information on these styles please visit the website on speaking styles here. Note that not all voices support all voice styles, please review Amazon's documentation.
To change the style of a voice inside the flow the following code can be used.
furhat.voice = PollyNeuralVoice.Matthew().also { it.style = PollyNeuralVoice.Style.News}
//OR
val matthewNewscaster = PollyNeuralVoice.Matthew()
matthewNewscaster.style = PollyNeuralVoice.Style.News
furhat.voice = matthewNewscaster
Microsoft Azure Neural Voices
Similarly to the Polly Neural voices, Microsoft Azure voices are neural, have special style tags and a lot of options.
furhat.say {
+voice.style("I don't want to speak to you", AzureVoice.Style.ANGRY)
}
furhat.say(AZURE_VOICE.style("But now I've had enough, and my voice is unfriendly.", AzureVoice.Style.UNFRIENDLY))
Note: All voices don't support the same tags, you can check the Microsoft website to determine supported styles for each neural voice.
Voice cloning with Elevenlabs
With Elevenlabs, you can clone voices using a few minutes of audio recording. All voices that are listed under My Voices (including the ones you have cloned) should be available on Furhat.
Since Elevenlabs support Multilingual TTS, they all have their language set to Multilingual
. If the voice is tagged with a gender (male or female), they will also have that gender set, otherwise they will be tagged as Neutral
.
Voice alterations
Sometimes you might want to configure how a voice sounds, for example the speed, prosody or other ways. Most TTS providers use SSML (Speech Synthesis Markup Language). However, these alterations are often voice-provider specific, i.e Cereproc has their own tags, Acapela has their own tags and Amazon polly has their own.
To ease for developers, the Furhat system comes with voice interfaces for each supported TTS provider. These allow typed support for altering speech output, and will assure that the right tags are used, regardless of provider:
// Used in Say
furhat.say("I am ${furhat.voice.emphasis("really")} happy to be here")
furhat.say("Now I will make a slight pause ${furhat.voice.pause("1000ms")} before continuing")
// Used in Ask with an Utterance object:
furhat.ask {
+"My phone number is"
// Within the utterance, the voice can be accessed directly
+voice.sayAs("0701234567", SayAsType.TELEPHONE)
}
Note that all these alterations are not available for every voice provider (for instance Voice.sayAs is not supported by Acapela).
The Azure ssml system is very rich and also quite voice-specific. We encourage you to go and try by yourself the possibilities on their website. You can choose to set the text to a raw ssml structure, which will give the same end result :
furhat.say("Welcome <break strength=\"medium\" /> to text-to-speech.", withVoice=AzureVoice(language=Language.ENGLISH_GB))
Be careful to add the antislash before quotes so it can remain as a string though. Also keep in mind that the same tags might not always work for different voices.
Voice transformations
Voices are classes that can be inherited from, this means that you can grab a voice and add your own custom transformations, this can be done like this.
// A version of Matthew that speaks with variable speaking rate
class VariableRateMatthew: PollyVoice.Matthew() {
override fun transform(text: String): String {
val myTransformedText = prosody(text, rate = (80..120).random() / 100.0)
// Don't forget to call super.transform() !!
return super.transform(myTransformedText)
}
}
// A version of Matthew that replaces 50% of words with potato
class PotatoMatthew: PollyVoice.Matthew() {
override fun transform(text: String): String {
val words = text.split(" ") //Split text on spaces, creating a list of words
val potatoText = words.joinToString(" ") { word -> //Replace words by potato in 50% of the cases
if (Math.random() > 0.5) {
"potato"
} else {
word
}
}
// Don't forget to call super.transform() !!
return super.transform(potatoText)
}
}
Voice variants
Then you set the voice, you can also specify some global settings (pitch, rate, volume) that changes the characteristics of the voice for each utterance being said.
// Selects a female English voice, with a high pitch and 10% increase in speech rate.
furhat.voice = Voice(gender = Gender.FEMALE, language = Language.ENGLISH_US, pitch = "high", rate = 1.1)
The volume setting only works for Amazon Polly voices. The accepted strings are: "silent", "x-soft", "soft", "medium", "loud", "x-loud"
Non-verbal speech sounds
Non-verbal sounds (a.k.a voice gestures) is a key aspect in making your interaction feel human. Voice gestures are voice-specific and examples are "ehh", "ahh" "uhm", coughs, laughs, yawns etc. These are typically specific for each voice, and only some providers and voices support them.
Please refer to these pages for vocal gestures per supported TTS voice provider:
- Amazon Polly
- Microsoft Azure
- Acapela, with a list of vocal smileys and exclamations