Speaking

Speech is fundamental to most interactions. This involves both making Furhat speak, and to listen for speech from the user, which Furhat can then respond to.

To have Furhat say something, use:

furhat.say("Some text")

Behind the scenes, say is calling a say-state that will send an event to the synthesizer to synthesize the text, and then return when the synthesizer is done. This means that your application will continue receiving events while speaking.

Speech synthesis actions are added to a speech synthesis queue. If the queue is empty (i.e., the system is silent), it is synthesized directly, otherwise it will be played when the currently queued utterances are completed. You can control this behavior as described below.

Controlling speech

Two ways of controlling the start of the speech synthesis exists through the following flags:

  • abort: Abort the current speech queue (empty it), and play the new utterance directly. (Default is false)
  • ifsilent: Will only synthesize the utterance if the queue is currently empty. (Default is false)
furhat.say("Hello there", abort = true) // Aborts the current speech and immediately starts speaking this utterance
furhat.say("My name is Furhat", ifsilent = true) // Speaks only if system is currently not speaking

Async speech

The synthesis action is by default blocking, which means that the synthesis is completed before the next action is taking place. If you want to continue with the next action immediately, you can pass the async parameter:

furhat.say("Some other text", async = true) // whatever comes after here will be executed immediately

You can also check if Furhat is currently speaking:

furhat.isSpeaking()

Stopping speech

You can make Furhat stop speaking with the following action:

furhat.stopSpeaking()

Utterances

The furhat.say and furhat.ask (see further down) commands also support more complex speech actions, called Utterances. Utterances allows you to combine speech with behaviors (such as Gestures) or events that should be mixed with the spoken utterances, as well as randomly selected parts, and audio files.

You can define utterances as objects, and then use them in a furhat.say:

val greeting = utterance {
    +"Hi there,"
    +Gestures.Smile
    +"nice to meet you"
}
furhat.say(greeting)

Or you can use them inline:

furhat.say {
    +"You can pick up your order"
    +glance(Location.LEFT)
    +"over there"}

The advantage of adding actions in one single utterance like this, compared to splitting it upp in several furhat.say commands, is that the prosody of the utterance will sound better, since it will be synthesized as one complete utterance. Also, it makes it easier to package the utterance into one object.

Note that the behaviors will be executed asynchronously, so the system will not wait for the behavior to be complete before continuing.

Kotlin also allows you to easily make parametrized utterances:

fun tellPrice(number: Int) = utterance {
    +"It costs $number dollars"
}
furhat.say(tellPrice(500))

Mid-utterance behaviors

Apart from gestures and glances (as shown above), you can add any closure of your choice in the utterance. If you enclose it in the "behavior" method call, you will have access to flow-related methods such as "send" (which allows you to send events in the middle of the utterance).

furhat.say({
    +"I will now print something"
    +behavior {
        println("something")
     }
    +"in the middle of the utterance"})

However, note that these behaviors have to be instant and asynchronous, which means that they cannot for example call another state. If you want to perform synchronous behaviors, such as waiting for a gesture to be completed before continuing, you have to use the "blocking" method instead:

furhat.say({
    +"Here are some of my gestures"
    +blocking {
        furhat.gesture(Gestures.BigSmile, async = false)
        furhat.gesture(Gestures.ExpressDisgust, async = false)
        furhat.gesture(Gestures.Wink, async = false)
     }
    +"how about that?"})

If we would have performed the gestures asynchronously, they would just have overriden each other, and Furhat would not wait for them to complete before proceding with the rest of the utterance.

There is also a shortcut for performing delays (a blocking behavior) in the middle of an utterance (note that this can also be accomplished using SSML-tags, as described further down):

furhat.say({
    +"Now I will pause"
    +delay(2000) // Pausing for 2000 ms
    +"before continuing"})

Randomizing parts

You can also add randomized parts to you utterance:

furhat.say({
    random {
        +"Hi"
        +"Hello"
    }
    +"there, nice to meet you"})

If you need to group randomized options, you can use "block":

furhat.say({
    +"This time I am"
    random {
        block {
            +Gestures.Smile
            +"happy"
        }
        +"neutral"
    }})

Using audio and pre-recorded speech

Utterances also allow you to play audio files, for example with pre-recorded speech, as part of an utterance:

furhat.say({
    +"An elephant sounds like this"
    +Audio("http://www.mysounds.com/elephant.wav", "ELEPHANT SOUND")})

Note: It's important that the audio files are wave files (.wav), 16 bit, 16kHz, mono for compatibility with the Furhat robot.

You can also include the audio file in the skill source resource folder and point to it using a URL:

furhat.say({
    +"An elephant sounds like this"
    +Audio("classpath:my/audio/elephant.wav", "ELEPHANT SOUND")})

Note: this only works when running the skill on the robot, not SDK.

An example of the richness you can achieve with natural speech is available on youtube.

Audio files have to be fetched from a URL and need to be in wav format. If a phonetics file (".pho") is provided on the same URL, the system will use it for lip sync. I.e if you provide the audio URL http://www.mysounds.com/elephant.wav, the system looks for http://www.mysounds.com/elephant.pho. If no such file is provided, then lip sync is automatically generated by default.

If the audio is not speech, lip sync can be disabled entirely:

Audio("http://www.mysounds.com/elephant.wav", "ELEPHANT SOUND", speech = false)

Note: Alternatively, we have a tool for natural speech forced alignment that automatically aligns lip movements to speech of an audio file. This tool is available on our developer zone furhat.io and in addition automatically hosts your audio files for you.

As a second argument, you need to provide a human readable string that can be displayed in logs and in the web interface.

Setting the voice and language

You can change the voice like this:

furhat.setVoice(language=Language.ENGLISH_US, gender=Gender.MALE) // Set the voice to English male

// OR we can set it using the Voice class as a specification, in one of these ways: 
furhat.setVoice(Voice(language=Language.ENGLISH_US, gender=Gender.MALE))
furhat.voice = Voice(language=Language.ENGLISH_US, gender=Gender.MALE)

Each provider-specific voice interface inherits the Voice class, and in turn, the specific voices have their own classes extending these provider-classes:

  • PollyVoice
    • Matthew
    • Astrid
    • Mathieu
    • ...
  • PollyNeuralVoice
    • Camila
    • Lupe
    • Justin
    • ...
  • CereprocVoice
    • William
  • AcapelaVoice

These can be used to specify the provider for the voice:

// Specify the provider of the voice using the PollyVoice class:
furhat.voice = PollyVoice(language=Language.SWEDISH) // Sets a Swedish Amazon Polly voice 

// Use a specific class for a specific voice:
furhat.voice = PollyVoice.Astrid() // Sets a the Amazon Polly voice "Astrid"

Before setting a voice, you can check whether it is available on the robot or SDK you are running on:

if (CereprocVoice.William().isAvailable) {
    furhat.voice = CereprocVoice.William()
} else if (PollyVoice(language=Language.ENGLISH_US).isAvalable) {
    furhat.voice = PollyVoice(language=Language.ENGLISH_US)
} else if (Voice(language=Language.ENGLISH_US).isAvalable) {
    furhat.voice = Voice(language=Language.ENGLISH_US)
}

After successfully setting a voice, furhat.voice will be set to the specific voice class that was actually selected:

furhat.voice = Voice(language=Language.ENGLISH_US)
println("The gender of the voice that was choisen is " + furhat.voice.gender)

Amazon Polly Neural Voices

Amazon Polly has a Neural TTS (NTTS) system that can produce even higher quality voices than its standard voices. The NTTS system produces the most natural and human-like text-to-speech voices possible.

Neural voices can have different styles, currently we support Neutral, Conversational and News. For more information on these styles please visit the website on speaking styles here. Note that not all voices support all voice styles, please review Amazon's documentation.

To change the style of a voice inside the flow the following code can be used.

furhat.voice = PollyNeuralVoice.Matthew().also { it.style = PollyNeuralVoice.Style.News}
//OR
val matthewNewscaster = PollyNeuralVoice.Matthew()
matthewNewscaster.style = PollyNeuralVoice.Style.News
furhat.voice = matthewNewscaster

Setting the input language

When you change Furhat's voice, the default is to also set the input language (speech recognition and NLU) to the language of the voice. If you want to set the input language to be different from the output language, you can do it like this:

furhat.setVoice(Language.ENGLISH_US, Gender.MALE, false) // The last parameter tells the system to not also set the input language
furhat.setInputLanguage(Language.SWEDISH) // Set the input language to Swedish

Voice alterations

Sometimes you might want to configure how a voice sounds, for example the speed, prosody or other ways. Most TTS providers use SSML (Speech Synthesis Markup Language). However, these alterations are often voice-provider specific, i.e Cereproc has their own tags, Acapela has their own tags and Amazon polly has their own.

To ease for developers, the Furhat system comes with voice interfaces for each supported TTS provider. These allow typed support for altering speech output, and will assure that the right tags are used, regardless of provider:

// Used in Say
furhat.say("I am ${furhat.voice.emphasis("really")} happy to be here")
furhat.say("Now I will make a slight pause ${furhat.voice.pause("1000ms")} before continuing")

// Used in Ask with an Utterance object:
furhat.ask {
    +"My phone number is"
    // Within the utterance, the voice can be accessed directly
    +voice.sayAs("0701234567", SayAsType.TELEPHONE)
}

Voice transformations

Voices are classes that can be inherited from, this means that you can grab a voice and add your own custom transformations, this can be done like this.

// A version of Matthew that speaks with variable speaking rate
    class VariableRateMatthew: PollyVoice.Matthew() {
        override fun transform(text: String): String {
            val myTransformedText = prosody(text, rate =  (80..120).random() / 100.0)
            // Don't forget to call super.transform() !!
            return super.transform(myTransformedText)
        }
    }
// A version of Matthew that replaces 50% of words with potato
    class PotatoMatthew: PollyVoice.Matthew() {
        override fun transform(text: String): String {
            val words = text.split(" ") //Split text on spaces, creating a list of words
            val potatoText = words.joinToString(" ") { word -> //Replace words by potato in 50% of the cases
                if (Math.random() > 0.5) {
                    "potato"
                } else {
                    word
                }
            }
            // Don't forget to call super.transform() !!
            return super.transform(potatoText)
        }
    }

Voice variants

Then you set the voice, you can also specify some global settings (pitch, rate, volume) that changes the characteristics of the voice for each utterance being said.

// Selects a female english voice, with a high pitch and 10% increase in speech rate.
furhat.voice = Voice(gender = Gender.FEMALE, language = Language.ENGLISH_US, pitch = "high", rate = 1.1)

// Do the same thing specifically for the Cereproc William voice:
furhat.voice = William(pitch = "high", rate = 1.1)

Non-verbal speech sounds

Non-verbal sounds (a.k.a Voice gestures) is a key aspect in making your interaction feel human. Voice gestures are voice-specific and examples are "ehh", "ahh" "uhm", coughs, laughs, yawns etc. These are typically specific for each voice, and only some providers and voices support them. If they are available, you can find them through the Vocal object in the voice class:

furhat.say("Sometimes I just laugh out loud ${William.Vocal.LAUGH_1}")