Speech

Speech is fundamental to most interactions. This involves both making Furhat speak, and to listen for speech from the user, which Furhat can then respond to.

To have Furhat say something, use:

furhat.say("Some text")

Behind the scenes, say is calling a say-state that will send an event to the synthesizer to synthesize the text, and then return when the synthesizer is done. This means that your application will continue receiving events while speaking.

Speech synthesis actions are added to a speech synthesis queue. If the queue is empty (i.e., the system is silent), it is synthesized directly, otherwise it will be played when the currently queued utterances are completed. You can control this behavior as described below.

Controlling speech

Two ways of controlling the start of the speech synthesis exists through the following flags:

  • abort: Abort the current speech queue (empty it), and play the new utterance directly. (Default is false)
  • ifsilent: Will only synthesize the utterance if the queue is currently empty. (Default is false)
furhat.say("Hello there", abort = true) // Aborts the current speech and immediately starts speaking this utterance
furhat.say("My name is Furhat", ifsilent = true) // Speaks only if system is currently not speaking

Async speech

The synthesis action is by default blocking, which means that the synthesis is completed before the next action is taking place. If you want to continue with the next action immediately, you can pass the async parameter:

furhat.say("Some other text", async = true) // whatever comes after here will be executed immediately

Stopping speech

You can make Furhat stop speaking with the following action:

furhat.stopSpeaking()

Utterances

The furhat.say and furhat.ask (see further down) commands also support more complex speech actions, called Utterances. Utterances allows you to combine speech with behaviors or events that should be mixed with the spoken utterances, as well as randomly selected parts, and audio files.

You can define utterances as objects, and then use them in a furhat.say:

val greeting = utterance {
    +"Hi there,"
    +Gestures.Smile
    +"nice to meet you"
}
furhat.say(greeting)

Or you can use them inline:

furhat.say({
    +"You can pick up your order"
    +glance(Location.LEFT)
    +"over there"})

The advantage of adding actions in one single utterance like this, compared to splitting it upp in several furhat.say commands, is that the prosody of the utterance will sound better, since it will be synthesized as one complete utterance. Also, it makes it easier to package the utterance into one object.

Note that the behaviors will be executed asynchronously, so the system will not wait for the behavior to be complete before continuing.

Kotlin also allows you to easily make parametrized utterances:

fun tellPrice(number: Int) = utterance {
    +"It costs $number dollars"
}
furhat.say(tellPrice(500))

Apart from gestures and glances (as shown above), you can add any closure of your choice in the utterance. If you enclose it in the "behavior" method call, you will have access to flow-related methods such as "send" (which allows you to send events in the middle of the utterance).

furhat.say({
    +"I will now print something"
    +behavior{println("something")}
    +"in the middle of the utterance"})

Randomizing parts

You can also add randomized parts to you utterance:

furhat.say({
    random {
        +"Hi"
        +"Hello"
    }
    +"there, nice to meet you"})

If you need to group randomized options, you can use "block":

furhat.say({
    +"This time I am"
    random {
        block {
            +Gestures.Smile
            +"happy"
        }
        +"neutral"
    }})

Using natural speech

Utterances also allow you to play audio files of natural speech as part of an utterance:

furhat.say({
    +"An elephant sounds like this"
    +Audio("http://www.mysounds.com/elephant.wav", "ELEPHANT SOUND")})

An example of the richness you can achieve with natural speech is available on youtube.

Audio files have to be fetched from a URL and need to be in wav format. For lipsync to work, a phonetics file (".pho") needs to be provided on the same URL. I.e if you provide the audio url http://www.mysounds.com/elephant.wav, the system will expect a http://www.mysounds.com/elephant.pho file to exist.

Note: We have a new tool for natural speech forced alignment that automatically aligns lip movements to speech of an audio file. This tool is available on our developer zone furhat.io and in addition automatically hosts your audio files for you.

As a second argument, you need to provide a human readable string that can be displayed in logs and in the web interface.

Setting the voice and language

You can change the voice like this:

furhat.setVoice(Language.ENGLISH_US, Gender.MALE) // Set the voice to English male
// OR
furhat.setVoice(Language.ENGLISH_US, "William") // Set the voice "William" (it is enough if the name matches partially)
// OR
furhat.setVoice(Language.ENGLISH_US, "William", Gender.MALE) // Set the voice to "William". If that doesn't exist, set the voice to English male.

The default is to also set the input language (speech recognition and NLU) to the language of the voice. If you want to set the input language to be different from the output language, you can do it like this:

furhat.setVoice(Language.ENGLISH_US, Gender.MALE, false) // The last parameter tells the system to not also set the input language
furhat.setInputLanguage(Language.SWEDISH) // Set the input language to Swedish

Voice alterations

Note: currently the voice alteration docs are in construction. Contact us if you want support in the meantime.

Sometimes you might want to configure how a voice sounds, for example the speed, prosody or other ways. These alterations are voice-provider specific, i.e Cereproc has their own tags, Acapela has their own tags and Amazon polly has their own. Most TTS providers resort to SSML (Speech Synthesis Markup Language).

To ease for developers, the Furhat system comes with voice interfaces for each supported TTS provider. These allow typed support for altering speech output. Each provider-specific voice interface inherits the Voice class that assumes an SSML voice is used. We strongly recommend using the TTS provider specific voice interfaces (see below) since these provide extended functionality.

// Create the voice interface. Note, we don't recommend using the Voice class but rather one of the below explained provider specific interfaces.
val myVoice = Voice(gender = Gender.FEMALE, language = Language.ENGLISH_US)

// Set the voice
furhat.setVoice(myVoice)

Each voice interface can have global settings that apply to each utterance being said and methods that can be explicitly used.

For global options, you can set the global prosody (pitch, rate, volume) settings for any voice by:

// Set global options, in this case a high pitch and 10% increase in speech rate.
val myVoice = Voice(gender = Gender.FEMALE, language = Language.ENGLISH_US, pitch = "high", rate = 1.1)

// Set voice
furhat.setVoice(myVoice)

// Any following say will use the global settings
furhat.say("This will be said with high pitch and faster rate")

You can also use methods controlling specific utterances:


// Used in Say
furhat.say("I am ${myVoice.emphasis("really")} happy to be here")

// Used in Ask with an Utterance object
furhat.ask {
    +"My phone number is"
    +glance(Location.LEFT)
    +myVoice.sayAs("0701234567", SayAsType.TELEPHONE)"
}

Cereproc voice tags

To define a voice interface for a Cereproc voice, use the CereprocVoice class. It supports a subset of SSML, for all available methods check the available class methods in your IDE.


// Create the voice interface. If william is not found, the system will resort to another voice of the same language.
val myVoice = CereprocVoice(name = "william", language = Language.ENGLISH_GB)

// Set the voice
furhat.setVoice(myVoice)

// Use the voice
furhat.say("I am ${myVoice.emphasis("really")} happy to be here") // This produces the SSML "I am <emphasis>really</emphasis> happy to be here"

For more information, please see the Cereproc Tagset document

Acapela voice tags

To define a voice interface for an Acapela voice, use the AcapelaVoice class. It supports a subset of SSML functionality but with their own tag system, hence some of the functionality is different. For all available methods check the available class methods in your IDE.

// Create the voice interface. If Elin is not found, it will resort to another female gender Swedish voice.
val myVoice = AcapelaVoice(name = "Elin", gender = Gender.FEMALE, language = Language.SWEDISH)

// Set the voice
furhat.setVoice(myVoice)

// Use the voice
furhat.say("I'm saying the following with ${myVoice.pitch("lower pitch", 170)}") // This produces "I'm saying the following with \pit=170\lower pitch\pit=185" where 185 is the standard pitch for a female voice.

For more information, please see section 4 ("Text tags") in Acapela Usermanual document

Amazon Polly SSML

To define a voice interface for an Amazon Polly voice, use the PollyVoice class. It supports a subset of SSML, for all available methods check the available class methods in your IDE.

// Create the voice interface. The first available female, american english voice will be used.
val myVoice = PollyVoice(gender = Gender.FEMALE, language = Language.ENGLISH_US)

// Set the voice
furhat.setVoice(myVoice)

// Use the voice
furhat.say("I see ${myVoice.whisper("dead people")} everywhere") // This produces the SSML "I am <emphasis>really</emphasis> happy to be here"

Please see Amazon supported SSML

Non-verbal speech

Non-verbal sounds (a.k.a Voice gestures) is a key aspect in making your interaction feel human. Voice gestures are voice-specific and examples are "ehh", "ahh" "uhm", coughs, laughs, yawns etc. Non-verbal behavior is a key aspect in making your interaction feel human and authentic and voice-gestures is a great help when making an interaction more lifelike.

Typically, voice gestures are used as normal strings and transformed by the speech synthesizer. Hence, it's important to make sure that you are using a voice that has the specific voice gesture. The voice gestures are voice dependent, meaning that each voice comes with a set of voice-gestures. The syntax also varies between providers as described below.

William voice gestures

An example of a voice gesture with Cereproc's William voice (the default Furhat voice) is:

furhat.say("GESTURE_TUT_TUT")

A list of shortcut tags for William is available here.

Acapela voice gestures

For Acapela voices, each voice comes with an unique set of voice-gestures. These are defined in .lst files in each of the voice folders. You use the tags as they are, for example furhat.say(""#YAWN01#").