Listening

A fundamental part of a speech interaction aside from Speech is listening and recognizing user speech. This page describes how this is done in the Furhat system. For documentation of what roles listening plays in flows, see flow docs. For documentation of how the user speech is interpreted to some actionable meaning, see natural language understanding (NLU) docs.

Listening to and asking the user

Listening

To have Furhat listen, you use the listen command (assuming you have a microphone configured):

furhat.listen() // Listen with a default timeout of 5 seconds

furhat.listen(timeout = 3000) // Listen with a timeout of 3 seconds (see below)

Behind the scenes, this calls a listen-state where Furhat will start listening through the selected microphones and return once a spoken utterance OR a timeout is detected. You catch these results with the various response handlers that are described further down.

You can specify three different timeouts that are involved in the listening, as arguments to the listen method:

  • timeout (noSpeechTimeout): The number of milliseconds of silence that should pass before an onNoResponse handler is triggered (i.e., it was determined that the user was silent and furhat should do something). Default is 5000. If you want to continue listening after this timeout, you have to call furhat.listen() again.
  • endSil (endSilTimeout): The number of milliseconds of silence that should pass after the user has spoken before an onResponse handler is triggered. Default is 800. If this value is decreased, Furhat will be more rapid in responding to the user, but with the risk of interrupting the user in pauses. If the value is increased, Furhat will be less likely to interrupt the user, but will have longer response times.
  • maxSpeech (maxSpeechTimeout): The maximum length of the user's utterance (in ms) before Furhat will interrupt the user (i.e., an onResponse is triggered with what has been said so far). The default is 15000. Note that utterances longer than 60s will always be interrupted due to constraints in the recognizer backend.

You can either set these parameter directly in furhat.listen (which will then only be valid for that listening), or as a global default:

// Set it for this listening:
furhat.listen(endSil = 1000, timeout = 8000, maxSpeech = 30000)

// Change the default thresholds:
furhat.param.endSilTimeout = 1000
furhat.param.noSpeechTimeout = 8000
furhat.param.maxSpeechTimeout = 30000

Stop listening

You can cancel a listen request.

furhat.stopListening()

Asking

Asking (a.k.a prompting) is a short hand for combining a say and a listen. It will trigger the same events as the listen method.

furhat.ask("What is your name?") // Ask with a default timeout of 8 seconds

furhat.ask("How old are you? Now you have to answer fast", timeout = 4000) // Ask with a timeout of 4 seconds

Response handlers

onResponse - handling verbal responses from users

The onResponse handler is executed whenever a furhat.listen() or furhat.ask() is called and speech is picked up from a user.

It is also possible to add an Intent or Entity to the onResponse handler. If speech is caught, the Furhat system will use Natural language processing to parse the meaning of the utterance - for more info see the NLU documentation. Each Intent or Entity found in onResponse handlers throughout the state hierarchy is used to classify the user's intent. Similar to events, if a response is not caught it will propagate to following handlers in the same state, secondly to parent states and thirdly to calling states.

A response object has several parameters. The most important ones are:

Name Type Description
intent Intent object The (optional) intent that the utterance is classified to be.
text String The text spoken
userId String The id of the user who spoke
contains method A method to search an utterance for an entity
findFirst method A method to find the first of an entity
val MyState = state {
    onEntry {
        furhat.ask("What happens?")
    }

    onResponse<MyIntent> {
        furhat.say("I understood that you said ${it.text} and meant ${it.intent.toText()}")
    }

    onResponse { // Catches everything else
        furhat.say("I didn't understand that")
    }
}

Handling multi-intent responses

Sometimes you want to catch multiple intents in one response. The response handlers makes it easy to do this when you want to support two intents, such as "Hi there, I would like to order a burger" (which could contain both a Greeting and an Order intent).

There are two ways of doing this. First, you can do it with the onPartialResponse-handler:

val MyState = state {
    onEntry {
        furhat.ask("How can I help you?")
    }

    onResponse<Greeting> { // Catches an isolated Greeting
        // Greet the user and reenter the state
        furhat.say("Hi there")
        reentry()
    }

    onPartialResponse<Greeting> { // Catches a Greeting together with another intent, such as Order
      // Greet the user and proceed with the order in the same turn
        furhat.say("Hi there")
        raise(it, it.secondaryIntent)
    }

    onResponse<Order> {
    /*
        Handle the order.
        This will be caught either if the user makes a direct Order
        or if it is triggered by the onPartialResponse above
    */
    }
}

The second way of doing this is by providing partial intents as an arguent to the main response handler:

val MyState = state {
    onEntry {
        furhat.ask("How can I help you?")
    }

    onResponse<Greeting> { // Catches an isolated Greeting
        // Greet the user and reenter the state
        furhat.say("Hi there")
        reentry()
    }

    // Here we also provide a list of optional partial intents that this intent can be combined with:
    onResponse<Order>(partial=listOf(Greeting())) { 
        /*
            Handle the order.
            The Greeting intent (if existing) can be reached through it.secondaryIntent. 
        */
    }
}

onNoResponse handler

The onNoResponse handler is triggered when no audio was picked up from the user.

val MyState = state {
    onEntry {
        furhat.ask("What happens?")
    }

    onResponse<MyIntent> {
        furhat.say("I understood that you said ${it.text} and ment ${MyIntent}")
    }

    onResponse { // Catches everything else
        furhat.say("I didn't understand that")
    }

    onNoResponse { // Catches silence
      furhat.say("I didn't hear anything")
    }
}

onResponseFailed handler

Handler to catch the event when we have an error with the speech recognition. This is caught by the default Dialog state, but you can use this trigger if you want to override it.

onInterimResponse handler

It is also possible to handle incremental (interim) speech recognition results while the user is still speaking. You do this by adding an onInterimResponse handler with a timeout.

val MyState = state {
    onEntry {
        // We set longer endSil and maxSpeech thresholds to allow for longer utterances and pauses in the user's speech
        furhat.ask("What happens?", endSil = 2000, maxSpeech = 30000)
    }

    // onInterimResponse is always instant, so that the listening is not aborted when it is triggered
    // We set the handler to be triggered as soon as we detect a 500ms pause
    onInterimResponse(endSil = 500) {
        // We give some feedback to the user, "okay" or a nod gesture.
        random (
            // Note that we need to set async = true, since we are in an instant trigger
            { furhat.say("okay", async = true) },
            // Gestures are async per default, so no need to set the flag
            { furhat.gesture(Gestures.Nod) }
        )
    }

    onResponse { 
        furhat.say("Thanks for your answer")
    }
}

If the endSil parameter for the onInterimResponse handler is not provided (i.e., default = 0), it will be triggered for every new interim response. No intent classification will be performed for interim responses, but you can access the recognized text through it.speech.text.

Ask with return values

askFor

If you simply want to ask for a specific intent or entity, there is an efficient way of implementing it using askFor.

val MyState = state {
    onEntry {
        var date = furhat.askFor<Date>("Which date were you born?")
        furhat.say("You were born on $date")
    }
}

You can also add inline response handlers to askFor:

val MyState = state {
    onEntry {
        var date = furhat.askFor<Date>("Which date were you born?") {
            onResponse<DontKnow> {
                furhat.say("You should really know that!")
                reentry()
            }
        }
        furhat.say("You were born on $date")
    }
}

askYN

If you simply want to ask a yes/no question, there is an efficient way of implementing it using askYN, which returns a boolean.

val MyState = state {
    onEntry {
        var happy = furhat.askYN("Are you happy?")
        if (happy) {
            furhat.say("You are happy")
        } else {
            furhat.say("You are not happy")
        }
    }
}

You can also add inline response handlers to askYN:

val MyState = state {
    onEntry {
        var happy = furhat.askYN("Are you happy?") {
            onResponse<Maybe> {
                furhat.say("Make up your mind!")
                reentry()
            }
        }
        if (happy) {
            furhat.say("You are happy")
        } else {
            furhat.say("You are not happy")
        }
    }
}

Changing default responses

All Furhat skills comes with default response handlers that handles uncaught speech, silences and errors. The reason is that you always want to catch a response in the system to not get unexpected behavior. There may be examples when you don't want this to happen, in which case you are free to overwrite the methods. Another reason for doing this is if you want to change the default utterances (as listed below) or perhaps add support for other languages.

The recommendation is to do so in a superstate that all your interactive states inherit (if you used our default template when creating your skill, this would be the Interaction state defined in general.kt). It is as simple as implementing the same handlers and not propagating the response events further.

For your information, the implicit default state looks like this:

val dialogState = state(parent = null) {
   var noinput = 0
   var nomatch = 0

   onResponse {
       nomatch++
       if (nomatch > 1)
           furhat.say("Sorry, I still didn't understand that")
       else
           furhat.say("Sorry, I didn't understand that")
       reentry()
   }

   onNoResponse {
       noinput++
       if (noinput > 1)
           furhat.say("Sorry, I still didn't hear you")
       else
           furhat.say("Sorry, I didn't hear you")
       reentry()
   }

   onResponseFailed {
       furhat.say("Sorry, my speech recognizer is not working")
       terminate()
   }
}

Changing the turn-taking policy

Per default, the user's turn is considered to be complete once the end of the user's speech is detected, and an onResponse handler is triggered. It possible to control this behaviour by setting another TurnTakingPolicy. The TurnTakingPolicy has a method called turnYieldTimeout, which takes the response and determines whether Furhat should take the turn or not (i.e., whether the onResponse handler should be triggered). If the turnYieldTimeout returns 0, Furhat will immediately take the turn, if the value is higher, it is treated as timeout (in msec) after which Furhat will take the turn if the user does not continue speaking.

If the user continues speaking, the new speech will be appended to the previous speech and a new intent classification will be performed, and the TurnTakingPolicy will be checked again (and so on).

The following example shows how you can implement a policy that gives the user more time if the response was very short:

Furhat.turnTakingPolicy = object : TurnTakingPolicy {
    override fun turnYieldTimeout(response: Response<*>): Int {
        if (response.speech.length < 1000) {
             // The response was less than 1000ms long, let's give the user 2000ms more to continue speaking
             return 2000
         } else {
             return 0
         }
    }
}

The following example shows how you can implement a policy that gives the user more time if no intent was found:

Furhat.turnTakingPolicy = object : TurnTakingPolicy {
    override fun turnYieldTimeout(response: Response<*>): Int {
        if (response.intent == NullIntent) {
             return 4000
         } else {
             return 0
         }
    }
}

The following example shows how you can implement a policy that gives the user more time if the user is not looking at Furhat (which can be a sign that the user is not done yet):

Furhat.turnTakingPolicy = object : TurnTakingPolicy {
    override fun turnYieldTimeout(response: Response<*>): Int {
        if (!UserManager.current.isAttendingFurhat) {
             return 2000
         } else {
             return 0
         }
    }
}

Improving recognition with phrases

While today's speech recognition services usually perform very well in most languages, sometimes you will notice that they struggle picking up certain phrases. This may be brand-names, odd spellings or unusual words that sound quite similar to other words. Specifically, the recognition services will struggle with one word utterances, for example in a poker play if you are saying "fold". You'll notice that "I choose to fold" is much easier picked up.

To get around this, you can send a list of phrases (strings) to the recognizer with the method furhat.setSpeechRecPhrases(List<*>).

It is also possible to add these phrases directly to the NLU. EnumEntity allows you to automatically add certain phrases, by setting a flag in the constructor. In this example, all fruit words would be primed whenever this entity is being used in an active intent:

class Fruit : EnumEntity(speechRecPhrases=true) {

    override fun getEnum(lang: Language): List<String> {
        return listOf("banana", "orange", "apple", "pineapple", "pear")
    }

}

You can also let an Intent return specific words which could be hard to recognize:

class Greeting : Intent() {

    override fun getExamples(lang: Language): List<String> {
        return listOf("how are you", "how do you do", "howdy")
    }

    override fun getSpeechRecPhrases(lang: Language): List<String> {
        return listOf("howdy")
    }

}

Improving recognition with multiple hypotheses

Per default, the speech recognizer only returns the top hypothesis. You can set it to return more hypotheses, in which case the intent classificaiton will be done for all hypotheses, and the best combination of speech recognition and intent classification confidence scores will automatically be selected as the response. This way, a response that the speech recognizer did not deem to be the most likely, but which makes more sense in the current context, can be selected.

Note that there is a potential downside of processing multiple hypotheses: (1) It might be more processing intensive, and (2) It might lead to more false positives, especially if there are a lot of intents active. Thus, it is typically more useful in cases where you have more constrained expectations, such as a quiz with four alternative responses.

furhat.param.recognitionAlternatives = 5
furhat.listen()

These hypothes (when set to more than 1) can be reviewed in any onResponse handler like so:

onResponse { response ->
        response.alternatives //The type is List<Triple<MultiIntentHyp, String, Double>>
        val singleAlternative = response.alternatives.first()
        singleAlternative.first //The multi intent hypotheses
        singleAlternative.first.conf //The confidence
        singleAlternative.first.intents //The intents
        singleAlternative.second //The text returned from the ASR, on which the intent is based
        singleAlternative.third //A combined score of the intent confidence and the text confidence.
    }

Recognizers

Recognizing speech is one of the key aspects of having a dialog, that's why Furhat currently provides Speech-to-Text recognizers or Automatic Speech Recognizers (ASR).

Which one is best for me and my robot?

An automatic speech recognizer converts audio to text. When the robot is asked to listen, it uses a recognizer to understand human speech. Currently all supported recognizers are cloud based, and therefore require an internet connection to function.

Your robot will come with credentials for our recommended recognizer, Google Cloud Speech-to-Text.

You may alternatively use Microsoft Azure Speech-to-Text, if you find that Google Cloud doesn't fit your needs. If you intend to use Microsoft Azure, bear in mind that you must create your own Azure account and provide the robot with your credentials.

You can always change which recognizer your robot is using via Settings menu in the web-interface. Alternatively, your skill can attempt to set which recognizer the robot is using with the following commands:

furhat.setRecognizer(ASRProvider.GOOGLE)
//Or
furhat.setRecognizer(ASRProvider.AZURE)  

Important note: When your skill runs on a robot without Microsoft Azure credentials the above operation will not be able to change the recognizer to Microsoft Azure. In this case it will return false.

val setToAzure = furhat.setRecognizer(ASRProvider.AZURE)
if(!setToAzure){
    furhat.say("I was unable to change my recognizer to Azure.")
}  

Differences between Google and Azure

Feature Google Microsoft
Credentials provided by default Yes No
Logging Yes No
Available in China No Yes
Multiple language recognition Yes Yes, but not in real-time and maximum 2
Longest phrase 60 seconds 20 seconds
Multiple recognition hypotheses Yes Yes
Pricing $0.006 / 15 seconds* $1 per audio hour ($0.0028 / second)
Languages ~120 ~37

* When listening, Google charges you for a minimum of 15 seconds of audio. This is per listening call. I.e. If your robot listens for 7 seconds, then for 10 seconds, and then for 17 seconds, you would be charged 15 + 15 + 17 = 47 seconds.

Microsoft Azure (Beta)

Microsoft Azure is available from Azure Portal. Your robot does not come with credentials for Microsoft Azure, and you will need to create your own trial or paid account to use the service.

Your credentials will work wherever your robot is physically located, but you may notice a change in latency as your robot travels the world using the same credentials.

There are a few differences with using Microsoft Azure as your robot's speech recognizer:

  • Text will come with punctuation. This will not affect NLU, as we standardize all text before it goes through our NLU engine.

  • Listening to multiple langauges will cause significant delays in the dialog, as it must wait 4-10 seconds for Microsoft's reply when this feature is on. We strongly suggest using this feature sparingly, and only in interactions where your robot can afford to wait this long before getting a result back.

  • Azure limits calls made with the same credentials to : 25 calls per 5 seconds. If your robot exceeds this limit, your attempts to listen will be cancelled with the message "Too many Requests". We recommend having separate credentials for each robot you use.

  • We do not support audio logging with Microsoft Azure. The dialog logging features of Furhat are disabled when using Microsoft Azure.

  • With Microsoft Azure, the robot will at most listen to 20 seconds of speech. This is not including any silence while the robot is waiting for a person to speak, i.e. the initial silence timeout parameter.

How do I get Microsoft Azure Credentials?

If you don't already have a Microsoft Azure account, you must create one.

Once you have an account, log in to Azure Portal. Create a new Speech resource. Note: there is a resource called Speech to text, this is an unrelated third-party product that your robot is unable to use. Once created, find the key for the service you created. Give this key, and the region your service is located in, to your robot.

We recommend using this latency tool to help you determine which region you should get a speech service for. Note that the Speech service is not available in all regions listed in the tool.