Listening

A fundamental part of a speech interaction aside from Speech is listening and recognizing user speech. This page describes how this is done in the Furhat system. For documentation of what roles listening plays in flows, see flow docs. For documentation of how the user speech is interpreted to some actionable meaning, see natural language understanding (NLU) docs.

Listening to and asking the user

Listening

To have Furhat listen, you use the listen command (assuming you have a microphone configured):

furhat.listen() // Listen with a default timeout of 5 seconds

furhat.listen(timeout = 3000) // Listen with a timeout of 3 seconds (see below)

Behind the scenes, this calls a listen-state where Furhat will start listening through the selected microphones and return once a spoken utterance OR a timeout is detected. You catch these results with the various response handlers that are described further down.

You can specify three different timeouts that are involved in the listening, as arguments to the listen method:

  • timeout (noSpeechTimeout): The number of milliseconds of silence that should pass before an onNoResponse handler is triggered (i.e., it was determined that the user was silent and furhat should do something). Default is 5000. If you want to continue listening after this timeout, you have to call furhat.listen() again.
  • endSil (endSilTimeout): The number of milliseconds of silence that should pass after the user has spoken before an onResponse handler is triggered. Default is 800. If this value is decreased, Furhat will be more rapid in responding to the user, but with the risk of interrupting the user in pauses. If the value is increased, Furhat will be less likely to interrupt the user, but will have longer response times.
  • maxSpeech (maxSpeechTimeout): The maximum length of the user's utterance (in ms) before Furhat will interrupt the user (i.e., an onResponse is triggered with what has been said so far). The default is 15000. Note that utterances longer than 60s will always be interrupted due to constraints in the recognizer backend.

You can either set these parameter directly in furhat.listen (which will then only be valid for that listening), or as a global default:

// Set it for this listening:
furhat.listen(endSil = 1000, timeout = 8000, maxSpeech = 30000)

// Change the default thresholds:
furhat.param.endSilTimeout = 1000
furhat.param.noSpeechTimeout = 8000
furhat.param.maxSpeechTimeout = 30000

Asking

Asking (a.k.a prompting) is a short hand for combining a say and a listen. It will trigger the same events as the listen method.

furhat.ask("What is your name?") // Ask with a default timeout of 8 seconds

furhat.ask("How old are you? Now you have to answer fast", timeout = 4000) // Ask with a timeout of 4 seconds

Response handlers

onResponse - handling verbal responses from users

The onResponse handler is executed whenever a furhat.listen() or furhat.ask() is called and speech is picked up from a user.

It is also possible to add an Intent or Entity to the onResponse handler. If speech is caught, the Furhat system will use Natural language processing to parse the meaning of the utterance - for more info see the NLU documentation. Each Intent or Entity found in onResponse handlers throughout the state hierarchy is used to classify the user's intent. Similar to events, if a response is not caught it will propagate to following handlers in the same state, secondly to parent states and thirdly to calling states.

A response object has several parameters. The most important ones are:

Name Type Description
intent Intent object The (optional) intent that the utterance is classified to be.
text String The text spoken
userId String The id of the user who spoke
contains method A method to search an utterance for an entity
findFirst method A method to find the first of an entity
val MyState = state {
    onEntry {
        furhat.ask("What happens?")
    }

    onResponse<MyIntent> {
        furhat.say("I understood that you said ${it.text} and ment ${it.intent.toText()}")
    }

    onResponse { // Catches everything else
        furhat.say("I didn't understand that")
    }
}

onPartialResponse - handling multi-intent responses

Sometimes you want to catch multiple intents in one response. The response handlers makes it easy to do this when you want to support two intents, such as "Hi there, I would like to order a burger" (which could contain both a Greeting and an Order intent):

val MyState = state {
    onEntry {
        furhat.ask("How can I help you?")
    }

    onResponse<Greeting> { // Catches an isolated Greeting
        // Greet the user and reenter the state
        furhat.say("Hi there")
        reentry()
    }

    onPartialResponse<Greeting> { // Catches a Greeting together with another intent, such as Order
      // Greet the user and proceed with the order in the same turn
        furhat.say("Hi there")
        raise(it, it.secondaryIntent)
    }

    onResponse<Order> {
    /*
        Handle the order.
        This will be caught either if the user makes a direct Order
        or if it is triggered by the onPartialResponse above
    */
    }
}

Per default, the partial response has to preceed the other intent. If this should not be the case, you can pass a prefix=false parameter to onPartialResponse like onPartialResponse<Greeting>(prefix = false) { ... }. In this case, the intents can come in any order, i.e "I want to order a burger, hello" would also match.

onNoResponse handler

The onNoResponse handler is triggered when no audio was picked up from the user.

val MyState = state {
    onEntry {
        furhat.ask("What happens?")
    }

    onResponse<MyIntent> {
        furhat.say("I understood that you said ${it.text} and ment ${MyIntent}")
    }

    onResponse { // Catches everything else
        furhat.say("I didn't understand that")
    }

    onNoResponse { // Catches silence
      furhat.say("I didn't hear anything")
    }
}

onResponseFailed handler

Handler to catch the event when we have an error with the speech recognition. This is caught by the default Dialog state, but you can use this trigger if you want to override it.

onInterimResponse handler

It is also possible to handle incremental (interim) speech recognition results while the user is still speaking. You do this by adding an onInterimResponse handler with a timeout.

val MyState = state {
    onEntry {
        // We set longer endSil and maxSpeech thresholds to allow for longer utterances and pauses in the user's speech
        furhat.ask("What happens?", endSil = 2000, maxSpeech = 30000)
    }

    // onInterimResponse is always instant, so that the listening is not aborted when it is triggered
    // We set the handler to be triggered as soon as we detect a 500ms pause
    onInterimResponse(endSil = 500) {
        // We give some feedback to the user, "okay" or a nod gesture.
        random (
            // Note that we need to set async = true, since we are in an instant trigger
            { furhat.say("okay", async = true) },
            // Gestures are async per default, so no need to set the flag
            { furhat.gesture(Gestures.Nod) }
        )
    }

    onResponse { 
        furhat.say("Thanks for your answer")
    }
}

If the endSil parameter for the onInterimResponse handler is not provided (i.e., default = 0), it will be triggered for every new interim response. No intent classification will be performed for interim responses, but you can access the recognized text through it.speech.text.

Ask with return values

Inline response handlers

Instead of adding the response handlers to the state, you can add them directly to ask:

val MyState = state {
    onEntry {
        var happy =
            furhat.ask("Are you happy?") {
                onResponse<Yes> {
                    terminate(true)
                }
                onResponse<No> {
                    terminate(false)
                }
            }
        if (happy) {
            furhat.say("You are happy")
        } else {
            furhat.say("You are not happy")
        }
    }
}

This will call a (hidden) state which has the furhat.ask() in the onEntry handler. Thus, you have to call terminate() to return from this state.

askFor

If you simply want to ask for a specific intent or entity, there is an efficient way of implementing it using askFor.

val MyState = state {
    onEntry {
        var date = furhat.askFor<Date>("Which date were you born?")
        furhat.say("You were born on $date")
    }
}

You can also add inline response handlers to askFor:

val MyState = state {
    onEntry {
        var date = furhat.askFor<Date>("Which date were you born?") {
            onResponse<DontKnow> {
                furhat.say("You should really know that!")
                reentry()
            }
        }
        furhat.say("You were born on $date")
    }
}

askYN

If you simply want to ask a yes/no question, there is an efficient way of implementing it using askYN, which returns a boolean.

val MyState = state {
    onEntry {
        var happy = furhat.askYN("Are you happy?")
        if (happy) {
            furhat.say("You are happy")
        } else {
            furhat.say("You are not happy")
        }
    }
}

You can also add inline response handlers to askYN:

val MyState = state {
    onEntry {
        var happy = furhat.askYN("Are you happy?") {
            onResponse<Maybe> {
                furhat.say("Make up your mind!")
                reentry()
            }
        }
        if (happy) {
            furhat.say("You are happy")
        } else {
            furhat.say("You are not happy")
        }
    }
}

Changing default responses

All Furhat skills comes with default response handlers that handles uncaught speech, silences and errors. The reason is that you always want to catch a response in the system to not get unexpected behavior. There may be examples when you don't want this to happen, in which case you are free to overwrite the methods. Another reason for doing this is if you want to change the default utterances (as listed below) or perhaps add support for other languages.

The recommendation is to do so in a superstate that all your interactive states inherit (if you used our default template when creating your skill, this would be the Interaction state defined in general.kt). It is as simple as implementing the same handlers and not propagating the response events further.

For your information, the implicit default state looks like this:

val dialogState = state(parent = null) {
   var noinput = 0
   var nomatch = 0

   onResponse {
       nomatch++
       if (nomatch > 1)
           furhat.say("Sorry, I still didn't understand that")
       else
           furhat.say("Sorry, I didn't understand that")
       reentry()
   }

   onNoResponse {
       noinput++
       if (noinput > 1)
           furhat.say("Sorry, I still didn't hear you")
       else
           furhat.say("Sorry, I didn't hear you")
       reentry()
   }

   onResponseFailed {
       furhat.say("Sorry, my speech recognizer is not working")
       terminate()
   }
}

Improving recognition with phrases

While today's speech recognition services usually perform very well in most languages, sometimes you will notice that they struggle picking up certain phrases. This may be brand-names, odd spellings or unusual words that sound quite similar to other words. Specifically, the recognition services will struggle with one word utterances, for example in a poker play if you are saying "fold". You'll notice that "I choose to fold" is much easier picked up.

To get around this, you can send a list of phrases (strings) to the recognizer with the method furhat.setSpeechRecPhrases(List<*>).

It is also possible to add these phrases directly to the NLU. EnumEntity allows you to automatically add certain phrases, by setting a flag in the constructor. In this example, all fruit words would be primed whenever this entity is being used in an active intent:

class Fruit : EnumEntity(speechRecPhrases=true) {

    override fun getEnum(lang: Language): List<String> {
        return listOf("banana", "orange", "apple", "pineapple", "pear")
    }

}

You can also let an Intent return specific words which could be hard to recognize:

class Greeting : Intent() {

    override fun getExamples(lang: Language): List<String> {
        return listOf("how are you", "how do you do", "howdy")
    }

    override fun getSpeechRecPhrases(lang: Language): List<String> {
        return listOf("howdy")
    }

}

Improving recognition with multiple hypotheses

Per default, the speech recognizer only returns the top hypothesis. You can set it to return more hypotheses, in which case the intent classificaiton will be done for all hypotheses, and the best combination of speech recognition and intent classification confidence scores will automatically be selected as the response. This way, a response that the speech recognizer did not deem to be the most likely, but which makes more sense in the current context, can be selected.

Note that there is a potential downside of processing multiple hypotheses: (1) It might be more processing intensive, and (2) It might lead to more false positives, especially if there are a lot of intents active. Thus, it is typically more useful in cases where you have more constrained expectations, such as a quiz with four alternative responses.

furhat.param.recognitionAlternatives = 5
furhat.listen()