In this article I talk about my personal experience putting together my first Google Assistant app. There was something strangely satisfying in the process, I suspect because in a few hours (with no previous experience) I got a simple working application going, which I then iteratively improved upon.
The project I selected was for a personal hobby project (Bible verse recordings with characters and voices from an animated series I am working on as a hobby), but I believe the principles are directly relevant to merchants wishing to provide a novel, branded experience. The app allows both interactive inquiry and subscription to notifications, where all interactions are with a branded character.
In this first part I describe some the design choices I went through while designing the app. The next part walks through how the objectives were reached in code. (The code is not yet finished as of publishing this post – notifications are not completed yet – but can be found on GitHub.)
What is Google Assistant?
Google assistant is the software that responds to users via voice or keyboard input, with output being sound, display, or a combination depending on the device. For example, Google Assistant is built into the Google Home devices, but is also available as a phone and tablet application. Further, Google Assistant is available as a SDK and so can be built into other devices such as a range of new smart screens launching soon.
I like to think of Google Assistant sort of like Jarvis in the recent Avengers movies – Tony Stark’s AI assistant that can communicate through any locally available communication means.
Third party developers can create Google Assistant apps that include training rules of phrases to map to “intents” (for example “tell me a bible verse”, “Another verse please”, “inspire me”) and a web service that is given a parsed intent (as JSON) that returns the response for the user (also encoded as JSON). Such applications can be launched using “Hey Google, talk to extra ordinary bible verses” (the name of my app – not yet published publicly).
A first consideration is whether a dedicated app is needed at all. For example, if you only wish to sell products via a voice interface, it is worth checking out Google Merchant Center and Google Shopping Actions. As a merchant, these tools allow you to upload information about products that you have available for sale, where the Assistant based browse and purchase flows are all provided by Google. Currently the experience is reserved for selected merchants, but may be expanded over time. This post focuses on building a new app, not integrating with existing experiences.
The following are a few potential commerce related experiences for customers:
- Customer service – providing an app where a user can track an order, get notified when the product is delivered.
- Shopping experience apps – browse, purchase, and checkout flows embedded in the app. For example, after registering you refrigerator you may be given an app to order new water filters for the model of fridge the system knows you purchased. “Hey Google, talk to Whirlpool and order a new water filter”.
- Product instruction guides – may allow a user to request different types of information about products or how to use them, including links off to YouTube instructional videos.
- Notification subscriptions – provide a tips and tricks newsletter subscription service, where new events (such as new products being available) can be sent to customers as a push notification (making sure you conform to GDPR requirements of course!).
There are many types of applications that could be built. For example, I am building an app to reinforce my animated series of episodes (each episode revolves around principles from a bible verse). The app provides another way to engage with end users that reinforces the brand message behind the series by using the same characters and voices in both the series and the app.
For an app to be used more than once it has to deliver some useful functionality. But to increase the engagement with the app it should be consistent with your brand message. In my case
- I used the same voices as in the series since voice and speech is the main interaction vehicle.
- I did try to create distinctive voice sounds for synthesized voices, but ended up falling back to voice recordings to be more engaging. It is still hard to get deep emotions through synthesized voices, although I expect that to improve over time.
- It should be noted that one disadvantage of recording voice clips is that a recording needs to be made per supported language if you app is international. Using a synthesized voice is much easier to support multiple languages.
- Another advantage of synthesized voices is recordings cannot insert dynamic content. Synthesized voices can say how many items are in a cart or read back contents of knowledge base articles or web content. I designed my app to be able to work with fixed prerecorded audio clips.
- Images can be added to experiences with a physical screen (e.g. a phone or tablet screen, or one of the upcoming “smart screen” devices due to hit the market soon). Code in the app adjusts the experience based on what “surfaces” are available. Note that even if a speaker is available on a device, it may be muted so a good app will output responses in multiple formats so the device can choose between them.
Use of recorded voice clips and screen shots of the different characters from my animated series turned what could be a fairly dry interaction with a list of bible verses into a much more engaging experience.
Once you have decided how you want to brand your app and experience, the next step is to design the conversation flow. You might start this design writing down anticipated conversations between yourself and the app you are planning to build. Work out the sorts of request you want to support and how someone may ask them.
One aha moment for myself was to realize a conversation is not a long flow chart with a single path through. In real life a user can change their mind of what they want to talk about at any time. So while you think about conversation flows you want to support, when writing an app it is more like working out all the possible requests that can happen responding to them whenever they come up. The app can remember context about previous requests to help disambiguate future requests (for example, use state variables to remember the last bible verse played so you can support requests like “repeat last bible verse”), but your code is written to assume all commands (intents) can occur at any time.
When designing responses, remember that most apps are not used very often. Make suggestions as a part of responses to things users can request. For example, “The current time is 3pm. You can also ask me about the weather.” Do not list all of the available commands every response (which would quickly become tedious), instead consider including one suggestion per response, with variables keeping track of which commands the user has already been told, or skipping commands the user has already used (since having used it, they must know that they can do that command). This is where most of the complexity of an app can end up – trying to create a more natural experience for the user. (My sample app does some of this, but could do more.)
It is also nice to include a bit of randomness in responses. Rather than having a single response for a command, include a few variations and play one at random to make it feel a bit more natural. For example, a request for the current time might return “The time is now 4:45pm” or “Its 4:45pm” and so on. Such responses should be phrased consistently with the “feel” you want your app to have. It should have a personality and express that consistently.
In my case, the main character in the app is the main character in the series, Sam. Sam has certain phrases he uses in the series (he uses the word “super” a lot), so in responses he does the same thing, adding to the brand feel of the application.
But don’t undervalue the usefulness of getting a demo app put together quickly and then asking some other people to try and use the app. Getting my two boys to use an early version of my app quickly showed that I needed to provide more guidance of what commands they could try. Without guidance the got stuck quickly not knowing what to ask as they had no idea what the app could and could not do.
Dialogflow is one of the tools you can use to map user input to “intents”. This allows you to submit multiple ways someone may express the same objective. You can provide multiple phrases to train the natural language parsing library. You can also provide vocabularies of entities like color names. In my app, I introduced episode titles as entities so you can say “play episode one” or “play the Friendship episode” (where “Friendship” was the episode title).
In addition to intents, various system events can occur, such as “start the app” and “start a new session carrying over a session from another device”. Moving sessions between devices can be useful if the interaction requires a particular capability not on the current device. For example, a Google Home does not have a screen so cannot play video clips. If the user requests to play a video, an app can request to be moved to another device (such as the user’s phone) where the video can be displayed.
Another aspect is dealing with errors. Rather than responding “I do not understand” every time, try responding in different ways with different suggestions for the user to try, but ultimately exit the app if you cannot recover.
The final concept I wanted to cover in this post was surfaces and their design implications. Google Assistant works on a range of devices with different capabilities. Google Home is voice and audio only. The assistant app on your phone or tablet has audio, screen, and keyboard capabilities (but could be muted). As more devices are released, there will be additional combinations of capabilities. A good application is able to work across a range of such devices by examining the capabilities of the device and adapting appropriately.
For example, when there is no screen available (only voice/audio), then any suggestions to the user must be made audibly. However, if a screen is available, the app can return up to 8 “suggestions” which are displayed at the bottom of the screen in ovals. Users can click on these as short cuts.
One interesting issue was all responses must include at least one simple response of text (which may have text-to-speech synthesis applied). Using special <audio> markup you can tell Assistant to use a supplied MP3 file instead of a synthesized voice. In this case the text may still be displayed (it is an error if there is no text to display). However, a card can also be displayed containing text and an image.
What I found was on my iPhone both the text and the card were displayed, resulting in duplicated visible content. The Smart Screen simulator on the other hand only displayed the card. I could not see how to get the card to play an audio clip. So my final solution when a screen was available was to generate short phrases like “I’m on it!” with an audio clip of the full response text. I put the full response text on the card for viewing. So if only the card was displayed, they saw all the text. If the text and card was displayed, they did not see the text twice. I need to do some more research here, but it seems to work.
In addition, if the device supports a web browser, my app displays buttons with links to YouTube. If the device does not support a web browser (e.g. a Google Home device with no screen), the app requests Assistant to look for a device with a web browser to transfer the session to.
In this blog post I introduced some of the design concepts I went through thinking about the experience I wanted to deliver. In the next part I will dig into the code used to turn these concepts into a working app.