My First Google Assistant App (Part 2)

Branded experiences create interesting marketing opportunities with Google Assistant applications. Part 1 discussed some of the design considerations. This post dives into the code to create a simple (but working) experience without database access, developed completely from the built-in console (no local laptop development environment required). The source code can be found in my GitHub repo, in the blopost-1 branch.

Please note, this is not an official Google code sample. I am just sharing my personal experiences as I learn using a personally hobby project as a sample.

I created this application to demonstrate branded experiences based on supporting a series of videos on YouTube I am slowly producing (cartoons for kids backed by principles from bible verses – see https://extra-ordinary.tv/ for more details). To add a bit of richness, I decided to create an app that could both navigate users to episodes and play random bible versus. To increase engagement it will also notify users when new content is available. The verses are presented by characters from the series as a form a brand reinforcement. Commerce applications may be notifications of new blog posts or new product offerings – e.g. the cake of the week from your local cake store.

Git Repo Structure

The code in the git repository consists of three groups of files:

  • Media files: All the images (.png) and audio clips (.mp3) my application needed. (Video files I uploaded to YouTube.)
  • Dialogflow configuration: The “extra-ordinary-phase-1.zip” file contains a ZIP export of all the dialog file configuration files. Dialogflow is what manages the heavy lifting of parsing user input and converting it into simple API calls, then taking the API responses and presenting them back to users. The configuration files are all text files, so you can read them all with a normal text editor. The ZIP file can be imported directly back into the Dialogflow console. So far I modify the Dialogflow configuration inside the console, using the exported ZIP file only as a backup.
  • Application code: There is a “package.json” file listing packages needed (I just used all the defaults provided) and “index.js” which is where the Node.js application logic goes, the focus of this blog post.

Consoles

I am not going to describe every step required to get the sample application up and going. If you get stuck I recommend going through some of the Actions of Google tutorials (there are links from the console). My purpose here is to give a feel of the potential with low code complexity required to build a branded Google Assistant app experience. I introduce the main points you need to know, but skip some details.

To build a similar application yourself you will need to create a Google Cloud account. When I created my account I used a free trial period and got the application completely working without paying a cent.

Media Hosting

I uploaded all the media files my app needed to Google Storage and enabled them for public access. (I have some other content on YouTube that I also link to.) I did this via the Google Cloud console. Click on “Storage” in the side menu tab.

I created a bucket called “extra-ordinary-assistant-assets” to hold my media. I then created a directory tree of “v1/media” and uploaded all the media files to this folder. I set the “Share publicly” link on for all the files. There are APIs to access this storage, but I just enabled them all by hand. (I had set myself the personal challenge of building a complete working application only using consoles.)

I then copied all the public links (I right clicked on each “Public link” and selected “Copy link address”) into my application code. Because I don’t have much media (this is only a simple application), I dropped the links directly into the source code instead of using a database.

Dialogflow

Navigating to the Actions on Google console you can create your new Google Assistant application project. The screenshot below shows my existing project.

Clicking on Add/import takes you through the flow to create your own application. You can nominate the area your application falls in – I picked “Kids & Family” for my application.

There is a “Skip” link as well if you want to do this later.

Next you will arrive at the main console.

Clicking on “Decide how your Action is invoked” will prompt you for information like your application title (I used “Extra ordinary bible verses”) and some other related information. Click “Save” to save your information, then “Overview” in the side menu to go back to the top level menu. It should now show the quick setup is done, and offer you “Build your Action” choices.

Click “Add Action(s)” and select “Custom intent”.

Clicking “Build” will take you through a few account selections and permissions flows and drop you into the Dialogflow console. This is where I spent most of my time building this app.

For my application, most of my effort was spent creating intents, defining entities, and fulfillment (the Node.js code implementing application logic).

  • Intents map user input (different things they say or type) to unique id’s (the intent name).
  • Entities are things like “color”, or in my case “episode name”, which are enumerated values that can be in user input (“Play the Friendship episode”, “Show me episode 2”.)
  • Fulfillment is the backend application logic to form responses to intent requests.

Note: You can respond to simple intents directly in dialog flow configuration. I decided to not do this and respond to all intents in my application logic so I had more control over “branding” of responses.

Here is the list of intents I created for my application.

Here is the “Play Random Bible Verse” intent definition.

As you can see, I added a range of different inputs (“training phrases”) a person may say which would all trigger the intent. The more variations you add, the more likely users will successfully use your application. Note that these phrases are training phrases – Dialogflow will translate other similar requests to your intents for you. So you should focus on different ways to say something but do not need to enumerate every possible way to express the request.

Note that some inputs may feel unusual on first invocation – a human would not say “another verse” on first entry. But there is no harm in putting them all in the same intent, and it is natural for users to ask for a first verse then say “another verse” afterwards. Watching what real users do can help you pick what phrases you want to accept.

The above intent did not have any parameters. The “Play Episode” intent does (the episode number).

The “Action and parameters” section defines what parameters the intent accepts and requires. E.g. the episode name is required for this intent. In the training phrases, Dialogflow automatically spotted my use of an entity name (“one”), so I just typed in phrases like “play one” and it spotted that “one” was an episode name once I had the “Episode” parameter defined.

To define the episode names, define an entity. Here is the entity for the episode titles.

Each row represents an episode. I enabled synonym support so users can say the episode name or number, which is mapped back to the first column in the table above (the episode number).

As well as intents, there are also “events” – system generated things your application can respond to. For example, if a request to play an episode video comes in but the user is on a device that cannot play the video, the user can be asked if they with to jump over to a device (e.g. their phone) that can play the video (if one exists). In this case, the app will be initialized with a “actions_intent_NEW_SURFACE” event with the current session data (e.g. the requested episode to be played) copied over from the old device. The following example shows an event name specified but no training phrases as voice input should never trigger this intent.

Scrolling down to the bottom of the intent page shows how to link up the intent to backend application logic (called “fulfillment” in Diaglogflow).

Selecting “Enable webhook call for this intent” tells Dialogflow that the request should be sent to a webhook in the cloud.

For the purposes of this demo, I used a Google Cloud Function to implement fulfillment logic. You can write simple Node.js code directly in Diaglogflow to process requests, which I used for this demo. You can of course use desktop command line tools.

To define the backend code, click on the “Fulfillment” side menu and enable the “Inline Editor”. (The first choice, “Webhook”, is for you want to host the code yourself.)

The rest of this blog post talks about the code inside “index.js”. (You won’t need to change “package.json”.)

Note: You cannot add other files using the inline editor (at this stage), so my “index.js” file got rather long. The inline editor is a simple editor to get up and going. More advanced applications will require you to build and host your own webhook. You can still use Google Cloud Functions to implement the webhook, just not the inline editor.

Fulfillment Logic Structure

The complete “index.js” fulfillment source code can be found here. At the top it loads up some required external libraries and data types. It then defines data structures to hold references to all media assets (yeah, this could have been structured a bit more nicely I know!). Finally the application logic is defined starting from the line “const app = dialogflow({debug: true});”.

The function “pickRandomMessage()” is a helper function that given an array picks a random entry from the array. This is useful to create a bit of variability in responses, making it feel more natural. In the media section I included textual and audio versions of messages in an array of alternatives, to pass into this function.

The function “respond()” is a helper function to form a response – we will come back to this in a moment. Many of the intents followed a common pattern for forming a response, so I pulled this code into a separate function.

The “app.middleware()” calls allows you to initialize configuration per request based on the conversation data – in my example I check to see what surfaces the app has available (screen, web browser, etc) and save the results away for easy access later. That makes it a bit easier for all the different intents to check.

The “app.intent()” calls are important. They link intent names (like “Play episode”) to code. Each intent defined in the Dialogflow console with “webhook” enabled as the fulfillment approach must register the code to be invoked for that intent name.

Let’s now dig into a few intent implementations.

Welcome Intent

Here is the code for the welcome intent, when is a “Welcome” event (not user voice input). The event is defined as follows.

The fulfillment code is then as follows.

app.intent('Default Welcome Intent', (conv) => {
  console.log("Default Welcome Intent");
  conv.data.fallbackCount = 0;
  conv.ask(new Suggestions(['Say a verse', 'Play episode 1', 'List episodes']));
  const msg = pickRandomMessage(samWelcomeMessages);
  respond(conv, msg.text, msg.mp3Url, "Welcome!", samWelcomeImageUrl, "Welcoming Sam");
});

The “conv” argument is a handle to the conversation data. Any properties set in “conv.data” (“fallbackCount” in the above code snippet) will be preserved between calls. The above code is initializing a counter variable so after repeated failures the session will exit.

The “conv.ask()” function is how you build up the response to return to the user (and ask them to continue their side of the conversation). It accepts a range of different data types. The “Suggestions” class takes an array of strings that are displayed as buttons at the bottom of the page to guide the user of suggested ways to continue the conversation.

Suggestions do not have to list all possible commands – the goal is to help a user who may be stuck on what to try next. On a touch screen, the suggestions may also be able to be tapped speeding up the user’s session. The application currently does not implement subscriptions fully – when supported, it would make sense showing “Subscribe” as a suggestion only when the user is unsubscribed and “Unsubscribe” only when the user is subscribed.

When the user starts up the application, as well as suggestions the user is shown one of several messages and an image. Each message has the plain text of the message and a URL to a MP3 file holding an audio clip of the same message. There is also a URL to an image to display (this could have been included with the text and audio clip URL, but I decided it was better to keep the image the same no matter which text was expressed).

const samWelcomeMessages = [
  {
    text: "Hi, I'm Sam! It's super cool to have you here! How can I help?",
    mp3Url: "https://storage.googleapis.com/extra-ordinary-assistant-assets/v1/media/Sam-Welcome-1.mp3",
  },
  {
    text: "I'ts Super Sam time!",
    mp3Url: "https://storage.googleapis.com/extra-ordinary-assistant-assets/v1/media/Sam-Welcome-2.mp3",
  },
];

const samWelcomeImageUrl = "https://storage.googleapis.com/extra-ordinary-assistant-assets/v1/media/Sam-Welcome.png";

In hindsight, the above welcome messages are not ideal. I will probably extend them so that the welcome message includes some hints of commands to try to get users going.

Let’s now look into the “respond()” method. The arguments are:

  • The conversation API handle.
  • The text of the message to tell the user.
  • The MP3 recording of the text.
  • The title to display at the top of a card (if supported).
  • The image to display on the card.
  • The “alt text” for the image.

The code is as follows:

function respond(conv, longText, mp3Url, title, imageUrl, imageAlt) {
  if (conv.hasScreen) {
    const shortText = pickRandomMessage([
       "Here you go!", "Coming right up!", "I'm on it!",
       "Super fast is super cool!"]);
    conv.ask(`<speak><audio src="${mp3Url}">${shortText}</audio></speak>`);
    conv.ask(new BasicCard({
      title: title,
      image: new Image({
        url: imageUrl,
        alt: imageAlt,
      }),
      text: longText,
    }));
  } else {
    conv.ask(`<speak><audio src="${mp3Url}">${longText}</audio></speak>`);
  }
}

The code changes its behavior based on whether is a screen is available. If there is no screen it drops down to the bottom “else” clause that generates SSML (speech markup language) to speak the text. The <audio> tag tells the speech library that synthesis is not required because there is already a MP3 file for the text (recorded in the character’s voice). The text must always be supplied – omitting the text returns an error, and the text is used as a fallback for text-to-speech synthesis. (It is also displayed in various logs.)

If a screen is available, the first “conv.ask()” does a bit of a trick. It plays the full audio file of the longer text, but it displays some short text (“I’m on it!” etc). This satisfies the need for some text to be included. Some devices with a screen will display this text as well as the card (e.g. your phone). Because the card is displayed I decided to not repeat the same text as what is displayed on the card. (It probably is more correct to display the same text twice so if the MP3 file cannot be loaded, it can fall back to text-to-speech synthesis.) Finally a “BasicCard” is displayed with the given title, image, and the full text of the message.

The “Default Fallback Intent” is only slightly more complicated. It is called when the user input is not recognized. If there are repeated problems the app will finally close the conversation, exiting the application.

Play Random Bible Verse Intent

The “Play Random Bible Verse” intent does not use the “respond()” method because the spoken dialog consists of Sam introducing who says the real bible verse (“Here is a bible verse from Deb”), followed by the recording of the verse itself. Instead of “I’m on it” the text of the introduction is displayed – you may notice the verse text is not displayed – it is only shown on the card. The card also leaves off the introductory text on smart screens as it is not important – cards should display a summary of the most important information.

This is where testing your application on a range of devices is important. The above makes some assumptions that may not actually be appropriate to make – in particular that no device will unable to play MP3 audio files.

Play Episode Intent

The “Play Episode” intent is worth digging into for two reasons: it shows an example of getting a parameter from the input and it shows how to redirect the user to another device if the current device does not have a screen.

To get arguments, an additional parameter must be defined (“params” in this example code snippet).

app.intent('Play Episode', (conv, params) => {
  let episode = params.episode;
  console.log(`Play Episode ${episode}`);

Even though the episode was marked as ‘required’ in the intent definition in Dialogflow, you should check for null values and guide the user how to submit a correct request.

If the device has a screen, the code displays a card as per normal, but adds a button to click on to display the video. This code may be slightly incorrect as it is currently checking if there is a screen (rather than a web browser). It is possible that a device can have a screen and not a web browser. I suspect the “button” of a card only is displayed if a web browser is available – I have not found a device with a screen but no web browser yet to test this on.

if (conv.hasScreen) {
  conv.ask(`<speak><audio src="${samHereIsYourEpisodeMessage.mp3Url}">${samHereIsYourEpisodeMessage.text}</audio></speak>`);
  conv.ask(new BasicCard({
    title: title,
    image: new Image({
      url: imageUrl,
      alt: `Episode ${episode}`,
    }),
    text: summary,
    buttons: new Button({
      title: 'Play Video',
      url: videoUrl,
    })
  }));
} else {

If there is no screen available, the code asks the Assistant if the logged in user has another device registered that does have a web browser available. If so, a “NewSurface” request is created which will ask the user if they wish to transfer over to another device. The context is displayed on the current device as a reason for the transfer. The notification is displayed on the new device as an alert. The capabilities supplied is the required capability of the new device. Before the function returns, the requested episode number is saved away in session storage which will be transferred over to the new device.

The second half of the equation for transferring across devices (if the user agrees to the transfer) is the “Get New Surface” intent code which is triggered by an event (not user input). The code plays it safe checking that the episode number was specified and that the transfer was not rejected by the user, then displays a card with the “Play Video” button on it. There is less introductory text as the user already requested the video to be displayed – they just want to get on and now view the episode video. It is an error to have zero text however, hence it always displays “Here is your episode”.

Testing your App

Within the Actions on Google console and in the Dialogflow Console, there are links to the Assistant simulator to try out your application as you develop it. The simulator can act as a phone, Google Home, or a smart screen allowing you to try your application on different devices (see the three icons under “Surface”).

The simulator is a quick way to test the basics of your application, but I would still recommend testing on real devices as well. If you have any devices logged in with your user id you can access the application immediately. When you connect to the application on the real device it will inform you it is connecting to a test version of the application.

You can add other accounts to your project as well via the Actions on Google console. Click on the “settings” cog and select “Permissions”.

Clicking “Add” on the next page allows to you add email addresses of other users to grant access to your code. Grant these users the “Project” / “Viewer” role to try out your application.

There is also support for alpha and beta testing the application after you submit it for formal review for release in the Google Assistant application directory.

Conclusions

The supplied code is a complete working application for Google Assistant and delivers useful functionality. It shows how special purpose applications can be built relatively quickly. In order to create a branded experience, each time text is to be displayed a MP3 file is provided of the character’s voice. Whenever a screen is available, an image is also displayed of the character.

As can be seen, the application logic is not particularly complex – it is more a matter of working the way through all the various error conditions that can occur and respond accordingly in each instance.

In the future I am hoping the assistant API will also allow short video clips to be displayed inside the experience so I can lip sync the voice track with a video, but that is for another day.

So what is next? I want to add Firestore support to save away user subscriptions in a database. When new episodes or bible verses become available users can then be sent notifications of the new content. This would be throttled to at most one notification a week so as a collection of verses is built up new users would get a weekly notification.

2 comments

  1. How did you create your photo and audio store? is that I have my store in firebase with mp3 audio and the image in jpg, I copy the download link and try it in your code, the image if it is seen, but the audio does not manage to reproduce it, I have to do something more to store so that your code can read it from my store? I thank you in advance for your answer

    1. They just need to be online somewhere you can get a URL for. If you jump to the “master” branch in the GitHub repo (the article links to an old version so the code would not change) and look in firebase.json you will see what I am doing now – there is a “hosting” section where the “media” directory is copied up as static files to firebase hosting. But they can be anywhere as long as you can get a URL. (First version I just uploaded them by hand via the console!). I still hope to have a part 3 post one day – just working through some issues still.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.