Experimenting with the Web Speech API

A few days ago, I spoke at WebTech Conference 2014 giving a presentation titled Talking and listening to web pages where I discussed the Web Speech API and what a developer can do with it to improve the user experience. This talk was inspired by two articles I wrote for SitePoint titled Introducing the Web Speech API and Talking Web Pages and the Speech Synthesis API. In this tutorial we’ll build upon the knowledge acquired and we’ll develop a demo that use both the interfaces defined by this API. If you need an introduction of the Web Speech API I recommend to read the two previously mentioned articles because this one will assume you have a good knowledge of it. Have fun!

Developing an Interactive Form

The goal of this article is to build an interactive form that our users can fill with their voice. For the sake of this example we’ll develop a registration form but you can apply the same concepts to any form you want. An important concept to keep in mind is that the use of the voice should never be the only source of input because no matter how accurate a speech recognizer is, it’ll never be perfect. So, the user should always be able to modify any field to fix any error the recognizer has made. In this demo we’ll provide a button that, once clicked, will start asking a question to the user and then the interaction continues with the user speaking the answer. The recognizer transforms the speech into text that is placed in the text field. Once the interaction is completed, which means all the fields of our form have been filled, our application will be polite and thank the user. As a final point remember that at the time of this writing the Web Speech API is very experimental and completely supported by Chrome only. Therefore our experiment will work in this browser only. Without further ado, let’s start building the markup of the registration form.

The HTML of the Registration Form

To keep things as easy as possible, our form will contain only three fields, but you can add as many of them as you need. In particular we’ll require our user to fill the name, surname, and nationality. If you have a basic knowledge of HTML, performing this task should be pretty easy. I suggest you to try to implement it before taking a look at the code below (my implementation):

<form>
   <label for="form-demo-name">Name:</label>
   <input id="form-demo-name" />
   <label for="form-demo-surname">Surname:</label>
   <input id="form-demo-surname" />
   <label for="form-demo-nationality">Nationality:</label>
   <input id="form-demo-nationality" />
   <input id="form-demo-voice" type="submit" value="Start" />
</form>

The previous code shows nothing but a classic form that can be only filled with the use of a keyboard or similar input devices. So, we need to find a way to specify the question we want to ask for each of the fields defined in the form. A good and simple solution is to employ the data-* attribute of HTML5. In particular we’ll specify a data-question attribute for every label–input pair. I’ve decided to set the attribute to the label associated to the input but you can easily change the demo to define the attribute on the input element. The resulting code is shown below:

<form>
   <label for="form-demo-name" data-question="What's your name?">Name:</label>
   <input id="form-demo-name" />
   <label for="form-demo-surname" data-question="What's your surname?">Surname:</label>
   <input id="form-demo-surname" />
   <label for="form-demo-nationality" data-question="What's your nationality?">Nationality:</label>
   <input id="form-demo-nationality" />
   <input id="form-demo-voice" type="submit" value="Start" />
</form>

Whether you’re surprised or not, this is all the markup we need to create our interactive form. Let’s now delve into the core of our demo by discussing the JavaScript code.

Adding the Business Logic

To develop the business logic of our form we need three ingredients: a speech synthesizer, a speech recognizer, and promises. We need a speech synthesizer to emit the sound that asks the user the question we’ve defined using the data-question attribute. The speech recognizer is used to transform the user’s response into text that will be set as a value of each field. Finally, we need promises to avoid callback hell!. The WebSpeech API is driven by asynchronous operations, so we need a way to synchronize all the operations. We need to start recognizing the speech of the user after the question has been asked, and we have to ask a new question after the user has spoken its answer and the recognizer has completed its work. Thus, we need to synchronize a variable set of consecutive (serial) asynchronous operations. We can easily solve this issue by adopting promises in our code. If you need a primer on what promises are, SitePoint has you covered with the article An Overview of JavaScript Promises. Another very good article has been written by Jake Archibald and it’s titled JavaScript Promises: There and back again

. Our code will be logically divided in two parts: a support library that deals with the Web Speech API and will act as the producer of the promises, and the code that will consume the promises. In the next two sections of this article we’ll talk about them.

Developing the Support Library

If you have a working knowledge of how the Web Speech API works, understanding the support library won’t be very hard. We’ll define an object literal that we’ll assign to a variable named Speech. This object has two methods: speak and recognize. The former accepts the text to speak and will be responsible to emit the audio and also create the promise associated with this operation. The promise will be resolved in case no error occurs (error event) or rejected if the error event is triggered. The promise will also be rejected if the browser doesn’t support the API. The recognize method is used to recognize the speech of the user. It doesn’t accept any arguments, and returns the text recognized by passing it to the resolve method of the promise created. As you’ll see recognize is slightly complex compared to speak because it has to deal with more situations. The promise created by recognize will be resolved when the final results are available or rejected in case any error occurs. Please note that the code will also take care of dealing with an issue I discovered few days ago on Windows 8.1 (#428873). The complete code of our support library is shown below:

var Speech = {
   speak: function(text) {
      return new Promise(function(resolve, reject) {
         if (!SpeechSynthesisUtterance) {
            reject('API not supported');
         }
      
         var utterance = new SpeechSynthesisUtterance(text);

         utterance.addEventListener('end', function() {
            console.log('Synthesizing completed');
            resolve();
         });

         utterance.addEventListener('error', function (event) {
            console.log('Synthesizing error');
            reject('An error has occurred while speaking: ' + event.error);
         });

         console.log('Synthesizing the text: ' + text);
         speechSynthesis.speak(utterance);
      });
   },
   recognize: function() {
      return new Promise(function(resolve, reject) {
         var SpeechRecognition = SpeechRecognition        ||
                                 webkitSpeechRecognition  ||
                                 null;

         if (SpeechRecognition === null) {
            reject('API not supported');
         }

         var recognizer = new SpeechRecognition();

         recognizer.addEventListener('result', function (event) {
            console.log('Recognition completed');
            for (var i = event.resultIndex; i < event.results.length; i++) {
               if (event.results[i].isFinal) {
                  resolve(event.results[i][0].transcript);
               }
            }
         });

         recognizer.addEventListener('error', function (event) {
            console.log('Recognition error');
            reject('An error has occurred while recognizing: ' + event.error);
         });

         recognizer.addEventListener('nomatch', function (event) {
            console.log('Recognition ended because of nomatch');
            reject('Error: sorry but I could not find a match');
         });

         recognizer.addEventListener('end', function (event) {
            console.log('Recognition ended');
            // If the Promise isn't resolved or rejected at this point
            // the demo is running on Chrome and Windows 8.1 (issue #428873).
            reject('Error: sorry but I could not recognize your speech');
         });

         console.log('Recognition started');
         recognizer.start();
      });
   }
};

Putting All the Pieces Together

With our support library in place, we need to write the code that will retrieve the questions we’ve specified and interact with the support library to create the interactive form. The first thing we need to do is to retrieve all the labels of our form because we’ll use their for attribute to retrieve the inputs and the data-question attribute to ask the questions. This operation is performed by the statement below:

var fieldLabels = [].slice.call(document.querySelectorAll('label'));

Recalling how we wrote the markup, we can shorten the code necessary by keeping the label–input

pairs, which means the question-answer pairs, coupled. We can do that by using a support function that we’ll call formData. Its goal is to return the new promise generated by every label–input pair. Treating every label and input in our form as a unique component, instead of different entities, allows us to reduce the code needed because we can extract a more abstract code and loop over them. The code of the formData function and how it’s called is shown below:

function formData(i) {
   return promise.then(function() {
              return Speech.speak(fieldLabels[i].dataset.question);
           })
           .then(function() {
              return Speech.recognize().then(function(text) {
                  document.getElementById(fieldLabels[i].getAttribute('for')).value = text;
              });
           });
}

for(var i = 0; i < fieldLabels.length; i++) {
   promise = formData(i);
}

Because we have coupled the promises as shown in the formData function we need an initial promise that is resolved to allow the others to start. This task is achieved by creating a promise immediately resolved before the loop of the previous snippet:

var promise = new Promise(function(resolve) {
   resolve();
});

As a final touch we want to thank you our users but also catch any possible error generated by our process:

promise.then(function() {
   return Speech.speak('Thank you for filling the form!');
})
.catch(function(error) {
  alert(error);
});

At this point our code is almost complete. The final step is to place all the code of this section inside a function executed when the user clicks the button.

The Result

As you have noted I haven’t discussed the style for this demo because it’s completely irrelevant and you are free to write your own. As an additional note I also want to mention that in the demo you’ll see below I’ve also created a simple spinner to give a visual feedback when the recognizer is ready to do its job. The result of the code developed is shown below but it’s also available as a JSBin: Form demo

Conclusion

In this tutorial we’ve developed a simple yet completely functional interactive form that a user can fill using the voice. To do that we’ve used some cutting-edge technologies such the Web Speech API and promises. The demo should have given you an idea of what’s possible to do using the new JavaScript APIs and how they can improve the experience of your users. As a final note remember that you can play with this demo in Chrome only. I hope you enjoyed this tutorial and have learned something new and interesting.

Frequently Asked Questions about Web Speech API

What is the Web Speech API and how does it work?

The Web Speech API is a web-based interface that allows websites and web applications to incorporate speech recognition and speech synthesis functionalities. It works by converting spoken language into written text (speech recognition) and vice versa (speech synthesis). This API is particularly useful in creating more accessible and interactive web experiences, such as voice-activated commands, dictation, and read-aloud features.

How can I start using the Web Speech API in my web application?

To start using the Web Speech API, you need to create an instance of the SpeechRecognition or SpeechSynthesis interface. These instances will provide you with methods and properties to control the speech recognition or synthesis process. You can then use event handlers to manage the results and errors.

What are the main differences between the Web Speech API and other speech recognition APIs?

The Web Speech API is a browser-based API, which means it doesn’t require any additional software or libraries to be installed. It’s also free to use, unlike some other APIs that may charge for usage. However, it’s worth noting that the Web Speech API’s capabilities and accuracy may vary depending on the browser and its version.

Can I use the Web Speech API for languages other than English?

Yes, the Web Speech API supports a wide range of languages. You can specify the language by setting the ‘lang’ property of the SpeechRecognition or SpeechSynthesisUtterance instance. However, the availability of certain languages may depend on the browser.

Why is my Web Speech API not working in certain browsers?

The Web Speech API is not supported by all browsers. As of now, it’s fully supported in Google Chrome and partially supported in other browsers like Firefox and Safari. If you’re having trouble with the API, make sure to check the browser compatibility.

How can I handle errors in the Web Speech API?

The Web Speech API provides an ‘onerror’ event handler that you can use to catch and handle errors. The event object passed to this handler will contain information about the error, which can help you diagnose and fix the issue.

Can I use the Web Speech API in mobile browsers?

Yes, the Web Speech API is supported in many mobile browsers, including Chrome for Android and Safari on iOS. However, the functionality and performance may vary depending on the device and its operating system.

How can I improve the accuracy of speech recognition with the Web Speech API?

The accuracy of speech recognition can be influenced by several factors, including the quality of the audio input, the clarity of the speech, and the ambient noise level. You can improve the accuracy by using a high-quality microphone, speaking clearly and slowly, and minimizing background noise.

Can I use the Web Speech API for real-time speech recognition?

Yes, the Web Speech API supports real-time speech recognition. You can use the ‘continuous’ property of the SpeechRecognition instance to control whether the recognition should continue after the user stops speaking.

Is the Web Speech API secure? Can it be used to record or store user’s speech data?

The Web Speech API itself does not record or store any speech data. It only processes the audio input and provides the recognition results. However, it’s important to note that any data transmitted over the internet can potentially be intercepted, so it’s recommended to use secure connections (HTTPS) when using the API.