An API for Siri

Nevan King
New Notes
Published in
13 min readMar 10, 2016

--

Since the introduction of Siri as part of iOS developers have wished that people could use Siri to control their apps. Rather than pecking at buttons and keyboards on a screen, we want our users to be able to use apps quickly and fluidly. We compare the ease of entering an appointment with a voice command to the multi-step, fidgety way it works with touch-screen input.

It seems now that many of the pieces required for Siri to control apps are falling into place. I’m going to give a possible roadmap that Apple might take to that would allow Siri to interact with any app.

What is Siri

The easy answer is that Siri is a virtual assistant which performs actions for you. But a better way to look at Siri is as an evolution of the user interface. Each of the big jumps in how we view and use computers has been driven by user interface evolutions, from punch cards to keyboard input and from point and click to direct touch-based input. Siri is the next evolution in UI and the dream is that our interactions with a computer will feel more like interactions with another person.

Each of the evolutions in UI has been accompanied by an increase in the “fuzziness” of input. Compared to punch-card-based computers, command line-based systems allowed people to do tasks in more varied ways, from using different programming languages to not having to worry about the order of arguments in a command. Point-and-click-based interfaces gave us a more physical experience and were more forgiving of mistakes (easier undo, trash instead of delete, large click and drag targets, the choice to use menus, clicks or keyboard shortcuts). So with each UI evolution, the way we make a computer do things has become less exact and esoteric and more tolerant. A large part of Siri’s “magic” is that it takes the extreme fuzziness of spoken language and refines it down to an exact command. Siri allows you to express the desire to be woken up at 7 in the morning in a number of ways, so you don’t need to memorise exact commands or steps.

How Siri Works

There are four broad steps in how Siri transforms a spoken command into a computer command. The first is to turn the sounds we make into words. This overcomes factors like languages, accents and background noise to transform somebody saying the words “Open Facebook” into the text string “Open Facebook”. To do this your speech is digitised and sent to the Apple servers which “listen” and transform audio to phonemes and then to words. This is the same as speech-to-text dictation and while it was impressive 20 years ago, today no-one thinks twice about it.

The second part is to extract meaning from the text. This is called language parsing. Siri reads the words “Open Facebook” and realises that there is an app named “Facebook” which needs to have the action “Launch” performed on it. This is what we think of as the magic of Siri. There are multiple ways to give this command (“Start Facebook”, “Launch Facebook”) but all have the same meaning.

The third part is to transform those actions into a standard format that the system can understand. You can’t tell iOS to “Launch Facebook”, it has to be transformed into a language that the system understands. In the same way, you can’t ask Yahoo Weather for the temperature right now, you have to also translate that. Each of these systems has a different language for communication and Siri has to be able to speak them all.

The fourth part is a response. For “Open Facebook”, the action of launching that app is enough, but for a request like “How’s the weather in New York?”, Siri needs to get data back, interpret the computer response, and then create a response that sounds human. Something like “It looks like rain in the afternoon”. This happens through understanding how each system responds, creating a natural language string and performing speech synthesis.

Siri API

An API is a way for things (apps, parts of apps, operating systems, web apps like Twitter) to communicate with each other. An API does two things: It advertises what actions something can do and offers a way to make it do those actions. APIs are the invisible plumbing that allow different pieces or parts of software to communicate information and orders. For apps everything works through APIs, from asking the system to take a photo to posting a message on Facebook to getting the current location. Every Twitter client uses Twitter’s standard API. Apart from advertising and performing actions, APIs are expected to be stable. If an API changes, everything that uses it has to be updated or they will stop working. This makes them rigid and resistant to change. Even a small spelling mistake or a change in how a computer calls an API will make it fail.

APIs are created to be used by computers but ultimately they are written, understood and (their interfaces) designed by humans. People read the API and, with the help of documentation, understand how to use them. They then program software to use those APIs.

Developers have been asking for a Siri API for many years, but this doesn’t mean that Siri doesn’t already have an API. Siri has an API, but it’s not known or available to developers outside of Apple. It’s a private API. More importantly, Siri knows about the private APIs offered by system apps like Reminders and Mail.

For example, when I tell Siri “Add ‘Milk’ to my shopping list”, Siri parses the text and comes to some conclusions:

  1. I want to use the Reminders app
  2. I want to add to the “Shopping” list
  3. I want to add the item “Milk”

Siri can’t just send Reminders a message to add Milk, it has to translate that intent into API calls, a language that the Reminders app understands. The people who programmed Reminders gave it an API which advertises available actions and, when those actions are called, performs the actions inside the app. In programming language terms, one of the API calls might look like this:

addReminder:”Milk” toList:”Shopping”

This request will ask Reminders to perform the action “add reminder to a list” and give it a couple of specific parameters, “Milk” and “Shopping”.

(I’m using an Objective-C or Smalltalk-style notation for the APIs since that’s what I’m used to. All APIs have different ways of writing and calling their actions which must be learned before they can be used.)

The people who write Siri already know the actions that the Reminders app can perform and are able to translate “Add ‘Milk’ to my shopping list” into the API call that makes it happen. Siri’s job is to translate the fuzziness of natural language, where you can express the same intent in multiple ways, to API calls where the intent is very specific. Then Siri goes the other way and turns very specific API calls into natural language that we can understand. In this case the Reminders app would send back a success API call and the response would be the stock phrase “I added it”.

There’s one important thing to clarify. While I said that developers have been hoping for a Siri API, what’s really meant is that developers want a way to offer an API to their apps that Siri can use. Siri is the one in control and apps are passively offering a list of things they can do. If I create a task-list app, I want to be able to offer Siri an API with actions like “Add todo” or “Add todo to DIY list” or “List my scheduled actions for tomorrow”. In addition, developers want Siri to be able to read those actions and understand them well enough to be able to match fuzzy, natural language to the API’s intent.

Currently, all the APIs that Siri uses are read and understood by people. People understand that Yahoo Weather can give weather information and can transform a question like “How’s the weather in Dallas?” into a request to Yelp with a location parameter of “Dallas”. The problem with an API for Siri to third-party apps is that there are too many apps for people to be able to parse and classify. The “understanding” has to happen automatically. Siri has to be able to read an app’s API and translate those commands into intents.

A Quick Aside: URL Schemes

If you are an advanced iOS user, you may have wondered about URL schemes. This is a quasi-API that allows apps to send messages to other apps requesting that they perform actions. You send an app a message like this:

net.nevan.todoapp:///addtodo?title=“buy milk”&list=“shopping”

The first part is a reverse-DNS string which uniquely identifies a particular app. The second part gives parameters with their labels to tell that app what action you want performed. It’s telling a (fictional) todo app to add “Milk” to the “Shopping” list.

The reason this isn’t a true API, and the reason that it can’t be used for Siri, is that it doesn’t advertise what it can do. In programming terms, it doesn’t expose its API. If you want to know how to tell this app to add a todo, you have to search for the developers documentation. A true API must communicate to Siri what actions it can do and allow Siri to invoke the actions in a standard way.

Steps to a Siri API

Back to the challenge Apple faces in allowing Siri to interface with third-party apps: translating the intent in voice commands into the intent of APIs without human intervention.

Since Apple likes to do things in small, incremental steps it’s unlikely that they will offer a full Siri API any time soon, but the beginnings are already underway. The low-hanging fruit is an API that always has the same intent and interface and is useful for a broad range of apps. That means search. Tim Cook has already announced that Siri on TVOS will be able to search third-party apps for content.

Searching for content looks almost exactly the same in all apps. You provide a string of text and get back a list of results. You can search Twitter for “cat pictures” or search Google Maps for “Dublin” or search Simplenote for “Meeting notes”. There are ways to filter or narrow results in all cases, but in general you only need to provide an app name and a text string to search for.

Search has been partly implemented in iOS 9, where apps can provide an index of their contents to the system, which gathers all the results and uses them in Spotlight. A recipe app can give the system a list of recipe titles and when the user searches for “Curry”, the system will give back a link to the app and to the specific place in the app which has a curry recipe.

Searching inside a specific app using Siri is slightly different. The user would say something like “Search Google Maps for Barcelona”. This time Siri would interpret which app the user wants to use (“Google Maps”) and the search term (“Barcelona”). Every app would have the same API automatically provided by the system, (e.g., “searchWithTerm”), and Siri would call that API on Google Maps:

searchWithTerm:”Barcelona”

Google Maps would open and show a map centred on Barcelona. This is different from the TVOS Siri search, where an app receives a search term and gives Siri back a list of results to display (or a simple “Found” or “Not found”).

What would a search API for apps look like? iOS 9 also added a feature called “Quick Actions” which provide a good hint to how an API could evolve.

Quick Actions: An API for Apps

Quick actions are a new feature in iOS 9 on iPhone 6S and 6S Plus. If an app has quick actions, pressing hard on the icon gives you a menu with up to four choices. App makers are using quick actions to let you quickly access commonly used features. If you want to paste a URL into Safari and need a new tab, you can deep press the Safari icon and choose that action. Quick actions save taps and time.

There are two varieties of quick actions: Static and Dynamic. Static quick actions are hard-coded into an area of the app that can be read by the iOS system. The system reads the actions to show the quick action menu. Dynamic actions are also read by the system (through a different mechanism) and can be changed by the app any time it’s running.

A quick action has a text title and an optional icon. Although it’s not as formal as an API, it is standard enough that Siri could match this title to a request. If a title reads “New Message”, that’s enough for Siri to match to a voice command.

The quick action can have one of a number of standard icons to display with the title. These icons are specified by the system and have a standard semantic meaning. There’s an icon specifically for search. Matching the title “Search” and the icon with that semantic meaning would make a great entry point for an API to Siri. What’s missing is a way to send a parameter (in this case a search string) through quick actions. But because these icons have a standard semantic meaning, allowing Siri to match them with commands would save a lot of the work in matching meaning.

I started thinking about quick actions as an API for Siri after looking at Apple’s quick action sample code. In that code, actions are given an identifier in reverse-DNS format. An action to “Take selfie” would be written like this:

“com.apple.quickactionsample.take-selfie”

There’s no good reason to use this format since the call receiver is always the app. The app already knows which app it is. The only reason to use this format would be for something outside the app to uniquely distinguish both the app and action.

API parameters

Most of the things that you do with Siri require both an action and a parameter. Instead of “Send message”, the more useful action is “Send message to Emily”. Quick actions have to method to include a parameter with the command but Spotlight search with app indexes gives a clue to how this might work.

There are two choices for including parameters: Siri can send the parameter as-is and let the app handle it, or alternatively Siri could have a list of possible parameters and match one, telling the app exactly which one to use. If a messaging app send a list of user names along with internal IDs for each name, Siri could read those names and send the one that best matches. Having a list of names means that Siri is less likely to mistake “Lara” for “Laura” and also able to request feedback “Did you mean ‘Jony’ or ‘Johnny’?”

Apps could also specify that a particular action can only be performed with a certain type. A messaging app might have a list of users and chatrooms, but “New message” can only work with a user ID. This helps Siri to understand better what an app can do and match a voice request to an API.

So quick actions might offer a universal API for apps to interact with Siri. Siri translates “Use Foursquare to check in” into the “Check In” action in Foursquare and sends the command to the Foursquare app in a way it can understand. Or if Siri hears “Play ‘Beat It’ in Spotify”, it uses Spotify’s play action, which accepts either a string that uses to find a best match and play, or alternatively a song ID that was encoded in Spotlight search.

Siri Responds

There are several ways Siri can respond to queries. The first is a simple success or failure. This could be an API that Siri provides or using a callback to Siri’s original request. Siri could either answer with a stock “I completed that” or “I couldn’t complete that”, or with a localised phrase provided by the app “I added your todo”.

Siri can ask the user to choose from a list of possibilities. This could be given to Siri as a simple list of text strings, or as a custom view controller with formatting. For example, a search on the Amazon app for “iOS programming” could return a simple list of titles, or a fully formatted list with titles prices and cover images. Sending a full view controller to the system is already supported by iOS in Widgets.

The final way is a fully customised view controller that allows Siri to display some information without ever opening the queried app. In the same way as “What’s the weather like tomorrow” displays a fully custom view, a different weather app could provide its own view. The app Dark Sky could provide a rain prediction chart for the hour from a request like “Ask Dark Sky for the weather for the next hour”.

Increments of API

Since Apple likes to release new technologies iteratively, the first versions of an API for Siri would be simple and limited, evolving yearly with OS releases. Here’s how it might look:

  1. Search: Commands like “Search Spotify for ‘Christmas songs’”
  2. Standard Actions: Apps can provide an API to Siri from a fixed list of common actions (Play, Take Photo, Create Task, Mark Location). “Play ‘Bob James’ in Spotify”
  3. Unlimited Actions: Apps provide an API with any action for Siri to interpret. “Find user “Emily” in Spotify”, “Play the track ‘Madame George’ from Van Morrison’s album ‘Astral Weeks’ in Spotify”.
  4. In-App Siri: Apps provide an API to Siri within the app. “Add this track to my Running playlist”, “Use a broad paintbrush with aquamarine paint”, “Apply the ‘Gotham’ filter”. This is already in place to some degree on Apple TV, where you can perform actions on the currently playing video. The most wonderful command: “What did she say?” which skips back 15 seconds and temporarily turns on subtitles.
  5. Siri for the Web: Before Apple bought it, Siri was conceived as the ultimate mashup. Apple has started embracing the open web lately and allowing Siri to interface with any web service that offers an API seems a natural fit. “Find the opening hours for the restaurant that Mike mentioned yesterday on Twitter”

Timeline for an API for Siri

No-one makes money betting on Apple’s release schedule, but it’s looking like Siri could open up soon. Tim Cook has announced a limited API for Siri on TVOS. There are rumours that Siri will be coming to the Mac in the next OS release. What’s more, the only really good way to control Apple Watch is through voice. When Siri isn’t working well on Apple Watch, the pain of having to navigate the small interface is far greater than on a phone. Despite the watch being underpowered for Siri, it’s still the most natural way use it. Some of my most used Siri commands on the watch are setting timers and reminders, and starting workouts. When Siri works, the time to do this is a tenth of doing them with the touch interface. All of this is also true for Apple TV which has gone as far as to give Siri a dedicated button on the remote. The newest Apple TV is driving many of the recent innovations in Siri. The best way to control both of Apple’s most recent products would be far more powerful if it was available for all apps.

Apple likes to give each of its WWDCs a key theme. It could be that 2016 will have Siri as the theme and we’ll see Siri transformed into a tool that can control any app.

--

--