Dirk Schnelle-Walka Deborah Dahl, Conversational Technologies
Copyright © 2019 the Contributors to the Voice Interaction Community Group, published by the Voice Interaction Community Group under the W3C Community Contributor License Agreement (CLA). A human-readable summary is available.
This documents describes a general architecture of Intelligent Personal Assistants and explores the potential for standardization.
This specification was published by the Voice Interaction Community Group. It is not a W3C Standard nor is it on the W3C Standards Track. Please note that under the W3C Community Contributor License Agreement (CLA) there is a limited opt-out and other conditions apply. Learn more about W3C Community and Business Groups.
Intelligent Personal Assistants (IPA) are already available in our daily lives through our smart phones. Apple’s Siri, Google Assistant, Microsoft’s Cortana, Samsung’s Bixby and many more are helping us on various tasks, like shopping, playing music, setting schedule, sending messages, and offering answers to simple questions. Additionally, we equip our households with smart speakers like Amazon’s Alexa or Google Home to be available without the need to pick up explicit devices for this sort of tasks or even control household appliances in our homes. As of today, there is no interoperability between the available IPA providers. Especially for exchanging learned user behaviors this is unlikely to happen at all.
This document describes a general architecture of IPAs and explores the potential areas for standardization.
A user would like to plan a trip to an international conference and she needs visa information and airline reservations. She will give the intelligent personal assistant her visa information (her citizenship, where she is going, purpose of travel, etc.) and it will respond by telling her the documentation she needs, how long the process will take and what the cost will be. This may require the personal assistant to consult with an auxiliary web service or another personal assistant that knows about visas.
Once the user has found out about the visa, she tells the PA that she wants to make airline reservations. She specifies her dates of travel and airline preferences and the PA then interacts with her to find appropriate flights.
A similar process will be repeated if the user wants to book a hotel, find a rental car, or find out about local attractions in the destination city. Booking a hotel as part of attending a conference could also involve finding out about a designated conference hotel or special conference rates.
For the remnder of the document, we consider an IPA that is extendable via a standardized market place. Such an IPA features the architectural buildings blocks as shown in the following figure.
This architecture comprises 3 layers that are detailed in the following sections
Client that enables the user to access the IPA via voice. Usually, IPA Clients make use of a microphone to capture the spoken input and a loud speaker to provide responses. As an extension an IPA Client may also receive commands to be executed locally.
General IPA Service API that mediates between the user and the overall IPA system. The service layer may be omitted in case the IPA Client communicates directly with the Dialog Management.
Component that receives user input as spoken input, updates its internal state, decides upon subsequent steps to continue a dialog and provides output as synthesized or recorded utterances.
For this, it employs several Dialogs that are responsible to handle isolated tasks or intents. The overall set of available dialogs defines the behavior and capabilities of the interaction with the IPA. The Dialog Manager is also responsible for a good user experience across the available Dialogs.
The Dialog Manager determines the Dialog that is best suited to serve the current user input and re-establishes the interaction state for that Dialog. Therefore, it may use the Dialog Registry.
The Dialog Manager also manages the session with a user. Conceptually, multiple sessions can be active in parallel. Dialogs are governed by Sessions, e.g. to free resources of ASR and NLU engines when a session expires. Linguistic phenomena, like anaphoric references and ellipsis are expected to work within a Session. The selected IPA Provider or the Dialog Manager may have leading roles for this task.
The Dialog Manager also features an ASR to convert spoken utterances into text strings and a TTS to convert text strings into audio.
As an extension the Dialog Manager may also return commands to be executed by the IPA Client.
The Automated Speech Recognizer (ASR) receives audio files of recorded utterances and generates a recognition hypothesis as text strings. Optionally, the ASR can also generate multiple recognition hypothesis along with a confidence score.
The Text-to-Speech (TTS) component receives text strings, which it converts into audio data.
The Core Dialog is able to handle basic functionality via Core Intents to enable interaction with the user at all. This includes among others
Conceptually, the Core Dialog is a special Dialog that is always available.
A Core Intent Set usually identifies tasks to be executed and define the capabilities of the Core Dialog. Conceptually, the Core Intents are always available.
The Dialog x are able to handle functionality that can be added to the capabilities of the Dialog Manager through their associated Intent Sets. The Dialogs must server different purpose in a sense that they are unique for a certain task. E.g., only a single flight reservation dialog may exist at a time.
Dialogs receive inputs as intents out of their supported Intent sets along with associated entities and return responses as text strings to be spoken.
Dialogs access the Provider Selection Service to fulfill their task. They maintain state and know which IPA Provider evaluated their request with the help of an identifier.
As an extension Dialogs may also return commands to be executed by the IPA Client.
An Intent Set usually identifies tasks that can be executed within the associated Dialog.
The Dialog registry manages all available Dialogs with their associated Intent Sets. Dialogs and their Intent Sets can be added or removed as needed.
A service that provides access to all known IPA Providers. This service also maps the IPA Intent Sets to the Intent Sets in the Dialog layer.
The Provider Selection Service receives input as text strings and returns results as intents from all IPA Providers that are able to reply to the user input along with associated entities.
The Provider Selection Service is stateless and always returns the responses from the IPA Providers along with an identification of the issuing IPA Provider.
A registry that knows how to access the known IPA Providers, i.e. which are available and credentials to access them. Storing of credentials must meet security and trust considerations that are expected from such a personalized service. IPA Providers can be added as needed.
A component that is able to extract meaning as intents and associated entities from an utterance as text strings.
The Core NLU is able to handle basic functionality via Core Intents to enable interaction with the user at all.
A generic data provider to aid the Core NLU determining the intent.
A provider of an IPA service, like
A component that is able to extract meaning as intents and associated entities from an utterance as text strings for IPA Provider X
An intent set that might be returned by Provider NLU to handle the capabilities of Provider X.
A data provider to aid the Provider NLU in determining the intent. This component may also be used to track the IPA Provider’s state
A knowledge graph to reason about the detected input from the Provider NLU and Data Provider to come up with some more meaningful results.
This section expands on the use case above, filling in details according to the sample architecture.
A user would like to plan a trip to an international conference and she needs visa information and airline reservations.
The user starts by asking a general purpose assistant ("IPA Service", on the left of the diagram) about what the visa requirements are for her situation. For a common situation, such as citizens of the EU traveling to the United States, the IPA is able to answer the question directly by getting the information from a web service that it knows about. This could be one of its dialogs 1-n. However, for less common situations (for example, a citizen of South Africa traveling to Japan), the generic IPA will try to identify a visa expert assistant application from the dialog registry. If it finds one, it will connect the user with the visa expert, one of the IPA providers on the right side. The visa expert will then engage in a dialog with the user to find out the dates and purposes of travel and will inform the user of the visa process.
Once the user has found out about the visa, she tells the IPA that she wants to make airline reservations. If she wants to use a particular service, or use a particular airline, she would say something like "I want to book a flight on American". The IPA will then either connect the user with American's IPA or, if American doesn't have an IPA, will inform the user of that fact. On the other hand, if the user doesn't specify an airline, the IPA will find a general flight search IPA from its registry and connect the user with the IPA for that flight search service. The flight search IPA will then interact with the user to find appropriate flights.
A similar process would be repeated if the user wants to book a hotel, find a rental car, find out about local attractions in the destination city, etc. Booking a hotel could also involve interacting with the conference's IPA to find out about a designated conference hotel or special rates.
t.b.d.