- From: Deborah Dahl <dahl@conversational-technologies.com>
- Date: Sat, 26 Feb 2005 12:07:16 -0500
- To: <www-voice@w3.org>
Dear Voice Browser Working Group, The Multimodal Interaction Working Group has reviewed the Semantic Interpretation for Speech Recognition Last Call Working Draft [1] in response to the WG's request and has prepared the following feedback. Thanks to the EMMA [2] subgroup and to Michael Johnston in particular for preparing the feedback. best regards, Debbie Dahl, MMI WG Chair Summary: The EMMA subgroup have reviewed the SI specification in some detail. Overall we find SI to be very well suited for the production of application semantics in XML for inclusion in EMMA documents. We have a number of comments, broken down into three sections. The first contains suggestions for simple editorial changes. The second contains some suggestions for an informative appendix exemplifying how SI can be used to construct EMMA annotations on elements within the application semantics. The third contains a proposal for extending the SI specification in order to provide access to word/phrase timing information within SI rules in order to support multimodal integration. [1] http://www.w3.org/TR/semantic-interpretation/ [2] http://www.w3.org/TR/emma/ ========================================================== Feedback from MMI EMMA subgroup on the VB SI specification ========================================================== Michael Johnston, AT&T Wu Chou, Avaya Debbie Dahl, Conversational Technologies Gerry McCobb, IBM Paulo Baggia, Loquendo Dave Raggett, Canon, W3C The EMMA subgroup have reviewed the SI specification in some detail. Overall we find SI to be very well suited for the production of application semantics in XML for inclusion in EMMA documents. We have a number of comments, broken down below into three sections. The first contains suggestions for simple editorial changes. The second contains some suggestions for an informative appendix exemplifying how SI can be used to construct EMMA annotations on elements within the application semantics. The third contains a proposal for extending the SI specification in order to provide access to word/phrase timing information within SI rules in order to support multimodal integration. 1. Editorial changes ==================== 1.1 Descriptions of EMMA ======================== It would be good if we could clarify the role of EMMA in standardizing the containers and annotation for semantic representation rather than the semantic representation of user utterances itself. How about the following changes: Third para of abstract: ----------------------- change last sentence: "The W3C Multimodal Interaction Activity is defining a data format (EMMA) for representing the information contained in user utterances" --> "The W3C Multimodal Interaction Activity is defining an XML data format (EMMA) for containing and annotating the information in user utterances" Section 1.1 ----------- likewise, second to last para: suggest change the first sentence to "The W3C Multimodal Interaction Activity is defining an XML data format (EMMA) for containing and annotating the information in user utterances" In the examples in section 3.3.2 -------------------------------- Typo: 4th last sentence, "of the for the" --> "of the" Section 5, fifth paragraph -------------------------- second sentence, "can not" --> "cannot" 2. Suggestion for Informative Appendix ====================================== The critical cases for compatibility of SI and EMMA are situations where EMMA annotations appear on elements within the application semantics generated by SI rules. The creation of emma elements such as emma:one-of will need to carried out by the processor which applies the SI rules, but these elements cannot be built within the SI scripts themselves since they contain the results of multiple different parses, possibly of different recognition results. <emma:emma ..> <emma:one-of> <emma:interpretation> APPLICATION NAMESPACE ELEMENTS BUILT USING SI SCRIPTS FOR FIRST RESULT STRING </emma:interpretation> <emma:interpretation> APPLICATION NAMESPACE ELEMENTS BUILT USING SI SCRIPTS FOR SECOND RESULT STRING </emma:interpretation> .... </emma:one-of> </emma:emma> In the review of the SISR document by the EMMA subgroup we identified three main situations in which EMMA annotations would appear within the application semantics and be generated using SI scripts. The first two of these can be handled with the existing mechanisms of SI and we would like to propose the inclusion of an informative appendix in the SI specification showing how SI scripts can be used to build these EMMA annotations. The EMMA specification itself contains an appendix showing how emma:hook annotations can be built using SRGS/SI. 2.1 emma:hook ============= emma:hook is used to indicate that a piece of semantic content needs to be combined with content from another mode. The mode required is indicated as the value of emma:hook. This can be readily handled with the existing SI specification using the _nsprefix property. As an example, to create the emma:hook="ink" annotation in the semantics of "zoom in here" the following rule could be used: <rule id="zoom"> zoom in here <tag> $.command = new Object(); $.command.action = "zoom"; $.command.location = new Object(); $.command.location._attributes = new Object(); $.command.location._attributes.hook = new Object(); $.command.location._attributes.hook._nsprefix = "emma"; $.command.location._attributes.hook._value = "ink"; $.command.location.type = "area"; </tag> </rule> The resulting ECMAscript object would be as follows: { command: { action: "zoom" location: { _attributes: { hook: { _nsprefix: "emma" _value: "ink" } } type: "area" } } } SI processing in an XML environment would generate the following document: <command> <action>zoom</action> <location emma:hook="ink"> <type>area</type> </location> </command> We will submit a separate CR to update the EMMA working draft to update the appendix 9.1 so that it uses _nsprefix as above. 2.2 emma:tokens --------------- A second common use of emma annotations within the application semantics is to annotate the specific words/tokens associated with some part of the semantics. These can be used by later stages of dialog processing and generation in order to construct confirmation questions. For example: If the user says "show flights from kennedy airport" the system might want to respond using the actual words that the speaker used, in making a confirmation: "did you say you want to leave from 'kennedy airport'" This can also be achieved using the existing specification using the .text property associated with rules. We discussed the possibility of having a general mechanism which assigned an emma:tokens value of the basis of every rule application, but faces two problems: there will generally not be a one to one relationship between the derivation tree of the parse and the XML elements in the resulting semantics, and also adding emma:tokens to every part of the semantics would also be verbose. We agreed that while some more general mechanism could be explored in future, for now the method would be to explictly create the emma:tokens attribute. As an example here, I will show a simple example for "flights to kennedy airport" With the following SRGS/SI rules: <rule id="flight"> flights to <ruleref uri="#city"/> <tag> $.command = new Object(); $.command.action = "flight"; $.command.destination = new Object(); $.command.destination._attributes = new Object(); $.command.destination._attributes.tokens = new Object(); $.command.destination._attributes.tokens._nsprefix = "emma"; $.command.destination._attributes.tokens._value = meta.city.text; $.command.destination._value = $city; </tag> </rule> <rule id="city"> <one-of> <item>kennedy airport<tag>$="JFK"</tag></item> <item>san francisco<tag>$="SFO"</tag></item> <item>john f kennedy <tag>$="JFK"</tag></item> </one-of> </rule> The resulting ECMAscript object would be as follows: {command: { action: "flight" destination: { _attributes: { tokens: { _nsprefix: "emma" _value: "kennedy airport" } } _value: "JFK" } }} SI processing in an XML environment would generate the following document: <command> <action>flight</action> <destination emma:tokens="kennedy airport">JFK</destination> </command> Notes that while it is not possible to determine what the user actually said from the semantics JFK, it is possible to determine what they said from the emma:tokens, "kennedy airport". 3. Addition of temporal metadata to SI/SRGS to support the creation of EMMA timestamps and multimodal integration ========================================================== Another kind of EMMA annotation which is needed within the application semantics, and which is particularly important for multimodal applications involving multimodal integration, is the annotation of the timespan associated with a particular piece of the semantic representation. For example, if the other modality involves computer vision or gaze, the multimodal integration component will have to determine where the user was pointing or looking when certain words were said, e.g when they say "zoom in here" you want to know where they were pointing so you can use that location as the place to zoom in on. In order enable the use of SI for applications with temporal constraints on multimodal integration, two additional associated variables would be needed in the SI specification, one to indicate the start time of the time interval associated with the words parsed by a rule and one to indicate the end time. For consistency with the EMMA annotation emma:start and emma:end these could be called .start and .end. If timing information was required on a particular element in the semantic representation it could be determined by accessing these. The values of the .start and .end associated variables should be absolute timestamps in milliseconds. Extending the "zoom in here" example from 1. above this would work as follows, assuming that the word "here" starts at 1087995961542 and ends at 1087995961642 <rule id="zoom"> zoom in <ruleref uri="#here"/> <tag> $.command = new Object(); $.command.action = "flight"; $.command.location = new Object(); $.command.location._attributes = new Object(); $.command.location._attributes.hook = new Object(); $.command.location._attributes.hook._nsprefix = "emma"; $.command.location._attributes.hook._value = "ink"; $.command.location._attributes.start = new Object(); $.command.location._attributes.start._nsprefix = "emma"; $.command.location._attributes.start._value = meta.here.start; $.command.location._attributes.end = new Object(); $.command.location._attributes.end._nsprefix = "emma"; $.command.location._attributes.end._value = meta.here.end; $.command.location._value = $here; </tag> </rule> <rule id="here"> here <tag>$.type = "area"</tag> </rule> The resulting ECMAscript object would be as follows: {command: { action: "zoom" location: { _attributes: { hook: { _nsprefix: "emma" _value: "ink" } start: { _nsprefix: "emma" _value: "1087995961542" } end: { _nsprefix: "emma" _value: "1087995961642" } } type: "area" } }} SI processing in an XML environment would generate the following document: <command> <action>zoom</action> <location emma:hook="ink" emma:start="1087995961542" emma:end="1087995961642"> <type>area</type> </location> </command> Other more specific names could be used if it is undesirable to reserve 'start' and 'end' for this metadata. This example could be reworked so the contents of the location object is defined in the 'here' rule, in that case the start and end of the word would be accessed through $.start and $.end. EMMA also supports relative timestamps through the attributes emma:time-ref-uri, emma:offset-to-start, and emma:duration. If absolute timing information is not available, and alternative would be to make the information needed to build a relative timestamp available to the SI scripts. The three pieces required would be a URI pointing to the start of the speech input, and offset in milliseconds from the start of the speech to the start of the phrase covered by the current rule, and the duration of the phrase. For example the following properties could be used: .timerefuri .offset_ms and .duration_ms.
Received on Saturday, 26 February 2005 17:07:58 UTC