W3C home > Mailing lists > Public > public-xg-htmlspeech@w3.org > April 2011

Re: Overview paragraph

From: Dan Burnett <dburnett@voxeo.com>
Date: Thu, 21 Apr 2011 11:20:30 -0400
Cc: public-xg-htmlspeech@w3.org
Message-Id: <1F544A9B-31D2-4CBC-8251-7E0A37BB226C@voxeo.com>
To: Marc Schroeder <marc.schroeder@dfki.de>
I find this a particularly interesting example, because SSML 1.1 added  
capabilities to address precisely this scenario.  TTS engine providers  
were concerned with being limited in their ability to innovate, while  
app authors were concerned that they were not able to control the user  
experience upon encountering a language the current TTS voice didn't  
know how to render, or when a particular requested voice was not  
available.

To address this, we first considered what actions a TTS engine could  
take upon these two conditions. Then, we added attributes that allow  
an author to indicate the desired behavior.  In the case where a  
synthesis processor decided that it did not know how to speak a  
particular language with the current voice, the author-requestable  
options were:  change the voice, ignore the text, ignore the language  
change (pretending the language had not changed), or processor choice.
In the case where a requested voice was not available, the options  
were:  select a voice according to prioritized features such as gender/ 
age/language, keep the existing voice, or processor choice.

This gave control to the author where needed, but only over actions we  
could guarantee all processors could take.  It also allowed for the  
"processorchoice" case, the default, which gave processors the ability  
to demonstrate their intelligence if they had any while simplifying  
authoring for those who didn't need to specify these details.

I am not saying we can absolutely do such an analysis for ASR engines,  
but I suspect that is the direction in which we are headed.

-- dan


On Apr 21, 2011, at 2:40 AM, Marc Schroeder wrote:

> I think what Bjorn is proposing here is a reasonable mindset, and  
> seems consistent with what has been done elsewhere: allow for non- 
> standard extensions, vendor-specific resources etc., but provide a  
> certain level of interoperability on the standard functionality  
> itself.
>
> As often, I find the TTS case easier to grasp. Take the example of a  
> request to synthesise a certain text in language "de" with a certain  
> vendor-specific voice. Now if that vendor's TTS or that specific  
> voice is not available for whatever reason, would we rather have no  
> TTS output or another voice in that language speak the text? The  
> same goes for any non-standard additional parameters to the vendor- 
> specific TTS engine -- if a different engine is used, I think the  
> expected functionality should be that the non-standard bits are  
> simply ignored, and the UA + fallback engine do their best to  
> realise the request "as best they can".
>
> Cheers,
> Marc
>
> On 20.04.11 23:18, Bjorn Bringert wrote:
>> We could either prevent applications from trying to use non-standard
>> resources with the default speech services, or specify how the
>> fallback will work if those resources are not available.
>>
>> To take a fictional example, if the app specifies something like
>> grammar="x-acme:foo", we could either specify that this is an error,
>> or that the recognizer should treat this as if the grammar parameter
>> was not set at all. I'd prefer the latter, since it makes it easier  
>> to
>> add new standard resources in the future. This is how many other web
>> standards work. For example, unknown elements and attributes in HTML
>> are silently ignored, unknown properties, fonts etc are silently
>> ignored in CSS.
>>
>> /Bjorn
>>
>> On Wed, Apr 20, 2011 at 9:47 PM, Young,  
>> Milan<Milan.Young@nuance.com>  wrote:
>>> I am in favor of what Patrick is proposing below.  But I'm still  
>>> uneasy
>>> about the language around the default engines.
>>>
>>> The problem is that we have no way of limiting how the app might  
>>> use the
>>> default recognizer or synthesizer.  It might, for example, make  
>>> use of
>>> proprietary resources such as grammars, models, or pronunciations.
>>>
>>> Requiring that such an application behaved even "consistently"  
>>> across
>>> all engines would require an enumeration of all such resources.   
>>> Engines
>>> would be prevented from extending this set unless they used  
>>> "outside"
>>> channels such as what Patrick outlined below.
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Patrick Ehlen [mailto:pehlen@attinteractive.com]
>>> Sent: Wednesday, April 20, 2011 1:44 PM
>>> To: Bjorn Bringert
>>> Cc: Young, Milan; Raj(Openstream); Satish S; Deborah Dahl; DRUTA,  
>>> DAN
>>> (ATTSI); public-xg-htmlspeech@w3.org
>>> Subject: Re: Overview paragraph
>>>
>>> Agreed. In my view, the point here is to provide a consistent set of
>>> methods for content developers to access speech services, whatever  
>>> their
>>> particular capabilities may be.
>>>
>>> For example, a developer may want to use a recognizer with a  
>>> proprietary
>>> type of model and an instance of that model on a server somewhere.  
>>> We
>>> should provide a method for someone to specify a URI for the  
>>> recognizer,
>>> a URI for the model, and a place to pass parameters that may be
>>> particular to that type of model. It would be up to the recognizer  
>>> to
>>> know how to handle the model and its parameters, but not part of  
>>> our job
>>> here.
>>>
>>>
>>> On Apr 20, 2011, at 13:22, "Bjorn Bringert"<bringert@google.com>   
>>> wrote:
>>>
>>>> A consistent user experience is not the same as an identical user
>>>> experience. For example, user agents render web pages using varying
>>>> window sizes and pixel densities.
>>>>
>>>> /Bjorn
>>>>
>>>> On Wed, Apr 20, 2011 at 9:10 PM, Young, Milan<Milan.Young@nuance.com 
>>>> >
>>> wrote:
>>>>> All default recognizers must return the same results/timings  
>>>>> with the
>>> same
>>>>> input waveform?  All default synthesizers should return the same
>>> samples on
>>>>> the same input SSML?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ________________________________
>>>>>
>>>>> From: Raj(Openstream) [mailto:raj@openstream.com]
>>>>> Sent: Wednesday, April 20, 2011 12:57 PM
>>>>> To: Satish S; Patrick Ehlen
>>>>>
>>>>> Cc: Deborah Dahl; Young, Milan; DRUTA, DAN (ATTSI);
>>>>> public-xg-htmlspeech@w3.org
>>>>> Subject: Re: Overview paragraph
>>>>>
>>>>>
>>>>>
>>>>> Yes..I agree with Satish's point...any application that desires to
>>> leverage
>>>>> advanced/specific features
>>>>>
>>>>> of an ASR, cannot be guaranteed to be portable..within the scope  
>>>>> our
>>>>> spec..and applications
>>>>>
>>>>> that use the default ( LCD ?) recognizer ( not sure if this is  
>>>>> what
>>> Dan D
>>>>> had in mind, by saying
>>>>>
>>>>> "simple" applications )  should be portable and have consistent  
>>>>> user
>>>>> experience with conforming
>>>>>
>>>>> browser/clients.
>>>>>
>>>>>
>>>>>
>>>>> --Raj
>>>>>
>>>>> ----- Original Message -----
>>>>>
>>>>> From: Satish S
>>>>>
>>>>> To: Patrick Ehlen
>>>>>
>>>>> Cc: Deborah Dahl ; Young, Milan ; DRUTA, DAN (ATTSI) ;
>>>>> public-xg-htmlspeech@w3.org
>>>>>
>>>>> Sent: Wednesday, April 20, 2011 3:38 PM
>>>>>
>>>>> Subject: Re: Overview paragraph
>>>>>
>>>>>
>>>>>
>>>>> As an express goal, perhaps we should clearly state that  
>>>>> applications
>>> that
>>>>> use the default/built-in recognizer should be portable across all
>>> browsers
>>>>> and speech engines. Beyond that, if the web app chooses to use a
>>> particular
>>>>> engine by specifying a URL it seems ok to rely on extended/ 
>>>>> additional
>>>>> capabilities provided by that engine.
>>>>>
>>>>> Cheers
>>>>> Satish
>>>>>
>>>>> On Wed, Apr 20, 2011 at 5:00 PM, Patrick Ehlen
>>> <pehlen@attinteractive.com>
>>>>> wrote:
>>>>>
>>>>> Deborah is right that not all speech engines will have the same
>>>>> capabilities, but we should strive to provide general
>>> parameterizations of
>>>>> the potential capabilities wherever possible. Otherwise engine
>>> providers
>>>>> will need to add their own extensions to the standard, and
>>> development will
>>>>> get fractured across the lines of browser/engine, as we saw happen
>>> with
>>>>> earlier Javascript XML handlers, etc.
>>>>>
>>>>> On Apr 20, 2011, at 8:27, "Deborah Dahl"
>>>>> <dahl@conversational-technologies.com>  wrote:
>>>>>
>>>>>> I don't think we can reach the goal of applications being  
>>>>>> completely
>>>>>> portable across speech engines  because speech engines will  
>>>>>> always
>>> have
>>>>>> different capabilities, and some of these are unlikely to be in  
>>>>>> the
>>> scope
>>>>>> of
>>>>>> our API.  For example, engines will handle different languages,  
>>>>>> some
>>>>>> engines
>>>>>> will be able to handle larger grammars, some applications will  
>>>>>> make
>>> use of
>>>>>> proprietary SLM's, and some applications won't be usable  
>>>>>> without an
>>> engine
>>>>>> that has a certain level of accuracy. So  I agree with Milan that
>>> the goal
>>>>>> is not to standardize functionality across speech engines. I  
>>>>>> think
>>> we
>>>>>> should
>>>>>> just say " provide the user with a consistent experience across
>>> different
>>>>>> platforms and devices" and leave it at that.
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: public-xg-htmlspeech-request@w3.org
>>> [mailto:public-xg-htmlspeech-
>>>>>>> request@w3.org] On Behalf Of Satish S
>>>>>>> Sent: Wednesday, April 20, 2011 5:18 AM
>>>>>>> To: Young, Milan
>>>>>>> Cc: DRUTA, DAN (ATTSI); public-xg-htmlspeech@w3.org
>>>>>>> Subject: Re: Overview paragraph
>>>>>>>
>>>>>>>    >>  provide the user with a consistent experience across
>>> different
>>>>>>>    platforms and devices irrespective of the speech engine used.
>>>>>>>
>>>>>>>
>>>>>>>    This effort is not about standardizing functionality across
>>> speech
>>>>>>>    engines.  The goal is speech application portability across  
>>>>>>> the
>>>>>>>    browsers.  Simple applications MAY be portable across speech
>>> engine
>>>>>>>    boundaries, but that's not a requirement.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I'd say the API proposal should aim for all applications to be
>>> portable
>>>>>> across
>>>>>>> speech engines. Starting with "may be portable" doesn't seem  
>>>>>>> to fit
>>> the
>>>>>> spirit
>>>>>>> of the web. Any extensions for speech engine specific parameters
>>> and
>>>>>>> results should be optional.
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Bjorn Bringert
>>>> Google UK Limited, Registered Office: Belgrave House, 76 Buckingham
>>>> Palace Road, London, SW1W 9TQ
>>>> Registered in England Number: 3977902
>>>>
>>>
>>
>>
>>
>
> -- 
> Dr. Marc Schröder, Senior Researcher at DFKI GmbH
> Project leader for DFKI in SSPNet http://sspnet.eu
> Team Leader DFKI TTS Group http://mary.dfki.de
> Editor W3C EmotionML Working Draft http://www.w3.org/TR/emotionml/
> Portal Editor http://emotion-research.net
>
> Homepage: http://www.dfki.de/~schroed
> Email: marc.schroeder@dfki.de
> Phone: +49-681-85775-5303
> Postal address: DFKI GmbH, Campus D3_2, Stuhlsatzenhausweg 3,  
> D-66123 Saarbrücken, Germany
> --
> Official DFKI coordinates:
> Deutsches Forschungszentrum fuer Kuenstliche Intelligenz GmbH
> Trippstadter Strasse 122, D-67663 Kaiserslautern, Germany
> Geschaeftsfuehrung:
> Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
> Dr. Walter Olthoff
> Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
> Amtsgericht Kaiserslautern, HRB 2313
>
Received on Thursday, 21 April 2011 15:20:59 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Thursday, 21 April 2011 15:21:00 GMT