- From: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
- Date: Sun, 4 Dec 2011 11:18:09 +1100
- To: public-xg-htmlspeech <public-xg-htmlspeech@w3.org>
As explained in the previous email, here is the list of specific issues that we would like to feed back into the group for the TTS part of the specification. Regards, Silvia. Specific Issues: 1. (A and S): Section 4.9 talks about visually highlighting the word or phrase that the application is synthesising. Example code for this use case would be really useful; we can’t see how it would be done with plain text input with no SSML <mark> elements. 2. In http://www.w3.org/2005/Incubator/htmlspeech/finalreport/XGR-htmlspeech.html#tts-section: "User Agents must pass the xml:lang values to synthesizers and synthesizers must use the passed in xml:lang to determine the language of text/plain." (S): It's only xml:lang when you are dealing with XHTML documents - otherwise it's @lang for HTML documents. Also, the default @lang attribute on HTML elements is not specifying to language of the document referred to in @src, but only the language of the attributes and the content of the element. 3. In http://www.w3.org/2005/Incubator/htmlspeech/finalreport/XGR-htmlspeech.html#tts-section: “The existing timeupdate event is dispatched to report progress through the synthesized speech. If the synthesis is of type application/ssml+xml, timeupdate events should be fired for each mark element that is encountered.” (S): Are there no timeupdate events for text provided through the @text attribute? So, in this case, the transport bar would only jump from 0% to 100% once the text has been played back? 4. http://www.w3.org/2005/Incubator/htmlspeech/finalreport/XGR-htmlspeech.html#interim-events: (S): The speech protocol talks about several interim-events that should be raised by the browser when receiving from the network. However, the events cannot be interpreted by the application unless it knows what to expect. Since changing the synthesising service would likely result in different interim-events to be processed (and according to http://www.w3.org/2005/Incubator/htmlspeech/finalreport/XGR-htmlspeech.html#fpr8, the browser may not always use the requested speech service) this means that web developers would effectively be unable to reliably use these interim events. It would be more useful to identify the events that a typical synthesising service may raise and standardise them for the <tts> element. 5. http://www.w3.org/2005/Incubator/htmlspeech/finalreport/XGR-htmlspeech.html#dfn-text: "The text (attribute) is an optional attribute to specify plain text to be synthesized. The text attribute, on setting must construct a data: URI that encodes the text to be played and assign that to the src property of the TTS element. If there are no encoding issues the URI for any given string would be data:text/plain,string, but if there are problematic characters the UserAgent should use base64 encoding." (S): Since the synthesising service expects a SSML file format from the @src attribute, would we not have to encode the text to SSML instead of plain text? Why encode text to text at all? Why not encode this directly to audio and hand it to an audio element in a data URI? 6. http://www.w3.org/2005/Incubator/htmlspeech/finalreport/XGR-htmlspeech.html#dfn-lastmark: "The new lastMark attribute must, on getting, return the name of the last SSML mark element that was encountered during playback. If no mark has been encountered yet, the attribute must return null. Note: There has been some interest in allowing lastMark to be set to a value as a way to control where playback would start/resume. This feature is simple in principle but may have subtle implementation consequences. It is not a feature that has been seriously studied in other similar languages such as SSML. Thus, the group decided not to consider this feature at this time." (S): So, lastMark is some kind of virtual cursor on the SSML document or the SSML encoded text from the @text attribute, correct? Given the note, should it be specified as readonly? (A): Also, how does that fit with the events that are raised on the marks - surely anything that is interested in the last mark played could just listen to the events? In particular, there are no examples that make use of the lastMark attribute. 7. (S): The play() function is inherited from HTMLMediaElement. In comparison the speak() function of the TTS extension to Chrome is much more powerful and more specific to TTS functionality, see http://code.google.com/chrome/extensions/tts.html . 8. (S): Why only allow a WebSocket-based protocol of interaction between the Synthesizer and the Browser? Wouldn't it be simpler to just rely on a Web server that converts a text document into an audio file and delivers it in byte ranges? Then the TTS could be an object to manage the text and hand it off to a Web Server, while what comes back can be handed to an audio element for playback. 9. In http://www.w3.org/2005/Incubator/htmlspeech/finalreport/XGR-htmlspeech.html#definitions, “Synthesizer” section: "Each synthesis request results in a separate output stream that is terminated once rendering is complete, or if it has been canceled by the client." (S): How does the client cancel rendering? Use the HTMLMediaElement pause() function doesn't really interrupt it. 10. (S): The protocol for returning synthesised speech from the Synthesizing server includes return of interim events as described in http://www.w3.org/2005/Incubator/htmlspeech/finalreport/XGR-htmlspeech.html#interim-events. However, the interim events are sent after the synthesised audio data. This may cause a problem because, e.g. when the tts element is set to autoplay, the browser may have already played through the time of the events. Therefore, there needs to be a requirement to send these events either before the media data itself or it could be headers on the packets with which the audio data itself is returned. 11. "FPR8. User agent (browser) can refuse to use requested speech service." (S): How is JS made aware of this? What is the error code? 12. "FPR31. User agents and speech services may agree to use alternate protocols for communication." (S): The spec seems to restrict this to WebSockets - why? 13. "FPR9. If browser refuses to use the web application requested speech service, it must inform the web app." (S): What error is raised in JS?
Received on Sunday, 4 December 2011 00:18:58 UTC