Feedback on the TTS specification (specific issues) from Silvia Pfeiffer on 2011-12-04 (public-xg-htmlspeech@w3.org from December 2011)

From: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
Date: Sun, 4 Dec 2011 11:18:09 +1100
To: public-xg-htmlspeech <public-xg-htmlspeech@w3.org>
Message-ID: <CAHp8n2=1XY0LjuPvrirO=Mfk7A+TNeXb42x3DGjzGAdNa8_c9Q@mail.gmail.com>

As explained in the previous email, here is the list of specific
issues that we would like to feed back into the group for the TTS part
of the specification.

Regards,
Silvia.

Specific Issues:

1. (A and S): Section 4.9 talks about visually highlighting the word
or phrase that the application is synthesising. Example code for this
use case would be really useful; we can’t see how it would be done
with plain text input with no SSML <mark> elements.

2. In http://www.w3.org/2005/Incubator/htmlspeech/finalreport/XGR-htmlspeech.html#tts-section:
"User Agents must pass the xml:lang values to synthesizers and
synthesizers must use the passed in xml:lang to determine the language
of text/plain."

(S): It's only xml:lang when you are dealing with XHTML documents -
otherwise it's @lang for HTML documents. Also, the default @lang
attribute on HTML elements is not specifying to language of the
document referred to in @src, but only the language of the attributes
and the content of the element.

3. In http://www.w3.org/2005/Incubator/htmlspeech/finalreport/XGR-htmlspeech.html#tts-section:
“The existing timeupdate event is dispatched to report progress
through the synthesized speech. If the synthesis is of type
application/ssml+xml, timeupdate events should be fired for each mark
element that is encountered.”

(S): Are there no timeupdate events for text provided through the
@text attribute? So, in this case, the transport bar would only jump
from 0% to 100% once the text has been played back?

4. http://www.w3.org/2005/Incubator/htmlspeech/finalreport/XGR-htmlspeech.html#interim-events:

(S): The speech protocol talks about several interim-events that
should be raised by the browser when receiving from the network.
However, the events cannot be interpreted by the application unless it
knows what to expect. Since changing the synthesising service would
likely result in different interim-events to be processed (and
according to http://www.w3.org/2005/Incubator/htmlspeech/finalreport/XGR-htmlspeech.html#fpr8,
the browser may not always use the requested speech service) this
means that web developers would effectively be unable to reliably use
these interim events. It would be more useful to identify the events
that a typical synthesising service may raise and standardise them for
the <tts> element.

5. http://www.w3.org/2005/Incubator/htmlspeech/finalreport/XGR-htmlspeech.html#dfn-text:
"The text (attribute) is an optional attribute to specify plain text
to be synthesized. The text attribute, on setting must construct a
data: URI that encodes the text to be played and assign that to the
src property of the TTS element. If there are no encoding issues the
URI for any given string would be data:text/plain,string, but if there
are problematic characters the UserAgent should use base64 encoding."

(S): Since the synthesising service expects a SSML file format from
the @src attribute, would we not have to encode the text to SSML
instead of plain text? Why encode text to text at all? Why not encode
this directly to audio and hand it to an audio element in a data URI?

6. http://www.w3.org/2005/Incubator/htmlspeech/finalreport/XGR-htmlspeech.html#dfn-lastmark:
"The new lastMark attribute must, on getting, return the name of the
last SSML mark element that was encountered during playback. If no
mark has been encountered yet, the attribute must return null.

Note: There has been some interest in allowing lastMark to be set to a
value as a way to control where playback would start/resume. This
feature is simple in principle but may have subtle implementation
consequences. It is not a feature that has been seriously studied in
other similar languages such as SSML. Thus, the group decided not to
consider this feature at this time."

(S): So, lastMark is some kind of virtual cursor on the SSML document
or the SSML encoded text from the @text attribute, correct? Given the
note, should it be specified as readonly?

(A): Also, how does that fit with the events that are raised on the
marks - surely anything that is interested in the last mark played
could just listen to the events? In particular, there are no examples
that make use of the lastMark attribute.

7. (S): The play() function is inherited from HTMLMediaElement. In
comparison the speak() function of the TTS extension to Chrome is much
more powerful and more specific to TTS functionality, see
http://code.google.com/chrome/extensions/tts.html .

8. (S): Why only allow a WebSocket-based protocol of interaction
between the Synthesizer and the Browser? Wouldn't it be simpler to
just rely on a Web server that converts a text document into an audio
file and delivers it in byte ranges? Then the TTS could be an object
to manage the text and hand it off to a Web Server, while what comes
back can be handed to an audio element for playback.

9. In http://www.w3.org/2005/Incubator/htmlspeech/finalreport/XGR-htmlspeech.html#definitions,
“Synthesizer” section:
"Each synthesis request results in a separate output stream that is
terminated once rendering is complete, or if it has been canceled by
the client."

(S): How does the client cancel rendering? Use the HTMLMediaElement
pause() function doesn't really interrupt it.

10. (S): The protocol for returning synthesised speech from the
Synthesizing server includes return of interim events as described in
http://www.w3.org/2005/Incubator/htmlspeech/finalreport/XGR-htmlspeech.html#interim-events.
However, the interim events are sent after the synthesised audio data.
This may cause a problem because, e.g. when the tts element is set to
autoplay, the browser may have already played through the time of the
events. Therefore, there needs to be a requirement to send these
events either before the media data itself or it could be headers on
the packets with which the audio data itself is returned.

11. "FPR8. User agent (browser) can refuse to use requested speech service."

(S): How is JS made aware of this? What is the error code?

12. "FPR31. User agents and speech services may agree to use alternate
protocols for communication."

(S): The spec seems to restrict this to WebSockets - why?

13. "FPR9. If browser refuses to use the web application requested
speech service, it must inform the web app."

(S): What error is raised in JS?

Received on Sunday, 4 December 2011 00:18:58 UTC