Feedback on the TTS part of the spec

Hi Speech XG,

Apologies for joining this group so late and for only now getting some
feedback to you. We hope though that our input will help with the
further development of the specifications and with communicating
design decisions to other potential implementers in whichever group
the specification will continue to be developed.

I work on HTML media element specifications and within the Google
Chrome team. Some of my colleagues and myself have looked at the TTS
part of your specification document with a view towards implementing
it in Chrome. We found we had three types of concerns: major concerns
with the direction of the spec, specific concerns with the details of
the spec, and minor nitpicks with examples, minor language errors etc.

In this feedback, (S) represents feedback from me, (A) represents
feedback from Alice, and (OJ) represents feedback from Ojan. I'll
follow the thread so if there are specific questions for Alice and
Ojan, I can bring them into the discussion.

Rather than dump all of that into the below document, we thought we'd
send them through to the list as three separate emails, so that we can
break it up a bit and possibly have some discussion among this group
(and potentially correct any misconceptions we may have had).

So, below is the list of major issues first.

Regards,
Silvia.


=====

Major Issues:

1. (A, S, OJ): It’s not clear from this document why a <tts> element
is required by the given use cases and requirements. Notably, there
are no code samples which demonstrate having a <tts> element attached
to the DOM: all of the examples which use TTS only manipulate the
element object in JavaScript.

If a JavaScript object were used instead, its API could be based on
the TTS object on Chrome’s TTS extension:
http://code.google.com/chrome/extensions/tts.html. This API is more
specific to TTS, and thus more powerful for TTS interaction than the
API provided by HTMLMediaElement.


2. (S, OJ): If a <tts> element is required, having <tts> inherit from
HTMLMediaElement doesn’t seem to be the best option: SSML or text
content is not media, it’s markup. Furthermore, HTMLMediaElement has
features that are not applicable to TTS. For example:

-  mediagroup requires a timeline to synchronise different media
resources with each other. What does that mean for a tts audio
resource that can have varying timelines depending on the synthesiser
of choice?

-  <track> elements (i.e. captions, chapters, descriptions) make no
sense for content that originates as text.

-  <source> elements and the source selection algorithm make no sense
when we only allow one type of resource to be referenced by @src,
namely SSML. No alternative sources need to be selected.


3. (S) Alternatively, has the use of the HTML5 MediaStream API
(http://www.whatwg.org/specs/web-apps/current-work/multipage/video-conferencing-and-peer-to-peer-communication.html#stream-api
) been considered? Since TTS is about creating audio data from an
input text string, it is more similar to creating audio data from a
microphone than it is to rendering of given audio data.

-   The getUserMedia() function could be extended to get access to
either a local or remote TTS service, which then attaches a callback
to this service, which has a stop() function to stop the TTS creation
(a functionality that the MediaElement doesn’t have).
-   The created MediaStream (which is the returned audio stream) can
then be handed to a audio element for playback - or even to a
MediaStreamRecorder for recording.

-   Something like this would be possible with the extra functionality
added to a new TTS interface:

<script>
text = “render this text”;
// opens the default browser TTS service
// first argument is ‘tts’ for local synthesiser, or ‘tts URL’ for remote one
// last argument contains a URL to either a SSML file or data url
navigator.getUserMedia(‘tts’, gotAudio, noStream, “data:text/plain;” + text);
function gotAudio(stream) {
  // sends the synthesised audio data to an audio renderer
  audio.src = URL.createObjectURL(stream);
  audio.play();
}
function noStream() {
  // some kind of error handling
  alert(‘Translation service unavailable.’);
}
</script>


4. (OJ) The proposed <tts> element syntax is unnecessarily different
from the rest of the platform. SSML, if we want to support it, should
be analogous to how browser vendors support MATHML and SVG content.
Notably:

-   There should be no @src attribute. Instead, the markup should just
go inside the TTS element. That way, all the existing DOM APIs (e.g.
innerHTML) for modifying the DOM just work without developers needing
to learn new APIs. More importantly, as we add new APIs for doing this
(e.g. the html quasi-literal) they will also just work. And if you
want async loading, you do it the same way you do any other async html
(e.g. using XMLHttpRequest).

-   If we decide not to support SSML and just support plain text, then
I don’t see the need for a TTS element at all. (See 1.)

-   We shouldn’t require the XML bits (e.g. xmlns).


5. (OJ) If SSML is required, it should be carefully implemented.
SSML itself is woefully underspecified. Browser vendors simply cannot
take the current spec and get interoperable implementations. They will
necessarily need to reverse-engineer what the other browsers do.

We should start by finding the minimal subset of this API with good
use-cases and/or has widespread support in existing speech engines. As
an example, the age property gives no useful indication of how to
implement it. A correct implementation as specified would be to parse,
but otherwise ignore it. A more useful set of values would be “young”
and “old”. Then it can be clearly specified as a young voice and an
old voice.

(S): One benefit of SSML: the way in which SSML markers are employed
to jump directly to named sections of the synthesised audio actually
relates really nicely to Media Fragment URIs, see
http://www.w3.org/TR/media-frags/#naming-name. With media fragment
URIs you can jump directly to a time offset in a media resource based
on a name having been provided to that section. These named markers
are like chapter markers. I could for example see a use case in a
media fragment URI that points at a synthesising server, hands it SSML
data and then applies the offset, e.g.
http://example.org/synthesize?ref=http://example.org/speak.ssml#id=window_seat
.


6. (A and S): The specification should be split up into different
sections or even documents for the <tts>, <reco> and the protocol
specifications. It’s difficult trying to separate out the different
concerns while reading the document.

Received on Sunday, 4 December 2011 00:15:59 UTC