- From: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
- Date: Sun, 4 Dec 2011 11:15:10 +1100
- To: public-xg-htmlspeech@w3.org
Hi Speech XG, Apologies for joining this group so late and for only now getting some feedback to you. We hope though that our input will help with the further development of the specifications and with communicating design decisions to other potential implementers in whichever group the specification will continue to be developed. I work on HTML media element specifications and within the Google Chrome team. Some of my colleagues and myself have looked at the TTS part of your specification document with a view towards implementing it in Chrome. We found we had three types of concerns: major concerns with the direction of the spec, specific concerns with the details of the spec, and minor nitpicks with examples, minor language errors etc. In this feedback, (S) represents feedback from me, (A) represents feedback from Alice, and (OJ) represents feedback from Ojan. I'll follow the thread so if there are specific questions for Alice and Ojan, I can bring them into the discussion. Rather than dump all of that into the below document, we thought we'd send them through to the list as three separate emails, so that we can break it up a bit and possibly have some discussion among this group (and potentially correct any misconceptions we may have had). So, below is the list of major issues first. Regards, Silvia. ===== Major Issues: 1. (A, S, OJ): It’s not clear from this document why a <tts> element is required by the given use cases and requirements. Notably, there are no code samples which demonstrate having a <tts> element attached to the DOM: all of the examples which use TTS only manipulate the element object in JavaScript. If a JavaScript object were used instead, its API could be based on the TTS object on Chrome’s TTS extension: http://code.google.com/chrome/extensions/tts.html. This API is more specific to TTS, and thus more powerful for TTS interaction than the API provided by HTMLMediaElement. 2. (S, OJ): If a <tts> element is required, having <tts> inherit from HTMLMediaElement doesn’t seem to be the best option: SSML or text content is not media, it’s markup. Furthermore, HTMLMediaElement has features that are not applicable to TTS. For example: - mediagroup requires a timeline to synchronise different media resources with each other. What does that mean for a tts audio resource that can have varying timelines depending on the synthesiser of choice? - <track> elements (i.e. captions, chapters, descriptions) make no sense for content that originates as text. - <source> elements and the source selection algorithm make no sense when we only allow one type of resource to be referenced by @src, namely SSML. No alternative sources need to be selected. 3. (S) Alternatively, has the use of the HTML5 MediaStream API (http://www.whatwg.org/specs/web-apps/current-work/multipage/video-conferencing-and-peer-to-peer-communication.html#stream-api ) been considered? Since TTS is about creating audio data from an input text string, it is more similar to creating audio data from a microphone than it is to rendering of given audio data. - The getUserMedia() function could be extended to get access to either a local or remote TTS service, which then attaches a callback to this service, which has a stop() function to stop the TTS creation (a functionality that the MediaElement doesn’t have). - The created MediaStream (which is the returned audio stream) can then be handed to a audio element for playback - or even to a MediaStreamRecorder for recording. - Something like this would be possible with the extra functionality added to a new TTS interface: <script> text = “render this text”; // opens the default browser TTS service // first argument is ‘tts’ for local synthesiser, or ‘tts URL’ for remote one // last argument contains a URL to either a SSML file or data url navigator.getUserMedia(‘tts’, gotAudio, noStream, “data:text/plain;” + text); function gotAudio(stream) { // sends the synthesised audio data to an audio renderer audio.src = URL.createObjectURL(stream); audio.play(); } function noStream() { // some kind of error handling alert(‘Translation service unavailable.’); } </script> 4. (OJ) The proposed <tts> element syntax is unnecessarily different from the rest of the platform. SSML, if we want to support it, should be analogous to how browser vendors support MATHML and SVG content. Notably: - There should be no @src attribute. Instead, the markup should just go inside the TTS element. That way, all the existing DOM APIs (e.g. innerHTML) for modifying the DOM just work without developers needing to learn new APIs. More importantly, as we add new APIs for doing this (e.g. the html quasi-literal) they will also just work. And if you want async loading, you do it the same way you do any other async html (e.g. using XMLHttpRequest). - If we decide not to support SSML and just support plain text, then I don’t see the need for a TTS element at all. (See 1.) - We shouldn’t require the XML bits (e.g. xmlns). 5. (OJ) If SSML is required, it should be carefully implemented. SSML itself is woefully underspecified. Browser vendors simply cannot take the current spec and get interoperable implementations. They will necessarily need to reverse-engineer what the other browsers do. We should start by finding the minimal subset of this API with good use-cases and/or has widespread support in existing speech engines. As an example, the age property gives no useful indication of how to implement it. A correct implementation as specified would be to parse, but otherwise ignore it. A more useful set of values would be “young” and “old”. Then it can be clearly specified as a young voice and an old voice. (S): One benefit of SSML: the way in which SSML markers are employed to jump directly to named sections of the synthesised audio actually relates really nicely to Media Fragment URIs, see http://www.w3.org/TR/media-frags/#naming-name. With media fragment URIs you can jump directly to a time offset in a media resource based on a name having been provided to that section. These named markers are like chapter markers. I could for example see a use case in a media fragment URI that points at a synthesising server, hands it SSML data and then applies the offset, e.g. http://example.org/synthesize?ref=http://example.org/speak.ssml#id=window_seat . 6. (A and S): The specification should be split up into different sections or even documents for the <tts>, <reco> and the protocol specifications. It’s difficult trying to separate out the different concerns while reading the document.
Received on Sunday, 4 December 2011 00:15:59 UTC