[SSML-IR] - France Telecom SSML Implementation Report

As a global telecommunications carrier, the France Telecom group
believes that the SSML 1.0 Candidate Recommendation establishes a
comprehensive solution facilitating the contribution to projects with
technologies that will be part of many people's daily life in the near
future.
Committed to customer care, responsibility and Innovation, France
Telecom is therefore happy to contribute to this Recommendation by
submitting the following SSML 1.0 Implementation Report and to support
the activities of the W3C Voice Browser working group.

The intrinsic behaviour of France Telecom Research & Development SSML
synthesizer is a complete processing at SSML input even when an error
occurs; the embeded XML parser does not validate XML input against
neither synthesis schema nor SSML DTD. This is why a special testers
version (consisting of a front-end validator performing DTD validation
and a standard SSML synthesizer) has been used. This special version
also inserts extra samples in the speech signal when a marker is
encountered in order to trigger an audo event, audible by testers, at
the exact position when marker occurs. 

The FTR&D implementation was run on the official test set
ssml-ir-20040119.zip with following modifications applied on
manifest.xml: 

*	The addition of a dep element for TA#51: <dep uri="51/turca.wav"
type="audio/x-wav"/>
*	The renaming of the 15 uri of the form
"test-prosody-XXX-%25-NN.txml" which doesn't correspond to any .txml
filenames in ZIP archive to "test-prosody-XXX-%-NN.txml" (without
substring "25") 

and with the following modifications applied on .txml files: 

*	79/79.txml
The removal of xml:lang attribute from speak element 
*	91/91.txml and 92/92.txml
The replacement of sentence by s 
*	27/27.txml
The replacement of paragraph by p 
*	301/test-prosody-rate-comp-115.txml
The addition of xml:lang="en-US" to speak 
*	298/ta_298.txml and 299/ta_299.txml
The modification of generated .ssml in order to obtain UTF-8 characters,
instead of UTF-16 characters for phoneme string "&#952;&#618;&#331;". 


Some instructions are not conform to the SSML recommendation. We suggest
the following modifications at assertions: 

*	#4 unknown format attribute
The assertion specifies : "When the value for the format attribute is
unknown or unsupported by a processor, it must render the contained text
as if no format value were specified" But recommendation adds : ", and
should render it using the interpret-as value that is specified."
Thus test and reference should not sound identical.
Suggestion: use a Multiple_Pair_Comp test. SSML test remains the same.
SSML reference equals Test without format attribute. 
*	#297 vendor-defined alphabet attribute
As is, this test is identical to test #20.
Instruction suggestion: for this test to pass, alphabet attribute should
be a valid vendor-defined alphabet of the form "x-organization" or
"x-organization-alphabet" 
*	#139 <voice xml:lang="language-not-available">The cat jumped
over the moon.</voice>
Instruction is not in line with SSML recommendation which says about
errors : "Results are undefined. A conforming synthesis processor may
detect and report an error and may recover from it." 
*	#225 The cat jumped over the moon.<prosody
contour="(0%,+20Hz)">The cat jumped over the moon.</prosody>
The first sentence is pronounced in a normal way, the second with a
constant pitch, from begin (0%) to end (end value is copied from the
nearest pitch target which 0%). Pitch target (+20Hz) is a relative
target ("relative to the pitch value just before the contained text").
In our case, the last pitch value is pitch on end of last vowel of word
"moon", that is: 64Hz. Then second sentence is pronounced with a
constant pitch of 84Hz which is heard by a human being at a lower pitch
than first sentence where pitch is varying between 64Hz and 132Hz.
Instruction suggestion: "The second repetition of the sentence should
have a constant pitch 20Hz above last voiced pitch frame of first
sentence" 
*	269, 283, 284, 285, 287 and 288 Units are case sensitive
Same remark as for TA#139. If it is an error, results are undefined. 

Furthermore we propose to improve two assertions. 

*	#223 time positions less than 0% are ignored
The folowing contour: "(-10%,-20Hz) (0%,+20.Hz) (10%,+30%) (40%,+10Hz)"
is equivalent to: "(0%,+20.Hz) (10%,+30%) (40%,+10Hz)" even for a
non-conformant SSML processor which interpolates between -10% and 0%
At least, 0% target is to be removed from this contour in order to
observe if SSML processors try to interpolate between -10% and +10%;
which they shouldn't. Instead, they should copy nearest pitch target;
that is 10%. To make the test more obvious for testers, the contour
could be like this:
"(-10%,50Hz)(100%,200Hz)"
If an SSML processor interpolates between 50Hz and 200Hz (which it
shouldn't) instead of using a constant pitch of 200Hz, it will clearly
be audible. 
*	#224 time positions greater then 100% are ignored
The tested contour is: "(0%,+20.Hz) (10%,+30%) (40%,10Hz)
(120%,-50.0Hz)"
In case a non-conformant SSML processor interpolates between 40% and
120%, it is quite possible that it is not audible by testers. Actually,
if pitch value just before the contained text is 60Hz, then -50Hz target
equals 10Hz (60-50=10) and test sounds identical to reference.
The suggestion is to use the following contour:
"(0%,200Hz)(110%,50Hz)"
If an SSML processor interpolates between 200Hz and 50Hz (which it
shouldn't) instead of using a constant pitch of 200Hz, it will clearly
be audible. 

Yours Faithfully,

Edouard Hinard

Received on Wednesday, 18 February 2004 09:13:49 UTC