Comments from PFWG on CSS3 Speech Module

   The following are comments from the Protocols and Formats Working Group on the 
CSS Speech Module specification at:
http://dev.w3.org/csswg/css3-speech/

PFWG's resolution to send these comments is recorded at:
http://lists.w3.org/Archives/Member/w3c-wai-pf/2011OctDec/0017.html

Comments are mainly with regard to screen reader usage, as we believe it to be
the primary use case for CSS 3 Speech. If some of these properties or values
are specific to other use cases, the document should mention that scope. For
example, we understand that this could be used to "Save this page as an audio
file." but do not believe that to be a common usage, and we believe our
potential concerns for screen reader usage outweigh any benefit these
properties may provide for the less-common scenarios.

PFWG thanks IBM and Apple for their work in reviewing this specification on behalf of PFWG.

1.)	voice-volume:

There would seem to be a need for a relative value from current setting. |
Louder / Softer.

There exists the possibility of damaging content to be created.  Imagine a web
page where it is very soft, and then in the middle the maximum decibels are
shouted. | Deliberate suffering could be created, not unlike deliberately
creating a photosentive epileptic situation. | Should this be prevented?
 
This property is of concern because it appears to allow page
authors to hijack the user interface. Screen reader users tend to set their
speech volume at an audible, but comfortable level, and allowing an author to
set the volume to x-loud or a high decibel could be a very disruptive
experience. Furthermore, some screen reader users also have hearing
impairments, so allowing an author to set the volume to x-soft or a low decibel
could result in content being inaccessible to those users. We would like to
suggest the group reconsider or further explain the necessity for this
property, or at least consider removing the x-* values and decibel support. The
spec's note ("listening environment and personal user preferences") at the end
of this section appears to confirm our concern that this property is perhaps
immature, and it would be unwise to implement this without additional
consideration of other vague, unspecified details such as user preference
overrides, and the ability for user agents to be more aware of their usage
environment or context.

2.)	voice-volume: silent;

The spec should give an example of expected appropriate usage of this value.
Because this generates a period of silence equal to the length of the
would-be-spoken content, most listeners will just assume speech output has
prematurely stopped. In radio terms, this is "dead air." How do you expect this
value to be useful?

Despite the at-risk status of this property, we believe it would be extremely
useful for conveying context, particularly in situations such as two-party
dialogue. 

3.)	speak:/speak-as:

WebKit and VoiceOver in the iOS5 betas implement partial support for the original values of the 'speak' property in CSS 2.1 as well as some additional values defined by the previous working draft of CSS 3, which seemed a logical progression from the CSS 2.1. Since the Working Group had not published an updated draft in over five years, we would not have expected this property to change so drastically. Please reconsider this property split, since 1) it is not apparent why the split was made, and 2) there is existing implementation that is unlikely to change in the pending release.

Previous values, from the most recent draft published in December 2004.
http://www.w3.org/TR/2004/WD-css3-speech-20041216/#speak

speak:/speak-as: values.
Whether or not the 'speak' and 'speak-as' properties are recombined, the values for the 'speak-as' property are listed as single token values, but are not mutually exclusive. We would expect to be able to use a token list to specify multiple values that apply. Perhaps:

.telephone {
     speak-as: digits no-punctuation; /* e.g., (415) 555-1212 */
}
.internetProtocol {
     speak-as: digits literal-punctuation; /* e.g., 127.0.0.1 */
}

4.)	pause-before:/pause-after:

These properties are of concern because they represent another way for the page author to hijack a screen reader user's experience. We are also concerned that end users will interpret correct implementation of these properties as a severe performance lag. For example, if a user were forced to wait 2 seconds between each heading, the experience would be tedious for TTS users comfortable with machine speech at rates pushing 400 words per minute.

If you plan to keep this property, we suggest the following:

1. Consider defining a few variants of the @media values defining the particular speech context. A long pause may provide slightly more value for the "save to audio file" or "read all" context than it would to a general screen reader user in the process of navigating a document quickly. We think it's unlikely that many screen reader users would want this feature affecting their TTS speed and responsiveness.

2. Define a maximum range for pause-before <time>, preferably less than 2s for screen readers, and issue validation warnings for times over the maximum.

3. Define millisecond values or WPM-relative time values for tokens, preferably all less than 1s. The document states that this it implementation-dependent. W3C history has shown this will result in drastically different values, and inconsistent implementation will be frustrating for authors and users alike.

4. In a separate document (perhaps HTML5) define default mappings of elements to their expected pause values. e.g. A table mapping pause before/after columns with each HTML element as a row.

5. Unequivocally declare that implementors should ignore pause-before values when navigating to an element in the screen reader context, so as to not create the perception of performance lag. e.g., If a screen reader user presses the command to "jump to next heading," speak it immediately. Ignore pause-before immediately after a focus change.

5.)	cue-before:/cue-after:

Consider token-based named sound icons, such as "warning", "error", or "progress-complete." Leave this flexible for platform- and implementation-specific values, such as "-osx-tink" or "-ios-tweet" and provide a comma-delimited fallback in the same way a user can specify a generic family fallback in addition to a named font: 

font-family: "MyFont", sans-serif;

cue-before:/cue-after: <decibel> properties.
We have the same concern with this decibel value as mentioned above with voice-volume.

6.)	voice-family: preserve;

Quoting from the editor's draft: 

> regardless of any potential language change within the content markup

This property value appears short-sighted, as most TTS voices are not only intended for a particular language, but are also mostly incapable of producing speech when confronted with characters outside its intended range of unicode characters. For example, it is highly unlikely that a Chinese TTS voice will be able to pronounce English in an understandable way (for anything other than very common words such as "okay") , and it's even less likely that a French TTS voice would be able to speak any words in Japanese. It seems this property value is only beneficial to force Western language TTS voices to mispronounce other Western languages, which is a feature of very little utility.

PFWG wishes to emphasize its strong preference that lang declarations,
explicitly including in line lang attributes, be processed using language
appropriate phonemes and pronunciation rules. Several screen readers do so
today. All speech generators should be expressly encouraged to do so.

PFWG further suggests that there be a way to specify an accent or locale setting.

7.)	voice-duration: <time>;

This is another property that seems to provide very little value. For example, what would be the expected behavior given the following CSS:

p { voice-duration: 1s; }

Given the following markup.

<p>Short paragraph.</p>

<p>Longer paragraph. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer elementum interdum ullamcorper. Nunc et ante dui. Sed odio erat, dictum vitae adipiscing nec, aliquam sed nibh. Fusce pharetra ante dolor. </p>

The first paragraph would be understandable, but should the second paragraph really be pronounced over a duration of 1 second? Probably not. Implementation of this would be tricky, too. Have any other vendors have expressed an interest in its implementation?

voice-duration is marked as at-risk, and we support dropping it from the final specification. 

8.)	voice-stress:

Seems to provide limited utility, hijacking, and implementation difficulty. 
voice-stress is marked as at-risk, and we support dropping it from the final specification. 

If retained, Consider the more descriptive name: | voice-emphasis, rather than
voice-stress. | Stress sounds only angry. | Emphasis has less emotion to it.
 
-------------------------------------------------------------------------------

Janina Sajka,	Phone:	+1.443.300.2200
		sip:janina@asterisk.rednote.net

Chair, Open Accessibility	janina@a11y.org	
Linux Foundation		http://a11y.org

Chair, Protocols & Formats
Web Accessibility Initiative	http://www.w3.org/wai/pf
World Wide Web Consortium (W3C)

Received on Tuesday, 11 October 2011 19:25:42 UTC