Re: [css-speech] Splitting into Level 1 and Level 2

On 27 June 2015 at 13:34, fantasai <fantasai.lists@inkedblade.net> wrote:
> While I think the CSS Speech module defines a really cool processing
> model for speech rendering of a document, we don't have much in the
> way of implementations. Also I suspect that a good speech stylesheet--
> one that enhanced, rather than interfered with, the speech user
> experience--would be hard to create without a better understanding of
> the "default UA stylesheet" and a fair amount of specialized training,
> so would be beyond the capabilities of most authors.
>
> However, I think the 'speak' and 'speak-as' properties would be very
> useful to have in the general authoring toolkit. The 'speak' property
> in particular allows speech rendering to have different hiding/showing
> of content than visual layout, without any weird hacks. So I'm thinking
> maybe we should split CSS Speech into two levels:
>
>   Level 1: 'speak' and 'speak-as'
>   Level 2: Everything currently in the spec.
>
> This might encourage implementation of 'speak' and 'speak-as' in
> browsers.
>
> Thoughts?

Hi,

I am an implementer of a Text-to-Speech program
(https://github.com/rhdunn/cainteoir-engine) that makes limited use of
CSS, currently to apply a basic content rendering model. As such, I am
interested in this proposal.

Here are my thoughts:

# thoughts on implementability

If the intention for this is to have a Web Browser (or narration
software) control a Text-to-Speech engine (or allow that as a valid
implementation of the specification), the control will be limited to
what the engine exposes. This will be:

  1.  Voice selection, which can be used in implementing the
`voice-family` property.

  2.  Voice parameters, which can be used to implement the
`voice-rate`, `voice-pitch`, `voice-volume` and `voice-range`
properties.

  3.  SSML markup, which can be used to implement the other features.

In addition, control of the audio output can be used to implement the
`voice-balance` property and the aural box model (pause, rest, cue).

The engine's functionality (e.g. SSML support) would affect the
ability to implement the different features, especially things like
explicit voice pitches.

For text-to-speech engines adding CSS Speech support, I can see
complexities between the CSS Speech model and the SSML model.

# speak

I don't see why `speak=none` is broken out from `speak-as` (esp. if
this is seen as analogous to `display`).

I am not sure about `speak=none` being overridden in descendents --
what are the use cases for this behaviour (esp. compared to the
`display=none` behaviour)?

I am not sure about the interaction with `display=none` -- I wonder if
this is best having `display=none` take precedence aurally (otherwise,
you could have speak=normal on head, script and style HTML elements).

Are there any use cases where speak=none is useful compared to what is
displayed on the screen? Is this intended for things like
navigation/menus? If so, how would a blind person know the menu is
there?

# speak-as

`spell-out` and `digits` are effectively the same. The only difference
is one is applied to words and the other to numbers. Note that the
user can only specify one of these at a time, so if applied to "hello
123" you cannot get both features at the same time.

Is spelling out the "rĂ´le" example as "R O circumflex L E" also conforming?

`spell-out=literal-punctuation` implies that the punctuation is not
used for pauses, but shouldn't it still be used for pauses as well (to
avoid very long run on utterances when applied over large text).

# SSML say-as compatibility

Although the note on `speak-as` says that the CSS model is limited to
a basic set of pronunciation rules compared to the SSML say-as
property, it also adds more complexity. Specifically:

`spell-out`, `digits` and `literal-punctuation` are all aspects of
`<say-as interpret-as="characters" format="characters">...`.

`literal-punctuation` and `no-punctuation` specify the removal of
pauses from punctuation. This is orthogonal to how the characters are
pronounced.

Thus, an SSML-compatible model could be:

## speak = [default | none | characters]

speak=default -- Use the default aural rendering of the text by the
Text-to-Speech engine.

speak=none -- Don't speak the text.

speak=characters -- Speak out letters, digits and punctuation
individually (same as interpret-as=characters with format=characters).

NOTE: This can be extended in the future to support more modes
(=glyphs, =date, =time, =telephone, =cardinal, =ordinal, etc.).

Bikeshedding: This is intended to be analogous to `display` (which is
not `display-as`), but could be changed to `speak-as` if required.

## punctuation-pause = [default | none]

punctuation-pause=default -- Let the Text-to-Speech engine determine
how long to pause after punctuation.

punctuation-pause=none -- Don't pause when encountering punctuation characters.

Bikeshedding: This could also be something like `punctuation-break`.

Thanks,
- Reece H. Dunn

Received on Saturday, 27 June 2015 14:24:30 UTC