Re: [css-speech][css-content][mediaqueries] Making Generated Content Accessible from Tab Atkins Jr. on 2014-12-03 (www-style@w3.org from December 2014)

From: Tab Atkins Jr. <jackalmage@gmail.com>
Date: Wed, 3 Dec 2014 07:55:17 -0800
To: Reece Dunn <msclrhd@googlemail.com>
Cc: Florian Rivoal <florian@rivoal.net>, Daniel Weck <daniel.weck@gmail.com>, James Craig <jcraig@apple.com>, fantasai <fantasai.lists@inkedblade.net>, Alan Stearns <stearns@adobe.com>, www-style list <www-style@w3.org>, fantasai <fantasai@inkedblade.net>
Message-ID: <CAAWBYDADWYD37sd1xJrzppB82T_FFxbr2PTy3LvTPKGb0JnWyQ@mail.gmail.com>

On Wed, Dec 3, 2014 at 7:18 AM, Reece Dunn <msclrhd@googlemail.com> wrote:
> On 3 December 2014 at 14:20, Florian Rivoal <florian@rivoal.net> wrote:
>> On 03 Dec 2014, at 14:50, Daniel Weck <daniel.weck@gmail.com> wrote:
>>>
>>> On Wed, Dec 3, 2014 at 3:43 AM, James Craig <jcraig@apple.com> wrote:
>>>>
>>>>> This raises 2 (related) questions. Is the introduction of this media feature sufficient to deprecate the “speech" media type into never matching? If not, can and should the same privacy model be applied to it?
>>>>
>>>> My understanding is that the speech media type is *only* useful for linearized audio-only media not intended for the screen, since it is mutually exclusive with the screen media type. Most assistive technologies operate on some concept of a "screen" (including screen readers for the blind) so the speech media type should never apply to screen readers or ScreenMagnifier+Speech utilities, but its possible there is some use case. For example, if you were to turn an EPUB into a generated TTS audiobook, the speech media type could apply. I don't know if any implementations support that, but you'd probably want to check with someone from DAISY before making it a No-Op.
>>>
>>>
>>> Hello,
>>
>> Hi, Thanks for the feedback, I was hoping you'd pop in.
>>
>>> Yes, from a content design perspective, the 'speech' Media Type can be
>>> used to define a "complete aural alternative to a visual presentation"
>>> (full quote below), and as per the specification: such representation
>>> would be mutually exclusive to other media types, when "rendered"
>>> within a *given* canvas. The same applies to 'braille' (for example),
>>> although the "tactile" Media Group also includes the 'embossed' Media
>>> Type (conversely, 'speech' stands on its own).
>>
>> The exclusive nature of media types has turned out to be an issue in almost all cases, which is why we're generally trying to deprecate them, and replace them by media features which capture the key aspect that made the media types different.
>>
>> Unlike a type like handheld for example, which was so similar to screen that browsers never matched it due to compat concerns, speech may be sufficiently different from screen an exclusive media type could work. At the same time, given the existence both of speech-UAs which only read the content out loud in a linear fashion (E-pub reader) *and* of speech UAs which do speech as an assistive complement to a visual 2d rendering, I am not so sure that this is really exclusive.
>
> I like the idea of using features, as that would allow CSS writers
> more control over the intent of what they want.
>
>> What would you think (naming aside) about a media feature like this:
>> speech: none | linear | screen-based
>
> Aren't these independent concepts?
>
> In an ebook reader, you can have 3 modes of speech:
> 1.  any audio/narration from the book itself (in ePub this is done
> using a SMIL document which is liked to the HTML document by id
> names);
> 2.  using text-to-speech (TTS) for reading the text in the HTML document;
> 3.  using (1) if present or (2) if not, for the current section.
>
> In the ebook reader case, the TTS reading is where these CSS rules are
> most likely to be applied. These would include the following use
> cases:
>
> 1.  Providing hints to the TTS engine on how the text should be
> spoken, including things like controlling numbered lists. This can be
> done with existing CSS or more powerfully with upcoming modules (e.g.
> the Counter Styles module). These can share styles with other media.
> This also includes the speak property (the speech equivalent of
> display) and say-as (a simplified version of the SSML
> say-as/interpret-as to e.g. say that a number should be spoken as
> digits).
>
> 2.  Controlling the audio produced. This is the styles affected by the
> linear model -- the pause, rest and cue styles from CSS speech, as
> well as the voice-* properties for controlling the TTS engine.
>
> The rules in (1) relate to how the text should be spoken and are
> applicable to both ebook readers and assistive technologies and may
> have a different rendering to screen or other media. The rules in (2)
> are only really applicable to ebook readers.
>
> As such, I propose two media features [1]:
>
> speech = none | tts
> presentation = screen | narration
>
> |speech=tts| is used for rules relating to (1), controlling how a
> text-to-speech engine (either via an ebook reader or screen reader)
> should interpret the content. |presentation=narration| is used for
> rules relating to (2), controlling how the audio should be spoken when
> read in a linear, narrative style.
>
> Thus, you have:
> 1.  display media (screen, print, etc.) -- speech=none, presentation=screen
> 2.  screen reader -- speech=tts, presentation=screen
> 3.  ebook reader -- speech=tts, presentation=narration
>
> [1] I would also be happy with something like |hinting = none | tts|,
> but we can save the bikeshedding issues until the overall concept and
> design is agreed on.

It looks like the 'speech' feature doesn't do anything but indicate
whether or not the 'presentation' feature is valid; the fact that
there are only three possible combinations, rather than four, suggests
they are tightly coupled.  It's usually more author-friendly, then, to
just expose the valid combinations in a single feature, as Florian
suggested.

If you want to apply some properties to any speech-based context,
regardless of whether it's a screen-reader or a narrator, you can just
use (speech) without a value, as that'll be true for both of the
speech-ish values.

~TJ

Received on Wednesday, 3 December 2014 15:56:04 UTC