Re: Roll-up captions in WebVTT from Silvia Pfeiffer on 2012-04-11 (public-texttracks@w3.org from April 2012)

From: Silvia Pfeiffer <silviapfeiffer1@gmail.com>
Date: Wed, 11 Apr 2012 14:32:45 +1000
To: Glenn Maynard <glenn@zewt.org>
Cc: David Singer <singer@apple.com>, Gal Klein <gal@plymedia.com>, public-texttracks@w3.org
Message-ID: <CAHp8n2n9xxUb5TCpCC_K9fMNvGA5bDfGyKaWhsp5X1QrwstGwg@mail.gmail.com>
On Wed, Apr 11, 2012 at 1:45 PM, Glenn Maynard <glenn@zewt.org> wrote:
> On Tue, Apr 10, 2012 at 8:22 PM, Silvia Pfeiffer <silviapfeiffer1@gmail.com>
> wrote:
>>
>> I am here and being paid by Google's accessibility team to make sure
>> this use case is supported for our YouTube captions, because we need
>> it. I've listed our use cases. What more do you want me to say?
>
>
> The use cases are disputed, but I'm trying to avoid replying to them just
> because it's mostly all been discussed already.

I would like to hear back on the other thread on this - in particular
since Shane seems to support your position. I know that we had several
customers on YouTube requiring the feature.


> The only argument I think
> might have merit is the claim that lots of people want it, but the data for
> that--the claim that around 50% of people want roll-ups--needs examination.
> (Subtitles and captions on every DVD and Blu-ray are pop-on; SRT, SSA and
> ASS subtitles are all pop-on; and I've spent a good deal of time talking
> about subtitled media--and yet I've never once heard anybody going "I wish
> these subtitles were roll-up, it's so much easier to read".  That's why I,
> at least, will take some convincing to believe there's really significant
> demand for it from users.)

We need to be able to have rollup functionality just for
time-overlapping cues. Surely it's obvious that the current display
mechanism of time-overlapping cues in WebVTT is very unusual (i.e.
using whichever space is available next to currently rendered cue
text, whether above or below). Surely we can also agree that many
people will prefer a rollup display over the current rendering
mechanism for time-overlapping cues. I am willing to concede that it
may be 50% or less, but substantially more than 20%, so it meets the
80% use case. I am also willing to agree that sometimes we may prefer
to display roll-down rather than roll-up.


>> What about explicitly positioned cues, e.g. underneath a certain
>> person and the desire to have captions scrolling there?
>
>
> This sounds like a third mode, separate from both roll-up and pop-on
> captions.

I don't see it that way. Rollup and rolldown are a means of modifying
already rendered cues no matter where they are rendered. The initial
rendering position and the transition mode are orthogonal concepts. We
don't need to mingle them into a new mode.

>  I'm not sure how (or if) they'd fit together.  It raises a huge
> new set of questions.  (What happens if the subject is moving around the
> frame?

The caption rendering box moves with them.

>  What if two subjects with active captions cross each other in the
> frame?

The caption rendering boxes move with them.

>  Does this mean three user options--pop-on, roll-up and
> follows-the-speaker?

No. We already have "follow-the-speaker" on TV - the captions are just
repeated in a different location to do that. This is not practical or
desirable on the Web and we can find a better way of rendering this.

The existing types as "pop-on", "roll-up" and "paint-on" aren't really
well separated types of captions and I wouldn't want to continue using
them as technical specification of rendering means.

"Pop-on" is more than just rendering on screen: it is usually implied
that there is no time-overlap with other cues. So, what are pop-on
captions with time-overlap (and more so: region-overlap)? That's a
type of captions that does not fit in any of these classes.

"Roll-up" similarly usually implies live captions with their delays
and errors, which are not a type of rendering at all, but authoring
deficiencies. Also, rollup usually implies successive revealing of two
characters each (or at least SCC is defined that way). They also
typically have the ">>" characters at the start. So, what are roll-up
captions that are accurately timed, shown word by word in sync with
the audio when these words are spoken, but do not use the ">>"
symbols? That's again a type of captions that doesn't fit in any of
the three classes.

"Paint-on" similarly usually implies revealing a word at a time for a
single piece of text about the length of a pop-on caption, but then
the text disappears and a new block is rendered by "revealing", i.e.
there are no time-overlapping cues and therefore no text needs moving
since the position of each word is clear for the cue.. So, what are
captions that are successively revealed but time-overlap? Again, this
is a type of captions that doesn't fit in any of the three classes.

Making up new classes doesn't help, in particular when the classes
overlap in functionality. We need to identify the different individual
features that the classes are made up of and turn those into WebVTT
features.

The features of pop-on are the following:
* render a piece of text on screen
* in a given location
* within a specified rendering area
All of these features are available with current WebVTT.

The features or roll-up are the following:
* render a limited number of lines of text on screen (usually 2-4)
* in a given location
* within a specified rendering area
* in a time-overlapping manner
* where within the rendering area lines of text always make space for
a new text line underneath existing text
* and text that scrolls out of the top of the rendering area disappears.
The last two features are not currently supported in WebVTT.


>  Can this be done in a way that will consistently work
> for all user preferences, eg. if a user wants roll-up and it's authored for
> follow?)  I'm afraid of the discussion spiraling out of control if we start
> considering this...

I'm sure we can and that's what all this discussion is working
towards. We need to be open to the opportunities of the Web and allow
all feature combinations, not just a limited set like the one that was
defined for TV. We don't have to mash them all up, however - instead
we need to regard them as separate dimensions that all have an impact
on how things are rendered on screen.

Cheers,
Silvia.
Received on Wednesday, 11 April 2012 04:33:34 UTC