Re: Roll-up captions in WebVTT

On Dec 19, 2011, at 16:50 , Glenn Maynard wrote:

> On Mon, Dec 19, 2011 at 7:10 PM, David Singer <singer@apple.com> wrote:
> I agree it doesn't work well for long credits.  I am also a bit taken aback at having an idea dismissed as 'evil and ugly' before we've really either worked it out or seen the alternatives. Can we debate the ideas along with (or preferably without) the value adjectives?
> 
> I used strong language to express my strong distaste for the idea.  Of course, I'd never tell anyone not to debate an idea I don't like.
>  
> I understand it doesn't *look* clean to repeat a text line that occurs in two different places in two consecutive cues, but it has a number of advantages.
> 
> The disadvantages: 
> * it doesn't 'feel right' to repeat things (but the bit-rate gain is minimal, in my opinion)
> * tagging is needed so that systems that need to know when it has happened can tell (e.g. screen readers)
> 
> Repeating each cue dozens of times essentially turns it into a non-human-readable, non-human-editable format.  This would be a great loss.

Yes, I agree, add that as a disadvantage: doesn't work for long scrollable texts.

But note it's not the length of the 'paragraph' that matters, but the height of the scrolling area.  A 3-line scrolling area can only possibly need 2 repeats.

> 
> The advantages:
> * no cue-to-cue dependency -- no I frames and P frames (this is pretty big, IMHO); each cue contains all its own text
> * allows the expression of any transition, not just scrolling: moving to stay with the speaker or out of the way, changes of color, background, etc.
> 
> Fundamentally, it's presentational rather than semantic.  If this is a markup feature at all (and I don't believe roll-ups should be), semantic markup ("render this cue as a roll-up caption")

No, I am not saying that.  I am saying that the semantic mark-up is "this span is the same as a span in adjacent cues with the same ID"

If it's ALSO associated with a CSS class and that class ALSO has transitions on Y-position, then it'll scroll up on systems that ALSO support CSS.

so you'd see (very rough example)

WEBVTT FILE

1
00:00:03.500 --> 00:00:05.000
<cue-id id=1>Everyone wants the most from life</cue-id>

2
00:00:06.000 --> 00:00:09.000
<cue-id id=1>Everyone wants the most from life</cue-id>
<cue-id id=2>but they seem unwilling to work for it</cue-id>

3
00:00:11.000 --> 00:00:14.000 A:end
<cue-id id=2>but they seem unwilling to work for it</cue-id>
even though opportunities abound

and then the style sheet has a general CSS transition on stuff for Y-position.

> allows UAs to adjust the presentation.  With presentational markup ("put the caption at position 0.9; now scroll it to position 0.8 ..."), the exact details are baked into the captions; the renderer doesn't really know why the motion is happening or the dependencies between various animations, so it can't really change anything.

Right.  We need to keep this.  I think I do...

> 
> For example, a UA might want to allow users to say "enlarge the roll-up area, so stuff stays on screen longer".
> 
> * allows the use of CSS transitions to express the optionality and effect of the transition
> 
> This doesn't require repeating cues.  For example, 
> 
> 00:11.000 --> 00:13.000 Position:0.8 Delay:1.5 Linear:0.5 Position:0.7
> <v Roger Bingham>We are in New York City
> 
> which would show the cue at L:0.8, wait 1.5 seconds, then scroll to L:0.7 over half a second.  (This is just off the top of my head; something that translates more directly to CSS transitions--which I'm not terribly familiar with--would be better.)

Cool, but I think it needs to work on sub-parts of cues, not just whole ones.

> 
> I think this has reasonable use cases (eg. sign translations that follow a sign as it moves across the screen).  I don't think this is appropriate for roll-up captions, but it's far less objectionable than multiple cues, where the suggestion seemed to look like:
> 
> 00:11.000 --> 00:11.500 L:0.8
> <v Roger Bingham>We are in New York City
> 
> 00:11.500 --> 00:11.525 L:0.79
> <v Roger Bingham>We are in New York City
> 
> 00:11.525 --> 00:11.550 L:0.78
> <v Roger Bingham>We are in New York City
> 
> 00:11.550 --> 00:11.575 L:0.77
> <v Roger Bingham>We are in New York City
> 
> and so on.  That's something you can do today, if you really want to, but it's messy, probably won't lead to very smooth scrolling, requires huge amounts of repetition and makes the file format essentially impossible to edit by hand.
> 
> -- 
> Glenn Maynard
> 

David Singer
Multimedia and Software Standards, Apple Inc.

Received on Tuesday, 20 December 2011 01:04:39 UTC