WebVTT

Extracted from the WHATWG HTML specification on 14th November 2011.

Contribution to the W3C Text Tracks Community Group under the W3C Community Contributor License Agreement by Google Inc., Apple Inc., Mozilla Foundation, and Opera Software ASA.


Table of contents

  1. 1 WebVTT
    1. 1.1 Introduction
    2. 1.2 Syntax
    3. 1.3 Parsing
    4. 1.4 WebVTT cue text parsing rules
    5. 1.5 WebVTT cue text DOM construction rules
  2. 2 WebVTT cue text rendering rules
  3. 3 Applying CSS properties to WebVTT Node Objects
  4. 4 CSS extensions
    1. 4.1 The '::cue' pseudo-element
    2. 4.2 The ':past' and ':future' pseudo-classes
  5. 5 Diverse snippets
    1. 5.1 WebVTT kinds in the track element
    2. 5.2 WebVTT srclang in the track element
    3. 5.3 WebVTT label in the track element
    4. 5.4 WebVTT mime type registration
  6. References

1 WebVTT

EXTRACT from Section 4.8.10.13 WHATWG HTML Specification.

The WebVTT format (Web Video Text Tracks) is a format intended for marking up external text track resources.

1.1 Introduction

The main use for WebVTT files is captioning video content. Here is a sample file that captions an interview:

WEBVTT

00:11.000 --> 00:13.000
<v Roger Bingham>We are in New York City

00:13.000 --> 00:16.000
<v Roger Bingham>We're actually at the Lucern Hotel, just down the street

00:16.000 --> 00:18.000
<v Roger Bingham>from the American Museum of Natural History

00:18.000 --> 00:20.000
<v Roger Bingham>And with me is Neil deGrasse Tyson

00:20.000 --> 00:22.000
<v Roger Bingham>Astrophysicist, Director of the Hayden Planetarium

00:22.000 --> 00:24.000
<v Roger Bingham>at the AMNH.

00:24.000 --> 00:26.000
<v Roger Bingham>Thank you for walking down here.

00:27.000 --> 00:30.000
<v Roger Bingham>And I want to do a follow-up on the last conversation we did.

00:30.000 --> 00:31.500 A:end S:50%
<v Roger Bingham>When we e-mailed—

00:30.500 --> 00:32.500 A:start S:50%
<v Neil deGrasse Tyson>Didn't we talk about enough in that conversation?

00:32.000 --> 00:35.500 A:end S:50%
<v Roger Bingham>No! No no no no; 'cos 'cos obviously 'cos

00:32.500 --> 00:33.500 A:start S:50%
<v Neil deGrasse Tyson><i>Laughs</i>

00:35.500 --> 00:38.000
<v Roger Bingham>You know I'm so excited my glasses are falling off here.

1.2 Syntax

A WebVTT file must consist of a WebVTT file body encoded as UTF-8 and labeled with the MIME type text/vtt. [RFC3629]

A WebVTT file body consists of the following components, in the following order:

  1. An optional U+FEFF BYTE ORDER MARK (BOM) character.
  2. The string "WEBVTT".
  3. Optionally, either a U+0020 SPACE character or a U+0009 CHARACTER TABULATION (tab) character followed by any number of characters that are not U+000A LINE FEED (LF) or U+000D CARRIAGE RETURN (CR) characters.
  4. Two or more WebVTT line terminators.
  5. Zero or more WebVTT cues separated from each other by two or more WebVTT line terminators.
  6. Zero or more WebVTT line terminators.

A WebVTT cue consists of the following components, in the given order:

  1. Optionally, a WebVTT cue identifier followed by a WebVTT line terminator.
  2. WebVTT cue timings.
  3. Optionally, one or more U+0020 SPACE characters or U+0009 CHARACTER TABULATION (tab) characters followed by WebVTT cue settings.
  4. A WebVTT line terminator.
  5. The cue payload: either WebVTT cue text, WebVTT chapter title text, or WebVTT metadata text.

A WebVTT cue corresponds to one piece of time-aligned text or data in the WebVTT file, for example one subtitle. The cue payload is the text or data associated with the cue.

WebVTT chapter title text is syntactically a subset of WebVTT cue text, and WebVTT cue text is syntactically a subset of WebVTT metadata text. Conformance checkers, when validating WebVTT files, may offer to restrict all cues to only having WebVTT chapter title text or WebVTT cue text as their cue payload; WebVTT metadata text cues are only useful for scripted applications (using the metadata text track kind).

A WebVTT file whose cues all have a cue payload that is WebVTT chapter title text is said to be a WebVTT file using chapter title text.

A WebVTT file whose cues all have a cue payload that is WebVTT cue text is said to be a WebVTT file using cue text. By definition, any file that is a WebVTT file using chapter title text is also a WebVTT file using cue text.

A WebVTT line terminator consists of one of the following:

A WebVTT cue identifier is any sequence of one or more characters not containing the substring "-->" (U+002D HYPHEN-MINUS, U+002D HYPHEN-MINUS, U+003E GREATER-THAN SIGN), nor containing any U+000A LINE FEED (LF) characters or U+000D CARRIAGE RETURN (CR) characters.

A WebVTT cue identifier can be used to reference a specific cue, for example from script or CSS.

The WebVTT cue timings part of a WebVTT cue consists of the following components, in the given order:

  1. A WebVTT timestamp representing the start time offset of the cue. The time represented by this WebVTT timestamp must be greater than or equal to the start time offsets of all previous cues in the file.
  2. One or more U+0020 SPACE characters or U+0009 CHARACTER TABULATION (tab) characters.
  3. The string "-->" (U+002D HYPHEN-MINUS, U+002D HYPHEN-MINUS, U+003E GREATER-THAN SIGN).
  4. One or more U+0020 SPACE characters or U+0009 CHARACTER TABULATION (tab) characters.
  5. A WebVTT timestamp representing the end time offset of the cue. The time represented by this WebVTT timestamp must be greater than the start time offset of the cue.

The WebVTT cue timings give the start and end offsets of the WebVTT cue. Different cues can overlap. Cues are always listed ordered by their start time.

A WebVTT file whose cues all have an end time offset x greater than or equal to the end time offsets of all the cues whose start time offsets are less than x is said to be a WebVTT file using only nested cues.

A WebVTT timestamp representing a time in seconds and fractions of a second is a WebVTT timestamp representing hours hours, minutes minutes, seconds seconds, and thousandths of a second seconds-frac, calculated as follows:

  1. Let seconds be the integer part of the time.

  2. Let seconds-frac be the fractional component of the time, expressed as the digits of the decimal fraction given to three decimal digits.

  3. If seconds is greater than 59, then let minutes be the integer component of seconds divided by sixty, and then let seconds be the remainder of dividing seconds divided by sixty. Otherwise, let minutes be zero.

  4. If minutes is greater than 59, then let hours be the integer component of minutes divided by sixty, and then let minutes be the remainder of dividing minutes divided by sixty. Otherwise, let hours be zero.

A WebVTT timestamp representing hours hours, minutes minutes, seconds seconds, and thousandths of a second seconds-frac, consists of the following components, in the given order:

  1. Optionally (required if hour is non-zero):
    1. Two or more characters in the range U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9), representing the hours as a base ten integer.
    2. A U+003A COLON character (:)
  2. Two characters in the range U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9), representing the minutes as a base ten integer in the range 0 ≤ minutes ≤ 59.
  3. A U+003A COLON character (:)
  4. Two characters in the range U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9), representing the seconds as a base ten integer in the range 0 ≤ seconds ≤ 59.
  5. A U+002E FULL STOP character (.).
  6. Three characters in the range U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9), representing the thousandths of a second seconds-frac as a base ten integer.

The WebVTT cue settings part of a WebVTT cue consists of zero or more of the following components, in any order, separated from each other by one or more U+0020 SPACE characters or U+0009 CHARACTER TABULATION (tab) characters. Each component must not be included more than once per WebVTT cue settings string.

WebVTT cue settings give configuration options regarding the position and alignment of the cue. For example, it allows a cue to be aligned to the left or positioned at the top right.

A WebVTT vertical text cue setting consists of the following components, in the order given:

  1. A U+0044 LATIN CAPITAL LETTER D character.
  2. A U+003A COLON character (:).
  3. One of the following strings: "vertical", "vertical-lr".

A WebVTT vertical text cue setting configures the cue to use vertical text layout rather than horizontal text layout. Vertical text layout is sometimes used in Japanese, for example. The default is horiontal layout.

A WebVTT line position cue setting consists of the following components, in the order given:

  1. A U+004C LATIN CAPITAL LETTER L character.

  2. A U+003A COLON character (:).

  3. Either:
    To represent a specific position relative to the video frame
    1. One or more characters in the range U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9).
    2. A U+0025 PERCENT SIGN character (%).
    To represent a line number
    1. Optionally a U+002D HYPHEN-MINUS character (-).
    2. One or more characters in the range U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9).

A WebVTT line position cue setting configures the position of the cue. For horizontal cues, this is the vertical position. The position can be given either as a percentage, which gives the distance from the top of the frame, or as a line number. Line numbers are based on the size of the first line of the cue. Positive line numbers count from the top of the frame (the top line is numbered 0), negative line numbers from the bottom of the frame (the bottom line is numbered −1).

A WebVTT text position cue setting consists of the following components, in the order given:

  1. A U+0054 LATIN CAPITAL LETTER T character.
  2. A U+003A COLON character (:).
  3. One or more characters in the range U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9).
  4. A U+0025 PERCENT SIGN character (%).

A WebVTT text position cue setting configures the position of the text in the direction orthogonal to the WebVTT line position cue setting. For horizontal cues, this is the horizontal position. The WebVTT text position cue setting is given as a percentage, calculated from the edge of the frame that the text begins (so for left-to-right English text, the left edge).

A WebVTT size cue setting consists of the following components, in the order given:

  1. A U+0053 LATIN CAPITAL LETTER S character.
  2. A U+003A COLON character (:).
  3. One or more characters in the range U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9).
  4. A U+0025 PERCENT SIGN character (%).

A WebVTT size cue setting configures the size of the cue in the same direction as the WebVTT text position cue setting. For horizontal cues, this is the width of the cue. It is given as a percentage of the width of the frame.

A WebVTT alignment cue setting consists of the following components, in the order given:

  1. A U+0041 LATIN CAPITAL LETTER A character.
  2. A U+003A COLON character (:).
  3. One of the following strings: "start", "middle", "end"

A WebVTT alignment cue setting configures the alignment of the text within the cue. The keywords are relative to the text direction; for left-to-right English text, "start" means left-aligned.

WebVTT metadata text consists of any sequence of zero or more characters other than U+000A LINE FEED (LF) characters and U+000D CARRIAGE RETURN (CR) characters, each optionally separated from the next by a WebVTT line terminator. (In other words, any text that does not have two consecutive WebVTT line terminators and does not start or end with a WebVTT line terminator.)

WebVTT chapter title text consists of zero or more of the following, each optionally separated from the next by a WebVTT line terminator:

WebVTT cue text consists of zero or more WebVTT cue components, in any order, each optionally separated from the next by a WebVTT line terminator.

The WebVTT cue components are:

WebVTT cue internal text consists of an optional WebVTT line terminator, followed by zero or more WebVTT cue components, in any order, each optionally followed by a WebVTT line terminator.

A WebVTT cue class span consists of a WebVTT cue span start tag "c" that disallows an annotation, WebVTT cue internal text representing cue text, and a WebVTT cue span end tag "c".

A WebVTT cue italics span consists of a WebVTT cue span start tag "i" that disallows an annotation, WebVTT cue internal text representing the italicized text, and a WebVTT cue span end tag "i".

A WebVTT cue bold span consists of a WebVTT cue span start tag "b" that disallows an annotation, WebVTT cue internal text representing the boldened text, and a WebVTT cue span end tag "b".

A WebVTT cue underline span consists of a WebVTT cue span start tag "u" that disallows an annotation, WebVTT cue internal text representing the underlined text, and a WebVTT cue span end tag "u".

A WebVTT cue ruby span consists of the following components, in the order given:

  1. A WebVTT cue span start tag "ruby" that disallows an annotation.
  2. One or more occurrences of the following group of components, in the order given:
    1. WebVTT cue internal text, representing the ruby base.
    2. A WebVTT cue span start tag "rt" that disallows an annotation.
    3. A WebVTT cue ruby text span: WebVTT cue internal text, representing the ruby text component of the ruby annotation.
    4. A WebVTT cue span end tag "rt". If this is the last occurance of this group of components in the WebVTT cue ruby span, then this last end tag string may be omitted.
  3. If the last end tag string was not omitted: Optionally, a WebVTT line terminator.
  4. If the last end tag string was not omitted: Zero or more U+0020 SPACE characters or U+0009 CHARACTER TABULATION (tab) characters, each optionally followed by a WebVTT line terminator.
  5. A WebVTT cue span end tag "ruby".

A WebVTT cue voice span consists of the following components, in the order given:

  1. A WebVTT cue span start tag "v" that requires an annotation; the annotation represents the name of the voice.
  2. WebVTT cue internal text.
  3. A WebVTT cue span end tag "v". If this WebVTT cue voice span is the only component of its WebVTT cue text sequence, then the end tag may be omitted for brevity.

A WebVTT cue span start tag has a tag name and either requires or disallows an annotation, and consists of the following components, in the order given:

  1. A U+003C LESS-THAN SIGN character (<).
  2. The tag name.
  3. Zero or more occurrences of the following sequence:
    1. U+002E FULL STOP character (.)
    2. One or more characters other than U+0009 CHARACTER TABULATION (tab) characters, U+000A LINE FEED (LF) characters, U+000D CARRIAGE RETURN (CR) characters, U+0020 SPACE characters, U+0026 AMPERSAND characters (&), U+003C LESS-THAN SIGN characters (<), U+003E GREATER-THAN SIGN characters (>), and U+002E FULL STOP characters (.), representing a class that describes the cue span's significance.
  4. If the start tag requires an annotation: a U+0020 SPACE character or a U+0009 CHARACTER TABULATION (tab) character, followed by one or more of the following components, the concatenation of their representations having a value that contains at least one character other than U+0020 SPACE and U+0009 CHARACTER TABULATION (tab) characters:
  5. A U+003E GREATER-THAN SIGN character (>).

A WebVTT cue span end tag has a tag name and consists of the following components, in the order given:

  1. A U+003C LESS-THAN SIGN character (<).
  2. U+002F SOLIDUS character (/).
  3. The tag name.
  4. A U+003E GREATER-THAN SIGN character (>).

A WebVTT cue timestamp consists of a U+003C LESS-THAN SIGN character (<), followed by a WebVTT timestamp representing the time that the given point in the cue becomes active, followed by a U+003E GREATER-THAN SIGN character (>). The time represented by the WebVTT timestamp must be greater than the times represented by any previous WebVTT cue timestamps in the cue, as well as greater than the cue's start time offset, and less than the cue's end time offset.

A WebVTT cue text span consists of one or more characters other than U+000A LINE FEED (LF) characters, U+000D CARRIAGE RETURN (CR) characters, U+0026 AMPERSAND characters (&), and U+003C LESS-THAN SIGN characters (<).

WebVTT cue span start tag annotation text consists of one or more characters other than U+000A LINE FEED (LF) characters, U+000D CARRIAGE RETURN (CR) characters, U+0026 AMPERSAND characters (&), and U+003E GREATER-THAN SIGN characters (>).

A WebVTT cue amp escape is the five character string "&amp;".

A WebVTT cue lt escape is the four character string "&lt;".

A WebVTT cue gt escape is the four character string "&gt;".

1.3 Parsing

A WebVTT parser, given an input byte stream and a text track list of cues output, must decode the byte stream as UTF-8, with error handling, and then must parse the resulting string according to the WebVTT parser algorithm below. This results in text track cues being added to output. [RFC3629]

A WebVTT parser, specifically its conversion and parsing steps, is typically run asynchronously, with the input byte stream being updated incrementally as the resource is downloaded; this is called an incremental WebVTT parser.

A WebVTT parser verifies a file signature before parsing the provided byte stream. If the stream lacks this WebVTT file signature, then the parser aborts.

The WebVTT parser algorithm is as follows:

  1. Let input be the string being parsed, after conversion to Unicode, and with the following transformations applied:

    • Replace all U+0000 NULL characters by U+FFFD REPLACEMENT CHARACTERs.

    • Replace each U+000D CARRIAGE RETURN U+000A LINE FEED (CRLF) character pair by a single U+000A LINE FEED (LF) character.

    • Replace all remaining U+000D CARRIAGE RETURN characters by U+000A LINE FEED (LF) characters.

  2. Let position be a pointer into input, initially pointing at the start of the string. In an incremental WebVTT parser, when this algorithm (or further algorithms that it uses) moves the position pointer, the user agent must wait until appropriate further characters from the byte stream have been added to input before moving the pointer, so that the algorithm never reads past the end of the input string. Once the byte stream has ended, and all characters have been added to input, then the position pointer may, when so instructed by the algorithms, be moved past the end of input.

  3. If the character indicated by position is a U+FEFF BYTE ORDER MARK (BOM) character, advance position to the next character in input.

  4. Let line be a string variable. Unset the already collected line flag.
  5. Collect a sequence of characters that are not U+000A LINE FEED (LF) characters. Let line be those characters, if any.

  6. If line is less than six characters long, then abort these steps. The file is not a WebVTT file.

  7. If line is exactly six characters long but does not exactly equal "WEBVTT", then abort these steps. The file does not start with the correct WebVTT file signature and was therefore not successfully processed.

  8. If line is more than six characters long but the first six characters do not exactly equal "WEBVTT", or the seventh character is neither a U+0020 SPACE character nor a U+0009 CHARACTER TABULATION (tab) character, then abort these steps. The file is not a WebVTT file.

  9. If position is past the end of input, then jump to the step labeled end.

  10. If the character indicated by position is a U+000A LINE FEED (LF) character, advance position to the next character in input.

  11. Header: Collect a sequence of characters that are not U+000A LINE FEED (LF) characters. Let line be those characters, if any.

  12. If position is past the end of input, then jump to the step labeled end.

  13. If the character indicated by position is a U+000A LINE FEED (LF) character, advance position to the next character in input.

  14. If line contains the three-character substring "-->" (U+002D HYPHEN-MINUS, U+002D HYPHEN-MINUS, U+003E GREATER-THAN SIGN), then set the already collected line flag and jump to the step labeled cue loop.

  15. If line is not the empty string, then jump back to the step labeled header.

  16. Cue loop: If the already collected line flag is set, then jump to the step labeled cue creation.

  17. Collect a sequence of characters that are U+000A LINE FEED (LF) characters.

  18. Collect a sequence of characters that are not U+000A LINE FEED (LF) characters. Let line be those characters, if any.

  19. If line is the empty string, then jump to the step labeled end. (In such a case, position is also forcibly past the end of input.)

  20. Cue creation: Let cue be a new text track cue associated with output's text track.

  21. Let cue's text track cue identifier be the empty string.

  22. Let cue's text track cue pause-on-exit flag be false.

  23. Let cue's text track cue writing direction be horizontal.

  24. Let cue's text track cue snap-to-lines flag be true.

  25. Let cue's text track cue line position be auto.

  26. Let cue's text track cue text position be 50.

  27. Let cue's text track cue size be 100.

  28. Let cue's text track cue alignment be middle alignment.

  29. Let cue's text track cue text be the empty string.

  30. If line contains the three-character substring "-->" (U+002D HYPHEN-MINUS, U+002D HYPHEN-MINUS, U+003E GREATER-THAN SIGN), then jump to the step labeled timings below.

  31. Let cue's text track cue identifier be line.

  32. If position is past the end of input, then discard cue and jump to the step labeled end.

  33. If the character indicated by position is a U+000A LINE FEED (LF) character, advance position to the next character in input.

  34. Collect a sequence of characters that are not U+000A LINE FEED (LF) characters. Let line be those characters, if any.

  35. If line is the empty string, then discard cue and jump to the step labeled cue loop.

  36. Timings: Unset the already collected line flag.

  37. Collect WebVTT cue timings and settings from line, using cue for the results. If that fails, jump to the step labeled bad cue.

  38. Let cue text be the empty string.

  39. Cue text loop: If position is past the end of input, then jump to the step labeled cue text processing.

  40. If the character indicated by position is a U+000A LINE FEED (LF) character, advance position to the next character in input.

  41. Collect a sequence of characters that are not U+000A LINE FEED (LF) characters. Let line be those characters, if any.

  42. If line is the empty string, then jump to the step labeled cue text processing.

  43. If line contains the three-character substring "-->" (U+002D HYPHEN-MINUS, U+002D HYPHEN-MINUS, U+003E GREATER-THAN SIGN), then set the already collected line flag and jump to the step labeled cue text processing.

  44. If cue text is not empty, append a U+000A LINE FEED (LF) character to cue text.

  45. Let cue text be the concatenation of cue text and line.

  46. Return to the step labeled cue text loop.

  47. Cue text processing: Let the text track cue text of cue be cue text, and let the rules for its interpretation be the WebVTT cue text parsing rules, the WebVTT cue text rendering rules, and the WebVTT cue text DOM construction rules.

  48. Add cue to the text track list of cues output.

  49. Jump to the step labeled cue loop.

  50. Bad cue: Discard cue.

  51. Bad cue loop: If position is past the end of input, then jump to the step labeled end.

  52. If the character indicated by position is a U+000A LINE FEED (LF) character, advance position to the next character in input.

  53. Collect a sequence of characters that are not U+000A LINE FEED (LF) characters. Let line be those characters, if any.

  54. If line contains the three-character substring "-->" (U+002D HYPHEN-MINUS, U+002D HYPHEN-MINUS, U+003E GREATER-THAN SIGN), then set the already collected line flag and jump to the step labeled cue loop.

  55. If line is the empty string, then jump to the step labeled cue loop.

  56. Otherwise, jump to the step labeled bad cue loop.

  57. End: The file has ended. Abort these steps. The WebVTT parser has finished.

When the algorithm above requires that the user agent collect WebVTT cue timings and settings from a string input for a text track cue cue, the user agent must run the following algorithm.

  1. Let input be the string being parsed.

  2. Let position be a pointer into input, initially pointing at the start of the string.

  3. Skip whitespace.

  4. Collect a WebVTT timestamp. If that algorithm fails, then abort these steps and return failure. Otherwise, let cue's text track cue start time be the collected time.

  5. Skip whitespace.

  6. If the character at position is not a U+002D HYPHEN-MINUS character (-) then abort these steps and return failure. Otherwise, move position forwards one character.

  7. If the character at position is not a U+002D HYPHEN-MINUS character (-) then abort these steps and return failure. Otherwise, move position forwards one character.

  8. If the character at position is not a U+003E GREATER-THAN SIGN character (>) then abort these steps and return failure. Otherwise, move position forwards one character.

  9. Skip whitespace.

  10. Collect a WebVTT timestamp. If that algorithm fails, then abort these steps and return failure. Otherwise, let cue's text track cue end time be the collected time.

  11. Parse the WebVTT settings for cue.

When the user agent is to parse the WebVTT settings for a text track cue cue, the user agent must run the following steps:

  1. Let input and position be the same variables as those of the same name in the algorithm that invoked these steps.

  2. Settings: Skip whitespace.

  3. If position is beyond the end of input then abort these steps.

  4. Let setting be the character at position, and move position forwards one character.

  5. If position is beyond the end of input then abort these steps.

  6. If the character at position is a space character, then jump back to the step labeled settings.

  7. If the character at position is not a U+003A COLON character (:), then set setting to the empty string.

  8. Move position forwards one character.

  9. If position is beyond the end of input then abort these steps.

  10. Run the appropriate substeps that apply for the value of setting, as follows:

    If setting is a U+0044 LATIN CAPITAL LETTER D character
    1. Collect a sequence of characters that are not space characters. Let value be those characters, if any.

    2. If value is a case-sensitive match for the string "vertical", then let cue's text track cue writing direction be vertical growing left.

    3. Otherwise, if value is a case-sensitive match for the string "vertical-lr", then let cue's text track cue writing direction be vertical growing right.

    If setting is a U+004C LATIN CAPITAL LETTER L character
    1. Collect a sequence of characters that are either U+002D HYPHEN-MINUS characters (-), U+0025 PERCENT SIGN characters (%), or characters in the range U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9). Let value be those characters, if any.

    2. If position is not beyond the end of input but the character at position is not a space character, then jump to the "otherwise" case below.

    3. If value does not contain at least one character in the range U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9), then jump back to the step labeled settings.

    4. If any character in value other than the first character is a U+002D HYPHEN-MINUS character (-), then jump back to the step labeled settings.

    5. If any character in value other than the last character is a U+0025 PERCENT SIGN character (%), then jump back to the step labeled settings.

    6. If the first character in value is a U+002D HYPHEN-MINUS character (-) and the last character in value is a U+0025 PERCENT SIGN character (%), then jump back to the step labeled settings.

    7. Ignoring the trailing percent sign, if any, interpret value as a (potentially signed) integer, and let number be that number.

    8. If the last character in value is a U+0025 PERCENT SIGN character (%), but number is not in the range 0 ≤ number ≤ 100, then jump back to the step labeled settings.

    9. Let cue's text track cue line position be number.

    10. If the last character in value is a U+0025 PERCENT SIGN character (%), then let cue's text track cue snap-to-lines flag be false.

    If setting is a U+0054 LATIN CAPITAL LETTER T character
    1. Collect a sequence of characters that are in the range U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9). Let value be those characters, if any.

    2. If position is beyond the end of input then jump back to the step labeled settings.

    3. If the character at position is not a U+0025 PERCENT SIGN character (%), then then jump to the "otherwise" case below.

    4. Move position forwards one character.

    5. If position is not beyond the end of input but the character at position is not a space character, then jump to the "otherwise" case below.

    6. If value is the empty string, then jump back to the step labeled settings.

    7. Interpret value as an integer, and let number be that number.

    8. If number is not in the range 0 ≤ number ≤ 100, then jump back to the step labeled settings.

    9. Let cue's text track cue text position be number.

    If setting is a U+0053 LATIN CAPITAL LETTER S character
    1. Collect a sequence of characters that are in the range U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9). Let value be those characters, if any.

    2. If position is beyond the end of input then jump back to the step labeled settings.

    3. If the character at position is not a U+0025 PERCENT SIGN character (%), then then jump to the "otherwise" case below.

    4. Move position forwards one character.

    5. If position is not beyond the end of input but the character at position is not a space character, then jump to the "otherwise" case below.

    6. If value is the empty string, then jump back to the step labeled settings.

    7. Interpret value as an integer, and let number be that number.

    8. If number is not in the range 0 ≤ number ≤ 100, then jump back to the step labeled settings.

    9. Let cue's text track cue size be number.

    If setting is a U+0041 LATIN CAPITAL LETTER A character
    1. Collect a sequence of characters that are not space characters. Let value be those characters, if any.

    2. If value is a case-sensitive match for the string "start", then let cue's text track cue alignment be start alignment.

    3. If value is a case-sensitive match for the string "middle", then let cue's text track cue alignment be middle alignment.

    4. If value is a case-sensitive match for the string "end", then let cue's text track cue alignment be end alignment.

    Otherwise

    Collect a sequence of characters that are not space characters and discard them.

  11. Jump back to the step labeled settings.

When this specification says that a user agent is to collect a WebVTT timestamp, the user agent must run the following steps:

  1. Let input and position be the same variables as those of the same name in the algorithm that invoked these steps.

  2. Let most significant units be minutes.

  3. If position is past the end of input, return an error and abort these steps.

  4. If the character indicated by position is not one of U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9), then return an error and abort these steps.

  5. Collect a sequence of characters in the range U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9), and let string be the collected substring.

  6. Interpret string as a base-ten integer. Let value1 be that integer.

  7. If string is not exactly two characters in length, or if value1 is greater than 59, let most significant units be hours.

  8. If position is beyond the end of input or if the character at position is not a U+003A COLON character (:), then return an error and abort these steps. Otherwise, move position forwards one character.

  9. Collect a sequence of characters in the range U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9), and let string be the collected substring.

  10. If string is not exactly two characters in length, return an error and abort these steps.

  11. Interpret string as a base-ten integer. Let value2 be that integer.

  12. If most significant units is hours, or if position is not beyond the end of input and the character at position is a U+003A COLON character (:), run these substeps:

    1. If position is beyond the end of input or if the character at position is not a U+003A COLON character (:), then return an error and abort these steps. Otherwise, move position forwards one character.

    2. Collect a sequence of characters in the range U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9), and let string be the collected substring.

    3. If string is not exactly two characters in length, return an error and abort these steps.

    4. Interpret string as a base-ten integer. Let value3 be that integer.

    Otherwise (if most significant units is not hours, and either position is beyond the end of input, or the character at position is not a U+003A COLON character (:)), let value3 have the value of value2, then value2 have the value of value1, then let value1 equal zero.

  13. If position is beyond the end of input or if the character at position is not a U+002E FULL STOP character (.), then return an error and abort these steps. Otherwise, move position forwards one character.

  14. Collect a sequence of characters in the range U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9), and let string be the collected substring.

  15. If string is not exactly three characters in length, return an error and abort these steps.

  16. Interpret string as a base-ten integer. Let value4 be that integer.

  17. If value2 is greater than 59 or if value3 is greater than 59, return an error and abort these steps.

  18. Let result be value1×60×60 + value2×60 + value3 + value4∕1000.

  19. Return result.

1.4 WebVTT cue text parsing rules

A WebVTT Node Object is a conceptual construct used to represent components of WebVTT cue text so that its processing can be described without reference to the underlying syntax.

There are two broad classes of WebVTT Node Objects: WebVTT Internal Node Objects and WebVTT Leaf Node Objects.

WebVTT Internal Node Objects are those that can contain further WebVTT Node Objects. They are conceptually similar to elements in HTML or the DOM. WebVTT Internal Node Objects have an ordered list of child WebVTT Node Objects. The WebVTT Internal Node Object is said to be the parent of the children. Cycles do not occur; the parent-child relationships so constructed form a tree structure. WebVTT Internal Node Objects also have an ordered list of class names, know as their applicable classes.

There are several concrete classes of WebVTT Internal Node Objects:

Lists of WebVTT Node Objects

These are used as root nodes for trees of WebVTT Node Objects.

WebVTT Class Objects

These represent spans of text (a WebVTT cue class span) in WebVTT cue text, and are used to annotate parts of the cue with applicable classes without implying further meaning (such as italics or bold).

WebVTT Italic Objects

These represent spans of italic text (a WebVTT cue italics span) in WebVTT cue text.

WebVTT Bold Objects

These represent spans of bold text (a WebVTT cue bold span) in WebVTT cue text.

WebVTT Underline Objects

These represent spans of underline text (a WebVTT cue underline span) in WebVTT cue text.

WebVTT Ruby Objects

These represent spans of ruby (a WebVTT cue ruby span) in WebVTT cue text.

WebVTT Ruby Text Objects

These represent spans of ruby text (a WebVTT cue ruby text span) in WebVTT cue text.

WebVTT Voice Objects

These represent spans of text associated with a specific voice (a WebVTT cue voice span) in WebVTT cue text. A WebVTT Voice Object has a value, which is the name of the voice.

WebVTT Leaf Node Objects are those that contain data, such as text, and cannot contain child WebVTT Node Objects.

There are two concrete classes of WebVTT Leaf Node Objects:

WebVTT Text Objects

A fragment of text. A WebVTT Text Object has a value, which is the text it represents.

WebVTT Timestamp Objects

A timestamp. A WebVTT Timestamp Object has a value, in seconds and fractions of a second, which is the time represented by the timestamp.

To parse a string input supposedly containing WebVTT cue text, user agents must use the following algorithm. This algorithm returns a list of WebVTT Node Objects.

  1. Let input be the string being parsed.

  2. Let position be a pointer into input, initially pointing at the start of the string.

  3. Let result be a List of WebVTT Node Objects, initially empty.

  4. Let current be the WebVTT Internal Node Object result.

  5. Loop: If position is past the end of input, return result and abort these steps.

  6. Let token be the result of invoking the WebVTT cue text tokenizer.

  7. Run the appropriate steps given the type of token:

    If token is a string
    1. Create a WebVTT Text Object whose value is the value of the string token token.

    2. Append the newly created WebVTT Text Object to current.

    If token is a start tag

    How the start tag token token is processed depends on its tag name, as follows:

    If the tag name is "c"

    Attach a WebVTT Class Object.

    If the tag name is "i"

    Attach a WebVTT Italic Object.

    If the tag name is "b"

    Attach a WebVTT Bold Object.

    If the tag name is "u"

    Attach a WebVTT Underline Object.

    If the tag name is "ruby"

    Attach a WebVTT Ruby Object.

    If the tag name is "rt"

    If current is a WebVTT Ruby Object, then attach a WebVTT Ruby Text Object.

    If the tag name is "v"

    Attach a WebVTT Voice Object, and set its value to the token's annotation string, or the empty string if there is no annotation string.

    Otherwise

    Ignore the token.

    When the steps above say to attach a WebVTT Internal Node Object of a particular concrete class, the user agent must run the following steps:

    1. Create a new WebVTT Internal Node Object of the specified concrete class.

    2. Set the new object's list of applicable classes to the list of classes in the token, excluding any classes that are the empty string.

    3. Append the newly created node object to current.

    4. Let current be the newly created node object.

    If token is an end tag

    If any of the following conditions is true, then let current be the parent node of current.

    Otherwise, if the tag name of the end tag token token is "ruby" and current is a WebVTT Ruby Text Object, then let current be the parent node of the parent node of current.

    Otherwise, ignore the token.

    If token is a timestamp tag
    1. Let input be the tag value.

    2. Let position be a pointer into input, initially pointing at the start of the string.

    3. Collect a WebVTT timestamp.

    4. If that algorithm does not fail, and if position now points at the end of input (i.e. there are no trailing characters after the timestamp), then create a WebVTT Timestamp Object whose value is the collected time, then append it to current.

      Otherwise, ignore the token.

  8. Jump to the step labeled loop.

The WebVTT cue text tokenizer is as follows. It emits a token, which is either a string (whose value is a sequence of characters), a start tag (with a tag name, a list of classes, and optionally an annotation), an end tag (with a tag name), or a timestamp tag (with a tag value).

  1. Let input and position be the same variables as those of the same name in the algorithm that invoked these steps.

  2. Let tokenizer state be WebVTT data state.

  3. Let result be the empty string.

  4. Let buffer be the empty string.

  5. Let classes be an empty list.

  6. Loop: If position is past the end of input, let c be an end-of-file marker. Otherwise, let c be the character in input pointed to by position.

    An end-of-file marker is not a Unicode character, it is used to end the tokenizer.

  7. Jump to the state given by tokenizer state:

    WebVTT data state

    Jump to the entry that matches the value of c:

    U+0026 AMPERSAND (&)

    Set buffer to c, set tokenizer state to the WebVTT escape state, and jump to the step labeled next.

    U+003C LESS-THAN SIGN (<)

    If result is the empty string, then set tokenizer state to the WebVTT tag state and jump to the step labeled next.

    Otherwise, return a string token whose value is result and abort these steps.

    End-of-file marker

    Return a string token whose value is result and abort these steps.

    Anything else

    Append c to result and jump to the step labeled next.

    WebVTT escape state

    Jump to the entry that matches the value of c:

    U+0026 AMPERSAND (&)

    Append buffer to result, set buffer to c, and jump to the step labeled next.

    Characters in the range U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9)
    Characters in the range U+0041 LATIN CAPITAL LETTER A to U+005A LATIN CAPITAL LETTER Z
    Characters in the range U+0061 LATIN SMALL LETTER A to U+007A LATIN SMALL LETTER Z

    Append c to buffer and jump to the step labeled next.

    U+003B SEMICOLON character (;)

    First, examine the value of buffer:

    If buffer is the string "&amp", then append a U+0026 AMPERSAND character (&) to result.

    If buffer is the string "&lt", then append a U+003C LESS-THAN SIGN character (<) to result.

    If buffer is the string "&gt", then append a U+003E GREATER-THAN SIGN character (>) to result.

    Otherwise, append buffer followed by a U+003B SEMICOLON character (;) to result.

    Then, in any case, set tokenizer state to the WebVTT data state, and jump to the step labeled next.

    U+003C LESS-THAN SIGN (<)
    End-of-file marker

    Append buffer to result, return a string token whose value is result, and abort these steps.

    Anything else

    Append buffer to result, append c to result, set tokenizer state to the WebVTT data state, and jump to the step labeled next.

    WebVTT tag state

    Jump to the entry that matches the value of c:

    U+0009 CHARACTER TABULATION (tab) character
    U+000A LINE FEED (LF) character
    U+000C FORM FEED (FF) character
    U+0020 SPACE character

    Set tokenizer state to the WebVTT start tag annotation state, and jump to the step labeled next.

    U+002E FULL STOP character (.)

    Set tokenizer state to the WebVTT start tag class state, and jump to the step labeled next.

    U+002F SOLIDUS character (/)

    Set tokenizer state to the WebVTT end tag state, and jump to the step labeled next.

    Characters in the range U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9)

    Set result to c, set tokenizer state to the WebVTT timestamp tag state, and jump to the step labeled next.

    U+003E GREATER-THAN SIGN character (>)

    Advance position to the next character in input, then jump to the next "end-of-file marker" entry below.

    End-of-file marker

    Return a start tag whose tag name is the empty string, with no classes and no annotation, and abort these steps.

    Anything else

    Set result to c, set tokenizer state to the WebVTT start tag state, and jump to the step labeled next.

    WebVTT start tag state

    Jump to the entry that matches the value of c:

    U+0009 CHARACTER TABULATION (tab) character
    U+000C FORM FEED (FF) character
    U+0020 SPACE character

    Set tokenizer state to the WebVTT start tag annotation state, and jump to the step labeled next.

    U+000A LINE FEED (LF) character

    Set buffer to c, set tokenizer state to the WebVTT start tag annotation state, and jump to the step labeled next.

    U+002E FULL STOP character (.)

    Set tokenizer state to the WebVTT start tag class state, and jump to the step labeled next.

    U+003E GREATER-THAN SIGN character (>)

    Advance position to the next character in input, then jump to the next "end-of-file marker" entry below.

    End-of-file marker

    Return a start tag whose tag name is result, with no classes and no annotation, and abort these steps.

    Anything else

    Append c to result and jump to the step labeled next.

    WebVTT start tag class state

    Jump to the entry that matches the value of c:

    U+0009 CHARACTER TABULATION (tab) character
    U+000C FORM FEED (FF) character
    U+0020 SPACE character

    Append to classes an entry whose value is buffer, set buffer to the empty string, set tokenizer state to the WebVTT start tag annotation state, and jump to the step labeled next.

    U+000A LINE FEED (LF) character

    Append to classes an entry whose value is buffer, set buffer to c, set tokenizer state to the WebVTT start tag annotation state, and jump to the step labeled next.

    U+002E FULL STOP character (.)

    Append to classes an entry whose value is buffer, set buffer to the empty string, and jump to the step labeled next.

    U+003E GREATER-THAN SIGN character (>)

    Advance position to the next character in input, then jump to the next "end-of-file marker" entry below.

    End-of-file marker

    Append to classes an entry whose value is buffer, then return a start tag whose tag name is result, with the classes given in classes but no annotation, and abort these steps.

    Anything else

    Append c to buffer and jump to the step labeled next.

    WebVTT start tag annotation state

    Jump to the entry that matches the value of c:

    U+003E GREATER-THAN SIGN character (>)

    Advance position to the next character in input, then jump to the next "end-of-file marker" entry below.

    End-of-file marker

    Remove any leading or trailing space characters from buffer, and replace any sequence of one or more consecutive space characters in buffer with a single U+0020 SPACE character; then, return a start tag whose tag name is result, with the classes given in classes, and with buffer as the annotation, and abort these steps.

    Anything else

    Append c to buffer and jump to the step labeled next.

    WebVTT end tag state

    Jump to the entry that matches the value of c:

    U+003E GREATER-THAN SIGN character (>)

    Advance position to the next character in input, then jump to the next "end-of-file marker" entry below.

    End-of-file marker

    Return an end tag whose tag name is result and abort these steps.

    Anything else

    Append c to result and jump to the step labeled next.

    WebVTT timestamp tag state

    Jump to the entry that matches the value of c:

    U+003E GREATER-THAN SIGN character (>)

    Advance position to the next character in input, then jump to the next "end-of-file marker" entry below.

    End-of-file marker

    Return a timestamp tag whose tag name is result and abort these steps.

    Anything else

    Append c to result and jump to the step labeled next.

  8. Next: Advance position to the next character in input.

  9. Jump to the step labeled loop.

1.5 WebVTT cue text DOM construction rules

To convert a List of WebVTT Node Objects to a DOM tree for Document owner, user agents must create a tree of DOM nodes that is isomorphous to the tree of WebVTT Node Objects, with the following mapping of WebVTT Node Objects to DOM nodes:

WebVTT Node Object DOM node
List of WebVTT Node Objects DocumentFragment node
WebVTT Class Object HTMLElement element node with localName "span".
WebVTT Italic Object HTMLElement element node with localName "i".
WebVTT Bold Object HTMLElement element node with localName "b".
WebVTT Underline Object HTMLElement element node with localName "u".
WebVTT Ruby Object HTMLElement element node with localName "ruby".
WebVTT Ruby Text Object HTMLElement element node with localName "rt".
WebVTT Voice Object HTMLElement element node with localName "span", a title attribute set to the WebVTT Voice Object's value.
WebVTT Text Object Text node whose character data is the value of the WebVTT Text Object.
WebVTT Timestamp Object ProcessingInstruction node whose target is "timestamp" and whose data is a WebVTT timestamp representing the value of the WebVTT Timestamp Object, with all optional components included.

HTMLElement nodes created as part of the mapping described above must have their namespaceURI set to the HTML namespace, and must have a class attribute set to the string obtained by concatenating all the classes that apply to the corresponding WebVTT Internal Node Object, each separated from the next by a single U+0020 SPACE character.

The ownerDocument attribute of all nodes in the DOM tree must be set to the given document owner.

All characteristics of the DOM nodes that are not described above or dependent on characteristics defined above must be left at their initial values.

2 WebVTT cue text rendering rules

EXTRACT from Section 15.4.2.1 WHATWG HTML Specification.

The rules for updating the display of WebVTT text tracks render the text tracks of a media element (specifically, a video element), or of another playback mechanism, by applying the steps below. All the text tracks that use these rules for a given media element, or other playback mechanism, are rendered together, to avoid overlapping subtitles from multiple tracks.

The output of the steps below is a set of CSS boxes that covers the rendering area of the media element or other playback mechanism, which user agents are expected to render in a manner suiting the user.

The rules are as follows:

  1. If the media element is an audio element, or is another playback mechanism with no rendering area, abort these steps. There is nothing to render.

  2. Let video be the media element or other playback mechanism.

  3. Let output be an empty list of absolutely positioned CSS block boxes.

  4. If the user agent is exposing a user interface for video, add to output one or more completely transparent positioned CSS block boxes that cover the same region as the user interface.

  5. If the last time these rules were run, the user agent was not exposing a user interface for video, but now it is, let reset be true. Otherwise, let reset be false.

  6. Let tracks be the subset of video's list of text tracks that have as their rules for updating the text track rendering these rules for updating the display of WebVTT text tracks, and whose text track mode is showing or showing by default.

  7. Let cues be an empty list of text track cues.

  8. For each track track in tracks, append to cues all the cues from track's list of cues that have their text track cue active flag set.

  9. If reset is false, then, for each text track cue cue in cues: if cue's text track cue display state has a set of CSS boxes, then add those boxes to output, and remove cue from cues.

  10. For each text track cue cue in cues that has not yet had corresponding CSS boxes added to output, in text track cue order, run the following substeps:

    1. Let nodes be the list of WebVTT Node Objects obtained by applying the WebVTT cue text parsing rules to the cue's text track cue text.

    2. Apply the Unicode Bidirectional Algorithm's Paragraph Level steps to nodes using the following constraints, to determine the paragraph embedding level of the cue: [BIDI]

      • nodes represents a single paragraph.
      • The paragraph's text consists of the concatenation of the values of each WebVTT Text Object in nodes, in a pre-order, depth-first traversal, excluding WebVTT Ruby Text Objects and their descendants.
    3. If the paragraph embedding level determined in the previous step is even (the paragraph direction is left-to-right), let direction be 'ltr', otherwise, let it be 'rtl'.

    4. If the text track cue writing direction is horizontal, then let block-flow be 'tb'. Otherwise, if the text track cue writing direction is vertical growing left, then let block-flow be 'lr'. Otherwise, the text track cue writing direction is vertical growing right; let block-flow be 'rl'.

    5. Determine the value of maximum size for cue as per the appropriate rules from the following list:

      If the text track cue writing direction is horizontal, the text track cue alignment is start, and direction is 'ltr'
      If the text track cue writing direction is horizontal, the text track cue alignment is end, and direction is 'rtl'
      If the text track cue writing direction is vertical growing left, and the text track cue alignment is start
      If the text track cue writing direction is vertical growing right, and the text track cue alignment is start

      Let maximum size be the text track cue text position subtracted from 100.

      If the text track cue writing direction is horizontal, the text track cue alignment is end, and direction is 'ltr'
      If the text track cue writing direction is horizontal, the text track cue alignment is start, and direction is 'rtl'
      If the text track cue writing direction is vertical growing left, and the text track cue alignment is end
      If the text track cue writing direction is vertical growing right, and the text track cue alignment is end

      Let maximum size be the text track cue text position.

      If the text track cue alignment is middle, the text track cue text position is less than or equal to 50

      Let maximum size be the text track cue text position multiplied by two.

      If the text track cue alignment is middle, the text track cue text position is greater than 50

      Let maximum size be the result of subtracting text track cue text position from 100 and then multiplying the result by two.

    6. If the text track cue size is less than maximum size, then let size be text track cue size. Otherwise, let size be maximum size.

    7. If the text track cue writing direction is horizontal, then let width be 'size vw' and height be 'auto'. Otherwise, let width be 'auto' and height be 'size vh'. (These are CSS values used by the next section to set CSS properties for the rendering; 'vw' and 'vh' are CSS units.) [CSSVALUES]

    8. Determine the value of x-position or y-position for cue as per the appropriate rules from the following list:

      If the text track cue writing direction is horizontal, the text track cue alignment is start, and direction is 'ltr'
      If the text track cue writing direction is horizontal, the text track cue alignment is end, and direction is 'rtl'

      Let x-position be the text track cue text position.

      If the text track cue writing direction is horizontal, the text track cue alignment is end, and direction is 'ltr'
      If the text track cue writing direction is horizontal, the text track cue alignment is start, and direction is 'rtl'

      Let x-position be the text track cue text position subtracted from 100.

      If the text track cue writing direction is vertical growing left, and the text track cue alignment is start
      If the text track cue writing direction is vertical growing right, and the text track cue alignment is start

      Let y-position be the text track cue text position.

      If the text track cue writing direction is vertical growing left, and the text track cue alignment is end
      If the text track cue writing direction is vertical growing right, and the text track cue alignment is end

      Let y-position be the text track cue text position subtracted from 100.

      If the text track cue writing direction is horizontal, the text track cue alignment is middle, and direction is 'ltr'

      Let x-position be the text track cue text position minus half of size.

      If the text track cue writing direction is horizontal, the text track cue alignment is middle, and direction is 'rtl'

      Let x-position-reverse be the text track cue text position minus half of size.

      Let x-position be x-position-reverse subtracted from 100.

      If the text track cue writing direction is vertical growing left, and the text track cue alignment is middle
      If the text track cue writing direction is vertical growing right, and the text track cue alignment is middle

      Let y-position be the text track cue text position minus half of size.

    9. Determine the value of whichever of x-position or y-position is not yet calculated for cue as per the appropriate rules from the following list:

      If the text track cue writing direction is horizontal, and the text track cue snap-to-lines flag is set

      Let y-position be zero.

      If the text track cue writing direction is horizontal, and the text track cue snap-to-lines flag is not set

      Let y-position be the text track cue computed line position.

      If the text track cue writing direction is vertical growing left, and the text track cue snap-to-lines flag is set
      If the text track cue writing direction is vertical growing right, and the text track cue snap-to-lines flag is set

      Let x-position be zero.

      If the text track cue writing direction is vertical growing left, and the text track cue snap-to-lines flag is not set
      If the text track cue writing direction is vertical growing right, and the text track cue snap-to-lines flag is not set

      Let x-position be the text track cue computed line position.

    10. Let left be 'x-position vw' and top be 'y-position vh'. (These again are CSS values used by the next section to set CSS properties for the rendering; 'vw' and 'vh' are CSS units.) [CSSVALUES]

    11. Apply the terms of the CSS specifications to nodes within the following constraints, thus obtaining a set of CSS boxes positioned relative to an initial containing block: [CSS]

      • The document tree is the tree of WebVTT Node Objects rooted at nodes.

      • For the purposes of processing by the CSS specification, WebVTT Internal Node Objects are equivalent to elements with the same contents.

      • For the purposes of processing by the CSS specification, WebVTT Text Objects are equivalent to text nodes.
      • No style sheets are associated with nodes. (The nodes are subsequently restyled using style sheets after their boxes are generated, as described below.)
      • The children of the nodes must be wrapped in an anonymous box whose 'display' property has the value 'inline'. This is the WebVTT cue background box.
      • Runs of children of WebVTT Ruby Objects that are not WebVTT Ruby Text Objects must be wrapped in anonymous boxes whose 'display' property has the value 'ruby-base'. [CSSRUBY]
      • Properties on WebVTT Node Objects have their values set as defined in the next section. (That section uses some of the variables whose values were calculated earlier in this algorithm.)
      • Text runs must be wrapped according to the CSS line-wrapping rules, except that additionally, regardless of the value of the 'white-space' property, lines must be wrapped at the edge of their containing blocks, even if doing so requires splitting a word where there is no line breaking opportunity. (Thus, normally text wraps as needed, but if there is a particularly long word, it does not overflow as it normally would in CSS, it is instead forcibly wrapped at the box's edge.)
      • The viewport (and initial containing block) is video's rendering area.

      Let boxes be the boxes generated as descendants of the initial containing block, along with their positions.

    12. If there are no line boxes in boxes, skip the remainder of these substeps for cue. The cue is ignored.

    13. Adjust the positions of boxes according to the appropriate steps from the following list:

      If cue's text track cue snap-to-lines flag is set

      Many of the steps in this algorithm vary according to the text track cue writing direction. Steps labeled "Horizontal" must be followed only when the text track cue writing direction is horizontal, steps labeled "Vertical" must be followed when the text track cue writing direction is either vertical growing left or vertical growing right, steps labeled "Vertical Growing Left" must be followed only when the text track cue writing direction is vertical growing left, and steps labeled "Vertical Growing Right" must be followed only when the text track cue writing direction is vertical growing right.

      1. Horizontal: Let step be the height of the first line box in boxes.

        Vertical: Let step be the width of the first line box in boxes.

      2. If step is zero, then jump to the step labeled done positioning below.

      3. Let line position be the text track cue computed line position.

      4. Vertical Growing Left: Add one to line position then negate it.

      5. Let position be the result of multiplying step and line position.

      6. Vertical Growing Left: Decrease position by the width of the bounding box of the boxes in boxes, then increase position by step.

      7. Horizontal: If line position is less than zero then increase position by the height of the video's rendering area, and negate step (so its value is now minus the height of the first line box in boxes).

        Vertical: If line position is less than zero then increase position by the width of the video's rendering area, and negate step.

      8. Horizontal: Move all the boxes in boxes down by the distance given by position.

        Vertical: Move all the boxes in boxes right by the distance given by position.

      9. Default: Remember the position of all the boxes in boxes as their default position.

      10. Let switched be false.

      11. Step loop: If none of the boxes in boxes would overlap any of the boxes in output, and all the boxes in output are within the video's rendering area, then jump to the step labeled done positioning below.

      12. Horizontal: If step is negative and the top of the first line box in boxes is now above the top of the video's rendering area, or if step is positive and the bottom of the first line box in boxes is now below the bottom of the video's rendering area, jump to the step labeled switch direction.

        Vertical: If step is negative and the left edge of the first line box in boxes is now to the left of the left edge of the video's rendering area, or if step is positive and the right edge of the first line box in boxes is now to the right of the right edge of the video's rendering area, jump to the step labeled switch direction.

      13. Horizontal: Move all the boxes in boxes down by the distance given by step. (If step is negative, then this will actually result in an upwards movement of the boxes in absolute terms.)

        Vertical: Move all the boxes in boxes right by the distance given by step. (If step is negative, then this will actually result in a leftwards movement of the boxes in absolute terms.)

      14. Jump back to the step labeled step loop.

      15. Switch direction: Move all the boxes in boxes back to their default position as determined in the step above labeled default.

      16. If switched is true, jump to the step labeled done positioning below.

      17. Negate step.

      18. Set switched to true.

      19. Jump back to the step labeled step loop.

      If cue's text track cue snap-to-lines flag is not set
      1. Set up x and y as follows:

        If the text track cue writing direction is horizontal, and direction is 'ltr'

        Let x be a percentage given by the text track cue text position, and let y be a percentage given by the text track cue computed line position.

        If the text track cue writing direction is horizontal, and direction is 'rtl'

        Let x be a percentage given by the text track cue text position subtracted from 100, and let y be a percentage given by the text track cue computed line position.

        If the text track cue writing direction is vertical growing left

        Let x be a percentage given by the text track cue computed line position subtracted from 100, and let y be a percentage given by the text track cue text position.

        If the text track cue writing direction is vertical growing right

        Let x be a percentage given by the text track cue computed line position, and let y be a percentage given by the text track cue text position.

      2. Position the boxes in boxes such that the point x% along the width of the bounding box of the boxes in boxes is x% of the way across the width of the video's rendering area, and the point y% along the height of the bounding box of the boxes in boxes is y% of the way across the height of the video's rendering area, while maintaining the relative positions of the boxes in boxes to each other.

      3. If none of the boxes in boxes would overlap any of the boxes in output, and all the boxes in output are within the video's rendering area, then jump to the step labeled done positioning below.

      4. If there is a position to which the boxes in boxes can be moved while maintaining the relative positions of the boxes in boxes to each other such that none of the boxes in boxes would overlap any of the boxes in output, and all the boxes in output would be within the video's rendering area, then move the boxes in boxes to the closest such position to their current position, and then jump to the step labeled done positioning below. If there are multiple such positions that are equidistant from their current position, use the highest one amongst them; if there are several at that height, then use the leftmost one amongst them.

      5. Otherwise, jump to the step labeled done positioning below. (The boxes will unfortunately overlap.)

    14. Done positioning: If there are any line boxes in the (possibly now repositioned) boxes that do not completely fit inside video's rendering area, remove those offending line boxes from boxes.

    15. Let cue's text track cue display state have the CSS boxes in boxes.

    16. Add the CSS boxes in boxes to output.

  11. Return output.

User agents may allow the user to override the above algorithm's positioning of cues, e.g. by dragging them to another location on the video, or even off the video entirely.

3 Applying CSS properties to WebVTT Node Objects

EXTRACT from Section 15.4.2.2 WHATWG HTML Specification.

When following the rules for updating the display of WebVTT text tracks, user agents must set properties of WebVTT Node Objects as defined in this section. [CSS]

On the (root) List of WebVTT Node Objects, the 'position' property must be set to 'absolute', the 'direction' property must be set to direction, the 'block-flow' property must be set to block-flow, the 'top' property must be set to top, the 'left' property must be set to left, the 'width' property must be set to width, and the 'height' property must be set to height, where direction, block-flow, top, left, width, and height are the values with those names determined by the rules for updating the display of WebVTT text tracks for the text track cue from whose text the List of WebVTT Node Objects was constructed.

The 'text-align' property on the (root) List of WebVTT Node Objects must be set to the value in the second cell of the row of the table below whose first cell is the value of the corresponding cue's text track cue alignment:

Text track cue alignment 'text-align' value
Start alignment 'start'
Middle alignment 'center'
End alignment 'end'

The 'font' shorthand property on the (root) List of WebVTT Node Objects must be set to '5vh sans-serif'. [CSSRUBY] [CSSVALUES]

The 'color' property on the (root) List of WebVTT Node Objects must be set to 'rgba(255,255,255,0)'. [CSSCOLOR]

The 'background' shorthand property on the WebVTT cue background box must be set to 'rgba(0,0,0,0.8)'. [CSSCOLOR]

A text outline or stroke may also be set on the (root) List of WebVTT Node Objects, if supported.

The 'font-style' property on WebVTT Italic Objects must be set to 'italic'.

The 'font-weight' property on WebVTT Bold Objects must be set to 'bold'.

The 'text-decoration' property on WebVTT Underline Objects must be set to 'underline'.

The 'display' property on WebVTT Ruby Objects must be set to 'ruby'. [CSSRUBY]

The 'display' property on WebVTT Ruby Text Objects must be set to 'ruby-text'. [CSSRUBY]

If there are style sheets that apply to the media element or other playback mechanism, then they must be interpreted as defined in the next section.

All other non-inherited properties must be set to their initial values; inherited properties on the root List of WebVTT Node Objects must inherit their values from the media element for which the text track cue is being rendered, if any. If there is no media element (i.e. if the text track is being rendered for another media playback mechanism), then inherited properties on the root List of WebVTT Node Objects must take their initial values.

4 CSS extensions

EXTRACT from Section 15.4.2.3 WHATWG HTML Specification.

When a user agent is rendering one or more text track cues according to the WebVTT cue text rendering rules, WebVTT Node Objects in the list of WebVTT Node Objects used in the rendering can be matched by certain pseudo-selectors as defined below. These selectors can begin or stop matching individual WebVTT Node Objects while a cue is being rendered, even in between applications of the WebVTT cue text rendering rules (which are only run when the set of active cues changes). User agents that support the pseudo-element described below must dynamically update renderings accordingly.

Pseudo-elements apply to elements that are matched by selectors. For the purpose of this section, that element is the matched element. The pseudo-elements defined in the following sections affect the styling of parts of text track cues that are being rendered for the matched element.

If the matched element is not a video element, the pseudo-elements defined below won't have any effect according to this specification.

A CSS user agent that implements the text tracks model must implement the '::cue' and '::cue(selector)' pseudo-elements, and the ':past' and ':future' pseudo-classes.

4.1 The '::cue' pseudo-element

The '::cue' pseudo-element (with no argument) matches any List of WebVTT Node Objects constructed for the matched element, with the exception that the properties corresponding to the 'background' shorthand must be applied to the WebVTT cue background box rather than the List of WebVTT Node Objects.

The following properties apply to the '::cue' pseudo-element with no argument; other properties set on the pseudo-element must be ignored:

The '::cue(selector)' pseudo-element with an argument must have an argument that consists of a group of selectors. It matches any WebVTT Internal Node Object constructed for the matched element that also matches the given group of selectors, with the nodes being treated as follows:

The following properties apply to the '::cue()' pseudo-element with an argument:

In addition, the following properties apply to the '::cue()' pseudo-element with an argument when the selector does not contain the ':past' and ':future' pseudo-classes:

Properties that do not apply must be ignored.

As a special exception, the properties corresponding to the 'background' shorthand, when they would have been applied to the List of WebVTT Node Objects, must instead be applied to the WebVTT cue background box.

4.2 The ':past' and ':future' pseudo-classes

The ':past' and ':future' pseudo-classes sometimes match WebVTT Node Objects. [SELECTORS]

The ':past' pseudo-class only matches WebVTT Node Objects that are in the past.

A WebVTT Node Object c is in the past if, in a pre-order, depth-first traversal of the text track cue's List of WebVTT Node Objects, there exists a WebVTT Timestamp Object whose value is less than the current playback position of the media element that is the matched element, entirely after the WebVTT Node Object c.

The ':future' pseudo-class only matches WebVTT Node Objects that are in the future.

A WebVTT Node Object c is in the future if, in a pre-order, depth-first traversal of the text track cue's List of WebVTT Node Objects, there exists a WebVTT Timestamp Object whose value is greater than the current playback position of the media element that is the matched element, entirely before the WebVTT Node Object c.

5 Diverse Snippets

5.1 WebVTT kinds in the track element

EXTRACT from Section 4.8.9 WHATWG HTML Specification.

The kind attribute is an enumerated attribute. The following table lists the keywords defined for this attribute. The keyword given in the first cell of each row maps to the state given in the second cell.

Keyword State Brief description
subtitles Subtitles Transcription or translation of the dialogue, suitable for when the sound is available but not understood (e.g. because the user does not understand the language of the media resource's audio track). Overlaid on the video.
captions Captions Transcription or translation of the dialogue, sound effects, relevant musical cues, and other relevant audio information, suitable for when sound is unavailable or not clearly audible (e.g. because it is muted, drowned-out by ambient noise, or because the user is deaf). Overlaid on the video; labeled as appropriate for the hard-of-hearing.
descriptions Descriptions Textual descriptions of the video component of the media resource, intended for audio synthesis when the visual component is obscured, unavailable, or not usable (e.g. because the user is interacting with the application without a screen while driving, or because the user is blind). Synthesized as audio.
chapters Chapters Chapter titles, intended to be used for navigating the media resource. Displayed as an interactive (potentially nested) list in the user agent's interface.
metadata Metadata Tracks intended for use from script. Not displayed by the user agent.

The attribute may be omitted. The missing value default is the subtitles state.

...

If the element's track URL identifies a WebVTT resource, and the element's kind attribute is not in the metadata state, then the WebVTT file must be a WebVTT file using cue text.

Furthermore, if the element's track URL identifies a WebVTT resource, and the element's kind attribute is in the chapters state, then the WebVTT file must be both a WebVTT file using chapter title text and a WebVTT file using only nested cues.

5.2 WebVTT srclang in the track element

EXTRACT from Section 4.8.9 WHATWG HTML Specification.

The srclang attribute gives the language of the text track data. The value must be a valid BCP 47 language tag. This attribute must be present if the element's kind attribute is in the subtitles state. [BCP47]

If the element has a srclang attribute whose value is not the empty string, then the element's track language is the value of the attribute. Otherwise, the element has no track language.

5.3 WebVTT label in the track element

SUBPART EXTRACT from Section 4.8.9 WHATWG HTML Specification.

The label attribute gives a user-readable title for the track. This title is used by user agents when listing subtitle, caption, and audio description tracks in their user interface.

The value of the label attribute, if the attribute is present, must not be the empty string. Furthermore, there must not be two track element children of the same media element whose kind attributes are in the same state, whose srclang attributes are both missing or have values that represent the same language, and whose label attributes are again both missing or both have the same value.

If the element has a label attribute whose value is not the empty string, then the element's track label is the value of the attribute. Otherwise, the element's track label is a user-agent defined string (e.g. the string "untitled" in the user's locale, or a value automatically generated from the other attributes).

5.4 WebVTT mime type registration

EXTRACT from Section 17.7 WHATWG HTML Specification.

This registration is for community review and will be submitted to the IESG for review, approval, and registration with IANA.

Type name:
text
Subtype name:
vtt
Required parameters:
No parameters
Optional parameters:
No parameters
Encoding considerations:
8bit (always UTF-8)
Security considerations:

Text track files themselves pose no immediate risk unless sensitive information is included within the data. Implementations, however, are required to follow specific rules when processing text tracks, to ensure that certain origin-based restrictions are honored. Failure to correctly implement these rules can result in information leakage, cross-site scripting attacks, and the like.

Interoperability considerations:

Rules for processing both conforming and non-conforming content are defined in this specification.

Some legacy files violate the requirement to use UTF-8.

Published specification:
This document is the relevant specification.
Applications that use this media type:
Web browsers and other video players.
Additional information:
Magic number(s):

WebVTT files all begin with one of the following byte sequences:

  • EF BB BF 57 45 42 56 54 54 0A
  • EF BB BF 57 45 42 56 54 54 0D
  • EF BB BF 57 45 42 56 54 54 20
  • EF BB BF 57 45 42 56 54 54 09
  • 57 45 42 56 54 54 0A
  • 57 45 42 56 54 54 0D
  • 57 45 42 56 54 54 20
  • 57 45 42 56 54 54 09

(An optional UTF-8 BOM, the ASCII string "WEBVTT", and finally a space, tab, or line break.)

File extension(s):
"vtt"
Macintosh file type code(s):
No specific Macintosh file type codes are recommended for this type.
Person & email address to contact for further information:
Ian Hickson <ian@hixie.ch>
Intended usage:
Common
Restrictions on usage:
No restrictions apply.
Author:
Ian Hickson <ian@hixie.ch>
Change controller:
W3C

Fragment identifiers have no meaning with text/vtt resources.

References

[BCP47]
Tags for Identifying Languages; Matching of Language Tags, A. Phillips, M. Davis. IETF.
[BIDI]
UAX #9: Unicode Bidirectional Algorithm, M. Davis. Unicode Consortium.
[CSS]
Cascading Style Sheets Level 2 Revision 1, B. Bos, T. Çelik, I. Hickson, H. Lie. W3C.
[CSSCOLOR]
CSS Color Module Level 3, T. Çelik, C. Lilley, L. Baron. W3C.
[CSSRUBY]
CSS3 Ruby Module, R. Ishida. W3C.
[CSSVALUES]
CSS3 Values and Units, H. Lie, C. Lilley. W3C.
[RFC3629]
UTF-8, a transformation format of ISO 10646, F. Yergeau. IETF.
[SELECTORS]
Selectors, E. Etemad, T. Çelik, D. Glazman, I. Hickson, P. Linss, J. Williams. W3C.