[ttml1] Ambiguity regarding tab (U+0009) processing in significant whitespace.

skynavga has just created a new issue for https://github.com/w3c/ttml1:

== Ambiguity regarding tab (U+0009) processing in significant whitespace. ==
The processing of "significant" whitespace by an XML application requires [1] that all non-markup characters be passed to the application, and, further, that the ``xml:space`` attribute, if declared, may be used by an author to signal the application as to whether (1) default application whitespace processing applies or (2) that whitespace should be preserved (by the application, as defined by the application).

In TTML the attribute ``xml:space`` is "declared", and its semantics are mapped [2] to XSL-FO style properties [3], specifically: ``suppress-at-line-break``, ``linefeed-treatment``, ``white-space-collapse``, and ``white-space-treatment``.  These properties are intended to reflect the semantics of the CSS2 ``white-space`` property [4], but at a finer level of functional granularity.

Now, in the course of TTML implementation activity, it has been asked what the behavior should be regarding an element to which ``xml:space="default"`` applies and which content is, for example:

<pre>&lt;span&gt;&amp;#x0009;X&lt;/span&gt;</pre>

namely, a single HORIZONTAL TAB (U+0009) character followed by a single 'X' character.

The particular question is whether the HORIZONTAL TAB (U+0009) character should:

1. be mapped to the SPACE (U+0020) character;
2. not be mapped to the SPACE (U+0020) character, but in all other ways be treated as if it were a SPACE (U+0020) character;
3. be neither mapped to nor treated as a SPACE (U+0020) character, e.g., be retained throughout presentation processing and eventually be mapped to a glyph in the applicable font, e.g., by using the font's CMAP (or equivalent) to map HORIZONTAL TAB (U+0009) to a glyph the specifies a definite width (advance)?
 
If the answer to this question is that it should be mapped to the SPACE (U+0020) character, then a secondary question arises as to when, i.e., during which processing step, should this mapping take place?

To untangle this subject, we will need to look at the original specification of CSS2 which defines the (default) initial ``normal`` value for the ``white-space`` property [5] as:

>This value directs user agents to collapse sequences of whitespace, and break lines as necessary to fill line boxes.

and which, further, defines *whitespace* [6] as:

>The token S in the grammar above stands for whitespace. Only the characters "space" (Unicode code 32), "tab" (9), "line feed" (10), "carriage return" (13), and "form feed" (12) can occur in whitespace. Other space-like characters, such as "em-space" (8195) and "ideographic space" (12288), are never part of whitespace.

Now, while *whitespace* is well defined here, and corresponds precisely with the definition given in XML 1.1 [7], the phrase "collapse sequences of whitespace" is not well defined. In CSS2.1, this latter phrase is given more substance by defining a *whitespace processing model* [8], which does define an operational model that provides greater detail, including:

> 4. If 'white-space' is set to 'normal', 'nowrap', or 'pre-line',
> (1) every tab (U+0009) is converted to a space (U+0020)
> (2) any space (U+0020) following another space (U+0020) — even a space before the inline, if that space also has 'white-space' set to 'normal', 'nowrap' or 'pre-line' — is removed.

So, what is the problem with respect to TTML? TTML bases its definition of ``xml:space="default"`` semantics on XSL-FO 1.1, published in 2006, which is based on the original CSS2 that does not include the above clarifications found in CSS2.1. Furthermore, TTML bases the definition of ``xml:space="default"`` semantics on the XSL-FO definitions of the newly minted XSL-FO (but not CSS2) properties:

* ``suppress-at-line-break="auto"``
* ``linefeed-treatment="treat-as-space"``
* ``white-space-collapse="true"``
* ``white-space-treatment="ignore-if-surrounding-linefeed"``

where these values also happen to be the (default) initial values for these properties when they are not otherwise specified.

In contrast, XSL-FO defines ``white-space="normal"`` as

* ``linefeed-treatment="treat-as-space"``
* ``white-space-collapse="true"``
* ``white-space-treatment="ignore-if-surrounding-linefeed"``
* ``wrap-option="wrap"``

a definition which also happens to be implicitly dependent upon the ``suppress-at-line-break`` property, since the interpretation of ``white-space-treatment="ignore-if-surrounding-linefeed"`` depends upon the value of the ``suppress-at-line-break`` property.

Combining the default initial values of these properties with the definition of ``white-space="normal"``, we surmise that the default whitespace processing behavior for XSL-FO is intended to align with the default whitespace processing behavior of CSS2. However, a detailed reading of the semantics of this behavior raises a number of possible problems:

1. the ``suppress-at-line-break``  property defines ``auto`` to suppress only the SPACE (U+0020) but not HORIZONTAL TAB (U+0009), and, further, explicitly states that all other characters are to be treated as if the value ``retain`` applies;
2. nowhere in XSL-FO is there explicit language that would cause HORIZONTAL TAB (U+0009) to be mapped to SPACE (U+0020), although there is a vague reference to the possibility of such mapping taking place during refinement processing found under the definition of the ``white-space-collapse`` property where it is stated that:
>after refinement, where some white space characters may have been discarded or turned into space characters, all remaining runs of two or more consecutive spaces are replaced by a single space, then any remaining space immediately adjacent to a remaining linefeed is also discarded.

To return to the example fragment of TTML cited above, absent a mapping from HORIZONTAL TAB (U+0009) to SPACE (U+0020), the whitespace processing behavior that applies to this fragment would seem to retain the HORIZONTAL TAB (U+0009) in

<pre>&lt;span&gt;&amp;#x0009;X&lt;/span&gt;</pre>

since, according to ``white-space-collapse="true"``, we have

* ``&#x0009;`` is classified as white space in XML, **and**
* ``&#x0009;`` is not ``&#x000A``, **but**
* the immediately preceding flow object (before ``<fo:character character="&#x0009;"/>``) is **not** a character flow object **and** the immediate following flow object is **not** a linefeed, i.e.,``<fo:character character="&#x000A;"/>``

so the ``&#x0009;`` is not collapsed, i.e., it does generate a glyph area.

But now, we have a problem since the (now elaborated) definition of *normal* whitespace processing behavior in CSS2.1 appears to call for every ``&#x0009;`` to be mapped to ``&#x0020;`` **prior** to performing white space collapsing behavior.

So, where does this leave us with respect to TTML? I believe we have two questions to resolve:

1. Is ``&#x0009;`` mapped to ``&#x0020;``? If so, then in what context and during which processing step?
2. If there are contexts where ``&#x0009;`` is not mapped to ``&#x0020;``, then what are the intended presentation semantics?

My answers to these questions are as follows:

* When ``xml:space="default"`` applies, then ``&#x0009;`` is mapped to ``&#x0020;`` prior to performing any other white space processing. This mapping would ideally occur during or immediately after constructing the reduced xml infoset of a TTML abstract document instance.
* When ``xml:space="preserve"`` applies, then ``&#x0009;`` is not mapped to ``&#x0020;``, in which case the CSS2.1 semantics would apply, namely:

>All tabs (U+0009) are rendered as a horizontal shift that lines up the start edge of the next glyph with the next tab stop. Tab stops occur at points that are multiples of 8 times the width of a space (U+0020) rendered in the block's font from the block's starting content edge. 

Specification text that implements the above could easily be added to both TTML1 and TTML2, ideally under the definition of ``xml:space`` [9] (and its TTML2 counterpart).

I don't have a strong opinion about whether we should adopt the CSS2.1 presentation semantics for HORIZONTAL TAB in cases where ``xml:space="preserve"`` applies. Alternative semantics could be to ignore entirely (i.e., treat like ZERO WIDTH SPACE) or treat as SPACE.

[1] https://www.w3.org/TR/REC-xml/#sec-white-space
[2] https://www.w3.org/TR/ttml1/#content-attribute-space
[3] https://www.w3.org/TR/xsl/
[4] https://www.w3.org/TR/xsl/#d0e297
[5] https://www.w3.org/TR/1998/REC-CSS2-19980512/text.html#white-space-prop
[6] https://www.w3.org/TR/1998/REC-CSS2-19980512/syndata.html#whitespace
[7] https://www.w3.org/TR/REC-xml/#NT-S
[8] https://www.w3.org/TR/2011/REC-CSS2-20110607/text.html#white-space-model
[9] https://www.w3.org/TR/ttml1/#content-attribute-space


Please view or discuss this issue at https://github.com/w3c/ttml1/issues/235 using your GitHub account

Received on Saturday, 1 April 2017 07:03:07 UTC