Re: CSS2.1: \A and white-space

[The "\A" in question below appears in the value of the content property.]

[Mon, 26 Aug 2002 18:56:50 +0200] Bert Bos:
>No, the 'white-space' property has no effect on '\A', since the '\A'
>is not inserted into the *input* of the CSS renderer, but into the
>*output*. Whitespace in the input is a form of mark-up and is thus
>interpreted by the HTML (or XML) parser and further undergoes
>transformations by the CSS renderer. But the '\A' is simply part of
>the rendered output. You can regard it as a glyph or as a control
>code, but the term "whitespace" doesn't apply to it.

Are spaces (" " and/or "\20") in the content property "whitespace"?

Referring to:
I gather the "input" is the document tree (created by the document language
parser in step 1) and the "output" is the formatting structure (delivered
to the user agent's rendering engine in step 6).

The document language parser must recognize whitespace in order to read its
input and build the document tree; however, at this stage it must still
preserve the individual whitespace characters used in element content ---
since the "white-space" property hasn't yet been determined, it is unknown
if and how this whitespace will be transformed.  Does the document tree at
this point (in the conceptual model) contain some sort of "tokenized"
version of each element's content, so that the original parsing of
whitespace is available to later stages? or is that original parsing thrown
away within elements?  That is, does input like this:

<P>This  is
	just a short paragraph.

get passed along as something like this:

     * element: P
       content: text("This"), wsp("  "), text("is"), wsp("\00000A\000009"),
                text("just"), wsp(" "), text("a"), wsp(" "),
                text("short"), wsp(" "), text("paragraph.")

that's already parsed for whitespace? or just like this:

     * element: P
       content: "This  is\00000A\000009just a short paragraph."

with the content represented as a pure Unicode string (with "early"
processing such as removing line breaks immediately after opening and
before closing tags and resolving entity references already applied)?

Note that it must be one or the other: either we pass along something  more
complicated than a Unicode string to represent the content of each element,
or we retain whitespace characters in content verbatim (losing  "parsing"
of this whitespace that occurred before creating the document tree).

For the formatting structure, CSS could supply "tokenized" content to the
renderer, or it could supply Unicode strings.  The renderer would have to
perform word-wrapping when appropriate; but the collapse of whitespace to a
single blank according to the white-space property could (in some models)
be done before the formatting structure is delivered to the renderer.

So, we have four possible cases:
     * document tree and formatting structure are both Unicode
     * document tree and formatting structure are both tokenized
     * document tree is Unicode, formatting structure is tokenized
     * document tree is tokenized, formatting structure is Unicode

* Document tree and formatting structure are both Unicode

In this case, it is difficult to see how generated content could be treated
any differently than document content, unless the CSS processor arbitrarily
tags all :before and :after pseudo-elements with "white-space: pre" ---
which begs the question, "Why not honor the white-space property?"

* Document tree and formatting structure are both tokenized

In this case, it makes sense that no generated content would contain
"whitespace" of any kind, since the CSS processor would just be passing
document language whitespace unchanged, and would need to do no whitespace
processing itself.

There are some oddities in this model: it would imply that word wrapping
cannot occur within generated content (since there can be no "whitespace"
in it); and *probably* --- depending on how the rendering engine works ---
spaces in generated content would not expand when "text-align: justify" is
in effect, since the rendering engine would not see them as whitespace.

I also have to wonder whether any practical implementation would actually
follow this model.

* Document tree is Unicode, formatting structure is tokenized

In this case, CSS (not the document language) would define "whitespace"
in document content: is this in fact how it works?  Since the CSS processor
would be managing whitespace in this model, it is unclear why generated
content should not be subject to the white-space property.

* Document tree is tokenized, formatting structure is Unicode

In this case, the CSS processor would presumably be condensing whitespace
(assuming the rendering engine doesn't re-parse whitespace; otherwise there
would be little sense in using this instead of the Unicode/Unicode model).
The rendering engine would recognize spaces as whitespace, and otherwise
need only to know whether or not word wrapping is in effect.  Using this
model, we would expect that in generated content, "\A" would always be a
newline, multiple blanks would not be condensed, and blanks would be
recognized as whitespace for purposes of justification and line wrapping by
the rendering engine.  The document language would define "whitespace"
within the document itself, but only blanks would have the effect of
whitespace in generated content (and would not be condensed).

I don't get the sense that this is how current browsers actually work; but
it sounds like what Bert and the CSS 2 specification may have intended.

The "bottom line" here is that I think a bit more clarification is needed.
Randall Joseph Fellmy aka

Received on Tuesday, 27 August 2002 05:21:53 UTC