Re: [CSS21][CSS3 Text] Re: Treating carriage return as white space in layout from Henri Sivonen on 2010-09-08 (www-style@w3.org from September 2010)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Wed, 8 Sep 2010 15:10:25 +0300
To: fantasai <fantasai.lists@inkedblade.net>
Cc: www-style@w3.org
Message-Id: <C0C145BB-81F2-42B8-8013-02106581F37E@iki.fi>

On Sep 8, 2010, at 10:22, fantasai wrote:

>    # Newlines in the source can be represented by a carriage
>    # return (U+000D), a linefeed (U+000A) or both (U+000D U+000A),
>    # or by some other mechanism that identifies the beginning
>    # and end of document segments, such as the SGML RECORD-START
>    # and RECORD-END tokens. The CSS 'white-space' processing
>    # model assumes all newlines have been normalized to line feeds.
> 
>  Drop the last sentence. Add
> 
>    | Any such newline representation is considered to be a <dfn>line
>    | break character</dfn> in the CSS white space processing rules.
>    |
>    | CSS does not define how newlines are represented in the source.
>    | In the absence of specific document language rules to the contrary,
>    | all linefeeds (U+000A), carriage returns (U+000D), and CRLF sequences
>    | (U+000D U+000A) in the source text are considered line break
>    | characters.

Is 'source text' a special term in CSS that means text in the document tree? I think 'source text' is confusing, because the whole point of my concern is that a carriage return doesn't appear literally in HTML (or XML) source but does appear in the resulting DOM.

Can this be changed to talk about text in the document tree?

>  Option A:
>    Add the Unicode paragraphs separator (PS, U+2028) and line separator
>    (LS, U+2029) to the list of line break characters in generated content
>    and in the source text defaults.

>  Option C:
>    Add the full list of UAX14 class BK characters to the source text
>    defaults and the generated content lists.

These options look rather XML 1.1-ish to me. To address my site compat concern I presented, it is unnecessary to go beyond CR. Furthermore, I think adding non-ASCII characters to white space operations shouldn't be done lightly, since performance and implementation complexity issues may arise when the internal representation of text is UTF-8. (I didn't check if there are astral BK characters that'd cause problems even when the internal representation is UTF-16.)

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Wednesday, 8 September 2010 12:11:01 UTC