Re: Line separator and Paragraph separator in HTML 5

"WS"/"White_Space" is defined in Unicode for BiDi purposes (it's a bidi
category). For instance LF, CR, NEL, and PS (paragraph separator) are not
WS since they are considered paragraph separating and thus does not occur
inside a paragraph for bidi processing. TAB (character tabulation) is not
in White_Space since TAB is supposed to be special treated in bidi (not
that I have seen that special treatment correctly implemented...).

So I don't think one should blindly reuse this bidi category for other
purposes. For HTML5's purposes, I think TAB, VT, LF, CR, NEL, and PS
should also be considered to be "white space"; i.e. a slightly more
general sense than the bidi category White_Space/WS. Further, in addition
to LF and CR, also VT, FF, NEL, LS, and PS should be considered line
break characters.

I don't see much logic in having both "[HTML5]space" and "White_Space"
in HTML5. A single set (as described above) would suffice it seems to me...
(out of which a subset are also line break characters, as above).

    /kent k



Den 2010-01-17 18.33, skrev "Andrey V. Lukyanov" <land@long.yar.ru>:

> 
> == Line separator and Paragraph separator in HTML 5 ==
> 
> Unicode includes such characters as "Line separator" (2028) and
> "Paragraph separator" (2029). What should happen if they are inserted in
> HTML source?
> 
> HTML 4.01 says that they "do not constitute line breaks in HTML", but
> does not specify their exact behavior beyond this (see Section 9.1).
> 
> HTML 5 does not specifically mention U+2028 and U+2029; however, it
> defines two notions: "space characters" and "White_Space characters"
> (see Section 2.4.1).
> 
> "Space characters" all belong to ASCII: U+0020 space, U+0009 character
> tabulation (tab), U+000A line feed (LF), U+000C form feed (FF), and
> U+000D carriage return (CR).
> 
> "White_Space characters" are defined as those that have the Unicode
> property "White_Space" (;WS; property in UnicodeData.txt). In Unicode
> 5.2, there are 18 of them:
> 
>   000C;<control>;Cc;0;WS;;;;;N;FORM FEED (FF);;;;
>   0020;SPACE;Zs;0;WS;;;;;N;;;;;
>   1680;OGHAM SPACE MARK;Zs;0;WS;;;;;N;;;;;
>   180E;MONGOLIAN VOWEL SEPARATOR;Zs;0;WS;;;;;N;;;;;
>   2000;EN QUAD;Zs;0;WS;2002;;;;N;;;;;
>   2001;EM QUAD;Zs;0;WS;2003;;;;N;;;;;
>   2002;EN SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
>   2003;EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
>   2004;THREE-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
>   2005;FOUR-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
>   2006;SIX-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
>   2007;FIGURE SPACE;Zs;0;WS;<noBreak> 0020;;;;N;;;;;
>   2008;PUNCTUATION SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
>   2009;THIN SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
>   200A;HAIR SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
>   2028;LINE SEPARATOR;Zl;0;WS;;;;;N;;;;;
>   205F;MEDIUM MATHEMATICAL SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
>   3000;IDEOGRAPHIC SPACE;Zs;0;WS;<wide> 0020;;;;N;;;;;
> 
> One can deduce from this that "space characters" are used for HTML
> source formatting; in the final output, they all are reduced to a simple
> space (or, in some positions, reduced to nothing).
> 
> "White_Space characters", on the other hand, are supposed to be
> displayed as they are (except at line ends, where they are reduced to
> zero width).
> 
> Now, we see that "Line separator" (2028) belongs to the "White_Space
> characters" category. So it seems that HTML 5 proposes to display it as
> it is, making it equivalent to <BR>. By analogy, one may think that
> "Paragraph separator" (2029) is now equivalent to <P>.
> 
> However, such interpretation goes against the principle of strict
> distinction between the HTML source formatting and the final output
> formatting. Besides, it would be incompatible with HTML 4.01.
> 
> Proposed solution to this is very simple: "Line separator" (2028) and
> "Paragraph separator" (2029) should be included in the "space
> characters" category. So, if someone uses U+2028 and U+2029 to make HTML
> source prettier, it will not affect the final output in any unexpected
> way.

Received on Monday, 18 January 2010 10:55:05 UTC