Line separator and Paragraph separator in HTML 5

From: Andrey V. Lukyanov <land@long.yar.ru>
Date: Sun, 17 Jan 2010 20:33:44 +0300 (MSK)
To: www-html@w3.org
Message-ID: <alpine.LFD.2.00.1001172028590.10212@long.yar.ru>

== Line separator and Paragraph separator in HTML 5 ==

Unicode includes such characters as "Line separator" (2028) and
"Paragraph separator" (2029). What should happen if they are inserted in
HTML source?

HTML 4.01 says that they "do not constitute line breaks in HTML", but
does not specify their exact behavior beyond this (see Section 9.1).

HTML 5 does not specifically mention U+2028 and U+2029; however, it
defines two notions: "space characters" and "White_Space characters"
(see Section 2.4.1).

"Space characters" all belong to ASCII: U+0020 space, U+0009 character
tabulation (tab), U+000A line feed (LF), U+000C form feed (FF), and
U+000D carriage return (CR).

"White_Space characters" are defined as those that have the Unicode
property "White_Space" (;WS; property in UnicodeData.txt). In Unicode
5.2, there are 18 of them:

  000C;<control>;Cc;0;WS;;;;;N;FORM FEED (FF);;;;
  1680;OGHAM SPACE MARK;Zs;0;WS;;;;;N;;;;;
  2000;EN QUAD;Zs;0;WS;2002;;;;N;;;;;
  2001;EM QUAD;Zs;0;WS;2003;;;;N;;;;;
  2002;EN SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
  2003;EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
  2004;THREE-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
  2005;FOUR-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
  2006;SIX-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
  2007;FIGURE SPACE;Zs;0;WS;<noBreak> 0020;;;;N;;;;;
  2008;PUNCTUATION SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
  2009;THIN SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
  200A;HAIR SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
  2028;LINE SEPARATOR;Zl;0;WS;;;;;N;;;;;
  205F;MEDIUM MATHEMATICAL SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;;
  3000;IDEOGRAPHIC SPACE;Zs;0;WS;<wide> 0020;;;;N;;;;;

One can deduce from this that "space characters" are used for HTML
source formatting; in the final output, they all are reduced to a simple
space (or, in some positions, reduced to nothing).

"White_Space characters", on the other hand, are supposed to be
displayed as they are (except at line ends, where they are reduced to
zero width).

Now, we see that "Line separator" (2028) belongs to the "White_Space
characters" category. So it seems that HTML 5 proposes to display it as
it is, making it equivalent to <BR>. By analogy, one may think that
"Paragraph separator" (2029) is now equivalent to <P>.

However, such interpretation goes against the principle of strict
distinction between the HTML source formatting and the final output
formatting. Besides, it would be incompatible with HTML 4.01.

Proposed solution to this is very simple: "Line separator" (2028) and
"Paragraph separator" (2029) should be included in the "space
characters" category. So, if someone uses U+2028 and U+2029 to make HTML
source prettier, it will not affect the final output in any unexpected
