- From: Andrey V. Lukyanov <land@long.yar.ru>
- Date: Sun, 17 Jan 2010 20:33:44 +0300 (MSK)
- To: www-html@w3.org
- Message-ID: <alpine.LFD.2.00.1001172028590.10212@long.yar.ru>
== Line separator and Paragraph separator in HTML 5 == Unicode includes such characters as "Line separator" (2028) and "Paragraph separator" (2029). What should happen if they are inserted in HTML source? HTML 4.01 says that they "do not constitute line breaks in HTML", but does not specify their exact behavior beyond this (see Section 9.1). HTML 5 does not specifically mention U+2028 and U+2029; however, it defines two notions: "space characters" and "White_Space characters" (see Section 2.4.1). "Space characters" all belong to ASCII: U+0020 space, U+0009 character tabulation (tab), U+000A line feed (LF), U+000C form feed (FF), and U+000D carriage return (CR). "White_Space characters" are defined as those that have the Unicode property "White_Space" (;WS; property in UnicodeData.txt). In Unicode 5.2, there are 18 of them: 000C;<control>;Cc;0;WS;;;;;N;FORM FEED (FF);;;; 0020;SPACE;Zs;0;WS;;;;;N;;;;; 1680;OGHAM SPACE MARK;Zs;0;WS;;;;;N;;;;; 180E;MONGOLIAN VOWEL SEPARATOR;Zs;0;WS;;;;;N;;;;; 2000;EN QUAD;Zs;0;WS;2002;;;;N;;;;; 2001;EM QUAD;Zs;0;WS;2003;;;;N;;;;; 2002;EN SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;; 2003;EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;; 2004;THREE-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;; 2005;FOUR-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;; 2006;SIX-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;; 2007;FIGURE SPACE;Zs;0;WS;<noBreak> 0020;;;;N;;;;; 2008;PUNCTUATION SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;; 2009;THIN SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;; 200A;HAIR SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;; 2028;LINE SEPARATOR;Zl;0;WS;;;;;N;;;;; 205F;MEDIUM MATHEMATICAL SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;; 3000;IDEOGRAPHIC SPACE;Zs;0;WS;<wide> 0020;;;;N;;;;; One can deduce from this that "space characters" are used for HTML source formatting; in the final output, they all are reduced to a simple space (or, in some positions, reduced to nothing). "White_Space characters", on the other hand, are supposed to be displayed as they are (except at line ends, where they are reduced to zero width). Now, we see that "Line separator" (2028) belongs to the "White_Space characters" category. So it seems that HTML 5 proposes to display it as it is, making it equivalent to <BR>. By analogy, one may think that "Paragraph separator" (2029) is now equivalent to <P>. However, such interpretation goes against the principle of strict distinction between the HTML source formatting and the final output formatting. Besides, it would be incompatible with HTML 4.01. Proposed solution to this is very simple: "Line separator" (2028) and "Paragraph separator" (2029) should be included in the "space characters" category. So, if someone uses U+2028 and U+2029 to make HTML source prettier, it will not affect the final output in any unexpected way.
Received on Sunday, 17 January 2010 17:34:48 UTC