- From: Kent Karlsson <kent.karlsson14@comhem.se>
- Date: Mon, 18 Jan 2010 11:53:41 +0100
- To: <www-html@w3.org>
"WS"/"White_Space" is defined in Unicode for BiDi purposes (it's a bidi category). For instance LF, CR, NEL, and PS (paragraph separator) are not WS since they are considered paragraph separating and thus does not occur inside a paragraph for bidi processing. TAB (character tabulation) is not in White_Space since TAB is supposed to be special treated in bidi (not that I have seen that special treatment correctly implemented...). So I don't think one should blindly reuse this bidi category for other purposes. For HTML5's purposes, I think TAB, VT, LF, CR, NEL, and PS should also be considered to be "white space"; i.e. a slightly more general sense than the bidi category White_Space/WS. Further, in addition to LF and CR, also VT, FF, NEL, LS, and PS should be considered line break characters. I don't see much logic in having both "[HTML5]space" and "White_Space" in HTML5. A single set (as described above) would suffice it seems to me... (out of which a subset are also line break characters, as above). /kent k Den 2010-01-17 18.33, skrev "Andrey V. Lukyanov" <land@long.yar.ru>: > > == Line separator and Paragraph separator in HTML 5 == > > Unicode includes such characters as "Line separator" (2028) and > "Paragraph separator" (2029). What should happen if they are inserted in > HTML source? > > HTML 4.01 says that they "do not constitute line breaks in HTML", but > does not specify their exact behavior beyond this (see Section 9.1). > > HTML 5 does not specifically mention U+2028 and U+2029; however, it > defines two notions: "space characters" and "White_Space characters" > (see Section 2.4.1). > > "Space characters" all belong to ASCII: U+0020 space, U+0009 character > tabulation (tab), U+000A line feed (LF), U+000C form feed (FF), and > U+000D carriage return (CR). > > "White_Space characters" are defined as those that have the Unicode > property "White_Space" (;WS; property in UnicodeData.txt). In Unicode > 5.2, there are 18 of them: > > 000C;<control>;Cc;0;WS;;;;;N;FORM FEED (FF);;;; > 0020;SPACE;Zs;0;WS;;;;;N;;;;; > 1680;OGHAM SPACE MARK;Zs;0;WS;;;;;N;;;;; > 180E;MONGOLIAN VOWEL SEPARATOR;Zs;0;WS;;;;;N;;;;; > 2000;EN QUAD;Zs;0;WS;2002;;;;N;;;;; > 2001;EM QUAD;Zs;0;WS;2003;;;;N;;;;; > 2002;EN SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;; > 2003;EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;; > 2004;THREE-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;; > 2005;FOUR-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;; > 2006;SIX-PER-EM SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;; > 2007;FIGURE SPACE;Zs;0;WS;<noBreak> 0020;;;;N;;;;; > 2008;PUNCTUATION SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;; > 2009;THIN SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;; > 200A;HAIR SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;; > 2028;LINE SEPARATOR;Zl;0;WS;;;;;N;;;;; > 205F;MEDIUM MATHEMATICAL SPACE;Zs;0;WS;<compat> 0020;;;;N;;;;; > 3000;IDEOGRAPHIC SPACE;Zs;0;WS;<wide> 0020;;;;N;;;;; > > One can deduce from this that "space characters" are used for HTML > source formatting; in the final output, they all are reduced to a simple > space (or, in some positions, reduced to nothing). > > "White_Space characters", on the other hand, are supposed to be > displayed as they are (except at line ends, where they are reduced to > zero width). > > Now, we see that "Line separator" (2028) belongs to the "White_Space > characters" category. So it seems that HTML 5 proposes to display it as > it is, making it equivalent to <BR>. By analogy, one may think that > "Paragraph separator" (2029) is now equivalent to <P>. > > However, such interpretation goes against the principle of strict > distinction between the HTML source formatting and the final output > formatting. Besides, it would be incompatible with HTML 4.01. > > Proposed solution to this is very simple: "Line separator" (2028) and > "Paragraph separator" (2029) should be included in the "space > characters" category. So, if someone uses U+2028 and U+2029 to make HTML > source prettier, it will not affect the final output in any unexpected > way.
Received on Monday, 18 January 2010 10:55:05 UTC