Re: Line separator and Paragraph separator in HTML 5

On Sun, 17 Jan 2010, Andrey V. Lukyanov wrote:
> 
> == Line separator and Paragraph separator in HTML 5 ==
> 
> Unicode includes such characters as "Line separator" (2028) and 
> "Paragraph separator" (2029). What should happen if they are inserted in 
> HTML source?

Nothing in particular.


> HTML 5 does not specifically mention U+2028 and U+2029; however, it 
> defines two notions: "space characters" and "White_Space characters" 
> (see Section 2.4.1).
> 
> "Space characters" all belong to ASCII: U+0020 space, U+0009 character 
> tabulation (tab), U+000A line feed (LF), U+000C form feed (FF), and 
> U+000D carriage return (CR).
> 
> "White_Space characters" are defined as those that have the Unicode 
> property "White_Space" (;WS; property in UnicodeData.txt).

"WS" in UnicodeData.txt is the Bidi_Class _value_ "White_Space", not the 
property White_Space, which is listed in PropList.txt.

I've tried to clarify this in the spec.


> Now, we see that "Line separator" (2028) belongs to the "White_Space
> characters" category. So it seems that HTML 5 proposes to display it as
> it is, making it equivalent to <BR>. By analogy, one may think that
> "Paragraph separator" (2029) is now equivalent to <P>.

If you mean in the rendering sense, that would be up to the Unicode and 
CSS specifications. Nothing in HTML5 says that U+000A should be rendered 
as a line break, for instance -- in fact <br> is defined in terms of 
U+000A, not the other way around.


> Proposed solution to this is very simple: "Line separator" (2028) and 
> "Paragraph separator" (2029) should be included in the "space 
> characters" category. So, if someone uses U+2028 and U+2029 to make HTML 
> source prettier, it will not affect the final output in any unexpected 
> way.

If you mean at the parser level, e.g. between a tag a name and an 
attribute name in a start tag, then that would contradict a design goal of 
HTML5, which is to ensure that parser-level effects are only based on 
ASCII characters.


On Mon, 18 Jan 2010, Kent Karlsson wrote:
> 
> I don't see much logic in having both "[HTML5]space" and "White_Space" 
> in HTML5. A single set (as described above) would suffice it seems to 
> me... (out of which a subset are also line break characters, as above).

The two terms are needed because a no-break space should not be treated 
like a space in attribute values, but should be treated as a space in 
element content, when it comes to parsing values (e.g. date values) for 
other purposes (e.g. microdata).

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Wednesday, 10 March 2010 02:55:41 UTC