Re: UTF-8 with Unicode line separator and BOM

Christian Ottosson <christian.ottosson@kurir.net> wrote:

> At least the Unicode line (and paragraph) separators should be 
> recognized as "white space", I think, shouldn't they?

No.  "2.3 Common Syntactic Constructs" of XML 1.0 says:

    S (white space) consists of one or more space (#x20) characters,
    carriage returns, line feeds, or tabs.

  cf. http://www.w3.org/TR/REC-xml#sec-common-syn

And Production 3 formally defines this as:

    [3]    S    ::=    (#x20 | #x9 | #xD | #xA)+

  cf. http://www.w3.org/TR/REC-xml#NT-S

So, neither LINE SEPARATOR (U+2028) nor PARAGRAPH SEPARATOR (U+2029)
is white space - those are just treated as character data.  That's
why the validator correctly reports errors (apart from BOM).

Moreover, "Unicode in XML and other Markup Languages" specification
explicitly discourages the use of line and paragraph separators
(U+2028 .. U+2029) as "not suitable for use with markup".

  cf. http://www.w3.org/TR/unicode-xml/#Charlist

So I'd recommend not to use them even as character data.

Regards,
-- 
Masayasu Ishikawa / mimasa@w3.org
W3C - World Wide Web Consortium

Received on Tuesday, 17 October 2000 12:20:51 UTC