Re: UTF-8 with Unicode line separator and BOM from Masayasu Ishikawa on 2000-10-17 (www-validator@w3.org from October 2000)

From: Masayasu Ishikawa <mimasa@w3.org>
Date: Wed, 18 Oct 2000 01:20:14 +0900
To: christian.ottosson@kurir.net
Cc: plh@w3.org, www-validator@w3.org
Message-Id: <20001018012014T.mimasa@w3.mag.keio.ac.jp>

Christian Ottosson <christian.ottosson@kurir.net> wrote:

> At least the Unicode line (and paragraph) separators should be 
> recognized as "white space", I think, shouldn't they?

No.  "2.3 Common Syntactic Constructs" of XML 1.0 says:

    S (white space) consists of one or more space (#x20) characters,
    carriage returns, line feeds, or tabs.

  cf. http://www.w3.org/TR/REC-xml#sec-common-syn

And Production 3 formally defines this as:

    [3]    S    ::=    (#x20 | #x9 | #xD | #xA)+

  cf. http://www.w3.org/TR/REC-xml#NT-S

So, neither LINE SEPARATOR (U+2028) nor PARAGRAPH SEPARATOR (U+2029)
is white space - those are just treated as character data.  That's
why the validator correctly reports errors (apart from BOM).

Moreover, "Unicode in XML and other Markup Languages" specification
explicitly discourages the use of line and paragraph separators
(U+2028 .. U+2029) as "not suitable for use with markup".

  cf. http://www.w3.org/TR/unicode-xml/#Charlist

So I'd recommend not to use them even as character data.

Regards,
-- 
Masayasu Ishikawa / mimasa@w3.org
W3C - World Wide Web Consortium

Received on Tuesday, 17 October 2000 12:20:51 UTC