W3C home > Mailing lists > Public > www-validator@w3.org > October 2000

Re: UTF-8 with Unicode line separator and BOM

From: Masayasu Ishikawa <mimasa@w3.org>
Date: Wed, 18 Oct 2000 01:20:14 +0900
To: christian.ottosson@kurir.net
Cc: plh@w3.org, www-validator@w3.org
Message-Id: <20001018012014T.mimasa@w3.mag.keio.ac.jp>
Christian Ottosson <christian.ottosson@kurir.net> wrote:

> At least the Unicode line (and paragraph) separators should be 
> recognized as "white space", I think, shouldn't they?

No.  "2.3 Common Syntactic Constructs" of XML 1.0 says:

    S (white space) consists of one or more space (#x20) characters,
    carriage returns, line feeds, or tabs.

  cf. http://www.w3.org/TR/REC-xml#sec-common-syn

And Production 3 formally defines this as:

    [3]    S    ::=    (#x20 | #x9 | #xD | #xA)+

  cf. http://www.w3.org/TR/REC-xml#NT-S

So, neither LINE SEPARATOR (U+2028) nor PARAGRAPH SEPARATOR (U+2029)
is white space - those are just treated as character data.  That's
why the validator correctly reports errors (apart from BOM).

Moreover, "Unicode in XML and other Markup Languages" specification
explicitly discourages the use of line and paragraph separators
(U+2028 .. U+2029) as "not suitable for use with markup".

  cf. http://www.w3.org/TR/unicode-xml/#Charlist

So I'd recommend not to use them even as character data.

Masayasu Ishikawa / mimasa@w3.org
W3C - World Wide Web Consortium
Received on Tuesday, 17 October 2000 12:20:51 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 1 March 2016 14:17:28 UTC