SP,% before encoding declaration (was: RE: I18N issues with the XML Specification) from Martin J. Duerst on 2000-04-12 (xml-editor@w3.org from April to June 2000)

From: Martin J. Duerst <duerst@w3.org>
Date: Wed, 12 Apr 2000 17:14:46 +0900
To: "Fran苡is Yergeau" <yergeau@alis.com>, "'Misha Wolf'" <misha.wolf@reuters.com>, <w3c-i18n-ig@w3.org>
Cc: xml-editor@w3.org, w3c-xml-core-wg@w3.org
Message-Id: <4.2.0.58.J.20000412164407.03376f00@sh.w3.mag.keio.ac.jp>

I'm not sure I agree. I have read Makoto's mail, and his
analysis is very thorough, and I'm not questioning it here.

However, for UTF-16 and anything similar to it, and for
any kind of entity, either of the following is true:

- It has some external encoding info. There is no need
   for heuristics.
- It is UTF-16. In this case, it has a BOM.
- It has an encoding declaration.

Makoto clearly shows that it's possible to have white space
and some other stuff at the start of external subsets,...,
BUT that is only the case if there is not TextDecl or XMLDecl.
So whatever has an encoding declaration has it first, without
any kind of other stuff before it (except a BOM).

This is easy to see from the following rules:

[22]  prolog ::=  XMLDecl? Misc* (doctypedecl Misc*)?
[30] extSubset ::=  TextDecl? extSubsetDecl
[79]  extPE ::=  TextDecl? extSubsetDecl
[78]  extParsedEnt ::=  TextDecl? content

I therefore propose that the various white-space and %
case, as well as the first sentence of the last paragraph in
E44, be removed. I have reflected that at
http://www.w3.org/International/Group/issues/xml/Overview.html#charset.autod 
etection

Any comments?

Regards,    Martin.

At 00/04/03 20:03 -0400, Fran苡is Yergeau wrote:
>Misha wrote:
> > The result of our discussions is recorded in:
> >
> >    I18N issues with the XML Specification
> >    http://www.w3.org/International/Group/issues/xml
>
>I have reviewed E44 [1], which is mentionned as the first issue in the "Deal
>with later" section of our issues list.
>
>I traced back the original mail from Murata Makoto [2] from which this
>erratum was written up.  I reviewed this mail again and it seems fine to me.
>The fact that we did not understand the erratum in Amsterdam was probably
>due to our rather hasty process, faced as we were with too much to do in too
>little time.
>
>I propose that we drop this erratum from our issues list.
>
>[1] http://www.w3.org/XML/xml-19980210-errata#E44
>[2] http://lists.w3.org/Archives/Member/w3c-xml-syntax-wg/1999Feb/0124.html
>
>--
>Fran輟is

Received on Wednesday, 12 April 2000 04:12:46 UTC