UTF-16 (was: Re: Charset reviewer appointed) from Martin J. Duerst on 1998-07-29 (ietf-charsets@w3.org from July to September 1998)

From: Martin J. Duerst <duerst@w3.org>
Date: Wed, 29 Jul 1998 18:20:00 +0900
To: Harald Tveit Alvestrand <Harald.Alvestrand@maxware.no>
Cc: unicore@unicode.org, Multiple Recipients of Unicore <unicore@unicode.org>, kenw@sybase.com, ietf-charsets@iana.org
Message-id: <199807290913.SAA27782@sh.w3.mag.keio.ac.jp>

At 09:02 98/07/29 +0200, Harald Tveit Alvestrand wrote:

> >At 13:42 98/07/27 +0200, Harald Tveit Alvestrand wrote:
> >
> >> The BOM is part of the charset that UTF-16 represents.
> >> Any application can say anything it wants to *further restricting*
> >> what characters can apply where; the part we couldn't tolerate
> >> was if XML insisted upon strings that were *illegal* in the registered
> >> UTF-16, yet calling the charset "UTF-16".

> What I was saying is that if XML states that all valid XML documents
> must start with the BOM, that's no more problematic than if HTML
> states that all valid HTML documents must start with <!DOCTYPE;
> this is part of the application, not part of the charset.
> 
> I'm not saying it's a good idea; I strongly suspect that it's not.
> But it does not need to have the consent of the charset registration.

What XML is currently stating is that all UTF-16 documents must start
with a BOM, and that this BOM is not part of the real XML document.

XML does not say anything about a BOM for UTF-8, but the whole text
(in particular http://www.w3.org/TR/REC-xml#charencoding) and
and the examples it gives (http://www.w3.org/TR/REC-xml#sec-guessing)
strongly suggest that such a thing was never even taken into any
kind of consideration (Makoto, please correct me if this is otherwise).

For all the other (legacy) encodings, putting in a BOM at the beginning
of the document wouldn't be impossible in theory (using "&#xFEFF;"),
but makes even less sense, and is definitely not required, nor would
it be considered correct XML.

Regards,   Martin.

--Boundary (ID uEbHHWxWEwCKT9wM3evJ5w)

Received on Wednesday, 29 July 1998 02:19:54 UTC