RE: Rephrasing of Document Character Set article from Phillips, Addison on 2008-06-10 (public-i18n-core@w3.org from April to June 2008)

From: Phillips, Addison <addison@amazon.com>
Date: Tue, 10 Jun 2008 07:33:04 -0700
To: Martin Duerst <duerst@it.aoyama.ac.jp>, Richard Ishida <ishida@w3.org>
CC: "public-i18n-core@w3.org" <public-i18n-core@w3.org>
Message-ID: <4D25F22093241741BC1D0EEBC2DBB1DA013ADE293A@EX-SEA5-D.ant.amazon.com>

Martin wrote:
> >
> >--
> >This means that XML or HTML documents are always processed as a
> sequence of
> >characters from the Unicode character set.
> >--
>
> This may not always be true. It is perfectly fine to have an
> XML parser that works in US-ASCII for US-ASCII documents, and
> so on. It may not be a good idea in terms of implementation,
> but it wouldn't be against the XML Rec.
>

(personal response)

Yes, but the effect is the same: a US-ASCII document might still contain an NCR that must be treated as a Unicode code point. It is useful to note that the paragraph directly following this sentence makes the point that the file might use any encoding, including a non-Unicode encoding.

While my suggestion might not be quite the right wording, it does, I think, convey the important point, which is that document authors may (and document processors must) treat files as if they were a sequence of Unicode code points. What encoding the processor uses internally is invisible.

Addison

Received on Tuesday, 10 June 2008 14:33:42 UTC