RE: Rephrasing of Document Character Set article

Martin wrote:
> >
> >--
> >This means that XML or HTML documents are always processed as a
> sequence of
> >characters from the Unicode character set.
> >--
>
> This may not always be true. It is perfectly fine to have an
> XML parser that works in US-ASCII for US-ASCII documents, and
> so on. It may not be a good idea in terms of implementation,
> but it wouldn't be against the XML Rec.
>

(personal response)

Yes, but the effect is the same: a US-ASCII document might still contain an NCR that must be treated as a Unicode code point. It is useful to note that the paragraph directly following this sentence makes the point that the file might use any encoding, including a non-Unicode encoding.

While my suggestion might not be quite the right wording, it does, I think, convey the important point, which is that document authors may (and document processors must) treat files as if they were a sequence of Unicode code points. What encoding the processor uses internally is invisible.

Addison

Received on Tuesday, 10 June 2008 14:33:42 UTC