Re: Who's on top -- charset param or BOM or encoding declaration

Henry S. Thompson scripsit:

>  G) It is a fatal error if an XML entity is determined (via default,
>      encoding declaration, or higher-level protocol) to be in a
>      certain encoding but contains byte sequences that are not legal
>      in that encoding
> 
>  H) [I]t is a fatal error if an entity encoded in UTF-8 contains any
>     ill-formed code unit sequences

Since the code units of UTF-8 entities are bytes, H is just a particular
case of G.  This was not always true, because some now-obsolete
definitions of UTF-8 allowed ill-formed byte sequences to be present
in existing documents, though they were not to be created in new ones.
Now, however, they are errors plain and simple.

>  X) [I]t is a *fatal error* for an entity including an encoding
>     declaration to be presented to the XML processor in an encoding
>     other than that named in the declaration

Since this only applies in the absence of EI, it seems to me to be
a nullity.  Under what circumstances could this error be detected?

>  W') [E]ntities without an encoding declaration which are delivered in
>      an encoding other than UTF-8 or UTF-16 *must* provide a charset
>      parameter
> 
> This is a sensible constraint on transcoders, including non-XML-aware
> transcoders -- if you transcode out of UTF-..., you *must* tell me to
> what.

+1

>  X') [I]t is a *fatal error* for an entity including an encoding
>      declaration to be presented to the XML processor with a charset
>      parameter other than that named in the declaration
> 
> The existing 3023bis draft includes something like this.  I don't
> think it can be retained, because it conflicts with W' for
> non-XML-aware transcoders: doing the responsible thing would produce
> documents which conflict with X'.  

In the alternative, we could just say that XML-unaware transcoding proxies
are no longer state of the art; it's the responsibility of something which
undertakes transcoding to make the necessary adjustments to the content.
I suspect that few transcoding proxies actually exist in the wild anyway.

>  XX') It is a *fatal error* for an XML entity beginning with a BOM to
>       declare an encoding other than that implied by the BOM.
> 
> That is, the BOM is authoritative regardless of whether EI is present
> or not.

If it's authoritative, then why worry about what the EI says?  I see
no reason to bother making this an error.

>  Y') [leads to a generalization of W'] Non-UTF-8-encoded entities
>      without an encoding declaration *must* be delivered with a
>      charset parameter and/or (in the case of UTF-16) a BOM

Isn't this equivalent to W'?  If so, then I think W' is clearer and
should be used.

> So, my tentative conclusion is that I will add something like Y' to
> 3023bis, and also replace the existing language in 3023bis which
> amounts to X' with something that notes that it's not an error per the
> XML spec. if there's a conflict between the charset param and the
> encoding decl, and that the charset param takes precedence in such a

If we must, we must.

-- 
John Cowan <cowan@ccil.org>             http://www.ccil.org/~cowan
"Make a case, man; you're full of naked assertions, just like Nietzsche."
"Oh, i suffer from that, too.  But you know, naked assertions or GTFO."
                        --heard on #scheme, sorta

Received on Sunday, 10 November 2013 07:02:12 UTC