Re: What makes illegal characters non-conformant from Bjoern Hoehrmann on 2009-09-24 (public-html-comments@w3.org from September 2009)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Thu, 24 Sep 2009 04:34:42 +0200
To: ht@inf.ed.ac.uk (Henry S. Thompson)
Cc: public-html-comments@w3.org
Message-ID: <sqjlb59gg4pch3n3jk2fom6h9vgfkk2tr9@hive.bjoern.hoehrmann.de>

* Henry S. Thompson wrote:
>I don't think I have a problem with that, I can imagine an argument
>that it's broken (although http://www.ltg.ed.ac.uk/~ht/char_alias.xml
>is _not_ broken per the XML specification. . .), but I can't find
>anywhere in the HTML5 spec. which says so.  Does it/should it?

It is not broken per the XML specification by the same reasoning that a
PNG image is not broken per the XML specification. Procedurally for both
cases the XML processor determines some character encoding and attempts
to decode the document, and then encounters byte sequences that do not
have a well-defined meaning according to the encoding's specification.
It is therefore not possible to restore the textual data the binary data
represents, and the XML specification only defines conformance for pro-
cessors and textual data objects.

Consider that the XML specification does not normatively define exactly
how to determine the character encoding (and I am ignoring that you've
used text/xml as media type for the document which has other theoretical
considerations rarely met in practise), so you can easily define a new
character encoding very-bogus-encoding as "Any sequence of bytes stands
for the text <?xml version='1.0' encoding='very-bogus-encoding'?><x/>"
and your document would be perfectly conforming if the processor does
indeed support that encoding.

Cases like this do in fact exist in the real world, for example, with
UTF-32 encoded documents the processor may not support UTF-32 and may
instead detect UTF-16 or UTF-8 and encounter illegal byte sequences or
disallowed characters. The only difference is in perception as UTF-32
is widely recognized while very-bogus-encoding is not.

It is ultimately entirely irrelevant whether your document is broken
per the XML specification as it is as far as common sense goes broken
per the US-ASCII specification. You might just as well have your web
server send out malformed TCP datagrams or a malformed HTTP response
and muse how that is or is not broken per unrelated specifications.
Similarily is very-bogus-encoding irrelevant because it violates what
is considered common sense. http://xkcd.com/468/ comes to mind.

(The XML specification actually considers your case a fatal error and
those are errors which in turn are violations of the constraints of the
specification, I've argued unsuccessfully against that in the past as
having specification violations dependant on processor capabilities is
a violation of common sense.)
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Received on Thursday, 24 September 2009 02:35:23 UTC