Re: XML file upload issues for encoding="UTF-8" from Frank Ellermann on 2007-09-16 (www-validator@w3.org from September 2007)

From: Frank Ellermann <nobody@xyzzy.claranet.de>
Date: Sun, 16 Sep 2007 21:58:09 +0200
To: www-validator@w3.org
Message-ID: <fck1so$8tv$1@sea.gmane.org>

Olivier Thereaux wrote:

>> Apparently (= reported by the validator) my browser claims to send
>> Content-Type: text/xml without charset.  Therefore the validator
>> expects US-ASCII ignoring the first input line:

> It doesn't just expect us-ascii. It *has* to process as us-ascii,
> per http://tools.ietf.org/html/rfc3023#section-8.5

Yes, I'm aware of RFC 3023, and instead of "expects" I should have
written "MUST assume" US ASCII.  But after that step we get to the
business at hand, I'm the author of a tool creating an XML document
and wish to validate it.  I'm not the author of the OS and the
browser used to upload this document to the validator (for the
upload interface), and I'm unfortunately not the admin of the Web
server where the validator finds this document (in the case of the
URL interface).

For obvious reasons browsers and Web servers of 3rd parties might
be sloppy and assume that anyfile.xml is text/xml without bothering
to figure out the correct charset.  That's sad or something, but
it's not what I'm really interested in, I want to see what's wrong
(if anything) _within_ my document.

So if the validator would tell me "BTW, your upload tool failed to
announce the correct charset" it would be okay.  But what it really
does is to refuse to start to work at all, it even doesn't show me
the source with the offending octet when I explicitly want this :-(

> see also: http://annevankesteren.nl/2005/03/text-xml

Nice... :-)  Wrt the validator you've two kinds of users, those
who implement tools like Firefox or adminster Web servers, and
another group writing documents or implementing tools to create
documents.  The second group should be much larger, and IMO the
validator should help them as good as possible without giving up
on being strict.

The validator reported the _last_ offending octet in line 35591,
obviously it didn't run into serious processing problems in this
case.  It could finish its processing in an orderly manner, e.g.
show the source when I want this, and any errors it finds, instead
of throwing a fatal error and giving up.  Just "giving up" is an
option when there are too many errors like say somebody uploading
a binary, or tons of NULs in UTF-16 interpreted as UTF-8 or ASCII.

But for the common cases "ASCII turns out to be UTF-8", "Latin-1
turns out to be windows-1252", or similar, it shouldn't take the
fast "fatal error" exit.

Frank

Received on Sunday, 16 September 2007 19:59:44 UTC