Re: Fallback to UTF-8 from Frank Ellermann on 2008-04-28 (www-validator@w3.org from April 2008)

From: Frank Ellermann <nobody@xyzzy.claranet.de>
Date: Mon, 28 Apr 2008 11:13:12 +0200
To: www-validator@w3.org
Message-ID: <fv44a4$7mk$1@ger.gmane.org>
olivier Thereaux wrote:

> the feedback requested was mostly on documents without
> declared character encoding

I can't help you there, I have no Latin-1 files, Andreas
posted a test URL.

> error is fixed. 5279 and 5280 are untouched.

FWIW I still think that these are bugs or oddities in the
XML 1.0 3rd and 4th editions, and not in the validator.

>> [warning] Unable to Determine Parse Mode!
>> [...]
>> | Type (-//IETF//DTD HTML i18n//EN) is not in the validator's catalog
>> <http://freenet-homepage.de/Xyzzy/home/test/res.html>
>> (SGML is correct - RFC 2070 DTD republished by IANA)

> It does validate right now. I guess you are pointing out that
> “ Unable to Determine Parse Mode” could use a better wording
> in this case?

No, the wording is fine, but what happens is suboptimal:
* The DTD should be added to the catalog, then SGML is clear.
* At the moment the default parse mode is SGML, that happens
  to be okay for this document.  But when you say that UTF-8
  is the future you can as well say that SGML is the past:
* HTML5 will do its own thing, the rest of the world uses XML,
  and SGML is doomed, ignoring billions of HTML < 5 documents.

>> [warning] Missing "charset" attribute for "text/xml" document.
>> <http://freenet-homepage.de/Xyzzy/home/test/utf-4.xml>
>> (this text/xml document really uses encoding US-ASCII)

>| HEAD http://freenet-homepage.de/Xyzzy/home/test/utf-4.xml
>| Content-Type: text/xml
> So I guess the warning by the validator, that the spec
> specifies a strong default of "us-ascii" is OK here?

IMO it is odd, it warns about using a "strong default" ASCII,
the document in fact starts with <?xml encoding="us-ascii" ?>,
it turns out to be ASCII, and if it would use a single octet
above 0x7F it would cause a real error message.  The warning
is apparently pointless, can the validator output an "info" ?

As "info: assuming ASCII", or "info: using SGML" (see above),
the case could be clearer.  I'm used to "a warning is always
bad news", using compiler option "pedantic".  About a decade
ago the *NIX style was "no news is good news". 

>> [warning] Mismatch between Public and System identifiers
[...]
>> (the released validator has no problem with using System
>>  identifiers pointing to its own catatlog, maybe it's an
>>  artefact of the qa-dev.w3.org != validator.w3.org setup)

> That's a new feature. Some recent feedback prompted the
> addition of a check for consistency between FPI and SI.
> It's a warning, so as usual, it can be ignored if you are
> sure of your doctype.

I'll never ever ignore warnings.  They can cost months when
porting code from compiler A on platform B to compiler C on
platform D.  A warning is a serious thing like a SHOULD in
an RFC (and an error is like a MUST meaning "you are dead").

When the WDG validator says "warning" it means "this will
break each and every browser, even if it could be in a very
formal and theoretical sense 'valid' (for Amaya, not ITW)".

>> [warning] Character Encoding mismatch!
>> | The character encoding specified in the HTTP header
>> | (iso-8859-1) is different from the value in the <meta>
>> | element (windows-1252).
[...]
>> (Nikita consistently hates u+0080 based on an iso-8859-1
>> assumption, and the document uses a windows-1252 0x80 €)

> Mm, sorry, not sure if you are reporting an issue or a 
> "work as it should, here". Can you give more details?

If the validator really assumes iso-8859-1, then 0x80 is
u+0080, not a valid SGML character, as reported by Nikita.

I miss the error message.  As soon as this error message
shows up I could again say that the assumption is already
wrong, of course a windows-1252 0x80 is okay... 

 Frank
Received on Monday, 28 April 2008 09:11:15 UTC