Re: charset=us-ascii mandatory? from Frank Ellermann on 2007-05-13 (www-validator@w3.org from May 2007)

From: Frank Ellermann <nobody@xyzzy.claranet.de>
Date: Sun, 13 May 2007 12:45:50 +0200
To: www-validator@w3.org
Message-ID: <4646EC5E.3503@xyzzy.claranet.de>
Jukka K. Korpela wrote:

> Reporting &#128; is a different issue, since its meaning does not
> depend on the document's character encoding.

Yes, sometimes when they move RFCs like RFC 2070 to "historic" it
can be still interesting to read what they said, back in 1997. <g>

The then (9-11 plus two days) new validator surprised me while I
was working on a document about another charset.  I've now finally
removed the last trace of this event, my "XHTML 1.0 invalid" icon.

> My modified proposal is:

> When a document is submitted to validation so that its character
> encoding cannot be deduced in any of the ways defined (message
> headers, meta tags, defined defaults), then

> 1) a data error message is issued, preferably before any other
>    message, explaining that validation cannot be carried out due
>    to lack of character encoding (charset information), with a
>    link to a document explaining this in detail

Okay.  Maybe different links depending on XML vs. (X)HTML, and for
the latter depending on the version, the details are different for
say HTML 2.x and XHTML 1.1.  Probably your proposal is for (X)HTML,
not XML.  And ignoring text/sgml, I can't tell what the validator
does with it, if anything.

That leaves us with various "unknown charset for (X)HTML" scenarios,
all starting with an error (1).  What about Latin-1 and HTML 2.x or
HTML 3.2 ?

> 2) that message is followed by a note explaining that validation
>    process is performed under some assumptions (to be listed next
>    or in a linked document)

> 3) a further explanation is given that emphasizes that character
>    data in the document cannot be tested and that the document
>    should be submitted to validation after selecting and specifying
>    an encoding

> 4) validation is then started with an assumed character encoding
>    where
>     a) octets 0 to 7F are interpreted as ASCII
>     b) octets 80 to FF are not interpreted at all but assumed to
>        constitute non-ASCII character data

That needs an explanation, my first thought was "of course %x80-FF
are not ASCII".  Finding examples where 4a) could fail miserably
is simple (UTF-1, UTF-7, SCSU, etc.), maybe swap your assumptions:

|     a) octets 80 to FF are assumed to constitute any non-ASCII
|        character data
|     b) octets 0 to 7F are interpreted as US-ASCII, and this could
|        result in misleading errors for some rarely used charsets.

In fact I've no clue if the validator supports UTF-7 or SCSU.  (For
UTF-1 it can IMHO give up when it's correctly declared, but that's
unrelated to your proposal).

>     (This is different from UKNOWN-8BIT, which is completely
>     agnostic about the interpretation.)

Unfortunately, I checked the definition in RFC 1428.  It's in the
context of SMTP + mail, IMO we could claim that it's precisely what
you want.  But it's probably cleaner if we register a new name like
UNKNOWN-ASCII as discussed on the charset list some months ago.

Besides the W3C validator doesn't need a registered name for your
proposal.

> Item 4 could be omitted. After all, the user _should_ do as
> explained in item 3.

The effect of omitting item 4 is similar to "assume utf-8", minus
tons of misleading error messages.

Frank
Received on Sunday, 13 May 2007 10:51:25 UTC