W3C home > Mailing lists > Public > www-validator@w3.org > June 2001

Re: Better internationalization of validator

From: Martin Duerst <duerst@w3.org>
Date: Mon, 18 Jun 2001 10:50:54 +0900
Message-Id: <>
To: Terje Bless <link@tss.no>
Cc: Gerald Oskoboiny <gerald@w3.org>, W3C Validator <www-validator@w3.org>
At 05:30 01/06/12 +0200, Terje Bless wrote:

> >Also, we may have to do some pre-sniffing anyway in order to deal with
> >UTF-16 and EBCDIC.
>I'll give you UTF-16 (kinda!), but EBCDIC is not possible to sniff for in
>any meaningfull way AFAIK; for all practical purposes, it needs to be
>properly labelled in the Content-Type (IOW, it's "SEP"[1]).

No, not exactly. Please see
for how it can work for XML. I guess the same thing applies to
HTML. For HTML, there are more ways to start a file, but not
that many more. I know about

<HTML> (in various case variants, that is)

Anything else (except of course for <?xml for XHTML )?

>As for UTF-16, I think it's reasonable to assume that it will be properly
>labelled or contain a BOM.

Almost, but again see the XML rec.

>Checking the first 2/3 bytes for one of the
>three possible BOMs in UTF-8/UTF-16-MSB/UTF-16-LSB is a far cry from the
>current mess (that alters the DOCTYPE if it sees "<FRAME"!).

Yes indeed.

>Is UTF-16 ASCII-compatible enough that we can assume ASCII up to the XML
>Declaration ("<?xml ... ?>")?

Well, yes, except that every second byte is a null byte :-).

>I could live with a little content sniffing
>-- to decide between HTML or XML semantics, or to determine source charset
>before we convert to UTF-8 internally, etc. -- as long as it stops guessing
>at doctypes based on tags present, and uses an actual SGML parser to figure
>out the (provided) DOCTYPE instead of a quick+dirty regex. Once we're there
>we should be able to use said SGML/XML parser to extract the necessary
>charset info; using two-pass parsing if necessary.

Okay. I'll work on it, as I have time.

Regards,   Martin.
Received on Sunday, 17 June 2001 21:51:24 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 1 March 2016 14:17:30 UTC