Re: Better internationalization of validator

From: Martin Duerst (duerst@w3.org)
Date: Sun, Jun 17 2001

  • Next message: Martin Duerst: "Added charset -> iconv conf file; checking UTF-8"

    Message-Id: <4.2.0.58.J.20010616212347.02ca2100@sh.w3.mag.keio.ac.jp>
    Date: Mon, 18 Jun 2001 10:50:54 +0900
    To: Terje Bless <link@tss.no>
    From: Martin Duerst <duerst@w3.org>
    Cc: Gerald Oskoboiny <gerald@w3.org>, W3C Validator <www-validator@w3.org>
    Subject: Re: Better internationalization of validator
    
    At 05:30 01/06/12 +0200, Terje Bless wrote:
    
    > >Also, we may have to do some pre-sniffing anyway in order to deal with
    > >UTF-16 and EBCDIC.
    >
    >I'll give you UTF-16 (kinda!), but EBCDIC is not possible to sniff for in
    >any meaningfull way AFAIK; for all practical purposes, it needs to be
    >properly labelled in the Content-Type (IOW, it's "SEP"[1]).
    
    No, not exactly. Please see
    http://www.w3.org/TR/REC-xml#sec-guessing-no-ext-info
    for how it can work for XML. I guess the same thing applies to
    HTML. For HTML, there are more ways to start a file, but not
    that many more. I know about
    
    <HTML> (in various case variants, that is)
    <!DOCTYPE ...
    
    Anything else (except of course for <?xml for XHTML )?
    
    
    
    
    >As for UTF-16, I think it's reasonable to assume that it will be properly
    >labelled or contain a BOM.
    
    Almost, but again see the XML rec.
    
    >Checking the first 2/3 bytes for one of the
    >three possible BOMs in UTF-8/UTF-16-MSB/UTF-16-LSB is a far cry from the
    >current mess (that alters the DOCTYPE if it sees "<FRAME"!).
    
    Yes indeed.
    
    
    >Is UTF-16 ASCII-compatible enough that we can assume ASCII up to the XML
    >Declaration ("<?xml ... ?>")?
    
    Well, yes, except that every second byte is a null byte :-).
    
    
    >I could live with a little content sniffing
    >-- to decide between HTML or XML semantics, or to determine source charset
    >before we convert to UTF-8 internally, etc. -- as long as it stops guessing
    >at doctypes based on tags present, and uses an actual SGML parser to figure
    >out the (provided) DOCTYPE instead of a quick+dirty regex. Once we're there
    >we should be able to use said SGML/XML parser to extract the necessary
    >charset info; using two-pass parsing if necessary.
    
    Okay. I'll work on it, as I have time.
    
    
    Regards,   Martin.