Re: Better internationalization of validator

From: Terje Bless (link@tss.no)
Date: Mon, Jun 11 2001

  • Next message: Martin Duerst: "Re: Charset Trouble"

    Date: Tue, 12 Jun 2001 05:30:21 +0200
    From: Terje Bless <link@tss.no>
    To: Martin Duerst <duerst@w3.org>
    cc: Gerald Oskoboiny <gerald@w3.org>, W3C Validator <www-validator@w3.org>
    Message-ID: <20010612053033-b01010701-7320a12d@192.168.1.6>
    Subject: Re: Better internationalization of validator
    
    [ I've been out of town for a couple of days. ]
    [ I'm catching up on w-v as time allows.      ]
    [ Size of backlog is currently: "Huge" :-|    ]
    
    On 11.06.01 at 16:48, Martin Duerst <duerst@w3.org> wrote:
    
    >At 05:00 01/05/22 +0200, Terje Bless wrote:
    >>[...] use HTML::Parser for [...] DOCTYPE sniffing and such[.]
    >
    >That would deal with <meta, but not with <?xml, I guess.
    
    IIRC, HTML::Parser deals with XML Processing instructions in later
    versions. To quote gaas in the POD[0]:
    
    # $p->xml_mode([$bool])
    #
    # Enabling this attribute changes the parser to allow some XML constructs
    # such as empty element tags and XML processing instructions.
    
    
    >Also, we may have to do some pre-sniffing anyway in order to deal with
    >UTF-16 and EBCDIC.
    
    I'll give you UTF-16 (kinda!), but EBCDIC is not possible to sniff for in
    any meaningfull way AFAIK; for all practical purposes, it needs to be
    properly labelled in the Content-Type (IOW, it's "SEP"[1]).
    
    As for UTF-16, I think it's reasonable to assume that it will be properly
    labelled or contain a BOM. Checking the first 2/3 bytes for one of the
    three possible BOMs in UTF-8/UTF-16-MSB/UTF-16-LSB is a far cry from the
    current mess (that alters the DOCTYPE if it sees "<FRAME"!).
    
    
    Is UTF-16 ASCII-compatible enough that we can assume ASCII up to the XML
    Declaration ("<?xml ... ?>")? I could live with a little content sniffing
    -- to decide between HTML or XML semantics, or to determine source charset
    before we convert to UTF-8 internally, etc. -- as long as it stops guessing
    at doctypes based on tags present, and uses an actual SGML parser to figure
    out the (provided) DOCTYPE instead of a quick+dirty regex. Once we're there
    we should be able to use said SGML/XML parser to extract the necessary
    charset info; using two-pass parsing if necessary.
    
    
    
    
    [0] - BTW, a usefull little Bookmarklet for looking up a module on CPAN:
    <URL:javascript:void(Qr=prompt('Module...',''));if(Qr)void(location.href='http://search.cpan.org/search?mode=module&query='+escape(Qr))>
    
    
    [1] - "In Other Words" and "Somebody Else's Problem", respectively. :-)