Re: Better internationalization of validator

From: Terje Bless (link@tss.no)
Date: Mon, May 21 2001

  • Next message: Peter K. Sheerin: "Re: Better internationalization of validator"

    Date: Tue, 22 May 2001 05:00:18 +0200
    From: Terje Bless <link@tss.no>
    To: Martin Duerst <duerst@w3.org>
    Cc: Gerald Oskoboiny <gerald@w3.org>, W3C Validator <www-validator@w3.org>
    Message-ID: <20010522050317-b01010701-0ce40794@192.168.1.6>
    Subject: Re: Better internationalization of validator
    
    On 22.05.01 at 10:27, Martin Duerst <duerst@w3.org> wrote:
    
    >From: Martin Duerst <duerst@w3.org>
    >Reply-To: duerst@w3.org
    
    BTE: Your Reply-To is nonsensical. You may want to fix it.
    
    
    >>That Draft expired in 1995...
    >
    >It's still a W3C note (actually it's the earliest one),
    >linked from http://www.w3.org/TR. Anyway, what it says
    >didn't expire, it's still valid.
    
    But it's bad form to cite Internet-Drafts as anything but a "Work in
    Progress" and it's bad form to cite an expired draft at all. :-)
    
    I'd suggest you reissue it as a W3C Note only and remove the Expiration
    date (and fix the formatting to match W3C style guide) if it's not intended
    to expire.
    
    
    >- Make sure that only the legal (according to IETF registry)
    >   charsets get through. Probably introducing another config file,
    >   which contains a list and a mapping to the corresponding iconv
    >   parameter values (also getting rid of the 'windows-xxxx' hack).
    
    This is in theory a PITA to manage, but may work fine in practice as the
    number of distinct charsets is now diminishing rather then increasing. Once
    validator.w3.org moves to glibc>2.2, and that config updated, it may well
    be zero-maintenance in practice. I'm still a bit worried about that thoŽ!
    
    
    >- Make sure that only the byte sequences legal in an encoding
    >   are accepted. (including the top item on the todo list)
    
    I've been wanting to do this but 1) I haven't found any good ways to do it
    and 2) I have yet to see a good definition of "valid" and 3) I have yet to
    find a general way to determine what byte sequences are valid in a given
    charset outside the most basic tests for n>=0xFF. Maintaining tables for
    this would be such an utter pain I wouldn't even consider it. If you can
    pull this one off I'll be in awe of you for the next milennium or so. :-)
    
    
    >- <meta ... charset over multiple lines.
    
    I've been meaning to take *all* that code out back and shoot it for a while
    now. It's been postponed because it's rather drastic and needs some serious
    testing to avoid snafus and I'm desperately short on time ATM. The New Deal
    is to use HTML::Parser for all such tasks (i.e. DOCTYPE sniffing and such).
    
    
    >- Allow to overwrite the charset from the validator form
    
    That's also on the list, mainly to enable
    <URL:http://validator.w3.org:8001/fragment-upload.html> to work (I also
    need a "Content-Type" override there I think).
    
    
    >- Picking up some frequent error patterns (in particular the
    >   error patterns from wrong charsets) and sending more specific
    >   error messages.
    
    Improving the error messages -- mostly just snarfed raw from Scott Bigham's
    originals -- is a good idea in general. Oh, and a good place to start for
    those who want to contribute but don't necessarily have the skills to write
    code!
    
    
    >Is your approach to remove all actual text from the 'check' script
    >(and e.g. giving each message a number)?
    
    A name. You can see the prototype code at
    <URL:http://www.tss.no/~link/dist/val.tar.gz>. Most text is static or needs
    to be looked up out of a database in any case. You then have a config file
    that maps a generic name (e.g. "validation_results") to a filename on disk.
    You then keep separate dirs for each language (ISO coded language names as
    the directory names) and substitute on the fly.
    
    
    >print template_lookup (<< "EOF");
    >   English text goes here.
    >EOF
    
    Some of the point is to get rid of inline HTML because it's ugly and
    unmaintainable. HTML::Template gives you loops and variable substitutions
    so you just stage all your variable data (say, put all (looked up) error
    messages in a list) and then run it through the template and return the
    result. Your template then resides in a file on disk and looks something
    like:
    
        <include "HTML_header.tmpl">
        Here are the results of... etc.
        <TMPL_LOOP @errors>
          Error on line $_->[0], column $_->[1], blah blah.
        </TMPL_LOOP>
        <include "HTML_footer.tmpl">
    
    And is easy for l10n people to localize. Instead of having a complicated
    system for looking up messages from a message catalog, you have l10n people
    make new templates -- that can take into account cultural differences as
    well if we have inappropriate symbology or something like that -- and can
    even enable "Braille" or "XML" or "Foo" languages. In particular, I was
    considering using this to give minimal XML output from the validator so you
    could use something XML-RPC/SOAP-ish to validate stuff and show results in
    a dedicated browser (Gnome frontend, or inline in a HTML editor).
    
    
    >The (Accept-)Language value can be a global variable, can
    >be made part of the lookup if that is objectified, or can be an
    >additional parameter.
    
    We get the Accept-Language at "compile" time (when we get called) and
    select which template set (directory) to use based on it.
    
    
    >We can make the thing into a module; actually,
    >it would be nice if such a module existed;
    
    It /would/ be a nice module, regardless of the solution chosen for the
    validator. I've also considered factoring out generic facilities from
    "check" and making it into W3C::* modules to make reuse easier. In
    particular, I've considered bribing Hugo with large quantities of some
    beverage or other in order to get him to turn the brunt of checklink into a
    module (that "check" among others could use). :-)
    
    
    >(But maybe I'm thinking too quickly. I have used a similar approach
    >a few years ago in an object-oriented C++-based framework called ET++,
    >and I have again heard it suggested independently for web-based stuff
    >in a recent discussion, so I'm a bit excited :-).
    
    *grin*
    
    I think this approach is better suited for traditional applications then
    the Validator. We need a template system to get rid of the inline HTML and
    as it happens, the template system can also deal nicely with l10n. Adding
    an ET++ module would be overkill if not outright pointless. IMO,
    obviously...
    
    
    >>Further, I'd planned to investigate switching to OpenSP over jclark SP
    >>because it gives message numbers in addition to just a free text error
    >>message.
    >
    >Do you know whether OpenSP did something about the limitation of
    >characters to <66535 in SP?
    
    Yeah. OpenSP fixes most of the little niggling issues with SP AFAICT. They
    also support more of Annex K, have saner calling syntax, is more portable,
    and -- if adicarlo ever gets around to putting my *.rpms up on SF :-) --
    comes as both rpm and deb to make it easier for folks to install it
    locally.
    
    (Note to self; build rpms of the validator)
    
    
    >Why not just store them as UTF-8 from the start? That would simplify
    >things, I think.
    
    For us, yes, but not for the l10n people. Good UTF-8 editors are few and
    far between ATM.
    
    
    >>convert to UTF-8 when read, and converted to Accept-Encoding preferred
    >>encoding on output to client.
    >
    >Converting to Accept-Encoding on output is an overall issue. I'm not sure
    >it's needed; if necessary, we could point to a converting proxy.
    
    We have the facility; why not use it?