Re: Better internationalization of validator

On 22.05.01 at 10:27, Martin Duerst <duerst@w3.org> wrote:

>From: Martin Duerst <duerst@w3.org>
>Reply-To: duerst@w3.org

BTE: Your Reply-To is nonsensical. You may want to fix it.


>>That Draft expired in 1995...
>
>It's still a W3C note (actually it's the earliest one),
>linked from http://www.w3.org/TR. Anyway, what it says
>didn't expire, it's still valid.

But it's bad form to cite Internet-Drafts as anything but a "Work in
Progress" and it's bad form to cite an expired draft at all. :-)

I'd suggest you reissue it as a W3C Note only and remove the Expiration
date (and fix the formatting to match W3C style guide) if it's not intended
to expire.


>- Make sure that only the legal (according to IETF registry)
>   charsets get through. Probably introducing another config file,
>   which contains a list and a mapping to the corresponding iconv
>   parameter values (also getting rid of the 'windows-xxxx' hack).

This is in theory a PITA to manage, but may work fine in practice as the
number of distinct charsets is now diminishing rather then increasing. Once
validator.w3.org moves to glibc>2.2, and that config updated, it may well
be zero-maintenance in practice. I'm still a bit worried about that thoŽ!


>- Make sure that only the byte sequences legal in an encoding
>   are accepted. (including the top item on the todo list)

I've been wanting to do this but 1) I haven't found any good ways to do it
and 2) I have yet to see a good definition of "valid" and 3) I have yet to
find a general way to determine what byte sequences are valid in a given
charset outside the most basic tests for n>=0xFF. Maintaining tables for
this would be such an utter pain I wouldn't even consider it. If you can
pull this one off I'll be in awe of you for the next milennium or so. :-)


>- <meta ... charset over multiple lines.

I've been meaning to take *all* that code out back and shoot it for a while
now. It's been postponed because it's rather drastic and needs some serious
testing to avoid snafus and I'm desperately short on time ATM. The New Deal
is to use HTML::Parser for all such tasks (i.e. DOCTYPE sniffing and such).


>- Allow to overwrite the charset from the validator form

That's also on the list, mainly to enable
<URL:http://validator.w3.org:8001/fragment-upload.html> to work (I also
need a "Content-Type" override there I think).


>- Picking up some frequent error patterns (in particular the
>   error patterns from wrong charsets) and sending more specific
>   error messages.

Improving the error messages -- mostly just snarfed raw from Scott Bigham's
originals -- is a good idea in general. Oh, and a good place to start for
those who want to contribute but don't necessarily have the skills to write
code!


>Is your approach to remove all actual text from the 'check' script
>(and e.g. giving each message a number)?

A name. You can see the prototype code at
<URL:http://www.tss.no/~link/dist/val.tar.gz>. Most text is static or needs
to be looked up out of a database in any case. You then have a config file
that maps a generic name (e.g. "validation_results") to a filename on disk.
You then keep separate dirs for each language (ISO coded language names as
the directory names) and substitute on the fly.


>print template_lookup (<< "EOF");
>   English text goes here.
>EOF

Some of the point is to get rid of inline HTML because it's ugly and
unmaintainable. HTML::Template gives you loops and variable substitutions
so you just stage all your variable data (say, put all (looked up) error
messages in a list) and then run it through the template and return the
result. Your template then resides in a file on disk and looks something
like:

    <include "HTML_header.tmpl">
    Here are the results of... etc.
    <TMPL_LOOP @errors>
      Error on line $_->[0], column $_->[1], blah blah.
    </TMPL_LOOP>
    <include "HTML_footer.tmpl">

And is easy for l10n people to localize. Instead of having a complicated
system for looking up messages from a message catalog, you have l10n people
make new templates -- that can take into account cultural differences as
well if we have inappropriate symbology or something like that -- and can
even enable "Braille" or "XML" or "Foo" languages. In particular, I was
considering using this to give minimal XML output from the validator so you
could use something XML-RPC/SOAP-ish to validate stuff and show results in
a dedicated browser (Gnome frontend, or inline in a HTML editor).


>The (Accept-)Language value can be a global variable, can
>be made part of the lookup if that is objectified, or can be an
>additional parameter.

We get the Accept-Language at "compile" time (when we get called) and
select which template set (directory) to use based on it.


>We can make the thing into a module; actually,
>it would be nice if such a module existed;

It /would/ be a nice module, regardless of the solution chosen for the
validator. I've also considered factoring out generic facilities from
"check" and making it into W3C::* modules to make reuse easier. In
particular, I've considered bribing Hugo with large quantities of some
beverage or other in order to get him to turn the brunt of checklink into a
module (that "check" among others could use). :-)


>(But maybe I'm thinking too quickly. I have used a similar approach
>a few years ago in an object-oriented C++-based framework called ET++,
>and I have again heard it suggested independently for web-based stuff
>in a recent discussion, so I'm a bit excited :-).

*grin*

I think this approach is better suited for traditional applications then
the Validator. We need a template system to get rid of the inline HTML and
as it happens, the template system can also deal nicely with l10n. Adding
an ET++ module would be overkill if not outright pointless. IMO,
obviously...


>>Further, I'd planned to investigate switching to OpenSP over jclark SP
>>because it gives message numbers in addition to just a free text error
>>message.
>
>Do you know whether OpenSP did something about the limitation of
>characters to <66535 in SP?

Yeah. OpenSP fixes most of the little niggling issues with SP AFAICT. They
also support more of Annex K, have saner calling syntax, is more portable,
and -- if adicarlo ever gets around to putting my *.rpms up on SF :-) --
comes as both rpm and deb to make it easier for folks to install it
locally.

(Note to self; build rpms of the validator)


>Why not just store them as UTF-8 from the start? That would simplify
>things, I think.

For us, yes, but not for the l10n people. Good UTF-8 editors are few and
far between ATM.


>>convert to UTF-8 when read, and converted to Accept-Encoding preferred
>>encoding on output to client.
>
>Converting to Accept-Encoding on output is an overall issue. I'm not sure
>it's needed; if necessary, we could point to a converting proxy.

We have the facility; why not use it?

Received on Monday, 21 May 2001 23:03:25 UTC