Re: Better internationalization of validator from Martin Duerst on 2001-05-24 (www-validator@w3.org from May 2001)

From: Martin Duerst <duerst@w3.org>
Date: Thu, 24 May 2001 09:35:27 +0900
To: Terje Bless <link@tss.no>
Cc: Gerald Oskoboiny <gerald@w3.org>, W3C Validator <www-validator@w3.org>
Message-Id: <4.2.0.58.J.20010524085829.03d58d10@sh.w3.mag.keio.ac.jp>
At 05:00 01/05/22 +0200, Terje Bless wrote:
>On 22.05.01 at 10:27, Martin Duerst <duerst@w3.org> wrote:

> >- Make sure that only the legal (according to IETF registry)
> >   charsets get through. Probably introducing another config file,
> >   which contains a list and a mapping to the corresponding iconv
> >   parameter values (also getting rid of the 'windows-xxxx' hack).
>
>This is in theory a PITA to manage, but may work fine in practice as the
>number of distinct charsets is now diminishing rather then increasing. Once
>validator.w3.org moves to glibc>2.2, and that config updated, it may well
>be zero-maintenance in practice. I'm still a bit worried about that thoエ!

Well, I think it is important to distinguish between encodings and
labelings. As far as I understand, CP1252 in iconv is the same
as windows-1252 in the iana charset registry. So I was a bit
surprised about the comment that iconv doesn't support windows-1252.

And about the maintainance overhead, I'm not worried, and I'm
ready to do it :-).


> >- Make sure that only the byte sequences legal in an encoding
> >   are accepted. (including the top item on the todo list)
>
>I've been wanting to do this but 1) I haven't found any good ways to do it

The conversion to UTF-8 should give you that, if you catch the right
errors (and use a converter that tells you, but Text::Iconv should
be able to do that (http://www.perldoc.com/cpan/Text/Iconv.html#ERRORS),
though I haven't tested it yet).



>and 2) I have yet to see a good definition of "valid"

Good point.


> >- <meta ... charset over multiple lines.
>
>I've been meaning to take *all* that code out back and shoot it for a while
>now. It's been postponed because it's rather drastic and needs some serious
>testing to avoid snafus and I'm desperately short on time ATM. The New Deal
>is to use HTML::Parser for all such tasks (i.e. DOCTYPE sniffing and such).

Any good docu available on html::parser? If it does a similar job to
what the validator currently does, it may be okay. But does it allow
to add new doctypes,...? At W3C, we pretty much need that :-).



> >Is your approach to remove all actual text from the 'check' script
> >(and e.g. giving each message a number)?
>
>A name. You can see the prototype code at
><URL:http://www.tss.no/~link/dist/val.tar.gz>. Most text is static or needs
>to be looked up out of a database in any case. You then have a config file
>that maps a generic name (e.g. "validation_results") to a filename on disk.
>You then keep separate dirs for each language (ISO coded language names as
>the directory names) and substitute on the fly.

Having a separate file for each message would make things quite tedious,
I guess. Having a name for the message sounds better than just a number,
but I think it's still quite tough for somebody to read the script
just with this. From my experience in the last two weeks, I can clearly
say that it was very helpful to have the texts inline. For somebody
who is very used to the software, that may not be that much of an
issue, but software should be written so that it can be read easily.


> >print template_lookup (<< "EOF");
> >   English text goes here.
> >EOF
>
>Some of the point is to get rid of inline HTML because it's ugly and
>unmaintainable.

A bit ugly maybe, yes. But having the text together with the code
also makes some maintenance tasks easier.


>HTML::Template gives you loops and variable substitutions
>so you just stage all your variable data (say, put all (looked up) error
>messages in a list) and then run it through the template and return the
>result. Your template then resides in a file on disk and looks something
>like:
>
>     <include "HTML_header.tmpl">
>     Here are the results of... etc.
>     <TMPL_LOOP @errors>
>       Error on line $_->[0], column $_->[1], blah blah.
>     </TMPL_LOOP>
>     <include "HTML_footer.tmpl">

Having had a look at e.g. http://himi.org/kbb/Template.html,
it looks like it's worth a try.



>And is easy for l10n people to localize. Instead of having a complicated
>system for looking up messages from a message catalog, you have l10n people
>make new templates -- that can take into account cultural differences as
>well if we have inappropriate symbology or something like that -- and can
>even enable "Braille" or "XML" or "Foo" languages. In particular, I was
>considering using this to give minimal XML output from the validator so you
>could use something XML-RPC/SOAP-ish to validate stuff and show results in
>a dedicated browser (Gnome frontend, or inline in a HTML editor).

Interesting. It will work as long as the error messages themselves
are independent of the page language, which they might not always be.


> >>Further, I'd planned to investigate switching to OpenSP over jclark SP
> >>because it gives message numbers in addition to just a free text error
> >>message.
> >
> >Do you know whether OpenSP did something about the limitation of
> >characters to <66535 in SP?
>
>Yeah. OpenSP fixes most of the little niggling issues with SP AFAICT. They
>also support more of Annex K, have saner calling syntax, is more portable,
>and -- if adicarlo ever gets around to putting my *.rpms up on SF :-) --
>comes as both rpm and deb to make it easier for folks to install it
>locally.

Annex K? Neither XML nor HTML have such an annex.


Regards,  Martin.
Received on Wednesday, 23 May 2001 21:15:04 UTC