Re: Better internationalization of validator from Martin Duerst on 2001-05-22 (www-validator@w3.org from May 2001)

From: Martin Duerst <duerst@w3.org>
Date: Tue, 22 May 2001 10:27:57 +0900
To: Terje Bless <link@tss.no>
Cc: www-validator@w3.org, Gerald Oskoboiny <gerald@w3.org>
Message-Id: <4.2.0.58.J.20010522095919.03ead950@sh.w3.mag.keio.ac.jp>
Hello Terje,

Many thanks for your mail.

At 00:04 01/05/22 +0200, Terje Bless wrote:
>[ CC to Martin as I'm not sure he's on w-v. Martin? ]
>
>On 21.05.01 at 14:57, Martin Duerst <duerst@w3.org> wrote:
>
> >I have just committed a very small patch to the 'check' validating
> >script, just to change some terms ('character encoding' rather than
> >'character set'; see http://www.w3.org/MarkUp/html-spec/charset-harmful
> >for why we don't want to use the later).
>
>That Draft expired in 1995...

It's still a W3C note (actually it's the earliest one),
linked from http://www.w3.org/TR. Anyway, what it says
didn't expire, it's still valid.


> >Over the weekend, I have had an extensive look at the validating
> >script, and I have various ideas for improvement in the area of
> >internationalization that I will work on in the next few days/
> >weeks. I'm looking forward to your comments/suggestions.
>
>Uhm, actually, I'd kinda like to see /your/ comments and suggestions. :-)

I have mostly looked at issues for the validator functionality itself,
like:

- Make sure that only the legal (according to IETF registry)
   charsets get through. Probably introducing another config file,
   which contains a list and a mapping to the corresponding iconv
   parameter values (also getting rid of the 'windows-xxxx' hack).
- Make sure that only the byte sequences legal in an encoding
   are accepted. (including the top item on the todo list)
- Dealing with cases such as UTF-16,...
- <meta ... charset over multiple lines.
- Allow to overwrite the charset from the validator form
- Picking up some frequent error patterns (in particular the
   error patterns from wrong charsets) and sending more specific
   error messages.


>My thinking up to now on how to deal with these issues has been to split
>out UI elements into Templates and select a template set based on
>Accept-Language or similar means. I even have a working prototype of this
>-- that implements only en_us, but has hooks for more -- which, while badly
>out of sync with W3C code, can reasonably easily be merged back in.

I have thought about this a bit, but not too much.
That's nice because we don't have too much overlap.

But it is definitely something that would be very nice.
Is your approach to remove all actual text from the 'check' script
(and e.g. giving each message a number)?
I would prefer the following approach:

To replace things such as

print << "EOF"
   English text goes here.
EOF

with something like

print template_lookup (<< "EOF");
   English text goes here.
EOF

This has to be thought through with respect to perl syntax, variable
substitution, and so on, but it makes the actual script much more
readable. The (Accept-)Language value can be a global variable, can
be made part of the lookup if that is objectified, or can be an
additional parameter. We can make the thing into a module; actually,
it would be nice if such a module existed; if not, we should make
it available to others.

(But maybe I'm thinking too quickly. I have used a similar approach
a few years ago in an object-oriented C++-based framework called ET++,
and I have again heard it suggested independently for web-based stuff
in a recent discussion, so I'm a bit excited :-).


>Further, I'd planned to investigate switching to OpenSP over jclark SP
>because it gives message numbers in addition to just a free text error
>message.

Do you know whether OpenSP did something about the limitation of characters
to <66535 in SP?

>This enables use of a Template lookup for error reporting as well
>using the same Template mechanism.
>
>The coup dエetat would be to store these templates in native encoding,

Why not just store them as UTF-8 from the start? That would simplify things,
I think.


>convert to UTF-8 when read, and converted to Accept-Encoding preferred
>encoding on output to client.

Converting to Accept-Encoding on output is an overall issue.
I'm not sure it's needed; if necessary, we could point to a converting proxy.


Regards,    Martin.
Received on Monday, 21 May 2001 21:41:14 UTC