Re: validator.w3.org and utf-8 (fwd) from Gerald Oskoboiny on 1998-08-07 (www-validator@w3.org from August 1998)

From: Gerald Oskoboiny <gerald@w3.org>
Date: Fri, 7 Aug 1998 01:50:28 -0400 (EDT)
To: www-validator@w3.org
Message-ID: <Pine.SOL.3.96.980807014607.29377A-100000@anansi.w3.org>

Here are some further ideas on the i18n issues from Martin Dürst:

> As for I18N bugs, the problem is that the validator, by assuming the
> input is something like iso-8859-1 works on bytes instead of working
> on characters, and ignores any character encoding issues. For some
> encodings (those where bytes in the 0x00-0x7F range always represent
> ASCII characters), this works. For the other encodings, among else
> two out of three Japanese encodings and the one used in Taiwan, this
> fails, because these can contain bytes that the validator sees as
> syntactically relevant characters such as "<", although they are
> part of two-byte sequences that denote Japanese or Chinese characters.
> 
> As a consequence, some truely valid HTML does not pass the validator.
> 
> To make this work, the following steps should be taken:
> 
> 1) Detect the character encoding ("charset") in the HTTP header or
>    the META construct (see
> http://www.w3.org/TR/REC-html40/charset.html#h-5.2.2).
>    This may involve "reloading" of the document if the META is used (the
>    browsers have to do this, too).
> 
> 2) Issuing a warning and taking some safety assumption in case the
>    "charset" information is not present. This warning should point
>    to information such as the HTML 4.0 chapter above.
> 
> 3) Once the "charset" is known (or assumed), convert from that "charset"
>    to Unicode. During that conversion, issue warnings for stuff that is
>    not allowed in that "charset" (the most frequent case being CP 1252
>    declared as ISO-8859-1). It may be possible at this stage to take some
>    shortcuts (i.e. not really convert to Unicode, just move everything
>    to the 0x80-0xFF range that is not ASCII), but ideally, we wouldn't
>    do that. Conversion means that we need a conversion library. We will
>    need that also for other projects (the CSS validator should use it,
>    but probably doesn't, Amaya will need it,...). It probably makes sense
>    to work on a common W3C conversion library (if we don't find something
>    that meets our needs). I would want to contribute significantly to
>    such a library (have written one before), but would need some advice
>    in particular in the area of portability (Unix/Windows), connected
>    with dynamic linking (try to avoid that every process that needs
>    a (sometimes huge) conversion table loads it into memory separately).
>    Henrik / Daniel V., any comments?
> 
> 4) Have the validator itself run in Unicode (if we use UTF-8, that
>    may not be too difficult).
> 
> I hope this is enough information for the next few steps. To help more,
> I would need more information on how exactly the validator is working.

-- 
Gerald Oskoboiny              <gerald@w3.org>  +1 617 253 2920
System Administrator, W3C     http://www.w3.org/People/Gerald/
World Wide Web Consortium, MIT Laboratory for Computer Science
545 Technology Square,  Room NE43-353  Cambridge MA  02139 USA

Received on Friday, 7 August 1998 01:50:05 UTC